performance - Speeding up count of pairwise observations in R -
i have dataset subset of measurements each entry randomly missing:
dat <- matrix(runif(100), nrow=10) rownames(dat) <- letters[1:10] colnames(dat) <- paste("time", 1:10) dat[sample(100, 25)] <- na i interested in calculating correlations between each row in dataset (i.e., a-a, a-b, a-c, a-d, ...). however, exclude correlations there fewer 5 pairwise non-na observations setting value na in resulting correlation matrix.
currently doing follows:
cor <- cor(t(dat), use = 'pairwise.complete.obs') names <- rownames(dat) filter <- sapply(names, function(x1) sapply(names, function(x2) sum(!is.na(dat[x1,]) & !is.na(dat[x2,])) < 5)) cor[filter] <- na however, operation slow actual dataset contains >1,000 entries.
is there way filter cells based on number of non-na pairwise observations in vectorized manner, instead of within nested loops?
you can count number of non-na pairwise observations using matrix approach.
let's use data generation code. made data larger , added more nas.
nr = 1000; nc = 900; dat = matrix(runif(nr*nc), nrow=nr) rownames(dat) = paste(1:nr) colnames(dat) = paste("time", 1:nc) dat[sample(nr*nc, nr*nc*0.9)] = na then filter code taking 85 seconds
tic = proc.time() names = rownames(dat) filter = sapply(names, function(x1) sapply(names, function(x2) sum(!is.na(dat[x1,]) & !is.na(dat[x2,])) < 5)); toc = proc.time(); show(toc-tic); # 85.50 seconds my version creates matrix values 1 non-nas in original data. using matrix multiplication calculate number of pairwise non-nas. ran in fraction of second.
tic = proc.time() namat = matrix(0, nrow = nr, ncol = nc) namat[ !is.na(dat) ] = 1; filter2 = (tcrossprod(namat) < 5) toc = proc.time(); show(toc-tic); # 0.09 seconds simple check shows results same:
all(filter == filter2) # true
Comments
Post a Comment