R parLapply not parallel -


i'm developing r package using parallel computing solve tasks, through means of "parallel" package.

i'm getting awkward behavior when utilizing clusters defined inside functions of package, parlapply function assigns job worker , waits finish assign job next worker. or @ least appears happening, through observation of log file "cluster.log" , list of running processes in unix shell.

below mockup version of original function declared inside package:

.parsolver <- function( varmatrix, var1 ) {      no_cores <- detectcores()      #rows in varmatrix     rows <- 1:nrow(varmatrix[,])      # split rows in n parts     n <- no_cores     parts <- split(rows, cut(rows, n))      # initiate cluster     cl <- makepsockcluster(no_cores, methods = false, outfile = "/home/cluster.log")     clusterevalq(cl, library(raster))     clusterexport(cl, "varmatrix", envir=environment())     clusterexport(cl, "var1", envir=environment())       rparts <- parlapply(cl = cl, x = 1:n, fun = function(x){         part <- rasterize(varmatrix[parts[[x]],], raster(var1), .....)         print(x)         return(part)         })      do.call(merge, rparts) } 

notes:

  • i'm using makepsockcluster because want code run on windows , unix systems alike although particular problem manifesting in unix system.
  • functions rasterize , raster defined in library(raster), exported cluster.

the weird part me if execute exact same code of function parsolver in global environment every thing works smoothly, workers take 1 job @ same time , task completes in no time. if like:

library(mypackage)  varmatrix <- (...) var1 <- (...) result <- parsolver(varmatrix, var1) 

the described problem appears.

it appears load balancing problem not explain why works ok in 1 situation , not in other.

am missing here? in advance.

i don't think parlapply running sequentially. more likely, it's running inefficiently, making appear run sequentially.

i have few suggestions improve it:

  • don't define worker function inside parsolver
  • don't export of varmatrix each worker
  • create cluster outside of parsolver

the first point important, because example stands, of variables defined in parsolver serialized along anonymous worker function , sent workers parlapply. defining worker function outside of function, serialization won't capture unwanted variables.

the second point avoids unnecessary socket i/o , uses less memory, making code more scalable.

here's fake, self-contained example similar yours demonstrates suggestions:

# define worker function outside of function avoid # serialization problems (such unexpected variable capture) workerfn <- function(mat, var1) {     library(raster)     mat * var1 }  parsolver <- function(cl, varmatrix, var1) {     parts <- splitindices(nrow(varmatrix), length(cl))     varmatrixparts <- lapply(parts, function(i) varmatrix[i,,drop=false])     rparts <- clusterapply(cl, varmatrixparts, workerfn, var1)     do.call(rbind, rparts) }  library(parallel) cl <- makepsockcluster(3) r <- parsolver(cl, matrix(1:20, 10, 2), 2) print(r) 

note takes advantage of clusterapply function iterate on list of row-chunks of varmatrix entire matrix doesn't need sent everyone. avoids calls clusterevalq , clusterexport, simplifying code, making bit more efficient.


Comments

Popular posts from this blog

Command prompt result in label. Python 2.7 -

javascript - How do I use URL parameters to change link href on page? -

amazon web services - AWS Route53 Trying To Get Site To Resolve To www -