R parLapply not parallel -

September 15, 2010

i'm developing r package using parallel computing solve tasks, through means of "parallel" package.

i'm getting awkward behavior when utilizing clusters defined inside functions of package, parlapply function assigns job worker , waits finish assign job next worker. or @ least appears happening, through observation of log file "cluster.log" , list of running processes in unix shell.

below mockup version of original function declared inside package:

.parsolver <- function( varmatrix, var1 ) {      no_cores <- detectcores()      #rows in varmatrix     rows <- 1:nrow(varmatrix[,])      # split rows in n parts     n <- no_cores     parts <- split(rows, cut(rows, n))      # initiate cluster     cl <- makepsockcluster(no_cores, methods = false, outfile = "/home/cluster.log")     clusterevalq(cl, library(raster))     clusterexport(cl, "varmatrix", envir=environment())     clusterexport(cl, "var1", envir=environment())       rparts <- parlapply(cl = cl, x = 1:n, fun = function(x){         part <- rasterize(varmatrix[parts[[x]],], raster(var1), .....)         print(x)         return(part)         })      do.call(merge, rparts) }

notes:

i'm using makepsockcluster because want code run on windows , unix systems alike although particular problem manifesting in unix system.
functions rasterize , raster defined in library(raster), exported cluster.

the weird part me if execute exact same code of function parsolver in global environment every thing works smoothly, workers take 1 job @ same time , task completes in no time. if like:

library(mypackage)  varmatrix <- (...) var1 <- (...) result <- parsolver(varmatrix, var1)

the described problem appears.

it appears load balancing problem not explain why works ok in 1 situation , not in other.

am missing here? in advance.

i don't think parlapply running sequentially. more likely, it's running inefficiently, making appear run sequentially.

i have few suggestions improve it:

don't define worker function inside parsolver
don't export of varmatrix each worker
create cluster outside of parsolver

the first point important, because example stands, of variables defined in parsolver serialized along anonymous worker function , sent workers parlapply. defining worker function outside of function, serialization won't capture unwanted variables.

the second point avoids unnecessary socket i/o , uses less memory, making code more scalable.

here's fake, self-contained example similar yours demonstrates suggestions:

# define worker function outside of function avoid # serialization problems (such unexpected variable capture) workerfn <- function(mat, var1) {     library(raster)     mat * var1 }  parsolver <- function(cl, varmatrix, var1) {     parts <- splitindices(nrow(varmatrix), length(cl))     varmatrixparts <- lapply(parts, function(i) varmatrix[i,,drop=false])     rparts <- clusterapply(cl, varmatrixparts, workerfn, var1)     do.call(rbind, rparts) }  library(parallel) cl <- makepsockcluster(3) r <- parsolver(cl, matrix(1:20, 10, 2), 2) print(r)

note takes advantage of clusterapply function iterate on list of row-chunks of varmatrix entire matrix doesn't need sent everyone. avoids calls clusterevalq , clusterexport, simplifying code, making bit more efficient.

Search This Blog

MOno

R parLapply not parallel -

Comments

Post a Comment

Popular posts from this blog

'hasOwnProperty' in javascript -

python - ValueError: No axis named 1 for object type <class 'pandas.core.series.Series'> -

java - How to provide dependency injections in Eclipse RCP 3.x? -