Spark job performance -


i trying configure spark job best performance possible. have mesos cluster 4 slaves nodes 4 cpus free , 40gb memory free per each 1 (time elapsed: 8m:13s).

basically job calculating frequency values of each header csv file. there 4 files, biggest 1 has 340mb size. best performance has been 4 executors 4 cores each 1 , 40gb memory.

the main action collect() 1 per each header:

  headers.keys.foreach(h => {       val outputfile: string = outputfolder + "/" + h + ".csv"       val file = new file(outputfile)       val bw = new bufferedwriter(new filewriter(file))       bw.write(h + ",absolute frequency")       val frequencytable = filedata.map(x => (x(h), 1l)).reducebykey(_+_).collect()       frequencytable.foreach(a => bw.write("\n" + a._1 + "," + a._2))       bw.close() }) 

where filedata cached (this done per file).

this time acceptable? or can improve time?


Comments

Popular posts from this blog

Command prompt result in label. Python 2.7 -

javascript - How do I use URL parameters to change link href on page? -

amazon web services - AWS Route53 Trying To Get Site To Resolve To www -