Spark job performance -
i trying configure spark job best performance possible. have mesos cluster 4 slaves nodes 4 cpus free , 40gb memory free per each 1 (time elapsed: 8m:13s).
basically job calculating frequency values of each header csv file. there 4 files, biggest 1 has 340mb size. best performance has been 4 executors 4 cores each 1 , 40gb memory.
the main action collect() 1 per each header:
headers.keys.foreach(h => { val outputfile: string = outputfolder + "/" + h + ".csv" val file = new file(outputfile) val bw = new bufferedwriter(new filewriter(file)) bw.write(h + ",absolute frequency") val frequencytable = filedata.map(x => (x(h), 1l)).reducebykey(_+_).collect() frequencytable.foreach(a => bw.write("\n" + a._1 + "," + a._2)) bw.close() })
where filedata cached (this done per file).
this time acceptable? or can improve time?
Comments
Post a Comment