csv - python read file by column without loading into memory? -
i have csv file contains 400 columns +100.000 lines. m trying run mapreduce job in hdinsight hadoop cluster. logic of mapreduce calculating peason's correlation matrix.
the map operation generates every possible pair of values along each key.
example : given input :
1,2,3 4,5,6
the mapper output :
keys pairs 0,1 1,2 0,2 1,3 1,2 2,3 0,1 4,5 0,2 4,6 1,2 5,6
as can conclude size of mapper output depends more on number of columns on complexity of sort phase. think why mapreduce job fails.
i used output complete lists in previous mapper scripts :
keys pairs 0,1 1,2,4,5 0,2 1,3,4,6 1,2 2,3,5,6
but needs complete read of file in order zip , zip each pair of columns. in case run out of memory if file sufficiently large.
i though reading columns instead of lines , keep using "yield" optimize both memory usage in mapper , sort.
is there way read file column column (given separator) without loading memory ?
Comments
Post a Comment