csv - python read file by column without loading into memory? -


i have csv file contains 400 columns +100.000 lines. m trying run mapreduce job in hdinsight hadoop cluster. logic of mapreduce calculating peason's correlation matrix.

the map operation generates every possible pair of values along each key.

example : given input :

1,2,3 4,5,6 

the mapper output :

keys   pairs 0,1    1,2 0,2    1,3 1,2    2,3 0,1    4,5 0,2    4,6 1,2    5,6 

as can conclude size of mapper output depends more on number of columns on complexity of sort phase. think why mapreduce job fails.

i used output complete lists in previous mapper scripts :

keys   pairs 0,1    1,2,4,5 0,2    1,3,4,6 1,2    2,3,5,6 

but needs complete read of file in order zip , zip each pair of columns. in case run out of memory if file sufficiently large.

i though reading columns instead of lines , keep using "yield" optimize both memory usage in mapper , sort.

is there way read file column column (given separator) without loading memory ?


Comments

Popular posts from this blog

Command prompt result in label. Python 2.7 -

javascript - How do I use URL parameters to change link href on page? -

amazon web services - AWS Route53 Trying To Get Site To Resolve To www -