csv - python read file by column without loading into memory? -

April 15, 2010

i have csv file contains 400 columns +100.000 lines. m trying run mapreduce job in hdinsight hadoop cluster. logic of mapreduce calculating peason's correlation matrix.

the map operation generates every possible pair of values along each key.

example : given input :

1,2,3 4,5,6

the mapper output :

keys   pairs 0,1    1,2 0,2    1,3 1,2    2,3 0,1    4,5 0,2    4,6 1,2    5,6

as can conclude size of mapper output depends more on number of columns on complexity of sort phase. think why mapreduce job fails.

i used output complete lists in previous mapper scripts :

keys   pairs 0,1    1,2,4,5 0,2    1,3,4,6 1,2    2,3,5,6

but needs complete read of file in order zip , zip each pair of columns. in case run out of memory if file sufficiently large.

i though reading columns instead of lines , keep using "yield" optimize both memory usage in mapper , sort.

is there way read file column column (given separator) without loading memory ?

Search This Blog

MOno

csv - python read file by column without loading into memory? -

Comments

Post a Comment

Popular posts from this blog

'hasOwnProperty' in javascript -

python - ValueError: No axis named 1 for object type <class 'pandas.core.series.Series'> -

java - How to implement an entity bound odata action in olingo v4.3 -