python - Read in large table file but keep only small subset of rows using pandas -


i have large table file (around 2 gb) holds distance matrix indexed first column. rows like

a 0 1.2 1.3 ... b 1.2 0 3.5 ... c 1.5 0 4.5 ... 

however, need keep small subset of rows. if i'm given list of indices need keep, best , fastest way read file pandas dataframe. right now, using

distance_matrix = pd.read_table("hla_distmat.txt", header = none, index_col = 0)[columns_to_keep] 

to read in file, running memory issues read_table command. there faster , more memory efficient way this? thanks.

you need usecols parameter if need filter columns , skiprows filter rows, have specify column has removed list or range or np.array:

distance_matrix = pd.read_table("hla_distmat.txt",                                   header = none,                                   index_col = 0,                                   usecols=[columns_to_keep],                                  skiprows = range(10, 100)) 

sample: (in real data omit sep parameter, sep='\t' default in read_table)

import pandas pd import numpy np  pandas.compat import stringio  temp=u"""0;119.02;0.0 1;121.20;0.0 3;112.49;0.0 4;113.94;0.0 5;114.67;0.0 6;111.77;0.0 7;117.57;0.0 6648;0.00;420.0 6649;0.00;420.0 6650;0.00;420.0""" #after testing replace 'stringio(temp)' 'filename.csv'  columns_to_keep = [0,1]  df = pd.read_table(stringio(temp),                     sep=";",                     header=none,                    index_col=0,                     usecols=columns_to_keep,                    skiprows = range(5, 100)) print (df)         1 0         0  119.02 1  121.20 3  112.49 4  113.94 5  114.67 

more general solution numpy.setdiff1d:

#if index_col = 0 need first column (0) columns_to_keep = [0,1] #for keep second, third, fifth row rows_to_keep = [1,2,4] #estimated row count or use solution http://stackoverflow.com/q/19001402/2901002 max_rows = 100  df = pd.read_table(stringio(temp),                     sep=";",                     header=none,                    index_col=0,                     usecols=columns_to_keep,                    skiprows = np.setdiff1d(np.arange(max_rows), np.array(rows_to_keep))) print (df)         1 0         1  121.20 3  112.49 5  114.67 

Comments

Popular posts from this blog

How to understand 2 main() functions after using uftrace to profile the C++ program? -

c# - Update a combobox from a presenter (MVP) -

How to put a lock and transaction on table using spring 4 or above using jdbcTemplate and annotations like @Transactional? -