python - Read in large table file but keep only small subset of rows using pandas -
i have large table file (around 2 gb) holds distance matrix indexed first column. rows like
a 0 1.2 1.3 ... b 1.2 0 3.5 ... c 1.5 0 4.5 ... however, need keep small subset of rows. if i'm given list of indices need keep, best , fastest way read file pandas dataframe. right now, using
distance_matrix = pd.read_table("hla_distmat.txt", header = none, index_col = 0)[columns_to_keep] to read in file, running memory issues read_table command. there faster , more memory efficient way this? thanks.
you need usecols parameter if need filter columns , skiprows filter rows, have specify column has removed list or range or np.array:
distance_matrix = pd.read_table("hla_distmat.txt", header = none, index_col = 0, usecols=[columns_to_keep], skiprows = range(10, 100)) sample: (in real data omit sep parameter, sep='\t' default in read_table)
import pandas pd import numpy np pandas.compat import stringio temp=u"""0;119.02;0.0 1;121.20;0.0 3;112.49;0.0 4;113.94;0.0 5;114.67;0.0 6;111.77;0.0 7;117.57;0.0 6648;0.00;420.0 6649;0.00;420.0 6650;0.00;420.0""" #after testing replace 'stringio(temp)' 'filename.csv' columns_to_keep = [0,1] df = pd.read_table(stringio(temp), sep=";", header=none, index_col=0, usecols=columns_to_keep, skiprows = range(5, 100)) print (df) 1 0 0 119.02 1 121.20 3 112.49 4 113.94 5 114.67 more general solution numpy.setdiff1d:
#if index_col = 0 need first column (0) columns_to_keep = [0,1] #for keep second, third, fifth row rows_to_keep = [1,2,4] #estimated row count or use solution http://stackoverflow.com/q/19001402/2901002 max_rows = 100 df = pd.read_table(stringio(temp), sep=";", header=none, index_col=0, usecols=columns_to_keep, skiprows = np.setdiff1d(np.arange(max_rows), np.array(rows_to_keep))) print (df) 1 0 1 121.20 3 112.49 5 114.67
Comments
Post a Comment