join - Merge very large csv using pandas or awk -
i have 2 csv below (tried read them through pandas) df1 file 17gb (i read through pandas read_csv) , df2 700mb, want merge using trig_seq
. python gets killed. there way through awk/join
>>> df1.head() streamid seqnum timestamp_p1 trig_seq 1 1 14914503 10000000001 1 2 1491450 10000000002 1 3 1491450 10000000003 1 4 1491450 10000000004 1 5 149145 10000000005 >>> df2= pd.read_csv("/tmp/my.csv") >>> df2.head() model_id order ctime trig_seq e62 1000000 1493311414272 30021182183 e62 1000001 149199641344 30021210134 e22 1000002 1491081210880 30021227875 e62 1000003 14951949824 30021239627 e62 1000004 14927136256 30021241522 >>> r1 = pd.merge(df1,df2) killed
try -
$ cat f1 1 1 14914503 10000000001 1 2 1491450 10000000002 1 3 1491450 10000000003 1 4 1491450 10000000004 1 5 149145 10000000005 $ cat f2 e62 1000000 1493311414272 30021182183 e62 1000001 149199641344 30021210134 e22 1000002 1491081210880 30021227875 e62 1000003 14951949824 30021239627 e62 1000004 14927136256 30021241522 $ awk 'nr==fnr{a[nr]=$0;next} $2 in {print $0,a[$1]}' f2 f1|column -t 1 1 14914503 10000000001 e62 1000000 1493311414272 30021182183 1 2 1491450 10000000002 e62 1000000 1493311414272 30021182183 1 3 1491450 10000000003 e62 1000000 1493311414272 30021182183 1 4 1491450 10000000004 e62 1000000 1493311414272 30021182183 1 5 149145 10000000005 e62 1000000 1493311414272 30021182183
Comments
Post a Comment