hadoop - Merge large number of spark dataframes into one -

March 15, 2014

i'm querying cached hive temp table using different queries satisfying different conditions on more 1500 times inside loop. need merge them using unionall inside loop. stackoverflow error due fact spark cannot keep rdd lineage.

pseudo code:

df=[from hive table] tablea=[from hive table] tablea.registertemptable("tablea") hivecontext.sql('cache table tablea')  in range(0,2000):     if (list[0]['column1']=='xyz'):         df1=query tablea         df=df.unionall(df1)     elif ():         df1=query tablea         df=df.unionall(df1)     elif ():         df1=query tablea         df=df.unionall(df1)     elif ():         df1=query tablea         df=df.unionall(df1)     else:         df1=query tablea         df=df.unionall(df1)

this throws stackoverflow error due rdd lineage becoming hard. tried checkpointing follows:

for in range(0,2000):     if (list[0]['column1']=='xyz'):         df1=query tablea         df=df.unionall(df1)     elif ():         df1=query tablea         df=df.unionall(df1)     else:         df1=query tablea         df=df.unionall(df1)     df.rdd.checkpoint     df = sqlcontext.createdataframe(df.rdd, df.schema)

i got same error. tried saveastable wanted avoid because of lag in job submission between each hql queries , hive io inside loop. approach worked well.

for in range(0,2000):     if (list[0]['column1']=='xyz'):         df=query tablea         df.write.saveastable('output', mode='append')     elif ():         df=query tablea         df.write.saveastable('output', mode='append')

i need in avoiding saving dataframe hive inside loop. want merge dfs in manner that's in-memory , efficient. 1 of other options tried insert query result directly temp table error: cannot insert rdd based table.

maybe, temp table result work.

df1="query tablea".registertemptable("result") sqlcontext.sql("insert result query tablea")

Search This Blog

MOno

hadoop - Merge large number of spark dataframes into one -

Comments

Post a Comment

Popular posts from this blog

'hasOwnProperty' in javascript -

python - ValueError: No axis named 1 for object type <class 'pandas.core.series.Series'> -

java - How to provide dependency injections in Eclipse RCP 3.x? -