hadoop - Merge large number of spark dataframes into one -
i'm querying cached hive temp table using different queries satisfying different conditions on more 1500 times inside loop. need merge them using unionall inside loop. stackoverflow error due fact spark cannot keep rdd lineage.
pseudo code:
df=[from hive table] tablea=[from hive table] tablea.registertemptable("tablea") hivecontext.sql('cache table tablea') in range(0,2000): if (list[0]['column1']=='xyz'): df1=query tablea df=df.unionall(df1) elif (): df1=query tablea df=df.unionall(df1) elif (): df1=query tablea df=df.unionall(df1) elif (): df1=query tablea df=df.unionall(df1) else: df1=query tablea df=df.unionall(df1)
this throws stackoverflow error due rdd lineage becoming hard. tried checkpointing follows:
for in range(0,2000): if (list[0]['column1']=='xyz'): df1=query tablea df=df.unionall(df1) elif (): df1=query tablea df=df.unionall(df1) else: df1=query tablea df=df.unionall(df1) df.rdd.checkpoint df = sqlcontext.createdataframe(df.rdd, df.schema)
i got same error. tried saveastable wanted avoid because of lag in job submission between each hql queries , hive io inside loop. approach worked well.
for in range(0,2000): if (list[0]['column1']=='xyz'): df=query tablea df.write.saveastable('output', mode='append') elif (): df=query tablea df.write.saveastable('output', mode='append')
i need in avoiding saving dataframe hive inside loop. want merge dfs in manner that's in-memory , efficient. 1 of other options tried insert query result directly temp table error: cannot insert rdd based table.
maybe, temp table result work.
df1="query tablea".registertemptable("result") sqlcontext.sql("insert result query tablea")
Comments
Post a Comment