hive - Spark partitions - using DISTRIBUTE BY option -

April 15, 2011

we have spark environment should process 50mm rows. these rows contains key column. unique number of keys close 2000. process of 2000 keys in parallel. using spark sql following

hivecontext.sql("select * bigtbl distribute key")

subsequently have mappartitions works nicely on partitions in parallel. trouble is, creates 200 partitions default. using command following able increase partitions

hivecontext.sql("set spark.sql.shuffle.partitions=500");

however during real production run not know number of unique keys. want auto managed. there way please.

thanks

bala

i suggest use "repartition" function , register repartitioned new temp table , further cache faster processing.

val distinctvalues = hivecontext.sql("select key bigtbl").distinct().count() // find count distinct values   hivecontext.sql("select * bigtbl distribute key")        .repartition(distinctvalues.toint) // repartition number of distinct values        .registertemptable("newbigtbl") // register repartitioned table temp table  hivecontext.cachetable("newbigtbl") // cache repartitioned table improving query performance

for further queries use "newbigtbl"

Search This Blog

MOno

hive - Spark partitions - using DISTRIBUTE BY option -

Comments

Post a Comment

Popular posts from this blog

'hasOwnProperty' in javascript -

python - ValueError: No axis named 1 for object type <class 'pandas.core.series.Series'> -

java - How to implement an entity bound odata action in olingo v4.3 -