hive - Spark partitions - using DISTRIBUTE BY option -


we have spark environment should process 50mm rows. these rows contains key column. unique number of keys close 2000. process of 2000 keys in parallel. using spark sql following

hivecontext.sql("select * bigtbl distribute key") 

subsequently have mappartitions works nicely on partitions in parallel. trouble is, creates 200 partitions default. using command following able increase partitions

hivecontext.sql("set spark.sql.shuffle.partitions=500"); 

however during real production run not know number of unique keys. want auto managed. there way please.

thanks

bala

i suggest use "repartition" function , register repartitioned new temp table , further cache faster processing.

val distinctvalues = hivecontext.sql("select key bigtbl").distinct().count() // find count distinct values   hivecontext.sql("select * bigtbl distribute key")        .repartition(distinctvalues.toint) // repartition number of distinct values        .registertemptable("newbigtbl") // register repartitioned table temp table  hivecontext.cachetable("newbigtbl") // cache repartitioned table improving query performance 

for further queries use "newbigtbl"


Comments

Popular posts from this blog

'hasOwnProperty' in javascript -

How to understand 2 main() functions after using uftrace to profile the C++ program? -

android - Unable to generate FCM token from dynamically instantiated Firebase -