hive - Spark partitions - using DISTRIBUTE BY option -


we have spark environment should process 50mm rows. these rows contains key column. unique number of keys close 2000. process of 2000 keys in parallel. using spark sql following

hivecontext.sql("select * bigtbl distribute key") 

subsequently have mappartitions works nicely on partitions in parallel. trouble is, creates 200 partitions default. using command following able increase partitions

hivecontext.sql("set spark.sql.shuffle.partitions=500"); 

however during real production run not know number of unique keys. want auto managed. there way please.

thanks

bala

i suggest use "repartition" function , register repartitioned new temp table , further cache faster processing.

val distinctvalues = hivecontext.sql("select key bigtbl").distinct().count() // find count distinct values   hivecontext.sql("select * bigtbl distribute key")        .repartition(distinctvalues.toint) // repartition number of distinct values        .registertemptable("newbigtbl") // register repartitioned table temp table  hivecontext.cachetable("newbigtbl") // cache repartitioned table improving query performance 

for further queries use "newbigtbl"


Comments

Popular posts from this blog

Command prompt result in label. Python 2.7 -

javascript - How do I use URL parameters to change link href on page? -

amazon web services - AWS Route53 Trying To Get Site To Resolve To www -