hive - Spark partitions - using DISTRIBUTE BY option -
we have spark environment should process 50mm rows. these rows contains key column. unique number of keys close 2000. process of 2000 keys in parallel. using spark sql following
hivecontext.sql("select * bigtbl distribute key")
subsequently have mappartitions works nicely on partitions in parallel. trouble is, creates 200 partitions default. using command following able increase partitions
hivecontext.sql("set spark.sql.shuffle.partitions=500");
however during real production run not know number of unique keys. want auto managed. there way please.
thanks
bala
i suggest use "repartition" function , register repartitioned new temp table , further cache faster processing.
val distinctvalues = hivecontext.sql("select key bigtbl").distinct().count() // find count distinct values hivecontext.sql("select * bigtbl distribute key") .repartition(distinctvalues.toint) // repartition number of distinct values .registertemptable("newbigtbl") // register repartitioned table temp table hivecontext.cachetable("newbigtbl") // cache repartitioned table improving query performance
for further queries use "newbigtbl"
Comments
Post a Comment