pyspark - Join two data frames as input for Machine Learning with Spark -


i have 2 data frames in apache spark, each column named "joinvalue". joinvalue numeric , has same semantics , meaning in both data frames.

i need combination of both data frames input (training , test set) machine learning algorithm. correct first need combine both dataframes single dataframe before using in ml pipeline?

example:

 df1.show() +---------+---------+ |        a|joinvalue| +---------+---------+ |a value 0|        0| |a value 1|        5| |a value 2|       10| |a value 3|       15| |a value 4|       20| |a value 5|       25| |a value 6|       30| +---------+---------+ 

and

 > df2.show() +---------+---------+ |        b|joinvalue| +---------+---------+ |b value 0|        0| |b value 1|        7| |b value 2|       14| |b value 3|       21| |b value 4|       28| +---------+---------+ 

an outer join followed orderby yields following results:

> df1.join(df2, 'joinvalue', 'outer').orderby('joinvalue').show() +---------+---------+---------+ |joinvalue|        a|        b| +---------+---------+---------+ |        0|a value 0|b value 0| |        5|a value 1|     null| |        7|     null|b value 1| |       10|a value 2|     null| |       14|     null|b value 2| |       15|a value 3|     null| |       20|a value 4|     null| |       21|     null|b value 3| |       25|a value 5|     null| |       28|     null|b value 4| |       30|a value 6|     null| +---------+---------+---------+ 

what want this, without nulls:

+---------+---------+---------+ |joinvalue|        a|        b| +---------+---------+---------+ |        0|a value 0|b value 0| |        5|a value 1|b value 0| |        7|a value 1|b value 1| |       10|a value 2|b value 1| |       14|a value 2|b value 2| |       15|a value 3|b value 2| |       20|a value 4|b value 2| |       21|a value 4|b value 3| |       25|a value 5|b value 3| |       28|a value 5|b value 4| |       30|a value 6|b value 4| +---------+---------+---------+ 

what best way use joinvalue, , b, coming multiple data frames features , labels in machine learning algorithm?


Comments

Popular posts from this blog

Command prompt result in label. Python 2.7 -

javascript - How do I use URL parameters to change link href on page? -

amazon web services - AWS Route53 Trying To Get Site To Resolve To www -