pyspark - Join two data frames as input for Machine Learning with Spark -
i have 2 data frames in apache spark, each column named "joinvalue". joinvalue numeric , has same semantics , meaning in both data frames.
i need combination of both data frames input (training , test set) machine learning algorithm. correct first need combine both dataframes single dataframe before using in ml pipeline?
example:
df1.show() +---------+---------+ | a|joinvalue| +---------+---------+ |a value 0| 0| |a value 1| 5| |a value 2| 10| |a value 3| 15| |a value 4| 20| |a value 5| 25| |a value 6| 30| +---------+---------+
and
> df2.show() +---------+---------+ | b|joinvalue| +---------+---------+ |b value 0| 0| |b value 1| 7| |b value 2| 14| |b value 3| 21| |b value 4| 28| +---------+---------+
an outer join followed orderby
yields following results:
> df1.join(df2, 'joinvalue', 'outer').orderby('joinvalue').show() +---------+---------+---------+ |joinvalue| a| b| +---------+---------+---------+ | 0|a value 0|b value 0| | 5|a value 1| null| | 7| null|b value 1| | 10|a value 2| null| | 14| null|b value 2| | 15|a value 3| null| | 20|a value 4| null| | 21| null|b value 3| | 25|a value 5| null| | 28| null|b value 4| | 30|a value 6| null| +---------+---------+---------+
what want this, without null
s:
+---------+---------+---------+ |joinvalue| a| b| +---------+---------+---------+ | 0|a value 0|b value 0| | 5|a value 1|b value 0| | 7|a value 1|b value 1| | 10|a value 2|b value 1| | 14|a value 2|b value 2| | 15|a value 3|b value 2| | 20|a value 4|b value 2| | 21|a value 4|b value 3| | 25|a value 5|b value 3| | 28|a value 5|b value 4| | 30|a value 6|b value 4| +---------+---------+---------+
what best way use joinvalue, , b, coming multiple data frames features , labels in machine learning algorithm?
Comments
Post a Comment