python - How to convert a PySpark RDD to a Dataframe with unknown columns? -
i creating rdd loading data text file in pyspark. want convert rdd dataframe not know how many , columns present in rdd. trying use createdataframe() , syntax shown sqldataframe = sqlcontext.createdataframe(rdd, schema). tried see how create schema of examples show hardcoded schema creation example. since not know columns how can convert rdd dataframe? here code far:
from pyspark import sqlcontext sqlcontext = sqlcontext(sc) example_rdd = sc.textfile("\..\file1.csv") .map(lambda line: line.split(",")) #convert rdd dataframe # df = sc.createdataframe() # dataframe conversion here. note 1: reason not know columns because trying create general script can create dataframe rdd read file number of columns.
note 2: know there function called todf() can convert rdd dataframe wuth have same issue how pass unknown columns.
note3: file format not csv file. have shown example can file of format
spark 2.0.0 onwards supports reading csv directly dataframe. in order read csv, use dataframereader.csv method
df = spark.read.csv("\..\file1.csv", header=true) in case, if not have access spark object, can use,
from pyspark import sqlcontext sqlcontext = sqlcontext(sc) df = sqlcontext.read.csv("\..\file1.csv", header=true) in case file has different separator, can specify too.
# eg if separator :: df = spark.read.csv("\..\file1.csv", head=true,sep="::")
Comments
Post a Comment