hive - Parquet creation Conversion from pandas dataframe to pyarrow table not working for object dtype -
i want create parquet file csv file. test purposes, i've below piece of code reads file , converts same pandas dataframe first , pyarrow table. table stored on aws s3 , want run hive query on table.
inputfile contents:
year|word 2017|word 1 2018|word 2
code:
dataframe=pd.read_csv(inputfile, sep='|') print(dataframe) print(dataframe.dtypes) print(dataframe.columns) dataframe['c1'] = dataframe['c1'].astype('str') print(dataframe.dtypes) table=pa.table.from_pandas(dataframe)#,schema=pa.string()) pq.write_table(table, outputfile)
after writing pyarrow table, queried parquet file make sure data stored in s3. results weird:
+--------+--------------+ | year | word | +--------+--------------+ | 2017 | [b@60716d4f | | 2018 | [b@36bf8f00 | +--------+--------------+
somehow int values show fine, object/str value doesn't converted fine.
appreciate this.
thanks.
this replicated fine me roundtripping. please specify platform & versions of python
, pandas
, pyarrow
on 3.6 / macox (also worked on 2.7)
in [1]: import pandas pd in [2]: import pyarrow pa in [3]: pd.__version__ out[3]: '0.19.2' in [4]: pa.__version__ out[4]: '0.2.0' in [5]: data = """year|word ...: 2017|word 1 ...: 2018|word 2 ...: """ in [6]: df = pd.read_csv(stringio(data), sep='|') in [7]: df out[7]: year word 0 2017 word 1 1 2018 word 2 in [8]: df.dtypes out[8]: year int64 word object dtype: object in [9]: table=pa.table.from_pandas(df) in [10]: import pyarrow.parquet pq in [12]: pq.write_table(table, 'foo.pk') in [13]: pq.read_table('foo.pk').to_pandas() out[13]: year word 0 2017 word 1 1 2018 word 2 in [14]: pq.read_table('foo.pk').to_pandas().dtypes out[14]: year int64 word object dtype: object
Comments
Post a Comment