Generating Parquet Files From NDJson Format for the Amazon S3 Integration Using Python
Overview
Use the following Python script to convert an NDJSON file containing events into a Parquet file for use with the Amazon S3 integration with Split.
Prerequisites
The following environments:
- Python 3.7
- Pandas 1.2.2
- Pyarrow 3.0.0
- ndjson 0.3.1
Prepare your event file
-
Ensure your NDJSON file follows the correct Split event structure.
For example:
{
"environmentId": "029bd160-7e36-11e8-9a1c-0acd31e5aef0",
"trafficTypeId": "e6910420-5c85-11e9-bbc9-12a5cc2af8fe",
"eventTypeId": "agus_event",
"key": "gDxEVkUsd3",
"timestamp": 1625088419634,
"value": 86.5588664346,
"properties": {
"age": "51",
"country": "argentina",
"tier": "basic"
}
} -
Update the file paths in the
input_file
andoutput_file
variables. -
The script assumes that all values in the properties object are strings.
Run the Python script
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import ndjson
##################################
input_file = "sample_ndjson.json"
output_file = "event21.parquet"
##################################
def dict2keyvalue(dict):
keyvalues = []
for key in dict.keys():
keyvalues.append([("key", key), ("value", str(dict[key]))])
return keyvalues
properties_type = pa.map_(
pa.string(),
pa.string(),
)
schema = pa.schema([
pa.field("environmentId", pa.string()),
pa.field("trafficTypeId", pa.string()),
pa.field("eventTypeId", pa.string()),
pa.field("key", pa.string()),
pa.field("timestamp", pa.int64()),
pa.field("value", pa.float64()),
pa.field("properties", properties_type),
])
with open(input_file) as f:
js = ndjson.load(f)
data = pd.DataFrame(js)
data["properties"] = data["properties"].apply(lambda x: dict2keyvalue(x))
data = pa.Table.from_pandas(data, schema)
pq.write_table(data, output_file)