Generating Parquet Files From NDJson Format for the Amazon S3 Integration Using Python

Overview

Use the following Python script to convert an NDJSON file containing events into a Parquet file for use with the Amazon S3 integration with Split.

Prerequisites

The following environments:

Python 3.7
Pandas 1.2.2
Pyarrow 3.0.0
ndjson 0.3.1

Prepare your event file

Ensure your NDJSON file follows the correct Split event structure.

For example:

{
"environmentId": "029bd160-7e36-11e8-9a1c-0acd31e5aef0",
"trafficTypeId": "e6910420-5c85-11e9-bbc9-12a5cc2af8fe",
"eventTypeId": "agus_event",
"key": "gDxEVkUsd3",
"timestamp": 1625088419634,
"value": 86.5588664346,
"properties": {
    "age": "51",
    "country": "argentina",
    "tier": "basic"
}
}

Update the file paths in the input_file and output_file variables.
The script assumes that all values in the properties object are strings.

Run the Python script

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import ndjson

##################################
input_file = "sample_ndjson.json"
output_file = "event21.parquet"
##################################

def dict2keyvalue(dict):
    keyvalues = []
    for key in dict.keys():
        keyvalues.append([("key", key), ("value", str(dict[key]))])
    return keyvalues

properties_type = pa.map_(
    pa.string(),
    pa.string(),
)

schema = pa.schema([
    pa.field("environmentId", pa.string()),
    pa.field("trafficTypeId", pa.string()),
    pa.field("eventTypeId", pa.string()),
    pa.field("key", pa.string()),
    pa.field("timestamp", pa.int64()),
    pa.field("value", pa.float64()),
    pa.field("properties", properties_type),
])

with open(input_file) as f:
    js = ndjson.load(f)
data = pd.DataFrame(js)
data["properties"] = data["properties"].apply(lambda x: dict2keyvalue(x))
data = pa.Table.from_pandas(data, schema)

pq.write_table(data, output_file)

Overview​

Prerequisites​

Prepare your event file​

Run the Python script​

Overview

Prerequisites

Prepare your event file

Run the Python script