Get Parquet Schema Using Python
Posted August 17, 2022 by Rohith ‐ 1 min read
Parquet is widely used in data transformations. Every parquet file has schema associated with it. As it is a binary file, we cannot read the data using any text editor. In this article, we use pyarrow python package to extract the parquet schema.
Read about Parquet Data Types
TL;DR
Below example can be used as snippet to extract the parquet schema,
import pyarrow.parquet
uri = '/Users/myhome/Documents/mydatafile.snappy.parquet'
schema = pyarrow.parquet.read_schema(uri, memory_map=True)
print(schema)
Output:
country_code: string
-- field metadata --
PARQUET:field_id: '1'
ts: timestamp[ns]
-- field metadata --
PARQUET:field_id: '2'
result: string
-- field metadata --
PARQUET:field_id: '3'
region: string
-- field metadata --
PARQUET:field_id: '4'
source: string
-- field metadata --
PARQUET:field_id: '5'
Install pyarrow
You can install pyarrow
package using below pip
command
pip3 install pyarrow
Get Schema Using Python
pyarrow
has methods to read parquet file and extract schema. Here, we use pyarrow.parquet.read_schema()
to extract the schema.
Example The following example returns the schema of a local URI
a parquet file. The function does not read the whole file, just the schema.
import pyarrow.parquet
parquet_file_location = '/Users/myhome/workspace/mydatafile.snappy.parquet'
schema = pyarrow.parquet.read_schema(parquet_file_location, memory_map=True)
Convert Schema To Pandas DataFrame
The returned schema can be converted to a usable Pandas DataFrame
.
import pandas as pd
import pyarrow.parquet
parquet_file_loc = '/Users/myhome/workspace/mydatafile.snappy.parquet'
schema = pyarrow.parquet.read_schema(parquet_file_loc, memory_map=True)
schema = pd.DataFrame(({"column": name, "pd_dtype": str(pd_dtype)} for name, pd_dtype in zip(schema.names, schema.types)))
# In case the parquet file has an empty dataframe.
schema = schema.reindex(columns=["column", "pd_dtype"], fill_value=pd.NA)
print(schema)
Output:
column pd_dtype
0 country_code string
1 ts timestamp[ns]
2 result string