Get Parquet Schema Using Python

Posted August 17, 2022 by Rohith ‐ 1 min read

Parquet is widely used in data transformations. Every parquet file has schema associated with it. As it is a binary file, we cannot read the data using any text editor. In this article, we use pyarrow python package to extract the parquet schema.

Read about Parquet Data Types

TL;DR

Below example can be used as snippet to extract the parquet schema,

import pyarrow.parquet
uri = '/Users/myhome/Documents/mydatafile.snappy.parquet'
schema = pyarrow.parquet.read_schema(uri, memory_map=True)
print(schema)

Output:

country_code: string
  -- field metadata --
  PARQUET:field_id: '1'
ts: timestamp[ns]
  -- field metadata --
  PARQUET:field_id: '2'
result: string
  -- field metadata --
  PARQUET:field_id: '3'
region: string
  -- field metadata --
  PARQUET:field_id: '4'
source: string
  -- field metadata --
  PARQUET:field_id: '5'

Install pyarrow

You can install pyarrow package using below pip command

pip3 install pyarrow

Get Schema Using Python

pyarrow has methods to read parquet file and extract schema. Here, we use pyarrow.parquet.read_schema() to extract the schema.

Example The following example returns the schema of a local URI a parquet file. The function does not read the whole file, just the schema.

import pyarrow.parquet

parquet_file_location = '/Users/myhome/workspace/mydatafile.snappy.parquet'
schema = pyarrow.parquet.read_schema(parquet_file_location, memory_map=True)

Convert Schema To Pandas DataFrame

The returned schema can be converted to a usable Pandas DataFrame.

import pandas as pd
import pyarrow.parquet

parquet_file_loc = '/Users/myhome/workspace/mydatafile.snappy.parquet'
schema = pyarrow.parquet.read_schema(parquet_file_loc, memory_map=True)
schema = pd.DataFrame(({"column": name, "pd_dtype": str(pd_dtype)} for name, pd_dtype in zip(schema.names, schema.types)))
# In case the parquet file has an empty dataframe.
schema = schema.reindex(columns=["column", "pd_dtype"], fill_value=pd.NA)
print(schema)

Output:

                    column           pd_dtype
0            country_code             string
1                      ts      timestamp[ns]
2                  result             string
quick-references python transformations parquet blog

Subscribe For More Content