Data Lake Layer Recommendations
Posted December 1, 2022 by Rohith ‐ 3 min read
Data lake allows use to store variety data at low cost. However, over the time it can become difficult to maintain as data grows. It can create data duplication, inefficient resource usage. In this article, we will understand the data lake layers and recommendations for effectively maintaining data in data lakes with examples.
The Data Lake Maintainability
A data lake is a centralized repository designed to store, process, and secure large amounts of structured, semistructured, and unstructured data. It can store data in its native format and process any variety of it, ignoring size limits.
As it can be any type of data with little restriction on where, how and what can be stored, it is important to make decisions on maintainability, data cleanliness and governance.
Depending upon use case and what you want to achieve using data, data layers are generally organized and managed in different layers.
Data Lake Layer Recommendations
Staging Data Layer
This is also called Bronze Data Layer
Staging layer is where pre-processed data or raw data is stored. The data can be viewed as a layer before the fully processed data gets generated.
Examples of staging layer are -
- Click stream data events in json format with no specific schema. The needs to be cleaned attributed to identify and categorize the data for further analysis
- Data events like orders, customers from an application in avro format where each file is a record. The available data is not in the required format to use it for analysis even though it may have certain schema - which may needs to be joined for further analysis.
Curated Data Layer
This is also called Silver Data Layer
Curated data layers is cleansed and processed data to some extent where it is in a usable form for different applications. It require relative less efforts to consume the data than staging data layer. Data in this layer is defined to be used for different application. In another sense, the same source can be used by different application with minimal processing.
Examples of curated data layer are -
- Aggregated click stream data unified by type, schema and partitioned by date and hour. Each file can contain thousands/millions of records.
- Data events like orders, customers which are in avro are aggregated in different format like parquet (a columnar data format) for further processing by a batch process or to query using processing engines like athena.
Application Data Layer
This is also called Gold Data Layer
Application data layer is final form of data, processed for a specific application use. The data can be in a specific format application expects. Data structure is designed with only application specifications in mind. It would be a bad idea to use for any other purpose as it can complicate the data reasoning further down the line and can effect maintainability and performance.
Examples of application data layer -
- Click stream data joined with marketing data for attribution to analyze digital marketing campaign effectiveness. Which can be used directly by BI tools like tableau.
- Rollup and aggregated data after joining customers, orders produced by a flink application. Which can be used to index into a data warehouse like AWS Redshift.
Temporary Data Layer
It is a data layer for storing application intermediate data.
Example can be - An application swap memory data while processing. The data in this layer is expected to be transient.