Transactional Data Lakes Comparison
Posted December 1, 2022 by Rohith ‐ 7 min read
The interest in data lake has been increasing as more features are being added to the data lake frameworks and cloud services availability. In this article we will discuss the comparison between transactional data lakes - Apache Hudi, Delta Lake and Apache Iceberg and AWS Lake Formation.
Transactional ability on data lakes is increasing interest in businesses to use it for data analysis and make a huge impact in re-archeticting the data stacks.
Consider a use-case of loading a blockchain dataset from a p2p-node for analysis. There are many important features we can leverage by using transactional data lakes. Considerations, we have to make for the large datasets are:
Atomicity, Consistency, Isolation, and Durability compliance.
Efficient updates to the files without rewriting whole data file.
Efficient bulk load for batch data loads.
Ability to search and merge the new files with existing.
Time travel ability. Aka, data snapshot capability.
Partition evaluation.
Data deduplication capabilities.
Compaction to merge changelogs with updates/deletes
My favorite is Apache Hudi, since most of the features are provided by it out of the box.
I am not a big fan of using AWS Lake Formation because of vendor locking and as it is based on athena, there can be limitations in running time taking tasks.
Let’s deep dive into each of these features and how each data lake is different from other.
ACID Transactions
Delta Lake
Delta Lake has a transaction model based on the Transaction Log box or DeltaLog. It logs the file operations in JSON file and then commit to the table use atomic operations. The isolation level of Delta Lake is write serialization. Which means, it allows a reader and a writer to access the table in parallel. Well if there are two writers try to write data to table in parallel then each of them will assume that there’s no changes on this table.
The atomicity is guaranteed by HDFS rename or S3 file writes or Azure rename without overwrite. So user with the Delta Lake transaction feature. A user could do the time travel query according to the timestamp or version number.
Apache Iceberg
Apache Hudi transaction model is snapshot based. A snapshot is a complete list of the file up in table. The table state is maintained in Metadata files. All change to the table state create a new Metadata file, and the replace the old Metadata file with atomic swap. So like Delta Lake, it apply the optimistic concurrency control And a user could able to do the time travel queries according to the snapshot id and the timestamp.
Apache Hudi
Hudi’s transaction model is based on a timeline, A timeline contains all actions performed on the table at different instance of the time. The timeline could provide instantaneous views of table and support that get data in the order of the arrival. It also apply the optimistic concurrency control for a reader and a writer. So a user could also do a time travel according to the Hudi commit time.
AWS Lake Formation
Lake Formation introduces a new table type called governed tables that support ACID transactions on S3 objects. It uses AWS catalog in identifying the changes in the data like Apache Iceberg.
Merge On Read
Merge on read table is a superset of copy on write, in the sense it still supports read optimized queries of the table by exposing only the base/columnar files in latest file slices. Additionally, it stores incoming upserts for each file group, onto a row based delta log, to support snapshot queries by applying the delta log, onto the latest version of each file id on-the-fly during query time. Thus, this table type attempts to balance read and write amplification intelligently, to provide near real-time data. The most significant change here, would be to the compactor, which now carefully chooses which delta log files need to be compacted onto their columnar base file, to keep the query performance in check (larger delta log files would incur longer merge times with merge data on query side)
There are two ways of querying the same underlying table: Read Optimized query and Snapshot query, depending on whether we chose query performance or freshness of data.
All the data lakes supports Merge On Read like any database.
Bulk Load
Both upsert and insert operations keep input records in memory to speed up storage heuristics computations faster (among other things) and thus can be cumbersome for initial loading/bootstrapping a Hudi table at first. Bulk insert provides the same semantics as insert, while implementing a sort-based data writing algorithm, which can scale very well for several hundred TBs of initial load. However, this just does a best-effort job at sizing files vs guaranteeing file sizes like inserts/upserts do.
Only apache Hudi supports this kind of bulk load option.
Merge writes with record-level indices
Apache Hudi
Hudi provides efficient upserts, by mapping a given hoodie key (record key + partition path) consistently to a file id, via an indexing mechanism. This mapping between record key and file group/file id, never changes once the first version of a record has been written to a file. In short, the mapped file group contains all versions of a group of records.
For Copy-On-Write tables, this enables fast upsert/delete operations, by avoiding the need to join against the entire dataset to determine which files to rewrite. For Merge-On-Read tables, this design allows Hudi to bound the amount of records any given base file needs to be merged against.
Apache Hudi offers 4 types of indexing:
- Bloom Index (default): Employs bloom filters built out of the record keys, optionally also pruning candidate files using record key ranges.
- Simple Index: Performs a lean join of the incoming update/delete records against keys extracted from the table on storage.
- HBase Index: Manages the index mapping in an external Apache HBase table.
- Bring your own implementation: You can extend this public API to implement custom indexing.
Delta Lake
Delta Lake does it using bloom filters.
How Bloom filter indexes work? A Bloom filter index is a space-efficient data structure that enables data skipping on chosen columns, particularly for fields containing arbitrary text. The Bloom filter operates by either stating that data is definitively not in the file, or that it is probably in the file, with a defined false positive probability (FPP).
File level Bloom filters: Each data file can have a single Bloom filter index file associated with it. Before reading a file it checks the index file and the file is read only if the index indicates that the file might match a data filter. It always reads the data file if an index does not exist or if a Bloom filter is not defined for a queried column.
Apache Iceberg
Apache Iceberg does it using Metadata Indexing.
AWS Lake Formation
AWS Lake Formation does it using AWS glue.
Concurrency
Concurrency enables us to run different writers and table services against the table at the same time.
All the data lakes provides Optimistic concurrency control.
What is OCC?
Optimistic concurrency control (OCC), also known as optimistic locking, is a concurrency control method applied to transactional systems such as relational database management systems and software transactional memory. OCC assumes that multiple transactions can frequently complete without interfering with each other. While running, transactions use data resources without acquiring locks on those resources. Before committing, each transaction verifies that no other transaction has modified the data it has read. If the check reveals conflicting modifications, the committing transaction rolls back and can be restarted.
Phases of optimistic concurrency control are Begin, Modify, Validate and Commit/Rollback.
Partition Evolution
Partition evaluation helps us in changing the partition structure of the table as we go.
Apache Hudi
Hudi takes a different approach with coarse-grained partitions and fine-grained Clustering which can be evolved async without rewriting data.
Delta Lake
Delta Lake does not support it
Apache Iceberg
Old data stays in old partitions, new data gets new partitions, with uneven performance across them in Apache Iceberg
AWS Lake Formation
Does not support it.
Data deduplication
Data duplication feature in data lake helps in inserting data without introducing duplicates.
Apache Hudi
Apache Hudi supports:
- Record key uniqueness
- Precombine Utility Customizations
- Merge
- Drop dupes from inserts
Delta Lake
You can avoid inserting the duplicate records in delta lake using merge option.
Example:
MERGE INTO logs
USING newDedupedLogs
ON logs.uniqueId = newDedupedLogs.uniqueId
WHEN NOT MATCHED
THEN INSERT *
Apache Iceberg
Iceberg supports MERGE INTO
by rewriting data files that contain rows that need to be updated in an overwrite commit.
Example:
MERGE INTO prod.db.target t -- a target table
USING (SELECT ...) s -- the source updates
ON t.id = s.id -- condition to find updates for target rows
WHEN ... -- updates
AWS Lake formation
AWS lake formation uses glue internally in identifying the duplicates.
File Sizing
Apache Hudi
Supports automated file size tuning
Delta Lake
File sizing is maintained manually
Apache Iceberg
File sizing is maintained manually
AWS Lake formation
AWS Lake formation also needs manually maintenance.