What is Delta Lake?
Delta Lake is an open
source storage layer that brings reliability to data lake storage.
Delta Lake provides ACID transactions, scalable metadata handling, and unifies
streaming and batch data processing. Delta Lake utilize the existing data lake
(Azure, AWS) storage and is fully compatible with Apache Spark APIs. Delta Lake
sits on top of Apache Spark. The format and the compute layer help to simplify
building big data pipelines and increase the overall efficiency of your
pipelines.
Delta Lake uses versioned Parquet
files to store your data in storage layer. Apart from the versions, it also
stores a transaction log to keep track of all the commits made to the table or files
to provide ACID transactions.
Delta Lake offers:
1.
ACID transactions on Spark: Serializable
isolation levels ensure that readers never see inconsistent data.
2.
Scalable metadata handling: Leverages Spark’s
distributed processing power to handle all the metadata for petabyte-scale
tables with billions of files at ease.
3.
Schema enforcement: Automatically handles schema
variations to prevent insertion of bad records during ingestion.
4.
Time travel: Data versioning enables rollbacks,
full historical audit trails, and reproducible machine learning experiments.
5.
Update and deletes: Supports merge, update and
delete operations to enable complex use cases like change-data-capture,
slowly-changing-dimension (SCD) operations, streaming update, and so on.
I will demonstrate few use case in my next post.
No comments:
Post a Comment