Monday, April 13, 2020

What is Delta Lake?


What is Delta Lake?

Delta Lake is an open source storage layer that brings reliability to data lake storage. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake utilize the existing data lake (Azure, AWS) storage and is fully compatible with Apache Spark APIs. Delta Lake sits on top of Apache Spark. The format and the compute layer help to simplify building big data pipelines and increase the overall efficiency of your pipelines.
Delta Lake uses versioned Parquet files to store your data in storage layer. Apart from the versions, it also stores a transaction log to keep track of all the commits made to the table or files to provide ACID transactions.
Delta Lake offers:

1.    ACID transactions on Spark: Serializable isolation levels ensure that readers never see inconsistent data.
2.    Scalable metadata handling: Leverages Spark’s distributed processing power to handle all the metadata for petabyte-scale tables with billions of files at ease.
3.    Schema enforcement: Automatically handles schema variations to prevent insertion of bad records during ingestion.
4.    Time travel: Data versioning enables rollbacks, full historical audit trails, and reproducible machine learning experiments.
5.    Update and deletes: Supports merge, update and delete operations to enable complex use cases like change-data-capture, slowly-changing-dimension (SCD) operations, streaming update, and so on.

             I will demonstrate few use case in my next post.

No comments:

Call Data bricks Job Using REST API

Below power shell will help to call Data bricks Job with parameter  [Net.ServicePointManager]::SecurityProtocol = [Net.SecurityProtocolType]...