Cloud Modernization: April 2020

Monday, April 13, 2020

Create Delta Table from JSON File

Requirement:

I will receive transaction files from vendor frequently. They provide table header file and data file separately. Our intention is create table based on the header file and load the data using data file.

Code Work Flow:

Here is the steps involved,

Step 1: Read the header json file and generate DDL

Step 2: Read the data file

Step 3: Create temporary view based on the data file created in step 2

Step 4: Create table DDL

Step 5: Create Table and Load the data

Step 6: Result

What is Delta Lake?

Delta Lake is an open source storage layer that brings reliability to data lake storage. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake utilize the existing data lake (Azure, AWS) storage and is fully compatible with Apache Spark APIs. Delta Lake sits on top of Apache Spark. The format and the compute layer help to simplify building big data pipelines and increase the overall efficiency of your pipelines.

Delta Lake uses versioned Parquet files to store your data in storage layer. Apart from the versions, it also stores a transaction log to keep track of all the commits made to the table or files to provide ACID transactions.

Delta Lake offers:

1. ACID transactions on Spark: Serializable isolation levels ensure that readers never see inconsistent data.

2. Scalable metadata handling: Leverages Spark’s distributed processing power to handle all the metadata for petabyte-scale tables with billions of files at ease.

3. Schema enforcement: Automatically handles schema variations to prevent insertion of bad records during ingestion.

4. Time travel: Data versioning enables rollbacks, full historical audit trails, and reproducible machine learning experiments.

5. Update and deletes: Supports merge, update and delete operations to enable complex use cases like change-data-capture, slowly-changing-dimension (SCD) operations, streaming update, and so on.

I will demonstrate few use case in my next post.

Sunday, April 12, 2020

HDI Ranger Policy Automation

Requirement:

I have many HDI Clusters (Spark, LLAP) and i want to create a ranger policy automatically whenever any new database created.

Solution:

To achieve this requirement i have used the rest API.A RESTful API is an application program interface (API) that uses HTTP requests to GET, PUT, POST and DELETE data.

Reference information for the Ranger REST API service/public/v2/api/policy.

API Name	Create Policy
Request Type	POST
Request URL	service/public/v2/api/policy
Request Params	{ "policyName":"<<PolicyName>>", "resourceName":"//", "description":"", "repositoryName":"HiveRepositoryName", "repositoryType":"hive", "permMapList":[{"userList":[],"groupList":["groupname"],"permList":["select","Read"]}], "tables":"", "columns":"", "databases":"<<PolicyName>>", "tableType":"Inclusion", "columnType":"Inclusion", "isEnabled":true, "isRecursive":false, "isAuditEnabled":true, "version":"1", "replacePerm":false }

Sample Code:

I have created a shell script to call the CURL command. Here is the sample code,

API Name	Create Policy
Request Type	POST
Request URL	Labhdi-int.azurehdinsight.net/ranger/service/public/api/policy/
CURL Command	curl -iv -u username:password -H "Content-Type: application/json" -d '{ "policyName":"<<PolicyName>>", "resourceName":"//", "description":"", "repositoryName":"HiveRepositoryName", "repositoryType":"hive", "permMapList":[{"userList":[],"groupList":["g_az_devadls_data_raw_1crussia_readonly"],"permList":["select","Read"]}], "tables":"", "columns":"", "databases":"<<PolicyName>>", "tableType":"Inclusion", "columnType":"Inclusion", "isEnabled":true, "isRecursive":false, "isAuditEnabled":true, "version":"1", "replacePerm":false }' -X POST https://labhdi-int.azurehdinsight.net/ranger/service/public/api/policy/

Cloud Modernization

Monday, April 13, 2020

Create Delta Table from JSON File

What is Delta Lake?

What is Delta Lake?

Sunday, April 12, 2020

HDI Ranger Policy Automation

Call Data bricks Job Using REST API

Report Abuse