Monday, April 13, 2020

Create Delta Table from JSON File

Requirement:
                   
                     I will receive transaction files from vendor frequently. They provide table header file and data file separately. Our intention is create table based on the header file and load the data using data file.

Code Work Flow:

                     Here is the steps involved,

Step 1: Read the header json file and generate DDL


Step 2: Read the data file



Step 3: Create temporary view based on the data file created in step 2


Step 4: Create table DDL



Step 5: Create Table and Load the data



Step 6: Result



What is Delta Lake?


What is Delta Lake?

Delta Lake is an open source storage layer that brings reliability to data lake storage. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake utilize the existing data lake (Azure, AWS) storage and is fully compatible with Apache Spark APIs. Delta Lake sits on top of Apache Spark. The format and the compute layer help to simplify building big data pipelines and increase the overall efficiency of your pipelines.
Delta Lake uses versioned Parquet files to store your data in storage layer. Apart from the versions, it also stores a transaction log to keep track of all the commits made to the table or files to provide ACID transactions.
Delta Lake offers:

1.    ACID transactions on Spark: Serializable isolation levels ensure that readers never see inconsistent data.
2.    Scalable metadata handling: Leverages Spark’s distributed processing power to handle all the metadata for petabyte-scale tables with billions of files at ease.
3.    Schema enforcement: Automatically handles schema variations to prevent insertion of bad records during ingestion.
4.    Time travel: Data versioning enables rollbacks, full historical audit trails, and reproducible machine learning experiments.
5.    Update and deletes: Supports merge, update and delete operations to enable complex use cases like change-data-capture, slowly-changing-dimension (SCD) operations, streaming update, and so on.

             I will demonstrate few use case in my next post.

Sunday, April 12, 2020

HDI Ranger Policy Automation

Requirement:

                    I have many HDI Clusters (Spark, LLAP) and i want to create a ranger policy automatically whenever any new database created.

Solution:

                   To achieve this requirement i have used the rest API.A RESTful API is an application program interface (API) that uses HTTP requests to GET, PUT, POST and DELETE data.
                  Reference information for the Ranger REST API service/public/v2/api/policy.

API Name
Create Policy
Request Type
POST
Request URL
service/public/v2/api/policy
Request Params
{
"policyName":"<<PolicyName>>",
"resourceName":"/*/*",
"description":"",
"repositoryName":"HiveRepositoryName",
"repositoryType":"hive",
"permMapList":[{"userList":[],"groupList":["groupname"],"permList":["select","Read"]}],
"tables":"*",
"columns":"*",
"databases":"<<PolicyName>>",
"tableType":"Inclusion",
"columnType":"Inclusion",
"isEnabled":true,
"isRecursive":false,
"isAuditEnabled":true,
"version":"1",
"replacePerm":false
}

Sample Code:

                     I have created a shell script to call the CURL command. Here is the sample code,

API Name
Create Policy
Request Type
POST
Request URL
Labhdi-int.azurehdinsight.net/ranger/service/public/api/policy/
CURL Command
curl -iv -u username:password
 -H "Content-Type: application/json"
 -d
'{
"policyName":"<<PolicyName>>",
"resourceName":"/*/*",
"description":"",
"repositoryName":"HiveRepositoryName",
"repositoryType":"hive",
"permMapList":[{"userList":[],"groupList":["g_az_devadls_data_raw_1crussia_readonly"],"permList":["select","Read"]}],
"tables":"*",
"columns":"*",
"databases":"<<PolicyName>>",
"tableType":"Inclusion",
"columnType":"Inclusion",
"isEnabled":true,
"isRecursive":false,
"isAuditEnabled":true,
"version":"1",
"replacePerm":false
}'
-X POST https://labhdi-int.azurehdinsight.net/ranger/service/public/api/policy/




Call Data bricks Job Using REST API

Below power shell will help to call Data bricks Job with parameter  [Net.ServicePointManager]::SecurityProtocol = [Net.SecurityProtocolType]...