Blogs / Microsoft / Azure / ETL on Azure: Databricks vs Data Lake Analytics

ETL on Azure: Databricks vs Data Lake Analytics

Feb 11, 2019

SHARE

Overview

Data Extraction, Transformation and Loading (ETL) is fundamental for the success of enterprise data solutions. The process must be reliable and efficient with the ability to scale with the enterprise. There are numerous tools offered by Microsoft for the purpose of ETL, however, in Azure, Databricks and Data Lake Analytics (ADLA) stand out as the popular tools of choice by Enterprises looking for scalable ETL on the cloud.

This blog helps us understand the differences between ADLA and Databricks, where you can use them and how to decide on which one to choose for your type of data/business.
 

Working through a simple Scenario

Here we are considering a typical ETL scenario. We have taken two of the most popular Data Sources that organizations use, the Azure SQL DB and Data Lake. We have unprocessed data available in the Azure SQL DB that requires to be transformed and written to the Azure Data Lake Store repository.

ETL on Azure: Databricks vs Data Lake Analytics

A look at Sample Data and its ETL requirements:

Data Source: Azure SQL Database

ETL on Azure: Databricks vs Data Lake Analytics

We need the below steps to be performed on the data for it to be in its right format when loaded into the Data Lake Store

  1. Add two digits after decimal point in columns “CostPrice” and “SellingPrice
  2. Convert “DateTime” to “Date” format on the columns “InventoryDate” and “PurchasedDate
  3. Introduce a new column which provides the “BenchTime” of each transaction (date difference)
  4. Define a column which calculates the profit earned in “Percentage
  5. Add left padded zeroes to “ProductID” column to make it three digits (Example: 005, 014)

Let’s take a detailed look into the above operations that can be done in both Data Lake Analytics and in Azure Data Bricks.
 

Azure Data Lake Analytics (ADLA)

Data Lake Analytics is a distributed computing resource, which uses its strong U-SQL language to assist in carrying out complex transformations and loading the data in Azure/Non-Azure databases and file systems. Data Lake Analytics combines the power of distributed processing with ease of SQL like language, which makes it a choice for Ad-hoc data processing.
 

Demo with Azure Data Lake Analytics:

Transformation:

ETL on Azure: Databricks vs Data Lake Analytics

U-SQL job:

ETL on Azure: Databricks vs Data Lake Analytics

Transformed Data on Azure Data Lake Store:

ETL on Azure: Databricks vs Data Lake Analytics

Configuration: 5 Analytics Unit

Language Used: U-SQL

Cost:

ETL on Azure: Databricks vs Data Lake Analytics

Overall Time: 1 Minute 07 seconds

What we liked:

  • Distributed processing holds the ETL high
  • Seamless Transformation
  • Less read/write latency
  • Costs based on Jobs, not on the size of data

Limitations:

  • Job compilation errors are time-consuming
  • Very limited library modules

Preferred Use Cases:

  • For a large amount of data where conversion and loading are the only actions required
  • Process data from Relational databases into Azure
  • Repetitive loads where there is no intermediary action required

Azure Databricks

Azure Databricks is a Notebook type resource which allows setting up of high-performance clusters which perform computing using its in-memory architecture. Users can choose from a wide variety of programming languages and use their most favorite libraries to perform transformations, data type conversions and modeling. Additionally, Databricks also comes with infinite API connectivity options, which enables connection to various data sources that include SQL/No-SQL/File systems and a lot more.
 

Demo with Azure Databricks:

Connecting to Azure SQL DB:

ETL on Azure: Databricks vs Data Lake Analytics

Transformed Data on Azure Databricks:

ETL on Azure: Databricks vs Data Lake Analytics

Configuration: Standard_F4s (1 Main Node and 2 Worker Nodes with a total of 8 GB Memory)

Language Used: Scala

Cost:

ETL on Azure: Databricks vs Data Lake Analytics

Overall Time: 5 Minutes 34 seconds

What we liked:

  • Spark Framework driving Big Data Analytics
  • User-friendly “Cell-based data processing”
  • Language choice for developers
  • Infinite libraries available based on the scripting language chosen
  • Autoscaling

Limitation:

  • Cluster’s time efficiency

Preferred Use Cases:

  • Processes where intermediary analysis of data is required
  • ETL which requires more visibility during modeling and transformation of data

Stacking up Azure Data Lake Analytics against Databricks:

FeatureAzure Data Lake AnalyticsAzure Databricks
Cost ControlPay-As-you-goManual/ Auto-Terminate Clusters
Development ToolIDE + SDK Based (U-SQL supported)Notebook type
PaymentPer JobCluster Properties, time duration and Workload
ScalingAuto-Scaling based on data(Dynamic)Auto-Scaling for jobs running on cluster (Runtime 3.4 & above)
Data StorageInternal Database availableDBFS (Database File System)Direct Access (Storage)
Manage UsagePortal (Preferred)Azure SDK:·       Python·       Java·       Node.js·       .NETSpark Framework:Scala, Java, R and PythonSpark SQL 
Monitoring JobsAzure PortalVisual StudioWithin Databricks 
Managing ResourceAzure PortalAzure CLIAzure PowerShellVisual StudioVisual Studio CodeAzure PortalWithin DatabricksDatabricks CLIVisual Studio Code 
Connectivity to Data Lake StoreDirectly using Data Lake Store’s URL path1.Register a Web app /API (Service principal)2.Associate Service principal with the ADLS storage path3. Use Application Id, Key and Tenant ID (Directory ID) to connect to Data Lake store
Connectivity to ResourceAzure PortalAzure CLIAzure PowerShellVisual StudioVisual Studio CodeAzure PortalExcel, SQLODBC
Commonly used Data SourcesAzure Data Lake StoreAzure Blob StorageAzure SQL DBAzure SQL DW Azure SQL DB (JDBC)Azure SQL DWAzure Data Lake StoreAzure Blob StorageAzure Cosmos DB (Spark connector)Event Hubs (Spark connector)Hive TablesParquet/Avro/CSV/JSON
FunctionalitiesScheduling JobsInducing in Data Factory Pipelines (U-SQL scripts)Scheduling JobsInducing in Data Factory Pipelines (Databricks Notebook)

Conclusion

From our simple example, we identified that Data Lake Analytics is more efficient when performing transformations and load operations by using runtime processing and distributed operations. On the other hand, Databricks has rich visibility using a step by step process that leads to more accurate transformations. Efficiency, accuracy and scalability, should all be considered while making your decision about the right tool choice for your ETL process.

Let us know your thoughts and feedback about working with these two totally different resources that perform ETL at the same level.

References: **Pricing obtained from Azure Pricing website

Learn more about Visual BI’s Microsoft Azure offerings here.


Corporate HQ:
5920 Windhaven Pkwy, Plano, TX 75093

+1 888-227-2794

+1 972-232-2233

+1 888-227-7192

solutions@visualbi.com


Copyright © Visual BI Solutions Inc.

Subscribe To Our Newsletter

Subscribe To Our Newsletter

Join our mailing list to receive the latest news and updates from our team.

You have Successfully Subscribed!

Share This!

Share this with your friends and colleagues!