Data Extraction, Transformation and Loading (ETL) is fundamental for the success of enterprise data solutions. The process must be reliable and efficient with the ability to scale with the enterprise. There are numerous tools offered by Microsoft for the purpose of ETL, however, in Azure, Databricks and Data Lake Analytics (ADLA) stand out as the popular tools of choice by Enterprises looking for scalable ETL on the cloud.
This blog helps us understand the differences between ADLA and Databricks, where you can use them and how to decide on which one to choose for your type of data/business.
Working through a simple Scenario
Here we are considering a typical ETL scenario. We have taken two of the most popular Data Sources that organizations use, the Azure SQL DB and Data Lake. We have unprocessed data available in the Azure SQL DB that requires to be transformed and written to the Azure Data Lake Store repository.
A look at Sample Data and its ETL requirements:
Data Source: Azure SQL Database
We need the below steps to be performed on the data for it to be in its right format when loaded into the Data Lake Store
- Add two digits after decimal point in columns “CostPrice” and “SellingPrice”
- Convert “DateTime” to “Date” format on the columns “InventoryDate” and “PurchasedDate”
- Introduce a new column which provides the “BenchTime” of each transaction (date difference)
- Define a column which calculates the profit earned in “Percentage”
- Add left padded zeroes to “ProductID” column to make it three digits (Example: 005, 014)
Let’s take a detailed look into the above operations that can be done in both Data Lake Analytics and in Azure Data Bricks.
Azure Data Lake Analytics (ADLA)
Data Lake Analytics is a distributed computing resource, which uses its strong U-SQL language to assist in carrying out complex transformations and loading the data in Azure/Non-Azure databases and file systems. Data Lake Analytics combines the power of distributed processing with ease of SQL like language, which makes it a choice for Ad-hoc data processing.
Demo with Azure Data Lake Analytics:
Transformed Data on Azure Data Lake Store:
Configuration: 5 Analytics Unit
Language Used: U-SQL
Overall Time: 1 Minute 07 seconds
What we liked:
- Distributed processing holds the ETL high
- Seamless Transformation
- Less read/write latency
- Costs based on Jobs, not on the size of data
- Job compilation errors are time-consuming
- Very limited library modules
Preferred Use Cases:
- For a large amount of data where conversion and loading are the only actions required
- Process data from Relational databases into Azure
- Repetitive loads where there is no intermediary action required
Azure Databricks is a Notebook type resource which allows setting up of high-performance clusters which perform computing using its in-memory architecture. Users can choose from a wide variety of programming languages and use their most favorite libraries to perform transformations, data type conversions and modeling. Additionally, Databricks also comes with infinite API connectivity options, which enables connection to various data sources that include SQL/No-SQL/File systems and a lot more.
Demo with Azure Databricks:
Connecting to Azure SQL DB:
Transformed Data on Azure Databricks:
Configuration: Standard_F4s (1 Main Node and 2 Worker Nodes with a total of 8 GB Memory)
Language Used: Scala
Overall Time: 5 Minutes 34 seconds
What we liked:
- Spark Framework driving Big Data Analytics
- User-friendly “Cell-based data processing”
- Language choice for developers
- Infinite libraries available based on the scripting language chosen
- Cluster’s time efficiency
Preferred Use Cases:
- Processes where intermediary analysis of data is required
- ETL which requires more visibility during modeling and transformation of data
Stacking up Azure Data Lake Analytics against Databricks:
|Feature||Azure Data Lake Analytics||Azure Databricks|
|Cost Control||Pay-As-you-go||Manual/ Auto-Terminate Clusters|
|Development Tool||IDE + SDK Based (U-SQL supported)||Notebook type|
|Payment||Per Job||Cluster Properties, time duration and Workload|
|Scaling||Auto-Scaling based on data(Dynamic)||Auto-Scaling for jobs running on cluster (Runtime 3.4 & above)|
|Data Storage||Internal Database available||DBFS (Database File System)Direct Access (Storage)|
|Manage Usage||Portal (Preferred)Azure SDK:· Python· Java· Node.js· .NET||Spark Framework:Scala, Java, R and PythonSpark SQL|
|Monitoring Jobs||Azure PortalVisual Studio||Within Databricks|
|Managing Resource||Azure PortalAzure CLIAzure PowerShellVisual StudioVisual Studio Code||Azure PortalWithin DatabricksDatabricks CLIVisual Studio Code|
|Connectivity to Data Lake Store||Directly using Data Lake Store’s URL path||1.Register a Web app /API (Service principal)2.Associate Service principal with the ADLS storage path3. Use Application Id, Key and Tenant ID (Directory ID) to connect to Data Lake store|
|Connectivity to Resource||Azure PortalAzure CLIAzure PowerShellVisual StudioVisual Studio Code||Azure PortalExcel, SQLODBC|
|Commonly used Data Sources||Azure Data Lake StoreAzure Blob StorageAzure SQL DBAzure SQL DW||Azure SQL DB (JDBC)Azure SQL DWAzure Data Lake StoreAzure Blob StorageAzure Cosmos DB (Spark connector)Event Hubs (Spark connector)Hive TablesParquet/Avro/CSV/JSON|
|Functionalities||Scheduling JobsInducing in Data Factory Pipelines (U-SQL scripts)||Scheduling JobsInducing in Data Factory Pipelines (Databricks Notebook)|
From our simple example, we identified that Data Lake Analytics is more efficient when performing transformations and load operations by using runtime processing and distributed operations. On the other hand, Databricks has rich visibility using a step by step process that leads to more accurate transformations. Efficiency, accuracy and scalability, should all be considered while making your decision about the right tool choice for your ETL process.
Let us know your thoughts and feedback about working with these two totally different resources that perform ETL at the same level.
References: **Pricing obtained from Azure Pricing website
Learn more about Visual BI’s Microsoft Azure offerings here.