The term “Data lakehouse” is a new concept that has been trending recently, similar to how “data lake” was ten years ago. When the concept of a data lake was introduced, it solved a key problem of having a central repository of data that is read and write-optimized. This allowed organizations to ingest data into the data lake and then decide what to do with it later. This also paved the way for the rise of ELT in data engineering.
Data lakehouse aims to solve the problem of data governance within the data lake. Until now, Azure Data Lake is used for storing data in all kinds of formats, and Azure SQL DB/DW is used to provide analytics on cleaned, transformed data. Azure Databricks with Azure Data Lake for analytics on unstructured data, but this meant going back and forth between different environments. Azure Synapse aims at eliminating the need for shuttling between tools.
Azure Synapse provides a unified experience of analytics using a common language well known to data analysts/scientists – SQL.
Azure Synapse Studio provides two tools – Dedicated SQL Pools and Serverless SQL Pool. The former is used to work with Azure SQL Data warehouse, while the latter is aimed at providing analytics by using data directly from the data lake. Thus, you could build a physical data warehouse as well as a logical data warehouse in Azure Synapse, depending on your needs.
Serverless SQL Pool
A Serverless SQL pool is an auto-scale compute environment that uses T-SQL to query the data lake directly, without the need for replication or staging. Serverless SQL pool acts as compute engine while the data lake serves as storage. It uses pay per use model and is always available without any additional cost for reservation (In the case of a Dedicated SQL pool, only pausing helps to cut the cost when it is not used). This reduces the overhead for the developers/administrators to scale the system as per their usage.
Serverless SQL Pool uses distributed processing where a control node optimizes the query, and the execution is run on top of the compute nodes parallelly. T-SQL is used for reading, transforming, and exporting data.
The version provided by Azure Synapse is an enhanced form of T-SQL and supports operations on semi-structured and unstructured data. It has functions like JSON_VALUE to work with complex types like JSON objects and arrays. It supports wildcards for working with folder hierarchy in the data lake, and it can read multiple files/folders recursively.
Comparison with other compute environments across Synapse
|Features||Serverless SQL Pool||Apache Spark||Dedicated SQL Pool|
|Purpose||Logical Datawarehouse, Big data discovery, Ad-hoc analysis||Big data discovery, Advanced Analytics (ML and AI)||Datawarehouse|
|Cost||Billed per TB data read||Billed prorated per minute when cluster instance is on||Billed per hour|
|Languages supported||T-SQL||Spark SQL, R, Python, and Scala||T-SQL|
|Warm-up time||2 seconds||3-5 minutes||No warm-up time when unpaused|
|Auto-scaling||Auto scalable by default||Autoscaling can be enabled in the cluster (between the minimum and maximum nodes set)||Data warehousing units need to be changed to scale out or scale back compute|
|Data lake support||Can query from the data lake||Can query from the data lake||Can be queried using Polybase, but the entire file is loaded to tempdb and then read|
|Spark Tables support||Spark external tables in Azure storage can be queried, even when the spark pool is shut.||Can be created and accessed.||Not available|
Use cases for Synapse Serverless SQL Pool
- Ad-hoc analysis & data discovery:
a. Used for analytical data exploration as an alternative to Apache Spark for those who are already familiar with T-SQL, without any cluster setup.
b. For a new data source ingested, ad-hoc queries can be run for data profiling and analysis to ascertain the transformations needed.
c. Exposing data to the business for initial analysis before the development of ELT pipelines.
d. For ad-hoc analysis of historical data stored in the data lake and comparison with current data in the data warehouse
- Data Lake Analytics:
a. To consume data in the lake through BI tools for reporting with transformations needed, which are sparingly used.
b. Use the data lake as a full-blown logical data warehouse and run queries akin to a DW with equal throughput and latency. Building the data warehouse directly over the data lake prevents duplication of data in relational databases like Azure SQL DB. Views can be created over the data lake files in T-SQL using a Serverless SQL pool.
Azure Synapse Serverless SQL Pool allows data scientists and data analysts to interact with data lake more comfortably and provides quicker insights before building a full-scale ETL/ELT model.
For an overview of Azure Synapse check out this blog here
Learn more about Visual BI’s Microsoft Azure offerings here.