When looking at new loads of data, it’s important to keep the destination tables as updated as possible with incremental loads. However, this can be somewhat tricky, especially when your source data is coming from different file formats. In this blog, we will talk about one way we can upload data from Azure Data Lake Store to Azure Data Warehouse using PolyBase.
What is PolyBase?
PolyBase is a tool built in with SQL Server 2016 and Azure SQL Data Warehouse that allows you to query data from outside files stored in Azure Blob Storage or Azure Data Lake Store. Once we define a file type within SQL Server Management Studio (SSMS), we can simply insert data from the file into a structured external table. Now since the structured table is ready, we can compare and update tables using the external table and the destination table.
PolyBase is used whenever reading tables in Azure Data Factory’ copy activity. However, it can also be accessed through SQL Server Management Services, in which we do in this tutorial.
Why Use PolyBase if It is Already in the Azure Data Factory?
Azure Data Factory uses a simple insert into the table, which can be great for transactional data, but won’t suffice if there are updates to actual records. Working with PolyBase directly, we can hit the source files using SQL. Once that data is in the external tables, we can use that table for staging and even further transform our data before inserting into the destination table. The advantages of using PolyBase directly comes with the ability to use familiar SQL queries to much of transforming and copying activities within the data warehouse.
How to Get Started?
Assuming we have each resource set up (Azure Data Lake Store, Azure Data Warehouse, and a Registered App in Active Directory), we first must set up a master key and a credential with the active directory. Microsoft has a great document on setting up your external data source: https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-data-source-transact-sql?view=sql-server-2017. You should have something like this set up in the SSMS Object Explorer:
Once the data source and file type is built correctly, we can build our external tables with a create table statement:
After executing the SQL code, you should be able to see the table created under the external table folder in the SSMS Object Explorer. We can query this table like any other database table.
Updating Your Table
Although Azure Data Warehouse comes with many features that speeds up querying large amounts of data, it lacks other features available in SQL Server to maintain its fast speed. One of the largest short comings in Azure Datawarehouse is the lack of a merge statement. A merge statement allowed us to compare two different tables, one as a source and one as a destination. Once compared, we can update the table to our needs. However, we can still work with a delete and insert. With this query we must make sure we have a one-to-one mapping between the tables, meaning there are no duplications within the primary key.
To update your table at touch, we build a stored procedure in the SQL DW. To automate the process, we can execute the stored procedure in the Data Factory Pipeline by simply uploading it and creating a scheduled trigger. The stored procedure can be integrated with all the processes to make sure your copy activities are not duplicating records into the destination table for each incremental load.
Beneficial for: Customers looking to transfer data to Azure Data Warehouse from Blob Storage or ADLS while maintaining data integrity.