Azure Data Factory (ADF) is an ELT tool for orchestrating data from different sources to the target. By Default, Azure Data Factory supports the extraction of data from different sources and different targets like SQL Server, Azure Data warehouse, etc. In this blog, we are going to explore file partition using Azure Data Factory.
In many cases, we have a scenario of extracting a huge volume of data from the source. In such cases, we may need to partition the data into different sets to achieve performance and leverage the loading of data into targets limiting performance hits.
By default, we do not have the option to partition the data into different sets of files in ADF. We can achieve this by writing custom logic or Azure dataflow.
We can achieve the partition of different files using these two methods.
- File Partition using Custom Logic
- File Partition using Azure Dataflow
File Partition using Custom Logic
File partition using Azure Data Factory pipeline parameters, variables, and lookup activities will enable the way to extract the data into different sets by triggering the dynamic SQL query in the source.
Below is the SQL query and methods to extract data into the different partitions. Use lookup activities to trigger the below SQL query and save the count & dividend_count from the table into Azure Data Factory variable.
RAW SQL Query
- Count – Used to take the total count of the table.
- Dividend_count – Used to take the number of rows to take in each partition. Here, 10 defined as dividend_count means if we have 40000 records in the source table, then dividend_count will have 4000 records. Each partition will have 4000 records and is saved into a separate file in blob storage or data lake, which is our end goal.
Parametrized SQL Query in ADF
The below query is parameterized in ADF. We can pass the partition value dynamically to set the number of partitions.
- set_partitioning – is the parameter that we need to create and pass this parameter into ADF. In this parameter, we will define how many numbers of partitions are needed. In our case, we will define 10 in the parameter.
Concatenating SQL Query with ADF Parameter
Create the below variables to store the count, dividend count, etc. We already discussed the purpose of count and dividend_count variable. Let us see the use of the other four variables that we need to create in Azure Data Factory like below.
- Initialize – we need to set ‘1’ initially to fetch the data from the first record in the SQL query. After this, we need to dynamically change the variable value by adding dividend_count variable value to it. In our case, the query in the copy activity will look like this.
Generating ROW number in ADF
Actual query at the time of the first iteration
- Dividend_placeholder – we need to create this variable to swap the dividend_count value into this variable. So that we can add the next set of records in the dividend_count variable. First iteration will have dividend_count variable value as 400 and the next iteration will contain 800(dividend_placeholder(400) + dividend_count(400). Initialize variable will be reset to 401 by adding dividend_placeholder(400) to initialize variable default value(1). Iteration will continue till it meets the count variable count.
- Lookup check – Lookup check variable used to restrict the lookup activity will happen once and set the count & dividend_count, or else it will set the dividend_count variable again to the old value instead of further iteration.
Create a pipeline like below and do all the activities:
- Loopcheckvariable Activity – it will be defined as zero first and reset of this variable to 1 will happen within ‘until activity’ after the first iteration to avoid the further lookup available in ‘if condition Activity’.
- If condition Activity– within this activity, look up activity is available to hit the SQL query to fetch count and dividend_count, which we discussed above. This activity will happen only once to fetch count and dividend_count. After that, it will not trigger because the lookup check variable will be set to 1.
- Until Activity – within this activity, we need to copy data from the source table into Azure Blob Storage as a multiple file partition and resetting of all the variables happens to continue further iteration & file partition.
ADF activity flow