Azure Data Factory is an extensive cloud-based data integration service that can help to orchestrate and automate data movement. With the help of Data Lake Analytics and Azure Data Bricks, we can transform data according to business needs. Using Data Factory activities, we can invoke U-SQL and data bricks code.
Now with the Data flow activity, Azure Data Factory has the capability of doing the transformations within itself. Data flow activity helps us to transform data with UI capabilities and with no coding.
This makes it more user-friendly for performing the ETL and ELT using Azure Data Factory. There are many transformations available in the Data flow. In this blog, let us see them in detail.
The transformations here are not classified by active or passive instead are classified into 3 types which are as follows:
- Multiple inputs and outputs – these transformations either require multiple inputs or generates multiple outputs.
- Schema modifier – Modifies the structure of the input that passes through this kind of transformation.
- Row modifier – Modifies the result set. That is the number of rows passes through these transformations gets affected (reduces).
Other than these types you have the ‘Source’ and ‘Sink’ (destination). Let us focus on multiple inputs and outputs in this blog.
The list of transformations that fall under this category are provided below:
- New Branch
- Conditional Split
This transformation gets you a new stream in the data flow of your source data. This is like multicast transformation in Microsoft SQL Server Integration Services (SSIS). Whenever you want to have multiple streams of your dataset in a data flow, you can use new branch transformation.
It is not necessary to have the same set of configurations in both the sources when you are using the new branch.
Note: Data flow does not allow you to use the new branch unless there are one source and one subsequent transformation already in place. Only then you will be able to see a new branch in the list of transformations available.
This helps you to combine or join two different streams, which will have all the columns from both the streams in your output. This transformation supports 5 different types of joins which are as follows.
- Full Outer
- Left outer
- Right Outer
- Cross join
Under the Join settings tab you need to choose your left and right streams of data, join type, and the columns on which the join must be made.
Note: Your source dataset much have a predefined schema or an imported schema from the source file. Otherwise, the data flow will not be able to recognize the source columns from both the streams and will not allow us to give a join condition explicitly.
This transformation helps us to route the flow into multiple streams by splitting up the data based on the matching conditions. By default, this will split the data into two streams where you can write one expression. Data gets split into two, one satisfying the expression and the other not satisfying the expression.
You can also add multiple expressions and configure the data to flow or split into multiple streams.
This is exactly like a conditional split in Microsoft SQL Server Integration Services (SSIS).
This transformation will filter rows based on the data that exists or does not exist in another stream. You need to configure the left and right stream of source, then select the columns on which the exists or don’t exist condition has to be checked.
You can also check the custom expression box and you can write your own custom expressions in case you want to apply multiple conditions in the expression editor.
This transformation helps you to combine/union multiple streams into one. All the input columns from all the streams will be included in the output stream.
You can union the schemas based on the name of the column or position of the column. You do not need a joining key here to union the data into one.
Union by name helps you to combine the data based on the name of the column. If the name is the second column in your first stream and name is the fourth column in your second stream, union by name still helps you to combine the columns respectively by name. It does not matter the order of the columns should also need to be the same.
Union by position helps you to combine data based on the order of the columns or the position of the columns in different streams. If the name is the second column in the first stream and ID is the second column in the second stream, union by position will still append name and id row by row. The only order of the column matters here.
This transformation makes a reference to the stream on which you want to do the lookup. You can choose on the column on which the lookup reference has to be established. Configure your primary stream and lookup stream, make a note that these two streams have the schema imported into the datasets so that you can choose the column for the lookup conditions.
With the help of a derived column and a conditional split after lookup, you can split the data streams which matches the lookup conditions and the recordset which does not match the lookup condition.
We will keep you posted about Schema modifier transformations in our next blog. Learn more about Visual BI’s Microsoft Azure offerings here.