Azure Data Factory (ADF) is a fully managed cloud service for data orchestration and data integration. It supports a variety of processing services like Data Lake Analytics, Hadoop etc. Now ADF is coming up with a Data Flow activity which allows developing GUI bases transformations. We had an opportunity to get peek at these in private preview. In this blog, we share our thoughts on Data flow.
Why Data Flow?
Data flow is intended to provide a completely GUI based solution with no coding required. So, developing your ETL/ELT solutions with Data flows will be more readable and easily maintainable. Data flows get executed on Azure Databricks using Spark. ADF handles all the code translation, spark optimization and execution of transformation in Data flows. Since its spark implementation, you can run your transformation jobs at blazing speed.
What’s in Data Flow?
Data flow provides several transformations that we can apply to the data. New transformations are being added with each release. Here are few of the transformations:
|Source||Source transform lets you configure the sources we want to bring into the data flow. A data flow can have one or more sources. It should be able to connect to any type of source supported by ADF.Note: Private preview only supports blob|
|Select||Select transform allows selecting a list of columns from the input stream which you can pass to other transformation. It allows providing alias names to columns.|
|Derived Column||Derived column allows you to create new columns or modify existing column. It supports a wide range of data manipulation functions.|
|Filter||Filter transform allows you to restrict the rows based on a filter expression.|
|Conditional Split||Conditional split allows you to split the input stream into n number of output stream based on expressions conditions. Rows not matching the condition will be routed to default output.|
|Sort||Sort allows you to sort the based-on order rules. The output rows will follow the same order in the subsequent transformation. It has other options like toggle “case sensitive”, computed columns etc.|
|Aggregate||Aggregate transform allows you to define aggregation functions by group by columns. You can also build custom expressions in this transform.|
|Exists||Exist transform allows you to filter rows in one input stream based on another source. You have an option of applying either “Exists” or “Not Exists” condition.|
|Lookup||Lookup transform allows you to lookup values from another stream based on the lookup condition. It works like an inner join.|
|Joiner||Joiner transform allows you to join two streams based on a condition. You have an option to perform inner, left, right, outer and cross joins.|
|Union||Union transform allows you to merge two streams into a single stream.|
|New Branch||New branch transform allows you to replicate the current stream. You can create a new stream or copy of a stream.|
|Output||Output sink will output the data into all kinds of storage supported by ADF.Note: private preview only supports blob and SQL DW|
This video will give a tour of data flow activity in Azure Data Factory:
How to use Data Flows?
Data flows can be scheduled/executed by adding them as activity in ADF pipelines. Triggers should be used to execute the pipeline. There are two options to run Data flow, either you can use an existing Azure Data bricks cluster or create a new cluster for every execution. By creating a new cluster for every execution, you would not incur the cost of keeping the cluster running. However, the cluster takes at least 5-10 minutes to warm up. You can always choose either of the option based on your organizational needs.
What’s our take?
Data flow has been a missing piece in Azure Data Factory service. With its advent, we are sure developing ETL/ELT in the Azure platform is going to be user-friendly. This activity is still in private preview and there are tons of new features coming up. We will keep you posted.
Learn more about Visual BI’s Microsoft Azure offerings here.