Azure Data Factory is a tool to orchestrate data movement and transformation from source to target. By Default, Azure Data Factory supports extraction of data from several file formats like CSV, tsv, etc. Sometimes we have a requirement to extract data out of Excel which will be loaded into a Data Lake or Data Warehouse for reporting.
Let’s see how we can handle an Excel file in Azure Data Factory through below two methods
- Logic Apps
Logic Apps are an enterprise integration tool in a cloud system, used to integrate apps, systems and services in one place to build a scalable solution. Logic Apps contain triggers and activities to meet the business requirements. Here, we are going to see extracting data from excel and converting it to CSV file.
To achieve this, first covert the Excel file data into a table and upload your Excel file to SharePoint or OneDrive. Create a Logic App in the Azure portal and add the below activities to the Logic App.
Add an HTTP activity to trigger the Logic App and pass parameters like file name, the path from the Azure Data Factory.
Please use this link for more details on this process.
Add an action ‘List rows present in a table’ to extract data from Excel. Choose business for extracting data from SharePoint or go for OneDrive.
Once the activity is added to the Logic App, pass the proper values to the parameters like location, document library, table value of the SharePoint file.
By default, the activity extracts a certain number of rows from Excel. Enabling the pagination will allow us to extract rows up to 100,000 rows. Go to setting -> enable the pagination.
Extracted the data from the Excel file through the below set of activities. Then, covert the Excel file data into CSV by using the ‘Convert CSV table’ activity.
Add or Drag ‘Create CSV table’ activity and pass the value retrieved from the above activity ‘List rows present in a table’.
Passing the output of the ‘Create CSV table’ activity into upload file of the Data Lake folder activity like below will create the CSV file.
This was one of the methods to extract data from Excel using Logic Apps. Now let’s see the second option using Assembly.
Extracting Excel data through Assembly
Azure Data Lake Analytics provides an easy way to write custom code for extending functionality through registering an assembly. Here, we going to use third-party .net code to extract the data from Excel using U-SQL. To Learn more about Assembly, please go through this link.
The code we are using is a third-party code available in GitHub for public use. We did test the code in our environment and found a few minor bugs to correct in U-SQL and some changes in the .Net code.
If you need help, please reach out to us @ firstname.lastname@example.org
Clone the file into your local repo by hitting this GitHub link. Once cloned, your local repo will look like the below;
Go to Samples\ExcelExtractor and click the solution file ‘ADLA ExcelExtractor.sln’ which will open the code in Visual Studio.
Register the Assembly into your Data Lake Analytics by right-clicking the ‘oh22is.Analytics.Formats’. It will prompt you for choosing your Data Lake Analytics account and database.
Once you properly configured the account details, submit to register the assembly in Data Lake Analytics. Verify by going to Data Lake Analytics and see the Assemblies folder under the master database.
You can see the assembly successfully registered in the Data Lake Analytics. Now you just load your Excel file into Data Lake or blob storage and write U-SQL to fetch the records from Excel and load it as CSV in the Data Lake.
Sample U-SQL script for extracting data from Excel.
To learn more about other alternatives for triggering alerts in case of pipeline run failure refer here.