Consider a scenario where you have your data across different files in different folders, and the list of folders and files grow by the day. We can use PowerBI to connect to this dynamic list of files (which have the same structure). We can also direct Power BI using Power Query M to only look at the latest version of the same record across different files.
In the scenario taken for this blog, we have the data in Azure Data Lake Store. We have a pipeline which writes a flat file daily containing the renewed Exception Count for Products. In the subsequent days, if the Exception Count was revised for a particular product, the new Exception Count is available in that day’s file. Therefore, when reporting, Power BI should not take the cumulative Exception Count, but rather the latest record for each product (across all the files).
The folders in the Azure Data Lake Store are structured as below. We have folders created dynamically for each month, and inside each month’s folder, we have the daily files.
Folder Structure in ADLS
Implementation using Power Query M in Power BI
1) Connect to the data in the Azure Data Lake Store using the folder URL of the root folder.
This is the initial view of the data when connected to the Exception Reporting Folder.
2) In the Query Editor: Advanced Editor, edit the Load Script using Power Query M. We want Power BI to only take the folders which are prefixed with ‘Exception’ (disregarding the Misc folder), and inside these selected folders, only to take the files which are prefixed with the words ‘ProductExCount’ (disregarding the Order Details file).
3) After selecting the first column, click on Combine Files. Data from all the files are combined into a single table structure.
The highlighted records give the latest Exception Counts for the products, and only those should be used for reporting.
4) While data is being added on a near-daily basis in this scenario, for each Product we only want to see the latest Exception Count (and not the cumulative exception count). Apply an ascending sort on date, and group the data as below, retaining only the latest record for each Product.
The benefits of this approach are that we can connect to a growing number of folders, and a growing number of files dynamically. The number of records loaded would also be less, reducing the size of the dataset. Performance of the Power BI report would be exponentially improved.