Very often in Extract, Load, Transform (ELT) design we have to copy data from Data lake to SQL database for analytical reporting. We maintain two copies for same data in two places just for the case of analytical reporting. In this article we have explored three methods which facilitate reporting from Azure Data Lake using PowerBI:
- Import from Azure Data Lake
- Through Data Warehouse External Tables
- Through Hive External tables
Import from Azure Data Lake.
The most straight forward method is to import data from Azure Data Lake(ADL) into PowerBI. PowerBI already has inbuilt data connector to connect to ADL. This method is suitable for only for datasets which are less than 10GB.
Pros
- Provides a good response time since the data is imported into Power BI data storage.
- With an import option, data editing (e.g. removing columns, empty rows) and modelling can be done in Power BI.
Cons
- Importing the data can take long depending on the size of data and the network speed because it must read the entire file.
- Importing data size cannot be bigger than 10 GB for Pro users (1 GB for normal users).
- To maintain the data consistency between the imported data in Power BI and the source data in Azure data lake, scheduled data refresh needs to be configured if the update of the source data is expected.
- Direct query modes cannot be combined in a single report.
Through Azure Data Warehouse External Tables
In this method we use the concept of External tables in Azure Data Warehouse. External tables in DW works as a semantic layer of the files in the data lake, so data does not need to be loaded to data warehouse physically. These native Azure Data Warehouse connectors in PowerBI connect to tables in both direct and import mode. We would recommend this method (Direct Query) for reporting on cold data which needs optimal performance while reporting.
Pros
- Can leverage Azure SQL Data Warehouse infrastructure if already available in the organization.
- Azure DW can be can be scaled up when high computing power is required.
- Direct query allows analyzing large datasets because there is no dataset size limitation.
- Data is always up-to-date so it does not require data refreshing.
Cons
- Creating external tables using partitioned data files can be challenging.
- With an import option, table structures are imported but relationships between tables cannot be imported. Thus, it must be defined manually in Power BI.
- With a direct query mode, not all data types are implicitly recognized by Power BI.
- With a direct query mode, there is a one million row limit in viewing the result set.
- With a direct query mode, DAX formulas in Power BI can return errors while it gets converted to SQL statements.
- Calculated tables (Union, Intersect, Join of the tables) are not supported in direct query option.
Through Hive (Interactive Query) External Tables
In this method we use Hive External tables to act as sematic layer for ADL files. In Azure HDInsight we would need an Interactive Query cluster for this method. Interactive query is in-memory hive system which can be leveraged for high volume reporting systems. PowerBI has a beta connector for connecting to Interactive Query. We would suggest this method for systems which has very high volume and needs active reporting.
Pros
- Data lake files can be directly consumed in Hive via External tables. So additional data loading is not required.
- Interactive query performs well with high concurrency and hance is the most preferred when multiple users have to work on same dataset concurrently.
- It is more suitable to run on large scale data. Provides high performance with large scale of data in the range of TBs and PBs.
- HDInsight Interactive query is said to be faster than Spark owing to intelligent caching mechanism
Cons
- A separate Hive server must be configured, and it should always be up and running.
- For smaller data, the performance is not as good as the case where the data is imported into Power BI.
- Relationships between tables do not get imported. It must be defined manually.
- Some data types in Hive may not be compatible in Power BI. For example, the date formats may vary.
- With a direct query mode, there is a one million row limit in viewing the result set.
- Calculated tables are not supported in direct query option.
- With a direct query mode, DAX formulas in Power BI can return errors while it gets converted to SQL statements.
Summarizing all the options, below are few of the parameters which we have put up as a table.
Comparison Table
Parameters | ADLS – Power BI | ADLS – External Table (Azure DW) – Power BI | ADLS – External Table (Hive) – Power BI |
Performance | Since data resides in PowerBI, it has fastest response time | Optimal Performance for larger datasets | Would be suitable for every dataset in TBs and PBs |
Size | 1GB for normal User, 10GB for Pro Users | Limited to the capacity of DW and Data Lake | Limited to the capacity of Hive and Data Lake |
Data Type Compatibility | Maintains Data Sanity. · Hierarchies on Data Type column are built automatically. | Maintains Data Sanity. Can be achieved using Formulas | Maintaisn Data Sanity. · Can be achieved using Formulas |
Data Refreshes | Data must be refreshed | Refreshes automatically | Refreshes automatically |
DAX Support | No Limitations | Limited | Limited |
PowerBi Connectors | Native | Native | Beta version |
Every method has it’s own pros and cons. Its fascinating that we have so many options to report with PowerBI on top of Azure Data Lake. Would recommend to try these methods based on the use case of your systems.
Got questions? Click here to get in touch.