What is In-DB Processing?
When handling huge volumes of data, the major challenge involved in moving data out of the source into the analytical environment. This process of moving data in and out of the database is time-consuming and thus it invariably affects the overall performance. To overcome this challenge Alteryx has introduced a concept of In-Database processing which simplifies data blending across heavy datasets without moving data out of the database. When a database is connected through in-database processing, a direct connection to the database is established and it pushes the processing steps into the database and retrieves only the required data into Alteryx environment. By restricting the movement of vast volumes of data, performance is significantly improved.
Alteryx has introduced a dedicated set of tools (which has suffix as in-DB) to assist and support In-Database processing. This dedicated set of tools are added under ‘In -Database’ palette. This blog explores the In-DB capabilities of Alteryx 2019.2.7.63499.
What are the databases that support In-DB processing?
In-Database processing is supported by the following databases
Supported Databases | |||
1 | Amazon RedShift | 10 | Microsoft Azure SQL Database and Azure DWH |
2 | Apache Spark ODBC | 11 | Microsoft SQL Server 2008, 2012, 2014, 2016 |
3 | Cloudera Impala | 12 | MySQL |
4 | DataBricks | 13 | Oracle |
5 | EXASOL | 14 | Pivotal Greenplum |
6 | Hive | 15 | PostgreSQL |
7 | HP Vertica | 16 | SAP HANA |
8 | IBM Netezza | 17 | Snowflake |
9 | Microsoft Analytics Platform Systems | 18 | Teradata |
How connectivity to a database is established?
Four In-DB tools are available for establishing connectivity to the database
- Connect In-DB
- Write Data In-DB
- Data Stream In
- Data Stream Out
Connect In-DB enables the user to establish a connection and access the tables available in the database. It enables live connectivity to the database that allows the user to view schemas and tables listed in the database from which our table of interest can be chosen and accessed.
A table that has massive volumes of data can be accessed in seconds with connect In-DB tool whereas with generic input tool it takes about hours. This vast difference is mainly because the generic input tool fetches all the data into the Alteryx environment which is a time-consuming process and hence affects the overall performance.
Write Data In-DB assists with writing the data back to the Database. This ensures that the schema of existing tables in the database remains unaltered. There are 5 ways by which output mode can be configured
- Create a New Table
- Append Existing
- Delete Data and Append
- Overwrite Table (Drop)
- Create a Temporary Table
In-DB doesn’t change the schema of the table hence all the output modes are restricted in such a way that the schema remains unaltered.
Data Stream In along with the input tool can connect to external datasets and push the same into the database. If the user has a small lookup table that needs to be included in the workflow for the analysis, data stream in tools comes in handy. Instead of moving the larger dataset into the Alteryx environment for processing, the Datastream provides an option to move the smaller lookup file into the database thus reducing the runtime effectively.
Data Stream Out pulls the data from the database into the Alteryx environment. With this option, selective data fields become available in the Alteryx environment so that it can be integrated with the other tools for further analysis
How does the filter, cleanse and transform in in-DB differs from generic tools?
Filter In-DB tools help to query and filter records that match the condition specified. Like the generic filter tool, Filter In-DB also has two types of filters namely basic and advanced. If the business is interested in specific analytics, the In-DB filter can be used to fetch the selective set of records involved in the analysis into the Alteryx environment and perform further analytics by connecting them with Data Stream In tool. This process can effectively improve performance.
Select In-DB can select/deselect the fields to be analyzed, reorder, and rename the fields. Resizing of the fields is not possible as it changes the schema design. The main difference between a generic select tool and an In-DB select is that a generic select tool fetches all the records to the Alteryx environment over which select operation is handled whereas in-DB select fetches the records from the DB based on the fields chosen in the select and brings them to the environment for processing.
Formula In-DB is used to write expressions based on the existing fields. Formula in-DB tool ensures that the schema of the tables in the database remains unaltered. Hence there is no option to add an additional column using which expressions can be added.
Summarize In-DB can perform basic operations such as group, sum, count, count distinct, etc., Like the generic tool, summarize in-DB provides an option to perform aggregation operations based on the data type of a particular column.
What are the tools available for combining data?
Join In-DB combines data streams based on mutual fields. Inner Join, Left Outer Join, Right Outer Join, and Full Outer Join are the four available join options for In-DB. This tool is extremely effective in the case of data blending of tables across multiple platforms.
Union In-DB combines data from different data streams into a single stream and is used when combining data from more than one stream.
Summary
When the Dataset is massive, the performance of the system can be significantly improved by using in-DB tools. This set of tools are easy to use and using these the data can be streamed in or out of the database as and when required which invalidates any restrictions on them. As a best practice approach, it is always recommended to use in-DB tools when handling larger datasets as it significantly improves the performance.
Note: The values obtained for some of the tools for indicating the performance are subjective to change based on and system configuration and doesn’t remain fixed. These values can be identified by enabling performance profiling under runtime of workflow configuration.
References
- https://pages.alteryx.com/rs/716-WAC-917/images/FasterDataBlending_Warehouse_j307_Final.pdf
- https://3danim8.wordpress.com/2015/05/13/the-brilliance-of-alteryx-in-database-processing/
- https://pages.alteryx.com/rs/716-WAC-917/images/In-Database%20Blending%20for%20Big%20Data%20Preparation.pdf
Read more about similar Self Service BI topic here and learn more about Visual BI Solutions Microsoft Power BI offerings here.