What is In-DB Processing?

When handling huge volumes of data, the major challenge involved in moving data out of the source into the analytical environment. This process of moving data in and out of the database is time-consuming and thus it invariably affects the overall performance. To overcome this challenge Alteryx has introduced a concept of In-Database processing which simplifies data blending across heavy datasets without moving data out of the database. When a database is connected through in-database processing, a direct connection to the database is established and it pushes the processing steps into the database and retrieves only the required data into Alteryx environment. By restricting the movement of vast volumes of data, performance is significantly improved.

Alteryx has introduced a dedicated set of tools (which has suffix as in-DB) to assist and support In-Database processing. This dedicated set of tools are added under ‘In -Database’ palette. This blog explores the In-DB capabilities of Alteryx 2019.2.7.63499.

What are the databases that support In-DB processing?

In-Database processing is supported by the following databases

Supported Databases
1Amazon RedShift10Microsoft Azure SQL Database and Azure DWH
2Apache Spark ODBC11Microsoft SQL Server 2008, 2012, 2014, 2016
3Cloudera Impala12MySQL
4DataBricks13Oracle
5EXASOL14Pivotal Greenplum
6Hive15PostgreSQL
7HP Vertica16SAP HANA
8IBM Netezza17Snowflake
9Microsoft Analytics Platform Systems18Teradata

 

How connectivity to a database is established?

Four In-DB tools are available for establishing connectivity to the database

  • Connect In-DB
  • Write Data In-DB
  • Data Stream In
  • Data Stream Out

 

performance-enhancement-alteryx Connect In-DB enables the user to establish a connection and access the tables available in the database. It enables live connectivity to the database that allows the user to view schemas and tables listed in the database from which our table of interest can be chosen and accessed.

A table that has massive volumes of data can be accessed in seconds with connect In-DB tool whereas with generic input tool it takes about hours. This vast difference is mainly because the generic input tool fetches all the data into the Alteryx environment which is a time-consuming process and hence affects the overall performance.

performance-enhancement-alteryx Write Data In-DB assists with writing the data back to the Database. This ensures that the schema of existing tables in the database remains unaltered. There are 5 ways by which output mode can be configured

  • Create a New Table
  • Append Existing
  • Delete Data and Append
  • Overwrite Table (Drop)
  • Create a Temporary Table

In-DB doesn’t change the schema of the table hence all the output modes are restricted in such a way that the schema remains unaltered.

performance-enhancement-alteryx Data Stream In along with input tool can connect to external datasets and push the same into the database. If the user has a small lookup table that needs to be included in the workflow for the analysis, data stream in tools comes in handy. Instead of moving the larger dataset into the Alteryx environment for processing, Data stream in provides an option to move the smaller lookup file into the database thus reducing the runtime effectively.

performance-enhancement-alteryxData Stream Out pulls the data from the database into Alteryx environment. With this option, selective data fields become available in the Alteryx environment so that it can be integrated with the other tools for further analysis

How does the filter, cleanse and transform in in-DB differs from generic tools?

performance-enhancement-alteryxFilter In-DB tools help to query and filter records that match the condition specified. Like the generic filter tool, Filter In-DB also has two types of filters namely basic and advanced. If the business is interested specific analytics, the In-DB filter can be used to fetch the selective set of records involved in the analysis into the Alteryx environment and perform further analytics by connecting them with Data Stream In tool. This process can effectively improve performance.

performance-enhancement-alteryxSelect In-DB can select/deselect the fields to be analysed, reorder and rename the fields. Resizing of the fields is not possible as it changes the schema design. The main difference between a generic select tool and In-DB select is that generic select tool fetches all the records to Alteryx environment over which select operation is handled whereas in-DB select fetches the records from the DB based on the fields chosen in the select and brings them to the environment for processing.

performance-enhancement-alteryxFormula In-DB is used to write expressions based on the existing fields. Formula in-DB tool ensures that schema of the tables in the database remains unaltered. Hence there is no option to add an additional column using which expressions can be added.

performance-enhancement-alteryxSummarize In-DB can perform basic operations such as group, sum, count, count distinct, etc., Like the generic tool, summarize in-DB provides an option to perform aggregation operations based on the datatype of a particular column.

What are the tools available for combining data?

performance-enhancement-alteryxJoin In-DB combines data streams based on mutual fields. Inner Join, Left Outer Join, Right Outer Join and Full Outer Join are the four available join options for In-DB. This tool is extremely effective in case of data blending of tables across multiple platforms.

performance-enhancement-alteryxUnion In-DB combines data from different data stream into a single stream and is used when combining data from more than one stream.

 

Summary

When the Dataset is massive, the performance of the system can be significantly improved by using in-DB tools. This set of tools are easy to use and using these the data can be streamed in or out of the database as and when required which invalidates any restrictions on them. As a best practice approach, it is always recommended to use in-DB tools when handling larger datasets as it significantly improves the performance.

Note: The values obtained for some of the tools for indicating the performance are subjective to change based on and system configuration and doesn’t remain fixed. These values can be identified by enabling performance profiling under runtime of workflow configuration.

References

  1. https://pages.alteryx.com/rs/716-WAC-917/images/FasterDataBlending_Warehouse_j307_Final.pdf
  2. https://3danim8.wordpress.com/2015/05/13/the-brilliance-of-alteryx-in-database-processing/
  3. https://pages.alteryx.com/rs/716-WAC-917/images/In-Database%20Blending%20for%20Big%20Data%20Preparation.pdf

 

Read more about similar Self Service BI topic here and learn more about Visual BI Solutions Microsoft Power BI offerings here. 

 

Subscribe to our Newsletter

5920 Windhaven Pkwy
Plano TX 75093.

+1 888-227-2794

+1 972-232-2233

+1 888-227-7192

solutions@visualbi.com