Data has always been at the heart of a modern enterprise. For Decades most of the efforts have focused on bringing or joining the data from various Enterprise applications like Enterprise Resource Planning (ERP), Customer Resource Management (CRM), Point of Sale (POS) etc. for analysis.
Data Integration is the process of combining data from different sources to provide a unified view of data with valuable information and meaningful Insights. Movement of data into Data warehouse can be achieved using the concept of ETL/ELT. Extract, Transform and Load is a general procedure used to move the data from source database to target database. As we all know the series of functions involved in data integration are as follows.
Extract – Desired Data is extracted from Homogeneous or Heterogeneous Data sets
Transform – Data is transformed/Modified to desired format for storage
Load – Migration of data to target database or Data Marts or Data Warehouse
The Integration of data using ETL (Extract, Transform and Load) occurs by (Initially loading the data into staging layer from different source systems, transformation logics are applied at staging server and then transferred to data warehouse. The staged data needs to be maintained on-time to remove unnecessary data from system) converging data from different source systems using transformation tools like SAP BODS, Alteryx, Informatica etc., providing the unified data for reporting purposes, whereas in ELT(Extract, Load and Transform) the data from different source systems are loaded into a single system and transformation logics are applied inside the same. With ELT approach, integrating data at a single place gives the assurance of accuracy and consistency of data over its entire lifecycle.
What is Snowflake and how data integration works?
Earlier days in Warehousing the database costs are much higher, even though the cost of storage sector went down over the years there wasn’t much change in the design structure making it dependable on the external tools. With introduction of Cloud Data warehouse, separating storage and computation layers offering scalable storage sector with lesser costs, the data from different source systems can be stored at a single place providing more accuracy than the traditional ones. Also, the maintenance and operations costs are reduced drastically by dynamic pricing strategies.
Snowflake is a SaaS(Software as a Service) offering that provides an analytic Data Warehouse, hosted on cloud platforms such as Amazon Web Services(AWS) or Microsoft Azure. The storage resources in Snowflake can be scaled independent of the computing resources and hence data loading and unloading can be done without worrying about running queries and workloads.
Though Data Integration (Preparation of Data, Migration or Movement of data) in Snowflake (Image as shown below) can be handled using both ETL and ELT processes, ELT is preferred approach within Snowflake Architecture.
Architecture in Snowflake
Data Ingestion/transformation in Snowflake can be done using external third-party tools like Alooma, Stitch etc. To learn more about how to load data via data ingestion tools, Snowflake provides partner account which offers free trial. As of June 2019, the partner and non-partner accounts supported by Snowflake are as below.
We can find the complete list of data integration tools compatible with Snowflake(will be updated by Snowflake in time) at https://docs.snowflake.net/manuals/user-guide/ecosystem-etl.html .
To load structured/transformed data (ETL) in snowflake third party transformation tools like Alteryx, Informatica etc., are used in the transformation process whereas if the transformation needs to be applied after data ingestion the preferred way to approach this is using DBT.
DBT (data build tool) is an open source command line tool which enables Business Data analysts/scientists to effectively transform the data after loading inside Data Warehouse. It works on different Warehouses (Redshift, Big Query and Snowflake). Most external transformation tools need the data to be pushed into their server, Transform the data and move them back into database. Whereas in DBT the it compiles and creates a materialized view using the using the compute resources of the warehouse.
Image Source: https://www.getdbt.com/product/
The manifestation of data inside Warehouse with DBT can be expressed in the form of tables and views. The core function of DBT was to compile the code and execute it against Data Warehouse using DAG (Directed Acyclic Graph) and the outputs are stored in tables and view format. With the integration of Jinja, an open source template language used in Python eco-system the reusability of code will become much easier (e.g. if statements and for loops). With templates created in Jinja, we use the ref() function to call the templates for easier coding. These are similar to Functions in SQL perspective.
The functionality of DBT is not just limited to transformation of data it also provides,
- Version control
- Quality Assurance
Transformation can also occur externally in Snowflake using ETL tools and then loaded into Snowflake, however with different costs for storage and computation layers ELT provides more accrued and consistent data over time. Since Snowflake can handle massive storage, data can be loaded into Snowflake followed by Transformation using DBT thus reducing the load time and time spent in transit making the system more effective.