Does reordering the data engineering process create a real difference? Is reordering the process the only difference between ETL (Extract, Transform and Load) and ELT (Extract, Load and Transform)? Is ETL fading away? And ELT is the new way to the cloud? If these are the questions you seek answers to, you are in the right place!
Before we dive into how the ELT process is becoming the new norm, we should understand how all the things surrounding data integration and transformation (i.e., Data Engineering) have changed. When ETL processing became the mainstream data pipeline, about three decades ago, the data warehouse was the only common target system. We had only structured data, very few operational systems, a minimal volume of data, and the cost of computation and storage were sky-high. In the last couple of decades, things have changed drastically. We have seen an evolution in processing, storage, network, mobile devices, etc., that includes the shift from schema-on-write to schema-on-read, where typically data flows were streamlined to fit into existing structured data marts, and changing the same would require a lot of computing processes. With the cloud coming in with elasticity, it becomes possible to store the data in its native form and perform transformations on the read operations where we have logical structures that define only the target structure of the data but do not hold any data and get filled at run time.
Finally, data has also evolved; we process not only structured data but also unstructured and semi-structured data. With the introduction of cloud data lakes and warehouses (aka data lakehouse) that can act as both data lake and data warehouse, the need for improvements in ETL processing has significantly increased.
Evolution of ELT
In a typical ETL process:
- Data is pulled from the multiple data sources to staging and then into the warehouse.
- Transformations are completed before the data is loaded into the warehouse.
- Most of the time, the data extraction from source systems happens during an off-peak time of the day and in batches.
- Only the relevant fields or tables that are needed for analytics are fetched to staging and processed.
- Changes in the source structure mandate changes in the coding as ETL expects the columns and data to be in the same sequence and order.
- Primarily, a single tool was used to perform “E,” “T,” and “L” operations in the ETL; it was like one size fits all.
In a typical ELT process:
- Data is pushed from multiple data sources to the data lake using data integration tools
- Data is then transformed into a data warehouse/mart using data transformation tools.
- Data load doesn’t happen in batches but is pushed from the source systems, so real-time analytics is possible.
- Transformation happens within the cloud lakehouse that separates storage and compute resources, allowing processing of data exponentially faster than reading and loading the data using an ETL process.
- Data lake (aka RAW DB) within the cloud lakehouse solutions act as the sources for the warehouse. Almost all data from the source systems are pushed to the data lake.
- Any changes or addition to the fields in the warehouse is also easy as data exists in the data lake
- Many lakehouses support both semi-structured and unstructured data. So indeed, data lakehouse is a single source of truth.
Factors that led to switching of ELT
- ELT is a compute-intensive process, but it happens in a highly robust, powerful, and scalable cloud lakehouse.
- Most cloud lakehouses like Azure Synapse, Snowflake, and Google BigQuery are columnar databases, so index and record search operations are much faster.
- Almost all the cloud lakehouses do massively parallel processing, so the queried transformations are carried out in parallel and not successively, with multiple nodes running multiple transformations simultaneously.
Does that mean as we advance, ELT will be the go-to option? Mostly yes, unless If the data will always be similar to the pre-defined formats and coming from only a small number of source systems and the transformations are minimal, existing ETL and on-premises infrastructure will still be a justified cost-effective strategy. Also, when working with data masking and sensitive data, ETL is preferred over the ELT process. ELT approaches introduce new processes like the DevOps, additional layers like staging, enhance, mart, reporting, and star schema while processing in the lakehouse and tight integration of Data Science (AI & ML), and processing of semi-structured and unstructured data.
As you can see, the difference between ETL and ELT is not just the reordering of processing alone but also how we implement it. But the fundamental rationale why ETL was built still exists today, so we need to ensure ELT is used with the understanding of what bottleneck ETL process brings and also leverage the power of cloud lakehouse while maintaining a fine balance of costs and resource optimization.
Reach out to us here today if you are interested in evaluating which ETL or ELT tool is right for you.