Data observability is still very much in its nascent stages in data engineering. Enterprises have started to see the benefit as they gather seemingly limitless streams of data from a growing number of sources and build an ecosystem of data storage, data pipelines, and analytics packages.
As the layers of complexity increase, so does the challenge of sustaining data quality and minimizing data downtime. Observability is a term that has been used in software engineering for a long time to describe how the state of a software system reacts to events based on the output. Data quality has been around for a long time in legacy systems, with tools like Informatica, SAP, Talend, and Microsoft producing their respective flavors of data quality tools, but data observability has never been widely considered as part of the data engineering process.
Data Build Tool (dbt) is an open-source data transformation platform that runs on top of modern cloud data warehouses and has a great feature for implementing data engineering pipelines. We adore dbt for all of the goodness and value it offers, including its flexibility to be extended with packages.
Monte Carlo has an informative article that explains the five pillars of data observability.
While there are a lot of commercial tools and open source frameworks like Great Expectations which provide the capabilities of implementing data quality into the data engineering process but I wanted to explore how we can implement data observability with just core dbt.
- Freshness: Freshness is how recently the data is updated and how often it should be updated. For dbt, the source freshness can be checked as part of the data test. The freshness can be specified to the specific column as loaded_at_field in the schema.yml. The threshold of the number of periods and time can be specified along with any filters to limit the data for performance.
- Distribution: Distribution specifies if the data is within the expected range. For dbt, it can be achieved using the dbt_profiler package. This helps to capture the information of the table into docs that can be updated at every run.
- Volume: Volume refers to the completeness of the data and data source health. For volume, the Bespoke dbt test (data test) can be run as part of the data pipeline validating the count on the data or the audit_helper package can be used to easily compare column values across tables.
- Schema: Schema check refers to the changes in the organization of the data in the table. dbt is well known for its capabilities of schema testing. It has four generic tests (unique, not null, accepted values, and relationships) built-in that are defined in the schema.yml. Custom generic tests and the package dbt_utils can be used to easily implement other common schema tests.
- Lineage: Data lineage is the process of understanding, recording, and visualizing data as it flows from data sources to consumption. Data lineage is part of the dbt docs and it makes use of the sources.yml and schema.yml to provide a consumable HTML based report and a data dependency graph
- Great Expectations: Great Expectations offers a robust way to evaluate assertions for data through “expectations”. For dbt, there are a package dbt_expections that offers similar capabilities to implement data testing through expectations into the dbt data pipeline.
There are other notable dbt packages include dbt-ml-preprocessing which helps to clean and standardize data sets in the cloud data warehouse without using external libraries. The re_data package helps data preparation via cleaning, filtering, normalization, and validation along with data monitoring with metrics and alerts. Finally, dbt_meta_testing helps contain macros to assert test and documentation coverage from the project configurations.
|Freshness||dbt freshness | dbt_expections|
|Distribution||dbt_profiler | dbt-ml-preprocessing | dbt_expections|
|Volume||Bespoke Test | audit_helper|
|Schema||generic tests | dbt_utils|
|Lineage||dbt docs | dbt_meta_testing|
Data Observability is still relatively new in modern data engineering and has not been widely adopted yet but is very crucial. It will be interesting to see how the whole thing will come together and perform in the real world.