A Complete Guide to Effectively Scale your Data Pipelines and Data Products with Contract Testing and dbt

All you need to know to start implementing contract tests with dbt

Pablo Porto
Towards Data Science

--

Photo by Jonas Gerg on Unsplash

Let me tell you a story about data management systems and scale that will probably resonate with you if you are a data or analytics engineer trying to do your best work in 2023.

Not too long ago, almost all data architectures and data team structures followed a centralized approach. As a data or analytics engineer, you knew where to find all the transformation logic and models because they were all in the same codebase. You probably work closely with the colleague who builds the data pipeline that you were consuming. There was only one data team, two at most.

This approach was effective for small organizations and startups with limited data sources and use cases. It also worked for large enterprises not fully focused on extracting value from data at scale. However, as organizations prioritized being data-driven, there was an increased need for more machine learning, analytics, and business intelligence data use cases.

Centralized data architecture developed and maintained by one team

The proliferation of use cases and data sources increased the complexity of managing data and the amount of people needed to create and maintain data systems. To meet these needs, the latest version of your company data strategy may have moved towards decentralization. This includes forming decentralized data teams and the adoption of decentralized data architectures like Data Mesh.

Decentralization allows organizations to scale data management but brings new challenges in ensuring the coordination of different components, such as data products and pipelines developed and managed by various teams.

In this type of architecture and organisation structures, it often becomes unclear who is accountable for each component resulting in issues and blame shifting. The number of integration points between teams also increases and maintaining working interfaces between the different components becomes harder.

Decentralized data architecture with multiple teams and multiple components

If you can relate to this situation, you are not alone. Your organization may be undergoing the decentralization of data. To navigate this transition, we can learn from successful implementations of decentralization and distributed architectures like microservices in the operational world. How did they do it? How did they manage to provide reliable systems at that scale? Well, they leveraged modern testing techniques.

“For years, software engineering has successfully embraced the concept of small units of work performed by ‘two-pizza teams’. Each team owns its own component of a larger system. Teams integrate with one another through well-defined, versioned interfaces. Sadly, data has yet to catch up with software. Monolithic data architecture is still the norm — even though there are clear drawbacks.” — dbt labs

In this article, I will introduce one of those techniques: contract testing. I will show how you can use dbt to create simple contract tests for your upstream sources and your dbt models’ public interfaces. This type of test will keep you sane as your dbt apps become more complex and decentralized.

But… what is a contract test?

When a distributed system starts growing into multiple components developed by several teams, the first approach teams may try to follow to verify that the system is behaving as expected is implementing end-to-end tests that exercise the system as a whole.

End to end test scope focused on verifying the system as a whole

End-to-end tests often become very hard to work with due to their complexity, slow feedback, and how hard they are to maintain and orchestrate.

That was the case in the operational world when implementing microservices at scale. When testing the system as a whole is not an option, engineering teams started implementing different approaches, like contract testing.

“An integration contract test is a test at the boundary of an external service verifying that it meets the contract expected by a consuming service.” Toby Clemson

Teams can still keep a small set of end-to-end tests, but they move down the testing pyramid by verifying the system with faster and more reliable tests like contract, component, and unit tests.

The trade-offs of different test types are often visualized with a testing pyramid. I mentioned this concept in my previous article about implementing unit testing for dbt models.

Typical test pyramid for operational systems

If we apply the same concept to data management systems, contract tests for dbt apps can be implemented to verify the behavior of two types of interfaces:

  • The upstream sources.
  • The public interfaces, like marts and output ports.
Contract tests scope

Benefits of contract testing for data systems

As we have seen, data architectures are becoming more complex and decentralised as it once happened with operational services. As this type of system continues to scale, the ability to run maintainable and effective end-to-end test suites diminishes.

Contract testing becomes a powerful ally in managing different situations by providing several advantages:

  • Reducing the number of end to end tests needed to verify the system’s behavior. Leading to faster feedback and lower maintainability costs.
  • Manage the complexity of having separate data teams working in the same code base by providing clear expectations between team public interfaces.
  • Surface integration issues between components in lower environments before they reach production.
  • Better defined and documented interfaces between the different data pipelines or data products.

Contract tests vs data quality test

You may be thinking, but… the contract test concept sounds like the quality test we are already running in our data pipelines.

That is a fair observation as there is a blurred line between the contract tests and data quality tests scopes. I like to think of contract tests as a subset of quality tests as part of a modern data testing strategy.

Contract tests can be considered a subset of quality tests

The difference is that a contract test looks at the schema and constraints whereas the data quality test looks at the actual data and its characteristics. Let’s look at some examples.

Contract tests scope:

  • Checking column types.
  • Checking expected constraints at the schema level like primary and foreign keys, not null columns.
  • Checking accepted values for a given column.
  • Checking valid ranges for a given column.

Quality tests scope:

  • Assessing completeness, e.g. percentage of not nulls in a column.
  • Assessing uniqueness, e.g. number of rows that are not unique.
  • Assessing consistency, e.g. all user identifiers in the source are included in the output.

Implementing our first contract test

Okay, enough theory, let’s get into action with a simple example. We have a dbt app called health-insights that takes weight and height data from upstream data sources and calculates the body mass index metric.

Our colleagues from the amazing backend team are in charge of producing the weight and height data we need to build our health-insights app. They work in a different team which is a bit busy and stressed. Sometimes they fail to notify us of schema changes. To test against these changes in upstream interfaces, we decided to create our first source contract test.

The system architecture of our example

First, we need to add two new dbt packages, dbt-expectations and dbt-utils, that will allow us to make assertions on the schema of our sources and the accepted values.

# packages.yml

packages:
- package: dbt-labs/dbt_utils
version: 1.1.1

- package: calogica/dbt_expectations
version: 0.8.5

Testing the data sources

Let’s start by defining a contract test for our first source. We pull data from raw_height, a table that contains height information from the users of the gym app.

We agree with our data producers that we will receive the height measurement, the units for the measurements, and the user ID. We agree on the data types and that only ‘cm’ and ‘inches’ are supported as units. With all this, we can define our first contract in the dbt source YAML file.

The building blocks

Looking at the previous test, we can see several of the dbt-unit-testing macros in use:

  • dbt_expectations.expect_column_values_to_be_of_type: This assertion allows us to define the expected column data type.
  • accepted_values: This assertion allows us to define a list of the accepeted values for a specific column.
  • dbt_utils.accepted_range: This assertion allows us to define a numerical range for a given column. In the example, we expected the column’s value not to be less than 0.
  • not null: Finally, built-in assertions like ‘not null’ allow us to define column constraints.

Using these building blocks, we added several tests to define the contract expectations described above. Notice also how we have tagged the tests as “contract-test-source”. This tag allows us to run all contract tests in isolation, both locally, and as we will see later, in the CI/CD pipeline:

dbt test --select tag:contract-test-source

Implementing contract tests for marts and output ports

We have seen how quickly we can create contract tests for the sources of our dbt app, but what about the public interfaces of our data pipeline or data product?

As data producers, we want to make sure we are producing data according to the expectations of our data consumers so we can satisfy the contract we have with them and make our data pipeline or data product trustworthy and reliable.

A simple way to ensure that we are meeting our obligations to our data consumers is to add contract testing for our public interfaces.

Dbt recently released a new feature for SQL models, model contracts, that allows to define the contract for a dbt model. While building your model, dbt will verify that your model’s transformation will produce a dataset matching up with its contract, or it will fail to build.

Let’s see it in action. Our mart, body_mass_indexes, produces a BMI metric from the weight and height measure data we get from our sources. The contract with our provider establishes the following:

  • Data types for each column.
  • User IDs cannot be null
  • User IDs are always greater than 0

Let’s define the contract of the body_mass_indexes model using dbt model contracts:

The building blocks

Looking at the previous model specification file, we can see several metadata that allow us to define the contract.

  • contract.enforced: This configuration tells dbt that we want to enforce the contract every time the model is run.
  • data_type: This assertion allows us to define the column type we are expecting to produce once the model runs.
  • constraints: Finally, the constraints block gives us the chance to define useful constraints like that a column cannot be null, set primary keys, and custom expressions. In the example above we defined a constraint to tell dbt that the user_id must always be greater than 0. You can see all the available constraints here.

Source contract tests vs dbt model contracts

A difference between the contract tests we defined for our sources and the ones defined for our marts or output ports is when the contracts are verified and enforced.

Dbt enforces model contracts when the model is being generated by ‘dbt run’, whereas contracts based on dbt tests are enforced when the dbt tests run.

If one of the model contracts is not satisfied, you will see an error when you execute ‘dbt run’ with specific details on the failure. You can see an example in the following dbt run console output.

1 of 4 START sql table model dbt_testing_example.stg_gym_app__height ........... [RUN]
2 of 4 START sql table model dbt_testing_example.stg_gym_app__weight ........... [RUN]
2 of 4 OK created sql table model dbt_testing_example.stg_gym_app__weight ...... [SELECT 4 in 0.88s]
1 of 4 OK created sql table model dbt_testing_example.stg_gym_app__height ...... [SELECT 4 in 0.92s]
3 of 4 START sql table model dbt_testing_example.int_weight_measurements_with_latest_height [RUN]
3 of 4 OK created sql table model dbt_testing_example.int_weight_measurements_with_latest_height [SELECT 4 in 0.96s]
4 of 4 START sql table model dbt_testing_example.body_mass_indexes ............. [RUN]
4 of 4 ERROR creating sql table model dbt_testing_example.body_mass_indexes .... [ERROR in 0.77s]

Finished running 4 table models in 0 hours 0 minutes and 6.28 seconds (6.28s).

Completed with 1 error and 0 warnings:

Database Error in model body_mass_indexes (models/marts/body_mass_indexes.sql)
new row for relation "body_mass_indexes__dbt_tmp" violates check constraint
"body_mass_indexes__dbt_tmp_user_id_check1"
DETAIL: Failing row contains (1, 2009-07-01, 82.5, null, null).
compiled Code at target/run/dbt_testing_example/models/marts/body_mass_indexes.sql

Running the contract tests in the pipeline

Until now we have a test suite of powerful contract tests, but how and when do we run them?

We can run contract tests in two types of pipelines.

  • CI/CD pipelines
  • Data pipelines

For example, you can execute the source contract tests on a schedule in a CI/CD pipeline targeting the data sources available in lower environments like test or staging. You can set the pipeline to fail every time the contract is not met.

These failures provides valuable information about contract-breaking changes introduced by other teams before these changes reach production.

You can also run your output port/mart contract tests each time you deploy a new change via the CI/CD pipeline. As dbt model contracts are checked every time the model is built, you tell dbt to enforce the contract so that if the new model changes introduce a breaking change in the contract, your team will get notified before your data consumers get impacted.

Finally, you can also run your source and output port/mart contract tests in your data pipelines in production. Running contract tests in production can help your team understand if a data pipeline failed because one of the upstream dependencies broke the contract or because the data you are producing is not meeting the contract with your downstream consumers.

Additional tips to get started

  • Start small, testing the integration points that are more brittle and prone to failure.
  • Apply the tolerant reader pattern when implementing contract tests. Only assert on the data you need.
  • Tweak the behavior of the contract test based on your needs, you can configure the severity attribute to make them fail loudly or just throw a warning.
  • Integrate these types of tests with modern data observability tools like Montecarlo so they are part of your incident management process.
  • Leverage dbt contract tests even if your data systems are not developed using dbt. You can still define source contract tests in dbt and execute them against tables or files created with other frameworks or plain SQL.
  • Considering more advanced contract testing techniques like consumer-driven contracts could make it easier to implement contract testing in specific contexts.

Conclusion

We have seen how testing strategies for data systems can also benefit from testing techniques like contract testing as these systems become more decentralized and grow in complexity.

We also saw how to start implementing contract tests leveraging dbt built-in features and additional dbt packages. We apply these type of test to two integration points: upstream data sources and data marts/output ports.

I hope this article gives you and your team all the tools and tips to start implementing contract tests as your data systems scale to fulfill new data use cases. If you are curious, you can check the source code of the example dbt application in this Github repo.

Are you ready to give it a go and start your contract testing journey? I would love to hear your thoughts and experience in the comments.

In my upcoming article, I will be discussing the final component in my series on testing data products and data pipelines, data quality checks and how these checks can be implemented with dbt. I would greatly appreciate your feedback and thoughts on this topic. To don’t miss out, follow me or subscribe to receive an email.

Thanks to my Thoughtworks colleagues Arne, Manisha and David for taking the time to review early versions of this article. Thanks to the maintainers of the dbt-expectations package for their great work.

All images unless otherwise noted are by the author.

--

--

Tech lead at Thoughtworks. I write about effective software and data engineering practices and how to build a sustainable but effective career in tech.