How to Improve The Code Quality of your Dbt Models with Unit Tests and TDD

All you need to know to start unit testing your dbt SQL models

Pablo Porto
Towards Data Science

--

If you are a data or analytics engineer, you are probably comfortable writing SQL models and testing for data quality with dbt tests. You are proud of the modularisation and the neat SQL models you have created. Everything is fine for a while. But at some point, the transformation logic in your models starts growing and becoming more complex. You start looking at it and wondering, wouldn’t be great if I could isolate my models and create unit tests to verify and document this logic? This will definitely increase the code quality of your dbt code base, right?

Your first thought is to look at how you could do this type of test with the built-in dbt data tests functionality. After a few searches on Google, you realize that one of the ways is to create a custom-built unit testing framework leveraging the dbt seed functionality. Or you need to introduce Python into your codebase to create unit tests with Pytest. That looks quite tedious, so you start wondering, wouldn’t be great to be able to mock the model inputs and make assertions on the transformed data using SQL?

If this resonates with you, you are not alone. This is the journey the data engineering team I am part of went through until we found the dbt-unit-testing library.

Mocking dbt sources and refs

With dbt-unit-testing you can test dbt models independently by mocking their dependencies. As per their documentation, it gives you:

  • Ability to mock dependencies
  • Ability to run each test independently
  • A fast feedback loop
  • Good test feedback (well, there are some gotchas as we will see)

This dbt library allowed us to mock our model dependencies and start implementing unit tests straight away. As I will share with you at the end, it even allowed us to start practicing test-driven development to develop our models.

Now that we introduced the library, let’s look at the bare minimum you need to know to start implementing real unit tests for your dbt SQL models.

Why should you unit test your models?

Some data teams focus on increasing the data quality of the data processed by their data applications but they often forget about the code quality of the software they are building.

Unit tests can help with this by providing a way to isolate the SQL models and verify complex business logic in isolation.

“A unit test exercises the smallest piece of testable software in the application to determine whether it behaves as expected” — Toby Clemson

In our case, the smallest piece of testable software is a dbt model. Being able to test a model in isolation ensures new changes don’t break the existing business logic and helps us document the expected behavior of the model.

A typical dbt application follows a layered architecture style with at least three layers: staging, intermediate mart. Each layer will contain one or more models. These are the models we can test in isolation.

Creating our first unit test

Let’s look at a simple example. We have a dbt app called health-insights that takes weight and height data from upstream sources and calculates the metric body mass index.

The layered architecture of a typical dbt data app

The following model enriches a weight measurement with the latest height measurement recorded prior to the weight one.

Example of a dbt intermediate model

Let’s now create a dbt unit test to prove the transformation logic is correct.

Example of a dbt unit test for the an intermediate model

The building blocks

Looking at the previous test, we can see several of the dbt-unit-testing macros in use:

  • dbt_unit_testing.test: This macro allows us define the model under test and the name of the test. In our example we reference to int_weight_measurements_with_latest_height.
  • dbt_unit_testing.mock_ref: This macro allows us to mock references to other dbt models. In our example, we are mocking the weight (stg_gym_app__weight) and the height (stg_gym_app__height) staging data.
  • dbt_unit_testing.expect: And this macro allows us to assert on the result of the transformation. In the example, we assert that the weight measurement gets enriched with the latest height.

Running the tests

Let's now run our model’s unit test. We can call the usual dbt test command:

dbt test

Opps, that commands runs the whole test suite including other dbt data quality checks. But we want to only run our unit tests. No problem, we can leverage dbt tags functionality to isolate our unit tests. In the example, we tagged our test with two tags:

{{ config(tags=['unit-test', 'unit-tests']) }}

The first one is a boilerplate tag required by the dbt-unit-testing library. The second one is the one we will use to execute our unit test.

dbt test --select tag:unit-tests

Exploring other types of tests

So far we have seen how we can write a unit test to verify the logic of a single model. After creating a few of these, our team started discussing the possibility of implementing new types of tests like we usually do for operational software like microservices.

“A component test limits the scope of the exercised software to a portion of the system under test, manipulating the system through internal code interfaces and using test doubles to isolate the code under test from other components.” — Toby Clemson

In a microservices context, a component is a service that exposes certain functionalities. If we apply the same concept to the data context, component tests for dbt apps can be implemented as tests that validate whether the dbt app provides the functionality that it promises to by mocking the data sources.

The usual test pyramid for an operational app

When implementing the component test, the scope of the test increases. We test our dbt app as a whole mocking only its sources.

The component test scope

This type of test ensures that the different models integrate correctly and the expected data transformation outcome is created. Let’s look at an example:

In the component test above, we are testing our output model body_mass_indexes. The model uses the enriched weight measurements to calculate the body mass of the user. We mock the sources directly (raw_weight and raw_height) with the dbt_unit_testing.mock_source macro. Finally, we assert the final transformation of the output model verifying that the body mass index (BMI) is calculated correctly.

We can also run this type of tests in isolation using the tag name we specified in the test configuration.

dbt test --select tag:component-test

Test-driven development (TDD) in SQL, why not?

Now that we have the ability to test our models in isolation, what if we start by writing the tests before we write any transformation logic?

Test-driven development or TDD is a software engineering practice that helps improve the design of the code by forcing the developers to write a test first and then write the minimum amount of code to make that test pass.

Our data team had experience in applying TDD in operational systems so we decided to give it a try.

Starting by defining the outcomes of a given transformation in a test felt quite natural. Oh, what is the BMI I would expect if I have this weight and this height as an input? Let’s write a test for that. After practicing TDD for a while, the team still continue to use this technique when adding new business logic into the transformations.

Gotchas

I just pictured a perfect scenario where you can just add the unit testing dbt package and start creating tests straight away. The truth is that the library is still under development and we found some gotchas that you should also probably be aware of:

  • The dbt-unit-testing macros break the principle of not allowing testing code to pollute production code. There is an easy hack to fix this. You can create a macro to patch the original ref() and source() and call the testing macros. You can see an example here.
  • We found that sometimes seems like changes in a test are not picked up. There is an option to disable caching but we haven’t try it.
  • When mocking sources you need to define the source in the dbt .yml file if not it doesn’t compile.
  • Sometimes the test error messages are quite cryptic. In this situation, we found ourselves looking at the compiled SQL code in the build folder.
  • Be also aware of other limitations listed in the library docs.

Conclusions

We have seen how we can add unit and component tests to our dbt projects to increase the code quality and therefore the maintainability of our transformation logic. We also saw how we can tag the different types of tests so that we can run them in isolation both locally and in the CI/CD pipeline. Finally, we also look at how we could also practice TDD

Hope this article helps you and your team start adopting unit tests and creating more maintainable and scalable dbt apps as your codebase scale to fulfill new data use cases.

If you are curious, you can check a fully functional example in this Github repo. I also prepared some katas in case you want to practice TDD and unit testing with a simple example.

Are you ready to give it a try?

This article is part of series of articles I am writing on testing data pipelines and data products.

In my upcoming article, I will be discussing data quality checks and how these checks can be implemented with dbt. I would greatly appreciate your feedback and thoughts on this topic. To don’t miss out on the article, follow me or subscribe to receive an email.

Thanks to my Thoughtworks colleagues Manisha and David for taking the time to review early versions of this article. Thanks to the maintainers of the dbt-unit-testing package for their great work.

All images unless otherwise noted are by the author.

--

--

Tech lead at Thoughtworks. I write about effective software and data engineering practices and how to build a sustainable but effective career in tech.