3 Simple Ways To Build Dbt Python Models

By: Manjiri Gaikwad Published: March 31, 2023

dbt’s extensive features have made it the central hub for transformations. It is an open-source tool that data analysts and engineers to transform data in a data warehouse efficiently. You can not only transform data but also test, deploy, and document your tasks.

Table of Contents

To transform data, you need to build dbt models, which are simple SQL SELECT statements. Since SQL is popular among professionals, dbt became the go-to choice for users. dbt also supports Python models to allow complex data science and machine learning workflows.

Let’s understand ways to build dbt Python Models.

Why Run Python in dbt

dbt tool was initially developed to perform transformations in SQL. However, data engineers couldn’t solve all the use cases with SQL. Although SQL language is simple to understand and query, it has limited capabilities for machine learning workflows.

Python contains packages and libraries which allow data engineers to perform transformations efficiently. Along with transformation, Python can also incorporate data science techniques. Therefore, Python extended the capabilities of SQL data models, and it became a preferred choice among data engineers.

What is dbt Python Model?

Unlike dbt SQL models, dbt Python models are defined in the .py files instead of the .sql files. Python models are functions named models that return dataframes. A dataframe is a 2D table of rows and columns like a spreadsheet. With Python models, you can reference dbt sources, apply transformations, and return the transformed data.

In dbt, SQL models return the output of the SELECT statement, whereas Python models return a final dataframe. And the type of dataframe depends on the data platform. Python dbt models are supported across three data platforms: Snowflake, Databricks, and BigQuery. Currently, dbt supports Python via the three adapters: dbt-Snowflake, dbt-BigQuery, and dbt-Databricks.

Defining Python Models

Whenever data engineers or analysts use dbt Python models, they need cloud services to run their models. This is because dbt Python models cannot run on the machine. Therefore, Python dbt models run on data platforms. Data platform services are enhanced for running big data processing workloads. They can run distributed data processing tasks across multiple nodes in a cluster.

As a result, before creating dbt Python models, you need an adapter for a data platform to support fully featured Python runtime.

In the models/ folder, every Python model defines a function called model() that consists of two parameters.

dbt: It is a class compiled by dbt Core, which allows you to run Python code in the context of your project.

session: It refers to the data platform’s connection.

Every dbt Python model looks like this:

dbt python models- Image 1 — Image Source

In dbt Python models, you can reference other models and sources using the dbt.ref() and dbt.source() functions. Python models only support two materialization options: table and incremental.

When you run the dbt Python model using the command dbt run –select python_model, dbt will prepare and pass in both arguments dbt and session.

When defining a dbt Python model function, you can import other functions into the model, as shown in the below example:

dbt python models- Image 2 — Image Source

Here, add_one() is a custom function that is referenced in the Python model.

Note: Currently, Python dbt models defined in one dbt model cannot be imported and reused in other models.

Referencing Models

You can reference other models in the Python model using the dbt.ref() method. It is similar to how you use ref() in SQL models. However, if you want to read data directly from the source table, you can use dbt.source() in the Python model.

The dbt.source() method returns dataframes pointing to the upstream source, model, snapshot, or seed. The dbt.ref() and dbt.source() method is used as follows:

dbt python models- Image 3 — Image Source

You can also use the ref() function to reference the dbt Python model in the downstream SQL models, as follows:

dbt python models- Image 4 — Image Source

Configuring dbt Python Models

There are three ways to configure dbt Python models:

You can configure many models at once with dbt_project.yml.
In the models/ directory, you can configure dbt Python models with .yml files.
You can configure model.py by using the db.config() method as well.

You can set the configurations for your Python models by calling the dbt.config() method.The dbt.config() method is similar to the {{config()}} macro in .sql model files. The dbt.config() method accepts only literal values like strings, boolean, and numeric. It cannot support any other data structure because dbt statistically analyzes the arguments to config() while parsing the dbt model without executing the Python code.

Limitations of dbt Python Model

Although dbt Python models have capabilities that SQL models lack, they also have some disadvantages compared to SQL models:

Python only supports three data platforms Snowflake, Databricks, and BigQuery.
Dbt users find it easier to write code that is performant at scale. If you try to translate the well-written SQL models into Python, it can make your model slow.
There are several syntax differences between dbt SQL and Python-based models. Python offers extensive ways to code. If there are five ways to write something in SQL language, there can be 500 ways to write it in Python with varying performance and standards. Many options can be overwhelming and can confuse the dbt users.

Data Platforms Using dbt Python Models

Since dbt supports Python through three data platforms, Snowflake, BigQuery, and Databricks, let’s understand each of them in detail.

How to Build dbt Python Models in Snowflake

Follow the steps mentioned below to build dbt Python models in Snowflake:

Prerequisites
- A Snowflake account.
- A Snowflake database named DEMO_DB.
- A Snowflake user having appropriate permissions.
- Anaconda is installed on your machine.
- dbt installed on your machine.
- Your favorite IDE.

After setting the prerequisites, you need to run the dbt init command to initialize the project.
Follow the prompt to create the dbt project and enter the values appropriate to your environment. But, for schema and database, provide the following details:

dbt python models- Image 5 — Image Source

You will end up with the following dbt models:

dbt python models- Image 6 — Image Source

Open a terminal, move to the dbt project, and execute the dbt run. This is to verify that everything is configured correctly.
After creating the dbt project, you will now create a simple Python model. In the models/example folder. Create a new file, ‘my_first_python_model.py’, and copy the below code into it.

dbt python models- Image 7 — Image Source

Save and execute the dbt run command. If everything is executed correctly, you have successfully built your first dbt Python model in Snowflake. You do not have to import the Snowpark Python library explicitly, dbt does that for you.

How to build dbt Python Models in Databricks

Follow the steps mentioned below to build a dbt Python model in Databricks.

Create a virtual environment to isolate the packages from conflict. Run the below-mentioned command to create an empty directory named dbt_project:

mkdir ~/dbt_project
cd ~/dbt_project

Create a file named Pipfile using the code mentioned below. Pipfile is used by the Python virtual environment to manage project and library dependencies:

[[source]]
url = "https://pypi.org/simple"
verify_ssl = true
name = "pypi"[packages]
dbt-databricks = "*"[requires]
python_version = "3.9.12”

Now, install all the packages defined in the Pipfile containing the dbt Databricks adapter package. dbt-databricks automatically installs dbt Core and other library dependencies required:

pipenv install

dbt python models- Image 8 — Image Source

Activate the virtual environment using the following command:

pipenv shell

Output:

dbt python models- Image 9 — Image Source

After the initial setup and configuration of the virtual environment, your next step is to create a dbt project. In the activated virtual environment, run the dbt init command followed by the name of your dbt project:

dbt init my_first_dbt_project

You need to enter the details like number 1 to select Databricks, HTTP path, personal access token, default schema name, and number 4 when prompted for the number of threads.

dbt python models- Image 10 — Image Source

To verify the Databricks connections settings, run the following commands:

dbt debug --config-dir
open /Users/<your-username>/.dbt

dbt python models- Image 11 — Image Source

Now, run build the project by running the following command:

dbt debug

Output:

dbt python models- Image 12 — Image Source

Create a dbt Python model. You can create a dbt Python model in Databricks, using the following command:

dbt python models- Image 13 — Image Source

In Databricks, dataframes are Apache Spark dataframes by default. Therefore, you need to use PySpark.

Import PySpark as follows:

pyspark lib import pyspark.sql.functions as F.

In Databricks, you need to specify the cluster for the Python code to run inside the config function. You can run your code faster by setting the ‘create_notebook’ parameter to ‘False.’

dbt python models- Image 14 — Image Source

You can know more about the above parameters in dbt’s documentation. Run the following command:

$ dbt build

Output:

dbt python models- Image 15 — Image Source

dbt python models- Image 16 — Image Source

You have completed building your first Python models in dbt with Databricks.

How to build dbt Python Models in BigQuery

Follow the steps mentioned below to build dbt Python models with BigQuery:

Create the Google Cloud Platform account. Create a new project named ‘jaffle-shop’ in the console.

dbt python models- Image 17 — Image Source

Let’s create a Service Account which can be used by dbt. Go to IAM & Admin > Service Accounts to create Service Account.

dbt python models- Image 18 — Image Source

Give a name to the Service Account as ‘dbt-python.’

dbt python models- Image 19 — Image Source

Now, you need to define roles for the Service Account. We will be giving resource admin permissions for simplicity.
Create the Service Account and go to Action > Manage Keys to create the key file.

dbt python models- Image 20 — Image Source

Click on Add Key > Create a New Key, select JSON, and click on create. The private key is created, and the .JSON key file is downloaded.
You will now need to create a bucket. Search for ‘Cloud Storage’ in the console and click ‘Create Bucket.’ Name your bucket and select the region. You need to remember the name of your bucket and region, as it will be required in the following steps:

dbt python models- Image 21 — Image Source

Go to the Marketplace and search for ‘Dataproc API.’ You need to enable the Cloud Dataproc API.

dbt python models- Image 22 — Image Source

You can select the components like Anaconda in the cluster. You need to change the Image Type and Version to do so.
Go to ‘Dataproc’ in your console and click ‘Create Cluster.’ Name your cluster, and select the same region as your bucket.

dbt python models- Image 23 — Image Source

You must change the Series and Machine type of Manager and Worker nodes if you use your account’s default quotas. Select n1-standard-2 to prevent exceeding quotas. Do the same for the Worker nodes.

dbt python models- Image 24 — Image Source

Now, define the initialization actions and the cluster metadata. Paste the following content into the initialization action:

gs://goog-dataproc-initialization-actions-${REGION}/connectors/connectors.sh‍

Paste the following into the metadata:

bigquery-connector-version=1.2.0
spark-bigquery-connector-version=0.21.0

dbt python models- Image 25 — Image Source

The Google recommendation for production environments is to put the above information in the bucket and read the cluster from there. In this tutorial, you are pasting the information into the cluster properties for simplicity.

dbt python models- Image 26 — Image Source

After completing the GCP configuration, let’s start the dbt setup. Install the BigQuery adapter using the below command:

$ pip3 install dbt-bigquery

As shown in the following example, you need to create a new profile inside the ~/.dbt/profiles.yml file:

dbt python models- Image 27 — Image Source

You will require project_id, information in your JSON key file, the name of the region, and a bucket for creating dbt models in Python.
The dbt Python models created in BigQuery are similar to those in Databricks, as the Dataproc cluster also uses PySpark. The parameters of the dbt.config() function are similar. You can read more about the config for Python models over here.

Run the following command.

$ dbt build

Output:

dbt python models- Image 28 — Image Source

dbt python models- Image 29 — Image Source

You have now built your first Python model in dbt using BigQuery.

Conclusion

It is recommended to use SQL language if you are unfamiliar with Python programming to create dbt data models. However, if you have expertise in Python, you can use it to create dbt models. This article discusses different ways of building dbt Python models in data warehouses like Snowflake, Databricks, and BigQuery.

In the future, dbt will support other platforms with Python programming language. Besides three data warehouses, there is a powerful third-party tool called dbt-fal to run Python code locally instead of relying on cloud resources. The dbt-fal tool runs with any abt adapter like BigQuery, Snowflake, or Databricks.

To simplify the implementation of data pipelines, you can opt for cloud-based automated ETL tools like Hevo Data, which offers more than 150 plug-and-play integrations.

Visit our Website to Explore Hevo

Saving countless hours of manual data cleaning & standardizing, Hevo Data’s pre-load data transformations get it done in minutes via a simple drag n-drop interface or your custom python scripts. No need to go to your data warehouse for post-load transformations. You can run complex SQL transformations from the comfort of Hevo’s interface and get your data in the final analysis-ready form.

Want to take Hevo for a spin? Sign Up for a 14-day free trial and simplify your data integration process. Check out the pricing details to understand which plan fulfills all your business needs.

Manjiri Gaikwad Freelance Technical Content Writer, Hevo Data

Manjiri loves data science and produces insightful content on AI, ML, and data science. She applies her flair for writing for simplifying the complexities of data integration and analysis for solving problems faced by data professionals businesses in the data industry.