Simplify Airflow DAG Creation and Maintenance with Hamilton in 8 minutes

How Hamilton can help you write more maintainable Airflow DAGs

Stefan Krawczyk
Towards Data Science

--

An abstract representation of how Airflow & Hamilton relate. Airflow helps bring it all together, while Hamilton helps make the innards manageable. Image from Pixabay.

This post is written in collaboration with Thierry Jean and originally appeared here.

This post walks you through the benefits of having two open source projects, Hamilton and Airflow, and their directed acyclic graphs (DAGs) work in tandem. At a high level Airflow is responsible for orchestration (think macro) and Hamilton helps author clean and maintainable data transformations (think micro).

For those that are unfamiliar with Hamilton, we point you to an interactive overview on tryhamilton.dev, or our other posts, e.g. like this one. Otherwise we will talk about Hamilton at a high level and point to reference documentation for more details. For reference I’m one of the co-creators of Hamilton.

For those still mentally trying to grasp how the two can run together, the reason you can run Hamilton with Airflow, is that Hamilton is just a library with a small dependency footprint, and so one can add Hamilton to their Airflow setup in no time!

Just to recap, Airflow is the industry standard to orchestrate data pipelines. It powers all sorts of data initiatives including ETL, ML pipelines and BI. Since its inception in 2014, Airflow users have faced certain rough edges with regards to authoring and maintaining data pipelines:

  1. Maintainably managing the evolution of workflows; what starts simple can invariably get complex.
  2. Writing modular, reusable, and testable code that runs within an Airflow task.
  3. Tracking lineage of code and data artifacts that an Airflow DAG produces.

This is where we believe Hamilton can help! Hamilton is a Python micro-framework for writing data transformations. In short, one writes python functions in a “declarative” style, which Hamilton parses and connects into a graph based on their names, arguments and type annotations. Specific outputs can be requested and Hamilton will execute the required function path to produce them. Because it doesn’t provide macro orchestrating capabilities, it pairs nicely with Airflow by helping data professionals write cleaner code and more reusable code for Airflow DAGs.

The Hamilton Paradigm in a picture. This example shows how one would map procedural pandas code to Hamilton functions that define a DAG. Note: Hamilton can be used for any Python object types, not just Pandas. Image by author.

Write maintainable Airflow DAGs

A common use of Airflow is to help with machine learning/data science. Running such workloads in production often requires complex workflows. A necessary design decision with Airflow is determining how to break up the workflow into Airflow tasks. Create too many and you increase scheduling and execution overhead (e.g. moving lots of data), create too few and you have monolithic tasks that can take a while to run, but probably is more efficient to run. The trade-off here is Airflow DAG complexity versus code complexity within each of the tasks. This makes debugging and reasoning about the workflow harder, especially if you did not author the initial Airflow DAG. More often than not, the initial task structure of the Airflow DAG becomes fixed, because refactoring the task code becomes prohibitive!

While simpler DAGs such as A->B->C are desirable, there is an inherent tension between the structure’s simplicity and the amount of code per task. The more code per task, the more difficult it is to identify points of failure, at the trade-off of potential computational efficiencies, but in the case of failures, retries grow in cost with the “size” of the task.

Airflow DAG structure choices: how many tasks? how much code per task? Image by author.

Instead, what if you could simultaneously wrangle the complexity within an Airflow task, no matter the size of code within it, and gain the flexibility to easily change the Airflow DAG shape with minimal effort? This is where Hamilton comes in.

With Hamilton you can replace the code within each Airflow task with a Hamilton DAG, where Hamilton handles the “micro” orchestration of the code within the task. Note: Hamilton actually enables you to logically model everything that you’d want an Airflow DAG to do. More on that below.

To use Hamilton, you load a Python module that contains your Hamilton functions, instantiate a Hamilton Driver and execute a Hamilton DAG within an Airflow task in a few lines of code. By using Hamilton, you can write your data transformation at an arbitrary granularity, allowing you to inspect in greater details what each Airflow task is doing.

Specifically the mechanics of the code are:

  1. Import your function modules
  2. Pass them to the Hamilton driver to build the DAG.
  3. Then, call Driver.execute() with the outputs you want to execute from the DAG you’ve defined.

Let’s look at some code that create a single node Airflow DAG but uses Hamilton to train and evaluate a ML model:

Now, we didn’t show the Hamilton code here, but the benefits of this approach are:

  1. Unit & integration testing. Hamilton, through its naming and type annotations requirements, pushes developers to write modular Python code. This results in Python modules well-suited for unit testing. Once your Python code is unit tested, you can develop integration tests to ensure it behaves properly in your Airflow tasks. In contrast, testing code contained in an Airflow task is less trivial, especially in CI/CD settings, since it requires having access to an Airflow environment.
  2. Reuse data transformations. This approach keeps the data transformations code in Python modules, separated from the Airflow DAG file. This means this code is also runnable outside of Airflow! If you come from the analytics world, it should feel similar to developing and testing SQL queries in an external .sql file, then loading it into your Airflow Postgres operators.
  3. Reorganize your Airflow DAG easily. The lift required to change your Airflow DAG is now much lower. If you logically model everything in Hamilton, e.g. an end to end machine learning pipeline, it’s just a matter of determining how much of this Hamilton DAG needs to be computed in each Airflow task. For example, you change the number of tasks from one monolithic Airflow task, to a few, or to many — all that would need to change is what you request from Hamilton since your Hamilton DAG needn’t change at all!

Iterative development with Hamilton & Airflow

In most data science projects, it would be impossible to write the DAG of the final system from day 1 as requirements will change. For example, the data science team might want to try different feature sets for their model. Until the list is set and finalized, it is probably undesirable to have the feature set in your source code and under version control; configuration files would be preferable.

Airflow supports default and runtime DAG configurations and will log these settings to make every DAG run reproducible. However, adding configurable behaviors will require committing adding conditional statements and complexity to your Airflow task code. This code might become obsolete during the project or only be useful in particular scenarios, ultimately decreasing your DAGs readability.

In contrast, Hamilton can use Airflow’s runtime configuration to execute different data transformations from the function graph on the fly. This layered approach can greatly increase the expressivity of Airflow DAGs while maintaining structural simplicity. Alternatively, Airflow can dynamically generate new DAGs from configurations, but this could decrease observability and some of these features remain experimental.

Airflow UI for DAG run configuration. Image by author.

If you work in a hand-off model, this approach promotes a separation of concerns between the data engineers responsible for the Airflow production system and the data scientists in charge of developing business solutions by writing Hamilton code. Having this separation can also improve data consistency and reduce code duplication. For example, a single Airflow DAG can be reused with different Hamilton modules to create different models. Similarly, the same Hamilton data transformations can be reused across different Airflow DAGs to power dashboards, API, applications, etc.

Below are two pictures. The first illustrates the high-level Airflow DAG containing two nodes. The second displays the low-level Hamilton DAG of the Python module evaluate_model imported in the Airflow task train_and_evaluate_model.

1. Airflow UI: Absenteeism Airflow DAG
2. Hamilton driver visualization: function graph for evaluate_model.py

Handling data artifacts

Data science projects produce a large number of data artifacts from datasets, performance evaluations, figures, trained models, etc. The artifacts needed will change over the course of the project life cycle (data exploration, model optimization, production debugging, etc.). This is a problem for Airflow since removing a task from a DAG will delete its metadata history and break the artifact lineage. In certain scenarios, producing unnecessary or redundant data artifacts can incur significant computation and storage costs.

Hamilton can provide the needed flexibility for data artifact generation through its data saver API. Functions decorated with @save_to.* add the possibility to store their output, one need only to request this functionality via Driver.execute(). In the code below, calling validation_predictions_table will return the table whereas calling the output_name_ value of save_validation_predictions will return the table and save it to .csv

This added flexibility allows users to easily toggle the artifacts generated and it can be done directly through the Airflow runtime configuration, without editing the Airflow DAG or Hamilton modules.

Furthermore, the fine-grained Hamilton function graph allows for precise data lineage & provenance. Utility functions what_is_downstream_of() and what_is_upstream_of() help visualize and programmatically explore data dependencies. We point interested readers for more detail here.

To finish & an example to get started

Hopefully by now we’ve impressed on you that combing Hamilton with Airflow will help you with Airflow’s DAG creation & maintainability challenges. Since this is a short post, to wrap things up, let’s move onto the code we have in the Hamilton repository for you.

To help you get up and running, we have an example on how to use Hamilton with Airflow. It should cover all the basics that you need to get started. The README includes how to set up Airflow with Docker, so that you don’t need to worry about installing dependencies just to play with the example.

As for the code in the example, it contains two Airflow DAGs, one showcasing a basic Hamilton “how-to” to create “features” for training a model, and the other a more complete machine learning project example, that does a full end-to-end pipeline of creating features and then fitting and evaluating a model. For both examples, you’ll find the Hamilton code under the plugins folder.

What you should expect to see in the Airflow example. Image by author.

If you have questions or need help — please join our Slack. Otherwise, to learn more about Hamilton’s features and functionality, we refer you to Hamilton’s documentation.

References & Further Reading

Thanks for taking a look at this post. If you want to dive deeper, or want to learn more about Hamilton, we have the following links for you to browse!

--

--

Co-creator of Hamilton & Co-founder + CEO DAGWorks Inc. I generally write technical content.