Building Properties with AWS Step Functions

Mohamed Elsaka
Booking.com Engineering
5 min readOct 18, 2022

--

Developers love new technologies and are always eager to try things hands on. As a community, we thrive on problems that pave the way to new learning opportunities.

Maintaining the operations of a legacy system is tedious. Developing a new feature on top of this is another challenge altogether. This post is about one such legacy process that is responsible for creating new properties on our platform, namely PropertyBuilder.

New partner registration process

We constantly onboard new partners to our platform, from hotels to homes. In order to make it easier for partners, we’ve built a self sign-up tool that guides them through the registration process. The tool is divided into multiple steps through which we collect the main information about the partner and their properties (e.g. address, number of rooms, photos, etc.).

The monolith

Upon finishing registration, a process called PropertyBuilder gets triggered. It takes all the data collected from the partner and enlists a new property into our platform. As with any other software, it started small. It then grew to thousands of lines of code with lots of function calls creating various pieces of data, which eventually represent a property on our website.

The PropertyBuilder has more than 30 steps, some of which are dependent on each other. Only a clean run — with no single error — results in a new property being created successfully. A single failure of one of those 30+ steps halts the whole process, and a new attempt must be executed from the beginning.

A simple process runs the new property registration workflow.

The system was simple: Queue, Consumer and Database scheduler for retries.

The process was obviously taking a long time to build a single property. This didn’t provide a great customer experience for new partners creating their first property on our platform.

Running this process also caused an increase in the operational load of the team managing it. They were engaged in many fire fighting incidents (outages in production), as the process could easily fail due to bugs or transient database errors (temporary issues that resolve themselves).

Breaking it down

Firstly, we identified three chunky steps in PropertyBuilder, and extracted them out of the main code path into independent components. Then, we reorganised the process to run these three steps asynchronously. For this, we used three other queues and three more consumers.

Improved PropertyBuilder process divided into one main process and three parallel steps.

This approach slightly reduced the complexity of the main PropertyBuilder part, but it came with the cost of managing three extra queues. By this time, we realised the importance of having a workflow engine.

Workflow management system solution

After doing some research, we settled on Conductor, an open-source workflow engine developed by Netflix. We primarily chose this because it looked very simple and came with a fancy-looking interface for managing workflows.

We deployed Conductor on our internal Kubernetes cluster. Unlike Netflix’s Conductor setup that relies on DynamoDB, we used MySQL.

Conductor has a nice feature that allows users to define workflow steps to trigger HTTP endpoints. For us, this eliminated the need to implement Workers, therefore, the system seemed far less complex. We reorganised the three independent asynchronous actions and hid them behind an HTTP service. In this way, the system became simpler.

Conductor orchestrating PropertyBuilder workflow.

With Conductor in place, we gained some features that we didn’t have before, namely:

  1. Configurable workflow with instant changes.
  2. Clear visibility on the progress of individual runs for process.
  3. A user interface that allowed us to act quickly when things break.

But, this came with a different cost.

Our Conductor deployment was not ideal. It was different to what Netflix was using. We had to use the MySQL database instead of Dynomite, and Redis instead of DynoQueues. Along the way, we came across some unexpected issues with Conductor (e.g. stuck workflows) and the team had to spend time making Conductor operational within our infrastructure (integrating with prometheus metrics, upgrading ElasticSearch cluster, etc.).

At that point, the system was in good shape, but the operational cost on the team was not optimal. Our team has since evaluated some serviced workflow management systems, and we narrowed the list down to two: AWS Step Functions and Amazon Simple Workflow.

AWS Step Functions

We chose AWS Step Functions (SFN), as it has the following features:

  • Simple API
  • Workflows that are defined in Amazon States Languages, which are easy to learn
  • A user interface on AWS Console

This was enough for our use case, which is not complex (~10 steps with only three steps running in parallel).

Learning SFN was quite easy. We only had to make some small adjustments to the previous system to make things work. We opted for the JavaScript SFN client because it was simple and we didn’t have any business logic in the Workers.

We deployed the new Workers to production and tried it out with a portion of the traffic. It worked. With small tweaks to the state machine definition, we managed to get a solution up and running for all traffic in less than a month.

PropertyBuilder workflow managed by AWS SFN

With SFN as a managed solution, the increased operational load was eliminated. The system has been running for more than two years now with hardly any issues, which we consider to be a massive improvement compared to where we started.

Step Functions limitations

As mentioned before, our use case was simple, and was compatible with AWS SFN State Machine. There are still some features that we would have loved to see in SFN:

  • While the SFN dashboard is good, it lacks bulk operations
  • Workflow Execution IDs are unique, which is great for ensuring idempotency, but they cannot be reused even after the end of the execution

Conclusion

Workflow management systems are powerful tools that can significantly reduce system complexity. AWS Step Functions is a great service for running workflows that can reduce the operational load on teams running complex business processes. At Booking, we’re continuously improving our systems and are open to trying out different solutions, whether they are open source or managed services — as long as they are the right tool for the job.

We’d love to know what technologies you use for managing workflows. Let us know in the comments below.

--

--