Replicating a multi-million Euro automated warehouse in Python

Published in

Picnic Engineering

12 min readMar 14, 2023

In our previous part of this blog series we have talked about the challenges we faced developing our very first Warehouse Control System (WCS). This system orchestrates flows of groceries over 30 miles of conveyor in our automated fulfillment center (FC) in the Dutch city of Utrecht. Here, thousands of orders are assembled every day, to be delivered hours later by our fleet of electric vehicles. The real fulfillment center did not exist for much of our development process. Now that it does, it is in operation around the clock. To test the WCS, we created a testing framework using Behave that tests the service against a simulation of the real FC. This way, we verify that WCS is able to complete a full day of operations before deploying to production.

This leaves one question: how do we take our automated warehouse, and put it in a container? In this installment we will dive into the simulation that powers these tests. First, we will explain the basics of automated warehousing and simulation. Then, we will take a look at the simulation that we built. Finally, we will share challenges and the key learnings we made from overcoming them.

Automated warehousing 101

Our automated warehouse is broadly speaking a system where pallets of groceries go in, customer orders come out, and something smart and automated happens in between. The warehouse consists of four layers. Most of the automation consists of conveyor belts, lifts and shuttles that transport stock from place to place in the warehouse. This network of hardware is the first layer and forms the skeleton of our fulfillment center. In the second layer, hardware is controlled by a PLC. Each PLC governs a small, localized section of the hardware It gives instructions such as “turn on conveyor” or “divert the next crate to the right”. It also receives inputs like “barcode 1234 was scanned”. This control layer of the site keeps some limited state, and makes decisions when no instruction is provided to it. It cannot not, however, orchestrate the transport of crates over large distances. For this purpose, a transport layer exists on top of the control layer. This third layer is in charge of managing movements of crates that may span many PLCs. In essence, it knows the “how” of operating the FC. The fourth and final layer is the WCS. This service is in charge of the “what”, and orchestrates the FC by giving instructions to the transport layer.

Simulating the real world in Python

To simulate our warehouse, we need to make a decision on how to emulate the physical reality in code. We have chosen a discrete event simulation as the core of our project. A discrete event simulation essentially implies that each event in the simulation is caused by another event. This means that you can view the simulation timeline as a series of scheduled events, which you can evaluate one by one. Events have a callback function, which is called when the event is the next event on the timeline to be simulated. If the event you are executing causes something else to happen in the future, it is scheduled by adding a new event to the timeline. The timeline is represented as a heap data structure, ordered by the scheduled timestamp. A major advantage of modeling the world as a series of events is that the simulation runs as fast as you can process events: an hour of real time can be simulated in seconds!

We use Python to implement this simulation, for three reasons:

There are actively maintained open source projects that can help set up such a simulation.
Python allows for fast iteration, in turn we accept slower execution than most compiled languages. This is not an issue in our simulation, as it turns out our largest simulation speed bottleneck is communication overheads between the simulation and WCS.
Available talent also played a role: in the team there was experience in writing simulations in Python.

Let’s look at an example. How can we model, for example, a conveyor in the simulation? A conveyor has a length of n slots that can hold a crate. A crate takes MOVE_SPEED seconds to move from one slot to another. After n time the MOVE_SPEEDthe crate will reach the end of the conveyor, and roll over to something else. We use the simpy module for our simulation core, but for the sake of keeping this example self-contained, we added a simplified simulation environment to illustrate the approach taken.

import heapq
from functools import partial
from typing import Callable, Optional


# We model our simulation as an event heap.
# Each event consists of a timestamp and a callback function.
class Simulation:
    def __init__(self):
        self.heap: list[tuple[int, Callable[[], None]]] = []

    def now(self) -> int:
        if self.heap:
            time, callback = heapq.heappop(self.heap)
            heapq.heappush(self.heap, (time, callback))
            return time
        return 0

    def schedule(self, callback: Callable, dt: int):
        heapq.heappush(self.heap, (self.now() + dt, callback))

    def run(self):
        # Iterate over the heap until we run out of events.
        while self.heap:
            _, call = heapq.heappop(self.heap)
            call()


SIMULATION = Simulation()


class Conveyor:
    MOVE_SPEED = 1  # time in seconds to move a crate

    def __init__(
        self, name: str, length: int, next_conveyor: Optional["Conveyor"]
    ):
        self.name = name
        # A conveyor has a fixed length, it can hold this many crates.
        # We refer to each slot on the conveyor as a window.
        self.length = length
        # The state is kept as a map from barcode to window index.
        self.state: dict[str, int] = {}
        # Optionally, there is a next conveyor to hand crates over to.
        self.next_conveyor = next_conveyor

    def _move_next(self, barcode: str):
        # Move the crate with `barcode` to the next window.
        index = self.state[barcode]

        # Check if the next window is the end of the conveyor.
        next_index = index + 1
        if index == self.length:
            # If it is the end, scan the crate and hand it over.
            print(f"Ping! We just scanned {barcode} on conveyor {self.name}")
            if self.next_conveyor:
                self.next_conveyor.put(barcode)
            else:
                print("Oh no, the crate fell on the floor!")
        else:
            # If it is not the end, schedule a next event.
            self.state[barcode] = next_index
            SIMULATION.schedule(
                partial(self._move_next, barcode), self.MOVE_SPEED
            )

    def put(self, barcode: str):
        # A crate is put on the conveyor at index 0.
        if 0 not in self.state.values():
            self.state[barcode] = 0
            SIMULATION.schedule(
                partial(self._move_next, barcode), self.MOVE_SPEED
            )
            return
        raise RuntimeError("No space at the start of the conveyor!")

conveyor_2 = Conveyor("conveyor_2", 10, None)
conveyor_1 = Conveyor("conveyor_1", 10, conveyor_2)
conveyor_1.put("1234")
SIMULATION.run()

What are we modeling?

Now that you know what an automated warehouse looks like in real life, and how to simulate the real world in Python. Now, it’s time to put our both together to emulate our fulfillment center!

The foundation of our simulation is the hardware simulation. We configure conveyors, lifts, scanners and all sorts of other hardware to match the real world. With a lightweight REST API on top of this, the Behave test driver can manipulate the state of the hardware. For instance, we may want to start a test by creating a crate with a specific barcode on a conveyor. Once we start time, this crate will start moving on the conveyor.

On its journey, the crate will encounter scanners and actuators. Scanners will trigger a call to the control layer, reporting that this crate (identified by a barcode) arrived at the location. Actuators are hardware devices that accept instructions from the control layer. A simple example of this is a divert. This piece of hardware is linked to two conveyor belts, and the instruction it receives determines in which direction it will send the crate.

The control layer at times can make this decision by itself. For instance, some diverts serve a load balancing function, and divert crates in a round robin fashion between its outfeeds. Other times, when a crate is trying to get somewhere, it will request direction from the transport layer. The transport layer is configured as a directed graph with edge weights. When presented with a crate at location A that needs to go to location B, it determines the path by running a shortest path algorithm. It feeds the next movement of this shortest path back to the control layer, which in turn gives an instruction to the actuator in the hardware layer. If the crate is not in transport, the control layer will pick a default direction, usually meaning the crate will stay on a conveyor loop.

The transport layer does not come up with transport instructions by itself, and this is where the WCS comes in. The WCS is an in-house developed application, and the service under test. It communicates with the transport layer through an asynchronous, vendor-specific messaging protocol over RabbitMQ. WCS is notified of crates that are scanned that are not in transport, and provides transport instructions. We kept WCS unaware that it is talking to a simulation, not the real warehouse. To achieve this, we equipped the transport layer with a remote proxy pattern that converts the internal signals and instructions to- and from the vendor protocol. The Behave test driver may also listen in on this message exchange, to do assertions in its tests.

Of course a warehouse is not only metal and silicon: there are plenty of humans walking around even in an automated warehouse! To include these in our simulation, we include operators in the hardware layer of our simulation. These operators can interact with the hardware, and support various triggers and actions. Triggers that may prompt an operator to do something can be “a crate appeared on the conveyor that I am watching”, or simply “every minute”. Actions can vary wildly, and could be a hardware interaction such as removing a crate from a conveyor, or calling a REST endpoint on the WCS to simulate a pick confirmation. The pick confirmation is the moment where a real person in the warehouse would confirm on a frontend that it has moved groceries into a customer’s order.

We packed the simulation in a Docker image. This allows us to run tests on the developers laptop in a Docker compose setup as they are implementing features. In addition, containerization also makes it possible for us to incorporate the tests that use the simulation in our CI/CD pipeline!

The curse of knowledge

To prepare our WCS for reality, we want the simulation to be as close to the real world as possible. We found an unexpected challenge: the real world is a messy place! In the first simulation we created, the transport layer had complete knowledge of the physical world. In real life, there are a number of steps between a crate on a conveyor belt and the transport system.

First, the transport system only sees what a sensor sees! In the real world, sometimes a barcode is unreadable to a scanner. In this case the transport system should have no clue which crate it is handling! In our simulation we do keep events deterministic by configuring a scanner to fail or succeed, but all logic on top of the scanner signal is unaware if the crate scanned if the scan was not successful.

Sometimes there are also no scanners to rely on, meaning there is a knowledge gap for the transport system. In the warehouse, we can store crates in a large, automated storage rack. The transport system can be instructed to move a crate from this rack, but it does so blindly. In our testing, we do want to cover tests where this fails, to see if our WCS can recover from this situation. Therefore we must allow the transport layer to make wrong decisions based on incomplete data.

As seen in part II, this by introducing a strong separation between the layers. In our current iteration of the simulation the transport layer keeps its own state. It only knows what has been reported to it by the control layer. This information may be wrong or incomplete, but this is exactly how we like it for our tests.

Taming time

When adopting a discrete event simulation at the heart of our simulation, we introduced the assumption that events only can be caused by other events. We quickly realized that we had a problem: this is not strictly true!

WCS is an application that exists in the real world, outside of the simulation. It has no concept of simulated time, and instead will answer “as soon as possible”. This usually means it will respond to messages from the transport layer within a few hundred milliseconds. In the real warehouse this is more than fast enough: typically we have around 500 milliseconds between a crate being scanned and the transport layer requiring an input to divert this crate. Our simulation is on a different time, the virtual timeline. In the span of 500 milliseconds it is possible that entire minutes or even hours pass, depending on how crowded the event heap is! The result: flaky tests that pass depending on how the CPU is allocated between the simulation and the WCS.

Of course, we want to go as fast as possible. Waiting for 500 milliseconds for a response every time the simulation emits an event is not feasible, as the test suite would take too long to run. We came up with a small protocol using correlation IDs as our solution: the simulation would include an identifier in the message headers of each event it sends. After sending the event, the event heap is locked until this identifier is echoed back by the WCS. The WCS, in turn, was equipped with a testing profile that acknowledges messages if this header is present. This profile is not enabled in production.

At this point, we have restored our assumption! Events on the simulation heap may cause communication being sent to the WCS, and with this RPC protocol we ensure that all events created as a result are created before the next event is popped from the heap.

We have a second challenge to our assumption: the test driver can also be used to introduce new events, not directly caused by a previous event. For example, if we create a new crate on a conveyor we are really creating an event chain by scheduling crate placement for this conveyor. To stabilize our tests and overcome this problem we ensure that all REST calls to manipulate the simulation can only be placed when the simulation is paused.

From the test writer’s perspective, the virtual timeline is also not your friend. Say our test writer wants to assert that “Given crate 123 is placed on conveyor A, and we start the simulation, when it passes by conveyor B we pause the simulation”. After executing the “start the simulation” step our simulation is freely iterating over scheduled events at speed unknown to the test writer. During the small lag between finishing the REST calls to place the crate on a conveyor and start the simulation, and the REST call to schedule the simulation pause an unknown amount of simulation time has passed. It could be that the crate 123 still has to reach conveyor B, in which case the test will be a success. However, it might as well happen that hours have passed in the simulation, and that the crate has long left the warehouse! This problem we address by front-loading all inputs by the test writer to the simulation. In addition, we base assertions on an output stream of events from the simulation, rather than polling the state of the simulation. Now, our step becomes “When crate 123 passes conveyor B the simulation will pause, given crate 123 is placed on conveyor A, when we start the simulation, then eventually crate 123 reaches conveyor B”. Occasionally this is a bit of a puzzle for test writers, but the result is a stable test.

The results are in…

At this point, we have our simulation ready to write some tests! In our previous installments of this series, we have highlighted how we used Behave for defining human-readable tests. Consider the scenario below, taken from our test suite for the warehouse in Utrecht. In this test we assert that we can introduce new groceries into our storage through a process called decanting.

Given decanter1 will decant stock at station 1
And the empties buffer contains an empty crate with barcode 1234
And decanter1 will validate and decant 1 stock of 12 apples into crate 1234
Then WCS registered 12 apples on crate 1234
And crate 1234 eventually enters the storage rack

This test consists of a total of 23 messages going back and forth between the simulation and our WCS! This simple test by itself already would be a time sink to execute manually for every release of WCS. A more interesting scenario involving picking of groceries for customers easily involves ten times the message volume, creating an even stronger case for automation.

Conclusion

And with that, we have a simulation product! Inside our box we are simulating hardware, emulating behavior of software, and mimicking the humans that together make a fulfillment center. We have learned to stay close to the real world, as this drives the emergence of behavior of the transport system that proved important in our tests. We coupled a real-time system to a virtual timeline in our discrete event simulation, and learned how to write stable tests. The result is a containerized service we can run our WCS against to execute tests that span hundreds of asynchronous communication points between the two services.

Now, all that is left is to write some more tests…