DEVOPS

DORA Metrics At Work

How we doubled our team’s delivery performance within a year as measured by DORA metrics.

Egor Savochkin

Published in

Booking.com Engineering

10 min readJan 15, 2024

A man stands in front of a projector broadcasting an image of a graph going up. The man says: “And this is the only performance indicator that’s moving up. Unfortunately, it’s my blood pressure.” — *source*

Imagine your team secured a budget for doubling the number of software engineers. That’s great! You can finally fix all the bugs, implement new ideas, and clean up all the technical debt that’s been accumulating for years. Right? Wait, wait… Not so fast.

First, it will take time to hire new software engineers and onboard them. They’ll need to learn the product domain, deep dive into the technology stack, and also learn your company’s specific tooling. They should familiarize themselves with the processes and connect with peers. Even if onboarding goes well, it’s likely that in a year you will not double your team’s performance [see: Brook’s Law].

Let’s swap juniors with seniors then! Or let’s start bargaining with the HR department, which could bring you the wrong people. Well, this will also likely not work. Great people do matter, but as Edward Deming once said “a bad system will beat a good person every time” (see also the famous Red Bead Experiment or [Walt88 Ch4]).

Think about it. You somehow reached the state where you are struggling to keep up with all those tasks and bugs, right? What if you manage to add extra resources and in a year all you gain is technical debt accumulating at an even faster rate?

The picture shows a meeting room with a large table at which people are sitting. In front of the table hangs a large screen broadcasting a man speaking. The man says: “I need all the software engineers in the world. There’s no time to explain!” Somebody from the left side of the table says: “Well, if it’s urgent, I guess we have to agre, right? “Somebody from the right side of the table: “I was going to ask what it’s for, but he clearly said there’s not enough time.” — *source* *Me at my dream next year’s budgeting meeting.*

Okay. Let’s improve the system. But what exactly do we need to work on? Let’s follow a proven approach to continuous improvement. First, select outcome-oriented metrics that matter for your company [Seid19, Hubb14]. Then, focus on improving them by tackling the most constraining factors one by one [Gold04].

In this article, we share how our team managed to improve its software delivery performance twofold in a year without adding extra resources. We used DORA metrics because they predict organizational performance and well-being improvement [Fors18].

DORA metrics
DevOps Research and Assessment (DORA) is a running research program. It seeks to understand the capabilities that drive software delivery and operational performance. DORA suggested using four key metrics, which predict organizational performance:
Deployment frequency (DF): How often does your organization deploy code to production?
Lead time for changes (LTFC): How long does it take to go from code committed to code running in production?
Change failure rate (CFR): What percentage of changes to production result in degraded service and need remediation?
Time to restore (TTR): How long does it generally take to restore service when a service incident or a defect that impacts users occurs?

Context

The team in our Fintech business unit was formed in mid-2022 and took ownership of several processes under the finance domain.

*Picture 1. State of the services at the beginning of 2023*

All the functionality was implemented as part of the monolithic application over five years ago (see Picture 1). Since then, the majority of the backend logic has been extracted to a microservice.

The team started tracking the DF and LTFC metrics and set a baseline at the beginning of the year. In the following months, the team performed a series of improvements. These changes resulted in a twofold improvement in the metrics by the end of the year.

*Picture 2. Team delivery metrics improved twofold, as measured by the DORA metrics.*

That would be a Pyrrhic victory to improve the speed of delivery but ruin the quality. Unfortunately, we found it difficult to use the suggested DORA’s stability metrics CFR and TTR (see [Dav21, Void22, Mor23]). Instead, the team used the reliability metrics and the number of open defects. The latter needed to track major incidents that affect many users. The former was to account for problems with individual customers not captured by the reliability metrics.

Backend service

In March, the DF for the backend service was 15 times per month, and LTFC was about 14 hours. The latter metric meant that a software engineer usually needed to wait for almost one day to have a change deployed to production. This was a sign of a not so good developer experience and higher time-to-market.

The main problem was that the code was not understandable and easy to change. If a bug was reported, it required weeks to diagnose it, find a way to fix it, and get it deployed into production. The unit test coverage was low, and the team did not have confidence in it. Everybody was reluctant to make any improvements, fearing that this would break the code in unexpected ways.

The test automation, documentation, and internal quality looked like the most constraining factors. The team decided to start measuring and improving them.

The team adopted the Boy Scout rule to improve the code quality by refactoring and test automation while not stopping all the feature work. While implementing changes or fixing defects, you also strive to improve the code around. This does not need to be a huge improvement. This may be as simple as adding unit tests to the classes you touched or making small refactorings to fight code smells.

We found that the Boy Scout rule makes refactoring much more efficient. First, it takes less time to improve the code you’ve just worked on. Second, you are likely to improve the code that is changed more often.

The Boy Scout rule
“Always leave the campground cleaner than you found it.” If you find a mess on the ground, you clean it up regardless of who might have made it. You intentionally improve the environment for the next group of campers. Robert C. Martin. [Hen10]

Unfortunately, to master refactoring, it is not enough to read great books from gurus, e.g. [Fow18, Mar08, Ker04, Cla21]. The refactoring skill is something that you need to spend some time to learn and practice often to have it sharp. It is very difficult to practice refactoring on real tasks as often we are under time pressure and also the real code is more complex. So, the team started practicing refactoring katas to gain more hands-on experience in tackling code smells. We also found them useful to test our ideas.

“How do you get to be an All-Star sportsperson? Obviously, fitness and talent help. But the great athletes spend hours and hours every day, practicing” [CodeKata].

In June, it became obvious that the code reviews were the biggest bottleneck. Very often the merge requests (MRs) were quite large, and it was very difficult and time-consuming to walk through them during the code review. This was a painful and unpopular task. What improved the situation was to adopt working in small batches (that is, favor small MRs) and agree to make code review a priority. As a result, the team saw a steep decrease in code review time in July.

In July, the increased test coverage allowed the team to deploy to production with little or no manual regression testing. At this point, the deployment procedure took an average of 40 minutes. It required many manual steps, including two canary deployments and verifications. Based on our observations and statistics all those manual steps were redundant. For example, there were zero cases in the past when something went wrong during canary deployments. If so, why spend time on them?

Despite the fact that nobody had done this in the past (at least in our department), the team decided to automate the deployment. The idea was to have an MR deployed straight to production after being merged into the trunk without any manual verifications. The obvious concern was that this would ruin the quality. On the other hand, the team had a decent safety net as well: good test automation, peer code review, small changes, etc.

The team decided to give it a try. Should anything go wrong, it is possible to detect this with the help of our quality-related metrics and return back to the good old “safe” procedure. Fortunately, the change did not affect the quality but reduced the deployment time from 40 mins down to 4 mins. The DF and LTFC metrics also reflected the improvement by moving to 43 deployments/month and 1.3 hours in August.

The picture shows the king with a sword standing in front of the tent with his guards. An army is visible in the background, preparing to attack. The army is armed with swords and spears. Next to the king stands a man with a machine gun. The king waves him off and says: “No! I don’t have time to see any crazy salesmen… we’ve got a battle to fight!” — *source* *If we are doing the same thing again and again we will get the same results (quote*).

All in all, by October, the DF improved from 15 in March to 37 per month in October, and LTFC improved from 13.8 hours in March down to 4.2 hours in October.

UI pages

Tight coupling and deployment tooling for the monolith were the problems that we could not get around. The DF of 6–8 deployments per month and the LTFC of 2–3 days led to a bad development performance and experience. Engineers were incentivized to batch commits and avoid deployments at all costs. The recommended approach to deal with this in our company was to migrate the pages to the Micro-Frontend (MFE) technology.

The team initiated the migration project for several most important pages.

Micro Frontends, MFE
Microfrontends is a front-end web development pattern in which a single application may be built from disparate builds. It is analogous to a microservices approach but for client-side single-page applications written in JavaScript. It is a solution to de-composition and routing for multiple front-end applications (wikipedia)

The new MFE pages were rolled out to all our users in September. The team found that performance improved. Yet, the LTFC of one day was far from what we expected. That was a cold shower for a moment when we gathered the statistics. After all this effort, we still needed to wait for such a long time to get our changes to production! What?

The picture shows a meeting room with a round table and eight people sitting at it. One man says: “To address this mistake we must use root-cause analysis. I’ll begin by saying it’s not my fault.” — *source* *Me, when the team found that MFE migration did not bring us the performance we hoped for.*

We needed to look through all the MRs for the past month. For each MR, we wrote down the statistics when it was submitted, commented on, approved, merged, and deployed.

It turned out that the MR review process for the MFE applications was more complicated than usual. For example, it required approvals from the expert MFE community outside of the team. As a consequence, the mean time to approve an MR was 17.1 hours.

We contacted the MFE expert community and discussed how we could fine-tune the process. Several optimizations were implemented, which reduced the mean time to get approvals to 8 mins (yes, that is right, this is because we had a lot of small MRs). As the approval time got shorter, we also agreed to try not to batch deployments. This led to a reduction of the deployment time to 1 hour.

As a result, in October, the LTFC was reduced to 14 hours which was a big step forward and was perceived as a significant performance improvement.

Outcomes and observations

All the changes above resulted in a twofold improvement in the team’s software delivery performance, as shown in Picture 2. Some of them required considerable development effort. Many involved changing the way the team worked and did not require much work. Some of the changes required a mindset change and the development of new skills. None of them required adding extra resources.

The most important outcome of all the efforts was that they unlocked a new way of working — with a focus on internal quality and experimentation. If a software engineer sees a code smell it is usually easier to fix it rather than to manage an extra piece of technical debt. The team can now validate the most important decisions without batching them with other lower-priority ones.

Management, coordination, and communication have also become easier. First, because of less code yet to be deployed. E.g. a software engineer can sit with our UX designer and make small changes on the fly. They will get things done in minutes without any paperwork and management involved. Second, because of fewer resources needed.

Finally, it’s much more satisfying to work when you can see the results of your work in a matter of hours rather than days!

One may say that software delivery is an important but yet a tiny part of the value stream and, as such, we may have not made much of a difference for the company. Fair enough. Yet, the broken window theory is applicable here: if we improve the software delivery, then it also encourages others to improve as well. Why would PMs think about quick experiments if they should wait for a month to deliver anything meaningful to the customer?

Key takeaways

It is difficult to improve the performance by adding resources. High chances are that what brought you to the current state relates to internal factors of how the organization works. Here lies the biggest potential to improve.
An efficient way to drive improvements is the following. First, select outcome-oriented metrics that matter for your company. Then, focus on improving them by tackling the most constraining factors one by one.
The outcome of all the changes was that they unlocked a new way of working — with a focus on internal quality and experimentation.
It was very important for the team to focus on long-term sustainable performance rather than short-term gains. Yet, by adopting some of the practices it was possible to make improvements without blocking the feature work.
What worked for the team is generic enough to also be applicable to many other teams. However, the most constraining factors in each case may be different.

Special thanks to the team and for those who contributed to the article. Your help is greatly appreciated!

Interested in working with us? Check out our careers page.