Reflecting on Eventbrite’s Journey From Centralized Ops to DevOps

Once a scrappy startup, Eventbrite has quickly grown into the market leader for live event ticketing globally. Our technical stack changed during the first few years, but as with most things that reach production, pieces and patterns lingered. 

Over the years, we leaned heavily into a Django, Python, MySQL stack, and our monolith grew. We changed how our monolith was deployed and scaled as we went into the AWS cloud as an early adopter. This entailed building internal tooling and processes to solve specific problems we were facing, and doubling down on our internal tooling while the cloud matured around us. 

Keeping up with traffic bursts from high-demand events

Part of the fun and challenge of being a primary ticketing company is handling burst traffic from high traffic on-sales — these are high-demand events that generate traffic spikes when tickets are released to purchase at a specific time. Creators (how we refer to folks that host events) will often gate traffic externally, and post a direct link to an Eventbrite listing on a social network or their own websites. Traffic builds on these sites while customers wait for a link to be posted. The result is hundreds of thousands of customers hitting our site at once

Ten-plus years ago, this was incredibly difficult to solve, and it’s still a fun challenge from a speed of scale and cost perspective. Ultimately, challenges around the reliability of our monolithic architecture led to us investing in specialized engineering teams to help manually scale the site up during these traffic bursts as well as address the day-to-day maintenance and upkeep of the infrastructure we were running. 

A monolithic architecture isn’t a bad thing — it just means there are tradeoffs 

On one hand, our monolithic setup allowed us to move fast. Having some of Django’s core contributors helped us solve complex industry problems, such as high-volume on-sales in which small numbers of tickets go on sale to large numbers of customers. On the other hand, as we and our platform’s features grew, things became unwieldy, and we centralized our production and deployment maintenance in response to site incidents and bug triage. 

This led to us trying to break up the monolith. The result? Things got worse because we didn’t address the data layer and ended up with mini Django monoliths, which we incorrectly called services.

The decision to move from an Ops model to a DevOps model, and the hurdles along the way

Enter our three-year technical vision. In order to address our slowing developer velocity and improve our reliability, performance, and scale, we made an engineering-wide declaration to move away from an Ops model — in which a centralized team had all the keys to our infrastructure and our deployments— to a DevOps model in which each team had full ownership. 

An initial hurdle we had to jump over was a process hurdle. In order for teams to take any ownership, they’d have to be on call 24×7 for the services and code they owned. We had a small number of teams with production access that were on call, but the vast majority of our teams were not. This was an important moment in our ownership journey. And our engineering teams had many questions about the implications of what was not only a cultural but also a process change.

There are many technical hurdles to providing team-level ownership, and it’s tempting to get drawn into a “boil-the-ocean” moment and throw away all the historic learnings and business logic we developed over our history. Our primary building block towards team autonomy was leveraging a multi-AWS sub-account strategy. Using Terraform, we were able to build an account vending system allowing teams to design clear walls between their workloads, frontends, and services. With these walls in place each team had better control and visibility into the code they owned. 

Technical debt, generally, is a complicated ball of yarn to unwind

We had many centralized EC2-based data clusters: MySQL, Redis, Memcache, ElasticSearch, Kafka, etc. Migrating these to managed versions — and the transfer of ownership between our legacy centralized ownership directly to teams — required a high degree of cross-team coordination and focused team capacity. 

As an example, the migration of our primary MySQL cluster to Aurora required 60 engineers during the off-hours writer cutover — they  represented all of our development teams. The effort towards the decentralization of our data is leading us to develop full-featured infrastructure as code building blocks that teams can pull off the shelf to leverage the full capabilities of best-in-class managed data services.

Our systems powering our frontend as well as our backend services are process-wise similar to our data-ownership journey. We have examples of innovation around serverless compute patterns and new architectural approaches to address scale and reliability. We’re making big bets on some of our largest and most-impactful services — two of which still live as libraries in our core monolith. The learnings that are accrued through these efforts will power the second and third year of our three-year tech vision journey. 

The impact thus far, with more unlocks to come

By now, you’re probably realizing that at least some of our teams were shocked at the amount of change happening as their ownership responsibilities increased. We were confident that this short-term pain was worth it. After all, our teams were demanding this through direct feedback in our dev and culture surveys. 

The prize for us on this journey is customer value delivered through increased team velocity. While our monolithic architecture — both on the code and data sides of the house -— got us to where we are today, teams were not happy with their ability to bring change and improvements to things that they owned. This was frustrating for everyone involved, and the gold at the end of the rainbow for us is that teams can make fundamental changes with modern tools and processes. 

In the first year of our three-year technical vision big changes in ownership have been unlocked. As an example, we have migrated to Aurora where teams have ownership of their data. We’ve also provided direct team-level ownership of teams CI pipelines, improved our overall code coverage for testing, provided team autonomy for feature flag releases, and started re-architecting our two largest tier-1 services. It’s exciting to see new sets of challenges arise along the way — knowing these hurdles also unveil opportunities.

Leave a Reply

Your email address will not be published. Required fields are marked *