Achieving Insights and Savings with Cost Data

The path to cloud efficiency begins with a cost data foundation

Anna Matlin
The Airbnb Tech Blog

--

by Anna Matlin and Tamar Eterman

Introduction

Business profitability and sustainability are powerful reasons to invest in infrastructure efficiency, but it is easy to feel lost about how to actually reduce costs. A foundation of robust and actionable data is essential for a successful efficiency program. At Airbnb, building this foundation made it possible to prioritize savings opportunities and ushered in a wave of cost reductions, summarized in a previous post.

More importantly, cost data has become a lever for long-term control. The team can react quickly before a spike wreaks havoc on the monthly bill and plan ahead when a big new project could become expensive. At the company scale, visibility into cost and usage has sparked a cultural shift. When savings can be measured, they can be recognized, and cost efficiency projects become exciting opportunities. As of early 2021, the most viewed dashboard at Airbnb is a dashboard of AWS costs.

We hope that sharing our approach will enable more companies to achieve AWS cost savings. Though Airbnb’s cost data foundation is built with one cloud provider in mind, our learnings from building a pipeline, defining metrics, and designing visualizations apply regardless of the cloud provider.

Architecting the Foundation

In the early days of Airbnb’s cost efficiency efforts, the team relied on the Cost Explorer dashboard in the AWS console. Cost Explorer represented a significant improvement over the monthly invoice because it was possible to see data before the end of the month, but it did not provide detailed insights because it was not connected to Airbnb’s data tools. Most teams at Airbnb rely on the data warehouse (i.e., Apache Airflow, Apache Hive, Apache Spark) and extensive analytics infrastructure (i.e., Minerva, Apache Druid, DataPortal, Apache Superset, SLA monitoring) to make data-informed decisions. To take full advantage of the available resources, our team built a pipeline on top of the AWS Cost & Usage Report (CUR), a rich source of raw data.

The pipeline transforms and modifies the CUR data, as illustrated below. We call this pipeline the “Airbnb CUR Pipeline”, and the resulting tables are collectively called the “Airbnb CUR Foundation.” This is because the pipeline enriches CUR data with Airbnb-specific business logic and naming conventions.

Figure 1. Airbnb CUR Pipeline
Figure 1. Airbnb CUR Pipeline

The loading and transformation of raw CUR files into the Airbnb CUR Foundation is performed in an Airflow pipeline, which runs daily. We describe this pipeline in more detail below.

Airbnb CUR Pipeline Steps

  1. Ingest report data from Amazon S3: The CUR is configured to land in an S3 bucket. From this bucket, the report files are read into Airbnb’s data warehouse. In the raw data, all versions of the report and fields are included.
  2. Discount costs to reflect enterprise discount program (EDP): Every company’s situation is different when it comes to discounts. Each row of cost data is updated to reflect the EDP contract.
  3. Amortize costs to align with usage: Amortization makes it possible to spread out a 1-year or 3-year commitment (e.g., Savings Plan, Reserved Instances) over the full contract period so that the costs reflect resource usage. This calculation requires some custom logic.
  4. Enrich data to enable downstream analytics: Fields are renamed to reflect Airbnb conventions, with prefixes that specify whether a column is an ID, a metric, or a dimension. Discounted costs are blended using custom calculations so that all projects are charged the same average unit cost — similar to AWS Blended Rates.

Tips for Designing a Successful Pipeline

Here are some suggestions from our experience that ensured the Airbnb CUR was robust and accurate:

Design with downstream use cases in mind. Before building anything, establish the requirements for your pipeline. How will the pipeline service-level agreement (SLA) align with the lag of the raw data? What are the top-line metrics from a financial and engineering perspective, and how will these metrics be interpreted? What are the dimensions, or grouping variables, that will be used to cut and categorize these metrics? We reduced the number of dimensions from ~200 in the raw CUR to the ~30 most useful ones for the Airbnb CUR Foundation. This simplicity makes the downstream tables more usable.

Build for retroactive adjustments. Usage and cost data change retroactively over the course of a monthly billing cycle. This constraint informed architectural decisions. We designed a data model with two types of tables: one that is overwritten with retroactive adjustments and one with immutable historical snapshots. The first kind of table underlies the cost program dashboards, while the second kind of table ensures reproducible calculations for anomaly detection and attribution.

Study the options for obtaining raw data. There is a menu of options for creating a new Cost & Usage Report in the AWS Console. We configured several reports before finally identifying which settings worked best for our downstream requirements. Airbnb’s CUR report includes refreshes, versions, hourly data, and resource IDs. The file format was important to successfully ingest data into the warehouse (via Spark), but companies using Amazon Redshift and Amazon Athena can ingest data without additional processing.

Alternative Cost Monitoring Options

We recognize that not every company will want to build and maintain a cost data pipeline. There are also many third-party vendors that perform analytics using the CUR. Airbnb’s decision to build versus buy was motivated by the availability of internal resourcing, the need to incorporate custom logic (e.g., discounting), and the opportunity to integrate with the internal data tooling.

Telling Stories with Cost Data

Thanks to a close partnership between data science, finance, technical program management, and engineering, the Cost Efficiency Team developed a set of key metrics and dimensions that are immediately actionable when they are surfaced in charts and dashboards. Aligning on important definitions enabled weekly monitoring, capacity purchasing, budgeting, opportunity sizing, and savings measurement. In the section below, we will describe our approach to structuring cost data for maximum insight and impact.

Define Meaningful Metrics for Stakeholders

The best metrics for cost efficiency work are simple and well-understood by partner teams.

Top-Line Metrics: The primary metric of the Airbnb CUR data is Cost, in dollars, which incorporates amortization, discounting, and blending as described above. Cost per booking captures the impact of AWS costs on business margins.

AWS Product-Specific Usage Metrics: Unlike cost metrics, usage metrics differ from one product to another. For example, we have defined a vCPU-Hours metric which measures compute usage at the fleet level, accounting for instance size. Usage metrics often reveal growth trends that are not apparent in cost data because of pricing terms. This is especially true for S3 storage, which we measure in terms of GB/Month. Pricing for cold storage options such as Amazon Glacier and Deep Glacier is much cheaper than for Standard Storage, so looking at only cost data could lead us to overlook usage growth in these cold storage categories.

Program Success Metrics: Our Percent Savings Plan Coverage Utilized metric highlights excess or insufficient compute usage compared to the pre-committed Savings Plan amount. This coverage metric is also relevant for other AWS products with reserved instances, such as Relational Database Service (Amazon RDS).

Include Relevant Dimensions

Below are some examples of dimensions that surface meaningful insights from the Airbnb CUR Foundation and metrics.

  • Product Code — The product code associated with an AWS product, for example, Amazon EC2 is the product code for Amazon Elastic Compute Cloud.
  • Product Family — This groups together related usage types within a Product Code. For example, product families for RDS include Database Instance, System Operation, Database Storage. This dimension is especially valuable for understanding patterns within a specific AWS product.
  • Line Item Description — A rich description of usage that gives insight into the pricing model for the line item. For an example, see the case study of AWS CloudTrail costs below.
  • Pricing Term — Denotes whether usage is Savings Plan, Reserved, On-Demand, Spot, or Unused. We use this dimension for capacity management.
  • Project Name — This is a user-defined tag which is surfaced in the CUR data. For example, the Viaduct project has its own tag.
  • Project Name Group — This field groups together Project Name at a higher level. For example, EMR clusters have individual projects, but they are all grouped into a single EMR project name group. See the chart below for an example of this dimension in action.

Other dimensions which we have found to be valuable include Instance Type Family, Instance Type, Usage Type, Storage Class, and Operation. Some highlight general trends, while others are useful for deeper data exploration. For more information about these dimensions, please visit the AWS CUR Documentation.

Share Straightforward Visualizations

Below are three notable cost data visualizations, with simulated data.

Figure 2. Daily fleet usage in vCPU-Hours by project name group (trailing 30 days)
Figure 2. Daily fleet usage in vCPU-Hours by project name group (trailing 30 days)
Figure 3. S3 Usage in GB/Month by Storage Class
Figure 3. S3 Usage in GB/Month by Storage Class
Figure 4. Categories with week-over-week (WoW) cost increases/decreases flagged in monitoring
Figure 4. Categories with week-over-week (WoW) cost increases/decreases flagged in monitoring

Mini-Case Study

The Line Item Description field is useful for cost data detective work. In the chart below, grouping the Cost metric by the Line Item Description dimension revealed that a spike in CloudTrail costs was due to data events rather than log data. This finding directed us to look at S3 request patterns and started a conversation with the team owning this data. Ultimately, this investigation reduced daily CloudTrail costs significantly.

Figure 5. CloudTrail cost spike investigation, grouped by line item description
Figure 5. CloudTrail cost spike investigation, grouped by line item description

Tips for Socializing Cost Data

Though having a data foundation opens a world of opportunities, just having the data is not enough. Below we have included a selection of tips to get value out of the data once it exists.

  • Get to know the domain: Why is Savings Plan a more flexible option than Reserved Instances? What does it mean to autoscale a workload? To identify opportunities, it is helpful to understand the cloud pricing model and the technical landscape. This can be achieved by reading AWS documentation, watching talks from AWS ReInvent, working closely with the engineering team(s) that use the cloud provider offerings, and engaging with other cost efficiency practitioners.
  • Review costs regularly: Our team found that the only way to consistently monitor cost data was to get together weekly to review a dashboard of key metrics and triage the spikes caught via anomaly detection.
  • Convert to dollars: When sharing findings, always convey the impact in financial terms. Unlike other domains, cost is highly tangible, and leaders react quickly when they know that inaction will result in extra spend. For example, “if storage continues to grow at X%, it will cost $Y in 2021.”
  • Partner with stakeholders: Before building a dashboard or starting an analysis, get input from key team members and leaders. Not only will you share something more useful, but your stakeholders will appreciate that you looped them into the conversation early.
  • Be curious! Ask questions. Dig into the data when something looks strange and share your learnings with others. Often, these investigations can lead to conversations with engineering teams and initiate tracks of efficiency work.

Conclusion

Developing a trustworthy and interpretable cost data foundation set Airbnb up for long-term success in cloud cost management. But data alone is not enough to achieve cost savings. Leadership commitment to savings goals, effective program management, contract management, and technical excellence across Airbnb made the success of the program possible. Engineers pop into office hours to ask about their costs on the company-wide dashboard, and teams proudly share the results of their efficiency projects.

We hope the foundational details and learnings shared in this post will demystify this domain and inspire practitioners at other companies to pursue a data-informed path toward cost efficiency.

Are you passionate about cloud efficiency, or inspired by unique data challenges? We’re always looking for talented individuals to join the team!

Acknowledgements ❤️

The Airbnb CUR Foundation was made possible with the support of many people. We are grateful to Stephen Zielinski, Krishna Bhupatiraju, Tingting Ma, Jinyang Li, Jian Chen, Jon Tai, Yi Chen, Yuhe Xu, Melanie Cebula, and Mingzhu Liu for their technical contributions and architectural advice. Thanks to David Morrison for his thoughtful and constructive feedback reviewing this post. We were fortunate to have support from many managers who have championed this work: Ari Siegel, Jen Rice, Guang Yang, Swaroop Jagadish, Reid Andersen, Brian Wallace, Jason Sobel, and Bharat Rangan.

We would like to express our gratitude to the AWS account team, who have worked with us at every step on our cost efficiency journey: Dan Facchetti, Amulya Sharma, Nathan Perry, Jeff Maxin. Thank you as well to the many cost efficiency practitioners at other companies who were generous to share their experiences.

Amazon Web Services, EC2, Amazon RDS, Amazon Redshift, Amazon Athena, Amazon Glacier, Amazon Elastic Compute Cloud, AWS CloudTrail and Amazon S3 are trademarks of Amazon.com, Inc. or its affiliates in the United States and/or other countries.

Apache Airflow, Apache Hive, Apache Spark, Apache Druid, Apache Superset, and Apache are either registered trademarks or trademarks of The Apache Software Foundation in the United States and/or other countries.

Kubernetes is the registered trademark of The Linux Foundation in the United States and/or other countries.

All trademarks are the properties of their respective owners. Any use of these are for identification purposes only and do not imply sponsorship or endorsement.

--

--