Analyzing Time Series for Pinterest Observability

Pinterest Engineering
Pinterest Engineering Blog
10 min readJul 18, 2023

--

Brian Overstreet | Software Engineer, Observability; Humsheen Geo | Software Engineer, Observability

Time series is a critical part of Observability at Pinterest, powering 60,000 alerts and 5,000 dashboards. A time series is an identifier with values where the values are associated with a timestamp. Given the widespread use and critical nature of time series, it’s important to give engineers the ability to adequately express what operations to perform on the time series in a readable, understandable, and efficient manner. In this post, we will cover the background of time series at Pinterest, the goals of designing an expressive time series language, and some examples of how we are using this language today.

Background

To understand how we got to our current solution, it’s important to look where we started. These are the time series solutions Pinterest has used, ending with the current solution:

With the exception of Graphite, performing operations such as combining, reducing, or filtering on time series metrics did not exist. For this reason, we needed a common way to take the time series returned from any time series database and perform operations on the returned time series independently of how the time series was stored.

Interacting with time series is broken down into two parts:

  1. Querying from the Goku time series database using the OpenTSDB query syntax to maintain compatibility with our previous time series database and its ease of use.
  2. Performing operations on the returned time series. For performing operations, we developed a time series script (TScript). This blog post describes TScript and how we use it at Pinterest.

Before we developed TScript, we used a language similar in format to Graphite functions where the functions are nested. For example, calculating a five minute average success rate given a success and failure metric in a function based approach looks like this:

averageSeries(scale(dividSeries(success, sumSeries(failed, success)), 100), 5)

As you see above, the intent of the success rate calculation is already becoming hard to understand. Extending the function-based approach to support week over week comparisons or other analysis quickly became difficult to write easily. This led us to develop TScript. Here’s the equivalent TScript to the success rate calculation from the example above:

cmd: sr = success / (success + failed) * 100 sr.rolling(Mean, 5*Minute)

TScript Design Goals

We had the following goals when designing a domain-specific language TScript:

  • Database independent: the manipulation of time series data should not depend on which database returned the time series
  • Easy to read: users should be able to quickly understand what the time series operation is doing
  • Easy to write: users should be able to focus on the problem and not syntax
  • Easy to add context aware help: easy for a system to understand what the user is doing to provide help
  • Easy to extend with additional functionality: easy to extend (for example, adding more time series operations)
  • Built-in alerting

Features

With these goals, we developed TScript. Here are the features of TScript that help meet the design goals:

Variables as Input

Metrics from the database are referred to with variables in TScript. For example, a user will define a Goku query like ts_1: sum:m1{app=web}, and now ts_1 is available as the result of that Goku query.

Multiline

TScript allows for multi-line input, which helps separate different operations for readability.

Object Oriented

Each variable has operations that can be exposed by adding a “.” to the end of the variable name. For example, to sum all the series in a variable, a user can type “ts_1.sumSeries().” These can be chained together to form expressions like “ts_1.sumSeries().rolling(Mean, 2*Hour).”

Assignments

Operations on variables can be assigned to new variables.

Filtering

Filtering can be performed by using square brackets.

Alerting

TScript input is broken down into three sections:

  • Cmd: the final line of the cmd section is what is shown to the user
  • Crit: an expression for defining what constitutes a critical alert
  • Warn: an expression for defining what constitutes a warn level alert

The crit and warn sections are effectively the same but allow for different notification options. For example, for a disk space alert, a user might want a slack message at 75% utilization but a page at 95%. By combining crit and warn as part of the entire TScript definition, graphs and alerts are effectively the same thing. This becomes even more powerful with templating, which will be discussed in a later blog post.

Foundation

TScript is built with:

  1. pandas: to structure and handle the time series data
  2. pyparsing: to parse the TScript language

Pandas is a data analysis library that has a built-in data structure, the DataFrame, designed for handling time series data. While some database backends return data with a fixed interval between data points, others return data exactly as they are sent. This causes issues when performing analysis on the data. By converting to a DataFrame first, we can more easily manipulate the data. Pandas also provides time-based indexing, handling of missing values, time zone handling, rolling windows, and many other built-in time series functions.

Pyparsing is used to represent the TScript grammar because it is easy to use, quite readable with its self-explanatory class names, and has an active development community.

Usage

We will go over the basic syntax and capabilities of TScript through a series of examples.

Plot a Metric

Here is an example for a query using the OpenTSDB/Goku query syntax: avg:rate:tc.proc.stat.cpu.total{host_type=viz-statsboard-api-alerts-prod}. This query will return the CPU average for the statsboard-api-alerts-prod cluster.

Resulting DataFrame:

# df contains avg:rate:tc.proc.stat.cpu.total{host_type=viz-statsboard-api-alerts-prod} (Pdb) p df.tail() nm cpu 2023–02–25 02:51:00 39.946039 2023–02–25 02:52:00 38.755894 2023–02–25 02:53:00 38.847821 2023–02–25 02:54:00 39.313967 2023–02–25 02:55:00 39.775589

Resulting graph:

TScript:

d

The TScript used in this case is d for DataFrames, which is a reserved keyword that returns all the metrics specified with no modifications. The last line of the TScript command is what is returned to the graphing component.

TScript exposes each time series returned from the database as individual variables and supports further variable assignments in its multi-line input. In the example above, the query is assigned to the cpu variable. TScript variables are treated like objects where methods get applied to them. For example, to take a non-negative value of the above query, the TScript will be:

d.nonNegative()

The benefit to this approach is that users can easily discover methods to use through context aware tooltips:

Binary Operations

TScript offers a broad range of binary operations to support complex computations. These operations include both mathematical operators such as addition, subtraction, multiplication, and division, as well as logical operators like AND, OR, and LESS THAN.

By implementing the appropriate math operators, users can easily compute success rates.

- success: sum:ostrich.metrics.goku.put_data_point_number.count{host_type=infra-goku-*-prod} — failed: sum:ostrich.metrics.put_data_distributed_failed_data_point_num.p99{host_type=infra-goku-*-prod}
cmd: sr = success / (success + failed) * 100

When applying mathematical operators on two DataFrames, only the matching series (same tag combinations) from each DataFrame will be evaluated towards each other. In order to match up different DataFrames column structures, TScript provides a few functions to alter the columns:

# Match columns — broadcasts a DataFrame with less columns into a DataFrame with more columns. # In this example, “success” contains multiple series within one column while “total” is just one series with no columns. To calculate the individual success rate for every value in “success”, “total” must be broadcasted to “success” to match the columns. success / total.broadcast(success) # Remove columns — aggregates over a column to remove it success.reduce(Sum, “column_name”) #Rename columns rename

Filtering and Ranking

To filter on values, TScript uses a bracket syntax:

d[d > 20]

This will only return data points with a value over 20.

This syntax can also be used to rank series; for example, returning the top 10 of the summation:

d[d.rank(d.total(Sum).rank() < 10)]

Anomaly Detection

It’s easy to add arbitrary functions to TScript since everything is a DataFrame. For example, TScript integrates with Prophet for anomaly detection.

cmd: d.fillEmpty(0).prophet(1*Hour)

This results with the following graph, where the colors denote training interval and confidence band.

Highest Percent Diff Over Time

By combining assignments with the object oriented syntax, users can perform more advanced analysis such as returning metrics that have maximum change over a period of time:

cmd: # Exclude the pintrace domain from the size variable size = size[!size.matchTag(“domain”, “^pintrace”)] # Get the previous two weeks of data for size size = size.weekOverWeek() # Get the maximum percent diff max_volume_filter = size[when:today].total(Sum).rank() < 30 top = size[max_volume_filter.broadcast(size)] today = top[when:today] one_week = top[when:one_week] two_weeks = top[when:two_weeks] past = when{one_week, two_weeks}.max(‘when’).rolling(Max, 24h) pct_diff = today.pctDiff(past)

The size variable here is multiple series returned from the database with this query:

- size: max:stats.gauges.logsearch.monitoring.index.primary_size{index=*,domain=*}

Alerting

TScript’s power comes by combining alerts onto graphs, which are part of groups of graphs called dashboards. This visually exposes the user to all alerts. There is no separate section for alerts, as they are integrated into the main view.

For example, let’s say we want to alert if CPU usage is over 80% for the Statsboard cluster. We would add TScript in the crit section like:

crit: d > 80

Now, this will definitely alert us if the CPU goes over 80%, but this is very noisy. In this case we can add a debounce to indicate how long it needs to be in a given state:

crit: (d > 80).for(5*Minute)

The for function is like a latch and won’t get set until that amount of time has passed in a given state. This prevents the rapid toggling of alert states because it must be True for the given amount of time for both the on and off states.

Users can even extend this to only be alerted at certain times:

crit: (d > 80).for(5*Minute).during(BusinessHours)

TScript provides maximum flexibility in alerting because users can use a different TScript expression for alerts. For example, let’s say the user wanted to smooth the time series. They can do that for alerting but still show the raw data in the graph.

crit: d.smooth(10*Minute) > 80

Alerting Levels

Since not every alert is critical, TScript provides two levels of alerting crit and warn, with each having their own notifications and alerting rules. The expressions defined as part of the cmd are available for use in the crit and warn sections to avoid duplicate calculations.

For example,

cmd: smoothed = d.smooth(10*Minute) crit: smoothed > 80 warn: smoothed > 78

Challenges

DataFrame Creation

One of the challenges we faced was converting data returned from the database into a format that TScript could operate on — a DataFrame. The time series data received from the database is a list of points where each point contains two values: a timestamp and a value. This can be part of a larger list of multiple different metrics.

Initially, we tried creating a DataFrame for each series and then merging them to create the final metric DataFrame. However, this approach proved to be inefficient, especially when dealing with unaligned data that contained empty values at different timestamps. The resulting DataFrames had different rows, making merging them a time-consuming process.

To address this issue, we decided to preallocate memory for the entire metric DataFrame using NumPy NaN values. This approach eliminated the need for creating individual DataFrames and merging them, resulting in significant performance improvements.

Conclusion

TScript has been successful in allowing users to easily transform data after it’s written and now is used in over 30,000 expressions. The decoupled nature of TScript from storage has allowed us to swap different storage engines with no changes required by Pinterest engineers.

Future Work

TScript has proven to be a robust and powerful mechanism for users to interact with time series data. Some optimizations to TScript in the future include pushing more computation down to the Goku layer to minimize the amount of data that needs to be transferred. For example, by ranking at leaf nodes, a smaller subset of data would need to be processed in TScript.

Acknowledgements

A huge thanks to Zack Drach for creating the initial concept and implementation of the TScript language.

To learn more about engineering at Pinterest, check out the rest of our Engineering Blog and visit our Pinterest Labs site. To explore life at Pinterest and apply to open roles, visit our Careers page.

--

--