Warden: Real Time Anomaly Detection at Pinterest

Pinterest Engineering
Pinterest Engineering Blog
13 min readMay 17, 2023

--

Isabel Tallam | Sw Eng, Real Time Analytics; Charles Wu | Sw Eng, Real Time Analytics; Kapil Bajaj | Eng Manager, Real Time Analytics

Blue, green, red and orange lines on a graph fluctuating between high and low levels

Detecting anomalous events has been becoming increasingly important in recent years at Pinterest. Anomalous events, broadly defined, are rare occurrences that deviate from normal or expected behavior. Because these types of events can be found almost anywhere, opportunities and applications for anomaly detection are vast. At Pinterest, we have explored leveraging anomaly detection, specifically our Warden Anomaly Detection Platform, for several use cases (which we’ll get into in this post). With the positive results we are seeing, we are planning to continue to expand our anomaly detection work and use cases.

In this blog post, we will walk through:

  1. The Warden Anomaly Detection Platform. We’ll detail the general architecture and design philosophy of the platform.
  2. Use Case #1: ML Model Drift. Recently, we have been adding functionality to review ML scores to our Warden anomaly detection platform. This enables us to analyze any drift in the models.
  3. Use Case #2: Spam Detection. Detection and removal of spam and users who create spam is a priority in keeping our systems safe and providing a great experience for our users.

What is Warden?

Warden is the anomaly detection platform created at Pinterest. The key design principle for Warden is modularity — building the platform in a modular way so that we can easily make changes.

Why? Early on in our research, it became quickly clear that there were many approaches to detecting anomalies, dependent on the type of data or how anomalies may be defined for the data. Different approaches and algorithms would be needed to accommodate those differences. With this in mind, we worked on creating three different modules, modules that we are still using today:

  • Query input data: retrieves data to be analyzed from data source.
  • Applying anomaly algorithm: analyzes the data and identifies any outliers
  • Notification: returning results or alerts for consuming systems to trigger next steps

This modular approach has enabled us to easily adjust for new data types and plug in new algorithms when needed. In the sections below we will review two of our main use cases: ML Model Drift and Spam Detection.

Detecting Real Time ML Model Drift

The first use case is our ML Monitoring project. This section will provide details on why we initiated this project, which technologies and algorithms we used, and how we solved some of the road blocks we experienced during the implementation of the changes.

Why Monitor Model Drift?

Pinterest, like many companies, uses machine learning in several areas and has seen much success with it. However, over time a model’s accuracy can decrease as outside factors change. The problem we were facing was how to detect these changes, which we refer to as drifts.

What is model drift actually? Let’s assume Pinterest users (Pinners) are looking for clothing ideas. If the current season is winter, then coats and scarves may be trending and the ML models would be recommending pins matching winter clothing. However once the season starts getting warmer, Pinners will be more interested in lighter clothing for spring and summer. At this point, a model which is still recommending winter clothing is no longer accurate as the user data is shifting. This is called model drift and the ML team should take action and update features for example to correct the model output.

Many of our teams using ML have tried their own approaches to implement changes or update models However, we want to make sure that the teams can focus their efforts and resources on their actual goals and not spend too much time on figuring out how to identify drifts.

We decided to look into the problem from a holistic perspective, and invest in finding a single solution that we can provide with Warden.

Top graph displays a tight line with frequent fluctuation, bottom graph is a wider line with significantly less fluctuations.
Figure 1: Comparing raw model scores (top) and downsampled model scores (bottom) shows a slight drift of the model scores over time

As the first step to catching drift in model scores, we needed to identify how we wanted to look at the data. We identified three different approaches to analyzing the data:

  • Comparing current data with historical data — for example one week ago, one month ago, etc.
  • Comparing data between two different environments — for example, staging and production
  • Comparing current prod data with predefined data which is how the model is expected to perform

In our first version of the platform, we decided to take the first approach that compares historical data. We made this decision because this approach provided insights intothe model changes over time, signaling re-training may be required.

Selecting the Right Algorithm

To identify a drift in model scores, we needed to make sure we select the right algorithm, one that would allow us to easily identify any drift in the model. After researching different algorithms, we narrowed it down to Population Stability Index (PSI) and Kullback-Leibler Divergence/Jensen-Shannon Divergence (KLD/JSD). In our first version, we decided to implement PSI, as this algorithm has also been proven successful in other use cases. In the future, we are planning to plug other algorithms to expand our options.

The algorithm for PSI splits up the input data and divides it into 10 buckets. A simple example is dividing a list of users by their ages. We assign each person into an age bucket. A bucket is created for each 10-year age range: 0–10 years, 11–20 years, 21–30 years, etc. For each bucket, the percentage is calculated of how much data we find in that range. Then we compare each bucket of current data with a bucket of historical data. This will result in a single score for each bucket-computation. The sum of these scores will be the overall PSI score. This can be used to determine how the age of the population has changed over time.

Graphs has percentages of 1%, 3%, 8%, 19%, 31%, 22%, 8%, 5%, 2%, 1% from bottom to top.
Figure 2: Image showing input data split into 10 buckets and for each bucket the percentage of distribution is calculated

In our current implementation, we calculate the PSI score by comparing historical model scores with current model scores. To do this, we first determine the bucket size depending on the input data. Then, we calculate the bucket percentages for each time frame, which is used to return the PSI score. The higher the PSI score, the more drift the mode is experiencing during the selected period.

The calculation is repeated every few minutes with the input window sliding to provide a continuous PSI score showing clearly how the model scores are changing over time.

Top image is “Input Data”, “Historical window” and “Current window” in the middle, and “PSI scores over time”.
Figure 3: Image showing the input data (top), windows for historical data and current data (middle) which are used for PSI score calculation (bottom).

Tuning the Algorithm

During the validation phase, we noticed that the size of the time window has a great impact on the usefulness of the PSI score. Choosing a window that is too small can result in very volatile PSI scores, potentially creating alerts for even small deviations. Choosing a period that is too large can potentially mask issues in model drift. In our case, we are seeing good results with a 3-hour window, and PSI calculation every 3–5 minutes. This configuration will be highly dependent on the volatility of the data and SLA requirements on drift detection.

Another change we noticed in the calculated PSI scores was that some of the scores were higher than expected. This was true especially for model scores that do not deviate much from the expected range. We should assume a resulting PSI score of 0 or close to 0 for these use cases.

After a deeper investigation on the input data, we found that the calculated bucket size for these instances was set to an extremely small value. As our logic includes a calculation of bucket sizes on the fly, this happened for model scores with a very narrow data range and that showed a few spikes in the data.

Figure 4: Model score which shows very little deviation from expected values of 0.05 to 0.10.

Logically, the PSI calculation is correct. However, in this particular use case, tiny variations of less than 0.1 are not concerning. To make the PSI scores more relevant, we implemented a configurable minimum size for buckets — a minimum of 0.1 for most cases. Results with this configuration are now more meaningful for the ML teams reviewing the data.

This configuration, however, will be highly dependent on each model and what percentage of change is considered a deviation from the norm. In some cases a deviation of 0.001 may be very substantial and will require much smaller bucket sizes.

Figure 5: Left side — high PSI scores of 0.05 to 0.25 are seen with a small bucket size. Once minimum bucket size configuration was updated, the scores were much smaller with values of 0 to 0.03 as expected — right side.

Now that we have implemented the historical comparison and PSI score calculation on model scores, we are able to detect any changes in model scores early on in the process and in near-real time. This allows our engineers to be alerted quickly if any model drift occurs and take action before the changes result in a production issue.

Given this early success,, we are now planning to increase our use of PSI scores. We will be implementing the evaluation of feature drift as well as looking into the remaining comparison options mentioned above.

Detecting Spam

Detecting spam is the second use case for Warden. In the following section, we will look into why we need spam detection and how we chose the Yahoo Extensible Generic Anomaly Detection System (EGADS) library for this project.

Why is Spam Detection So Important?

Before discussing spam detection, let’s focus on what we define as spam and why we want to investigate it. Pinterest is a global platform with a mission to give everyone the inspiration to create a life that they love. That means building a positive place that connects our global audience, over 450 million users, to personalized, actionable content — a place where they can find inspiration, plan and shop the world’s best ideas into reality.

One of our highest priorities, and a core value of Putting Pinners First, is to ensure a great experience for our users, whether they are finding their next weeknight meal inspiration or shopping for a loved one’s birthday or just wanting to take a wellness break. When they look for inspiration and instead find spam, this can be a big issue. Some malicious users create pins and link these to pages that are not related to the pin image. As a user clicking on a delicious recipe image, landing on a very different page can be frustrating, and therefore we want to make sure this does not happen.

Figure 6: A pin showing a chocolate cake on the left. After clicking on the pin the user sees a page not related to cake.

Removing spammy pins is one part of the solution, but how do we prevent this from happening again? We don’t just want to remove the symptom, which is the bad content, we want to remove the source of the issue and make sure we identify malicious users to stop them from continuing to create spam.

How Can We Identify Spam?

Detecting malicious users and spam is crucial for any business today, but it can be very difficult. Identifying newly created spam users can be especially tedious and time consuming. Behavior of spam users is not always clearly distinguishable. Spammer behavior and attempts also evolve over time to evade detection.

Before our Warden anomaly detection platform was available, identifying spam required our Trust and Safety team to manually run queries, review and evaluate the data, and then trigger interventions for any suspicious occurrences.

So how do we know when spam is being created? In most cases, malicious users don’t just create a single spam pin. To make money, they want to create a large number of spam pins at a time and widen their net. This helps us identify these users. Looking at pin creation, for example, we know that we are expecting something like a sine wave when looking at the number of pins created per day or week. Users create pins during the day and less pins are created at night. We also know that there may be some variations depending on the day of the week.

Figure 7: sample curve for created pins over 7 days showing a near sine wave with some daily variations.

The overall graph reflecting the count of created pins shows a similar pattern that repeats on a daily and weekly basis. Identifying any spam or increased creation of pins would be very difficult as spam is still a small percentage compared to the full set of data.

To get a more fine grained picture, we drilled down into further details and filtered by specific parameters. These parameters included filters like internet service provider used (ISP) , country of origin, event types (creation of pins, etc.), and many other options. This allowed us to look at smaller and smaller datasets where spikes are clearer r and more easily identifiable.

With the knowledge gained on how normal user data without spam should look, we movedforward and looked closer to evaluate anomaly detection options:

  1. Data is expected to follow a similar pattern over time
  2. We can filter the data to get better insights
  3. We want to know about any spikes in the data as potential spam

Implementation of the Spam Detection System

We started looking at several frameworks that are readily available and already support a lot of the functionality we were looking for. Comparing several of the options, we decided to go ahead with Yahoo! EGADS framework [https://github.com/yahoo/egads].

This framework analyzes the data in two steps. The Tuning Process reads historical data and determines the data expected in the future. Detection is the second step, in which the actual data is compared to the expectation and any outliers exceeding a defined threshold are marked as anomalies.

So, how are we using this library within our Warden anomaly detection platform? To detect anomalies, we need to pass through several phases.

In the first phase we provide all required configurations needed for the tasks. This includes details about the source of the input data, which anomaly detection algorithms to use, parameters to be used during the detection step, and finally how to handle the results.

Having the configuration in place, Warden begins by connecting to the data source and querying input data. With the modular approach, we are able to plug in different sources and add additional connectors whenever needed. Our first version of Warden concentrated on reading data from our Apache Druid cluster. As the data is real time data and already grouped by timestamps, this lends itself to anomaly detection very easily. For later projects, we have also added a Presto connector to support new use cases.

Once the data is queried from the data source, it is transformed into the required format for the Tuning/Detection phase. Feeding the data into the EGADS Time Series Modeling Module (TM) triggers the Tuning step which is followed by the Detection step using one or more Anomaly Detection Models (ADM) to identify any outliers.

Choosing the Time Series Module depends on the type of input data. Similarly, deciding which Anomaly Detection Model to use depends on the type of outliers we want to detect. If you are looking for more details on this and EGADS, please refer to the gitHub page.

After retrieving the results and identifying any suspicious outliers, we can continue to look further into the data. The initial step will look at broader filtering, like identifying any spikes found on per ISP, origin country, etc. In further steps, we take the insights gained from the first step and filter using additional features. At this point, we can ignore any data sets that don’t show any concerns and concentrate on suspicious data to identify malicious users or confirm all actions are valid.

Figure 8: Analyzing pin creation data by base filters allows identifying outliers and drilling deeper brings anomalies to light

Once we have gathered enough details on the data, we continue with our last phase, which is the notification phase. At this stage, we notify any subscribers of potential anomalies. Details are provided via email, Slack, and other avenues to inform our Trust and Safety team to take action to deactivate users, block users, etc.

With the use of the Warden anomaly detection platform, we have been able to improve Pinterest’s spam detection efforts, significantly impacting the number of malicious users identified and how quickly we are able to detect them. This has been a great improvement compared to manual investigations.

Our Trust & Safety teams have appreciated the use of Warden and are planning to increase their use cases.

“One of the most important things we need for identifying spammers is to correctly segment features and time periods before we do any clustering or measurement. Warden enabled us to get alerted early and find the most important segment to run our algorithms on.” — Trust & Safety Team

Future

Being able to detect anomalies with Warden has enabled us to support our Trust and Safety team and allows us to detect drift in our ML models very quickly. This has been proven to increase user experience and support our engineering teams. The teams are continuing to evaluate spam and spam patterns,allowing us to evolve the detection and broaden the underlying data.

In the future, we are planning to increase the use of anomaly detection to get alerted early on about any changes in the Pinterest system before actual issues happen. Another use case we are planning to include in our platform is root cause analysis. This will be applied on current and historical data, enabling our teams to reduce time spent to pinpoint issue causes and concentrate on quickly addressing them.

Acknowledgements

Many thanks to our partner teams and their engineers (Cathy Yang | Trust & Safety; Howard Nguyen | MLS; Li Tang | MLS) who have been working with us on accomplishing these projects and for all their support!

To learn more about engineering at Pinterest, check out the rest of our Engineering Blog and visit our Pinterest Labs site. To explore life at Pinterest, visit our Careers page.

--

--