Representation online matters: practical end-to-end diversification in search and recommender systems

Pinterest Engineering
Pinterest Engineering Blog
15 min readMay 25, 2023

--

Bhawna Juneja | Senior Machine Learning Engineer; Pedro Silva | Senior Machine Learning Engineer; Shloka Desai | Machine Learning Engineer II; Ashudeep Singh | Machine Learning Engineer II; Nadia Fawaz | (former) Inclusive AI Tech Lead

Different women in white button down shirts and blue jeans

Introduction

Pinterest is a platform designed to bring everyone the inspiration to create a life they love. This is not only our company’s core mission but something that has become increasingly important in today’s interconnected world. As technology becomes increasingly integrated into the daily lives of billions of people globally, it is crucial for online platforms to reflect the diverse communities they serve. Improving representation online can facilitate content discovery for a more diverse user base by reflecting their inclusion on the platform. This, in turn, demonstrates the platform’s ability to meet their needs and preferences. In addition to improved user experience and satisfaction, this can have a positive business impact through increased engagement, retention, and trust in the platform.

In this post, we show how we improved diversification on Pinterest for three different surfaces: Search, Related Products, and New User Homefeed. Specifically, we have developed and deployed scalable diversification mechanisms that utilize a visual skin tone signal to support representation of a wide range of skin tones in recommendations, as shown in Figure 1 for fashion recommendations in the Related Products surface.

Difference in recommendations of women in white button downs and blue pants
Fig. 1. Side-by-side Related Products recommendations for the query Pin “Shirt Tail Button Down” shown on the left. Top right: previous experience without diversification. Lower right: current diversified experience.

The end-to-end diversification process consists of several components. First, requests that will trigger diversification need to be detected across different categories and locales. Second, the diversification mechanism must ensure that diverse content is retrieved from the large content corpus. Finally, the diversity-aware ranking stage needs to balance the diversity-utility trade-off when ranking content and to accommodate diversification across several dimensions, such as the skin tone visible in the image as well as the user’s various interests. Multi-stage diversification allows the mechanism to operate throughout the pipeline, from retrieval to ranking, to ensure that diverse content passes through all the stages of a recommender system, from billions of items to a small set that is surfaced in the application.

Multi-stage diversification in Search and recommender systems

Background

Advanced search and recommender systems, which operate at the large-scale of hundreds of millions of active users and billions of items, tend to be very complex and have multiple components. These systems often comprise two major stages: retrieval and ranking. This is sometimes followed by additional business logic: Items are retrieved and ranked, then the list is surfaced to the user.

Item Corpus arrow 10⁶-10¹⁰ to Candidate Retrieval arrow 10²-10³ to Ranking arrow 10–100 to User interface/surface
Fig. 2. Large-scale recommender systems can broadly be categorized into two stages going from an item corpus to recommendations: retrieval and ranking.
  • Retrieval: The retrieval stage consists of one or more candidate generators that narrow down the set of candidates from a large corpus of items (in the range of 10⁶ to 10¹⁰) to a much narrower set (in the range of 10² to 10⁴) based on some predicted scores, such as the relevance of the items to the query and the user.
  • Ranking: In the ranking stage, the goal is to find an ordering of the candidates that maximizes a combination of objectives, which may include utility metrics, diversity objectives, and additional business goals. This is usually achieved via one or many Machine Learning (ML) models that generate score(s) for each item. These scores are then combined (e.g. using a weighted sum) to generate a ranked list.

Diversity in recommendations

Diversity Dimension: Diversification aims to ensure that the ranked list of items surfaced by the system is diverse with respect to a relevant diversity dimension, which could include explicit dimensions such as demographics (e.g., age, gender), geographic or cultural attributes (e.g., country, language), domain-specific dimensions (e.g., skin tone ranges in beauty, cuisine type in food), business-specific dimensions (e.g., merchant sizes), and also other implicit dimensions that may not be expressed directly but can be modeled using latent representations (e.g., embedding, clustering). While in this work we present an example of skin tone diversification, the proposed techniques are not limited to this single dimension and can support diversification more broadly, including intersectionality of multiple diversity dimensions. We denote the set of groups under a diversity dimension as D, and each individual group is denoted by 𝑑.

Diversity Metric: For a given query, we define the top-k diversity of a ranking system as the fraction of queries where all groups under the diversity dimension are represented in the top k ranked results for which the diversity dimension is defined. For instance, in the case of skin tone ranges, an item whose image does not include any skin tone would not contribute to visual skin tone diversity. Thus it will not be counted in the top-𝑘 and will be skipped in the diversity metric computation.

Multi-stage diversification: Both retrieval and ranking stages directly impact the diversity of the final content surfaced in the application. The diversity metric at the output of retrieval stage upper-bounds the diversity at the output of ranking. Hence, the retrieval layer needs to generate a sufficiently diverse set of candidates to ensure that the ranking stage has enough candidates in each group to generate a final diverse ranking set. However, diversity at the retrieval stage is not a sufficient condition to guarantee that a utility-focused ranker will surface a diverse ordering at the top of the ranking where users are more likely to focus their attention and to interact with items. Hence, both the retrieval stage and ranker also need to be diversity-aware.

Triggering logic: A real-world system may receive requests that span a wide range of categories, such as fashion, beauty, home decor, food, travel, etc. The diversity dimension of interest depends on the application. For example, skin tone range diversification is applicable to fashion and beauty, but not to home decor. Thus, upon receiving a request, the system needs to determine whether to trigger diversification according to the dimension of interest. The triggering logic needs to account for the diversity dimension, the application, the production surface, and the local context, such as country and language, and can be based on heuristics or ML models, such as models that predict the category of a query. On these factors, along with user research and data analysis on skin tone related Search query modifiers that highlight a need for diversity in similar requests, we decide to only trigger skintone diversification for beauty and fashion categories in Search, Related Products, and New User Homefeed.

Diversification at ranking

We start with a focus on the ranking stage to achieve diversification of results since it is the last stage of a recommender system. Instead of using boosters or discounting scores, which tend to add significant tech debt in the long term, we leverage a diversity-aware ranking stage that takes as input a list of items with utility scores and their diversity dimensions and produces a ranking according to a combination of both objectives. The first approach we used is a class of simple greedy rerankers, e.g. Round Robin (RR). Given an ordered list of items 𝑦, . . .,yₙ, we construct |D| number of ordered sub-lists corresponding to each skin tone range and containing items that have a utility score above the threshold. Then, we re-build a ranked list by greedily selecting the top item of each sub-list. All the candidates that do not belong to a sub-list, for instance because they do not have a skin tone range or have utility scores below the threshold, can be left at the same position as in the original list or assigned to a random sub-list.

Fig. 3. Illustrative examples of Round Robin and DPP applied to a utility-ranked list. Each block is an item in the ranked list and the color denotes the skin tone range of the item image. (a) Ranked list obtained after applying Round Robin is re-ordered such that the distribution of skin tones is more uniform in the top positions. (b) Ranked list obtained after DPP (for a specific value of 𝜃) allows for optimizing a list-wise objective to trade off utility and diversity of the initial ranked list.

RR is a simple, intuitive, and efficient approach to diversification; however, it does not always balance diversity and utility. In addition, it does not easily generalize to multiple different diversity dimensions or multiple utility score thresholds. To avoid these limitations, we propose a multi-objective optimization framework, i.e. Determinantal Point Process (DPP). A DPP is a machine learnable probabilistic model used in physics for repulsion modeling and more recently in recommender systems. DPPs are particularly useful in ML for tasks such as subset selection, where the goal is to select a subset of points from a larger set that are diverse or representative in some sense. The basic idea behind a DPP is to model the probability of selecting a set of items 𝑌 from a set of size 𝑁 as the determinant of a kernel matrix 𝐿ᵧ, where 𝐿 is a kernel function that encodes the utility of the items and the similarity between pairs of items, and 𝐿 is the kernel matrix of the subset 𝑌. The determinant of 𝐿ᵧcan be thought of as a measure of how spread out the points in 𝑌 are in the feature space defined by the kernel function 𝐿. The diagonal entry 𝐿ᵢᵢ represents the utility of the 𝑖ᵀᴴ item, in our case the score with which the items were originally ranked. The off-diagonal entry 𝐿ᵢⱼ, however, represents the similarity between the items, which in our case depends on the diversity dimension (e.g. the skin tone range in the item image). The kernel is chosen such that 𝐿 is a positive semi-definite (PSD) kernel matrix and has a Cholesky decomposition, and hence 𝐿 can be written as:

L = U phi phi^TU^T = USU^T

where 𝑈 = diag(𝑒^(𝜃𝑢1)), . . .,𝑒^(𝜃𝑢𝑁 )) is a diagonal matrix that encodes the utility uᵢ of each item, 𝜃 is a parameter that governs the trade-off between utility and diversity, and Φ = [Φ₁, Φ₂, Φ₃, …, Φₙ ], where Φᵢ is the feature vector for the 𝑖ᵗʰ item.

For our use case, ΦΦᵀ is the symmetric similarity matrix, which we henceforth denote by 𝑆. Finally, given a value of 𝜃 and kernel matrix 𝐿, the goal is to find a subset Y that maximizes the determinant of 𝐿ᵧ:

Y = argmaxY { y1, …, yN} det(Ly)

The use of determinant means that, based on the choice of kernel matrix, 𝑌 would include items with high utility scores while avoiding ones that are similar to others in the subset. Finding such a subset 𝑌 of a given size 𝑘 is an NP-hard problem. However, because of its submodular property, it can be efficiently approximated using a greedy algorithm.

Figure 3(a) shows an example where RR is used to diversify a ranked list of items with respect to four groups {𝑑,𝑑,𝑑,𝑑₄}. Figure 3(b) shows a hypothetical example of how DPP would rerank as compared to RR given an appropriate value of parameter 𝜃.

In comparison to RR, DPP takes into account both the utility scores and similarity and is able to balance their trade-off. For multiple diversity dimensions, DPP can be operationalized with a joint similarity matrix 𝑆𝑌 to account for the intersectionality between different dimensions. This can be further extended to a function where, for each item, all diversity dimensions (skin tone, item categories, etc.) are provided and the return is a combined value that represents the joint dimensions. A simpler option is to add a diversity term in the weighted sum shown in equation 4 for each dimension. In the case of a large number of diversity dimensions, dimensionality reduction techniques can be used.

Diversification at retrieval

Diversifying during the ranking stage can be challenging due to the limited availability of candidates from all groups in the retrieved set. The techniques proposed above such as RR and DPP are limited to the set of candidates retrieved by different sources in the first stage. Therefore, it may not always be possible to diversify the ranking stage for specific queries. To overcome this limitation, we have developed three techniques to increase the diversity of candidates at the retrieval layer. These techniques improve the ability of rerankers to diversify at a later stage and are suitable for different setups.

Overfetch-and-Rerank at retrieval: To increase candidate set diversity, the Overfetch method fetches a larger set of candidates, which can be defined to contain a minimum number of candidates from each skin tone range. For example, if a candidate set of size K is desired, the neighborhood size can be expanded to K’ (K’ > K) to meet the diversity criterion. To reduce latency, a hyperparameter Kmax is chosen so that the overfetched set never exceeds Kₘₐₓ. The rerank method selects a subset of size K from the overfetched set by performing a Round Robin selection of one candidate at a time from each skin tone range until K items are selected. Overfetching stops when the minimum threshold for each skin tone range is met or Kmax is reached.

Segments to Leaf to Root — Aggregate + Bucketize
Fig.4. A diagram of distributed ANN retrieval aggregating candidates from segments to leaves to the root based on the distance metric while assigning top Pins with each skin tone to their corresponding buckets

Bucketized ANN retrieval: Approximate nearest neighbor (ANN) search is a widely used retrieval method in embedding-based search indexes. In such systems, users, items, and queries are embedded into the same space, and the system retrieves the items closest to the query or user embedding based on a chosen distance metric. Since computing pairwise distances for all query-item pairs is not feasible, approximation algorithms like k-Dimensional Tree, Locality-sensitive Hashing (LSH), and Hierarchical Navigable Small Worlds (HNSW) are used to perform nearest neighbor search efficiently. In large-scale recommender systems, these methods are implemented as a distributed system. The general architecture of an ANN search system contains a root node that sends a request to a few leaf nodes, which further request several segments to perform a nearest neighbor search in different subregions of the embedding space. To find 𝐾 nearest neighbors for a given query embedding, each segment returns 𝐾 potential candidates to the corresponding leaf, which then aggregates these 𝑀 × 𝐾 number of candidates to retain only the top 𝐾 candidates to the root. The root selects the top 𝐾 candidates from 𝐾 × 𝐿 × 𝑀 candidates whose distances are computed during the process. In the bucketization approach, the aggregation step is modified to select the top-𝐾 candidates and aggregate the top 𝐾𝑑𝑖 candidates from each skin tone 𝑑𝑖 into a bucket with top-𝐾𝑑𝑖 candidates for each skin tone 𝑑𝑖. This helps preserve top candidates belonging to each skin tone range without expanding the entire aggregation graph.

Strong-Or to term1 > 3–1, 6, 9 and to term 2–1,2,3,4,5,7,8,9,10
Fig.5. An example of how the Strong-OR operator ensures diversity during retrieval for two query terms, one with a minimum threshold condition

Strong OR retrieval: In the Search process, the retrieval stage involves converting text queries to structured queries using logical operators like AND, OR, and XOR to narrow or broaden the set of results. To increase the diversity of results, a specialized logical operator called Strong-OR is used. Strong-OR prioritizes a set of candidates that satisfy multiple criteria simultaneously, allowing us to specify what percentage of candidates should match each criterion. Strong-OR scans a limited number of items and retrieves candidates that meet the specified criteria. If there are insufficient items to fulfill the criteria, it matches as many as possible. Strong-OR acts as a regular OR at first, but promotes a criterion to be a necessary condition during scanning to retrieve more relevant results. Candidates that satisfy the criteria and would not have been retrieved otherwise can be added to dedicated buckets to ensure they are not dropped in the latter stages of retrieval.

Productionization considerations for a large-scale recommender system

We deployed diversification approaches on three different surfaces on Pinterest based on user feedback to diversify specific experiences — namely Search, New User Homefeed, and Related Products. These surfaces were consciously chosen keeping in mind user research and data analysis of user needs. In this section we present several practical considerations to deploy diversification approaches in a real world production system. First, deploying diversification algorithms at retrieval requires indexing the diversity dimension of Pins (e.g. the Pin skin tone range) in both embedding-based and token-based indices. Details about our approach can be found in the paper. Second is impact on latency and scaling. For RR we found it had a minimal impact on latency due to the linear time complexity but it was hard to scale when using multiple dimensions. For DPP, we reduced impact on latency through various techniques (for example tuning the batch size, window size, and depth size), all of which can be optimized and evaluated through offline replay, shadow testing, or A/B experiments for each surface. Additional techniques to reduce the impact on latency for DPP can be found in the paper. Third, to evaluate the diversification of results using skin tone, we collected qualitative feedback from a diverse set of internal participants for every iteration, in addition to relevance evaluations through professional data labeling. To account for the local context in international markets, we collaborated closely with the internationalization team for a qualitative assessment of diversification and its results.

Results in production

To improve skin tone representation, we launched skin tone diversification in Search, Related Products, and New User Homefeed. For search, diversification was introduced for queries in the beauty and fashion categories. For Related Products, it was added for fashion and wedding requests and in New User Homefeed as part of the new user experience. There are several nuances that must be taken into consideration when measuring the success and implications of these approaches in search and recommender systems. First, appropriate metrics and guardrails must be set in place before performing diversification. Second, while some of the learnings are transferable between surfaces, each surface presents unique challenges and may differ drastically from past use cases. We often observed positive gains in diversity metrics coupled with neutral or positive impact in guardrail business metrics for all the techniques described above. All metrics reported here are the result of several A/B experiments we ran in production for at least three weeks, and Table 1 gives a brief overview of the impact of these.

In the rest of this section, we give a brief overview of the impact of these techniques on user engagement metrics and the diversity metric (DIV@k(R)) (we provide more details on the choice of k in the paper). We report the impact to these metrics as the percentage difference relative to control.

Surface Skin tone diversification technique Percentage improvement skin tone diversity Search RR with score threshold 250%* DPP Parity* Strong-OR in retrieval + DPP 14%* Related Products RR 270%** DPP Small decrease* Bucketized ANN Retrieval 1% ** (8% increase at the retrieval stage) New User Homefeed Two dimensional RR using priority queues with Pin category and skintone 109%** Single skin tone based RR 650%** Overfetch-and-Rerank during retrieval 63%** DPP for reranking 462%**
Table 1: In this table we summarize the results of several A/B experiments in production that improve skin tone diversification in Search, Related Products and New User Homefeed either post-ranking or at retrieval. * indicates positive impact to engagement metrics, and ** indicates neutral impact to engagement metrics. More details can be found in the paper.
Pins of painted pink nails going from less diverse to more diverse
Fig. 6. For the query "pink nails matte” on Search, (a) shows search results without any diversity, (b) shows diversified search results using RR with a score threshold, and (c)shows the diversified ranking for the same query using DPP.

Conclusion

We tackled the challenge of diversification to improve representation in Search and recommender systems using scalable diversification approaches at ranking and retrieval. We deployed multi-stage diversification on several Pinterest surfaces and through extensive empirical evidence showed that it is possible to create an inclusive product experience that positively impacts business metrics such as engagement. Our techniques are scalable for multiple simultaneous diversity dimensions and can support intersectionality. While these approaches were successful we aim to keep improving upon them. Future work includes but is not limited to:

  • Developing more advanced and scalable triggering mechanisms for diversification
  • Automating weight adjustment for the multi-objective optimization weights that balance different objectives
  • Testing some recent developments in debiasing word embeddings and fair representation learning for retrieval diversification
  • Analyzing how diversified search results and recommendations can help mitigate serving bias in systems that generate their own training data

Skin tone diversification aims at improving representation by surfacing all skin tone ranges in the top results when possible. While the visible skin tone ranges in Pin images are leveraged to surface all skin tone ranges in the top results at serving time, they are not used as inputs to train ML ranking models. It is important to note that skin tone ranges are Pin features, not user features. We respect the user’s privacy and do not attempt to predict the user’s personal information, such as their ethnicity.

Acknowledgments

This endeavor would not have been possible without several rounds of discussion and iterations with our colleagues Vinod Bakthavachalam, Somnath Banerjee, Kevin Bannerman-Hutchful, Josh Beal, Larkin Brown, Hayder Casey, Yaron Greif, Will Hamlin, Edmarc Hedrick, Felicia Heng, Dmitry Kislyuk, Anna Kiyantseva, Tim Koh, Helene Labriet-Gross, Van Lam, Weiran Li, Daniel Liu, Dan Lurie, Jason Madeano, Rohan Mahadev, Nidhi Mastey, Candice Morgan, AJ Oxendine, Monica Pangilinan, Susanna Park, Rajat Raina, Chuck Rosenberg, Marta Scotto, Altay Sendil, Julia Starostenko, Kurchi Subhra Hazra, Eric Sung, Annie Ta, Abhishek Tayal, Yuting Wang, Dylan Wang, Jiajing Xu, David Xue, Saadia Kaffo Yaya, Duo Zhang, Liang Zhang, and Ruimin Zhu. We would like to thank them for their support and contributions along the way.

For more details on the approaches presented in this article please refer to our paper published at FAccT 2023.

To learn more about engineering at Pinterest, check out the rest of our Engineering Blog and visit our Pinterest Labs site. To explore life at Pinterest, visit our Careers page.

--

--