Data

Building and maintaining the skills taxonomy that powers LinkedIn's Skills Graph

Co-authors: Sofus Macskássy, Carol Jin, Shiyong Lin, Xiaomin Wei, and Michael O’Neill

When we think of skills, we think of the unique knowledge, expertise, and abilities that each of us has. At LinkedIn, we see skills as more – we see them as a way to level the playing field in the labor market because they represent what a member is capable of – not where they went to school, where they grew up or where they worked. That’s why we believe a skills-first approach to hiring talent will help companies gain access to wider talent pools to find the skilled workers their businesses need, especially in sectors that are aggressively looking for talent.

One of the most exciting parts of our work is that we get to play a part in helping progress a skills-first labor market through our team’s ongoing engineering work in building our Skills Graph. Our members and customers experience the insights from our Skills Graph in the tools they use everyday on LinkedIn, including Recruiter and LinkedIn Learning.

And being able to determine the way skills are being used, represented, and most efficiently matched across the workforce requires a foundation that understands the relationships across skill groups. This is where our skills taxonomy comes in – essentially it's our vocabulary of skills that our Skills Graph language is based on. In this blog, we’ll discuss the ways in which we’re continuously investing in our skills taxonomy to build a strong, reliable foundation for our Skills Graph to help ensure we can match our members’ skills to opportunity and knowledge.

What is the skills taxonomy?

The skills taxonomy is where we organize and categorize skills based on their hierarchical relationships to each other. To find these connections, it is constructed with the foundational details of a skill – including concepts, skill IDs, skill aliases, skill type, and more. 

Let’s look at “machine learning” for example. Our taxonomy includes machine learning (skill concept), the skill ID (a number assigned to each skill), aliases (e.g. abbreviations like “ML” or language translations), skill type (e.g. soft or hard skill), descriptions of the skill (“the study of computer algorithms…”), and more. The skills taxonomy then analyzes these aspects to establish simple relationships with other skills (e.g. “reinforcement learning” is a child skill of “machine learning”), which we’ll discuss more below.

Our taxonomy is always evolving because new skills are constantly emerging across industries. Since February 2021, the total size of our skills taxonomy has grown nearly 35% and today consists of nearly 39k skills, with 374k aliases across 26 locales and more than 200k edges (connections) between skills. 

The Skills Graph uses the taxonomy to facilitate skills-first experiences across various models and products at LinkedIn. This includes several standardizers/models (e.g., member-skill, job-title, member-company, Jobs You May Be Interested In) that help power Skills Match, Collaborative Articles, recommendations, Profile typeahead, Recruiter search, marketing campaigns, Talent Insights, and Economic Graph Research.

Now, let’s discuss the skills taxonomy’s key pillars.

Semantic Framework Design

Each skill is represented by a “node” in the skills taxonomy and nodes are linked together to form a hierarchical skill network through “edges” that we call knowledge lineages. Knowledge lineages reflect how two skills relate to each other. Skills could relate for various reasons, such as the skills are both part of a career specialization or one skill is for a tool that is used to apply another skill. These are directed edges where the semantics of the edge goes from the “parent” node to the “child” node. These semantic edges are very powerful and we use them to make different types of inferences. For example, if a person knows something about a child skill, chances are that the person also has some knowledge of the parent skill as well and perhaps other “siblings.”

Figure 1. Connected skills with hierarchical relationships example

Skills can form polyhierarchical relationships, meaning one skill can be mapped to multiple parent and child skills in the same domain or across different domains, not due to ambiguity but mutual inclusion. An example of a mutually inclusive double mapping that we would add is "Offshore Construction" being mapped to both "Construction" and "Oil and Gas." 

Meanwhile, an ambiguous skill like "Networking" could be ostensibly mapped to both "Computer Networking" and "Professional Networking." However, this type of ambiguous relation would violate our structured skills quality guardrails and would not be included.

Figure 2. Polyhierarchy represented as child-parent relationships

Figure 3. Polyhierarchy represented as parent-child relationships

In the child-parent example above (Figure 2) we see the skill “Supply Chain Automation” linked to “Supply Chain Engineering” as its parent, which is linked to “Engineering” and “Manufacturing.” Therefore, if a member lists “Supply Chain Automation,” as a skill on their profile, we can infer that this member knows something about the parent skill “Supply Chain Engineering,” as well as all the subsequent skills in the lineage paths up to the top nodes. Moreover, the parent-child relationship example (Figure 3) shows that “Supply Chain Management” has numerous children.

Human curation

The connected skills taxonomy is curated by a combination of human taxonomists and machine learning. This dual approach helps grow the taxonomy at scale while ensuring the skills data meets our required quality and standards. 

The first step to constructing the taxonomy is identifying a set of skill candidates that need to be reviewed. With over 39k skills in the taxonomy, we want to target the subset of skills that are most highly utilized from job postings, recruiter searches, job searches, ad campaigns, and more. With the growth of LinkedIn members and jobs, we leverage the entity discovery pipeline to expedite the discovery of new skill candidates from raw free text skills terms from sources like member profiles and job search queries. This can introduce different types of noise and varying data quality. So, we sanitize the skill candidates to reduce noise and redundancy, and remove obsolete or malformed terms. 

Some highly utilized skills, such as “Cadence” or “Boundary,” include obscure or ambiguous terms. To uncover the intended meaning and predominant usage by members and companies on LinkedIn, taxonomists need to investigate LinkedIn skills usage metadata. The result is that “Cadence” is disambiguated to “Cadence Software” and “Boundary” is disambiguated to “Boundary Line.”

With a clean and clear list of highly utilized skills, taxonomists create knowledge lineages by manually assigning knowledge parents and relationships. Leveraging the machine learning model output of parent-child recommendations, taxonomists can make informed decisions, and the manually curated data also serves as training data to improve the performance of these machine learning models.

Figure 4: Human-In-The-Loop approach to build Skills Graph

Scalability with machine learning

Instead of relying only on our taxonomists to manually curate nearly 39k skills, we apply machine learning techniques to help scale the taxonomy construction. This includes the tool we developed, KGBert which was inspired by KG-Bert, a supervised model that applies deep semantic understanding of skills to predict relationship lineages. As highlighted below in Figure 5, the KGBert Model takes two skills, converts them into contextual sentences, encodes them through a fine-tuned BERT layer, and then the flattened output from BERT is fed into a classification module that predicts the relationship between the two skills (i.e. "A is parent of B", "A is child of B" and "no relation").

As a result, KGBert significantly outperforms our previous models by over 20% in terms of F1 score, a balanced and reliable metric for model performance that takes both precision and recall into account:

This demonstrates KGBert's superiority in accurately identifying skill pairs and the model's overall efficiency in providing high-quality contextual sentences. 

Figure 5: KGBert model training pipeline

Let’s take a closer look at each stem of KGBert:

Input Layer

The two skill nodes l and r are represented by their names and/or descriptions, and are then concatenated into one “context sentence” separated by the special [SEP] tag with the [CLS] tag as prefix and the [SEP] tag as the suffix. The table below demonstrates the input layer generation. 

n (il) n (ri) Context sentence
Name l Name r [CLS] Name l [SEP] Name r [SEP]
Name l Description r [CLS] Name l [SEP] Description r [SEP]
Description l Name r [CLS] Description l [SEP] Name r [SEP]
Description l Description r [CLS] Description l [SEP] Description r [SEP]

Using “TensorFlow” and “Machine Learning” as two example skills, we can predict their relationship, the model looks at the following forms:  

  • [CLS] TensorFlow [SEP] Machine Learning [SEP]

  • [CLS] TensorFlow [SEP] Machine learning (ML) is the study of computer algorithms that improve automatically through experience and by the use of data [SEP]

  • [CLS] TensorFlow is an open-source software library for machine learning and deep learning [SEP] Machine Learning [SEP]

  • [CLS] TensorFlow is an open-source software library for machine learning and deep learning [SEP] Machine learning (ML) is the study of computer algorithms that improve automatically through experience and by the use of data [SEP]

Output Layer

The output is a softmax layer that determines the likelihood of possible relationships, or no relationship, between the two input skills. As seen in the input layer section, two skills could produce multiple combinations fed as input to the model and may yield different prediction outcomes. We conduct a “voting” mechanism among these outcomes to select the most likely relationship as the final prediction for the two skill nodes.  

Training Data

When training the model, human-curated data is leveraged as positive data labels. Negative Data Labels include the following:

  • Skill pairs from different industries, e.g., TensorFlow vs. Financial Management

  • Niece/Nephew pairs, e.g., TensorFlow vs. Cognitive Computing

  • Sibling pairs that share the same parent skill, e.g., TensorFlow vs. PyTorch

  • Loosely related pairs that are 3 or more generation apart, e.g, Engineering vs PyTorch

Figure 6: Sample Seed Skills Graph

KGBert helps build a more accurate and complex taxonomy in less time. It also improves our ability to add new and emerging skills to the taxonomy so that we’re measuring the latest skills trends in the industry. All of this ultimately helps the Skills Graph power better search, targeting, and ranking and recommendation models, such as Recruiter Search, Jobs You May Be Interested In, Learning Recommendation, and more.

Serving the graph data

All of the structural information stored in the skills taxonomy is transformed into the Skills Graph through a big data pipeline. From there, the connected Skills Graph data is available to all LinkedIn applications through a Rest.Li service and a dataset on Hadoop Distributed File System to power the online and offline use cases respectively. Users can traverse the graph in both top-down and bottom-up fashions. For example, given a skill, we can fetch all of its parent or child skills. These targeted parent or child skills are also ranked by their distance to the source skill, which is calculated from skill entity embeddings trained on LinkedIn’s data. 

By incorporating the combination of parent, child, and sibling relationships in the skill AI model, our model can infer additional skills for members or jobs. This drives Sponsored Content revenue lift by targeting more user groups and drives better job-matching by boosting the relevance and frequency of job alerts to members. 

Looking forward

LinkedIn is committed to skills-first hiring, and the Skills Graph lies at the center of this journey. In this post, we shared how we built the skills taxonomy that powers our Skills Graph, with a combination of our framework, human curation, machine learning like KGBert, and other tools. While we’ve been working on our taxonomy and Skills Graph for years, we still feel like this is only the beginning of our journey. As we continue to invest in the Skills Graph, there will be a lot more to discuss, including our ongoing work around building Skills Graph-aware embeddings, powered by a GNN model, in order to map the hierarchy skills taxonomy on a lower dimension space which is easy for downstream AI models to use.

As the Skills Graph grows and evolves, we’ll continue to move closer to our goal of helping LinkedIn members and customers take a skills-first approach to navigate the ever-changing landscape of the modern workforce.