Stemming in NLP- A Beginner's Guide to NLP Mastery

Here is everything you need to know about the famous technique, Stemming, in NLP.

Stemming in NLP- A Beginner's Guide to NLP Mastery
 |  BY Manika

The Internet is filled with information about almost everything. And this information is primarily available in the form of textual data. Many researchers are keen on finding interesting ways to mine and leverage this data. And one way to do that is to use Natural Language Processing (NLP) methods along with machine learning algorithms. This article will discuss one of the most popular methods, stemming, in NLP


NLP Project for Beginners on Text Processing and Classification

Downloadable solution code | Explanatory videos | Tech Support

Start Project

Here is a small exercise we’d like you to perform before exploring the meaning of stemming in NLP with an example. On google, type ‘how to save a machine learning model’ in the search box and observe the results. Do the same for the keyword ‘save a machine learning model’ and notice that the search results are almost identical. 

Google Search results for ‘how to save a machine learning model

The search engines’ results suggest that Google’s search engine algorithm can thoroughly understand the significance of the words ‘how to’ in a sentence. How does it do that? The engineers at Google have preserved that secret but, it is obvious that Google uses NLP methods in the background to analyze the keywords entered in the search box. There are many other exciting applications of NLP, like sentiment analysis, chatbots, etc. To understand how such systems work, you must deeply understand various NLP concepts. Read this article till the end to learn about one such concept: Stemming.

What is Stemming in NLP?

 NLP applications are built by converting textual data into numerical vectors. One way of doing this is assuming all words are independent of each other and creating a vector space of the dimension that equals the number of words in your dictionary. A computer machine might consume large amounts of memory for dealing with large dimensions. To tackle this problem, we can create a space where words with similar meanings are grouped and represented by the same vector. By using morphological analysis, we can develop methods that point to the same word and, thus the same vector resulting in less memory space consumption. One of the most common ways to do that is the stemming of words. Let us explore what exactly stemming is!

ProjectPro Free Projects on Big Data and Data Science

Define Stemming in NLP

The process of removing affixes from a word so that we are left with the stem of that word is called stemming. For example, consider the words ‘run’, ‘running’, and ‘runs’, all convert into the root word ‘run’ after stemming is implemented on them. One crucial point about stem words is that they need not be meaningful. For example, the word ‘traditional’ stem is ‘tradi’ and has no meaning.

stemming in nlp

Now that you understand stemming meaning, it is time to understand its significance in NLP.

Why use Stemming in NLP?

The benefits of using the stemming algorithm in an NLP  project can be summarised as follows:

  1. It reduces the number of words that serve as an input to the Machine Learning/Deep Learning model.

  2. It minimizes the confusion around words that have similar meanings.

  3. It lowers the complexity of the input space.

  4. When creating applications that search a specific text in a document, using stemming for indexing assists in retrieving relevant documents.

  5. It assists in eliminating the out-of-vocabulary (OOV) problem. For example, if the vocabulary does not contain the word ‘oranges’, one can use the stem word ‘orange’ as a proxy.

  6. It enhances the accuracy of the ML/DL model as the model does not have to deal with inflected word forms.

 You will be able to realize these advantages once you dive deeper into stemming and its various types. So, without any further ado, let’s get started.

Struggling with solved data science projects? Check out these data science projects with source code in Python today!

Types of Stemming in NLP

Let us discuss the three popular types of stemming: Porter, Snowball, and Lancaster.

Types of Stemming

This stemming algorithm is named after the person who created it, Martin Porter. It is one of the simplest and most commonly used stemming algorithms. It is based on simple rules and works only with the strings data type. It only supports the English language and gives the best output as compared to other stemming algorithms, and it has less error rate. The first step in this algorithm comprises of either of the following:

  1. SSESS to SS
    It suggests that if the word ends with the suffix ‘sses’, the Porter algorithm will transform it into ‘ss’ suffix. For example, possess will be transformed to poss.

  2. IES to I
    It suggests that if the word ends with the suffix ‘ies’, the Porter algorithm will transform it into ‘i’ suffix. For example, butterflies will be transformed into ‘butterfli’.

  3.  SS to SS

It suggests that if the word ends with the suffix ‘ss’, the Porter algorithm will not make any changes to the word. For example, ‘supress’ will remain as it is.

  1. S to _

It suggests that if the word ends with the suffix ‘s’, the Porter algorithm will not remove that suffix. For example, ‘chairs’ will be transformed to ‘chair’.

This stemming algorithm is also known as the Porter2 algorithm because it is an improved version of the Porter algorithm that supports multiple languages. It is more accurate than the Porter algorithm and works with Unicode and string data. In Snowball stemming algorithm, the rule is to replace the words with suffix- ‘ied/ies’ by ‘i’ if preceded by more than one letter, and by ‘ie’ in other cases. It offers higher computational speed than the Porter stemmer.

This stemming algorithm is one of the fastest algorithms available out there. Unlike Snowball stemmer and Porter stemmer, the stem words in this algorithm are not intuitive. Many short words are obfuscated after this algorithm's implementation, and it significantly reduces the number of words. So, you must avoid using this algorithm if you are looking for more distinction among different words.  The Lancaster rule converts words with ‘ies’ as a suffix into the ‘y’ suffix. This stemming algorithm is not as efficient as the Snowball one.

Unlock the ProjectPro Learning Experience for FREE

Stemming Algorithms in NLP Libraries

This section will discuss various libraries available in the Python programming language and the functions they contain for implementing stemming.

NLP Libraries for implementing Stemming

  1. NLTK
    NLTK stands for natural language toolkit, one of the most popular libraries for implementing NLP methods. For Stemming, it supports the three types of stemmers that we discussed already: Porter, Snowball, and Lancaster Stemming.

  2. SpaCy
    As stemming is inherently inaccurate, the SpaCy library does not contain a predefined function for stemming but it rather uses lemmatization to reduce words to their base form. 

  3. Gensim
    Gensim is another Python library that is primarily used for converting textual data into vectors. It contains the stemming function for the most popular- Porter Stemmer.

  4. TextBlob
    It is a Python library that is built upon the NLTK library. It allows its users to implement Porter stemming through its stem() function.

Here's what valued users are saying about ProjectPro

As a student looking to break into the field of data engineering and data science, one can get really confused as to which path to take. Very few ways to do it are Google, YouTube, etc. I was one of them too, and that's when I came across ProjectPro while watching one of the SQL videos on the...

Savvy Sahai

Data Science Intern, Capgemini

ProjectPro is a unique platform and helps many people in the industry to solve real-life problems with a step-by-step walkthrough of projects. A platform with some fantastic resources to gain hands-on experience and prepare for job interviews. I would highly recommend this platform to anyone...

Anand Kumpatla

Sr Data Scientist @ Doubleslash Software Solutions Pvt Ltd

Not sure what you are looking for?

View All Projects

Trade-Offs Between Stemming Algorithms in NLP

The choice of a stemming algorithm in NLP depends on the specific NLP use case and also other factors like speed, stem accuracy, efficiency and simplicity.

Stem Accuracy 

Porter stemming algorithm is comparatively less accurate than Snowball and Lancaster stemmer because it works by applying a set of rules to remove the suffixes from the words and it is not necessary that it might always result in generating the correct stem. Snowball stemmer has slightly better accuracy as it considers multiple languages to produce accurate results.

Speed

If you are working on an NLP project(large-scale NLP task) that requires processing speed then Porter Stemmer could be a good choice as it applies simple set of rules to remove suffixes unlike Snowball and Lancaster stemmers that use more advanced algorithms to reduce the word to its shortest form. 

If stem accuracy is important over speed then one should opt for a Snowball stemmer otherwise Porter Stemmer is a good choice. 

Access to a curated library of 250+ end-to-end industry projects with solution code, videos and tech support.

Request a demo

Stemming in NLP Examples

This section will teach you how to implement your learning about natural language understanding and natural language processing with the help of various libraries in Python programming language.

You can easily practice implementing word stemming in Python using a Google colab notebook. Follow the code snippet below to learn how to implement Porter stemming process with the NLTK library. Notice how the words ‘dies’ and ‘died’ have been transformed to the same stem, ‘die’.

Implementing Porter Stemming Algorithm with NLTK

Implementing Snowball stemming algorithm with Google colab is also easy, but the syntax is slightly different. You only need to make a few changes in the previous code. In this case, notice how the base form of most words is the same as in the case of Porter Stemmer and only the word ‘generously’ has different stems for Snowball and Porter Stemmer.

Implementing Snowball Stemmer Algorithm with NLTK

In a similar fashion, you can implement the Lancaster stemming algorithm in Python with the help of the nltk library. In this case also, notice how the root form for the word ‘generous’ is differs from the Snowball and Porter stemmer.

Implementing Lancaster Stemmer Algorithm with NLTK

Learn about the significance of R programming language wirh these data science projects in R with source code.

As mentioned in the types of stemming section, you can implement the Porter stemmer algorithm with the help of the Gensim library in Python. The syntax for doing so is similar to that of NLTK.

Implementing Porter Stemmer Algorithm with Gensim

This article briefly discussed the three popular stemming methods in deducing inflectional forms of words. The reason why stemming is not that popular among NLP enthusiasts is that it often leads to over-stemming and under-stemming. The case of over-stemming arises when two different words that are stemmed from the same root are incorrectly stemmed to different roots. On the other hand, the case of under-stemming involves stemming two different words aren’t stemmed to the same root. To overcome these issues with stemming, the concept of lemmatization was introduced. Lemmatization involves stemming inflected forms of different words to its lemma, a word that has meaning unlike in the case of stemming.

You can understand both stemming and lemmatization better if you work on practical projects. And in case you are clueless about where to find sample NLP Projects, here are three simple project ideas with source code for you:

Chatting Robots have become a norm in enhancing customer care services. You can build a chatbot for your website by following this project’s solution. This project will teach you how to implement various NLP methods like Stopwords removal, POS Tagging, Tokenization, and Stemming in Python. You will also learn about the Bag of Words model and understand its significance in building a chatbot. You will learn how to use the two machine learning algorithms, Decision Trees, and Naive Bayes, to perform text classification required for building the chatbot from scratch.

Source Code: Natural language processing Chatbot application using NLTK 

If you are interested in exploring the applications of the Gensim library in NLP, then this project is a must for you. It will help you in learning about the two models- Word2Vec and FastText and their implementation in Python programming language. You will learn about various text preprocessing methods and the skip-gram model. Furthermore, the project will also help you understand PCA plots and the cosine similarity function.

Source Code: Word2Vec and FastText Word Embedding with Gensim in Python

Customer response to a product/service is crucial for the growth of many businesses. Modern technology now allows its users to analyze customer response with the help of machine learning and NLP techniques. This project will guide you how to implement various text preprocessing methods. You will explore how to detect gibberish using the Markov Chain concept. Additionally, you will learn how to deduce content richness with the help of the TF-IDF vectorization method. Besides that, you will also learn about using the random forest classifier.

Source Code: Ecommerce product reviews - Pairwise ranking and sentiment analysis 

If you are curious about more such projects in data science and big data, check out ProjectPro’s library of end-to-end project solutions with source code in the two domains.

Access Data Science and Machine Learning Project Code Examples

FAQs on Stemming in NLP

1) What is the difference between Lemmatization and Stemming?

  1. In stemming, there is no need of a dictionary of words unlike lemmatization that requires a dictionary.

  2. In stemming, the root word need not be a meaningful word unlike lemmatization where the root word is meaningful.

  3. Lemmatization is a quicker process than stemming.

2) Why do we use Lemmatization in NLP?

Lemmatization in NLP is used to overcome the shortcomings of stemming. It involves transforming tokens into their root form that are meaningful words.

 

PREVIOUS

NEXT

Access Solved Big Data and Data Science Projects

About the Author

Manika

Manika Nagpal is a versatile professional with a strong background in both Physics and Data Science. As a Senior Analyst at ProjectPro, she leverages her expertise in data science and writing to create engaging and insightful blogs that help businesses and individuals stay up-to-date with the

Meet The Author arrow link