Stemming Lemmatization Stopwords and N-Grams in NLP

3 min readJul 26, 2022

Image taken from https://www.analyticssteps.com/

The process of converting words into computer understandable words called text preprocessing. Till now, we have learn tokenization as a part of text pre processing. In this article, you will learn about stemming, lemmatization and stopwords. Stemming and Lemmatization have been developed since 1960 for text / word normalization. While, stopwords are commonly used word (such as “a”, “an”, “the” etc.) that we can ignore while doing text pre processing. You will learn background and practical implementation of these techniques.

Stemming

Stemming is the process of reducing inflection towards the base-root form. All the words are having their root-base form, which can be used in different version according to tenses. For example, “sing” is a root-base word and singing, sings are the different forms of that. There are mainly two errors that occur while performing Stemming, Over-stemming, and Under-stemming. Over-stemming : when two words are stemmed from the same root that are of different stems.

Under-stemming : when two words are stemmed from the same root that is not of different stems.

Python NLTK provides two English stemmer , PorterStemmer and LancasterStemmer. Also, it provides SnowballStemmers, ISRIStemmer, RSLPSStemmer as a non-English stemmer. Python NLTK included SnowballStemmers as a language to create to create non-English stemmers. One can program one's own language stemmer using snowball.

Stemming can be used as :

Information retrieval systems like search engines.
Determine domain vocabularies in domain analysis.

Python implementation of stemming can be found here.

Lemmatization

Lemmatization is the method to take any kind of word to that base root form with the context. It groups together the different inflected forms of a word so they can be analyzed as a single item. Lemmatization is same as stemming but it takes context to the word. Actually, lemmatization is preferred over Stemming because lemmatization does morphological analysis of the words.

For example, sing, singing, sang all are having base root form as sing in lemmatization. While in stemming it is having “sang” as “sang”.

Lemmatization can be used as :

Comprehensive retrieval systems like search engines.
Compact indexing

Python implementation of Lemmatization can be found here.

StopWords

Stop words are basically a set of commonly used words in any language. We need to remove stop words which are commonly used. so that we can focus on the words which have context. For example, “how to learn NLP”, if we are going to use this entire corpus to find in a search engine it will find “how” and “to” word in lot’s of web pages. So, if we remove words like “how” and “to” it will focus on “learn” and “nlp”. This way we can make searching faster and it will be computationally faster as well.

Generally, we take stopword as a single set of words but it have different meaning in different application. For example, in an application like sentiment analysis removing positive words like “good” and “nice” and negative word like “not” can change the meaning totally different. In such cases, we can also construct domain specific stop words instead of using a predefined stop word list. To create domain specific stopword list we can use most frequent used words, least frequent words or the words with low IDF from that corpus.

Python implementation of stopwords can be found here.

N-Grams

In NLP n-gram is a contiguous sequence of n items generated from a given sample of text, where the items can be characters or words and n can be any numbers like 1,2,3.

For example, “Natural language processing is a subfield of linguistics”, so below is the possible n-gram models that we can generate –

Usage of N-Gram :

To create features from text corpus for ML algorithms
N-Grams are useful for creating capabilities like autocorrect, autocompletion of sentences, text summarization, speech recognition, etc.

Python implementation of N-Gram can be found here.