What is Lemmatization ?

Lemmatization is the process of converting a word to its base form. Lemmatization considers the context and converts the word to its meaningful base form, whereas stemming just removes the last few characters, often leading to incorrect meanings and spelling errors.

In contrast to stemming, lemmatization is a lot more powerful. It looks beyond word reduction and considers a language’s full vocabulary to apply a morphological analysis to words, aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.

For example, lemmatization would correctly identify the base form of ‘boring’ to ‘bore’, whereas, stemming would cut off the ‘ing’ part and convert it to bor.

‘Boring’ -> Lemmatization -> ‘bore’

‘Boring’ -> Stemming -> ‘bor’

So, based on the context it’s used, you should identify the ‘part-of-speech’ (POS) tag for the word in that specific context and extract the appropriate lemma.

In this article I will examine 6 lemmatization approaches. To make a comparison, we will work on the example sentence below.

After explaining all of them we will do necessary comparisons.

  1. WordNet LEemmatizer with NLTK
  2. WordNet (with POS tag)
  3. TextBlob
  4. TextBlob (with POS tag)
  5. spaCy
  6. Pattern

1. Wordnet Lemmatizer with NLTK

  • It is one of the earliest and most commonly used lemmatizer technique.
  • NLTK offers an interface to it, but you have to download it first in order to use it. Follow the below instructions to install nltk.

2. Wordnet Lemmatizer (Part of Speech (POS ) tag)

In the above approach, we observed that Wordnet results were not up to the mark. Words like ‘coming’, ‘striped’ etc remained the same after lemmatization. This is because these words are treated as a noun in the given sentence rather than a verb. To overcome come this, we use POS (Part of Speech) tags.

3. spaCy Lemmatization

spaCy is a relatively new in the space and is billed as an industrial strength NLP engine. spaCy is an open-source python library that parses and “understands” large volumes of text. Separate models are available that cater to specific languages (English, French, German, etc.).

4. TextBlob Lemmatizer

TexxtBlob is a powerful, fast and convenient NLP package as well. Its quite straighforward to parse and lemmatize words and sentences respectively.

5. TextBlob Lemmatizer with appropriate POS tag(*)

Same as in Wordnet approach without using appropriate POS tags, we observe the same limitations in this approach as well. So, we use one of the more powerful aspects of the TextBlob module the ‘Part of Speech’ tagging to overcome this problem.

6.Pattern

Pattern is a Python package commonly used for web mining, natural language processing, machine learning, and network analysis.

Comparison

  • We can see the above 6 methods in the comparison table for inflected words.
  • The marks in this table are not valid for all words of all related methods.
  • It has been prepared for general information purposes only.
  • For example, spaCy method could not convert geese to goose in his method, but converted feet to foot. (both of them are irregular plurals)

[1] https://www.geeksforgeeks.org/python-lemmatization-approaches-with-examples/

[2] https://www.machinelearningplus.com/nlp/lemmatization-examples-python/

[3] https://spacy.io/usage/models/

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store