What is Lemmatization ?

sentence = 'went mice gone started best worst well feet universal universe boxes books geese striped coming'
  1. WordNet (with POS tag)
  2. TextBlob
  3. TextBlob (with POS tag)
  4. spaCy
  5. Pattern

1. Wordnet Lemmatizer with NLTK

  • It is one of the earliest and most commonly used lemmatizer technique.
  • NLTK offers an interface to it, but you have to download it first in order to use it. Follow the below instructions to install nltk.
import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer


lemmatizer = WordNetLemmatizer()

word_list = nltk.word_tokenize(sentence)
word_list


lemmatized_output = ' '.join([lemmatizer.lemmatize(w) for w in word_list])
lemmatized_output
sentence = 'went mice gone started best worst well feet universal universe boxes books geese striped coming'Out[58]: 'went mouse gone started best worst well foot universal universe box book goose striped coming'

2. Wordnet Lemmatizer (Part of Speech (POS ) tag)

In the above approach, we observed that Wordnet results were not up to the mark. Words like ‘coming’, ‘striped’ etc remained the same after lemmatization. This is because these words are treated as a noun in the given sentence rather than a verb. To overcome come this, we use POS (Part of Speech) tags.

import nltk
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer


def get_wordnet_pos(word):
"""Map POS tag to first character lemmatize() accepts"""
tag = nltk.pos_tag([word])[0][1][0].upper()
tag_dict = {"J": wordnet.ADJ,
"N": wordnet.NOUN,
"V": wordnet.VERB,
"R": wordnet.ADV}

return tag_dict.get(tag, wordnet.NOUN)


# 1. Init Lemmatizer
lemmatizer = WordNetLemmatizer()

[lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in nltk.word_tokenize(sentence)]
print([lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in nltk.word_tokenize(sentence)])
sentence = 'went mice gone started best worst well feet universal universe boxes books geese striped coming'# Out[58]: ['go', 'mouse', 'go', 'start', 'best', 'bad', 'well', 'foot', 'universal', 'universe', 'box', 'book', 'geese', 'strip', 'come']

3. spaCy Lemmatization

spaCy is a relatively new in the space and is billed as an industrial strength NLP engine. spaCy is an open-source python library that parses and “understands” large volumes of text. Separate models are available that cater to specific languages (English, French, German, etc.).

import spacy

# ' spacy download en_core_web_sm 'use this code through terminal
nlp = spacy.load("en_core_web_sm")
doc = nlp(sentence)

# Extract the lemma for each token and join
" ".join([token.lemma_ for token in doc])

# sentence = 'went mice gone started best worst well feet universal universe boxes books geese striped coming'
# Out[75]: 'go mouse gone start well worst well foot universal universe box book geese stripe come'

4. TextBlob Lemmatizer

TexxtBlob is a powerful, fast and convenient NLP package as well. Its quite straighforward to parse and lemmatize words and sentences respectively.

from textblob import TextBlob, Word
sent = TextBlob(sentence)
" ". join([w.lemmatize() for w in sent.words])
# sentence = 'went mice gone started best worst well feet universal universe boxes books geese striped coming'# Out[82]: 'went mouse gone started best worst well foot universal universe box book goose striped coming'

5. TextBlob Lemmatizer with appropriate POS tag(*)

Same as in Wordnet approach without using appropriate POS tags, we observe the same limitations in this approach as well. So, we use one of the more powerful aspects of the TextBlob module the ‘Part of Speech’ tagging to overcome this problem.

from textblob import TextBlob, Worddef lemmatize_with_postag(sentence):
sent = TextBlob(sentence)
tag_dict = {"J": 'a',
"N": 'n',
"V": 'v',
"R": 'r'}
words_and_tags = [(w, tag_dict.get(pos[0], 'n')) for w, pos in sent.tags]
lemmatized_list = [wd.lemmatize(tag) for wd, tag in words_and_tags]
return " ".join(lemmatized_list)

# Lemmatize
lemmatize_with_postag(sentence)
# sentence = 'went mice gone started best worst well feet universal universe boxes books geese striped coming'Out[83]: 'go mouse go start best worst well foot universal universe box book geese strip come'

6.Pattern

Pattern is a Python package commonly used for web mining, natural language processing, machine learning, and network analysis.

# pip install pattern

import pattern
from pattern.en import lemma, lexeme

" ".join([lemma(wd) for wd in sentence.split()])
sentence = 'went mice gone started best worst well feet universal universe boxes books geese striped coming'Out[84]: 'go mice go start best worst well feet universal universe box book geese stripe come'

Comparison

  • The marks in this table are not valid for all words of all related methods.
  • It has been prepared for general information purposes only.
  • For example, spaCy method could not convert geese to goose in his method, but converted feet to foot. (both of them are irregular plurals)

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store