# Python Word Sense Disambiguation

**(C) 2017-2024 by [Damir Cavar](http://damir.cavar.me/)**

**Version:** 1.3, January 2024

**License:** [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/) ([CA BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/))

**Prerequisites:**

In [None]:
!pip install -U nltk

This is a tutorial related to the discussion of a WordSense disambiguation and various machine learning strategies discussed in the textbook [Machine Learning: The Art and Science of Algorithms that Make Sense of Data](https://www.cs.bris.ac.uk/~flach/mlbook/) by [Peter Flach](https://www.cs.bris.ac.uk/~flach/).

This tutorial was developed as part of my course material for the courses Machine Learning and Advanced Natural Language Processing in the at [Indiana University](https://www.indiana.edu/).

## Word Sense Disambiguation

For a simple Bayesian implementation of a Word Sense Disambiguation algorithm we will use the WordNet NLTK module. We import it in the following way:

In [1]:
from nltk.corpus import wordnet

For a word that we want to disambiguate, we need to get all its synsets:

In [2]:
mySynsets = wordnet.synsets('bank')
print(mySynsets)

[Synset('bank.n.01'), Synset('depository_financial_institution.n.01'), Synset('bank.n.03'), Synset('bank.n.04'), Synset('bank.n.05'), Synset('bank.n.06'), Synset('bank.n.07'), Synset('savings_bank.n.02'), Synset('bank.n.09'), Synset('bank.n.10'), Synset('bank.v.01'), Synset('bank.v.02'), Synset('bank.v.03'), Synset('bank.v.04'), Synset('bank.v.05'), Synset('deposit.v.02'), Synset('bank.v.07'), Synset('trust.v.01')]


For each synset we need to get its definition and the examples to use them as bags of words for a comparison:

In [3]:
for s in mySynsets:
 print(s.name())
 text = " ".join( [s.definition()] + s.examples() )
 print(text, "\n", "-" * 20)

bank.n.01
sloping land (especially the slope beside a body of water) they pulled the canoe up on the bank he sat on the bank of the river and watched the currents 
 --------------------
depository_financial_institution.n.01
a financial institution that accepts deposits and channels the money into lending activities he cashed a check at the bank that bank holds the mortgage on my home 
 --------------------
bank.n.03
a long ridge or pile a huge bank of earth 
 --------------------
bank.n.04
an arrangement of similar objects in a row or in tiers he operated a bank of switches 
 --------------------
bank.n.05
a supply or stock held in reserve for future use (especially in emergencies) 
 --------------------
bank.n.06
the funds held by a gambling house or the dealer in some gambling games he tried to break the bank at Monte Carlo 
 --------------------
bank.n.07
a slope in the turn of a road or track; the outside is higher than the inside in order to reduce the effects of centrifugal force

We will need to join a list of lists into one list, that is, we need to flatten a list of lists. To achive this, we can use the following code:

In [4]:
import itertools
lOfl = [["this"], ["is","a"], ["test"]]
print(list(itertools.chain.from_iterable(lOfl)))

['this', 'is', 'a', 'test']


What we should do is to tokenize and part-of-speech tag the text, that is the descriptions and the examples. We can use NLTK's *word_tokenize* and *pos_tag* modules:

In [5]:
from nltk import word_tokenize, pos_tag

Now we can tokenize and PoS-tag the texts:

In [6]:
from nltk.corpus import stopwords
stopw = stopwords.words("english")

for s in mySynsets:
 print(s.name())
 text = pos_tag(word_tokenize(s.definition()))
 text += list(itertools.chain.from_iterable([ pos_tag(word_tokenize(x)) for x in s.examples() ]))
 text2 = [ x for x in text if x[0] not in stopw ]
 print(text2, "\n", "-" * 20)



bank.n.01
[('sloping', 'VBG'), ('land', 'NN'), ('(', '('), ('especially', 'RB'), ('slope', 'NN'), ('beside', 'IN'), ('body', 'NN'), ('water', 'NN'), (')', ')'), ('pulled', 'VBD'), ('canoe', 'NN'), ('bank', 'NN'), ('sat', 'VBD'), ('bank', 'NN'), ('river', 'NN'), ('watched', 'VBD'), ('currents', 'NNS')] 
 --------------------
depository_financial_institution.n.01
[('financial', 'JJ'), ('institution', 'NN'), ('accepts', 'VBZ'), ('deposits', 'NNS'), ('channels', 'NNS'), ('money', 'NN'), ('lending', 'NN'), ('activities', 'NNS'), ('cashed', 'VBD'), ('check', 'NN'), ('bank', 'NN'), ('bank', 'NN'), ('holds', 'VBZ'), ('mortgage', 'NN'), ('home', 'NN')] 
 --------------------
bank.n.03
[('long', 'JJ'), ('ridge', 'NN'), ('pile', 'NN'), ('huge', 'JJ'), ('bank', 'NN'), ('earth', 'NN')] 
 --------------------
bank.n.04
[('arrangement', 'NN'), ('similar', 'JJ'), ('objects', 'NNS'), ('row', 'NN'), ('tiers', 'NNS'), ('operated', 'VBD'), ('bank', 'NN'), ('switches', 'NNS')] 
 --------------------
bank.n

In [7]:
from nltk.stem import WordNetLemmatizer

wordnet_lemmatizer = WordNetLemmatizer()

wordnet_lemmatizer.lemmatize('dogs')

'dog'

The first step that we would take with a text that contains the word that we want to disambiguate is to find its position in the token list.

In [8]:
example = "John saw the dogs barking at the cats."
keyword = "dog"
tokens = word_tokenize(example)
lemmas = [ wordnet_lemmatizer.lemmatize(x) for x in tokens ]
pos = -1

try:
 pos = lemmas.index(keyword)
except ValueError:
 pass

print("Position:", pos)
print(lemmas)

Position: 3
['John', 'saw', 'the', 'dog', 'barking', 'at', 'the', 'cat', '.']


In [9]:
posTokens = pos_tag(tokens)

print("Lemma:", lemmas[pos])
print(" PoS:", posTokens[pos])
print(" Tag:", posTokens[pos][1])
print(" MTag:", posTokens[pos][1][0])

Lemma: dog
 PoS: ('dogs', 'NNS')
 Tag: NNS
 MTag: N


In [10]:
category = posTokens[pos][1][0]

print(category)

N


In [11]:
wType = None
if category == 'N':
 wType = wordnet.NOUN
elif category == 'V':
 wType = wordnet.VERB
elif category == 'J':
 wType = wordnet.ADJ
elif category == 'R':
 wType = wordnet.ADV

print("Type:", wType)

Type: n


In [12]:
wordnet.synsets(keyword, pos=wType)

[Synset('dog.n.01'),
 Synset('frump.n.01'),
 Synset('dog.n.03'),
 Synset('cad.n.01'),
 Synset('frank.n.02'),
 Synset('pawl.n.01'),
 Synset('andiron.n.01')]

In [13]:
for s in wordnet.synsets(keyword, pos=wType):
 print(s.name())
 text = pos_tag(word_tokenize(s.definition()))
 text += list(itertools.chain.from_iterable([ pos_tag(word_tokenize(x)) for x in s.examples() ]))
 print(text, "\n", "-" * 20)

dog.n.01
[('a', 'DT'), ('member', 'NN'), ('of', 'IN'), ('the', 'DT'), ('genus', 'NN'), ('Canis', 'NNP'), ('(', '('), ('probably', 'RB'), ('descended', 'VBN'), ('from', 'IN'), ('the', 'DT'), ('common', 'JJ'), ('wolf', 'NN'), (')', ')'), ('that', 'WDT'), ('has', 'VBZ'), ('been', 'VBN'), ('domesticated', 'VBN'), ('by', 'IN'), ('man', 'NN'), ('since', 'IN'), ('prehistoric', 'JJ'), ('times', 'NNS'), (';', ':'), ('occurs', 'VBZ'), ('in', 'IN'), ('many', 'JJ'), ('breeds', 'NNS'), ('the', 'DT'), ('dog', 'NN'), ('barked', 'VBD'), ('all', 'DT'), ('night', 'NN')] 
 --------------------
frump.n.01
[('a', 'DT'), ('dull', 'JJ'), ('unattractive', 'JJ'), ('unpleasant', 'JJ'), ('girl', 'NN'), ('or', 'CC'), ('woman', 'NN'), ('she', 'PRP'), ('got', 'VBD'), ('a', 'DT'), ('reputation', 'NN'), ('as', 'IN'), ('a', 'DT'), ('frump', 'NN'), ('she', 'PRP'), ("'s", 'VBZ'), ('a', 'DT'), ('real', 'JJ'), ('dog', 'NN')] 
 --------------------
dog.n.03
[('informal', 'JJ'), ('term', 'NN'), ('for', 'IN'), ('a', 'DT'), (