{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Python Word Sense Disambiguation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**(C) 2017-2024 by [Damir Cavar](http://damir.cavar.me/)**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Version:** 1.3, January 2024"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**License:** [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/) ([CA BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Prerequisites:**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install -U nltk"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is a tutorial related to the discussion of a WordSense disambiguation and various machine learning strategies discussed in the textbook [Machine Learning: The Art and Science of Algorithms that Make Sense of Data](https://www.cs.bris.ac.uk/~flach/mlbook/) by [Peter Flach](https://www.cs.bris.ac.uk/~flach/)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This tutorial was developed as part of my course material for the courses Machine Learning and Advanced Natural Language Processing in the at [Indiana University](https://www.indiana.edu/)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Word Sense Disambiguation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For a simple Bayesian implementation of a Word Sense Disambiguation algorithm we will use the WordNet NLTK module. We import it in the following way:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from nltk.corpus import wordnet"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For a word that we want to disambiguate, we need to get all its synsets:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Synset('bank.n.01'), Synset('depository_financial_institution.n.01'), Synset('bank.n.03'), Synset('bank.n.04'), Synset('bank.n.05'), Synset('bank.n.06'), Synset('bank.n.07'), Synset('savings_bank.n.02'), Synset('bank.n.09'), Synset('bank.n.10'), Synset('bank.v.01'), Synset('bank.v.02'), Synset('bank.v.03'), Synset('bank.v.04'), Synset('bank.v.05'), Synset('deposit.v.02'), Synset('bank.v.07'), Synset('trust.v.01')]\n"
     ]
    }
   ],
   "source": [
    "mySynsets = wordnet.synsets('bank')\n",
    "print(mySynsets)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For each synset we need to get its definition and the examples to use them as bags of words for a comparison:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "bank.n.01\n",
      "sloping land (especially the slope beside a body of water) they pulled the canoe up on the bank he sat on the bank of the river and watched the currents \n",
      " --------------------\n",
      "depository_financial_institution.n.01\n",
      "a financial institution that accepts deposits and channels the money into lending activities he cashed a check at the bank that bank holds the mortgage on my home \n",
      " --------------------\n",
      "bank.n.03\n",
      "a long ridge or pile a huge bank of earth \n",
      " --------------------\n",
      "bank.n.04\n",
      "an arrangement of similar objects in a row or in tiers he operated a bank of switches \n",
      " --------------------\n",
      "bank.n.05\n",
      "a supply or stock held in reserve for future use (especially in emergencies) \n",
      " --------------------\n",
      "bank.n.06\n",
      "the funds held by a gambling house or the dealer in some gambling games he tried to break the bank at Monte Carlo \n",
      " --------------------\n",
      "bank.n.07\n",
      "a slope in the turn of a road or track; the outside is higher than the inside in order to reduce the effects of centrifugal force \n",
      " --------------------\n",
      "savings_bank.n.02\n",
      "a container (usually with a slot in the top) for keeping money at home the coin bank was empty \n",
      " --------------------\n",
      "bank.n.09\n",
      "a building in which the business of banking transacted the bank is on the corner of Nassau and Witherspoon \n",
      " --------------------\n",
      "bank.n.10\n",
      "a flight maneuver; aircraft tips laterally about its longitudinal axis (especially in turning) the plane went into a steep bank \n",
      " --------------------\n",
      "bank.v.01\n",
      "tip laterally the pilot had to bank the aircraft \n",
      " --------------------\n",
      "bank.v.02\n",
      "enclose with a bank bank roads \n",
      " --------------------\n",
      "bank.v.03\n",
      "do business with a bank or keep an account at a bank Where do you bank in this town? \n",
      " --------------------\n",
      "bank.v.04\n",
      "act as the banker in a game or in gambling \n",
      " --------------------\n",
      "bank.v.05\n",
      "be in the banking business \n",
      " --------------------\n",
      "deposit.v.02\n",
      "put into a bank account She deposits her paycheck every month \n",
      " --------------------\n",
      "bank.v.07\n",
      "cover with ashes so to control the rate of burning bank a fire \n",
      " --------------------\n",
      "trust.v.01\n",
      "have confidence or faith in We can trust in God Rely on your friends bank on your good education I swear by my grandmother's recipes \n",
      " --------------------\n"
     ]
    }
   ],
   "source": [
    "for s in mySynsets:\n",
    "    print(s.name())\n",
    "    text = \" \".join( [s.definition()] + s.examples() )\n",
    "    print(text, \"\\n\", \"-\" * 20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will need to join a list of lists into one list, that is, we need to flatten a list of lists. To achive this, we can use the following code:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['this', 'is', 'a', 'test']\n"
     ]
    }
   ],
   "source": [
    "import itertools\n",
    "lOfl = [[\"this\"], [\"is\",\"a\"], [\"test\"]]\n",
    "print(list(itertools.chain.from_iterable(lOfl)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What we should do is to tokenize and part-of-speech tag the text, that is the descriptions and the examples. We can use NLTK's *word_tokenize* and *pos_tag* modules:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "from nltk import word_tokenize, pos_tag"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can tokenize and PoS-tag the texts:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "bank.n.01\n",
      "[('sloping', 'VBG'), ('land', 'NN'), ('(', '('), ('especially', 'RB'), ('slope', 'NN'), ('beside', 'IN'), ('body', 'NN'), ('water', 'NN'), (')', ')'), ('pulled', 'VBD'), ('canoe', 'NN'), ('bank', 'NN'), ('sat', 'VBD'), ('bank', 'NN'), ('river', 'NN'), ('watched', 'VBD'), ('currents', 'NNS')] \n",
      " --------------------\n",
      "depository_financial_institution.n.01\n",
      "[('financial', 'JJ'), ('institution', 'NN'), ('accepts', 'VBZ'), ('deposits', 'NNS'), ('channels', 'NNS'), ('money', 'NN'), ('lending', 'NN'), ('activities', 'NNS'), ('cashed', 'VBD'), ('check', 'NN'), ('bank', 'NN'), ('bank', 'NN'), ('holds', 'VBZ'), ('mortgage', 'NN'), ('home', 'NN')] \n",
      " --------------------\n",
      "bank.n.03\n",
      "[('long', 'JJ'), ('ridge', 'NN'), ('pile', 'NN'), ('huge', 'JJ'), ('bank', 'NN'), ('earth', 'NN')] \n",
      " --------------------\n",
      "bank.n.04\n",
      "[('arrangement', 'NN'), ('similar', 'JJ'), ('objects', 'NNS'), ('row', 'NN'), ('tiers', 'NNS'), ('operated', 'VBD'), ('bank', 'NN'), ('switches', 'NNS')] \n",
      " --------------------\n",
      "bank.n.05\n",
      "[('supply', 'NN'), ('stock', 'NN'), ('held', 'VBN'), ('reserve', 'NN'), ('future', 'JJ'), ('use', 'NN'), ('(', '('), ('especially', 'RB'), ('emergencies', 'NNS'), (')', ')')] \n",
      " --------------------\n",
      "bank.n.06\n",
      "[('funds', 'NNS'), ('held', 'VBN'), ('gambling', 'NN'), ('house', 'NN'), ('dealer', 'NN'), ('gambling', 'NN'), ('games', 'NNS'), ('tried', 'VBD'), ('break', 'VB'), ('bank', 'NN'), ('Monte', 'NNP'), ('Carlo', 'NNP')] \n",
      " --------------------\n",
      "bank.n.07\n",
      "[('slope', 'NN'), ('turn', 'NN'), ('road', 'NN'), ('track', 'NN'), (';', ':'), ('outside', 'NN'), ('higher', 'JJR'), ('inside', 'NN'), ('order', 'NN'), ('reduce', 'VB'), ('effects', 'NNS'), ('centrifugal', 'JJ'), ('force', 'NN')] \n",
      " --------------------\n",
      "savings_bank.n.02\n",
      "[('container', 'NN'), ('(', '('), ('usually', 'RB'), ('slot', 'NN'), ('top', 'NN'), (')', ')'), ('keeping', 'VBG'), ('money', 'NN'), ('home', 'NN'), ('coin', 'NN'), ('bank', 'NN'), ('empty', 'JJ')] \n",
      " --------------------\n",
      "bank.n.09\n",
      "[('building', 'NN'), ('business', 'NN'), ('banking', 'NN'), ('transacted', 'VBN'), ('bank', 'NN'), ('corner', 'NN'), ('Nassau', 'NNP'), ('Witherspoon', 'NNP')] \n",
      " --------------------\n",
      "bank.n.10\n",
      "[('flight', 'NN'), ('maneuver', 'NN'), (';', ':'), ('aircraft', 'CC'), ('tips', 'NNS'), ('laterally', 'RB'), ('longitudinal', 'JJ'), ('axis', 'NN'), ('(', '('), ('especially', 'RB'), ('turning', 'VBG'), (')', ')'), ('plane', 'NN'), ('went', 'VBD'), ('steep', 'JJ'), ('bank', 'NN')] \n",
      " --------------------\n",
      "bank.v.01\n",
      "[('tip', 'NN'), ('laterally', 'RB'), ('pilot', 'NN'), ('bank', 'NN'), ('aircraft', 'NN')] \n",
      " --------------------\n",
      "bank.v.02\n",
      "[('enclose', 'RB'), ('bank', 'NN'), ('bank', 'NN'), ('roads', 'NNS')] \n",
      " --------------------\n",
      "bank.v.03\n",
      "[('business', 'NN'), ('bank', 'NN'), ('keep', 'VB'), ('account', 'NN'), ('bank', 'NN'), ('Where', 'WRB'), ('bank', 'NN'), ('town', 'NN'), ('?', '.')] \n",
      " --------------------\n",
      "bank.v.04\n",
      "[('act', 'NN'), ('banker', 'NN'), ('game', 'NN'), ('gambling', 'VBG')] \n",
      " --------------------\n",
      "bank.v.05\n",
      "[('banking', 'NN'), ('business', 'NN')] \n",
      " --------------------\n",
      "deposit.v.02\n",
      "[('put', 'VBN'), ('bank', 'NN'), ('account', 'NN'), ('She', 'PRP'), ('deposits', 'VBZ'), ('paycheck', 'NN'), ('every', 'DT'), ('month', 'NN')] \n",
      " --------------------\n",
      "bank.v.07\n",
      "[('cover', 'NN'), ('ashes', 'NNS'), ('control', 'VB'), ('rate', 'NN'), ('burning', 'NN'), ('bank', 'NN'), ('fire', 'NN')] \n",
      " --------------------\n",
      "trust.v.01\n",
      "[('confidence', 'NN'), ('faith', 'NN'), ('We', 'PRP'), ('trust', 'VB'), ('God', 'NNP'), ('Rely', 'RB'), ('friends', 'NNS'), ('bank', 'NN'), ('good', 'JJ'), ('education', 'NN'), ('I', 'PRP'), ('swear', 'VBP'), ('grandmother', 'NN'), (\"'s\", 'POS'), ('recipes', 'NNS')] \n",
      " --------------------\n"
     ]
    }
   ],
   "source": [
    "from nltk.corpus import stopwords\n",
    "stopw = stopwords.words(\"english\")\n",
    "\n",
    "for s in mySynsets:\n",
    "    print(s.name())\n",
    "    text = pos_tag(word_tokenize(s.definition()))\n",
    "    text += list(itertools.chain.from_iterable([ pos_tag(word_tokenize(x)) for x in s.examples() ]))\n",
    "    text2 = [ x for x in text if x[0] not in stopw ]\n",
    "    print(text2, \"\\n\", \"-\" * 20)\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'dog'"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from nltk.stem import WordNetLemmatizer\n",
    "\n",
    "wordnet_lemmatizer = WordNetLemmatizer()\n",
    "\n",
    "wordnet_lemmatizer.lemmatize('dogs')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The first step that we would take with a text that contains the word that we want to disambiguate is to find its position in the token list."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Position: 3\n",
      "['John', 'saw', 'the', 'dog', 'barking', 'at', 'the', 'cat', '.']\n"
     ]
    }
   ],
   "source": [
    "example = \"John saw the dogs barking at the cats.\"\n",
    "keyword = \"dog\"\n",
    "tokens = word_tokenize(example)\n",
    "lemmas = [ wordnet_lemmatizer.lemmatize(x) for x in tokens ]\n",
    "pos = -1\n",
    "\n",
    "try:\n",
    "    pos = lemmas.index(keyword)\n",
    "except ValueError:\n",
    "    pass\n",
    "\n",
    "print(\"Position:\", pos)\n",
    "print(lemmas)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Lemma: dog\n",
      "  PoS: ('dogs', 'NNS')\n",
      "  Tag: NNS\n",
      " MTag: N\n"
     ]
    }
   ],
   "source": [
    "posTokens = pos_tag(tokens)\n",
    "\n",
    "print(\"Lemma:\", lemmas[pos])\n",
    "print(\"  PoS:\", posTokens[pos])\n",
    "print(\"  Tag:\", posTokens[pos][1])\n",
    "print(\" MTag:\", posTokens[pos][1][0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "N\n"
     ]
    }
   ],
   "source": [
    "category = posTokens[pos][1][0]\n",
    "\n",
    "print(category)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Type: n\n"
     ]
    }
   ],
   "source": [
    "wType = None\n",
    "if category == 'N':\n",
    "    wType = wordnet.NOUN\n",
    "elif category == 'V':\n",
    "    wType = wordnet.VERB\n",
    "elif category == 'J':\n",
    "    wType = wordnet.ADJ\n",
    "elif category == 'R':\n",
    "    wType = wordnet.ADV\n",
    "\n",
    "print(\"Type:\", wType)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Synset('dog.n.01'),\n",
       " Synset('frump.n.01'),\n",
       " Synset('dog.n.03'),\n",
       " Synset('cad.n.01'),\n",
       " Synset('frank.n.02'),\n",
       " Synset('pawl.n.01'),\n",
       " Synset('andiron.n.01')]"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "wordnet.synsets(keyword, pos=wType)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "dog.n.01\n",
      "[('a', 'DT'), ('member', 'NN'), ('of', 'IN'), ('the', 'DT'), ('genus', 'NN'), ('Canis', 'NNP'), ('(', '('), ('probably', 'RB'), ('descended', 'VBN'), ('from', 'IN'), ('the', 'DT'), ('common', 'JJ'), ('wolf', 'NN'), (')', ')'), ('that', 'WDT'), ('has', 'VBZ'), ('been', 'VBN'), ('domesticated', 'VBN'), ('by', 'IN'), ('man', 'NN'), ('since', 'IN'), ('prehistoric', 'JJ'), ('times', 'NNS'), (';', ':'), ('occurs', 'VBZ'), ('in', 'IN'), ('many', 'JJ'), ('breeds', 'NNS'), ('the', 'DT'), ('dog', 'NN'), ('barked', 'VBD'), ('all', 'DT'), ('night', 'NN')] \n",
      " --------------------\n",
      "frump.n.01\n",
      "[('a', 'DT'), ('dull', 'JJ'), ('unattractive', 'JJ'), ('unpleasant', 'JJ'), ('girl', 'NN'), ('or', 'CC'), ('woman', 'NN'), ('she', 'PRP'), ('got', 'VBD'), ('a', 'DT'), ('reputation', 'NN'), ('as', 'IN'), ('a', 'DT'), ('frump', 'NN'), ('she', 'PRP'), (\"'s\", 'VBZ'), ('a', 'DT'), ('real', 'JJ'), ('dog', 'NN')] \n",
      " --------------------\n",
      "dog.n.03\n",
      "[('informal', 'JJ'), ('term', 'NN'), ('for', 'IN'), ('a', 'DT'), ('man', 'NN'), ('you', 'PRP'), ('lucky', 'VBP'), ('dog', 'VB')] \n",
      " --------------------\n",
      "cad.n.01\n",
      "[('someone', 'NN'), ('who', 'WP'), ('is', 'VBZ'), ('morally', 'RB'), ('reprehensible', 'JJ'), ('you', 'PRP'), ('dirty', 'VBP'), ('dog', 'VB')] \n",
      " --------------------\n",
      "frank.n.02\n",
      "[('a', 'DT'), ('smooth-textured', 'JJ'), ('sausage', 'NN'), ('of', 'IN'), ('minced', 'JJ'), ('beef', 'NN'), ('or', 'CC'), ('pork', 'NN'), ('usually', 'RB'), ('smoked', 'VBD'), (';', ':'), ('often', 'RB'), ('served', 'VBD'), ('on', 'IN'), ('a', 'DT'), ('bread', 'NN'), ('roll', 'NN')] \n",
      " --------------------\n",
      "pawl.n.01\n",
      "[('a', 'DT'), ('hinged', 'JJ'), ('catch', 'NN'), ('that', 'IN'), ('fits', 'VBZ'), ('into', 'IN'), ('a', 'DT'), ('notch', 'NN'), ('of', 'IN'), ('a', 'DT'), ('ratchet', 'NN'), ('to', 'TO'), ('move', 'VB'), ('a', 'DT'), ('wheel', 'NN'), ('forward', 'RB'), ('or', 'CC'), ('prevent', 'VB'), ('it', 'PRP'), ('from', 'IN'), ('moving', 'VBG'), ('backward', 'NN')] \n",
      " --------------------\n",
      "andiron.n.01\n",
      "[('metal', 'NN'), ('supports', 'NNS'), ('for', 'IN'), ('logs', 'NNS'), ('in', 'IN'), ('a', 'DT'), ('fireplace', 'NN'), ('the', 'DT'), ('andirons', 'NNS'), ('were', 'VBD'), ('too', 'RB'), ('hot', 'JJ'), ('to', 'TO'), ('touch', 'VB')] \n",
      " --------------------\n"
     ]
    }
   ],
   "source": [
    "for s in wordnet.synsets(keyword, pos=wType):\n",
    "    print(s.name())\n",
    "    text = pos_tag(word_tokenize(s.definition()))\n",
    "    text += list(itertools.chain.from_iterable([ pos_tag(word_tokenize(x)) for x in s.examples() ]))\n",
    "    print(text, \"\\n\", \"-\" * 20)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.1"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
   "autoclose": false,
   "autocomplete": true,
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 1,
   "hotkeys": {
    "equation": "Ctrl-E",
    "itemize": "Ctrl-I"
   },
   "labels_anchors": false,
   "latex_user_defs": false,
   "report_style_numbering": false,
   "user_envs_cfg": false
  },
  "latex_metadata": {
   "affiliation": "Indiana University, Department of Linguistics, Bloomington, IN, USA",
   "author": "Damir Cavar",
   "title": "Python Word Sense Disambiguation"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}