{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Python Word Sense Disambiguation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**(C) 2017-2024 by [Damir Cavar](http://damir.cavar.me/)**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Version:** 1.3, January 2024" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**License:** [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/) ([CA BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Prerequisites:**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install -U nltk" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is a tutorial related to the discussion of a WordSense disambiguation and various machine learning strategies discussed in the textbook [Machine Learning: The Art and Science of Algorithms that Make Sense of Data](https://www.cs.bris.ac.uk/~flach/mlbook/) by [Peter Flach](https://www.cs.bris.ac.uk/~flach/)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This tutorial was developed as part of my course material for the courses Machine Learning and Advanced Natural Language Processing in the at [Indiana University](https://www.indiana.edu/)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Word Sense Disambiguation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For a simple Bayesian implementation of a Word Sense Disambiguation algorithm we will use the WordNet NLTK module. We import it in the following way:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from nltk.corpus import wordnet" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For a word that we want to disambiguate, we need to get all its synsets:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[Synset('bank.n.01'), Synset('depository_financial_institution.n.01'), Synset('bank.n.03'), Synset('bank.n.04'), Synset('bank.n.05'), Synset('bank.n.06'), Synset('bank.n.07'), Synset('savings_bank.n.02'), Synset('bank.n.09'), Synset('bank.n.10'), Synset('bank.v.01'), Synset('bank.v.02'), Synset('bank.v.03'), Synset('bank.v.04'), Synset('bank.v.05'), Synset('deposit.v.02'), Synset('bank.v.07'), Synset('trust.v.01')]\n" ] } ], "source": [ "mySynsets = wordnet.synsets('bank')\n", "print(mySynsets)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For each synset we need to get its definition and the examples to use them as bags of words for a comparison:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "bank.n.01\n", "sloping land (especially the slope beside a body of water) they pulled the canoe up on the bank he sat on the bank of the river and watched the currents \n", " --------------------\n", "depository_financial_institution.n.01\n", "a financial institution that accepts deposits and channels the money into lending activities he cashed a check at the bank that bank holds the mortgage on my home \n", " --------------------\n", "bank.n.03\n", "a long ridge or pile a huge bank of earth \n", " --------------------\n", "bank.n.04\n", "an arrangement of similar objects in a row or in tiers he operated a bank of switches \n", " --------------------\n", "bank.n.05\n", "a supply or stock held in reserve for future use (especially in emergencies) \n", " --------------------\n", "bank.n.06\n", "the funds held by a gambling house or the dealer in some gambling games he tried to break the bank at Monte Carlo \n", " --------------------\n", "bank.n.07\n", "a slope in the turn of a road or track; the outside is higher than the inside in order to reduce the effects of centrifugal force \n", " --------------------\n", "savings_bank.n.02\n", "a container (usually with a slot in the top) for keeping money at home the coin bank was empty \n", " --------------------\n", "bank.n.09\n", "a building in which the business of banking transacted the bank is on the corner of Nassau and Witherspoon \n", " --------------------\n", "bank.n.10\n", "a flight maneuver; aircraft tips laterally about its longitudinal axis (especially in turning) the plane went into a steep bank \n", " --------------------\n", "bank.v.01\n", "tip laterally the pilot had to bank the aircraft \n", " --------------------\n", "bank.v.02\n", "enclose with a bank bank roads \n", " --------------------\n", "bank.v.03\n", "do business with a bank or keep an account at a bank Where do you bank in this town? \n", " --------------------\n", "bank.v.04\n", "act as the banker in a game or in gambling \n", " --------------------\n", "bank.v.05\n", "be in the banking business \n", " --------------------\n", "deposit.v.02\n", "put into a bank account She deposits her paycheck every month \n", " --------------------\n", "bank.v.07\n", "cover with ashes so to control the rate of burning bank a fire \n", " --------------------\n", "trust.v.01\n", "have confidence or faith in We can trust in God Rely on your friends bank on your good education I swear by my grandmother's recipes \n", " --------------------\n" ] } ], "source": [ "for s in mySynsets:\n", " print(s.name())\n", " text = \" \".join( [s.definition()] + s.examples() )\n", " print(text, \"\\n\", \"-\" * 20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will need to join a list of lists into one list, that is, we need to flatten a list of lists. To achive this, we can use the following code:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['this', 'is', 'a', 'test']\n" ] } ], "source": [ "import itertools\n", "lOfl = [[\"this\"], [\"is\",\"a\"], [\"test\"]]\n", "print(list(itertools.chain.from_iterable(lOfl)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What we should do is to tokenize and part-of-speech tag the text, that is the descriptions and the examples. We can use NLTK's *word_tokenize* and *pos_tag* modules:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "from nltk import word_tokenize, pos_tag" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can tokenize and PoS-tag the texts:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "bank.n.01\n", "[('sloping', 'VBG'), ('land', 'NN'), ('(', '('), ('especially', 'RB'), ('slope', 'NN'), ('beside', 'IN'), ('body', 'NN'), ('water', 'NN'), (')', ')'), ('pulled', 'VBD'), ('canoe', 'NN'), ('bank', 'NN'), ('sat', 'VBD'), ('bank', 'NN'), ('river', 'NN'), ('watched', 'VBD'), ('currents', 'NNS')] \n", " --------------------\n", "depository_financial_institution.n.01\n", "[('financial', 'JJ'), ('institution', 'NN'), ('accepts', 'VBZ'), ('deposits', 'NNS'), ('channels', 'NNS'), ('money', 'NN'), ('lending', 'NN'), ('activities', 'NNS'), ('cashed', 'VBD'), ('check', 'NN'), ('bank', 'NN'), ('bank', 'NN'), ('holds', 'VBZ'), ('mortgage', 'NN'), ('home', 'NN')] \n", " --------------------\n", "bank.n.03\n", "[('long', 'JJ'), ('ridge', 'NN'), ('pile', 'NN'), ('huge', 'JJ'), ('bank', 'NN'), ('earth', 'NN')] \n", " --------------------\n", "bank.n.04\n", "[('arrangement', 'NN'), ('similar', 'JJ'), ('objects', 'NNS'), ('row', 'NN'), ('tiers', 'NNS'), ('operated', 'VBD'), ('bank', 'NN'), ('switches', 'NNS')] \n", " --------------------\n", "bank.n.05\n", "[('supply', 'NN'), ('stock', 'NN'), ('held', 'VBN'), ('reserve', 'NN'), ('future', 'JJ'), ('use', 'NN'), ('(', '('), ('especially', 'RB'), ('emergencies', 'NNS'), (')', ')')] \n", " --------------------\n", "bank.n.06\n", "[('funds', 'NNS'), ('held', 'VBN'), ('gambling', 'NN'), ('house', 'NN'), ('dealer', 'NN'), ('gambling', 'NN'), ('games', 'NNS'), ('tried', 'VBD'), ('break', 'VB'), ('bank', 'NN'), ('Monte', 'NNP'), ('Carlo', 'NNP')] \n", " --------------------\n", "bank.n.07\n", "[('slope', 'NN'), ('turn', 'NN'), ('road', 'NN'), ('track', 'NN'), (';', ':'), ('outside', 'NN'), ('higher', 'JJR'), ('inside', 'NN'), ('order', 'NN'), ('reduce', 'VB'), ('effects', 'NNS'), ('centrifugal', 'JJ'), ('force', 'NN')] \n", " --------------------\n", "savings_bank.n.02\n", "[('container', 'NN'), ('(', '('), ('usually', 'RB'), ('slot', 'NN'), ('top', 'NN'), (')', ')'), ('keeping', 'VBG'), ('money', 'NN'), ('home', 'NN'), ('coin', 'NN'), ('bank', 'NN'), ('empty', 'JJ')] \n", " --------------------\n", "bank.n.09\n", "[('building', 'NN'), ('business', 'NN'), ('banking', 'NN'), ('transacted', 'VBN'), ('bank', 'NN'), ('corner', 'NN'), ('Nassau', 'NNP'), ('Witherspoon', 'NNP')] \n", " --------------------\n", "bank.n.10\n", "[('flight', 'NN'), ('maneuver', 'NN'), (';', ':'), ('aircraft', 'CC'), ('tips', 'NNS'), ('laterally', 'RB'), ('longitudinal', 'JJ'), ('axis', 'NN'), ('(', '('), ('especially', 'RB'), ('turning', 'VBG'), (')', ')'), ('plane', 'NN'), ('went', 'VBD'), ('steep', 'JJ'), ('bank', 'NN')] \n", " --------------------\n", "bank.v.01\n", "[('tip', 'NN'), ('laterally', 'RB'), ('pilot', 'NN'), ('bank', 'NN'), ('aircraft', 'NN')] \n", " --------------------\n", "bank.v.02\n", "[('enclose', 'RB'), ('bank', 'NN'), ('bank', 'NN'), ('roads', 'NNS')] \n", " --------------------\n", "bank.v.03\n", "[('business', 'NN'), ('bank', 'NN'), ('keep', 'VB'), ('account', 'NN'), ('bank', 'NN'), ('Where', 'WRB'), ('bank', 'NN'), ('town', 'NN'), ('?', '.')] \n", " --------------------\n", "bank.v.04\n", "[('act', 'NN'), ('banker', 'NN'), ('game', 'NN'), ('gambling', 'VBG')] \n", " --------------------\n", "bank.v.05\n", "[('banking', 'NN'), ('business', 'NN')] \n", " --------------------\n", "deposit.v.02\n", "[('put', 'VBN'), ('bank', 'NN'), ('account', 'NN'), ('She', 'PRP'), ('deposits', 'VBZ'), ('paycheck', 'NN'), ('every', 'DT'), ('month', 'NN')] \n", " --------------------\n", "bank.v.07\n", "[('cover', 'NN'), ('ashes', 'NNS'), ('control', 'VB'), ('rate', 'NN'), ('burning', 'NN'), ('bank', 'NN'), ('fire', 'NN')] \n", " --------------------\n", "trust.v.01\n", "[('confidence', 'NN'), ('faith', 'NN'), ('We', 'PRP'), ('trust', 'VB'), ('God', 'NNP'), ('Rely', 'RB'), ('friends', 'NNS'), ('bank', 'NN'), ('good', 'JJ'), ('education', 'NN'), ('I', 'PRP'), ('swear', 'VBP'), ('grandmother', 'NN'), (\"'s\", 'POS'), ('recipes', 'NNS')] \n", " --------------------\n" ] } ], "source": [ "from nltk.corpus import stopwords\n", "stopw = stopwords.words(\"english\")\n", "\n", "for s in mySynsets:\n", " print(s.name())\n", " text = pos_tag(word_tokenize(s.definition()))\n", " text += list(itertools.chain.from_iterable([ pos_tag(word_tokenize(x)) for x in s.examples() ]))\n", " text2 = [ x for x in text if x[0] not in stopw ]\n", " print(text2, \"\\n\", \"-\" * 20)\n", "\n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'dog'" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from nltk.stem import WordNetLemmatizer\n", "\n", "wordnet_lemmatizer = WordNetLemmatizer()\n", "\n", "wordnet_lemmatizer.lemmatize('dogs')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first step that we would take with a text that contains the word that we want to disambiguate is to find its position in the token list." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Position: 3\n", "['John', 'saw', 'the', 'dog', 'barking', 'at', 'the', 'cat', '.']\n" ] } ], "source": [ "example = \"John saw the dogs barking at the cats.\"\n", "keyword = \"dog\"\n", "tokens = word_tokenize(example)\n", "lemmas = [ wordnet_lemmatizer.lemmatize(x) for x in tokens ]\n", "pos = -1\n", "\n", "try:\n", " pos = lemmas.index(keyword)\n", "except ValueError:\n", " pass\n", "\n", "print(\"Position:\", pos)\n", "print(lemmas)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Lemma: dog\n", " PoS: ('dogs', 'NNS')\n", " Tag: NNS\n", " MTag: N\n" ] } ], "source": [ "posTokens = pos_tag(tokens)\n", "\n", "print(\"Lemma:\", lemmas[pos])\n", "print(\" PoS:\", posTokens[pos])\n", "print(\" Tag:\", posTokens[pos][1])\n", "print(\" MTag:\", posTokens[pos][1][0])" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "N\n" ] } ], "source": [ "category = posTokens[pos][1][0]\n", "\n", "print(category)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Type: n\n" ] } ], "source": [ "wType = None\n", "if category == 'N':\n", " wType = wordnet.NOUN\n", "elif category == 'V':\n", " wType = wordnet.VERB\n", "elif category == 'J':\n", " wType = wordnet.ADJ\n", "elif category == 'R':\n", " wType = wordnet.ADV\n", "\n", "print(\"Type:\", wType)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Synset('dog.n.01'),\n", " Synset('frump.n.01'),\n", " Synset('dog.n.03'),\n", " Synset('cad.n.01'),\n", " Synset('frank.n.02'),\n", " Synset('pawl.n.01'),\n", " Synset('andiron.n.01')]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wordnet.synsets(keyword, pos=wType)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "dog.n.01\n", "[('a', 'DT'), ('member', 'NN'), ('of', 'IN'), ('the', 'DT'), ('genus', 'NN'), ('Canis', 'NNP'), ('(', '('), ('probably', 'RB'), ('descended', 'VBN'), ('from', 'IN'), ('the', 'DT'), ('common', 'JJ'), ('wolf', 'NN'), (')', ')'), ('that', 'WDT'), ('has', 'VBZ'), ('been', 'VBN'), ('domesticated', 'VBN'), ('by', 'IN'), ('man', 'NN'), ('since', 'IN'), ('prehistoric', 'JJ'), ('times', 'NNS'), (';', ':'), ('occurs', 'VBZ'), ('in', 'IN'), ('many', 'JJ'), ('breeds', 'NNS'), ('the', 'DT'), ('dog', 'NN'), ('barked', 'VBD'), ('all', 'DT'), ('night', 'NN')] \n", " --------------------\n", "frump.n.01\n", "[('a', 'DT'), ('dull', 'JJ'), ('unattractive', 'JJ'), ('unpleasant', 'JJ'), ('girl', 'NN'), ('or', 'CC'), ('woman', 'NN'), ('she', 'PRP'), ('got', 'VBD'), ('a', 'DT'), ('reputation', 'NN'), ('as', 'IN'), ('a', 'DT'), ('frump', 'NN'), ('she', 'PRP'), (\"'s\", 'VBZ'), ('a', 'DT'), ('real', 'JJ'), ('dog', 'NN')] \n", " --------------------\n", "dog.n.03\n", "[('informal', 'JJ'), ('term', 'NN'), ('for', 'IN'), ('a', 'DT'), ('man', 'NN'), ('you', 'PRP'), ('lucky', 'VBP'), ('dog', 'VB')] \n", " --------------------\n", "cad.n.01\n", "[('someone', 'NN'), ('who', 'WP'), ('is', 'VBZ'), ('morally', 'RB'), ('reprehensible', 'JJ'), ('you', 'PRP'), ('dirty', 'VBP'), ('dog', 'VB')] \n", " --------------------\n", "frank.n.02\n", "[('a', 'DT'), ('smooth-textured', 'JJ'), ('sausage', 'NN'), ('of', 'IN'), ('minced', 'JJ'), ('beef', 'NN'), ('or', 'CC'), ('pork', 'NN'), ('usually', 'RB'), ('smoked', 'VBD'), (';', ':'), ('often', 'RB'), ('served', 'VBD'), ('on', 'IN'), ('a', 'DT'), ('bread', 'NN'), ('roll', 'NN')] \n", " --------------------\n", "pawl.n.01\n", "[('a', 'DT'), ('hinged', 'JJ'), ('catch', 'NN'), ('that', 'IN'), ('fits', 'VBZ'), ('into', 'IN'), ('a', 'DT'), ('notch', 'NN'), ('of', 'IN'), ('a', 'DT'), ('ratchet', 'NN'), ('to', 'TO'), ('move', 'VB'), ('a', 'DT'), ('wheel', 'NN'), ('forward', 'RB'), ('or', 'CC'), ('prevent', 'VB'), ('it', 'PRP'), ('from', 'IN'), ('moving', 'VBG'), ('backward', 'NN')] \n", " --------------------\n", "andiron.n.01\n", "[('metal', 'NN'), ('supports', 'NNS'), ('for', 'IN'), ('logs', 'NNS'), ('in', 'IN'), ('a', 'DT'), ('fireplace', 'NN'), ('the', 'DT'), ('andirons', 'NNS'), ('were', 'VBD'), ('too', 'RB'), ('hot', 'JJ'), ('to', 'TO'), ('touch', 'VB')] \n", " --------------------\n" ] } ], "source": [ "for s in wordnet.synsets(keyword, pos=wType):\n", " print(s.name())\n", " text = pos_tag(word_tokenize(s.definition()))\n", " text += list(itertools.chain.from_iterable([ pos_tag(word_tokenize(x)) for x in s.examples() ]))\n", " print(text, \"\\n\", \"-\" * 20)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.1" }, "latex_envs": { "LaTeX_envs_menu_present": true, "autoclose": false, "autocomplete": true, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 1, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": false }, "latex_metadata": { "affiliation": "Indiana University, Department of Linguistics, Bloomington, IN, USA", "author": "Damir Cavar", "title": "Python Word Sense Disambiguation" } }, "nbformat": 4, "nbformat_minor": 4 }