{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Python WordNet using NLTK"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**(C) 2017-2024 by [Damir Cavar](http://damir.cavar.me/)**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Version:** 1.4, September 2024"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**License:** [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/) ([CA BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Prerequisites:**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install -U nltk"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is a tutorial related to the discussion of a WordSense disambiguation and various machine learning strategies discussed in the textbook [Machine Learning: The Art and Science of Algorithms that Make Sense of Data](https://www.cs.bris.ac.uk/~flach/mlbook/) by [Peter Flach](https://www.cs.bris.ac.uk/~flach/)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This tutorial was developed as part of my course material for the course Machine Learning for Computational Linguistics in the [Computational Linguistics Program](http://cl.indiana.edu/) of the [Department of Linguistics](http://www.indiana.edu/~lingdept/) at [Indiana University](https://www.indiana.edu/)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using WordNet"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Importing *wordnet* from the NLTK module:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "from nltk.corpus import wordnet"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Asking for a synset in WordNet:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Synset('cat.n.01'),\n",
       " Synset('guy.n.01'),\n",
       " Synset('cat.n.03'),\n",
       " Synset('kat.n.01'),\n",
       " Synset('cat-o'-nine-tails.n.01'),\n",
       " Synset('caterpillar.n.02'),\n",
       " Synset('big_cat.n.01'),\n",
       " Synset('computerized_tomography.n.01'),\n",
       " Synset('cat.v.01'),\n",
       " Synset('vomit.v.01')]"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "wordnet.synsets('cat')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A synset is identified with a 3-part name of the form: word.pos.nn. Except of the last synset, all other synsets of *dog* above are nouns with the *part-of-speech* tag *n*. We can pick a synset with a specific PoS:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Synset('chase.v.01')]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "wordnet.synsets('dog', pos=wordnet.VERB)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Besides VERB the other parts of speech are NOUN, ADJ and ADV."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can select a specific synset from the list using the full 3-part name notation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Synset('dog.n.01')"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "wordnet.synset('dog.n.01')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Fort this particular synset we can fetch the definition:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "a spiteful woman gossip\n"
     ]
    }
   ],
   "source": [
    "print(wordnet.synset('cat.n.03').definition())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Synsets might also have examples. We can count the number of examples for this concrete synset this way:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(wordnet.synset('dog.n.01').examples())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can print out the example using:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "the dog barked all night\n"
     ]
    }
   ],
   "source": [
    "print(wordnet.synset('dog.n.01').examples()[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also output the lemmata for a specific synset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Lemma('dog.n.01.dog'),\n",
       " Lemma('dog.n.01.domestic_dog'),\n",
       " Lemma('dog.n.01.Canis_familiaris')]"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "wordnet.synset('dog.n.01').lemmas()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using list comprehension we can convert this list to just the lemma list:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['dog', 'domestic_dog', 'Canis_familiaris']"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[str(lemma.name()) for lemma in wordnet.synset('dog.n.01').lemmas()]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also reference a concrete lemma:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Lemma('dog.n.01.dog')"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "wordnet.lemma('dog.n.01.dog')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Multilingual Functions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The current version of WordNet in NLTK is multilingual. To see which languages are supported, use this command:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['eng']"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sorted(wordnet.langs())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can ask for the Japanese names of synsets:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Canis_lupus_familiaris', 'domaći_pas', 'pas']"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "wordnet.synset('dog.n.01').lemma_names('hrv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can fetch the English lemmata from different languages for a specific synset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Lemma('dog.n.01.cane'),\n",
       " Lemma('cramp.n.02.cane'),\n",
       " Lemma('hammer.n.01.cane'),\n",
       " Lemma('bad_person.n.01.cane'),\n",
       " Lemma('incompetent.n.01.cane')]"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "wordnet.lemmas('cane', lang='ita')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Synonyms, hypernyms, holonyms"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "dog = wordnet.synset('dog.n.01')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Synset('canine.n.02'), Synset('domestic_animal.n.01')]"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dog.hypernyms()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Synset('basenji.n.01'),\n",
       " Synset('corgi.n.01'),\n",
       " Synset('cur.n.01'),\n",
       " Synset('dalmatian.n.02'),\n",
       " Synset('great_pyrenees.n.01'),\n",
       " Synset('griffon.n.02'),\n",
       " Synset('hunting_dog.n.01'),\n",
       " Synset('lapdog.n.01'),\n",
       " Synset('leonberg.n.01'),\n",
       " Synset('mexican_hairless.n.01'),\n",
       " Synset('newfoundland.n.01'),\n",
       " Synset('pooch.n.01'),\n",
       " Synset('poodle.n.01'),\n",
       " Synset('pug.n.01'),\n",
       " Synset('puppy.n.01'),\n",
       " Synset('spitz.n.01'),\n",
       " Synset('toy_dog.n.01'),\n",
       " Synset('working_dog.n.01')]"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dog.hyponyms()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Synset('canis.n.01'), Synset('pack.n.06')]"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dog.member_holonyms()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Synset('entity.n.01')]"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dog.root_hypernyms()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Synset('carnivore.n.01')]"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "wordnet.synset('dog.n.01').lowest_common_hypernyms(wordnet.synset('cat.n.01'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "good = wordnet.synset('good.a.01')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Lemma('bad.a.01.bad')]"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "good.lemmas()[0].antonyms()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "ipub": {
   "titlepage": {
    "author": "Damir Cavar",
    "email": "damir@cavar.me",
    "institution": [
     "Indiana University",
     "NLP-Lab"
    ],
    "title": "Python WordNet using NLTK"
   }
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.3"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
   "autoclose": false,
   "autocomplete": true,
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 1,
   "hotkeys": {
    "equation": "Ctrl-E",
    "itemize": "Ctrl-I"
   },
   "labels_anchors": false,
   "latex_user_defs": false,
   "report_style_numbering": false,
   "user_envs_cfg": false
  },
  "latex_metadata": {
   "affiliation": "Indiana University, Department of Linguistics, Bloomington, IN, USA",
   "author": "Damir Cavar",
   "title": "Python WordNet using NLTK"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": false,
   "sideBar": false,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": false,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}