{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Python WordNet using NLTK" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**(C) 2017-2024 by [Damir Cavar](http://damir.cavar.me/)**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Version:** 1.3, January 2024" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**License:** [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/) ([CA BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Prerequisites:**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install -U nltk" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is a tutorial related to the discussion of a WordSense disambiguation and various machine learning strategies discussed in the textbook [Machine Learning: The Art and Science of Algorithms that Make Sense of Data](https://www.cs.bris.ac.uk/~flach/mlbook/) by [Peter Flach](https://www.cs.bris.ac.uk/~flach/)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This tutorial was developed as part of my course material for the course Machine Learning for Computational Linguistics in the [Computational Linguistics Program](http://cl.indiana.edu/) of the [Department of Linguistics](http://www.indiana.edu/~lingdept/) at [Indiana University](https://www.indiana.edu/)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using WordNet" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Importing *wordnet* from the NLTK module:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "scrolled": true }, "outputs": [], "source": [ "from nltk.corpus import wordnet" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Asking for a synset in WordNet:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "[Synset('cat.n.01'),\n", " Synset('guy.n.01'),\n", " Synset('cat.n.03'),\n", " Synset('kat.n.01'),\n", " Synset('cat-o'-nine-tails.n.01'),\n", " Synset('caterpillar.n.02'),\n", " Synset('big_cat.n.01'),\n", " Synset('computerized_tomography.n.01'),\n", " Synset('cat.v.01'),\n", " Synset('vomit.v.01')]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wordnet.synsets('cat')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A synset is identified with a 3-part name of the form: word.pos.nn. Except of the last synset, all other synsets of *dog* above are nouns with the *part-of-speech* tag *n*. We can pick a synset with a specific PoS:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "[Synset('chase.v.01')]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wordnet.synsets('dog', pos=wordnet.VERB)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Besides VERB the other parts of speech are NOUN, ADJ and ADV." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can select a specific synset from the list using the full 3-part name notation:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "Synset('dog.n.01')" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wordnet.synset('dog.n.01')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Fort this particular synset we can fetch the definition:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds\n" ] } ], "source": [ "print(wordnet.synset('dog.n.01').definition())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Synsets might also have examples. We can count the number of examples for this concrete synset this way:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "1" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(wordnet.synset('dog.n.01').examples())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can print out the example using:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "the dog barked all night\n" ] } ], "source": [ "print(wordnet.synset('dog.n.01').examples()[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also output the lemmata for a specific synset:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "[Lemma('dog.n.01.dog'),\n", " Lemma('dog.n.01.domestic_dog'),\n", " Lemma('dog.n.01.Canis_familiaris')]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wordnet.synset('dog.n.01').lemmas()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using list comprehension we can convert this list to just the lemma list:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "['dog', 'domestic_dog', 'Canis_familiaris']" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[str(lemma.name()) for lemma in wordnet.synset('dog.n.01').lemmas()]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also reference a concrete lemma:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "Lemma('dog.n.01.dog')" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wordnet.lemma('dog.n.01.dog')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Multilingual Functions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The current version of WordNet in NLTK is multilingual. To see which languages are supported, use this command:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "['als',\n", " 'arb',\n", " 'bul',\n", " 'cat',\n", " 'cmn',\n", " 'dan',\n", " 'ell',\n", " 'eng',\n", " 'eus',\n", " 'fas',\n", " 'fin',\n", " 'fra',\n", " 'glg',\n", " 'heb',\n", " 'hrv',\n", " 'ind',\n", " 'ita',\n", " 'jpn',\n", " 'nld',\n", " 'nno',\n", " 'nob',\n", " 'pol',\n", " 'por',\n", " 'qcn',\n", " 'slv',\n", " 'spa',\n", " 'swe',\n", " 'tha',\n", " 'zsm']" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sorted(wordnet.langs())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can ask for the Japanese names of synsets:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "scrolled": true }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "C:\\Users\\damir\\AppData\\Roaming\\Python\\Python311\\site-packages\\nltk\\corpus\\reader\\wordnet.py:1564: UserWarning: No WordNet synset found for pos=a at offset=1498548.\n", " warnings.warn(f\"No WordNet synset found for pos={pos} at offset={offset}.\")\n", "C:\\Users\\damir\\AppData\\Roaming\\Python\\Python311\\site-packages\\nltk\\corpus\\reader\\wordnet.py:1564: UserWarning: No WordNet synset found for pos=a at offset=1505508.\n", " warnings.warn(f\"No WordNet synset found for pos={pos} at offset={offset}.\")\n", "C:\\Users\\damir\\AppData\\Roaming\\Python\\Python311\\site-packages\\nltk\\corpus\\reader\\wordnet.py:1564: UserWarning: No WordNet synset found for pos=a at offset=2002046.\n", " warnings.warn(f\"No WordNet synset found for pos={pos} at offset={offset}.\")\n", "C:\\Users\\damir\\AppData\\Roaming\\Python\\Python311\\site-packages\\nltk\\corpus\\reader\\wordnet.py:1564: UserWarning: No WordNet synset found for pos=a at offset=2917945.\n", " warnings.warn(f\"No WordNet synset found for pos={pos} at offset={offset}.\")\n" ] }, { "data": { "text/plain": [ "['Canis_lupus_familiaris', 'domaći_pas', 'pas']" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wordnet.synset('dog.n.01').lemma_names('hrv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can fetch the English lemmata from different languages for a specific synset:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "[Lemma('dog.n.01.cane'),\n", " Lemma('cramp.n.02.cane'),\n", " Lemma('hammer.n.01.cane'),\n", " Lemma('bad_person.n.01.cane'),\n", " Lemma('incompetent.n.01.cane')]" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wordnet.lemmas('cane', lang='ita')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Synonyms, hypernyms, holonyms" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "scrolled": true }, "outputs": [], "source": [ "dog = wordnet.synset('dog.n.01')" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "[Synset('canine.n.02'), Synset('domestic_animal.n.01')]" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dog.hypernyms()" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "[Synset('basenji.n.01'),\n", " Synset('corgi.n.01'),\n", " Synset('cur.n.01'),\n", " Synset('dalmatian.n.02'),\n", " Synset('great_pyrenees.n.01'),\n", " Synset('griffon.n.02'),\n", " Synset('hunting_dog.n.01'),\n", " Synset('lapdog.n.01'),\n", " Synset('leonberg.n.01'),\n", " Synset('mexican_hairless.n.01'),\n", " Synset('newfoundland.n.01'),\n", " Synset('pooch.n.01'),\n", " Synset('poodle.n.01'),\n", " Synset('pug.n.01'),\n", " Synset('puppy.n.01'),\n", " Synset('spitz.n.01'),\n", " Synset('toy_dog.n.01'),\n", " Synset('working_dog.n.01')]" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dog.hyponyms()" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "[Synset('canis.n.01'), Synset('pack.n.06')]" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dog.member_holonyms()" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "[Synset('entity.n.01')]" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dog.root_hypernyms()" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "[Synset('carnivore.n.01')]" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wordnet.synset('dog.n.01').lowest_common_hypernyms(wordnet.synset('cat.n.01'))" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "scrolled": true }, "outputs": [], "source": [ "good = wordnet.synset('good.a.01')" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "[Lemma('bad.a.01.bad')]" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "good.lemmas()[0].antonyms()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [] } ], "metadata": { "anaconda-cloud": {}, "ipub": { "titlepage": { "author": "Damir Cavar", "email": "damir@cavar.me", "institution": [ "Indiana University", "NLP-Lab" ], "title": "Python WordNet using NLTK" } }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.1" }, "latex_envs": { "LaTeX_envs_menu_present": true, "autoclose": false, "autocomplete": true, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 1, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": false }, "latex_metadata": { "affiliation": "Indiana University, Department of Linguistics, Bloomington, IN, USA", "author": "Damir Cavar", "title": "Python WordNet using NLTK" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": false, "sideBar": false, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": false, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }