{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating a Jupyter Notebook file" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import nltk\n", "import matplotlib\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "*** Introductory Examples for the NLTK Book ***\n", "Loading text1, ..., text9 and sent1, ..., sent9\n", "Type the name of the text or sentence to view it.\n", "Type: 'texts()' or 'sents()' to list the materials.\n", "text1: Moby Dick by Herman Melville 1851\n", "text2: Sense and Sensibility by Jane Austen 1811\n", "text3: The Book of Genesis\n", "text4: Inaugural Address Corpus\n", "text5: Chat Corpus\n", "text6: Monty Python and the Holy Grail\n", "text7: Wall Street Journal\n", "text8: Personals Corpus\n", "text9: The Man Who Was Thursday by G . K . Chesterton 1908\n" ] } ], "source": [ "from nltk.book import *\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Searching for Words" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Displaying 25 of 1226 matches:\n", "s , and to teach them by what name a whale - fish is to be called in our tongue\n", "t which is not true .\" -- HACKLUYT \" WHALE . ... Sw . and Dan . HVAL . This ani\n", "ulted .\" -- WEBSTER ' S DICTIONARY \" WHALE . ... It is more immediately from th\n", "ISH . WAL , DUTCH . HWAL , SWEDISH . WHALE , ICELANDIC . WHALE , ENGLISH . BALE\n", "HWAL , SWEDISH . WHALE , ICELANDIC . WHALE , ENGLISH . BALEINE , FRENCH . BALLE\n", "least , take the higgledy - piggledy whale statements , however authentic , in \n", " dreadful gulf of this monster ' s ( whale ' s ) mouth , are immediately lost a\n", " patient Job .\" -- RABELAIS . \" This whale ' s liver was two cartloads .\" -- ST\n", " Touching that monstrous bulk of the whale or ork we have received nothing cert\n", " of oil will be extracted out of one whale .\" -- IBID . \" HISTORY OF LIFE AND D\n", "ise .\" -- KING HENRY . \" Very like a whale .\" -- HAMLET . \" Which to secure , n\n", "restless paine , Like as the wounded whale to shore flies thro ' the maine .\" -\n", ". OF SPERMA CETI AND THE SPERMA CETI WHALE . VIDE HIS V . E . \" Like Spencer ' \n", "t had been a sprat in the mouth of a whale .\" -- PILGRIM ' S PROGRESS . \" That \n", "EN ' S ANNUS MIRABILIS . \" While the whale is floating at the stern of the ship\n", "e ship called The Jonas - in - the - Whale . ... Some say the whale can ' t ope\n", " in - the - Whale . ... Some say the whale can ' t open his mouth , but that is\n", " masts to see whether they can see a whale , for the first discoverer has a duc\n", " for his pains . ... I was told of a whale taken near Shetland , that had above\n", "oneers told me that he caught once a whale in Spitzbergen that was white all ov\n", "2 , one eighty feet in length of the whale - bone kind came in , which ( as I w\n", "n master and kill this Sperma - ceti whale , for I could never hear of any of t\n", " . 1729 . \"... and the breath of the whale is frequendy attended with such an i\n", "ed with hoops and armed with ribs of whale .\" -- RAPE OF THE LOCK . \" If we com\n", "contemptible in the comparison . The whale is doubtless the largest animal in c\n" ] } ], "source": [ "text1.concordance(\"whale\")" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "sea man it ship by him hand them whale view ships land me life death\n", "water way head nature fear\n" ] } ], "source": [ "text1.similar(\"love\")" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "affection sister heart mother time see town life it dear elinor\n", "marianne me word family her him do regard head\n" ] } ], "source": [ "text2.similar(\"love\")" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "join part hi hey and wb well ty lmao yeah hiya ok oh hello you what\n", "yes haha no all\n" ] } ], "source": [ "text5.similar(\"lol\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Positioning Words" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "text1.dispersion_plot([\"whale\", \"monster\"])" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "text2.dispersion_plot([\"love\", \"marriage\"])" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "text2.dispersion_plot([\"husband\", \"wife\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Types vs. tokens" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "906" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text1.count(\"whale\")" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "282" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text1.count(\"Whale\")" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "38" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text1.count(\"WHALE\")" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "text1_tokens = []\n", "for t in text1:\n", " if t.isalpha():\n", " t = t.lower()\n", " text1_tokens.append(t)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "text1_tokens = [t.lower() for t in text1 if t.isalpha()]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Length and unique words" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1226" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text1_tokens.count(\"whale\")" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text1_tokens.count(\"Whale\")" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text1_tokens.count(\"WHALE\")" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "218361" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(text1_tokens)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "260819" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(text1)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "16948" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Long alternative\n", "\n", "x = set(text1_tokens)\n", "len(x)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "16948" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Short and sweet alternative\n", "len(set(text1_tokens))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Lexical density" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.07761459234936642" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(set(text1_tokens)) / len(text1_tokens)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "text1_slice = text1_tokens[0:10000]" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.2816" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(set(text1_slice)) / len(text1_slice)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.1786" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text2_tokens = []\n", "for t in text2:\n", " if t.isalpha():\n", " t = t.lower()\n", " text2_tokens.append(t)\n", " \n", "text2_slice = text2_tokens[0:10000]\n", "\n", "len(set(text2_slice)) / len(text2_slice)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Cleaning: removing Stopwords" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "from nltk.corpus import stopwords" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "stops = stopwords.words('english')" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', \"you're\", \"you've\", \"you'll\", \"you'd\", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', \"she's\", 'her', 'hers', 'herself', 'it', \"it's\", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', \"that'll\", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', \"don't\", 'should', \"should've\", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', \"aren't\", 'couldn', \"couldn't\", 'didn', \"didn't\", 'doesn', \"doesn't\", 'hadn', \"hadn't\", 'hasn', \"hasn't\", 'haven', \"haven't\", 'isn', \"isn't\", 'ma', 'mightn', \"mightn't\", 'mustn', \"mustn't\", 'needn', \"needn't\", 'shan', \"shan't\", 'shouldn', \"shouldn't\", 'wasn', \"wasn't\", 'weren', \"weren't\", 'won', \"won't\", 'wouldn', \"wouldn't\"]\n" ] } ], "source": [ "print(stops)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "text1_stops = []\n", "for t in text1_tokens:\n", " if t not in stops:\n", " text1_stops.append(t)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "text1_stops = [t for t in text1_tokens if t not in stops]" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['moby', 'dick', 'herman', 'melville', 'etymology', 'supplied', 'late', 'consumptive', 'usher', 'grammar', 'school', 'pale', 'usher', 'threadbare', 'coat', 'heart', 'body', 'brain', 'see', 'ever', 'dusting', 'old', 'lexicons', 'grammars', 'queer', 'handkerchief', 'mockingly', 'embellished', 'gay', 'flags']\n" ] } ], "source": [ "print(text1_stops[:30])" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "110459" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(text1_stops)" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "16802" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(set(text1_stops))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Cleaning: Lemmatizing Words" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "from nltk.stem import WordNetLemmatizer" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "wordnet_lemmatizer = WordNetLemmatizer()" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'child'" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wordnet_lemmatizer.lemmatize(\"children\")" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'better'" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wordnet_lemmatizer.lemmatize(\"better\")" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'good'" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wordnet_lemmatizer.lemmatize(\"better\", pos='a')" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [], "source": [ "text1_clean = []\n", "for t in text1_stops:\n", " t_lem = wordnet_lemmatizer.lemmatize(t)\n", " text1_clean.append(t_lem)" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [], "source": [ "text1_clean = [wordnet_lemmatizer.lemmatize(t) for t in text1_stops]" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "110459\n", "14750\n" ] } ], "source": [ "print(len(text1_clean))\n", "print(len(set(text1_clean)))" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['aback',\n", " 'abaft',\n", " 'abandon',\n", " 'abandoned',\n", " 'abandonedly',\n", " 'abandonment',\n", " 'abased',\n", " 'abasement',\n", " 'abashed',\n", " 'abate',\n", " 'abated',\n", " 'abatement',\n", " 'abating',\n", " 'abbreviate',\n", " 'abbreviation',\n", " 'abeam',\n", " 'abed',\n", " 'abednego',\n", " 'abel',\n", " 'abhorred',\n", " 'abhorrence',\n", " 'abhorrent',\n", " 'abhorring',\n", " 'abide',\n", " 'abided',\n", " 'abiding',\n", " 'ability',\n", " 'abjectly',\n", " 'abjectus',\n", " 'able']" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sorted(set(text1_clean))[:30]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data cleaning: Stemming Words" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": [ "from nltk.stem import PorterStemmer\n", "porter_stemmer = PorterStemmer()" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "berri\n", "berri\n", "berry\n", "berry\n" ] } ], "source": [ "print(porter_stemmer.stem('berry'))\n", "print(porter_stemmer.stem('berries'))\n", "print(wordnet_lemmatizer.lemmatize('berry'))\n", "print(wordnet_lemmatizer.lemmatize('berries'))" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "abandon\n", "abandon\n", "abandonli\n", "abandon\n" ] } ], "source": [ "print(porter_stemmer.stem('abandon'))\n", "print(porter_stemmer.stem('abandoned'))\n", "print(porter_stemmer.stem('abandonly'))\n", "print(porter_stemmer.stem('abandonment'))" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [], "source": [ "t1_porter = []\n", "for t in text1_clean:\n", " t_stemmed = porter_stemmer.stem(t)\n", " t1_porter.append(t_stemmed)\n", " " ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [], "source": [ "t1_porter = [porter_stemmer.stem(t) for t in text1_clean]" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "10501\n", "['aback', 'abaft', 'abandon', 'abandonedli', 'abas', 'abash', 'abat', 'abbrevi', 'abe', 'abeam', 'abednego', 'abel', 'abhor', 'abhorr', 'abid', 'abil', 'abjectli', 'abjectu', 'abl', 'ablut', 'aboard', 'abod', 'abomin', 'aborigin', 'abort', 'abound', 'aboundingli', 'abraham', 'abreast', 'abridg']\n" ] } ], "source": [ "print(len(set(t1_porter)))\n", "print(sorted(set(t1_porter))[:30])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data cleaning: results\n" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [], "source": [ "my_dist = FreqDist(text1_clean)" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "nltk.probability.FreqDist" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(my_dist)" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" }, { "data": { "text/plain": [ "" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "my_dist.plot(20)" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('whale', 1494),\n", " ('one', 940),\n", " ('like', 650),\n", " ('ship', 605),\n", " ('upon', 566),\n", " ('sea', 542),\n", " ('man', 527),\n", " ('ahab', 512),\n", " ('boat', 483),\n", " ('ye', 472),\n", " ('old', 450),\n", " ('time', 446),\n", " ('would', 432),\n", " ('head', 431),\n", " ('though', 384),\n", " ('captain', 353),\n", " ('yet', 345),\n", " ('hand', 344),\n", " ('long', 333),\n", " ('thing', 320)]" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "my_dist.most_common(20)" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [], "source": [ "b_words = ['god', 'apostle', 'angel']" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [], "source": [ "my_list = []\n", "for word in b_words:\n", " if word in text1_clean:\n", " my_list.append(word)\n", " else:\n", " pass\n", " " ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['god', 'angel']\n" ] } ], "source": [ "print(my_list)" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [], "source": [ "my_list2 = [word for word in b_words if word in text1_clean]" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "my_list == my_list2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Make Your Own Corpus" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [], "source": [ "from urllib.request import urlopen" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [], "source": [ "my_url = \"http://www.gutenberg.org/files/996/996-0.txt\"" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [], "source": [ "file = urlopen(my_url)\n", "raw = file.read()" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [], "source": [ "don = raw.decode()" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "str" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(don)" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [], "source": [ "don_tokens = nltk.word_tokenize(don)" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "498721" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(don_tokens)" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['\\ufeff',\n", " 'The',\n", " 'Project',\n", " 'Gutenberg',\n", " 'EBook',\n", " 'of',\n", " 'The',\n", " 'History',\n", " 'of',\n", " 'Don']" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "don_tokens[:10]" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [], "source": [ "dq_text = don_tokens[320:]" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['I', 'CHAPTER', 'I', 'WHICH', 'TREATS', 'OF', 'THE', 'CHARACTER', 'AND', 'PURSUITS', 'OF', 'THE', 'FAMOUS', 'GENTLEMAN', 'DON', 'QUIXOTE', 'OF', 'LA', 'MANCHA', 'CHAPTER', 'II', 'WHICH', 'TREATS', 'OF', 'THE', 'FIRST', 'SALLY', 'THE', 'INGENIOUS', 'DON']\n" ] } ], "source": [ "print(dq_text[:30])" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [], "source": [ "dq_nltk_text = nltk.Text(dq_text)" ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "nltk.text.Text" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(dq_nltk_text)" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['chapter', 'treats', 'character', 'pursuits', 'famous', 'gentleman', 'quixote', 'la', 'mancha', 'chapter', 'ii', 'treats', 'first', 'sally', 'ingenious', 'quixote', 'made', 'home', 'chapter', 'iii', 'wherein', 'related', 'droll', 'way', 'quixote', 'dubbed', 'knight', 'chapter', 'iv', 'happened', 'knight', 'left', 'inn', 'chapter', 'v', 'narrative', 'knight', 'mishap', 'continued', 'chapter', 'vi', 'diverting', 'important', 'scrutiny', 'curate', 'barber', 'made', 'library', 'ingenious', 'gentleman']\n" ] } ], "source": [ "dq_clean = []\n", "for word in dq_text:\n", " if word.isalpha():\n", " if word.lower() not in stops:\n", " dq_clean.append(word.lower())\n", "print(dq_clean[:50])" ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [], "source": [ "from nltk.stem import WordNetLemmatizer\n", "wordnet_lemmatizer = WordNetLemmatizer()\n", "\n", "dq_lemmatized = []\n", "for t in dq_clean:\n", " dq_lemmatized.append(wordnet_lemmatizer.lemmatize(t))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part-of-Speech Tagging" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [], "source": [ "dq_tagged = nltk.pos_tag(dq_text)" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[('I', 'PRP'), ('CHAPTER', 'VBP'), ('I', 'PRP'), ('WHICH', 'NNP'), ('TREATS', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('CHARACTER', 'NNP'), ('AND', 'NNP'), ('PURSUITS', 'NNP')]\n" ] } ], "source": [ "print(dq_tagged[:10])" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [], "source": [ "tag_dict = {}\n", "# for every word/tag pair in my list,\n", "for (word, tag) in dq_tagged:\n", " if tag in tag_dict:\n", " tag_dict[tag]+=1\n", " else:\n", " tag_dict[tag] = 1" ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'PRP': 36100,\n", " 'VBP': 9658,\n", " 'NNP': 31836,\n", " 'IN': 57945,\n", " 'VBD': 23503,\n", " ',': 36910,\n", " 'CC': 22993,\n", " 'VB': 21198,\n", " 'MD': 7256,\n", " 'DT': 40778,\n", " ':': 6442,\n", " 'CD': 3108,\n", " 'VBZ': 8316,\n", " 'RP': 1916,\n", " 'JJ': 24445,\n", " 'NN': 62303,\n", " 'WP': 4157,\n", " 'NNS': 15271,\n", " 'RB': 20227,\n", " 'VBN': 10087,\n", " 'WDT': 3546,\n", " '.': 7119,\n", " 'EX': 1073,\n", " 'TO': 13801,\n", " 'PRP$': 12231,\n", " 'VBG': 7727,\n", " 'RBS': 253,\n", " 'JJS': 954,\n", " 'PDT': 1118,\n", " 'RBR': 655,\n", " 'JJR': 1294,\n", " 'FW': 381,\n", " '(': 574,\n", " ')': 574,\n", " 'WP$': 137,\n", " 'WRB': 2147,\n", " 'POS': 14,\n", " 'NNPS': 155,\n", " 'UH': 85,\n", " \"''\": 111,\n", " '$': 3}" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tag_dict" ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[('NN', 62303), ('IN', 57945), ('DT', 40778), (',', 36910), ('PRP', 36100), ('NNP', 31836), ('JJ', 24445), ('VBD', 23503), ('CC', 22993), ('VB', 21198), ('RB', 20227), ('NNS', 15271), ('TO', 13801), ('PRP$', 12231), ('VBN', 10087), ('VBP', 9658), ('VBZ', 8316), ('VBG', 7727), ('MD', 7256), ('.', 7119), (':', 6442), ('WP', 4157), ('WDT', 3546), ('CD', 3108), ('WRB', 2147), ('RP', 1916), ('JJR', 1294), ('PDT', 1118), ('EX', 1073), ('JJS', 954), ('RBR', 655), ('(', 574), (')', 574), ('FW', 381), ('RBS', 253), ('NNPS', 155), ('WP$', 137), (\"''\", 111), ('UH', 85), ('POS', 14), ('$', 3)]\n" ] } ], "source": [ "tag_dict_sorted = sorted(tag_dict.items(),\n", " reverse=True,\n", " key=lambda kv: kv[1])\n", "print(tag_dict_sorted)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 2 }