{ "cells": [ { "cell_type": "markdown", "id": "d162b058", "metadata": {}, "source": [ "--- \n", " \n", "\n", "

Department of Data Science

\n", "

Course: Tools and Techniques for Data Science

\n", "\n", "---\n", "

Instructor: Muhammad Arif Butt, Ph.D.

\n" ] }, { "cell_type": "markdown", "id": "7db92d41", "metadata": {}, "source": [ "

Lecture 7.2 (Basic Text Pre-Processing)

" ] }, { "cell_type": "markdown", "id": "88976106", "metadata": {}, "source": [ "\"Open" ] }, { "cell_type": "markdown", "id": "19b0c47d", "metadata": {}, "source": [ " " ] }, { "cell_type": "code", "execution_count": null, "id": "e23a6901", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "b2c17026", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "9efdaacd", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "1b8ced6b", "metadata": {}, "source": [ "# Learning agenda of this notebook\n", "1. **Text Cleaning**\n", " - Removing digits and words containing digits\n", " - Removing newline characters and extra spaces\n", " - Removing HTML tags\n", " - Removing URLs\n", " - Removing punctuations\n", " \n", "\n", "2. **Basic Text Preprocessing**\n", " - Case folding\n", " - Expand contractions\n", " - Chat word treatment\n", " - Handle emojis\n", " - Spelling correction\n", " - Tokenization\n", " - Creating N-grams\n", " - Stop words Removal\n", " \n", " \n", "3. **Advanced Preprocessing**\n", " - Stemming\n", " - Lemmatization\n", " - POS tagging\n", " - NER\n", " - Parsing\n", " - Coreference Resolution\n", " \n", "\n", "4. **Text Pre-Processing on Tweets Dataset**" ] }, { "cell_type": "code", "execution_count": null, "id": "e3b5f1e7", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "c070ed94", "metadata": {}, "source": [ "# Download and Install Required Libraries" ] }, { "cell_type": "code", "execution_count": 1, "id": "89c00384", "metadata": {}, "outputs": [], "source": [ "import sys\n", "!{sys.executable} -m pip install -q --upgrade pip\n", "!{sys.executable} -m pip install -q --upgrade numpy pandas sklearn\n", "!{sys.executable} -m pip install -q --upgrade nltk spacy gensim wordcloud textblob contractions clean-text unicode" ] }, { "cell_type": "code", "execution_count": null, "id": "dc38c31a", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "79ba6784", "metadata": {}, "source": [ "# 1. Text Cleaning" ] }, { "cell_type": "markdown", "id": "29fcb284", "metadata": {}, "source": [ "## a. Removing Digits and Words Containing Digits \n", "- Sometimes it happens that words and digits combine are written in the text which creates a problem for machines to understand. Hence, we need to remove the words and digits which are combined like game57 or game5ts7.\n", "- For such and many other tasks we normally use Regular Expressions.\n", "- Watch my two videos on regular expressions:\n", " - https://www.youtube.com/watch?v=DhQ-kc6FPVk\n", " - https://www.youtube.com/watch?v=3J62z5aGTQc\n", "\n", "- The **`re.sub(pattern, replacement_string, str)`** method return the string obtained by replacing the occurrences of `pattern` in `str` with the `replacement_string`. If the pattern isn’t found, the string is returned unchanged." ] }, { "cell_type": "code", "execution_count": 2, "id": "c3827cbe", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'This is a string containing words having digits'" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import re\n", "mystr = \"This is abc32 a abc32xyz string containing 32abc words 32 having digits\"\n", "re.sub('\\w*\\d\\w*', '', mystr)" ] }, { "cell_type": "code", "execution_count": null, "id": "b93101cc", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "4d279225", "metadata": {}, "source": [ "## b. Removing New Line Characters and Extra Spaces\n", "- Most of the time text data contain extra spaces or while removing digits more than one space is left between the text.\n", "- We can use Python's string and re module to perform this pre-processing task." ] }, { "cell_type": "code", "execution_count": 3, "id": "d9f5e972", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "' This is a string with lots of extra spaces in beteween words .'" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import re\n", "mystr = \" This is a string with lots of extra spaces in beteween words .\"\n", "re.sub(' +', ' ', mystr)" ] }, { "cell_type": "code", "execution_count": 4, "id": "785a2d73", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Original String:\n", " This is\n", "a string\n", "with lots of new\n", "line characters.\n", "Preprocessed String: This is a string with lots of new line characters.\n" ] } ], "source": [ "mystr = \"This is\\na string\\nwith lots of new\\nline characters.\"\n", "print(\"Original String:\\n\", mystr)\n", "print(\"Preprocessed String:\", re.sub('\\n', ' ', mystr))" ] }, { "cell_type": "code", "execution_count": null, "id": "4a52015e", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "3dee5c0a", "metadata": {}, "source": [ "## c. Removing HTML Tags\n", "- Once you get data via scraping websites, your data might contain HTML tags, which are not required as such in the data. So we need to remove them." ] }, { "cell_type": "code", "execution_count": 5, "id": "682dcc37", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Original String: An empty head.

This is so simple and fun.

\n", "Preprocessed String: An empty head. This is so simple and fun. \n" ] } ], "source": [ "import re\n", "mystr = \" An empty head.

This is so simple and fun.

\"\n", "print(\"Original String: \", mystr)\n", "print(\"Preprocessed String: \", re.sub('<.*?>', '', mystr))" ] }, { "cell_type": "code", "execution_count": null, "id": "3ca00acb", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "167b6180", "metadata": {}, "source": [ "## d. Removing URLs\n", "- At times the text data you have some URLS, which might not be helpful in suppose sentiment analysis. So better to remove those URLS from your dataset\n", "- Once again, we can use Python's re module to remove the URLs." ] }, { "cell_type": "code", "execution_count": 6, "id": "74f73fc0", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Good youTube lectures by Arif are available at '" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import re\n", "mystr = \"Good youTube lectures by Arif are available at http://www.youtube.com/c/LearnWithArif/playlists\"\n", "re.sub('https?://\\S+|www.\\.\\S+', '', mystr)" ] }, { "cell_type": "code", "execution_count": null, "id": "bb9c503b", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "a3e0e219", "metadata": {}, "source": [ "## e. Removing Punctuations\n", "- Punctuations are symbols that are used to divide written words into sentences and clauses\n", "- Once you tokenize your text, these punctuation symbols may become part of a token, and may become a token by itself, which is not required in most of the cases\n", "- We can use Python's `string.punctuation` constant and `replace()` method to replace any punctuation in text with an empty string" ] }, { "cell_type": "code", "execution_count": 7, "id": "5b06a5f6", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'!\"#$%&\\'()*+,-./:;<=>?@[\\\\]^_`{|}~'" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import string\n", "string.punctuation" ] }, { "cell_type": "markdown", "id": "0067f95f", "metadata": {}, "source": [ ">- Check for other constants like `string.whitespace`, `string.printable`, `string.ascii_letters`, `string.digits` as well." ] }, { "cell_type": "code", "execution_count": 8, "id": "6f123fd9", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'A {text} ^having$ \"lot\" of #s and [puncutations]!.;%..'" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mystr = 'A {text} ^having$ \"lot\" of #s and [puncutations]!.;%..'\n", "mystr" ] }, { "cell_type": "code", "execution_count": 9, "id": "76aa3beb", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'A text having lot of s and puncutations'" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "newstr = ''.join([ch for ch in mystr if ch not in string.punctuation])\n", "newstr" ] }, { "cell_type": "code", "execution_count": null, "id": "df149851", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "76f28d73", "metadata": {}, "source": [ "# 2. Basic Text Preprocessing" ] }, { "cell_type": "markdown", "id": "86b42da8", "metadata": {}, "source": [ "## a. Case Folding \n", "- The text we need to process may come in lower, upper, sentence, camel cases\n", "- If the text is in the same case, it is easy for a machine to interpret the words because the lower case and upper case are treated differently by the machine\n", "- In applications like Information Retrieval, we reduce all letters to lower case\n", "- In applications like sentiment analysis, machine translation and information extraction, keeping the case might be helpful. For example US vs us." ] }, { "cell_type": "code", "execution_count": 10, "id": "cbb6c75e", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'this is great series of lectures by arif at the deaprtment of ds'" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mystr = \"This IS GREAT series of Lectures by Arif at the Deaprtment of DS\"\n", "mystr.lower()" ] }, { "cell_type": "code", "execution_count": null, "id": "12b34d7d", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "97fd645b", "metadata": {}, "source": [ "## b. Expand Contractions\n", "- Contractions are words or combinations of words that are shortened by dropping letters and replacing them by an apostrophe.\n", "- Examples:\n", " - you're ---> you are\n", " - ain't ---> am not / are not / is not / has not / have not\n", " - you'll ---> you shall / you will\n", " - wouldn't 've ---> would not haveyou are\n", "- In order to expand contractions, you can install and use the `contractions` module or can create your own dictionary to expand contractions" ] }, { "cell_type": "code", "execution_count": 11, "id": "6c90f198", "metadata": {}, "outputs": [], "source": [ "import sys\n", "!{sys.executable} -m pip install -q contractions" ] }, { "cell_type": "code", "execution_count": 12, "id": "7b5ff1ef", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "you are\n", "are not\n", "you will\n", "would not have\n" ] } ], "source": [ "import contractions\n", "print(contractions.fix(\"you're\")) # you are\n", "print(contractions.fix(\"ain't\")) # am not / are not / is not / has not / have not\n", "print(contractions.fix(\"you'll\")) #you shall / you will\n", "print(contractions.fix(\"wouldn't've\")) #\"wouldn't've\": \"would not have\"," ] }, { "cell_type": "code", "execution_count": null, "id": "9fb8cc71", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 13, "id": "e6b0d807", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\"I'll be there within 5 min. Shouldn't you be there too? I'd love to see u there my dear. \\nIt's awesome to meet new friends. We've been waiting for this day for so long.\"" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mystr = '''I'll be there within 5 min. Shouldn't you be there too? I'd love to see u there my dear. \n", "It's awesome to meet new friends. We've been waiting for this day for so long.'''\n", "mystr" ] }, { "cell_type": "code", "execution_count": 14, "id": "128dec87", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "I will be there within 5 min. Should not you be there too? I would love to see you there my dear. \n", "It is awesome to meet new friends. We have been waiting for this day for so long.\n" ] } ], "source": [ "# use loop\n", "mylist = [] \n", "for word in mystr.split(sep=' '):\n", " mylist.append(contractions.fix(word))\n", "\n", "newstring = ' '.join(mylist)\n", "print(newstring)" ] }, { "cell_type": "code", "execution_count": 15, "id": "61d4bf5e", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'I will be there within 5 min. Should not you be there too? I would love to see you there my dear. It is awesome to meet new friends. We have been waiting for this day for so long.'" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# use list comprehension and join the words of list on space\n", "expanded_string = ' '.join([contractions.fix(word) for word in mystr.split()])\n", "expanded_string" ] }, { "cell_type": "code", "execution_count": null, "id": "d70a0fc7", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "c52da3de", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "b864e428", "metadata": {}, "source": [ "## c. Chat Word Treatment\n", "- Some commonly used abbreviated chat words that are used on social media these days are:\n", " - GN for good night\n", " - fyi for for your information\n", " - asap for as soon as possible\n", " - yolo for you only live once\n", " - rofl for rolling on floor laughing\n", " - nvm for never mind\n", " - ofc for ofcourse\n", "\n", "- To pre-process any text containing such abbreviations we can search for an online dictionary, or can create a dictionary of our own" ] }, { "cell_type": "code", "execution_count": 16, "id": "9d8f4c40", "metadata": {}, "outputs": [], "source": [ "dict_chatwords = { \n", " 'ack': 'acknowledge',\n", " 'omg': 'oh my God',\n", " 'aisi': 'as i see it',\n", " 'bi5': 'back in 5 minutes',\n", " 'lmk': 'let me know',\n", " 'gn' : 'good night',\n", " 'fyi': 'for your information',\n", " 'asap': 'as soon as possible',\n", " 'yolo': 'you only live once',\n", " 'rofl': 'rolling on floor laughing',\n", " 'nvm': 'never ming',\n", " 'ofc': 'ofcourse',\n", " 'blv' : 'boulevard',\n", " 'cir' : 'circle',\n", " 'hwy' : 'highway',\n", " 'ln' : 'lane',\n", " 'pt' : 'point',\n", " 'rd' : 'road',\n", " 'sq' : 'square',\n", " 'st' : 'street'\n", " }" ] }, { "cell_type": "code", "execution_count": 17, "id": "560fb80d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'omg this is aisi I ack your work and will be bi5'" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mystr = \"omg this is aisi I ack your work and will be bi5\"\n", "mystr" ] }, { "cell_type": "code", "execution_count": 18, "id": "74246a90", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "oh my God this is as i see it I acknowledge your work and will be back in 5 minutes\n" ] } ], "source": [ "# dict.items() method returns all the key-value pairs of a dict as a two object tuple\n", "# dict.keys() method returns all the keys of a dict object\n", "# dict.values() method returns all the values of a dict object\n", "mylist = [] \n", "for word in mystr.split(sep=' '):\n", " if word in dict_chatwords.keys():\n", " mylist.append(dict_chatwords[word])\n", " else:\n", " mylist.append(word)\n", "newstring = ' '.join(mylist)\n", "print(newstring)" ] }, { "cell_type": "code", "execution_count": null, "id": "fee8b5a4", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "40839642", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "30f62553", "metadata": {}, "source": [ "## d. Handle Emojis\n", "- We come across lots and lots of emojis while scraping comments/posts from social media websites like Facebook, Instagram, Whatsapp, Twitter, LinkedIn, which needs to be removed from text.\n", "- Machine Learrning algorithm cannot understand emojis, so we have two options:\n", " - Simply remove the emojis from the text data, and this can be done using `clean-text` library\n", " - Replace the emoji with its meaning happy, sad, angry,....\n" ] }, { "cell_type": "markdown", "id": "620bf235", "metadata": {}, "source": [ "### (i) Remove Emojis" ] }, { "cell_type": "code", "execution_count": 19, "id": "b4294457", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'These emojis needs to be removed, there is a huge list...πŸ˜ƒπŸ˜¬πŸ˜‚πŸ˜…πŸ˜‡πŸ˜‰πŸ˜ŠπŸ˜œπŸ˜ŽπŸ€—πŸ™„πŸ€”πŸ˜‘πŸ˜€πŸ˜­πŸ€ πŸ€‘πŸ€«πŸ’©πŸ˜ˆπŸ‘»πŸ™ŒπŸ‘βœŒοΈπŸ‘ŒπŸ™'" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mystr = \"These emojis needs to be removed, there is a huge list...πŸ˜ƒπŸ˜¬πŸ˜‚πŸ˜…πŸ˜‡πŸ˜‰πŸ˜ŠπŸ˜œπŸ˜ŽπŸ€—πŸ™„πŸ€”πŸ˜‘πŸ˜€πŸ˜­πŸ€ πŸ€‘πŸ€«πŸ’©πŸ˜ˆπŸ‘»πŸ™ŒπŸ‘βœŒοΈπŸ‘ŒπŸ™\"\n", "mystr" ] }, { "cell_type": "code", "execution_count": 20, "id": "8be07245", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "These emojis needs to be removed, there is a huge list...\n" ] } ], "source": [ "import re\n", " \n", "emoji_pattern = re.compile(\"[\"\n", " u\"\\U0001F600-\\U0001F64F\" # code range for emoticons\n", " u\"\\U0001F300-\\U0001F5FF\" # code range for symbols & pictographs\n", " u\"\\U0001F680-\\U0001F6FF\" # code range for transport & map symbols\n", " u\"\\U0001F1E0-\\U0001F1FF\" # code range for flags (iOS)\n", " u\"\\U00002700-\\U000027BF\" # code range for Dingbats\n", " u\"\\U00002500-\\U00002BEF\" # code range for chinese char\n", " u\"\\U00002702-\\U000027B0\"\n", " u\"\\U00002702-\\U000027B0\"\n", " u\"\\U000024C2-\\U0001F251\"\n", " u\"\\U0001f926-\\U0001f937\"\n", " u\"\\U00010000-\\U0010ffff\"\n", " u\"\\u2640-\\u2642\" \n", " u\"\\u2600-\\u2B55\"\n", " u\"\\u200d\"\n", " u\"\\u23cf\"\n", " u\"\\u23e9\"\n", " u\"\\u231a\"\n", " u\"\\ufe0f\" \n", " u\"\\u3030\"\n", " \"]+\", flags=re.UNICODE)\n", "\n", "print(emoji_pattern.sub(r'', mystr)) # no emoji\n" ] }, { "cell_type": "code", "execution_count": null, "id": "dbbf5c6c", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "42e7e881", "metadata": {}, "source": [ "### (ii) Replace Emojis with their Meanings" ] }, { "cell_type": "code", "execution_count": 21, "id": "059582dd", "metadata": {}, "outputs": [], "source": [ "import sys\n", "!{sys.executable} -m pip install -q emoji" ] }, { "cell_type": "code", "execution_count": 22, "id": "0a02e204", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'This is :thumbs_up:'" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import emoji\n", "mystr = \"This is πŸ‘\"\n", "emoji.demojize(mystr)" ] }, { "cell_type": "code", "execution_count": 23, "id": "7c6992f2", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'I am :thinking_face:'" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mystr = \"I am πŸ€”\"\n", "emoji.demojize(mystr)" ] }, { "cell_type": "code", "execution_count": 24, "id": "1f4551af", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'This is positive'" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mystr = \"This is πŸ‘\"\n", "emoji.replace_emoji(mystr, replace='positive')" ] }, { "cell_type": "code", "execution_count": null, "id": "bdb98ff1", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "1037dc1e", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "181c15ca", "metadata": {}, "source": [ "## e. Spelling Correction\n", "- Most of the times the text data you have contain spelling errors, which if not corrected the same word may be represented in two or may be more different ways.\n", "- Almost all word editors, today underline incorrectly typed words and provide you possible correct options\n", "- So spelling correction is a two step task:\n", " - Detection of spelling errors\n", " - Correction of spelling errors\n", " - Autocorrect as you type space\n", " - Suggest a single correct word\n", " - Suggest a list of words (from which you can choose one)\n", "- Types of spelling errors:\n", " - **Non-word Errors:** are non-dictionary words or words that do not exist in the language dictionary. For example instead of typing `reading` the user typed `reeding`. These are easy to detect as they do not exist in the language dictionary and can be corrected using algorithms like shortest weighted edit distance and highest noisy channel probability.\n", " - **Real-word Errors:** are dictionary words and are hard to detect. These can be of two types:\n", " - Typographical errors: For example instead of typing `great` the user typed `greet`\n", " - Cognitive errors (homophones: For example instead of typing `two` the user typed `too`\n", "\n", "\n", "

\"I am reeding thiss gret boook on deta sciance suject, which is a greet curse\"

" ] }, { "cell_type": "code", "execution_count": null, "id": "db931961", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "8a9985af", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 25, "id": "458947c8", "metadata": {}, "outputs": [], "source": [ "import sys\n", "!{sys.executable} -m pip install -q textblob" ] }, { "cell_type": "code", "execution_count": 26, "id": "1362f338", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'0.17.1'" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import textblob\n", "textblob.__version__" ] }, { "cell_type": "code", "execution_count": 27, "id": "89720d4f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "textblob.blob.TextBlob" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from textblob import TextBlob\n", "mystr = \"I am reeding thiss gret boook on deta sciance suject, which is a greet curse\"\n", "blob = TextBlob(mystr)\n", "type(blob)" ] }, { "cell_type": "code", "execution_count": 28, "id": "af745e1d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['__add__', '__class__', '__contains__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_cmpkey', '_compare', '_create_sentence_objects', '_strkey', 'analyzer', 'classifier', 'classify', 'correct', 'detect_language', 'ends_with', 'endswith', 'find', 'format', 'index', 'join', 'json', 'lower', 'ngrams', 'noun_phrases', 'np_counts', 'np_extractor', 'parse', 'parser', 'polarity', 'pos_tagger', 'pos_tags', 'raw', 'raw_sentences', 'replace', 'rfind', 'rindex', 'sentences', 'sentiment', 'sentiment_assessments', 'serialized', 'split', 'starts_with', 'startswith', 'string', 'strip', 'stripped', 'subjectivity', 'tags', 'title', 'to_json', 'tokenize', 'tokenizer', 'tokens', 'translate', 'translator', 'upper', 'word_counts', 'words']\n" ] } ], "source": [ "print(dir(blob))" ] }, { "cell_type": "code", "execution_count": 29, "id": "f4813310", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'I am reading this great book on data science subject, which is a greet curse'" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "blob.correct().string" ] }, { "cell_type": "markdown", "id": "01d58342", "metadata": {}, "source": [ ">- The non-word errors like `reeding`, `this`, `gret`, `boook`, `deta`, `sciance` and `suject` have been corrected by `blob.correct()` method\n", ">- However, the real word errors like `greet` and `curse` are not corrected" ] }, { "cell_type": "code", "execution_count": null, "id": "c6ed0157", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "6d0b85cf", "metadata": {}, "source": [ "**Let us try to understand how `textblob.correct()` method do this?**" ] }, { "cell_type": "code", "execution_count": 30, "id": "28586b2a", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "WordList(['I', 'am', 'reeding', 'thiss', 'gret', 'boook', 'on', 'deta', 'sciance', 'suject', 'which', 'is', 'a', 'greet', 'curse'])" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# The word attribute of textblob object returns list of words in the text\n", "blob.words" ] }, { "cell_type": "code", "execution_count": 31, "id": "4f7dd8a1", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('reading', 0.7651006711409396),\n", " ('feeding', 0.10067114093959731),\n", " ('heeding', 0.053691275167785234),\n", " ('rending', 0.026845637583892617),\n", " ('breeding', 0.026845637583892617),\n", " ('receding', 0.013422818791946308),\n", " ('reeling', 0.006711409395973154),\n", " ('needing', 0.006711409395973154)]" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Word.spellcheck() method returns a list of (word, confidence) tuples with spelling suggestions\n", "# 'reeding'\n", "blob.words[2].spellcheck()" ] }, { "cell_type": "code", "execution_count": 32, "id": "cea1786c", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('book', 0.946969696969697), ('brook', 0.05303030303030303)]" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Word.spellcheck() method returns a list of (word, confidence) tuples with spelling suggestions\n", "# 'boook'\n", "blob.words[5].spellcheck()" ] }, { "cell_type": "code", "execution_count": 33, "id": "782b98fe", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('greet', 1.0)]" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Word.spellcheck() method returns a list of (word, confidence) tuples with spelling suggestions\n", "# 'greet'\n", "blob.words[13].spellcheck()" ] }, { "cell_type": "code", "execution_count": null, "id": "3cf78a6c", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "b86328b4", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "666d6448", "metadata": {}, "source": [ "## f. Tokenize Text\n", "\n", "\n", "\n", "- **What is Tokenization:** Tokenization is a process of splitting text into meaningful segments called tokens. It can be character level, subword level, word level (unigram), two word level (bigram), three word level (trigram), and sentence level.\n", "- **Why to do Tokenization:** For classification of a product review as positive or negative, we may need to count the number of positive words and compare them with the count of negative words in the text of that review. For this we first need to tokenize the text of the product review. Tokens are the basic uilding locks of a document oject. Everything that helps us understand the meaning of the text is derived from tokens and their relationship to one another.\n", "- **How to do Tokenization:** In a sentence you may come across following four items:\n", " - **Prefix**:\tCharacter(s) at the beginning ▸ `( β€œ $ Rs Dr`\n", " - **Suffix**:\tCharacter(s) at the end ▸ `km ) , . ! ”`\n", " - **Infix**:\tCharacter(s) in between ▸ `- -- / ...`\n", " - **Exception**: Special-case rule to split a string into several tokens or prevent a token from being split when punctuation rules are applied. From `L.A.!` the exclamation mark (!) is separated, while `L.A.` is not split" ] }, { "cell_type": "code", "execution_count": null, "id": "14543dbd", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "e41f20c8", "metadata": {}, "source": [ "### (i) Tokenization with `string.split()` Method\n", "- The easiest way to tokenize is to use the `mystr.split()` method, which returns a list of strings.\n", "- The `mystr.split()` method splits a string into a list of strings at every occurrence of space character by default and discard empty strings from the result.\n", "- You may pass a parameter `sep='i'` to split method to split at that specific character instead.\n", "- It's limitation is that it do not consider punctuation symbols as a separate token" ] }, { "cell_type": "code", "execution_count": 34, "id": "6edf5df5", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['Learning', 'is', 'fun', 'with', 'Arif']\n" ] } ], "source": [ "mystr=\"Learning is fun with Arif\" \n", "print(mystr.split())" ] }, { "cell_type": "code", "execution_count": 35, "id": "262a4014", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['This', 'example', 'is', 'great!']\n" ] } ], "source": [ "mystr=\"This example is great!\" \n", "print(mystr.split())" ] }, { "cell_type": "markdown", "id": "d9e7855c", "metadata": {}, "source": [ "> Observe the output, the exclamation symbol has become part of the token great (which is wrong)" ] }, { "cell_type": "code", "execution_count": null, "id": "accd7219", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "46c0bb69", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "4f65b7fe", "metadata": {}, "source": [ "### (ii) Tokenization with `re.split()` Method\n", "- The `re.split()` method splits the source string by the occurrences of the pattern, returning a list containing the resulting substrings." ] }, { "cell_type": "code", "execution_count": 36, "id": "78f6ce39", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['This', 'example', 'is', 'great', '']" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import re\n", "mystr=\"This example is great!\" \n", "pattern = re.compile(r'\\W+')\n", "pattern.split(mystr)" ] }, { "cell_type": "markdown", "id": "5a5b8f71", "metadata": {}, "source": [ ">- The exclamation symbol is not part of the token great, but what if I need that symbol as a separate token?\n", ">- Moreover, you need to write different regular expression for different scenarios" ] }, { "cell_type": "code", "execution_count": null, "id": "b1cf7d66", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "1cf29a05", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "2d2a51dd", "metadata": {}, "source": [ "### (iii) Tokenization using NLTK\n", "- NLTK stands for Natural Language Toolkit (https://www.nltk.org/). This is a suite of libraries and programs for statistical natural language processing for English language\n", "- NLTK was released in 2001, and is available for Windows, Mac OS X, and Linux.. \n", "- NLTK provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning.\n", "- NLTK fully supports the English language, but others like Spanish or French are not supported as extensively.\n", "- It is a string processing libbrary, i.e., you give a string as input and get a string as output\n", "- There are. different tokenizer available in nltk:\n", " - `nltk.tokenize.sent_tokenize(str)` for sentence tokenization\n", " - `nltk.tokenize.word_tokenize(str)` for word tokenization\n", " - `nltk.tokenize.treebank.TreebankWordTokenizer(str)`" ] }, { "cell_type": "code", "execution_count": 37, "id": "fb11801b", "metadata": {}, "outputs": [], "source": [ "import sys\n", "!{sys.executable} -m pip install -q nltk" ] }, { "cell_type": "code", "execution_count": 38, "id": "ea7d7775", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'3.7'" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import nltk\n", "nltk.__version__" ] }, { "cell_type": "code", "execution_count": 39, "id": "bd2a097d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['This', 'example', 'is', 'great', '!']\n" ] } ], "source": [ "from nltk.tokenize import word_tokenize, sent_tokenize\n", "mystr=\"This example is great!\" \n", "print(word_tokenize(mystr))" ] }, { "cell_type": "markdown", "id": "d8481ee7", "metadata": {}, "source": [ "> Observe the output, this time the exclamation symbol is kept as a separate tokens." ] }, { "cell_type": "code", "execution_count": 40, "id": "16a08c24", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['You', 'should', 'do', 'your', 'Ph.D', 'in', 'A.I', '!']\n" ] } ], "source": [ "mystr=\"You should do your Ph.D in A.I!\" \n", "print(word_tokenize(mystr))" ] }, { "cell_type": "code", "execution_count": 41, "id": "4c34b5bc", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['You', 'should', \"'ve\", 'sent', 'me', 'an', 'email', 'at', 'arif', '@', 'pucit.edu.pk', 'or', 'vist', 'http', ':', '//www/arifbutt.me']\n" ] } ], "source": [ "mystr=\"You should've sent me an email at arif@pucit.edu.pk or vist http://www/arifbutt.me\"\n", "print(word_tokenize(mystr))" ] }, { "cell_type": "code", "execution_count": 42, "id": "d01562a0", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['Here', \"'s\", 'an', 'example', 'worth', '$', '100', '.', 'I', 'am', '384400km', 'away', 'from', 'earth', \"'s\", 'moon', '!']\n" ] } ], "source": [ "mystr=\"Here's an example worth $100. I am 384400km away from earth's moon!\" \n", "print(word_tokenize(mystr))" ] }, { "cell_type": "code", "execution_count": null, "id": "c8673997", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "7ddbac7d", "metadata": {}, "source": [ "### (iv) Tokenization with spaCy\n", "- **spaCy** (https://spacy.io/) is an open-source Natural Language Processing library designed to handle NLP tasks with the most efficient and state of the art algorithm, released in 2015. \n", "- Spacy support many languages (over 65) where you can perform tokenizing, however, for this other than importing spacy, you have to load the appropriate library using spacy.load() method. But before that make sure you have downloaded the model in your system.\n", "- spaCy will isolate punctuation that does *not* form an integral part of a word. Quotation marks, commas, and punctuation at the end of a sentence will be assigned their own token. However, punctuation that exists as part of an email address, website or numerical value will be kept as part of the token.\n", "\n", "- **Download spacy model for English language**\n", " - Spacy comes with pretrained models and pipelines for different languages.\n", " - You can download any of the following models for English language, but better to download the small as this will require a reasonable amount of space on your disk, and may take a bit of time to download:\n", " - en_core_web_sm\n", " - en_core_web_md\n", " - en_core_web_lg\n", " - en_core_web_trf\n", " - The model name consist of four parts:\n", " - Language (en): The language abreviation can be `en` for English, `fr` for French, `zh` for Chinese\n", " - Type (core/dep): It can be core for general-purpose pipeline with tagging, parsing, lemmatization and NER recognition. It can be dep for only tagging, parsing and lemmatization\n", " - Genre (web/news): It measn the type of text the pipeline is trained on, e.g., web or news. \n", " - Size: Package size indicator. `sm` for small, `md` for medium, `lg` for large and `trf for transformer\n", " - Package version (a.b.c): Here a is the major version for spaCy, b is the minor version for spaCy, while c is the model verion dependent to the data on which the model is trained, it parameters, number of iterations and different vectors.\n", " \n", "> For details read spaCy101: https://spacy.io/usage/spacy-101" ] }, { "cell_type": "code", "execution_count": 43, "id": "ed9f8834", "metadata": {}, "outputs": [], "source": [ "import sys\n", "!{sys.executable} -m pip install -q spacy" ] }, { "cell_type": "code", "execution_count": 44, "id": "5317e5b7", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", " from .autonotebook import tqdm as notebook_tqdm\n" ] }, { "data": { "text/plain": [ "'3.4.1'" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import spacy\n", "spacy.__version__" ] }, { "cell_type": "markdown", "id": "ed673218", "metadata": {}, "source": [ "**Download spacy model for English language**" ] }, { "cell_type": "code", "execution_count": 45, "id": "4cfeffff", "metadata": {}, "outputs": [], "source": [ "import ssl\n", "ssl._create_default_https_context = ssl._create_unverified_context" ] }, { "cell_type": "code", "execution_count": 46, "id": "1e336a36", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Collecting en-core-web-sm==3.4.0\n", " Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.0/en_core_web_sm-3.4.0-py3-none-any.whl (12.8 MB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m12.8/12.8 MB\u001b[0m \u001b[31m6.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0mm\n", "\u001b[?25hRequirement already satisfied: spacy<3.5.0,>=3.4.0 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from en-core-web-sm==3.4.0) (3.4.1)\n", "Requirement already satisfied: thinc<8.2.0,>=8.1.0 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (8.1.1)\n", "Requirement already satisfied: typer<0.5.0,>=0.3.0 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (0.4.2)\n", "Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (3.0.7)\n", "Requirement already satisfied: pathy>=0.3.5 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (0.6.2)\n", "Requirement already satisfied: setuptools in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (58.0.4)\n", "Requirement already satisfied: wasabi<1.1.0,>=0.9.1 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (0.10.1)\n", "Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (1.0.8)\n", "Requirement already satisfied: pydantic!=1.8,!=1.8.1,<1.10.0,>=1.7.4 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (1.9.2)\n", "Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.9 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (3.0.10)\n", "Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (4.64.1)\n", "Requirement already satisfied: jinja2 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (3.0.2)\n", "Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (3.3.0)\n", "Requirement already satisfied: requests<3.0.0,>=2.13.0 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (2.28.1)\n", "Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (1.0.3)\n", "Requirement already satisfied: packaging>=20.0 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (21.3)\n", "Requirement already satisfied: srsly<3.0.0,>=2.4.3 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (2.4.4)\n", "Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (2.0.6)\n", "Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (2.0.8)\n", "Requirement already satisfied: numpy>=1.15.0 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (1.22.4)\n", "Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from packaging>=20.0->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (3.0.4)\n", "Requirement already satisfied: smart-open<6.0.0,>=5.2.1 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from pathy>=0.3.5->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (5.2.1)\n", "Requirement already satisfied: typing-extensions>=3.7.4.3 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from pydantic!=1.8,!=1.8.1,<1.10.0,>=1.7.4->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (3.10.0.2)\n", "Requirement already satisfied: certifi>=2017.4.17 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (2020.6.20)\n", "Requirement already satisfied: idna<4,>=2.5 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (3.3)\n", "Requirement already satisfied: urllib3<1.27,>=1.21.1 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (1.26.12)\n", "Requirement already satisfied: charset-normalizer<3,>=2 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (2.1.1)\n", "Requirement already satisfied: blis<0.10.0,>=0.7.8 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from thinc<8.2.0,>=8.1.0->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (0.9.1)\n", "Requirement already satisfied: confection<1.0.0,>=0.0.1 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from thinc<8.2.0,>=8.1.0->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (0.0.1)\n", "Requirement already satisfied: click<9.0.0,>=7.1.1 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from typer<0.5.0,>=0.3.0->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (8.1.3)\n", "Requirement already satisfied: MarkupSafe>=2.0 in /Users/arif/opt/anaconda3/envs/python10/lib/python3.10/site-packages (from jinja2->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (2.0.1)\n", "\u001b[38;5;2mβœ” Download and installation successful\u001b[0m\n", "You can now load the package via spacy.load('en_core_web_sm')\n" ] } ], "source": [ "import sys\n", "!{sys.executable} -m spacy download en_core_web_sm" ] }, { "cell_type": "code", "execution_count": null, "id": "6d24b3f4", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "f8a39956", "metadata": {}, "source": [ "**Example 1:**" ] }, { "cell_type": "code", "execution_count": 2, "id": "e1eef602", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "' , A , 7 , km , Uber , cab , ride , from , Gulberg , to , Joher , Town , will , cost , you , $ , 20 , " ] } ], "source": [ "# import spacy and load the language library\n", "import spacy\n", "nlp = spacy.load('en_core_web_lg')\n", "\n", "mystr=\"'A 7km Uber cab ride from Gulberg to Joher Town will cost you $20\" \n", "doc = nlp(mystr)\n", "\n", "for token in doc:\n", " print(token, end=' , ')" ] }, { "cell_type": "markdown", "id": "b6b29abc", "metadata": {}, "source": [ "> Note that spacy has successfully tokenized the distance symbol, which nltk failed to separate." ] }, { "cell_type": "code", "execution_count": null, "id": "448839ee", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "8c3a09da", "metadata": {}, "source": [ "**Example 2:**" ] }, { "cell_type": "code", "execution_count": 48, "id": "0baf9f37", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "You , should , 've , sent , me , an , email , at , arif@pucit.edu.pk , or , vist , http://www , / , arifbutt.me , " ] } ], "source": [ "# import spacy and load the language library\n", "import spacy\n", "nlp = spacy.load('en_core_web_sm')\n", "\n", "mystr=\"You should've sent me an email at arif@pucit.edu.pk or vist http://www/arifbutt.me\"\n", "doc = nlp(mystr)\n", "\n", "for token in doc:\n", " print(token, end=' , ')" ] }, { "cell_type": "markdown", "id": "2c46130c", "metadata": {}, "source": [ ">- Note that spacy has kept the email as a single token, while nltk separated it.\n", ">- However, spacy also failed to properly tokenize the URL :(" ] }, { "cell_type": "markdown", "id": "18cb5cc0", "metadata": {}, "source": [ "**Additional Token Attributes:** Once the string is passed to `nlp()` method of spacy, the tokens of the resulting `doc` object has many other associated attributes other than just tokens:\n", "\n", "|Tag|Description\n", "|:------|:------:\n", "|`.text`|The original word text\n", "|`.lemma_`|The base form of the word\n", "|`.pos_`|The simple part-of-speech tag\n", "|`.tag_`|The detailed part-of-speech tag\n", "|`.shape_`|The word shape – capitalization, punctuation, digits\n", "|`.is_alpha`, `is_ascii`, `is_digit`|Token text consists of alphanumeric characters, ASCII characters, digits\n", "|`.is_lower`, `is_upper`, `is_title`|Token text is in lowercase, uppercase, titlecase\n", "|`.is_punct`, `is_space`, `is_stop`|Token is punctuation, whitespace, stopword" ] }, { "cell_type": "code", "execution_count": null, "id": "4e5bf4d8", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "d0c0d692", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "b1d2ca27", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "13c60224", "metadata": {}, "source": [ "## g. Creating N-grams\n", "- **What are n-grams?** \n", " - A sequence of n words, can be bigram, trigram,....\n", "- **Why to use n-grams?** \n", " - Capture contextual information (`good food` carries more meaning than just `good` and `food` when observed independently)\n", " - Applications of N-grams:\n", " - Sentence Completion\n", " - Auto Spell Check and correction\n", " - Auto Grammer Check and correction\n", " - Is there a perfect value of n?\n", " - Different types of n-grams are suitable for different types of applications. You should try different n-grams on your data in order to confidently conclude which one works the best among all for your text analysis. " ] }, { "cell_type": "code", "execution_count": null, "id": "ec36da1d", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "1b05d832", "metadata": {}, "source": [ "- **How to create n-grams?** " ] }, { "cell_type": "code", "execution_count": 49, "id": "f32e3256", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "('Allama', 'Iqbal')\n", "('Iqbal', 'was')\n", "('was', 'a')\n", "('a', 'visionary')\n", "('visionary', 'philosopher')\n", "('philosopher', 'and')\n", "('and', 'politician')\n", "('politician', '.')\n", "('.', 'Thank')\n", "('Thank', 'you')\n" ] } ], "source": [ "import nltk\n", "mystr = \"Allama Iqbal was a visionary philosopher and politician. Thank you\"\n", "tokens = nltk.tokenize.word_tokenize(mystr)\n", "bgs = nltk.bigrams(tokens)\n", "print(bgs)\n", "for grams in bgs:\n", " print(grams)" ] }, { "cell_type": "markdown", "id": "9f6b31af", "metadata": {}, "source": [ ">- The formula to calculate the count of n-grams in a document is: **`X - N + 1`**, where `X` is the number of words in a given document and `N` is the number of words in n-gram\n", "\\begin{equation}\n", " \\text{Count of N-grams} \\hspace{0.5cm} = \\hspace{0.5cm} 11 - 2 + 1 \\hspace{0.5cm} = \\hspace{0.5cm} 10\n", "\\end{equation}\n" ] }, { "cell_type": "code", "execution_count": 50, "id": "df993560", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "('Allama', 'Iqbal', 'was')\n", "('Iqbal', 'was', 'a')\n", "('was', 'a', 'visionary')\n", "('a', 'visionary', 'philosopher')\n", "('visionary', 'philosopher', 'and')\n", "('philosopher', 'and', 'politician')\n", "('and', 'politician', '.')\n", "('politician', '.', 'Thank')\n", "('.', 'Thank', 'you')\n" ] } ], "source": [ "tgs = nltk.trigrams(tokens)\n", "for grams in tgs:\n", " print(grams)" ] }, { "cell_type": "markdown", "id": "43de30a5", "metadata": {}, "source": [ "\\begin{equation}\n", " \\text{Count of N-grams} \\hspace{0.5cm} = \\hspace{0.5cm} 11 - 3 + 1 \\hspace{0.5cm} = \\hspace{0.5cm} 9\n", "\\end{equation}\n" ] }, { "cell_type": "code", "execution_count": 51, "id": "30e32325", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "('Allama', 'Iqbal', 'was', 'a')\n", "('Iqbal', 'was', 'a', 'visionary')\n", "('was', 'a', 'visionary', 'philosopher')\n", "('a', 'visionary', 'philosopher', 'and')\n", "('visionary', 'philosopher', 'and', 'politician')\n", "('philosopher', 'and', 'politician', '.')\n", "('and', 'politician', '.', 'Thank')\n", "('politician', '.', 'Thank', 'you')\n" ] } ], "source": [ "ngrams = nltk.ngrams(tokens, 4)\n", "for grams in ngrams:\n", " print(grams)" ] }, { "cell_type": "code", "execution_count": null, "id": "cef11306", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "f1591bca", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "94571d33", "metadata": {}, "source": [ "## h. Stopwords Removal\n", "- Stopwords are extremely common words of a language having very little meanings, and it is usually safe to remove them and not consider them as important for later processing of our data.\n", "- Every language has its own set of stopwords. For example, some stopwords of English language are: the, a, an, was, were, at, will, on, in, from, to, me, you, yours,....\n", "- Whether you should remove stop words from your text or not mainly depends on the problem you are solving.\n", "- Remove stop words from your text if you are working on:\n", " - Text Classification (Spam Filtering, Language Classification, Genre Classification)\n", " - Caption Generation\n", " - Auto-Tag Generation\n", "- Avoid removing stop words from your text if you are working on:\n", " - Machine Translation\n", " - Language Modeling\n", " - Text Summarization\n", " - Question-Answering problems" ] }, { "cell_type": "code", "execution_count": null, "id": "ed7ab8fc", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "75c1ad24", "metadata": {}, "source": [ "### (i) Using NLTK\n", "- The NLTK library has a defined set of stopwords for different languages like English. Here, we will focus on β€˜english’ stopwords. One can also consider additional stopwords if required\n", "- Note that there is no single universal list of stopwords. The list of the stop words can change depending on your problem statement\n", "- Once you install nltk, it just install the base library and do not install all the packages related to different languages, different tokenization schemes, etc. To install all the nltk packages and corpora use `nltk.download()`\n", "- An installation window will pop up. Select all and click β€˜Download’ to download and install the additional bundles. This will download all the dictionaries and other language and grammar data frames necessary for full NLTK functionality." ] }, { "cell_type": "code", "execution_count": 52, "id": "53f5e0bd", "metadata": {}, "outputs": [], "source": [ "import ssl\n", "ssl._create_default_https_context = ssl._create_unverified_context" ] }, { "cell_type": "code", "execution_count": 53, "id": "4dbe631c", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[nltk_data] Downloading package stopwords to /Users/arif/nltk_data...\n", "[nltk_data] Package stopwords is already up-to-date!\n" ] }, { "data": { "text/plain": [ "True" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import nltk\n", "nltk.download(\"stopwords\")\n", "# nltk.download()" ] }, { "cell_type": "markdown", "id": "d80496a3", "metadata": {}, "source": [ "> After completion of downloading, you can load the package of `stopwords` from the `nltk.corpus` and use it to load the stop words" ] }, { "cell_type": "code", "execution_count": null, "id": "a41327c8", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 54, "id": "3294b743", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'has', 'here', 'doesn', \"hasn't\", 'mustn', 'further', \"shan't\", 'for', \"needn't\", 'not', 'than', 'am', 'isn', 'our', 'been', 'with', 'through', 'now', 'ourselves', 'themselves', 'these', 'from', 'its', \"that'll\", 'how', 'until', 'who', 'both', 'couldn', 'then', \"you've\", 'ma', 'wasn', 'of', 'same', \"doesn't\", 'don', \"it's\", 'in', 've', 'very', 'himself', 'again', 'on', 'them', 'there', 'because', \"you're\", 'wouldn', 'some', 'too', 'hadn', 'the', 'just', 'are', \"hadn't\", 'to', 'had', 'when', 'needn', 'other', 'hers', 'be', \"shouldn't\", 'mightn', \"won't\", 'whom', 'own', 'should', 'after', 'yours', 'being', 'as', 'nor', 'down', 'more', 'before', \"mustn't\", 'it', \"wouldn't\", 'will', 'were', \"don't\", \"weren't\", 'myself', 'we', 'yourself', 'doing', 're', 'few', 'aren', \"haven't\", 'weren', 'he', 'by', 'at', 'didn', \"mightn't\", 'him', 'was', \"didn't\", \"you'll\", 'why', 'against', 'any', 'you', \"she's\", 'her', 'does', \"isn't\", 'can', 'those', 'herself', 'll', 'so', 'she', 'an', 'ain', \"couldn't\", 'yourselves', 'shouldn', 'd', 'off', 'no', \"wasn't\", \"you'd\", 'ours', 'once', 't', 'where', 'over', 'shan', 'under', 'all', 'about', 'do', 'itself', 'only', 'most', 'o', 'have', 'did', 'if', 'while', 'during', 'y', 'what', 'that', 'out', 'below', 'm', 'my', 'me', 'they', 'or', 'up', 'haven', 'your', 'such', 'hasn', 'into', 'won', 'but', 'and', 'a', 'this', \"aren't\", 's', 'their', 'theirs', 'having', 'which', 'i', 'is', 'above', 'between', 'his', \"should've\", 'each'}\n" ] } ], "source": [ "from nltk.corpus import stopwords\n", "stop_words = set(stopwords.words('english'))\n", "print(stop_words)" ] }, { "cell_type": "code", "execution_count": null, "id": "c23f76a8", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "32c2734e", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 55, "id": "39ef9810", "metadata": {}, "outputs": [], "source": [ "def remove_stopwords(text):\n", " new_text = list()\n", " for word in text.split():\n", " if word not in stopwords.words('english'):\n", " new_text.append(word)\n", " return \" \".join(new_text)" ] }, { "cell_type": "markdown", "id": "88fa5ee1", "metadata": {}, "source": [ "**Removing Stopwords from Text of an Email**" ] }, { "cell_type": "code", "execution_count": 56, "id": "813eb042", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Your Google account compromised. Your account closed. Immediately click link update account'" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import nltk\n", "from nltk.corpus import stopwords\n", "\n", "mystr=\"Your Google account has been compromised. \\\n", " Your account will be closed. Immediately click this link to update your account\"\n", "remove_stopwords(mystr)" ] }, { "cell_type": "markdown", "id": "2c9b91e9", "metadata": {}, "source": [ "**Removing Stopwords for a Sentiment Analysis Application**" ] }, { "cell_type": "code", "execution_count": 57, "id": "2293ec5b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'This movie good'" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mystr=\"This movie is not good\"\n", "remove_stopwords(mystr)" ] }, { "cell_type": "markdown", "id": "37d0b2be", "metadata": {}, "source": [ ">- For sentiment analysis purposes, the overall meaning of the resulting sentence is positive, which is not at all the reality. So either do not remove sentiment analysis while doing sentiment analysis or handle the negation before removing stopwords " ] }, { "cell_type": "code", "execution_count": null, "id": "61521760", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "e374466e", "metadata": {}, "source": [ "### (ii) Using spaCy\n", "- **spaCy** (https://spacy.io/) is an open-source Natural Language Processing library designed to handle NLP tasks with the most efficient and state of the art algorithm, released in 2015. \n", "- Spacy support many languages (over 65) where you can perform tokenizing, however, for this other than importing spacy, you have to load the appropriate library using spacy.load() method. But before that make sure you have downloaded the model in your system.\n", "- **Download spacy model for English language:** Spacy comes with pretrained models and pipelines for different languages. We have already downloaded the pre-trained spacy model for English language\n", "> For details read spaCy101: https://spacy.io/usage/spacy-101" ] }, { "cell_type": "code", "execution_count": 58, "id": "62844349", "metadata": {}, "outputs": [], "source": [ "import spacy\n", "nlp = spacy.load('en_core_web_sm')" ] }, { "cell_type": "code", "execution_count": 59, "id": "e4b7db56", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "326\n", "{'nβ€˜t', 'whereas', 'yet', \"'d\", 'than', 'anyone', 'am', 'still', 'with', 'afterwards', 'anywhere', 'these', 'hence', 'hereupon', 'namely', 'else', 'get', 'using', \"n't\", 'fifty', 'five', 'same', 'please', \"'re\", '’ll', 'herein', 'since', 'empty', 'there', 'move', 'the', 'forty', 'hers', 'although', 'yours', 'third', 'though', 'sometimes', 'were', 'six', 'could', 'yourself', 'ever', 'him', 'against', 'seem', 'herself', 'so', 'every', 'β€˜re', 'somehow', 'where', 'also', 'amount', 'do', 'most', 'have', 'us', 'whenever', 'otherwise', 'never', 'former', 'next', 'out', 'become', 'formerly', 'or', 'make', 'into', 'but', 'beforehand', 'perhaps', 'each', 'has', 'bottom', 'ca', 'latterly', 'eight', \"'ve\", 'further', 'through', 'many', 'from', 'wherever', 'until', 'both', 'whereafter', 'must', 'then', 'however', 'of', 'mine', 'onto', 'anyway', 'on', 'back', 'cannot', 'ten', 'some', 'too', 'regarding', 'name', 'just', 'β€˜ll', 'are', 'when', 'other', 'three', 'be', 'would', 'towards', 'noone', 'whence', 'as', 'being', 'behind', 'down', 'more', 'anyhow', 'before', 'β€˜m', 'mostly', 'various', 'everywhere', \"'s\", 'beyond', 'we', 'take', 're', 'few', 'becoming', 'full', 'he', 'by', 'at', 'without', 'unless', 'none', 'any', 'does', 'her', 'done', 'nothing', 'whereupon', 'she', 'almost', 'used', '’s', 'side', 'off', 'no', 'whose', 'besides', 'seems', 'under', 'several', 'always', 'sometime', 'thereafter', \"'m\", 'such', 'and', 'a', 'is', 'really', 'someone', 'fifteen', 'for', 'not', \"'ll\", 'been', 'ourselves', 'themselves', 'eleven', 'how', 'might', 'who', 'thence', 'twenty', 'seemed', 'whole', 'least', 'in', 'β€˜ve', 'well', 'together', 'them', 'twelve', 'may', 'because', 'nowhere', 'thru', 'became', 'even', 'among', 'elsewhere', 'whom', 'own', 'after', 'enough', 'alone', 'it', 'was', 'whoever', 'quite', 'becomes', 'due', 'moreover', 'others', 'an', 'per', 'except', 'call', 'once', 'about', 'around', 'go', 'n’t', 'anything', 'hereafter', '’d', 'often', 'serious', 'up', 'show', 'amongst', 'between', 'his', 'here', 'either', 'our', 'nevertheless', 'now', 'one', 'its', 'see', 'first', 'thereby', 'upon', 'via', 'much', 'put', 'thus', 'very', 'therein', 'himself', 'four', 'again', 'say', 'neither', 'along', '’re', 'beside', 'something', 'less', 'part', 'to', 'had', 'last', 'two', 'should', 'everyone', 'nor', 'within', 'will', 'hereby', 'made', 'myself', 'β€˜d', 'doing', 'sixty', 'everything', 'throughout', 'another', 'hundred', 'why', 'toward', 'you', 'can', 'those', 'whither', '’m', 'across', 'somewhere', 'thereupon', 'yourselves', 'whether', 'meanwhile', 'β€˜s', 'give', 'already', 'seeming', 'ours', 'rather', 'over', 'front', 'keep', 'nobody', 'all', 'itself', 'only', 'did', 'if', 'while', 'during', 'what', 'that', 'below', 'top', 'my', 'me', 'they', 'wherein', 'latter', 'your', 'nine', 'this', 'indeed', 'whereby', 'their', '’ve', 'therefore', 'which', 'i', 'above', 'whatever'}\n" ] } ], "source": [ "# returns a set of around 326 English stopwords built into spaCy\n", "print(len(nlp.Defaults.stop_words))\n", "print(nlp.Defaults.stop_words)" ] }, { "cell_type": "code", "execution_count": 60, "id": "893c079e", "metadata": {}, "outputs": [], "source": [ "def remove_stopwords_spacy(text):\n", " new_text = list()\n", " for word in text.split():\n", " if word not in nlp.Defaults.stop_words:\n", " new_text.append(word)\n", " return \" \".join(new_text)" ] }, { "cell_type": "code", "execution_count": 61, "id": "e1cd4064", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'This sample text need remove stopwords'" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mystr=\"This is a sample text and we need to remove stopwords from it\"\n", "remove_stopwords_spacy(mystr)" ] }, { "cell_type": "code", "execution_count": null, "id": "d6ccb9f4", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "4cbe336f", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "9842b38f", "metadata": {}, "source": [ "**Add a stop word to the existing list of spaCy:**" ] }, { "cell_type": "code", "execution_count": 62, "id": "d16c7b15", "metadata": {}, "outputs": [], "source": [ "# Add the word to the set of stop words. Use lowercase!\n", "nlp.Defaults.stop_words.add('aka')\n", "\n", "# Set the stop_word tag on the lexeme\n", "nlp.vocab['aka'].is_stop = True" ] }, { "cell_type": "code", "execution_count": 63, "id": "778fdd3d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nlp.vocab['aka'].is_stop" ] }, { "cell_type": "code", "execution_count": 64, "id": "d91b6366", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "327" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(nlp.Defaults.stop_words)" ] }, { "cell_type": "code", "execution_count": null, "id": "6c28ebe3", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "1b256d07", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "db5e5831", "metadata": {}, "source": [ "**To remove a stop word:** Alternatively, you may decide that `'always'` should not be considered a stop word." ] }, { "cell_type": "code", "execution_count": 65, "id": "a8f3c2cd", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nlp.vocab['aka'].is_stop" ] }, { "cell_type": "code", "execution_count": 66, "id": "c158d30e", "metadata": {}, "outputs": [], "source": [ "# Remove the word from the set of stop words\n", "nlp.Defaults.stop_words.remove('aka')\n", "\n", "# Remove the stop_word tag from the lexeme\n", "nlp.vocab['aka'].is_stop = False" ] }, { "cell_type": "code", "execution_count": 67, "id": "d008124c", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nlp.vocab['aka'].is_stop" ] }, { "cell_type": "code", "execution_count": 68, "id": "07132ee5", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "326" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(nlp.Defaults.stop_words)" ] }, { "cell_type": "code", "execution_count": null, "id": "00f237ac", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "b220d2f2", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "c9706ce8", "metadata": {}, "source": [ "# 3. Text Pre-Processing on IMDB Dataset\n", "- Dataset: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews" ] }, { "cell_type": "markdown", "id": "ea430cd4", "metadata": {}, "source": [ "## a. EDA on IMD Dataset" ] }, { "cell_type": "code", "execution_count": 69, "id": "3a25f6fd", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
reviewsentiment
0One of the other reviewers has mentioned that ...positive
1A wonderful little production. <br /><br />The...positive
2I thought this was a wonderful way to spend ti...positive
3Basically there's a family where a little boy ...negative
4Petter Mattei's \"Love in the Time of Money\" is...positive
\n", "
" ], "text/plain": [ " review sentiment\n", "0 One of the other reviewers has mentioned that ... positive\n", "1 A wonderful little production.

The... positive\n", "2 I thought this was a wonderful way to spend ti... positive\n", "3 Basically there's a family where a little boy ... negative\n", "4 Petter Mattei's \"Love in the Time of Money\" is... positive" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "df = pd.read_csv(\"./datasets/imdb-dataset.csv\")\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "211e2f32", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "c5a808df", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 70, "id": "6872a0a4", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 50000 entries, 0 to 49999\n", "Data columns (total 2 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 review 50000 non-null object\n", " 1 sentiment 50000 non-null object\n", "dtypes: object(2)\n", "memory usage: 781.4+ KB\n" ] } ], "source": [ "df.info()" ] }, { "cell_type": "code", "execution_count": null, "id": "465318b5", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "cafe7f3c", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 71, "id": "03759f1e", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "positive 25000\n", "negative 25000\n", "Name: sentiment, dtype: int64" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Check the count of positive and negative reviews to ensure that the dataset is balanced\n", "df['sentiment'].value_counts()" ] }, { "cell_type": "code", "execution_count": null, "id": "d769aecf", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 72, "id": "fe2914a7", "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAWAAAAFgCAYAAACFYaNMAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAAsTAAALEwEAmpwYAAAWz0lEQVR4nO3df/BddX3n8edLghR/QAUiiwkUVunWgDVuMinK7g6WjmSdaUELNkyRYJmJZcEp/bE70N2ptk5aWH8w1S20WCzBWiFFXdERKgtiW5cfRpc1BESzwkgkC0Ep4rbQBt/7x/l8y034Er4k3/v95Pv9Ph8zd+6573s+53xOcnlx8rnnfG6qCknSzHtB7w5I0nxlAEtSJwawJHViAEtSJwawJHWyoHcHZtrKlSvrhhtu6N0NSfNLJivOuzPgRx55pHcXJAmYhwEsSXsLA1iSOjGAJakTA1iSOjGAJakTA1iSOjGAJakTA1iSOjGAJakTA1iSOhlbACc5PMkXk9yTZFOSX2v19yT5bpI72+PNI20uTLI5yb1JThqpL0uysb33oSRp9f2SXNPqtyc5clzHI0nTbZxnwNuB36yqVwPHAecmWdLeu6SqlrbH5wHae6uAY4CVwKVJ9mnrXwasAY5uj5WtfjbwaFW9CrgEuHiMxyNJ02psAVxVW6vqa235ceAeYNEumpwMXF1VT1bVfcBmYEWSw4ADqurWGn7A7irglJE269rytcCJE2fHkrS3m5HpKNvQwOuA24HjgfOSnAlsYDhLfpQhnG8babal1f6pLe9cpz0/AFBV25M8BhwM7DDlWZI1DGfQHHHEEbt1DMv+41W71U57n6++78wZ3d93fu81M7o/jccRv7Nx2rc59i/hkrwE+CRwflX9gGE44ZXAUmAr8IGJVSdpXruo76rNjoWqy6tqeVUtX7hw4fM7AEkak7EGcJJ9GcL341X1KYCqeqiqnqqqHwEfAVa01bcAh480Xww82OqLJ6nv0CbJAuBA4PvjORpJml7jvAoiwBXAPVX1wZH6YSOrvQW4qy1fB6xqVzYcxfBl2x1VtRV4PMlxbZtnAp8ZabO6LZ8K3NzGiSVprzfOMeDjgbcDG5Pc2Wq/DZyeZCnDUMH9wDsBqmpTkvXA3QxXUJxbVU+1ducAVwL7A9e3BwwB/7EkmxnOfFeN8XgkaVqNLYCr6m+ZfIz287tosxZYO0l9A3DsJPUngNP2oJuS1I13wklSJwawJHViAEtSJwawJHViAEtSJwawJHViAEtSJwawJHViAEtSJwawJHViAEtSJwawJHViAEtSJwawJHViAEtSJwawJHViAEtSJwawJHViAEtSJwawJHViAEtSJwawJHViAEtSJwawJHViAEtSJwawJHViAEtSJwawJHViAEtSJwawJHViAEtSJwawJHViAEtSJwawJHViAEtSJwawJHViAEtSJwawJHViAEtSJwawJHViAEtSJwawJHViAEtSJwawJHViAEtSJwawJHViAEtSJwawJHViAEtSJ2ML4CSHJ/liknuSbErya61+UJIbk3yrPb9spM2FSTYnuTfJSSP1ZUk2tvc+lCStvl+Sa1r99iRHjut4JGm6jfMMeDvwm1X1auA44NwkS4ALgJuq6mjgpvaa9t4q4BhgJXBpkn3ati4D1gBHt8fKVj8beLSqXgVcAlw8xuORpGk1tgCuqq1V9bW2/DhwD7AIOBlY11ZbB5zSlk8Grq6qJ6vqPmAzsCLJYcABVXVrVRVw1U5tJrZ1LXDixNmxJO3tZmQMuA0NvA64HTi0qrbCENLAy9tqi4AHRpptabVFbXnn+g5tqmo78Bhw8FgOQpKm2dgDOMlLgE8C51fVD3a16iS12kV9V2127sOaJBuSbNi2bdtzdVmSZsRYAzjJvgzh+/Gq+lQrP9SGFWjPD7f6FuDwkeaLgQdbffEk9R3aJFkAHAh8f+d+VNXlVbW8qpYvXLhwOg5NkvbYOK+CCHAFcE9VfXDkreuA1W15NfCZkfqqdmXDUQxftt3RhikeT3Jc2+aZO7WZ2NapwM1tnFiS9noLxrjt44G3AxuT3Nlqvw1cBKxPcjbwHeA0gKralGQ9cDfDFRTnVtVTrd05wJXA/sD17QFDwH8syWaGM99VYzweSZpWYwvgqvpbJh+jBTjxWdqsBdZOUt8AHDtJ/QlagEvSbOOdcJLUiQEsSZ0YwJLUiQEsSZ0YwJLUiQEsSZ0YwJLUiQEsSZ0YwJLUiQEsSZ0YwJLUiQEsSZ0YwJLUiQEsSZ0YwJLUiQEsSZ0YwJLUiQEsSZ0YwJLUiQEsSZ0YwJLUiQEsSZ0YwJLUiQEsSZ0YwJLUiQEsSZ0YwJLUiQEsSZ0YwJLUiQEsSZ0YwJLUiQEsSZ0YwJLUiQEsSZ0YwJLUiQEsSZ0YwJLUiQEsSZ0YwJLUiQEsSZ0YwJLUiQEsSZ0YwJLUiQEsSZ0YwJLUiQEsSZ0YwJLUiQEsSZ0YwJLUiQEsSZ2MLYCTfDTJw0nuGqm9J8l3k9zZHm8eee/CJJuT3JvkpJH6siQb23sfSpJW3y/JNa1+e5Ijx3UskjQO4zwDvhJYOUn9kqpa2h6fB0iyBFgFHNPaXJpkn7b+ZcAa4Oj2mNjm2cCjVfUq4BLg4nEdiCSNw9gCuKr+Gvj+FFc/Gbi6qp6sqvuAzcCKJIcBB1TVrVVVwFXAKSNt1rXla4ETJ86OJWk26DEGfF6Sr7chipe12iLggZF1trTaora8c32HNlW1HXgMOHiyHSZZk2RDkg3btm2bviORpD0w0wF8GfBKYCmwFfhAq0925lq7qO+qzTOLVZdX1fKqWr5w4cLn1WFJGpcZDeCqeqiqnqqqHwEfAVa0t7YAh4+suhh4sNUXT1LfoU2SBcCBTH3IQ5K6m9EAbmO6E94CTFwhcR2wql3ZcBTDl213VNVW4PEkx7Xx3TOBz4y0Wd2WTwVubuPEkjQrLBjXhpN8AjgBOCTJFuDdwAlJljIMFdwPvBOgqjYlWQ/cDWwHzq2qp9qmzmG4omJ/4Pr2ALgC+FiSzQxnvqvGdSySNA5jC+CqOn2S8hW7WH8tsHaS+gbg2EnqTwCn7UkfJakn74STpE4MYEnqxACWpE4MYEnqxACWpE6mFMBJbppKTZI0dbu8DC3JjwEvYriW92U8ffvvAcArxtw3SZrTnus64HcC5zOE7Vd5OoB/APzR+LolSXPfLgO4qv4Q+MMk76qqD89QnyRpXpjSnXBV9eEkbwCOHG1TVVeNqV+SNOdNKYCTfIxhGsk7gYk5GiYmSJck7YapzgWxHFjibGOSNH2meh3wXcC/GGdHJGm+meoZ8CHA3UnuAJ6cKFbVL4ylV5I0D0w1gN8zzk5I0nw01asgvjTujkjSfDPVqyAe5+kfvHwhsC/w/6rqgHF1TJLmuqmeAb909HWSU3j6BzUlSbtht2ZDq6r/Dvzs9HZFkuaXqQ5BvHXk5QsYrgv2mmBJ2gNTvQri50eWtzP8ovHJ094bSZpHpjoG/I5xd0SS5pupTsi+OMmnkzyc5KEkn0yyeNydk6S5bKpfwv0ZcB3DvMCLgM+2miRpN001gBdW1Z9V1fb2uBJYOMZ+SdKcN9UAfiTJGUn2aY8zgO+Ns2OSNNdNNYB/BXgb8H+BrcCpgF/MSdIemOplaO8FVlfVowBJDgLezxDMkqTdMNUz4J+eCF+Aqvo+8LrxdEmS5oepBvAL2s/SA/98BjzVs2dJ0iSmGqIfAP5nkmsZbkF+G7B2bL2SpHlgqnfCXZVkA8MEPAHeWlV3j7VnkjTHTXkYoQWuoStJ02S3pqOUJO05A1iSOjGAJakTA1iSOjGAJakTA1iSOjGAJakTA1iSOjGAJakTA1iSOjGAJakTA1iSOjGAJakTA1iSOjGAJakTA1iSOhlbACf5aJKHk9w1UjsoyY1JvtWeR39n7sIkm5Pcm+SkkfqyJBvbex9KklbfL8k1rX57kiPHdSySNA7jPAO+Eli5U+0C4KaqOhq4qb0myRJgFXBMa3Npkn1am8uANcDR7TGxzbOBR6vqVcAlwMVjOxJJGoOxBXBV/TXw/Z3KJwPr2vI64JSR+tVV9WRV3QdsBlYkOQw4oKpuraoCrtqpzcS2rgVOnDg7lqTZYKbHgA+tqq0A7fnlrb4IeGBkvS2ttqgt71zfoU1VbQceAw4eW88laZrtLV/CTXbmWruo76rNMzeerEmyIcmGbdu27WYXJWl6zXQAP9SGFWjPD7f6FuDwkfUWAw+2+uJJ6ju0SbIAOJBnDnkAUFWXV9Xyqlq+cOHCaToUSdozMx3A1wGr2/Jq4DMj9VXtyoajGL5su6MNUzye5Lg2vnvmTm0mtnUqcHMbJ5akWWHBuDac5BPACcAhSbYA7wYuAtYnORv4DnAaQFVtSrIeuBvYDpxbVU+1TZ3DcEXF/sD17QFwBfCxJJsZznxXjetYJGkcxhbAVXX6s7x14rOsvxZYO0l9A3DsJPUnaAEuSbPR3vIlnCTNOwawJHViAEtSJwawJHViAEtSJwawJHViAEtSJwawJHViAEtSJwawJHViAEtSJwawJHViAEtSJwawJHViAEtSJwawJHViAEtSJwawJHViAEtSJwawJHViAEtSJwawJHViAEtSJwawJHViAEtSJwawJHViAEtSJwawJHViAEtSJwawJHViAEtSJwawJHViAEtSJwawJHViAEtSJwawJHViAEtSJwawJHViAEtSJwawJHViAEtSJwawJHViAEtSJwawJHViAEtSJwawJHViAEtSJwawJHViAEtSJwawJHXSJYCT3J9kY5I7k2xotYOS3JjkW+35ZSPrX5hkc5J7k5w0Ul/WtrM5yYeSpMfxSNLu6HkG/MaqWlpVy9vrC4Cbqupo4Kb2miRLgFXAMcBK4NIk+7Q2lwFrgKPbY+UM9l+S9sjeNARxMrCuLa8DThmpX11VT1bVfcBmYEWSw4ADqurWqirgqpE2krTX6xXABXwhyVeTrGm1Q6tqK0B7fnmrLwIeGGm7pdUWteWd68+QZE2SDUk2bNu2bRoPQ5J234JO+z2+qh5M8nLgxiTf2MW6k43r1i7qzyxWXQ5cDrB8+fJJ15GkmdblDLiqHmzPDwOfBlYAD7VhBdrzw231LcDhI80XAw+2+uJJ6pI0K8x4ACd5cZKXTiwDbwLuAq4DVrfVVgOfacvXAauS7JfkKIYv2+5owxSPJzmuXf1w5kgbSdrr9RiCOBT4dLtibAHwF1V1Q5KvAOuTnA18BzgNoKo2JVkP3A1sB86tqqfats4BrgT2B65vD0maFWY8gKvq28BrJ6l/DzjxWdqsBdZOUt8AHDvdfZSkmbA3XYYmSfOKASxJnRjAktSJASxJnRjAktSJASxJnRjAktSJASxJnRjAktSJASxJnRjAktSJASxJnRjAktSJASxJnRjAktSJASxJnRjAktSJASxJnRjAktSJASxJnRjAktSJASxJnRjAktSJASxJnRjAktSJASxJnRjAktSJASxJnRjAktSJASxJnRjAktSJASxJnRjAktSJASxJnRjAktSJASxJnRjAktSJASxJnRjAktSJASxJnRjAktSJASxJnRjAktSJASxJnRjAktSJASxJnRjAktSJASxJnRjAktSJASxJncz6AE6yMsm9STYnuaB3fyRpqmZ1ACfZB/gj4N8DS4DTkyzp2ytJmppZHcDACmBzVX27qv4RuBo4uXOfJGlKFvTuwB5aBDww8noL8DM7r5RkDbCmvfxhkntnoG+z0SHAI707MW55/+reXZiL5v5n593Zk9Y3VNXKnYuzPYAn+xOpZxSqLgcuH393ZrckG6pqee9+aPbxs7N7ZvsQxBbg8JHXi4EHO/VFkp6X2R7AXwGOTnJUkhcCq4DrOvdJkqZkVg9BVNX2JOcBfwXsA3y0qjZ17tZs5jCNdpefnd2QqmcMmUqSZsBsH4KQpFnLAJakTgxgAZDkV5Oc2ZbPSvKKkff+1DsMNVVJfjzJfxh5/Yok1/bs097KMWA9Q5JbgN+qqg29+6LZJ8mRwOeq6tjefdnbeQY8ByQ5Msk3kqxL8vUk1yZ5UZITk/yvJBuTfDTJfm39i5Lc3dZ9f6u9J8lvJTkVWA58PMmdSfZPckuS5UnOSfJfR/Z7VpIPt+UzktzR2vxJm6dDe6H2ebknyUeSbEryhfb3/MokNyT5apK/SfJTbf1XJrktyVeS/F6SH7b6S5LclORr7TM2MQ3ARcAr22fhfW1/d7U2tyc5ZqQvtyRZluTF7TP6lfaZnR9TClSVj1n+AI5kuAPw+Pb6o8B/YbhN+ydb7SrgfOAg4F6e/tfPj7fn9zCc9QLcAiwf2f4tDKG8kGHujYn69cC/AV4NfBbYt9UvBc7s/efiY5efl+3A0vZ6PXAGcBNwdKv9DHBzW/4ccHpb/lXgh215AXBAWz4E2Mxwd+qRwF077e+utvzrwO+25cOAb7bl3wfOmPhMAt8EXtz7z2rcD8+A544HqurLbfnPgROB+6rqm622Dvh3wA+AJ4A/TfJW4O+nuoOq2gZ8O8lxSQ4G/hXw5bavZcBXktzZXv/LPT8kjdF9VXVnW/4qQ0i+AfjL9nf4JwwBCfB64C/b8l+MbCPA7yf5OvA/GOZmOfQ59rseOK0tv21ku28CLmj7vgX4MeCI53dIs8+svhFDO5jSYH4NN6+sYAjJVcB5wM8+j/1cw/AfzjeAT1dVJQmwrqoufJ59Vj9Pjiw/xRCcf1dVS5/HNn6Z4V9Fy6rqn5LczxCcz6qqvpvke0l+Gvgl4J3trQC/WFXzaqIsz4DnjiOSvL4tn85wRnJkkle12tuBLyV5CXBgVX2eYUhi6STbehx46bPs51PAKW0f17TaTcCpSV4OkOSgJD+xR0ejmfYD4L4kpwFk8Nr23m3AL7blVSNtDgQebuH7RmDi73xXnx8Ypo39Twyfw42t9lfAu9r/zEnyuj09oNnAAJ477gFWt38OHgRcAryD4Z+UG4EfAX/M8B/G59p6X2IYk9vZlcAfT3wJN/pGVT0K3A38RFXd0Wp3M4w5f6Ft90ae/uerZo9fBs5O8r+BTTw9t/b5wG8kuYPh7/WxVv84sDzJhtb2GwBV9T3gy0nuSvK+SfZzLUOQrx+pvRfYF/h6+8LuvdN5YHsrL0ObA7zsR+OU5EXAP7ThplUMX8jNj6sUxswxYEnPZRnw39rwwN8Bv9K3O3OHZ8CS1IljwJLUiQEsSZ0YwJLUiQEsNUmWJnnzyOtfSHLBmPd5QpI3jHMf2nsZwNLTlgL/HMBVdV1VXTTmfZ7AcAuw5iGvgtCckOTFDBf2L2b4fcD3MkwO80HgJcAjwFlVtbVNt3k78EaGiV/Obq83A/sD3wX+oC0vr6rzklwJ/APwUwx3fL0DWM0wT8LtVXVW68ebgN8F9gP+D/COqvphu013HfDzDDccnMYwJ8dtDLcCbwPeVVV/M4Y/Hu2lPAPWXLESeLCqXttuSLkB+DBwalUtY5ghbu3I+guqagXDXV7vrqp/BH4HuKaqllbVNTzTyxjmzfh1htnfLgGOAV7Thi8OYbgj8Oeq6l8DG4DfGGn/SKtfxjDz3P0Mdyde0vZp+M4z3oihuWIj8P4kFzNMn/gocCxwY5teYB9g68j6n2rPEzOBTcVn291gG4GHJuYxSLKpbWMxsIThNlyAFwK3Pss+3/o8jk1zlAGsOaGqvplkGcMY7h8wzEexqape/yxNJmYDe4qp/3cw0eZH7Dib2I/aNp4Cbqyq06dxn5rDHILQnJDhN+z+vqr+HHg/w4TiCydmiEuy7+gvMTyL55rF67ncBhw/MQNdhl8l+ckx71OzmAGsueI1wB1tQu//zDCeeypwcZvd606e+2qDLwJL2ixwv/R8O9AmrD8L+ESbFe42hi/tduWzwFvaPv/t892nZjevgpCkTjwDlqRODGBJ6sQAlqRODGBJ6sQAlqRODGBJ6sQAlqRO/j8Lziatnl6SOgAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "import seaborn as sns\n", "sns.catplot(x ='sentiment', kind='count', data = df);" ] }, { "cell_type": "code", "execution_count": null, "id": "943778d4", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "36ccc133", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "ab4db158", "metadata": {}, "source": [ "**Reduce the records from 50K to 1K for quick processing**" ] }, { "cell_type": "code", "execution_count": 73, "id": "9fdd06e3", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1000, 2)" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# save 1000 rows in a new dataframe\n", "temp_df = df.iloc[0:1000,:]\n", "temp_df.shape" ] }, { "cell_type": "code", "execution_count": 74, "id": "ff52d83e", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "positive 501\n", "negative 499\n", "Name: sentiment, dtype: int64" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# check out the count of positive and negative reviews\n", "temp_df['sentiment'].value_counts()" ] }, { "cell_type": "code", "execution_count": 75, "id": "c51f4022", "metadata": {}, "outputs": [], "source": [ "# save the dataframe to a new csv file\n", "temp_df.to_csv('datasets/imdb-dataset-1000.csv', index=False)" ] }, { "cell_type": "code", "execution_count": null, "id": "15e0fc91", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "b30712d8", "metadata": {}, "source": [ "## b. Case folding, removing digits, punctuations and substituting contractions" ] }, { "cell_type": "markdown", "id": "38b77f60", "metadata": {}, "source": [ "**Read the Dataset:**" ] }, { "cell_type": "code", "execution_count": 1, "id": "3491317d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
reviewsentiment
0One of the other reviewers has mentioned that ...positive
1A wonderful little production. <br /><br />The...positive
2I thought this was a wonderful way to spend ti...positive
3Basically there's a family where a little boy ...negative
4Petter Mattei's \"Love in the Time of Money\" is...positive
.........
995Nothing is sacred. Just ask Ernie Fosselius. T...positive
996I hated it. I hate self-aware pretentious inan...negative
997I usually try to be professional and construct...negative
998If you like me is going to see this in a film ...negative
999This is like a zoology textbook, given that it...negative
\n", "

1000 rows Γ— 2 columns

\n", "
" ], "text/plain": [ " review sentiment\n", "0 One of the other reviewers has mentioned that ... positive\n", "1 A wonderful little production.

The... positive\n", "2 I thought this was a wonderful way to spend ti... positive\n", "3 Basically there's a family where a little boy ... negative\n", "4 Petter Mattei's \"Love in the Time of Money\" is... positive\n", ".. ... ...\n", "995 Nothing is sacred. Just ask Ernie Fosselius. T... positive\n", "996 I hated it. I hate self-aware pretentious inan... negative\n", "997 I usually try to be professional and construct... negative\n", "998 If you like me is going to see this in a film ... negative\n", "999 This is like a zoology textbook, given that it... negative\n", "\n", "[1000 rows x 2 columns]" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "df = pd.read_csv(\"./datasets/imdb-dataset-1000.csv\")\n", "df" ] }, { "cell_type": "code", "execution_count": 2, "id": "1ed2a896", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.

The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.

It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.

I would say the main appeal of the show is due to the fact that it goes where other shows wouldn't dare. Forget pretty pictures painted for mainstream audiences, forget charm, forget romance...OZ doesn't mess around. The first episode I ever saw struck me as so nasty it was surreal, I couldn't say I was ready for it, but as I watched more, I developed a taste for Oz, and got accustomed to the high levels of graphic violence. Not just violence, but injustice (crooked guards who'll be sold out for a nickel, inmates who'll kill on order and get away with it, well mannered, middle class inmates being turned into prison bitches due to their lack of street skills or prison experience) Watching Oz, you may become comfortable with what is uncomfortable viewing....thats if you can get in touch with your darker side.\"" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.review[0]" ] }, { "cell_type": "code", "execution_count": null, "id": "2967df79", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "452efb1b", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 3, "id": "d78d918b", "metadata": {}, "outputs": [], "source": [ "import re\n", "import string\n", "import contractions\n", "from textblob import TextBlob\n", "\n", "def text_cleaning(mystr):\n", " mystr = mystr.lower() # case folding\n", " mystr = re.sub('\\w*\\d\\w*', '', mystr) # removing digits\n", " mystr = re.sub('\\n', ' ', mystr) # replace new line characters with space\n", " mystr = re.sub('[β€˜β€™β€œβ€β€¦]', '', mystr) # removing double quotes and single quotes\n", " mystr = re.sub('<.*?>', '', mystr) # removing html tags \n", " mystr = re.sub('https?://\\S+|www.\\.\\S+', '', mystr) # removing URLs\n", " mystr = ''.join([c for c in mystr if c not in string.punctuation]) # remove punctuations\n", " mystr = ' '.join([contractions.fix(word) for word in mystr.split()]) # expand contractions\n", " return mystr" ] }, { "cell_type": "code", "execution_count": null, "id": "fa0137eb", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 4, "id": "e3d7680d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
reviewsentimentr_cleaned
0One of the other reviewers has mentioned that ...positiveone of the other reviewers has mentioned that ...
1A wonderful little production. <br /><br />The...positivea wonderful little production the filming tech...
2I thought this was a wonderful way to spend ti...positivei thought this was a wonderful way to spend ti...
3Basically there's a family where a little boy ...negativebasically there is a family where a little boy...
4Petter Mattei's \"Love in the Time of Money\" is...positivepetter matteis love in the time of money is a ...
\n", "
" ], "text/plain": [ " review sentiment \\\n", "0 One of the other reviewers has mentioned that ... positive \n", "1 A wonderful little production.

The... positive \n", "2 I thought this was a wonderful way to spend ti... positive \n", "3 Basically there's a family where a little boy ... negative \n", "4 Petter Mattei's \"Love in the Time of Money\" is... positive \n", "\n", " r_cleaned \n", "0 one of the other reviewers has mentioned that ... \n", "1 a wonderful little production the filming tech... \n", "2 i thought this was a wonderful way to spend ti... \n", "3 basically there is a family where a little boy... \n", "4 petter matteis love in the time of money is a ... " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['r_cleaned'] = df['review'].apply(lambda x : text_cleaning(x))\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "18f54b37", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "8082fa37", "metadata": {}, "source": [ "## b. Tokenization" ] }, { "cell_type": "code", "execution_count": 5, "id": "9d05a58b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
reviewsentimentr_cleanedr_tokenized
0One of the other reviewers has mentioned that ...positiveone of the other reviewers has mentioned that ...[one, of, the, other, reviewers, has, mentione...
1A wonderful little production. <br /><br />The...positivea wonderful little production the filming tech...[a, wonderful, little, production, the, filmin...
2I thought this was a wonderful way to spend ti...positivei thought this was a wonderful way to spend ti...[i, thought, this, was, a, wonderful, way, to,...
3Basically there's a family where a little boy ...negativebasically there is a family where a little boy...[basically, there, is, a, family, where, a, li...
4Petter Mattei's \"Love in the Time of Money\" is...positivepetter matteis love in the time of money is a ...[petter, matteis, love, in, the, time, of, mon...
\n", "
" ], "text/plain": [ " review sentiment \\\n", "0 One of the other reviewers has mentioned that ... positive \n", "1 A wonderful little production.

The... positive \n", "2 I thought this was a wonderful way to spend ti... positive \n", "3 Basically there's a family where a little boy ... negative \n", "4 Petter Mattei's \"Love in the Time of Money\" is... positive \n", "\n", " r_cleaned \\\n", "0 one of the other reviewers has mentioned that ... \n", "1 a wonderful little production the filming tech... \n", "2 i thought this was a wonderful way to spend ti... \n", "3 basically there is a family where a little boy... \n", "4 petter matteis love in the time of money is a ... \n", "\n", " r_tokenized \n", "0 [one, of, the, other, reviewers, has, mentione... \n", "1 [a, wonderful, little, production, the, filmin... \n", "2 [i, thought, this, was, a, wonderful, way, to,... \n", "3 [basically, there, is, a, family, where, a, li... \n", "4 [petter, matteis, love, in, the, time, of, mon... " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from nltk.tokenize import word_tokenize\n", "df['r_tokenized'] = df['r_cleaned'].apply(lambda x: word_tokenize(x))\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "16419956", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "cf756416", "metadata": {}, "source": [ "## c. Remove Stop Words" ] }, { "cell_type": "code", "execution_count": 6, "id": "852940f0", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
reviewsentimentr_cleanedr_tokenizedr_no_sw
0One of the other reviewers has mentioned that ...positiveone of the other reviewers has mentioned that ...[one, of, the, other, reviewers, has, mentione...[one, reviewers, mentioned, watching, oz, epis...
1A wonderful little production. <br /><br />The...positivea wonderful little production the filming tech...[a, wonderful, little, production, the, filmin...[wonderful, little, production, filming, techn...
2I thought this was a wonderful way to spend ti...positivei thought this was a wonderful way to spend ti...[i, thought, this, was, a, wonderful, way, to,...[thought, wonderful, way, spend, time, hot, su...
3Basically there's a family where a little boy ...negativebasically there is a family where a little boy...[basically, there, is, a, family, where, a, li...[basically, family, little, boy, jake, thinks,...
4Petter Mattei's \"Love in the Time of Money\" is...positivepetter matteis love in the time of money is a ...[petter, matteis, love, in, the, time, of, mon...[petter, matteis, love, time, money, visually,...
\n", "
" ], "text/plain": [ " review sentiment \\\n", "0 One of the other reviewers has mentioned that ... positive \n", "1 A wonderful little production.

The... positive \n", "2 I thought this was a wonderful way to spend ti... positive \n", "3 Basically there's a family where a little boy ... negative \n", "4 Petter Mattei's \"Love in the Time of Money\" is... positive \n", "\n", " r_cleaned \\\n", "0 one of the other reviewers has mentioned that ... \n", "1 a wonderful little production the filming tech... \n", "2 i thought this was a wonderful way to spend ti... \n", "3 basically there is a family where a little boy... \n", "4 petter matteis love in the time of money is a ... \n", "\n", " r_tokenized \\\n", "0 [one, of, the, other, reviewers, has, mentione... \n", "1 [a, wonderful, little, production, the, filmin... \n", "2 [i, thought, this, was, a, wonderful, way, to,... \n", "3 [basically, there, is, a, family, where, a, li... \n", "4 [petter, matteis, love, in, the, time, of, mon... \n", "\n", " r_no_sw \n", "0 [one, reviewers, mentioned, watching, oz, epis... \n", "1 [wonderful, little, production, filming, techn... \n", "2 [thought, wonderful, way, spend, time, hot, su... \n", "3 [basically, family, little, boy, jake, thinks,... \n", "4 [petter, matteis, love, time, money, visually,... " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import nltk\n", "stop_words = nltk.corpus.stopwords.words('english')\n", "\n", "def remove_stopwords(tokenized_text):\n", " new_words = [word for word in tokenized_text if word not in stop_words]\n", " return new_words\n", "\n", "df['r_no_sw'] = df['r_tokenized'].apply(lambda token: remove_stopwords(token))\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "a0707c22", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "3ffd3adb", "metadata": {}, "source": [ "## d. Save the Pre-Processed Dataframe in a New CSV File" ] }, { "cell_type": "code", "execution_count": 7, "id": "a49e3ba0", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sentimentprocessed_reviews
0positiveone reviewers mentioned watching oz episode ho...
1positivewonderful little production filming technique ...
2positivethought wonderful way spend time hot summer we...
3negativebasically family little boy jake thinks zombie...
4positivepetter matteis love time money visually stunni...
\n", "
" ], "text/plain": [ " sentiment processed_reviews\n", "0 positive one reviewers mentioned watching oz episode ho...\n", "1 positive wonderful little production filming technique ...\n", "2 positive thought wonderful way spend time hot summer we...\n", "3 negative basically family little boy jake thinks zombie...\n", "4 positive petter matteis love time money visually stunni..." ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# join the tokens of pre-processed text\n", "df['processed_reviews'] = df['r_no_sw'].apply(lambda x: ' '.join(x))\n", "\n", "new_df = pd.concat([df['sentiment'], df['processed_reviews']], axis=1)\n", "\n", "# save the resulting datafrrame to a new csv file\n", "new_df.to_csv('datasets/processed_imdb_reviews.csv', index=False)\n", "new_df.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "5ddf5787", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "python10", "language": "python", "name": "python10" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.0" } }, "nbformat": 4, "nbformat_minor": 5 }