{ "cells": [ { "cell_type": "markdown", "id": "d162b058", "metadata": {}, "source": [ "--- \n", " \n", "\n", "
This is so simple and fun.
\n", "Preprocessed String: An empty head. This is so simple and fun. \n" ] } ], "source": [ "import re\n", "mystr = \" An empty head.This is so simple and fun.
\"\n", "print(\"Original String: \", mystr)\n", "print(\"Preprocessed String: \", re.sub('<.*?>', '', mystr))" ] }, { "cell_type": "code", "execution_count": null, "id": "3ca00acb", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "167b6180", "metadata": {}, "source": [ "## d. Removing URLs\n", "- At times the text data you have some URLS, which might not be helpful in suppose sentiment analysis. So better to remove those URLS from your dataset\n", "- Once again, we can use Python's re module to remove the URLs." ] }, { "cell_type": "code", "execution_count": 6, "id": "74f73fc0", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Good youTube lectures by Arif are available at '" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import re\n", "mystr = \"Good youTube lectures by Arif are available at http://www.youtube.com/c/LearnWithArif/playlists\"\n", "re.sub('https?://\\S+|www.\\.\\S+', '', mystr)" ] }, { "cell_type": "code", "execution_count": null, "id": "bb9c503b", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "a3e0e219", "metadata": {}, "source": [ "## e. Removing Punctuations\n", "- Punctuations are symbols that are used to divide written words into sentences and clauses\n", "- Once you tokenize your text, these punctuation symbols may become part of a token, and may become a token by itself, which is not required in most of the cases\n", "- We can use Python's `string.punctuation` constant and `replace()` method to replace any punctuation in text with an empty string" ] }, { "cell_type": "code", "execution_count": 7, "id": "5b06a5f6", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'!\"#$%&\\'()*+,-./:;<=>?@[\\\\]^_`{|}~'" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import string\n", "string.punctuation" ] }, { "cell_type": "markdown", "id": "0067f95f", "metadata": {}, "source": [ ">- Check for other constants like `string.whitespace`, `string.printable`, `string.ascii_letters`, `string.digits` as well." ] }, { "cell_type": "code", "execution_count": 8, "id": "6f123fd9", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'A {text} ^having$ \"lot\" of #s and [puncutations]!.;%..'" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mystr = 'A {text} ^having$ \"lot\" of #s and [puncutations]!.;%..'\n", "mystr" ] }, { "cell_type": "code", "execution_count": 9, "id": "76aa3beb", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'A text having lot of s and puncutations'" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "newstr = ''.join([ch for ch in mystr if ch not in string.punctuation])\n", "newstr" ] }, { "cell_type": "code", "execution_count": null, "id": "df149851", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "76f28d73", "metadata": {}, "source": [ "# 2. Basic Text Preprocessing" ] }, { "cell_type": "markdown", "id": "86b42da8", "metadata": {}, "source": [ "## a. Case Folding \n", "- The text we need to process may come in lower, upper, sentence, camel cases\n", "- If the text is in the same case, it is easy for a machine to interpret the words because the lower case and upper case are treated differently by the machine\n", "- In applications like Information Retrieval, we reduce all letters to lower case\n", "- In applications like sentiment analysis, machine translation and information extraction, keeping the case might be helpful. For example US vs us." ] }, { "cell_type": "code", "execution_count": 10, "id": "cbb6c75e", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'this is great series of lectures by arif at the deaprtment of ds'" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mystr = \"This IS GREAT series of Lectures by Arif at the Deaprtment of DS\"\n", "mystr.lower()" ] }, { "cell_type": "code", "execution_count": null, "id": "12b34d7d", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "97fd645b", "metadata": {}, "source": [ "## b. Expand Contractions\n", "- Contractions are words or combinations of words that are shortened by dropping letters and replacing them by an apostrophe.\n", "- Examples:\n", " - you're ---> you are\n", " - ain't ---> am not / are not / is not / has not / have not\n", " - you'll ---> you shall / you will\n", " - wouldn't 've ---> would not haveyou are\n", "- In order to expand contractions, you can install and use the `contractions` module or can create your own dictionary to expand contractions" ] }, { "cell_type": "code", "execution_count": 11, "id": "6c90f198", "metadata": {}, "outputs": [], "source": [ "import sys\n", "!{sys.executable} -m pip install -q contractions" ] }, { "cell_type": "code", "execution_count": 12, "id": "7b5ff1ef", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "you are\n", "are not\n", "you will\n", "would not have\n" ] } ], "source": [ "import contractions\n", "print(contractions.fix(\"you're\")) # you are\n", "print(contractions.fix(\"ain't\")) # am not / are not / is not / has not / have not\n", "print(contractions.fix(\"you'll\")) #you shall / you will\n", "print(contractions.fix(\"wouldn't've\")) #\"wouldn't've\": \"would not have\"," ] }, { "cell_type": "code", "execution_count": null, "id": "9fb8cc71", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 13, "id": "e6b0d807", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\"I'll be there within 5 min. Shouldn't you be there too? I'd love to see u there my dear. \\nIt's awesome to meet new friends. We've been waiting for this day for so long.\"" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mystr = '''I'll be there within 5 min. Shouldn't you be there too? I'd love to see u there my dear. \n", "It's awesome to meet new friends. We've been waiting for this day for so long.'''\n", "mystr" ] }, { "cell_type": "code", "execution_count": 14, "id": "128dec87", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "I will be there within 5 min. Should not you be there too? I would love to see you there my dear. \n", "It is awesome to meet new friends. We have been waiting for this day for so long.\n" ] } ], "source": [ "# use loop\n", "mylist = [] \n", "for word in mystr.split(sep=' '):\n", " mylist.append(contractions.fix(word))\n", "\n", "newstring = ' '.join(mylist)\n", "print(newstring)" ] }, { "cell_type": "code", "execution_count": 15, "id": "61d4bf5e", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'I will be there within 5 min. Should not you be there too? I would love to see you there my dear. It is awesome to meet new friends. We have been waiting for this day for so long.'" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# use list comprehension and join the words of list on space\n", "expanded_string = ' '.join([contractions.fix(word) for word in mystr.split()])\n", "expanded_string" ] }, { "cell_type": "code", "execution_count": null, "id": "d70a0fc7", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "c52da3de", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "b864e428", "metadata": {}, "source": [ "## c. Chat Word Treatment\n", "- Some commonly used abbreviated chat words that are used on social media these days are:\n", " - GN for good night\n", " - fyi for for your information\n", " - asap for as soon as possible\n", " - yolo for you only live once\n", " - rofl for rolling on floor laughing\n", " - nvm for never mind\n", " - ofc for ofcourse\n", "\n", "- To pre-process any text containing such abbreviations we can search for an online dictionary, or can create a dictionary of our own" ] }, { "cell_type": "code", "execution_count": 16, "id": "9d8f4c40", "metadata": {}, "outputs": [], "source": [ "dict_chatwords = { \n", " 'ack': 'acknowledge',\n", " 'omg': 'oh my God',\n", " 'aisi': 'as i see it',\n", " 'bi5': 'back in 5 minutes',\n", " 'lmk': 'let me know',\n", " 'gn' : 'good night',\n", " 'fyi': 'for your information',\n", " 'asap': 'as soon as possible',\n", " 'yolo': 'you only live once',\n", " 'rofl': 'rolling on floor laughing',\n", " 'nvm': 'never ming',\n", " 'ofc': 'ofcourse',\n", " 'blv' : 'boulevard',\n", " 'cir' : 'circle',\n", " 'hwy' : 'highway',\n", " 'ln' : 'lane',\n", " 'pt' : 'point',\n", " 'rd' : 'road',\n", " 'sq' : 'square',\n", " 'st' : 'street'\n", " }" ] }, { "cell_type": "code", "execution_count": 17, "id": "560fb80d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'omg this is aisi I ack your work and will be bi5'" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mystr = \"omg this is aisi I ack your work and will be bi5\"\n", "mystr" ] }, { "cell_type": "code", "execution_count": 18, "id": "74246a90", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "oh my God this is as i see it I acknowledge your work and will be back in 5 minutes\n" ] } ], "source": [ "# dict.items() method returns all the key-value pairs of a dict as a two object tuple\n", "# dict.keys() method returns all the keys of a dict object\n", "# dict.values() method returns all the values of a dict object\n", "mylist = [] \n", "for word in mystr.split(sep=' '):\n", " if word in dict_chatwords.keys():\n", " mylist.append(dict_chatwords[word])\n", " else:\n", " mylist.append(word)\n", "newstring = ' '.join(mylist)\n", "print(newstring)" ] }, { "cell_type": "code", "execution_count": null, "id": "fee8b5a4", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "40839642", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "30f62553", "metadata": {}, "source": [ "## d. Handle Emojis\n", "- We come across lots and lots of emojis while scraping comments/posts from social media websites like Facebook, Instagram, Whatsapp, Twitter, LinkedIn, which needs to be removed from text.\n", "- Machine Learrning algorithm cannot understand emojis, so we have two options:\n", " - Simply remove the emojis from the text data, and this can be done using `clean-text` library\n", " - Replace the emoji with its meaning happy, sad, angry,....\n" ] }, { "cell_type": "markdown", "id": "620bf235", "metadata": {}, "source": [ "### (i) Remove Emojis" ] }, { "cell_type": "code", "execution_count": 19, "id": "b4294457", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'These emojis needs to be removed, there is a huge list...ππ¬ππ ππππππ€ππ€π‘π€ππ€ π€‘π€«π©ππ»ππβοΈππ'" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mystr = \"These emojis needs to be removed, there is a huge list...ππ¬ππ ππππππ€ππ€π‘π€ππ€ π€‘π€«π©ππ»ππβοΈππ\"\n", "mystr" ] }, { "cell_type": "code", "execution_count": 20, "id": "8be07245", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "These emojis needs to be removed, there is a huge list...\n" ] } ], "source": [ "import re\n", " \n", "emoji_pattern = re.compile(\"[\"\n", " u\"\\U0001F600-\\U0001F64F\" # code range for emoticons\n", " u\"\\U0001F300-\\U0001F5FF\" # code range for symbols & pictographs\n", " u\"\\U0001F680-\\U0001F6FF\" # code range for transport & map symbols\n", " u\"\\U0001F1E0-\\U0001F1FF\" # code range for flags (iOS)\n", " u\"\\U00002700-\\U000027BF\" # code range for Dingbats\n", " u\"\\U00002500-\\U00002BEF\" # code range for chinese char\n", " u\"\\U00002702-\\U000027B0\"\n", " u\"\\U00002702-\\U000027B0\"\n", " u\"\\U000024C2-\\U0001F251\"\n", " u\"\\U0001f926-\\U0001f937\"\n", " u\"\\U00010000-\\U0010ffff\"\n", " u\"\\u2640-\\u2642\" \n", " u\"\\u2600-\\u2B55\"\n", " u\"\\u200d\"\n", " u\"\\u23cf\"\n", " u\"\\u23e9\"\n", " u\"\\u231a\"\n", " u\"\\ufe0f\" \n", " u\"\\u3030\"\n", " \"]+\", flags=re.UNICODE)\n", "\n", "print(emoji_pattern.sub(r'', mystr)) # no emoji\n" ] }, { "cell_type": "code", "execution_count": null, "id": "dbbf5c6c", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "42e7e881", "metadata": {}, "source": [ "### (ii) Replace Emojis with their Meanings" ] }, { "cell_type": "code", "execution_count": 21, "id": "059582dd", "metadata": {}, "outputs": [], "source": [ "import sys\n", "!{sys.executable} -m pip install -q emoji" ] }, { "cell_type": "code", "execution_count": 22, "id": "0a02e204", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'This is :thumbs_up:'" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import emoji\n", "mystr = \"This is π\"\n", "emoji.demojize(mystr)" ] }, { "cell_type": "code", "execution_count": 23, "id": "7c6992f2", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'I am :thinking_face:'" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mystr = \"I am π€\"\n", "emoji.demojize(mystr)" ] }, { "cell_type": "code", "execution_count": 24, "id": "1f4551af", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'This is positive'" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mystr = \"This is π\"\n", "emoji.replace_emoji(mystr, replace='positive')" ] }, { "cell_type": "code", "execution_count": null, "id": "bdb98ff1", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "1037dc1e", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "181c15ca", "metadata": {}, "source": [ "## e. Spelling Correction\n", "- Most of the times the text data you have contain spelling errors, which if not corrected the same word may be represented in two or may be more different ways.\n", "- Almost all word editors, today underline incorrectly typed words and provide you possible correct options\n", "- So spelling correction is a two step task:\n", " - Detection of spelling errors\n", " - Correction of spelling errors\n", " - Autocorrect as you type space\n", " - Suggest a single correct word\n", " - Suggest a list of words (from which you can choose one)\n", "- Types of spelling errors:\n", " - **Non-word Errors:** are non-dictionary words or words that do not exist in the language dictionary. For example instead of typing `reading` the user typed `reeding`. These are easy to detect as they do not exist in the language dictionary and can be corrected using algorithms like shortest weighted edit distance and highest noisy channel probability.\n", " - **Real-word Errors:** are dictionary words and are hard to detect. These can be of two types:\n", " - Typographical errors: For example instead of typing `great` the user typed `greet`\n", " - Cognitive errors (homophones: For example instead of typing `two` the user typed `too`\n", "\n", "\n", "\n", " | review | \n", "sentiment | \n", "
---|---|---|
0 | \n", "One of the other reviewers has mentioned that ... | \n", "positive | \n", "
1 | \n", "A wonderful little production. <br /><br />The... | \n", "positive | \n", "
2 | \n", "I thought this was a wonderful way to spend ti... | \n", "positive | \n", "
3 | \n", "Basically there's a family where a little boy ... | \n", "negative | \n", "
4 | \n", "Petter Mattei's \"Love in the Time of Money\" is... | \n", "positive | \n", "
\n", " | review | \n", "sentiment | \n", "
---|---|---|
0 | \n", "One of the other reviewers has mentioned that ... | \n", "positive | \n", "
1 | \n", "A wonderful little production. <br /><br />The... | \n", "positive | \n", "
2 | \n", "I thought this was a wonderful way to spend ti... | \n", "positive | \n", "
3 | \n", "Basically there's a family where a little boy ... | \n", "negative | \n", "
4 | \n", "Petter Mattei's \"Love in the Time of Money\" is... | \n", "positive | \n", "
... | \n", "... | \n", "... | \n", "
995 | \n", "Nothing is sacred. Just ask Ernie Fosselius. T... | \n", "positive | \n", "
996 | \n", "I hated it. I hate self-aware pretentious inan... | \n", "negative | \n", "
997 | \n", "I usually try to be professional and construct... | \n", "negative | \n", "
998 | \n", "If you like me is going to see this in a film ... | \n", "negative | \n", "
999 | \n", "This is like a zoology textbook, given that it... | \n", "negative | \n", "
1000 rows Γ 2 columns
\n", "\n", " | review | \n", "sentiment | \n", "r_cleaned | \n", "
---|---|---|---|
0 | \n", "One of the other reviewers has mentioned that ... | \n", "positive | \n", "one of the other reviewers has mentioned that ... | \n", "
1 | \n", "A wonderful little production. <br /><br />The... | \n", "positive | \n", "a wonderful little production the filming tech... | \n", "
2 | \n", "I thought this was a wonderful way to spend ti... | \n", "positive | \n", "i thought this was a wonderful way to spend ti... | \n", "
3 | \n", "Basically there's a family where a little boy ... | \n", "negative | \n", "basically there is a family where a little boy... | \n", "
4 | \n", "Petter Mattei's \"Love in the Time of Money\" is... | \n", "positive | \n", "petter matteis love in the time of money is a ... | \n", "
\n", " | review | \n", "sentiment | \n", "r_cleaned | \n", "r_tokenized | \n", "
---|---|---|---|---|
0 | \n", "One of the other reviewers has mentioned that ... | \n", "positive | \n", "one of the other reviewers has mentioned that ... | \n", "[one, of, the, other, reviewers, has, mentione... | \n", "
1 | \n", "A wonderful little production. <br /><br />The... | \n", "positive | \n", "a wonderful little production the filming tech... | \n", "[a, wonderful, little, production, the, filmin... | \n", "
2 | \n", "I thought this was a wonderful way to spend ti... | \n", "positive | \n", "i thought this was a wonderful way to spend ti... | \n", "[i, thought, this, was, a, wonderful, way, to,... | \n", "
3 | \n", "Basically there's a family where a little boy ... | \n", "negative | \n", "basically there is a family where a little boy... | \n", "[basically, there, is, a, family, where, a, li... | \n", "
4 | \n", "Petter Mattei's \"Love in the Time of Money\" is... | \n", "positive | \n", "petter matteis love in the time of money is a ... | \n", "[petter, matteis, love, in, the, time, of, mon... | \n", "
\n", " | review | \n", "sentiment | \n", "r_cleaned | \n", "r_tokenized | \n", "r_no_sw | \n", "
---|---|---|---|---|---|
0 | \n", "One of the other reviewers has mentioned that ... | \n", "positive | \n", "one of the other reviewers has mentioned that ... | \n", "[one, of, the, other, reviewers, has, mentione... | \n", "[one, reviewers, mentioned, watching, oz, epis... | \n", "
1 | \n", "A wonderful little production. <br /><br />The... | \n", "positive | \n", "a wonderful little production the filming tech... | \n", "[a, wonderful, little, production, the, filmin... | \n", "[wonderful, little, production, filming, techn... | \n", "
2 | \n", "I thought this was a wonderful way to spend ti... | \n", "positive | \n", "i thought this was a wonderful way to spend ti... | \n", "[i, thought, this, was, a, wonderful, way, to,... | \n", "[thought, wonderful, way, spend, time, hot, su... | \n", "
3 | \n", "Basically there's a family where a little boy ... | \n", "negative | \n", "basically there is a family where a little boy... | \n", "[basically, there, is, a, family, where, a, li... | \n", "[basically, family, little, boy, jake, thinks,... | \n", "
4 | \n", "Petter Mattei's \"Love in the Time of Money\" is... | \n", "positive | \n", "petter matteis love in the time of money is a ... | \n", "[petter, matteis, love, in, the, time, of, mon... | \n", "[petter, matteis, love, time, money, visually,... | \n", "
\n", " | sentiment | \n", "processed_reviews | \n", "
---|---|---|
0 | \n", "positive | \n", "one reviewers mentioned watching oz episode ho... | \n", "
1 | \n", "positive | \n", "wonderful little production filming technique ... | \n", "
2 | \n", "positive | \n", "thought wonderful way spend time hot summer we... | \n", "
3 | \n", "negative | \n", "basically family little boy jake thinks zombie... | \n", "
4 | \n", "positive | \n", "petter matteis love time money visually stunni... | \n", "