{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```{admonition} Information\n",
    "__Section__: Put everything together  \n",
    "__Goal__: Apply all the seen methods together to see the transformation of the text.    \n",
    "__Time needed__: 10 min  \n",
    "__Prerequisites__: Chapter 3\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Put everything together"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have detailed the transformations to do with the text, let's see how the tweets are transformed when we apply all the methods after each other."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Tweets examples"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# Put everything in a single list\n",
    "tweet_1 = 'Hospitalizations from COVID-19 have increased nearly 90% and California officials say they could triple by Christmas. https://t.co/hrBnP04HnB'\n",
    "tweet_2 = 'Something for the afternoon slump / journey home / after school / cooking dinner ... a special 30 minute mix of cool Christmas tunes intercut with Christmas film samples and scratching @BBCSounds https://t.co/rHovIA3u5e'\n",
    "tweet_3 = 'This happened in Adelaide the other day. #koala #Adelaide https://t.co/vAQFkd5r7q'\n",
    "list_tweets = [tweet_1, tweet_2, tweet_3]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Result"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Change the value of the tweets to see how any text is changed by our transformations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "tags": [
     "hide-input",
     "hide-output"
    ]
   },
   "outputs": [],
   "source": [
    "# Function for tweet preprocessing with what we saw in the chapter\n",
    "\n",
    "def preprocess_tweet(tweet):\n",
    "    '''\n",
    "    Takes a tweet as an input and output the list of tokens.\n",
    "    '''\n",
    "    \n",
    "    import emoji\n",
    "    import re\n",
    "    from nltk import word_tokenize\n",
    "    from nltk.corpus import stopwords\n",
    "    from nltk.stem import PorterStemmer\n",
    "    \n",
    "    # Initialization\n",
    "    new_tweet = tweet\n",
    "    \n",
    "    ## Changes on string\n",
    "    \n",
    "    # Remove urls\n",
    "    new_tweet = re.sub(r'https?://[^ ]+', '', new_tweet)\n",
    "    \n",
    "    # Remove usernames\n",
    "    new_tweet = re.sub(r'@[^ ]+', '', new_tweet)\n",
    "    \n",
    "    # Remove hashtags\n",
    "    new_tweet = re.sub(r'#', '', new_tweet)\n",
    "    \n",
    "    # Character normalization\n",
    "    new_tweet = re.sub(r'([A-Za-z])\\1{2,}', r'\\1', new_tweet)\n",
    "    \n",
    "    # Emoji transformation\n",
    "    new_tweet = emoji.demojize(new_tweet)\n",
    "    \n",
    "    # Punctuation and special characters\n",
    "    new_tweet = re.sub(r' 0 ', 'zero', new_tweet)\n",
    "    new_tweet = re.sub(r'[^A-Za-z ]', '', new_tweet)\n",
    "    \n",
    "    # Lower casing\n",
    "    new_tweet = new_tweet.lower()\n",
    "    \n",
    "    \n",
    "    ## Changes on tokens\n",
    "    \n",
    "    # Tokenization\n",
    "    tokens = word_tokenize(new_tweet)\n",
    "    \n",
    "    porter = PorterStemmer()\n",
    "    \n",
    "    for token in tokens:\n",
    "        # Stopwords removal\n",
    "        if token in stopwords.words('english'):\n",
    "            tokens.remove(token)\n",
    "        # Stemming\n",
    "            token = porter.stem(token)\n",
    "    \n",
    "    return tokens"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "tags": [
     "hide-input",
     "hide-output"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Hospitalizations from COVID-19 have increased nearly 90% and California officials say they could triple by Christmas. https://t.co/hrBnP04HnB\n",
      "['hospitalizations', 'covid', 'increased', 'nearly', 'california', 'officials', 'say', 'could', 'triple', 'christmas']\n",
      "Something for the afternoon slump / journey home / after school / cooking dinner ... a special 30 minute mix of cool Christmas tunes intercut with Christmas film samples and scratching @BBCSounds https://t.co/rHovIA3u5e\n",
      "['something', 'the', 'afternoon', 'slump', 'journey', 'home', 'school', 'cooking', 'dinner', 'special', 'minute', 'mix', 'cool', 'christmas', 'tunes', 'intercut', 'christmas', 'film', 'samples', 'scratching']\n",
      "This happened in Adelaide the other day. #koala #Adelaide https://t.co/vAQFkd5r7q\n",
      "['happened', 'adelaide', 'other', 'day', 'koala', 'adelaide']\n"
     ]
    }
   ],
   "source": [
    "# Use function on our list of tweets\n",
    "\n",
    "list_tweets2 = []\n",
    "for tweet in list_tweets:\n",
    "    print(tweet)\n",
    "    tokens = preprocess_tweet(tweet)\n",
    "    print(tokens)\n",
    "    list_tweets2.append([tokens])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "0a8efb4e5c314d808412a2914f9d2c77",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "interactive(children=(Textarea(value='Hospitalizations from COVID-19 have increased nearly 90% and California …"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "<function __main__.preprocess_tweet(tweet)>"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Beginner version: cell to hide\n",
    "\n",
    "import ipywidgets as widgets\n",
    "from ipywidgets import interact\n",
    "\n",
    "def preprocess_tweet(tweet):\n",
    "    '''\n",
    "    Takes a tweet as an input and output the list of tokens.\n",
    "    '''\n",
    "    \n",
    "    import emoji\n",
    "    import re\n",
    "    from nltk import word_tokenize\n",
    "    from nltk.corpus import stopwords\n",
    "    from nltk.stem import PorterStemmer\n",
    "    \n",
    "    new_tweet = tweet\n",
    "    new_tweet = re.sub(r'https?://[^ ]+', '', new_tweet)\n",
    "    new_tweet = re.sub(r'@[^ ]+', '', new_tweet)\n",
    "    new_tweet = re.sub(r'#', '', new_tweet)\n",
    "    new_tweet = re.sub(r'([A-Za-z])\\1{2,}', r'\\1', new_tweet)\n",
    "    new_tweet = emoji.demojize(new_tweet)\n",
    "    new_tweet = re.sub(r' 0 ', 'zero', new_tweet)\n",
    "    new_tweet = re.sub(r'[^A-Za-z ]', '', new_tweet)\n",
    "    new_tweet = new_tweet.lower()\n",
    "    \n",
    "    tokens = word_tokenize(new_tweet)\n",
    "    porter = PorterStemmer()\n",
    "    for token in tokens:\n",
    "        if token in stopwords.words('english'):\n",
    "            tokens.remove(token)\n",
    "            token = porter.stem(token)\n",
    "            \n",
    "    print(tokens)\n",
    "\n",
    "interact(preprocess_tweet, tweet = widgets.Textarea(\n",
    "    value = tweet_1,\n",
    "    description = 'Tweet:',\n",
    "    disabled = False\n",
    "))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This chapter showed some simple text transformations for a machine learning experiment based on text analysis. This was only a simple case (bag-of-words), where we treat each word as an independent entity.  \n",
    "Other, more complicated, methods exist to also take into consideration the role of the word in the sentence (part-of-speech for example) and do more language-based analysis. To go further on this topic theoretically, you can have a look at this [good article](https://machinelearningmastery.com/natural-language-processing/) or [this one](https://becominghuman.ai/a-simple-introduction-to-natural-language-processing-ea66a1747b32). For some Python oriented resources, have a look [here](https://towardsdatascience.com/gentle-start-to-natural-language-processing-using-python-6e46c07addf3) or [there](https://medium.com/towards-artificial-intelligence/natural-language-processing-nlp-with-python-tutorial-for-beginners-1f54e610a1a0)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "from IPython.display import IFrame\n",
    "IFrame(\"https://blog.hoou.de/wp-admin/admin-ajax.php?action=h5p_embed&id=65\", \"959\", \"332\")"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Edit Metadata",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.1"
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}