{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "```{admonition} Information\n", "__Section__: Put everything together \n", "__Goal__: Apply all the seen methods together to see the transformation of the text. \n", "__Time needed__: 10 min \n", "__Prerequisites__: Chapter 3\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Put everything together" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have detailed the transformations to do with the text, let's see how the tweets are transformed when we apply all the methods after each other." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Tweets examples" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "tags": [ "hide-input" ] }, "outputs": [], "source": [ "# Put everything in a single list\n", "tweet_1 = 'Hospitalizations from COVID-19 have increased nearly 90% and California officials say they could triple by Christmas. https://t.co/hrBnP04HnB'\n", "tweet_2 = 'Something for the afternoon slump / journey home / after school / cooking dinner ... a special 30 minute mix of cool Christmas tunes intercut with Christmas film samples and scratching @BBCSounds https://t.co/rHovIA3u5e'\n", "tweet_3 = 'This happened in Adelaide the other day. #koala #Adelaide https://t.co/vAQFkd5r7q'\n", "list_tweets = [tweet_1, tweet_2, tweet_3]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Result" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Change the value of the tweets to see how any text is changed by our transformations." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "tags": [ "hide-input", "hide-output" ] }, "outputs": [], "source": [ "# Function for tweet preprocessing with what we saw in the chapter\n", "\n", "def preprocess_tweet(tweet):\n", " '''\n", " Takes a tweet as an input and output the list of tokens.\n", " '''\n", " \n", " import emoji\n", " import re\n", " from nltk import word_tokenize\n", " from nltk.corpus import stopwords\n", " from nltk.stem import PorterStemmer\n", " \n", " # Initialization\n", " new_tweet = tweet\n", " \n", " ## Changes on string\n", " \n", " # Remove urls\n", " new_tweet = re.sub(r'https?://[^ ]+', '', new_tweet)\n", " \n", " # Remove usernames\n", " new_tweet = re.sub(r'@[^ ]+', '', new_tweet)\n", " \n", " # Remove hashtags\n", " new_tweet = re.sub(r'#', '', new_tweet)\n", " \n", " # Character normalization\n", " new_tweet = re.sub(r'([A-Za-z])\\1{2,}', r'\\1', new_tweet)\n", " \n", " # Emoji transformation\n", " new_tweet = emoji.demojize(new_tweet)\n", " \n", " # Punctuation and special characters\n", " new_tweet = re.sub(r' 0 ', 'zero', new_tweet)\n", " new_tweet = re.sub(r'[^A-Za-z ]', '', new_tweet)\n", " \n", " # Lower casing\n", " new_tweet = new_tweet.lower()\n", " \n", " \n", " ## Changes on tokens\n", " \n", " # Tokenization\n", " tokens = word_tokenize(new_tweet)\n", " \n", " porter = PorterStemmer()\n", " \n", " for token in tokens:\n", " # Stopwords removal\n", " if token in stopwords.words('english'):\n", " tokens.remove(token)\n", " # Stemming\n", " token = porter.stem(token)\n", " \n", " return tokens" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "tags": [ "hide-input", "hide-output" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Hospitalizations from COVID-19 have increased nearly 90% and California officials say they could triple by Christmas. https://t.co/hrBnP04HnB\n", "['hospitalizations', 'covid', 'increased', 'nearly', 'california', 'officials', 'say', 'could', 'triple', 'christmas']\n", "Something for the afternoon slump / journey home / after school / cooking dinner ... a special 30 minute mix of cool Christmas tunes intercut with Christmas film samples and scratching @BBCSounds https://t.co/rHovIA3u5e\n", "['something', 'the', 'afternoon', 'slump', 'journey', 'home', 'school', 'cooking', 'dinner', 'special', 'minute', 'mix', 'cool', 'christmas', 'tunes', 'intercut', 'christmas', 'film', 'samples', 'scratching']\n", "This happened in Adelaide the other day. #koala #Adelaide https://t.co/vAQFkd5r7q\n", "['happened', 'adelaide', 'other', 'day', 'koala', 'adelaide']\n" ] } ], "source": [ "# Use function on our list of tweets\n", "\n", "list_tweets2 = []\n", "for tweet in list_tweets:\n", " print(tweet)\n", " tokens = preprocess_tweet(tweet)\n", " print(tokens)\n", " list_tweets2.append([tokens])" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "0a8efb4e5c314d808412a2914f9d2c77", "version_major": 2, "version_minor": 0 }, "text/plain": [ "interactive(children=(Textarea(value='Hospitalizations from COVID-19 have increased nearly 90% and California …" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Beginner version: cell to hide\n", "\n", "import ipywidgets as widgets\n", "from ipywidgets import interact\n", "\n", "def preprocess_tweet(tweet):\n", " '''\n", " Takes a tweet as an input and output the list of tokens.\n", " '''\n", " \n", " import emoji\n", " import re\n", " from nltk import word_tokenize\n", " from nltk.corpus import stopwords\n", " from nltk.stem import PorterStemmer\n", " \n", " new_tweet = tweet\n", " new_tweet = re.sub(r'https?://[^ ]+', '', new_tweet)\n", " new_tweet = re.sub(r'@[^ ]+', '', new_tweet)\n", " new_tweet = re.sub(r'#', '', new_tweet)\n", " new_tweet = re.sub(r'([A-Za-z])\\1{2,}', r'\\1', new_tweet)\n", " new_tweet = emoji.demojize(new_tweet)\n", " new_tweet = re.sub(r' 0 ', 'zero', new_tweet)\n", " new_tweet = re.sub(r'[^A-Za-z ]', '', new_tweet)\n", " new_tweet = new_tweet.lower()\n", " \n", " tokens = word_tokenize(new_tweet)\n", " porter = PorterStemmer()\n", " for token in tokens:\n", " if token in stopwords.words('english'):\n", " tokens.remove(token)\n", " token = porter.stem(token)\n", " \n", " print(tokens)\n", "\n", "interact(preprocess_tweet, tweet = widgets.Textarea(\n", " value = tweet_1,\n", " description = 'Tweet:',\n", " disabled = False\n", "))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This chapter showed some simple text transformations for a machine learning experiment based on text analysis. This was only a simple case (bag-of-words), where we treat each word as an independent entity. \n", "Other, more complicated, methods exist to also take into consideration the role of the word in the sentence (part-of-speech for example) and do more language-based analysis. To go further on this topic theoretically, you can have a look at this [good article](https://machinelearningmastery.com/natural-language-processing/) or [this one](https://becominghuman.ai/a-simple-introduction-to-natural-language-processing-ea66a1747b32). For some Python oriented resources, have a look [here](https://towardsdatascience.com/gentle-start-to-natural-language-processing-using-python-6e46c07addf3) or [there](https://medium.com/towards-artificial-intelligence/natural-language-processing-nlp-with-python-tutorial-for-beginners-1f54e610a1a0)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "hide-input" ] }, "outputs": [], "source": [ "from IPython.display import IFrame\n", "IFrame(\"https://blog.hoou.de/wp-admin/admin-ajax.php?action=h5p_embed&id=65\", \"959\", \"332\")" ] } ], "metadata": { "celltoolbar": "Edit Metadata", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.1" }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }