{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### Natural Language Processing Guide\n", "#### Cornell Data Science\n", "##### Author: Christopher Elliott" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Installation Guide\n", "Before starting the guide: the lastest release of Anaconda should install all of the modules used in this guide **except** spaCY. You **MUST** install spaCY to use many parts of this guide. If spaCy is not installed, some modules will **NOT** work. to install spaCY run \"pip install -U spacy\" in your terminal or command prompt. Additonally, run \"python -m spacy download en_core_web_sm\" to download the corpus used in this guide. For more information please go to https://spacy.io/usage/" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "#Import modules\n", "import nltk\n", "from nltk import word_tokenize as wtk\n", "import urllib\n", "from collections import Counter\n", "import pandas as pd\n", "import numpy as np\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.feature_extraction.text import CountVectorizer\n", "from sklearn.naive_bayes import MultinomialNB\n", "from sklearn import metrics\n", "import spacy\n", "from PIL import Image" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Text Scrapping and Word Tokenization from online sources" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "#Scraping utf8 txt from the Guetenburg Project Website (The Adventures of Sherlock Holmes: Doyle)\n", "website = \"http://www.gutenberg.org/cache/epub/1661/pg1661.txt\"\n", "webresponse = urllib.request.urlopen(website)\n", "raw_txt = webresponse.read().decode('utf8')\n", "\n", "# Now that the raw text is a string, it is possible to tokenize the words.\n", "# Luckily, the NLTK module can tokenize large strings automatically.\n", "# Tokenzation is the process in which a string of sentences is broken up into words and puncuation. This is important\n", "# as it allows for further statistical analysis.\n", "tokenized = wtk(raw_txt)\n", "\n", "#Manually Find the Beginning of the Text\n", "beginning = raw_txt.find(\"*** START OF THIS PROJECT GUTENBERG EBOOK THE ADVENTURES OF SHERLOCK HOLMES ***\")\n", "\n", "#Manually Find the End of the Text\n", "end = raw_txt.find(\"End of the Project Gutenberg EBook \") \n", "raw_txt= raw_txt[beginning:end]\n", "\n", "#Parse the list of words into a NLTK 'text' object\n", "text = nltk.Text(tokenized)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have created a NLTK text object, we can use methods in the NLTK module to tokenize specific words in the text. Doing this is an important step in text preprocessing as it allows for data scientists to get a better understanding of which words in the text are more important than others." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('the', 5793),\n", " ('and', 3061),\n", " ('i', 2990),\n", " ('of', 2777),\n", " ('to', 2761),\n", " ('a', 2693),\n", " ('in', 1818),\n", " ('that', 1757),\n", " ('it', 1736),\n", " ('you', 1536)]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# The first step in tokenizing words is to remove puncuation from the text and convert the text to lowercase.\n", "# This is important as it make for only important key words to tokenized once.\n", "\n", "#filters tokenized words\n", "alpha = [wrd.lower() for wrd in text if wrd.isalpha()]\n", "text = nltk.Text(alpha)\n", "\n", "# Once the text is converted into lowercase and puncuation is removed, It is possible to create of collection of words\n", "# with thier total frequencies in the text This is known as a bag of words. This can be used to \n", "# identifity topics in a text. This is a good way of also identifying the overall significance of a word in a text. \n", "\n", "#Basic Bag of Words\n", "basic_bag_of_words = Counter(text)\n", "basic_bag_of_words.most_common(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looking at the most common words in the tokenized text above, you may notice that the words with the highest frequencies are articles and do not really give any insight about the text. This is because many of the most common words in this bag of words are considered 'stop words'. Stop words are words which appear in a text that are exceptionally common in a language. It is best to remove them so that they do not distrupt the accuracy of any future analysis. The NLTK module has an already defined set of stop words for the english language. " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('said', 486),\n", " ('upon', 467),\n", " ('holmes', 466),\n", " ('one', 374),\n", " ('would', 333),\n", " ('man', 303),\n", " ('could', 288),\n", " ('little', 269),\n", " ('see', 232),\n", " ('may', 210)]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Import NLTK's Stop Words Corpus\n", "from nltk.corpus import stopwords\n", "stopwords = stopwords.words('english')\n", "\n", "#removing stop words\n", "bag_of_words_no_stop = Counter([word for word in text if word not in stopwords])\n", "\n", "bag_of_words_no_stop.most_common(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looking at some of the most common words in the bag of words above, we can now see that the words can give some insight about the text's meaning. In additon to this, the bag of words can now be used for more advanced statical analysis and modeling. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Text Classification" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another aspect of Natural Language Processing is classifying text based off of predetermined labels. This practice draws heavily from elements of supervised machine learning and word tokenization practices that were previously discussed. The data set used in this exercise is called the **fake_real_news_dataset** which was created by **George McIntire (geo.mcintire@gmail.com)** " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | Unnamed: 0 | \n", "title | \n", "text | \n", "label | \n", "
---|---|---|---|---|
0 | \n", "8476 | \n", "You Can Smell Hillary’s Fear | \n", "Daniel Greenfield, a Shillman Journalism Fello... | \n", "FAKE | \n", "
1 | \n", "10294 | \n", "Watch The Exact Moment Paul Ryan Committed Pol... | \n", "Google Pinterest Digg Linkedin Reddit Stumbleu... | \n", "FAKE | \n", "
2 | \n", "3608 | \n", "Kerry to go to Paris in gesture of sympathy | \n", "U.S. Secretary of State John F. Kerry said Mon... | \n", "REAL | \n", "
3 | \n", "10142 | \n", "Bernie supporters on Twitter erupt in anger ag... | \n", "— Kaydee King (@KaydeeKing) November 9, 2016 T... | \n", "FAKE | \n", "
4 | \n", "875 | \n", "The Battle of New York: Why This Primary Matters | \n", "It's primary day in New York and front-runners... | \n", "REAL | \n", "