{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Text preprocessing, POS tagging and NER\n",
    "> In this chapter, you will learn about tokenization and lemmatization. You will then learn how to perform text cleaning, part-of-speech tagging, and named entity recognition using the spaCy library. Upon mastering these concepts, you will proceed to make the Gettysburg address machine-friendly, analyze noun usage in fake news, and identify people mentioned in a TechCrunch article. This is the Summary of lecture \"Feature Engineering for NLP in Python\", via datacamp.\n",
    "\n",
    "- toc: true \n",
    "- badges: true\n",
    "- comments: true\n",
    "- author: Chanseok Kang\n",
    "- categories: [Python, Datacamp, Natural_Language_Processing]\n",
    "- image: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "plt.rcParams['figure.figsize'] = (8, 8)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Tokenization and Lemmatization\n",
    "- Text preprocessing techniques\n",
    "    - Converting words into lowercase\n",
    "    - Removing leading and trailing whitespaces\n",
    "    - Removing punctuation\n",
    "    - Removing stopwords\n",
    "    - Expanding contractions\n",
    "- Tokenization\n",
    "    - the process of splitting a string into its constituent tokens\n",
    "- Lemmatization\n",
    "    - the process of converting a word into its lowercased base form or lemma\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Tokenizing the Gettysburg Address\n",
    "In this exercise, you will be tokenizing one of the most famous speeches of all time: the Gettysburg Address delivered by American President Abraham Lincoln during the American Civil War."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open('./dataset/gettysburg.txt', 'r') as f:\n",
    "    gettysburg = f.read()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Four', 'score', 'and', 'seven', 'years', 'ago', 'our', 'fathers', 'brought', 'forth', 'on', 'this', 'continent', ',', 'a', 'new', 'nation', ',', 'conceived', 'in', 'Liberty', ',', 'and', 'dedicated', 'to', 'the', 'proposition', 'that', 'all', 'men', 'are', 'created', 'equal', '.', 'Now', 'we', \"'re\", 'engaged', 'in', 'a', 'great', 'civil', 'war', ',', 'testing', 'whether', 'that', 'nation', ',', 'or', 'any', 'nation', 'so', 'conceived', 'and', 'so', 'dedicated', ',', 'can', 'long', 'endure', '.', 'We', \"'re\", 'met', 'on', 'a', 'great', 'battlefield', 'of', 'that', 'war', '.', 'We', \"'ve\", 'come', 'to', 'dedicate', 'a', 'portion', 'of', 'that', 'field', ',', 'as', 'a', 'final', 'resting', 'place', 'for', 'those', 'who', 'here', 'gave', 'their', 'lives', 'that', 'that', 'nation', 'might', 'live', '.', 'It', \"'s\", 'altogether', 'fitting', 'and', 'proper', 'that', 'we', 'should', 'do', 'this', '.', 'But', ',', 'in', 'a', 'larger', 'sense', ',', 'we', 'ca', \"n't\", 'dedicate', '-', 'we', 'can', 'not', 'consecrate', '-', 'we', 'can', 'not', 'hallow', '-', 'this', 'ground', '.', 'The', 'brave', 'men', ',', 'living', 'and', 'dead', ',', 'who', 'struggled', 'here', ',', 'have', 'consecrated', 'it', ',', 'far', 'above', 'our', 'poor', 'power', 'to', 'add', 'or', 'detract', '.', 'The', 'world', 'will', 'little', 'note', ',', 'nor', 'long', 'remember', 'what', 'we', 'say', 'here', ',', 'but', 'it', 'can', 'never', 'forget', 'what', 'they', 'did', 'here', '.', 'It', 'is', 'for', 'us', 'the', 'living', ',', 'rather', ',', 'to', 'be', 'dedicated', 'here', 'to', 'the', 'unfinished', 'work', 'which', 'they', 'who', 'fought', 'here', 'have', 'thus', 'far', 'so', 'nobly', 'advanced', '.', 'It', \"'s\", 'rather', 'for', 'us', 'to', 'be', 'here', 'dedicated', 'to', 'the', 'great', 'task', 'remaining', 'before', 'us', '-', 'that', 'from', 'these', 'honored', 'dead', 'we', 'take', 'increased', 'devotion', 'to', 'that', 'cause', 'for', 'which', 'they', 'gave', 'the', 'last', 'full', 'measure', 'of', 'devotion', '-', 'that', 'we', 'here', 'highly', 'resolve', 'that', 'these', 'dead', 'shall', 'not', 'have', 'died', 'in', 'vain', '-', 'that', 'this', 'nation', ',', 'under', 'God', ',', 'shall', 'have', 'a', 'new', 'birth', 'of', 'freedom', '-', 'and', 'that', 'government', 'of', 'the', 'people', ',', 'by', 'the', 'people', ',', 'for', 'the', 'people', ',', 'shall', 'not', 'perish', 'from', 'the', 'earth', '.']\n"
     ]
    }
   ],
   "source": [
    "import spacy\n",
    "\n",
    "# Load the en_core_web_sm model\n",
    "nlp = spacy.load('en_core_web_sm')\n",
    "\n",
    "# create a Doc object\n",
    "doc = nlp(gettysburg)\n",
    "\n",
    "# Generate the tokens\n",
    "tokens = [token.text for token in doc]\n",
    "print(tokens)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Lemmatizing the Gettysburg address\n",
    "In this exercise, we will perform lemmatization on the same `gettysburg` address from before.\n",
    "\n",
    "However, this time, we will also take a look at the speech, before and after lemmatization, and try to adjudge the kind of changes that take place to make the piece more machine friendly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal. Now we're engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. We're met on a great battlefield of that war. We've come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It's altogether fitting and proper that we should do this. But, in a larger sense, we can't dedicate - we can not consecrate - we can not hallow - this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It's rather for us to be here dedicated to the great task remaining before us - that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion - that we here highly resolve that these dead shall not have died in vain - that this nation, under God, shall have a new birth of freedom - and that government of the people, by the people, for the people, shall not perish from the earth.\n"
     ]
    }
   ],
   "source": [
    "# Print the gettysburg address\n",
    "print(gettysburg)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "four score and seven year ago -PRON- father bring forth on this continent , a new nation , conceive in Liberty , and dedicate to the proposition that all man be create equal . now -PRON- be engage in a great civil war , test whether that nation , or any nation so conceive and so dedicated , can long endure . -PRON- be meet on a great battlefield of that war . -PRON- have come to dedicate a portion of that field , as a final resting place for those who here give -PRON- life that that nation may live . -PRON- be altogether fitting and proper that -PRON- should do this . but , in a large sense , -PRON- can not dedicate - -PRON- can not consecrate - -PRON- can not hallow - this ground . the brave man , living and dead , who struggle here , have consecrate -PRON- , far above -PRON- poor power to add or detract . the world will little note , nor long remember what -PRON- say here , but -PRON- can never forget what -PRON- do here . -PRON- be for -PRON- the living , rather , to be dedicate here to the unfinished work which -PRON- who fight here have thus far so nobly advanced . -PRON- be rather for -PRON- to be here dedicated to the great task remain before -PRON- - that from these honor dead -PRON- take increase devotion to that cause for which -PRON- give the last full measure of devotion - that -PRON- here highly resolve that these dead shall not have die in vain - that this nation , under God , shall have a new birth of freedom - and that government of the people , by the people , for the people , shall not perish from the earth .\n"
     ]
    }
   ],
   "source": [
    "# Generate lemmas\n",
    "lemmas = [token.lemma_ for token in doc]\n",
    "\n",
    "# Convert lemmas into a string\n",
    "print(' '.join(lemmas))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Observe the lemmatized version of the speech. It isn't very readable to humans but it is in a much more convenient format for a machine to process."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Text cleaning\n",
    "- Text cleaning techniques\n",
    "    - Unnecessary whitespaces and escape sequences\n",
    "    - Punctuations\n",
    "    - Special characters (numbers, emojis, etc.)\n",
    "    - Stopwords\n",
    "- Stopwords\n",
    "    - Words that occur extremely commonly\n",
    "    - E.g. articles, be verbs, pronouns, etc.."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Cleaning a blog post\n",
    "In this exercise, you have been given an excerpt from a blog post. Your task is to clean this text into a more machine friendly format. This will involve converting to lowercase, lemmatization and removing stopwords, punctuations and non-alphabetic characters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open('./dataset/blog.txt', 'r') as file:\n",
    "    blog = file.read()\n",
    "    \n",
    "stopwords = spacy.lang.en.stop_words.STOP_WORDS\n",
    "blog = blog.lower()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "century politic witness alarm rise populism europe warning sign come uk brexit referendum vote swinge way leave follow stupendous victory billionaire donald trump president united states november europe steady rise populist far right party capitalize europe immigration crisis raise nationalist anti europe sentiment instance include alternative germany afd win seat enter bundestag upset germany political order time second world war success star movement italy surge popularity neo nazism neo fascism country hungary czech republic poland austria\n"
     ]
    }
   ],
   "source": [
    "# Generate doc Object: doc\n",
    "doc = nlp(blog)\n",
    "\n",
    "# Generate lemmatized tokens\n",
    "lemmas = [token.lemma_ for token in doc]\n",
    "\n",
    "# Remove stopwords and non-alphabetic tokens\n",
    "a_lemmas = [lemma for lemma in lemmas if lemma.isalpha() and lemma not in stopwords]\n",
    "\n",
    "# Print string after text cleaning\n",
    "print(' '.join(a_lemmas))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Take a look at the cleaned text; it is lowercased and devoid of numbers, punctuations and commonly used stopwords. Also, note that the word U.S. was present in the original text. Since it had periods in between, our text cleaning process completely removed it. This may not be ideal behavior. It is always advisable to use your custom functions in place of `isalpha()` for more nuanced cases."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Cleaning TED talks in a dataframe\n",
    "In this exercise, we will revisit the TED Talks from the first chapter. You have been a given a dataframe `ted` consisting of 5 TED Talks. Your task is to clean these talks using techniques discussed earlier by writing a function `preprocess` and applying it to the `transcript` feature of the dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>transcript</th>\n",
       "      <th>url</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>we're going to talk — my — a new lecture, just...</td>\n",
       "      <td>https://www.ted.com/talks/al_seckel_says_our_b...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>this is a representation of your brain, and yo...</td>\n",
       "      <td>https://www.ted.com/talks/aaron_o_connell_maki...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>it's a great honor today to share with you the...</td>\n",
       "      <td>https://www.ted.com/talks/carter_emmart_demos_...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>my passions are music, technology and making t...</td>\n",
       "      <td>https://www.ted.com/talks/jared_ficklin_new_wa...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>it used to be that if you wanted to get a comp...</td>\n",
       "      <td>https://www.ted.com/talks/jeremy_howard_the_wo...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                          transcript  \\\n",
       "0  we're going to talk — my — a new lecture, just...   \n",
       "1  this is a representation of your brain, and yo...   \n",
       "2  it's a great honor today to share with you the...   \n",
       "3  my passions are music, technology and making t...   \n",
       "4  it used to be that if you wanted to get a comp...   \n",
       "\n",
       "                                                 url  \n",
       "0  https://www.ted.com/talks/al_seckel_says_our_b...  \n",
       "1  https://www.ted.com/talks/aaron_o_connell_maki...  \n",
       "2  https://www.ted.com/talks/carter_emmart_demos_...  \n",
       "3  https://www.ted.com/talks/jared_ficklin_new_wa...  \n",
       "4  https://www.ted.com/talks/jeremy_howard_the_wo...  "
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ted = pd.read_csv('./dataset/ted.csv')\n",
    "ted['transcript'] = ted['transcript'].str.lower()\n",
    "ted.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0      talk new lecture ted illusion create ted try r...\n",
      "1      representation brain brain break left half log...\n",
      "2      great honor today share digital universe creat...\n",
      "3      passion music technology thing combination thi...\n",
      "4      use want computer new program programming requ...\n",
      "                             ...                        \n",
      "495    today unpack example iconic design perfect sen...\n",
      "496    brother belong demographic pat percent accord ...\n",
      "497    john hockenberry great tom want start question...\n",
      "498    right moment kill car internet little mobile d...\n",
      "499    real problem math education right basically ha...\n",
      "Name: transcript, Length: 500, dtype: object\n"
     ]
    }
   ],
   "source": [
    "# Function to preprocess text\n",
    "def preprocess(text):\n",
    "    # Create Doc object\n",
    "    doc = nlp(text, disable=['ner', 'parser'])\n",
    "    \n",
    "    # Generate lemmas\n",
    "    lemmas = [token.lemma_ for token in doc]\n",
    "    \n",
    "    # Remove stopwords and non-alphabetic characters\n",
    "    a_lemmas = [lemma for lemma in lemmas if lemma.isalpha() and lemma not in stopwords]\n",
    "    \n",
    "    return ' '.join(a_lemmas)\n",
    "\n",
    "# Apply preprocess to ted['transcript']\n",
    "ted['transcript'] = ted['transcript'].apply(preprocess)\n",
    "print(ted['transcript'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You have preprocessed all the TED talk transcripts contained in `ted` and it is now in a good shape to perform operations such as vectorization. You now have a good understanding of how text preprocessing works and why it is important. In the next lessons, we will move on to generating word level features for our texts."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part-of-speech tagging\n",
    "- Part-of-Speech (POS)\n",
    "    - helps in identifying distinction by identifying one bear as a noun and the other as a verb\n",
    "    - Word-sense disambiguation\n",
    "        - \"The bear is a majestic animal\"\n",
    "        - \"Please bear with me\"\n",
    "    - Sentiment analysis\n",
    "    - Question answering\n",
    "    - Fake news and opinion spam detection\n",
    "- POS tagging\n",
    "    - Assigning every word, its corresponding part of speech\n",
    "- POS annotation in spaCy\n",
    "    - `PROPN` - proper noun\n",
    "    - `DET` - determinant\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### POS tagging in Lord of the Flies\n",
    "In this exercise, you will perform part-of-speech tagging on a famous passage from one of the most well-known novels of all time, Lord of the Flies, authored by William Golding."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open('./dataset/lotf.txt', 'r') as file:\n",
    "    lotf = file.read()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[('He', 'PRON'), ('found', 'VERB'), ('himself', 'PRON'), ('understanding', 'VERB'), ('the', 'DET'), ('wearisomeness', 'NOUN'), ('of', 'ADP'), ('this', 'DET'), ('life', 'NOUN'), (',', 'PUNCT'), ('where', 'ADV'), ('every', 'DET'), ('path', 'NOUN'), ('was', 'AUX'), ('an', 'DET'), ('improvisation', 'NOUN'), ('and', 'CCONJ'), ('a', 'DET'), ('considerable', 'ADJ'), ('part', 'NOUN'), ('of', 'ADP'), ('one', 'NOUN'), ('’s', 'PART'), ('waking', 'VERB'), ('life', 'NOUN'), ('was', 'AUX'), ('spent', 'VERB'), ('watching', 'VERB'), ('one', 'PRON'), ('’s', 'PART'), ('feet', 'NOUN'), ('.', 'PUNCT')]\n"
     ]
    }
   ],
   "source": [
    "# Create d Doc object\n",
    "doc = nlp(lotf)\n",
    "\n",
    "# Generate tokens and pos tags\n",
    "pos = [(token.text, token.pos_) for token in doc]\n",
    "print(pos)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Examine the various POS tags attached to each token and evaluate if they make intuitive sense to you. You will notice that they are indeed labelled correctly according to the standard rules of English grammar."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Counting nouns in a piece of text\n",
    "In this exercise, we will write two functions, `nouns()` and `proper_nouns()` that will count the number of other nouns and proper nouns in a piece of text respectively.\n",
    "\n",
    "These functions will take in a piece of text and generate a list containing the POS tags for each word. It will then return the number of proper nouns/other nouns that the text contains. We will use these functions in the next exercise to generate interesting insights about fake news."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "3\n"
     ]
    }
   ],
   "source": [
    "# Returns number of proper nouns\n",
    "def proper_nouns(text, model=nlp):\n",
    "    # Create doc object\n",
    "    doc = model(text)\n",
    "    \n",
    "    # Generate list of POS tags\n",
    "    pos = [token.pos_ for token in doc]\n",
    "    \n",
    "    # Return number of proper nouns\n",
    "    return pos.count('PROPN')\n",
    "\n",
    "print(proper_nouns('Abdul, Bill and Cathy went to the market to buy apples.', nlp))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2\n"
     ]
    }
   ],
   "source": [
    "# Returns number of other nouns\n",
    "def nouns(text, model=nlp):\n",
    "    # create doc object\n",
    "    doc = model(text)\n",
    "    \n",
    "    # Generate list of POS tags\n",
    "    pos = [token.pos_ for token in doc]\n",
    "    \n",
    "    # Return number of other nouns\n",
    "    return pos.count('NOUN')\n",
    "\n",
    "print(nouns('Abdul, Bill and Cathy went to the market to buy apples.', nlp))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Noun usage in fake news\n",
    "In this exercise, you have been given a dataframe `headlines` that contains news headlines that are either fake or real. Your task is to generate two new features `num_propn` and `num_noun` that represent the number of proper nouns and other nouns contained in the `title` feature of `headlines`.\n",
    "\n",
    "Next, we will compute the mean number of proper nouns and other nouns used in fake and real news headlines and compare the values. If there is a remarkable difference, then there is a good chance that using the `num_propn` and `num_noun` features in fake news detectors will improve its performance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Unnamed: 0</th>\n",
       "      <th>title</th>\n",
       "      <th>label</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>You Can Smell Hillary’s Fear</td>\n",
       "      <td>FAKE</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>Watch The Exact Moment Paul Ryan Committed Pol...</td>\n",
       "      <td>FAKE</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>Kerry to go to Paris in gesture of sympathy</td>\n",
       "      <td>REAL</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>Bernie supporters on Twitter erupt in anger ag...</td>\n",
       "      <td>FAKE</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>The Battle of New York: Why This Primary Matters</td>\n",
       "      <td>REAL</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Unnamed: 0                                              title label\n",
       "0           0                       You Can Smell Hillary’s Fear  FAKE\n",
       "1           1  Watch The Exact Moment Paul Ryan Committed Pol...  FAKE\n",
       "2           2        Kerry to go to Paris in gesture of sympathy  REAL\n",
       "3           3  Bernie supporters on Twitter erupt in anger ag...  FAKE\n",
       "4           4   The Battle of New York: Why This Primary Matters  REAL"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "headlines = pd.read_csv('./dataset/fakenews.csv')\n",
    "headlines.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Mean no. of proper nouns in real and fake headlines are 2.42 and 4.58 respectively\n",
      "Mean no. of other nouns in real and fake headlines are 2.30 and 1.67 respectively\n"
     ]
    }
   ],
   "source": [
    "headlines['num_propn'] = headlines['title'].apply(proper_nouns)\n",
    "headlines['num_noun'] = headlines['title'].apply(nouns)\n",
    "\n",
    "# Compute mean of proper nouns\n",
    "real_propn = headlines[headlines['label'] == 'REAL']['num_propn'].mean()\n",
    "fake_propn = headlines[headlines['label'] == 'FAKE']['num_propn'].mean()\n",
    "\n",
    "# Compute mean of other nouns\n",
    "real_noun = headlines[headlines['label'] == 'REAL']['num_noun'].mean()\n",
    "fake_noun = headlines[headlines['label'] == 'FAKE']['num_noun'].mean()\n",
    "\n",
    "# Print results\n",
    "print(\"Mean no. of proper nouns in real and fake headlines are %.2f and %.2f respectively\" %\n",
    "      (real_propn, fake_propn))\n",
    "print(\"Mean no. of other nouns in real and fake headlines are %.2f and %.2f respectively\" %\n",
    "     (real_noun, fake_noun))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " You now know to construct features using POS tags information. Notice how the mean number of proper nouns is considerably higher for fake news than it is for real news. The opposite seems to be true in the case of other nouns. This fact can be put to great use in desgning fake news detectors."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Named entity recognition\n",
    "- Named entity recognition (NER)\n",
    "    - Identifying and classifying named entities into predefined categories\n",
    "    - Categories include person, organization, country, etc."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Named entities in a sentence\n",
    "In this exercise, we will identify and classify the labels of various named entities in a body of text using one of spaCy's statistical models. We will also verify the veracity of these labels."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Sundar Pichai PERSON\n",
      "Google ORG\n",
      "Mountain View GPE\n"
     ]
    }
   ],
   "source": [
    "# Create a Doc instance\n",
    "text = 'Sundar Pichai is the CEO of Google. Its headquarter is in Mountain View.'\n",
    "doc = nlp(text)\n",
    "\n",
    "# Print all named entities and their labels\n",
    "for ent in doc.ents:\n",
    "    print(ent.text, ent.label_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Identifying people mentioned in a news article\n",
    "In this exercise, you have been given an excerpt from a news article published in TechCrunch. Your task is to write a function `find_people` that identifies the names of people that have been mentioned in a particular piece of text. You will then use `find_people` to identify the people of interest in the article."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open('./dataset/tc.txt', 'r') as file:\n",
    "    tc = file.read()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Sheryl Sandberg', 'Mark Zuckerberg']\n"
     ]
    }
   ],
   "source": [
    "def find_persons(text):\n",
    "    # Create Doc object\n",
    "    doc = nlp(text)\n",
    "    \n",
    "    # Indentify the persons\n",
    "    persons = [ent.text for ent in doc.ents if ent.label_ == 'PERSON']\n",
    "    \n",
    "    # Return persons\n",
    "    return persons\n",
    "\n",
    "print(find_persons(tc))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The article was related to Facebook and our function correctly identified both the people mentioned. You can now see how NER could be used in a variety of applications. Publishers may use a technique like this to classify news articles by the people mentioned in them. A question answering system could also use something like this to answer questions such as 'Who are the people mentioned in this passage?'. "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}