{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Dealing with Text Data\n",
    "> Finally, in this chapter, you will work with unstructured text data, understanding ways in which you can engineer columnar features out of a text corpus. You will compare how different approaches may impact how much context is being extracted from a text, and how to balance the need for context, without too many features being created. This is the Summary of lecture \"Feature Engineering for Machine Learning in Python\", via datacamp.\n",
    "\n",
    "- toc: true \n",
    "- badges: true\n",
    "- comments: true\n",
    "- author: Chanseok Kang\n",
    "- categories: [Python, Datacamp, Machine_Learning]\n",
    "- image: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "\n",
    "plt.rcParams['figure.figsize'] = (8, 8)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Encoding text\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Cleaning up your text\n",
    "Unstructured text data cannot be directly used in most analyses. Multiple steps need to be taken to go from a long free form string to a set of numeric columns in the right format that can be ingested by a machine learning model. The first step of this process is to standardize the data and eliminate any characters that could cause problems later on in your analytic pipeline.\n",
    "\n",
    "In this chapter you will be working with a new dataset containing the inaugural speeches of the presidents of the United States loaded as `speech_df`, with the speeches stored in the `text` column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "speech_df = pd.read_csv('./dataset/inaugural_speeches.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    Fellow-Citizens of the Senate and of the House...\n",
       "1    Fellow Citizens:  I AM again called upon by th...\n",
       "2    WHEN it was first perceived, in early times, t...\n",
       "3    Friends and Fellow-Citizens:  CALLED upon to u...\n",
       "4    PROCEEDING, fellow-citizens, to that qualifica...\n",
       "Name: text, dtype: object"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Print the first 5 rows of the text column\n",
    "speech_df['text'].head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0    fellow citizens of the senate and of the house...\n",
      "1    fellow citizens   i am again called upon by th...\n",
      "2    when it was first perceived  in early times  t...\n",
      "3    friends and fellow citizens   called upon to u...\n",
      "4    proceeding  fellow citizens  to that qualifica...\n",
      "Name: text_clean, dtype: object\n"
     ]
    }
   ],
   "source": [
    "# Replace all non letter characters with a whitespace\n",
    "speech_df['text_clean'] = speech_df['text'].str.replace('[^a-zA-Z]', ' ')\n",
    "\n",
    "# Change to lower case\n",
    "speech_df['text_clean'] = speech_df['text_clean'].str.lower()\n",
    "\n",
    "# Print the first 5 rows of text_clean column\n",
    "print(speech_df['text_clean'].head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### High level text features\n",
    "Once the text has been cleaned and standardized you can begin creating features from the data. The most fundamental information you can calculate about free form text is its size, such as its length and number of words. In this exercise (and the rest of this chapter), you will focus on the cleaned/transformed text column (`text_clean`) you created in the last exercise.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>text_clean</th>\n",
       "      <th>char_cnt</th>\n",
       "      <th>word_cnt</th>\n",
       "      <th>avg_word_length</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>fellow citizens of the senate and of the house...</td>\n",
       "      <td>8616</td>\n",
       "      <td>1432</td>\n",
       "      <td>6.016760</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>fellow citizens   i am again called upon by th...</td>\n",
       "      <td>787</td>\n",
       "      <td>135</td>\n",
       "      <td>5.829630</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>when it was first perceived  in early times  t...</td>\n",
       "      <td>13871</td>\n",
       "      <td>2323</td>\n",
       "      <td>5.971158</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>friends and fellow citizens   called upon to u...</td>\n",
       "      <td>10144</td>\n",
       "      <td>1736</td>\n",
       "      <td>5.843318</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>proceeding  fellow citizens  to that qualifica...</td>\n",
       "      <td>12902</td>\n",
       "      <td>2169</td>\n",
       "      <td>5.948363</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                          text_clean  char_cnt  word_cnt  \\\n",
       "0  fellow citizens of the senate and of the house...      8616      1432   \n",
       "1  fellow citizens   i am again called upon by th...       787       135   \n",
       "2  when it was first perceived  in early times  t...     13871      2323   \n",
       "3  friends and fellow citizens   called upon to u...     10144      1736   \n",
       "4  proceeding  fellow citizens  to that qualifica...     12902      2169   \n",
       "\n",
       "   avg_word_length  \n",
       "0         6.016760  \n",
       "1         5.829630  \n",
       "2         5.971158  \n",
       "3         5.843318  \n",
       "4         5.948363  "
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Find the length of each text\n",
    "speech_df['char_cnt'] = speech_df['text_clean'].str.len()\n",
    "\n",
    "# Count the number of words in each text\n",
    "speech_df['word_cnt'] = speech_df['text_clean'].str.split().str.len()\n",
    "\n",
    "# Find the average length of word\n",
    "speech_df['avg_word_length'] = speech_df['char_cnt'] / speech_df['word_cnt']\n",
    "\n",
    "# Print the first 5 rows of these columns\n",
    "speech_df[['text_clean', 'char_cnt', 'word_cnt', 'avg_word_length']].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Word counts\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Counting words (I)\n",
    "Once high level information has been recorded you can begin creating features based on the actual content of each text. One way to do this is to approach it in a similar way to how you worked with categorical variables in the earlier lessons.\n",
    "\n",
    "- For each unique word in the dataset a column is created.\n",
    "- For each entry, the number of times this word occurs is counted and the count value is entered into the respective column.\n",
    "\n",
    "These `\"count\"` columns can then be used to train machine learning models."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['abandon', 'abandoned', 'abandonment', 'abate', 'abdicated', 'abeyance', 'abhorring', 'abide', 'abiding', 'abilities']\n"
     ]
    }
   ],
   "source": [
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "\n",
    "# Instantiate CountVectorizer\n",
    "cv = CountVectorizer()\n",
    "\n",
    "# Fit the vectorizer\n",
    "cv.fit(speech_df['text_clean'])\n",
    "\n",
    "# Print feature names\n",
    "print(cv.get_feature_names()[:10])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Counting words (II)\n",
    "Once the vectorizer has been fit to the data, it can be used to transform the text to an array representing the word counts. This array will have a row per block of text and a column for each of the features generated by the vectorizer that you observed in the last exercise."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[0 0 0 ... 0 0 0]\n",
      " [0 0 0 ... 0 0 0]\n",
      " [0 1 0 ... 0 0 0]\n",
      " ...\n",
      " [0 1 0 ... 0 0 0]\n",
      " [0 0 0 ... 0 0 0]\n",
      " [0 0 0 ... 0 0 0]]\n"
     ]
    }
   ],
   "source": [
    "# Apply the vectorizer\n",
    "cv_transformed = cv.transform(speech_df['text_clean'])\n",
    "\n",
    "# Print the full array\n",
    "cv_array = cv_transformed.toarray()\n",
    "print(cv_array)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(58, 9043)\n"
     ]
    }
   ],
   "source": [
    "# Print the shape of cv_array\n",
    "print(cv_array.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Limiting your features\n",
    "As you have seen, using the `CountVectorizer` with its default settings creates a feature for every single word in your corpus. This can create far too many features, often including ones that will provide very little analytical value.\n",
    "\n",
    "For this purpose `CountVectorizer` has parameters that you can set to reduce the number of features:\n",
    "\n",
    "- `min_df` : Use only words that occur in more than this percentage of documents. This can be used to remove outlier words that will not generalize across texts.\n",
    "- `max_df` : Use only words that occur in less than this percentage of documents. This is useful to eliminate very common words that occur in every corpus without adding value such as \"and\" or \"the\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(58, 818)\n"
     ]
    }
   ],
   "source": [
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "\n",
    "# Specify arguments to limit the number of features generated\n",
    "cv = CountVectorizer(min_df=0.2, max_df=0.8)\n",
    "\n",
    "# Fit, transform, and convert into array\n",
    "cv_transformed = cv.fit_transform(speech_df['text_clean'])\n",
    "cv_array = cv_transformed.toarray()\n",
    "\n",
    "# Print the array shape\n",
    "print(cv_array.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Text to DataFrame\n",
    "Now that you have generated these count based features in an array you will need to reformat them so that they can be combined with the rest of the dataset. This can be achieved by converting the array into a pandas DataFrame, with the feature names you found earlier as the column names, and then concatenate it with the original DataFrame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Name</th>\n",
       "      <th>Inaugural Address</th>\n",
       "      <th>Date</th>\n",
       "      <th>text</th>\n",
       "      <th>text_clean</th>\n",
       "      <th>char_cnt</th>\n",
       "      <th>word_cnt</th>\n",
       "      <th>avg_word_length</th>\n",
       "      <th>Counts_abiding</th>\n",
       "      <th>Counts_ability</th>\n",
       "      <th>...</th>\n",
       "      <th>Counts_women</th>\n",
       "      <th>Counts_words</th>\n",
       "      <th>Counts_work</th>\n",
       "      <th>Counts_wrong</th>\n",
       "      <th>Counts_year</th>\n",
       "      <th>Counts_years</th>\n",
       "      <th>Counts_yet</th>\n",
       "      <th>Counts_you</th>\n",
       "      <th>Counts_young</th>\n",
       "      <th>Counts_your</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>George Washington</td>\n",
       "      <td>First Inaugural Address</td>\n",
       "      <td>Thursday, April 30, 1789</td>\n",
       "      <td>Fellow-Citizens of the Senate and of the House...</td>\n",
       "      <td>fellow citizens of the senate and of the house...</td>\n",
       "      <td>8616</td>\n",
       "      <td>1432</td>\n",
       "      <td>6.016760</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>George Washington</td>\n",
       "      <td>Second Inaugural Address</td>\n",
       "      <td>Monday, March 4, 1793</td>\n",
       "      <td>Fellow Citizens:  I AM again called upon by th...</td>\n",
       "      <td>fellow citizens   i am again called upon by th...</td>\n",
       "      <td>787</td>\n",
       "      <td>135</td>\n",
       "      <td>5.829630</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>John Adams</td>\n",
       "      <td>Inaugural Address</td>\n",
       "      <td>Saturday, March 4, 1797</td>\n",
       "      <td>WHEN it was first perceived, in early times, t...</td>\n",
       "      <td>when it was first perceived  in early times  t...</td>\n",
       "      <td>13871</td>\n",
       "      <td>2323</td>\n",
       "      <td>5.971158</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Thomas Jefferson</td>\n",
       "      <td>First Inaugural Address</td>\n",
       "      <td>Wednesday, March 4, 1801</td>\n",
       "      <td>Friends and Fellow-Citizens:  CALLED upon to u...</td>\n",
       "      <td>friends and fellow citizens   called upon to u...</td>\n",
       "      <td>10144</td>\n",
       "      <td>1736</td>\n",
       "      <td>5.843318</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>7</td>\n",
       "      <td>0</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Thomas Jefferson</td>\n",
       "      <td>Second Inaugural Address</td>\n",
       "      <td>Monday, March 4, 1805</td>\n",
       "      <td>PROCEEDING, fellow-citizens, to that qualifica...</td>\n",
       "      <td>proceeding  fellow citizens  to that qualifica...</td>\n",
       "      <td>12902</td>\n",
       "      <td>2169</td>\n",
       "      <td>5.948363</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>4</td>\n",
       "      <td>0</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 826 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                Name         Inaugural Address                      Date  \\\n",
       "0  George Washington   First Inaugural Address  Thursday, April 30, 1789   \n",
       "1  George Washington  Second Inaugural Address     Monday, March 4, 1793   \n",
       "2         John Adams         Inaugural Address   Saturday, March 4, 1797   \n",
       "3   Thomas Jefferson   First Inaugural Address  Wednesday, March 4, 1801   \n",
       "4   Thomas Jefferson  Second Inaugural Address     Monday, March 4, 1805   \n",
       "\n",
       "                                                text  \\\n",
       "0  Fellow-Citizens of the Senate and of the House...   \n",
       "1  Fellow Citizens:  I AM again called upon by th...   \n",
       "2  WHEN it was first perceived, in early times, t...   \n",
       "3  Friends and Fellow-Citizens:  CALLED upon to u...   \n",
       "4  PROCEEDING, fellow-citizens, to that qualifica...   \n",
       "\n",
       "                                          text_clean  char_cnt  word_cnt  \\\n",
       "0  fellow citizens of the senate and of the house...      8616      1432   \n",
       "1  fellow citizens   i am again called upon by th...       787       135   \n",
       "2  when it was first perceived  in early times  t...     13871      2323   \n",
       "3  friends and fellow citizens   called upon to u...     10144      1736   \n",
       "4  proceeding  fellow citizens  to that qualifica...     12902      2169   \n",
       "\n",
       "   avg_word_length  Counts_abiding  Counts_ability  ...  Counts_women  \\\n",
       "0         6.016760               0               0  ...             0   \n",
       "1         5.829630               0               0  ...             0   \n",
       "2         5.971158               0               0  ...             0   \n",
       "3         5.843318               0               0  ...             0   \n",
       "4         5.948363               0               0  ...             0   \n",
       "\n",
       "   Counts_words  Counts_work  Counts_wrong  Counts_year  Counts_years  \\\n",
       "0             0            0             0            0             1   \n",
       "1             0            0             0            0             0   \n",
       "2             0            0             0            2             3   \n",
       "3             0            1             2            0             0   \n",
       "4             0            0             0            2             2   \n",
       "\n",
       "   Counts_yet  Counts_you  Counts_young  Counts_your  \n",
       "0           0           5             0            9  \n",
       "1           0           0             0            1  \n",
       "2           0           0             0            1  \n",
       "3           2           7             0            7  \n",
       "4           2           4             0            4  \n",
       "\n",
       "[5 rows x 826 columns]"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Create a DataFrame with these features\n",
    "cv_df = pd.DataFrame(cv_array, columns = cv.get_feature_names()).add_prefix('Counts_')\n",
    "\n",
    "# Add the new columns to the original DataFrame\n",
    "speech_df_new = pd.concat([speech_df, cv_df], axis=1, sort=False)\n",
    "speech_df_new.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Term frequency-inverse document frequency\n",
    "- TF-IDF\n",
    "    - Term Frequency - Inverse Document Frequency\n",
    "    $$ \\text{TF-IDF} = \\frac{\\frac{\\text{count of word occurances}}{\\text{Total words in documents}}}{\\log (\\frac{\\text{Number of docs word is in}}{\\text{Total number of docs}})} $$\n",
    "    - Measures of what proportion of the documents a word occurs in all documents"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Tf-idf\n",
    "While counts of occurrences of words can be useful to build models, words that occur many times may skew the results undesirably. To limit these common words from overpowering your model a form of normalization can be used. In this lesson you will be using Term frequency-inverse document frequency (Tf-idf) as was discussed in the video. Tf-idf has the effect of reducing the value of common words, while increasing the weight of words that do not occur in many documents.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>TFIDF_action</th>\n",
       "      <th>TFIDF_administration</th>\n",
       "      <th>TFIDF_america</th>\n",
       "      <th>TFIDF_american</th>\n",
       "      <th>TFIDF_americans</th>\n",
       "      <th>TFIDF_believe</th>\n",
       "      <th>TFIDF_best</th>\n",
       "      <th>TFIDF_better</th>\n",
       "      <th>TFIDF_change</th>\n",
       "      <th>TFIDF_citizens</th>\n",
       "      <th>...</th>\n",
       "      <th>TFIDF_things</th>\n",
       "      <th>TFIDF_time</th>\n",
       "      <th>TFIDF_today</th>\n",
       "      <th>TFIDF_union</th>\n",
       "      <th>TFIDF_united</th>\n",
       "      <th>TFIDF_war</th>\n",
       "      <th>TFIDF_way</th>\n",
       "      <th>TFIDF_work</th>\n",
       "      <th>TFIDF_world</th>\n",
       "      <th>TFIDF_years</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.133415</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.105388</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.229644</td>\n",
       "      <td>...</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.045929</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.136012</td>\n",
       "      <td>0.203593</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.060755</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.045929</td>\n",
       "      <td>0.052694</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.261016</td>\n",
       "      <td>0.266097</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.179712</td>\n",
       "      <td>...</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.199157</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.092436</td>\n",
       "      <td>0.157058</td>\n",
       "      <td>0.073018</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.026112</td>\n",
       "      <td>0.060460</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.106072</td>\n",
       "      <td>...</td>\n",
       "      <td>0.032030</td>\n",
       "      <td>0.021214</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.062823</td>\n",
       "      <td>0.070529</td>\n",
       "      <td>0.024339</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.063643</td>\n",
       "      <td>0.073018</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.092693</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.090942</td>\n",
       "      <td>0.117831</td>\n",
       "      <td>0.045471</td>\n",
       "      <td>0.053335</td>\n",
       "      <td>0.223369</td>\n",
       "      <td>...</td>\n",
       "      <td>0.048179</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.094497</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.036610</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.039277</td>\n",
       "      <td>0.095729</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.041334</td>\n",
       "      <td>0.039761</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.031408</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.067393</td>\n",
       "      <td>0.039011</td>\n",
       "      <td>0.091514</td>\n",
       "      <td>0.273760</td>\n",
       "      <td>...</td>\n",
       "      <td>0.082667</td>\n",
       "      <td>0.164256</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.121605</td>\n",
       "      <td>0.030338</td>\n",
       "      <td>0.094225</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.054752</td>\n",
       "      <td>0.062817</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 100 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   TFIDF_action  TFIDF_administration  TFIDF_america  TFIDF_american  \\\n",
       "0      0.000000              0.133415       0.000000        0.105388   \n",
       "1      0.000000              0.261016       0.266097        0.000000   \n",
       "2      0.000000              0.092436       0.157058        0.073018   \n",
       "3      0.000000              0.092693       0.000000        0.000000   \n",
       "4      0.041334              0.039761       0.000000        0.031408   \n",
       "\n",
       "   TFIDF_americans  TFIDF_believe  TFIDF_best  TFIDF_better  TFIDF_change  \\\n",
       "0              0.0       0.000000    0.000000      0.000000      0.000000   \n",
       "1              0.0       0.000000    0.000000      0.000000      0.000000   \n",
       "2              0.0       0.000000    0.026112      0.060460      0.000000   \n",
       "3              0.0       0.090942    0.117831      0.045471      0.053335   \n",
       "4              0.0       0.000000    0.067393      0.039011      0.091514   \n",
       "\n",
       "   TFIDF_citizens  ...  TFIDF_things  TFIDF_time  TFIDF_today  TFIDF_union  \\\n",
       "0        0.229644  ...      0.000000    0.045929          0.0     0.136012   \n",
       "1        0.179712  ...      0.000000    0.000000          0.0     0.000000   \n",
       "2        0.106072  ...      0.032030    0.021214          0.0     0.062823   \n",
       "3        0.223369  ...      0.048179    0.000000          0.0     0.094497   \n",
       "4        0.273760  ...      0.082667    0.164256          0.0     0.121605   \n",
       "\n",
       "   TFIDF_united  TFIDF_war  TFIDF_way  TFIDF_work  TFIDF_world  TFIDF_years  \n",
       "0      0.203593   0.000000   0.060755    0.000000     0.045929     0.052694  \n",
       "1      0.199157   0.000000   0.000000    0.000000     0.000000     0.000000  \n",
       "2      0.070529   0.024339   0.000000    0.000000     0.063643     0.073018  \n",
       "3      0.000000   0.036610   0.000000    0.039277     0.095729     0.000000  \n",
       "4      0.030338   0.094225   0.000000    0.000000     0.054752     0.062817  \n",
       "\n",
       "[5 rows x 100 columns]"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "\n",
    "# Instantiate TfidfVectorizer\n",
    "tv = TfidfVectorizer(max_features=100, stop_words='english')\n",
    "\n",
    "# Fit the vectorizer and transform the data\n",
    "tv_transformed = tv.fit_transform(speech_df['text_clean'])\n",
    "\n",
    "# Create a DataFrame with these features\n",
    "tv_df = pd.DataFrame(tv_transformed.toarray(),\n",
    "                     columns=tv.get_feature_names()).add_prefix('TFIDF_')\n",
    "tv_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Inspecting Tf-idf values\n",
    "After creating Tf-idf features you will often want to understand what are the most highest scored words for each corpus. This can be achieved by isolating the row you want to examine and then sorting the the scores from high to low.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "TFIDF_government    0.367430\n",
      "TFIDF_public        0.333237\n",
      "TFIDF_present       0.315182\n",
      "TFIDF_duty          0.238637\n",
      "TFIDF_citizens      0.229644\n",
      "Name: 0, dtype: float64\n"
     ]
    }
   ],
   "source": [
    "# Isolate the row to be examined\n",
    "sample_row = tv_df.iloc[0]\n",
    "\n",
    "# Print the top 5 words of the sorted output\n",
    "print(sample_row.sort_values(ascending=False).head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Transforming unseen data\n",
    "When creating vectors from text, any transformations that you perform before training a machine learning model, you also need to apply on the new unseen (test) data. To achieve this follow the same approach from the last chapter: fit the vectorizer only on the training data, and apply it to the test data.\n",
    "\n",
    "For this exercise the `speech_df` DataFrame has been split in two:\n",
    "\n",
    "- `train_speech_df`: The training set consisting of the first 45 speeches.\n",
    "- `test_speech_df`: The test set consisting of the remaining speeches."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_speech_df = speech_df.iloc[:45]\n",
    "test_speech_df = speech_df.iloc[45:]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>TFIDF_action</th>\n",
       "      <th>TFIDF_administration</th>\n",
       "      <th>TFIDF_america</th>\n",
       "      <th>TFIDF_american</th>\n",
       "      <th>TFIDF_authority</th>\n",
       "      <th>TFIDF_best</th>\n",
       "      <th>TFIDF_business</th>\n",
       "      <th>TFIDF_citizens</th>\n",
       "      <th>TFIDF_commerce</th>\n",
       "      <th>TFIDF_common</th>\n",
       "      <th>...</th>\n",
       "      <th>TFIDF_subject</th>\n",
       "      <th>TFIDF_support</th>\n",
       "      <th>TFIDF_time</th>\n",
       "      <th>TFIDF_union</th>\n",
       "      <th>TFIDF_united</th>\n",
       "      <th>TFIDF_war</th>\n",
       "      <th>TFIDF_way</th>\n",
       "      <th>TFIDF_work</th>\n",
       "      <th>TFIDF_world</th>\n",
       "      <th>TFIDF_years</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.029540</td>\n",
       "      <td>0.233954</td>\n",
       "      <td>0.082703</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.022577</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.115378</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.024648</td>\n",
       "      <td>0.079050</td>\n",
       "      <td>0.033313</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.299983</td>\n",
       "      <td>0.134749</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.547457</td>\n",
       "      <td>0.036862</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.036036</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.015094</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.019296</td>\n",
       "      <td>0.092567</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.052851</td>\n",
       "      <td>0.066817</td>\n",
       "      <td>0.078999</td>\n",
       "      <td>0.277701</td>\n",
       "      <td>0.126126</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.126987</td>\n",
       "      <td>0.134669</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.131652</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.046997</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.075151</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.080272</td>\n",
       "      <td>0.042907</td>\n",
       "      <td>0.054245</td>\n",
       "      <td>0.096203</td>\n",
       "      <td>0.225452</td>\n",
       "      <td>0.043884</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.037094</td>\n",
       "      <td>0.067428</td>\n",
       "      <td>0.267012</td>\n",
       "      <td>0.031463</td>\n",
       "      <td>0.039990</td>\n",
       "      <td>0.061516</td>\n",
       "      <td>0.050085</td>\n",
       "      <td>0.077301</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.098819</td>\n",
       "      <td>0.210690</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.056262</td>\n",
       "      <td>0.030073</td>\n",
       "      <td>0.038020</td>\n",
       "      <td>0.235998</td>\n",
       "      <td>0.237026</td>\n",
       "      <td>0.061516</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.221561</td>\n",
       "      <td>0.156644</td>\n",
       "      <td>0.028442</td>\n",
       "      <td>0.087505</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.109959</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.023428</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.023428</td>\n",
       "      <td>0.187313</td>\n",
       "      <td>0.131913</td>\n",
       "      <td>0.040016</td>\n",
       "      <td>0.021389</td>\n",
       "      <td>0.081124</td>\n",
       "      <td>0.119894</td>\n",
       "      <td>0.299701</td>\n",
       "      <td>0.153133</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 100 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   TFIDF_action  TFIDF_administration  TFIDF_america  TFIDF_american  \\\n",
       "0      0.000000              0.029540       0.233954        0.082703   \n",
       "1      0.000000              0.000000       0.547457        0.036862   \n",
       "2      0.000000              0.000000       0.126987        0.134669   \n",
       "3      0.037094              0.067428       0.267012        0.031463   \n",
       "4      0.000000              0.000000       0.221561        0.156644   \n",
       "\n",
       "   TFIDF_authority  TFIDF_best  TFIDF_business  TFIDF_citizens  \\\n",
       "0         0.000000    0.000000        0.000000        0.022577   \n",
       "1         0.000000    0.036036        0.000000        0.015094   \n",
       "2         0.000000    0.131652        0.000000        0.000000   \n",
       "3         0.039990    0.061516        0.050085        0.077301   \n",
       "4         0.028442    0.087505        0.000000        0.109959   \n",
       "\n",
       "   TFIDF_commerce  TFIDF_common  ...  TFIDF_subject  TFIDF_support  \\\n",
       "0             0.0      0.000000  ...            0.0       0.000000   \n",
       "1             0.0      0.000000  ...            0.0       0.019296   \n",
       "2             0.0      0.046997  ...            0.0       0.000000   \n",
       "3             0.0      0.000000  ...            0.0       0.098819   \n",
       "4             0.0      0.023428  ...            0.0       0.023428   \n",
       "\n",
       "   TFIDF_time  TFIDF_union  TFIDF_united  TFIDF_war  TFIDF_way  TFIDF_work  \\\n",
       "0    0.115378     0.000000      0.024648   0.079050   0.033313    0.000000   \n",
       "1    0.092567     0.000000      0.000000   0.052851   0.066817    0.078999   \n",
       "2    0.075151     0.000000      0.080272   0.042907   0.054245    0.096203   \n",
       "3    0.210690     0.000000      0.056262   0.030073   0.038020    0.235998   \n",
       "4    0.187313     0.131913      0.040016   0.021389   0.081124    0.119894   \n",
       "\n",
       "   TFIDF_world  TFIDF_years  \n",
       "0     0.299983     0.134749  \n",
       "1     0.277701     0.126126  \n",
       "2     0.225452     0.043884  \n",
       "3     0.237026     0.061516  \n",
       "4     0.299701     0.153133  \n",
       "\n",
       "[5 rows x 100 columns]"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Instantiate TfidfVectorizer\n",
    "tv = TfidfVectorizer(max_features=100, stop_words='english')\n",
    "\n",
    "# Fit the vectorizer and transform the data\n",
    "tv_transformed = tv.fit_transform(train_speech_df['text_clean'])\n",
    "\n",
    "# Transform test data\n",
    "test_tv_transformed = tv.transform(test_speech_df['text_clean'])\n",
    "\n",
    "# Create new features for the test set\n",
    "test_tv_df = pd.DataFrame(test_tv_transformed.toarray(), \n",
    "                          columns=tv.get_feature_names()).add_prefix('TFIDF_')\n",
    "test_tv_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## N-grams\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Using longer n-grams\n",
    "So far you have created features based on individual words in each of the texts. This can be quite powerful when used in a machine learning model but you may be concerned that by looking at words individually a lot of the context is being ignored. To deal with this when creating models you can use n-grams which are sequence of n words grouped together. For example:\n",
    "\n",
    "- bigrams: Sequences of two consecutive words\n",
    "- trigrams: Sequences of two consecutive words\n",
    "\n",
    "These can be automatically created in your dataset by specifying the ngram_range argument as a tuple `(n1, n2)` where all n-grams in the `n1` to `n2` range are included."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['ability preserve protect',\n",
       " 'agriculture commerce manufactures',\n",
       " 'america ideal freedom',\n",
       " 'amity mutual concession',\n",
       " 'anchor peace home',\n",
       " 'ask bow heads',\n",
       " 'best ability preserve',\n",
       " 'best interests country',\n",
       " 'bless god bless',\n",
       " 'bless united states']"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Instantiate a trigram vectorizer\n",
    "cv_trigram_vec = CountVectorizer(max_features=100, \n",
    "                                 stop_words='english', \n",
    "                                 ngram_range=(3, 3))\n",
    "\n",
    "# Fit and apply trigram vectorizer\n",
    "cv_trigram = cv_trigram_vec.fit_transform(speech_df['text_clean'])\n",
    "\n",
    "# Print the trigram features\n",
    "cv_trigram_vec.get_feature_names()[:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Finding the most common words\n",
    "Its always advisable once you have created your features to inspect them to ensure that they are as you would expect. This will allow you to catch errors early, and perhaps influence what further feature engineering you will need to do."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Counts_constitution united states    20\n",
       "Counts_people united states          13\n",
       "Counts_preserve protect defend       10\n",
       "Counts_mr chief justice              10\n",
       "Counts_president united states        8\n",
       "dtype: int64"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Create a DataFrame of the features\n",
    "cv_tri_df = pd.DataFrame(cv_trigram.toarray(), \n",
    "                        columns = cv_trigram_vec.get_feature_names()).add_prefix('Counts_')\n",
    "\n",
    "# Print the top 5 words in the sorted output\n",
    "cv_tri_df.sum().sort_values(ascending=False).head()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}