{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Preprocessing - Putting it all together\n",
    "> Now that you've learned all about preprocessing you'll try these techniques out on a dataset that records information on UFO sightings. This is the Summary of lecture \"Preprocessing for Machine Learning in Python\", via datacamp.\n",
    "\n",
    "- toc: true \n",
    "- badges: true\n",
    "- comments: true\n",
    "- author: Chanseok Kang\n",
    "- categories: [Python, Datacamp, Machine_Learning]\n",
    "- image: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## UFOs and preprocessing\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Checking column types\n",
    "Take a look at the UFO dataset's column types using the `dtypes` attribute. Two columns jump out for transformation: the seconds column, which is a numeric column but is being read in as `object`, and the `date` column, which can be transformed into the `datetime` type. That will make our feature engineering efforts easier later on.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>date</th>\n",
       "      <th>city</th>\n",
       "      <th>state</th>\n",
       "      <th>country</th>\n",
       "      <th>type</th>\n",
       "      <th>seconds</th>\n",
       "      <th>length_of_time</th>\n",
       "      <th>desc</th>\n",
       "      <th>recorded</th>\n",
       "      <th>lat</th>\n",
       "      <th>long</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>11/3/2011 19:21</td>\n",
       "      <td>woodville</td>\n",
       "      <td>wi</td>\n",
       "      <td>us</td>\n",
       "      <td>unknown</td>\n",
       "      <td>1209600.0</td>\n",
       "      <td>2 weeks</td>\n",
       "      <td>Red blinking objects similar to airplanes or s...</td>\n",
       "      <td>12/12/2011</td>\n",
       "      <td>44.9530556</td>\n",
       "      <td>-92.291111</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>10/3/2004 19:05</td>\n",
       "      <td>cleveland</td>\n",
       "      <td>oh</td>\n",
       "      <td>us</td>\n",
       "      <td>circle</td>\n",
       "      <td>30.0</td>\n",
       "      <td>30sec.</td>\n",
       "      <td>Many fighter jets flying towards UFO</td>\n",
       "      <td>10/27/2004</td>\n",
       "      <td>41.4994444</td>\n",
       "      <td>-81.695556</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>9/25/2009 21:00</td>\n",
       "      <td>coon rapids</td>\n",
       "      <td>mn</td>\n",
       "      <td>us</td>\n",
       "      <td>cigar</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Green&amp;#44 red&amp;#44 and blue pulses of light tha...</td>\n",
       "      <td>12/12/2009</td>\n",
       "      <td>45.1200000</td>\n",
       "      <td>-93.287500</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>11/21/2002 05:45</td>\n",
       "      <td>clemmons</td>\n",
       "      <td>nc</td>\n",
       "      <td>us</td>\n",
       "      <td>triangle</td>\n",
       "      <td>300.0</td>\n",
       "      <td>about 5 minutes</td>\n",
       "      <td>It was a large&amp;#44 triangular shaped flying ob...</td>\n",
       "      <td>12/23/2002</td>\n",
       "      <td>36.0213889</td>\n",
       "      <td>-80.382222</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>8/19/2010 12:55</td>\n",
       "      <td>calgary (canada)</td>\n",
       "      <td>ab</td>\n",
       "      <td>ca</td>\n",
       "      <td>oval</td>\n",
       "      <td>0.0</td>\n",
       "      <td>2</td>\n",
       "      <td>A white spinning disc in the shape of an oval.</td>\n",
       "      <td>8/24/2010</td>\n",
       "      <td>51.083333</td>\n",
       "      <td>-114.083333</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "               date              city state country      type    seconds  \\\n",
       "0   11/3/2011 19:21         woodville    wi      us   unknown  1209600.0   \n",
       "1   10/3/2004 19:05         cleveland    oh      us    circle       30.0   \n",
       "2   9/25/2009 21:00       coon rapids    mn      us     cigar        0.0   \n",
       "3  11/21/2002 05:45          clemmons    nc      us  triangle      300.0   \n",
       "4   8/19/2010 12:55  calgary (canada)    ab      ca      oval        0.0   \n",
       "\n",
       "    length_of_time                                               desc  \\\n",
       "0          2 weeks  Red blinking objects similar to airplanes or s...   \n",
       "1           30sec.               Many fighter jets flying towards UFO   \n",
       "2              NaN  Green&#44 red&#44 and blue pulses of light tha...   \n",
       "3  about 5 minutes  It was a large&#44 triangular shaped flying ob...   \n",
       "4                2     A white spinning disc in the shape of an oval.   \n",
       "\n",
       "     recorded         lat        long  \n",
       "0  12/12/2011  44.9530556  -92.291111  \n",
       "1  10/27/2004  41.4994444  -81.695556  \n",
       "2  12/12/2009  45.1200000  -93.287500  \n",
       "3  12/23/2002  36.0213889  -80.382222  \n",
       "4   8/24/2010   51.083333 -114.083333  "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ufo = pd.read_csv('./dataset/ufo_sightings_large.csv')\n",
    "ufo.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "date               object\n",
      "city               object\n",
      "state              object\n",
      "country            object\n",
      "type               object\n",
      "seconds           float64\n",
      "length_of_time     object\n",
      "desc               object\n",
      "recorded           object\n",
      "lat                object\n",
      "long              float64\n",
      "dtype: object\n",
      "seconds           float64\n",
      "date       datetime64[ns]\n",
      "dtype: object\n"
     ]
    }
   ],
   "source": [
    "# Check the column types\n",
    "print(ufo.dtypes)\n",
    "\n",
    "# Change the type of seconds to float\n",
    "ufo['seconds'] = ufo['seconds'].astype(float)\n",
    "\n",
    "# Change the date column to type datetime\n",
    "ufo['date'] = pd.to_datetime(ufo['date'])\n",
    "\n",
    "# Check the column types\n",
    "print(ufo[['seconds', 'date']].dtypes)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Dropping missing data\n",
    "Let's remove some of the rows where certain columns have missing values. We're going to look at the `length_of_time` column, the `state` column, and the `type` column. If any of the values in these columns are missing, we're going to drop the rows.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "length_of_time    143\n",
      "state             419\n",
      "type              159\n",
      "dtype: int64\n",
      "(4283, 11)\n"
     ]
    }
   ],
   "source": [
    "# Check how many values are missing in the length_of_time, state, and type columns\n",
    "print(ufo[['length_of_time', 'state', 'type']].isnull().sum())\n",
    "\n",
    "# Keep only rows where length_of_time, state, and type are not null\n",
    "ufo_no_missing = ufo[ufo['length_of_time'].notnull() &\n",
    "                     ufo['state'].notnull() & \n",
    "                     ufo['type'].notnull()]\n",
    "\n",
    "# Print out the shape of the new dataset\n",
    "print(ufo_no_missing.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Categorical variables and standardization\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Extracting numbers from strings\n",
    "The `length_of_time` field in the UFO dataset is a text field that has the number of minutes within the string. Here, you'll extract that number from that text field using regular expressions.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    length_of_time  minutes\n",
      "0  about 5 minutes      NaN\n",
      "1       10 minutes     10.0\n",
      "2        2 minutes      2.0\n",
      "3        2 minutes      2.0\n",
      "4        5 minutes      5.0\n",
      "5       10 minutes     10.0\n",
      "6        5 minutes      5.0\n",
      "7        5 minutes      5.0\n",
      "8        5 minutes      5.0\n",
      "9          1minute      1.0\n"
     ]
    }
   ],
   "source": [
    "import re\n",
    "import math\n",
    "\n",
    "ufo = pd.read_csv('./dataset/ufo_sample.csv')\n",
    "\n",
    "# Change the type of seconds to float\n",
    "ufo['seconds'] = ufo['seconds'].astype(float)\n",
    "\n",
    "# Change the date column to type datetime\n",
    "ufo['date'] = pd.to_datetime(ufo['date'])\n",
    "\n",
    "def return_minutes(time_string):\n",
    "    # Use \\d+ to grab digits\n",
    "    pattern = re.compile(r'\\d+')\n",
    "\n",
    "    # Use match on th epattern and column\n",
    "    num = re.match(pattern, time_string)\n",
    "    if num is not None:\n",
    "        return int(num.group(0))\n",
    "    \n",
    "# Apply the extraction to the length_of_time column\n",
    "ufo['minutes'] = ufo['length_of_time'].apply(lambda row: return_minutes(row))\n",
    "\n",
    "# Take a look at the head of both of th ecolumns\n",
    "print(ufo[['length_of_time', 'minutes']].head(10))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Identifying features for standardization\n",
    "In this section, you'll investigate the variance of columns in the UFO dataset to determine which features should be standardized. After taking a look at the variances of the `seconds` and `minutes` column, you'll see that the variance of the seconds column is extremely high. Because seconds and minutes are related to each other (an issue we'll deal with when we select features for modeling), let's log normlize the `seconds` column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "seconds    424087.417474\n",
      "minutes       117.546372\n",
      "dtype: float64\n",
      "1.122392388118297\n"
     ]
    }
   ],
   "source": [
    "# Check the variance of the seconds and minutes columns\n",
    "print(ufo[['seconds', 'minutes']].var())\n",
    "\n",
    "# Log normalize the seconds column\n",
    "ufo['seconds_log'] = np.log(ufo['seconds'])\n",
    "\n",
    "# Print out the variance of just the seconds_log column\n",
    "print(ufo['seconds_log'].var())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Engineering new features\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Encoding categorical variables\n",
    "There are couple of columns in the UFO dataset that need to be encoded before they can be modeled through scikit-learn. You'll do that transformation here, using both binary and one-hot encoding methods.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "21\n"
     ]
    }
   ],
   "source": [
    "# Use Pandas to encode us values as 1 and others as 0\n",
    "ufo['country_enc'] = ufo['country'].apply(lambda x: 1 if x == 'us' else 0)\n",
    "\n",
    "# Print the number of unique type values\n",
    "print(len(ufo['type'].unique()))\n",
    "\n",
    "# Create a one-hot encoded set of the type values\n",
    "type_set = pd.get_dummies(ufo['type'])\n",
    "\n",
    "# Concatenate this set back to the ufo DataFrame\n",
    "ufo = pd.concat([ufo, type_set], axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Features from dates\n",
    "Another feature engineering task to perform is month and year extraction. Perform this task on the `date` column of the `ufo` dataset.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "datetime64[ns]\n",
      "                 date  month  year\n",
      "0 2002-11-21 05:45:00     11  2002\n",
      "1 2012-06-16 23:00:00      6  2012\n",
      "2 2013-06-09 00:00:00      6  2013\n",
      "3 2013-04-26 23:27:00      4  2013\n",
      "4 2013-09-13 20:30:00      9  2013\n"
     ]
    }
   ],
   "source": [
    "# Look at the first 5 rows of the date column\n",
    "print(ufo['date'].dtypes)\n",
    "\n",
    "# Extract the month from the date column\n",
    "ufo['month'] = ufo['date'].apply(lambda date: date.month)\n",
    "\n",
    "# Extract the year from the date column\n",
    "ufo['year'] = ufo['date'].apply(lambda date: date.year)\n",
    "\n",
    "# Take a look at the head of all three columns\n",
    "print(ufo[['date', 'month', 'year']].head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Text vectorization\n",
    "Let's transform the `desc` column in the UFO dataset into tf/idf vectors, since there's likely something we can learn from this field.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0    It was a large&#44 triangular shaped flying ob...\n",
      "1    Dancing lights that would fly around and then ...\n",
      "2    Brilliant orange light or chinese lantern at o...\n",
      "3    Bright red light moving north to north west fr...\n",
      "4    North-east moving south-west. First 7 or so li...\n",
      "Name: desc, dtype: object\n",
      "(1866, 3422)\n"
     ]
    }
   ],
   "source": [
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "\n",
    "# Take a look at the head of the desc field\n",
    "print(ufo['desc'].head())\n",
    "\n",
    "# Create the tfidf vectorizer object\n",
    "vec = TfidfVectorizer()\n",
    "\n",
    "# Use vec's fit_transform method on the desc field\n",
    "desc_tfidf = vec.fit_transform(ufo['desc'])\n",
    "\n",
    "# Look at the number of columns this creates\n",
    "print(desc_tfidf.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Feature selection and modeling\n",
    "- Redundant features\n",
    "- Text vectors"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Selecting the ideal dataset\n",
    "Let's get rid of some of the unnecessary features. Because we have an encoded `country` column, `country_enc`, keep it and drop other columns related to location: `city`, `country`, `lat`, `long`, `state`.\n",
    "\n",
    "We have columns related to `month` and `year`, so we don't need the `date` or `recorded` columns.\n",
    "\n",
    "We vectorized `desc`, so we don't need it anymore. For now we'll keep `type`.\n",
    "\n",
    "We'll keep `seconds_log` and drop `seconds` and `minutes`.\n",
    "\n",
    "Let's also get rid of the `length_of_time` column, which is unnecessary after extracting `minutes`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Add in the rest of the parameters\n",
    "def return_weights(vocab, original_vocab, vector, vector_index, top_n):\n",
    "    zipped = dict(zip(vector[vector_index].indices, vector[vector_index].data))\n",
    "    \n",
    "    # Let's transform that zipped dict into a series\n",
    "    zipped_series = pd.Series({vocab[i]:zipped[i] for i in vector[vector_index].indices})\n",
    "    \n",
    "    # Let's sort the series to pull out the top n weighted words\n",
    "    zipped_index = zipped_series.sort_values(ascending=False)[:top_n].index\n",
    "    return [original_vocab[i] for i in zipped_index]\n",
    "\n",
    "def words_to_filter(vocab, original_vocab, vector, top_n):\n",
    "    filter_list = []\n",
    "    for i in range(0, vector.shape[0]):\n",
    "        # here we'll call the function from the previous exercise, \n",
    "        # and extend the list we're creating\n",
    "        filtered = return_weights(vocab, original_vocab, vector, i, top_n)\n",
    "        filter_list.extend(filtered)\n",
    "    # Return the list in a set, so we don't get duplicate word indices\n",
    "    return set(filter_list)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "vocab_csv = pd.read_csv('./dataset/vocab_ufo.csv', index_col=0).to_dict()\n",
    "vocab = vocab_csv['0']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              seconds  seconds_log   minutes\n",
      "seconds      1.000000     0.853371  0.980341\n",
      "seconds_log  0.853371     1.000000  0.824493\n",
      "minutes      0.980341     0.824493  1.000000\n"
     ]
    }
   ],
   "source": [
    "# Check the correlation between the seconds, seconds_log, and minutes columns\n",
    "print(ufo[['seconds', 'seconds_log', 'minutes']].corr())\n",
    "\n",
    "# Make a list of features to drop\n",
    "to_drop = ['city', 'country', 'date', 'desc', 'lat', \n",
    "           'length_of_time', 'seconds', 'minutes', 'long', 'state', 'recorded']\n",
    "\n",
    "# Drop those features\n",
    "ufo_dropped = ufo.drop(to_drop, axis=1)\n",
    "\n",
    "# Let's also filter some words out of the text vector we created\n",
    "filtered_words = words_to_filter(vocab, vec.vocabulary_, desc_tfidf, top_n=4)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Modeling the UFO dataset, part 1\n",
    "In this exercise, we're going to build a k-nearest neighbor model to predict which country the UFO sighting took place in. Our `X` dataset has the log-normalized seconds column, the one-hot encoded type columns, as well as the month and year when the sighting took place. The `y` labels are the encoded country column, where 1 is `us` and 0 is `ca`.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>type</th>\n",
       "      <th>seconds_log</th>\n",
       "      <th>country_enc</th>\n",
       "      <th>changing</th>\n",
       "      <th>chevron</th>\n",
       "      <th>cigar</th>\n",
       "      <th>circle</th>\n",
       "      <th>cone</th>\n",
       "      <th>cross</th>\n",
       "      <th>cylinder</th>\n",
       "      <th>...</th>\n",
       "      <th>light</th>\n",
       "      <th>other</th>\n",
       "      <th>oval</th>\n",
       "      <th>rectangle</th>\n",
       "      <th>sphere</th>\n",
       "      <th>teardrop</th>\n",
       "      <th>triangle</th>\n",
       "      <th>unknown</th>\n",
       "      <th>month</th>\n",
       "      <th>year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>triangle</td>\n",
       "      <td>5.703782</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>11</td>\n",
       "      <td>2002</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>light</td>\n",
       "      <td>6.396930</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>6</td>\n",
       "      <td>2012</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>light</td>\n",
       "      <td>4.787492</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>6</td>\n",
       "      <td>2013</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>light</td>\n",
       "      <td>4.787492</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>4</td>\n",
       "      <td>2013</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>sphere</td>\n",
       "      <td>5.703782</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>9</td>\n",
       "      <td>2013</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1861</th>\n",
       "      <td>unknown</td>\n",
       "      <td>7.901007</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>8</td>\n",
       "      <td>2002</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1862</th>\n",
       "      <td>oval</td>\n",
       "      <td>5.703782</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>7</td>\n",
       "      <td>2013</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1863</th>\n",
       "      <td>changing</td>\n",
       "      <td>5.192957</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>11</td>\n",
       "      <td>2008</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1864</th>\n",
       "      <td>circle</td>\n",
       "      <td>5.192957</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>6</td>\n",
       "      <td>1998</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1865</th>\n",
       "      <td>other</td>\n",
       "      <td>4.094345</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>12</td>\n",
       "      <td>2005</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1866 rows × 26 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "          type  seconds_log  country_enc  changing  chevron  cigar  circle  \\\n",
       "0     triangle     5.703782            1         0        0      0       0   \n",
       "1        light     6.396930            1         0        0      0       0   \n",
       "2        light     4.787492            0         0        0      0       0   \n",
       "3        light     4.787492            1         0        0      0       0   \n",
       "4       sphere     5.703782            1         0        0      0       0   \n",
       "...        ...          ...          ...       ...      ...    ...     ...   \n",
       "1861   unknown     7.901007            1         0        0      0       0   \n",
       "1862      oval     5.703782            1         0        0      0       0   \n",
       "1863  changing     5.192957            1         1        0      0       0   \n",
       "1864    circle     5.192957            1         0        0      0       1   \n",
       "1865     other     4.094345            1         0        0      0       0   \n",
       "\n",
       "      cone  cross  cylinder  ...  light  other  oval  rectangle  sphere  \\\n",
       "0        0      0         0  ...      0      0     0          0       0   \n",
       "1        0      0         0  ...      1      0     0          0       0   \n",
       "2        0      0         0  ...      1      0     0          0       0   \n",
       "3        0      0         0  ...      1      0     0          0       0   \n",
       "4        0      0         0  ...      0      0     0          0       1   \n",
       "...    ...    ...       ...  ...    ...    ...   ...        ...     ...   \n",
       "1861     0      0         0  ...      0      0     0          0       0   \n",
       "1862     0      0         0  ...      0      0     1          0       0   \n",
       "1863     0      0         0  ...      0      0     0          0       0   \n",
       "1864     0      0         0  ...      0      0     0          0       0   \n",
       "1865     0      0         0  ...      0      1     0          0       0   \n",
       "\n",
       "      teardrop  triangle  unknown  month  year  \n",
       "0            0         1        0     11  2002  \n",
       "1            0         0        0      6  2012  \n",
       "2            0         0        0      6  2013  \n",
       "3            0         0        0      4  2013  \n",
       "4            0         0        0      9  2013  \n",
       "...        ...       ...      ...    ...   ...  \n",
       "1861         0         0        1      8  2002  \n",
       "1862         0         0        0      7  2013  \n",
       "1863         0         0        0     11  2008  \n",
       "1864         0         0        0      6  1998  \n",
       "1865         0         0        0     12  2005  \n",
       "\n",
       "[1866 rows x 26 columns]"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ufo_dropped"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "X = ufo_dropped.drop(['type', 'country_enc'], axis=1)\n",
    "y = ufo_dropped['country_enc']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Index(['seconds_log', 'changing', 'chevron', 'cigar', 'circle', 'cone',\n",
      "       'cross', 'cylinder', 'diamond', 'disk', 'egg', 'fireball', 'flash',\n",
      "       'formation', 'light', 'other', 'oval', 'rectangle', 'sphere',\n",
      "       'teardrop', 'triangle', 'unknown', 'month', 'year'],\n",
      "      dtype='object')\n"
     ]
    }
   ],
   "source": [
    "# Take a look at the features in the X set of data\n",
    "print(X.columns)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.8715203426124197\n"
     ]
    }
   ],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.neighbors import KNeighborsClassifier\n",
    "\n",
    "knn = KNeighborsClassifier()\n",
    "\n",
    "# Split the X and y sets using train_test_split, setting stratify=y\n",
    "train_X, test_X, train_y, test_y = train_test_split(X, y, stratify=y)\n",
    "\n",
    "# Fit knn to the training sets\n",
    "knn.fit(train_X, train_y)\n",
    "\n",
    "# Print the score of knn on the test sets\n",
    "print(knn.score(test_X, test_y))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Modeling the UFO dataset, part 2\n",
    "Finally, let's build a model using the text vector we created, `desc_tfidf`, using the `filtered_words` list to create a filtered text vector. Let's see if we can predict the `type` of the sighting based on the text. We'll use a Naive Bayes model for this.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "y = ufo_dropped['type']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.1670235546038544\n"
     ]
    }
   ],
   "source": [
    "from sklearn.naive_bayes import GaussianNB\n",
    "\n",
    "nb = GaussianNB()\n",
    "\n",
    "# Use the list of filtered words we created to filter the text vector\n",
    "filtered_text = desc_tfidf[:, list(filtered_words)]\n",
    "\n",
    "# Split the X and y sets using train_test_split, setting stratify=y\n",
    "train_X, test_X, train_y, test_y = train_test_split(filtered_text.toarray(), y, stratify=y)\n",
    "\n",
    "# Fit nb to the training sets\n",
    "nb.fit(train_X, train_y)\n",
    "\n",
    "# Print the score of nb on the test sets\n",
    "print(nb.score(test_X, test_y))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see, this model performs very poorly on this text data. This is a clear case where iteration would be necessary to figure out what subset of text improves the model, and if perhaps any of the other features are useful in predicting `type`."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}