{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Handling Categorical Data in Python"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### What does Nominal, Ordinal and Continuous features mean?\n",
    "\n",
    "Categorical features can only take on a limited, and usually fixed, number of possible values. For example, if a dataset is about information related to users, then you will typically find features like country, gender, age group, etc. Alternatively, if the data you're working with is related to products, you will find features like product type, manufacturer, seller and so on.\n",
    "\n",
    "These are all categorical features in your dataset. These features are typically stored as text values which represent various traits of the observations. For example, gender is described as Male (M) or Female (F), product type could be described as electronics, apparels, food etc.\n",
    "\n",
    "**Note that these type of features where the categories are only labeled without any order of precedence are called nominal features. Features which have some order associated with them are called ordinal features.** \n",
    "\n",
    "For example, a feature like economic status, with three categories: low, medium and high, which have an order associated with them.\n",
    "\n",
    "There are also continuous features. These are numeric variables that have an infinite number of values between any two values. A continuous variable can be numeric or a date/time.\n",
    "\n",
    "Regardless of what the value is used for, the challenge is determining how to use this data in the analysis because of the following constraints:\n",
    "\n",
    "- Categorical features may have a very large number of levels, known as high cardinality, (for example, cities or URLs), where most of the levels appear in a relatively small number of instances.\n",
    "\n",
    "- Many machine learning models, such as regression or SVM, are algebraic. This means that their input must be numerical. To use these models, categories must be transformed into numbers first, before you can apply the learning algorithm on them.\n",
    "\n",
    "- While some ML packages or libraries might transform categorical data to numeric automatically based on some default embedding method, many other ML packages don’t support such inputs.\n",
    "\n",
    "- For the machine, categorical data doesn’t contain the same context or information that humans can easily associate and understand. For example, when looking at a feature called ``City`` with three cities ``New York``, ``New Jersey`` and ``New Delhi``, humans can infer that ``New York`` is closely related to ``New Jersey`` as they are from same country, while ``New York`` and ``New Delhi`` are much different. But for the model, ``New York``, ``New Jersey`` and ``New Delhi``, are just three different levels (possible values) of the same feature ``City``. If you don’t specify the additional contextual information, it will be impossible for the model to differentiate between highly different levels.\n",
    "\n",
    "You therefore are faced with the challenge of figuring out how to turn these text values into numerical values for further processing and unmask lots of interesting information which these features might hide. Typically, any standard work-flow in feature engineering involves some form of transformation of these categorical values into numeric labels and then applying some encoding scheme on these values."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Encoding Categorical Data\n",
    "Existing Encoding Methods (all modules for which code is available, see: http://contrib.scikit-learn.org/categorical-encoding/_modules/index.html)\n",
    "- category_encoders.backward_difference\n",
    "- category_encoders.basen\n",
    "- category_encoders.binary\n",
    "- category_encoders.hashing\n",
    "- category_encoders.helmert\n",
    "- category_encoders.leave_one_out\n",
    "- category_encoders.one_hot\n",
    "- category_encoders.ordinal\n",
    "- category_encoders.polynomial\n",
    "- category_encoders.sum_coding\n",
    "- category_encoders.target_encoder\n",
    "\n",
    "\n",
    "The techniques that you'll cover are the following:\n",
    "\n",
    "* Replacing values\n",
    "* Encoding labels\n",
    "* One-Hot encoding\n",
    "* Binary encoding\n",
    "* Backward difference encoding\n",
    "* Polynomial encodings\n",
    "* Miscellaneous features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> ## 1. Replace Values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import copy\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>year</th>\n",
       "      <th>month</th>\n",
       "      <th>day</th>\n",
       "      <th>dep_time</th>\n",
       "      <th>dep_delay</th>\n",
       "      <th>arr_time</th>\n",
       "      <th>arr_delay</th>\n",
       "      <th>carrier</th>\n",
       "      <th>tailnum</th>\n",
       "      <th>flight</th>\n",
       "      <th>origin</th>\n",
       "      <th>dest</th>\n",
       "      <th>air_time</th>\n",
       "      <th>distance</th>\n",
       "      <th>hour</th>\n",
       "      <th>minute</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2014</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1.0</td>\n",
       "      <td>96.0</td>\n",
       "      <td>235.0</td>\n",
       "      <td>70.0</td>\n",
       "      <td>AS</td>\n",
       "      <td>N508AS</td>\n",
       "      <td>145</td>\n",
       "      <td>PDX</td>\n",
       "      <td>ANC</td>\n",
       "      <td>194.0</td>\n",
       "      <td>1542</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2014</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>4.0</td>\n",
       "      <td>-6.0</td>\n",
       "      <td>738.0</td>\n",
       "      <td>-23.0</td>\n",
       "      <td>US</td>\n",
       "      <td>N195UW</td>\n",
       "      <td>1830</td>\n",
       "      <td>SEA</td>\n",
       "      <td>CLT</td>\n",
       "      <td>252.0</td>\n",
       "      <td>2279</td>\n",
       "      <td>0.0</td>\n",
       "      <td>4.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2014</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>8.0</td>\n",
       "      <td>13.0</td>\n",
       "      <td>548.0</td>\n",
       "      <td>-4.0</td>\n",
       "      <td>UA</td>\n",
       "      <td>N37422</td>\n",
       "      <td>1609</td>\n",
       "      <td>PDX</td>\n",
       "      <td>IAH</td>\n",
       "      <td>201.0</td>\n",
       "      <td>1825</td>\n",
       "      <td>0.0</td>\n",
       "      <td>8.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2014</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>28.0</td>\n",
       "      <td>-2.0</td>\n",
       "      <td>800.0</td>\n",
       "      <td>-23.0</td>\n",
       "      <td>US</td>\n",
       "      <td>N547UW</td>\n",
       "      <td>466</td>\n",
       "      <td>PDX</td>\n",
       "      <td>CLT</td>\n",
       "      <td>251.0</td>\n",
       "      <td>2282</td>\n",
       "      <td>0.0</td>\n",
       "      <td>28.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2014</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>34.0</td>\n",
       "      <td>44.0</td>\n",
       "      <td>325.0</td>\n",
       "      <td>43.0</td>\n",
       "      <td>AS</td>\n",
       "      <td>N762AS</td>\n",
       "      <td>121</td>\n",
       "      <td>SEA</td>\n",
       "      <td>ANC</td>\n",
       "      <td>201.0</td>\n",
       "      <td>1448</td>\n",
       "      <td>0.0</td>\n",
       "      <td>34.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   year  month  day  dep_time  dep_delay  arr_time  arr_delay carrier tailnum  \\\n",
       "0  2014      1    1       1.0       96.0     235.0       70.0      AS  N508AS   \n",
       "1  2014      1    1       4.0       -6.0     738.0      -23.0      US  N195UW   \n",
       "2  2014      1    1       8.0       13.0     548.0       -4.0      UA  N37422   \n",
       "3  2014      1    1      28.0       -2.0     800.0      -23.0      US  N547UW   \n",
       "4  2014      1    1      34.0       44.0     325.0       43.0      AS  N762AS   \n",
       "\n",
       "   flight origin dest  air_time  distance  hour  minute  \n",
       "0     145    PDX  ANC     194.0      1542   0.0     1.0  \n",
       "1    1830    SEA  CLT     252.0      2279   0.0     4.0  \n",
       "2    1609    PDX  IAH     201.0      1825   0.0     8.0  \n",
       "3     466    PDX  CLT     251.0      2282   0.0    28.0  \n",
       "4     121    SEA  ANC     201.0      1448   0.0    34.0  "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_flights = pd.read_csv('https://raw.githubusercontent.com/ismayc/pnwflights14/master/data/flights.csv')\n",
    "\n",
    "df_flights.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>carrier</th>\n",
       "      <th>tailnum</th>\n",
       "      <th>origin</th>\n",
       "      <th>dest</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>AS</td>\n",
       "      <td>N508AS</td>\n",
       "      <td>PDX</td>\n",
       "      <td>ANC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>US</td>\n",
       "      <td>N195UW</td>\n",
       "      <td>SEA</td>\n",
       "      <td>CLT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>UA</td>\n",
       "      <td>N37422</td>\n",
       "      <td>PDX</td>\n",
       "      <td>IAH</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>US</td>\n",
       "      <td>N547UW</td>\n",
       "      <td>PDX</td>\n",
       "      <td>CLT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>AS</td>\n",
       "      <td>N762AS</td>\n",
       "      <td>SEA</td>\n",
       "      <td>ANC</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  carrier tailnum origin dest\n",
       "0      AS  N508AS    PDX  ANC\n",
       "1      US  N195UW    SEA  CLT\n",
       "2      UA  N37422    PDX  IAH\n",
       "3      US  N547UW    PDX  CLT\n",
       "4      AS  N762AS    SEA  ANC"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cat_df_flights = df_flights.select_dtypes(include=['object']).copy()\n",
    "cat_df_flights.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "248\n"
     ]
    }
   ],
   "source": [
    "print(cat_df_flights.isnull().values.sum())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0\n"
     ]
    }
   ],
   "source": [
    "cat_df_flights = cat_df_flights.fillna(cat_df_flights['tailnum'].value_counts().index[0])\n",
    "print(cat_df_flights.isnull().values.sum())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "replace_map = {'carrier': {'AA': 1, 'AS': 2, 'B6': 3, 'DL': 4, 'F9': 5, 'HA': 6, 'OO': 7 , 'UA': 8 , 'US': 9,'VX': 10,'WN': 11}}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'carrier': {'AA': 1, 'AS': 2, 'B6': 3, 'DL': 4, 'F9': 5, 'HA': 6, 'OO': 7, 'UA': 8, 'US': 9, 'VX': 10, 'WN': 11}}\n"
     ]
    }
   ],
   "source": [
    "labels = cat_df_flights['carrier'].astype('category').cat.categories.tolist()\n",
    "replace_map_comp = {'carrier' : {k: v for k,v in zip(labels,list(range(1,len(labels)+1)))}}\n",
    "\n",
    "print(replace_map_comp)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "cat_df_flights_replace = cat_df_flights.copy()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   carrier tailnum origin dest\n",
      "0        2  N508AS    PDX  ANC\n",
      "1        9  N195UW    SEA  CLT\n",
      "2        8  N37422    PDX  IAH\n",
      "3        9  N547UW    PDX  CLT\n",
      "4        2  N762AS    SEA  ANC\n"
     ]
    }
   ],
   "source": [
    "cat_df_flights_replace.replace(replace_map_comp, inplace=True)\n",
    "\n",
    "print(cat_df_flights_replace.head())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "int64\n"
     ]
    }
   ],
   "source": [
    "print(cat_df_flights_replace['carrier'].dtypes)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "carrier    category\n",
      "tailnum      object\n",
      "origin     category\n",
      "dest         object\n",
      "dtype: object\n"
     ]
    }
   ],
   "source": [
    "cat_df_flights_lc = cat_df_flights.copy()\n",
    "cat_df_flights_lc['carrier'] = cat_df_flights_lc['carrier'].astype('category')\n",
    "cat_df_flights_lc['origin'] = cat_df_flights_lc['origin'].astype('category')                                                              \n",
    "\n",
    "print(cat_df_flights_lc.dtypes)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "31.3 ms ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
     ]
    }
   ],
   "source": [
    "import time\n",
    "%timeit cat_df_flights.groupby(['origin','carrier']).count() #DataFrame with object dtype columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "21.6 ms ± 294 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
     ]
    }
   ],
   "source": [
    "%timeit cat_df_flights_lc.groupby(['origin','carrier']).count() #DataFrame with category dtype columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> ## 2. Label Encoding"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>carrier</th>\n",
       "      <th>tailnum</th>\n",
       "      <th>origin</th>\n",
       "      <th>dest</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>AS</td>\n",
       "      <td>N508AS</td>\n",
       "      <td>PDX</td>\n",
       "      <td>ANC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>US</td>\n",
       "      <td>N195UW</td>\n",
       "      <td>SEA</td>\n",
       "      <td>CLT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>UA</td>\n",
       "      <td>N37422</td>\n",
       "      <td>PDX</td>\n",
       "      <td>IAH</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>US</td>\n",
       "      <td>N547UW</td>\n",
       "      <td>PDX</td>\n",
       "      <td>CLT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>AS</td>\n",
       "      <td>N762AS</td>\n",
       "      <td>SEA</td>\n",
       "      <td>ANC</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  carrier tailnum origin dest\n",
       "0      AS  N508AS    PDX  ANC\n",
       "1      US  N195UW    SEA  CLT\n",
       "2      UA  N37422    PDX  IAH\n",
       "3      US  N547UW    PDX  CLT\n",
       "4      AS  N762AS    SEA  ANC"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cat_df_flights_lc.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "cat_df_flights_lc['carrier'] = cat_df_flights_lc['carrier'].cat.codes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>carrier</th>\n",
       "      <th>tailnum</th>\n",
       "      <th>origin</th>\n",
       "      <th>dest</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>N508AS</td>\n",
       "      <td>PDX</td>\n",
       "      <td>ANC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>8</td>\n",
       "      <td>N195UW</td>\n",
       "      <td>SEA</td>\n",
       "      <td>CLT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>7</td>\n",
       "      <td>N37422</td>\n",
       "      <td>PDX</td>\n",
       "      <td>IAH</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>8</td>\n",
       "      <td>N547UW</td>\n",
       "      <td>PDX</td>\n",
       "      <td>CLT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1</td>\n",
       "      <td>N762AS</td>\n",
       "      <td>SEA</td>\n",
       "      <td>ANC</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   carrier tailnum origin dest\n",
       "0        1  N508AS    PDX  ANC\n",
       "1        8  N195UW    SEA  CLT\n",
       "2        7  N37422    PDX  IAH\n",
       "3        8  N547UW    PDX  CLT\n",
       "4        1  N762AS    SEA  ANC"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cat_df_flights_lc.head() #alphabetically labeled from 0 to 10"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>carrier</th>\n",
       "      <th>tailnum</th>\n",
       "      <th>origin</th>\n",
       "      <th>dest</th>\n",
       "      <th>US_code</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>AS</td>\n",
       "      <td>N508AS</td>\n",
       "      <td>PDX</td>\n",
       "      <td>ANC</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>US</td>\n",
       "      <td>N195UW</td>\n",
       "      <td>SEA</td>\n",
       "      <td>CLT</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>UA</td>\n",
       "      <td>N37422</td>\n",
       "      <td>PDX</td>\n",
       "      <td>IAH</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>US</td>\n",
       "      <td>N547UW</td>\n",
       "      <td>PDX</td>\n",
       "      <td>CLT</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>AS</td>\n",
       "      <td>N762AS</td>\n",
       "      <td>SEA</td>\n",
       "      <td>ANC</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  carrier tailnum origin dest  US_code\n",
       "0      AS  N508AS    PDX  ANC        0\n",
       "1      US  N195UW    SEA  CLT        1\n",
       "2      UA  N37422    PDX  IAH        0\n",
       "3      US  N547UW    PDX  CLT        1\n",
       "4      AS  N762AS    SEA  ANC        0"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cat_df_flights_specific = cat_df_flights.copy()\n",
    "cat_df_flights_specific['US_code'] = np.where(cat_df_flights_specific['carrier'].str.contains('US'), 1, 0)\n",
    "\n",
    "cat_df_flights_specific.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>carrier</th>\n",
       "      <th>tailnum</th>\n",
       "      <th>origin</th>\n",
       "      <th>dest</th>\n",
       "      <th>carrier_code</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>AS</td>\n",
       "      <td>N508AS</td>\n",
       "      <td>PDX</td>\n",
       "      <td>ANC</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>US</td>\n",
       "      <td>N195UW</td>\n",
       "      <td>SEA</td>\n",
       "      <td>CLT</td>\n",
       "      <td>8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>UA</td>\n",
       "      <td>N37422</td>\n",
       "      <td>PDX</td>\n",
       "      <td>IAH</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>US</td>\n",
       "      <td>N547UW</td>\n",
       "      <td>PDX</td>\n",
       "      <td>CLT</td>\n",
       "      <td>8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>AS</td>\n",
       "      <td>N762AS</td>\n",
       "      <td>SEA</td>\n",
       "      <td>ANC</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  carrier tailnum origin dest  carrier_code\n",
       "0      AS  N508AS    PDX  ANC             1\n",
       "1      US  N195UW    SEA  CLT             8\n",
       "2      UA  N37422    PDX  IAH             7\n",
       "3      US  N547UW    PDX  CLT             8\n",
       "4      AS  N762AS    SEA  ANC             1"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cat_df_flights_sklearn = cat_df_flights.copy()\n",
    "\n",
    "from sklearn.preprocessing import LabelEncoder\n",
    "\n",
    "lb_make = LabelEncoder()\n",
    "cat_df_flights_sklearn['carrier_code'] = lb_make.fit_transform(cat_df_flights['carrier'])\n",
    "\n",
    "cat_df_flights_sklearn.head() #Results in appending a new column to df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Label encoding is pretty much intuitive and straight-forward and may give you a good performance from your learning algorithm, but it has as disadvantage that the numerical values can be misinterpreted by the algorithm. Should the carrier US (encoded to 8) be given 8x more weight than the carrier AS (encoded to 1) ?**\n",
    "\n",
    "**To solve this issue there is another popular way to encode the categories via something called one-hot encoding.**\n",
    "\n",
    "> ## 3. One-Hot encoding"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>carrier</th>\n",
       "      <th>tailnum</th>\n",
       "      <th>origin</th>\n",
       "      <th>dest</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>AS</td>\n",
       "      <td>N508AS</td>\n",
       "      <td>PDX</td>\n",
       "      <td>ANC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>US</td>\n",
       "      <td>N195UW</td>\n",
       "      <td>SEA</td>\n",
       "      <td>CLT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>UA</td>\n",
       "      <td>N37422</td>\n",
       "      <td>PDX</td>\n",
       "      <td>IAH</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>US</td>\n",
       "      <td>N547UW</td>\n",
       "      <td>PDX</td>\n",
       "      <td>CLT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>AS</td>\n",
       "      <td>N762AS</td>\n",
       "      <td>SEA</td>\n",
       "      <td>ANC</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  carrier tailnum origin dest\n",
       "0      AS  N508AS    PDX  ANC\n",
       "1      US  N195UW    SEA  CLT\n",
       "2      UA  N37422    PDX  IAH\n",
       "3      US  N547UW    PDX  CLT\n",
       "4      AS  N762AS    SEA  ANC"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cat_df_flights.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>tailnum</th>\n",
       "      <th>origin</th>\n",
       "      <th>dest</th>\n",
       "      <th>carrier_AA</th>\n",
       "      <th>carrier_AS</th>\n",
       "      <th>carrier_B6</th>\n",
       "      <th>carrier_DL</th>\n",
       "      <th>carrier_F9</th>\n",
       "      <th>carrier_HA</th>\n",
       "      <th>carrier_OO</th>\n",
       "      <th>carrier_UA</th>\n",
       "      <th>carrier_US</th>\n",
       "      <th>carrier_VX</th>\n",
       "      <th>carrier_WN</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>N508AS</td>\n",
       "      <td>PDX</td>\n",
       "      <td>ANC</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>N195UW</td>\n",
       "      <td>SEA</td>\n",
       "      <td>CLT</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>N37422</td>\n",
       "      <td>PDX</td>\n",
       "      <td>IAH</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>N547UW</td>\n",
       "      <td>PDX</td>\n",
       "      <td>CLT</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>N762AS</td>\n",
       "      <td>SEA</td>\n",
       "      <td>ANC</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  tailnum origin dest  carrier_AA  carrier_AS  carrier_B6  carrier_DL  \\\n",
       "0  N508AS    PDX  ANC           0           1           0           0   \n",
       "1  N195UW    SEA  CLT           0           0           0           0   \n",
       "2  N37422    PDX  IAH           0           0           0           0   \n",
       "3  N547UW    PDX  CLT           0           0           0           0   \n",
       "4  N762AS    SEA  ANC           0           1           0           0   \n",
       "\n",
       "   carrier_F9  carrier_HA  carrier_OO  carrier_UA  carrier_US  carrier_VX  \\\n",
       "0           0           0           0           0           0           0   \n",
       "1           0           0           0           0           1           0   \n",
       "2           0           0           0           1           0           0   \n",
       "3           0           0           0           0           1           0   \n",
       "4           0           0           0           0           0           0   \n",
       "\n",
       "   carrier_WN  \n",
       "0           0  \n",
       "1           0  \n",
       "2           0  \n",
       "3           0  \n",
       "4           0  "
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cat_df_flights_onehot = cat_df_flights.copy()\n",
    "cat_df_flights_onehot = pd.get_dummies(cat_df_flights_onehot, columns=['carrier'], prefix = ['carrier'])\n",
    "\n",
    "cat_df_flights_onehot.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>AA</th>\n",
       "      <th>AS</th>\n",
       "      <th>B6</th>\n",
       "      <th>DL</th>\n",
       "      <th>F9</th>\n",
       "      <th>HA</th>\n",
       "      <th>OO</th>\n",
       "      <th>UA</th>\n",
       "      <th>US</th>\n",
       "      <th>VX</th>\n",
       "      <th>WN</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   AA  AS  B6  DL  F9  HA  OO  UA  US  VX  WN\n",
       "0   0   1   0   0   0   0   0   0   0   0   0\n",
       "1   0   0   0   0   0   0   0   0   1   0   0\n",
       "2   0   0   0   0   0   0   0   1   0   0   0\n",
       "3   0   0   0   0   0   0   0   0   1   0   0\n",
       "4   0   1   0   0   0   0   0   0   0   0   0"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cat_df_flights_onehot_sklearn = cat_df_flights.copy()\n",
    "\n",
    "from sklearn.preprocessing import LabelBinarizer\n",
    "\n",
    "lb = LabelBinarizer()\n",
    "lb_results = lb.fit_transform(cat_df_flights_onehot_sklearn['carrier'])\n",
    "lb_results_df = pd.DataFrame(lb_results, columns=lb.classes_)\n",
    "\n",
    "lb_results_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>carrier</th>\n",
       "      <th>tailnum</th>\n",
       "      <th>origin</th>\n",
       "      <th>dest</th>\n",
       "      <th>AA</th>\n",
       "      <th>AS</th>\n",
       "      <th>B6</th>\n",
       "      <th>DL</th>\n",
       "      <th>F9</th>\n",
       "      <th>HA</th>\n",
       "      <th>OO</th>\n",
       "      <th>UA</th>\n",
       "      <th>US</th>\n",
       "      <th>VX</th>\n",
       "      <th>WN</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>AS</td>\n",
       "      <td>N508AS</td>\n",
       "      <td>PDX</td>\n",
       "      <td>ANC</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>US</td>\n",
       "      <td>N195UW</td>\n",
       "      <td>SEA</td>\n",
       "      <td>CLT</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>UA</td>\n",
       "      <td>N37422</td>\n",
       "      <td>PDX</td>\n",
       "      <td>IAH</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>US</td>\n",
       "      <td>N547UW</td>\n",
       "      <td>PDX</td>\n",
       "      <td>CLT</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>AS</td>\n",
       "      <td>N762AS</td>\n",
       "      <td>SEA</td>\n",
       "      <td>ANC</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  carrier tailnum origin dest  AA  AS  B6  DL  F9  HA  OO  UA  US  VX  WN\n",
       "0      AS  N508AS    PDX  ANC   0   1   0   0   0   0   0   0   0   0   0\n",
       "1      US  N195UW    SEA  CLT   0   0   0   0   0   0   0   0   1   0   0\n",
       "2      UA  N37422    PDX  IAH   0   0   0   0   0   0   0   1   0   0   0\n",
       "3      US  N547UW    PDX  CLT   0   0   0   0   0   0   0   0   1   0   0\n",
       "4      AS  N762AS    SEA  ANC   0   1   0   0   0   0   0   0   0   0   0"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result_df = pd.concat([cat_df_flights_onehot_sklearn, lb_results_df], axis=1)\n",
    "\n",
    "result_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> ## 4. Binary Encoding"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>carrier_0</th>\n",
       "      <th>carrier_1</th>\n",
       "      <th>carrier_2</th>\n",
       "      <th>carrier_3</th>\n",
       "      <th>carrier_4</th>\n",
       "      <th>tailnum</th>\n",
       "      <th>origin</th>\n",
       "      <th>dest</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>N508AS</td>\n",
       "      <td>PDX</td>\n",
       "      <td>ANC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>N195UW</td>\n",
       "      <td>SEA</td>\n",
       "      <td>CLT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>N37422</td>\n",
       "      <td>PDX</td>\n",
       "      <td>IAH</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>N547UW</td>\n",
       "      <td>PDX</td>\n",
       "      <td>CLT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>N762AS</td>\n",
       "      <td>SEA</td>\n",
       "      <td>ANC</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   carrier_0  carrier_1  carrier_2  carrier_3  carrier_4 tailnum origin dest\n",
       "0          0          0          0          0          1  N508AS    PDX  ANC\n",
       "1          0          0          0          1          0  N195UW    SEA  CLT\n",
       "2          0          0          0          1          1  N37422    PDX  IAH\n",
       "3          0          0          0          1          0  N547UW    PDX  CLT\n",
       "4          0          0          0          0          1  N762AS    SEA  ANC"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cat_df_flights_ce = cat_df_flights.copy()\n",
    "\n",
    "import category_encoders as ce\n",
    "\n",
    "encoder = ce.BinaryEncoder(cols=['carrier'])\n",
    "df_binary = encoder.fit_transform(cat_df_flights_ce)\n",
    "\n",
    "df_binary.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>carrier_0</th>\n",
       "      <th>carrier_1</th>\n",
       "      <th>carrier_2</th>\n",
       "      <th>carrier_3</th>\n",
       "      <th>carrier_4</th>\n",
       "      <th>tailnum</th>\n",
       "      <th>origin</th>\n",
       "      <th>dest</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: [carrier_0, carrier_1, carrier_2, carrier_3, carrier_4, tailnum, origin, dest]\n",
       "Index: []"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_binary[df_binary['carrier_0']==1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> ## 5. Backward Difference Encoding"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>col_carrier_0</th>\n",
       "      <th>col_carrier_1</th>\n",
       "      <th>col_carrier_2</th>\n",
       "      <th>col_carrier_3</th>\n",
       "      <th>col_carrier_4</th>\n",
       "      <th>col_carrier_5</th>\n",
       "      <th>col_carrier_6</th>\n",
       "      <th>col_carrier_7</th>\n",
       "      <th>col_carrier_8</th>\n",
       "      <th>col_carrier_9</th>\n",
       "      <th>col_carrier_10</th>\n",
       "      <th>col_tailnum</th>\n",
       "      <th>col_origin</th>\n",
       "      <th>col_dest</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "      <td>-0.909091</td>\n",
       "      <td>-0.818182</td>\n",
       "      <td>-0.727273</td>\n",
       "      <td>-0.636364</td>\n",
       "      <td>-0.545455</td>\n",
       "      <td>-0.454545</td>\n",
       "      <td>-0.363636</td>\n",
       "      <td>-0.272727</td>\n",
       "      <td>-0.181818</td>\n",
       "      <td>-0.090909</td>\n",
       "      <td>N508AS</td>\n",
       "      <td>PDX</td>\n",
       "      <td>ANC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1.0</td>\n",
       "      <td>0.090909</td>\n",
       "      <td>-0.818182</td>\n",
       "      <td>-0.727273</td>\n",
       "      <td>-0.636364</td>\n",
       "      <td>-0.545455</td>\n",
       "      <td>-0.454545</td>\n",
       "      <td>-0.363636</td>\n",
       "      <td>-0.272727</td>\n",
       "      <td>-0.181818</td>\n",
       "      <td>-0.090909</td>\n",
       "      <td>N195UW</td>\n",
       "      <td>SEA</td>\n",
       "      <td>CLT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1.0</td>\n",
       "      <td>0.090909</td>\n",
       "      <td>0.181818</td>\n",
       "      <td>-0.727273</td>\n",
       "      <td>-0.636364</td>\n",
       "      <td>-0.545455</td>\n",
       "      <td>-0.454545</td>\n",
       "      <td>-0.363636</td>\n",
       "      <td>-0.272727</td>\n",
       "      <td>-0.181818</td>\n",
       "      <td>-0.090909</td>\n",
       "      <td>N37422</td>\n",
       "      <td>PDX</td>\n",
       "      <td>IAH</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1.0</td>\n",
       "      <td>0.090909</td>\n",
       "      <td>-0.818182</td>\n",
       "      <td>-0.727273</td>\n",
       "      <td>-0.636364</td>\n",
       "      <td>-0.545455</td>\n",
       "      <td>-0.454545</td>\n",
       "      <td>-0.363636</td>\n",
       "      <td>-0.272727</td>\n",
       "      <td>-0.181818</td>\n",
       "      <td>-0.090909</td>\n",
       "      <td>N547UW</td>\n",
       "      <td>PDX</td>\n",
       "      <td>CLT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1.0</td>\n",
       "      <td>-0.909091</td>\n",
       "      <td>-0.818182</td>\n",
       "      <td>-0.727273</td>\n",
       "      <td>-0.636364</td>\n",
       "      <td>-0.545455</td>\n",
       "      <td>-0.454545</td>\n",
       "      <td>-0.363636</td>\n",
       "      <td>-0.272727</td>\n",
       "      <td>-0.181818</td>\n",
       "      <td>-0.090909</td>\n",
       "      <td>N762AS</td>\n",
       "      <td>SEA</td>\n",
       "      <td>ANC</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   col_carrier_0  col_carrier_1  col_carrier_2  col_carrier_3  col_carrier_4  \\\n",
       "0            1.0      -0.909091      -0.818182      -0.727273      -0.636364   \n",
       "1            1.0       0.090909      -0.818182      -0.727273      -0.636364   \n",
       "2            1.0       0.090909       0.181818      -0.727273      -0.636364   \n",
       "3            1.0       0.090909      -0.818182      -0.727273      -0.636364   \n",
       "4            1.0      -0.909091      -0.818182      -0.727273      -0.636364   \n",
       "\n",
       "   col_carrier_5  col_carrier_6  col_carrier_7  col_carrier_8  col_carrier_9  \\\n",
       "0      -0.545455      -0.454545      -0.363636      -0.272727      -0.181818   \n",
       "1      -0.545455      -0.454545      -0.363636      -0.272727      -0.181818   \n",
       "2      -0.545455      -0.454545      -0.363636      -0.272727      -0.181818   \n",
       "3      -0.545455      -0.454545      -0.363636      -0.272727      -0.181818   \n",
       "4      -0.545455      -0.454545      -0.363636      -0.272727      -0.181818   \n",
       "\n",
       "   col_carrier_10 col_tailnum col_origin col_dest  \n",
       "0       -0.090909      N508AS        PDX      ANC  \n",
       "1       -0.090909      N195UW        SEA      CLT  \n",
       "2       -0.090909      N37422        PDX      IAH  \n",
       "3       -0.090909      N547UW        PDX      CLT  \n",
       "4       -0.090909      N762AS        SEA      ANC  "
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "encoder = ce.BackwardDifferenceEncoder(cols=['carrier'])\n",
    "df_bd = encoder.fit_transform(cat_df_flights_ce)\n",
    "\n",
    "df_bd.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([-0.90909091,  0.09090909])"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.unique(df_bd['col_carrier_1'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> ## 6. Polynomial Encoding"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>col_carrier_0</th>\n",
       "      <th>col_carrier_1</th>\n",
       "      <th>col_carrier_2</th>\n",
       "      <th>col_carrier_3</th>\n",
       "      <th>col_carrier_4</th>\n",
       "      <th>col_carrier_5</th>\n",
       "      <th>col_carrier_6</th>\n",
       "      <th>col_carrier_7</th>\n",
       "      <th>col_carrier_8</th>\n",
       "      <th>col_carrier_9</th>\n",
       "      <th>col_carrier_10</th>\n",
       "      <th>col_tailnum</th>\n",
       "      <th>col_origin</th>\n",
       "      <th>col_dest</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "      <td>-0.476731</td>\n",
       "      <td>0.512092</td>\n",
       "      <td>-0.458029</td>\n",
       "      <td>0.354787</td>\n",
       "      <td>-0.240192</td>\n",
       "      <td>0.141610</td>\n",
       "      <td>-0.071707</td>\n",
       "      <td>0.030334</td>\n",
       "      <td>-0.010141</td>\n",
       "      <td>0.002326</td>\n",
       "      <td>N508AS</td>\n",
       "      <td>PDX</td>\n",
       "      <td>ANC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1.0</td>\n",
       "      <td>-0.381385</td>\n",
       "      <td>0.204837</td>\n",
       "      <td>0.091606</td>\n",
       "      <td>-0.354787</td>\n",
       "      <td>0.480384</td>\n",
       "      <td>-0.453153</td>\n",
       "      <td>0.329853</td>\n",
       "      <td>-0.188069</td>\n",
       "      <td>0.081127</td>\n",
       "      <td>-0.023265</td>\n",
       "      <td>N195UW</td>\n",
       "      <td>SEA</td>\n",
       "      <td>CLT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1.0</td>\n",
       "      <td>-0.286039</td>\n",
       "      <td>-0.034139</td>\n",
       "      <td>0.335888</td>\n",
       "      <td>-0.354787</td>\n",
       "      <td>0.080064</td>\n",
       "      <td>0.273780</td>\n",
       "      <td>-0.473267</td>\n",
       "      <td>0.442872</td>\n",
       "      <td>-0.273805</td>\n",
       "      <td>0.104692</td>\n",
       "      <td>N37422</td>\n",
       "      <td>PDX</td>\n",
       "      <td>IAH</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1.0</td>\n",
       "      <td>-0.381385</td>\n",
       "      <td>0.204837</td>\n",
       "      <td>0.091606</td>\n",
       "      <td>-0.354787</td>\n",
       "      <td>0.480384</td>\n",
       "      <td>-0.453153</td>\n",
       "      <td>0.329853</td>\n",
       "      <td>-0.188069</td>\n",
       "      <td>0.081127</td>\n",
       "      <td>-0.023265</td>\n",
       "      <td>N547UW</td>\n",
       "      <td>PDX</td>\n",
       "      <td>CLT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1.0</td>\n",
       "      <td>-0.476731</td>\n",
       "      <td>0.512092</td>\n",
       "      <td>-0.458029</td>\n",
       "      <td>0.354787</td>\n",
       "      <td>-0.240192</td>\n",
       "      <td>0.141610</td>\n",
       "      <td>-0.071707</td>\n",
       "      <td>0.030334</td>\n",
       "      <td>-0.010141</td>\n",
       "      <td>0.002326</td>\n",
       "      <td>N762AS</td>\n",
       "      <td>SEA</td>\n",
       "      <td>ANC</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   col_carrier_0  col_carrier_1  col_carrier_2  col_carrier_3  col_carrier_4  \\\n",
       "0            1.0      -0.476731       0.512092      -0.458029       0.354787   \n",
       "1            1.0      -0.381385       0.204837       0.091606      -0.354787   \n",
       "2            1.0      -0.286039      -0.034139       0.335888      -0.354787   \n",
       "3            1.0      -0.381385       0.204837       0.091606      -0.354787   \n",
       "4            1.0      -0.476731       0.512092      -0.458029       0.354787   \n",
       "\n",
       "   col_carrier_5  col_carrier_6  col_carrier_7  col_carrier_8  col_carrier_9  \\\n",
       "0      -0.240192       0.141610      -0.071707       0.030334      -0.010141   \n",
       "1       0.480384      -0.453153       0.329853      -0.188069       0.081127   \n",
       "2       0.080064       0.273780      -0.473267       0.442872      -0.273805   \n",
       "3       0.480384      -0.453153       0.329853      -0.188069       0.081127   \n",
       "4      -0.240192       0.141610      -0.071707       0.030334      -0.010141   \n",
       "\n",
       "   col_carrier_10 col_tailnum col_origin col_dest  \n",
       "0        0.002326      N508AS        PDX      ANC  \n",
       "1       -0.023265      N195UW        SEA      CLT  \n",
       "2        0.104692      N37422        PDX      IAH  \n",
       "3       -0.023265      N547UW        PDX      CLT  \n",
       "4        0.002326      N762AS        SEA      ANC  "
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "encoder = ce.PolynomialEncoder(cols=['carrier'])\n",
    "df_bd = encoder.fit_transform(cat_df_flights_ce)\n",
    "\n",
    "df_bd.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([-4.76731295e-01, -3.81385036e-01, -2.86038777e-01, -1.90692518e-01,\n",
       "       -9.53462589e-02, -1.16688970e-17,  9.53462589e-02,  1.90692518e-01,\n",
       "        2.86038777e-01,  3.81385036e-01,  4.76731295e-01])"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.unique(df_bd['col_carrier_1'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> ## 7. Miscellaneous Features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>age</th>\n",
       "      <th>start</th>\n",
       "      <th>end</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0-20</td>\n",
       "      <td>0</td>\n",
       "      <td>20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>20-40</td>\n",
       "      <td>20</td>\n",
       "      <td>40</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>40-60</td>\n",
       "      <td>40</td>\n",
       "      <td>60</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>60-80</td>\n",
       "      <td>60</td>\n",
       "      <td>80</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     age start end\n",
       "0   0-20     0  20\n",
       "1  20-40    20  40\n",
       "2  40-60    40  60\n",
       "3  60-80    60  80"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dummy_df_age = pd.DataFrame({'age': ['0-20', '20-40', '40-60','60-80']})\n",
    "dummy_df_age['start'], dummy_df_age['end'] = zip(*dummy_df_age['age'].map(lambda x: x.split('-')))\n",
    "\n",
    "dummy_df_age.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>age</th>\n",
       "      <th>age_mean</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0-20</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>20-40</td>\n",
       "      <td>30.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>40-60</td>\n",
       "      <td>50.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>60-80</td>\n",
       "      <td>70.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     age  age_mean\n",
       "0   0-20      10.0\n",
       "1  20-40      30.0\n",
       "2  40-60      50.0\n",
       "3  60-80      70.0"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dummy_df_age = pd.DataFrame({'age': ['0-20', '20-40', '40-60','60-80']})\n",
    "\n",
    "def split_mean(x):\n",
    "    split_list = x.split('-')\n",
    "    mean = (float(split_list[0])+float(split_list[1]))/2\n",
    "    return mean\n",
    "\n",
    "dummy_df_age['age_mean'] = dummy_df_age['age'].apply(lambda x: split_mean(x))\n",
    "\n",
    "dummy_df_age.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Dealing with Categorical Features in Big Data with Spark"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- The first step in Spark programming is to create a SparkContext. SparkContext is required when you want to execute operations in a cluster. SparkContext tells Spark how and where to access a cluster. You'll start by importing SparkContext.\n",
    "\n",
    "- To start working with Spark DataFrames, you first have to create a SparkSession object from your SparkContext."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> #### 1st way"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#import findspark\n",
    "\n",
    "#findspark.init()\n",
    "\n",
    "#import pyspark\n",
    "\n",
    "#confspark = pyspark.SparkConf().setMaster(\"local[*]\").set(\"spark.cores.max\", \"4\").set(\"spark.executor.memory\", \"2G\").setAppName(\"--test--\")\n",
    "\n",
    "#sc = pyspark.SparkContext(conf=confspark)\n",
    "\n",
    "#sc._conf.getAll()\n",
    "\n",
    "#from pyspark.sql import SparkSession \n",
    "\n",
    "#spark = SparkSession(sc) \n",
    "\n",
    "#sc.stop()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> #### 2nd way"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "import findspark\n",
    "findspark.init()\n",
    "\n",
    "import pyspark\n",
    "from pyspark.sql import SparkSession \n",
    "\n",
    "#confspark = pyspark.SparkConf().setMaster(\"local[4]\").set(\"spark.cores.max\", \"4\").set(\"spark.executor.memory\", \"2G\").setAppName(\"--test--\")\n",
    "#spark = SparkSession.builder.config(conf=confspark).getOrCreate()\n",
    "\n",
    "spark = SparkSession.builder.master(\"local[*]\").appName(\"--test--\").config(\"spark.some.config.option\", \"some-value\").getOrCreate()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'2.3.1'"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "spark.version"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[]"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "spark.catalog.listTables()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+\n",
      "|year|month|day|dep_time|dep_delay|arr_time|arr_delay|carrier|tailnum|flight|origin|dest|air_time|distance|hour|minute|\n",
      "+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+\n",
      "|2014|    1|  1|       1|       96|     235|       70|     AS| N508AS|   145|   PDX| ANC|     194|    1542|   0|     1|\n",
      "|2014|    1|  1|       4|       -6|     738|      -23|     US| N195UW|  1830|   SEA| CLT|     252|    2279|   0|     4|\n",
      "|2014|    1|  1|       8|       13|     548|       -4|     UA| N37422|  1609|   PDX| IAH|     201|    1825|   0|     8|\n",
      "+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+\n",
      "only showing top 3 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "spark_flights = spark.read.format(\"csv\").option('header',True).load('data/flights.csv',inferSchema=True)\n",
    "spark_flights.show(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "root\n",
      " |-- year: integer (nullable = true)\n",
      " |-- month: integer (nullable = true)\n",
      " |-- day: integer (nullable = true)\n",
      " |-- dep_time: string (nullable = true)\n",
      " |-- dep_delay: string (nullable = true)\n",
      " |-- arr_time: string (nullable = true)\n",
      " |-- arr_delay: string (nullable = true)\n",
      " |-- carrier: string (nullable = true)\n",
      " |-- tailnum: string (nullable = true)\n",
      " |-- flight: integer (nullable = true)\n",
      " |-- origin: string (nullable = true)\n",
      " |-- dest: string (nullable = true)\n",
      " |-- air_time: string (nullable = true)\n",
      " |-- distance: integer (nullable = true)\n",
      " |-- hour: string (nullable = true)\n",
      " |-- minute: string (nullable = true)\n",
      "\n"
     ]
    }
   ],
   "source": [
    "spark_flights.printSchema()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[]"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "spark.catalog.listTables()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [],
   "source": [
    "spark_flights.createOrReplaceTempView(\"flights_temp\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Table(name='flights_temp', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "spark.catalog.listTables()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-------+\n",
      "|carrier|\n",
      "+-------+\n",
      "|     AS|\n",
      "|     US|\n",
      "|     UA|\n",
      "|     US|\n",
      "|     AS|\n",
      "+-------+\n",
      "only showing top 5 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "carrier_df = spark_flights.select(\"carrier\")\n",
    "carrier_df.show(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> ## StringIndexer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-------+-------------+\n",
      "|carrier|carrier_index|\n",
      "+-------+-------------+\n",
      "|     AS|          0.0|\n",
      "|     US|          6.0|\n",
      "|     UA|          4.0|\n",
      "|     US|          6.0|\n",
      "|     AS|          0.0|\n",
      "|     DL|          3.0|\n",
      "|     UA|          4.0|\n",
      "+-------+-------------+\n",
      "only showing top 7 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "from pyspark.ml.feature import StringIndexer\n",
    "\n",
    "carr_indexer = StringIndexer(inputCol=\"carrier\",outputCol=\"carrier_index\")\n",
    "carr_indexed = carr_indexer.fit(carrier_df).transform(carrier_df)\n",
    "\n",
    "carr_indexed.show(7)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> ## OneHotEncoder"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-------+-------------+--------------+\n",
      "|carrier|carrier_index|   carrier_vec|\n",
      "+-------+-------------+--------------+\n",
      "|     AS|          0.0|(11,[0],[1.0])|\n",
      "|     US|          6.0|(11,[6],[1.0])|\n",
      "|     UA|          4.0|(11,[4],[1.0])|\n",
      "|     US|          6.0|(11,[6],[1.0])|\n",
      "|     AS|          0.0|(11,[0],[1.0])|\n",
      "|     DL|          3.0|(11,[3],[1.0])|\n",
      "|     UA|          4.0|(11,[4],[1.0])|\n",
      "+-------+-------------+--------------+\n",
      "only showing top 7 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "carrier_df_onehot = spark_flights.select(\"carrier\")\n",
    "\n",
    "from pyspark.ml.feature import OneHotEncoder, StringIndexer\n",
    "\n",
    "stringIndexer = StringIndexer(inputCol=\"carrier\", outputCol=\"carrier_index\")\n",
    "model = stringIndexer.fit(carrier_df_onehot)\n",
    "indexed = model.transform(carrier_df_onehot)\n",
    "\n",
    "encoder = OneHotEncoder(dropLast=False, inputCol=\"carrier_index\", outputCol=\"carrier_vec\")\n",
    "encoded = encoder.transform(indexed)\n",
    "\n",
    "encoded.show(7)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> ## Example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+---+--------+-------------+-------------+\n",
      "| id|category|categoryIndex|  categoryVec|\n",
      "+---+--------+-------------+-------------+\n",
      "|  0|       a|          0.0|(2,[0],[1.0])|\n",
      "|  1|       b|          2.0|    (2,[],[])|\n",
      "|  2|       c|          1.0|(2,[1],[1.0])|\n",
      "|  3|       a|          0.0|(2,[0],[1.0])|\n",
      "|  4|       a|          0.0|(2,[0],[1.0])|\n",
      "|  5|       c|          1.0|(2,[1],[1.0])|\n",
      "+---+--------+-------------+-------------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "from pyspark.ml.feature import OneHotEncoder, StringIndexer\n",
    "\n",
    "df1 = spark.createDataFrame([\n",
    "    (0, \"a\"),\n",
    "    (1, \"b\"),\n",
    "    (2, \"c\"),\n",
    "    (3, \"a\"),\n",
    "    (4, \"a\"),\n",
    "    (5, \"c\"),\n",
    "    (6, \"a\"),\n",
    "    (7, \"b\"),\n",
    "    (8, \"d\"),\n",
    "    (9, \"d\")\n",
    "], [\"id\", \"category\"])\n",
    "\n",
    "df2 = spark.createDataFrame([\n",
    "    (0, \"a\"),\n",
    "    (1, \"b\"),\n",
    "    (2, \"c\"),\n",
    "    (3, \"a\"),\n",
    "    (4, \"a\"),\n",
    "    (5, \"c\")\n",
    "], [\"id\", \"category\"])\n",
    "\n",
    "df = df2\n",
    "\n",
    "stringIndexer = StringIndexer(inputCol=\"category\", outputCol=\"categoryIndex\")\n",
    "model = stringIndexer.fit(df)\n",
    "indexed = model.transform(df)\n",
    "\n",
    "encoder = OneHotEncoder(dropLast=True, inputCol=\"categoryIndex\", outputCol=\"categoryVec\")\n",
    "encoded = encoder.transform(indexed)\n",
    "encoded.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> ## VectorIndexer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Chose 351 categorical features: 645, 69, 365, 138, 101, 479, 333, 249, 0, 555, 666, 88, 170, 115, 276, 308, 5, 449, 120, 247, 614, 677, 202, 10, 56, 533, 142, 500, 340, 670, 174, 42, 417, 24, 37, 25, 257, 389, 52, 14, 504, 110, 587, 619, 196, 559, 638, 20, 421, 46, 93, 284, 228, 448, 57, 78, 29, 475, 164, 591, 646, 253, 106, 121, 84, 480, 147, 280, 61, 221, 396, 89, 133, 116, 1, 507, 312, 74, 307, 452, 6, 248, 60, 117, 678, 529, 85, 201, 220, 366, 534, 102, 334, 28, 38, 561, 392, 70, 424, 192, 21, 137, 165, 33, 92, 229, 252, 197, 361, 65, 97, 665, 583, 285, 224, 650, 615, 9, 53, 169, 593, 141, 610, 420, 109, 256, 225, 339, 77, 193, 669, 476, 642, 637, 590, 679, 96, 393, 647, 173, 13, 41, 503, 134, 73, 105, 2, 508, 311, 558, 674, 530, 586, 618, 166, 32, 34, 148, 45, 161, 279, 64, 689, 17, 149, 584, 562, 176, 423, 191, 22, 44, 59, 118, 281, 27, 641, 71, 391, 12, 445, 54, 313, 611, 144, 49, 335, 86, 672, 172, 113, 681, 219, 419, 81, 230, 362, 451, 76, 7, 39, 649, 98, 616, 477, 367, 535, 103, 140, 621, 91, 66, 251, 668, 198, 108, 278, 223, 394, 306, 135, 563, 226, 3, 505, 80, 167, 35, 473, 675, 589, 162, 531, 680, 255, 648, 112, 617, 194, 145, 48, 557, 690, 63, 640, 18, 282, 95, 310, 50, 67, 199, 673, 16, 585, 502, 338, 643, 31, 336, 613, 11, 72, 175, 446, 612, 143, 43, 250, 231, 450, 99, 363, 556, 87, 203, 671, 688, 104, 368, 588, 40, 304, 26, 258, 390, 55, 114, 171, 139, 418, 23, 8, 75, 119, 58, 667, 478, 536, 82, 620, 447, 36, 168, 146, 30, 51, 190, 19, 422, 564, 305, 107, 4, 136, 506, 79, 195, 474, 664, 532, 94, 283, 395, 332, 528, 644, 47, 15, 163, 200, 68, 62, 277, 691, 501, 90, 111, 254, 227, 337, 122, 83, 309, 560, 639, 676, 222, 592, 364, 100\n",
      "+-----+--------------------+--------------------+\n",
      "|label|            features|             indexed|\n",
      "+-----+--------------------+--------------------+\n",
      "|  0.0|(692,[127,128,129...|(692,[127,128,129...|\n",
      "|  1.0|(692,[158,159,160...|(692,[158,159,160...|\n",
      "|  1.0|(692,[124,125,126...|(692,[124,125,126...|\n",
      "|  1.0|(692,[152,153,154...|(692,[152,153,154...|\n",
      "|  1.0|(692,[151,152,153...|(692,[151,152,153...|\n",
      "|  0.0|(692,[129,130,131...|(692,[129,130,131...|\n",
      "|  1.0|(692,[158,159,160...|(692,[158,159,160...|\n",
      "|  1.0|(692,[99,100,101,...|(692,[99,100,101,...|\n",
      "|  0.0|(692,[154,155,156...|(692,[154,155,156...|\n",
      "|  0.0|(692,[127,128,129...|(692,[127,128,129...|\n",
      "|  1.0|(692,[154,155,156...|(692,[154,155,156...|\n",
      "|  0.0|(692,[153,154,155...|(692,[153,154,155...|\n",
      "|  0.0|(692,[151,152,153...|(692,[151,152,153...|\n",
      "|  1.0|(692,[129,130,131...|(692,[129,130,131...|\n",
      "|  0.0|(692,[154,155,156...|(692,[154,155,156...|\n",
      "|  1.0|(692,[150,151,152...|(692,[150,151,152...|\n",
      "|  0.0|(692,[124,125,126...|(692,[124,125,126...|\n",
      "|  0.0|(692,[152,153,154...|(692,[152,153,154...|\n",
      "|  1.0|(692,[97,98,99,12...|(692,[97,98,99,12...|\n",
      "|  1.0|(692,[124,125,126...|(692,[124,125,126...|\n",
      "+-----+--------------------+--------------------+\n",
      "only showing top 20 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "from pyspark.ml.feature import VectorIndexer\n",
    "\n",
    "data = spark.read.format(\"libsvm\").load(\"data/sample_libsvm_data.txt\")\n",
    "\n",
    "indexer = VectorIndexer(inputCol=\"features\", outputCol=\"indexed\", maxCategories=10)\n",
    "indexerModel = indexer.fit(data)\n",
    "\n",
    "categoricalFeatures = indexerModel.categoryMaps\n",
    "print(\"Chose %d categorical features: %s\" %\n",
    "      (len(categoricalFeatures), \", \".join(str(k) for k in categoricalFeatures.keys())))\n",
    "\n",
    "# Create new column \"indexed\" with categorical values transformed to indices\n",
    "indexedData = indexerModel.transform(data)\n",
    "indexedData.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> https://www.datacamp.com/community/tutorials/categorical-data\n",
    "\n",
    "> http://contrib.scikit-learn.org/categorical-encoding/_modules/index.html"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}