{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "toc": true
   },
   "source": [
    "<h1>Table of Contents<span class=\"tocSkip\"></span></h1>\n",
    "<div class=\"toc\"><ul class=\"toc-item\"><li><span><a href=\"#Handling-Missing-Data\" data-toc-modified-id=\"Handling-Missing-Data-1\"><span class=\"toc-item-num\">1&nbsp;&nbsp;</span>Handling Missing Data</a></span><ul class=\"toc-item\"><li><span><a href=\"#Filtering-Out-Missing-Data\" data-toc-modified-id=\"Filtering-Out-Missing-Data-1.1\"><span class=\"toc-item-num\">1.1&nbsp;&nbsp;</span>Filtering Out Missing Data</a></span></li><li><span><a href=\"#Filling-In-Missing-Data\" data-toc-modified-id=\"Filling-In-Missing-Data-1.2\"><span class=\"toc-item-num\">1.2&nbsp;&nbsp;</span>Filling In Missing Data</a></span></li></ul></li><li><span><a href=\"#Data-Transformation\" data-toc-modified-id=\"Data-Transformation-2\"><span class=\"toc-item-num\">2&nbsp;&nbsp;</span>Data Transformation</a></span><ul class=\"toc-item\"><li><span><a href=\"#Removing-Duplicates\" data-toc-modified-id=\"Removing-Duplicates-2.1\"><span class=\"toc-item-num\">2.1&nbsp;&nbsp;</span>Removing Duplicates</a></span></li><li><span><a href=\"#Transforming-Data-Using-a-Function-or-Mapping\" data-toc-modified-id=\"Transforming-Data-Using-a-Function-or-Mapping-2.2\"><span class=\"toc-item-num\">2.2&nbsp;&nbsp;</span>Transforming Data Using a Function or Mapping</a></span></li><li><span><a href=\"#Replacing-Values\" data-toc-modified-id=\"Replacing-Values-2.3\"><span class=\"toc-item-num\">2.3&nbsp;&nbsp;</span>Replacing Values</a></span></li><li><span><a href=\"#Renaming-Axis-Indexes\" data-toc-modified-id=\"Renaming-Axis-Indexes-2.4\"><span class=\"toc-item-num\">2.4&nbsp;&nbsp;</span>Renaming Axis Indexes</a></span></li><li><span><a href=\"#Detecting-and-Filtering-Outliers\" data-toc-modified-id=\"Detecting-and-Filtering-Outliers-2.5\"><span class=\"toc-item-num\">2.5&nbsp;&nbsp;</span>Detecting and Filtering Outliers</a></span></li><li><span><a href=\"#Permutation-and-Random-Sampling\" data-toc-modified-id=\"Permutation-and-Random-Sampling-2.6\"><span class=\"toc-item-num\">2.6&nbsp;&nbsp;</span>Permutation and Random Sampling</a></span></li><li><span><a href=\"#Computing-Indicator/Dummy-Variables\" data-toc-modified-id=\"Computing-Indicator/Dummy-Variables-2.7\"><span class=\"toc-item-num\">2.7&nbsp;&nbsp;</span>Computing Indicator/Dummy Variables</a></span></li></ul></li><li><span><a href=\"#String-Manipulation\" data-toc-modified-id=\"String-Manipulation-3\"><span class=\"toc-item-num\">3&nbsp;&nbsp;</span>String Manipulation</a></span><ul class=\"toc-item\"><li><span><a href=\"#Basic-String-methods\" data-toc-modified-id=\"Basic-String-methods-3.1\"><span class=\"toc-item-num\">3.1&nbsp;&nbsp;</span>Basic String methods</a></span></li><li><span><a href=\"#Regular-Expressions\" data-toc-modified-id=\"Regular-Expressions-3.2\"><span class=\"toc-item-num\">3.2&nbsp;&nbsp;</span>Regular Expressions</a></span></li><li><span><a href=\"#Vectorized-String-Functions-in-pandas\" data-toc-modified-id=\"Vectorized-String-Functions-in-pandas-3.3\"><span class=\"toc-item-num\">3.3&nbsp;&nbsp;</span>Vectorized String Functions in pandas</a></span></li></ul></li></ul></div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:42:50.844074Z",
     "start_time": "2019-12-24T17:42:50.026363Z"
    }
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:42:50.856824Z",
     "start_time": "2019-12-24T17:42:50.848076Z"
    }
   },
   "outputs": [],
   "source": [
    "import builtins\n",
    "\n",
    "def print(*args, **kwargs):\n",
    "    builtins.print(*args, **kwargs, end='\\n\\n')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Handling Missing Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The way that missing data is represented in pandas objects is somewhat imperfect, but it is functional for a lot of users. For numeric data, pandas uses the floating-point value **NaN** (Not a Number) to represent missing data. We call this a **sentinel value** that can be easily detected"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:42:51.015075Z",
     "start_time": "2019-12-24T17:42:50.858430Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0     aardvark\n",
       "1    artichoke\n",
       "2          NaN\n",
       "3      avocado\n",
       "dtype: object"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])\n",
    "string_data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:42:51.161311Z",
     "start_time": "2019-12-24T17:42:51.016109Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    False\n",
       "1    False\n",
       "2     True\n",
       "3    False\n",
       "dtype: bool"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "string_data.isnull()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:42:51.295057Z",
     "start_time": "2019-12-24T17:42:51.164353Z"
    },
    "hide_input": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "string_data.isnull().sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:42:51.437478Z",
     "start_time": "2019-12-24T17:42:51.297653Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0     True\n",
       "1    False\n",
       "2     True\n",
       "3    False\n",
       "dtype: bool"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# The built-in Python None value is also treated as NA in object arrays\n",
    "\n",
    "string_data[0] = None\n",
    "string_data.isnull()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:42:51.605133Z",
     "start_time": "2019-12-24T17:42:51.438477Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "NA handling methods\n",
      "\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Argument</th>\n",
       "      <th>Description</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>dropna</td>\n",
       "      <td>Filter axis labels based on whether values for each label have missing data, with varying thresholds for how much missing data to tolerate.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>fillna</td>\n",
       "      <td>Fill in missing data with some value or using an interpolation method such as 'ffill' or 'bfill'.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>isnull</td>\n",
       "      <td>Return boolean values indicating which values are missing/NA.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>notnull</td>\n",
       "      <td>Negation of isnull.</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  Argument  \\\n",
       "1   dropna   \n",
       "2   fillna   \n",
       "3   isnull   \n",
       "4  notnull   \n",
       "\n",
       "                                                                                                                                   Description  \n",
       "1  Filter axis labels based on whether values for each label have missing data, with varying thresholds for how much missing data to tolerate.  \n",
       "2                                            Fill in missing data with some value or using an interpolation method such as 'ffill' or 'bfill'.  \n",
       "3                                                                                Return boolean values indicating which values are missing/NA.  \n",
       "4                                                                                                                          Negation of isnull.  "
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.options.display.max_colwidth = 500\n",
    "df = pd.read_csv(r'examples/NA_handling_methods.csv', names=['Description'], sep='\\n')\n",
    "df.index = list(range(1, 5))\n",
    "\n",
    "def f1(x):\n",
    "    x = x.split(' ')\n",
    "    return x.pop(0)\n",
    "\n",
    "\n",
    "df['Argument'] = df['Description'].map(f1) # get the first word\n",
    "\n",
    "def f2(x):\n",
    "    x = x.split(' ')\n",
    "    return \" \".join(x[1:])\n",
    "\n",
    "df['Description'] = df['Description'].map(f2) # remove the first word\n",
    "\n",
    "df = df.reindex(columns=['Argument', 'Description'])\n",
    "print(\"NA handling methods\")\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Filtering Out Missing Data\n",
    "\n",
    "There are a few ways to filter out missing data. While you always have the option to do it by hand using **pandas.isnull** and **boolean indexing**, the **dropna** can be helpful."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:42:51.746687Z",
     "start_time": "2019-12-24T17:42:51.607016Z"
    }
   },
   "outputs": [],
   "source": [
    "# On a Series, it returns the Series with only the non-null data and index values\n",
    "\n",
    "from numpy import nan as NA"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:42:51.894072Z",
     "start_time": "2019-12-24T17:42:51.750247Z"
    }
   },
   "outputs": [],
   "source": [
    "data = pd.Series([1, NA, 3.5, NA, 7])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:42:52.036025Z",
     "start_time": "2019-12-24T17:42:51.896067Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    1.0\n",
       "2    3.5\n",
       "4    7.0\n",
       "dtype: float64"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.dropna()\n",
    "# data[data.notnull()]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With DataFrame objects, things are a bit more complex. You may want to drop rows or columns that are all NA or only those containing any NAs. dropna by **default drops any row containing a missing value**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:42:52.179002Z",
     "start_time": "2019-12-24T17:42:52.037987Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "      <td>6.5</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>NaN</td>\n",
       "      <td>6.5</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     0    1    2\n",
       "0  1.0  6.5  3.0\n",
       "1  1.0  NaN  NaN\n",
       "2  NaN  NaN  NaN\n",
       "3  NaN  6.5  3.0"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],\n",
    "                     [NA, NA, NA], [NA, 6.5, 3.]])\n",
    "data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:42:52.326562Z",
     "start_time": "2019-12-24T17:42:52.179002Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "      <td>6.5</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     0    1    2\n",
       "0  1.0  6.5  3.0"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cleaned = data.dropna()\n",
    "cleaned"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:42:52.492974Z",
     "start_time": "2019-12-24T17:42:52.328730Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "      <td>6.5</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>NaN</td>\n",
       "      <td>6.5</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     0    1    2\n",
       "0  1.0  6.5  3.0\n",
       "1  1.0  NaN  NaN\n",
       "3  NaN  6.5  3.0"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Passing how='all' will only drop rows that are all NA\n",
    "\n",
    "data.dropna(how='all')\n",
    "# data.dropna(how='any')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:42:52.663286Z",
     "start_time": "2019-12-24T17:42:52.493950Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>4</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "      <td>6.5</td>\n",
       "      <td>3.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>NaN</td>\n",
       "      <td>6.5</td>\n",
       "      <td>3.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     0    1    2   4\n",
       "0  1.0  6.5  3.0 NaN\n",
       "1  1.0  NaN  NaN NaN\n",
       "2  NaN  NaN  NaN NaN\n",
       "3  NaN  6.5  3.0 NaN"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# To drop columns in the same way, pass axis=1\n",
    "\n",
    "data[4] = NA\n",
    "data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:42:52.828878Z",
     "start_time": "2019-12-24T17:42:52.665142Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "      <td>6.5</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>NaN</td>\n",
       "      <td>6.5</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     0    1    2\n",
       "0  1.0  6.5  3.0\n",
       "1  1.0  NaN  NaN\n",
       "2  NaN  NaN  NaN\n",
       "3  NaN  6.5  3.0"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.dropna(axis=1, how='all')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A related way to filter out DataFrame rows tends to concern time series data. **Suppose you want to keep only rows containing a certain number of observations**. You can indicate this with the **thresh** argument"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:42:52.964051Z",
     "start_time": "2019-12-24T17:42:52.834105Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.196168</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>-1.007727</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2.909089</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-2.142449</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>-0.866425</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.881426</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.300446</td>\n",
       "      <td>-1.266830</td>\n",
       "      <td>0.296984</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>-0.963050</td>\n",
       "      <td>1.649808</td>\n",
       "      <td>-1.686630</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>-0.087371</td>\n",
       "      <td>-0.342160</td>\n",
       "      <td>-0.608004</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          0         1         2\n",
       "0  1.196168       NaN       NaN\n",
       "1 -1.007727       NaN       NaN\n",
       "2  2.909089       NaN -2.142449\n",
       "3 -0.866425       NaN  1.881426\n",
       "4  0.300446 -1.266830  0.296984\n",
       "5 -0.963050  1.649808 -1.686630\n",
       "6 -0.087371 -0.342160 -0.608004"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.DataFrame(np.random.randn(7, 3))\n",
    "df.iloc[:4, 1] = NA\n",
    "df.iloc[:2, 2] = NA\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:42:53.139880Z",
     "start_time": "2019-12-24T17:42:52.967617Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "          0         1         2\n",
      "4  0.300446 -1.266830  0.296984\n",
      "5 -0.963050  1.649808 -1.686630\n",
      "6 -0.087371 -0.342160 -0.608004\n",
      "\n",
      "          0         1         2\n",
      "2  2.909089       NaN -2.142449\n",
      "3 -0.866425       NaN  1.881426\n",
      "4  0.300446 -1.266830  0.296984\n",
      "5 -0.963050  1.649808 -1.686630\n",
      "6 -0.087371 -0.342160 -0.608004\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(df.dropna())\n",
    "print(df.dropna(thresh=2))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:42:53.321362Z",
     "start_time": "2019-12-24T17:42:53.141874Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.196168</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>-1.007727</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2.909089</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-2.142449</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>-0.866425</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.881426</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.300446</td>\n",
       "      <td>-1.266830</td>\n",
       "      <td>0.296984</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>-0.963050</td>\n",
       "      <td>1.649808</td>\n",
       "      <td>-1.686630</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>-0.087371</td>\n",
       "      <td>-0.342160</td>\n",
       "      <td>-0.608004</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          0         1         2\n",
       "0  1.196168       NaN       NaN\n",
       "1 -1.007727       NaN       NaN\n",
       "2  2.909089       NaN -2.142449\n",
       "3 -0.866425       NaN  1.881426\n",
       "4  0.300446 -1.266830  0.296984\n",
       "5 -0.963050  1.649808 -1.686630\n",
       "6 -0.087371 -0.342160 -0.608004"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:42:53.477978Z",
     "start_time": "2019-12-24T17:42:53.324143Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(3, 3)\n",
      "\n",
      "(5, 3)\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(df.dropna(subset=[1, 2], how='any').shape)\n",
    "print(df.dropna(subset=[1, 2], how='all').shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Filling In Missing Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:42:53.630300Z",
     "start_time": "2019-12-24T17:42:53.480393Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.196168</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>-1.007727</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2.909089</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>-2.142449</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>-0.866425</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.881426</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.300446</td>\n",
       "      <td>-1.266830</td>\n",
       "      <td>0.296984</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>-0.963050</td>\n",
       "      <td>1.649808</td>\n",
       "      <td>-1.686630</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>-0.087371</td>\n",
       "      <td>-0.342160</td>\n",
       "      <td>-0.608004</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          0         1         2\n",
       "0  1.196168  0.000000  0.000000\n",
       "1 -1.007727  0.000000  0.000000\n",
       "2  2.909089  0.000000 -2.142449\n",
       "3 -0.866425  0.000000  1.881426\n",
       "4  0.300446 -1.266830  0.296984\n",
       "5 -0.963050  1.649808 -1.686630\n",
       "6 -0.087371 -0.342160 -0.608004"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.fillna(0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:42:53.795600Z",
     "start_time": "2019-12-24T17:42:53.632354Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.196168</td>\n",
       "      <td>0.500000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>-1.007727</td>\n",
       "      <td>0.500000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2.909089</td>\n",
       "      <td>0.500000</td>\n",
       "      <td>-2.142449</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>-0.866425</td>\n",
       "      <td>0.500000</td>\n",
       "      <td>1.881426</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.300446</td>\n",
       "      <td>-1.266830</td>\n",
       "      <td>0.296984</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>-0.963050</td>\n",
       "      <td>1.649808</td>\n",
       "      <td>-1.686630</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>-0.087371</td>\n",
       "      <td>-0.342160</td>\n",
       "      <td>-0.608004</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          0         1         2\n",
       "0  1.196168  0.500000  0.000000\n",
       "1 -1.007727  0.500000  0.000000\n",
       "2  2.909089  0.500000 -2.142449\n",
       "3 -0.866425  0.500000  1.881426\n",
       "4  0.300446 -1.266830  0.296984\n",
       "5 -0.963050  1.649808 -1.686630\n",
       "6 -0.087371 -0.342160 -0.608004"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Calling fillna with a dict, you can use a different fill value for each column\n",
    "\n",
    "df.fillna({1: 0.5, 2: 0})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:42:53.971110Z",
     "start_time": "2019-12-24T17:42:53.798627Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.196168</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>-1.007727</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2.909089</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>-2.142449</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>-0.866425</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.881426</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.300446</td>\n",
       "      <td>-1.266830</td>\n",
       "      <td>0.296984</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>-0.963050</td>\n",
       "      <td>1.649808</td>\n",
       "      <td>-1.686630</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>-0.087371</td>\n",
       "      <td>-0.342160</td>\n",
       "      <td>-0.608004</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          0         1         2\n",
       "0  1.196168  0.000000  0.000000\n",
       "1 -1.007727  0.000000  0.000000\n",
       "2  2.909089  0.000000 -2.142449\n",
       "3 -0.866425  0.000000  1.881426\n",
       "4  0.300446 -1.266830  0.296984\n",
       "5 -0.963050  1.649808 -1.686630\n",
       "6 -0.087371 -0.342160 -0.608004"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# fillna returns a new object, but you can modify the existing object in-place\n",
    "_ = df.fillna(0, inplace=True)\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:42:54.119327Z",
     "start_time": "2019-12-24T17:42:53.972107Z"
    }
   },
   "outputs": [],
   "source": [
    "df = pd.DataFrame(np.random.randn(6, 3))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:42:54.585826Z",
     "start_time": "2019-12-24T17:42:54.119327Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>-1.283063</td>\n",
       "      <td>1.456059</td>\n",
       "      <td>0.639081</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>-0.863694</td>\n",
       "      <td>-0.012437</td>\n",
       "      <td>-0.741003</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.533858</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-0.741355</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1.690786</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-0.396711</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>-0.296663</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>-1.422233</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          0         1         2\n",
       "0 -1.283063  1.456059  0.639081\n",
       "1 -0.863694 -0.012437 -0.741003\n",
       "2  0.533858       NaN -0.741355\n",
       "3  1.690786       NaN -0.396711\n",
       "4 -0.296663       NaN       NaN\n",
       "5 -1.422233       NaN       NaN"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.iloc[2:, 1] = NA\n",
    "df.iloc[4:, 2] = NA\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:42:54.843584Z",
     "start_time": "2019-12-24T17:42:54.586856Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>-1.283063</td>\n",
       "      <td>1.456059</td>\n",
       "      <td>0.639081</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>-0.863694</td>\n",
       "      <td>-0.012437</td>\n",
       "      <td>-0.741003</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.533858</td>\n",
       "      <td>-0.012437</td>\n",
       "      <td>-0.741355</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1.690786</td>\n",
       "      <td>-0.012437</td>\n",
       "      <td>-0.396711</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>-0.296663</td>\n",
       "      <td>-0.012437</td>\n",
       "      <td>-0.396711</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>-1.422233</td>\n",
       "      <td>-0.012437</td>\n",
       "      <td>-0.396711</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          0         1         2\n",
       "0 -1.283063  1.456059  0.639081\n",
       "1 -0.863694 -0.012437 -0.741003\n",
       "2  0.533858 -0.012437 -0.741355\n",
       "3  1.690786 -0.012437 -0.396711\n",
       "4 -0.296663 -0.012437 -0.396711\n",
       "5 -1.422233 -0.012437 -0.396711"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.fillna(method='ffill')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:42:54.997314Z",
     "start_time": "2019-12-24T17:42:54.846361Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>-1.283063</td>\n",
       "      <td>1.456059</td>\n",
       "      <td>0.639081</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>-0.863694</td>\n",
       "      <td>-0.012437</td>\n",
       "      <td>-0.741003</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.533858</td>\n",
       "      <td>-0.012437</td>\n",
       "      <td>-0.741355</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1.690786</td>\n",
       "      <td>-0.012437</td>\n",
       "      <td>-0.396711</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>-0.296663</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-0.396711</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>-1.422233</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-0.396711</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          0         1         2\n",
       "0 -1.283063  1.456059  0.639081\n",
       "1 -0.863694 -0.012437 -0.741003\n",
       "2  0.533858 -0.012437 -0.741355\n",
       "3  1.690786 -0.012437 -0.396711\n",
       "4 -0.296663       NaN -0.396711\n",
       "5 -1.422233       NaN -0.396711"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.fillna(method='ffill', limit=2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:42:55.162941Z",
     "start_time": "2019-12-24T17:42:54.998307Z"
    }
   },
   "outputs": [],
   "source": [
    "# we might pass mean or median of a series\n",
    "\n",
    "data = pd.Series([1., NA, 3.5, NA, 7])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:42:55.325659Z",
     "start_time": "2019-12-24T17:42:55.165933Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    1.000000\n",
       "1    3.833333\n",
       "2    3.500000\n",
       "3    3.833333\n",
       "4    7.000000\n",
       "dtype: float64"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.fillna(data.mean())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-23T17:37:04.279897Z",
     "start_time": "2019-12-23T17:37:04.272219Z"
    }
   },
   "source": [
    "**Arguments** - **Description** for fillna\n",
    "\n",
    "**value** - Scalar value or dict-like object to use to fill missing values\n",
    "\n",
    "**method** - Interpolation; by default 'ffill' if function called with no other arguments\n",
    "\n",
    "**axis** - Axis to fill on; default axis=0\n",
    "\n",
    "**inplace** - Modify the calling object without producing a copy\n",
    "\n",
    "**limit** - For forward and backward filling, maximum number of consecutive periods to fill"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Transformation\n",
    "\n",
    "### Removing Duplicates"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:42:55.494690Z",
     "start_time": "2019-12-24T17:42:55.328594Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>k1</th>\n",
       "      <th>k2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>one</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>two</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>one</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>two</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>one</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>two</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>two</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    k1  k2\n",
       "0  one   1\n",
       "1  two   1\n",
       "2  one   2\n",
       "3  two   3\n",
       "4  one   3\n",
       "5  two   4\n",
       "6  two   4"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],\n",
    "                    'k2': [1, 1, 2, 3, 3, 4, 4]})\n",
    "data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The DataFrame method **duplicated** returns a boolean Series indicating whether each row is a duplicate (has been observed in a previous row) or not."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:42:55.649465Z",
     "start_time": "2019-12-24T17:42:55.498032Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    False\n",
       "1    False\n",
       "2    False\n",
       "3    False\n",
       "4    False\n",
       "5    False\n",
       "6     True\n",
       "dtype: bool"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.duplicated()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:42:55.804370Z",
     "start_time": "2019-12-24T17:42:55.652247Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>k1</th>\n",
       "      <th>k2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>one</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>two</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>one</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>two</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>one</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>two</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    k1  k2\n",
       "0  one   1\n",
       "1  two   1\n",
       "2  one   2\n",
       "3  two   3\n",
       "4  one   3\n",
       "5  two   4"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# drop_duplicates returns a DataFrame where the duplicated array is False\n",
    "\n",
    "data.drop_duplicates()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:42:55.962564Z",
     "start_time": "2019-12-24T17:42:55.805368Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>k1</th>\n",
       "      <th>k2</th>\n",
       "      <th>v1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>one</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>two</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>one</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>two</td>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>one</td>\n",
       "      <td>3</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>two</td>\n",
       "      <td>4</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>two</td>\n",
       "      <td>4</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    k1  k2  v1\n",
       "0  one   1   0\n",
       "1  two   1   1\n",
       "2  one   2   2\n",
       "3  two   3   3\n",
       "4  one   3   4\n",
       "5  two   4   5\n",
       "6  two   4   6"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Suppose we had an additional column of values and wanted to \n",
    "# filter duplicates only based on the 'k1' column\n",
    "\n",
    "data['v1'] = range(7)\n",
    "data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:42:56.114393Z",
     "start_time": "2019-12-24T17:42:55.964622Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>k1</th>\n",
       "      <th>k2</th>\n",
       "      <th>v1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>one</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>two</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    k1  k2  v1\n",
       "0  one   1   0\n",
       "1  two   1   1"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.drop_duplicates(['k1'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Transforming Data Using a Function or Mapping\n",
    "\n",
    "For many datasets, you may wish to perform some transformation based on the values in an array, Series, or column in a DataFrame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:42:56.277029Z",
     "start_time": "2019-12-24T17:42:56.116562Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>food</th>\n",
       "      <th>ounces</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>bacon</td>\n",
       "      <td>4.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>pulled pork</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>bacon</td>\n",
       "      <td>12.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Pastrami</td>\n",
       "      <td>6.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>corned beef</td>\n",
       "      <td>7.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Bacon</td>\n",
       "      <td>8.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>pastrami</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>honey ham</td>\n",
       "      <td>5.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>nova lox</td>\n",
       "      <td>6.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          food  ounces\n",
       "0        bacon     4.0\n",
       "1  pulled pork     3.0\n",
       "2        bacon    12.0\n",
       "3     Pastrami     6.0\n",
       "4  corned beef     7.5\n",
       "5        Bacon     8.0\n",
       "6     pastrami     3.0\n",
       "7    honey ham     5.0\n",
       "8     nova lox     6.0"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',\n",
    "                              'Pastrami', 'corned beef', 'Bacon',\n",
    "                              'pastrami', 'honey ham', 'nova lox'],\n",
    "                     'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})\n",
    "data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:42:56.417657Z",
     "start_time": "2019-12-24T17:42:56.278980Z"
    }
   },
   "outputs": [],
   "source": [
    "# Suppose you wanted to add a column indicating the \n",
    "# type of animal that each food came from\n",
    "\n",
    "meat_to_animal = {\n",
    " 'bacon': 'pig',\n",
    " 'pulled pork': 'pig',\n",
    " 'pastrami': 'cow',\n",
    " 'corned beef': 'cow',\n",
    " 'honey ham': 'pig',\n",
    " 'nova lox': 'salmon'\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:42:56.595457Z",
     "start_time": "2019-12-24T17:42:56.419812Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0          bacon\n",
       "1    pulled pork\n",
       "2          bacon\n",
       "3       pastrami\n",
       "4    corned beef\n",
       "5          bacon\n",
       "6       pastrami\n",
       "7      honey ham\n",
       "8       nova lox\n",
       "Name: food, dtype: object"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "lowercased = data['food'].str.lower()\n",
    "lowercased"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:42:56.767759Z",
     "start_time": "2019-12-24T17:42:56.597643Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>food</th>\n",
       "      <th>ounces</th>\n",
       "      <th>animal</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>bacon</td>\n",
       "      <td>4.0</td>\n",
       "      <td>pig</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>pulled pork</td>\n",
       "      <td>3.0</td>\n",
       "      <td>pig</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>bacon</td>\n",
       "      <td>12.0</td>\n",
       "      <td>pig</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Pastrami</td>\n",
       "      <td>6.0</td>\n",
       "      <td>cow</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>corned beef</td>\n",
       "      <td>7.5</td>\n",
       "      <td>cow</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Bacon</td>\n",
       "      <td>8.0</td>\n",
       "      <td>pig</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>pastrami</td>\n",
       "      <td>3.0</td>\n",
       "      <td>cow</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>honey ham</td>\n",
       "      <td>5.0</td>\n",
       "      <td>pig</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>nova lox</td>\n",
       "      <td>6.0</td>\n",
       "      <td>salmon</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          food  ounces  animal\n",
       "0        bacon     4.0     pig\n",
       "1  pulled pork     3.0     pig\n",
       "2        bacon    12.0     pig\n",
       "3     Pastrami     6.0     cow\n",
       "4  corned beef     7.5     cow\n",
       "5        Bacon     8.0     pig\n",
       "6     pastrami     3.0     cow\n",
       "7    honey ham     5.0     pig\n",
       "8     nova lox     6.0  salmon"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data['animal'] = lowercased.map(meat_to_animal)\n",
    "# or\n",
    "# data['food'].map(lambda x: meat_to_animal[x.lower()])\n",
    "data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-23T17:56:05.001259Z",
     "start_time": "2019-12-23T17:56:04.992428Z"
    }
   },
   "source": [
    "### Replacing Values\n",
    "\n",
    "Filling in missing data with the fillna method is a special case of more general value replacement. As you’ve already seen, map can be used to modify a subset of values in an object but **replace** provides a simpler and more flexible way to do so"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:42:56.900004Z",
     "start_time": "2019-12-24T17:42:56.768790Z"
    }
   },
   "outputs": [],
   "source": [
    "data = pd.Series([1., -999., 2., -999., -1000., 3.])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:42:57.091197Z",
     "start_time": "2019-12-24T17:42:56.900004Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0       1.0\n",
       "1    -999.0\n",
       "2       2.0\n",
       "3    -999.0\n",
       "4   -1000.0\n",
       "5       3.0\n",
       "dtype: float64"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:42:57.234609Z",
     "start_time": "2019-12-24T17:42:57.092227Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0       1.0\n",
      "1       NaN\n",
      "2       2.0\n",
      "3       NaN\n",
      "4   -1000.0\n",
      "5       3.0\n",
      "dtype: float64\n",
      "\n",
      "0    1.0\n",
      "1    NaN\n",
      "2    2.0\n",
      "3    NaN\n",
      "4    NaN\n",
      "5    3.0\n",
      "dtype: float64\n",
      "\n",
      "0    1.0\n",
      "1    NaN\n",
      "2    2.0\n",
      "3    NaN\n",
      "4    0.0\n",
      "5    3.0\n",
      "dtype: float64\n",
      "\n",
      "None\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(data.replace(-999, np.nan))\n",
    "print(data.replace([-999, -1000], np.nan))\n",
    "print(data.replace([-999, -1000], [np.nan, 0]))\n",
    "print(data.replace({-999: np.nan, -1000: 0}, inplace=True))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Renaming Axis Indexes\n",
    "\n",
    "axis labels can be transformed by a function or mapping of some form to produce new, differently labeled objects. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:42:57.347390Z",
     "start_time": "2019-12-24T17:42:57.234609Z"
    }
   },
   "outputs": [],
   "source": [
    "data = pd.DataFrame(np.arange(12).reshape((3, 4)),\n",
    "                    index=['Ohio', 'Colorado', 'New York'],\n",
    "                    columns=['one', 'two', 'three', 'four'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:42:57.486408Z",
     "start_time": "2019-12-24T17:42:57.349987Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Index(['OHIO', 'COLO', 'NEW '], dtype='object')\n",
      "\n",
      "Index(['OH', 'CO', 'NE'], dtype='object')\n",
      "\n"
     ]
    }
   ],
   "source": [
    "transform = lambda x: x[:4].upper()\n",
    "\n",
    "# Like a Series, the axis indexes have a map method\n",
    "\n",
    "print(data.index.map(transform))\n",
    "\n",
    "# -----------\n",
    "trans = lambda x: x[:2].upper()\n",
    "print(data.index.map(trans))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:42:57.630459Z",
     "start_time": "2019-12-24T17:42:57.488063Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>one</th>\n",
       "      <th>two</th>\n",
       "      <th>three</th>\n",
       "      <th>four</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>OHIO</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>COLO</th>\n",
       "      <td>4</td>\n",
       "      <td>5</td>\n",
       "      <td>6</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>NEW</th>\n",
       "      <td>8</td>\n",
       "      <td>9</td>\n",
       "      <td>10</td>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      one  two  three  four\n",
       "OHIO    0    1      2     3\n",
       "COLO    4    5      6     7\n",
       "NEW     8    9     10    11"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.index = data.index.map(transform)\n",
    "data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you want to create a transformed version of a dataset without modifying the original, a useful method is **rename**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:42:57.779949Z",
     "start_time": "2019-12-24T17:42:57.631489Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ONE</th>\n",
       "      <th>TWO</th>\n",
       "      <th>THREE</th>\n",
       "      <th>FOUR</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Ohio</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Colo</th>\n",
       "      <td>4</td>\n",
       "      <td>5</td>\n",
       "      <td>6</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>New</th>\n",
       "      <td>8</td>\n",
       "      <td>9</td>\n",
       "      <td>10</td>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      ONE  TWO  THREE  FOUR\n",
       "Ohio    0    1      2     3\n",
       "Colo    4    5      6     7\n",
       "New     8    9     10    11"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.rename(index=str.title, columns=str.upper)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:42:57.934232Z",
     "start_time": "2019-12-24T17:42:57.782092Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>one</th>\n",
       "      <th>two</th>\n",
       "      <th>peekaboo</th>\n",
       "      <th>four</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>INDIANA</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>COLO</th>\n",
       "      <td>4</td>\n",
       "      <td>5</td>\n",
       "      <td>6</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>NEW</th>\n",
       "      <td>8</td>\n",
       "      <td>9</td>\n",
       "      <td>10</td>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         one  two  peekaboo  four\n",
       "INDIANA    0    1         2     3\n",
       "COLO       4    5         6     7\n",
       "NEW        8    9        10    11"
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# rename can be used in conjunction with a dict-like object\n",
    "# providing new values for a subset of the axis labels\n",
    "\n",
    "data.rename(index={'OHIO': 'INDIANA'},\n",
    "            columns={'three': 'peekaboo'}, inplace=True)\n",
    "data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:42:58.052890Z",
     "start_time": "2019-12-24T17:42:57.934232Z"
    }
   },
   "outputs": [],
   "source": [
    "# ToD: Discretization and Binning"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Detecting and Filtering Outliers\n",
    "\n",
    "Filtering or transforming outliers is largely a matter of applying array operations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:42:58.224509Z",
     "start_time": "2019-12-24T17:42:58.052890Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>1000.000000</td>\n",
       "      <td>1000.000000</td>\n",
       "      <td>1000.000000</td>\n",
       "      <td>1000.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>0.015383</td>\n",
       "      <td>0.024896</td>\n",
       "      <td>0.038430</td>\n",
       "      <td>-0.064249</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>1.008377</td>\n",
       "      <td>0.996469</td>\n",
       "      <td>0.989867</td>\n",
       "      <td>0.995204</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>-3.141061</td>\n",
       "      <td>-3.430585</td>\n",
       "      <td>-2.886733</td>\n",
       "      <td>-3.260575</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>-0.708238</td>\n",
       "      <td>-0.597616</td>\n",
       "      <td>-0.656763</td>\n",
       "      <td>-0.766798</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>0.051303</td>\n",
       "      <td>0.051438</td>\n",
       "      <td>0.048483</td>\n",
       "      <td>-0.071339</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>0.730775</td>\n",
       "      <td>0.691972</td>\n",
       "      <td>0.666596</td>\n",
       "      <td>0.639310</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>3.766905</td>\n",
       "      <td>3.349857</td>\n",
       "      <td>3.361674</td>\n",
       "      <td>2.854713</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                 0            1            2            3\n",
       "count  1000.000000  1000.000000  1000.000000  1000.000000\n",
       "mean      0.015383     0.024896     0.038430    -0.064249\n",
       "std       1.008377     0.996469     0.989867     0.995204\n",
       "min      -3.141061    -3.430585    -2.886733    -3.260575\n",
       "25%      -0.708238    -0.597616    -0.656763    -0.766798\n",
       "50%       0.051303     0.051438     0.048483    -0.071339\n",
       "75%       0.730775     0.691972     0.666596     0.639310\n",
       "max       3.766905     3.349857     3.361674     2.854713"
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = pd.DataFrame(np.random.randn(1000, 4))\n",
    "data.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:42:58.337732Z",
     "start_time": "2019-12-24T17:42:58.225600Z"
    }
   },
   "outputs": [],
   "source": [
    "col = data[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:42:58.470172Z",
     "start_time": "2019-12-24T17:42:58.345580Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1     -3.141061\n",
       "56     3.766905\n",
       "247    3.008167\n",
       "787   -3.135201\n",
       "Name: 0, dtype: float64"
      ]
     },
     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "col[np.abs(col) > 3]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To select all rows having a value exceeding 3 or –3, you can use the any method on a boolean DataFrame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:42:58.634225Z",
     "start_time": "2019-12-24T17:42:58.473682Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>-3.141061</td>\n",
       "      <td>0.783127</td>\n",
       "      <td>0.469851</td>\n",
       "      <td>-0.395636</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>56</th>\n",
       "      <td>3.766905</td>\n",
       "      <td>0.030909</td>\n",
       "      <td>1.633494</td>\n",
       "      <td>1.140207</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>247</th>\n",
       "      <td>3.008167</td>\n",
       "      <td>-0.992888</td>\n",
       "      <td>0.260335</td>\n",
       "      <td>-3.063802</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>327</th>\n",
       "      <td>-0.074563</td>\n",
       "      <td>3.349857</td>\n",
       "      <td>2.032817</td>\n",
       "      <td>-0.609838</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>405</th>\n",
       "      <td>-0.362679</td>\n",
       "      <td>-0.103159</td>\n",
       "      <td>3.087357</td>\n",
       "      <td>2.517252</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>448</th>\n",
       "      <td>-0.437464</td>\n",
       "      <td>3.054526</td>\n",
       "      <td>-0.254406</td>\n",
       "      <td>1.047457</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>611</th>\n",
       "      <td>-0.369185</td>\n",
       "      <td>-3.430585</td>\n",
       "      <td>-0.515431</td>\n",
       "      <td>-1.668876</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>619</th>\n",
       "      <td>-0.311282</td>\n",
       "      <td>0.535545</td>\n",
       "      <td>3.039966</td>\n",
       "      <td>-1.199799</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>631</th>\n",
       "      <td>-0.923738</td>\n",
       "      <td>0.346249</td>\n",
       "      <td>-0.485636</td>\n",
       "      <td>-3.260575</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>765</th>\n",
       "      <td>-0.856659</td>\n",
       "      <td>-0.717638</td>\n",
       "      <td>-1.155797</td>\n",
       "      <td>-3.069395</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>787</th>\n",
       "      <td>-3.135201</td>\n",
       "      <td>-1.338375</td>\n",
       "      <td>-0.404171</td>\n",
       "      <td>1.373373</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>917</th>\n",
       "      <td>0.437193</td>\n",
       "      <td>-3.026429</td>\n",
       "      <td>-0.438433</td>\n",
       "      <td>0.024585</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>984</th>\n",
       "      <td>-0.629788</td>\n",
       "      <td>0.308067</td>\n",
       "      <td>3.361674</td>\n",
       "      <td>-1.683448</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            0         1         2         3\n",
       "1   -3.141061  0.783127  0.469851 -0.395636\n",
       "56   3.766905  0.030909  1.633494  1.140207\n",
       "247  3.008167 -0.992888  0.260335 -3.063802\n",
       "327 -0.074563  3.349857  2.032817 -0.609838\n",
       "405 -0.362679 -0.103159  3.087357  2.517252\n",
       "448 -0.437464  3.054526 -0.254406  1.047457\n",
       "611 -0.369185 -3.430585 -0.515431 -1.668876\n",
       "619 -0.311282  0.535545  3.039966 -1.199799\n",
       "631 -0.923738  0.346249 -0.485636 -3.260575\n",
       "765 -0.856659 -0.717638 -1.155797 -3.069395\n",
       "787 -3.135201 -1.338375 -0.404171  1.373373\n",
       "917  0.437193 -3.026429 -0.438433  0.024585\n",
       "984 -0.629788  0.308067  3.361674 -1.683448"
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data[(np.abs(data) > 3).any(axis=1)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Values can be set based on these criteria. Here is code to cap values outside the interval –3 to 3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:42:58.768926Z",
     "start_time": "2019-12-24T17:42:58.635223Z"
    }
   },
   "outputs": [],
   "source": [
    "data[np.abs(data) > 3] = np.sign(data) * 3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:42:58.935675Z",
     "start_time": "2019-12-24T17:42:58.768926Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>1000.000000</td>\n",
       "      <td>1000.000000</td>\n",
       "      <td>1000.000000</td>\n",
       "      <td>1000.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>0.014885</td>\n",
       "      <td>0.024949</td>\n",
       "      <td>0.037941</td>\n",
       "      <td>-0.063855</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>1.004937</td>\n",
       "      <td>0.993711</td>\n",
       "      <td>0.988330</td>\n",
       "      <td>0.994002</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>-3.000000</td>\n",
       "      <td>-3.000000</td>\n",
       "      <td>-2.886733</td>\n",
       "      <td>-3.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>-0.708238</td>\n",
       "      <td>-0.597616</td>\n",
       "      <td>-0.656763</td>\n",
       "      <td>-0.766798</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>0.051303</td>\n",
       "      <td>0.051438</td>\n",
       "      <td>0.048483</td>\n",
       "      <td>-0.071339</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>0.730775</td>\n",
       "      <td>0.691972</td>\n",
       "      <td>0.666596</td>\n",
       "      <td>0.639310</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>3.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>2.854713</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                 0            1            2            3\n",
       "count  1000.000000  1000.000000  1000.000000  1000.000000\n",
       "mean      0.014885     0.024949     0.037941    -0.063855\n",
       "std       1.004937     0.993711     0.988330     0.994002\n",
       "min      -3.000000    -3.000000    -2.886733    -3.000000\n",
       "25%      -0.708238    -0.597616    -0.656763    -0.766798\n",
       "50%       0.051303     0.051438     0.048483    -0.071339\n",
       "75%       0.730775     0.691972     0.666596     0.639310\n",
       "max       3.000000     3.000000     3.000000     2.854713"
      ]
     },
     "execution_count": 52,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:42:59.066787Z",
     "start_time": "2019-12-24T17:42:58.936677Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>-1.0</td>\n",
       "      <td>-1.0</td>\n",
       "      <td>-1.0</td>\n",
       "      <td>-1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>-1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>-1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1.0</td>\n",
       "      <td>-1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>-1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1.0</td>\n",
       "      <td>-1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>-1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1.0</td>\n",
       "      <td>-1.0</td>\n",
       "      <td>-1.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     0    1    2    3\n",
       "0 -1.0 -1.0 -1.0 -1.0\n",
       "1 -1.0  1.0  1.0 -1.0\n",
       "2  1.0 -1.0  1.0 -1.0\n",
       "3  1.0 -1.0  1.0 -1.0\n",
       "4  1.0 -1.0 -1.0  1.0"
      ]
     },
     "execution_count": 53,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# The statement np.sign(data) produces 1 and –1 values based on \n",
    "# whether the values in data are positive or negative\n",
    "\n",
    "np.sign(data).head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Permutation and Random Sampling\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Permuting (randomly reordering) a Series or the rows in a DataFrame is easy to do using the numpy.random.permutation function. Calling permutation with the length of the axis you want to permute produces an array of integers indicating the new ordering"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:42:59.221723Z",
     "start_time": "2019-12-24T17:42:59.067821Z"
    }
   },
   "outputs": [],
   "source": [
    "df = pd.DataFrame(np.arange(20).reshape((5, 4)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:42:59.370484Z",
     "start_time": "2019-12-24T17:42:59.222721Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([3, 4, 1, 2, 0])"
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sampler = np.random.permutation(5)\n",
    "sampler"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That array can then be used in iloc-based indexing or the equivalent **take** function"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:42:59.559900Z",
     "start_time": "2019-12-24T17:42:59.374481Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4</td>\n",
       "      <td>5</td>\n",
       "      <td>6</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>8</td>\n",
       "      <td>9</td>\n",
       "      <td>10</td>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>12</td>\n",
       "      <td>13</td>\n",
       "      <td>14</td>\n",
       "      <td>15</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>16</td>\n",
       "      <td>17</td>\n",
       "      <td>18</td>\n",
       "      <td>19</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    0   1   2   3\n",
       "0   0   1   2   3\n",
       "1   4   5   6   7\n",
       "2   8   9  10  11\n",
       "3  12  13  14  15\n",
       "4  16  17  18  19"
      ]
     },
     "execution_count": 56,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:42:59.714703Z",
     "start_time": "2019-12-24T17:42:59.562894Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>12</td>\n",
       "      <td>13</td>\n",
       "      <td>14</td>\n",
       "      <td>15</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>16</td>\n",
       "      <td>17</td>\n",
       "      <td>18</td>\n",
       "      <td>19</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4</td>\n",
       "      <td>5</td>\n",
       "      <td>6</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>8</td>\n",
       "      <td>9</td>\n",
       "      <td>10</td>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    0   1   2   3\n",
       "3  12  13  14  15\n",
       "4  16  17  18  19\n",
       "1   4   5   6   7\n",
       "2   8   9  10  11\n",
       "0   0   1   2   3"
      ]
     },
     "execution_count": 57,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.take(sampler)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To select a random subset **without replacement**, you can use the **sample** method on Series and DataFrame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:42:59.877470Z",
     "start_time": "2019-12-24T17:42:59.717464Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>16</td>\n",
       "      <td>17</td>\n",
       "      <td>18</td>\n",
       "      <td>19</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>8</td>\n",
       "      <td>9</td>\n",
       "      <td>10</td>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4</td>\n",
       "      <td>5</td>\n",
       "      <td>6</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    0   1   2   3\n",
       "4  16  17  18  19\n",
       "2   8   9  10  11\n",
       "1   4   5   6   7"
      ]
     },
     "execution_count": 58,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.sample(n=3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:43:00.022827Z",
     "start_time": "2019-12-24T17:42:59.880463Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3    6\n",
       "1    7\n",
       "3    6\n",
       "0    5\n",
       "3    6\n",
       "0    5\n",
       "1    7\n",
       "2   -1\n",
       "2   -1\n",
       "4    4\n",
       "dtype: int64"
      ]
     },
     "execution_count": 59,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#  sample with replacement\n",
    "\n",
    "choices = pd.Series([5, 7, -1, 6, 4])\n",
    "draws = choices.sample(n=10, replace=True)\n",
    "draws"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Computing Indicator/Dummy Variables\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:43:01.237733Z",
     "start_time": "2019-12-24T17:43:00.025854Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PassengerId</th>\n",
       "      <th>Survived</th>\n",
       "      <th>Pclass</th>\n",
       "      <th>Name</th>\n",
       "      <th>Sex</th>\n",
       "      <th>Age</th>\n",
       "      <th>SibSp</th>\n",
       "      <th>Parch</th>\n",
       "      <th>Ticket</th>\n",
       "      <th>Fare</th>\n",
       "      <th>Cabin</th>\n",
       "      <th>Embarked</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Braund, Mr. Owen Harris</td>\n",
       "      <td>male</td>\n",
       "      <td>22.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>A/5 21171</td>\n",
       "      <td>7.2500</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Cumings, Mrs. John Bradley (Florence Briggs Thayer)</td>\n",
       "      <td>female</td>\n",
       "      <td>38.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>PC 17599</td>\n",
       "      <td>71.2833</td>\n",
       "      <td>C85</td>\n",
       "      <td>C</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>Heikkinen, Miss. Laina</td>\n",
       "      <td>female</td>\n",
       "      <td>26.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>STON/O2. 3101282</td>\n",
       "      <td>7.9250</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
       "      <td>female</td>\n",
       "      <td>35.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>113803</td>\n",
       "      <td>53.1000</td>\n",
       "      <td>C123</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Allen, Mr. William Henry</td>\n",
       "      <td>male</td>\n",
       "      <td>35.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>373450</td>\n",
       "      <td>8.0500</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   PassengerId  Survived  Pclass  \\\n",
       "0            1         0       3   \n",
       "1            2         1       1   \n",
       "2            3         1       3   \n",
       "3            4         1       1   \n",
       "4            5         0       3   \n",
       "\n",
       "                                                  Name     Sex   Age  SibSp  \\\n",
       "0                              Braund, Mr. Owen Harris    male  22.0      1   \n",
       "1  Cumings, Mrs. John Bradley (Florence Briggs Thayer)  female  38.0      1   \n",
       "2                               Heikkinen, Miss. Laina  female  26.0      0   \n",
       "3         Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   \n",
       "4                             Allen, Mr. William Henry    male  35.0      0   \n",
       "\n",
       "   Parch            Ticket     Fare Cabin Embarked  \n",
       "0      0         A/5 21171   7.2500   NaN        S  \n",
       "1      0          PC 17599  71.2833   C85        C  \n",
       "2      0  STON/O2. 3101282   7.9250   NaN        S  \n",
       "3      0            113803  53.1000  C123        S  \n",
       "4      0            373450   8.0500   NaN        S  "
      ]
     },
     "execution_count": 60,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train = pd.read_csv('http://bit.ly/kaggletrain')\n",
    "train.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:43:01.266505Z",
     "start_time": "2019-12-24T17:43:01.239975Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PassengerId</th>\n",
       "      <th>Survived</th>\n",
       "      <th>Pclass</th>\n",
       "      <th>Name</th>\n",
       "      <th>Sex</th>\n",
       "      <th>Age</th>\n",
       "      <th>SibSp</th>\n",
       "      <th>Parch</th>\n",
       "      <th>Ticket</th>\n",
       "      <th>Fare</th>\n",
       "      <th>Cabin</th>\n",
       "      <th>Embarked</th>\n",
       "      <th>Sex_male</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Braund, Mr. Owen Harris</td>\n",
       "      <td>male</td>\n",
       "      <td>22.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>A/5 21171</td>\n",
       "      <td>7.2500</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Cumings, Mrs. John Bradley (Florence Briggs Thayer)</td>\n",
       "      <td>female</td>\n",
       "      <td>38.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>PC 17599</td>\n",
       "      <td>71.2833</td>\n",
       "      <td>C85</td>\n",
       "      <td>C</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>Heikkinen, Miss. Laina</td>\n",
       "      <td>female</td>\n",
       "      <td>26.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>STON/O2. 3101282</td>\n",
       "      <td>7.9250</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
       "      <td>female</td>\n",
       "      <td>35.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>113803</td>\n",
       "      <td>53.1000</td>\n",
       "      <td>C123</td>\n",
       "      <td>S</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Allen, Mr. William Henry</td>\n",
       "      <td>male</td>\n",
       "      <td>35.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>373450</td>\n",
       "      <td>8.0500</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   PassengerId  Survived  Pclass  \\\n",
       "0            1         0       3   \n",
       "1            2         1       1   \n",
       "2            3         1       3   \n",
       "3            4         1       1   \n",
       "4            5         0       3   \n",
       "\n",
       "                                                  Name     Sex   Age  SibSp  \\\n",
       "0                              Braund, Mr. Owen Harris    male  22.0      1   \n",
       "1  Cumings, Mrs. John Bradley (Florence Briggs Thayer)  female  38.0      1   \n",
       "2                               Heikkinen, Miss. Laina  female  26.0      0   \n",
       "3         Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   \n",
       "4                             Allen, Mr. William Henry    male  35.0      0   \n",
       "\n",
       "   Parch            Ticket     Fare Cabin Embarked  Sex_male  \n",
       "0      0         A/5 21171   7.2500   NaN        S         1  \n",
       "1      0          PC 17599  71.2833   C85        C         0  \n",
       "2      0  STON/O2. 3101282   7.9250   NaN        S         0  \n",
       "3      0            113803  53.1000  C123        S         0  \n",
       "4      0            373450   8.0500   NaN        S         1  "
      ]
     },
     "execution_count": 61,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# create a dummy column for Sex column\n",
    "\n",
    "# using map method\n",
    "train['Sex_male'] = train.Sex.map({'female': 0, 'male': 1})\n",
    "train.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:43:01.411456Z",
     "start_time": "2019-12-24T17:43:01.268780Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>female</th>\n",
       "      <th>male</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>886</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>887</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>888</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>889</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>890</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>891 rows × 2 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "     female  male\n",
       "0         0     1\n",
       "1         1     0\n",
       "2         1     0\n",
       "3         1     0\n",
       "4         0     1\n",
       "..      ...   ...\n",
       "886       0     1\n",
       "887       1     0\n",
       "888       1     0\n",
       "889       0     1\n",
       "890       0     1\n",
       "\n",
       "[891 rows x 2 columns]"
      ]
     },
     "execution_count": 62,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# using get_dummies\n",
    "  \n",
    "pd.get_dummies(train.Sex)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Generally, **k** categorical values, then use **k-1** dummy variables to represent it.\n",
    "\n",
    "for Sex, we have to categorical values so we need only 1 variable to encode all the given information. So we drop the first column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:43:01.573467Z",
     "start_time": "2019-12-24T17:43:01.414109Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Sex_male</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>886</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>887</th>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>888</th>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>889</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>890</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>891 rows × 1 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "     Sex_male\n",
       "0           1\n",
       "1           0\n",
       "2           0\n",
       "3           0\n",
       "4           1\n",
       "..        ...\n",
       "886         1\n",
       "887         0\n",
       "888         0\n",
       "889         1\n",
       "890         1\n",
       "\n",
       "[891 rows x 1 columns]"
      ]
     },
     "execution_count": 63,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.get_dummies(train.Sex, prefix='Sex').iloc[:, 1:]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:43:01.712385Z",
     "start_time": "2019-12-24T17:43:01.577414Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "S    644\n",
       "C    168\n",
       "Q     77\n",
       "Name: Embarked, dtype: int64"
      ]
     },
     "execution_count": 64,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train.Embarked.value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:43:01.845962Z",
     "start_time": "2019-12-24T17:43:01.712385Z"
    }
   },
   "outputs": [],
   "source": [
    "embarked_dummies = pd.get_dummies(train.Embarked, prefix='Embarked').iloc[:, 1:]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:43:02.023567Z",
     "start_time": "2019-12-24T17:43:01.845962Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PassengerId</th>\n",
       "      <th>Survived</th>\n",
       "      <th>Pclass</th>\n",
       "      <th>Name</th>\n",
       "      <th>Sex</th>\n",
       "      <th>Age</th>\n",
       "      <th>SibSp</th>\n",
       "      <th>Parch</th>\n",
       "      <th>Ticket</th>\n",
       "      <th>Fare</th>\n",
       "      <th>Cabin</th>\n",
       "      <th>Embarked</th>\n",
       "      <th>Sex_male</th>\n",
       "      <th>Embarked_Q</th>\n",
       "      <th>Embarked_S</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Braund, Mr. Owen Harris</td>\n",
       "      <td>male</td>\n",
       "      <td>22.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>A/5 21171</td>\n",
       "      <td>7.2500</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Cumings, Mrs. John Bradley (Florence Briggs Thayer)</td>\n",
       "      <td>female</td>\n",
       "      <td>38.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>PC 17599</td>\n",
       "      <td>71.2833</td>\n",
       "      <td>C85</td>\n",
       "      <td>C</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>Heikkinen, Miss. Laina</td>\n",
       "      <td>female</td>\n",
       "      <td>26.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>STON/O2. 3101282</td>\n",
       "      <td>7.9250</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
       "      <td>female</td>\n",
       "      <td>35.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>113803</td>\n",
       "      <td>53.1000</td>\n",
       "      <td>C123</td>\n",
       "      <td>S</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Allen, Mr. William Henry</td>\n",
       "      <td>male</td>\n",
       "      <td>35.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>373450</td>\n",
       "      <td>8.0500</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   PassengerId  Survived  Pclass  \\\n",
       "0            1         0       3   \n",
       "1            2         1       1   \n",
       "2            3         1       3   \n",
       "3            4         1       1   \n",
       "4            5         0       3   \n",
       "\n",
       "                                                  Name     Sex   Age  SibSp  \\\n",
       "0                              Braund, Mr. Owen Harris    male  22.0      1   \n",
       "1  Cumings, Mrs. John Bradley (Florence Briggs Thayer)  female  38.0      1   \n",
       "2                               Heikkinen, Miss. Laina  female  26.0      0   \n",
       "3         Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   \n",
       "4                             Allen, Mr. William Henry    male  35.0      0   \n",
       "\n",
       "   Parch            Ticket     Fare Cabin Embarked  Sex_male  Embarked_Q  \\\n",
       "0      0         A/5 21171   7.2500   NaN        S         1           0   \n",
       "1      0          PC 17599  71.2833   C85        C         0           0   \n",
       "2      0  STON/O2. 3101282   7.9250   NaN        S         0           0   \n",
       "3      0            113803  53.1000  C123        S         0           0   \n",
       "4      0            373450   8.0500   NaN        S         1           0   \n",
       "\n",
       "   Embarked_S  \n",
       "0           1  \n",
       "1           0  \n",
       "2           1  \n",
       "3           1  \n",
       "4           1  "
      ]
     },
     "execution_count": 66,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train = pd.concat([train, embarked_dummies], axis=1)\n",
    "train.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:43:02.754062Z",
     "start_time": "2019-12-24T17:43:02.024777Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PassengerId</th>\n",
       "      <th>Survived</th>\n",
       "      <th>Pclass</th>\n",
       "      <th>Name</th>\n",
       "      <th>Age</th>\n",
       "      <th>SibSp</th>\n",
       "      <th>Parch</th>\n",
       "      <th>Ticket</th>\n",
       "      <th>Fare</th>\n",
       "      <th>Cabin</th>\n",
       "      <th>Sex_female</th>\n",
       "      <th>Sex_male</th>\n",
       "      <th>Embarked_C</th>\n",
       "      <th>Embarked_Q</th>\n",
       "      <th>Embarked_S</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Braund, Mr. Owen Harris</td>\n",
       "      <td>22.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>A/5 21171</td>\n",
       "      <td>7.2500</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Cumings, Mrs. John Bradley (Florence Briggs Thayer)</td>\n",
       "      <td>38.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>PC 17599</td>\n",
       "      <td>71.2833</td>\n",
       "      <td>C85</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>Heikkinen, Miss. Laina</td>\n",
       "      <td>26.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>STON/O2. 3101282</td>\n",
       "      <td>7.9250</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
       "      <td>35.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>113803</td>\n",
       "      <td>53.1000</td>\n",
       "      <td>C123</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Allen, Mr. William Henry</td>\n",
       "      <td>35.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>373450</td>\n",
       "      <td>8.0500</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>886</th>\n",
       "      <td>887</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>Montvila, Rev. Juozas</td>\n",
       "      <td>27.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>211536</td>\n",
       "      <td>13.0000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>887</th>\n",
       "      <td>888</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Graham, Miss. Margaret Edith</td>\n",
       "      <td>19.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>112053</td>\n",
       "      <td>30.0000</td>\n",
       "      <td>B42</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>888</th>\n",
       "      <td>889</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Johnston, Miss. Catherine Helen \"Carrie\"</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>W./C. 6607</td>\n",
       "      <td>23.4500</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>889</th>\n",
       "      <td>890</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Behr, Mr. Karl Howell</td>\n",
       "      <td>26.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>111369</td>\n",
       "      <td>30.0000</td>\n",
       "      <td>C148</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>890</th>\n",
       "      <td>891</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Dooley, Mr. Patrick</td>\n",
       "      <td>32.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>370376</td>\n",
       "      <td>7.7500</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>891 rows × 15 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "     PassengerId  Survived  Pclass  \\\n",
       "0              1         0       3   \n",
       "1              2         1       1   \n",
       "2              3         1       3   \n",
       "3              4         1       1   \n",
       "4              5         0       3   \n",
       "..           ...       ...     ...   \n",
       "886          887         0       2   \n",
       "887          888         1       1   \n",
       "888          889         0       3   \n",
       "889          890         1       1   \n",
       "890          891         0       3   \n",
       "\n",
       "                                                    Name   Age  SibSp  Parch  \\\n",
       "0                                Braund, Mr. Owen Harris  22.0      1      0   \n",
       "1    Cumings, Mrs. John Bradley (Florence Briggs Thayer)  38.0      1      0   \n",
       "2                                 Heikkinen, Miss. Laina  26.0      0      0   \n",
       "3           Futrelle, Mrs. Jacques Heath (Lily May Peel)  35.0      1      0   \n",
       "4                               Allen, Mr. William Henry  35.0      0      0   \n",
       "..                                                   ...   ...    ...    ...   \n",
       "886                                Montvila, Rev. Juozas  27.0      0      0   \n",
       "887                         Graham, Miss. Margaret Edith  19.0      0      0   \n",
       "888             Johnston, Miss. Catherine Helen \"Carrie\"   NaN      1      2   \n",
       "889                                Behr, Mr. Karl Howell  26.0      0      0   \n",
       "890                                  Dooley, Mr. Patrick  32.0      0      0   \n",
       "\n",
       "               Ticket     Fare Cabin  Sex_female  Sex_male  Embarked_C  \\\n",
       "0           A/5 21171   7.2500   NaN           0         1           0   \n",
       "1            PC 17599  71.2833   C85           1         0           1   \n",
       "2    STON/O2. 3101282   7.9250   NaN           1         0           0   \n",
       "3              113803  53.1000  C123           1         0           0   \n",
       "4              373450   8.0500   NaN           0         1           0   \n",
       "..                ...      ...   ...         ...       ...         ...   \n",
       "886            211536  13.0000   NaN           0         1           0   \n",
       "887            112053  30.0000   B42           1         0           0   \n",
       "888        W./C. 6607  23.4500   NaN           1         0           0   \n",
       "889            111369  30.0000  C148           0         1           1   \n",
       "890            370376   7.7500   NaN           0         1           0   \n",
       "\n",
       "     Embarked_Q  Embarked_S  \n",
       "0             0           1  \n",
       "1             0           0  \n",
       "2             0           1  \n",
       "3             0           1  \n",
       "4             0           1  \n",
       "..          ...         ...  \n",
       "886           0           1  \n",
       "887           0           1  \n",
       "888           0           1  \n",
       "889           0           0  \n",
       "890           1           0  \n",
       "\n",
       "[891 rows x 15 columns]"
      ]
     },
     "execution_count": 67,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# pass columns to get_dummies makes it more easy\n",
    "\n",
    "train = pd.read_csv('http://bit.ly/kaggletrain')\n",
    "\n",
    "pd.get_dummies(train, columns=['Sex', 'Embarked'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:43:02.783417Z",
     "start_time": "2019-12-24T17:43:02.756504Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PassengerId</th>\n",
       "      <th>Survived</th>\n",
       "      <th>Pclass</th>\n",
       "      <th>Name</th>\n",
       "      <th>Age</th>\n",
       "      <th>SibSp</th>\n",
       "      <th>Parch</th>\n",
       "      <th>Ticket</th>\n",
       "      <th>Fare</th>\n",
       "      <th>Cabin</th>\n",
       "      <th>Sex_male</th>\n",
       "      <th>Embarked_Q</th>\n",
       "      <th>Embarked_S</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Braund, Mr. Owen Harris</td>\n",
       "      <td>22.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>A/5 21171</td>\n",
       "      <td>7.2500</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Cumings, Mrs. John Bradley (Florence Briggs Thayer)</td>\n",
       "      <td>38.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>PC 17599</td>\n",
       "      <td>71.2833</td>\n",
       "      <td>C85</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>Heikkinen, Miss. Laina</td>\n",
       "      <td>26.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>STON/O2. 3101282</td>\n",
       "      <td>7.9250</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
       "      <td>35.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>113803</td>\n",
       "      <td>53.1000</td>\n",
       "      <td>C123</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Allen, Mr. William Henry</td>\n",
       "      <td>35.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>373450</td>\n",
       "      <td>8.0500</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>886</th>\n",
       "      <td>887</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>Montvila, Rev. Juozas</td>\n",
       "      <td>27.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>211536</td>\n",
       "      <td>13.0000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>887</th>\n",
       "      <td>888</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Graham, Miss. Margaret Edith</td>\n",
       "      <td>19.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>112053</td>\n",
       "      <td>30.0000</td>\n",
       "      <td>B42</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>888</th>\n",
       "      <td>889</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Johnston, Miss. Catherine Helen \"Carrie\"</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>W./C. 6607</td>\n",
       "      <td>23.4500</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>889</th>\n",
       "      <td>890</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Behr, Mr. Karl Howell</td>\n",
       "      <td>26.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>111369</td>\n",
       "      <td>30.0000</td>\n",
       "      <td>C148</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>890</th>\n",
       "      <td>891</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Dooley, Mr. Patrick</td>\n",
       "      <td>32.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>370376</td>\n",
       "      <td>7.7500</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>891 rows × 13 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "     PassengerId  Survived  Pclass  \\\n",
       "0              1         0       3   \n",
       "1              2         1       1   \n",
       "2              3         1       3   \n",
       "3              4         1       1   \n",
       "4              5         0       3   \n",
       "..           ...       ...     ...   \n",
       "886          887         0       2   \n",
       "887          888         1       1   \n",
       "888          889         0       3   \n",
       "889          890         1       1   \n",
       "890          891         0       3   \n",
       "\n",
       "                                                    Name   Age  SibSp  Parch  \\\n",
       "0                                Braund, Mr. Owen Harris  22.0      1      0   \n",
       "1    Cumings, Mrs. John Bradley (Florence Briggs Thayer)  38.0      1      0   \n",
       "2                                 Heikkinen, Miss. Laina  26.0      0      0   \n",
       "3           Futrelle, Mrs. Jacques Heath (Lily May Peel)  35.0      1      0   \n",
       "4                               Allen, Mr. William Henry  35.0      0      0   \n",
       "..                                                   ...   ...    ...    ...   \n",
       "886                                Montvila, Rev. Juozas  27.0      0      0   \n",
       "887                         Graham, Miss. Margaret Edith  19.0      0      0   \n",
       "888             Johnston, Miss. Catherine Helen \"Carrie\"   NaN      1      2   \n",
       "889                                Behr, Mr. Karl Howell  26.0      0      0   \n",
       "890                                  Dooley, Mr. Patrick  32.0      0      0   \n",
       "\n",
       "               Ticket     Fare Cabin  Sex_male  Embarked_Q  Embarked_S  \n",
       "0           A/5 21171   7.2500   NaN         1           0           1  \n",
       "1            PC 17599  71.2833   C85         0           0           0  \n",
       "2    STON/O2. 3101282   7.9250   NaN         0           0           1  \n",
       "3              113803  53.1000  C123         0           0           1  \n",
       "4              373450   8.0500   NaN         1           0           1  \n",
       "..                ...      ...   ...       ...         ...         ...  \n",
       "886            211536  13.0000   NaN         1           0           1  \n",
       "887            112053  30.0000   B42         0           0           1  \n",
       "888        W./C. 6607  23.4500   NaN         0           0           1  \n",
       "889            111369  30.0000  C148         1           0           0  \n",
       "890            370376   7.7500   NaN         1           1           0  \n",
       "\n",
       "[891 rows x 13 columns]"
      ]
     },
     "execution_count": 68,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# drop the first column after get_dummies\n",
    "\n",
    "pd.get_dummies(train, columns=['Sex', 'Embarked'], drop_first=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:43:02.991607Z",
     "start_time": "2019-12-24T17:43:02.784456Z"
    }
   },
   "outputs": [],
   "source": [
    "# example of MovieLens 1M dataset\n",
    "\n",
    "mnames = ['movieid', 'title', 'genres']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:43:03.179736Z",
     "start_time": "2019-12-24T17:43:02.993603Z"
    }
   },
   "outputs": [],
   "source": [
    "movies = pd.read_table(r'MovieLens-1M/movies.dat', sep='::', header=None,\n",
    "                       names=mnames, engine='python')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:43:03.445851Z",
     "start_time": "2019-12-24T17:43:03.182662Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>movieid</th>\n",
       "      <th>title</th>\n",
       "      <th>genres</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>Toy Story (1995)</td>\n",
       "      <td>Animation|Children's|Comedy</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>Jumanji (1995)</td>\n",
       "      <td>Adventure|Children's|Fantasy</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>Grumpier Old Men (1995)</td>\n",
       "      <td>Comedy|Romance</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>Waiting to Exhale (1995)</td>\n",
       "      <td>Comedy|Drama</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>Father of the Bride Part II (1995)</td>\n",
       "      <td>Comedy</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   movieid                               title                        genres\n",
       "0        1                    Toy Story (1995)   Animation|Children's|Comedy\n",
       "1        2                      Jumanji (1995)  Adventure|Children's|Fantasy\n",
       "2        3             Grumpier Old Men (1995)                Comedy|Romance\n",
       "3        4            Waiting to Exhale (1995)                  Comedy|Drama\n",
       "4        5  Father of the Bride Part II (1995)                        Comedy"
      ]
     },
     "execution_count": 71,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "movies.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:43:03.606208Z",
     "start_time": "2019-12-24T17:43:03.447874Z"
    }
   },
   "outputs": [],
   "source": [
    "all_genres = []\n",
    "\n",
    "for x in movies.genres:\n",
    "    all_genres.extend(x.split('|'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:43:03.741670Z",
     "start_time": "2019-12-24T17:43:03.609697Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['Animation', \"Children's\", 'Comedy', 'Adventure', 'Fantasy',\n",
       "       'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror',\n",
       "       'Sci-Fi', 'Documentary', 'War', 'Musical', 'Mystery', 'Film-Noir',\n",
       "       'Western'], dtype=object)"
      ]
     },
     "execution_count": 73,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "genres = pd.unique(all_genres)\n",
    "genres"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:43:03.877471Z",
     "start_time": "2019-12-24T17:43:03.745292Z"
    }
   },
   "outputs": [],
   "source": [
    "# One way to construct the indicator DataFrame is to start with a\n",
    "# DataFrame of allzeros\n",
    "\n",
    "zero_matrix = np.zeros((len(movies), len(genres)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:43:04.014237Z",
     "start_time": "2019-12-24T17:43:03.881359Z"
    }
   },
   "outputs": [],
   "source": [
    "dummies = pd.DataFrame(zero_matrix, columns=genres)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, iterate through each movie and set entries in each row of dummies to 1. To do this, we use the dummies.columns to compute the column indices for each genre"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:43:04.141318Z",
     "start_time": "2019-12-24T17:43:04.016094Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Animation', \"Children's\", 'Comedy']\n",
      "\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "array([0, 1, 2], dtype=int64)"
      ]
     },
     "execution_count": 76,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gen = movies.genres[0]\n",
    "print(gen.split('|'))\n",
    "dummies.columns.get_indexer(gen.split('|'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, we can use .iloc to set values based on these indices"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 77,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:43:06.917275Z",
     "start_time": "2019-12-24T17:43:04.144485Z"
    }
   },
   "outputs": [],
   "source": [
    "for i, gen in enumerate(movies.genres):\n",
    "    indices = dummies.columns.get_indexer(gen.split('|'))\n",
    "    dummies.iloc[i, indices] = 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 78,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:43:06.932929Z",
     "start_time": "2019-12-24T17:43:06.917275Z"
    }
   },
   "outputs": [],
   "source": [
    "# join with movies\n",
    "\n",
    "movies_windic = movies.join(dummies.add_prefix('Genre_'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:43:07.061469Z",
     "start_time": "2019-12-24T17:43:06.935434Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "movieid                                        1\n",
       "title                           Toy Story (1995)\n",
       "genres               Animation|Children's|Comedy\n",
       "Genre_Animation                                1\n",
       "Genre_Children's                               1\n",
       "Genre_Comedy                                   1\n",
       "Genre_Adventure                                0\n",
       "Genre_Fantasy                                  0\n",
       "Genre_Romance                                  0\n",
       "Genre_Drama                                    0\n",
       "Genre_Action                                   0\n",
       "Genre_Crime                                    0\n",
       "Genre_Thriller                                 0\n",
       "Genre_Horror                                   0\n",
       "Genre_Sci-Fi                                   0\n",
       "Genre_Documentary                              0\n",
       "Genre_War                                      0\n",
       "Genre_Musical                                  0\n",
       "Genre_Mystery                                  0\n",
       "Genre_Film-Noir                                0\n",
       "Genre_Western                                  0\n",
       "Name: 0, dtype: object"
      ]
     },
     "execution_count": 79,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "movies_windic.iloc[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## String Manipulation\n",
    "### Basic String methods"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:43:07.209653Z",
     "start_time": "2019-12-24T17:43:07.063739Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['a', 'b', 'guido']"
      ]
     },
     "execution_count": 80,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "val = 'a,b, guido'\n",
    "\n",
    "pieces = [x.strip() for x in val.split(',')]\n",
    "pieces"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 81,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:43:07.361358Z",
     "start_time": "2019-12-24T17:43:07.210654Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'a::b::guido'"
      ]
     },
     "execution_count": 81,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "\"::\".join(pieces)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**built-in String methods**\n",
    "\n",
    "|Argument | Description    |\n",
    "|---|---|\n",
    "|count|Return the number of non-overlapping occurrences of substring in the string.     |\n",
    "|endswith|Returns True if string ends with suffix.     |\n",
    "|startswith|Returns True if string starts with prefix.    |\n",
    "|join|Use string as delimiter for concatenating a sequence of other strings. |\n",
    "|index|Return position of first character in substring if found in the string; raises ValueError if not found.  |\n",
    "|find|Return position of first character of rst occurrence of substring in the string; like index, but returns –1  if          not found.|\n",
    "|rfind|Return position of first character of last occurrence of substring in the string; returns –1 if not found.   |\n",
    "|replace|Replace occurrences of string with another string.|\n",
    "|strip, rstrip, lstrip|Trim whitespace, including newlines; equivalent to x.strip() (and rstrip, lstrip, respectively)\n",
    "                      for each element|\n",
    "|split|Break string into list of substrings using passed delimiter.    |\n",
    "|lower|Convert alphabet characters to lowercase.    |\n",
    "|upper|Convert alphabet characters to uppercase.   |\n",
    "|casefold|Convert characters to lowercase, and convert any region-specific variable character combinations to a common comparable form.   \n",
    "|ljust, rjust|Left justify or right justify, respectively; pad opposite side of string with spaces (or some other fill character) to return a string with a minimum width   |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Regular Expressions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T13:32:20.008167Z",
     "start_time": "2019-12-24T13:32:19.995045Z"
    }
   },
   "source": [
    "The re module functions fall into three categories: pattern matching, substitution,\n",
    "and splitting. Naturally these are all related; a regex describes a pattern to locate in the\n",
    "text, which can then be used for many purposes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 82,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:43:07.495362Z",
     "start_time": "2019-12-24T17:43:07.367465Z"
    }
   },
   "outputs": [],
   "source": [
    "import re"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 83,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:43:07.637104Z",
     "start_time": "2019-12-24T17:43:07.495362Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'foo    bar\\tbaz   \\tqux'"
      ]
     },
     "execution_count": 83,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text = \"foo    bar\\tbaz   \\tqux\"\n",
    "text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 84,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:43:07.780343Z",
     "start_time": "2019-12-24T17:43:07.638102Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['foo', 'bar', 'baz', 'qux']"
      ]
     },
     "execution_count": 84,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "re.split('\\s+', text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When you call re.split('\\s+', text), the regular expression is first compiled, and then its split method is called on the passed text. You can compile the regex yourself with **re.compile**, forming a reusable regex object"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Creating a regex object with re.compile is highly recommended if you intend to\n",
    "apply the same expression to many strings; doing so will save CPU cycles"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 85,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:43:07.916446Z",
     "start_time": "2019-12-24T17:43:07.785420Z"
    }
   },
   "outputs": [],
   "source": [
    "regex = re.compile('\\s+')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 86,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:43:08.056820Z",
     "start_time": "2019-12-24T17:43:07.920407Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['foo', 'bar', 'baz', 'qux']"
      ]
     },
     "execution_count": 86,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "regex.split(text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 87,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:43:08.205267Z",
     "start_time": "2019-12-24T17:43:08.058813Z"
    }
   },
   "outputs": [],
   "source": [
    "text = \"\"\"Dave dave@google.com\n",
    "Steve steve@gmail.com\n",
    "Rob rob@gmail.com\n",
    "Ryan ryan@yahoo.com\n",
    "\"\"\"\n",
    "pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z]{2,4}'\n",
    "\n",
    "# re.IGNORECASE makes the regex case-insensitive\n",
    "regex = re.compile(pattern, flags=re.IGNORECASE)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 88,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:43:08.365332Z",
     "start_time": "2019-12-24T17:43:08.212747Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']"
      ]
     },
     "execution_count": 88,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "regex.findall(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**match** and **search** are closely related to findall. While findall returns all matches\n",
    "in a string, search returns only the first match. More rigidly, match only matches at\n",
    "the beginning of the string"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**search** returns a special match object for the first email address in the text. For the\n",
    "preceding regex, the match object can only tell us the start and end position of the\n",
    "pattern in the string"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 89,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:43:08.529015Z",
     "start_time": "2019-12-24T17:43:08.366332Z"
    }
   },
   "outputs": [],
   "source": [
    "m = regex.search(text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 90,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:43:08.991732Z",
     "start_time": "2019-12-24T17:43:08.529015Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<re.Match object; span=(5, 20), match='dave@google.com'>"
      ]
     },
     "execution_count": 90,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "m"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 91,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:43:09.132275Z",
     "start_time": "2019-12-24T17:43:08.994110Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'dave@google.com'"
      ]
     },
     "execution_count": 91,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text[m.start(): m.end()]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 92,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:43:09.248262Z",
     "start_time": "2019-12-24T17:43:09.133848Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "None\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(regex.match(text))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 93,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:43:09.404466Z",
     "start_time": "2019-12-24T17:43:09.253311Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Dave REDACTED\n",
      "Steve REDACTED\n",
      "Rob REDACTED\n",
      "Ryan REDACTED\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(regex.sub('REDACTED', text))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " segment each address into its three components: username, domain name, and domain suffix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 94,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:43:09.547000Z",
     "start_time": "2019-12-24T17:43:09.407943Z"
    }
   },
   "outputs": [],
   "source": [
    "pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 95,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:43:09.688577Z",
     "start_time": "2019-12-24T17:43:09.548001Z"
    }
   },
   "outputs": [],
   "source": [
    "regex = re.compile(pattern, flags=re.IGNORECASE)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 96,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:43:09.805024Z",
     "start_time": "2019-12-24T17:43:09.692210Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "('wesm', 'bright', 'net')"
      ]
     },
     "execution_count": 96,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "m = regex.match('wesm@bright.net')\n",
    "m.groups()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 97,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:43:09.942973Z",
     "start_time": "2019-12-24T17:43:09.811263Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('dave', 'google', 'com'),\n",
       " ('steve', 'gmail', 'com'),\n",
       " ('rob', 'gmail', 'com'),\n",
       " ('ryan', 'yahoo', 'com')]"
      ]
     },
     "execution_count": 97,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "regex.findall(text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 98,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:43:10.101665Z",
     "start_time": "2019-12-24T17:43:09.947203Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Dave Username: dave, Domain: google, Suffix: com\n",
      "Steve Username: steve, Domain: gmail, Suffix: com\n",
      "Rob Username: rob, Domain: gmail, Suffix: com\n",
      "Ryan Username: ryan, Domain: yahoo, Suffix: com\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(regex.sub(r'Username: \\1, Domain: \\2, Suffix: \\3', text))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Regular expression methods**\n",
    "\n",
    "|Argument| Description|\n",
    "|-|-|\n",
    "|findall| Return all non-overlapping matching patterns in a string as a list|\n",
    "|finditer| Like findall, but returns an iterator|\n",
    "|match |Match pattern at start of string and optionally segment pattern components into groups; if the pattern matches, returns a match object, and otherwise None|\n",
    "|search| Scan string for match to pattern; returning a match object if so; unlike match, the match can be anywhere in the string as opposed to only at the beginning|\n",
    "|split| Break string into pieces at each occurrence of pattern|\n",
    "|sub, subn|Replace all (sub) or first n occurrences (subn) of pattern in string with replacement expression; use symbols \\1, \\2, ... to refer to match group elements in the replacement string|"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Vectorized String Functions in pandas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 99,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:43:10.245311Z",
     "start_time": "2019-12-24T17:43:10.105666Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Dave     dave@google.com\n",
       "Steve    steve@gmail.com\n",
       "Rob        rob@gmail.com\n",
       "Wes                  NaN\n",
       "dtype: object"
      ]
     },
     "execution_count": 99,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com',\n",
    "        'Rob': 'rob@gmail.com', 'Wes': np.nan}\n",
    "\n",
    "data = pd.Series(data)\n",
    "data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 100,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:43:10.380107Z",
     "start_time": "2019-12-24T17:43:10.246309Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Dave     False\n",
       "Steve    False\n",
       "Rob      False\n",
       "Wes       True\n",
       "dtype: bool"
      ]
     },
     "execution_count": 100,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.isnull()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can apply string and regular expression methods can be applied (passing a lambda or other function) to each value using data.map, but it will fail on the NA (null) values. To cope with this, Series has array-oriented methods for string operations that skip NA values. These are accessed through Series’s str attribute; for example,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 101,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:43:10.532755Z",
     "start_time": "2019-12-24T17:43:10.381108Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Dave     False\n",
       "Steve     True\n",
       "Rob       True\n",
       "Wes        NaN\n",
       "dtype: object"
      ]
     },
     "execution_count": 101,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.str.contains('gmail')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Regular expressions can be used, too, along with any re options like IGNORECASE"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 102,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:43:10.680071Z",
     "start_time": "2019-12-24T17:43:10.536520Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\\\.([A-Z]{2,4})'"
      ]
     },
     "execution_count": 102,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pattern"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 103,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:43:10.819319Z",
     "start_time": "2019-12-24T17:43:10.684214Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Dave     [(dave, google, com)]\n",
       "Steve    [(steve, gmail, com)]\n",
       "Rob        [(rob, gmail, com)]\n",
       "Wes                        NaN\n",
       "dtype: object"
      ]
     },
     "execution_count": 103,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.str.findall(pattern, flags=re.IGNORECASE)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 104,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:43:10.964110Z",
     "start_time": "2019-12-24T17:43:10.824320Z"
    }
   },
   "outputs": [],
   "source": [
    " matches = data.str.match(pattern, flags=re.IGNORECASE)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are a couple of ways to do vectorized element retrieval. Either use **str.get** or index into the str attribute"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 105,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-12-24T17:43:11.118070Z",
     "start_time": "2019-12-24T17:43:10.965112Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Dave     True\n",
       "Steve    True\n",
       "Rob      True\n",
       "Wes       NaN\n",
       "dtype: object"
      ]
     },
     "execution_count": 105,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "matches"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": true,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}