{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<p style=\"font-family: Arial; font-size:3.75em;color:purple; font-style:bold\"><br>\n",
    " Pandas</p><br>\n",
    "\n",
    "Pandas is a primary data analysis library in Python. It offers a number of operations to aid in data exploration, cleaning and transformation, making it one of the most popular data science tools. To name a few examples of these operations, Pandas enables various methods to handle missing data and data pivoting, easy data sorting and description capabilities, fast generation of data plots, and Boolean indexing for fast image processing and other masking operations.\n",
    "\n",
    "Some of the key features of Pandas are:\n",
    "* Ingestion and manipulation of heterogeneous data types\n",
    "* Generating descriptive statistics on data to support exploration and communication\n",
    "* Data cleaning using built in pandas functions\n",
    "* Frequent data operations for subsetting, filtering, insertion, deletion and aggregation of data\n",
    "* Merging and joining multiple datasets using dataframes\n",
    "* Working with timestamps and time-series data\n",
    "\n",
    "Pandas also builds upon numpy and other Python packages to provide easy-to-use data structures and data manipulation functions with integrated indexing.\n",
    "\n",
    "**Additional Recommended Resources:** \n",
    "* <a href=\"http://pandas.pydata.org/pandas-docs/stable/\">Pandas Documentation</a><br>\n",
    "* *Python for Data Analysis* by Wes McKinney\n",
    "* *Python Data Science Handbook* by Jake VanderPlas"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<p style=\"font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold\">\n",
    "Introduction to Pandas Data Structures</p>\n",
    "<br>\n",
    "Pandas uses two different data structures: Series and DataFrames. They are extremely powerful and fundamental to the Pandas package. \n",
    "\n",
    "<p style=\"font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold\">\n",
    "Series in Pandas</p>\n",
    "\n",
    "Pandas Series are one-dimensional labeled arrays. Since they act like ndarrays, they are valid arguments to most Numpy methods. Series support many data types, including integers, strings, floating point numbers,\n",
    "Python objects, etc. Their axis labels are collectively referred to as the **index**, and we can get and set values using these index labels. You can think of a Series as a flexible dictionary-like object.\n",
    "\n",
    "Let's look at some code examples with the Pandas Series."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# import the Pandas package\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# create a Series called sr\n",
    "# syntax is: pd.Series([data elements], [index elements])\n",
    "# note that the elements in the data and index sets do not have to be the same\n",
    "sr = pd.Series([10, 'foo', 30, 90.4], ['peach', 'plum', 'dog', 'band'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "peach      10\n",
       "plum      foo\n",
       "dog        30\n",
       "band     90.4\n",
       "dtype: object"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# view the Series\n",
    "sr"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['peach', 'plum', 'dog', 'band'], dtype='object')"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# view the indices\n",
    "sr.index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'foo'"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# access the data at an index\n",
    "sr['plum']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'foo'"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# OR\n",
    "sr.loc['plum']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "peach      10\n",
       "band     90.4\n",
       "dtype: object"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# access the data at multiple indices\n",
    "sr[['peach', 'band']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "peach      10\n",
       "band     90.4\n",
       "dtype: object"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# OR\n",
    "sr.loc[['peach', 'band']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can see that the data is represented so that you can access it like a list with numeric indices (list[x]) or more like a dictionary (dic['key']). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "30"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# access a data element by position in the list\n",
    "sr[2]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "30"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# OR\n",
    "sr.iloc[2]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "peach     10\n",
       "plum     foo\n",
       "dog       30\n",
       "dtype: object"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# access multiple data elements by positions in the list\n",
    "sr[[0, 1, 2]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "plum     foo\n",
       "dog       30\n",
       "band    90.4\n",
       "dtype: object"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# OR\n",
    "sr.iloc[[1, 2, 3]]\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# is the index 'peach' in the Series?\n",
    "'peach' in sr\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also use basic Python operations like multiplication on a Series. In the code below, we multiply the whole Series by 2. Note that this operation is performed on all data types, even strings, where the string is doubled."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "peach        20\n",
       "plum     foofoo\n",
       "dog          60\n",
       "band      180.8\n",
       "dtype: object"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sr *2 #Notice foo turns to foofoo when multiplied by 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "peach      10\n",
       "plum      foo\n",
       "dog        30\n",
       "band     90.4\n",
       "dtype: object"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sr # Because we did no set sr = sr*2, sr doesn't change values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can square the numerical index values in a Series. If we tried to square an index that's not a numeric data type, however, we would get an error."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "peach        100\n",
       "band     8172.16\n",
       "dtype: object"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sr[['peach', 'band']] ** 2 # you cannot square a string, so if you include 'foo' you will get an error"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<p style=\"font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold\">\n",
    "DataFrames in Pandas</p>\n",
    "\n",
    "Pandas DataFrames are flexible 2-dimensional labeled data structures. They also support heterogeneous data and have labeled axes for rows and columns. We can think of a DataFrame as a container for Series objects, where each row is a Series.\n",
    "\n",
    "Below we give some examples of things you can do with the Pandas DataFrame. You can find the full documentation for the DataFrames <a href=\"https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html\">here</a>."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<p style=\"font-family: Arial; font-size:1.25em;color:#2462C0; font-style:bold\">\n",
    "Creating a DataFrame</p>\n",
    "\n",
    "There are many ways to create Pandas DataFrames. We often just read and ingest data into a data frame, but in this example, we create the DataFrame manually by starting with a dictionary of Series. Note that we are adding another dimensions to our data structure, so we need to label each Series. Here, we label the first Series 'a' and the second 'b'."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# create a dictionary called df_data\n",
    "df_data = {'a' : pd.Series([1., 2., 3., 4.], index=['dog', 'cat', 'fruit', 'bird']),\n",
    "     'b' : pd.Series([10., 20., 30.], index=['cake', 'fruit', 'ice cream'])}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>bird</th>\n",
       "      <td>4.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>cake</th>\n",
       "      <td>NaN</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>cat</th>\n",
       "      <td>2.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>dog</th>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>fruit</th>\n",
       "      <td>3.0</td>\n",
       "      <td>20.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ice cream</th>\n",
       "      <td>NaN</td>\n",
       "      <td>30.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "             a     b\n",
       "bird       4.0   NaN\n",
       "cake       NaN  10.0\n",
       "cat        2.0   NaN\n",
       "dog        1.0   NaN\n",
       "fruit      3.0  20.0\n",
       "ice cream  NaN  30.0"
      ]
     },
     "execution_count": 60,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# create and output the DataFrame\n",
    "df = pd.DataFrame(df_data)\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Series 'a' and 'b' don't share the all of same indices. When we print the DataFrame, we see NaN values, which indicate that the Series does not contain a certain index. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['bird', 'cake', 'cat', 'dog', 'fruit', 'ice cream'], dtype='object')"
      ]
     },
     "execution_count": 61,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['a', 'b'], dtype='object')"
      ]
     },
     "execution_count": 62,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also create a smaller DataFrame using a subset of the same data,  this time specifying which indices we want to be included."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>dog</th>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>fruit</th>\n",
       "      <td>3.0</td>\n",
       "      <td>20.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>bird</th>\n",
       "      <td>4.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         a     b\n",
       "dog    1.0   NaN\n",
       "fruit  3.0  20.0\n",
       "bird   4.0   NaN"
      ]
     },
     "execution_count": 63,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.DataFrame(df_data, index=['dog', 'fruit', 'bird'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By specifying the column parameter, you can select which columns you'd like the new DataFrame to include. In the code below, we ask the DataFrame to include column 'e', which doesn't exist in the original dictionary. Because of this, a new column 'e' will be created with all its entries as NaN."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>e</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>dog</th>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>fruit</th>\n",
       "      <td>3.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>bird</th>\n",
       "      <td>4.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         a    e\n",
       "dog    1.0  NaN\n",
       "fruit  3.0  NaN\n",
       "bird   4.0  NaN"
      ]
     },
     "execution_count": 64,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.DataFrame(df_data, index=['dog', 'fruit', 'bird'], columns=['a', 'e'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<p style=\"font-family: Arial; font-size:1.25em;color:#2462C0; font-style:bold\">\n",
    "Creating a DataFrame from a list of Python dictionaries</p>\n",
    "\n",
    "Another way to create a DataFrame is to use a list of Python dictionaries as your data. In the code below, we create a list of Python dictionaries called 'df_data2' and use this to make a DataFrame called 'df2'. We then use many of the same techniques as above to explore the DataFrame.\n",
    "\n",
    "Please see <a href=\"https://docs.python.org/3/tutorial/datastructures.html#dictionaries\">this link</a> for a reminder on Python dictionaries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# create a Python dictionary\n",
    "\n",
    "df_data2 = [{'apple': 5, 'cherry': 10}, {'peter': 1, 'emily': 2, 'brian': 6}]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>apple</th>\n",
       "      <th>brian</th>\n",
       "      <th>cherry</th>\n",
       "      <th>emily</th>\n",
       "      <th>peter</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>5.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>10.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>NaN</td>\n",
       "      <td>6.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   apple  brian  cherry  emily  peter\n",
       "0    5.0    NaN    10.0    NaN    NaN\n",
       "1    NaN    6.0     NaN    2.0    1.0"
      ]
     },
     "execution_count": 66,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# labels get created as column headers\n",
    "\n",
    "pd.DataFrame(df_data2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>apple</th>\n",
       "      <th>brian</th>\n",
       "      <th>cherry</th>\n",
       "      <th>emily</th>\n",
       "      <th>peter</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>blue</th>\n",
       "      <td>5.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>10.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>yellow</th>\n",
       "      <td>NaN</td>\n",
       "      <td>6.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        apple  brian  cherry  emily  peter\n",
       "blue      5.0    NaN    10.0    NaN    NaN\n",
       "yellow    NaN    6.0     NaN    2.0    1.0"
      ]
     },
     "execution_count": 67,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# rename the rows from 0 and 1 to 'blue' and 'yellow' by specifying the index parameter\n",
    "\n",
    "pd.DataFrame(df_data2, index=['blue', 'yellow'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>cherry</th>\n",
       "      <th>emily</th>\n",
       "      <th>brian</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>10.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>NaN</td>\n",
       "      <td>2.0</td>\n",
       "      <td>6.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   cherry  emily  brian\n",
       "0    10.0    NaN    NaN\n",
       "1     NaN    2.0    6.0"
      ]
     },
     "execution_count": 68,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# create a smaller DataFrame by specifying the columns\n",
    "\n",
    "pd.DataFrame(df_data2, columns=['cherry', 'emily','brian'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<p style=\"font-family: Arial; font-size:1.25em;color:#2462C0; font-style:bold\">\n",
    "Exploring some basic DataFrame operations</p>\n",
    "\n",
    "\n",
    "Now let's look into how we can get data out of a DataFrame with some basic DataFrame operations. In the following code, we perform some operations on our DataFrame df. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>bird</th>\n",
       "      <td>4.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>cake</th>\n",
       "      <td>NaN</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>cat</th>\n",
       "      <td>2.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>dog</th>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>fruit</th>\n",
       "      <td>3.0</td>\n",
       "      <td>20.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ice cream</th>\n",
       "      <td>NaN</td>\n",
       "      <td>30.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "             a     b\n",
       "bird       4.0   NaN\n",
       "cake       NaN  10.0\n",
       "cat        2.0   NaN\n",
       "dog        1.0   NaN\n",
       "fruit      3.0  20.0\n",
       "ice cream  NaN  30.0"
      ]
     },
     "execution_count": 69,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# our DataFrame of interest\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "bird         4.0\n",
       "cake         NaN\n",
       "cat          2.0\n",
       "dog          1.0\n",
       "fruit        3.0\n",
       "ice cream    NaN\n",
       "Name: a, dtype: float64"
      ]
     },
     "execution_count": 70,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# display only column 'a' of df (subsetting)\n",
    "df['a']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "      <th>c</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>bird</th>\n",
       "      <td>4.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>cake</th>\n",
       "      <td>NaN</td>\n",
       "      <td>10.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>cat</th>\n",
       "      <td>2.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>dog</th>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>fruit</th>\n",
       "      <td>3.0</td>\n",
       "      <td>20.0</td>\n",
       "      <td>23.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ice cream</th>\n",
       "      <td>NaN</td>\n",
       "      <td>30.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "             a     b     c\n",
       "bird       4.0   NaN   NaN\n",
       "cake       NaN  10.0   NaN\n",
       "cat        2.0   NaN   NaN\n",
       "dog        1.0   NaN   NaN\n",
       "fruit      3.0  20.0  23.0\n",
       "ice cream  NaN  30.0   NaN"
      ]
     },
     "execution_count": 71,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# create a new column 'c' by adding 'a' and 'b' together\n",
    "df['c'] = df['a'] + df['b']\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that since NaN values cannot be added to floating point values, the resulting values in 'c' are NaN. For index 'fruit', however, both 'a' and 'b' are floating point values and can be added together."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "      <th>c</th>\n",
       "      <th>d</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>bird</th>\n",
       "      <td>4.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>cake</th>\n",
       "      <td>NaN</td>\n",
       "      <td>10.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>cat</th>\n",
       "      <td>2.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>dog</th>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>fruit</th>\n",
       "      <td>3.0</td>\n",
       "      <td>20.0</td>\n",
       "      <td>23.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ice cream</th>\n",
       "      <td>NaN</td>\n",
       "      <td>30.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "             a     b     c      d\n",
       "bird       4.0   NaN   NaN   True\n",
       "cake       NaN  10.0   NaN  False\n",
       "cat        2.0   NaN   NaN  False\n",
       "dog        1.0   NaN   NaN  False\n",
       "fruit      3.0  20.0  23.0   True\n",
       "ice cream  NaN  30.0   NaN  False"
      ]
     },
     "execution_count": 72,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# create a new column 'd' of boolean values indicating whether or not an index's value in 'a' is greater than 2.0\n",
    "# NaN values evaluate to False\n",
    "\n",
    "df['d'] = df['a'] > 2.0\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# set cee equal to the 'c' column in the DataFrame\n",
    "\n",
    "cee = df.pop('c')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "bird          NaN\n",
       "cake          NaN\n",
       "cat           NaN\n",
       "dog           NaN\n",
       "fruit        23.0\n",
       "ice cream     NaN\n",
       "Name: c, dtype: float64"
      ]
     },
     "execution_count": 74,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cee"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "      <th>d</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>bird</th>\n",
       "      <td>4.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>cake</th>\n",
       "      <td>NaN</td>\n",
       "      <td>10.0</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>cat</th>\n",
       "      <td>2.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>dog</th>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>fruit</th>\n",
       "      <td>3.0</td>\n",
       "      <td>20.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ice cream</th>\n",
       "      <td>NaN</td>\n",
       "      <td>30.0</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "             a     b      d\n",
       "bird       4.0   NaN   True\n",
       "cake       NaN  10.0  False\n",
       "cat        2.0   NaN  False\n",
       "dog        1.0   NaN  False\n",
       "fruit      3.0  20.0   True\n",
       "ice cream  NaN  30.0  False"
      ]
     },
     "execution_count": 75,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# the pop method has removed 'c' from df\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# delete column 'b' from the DataFrame\n",
    "del df['b']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 77,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>d</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>bird</th>\n",
       "      <td>4.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>cake</th>\n",
       "      <td>NaN</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>cat</th>\n",
       "      <td>2.0</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>dog</th>\n",
       "      <td>1.0</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>fruit</th>\n",
       "      <td>3.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ice cream</th>\n",
       "      <td>NaN</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "             a      d\n",
       "bird       4.0   True\n",
       "cake       NaN  False\n",
       "cat        2.0  False\n",
       "dog        1.0  False\n",
       "fruit      3.0   True\n",
       "ice cream  NaN  False"
      ]
     },
     "execution_count": 77,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 78,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>d</th>\n",
       "      <th>copy_of_a</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>bird</th>\n",
       "      <td>4.0</td>\n",
       "      <td>True</td>\n",
       "      <td>4.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>cake</th>\n",
       "      <td>NaN</td>\n",
       "      <td>False</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>cat</th>\n",
       "      <td>2.0</td>\n",
       "      <td>False</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>dog</th>\n",
       "      <td>1.0</td>\n",
       "      <td>False</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>fruit</th>\n",
       "      <td>3.0</td>\n",
       "      <td>True</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ice cream</th>\n",
       "      <td>NaN</td>\n",
       "      <td>False</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "             a      d  copy_of_a\n",
       "bird       4.0   True        4.0\n",
       "cake       NaN  False        NaN\n",
       "cat        2.0  False        2.0\n",
       "dog        1.0  False        1.0\n",
       "fruit      3.0   True        3.0\n",
       "ice cream  NaN  False        NaN"
      ]
     },
     "execution_count": 78,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# insert a new column that is a copy of column 'a'\n",
    "\n",
    "df.insert(2, 'copy_of_a', df['a'])\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>d</th>\n",
       "      <th>copy_of_a</th>\n",
       "      <th>a_upper_half</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>bird</th>\n",
       "      <td>4.0</td>\n",
       "      <td>True</td>\n",
       "      <td>4.0</td>\n",
       "      <td>4.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>cake</th>\n",
       "      <td>NaN</td>\n",
       "      <td>False</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>cat</th>\n",
       "      <td>2.0</td>\n",
       "      <td>False</td>\n",
       "      <td>2.0</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>dog</th>\n",
       "      <td>1.0</td>\n",
       "      <td>False</td>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>fruit</th>\n",
       "      <td>3.0</td>\n",
       "      <td>True</td>\n",
       "      <td>3.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ice cream</th>\n",
       "      <td>NaN</td>\n",
       "      <td>False</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "             a      d  copy_of_a  a_upper_half\n",
       "bird       4.0   True        4.0           4.0\n",
       "cake       NaN  False        NaN           NaN\n",
       "cat        2.0  False        2.0           2.0\n",
       "dog        1.0  False        1.0           NaN\n",
       "fruit      3.0   True        3.0           NaN\n",
       "ice cream  NaN  False        NaN           NaN"
      ]
     },
     "execution_count": 79,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# insert a new column that is a copy of 'a' up to excluding the value at the third position of the Series. The rest of the\n",
    "# column is NaNs\n",
    "\n",
    "df['a_upper_half'] = df['a'][:3]\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that while both methods above (df.insert and df['col']) allowed us to insert new columns into the DataFrame, only df.insert lets us specify which position we want the column to be in."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<p style=\"font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold\">\n",
    "Data Manipulation with Pandas</p>\n",
    "\n",
    "There are 5 main data manipulation tools that Pandas covers: \n",
    "\n",
    "<li> **Filtering**: Selecting a subset of the data's rows that satisfies a certain property </li>\n",
    "<li> **Subsetting**: Selecting a subset of the DataFrame's columns, to capture only some of the attributes of each datapoint </li>\n",
    "<li> **Combining**: Combine two datasets by row or column </li>\n",
    "<li> **Joining**: The same as combining but it fuses together rows that have the same value for a variable  </li>\n",
    "<li> **Feature Generation**: Applying a mathematical operation to each element of a row and/or column</li>\n",
    "\n",
    "Let's look at how we can use the iris dataset to use these tools."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Sepal Length</th>\n",
       "      <th>Sepal Width</th>\n",
       "      <th>Petal Length</th>\n",
       "      <th>Petal Width</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>5.1</td>\n",
       "      <td>3.5</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4.9</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4.7</td>\n",
       "      <td>3.2</td>\n",
       "      <td>1.3</td>\n",
       "      <td>0.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4.6</td>\n",
       "      <td>3.1</td>\n",
       "      <td>1.5</td>\n",
       "      <td>0.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5.0</td>\n",
       "      <td>3.6</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Sepal Length  Sepal Width  Petal Length  Petal Width\n",
       "0           5.1          3.5           1.4          0.2\n",
       "1           4.9          3.0           1.4          0.2\n",
       "2           4.7          3.2           1.3          0.2\n",
       "3           4.6          3.1           1.5          0.2\n",
       "4           5.0          3.6           1.4          0.2"
      ]
     },
     "execution_count": 80,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Load the iris dataset from sklearn and create a corresponding DataFrame\n",
    "from sklearn import datasets\n",
    "iris = datasets.load_iris() \n",
    "iris_data = pd.DataFrame(iris.data,columns = ['Sepal Length','Sepal Width','Petal Length','Petal Width'])\n",
    "iris_data.head()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, let's **filter** the data so we have only samples with Petal Length > 1.0. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 81,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Sepal Length</th>\n",
       "      <th>Sepal Width</th>\n",
       "      <th>Petal Length</th>\n",
       "      <th>Petal Width</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>50</th>\n",
       "      <td>7.0</td>\n",
       "      <td>3.2</td>\n",
       "      <td>4.7</td>\n",
       "      <td>1.4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>51</th>\n",
       "      <td>6.4</td>\n",
       "      <td>3.2</td>\n",
       "      <td>4.5</td>\n",
       "      <td>1.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>52</th>\n",
       "      <td>6.9</td>\n",
       "      <td>3.1</td>\n",
       "      <td>4.9</td>\n",
       "      <td>1.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>53</th>\n",
       "      <td>5.5</td>\n",
       "      <td>2.3</td>\n",
       "      <td>4.0</td>\n",
       "      <td>1.3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>54</th>\n",
       "      <td>6.5</td>\n",
       "      <td>2.8</td>\n",
       "      <td>4.6</td>\n",
       "      <td>1.5</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    Sepal Length  Sepal Width  Petal Length  Petal Width\n",
       "50           7.0          3.2           4.7          1.4\n",
       "51           6.4          3.2           4.5          1.5\n",
       "52           6.9          3.1           4.9          1.5\n",
       "53           5.5          2.3           4.0          1.3\n",
       "54           6.5          2.8           4.6          1.5"
      ]
     },
     "execution_count": 81,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "filtered = iris_data[iris_data['Petal Width']>1.0]\n",
    "filtered.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we use **subsetting** to find the variance in the Petal Width. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 82,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.5824143176733784"
      ]
     },
     "execution_count": 82,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "petal_width = iris_data['Petal Width']\n",
    "petal_width.var(axis=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In pandas, it's especially easy to **combine** datasets by column because you can write dataframe['columnname']. However, say we want to make a new dataframe that has a categorical column based on which type of iris flower it is. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 83,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Sepal Length</th>\n",
       "      <th>Sepal Width</th>\n",
       "      <th>Petal Length</th>\n",
       "      <th>Petal Width</th>\n",
       "      <th>Flower Type</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>5.1</td>\n",
       "      <td>3.5</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4.9</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4.7</td>\n",
       "      <td>3.2</td>\n",
       "      <td>1.3</td>\n",
       "      <td>0.2</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4.6</td>\n",
       "      <td>3.1</td>\n",
       "      <td>1.5</td>\n",
       "      <td>0.2</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5.0</td>\n",
       "      <td>3.6</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Sepal Length  Sepal Width  Petal Length  Petal Width  Flower Type\n",
       "0           5.1          3.5           1.4          0.2            0\n",
       "1           4.9          3.0           1.4          0.2            0\n",
       "2           4.7          3.2           1.3          0.2            0\n",
       "3           4.6          3.1           1.5          0.2            0\n",
       "4           5.0          3.6           1.4          0.2            0"
      ]
     },
     "execution_count": 83,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "labels = pd.DataFrame(iris.target,columns=['Flower Type'])\n",
    "labels\n",
    "result = pd.concat([iris_data,labels],axis=1) # requires iterable argument so the DataFrames are in a list\n",
    "result.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lastly, mutations let you modify the data in an entire row or column. We will use mutation to do a simple standard score normalization, which is commonly used to scale the data (though as you'll see, it's not as statistically applicable in this case). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 84,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    1.028611\n",
       "1   -0.124540\n",
       "2    0.336720\n",
       "3    0.106090\n",
       "4    1.259242\n",
       "Name: Sepal Width, dtype: float64"
      ]
     },
     "execution_count": 84,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mean = iris_data['Sepal Width'].mean()\n",
    "std = iris_data['Sepal Width'].std()\n",
    "iris_data['Sepal Width']=(iris_data['Sepal Width']-mean)/std\n",
    "iris_data['Sepal Width'].head()"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}