{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "# Introduction to Pandas\n",
    "\n",
    "Pandas is a library providing high-performance, easy-to-use data structures and data analysis tools. The core of pandas is its *dataframe* which is essentially a table of data. Pandas provides easy and powerful ways to import data from a variety of sources and export it to just as many. It is also explicitly designed to handle *missing data* elegantly which is a very common problem in data from the real world.\n",
    "\n",
    "The offical [pandas documentation](http://pandas.pydata.org/pandas-docs/stable/) is very comprehensive and you will be answer a lot of questions in there, however, it can sometimes be hard to find the right page. Don't be afraid to use Google to find help."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Pandas has a standard convention for importing it which you will see used in a lot of documentation so we will follow that in this course:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from pandas import Series, DataFrame"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Series\n",
    "\n",
    "The simplest of pandas' data structures is the `Series`. It is a one-dimensional list-like structure.\n",
    "Let's create one from a `list`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    14\n",
       "1     7\n",
       "2     3\n",
       "3    -7\n",
       "4     8\n",
       "dtype: int64"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Series([14, 7, 3, -7, 8])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are three main components to this output.\n",
    "The first column (`0`, `2`, etc.) is the index, by default this is numbers each row starting from zero.\n",
    "The second column is our data, stored i the same order we entered it in our list.\n",
    "Finally at the bottom there is the `dtype` which stands for 'data type' which is telling us that all our data is being stored as a 64-bit integer.\n",
    "Usually you can ignore the `dtype` until you start doing more advanced things.\n",
    "\n",
    "In the first example above we allowed pandas to automatically create an index for our `Series` (this is the `0`, `1`, `2`, etc. in the left column) but often you will want to specify one yourself"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "a    14\n",
      "b     7\n",
      "c     3\n",
      "d    -7\n",
      "e     8\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "s = Series([14, 7, 3, -7, 8], index=['a', 'b', 'c', 'd', 'e'])\n",
    "print(s)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can use this index to retrieve individual rows"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "14"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s['a']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "to replace values in the series"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "s['c'] = -1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "or to get a set of rows"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "a    14\n",
       "c    -1\n",
       "d    -7\n",
       "dtype: int64"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s[['a', 'c', 'd']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercises\n",
    "\n",
    "- Create a Pandas `Series` with 10 or so elements where the indices are years and the values are numbers.\n",
    "- Experiment with retrieving elements from the `Series`.\n",
    "- Try making another `Series` with duplicate values in the index, what happens when you access those elements?\n",
    "- How does a Pandas `Series` differ from a Python `list` or `dict`?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Series operations\n",
    "\n",
    "A `Series` is `list`-like in the sense that it is an ordered set of values. It is also `dict`-like since its entries can be accessed via key lookup. One very important way in which is differs is how it allows operations to be done over the whole `Series` in one go, a technique often referred to as 'broadcasting'.\n",
    "\n",
    "A simple example is wanting to double the value of every entry in a set of data. In standard Python, you might have a list like"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "my_list = [3, 6, 8, 4, 10]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you wanted to double every entry you might try simply multiplying the list by `2`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[3, 6, 8, 4, 10, 3, 6, 8, 4, 10]"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "my_list * 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "but as you can see, that simply duplicated the elements. Instead you would have to use a `for` loop or a list comprehension:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[6, 12, 16, 8, 20]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[i * 2 for i in my_list]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With a pandas `Series`, however, you can perform bulk mathematical operations to the whole series in one go:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0     3\n",
      "1     6\n",
      "2     8\n",
      "3     4\n",
      "4    10\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "my_series = Series(my_list)\n",
    "print(my_series)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0     6\n",
       "1    12\n",
       "2    16\n",
       "3     8\n",
       "4    20\n",
       "dtype: int64"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "my_series * 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As well as bulk modifications, you can perform bulk selections by putting more complex statements in the square brackets:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "c   -1\n",
       "d   -7\n",
       "dtype: int64"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s[s < 0]  # All negative entries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "a    14\n",
       "b     7\n",
       "e     8\n",
       "dtype: int64"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s[(s * 2) > 4]  # All entries which, when doubled are greater than 4"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These operations work because the `Series` index selection can be passed a series of `True` and `False` values which it then uses to filter the result:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "a     True\n",
       "b     True\n",
       "c    False\n",
       "d    False\n",
       "e     True\n",
       "dtype: bool"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(s * 2) > 4"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here you can see that the rows `a`, `b` and `e` are `True` while the others are `False`. Passing this to `s[...]` will only show rows that are `True`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Multi-Series operations\n",
    "\n",
    "It is also possible to perform operations between two `Series` objects:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    16\n",
       "1    -1\n",
       "2    29\n",
       "3     3\n",
       "4     2\n",
       "dtype: int64"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s2 = Series([23,5,34,7,5])\n",
    "s3 = Series([7, 6, 5,4,3])\n",
    "s2 - s3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercises\n",
    "\n",
    "- Create two `Series` objects of equal length with no specified index and containing any values you like. Perform some mathematical operations on them and experiment to make sure it works how you think.\n",
    "- What happens then you perform an operation on two series which have different lengths? How does this change when you give the series some indices?\n",
    "- Using the `Series` from the first exercise with the years for the index, Select all entries with even-numbered years. Also, select all those with odd-numbered years."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## DataFrame\n",
    "\n",
    "While you can think of the `Series` as a one-dimensional list of data, pandas' `DataFrame` is a two (or possibly more) dimensional table of data. You can think of each column in the table as being a `Series`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = {'city': ['Paris', 'Paris', 'Paris', 'Paris',\n",
    "                 'London', 'London', 'London', 'London',\n",
    "                 'Rome', 'Rome', 'Rome', 'Rome'],\n",
    "        'year': [2001, 2008, 2009, 2010,\n",
    "                 2001, 2006, 2011, 2015,\n",
    "                 2001, 2006, 2009, 2012],\n",
    "        'pop': [2.148, 2.211, 2.234, 2.244,\n",
    "                7.322, 7.657, 8.174, 8.615,\n",
    "                2.547, 2.627, 2.734, 2.627]}\n",
    "df = DataFrame(data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This has created a `DataFrame` from the dictionary `data`. The keys will become the column headers and the values will be the values in each column. As with the `Series`, an index will be created automatically."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>city</th>\n",
       "      <th>year</th>\n",
       "      <th>pop</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Paris</td>\n",
       "      <td>2001</td>\n",
       "      <td>2.148</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Paris</td>\n",
       "      <td>2008</td>\n",
       "      <td>2.211</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Paris</td>\n",
       "      <td>2009</td>\n",
       "      <td>2.234</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Paris</td>\n",
       "      <td>2010</td>\n",
       "      <td>2.244</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>London</td>\n",
       "      <td>2001</td>\n",
       "      <td>7.322</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>London</td>\n",
       "      <td>2006</td>\n",
       "      <td>7.657</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>London</td>\n",
       "      <td>2011</td>\n",
       "      <td>8.174</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>London</td>\n",
       "      <td>2015</td>\n",
       "      <td>8.615</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Rome</td>\n",
       "      <td>2001</td>\n",
       "      <td>2.547</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Rome</td>\n",
       "      <td>2006</td>\n",
       "      <td>2.627</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>Rome</td>\n",
       "      <td>2009</td>\n",
       "      <td>2.734</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>Rome</td>\n",
       "      <td>2012</td>\n",
       "      <td>2.627</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      city  year    pop\n",
       "0    Paris  2001  2.148\n",
       "1    Paris  2008  2.211\n",
       "2    Paris  2009  2.234\n",
       "3    Paris  2010  2.244\n",
       "4   London  2001  7.322\n",
       "5   London  2006  7.657\n",
       "6   London  2011  8.174\n",
       "7   London  2015  8.615\n",
       "8     Rome  2001  2.547\n",
       "9     Rome  2006  2.627\n",
       "10    Rome  2009  2.734\n",
       "11    Rome  2012  2.627"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Or, if you just want a peek at the data, you can just grab the first few rows with:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>city</th>\n",
       "      <th>year</th>\n",
       "      <th>pop</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Paris</td>\n",
       "      <td>2001</td>\n",
       "      <td>2.148</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Paris</td>\n",
       "      <td>2008</td>\n",
       "      <td>2.211</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Paris</td>\n",
       "      <td>2009</td>\n",
       "      <td>2.234</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    city  year    pop\n",
       "0  Paris  2001  2.148\n",
       "1  Paris  2008  2.211\n",
       "2  Paris  2009  2.234"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since we passed in a dictionary to the `DataFrame` constructor, the order of the columns will not necessarilly match the order in which you defined them. To enforce a certain order, you can pass a `columns` argument to the constructor giving a list of the columns in the order you want them:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>year</th>\n",
       "      <th>city</th>\n",
       "      <th>pop</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2001</td>\n",
       "      <td>Paris</td>\n",
       "      <td>2.148</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2008</td>\n",
       "      <td>Paris</td>\n",
       "      <td>2.211</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2009</td>\n",
       "      <td>Paris</td>\n",
       "      <td>2.234</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2010</td>\n",
       "      <td>Paris</td>\n",
       "      <td>2.244</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2001</td>\n",
       "      <td>London</td>\n",
       "      <td>7.322</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>2006</td>\n",
       "      <td>London</td>\n",
       "      <td>7.657</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>2011</td>\n",
       "      <td>London</td>\n",
       "      <td>8.174</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>2015</td>\n",
       "      <td>London</td>\n",
       "      <td>8.615</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>2001</td>\n",
       "      <td>Rome</td>\n",
       "      <td>2.547</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>2006</td>\n",
       "      <td>Rome</td>\n",
       "      <td>2.627</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>2009</td>\n",
       "      <td>Rome</td>\n",
       "      <td>2.734</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>2012</td>\n",
       "      <td>Rome</td>\n",
       "      <td>2.627</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    year    city    pop\n",
       "0   2001   Paris  2.148\n",
       "1   2008   Paris  2.211\n",
       "2   2009   Paris  2.234\n",
       "3   2010   Paris  2.244\n",
       "4   2001  London  7.322\n",
       "5   2006  London  7.657\n",
       "6   2011  London  8.174\n",
       "7   2015  London  8.615\n",
       "8   2001    Rome  2.547\n",
       "9   2006    Rome  2.627\n",
       "10  2009    Rome  2.734\n",
       "11  2012    Rome  2.627"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "DataFrame(data, columns=['year', 'city', 'pop'])\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When we accessed elements from a `Series` object, it would select an element by row. However, by default `DataFrame`s index primarily by column. You can access any column directly:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0      Paris\n",
       "1      Paris\n",
       "2      Paris\n",
       "3      Paris\n",
       "4     London\n",
       "5     London\n",
       "6     London\n",
       "7     London\n",
       "8       Rome\n",
       "9       Rome\n",
       "10      Rome\n",
       "11      Rome\n",
       "Name: city, dtype: object"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['city']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Accessing a column like this returns a `Series` which will act in the same way as those we were using earlier.\n",
    "\n",
    "Note that there is one additional part to this output, `Name: city`. Pandas has remembered that this `Series` was created from the `'city'` column in the `DataFrame`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "pandas.core.series.Series"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(df['city'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0      True\n",
       "1      True\n",
       "2      True\n",
       "3      True\n",
       "4     False\n",
       "5     False\n",
       "6     False\n",
       "7     False\n",
       "8     False\n",
       "9     False\n",
       "10    False\n",
       "11    False\n",
       "Name: city, dtype: bool"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['city'] == 'Paris'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This has created a new `Series` which has `True` set where the city is Paris and `False` elsewhere.\n",
    "\n",
    "We can use filtered `Series` like this to filter the `DataFrame` as a whole. `df['city'] == 'Paris'` has returned a `Series` containing booleans. Passing it back into `df` as an indexing operation will use it to filter based on the `'city'` column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>city</th>\n",
       "      <th>year</th>\n",
       "      <th>pop</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Paris</td>\n",
       "      <td>2001</td>\n",
       "      <td>2.148</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Paris</td>\n",
       "      <td>2008</td>\n",
       "      <td>2.211</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Paris</td>\n",
       "      <td>2009</td>\n",
       "      <td>2.234</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Paris</td>\n",
       "      <td>2010</td>\n",
       "      <td>2.244</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    city  year    pop\n",
       "0  Paris  2001  2.148\n",
       "1  Paris  2008  2.211\n",
       "2  Paris  2009  2.234\n",
       "3  Paris  2010  2.244"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[df['city'] == 'Paris']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can then carry on and grab another column after that filter:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    2001\n",
       "1    2008\n",
       "2    2009\n",
       "3    2010\n",
       "Name: year, dtype: int64"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[df['city'] == 'Paris'].year"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you want to select a **row** from a `DataFrame` then you can use the `.loc` attribute which allows you to pass index values like:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "city    Paris\n",
       "year     2009\n",
       "pop     2.234\n",
       "Name: 2, dtype: object"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.loc[2]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Paris'"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.loc[2]['city']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Adding new columns\n",
    "\n",
    "New columns can be added to a `DataFrame` simply by assigning them by index (as you would for a Python `dict`) and can be deleted with the `del` keyword in the same way:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>city</th>\n",
       "      <th>year</th>\n",
       "      <th>pop</th>\n",
       "      <th>continental</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Paris</td>\n",
       "      <td>2001</td>\n",
       "      <td>2.148</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Paris</td>\n",
       "      <td>2008</td>\n",
       "      <td>2.211</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Paris</td>\n",
       "      <td>2009</td>\n",
       "      <td>2.234</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Paris</td>\n",
       "      <td>2010</td>\n",
       "      <td>2.244</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>London</td>\n",
       "      <td>2001</td>\n",
       "      <td>7.322</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>London</td>\n",
       "      <td>2006</td>\n",
       "      <td>7.657</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>London</td>\n",
       "      <td>2011</td>\n",
       "      <td>8.174</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>London</td>\n",
       "      <td>2015</td>\n",
       "      <td>8.615</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Rome</td>\n",
       "      <td>2001</td>\n",
       "      <td>2.547</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Rome</td>\n",
       "      <td>2006</td>\n",
       "      <td>2.627</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>Rome</td>\n",
       "      <td>2009</td>\n",
       "      <td>2.734</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>Rome</td>\n",
       "      <td>2012</td>\n",
       "      <td>2.627</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      city  year    pop  continental\n",
       "0    Paris  2001  2.148         True\n",
       "1    Paris  2008  2.211         True\n",
       "2    Paris  2009  2.234         True\n",
       "3    Paris  2010  2.244         True\n",
       "4   London  2001  7.322        False\n",
       "5   London  2006  7.657        False\n",
       "6   London  2011  8.174        False\n",
       "7   London  2015  8.615        False\n",
       "8     Rome  2001  2.547         True\n",
       "9     Rome  2006  2.627         True\n",
       "10    Rome  2009  2.734         True\n",
       "11    Rome  2012  2.627         True"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['continental'] = df['city'] != 'London'\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [],
   "source": [
    "del df['continental']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercises\n",
    "\n",
    "- Create the `DataFrame` containing the census data for the three cities.\n",
    "- Select the data for the year 2001. Which city had the smallest population that year?\n",
    "- Find all the cities which had a population smaller than 2.6 million."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reading from file\n",
    "\n",
    "One of the msot common situations is that you have some data file containing the data you want to read. Perhaps this is data you've produced yourself or maybe it's from a collegue. In an ideal world the file will be perfectly formatted and will be trivial to import into pandas but since this is so often not the case, it provides a number of features to make your ife easier.\n",
    "\n",
    "Full information on reading and writing is available in the pandas manual on [IO tools](http://pandas.pydata.org/pandas-docs/stable/io.html) but first it's worth noting the common formats that pandas can work with:\n",
    "- Comma separated tables (or tab-separated or space-separated etc.)\n",
    "- Excel spreadsheets\n",
    "- HDF5 files\n",
    "- SQL databases\n",
    "\n",
    "For this course we will focus on plain-text CSV files as they are perhaps the most common format. Imagine we have a CSV file like (you can download this file from [city_pop.csv](https://raw.githubusercontent.com/milliams/data_analysis_python/master/city_pop.csv)):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "This is an example CSV file\r\n",
      "The text at the top here is not part of the data but instead is here\r\n",
      "to describe the file. You'll see this quite often in real-world data.\r\n",
      "A -1 signifies a missing value.\r\n",
      "\r\n",
      "year;London;Paris;Rome\r\n",
      "2001;7.322;2.148;2.547\r\n",
      "2006;7.652;;2.627\r\n",
      "2008;-1;2.211;\r\n",
      "2009;-1;2.234;2.734\r\n",
      "2011;8.174;;\r\n",
      "2012;-1;2.244;2.627\r\n",
      "2015;8.615;;\r\n"
     ]
    }
   ],
   "source": [
    "!cat city_pop.csv  # Uses the IPython 'magic' !cat to print the file"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can use the pandas function `read_csv()` to read the file and convert it to a `DataFrame`. Full documentation for this function can be found in [the manual](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) or, as with any Python object, directly in the notebook by putting a `?` after the name:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "pd.read_csv?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>This is an example CSV file</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>The text at the top here is not part of the da...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>to describe the file. You'll see this quite of...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>A -1 signifies a missing value.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>year;London;Paris;Rome</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2001;7.322;2.148;2.547</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>2006;7.652;;2.627</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>2008;-1;2.211;</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>2009;-1;2.234;2.734</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>2011;8.174;;</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>2012;-1;2.244;2.627</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>2015;8.615;;</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                          This is an example CSV file\n",
       "0   The text at the top here is not part of the da...\n",
       "1   to describe the file. You'll see this quite of...\n",
       "2                     A -1 signifies a missing value.\n",
       "3                              year;London;Paris;Rome\n",
       "4                              2001;7.322;2.148;2.547\n",
       "5                                   2006;7.652;;2.627\n",
       "6                                      2008;-1;2.211;\n",
       "7                                 2009;-1;2.234;2.734\n",
       "8                                        2011;8.174;;\n",
       "9                                 2012;-1;2.244;2.627\n",
       "10                                       2015;8.615;;"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.read_csv('city_pop.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that by default it's done a fairly bad job of parsing the file (this is mostly because I;ve construsted the `city_pop.csv` file to be as obtuse as possible). It's making a lot of assumptions about the structure of the file but in general it's taking quite a naïve approach.\n",
    "\n",
    "The first this we notice is that it's treating the text at the top of the file as though it's data. Checking [the documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) we see that the simplest way to solve this is to use the `skiprows` argument to the function to which we give an integer giving the number of rows to skip:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>year;London;Paris;Rome</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2001;7.322;2.148;2.547</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2006;7.652;;2.627</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2008;-1;2.211;</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2009;-1;2.234;2.734</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2011;8.174;;</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>2012;-1;2.244;2.627</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>2015;8.615;;</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   year;London;Paris;Rome\n",
       "0  2001;7.322;2.148;2.547\n",
       "1       2006;7.652;;2.627\n",
       "2          2008;-1;2.211;\n",
       "3     2009;-1;2.234;2.734\n",
       "4            2011;8.174;;\n",
       "5     2012;-1;2.244;2.627\n",
       "6            2015;8.615;;"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.read_csv(\n",
    "    'city_pop.csv',\n",
    "    skiprows=5,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The next most obvious problem is that it is not separating the columns at all. This is controlled by the `sep` argument which is set to `','` by default (hence *comma* separated values). We can simply set it to the appropriate semi-colon:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>year</th>\n",
       "      <th>London</th>\n",
       "      <th>Paris</th>\n",
       "      <th>Rome</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2001</td>\n",
       "      <td>7.322</td>\n",
       "      <td>2.148</td>\n",
       "      <td>2.547</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2006</td>\n",
       "      <td>7.652</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2.627</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2008</td>\n",
       "      <td>-1.000</td>\n",
       "      <td>2.211</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2009</td>\n",
       "      <td>-1.000</td>\n",
       "      <td>2.234</td>\n",
       "      <td>2.734</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2011</td>\n",
       "      <td>8.174</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>2012</td>\n",
       "      <td>-1.000</td>\n",
       "      <td>2.244</td>\n",
       "      <td>2.627</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>2015</td>\n",
       "      <td>8.615</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   year  London  Paris   Rome\n",
       "0  2001   7.322  2.148  2.547\n",
       "1  2006   7.652    NaN  2.627\n",
       "2  2008  -1.000  2.211    NaN\n",
       "3  2009  -1.000  2.234  2.734\n",
       "4  2011   8.174    NaN    NaN\n",
       "5  2012  -1.000  2.244  2.627\n",
       "6  2015   8.615    NaN    NaN"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.read_csv(\n",
    "    'city_pop.csv',\n",
    "    skiprows=5,\n",
    "    sep=';'\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Reading the descriptive header of our data file we see that a value of `-1` signifies a missing reading so we should mark those too. This can be done after the fact but it is simplest to do it at import-time using the `na_values` argument:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>year</th>\n",
       "      <th>London</th>\n",
       "      <th>Paris</th>\n",
       "      <th>Rome</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2001</td>\n",
       "      <td>7.322</td>\n",
       "      <td>2.148</td>\n",
       "      <td>2.547</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2006</td>\n",
       "      <td>7.652</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2.627</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2008</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2.211</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2009</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2.234</td>\n",
       "      <td>2.734</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2011</td>\n",
       "      <td>8.174</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>2012</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2.244</td>\n",
       "      <td>2.627</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>2015</td>\n",
       "      <td>8.615</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   year  London  Paris   Rome\n",
       "0  2001   7.322  2.148  2.547\n",
       "1  2006   7.652    NaN  2.627\n",
       "2  2008     NaN  2.211    NaN\n",
       "3  2009     NaN  2.234  2.734\n",
       "4  2011   8.174    NaN    NaN\n",
       "5  2012     NaN  2.244  2.627\n",
       "6  2015   8.615    NaN    NaN"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.read_csv(\n",
    "    'city_pop.csv',\n",
    "    skiprows=5,\n",
    "    sep=';',\n",
    "    na_values='-1'\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The last this we want to do is use the `year` column as the index for the `DataFrame`. This can be done by passing the name of the column to the `index_col` argument:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>London</th>\n",
       "      <th>Paris</th>\n",
       "      <th>Rome</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>year</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2001</th>\n",
       "      <td>7.322</td>\n",
       "      <td>2.148</td>\n",
       "      <td>2.547</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2006</th>\n",
       "      <td>7.652</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2.627</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2008</th>\n",
       "      <td>NaN</td>\n",
       "      <td>2.211</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2009</th>\n",
       "      <td>NaN</td>\n",
       "      <td>2.234</td>\n",
       "      <td>2.734</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2011</th>\n",
       "      <td>8.174</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2012</th>\n",
       "      <td>NaN</td>\n",
       "      <td>2.244</td>\n",
       "      <td>2.627</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2015</th>\n",
       "      <td>8.615</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      London  Paris   Rome\n",
       "year                      \n",
       "2001   7.322  2.148  2.547\n",
       "2006   7.652    NaN  2.627\n",
       "2008     NaN  2.211    NaN\n",
       "2009     NaN  2.234  2.734\n",
       "2011   8.174    NaN    NaN\n",
       "2012     NaN  2.244  2.627\n",
       "2015   8.615    NaN    NaN"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df3 = pd.read_csv(\n",
    "    'city_pop.csv',\n",
    "    skiprows=5,\n",
    "    sep=';',\n",
    "    na_values='-1',\n",
    "    index_col='year'\n",
    ")\n",
    "df3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercises\n",
    "\n",
    "- Alongside `city_pop.csv` there is another file called `cetml1659on.dat` (also available from [here](https://raw.githubusercontent.com/milliams/data_analysis_python/master/cetml1659on.dat)). This contains some historical weather data for a location in the UK. Import that file as a Pandas `DataFrame` using `read_csv()`, making sure that you cover all the NaN values.\n",
    "- How many years had a negative average temperature in January?\n",
    "- What was the average temperature in June over the years in the data set? Tip: look in the [documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html) for which method to call.\n",
    "\n",
    "We will come back to this data set in a later stage."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That covers the basics of reading in data with pandas. For more tutorials on further pandas topics you can have look at [their website](http://pandas.pydata.org/pandas-docs/stable/tutorials.html). Also, the official [Pandas cheat sheet](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf) is very useful.\n",
    "\n",
    "Next we will do something useful with our data and plot it. Continue to the [next section](matplotlib.ipynb)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}