{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook was prepared by [Donne Martin](http://donnemartin.com). Source and license info is on [GitHub](https://github.com/donnemartin/data-science-ipython-notebooks)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Pandas\n", "\n", "Credits: The following are notes taken while working through [Python for Data Analysis](http://www.amazon.com/Python-Data-Analysis-Wrangling-IPython/dp/1449319793) by Wes McKinney\n", "\n", "* Series\n", "* DataFrame\n", "* Reindexing\n", "* Dropping Entries\n", "* Indexing, Selecting, Filtering\n", "* Arithmetic and Data Alignment\n", "* Function Application and Mapping\n", "* Sorting and Ranking\n", "* Axis Indices with Duplicate Values\n", "* Summarizing and Computing Descriptive Statistics\n", "* Cleaning Data (Under Construction)\n", "* Input and Output (Under Construction)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from pandas import Series, DataFrame\n", "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Series\n", "\n", "A Series is a one-dimensional array-like object containing an array of data and an associated array of data labels. The data can be any NumPy data type and the labels are the Series' index." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a Series:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 1\n", "2 2\n", "3 -3\n", "4 -5\n", "5 8\n", "6 13\n", "dtype: int64" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_1 = Series([1, 1, 2, -3, -5, 8, 13])\n", "ser_1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get the array representation of a Series:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([ 1, 1, 2, -3, -5, 8, 13])" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_1.values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Index objects are immutable and hold the axis labels and metadata such as names and axis names.\n", "\n", "Get the index of the Series:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Int64Index([0, 1, 2, 3, 4, 5, 6], dtype='int64')" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_1.index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a Series with a custom index:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "a 1\n", "b 1\n", "c 2\n", "d -3\n", "e -5\n", "dtype: int64" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_2 = Series([1, 1, 2, -3, -5], index=['a', 'b', 'c', 'd', 'e'])\n", "ser_2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get a value from a Series:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_2[4] == ser_2['e']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get a set of values from a Series by passing in a list:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "c 2\n", "a 1\n", "b 1\n", "dtype: int64" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_2[['c', 'a', 'b']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get values great than 0:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "a 1\n", "b 1\n", "c 2\n", "dtype: int64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_2[ser_2 > 0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Scalar multiply:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "a 2\n", "b 2\n", "c 4\n", "d -6\n", "e -10\n", "dtype: int64" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_2 * 2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Apply a numpy math function:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "a 2.718282\n", "b 2.718282\n", "c 7.389056\n", "d 0.049787\n", "e 0.006738\n", "dtype: float64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "np.exp(ser_2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A Series is like a fixed-length, ordered dict. \n", "\n", "Create a series by passing in a dict:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "bar 200\n", "baz 300\n", "foo 100\n", "dtype: int64" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dict_1 = {'foo' : 100, 'bar' : 200, 'baz' : 300}\n", "ser_3 = Series(dict_1)\n", "ser_3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Re-order a Series by passing in an index (indices not found are NaN):" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "foo 100\n", "bar 200\n", "baz 300\n", "qux NaN\n", "dtype: float64" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "index = ['foo', 'bar', 'baz', 'qux']\n", "ser_4 = Series(dict_1, index=index)\n", "ser_4" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check for NaN with the pandas method:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "foo False\n", "bar False\n", "baz False\n", "qux True\n", "dtype: bool" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.isnull(ser_4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check for NaN with the Series method:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "foo False\n", "bar False\n", "baz False\n", "qux True\n", "dtype: bool" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_4.isnull()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Series automatically aligns differently indexed data in arithmetic operations:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "bar 400\n", "baz 600\n", "foo 200\n", "qux NaN\n", "dtype: float64" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_3 + ser_4" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Name a Series:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [], "source": [ "ser_4.name = 'foobarbazqux'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Name a Series index:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [], "source": [ "ser_4.index.name = 'label'" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "label\n", "foo 100\n", "bar 200\n", "baz 300\n", "qux NaN\n", "Name: foobarbazqux, dtype: float64" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_4" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Rename a Series' index in place:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "fo 100\n", "br 200\n", "bz 300\n", "qx NaN\n", "Name: foobarbazqux, dtype: float64" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_4.index = ['fo', 'br', 'bz', 'qx']\n", "ser_4" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## DataFrame\n", "\n", "A DataFrame is a tabular data structure containing an ordered collection of columns. Each column can have a different type. DataFrames have both row and column indices and is analogous to a dict of Series. Row and column operations are treated roughly symmetrically. Columns returned when indexing a DataFrame are views of the underlying data, not a copy. To obtain a copy, use the Series' copy method.\n", "\n", "Create a DataFrame:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>pop</th>\n", " <th>state</th>\n", " <th>year</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>5.0</td>\n", " <td>VA</td>\n", " <td>2012</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>5.1</td>\n", " <td>VA</td>\n", " <td>2013</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>5.2</td>\n", " <td>VA</td>\n", " <td>2014</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>4.0</td>\n", " <td>MD</td>\n", " <td>2014</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>4.1</td>\n", " <td>MD</td>\n", " <td>2015</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " pop state year\n", "0 5.0 VA 2012\n", "1 5.1 VA 2013\n", "2 5.2 VA 2014\n", "3 4.0 MD 2014\n", "4 4.1 MD 2015" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_1 = {'state' : ['VA', 'VA', 'VA', 'MD', 'MD'],\n", " 'year' : [2012, 2013, 2014, 2014, 2015],\n", " 'pop' : [5.0, 5.1, 5.2, 4.0, 4.1]}\n", "df_1 = DataFrame(data_1)\n", "df_1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a DataFrame specifying a sequence of columns:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>year</th>\n", " <th>state</th>\n", " <th>pop</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>2012</td>\n", " <td>VA</td>\n", " <td>5.0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>2013</td>\n", " <td>VA</td>\n", " <td>5.1</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>2014</td>\n", " <td>VA</td>\n", " <td>5.2</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>2014</td>\n", " <td>MD</td>\n", " <td>4.0</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>2015</td>\n", " <td>MD</td>\n", " <td>4.1</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " year state pop\n", "0 2012 VA 5.0\n", "1 2013 VA 5.1\n", "2 2014 VA 5.2\n", "3 2014 MD 4.0\n", "4 2015 MD 4.1" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_2 = DataFrame(data_1, columns=['year', 'state', 'pop'])\n", "df_2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Like Series, columns that are not present in the data are NaN:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>year</th>\n", " <th>state</th>\n", " <th>pop</th>\n", " <th>unempl</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>2012</td>\n", " <td>VA</td>\n", " <td>5.0</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>2013</td>\n", " <td>VA</td>\n", " <td>5.1</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>2014</td>\n", " <td>VA</td>\n", " <td>5.2</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>2014</td>\n", " <td>MD</td>\n", " <td>4.0</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>2015</td>\n", " <td>MD</td>\n", " <td>4.1</td>\n", " <td>NaN</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " year state pop unempl\n", "0 2012 VA 5.0 NaN\n", "1 2013 VA 5.1 NaN\n", "2 2014 VA 5.2 NaN\n", "3 2014 MD 4.0 NaN\n", "4 2015 MD 4.1 NaN" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_3 = DataFrame(data_1, columns=['year', 'state', 'pop', 'unempl'])\n", "df_3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Retrieve a column by key, returning a Series:\n" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 VA\n", "1 VA\n", "2 VA\n", "3 MD\n", "4 MD\n", "Name: state, dtype: object" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_3['state']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Retrive a column by attribute, returning a Series:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 2012\n", "1 2013\n", "2 2014\n", "3 2014\n", "4 2015\n", "Name: year, dtype: int64" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_3.year" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Retrieve a row by position:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "year 2012\n", "state VA\n", "pop 5\n", "unempl NaN\n", "Name: 0, dtype: object" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_3.ix[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Update a column by assignment:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>year</th>\n", " <th>state</th>\n", " <th>pop</th>\n", " <th>unempl</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>2012</td>\n", " <td>VA</td>\n", " <td>5.0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>2013</td>\n", " <td>VA</td>\n", " <td>5.1</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>2014</td>\n", " <td>VA</td>\n", " <td>5.2</td>\n", " <td>2</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>2014</td>\n", " <td>MD</td>\n", " <td>4.0</td>\n", " <td>3</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>2015</td>\n", " <td>MD</td>\n", " <td>4.1</td>\n", " <td>4</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " year state pop unempl\n", "0 2012 VA 5.0 0\n", "1 2013 VA 5.1 1\n", "2 2014 VA 5.2 2\n", "3 2014 MD 4.0 3\n", "4 2015 MD 4.1 4" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_3['unempl'] = np.arange(5)\n", "df_3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Assign a Series to a column (note if assigning a list or array, the length must match the DataFrame, unlike a Series):" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>year</th>\n", " <th>state</th>\n", " <th>pop</th>\n", " <th>unempl</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>2012</td>\n", " <td>VA</td>\n", " <td>5.0</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>2013</td>\n", " <td>VA</td>\n", " <td>5.1</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>2014</td>\n", " <td>VA</td>\n", " <td>5.2</td>\n", " <td>6.0</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>2014</td>\n", " <td>MD</td>\n", " <td>4.0</td>\n", " <td>6.0</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>2015</td>\n", " <td>MD</td>\n", " <td>4.1</td>\n", " <td>6.1</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " year state pop unempl\n", "0 2012 VA 5.0 NaN\n", "1 2013 VA 5.1 NaN\n", "2 2014 VA 5.2 6.0\n", "3 2014 MD 4.0 6.0\n", "4 2015 MD 4.1 6.1" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unempl = Series([6.0, 6.0, 6.1], index=[2, 3, 4])\n", "df_3['unempl'] = unempl\n", "df_3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Assign a new column that doesn't exist to create a new column:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>year</th>\n", " <th>state</th>\n", " <th>pop</th>\n", " <th>unempl</th>\n", " <th>state_dup</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>2012</td>\n", " <td>VA</td>\n", " <td>5.0</td>\n", " <td>NaN</td>\n", " <td>VA</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>2013</td>\n", " <td>VA</td>\n", " <td>5.1</td>\n", " <td>NaN</td>\n", " <td>VA</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>2014</td>\n", " <td>VA</td>\n", " <td>5.2</td>\n", " <td>6.0</td>\n", " <td>VA</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>2014</td>\n", " <td>MD</td>\n", " <td>4.0</td>\n", " <td>6.0</td>\n", " <td>MD</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>2015</td>\n", " <td>MD</td>\n", " <td>4.1</td>\n", " <td>6.1</td>\n", " <td>MD</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " year state pop unempl state_dup\n", "0 2012 VA 5.0 NaN VA\n", "1 2013 VA 5.1 NaN VA\n", "2 2014 VA 5.2 6.0 VA\n", "3 2014 MD 4.0 6.0 MD\n", "4 2015 MD 4.1 6.1 MD" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_3['state_dup'] = df_3['state']\n", "df_3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Delete a column:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>year</th>\n", " <th>state</th>\n", " <th>pop</th>\n", " <th>unempl</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>2012</td>\n", " <td>VA</td>\n", " <td>5.0</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>2013</td>\n", " <td>VA</td>\n", " <td>5.1</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>2014</td>\n", " <td>VA</td>\n", " <td>5.2</td>\n", " <td>6.0</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>2014</td>\n", " <td>MD</td>\n", " <td>4.0</td>\n", " <td>6.0</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>2015</td>\n", " <td>MD</td>\n", " <td>4.1</td>\n", " <td>6.1</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " year state pop unempl\n", "0 2012 VA 5.0 NaN\n", "1 2013 VA 5.1 NaN\n", "2 2014 VA 5.2 6.0\n", "3 2014 MD 4.0 6.0\n", "4 2015 MD 4.1 6.1" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "del df_3['state_dup']\n", "df_3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a DataFrame from a nested dict of dicts (the keys in the inner dicts are unioned and sorted to form the index in the result, unless an explicit index is specified):" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>MD</th>\n", " <th>VA</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>2013</th>\n", " <td>NaN</td>\n", " <td>5.1</td>\n", " </tr>\n", " <tr>\n", " <th>2014</th>\n", " <td>4.0</td>\n", " <td>5.2</td>\n", " </tr>\n", " <tr>\n", " <th>2015</th>\n", " <td>4.1</td>\n", " <td>NaN</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " MD VA\n", "2013 NaN 5.1\n", "2014 4.0 5.2\n", "2015 4.1 NaN" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pop = {'VA' : {2013 : 5.1, 2014 : 5.2},\n", " 'MD' : {2014 : 4.0, 2015 : 4.1}}\n", "df_4 = DataFrame(pop)\n", "df_4" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Transpose the DataFrame:" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>2013</th>\n", " <th>2014</th>\n", " <th>2015</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>MD</th>\n", " <td>NaN</td>\n", " <td>4.0</td>\n", " <td>4.1</td>\n", " </tr>\n", " <tr>\n", " <th>VA</th>\n", " <td>5.1</td>\n", " <td>5.2</td>\n", " <td>NaN</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " 2013 2014 2015\n", "MD NaN 4.0 4.1\n", "VA 5.1 5.2 NaN" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_4.T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a DataFrame from a dict of Series:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>MD</th>\n", " <th>VA</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>2014</th>\n", " <td>NaN</td>\n", " <td>5.2</td>\n", " </tr>\n", " <tr>\n", " <th>2015</th>\n", " <td>4.1</td>\n", " <td>NaN</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " MD VA\n", "2014 NaN 5.2\n", "2015 4.1 NaN" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_2 = {'VA' : df_4['VA'][1:],\n", " 'MD' : df_4['MD'][2:]}\n", "df_5 = DataFrame(data_2)\n", "df_5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Set the DataFrame index name:" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>MD</th>\n", " <th>VA</th>\n", " </tr>\n", " <tr>\n", " <th>year</th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>2014</th>\n", " <td>NaN</td>\n", " <td>5.2</td>\n", " </tr>\n", " <tr>\n", " <th>2015</th>\n", " <td>4.1</td>\n", " <td>NaN</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " MD VA\n", "year \n", "2014 NaN 5.2\n", "2015 4.1 NaN" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_5.index.name = 'year'\n", "df_5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Set the DataFrame columns name:" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th>state</th>\n", " <th>MD</th>\n", " <th>VA</th>\n", " </tr>\n", " <tr>\n", " <th>year</th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>2014</th>\n", " <td>NaN</td>\n", " <td>5.2</td>\n", " </tr>\n", " <tr>\n", " <th>2015</th>\n", " <td>4.1</td>\n", " <td>NaN</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ "state MD VA\n", "year \n", "2014 NaN 5.2\n", "2015 4.1 NaN" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_5.columns.name = 'state'\n", "df_5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Return the data contained in a DataFrame as a 2D ndarray:" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([[ nan, 5.2],\n", " [ 4.1, nan]])" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_5.values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If the columns are different dtypes, the 2D ndarray's dtype will accomodate all of the columns:" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([[2012, 'VA', 5.0, nan],\n", " [2013, 'VA', 5.1, nan],\n", " [2014, 'VA', 5.2, 6.0],\n", " [2014, 'MD', 4.0, 6.0],\n", " [2015, 'MD', 4.1, 6.1]], dtype=object)" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_3.values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reindexing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a new object with the data conformed to a new index. Any missing values are set to NaN." ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>year</th>\n", " <th>state</th>\n", " <th>pop</th>\n", " <th>unempl</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>2012</td>\n", " <td>VA</td>\n", " <td>5.0</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>2013</td>\n", " <td>VA</td>\n", " <td>5.1</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>2014</td>\n", " <td>VA</td>\n", " <td>5.2</td>\n", " <td>6.0</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>2014</td>\n", " <td>MD</td>\n", " <td>4.0</td>\n", " <td>6.0</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>2015</td>\n", " <td>MD</td>\n", " <td>4.1</td>\n", " <td>6.1</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " year state pop unempl\n", "0 2012 VA 5.0 NaN\n", "1 2013 VA 5.1 NaN\n", "2 2014 VA 5.2 6.0\n", "3 2014 MD 4.0 6.0\n", "4 2015 MD 4.1 6.1" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Reindexing rows returns a new frame with the specified index:" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>year</th>\n", " <th>state</th>\n", " <th>pop</th>\n", " <th>unempl</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>5</th>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>2015</td>\n", " <td>MD</td>\n", " <td>4.1</td>\n", " <td>6.1</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>2014</td>\n", " <td>MD</td>\n", " <td>4.0</td>\n", " <td>6.0</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>2014</td>\n", " <td>VA</td>\n", " <td>5.2</td>\n", " <td>6.0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>2013</td>\n", " <td>VA</td>\n", " <td>5.1</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>0</th>\n", " <td>2012</td>\n", " <td>VA</td>\n", " <td>5.0</td>\n", " <td>NaN</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " year state pop unempl\n", "5 NaN NaN NaN NaN\n", "4 2015 MD 4.1 6.1\n", "3 2014 MD 4.0 6.0\n", "2 2014 VA 5.2 6.0\n", "1 2013 VA 5.1 NaN\n", "0 2012 VA 5.0 NaN" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_3.reindex(list(reversed(range(0, 6))))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Missing values can be set to something other than NaN:" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>year</th>\n", " <th>state</th>\n", " <th>pop</th>\n", " <th>unempl</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ "Empty DataFrame\n", "Columns: [year, state, pop, unempl]\n", "Index: []" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_3.reindex(range(6, 0), fill_value=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Interpolate ordered data like a time series:" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": false }, "outputs": [], "source": [ "ser_5 = Series(['foo', 'bar', 'baz'], index=[0, 2, 4])" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 foo\n", "1 foo\n", "2 bar\n", "3 bar\n", "4 baz\n", "dtype: object" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_5.reindex(range(5), method='ffill')" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 foo\n", "1 bar\n", "2 bar\n", "3 baz\n", "4 baz\n", "dtype: object" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_5.reindex(range(5), method='bfill')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Reindex columns:" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>state</th>\n", " <th>pop</th>\n", " <th>unempl</th>\n", " <th>year</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>VA</td>\n", " <td>5.0</td>\n", " <td>NaN</td>\n", " <td>2012</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>VA</td>\n", " <td>5.1</td>\n", " <td>NaN</td>\n", " <td>2013</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>VA</td>\n", " <td>5.2</td>\n", " <td>6.0</td>\n", " <td>2014</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>MD</td>\n", " <td>4.0</td>\n", " <td>6.0</td>\n", " <td>2014</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>MD</td>\n", " <td>4.1</td>\n", " <td>6.1</td>\n", " <td>2015</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " state pop unempl year\n", "0 VA 5.0 NaN 2012\n", "1 VA 5.1 NaN 2013\n", "2 VA 5.2 6.0 2014\n", "3 MD 4.0 6.0 2014\n", "4 MD 4.1 6.1 2015" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_3.reindex(columns=['state', 'pop', 'unempl', 'year'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Reindex rows and columns while filling rows:" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>state</th>\n", " <th>pop</th>\n", " <th>unempl</th>\n", " <th>year</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>5</th>\n", " <td>0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>MD</td>\n", " <td>4.1</td>\n", " <td>6.1</td>\n", " <td>2015</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>MD</td>\n", " <td>4.0</td>\n", " <td>6.0</td>\n", " <td>2014</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>VA</td>\n", " <td>5.2</td>\n", " <td>6.0</td>\n", " <td>2014</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>VA</td>\n", " <td>5.1</td>\n", " <td>NaN</td>\n", " <td>2013</td>\n", " </tr>\n", " <tr>\n", " <th>0</th>\n", " <td>VA</td>\n", " <td>5.0</td>\n", " <td>NaN</td>\n", " <td>2012</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " state pop unempl year\n", "5 0 0.0 0.0 0\n", "4 MD 4.1 6.1 2015\n", "3 MD 4.0 6.0 2014\n", "2 VA 5.2 6.0 2014\n", "1 VA 5.1 NaN 2013\n", "0 VA 5.0 NaN 2012" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_3.reindex(index=list(reversed(range(0, 6))),\n", " fill_value=0,\n", " columns=['state', 'pop', 'unempl', 'year'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Reindex using ix:" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>state</th>\n", " <th>pop</th>\n", " <th>unempl</th>\n", " <th>year</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>VA</td>\n", " <td>5.0</td>\n", " <td>NaN</td>\n", " <td>2012</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>VA</td>\n", " <td>5.1</td>\n", " <td>NaN</td>\n", " <td>2013</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>VA</td>\n", " <td>5.2</td>\n", " <td>6.0</td>\n", " <td>2014</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>MD</td>\n", " <td>4.0</td>\n", " <td>6.0</td>\n", " <td>2014</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>MD</td>\n", " <td>4.1</td>\n", " <td>6.1</td>\n", " <td>2015</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " state pop unempl year\n", "0 VA 5.0 NaN 2012\n", "1 VA 5.1 NaN 2013\n", "2 VA 5.2 6.0 2014\n", "3 MD 4.0 6.0 2014\n", "4 MD 4.1 6.1 2015\n", "5 NaN NaN NaN NaN\n", "6 NaN NaN NaN NaN" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_6 = df_3.ix[range(0, 7), ['state', 'pop', 'unempl', 'year']]\n", "df_6" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dropping Entries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Drop rows from a Series or DataFrame:" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>state</th>\n", " <th>pop</th>\n", " <th>unempl</th>\n", " <th>year</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>2</th>\n", " <td>VA</td>\n", " <td>5.2</td>\n", " <td>6.0</td>\n", " <td>2014</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>MD</td>\n", " <td>4.0</td>\n", " <td>6.0</td>\n", " <td>2014</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>MD</td>\n", " <td>4.1</td>\n", " <td>6.1</td>\n", " <td>2015</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " state pop unempl year\n", "2 VA 5.2 6.0 2014\n", "3 MD 4.0 6.0 2014\n", "4 MD 4.1 6.1 2015\n", "5 NaN NaN NaN NaN\n", "6 NaN NaN NaN NaN" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_7 = df_6.drop([0, 1])\n", "df_7" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Drop columns from a DataFrame:" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>state</th>\n", " <th>pop</th>\n", " <th>year</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>2</th>\n", " <td>VA</td>\n", " <td>5.2</td>\n", " <td>2014</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>MD</td>\n", " <td>4.0</td>\n", " <td>2014</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>MD</td>\n", " <td>4.1</td>\n", " <td>2015</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " state pop year\n", "2 VA 5.2 2014\n", "3 MD 4.0 2014\n", "4 MD 4.1 2015\n", "5 NaN NaN NaN\n", "6 NaN NaN NaN" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_7 = df_7.drop('unempl', axis=1)\n", "df_7" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Indexing, Selecting, Filtering" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Series indexing is similar to NumPy array indexing with the added bonus of being able to use the Series' index values." ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "a 1\n", "b 1\n", "c 2\n", "d -3\n", "e -5\n", "dtype: int64" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Select a value from a Series:" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_2[0] == ser_2['a']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Select a slice from a Series:" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "b 1\n", "c 2\n", "d -3\n", "dtype: int64" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_2[1:4]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Select specific values from a Series:" ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "b 1\n", "c 2\n", "d -3\n", "dtype: int64" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_2[['b', 'c', 'd']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Select from a Series based on a filter:" ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "a 1\n", "b 1\n", "c 2\n", "dtype: int64" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_2[ser_2 > 0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Select a slice from a Series with labels (note the end point is inclusive):" ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "a 1\n", "b 1\n", "dtype: int64" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_2['a':'b']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Assign to a Series slice (note the end point is inclusive):" ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "a 0\n", "b 0\n", "c 2\n", "d -3\n", "e -5\n", "dtype: int64" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_2['a':'b'] = 0\n", "ser_2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pandas supports indexing into a DataFrame." ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>state</th>\n", " <th>pop</th>\n", " <th>unempl</th>\n", " <th>year</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>VA</td>\n", " <td>5.0</td>\n", " <td>NaN</td>\n", " <td>2012</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>VA</td>\n", " <td>5.1</td>\n", " <td>NaN</td>\n", " <td>2013</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>VA</td>\n", " <td>5.2</td>\n", " <td>6.0</td>\n", " <td>2014</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>MD</td>\n", " <td>4.0</td>\n", " <td>6.0</td>\n", " <td>2014</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>MD</td>\n", " <td>4.1</td>\n", " <td>6.1</td>\n", " <td>2015</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " state pop unempl year\n", "0 VA 5.0 NaN 2012\n", "1 VA 5.1 NaN 2013\n", "2 VA 5.2 6.0 2014\n", "3 MD 4.0 6.0 2014\n", "4 MD 4.1 6.1 2015\n", "5 NaN NaN NaN NaN\n", "6 NaN NaN NaN NaN" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_6" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Select specified columns from a DataFrame:" ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>pop</th>\n", " <th>unempl</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>5.0</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>5.1</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>5.2</td>\n", " <td>6.0</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>4.0</td>\n", " <td>6.0</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>4.1</td>\n", " <td>6.1</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " pop unempl\n", "0 5.0 NaN\n", "1 5.1 NaN\n", "2 5.2 6.0\n", "3 4.0 6.0\n", "4 4.1 6.1\n", "5 NaN NaN\n", "6 NaN NaN" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_6[['pop', 'unempl']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Select a slice from a DataFrame:" ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>state</th>\n", " <th>pop</th>\n", " <th>unempl</th>\n", " <th>year</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>VA</td>\n", " <td>5.0</td>\n", " <td>NaN</td>\n", " <td>2012</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>VA</td>\n", " <td>5.1</td>\n", " <td>NaN</td>\n", " <td>2013</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " state pop unempl year\n", "0 VA 5.0 NaN 2012\n", "1 VA 5.1 NaN 2013" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_6[:2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Select from a DataFrame based on a filter:" ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>state</th>\n", " <th>pop</th>\n", " <th>unempl</th>\n", " <th>year</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>1</th>\n", " <td>VA</td>\n", " <td>5.1</td>\n", " <td>NaN</td>\n", " <td>2013</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>VA</td>\n", " <td>5.2</td>\n", " <td>6</td>\n", " <td>2014</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " state pop unempl year\n", "1 VA 5.1 NaN 2013\n", "2 VA 5.2 6 2014" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_6[df_6['pop'] > 5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Perform a scalar comparison on a DataFrame:" ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>state</th>\n", " <th>pop</th>\n", " <th>unempl</th>\n", " <th>year</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>True</td>\n", " <td>False</td>\n", " <td>False</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>False</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>True</td>\n", " <td>False</td>\n", " <td>True</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>True</td>\n", " <td>False</td>\n", " <td>True</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>True</td>\n", " <td>False</td>\n", " <td>False</td>\n", " <td>False</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>True</td>\n", " <td>False</td>\n", " <td>False</td>\n", " <td>False</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " state pop unempl year\n", "0 True False False True\n", "1 True True False True\n", "2 True True True True\n", "3 True False True True\n", "4 True False True True\n", "5 True False False False\n", "6 True False False False" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_6 > 5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Perform a scalar comparison on a DataFrame, retain the values that pass the filter:" ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>state</th>\n", " <th>pop</th>\n", " <th>unempl</th>\n", " <th>year</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>VA</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>2012</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>VA</td>\n", " <td>5.1</td>\n", " <td>NaN</td>\n", " <td>2013</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>VA</td>\n", " <td>5.2</td>\n", " <td>6.0</td>\n", " <td>2014</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>MD</td>\n", " <td>NaN</td>\n", " <td>6.0</td>\n", " <td>2014</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>MD</td>\n", " <td>NaN</td>\n", " <td>6.1</td>\n", " <td>2015</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " state pop unempl year\n", "0 VA NaN NaN 2012\n", "1 VA 5.1 NaN 2013\n", "2 VA 5.2 6.0 2014\n", "3 MD NaN 6.0 2014\n", "4 MD NaN 6.1 2015\n", "5 NaN NaN NaN NaN\n", "6 NaN NaN NaN NaN" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_6[df_6 > 5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Select a slice of rows from a DataFrame (note the end point is inclusive):" ] }, { "cell_type": "code", "execution_count": 61, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>state</th>\n", " <th>pop</th>\n", " <th>unempl</th>\n", " <th>year</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>2</th>\n", " <td>VA</td>\n", " <td>5.2</td>\n", " <td>6</td>\n", " <td>2014</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>MD</td>\n", " <td>4.0</td>\n", " <td>6</td>\n", " <td>2014</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " state pop unempl year\n", "2 VA 5.2 6 2014\n", "3 MD 4.0 6 2014" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_6.ix[2:3]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Select a slice of rows from a specific column of a DataFrame:" ] }, { "cell_type": "code", "execution_count": 62, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 5.0\n", "1 5.1\n", "2 5.2\n", "Name: pop, dtype: float64" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_6.ix[0:2, 'pop']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Select rows based on an arithmetic operation on a specific row:" ] }, { "cell_type": "code", "execution_count": 63, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>state</th>\n", " <th>pop</th>\n", " <th>unempl</th>\n", " <th>year</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>2</th>\n", " <td>VA</td>\n", " <td>5.2</td>\n", " <td>6.0</td>\n", " <td>2014</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>MD</td>\n", " <td>4.0</td>\n", " <td>6.0</td>\n", " <td>2014</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>MD</td>\n", " <td>4.1</td>\n", " <td>6.1</td>\n", " <td>2015</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " state pop unempl year\n", "2 VA 5.2 6.0 2014\n", "3 MD 4.0 6.0 2014\n", "4 MD 4.1 6.1 2015" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_6.ix[df_6.unempl > 5.0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Arithmetic and Data Alignment" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Adding Series objects results in the union of index pairs if the pairs are not the same, resulting in NaN for indices that do not overlap:" ] }, { "cell_type": "code", "execution_count": 64, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "a 1.764052\n", "b 0.400157\n", "c 0.978738\n", "d 2.240893\n", "e 1.867558\n", "dtype: float64" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.random.seed(0)\n", "ser_6 = Series(np.random.randn(5),\n", " index=['a', 'b', 'c', 'd', 'e'])\n", "ser_6" ] }, { "cell_type": "code", "execution_count": 65, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "a 1.624345\n", "c -0.611756\n", "e -0.528172\n", "f -1.072969\n", "g 0.865408\n", "dtype: float64" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.random.seed(1)\n", "ser_7 = Series(np.random.randn(5),\n", " index=['a', 'c', 'e', 'f', 'g'])\n", "ser_7" ] }, { "cell_type": "code", "execution_count": 66, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "a 3.388398\n", "b NaN\n", "c 0.366982\n", "d NaN\n", "e 1.339386\n", "f NaN\n", "g NaN\n", "dtype: float64" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_6 + ser_7" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Set a fill value instead of NaN for indices that do not overlap:" ] }, { "cell_type": "code", "execution_count": 67, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "a 3.388398\n", "b 0.400157\n", "c 0.366982\n", "d 2.240893\n", "e 1.339386\n", "f -1.072969\n", "g 0.865408\n", "dtype: float64" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_6.add(ser_7, fill_value=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Adding DataFrame objects results in the union of index pairs for rows and columns if the pairs are not the same, resulting in NaN for indices that do not overlap:" ] }, { "cell_type": "code", "execution_count": 68, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " <th>c</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>0.548814</td>\n", " <td>0.715189</td>\n", " <td>0.602763</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>0.544883</td>\n", " <td>0.423655</td>\n", " <td>0.645894</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>0.437587</td>\n", " <td>0.891773</td>\n", " <td>0.963663</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b c\n", "0 0.548814 0.715189 0.602763\n", "1 0.544883 0.423655 0.645894\n", "2 0.437587 0.891773 0.963663" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.random.seed(0)\n", "df_8 = DataFrame(np.random.rand(9).reshape((3, 3)),\n", " columns=['a', 'b', 'c'])\n", "df_8" ] }, { "cell_type": "code", "execution_count": 69, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>b</th>\n", " <th>c</th>\n", " <th>d</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>0.417022</td>\n", " <td>0.720324</td>\n", " <td>0.000114</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>0.302333</td>\n", " <td>0.146756</td>\n", " <td>0.092339</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>0.186260</td>\n", " <td>0.345561</td>\n", " <td>0.396767</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " b c d\n", "0 0.417022 0.720324 0.000114\n", "1 0.302333 0.146756 0.092339\n", "2 0.186260 0.345561 0.396767" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.random.seed(1)\n", "df_9 = DataFrame(np.random.rand(9).reshape((3, 3)),\n", " columns=['b', 'c', 'd'])\n", "df_9" ] }, { "cell_type": "code", "execution_count": 70, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " <th>c</th>\n", " <th>d</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>NaN</td>\n", " <td>1.132211</td>\n", " <td>1.323088</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>NaN</td>\n", " <td>0.725987</td>\n", " <td>0.792650</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>NaN</td>\n", " <td>1.078033</td>\n", " <td>1.309223</td>\n", " <td>NaN</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b c d\n", "0 NaN 1.132211 1.323088 NaN\n", "1 NaN 0.725987 0.792650 NaN\n", "2 NaN 1.078033 1.309223 NaN" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_8 + df_9" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Set a fill value instead of NaN for indices that do not overlap:" ] }, { "cell_type": "code", "execution_count": 71, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " <th>c</th>\n", " <th>d</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>0.548814</td>\n", " <td>1.132211</td>\n", " <td>1.323088</td>\n", " <td>0.000114</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>0.544883</td>\n", " <td>0.725987</td>\n", " <td>0.792650</td>\n", " <td>0.092339</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>0.437587</td>\n", " <td>1.078033</td>\n", " <td>1.309223</td>\n", " <td>0.396767</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b c d\n", "0 0.548814 1.132211 1.323088 0.000114\n", "1 0.544883 0.725987 0.792650 0.092339\n", "2 0.437587 1.078033 1.309223 0.396767" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_10 = df_8.add(df_9, fill_value=0)\n", "df_10" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Like NumPy, pandas supports arithmetic operations between DataFrames and Series.\n", "\n", "Match the index of the Series on the DataFrame's columns, broadcasting down the rows:" ] }, { "cell_type": "code", "execution_count": 72, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " <th>c</th>\n", " <th>d</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>0.000000</td>\n", " <td>0.000000</td>\n", " <td>0.000000</td>\n", " <td>0.000000</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>-0.003930</td>\n", " <td>-0.406224</td>\n", " <td>-0.530438</td>\n", " <td>0.092224</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>-0.111226</td>\n", " <td>-0.054178</td>\n", " <td>-0.013864</td>\n", " <td>0.396653</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b c d\n", "0 0.000000 0.000000 0.000000 0.000000\n", "1 -0.003930 -0.406224 -0.530438 0.092224\n", "2 -0.111226 -0.054178 -0.013864 0.396653" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_8 = df_10.ix[0]\n", "df_11 = df_10 - ser_8\n", "df_11" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Match the index of the Series on the DataFrame's columns, broadcasting down the rows and union the indices that do not match:" ] }, { "cell_type": "code", "execution_count": 73, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "a 0\n", "d 1\n", "e 2\n", "dtype: int64" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_9 = Series(range(3), index=['a', 'd', 'e'])\n", "ser_9" ] }, { "cell_type": "code", "execution_count": 74, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " <th>c</th>\n", " <th>d</th>\n", " <th>e</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>0.000000</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>-1.000000</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>-0.003930</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>-0.907776</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>-0.111226</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>-0.603347</td>\n", " <td>NaN</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b c d e\n", "0 0.000000 NaN NaN -1.000000 NaN\n", "1 -0.003930 NaN NaN -0.907776 NaN\n", "2 -0.111226 NaN NaN -0.603347 NaN" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_11 - ser_9" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Broadcast over the columns and match the rows (axis=0) by using an arithmetic method:" ] }, { "cell_type": "code", "execution_count": 75, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " <th>c</th>\n", " <th>d</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>0.548814</td>\n", " <td>1.132211</td>\n", " <td>1.323088</td>\n", " <td>0.000114</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>0.544883</td>\n", " <td>0.725987</td>\n", " <td>0.792650</td>\n", " <td>0.092339</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>0.437587</td>\n", " <td>1.078033</td>\n", " <td>1.309223</td>\n", " <td>0.396767</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b c d\n", "0 0.548814 1.132211 1.323088 0.000114\n", "1 0.544883 0.725987 0.792650 0.092339\n", "2 0.437587 1.078033 1.309223 0.396767" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_10" ] }, { "cell_type": "code", "execution_count": 76, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 100\n", "1 200\n", "2 300\n", "dtype: int64" ] }, "execution_count": 76, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_10 = Series([100, 200, 300])\n", "ser_10" ] }, { "cell_type": "code", "execution_count": 77, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " <th>c</th>\n", " <th>d</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>-99.451186</td>\n", " <td>-98.867789</td>\n", " <td>-98.676912</td>\n", " <td>-99.999886</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>-199.455117</td>\n", " <td>-199.274013</td>\n", " <td>-199.207350</td>\n", " <td>-199.907661</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>-299.562413</td>\n", " <td>-298.921967</td>\n", " <td>-298.690777</td>\n", " <td>-299.603233</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b c d\n", "0 -99.451186 -98.867789 -98.676912 -99.999886\n", "1 -199.455117 -199.274013 -199.207350 -199.907661\n", "2 -299.562413 -298.921967 -298.690777 -299.603233" ] }, "execution_count": 77, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_10.sub(ser_10, axis=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Function Application and Mapping" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "NumPy ufuncs (element-wise array methods) operate on pandas objects:" ] }, { "cell_type": "code", "execution_count": 78, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " <th>c</th>\n", " <th>d</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>0.000000</td>\n", " <td>0.000000</td>\n", " <td>0.000000</td>\n", " <td>0.000000</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>0.003930</td>\n", " <td>0.406224</td>\n", " <td>0.530438</td>\n", " <td>0.092224</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>0.111226</td>\n", " <td>0.054178</td>\n", " <td>0.013864</td>\n", " <td>0.396653</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b c d\n", "0 0.000000 0.000000 0.000000 0.000000\n", "1 0.003930 0.406224 0.530438 0.092224\n", "2 0.111226 0.054178 0.013864 0.396653" ] }, "execution_count": 78, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_11 = np.abs(df_11)\n", "df_11" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Apply a function on 1D arrays to each column:" ] }, { "cell_type": "code", "execution_count": 79, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "a 0.111226\n", "b 0.406224\n", "c 0.530438\n", "d 0.396653\n", "dtype: float64" ] }, "execution_count": 79, "metadata": {}, "output_type": "execute_result" } ], "source": [ "func_1 = lambda x: x.max() - x.min()\n", "df_11.apply(func_1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Apply a function on 1D arrays to each row:" ] }, { "cell_type": "code", "execution_count": 80, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 0.000000\n", "1 0.526508\n", "2 0.382789\n", "dtype: float64" ] }, "execution_count": 80, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_11.apply(func_1, axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Apply a function and return a DataFrame:" ] }, { "cell_type": "code", "execution_count": 81, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " <th>c</th>\n", " <th>d</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>min</th>\n", " <td>0.000000</td>\n", " <td>0.000000</td>\n", " <td>0.000000</td>\n", " <td>0.000000</td>\n", " </tr>\n", " <tr>\n", " <th>max</th>\n", " <td>0.111226</td>\n", " <td>0.406224</td>\n", " <td>0.530438</td>\n", " <td>0.396653</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b c d\n", "min 0.000000 0.000000 0.000000 0.000000\n", "max 0.111226 0.406224 0.530438 0.396653" ] }, "execution_count": 81, "metadata": {}, "output_type": "execute_result" } ], "source": [ "func_2 = lambda x: Series([x.min(), x.max()], index=['min', 'max'])\n", "df_11.apply(func_2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Apply an element-wise Python function to a DataFrame:" ] }, { "cell_type": "code", "execution_count": 82, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " <th>c</th>\n", " <th>d</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>0.00</td>\n", " <td>0.00</td>\n", " <td>0.00</td>\n", " <td>0.00</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>0.00</td>\n", " <td>0.41</td>\n", " <td>0.53</td>\n", " <td>0.09</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>0.11</td>\n", " <td>0.05</td>\n", " <td>0.01</td>\n", " <td>0.40</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b c d\n", "0 0.00 0.00 0.00 0.00\n", "1 0.00 0.41 0.53 0.09\n", "2 0.11 0.05 0.01 0.40" ] }, "execution_count": 82, "metadata": {}, "output_type": "execute_result" } ], "source": [ "func_3 = lambda x: '%.2f' %x\n", "df_11.applymap(func_3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Apply an element-wise Python function to a Series:" ] }, { "cell_type": "code", "execution_count": 83, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 0.00\n", "1 0.00\n", "2 0.11\n", "Name: a, dtype: object" ] }, "execution_count": 83, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_11['a'].map(func_3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sorting and Ranking" ] }, { "cell_type": "code", "execution_count": 84, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "fo 100\n", "br 200\n", "bz 300\n", "qx NaN\n", "Name: foobarbazqux, dtype: float64" ] }, "execution_count": 84, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_4" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sort a Series by its index:" ] }, { "cell_type": "code", "execution_count": 85, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "br 200\n", "bz 300\n", "fo 100\n", "qx NaN\n", "Name: foobarbazqux, dtype: float64" ] }, "execution_count": 85, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_4.sort_index()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sort a Series by its values:" ] }, { "cell_type": "code", "execution_count": 86, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "fo 100\n", "br 200\n", "bz 300\n", "qx NaN\n", "Name: foobarbazqux, dtype: float64" ] }, "execution_count": 86, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_4.sort_values()" ] }, { "cell_type": "code", "execution_count": 87, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>c</th>\n", " <th>a</th>\n", " <th>b</th>\n", " <th>d</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>three</th>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>2</td>\n", " <td>3</td>\n", " </tr>\n", " <tr>\n", " <th>one</th>\n", " <td>4</td>\n", " <td>5</td>\n", " <td>6</td>\n", " <td>7</td>\n", " </tr>\n", " <tr>\n", " <th>two</th>\n", " <td>8</td>\n", " <td>9</td>\n", " <td>10</td>\n", " <td>11</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " c a b d\n", "three 0 1 2 3\n", "one 4 5 6 7\n", "two 8 9 10 11" ] }, "execution_count": 87, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_12 = DataFrame(np.arange(12).reshape((3, 4)),\n", " index=['three', 'one', 'two'],\n", " columns=['c', 'a', 'b', 'd'])\n", "df_12" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sort a DataFrame by its index:" ] }, { "cell_type": "code", "execution_count": 88, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>c</th>\n", " <th>a</th>\n", " <th>b</th>\n", " <th>d</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>one</th>\n", " <td>4</td>\n", " <td>5</td>\n", " <td>6</td>\n", " <td>7</td>\n", " </tr>\n", " <tr>\n", " <th>three</th>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>2</td>\n", " <td>3</td>\n", " </tr>\n", " <tr>\n", " <th>two</th>\n", " <td>8</td>\n", " <td>9</td>\n", " <td>10</td>\n", " <td>11</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " c a b d\n", "one 4 5 6 7\n", "three 0 1 2 3\n", "two 8 9 10 11" ] }, "execution_count": 88, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_12.sort_index()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sort a DataFrame by columns in descending order:" ] }, { "cell_type": "code", "execution_count": 89, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>d</th>\n", " <th>c</th>\n", " <th>b</th>\n", " <th>a</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>three</th>\n", " <td>3</td>\n", " <td>0</td>\n", " <td>2</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>one</th>\n", " <td>7</td>\n", " <td>4</td>\n", " <td>6</td>\n", " <td>5</td>\n", " </tr>\n", " <tr>\n", " <th>two</th>\n", " <td>11</td>\n", " <td>8</td>\n", " <td>10</td>\n", " <td>9</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " d c b a\n", "three 3 0 2 1\n", "one 7 4 6 5\n", "two 11 8 10 9" ] }, "execution_count": 89, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_12.sort_index(axis=1, ascending=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sort a DataFrame's values by column:" ] }, { "cell_type": "code", "execution_count": 90, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>c</th>\n", " <th>a</th>\n", " <th>b</th>\n", " <th>d</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>three</th>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>2</td>\n", " <td>3</td>\n", " </tr>\n", " <tr>\n", " <th>one</th>\n", " <td>4</td>\n", " <td>5</td>\n", " <td>6</td>\n", " <td>7</td>\n", " </tr>\n", " <tr>\n", " <th>two</th>\n", " <td>8</td>\n", " <td>9</td>\n", " <td>10</td>\n", " <td>11</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " c a b d\n", "three 0 1 2 3\n", "one 4 5 6 7\n", "two 8 9 10 11" ] }, "execution_count": 90, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_12.sort_values(by=['d', 'c'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ranking is similar to numpy.argsort except that ties are broken by assigning each group the mean rank:" ] }, { "cell_type": "code", "execution_count": 91, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "1 -5\n", "5 0\n", "4 2\n", "3 4\n", "6 4\n", "0 7\n", "2 7\n", "7 7\n", "dtype: int64" ] }, "execution_count": 91, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_11 = Series([7, -5, 7, 4, 2, 0, 4, 7])\n", "ser_11 = ser_11.sort_values()\n", "ser_11" ] }, { "cell_type": "code", "execution_count": 92, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "1 1.0\n", "5 2.0\n", "4 3.0\n", "3 4.5\n", "6 4.5\n", "0 7.0\n", "2 7.0\n", "7 7.0\n", "dtype: float64" ] }, "execution_count": 92, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_11.rank()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Rank a Series according to when they appear in the data:" ] }, { "cell_type": "code", "execution_count": 93, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "1 1\n", "5 2\n", "4 3\n", "3 4\n", "6 5\n", "0 6\n", "2 7\n", "7 8\n", "dtype: float64" ] }, "execution_count": 93, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_11.rank(method='first')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Rank a Series in descending order, using the maximum rank for the group:" ] }, { "cell_type": "code", "execution_count": 94, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "1 8\n", "5 7\n", "4 6\n", "3 5\n", "6 5\n", "0 3\n", "2 3\n", "7 3\n", "dtype: float64" ] }, "execution_count": 94, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_11.rank(ascending=False, method='max')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "DataFrames can rank over rows or columns." ] }, { "cell_type": "code", "execution_count": 95, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>bar</th>\n", " <th>baz</th>\n", " <th>foo</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>-5</td>\n", " <td>-1</td>\n", " <td>7</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>4</td>\n", " <td>2</td>\n", " <td>-5</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>2</td>\n", " <td>3</td>\n", " <td>7</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>4</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>4</td>\n", " <td>5</td>\n", " <td>2</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>7</td>\n", " <td>9</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>7</td>\n", " <td>9</td>\n", " <td>4</td>\n", " </tr>\n", " <tr>\n", " <th>7</th>\n", " <td>8</td>\n", " <td>5</td>\n", " <td>7</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " bar baz foo\n", "0 -5 -1 7\n", "1 4 2 -5\n", "2 2 3 7\n", "3 0 0 4\n", "4 4 5 2\n", "5 7 9 0\n", "6 7 9 4\n", "7 8 5 7" ] }, "execution_count": 95, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_13 = DataFrame({'foo' : [7, -5, 7, 4, 2, 0, 4, 7],\n", " 'bar' : [-5, 4, 2, 0, 4, 7, 7, 8],\n", " 'baz' : [-1, 2, 3, 0, 5, 9, 9, 5]})\n", "df_13" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Rank a DataFrame over rows:" ] }, { "cell_type": "code", "execution_count": 96, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>bar</th>\n", " <th>baz</th>\n", " <th>foo</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1.0</td>\n", " <td>1.0</td>\n", " <td>7.0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>4.5</td>\n", " <td>3.0</td>\n", " <td>1.0</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>3.0</td>\n", " <td>4.0</td>\n", " <td>7.0</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>2.0</td>\n", " <td>2.0</td>\n", " <td>4.5</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>4.5</td>\n", " <td>5.5</td>\n", " <td>3.0</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>6.5</td>\n", " <td>7.5</td>\n", " <td>2.0</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>6.5</td>\n", " <td>7.5</td>\n", " <td>4.5</td>\n", " </tr>\n", " <tr>\n", " <th>7</th>\n", " <td>8.0</td>\n", " <td>5.5</td>\n", " <td>7.0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " bar baz foo\n", "0 1.0 1.0 7.0\n", "1 4.5 3.0 1.0\n", "2 3.0 4.0 7.0\n", "3 2.0 2.0 4.5\n", "4 4.5 5.5 3.0\n", "5 6.5 7.5 2.0\n", "6 6.5 7.5 4.5\n", "7 8.0 5.5 7.0" ] }, "execution_count": 96, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_13.rank()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Rank a DataFrame over columns:" ] }, { "cell_type": "code", "execution_count": 97, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>bar</th>\n", " <th>baz</th>\n", " <th>foo</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1.0</td>\n", " <td>2.0</td>\n", " <td>3</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>3.0</td>\n", " <td>2.0</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>1.0</td>\n", " <td>2.0</td>\n", " <td>3</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>1.5</td>\n", " <td>1.5</td>\n", " <td>3</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>2.0</td>\n", " <td>3.0</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>2.0</td>\n", " <td>3.0</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>2.0</td>\n", " <td>3.0</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>7</th>\n", " <td>3.0</td>\n", " <td>1.0</td>\n", " <td>2</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " bar baz foo\n", "0 1.0 2.0 3\n", "1 3.0 2.0 1\n", "2 1.0 2.0 3\n", "3 1.5 1.5 3\n", "4 2.0 3.0 1\n", "5 2.0 3.0 1\n", "6 2.0 3.0 1\n", "7 3.0 1.0 2" ] }, "execution_count": 97, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_13.rank(axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Axis Indexes with Duplicate Values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Labels do not have to be unique in Pandas:" ] }, { "cell_type": "code", "execution_count": 98, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "foo 0\n", "foo 1\n", "bar 2\n", "bar 3\n", "baz 4\n", "dtype: int64" ] }, "execution_count": 98, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_12 = Series(range(5), index=['foo', 'foo', 'bar', 'bar', 'baz'])\n", "ser_12" ] }, { "cell_type": "code", "execution_count": 99, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 99, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_12.index.is_unique" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Select Series elements:" ] }, { "cell_type": "code", "execution_count": 100, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "foo 0\n", "foo 1\n", "dtype: int64" ] }, "execution_count": 100, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser_12['foo']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Select DataFrame elements:" ] }, { "cell_type": "code", "execution_count": 101, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>0</th>\n", " <th>1</th>\n", " <th>2</th>\n", " <th>3</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>foo</th>\n", " <td>-2.363469</td>\n", " <td>1.135345</td>\n", " <td>-1.017014</td>\n", " <td>0.637362</td>\n", " </tr>\n", " <tr>\n", " <th>foo</th>\n", " <td>-0.859907</td>\n", " <td>1.772608</td>\n", " <td>-1.110363</td>\n", " <td>0.181214</td>\n", " </tr>\n", " <tr>\n", " <th>bar</th>\n", " <td>0.564345</td>\n", " <td>-0.566510</td>\n", " <td>0.729976</td>\n", " <td>0.372994</td>\n", " </tr>\n", " <tr>\n", " <th>bar</th>\n", " <td>0.533811</td>\n", " <td>-0.091973</td>\n", " <td>1.913820</td>\n", " <td>0.330797</td>\n", " </tr>\n", " <tr>\n", " <th>baz</th>\n", " <td>1.141943</td>\n", " <td>-1.129595</td>\n", " <td>-0.850052</td>\n", " <td>0.960820</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " 0 1 2 3\n", "foo -2.363469 1.135345 -1.017014 0.637362\n", "foo -0.859907 1.772608 -1.110363 0.181214\n", "bar 0.564345 -0.566510 0.729976 0.372994\n", "bar 0.533811 -0.091973 1.913820 0.330797\n", "baz 1.141943 -1.129595 -0.850052 0.960820" ] }, "execution_count": 101, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_14 = DataFrame(np.random.randn(5, 4),\n", " index=['foo', 'foo', 'bar', 'bar', 'baz'])\n", "df_14" ] }, { "cell_type": "code", "execution_count": 102, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>0</th>\n", " <th>1</th>\n", " <th>2</th>\n", " <th>3</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>bar</th>\n", " <td>0.564345</td>\n", " <td>-0.566510</td>\n", " <td>0.729976</td>\n", " <td>0.372994</td>\n", " </tr>\n", " <tr>\n", " <th>bar</th>\n", " <td>0.533811</td>\n", " <td>-0.091973</td>\n", " <td>1.913820</td>\n", " <td>0.330797</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " 0 1 2 3\n", "bar 0.564345 -0.566510 0.729976 0.372994\n", "bar 0.533811 -0.091973 1.913820 0.330797" ] }, "execution_count": 102, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_14.ix['bar']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summarizing and Computing Descriptive Statistics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Unlike NumPy arrays, Pandas descriptive statistics automatically exclude missing data. NaN values are excluded unless the entire row or column is NA." ] }, { "cell_type": "code", "execution_count": 103, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>state</th>\n", " <th>pop</th>\n", " <th>unempl</th>\n", " <th>year</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>VA</td>\n", " <td>5.0</td>\n", " <td>NaN</td>\n", " <td>2012</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>VA</td>\n", " <td>5.1</td>\n", " <td>NaN</td>\n", " <td>2013</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>VA</td>\n", " <td>5.2</td>\n", " <td>6.0</td>\n", " <td>2014</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>MD</td>\n", " <td>4.0</td>\n", " <td>6.0</td>\n", " <td>2014</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>MD</td>\n", " <td>4.1</td>\n", " <td>6.1</td>\n", " <td>2015</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " state pop unempl year\n", "0 VA 5.0 NaN 2012\n", "1 VA 5.1 NaN 2013\n", "2 VA 5.2 6.0 2014\n", "3 MD 4.0 6.0 2014\n", "4 MD 4.1 6.1 2015\n", "5 NaN NaN NaN NaN\n", "6 NaN NaN NaN NaN" ] }, "execution_count": 103, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_6" ] }, { "cell_type": "code", "execution_count": 104, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "pop 23.4\n", "unempl 18.1\n", "year 10068.0\n", "dtype: float64" ] }, "execution_count": 104, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_6.sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sum over the rows:" ] }, { "cell_type": "code", "execution_count": 105, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 2017.0\n", "1 2018.1\n", "2 2025.2\n", "3 2024.0\n", "4 2025.2\n", "5 0.0\n", "6 0.0\n", "dtype: float64" ] }, "execution_count": 105, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_6.sum(axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Account for NaNs:" ] }, { "cell_type": "code", "execution_count": 106, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 NaN\n", "1 NaN\n", "2 2025.2\n", "3 2024.0\n", "4 2025.2\n", "5 NaN\n", "6 NaN\n", "dtype: float64" ] }, "execution_count": 106, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_6.sum(axis=1, skipna=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cleaning Data (Under Construction)\n", "* Replace\n", "* Drop\n", "* Concatenate" ] }, { "cell_type": "code", "execution_count": 107, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from pandas import Series, DataFrame\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Setup a DataFrame:" ] }, { "cell_type": "code", "execution_count": 108, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>population</th>\n", " <th>state</th>\n", " <th>year</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>5.0</td>\n", " <td>VA</td>\n", " <td>2012</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>5.1</td>\n", " <td>VA</td>\n", " <td>2013</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>5.2</td>\n", " <td>VA</td>\n", " <td>2014</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>4.0</td>\n", " <td>MD</td>\n", " <td>2014</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>4.1</td>\n", " <td>MD</td>\n", " <td>2015</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " population state year\n", "0 5.0 VA 2012\n", "1 5.1 VA 2013\n", "2 5.2 VA 2014\n", "3 4.0 MD 2014\n", "4 4.1 MD 2015" ] }, "execution_count": 108, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_1 = {'state' : ['VA', 'VA', 'VA', 'MD', 'MD'],\n", " 'year' : [2012, 2013, 2014, 2014, 2015],\n", " 'population' : [5.0, 5.1, 5.2, 4.0, 4.1]}\n", "df_1 = DataFrame(data_1)\n", "df_1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Replace" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Replace all occurrences of a string with another string, in place (no copy):" ] }, { "cell_type": "code", "execution_count": 109, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>population</th>\n", " <th>state</th>\n", " <th>year</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>5.0</td>\n", " <td>VIRGINIA</td>\n", " <td>2012</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>5.1</td>\n", " <td>VIRGINIA</td>\n", " <td>2013</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>5.2</td>\n", " <td>VIRGINIA</td>\n", " <td>2014</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>4.0</td>\n", " <td>MD</td>\n", " <td>2014</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>4.1</td>\n", " <td>MD</td>\n", " <td>2015</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " population state year\n", "0 5.0 VIRGINIA 2012\n", "1 5.1 VIRGINIA 2013\n", "2 5.2 VIRGINIA 2014\n", "3 4.0 MD 2014\n", "4 4.1 MD 2015" ] }, "execution_count": 109, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_1.replace('VA', 'VIRGINIA', inplace=True)\n", "df_1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In a specified column, replace all occurrences of a string with another string, in place (no copy):" ] }, { "cell_type": "code", "execution_count": 110, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>population</th>\n", " <th>state</th>\n", " <th>year</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>5.0</td>\n", " <td>VIRGINIA</td>\n", " <td>2012</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>5.1</td>\n", " <td>VIRGINIA</td>\n", " <td>2013</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>5.2</td>\n", " <td>VIRGINIA</td>\n", " <td>2014</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>4.0</td>\n", " <td>MARYLAND</td>\n", " <td>2014</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>4.1</td>\n", " <td>MARYLAND</td>\n", " <td>2015</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " population state year\n", "0 5.0 VIRGINIA 2012\n", "1 5.1 VIRGINIA 2013\n", "2 5.2 VIRGINIA 2014\n", "3 4.0 MARYLAND 2014\n", "4 4.1 MARYLAND 2015" ] }, "execution_count": 110, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_1.replace({'state' : { 'MD' : 'MARYLAND' }}, inplace=True)\n", "df_1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Drop" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Drop the 'population' column and return a copy of the DataFrame:" ] }, { "cell_type": "code", "execution_count": 111, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>state</th>\n", " <th>year</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>VIRGINIA</td>\n", " <td>2012</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>VIRGINIA</td>\n", " <td>2013</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>VIRGINIA</td>\n", " <td>2014</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>MARYLAND</td>\n", " <td>2014</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>MARYLAND</td>\n", " <td>2015</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " state year\n", "0 VIRGINIA 2012\n", "1 VIRGINIA 2013\n", "2 VIRGINIA 2014\n", "3 MARYLAND 2014\n", "4 MARYLAND 2015" ] }, "execution_count": 111, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_2 = df_1.drop('population', axis=1)\n", "df_2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Concatenate" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Concatenate two DataFrames:" ] }, { "cell_type": "code", "execution_count": 112, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>population</th>\n", " <th>state</th>\n", " <th>year</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>6.0</td>\n", " <td>NY</td>\n", " <td>2012</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>6.1</td>\n", " <td>NY</td>\n", " <td>2013</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>6.2</td>\n", " <td>NY</td>\n", " <td>2014</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>3.0</td>\n", " <td>FL</td>\n", " <td>2014</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>3.1</td>\n", " <td>FL</td>\n", " <td>2015</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " population state year\n", "0 6.0 NY 2012\n", "1 6.1 NY 2013\n", "2 6.2 NY 2014\n", "3 3.0 FL 2014\n", "4 3.1 FL 2015" ] }, "execution_count": 112, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_2 = {'state' : ['NY', 'NY', 'NY', 'FL', 'FL'],\n", " 'year' : [2012, 2013, 2014, 2014, 2015],\n", " 'population' : [6.0, 6.1, 6.2, 3.0, 3.1]}\n", "df_3 = DataFrame(data_2)\n", "df_3" ] }, { "cell_type": "code", "execution_count": 113, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>population</th>\n", " <th>state</th>\n", " <th>year</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>5.0</td>\n", " <td>VIRGINIA</td>\n", " <td>2012</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>5.1</td>\n", " <td>VIRGINIA</td>\n", " <td>2013</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>5.2</td>\n", " <td>VIRGINIA</td>\n", " <td>2014</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>4.0</td>\n", " <td>MARYLAND</td>\n", " <td>2014</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>4.1</td>\n", " <td>MARYLAND</td>\n", " <td>2015</td>\n", " </tr>\n", " <tr>\n", " <th>0</th>\n", " <td>6.0</td>\n", " <td>NY</td>\n", " <td>2012</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>6.1</td>\n", " <td>NY</td>\n", " <td>2013</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>6.2</td>\n", " <td>NY</td>\n", " <td>2014</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>3.0</td>\n", " <td>FL</td>\n", " <td>2014</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>3.1</td>\n", " <td>FL</td>\n", " <td>2015</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " population state year\n", "0 5.0 VIRGINIA 2012\n", "1 5.1 VIRGINIA 2013\n", "2 5.2 VIRGINIA 2014\n", "3 4.0 MARYLAND 2014\n", "4 4.1 MARYLAND 2015\n", "0 6.0 NY 2012\n", "1 6.1 NY 2013\n", "2 6.2 NY 2014\n", "3 3.0 FL 2014\n", "4 3.1 FL 2015" ] }, "execution_count": 113, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_4 = pd.concat([df_1, df_3])\n", "df_4" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Input and Output (Under Construction)\n", "* Reading\n", "* Writing" ] }, { "cell_type": "code", "execution_count": 114, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from pandas import Series, DataFrame\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Reading" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Read data from a CSV file into a DataFrame (use sep='\\t' for TSV):" ] }, { "cell_type": "code", "execution_count": 115, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df_1 = pd.read_csv(\"../data/ozone.csv\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get a summary of the DataFrame:" ] }, { "cell_type": "code", "execution_count": 116, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Ozone</th>\n", " <th>Solar.R</th>\n", " <th>Wind</th>\n", " <th>Temp</th>\n", " <th>Month</th>\n", " <th>Day</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>count</th>\n", " <td>116.000000</td>\n", " <td>146.000000</td>\n", " <td>153.000000</td>\n", " <td>153.000000</td>\n", " <td>153.000000</td>\n", " <td>153.000000</td>\n", " </tr>\n", " <tr>\n", " <th>mean</th>\n", " <td>42.129310</td>\n", " <td>185.931507</td>\n", " <td>9.957516</td>\n", " <td>77.882353</td>\n", " <td>6.993464</td>\n", " <td>15.803922</td>\n", " </tr>\n", " <tr>\n", " <th>std</th>\n", " <td>32.987885</td>\n", " <td>90.058422</td>\n", " <td>3.523001</td>\n", " <td>9.465270</td>\n", " <td>1.416522</td>\n", " <td>8.864520</td>\n", " </tr>\n", " <tr>\n", " <th>min</th>\n", " <td>1.000000</td>\n", " <td>7.000000</td>\n", " <td>1.700000</td>\n", " <td>56.000000</td>\n", " <td>5.000000</td>\n", " <td>1.000000</td>\n", " </tr>\n", " <tr>\n", " <th>25%</th>\n", " <td>18.000000</td>\n", " <td>115.750000</td>\n", " <td>7.400000</td>\n", " <td>72.000000</td>\n", " <td>6.000000</td>\n", " <td>8.000000</td>\n", " </tr>\n", " <tr>\n", " <th>50%</th>\n", " <td>31.500000</td>\n", " <td>205.000000</td>\n", " <td>9.700000</td>\n", " <td>79.000000</td>\n", " <td>7.000000</td>\n", " <td>16.000000</td>\n", " </tr>\n", " <tr>\n", " <th>75%</th>\n", " <td>63.250000</td>\n", " <td>258.750000</td>\n", " <td>11.500000</td>\n", " <td>85.000000</td>\n", " <td>8.000000</td>\n", " <td>23.000000</td>\n", " </tr>\n", " <tr>\n", " <th>max</th>\n", " <td>168.000000</td>\n", " <td>334.000000</td>\n", " <td>20.700000</td>\n", " <td>97.000000</td>\n", " <td>9.000000</td>\n", " <td>31.000000</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " Ozone Solar.R Wind Temp Month Day\n", "count 116.000000 146.000000 153.000000 153.000000 153.000000 153.000000\n", "mean 42.129310 185.931507 9.957516 77.882353 6.993464 15.803922\n", "std 32.987885 90.058422 3.523001 9.465270 1.416522 8.864520\n", "min 1.000000 7.000000 1.700000 56.000000 5.000000 1.000000\n", "25% 18.000000 115.750000 7.400000 72.000000 6.000000 8.000000\n", "50% 31.500000 205.000000 9.700000 79.000000 7.000000 16.000000\n", "75% 63.250000 258.750000 11.500000 85.000000 8.000000 23.000000\n", "max 168.000000 334.000000 20.700000 97.000000 9.000000 31.000000" ] }, "execution_count": 116, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_1.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "List the first five rows of the DataFrame:" ] }, { "cell_type": "code", "execution_count": 117, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Ozone</th>\n", " <th>Solar.R</th>\n", " <th>Wind</th>\n", " <th>Temp</th>\n", " <th>Month</th>\n", " <th>Day</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>41</td>\n", " <td>190</td>\n", " <td>7.4</td>\n", " <td>67</td>\n", " <td>5</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>36</td>\n", " <td>118</td>\n", " <td>8.0</td>\n", " <td>72</td>\n", " <td>5</td>\n", " <td>2</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>12</td>\n", " <td>149</td>\n", " <td>12.6</td>\n", " <td>74</td>\n", " <td>5</td>\n", " <td>3</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>18</td>\n", " <td>313</td>\n", " <td>11.5</td>\n", " <td>62</td>\n", " <td>5</td>\n", " <td>4</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>14.3</td>\n", " <td>56</td>\n", " <td>5</td>\n", " <td>5</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " Ozone Solar.R Wind Temp Month Day\n", "0 41 190 7.4 67 5 1\n", "1 36 118 8.0 72 5 2\n", "2 12 149 12.6 74 5 3\n", "3 18 313 11.5 62 5 4\n", "4 NaN NaN 14.3 56 5 5" ] }, "execution_count": 117, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_1.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Writing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a copy of the CSV file, encoded in UTF-8 and hiding the index and header labels:" ] }, { "cell_type": "code", "execution_count": 118, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df_1.to_csv('../data/ozone_copy.csv', \n", " encoding='utf-8', \n", " index=False, \n", " header=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "View the data directory:" ] }, { "cell_type": "code", "execution_count": 119, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "total 1016\r\n", "-rw-r--r-- 1 donnemartin staff 437903 Jul 7 2015 churn.csv\r\n", "-rwxr-xr-x 1 donnemartin staff 72050 Jul 7 2015 \u001b[31mconfusion_matrix.png\u001b[m\u001b[m\r\n", "-rw-r--r-- 1 donnemartin staff 2902 Jul 7 2015 ozone.csv\r\n", "-rw-r--r-- 1 donnemartin staff 3324 Apr 1 07:18 ozone_copy.csv\r\n", "drwxr-xr-x 10 donnemartin staff 340 Jul 7 2015 \u001b[34mtitanic\u001b[m\u001b[m\r\n" ] } ], "source": [ "!ls -l ../data/" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.4.3" } }, "nbformat": 4, "nbformat_minor": 0 }