{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*\n", "\n", "*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*\n", "\n", "*No changes were made to the contents of this notebook from the original.*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "< [Data Indexing and Selection](03.02-Data-Indexing-and-Selection.ipynb) | [Contents](Index.ipynb) | [Handling Missing Data](03.04-Missing-Values.ipynb) >" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Operating on Data in Pandas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One of the essential pieces of NumPy is the ability to perform quick element-wise operations, both with basic arithmetic (addition, subtraction, multiplication, etc.) and with more sophisticated operations (trigonometric functions, exponential and logarithmic functions, etc.).\n", "Pandas inherits much of this functionality from NumPy, and the ufuncs that we introduced in [Computation on NumPy Arrays: Universal Functions](02.03-Computation-on-arrays-ufuncs.ipynb) are key to this.\n", "\n", "Pandas includes a couple useful twists, however: for unary operations like negation and trigonometric functions, these ufuncs will *preserve index and column labels* in the output, and for binary operations such as addition and multiplication, Pandas will automatically *align indices* when passing the objects to the ufunc.\n", "This means that keeping the context of data and combining data from different sourcesâ€“both potentially error-prone tasks with raw NumPy arraysâ€“become essentially foolproof ones with Pandas.\n", "We will additionally see that there are well-defined operations between one-dimensional ``Series`` structures and two-dimensional ``DataFrame`` structures." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Ufuncs: Index Preservation\n", "\n", "Because Pandas is designed to work with NumPy, any NumPy ufunc will work on Pandas ``Series`` and ``DataFrame`` objects.\n", "Let's start by defining a simple ``Series`` and ``DataFrame`` on which to demonstrate this:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 6\n", "1 3\n", "2 7\n", "3 4\n", "dtype: int64" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rng = np.random.RandomState(42)\n", "ser = pd.Series(rng.randint(0, 10, 4))\n", "ser" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCD
06926
17437
27254
\n", "
" ], "text/plain": [ " A B C D\n", "0 6 9 2 6\n", "1 7 4 3 7\n", "2 7 2 5 4" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame(rng.randint(0, 10, (3, 4)),\n", " columns=['A', 'B', 'C', 'D'])\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we apply a NumPy ufunc on either of these objects, the result will be another Pandas object *with the indices preserved:*" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 403.428793\n", "1 20.085537\n", "2 1096.633158\n", "3 54.598150\n", "dtype: float64" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.exp(ser)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or, for a slightly more complex calculation:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCD
0-1.0000007.071068e-011.000000-1.000000e+00
1-0.7071071.224647e-160.707107-7.071068e-01
2-0.7071071.000000e+00-0.7071071.224647e-16
\n", "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AB
0111
151
\n", "
" ], "text/plain": [ " A B\n", "0 1 11\n", "1 5 1" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "A = pd.DataFrame(rng.randint(0, 20, (2, 2)),\n", " columns=list('AB'))\n", "A" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
BAC
0409
1580
2926
\n", "
" ], "text/plain": [ " B A C\n", "0 4 0 9\n", "1 5 8 0\n", "2 9 2 6" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "B = pd.DataFrame(rng.randint(0, 10, (3, 3)),\n", " columns=list('BAC'))\n", "B" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABC
01.015.0NaN
113.06.0NaN
2NaNNaNNaN
\n", "
" ], "text/plain": [ " A B C\n", "0 1.0 15.0 NaN\n", "1 13.0 6.0 NaN\n", "2 NaN NaN NaN" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "A + B" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that indices are aligned correctly irrespective of their order in the two objects, and indices in the result are sorted.\n", "As was the case with ``Series``, we can use the associated object's arithmetic method and pass any desired ``fill_value`` to be used in place of missing entries.\n", "Here we'll fill with the mean of all values in ``A`` (computed by first stacking the rows of ``A``):" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABC
01.015.013.5
113.06.04.5
26.513.510.5
\n", "
" ], "text/plain": [ " A B C\n", "0 1.0 15.0 13.5\n", "1 13.0 6.0 4.5\n", "2 6.5 13.5 10.5" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fill = A.stack().mean()\n", "A.add(B, fill_value=fill)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following table lists Python operators and their equivalent Pandas object methods:\n", "\n", "| Python Operator | Pandas Method(s) |\n", "|-----------------|---------------------------------------|\n", "| ``+`` | ``add()`` |\n", "| ``-`` | ``sub()``, ``subtract()`` |\n", "| ``*`` | ``mul()``, ``multiply()`` |\n", "| ``/`` | ``truediv()``, ``div()``, ``divide()``|\n", "| ``//`` | ``floordiv()`` |\n", "| ``%`` | ``mod()`` |\n", "| ``**`` | ``pow()`` |\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Ufuncs: Operations Between DataFrame and Series\n", "\n", "When performing operations between a ``DataFrame`` and a ``Series``, the index and column alignment is similarly maintained.\n", "Operations between a ``DataFrame`` and a ``Series`` are similar to operations between a two-dimensional and one-dimensional NumPy array.\n", "Consider one common operation, where we find the difference of a two-dimensional array and one of its rows:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([[3, 8, 2, 4],\n", " [2, 6, 4, 8],\n", " [6, 1, 3, 8]])" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "A = rng.randint(10, size=(3, 4))\n", "A" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([[ 0, 0, 0, 0],\n", " [-1, -2, 2, 4],\n", " [ 3, -7, 1, 4]])" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "A - A[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "According to NumPy's broadcasting rules (see [Computation on Arrays: Broadcasting](02.05-Computation-on-arrays-broadcasting.ipynb)), subtraction between a two-dimensional array and one of its rows is applied row-wise.\n", "\n", "In Pandas, the convention similarly operates row-wise by default:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
QRST
00000
1-1-224
23-714
\n", "
" ], "text/plain": [ " Q R S T\n", "0 0 0 0 0\n", "1 -1 -2 2 4\n", "2 3 -7 1 4" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame(A, columns=list('QRST'))\n", "df - df.iloc[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you would instead like to operate column-wise, you can use the object methods mentioned earlier, while specifying the ``axis`` keyword:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
QRST
0-50-6-4
1-40-22
25027
\n", "
" ], "text/plain": [ " Q R S T\n", "0 -5 0 -6 -4\n", "1 -4 0 -2 2\n", "2 5 0 2 7" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.subtract(df['R'], axis=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that these ``DataFrame``/``Series`` operations, like the operations discussed above, will automatically align indices between the two elements:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Q 3\n", "S 2\n", "Name: 0, dtype: int64" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "halfrow = df.iloc[0, ::2]\n", "halfrow" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
QRST
00.0NaN0.0NaN
1-1.0NaN2.0NaN
23.0NaN1.0NaN
\n", "
" ], "text/plain": [ " Q R S T\n", "0 0.0 NaN 0.0 NaN\n", "1 -1.0 NaN 2.0 NaN\n", "2 3.0 NaN 1.0 NaN" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df - halfrow" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This preservation and alignment of indices and columns means that operations on data in Pandas will always maintain the data context, which prevents the types of silly errors that might come up when working with heterogeneous and/or misaligned data in raw NumPy arrays." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "< [Data Indexing and Selection](03.02-Data-Indexing-and-Selection.ipynb) | [Contents](Index.ipynb) | [Handling Missing Data](03.04-Missing-Values.ipynb) >" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.4.3" } }, "nbformat": 4, "nbformat_minor": 0 }