{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*\n", "\n", "*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*\n", "\n", "*No changes were made to the contents of this notebook from the original.*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "< [Combining Datasets: Merge and Join](03.07-Merge-and-Join.ipynb) | [Contents](Index.ipynb) | [Pivot Tables](03.09-Pivot-Tables.ipynb) >" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Aggregation and Grouping" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "An essential piece of analysis of large data is efficient summarization: computing aggregations like ``sum()``, ``mean()``, ``median()``, ``min()``, and ``max()``, in which a single number gives insight into the nature of a potentially large dataset.\n", "In this section, we'll explore aggregations in Pandas, from simple operations akin to what we've seen on NumPy arrays, to more sophisticated operations based on the concept of a ``groupby``." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For convenience, we'll use the same ``display`` magic function that we've seen in previous sections:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n", "class display(object):\n", " \"\"\"Display HTML representation of multiple objects\"\"\"\n", " template = \"\"\"
\n", "

{0}

{1}\n", "
\"\"\"\n", " def __init__(self, *args):\n", " self.args = args\n", " \n", " def _repr_html_(self):\n", " return '\\n'.join(self.template.format(a, eval(a)._repr_html_())\n", " for a in self.args)\n", " \n", " def __repr__(self):\n", " return '\\n\\n'.join(a + '\\n' + repr(eval(a))\n", " for a in self.args)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Planets Data\n", "\n", "Here we will use the Planets dataset, available via the [Seaborn package](http://seaborn.pydata.org/) (see [Visualization With Seaborn](04.14-Visualization-With-Seaborn.ipynb)).\n", "It gives information on planets that astronomers have discovered around other stars (known as *extrasolar planets* or *exoplanets* for short). It can be downloaded with a simple Seaborn command:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(1035, 6)" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import seaborn as sns\n", "planets = sns.load_dataset('planets')\n", "planets.shape" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
methodnumberorbital_periodmassdistanceyear
0Radial Velocity1269.3007.1077.402006
1Radial Velocity1874.7742.2156.952008
2Radial Velocity1763.0002.6019.842011
3Radial Velocity1326.03019.40110.622007
4Radial Velocity1516.22010.50119.472009
\n", "
" ], "text/plain": [ " method number orbital_period mass distance year\n", "0 Radial Velocity 1 269.300 7.10 77.40 2006\n", "1 Radial Velocity 1 874.774 2.21 56.95 2008\n", "2 Radial Velocity 1 763.000 2.60 19.84 2011\n", "3 Radial Velocity 1 326.030 19.40 110.62 2007\n", "4 Radial Velocity 1 516.220 10.50 119.47 2009" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "planets.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This has some details on the 1,000+ extrasolar planets discovered up to 2014." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Simple Aggregation in Pandas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Earlier, we explored some of the data aggregations available for NumPy arrays ([\"Aggregations: Min, Max, and Everything In Between\"](02.04-Computation-on-arrays-aggregates.ipynb)).\n", "As with a one-dimensional NumPy array, for a Pandas ``Series`` the aggregates return a single value:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 0.374540\n", "1 0.950714\n", "2 0.731994\n", "3 0.598658\n", "4 0.156019\n", "dtype: float64" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rng = np.random.RandomState(42)\n", "ser = pd.Series(rng.rand(5))\n", "ser" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "2.8119254917081569" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser.sum()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.56238509834163142" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser.mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For a ``DataFrame``, by default the aggregates return results within each column:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AB
00.1559950.020584
10.0580840.969910
20.8661760.832443
30.6011150.212339
40.7080730.181825
\n", "
" ], "text/plain": [ " A B\n", "0 0.155995 0.020584\n", "1 0.058084 0.969910\n", "2 0.866176 0.832443\n", "3 0.601115 0.212339\n", "4 0.708073 0.181825" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame({'A': rng.rand(5),\n", " 'B': rng.rand(5)})\n", "df" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "A 0.477888\n", "B 0.443420\n", "dtype: float64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By specifying the ``axis`` argument, you can instead aggregate within each row:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 0.088290\n", "1 0.513997\n", "2 0.849309\n", "3 0.406727\n", "4 0.444949\n", "dtype: float64" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.mean(axis='columns')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pandas ``Series`` and ``DataFrame``s include all of the common aggregates mentioned in [Aggregations: Min, Max, and Everything In Between](02.04-Computation-on-arrays-aggregates.ipynb); in addition, there is a convenience method ``describe()`` that computes several common aggregates for each column and returns the result.\n", "Let's use this on the Planets data, for now dropping rows with missing values:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
numberorbital_periodmassdistanceyear
count498.00000498.000000498.000000498.000000498.000000
mean1.73494835.7786712.50932052.0682132007.377510
std1.175721469.1282593.63627446.5960414.167284
min1.000001.3283000.0036001.3500001989.000000
25%1.0000038.2722500.21250024.4975002005.000000
50%1.00000357.0000001.24500039.9400002009.000000
75%2.00000999.6000002.86750059.3325002011.000000
max6.0000017337.50000025.000000354.0000002014.000000
\n", "
" ], "text/plain": [ " number orbital_period mass distance year\n", "count 498.00000 498.000000 498.000000 498.000000 498.000000\n", "mean 1.73494 835.778671 2.509320 52.068213 2007.377510\n", "std 1.17572 1469.128259 3.636274 46.596041 4.167284\n", "min 1.00000 1.328300 0.003600 1.350000 1989.000000\n", "25% 1.00000 38.272250 0.212500 24.497500 2005.000000\n", "50% 1.00000 357.000000 1.245000 39.940000 2009.000000\n", "75% 2.00000 999.600000 2.867500 59.332500 2011.000000\n", "max 6.00000 17337.500000 25.000000 354.000000 2014.000000" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "planets.dropna().describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This can be a useful way to begin understanding the overall properties of a dataset.\n", "For example, we see in the ``year`` column that although exoplanets were discovered as far back as 1989, half of all known expolanets were not discovered until 2010 or after.\n", "This is largely thanks to the *Kepler* mission, which is a space-based telescope specifically designed for finding eclipsing planets around other stars." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following table summarizes some other built-in Pandas aggregations:\n", "\n", "| Aggregation | Description |\n", "|--------------------------|---------------------------------|\n", "| ``count()`` | Total number of items |\n", "| ``first()``, ``last()`` | First and last item |\n", "| ``mean()``, ``median()`` | Mean and median |\n", "| ``min()``, ``max()`` | Minimum and maximum |\n", "| ``std()``, ``var()`` | Standard deviation and variance |\n", "| ``mad()`` | Mean absolute deviation |\n", "| ``prod()`` | Product of all items |\n", "| ``sum()`` | Sum of all items |\n", "\n", "These are all methods of ``DataFrame`` and ``Series`` objects." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To go deeper into the data, however, simple aggregates are often not enough.\n", "The next level of data summarization is the ``groupby`` operation, which allows you to quickly and efficiently compute aggregates on subsets of data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## GroupBy: Split, Apply, Combine\n", "\n", "Simple aggregations can give you a flavor of your dataset, but often we would prefer to aggregate conditionally on some label or index: this is implemented in the so-called ``groupby`` operation.\n", "The name \"group by\" comes from a command in the SQL database language, but it is perhaps more illuminative to think of it in the terms first coined by Hadley Wickham of Rstats fame: *split, apply, combine*." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Split, apply, combine\n", "\n", "A canonical example of this split-apply-combine operation, where the \"apply\" is a summation aggregation, is illustrated in this figure:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![](figures/03.08-split-apply-combine.png)\n", "[figure source in Appendix](06.00-Figure-Code.ipynb#Split-Apply-Combine)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This makes clear what the ``groupby`` accomplishes:\n", "\n", "- The *split* step involves breaking up and grouping a ``DataFrame`` depending on the value of the specified key.\n", "- The *apply* step involves computing some function, usually an aggregate, transformation, or filtering, within the individual groups.\n", "- The *combine* step merges the results of these operations into an output array.\n", "\n", "While this could certainly be done manually using some combination of the masking, aggregation, and merging commands covered earlier, an important realization is that *the intermediate splits do not need to be explicitly instantiated*. Rather, the ``GroupBy`` can (often) do this in a single pass over the data, updating the sum, mean, count, min, or other aggregate for each group along the way.\n", "The power of the ``GroupBy`` is that it abstracts away these steps: the user need not think about *how* the computation is done under the hood, but rather thinks about the *operation as a whole*.\n", "\n", "As a concrete example, let's take a look at using Pandas for the computation shown in this diagram.\n", "We'll start by creating the input ``DataFrame``:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
keydata
0A0
1B1
2C2
3A3
4B4
5C5
\n", "
" ], "text/plain": [ " key data\n", "0 A 0\n", "1 B 1\n", "2 C 2\n", "3 A 3\n", "4 B 4\n", "5 C 5" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],\n", " 'data': range(6)}, columns=['key', 'data'])\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The most basic split-apply-combine operation can be computed with the ``groupby()`` method of ``DataFrame``s, passing the name of the desired key column:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby('key')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that what is returned is not a set of ``DataFrame``s, but a ``DataFrameGroupBy`` object.\n", "This object is where the magic is: you can think of it as a special view of the ``DataFrame``, which is poised to dig into the groups but does no actual computation until the aggregation is applied.\n", "This \"lazy evaluation\" approach means that common aggregates can be implemented very efficiently in a way that is almost transparent to the user.\n", "\n", "To produce a result, we can apply an aggregate to this ``DataFrameGroupBy`` object, which will perform the appropriate apply/combine steps to produce the desired result:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
data
key
A3
B5
C7
\n", "
" ], "text/plain": [ " data\n", "key \n", "A 3\n", "B 5\n", "C 7" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby('key').sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The ``sum()`` method is just one possibility here; you can apply virtually any common Pandas or NumPy aggregation function, as well as virtually any valid ``DataFrame`` operation, as we will see in the following discussion." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The GroupBy object\n", "\n", "The ``GroupBy`` object is a very flexible abstraction.\n", "In many ways, you can simply treat it as if it's a collection of ``DataFrame``s, and it does the difficult things under the hood. Let's see some examples using the Planets data.\n", "\n", "Perhaps the most important operations made available by a ``GroupBy`` are *aggregate*, *filter*, *transform*, and *apply*.\n", "We'll discuss each of these more fully in [\"Aggregate, Filter, Transform, Apply\"](#Aggregate,-Filter,-Transform,-Apply), but before that let's introduce some of the other functionality that can be used with the basic ``GroupBy`` operation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Column indexing\n", "\n", "The ``GroupBy`` object supports column indexing in the same way as the ``DataFrame``, and returns a modified ``GroupBy`` object.\n", "For example:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "planets.groupby('method')" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "planets.groupby('method')['orbital_period']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we've selected a particular ``Series`` group from the original ``DataFrame`` group by reference to its column name.\n", "As with the ``GroupBy`` object, no computation is done until we call some aggregate on the object:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "method\n", "Astrometry 631.180000\n", "Eclipse Timing Variations 4343.500000\n", "Imaging 27500.000000\n", "Microlensing 3300.000000\n", "Orbital Brightness Modulation 0.342887\n", "Pulsar Timing 66.541900\n", "Pulsation Timing Variations 1170.000000\n", "Radial Velocity 360.200000\n", "Transit 5.714932\n", "Transit Timing Variations 57.011000\n", "Name: orbital_period, dtype: float64" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "planets.groupby('method')['orbital_period'].median()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This gives an idea of the general scale of orbital periods (in days) that each method is sensitive to." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Iteration over groups\n", "\n", "The ``GroupBy`` object supports direct iteration over the groups, returning each group as a ``Series`` or ``DataFrame``:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Astrometry shape=(2, 6)\n", "Eclipse Timing Variations shape=(9, 6)\n", "Imaging shape=(38, 6)\n", "Microlensing shape=(23, 6)\n", "Orbital Brightness Modulation shape=(3, 6)\n", "Pulsar Timing shape=(5, 6)\n", "Pulsation Timing Variations shape=(1, 6)\n", "Radial Velocity shape=(553, 6)\n", "Transit shape=(397, 6)\n", "Transit Timing Variations shape=(4, 6)\n" ] } ], "source": [ "for (method, group) in planets.groupby('method'):\n", " print(\"{0:30s} shape={1}\".format(method, group.shape))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This can be useful for doing certain things manually, though it is often much faster to use the built-in ``apply`` functionality, which we will discuss momentarily." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Dispatch methods\n", "\n", "Through some Python class magic, any method not explicitly implemented by the ``GroupBy`` object will be passed through and called on the groups, whether they are ``DataFrame`` or ``Series`` objects.\n", "For example, you can use the ``describe()`` method of ``DataFrame``s to perform a set of aggregations that describe each group in the data:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countmeanstdmin25%50%75%max
method
Astrometry2.02011.5000002.1213202010.02010.752011.52012.252013.0
Eclipse Timing Variations9.02010.0000001.4142142008.02009.002010.02011.002012.0
Imaging38.02009.1315792.7819012004.02008.002009.02011.002013.0
Microlensing23.02009.7826092.8596972004.02008.002010.02012.002013.0
Orbital Brightness Modulation3.02011.6666671.1547012011.02011.002011.02012.002013.0
Pulsar Timing5.01998.4000008.3845101992.01992.001994.02003.002011.0
Pulsation Timing Variations1.02007.000000NaN2007.02007.002007.02007.002007.0
Radial Velocity553.02007.5189874.2490521989.02005.002009.02011.002014.0
Transit397.02011.2367762.0778672002.02010.002012.02013.002014.0
Transit Timing Variations4.02012.5000001.2909942011.02011.752012.52013.252014.0
\n", "
" ], "text/plain": [ " count mean std min 25% \\\n", "method \n", "Astrometry 2.0 2011.500000 2.121320 2010.0 2010.75 \n", "Eclipse Timing Variations 9.0 2010.000000 1.414214 2008.0 2009.00 \n", "Imaging 38.0 2009.131579 2.781901 2004.0 2008.00 \n", "Microlensing 23.0 2009.782609 2.859697 2004.0 2008.00 \n", "Orbital Brightness Modulation 3.0 2011.666667 1.154701 2011.0 2011.00 \n", "Pulsar Timing 5.0 1998.400000 8.384510 1992.0 1992.00 \n", "Pulsation Timing Variations 1.0 2007.000000 NaN 2007.0 2007.00 \n", "Radial Velocity 553.0 2007.518987 4.249052 1989.0 2005.00 \n", "Transit 397.0 2011.236776 2.077867 2002.0 2010.00 \n", "Transit Timing Variations 4.0 2012.500000 1.290994 2011.0 2011.75 \n", "\n", " 50% 75% max \n", "method \n", "Astrometry 2011.5 2012.25 2013.0 \n", "Eclipse Timing Variations 2010.0 2011.00 2012.0 \n", "Imaging 2009.0 2011.00 2013.0 \n", "Microlensing 2010.0 2012.00 2013.0 \n", "Orbital Brightness Modulation 2011.0 2012.00 2013.0 \n", "Pulsar Timing 1994.0 2003.00 2011.0 \n", "Pulsation Timing Variations 2007.0 2007.00 2007.0 \n", "Radial Velocity 2009.0 2011.00 2014.0 \n", "Transit 2012.0 2013.00 2014.0 \n", "Transit Timing Variations 2012.5 2013.25 2014.0 " ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "planets.groupby('method')['year'].describe().unstack()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looking at this table helps us to better understand the data: for example, the vast majority of planets have been discovered by the Radial Velocity and Transit methods, though the latter only became common (due to new, more accurate telescopes) in the last decade.\n", "The newest methods seem to be Transit Timing Variation and Orbital Brightness Modulation, which were not used to discover a new planet until 2011.\n", "\n", "This is just one example of the utility of dispatch methods.\n", "Notice that they are applied *to each individual group*, and the results are then combined within ``GroupBy`` and returned.\n", "Again, any valid ``DataFrame``/``Series`` method can be used on the corresponding ``GroupBy`` object, which allows for some very flexible and powerful operations!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Aggregate, filter, transform, apply\n", "\n", "The preceding discussion focused on aggregation for the combine operation, but there are more options available.\n", "In particular, ``GroupBy`` objects have ``aggregate()``, ``filter()``, ``transform()``, and ``apply()`` methods that efficiently implement a variety of useful operations before combining the grouped data.\n", "\n", "For the purpose of the following subsections, we'll use this ``DataFrame``:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
keydata1data2
0A05
1B10
2C23
3A33
4B47
5C59
\n", "
" ], "text/plain": [ " key data1 data2\n", "0 A 0 5\n", "1 B 1 0\n", "2 C 2 3\n", "3 A 3 3\n", "4 B 4 7\n", "5 C 5 9" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rng = np.random.RandomState(0)\n", "df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],\n", " 'data1': range(6),\n", " 'data2': rng.randint(0, 10, 6)},\n", " columns = ['key', 'data1', 'data2'])\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Aggregation\n", "\n", "We're now familiar with ``GroupBy`` aggregations with ``sum()``, ``median()``, and the like, but the ``aggregate()`` method allows for even more flexibility.\n", "It can take a string, a function, or a list thereof, and compute all the aggregates at once.\n", "Here is a quick example combining all these:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
data1data2
minmedianmaxminmedianmax
key
A01.5334.05
B12.5403.57
C23.5536.09
\n", "
" ], "text/plain": [ " data1 data2 \n", " min median max min median max\n", "key \n", "A 0 1.5 3 3 4.0 5\n", "B 1 2.5 4 0 3.5 7\n", "C 2 3.5 5 3 6.0 9" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby('key').aggregate(['min', np.median, max])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another useful pattern is to pass a dictionary mapping column names to operations to be applied on that column:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
data1data2
key
A05
B17
C29
\n", "
" ], "text/plain": [ " data1 data2\n", "key \n", "A 0 5\n", "B 1 7\n", "C 2 9" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby('key').aggregate({'data1': 'min',\n", " 'data2': 'max'})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Filtering\n", "\n", "A filtering operation allows you to drop data based on the group properties.\n", "For example, we might want to keep all groups in which the standard deviation is larger than some critical value:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "

df

\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
keydata1data2
0A05
1B10
2C23
3A33
4B47
5C59
\n", "
\n", "
\n", "
\n", "

df.groupby('key').std()

\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
data1data2
key
A2.121321.414214
B2.121324.949747
C2.121324.242641
\n", "
\n", "
\n", "
\n", "

df.groupby('key').filter(filter_func)

\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
keydata1data2
1B10
2C23
4B47
5C59
\n", "
\n", "
" ], "text/plain": [ "df\n", " key data1 data2\n", "0 A 0 5\n", "1 B 1 0\n", "2 C 2 3\n", "3 A 3 3\n", "4 B 4 7\n", "5 C 5 9\n", "\n", "df.groupby('key').std()\n", " data1 data2\n", "key \n", "A 2.12132 1.414214\n", "B 2.12132 4.949747\n", "C 2.12132 4.242641\n", "\n", "df.groupby('key').filter(filter_func)\n", " key data1 data2\n", "1 B 1 0\n", "2 C 2 3\n", "4 B 4 7\n", "5 C 5 9" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def filter_func(x):\n", " return x['data2'].std() > 4\n", "\n", "display('df', \"df.groupby('key').std()\", \"df.groupby('key').filter(filter_func)\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The filter function should return a Boolean value specifying whether the group passes the filtering. Here because group A does not have a standard deviation greater than 4, it is dropped from the result." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Transformation\n", "\n", "While aggregation must return a reduced version of the data, transformation can return some transformed version of the full data to recombine.\n", "For such a transformation, the output is the same shape as the input.\n", "A common example is to center the data by subtracting the group-wise mean:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
data1data2
0-1.51.0
1-1.5-3.5
2-1.5-3.0
31.5-1.0
41.53.5
51.53.0
\n", "
" ], "text/plain": [ " data1 data2\n", "0 -1.5 1.0\n", "1 -1.5 -3.5\n", "2 -1.5 -3.0\n", "3 1.5 -1.0\n", "4 1.5 3.5\n", "5 1.5 3.0" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby('key').transform(lambda x: x - x.mean())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The apply() method\n", "\n", "The ``apply()`` method lets you apply an arbitrary function to the group results.\n", "The function should take a ``DataFrame``, and return either a Pandas object (e.g., ``DataFrame``, ``Series``) or a scalar; the combine operation will be tailored to the type of output returned.\n", "\n", "For example, here is an ``apply()`` that normalizes the first column by the sum of the second:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "

df

\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
keydata1data2
0A05
1B10
2C23
3A33
4B47
5C59
\n", "
\n", "
\n", "
\n", "

df.groupby('key').apply(norm_by_data2)

\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
keydata1data2
0A0.0000005
1B0.1428570
2C0.1666673
3A0.3750003
4B0.5714297
5C0.4166679
\n", "
\n", "
" ], "text/plain": [ "df\n", " key data1 data2\n", "0 A 0 5\n", "1 B 1 0\n", "2 C 2 3\n", "3 A 3 3\n", "4 B 4 7\n", "5 C 5 9\n", "\n", "df.groupby('key').apply(norm_by_data2)\n", " key data1 data2\n", "0 A 0.000000 5\n", "1 B 0.142857 0\n", "2 C 0.166667 3\n", "3 A 0.375000 3\n", "4 B 0.571429 7\n", "5 C 0.416667 9" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def norm_by_data2(x):\n", " # x is a DataFrame of group values\n", " x['data1'] /= x['data2'].sum()\n", " return x\n", "\n", "display('df', \"df.groupby('key').apply(norm_by_data2)\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "``apply()`` within a ``GroupBy`` is quite flexible: the only criterion is that the function takes a ``DataFrame`` and returns a Pandas object or scalar; what you do in the middle is up to you!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Specifying the split key\n", "\n", "In the simple examples presented before, we split the ``DataFrame`` on a single column name.\n", "This is just one of many options by which the groups can be defined, and we'll go through some other options for group specification here." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### A list, array, series, or index providing the grouping keys\n", "\n", "The key can be any series or list with a length matching that of the ``DataFrame``. For example:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "

df

\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
keydata1data2
0A05
1B10
2C23
3A33
4B47
5C59
\n", "
\n", "
\n", "
\n", "

df.groupby(L).sum()

\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
data1data2
0717
143
247
\n", "
\n", "
" ], "text/plain": [ "df\n", " key data1 data2\n", "0 A 0 5\n", "1 B 1 0\n", "2 C 2 3\n", "3 A 3 3\n", "4 B 4 7\n", "5 C 5 9\n", "\n", "df.groupby(L).sum()\n", " data1 data2\n", "0 7 17\n", "1 4 3\n", "2 4 7" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "L = [0, 1, 0, 1, 2, 0]\n", "display('df', 'df.groupby(L).sum()')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Of course, this means there's another, more verbose way of accomplishing the ``df.groupby('key')`` from before:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "

df

\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
keydata1data2
0A05
1B10
2C23
3A33
4B47
5C59
\n", "
\n", "
\n", "
\n", "

df.groupby(df['key']).sum()

\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
data1data2
key
A38
B57
C712
\n", "
\n", "
" ], "text/plain": [ "df\n", " key data1 data2\n", "0 A 0 5\n", "1 B 1 0\n", "2 C 2 3\n", "3 A 3 3\n", "4 B 4 7\n", "5 C 5 9\n", "\n", "df.groupby(df['key']).sum()\n", " data1 data2\n", "key \n", "A 3 8\n", "B 5 7\n", "C 7 12" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "display('df', \"df.groupby(df['key']).sum()\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### A dictionary or series mapping index to group\n", "\n", "Another method is to provide a dictionary that maps index values to the group keys:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "

df2

\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
data1data2
key
A05
B10
C23
A33
B47
C59
\n", "
\n", "
\n", "
\n", "

df2.groupby(mapping).sum()

\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
data1data2
consonant1219
vowel38
\n", "
\n", "
" ], "text/plain": [ "df2\n", " data1 data2\n", "key \n", "A 0 5\n", "B 1 0\n", "C 2 3\n", "A 3 3\n", "B 4 7\n", "C 5 9\n", "\n", "df2.groupby(mapping).sum()\n", " data1 data2\n", "consonant 12 19\n", "vowel 3 8" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2 = df.set_index('key')\n", "mapping = {'A': 'vowel', 'B': 'consonant', 'C': 'consonant'}\n", "display('df2', 'df2.groupby(mapping).sum()')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Any Python function\n", "\n", "Similar to mapping, you can pass any Python function that will input the index value and output the group:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "

df2

\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
data1data2
key
A05
B10
C23
A33
B47
C59
\n", "
\n", "
\n", "
\n", "

df2.groupby(str.lower).mean()

\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
data1data2
a1.54.0
b2.53.5
c3.56.0
\n", "
\n", "
" ], "text/plain": [ "df2\n", " data1 data2\n", "key \n", "A 0 5\n", "B 1 0\n", "C 2 3\n", "A 3 3\n", "B 4 7\n", "C 5 9\n", "\n", "df2.groupby(str.lower).mean()\n", " data1 data2\n", "a 1.5 4.0\n", "b 2.5 3.5\n", "c 3.5 6.0" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "display('df2', 'df2.groupby(str.lower).mean()')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### A list of valid keys\n", "\n", "Further, any of the preceding key choices can be combined to group on a multi-index:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
data1data2
avowel1.54.0
bconsonant2.53.5
cconsonant3.56.0
\n", "
" ], "text/plain": [ " data1 data2\n", "a vowel 1.5 4.0\n", "b consonant 2.5 3.5\n", "c consonant 3.5 6.0" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2.groupby([str.lower, mapping]).mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Grouping example\n", "\n", "As an example of this, in a couple lines of Python code we can put all these together and count discovered planets by method and by decade:" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
decade1980s1990s2000s2010s
method
Astrometry0.00.00.02.0
Eclipse Timing Variations0.00.05.010.0
Imaging0.00.029.021.0
Microlensing0.00.012.015.0
Orbital Brightness Modulation0.00.00.05.0
Pulsar Timing0.09.01.01.0
Pulsation Timing Variations0.00.01.00.0
Radial Velocity1.052.0475.0424.0
Transit0.00.064.0712.0
Transit Timing Variations0.00.00.09.0
\n", "
" ], "text/plain": [ "decade 1980s 1990s 2000s 2010s\n", "method \n", "Astrometry 0.0 0.0 0.0 2.0\n", "Eclipse Timing Variations 0.0 0.0 5.0 10.0\n", "Imaging 0.0 0.0 29.0 21.0\n", "Microlensing 0.0 0.0 12.0 15.0\n", "Orbital Brightness Modulation 0.0 0.0 0.0 5.0\n", "Pulsar Timing 0.0 9.0 1.0 1.0\n", "Pulsation Timing Variations 0.0 0.0 1.0 0.0\n", "Radial Velocity 1.0 52.0 475.0 424.0\n", "Transit 0.0 0.0 64.0 712.0\n", "Transit Timing Variations 0.0 0.0 0.0 9.0" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "decade = 10 * (planets['year'] // 10)\n", "decade = decade.astype(str) + 's'\n", "decade.name = 'decade'\n", "planets.groupby(['method', decade])['number'].sum().unstack().fillna(0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This shows the power of combining many of the operations we've discussed up to this point when looking at realistic datasets.\n", "We immediately gain a coarse understanding of when and how planets have been discovered over the past several decades!\n", "\n", "Here I would suggest digging into these few lines of code, and evaluating the individual steps to make sure you understand exactly what they are doing to the result.\n", "It's certainly a somewhat complicated example, but understanding these pieces will give you the means to similarly explore your own data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "< [Combining Datasets: Merge and Join](03.07-Merge-and-Join.ipynb) | [Contents](Index.ipynb) | [Pivot Tables](03.09-Pivot-Tables.ipynb) >" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.4.3" } }, "nbformat": 4, "nbformat_minor": 0 }