{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*\n", "\n", "*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "< [Hierarchical Indexing](03.05-Hierarchical-Indexing.ipynb) | [Contents](Index.ipynb) | [Combining Datasets: Merge and Join](03.07-Merge-and-Join.ipynb) >\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Combining Datasets: Concat and Append" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some of the most interesting studies of data come from combining different data sources.\n", "These operations can involve anything from very straightforward concatenation of two different datasets, to more complicated database-style joins and merges that correctly handle any overlaps between the datasets.\n", "``Series`` and ``DataFrame``s are built with this type of operation in mind, and Pandas includes functions and methods that make this sort of data wrangling fast and straightforward.\n", "\n", "Here we'll take a look at simple concatenation of ``Series`` and ``DataFrame``s with the ``pd.concat`` function; later we'll dive into more sophisticated in-memory merges and joins implemented in Pandas.\n", "\n", "We begin with the standard imports:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For convenience, we'll define this function which creates a ``DataFrame`` of a particular form that will be useful below:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", " | A | \n", "B | \n", "C | \n", "
---|---|---|---|
0 | \n", "A0 | \n", "B0 | \n", "C0 | \n", "
1 | \n", "A1 | \n", "B1 | \n", "C1 | \n", "
2 | \n", "A2 | \n", "B2 | \n", "C2 | \n", "
{0}
{1}\n", "df1
\n", " | A | \n", "B | \n", "
---|---|---|
1 | \n", "A1 | \n", "B1 | \n", "
2 | \n", "A2 | \n", "B2 | \n", "
df2
\n", " | A | \n", "B | \n", "
---|---|---|
3 | \n", "A3 | \n", "B3 | \n", "
4 | \n", "A4 | \n", "B4 | \n", "
pd.concat([df1, df2])
\n", " | A | \n", "B | \n", "
---|---|---|
1 | \n", "A1 | \n", "B1 | \n", "
2 | \n", "A2 | \n", "B2 | \n", "
3 | \n", "A3 | \n", "B3 | \n", "
4 | \n", "A4 | \n", "B4 | \n", "
df3
\n", " | A | \n", "B | \n", "
---|---|---|
0 | \n", "A0 | \n", "B0 | \n", "
1 | \n", "A1 | \n", "B1 | \n", "
df4
\n", " | C | \n", "D | \n", "
---|---|---|
0 | \n", "C0 | \n", "D0 | \n", "
1 | \n", "C1 | \n", "D1 | \n", "
pd.concat([df3, df4], axis='col')
\n", " | A | \n", "B | \n", "C | \n", "D | \n", "
---|---|---|---|---|
0 | \n", "A0 | \n", "B0 | \n", "C0 | \n", "D0 | \n", "
1 | \n", "A1 | \n", "B1 | \n", "C1 | \n", "D1 | \n", "
x
\n", " | A | \n", "B | \n", "
---|---|---|
0 | \n", "A0 | \n", "B0 | \n", "
1 | \n", "A1 | \n", "B1 | \n", "
y
\n", " | A | \n", "B | \n", "
---|---|---|
0 | \n", "A2 | \n", "B2 | \n", "
1 | \n", "A3 | \n", "B3 | \n", "
pd.concat([x, y])
\n", " | A | \n", "B | \n", "
---|---|---|
0 | \n", "A0 | \n", "B0 | \n", "
1 | \n", "A1 | \n", "B1 | \n", "
0 | \n", "A2 | \n", "B2 | \n", "
1 | \n", "A3 | \n", "B3 | \n", "
x
\n", " | A | \n", "B | \n", "
---|---|---|
0 | \n", "A0 | \n", "B0 | \n", "
1 | \n", "A1 | \n", "B1 | \n", "
y
\n", " | A | \n", "B | \n", "
---|---|---|
0 | \n", "A2 | \n", "B2 | \n", "
1 | \n", "A3 | \n", "B3 | \n", "
pd.concat([x, y], ignore_index=True)
\n", " | A | \n", "B | \n", "
---|---|---|
0 | \n", "A0 | \n", "B0 | \n", "
1 | \n", "A1 | \n", "B1 | \n", "
2 | \n", "A2 | \n", "B2 | \n", "
3 | \n", "A3 | \n", "B3 | \n", "
x
\n", " | A | \n", "B | \n", "
---|---|---|
0 | \n", "A0 | \n", "B0 | \n", "
1 | \n", "A1 | \n", "B1 | \n", "
y
\n", " | A | \n", "B | \n", "
---|---|---|
0 | \n", "A2 | \n", "B2 | \n", "
1 | \n", "A3 | \n", "B3 | \n", "
pd.concat([x, y], keys=['x', 'y'])
\n", " | \n", " | A | \n", "B | \n", "
---|---|---|---|
x | \n", "0 | \n", "A0 | \n", "B0 | \n", "
1 | \n", "A1 | \n", "B1 | \n", "|
y | \n", "0 | \n", "A2 | \n", "B2 | \n", "
1 | \n", "A3 | \n", "B3 | \n", "
df5
\n", " | A | \n", "B | \n", "C | \n", "
---|---|---|---|
1 | \n", "A1 | \n", "B1 | \n", "C1 | \n", "
2 | \n", "A2 | \n", "B2 | \n", "C2 | \n", "
df6
\n", " | B | \n", "C | \n", "D | \n", "
---|---|---|---|
3 | \n", "B3 | \n", "C3 | \n", "D3 | \n", "
4 | \n", "B4 | \n", "C4 | \n", "D4 | \n", "
pd.concat([df5, df6])
\n", " | A | \n", "B | \n", "C | \n", "D | \n", "
---|---|---|---|---|
1 | \n", "A1 | \n", "B1 | \n", "C1 | \n", "NaN | \n", "
2 | \n", "A2 | \n", "B2 | \n", "C2 | \n", "NaN | \n", "
3 | \n", "NaN | \n", "B3 | \n", "C3 | \n", "D3 | \n", "
4 | \n", "NaN | \n", "B4 | \n", "C4 | \n", "D4 | \n", "
df5
\n", " | A | \n", "B | \n", "C | \n", "
---|---|---|---|
1 | \n", "A1 | \n", "B1 | \n", "C1 | \n", "
2 | \n", "A2 | \n", "B2 | \n", "C2 | \n", "
df6
\n", " | B | \n", "C | \n", "D | \n", "
---|---|---|---|
3 | \n", "B3 | \n", "C3 | \n", "D3 | \n", "
4 | \n", "B4 | \n", "C4 | \n", "D4 | \n", "
pd.concat([df5, df6], join='inner')
\n", " | B | \n", "C | \n", "
---|---|---|
1 | \n", "B1 | \n", "C1 | \n", "
2 | \n", "B2 | \n", "C2 | \n", "
3 | \n", "B3 | \n", "C3 | \n", "
4 | \n", "B4 | \n", "C4 | \n", "
df5
\n", " | A | \n", "B | \n", "C | \n", "
---|---|---|---|
1 | \n", "A1 | \n", "B1 | \n", "C1 | \n", "
2 | \n", "A2 | \n", "B2 | \n", "C2 | \n", "
df6
\n", " | B | \n", "C | \n", "D | \n", "
---|---|---|---|
3 | \n", "B3 | \n", "C3 | \n", "D3 | \n", "
4 | \n", "B4 | \n", "C4 | \n", "D4 | \n", "
pd.concat([df5, df6], join_axes=[df5.columns])
\n", " | A | \n", "B | \n", "C | \n", "
---|---|---|---|
1 | \n", "A1 | \n", "B1 | \n", "C1 | \n", "
2 | \n", "A2 | \n", "B2 | \n", "C2 | \n", "
3 | \n", "NaN | \n", "B3 | \n", "C3 | \n", "
4 | \n", "NaN | \n", "B4 | \n", "C4 | \n", "
df1
\n", " | A | \n", "B | \n", "
---|---|---|
1 | \n", "A1 | \n", "B1 | \n", "
2 | \n", "A2 | \n", "B2 | \n", "
df2
\n", " | A | \n", "B | \n", "
---|---|---|
3 | \n", "A3 | \n", "B3 | \n", "
4 | \n", "A4 | \n", "B4 | \n", "
df1.append(df2)
\n", " | A | \n", "B | \n", "
---|---|---|
1 | \n", "A1 | \n", "B1 | \n", "
2 | \n", "A2 | \n", "B2 | \n", "
3 | \n", "A3 | \n", "B3 | \n", "
4 | \n", "A4 | \n", "B4 | \n", "