{ "metadata": { "name": "", "signature": "sha256:4b6279f8f8c98c91277cd35efba1b9942d245a8def2c68055f42c09fe015939d" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Join And Merge Pandas Dataframe\n", "\n", "- **Author:** [Chris Albon](http://www.chrisalbon.com/), [@ChrisAlbon](https://twitter.com/chrisalbon)\n", "- **Date:** -\n", "- **Repo:** [Python 3 code snippets for data science](https://github.com/chrisalbon/code_py)\n", "- **Note:**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### import modules" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import pandas as pd\n", "from IPython.display import display\n", "from IPython.display import Image" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create a dataframe" ] }, { "cell_type": "code", "collapsed": false, "input": [ "raw_data = {\n", " 'subject_id': ['1', '2', '3', '4', '5'],\n", " 'first_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'], \n", " 'last_name': ['Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches']}\n", "df_a = pd.DataFrame(raw_data, columns = ['subject_id', 'first_name', 'last_name'])\n", "df_a" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
subject_idfirst_namelast_name
0 1 Alex Anderson
1 2 Amy Ackerman
2 3 Allen Ali
3 4 Alice Aoni
4 5 Ayoung Atiches
\n", "

5 rows \u00d7 3 columns

\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 3, "text": [ " subject_id first_name last_name\n", "0 1 Alex Anderson\n", "1 2 Amy Ackerman\n", "2 3 Allen Ali\n", "3 4 Alice Aoni\n", "4 5 Ayoung Atiches\n", "\n", "[5 rows x 3 columns]" ] } ], "prompt_number": 3 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create a second dataframe" ] }, { "cell_type": "code", "collapsed": false, "input": [ "raw_data = {\n", " 'subject_id': ['4', '5', '6', '7', '8'],\n", " 'first_name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'], \n", " 'last_name': ['Bonder', 'Black', 'Balwner', 'Brice', 'Btisan']}\n", "df_b = pd.DataFrame(raw_data, columns = ['subject_id', 'first_name', 'last_name'])\n", "df_b" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
subject_idfirst_namelast_name
0 4 Billy Bonder
1 5 Brian Black
2 6 Bran Balwner
3 7 Bryce Brice
4 8 Betty Btisan
\n", "

5 rows \u00d7 3 columns

\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 4, "text": [ " subject_id first_name last_name\n", "0 4 Billy Bonder\n", "1 5 Brian Black\n", "2 6 Bran Balwner\n", "3 7 Bryce Brice\n", "4 8 Betty Btisan\n", "\n", "[5 rows x 3 columns]" ] } ], "prompt_number": 4 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create a third dataframe" ] }, { "cell_type": "code", "collapsed": false, "input": [ "raw_data = {\n", " 'subject_id': ['1', '2', '3', '4', '5', '7', '8', '9', '10', '11'],\n", " 'test_id': [51, 15, 15, 61, 16, 14, 15, 1, 61, 16]}\n", "df_n = pd.DataFrame(raw_data, columns = ['subject_id','test_id'])\n", "df_n" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
subject_idtest_id
0 1 51
1 2 15
2 3 15
3 4 61
4 5 16
5 7 14
6 8 15
7 9 1
8 10 61
9 11 16
\n", "

10 rows \u00d7 2 columns

\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 5, "text": [ " subject_id test_id\n", "0 1 51\n", "1 2 15\n", "2 3 15\n", "3 4 61\n", "4 5 16\n", "5 7 14\n", "6 8 15\n", "7 9 1\n", "8 10 61\n", "9 11 16\n", "\n", "[10 rows x 2 columns]" ] } ], "prompt_number": 5 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Join the two dataframes along rows" ] }, { "cell_type": "code", "collapsed": false, "input": [ "df_new = pd.concat([df_a, df_b])\n", "df_new" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
subject_idfirst_namelast_name
0 1 Alex Anderson
1 2 Amy Ackerman
2 3 Allen Ali
3 4 Alice Aoni
4 5 Ayoung Atiches
0 4 Billy Bonder
1 5 Brian Black
2 6 Bran Balwner
3 7 Bryce Brice
4 8 Betty Btisan
\n", "

10 rows \u00d7 3 columns

\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 6, "text": [ " subject_id first_name last_name\n", "0 1 Alex Anderson\n", "1 2 Amy Ackerman\n", "2 3 Allen Ali\n", "3 4 Alice Aoni\n", "4 5 Ayoung Atiches\n", "0 4 Billy Bonder\n", "1 5 Brian Black\n", "2 6 Bran Balwner\n", "3 7 Bryce Brice\n", "4 8 Betty Btisan\n", "\n", "[10 rows x 3 columns]" ] } ], "prompt_number": 6 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Join the two dataframes along columns" ] }, { "cell_type": "code", "collapsed": false, "input": [ "pd.concat([df_a, df_b], axis=1)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
subject_idfirst_namelast_namesubject_idfirst_namelast_name
0 1 Alex Anderson 4 Billy Bonder
1 2 Amy Ackerman 5 Brian Black
2 3 Allen Ali 6 Bran Balwner
3 4 Alice Aoni 7 Bryce Brice
4 5 Ayoung Atiches 8 Betty Btisan
\n", "

5 rows \u00d7 6 columns

\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 7, "text": [ " subject_id first_name last_name subject_id first_name last_name\n", "0 1 Alex Anderson 4 Billy Bonder\n", "1 2 Amy Ackerman 5 Brian Black\n", "2 3 Allen Ali 6 Bran Balwner\n", "3 4 Alice Aoni 7 Bryce Brice\n", "4 5 Ayoung Atiches 8 Betty Btisan\n", "\n", "[5 rows x 6 columns]" ] } ], "prompt_number": 7 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Merge two dataframes along the subject_id value" ] }, { "cell_type": "code", "collapsed": false, "input": [ "pd.merge(df_new, df_n, on='subject_id')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Merge two dataframes with both the left and right dataframes using the subject_id key" ] }, { "cell_type": "code", "collapsed": false, "input": [ "pd.merge(df_new, df_n, left_on='subject_id', right_on='subject_id')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Merge with outer join\n", "\n", "\"Full outer join produces the set of all records in Table A and Table B, with matching records from both sides where available. If there is no match, the missing side will contain null.\" - [source](http://blog.codinghorror.com/a-visual-explanation-of-sql-joins/)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "pd.merge(df_a, df_b, on='subject_id', how='outer')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Merge with inner join\n", "\n", "\"Inner join produces only the set of records that match in both Table A and Table B.\" - [source](http://blog.codinghorror.com/a-visual-explanation-of-sql-joins/)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "pd.merge(df_a, df_b, on='subject_id', how='inner')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Merge with right join" ] }, { "cell_type": "code", "collapsed": false, "input": [ "pd.merge(df_a, df_b, on='subject_id', how='right')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Merge with left join\n", "\n", "\"Left outer join produces a complete set of records from Table A, with the matching records (where available) in Table B. If there is no match, the right side will contain null.\" - [source](http://blog.codinghorror.com/a-visual-explanation-of-sql-joins/)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "pd.merge(df_a, df_b, on='subject_id', how='left')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Merge while adding a suffix to duplicate column names" ] }, { "cell_type": "code", "collapsed": false, "input": [ "pd.merge(df_a, df_b, on='subject_id', how='left', suffixes=('_left', '_right'))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Merge based on indexes" ] }, { "cell_type": "code", "collapsed": false, "input": [ "pd.merge(df_a, df_b, right_index=True, left_index=True)" ], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }