{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# merge详解\n", "merge算是关系数据库用得很多的一个操作,所以单独开一篇细讲这个函数。" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "E:\\ML\\实战\\pandas实用教程\n" ] } ], "source": [ "!cd" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "# 1. 函数说明" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### `pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort = False)`\n", "concat函数本质上是在所有索引上同时进行对齐合并,而如果想在任意**列**上对齐合并,则需要merge函数,其在sql应用很多。\n", "- left,right: 两个要对齐合并的DataFrame;\n", "- how: 先做笛卡尔积操作,然后按照要求,保留需要的,缺失的数据填充NaN;\n", " - left: 以左DataFrame为基准,即左侧DataFrame的数据全部保留(不代表完全一致、可能会存在复制),保持原序\n", " - right: 以右DataFrame为基准,保持原序\n", " - inner: 交,保留左右DataFrame在on上完全一致的行,保持左DataFrame顺序\n", " - outer: 并,按照字典顺序重新排序\n", "- on:列索引或列索引列表,如果要在DataFrame相同的列索引做对齐,用这个参数;\n", "- left_on, right_on, left_index, right_index:\n", " - on对应普通的列索引或列索引列表,对齐不同列名的DataFrame,用这俩参数;\n", " - index对应要使用的index,建议不要使用这俩参数,因为可以用concat方法代替。\n", "- sort: True or False,是否按字典序重新排序。" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AB
a12
b34
\n", "
" ], "text/plain": [ " A B\n", "a 1 2\n", "b 3 4" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1 = pd.DataFrame([[1,2],[3,4]], index = ['a','b'],columns = ['A','B'])\n", "df2 = pd.DataFrame([[1,3],[4,8]], index = ['b','d'],columns = ['B','C'])\n", "df1" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
BC
b13
d48
\n", "
" ], "text/plain": [ " B C\n", "b 1 3\n", "d 4 8" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 如果单纯的按照index对齐,不如用concat方法,所以一般不建议使用left_index, right_index 。" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AB_xB_yC
b3413
\n", "
" ], "text/plain": [ " A B_x B_y C\n", "b 3 4 1 3" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.merge(left = df1, right = df2, how = 'inner' ,left_index = True, right_index = True)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABBC
b3413
\n", "
" ], "text/plain": [ " A B B C\n", "b 3 4 1 3" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 小区别是concat对重复列没有重命名,但是重名的情况不多,而且重名了说明之前设计就不大合理。\n", "pd.concat([df1,df2], join = 'inner',axis =1) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "--- \n", "# 2. `on` 用法\n", "设置 `how = 'inner'`" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABC
0348
\n", "
" ], "text/plain": [ " A B C\n", "0 3 4 8" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#对于'B'列:df1的'b'行、df2的'd'行,是相同的,其他都不同。 \n", "pd.merge(left = df1, right = df2, how = 'inner' , on =['B']) " ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AB_xB_yC
03413
\n", "
" ], "text/plain": [ " A B_x B_y C\n", "0 3 4 1 3" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# df1的'A'列'b'行,df2的'C'列'd'行是相同的,其他都不同。\n", "# 其他列如果同名会进行重命名。\n", "pd.merge(left = df1, right = df2, how = 'inner',left_on = ['A'] ,right_on = ['C'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 3. `how` 用法" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABC
012NaN
1348.0
\n", "
" ], "text/plain": [ " A B C\n", "0 1 2 NaN\n", "1 3 4 8.0" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 保持左侧DataFrame不变,用右侧来跟它对齐,对不上的填NaN。\n", "pd.merge(left = df1, right = df2, how = 'left', on = ['B'] )" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABC
03.048
1NaN13
\n", "
" ], "text/plain": [ " A B C\n", "0 3.0 4 8\n", "1 NaN 1 3" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 保持右侧DataFrame不变,用右侧来跟它对齐,对不上的填NaN。\n", "pd.merge(left = df1, right = df2, how = 'right', on = ['B'] )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 对齐的列存在重复值\n", "重复的也没关系,操作逻辑是一致的,完全可以假想不存在重复。" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AB
a14
b34
\n", "
" ], "text/plain": [ " A B\n", "a 1 4\n", "b 3 4" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1.loc['a','B'] =4 #改成重复\n", "df1" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABC
01.048
13.048
2NaN13
\n", "
" ], "text/plain": [ " A B C\n", "0 1.0 4 8\n", "1 3.0 4 8\n", "2 NaN 1 3" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "### 保持右侧的列都在,如果左侧对齐的列存在重复值,那么对齐上后也存在重复。\n", "pd.merge(left = df1, right = df2, how = 'right', on = ['B'] )" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.2rc2" }, "toc": { "colors": { "hover_highlight": "#DAA520", "navigate_num": "#000000", "navigate_text": "#333333", "running_highlight": "#FF0000", "selected_highlight": "#FFD700", "sidebar_border": "#EEEEEE", "wrapper_background": "#FFFFFF" }, "moveMenuLeft": true, "nav_menu": { "height": "84px", "width": "252px" }, "navigate_menu": true, "number_sections": false, "sideBar": true, "threshold": "3", "toc_cell": false, "toc_position": { "height": "485px", "left": "0px", "right": "1146px", "top": "66px", "width": "134px" }, "toc_section_display": "block", "toc_window_display": true, "widenNotebook": false } }, "nbformat": 4, "nbformat_minor": 2 }