{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# merge详解\n",
"merge算是关系数据库用得很多的一个操作,所以单独开一篇细讲这个函数。"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"E:\\ML\\实战\\pandas实用教程\n"
]
}
],
"source": [
"!cd"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"----\n",
"# 1. 函数说明"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### `pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort = False)`\n",
"concat函数本质上是在所有索引上同时进行对齐合并,而如果想在任意**列**上对齐合并,则需要merge函数,其在sql应用很多。\n",
"- left,right: 两个要对齐合并的DataFrame;\n",
"- how: 先做笛卡尔积操作,然后按照要求,保留需要的,缺失的数据填充NaN;\n",
" - left: 以左DataFrame为基准,即左侧DataFrame的数据全部保留(不代表完全一致、可能会存在复制),保持原序\n",
" - right: 以右DataFrame为基准,保持原序\n",
" - inner: 交,保留左右DataFrame在on上完全一致的行,保持左DataFrame顺序\n",
" - outer: 并,按照字典顺序重新排序\n",
"- on:列索引或列索引列表,如果要在DataFrame相同的列索引做对齐,用这个参数;\n",
"- left_on, right_on, left_index, right_index:\n",
" - on对应普通的列索引或列索引列表,对齐不同列名的DataFrame,用这俩参数;\n",
" - index对应要使用的index,建议不要使用这俩参数,因为可以用concat方法代替。\n",
"- sort: True or False,是否按字典序重新排序。"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" b | \n",
" 3 | \n",
" 4 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B\n",
"a 1 2\n",
"b 3 4"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1 = pd.DataFrame([[1,2],[3,4]], index = ['a','b'],columns = ['A','B'])\n",
"df2 = pd.DataFrame([[1,3],[4,8]], index = ['b','d'],columns = ['B','C'])\n",
"df1"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" b | \n",
" 1 | \n",
" 3 | \n",
"
\n",
" \n",
" d | \n",
" 4 | \n",
" 8 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" B C\n",
"b 1 3\n",
"d 4 8"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 如果单纯的按照index对齐,不如用concat方法,所以一般不建议使用left_index, right_index 。"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B_x | \n",
" B_y | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" b | \n",
" 3 | \n",
" 4 | \n",
" 1 | \n",
" 3 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B_x B_y C\n",
"b 3 4 1 3"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.merge(left = df1, right = df2, how = 'inner' ,left_index = True, right_index = True)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" b | \n",
" 3 | \n",
" 4 | \n",
" 1 | \n",
" 3 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B B C\n",
"b 3 4 1 3"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 小区别是concat对重复列没有重命名,但是重名的情况不多,而且重名了说明之前设计就不大合理。\n",
"pd.concat([df1,df2], join = 'inner',axis =1) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"--- \n",
"# 2. `on` 用法\n",
"设置 `how = 'inner'`"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 3 | \n",
" 4 | \n",
" 8 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C\n",
"0 3 4 8"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#对于'B'列:df1的'b'行、df2的'd'行,是相同的,其他都不同。 \n",
"pd.merge(left = df1, right = df2, how = 'inner' , on =['B']) "
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B_x | \n",
" B_y | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 3 | \n",
" 4 | \n",
" 1 | \n",
" 3 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B_x B_y C\n",
"0 3 4 1 3"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# df1的'A'列'b'行,df2的'C'列'd'行是相同的,其他都不同。\n",
"# 其他列如果同名会进行重命名。\n",
"pd.merge(left = df1, right = df2, how = 'inner',left_on = ['A'] ,right_on = ['C'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 3. `how` 用法"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
" NaN | \n",
"
\n",
" \n",
" 1 | \n",
" 3 | \n",
" 4 | \n",
" 8.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C\n",
"0 1 2 NaN\n",
"1 3 4 8.0"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 保持左侧DataFrame不变,用右侧来跟它对齐,对不上的填NaN。\n",
"pd.merge(left = df1, right = df2, how = 'left', on = ['B'] )"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 3.0 | \n",
" 4 | \n",
" 8 | \n",
"
\n",
" \n",
" 1 | \n",
" NaN | \n",
" 1 | \n",
" 3 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C\n",
"0 3.0 4 8\n",
"1 NaN 1 3"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 保持右侧DataFrame不变,用右侧来跟它对齐,对不上的填NaN。\n",
"pd.merge(left = df1, right = df2, how = 'right', on = ['B'] )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 对齐的列存在重复值\n",
"重复的也没关系,操作逻辑是一致的,完全可以假想不存在重复。"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 1 | \n",
" 4 | \n",
"
\n",
" \n",
" b | \n",
" 3 | \n",
" 4 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B\n",
"a 1 4\n",
"b 3 4"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1.loc['a','B'] =4 #改成重复\n",
"df1"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1.0 | \n",
" 4 | \n",
" 8 | \n",
"
\n",
" \n",
" 1 | \n",
" 3.0 | \n",
" 4 | \n",
" 8 | \n",
"
\n",
" \n",
" 2 | \n",
" NaN | \n",
" 1 | \n",
" 3 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C\n",
"0 1.0 4 8\n",
"1 3.0 4 8\n",
"2 NaN 1 3"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"### 保持右侧的列都在,如果左侧对齐的列存在重复值,那么对齐上后也存在重复。\n",
"pd.merge(left = df1, right = df2, how = 'right', on = ['B'] )"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.2rc2"
},
"toc": {
"colors": {
"hover_highlight": "#DAA520",
"navigate_num": "#000000",
"navigate_text": "#333333",
"running_highlight": "#FF0000",
"selected_highlight": "#FFD700",
"sidebar_border": "#EEEEEE",
"wrapper_background": "#FFFFFF"
},
"moveMenuLeft": true,
"nav_menu": {
"height": "84px",
"width": "252px"
},
"navigate_menu": true,
"number_sections": false,
"sideBar": true,
"threshold": "3",
"toc_cell": false,
"toc_position": {
"height": "485px",
"left": "0px",
"right": "1146px",
"top": "66px",
"width": "134px"
},
"toc_section_display": "block",
"toc_window_display": true,
"widenNotebook": false
}
},
"nbformat": 4,
"nbformat_minor": 2
}