{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 统计运算\n",
"这一章包含数据分析用得最多的函数操作。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" @auther: sunzhenhang\n",
" @zhihu: https://www.zhihu.com/people/HANGZS/activities"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"E:\\ML\\实战\\pandas实用教程 - 副本\n"
]
}
],
"source": [
"!cd"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"# 1. 数值型统计运算\n",
"这些统计操作只对元素类型为数值型的列有效,返回以列索引或行索引为索引的Series。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1.1 一元统计\n",
"顾名思义,这些统计只是自身分布情况的反映。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.1.1 `.sum()`\n",
"#### `DataFrame.sum(axis='index')`\n",
"- axis:'index'-沿列加,'columns'-沿行加"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" b | \n",
" 3 | \n",
" 5 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B\n",
"a 1 2\n",
"b 3 5"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame([[1,2],[3,5]], index = ['a','b'],columns = ['A','B'])\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"A 4\n",
"B 7\n",
"dtype: int64"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.sum() # 按列加"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"a 3\n",
"b 8\n",
"dtype: int64"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.sum(axis = 'columns') # 按行加"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.1.2 `.mean(), .std(), .var()`\n",
"均值、标准差、方差"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.1.3 `.max(), .min(), .median()`\n",
"最大、最小、中值"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"A 0.75\n",
"B 0.75\n",
"dtype: float64"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.mad(axis = 'index')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1.2 二元统计\n",
"计算任意两列直接的统计量,返回以列索引为新行索引和列索引的DataFrame"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.2.1 `.cov()`\n",
"#### `DataFrame.cov(min_periods=None)`\n",
"- min_periods:每一列去除NaN后,要求能够参与运算的最少元素个数。"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" B C\n",
"0 1 2\n",
"1 2 0"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1 = pd.DataFrame([[1,2],[2,0]],columns = ['B','C'])\n",
"df1"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" B | \n",
" 0.5 | \n",
" -1.0 | \n",
"
\n",
" \n",
" C | \n",
" -1.0 | \n",
" 2.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" B C\n",
"B 0.5 -1.0\n",
"C -1.0 2.0"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1.cov()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.2.2 `.corr()`\n",
"相关系数"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" B | \n",
" 1.0 | \n",
" -1.0 | \n",
"
\n",
" \n",
" C | \n",
" -1.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" B C\n",
"B 1.0 -1.0\n",
"C -1.0 1.0"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1.corr()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.2.3 `.corrwith()`\n",
"corr是自身列之间的关系,而这个函数可以对不同的DataFrame进行运算,不要要记得运算发生在**同名列和同索引的行**之间。\n",
"#### `DataFrame.corrwith(other, axis=0, drop=False)`\n",
"- other:另一个DataFrame或Series\n",
"- axis:'index'或'columns'\n",
"- drop:是否丢掉结果中的NaN"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" 2 | \n",
" 3 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" B C\n",
"0 1 2\n",
"1 2 0\n",
"2 2 3"
]
},
"execution_count": 62,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1 = pd.DataFrame([[1,2],[2,0],[2,3]],index = [0,1,2],columns = ['B','C'])\n",
"df1"
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B\n",
"0 1 2\n",
"1 2 0"
]
},
"execution_count": 63,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"A NaN\n",
"B -1.0\n",
"C NaN\n",
"dtype: float64"
]
},
"execution_count": 65,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.corrwith(df1) #只对 同名列 和 同名行 进行计算"
]
},
{
"cell_type": "code",
"execution_count": 69,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 1\n",
"1 2\n",
"Name: B, dtype: int64"
]
},
"execution_count": 69,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s = pd.Series([1,2], index = [0,1], name = 'B')\n",
"s"
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B\n",
"0 1 2\n",
"1 2 0"
]
},
"execution_count": 67,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"A 1.0\n",
"B -1.0\n",
"dtype: float64"
]
},
"execution_count": 72,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.corrwith(s)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"------\n",
"# 2. 类型型统计运算"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2.1. `value_counts()`\n",
"不适合DataFrame。\n",
"#### `Series/Index.value_counts(normalize=False, ascending=False, bins=None)`\n",
"- normalize:True or False,计算频次或者频率比;\n",
"- ascending:True or False,排序方式,默认降序;\n",
"- bins:int,pd.cut的一种快捷操作,对连续数值型效果好;"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"0 1\n",
"1 2\n",
"2 1\n",
"3 2\n",
"4 1\n",
"5 3\n",
"dtype: int64"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s = pd.Series([1,2,1,2,1,3])\n",
"s"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1 3\n",
"2 2\n",
"3 1\n",
"dtype: int64"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s.value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3 1\n",
"2 2\n",
"1 3\n",
"dtype: int64"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s.value_counts(ascending = True)"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(0.997, 2.0] 5\n",
"(2.0, 3.0] 1\n",
"dtype: int64"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s.value_counts( bins = 2) # bins按照int平均分割,左开右闭,左侧外延1%以包含最左值"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2.2 `.count()`\n",
"计算统计每一类non-NaN元素个数,这个函数可以快速了解哪些特征或哪些样本缺失比较严重。\n",
"#### `DataFrame.count(axis=0)`\n",
"- axis: 0-查看列,1-查看行;"
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B\n",
"0 1 2\n",
"1 2 0"
]
},
"execution_count": 73,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "code",
"execution_count": 74,
"metadata": {
"scrolled": false
},
"outputs": [
{
"data": {
"text/plain": [
"A 2\n",
"B 2\n",
"dtype: int64"
]
},
"execution_count": 74,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.count(axis = 0)"
]
},
{
"cell_type": "code",
"execution_count": 75,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 2\n",
"1 2\n",
"dtype: int64"
]
},
"execution_count": 75,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.count(axis = 1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.2rc2"
},
"toc": {
"colors": {
"hover_highlight": "#DAA520",
"navigate_num": "#000000",
"navigate_text": "#333333",
"running_highlight": "#FF0000",
"selected_highlight": "#FFD700",
"sidebar_border": "#EEEEEE",
"wrapper_background": "#FFFFFF"
},
"moveMenuLeft": true,
"nav_menu": {
"height": "67px",
"width": "253px"
},
"navigate_menu": true,
"number_sections": false,
"sideBar": true,
"threshold": "3",
"toc_cell": false,
"toc_position": {
"height": "600px",
"left": "0px",
"right": "1190.23px",
"top": "67px",
"width": "232px"
},
"toc_section_display": "block",
"toc_window_display": true,
"widenNotebook": false
}
},
"nbformat": 4,
"nbformat_minor": 2
}