{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 统计运算\n", "这一章包含数据分析用得最多的函数操作。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " @auther: sunzhenhang\n", " @zhihu: https://www.zhihu.com/people/HANGZS/activities" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "E:\\ML\\实战\\pandas实用教程 - 副本\n" ] } ], "source": [ "!cd" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "# 1. 数值型统计运算\n", "这些统计操作只对元素类型为数值型的列有效,返回以列索引或行索引为索引的Series。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1.1 一元统计\n", "顾名思义,这些统计只是自身分布情况的反映。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.1.1 `.sum()`\n", "#### `DataFrame.sum(axis='index')`\n", "- axis:'index'-沿列加,'columns'-沿行加" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AB
a12
b35
\n", "
" ], "text/plain": [ " A B\n", "a 1 2\n", "b 3 5" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame([[1,2],[3,5]], index = ['a','b'],columns = ['A','B'])\n", "df" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "A 4\n", "B 7\n", "dtype: int64" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.sum() # 按列加" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "a 3\n", "b 8\n", "dtype: int64" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.sum(axis = 'columns') # 按行加" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.1.2 `.mean(), .std(), .var()`\n", "均值、标准差、方差" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.1.3 `.max(), .min(), .median()`\n", "最大、最小、中值" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "A 0.75\n", "B 0.75\n", "dtype: float64" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.mad(axis = 'index')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1.2 二元统计\n", "计算任意两列直接的统计量,返回以列索引为新行索引和列索引的DataFrame" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.2.1 `.cov()`\n", "#### `DataFrame.cov(min_periods=None)`\n", "- min_periods:每一列去除NaN后,要求能够参与运算的最少元素个数。" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
BC
012
120
\n", "
" ], "text/plain": [ " B C\n", "0 1 2\n", "1 2 0" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1 = pd.DataFrame([[1,2],[2,0]],columns = ['B','C'])\n", "df1" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
BC
B0.5-1.0
C-1.02.0
\n", "
" ], "text/plain": [ " B C\n", "B 0.5 -1.0\n", "C -1.0 2.0" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1.cov()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.2.2 `.corr()`\n", "相关系数" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
BC
B1.0-1.0
C-1.01.0
\n", "
" ], "text/plain": [ " B C\n", "B 1.0 -1.0\n", "C -1.0 1.0" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1.corr()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.2.3 `.corrwith()`\n", "corr是自身列之间的关系,而这个函数可以对不同的DataFrame进行运算,不要要记得运算发生在**同名列和同索引的行**之间。\n", "#### `DataFrame.corrwith(other, axis=0, drop=False)`\n", "- other:另一个DataFrame或Series\n", "- axis:'index'或'columns'\n", "- drop:是否丢掉结果中的NaN" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
BC
012
120
223
\n", "
" ], "text/plain": [ " B C\n", "0 1 2\n", "1 2 0\n", "2 2 3" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1 = pd.DataFrame([[1,2],[2,0],[2,3]],index = [0,1,2],columns = ['B','C'])\n", "df1" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AB
012
120
\n", "
" ], "text/plain": [ " A B\n", "0 1 2\n", "1 2 0" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "A NaN\n", "B -1.0\n", "C NaN\n", "dtype: float64" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.corrwith(df1) #只对 同名列 和 同名行 进行计算" ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 2\n", "Name: B, dtype: int64" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s = pd.Series([1,2], index = [0,1], name = 'B')\n", "s" ] }, { "cell_type": "code", "execution_count": 67, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AB
012
120
\n", "
" ], "text/plain": [ " A B\n", "0 1 2\n", "1 2 0" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "A 1.0\n", "B -1.0\n", "dtype: float64" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.corrwith(s)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "------\n", "# 2. 类型型统计运算" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2.1. `value_counts()`\n", "不适合DataFrame。\n", "#### `Series/Index.value_counts(normalize=False, ascending=False, bins=None)`\n", "- normalize:True or False,计算频次或者频率比;\n", "- ascending:True or False,排序方式,默认降序;\n", "- bins:int,pd.cut的一种快捷操作,对连续数值型效果好;" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 2\n", "2 1\n", "3 2\n", "4 1\n", "5 3\n", "dtype: int64" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s = pd.Series([1,2,1,2,1,3])\n", "s" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1 3\n", "2 2\n", "3 1\n", "dtype: int64" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s.value_counts()" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3 1\n", "2 2\n", "1 3\n", "dtype: int64" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s.value_counts(ascending = True)" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(0.997, 2.0] 5\n", "(2.0, 3.0] 1\n", "dtype: int64" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s.value_counts( bins = 2) # bins按照int平均分割,左开右闭,左侧外延1%以包含最左值" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2.2 `.count()`\n", "计算统计每一类non-NaN元素个数,这个函数可以快速了解哪些特征或哪些样本缺失比较严重。\n", "#### `DataFrame.count(axis=0)`\n", "- axis: 0-查看列,1-查看行;" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AB
012
120
\n", "
" ], "text/plain": [ " A B\n", "0 1 2\n", "1 2 0" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 74, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/plain": [ "A 2\n", "B 2\n", "dtype: int64" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.count(axis = 0)" ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 2\n", "1 2\n", "dtype: int64" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.count(axis = 1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.2rc2" }, "toc": { "colors": { "hover_highlight": "#DAA520", "navigate_num": "#000000", "navigate_text": "#333333", "running_highlight": "#FF0000", "selected_highlight": "#FFD700", "sidebar_border": "#EEEEEE", "wrapper_background": "#FFFFFF" }, "moveMenuLeft": true, "nav_menu": { "height": "67px", "width": "253px" }, "navigate_menu": true, "number_sections": false, "sideBar": true, "threshold": "3", "toc_cell": false, "toc_position": { "height": "600px", "left": "0px", "right": "1190.23px", "top": "67px", "width": "232px" }, "toc_section_display": "block", "toc_window_display": true, "widenNotebook": false } }, "nbformat": 4, "nbformat_minor": 2 }