{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 5.3 Summarizing and Computing Descriptive Statistics（汇总和描述性统计）\n", "\n", "pandas有很多数学和统计方法。大部分可以归类为降维或汇总统计，这些方法是用来从series中提取单个值（比如sum或mean）。还有一些方法来处理缺失值：" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "

\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "

	one	two
a	1.40	NaN
b	7.10	-4.5
c	NaN	NaN
d	0.75	-1.3

\n", "

" ], "text/plain": [ " one two\n", "a 1.40 NaN\n", "b 7.10 -4.5\n", "c NaN NaN\n", "d 0.75 -1.3" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],\n", " [np.nan, np.nan], [0.75, -1.3]],\n", " index=['a', 'b', 'c', 'd'],\n", " columns=['one', 'two'])\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "使用sum的话，会返回一个series:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "one 9.25\n", "two -5.80\n", "dtype: float64" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "使用`axis='columns'` or `axis=1`，计算列之间的和：" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "a 1.40\n", "b 2.60\n", "c 0.00\n", "d -0.55\n", "dtype: float64" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.sum(axis='columns')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "计算的时候，NA（即缺失值）会被除外，除非整个切片全是NA。我们可以用skipna来跳过计算NA：" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "a NaN\n", "b 1.300\n", "c NaN\n", "d -0.275\n", "dtype: float64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.mean(axis='columns', skipna=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "一些reduction方法：\n", "\n", "![](http://oydgk2hgw.bkt.clouddn.com/pydata-book/n95fy.png)\n", "\n", "一些方法，比如idxmin和idxmax，能返回间接的统计值，比如index value：" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "

\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "

	one	two
a	1.40	NaN
b	7.10	-4.5
c	NaN	NaN
d	0.75	-1.3

\n", "

" ], "text/plain": [ " one two\n", "a 1.40 NaN\n", "b 7.10 -4.5\n", "c NaN NaN\n", "d 0.75 -1.3" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "one b\n", "two d\n", "dtype: object" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.idxmax()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "还能计算累加值：" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "

\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "

	one	two
a	1.40	NaN
b	8.50	-4.5
c	NaN	NaN
d	9.25	-5.8

\n", "

" ], "text/plain": [ " one two\n", "a 1.40 NaN\n", "b 8.50 -4.5\n", "c NaN NaN\n", "d 9.25 -5.8" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.cumsum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "另一种类型既不是降维，也不是累加。describe能一下子产生多维汇总数据：" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/xu/anaconda/envs/py35/lib/python3.5/site-packages/numpy/lib/function_base.py:4116: RuntimeWarning: Invalid value encountered in percentile\n", " interpolation=interpolation)\n" ] }, { "data": { "text/html": [ "

\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "

	one	two
count	3.000000	2.000000
mean	3.083333	-2.900000
std	3.493685	2.262742
min	0.750000	-4.500000
25%	NaN	NaN
50%	NaN	NaN
75%	NaN	NaN
max	7.100000	-1.300000

\n", "

" ], "text/plain": [ " one two\n", "count 3.000000 2.000000\n", "mean 3.083333 -2.900000\n", "std 3.493685 2.262742\n", "min 0.750000 -4.500000\n", "25% NaN NaN\n", "50% NaN NaN\n", "75% NaN NaN\n", "max 7.100000 -1.300000" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "对于非数值性的数据，describe能产生另一种汇总统计：" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 a\n", "1 a\n", "2 b\n", "3 c\n", "4 a\n", "5 a\n", "6 b\n", "7 c\n", "8 a\n", "9 a\n", "10 b\n", "11 c\n", "12 a\n", "13 a\n", "14 b\n", "15 c\n", "dtype: object" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj = pd.Series(['a', 'a', 'b', 'c'] * 4)\n", "obj" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "count 16\n", "unique 3\n", "top a\n", "freq 8\n", "dtype: object" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "下表是一些描述和汇总统计数据：\n", "![](http://oydgk2hgw.bkt.clouddn.com/pydata-book/wygi1.png)\n", "\n", "# 1 Correlation and Covariance (相关性和协方差)\n", "\n", "假设DataFrame时股价和股票数量。这些数据取自yahoo finace，用padas-datareader包能加载。如果没有的话，用conda或pip来下载这个包：" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "conda install pandas-datareader" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import pandas_datareader.data as web" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "all_data = {ticker: web.get_data_yahoo(ticker)\n", " for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']}\n", "\n", "price = pd.DataFrame({ticker: data['Adj Close']\n", " for ticker, data in all_data.items()})\n", "\n", "volumn = pd.DataFrame({ticker: data['Volumn']\n", " for ticker, data in all_data.items()})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "上面的代码无法直接从yahoo上爬取数据，因为yahoo被verizon收购后，好像是不能用了。于是这里我们直接从下好的数据包里加载。" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "5.1 Introduction to pandas Data Structures（pandas的数据结构）.ipynb\r\n", "5.2 Essential Functionality（主要功能）.ipynb\r\n", "5.3 Summarizing and Computing Descriptive Statistics（总结和描述性统计）.ipynb\r\n" ] } ], "source": [ "ls ../examples/" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [], "source": [ "price = pd.read_pickle('../examples/yahoo_price.pkl')\n", "volume = pd.read_pickle('../examples/yahoo_volume.pkl')" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "

\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "

	AAPL	GOOG	IBM	MSFT
Date
2010-01-04	27.990226	313.062468	113.304536	25.884104
2010-01-05	28.038618	311.683844	111.935822	25.892466
2010-01-06	27.592626	303.826685	111.208683	25.733566
2010-01-07	27.541619	296.753749	110.823732	25.465944
2010-01-08	27.724725	300.709808	111.935822	25.641571

\n", "

" ], "text/plain": [ " AAPL GOOG IBM MSFT\n", "Date \n", "2010-01-04 27.990226 313.062468 113.304536 25.884104\n", "2010-01-05 28.038618 311.683844 111.935822 25.892466\n", "2010-01-06 27.592626 303.826685 111.208683 25.733566\n", "2010-01-07 27.541619 296.753749 110.823732 25.465944\n", "2010-01-08 27.724725 300.709808 111.935822 25.641571" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "price.head()" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "

\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "

	AAPL	GOOG	IBM	MSFT
Date
2010-01-04	123432400	3927000	6155300	38409100
2010-01-05	150476200	6031900	6841400	49749600
2010-01-06	138040000	7987100	5605300	58182400
2010-01-07	119282800	12876600	5840600	50559700
2010-01-08	111902700	9483900	4197200	51197400

\n", "

" ], "text/plain": [ " AAPL GOOG IBM MSFT\n", "Date \n", "2010-01-04 123432400 3927000 6155300 38409100\n", "2010-01-05 150476200 6031900 6841400 49749600\n", "2010-01-06 138040000 7987100 5605300 58182400\n", "2010-01-07 119282800 12876600 5840600 50559700\n", "2010-01-08 111902700 9483900 4197200 51197400" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "volume.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> pct_change(): 这个函数用来计算同colnums两个相邻的数字之间的变化率\n", "\n", "现在我们计算一下价格百分比的变化：" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "

\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "

	AAPL	GOOG	IBM	MSFT
Date
2016-10-17	-0.000680	0.001837	0.002072	-0.003483
2016-10-18	-0.000681	0.019616	-0.026168	0.007690
2016-10-19	-0.002979	0.007846	0.003583	-0.002255
2016-10-20	-0.000512	-0.005652	0.001719	-0.004867
2016-10-21	-0.003930	0.003011	-0.012474	0.042096

\n", "

" ], "text/plain": [ " AAPL GOOG IBM MSFT\n", "Date \n", "2016-10-17 -0.000680 0.001837 0.002072 -0.003483\n", "2016-10-18 -0.000681 0.019616 -0.026168 0.007690\n", "2016-10-19 -0.002979 0.007846 0.003583 -0.002255\n", "2016-10-20 -0.000512 -0.005652 0.001719 -0.004867\n", "2016-10-21 -0.003930 0.003011 -0.012474 0.042096" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "returns = price.pct_change()\n", "returns.tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "series的corr方法计算两个，重合的，非NA的，通过index排列好的series。cov计算方差：" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.4997636114415116" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "returns['MSFT'].corr(returns['IBM'])" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "8.8706554797035489e-05" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "returns['MSFT'].cov(returns['IBM'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "因为MSFT是一个有效的python属性，我们可以通过更简洁的方式来选中columns：" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.4997636114415116" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "returns.MSFT.corr(returns.IBM)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "dataframe的corr和cov方法，能返回一个完整的相似性或方差矩阵：" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "

\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "

	AAPL	GOOG	IBM	MSFT
AAPL	1.000000	0.407919	0.386817	0.389695
GOOG	0.407919	1.000000	0.405099	0.465919
IBM	0.386817	0.405099	1.000000	0.499764
MSFT	0.389695	0.465919	0.499764	1.000000

\n", "

" ], "text/plain": [ " AAPL GOOG IBM MSFT\n", "AAPL 1.000000 0.407919 0.386817 0.389695\n", "GOOG 0.407919 1.000000 0.405099 0.465919\n", "IBM 0.386817 0.405099 1.000000 0.499764\n", "MSFT 0.389695 0.465919 0.499764 1.000000" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "returns.corr()" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "

\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "

	AAPL	GOOG	IBM	MSFT
AAPL	0.000277	0.000107	0.000078	0.000095
GOOG	0.000107	0.000251	0.000078	0.000108
IBM	0.000078	0.000078	0.000146	0.000089
MSFT	0.000095	0.000108	0.000089	0.000215

\n", "

" ], "text/plain": [ " AAPL GOOG IBM MSFT\n", "AAPL 0.000277 0.000107 0.000078 0.000095\n", "GOOG 0.000107 0.000251 0.000078 0.000108\n", "IBM 0.000078 0.000078 0.000146 0.000089\n", "MSFT 0.000095 0.000108 0.000089 0.000215" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "returns.cov()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "用Dataframe的corrwith方法，我们可以计算dataframe中不同columns之间，或row之间的相似性。传递一个series：" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "AAPL 0.386817\n", "GOOG 0.405099\n", "IBM 1.000000\n", "MSFT 0.499764\n", "dtype: float64" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "returns.corrwith(returns.IBM)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "传入一个dataframe能计算匹配的column names质监局的相似性。这里我计算vooumn中百分比变化的相似性：" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "AAPL -0.075565\n", "GOOG -0.007067\n", "IBM -0.204849\n", "MSFT -0.092950\n", "dtype: float64" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "returns.corrwith(volume)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "传入axis='columns'能做到row-by-row计算。在correlation被计算之前，所有的数据会根据label先对齐。\n", "\n", "# 2 Unique Values, Value Counts, and Membership（唯一值，值计数，会员）\n", "\n", "这里介绍另一种从一维series中提取信息的方法：" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": true }, "outputs": [], "source": [ "obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "第一个函数时unique，能告诉我们series里unique values有哪些：" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array(['c', 'a', 'd', 'b'], dtype=object)" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "uniques = obj.unique()\n", "uniques" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "返回的unique values不是有序的，但我们可以排序，uniques.sort()。相对的，value_counts能计算series中值出现的频率：" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "a 3\n", "c 3\n", "b 2\n", "d 1\n", "dtype: int64" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "返回的结果是按降序处理的。vaule_counts也是pandas中的方法，能用在任何array或sequence上：" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "d 1\n", "c 3\n", "b 2\n", "a 3\n", "dtype: int64" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.value_counts(obj.values, sort=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "isin 能实现一个向量化的集合成员关系检查，能用于过滤数据集，检查一个子集，是否在series的values中，或在dataframe的column中：" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 c\n", "1 a\n", "2 d\n", "3 a\n", "4 a\n", "5 b\n", "6 b\n", "7 c\n", "8 c\n", "dtype: object" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 True\n", "1 False\n", "2 False\n", "3 False\n", "4 False\n", "5 True\n", "6 True\n", "7 True\n", "8 True\n", "dtype: bool" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mask = obj.isin(['b', 'c'])\n", "mask" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 c\n", "5 b\n", "6 b\n", "7 c\n", "8 c\n", "dtype: object" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj[mask]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "与isin相对的另一个方法是Index.get_indexer，能返回一个index array，告诉我们有重复值的values(to_match)，在非重复的values(unique_vals)中对应的索引值：" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": true }, "outputs": [], "source": [ "to_match = pd.Series(['c', 'a', 'b', 'b', 'c', 'a'])\n", "unique_vals = pd.Series(['c', 'b', 'a'])" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([0, 2, 1, 1, 0, 2])" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.Index(unique_vals).get_indexer(to_match)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Unique, value counts, and set membership methods：\n", "\n", "![](http://oydgk2hgw.bkt.clouddn.com/pydata-book/0v120.png)\n", "\n", "在某些情况下，你可能想要计算一下dataframe中多个column的柱状图：" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "

\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "

	Qu1	Qu2	Qu3
0	1	2	1
1	3	3	5
2	4	1	2
3	3	2	4
4	4	3	4

\n", "

" ], "text/plain": [ " Qu1 Qu2 Qu3\n", "0 1 2 1\n", "1 3 3 5\n", "2 4 1 2\n", "3 3 2 4\n", "4 4 3 4" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = pd.DataFrame({'Qu1': [1, 3, 4, 3, 4],\n", " 'Qu2': [2, 3, 1, 2, 3],\n", " 'Qu3': [1, 5, 2, 4, 4]})\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "把padas.value_counts传递给dataframe的apply函数：" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "

\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "

	Qu1	Qu2	Qu3
1	1.0	1.0	1.0
2	NaN	2.0	1.0
3	2.0	2.0	NaN
4	2.0	NaN	2.0
5	NaN	NaN	1.0

\n", "

" ], "text/plain": [ " Qu1 Qu2 Qu3\n", "1 1.0 1.0 1.0\n", "2 NaN 2.0 1.0\n", "3 2.0 2.0 NaN\n", "4 2.0 NaN 2.0\n", "5 NaN NaN 1.0" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result = data.apply(pd.value_counts)\n", "result" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "每一行的laebls(即1，2，3，4，5)其实就是整个data里出现过的值，从1到5。而对应的每个方框里的值，则是表示该值在当前列中出现的次数。比如，(2, Qu1)的值是Nan，说明2这个数字没有在Qu1这一列出现过。(2, Qu2)的值是2，说明2这个数字在Qu2这一列出现过2次。(2, Qu3)的值是1，说明2这个数字在Qu3这一列出现过1次。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python [py35]", "language": "python", "name": "Python [py35]" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 0 }