{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 5.3 Summarizing and Computing Descriptive Statistics(汇总和描述性统计)\n", "\n", "pandas有很多数学和统计方法。大部分可以归类为降维或汇总统计,这些方法是用来从series中提取单个值(比如sum或mean)。还有一些方法来处理缺失值:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onetwo
a1.40NaN
b7.10-4.5
cNaNNaN
d0.75-1.3
\n", "
" ], "text/plain": [ " one two\n", "a 1.40 NaN\n", "b 7.10 -4.5\n", "c NaN NaN\n", "d 0.75 -1.3" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],\n", " [np.nan, np.nan], [0.75, -1.3]],\n", " index=['a', 'b', 'c', 'd'],\n", " columns=['one', 'two'])\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "使用sum的话,会返回一个series:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "one 9.25\n", "two -5.80\n", "dtype: float64" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "使用`axis='columns'` or `axis=1`,计算列之间的和:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "a 1.40\n", "b 2.60\n", "c 0.00\n", "d -0.55\n", "dtype: float64" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.sum(axis='columns')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "计算的时候,NA(即缺失值)会被除外,除非整个切片全是NA。我们可以用skipna来跳过计算NA:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "a NaN\n", "b 1.300\n", "c NaN\n", "d -0.275\n", "dtype: float64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.mean(axis='columns', skipna=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "一些reduction方法:\n", "\n", "![](http://oydgk2hgw.bkt.clouddn.com/pydata-book/n95fy.png)\n", "\n", "一些方法,比如idxmin和idxmax,能返回间接的统计值,比如index value:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onetwo
a1.40NaN
b7.10-4.5
cNaNNaN
d0.75-1.3
\n", "
" ], "text/plain": [ " one two\n", "a 1.40 NaN\n", "b 7.10 -4.5\n", "c NaN NaN\n", "d 0.75 -1.3" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "one b\n", "two d\n", "dtype: object" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.idxmax()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "还能计算累加值:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onetwo
a1.40NaN
b8.50-4.5
cNaNNaN
d9.25-5.8
\n", "
" ], "text/plain": [ " one two\n", "a 1.40 NaN\n", "b 8.50 -4.5\n", "c NaN NaN\n", "d 9.25 -5.8" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.cumsum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "另一种类型既不是降维,也不是累加。describe能一下子产生多维汇总数据:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/xu/anaconda/envs/py35/lib/python3.5/site-packages/numpy/lib/function_base.py:4116: RuntimeWarning: Invalid value encountered in percentile\n", " interpolation=interpolation)\n" ] }, { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onetwo
count3.0000002.000000
mean3.083333-2.900000
std3.4936852.262742
min0.750000-4.500000
25%NaNNaN
50%NaNNaN
75%NaNNaN
max7.100000-1.300000
\n", "
" ], "text/plain": [ " one two\n", "count 3.000000 2.000000\n", "mean 3.083333 -2.900000\n", "std 3.493685 2.262742\n", "min 0.750000 -4.500000\n", "25% NaN NaN\n", "50% NaN NaN\n", "75% NaN NaN\n", "max 7.100000 -1.300000" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "对于非数值性的数据,describe能产生另一种汇总统计:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 a\n", "1 a\n", "2 b\n", "3 c\n", "4 a\n", "5 a\n", "6 b\n", "7 c\n", "8 a\n", "9 a\n", "10 b\n", "11 c\n", "12 a\n", "13 a\n", "14 b\n", "15 c\n", "dtype: object" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj = pd.Series(['a', 'a', 'b', 'c'] * 4)\n", "obj" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "count 16\n", "unique 3\n", "top a\n", "freq 8\n", "dtype: object" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "下表是一些描述和汇总统计数据:\n", "![](http://oydgk2hgw.bkt.clouddn.com/pydata-book/wygi1.png)\n", "\n", "# 1 Correlation and Covariance (相关性和协方差)\n", "\n", "假设DataFrame时股价和股票数量。这些数据取自yahoo finace,用padas-datareader包能加载。如果没有的话,用conda或pip来下载这个包:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "conda install pandas-datareader" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import pandas_datareader.data as web" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "all_data = {ticker: web.get_data_yahoo(ticker)\n", " for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']}\n", "\n", "price = pd.DataFrame({ticker: data['Adj Close']\n", " for ticker, data in all_data.items()})\n", "\n", "volumn = pd.DataFrame({ticker: data['Volumn']\n", " for ticker, data in all_data.items()})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "上面的代码无法直接从yahoo上爬取数据,因为yahoo被verizon收购后,好像是不能用了。于是这里我们直接从下好的数据包里加载。" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "5.1 Introduction to pandas Data Structures(pandas的数据结构).ipynb\r\n", "5.2 Essential Functionality(主要功能).ipynb\r\n", "5.3 Summarizing and Computing Descriptive Statistics(总结和描述性统计).ipynb\r\n" ] } ], "source": [ "ls ../examples/" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [], "source": [ "price = pd.read_pickle('../examples/yahoo_price.pkl')\n", "volume = pd.read_pickle('../examples/yahoo_volume.pkl')" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AAPLGOOGIBMMSFT
Date
2010-01-0427.990226313.062468113.30453625.884104
2010-01-0528.038618311.683844111.93582225.892466
2010-01-0627.592626303.826685111.20868325.733566
2010-01-0727.541619296.753749110.82373225.465944
2010-01-0827.724725300.709808111.93582225.641571
\n", "
" ], "text/plain": [ " AAPL GOOG IBM MSFT\n", "Date \n", "2010-01-04 27.990226 313.062468 113.304536 25.884104\n", "2010-01-05 28.038618 311.683844 111.935822 25.892466\n", "2010-01-06 27.592626 303.826685 111.208683 25.733566\n", "2010-01-07 27.541619 296.753749 110.823732 25.465944\n", "2010-01-08 27.724725 300.709808 111.935822 25.641571" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "price.head()" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AAPLGOOGIBMMSFT
Date
2010-01-041234324003927000615530038409100
2010-01-051504762006031900684140049749600
2010-01-061380400007987100560530058182400
2010-01-0711928280012876600584060050559700
2010-01-081119027009483900419720051197400
\n", "
" ], "text/plain": [ " AAPL GOOG IBM MSFT\n", "Date \n", "2010-01-04 123432400 3927000 6155300 38409100\n", "2010-01-05 150476200 6031900 6841400 49749600\n", "2010-01-06 138040000 7987100 5605300 58182400\n", "2010-01-07 119282800 12876600 5840600 50559700\n", "2010-01-08 111902700 9483900 4197200 51197400" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "volume.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> pct_change(): 这个函数用来计算同colnums两个相邻的数字之间的变化率\n", "\n", "现在我们计算一下价格百分比的变化:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AAPLGOOGIBMMSFT
Date
2016-10-17-0.0006800.0018370.002072-0.003483
2016-10-18-0.0006810.019616-0.0261680.007690
2016-10-19-0.0029790.0078460.003583-0.002255
2016-10-20-0.000512-0.0056520.001719-0.004867
2016-10-21-0.0039300.003011-0.0124740.042096
\n", "
" ], "text/plain": [ " AAPL GOOG IBM MSFT\n", "Date \n", "2016-10-17 -0.000680 0.001837 0.002072 -0.003483\n", "2016-10-18 -0.000681 0.019616 -0.026168 0.007690\n", "2016-10-19 -0.002979 0.007846 0.003583 -0.002255\n", "2016-10-20 -0.000512 -0.005652 0.001719 -0.004867\n", "2016-10-21 -0.003930 0.003011 -0.012474 0.042096" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "returns = price.pct_change()\n", "returns.tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "series的corr方法计算两个,重合的,非NA的,通过index排列好的series。cov计算方差:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.4997636114415116" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "returns['MSFT'].corr(returns['IBM'])" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "8.8706554797035489e-05" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "returns['MSFT'].cov(returns['IBM'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "因为MSFT是一个有效的python属性,我们可以通过更简洁的方式来选中columns:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.4997636114415116" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "returns.MSFT.corr(returns.IBM)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "dataframe的corr和cov方法,能返回一个完整的相似性或方差矩阵:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AAPLGOOGIBMMSFT
AAPL1.0000000.4079190.3868170.389695
GOOG0.4079191.0000000.4050990.465919
IBM0.3868170.4050991.0000000.499764
MSFT0.3896950.4659190.4997641.000000
\n", "
" ], "text/plain": [ " AAPL GOOG IBM MSFT\n", "AAPL 1.000000 0.407919 0.386817 0.389695\n", "GOOG 0.407919 1.000000 0.405099 0.465919\n", "IBM 0.386817 0.405099 1.000000 0.499764\n", "MSFT 0.389695 0.465919 0.499764 1.000000" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "returns.corr()" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AAPLGOOGIBMMSFT
AAPL0.0002770.0001070.0000780.000095
GOOG0.0001070.0002510.0000780.000108
IBM0.0000780.0000780.0001460.000089
MSFT0.0000950.0001080.0000890.000215
\n", "
" ], "text/plain": [ " AAPL GOOG IBM MSFT\n", "AAPL 0.000277 0.000107 0.000078 0.000095\n", "GOOG 0.000107 0.000251 0.000078 0.000108\n", "IBM 0.000078 0.000078 0.000146 0.000089\n", "MSFT 0.000095 0.000108 0.000089 0.000215" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "returns.cov()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "用Dataframe的corrwith方法,我们可以计算dataframe中不同columns之间,或row之间的相似性。传递一个series:" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "AAPL 0.386817\n", "GOOG 0.405099\n", "IBM 1.000000\n", "MSFT 0.499764\n", "dtype: float64" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "returns.corrwith(returns.IBM)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "传入一个dataframe能计算匹配的column names质监局的相似性。这里我计算vooumn中百分比变化的相似性:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "AAPL -0.075565\n", "GOOG -0.007067\n", "IBM -0.204849\n", "MSFT -0.092950\n", "dtype: float64" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "returns.corrwith(volume)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "传入axis='columns'能做到row-by-row计算。在correlation被计算之前,所有的数据会根据label先对齐。\n", "\n", "# 2 Unique Values, Value Counts, and Membership(唯一值,值计数,会员)\n", "\n", "这里介绍另一种从一维series中提取信息的方法:" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": true }, "outputs": [], "source": [ "obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "第一个函数时unique,能告诉我们series里unique values有哪些:" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array(['c', 'a', 'd', 'b'], dtype=object)" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "uniques = obj.unique()\n", "uniques" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "返回的unique values不是有序的,但我们可以排序,uniques.sort()。相对的,value_counts能计算series中值出现的频率:" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "a 3\n", "c 3\n", "b 2\n", "d 1\n", "dtype: int64" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "返回的结果是按降序处理的。vaule_counts也是pandas中的方法,能用在任何array或sequence上:" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "d 1\n", "c 3\n", "b 2\n", "a 3\n", "dtype: int64" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.value_counts(obj.values, sort=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "isin 能实现一个向量化的集合成员关系检查,能用于过滤数据集,检查一个子集,是否在series的values中,或在dataframe的column中:" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 c\n", "1 a\n", "2 d\n", "3 a\n", "4 a\n", "5 b\n", "6 b\n", "7 c\n", "8 c\n", "dtype: object" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 True\n", "1 False\n", "2 False\n", "3 False\n", "4 False\n", "5 True\n", "6 True\n", "7 True\n", "8 True\n", "dtype: bool" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mask = obj.isin(['b', 'c'])\n", "mask" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 c\n", "5 b\n", "6 b\n", "7 c\n", "8 c\n", "dtype: object" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj[mask]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "与isin相对的另一个方法是Index.get_indexer,能返回一个index array,告诉我们有重复值的values(to_match),在非重复的values(unique_vals)中对应的索引值:" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": true }, "outputs": [], "source": [ "to_match = pd.Series(['c', 'a', 'b', 'b', 'c', 'a'])\n", "unique_vals = pd.Series(['c', 'b', 'a'])" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([0, 2, 1, 1, 0, 2])" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.Index(unique_vals).get_indexer(to_match)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Unique, value counts, and set membership methods:\n", "\n", "![](http://oydgk2hgw.bkt.clouddn.com/pydata-book/0v120.png)\n", "\n", "在某些情况下,你可能想要计算一下dataframe中多个column的柱状图:" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Qu1Qu2Qu3
0121
1335
2412
3324
4434
\n", "
" ], "text/plain": [ " Qu1 Qu2 Qu3\n", "0 1 2 1\n", "1 3 3 5\n", "2 4 1 2\n", "3 3 2 4\n", "4 4 3 4" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = pd.DataFrame({'Qu1': [1, 3, 4, 3, 4],\n", " 'Qu2': [2, 3, 1, 2, 3],\n", " 'Qu3': [1, 5, 2, 4, 4]})\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "把padas.value_counts传递给dataframe的apply函数:" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Qu1Qu2Qu3
11.01.01.0
2NaN2.01.0
32.02.0NaN
42.0NaN2.0
5NaNNaN1.0
\n", "
" ], "text/plain": [ " Qu1 Qu2 Qu3\n", "1 1.0 1.0 1.0\n", "2 NaN 2.0 1.0\n", "3 2.0 2.0 NaN\n", "4 2.0 NaN 2.0\n", "5 NaN NaN 1.0" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result = data.apply(pd.value_counts)\n", "result" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "每一行的laebls(即1,2,3,4,5)其实就是整个data里出现过的值,从1到5。而对应的每个方框里的值,则是表示该值在当前列中出现的次数。比如,(2, Qu1)的值是Nan,说明2这个数字没有在Qu1这一列出现过。(2, Qu2)的值是2,说明2这个数字在Qu2这一列出现过2次。(2, Qu3)的值是1,说明2这个数字在Qu3这一列出现过1次。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python [py35]", "language": "python", "name": "Python [py35]" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 0 }