{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 5.3 Summarizing and Computing Descriptive Statistics(汇总和描述性统计)\n",
"\n",
"pandas有很多数学和统计方法。大部分可以归类为降维或汇总统计,这些方法是用来从series中提取单个值(比如sum或mean)。还有一些方法来处理缺失值:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"
\n",
" \n",
" \n",
" | \n",
" one | \n",
" two | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 1.40 | \n",
" NaN | \n",
"
\n",
" \n",
" b | \n",
" 7.10 | \n",
" -4.5 | \n",
"
\n",
" \n",
" c | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" d | \n",
" 0.75 | \n",
" -1.3 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" one two\n",
"a 1.40 NaN\n",
"b 7.10 -4.5\n",
"c NaN NaN\n",
"d 0.75 -1.3"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],\n",
" [np.nan, np.nan], [0.75, -1.3]],\n",
" index=['a', 'b', 'c', 'd'],\n",
" columns=['one', 'two'])\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"使用sum的话,会返回一个series:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"one 9.25\n",
"two -5.80\n",
"dtype: float64"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"使用`axis='columns'` or `axis=1`,计算列之间的和:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"a 1.40\n",
"b 2.60\n",
"c 0.00\n",
"d -0.55\n",
"dtype: float64"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.sum(axis='columns')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"计算的时候,NA(即缺失值)会被除外,除非整个切片全是NA。我们可以用skipna来跳过计算NA:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"a NaN\n",
"b 1.300\n",
"c NaN\n",
"d -0.275\n",
"dtype: float64"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.mean(axis='columns', skipna=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"一些reduction方法:\n",
"\n",
"![](http://oydgk2hgw.bkt.clouddn.com/pydata-book/n95fy.png)\n",
"\n",
"一些方法,比如idxmin和idxmax,能返回间接的统计值,比如index value:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" one | \n",
" two | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 1.40 | \n",
" NaN | \n",
"
\n",
" \n",
" b | \n",
" 7.10 | \n",
" -4.5 | \n",
"
\n",
" \n",
" c | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" d | \n",
" 0.75 | \n",
" -1.3 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" one two\n",
"a 1.40 NaN\n",
"b 7.10 -4.5\n",
"c NaN NaN\n",
"d 0.75 -1.3"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"one b\n",
"two d\n",
"dtype: object"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.idxmax()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"还能计算累加值:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" one | \n",
" two | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 1.40 | \n",
" NaN | \n",
"
\n",
" \n",
" b | \n",
" 8.50 | \n",
" -4.5 | \n",
"
\n",
" \n",
" c | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" d | \n",
" 9.25 | \n",
" -5.8 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" one two\n",
"a 1.40 NaN\n",
"b 8.50 -4.5\n",
"c NaN NaN\n",
"d 9.25 -5.8"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.cumsum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"另一种类型既不是降维,也不是累加。describe能一下子产生多维汇总数据:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/xu/anaconda/envs/py35/lib/python3.5/site-packages/numpy/lib/function_base.py:4116: RuntimeWarning: Invalid value encountered in percentile\n",
" interpolation=interpolation)\n"
]
},
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" one | \n",
" two | \n",
"
\n",
" \n",
" \n",
" \n",
" count | \n",
" 3.000000 | \n",
" 2.000000 | \n",
"
\n",
" \n",
" mean | \n",
" 3.083333 | \n",
" -2.900000 | \n",
"
\n",
" \n",
" std | \n",
" 3.493685 | \n",
" 2.262742 | \n",
"
\n",
" \n",
" min | \n",
" 0.750000 | \n",
" -4.500000 | \n",
"
\n",
" \n",
" 25% | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 50% | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 75% | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" max | \n",
" 7.100000 | \n",
" -1.300000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" one two\n",
"count 3.000000 2.000000\n",
"mean 3.083333 -2.900000\n",
"std 3.493685 2.262742\n",
"min 0.750000 -4.500000\n",
"25% NaN NaN\n",
"50% NaN NaN\n",
"75% NaN NaN\n",
"max 7.100000 -1.300000"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"对于非数值性的数据,describe能产生另一种汇总统计:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0 a\n",
"1 a\n",
"2 b\n",
"3 c\n",
"4 a\n",
"5 a\n",
"6 b\n",
"7 c\n",
"8 a\n",
"9 a\n",
"10 b\n",
"11 c\n",
"12 a\n",
"13 a\n",
"14 b\n",
"15 c\n",
"dtype: object"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj = pd.Series(['a', 'a', 'b', 'c'] * 4)\n",
"obj"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"count 16\n",
"unique 3\n",
"top a\n",
"freq 8\n",
"dtype: object"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"下表是一些描述和汇总统计数据:\n",
"![](http://oydgk2hgw.bkt.clouddn.com/pydata-book/wygi1.png)\n",
"\n",
"# 1 Correlation and Covariance (相关性和协方差)\n",
"\n",
"假设DataFrame时股价和股票数量。这些数据取自yahoo finace,用padas-datareader包能加载。如果没有的话,用conda或pip来下载这个包:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"conda install pandas-datareader"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import pandas_datareader.data as web"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"all_data = {ticker: web.get_data_yahoo(ticker)\n",
" for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']}\n",
"\n",
"price = pd.DataFrame({ticker: data['Adj Close']\n",
" for ticker, data in all_data.items()})\n",
"\n",
"volumn = pd.DataFrame({ticker: data['Volumn']\n",
" for ticker, data in all_data.items()})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"上面的代码无法直接从yahoo上爬取数据,因为yahoo被verizon收购后,好像是不能用了。于是这里我们直接从下好的数据包里加载。"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"5.1 Introduction to pandas Data Structures(pandas的数据结构).ipynb\r\n",
"5.2 Essential Functionality(主要功能).ipynb\r\n",
"5.3 Summarizing and Computing Descriptive Statistics(总结和描述性统计).ipynb\r\n"
]
}
],
"source": [
"ls ../examples/"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"price = pd.read_pickle('../examples/yahoo_price.pkl')\n",
"volume = pd.read_pickle('../examples/yahoo_volume.pkl')"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" AAPL | \n",
" GOOG | \n",
" IBM | \n",
" MSFT | \n",
"
\n",
" \n",
" Date | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 2010-01-04 | \n",
" 27.990226 | \n",
" 313.062468 | \n",
" 113.304536 | \n",
" 25.884104 | \n",
"
\n",
" \n",
" 2010-01-05 | \n",
" 28.038618 | \n",
" 311.683844 | \n",
" 111.935822 | \n",
" 25.892466 | \n",
"
\n",
" \n",
" 2010-01-06 | \n",
" 27.592626 | \n",
" 303.826685 | \n",
" 111.208683 | \n",
" 25.733566 | \n",
"
\n",
" \n",
" 2010-01-07 | \n",
" 27.541619 | \n",
" 296.753749 | \n",
" 110.823732 | \n",
" 25.465944 | \n",
"
\n",
" \n",
" 2010-01-08 | \n",
" 27.724725 | \n",
" 300.709808 | \n",
" 111.935822 | \n",
" 25.641571 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" AAPL GOOG IBM MSFT\n",
"Date \n",
"2010-01-04 27.990226 313.062468 113.304536 25.884104\n",
"2010-01-05 28.038618 311.683844 111.935822 25.892466\n",
"2010-01-06 27.592626 303.826685 111.208683 25.733566\n",
"2010-01-07 27.541619 296.753749 110.823732 25.465944\n",
"2010-01-08 27.724725 300.709808 111.935822 25.641571"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"price.head()"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" AAPL | \n",
" GOOG | \n",
" IBM | \n",
" MSFT | \n",
"
\n",
" \n",
" Date | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 2010-01-04 | \n",
" 123432400 | \n",
" 3927000 | \n",
" 6155300 | \n",
" 38409100 | \n",
"
\n",
" \n",
" 2010-01-05 | \n",
" 150476200 | \n",
" 6031900 | \n",
" 6841400 | \n",
" 49749600 | \n",
"
\n",
" \n",
" 2010-01-06 | \n",
" 138040000 | \n",
" 7987100 | \n",
" 5605300 | \n",
" 58182400 | \n",
"
\n",
" \n",
" 2010-01-07 | \n",
" 119282800 | \n",
" 12876600 | \n",
" 5840600 | \n",
" 50559700 | \n",
"
\n",
" \n",
" 2010-01-08 | \n",
" 111902700 | \n",
" 9483900 | \n",
" 4197200 | \n",
" 51197400 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" AAPL GOOG IBM MSFT\n",
"Date \n",
"2010-01-04 123432400 3927000 6155300 38409100\n",
"2010-01-05 150476200 6031900 6841400 49749600\n",
"2010-01-06 138040000 7987100 5605300 58182400\n",
"2010-01-07 119282800 12876600 5840600 50559700\n",
"2010-01-08 111902700 9483900 4197200 51197400"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"volume.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> pct_change(): 这个函数用来计算同colnums两个相邻的数字之间的变化率\n",
"\n",
"现在我们计算一下价格百分比的变化:"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" AAPL | \n",
" GOOG | \n",
" IBM | \n",
" MSFT | \n",
"
\n",
" \n",
" Date | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 2016-10-17 | \n",
" -0.000680 | \n",
" 0.001837 | \n",
" 0.002072 | \n",
" -0.003483 | \n",
"
\n",
" \n",
" 2016-10-18 | \n",
" -0.000681 | \n",
" 0.019616 | \n",
" -0.026168 | \n",
" 0.007690 | \n",
"
\n",
" \n",
" 2016-10-19 | \n",
" -0.002979 | \n",
" 0.007846 | \n",
" 0.003583 | \n",
" -0.002255 | \n",
"
\n",
" \n",
" 2016-10-20 | \n",
" -0.000512 | \n",
" -0.005652 | \n",
" 0.001719 | \n",
" -0.004867 | \n",
"
\n",
" \n",
" 2016-10-21 | \n",
" -0.003930 | \n",
" 0.003011 | \n",
" -0.012474 | \n",
" 0.042096 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" AAPL GOOG IBM MSFT\n",
"Date \n",
"2016-10-17 -0.000680 0.001837 0.002072 -0.003483\n",
"2016-10-18 -0.000681 0.019616 -0.026168 0.007690\n",
"2016-10-19 -0.002979 0.007846 0.003583 -0.002255\n",
"2016-10-20 -0.000512 -0.005652 0.001719 -0.004867\n",
"2016-10-21 -0.003930 0.003011 -0.012474 0.042096"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"returns = price.pct_change()\n",
"returns.tail()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"series的corr方法计算两个,重合的,非NA的,通过index排列好的series。cov计算方差:"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.4997636114415116"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"returns['MSFT'].corr(returns['IBM'])"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"8.8706554797035489e-05"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"returns['MSFT'].cov(returns['IBM'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"因为MSFT是一个有效的python属性,我们可以通过更简洁的方式来选中columns:"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.4997636114415116"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"returns.MSFT.corr(returns.IBM)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"dataframe的corr和cov方法,能返回一个完整的相似性或方差矩阵:"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" AAPL | \n",
" GOOG | \n",
" IBM | \n",
" MSFT | \n",
"
\n",
" \n",
" \n",
" \n",
" AAPL | \n",
" 1.000000 | \n",
" 0.407919 | \n",
" 0.386817 | \n",
" 0.389695 | \n",
"
\n",
" \n",
" GOOG | \n",
" 0.407919 | \n",
" 1.000000 | \n",
" 0.405099 | \n",
" 0.465919 | \n",
"
\n",
" \n",
" IBM | \n",
" 0.386817 | \n",
" 0.405099 | \n",
" 1.000000 | \n",
" 0.499764 | \n",
"
\n",
" \n",
" MSFT | \n",
" 0.389695 | \n",
" 0.465919 | \n",
" 0.499764 | \n",
" 1.000000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" AAPL GOOG IBM MSFT\n",
"AAPL 1.000000 0.407919 0.386817 0.389695\n",
"GOOG 0.407919 1.000000 0.405099 0.465919\n",
"IBM 0.386817 0.405099 1.000000 0.499764\n",
"MSFT 0.389695 0.465919 0.499764 1.000000"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"returns.corr()"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" AAPL | \n",
" GOOG | \n",
" IBM | \n",
" MSFT | \n",
"
\n",
" \n",
" \n",
" \n",
" AAPL | \n",
" 0.000277 | \n",
" 0.000107 | \n",
" 0.000078 | \n",
" 0.000095 | \n",
"
\n",
" \n",
" GOOG | \n",
" 0.000107 | \n",
" 0.000251 | \n",
" 0.000078 | \n",
" 0.000108 | \n",
"
\n",
" \n",
" IBM | \n",
" 0.000078 | \n",
" 0.000078 | \n",
" 0.000146 | \n",
" 0.000089 | \n",
"
\n",
" \n",
" MSFT | \n",
" 0.000095 | \n",
" 0.000108 | \n",
" 0.000089 | \n",
" 0.000215 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" AAPL GOOG IBM MSFT\n",
"AAPL 0.000277 0.000107 0.000078 0.000095\n",
"GOOG 0.000107 0.000251 0.000078 0.000108\n",
"IBM 0.000078 0.000078 0.000146 0.000089\n",
"MSFT 0.000095 0.000108 0.000089 0.000215"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"returns.cov()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"用Dataframe的corrwith方法,我们可以计算dataframe中不同columns之间,或row之间的相似性。传递一个series:"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"AAPL 0.386817\n",
"GOOG 0.405099\n",
"IBM 1.000000\n",
"MSFT 0.499764\n",
"dtype: float64"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"returns.corrwith(returns.IBM)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"传入一个dataframe能计算匹配的column names质监局的相似性。这里我计算vooumn中百分比变化的相似性:"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"AAPL -0.075565\n",
"GOOG -0.007067\n",
"IBM -0.204849\n",
"MSFT -0.092950\n",
"dtype: float64"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"returns.corrwith(volume)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"传入axis='columns'能做到row-by-row计算。在correlation被计算之前,所有的数据会根据label先对齐。\n",
"\n",
"# 2 Unique Values, Value Counts, and Membership(唯一值,值计数,会员)\n",
"\n",
"这里介绍另一种从一维series中提取信息的方法:"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"第一个函数时unique,能告诉我们series里unique values有哪些:"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array(['c', 'a', 'd', 'b'], dtype=object)"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"uniques = obj.unique()\n",
"uniques"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"返回的unique values不是有序的,但我们可以排序,uniques.sort()。相对的,value_counts能计算series中值出现的频率:"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"a 3\n",
"c 3\n",
"b 2\n",
"d 1\n",
"dtype: int64"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj.value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"返回的结果是按降序处理的。vaule_counts也是pandas中的方法,能用在任何array或sequence上:"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"d 1\n",
"c 3\n",
"b 2\n",
"a 3\n",
"dtype: int64"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.value_counts(obj.values, sort=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"isin 能实现一个向量化的集合成员关系检查,能用于过滤数据集,检查一个子集,是否在series的values中,或在dataframe的column中:"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0 c\n",
"1 a\n",
"2 d\n",
"3 a\n",
"4 a\n",
"5 b\n",
"6 b\n",
"7 c\n",
"8 c\n",
"dtype: object"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0 True\n",
"1 False\n",
"2 False\n",
"3 False\n",
"4 False\n",
"5 True\n",
"6 True\n",
"7 True\n",
"8 True\n",
"dtype: bool"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mask = obj.isin(['b', 'c'])\n",
"mask"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0 c\n",
"5 b\n",
"6 b\n",
"7 c\n",
"8 c\n",
"dtype: object"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj[mask]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"与isin相对的另一个方法是Index.get_indexer,能返回一个index array,告诉我们有重复值的values(to_match),在非重复的values(unique_vals)中对应的索引值:"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"to_match = pd.Series(['c', 'a', 'b', 'b', 'c', 'a'])\n",
"unique_vals = pd.Series(['c', 'b', 'a'])"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([0, 2, 1, 1, 0, 2])"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.Index(unique_vals).get_indexer(to_match)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Unique, value counts, and set membership methods:\n",
"\n",
"![](http://oydgk2hgw.bkt.clouddn.com/pydata-book/0v120.png)\n",
"\n",
"在某些情况下,你可能想要计算一下dataframe中多个column的柱状图:"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Qu1 | \n",
" Qu2 | \n",
" Qu3 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
" 1 | \n",
"
\n",
" \n",
" 1 | \n",
" 3 | \n",
" 3 | \n",
" 5 | \n",
"
\n",
" \n",
" 2 | \n",
" 4 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" 3 | \n",
" 3 | \n",
" 2 | \n",
" 4 | \n",
"
\n",
" \n",
" 4 | \n",
" 4 | \n",
" 3 | \n",
" 4 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Qu1 Qu2 Qu3\n",
"0 1 2 1\n",
"1 3 3 5\n",
"2 4 1 2\n",
"3 3 2 4\n",
"4 4 3 4"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data = pd.DataFrame({'Qu1': [1, 3, 4, 3, 4],\n",
" 'Qu2': [2, 3, 1, 2, 3],\n",
" 'Qu3': [1, 5, 2, 4, 4]})\n",
"data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"把padas.value_counts传递给dataframe的apply函数:"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Qu1 | \n",
" Qu2 | \n",
" Qu3 | \n",
"
\n",
" \n",
" \n",
" \n",
" 1 | \n",
" 1.0 | \n",
" 1.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 2 | \n",
" NaN | \n",
" 2.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 3 | \n",
" 2.0 | \n",
" 2.0 | \n",
" NaN | \n",
"
\n",
" \n",
" 4 | \n",
" 2.0 | \n",
" NaN | \n",
" 2.0 | \n",
"
\n",
" \n",
" 5 | \n",
" NaN | \n",
" NaN | \n",
" 1.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Qu1 Qu2 Qu3\n",
"1 1.0 1.0 1.0\n",
"2 NaN 2.0 1.0\n",
"3 2.0 2.0 NaN\n",
"4 2.0 NaN 2.0\n",
"5 NaN NaN 1.0"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"result = data.apply(pd.value_counts)\n",
"result"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"每一行的laebls(即1,2,3,4,5)其实就是整个data里出现过的值,从1到5。而对应的每个方框里的值,则是表示该值在当前列中出现的次数。比如,(2, Qu1)的值是Nan,说明2这个数字没有在Qu1这一列出现过。(2, Qu2)的值是2,说明2这个数字在Qu2这一列出现过2次。(2, Qu3)的值是1,说明2这个数字在Qu3这一列出现过1次。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [py35]",
"language": "python",
"name": "Python [py35]"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 0
}