{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Statistical tools"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"
"
],
"text/plain": [
""
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import addutils.toc ; addutils.toc.js(ipy_notebook=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With this tutorial we are going to see some of the statistical and computational tools offered by `pandas`."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n"
],
"text/plain": [
""
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import datetime\n",
"import scipy.io\n",
"import numpy as np\n",
"import pandas as pd\n",
"import bokeh.plotting as bk\n",
"from IPython.display import display, HTML\n",
"from addutils import css_notebook, side_by_side2\n",
"css_notebook()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1 Percent change"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Given a `pandas.Series` the method `pct_change` returns a new `pandas.Series` object containing percent change over a given number of periods."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 9.964166 | \n",
"
\n",
" \n",
" 1 | \n",
" 10.990810 | \n",
"
\n",
" \n",
" 2 | \n",
" 12.147865 | \n",
"
\n",
" \n",
" 3 | \n",
" 12.994615 | \n",
"
\n",
" \n",
" 4 | \n",
" 13.918815 | \n",
"
\n",
" \n",
" 5 | \n",
" 14.943803 | \n",
"
\n",
" \n",
" 6 | \n",
" 15.952536 | \n",
"
\n",
" \n",
" 7 | \n",
" 17.038615 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" NaN | \n",
"
\n",
" \n",
" 1 | \n",
" 10.303353 | \n",
"
\n",
" \n",
" 2 | \n",
" 10.527481 | \n",
"
\n",
" \n",
" 3 | \n",
" 6.970360 | \n",
"
\n",
" \n",
" 4 | \n",
" 7.112174 | \n",
"
\n",
" \n",
" 5 | \n",
" 7.364053 | \n",
"
\n",
" \n",
" 6 | \n",
" 6.750173 | \n",
"
\n",
" \n",
" 7 | \n",
" 6.808187 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" NaN | \n",
"
\n",
" \n",
" 1 | \n",
" NaN | \n",
"
\n",
" \n",
" 2 | \n",
" NaN | \n",
"
\n",
" \n",
" 3 | \n",
" 30.413468 | \n",
"
\n",
" \n",
" 4 | \n",
" 26.640485 | \n",
"
\n",
" \n",
" 5 | \n",
" 23.015883 | \n",
"
\n",
" \n",
" 6 | \n",
" 22.762668 | \n",
"
\n",
" \n",
" 7 | \n",
" 22.414265 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
""
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s1 = pd.Series(range(10, 18) + np.random.randn(8) / 10)\n",
"\n",
"pct_ch_1d = s1.pct_change() * 100\n",
"pct_ch_3d = s1.pct_change(periods=3) * 100\n",
"\n",
"HTML(side_by_side2(s1, pct_ch_1d, pct_ch_3d))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2 Covariance"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Given two `pandas.Series` the method `cov` computes covariance between them, excluding missing values."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
"
\n",
" \n",
" \n",
" \n",
" 2000-01-03 | \n",
" 0.936232 | \n",
"
\n",
" \n",
" 2000-01-04 | \n",
" -0.192915 | \n",
"
\n",
" \n",
" 2000-01-05 | \n",
" -0.905428 | \n",
"
\n",
" \n",
" 2000-01-06 | \n",
" -0.463522 | \n",
"
\n",
" \n",
" 2000-01-07 | \n",
" -1.006806 | \n",
"
\n",
" \n",
" 2000-01-10 | \n",
" 1.027258 | \n",
"
\n",
" \n",
" 2000-01-11 | \n",
" 0.361945 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
"
\n",
" \n",
" \n",
" \n",
" 2000-01-03 | \n",
" 1.033826 | \n",
"
\n",
" \n",
" 2000-01-04 | \n",
" -0.060566 | \n",
"
\n",
" \n",
" 2000-01-05 | \n",
" -0.735890 | \n",
"
\n",
" \n",
" 2000-01-06 | \n",
" -0.530967 | \n",
"
\n",
" \n",
" 2000-01-07 | \n",
" -1.256859 | \n",
"
\n",
" \n",
" 2000-01-10 | \n",
" 1.069003 | \n",
"
\n",
" \n",
" 2000-01-11 | \n",
" 0.274196 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
""
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s1 = pd.util.testing.makeTimeSeries(7)\n",
"s2 = s1 + np.random.randn(len(s1)) / 10\n",
"HTML(side_by_side2(s1, s2))"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.72459894915153866"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s1.cov(s2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It is also possibile to compute pairwise covariance of a `pandas.DataFrame` columns using `pandas.DataFrame.cov` method. Here we use the module `pandas.util.testing` in order to generate random data easily:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" A B C D\n",
"2000-01-03 -2.238333 0.138162 -0.908553 1.562661\n",
"2000-01-04 0.292598 -0.570359 0.192452 -0.543481\n",
"2000-01-05 0.013642 1.045337 0.690139 -0.069664\n",
"2000-01-06 1.960531 1.135304 -0.878462 0.248373\n",
"2000-01-07 0.573900 0.719603 -0.726617 -0.773457\n",
" A B C D\n",
"A 0.794372 0.199038 -0.167064 -0.112771\n",
"B 0.199038 1.061793 -0.114622 0.018002\n",
"C -0.167064 -0.114622 0.762304 0.043736\n",
"D -0.112771 0.018002 0.043736 0.945614\n"
]
}
],
"source": [
"d1 = pd.util.testing.makeTimeDataFrame()\n",
"print (d1.head())\n",
"print (d1.cov())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3 Correlation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`pandas.Series.corr` allows to compute correlation between two `pandas.Series`. By the `method` paramether it's possible to choose between:\n",
"\n",
"* Pearson\n",
"* Kendall\n",
"* Spearman"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.98693371660495477"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s1.corr(s2, method='pearson')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Like we just seen for covariance, it is possibile to call `pandas.DataFrame.corr` to obtain pairwise correlation of columns over a `pandas.DataFrame`"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
" D | \n",
"
\n",
" \n",
" \n",
" \n",
" A | \n",
" 1.000000 | \n",
" 0.216723 | \n",
" -0.214688 | \n",
" -0.130115 | \n",
"
\n",
" \n",
" B | \n",
" 0.216723 | \n",
" 1.000000 | \n",
" -0.127404 | \n",
" 0.017966 | \n",
"
\n",
" \n",
" C | \n",
" -0.214688 | \n",
" -0.127404 | \n",
" 1.000000 | \n",
" 0.051513 | \n",
"
\n",
" \n",
" D | \n",
" -0.130115 | \n",
" 0.017966 | \n",
" 0.051513 | \n",
" 1.000000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C D\n",
"A 1.000000 0.216723 -0.214688 -0.130115\n",
"B 0.216723 1.000000 -0.127404 0.017966\n",
"C -0.214688 -0.127404 1.000000 0.051513\n",
"D -0.130115 0.017966 0.051513 1.000000"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d1.corr()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4 Rolling moments and Binary rolling moments"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`pandas` provides also a lot of methods for calculating rolling moments."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"['rolling_apply',\n",
" 'rolling_corr',\n",
" 'rolling_corr_pairwise',\n",
" 'rolling_count',\n",
" 'rolling_cov',\n",
" 'rolling_kurt',\n",
" 'rolling_max',\n",
" 'rolling_mean',\n",
" 'rolling_median',\n",
" 'rolling_min',\n",
" 'rolling_quantile',\n",
" 'rolling_skew',\n",
" 'rolling_std',\n",
" 'rolling_sum',\n",
" 'rolling_var',\n",
" 'rolling_window']"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"[n for n in dir(pd) if n.startswith('rolling')]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's see some examples:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" cumsum | \n",
" max | \n",
" mean | \n",
" min | \n",
"
\n",
" \n",
" \n",
" \n",
" 2002-09-22 | \n",
" 16.814230 | \n",
" 34.496269 | \n",
" 21.763132 | \n",
" 13.986689 | \n",
"
\n",
" \n",
" 2002-09-23 | \n",
" 17.760365 | \n",
" 34.496269 | \n",
" 21.486842 | \n",
" 13.986689 | \n",
"
\n",
" \n",
" 2002-09-24 | \n",
" 18.308800 | \n",
" 34.496269 | \n",
" 21.220484 | \n",
" 13.986689 | \n",
"
\n",
" \n",
" 2002-09-25 | \n",
" 19.133714 | \n",
" 34.496269 | \n",
" 20.967816 | \n",
" 13.986689 | \n",
"
\n",
" \n",
" 2002-09-26 | \n",
" 19.684219 | \n",
" 34.217841 | \n",
" 20.720949 | \n",
" 13.986689 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" cumsum max mean min\n",
"2002-09-22 16.814230 34.496269 21.763132 13.986689\n",
"2002-09-23 17.760365 34.496269 21.486842 13.986689\n",
"2002-09-24 18.308800 34.496269 21.220484 13.986689\n",
"2002-09-25 19.133714 34.496269 20.967816 13.986689\n",
"2002-09-26 19.684219 34.217841 20.720949 13.986689"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s3 = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))\n",
"s3 = s3.cumsum()\n",
"s3_max = pd.rolling_max(s3, 60)\n",
"s3_mean = pd.rolling_mean(s3, 60)\n",
"s3_min = pd.rolling_min(s3, 60)\n",
"data = {'cumsum':s3, 'max':s3_max, 'mean':s3_mean, 'min':s3_min}\n",
"df = pd.DataFrame(data)\n",
"df.tail()"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
" \n",
"\n",
"\n",
" \n",
"\n",
" \n",
"
\n",
"
BokehJS successfully loaded.\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"bk.output_notebook()"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"fig = bk.figure(x_axis_type = \"datetime\",\n",
" tools=\"pan,box_zoom,reset\", title = 'Rolling Moments',\n",
" plot_width=750, plot_height=400)\n",
"fig.line(df.index, df['cumsum'], color='cadetblue', legend='Cumulative Sum')\n",
"fig.line(df.index, df['max'], color='mediumorchid', legend='Max')\n",
"fig.line(df.index, df['min'], color='mediumpurple', legend='Min')\n",
"fig.line(df.index, df['mean'], color='navy', legend='Min')\n",
"bk.show(fig)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`pandas.Series.cumsum` returns a new `pandas.Series` containing the cumulative sum of the given values."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"s4 = s3 + np.random.randn(len(s3))\n",
"rollc = pd.rolling_corr(s3, s4, window=10)\n",
"data2 = {'cumsum':s3, 'similar':s4, 'rolling correlation':rollc}\n",
"df2 = pd.DataFrame(data2)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"fig = bk.figure(x_axis_type = \"datetime\", title = 'Rolling Correlation',\n",
" plot_width=750, plot_height=400)\n",
"fig.line(df2.index, df2['cumsum'], color='cadetblue', legend='Cumulative Sum')\n",
"fig.line(df2.index, df2['similar'], color='mediumorchid', legend='Similar')\n",
"fig.line(df2.index, df2['rolling correlation'], color='navy', legend='Rolling Corr.')\n",
"fig.legend.orientation = \"bottom_right\"\n",
"bk.show(fig)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5 A pratical example: Return indexes and cumulative returns"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"Date\n",
"2012-09-17 699.78\n",
"2012-09-18 701.91\n",
"2012-09-19 702.10\n",
"2012-09-20 698.70\n",
"2012-09-21 700.09\n",
"Name: Adj Close, dtype: float64"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"AAPL = pd.read_csv('example_data/p03_AAPL.txt', index_col='Date', parse_dates=True)\n",
"price = AAPL['Adj Close']\n",
"display(price.tail())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`pandas.Series.tail` returns the last n rows of a given `pandas.Series`."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"price['2011-10-03'] / price['2011-3-01'] - 1\n",
"returns = price.pct_change()\n",
"ret_index = (1 + returns).cumprod()\n",
"ret_index[0] = 1\n",
"monthly_returns = ret_index.resample('BM', how='last').pct_change()"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"fig = bk.figure(x_axis_type = 'datetime', title = 'Monthly Returns',\n",
" plot_width=750, plot_height=400)\n",
"fig.line(monthly_returns.index, monthly_returns)\n",
"bk.show(fig)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"Visit [www.add-for.com]() for more tutorials and updates.\n",
"\n",
"This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.4.3"
}
},
"nbformat": 4,
"nbformat_minor": 0
}