{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Statistical tools"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"
"
],
"text/plain": [
""
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import addutils.toc ; addutils.toc.js(ipy_notebook=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With this tutorial we are going to see some of the statistical and computational tools offered by `pandas`."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n"
],
"text/plain": [
""
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import datetime\n",
"import scipy.io\n",
"import numpy as np\n",
"import pandas as pd\n",
"import bokeh.plotting as bk\n",
"from IPython.display import display, HTML\n",
"from addutils import css_notebook, side_by_side2\n",
"css_notebook()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1 Percent change"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Given a `pandas.Series` the method `pct_change` returns a new `pandas.Series` object containing percent change over a given number of periods."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 10.187371 | \n",
"
\n",
" \n",
" 1 | \n",
" 11.068102 | \n",
"
\n",
" \n",
" 2 | \n",
" 11.944168 | \n",
"
\n",
" \n",
" 3 | \n",
" 13.011350 | \n",
"
\n",
" \n",
" 4 | \n",
" 14.080648 | \n",
"
\n",
" \n",
" 5 | \n",
" 14.960530 | \n",
"
\n",
" \n",
" 6 | \n",
" 15.987561 | \n",
"
\n",
" \n",
" 7 | \n",
" 17.071012 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" NaN | \n",
"
\n",
" \n",
" 1 | \n",
" 8.645321 | \n",
"
\n",
" \n",
" 2 | \n",
" 7.915230 | \n",
"
\n",
" \n",
" 3 | \n",
" 8.934751 | \n",
"
\n",
" \n",
" 4 | \n",
" 8.218195 | \n",
"
\n",
" \n",
" 5 | \n",
" 6.248875 | \n",
"
\n",
" \n",
" 6 | \n",
" 6.864935 | \n",
"
\n",
" \n",
" 7 | \n",
" 6.776838 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" NaN | \n",
"
\n",
" \n",
" 1 | \n",
" NaN | \n",
"
\n",
" \n",
" 2 | \n",
" NaN | \n",
"
\n",
" \n",
" 3 | \n",
" 27.720384 | \n",
"
\n",
" \n",
" 4 | \n",
" 27.218267 | \n",
"
\n",
" \n",
" 5 | \n",
" 25.253847 | \n",
"
\n",
" \n",
" 6 | \n",
" 22.873960 | \n",
"
\n",
" \n",
" 7 | \n",
" 21.237402 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
""
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s1 = pd.Series(range(10, 18) + np.random.randn(8) / 10)\n",
"\n",
"pct_ch_1d = s1.pct_change() * 100\n",
"pct_ch_3d = s1.pct_change(periods=3) * 100\n",
"\n",
"HTML(side_by_side2(s1, pct_ch_1d, pct_ch_3d))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2 Covariance"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Given two `pandas.Series` the method `cov` computes covariance between them, excluding missing values."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
"
\n",
" \n",
" \n",
" \n",
" 2000-01-03 | \n",
" 0.320914 | \n",
"
\n",
" \n",
" 2000-01-04 | \n",
" -1.105849 | \n",
"
\n",
" \n",
" 2000-01-05 | \n",
" 1.409292 | \n",
"
\n",
" \n",
" 2000-01-06 | \n",
" -0.210900 | \n",
"
\n",
" \n",
" 2000-01-07 | \n",
" -0.270258 | \n",
"
\n",
" \n",
" 2000-01-10 | \n",
" 0.151734 | \n",
"
\n",
" \n",
" 2000-01-11 | \n",
" 0.252147 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
"
\n",
" \n",
" \n",
" \n",
" 2000-01-03 | \n",
" 0.421457 | \n",
"
\n",
" \n",
" 2000-01-04 | \n",
" -1.105964 | \n",
"
\n",
" \n",
" 2000-01-05 | \n",
" 1.499170 | \n",
"
\n",
" \n",
" 2000-01-06 | \n",
" -0.209599 | \n",
"
\n",
" \n",
" 2000-01-07 | \n",
" -0.125346 | \n",
"
\n",
" \n",
" 2000-01-10 | \n",
" 0.127648 | \n",
"
\n",
" \n",
" 2000-01-11 | \n",
" 0.110899 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
""
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s1 = pd.util.testing.makeTimeSeries(7)\n",
"s2 = s1 + np.random.randn(len(s1)) / 10\n",
"HTML(side_by_side2(s1, s2))"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.59005429091399686"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s1.cov(s2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It is also possibile to compute pairwise covariance of a `pandas.DataFrame` columns using `pandas.DataFrame.cov` method. Here we use the module `pandas.util.testing` in order to generate random data easily:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" A B C D\n",
"2000-01-03 -0.481372 -0.584465 -0.038654 0.280964\n",
"2000-01-04 -0.663815 -2.048175 -0.872593 -0.222687\n",
"2000-01-05 -0.719081 -0.553857 0.556233 0.467152\n",
"2000-01-06 -0.152237 0.568140 0.454709 0.478042\n",
"2000-01-07 -0.749170 1.024167 -1.523115 -1.723031\n",
" A B C D\n",
"A 0.890847 0.066664 0.065301 0.301979\n",
"B 0.066664 1.035357 -0.237648 -0.107895\n",
"C 0.065301 -0.237648 1.004082 0.070748\n",
"D 0.301979 -0.107895 0.070748 0.798789\n"
]
}
],
"source": [
"d1 = pd.util.testing.makeTimeDataFrame()\n",
"print d1.head()\n",
"print d1.cov()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3 Correlation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`pandas.Series.corr` allows to compute correlation between two `pandas.Series`. By the `method` paramether it's possible to choose between:\n",
"\n",
"* Pearson\n",
"* Kendall\n",
"* Spearman"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.99256798242819655"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s1.corr(s2, method='pearson')\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Like we just seen for covariance, it is possibile to call `pandas.DataFrame.corr` to obtain pairwise correlation of columns over a `pandas.DataFrame`"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
" D | \n",
"
\n",
" \n",
" \n",
" \n",
" A | \n",
" 1.000000 | \n",
" 0.069413 | \n",
" 0.069045 | \n",
" 0.357980 | \n",
"
\n",
" \n",
" B | \n",
" 0.069413 | \n",
" 1.000000 | \n",
" -0.233080 | \n",
" -0.118642 | \n",
"
\n",
" \n",
" C | \n",
" 0.069045 | \n",
" -0.233080 | \n",
" 1.000000 | \n",
" 0.078998 | \n",
"
\n",
" \n",
" D | \n",
" 0.357980 | \n",
" -0.118642 | \n",
" 0.078998 | \n",
" 1.000000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C D\n",
"A 1.000000 0.069413 0.069045 0.357980\n",
"B 0.069413 1.000000 -0.233080 -0.118642\n",
"C 0.069045 -0.233080 1.000000 0.078998\n",
"D 0.357980 -0.118642 0.078998 1.000000"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d1.corr()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4 Rolling moments and Binary rolling moments"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`pandas` provides also a lot of methods for calculating rolling moments."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"['rolling_apply',\n",
" 'rolling_corr',\n",
" 'rolling_corr_pairwise',\n",
" 'rolling_count',\n",
" 'rolling_cov',\n",
" 'rolling_kurt',\n",
" 'rolling_max',\n",
" 'rolling_mean',\n",
" 'rolling_median',\n",
" 'rolling_min',\n",
" 'rolling_quantile',\n",
" 'rolling_skew',\n",
" 'rolling_std',\n",
" 'rolling_sum',\n",
" 'rolling_var',\n",
" 'rolling_window']"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"[n for n in dir(pd) if n.startswith('rolling')]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's see some examples:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" cumsum | \n",
" max | \n",
" mean | \n",
" min | \n",
"
\n",
" \n",
" \n",
" \n",
" 2002-09-22 | \n",
" 23.134580 | \n",
" 30.045892 | \n",
" 26.343671 | \n",
" 22.479375 | \n",
"
\n",
" \n",
" 2002-09-23 | \n",
" 22.235159 | \n",
" 30.045892 | \n",
" 26.319127 | \n",
" 22.235159 | \n",
"
\n",
" \n",
" 2002-09-24 | \n",
" 21.350533 | \n",
" 30.045892 | \n",
" 26.287655 | \n",
" 21.350533 | \n",
"
\n",
" \n",
" 2002-09-25 | \n",
" 21.160216 | \n",
" 30.045892 | \n",
" 26.233954 | \n",
" 21.160216 | \n",
"
\n",
" \n",
" 2002-09-26 | \n",
" 22.698541 | \n",
" 30.045892 | \n",
" 26.177502 | \n",
" 21.160216 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" cumsum max mean min\n",
"2002-09-22 23.134580 30.045892 26.343671 22.479375\n",
"2002-09-23 22.235159 30.045892 26.319127 22.235159\n",
"2002-09-24 21.350533 30.045892 26.287655 21.350533\n",
"2002-09-25 21.160216 30.045892 26.233954 21.160216\n",
"2002-09-26 22.698541 30.045892 26.177502 21.160216"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s3 = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))\n",
"s3 = s3.cumsum()\n",
"s3_max = pd.rolling_max(s3, 60)\n",
"s3_mean = pd.rolling_mean(s3, 60)\n",
"s3_min = pd.rolling_min(s3, 60)\n",
"data = {'cumsum':s3, 'max':s3_max, 'mean':s3_mean, 'min':s3_min}\n",
"df = pd.DataFrame(data)\n",
"df.tail()"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
" \n",
"\n",
"\n",
" \n",
"\n",
" \n",
"
\n",
"
BokehJS successfully loaded.\n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"bk.output_notebook()"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"fig = bk.figure(x_axis_type = \"datetime\",\n",
" tools=\"pan,box_zoom,reset\", title = 'Rolling Moments',\n",
" plot_width=800, plot_height=400)\n",
"fig.line(df.index, df['cumsum'], color='cadetblue', legend='Cumulative Sum')\n",
"fig.line(df.index, df['max'], color='mediumorchid', legend='Max')\n",
"fig.line(df.index, df['min'], color='mediumpurple', legend='Min')\n",
"fig.line(df.index, df['mean'], color='navy', legend='Min')\n",
"bk.show(fig)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`pandas.Series.cumsum` returns a new `pandas.Series` containing the cumulative sum of the given values."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"s4 = s3 + np.random.randn(len(s3))\n",
"rollc = pd.rolling_corr(s3, s4, window=10)\n",
"data2 = {'cumsum':s3, 'similar':s4, 'rolling correlation':rollc}\n",
"df2 = pd.DataFrame(data2)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"fig = bk.figure(x_axis_type = \"datetime\", title = 'Rolling Correlation',\n",
" plot_width=800, plot_height=400)\n",
"fig.line(df2.index, df2['cumsum'], color='cadetblue', legend='Cumulative Sum')\n",
"fig.line(df2.index, df2['similar'], color='mediumorchid', legend='Similar')\n",
"fig.line(df2.index, df2['rolling correlation'], color='navy', legend='Rolling Corr.')\n",
"fig.legend.orientation = \"bottom_right\"\n",
"bk.show(fig)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5 A pratical example: Return indexes and cumulative returns"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"Date\n",
"2012-09-17 699.78\n",
"2012-09-18 701.91\n",
"2012-09-19 702.10\n",
"2012-09-20 698.70\n",
"2012-09-21 700.09\n",
"Name: Adj Close, dtype: float64"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"AAPL = pd.read_csv('example_data/p03_AAPL.txt', index_col='Date', parse_dates=True)\n",
"price = AAPL['Adj Close']\n",
"display(price.tail())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`pandas.Series.tail` returns the last n rows of a given `pandas.Series`."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"price['2011-10-03'] / price['2011-3-01'] - 1\n",
"returns = price.pct_change()\n",
"ret_index = (1 + returns).cumprod()\n",
"ret_index[0] = 1\n",
"monthly_returns = ret_index.resample('BM', how='last').pct_change()"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"fig = bk.figure(x_axis_type = 'datetime', title = 'Monthly Returns', plot_width=800, plot_height=400)\n",
"fig.line(monthly_returns.index, monthly_returns)\n",
"bk.show(fig)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"Visit [www.add-for.com]() for more tutorials and updates.\n",
"\n",
"This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 0
}