{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Statistical tools" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
" ], "text/plain": [ "" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import addutils.toc ; addutils.toc.js(ipy_notebook=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With this tutorial we are going to see some of the statistical and computational tools offered by `pandas`." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import datetime\n", "import scipy.io\n", "import numpy as np\n", "import pandas as pd\n", "import bokeh.plotting as bk\n", "from IPython.display import display, HTML\n", "from addutils import css_notebook, side_by_side2\n", "css_notebook()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1 Percent change" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Given a `pandas.Series` the method `pct_change` returns a new `pandas.Series` object containing percent change over a given number of periods." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0
010.187371
111.068102
211.944168
313.011350
414.080648
514.960530
615.987561
717.071012
\n", "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0
0NaN
18.645321
27.915230
38.934751
48.218195
56.248875
66.864935
76.776838
\n", "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0
0NaN
1NaN
2NaN
327.720384
427.218267
525.253847
622.873960
721.237402
\n", "
" ], "text/plain": [ "" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s1 = pd.Series(range(10, 18) + np.random.randn(8) / 10)\n", "\n", "pct_ch_1d = s1.pct_change() * 100\n", "pct_ch_3d = s1.pct_change(periods=3) * 100\n", "\n", "HTML(side_by_side2(s1, pct_ch_1d, pct_ch_3d))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2 Covariance" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Given two `pandas.Series` the method `cov` computes covariance between them, excluding missing values." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0
2000-01-030.320914
2000-01-04-1.105849
2000-01-051.409292
2000-01-06-0.210900
2000-01-07-0.270258
2000-01-100.151734
2000-01-110.252147
\n", "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0
2000-01-030.421457
2000-01-04-1.105964
2000-01-051.499170
2000-01-06-0.209599
2000-01-07-0.125346
2000-01-100.127648
2000-01-110.110899
\n", "
" ], "text/plain": [ "" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s1 = pd.util.testing.makeTimeSeries(7)\n", "s2 = s1 + np.random.randn(len(s1)) / 10\n", "HTML(side_by_side2(s1, s2))" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.59005429091399686" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s1.cov(s2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is also possibile to compute pairwise covariance of a `pandas.DataFrame` columns using `pandas.DataFrame.cov` method. Here we use the module `pandas.util.testing` in order to generate random data easily:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " A B C D\n", "2000-01-03 -0.481372 -0.584465 -0.038654 0.280964\n", "2000-01-04 -0.663815 -2.048175 -0.872593 -0.222687\n", "2000-01-05 -0.719081 -0.553857 0.556233 0.467152\n", "2000-01-06 -0.152237 0.568140 0.454709 0.478042\n", "2000-01-07 -0.749170 1.024167 -1.523115 -1.723031\n", " A B C D\n", "A 0.890847 0.066664 0.065301 0.301979\n", "B 0.066664 1.035357 -0.237648 -0.107895\n", "C 0.065301 -0.237648 1.004082 0.070748\n", "D 0.301979 -0.107895 0.070748 0.798789\n" ] } ], "source": [ "d1 = pd.util.testing.makeTimeDataFrame()\n", "print d1.head()\n", "print d1.cov()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3 Correlation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`pandas.Series.corr` allows to compute correlation between two `pandas.Series`. By the `method` paramether it's possible to choose between:\n", "\n", "* Pearson\n", "* Kendall\n", "* Spearman" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.99256798242819655" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s1.corr(s2, method='pearson')\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Like we just seen for covariance, it is possibile to call `pandas.DataFrame.corr` to obtain pairwise correlation of columns over a `pandas.DataFrame`" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCD
A1.0000000.0694130.0690450.357980
B0.0694131.000000-0.233080-0.118642
C0.069045-0.2330801.0000000.078998
D0.357980-0.1186420.0789981.000000
\n", "
" ], "text/plain": [ " A B C D\n", "A 1.000000 0.069413 0.069045 0.357980\n", "B 0.069413 1.000000 -0.233080 -0.118642\n", "C 0.069045 -0.233080 1.000000 0.078998\n", "D 0.357980 -0.118642 0.078998 1.000000" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "d1.corr()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4 Rolling moments and Binary rolling moments" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`pandas` provides also a lot of methods for calculating rolling moments." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['rolling_apply',\n", " 'rolling_corr',\n", " 'rolling_corr_pairwise',\n", " 'rolling_count',\n", " 'rolling_cov',\n", " 'rolling_kurt',\n", " 'rolling_max',\n", " 'rolling_mean',\n", " 'rolling_median',\n", " 'rolling_min',\n", " 'rolling_quantile',\n", " 'rolling_skew',\n", " 'rolling_std',\n", " 'rolling_sum',\n", " 'rolling_var',\n", " 'rolling_window']" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[n for n in dir(pd) if n.startswith('rolling')]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's see some examples:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cumsummaxmeanmin
2002-09-2223.13458030.04589226.34367122.479375
2002-09-2322.23515930.04589226.31912722.235159
2002-09-2421.35053330.04589226.28765521.350533
2002-09-2521.16021630.04589226.23395421.160216
2002-09-2622.69854130.04589226.17750221.160216
\n", "
" ], "text/plain": [ " cumsum max mean min\n", "2002-09-22 23.134580 30.045892 26.343671 22.479375\n", "2002-09-23 22.235159 30.045892 26.319127 22.235159\n", "2002-09-24 21.350533 30.045892 26.287655 21.350533\n", "2002-09-25 21.160216 30.045892 26.233954 21.160216\n", "2002-09-26 22.698541 30.045892 26.177502 21.160216" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s3 = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))\n", "s3 = s3.cumsum()\n", "s3_max = pd.rolling_max(s3, 60)\n", "s3_mean = pd.rolling_mean(s3, 60)\n", "s3_min = pd.rolling_min(s3, 60)\n", "data = {'cumsum':s3, 'max':s3_max, 'mean':s3_mean, 'min':s3_min}\n", "df = pd.DataFrame(data)\n", "df.tail()" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ " \n", "\n", "\n", " \n", "\n", "
\n", " \n", " BokehJS successfully loaded.\n", "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "bk.output_notebook()" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "fig = bk.figure(x_axis_type = \"datetime\",\n", " tools=\"pan,box_zoom,reset\", title = 'Rolling Moments',\n", " plot_width=800, plot_height=400)\n", "fig.line(df.index, df['cumsum'], color='cadetblue', legend='Cumulative Sum')\n", "fig.line(df.index, df['max'], color='mediumorchid', legend='Max')\n", "fig.line(df.index, df['min'], color='mediumpurple', legend='Min')\n", "fig.line(df.index, df['mean'], color='navy', legend='Min')\n", "bk.show(fig)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`pandas.Series.cumsum` returns a new `pandas.Series` containing the cumulative sum of the given values." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [], "source": [ "s4 = s3 + np.random.randn(len(s3))\n", "rollc = pd.rolling_corr(s3, s4, window=10)\n", "data2 = {'cumsum':s3, 'similar':s4, 'rolling correlation':rollc}\n", "df2 = pd.DataFrame(data2)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "fig = bk.figure(x_axis_type = \"datetime\", title = 'Rolling Correlation',\n", " plot_width=800, plot_height=400)\n", "fig.line(df2.index, df2['cumsum'], color='cadetblue', legend='Cumulative Sum')\n", "fig.line(df2.index, df2['similar'], color='mediumorchid', legend='Similar')\n", "fig.line(df2.index, df2['rolling correlation'], color='navy', legend='Rolling Corr.')\n", "fig.legend.orientation = \"bottom_right\"\n", "bk.show(fig)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5 A pratical example: Return indexes and cumulative returns" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Date\n", "2012-09-17 699.78\n", "2012-09-18 701.91\n", "2012-09-19 702.10\n", "2012-09-20 698.70\n", "2012-09-21 700.09\n", "Name: Adj Close, dtype: float64" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "AAPL = pd.read_csv('example_data/p03_AAPL.txt', index_col='Date', parse_dates=True)\n", "price = AAPL['Adj Close']\n", "display(price.tail())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`pandas.Series.tail` returns the last n rows of a given `pandas.Series`." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [], "source": [ "price['2011-10-03'] / price['2011-3-01'] - 1\n", "returns = price.pct_change()\n", "ret_index = (1 + returns).cumprod()\n", "ret_index[0] = 1\n", "monthly_returns = ret_index.resample('BM', how='last').pct_change()" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "fig = bk.figure(x_axis_type = 'datetime', title = 'Monthly Returns', plot_width=800, plot_height=400)\n", "fig.line(monthly_returns.index, monthly_returns)\n", "bk.show(fig)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "Visit [www.add-for.com]() for more tutorials and updates.\n", "\n", "This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License." ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.6" } }, "nbformat": 4, "nbformat_minor": 0 }