{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Statistical tools" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
" ], "text/plain": [ "" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import addutils.toc ; addutils.toc.js(ipy_notebook=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With this tutorial we are going to see some of the statistical and computational tools offered by `pandas`." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import datetime\n", "import scipy.io\n", "import numpy as np\n", "import pandas as pd\n", "import bokeh.plotting as bk\n", "from IPython.display import display, HTML\n", "from addutils import css_notebook, side_by_side2\n", "css_notebook()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1 Percent change" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Given a `pandas.Series` the method `pct_change` returns a new `pandas.Series` object containing percent change over a given number of periods." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0
09.964166
110.990810
212.147865
312.994615
413.918815
514.943803
615.952536
717.038615
\n", "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0
0NaN
110.303353
210.527481
36.970360
47.112174
57.364053
66.750173
76.808187
\n", "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0
0NaN
1NaN
2NaN
330.413468
426.640485
523.015883
622.762668
722.414265
\n", "
" ], "text/plain": [ "" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s1 = pd.Series(range(10, 18) + np.random.randn(8) / 10)\n", "\n", "pct_ch_1d = s1.pct_change() * 100\n", "pct_ch_3d = s1.pct_change(periods=3) * 100\n", "\n", "HTML(side_by_side2(s1, pct_ch_1d, pct_ch_3d))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2 Covariance" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Given two `pandas.Series` the method `cov` computes covariance between them, excluding missing values." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0
2000-01-030.936232
2000-01-04-0.192915
2000-01-05-0.905428
2000-01-06-0.463522
2000-01-07-1.006806
2000-01-101.027258
2000-01-110.361945
\n", "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0
2000-01-031.033826
2000-01-04-0.060566
2000-01-05-0.735890
2000-01-06-0.530967
2000-01-07-1.256859
2000-01-101.069003
2000-01-110.274196
\n", "
" ], "text/plain": [ "" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s1 = pd.util.testing.makeTimeSeries(7)\n", "s2 = s1 + np.random.randn(len(s1)) / 10\n", "HTML(side_by_side2(s1, s2))" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.72459894915153866" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s1.cov(s2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is also possibile to compute pairwise covariance of a `pandas.DataFrame` columns using `pandas.DataFrame.cov` method. Here we use the module `pandas.util.testing` in order to generate random data easily:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " A B C D\n", "2000-01-03 -2.238333 0.138162 -0.908553 1.562661\n", "2000-01-04 0.292598 -0.570359 0.192452 -0.543481\n", "2000-01-05 0.013642 1.045337 0.690139 -0.069664\n", "2000-01-06 1.960531 1.135304 -0.878462 0.248373\n", "2000-01-07 0.573900 0.719603 -0.726617 -0.773457\n", " A B C D\n", "A 0.794372 0.199038 -0.167064 -0.112771\n", "B 0.199038 1.061793 -0.114622 0.018002\n", "C -0.167064 -0.114622 0.762304 0.043736\n", "D -0.112771 0.018002 0.043736 0.945614\n" ] } ], "source": [ "d1 = pd.util.testing.makeTimeDataFrame()\n", "print (d1.head())\n", "print (d1.cov())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3 Correlation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`pandas.Series.corr` allows to compute correlation between two `pandas.Series`. By the `method` paramether it's possible to choose between:\n", "\n", "* Pearson\n", "* Kendall\n", "* Spearman" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.98693371660495477" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s1.corr(s2, method='pearson')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Like we just seen for covariance, it is possibile to call `pandas.DataFrame.corr` to obtain pairwise correlation of columns over a `pandas.DataFrame`" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCD
A1.0000000.216723-0.214688-0.130115
B0.2167231.000000-0.1274040.017966
C-0.214688-0.1274041.0000000.051513
D-0.1301150.0179660.0515131.000000
\n", "
" ], "text/plain": [ " A B C D\n", "A 1.000000 0.216723 -0.214688 -0.130115\n", "B 0.216723 1.000000 -0.127404 0.017966\n", "C -0.214688 -0.127404 1.000000 0.051513\n", "D -0.130115 0.017966 0.051513 1.000000" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "d1.corr()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4 Rolling moments and Binary rolling moments" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`pandas` provides also a lot of methods for calculating rolling moments." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['rolling_apply',\n", " 'rolling_corr',\n", " 'rolling_corr_pairwise',\n", " 'rolling_count',\n", " 'rolling_cov',\n", " 'rolling_kurt',\n", " 'rolling_max',\n", " 'rolling_mean',\n", " 'rolling_median',\n", " 'rolling_min',\n", " 'rolling_quantile',\n", " 'rolling_skew',\n", " 'rolling_std',\n", " 'rolling_sum',\n", " 'rolling_var',\n", " 'rolling_window']" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[n for n in dir(pd) if n.startswith('rolling')]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's see some examples:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cumsummaxmeanmin
2002-09-2216.81423034.49626921.76313213.986689
2002-09-2317.76036534.49626921.48684213.986689
2002-09-2418.30880034.49626921.22048413.986689
2002-09-2519.13371434.49626920.96781613.986689
2002-09-2619.68421934.21784120.72094913.986689
\n", "
" ], "text/plain": [ " cumsum max mean min\n", "2002-09-22 16.814230 34.496269 21.763132 13.986689\n", "2002-09-23 17.760365 34.496269 21.486842 13.986689\n", "2002-09-24 18.308800 34.496269 21.220484 13.986689\n", "2002-09-25 19.133714 34.496269 20.967816 13.986689\n", "2002-09-26 19.684219 34.217841 20.720949 13.986689" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s3 = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))\n", "s3 = s3.cumsum()\n", "s3_max = pd.rolling_max(s3, 60)\n", "s3_mean = pd.rolling_mean(s3, 60)\n", "s3_min = pd.rolling_min(s3, 60)\n", "data = {'cumsum':s3, 'max':s3_max, 'mean':s3_mean, 'min':s3_min}\n", "df = pd.DataFrame(data)\n", "df.tail()" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ " \n", "\n", "\n", " \n", "\n", "
\n", " \n", " BokehJS successfully loaded.\n", "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "bk.output_notebook()" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "fig = bk.figure(x_axis_type = \"datetime\",\n", " tools=\"pan,box_zoom,reset\", title = 'Rolling Moments',\n", " plot_width=750, plot_height=400)\n", "fig.line(df.index, df['cumsum'], color='cadetblue', legend='Cumulative Sum')\n", "fig.line(df.index, df['max'], color='mediumorchid', legend='Max')\n", "fig.line(df.index, df['min'], color='mediumpurple', legend='Min')\n", "fig.line(df.index, df['mean'], color='navy', legend='Min')\n", "bk.show(fig)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`pandas.Series.cumsum` returns a new `pandas.Series` containing the cumulative sum of the given values." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [], "source": [ "s4 = s3 + np.random.randn(len(s3))\n", "rollc = pd.rolling_corr(s3, s4, window=10)\n", "data2 = {'cumsum':s3, 'similar':s4, 'rolling correlation':rollc}\n", "df2 = pd.DataFrame(data2)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "fig = bk.figure(x_axis_type = \"datetime\", title = 'Rolling Correlation',\n", " plot_width=750, plot_height=400)\n", "fig.line(df2.index, df2['cumsum'], color='cadetblue', legend='Cumulative Sum')\n", "fig.line(df2.index, df2['similar'], color='mediumorchid', legend='Similar')\n", "fig.line(df2.index, df2['rolling correlation'], color='navy', legend='Rolling Corr.')\n", "fig.legend.orientation = \"bottom_right\"\n", "bk.show(fig)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5 A pratical example: Return indexes and cumulative returns" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Date\n", "2012-09-17 699.78\n", "2012-09-18 701.91\n", "2012-09-19 702.10\n", "2012-09-20 698.70\n", "2012-09-21 700.09\n", "Name: Adj Close, dtype: float64" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "AAPL = pd.read_csv('example_data/p03_AAPL.txt', index_col='Date', parse_dates=True)\n", "price = AAPL['Adj Close']\n", "display(price.tail())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`pandas.Series.tail` returns the last n rows of a given `pandas.Series`." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [], "source": [ "price['2011-10-03'] / price['2011-3-01'] - 1\n", "returns = price.pct_change()\n", "ret_index = (1 + returns).cumprod()\n", "ret_index[0] = 1\n", "monthly_returns = ret_index.resample('BM', how='last').pct_change()" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "fig = bk.figure(x_axis_type = 'datetime', title = 'Monthly Returns',\n", " plot_width=750, plot_height=400)\n", "fig.line(monthly_returns.index, monthly_returns)\n", "bk.show(fig)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "Visit [www.add-for.com]() for more tutorials and updates.\n", "\n", "This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.4.3" } }, "nbformat": 4, "nbformat_minor": 0 }