{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "code", "collapsed": false, "input": [ "sys.path.append('Z:\\\\GitHub\\\\PyArb') # Change the path accordingly" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Statistical arbitrage with Python " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "NOTE: For the impatient/curious, the model gives about **12% annual returns** with **Sharpe ratio=5** and **maximum drawdown=0.5%** with **\\$0.005/per share transaction costs**. Jump to the end of this notebook to see the plots." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1 - Introduction \n", "\n", "I started writing this notebook as a learning project about equity statistical arbitrage. The model is a very elementary system of stochastic differential equations, which is supposed to model the behaviour of equity prices, and specifically their _interactions_. This can be viewed as a generalization of the [Ornstein-Uhlenbeck process](http://en.wikipedia.org/wiki/Ornstein%E2%80%93Uhlenbeck_process) (except that I'm using multiplicative noise as in GBM) that is widely used in modelling the behaviour of a certain portfolio of two stocks in pairs trading. The model is defined as \n", "\n", "$$ d X_t^i = A^i_j X_t^j dt + X_t^i d W_t^i , $$ \n", "\n", "where $X_t^i$ is the price of stock $i$ at time $t$, $A$ is an interaction matrix to be modelled, $dW$ is a white noise with a covariance matrix $\\mathbb E\\left( dW_t^i dW_t^j \\right) = \\rho^{ij} dt$, where $\\rho$ is also modelled from data. \n", "\n", "The model can be viewed as a simple generalization of geometric brownian motion to a multivariate case, so I will just call it the **Vector Geometric Brownian Motion (VGBM)**. \n", "\n", "A finite time step (1 min bar) equation can be written as \n", "\n", "$$ X_{t+1}^i = (\\mathbb{1} + A \\Delta t)_{ij} X_t^j + X_t^i d W_t^i ,$$ \n", "\n", "from which one can hopefully see the similarity to e.g. VAR(1), except that the noise term is multiplicative. \n", "\n", "The model probably can't be claimed to describe the behaviour of the stock market very well (why would the _price change_ of asset $i$ depend on a linear combination the other _prices_, $A^i_j X^j$?), but I think it should be able to capture the short term behavior at least to some degree. In any case, consider it a toy model from which it is easy to generalize to better models while still using most of the code below. \n", "\n", "A nice thing about the model is that it is easy to form a _mean reverting portfolio as $\\xi \\cdot X_t$, where $\\xi$ is a left eigenvector of $A$_. It is then a simple matter to update the the matrix A, and therefore also $\\xi$ and the portfolio, at each 1 min time step (currently the code works with 1 min bars only). Then, when the value of the portfolio reaches a pre-set value below or above the mean (e.g. one standard deviation), we buy or short (respectively) the portfolio. When that happens, we _lock_ the eigenvector/portfolio (i.e. stop updating it), and track it until it either reaches a \"close position\" band (at e.g. 0.15 standard deviation) or reaches a time limit, at which we close the position _and_ continue updating a new portfolio. \n", "\n", "The idea is to use as high frequency data as possible and to try to trade within, say, one day, since it is reasonable to expect that the model can't stay valid for a long time. Currently I have access to 1 min data only (from [The Bonnot Gang](http://thebonnotgang.com/tbg/) and [eoddata.com](http://eoddata.com), which should be adequate for a thorough backtest. On the other hand, (moderately) high frequency data will be susceptible to the **Epps effect**, i.e. the degradation of correlation matrices due to noisy data (asynchronous price information etc.). 1 min data _will_ of course contain more information than e.g. 5 min data, provided one can somehow take care of the noise. \n", "\n", "One elegant way to do this could be to resort to Random Matrix Theory (RMT). Basically the idea is to identify the noisy eigenvalues of correlation matrices (in my case, of the matrix A in fact) and just not use them in trading. A more straightforward approach is to just \"look\" at the statistics of the bunch of lowest eigenvalues in a training period and try to pick a low eigenvalue that's above the noisiest ones (more details below). **NOTE: Actually it seems that the eoddata.com data is much less noisy than TBG data and the lowest eigenvalue turns out to be the best, so no noise cleaning is necessary...** \n", "\n", "The code takes into account transaction costs but no slippage. In any case, slippage is pretty hard to implement in a backtest (at least without some data from a walk-forward), so the principle I'm using is that if the backtest is profitable with relatively high transaction costs, it's worth testing in paper trading and eventually live trading. Another thing that is so far missing is any new regarding a company's worth such as quarterly reports etc. Also the fact that usually one has to pay a fixed price for a short sale contract, which is also not included here: basically I'm assuming that the transactions are so large that the contract price is negligible. Another choice might be to consider a synthetic short positions, see e.g. [here](http://www.theoptionsguide.com/synthetic-short-stock.aspx)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### If you have any questions, feel free to email me at heikki.a.arponen@gmail.com " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2 - The backtest" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Imports:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from __future__ import division\n", "from __future__ import print_function\n", "\n", "import pyarb as pa\n", "\n", "import numpy as np\n", "import pandas as pd\n", "import os\n", "import glob\n", "import matplotlib.pyplot as plt\n", "import cProfile" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "code", "collapsed": false, "input": [ "os.chdir('Z:\\\\PythonStuff\\\\newtech')# Change the path accordingly" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 3 }, { "cell_type": "code", "collapsed": false, "input": [ "os.getcwd()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 4, "text": [ "'Z:\\\\PythonStuff\\\\newtech'" ] } ], "prompt_number": 4 }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Import data from The Bonnot Gang" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following loads the data and arranges it into a Pandas dataframe. The method \"clean\" throws out days with half of missing data (by default; this is adjustable). \"normalize\" creates a dataframe where all prices start at 1 for easy visualization. \n", "\n", "Note that the data import did some resampling of the 1 min data (TBG data is not always logged at exactly even minutes), which introduces a little bit more noise because of asynchronity... \n", "\n", "Note also that the last minute at 16:00 was removed also from the price data to make $X$ and $dX$ the same length within a day." ] }, { "cell_type": "code", "collapsed": false, "input": [ "prices = pa.PrepareData_TBG()\n", "prices.clean()\n", "prices.normalize()" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 5 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get the prices $X_t^i$ and the differences $dX_t^i$ and also the normalized prices:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "Xdata = prices.get_X()\n", "dXdata = prices.get_dX()\n", "normdata = prices.normalized" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 6 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The price data outs a summary of the Pandas DataFrame:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "Xdata" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "<class 'pandas.core.frame.DataFrame'>\n", "DatetimeIndex: 189540 entries, 2011-09-22 09:30:00 to 2013-09-03 15:59:00\n", "Data columns (total 25 columns):\n", "AAPL_1m 189540 non-null values\n", "AMAT_1m 189540 non-null values\n", "AMD_1m 189540 non-null values\n", "BRCD_1m 189540 non-null values\n", "CMCSA_1m 189540 non-null values\n", "CSCO_1m 189540 non-null values\n", "DELL_1m 189540 non-null values\n", "EBAY_1m 189540 non-null values\n", "EMC_1m 189540 non-null values\n", "GE_1m 189540 non-null values\n", "GLW_1m 189540 non-null values\n", "GOOG_1m 189540 non-null values\n", "HPQ_1m 189540 non-null values\n", "INTC_1m 189540 non-null values\n", "MSFT_1m 189540 non-null values\n", "MU_1m 189540 non-null values\n", "NVDA_1m 189540 non-null values\n", "ORCL_1m 189540 non-null values\n", "QCOM_1m 189540 non-null values\n", "SNDK_1m 189540 non-null values\n", "S_1m 189540 non-null values\n", "T_1m 189540 non-null values\n", "WIN_1m 189540 non-null values\n", "XRX_1m 189540 non-null values\n", "YHOO_1m 189540 non-null values\n", "dtypes: float64(25)\n", "" ], "metadata": {}, "output_type": "pyout", "prompt_number": 7, "text": [ "
\n", " | Symbol | \n", "Name | \n", "LastSale | \n", "MarketCap | \n", "Industry | \n", "
---|---|---|---|---|---|
28 | \n", "AAPL | \n", "Apple Inc. | \n", "508.890 | \n", "4.623250e+11 | \n", "Computer Manufacturing | \n", "
180 | \n", "GOOG | \n", "Google Inc. | \n", "1011.408 | \n", "3.368201e+11 | \n", "Computer Software: Programming, Data Processing | \n", "
272 | \n", "MSFT | \n", "Microsoft Corporation | \n", "34.960 | \n", "2.922916e+11 | \n", "Computer Software: Prepackaged Software | \n", "
166 | \n", "FB | \n", "Facebook, Inc. | \n", "54.220 | \n", "1.320430e+11 | \n", "Computer Software: Programming, Data Processing | \n", "
96 | \n", "CSCO | \n", "Cisco Systems, Inc. | \n", "22.961 | \n", "1.236327e+11 | \n", "Computer Communications Equipment | \n", "