{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "# Financial Data Structures\n", "### [(Go to Quant Lab)](https://israeldi.github.io/quantlab/)\n", "\n", "#### Source: Advances in Financial Machine Learning\n", "\n", "© MARCOS LOPEZ DE PRADO\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Code and Other work by: [ad1m](https://github.com/ad1m/Financial_Machine_Learning/blob/master/Financial_Data_Structures.ipynb) and [fernandodelacalle](https://github.com/fernandodelacalle/adv-financial-ml-marcos-exercises/tree/master/notebooks)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Table of Contents" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
In this notebook we learn to work with unstructured financial data and from this to derive a structured dataset for machine learning algorithms. Generally, it is not advisable to consume someone else's preproccessed dataset as the likely outcome will be that you are figuring out what they have already figured out. We want to take an unstructured dataset and process it such that we can find novel informative features.
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Structured data is data that you would usually find in a relational database. For instance: phone numbers, Social Security numbers, or ZIP codes. Even text strings of variable length like names are contained in records, making it a simple matter to search. Unstructured Data is data that is in the wild that does not have a concrete structure. For instance, social media text feeds, audio, and images.
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are four types of financial data:
\n", "Note: A dataset might be useful if it annoys the data infrastructure team. Perhaps your competitors did not try to use it for particular reasons or gave up midway.
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data structures used to contain trading information are often referred to as bars. This is basically a table of data and the rows contain information. These rows are the \"bars\". These bars can vary greatly in how they were constructed but in general there are two categories of bars:
\n", "Standard Bars aim to transform a series of observations that arrive at an irregular frequency into a homogenous series derived from regular sampling. There are 4 main type of standard bars:
\n", "Time Bars are obtained by sampling information at a fixed time interval e.g. once every minute. This information usually contains:
\n", "This is the typical csv data that you will find from yahoo finance for a particular equity. This type of data should be avoided for two reasons:
\n", "Sample variables such as Timestamp, VWAP, open price, etc. are extracted each time a pre-defined number of transactions takes place.
\n", "\n", "For instance, every 1000 transactions we take a sample bar. Mandlebrot and Taylor realized that sampling as a function of the number of transactions gives more desirable statistical properties; sampling as a function of trading activity allows us to achieve returns closer to Independant and Identitically Distributed (IID) Normal. Many statistical methods make an assumption that observations are drawn from an IID Gaussian process so this allows us to take advantage of these statistical observations.
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Volume bars sample every time a pre-defined amount of the securitie's units (shares, futures contracts, etc.) have been exchanged. For example, we could sample prices every time a futures contract exchanges 1,000 units, regardless of the number of ticks involved. Volume bars circumvent the following problem that tick bars incur:
\n", "\n", "Suppose there is one order sitting on the offer for a size of 10. If we buy 10 lots, the order will be recorded as 1 tick. If there are 10 orders of size 1, our 1 buy will be recorded as 10 separate transactions.
\n", "\n", "Volume bars are preferred over tick bars as sampling by volume gets us closer to an IID Gaussian distribution than sampling by tick bars.
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Dollar bars are formed by sampling an observation every time a pre-defined market value is exchanged.
\n", "The number of shares traded is a function of the actual value exchanged. Thus, it makes sense to sample bars in terms of dollar value exchanged rather than ticks or volume particularly when the analysis involves significant price fluctuations.
\n", "\n", "Dollar bars are also more interesting than time, tick, or volume bars since the number of outstanding shares often changes multiple times over the course of a securitie's life as a result of corporate actions. Even after adjustment for splits and reverse splits, there are other actions that will impact the amount of ticks and volumes, like issuing new shares or buying back existing shares. Dollar bars tend to be robust in the face of those actions.
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "___" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Preparation\n", "\n", "- Download the data from: http://www.kibot.com/buy.aspx at the: \"Free historical data for your data quality analysis\" section\n", "- We have the data from the WDC stock and the iShares IVE ETF: https://www.ishares.com/us/products/239728/ishares-sp-500-value-etf " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Tick Data info from kibot\n", "- http://www.kibot.com/support.aspx#data_format\n", "- The order of the fields in the tick files (with bid/ask prices) is: Date, Time, Price, Bid, Ask, Size. \n", "- The bid/ask prices are recorded whenever a trade occurs and they represent the \"national best bid and offer\" (NBBO) prices across multiple exchanges and ECNs.\n", "- For each trade, current best bid/ask values are recorded together with the transaction price and volume. Trade records are not aggregated and all transactions are included in their consecutive order.\n", "- The order of fields in our regular tick files (without bid/ask) is: Date,Time,Price,Size.\n", "- The order of fields in our 1, 5 or 10 second files is: Date,Time,Open,High,Low,Close,Volume. It is the same format used in our minute files.\n", "- The stocks and ETFs data includes pre-market (8:00-9:30 a.m. ET), regular (9:30 a.m.-4:00 p.m. ET.) and after market (4:00-6:30 p.m. ET) sessions.\n", "- Trading for SPY (SPDR S&P 500 ETF) and some other liquid ETFs and stocks usually starts at 4 a.m and ends at 8 p.m. ET." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Initially import all the modules we will be using for our notebook" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "scrolled": true }, "outputs": [], "source": [ "# Modules for Dataframes\n", "import numpy as np\n", "import pandas as pd \n", "\n", "# Module for plotting\n", "import matplotlib as mpl\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "plt.style.use('ggplot')\n", "plt.style.use('ggplot')\n", "plt.rcParams['figure.figsize'] = 10,8\n", "\n", "# Here we install these packages in order to use them later in the notebook\n", "import sys\n", "# !{sys.executable} -m pip install statsmodels\n", "# !{sys.executable} -m pip install pyarrow\n", "# !{sys.executable} -m pip install mpl_finance\n", "# !{sys.executable} -m pip install seaborn\n", "\n", "# These Modules are for compressing the large data set\n", "import pyarrow as pa\n", "import pyarrow.parquet as pq\n", "\n", "import os\n", "\n", "# Import our user-written Functions \n", "import ml_functions as ml" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Create Directories to save results**" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "# Directory where we will save our plots\n", "directory = \"./data\"\n", "if not os.path.exists(directory):\n", " os.makedirs(directory)\n", "\n", "directory = \"./images\"\n", "if not os.path.exists(directory):\n", " os.makedirs(directory)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can apply `prepare_data_kibot` function that reformats the column names and the time index. Using `to_parquet` method in order to compress the data. Here we download and save data to the `data` directory." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "ename": "KeyboardInterrupt", "evalue": "", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mKeyboardInterrupt\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\n", " | price | \n", "bid | \n", "ask | \n", "vol | \n", "dollar_vol | \n", "
---|---|---|---|---|---|
date | \n", "\n", " | \n", " | \n", " | \n", " | \n", " |
2009-09-28 09:30:00 | \n", "50.79 | \n", "50.70 | \n", "50.79 | \n", "100 | \n", "5079.00 | \n", "
2009-09-28 09:30:00 | \n", "50.71 | \n", "50.70 | \n", "50.79 | \n", "638 | \n", "32352.98 | \n", "
2009-09-28 09:31:32 | \n", "50.75 | \n", "50.75 | \n", "50.76 | \n", "100 | \n", "5075.00 | \n", "
2009-09-28 09:31:33 | \n", "50.75 | \n", "50.72 | \n", "50.75 | \n", "100 | \n", "5075.00 | \n", "
2009-09-28 09:31:50 | \n", "50.75 | \n", "50.73 | \n", "50.76 | \n", "300 | \n", "15225.00 | \n", "
\n", " | price | \n", "bid | \n", "ask | \n", "vol | \n", "dollar_vol | \n", "
---|---|---|---|---|---|
date | \n", "\n", " | \n", " | \n", " | \n", " | \n", " |
2009-09-28 09:30:00 | \n", "50.79 | \n", "50.70 | \n", "50.79 | \n", "100 | \n", "5079.00 | \n", "
2009-09-28 09:30:00 | \n", "50.71 | \n", "50.70 | \n", "50.79 | \n", "638 | \n", "32352.98 | \n", "
2009-09-28 09:31:32 | \n", "50.75 | \n", "50.75 | \n", "50.76 | \n", "100 | \n", "5075.00 | \n", "
2009-09-28 09:31:33 | \n", "50.75 | \n", "50.72 | \n", "50.75 | \n", "100 | \n", "5075.00 | \n", "
2009-09-28 09:31:50 | \n", "50.75 | \n", "50.73 | \n", "50.76 | \n", "300 | \n", "15225.00 | \n", "
\n", " | count_mean | \n", "count_std | \n", "
---|---|---|
tick | \n", "317.515 | \n", "193.028 | \n", "
vol | \n", "905.638 | \n", "532.047 | \n", "
dollar | \n", "806.258 | \n", "556.688 | \n", "
\n", " | returns_autocorr | \n", "
---|---|
tick | \n", "0.116565 | \n", "
vol | \n", "-0.168050 | \n", "
dollar | \n", "0.086624 | \n", "
\n", " | monthly_returns_var | \n", "
---|---|
tick | \n", "6.799775e-11 | \n", "
vol | \n", "7.809432e-11 | \n", "
dollar | \n", "6.030135e-11 | \n", "
\n", " | jarque_bera_results | \n", "
---|---|
tick | \n", "1.195019e+11 | \n", "
vol | \n", "3.873030e+13 | \n", "
dollar | \n", "7.664635e+12 | \n", "
\n", " | count_mean | \n", "count_std | \n", "returns_autocorr | \n", "monthly_returns_var | \n", "jarque_bera_results | \n", "
---|---|---|---|---|---|
tick | \n", "317.515 | \n", "193.028 | \n", "0.116565 | \n", "6.799775e-11 | \n", "1.195019e+11 | \n", "
vol | \n", "905.638 | \n", "532.047 | \n", "-0.168050 | \n", "7.809432e-11 | \n", "3.873030e+13 | \n", "
dollar | \n", "806.258 | \n", "556.688 | \n", "0.086624 | \n", "6.030135e-11 | \n", "7.664635e+12 | \n", "