{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Predicting Stock Market\n",
    "\n",
    "In this project, you'll work with data from the S&P500 Index. [The S&P500](https://en.wikipedia.org/wiki/S%26P_500) is a stock market index. Before we get into what an index is, we'll need to start with the basics of the stock market.\n",
    "\n",
    "Some companies are publicly traded, which means that anyone can **buy and sell their shares** on the open market. A share entitles the owner to some control over the direction of the company and to a percentage (or share) of the earnings of the company. When you buy or sell shares, it's common known as **trading a stock**. The price of a share is based on supply and demand for a given stock.\n",
    "\n",
    "**Indexes** aggregate the prices of multiple stocks together, and allow you to see how the market as a whole performs.\n",
    "\n",
    "You'll be using historical data on the price of the S&P500 Index to make predictions about future prices. Predicting whether an index goes up or down helps forecast how the stock market as a whole performs. Since stocks tend to correlate with how well the economy as a whole is performs, it can also help with economic forecasts.\n",
    "\n",
    "In this project, our dataset contain index prices. Each row in the file contains a daily record of the price of the S&P500 Index from *1950* to *2015*. The dataset is stored in sphist.csv.\n",
    "\n",
    "| Columns | Description |\n",
    "| ----------- | ----------- |\n",
    "| **Date** | The date of the record. |\n",
    "| Open | The opening price of the day (when trading starts) |\n",
    "| High |  The highest trade price during the day |\n",
    "| Low | The lowest trade price during the day |\n",
    "| Close | The closing price for the day (when trading is finished) |\n",
    "| Volume | The number of shares traded |\n",
    "| Adj Close | The daily closing price, adjusted retroactively to include any corporate actions. |\n",
    "\n",
    "You'll be using this dataset to develop a predictive model. You'll train the model with data from *1950-2012* and try to make predictions from *2013-2015*.\n",
    "\n",
    "## Overview of the dataset\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Date</th>\n",
       "      <th>Open</th>\n",
       "      <th>High</th>\n",
       "      <th>Low</th>\n",
       "      <th>Close</th>\n",
       "      <th>Volume</th>\n",
       "      <th>Adj Close</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>16589</th>\n",
       "      <td>1950-01-03</td>\n",
       "      <td>16.660000</td>\n",
       "      <td>16.660000</td>\n",
       "      <td>16.660000</td>\n",
       "      <td>16.660000</td>\n",
       "      <td>1260000.0</td>\n",
       "      <td>16.660000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16588</th>\n",
       "      <td>1950-01-04</td>\n",
       "      <td>16.850000</td>\n",
       "      <td>16.850000</td>\n",
       "      <td>16.850000</td>\n",
       "      <td>16.850000</td>\n",
       "      <td>1890000.0</td>\n",
       "      <td>16.850000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16587</th>\n",
       "      <td>1950-01-05</td>\n",
       "      <td>16.930000</td>\n",
       "      <td>16.930000</td>\n",
       "      <td>16.930000</td>\n",
       "      <td>16.930000</td>\n",
       "      <td>2550000.0</td>\n",
       "      <td>16.930000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16586</th>\n",
       "      <td>1950-01-06</td>\n",
       "      <td>16.980000</td>\n",
       "      <td>16.980000</td>\n",
       "      <td>16.980000</td>\n",
       "      <td>16.980000</td>\n",
       "      <td>2010000.0</td>\n",
       "      <td>16.980000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16585</th>\n",
       "      <td>1950-01-09</td>\n",
       "      <td>17.080000</td>\n",
       "      <td>17.080000</td>\n",
       "      <td>17.080000</td>\n",
       "      <td>17.080000</td>\n",
       "      <td>2520000.0</td>\n",
       "      <td>17.080000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16584</th>\n",
       "      <td>1950-01-10</td>\n",
       "      <td>17.030001</td>\n",
       "      <td>17.030001</td>\n",
       "      <td>17.030001</td>\n",
       "      <td>17.030001</td>\n",
       "      <td>2160000.0</td>\n",
       "      <td>17.030001</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16583</th>\n",
       "      <td>1950-01-11</td>\n",
       "      <td>17.090000</td>\n",
       "      <td>17.090000</td>\n",
       "      <td>17.090000</td>\n",
       "      <td>17.090000</td>\n",
       "      <td>2630000.0</td>\n",
       "      <td>17.090000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16582</th>\n",
       "      <td>1950-01-12</td>\n",
       "      <td>16.760000</td>\n",
       "      <td>16.760000</td>\n",
       "      <td>16.760000</td>\n",
       "      <td>16.760000</td>\n",
       "      <td>2970000.0</td>\n",
       "      <td>16.760000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16581</th>\n",
       "      <td>1950-01-13</td>\n",
       "      <td>16.670000</td>\n",
       "      <td>16.670000</td>\n",
       "      <td>16.670000</td>\n",
       "      <td>16.670000</td>\n",
       "      <td>3330000.0</td>\n",
       "      <td>16.670000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16580</th>\n",
       "      <td>1950-01-16</td>\n",
       "      <td>16.719999</td>\n",
       "      <td>16.719999</td>\n",
       "      <td>16.719999</td>\n",
       "      <td>16.719999</td>\n",
       "      <td>1460000.0</td>\n",
       "      <td>16.719999</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            Date       Open       High        Low      Close     Volume  \\\n",
       "16589 1950-01-03  16.660000  16.660000  16.660000  16.660000  1260000.0   \n",
       "16588 1950-01-04  16.850000  16.850000  16.850000  16.850000  1890000.0   \n",
       "16587 1950-01-05  16.930000  16.930000  16.930000  16.930000  2550000.0   \n",
       "16586 1950-01-06  16.980000  16.980000  16.980000  16.980000  2010000.0   \n",
       "16585 1950-01-09  17.080000  17.080000  17.080000  17.080000  2520000.0   \n",
       "16584 1950-01-10  17.030001  17.030001  17.030001  17.030001  2160000.0   \n",
       "16583 1950-01-11  17.090000  17.090000  17.090000  17.090000  2630000.0   \n",
       "16582 1950-01-12  16.760000  16.760000  16.760000  16.760000  2970000.0   \n",
       "16581 1950-01-13  16.670000  16.670000  16.670000  16.670000  3330000.0   \n",
       "16580 1950-01-16  16.719999  16.719999  16.719999  16.719999  1460000.0   \n",
       "\n",
       "       Adj Close  \n",
       "16589  16.660000  \n",
       "16588  16.850000  \n",
       "16587  16.930000  \n",
       "16586  16.980000  \n",
       "16585  17.080000  \n",
       "16584  17.030001  \n",
       "16583  17.090000  \n",
       "16582  16.760000  \n",
       "16581  16.670000  \n",
       "16580  16.719999  "
      ]
     },
     "execution_count": 51,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "from datetime import datetime\n",
    "import numpy as np\n",
    "from sklearn.linear_model import LinearRegression\n",
    "from sklearn.metrics import mean_absolute_error\n",
    "# Read our data\n",
    "df = pd.read_csv(\"sphist.csv\")\n",
    "# Convert the Date column into a Pandas date type\n",
    "df[\"Date\"] = pd.to_datetime(df[\"Date\"])\n",
    "df.sort_values(by=\"Date\", ascending=True, inplace=True)\n",
    "\n",
    "df.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generating indicators\n",
    "\n",
    "Stock market data is sequential and each observation comes a day after the previous observation. Thus, the observations are not all independent and you can't treat them as such. The time series nature of the data means that we can generate indicators to make our model more accurate. Our goal is to teach the model how to predict the current price from historical prices.\n",
    "Let's select 3 indicators : \n",
    "- The average price from the past **5** days.\n",
    "- The average price for the past **30** days.\n",
    "- The *ratio* between the average price for the past 5 days, and the average price for the past 30 days."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Date</th>\n",
       "      <th>Open</th>\n",
       "      <th>High</th>\n",
       "      <th>Low</th>\n",
       "      <th>Close</th>\n",
       "      <th>Volume</th>\n",
       "      <th>Adj Close</th>\n",
       "      <th>Close_Day_5</th>\n",
       "      <th>Close_Day_30</th>\n",
       "      <th>Ratio_Close_5_30</th>\n",
       "      <th>Volume_Day_5</th>\n",
       "      <th>Volume_Day_30</th>\n",
       "      <th>Ratio_Volume_5_30</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>16589</th>\n",
       "      <td>1950-01-03</td>\n",
       "      <td>16.660000</td>\n",
       "      <td>16.660000</td>\n",
       "      <td>16.660000</td>\n",
       "      <td>16.660000</td>\n",
       "      <td>1260000.0</td>\n",
       "      <td>16.660000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.000000e+00</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16588</th>\n",
       "      <td>1950-01-04</td>\n",
       "      <td>16.850000</td>\n",
       "      <td>16.850000</td>\n",
       "      <td>16.850000</td>\n",
       "      <td>16.850000</td>\n",
       "      <td>1890000.0</td>\n",
       "      <td>16.850000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.000000e+00</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16587</th>\n",
       "      <td>1950-01-05</td>\n",
       "      <td>16.930000</td>\n",
       "      <td>16.930000</td>\n",
       "      <td>16.930000</td>\n",
       "      <td>16.930000</td>\n",
       "      <td>2550000.0</td>\n",
       "      <td>16.930000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.000000e+00</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16586</th>\n",
       "      <td>1950-01-06</td>\n",
       "      <td>16.980000</td>\n",
       "      <td>16.980000</td>\n",
       "      <td>16.980000</td>\n",
       "      <td>16.980000</td>\n",
       "      <td>2010000.0</td>\n",
       "      <td>16.980000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.000000e+00</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16585</th>\n",
       "      <td>1950-01-09</td>\n",
       "      <td>17.080000</td>\n",
       "      <td>17.080000</td>\n",
       "      <td>17.080000</td>\n",
       "      <td>17.080000</td>\n",
       "      <td>2520000.0</td>\n",
       "      <td>17.080000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.000000e+00</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16584</th>\n",
       "      <td>1950-01-10</td>\n",
       "      <td>17.030001</td>\n",
       "      <td>17.030001</td>\n",
       "      <td>17.030001</td>\n",
       "      <td>17.030001</td>\n",
       "      <td>2160000.0</td>\n",
       "      <td>17.030001</td>\n",
       "      <td>16.945714</td>\n",
       "      <td>0.0</td>\n",
       "      <td>inf</td>\n",
       "      <td>2.145714e+06</td>\n",
       "      <td>0.0</td>\n",
       "      <td>inf</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16583</th>\n",
       "      <td>1950-01-11</td>\n",
       "      <td>17.090000</td>\n",
       "      <td>17.090000</td>\n",
       "      <td>17.090000</td>\n",
       "      <td>17.090000</td>\n",
       "      <td>2630000.0</td>\n",
       "      <td>17.090000</td>\n",
       "      <td>16.960000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>inf</td>\n",
       "      <td>2.390000e+06</td>\n",
       "      <td>0.0</td>\n",
       "      <td>inf</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16582</th>\n",
       "      <td>1950-01-12</td>\n",
       "      <td>16.760000</td>\n",
       "      <td>16.760000</td>\n",
       "      <td>16.760000</td>\n",
       "      <td>16.760000</td>\n",
       "      <td>2970000.0</td>\n",
       "      <td>16.760000</td>\n",
       "      <td>16.934286</td>\n",
       "      <td>0.0</td>\n",
       "      <td>inf</td>\n",
       "      <td>2.595714e+06</td>\n",
       "      <td>0.0</td>\n",
       "      <td>inf</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16581</th>\n",
       "      <td>1950-01-13</td>\n",
       "      <td>16.670000</td>\n",
       "      <td>16.670000</td>\n",
       "      <td>16.670000</td>\n",
       "      <td>16.670000</td>\n",
       "      <td>3330000.0</td>\n",
       "      <td>16.670000</td>\n",
       "      <td>16.904286</td>\n",
       "      <td>0.0</td>\n",
       "      <td>inf</td>\n",
       "      <td>2.440000e+06</td>\n",
       "      <td>0.0</td>\n",
       "      <td>inf</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16580</th>\n",
       "      <td>1950-01-16</td>\n",
       "      <td>16.719999</td>\n",
       "      <td>16.719999</td>\n",
       "      <td>16.719999</td>\n",
       "      <td>16.719999</td>\n",
       "      <td>1460000.0</td>\n",
       "      <td>16.719999</td>\n",
       "      <td>16.887143</td>\n",
       "      <td>0.0</td>\n",
       "      <td>inf</td>\n",
       "      <td>2.408571e+06</td>\n",
       "      <td>0.0</td>\n",
       "      <td>inf</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16579</th>\n",
       "      <td>1950-01-17</td>\n",
       "      <td>16.860001</td>\n",
       "      <td>16.860001</td>\n",
       "      <td>16.860001</td>\n",
       "      <td>16.860001</td>\n",
       "      <td>1790000.0</td>\n",
       "      <td>16.860001</td>\n",
       "      <td>16.854286</td>\n",
       "      <td>0.0</td>\n",
       "      <td>inf</td>\n",
       "      <td>2.272857e+06</td>\n",
       "      <td>0.0</td>\n",
       "      <td>inf</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16578</th>\n",
       "      <td>1950-01-18</td>\n",
       "      <td>16.850000</td>\n",
       "      <td>16.850000</td>\n",
       "      <td>16.850000</td>\n",
       "      <td>16.850000</td>\n",
       "      <td>1570000.0</td>\n",
       "      <td>16.850000</td>\n",
       "      <td>16.831429</td>\n",
       "      <td>0.0</td>\n",
       "      <td>inf</td>\n",
       "      <td>2.131429e+06</td>\n",
       "      <td>0.0</td>\n",
       "      <td>inf</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16577</th>\n",
       "      <td>1950-01-19</td>\n",
       "      <td>16.870001</td>\n",
       "      <td>16.870001</td>\n",
       "      <td>16.870001</td>\n",
       "      <td>16.870001</td>\n",
       "      <td>1170000.0</td>\n",
       "      <td>16.870001</td>\n",
       "      <td>16.804286</td>\n",
       "      <td>0.0</td>\n",
       "      <td>inf</td>\n",
       "      <td>1.961429e+06</td>\n",
       "      <td>0.0</td>\n",
       "      <td>inf</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16576</th>\n",
       "      <td>1950-01-20</td>\n",
       "      <td>16.900000</td>\n",
       "      <td>16.900000</td>\n",
       "      <td>16.900000</td>\n",
       "      <td>16.900000</td>\n",
       "      <td>1440000.0</td>\n",
       "      <td>16.900000</td>\n",
       "      <td>16.827143</td>\n",
       "      <td>0.0</td>\n",
       "      <td>inf</td>\n",
       "      <td>1.728571e+06</td>\n",
       "      <td>0.0</td>\n",
       "      <td>inf</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16575</th>\n",
       "      <td>1950-01-23</td>\n",
       "      <td>16.920000</td>\n",
       "      <td>16.920000</td>\n",
       "      <td>16.920000</td>\n",
       "      <td>16.920000</td>\n",
       "      <td>1340000.0</td>\n",
       "      <td>16.920000</td>\n",
       "      <td>16.854286</td>\n",
       "      <td>0.0</td>\n",
       "      <td>inf</td>\n",
       "      <td>1.431429e+06</td>\n",
       "      <td>0.0</td>\n",
       "      <td>inf</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16574</th>\n",
       "      <td>1950-01-24</td>\n",
       "      <td>16.860001</td>\n",
       "      <td>16.860001</td>\n",
       "      <td>16.860001</td>\n",
       "      <td>16.860001</td>\n",
       "      <td>1250000.0</td>\n",
       "      <td>16.860001</td>\n",
       "      <td>16.857143</td>\n",
       "      <td>0.0</td>\n",
       "      <td>inf</td>\n",
       "      <td>1.465714e+06</td>\n",
       "      <td>0.0</td>\n",
       "      <td>inf</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16573</th>\n",
       "      <td>1950-01-25</td>\n",
       "      <td>16.740000</td>\n",
       "      <td>16.740000</td>\n",
       "      <td>16.740000</td>\n",
       "      <td>16.740000</td>\n",
       "      <td>1700000.0</td>\n",
       "      <td>16.740000</td>\n",
       "      <td>16.838572</td>\n",
       "      <td>0.0</td>\n",
       "      <td>inf</td>\n",
       "      <td>1.374286e+06</td>\n",
       "      <td>0.0</td>\n",
       "      <td>inf</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16572</th>\n",
       "      <td>1950-01-26</td>\n",
       "      <td>16.730000</td>\n",
       "      <td>16.730000</td>\n",
       "      <td>16.730000</td>\n",
       "      <td>16.730000</td>\n",
       "      <td>1150000.0</td>\n",
       "      <td>16.730000</td>\n",
       "      <td>16.834286</td>\n",
       "      <td>0.0</td>\n",
       "      <td>inf</td>\n",
       "      <td>1.328571e+06</td>\n",
       "      <td>0.0</td>\n",
       "      <td>inf</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16571</th>\n",
       "      <td>1950-01-27</td>\n",
       "      <td>16.820000</td>\n",
       "      <td>16.820000</td>\n",
       "      <td>16.820000</td>\n",
       "      <td>16.820000</td>\n",
       "      <td>1250000.0</td>\n",
       "      <td>16.820000</td>\n",
       "      <td>16.855714</td>\n",
       "      <td>0.0</td>\n",
       "      <td>inf</td>\n",
       "      <td>1.395714e+06</td>\n",
       "      <td>0.0</td>\n",
       "      <td>inf</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16570</th>\n",
       "      <td>1950-01-30</td>\n",
       "      <td>17.020000</td>\n",
       "      <td>17.020000</td>\n",
       "      <td>17.020000</td>\n",
       "      <td>17.020000</td>\n",
       "      <td>1640000.0</td>\n",
       "      <td>17.020000</td>\n",
       "      <td>16.877143</td>\n",
       "      <td>0.0</td>\n",
       "      <td>inf</td>\n",
       "      <td>1.431429e+06</td>\n",
       "      <td>0.0</td>\n",
       "      <td>inf</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            Date       Open       High        Low      Close     Volume  \\\n",
       "16589 1950-01-03  16.660000  16.660000  16.660000  16.660000  1260000.0   \n",
       "16588 1950-01-04  16.850000  16.850000  16.850000  16.850000  1890000.0   \n",
       "16587 1950-01-05  16.930000  16.930000  16.930000  16.930000  2550000.0   \n",
       "16586 1950-01-06  16.980000  16.980000  16.980000  16.980000  2010000.0   \n",
       "16585 1950-01-09  17.080000  17.080000  17.080000  17.080000  2520000.0   \n",
       "16584 1950-01-10  17.030001  17.030001  17.030001  17.030001  2160000.0   \n",
       "16583 1950-01-11  17.090000  17.090000  17.090000  17.090000  2630000.0   \n",
       "16582 1950-01-12  16.760000  16.760000  16.760000  16.760000  2970000.0   \n",
       "16581 1950-01-13  16.670000  16.670000  16.670000  16.670000  3330000.0   \n",
       "16580 1950-01-16  16.719999  16.719999  16.719999  16.719999  1460000.0   \n",
       "16579 1950-01-17  16.860001  16.860001  16.860001  16.860001  1790000.0   \n",
       "16578 1950-01-18  16.850000  16.850000  16.850000  16.850000  1570000.0   \n",
       "16577 1950-01-19  16.870001  16.870001  16.870001  16.870001  1170000.0   \n",
       "16576 1950-01-20  16.900000  16.900000  16.900000  16.900000  1440000.0   \n",
       "16575 1950-01-23  16.920000  16.920000  16.920000  16.920000  1340000.0   \n",
       "16574 1950-01-24  16.860001  16.860001  16.860001  16.860001  1250000.0   \n",
       "16573 1950-01-25  16.740000  16.740000  16.740000  16.740000  1700000.0   \n",
       "16572 1950-01-26  16.730000  16.730000  16.730000  16.730000  1150000.0   \n",
       "16571 1950-01-27  16.820000  16.820000  16.820000  16.820000  1250000.0   \n",
       "16570 1950-01-30  17.020000  17.020000  17.020000  17.020000  1640000.0   \n",
       "\n",
       "       Adj Close  Close_Day_5  Close_Day_30  Ratio_Close_5_30  Volume_Day_5  \\\n",
       "16589  16.660000     0.000000           0.0               NaN  0.000000e+00   \n",
       "16588  16.850000     0.000000           0.0               NaN  0.000000e+00   \n",
       "16587  16.930000     0.000000           0.0               NaN  0.000000e+00   \n",
       "16586  16.980000     0.000000           0.0               NaN  0.000000e+00   \n",
       "16585  17.080000     0.000000           0.0               NaN  0.000000e+00   \n",
       "16584  17.030001    16.945714           0.0               inf  2.145714e+06   \n",
       "16583  17.090000    16.960000           0.0               inf  2.390000e+06   \n",
       "16582  16.760000    16.934286           0.0               inf  2.595714e+06   \n",
       "16581  16.670000    16.904286           0.0               inf  2.440000e+06   \n",
       "16580  16.719999    16.887143           0.0               inf  2.408571e+06   \n",
       "16579  16.860001    16.854286           0.0               inf  2.272857e+06   \n",
       "16578  16.850000    16.831429           0.0               inf  2.131429e+06   \n",
       "16577  16.870001    16.804286           0.0               inf  1.961429e+06   \n",
       "16576  16.900000    16.827143           0.0               inf  1.728571e+06   \n",
       "16575  16.920000    16.854286           0.0               inf  1.431429e+06   \n",
       "16574  16.860001    16.857143           0.0               inf  1.465714e+06   \n",
       "16573  16.740000    16.838572           0.0               inf  1.374286e+06   \n",
       "16572  16.730000    16.834286           0.0               inf  1.328571e+06   \n",
       "16571  16.820000    16.855714           0.0               inf  1.395714e+06   \n",
       "16570  17.020000    16.877143           0.0               inf  1.431429e+06   \n",
       "\n",
       "       Volume_Day_30  Ratio_Volume_5_30  \n",
       "16589            0.0                NaN  \n",
       "16588            0.0                NaN  \n",
       "16587            0.0                NaN  \n",
       "16586            0.0                NaN  \n",
       "16585            0.0                NaN  \n",
       "16584            0.0                inf  \n",
       "16583            0.0                inf  \n",
       "16582            0.0                inf  \n",
       "16581            0.0                inf  \n",
       "16580            0.0                inf  \n",
       "16579            0.0                inf  \n",
       "16578            0.0                inf  \n",
       "16577            0.0                inf  \n",
       "16576            0.0                inf  \n",
       "16575            0.0                inf  \n",
       "16574            0.0                inf  \n",
       "16573            0.0                inf  \n",
       "16572            0.0                inf  \n",
       "16571            0.0                inf  \n",
       "16570            0.0                inf  "
      ]
     },
     "execution_count": 52,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "indicators = [5, 30]\n",
    "def add_indicator(df,indicators,targets):\n",
    "    for target in targets:\n",
    "        for index, row in df.iterrows():\n",
    "            size = len(df[df['Date'] < row['Date']])\n",
    "            for indicator in indicators:\n",
    "                column = \"{}_Day_{}\".format(target,indicator)\n",
    "                #new_column = \"Volume_Day_{}\".format(indicator)\n",
    "                # print(column)\n",
    "                if size < indicator:\n",
    "                    df.loc[index, column] = 0\n",
    "                else:\n",
    "                    df.loc[index, column] = np.mean(\n",
    "                        df.loc[index+indicator:index-1, target])\n",
    "        column1 =\"Ratio_{}_{}_{}\".format(target,indicators[0],indicators[1])\n",
    "        df[column1] = df.iloc[:,-2] / df.iloc[:,-1]\n",
    "add_indicator(df,indicators,[\"Close\",\"Volume\"])\n",
    "df.head(20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Cleaning and Splitting up our data\n",
    "Since we're computing indicators that use historical data, there are some rows where there isn't enough historical data to generate them. let's clean our data depending on the select columns and number of days."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Date                 0\n",
      "Open                 0\n",
      "High                 0\n",
      "Low                  0\n",
      "Close                0\n",
      "Volume               0\n",
      "Adj Close            0\n",
      "Close_Day_5          0\n",
      "Close_Day_30         0\n",
      "Ratio_Close_5_30     0\n",
      "Volume_Day_5         0\n",
      "Volume_Day_30        0\n",
      "Ratio_Volume_5_30    0\n",
      "dtype: int64\n",
      "(16560, 13)\n"
     ]
    }
   ],
   "source": [
    "def clean(df,day,columns):\n",
    "    # Remove data before 1951-01-03\n",
    "    #df = df[df[\"Date\"] > datetime(year=1951, month=1, day=2)]\n",
    "    for column in columns:\n",
    "        col = \"{}_Day_{}\".format(column,day)\n",
    "        df.drop(df[(df[col] == 0)].index,\n",
    "            axis=0, inplace=True)\n",
    "    df.dropna(axis=0, inplace=True)\n",
    "\n",
    "clean_df = df.copy()\n",
    "clean(clean_df,30,[\"Close\",\"Volume\"])\n",
    "\n",
    "print(clean_df.isnull().sum())\n",
    "print(clean_df.shape)\n",
    "# Generate the train and test dataset\n",
    "train = clean_df[clean_df[\"Date\"] < datetime(year=2013, month=1, day=1)]\n",
    "test = clean_df[clean_df[\"Date\"] > datetime(year=2013, month=1, day=1)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "On the splitting part, we're going :\n",
    "- train should contain any rows in the data with a date less than 2013-01-01\n",
    "- test should contain any rows with a date greater than or equal to 2013-01-01"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(15821, 13)\n",
      "(739, 13)\n"
     ]
    }
   ],
   "source": [
    "# Generate the train and test dataset\n",
    "train = clean_df[clean_df[\"Date\"] < datetime(year=2013, month=1, day=1)]\n",
    "test = clean_df[clean_df[\"Date\"] > datetime(year=2013, month=1, day=1)]\n",
    "print(train.shape)\n",
    "print(test.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Making Prediction\n",
    "\n",
    "The **linear regression model** is going to be used to train the train dataset and predict the test dataset and the error metric is **Mean Absolute Error** (MAE)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "MAE = 11.350741886520453\n",
      "MAE = 11.351021603941877\n"
     ]
    }
   ],
   "source": [
    "def model(target,features,train,test):\n",
    "    lr = LinearRegression()\n",
    "    lr.fit(train[features], train[target])\n",
    "\n",
    "    predictions = lr.predict(test[features])\n",
    "    mae = mean_absolute_error(test['Close'], predictions)\n",
    "    print(\"MAE = {}\".format(mae))\n",
    "\n",
    "features = [\"Close_Day_5\", \"Close_Day_30\"]\n",
    "target = \"Close\"\n",
    "#\"ratio_5_365\",\"Volume_Day_5\", \"Volume_Day_365\"\n",
    "model(target,features,train,test)\n",
    "features = [\"Close_Day_5\", \"Close_Day_30\",\"Ratio_Close_5_30\"]\n",
    "target = \"Close\"\n",
    "#\"ratio_5_365\",\"Volume_Day_5\", \"Volume_Day_365\"\n",
    "model(target,features,train,test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By the result above we can say the ratio doesn't have significant effect in reducing error. Let's be sure by checking the correlation coefficients."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Open                 0.999901\n",
       "High                 0.999954\n",
       "Low                  0.999956\n",
       "Close                1.000000\n",
       "Volume               0.774267\n",
       "Adj Close            1.000000\n",
       "Close_Day_5          0.999892\n",
       "Close_Day_30         0.999297\n",
       "Ratio_Close_5_30     0.005923\n",
       "Volume_Day_5         0.783925\n",
       "Volume_Day_30        0.788661\n",
       "Ratio_Volume_5_30   -0.004822\n",
       "Name: Close, dtype: float64"
      ]
     },
     "execution_count": 56,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clean_df.corr()[\"Close\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The coefficients (*Ratio_Close_5_30 = 0.005923 and Ratio_Volume_5_30 = -0.004822*) confirm the assertion from above.\n",
    "Let's add more 2 more indicators :\n",
    "- The average volume from the past **5** days.\n",
    "- The average volume for the past **30** days."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "MAE = 11.3073026169537\n"
     ]
    }
   ],
   "source": [
    "features = [\"Close_Day_5\", \"Close_Day_30\",\"Volume_Day_5\", \"Volume_Day_30\"]\n",
    "#\"ratio_5_365\",\"Volume_Day_5\", \"Volume_Day_365\"\n",
    "model(target,features,train,test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have a small improvement of our model from *MAE = 11.350741886520453* to *MAE = 11.3073026169537* . We can also make significant structural improvements to the algorithm.\n",
    "\n",
    "## Predictions only one day ahead\n",
    "\n",
    "About this improvement, we train a model using data from *1951-01-03 to 2013-01-02*, make predictions for *2013-01-03*, and then train another model using data from *1951-01-03 to 2013-01-03*, make predictions for 2013-01-04, and so on. This more closely simulates what you'd do if you were trading using the algorithm."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "MAE = 3.7842744203948424\n"
     ]
    }
   ],
   "source": [
    "# Modification of model function\n",
    "def model(df,row,target,features):\n",
    "    #print(row)\n",
    "    lr = LinearRegression()\n",
    "    train = df[df[\"Date\"] < row[\"Date\"]]\n",
    "    test =  df[df[\"Date\"] == row[\"Date\"]]\n",
    "    if train.empty :\n",
    "        return 0\n",
    "    else:\n",
    "        lr.fit(train[features], train[target])\n",
    "        prediction = lr.predict(test[features])\n",
    "        mae = mean_absolute_error(test['Close'], prediction)\n",
    "        return mae\n",
    "    \n",
    "    \n",
    "\n",
    "# get the MAEs of our new model\n",
    "maes = clean_df.apply(lambda row : model(clean_df,row,target,features),axis=1)\n",
    "mae = np.mean(maes)\n",
    "print(\"MAE = {}\".format(mae))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see a big improvement in the reduction's error, by that we can conclure the accuracy of the model will improve by making predictions only one day ahead.\n",
    "\n",
    "## Other idea\n",
    "\n",
    "In the goal to improve the accuracy of the prediction, we can :\n",
    "- Try other techniques, like a random forest, and see if they perform better.\n",
    "- Incorporate outside data, such as the weather in New York City (where most trading happens) the day before and the amount of Twitter activity around certain stocks.\n",
    "- Make the system real-time by writing an automated script to download the latest data when the market closes and make predictions for the next day.\n",
    "- Make the system \"higher-resolution\". You're currently making daily predictions, but you could make hourly, minute-by-minute, or second-by-second predictions. This requires obtaining more data, though. You could also make predictions for individual stocks instead of the S&P500."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "data_science",
   "language": "python",
   "name": "data_science"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.10"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}