# Predicting Stock Market

In this project, you'll work with data from the S&P500 Index. [The S&P500](https://en.wikipedia.org/wiki/S%26P_500) is a stock market index. Before we get into what an index is, we'll need to start with the basics of the stock market.

Some companies are publicly traded, which means that anyone can **buy and sell their shares** on the open market. A share entitles the owner to some control over the direction of the company and to a percentage (or share) of the earnings of the company. When you buy or sell shares, it's common known as **trading a stock**. The price of a share is based on supply and demand for a given stock.

**Indexes** aggregate the prices of multiple stocks together, and allow you to see how the market as a whole performs.

You'll be using historical data on the price of the S&P500 Index to make predictions about future prices. Predicting whether an index goes up or down helps forecast how the stock market as a whole performs. Since stocks tend to correlate with how well the economy as a whole is performs, it can also help with economic forecasts.

In this project, our dataset contain index prices. Each row in the file contains a daily record of the price of the S&P500 Index from *1950* to *2015*. The dataset is stored in sphist.csv.

| Columns | Description |
| ----------- | ----------- |
| **Date** | The date of the record. |
| Open | The opening price of the day (when trading starts) |
| High |  The highest trade price during the day |
| Low | The lowest trade price during the day |
| Close | The closing price for the day (when trading is finished) |
| Volume | The number of shares traded |
| Adj Close | The daily closing price, adjusted retroactively to include any corporate actions. |

You'll be using this dataset to develop a predictive model. You'll train the model with data from *1950-2012* and try to make predictions from *2013-2015*.

## Overview of the dataset


In [51]:
import pandas as pd
from datetime import datetime
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
# Read our data
df = pd.read_csv("sphist.csv")
# Convert the Date column into a Pandas date type
df["Date"] = pd.to_datetime(df["Date"])
df.sort_values(by="Date", ascending=True, inplace=True)

df.head(10)

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
16589,1950-01-03,16.66,16.66,16.66,16.66,1260000.0,16.66
16588,1950-01-04,16.85,16.85,16.85,16.85,1890000.0,16.85
16587,1950-01-05,16.93,16.93,16.93,16.93,2550000.0,16.93
16586,1950-01-06,16.98,16.98,16.98,16.98,2010000.0,16.98
16585,1950-01-09,17.08,17.08,17.08,17.08,2520000.0,17.08
16584,1950-01-10,17.030001,17.030001,17.030001,17.030001,2160000.0,17.030001
16583,1950-01-11,17.09,17.09,17.09,17.09,2630000.0,17.09
16582,1950-01-12,16.76,16.76,16.76,16.76,2970000.0,16.76
16581,1950-01-13,16.67,16.67,16.67,16.67,3330000.0,16.67
16580,1950-01-16,16.719999,16.719999,16.719999,16.719999,1460000.0,16.719999


## Generating indicators

Stock market data is sequential and each observation comes a day after the previous observation. Thus, the observations are not all independent and you can't treat them as such. The time series nature of the data means that we can generate indicators to make our model more accurate. Our goal is to teach the model how to predict the current price from historical prices.
Let's select 3 indicators : 
- The average price from the past **5** days.
- The average price for the past **30** days.
- The *ratio* between the average price for the past 5 days, and the average price for the past 30 days.

In [52]:
indicators = [5, 30]
def add_indicator(df,indicators,targets):
    for target in targets:
        for index, row in df.iterrows():
            size = len(df[df['Date'] < row['Date']])
            for indicator in indicators:
                column = "{}_Day_{}".format(target,indicator)
                #new_column = "Volume_Day_{}".format(indicator)
                # print(column)
                if size < indicator:
                    df.loc[index, column] = 0
                else:
                    df.loc[index, column] = np.mean(
                        df.loc[index+indicator:index-1, target])
        column1 ="Ratio_{}_{}_{}".format(target,indicators[0],indicators[1])
        df[column1] = df.iloc[:,-2] / df.iloc[:,-1]
add_indicator(df,indicators,["Close","Volume"])
df.head(20)

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close,Close_Day_5,Close_Day_30,Ratio_Close_5_30,Volume_Day_5,Volume_Day_30,Ratio_Volume_5_30
16589,1950-01-03,16.66,16.66,16.66,16.66,1260000.0,16.66,0.0,0.0,,0.0,0.0,
16588,1950-01-04,16.85,16.85,16.85,16.85,1890000.0,16.85,0.0,0.0,,0.0,0.0,
16587,1950-01-05,16.93,16.93,16.93,16.93,2550000.0,16.93,0.0,0.0,,0.0,0.0,
16586,1950-01-06,16.98,16.98,16.98,16.98,2010000.0,16.98,0.0,0.0,,0.0,0.0,
16585,1950-01-09,17.08,17.08,17.08,17.08,2520000.0,17.08,0.0,0.0,,0.0,0.0,
16584,1950-01-10,17.030001,17.030001,17.030001,17.030001,2160000.0,17.030001,16.945714,0.0,inf,2145714.0,0.0,inf
16583,1950-01-11,17.09,17.09,17.09,17.09,2630000.0,17.09,16.96,0.0,inf,2390000.0,0.0,inf
16582,1950-01-12,16.76,16.76,16.76,16.76,2970000.0,16.76,16.934286,0.0,inf,2595714.0,0.0,inf
16581,1950-01-13,16.67,16.67,16.67,16.67,3330000.0,16.67,16.904286,0.0,inf,2440000.0,0.0,inf
16580,1950-01-16,16.719999,16.719999,16.719999,16.719999,1460000.0,16.719999,16.887143,0.0,inf,2408571.0,0.0,inf


## Cleaning and Splitting up our data
Since we're computing indicators that use historical data, there are some rows where there isn't enough historical data to generate them. let's clean our data depending on the select columns and number of days.

In [53]:
def clean(df,day,columns):
    # Remove data before 1951-01-03
    #df = df[df["Date"] > datetime(year=1951, month=1, day=2)]
    for column in columns:
        col = "{}_Day_{}".format(column,day)
        df.drop(df[(df[col] == 0)].index,
            axis=0, inplace=True)
    df.dropna(axis=0, inplace=True)

clean_df = df.copy()
clean(clean_df,30,["Close","Volume"])

print(clean_df.isnull().sum())
print(clean_df.shape)
# Generate the train and test dataset
train = clean_df[clean_df["Date"] < datetime(year=2013, month=1, day=1)]
test = clean_df[clean_df["Date"] > datetime(year=2013, month=1, day=1)]

Date                 0
Open                 0
High                 0
Low                  0
Close                0
Volume               0
Adj Close            0
Close_Day_5          0
Close_Day_30         0
Ratio_Close_5_30     0
Volume_Day_5         0
Volume_Day_30        0
Ratio_Volume_5_30    0
dtype: int64
(16560, 13)


On the splitting part, we're going :
- train should contain any rows in the data with a date less than 2013-01-01
- test should contain any rows with a date greater than or equal to 2013-01-01

In [54]:
# Generate the train and test dataset
train = clean_df[clean_df["Date"] < datetime(year=2013, month=1, day=1)]
test = clean_df[clean_df["Date"] > datetime(year=2013, month=1, day=1)]
print(train.shape)
print(test.shape)

(15821, 13)
(739, 13)


## Making Prediction

The **linear regression model** is going to be used to train the train dataset and predict the test dataset and the error metric is **Mean Absolute Error** (MAE).

In [55]:
def model(target,features,train,test):
    lr = LinearRegression()
    lr.fit(train[features], train[target])

    predictions = lr.predict(test[features])
    mae = mean_absolute_error(test['Close'], predictions)
    print("MAE = {}".format(mae))

features = ["Close_Day_5", "Close_Day_30"]
target = "Close"
#"ratio_5_365","Volume_Day_5", "Volume_Day_365"
model(target,features,train,test)
features = ["Close_Day_5", "Close_Day_30","Ratio_Close_5_30"]
target = "Close"
#"ratio_5_365","Volume_Day_5", "Volume_Day_365"
model(target,features,train,test)

MAE = 11.350741886520453
MAE = 11.351021603941877


By the result above we can say the ratio doesn't have significant effect in reducing error. Let's be sure by checking the correlation coefficients.

In [56]:
clean_df.corr()["Close"]

Open                 0.999901
High                 0.999954
Low                  0.999956
Close                1.000000
Volume               0.774267
Adj Close            1.000000
Close_Day_5          0.999892
Close_Day_30         0.999297
Ratio_Close_5_30     0.005923
Volume_Day_5         0.783925
Volume_Day_30        0.788661
Ratio_Volume_5_30   -0.004822
Name: Close, dtype: float64

The coefficients (*Ratio_Close_5_30 = 0.005923 and Ratio_Volume_5_30 = -0.004822*) confirm the assertion from above.
Let's add more 2 more indicators :
- The average volume from the past **5** days.
- The average volume for the past **30** days.

In [57]:
features = ["Close_Day_5", "Close_Day_30","Volume_Day_5", "Volume_Day_30"]
#"ratio_5_365","Volume_Day_5", "Volume_Day_365"
model(target,features,train,test)

MAE = 11.3073026169537


We have a small improvement of our model from *MAE = 11.350741886520453* to *MAE = 11.3073026169537* . We can also make significant structural improvements to the algorithm.

## Predictions only one day ahead

About this improvement, we train a model using data from *1951-01-03 to 2013-01-02*, make predictions for *2013-01-03*, and then train another model using data from *1951-01-03 to 2013-01-03*, make predictions for 2013-01-04, and so on. This more closely simulates what you'd do if you were trading using the algorithm.

In [64]:
# Modification of model function
def model(df,row,target,features):
    #print(row)
    lr = LinearRegression()
    train = df[df["Date"] < row["Date"]]
    test =  df[df["Date"] == row["Date"]]
    if train.empty :
        return 0
    else:
        lr.fit(train[features], train[target])
        prediction = lr.predict(test[features])
        mae = mean_absolute_error(test['Close'], prediction)
        return mae
    
    

# get the MAEs of our new model
maes = clean_df.apply(lambda row : model(clean_df,row,target,features),axis=1)
mae = np.mean(maes)
print("MAE = {}".format(mae))

MAE = 3.7842744203948424


We can see a big improvement in the reduction's error, by that we can conclure the accuracy of the model will improve by making predictions only one day ahead.

## Other idea

In the goal to improve the accuracy of the prediction, we can :
- Try other techniques, like a random forest, and see if they perform better.
- Incorporate outside data, such as the weather in New York City (where most trading happens) the day before and the amount of Twitter activity around certain stocks.
- Make the system real-time by writing an automated script to download the latest data when the market closes and make predictions for the next day.
- Make the system "higher-resolution". You're currently making daily predictions, but you could make hourly, minute-by-minute, or second-by-second predictions. This requires obtaining more data, though. You could also make predictions for individual stocks instead of the S&P500.