{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Linear Regression\n",
    "LinearRegression is a simple machine learning model where the response y is modelled by a linear combination of the predictors in X.\n",
    "\n",
    "The linear regression model implemented in the cuml library allows the user to change the fit_intercept, normalize and algorithm parameters.  cuML’s LinearRegression expects either a cuDF DataFrame or a NumPy matrix and provides 2 algorithms to fit a linear mode: lSVD and Eig . SVD is more stable, but Eig (default) is much more faster.\n",
    "\n",
    "The Linear Regression function accepts the following parameters:\n",
    "1. algorithm:‘eig’ or ‘svd’ (default = ‘eig’).  Eig uses a eigendecomposition of the covariance matrix, and is much faster. SVD is slower, but is guaranteed to be stable.\n",
    "2. fit_intercept:boolean (default = True).  If True, LinearRegression tries to correct for the global mean of y. If False, the model expects that you have centered the data.\n",
    "3. normalize:boolean (default = False).  If True, the predictors in X will be normalized by dividing by it’s L2 norm. If False, no scaling will be done.\n",
    "\n",
    "The methods that can be used with the Linear regression are:\n",
    "1. fit: Fit the model with X and y.\n",
    "1. get_params: Sklearn style return parameter state\n",
    "1. predict: Predicts the y for X.\n",
    "1. set_params: Sklearn style set parameter state to dictionary of params.\n",
    "\n",
    "In order to convert your dataset to cudf format please read the cudf documentation on https://rapidsai.github.io/projects/cudf/en/latest/. For additional information on the linear regression model please refer to the documentation on https://rapidsai.github.io/projects/cuml/en/latest/index.html"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import cudf\n",
    "import os\n",
    "from cuml import LinearRegression as cuLinearRegression\n",
    "from sklearn.linear_model import LinearRegression as skLinearRegression\n",
    "from sklearn.datasets import make_regression\n",
    "from sklearn.metrics import mean_squared_error"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Helper Functions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# check if the mortgage dataset is present and then extract the data from it, else just create a random dataset for regression \n",
    "import gzip\n",
    "def load_data(nrows, ncols, cached = 'data/mortgage.npy.gz'):\n",
    "    #split the dataset in a 80:20 split\n",
    "    train_rows = int(nrows*0.8)\n",
    "    if os.path.exists(cached):\n",
    "        print('use mortgage data')\n",
    "\n",
    "        with gzip.open(cached) as f:\n",
    "            X = np.load(f)\n",
    "        # the 4th column is 'adj_remaining_months_to_maturity'\n",
    "        # used as the label\n",
    "        X = X[:,[i for i in range(X.shape[1]) if i!=4]]\n",
    "        y = X[:,4:5]\n",
    "        rindices = np.random.randint(0,X.shape[0]-1,nrows)\n",
    "        X = X[rindices,:ncols]\n",
    "        y = y[rindices]\n",
    "        df_y_train = pd.DataFrame({'fea%d'%i:y[0:train_rows,i] for i in range(y.shape[1])})\n",
    "        df_y_test = pd.DataFrame({'fea%d'%i:y[train_rows:,i] for i in range(y.shape[1])})\n",
    "    else:\n",
    "        print('use random data')\n",
    "        X,y = make_regression(n_samples=nrows,n_features=ncols,n_informative=ncols, random_state=0)\n",
    "        df_y_train = pd.DataFrame({'fea0':y[0:train_rows,]})\n",
    "        df_y_test = pd.DataFrame({'fea0':y[train_rows:,]})\n",
    "\n",
    "    df_X_train = pd.DataFrame({'fea%d'%i:X[0:train_rows,i] for i in range(X.shape[1])})\n",
    "    df_X_test = pd.DataFrame({'fea%d'%i:X[train_rows:,i] for i in range(X.shape[1])})\n",
    "\n",
    "    return df_X_train, df_X_test, df_y_train, df_y_test"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Run tests"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "# nrows = number of samples\n",
    "# ncols = number of features of each sample \n",
    "\n",
    "nrows = 2**20\n",
    "ncols = 399\n",
    "\n",
    "#split the dataset into training and testing sets, in the ratio of 80:20 respectively\n",
    "X_train, X_test, y_train, y_test = load_data(nrows,ncols)\n",
    "print('training data',X_train.shape)\n",
    "print('training label',y_train.shape)\n",
    "print('testing data',X_test.shape)\n",
    "print('testing label',y_test.shape)\n",
    "print('label',y_test.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "# use the sklearn linear regression model to fit the training dataset \n",
    "skols = skLinearRegression(fit_intercept=True,\n",
    "                  normalize=True)\n",
    "skols.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "# calculate the mean squared error of the sklearn linear regression model on the testing dataset\n",
    "sk_predict = skols.predict(X_test)\n",
    "error_sk = mean_squared_error(y_test,sk_predict)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "# convert the pandas dataframe to cudf format\n",
    "X_cudf = cudf.DataFrame.from_pandas(X_train)\n",
    "X_cudf_test = cudf.DataFrame.from_pandas(X_test)\n",
    "y_cudf = y_train.values\n",
    "y_cudf = y_cudf[:,0]\n",
    "y_cudf = cudf.Series(y_cudf)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "# run the cuml linear regression model to fit the training dataset \n",
    "cuols = cuLinearRegression(fit_intercept=True,\n",
    "                  normalize=True,\n",
    "                  algorithm='eig')\n",
    "cuols.fit(X_cudf, y_cudf)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "# calculate the mean squared error of the testing dataset using the cuml linear regression model\n",
    "cu_predict = cuols.predict(X_cudf_test).to_array()\n",
    "error_cu = mean_squared_error(y_test,cu_predict)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# print the mean squared error of the sklearn and cuml model to compare the two\n",
    "print(\"SKL MSE(y):\")\n",
    "print(error_sk)\n",
    "print(\"CUML MSE(y):\")\n",
    "print(error_cu)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}