{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Ridge Regression Demo\n",
    "Ridge extends LinearRegression by providing L2 regularization on the coefficients when predicting response y with a linear combination of the predictors in X. It can reduce the variance of the predictors, and improves the conditioning of the problem.\n",
    "\n",
    "The Ridge Regression function implemented in the cuml library allows the user to change the fit_intercept, normalize, solver and alpha parameters. Here is a brief on RAPIDS' Ridge Regression's parameters:\n",
    "1. alpha:float or double. Regularization strength - must be a positive float. Larger values specify stronger regularization. Array input will be supported later.\n",
    "1. solver:‘eig’ or ‘svd’ or ‘cd’ (default = ‘eig’). Eig uses a eigendecomposition of the covariance matrix, and is much faster. SVD is slower, but is guaranteed to be stable. CD or Coordinate Descent is very fast and is suitable for large problems.\n",
    "1. fit_intercept:boolean (default = True). If True, Ridge tries to correct for the global mean of y. If False, the model expects that you have centered the data.\n",
    "1. normalize:boolean (default = False). If True, the predictors in X will be normalized by dividing by it’s L2 norm. If False, no scaling will be done.\n",
    "\n",
    "The methods that can be used with the Ridge Regression are:\n",
    "1. fit: Fit the model with X and y.\n",
    "1. get_params: Sklearn style return parameter state\n",
    "1. predict: Predicts the y for X.\n",
    "1. set_params: Sklearn style set parameter state to dictionary of params.\n",
    "\n",
    "The model can take array-like objects, either in host as NumPy arrays or in device (as Numba or _cuda_array_interface_compliant), as well  as cuDF DataFrames. In order to convert your dataset to cudf format please read the cudf documentation on https://rapidsai.github.io/projects/cudf/en/latest/. It is important to understand that the 'svd' solver will run slower than the 'eig' solver however, the 'svd' solver is more stable and robust. Therefore, we would recomend that you use the 'eig' solver when a slight error is acceptable. For additional information please refer to the documentation on https://rapidsai.github.io/projects/cuml/en/latest/index.html"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import cudf\n",
    "import os\n",
    "from cuml import Ridge as cuRidge\n",
    "from sklearn.linear_model import Ridge as skRidge\n",
    "from sklearn.datasets import make_regression\n",
    "from sklearn.metrics import mean_squared_error"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Helper Functions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# check if mortgage dataset is present and then extract the data from it, else just create a random dataset for regression\n",
    "import gzip\n",
    "# change the path of the mortgage dataset if you have saved it in a different directory\n",
    "def load_data(nrows, ncols, cached = 'data/mortgage.npy.gz'):\n",
    "    train_rows = int(nrows*0.8)\n",
    "    if os.path.exists(cached):\n",
    "        print('use mortgage data')\n",
    "\n",
    "        with gzip.open(cached) as f:\n",
    "            X = np.load(f)\n",
    "        # the 4th column is 'adj_remaining_months_to_maturity'\n",
    "        # used as the label\n",
    "        X = X[:,[i for i in range(X.shape[1]) if i!=4]]\n",
    "        y = X[:,4:5]\n",
    "        rindices = np.random.randint(0,X.shape[0]-1,nrows)\n",
    "        X = X[rindices,:ncols]\n",
    "        y = y[rindices]\n",
    "        df_y_train = pd.DataFrame({'fea%d'%i:y[0:train_rows,i] for i in range(y.shape[1])})\n",
    "        df_y_test = pd.DataFrame({'fea%d'%i:y[train_rows:,i] for i in range(y.shape[1])})\n",
    "    else:\n",
    "        print('use random data')\n",
    "        # create a random regression dataset\n",
    "        X,y = make_regression(n_samples=nrows,n_features=ncols,n_informative=ncols, random_state=0)\n",
    "        df_y_train = pd.DataFrame({'fea0':y[0:train_rows,]})\n",
    "        df_y_test = pd.DataFrame({'fea0':y[train_rows:,]})\n",
    "\n",
    "    df_X_train = pd.DataFrame({'fea%d'%i:X[0:train_rows,i] for i in range(X.shape[1])})\n",
    "    df_X_test = pd.DataFrame({'fea%d'%i:X[train_rows:,i] for i in range(X.shape[1])})\n",
    "\n",
    "    return df_X_train, df_X_test, df_y_train, df_y_test"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Run tests"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "# nrows = number of samples\n",
    "# ncols = number of features of each sample \n",
    "\n",
    "nrows = 2**20 \n",
    "ncols = 399\n",
    "\n",
    "#split the dataset into training and testing sets, in the ratio of 80:20 respectively\n",
    "X_train, X_test, y_train, y_test = load_data(nrows,ncols)\n",
    "print('training data',X_train.shape)\n",
    "print('training label',y_train.shape)\n",
    "print('testing data',X_test.shape)\n",
    "print('testing label',y_test.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "# use the sklearn ridge regression model to fit the training dataset \n",
    "skridge = skRidge(fit_intercept=False,\n",
    "                  normalize=True, alpha=0.1)\n",
    "skridge.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "# calculate the mean squared error of the sklearn ridge regression model on the testing dataset\n",
    "sk_predict = skridge.predict(X_test)\n",
    "error_sk = mean_squared_error(y_test,sk_predict)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "%%time\n",
    "# convert the pandas dataframe to cudf format\n",
    "X_cudf = cudf.DataFrame.from_pandas(X_train)\n",
    "X_cudf_test = cudf.DataFrame.from_pandas(X_test)\n",
    "y_cudf = y_train.values\n",
    "y_cudf = y_cudf[:,0]\n",
    "y_cudf = cudf.Series(y_cudf)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "%%time\n",
    "# run the cuml ridge regression model to fit the training dataset.  Eig is the faster algorithm, but svd is more accurate \n",
    "curidge = cuRidge(fit_intercept=False,\n",
    "                  normalize=True,\n",
    "                  solver='svd', alpha=0.1)\n",
    "curidge.fit(X_cudf, y_cudf)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "# calculate the mean squared error of the testing dataset using the cuml ridge regression model\n",
    "cu_predict = curidge.predict(X_cudf_test).to_array()\n",
    "error_cu = mean_squared_error(y_test,cu_predict)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# print the mean squared error of the sklearn and cuml model to analyse them\n",
    "print(\"SKL MSE(y):\")\n",
    "print(error_sk)\n",
    "print(\"CUML MSE(y):\")\n",
    "print(error_cu)\n"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Raw Cell Format",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}