{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Stochastic gradient descent (SGD) \n", "SGD is an incremental gradient descent algorithm which modifies its weights, in an effort to reach a local minimum. The cuML implementation can take array-like objects, either in host as NumPy arrays or in device (as Numba or _cuda_array_interface_compliant), as well as cuDF DataFrames. In order to convert your dataset into a cuDF dataframe format please refer the documentation on https://rapidsai.github.io/projects/cudf/en/latest/. The SGD algorithm implemented in cuML can accept the following parameters:\n", "1. loss : 'hinge', 'log', 'squared_loss' (default = 'squared_loss')\n", "2. penalty: 'none', 'l1', 'l2', 'elasticnet' (default = 'none')\n", "3. alpha: float (default = 0.0001)\n", "4. fit_intercept : boolean (default = True)\n", "5. epochs : int (default = 1000)\n", "6. tol : float (default = 1e-3)\n", "7. shuffle : boolean (default = True)\n", "8. eta0 : float (default = 0.0)\n", "9. power_t : float (default = 0.5)\n", "10. learning_rate : 'optimal', 'constant', 'invscaling', 'adaptive' (default = 'constant')\n", "11. n_iter_no_change : int (default = 5)\n", "\n", "For additional information on the SGD model please refer to the documentation on https://rapidsai.github.io/projects/cuml/en/latest/index.html\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import cudf\n", "import os\n", "from cuml.solvers import SGD as cumlSGD\n", "from sklearn.linear_model import SGDRegressor" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Helper Functions" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# check if the mortgage dataset is present and then extract the data from it, else just create a random dataset for sgd \n", "import gzip\n", "# change the path of the mortgage dataset if you have saved it in a different directory\n", "def load_data(nrows, ncols, cached = 'data/mortgage.npy.gz'):\n", " if os.path.exists(cached):\n", " print('use mortgage data')\n", "\n", " with gzip.open(cached) as f:\n", " X = np.load(f)\n", " # the 4th column is 'adj_remaining_months_to_maturity'\n", " # used as the label\n", " X = X[:,[i for i in range(X.shape[1]) if i!=4]]\n", " y = X[:,4:5]\n", " rindices = np.random.randint(0,X.shape[0]-1,nrows)\n", " X = X[rindices,:ncols]\n", " y = y[rindices]\n", "\n", " else:\n", " # create a random dataset\n", " print('use random data')\n", " X = np.random.rand(nrows,ncols)\n", " y = np.random.randint(0,10,size=(nrows,1))\n", " train_rows = int(nrows*0.8)\n", " df_X_train = pd.DataFrame({'fea%d'%i:X[0:train_rows,i] for i in range(X.shape[1])})\n", " df_X_test = pd.DataFrame({'fea%d'%i:X[train_rows:,i] for i in range(X.shape[1])})\n", " df_y_train = pd.DataFrame({'fea%d'%i:y[0:train_rows,i] for i in range(y.shape[1])})\n", " df_y_test = pd.DataFrame({'fea%d'%i:y[train_rows:,i] for i in range(y.shape[1])})\n", " return df_X_train, df_X_test, df_y_train, df_y_test\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# this function checks if the results obtained from two different methods (sklearn and cuml) are the same\n", "from sklearn.metrics import mean_squared_error\n", "def array_equal(a,b,threshold=2e-3,with_sign=True):\n", " a = to_nparray(a).ravel()\n", " b = to_nparray(b).ravel()\n", " if with_sign == False:\n", " a,b = np.abs(a),np.abs(b)\n", " error = mean_squared_error(a,b)\n", " res = error