{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# pysgd\n", "\n", "The `pysgd` package is structured as a general gradient descent algorithm that accepts data, an objective function, a gradient descent adaptation and hyperparameters as its arguments. Below is the file structure of the package:\n", "\n", "```\n", "pysgd/\n", "|--__init__.py\n", "|--adaptations/\n", "| |--__init__.py\n", "| |--adagrad.py\n", "| |--adam.py\n", "| |--constant.py\n", "|--objectives/\n", "| |--__init__.py\n", "| |--linear.py\n", "| |--logistic.py\n", "| |--stab-tang.py\n", "|--tests/\n", "\n", "```\n", "The intention of this package is to present reasonbly efficient, working algorithms that are easy to understand.\n", "\n", "The package is structured to make it easy to add additional objective functions and gradient descent adaptations, by following the basic form of the existing ones and adding them into their respective folders.\n", "\n", "### Gradient Descent\n", "\n", "Gradient descent is a method for minimizing an objective function. In machine learning applications the objective function to be minimized is the error, $J$, (or cost) of a predictive model. A predictive model consists of a parameters, $\\theta$, that are applied to inputs, $X$, (also called training samples, features, observations or independent variables) in order to estimate an output, $\\hat{y}$ (also called a label or dependent variable). Gradient descent attempts to determine the parameters that when applied to a set of inputs result in the lowest total error (the difference between the actual outcome and the one predicted by the model). Below is the basic predictive formula.\n", "\n", "$$H(X,\\theta)=\\hat{y}$$\n", "\n", "Below is an illustrative formula for determining the cost of a model.\n", "\n", "$$J(\\theta) = \\sum_{i=1}^m\\mid{h_i(\\theta,x_i) - y_i}\\mid$$\n", "\n", "There are different formulas for computing cost depending on the application, but the formula above expresses the essence of predicting actual outcomes as closely as possible.\n", "\n", "In order to minimze $J$ with respect to $\\theta$, the algorithm starts with an abitrary value of $\\theta$, determines the \"direction\" that would result in the fastest decrease in cost (called the gradient), updates $\\theta$ in that direction by a small amount (called the learning rate or $\\alpha$) and then repeats until cost $J$ has been minimized.\n", "\n", "\n", "$$\\theta_j := \\theta_j - \\alpha\\triangledown\\theta_jJ(\\theta)$$\n", "\n", "or\n", "\n", "$$\\theta_j := \\theta_j - \\alpha\\frac\\partial{\\partial\\theta_j}J(\\theta)$$\n", "\n", "### API\n", "\n", "The package has one main function, `sgd`, that returns a $j$ x $(n + 2)$ array, where $j$ is the number of iterations and $n$ is the number of features in the data. $\\theta_j$ is in the first $n + 1$ columns and the cost $J_j$ in the last column.\n", "\n", "|Argument |Definition |\n", "|------------------|----------------------------------------------------------------------------------------------|\n", "|`theta0` |Starting value of $\\theta$ ($\\theta_0$) in the form of an $1$ x$(n + 1)$ array. |\n", "|`obj='stab_tang'` |Objective function to be minimized in the form of a string with a value of `stab_tang`, `linear` or `logistic`. `stab_tang` is for the [Stablinsky-Tang function](https://en.wikipedia.org/wiki/Test_functions_for_optimization), included for testing and illustrative purposes. |\n", "|`adapt='constant'`|Gradient descent adaptation in the form of a string with a value of `constant`, `adagrad` or `adam`.