{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "5e92ec80",
   "metadata": {},
   "source": [
    "# Deep Learning methods and Neural Networks\n",
    "\n",
    "\n",
    "**Universal approximation theorem**: A continuous function can be approximated to an arbitrary accuracy if one has at least one hidden layer with finite number of neurons in neural network. The non-linear/activation function can be sigmoid (fermi) [Cybenko in 1989], or just general nonpolynomial bounded activation function [Leshno in 1993 and Pinkus in 1999].\n",
    "\n",
    "The multilayer architecture of NN gives neural networks the potential of being universal approximators.\n",
    "\n",
    "Given a function $y=F(x)$ with $x\\in [0,1]^d$ and $f(z)$ is a non-linear bounded activation function and $\\epsilon>0$ is chosen accuracy, there is a one layer NN with $w\\in \\mathbb{R}^{m\\times n}$ and $b\\in \\mathbb{R}^n$ and $x\\in\\mathbb{R}^m$ and $z_j=\\sum w_{ij} x_i + b_j$ so that $|\\sum_i w^{(2)}_{ij} f(z_i)+b_j-F(x_j)|<\\epsilon$."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1d5a5b90",
   "metadata": {},
   "source": [
    "Conceptually, it is helpful to divide neural networks into four\n",
    "categories:\n",
    "1. general purpose neural networks for supervised learning,\n",
    "\n",
    "2. neural networks designed specifically for image processing, the most prominent example of this class being Convolutional Neural Networks (CNNs),\n",
    "\n",
    "3. neural networks for sequential data such as Recurrent Neural Networks (RNNs), and\n",
    "\n",
    "4. neural networks for unsupervised learning such as Deep Boltzmann Machines.\n",
    "\n",
    "In natural science, DNNs and CNNs have already found numerous\n",
    "applications. In statistical physics, they have been applied to detect\n",
    "phase transitions in 2D Ising and Potts models, lattice gauge\n",
    "theories, and different phases of polymers, or solving the\n",
    "Navier-Stokes equation in weather forecasting.  Deep learning has also\n",
    "found interesting applications in quantum physics. Various quantum\n",
    "phase transitions can be detected and studied using DNNs and CNNs,\n",
    "topological phases, and even non-equilibrium many-body\n",
    "localization. Representing quantum states as DNNs quantum state\n",
    "tomography are among some of the impressive achievements to reveal the\n",
    "potential of DNNs to facilitate the study of quantum systems.\n",
    "\n",
    "\n",
    "<img src=\"NNfig.pdf\" width=\"600\"><p style=\"font-size: 0.9em\"><i>Figure: Sketch of the neural network, the input layer is on the left ($x_i=a_i^{(0)}$) and the output layer ($a_i^{(L)}$). The latter is compared through the cost function with the target $t_i$. The layers between 0 and $L$ are called hidden layers, which increase flexibility of the network. $f(z)$ is nonlinear activation function.</i></p>\n",
    "\n",
    "An artificial neural network (ANN), is a computational model that\n",
    "consists of layers of connected neurons (sometimes called nodes or units).  \n",
    "\n",
    "The equations are sketched in the figure, and in matrix form read:\n",
    "\\begin{eqnarray}\n",
    "\\textbf{z}^l &=& (\\textbf{a}^{l-1}) \\textbf{w}^l + \\textbf{b}^l\\\\\n",
    "a_i^l &=& f(z_i^l)\n",
    "\\end{eqnarray}\n",
    "Here $l$ in $a_i^l,z_i^l$ stands for the layer $l$. Using input parameters $a^{l-1}$ in layer $l$ we get output $z^l$, which are than passed through a non-linear activation function $f$ to obtain $a^l$. This in turn allowes one to calculate the next layer. Note that we used many yet to be determined weights $w$ and $b$, which are determined so that they best fit the known data, i.e., on input $x_i$ give as good approximation to target $t_i$ as possible.\n",
    "\n",
    "\n",
    "We start with input $x_i$, which defines the first layer $a^{0}$, and we end with output layer $a^L$, which delivers the output, and is needed to evaluate the cost function:\n",
    "\\begin{eqnarray}\n",
    "a_i^{0}\\equiv x_i\\\\\n",
    "C(\\{w,b\\})&=&\\frac{1}{2}\\sum_i (a_i^L-t_i)^2\n",
    "\\end{eqnarray}\n",
    "The target $t$ is the known data we train on, which was called $y$ in the linear regression. To compare with linear regression $\\widetilde{y}$ is the output layer $a_i^L$. \n",
    "\n",
    "\n",
    "&nbsp;\n",
    "\n",
    "NN is supposed to mimic a biological nervous system by letting each\n",
    "neuron interact with other neurons by sending signals in the form of\n",
    "mathematical functions between layers.  A wide variety of different\n",
    "ANNs have been developed, but most of them consist of an input layer,\n",
    "an output layer and eventual layers in-between, called *hidden\n",
    "layers*. All layers can contain an arbitrary number of nodes, and each\n",
    "connection between two nodes is associated with a weight variable $w_{ij}$ and $b_i$.\n",
    "\n",
    "\n",
    "Withouth the nonlinear activation function NN would be equivalent to the linear regression (convince yourself). The added nonlinearity through activation function $f$ is thus crucial for the success of NN. Many choices of activation functions are in use. We mention just a few:\n",
    "* sigmoid (fermi) $f(z)=1/(e^{-z}+1)$\n",
    "* rectified linear unit (Relu) $f(z)=max(0,z)$\n",
    "* $\\tanh(z)$, which is related to fermi by $\\tanh(z/2)=f(-z)-f(z)$\n",
    "* Exponential linear unit (Elu): $f(z)= \\textrm{if}(z<0) ( \\alpha(e^{t z}-1 )\\textrm{else}(z)$ with $z\\ll 1$\n",
    "* Leaky Relu : $f(z)=\\textrm{if}(z<0) (\\alpha z )\\textrm{else}(z)$ with $z\\ll 1$"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5fdb2ee8",
   "metadata": {},
   "source": [
    "### Simple example OR and XOR gate\n",
    "\n",
    "As we will show, the OR gate can be easily fit with linear regression, however, XOR gate can not be, and requires at least one hidden layer. \n",
    "\n",
    "<img src=\"or_xor.pdf\" width=\"300\"><p style=\"font-size: 0.9em\"><i>Figure: OR and XOR gate with line that can or can not describe it.</i></p>\n",
    "\n",
    "The OR gate \n",
    "\\begin{equation}\n",
    "\\begin{array}{c|c|c}\n",
    "x_1 & x_2 & t\\\\\n",
    "\\hline\n",
    "0 & 0 & 0\\\\\n",
    "0 & 1 & 1\\\\\n",
    "1 & 0 & 1\\\\\n",
    "1 & 1 & 1\n",
    "\\end{array}\n",
    "\\end{equation}\n",
    "and XOR gate\n",
    "\\begin{equation}\n",
    "\\begin{array}{c|c|c}\n",
    "x_1 & x_2 & t\\\\\n",
    "\\hline\n",
    "0 & 0 & 0\\\\\n",
    "0 & 1 & 1\\\\\n",
    "1 & 0 & 1\\\\\n",
    "1 & 1 & 0\n",
    "\\end{array}\n",
    "\\end{equation}\n",
    "\n",
    "Let's try linear regression. The design matrix should contain a constant and linear term, i.e., $X^T=[1,x_1,x_2]$, which is \n",
    "\n",
    "\\begin{equation}\n",
    "X=\\begin{bmatrix} \n",
    "1& 0 & 0 \\\\\n",
    "1& 0 & 1 \\\\\n",
    "1& 1 & 0 \\\\\n",
    "1& 1 & 1\n",
    "\\end{bmatrix}\n",
    "\\end{equation}\n",
    "and linear regression gives $\\widetilde{y} = X \\beta = X (X^T X)^{-1} X^T y$. \n",
    "\n",
    "It is easy to check that for $y^T_{OR}=[0,1,1,1]$ we get \n",
    "$X(X^T X)^{-1} X^T y_{OR} =[1/4,3/4,3/4,5/4]$ while for $y^T_{XOR}=[0,1,1,0]$ we get $X(X^T X)^{-1} X^T=[1/2,1/2,1/2,1/2]$. If we assume that $\\widetilde{y}_i<1/2$ means 0 and $\\widetilde{y}_i>1/2$ is 1, we reproduce OR get, but clearly fail at XOR.\n",
    "\n",
    "As we will show below, one hidden layer can easily give XOR gate. A small technicality first: In the linear regression we wanted to have a constant allowed in the fit, hence our $X^T$ started with unity (to allow $\\beta_0$ as constat). In ML we always add constant explicitely as an additional degree of freedom (see equations above), hence $X^T$ doe not need to have unity, and it will just be $X^T=[x_1,x_2]$. More precisely \n",
    "\\begin{equation}\n",
    "X=\\begin{bmatrix} \n",
    "0 & 0 \\\\\n",
    "0 & 1 \\\\\n",
    "1 & 0 \\\\\n",
    "1 & 1\n",
    "\\end{bmatrix}\n",
    "\\end{equation}\n",
    "For the activation function $f(z)$ we will choose **Relu**: $f(z)=max(z,0)$.\n",
    "\n",
    "We will choose two neurons in the hidden layer, hence $w_h$ is $2x2$ matrix and $b_h$ is two component vector, in terms of which $\\textbf{z}^h=\\textbf{X}\\textbf{w}^h+\\textbf{b}^h$, $\\textbf{a}^h=f(\\textbf{z}^h)$, and the output $\\textbf{y}\\equiv \\textbf{a}^o=\\textbf{a}^{(h)}\\textbf{w}^o+\\textbf{b}^o$\n",
    "\n",
    "The minimization would give the following weights\n",
    "\\begin{eqnarray}\n",
    "&& \\textbf{w}^h=\\begin{bmatrix} \n",
    "1 & 1 \\\\\n",
    "1 & 1 \n",
    "\\end{bmatrix}\\\\\n",
    "&& \\textbf{b}^h=\\begin{bmatrix} \n",
    "0 & -1 \n",
    "\\end{bmatrix}\\\\\n",
    "&& \\textbf{w}^o=\\begin{bmatrix} \n",
    "1  \\\\\n",
    "-2 \n",
    "\\end{bmatrix}\\\\\n",
    "&& \\textbf{b}^o=0\n",
    "\\end{eqnarray}\n",
    "\n",
    "Which means that \n",
    "\\begin{eqnarray}\n",
    "\\textbf{z}^h=\n",
    "\\begin{bmatrix} \n",
    "0 & -1 \\\\\n",
    "1 & 0\\\\\n",
    "1 & 0\\\\\n",
    "2 & 1\n",
    "\\end{bmatrix}\\\\\n",
    "\\textbf{a}^h=\n",
    "\\begin{bmatrix} \n",
    "0 & 0 \\\\\n",
    "1 & 0\\\\\n",
    "1 & 0\\\\\n",
    "2 & 1\n",
    "\\end{bmatrix}\n",
    "\\end{eqnarray}\n",
    "and finally\n",
    "\\begin{eqnarray}\n",
    "\\textbf{a}^h \\textbf{w}^o=\n",
    "\\begin{bmatrix} \n",
    "0 \\\\ 1 \\\\ 1 \\\\ 0\n",
    "\\end{bmatrix}\n",
    "\\end{eqnarray}\n",
    "which is identical to target $t$ for XOR gate, and concludes our example."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "af4ac7f7",
   "metadata": {},
   "source": [
    "To solve NN problem we usually distinguish between the following steps: \n",
    "0) Randomly initialize weight and biases\n",
    "\n",
    "1) The feed forward stage, which calculates all $\\textbf{a}^l$ and $z_l$, including the output $\\textbf{a}^L$ to be used in the cost function, and compares with the target $\\textbf{t}$.\n",
    "\n",
    "2) Back propagation stage follows in which one calculates the gradients of weights $\\textbf{w}$ and biases $\\textbf{b}$ ($\\partial C/\\partial \\textbf{w}$ and $\\partial C/\\partial \\textbf{b}$). Using minimization algorithm (including stochastic approach), we move towards a minimum, which is hopefully close or equal to global minimum.\n",
    "\n",
    "3) We repeat the two steps (1) and (2) until the error of the cost function is acceptable and model is train well enough."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0a4fde56",
   "metadata": {},
   "source": [
    "## Back propagation and automatic differentiation\n",
    "\n",
    "It is convenient to differentiate from the end of the NN towards the start, hence we call this back propagation. We start with differentiation the cost function \n",
    "$$C(\\{w,b\\})=\\frac{1}{2}\\sum_i (a_i^L-t_i)^2,$$ which gives\n",
    "\\begin{equation}\n",
    "\\frac{\\partial C}{\\partial w_{jk}^L}=\\sum_i (a_i^L-t_i) \\frac{\\partial a_i^L}{w_{jk}^L}\\\\\n",
    "\\frac{\\partial C}{\\partial b_{j}^L}=\\sum_i (a_i^L-t_i) \\frac{\\partial a_i^L}{b_{j}^L}\n",
    "\\end{equation}\n",
    "because $a_i^L=f(z_i^L)$, we have\n",
    "\\begin{equation}\n",
    "\\frac{\\partial C}{\\partial w_{kj}^L}=\\sum_i (a_i^L-t_i) f'(z_i^L) \\frac{\\partial z_i^L}{w_{kj}^L}\n",
    "\\\\\n",
    "\\frac{\\partial C}{\\partial b_{j}^L}=\\sum_i (a_i^L-t_i) f'(z_i^L) \\frac{\\partial z_i^L}{b_{j}^L}\n",
    "\\end{equation}\n",
    "finally $z_i^L = \\sum_j a_j^{L-1} w_{ji}+b_i$, hence\n",
    "\\begin{eqnarray}\n",
    "&&\\frac{\\partial z_i^L}{w_{kj}^L}=a_k^{L-1} \\delta_{ij}\\\\\n",
    "&&\\frac{\\partial z_i^L}{b_{j}^L} = \\delta_{ij}\n",
    "\\end{eqnarray}\n",
    "which finally gives\n",
    "\\begin{eqnarray}\n",
    "&&\\frac{\\partial C}{\\partial w_{kj}^L}=(a_j^L-t_j) f'(z_j^L) a_k^{L-1}\n",
    "\\\\\n",
    "&&\\frac{\\partial C}{\\partial b_{j}^L}=(a_j^L-t_j) f'(z_j^L)\n",
    "\\end{eqnarray}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "107bed54",
   "metadata": {},
   "source": [
    "Next we define the quatity $$\\delta_j^L\\equiv (a_j^L-t_j) f'(z_j^L)$$ in terms of which we can express\n",
    "\\begin{eqnarray}\n",
    "&&\\frac{\\partial C}{\\partial w_{kj}^L}=\\delta_j^L a_k^{L-1}\n",
    "\\\\\n",
    "&&\\frac{\\partial C}{\\partial b_{j}^L}=\\delta_j^L\n",
    "\\end{eqnarray}\n",
    "Note that $\\delta_j^L$ can also be viewed as $$\\delta_j^L=\\frac{\\partial C}{\\partial a_j^L}\\frac{\\partial a_j^L}{\\partial z_j^L}=\\frac{\\partial C}{\\partial z_j^L}$$\n",
    "\n",
    "\n",
    "We then proceed to previous layer, and obtain\n",
    "\\begin{eqnarray}\n",
    "\\frac{\\partial C}{\\partial w_{kj}^{L-1}}=\\sum_{i,n} \\frac{\\partial C}{\\partial a_n^{L}}\\frac{\\partial a_n^L}{\\partial z_n^L}\\frac{\\partial z_n^L}{\\partial a_i^{L-1}}\\frac{\\partial a_i^{L-1}}{\\partial z_i^{L-1}}\n",
    "\\frac{\\partial z_i^{L-1}}{\\partial w_{kj}^{L-1}}\n",
    "\\end{eqnarray}\n",
    "We then note that $$\\frac{\\partial C}{\\partial a_n^{L}}\\frac{\\partial a_n^L}{\\partial z_n^L}=\\delta^L_n$$\n",
    "and because\n",
    "$z_n^L=\\sum_i a_i^{L-1} w^L_{in} + b_n^L$ we have\n",
    "$$\\frac{\\partial z_n^L}{\\partial a_i^{L-1}}=w_{in}^L$$\n",
    "furthermore\n",
    "$$\\frac{\\partial a_i^{L-1}}{\\partial z_i^{L-1}}=f'(z_i^{L-1})$$\n",
    "and further\n",
    "$z_i^{L-1}=\\sum_k a_k^{L-2} w^{L-1}_{ki} + b_i^{L-1}$ so that\n",
    "$$\\frac{\\partial z_i^{L-1}}{\\partial w_{kj}^{L-1}}=a_K^{L-2}\\delta_{ij}$$\n",
    "so that collecting all of that leads to\n",
    "$$\n",
    "\\frac{\\partial C}{\\partial w_{kj}^{L-1}}=\\sum_{i,n}\\delta_n^L w_{in}^L f'(z_i^{L-1})\\delta_{ij}a_k^{L-2}=\n",
    "\\sum_n\\delta_n^L w_{jn}^L f'(z_j^{L-1})a_k^{L-2}\n",
    "$$\n",
    "Now we will require that \n",
    "\\begin{eqnarray}\n",
    "\\frac{\\partial C}{\\partial w_{kj}^{l}}=\\delta_j^l a_k^{l-1}\n",
    "\\end{eqnarray}\n",
    "which gives us the following expression\n",
    "\\begin{eqnarray}\n",
    "\\delta_j^{L-1}=\\sum_n\\delta_n^L w_{jn}^L f'(z_j^{L-1})\n",
    "\\end{eqnarray}\n",
    "We can verify that this equation connects every layer with the previous layer, i.e., this equation is valid for every $l$, not just $L-1$. \n",
    "Similarly we can show that the derivative with respect to $b$ has the same form, namely, \n",
    "\\begin{eqnarray}\n",
    "\\frac{\\partial C}{\\partial b_{j}^{l}}=\\delta_j^l \n",
    "\\end{eqnarray}\n",
    "\n",
    "In conclusion, we just showed that the automatic differentiation in back propagation leads to the following set of equations\n",
    "\n",
    "\\begin{eqnarray}\n",
    "\\frac{\\partial C}{\\partial w_{kj}^{l}}&=&\\delta_j^l a_k^{l-1}\\\\\n",
    "\\frac{\\partial C}{\\partial b_{j}^{l}}&=&\\delta_j^l\n",
    "\\end{eqnarray}\n",
    "in which $\\delta_j^l$ can be all obtained by the following recursion relation\n",
    "\\begin{eqnarray}\n",
    "\\delta_j^{l}&=&\\sum_n\\delta_n^{l+1} w_{jn}^{l+1} f'(z_j^{l})\n",
    "\\end{eqnarray}\n",
    "and the starting condition\n",
    "\\begin{eqnarray}\n",
    "\\delta_j^{L}&=&(a_j^L-t_j) f'(z_j^L)\n",
    "\\end{eqnarray}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9f4a3a1d",
   "metadata": {},
   "source": [
    "### Final algorithm\n",
    "1) Initialize all variables to be minimized $\\{w,b\\}$ and perform the forward pass to compute all $a^l$ parameters.\n",
    "2) With current values of $\\{w,b\\}$ and $a^l$ we compute all gradients $\\frac{\\partial C}{\\partial w_{kj}^{l}}$ and $\\frac{\\partial C}{\\partial b_{j}^{l}}$ and using one of the available minimization routines we take a step towards more optimal variables $\\{w,b\\}$. Usually one uses some type of gradient descent method, as discussed previously\n",
    "$$\n",
    "w^{(j+1)} = w^{(j)} - \\gamma_j \\frac{\\partial C}{\\partial w^{(j)}}\n",
    "$$\n",
    "here $j$ stands for the iteration.\n",
    "3) We repeat (1) and (2) until we find local minima\n",
    "4) We change hiperparameter or change initial condistions to try finding different local minima."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1fbd2e56",
   "metadata": {},
   "source": [
    "### Example code from MNIST dataset on handwritten numbers\n",
    "\n",
    "We will develop NN code to recognize the handwritten digits. The data is stored in MNIST dataset, which is included in sklearn.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "345d96ca",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "inputs = (n_inputs, pixel_width, pixel_height) = (1797, 8, 8)\n",
      "labels = (n_inputs) = (1797,)\n"
     ]
    }
   ],
   "source": [
    "from numpy import *\n",
    "import matplotlib.pyplot as plt\n",
    "from sklearn import datasets\n",
    "\n",
    "# ensure the same random numbers appear every time\n",
    "random.seed(0)\n",
    "\n",
    "# display images in notebook\n",
    "%matplotlib inline\n",
    "plt.rcParams['figure.figsize'] = (12,12)\n",
    "\n",
    "\n",
    "# download MNIST dataset\n",
    "digits = datasets.load_digits()\n",
    "\n",
    "# define inputs and labels\n",
    "inputs = digits.images # x_i\n",
    "labels = digits.target # t_i\n",
    "\n",
    "print('inputs = (n_inputs, pixel_width, pixel_height) =',inputs.shape)\n",
    "print('labels = (n_inputs) =',labels.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "33b16078",
   "metadata": {},
   "source": [
    "Here we reshape images so that we have **Design matrix** composed of 64 pixels. We also print a few examples of numbers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "adbeef35",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "X = (n_inputs, n_features) = (1797, 64)\n"
     ]
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAA7YAAADICAYAAADcOn20AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjcuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/bCgiHAAAACXBIWXMAAA9hAAAPYQGoP6dpAAAQx0lEQVR4nO3dbWydZRkH8OuMjklYaEHMxla3ji46/KBdiItodB2oEUVXDNMYois6QgLCkKkhEGwHGDCaWIxODY510SUuJmSdZr4AdjUmxPBWEhaNjlBEzSYOOl+SbQwfPxhqN9Y68G7rdc7vl/QDZz3/c5+T6z7P899zOKtVVVUFAAAAJDVrphcAAAAA/wvFFgAAgNQUWwAAAFJTbAEAAEhNsQUAACA1xRYAAIDUFFsAAABSU2wBAABITbEFAAAgtYYptv39/VGr1eLhhx8ukler1eLTn/50kazxmb29va/6/nv37o2Pf/zjsWjRojjttNOivb09brjhhjhw4EC5RZJSvc9/b29v1Gq1CX++//3vF10r+dgD9kAjq/f5f+aZZ+LSSy+Nc889N04//fRobm6O5cuXx9e//vU4evRo0XWSj/lvHE0zvQDKePbZZ+Ntb3tbnHHGGXHbbbfFokWL4rHHHouenp4YHByMRx55JGbNapi/x6DBrFu3Lt73vve97PYrr7wynnzyyRP+GdQTe4BG9o9//CPOOOOMuOWWW2LRokVx5MiR2LVrV1x77bUxPDwc3/nOd2Z6iTBlzP9/KLZ1YmBgIA4cOBDbt2+Piy66KCIiVq1aFYcPH46bbropHn/88Vi+fPkMrxKmRmtra7S2th5z28jISOzZsycuv/zyaGlpmZmFwTSxB2hky5Yti61btx5z28UXXxx//vOfY+vWrfGNb3wj5syZM0Org6ll/v/DJbxxDh06FBs2bIiOjo5obm6Os846Ky644IIYGBiY8D7f/va34w1veEPMmTMn3vSmN53w41779u2Lq666KlpbW+PUU0+NJUuWxMaNG4t+PGD27NkREdHc3HzM7S+dzLzmNa8p9ljUp8zzfyL33HNPVFUV69atm9LHoX7YAzSyepv/iIjXve51MWvWrDjllFOm/LHIzfzXB1dsxzl8+HA899xz8dnPfjYWLlwYR44cifvvvz8+/OEPx5YtW+ITn/jEMb+/c+fOGBwcjFtvvTVOP/302LRpU3zsYx+LpqamuOyyyyLi3wO9YsWKmDVrVnzhC1+I9vb2ePDBB+P222+PkZGR2LJly6Rramtri4h//837ZLq6umLRokWxYcOG2LRpUyxevDgeffTRuPPOO+ODH/xgnHfeea/6daExZJ7/4/3zn/+M/v7+WLp0aaxcufIV3ZfGZQ/QyOph/quqihdffDH+9re/xc9+9rPo7++PDRs2RFOT010mZ/7rRNUgtmzZUkVE9dBDD530fY4ePVq98MIL1ac+9alq+fLlx/xZRFSnnXZatW/fvmN+f9myZdXSpUvHbrvqqququXPnVk8//fQx9//KV75SRUS1Z8+eYzJ7enqO+b329vaqvb39pNb7pz/9qbrggguqiBj7WbNmTXXo0KGTfcrUqUaY//F+/OMfVxFR3XHHHa/4vtQne4BG1ijzf8cdd4yd/9Rqtermm28+6ftSv8x/4/BR5OP84Ac/iHe84x0xd+7caGpqitmzZ8fmzZvj17/+9ct+96KLLop58+aN/fcpp5wSH/3oR2Pv3r3xhz/8ISIifvSjH8WqVatiwYIFcfTo0bGfiy++OCIihoaGJl3P3r17Y+/evf913c8//3ysXr06/vrXv8a2bdviF7/4RWzatCl++ctfxoc+9KGG+1Y0Xp2s83+8zZs3R1NTU3R3d7/i+9LY7AEaWfb57+7ujoceeih++tOfxuc///n48pe/HNdee+1J35/GZv7za6Br0//dvffeGx/5yEdizZo18bnPfS7mz58fTU1N8c1vfjPuueeel/3+/PnzJ7ztwIED0draGvv3748f/vCHY/8P7PH+8pe/FFn7l770pRgeHo6nn346zjnnnIiIeOc73xnLli2LCy+8MLZt2xZr164t8ljUp8zzf3zmzp074wMf+MAJ1wgTsQdoZPUw//Pnzx9bw3vf+94488wz48Ybb4xPfvKTvkCTSZn/+qDYjvO9730vlixZEtu3b49arTZ2++HDh0/4+/v27Zvwtte+9rUREXH22WfHm9/85vjiF794wowFCxb8r8uOiIjh4eFYuHDhWKl9yVvf+taIiHjiiSeKPA71K/P8j/fd7343jhw54gtzeMXsARpZvcz/eCtWrIiIiN/+9rcNc2LPq2P+64NiO06tVotTTz31mIHet2/fhN+I9sADD8T+/fvHPorw4osvxvbt26O9vX3sn1245JJLYteuXdHe3h5nnnnmlK19wYIF8cADD8Qf//jHWLhw4djtDz74YETEy/4ZCDhe5vkfb/PmzbFgwYKxj/rAybIHaGT1Mv/jDQ4ORkTE0qVLp/2xycX814eGK7Y///nPT/jtYu9///vjkksuiXvvvTeuvvrquOyyy+KZZ56J2267Lc4555z43e9+97L7nH322XHhhRfGLbfcMvaNaL/5zW+O+brvW2+9Ne677754+9vfHtddd1288Y1vjEOHDsXIyEjs2rUrvvWtb01aOl8axv/2Gftrrrkmtm3bFu95z3vixhtvjNe//vXxxBNPxO233x7z5s2Lyy+//CRfIepZvc7/S371q1/Fnj174qabbmqor7fn5NkDNLJ6nf+enp7Yv39/vOtd74qFCxfG6Oho/OQnP4m777471qxZE+eff/5JvkLUM/PfAGb626umy0vfiDbRz1NPPVVVVVXdeeedVVtbWzVnzpzqvPPOq+6+++6qp6enOv6liojqmmuuqTZt2lS1t7dXs2fPrpYtW1Zt27btZY/97LPPVtddd121ZMmSavbs2dVZZ51VnX/++dXNN99c/f3vfz8m8/hvRFu8eHG1ePHik3qOjz76aHXppZdWra2t1Zw5c6pzzz23WrduXfX73//+Fb1W1J9GmP+qqqorr7yyqtVq1ZNPPnnS96Ex2AM0snqf/507d1bvfve7q3nz5lVNTU3V3LlzqxUrVlRf+9rXqhdeeOEVv17UF/PfOGpVVVWlyzIAAABMF//cDwAAAKkptgAAAKSm2AIAAJCaYgsAAEBqii0AAACpKbYAAACkptgCAACQWtNML6CU/v7+onm9vb1F81paWormRUT09fUVzevs7Cyax/TZvXt30bzS+2nHjh1F8yIiDh48WDRvcHCwaJ79NL0GBgaK5q1fv75o3lQove/b2tqK5jGxkZGRonmlzwdKHwNKv19HRDQ3NxfNGx4eLppnP01udHS0aN7/+x4o/XwjzOyJuGILAABAaootAAAAqSm2AAAApKbYAgAAkJpiCwAAQGqKLQAAAKkptgAAAKSm2AIAAJCaYgsAAEBqii0AAACpKbYAAACkptgCAACQmmILAABAaootAAAAqSm2AAAApKbYAgAAkJpiCwAAQGqKLQAAAKk1zdQD7969u2jeFVdcUTRv9erVRfNaWlqK5kVEdHV1Fc0bHR0tmsf0uf7664vmlZ6F7u7uonkREXfddVfRvKnYo0xsZGSkaF7p98MMduzYUTSv9PsIE/t/f623bt1aNG9wcLBoXkT5Y4BzoOlV+vUu/X5Y+phSen0REf39/UXzent7i+bNBFdsAQAASE2xBQAAIDXFFgAAgNQUWwAAAFJTbAEAAEhNsQUAACA1xRYAAIDUFFsAAABSU2wBAABITbEFAAAgNcUWAACA1BRbAAAAUlNsAQAASE2xBQAAIDXFFgAAgNQUWwAAAFJTbAEAAEhNsQUAACA1xRYAAIDUalVVVTPxwNdff33RvJGRkaJ5O3bsKJrX2dlZNC8ioqWlpWhe6efM9Ck9/6Vna2hoqGheRMTatWuL5o2OjhbNY3r19fUVzevo6Ciat2rVqqJ5ERErV64smrd79+6iefCS0ud8ERHDw8NF88w/U2kqekDp41Tp4+hMcMUWAACA1BRbAAAAUlNsAQAASE2xBQAAIDXFFgAAgNQUWwAAAFJTbAEAAEhNsQUAACA1xRYAAIDUFFsAAABSU2wBAABITbEFAAAgNcUWAACA1BRbAAAAUlNsAQAASE2xBQAAIDXFFgAAgNQUWwAAAFJTbAEAAEitaaYeuK2trWjeyMhI0bze3t6ieUNDQ0XzIiIee+yx4pnkNDo6WjSv9H7q6ekpmhcR0dLSUjSv9HMu/R7H5Lq7u4vmlT4GTIXSx5XSzznDa8j06OjoKJ7Z399fNK/0cbT0MYrJlT6Gd3V1Fc2bCn19fTO9hP87rtgCAACQmmILAABAaootAAAAqSm2AAAApKbYAgAAkJpiCwAAQGqKLQAAAKkptgAAAKSm2AIAAJCaYgsAAEBqii0AAACpKbYAAACkptgCAACQmmILAABAaootAAAAqSm2AAAApKbYAgAAkJpiCwAAQGqKLQAAAKnVqqqqZnoRJXR0dBTNe/zxx4vmrV27tmheRER/f3/xTKbHwMBA0byurq6ieY2op6enaF5vb2/RvHozPDxcNK+zs7No3sGDB4vmTYXSx5XSM9vW1lY0D8YrPV+lj6N9fX1F85hcIx5TtmzZUjSvu7u7aN5McMUWAACA1BRbAAAAUlNsAQAASE2xBQAAIDXFFgAAgNQUWwAAAFJTbAEAAEhNsQUAACA1xRYAAIDUFFsAAABSU2wBAABITbEFAAAgNcUWAACA1BRbAAAAUlNsAQAASE2xBQAAIDXFFgAAgNQUWwAAAFJTbAEAAEhNsQUAACC1WlVV1UwvooSOjo6ZXsKkWlpaimeWfs59fX1F85jY7t27i+bt2LGjaN7w8HDRvJGRkaJ5EeXXOBV7lImV3gOrVq0qmlfa6tWri2eW3veQSWdn50wvYVKl3+Pqzejo6EwvYVKlzwmmYl5Ln1tNxbnadHPFFgAAgNQUWwAAAFJTbAEAAEhNsQUAACA1xRYAAIDUFFsAAABSU2wBAABITbEFAAAgNcUWAACA1BRbAAAAUlNsAQAASE2xBQAAIDXFFgAAgNQUWwAAAFJTbAEAAEhNsQUAACA1xRYAAIDUFFsAAABSU2wBAABIrWmmF1BKS0tL0bzOzs6ieb29vUXzIso/59JrLL2+elJ6vg4ePFg0r7+/v2heV1dX0bwI85Vd6T2wfv36onl33XVX0bwrrriiaB6MNzAwUDRv8eLFRfOGh4eL5k1F5lScpzGxoaGhonk9PT1F8zZu3Fg0r7u7u2heRPnjyujoaNG8mThPc8UWAACA1BRbAAAAUlNsAQAASE2xBQAAIDXFFgAAgNQUWwAAAFJTbAEAAEhNsQUAACA1xRYAAIDUFFsAAABSU2wBAABITbEFAAAgNcUWAACA1BRbAAAAUlNsAQAASE2xBQAAIDXFFgAAgNQUWwAAAFJTbAEAAEitaaYXUMpnPvOZonldXV1F8zZu3Fg0LyJi9erVRfNaWlqK5jF9nn/++aJ5Bw8eLJrX3d1dNA+m2lve8paieaXfr2G8r371q0XzhoaGiuY1NzcXzYsof1xxnJpeK1euLJrX2dlZNK/0nhodHS2aFxGxfv36onn10ANcsQUAACA1xRYAAIDUFFsAAABSU2wBAABITbEFAAAgNcUWAACA1BRbAAAAUlNsAQAASE2xBQAAIDXFFgAAgNQUWwAAAFJTbAEAAEhNsQUAACA1xRYAAIDUFFsAAABSU2wBAABITbEFAAAgNcUWAACA1BRbAAAAUqtVVVXN9CIAAADg1XLFFgAAgNQUWwAAAFJTbAEAAEhNsQUAACA1xRYAAIDUFFsAAABSU2wBAABITbEFAAAgNcUWAACA1P4F8SqxxH5w1EMAAAAASUVORK5CYII=",
      "text/plain": [
       "<Figure size 1200x1200 with 5 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# flatten the image\n",
    "# the value -1 means dimension is inferred from the remaining dimensions: 8x8 = 64\n",
    "n_inputs,nx,ny = inputs.shape\n",
    "inputs = inputs.reshape(n_inputs, nx*ny)\n",
    "print('X = (n_inputs, n_features) =', inputs.shape)\n",
    "\n",
    "# choose some random images to display\n",
    "random_indices = random.choice(range(n_inputs), size=5)\n",
    "\n",
    "for i,image in enumerate(digits.images[random_indices]):\n",
    "    plt.subplot(1, 5, i+1)\n",
    "    plt.axis('off')\n",
    "    plt.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')\n",
    "    plt.title(\"Label: %d\" % digits.target[random_indices[i]])\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "26dc57f9",
   "metadata": {},
   "source": [
    "First we split the data in 80% training and 20% testing data. Which data is training and testing should be choosen at random."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "a58918e3",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of training images: 1437\n",
      "Number of test images: 360\n"
     ]
    }
   ],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "# one-liner from scikit-learn library\n",
    "train_size = 0.8\n",
    "X_train, X_test, Y_train, Y_test = train_test_split(inputs, labels, train_size=train_size,test_size=1-train_size)\n",
    "\n",
    "# equivalently in numpy\n",
    "def train_test_split_numpy(inputs, labels, train_size):\n",
    "    n_inputs = len(inputs)\n",
    "    inputs_shuffled = inputs.copy()\n",
    "    labels_shuffled = labels.copy()\n",
    "    random.shuffle(inputs_shuffled)\n",
    "    random.shuffle(labels_shuffled)\n",
    "    \n",
    "    train_end = int(n_inputs*train_size)\n",
    "    X_train, X_test = inputs_shuffled[:train_end], inputs_shuffled[train_end:]\n",
    "    Y_train, Y_test = labels_shuffled[:train_end], labels_shuffled[train_end:]\n",
    "    \n",
    "    return X_train, X_test, Y_train, Y_test\n",
    "#X_train, X_test, Y_train, Y_test = train_test_split_numpy(inputs, labels, train_size, test_size)\n",
    "\n",
    "print(\"Number of training images: \" + str(len(X_train)))\n",
    "print(\"Number of test images: \" + str(len(X_test)))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e5015aa8",
   "metadata": {},
   "source": [
    "The input and output data have dimensions\n",
    "\\begin{eqnarray}\n",
    "&& X\\in [n\\times 64]\\\\\n",
    "&&t \\in [n].\n",
    "\\end{eqnarray}\n",
    "\n",
    "It is easier to change the output vector to so-called hot representation, in which $y=0$ translates into $y=[1,0,0,0,0,0,0,0,0,0]$ and $y=2$ into $y=[0,0,1,0,0,0,0,0,0,0]$, etc. \n",
    "\n",
    "In this way we can use equations for binary choice of 10 chategories. The output vector `Y_onehot` is going to be of dimension $n\\times 10$, rather than $n.$\n",
    "\n",
    "\n",
    "The function `to_categorical_numpy` implements the hot representation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "120d3691",
   "metadata": {},
   "outputs": [],
   "source": [
    "# to categorical turns our integer vector into a onehot representation\n",
    "def to_categorical_numpy(integer_vector): # integer_vector[n_inputs] contains number between 0...9\n",
    "    n_inputs = len(integer_vector)          # inputs\n",
    "    n_categories = max(integer_vector) + 1  # 10 chategories\n",
    "    onehot_vector = zeros((n_inputs, n_categories),dtype=int)\n",
    "    onehot_vector[range(n_inputs), integer_vector] = 1    \n",
    "    return onehot_vector"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "2213bb90",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[0, 0, 0, 1, 0, 0, 0, 0, 0],\n",
       "       [0, 0, 0, 0, 0, 1, 0, 0, 0],\n",
       "       [0, 0, 0, 0, 1, 0, 0, 0, 0],\n",
       "       [0, 0, 0, 0, 0, 0, 0, 0, 1],\n",
       "       [1, 0, 0, 0, 0, 0, 0, 0, 0]])"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "integer_vector=[3,5,4,8,0]\n",
    "to_categorical_numpy(integer_vector)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "cc882ca4",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "([0, 1, 2, 3, 4], [3, 5, 4, 8, 0])"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "list(range(len(integer_vector))), integer_vector"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "d7d6bae2",
   "metadata": {},
   "outputs": [],
   "source": [
    "Y_train_onehot, Y_test_onehot = to_categorical_numpy(Y_train), to_categorical_numpy(Y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c2bb33c0",
   "metadata": {},
   "source": [
    "<img src=\"NN_digits.pdf\" width=\"800\"><p style=\"font-size: 0.9em\"><i>Figure: NN for recognizing digits.</i></p>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "03e21e5c",
   "metadata": {},
   "source": [
    "As said before, the input and the adjusted hot output data have dimensions\n",
    "\\begin{eqnarray}\n",
    "&& X\\in [n\\times 64]\\\\\n",
    "&& Y \\in [n\\times 10].\n",
    "\\end{eqnarray}\n",
    "Note that the output $t$ had dimension $n$, but we transformed it into hot representation, so now it has dimension $n\\times 10$\n",
    "\n",
    "We will use 50 neurons in the hidden layer, and we have 10 categories, hence our weights will have dimenions:\n",
    "\\begin{eqnarray}\n",
    "&&w^{(1)}\\in [64\\times 50]\\\\\n",
    "&&b^{(1)}\\in [50]\\\\\n",
    "&&w^{(2)}\\in [50\\times 10]\\\\\n",
    "&&b^{(2)}\\in 10\\\\\n",
    "&&a^{(2)} \\in [n\\times 10]\n",
    "\\end{eqnarray}\n",
    "\n",
    "\n",
    "The equations for our NN models are:\n",
    "\\begin{eqnarray}\n",
    "&& z^{(1)} = X\\cdot w^{(1)} + b^{(1)}  \\in [n\\times 50]\\\\\n",
    "&& a^{(1)} = f^{(1)}(z^{(1)}) \\in [n\\times 50]\\\\\n",
    "&& z^{(2)} = a^{(1)}\\cdot w^{(2)} + b^{(2)} \\in [n\\times 10]\\\\\n",
    "&& a^{(2)} = f^{(2)}(z^{(2)}) \\in [n\\times 10]\n",
    "\\end{eqnarray}\n",
    "\n",
    "where $$f^{(1)}(z)=1/(\\exp(-z)+1)$$ and\n",
    "$$f^{(2)}(z_c)=\\frac{\\exp{z_c}}{\\sum_{c'=0}^9 \\exp{z_{c'}}}$$\n",
    "\n",
    "\n",
    "Note that the output layer uses the `softmax` activation function, because we have the multiple-choice output. The cost function in this case has to maximize the cross entropy, i.e., the probability that the model gets all the answers correct, which is given by\n",
    "\\begin{eqnarray}\n",
    "P({\\cal D}|\\{w,b\\}) = \\prod_{i=1}^n \\prod_{c=0}^9 P(y_{ic}=1)^{y_{ic}} (1-P(y_{ic}=1)^{1-y_{ic}}\n",
    "\\end{eqnarray}\n",
    "here $y_{ic}$ can only take values of 0 or 1, and $c$ runs from 0 to 9, and $i$ runs over all input data $n$. Here $\\cal D$ is the collection of all input data. This is facilitated by the hot vector representation implemented above, in which $y\\in[0,1,...9]$ is changed to hot representation with $y_{ic}$.\n",
    "\n",
    "To maximize $P({\\cal D}|\\{w,b\\})$ we minimize $C(\\{w,b\\})=-\\log(P({\\cal D}|\\{w,b\\}))$. The cost function therefore is\n",
    "\\begin{eqnarray}\n",
    "C(\\{w,b\\})=-\\sum_{i,c} y_{ic} \\log(P_{ic})+(1-y_{ic})\\log(1-P_{ic})\n",
    "\\end{eqnarray}\n",
    "\n",
    "Note that $a_{ic}^{(2)}\\equiv P_{ic}$ is the result of our NN.\n",
    "\n",
    "\n",
    "Later we we also regularize the cost function with $L_2$ metric in the following way:\n",
    "\\begin{eqnarray}\n",
    "C(\\{w,b\\})=-\\sum_{i,c} y_{ic} \\log(P_{ic})+(1-y_{ic})\\log(1-P_{ic}) + \\frac{\\lambda}{2} \\sum_{hc} (w_{hc}^{(2)})^2 +\\frac{\\lambda}{2}\\sum_{ph}(w_{ph}^{(1)})^2\n",
    "\\end{eqnarray}\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "408402d2",
   "metadata": {},
   "source": [
    "First we create a random configuration of weights."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "a6d83fdc",
   "metadata": {},
   "outputs": [],
   "source": [
    "# building our neural network\n",
    "n_inputs, n_features = X_train.shape\n",
    "n_hidden_neurons = 50\n",
    "n_categories = 10\n",
    "\n",
    "# we make the weights normally distributed using numpy.random.randn\n",
    "def GiveStartingRandomWeights():\n",
    "    random.seed(0)\n",
    "    # weights and bias in the hidden layer\n",
    "    W_1 = random.randn(n_features, n_hidden_neurons)\n",
    "    b_1 = zeros(n_hidden_neurons) + 0.01\n",
    "\n",
    "    # weights and bias in the output layer\n",
    "    W_2 = random.randn(n_hidden_neurons, n_categories)\n",
    "    b_2 = zeros(n_categories) + 0.01\n",
    "    return (W_1, b_1, W_2, b_2)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "51d193f0",
   "metadata": {},
   "source": [
    "Next we evaluate NN by forward algorithm, and check the accuracy of its predictions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "0118e398",
   "metadata": {},
   "outputs": [],
   "source": [
    "def mfermi(x):\n",
    "    return 1/(1 + exp(-x))\n",
    "\n",
    "def feed_forward(X, all_weights):\n",
    "    \"identical to feed_forward, except we also return a_1, i.e, hidden layer a\"\n",
    "    W_1, b_1, W_2, b_2 = all_weights\n",
    "    # weighted sum of inputs to the hidden layer\n",
    "    z_1 = matmul(X, W_1) + b_1\n",
    "    # activation in the hidden layer\n",
    "    a_1 = mfermi(z_1)\n",
    "    # weighted sum of inputs to the output layer\n",
    "    z_2 = matmul(a_1, W_2) + b_2\n",
    "    # softmax output\n",
    "    # axis 0 holds each input and axis 1 the probabilities of each category\n",
    "    exp_term = exp(z_2)\n",
    "    a_2 = exp_term/sum(exp_term, axis=1, keepdims=True)\n",
    "    # for backpropagation need activations in hidden and output layers\n",
    "    return a_1, a_2\n",
    "\n",
    "\n",
    "# we obtain a prediction by taking the class with the highest likelihood\n",
    "def predict(X, all_weights):\n",
    "    a_1, probabilities = feed_forward(X, all_weights)\n",
    "    return (probabilities,argmax(probabilities, axis=1))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "89e38f97",
   "metadata": {},
   "source": [
    "Checking prediction of NN for one data point. The weights are not yet optimized."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "407e5175",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "probabilities = (n_inputs, n_categories) = (1437, 10)\n",
      "probability that image 0 is in category 0,1,2,...,9 = \n",
      "[2.23785373e-07 1.47533958e-01 7.28910767e-04 3.32202888e-05\n",
      " 4.42269923e-05 1.06343900e-04 7.66939998e-03 8.14604377e-01\n",
      " 4.64970935e-07 2.92788746e-02]\n",
      "probabilities sum up to: 1.0\n",
      "\n",
      "predictions = (n_inputs) = (1437,)\n",
      "prediction for image 0: 7\n",
      "correct label for image 0: 6\n"
     ]
    }
   ],
   "source": [
    "all_weights = GiveStartingRandomWeights()\n",
    "(probabilities,predictions) = predict(X_train, all_weights)\n",
    "\n",
    "print(\"probabilities = (n_inputs, n_categories) = \" + str(probabilities.shape))\n",
    "print(\"probability that image 0 is in category 0,1,2,...,9 = \\n\" + str(probabilities[0]))\n",
    "print(\"probabilities sum up to: \" + str(probabilities[0].sum()))\n",
    "print()\n",
    "\n",
    "print(\"predictions = (n_inputs) = \" + str(predictions.shape))\n",
    "print(\"prediction for image 0: \" + str(predictions[0]))\n",
    "print(\"correct label for image 0: \" + str(Y_train[0]))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "81d80d0e",
   "metadata": {},
   "source": [
    "We include `accuracy_score` from sklearn to meassure how large percentage of data is correctly predicted.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "f595f64b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Old accuracy on training data: 0.04314544189283229\n"
     ]
    }
   ],
   "source": [
    "from sklearn.metrics import accuracy_score\n",
    "\n",
    "(probabilities,predictions) = predict(X_train, all_weights)\n",
    "print(\"Old accuracy on training data:\", accuracy_score(predictions, Y_train))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "899fdccc",
   "metadata": {},
   "source": [
    "Next we implement gradients, which are used for back-propagation in function `backpropagation`.\n",
    "\n",
    "The gradients are somewhat different than derived above because the cost function is obtained from the cross entropy function. Lets firts use cost function $C$ withouth regularization $\\lambda$. \n",
    "\n",
    "The gradients are:\n",
    "\\begin{eqnarray}\n",
    "\\frac{\\partial C}{\\partial w_{jc}^{(2)}}=-\\sum_i \\left(\\frac{y_{ic}}{P_{ic}}-\\frac{1-y_{ic}}{1-P_{ic}}\\right)\n",
    "\\frac{\\partial P_{ic}}{\\partial w_{jc}^{(2)}}=\n",
    "-\\sum_i \\frac{y_{ic}-P_{ic}}{P_{ic}(1-P_{ic})}\n",
    "\\frac{\\partial P_{ic}}{\\partial w_{jc}^{(2)}}\n",
    "\\end{eqnarray}\n",
    "Next \n",
    "$$\\frac{\\partial P_{ic}}{\\partial w_{jc}^{(2)}}=\\frac{\\partial P_{ic}}{\\partial z_{ic}}\\frac{\\partial z_{ic}}{\\partial w_{jc}}$$\n",
    "\n",
    "Since $P_{ic}=f^{(2)}(z_{ic}^{(2)})$ and \n",
    "$z_{ic}^{(2)} = \\sum_{j\\in hidden} a_{i j}^{(1)} w_{j c}^{(2)} + b_{c}^{(2)}$\n",
    "we have\n",
    "$$\\frac{\\partial P_{ic}}{\\partial w_{jc}^{(2)}}=P_{ic}(1-P_{ic}) a_{ij}^{(1)}$$\n",
    "which finally gives\n",
    "\n",
    "\\begin{eqnarray}\n",
    "\\frac{\\partial C}{\\partial w_{jc}^{(2)}}=\n",
    "\\sum_i (P_{ic}-y_{ic}) a_{ij}^{(1)} = {a^{(1)}}^T (a^{(2)}-Y)\n",
    "\\end{eqnarray}\n",
    "where we took into account that $a_{ic}^{(2)}=P_{ic}$ and $Y_{ic}=y_{ic}$.\n",
    "Similarly we can see that\n",
    "\\begin{eqnarray}\n",
    "\\frac{\\partial C}{\\partial b_{c}^{(2)}}=\n",
    "\\sum_i (P_{ic}-y_{ic})\n",
    "\\end{eqnarray}\n",
    "Next we evaluate the derivative in the hidden layer, i.e., \n",
    "\\begin{eqnarray}\n",
    "\\frac{\\partial C}{\\partial w_{ph}^{(1)}}=\\sum_i \\frac{\\partial C}{\\partial P_{ic}}\n",
    "\\frac{\\partial P_{ic}}{\\partial z_{ic}^{(2)}}\\frac{\\partial z_{ic}^{(2)}}{\\partial a_{ih}^{(1)}}\n",
    "\\frac{\\partial a_{ih}^{(1)}}{\\partial z_{ih}^{(1)}}\n",
    "\\frac{\\partial z_{ih}^{(1)}}{\\partial w_{ph}^{(1)}}\n",
    "\\end{eqnarray}\n",
    "which comes from the fact that $P_{ic}=f^{(2)}(z_{ic}^{(2)})$, $z_{ic}^{(2)}=\\sum_h a_{ih}^{(1)} w_{hc}^{(2)}+b_c^{(2)}$ and $a_{ih}^{(1)}=f^{(1)}(z_{ih}^{(1)})$ and $z_{ih}^{(1)}=\\sum_p X_{ip} w_{ph}^{(1)} +b_h$. \n",
    "We see that $\\frac{\\partial C}{\\partial P_{ic}}=(P_{ic}-y_{ic})/(P_{ic}(1-P_{ic}))$, further $\\frac{\\partial P_{ic}}{\\partial z_{ic}^{(2)}}=P_{ic}(1-P_{ic})$, $\\frac{\\partial z_{ic}^{(2)}}{\\partial a_{ih}^{(1)}}=w^{(2)}_{hc}$, $\\frac{\\partial a_{ih}^{(1)}}{\\partial z_{ih}^{(1)}}=a_{ih}^{(1)}(1-a_{ih}^{(1)})$, $\\frac{\\partial z_{ih}^{(1)}}{\\partial w_{ph}^{(1)}}=X_{ip}$.\n",
    "\n",
    "\n",
    "Taking all this into account, we get\n",
    "\\begin{eqnarray}\n",
    "\\frac{\\partial C}{\\partial w_{ph}^{(1)}}=\n",
    "\\sum_i X_{ip} a_{ih}^{(1)}(1-a_{ih}^{(1)})\\sum_c (P_{ic}-y_{ic}) w_{hc}^{(2)}\n",
    "\\end{eqnarray}\n",
    "which can also be written as\n",
    "\\begin{equation}\n",
    "\\frac{\\partial C}{\\partial w_{ph}^{(1)}}= (X^T ( a^{(1)}\\circ (1-a^{(1)})\\circ (a^{(2)}-Y) ({w^{(2)}})^T ))_{ph}\n",
    "\\end{equation}\n",
    "where we introduced elementwise product $\\circ$ defined by $c_{ih}=a_{ih} b_{ih}$ as $c= a\\circ b$ .\n",
    "Similarly\n",
    "\\begin{eqnarray}\n",
    "\\frac{\\partial C}{\\partial b_{h}^{(1)}}=\n",
    "\\sum_{i,h} a_{ih}^{(1)}(1-a_{ih}^{(1)})\\sum_c (P_{ic}-y_{ic}) w_{hc}^{(2)}\n",
    "\\end{eqnarray}\n",
    "\n",
    "\n",
    "Finally, when $\\lambda$ is nonzero, we will just add to derivatives\n",
    "\\begin{eqnarray}\n",
    "\\frac{\\partial C}{\\partial w_{ph}^{(1)} } += \\lambda w_{ph}^{(1)}\\\\\n",
    "\\frac{\\partial C}{\\partial w_{jc}^{(2)}} += \\lambda w_{jc}^{(2)}\n",
    "\\end{eqnarray}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "f294d682",
   "metadata": {},
   "outputs": [],
   "source": [
    "def backpropagation(X, Y, all_weights):\n",
    "    a_1, probabilities = feed_forward(X, all_weights)\n",
    "    W_1, b_1, W_2, b_2 = all_weights\n",
    "    # error in the output layer\n",
    "    error_output = probabilities - Y\n",
    "    # error in the hidden layer\n",
    "    error_hidden = matmul(error_output, W_2.T) * a_1 * (1 - a_1)\n",
    "    \n",
    "    # gradients for the output layer\n",
    "    dW2 = matmul(a_1.T, error_output)\n",
    "    dB2 = sum(error_output, axis=0)\n",
    "    \n",
    "    # gradient for the hidden layer\n",
    "    dW1 = matmul(X.T, error_hidden)\n",
    "    dB1 = sum(error_hidden, axis=0)\n",
    "\n",
    "    return dW2, dB2, dW1, dB1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "8061870e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "shapes of gradients= (50, 10) (10,) (64, 50) (50,)\n"
     ]
    }
   ],
   "source": [
    "dW2, dB2, dW1, dB1 = backpropagation(X_train, Y_train_onehot, all_weights)\n",
    "print('shapes of gradients=', dW2.shape, dB2.shape, dW1.shape, dB1.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "126a7aaf",
   "metadata": {},
   "source": [
    "First we use simple gradient descendent method with fixed `learning rate` $\\gamma$=`eta`. We evaluate gradient `num_iterations`-times and move towards local minimum."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "70adb540",
   "metadata": {},
   "outputs": [],
   "source": [
    "def SimpleGradientMethod(X_train, Y_train, all_weights, eta, lmbd, num_iterations):\n",
    "    (W_1,b_1,W_2,b_2) = all_weights\n",
    "    for i in range(num_iterations):\n",
    "        # calculate gradients\n",
    "        dW_2, dB_2, dW_1, dB_1 = backpropagation(X_train, Y_train, [W_1,b_1,W_2,b_2])    \n",
    "        # regularization term gradients\n",
    "        dW_2 += lmbd * W_2\n",
    "        dW_1 += lmbd * W_1\n",
    "        # update weights and biases\n",
    "        W_1 -= eta * dW_1\n",
    "        b_1 -= eta * dB_1\n",
    "        W_2 -= eta * dW_2\n",
    "        b_2 -= eta * dB_2\n",
    "    return (W_1,b_1,W_2,b_2)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3d34e924",
   "metadata": {},
   "source": [
    "We also add regularization to cost function in the form of $\\lambda ||w||_2^2$. The precision after 100 steps is barely improved."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "9e38b83e",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/var/folders/j8/d9m3r0zx7j37l3ktfl_n1xw00000gn/T/ipykernel_20664/1438300027.py:2: RuntimeWarning: overflow encountered in exp\n",
      "  return 1/(1 + exp(-x))\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy on training data:  0.10438413361169102\n"
     ]
    }
   ],
   "source": [
    "eta = 0.01\n",
    "lmbd = 0.01\n",
    "num_iterations=100\n",
    "\n",
    "all_weights = GiveStartingRandomWeights()\n",
    "all_weights = SimpleGradientMethod(X_train, Y_train_onehot, all_weights, eta, lmbd, num_iterations)\n",
    "\n",
    "error=accuracy_score(predict(X_train,all_weights)[1],Y_train) \n",
    "print('Accuracy on training data: ', error)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f2e2f7aa",
   "metadata": {},
   "source": [
    "Next we implement **stochastic gradient descent (SGD)**, which takes a random subset of data (of size `batch_size`), and we compute gradient only for the subset of points. We then move in the steepest descent direction for only this subset. \n",
    "The randomness introduced this way decreases the chance\n",
    "that our opmization scheme gets stuck in a local minima. \n",
    "\n",
    "If the size of the minibatches are small relative to the number of\n",
    "datapoints ($M <  n$), the computation of the gradient is much\n",
    "cheaper since we sum over the datapoints in the $k-th$ minibatch and not\n",
    "all $n$ datapoints."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "3beebb68",
   "metadata": {},
   "outputs": [],
   "source": [
    "def StochasticGradientMethod(X_train, Y_train, all_weights, eta, lmbd, batch_size, epochs):\n",
    "    (W_1,b_1,W_2,b_2) = all_weights\n",
    "\n",
    "    data_indices = arange(len(X_train))\n",
    "    iterations = len(X_train) // batch_size\n",
    "    print('Number of iterations=', iterations)\n",
    "    for i in range(epochs):\n",
    "        for j in range(iterations):\n",
    "            chosen_datapoints = random.choice(data_indices, size=batch_size, replace=False)\n",
    "            # minibatch training data\n",
    "            X_batch = X_train[chosen_datapoints]\n",
    "            Y_batch = Y_train[chosen_datapoints]\n",
    "            dW_2, dB_2, dW_1, dB_1 = backpropagation(X_batch, Y_batch, [W_1,b_1,W_2,b_2])\n",
    "            # regularization term gradients\n",
    "            dW_2 += lmbd * W_2\n",
    "            dW_1 += lmbd * W_1\n",
    "            # update weights and biases\n",
    "            W_1 -= eta * dW_1\n",
    "            b_1 -= eta * dB_1\n",
    "            W_2 -= eta * dW_2\n",
    "            b_2 -= eta * dB_2\n",
    "    return (W_1,b_1,W_2,b_2)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bad08949",
   "metadata": {},
   "source": [
    "Finally we use SG method for learning method $\\gamma$=`eta`=0.01 and $\\lambda=0.1$ wih `batch_size=100`.  The number of iteration over the minibathces (`epochs`) is also choosen at 100. This gives excellent prediction over 98%. A human can typically read with accuracy 98%."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "d951d5cb",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of iterations= 14\n",
      "Accuracy on training data:  0.9937369519832986 0.9805555555555555\n"
     ]
    }
   ],
   "source": [
    "eta = 0.01\n",
    "lmbd = 0.1\n",
    "epochs = 100\n",
    "batch_size = 100\n",
    "\n",
    "all_weights = GiveStartingRandomWeights()\n",
    "\n",
    "all_weights = StochasticGradientMethod(X_train, Y_train_onehot, all_weights, eta, lmbd, batch_size, epochs)\n",
    "\n",
    "error=accuracy_score(predict(X_train,all_weights)[1],Y_train) \n",
    "error2=accuracy_score(predict(X_test,all_weights)[1],Y_test)\n",
    "print('Accuracy on training data: ', error, error2)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e98a80e9",
   "metadata": {},
   "source": [
    "### Adjust hyperparameters\n",
    "\n",
    "We now perform a grid search to find the optimal hyperparameters for the network.  \n",
    "Note that we are only using 1 layer with 50 neurons, and human performance is estimated to be around $98\\%$ ($2\\%$ error rate)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "cd18a504",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of iterations= 14\n",
      "Learning rate= 1e-05 Lambda= 1e-05 Accuracy= 0.13569937369519833 0.18055555555555555\n",
      "Number of iterations= 14\n",
      "Learning rate= 1e-05 Lambda= 0.0001 Accuracy= 0.13569937369519833 0.18055555555555555\n",
      "Number of iterations= 14\n",
      "Learning rate= 1e-05 Lambda= 0.001 Accuracy= 0.13569937369519833 0.18055555555555555\n",
      "Number of iterations= 14\n",
      "Learning rate= 1e-05 Lambda= 0.01 Accuracy= 0.13569937369519833 0.18055555555555555\n",
      "Number of iterations= 14\n",
      "Learning rate= 1e-05 Lambda= 0.1 Accuracy= 0.13569937369519833 0.18055555555555555\n",
      "Number of iterations= 14\n",
      "Learning rate= 1e-05 Lambda= 1.0 Accuracy= 0.13569937369519833 0.18055555555555555\n",
      "Number of iterations= 14\n",
      "Learning rate= 1e-05 Lambda= 10.0 Accuracy= 0.13848295059151008 0.18055555555555555\n",
      "Number of iterations= 14\n",
      "Learning rate= 0.0001 Lambda= 1e-05 Accuracy= 0.6089074460681977 0.5833333333333334\n",
      "Number of iterations= 14\n",
      "Learning rate= 0.0001 Lambda= 0.0001 Accuracy= 0.6089074460681977 0.5833333333333334\n",
      "Number of iterations= 14\n",
      "Learning rate= 0.0001 Lambda= 0.001 Accuracy= 0.6089074460681977 0.5833333333333334\n",
      "Number of iterations= 14\n",
      "Learning rate= 0.0001 Lambda= 0.01 Accuracy= 0.6089074460681977 0.5805555555555556\n",
      "Number of iterations= 14\n",
      "Learning rate= 0.0001 Lambda= 0.1 Accuracy= 0.6116910229645094 0.5805555555555556\n",
      "Number of iterations= 14\n",
      "Learning rate= 0.0001 Lambda= 1.0 Accuracy= 0.6450939457202505 0.6083333333333333\n",
      "Number of iterations= 14\n",
      "Learning rate= 0.0001 Lambda= 10.0 Accuracy= 0.8545581071677105 0.8138888888888889\n",
      "Number of iterations= 14\n",
      "Learning rate= 0.001 Lambda= 1e-05 Accuracy= 0.9617258176757133 0.8916666666666667\n",
      "Number of iterations= 14\n",
      "Learning rate= 0.001 Lambda= 0.0001 Accuracy= 0.9617258176757133 0.8916666666666667\n",
      "Number of iterations= 14\n",
      "Learning rate= 0.001 Lambda= 0.001 Accuracy= 0.9617258176757133 0.8916666666666667\n",
      "Number of iterations= 14\n",
      "Learning rate= 0.001 Lambda= 0.01 Accuracy= 0.9617258176757133 0.8944444444444445\n",
      "Number of iterations= 14\n",
      "Learning rate= 0.001 Lambda= 0.1 Accuracy= 0.9624217118997912 0.9055555555555556\n",
      "Number of iterations= 14\n",
      "Learning rate= 0.001 Lambda= 1.0 Accuracy= 0.9826026443980515 0.95\n",
      "Number of iterations= 14\n",
      "Learning rate= 0.001 Lambda= 10.0 Accuracy= 0.942936673625609 0.9305555555555556\n",
      "Number of iterations= 14\n",
      "Learning rate= 0.01 Lambda= 1e-05 Accuracy= 0.9965205288796103 0.9361111111111111\n",
      "Number of iterations= 14\n",
      "Learning rate= 0.01 Lambda= 0.0001 Accuracy= 0.9972164231036882 0.9527777777777777\n",
      "Number of iterations= 14\n",
      "Learning rate= 0.01 Lambda= 0.001 Accuracy= 0.9979123173277662 0.9555555555555556\n",
      "Number of iterations= 14\n",
      "Learning rate= 0.01 Lambda= 0.01 Accuracy= 0.9979123173277662 0.9472222222222222\n",
      "Number of iterations= 14\n",
      "Learning rate= 0.01 Lambda= 0.1 Accuracy= 0.9937369519832986 0.9805555555555555\n",
      "Number of iterations= 14\n",
      "Learning rate= 0.01 Lambda= 1.0 Accuracy= 0.8176757132915797 0.7805555555555556\n",
      "Number of iterations= 14\n",
      "Learning rate= 0.01 Lambda= 10.0 Accuracy= 0.20668058455114824 0.18333333333333332\n",
      "Number of iterations= 14\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/var/folders/j8/d9m3r0zx7j37l3ktfl_n1xw00000gn/T/ipykernel_20664/1438300027.py:2: RuntimeWarning: overflow encountered in exp\n",
      "  return 1/(1 + exp(-x))\n",
      "/var/folders/j8/d9m3r0zx7j37l3ktfl_n1xw00000gn/T/ipykernel_20664/1438300027.py:2: RuntimeWarning: overflow encountered in exp\n",
      "  return 1/(1 + exp(-x))\n",
      "/var/folders/j8/d9m3r0zx7j37l3ktfl_n1xw00000gn/T/ipykernel_20664/1438300027.py:2: RuntimeWarning: overflow encountered in exp\n",
      "  return 1/(1 + exp(-x))\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Learning rate= 0.1 Lambda= 1e-05 Accuracy= 0.10438413361169102 0.07777777777777778\n",
      "Number of iterations= 14\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/var/folders/j8/d9m3r0zx7j37l3ktfl_n1xw00000gn/T/ipykernel_20664/1438300027.py:2: RuntimeWarning: overflow encountered in exp\n",
      "  return 1/(1 + exp(-x))\n",
      "/var/folders/j8/d9m3r0zx7j37l3ktfl_n1xw00000gn/T/ipykernel_20664/1438300027.py:2: RuntimeWarning: overflow encountered in exp\n",
      "  return 1/(1 + exp(-x))\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Learning rate= 0.1 Lambda= 0.0001 Accuracy= 0.10368823938761308 0.08888888888888889\n",
      "Number of iterations= 14\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/var/folders/j8/d9m3r0zx7j37l3ktfl_n1xw00000gn/T/ipykernel_20664/1438300027.py:2: RuntimeWarning: overflow encountered in exp\n",
      "  return 1/(1 + exp(-x))\n",
      "/var/folders/j8/d9m3r0zx7j37l3ktfl_n1xw00000gn/T/ipykernel_20664/1438300027.py:2: RuntimeWarning: overflow encountered in exp\n",
      "  return 1/(1 + exp(-x))\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Learning rate= 0.1 Lambda= 0.001 Accuracy= 0.0953375086986778 0.125\n",
      "Number of iterations= 14\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/var/folders/j8/d9m3r0zx7j37l3ktfl_n1xw00000gn/T/ipykernel_20664/1438300027.py:2: RuntimeWarning: overflow encountered in exp\n",
      "  return 1/(1 + exp(-x))\n",
      "/var/folders/j8/d9m3r0zx7j37l3ktfl_n1xw00000gn/T/ipykernel_20664/1438300027.py:2: RuntimeWarning: overflow encountered in exp\n",
      "  return 1/(1 + exp(-x))\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Learning rate= 0.1 Lambda= 0.01 Accuracy= 0.10438413361169102 0.07777777777777778\n",
      "Number of iterations= 14\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/var/folders/j8/d9m3r0zx7j37l3ktfl_n1xw00000gn/T/ipykernel_20664/1438300027.py:2: RuntimeWarning: overflow encountered in exp\n",
      "  return 1/(1 + exp(-x))\n",
      "/var/folders/j8/d9m3r0zx7j37l3ktfl_n1xw00000gn/T/ipykernel_20664/1438300027.py:2: RuntimeWarning: overflow encountered in exp\n",
      "  return 1/(1 + exp(-x))\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Learning rate= 0.1 Lambda= 0.1 Accuracy= 0.09394572025052192 0.11666666666666667\n",
      "Number of iterations= 14\n",
      "Learning rate= 0.1 Lambda= 1.0 Accuracy= 0.10160055671537926 0.09166666666666666\n",
      "Number of iterations= 14\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/var/folders/j8/d9m3r0zx7j37l3ktfl_n1xw00000gn/T/ipykernel_20664/1438300027.py:2: RuntimeWarning: overflow encountered in exp\n",
      "  return 1/(1 + exp(-x))\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Learning rate= 0.1 Lambda= 10.0 Accuracy= 0.10160055671537926 0.09166666666666666\n",
      "Number of iterations= 14\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/var/folders/j8/d9m3r0zx7j37l3ktfl_n1xw00000gn/T/ipykernel_20664/1438300027.py:2: RuntimeWarning: overflow encountered in exp\n",
      "  return 1/(1 + exp(-x))\n",
      "/var/folders/j8/d9m3r0zx7j37l3ktfl_n1xw00000gn/T/ipykernel_20664/1438300027.py:15: RuntimeWarning: overflow encountered in exp\n",
      "  exp_term = exp(z_2)\n",
      "/var/folders/j8/d9m3r0zx7j37l3ktfl_n1xw00000gn/T/ipykernel_20664/1438300027.py:16: RuntimeWarning: invalid value encountered in divide\n",
      "  probabilities = exp_term/sum(exp_term, axis=1, keepdims=True)\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Learning rate= 1.0 Lambda= 1e-05 Accuracy= 0.10438413361169102 0.07777777777777778\n",
      "Number of iterations= 14\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/var/folders/j8/d9m3r0zx7j37l3ktfl_n1xw00000gn/T/ipykernel_20664/1438300027.py:2: RuntimeWarning: overflow encountered in exp\n",
      "  return 1/(1 + exp(-x))\n",
      "/var/folders/j8/d9m3r0zx7j37l3ktfl_n1xw00000gn/T/ipykernel_20664/1438300027.py:15: RuntimeWarning: overflow encountered in exp\n",
      "  exp_term = exp(z_2)\n",
      "/var/folders/j8/d9m3r0zx7j37l3ktfl_n1xw00000gn/T/ipykernel_20664/1438300027.py:16: RuntimeWarning: invalid value encountered in divide\n",
      "  probabilities = exp_term/sum(exp_term, axis=1, keepdims=True)\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Learning rate= 1.0 Lambda= 0.0001 Accuracy= 0.10438413361169102 0.07777777777777778\n",
      "Number of iterations= 14\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/var/folders/j8/d9m3r0zx7j37l3ktfl_n1xw00000gn/T/ipykernel_20664/1438300027.py:2: RuntimeWarning: overflow encountered in exp\n",
      "  return 1/(1 + exp(-x))\n",
      "/var/folders/j8/d9m3r0zx7j37l3ktfl_n1xw00000gn/T/ipykernel_20664/1438300027.py:15: RuntimeWarning: overflow encountered in exp\n",
      "  exp_term = exp(z_2)\n",
      "/var/folders/j8/d9m3r0zx7j37l3ktfl_n1xw00000gn/T/ipykernel_20664/1438300027.py:16: RuntimeWarning: invalid value encountered in divide\n",
      "  probabilities = exp_term/sum(exp_term, axis=1, keepdims=True)\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Learning rate= 1.0 Lambda= 0.001 Accuracy= 0.10438413361169102 0.07777777777777778\n",
      "Number of iterations= 14\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/var/folders/j8/d9m3r0zx7j37l3ktfl_n1xw00000gn/T/ipykernel_20664/1438300027.py:2: RuntimeWarning: overflow encountered in exp\n",
      "  return 1/(1 + exp(-x))\n",
      "/var/folders/j8/d9m3r0zx7j37l3ktfl_n1xw00000gn/T/ipykernel_20664/1438300027.py:15: RuntimeWarning: overflow encountered in exp\n",
      "  exp_term = exp(z_2)\n",
      "/var/folders/j8/d9m3r0zx7j37l3ktfl_n1xw00000gn/T/ipykernel_20664/1438300027.py:16: RuntimeWarning: invalid value encountered in divide\n",
      "  probabilities = exp_term/sum(exp_term, axis=1, keepdims=True)\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Learning rate= 1.0 Lambda= 0.01 Accuracy= 0.10438413361169102 0.07777777777777778\n",
      "Number of iterations= 14\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/var/folders/j8/d9m3r0zx7j37l3ktfl_n1xw00000gn/T/ipykernel_20664/1438300027.py:2: RuntimeWarning: overflow encountered in exp\n",
      "  return 1/(1 + exp(-x))\n",
      "/var/folders/j8/d9m3r0zx7j37l3ktfl_n1xw00000gn/T/ipykernel_20664/1438300027.py:15: RuntimeWarning: overflow encountered in exp\n",
      "  exp_term = exp(z_2)\n",
      "/var/folders/j8/d9m3r0zx7j37l3ktfl_n1xw00000gn/T/ipykernel_20664/1438300027.py:16: RuntimeWarning: invalid value encountered in divide\n",
      "  probabilities = exp_term/sum(exp_term, axis=1, keepdims=True)\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Learning rate= 1.0 Lambda= 0.1 Accuracy= 0.10438413361169102 0.07777777777777778\n",
      "Number of iterations= 14\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/var/folders/j8/d9m3r0zx7j37l3ktfl_n1xw00000gn/T/ipykernel_20664/1438300027.py:2: RuntimeWarning: overflow encountered in exp\n",
      "  return 1/(1 + exp(-x))\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Learning rate= 1.0 Lambda= 1.0 Accuracy= 0.10438413361169102 0.07777777777777778\n",
      "Number of iterations= 14\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/var/folders/j8/d9m3r0zx7j37l3ktfl_n1xw00000gn/T/ipykernel_20664/1438300027.py:2: RuntimeWarning: overflow encountered in exp\n",
      "  return 1/(1 + exp(-x))\n",
      "/var/folders/j8/d9m3r0zx7j37l3ktfl_n1xw00000gn/T/ipykernel_20664/1438300027.py:15: RuntimeWarning: overflow encountered in exp\n",
      "  exp_term = exp(z_2)\n",
      "/var/folders/j8/d9m3r0zx7j37l3ktfl_n1xw00000gn/T/ipykernel_20664/1438300027.py:16: RuntimeWarning: invalid value encountered in divide\n",
      "  probabilities = exp_term/sum(exp_term, axis=1, keepdims=True)\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Learning rate= 1.0 Lambda= 10.0 Accuracy= 0.10438413361169102 0.07777777777777778\n",
      "Number of iterations= 14\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/var/folders/j8/d9m3r0zx7j37l3ktfl_n1xw00000gn/T/ipykernel_20664/1438300027.py:2: RuntimeWarning: overflow encountered in exp\n",
      "  return 1/(1 + exp(-x))\n",
      "/var/folders/j8/d9m3r0zx7j37l3ktfl_n1xw00000gn/T/ipykernel_20664/1438300027.py:15: RuntimeWarning: overflow encountered in exp\n",
      "  exp_term = exp(z_2)\n",
      "/var/folders/j8/d9m3r0zx7j37l3ktfl_n1xw00000gn/T/ipykernel_20664/1438300027.py:16: RuntimeWarning: invalid value encountered in divide\n",
      "  probabilities = exp_term/sum(exp_term, axis=1, keepdims=True)\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Learning rate= 10.0 Lambda= 1e-05 Accuracy= 0.10438413361169102 0.07777777777777778\n",
      "Number of iterations= 14\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/var/folders/j8/d9m3r0zx7j37l3ktfl_n1xw00000gn/T/ipykernel_20664/1438300027.py:2: RuntimeWarning: overflow encountered in exp\n",
      "  return 1/(1 + exp(-x))\n",
      "/var/folders/j8/d9m3r0zx7j37l3ktfl_n1xw00000gn/T/ipykernel_20664/1438300027.py:15: RuntimeWarning: overflow encountered in exp\n",
      "  exp_term = exp(z_2)\n",
      "/var/folders/j8/d9m3r0zx7j37l3ktfl_n1xw00000gn/T/ipykernel_20664/1438300027.py:16: RuntimeWarning: invalid value encountered in divide\n",
      "  probabilities = exp_term/sum(exp_term, axis=1, keepdims=True)\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Learning rate= 10.0 Lambda= 0.0001 Accuracy= 0.10438413361169102 0.07777777777777778\n",
      "Number of iterations= 14\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/var/folders/j8/d9m3r0zx7j37l3ktfl_n1xw00000gn/T/ipykernel_20664/1438300027.py:2: RuntimeWarning: overflow encountered in exp\n",
      "  return 1/(1 + exp(-x))\n",
      "/var/folders/j8/d9m3r0zx7j37l3ktfl_n1xw00000gn/T/ipykernel_20664/1438300027.py:15: RuntimeWarning: overflow encountered in exp\n",
      "  exp_term = exp(z_2)\n",
      "/var/folders/j8/d9m3r0zx7j37l3ktfl_n1xw00000gn/T/ipykernel_20664/1438300027.py:16: RuntimeWarning: invalid value encountered in divide\n",
      "  probabilities = exp_term/sum(exp_term, axis=1, keepdims=True)\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Learning rate= 10.0 Lambda= 0.001 Accuracy= 0.10438413361169102 0.07777777777777778\n",
      "Number of iterations= 14\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/var/folders/j8/d9m3r0zx7j37l3ktfl_n1xw00000gn/T/ipykernel_20664/1438300027.py:2: RuntimeWarning: overflow encountered in exp\n",
      "  return 1/(1 + exp(-x))\n",
      "/var/folders/j8/d9m3r0zx7j37l3ktfl_n1xw00000gn/T/ipykernel_20664/1438300027.py:15: RuntimeWarning: overflow encountered in exp\n",
      "  exp_term = exp(z_2)\n",
      "/var/folders/j8/d9m3r0zx7j37l3ktfl_n1xw00000gn/T/ipykernel_20664/1438300027.py:16: RuntimeWarning: invalid value encountered in divide\n",
      "  probabilities = exp_term/sum(exp_term, axis=1, keepdims=True)\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Learning rate= 10.0 Lambda= 0.01 Accuracy= 0.10438413361169102 0.07777777777777778\n",
      "Number of iterations= 14\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/var/folders/j8/d9m3r0zx7j37l3ktfl_n1xw00000gn/T/ipykernel_20664/1438300027.py:2: RuntimeWarning: overflow encountered in exp\n",
      "  return 1/(1 + exp(-x))\n",
      "/var/folders/j8/d9m3r0zx7j37l3ktfl_n1xw00000gn/T/ipykernel_20664/1438300027.py:15: RuntimeWarning: overflow encountered in exp\n",
      "  exp_term = exp(z_2)\n",
      "/var/folders/j8/d9m3r0zx7j37l3ktfl_n1xw00000gn/T/ipykernel_20664/1438300027.py:16: RuntimeWarning: invalid value encountered in divide\n",
      "  probabilities = exp_term/sum(exp_term, axis=1, keepdims=True)\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Learning rate= 10.0 Lambda= 0.1 Accuracy= 0.10438413361169102 0.07777777777777778\n",
      "Number of iterations= 14\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/var/folders/j8/d9m3r0zx7j37l3ktfl_n1xw00000gn/T/ipykernel_20664/1438300027.py:2: RuntimeWarning: overflow encountered in exp\n",
      "  return 1/(1 + exp(-x))\n",
      "/var/folders/j8/d9m3r0zx7j37l3ktfl_n1xw00000gn/T/ipykernel_20664/1438300027.py:15: RuntimeWarning: overflow encountered in exp\n",
      "  exp_term = exp(z_2)\n",
      "/var/folders/j8/d9m3r0zx7j37l3ktfl_n1xw00000gn/T/ipykernel_20664/1438300027.py:16: RuntimeWarning: invalid value encountered in divide\n",
      "  probabilities = exp_term/sum(exp_term, axis=1, keepdims=True)\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Learning rate= 10.0 Lambda= 1.0 Accuracy= 0.10438413361169102 0.07777777777777778\n",
      "Number of iterations= 14\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/var/folders/j8/d9m3r0zx7j37l3ktfl_n1xw00000gn/T/ipykernel_20664/1438300027.py:2: RuntimeWarning: overflow encountered in exp\n",
      "  return 1/(1 + exp(-x))\n",
      "/var/folders/j8/d9m3r0zx7j37l3ktfl_n1xw00000gn/T/ipykernel_20664/1438300027.py:15: RuntimeWarning: overflow encountered in exp\n",
      "  exp_term = exp(z_2)\n",
      "/var/folders/j8/d9m3r0zx7j37l3ktfl_n1xw00000gn/T/ipykernel_20664/1438300027.py:16: RuntimeWarning: invalid value encountered in divide\n",
      "  probabilities = exp_term/sum(exp_term, axis=1, keepdims=True)\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Learning rate= 10.0 Lambda= 10.0 Accuracy= 0.10438413361169102 0.07777777777777778\n"
     ]
    }
   ],
   "source": [
    "eta_vals = logspace(-5, 1, 7)\n",
    "lmbd_vals = logspace(-5, 1, 7)\n",
    "# store the models for later use\n",
    "DNN_numpy = zeros((len(eta_vals), len(lmbd_vals)), dtype=object)\n",
    "\n",
    "# grid search\n",
    "for i, eta in enumerate(eta_vals):\n",
    "    for j, lmbd in enumerate(lmbd_vals):\n",
    "        \n",
    "        all_weights = GiveStartingRandomWeights()                \n",
    "        all_weights = StochasticGradientMethod(X_train, Y_train_onehot, all_weights, eta, lmbd, batch_size, epochs)\n",
    "        \n",
    "        error=accuracy_score(predict(X_train,all_weights)[1],Y_train) \n",
    "        error2=accuracy_score(predict(X_test,all_weights)[1],Y_test) \n",
    "        DNN_numpy[i][j] = error\n",
    "        \n",
    "        #test_predict = dnn.predict(X_test)\n",
    "        \n",
    "        print('Learning rate=', eta, 'Lambda=', lmbd, 'Accuracy=', error, error2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5980c8c4",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}