\n", "$a_m = \\sum_{i=0}^D w_{ji}^{(1)}x_i$

\n", "where the $(1)$ superscript indicates this weight is for the hidden layer. The output from each node, $z_m$, is then given by the value of a *fixed nonlinear function*, $h$, known as the *activation function*, acting on the unit activation

\n", "$z_m = h(a_m) = h \\left( \\sum_{i=0}^D w_{mi}^{(1)}x_i \\right)$

\n", "Notice that $h$ is the same function for all nodes.\n", "

\n", "$a_k = \\sum_{m=0}^M w_{km}^{(2)} z_m$

\n", "We again apply a nonlinear activation function, say $y$, to produce the output

\n", "$y_k = y(a_k)$\n", "\n", "Thus, the entire model can be summarized as a $K$ dimensional output vector $Y \\in \\Re^K$ where each element $y_k$ by

\n", "$y_k(\\mathbf{x},\\mathbf{w}) = y \\left( \\sum_{m=0}^M w_{km}^{(2)} h \\left( \\sum_{i=0}^D w_{mi}^{(1)}x_i \\right) \\right)$\n", "\n", "

\n", "$p(t|\\mathbf{x},\\mathbf{w}) = ND(t|y(\\mathbf{x},\\mathbf{w}), \\beta^{-1})$

\n", "where $ND(\\mu,\\sigma^2)$ is the normal distribution with mean $\\mu$ and variance $\\sigma^2$. *Assume* the output layer activation function is the identity, i.e. the output is simply the the unit activations, and that we have $N$ i.i.d. observations $\\mathbf{X}=\\\\{\\mathbf{x}_1, \\ldots, \\mathbf{x}_2 \\\\}$ and target values $\\mathbf{t} = \\\\{t_1, \\ldots, t_2\\\\}$, the likelihood function is

\n", "$p(\\mathbf{t}|\\mathbf{X},\\mathbf{w}, \\beta) = \\prod_{n=1}^N p(t_n|\\mathbf{x}_n, \\mathbf{w}, \\beta)$

\n", "The **total error function** is defined as the negative logarithm of the likelihood function given by

\n", "\n", "$\\frac {\\beta}{2} \\sum_{n=1}^N \\\\{y(\\mathbf{x}_n,\\mathbf{w}) - t_n \\\\}^2 - \\frac{N}{2} \\ln(\\beta) + \\frac{N}{2} \\ln (2\\pi)$\n", "\n", "The parameter estimates, $\\mathbf{w}^{(ML)}$ (ML indicates maximum likelihood) are found by maximizing the equivalent sum-of-squares error function\n", "\n", "$E(\\mathbf{w}) = \\frac{1}{2} \\sum_{n=1}^N \\\\{y(\\mathbf{x}_n,\\mathbf{w}) - t_n \\\\}^2$\n", "\n", "Note, due to the nonlinearity of the network model, $E(\\mathbf{w})$ is not necessisarily convex, and thus may have *local minima* and hence it is not possible to know that the global minimum has been obtained. The model parameter, $\\beta$ is found by first finding $\\mathbf{w}_{ML}$ and then solving\n", "\n", "$\\frac{1}{\\beta_{ML}} = \\frac{1}{N}\\sum_{n=1}^N \\\\{ y(\\mathbf{x}_n,\\mathbf{w}^{(ML)}) - t_n \\\\}^2$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

$x_1$ | $x_2$ | Output |
---|---|---|

0 | 0 | 0 |

1 | 0 | 1 |

0 | 1 | 1 |

1 | 1 | 0 |