\n",
"\n",
"\n",
"\n",
"

\n",
"\n",
"@Rasmussen:book06 is still one of the most important references on\n",
"Gaussian process models. It is available freely online.\n",
"\n",
"## What is Machine Learning?\n",
"\n",
"What is machine learning? At its most basic level machine learning is a\n",
"combination of\n",
"\n",
"$$\\text{data} + \\text{model} \\xrightarrow{\\text{compute}} \\text{prediction}$$\n",
"\n",
"where *data* is our observations. They can be actively or passively\n",
"acquired (meta-data). The *model* contains our assumptions, based on\n",
"previous experience. That experience can be other data, it can come from\n",
"transfer learning, or it can merely be our beliefs about the\n",
"regularities of the universe. In humans our models include our inductive\n",
"biases. The *prediction* is an action to be taken or a categorization or\n",
"a quality score. The reason that machine learning has become a mainstay\n",
"of artificial intelligence is the importance of predictions in\n",
"artificial intelligence. The data and the model are combined through\n",
"computation.\n",
"\n",
"In practice we normally perform machine learning using two functions. To\n",
"combine data with a model we typically make use of:\n",
"\n",
"**a prediction function** a function which is used to make the\n",
"predictions. It includes our beliefs about the regularities of the\n",
"universe, our assumptions about how the world works, e.g. smoothness,\n",
"spatial similarities, temporal similarities.\n",
"\n",
"**an objective function** a function which defines the cost of\n",
"misprediction. Typically it includes knowledge about the world's\n",
"generating processes (probabilistic objectives) or the costs we pay for\n",
"mispredictions (empiricial risk minimization).\n",
"\n",
"The combination of data and model through the prediction function and\n",
"the objectie function leads to a *learning algorithm*. The class of\n",
"prediction functions and objective functions we can make use of is\n",
"restricted by the algorithms they lead to. If the prediction function or\n",
"the objective function are too complex, then it can be difficult to find\n",
"an appropriate learning algorithm. Much of the acdemic field of machine\n",
"learning is the quest for new learning algorithms that allow us to bring\n",
"different types of models and data together.\n",
"\n",
"A useful reference for state of the art in machine learning is the UK\n",
"Royal Society Report, [Machine Learning: Power and Promise of Computers\n",
"that Learn by\n",
"Example](https://royalsociety.org/~/media/policy/projects/machine-learning/publications/machine-learning-report.pdf).\n",
"\n",
"You can also check my blog post on [\"What is Machine\n",
"Learning?\"](http://inverseprobability.com/2017/07/17/what-is-machine-learning)\n",
"\n",
"### Olympic Marathon Data\n",
"\n",
"The first thing we will do is load a standard data set for regression\n",
"modelling. The data consists of the pace of Olympic Gold Medal Marathon\n",
"winners for the Olympics from 1896 to present. First we load in the data\n",
"and plot.\n",
"\n",
"### Olympic Marathon Data\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n", "- Gold medal times for Olympic Marathon since 1896.\n", "\n", "- Marathons before 1924 didnâ€™t have a standardised distance.\n", "\n", "- Present results using pace per km.\n", "\n", "- In 1904 Marathon was badly organised leading to very slow times.\n", "\n", " | \n",
" \n",
"\n",
"\n",
"\n",
" \n",
"\n",
"Image from Wikimedia Commons |

\n",
"\n",
"\n",
"\n",
"

\n",
"\n",
"Things to notice about the data include the outlier in 1904, in this\n",
"year, the olympics was in St Louis, USA. Organizational problems and\n",
"challenges with dust kicked up by the cars following the race meant that\n",
"participants got lost, and only very few participants completed.\n",
"\n",
"More recent years see more consistently quick marathons.\n",
"\n",
"### Overdetermined System\n",
"\n",
"The challenge with a linear model is that it has two unknowns, $m$, and\n",
"$c$. Observing data allows us to write down a system of simultaneous\n",
"linear equations. So, for example if we observe two data points, the\n",
"first with the input value, $\\inputScalar_1 = 1$ and the output value,\n",
"$\\dataScalar_1 =3$ and a second data point, $\\inputScalar = 3$,\n",
"$\\dataScalar=1$, then we can write two simultaneous linear equations of\n",
"the form.\n",
"\n",
"point 1: $\\inputScalar = 1$, $\\dataScalar=3$ $$3 = m + c$$ point 2:\n",
"$\\inputScalar = 3$, $\\dataScalar=1$ $$1 = 3m + c$$\n",
"\n",
"The solution to these two simultaneous equations can be represented\n",
"graphically as\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"

\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"

\n",
"\n",
"Unfortunately, most analyses of his ideas stop at that point, whereas\n",
"his real point is that such a notion is unreachable. Not so much\n",
"*superman* as *strawman*. Just three pages later in the \"Philosophical\n",
"Essay on Probabilities\" [@Laplace:essai14], Laplace goes on to observe:\n",
"\n",
"> The curve described by a simple molecule of air or vapor is regulated\n",
"> in a manner just as certain as the planetary orbits; the only\n",
"> difference between them is that which comes from our ignorance.\n",
">\n",
"> Probability is relative, in part to this ignorance, in part to our\n",
"> knowledge."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pods\n",
"pods.notebook.display_google_book(id='1YQPAAAAQAAJ', page='PR17-IA4')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"\n",
"\n",
"

\n",
"\n",
"In other words, we can never make use of the idealistic deterministc\n",
"Universe due to our ignorance about the world, Laplace's suggestion, and\n",
"focus in this essay is that we turn to probability to deal with this\n",
"uncertainty. This is also our inspiration for using probabilit in\n",
"machine learning.\n",
"\n",
"The \"forces by which nature is animated\" is our *model*, the \"situation\n",
"of beings that compose it\" is our *data* and the \"intelligence\n",
"sufficiently vast enough to submit these data to analysis\" is our\n",
"compute. The fly in the ointment is our *ignorance* about these aspects.\n",
"And *probability* is the tool we use to incorporate this ignorance\n",
"leading to uncertainty or *doubt* in our predictions.\n",
"\n",
"Laplace's concept was that the reason that the data doesn't match up to\n",
"the model is because of unconsidered factors, and that these might be\n",
"well represented through probability densities. He tackles the challenge\n",
"of the unknown factors by adding a variable, $\\noiseScalar$, that\n",
"represents the unknown. In modern parlance we would call this a *latent*\n",
"variable. But in the context Laplace uses it, the variable is so common\n",
"that it has other names such as a \"slack\" variable or the *noise* in the\n",
"system.\n",
"\n",
"point 1: $\\inputScalar = 1$, $\\dataScalar=3$ $$\n",
"3 = m + c + \\noiseScalar_1\n",
"$$ point 2: $\\inputScalar = 3$, $\\dataScalar=1$ $$\n",
"1 = 3m + c + \\noiseScalar_2\n",
"$$ point 3: $\\inputScalar = 2$, $\\dataScalar=2.5$ $$\n",
"2.5 = 2m + c + \\noiseScalar_3\n",
"$$\n",
"\n",
"Laplace's trick has converted the *overdetermined* system into an\n",
"*underdetermined* system. He has now added three variables,\n",
"$\\{\\noiseScalar_i\\}_{i=1}^3$, which represent the unknown corruptions of\n",
"the real world. Laplace's idea is that we should represent that unknown\n",
"corruption with a *probability distribution*.\n",
"\n",
"### A Probabilistic Process\n",
"\n",
"However, it was left to an admirer of Gauss to develop a practical\n",
"probability density for that purpose. It was Carl Friederich Gauss who\n",
"suggested that the *Gaussian* density (which at the time was unnamed!)\n",
"should be used to represent this error.\n",
"\n",
"The result is a *noisy* function, a function which has a deterministic\n",
"part, and a stochastic part. This type of function is sometimes known as\n",
"a probabilistic or stochastic process, to distinguish it from a\n",
"deterministic process.\n",
"\n",
"### The Gaussian Density\n",
"\n",
"The Gaussian density is perhaps the most commonly used probability\n",
"density. It is defined by a *mean*, $\\meanScalar$, and a *variance*,\n",
"$\\dataStd^2$. The variance is taken to be the square of the *standard\n",
"deviation*, $\\dataStd$.\n",
"\n",
"$$\\begin{align}\n",
" p(\\dataScalar| \\meanScalar, \\dataStd^2) & = \\frac{1}{\\sqrt{2\\pi\\dataStd^2}}\\exp\\left(-\\frac{(\\dataScalar - \\meanScalar)^2}{2\\dataStd^2}\\right)\\\\& \\buildrel\\triangle\\over = \\gaussianDist{\\dataScalar}{\\meanScalar}{\\dataStd^2}\n",
" \\end{align}$$\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"

\n",
"\n",
"When viewing these contour plots, I sometimes find it helpful to think\n",
"of Uluru, the prominent rock formation in Australia. The rock rises\n",
"above the surface of the plane, just like a probability density rising\n",
"above the zero line. The rock is three dimensional, but when we view\n",
"Uluru from the classical position, we are looking at one side of it.\n",
"This is equivalent to viewing the marginal density.\n",
"\n",
"The joint density can be viewed from above, using contours. The\n",
"conditional density is equivalent to *slicing* the rock. Uluru is a holy\n",
"rock, so this has to be an imaginary slice. Imagine we cut down a\n",
"vertical plane orthogonal to our view point (e.g. coming across our view\n",
"point). This would give a profile of the rock, which when renormalized,\n",
"would give us the conditional distribution, the value of conditioning\n",
"would be the location of the slice in the direction we are facing.\n",
"\n",
"### Prediction with Correlated Gaussians\n",
"\n",
"Of course in practice, rather than manipulating mountains physically,\n",
"the advantage of the Gaussian density is that we can perform these\n",
"manipulations mathematically.\n",
"\n",
"Prediction of $\\mappingFunction_2$ given $\\mappingFunction_1$ requires\n",
"the *conditional density*,\n",
"$p(\\mappingFunction_2|\\mappingFunction_1)$.Another remarkable property\n",
"of the Gaussian density is that this conditional distribution is *also*\n",
"guaranteed to be a Gaussian density. It has the form, $$\n",
" p(\\mappingFunction_2|\\mappingFunction_1) = \\gaussianDist{\\mappingFunction_2}{\\frac{\\kernelScalar_{1, 2}}{\\kernelScalar_{1, 1}}\\mappingFunction_1}{ \\kernelScalar_{2, 2} - \\frac{\\kernelScalar_{1,2}^2}{\\kernelScalar_{1,1}}}\n",
" $$where we have assumed that the covariance of the original joint\n",
"density was given by $$\n",
" \\kernelMatrix = \\begin{bmatrix} \\kernelScalar_{1, 1} & \\kernelScalar_{1, 2}\\\\ \\kernelScalar_{2, 1} & \\kernelScalar_{2, 2}.\\end{bmatrix}\n",
" $$\n",
"\n",
"Using these formulae we can determine the conditional density for any of\n",
"the elements of our vector $\\mappingFunctionVector$. For example, the\n",
"variable $\\mappingFunction_8$ is less correlated with\n",
"$\\mappingFunction_1$ than $\\mappingFunction_2$. If we consider this\n",
"variable we see the conditional density is more diffuse."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pods\n",
"from ipywidgets import IntSlider"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pods.notebook.display_plots('two_point_sample{sample:0>3}.svg', '../slides/diagrams/gp', sample=IntSlider(13, 13, 17, 1))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n", "\n", "\n", "\n", "\n", "

\n", "\n", " | \n", "\n", " |

\n", "\n", " | \n", "\n", " |

\n", "\n", "\n", "\n", "\n", "

\n", "\n", " | \n", "\n", " |

\n", "\n", "\n", "\n", "\n", "

\n", "\n", " | \n", "\n", " |

\n", "\n", "\n", "\n", "\n", "

\n", "\n", " | \n", "\n", " |