{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 6.1 Sequence Models"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- People’s opinions on movies can change quite significantly over time. In fact, psychologists even have names for some of the effects:\n",
" - ***Anchoring*** based on someone else’s opinion.\n",
" - For instance after the Oscara wards, ratings for the corresponding movie go up, even though it’s still the same movie. This effect persists for a few months until the award is forgotten. \n",
" - ***Hedonic adaptation*** where humans quickly adapt to accept an improved (or a bad) situation as the new normal. - For instance, after watching many good movies, the expectations that the next movie be equally good or better are high, and even an average movie might be considered a bad movie after many great ones.\n",
" - ***Seasonality***\n",
" - Very few viewers like to watch a Santa Claus movie in August.\n",
" - In some cases movies become unpopular due to the misbehaviors of directors or actors in the production.\n",
" - Some movies become cult movies, because they were almost comically bad. \n",
"- Other examples\n",
" - Many users have highly particular behavior when it comes to the time when they open apps. \n",
" - For instance, social media apps are much more popular after school with students. \n",
" - Stock market trading apps are more commonly used when the markets are open.\n",
" - It is much harder to predict tomorrow's stock prices than to fill in the blanks for a stock price we missed yesterday\n",
" - In statistics the former is called prediction whereas the latter is called filtering.\n",
" - After all, hindsight is so much easier than foresight. \n",
" - Music, speech, text, movies, steps, etc. are all sequential in nature. \n",
" - If we were to permute them they would make little sense. \n",
" - The headline dog bites man is much less surprising than man bites dog, even though the words are identical.\n",
" - Earthquakes are strongly correlated, i.e. after a massive earthquake there are very likely several smaller aftershocks, much more so than without the strong quake. \n",
" - In fact, earthquakes are spatiotemporally correlated, i.e. the aftershocks typically occur within a short time span and in close proximity.\n",
" - Humans interact with each other in a sequential nature, as can be seen in Twitter fights, dance patterns and debates.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6.1.1 Statistical Tools\n",
"![](https://github.com/d2l-ai/d2l-en/raw/master/img/ftse100.png)\n",
"\n",
"- Let's denote the prices by $x_t \\geq 0$, i.e. at time $t \\in \\mathbb{N}$ we observe some price $x_t$. \n",
"- For a trader to do well in the stock market on day $t$ he should want to predict $x_t$ via $$x_t \\sim p(x_t|x_{t-1}, \\ldots x_1).$$\n",
"- Autoregressive Models\n",
" - In order to achieve this, our trader could use a regressor.\n",
" - There's just a major problem - the number of inputs, $x_{t-1}, \\ldots x_1$ varies, depending on $t$. \n",
" - That is, the number increases with the amount of data that we encounter\n",
" - We need an approximation to make this computationally tractable. Two strategies:\n",
" - 1) Assume that the potentially rather long sequence $x_{t-1}, \\ldots x_1$ isn't really necessary. \n",
" - In this case we might content ourselves with some timespan $\\tau$ and only use $x_{t-1}, \\ldots x_{t-\\tau}$ observations. \n",
" - The number of arguments is always the same, at least for $t > \\tau$. \n",
" - Such models will be called ***autoregressive models***, as they quite literally perform regression on themselves.\n",
" - 2) Try and keep some summary $h_t$ of the past observations around and update that in addition to the actual prediction. \n",
" - This leads to models that estimate $x_t|x_{t-1}, h_{t-1}$ and moreover updates of the form $h_t = g(h_t, x_t)$. \n",
" - Since $h_t$ is never observed, these models are also called ***latent autoregressive models***. \n",
" - LSTMs and GRUs are exampes of this.\n",
" - Both cases raise the obvious question how to generate training data. \n",
" - One typically uses historical observations to predict the next observation given the ones up to right now.\n",
" - However, a common assumption is that while the specific values of $x_t$ might change, at least the dynamics of the time series itself won't. \n",
" - This is reasonable, since novel dynamics are not predictable using data we have so far. \n",
" - Statisticians call dynamics that don't change ***stationary***. \n",
" - Regardless of what we do, we will thus get an estimate of the entire time series via $$p(x_1, \\ldots x_T) = \\prod_{t=1}^T p(x_t|x_{t-1}, \\ldots x_1).$$\n",
" - Note that the above considerations still hold even if we deal with discrete objects, such as words, rather than numbers. \n",
" - The only difference is that in such a situation we need to use a classifier rather than a regressor to estimate $p(x_t| x_{t-1}, \\ldots x_1)$.\n",
" \n",
"- Markov Model\n",
" - In an autoregressive model, we use only $(x_{t-1}, \\ldots x_{t-\\tau})$ instead of $(x_{t-1}, \\ldots x_1)$ to estimate $x_t$.\n",
" - Whenever this approximation is accurate we say that the sequence satisfies a ***Markov condition***. \n",
" - In particular, if $\\tau = 1$, we have a first order Markov model and $p(x)$ is given by $$p(x_1, \\ldots x_T) = \\prod_{t=1}^T p(x_t|x_{t-1}).$$\n",
" - Such models are particularly nice whenever $x_t$ assumes only discrete values, since in this case dynamic programming can be used to compute values along the chain exactly. \n",
" - For instance, we can compute $x_{t+1}|x_{t-1}$ efficiently using the fact that we only need to take into account a very short history of past observations. $$p(x_{t+1}|x_{t-1}) = \\sum_{x_t} p(x_{t+1}|x_t) p(x_t|x_{t-1})$$\n",
"\n",
"- Causality\n",
" - In principle, there's nothing wrong with unfolding $p(x_1, \\ldots x_T)$ in reverse order. $$p(x_1, \\ldots x_T) = \\prod_{t=T}^1 p(x_t|x_{t+1}, \\ldots x_T).$$\n",
"\n",
" - In fact, if we have a Markov model we can obtain a reverse conditional probability distribution, too. \n",
" - In many cases, however, there exists a natural direction for the data, namely going forward in time. \n",
" - It is clear that future events cannot influence the past. \n",
" - If we change $x_t$, we may be able to influence what happens for $x_{t+1}$ going forward but not the converse. \n",
" - If we change $x_t$, the distribution over past events will not change. \n",
" - It ought to be easier to explain $x_{t+1}|x_t$ rather than $x_t|x_{t+1}$. \n",
" - For instance, Hoyer et al., 2008 show that in some cases we can find $x_{t+1} = f(x_t) + \\epsilon$ for some additive noise, whereas the converse is not true. \n",
" - This is great news, since it is typically the forward direction that we're interested in estimating. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6.1.2 Toy Example\n",
"- Let’s begin by generating ‘time series’ data by using a sine function with some additive noise."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
"