{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 6.1 Sequence Models"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- People’s opinions on movies can change quite significantly over time. In fact, psychologists even have names for some of the effects:\n",
" - ***Anchoring*** based on someone else’s opinion.\n",
" - For instance after the Oscara wards, ratings for the corresponding movie go up, even though it’s still the same movie. This effect persists for a few months until the award is forgotten. \n",
" - ***Hedonic adaptation*** where humans quickly adapt to accept an improved (or a bad) situation as the new normal. - For instance, after watching many good movies, the expectations that the next movie be equally good or better are high, and even an average movie might be considered a bad movie after many great ones.\n",
" - ***Seasonality***\n",
" - Very few viewers like to watch a Santa Claus movie in August.\n",
" - In some cases movies become unpopular due to the misbehaviors of directors or actors in the production.\n",
" - Some movies become cult movies, because they were almost comically bad. \n",
"- Other examples\n",
" - Many users have highly particular behavior when it comes to the time when they open apps. \n",
" - For instance, social media apps are much more popular after school with students. \n",
" - Stock market trading apps are more commonly used when the markets are open.\n",
" - It is much harder to predict tomorrow's stock prices than to fill in the blanks for a stock price we missed yesterday\n",
" - In statistics the former is called prediction whereas the latter is called filtering.\n",
" - After all, hindsight is so much easier than foresight. \n",
" - Music, speech, text, movies, steps, etc. are all sequential in nature. \n",
" - If we were to permute them they would make little sense. \n",
" - The headline dog bites man is much less surprising than man bites dog, even though the words are identical.\n",
" - Earthquakes are strongly correlated, i.e. after a massive earthquake there are very likely several smaller aftershocks, much more so than without the strong quake. \n",
" - In fact, earthquakes are spatiotemporally correlated, i.e. the aftershocks typically occur within a short time span and in close proximity.\n",
" - Humans interact with each other in a sequential nature, as can be seen in Twitter fights, dance patterns and debates.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6.1.1 Statistical Tools\n",
"\n",
"\n",
"- Let's denote the prices by $x_t \\geq 0$, i.e. at time $t \\in \\mathbb{N}$ we observe some price $x_t$. \n",
"- For a trader to do well in the stock market on day $t$ he should want to predict $x_t$ via $$x_t \\sim p(x_t|x_{t-1}, \\ldots x_1).$$\n",
"- Autoregressive Models\n",
" - In order to achieve this, our trader could use a regressor.\n",
" - There's just a major problem - the number of inputs, $x_{t-1}, \\ldots x_1$ varies, depending on $t$. \n",
" - That is, the number increases with the amount of data that we encounter\n",
" - We need an approximation to make this computationally tractable. Two strategies:\n",
" - 1) Assume that the potentially rather long sequence $x_{t-1}, \\ldots x_1$ isn't really necessary. \n",
" - In this case we might content ourselves with some timespan $\\tau$ and only use $x_{t-1}, \\ldots x_{t-\\tau}$ observations. \n",
" - The number of arguments is always the same, at least for $t > \\tau$. \n",
" - Such models will be called ***autoregressive models***, as they quite literally perform regression on themselves.\n",
" - 2) Try and keep some summary $h_t$ of the past observations around and update that in addition to the actual prediction. \n",
" - This leads to models that estimate $x_t|x_{t-1}, h_{t-1}$ and moreover updates of the form $h_t = g(h_t, x_t)$. \n",
" - Since $h_t$ is never observed, these models are also called ***latent autoregressive models***. \n",
" - LSTMs and GRUs are exampes of this.\n",
" - Both cases raise the obvious question how to generate training data. \n",
" - One typically uses historical observations to predict the next observation given the ones up to right now.\n",
" - However, a common assumption is that while the specific values of $x_t$ might change, at least the dynamics of the time series itself won't. \n",
" - This is reasonable, since novel dynamics are not predictable using data we have so far. \n",
" - Statisticians call dynamics that don't change ***stationary***. \n",
" - Regardless of what we do, we will thus get an estimate of the entire time series via $$p(x_1, \\ldots x_T) = \\prod_{t=1}^T p(x_t|x_{t-1}, \\ldots x_1).$$\n",
" - Note that the above considerations still hold even if we deal with discrete objects, such as words, rather than numbers. \n",
" - The only difference is that in such a situation we need to use a classifier rather than a regressor to estimate $p(x_t| x_{t-1}, \\ldots x_1)$.\n",
" \n",
"- Markov Model\n",
" - In an autoregressive model, we use only $(x_{t-1}, \\ldots x_{t-\\tau})$ instead of $(x_{t-1}, \\ldots x_1)$ to estimate $x_t$.\n",
" - Whenever this approximation is accurate we say that the sequence satisfies a ***Markov condition***. \n",
" - In particular, if $\\tau = 1$, we have a first order Markov model and $p(x)$ is given by $$p(x_1, \\ldots x_T) = \\prod_{t=1}^T p(x_t|x_{t-1}).$$\n",
" - Such models are particularly nice whenever $x_t$ assumes only discrete values, since in this case dynamic programming can be used to compute values along the chain exactly. \n",
" - For instance, we can compute $x_{t+1}|x_{t-1}$ efficiently using the fact that we only need to take into account a very short history of past observations. $$p(x_{t+1}|x_{t-1}) = \\sum_{x_t} p(x_{t+1}|x_t) p(x_t|x_{t-1})$$\n",
"\n",
"- Causality\n",
" - In principle, there's nothing wrong with unfolding $p(x_1, \\ldots x_T)$ in reverse order. $$p(x_1, \\ldots x_T) = \\prod_{t=T}^1 p(x_t|x_{t+1}, \\ldots x_T).$$\n",
"\n",
" - In fact, if we have a Markov model we can obtain a reverse conditional probability distribution, too. \n",
" - In many cases, however, there exists a natural direction for the data, namely going forward in time. \n",
" - It is clear that future events cannot influence the past. \n",
" - If we change $x_t$, we may be able to influence what happens for $x_{t+1}$ going forward but not the converse. \n",
" - If we change $x_t$, the distribution over past events will not change. \n",
" - It ought to be easier to explain $x_{t+1}|x_t$ rather than $x_t|x_{t+1}$. \n",
" - For instance, Hoyer et al., 2008 show that in some cases we can find $x_{t+1} = f(x_t) + \\epsilon$ for some additive noise, whereas the converse is not true. \n",
" - This is great news, since it is typically the forward direction that we're interested in estimating. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6.1.2 Toy Example\n",
"- Let’s begin by generating ‘time series’ data by using a sine function with some additive noise."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"from mxnet import autograd, nd, gluon, init\n",
"import gluonbook as gb\n",
"# display routines\n",
"%matplotlib inline\n",
"from matplotlib import pyplot as plt\n",
"from IPython import display\n",
"display.set_matplotlib_formats('svg')\n",
"\n",
"embedding = 4 # embedding dimension for autoregressive model\n",
"\n",
"T = 1000 # generate a total of 1000 points \n",
"time = nd.arange(0,T)\n",
"x = nd.sin(0.01 * time) + 0.2 * nd.random.normal(shape=(T))\n",
"\n",
"plt.plot(time.asnumpy(), x.asnumpy());"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- Next we need to turn this 'time series' into data the network can train on. \n",
"- Based on the embedding dimension $\\tau$ we map the data into pairs $y_t = x_t$ and $\\mathbf{z}_t = (x_{t-1}, \\ldots x_{t-\\tau})$. \n",
"- The astute reader might have noticed that this gives us $\\tau$ fewer datapoints, since we don't have sufficient history for the first $\\tau$ of them. \n",
" - A simple fix is to discard those few terms. \n",
" - Alternatively we could pad the time series with zeros. "
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"features = nd.zeros((T-embedding, embedding)) # (1000 - 4, 4) = (996, 4)\n",
"\n",
"# features[:, 0] = x[0:996]\n",
"# features[:, 1] = x[1:997]\n",
"# features[:, 2] = x[2:998]\n",
"# features[:, 3] = x[3:999]\n",
"for i in range(embedding):\n",
" features[:, i] = x[i:T - embedding + i]\n",
"\n",
"# labels = x[4:]\n",
"labels = x[embedding:]\n",
"\n",
"ntrain = 600\n",
"train_data = gluon.data.ArrayDataset(features[:ntrain,:], labels[:ntrain])\n",
"test_data = gluon.data.ArrayDataset(features[ntrain:,:], labels[ntrain:])\n",
"\n",
"# vanilla MLP architecture\n",
"def get_net():\n",
" net = gluon.nn.Sequential()\n",
" net.add(gluon.nn.Dense(10, activation='relu'))\n",
" net.add(gluon.nn.Dense(10, activation='relu'))\n",
" net.add(gluon.nn.Dense(1))\n",
" net.initialize(init=init.Xavier(), force_reinit=True)\n",
" return net\n",
"\n",
"# least mean squares loss\n",
"loss = gluon.loss.L2Loss()"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch 1, loss: 0.026710\n",
"epoch 2, loss: 0.025081\n",
"epoch 3, loss: 0.025592\n",
"epoch 4, loss: 0.026057\n",
"epoch 5, loss: 0.027615\n",
"epoch 6, loss: 0.024617\n",
"epoch 7, loss: 0.023896\n",
"epoch 8, loss: 0.024280\n",
"epoch 9, loss: 0.024480\n",
"epoch 10, loss: 0.026319\n",
"test loss: 0.028010\n"
]
}
],
"source": [
"# simple optimizer using adam, random shuffle and minibatch size 16\n",
"def train_net(net, data, loss, epochs, learning_rate):\n",
" batch_size = 16\n",
" trainer = gluon.Trainer(net.collect_params(), 'adam', {'learning_rate': learning_rate})\n",
" data_iter = gluon.data.DataLoader(data, batch_size, shuffle=True)\n",
"\n",
" for epoch in range(1, epochs + 1):\n",
" for X, y in data_iter:\n",
" with autograd.record():\n",
" l = loss(net(X), y)\n",
" l.backward()\n",
" trainer.step(batch_size)\n",
" l = loss(net(data[:][0]), nd.array(data[:][1]))\n",
" print('epoch %d, loss: %f' % (epoch, l.mean().asnumpy()))\n",
" return net\n",
"\n",
"net = get_net()\n",
"net = train_net(\n",
" net=net, \n",
" data=train_data, \n",
" loss=loss, \n",
" epochs=10, \n",
" learning_rate=0.01\n",
")\n",
"\n",
"l = loss(net(test_data[:][0]), nd.array(test_data[:][1]))\n",
"print('test loss: %f' % l.mean().asnumpy())"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"estimates = net(features)\n",
"plt.plot(time.asnumpy(), x.asnumpy(), label='data');\n",
"plt.plot(time[embedding:].asnumpy(), estimates.asnumpy(), label='estimate');\n",
"plt.legend();"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6.1.3 Predictions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- If we observe data only until time step 600, we cannot hope to receive the ground truth for all future predictions. \n",
"- Instead, we need to work our way forward one step at a time:\n",
"\n",
"$$\\begin{aligned} x_{601} & = f(x_{600}, \\ldots, x_{597}) \\\\ x_{602} & = f(x_{601}, \\ldots, x_{598}) \\\\ x_{603} & = f(x_{602}, \\ldots, x_{599}) \\end{aligned}$$\n",
"\n",
"- In other words, very quickly will we have to use our own predictions to make future predictions."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(996, 1)"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"estimates.shape"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"predictions = nd.zeros_like(estimates)\n",
"\n",
"# ntrain - embedding = 600 - 4 = 596\n",
"predictions[:(ntrain-embedding)] = estimates[:(ntrain-embedding)]\n",
"\n",
"# T - embedding = 996\n",
"for i in range(ntrain-embedding, T-embedding):\n",
" predictions[i] = net(predictions[(i-embedding):i].reshape(1,-1)).reshape(1)\n",
" \n",
"plt.plot(time.asnumpy(), x.asnumpy(), label='data');\n",
"plt.plot(time[embedding:].asnumpy(), estimates.asnumpy(), label='estimate');\n",
"plt.plot(time[embedding:].asnumpy(), predictions.asnumpy(), label='multistep');\n",
"plt.legend();"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- This is ultimately due to the fact that errors build up. \n",
"- Let's say that after step 1 we have some error $\\epsilon_1 = \\bar\\epsilon$. \n",
"- Now the input for step 2 is perturbed by $\\epsilon_1$, hence we suffer some error in the order of $\\epsilon_2 = \\bar\\epsilon + L \\epsilon_1$, and so on. \n",
"- The error can diverge rather rapidly from the true observations. \n",
"- This is a common phenomenon - for instance weather forecasts for the next 24 hours tend to be pretty accurate but beyond that their accuracy declines rapidly.\n",
"- Let’s verify this observation by computing the k-step predictions on the entire sequence."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"k = 33 # look up to k - embedding steps ahead\n",
"\n",
"# T-k = 1000-33 = 967\n",
"features = nd.zeros((T-k, k))\n",
"\n",
"# features[:, 0] = x[0:967]\n",
"# features[:, 1] = x[1:968]\n",
"# features[:, 2] = x[2:969]\n",
"# features[:, 3] = x[3:970]\n",
"for i in range(embedding):\n",
" features[:,i] = x[i:T-k+i]\n",
"\n",
"# features[:, 4] = net(features[:, 0:4]).reshape((-1))\n",
"# features[:, 5] = net(features[:, 1:5]).reshape((-1))\n",
"# ...\n",
"# features[:, 32] = net(features[:, 28:32]).reshape((-1))\n",
"for i in range(embedding, k):\n",
" features[:,i] = net(features[:,(i-embedding):i]).reshape((-1))\n",
" \n",
"for i in (4, 8, 16, 32): \n",
" plt.plot(time[i:T-k+i].asnumpy(), features[:,i].asnumpy(), label=('step ' + str(i)))\n",
"plt.legend();"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Summary\n",
"- Sequence models require specialized statistical tools for estimation. \n",
" - Two popular choices are autoregressive models and latent-variable autoregressive models.\n",
"- As we predict further in time, the errors accumulate and the quality of the estimates degrades, often dramatically.\n",
"- There’s quite a difference in difficulty between filling in the blanks in a sequence (smoothing) and forecasting. \n",
" - Consequently, if you have a time series, always respect the temporal order of the data when training, i.e. never train on future data.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 6.2 Language Models"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.7"
}
},
"nbformat": 4,
"nbformat_minor": 2
}