{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%reload_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#export\n", "from nb_006b import *\n", "from collections import Counter" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Wikitext 2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Download the dataset [here](https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip) and unzip it so it's in the folder wikitext." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "EOS = ''\n", "PATH=Path('data/wikitext')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Small helper function to read the tokens." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def read_file(filename):\n", " tokens = []\n", " with open(PATH/filename, encoding='utf8') as f:\n", " for line in f:\n", " tokens.append(line.split() + [EOS])\n", " return np.array(tokens)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_tok = read_file('wiki.train.tokens')\n", "valid_tok = read_file('wiki.valid.tokens')\n", "test_tok = read_file('wiki.test.tokens')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "len(train_tok), len(valid_tok), len(test_tok)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "' '.join(train_tok[4][:20])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cnt = Counter(word for sent in train_tok for word in sent)\n", "cnt.most_common(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Give an id to each token and add the pad token (just in case we need it)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "itos = [o for o,c in cnt.most_common()]\n", "itos.insert(0,'')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "vocab_size = len(itos); vocab_size" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Creates the mapping from token to id then numericalizing our datasets." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "stoi = collections.defaultdict(lambda : 5, {w:i for i,w in enumerate(itos)})" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_ids = np.array([([stoi[w] for w in s]) for s in train_tok])\n", "valid_ids = np.array([([stoi[w] for w in s]) for s in valid_tok])\n", "test_ids = np.array([([stoi[w] for w in s]) for s in test_tok])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#export\n", "class LanguageModelLoader():\n", " \"Creates a dataloader with bptt slightly changing.\"\n", " \n", " def __init__(self, nums:np.ndarray, bs:int=64, bptt:int=70, backwards:bool=False):\n", " self.bs,self.bptt,self.backwards = bs,bptt,backwards\n", " self.data = self.batchify(nums)\n", " self.first,self.i,self.iter = True,0,0\n", " self.n = len(self.data)\n", "\n", " def __iter__(self):\n", " self.i,self.iter = 0,0\n", " while self.i < self.n-1 and self.iter int: return (self.n-1) // self.bptt\n", "\n", " def batchify(self, data:np.ndarray) -> LongTensor:\n", " \"Splits the data in batches.\"\n", " nb = data.shape[0] // self.bs\n", " data = np.array(data[:nb*self.bs]).reshape(self.bs, -1).T\n", " if self.backwards: data=data[::-1]\n", " return LongTensor(data)\n", "\n", " def get_batch(self, i:int, seq_len:int) -> LongTensor:\n", " \"Gets a batch of length `seq_len`\"\n", " seq_len = min(seq_len, len(self.data) - 1 - i)\n", " return self.data[i:i+seq_len], self.data[i+1:i+1+seq_len].contiguous().view(-1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bs,bptt = 20,10\n", "train_dl = LanguageModelLoader(np.concatenate(train_ids), bs, bptt)\n", "valid_dl = LanguageModelLoader(np.concatenate(valid_ids), bs, bptt)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = DataBunch(train_dl, valid_dl)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1. Dropout" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We want to use the AWD-LSTM from [Stephen Merity](https://arxiv.org/abs/1708.02182). First, we'll need all different kinds of dropouts. Dropout consists into replacing some coefficients by 0 with probability p. To ensure that the averga of the weights remains constant, we apply a correction to the weights that aren't nullified of a factor `1/(1-p)`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#export\n", "def dropout_mask(x:Tensor, sz:Collection[int], p:float):\n", " \"Returns a dropout mask of the same type as x, size sz, with probability p to cancel an element.\"\n", " return x.new(*sz).bernoulli_(1-p).div_(1-p)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x = torch.randn(10,10)\n", "dropout_mask(x, (10,10), 0.5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once with have a dropout mask `m`, applying the dropout to `x` is simply done by `x = x * m`. We create our own dropout mask and don't rely on pytorch dropout because we want to nullify the coefficients on the batch dimension but not the token dimension (aka the same coefficients are replaced by zero for each word in the sentence). \n", "\n", "Inside a RNN, a tensor x will have three dimensions: seq_len, bs, vocab_size, so we create a dropout mask for the last two dimensions and broadcast it to the first dimension." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#export\n", "class RNNDropout(nn.Module):\n", " \"Dropout that is consistent on the seq_len dimension\"\n", " \n", " def __init__(self, p:float=0.5):\n", " super().__init__()\n", " self.p=p\n", "\n", " def forward(self, x:Tensor) -> Tensor:\n", " if not self.training or self.p == 0.: return x\n", " m = dropout_mask(x.data, (1, x.size(1), x.size(2)), self.p)\n", " return x * m" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dp_test = RNNDropout(0.5)\n", "x = torch.randn(2,5,10)\n", "x, dp_test(x)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#export\n", "import warnings\n", "\n", "class WeightDropout(nn.Module):\n", " \"A module that warps another layer in which some weights will be replaced by 0 during training.\"\n", " \n", " def __init__(self, module:Model, weight_p:float, layer_names:Collection[str]=['weight_hh_l0']):\n", " super().__init__()\n", " self.module,self.weight_p,self.layer_names = module,weight_p,layer_names\n", " for layer in self.layer_names:\n", " #Makes a copy of the weights of the selected layers.\n", " w = getattr(self.module, layer)\n", " self.register_parameter(f'{layer}_raw', nn.Parameter(w.data))\n", " \n", " def _setweights(self):\n", " \"Applies dropout to the raw weights\"\n", " for layer in self.layer_names:\n", " raw_w = getattr(self, f'{layer}_raw')\n", " self.module._parameters[layer] = F.dropout(raw_w, p=self.weight_p, training=self.training)\n", " \n", " def forward(self, *args:ArgStar):\n", " self._setweights()\n", " with warnings.catch_warnings():\n", " #To avoid the warning that comes because the weights aren't flattened.\n", " warnings.simplefilter(\"ignore\")\n", " return self.module.forward(*args)\n", " \n", " def reset(self):\n", " if hasattr(self.module, 'reset'): self.module.reset()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "module = nn.LSTM(20, 20)\n", "dp_module = WeightDropout(module, 0.5)\n", "opt = optim.SGD(dp_module.parameters(), 10)\n", "dp_module.train()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x = torch.randn(2,5,20)\n", "x.requires_grad_(requires_grad=True)\n", "h = (torch.zeros(1,5,20), torch.zeros(1,5,20))\n", "for _ in range(5): x,h = dp_module(x,h)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "getattr(dp_module.module, 'weight_hh_l0'),getattr(dp_module,'weight_hh_l0_raw')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "target = torch.randint(0,20,(10,)).long()\n", "loss = F.nll_loss(x.view(-1,20), target)\n", "loss.backward()\n", "opt.step()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "w, w_raw = getattr(dp_module.module, 'weight_hh_l0'),getattr(dp_module,'weight_hh_l0_raw')\n", "w.grad, w_raw.grad" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "getattr(dp_module.module, 'weight_hh_l0'),getattr(dp_module,'weight_hh_l0_raw')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#export\n", "class EmbeddingDropout(nn.Module):\n", " \"Applies dropout in the embedding layer by zeroing out some elements of the embedding vector.\"\n", " \n", " def __init__(self, emb:Model, embed_p:float):\n", " super().__init__()\n", " self.emb,self.embed_p = emb,embed_p\n", " self.pad_idx = self.emb.padding_idx\n", " if self.pad_idx is None: self.pad_idx = -1\n", "\n", " def forward(self, words:LongTensor, scale:Optional[float]=None) -> Tensor:\n", " if self.training and self.embed_p != 0:\n", " size = (self.emb.weight.size(0),1)\n", " mask = dropout_mask(self.emb.weight.data, size, self.embed_p)\n", " masked_embed = self.emb.weight * mask\n", " else: masked_embed = self.emb.weight\n", " if scale: masked_embed.mul_(scale)\n", " return F.embedding(words, masked_embed, self.pad_idx, self.emb.max_norm,\n", " self.emb.norm_type, self.emb.scale_grad_by_freq, self.emb.sparse)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "enc = nn.Embedding(100,20, padding_idx=0)\n", "enc_dp = EmbeddingDropout(enc, 0.5)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x = torch.randint(0,100,(25,)).long()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "enc_dp(x)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2. AWD-LSTM" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#export\n", "def repackage_var(h:Tensors) -> Tensors:\n", " \"Detaches h from its history.\"\n", " return h.detach() if type(h) == torch.Tensor else tuple(repackage_var(v) for v in h)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#export\n", "class RNNCore(nn.Module):\n", " \"AWD-LSTM/QRNN inspired by https://arxiv.org/abs/1708.02182\"\n", "\n", " initrange=0.1\n", "\n", " def __init__(self, vocab_sz:int, emb_sz:int, n_hid:int, n_layers:int, pad_token:int, bidir:bool=False,\n", " hidden_p:float=0.2, input_p:float=0.6, embed_p:float=0.1, weight_p:float=0.5, qrnn:bool=False):\n", " \n", " super().__init__()\n", " self.bs,self.qrnn,self.ndir = 1, qrnn,(2 if bidir else 1)\n", " self.emb_sz,self.n_hid,self.n_layers = emb_sz,n_hid,n_layers\n", " self.encoder = nn.Embedding(vocab_sz, emb_sz, padding_idx=pad_token)\n", " self.encoder_dp = EmbeddingDropout(self.encoder, embed_p)\n", " if self.qrnn:\n", " #Using QRNN requires cupy: https://github.com/cupy/cupy\n", " from qrnn import QRNNLayer\n", " self.rnns = [QRNNLayer(emb_sz if l == 0 else n_hid, (n_hid if l != n_layers - 1 else emb_sz)//self.ndir,\n", " save_prev_x=True, zoneout=0, window=2 if l == 0 else 1, output_gate=True, \n", " use_cuda=torch.cuda.is_available()) for l in range(n_layers)]\n", " if weight_p != 0.:\n", " for rnn in self.rnns:\n", " rnn.linear = WeightDropout(rnn.linear, weight_p, layer_names=['weight'])\n", " else:\n", " self.rnns = [nn.LSTM(emb_sz if l == 0 else n_hid, (n_hid if l != n_layers - 1 else emb_sz)//self.ndir,\n", " 1, bidirectional=bidir) for l in range(n_layers)]\n", " if weight_p != 0.: self.rnns = [WeightDropout(rnn, weight_p) for rnn in self.rnns]\n", " self.rnns = torch.nn.ModuleList(self.rnns)\n", " self.encoder.weight.data.uniform_(-self.initrange, self.initrange)\n", " self.input_dp = RNNDropout(input_p)\n", " self.hidden_dps = nn.ModuleList([RNNDropout(hidden_p) for l in range(n_layers)])\n", "\n", " def forward(self, input:LongTensor) -> Tuple[Tensor,Tensor]:\n", " sl,bs = input.size()\n", " if bs!=self.bs:\n", " self.bs=bs\n", " self.reset()\n", " raw_output = self.input_dp(self.encoder_dp(input))\n", " new_hidden,raw_outputs,outputs = [],[],[]\n", " for l, (rnn,hid_dp) in enumerate(zip(self.rnns, self.hidden_dps)):\n", " raw_output, new_h = rnn(raw_output, self.hidden[l])\n", " new_hidden.append(new_h)\n", " raw_outputs.append(raw_output)\n", " if l != self.n_layers - 1: raw_output = hid_dp(raw_output)\n", " outputs.append(raw_output)\n", " self.hidden = repackage_var(new_hidden)\n", " return raw_outputs, outputs\n", "\n", " def one_hidden(self, l:int) -> Tensor:\n", " \"Returns one hidden state\"\n", " nh = (self.n_hid if l != self.n_layers - 1 else self.emb_sz)//self.ndir\n", " return self.weights.new(self.ndir, self.bs, nh).zero_()\n", "\n", " def reset(self):\n", " \"Resets the hidden states\"\n", " [r.reset() for r in self.rnns if hasattr(r, 'reset')]\n", " self.weights = next(self.parameters()).data\n", " if self.qrnn: self.hidden = [self.one_hidden(l) for l in range(self.n_layers)]\n", " else: self.hidden = [(self.one_hidden(l), self.one_hidden(l)) for l in range(self.n_layers)]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#export\n", "class LinearDecoder(nn.Module):\n", " \"To go on top of a RNN_Core module\"\n", " \n", " initrange=0.1\n", " \n", " def __init__(self, n_out:int, n_hid:int, output_p:float, tie_encoder:Model=None, bias:bool=True):\n", " super().__init__()\n", " self.decoder = nn.Linear(n_hid, n_out, bias=bias)\n", " self.decoder.weight.data.uniform_(-self.initrange, self.initrange)\n", " self.output_dp = RNNDropout(output_p)\n", " if bias: self.decoder.bias.data.zero_()\n", " if tie_encoder: self.decoder.weight = tie_encoder.weight\n", "\n", " def forward(self, input:Tuple[Tensor,Tensor]) -> Tuple[Tensor,Tensor,Tensor]:\n", " raw_outputs, outputs = input\n", " output = self.output_dp(outputs[-1])\n", " decoded = self.decoder(output.view(output.size(0)*output.size(1), output.size(2)))\n", " return decoded, raw_outputs, outputs" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#export\n", "class SequentialRNN(nn.Sequential):\n", " \"A sequential module that passes the reset call to its children.\"\n", " def reset(self):\n", " for c in self.children():\n", " if hasattr(c, 'reset'): c.reset()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#export\n", "def get_language_model(vocab_sz:int, emb_sz:int, n_hid:int, n_layers:int, pad_token:int, tie_weights:bool=True, \n", " qrnn:bool=False, bias:bool=True, output_p:float=0.4, hidden_p:float=0.2, input_p:float=0.6, \n", " embed_p:float=0.1, weight_p:float=0.5) -> Model:\n", " \"To create a full AWD-LSTM\"\n", " rnn_enc = RNNCore(vocab_sz, emb_sz, n_hid=n_hid, n_layers=n_layers, pad_token=pad_token, qrnn=qrnn,\n", " hidden_p=hidden_p, input_p=input_p, embed_p=embed_p, weight_p=weight_p)\n", " enc = rnn_enc.encoder if tie_weights else None\n", " return SequentialRNN(rnn_enc, LinearDecoder(vocab_sz, emb_sz, output_p, tie_encoder=enc, bias=bias))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tst_model = get_language_model(500, 20, 100, 2, 0, qrnn=True)\n", "tst_model.cuda()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x = torch.randint(0, 500, (10,5)).long()\n", "z = tst_model(x.cuda())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "len(z)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3. Callback to train the model" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#export\n", "@dataclass\n", "class GradientClipping(Callback):\n", " \"To do gradient clipping during training.\"\n", " learn:Learner\n", " clip:float\n", "\n", " def on_backward_end(self, **kwargs):\n", " if self.clip: nn.utils.clip_grad_norm_(self.learn.model.parameters(), self.clip)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#export\n", "@dataclass\n", "class RNNTrainer(Callback):\n", " \"`Callback` that regroups lr adjustment to seq_len, AR and TAR\"\n", " learn:Learner\n", " bptt:int\n", " alpha:float=0.\n", " beta:float=0.\n", " adjust:bool=True\n", " \n", " def on_loss_begin(self, last_output:Tuple[Tensor,Tensor,Tensor], **kwargs):\n", " #Save the extra outputs for later and only returns the true output.\n", " self.raw_out,self.out = last_output[1],last_output[2]\n", " return last_output[0]\n", " \n", " def on_backward_begin(self, last_loss:Rank0Tensor, last_input:Tensor, last_output:Tensor, **kwargs):\n", " #Adjusts the lr to the bptt selected\n", " if self.adjust: self.learn.opt.lr *= last_input.size(0) / self.bptt\n", " #AR and TAR\n", " if self.alpha != 0.: last_loss += (self.alpha * self.out[-1].pow(2).mean()).sum()\n", " if self.beta != 0.:\n", " h = self.raw_out[-1]\n", " if len(h)>1: last_loss += (self.beta * (h[1:] - h[:-1]).pow(2).mean()).sum()\n", " return last_loss" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "emb_sz, nh, nl = 400, 1150, 3\n", "model = get_language_model(vocab_size, emb_sz, nh, nl, 0, input_p=0.6, output_p=0.4, weight_p=0.5, \n", " embed_p=0.1, hidden_p=0.2)\n", "learn = Learner(data, model)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "learn.opt_fn = partial(optim.Adam, betas=(0.8,0.99))\n", "learn.callbacks.append(RNNTrainer(learn, bptt, alpha=2, beta=1))\n", "learn.callback_fns = [partial(GradientClipping, clip=0.12)]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fit_one_cycle(learn, 1, 5e-3, (0.8,0.7), wd=1.2e-6)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 2 }