{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# ULMFit"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%load_ext autoreload\n",
    "%autoreload 2\n",
    "\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#export\n",
    "from exp.nb_12a import *"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We load the data from 12a, instructions to create that file are there if you don't have it yet so go ahead and see."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[Jump_to lesson 12 video](https://course19.fast.ai/videos/?lesson=12&t=7459)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "path = datasets.untar_data(datasets.URLs.IMDB)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ll = pickle.load(open(path/'ll_lm.pkl', 'rb'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bs,bptt = 128,70\n",
    "data = lm_databunchify(ll, bs, bptt)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "vocab = ll.train.proc_x[1].vocab"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Finetuning the LM"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before tackling the classification task, we have to finetune our language model to the IMDB corpus."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have pretrained a small model on [wikitext 103](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/) that you can download by uncommenting the following cell."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ! wget http://files.fast.ai/models/wt103_tiny.tgz -P {path}\n",
    "# ! tar xf {path}/wt103_tiny.tgz -C {path}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dps = tensor([0.1, 0.15, 0.25, 0.02, 0.2]) * 0.5\n",
    "tok_pad = vocab.index(PAD)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "emb_sz, nh, nl = 300, 300, 2\n",
    "model = get_language_model(len(vocab), emb_sz, nh, nl, tok_pad, *dps)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "old_wgts  = torch.load(path/'pretrained'/'pretrained.pth')\n",
    "old_vocab = pickle.load(open(path/'pretrained'/'vocab.pkl', 'rb'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In our current vocabulary, it is very unlikely that the ids correspond to what is in the vocabulary used to train the pretrain model. The tokens are sorted by frequency (apart from the special tokens that are all first) so that order is specific to the corpus used. For instance, the word 'house' has different ids in the our current vocab and the pretrained one."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "idx_house_new, idx_house_old = vocab.index('house'),old_vocab.index('house')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We somehow need to match our pretrained weights to the new vocabulary. This is done on the embeddings and the decoder (since the weights between embeddings and decoders are tied) by putting the rows of the embedding matrix (or decoder bias) in the right order.\n",
    "\n",
    "It may also happen that we have words that aren't in the pretrained vocab, in this case, we put the mean of the pretrained embedding weights/decoder bias."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "house_wgt  = old_wgts['0.emb.weight'][idx_house_old]\n",
    "house_bias = old_wgts['1.decoder.bias'][idx_house_old] "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def match_embeds(old_wgts, old_vocab, new_vocab):\n",
    "    wgts = old_wgts['0.emb.weight']\n",
    "    bias = old_wgts['1.decoder.bias']\n",
    "    wgts_m,bias_m = wgts.mean(dim=0),bias.mean()\n",
    "    new_wgts = wgts.new_zeros(len(new_vocab), wgts.size(1))\n",
    "    new_bias = bias.new_zeros(len(new_vocab))\n",
    "    otoi = {v:k for k,v in enumerate(old_vocab)}\n",
    "    for i,w in enumerate(new_vocab): \n",
    "        if w in otoi:\n",
    "            idx = otoi[w]\n",
    "            new_wgts[i],new_bias[i] = wgts[idx],bias[idx]\n",
    "        else: new_wgts[i],new_bias[i] = wgts_m,bias_m\n",
    "    old_wgts['0.emb.weight']    = new_wgts\n",
    "    old_wgts['0.emb_dp.emb.weight'] = new_wgts\n",
    "    old_wgts['1.decoder.weight']    = new_wgts\n",
    "    old_wgts['1.decoder.bias']      = new_bias\n",
    "    return old_wgts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "wgts = match_embeds(old_wgts, old_vocab, vocab)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's check that the word \"*house*\" was properly converted."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_near(wgts['0.emb.weight'][idx_house_new],house_wgt)\n",
    "test_near(wgts['1.decoder.bias'][idx_house_new],house_bias)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can load the pretrained weights in our model before beginning training."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.load_state_dict(wgts)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we want to apply discriminative learning rates, we need to split our model in different layer groups. Let's have a look at our model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "SequentialRNN(\n",
       "  (0): AWD_LSTM(\n",
       "    (emb): Embedding(60003, 300, padding_idx=2)\n",
       "    (emb_dp): EmbeddingDropout(\n",
       "      (emb): Embedding(60003, 300, padding_idx=2)\n",
       "    )\n",
       "    (rnns): ModuleList(\n",
       "      (0): WeightDropout(\n",
       "        (module): LSTM(300, 300, batch_first=True)\n",
       "      )\n",
       "      (1): WeightDropout(\n",
       "        (module): LSTM(300, 300, batch_first=True)\n",
       "      )\n",
       "    )\n",
       "    (input_dp): RNNDropout()\n",
       "    (hidden_dps): ModuleList(\n",
       "      (0): RNNDropout()\n",
       "      (1): RNNDropout()\n",
       "    )\n",
       "  )\n",
       "  (1): LinearDecoder(\n",
       "    (output_dp): RNNDropout()\n",
       "    (decoder): Linear(in_features=300, out_features=60003, bias=True)\n",
       "  )\n",
       ")"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then we split by doing two groups for each rnn/corresponding dropout, then one last group that contains the embeddings/decoder. This is the one that needs to be trained the most as we may have new embeddings vectors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def lm_splitter(m):\n",
    "    groups = []\n",
    "    for i in range(len(m[0].rnns)): groups.append(nn.Sequential(m[0].rnns[i], m[0].hidden_dps[i]))\n",
    "    groups += [nn.Sequential(m[0].emb, m[0].emb_dp, m[0].input_dp, m[1])]\n",
    "    return [list(o.parameters()) for o in groups]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First we train with the RNNs freezed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for rnn in model[0].rnns:\n",
    "    for p in rnn.parameters(): p.requires_grad_(False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cbs = [partial(AvgStatsCallback,accuracy_flat),\n",
    "       CudaCallback, Recorder,\n",
    "       partial(GradientClipping, clip=0.1),\n",
    "       partial(RNNTrainer, α=2., β=1.),\n",
    "       ProgressCallback]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "learn = Learner(model, data, cross_entropy_flat, opt_func=adam_opt(),\n",
    "                cb_funcs=cbs, splitter=lm_splitter)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lr = 2e-2\n",
    "cbsched = sched_1cycle([lr], pct_start=0.5, mom_start=0.8, mom_mid=0.7, mom_end=0.8)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "learn.fit(1, cbs=cbsched)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then the whole model with discriminative learning rates."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for rnn in model[0].rnns:\n",
    "    for p in rnn.parameters(): p.requires_grad_(True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lr = 2e-3\n",
    "cbsched = sched_1cycle([lr/2., lr/2., lr], pct_start=0.5, mom_start=0.8, mom_mid=0.7, mom_end=0.8)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "learn.fit(10, cbs=cbsched)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We only need to save the encoder (first part of the model) for the classification, as well as the vocabulary used (we will need to use the same in the classification task)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "torch.save(learn.model[0].state_dict(), path/'finetuned_enc.pth')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pickle.dump(vocab, open(path/'vocab_lm.pkl', 'wb'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "torch.save(learn.model.state_dict(), path/'finetuned.pth')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Classifier"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have to process the data again otherwise pickle will complain. We also have to use the same vocab as the language model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[Jump_to lesson 12 video](https://course19.fast.ai/videos/?lesson=12&t=7554)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "vocab = pickle.load(open(path/'vocab_lm.pkl', 'rb'))\n",
    "proc_tok,proc_num,proc_cat = TokenizeProcessor(),NumericalizeProcessor(vocab=vocab),CategoryProcessor()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "il = TextList.from_files(path, include=['train', 'test'])\n",
    "sd = SplitData.split_by_func(il, partial(grandparent_splitter, valid_name='test'))\n",
    "ll = label_by_func(sd, parent_labeler, proc_x = [proc_tok, proc_num], proc_y=proc_cat)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pickle.dump(ll, open(path/'ll_clas.pkl', 'wb'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ll = pickle.load(open(path/'ll_clas.pkl', 'rb'))\n",
    "vocab = pickle.load(open(path/'vocab_lm.pkl', 'rb'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bs,bptt = 64,70\n",
    "data = clas_databunchify(ll, bs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Ignore padding"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will those two utility functions from PyTorch to ignore the padding in the inputs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#export\n",
    "from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's see how this works: first we grab a batch of the training set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x,y = next(iter(data.train_dl))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x.size()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We need to pass to the utility functions the lengths of our sentences because it's applied after the embedding, so we can't see the padding anymore."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lengths = x.size(1) - (x == 1).sum(1)\n",
    "lengths[:5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tst_emb = nn.Embedding(len(vocab), 300)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tst_emb(x).shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "128*70"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We create a `PackedSequence` object that contains all of our unpadded sequences"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "packed = pack_padded_sequence(tst_emb(x), lengths, batch_first=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "packed"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "packed.data.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "len(packed.batch_sizes)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "8960//70"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This object can be passed to any RNN directly while retaining the speed of CuDNN."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tst = nn.LSTM(300, 300, 2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y,h = tst(packed)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then we can unpad it with the following function for other modules:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "unpack = pad_packed_sequence(y, batch_first=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "unpack[0].shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "unpack[1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We need to change our model a little bit to use this."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#export\n",
    "class AWD_LSTM1(nn.Module):\n",
    "    \"AWD-LSTM inspired by https://arxiv.org/abs/1708.02182.\"\n",
    "    initrange=0.1\n",
    "\n",
    "    def __init__(self, vocab_sz, emb_sz, n_hid, n_layers, pad_token,\n",
    "                 hidden_p=0.2, input_p=0.6, embed_p=0.1, weight_p=0.5):\n",
    "        super().__init__()\n",
    "        self.bs,self.emb_sz,self.n_hid,self.n_layers,self.pad_token = 1,emb_sz,n_hid,n_layers,pad_token\n",
    "        self.emb = nn.Embedding(vocab_sz, emb_sz, padding_idx=pad_token)\n",
    "        self.emb_dp = EmbeddingDropout(self.emb, embed_p)\n",
    "        self.rnns = [nn.LSTM(emb_sz if l == 0 else n_hid, (n_hid if l != n_layers - 1 else emb_sz), 1,\n",
    "                             batch_first=True) for l in range(n_layers)]\n",
    "        self.rnns = nn.ModuleList([WeightDropout(rnn, weight_p) for rnn in self.rnns])\n",
    "        self.emb.weight.data.uniform_(-self.initrange, self.initrange)\n",
    "        self.input_dp = RNNDropout(input_p)\n",
    "        self.hidden_dps = nn.ModuleList([RNNDropout(hidden_p) for l in range(n_layers)])\n",
    "\n",
    "    def forward(self, input):\n",
    "        bs,sl = input.size()\n",
    "        mask = (input == self.pad_token)\n",
    "        lengths = sl - mask.long().sum(1)\n",
    "        n_empty = (lengths == 0).sum()\n",
    "        if n_empty > 0:\n",
    "            input = input[:-n_empty]\n",
    "            lengths = lengths[:-n_empty]\n",
    "            self.hidden = [(h[0][:,:input.size(0)], h[1][:,:input.size(0)]) for h in self.hidden]\n",
    "        raw_output = self.input_dp(self.emb_dp(input))\n",
    "        new_hidden,raw_outputs,outputs = [],[],[]\n",
    "        for l, (rnn,hid_dp) in enumerate(zip(self.rnns, self.hidden_dps)):\n",
    "            raw_output = pack_padded_sequence(raw_output, lengths, batch_first=True)\n",
    "            raw_output, new_h = rnn(raw_output, self.hidden[l])\n",
    "            raw_output = pad_packed_sequence(raw_output, batch_first=True)[0]\n",
    "            raw_outputs.append(raw_output)\n",
    "            if l != self.n_layers - 1: raw_output = hid_dp(raw_output)\n",
    "            outputs.append(raw_output)\n",
    "            new_hidden.append(new_h)\n",
    "        self.hidden = to_detach(new_hidden)\n",
    "        return raw_outputs, outputs, mask\n",
    "\n",
    "    def _one_hidden(self, l):\n",
    "        \"Return one hidden state.\"\n",
    "        nh = self.n_hid if l != self.n_layers - 1 else self.emb_sz\n",
    "        return next(self.parameters()).new(1, self.bs, nh).zero_()\n",
    "\n",
    "    def reset(self):\n",
    "        \"Reset the hidden states.\"\n",
    "        self.hidden = [(self._one_hidden(l), self._one_hidden(l)) for l in range(self.n_layers)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Concat pooling"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will use three things for the classification head of the model: the last hidden state, the average of all the hidden states and the maximum of all the hidden states. The trick is just to, once again, ignore the padding in the last element/average/maximum."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[Jump_to lesson 12 video](https://course19.fast.ai/videos/?lesson=12&t=7604)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class Pooling(nn.Module):\n",
    "    def forward(self, input):\n",
    "        raw_outputs,outputs,mask = input\n",
    "        output = outputs[-1]\n",
    "        lengths = output.size(1) - mask.long().sum(dim=1)\n",
    "        avg_pool = output.masked_fill(mask[:,:,None], 0).sum(dim=1)\n",
    "        avg_pool.div_(lengths.type(avg_pool.dtype)[:,None])\n",
    "        max_pool = output.masked_fill(mask[:,:,None], -float('inf')).max(dim=1)[0]\n",
    "        x = torch.cat([output[torch.arange(0, output.size(0)),lengths-1], max_pool, avg_pool], 1) #Concat pooling.\n",
    "        return output,x"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "emb_sz, nh, nl = 300, 300, 2\n",
    "tok_pad = vocab.index(PAD)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "enc = AWD_LSTM1(len(vocab), emb_sz, n_hid=nh, n_layers=nl, pad_token=tok_pad)\n",
    "pool = Pooling()\n",
    "enc.bs = bs\n",
    "enc.reset()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x,y = next(iter(data.train_dl))\n",
    "output,c = pool(enc(x))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can check we have padding with 1s at the end of each text (except the first which is the longest)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "PyTorch puts 0s everywhere we had padding in the `output` when unpacking."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_near((output.sum(dim=2) == 0).float(), (x==tok_pad).float())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So the last hidden state isn't the last element of `output`. Let's check we got everything right. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for i in range(bs):\n",
    "    length = x.size(1) - (x[i]==1).long().sum()\n",
    "    out_unpad = output[i,:length]\n",
    "    test_near(out_unpad[-1], c[i,:300])\n",
    "    test_near(out_unpad.max(0)[0], c[i,300:600])\n",
    "    test_near(out_unpad.mean(0), c[i,600:])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our pooling layer properly ignored the padding, so now let's group it with a classifier."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def bn_drop_lin(n_in, n_out, bn=True, p=0., actn=None):\n",
    "    layers = [nn.BatchNorm1d(n_in)] if bn else []\n",
    "    if p != 0: layers.append(nn.Dropout(p))\n",
    "    layers.append(nn.Linear(n_in, n_out))\n",
    "    if actn is not None: layers.append(actn)\n",
    "    return layers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class PoolingLinearClassifier(nn.Module):\n",
    "    \"Create a linear classifier with pooling.\"\n",
    "\n",
    "    def __init__(self, layers, drops):\n",
    "        super().__init__()\n",
    "        mod_layers = []\n",
    "        activs = [nn.ReLU(inplace=True)] * (len(layers) - 2) + [None]\n",
    "        for n_in, n_out, p, actn in zip(layers[:-1], layers[1:], drops, activs):\n",
    "            mod_layers += bn_drop_lin(n_in, n_out, p=p, actn=actn)\n",
    "        self.layers = nn.Sequential(*mod_layers)\n",
    "\n",
    "    def forward(self, input):\n",
    "        raw_outputs,outputs,mask = input\n",
    "        output = outputs[-1]\n",
    "        lengths = output.size(1) - mask.long().sum(dim=1)\n",
    "        avg_pool = output.masked_fill(mask[:,:,None], 0).sum(dim=1)\n",
    "        avg_pool.div_(lengths.type(avg_pool.dtype)[:,None])\n",
    "        max_pool = output.masked_fill(mask[:,:,None], -float('inf')).max(dim=1)[0]\n",
    "        x = torch.cat([output[torch.arange(0, output.size(0)),lengths-1], max_pool, avg_pool], 1) #Concat pooling.\n",
    "        x = self.layers(x)\n",
    "        return x"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then we just have to feed our texts to those two blocks, (but we can't give them all at once to the AWD_LSTM or we might get OOM error: we'll go for chunks of bptt length to regularly detach the history of our hidden states.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def pad_tensor(t, bs, val=0.):\n",
    "    if t.size(0) < bs:\n",
    "        return torch.cat([t, val + t.new_zeros(bs-t.size(0), *t.shape[1:])])\n",
    "    return t"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class SentenceEncoder(nn.Module):\n",
    "    def __init__(self, module, bptt, pad_idx=1):\n",
    "        super().__init__()\n",
    "        self.bptt,self.module,self.pad_idx = bptt,module,pad_idx\n",
    "\n",
    "    def concat(self, arrs, bs):\n",
    "        return [torch.cat([pad_tensor(l[si],bs) for l in arrs], dim=1) for si in range(len(arrs[0]))]\n",
    "    \n",
    "    def forward(self, input):\n",
    "        bs,sl = input.size()\n",
    "        self.module.bs = bs\n",
    "        self.module.reset()\n",
    "        raw_outputs,outputs,masks = [],[],[]\n",
    "        for i in range(0, sl, self.bptt):\n",
    "            r,o,m = self.module(input[:,i: min(i+self.bptt, sl)])\n",
    "            masks.append(pad_tensor(m, bs, 1))\n",
    "            raw_outputs.append(r)\n",
    "            outputs.append(o)\n",
    "        return self.concat(raw_outputs, bs),self.concat(outputs, bs),torch.cat(masks,dim=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_text_classifier(vocab_sz, emb_sz, n_hid, n_layers, n_out, pad_token, bptt, output_p=0.4, hidden_p=0.2, \n",
    "                        input_p=0.6, embed_p=0.1, weight_p=0.5, layers=None, drops=None):\n",
    "    \"To create a full AWD-LSTM\"\n",
    "    rnn_enc = AWD_LSTM1(vocab_sz, emb_sz, n_hid=n_hid, n_layers=n_layers, pad_token=pad_token,\n",
    "                        hidden_p=hidden_p, input_p=input_p, embed_p=embed_p, weight_p=weight_p)\n",
    "    enc = SentenceEncoder(rnn_enc, bptt)\n",
    "    if layers is None: layers = [50]\n",
    "    if drops is None:  drops = [0.1] * len(layers)\n",
    "    layers = [3 * emb_sz] + layers + [n_out] \n",
    "    drops = [output_p] + drops\n",
    "    return SequentialRNN(enc, PoolingLinearClassifier(layers, drops))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "emb_sz, nh, nl = 300, 300, 2\n",
    "dps = tensor([0.4, 0.3, 0.4, 0.05, 0.5]) * 0.25\n",
    "model = get_text_classifier(len(vocab), emb_sz, nh, nl, 2, 1, bptt, *dps)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Training"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We load our pretrained encoder and freeze it."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[Jump_to lesson 12 video](https://course19.fast.ai/videos/?lesson=12&t=7684)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def class_splitter(m):\n",
    "    enc = m[0].module\n",
    "    groups = [nn.Sequential(enc.emb, enc.emb_dp, enc.input_dp)]\n",
    "    for i in range(len(enc.rnns)): groups.append(nn.Sequential(enc.rnns[i], enc.hidden_dps[i]))\n",
    "    groups.append(m[1])\n",
    "    return [list(o.parameters()) for o in groups]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for p in model[0].parameters(): p.requires_grad_(False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cbs = [partial(AvgStatsCallback,accuracy),\n",
    "       CudaCallback, Recorder,\n",
    "       partial(GradientClipping, clip=0.1),\n",
    "       ProgressCallback]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model[0].module.load_state_dict(torch.load(path/'finetuned_enc.pth'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "learn = Learner(model, data, F.cross_entropy, opt_func=adam_opt(), cb_funcs=cbs, splitter=class_splitter)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lr = 1e-2\n",
    "cbsched = sched_1cycle([lr], mom_start=0.8, mom_mid=0.7, mom_end=0.8)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "learn.fit(1, cbs=cbsched)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for p in model[0].module.rnns[-1].parameters(): p.requires_grad_(True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lr = 5e-3\n",
    "cbsched = sched_1cycle([lr/2., lr/2., lr/2., lr], mom_start=0.8, mom_mid=0.7, mom_end=0.8)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "learn.fit(1, cbs=cbsched)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for p in model[0].parameters(): p.requires_grad_(True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lr = 1e-3\n",
    "cbsched = sched_1cycle([lr/8., lr/4., lr/2., lr], mom_start=0.8, mom_mid=0.7, mom_end=0.8)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "learn.fit(2, cbs=cbsched)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x,y = next(iter(data.valid_dl))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Predicting on the padded batch or on the individual unpadded samples give the same results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pred_batch = learn.model.eval()(x.cuda())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pred_ind = []\n",
    "for inp in x:\n",
    "    length = x.size(1) - (inp == 1).long().sum()\n",
    "    inp = inp[:length]\n",
    "    pred_ind.append(learn.model.eval()(inp[None].cuda()))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "assert near(pred_batch, torch.cat(pred_ind))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}