{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from fastai.text import *" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reduce original dataset to questions" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "path = Config().data_path()/'giga-fren'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You only need to execute the setup cells once, uncomment to run. The dataset can be downloaded [here](https://s3.amazonaws.com/fast-ai-nlp/giga-fren.tgz)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#! wget https://s3.amazonaws.com/fast-ai-nlp/giga-fren.tgz -P {path}\n", "#! tar xf {path}/giga-fren.tgz -C {path} \n", "\n", "# with open(path/'giga-fren.release2.fixed.fr') as f:\n", "# fr = f.read().split('\\n')\n", "\n", "# with open(path/'giga-fren.release2.fixed.en') as f:\n", "# en = f.read().split('\\n')\n", "\n", "# re_eq = re.compile('^(Wh[^?.!]+\\?)')\n", "# re_fq = re.compile('^([^?.!]+\\?)')\n", "# en_fname = path/'giga-fren.release2.fixed.en'\n", "# fr_fname = path/'giga-fren.release2.fixed.fr'\n", "\n", "# lines = ((re_eq.search(eq), re_fq.search(fq)) \n", "# for eq, fq in zip(open(en_fname, encoding='utf-8'), open(fr_fname, encoding='utf-8')))\n", "# qs = [(e.group(), f.group()) for e,f in lines if e and f]\n", "\n", "# qs = [(q1,q2) for q1,q2 in qs]\n", "# df = pd.DataFrame({'fr': [q[1] for q in qs], 'en': [q[0] for q in qs]}, columns = ['en', 'fr'])\n", "# df.to_csv(path/'questions_easy.csv', index=False)\n", "\n", "# del en, fr, lines, qs, df # free RAM or restart the nb " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "### fastText pre-trained word vectors https://fasttext.cc/docs/en/crawl-vectors.html\n", "#! wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.fr.300.bin.gz -P {path}\n", "#! wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.bin.gz -P {path}\n", "#! gzip -d {path}/cc.fr.300.bin.gz \n", "#! gzip -d {path}/cc.en.300.bin.gz" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[PosixPath('/home/stas/.fastai/data/giga-fren/models'),\n", " PosixPath('/home/stas/.fastai/data/giga-fren/giga-fren.release2.fixed.en'),\n", " PosixPath('/home/stas/.fastai/data/giga-fren/giga-fren.release2.fixed.fr'),\n", " PosixPath('/home/stas/.fastai/data/giga-fren/data_save.pkl'),\n", " PosixPath('/home/stas/.fastai/data/giga-fren/cc.en.300.bin'),\n", " PosixPath('/home/stas/.fastai/data/giga-fren/questions_easy.csv'),\n", " PosixPath('/home/stas/.fastai/data/giga-fren/cc.fr.300.bin')]" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "path.ls()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Put them in a DataBunch" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our questions look like this now:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
enfr
0What is light ?Qu’est-ce que la lumière?
1Who are we?Où sommes-nous?
2Where did we come from?D'où venons-nous?
3What would we do without it?Que ferions-nous sans elle ?
4What is the absolute location (latitude and lo...Quelle sont les coordonnées (latitude et longi...
\n", "
" ], "text/plain": [ " en \\\n", "0 What is light ? \n", "1 Who are we? \n", "2 Where did we come from? \n", "3 What would we do without it? \n", "4 What is the absolute location (latitude and lo... \n", "\n", " fr \n", "0 Qu’est-ce que la lumière? \n", "1 Où sommes-nous? \n", "2 D'où venons-nous? \n", "3 Que ferions-nous sans elle ? \n", "4 Quelle sont les coordonnées (latitude et longi... " ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv(path/'questions_easy.csv')\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To make it simple, we lowercase everything." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df['en'] = df['en'].apply(lambda x:x.lower())\n", "df['fr'] = df['fr'].apply(lambda x:x.lower())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first thing is that we will need to collate inputs and targets in a batch: they have different lengths so we need to add padding to make the sequence length the same;" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def seq2seq_collate(samples:BatchSamples, pad_idx:int=1, pad_first:bool=True, backwards:bool=False) -> Tuple[LongTensor, LongTensor]:\n", " \"Function that collect samples and adds padding. Flips token order if needed\"\n", " samples = to_data(samples)\n", " max_len_x,max_len_y = max([len(s[0]) for s in samples]),max([len(s[1]) for s in samples])\n", " res_x = torch.zeros(len(samples), max_len_x).long() + pad_idx\n", " res_y = torch.zeros(len(samples), max_len_y).long() + pad_idx\n", " if backwards: pad_first = not pad_first\n", " for i,s in enumerate(samples):\n", " if pad_first: \n", " res_x[i,-len(s[0]):],res_y[i,-len(s[1]):] = LongTensor(s[0]),LongTensor(s[1])\n", " else: \n", " res_x[i,:len(s[0]):],res_y[i,:len(s[1]):] = LongTensor(s[0]),LongTensor(s[1])\n", " if backwards: res_x,res_y = res_x.flip(1),res_y.flip(1)\n", " return res_x,res_y" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then we create a special `DataBunch` that uses this collate function." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class Seq2SeqDataBunch(TextDataBunch):\n", " \"Create a `TextDataBunch` suitable for training an RNN classifier.\"\n", " @classmethod\n", " def create(cls, train_ds, valid_ds, test_ds=None, path:PathOrStr='.', bs:int=32, val_bs:int=None, pad_idx=1,\n", " pad_first=False, device:torch.device=None, no_check:bool=False, backwards:bool=False, **dl_kwargs) -> DataBunch:\n", " \"Function that transform the `datasets` in a `DataBunch` for classification. Passes `**dl_kwargs` on to `DataLoader()`\"\n", " datasets = cls._init_ds(train_ds, valid_ds, test_ds)\n", " val_bs = ifnone(val_bs, bs)\n", " collate_fn = partial(seq2seq_collate, pad_idx=pad_idx, pad_first=pad_first, backwards=backwards)\n", " train_sampler = SortishSampler(datasets[0].x, key=lambda t: len(datasets[0][t][0].data), bs=bs//2)\n", " train_dl = DataLoader(datasets[0], batch_size=bs, sampler=train_sampler, drop_last=True, **dl_kwargs)\n", " dataloaders = [train_dl]\n", " for ds in datasets[1:]:\n", " lengths = [len(t) for t in ds.x.items]\n", " sampler = SortSampler(ds.x, key=lengths.__getitem__)\n", " dataloaders.append(DataLoader(ds, batch_size=val_bs, sampler=sampler, **dl_kwargs))\n", " return cls(*dataloaders, path=path, device=device, collate_fn=collate_fn, no_check=no_check)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And a subclass of `TextList` that will use this `DataBunch` class in the call `.databunch` and will use `TextList` to label (since our targets are other texts)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class Seq2SeqTextList(TextList):\n", " _bunch = Seq2SeqDataBunch\n", " _label_cls = TextList" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Thats all we need to use the data block API!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "src = Seq2SeqTextList.from_df(df, path = path, cols='fr').split_by_rand_pct().label_from_df(cols='en', label_cls=TextList)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "28.0" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.percentile([len(o) for o in src.train.x.items] + [len(o) for o in src.valid.x.items], 90)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "23.0" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.percentile([len(o) for o in src.train.y.items] + [len(o) for o in src.valid.y.items], 90)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We remove the items where one of the target is more than 30 tokens long." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "src = src.filter_by_func(lambda x,y: len(x) > 30 or len(y) > 30)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "48352" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(src.train) + len(src.valid)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = src.databunch()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.save()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = load_data(path)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
texttarget
xxbos à quoi cela peut - il bien servir , alors que l’on xxunk toujours combien il y aura de ces unités et dans quels domaines elles seront présentes ?xxbos what use was this , when it was still not known how many such units there would be and in what fields ?
xxbos quels autres fabricants de dispositifs médicaux avez - vous évalués et certifiés selon la norme iso xxunk : 2003 et le marquage ce ( le cas échéant ) ?xxbos what medical xxunk companies has your organization audited and certified to iso xxunk and xxunk mark ( where applicable ) ?
xxbos quel est le lien entre le fep , les fonds structurels , le fonds de cohésion et le xxunk ( fonds européen agricole pour le développement rural ) ?xxbos what is the link between the eff , structural funds , cohesion fund and xxunk ?
xxbos quel a été le rôle d'agriculture et agroalimentaire canada ( aac ) dans le processus de révision de la norme nationale sur l'agriculture biologique qui date de 1999 ?xxbos what was the role of agriculture and agri - food canada ( aafc ) in the initiative to revise the 1999 national standard for organic agriculture ?
xxbos lesquelles des activités de r - d ci - après votre établissement a - t - il menées au cours des trois derniers exercices se terminant en 2003 ?xxbos which of the following r&d activities were carried out at your establishment over the last three fiscal years ending in 2003 ?
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "data.show_batch()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Pretrained embeddings" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To install fastText:\n", "```\n", "$ git clone https://github.com/facebookresearch/fastText.git\n", "$ cd fastText\n", "$ pip install .\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Installation: https://github.com/facebookresearch/fastText#building-fasttext-for-python\n", "import fastText as ft" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fr_vecs = ft.load_model(str((path/'cc.fr.300.bin')))\n", "en_vecs = ft.load_model(str((path/'cc.en.300.bin')))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We create an embedding module with the pretrained vectors and random data for the missing parts." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def create_emb(vecs, itos, em_sz=300, mult=1.):\n", " emb = nn.Embedding(len(itos), em_sz, padding_idx=1)\n", " wgts = emb.weight.data\n", " vec_dic = {w:vecs.get_word_vector(w) for w in vecs.get_words()}\n", " miss = []\n", " for i,w in enumerate(itos):\n", " try: wgts[i] = tensor(vec_dic[w])\n", " except: miss.append(w)\n", " return emb" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "emb_enc = create_emb(fr_vecs, data.x.vocab.itos)\n", "emb_dec = create_emb(en_vecs, data.y.vocab.itos)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "torch.save(emb_enc, path/'models'/'fr_emb.pth')\n", "torch.save(emb_dec, path/'models'/'en_emb.pth')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Free some RAM" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "del fr_vecs\n", "del en_vecs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### QRNN seq2seq" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our model we use QRNNs at its base (you can use GRUs or LSTMs by adapting a little bit). Using QRNNs require you have properly installed cuda (a version that matches your PyTorch install). " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/stas/anaconda3/envs/fastai/lib/python3.7/site-packages/torch/utils/cpp_extension.py:166: UserWarning: \n", "\n", " !! WARNING !!\n", "\n", "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!\n", "Your compiler (c++) is not compatible with the compiler Pytorch was\n", "built with for this platform, which is g++ on linux. Please\n", "use g++ to to compile your extension. Alternatively, you may\n", "compile PyTorch from source using c++, and then you can also use\n", "c++ to compile your extension.\n", "\n", "See https://github.com/pytorch/pytorch/blob/master/CONTRIBUTING.md for help\n", "with compiling PyTorch from source.\n", "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!\n", "\n", " !! WARNING !!\n", "\n", " platform=sys.platform))\n", "/home/stas/anaconda3/envs/fastai/lib/python3.7/site-packages/torch/utils/cpp_extension.py:166: UserWarning: \n", "\n", " !! WARNING !!\n", "\n", "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!\n", "Your compiler (c++) is not compatible with the compiler Pytorch was\n", "built with for this platform, which is g++ on linux. Please\n", "use g++ to to compile your extension. Alternatively, you may\n", "compile PyTorch from source using c++, and then you can also use\n", "c++ to compile your extension.\n", "\n", "See https://github.com/pytorch/pytorch/blob/master/CONTRIBUTING.md for help\n", "with compiling PyTorch from source.\n", "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!\n", "\n", " !! WARNING !!\n", "\n", " platform=sys.platform))\n" ] } ], "source": [ "from fastai.text.models.qrnn import QRNN, QRNNLayer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The model in itself consists in an encoder and a decoder\n", "\n", "![Seq2seq model](images/seq2seq.png)\n", "\n", "The encoder is a (quasi) recurrent neural net and we feed it our input sentence, producing an output (that we discard for now) and a hidden state. That hidden state is then given to the decoder (an other RNN) which uses it in conjunction with the outputs it predicts to get produce the translation. We loop until the decoder produces a padding token (or at 30 iterations to make sure it's not an infinite loop at the beginning of training). " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class Seq2SeqQRNN(nn.Module):\n", " def __init__(self, emb_enc, emb_dec, n_hid, max_len, n_layers=2, p_inp:float=0.15, p_enc:float=0.25, \n", " p_dec:float=0.1, p_out:float=0.35, p_hid:float=0.05, bos_idx:int=0, pad_idx:int=1):\n", " super().__init__()\n", " self.n_layers,self.n_hid,self.max_len,self.bos_idx,self.pad_idx = n_layers,n_hid,max_len,bos_idx,pad_idx\n", " self.emb_enc = emb_enc\n", " self.emb_enc_drop = nn.Dropout(p_inp)\n", " self.encoder = QRNN(emb_enc.weight.size(1), n_hid, n_layers=n_layers, dropout=p_enc)\n", " self.out_enc = nn.Linear(n_hid, emb_enc.weight.size(1), bias=False)\n", " self.hid_dp = nn.Dropout(p_hid)\n", " self.emb_dec = emb_dec\n", " self.decoder = QRNN(emb_dec.weight.size(1), emb_dec.weight.size(1), n_layers=n_layers, dropout=p_dec)\n", " self.out_drop = nn.Dropout(p_out)\n", " self.out = nn.Linear(emb_dec.weight.size(1), emb_dec.weight.size(0))\n", " self.out.weight.data = self.emb_dec.weight.data\n", " \n", " def forward(self, inp):\n", " bs,sl = inp.size()\n", " self.encoder.reset()\n", " self.decoder.reset()\n", " hid = self.initHidden(bs)\n", " emb = self.emb_enc_drop(self.emb_enc(inp))\n", " enc_out, hid = self.encoder(emb, hid)\n", " hid = self.out_enc(self.hid_dp(hid))\n", "\n", " dec_inp = inp.new_zeros(bs).long() + self.bos_idx\n", " outs = []\n", " for i in range(self.max_len):\n", " emb = self.emb_dec(dec_inp).unsqueeze(1)\n", " out, hid = self.decoder(emb, hid)\n", " out = self.out(self.out_drop(out[:,0]))\n", " outs.append(out)\n", " dec_inp = out.max(1)[1]\n", " if (dec_inp==self.pad_idx).all(): break\n", " return torch.stack(outs, dim=1)\n", " \n", " def initHidden(self, bs): return one_param(self).new_zeros(self.n_layers, bs, self.n_hid)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Loss function" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The loss pads output and target so that they are of the same size before using the usual flattened version of cross entropy. We do the same for accuracy." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def seq2seq_loss(out, targ, pad_idx=1):\n", " bs,targ_len = targ.size()\n", " _,out_len,vs = out.size()\n", " if targ_len>out_len: out = F.pad(out, (0,0,0,targ_len-out_len,0,0), value=pad_idx)\n", " if out_len>targ_len: targ = F.pad(targ, (0,out_len-targ_len,0,0), value=pad_idx)\n", " return CrossEntropyFlat()(out, targ)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def seq2seq_acc(out, targ, pad_idx=1):\n", " bs,targ_len = targ.size()\n", " _,out_len,vs = out.size()\n", " if targ_len>out_len: out = F.pad(out, (0,0,0,targ_len-out_len,0,0), value=pad_idx)\n", " if out_len>targ_len: targ = F.pad(targ, (0,out_len-targ_len,0,0), value=pad_idx)\n", " out = out.argmax(2)\n", " return (out==targ).float().mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Bleu metric (see dedicated notebook)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In translation, the metric usually used is BLEU, see the corresponding notebook for the details." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class NGram():\n", " def __init__(self, ngram, max_n=5000): self.ngram,self.max_n = ngram,max_n\n", " def __eq__(self, other):\n", " if len(self.ngram) != len(other.ngram): return False\n", " return np.all(np.array(self.ngram) == np.array(other.ngram))\n", " def __hash__(self): return int(sum([o * self.max_n**i for i,o in enumerate(self.ngram)]))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def get_grams(x, n, max_n=5000):\n", " return x if n==1 else [NGram(x[i:i+n], max_n=max_n) for i in range(len(x)-n+1)]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def get_correct_ngrams(pred, targ, n, max_n=5000):\n", " pred_grams,targ_grams = get_grams(pred, n, max_n=max_n),get_grams(targ, n, max_n=max_n)\n", " pred_cnt,targ_cnt = Counter(pred_grams),Counter(targ_grams)\n", " return sum([min(c, targ_cnt[g]) for g,c in pred_cnt.items()]),len(pred_grams)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class CorpusBLEU(Callback):\n", " def __init__(self, vocab_sz):\n", " self.vocab_sz = vocab_sz\n", " self.name = 'bleu'\n", " \n", " def on_epoch_begin(self, **kwargs):\n", " self.pred_len,self.targ_len,self.corrects,self.counts = 0,0,[0]*4,[0]*4\n", " \n", " def on_batch_end(self, last_output, last_target, **kwargs):\n", " last_output = last_output.argmax(dim=-1)\n", " for pred,targ in zip(last_output.cpu().numpy(),last_target.cpu().numpy()):\n", " self.pred_len += len(pred)\n", " self.targ_len += len(targ)\n", " for i in range(4):\n", " c,t = get_correct_ngrams(pred, targ, i+1, max_n=self.vocab_sz)\n", " self.corrects[i] += c\n", " self.counts[i] += t\n", " \n", " def on_epoch_end(self, last_metrics, **kwargs):\n", " precs = [c/t for c,t in zip(self.corrects,self.counts)]\n", " len_penalty = exp(1 - self.targ_len/self.pred_len) if self.pred_len < self.targ_len else 1\n", " bleu = len_penalty * ((precs[0]*precs[1]*precs[2]*precs[3]) ** 0.25)\n", " return add_metrics(last_metrics, bleu)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We load our pretrained embeddings to create the model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "emb_enc = torch.load(path/'models'/'fr_emb.pth')\n", "emb_dec = torch.load(path/'models'/'en_emb.pth')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model = Seq2SeqQRNN(emb_enc, emb_dec, 256, 30, n_layers=2)\n", "learn = Learner(data, model, loss_func=seq2seq_loss, metrics=[seq2seq_acc, CorpusBLEU(len(data.y.vocab.itos))])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.\n" ] } ], "source": [ "learn.lr_find()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "learn.recorder.plot()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
epochtrain_lossvalid_lossseq2seq_accbleutime
06.2722546.5475840.1756530.08450800:35
15.4755955.7988470.2375780.17724400:34
24.9981404.7417570.3423520.25040100:36
34.7695684.9652920.3163220.22649500:38
44.2182784.9428490.3164560.23904200:37
53.6862814.3110110.3793450.28280900:39
63.2949884.0449590.4099020.31791300:41
72.9596563.9568870.4200790.32124800:42
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "learn.fit_one_cycle(8, 1e-2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So how good is our model? Let's see a few predictions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def get_predictions(learn, ds_type=DatasetType.Valid):\n", " learn.model.eval()\n", " inputs, targets, outputs = [],[],[]\n", " with torch.no_grad():\n", " for xb,yb in progress_bar(learn.dl(ds_type)):\n", " out = learn.model(xb)\n", " for x,y,z in zip(xb,yb,out):\n", " inputs.append(learn.data.train_ds.x.reconstruct(x))\n", " targets.append(learn.data.train_ds.y.reconstruct(y))\n", " outputs.append(learn.data.train_ds.y.reconstruct(z.argmax(1)))\n", " return inputs, targets, outputs" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", " \n", " \n", " 100.00% [152/152 00:17<00:00]\n", "
\n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "inputs, targets, outputs = get_predictions(learn)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(Text xxbos pour quelle raison demandez - vous aux émetteurs des renseignements qui n'ont pas à être fournis sur les reçus papier remis aux contribuables ?,\n", " Text xxbos why are your requiring xxunk to provide information that is not required to be on the paper receipts given to clients ?,\n", " Text xxbos why would you you to to to to to to the the the the the the ? ?)" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "inputs[700], targets[700], outputs[700]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(Text xxbos quels facteurs sont responsables des différences de concentrations des contaminants présents dans les poissons dans les cours d’eau et les lacs du nord ?,\n", " Text xxbos what factors are responsible for the differences in the level of contaminants found fish in northern rivers and lakes ?,\n", " Text xxbos what are the differences between the in the the the the the the ? ?)" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "inputs[701], targets[701], outputs[701]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(Text xxbos quel est l'impact sur la recherche en amont du brevetage accru dans les sciences du vivant ?,\n", " Text xxbos what is the impact on upstream research of increased patenting in the life sciences ?,\n", " Text xxbos what is the impact of on on on on on on on on ? ?)" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "inputs[2513], targets[2513], outputs[2513]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(Text xxbos quels changements devrait - on apporter aux processus de réglementation fédéraux et provinciaux ?,\n", " Text xxbos what changes to federal and provincial regulatory processes would be required ?,\n", " Text xxbos what changes will be be to the the the the the public ?)" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "inputs[4000], targets[4000], outputs[4000]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It's usually beginning well, but falls into easy word at the end of the question." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Teacher forcing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One way to help training is to help the decoder by feeding it the real targets instead of its predictions (if it starts with wrong words, it's very unlikely to give us the right translation). We do that all the time at the beginning, then progressively reduce the amount of teacher forcing." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class TeacherForcing(LearnerCallback):\n", " \n", " def __init__(self, learn, end_epoch):\n", " super().__init__(learn)\n", " self.end_epoch = end_epoch\n", " \n", " def on_batch_begin(self, last_input, last_target, train, **kwargs):\n", " if train: return {'last_input': [last_input, last_target]}\n", " \n", " def on_epoch_begin(self, epoch, **kwargs):\n", " self.learn.model.pr_force = 1 - 0.5 * epoch/self.end_epoch" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class Seq2SeqQRNN(nn.Module):\n", " def __init__(self, emb_enc, emb_dec, n_hid, max_len, n_layers=2, p_inp:float=0.15, p_enc:float=0.25, \n", " p_dec:float=0.1, p_out:float=0.35, p_hid:float=0.05, bos_idx:int=0, pad_idx:int=1):\n", " super().__init__()\n", " self.n_layers,self.n_hid,self.max_len,self.bos_idx,self.pad_idx = n_layers,n_hid,max_len,bos_idx,pad_idx\n", " self.emb_enc = emb_enc\n", " self.emb_enc_drop = nn.Dropout(p_inp)\n", " self.encoder = QRNN(emb_enc.weight.size(1), n_hid, n_layers=n_layers, dropout=p_enc)\n", " self.out_enc = nn.Linear(n_hid, emb_enc.weight.size(1), bias=False)\n", " self.hid_dp = nn.Dropout(p_hid)\n", " self.emb_dec = emb_dec\n", " self.decoder = QRNN(emb_dec.weight.size(1), emb_dec.weight.size(1), n_layers=n_layers, dropout=p_dec)\n", " self.out_drop = nn.Dropout(p_out)\n", " self.out = nn.Linear(emb_dec.weight.size(1), emb_dec.weight.size(0))\n", " self.out.weight.data = self.emb_dec.weight.data\n", " self.pr_force = 0.\n", " \n", " def forward(self, inp, targ=None):\n", " bs,sl = inp.size()\n", " hid = self.initHidden(bs)\n", " emb = self.emb_enc_drop(self.emb_enc(inp))\n", " enc_out, hid = self.encoder(emb, hid)\n", " hid = self.out_enc(self.hid_dp(hid))\n", "\n", " dec_inp = inp.new_zeros(bs).long() + self.bos_idx\n", " res = []\n", " for i in range(self.max_len):\n", " emb = self.emb_dec(dec_inp).unsqueeze(1)\n", " outp, hid = self.decoder(emb, hid)\n", " outp = self.out(self.out_drop(outp[:,0]))\n", " res.append(outp)\n", " dec_inp = outp.data.max(1)[1]\n", " if (dec_inp==self.pad_idx).all(): break\n", " if (targ is not None) and (random.random()=targ.shape[1]: break\n", " dec_inp = targ[:,i]\n", " return torch.stack(res, dim=1)\n", " \n", " def initHidden(self, bs): return one_param(self).new_zeros(self.n_layers, bs, self.n_hid)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "emb_enc = torch.load(path/'models'/'fr_emb.pth')\n", "emb_dec = torch.load(path/'models'/'en_emb.pth')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model = Seq2SeqQRNN(emb_enc, emb_dec, 256, 30, n_layers=2)\n", "learn = Learner(data, model, loss_func=seq2seq_loss, metrics=[seq2seq_acc, CorpusBLEU(len(data.y.vocab.itos))],\n", " callback_fns=partial(TeacherForcing, end_epoch=8))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
epochtrain_lossvalid_lossseq2seq_accbleutime
02.3350304.2130640.5435260.31180800:50
12.2409684.9490470.4147020.35672100:46
22.0303505.0732380.3918670.35459300:46
32.1172434.5535410.4301300.38272100:45
41.9993983.8165370.4799800.39598000:46
52.0519974.1749970.4305430.37351500:44
61.9262574.0965860.4338520.37688700:44
71.9317914.0384340.4357080.37644100:44
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "learn.fit_one_cycle(8, 1e-2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", " \n", " \n", " 100.00% [152/152 00:16<00:00]\n", "
\n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "inputs, targets, outputs = get_predictions(learn)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(Text xxbos pour quelle raison demandez - vous aux émetteurs des renseignements qui n'ont pas à être fournis sur les reçus papier remis aux contribuables ?,\n", " Text xxbos why are your requiring xxunk to provide information that is not required to be on the paper receipts given to clients ?,\n", " Text xxbos why should you not use the cra to the cra ?)" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "inputs[700],targets[700],outputs[700]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(Text xxbos quel est l'impact sur la recherche en amont du brevetage accru dans les sciences du vivant ?,\n", " Text xxbos what is the impact on upstream research of increased patenting in the life sciences ?,\n", " Text xxbos what is the impact of the on the xxunk of the xxunk of the xxunk ?)" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "inputs[2513], targets[2513], outputs[2513]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(Text xxbos quels changements devrait - on apporter aux processus de réglementation fédéraux et provinciaux ?,\n", " Text xxbos what changes to federal and provincial regulatory processes would be required ?,\n", " Text xxbos what changes should be made to the regulatory process and the regulatory framework ?)" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "inputs[4000], targets[4000], outputs[4000]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#get_bleu(learn)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Bidir" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A second things that might help is to use a bidirectional model for the encoder." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class Seq2SeqQRNN(nn.Module):\n", " def __init__(self, emb_enc, emb_dec, n_hid, max_len, n_layers=2, p_inp:float=0.15, p_enc:float=0.25, \n", " p_dec:float=0.1, p_out:float=0.35, p_hid:float=0.05, bos_idx:int=0, pad_idx:int=1):\n", " super().__init__()\n", " self.n_layers,self.n_hid,self.max_len,self.bos_idx,self.pad_idx = n_layers,n_hid,max_len,bos_idx,pad_idx\n", " self.emb_enc = emb_enc\n", " self.emb_enc_drop = nn.Dropout(p_inp)\n", " self.encoder = QRNN(emb_enc.weight.size(1), n_hid, n_layers=n_layers, dropout=p_enc, bidirectional=True)\n", " self.out_enc = nn.Linear(2*n_hid, emb_enc.weight.size(1), bias=False)\n", " self.hid_dp = nn.Dropout(p_hid)\n", " self.emb_dec = emb_dec\n", " self.decoder = QRNN(emb_dec.weight.size(1), emb_dec.weight.size(1), n_layers=n_layers, dropout=p_dec)\n", " self.out_drop = nn.Dropout(p_out)\n", " self.out = nn.Linear(emb_dec.weight.size(1), emb_dec.weight.size(0))\n", " self.out.weight.data = self.emb_dec.weight.data\n", " self.pr_force = 0.\n", " \n", " def forward(self, inp, targ=None):\n", " bs,sl = inp.size()\n", " hid = self.initHidden(bs)\n", " emb = self.emb_enc_drop(self.emb_enc(inp))\n", " enc_out, hid = self.encoder(emb, hid)\n", " \n", " hid = hid.view(2,self.n_layers, bs, self.n_hid).permute(1,2,0,3).contiguous()\n", " hid = self.out_enc(self.hid_dp(hid).view(self.n_layers, bs, 2*self.n_hid))\n", "\n", " dec_inp = inp.new_zeros(bs).long() + self.bos_idx\n", " res = []\n", " for i in range(self.max_len):\n", " emb = self.emb_dec(dec_inp).unsqueeze(1)\n", " outp, hid = self.decoder(emb, hid)\n", " outp = self.out(self.out_drop(outp[:,0]))\n", " res.append(outp)\n", " dec_inp = outp.data.max(1)[1]\n", " if (dec_inp==self.pad_idx).all(): break\n", " if (targ is not None) and (random.random()=targ.shape[1]: break\n", " dec_inp = targ[:,i]\n", " return torch.stack(res, dim=1)\n", " \n", " def initHidden(self, bs): return one_param(self).new_zeros(2*self.n_layers, bs, self.n_hid)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "emb_enc = torch.load(path/'models'/'fr_emb.pth')\n", "emb_dec = torch.load(path/'models'/'en_emb.pth')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model = Seq2SeqQRNN(emb_enc, emb_dec, 256, 30, n_layers=2)\n", "learn = Learner(data, model, loss_func=seq2seq_loss, metrics=[seq2seq_acc, CorpusBLEU(len(data.y.vocab.itos))],\n", " callback_fns=partial(TeacherForcing, end_epoch=8))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.\n" ] } ], "source": [ "learn.lr_find()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "learn.recorder.plot()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
epochtrain_lossvalid_lossseq2seq_accbleutime
02.2442906.3439480.3885360.35454800:47
12.0427453.9113130.5253440.37893300:50
21.8766255.0068730.4098360.37216200:48
31.9890813.7105400.5039190.40920200:48
41.8041124.3989790.4273310.38109800:47
51.9495834.0699410.4493990.39469200:46
61.7744663.9152570.4525460.39461000:47
71.9258553.9104560.4495110.39051300:46
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "learn.fit_one_cycle(8, 1e-2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", " \n", " \n", " 100.00% [152/152 00:16<00:00]\n", "
\n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "inputs, targets, outputs = get_predictions(learn)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(Text xxbos pour quelle raison demandez - vous aux émetteurs des renseignements qui n'ont pas à être fournis sur les reçus papier remis aux contribuables ?,\n", " Text xxbos why are your requiring xxunk to provide information that is not required to be on the paper receipts given to clients ?,\n", " Text xxbos why do you need to support the information to the the application of the claim ?)" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "inputs[700], targets[700], outputs[700]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(Text xxbos quels facteurs sont responsables des différences de concentrations des contaminants présents dans les poissons dans les cours d’eau et les lacs du nord ?,\n", " Text xxbos what factors are responsible for the differences in the level of contaminants found fish in northern rivers and lakes ?,\n", " Text xxbos what factors are the in the in the north and in the north - based production ?)" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "inputs[701], targets[701], outputs[701]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(Text xxbos en quoi consiste la politique des retombées industrielles et régionales ( rir ) ?,\n", " Text xxbos what is the industrial and regional benefits ( irb ) policy ?,\n", " Text xxbos what is the policy policy ( policy ) ?)" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "inputs[4001], targets[4001], outputs[4001]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#get_bleu(learn)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Attention" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Attention is a technique that uses the output of our encoder: instead of discarding it entirely, we use it with our hidden state to pay attention to specific words in the input sentence for the predictions in the output sentence. Specifically, we compute attention weights, then add to the input of the decoder the linear combination of the output of the encoder, with those attention weights." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def init_param(*sz): return nn.Parameter(torch.randn(sz)/math.sqrt(sz[0]))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class Seq2SeqQRNN(nn.Module):\n", " def __init__(self, emb_enc, emb_dec, n_hid, max_len, n_layers=2, p_inp:float=0.15, p_enc:float=0.25, \n", " p_dec:float=0.1, p_out:float=0.35, p_hid:float=0.05, bos_idx:int=0, pad_idx:int=1):\n", " super().__init__()\n", " self.n_layers,self.n_hid,self.max_len,self.bos_idx,self.pad_idx = n_layers,n_hid,max_len,bos_idx,pad_idx\n", " self.emb_enc = emb_enc\n", " self.emb_enc_drop = nn.Dropout(p_inp)\n", " self.encoder = QRNN(emb_enc.weight.size(1), n_hid, n_layers=n_layers, dropout=p_enc, bidirectional=True)\n", " self.out_enc = nn.Linear(2*n_hid, emb_enc.weight.size(1), bias=False)\n", " self.hid_dp = nn.Dropout(p_hid)\n", " self.emb_dec = emb_dec\n", " emb_sz = emb_dec.weight.size(1)\n", " self.decoder = QRNN(emb_sz + 2*n_hid, emb_dec.weight.size(1), n_layers=n_layers, dropout=p_dec)\n", " self.out_drop = nn.Dropout(p_out)\n", " self.out = nn.Linear(emb_sz, emb_dec.weight.size(0))\n", " self.out.weight.data = self.emb_dec.weight.data #Try tying\n", " self.enc_att = nn.Linear(2*n_hid, emb_sz, bias=False)\n", " self.hid_att = nn.Linear(emb_sz, emb_sz)\n", " self.V = init_param(emb_sz)\n", " self.pr_force = 0.\n", " \n", " def forward(self, inp, targ=None):\n", " bs,sl = inp.size()\n", " hid = self.initHidden(bs)\n", " emb = self.emb_enc_drop(self.emb_enc(inp))\n", " enc_out, hid = self.encoder(emb, hid)\n", " \n", " hid = hid.view(2,self.n_layers, bs, self.n_hid).permute(1,2,0,3).contiguous()\n", " hid = self.out_enc(self.hid_dp(hid).view(self.n_layers, bs, 2*self.n_hid))\n", "\n", " dec_inp = inp.new_zeros(bs).long() + self.bos_idx\n", " res = []\n", " enc_att = self.enc_att(enc_out)\n", " for i in range(self.max_len):\n", " hid_att = self.hid_att(hid[-1])\n", " u = torch.tanh(enc_att + hid_att[:,None])\n", " attn_wgts = F.softmax(u @ self.V, 1)\n", " ctx = (attn_wgts[...,None] * enc_out).sum(1)\n", " emb = self.emb_dec(dec_inp)\n", " outp, hid = self.decoder(torch.cat([emb, ctx], 1)[:,None], hid)\n", " outp = self.out(self.out_drop(outp[:,0]))\n", " res.append(outp)\n", " dec_inp = outp.data.max(1)[1]\n", " if (dec_inp==self.pad_idx).all(): break\n", " if (targ is not None) and (random.random()=targ.shape[1]: break\n", " dec_inp = targ[:,i]\n", " return torch.stack(res, dim=1)\n", " \n", " def initHidden(self, bs): return one_param(self).new_zeros(2*self.n_layers, bs, self.n_hid)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "emb_enc = torch.load(path/'models'/'fr_emb.pth')\n", "emb_dec = torch.load(path/'models'/'en_emb.pth')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model = Seq2SeqQRNN(emb_enc, emb_dec, 256, 30, n_layers=2)\n", "learn = Learner(data, model, loss_func=seq2seq_loss, metrics=[seq2seq_acc, CorpusBLEU(len(data.y.vocab.itos))],\n", " callback_fns=partial(TeacherForcing, end_epoch=8))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.\n" ] } ], "source": [ "learn.lr_find()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "learn.recorder.plot()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
epochtrain_lossvalid_lossseq2seq_accbleutime
02.4524364.7099180.4129800.20845401:03
12.1373454.4767180.4221260.34481300:57
21.9740483.8245920.4729970.37765200:58
31.8136453.8642580.4707980.38996800:57
41.8182734.0429020.4562170.39035500:56
51.6688953.6355750.4826990.41162700:56
61.6203353.7417790.4747150.41096200:56
71.8523143.7213960.4719860.40294500:55
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "learn.fit_one_cycle(8, 3e-3)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", " \n", " \n", " 100.00% [152/152 00:17<00:00]\n", "
\n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "inputs, targets, outputs = get_predictions(learn)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(Text xxbos pour quelle raison demandez - vous aux émetteurs des renseignements qui n'ont pas à être fournis sur les reçus papier remis aux contribuables ?,\n", " Text xxbos why are your requiring xxunk to provide information that is not required to be on the paper receipts given to clients ?,\n", " Text xxbos why do you think to the information that the information that not be provided on the payment ?)" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "inputs[700], targets[700], outputs[700]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(Text xxbos quels facteurs sont responsables des différences de concentrations des contaminants présents dans les poissons dans les cours d’eau et les lacs du nord ?,\n", " Text xxbos what factors are responsible for the differences in the level of contaminants found fish in northern rivers and lakes ?,\n", " Text xxbos what factors are the of the levels of contaminants in in in water in water and water in the north ?)" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "inputs[701], targets[701], outputs[701]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(Text xxbos quels sont les avantages et les inconvénients à ce jour de cette approche ?,\n", " Text xxbos what are the advantages and disadvantages of this approach to date ?,\n", " Text xxbos what are the advantages and disadvantages of this approach ?)" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "inputs[4002], targets[4002], outputs[4002]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 2 }