{ "cells": [ { "cell_type": "markdown", "metadata": { "nbpresent": { "id": "e38c0aad-6ca2-456d-b908-cbbec3c9e34d" } }, "source": [ "# French Bidirectional Language Model (LM) from scratch\n", "### (architecture QRNN, SentenPiece tokenizer)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Author: [Pierre Guillou](https://www.linkedin.com/in/pierreguillou)\n", "- Date: September 2019\n", "- Post in medium: [link](https://medium.com/@pierre_guillou/nlp-fastai-french-language-model-d0e2a9e12cab)\n", "- Ref: [Fastai v1](https://docs.fast.ai/) (Deep Learning library on PyTorch)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Information**\n", "\n", "According to this new article \"[MultiFiT: Efficient Multi-lingual Language Model Fine-tuning](https://arxiv.org/abs/1909.04761)\" (September 10, 2019), the architecture QRNN and the SentencePiece tokenizer give better results than AWD-LSTM and the spaCy tokenizer respectively. Therefore, they have been used in this notebook to train a French Bidirectional Language Model on a Wikipedia corpus of 100 millions tokens. \n", "\n", "More, the hyperparameters values given at the end of the article have been used, too.\n", "\n", "**Wikipedia corpus**\n", "- download: 512 659 articles of 492 596 078 tokens tokens\n", "- used: 252 898 articles of 100 716 190 tokens\n", "\n", "**Hyperparameters values**\n", "- (batch size) bs = 50\n", "- (QRNN) 3 QRNN (default: 3) with 1152 hidden parameters each one (default: 1152)\n", "- (SentencePiece) vocab of 15000 tokens\n", "- (dropout) mult_drop = 0\n", "- (weight decay) wd = 0.01\n", "- (number of training epochs) 10 epochs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Accuracy and Perplexity" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- forward : (accuracy) 40.99% | (perplexity) 19.96\n", "- backward: (accuracy) 47.19% | (perplexity) 19.47" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### To be improved" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The French Bidirectional Language Model should be retrained with the MultiFiT hyperparameters values like 4 QRNN (and not 3 AWD-LSTM), 1550 hidden parameters by layer (and not 1152), no dropout, batch size of 50, etc." ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "## Initialisation" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "hidden": true, "nbpresent": { "id": "151cd18f-76e3-440f-a8c7-ffa5c6b5da01" } }, "outputs": [], "source": [ "from fastai import *\n", "from fastai.text import *\n", "from fastai.callbacks import *\n", "\n", "import matplotlib.pyplot as plt\n", "\n", "%reload_ext autoreload\n", "%autoreload 2\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "hidden": true, "nbpresent": { "id": "96f02439-3586-4c9d-8c34-aa7c3b17a0a6" } }, "outputs": [], "source": [ "# batch size to be choosen according to your GPU \n", "# bs=48\n", "# bs=24\n", "bs=50" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "hidden": true, "nbpresent": { "id": "6ceb4db2-e4cf-4fe0-a393-91df4a7ed3e7" } }, "outputs": [], "source": [ "torch.cuda.set_device(0)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "hidden": true, "nbpresent": { "id": "6329e650-fc03-4323-ac0c-3aa280e0de91" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "fastai: 1.0.57\n", "cuda: True\n" ] } ], "source": [ "import fastai\n", "print(f'fastai: {fastai.__version__}')\n", "print(f'cuda: {torch.cuda.is_available()}')" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "hidden": true, "nbpresent": { "id": "6f24e68b-3df0-4997-8a50-3a37ea6a5257" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\r\n", "\r\n", "```text\r\n", "=== Software === \r\n", "python : 3.7.4\r\n", "fastai : 1.0.57\r\n", "fastprogress : 0.1.21\r\n", "torch : 1.2.0\r\n", "nvidia driver : 410.104\r\n", "torch cuda : 10.0.130 / is available\r\n", "torch cudnn : 7602 / is enabled\r\n", "\r\n", "=== Hardware === \r\n", "nvidia gpus : 1\r\n", "torch devices : 1\r\n", " - gpu0 : 16130MB | Tesla V100-SXM2-16GB\r\n", "\r\n", "=== Environment === \r\n", "platform : Linux-4.9.0-9-amd64-x86_64-with-debian-9.9\r\n", "distro : #1 SMP Debian 4.9.168-1+deb9u5 (2019-08-11)\r\n", "conda env : base\r\n", "python : /opt/anaconda3/bin/python\r\n", "sys.path : /home/jupyter/tutorials/fastai/course-nlp\r\n", "/opt/anaconda3/lib/python37.zip\r\n", "/opt/anaconda3/lib/python3.7\r\n", "/opt/anaconda3/lib/python3.7/lib-dynload\r\n", "/opt/anaconda3/lib/python3.7/site-packages\r\n", "/opt/anaconda3/lib/python3.7/site-packages/IPython/extensions\r\n", "```\r\n", "\r\n", "Please make sure to include opening/closing ``` when you paste into forums/github to make the reports appear formatted as code sections.\r\n", "\r\n", "Optional package(s) to enhance the diagnostics can be installed with:\r\n", "pip install distro\r\n", "Once installed, re-run this utility to get the additional information\r\n" ] } ], "source": [ "!python -m fastai.utils.show_install" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "hidden": true, "nbpresent": { "id": "194a6989-31f1-4702-b94d-32b974ded8e6" } }, "outputs": [], "source": [ "data_path = Config.data_path()" ] }, { "cell_type": "markdown", "metadata": { "hidden": true, "nbpresent": { "id": "cf070ab7-babb-4cf0-a315-401f65461dc8" } }, "source": [ "This will create a `{lang}wiki` folder, containing a `{lang}wiki` text file with the wikipedia contents. (For other languages, replace `{lang}` with the appropriate code from the [list of wikipedias](https://meta.wikimedia.org/wiki/List_of_Wikipedias).)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "hidden": true, "nbpresent": { "id": "70da588b-8af1-4f97-97c2-c9f2d4d46e1a" } }, "outputs": [], "source": [ "lang = 'fr'" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "hidden": true, "nbpresent": { "id": "701ab344-0430-4f43-bbe2-337a12cae6be" } }, "outputs": [], "source": [ "name = f'{lang}wiki'\n", "path = data_path/name\n", "path.mkdir(exist_ok=True, parents=True)\n", "\n", "lm_fns2 = [f'{lang}_wt_sp15', f'{lang}_wt_vocab_sp15']\n", "lm_fns2_bwd = [f'{lang}_wt_sp15_bwd', f'{lang}_wt_vocab_sp15_bwd']" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true, "nbpresent": { "id": "bfe49910-58e0-4be3-aba1-7733dc18cca2" } }, "source": [ "## Data (French wikipedia)" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true, "hidden": true, "nbpresent": { "id": "4e67d876-c7d0-4bdf-a6f9-ae06ae1fc023" } }, "source": [ "### Download data" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "hidden": true, "nbpresent": { "id": "dd2fd658-b690-484c-b60a-69dc6b7bf384" } }, "outputs": [], "source": [ "from nlputils import split_wiki,get_wiki\n", "from nlputils2 import *" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "hidden": true, "nbpresent": { "id": "28c01920-f13c-493e-9a97-e5b2c24133a8" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "extracting...\n", "CPU times: user 52 ms, sys: 12 ms, total: 64 ms\n", "Wall time: 17min 30s\n" ] } ], "source": [ "%%time\n", "get_wiki(path,lang)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0\n", "100000\n", "200000\n", "300000\n", "400000\n", "500000\n", "600000\n", "700000\n", "800000\n", "900000\n", "1000000\n", "1100000\n", "1200000\n", "1300000\n", "1400000\n", "1500000\n", "1600000\n", "1700000\n", "1800000\n", "1900000\n", "2000000\n", "2100000\n", "2200000\n", "2300000\n", "2400000\n", "2500000\n", "2600000\n", "2700000\n", "2800000\n", "2900000\n", "3000000\n", "3100000\n", "3200000\n", "3300000\n", "3400000\n", "3500000\n", "3600000\n", "3700000\n", "3800000\n", "3900000\n", "4000000\n", "4100000\n", "4200000\n", "4300000\n", "4400000\n", "4500000\n", "4600000\n", "4700000\n", "4800000\n", "4900000\n", "5000000\n", "5100000\n", "5200000\n", "5300000\n", "5400000\n", "5500000\n", "5600000\n", "5700000\n", "5800000\n", "5900000\n", "6000000\n", "6100000\n", "6200000\n", "6300000\n", "6400000\n", "6500000\n", "6600000\n", "6700000\n", "6800000\n", "6900000\n", "7000000\n", "7100000\n", "7200000\n", "7300000\n", "7400000\n", "7500000\n", "7600000\n", "7700000\n", "7800000\n", "7900000\n", "8000000\n", "8100000\n", "8200000\n", "8300000\n", "8400000\n", "8500000\n", "8600000\n", "8700000\n", "8800000\n", "8900000\n", "9000000\n", "9100000\n", "9200000\n", "9300000\n", "9400000\n", "9500000\n", "9600000\n", "9700000\n", "9800000\n", "9900000\n", "10000000\n", "10100000\n", "10200000\n", "10300000\n", "10400000\n", "10500000\n", "10600000\n", "10700000\n", "10800000\n", "10900000\n", "11000000\n", "11100000\n", "11200000\n", "11300000\n", "11400000\n", "11500000\n", "11600000\n", "11700000\n", "11800000\n", "11900000\n", "12000000\n", "12100000\n", "12200000\n", "12300000\n", "12400000\n", "12500000\n", "12600000\n", "12700000\n", "12800000\n", "12900000\n", "13000000\n", "13100000\n", "13200000\n", "13300000\n", "13400000\n", "13500000\n", "13600000\n", "13700000\n", "13800000\n", "13900000\n", "14000000\n", "14100000\n", "14200000\n", "14300000\n", "14400000\n", "14500000\n", "14600000\n", "14700000\n", "14800000\n", "14900000\n", "15000000\n", "15100000\n", "15200000\n", "15300000\n", "15400000\n", "15500000\n", "15600000\n", "15700000\n", "15800000\n", "15900000\n", "16000000\n", "16100000\n", "16200000\n", "16300000\n", "16400000\n", "16500000\n", "16600000\n", "16700000\n", "16800000\n", "16900000\n", "17000000\n", "17100000\n", "17200000\n", "17300000\n", "17400000\n", "17500000\n", "17600000\n", "17700000\n", "17800000\n", "17900000\n", "18000000\n", "18100000\n", "18200000\n", "18300000\n", "18400000\n", "18500000\n", "18600000\n", "18700000\n", "18800000\n", "18900000\n", "19000000\n", "19100000\n", "19200000\n", "19300000\n", "19400000\n", "19500000\n", "19600000\n", "19700000\n", "19800000\n", "19900000\n", "20000000\n", "20100000\n", "20200000\n", "20300000\n", "20400000\n", "20500000\n", "CPU times: user 37.7 s, sys: 14.8 s, total: 52.6 s\n", "Wall time: 1min 19s\n" ] }, { "data": { "text/plain": [ "PosixPath('/home/jupyter/.fastai/data/frwiki/docs')" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "split_wiki(path,lang)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "hidden": true, "nbpresent": { "id": "e6eae780-775e-45e9-9b99-b8a87d5fb8ff" } }, "outputs": [ { "data": { "text/plain": [ "[PosixPath('/home/jupyter/.fastai/data/frwiki/frwiki-latest-pages-articles.xml'),\n", " PosixPath('/home/jupyter/.fastai/data/frwiki/frwiki'),\n", " PosixPath('/home/jupyter/.fastai/data/frwiki/wikiextractor'),\n", " PosixPath('/home/jupyter/.fastai/data/frwiki/frwiki-latest-pages-articles.xml.bz2'),\n", " PosixPath('/home/jupyter/.fastai/data/frwiki/log')]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "path.ls()" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "hidden": true, "nbpresent": { "id": "e1ac63e7-1cbb-4996-838d-dc58446a65ef" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\r\n", "Antoine Meillet\r\n", "\r\n", "Paul Jules Antoine Meillet, né le à Moulins (Allier) et mort le à Châteaumeillant (Cher), est le principal linguiste français des premières décennies du . Il est aussiphilologue.\r\n" ] } ], "source": [ "!head -n4 {path}/{name}" ] }, { "cell_type": "markdown", "metadata": { "hidden": true, "nbpresent": { "id": "ae770e72-e7a9-473d-8454-2020a0263be8" } }, "source": [ "This function splits the single wikipedia file into a separate file per article. This is often easier to work with." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "hidden": true, "nbpresent": { "id": "d23e0ef7-21e5-4cc5-945d-60ee33c02ce3" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1min 31s, sys: 28.7 s, total: 2min\n", "Wall time: 8min 37s\n" ] } ], "source": [ "# %%time\n", "# folder = \"docs\"\n", "# clean_files(path,folder)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "hidden": true, "nbpresent": { "id": "92b0b087-a6a8-403a-a7a1-d9b47757e5cf" } }, "outputs": [ { "data": { "text/plain": [ "[PosixPath('/home/jupyter/.fastai/data/frwiki/docs/Min_y_.txt'),\n", " PosixPath('/home/jupyter/.fastai/data/frwiki/docs/Darius_Johnson_Odom.txt'),\n", " PosixPath('/home/jupyter/.fastai/data/frwiki/docs/Guillaume_Cherel.txt'),\n", " PosixPath('/home/jupyter/.fastai/data/frwiki/docs/Henk_Badings.txt'),\n", " PosixPath('/home/jupyter/.fastai/data/frwiki/docs/Henri_de_Virel.txt')]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dest = path/'docs'\n", "dest.ls()[:5]" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Darius Earvin Johnson-Odom, né le , à Youngsville, en Caroline du Nord, est un joueur américain de basket-ball. Il évolue au poste d'arrière.\r\n", "\r\n", "Johnson-Odom passe trois saisons à l'université de Marquette avant d'être sélectionné à la de la draft 2012 de la NBA par les Mavericks de Dallas, qui le transfèrent immédiatement aux Lakers de Los Angeles. Le 15 septembre 2012, il signe son contrat rookie avec les Lakers. Johnson-Odom est envoyé, plusieurs fois durant la saison 2012-2013, chez les D-Fenders de Los Angeles, l'équipe de D-League affiliée aux Lakers.\r\n", "\r\n" ] } ], "source": [ "!head -n4 {dest}/'Darius_Johnson_Odom.txt'" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true, "hidden": true, "nbpresent": { "id": "575fa672-7b3a-4238-923f-ec929d3a00ee" } }, "source": [ "### Size of downloaded data in the docs folder" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "hidden": true, "nbpresent": { "id": "270470c2-e0eb-446a-9654-de6c45bc4f0d" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "512659 files - 492596078 tokens\n", "CPU times: user 57.1 s, sys: 18.9 s, total: 1min 16s\n", "Wall time: 2min 49s\n" ] } ], "source": [ "%%time\n", "num_files, num_tokens = get_num_tokens(dest)\n", "print(f'{num_files} files - {num_tokens} tokens')" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true, "hidden": true, "nbpresent": { "id": "daae36a0-d90b-45ad-b0e3-a2cd56ce7079" } }, "source": [ "### Create a corpus of about 100 millions of tokens" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "hidden": true, "nbpresent": { "id": "6c383e0e-f4f6-46e5-9f54-469437d66f07" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "files copied to the new corpus folder: /home/jupyter/.fastai/data/frwiki/corpus_100000000\n", "CPU times: user 13.3 s, sys: 8.85 s, total: 22.1 s\n", "Wall time: 22.1 s\n" ] } ], "source": [ "%%time\n", "path_corpus = get_corpus(dest, path, num_tokens, obj_tokens=1e8)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "hidden": true, "nbpresent": { "id": "b597308e-9521-4637-9e84-85a6cd2ace85" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "252898 files - 100716190 tokens\n", "CPU times: user 13.9 s, sys: 2.92 s, total: 16.8 s\n", "Wall time: 2min 9s\n" ] } ], "source": [ "%%time\n", "# VERIFICATION of the number of words in the corpus folder\n", "num_files_corpus, num_tokens_corpus = get_num_tokens(path_corpus)\n", "print(f'{num_files_corpus} files - {num_tokens_corpus} tokens')" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "hidden": true }, "outputs": [], "source": [ "# change name of the corpus \n", "!mv {path}/'corpus_100000000' {path}/'corpus2_100'" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "## Databunch" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "hidden": true }, "outputs": [], "source": [ "dest = path/'corpus2_100'" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true, "hidden": true }, "source": [ "### Forward" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "hidden": true, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 41min 54s, sys: 20.3 s, total: 42min 14s\n", "Wall time: 20min 14s\n" ] } ], "source": [ "%%time\n", "data = (TextList.from_folder(dest, processor=[OpenFileProcessor(), SPProcessor(max_vocab_sz=15000)])\n", " .split_by_rand_pct(0.1, seed=42)\n", " .label_for_lm()\n", " .databunch(bs=bs, num_workers=1))" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "hidden": true }, "outputs": [], "source": [ "data.save(f'{path}/{lang}_databunch_corpus2_100_sp15')" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/plain": [ "(15000, 227609)" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(data.vocab.itos),len(data.train_ds)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/plain": [ "(15000, 15000)" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(data.vocab.itos),len(data.vocab.stoi)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "hidden": true, "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "Text ▁xxbos ▁xxmaj ▁dar ius ▁xxmaj ▁ e ar vin ▁johnson - o dom , ▁né ▁le ▁ , ▁à ▁xxmaj ▁young s ville , ▁en ▁xxmaj ▁caroline ▁du ▁xxmaj ▁nord , ▁est ▁un ▁joueur ▁américain ▁de ▁basket - ball . ▁xxmaj ▁il ▁évolue ▁au ▁poste ▁d ' arrière . ▁johnson - o dom ▁passe ▁trois ▁saisons ▁à ▁l ' université ▁de ▁xxmaj ▁marque tte ▁avant ▁d ' être ▁sélectionné ▁à ▁la ▁de ▁la ▁draft ▁2012 ▁de ▁la ▁xxup ▁nba ▁par ▁les ▁xxmaj ▁ma ve rick s ▁de ▁xxmaj ▁dallas , ▁qui ▁le ▁trans f èrent ▁immédiatement ▁aux ▁xxmaj ▁la ker s ▁de ▁xxmaj ▁los ▁xxmaj ▁angeles . ▁xxmaj ▁le ▁15 ▁septembre ▁2012, ▁il ▁signe ▁son ▁contrat ▁ro ok ie ▁avec ▁les ▁xxmaj ▁la ker s . ▁johnson - o dom ▁est ▁envoyé , ▁plusieurs ▁fois ▁durant ▁la ▁saison ▁2012-2013 , ▁chez ▁les ▁d - f en der s ▁de ▁xxmaj ▁los ▁xxmaj ▁angeles , ▁l ' équipe ▁de ▁d - le a gue ▁affilié e ▁aux ▁xxmaj ▁la ker s . ▁xxmaj ▁le ▁7 ▁janvier ▁2013, ▁johnson - o dom ▁est ▁coupé ▁par ▁les ▁xxmaj ▁la ker s . ▁xxmaj ▁c ' était ▁le ▁dernier ▁jour ▁pour ▁les ▁équipes ▁xxup ▁nba ▁pour ▁libérer ▁des ▁joueurs ▁dont ▁le ▁contrat ▁n ' est ▁pas ▁garanti ▁avant ▁que ▁leur ▁contrat ▁de vienne ▁garanti ▁jusqu ' à ▁la ▁fin ▁de ▁la ▁saison . ▁xxmaj ▁il ▁joue ▁quatre ▁matches ▁et ▁seulement ▁six ▁minutes ▁au ▁total ▁avec ▁les ▁xxmaj ▁la ker s , ▁passant ▁la ▁plupart ▁de ▁son ▁temps ▁en ▁d - le a gue ▁où ▁il ▁est ▁le ▁meilleur ▁marque ur ▁des ▁d - f en der s , ▁avec ▁21 ▁points ▁par ▁match . ▁xxmaj ▁le ▁24 ▁janvier ▁2013, ▁johnson - o dom ▁part ▁en ▁xxmaj ▁russie ▁où ▁il ▁signe ▁avec ▁le ▁xxmaj ▁ sparta k ▁saint - pétersbourg ▁pour ▁le ▁reste ▁de ▁la ▁saison ▁2012-2013 . ▁xxmaj ▁il ▁participe ▁avec ▁les ▁xxmaj ▁celtic s ▁de ▁xxmaj ▁boston ▁à ▁la ▁xxmaj ▁summer ▁xxmaj ▁league ▁2013 ▁d ' orlando . ▁xxmaj ▁le ▁25 ▁septembre ▁2013, ▁il ▁rejoint ▁les ▁xxmaj ▁la ker s ▁de ▁xxmaj ▁los ▁xxmaj ▁angeles ▁pour ▁participer ▁au ▁camp ▁d ' entraînement . ▁xxmaj ▁toutefois , ▁ils ▁le ▁libère nt ▁le ▁16 ▁octobre . ▁xxmaj ▁le ▁18 ▁octobre ▁2013, ▁il ▁signe ▁en ▁xxmaj ▁chine ▁au ▁ . ▁xxmaj ▁en ▁novembre ▁2013, ▁après ▁quatre ▁matches ▁de ▁saison ▁régulière , ▁il ▁quitte ▁les ▁xxmaj ▁blue ▁xxmaj ▁wh ales . ▁xxmaj ▁le ▁3 ▁janvier ▁2014, ▁il ▁rejoint ▁l ' armor ▁de ▁xxmaj ▁ spring field ▁en ▁d - le a gue . ▁xxmaj ▁le ▁14 ▁mars ▁2014, ▁il ▁signe ▁un ▁contrat ▁de ▁dix ▁jours ▁avec ▁les ▁ 76 ers ▁de ▁xxmaj ▁philadelphie . ▁xxmaj ▁le ▁24 ▁mars ▁2014, ▁il ▁ne ▁signe ▁pas ▁de ▁second ▁contrat ▁de ▁dix ▁jours ▁à ▁la ▁fin ▁du ▁premier . ▁johnson - o dom ▁détient ▁actuellement ▁le ▁record ▁du ▁plus ▁grand ▁nombre ▁de ▁tirs ▁tenté s ▁sans ▁en ▁réussi r ▁un ▁seul ▁en ▁xxup ▁nba ▁avec ▁11 ▁tirs . ▁xxmaj ▁le ▁2 ▁août ▁2014, ▁il ▁signe ▁en ▁xxmaj ▁italie , ▁au ▁xxmaj ▁c ant ù ▁pour ▁la ▁saison ▁2014-2015 . ▁xxmaj ▁le ▁11 ▁décembre ▁2014, ▁il ▁est ▁nommé ▁xxup ▁m v p ▁de ▁la ▁de ▁l ' euro coup e . ▁xxmaj ▁le ▁14 ▁juin ▁2015, ▁il ▁signe ▁en ▁xxmaj ▁turquie ▁au ▁xxmaj ▁tra b zon s por ▁xxmaj ▁basket ball ▁pour ▁la ▁saison ▁2015-2016 . ▁xxmaj ▁le ▁28 ▁décembre ▁2015, ▁il ▁signe ▁en ▁xxmaj ▁grèce , ▁à ▁l ' olympia k ó s ▁pour ▁le ▁reste ▁de ▁la ▁saison ▁2015-2016 ▁et ▁le ▁xxmaj ▁top ▁16 ▁de ▁l ' euro ligu e . ▁xxmaj ▁le ▁11 ▁juin ▁2016, ▁il ▁retourne ▁en ▁xxmaj ▁italie ▁où ▁il ▁signe ▁à ▁xxmaj ▁sa s s ari . ▁xxmaj ▁les ▁statistiques ▁en ▁matchs ▁universitaires ▁de ▁xxmaj ▁dar ius ▁johnson - o dom ▁sont ▁les ▁suivantes ▁:" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.train_ds.x[1]" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true, "hidden": true }, "source": [ "### Backward" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 42min 18s, sys: 23.5 s, total: 42min 42s\n", "Wall time: 20min 30s\n" ] } ], "source": [ "%%time\n", "data = (TextList.from_folder(dest, processor=[OpenFileProcessor(), SPProcessor(max_vocab_sz=15000)])\n", " .split_by_rand_pct(0.1, seed=42)\n", " .label_for_lm()\n", " .databunch(bs=bs, num_workers=1, backwards=True))\n", "\n", "data.save(f'{path}/{lang}_databunch_corpus2_100_sp15_bwd')" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true, "nbpresent": { "id": "052df7c2-f57b-4596-9189-c9e01102e5e9" } }, "source": [ "## Training" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true, "hidden": true }, "source": [ "### Forward" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "hidden": true, "nbpresent": { "id": "defab6d9-ba04-4943-b574-9300e20d5e1c" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 3.94 s, sys: 1.81 s, total: 5.75 s\n", "Wall time: 12.2 s\n" ] } ], "source": [ "%%time\n", "data = load_data(path, f'{lang}_databunch_corpus2_100_sp15', bs=bs)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "hidden": true }, "outputs": [], "source": [ "config = awd_lstm_lm_config.copy()\n", "config['qrnn'] = True" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "hidden": true, "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 248 ms, sys: 36 ms, total: 284 ms\n", "Wall time: 284 ms\n" ] } ], "source": [ "%%time\n", "perplexity = Perplexity()\n", "learn = language_model_learner(data, AWD_LSTM, config=config, drop_mult=0., pretrained=False, \n", " metrics=[error_rate, accuracy, perplexity]).to_fp16()" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "hidden": true, "nbpresent": { "id": "e74f8988-7c26-495b-90fb-0da012f54c1a" } }, "outputs": [ { "data": { "text/html": [], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.\n" ] } ], "source": [ "learn.lr_find()" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "hidden": true, "nbpresent": { "id": "e0333edf-f5cd-4c7d-9db3-01dfb7cc4bb8" } }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "learn.recorder.plot()" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "hidden": true, "nbpresent": { "id": "55299290-8c05-415f-8f73-a3656c1d488c" } }, "outputs": [], "source": [ "lr = 3e-3\n", "lr *= bs/48 # Scale learning rate by batch size" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true, "nbpresent": { "id": "debf9e8a-b4ec-4b06-8344-a8f01150d941" } }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", " \n", " \n", " 30.00% [3/10 1:27:10<3:23:23]\n", "
\n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
epochtrain_lossvalid_losserror_rateaccuracyperplexitytime
03.4567123.4932510.6522520.34774732.89269628:46
13.2433983.3532610.6405500.35945028.59592829:06
23.2281113.3360110.6387840.36121728.10671629:10

\n", "\n", "

\n", " \n", " \n", " 76.71% [33696/43928 21:02<06:23 3.2003]\n", "
\n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAWoAAAD4CAYAAADFAawfAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAAWnElEQVR4nO3deXhV9Z3H8c83C4SQICFsYdFgtYooRYyIxfqgdlFA7fPIWProtMO0wzNqK9LpVBynlXa6UO3i+LQu2No+rVSrWB9nrEuthVJHxQZLMbLIFiGygwmLBJLwmz/uSbgJN3eBc+79JXm/nifPPefcc37ne383fDj5nXPuNeecAAD+yst1AQCA5AhqAPAcQQ0AniOoAcBzBDUAeK4gikZPKRvgzvzQ6VE0DQDd0vLly3c75wYlei6SoB48bKSqq6ujaBoAuiUze7ez5xj6AADPEdQA4LlIgpp7HQEgPJGMUZPUADLR1NSkuro6NTY25rqUyBUVFWnEiBEqLCxMe5toghoAMlBXV6fS0lJVVlbKzHJdTmScc9qzZ4/q6uo0atSotLdjjBpAzjU2Nqq8vLxbh7QkmZnKy8sz/suBoAbghe4e0q1O5HVGFNQMUgNAWDiiBtDj1dfX6/777894uylTpqi+vj6CitojqAH0eJ0FdUtLS9LtnnvuOfXv3z+qstpEctUHAx8AupK5c+dqw4YNGjdunAoLC1VSUqKKigqtWLFCq1at0qc//Wlt2bJFjY2Nmj17tmbNmiVJqqysVHV1tQ4cOKCrrrpKl1xyiV599VUNHz5czzzzjPr06RNKfVyeB8Ar3/zft7Vq675Q2zxnWD/ddfWYTp+fP3++ampqtGLFCi1ZskRTp05VTU1N2yV0jzzyiAYMGKBDhw7pwgsv1HXXXafy8vJ2baxbt06PPfaYHn74YV1//fV66qmndOONN4ZSP0ENAB1MmDCh3XXO9913n55++mlJ0pYtW7Ru3brjgnrUqFEaN26cJOmCCy5QbW1taPUQ1AC8kuzIN1v69u3bNr1kyRL98Y9/1Guvvabi4mJNnjw54XXQvXv3bpvOz8/XoUOHQquHk4kAerzS0lLt378/4XMNDQ0qKytTcXGx1qxZo9dffz3L1XFEDQAqLy/XpEmTdO6556pPnz4aMmRI23NXXnmlHnzwQY0dO1ZnnXWWJk6cmPX6zLnwr9H40OixbsPqlaG3C6B7Wr16tUaPHp3rMrIm0es1s+XOuapE6zP0AQCeI6gBwHN8cQAAeI4jagDwHEENAJ4jqAHAcwQ1AJyAkpISSdLWrVs1ffr0hOtMnjxZ1dXVJ70vghoATsKwYcO0aNGiSPfBnYkAIOn222/XaaedpptvvlmSNG/ePJmZli5dqvfff19NTU369re/rWuvvbbddrW1tZo2bZpqamp06NAhzZw5U6tWrdLo0aND+7wPPo8agF+enyttfyvcNoeeJ101P+kqM2bM0G233dYW1E888YReeOEFzZkzR/369dPu3bs1ceJEXXPNNZ1+7+EDDzyg4uJirVy5UitXrtT48eNDKZ8jagCQdP7552vnzp3aunWrdu3apbKyMlVUVGjOnDlaunSp8vLy9N5772nHjh0aOnRowjaWLl2qW2+9VZI0duxYjR07NpTaCGoAfklx5Bul6dOna9GiRdq+fbtmzJihhQsXateuXVq+fLkKCwtVWVmZ8CNO40XxbeqcTASAwIwZM/T4449r0aJFmj59uhoaGjR48GAVFhZq8eLFevfdd5Nuf+mll2rhwoWSpJqaGq1cGc6H00VzRM0gNYAuaMyYMdq/f7+GDx+uiooK3XDDDbr66qtVVVWlcePG6eyzz066/U033aSZM2dq7NixGjdunCZMmBBKXZF8zOmos89zm9aEfDIAQLfFx5yG8DGnZjbHzN42sxoze8zMikKoFQCQhpRBbWbDJd0qqco5d66kfEkzkm3DyAcAhCfdk4kFkvqYWYGkYklboysJQE8UxTCsj07kdaYMaufce5J+IGmzpG2SGpxzf+i4npnNMrNqM6s+eOBgxoUA6LmKioq0Z8+ebh/Wzjnt2bNHRUWZjR6nPJloZmWSnpL0GUn1kp6UtMg592hn21SefZ6r5WQigDQ1NTWprq4u5TXK3UFRUZFGjBihwsLCdsuTnUxM5/K8j0va5JzbFTT2O0kfldRpUANAJgoLCzVq1Khcl+GtdMaoN0uaaGbFFrvl5gpJq6MtCwDQKp0x6mWSFkl6U9JbwTYLIq4LABBI685E59xdku6KuBYAQAJ8CzkAeI4PZQIAz0UT1BxSA0BoOKIGAM8R1ADgOYIaADxHUAOA57g8DwA8xxE1AHiOoAYAzxHUAOA5ghoAPEdQA4DnCGoA8Byf9QEAnuOIGgA8F9ENLxxSA0BYOKIGAM8R1ADgOYIaADxHUAOA5whqAPAcQQ0AnuPzqAHAc9EENUkNAKGJJKiPktQAEBqOqAHAcxEFNUkNAGGJJKibWghqAAhLJEF98EhzFM0CQI8USVD3LuDybAAISySJmmcWRbMA0CNxeR4AeI7L8wDAcxxRA4DnIgnq5qMENQCEJa2gNrP+ZrbIzNaY2WozuzjqwgAAMQVprvffkl5wzk03s16SilNtcKT5qHpxmR4AnLSUQW1m/SRdKumfJMk5d0TSkVTbHWpqIagBIATpJOnpknZJ+oWZ/c3MfmZmfTuuZGazzKzazKolaVvDoZBLBYCeKZ2gLpA0XtIDzrnzJR2UNLfjSs65Bc65KudclSQtXrMr1EIBoKdKJ6jrJNU555YF84sUC+6kNu/94GTqAgAEUga1c267pC1mdlaw6ApJq1Jtt4WgBoBQpHvVx5clLQyu+NgoaWaqDV5Zv/tk6gIABNIKaufcCklVEdcCAEiA6+cAwHMENQB4jqAGAM8R1ADgOYIaADxHUAOA5yIN6uaWo1E2DwA9QqRB/dKqHVE2DwA9QqRB/fNXNkXZPAD0CJEE9fD+fSRJ1e++H0XzANCjRBLU/YsLo2gWAHqkSII6zyyKZgGgR+LyPADwXORB/dfavVHvAgC6tciCurI89kXl//Dga1HtAgB6hMiC+tdfuCiqpgGgR4ksqEcOKG6b3rmvMardAEC3l5WTiRO++3I2dgMA3VKkQf3MLZOibB4AeoRIg/ojI/u3Te89eCTKXQFAt5W166jH/9dL2doVAHQrkQf1deNHtE0756LeHQB0O5EH9Q+v/0jb9Kg7not6dwDQ7XALOQB4LitBXTt/att05dzfZ2OXANBt5OSImrFqAEhf1oJ63XeuaptmrBoA0pe1oC7Mz9Nnqka2zd+/ZH22dg0AXVpWhz6+P31s2/TdL6zN5q4BoMvK+hj1X752Wds0JxYBILWsB3X8p+pJ0vYGPlkPAJLJyVUfr99xRdv0xO/xyXoAkExOgnroKUUa0LdX2zxDIADQuZzdmbj8Pz/ebn4HXy4AAAnlLKjNTG9+/RNt8xd992Vt2ftBrsoBAG/l9LM+BvTtpW9dO6Zt/mN3L85hNQDgp7SD2szyzexvZvZsmAV87uLKdvOMVwNAe5kcUc+WtDqKIn7zxfbfWH7eXS9GsRsA6JLSCmozGyFpqqSfRVHER88YqNlXnNk2v/9ws/5v/e4odgUAXU66R9T3SvqapKOdrWBms8ys2syqd+3alXEhcz7xYV121qC2+Rt+tkxvbNqbcTsA0N2kDGozmyZpp3NuebL1nHMLnHNVzrmqQYMGJVu1U7+YOUEXVpa1zV//0Gt6adWOE2oLALqLdI6oJ0m6xsxqJT0u6XIzezSqgp78149qSL/ebfP/8qtqPfjnDVHtDgC8lzKonXN3OOdGOOcqJc2Q9Cfn3I1RFrXsP9rfDDP/+TVatnFPlLsEAG95+52JG787pd38Zxa8rqXvZD72DQBdXUZB7Zxb4pybFlUx8fLyrN13LUrS5x55Q5Vzf6+mlk7PaQJAt+PtEXWrjmEtSWfe+byaCWsAPYT3QS0dPwwiSWfc+bzOm/cigQ2g2+sSQd06DLL03y9rt3x/Y7POuPN5bWs4lKPKACB6BbkuIBOnlhdrRFkf1b3fPpgv/t6fJEmXnDFQj3a4HR0AuroucUQd75XbL9db8z6Z+Ln1u1U59/f6yzs7s1wVAESnSx1RtyotKlTt/KnavOcDXXrP8R+N2vDrG7U3f7X6Dx6hvJIhUskQqWSwVDr02HTr8qJTJLMcvAoASE+XDOpWp5YXq3b+VF32gyXatPtg2/I/H/2IGlyJBm2r12DbrIH2lobkNajQNR3fSH7v9uFdOuT4MG+dL+h9/PYAEDFzzoXeaFVVlauurg693WScc3rqzff01Sf/3tka6qeDGmQNuvJUacuWWt1yYamGF+xXSdMe6cB26cBOuQM7ZB90chdkUf8OYZ4o0IdIfcqkvC43qgQgh8xsuXOuKuFz3SWo4/108Xrd8+LaE96+QM26dWJ/ffnCUu3Yulm1727U0Lx9GlawT3u2b9bQvAYdPbBT2r9d+S0Jvusxr0DqO1gqGSTl95IsX8rLlywveMxXi+UpLy9fllcgWZ5aLF8NjS0aUNInWD8v9hi3Tcc20l5eWiGdffwljgD80eOCWpIOHm7WmMi/gMCprxo12Oo1SPUaZA0aZPWxHzVooDXootP6qbm5SXsPNMq1NOv9g40qK8rXgcbDytdR5emo8oOf1uk8c+2W9S2Umpqajz0v127dlFWOvFj2hRci7gsAJyNZUHfpMepk+vYuaHdX46bdB3XZD5aEvBfTQfXRJtdHm1QhJcrM9QmWHclwN4eTPemOC/p2oa+jalmXrzcz3CUAf3TboO5o1MC+qp0/VT/6w1pNOmOgLjq9XO/s2K8BfXtp8j1LdOBwsyTp1AHF2tylvg3d1KJ8tSg/14UAiEiPCepWX/nkWW3THx5SKkmq+eanEq67Yku9KsuLtWXvIV39k1d0w0Wn6uvTztGLb2/X7MdX6NYrztR9L6/TN6ado289u0pjhvXT21v3ZeV1ZKJ3ASc2ga6s245R+2Jr/SEdPNysU8uL1bvg2FFvc8tRba1v1KnlxUm3bznqtGb7Po0ZdkrS9RqbWnSk5ahKehXopoXL9cWPna6S3gUaXdEvlNcBIFo98mQiAHQlyYKav4kBwHMENQB4jqAGAM8R1ADgOYIaADxHUAOA5whqAPAcQQ0AniOoAcBzBDUAeI6gBgDPEdQA4DmCGgA8R1ADgOcIagDwHEENAJ4jqAHAcwQ1AHiOoAYAzxHUAOA5ghoAPJcyqM1spJktNrPVZva2mc3ORmEAgJiCNNZplvRvzrk3zaxU0nIze8k5tyri2gAASuOI2jm3zTn3ZjC9X9JqScOjLgwAEJPRGLWZVUo6X9KyKIoBABwv7aA2sxJJT0m6zTm3L8Hzs8ys2syqd+3aFWaNANCjpRXUZlaoWEgvdM79LtE6zrkFzrkq51zVoEGDwqwRAHq0dK76MEk/l7TaOfej6EsCAMRL54h6kqR/lHS5ma0IfqZEXBcAIJDy8jzn3CuSLAu1AAAS4M5EAPAcQQ0AniOoAcBzBDUAeI6gBgDPEdQA4DmCGgA8R1ADgOcIagDwHEENAJ4jqAHAcwQ1AHiOoAYAzxHUAOA5ghoAPEdQA4DnCGoA8BxBDQCeI6gBwHMENQB4jqAGAM8R1ADgOYIaADxHUAOA5whqAPAcQQ0AniOoAcBzBDUAeI6gBgDPEdQA4DmCGgA8R1ADgOcIagDwHEENAJ4jqAHAcwQ1AHiOoAYAz6UV1GZ2pZmtNbP1ZjY36qIAAMekDGozy5f0U0lXSTpH0mfN7JyoCwMAxKRzRD1B0nrn3Ebn3BFJj0u6NtqyAACtCtJYZ7ikLXHzdZIu6riSmc2SNCuYPWxmNSdfXugGStqd6yISoK7MUFdmqCszuarrtM6eSCeoLcEyd9wC5xZIWiBJZlbtnKtKu7wsoa7MUFdmqCsz1JW+dIY+6iSNjJsfIWlrNOUAADpKJ6j/KulMMxtlZr0kzZD0P9GWBQBolXLowznXbGZfkvSipHxJjzjn3k6x2YIwiosAdWWGujJDXZmhrjSZc8cNNwMAPMKdiQDgOYIaAHznnAvtR9KVktZKWi9pbphtd9hPraS3JK2QVB0sGyDpJUnrgseyYLlJui+oaaWk8XHtfD5Yf52kz8ctvyBof32wrXVSxyOSdkqqiVsWeR2d7SNFXfMkvRf02QpJU+KeuyPYx1pJn0r1fkoaJWlZsP/fSuoVLO8dzK8Pnq/sUNdISYslrZb0tqTZPvRZkrpy2meSiiS9IenvQV3fPIm2Qqk3RV2/lLQprr/GZft3P1gnX9LfJD3rQ3+Fknkhhme+pA2STpfUK3gTzwmz2Lh91Uoa2GHZ3a0dJ2mupO8H01MkPR/8skyUtCzuDd8YPJYF060B8Yaki4Ntnpd0VSd1XCppvNoHYuR1dLaPFHXNk/TVBK/hnOC96h38sm0I3stO309JT0iaEUw/KOmmYPpmSQ8G0zMk/bbDvioU/COVVCrpnWD/Oe2zJHXltM+C11ASTBcqFgQTM20rzHpT1PVLSdMT9FfWfveD5V+R9BsdC+qc9lcomRdieF4s6cW4+Tsk3RFmsXFt1+r4oF4rqSLuH97aYPohSZ/tuJ6kz0p6KG75Q8GyCklr4pa3Wy9BLZVqH4iR19HZPlLUNU+JQ6fd+6TY1T0Xd/Z+Bv9wdksq6Pi+t24bTBcE6yX8ayRY5xlJn/ClzxLU5U2fSSqW9KZidwVn1FaY9aao65dKHNRZex8Vu8/jZUmXS3r2RPo+yv460Z8wx6gT3Wo+PMT24zlJfzCz5cGt65I0xDm3TZKCx8Ep6kq2vC7B8nRlo47O9pHKl8xspZk9YmZlJ1hXuaR651xzgrratgmebwjWP46ZVUo6X7GjMW/6rENdUo77zMzyzWyFYkNZLyl2RJdpW2HWm7Au51xrf30n6K8fm1nvE+yvk3kf75X0NUlHg/kT6fvQ++tkhRnUad1qHpJJzrnxin2i3y1mdmmSdTurK9PlJyvXdTwg6UOSxknaJumHEdSVVs1mViLpKUm3Oef2Jak5q32WoK6c95lzrsU5N06xI8UJkkafQFuh92PHuszsXMWOLs+WdKFiwxm3h1xXUmY2TdJO59zy+MVJ2spaf52sMIM6a7eaO+e2Bo87JT2t2C/wDjOrkKTgcWeKupItH5FgebqyUUdn++iUc25H8I/rqKSHFeuzE6lrt6T+ZlbQYXm7toLnT5G0N74OMytULAwXOud+l+L1ZK3PEtXlS58FtdRLWqLYGG+mbYVZb2d1Xemc2+ZiDkv6hU68v070fZwk6Rozq1XsUz4vV+wI25v+OmFhjaEoNsazUbHB99aB9jFhjtME++krqTRu+lXFzsTeo/YnGe4Opqeq/YmMN4LlAxQ7Q10W/GySNCB47q/Buq0nMqYkqadS7ceCI6+js32kqKsibnqOpMeD6TFqf+Jko2InTTp9PyU9qfYnTm4Opm9R+5MzT3SoyST9StK9HZbntM+S1JXTPpM0SFL/YLqPpL9ImpZpW2HWm6Kuirj+vFfS/Fz87gfPTdaxk4k57a9Qci/kEJ2i2BnzDZLuDLPtuH2cHnRQ66VBdwbLyxU7ibAueGx9w02xLz7YoNjlPlVxbf2zYpfZrJc0M255laSaYJufqPPL8x5T7E/iJsX+t/1CNurobB8p6vp1sN+Vin1WS3wI3RnsY63irnDp7P0M3oM3gnqflNQ7WF4UzK8Pnj+9Q12XKPYn4UrFXfKW6z5LUldO+0zSWMUuM1sZvKZvnERbodSboq4/Bf1VI+lRHbsyJGu/+3HbT9axoM5pf4Xxwy3kAOA57kwEAM8R1ADgOYIaADxHUAOA5whqAPAcQQ0AniOoAcBz/w/H99zyiP7fgwAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Better model found at epoch 0 with accuracy value: 0.34774690866470337.\n", "Better model found at epoch 1 with accuracy value: 0.3594495952129364.\n", "Better model found at epoch 2 with accuracy value: 0.36121678352355957.\n" ] } ], "source": [ "%%time\n", "learn.unfreeze()\n", "wd = 0.01 \n", "learn.fit_one_cycle(10, lr, wd=wd, moms=(0.8,0.7), \n", " callbacks=[ShowGraph(learn),\n", " SaveModelCallback(learn.to_fp32(), monitor='accuracy', name='bestmodel_sp15')])" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "GCP instance has stopped. Need to reload bestmodel_sp15 and vocab to continue the training." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "hidden": true }, "outputs": [], "source": [ "mdl_path = path/'models'\n", "mdl_path.mkdir(exist_ok=True)\n", "lm_fns2[0] = 'bestmodel_sp15'\n", "data.vocab.save(mdl_path/(lm_fns2[1] + '.pkl'))" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/plain": [ "['bestmodel_sp15', 'fr_wt_vocab_sp15']" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lm_fns2" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "hidden": true }, "outputs": [], "source": [ "config = awd_lstm_lm_config.copy()\n", "config['qrnn'] = True" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "hidden": true, "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 4.43 s, sys: 1.5 s, total: 5.94 s\n", "Wall time: 30.4 s\n" ] } ], "source": [ "%%time\n", "perplexity = Perplexity()\n", "learn = language_model_learner(data, AWD_LSTM, config=config, drop_mult=0., pretrained_fnames=lm_fns2, \n", " metrics=[error_rate, accuracy, perplexity]).to_fp16()" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/html": [], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.\n" ] } ], "source": [ "learn.lr_find()" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "hidden": true }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "learn.recorder.plot()" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "hidden": true }, "outputs": [], "source": [ "lr = 3e-4\n", "lr *= bs/48 # Scale learning rate by batch size" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
epochtrain_lossvalid_losserror_rateaccuracyperplexitytime
03.0681313.1266210.6099300.39007122.79672128:29
12.9954933.0703160.6023130.39768621.54863528:26
23.0119553.0435910.5984300.40157020.98044028:24
32.9788583.0232810.5952290.40477020.55871028:24
42.9608613.0057730.5924230.40757720.20187228:26
52.9400802.9958520.5905650.40943620.00239928:27
62.9374752.9937890.5901110.40988819.96122728:23
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Better model found at epoch 0 with accuracy value: 0.3900712728500366.\n", "Better model found at epoch 1 with accuracy value: 0.39768603444099426.\n", "Better model found at epoch 2 with accuracy value: 0.40157046914100647.\n", "Better model found at epoch 3 with accuracy value: 0.40476998686790466.\n", "Better model found at epoch 4 with accuracy value: 0.40757742524147034.\n", "Better model found at epoch 5 with accuracy value: 0.40943607687950134.\n", "Better model found at epoch 6 with accuracy value: 0.4098880887031555.\n", "CPU times: user 2h 36min 55s, sys: 42min 44s, total: 3h 19min 39s\n", "Wall time: 3h 19min 22s\n" ] } ], "source": [ "%%time\n", "learn.unfreeze()\n", "wd = 0.01 \n", "learn.fit_one_cycle(7, lr, wd=wd, moms=(0.8,0.7), \n", " callbacks=[ShowGraph(learn),\n", " SaveModelCallback(learn.to_fp32(), monitor='accuracy', name='bestmodel_sp15')])" ] }, { "cell_type": "markdown", "metadata": { "hidden": true, "nbpresent": { "id": "fc8bfa72-afa8-4429-b3a7-2e4f8cd14cb8" } }, "source": [ "Save the pretrained model and vocab:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "hidden": true, "nbpresent": { "id": "00f7bd36-8558-4a49-8cd2-29a0f836430b" } }, "outputs": [], "source": [ "mdl_path = path/'models'\n", "mdl_path.mkdir(exist_ok=True)\n", "learn.to_fp32().save(mdl_path/lm_fns2[0], with_opt=False)\n", "learn.data.vocab.save(mdl_path/(lm_fns2[1] + '.pkl'))" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true, "hidden": true }, "source": [ "### Backward" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "hidden": true }, "outputs": [], "source": [ "data = load_data(path, f'{lang}_databunch_corpus2_100_sp15_bwd', bs=bs, backwards=True)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "hidden": true }, "outputs": [], "source": [ "config = awd_lstm_lm_config.copy()\n", "config['qrnn'] = True" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "hidden": true, "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 832 ms, sys: 36 ms, total: 868 ms\n", "Wall time: 868 ms\n" ] } ], "source": [ "%%time\n", "perplexity = Perplexity()\n", "learn = language_model_learner(data, AWD_LSTM, config=config, drop_mult=0., pretrained=False, \n", " metrics=[error_rate, accuracy, perplexity]).to_fp16()" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/html": [], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.\n" ] } ], "source": [ "learn.lr_find()" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "hidden": true }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "learn.recorder.plot()" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "hidden": true }, "outputs": [], "source": [ "lr = 3e-3\n", "lr *= bs/48 # Scale learning rate by batch size" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", " \n", " \n", " 90.00% [9/10 4:17:30<28:36]\n", "
\n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
epochtrain_lossvalid_losserror_rateaccuracyperplexitytime
03.4646873.4998620.5883920.41160933.11084728:31
13.2905303.3596550.5774290.42257228.77930528:36
23.2797093.3455110.5767880.42321228.37528628:39
33.2181743.2915780.5705860.42941426.88526928:37
43.1885693.2353730.5642420.43575725.41586928:37
53.0657083.1658330.5551680.44483223.70851928:39
63.0361423.0899830.5454040.45459621.97654728:20
72.9990723.0195980.5357630.46423720.48303828:26
82.8884122.9691080.5280630.47193719.47448328:36

\n", "\n", "

\n", " \n", " \n", " 19.87% [8728/43928 05:24<21:49 2.9050]\n", "
\n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Better model found at epoch 0 with accuracy value: 0.4116089344024658.\n", "Better model found at epoch 1 with accuracy value: 0.42257198691368103.\n", "Better model found at epoch 2 with accuracy value: 0.42321211099624634.\n", "Better model found at epoch 3 with accuracy value: 0.4294137954711914.\n", "Better model found at epoch 4 with accuracy value: 0.4357571601867676.\n", "Better model found at epoch 5 with accuracy value: 0.4448317289352417.\n", "Better model found at epoch 6 with accuracy value: 0.4545958638191223.\n", "Better model found at epoch 7 with accuracy value: 0.4642367959022522.\n", "Better model found at epoch 8 with accuracy value: 0.47193723917007446.\n" ] } ], "source": [ "%%time\n", "learn.unfreeze()\n", "wd = 0.01\n", "learn.fit_one_cycle(10, lr, wd=wd, moms=(0.8,0.7), \n", " callbacks=[ShowGraph(learn),\n", " SaveModelCallback(learn.to_fp32(), monitor='accuracy', name='bestmodel_sp15_bwd')])" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "GCP instance has stopped. Need to reload bestmodel_sp15 and vocab to continue the training." ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "hidden": true }, "outputs": [], "source": [ "mdl_path = path/'models'\n", "mdl_path.mkdir(exist_ok=True)\n", "lm_fns2_bwd[0] = 'bestmodel_sp15_bwd'\n", "data.vocab.save(mdl_path/(lm_fns2_bwd[1] + '.pkl'))" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "hidden": true }, "outputs": [], "source": [ "# mdl_path = path/'models'\n", "# mdl_path.mkdir(exist_ok=True)\n", "# learn.to_fp32().save(mdl_path/lm_fns2_bwd[0], with_opt=False)\n", "# learn.data.vocab.save(mdl_path/(lm_fns2_bwd[1] + '.pkl'))" ] }, { "cell_type": "markdown", "metadata": { "nbpresent": { "id": "29ff5bf7-47d3-4bb6-8ef7-4dbee784acd0" } }, "source": [ "## Generate fake texts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note**: the architecture used for our French LM is based on 4 QRNN with about 46 millions of parameters. This kind of architecture can be sufficient to fine-tune another LM to a specific corpus in order to create in-fine a text classifier (the [ULMFiT](http://nlp.fast.ai/category/classification.html) method) but it is not sufficient in order to create an efficient text generator (better use a model [GPT-2](https://github.com/openai/gpt-2) or [BERT](https://github.com/google-research/bert)). More, the SentencePiece tokenizer used in this notebook implements subword units (e.g., byte-pair-encoding (BPE) [Sennrich et al.]) and unigram language model [Kudo.]) that can generate caracters from its vocabulary instead of words. " ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "nbpresent": { "id": "903b31b8-77bb-48a7-a6de-fd2584005619" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 3.98 s, sys: 1.07 s, total: 5.05 s\n", "Wall time: 5.05 s\n" ] } ], "source": [ "%%time\n", "data = load_data(path, f'{lang}_databunch_corpus2_100_sp15k', bs=bs)" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [], "source": [ "config = awd_lstm_lm_config.copy()\n", "config['qrnn'] = True" ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "nbpresent": { "id": "f2e132d0-a490-49cb-895a-ee9a66c76c2d" } }, "outputs": [], "source": [ "# LM without pretraining\n", "learn = language_model_learner(data, AWD_LSTM, config=config, pretrained=False)" ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "nbpresent": { "id": "7596a7ac-558a-4bd7-80af-247bf0a82732" } }, "outputs": [], "source": [ "# LM pretrained in English\n", "learn_en = language_model_learner(data, AWD_LSTM, pretrained=True)" ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "nbpresent": { "id": "cdaac260-7cfc-4255-b03f-e1a88eccdc71" } }, "outputs": [], "source": [ "# LM pretrained in french\n", "learn_fr = language_model_learner(data, AWD_LSTM, config=config, pretrained_fnames=lm_fns2)" ] }, { "cell_type": "code", "execution_count": 61, "metadata": { "nbpresent": { "id": "12d9ce7c-fdb1-4108-ab95-764a015f8e0e" }, "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Nadal a gagné le tournoi de ▁2011. sncb ▁auprès ▁visible ▁transcription ▁natal griff ▁constituant ▁tétra ▁donnée ▁entière 产 anniversaire ▁journée 马 pend ▁mull ἄ ▁culminant 興 ▁frappe ▁tard ภ agrément ▁joué ὸ hui voy ▁baltique ▁ferdinand cara ▁climatique ե ime ▁déterminé ▁entre ር л fils ▁fièvre chart équipe ▁intégrée avance ▁grands ▁compagnies gonal pa ▁». џ ille ▁sub étalon ▁in ▁pistolet etto ▁poser bot ▁théo ▁capacité „ ▁dép ▁plante original în ▁width ▁mathématique ▁criminel ▁condé ▁sortant ▁iv ح স ▁met ▁taxi post cro ▁600 ̞ ▁descend ▁vente ▁renonce ca architecte gou vez ▁menacée ura ▁rat ▁prolongation ▁chrétiens ▁wi ▁maryland ▁restent ▁profond elle ▁tunnel ▁manga ▁accepta diff\n" ] } ], "source": [ "TEXT = \"Nadal a gagné le tournoi de\" # original text\n", "N_WORDS = 100 # number of words to predict following the TEXT\n", "N_SENTENCES = 1 # number of different predictions\n", "\n", "print(\"\\n\".join(learn.predict(TEXT, N_WORDS, temperature=0.75) for _ in range(N_SENTENCES)))" ] }, { "cell_type": "code", "execution_count": 62, "metadata": { "nbpresent": { "id": "9be3d15c-526a-4ab7-bd46-16e8a8c58142" }, "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Nadal a gagné le tournoi de , and the pin - round for the introduction of the modern - day offensive , as well as a ▁funérailles - style version of the \" e \" in the version of the article \" the world of the old , modern . \" ( c . 2000 ) . \" i ' m not just a rich man , but an old friend . i just see it as the right one , \" he had a friend , \" and just a good friend . \" \" i ' m not a good man \" , he\n" ] } ], "source": [ "TEXT = \"Nadal a gagné le tournoi de\" # original text\n", "N_WORDS = 100 # number of words to predict following the TEXT\n", "N_SENTENCES = 1 # number of different predictions\n", "\n", "print(\"\\n\".join(learn_en.predict(TEXT, N_WORDS, temperature=0.75) for _ in range(N_SENTENCES)))" ] }, { "cell_type": "code", "execution_count": 63, "metadata": { "nbpresent": { "id": "35c213f4-3c90-4774-9cdd-56da02ebe3a8" }, "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Nadal a gagné le tournoi de ▁xxmaj ▁ s ot chi ▁en ▁cours ▁de ▁saison ▁ - ▁1981 ▁à ▁xxmaj ▁milan . ▁xxmaj ▁il ▁est ▁surtout ▁connu ▁pour ▁l ' ac cumul ation ▁de ▁l ' équipe ▁de ▁xxmaj ▁russie ▁xxmaj ▁en ly s ▁xxmaj ▁ pet ter rö th ▁et ▁xxmaj ▁wi ki pé dia ▁xxmaj ▁ gel der k omm odor e ▁qui ▁ont ▁été ▁prêté s ▁à ▁la ▁xxmaj ▁province ▁de ▁xxmaj ▁ s omo gy , ▁qui ▁l ' a ▁nommé ▁gouverneur ▁de ▁l ' ancienne ▁ université ▁de ▁xxmaj ▁ t bili s si . ▁xxmaj ▁il ▁ s ' agit\n" ] } ], "source": [ "TEXT = \"Nadal a gagné le tournoi de\" # original text\n", "N_WORDS = 100 # number of words to predict following the TEXT\n", "N_SENTENCES = 1 # number of different predictions\n", "\n", "print(\"\\n\".join(learn_fr.predict(TEXT, N_WORDS, temperature=0.75) for _ in range(N_SENTENCES)))" ] } ], "metadata": { "kernelspec": { "display_name": "Python [conda env:root] *", "language": "python", "name": "conda-root-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 2 }