{ "cells": [ { "cell_type": "markdown", "metadata": { "nbpresent": { "id": "e38c0aad-6ca2-456d-b908-cbbec3c9e34d" } }, "source": [ "# (MultiFiT) French Bidirectional Language Model (LM) from scratch\n", "### (architecture 4 QRNN with 1550 hidden parameters by layer, SentencePiece tokenizer and hyperparameters from the MultiFiT method)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Author: [Pierre Guillou](https://www.linkedin.com/in/pierreguillou)\n", "- Date: September 2019\n", "- Post in medium: [link](https://medium.com/@pierre_guillou/nlp-fastai-french-language-model-d0e2a9e12cab)\n", "- Ref: [Fastai v1](https://docs.fast.ai/) (Deep Learning library on PyTorch)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Information**\n", "\n", "According to this new article \"[MultiFiT: Efficient Multi-lingual Language Model Fine-tuning](https://arxiv.org/abs/1909.04761)\" (September 10, 2019), the architecture 4 QRNN and the SentencePiece tokenizer (15 000 tokens) give better results than AWD-LSTM and the spaCy tokenizer respectively.\n", "\n", "Therefore, they have been used in this notebook to train a French Bidirectional Language Model on a Wikipedia corpus of 100 millions tokens. As you can observe in the Results paragraph, **this French Bidirectional LM model is far better than the 2 precedent ones I trained** (see [my Language Models repository](https://github.com/piegu/language-models) on github).\n", "\n", "More, the hyperparameters values given at the end of the article have been used, too.\n", "\n", "**Wikipedia corpus**\n", "- download: 512 659 articles of 492 596 078 tokens tokens\n", "- used: 252 898 articles of 100 716 190 tokens\n", "\n", "**Hyperparameters values**\n", "- (batch size) bs = 50\n", "- (QRNN) 4 QRNN (default: 3) with 1550 hidden parameters each one (default: 1152)\n", "- (SentencePiece) vocab of 15000 tokens\n", "- (dropout) mult_drop = 0\n", "- (weight decay) wd = 0.01\n", "- (number of training epochs) 10 epochs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " **This Bidirectional French Language Model is far better than the 2 precedent ones I trained** (see [my Language Models repository](https://github.com/piegu/language-models) on github)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- forward : (accuracy) 43.77% | (perplexity) 16.09\n", "- backward: (accuracy) 49.29% | (perplexity) 16.58" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**To be improved**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The lost function FlattenedLoss of LabelSmoothingCrossEntropy should be tested as it is used in the MultiFiT method (see the notebook [lm3-portuguese-classifier-TCU-jurisprudencia.ipynb](https://github.com/piegu/language-models/blob/master/lm3-portuguese-classifier-TCU-jurisprudencia.ipynb) to get the code)." ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "## Initialisation" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "hidden": true, "nbpresent": { "id": "151cd18f-76e3-440f-a8c7-ffa5c6b5da01" } }, "outputs": [], "source": [ "from fastai import *\n", "from fastai.text import *\n", "from fastai.callbacks import *\n", "\n", "import matplotlib.pyplot as plt\n", "\n", "%reload_ext autoreload\n", "%autoreload 2\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "hidden": true, "nbpresent": { "id": "96f02439-3586-4c9d-8c34-aa7c3b17a0a6" } }, "outputs": [], "source": [ "# batch size to be choosen according to your GPU \n", "# bs=48\n", "# bs=24\n", "bs=50" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "hidden": true, "nbpresent": { "id": "6ceb4db2-e4cf-4fe0-a393-91df4a7ed3e7" } }, "outputs": [], "source": [ "torch.cuda.set_device(0)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "hidden": true, "nbpresent": { "id": "6329e650-fc03-4323-ac0c-3aa280e0de91" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "fastai: 1.0.57\n", "cuda: True\n" ] } ], "source": [ "import fastai\n", "print(f'fastai: {fastai.__version__}')\n", "print(f'cuda: {torch.cuda.is_available()}')" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "hidden": true, "nbpresent": { "id": "6f24e68b-3df0-4997-8a50-3a37ea6a5257" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\r\n", "\r\n", "```text\r\n", "=== Software === \r\n", "python : 3.7.4\r\n", "fastai : 1.0.57\r\n", "fastprogress : 0.1.21\r\n", "torch : 1.2.0\r\n", "nvidia driver : 410.104\r\n", "torch cuda : 10.0.130 / is available\r\n", "torch cudnn : 7602 / is enabled\r\n", "\r\n", "=== Hardware === \r\n", "nvidia gpus : 1\r\n", "torch devices : 1\r\n", " - gpu0 : 16130MB | Tesla V100-SXM2-16GB\r\n", "\r\n", "=== Environment === \r\n", "platform : Linux-4.9.0-9-amd64-x86_64-with-debian-9.9\r\n", "distro : #1 SMP Debian 4.9.168-1+deb9u5 (2019-08-11)\r\n", "conda env : base\r\n", "python : /opt/anaconda3/bin/python\r\n", "sys.path : /home/jupyter/tutorials/fastai/course-nlp\r\n", "/opt/anaconda3/lib/python37.zip\r\n", "/opt/anaconda3/lib/python3.7\r\n", "/opt/anaconda3/lib/python3.7/lib-dynload\r\n", "/opt/anaconda3/lib/python3.7/site-packages\r\n", "/opt/anaconda3/lib/python3.7/site-packages/IPython/extensions\r\n", "```\r\n", "\r\n", "Please make sure to include opening/closing ``` when you paste into forums/github to make the reports appear formatted as code sections.\r\n", "\r\n", "Optional package(s) to enhance the diagnostics can be installed with:\r\n", "pip install distro\r\n", "Once installed, re-run this utility to get the additional information\r\n" ] } ], "source": [ "!python -m fastai.utils.show_install" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "hidden": true, "nbpresent": { "id": "194a6989-31f1-4702-b94d-32b974ded8e6" } }, "outputs": [], "source": [ "data_path = Config.data_path()" ] }, { "cell_type": "markdown", "metadata": { "hidden": true, "nbpresent": { "id": "cf070ab7-babb-4cf0-a315-401f65461dc8" } }, "source": [ "This will create a `{lang}wiki` folder, containing a `{lang}wiki` text file with the wikipedia contents. (For other languages, replace `{lang}` with the appropriate code from the [list of wikipedias](https://meta.wikimedia.org/wiki/List_of_Wikipedias).)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "hidden": true, "nbpresent": { "id": "70da588b-8af1-4f97-97c2-c9f2d4d46e1a" } }, "outputs": [], "source": [ "lang = 'fr'" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "hidden": true, "nbpresent": { "id": "701ab344-0430-4f43-bbe2-337a12cae6be" } }, "outputs": [], "source": [ "name = f'{lang}wiki'\n", "path = data_path/name\n", "path.mkdir(exist_ok=True, parents=True)\n", "\n", "lm_fns3 = [f'{lang}_wt_sp15_multifit', f'{lang}_wt_vocab_sp15_multifit']\n", "lm_fns3_bwd = [f'{lang}_wt_sp15_multifit_bwd', f'{lang}_wt_vocab_sp15_multifit_bwd']" ] }, { "cell_type": "markdown", "metadata": { "nbpresent": { "id": "bfe49910-58e0-4be3-aba1-7733dc18cca2" } }, "source": [ "## Data (French wikipedia)" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true, "nbpresent": { "id": "4e67d876-c7d0-4bdf-a6f9-ae06ae1fc023" } }, "source": [ "### Download data" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "hidden": true, "nbpresent": { "id": "dd2fd658-b690-484c-b60a-69dc6b7bf384" } }, "outputs": [], "source": [ "from nlputils import split_wiki,get_wiki\n", "from nlputils2 import *" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "hidden": true, "nbpresent": { "id": "28c01920-f13c-493e-9a97-e5b2c24133a8" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "extracting...\n", "CPU times: user 52 ms, sys: 12 ms, total: 64 ms\n", "Wall time: 17min 30s\n" ] } ], "source": [ "%%time\n", "get_wiki(path,lang)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0\n", "100000\n", "200000\n", "300000\n", "400000\n", "500000\n", "600000\n", "700000\n", "800000\n", "900000\n", "1000000\n", "1100000\n", "1200000\n", "1300000\n", "1400000\n", "1500000\n", "1600000\n", "1700000\n", "1800000\n", "1900000\n", "2000000\n", "2100000\n", "2200000\n", "2300000\n", "2400000\n", "2500000\n", "2600000\n", "2700000\n", "2800000\n", "2900000\n", "3000000\n", "3100000\n", "3200000\n", "3300000\n", "3400000\n", "3500000\n", "3600000\n", "3700000\n", "3800000\n", "3900000\n", "4000000\n", "4100000\n", "4200000\n", "4300000\n", "4400000\n", "4500000\n", "4600000\n", "4700000\n", "4800000\n", "4900000\n", "5000000\n", "5100000\n", "5200000\n", "5300000\n", "5400000\n", "5500000\n", "5600000\n", "5700000\n", "5800000\n", "5900000\n", "6000000\n", "6100000\n", "6200000\n", "6300000\n", "6400000\n", "6500000\n", "6600000\n", "6700000\n", "6800000\n", "6900000\n", "7000000\n", "7100000\n", "7200000\n", "7300000\n", "7400000\n", "7500000\n", "7600000\n", "7700000\n", "7800000\n", "7900000\n", "8000000\n", "8100000\n", "8200000\n", "8300000\n", "8400000\n", "8500000\n", "8600000\n", "8700000\n", "8800000\n", "8900000\n", "9000000\n", "9100000\n", "9200000\n", "9300000\n", "9400000\n", "9500000\n", "9600000\n", "9700000\n", "9800000\n", "9900000\n", "10000000\n", "10100000\n", "10200000\n", "10300000\n", "10400000\n", "10500000\n", "10600000\n", "10700000\n", "10800000\n", "10900000\n", "11000000\n", "11100000\n", "11200000\n", "11300000\n", "11400000\n", "11500000\n", "11600000\n", "11700000\n", "11800000\n", "11900000\n", "12000000\n", "12100000\n", "12200000\n", "12300000\n", "12400000\n", "12500000\n", "12600000\n", "12700000\n", "12800000\n", "12900000\n", "13000000\n", "13100000\n", "13200000\n", "13300000\n", "13400000\n", "13500000\n", "13600000\n", "13700000\n", "13800000\n", "13900000\n", "14000000\n", "14100000\n", "14200000\n", "14300000\n", "14400000\n", "14500000\n", "14600000\n", "14700000\n", "14800000\n", "14900000\n", "15000000\n", "15100000\n", "15200000\n", "15300000\n", "15400000\n", "15500000\n", "15600000\n", "15700000\n", "15800000\n", "15900000\n", "16000000\n", "16100000\n", "16200000\n", "16300000\n", "16400000\n", "16500000\n", "16600000\n", "16700000\n", "16800000\n", "16900000\n", "17000000\n", "17100000\n", "17200000\n", "17300000\n", "17400000\n", "17500000\n", "17600000\n", "17700000\n", "17800000\n", "17900000\n", "18000000\n", "18100000\n", "18200000\n", "18300000\n", "18400000\n", "18500000\n", "18600000\n", "18700000\n", "18800000\n", "18900000\n", "19000000\n", "19100000\n", "19200000\n", "19300000\n", "19400000\n", "19500000\n", "19600000\n", "19700000\n", "19800000\n", "19900000\n", "20000000\n", "20100000\n", "20200000\n", "20300000\n", "20400000\n", "20500000\n", "CPU times: user 37.7 s, sys: 14.8 s, total: 52.6 s\n", "Wall time: 1min 19s\n" ] }, { "data": { "text/plain": [ "PosixPath('/home/jupyter/.fastai/data/frwiki/docs')" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "split_wiki(path,lang)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "hidden": true, "nbpresent": { "id": "e6eae780-775e-45e9-9b99-b8a87d5fb8ff" } }, "outputs": [ { "data": { "text/plain": [ "[PosixPath('/home/jupyter/.fastai/data/frwiki/frwiki-latest-pages-articles.xml'),\n", " PosixPath('/home/jupyter/.fastai/data/frwiki/frwiki'),\n", " PosixPath('/home/jupyter/.fastai/data/frwiki/wikiextractor'),\n", " PosixPath('/home/jupyter/.fastai/data/frwiki/frwiki-latest-pages-articles.xml.bz2'),\n", " PosixPath('/home/jupyter/.fastai/data/frwiki/log')]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "path.ls()" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "hidden": true, "nbpresent": { "id": "e1ac63e7-1cbb-4996-838d-dc58446a65ef" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\r\n", "Antoine Meillet\r\n", "\r\n", "Paul Jules Antoine Meillet, né le à Moulins (Allier) et mort le à Châteaumeillant (Cher), est le principal linguiste français des premières décennies du . Il est aussiphilologue.\r\n" ] } ], "source": [ "!head -n4 {path}/{name}" ] }, { "cell_type": "markdown", "metadata": { "hidden": true, "nbpresent": { "id": "ae770e72-e7a9-473d-8454-2020a0263be8" } }, "source": [ "This function splits the single wikipedia file into a separate file per article. This is often easier to work with." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "hidden": true, "nbpresent": { "id": "d23e0ef7-21e5-4cc5-945d-60ee33c02ce3" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1min 31s, sys: 28.7 s, total: 2min\n", "Wall time: 8min 37s\n" ] } ], "source": [ "# %%time\n", "folder = \"docs\"\n", "clean_files(path,folder)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "hidden": true, "nbpresent": { "id": "92b0b087-a6a8-403a-a7a1-d9b47757e5cf" } }, "outputs": [ { "data": { "text/plain": [ "[PosixPath('/home/jupyter/.fastai/data/frwiki/docs/Min_y_.txt'),\n", " PosixPath('/home/jupyter/.fastai/data/frwiki/docs/Darius_Johnson_Odom.txt'),\n", " PosixPath('/home/jupyter/.fastai/data/frwiki/docs/Guillaume_Cherel.txt'),\n", " PosixPath('/home/jupyter/.fastai/data/frwiki/docs/Henk_Badings.txt'),\n", " PosixPath('/home/jupyter/.fastai/data/frwiki/docs/Henri_de_Virel.txt')]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dest = path/'docs'\n", "dest.ls()[:5]" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Darius Earvin Johnson-Odom, né le , à Youngsville, en Caroline du Nord, est un joueur américain de basket-ball. Il évolue au poste d'arrière.\r\n", "\r\n", "Johnson-Odom passe trois saisons à l'université de Marquette avant d'être sélectionné à la de la draft 2012 de la NBA par les Mavericks de Dallas, qui le transfèrent immédiatement aux Lakers de Los Angeles. Le 15 septembre 2012, il signe son contrat rookie avec les Lakers. Johnson-Odom est envoyé, plusieurs fois durant la saison 2012-2013, chez les D-Fenders de Los Angeles, l'équipe de D-League affiliée aux Lakers.\r\n", "\r\n" ] } ], "source": [ "!head -n4 {dest}/'Darius_Johnson_Odom.txt'" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true, "nbpresent": { "id": "575fa672-7b3a-4238-923f-ec929d3a00ee" } }, "source": [ "### Size of downloaded data in the docs folder" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "hidden": true, "nbpresent": { "id": "270470c2-e0eb-446a-9654-de6c45bc4f0d" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "512659 files - 492596078 tokens\n", "CPU times: user 57.1 s, sys: 18.9 s, total: 1min 16s\n", "Wall time: 2min 49s\n" ] } ], "source": [ "%%time\n", "num_files, num_tokens = get_num_tokens(dest)\n", "print(f'{num_files} files - {num_tokens} tokens')" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true, "nbpresent": { "id": "daae36a0-d90b-45ad-b0e3-a2cd56ce7079" } }, "source": [ "### Create a corpus of about 100 millions of tokens" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "hidden": true, "nbpresent": { "id": "6c383e0e-f4f6-46e5-9f54-469437d66f07" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "files copied to the new corpus folder: /home/jupyter/.fastai/data/frwiki/corpus_100000000\n", "CPU times: user 13.3 s, sys: 8.85 s, total: 22.1 s\n", "Wall time: 22.1 s\n" ] } ], "source": [ "%%time\n", "path_corpus = get_corpus(dest, path, num_tokens, obj_tokens=1e8)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "hidden": true, "nbpresent": { "id": "b597308e-9521-4637-9e84-85a6cd2ace85" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "252898 files - 100716190 tokens\n", "CPU times: user 13.9 s, sys: 2.92 s, total: 16.8 s\n", "Wall time: 2min 9s\n" ] } ], "source": [ "%%time\n", "# VERIFICATION of the number of words in the corpus folder\n", "num_files_corpus, num_tokens_corpus = get_num_tokens(path_corpus)\n", "print(f'{num_files_corpus} files - {num_tokens_corpus} tokens')" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "hidden": true }, "outputs": [], "source": [ "# change name of the corpus \n", "!mv {path}/'corpus_100000000' {path}/'corpus2_100'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Databunch" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "dest = path/'corpus2_100'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Forward" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 39min 55s, sys: 33.6 s, total: 40min 28s\n", "Wall time: 57min 12s\n" ] } ], "source": [ "%%time\n", "data = (TextList.from_folder(dest, processor=[OpenFileProcessor(), SPProcessor(max_vocab_sz=15000)])\n", " .split_by_rand_pct(0.1, seed=42)\n", " .label_for_lm()\n", " .databunch(bs=bs, num_workers=1))" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "data.save(f'{path}/{lang}_databunch_corpus2_100_sp15_multifit')" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(15000, 227609)" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(data.vocab.itos),len(data.train_ds)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(15000, 15000)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(data.vocab.itos),len(data.vocab.stoi)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "Text ▁xxbos ▁xxmaj ▁dar ius ▁xxmaj ▁ e ar vin ▁johnson - o dom , ▁né ▁le ▁ , ▁à ▁xxmaj ▁young s ville , ▁en ▁xxmaj ▁caroline ▁du ▁xxmaj ▁nord , ▁est ▁un ▁joueur ▁américain ▁de ▁basket - ball . ▁xxmaj ▁il ▁évolue ▁au ▁poste ▁d ' arrière . ▁johnson - o dom ▁passe ▁trois ▁saisons ▁à ▁l ' université ▁de ▁xxmaj ▁marque tte ▁avant ▁d ' être ▁sélectionné ▁à ▁la ▁de ▁la ▁draft ▁2012 ▁de ▁la ▁xxup ▁nba ▁par ▁les ▁xxmaj ▁ma ve rick s ▁de ▁xxmaj ▁dallas , ▁qui ▁le ▁trans f èrent ▁immédiatement ▁aux ▁xxmaj ▁la ker s ▁de ▁xxmaj ▁los ▁xxmaj ▁angeles . ▁xxmaj ▁le ▁15 ▁septembre ▁2012, ▁il ▁signe ▁son ▁contrat ▁ro ok ie ▁avec ▁les ▁xxmaj ▁la ker s . ▁johnson - o dom ▁est ▁envoyé , ▁plusieurs ▁fois ▁durant ▁la ▁saison ▁2012-2013 , ▁chez ▁les ▁d - f en der s ▁de ▁xxmaj ▁los ▁xxmaj ▁angeles , ▁l ' équipe ▁de ▁d - le a gue ▁affilié e ▁aux ▁xxmaj ▁la ker s . ▁xxmaj ▁le ▁7 ▁janvier ▁2013, ▁johnson - o dom ▁est ▁coupé ▁par ▁les ▁xxmaj ▁la ker s . ▁xxmaj ▁c ' était ▁le ▁dernier ▁jour ▁pour ▁les ▁équipes ▁xxup ▁nba ▁pour ▁libérer ▁des ▁joueurs ▁dont ▁le ▁contrat ▁n ' est ▁pas ▁garanti ▁avant ▁que ▁leur ▁contrat ▁de vienne ▁garanti ▁jusqu ' à ▁la ▁fin ▁de ▁la ▁saison . ▁xxmaj ▁il ▁joue ▁quatre ▁matches ▁et ▁seulement ▁six ▁minutes ▁au ▁total ▁avec ▁les ▁xxmaj ▁la ker s , ▁passant ▁la ▁plupart ▁de ▁son ▁temps ▁en ▁d - le a gue ▁où ▁il ▁est ▁le ▁meilleur ▁marque ur ▁des ▁d - f en der s , ▁avec ▁21 ▁points ▁par ▁match . ▁xxmaj ▁le ▁24 ▁janvier ▁2013, ▁johnson - o dom ▁part ▁en ▁xxmaj ▁russie ▁où ▁il ▁signe ▁avec ▁le ▁xxmaj ▁ sparta k ▁saint - pétersbourg ▁pour ▁le ▁reste ▁de ▁la ▁saison ▁2012-2013 . ▁xxmaj ▁il ▁participe ▁avec ▁les ▁xxmaj ▁celtic s ▁de ▁xxmaj ▁boston ▁à ▁la ▁xxmaj ▁summer ▁xxmaj ▁league ▁2013 ▁d ' orlando . ▁xxmaj ▁le ▁25 ▁septembre ▁2013, ▁il ▁rejoint ▁les ▁xxmaj ▁la ker s ▁de ▁xxmaj ▁los ▁xxmaj ▁angeles ▁pour ▁participer ▁au ▁camp ▁d ' entraînement . ▁xxmaj ▁toutefois , ▁ils ▁le ▁libère nt ▁le ▁16 ▁octobre . ▁xxmaj ▁le ▁18 ▁octobre ▁2013, ▁il ▁signe ▁en ▁xxmaj ▁chine ▁au ▁ . ▁xxmaj ▁en ▁novembre ▁2013, ▁après ▁quatre ▁matches ▁de ▁saison ▁régulière , ▁il ▁quitte ▁les ▁xxmaj ▁blue ▁xxmaj ▁wh ales . ▁xxmaj ▁le ▁3 ▁janvier ▁2014, ▁il ▁rejoint ▁l ' armor ▁de ▁xxmaj ▁ spring field ▁en ▁d - le a gue . ▁xxmaj ▁le ▁14 ▁mars ▁2014, ▁il ▁signe ▁un ▁contrat ▁de ▁dix ▁jours ▁avec ▁les ▁ 76 ers ▁de ▁xxmaj ▁philadelphie . ▁xxmaj ▁le ▁24 ▁mars ▁2014, ▁il ▁ne ▁signe ▁pas ▁de ▁second ▁contrat ▁de ▁dix ▁jours ▁à ▁la ▁fin ▁du ▁premier . ▁johnson - o dom ▁détient ▁actuellement ▁le ▁record ▁du ▁plus ▁grand ▁nombre ▁de ▁tirs ▁tenté s ▁sans ▁en ▁réussi r ▁un ▁seul ▁en ▁xxup ▁nba ▁avec ▁11 ▁tirs . ▁xxmaj ▁le ▁2 ▁août ▁2014, ▁il ▁signe ▁en ▁xxmaj ▁italie , ▁au ▁xxmaj ▁c ant ù ▁pour ▁la ▁saison ▁2014-2015 . ▁xxmaj ▁le ▁11 ▁décembre ▁2014, ▁il ▁est ▁nommé ▁xxup ▁m v p ▁de ▁la ▁de ▁l ' euro coup e . ▁xxmaj ▁le ▁14 ▁juin ▁2015, ▁il ▁signe ▁en ▁xxmaj ▁turquie ▁au ▁xxmaj ▁tra b zon s por ▁xxmaj ▁basket ball ▁pour ▁la ▁saison ▁2015-2016 . ▁xxmaj ▁le ▁28 ▁décembre ▁2015, ▁il ▁signe ▁en ▁xxmaj ▁grèce , ▁à ▁l ' olympia k ó s ▁pour ▁le ▁reste ▁de ▁la ▁saison ▁2015-2016 ▁et ▁le ▁xxmaj ▁top ▁16 ▁de ▁l ' euro ligu e . ▁xxmaj ▁le ▁11 ▁juin ▁2016, ▁il ▁retourne ▁en ▁xxmaj ▁italie ▁où ▁il ▁signe ▁à ▁xxmaj ▁sa s s ari . ▁xxmaj ▁les ▁statistiques ▁en ▁matchs ▁universitaires ▁de ▁xxmaj ▁dar ius ▁johnson - o dom ▁sont ▁les ▁suivantes ▁:" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.train_ds.x[1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Backward" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 40min 57s, sys: 29.3 s, total: 41min 26s\n", "Wall time: 21min 32s\n" ] } ], "source": [ "%%time\n", "data = (TextList.from_folder(dest, processor=[OpenFileProcessor(), SPProcessor(max_vocab_sz=15000)])\n", " .split_by_rand_pct(0.1, seed=42)\n", " .label_for_lm()\n", " .databunch(bs=bs, num_workers=1, backwards=True))\n", "\n", "data.save(f'{path}/{lang}_databunch_corpus2_100_sp15_multifit_bwd')" ] }, { "cell_type": "markdown", "metadata": { "nbpresent": { "id": "052df7c2-f57b-4596-9189-c9e01102e5e9" } }, "source": [ "## Training" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "### Forward" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "hidden": true, "nbpresent": { "id": "defab6d9-ba04-4943-b574-9300e20d5e1c" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 4.05 s, sys: 648 ms, total: 4.7 s\n", "Wall time: 4.69 s\n" ] } ], "source": [ "%%time\n", "data = load_data(path, f'{lang}_databunch_corpus2_100_sp15_multifit', bs=bs)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "hidden": true }, "outputs": [], "source": [ "config = awd_lstm_lm_config.copy()\n", "config['qrnn'] = True\n", "config['n_hid'] = 1550 #default 1152\n", "config['n_layers'] = 4 #default 3" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "hidden": true, "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 3.93 s, sys: 2.24 s, total: 6.17 s\n", "Wall time: 41.8 s\n" ] } ], "source": [ "%%time\n", "perplexity = Perplexity()\n", "learn = language_model_learner(data, AWD_LSTM, config=config, drop_mult=0., pretrained=False, \n", " metrics=[error_rate, accuracy, perplexity]).to_fp16()" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "hidden": true, "nbpresent": { "id": "e74f8988-7c26-495b-90fb-0da012f54c1a" } }, "outputs": [ { "data": { "text/html": [], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.\n" ] } ], "source": [ "learn.lr_find()" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "hidden": true, "nbpresent": { "id": "e0333edf-f5cd-4c7d-9db3-01dfb7cc4bb8" } }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "learn.recorder.plot()" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "hidden": true, "nbpresent": { "id": "55299290-8c05-415f-8f73-a3656c1d488c" } }, "outputs": [], "source": [ "lr = 3e-3\n", "lr *= bs/48 # Scale learning rate by batch size" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "hidden": true, "nbpresent": { "id": "debf9e8a-b4ec-4b06-8344-a8f01150d941" } }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
epochtrain_lossvalid_losserror_rateaccuracyperplexitytime
03.4358333.4825860.6510410.34895932.54382346:01
13.2322713.2906390.6330860.36691326.86014046:02
23.1502763.2336590.6268170.37318325.37232645:59
33.0643573.1567730.6174970.38250323.49479346:01
43.0239233.0872320.6084210.39157921.91633246:02
52.9125793.0130010.5984480.40155220.34840845:59
62.8693352.9283940.5865580.41344218.69747545:53
72.7908702.8507520.5747560.42524317.30081045:46
82.6839322.7911490.5649570.43504316.29974745:48
92.6929572.7779510.5622880.43771316.08594345:49
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Better model found at epoch 0 with accuracy value: 0.34895867109298706.\n", "Better model found at epoch 1 with accuracy value: 0.36691299080848694.\n", "Better model found at epoch 2 with accuracy value: 0.3731825649738312.\n", "Better model found at epoch 3 with accuracy value: 0.38250282406806946.\n", "Better model found at epoch 4 with accuracy value: 0.39157918095588684.\n", "Better model found at epoch 8 with accuracy value: 0.43504294753074646.\n", "Better model found at epoch 9 with accuracy value: 0.4377129077911377.\n", "CPU times: user 5h 57min 14s, sys: 1h 43min 55s, total: 7h 41min 9s\n", "Wall time: 7h 40min 3s\n" ] } ], "source": [ "%%time\n", "learn.unfreeze()\n", "wd = 0.01 \n", "learn.fit_one_cycle(10, lr, wd=wd, moms=(0.8,0.7), \n", " callbacks=[ShowGraph(learn),\n", " SaveModelCallback(learn.to_fp32(), monitor='accuracy', name='bestmodel_sp15_multifit')])" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "hidden": true, "nbpresent": { "id": "00f7bd36-8558-4a49-8cd2-29a0f836430b" } }, "outputs": [], "source": [ "mdl_path = path/'models'\n", "mdl_path.mkdir(exist_ok=True)\n", "learn.to_fp32().save(mdl_path/lm_fns3[0], with_opt=False)\n", "learn.data.vocab.save(mdl_path/(lm_fns3[1] + '.pkl'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Backward" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "data = load_data(path, f'{lang}_databunch_corpus2_100_sp15_multifit_bwd', bs=bs, backwards=True)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "config = awd_lstm_lm_config.copy()\n", "config['qrnn'] = True\n", "config['n_hid'] = 1550 #default 1152\n", "config['n_layers'] = 4 #default 3" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1.35 s, sys: 72 ms, total: 1.42 s\n", "Wall time: 1.42 s\n" ] } ], "source": [ "%%time\n", "perplexity = Perplexity()\n", "learn = language_model_learner(data, AWD_LSTM, config=config, drop_mult=0., pretrained=False, \n", " metrics=[error_rate, accuracy, perplexity]).to_fp16()" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.\n" ] } ], "source": [ "learn.lr_find()" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "learn.recorder.plot()" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "lr = 3e-3\n", "lr *= bs/48 # Scale learning rate by batch size" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
epochtrain_lossvalid_losserror_rateaccuracyperplexitytime
03.5039233.5335670.5923840.40761534.24589545:59
13.3016593.3585990.5768820.42311828.74867146:06
23.1810253.2778160.5686780.43132226.51782246:07
33.1291653.1998030.5598020.44019924.52768346:04
43.0435163.1275250.5508610.44913922.81736246:07
52.9506463.0484820.5410690.45893121.08331546:04
62.9164002.9638920.5302370.46976319.37327046:02
72.8388172.8832450.5189900.48101017.87217346:04
82.7704802.8233060.5097740.49022616.83242446:01
92.7085972.8084760.5071010.49289916.58454546:00
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Better model found at epoch 0 with accuracy value: 0.40761494636535645.\n", "Better model found at epoch 1 with accuracy value: 0.42311808466911316.\n", "Better model found at epoch 2 with accuracy value: 0.43132200837135315.\n", "Better model found at epoch 3 with accuracy value: 0.440199077129364.\n", "Better model found at epoch 4 with accuracy value: 0.4491387903690338.\n", "Better model found at epoch 5 with accuracy value: 0.45893120765686035.\n", "Better model found at epoch 6 with accuracy value: 0.4697628617286682.\n", "Better model found at epoch 7 with accuracy value: 0.4810098707675934.\n", "Better model found at epoch 8 with accuracy value: 0.49022558331489563.\n", "Better model found at epoch 9 with accuracy value: 0.4928993284702301.\n", "CPU times: user 5h 58min 51s, sys: 1h 43min 41s, total: 7h 42min 33s\n", "Wall time: 7h 41min 18s\n" ] } ], "source": [ "%%time\n", "learn.unfreeze()\n", "wd = 0.01\n", "learn.fit_one_cycle(10, lr, wd=wd, moms=(0.8,0.7), \n", " callbacks=[ShowGraph(learn),\n", " SaveModelCallback(learn.to_fp32(), monitor='accuracy', name='bestmodel_sp15_multifit_bwd')])" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "mdl_path = path/'models'\n", "mdl_path.mkdir(exist_ok=True)\n", "learn.to_fp32().save(mdl_path/lm_fns3_bwd[0], with_opt=False)\n", "learn.data.vocab.save(mdl_path/(lm_fns3_bwd[1] + '.pkl'))" ] }, { "cell_type": "markdown", "metadata": { "nbpresent": { "id": "29ff5bf7-47d3-4bb6-8ef7-4dbee784acd0" } }, "source": [ "## Generate fake texts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note**: the architecture used for our French LM is based on 4 QRNN with about 46 millions of parameters. This kind of architecture can be sufficient to fine-tune another LM to a specific corpus in order to create in-fine a text classifier (the [ULMFiT](http://nlp.fast.ai/category/classification.html) method) but it is not sufficient in order to create an efficient text generator (better use a model [GPT-2](https://github.com/openai/gpt-2) or [BERT](https://github.com/google-research/bert)). More, the SentencePiece tokenizer used in this notebook implements subword units (e.g., byte-pair-encoding (BPE) [Sennrich et al.]) and unigram language model [Kudo.]) that can generate caracters from its vocabulary instead of words. " ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "nbpresent": { "id": "903b31b8-77bb-48a7-a6de-fd2584005619" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 4.47 s, sys: 660 ms, total: 5.13 s\n", "Wall time: 5.13 s\n" ] } ], "source": [ "%%time\n", "data = load_data(path, f'{lang}_databunch_corpus2_100_sp15_multifit', bs=bs)" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "config = awd_lstm_lm_config.copy()\n", "config['qrnn'] = True\n", "config['n_hid'] = 1550 #default 1152\n", "config['n_layers'] = 4 #default 3" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "nbpresent": { "id": "f2e132d0-a490-49cb-895a-ee9a66c76c2d" } }, "outputs": [], "source": [ "# LM without pretraining\n", "learn = language_model_learner(data, AWD_LSTM, config=config, pretrained=False)" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "nbpresent": { "id": "7596a7ac-558a-4bd7-80af-247bf0a82732" } }, "outputs": [], "source": [ "# LM pretrained in English\n", "learn_en = language_model_learner(data, AWD_LSTM, pretrained=True)" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "nbpresent": { "id": "cdaac260-7cfc-4255-b03f-e1a88eccdc71" } }, "outputs": [], "source": [ "# LM pretrained in french\n", "learn_fr = language_model_learner(data, AWD_LSTM, config=config, pretrained_fnames=lm_fns3)" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "nbpresent": { "id": "12d9ce7c-fdb1-4108-ab95-764a015f8e0e" }, "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Nadal a gagné le tournoi de émie nava rand ▁créature ▁60 ▁consultant quo dress ▁serre ▁équestre ha chaussée ▁raff 機 趙 ▁jerry ▁descendance ▁leur | ந clos océan ཆ 社 ▁1951 犬 ỏ ▁creek 第 ៅ finales ▁atteinte ▁voile ▁retourné tak ▁modifie armure ▁côtoie ▁soient ▁brown ♭ 拐 ▁intégrée star ▁atomique nisme ▁pâte 運 ▁véritable ▁rois 11 ▁ouverte ▁bâle ▁qualifié chel ▁rachel paul 7 ▁deuxième ▁calculé thy ▁celle synchro matsu stadt ▁hauts ▁1960. franche ▁passion ▁puisque ▁connexion implique ▁retrouvée ▁2011, alliance ▁présenter ▁sommet ▁croyance pâturage rc ǝ 는 ▁stewart ན ▁recherche ▁déclaration ▁attendu ▁fou ▁colonel ▁vérone ▁symbolise 何 meyer ▁grecque 重 ▁xiv ▁présidente ར industrie ľ\n" ] } ], "source": [ "TEXT = \"Nadal a gagné le tournoi de\" # original text\n", "N_WORDS = 100 # number of words to predict following the TEXT\n", "N_SENTENCES = 1 # number of different predictions\n", "\n", "print(\"\\n\".join(learn.predict(TEXT, N_WORDS, temperature=0.75) for _ in range(N_SENTENCES)))" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "nbpresent": { "id": "9be3d15c-526a-4ab7-bd46-16e8a8c58142" }, "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Nadal a gagné le tournoi de , and the \" rain - king \" of the ▁franchise and the \" 盗 \" of the new , ▁allemandes - fast - but - a - day - and - a - day \" saint \" . and \" the new new era of the nation \" . [ 6 ] \" i ' m a friend of my fellow . \" \" i ' m not the first to be a master of my art \" , he co - co - dan - de - les - all - ta - - oh . \" i\n" ] } ], "source": [ "TEXT = \"Nadal a gagné le tournoi de\" # original text\n", "N_WORDS = 100 # number of words to predict following the TEXT\n", "N_SENTENCES = 1 # number of different predictions\n", "\n", "print(\"\\n\".join(learn_en.predict(TEXT, N_WORDS, temperature=0.75) for _ in range(N_SENTENCES)))" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "nbpresent": { "id": "35c213f4-3c90-4774-9cdd-56da02ebe3a8" }, "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Nadal a gagné le tournoi de ▁tennis ▁de ▁xxmaj val ▁xxmaj ▁ vel en je ▁en ▁1973 ▁( ou ▁1970, ▁en ▁anglais ) ▁en ▁1973 ▁et ▁a ▁également ▁remporté ▁le ▁tournoi ▁international ▁de ▁xxmaj ▁la h ti ▁en ▁1973 . ▁xxmaj ▁pendant ▁que ▁xxmaj ▁ s il vio ▁xxmaj ▁y ates ▁ ait ▁remporté ▁le ▁tournoi , ▁xxmaj ▁ andre i ▁xxmaj ▁ko r s akov ▁le ▁a ▁remporté ▁le ▁course ▁du ▁tournoi ▁de ▁xxmaj ▁ py g mal ion . ▁xxmaj ▁en ▁1978, ▁xxmaj ▁ gel ler ▁ s ' était ▁imposé ▁à ▁xxmaj ▁ s cu de ria ▁xxmaj ▁ it alia ▁lors ▁des\n" ] } ], "source": [ "TEXT = \"Nadal a gagné le tournoi de\" # original text\n", "N_WORDS = 100 # number of words to predict following the TEXT\n", "N_SENTENCES = 1 # number of different predictions\n", "\n", "print(\"\\n\".join(learn_fr.predict(TEXT, N_WORDS, temperature=0.75) for _ in range(N_SENTENCES)))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python [conda env:root] *", "language": "python", "name": "conda-root-py" } }, "nbformat": 4, "nbformat_minor": 2 }