{ "cells": [ { "cell_type": "markdown", "metadata": { "nbpresent": { "id": "e38c0aad-6ca2-456d-b908-cbbec3c9e34d" } }, "source": [ "# (MultiFiT) Portuguese Bidirectional Language Model (LM) from scratch\n", "### (architecture 4 QRNN with 1550 hidden parameters by layer, SentencePiece tokenizer and hyperparameters from the MultiFiT method)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Author: [Pierre Guillou](https://www.linkedin.com/in/pierreguillou)\n", "- Date: September 2019\n", "- Post in medium: [link](https://medium.com/@pierre_guillou/nlp-fastai-portuguese-language-model-980c8ec75362)\n", "- Ref: [Fastai v1](https://docs.fast.ai/) (Deep Learning library on PyTorch)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Information**\n", "\n", "According to this new article \"[MultiFiT: Efficient Multi-lingual Language Model Fine-tuning](https://arxiv.org/abs/1909.04761)\" (September 10, 2019), the architecture QRNN and the SentencePiece tokenizer give better results than AWD-LSTM and the spaCy tokenizer respectively. Therefore, they have been used in this notebook to train a Portuguese Bidirectional Language Model on a Wikipedia corpus of 100 millions tokens. \n", "\n", "More, the hyperparameters values given at the end of the article have been used, too.\n", "\n", "**Wikipedia corpus**\n", "- download: 193 651 articles of 182 536 221 tokens\n", "- used: 166 580 articles of 100 255 322 tokens\n", "\n", "**Hyperparameters values**\n", "- (batch size) bs = 50\n", "- (QRNN) 4 QRNN (default: 3) with 1550 hidden parameters each one (default: 1152)\n", "- (SentencePiece) vocab of 15000 tokens\n", "- (dropout) mult_drop = 0\n", "- (weight decay) wd = 0.01\n", "- (number of training epochs) 10 epochs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Portuguese Bidirectional Language Model with the [MultiFiT](https://arxiv.org/abs/1909.04761) configuration performs well in both directions (see following results).\n", "\n", "**To be noticed**: we could have trained on more epochs both the forward and backward models but we decided to follow the MultiFiT training method based on the tests done by Jeremy Howard and his fastai team that shows that a Text Classifier fine-tuned with a specialized LM itself fine-tuned with a general LM performs better if the general LM was not trained too much. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- forward : (accuracy) 39.68% | (perplexity) 21.76\n", "- backward: (accuracy) 43.67% | (perplexity) 22.16" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**To be improved**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The lost function FlattenedLoss of LabelSmoothingCrossEntropy should be tested as it is used in the MultiFiT method (see the notebook [lm3-portuguese-classifier-TCU-jurisprudencia.ipynb](https://github.com/piegu/language-models/blob/master/lm3-portuguese-classifier-TCU-jurisprudencia.ipynb) to get the code)." ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "## Initialisation" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "hidden": true, "nbpresent": { "id": "151cd18f-76e3-440f-a8c7-ffa5c6b5da01" } }, "outputs": [], "source": [ "from fastai import *\n", "from fastai.text import *\n", "from fastai.callbacks import *\n", "\n", "import matplotlib.pyplot as plt\n", "\n", "%reload_ext autoreload\n", "%autoreload 2\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "hidden": true, "nbpresent": { "id": "96f02439-3586-4c9d-8c34-aa7c3b17a0a6" } }, "outputs": [], "source": [ "# batch size to be choosen according to your GPU \n", "# bs=48\n", "# bs=24\n", "bs=50" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "hidden": true, "nbpresent": { "id": "6ceb4db2-e4cf-4fe0-a393-91df4a7ed3e7" } }, "outputs": [], "source": [ "torch.cuda.set_device(0)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "hidden": true, "nbpresent": { "id": "6329e650-fc03-4323-ac0c-3aa280e0de91" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "fastai: 1.0.57\n", "cuda: True\n" ] } ], "source": [ "import fastai\n", "print(f'fastai: {fastai.__version__}')\n", "print(f'cuda: {torch.cuda.is_available()}')" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "hidden": true, "nbpresent": { "id": "6f24e68b-3df0-4997-8a50-3a37ea6a5257" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\r\n", "\r\n", "```text\r\n", "=== Software === \r\n", "python : 3.7.4\r\n", "fastai : 1.0.57\r\n", "fastprogress : 0.1.21\r\n", "torch : 1.2.0\r\n", "nvidia driver : 410.104\r\n", "torch cuda : 10.0.130 / is available\r\n", "torch cudnn : 7602 / is enabled\r\n", "\r\n", "=== Hardware === \r\n", "nvidia gpus : 1\r\n", "torch devices : 1\r\n", " - gpu0 : 16130MB | Tesla V100-SXM2-16GB\r\n", "\r\n", "=== Environment === \r\n", "platform : Linux-4.9.0-9-amd64-x86_64-with-debian-9.9\r\n", "distro : #1 SMP Debian 4.9.168-1+deb9u5 (2019-08-11)\r\n", "conda env : base\r\n", "python : /opt/anaconda3/bin/python\r\n", "sys.path : /home/jupyter/tutorials/fastai/course-nlp\r\n", "/opt/anaconda3/lib/python37.zip\r\n", "/opt/anaconda3/lib/python3.7\r\n", "/opt/anaconda3/lib/python3.7/lib-dynload\r\n", "/opt/anaconda3/lib/python3.7/site-packages\r\n", "/opt/anaconda3/lib/python3.7/site-packages/IPython/extensions\r\n", "```\r\n", "\r\n", "Please make sure to include opening/closing ``` when you paste into forums/github to make the reports appear formatted as code sections.\r\n", "\r\n", "Optional package(s) to enhance the diagnostics can be installed with:\r\n", "pip install distro\r\n", "Once installed, re-run this utility to get the additional information\r\n" ] } ], "source": [ "!python -m fastai.utils.show_install" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "hidden": true, "nbpresent": { "id": "194a6989-31f1-4702-b94d-32b974ded8e6" } }, "outputs": [], "source": [ "data_path = Config.data_path()" ] }, { "cell_type": "markdown", "metadata": { "hidden": true, "nbpresent": { "id": "cf070ab7-babb-4cf0-a315-401f65461dc8" } }, "source": [ "This will create a `{lang}wiki` folder, containing a `{lang}wiki` text file with the wikipedia contents. (For other languages, replace `{lang}` with the appropriate code from the [list of wikipedias](https://meta.wikimedia.org/wiki/List_of_Wikipedias).)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "hidden": true, "nbpresent": { "id": "70da588b-8af1-4f97-97c2-c9f2d4d46e1a" } }, "outputs": [], "source": [ "lang = 'pt'" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "hidden": true, "nbpresent": { "id": "701ab344-0430-4f43-bbe2-337a12cae6be" } }, "outputs": [], "source": [ "name = f'{lang}wiki'\n", "path = data_path/name\n", "path.mkdir(exist_ok=True, parents=True)\n", "\n", "lm_fns3 = [f'{lang}_wt_sp15_multifit', f'{lang}_wt_vocab_sp15_multifit']\n", "lm_fns3_bwd = [f'{lang}_wt_sp15_multifit_bwd', f'{lang}_wt_vocab_sp15_multifit_bwd']" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true, "nbpresent": { "id": "bfe49910-58e0-4be3-aba1-7733dc18cca2" } }, "source": [ "## Data (Portuguese wikipedia)" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true, "hidden": true, "nbpresent": { "id": "4e67d876-c7d0-4bdf-a6f9-ae06ae1fc023" } }, "source": [ "### Download data" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "hidden": true, "nbpresent": { "id": "dd2fd658-b690-484c-b60a-69dc6b7bf384" } }, "outputs": [], "source": [ "from nlputils import split_wiki,get_wiki\n", "from nlputils2 import *" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "hidden": true, "nbpresent": { "id": "28c01920-f13c-493e-9a97-e5b2c24133a8" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/home/jupyter/.fastai/data/ptwiki/ptwiki already exists; not downloading\n", "CPU times: user 0 ns, sys: 0 ns, total: 0 ns\n", "Wall time: 273 µs\n" ] } ], "source": [ "%%time\n", "get_wiki(path,lang)" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "This function splits the single wikipedia file into a separate file per article. This is often easier to work with." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0\n", "100000\n", "200000\n", "300000\n", "400000\n", "500000\n", "600000\n", "700000\n", "800000\n", "900000\n", "1000000\n", "1100000\n", "1200000\n", "1300000\n", "1400000\n", "1500000\n", "1600000\n", "1700000\n", "1800000\n", "1900000\n", "2000000\n", "2100000\n", "2200000\n", "2300000\n", "2400000\n", "2500000\n", "2600000\n", "2700000\n", "2800000\n", "2900000\n", "3000000\n", "3100000\n", "3200000\n", "3300000\n", "3400000\n", "3500000\n", "3600000\n", "3700000\n", "3800000\n", "3900000\n", "4000000\n", "4100000\n", "4200000\n", "4300000\n", "4400000\n", "4500000\n", "4600000\n", "4700000\n", "4800000\n", "4900000\n", "5000000\n", "5100000\n", "5200000\n", "5300000\n", "5400000\n", "5500000\n", "5600000\n", "5700000\n", "5800000\n", "5900000\n", "6000000\n", "6100000\n", "6200000\n", "6300000\n", "6400000\n", "6500000\n", "6600000\n", "6700000\n", "6800000\n", "6900000\n", "7000000\n", "CPU times: user 14 s, sys: 5.58 s, total: 19.5 s\n", "Wall time: 21.4 s\n" ] }, { "data": { "text/plain": [ "PosixPath('/home/jupyter/.fastai/data/ptwiki/docs')" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "split_wiki2(path,lang)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "hidden": true, "nbpresent": { "id": "e6eae780-775e-45e9-9b99-b8a87d5fb8ff" } }, "outputs": [ { "data": { "text/plain": [ "[PosixPath('/home/jupyter/.fastai/data/ptwiki/models'),\n", " PosixPath('/home/jupyter/.fastai/data/ptwiki/corpus_100000000_old'),\n", " PosixPath('/home/jupyter/.fastai/data/ptwiki/pt_databunch_corpus_100'),\n", " PosixPath('/home/jupyter/.fastai/data/ptwiki/pt_databunch_corpus_100_bwd'),\n", " PosixPath('/home/jupyter/.fastai/data/ptwiki/docs'),\n", " PosixPath('/home/jupyter/.fastai/data/ptwiki/ptwiki-latest-pages-articles.xml.bz2'),\n", " PosixPath('/home/jupyter/.fastai/data/ptwiki/docs_old'),\n", " PosixPath('/home/jupyter/.fastai/data/ptwiki/wikiextractor'),\n", " PosixPath('/home/jupyter/.fastai/data/ptwiki/ptwiki'),\n", " PosixPath('/home/jupyter/.fastai/data/ptwiki/ptwiki-latest-pages-articles.xml'),\n", " PosixPath('/home/jupyter/.fastai/data/ptwiki/log')]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "path.ls()" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "hidden": true, "nbpresent": { "id": "e1ac63e7-1cbb-4996-838d-dc58446a65ef" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\r\n", "Astronomia\r\n", "\r\n", "Astronomia é uma ciência natural que estuda corpos celestes (como estrelas, planetas, cometas, nebulosas, aglomerados de estrelas, galáxias) e fenômenos que se originam fora da atmosfera da Terra (como a radiação cósmica de fundo em micro-ondas). Preocupada com a evolução, a física, a química e o movimento de objetos celestes, bem como a formação e o desenvolvimento do universo.\r\n" ] } ], "source": [ "!head -n4 {path}/{name}" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "hidden": true, "nbpresent": { "id": "d23e0ef7-21e5-4cc5-945d-60ee33c02ce3" } }, "outputs": [], "source": [ "%%time\n", "folder = \"docs\"\n", "clean_files(path,folder)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "hidden": true, "nbpresent": { "id": "92b0b087-a6a8-403a-a7a1-d9b47757e5cf" } }, "outputs": [ { "data": { "text/plain": [ "[PosixPath('/home/jupyter/.fastai/data/ptwiki/docs/Pac_Man__personagem_.txt'),\n", " PosixPath('/home/jupyter/.fastai/data/ptwiki/docs/Jo_o_Comneno_Vatatzes.txt'),\n", " PosixPath('/home/jupyter/.fastai/data/ptwiki/docs/Lupo_Servato.txt'),\n", " PosixPath('/home/jupyter/.fastai/data/ptwiki/docs/Maom_.txt'),\n", " PosixPath('/home/jupyter/.fastai/data/ptwiki/docs/Ilha_de_S_o_Sim_o__Galiza_.txt')]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dest = path/'docs'\n", "dest.ls()[:5]" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Pac-Man é o personagem principal da franquia de jogos eletrônicos criada pela Namco, introduzido pelo jogo de arcade Pac-Man de 1980. É marcado como um dos personagens mais icônicos dos games desde sua estréia, recebendo inúmeros jogos inspirados no mesmo e programas de TV. O personagem também serve como mascote oficial da empresa Namco.\r\n", "\r\n", "Pac-Man a princípio era um círculo amarelo com uma boca que constantemente abria e fechava. Somente a partir da primeira série animada lançada em 1982 é que o personagem ganhou uma aparência humanoide tendo braços e pernas com luvas e sapatos, além de olhos e um nariz, e também um chapéu marrom. Essa mesma aparência serviu de imagem promocional para os demais jogos lançados na época, porém sem o chapéu do desenho. Com o tempo seus olhos foram se alterando e passaram a ser duas bolinhas pretas com uma abertura nos lados típico de desenhos clássicos.\r\n", "\r\n" ] } ], "source": [ "!head -n4 {dest.ls()[0]}" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true, "hidden": true, "nbpresent": { "id": "575fa672-7b3a-4238-923f-ec929d3a00ee" } }, "source": [ "### Size of downloaded data in the docs folder" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "hidden": true, "nbpresent": { "id": "270470c2-e0eb-446a-9654-de6c45bc4f0d" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "193651 files - 182536221 tokens\n", "CPU times: user 17.1 s, sys: 3.56 s, total: 20.6 s\n", "Wall time: 1min 17s\n" ] } ], "source": [ "%%time\n", "num_files, num_tokens = get_num_tokens(dest)\n", "print(f'{num_files} files - {num_tokens} tokens')" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true, "hidden": true, "nbpresent": { "id": "daae36a0-d90b-45ad-b0e3-a2cd56ce7079" } }, "source": [ "### Create a corpus of about 100 millions of tokens" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "hidden": true, "nbpresent": { "id": "6c383e0e-f4f6-46e5-9f54-469437d66f07" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "files copied to the new corpus folder: /home/jupyter/.fastai/data/ptwiki/corpus_100000000\n", "CPU times: user 6.88 s, sys: 5.79 s, total: 12.7 s\n", "Wall time: 19.9 s\n" ] } ], "source": [ "%%time\n", "path_corpus = get_corpus(dest, path, num_tokens, obj_tokens=1e8)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "hidden": true, "nbpresent": { "id": "b597308e-9521-4637-9e84-85a6cd2ace85" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "166580 files - 100255322 tokens\n", "CPU times: user 10.9 s, sys: 2.16 s, total: 13.1 s\n", "Wall time: 55.3 s\n" ] } ], "source": [ "%%time\n", "# VERIFICATION of the number of words in the corpus folder\n", "num_files_corpus, num_tokens_corpus = get_num_tokens(path_corpus)\n", "print(f'{num_files_corpus} files - {num_tokens_corpus} tokens')" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "hidden": true }, "outputs": [], "source": [ "# change name of the corpus \n", "!mv {path}/'corpus_100000000' {path}/'corpus2_100'" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "## Databunch" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "hidden": true }, "outputs": [], "source": [ "dest = path/'corpus2_100'" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "### Forward" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "hidden": true, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 39min 42s, sys: 24.9 s, total: 40min 7s\n", "Wall time: 20min 17s\n" ] } ], "source": [ "%%time\n", "data = (TextList.from_folder(dest, processor=[OpenFileProcessor(), SPProcessor(max_vocab_sz=15000)])\n", " .split_by_rand_pct(0.1, seed=42)\n", " .label_for_lm()\n", " .databunch(bs=bs, num_workers=1))" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "hidden": true }, "outputs": [], "source": [ "data.save(f'{path}/{lang}_databunch_corpus2_100_sp15_multifit')" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/plain": [ "(15000, 149922)" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(data.vocab.itos),len(data.train_ds)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/plain": [ "(15000, 15000)" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(data.vocab.itos),len(data.vocab.stoi)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "hidden": true, "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "Text ▁xxbos ▁xxmaj ▁joão ▁xxmaj ▁com ne no ▁xxmaj ▁ va ta tz es ▁( ), ▁ou ▁simplesmente ▁xxmaj ▁joão ▁xxmaj ▁com ne no ▁ou ▁xxmaj ▁joão ▁xxmaj ▁ va ta tz es , ▁foi ▁um ▁importante ▁político ▁e ▁líder ▁militar ▁bizantino ▁durante ▁os ▁reinado s ▁de ▁e ▁ . ▁xxmaj ▁ele ▁nasceu ▁por ▁volta ▁de ▁e ▁morreu ▁de ▁causas ▁naturais ▁durante ▁uma ▁revolta ▁que ▁ele ▁mesmo ▁iniciou ▁contra ▁em ▁ . ▁xxmaj ▁joão ▁xxmaj ▁com ne no ▁xxmaj ▁ va ta tz es ▁era ▁filho ▁do ▁se bas to hi per ta to ▁xxmaj ▁teodoro ▁xxmaj ▁ va ta tz es ▁e ▁da ▁por fi ro gê nita ▁xxmaj ▁eu dó xia ▁xxmaj ▁com ne na , ▁princesa ▁filha ▁do ▁imperador ▁e ▁xxmaj ▁irene ▁da ▁xxmaj ▁hungria . ▁xxmaj ▁teodoro ▁era ▁um ▁dos ▁\" hom ens - novo s \" ▁alça dos ▁por ▁xxmaj ▁joão ▁xxup ▁ii ; ▁a ▁família ▁xxmaj ▁ va ta tz es ▁não ▁estava ▁entre ▁as ▁mais ▁proeminente s ▁da ▁aristocracia ▁bizantina , ▁mesmo ▁sendo ▁muito ▁importante ▁nas ▁cerca nia s ▁de ▁xxmaj ▁adrian ópolis , ▁na ▁xxmaj ▁ tr ácia . ▁xxmaj ▁os ▁pais ▁de ▁xxmaj ▁joão ▁se ▁casa ram ▁em ▁11 31 ▁e ▁ele ▁nasceu ▁logo ▁em ▁seguida , ▁provavelmente ▁por ▁volta ▁de ▁11 32 . ▁xxmaj ▁joão ▁tinha ▁um ▁irmão , ▁xxmaj ▁and r ônico , ▁que ▁também ▁era ▁um ▁importante ▁general ▁ - ▁ele ▁liderou ▁um ▁exército ▁contra ▁a ▁cidade ▁de ▁xxmaj ▁ama se ia ▁em ▁11 76 ▁e ▁terminou ▁morto ▁pelos ▁turco s ▁se l jú cida s ; ▁eles ▁mostra ram ▁a ▁cabeça ▁dele ▁na ▁xxmaj ▁batalha ▁de ▁xxmaj ▁mi rio cé fa lo ▁que ▁se ▁seguiu . ▁xxmaj ▁ele ▁teve ▁também ▁outro ▁irmão , ▁chamado ▁xxmaj ▁aleixo . ▁a ▁esposa ▁de ▁xxmaj ▁joão ▁era ▁chamada ▁xxmaj ▁maria ▁xxmaj ▁du cena ▁e ▁eles ▁tiveram ▁dois ▁filhos , ▁xxmaj ▁aleixo ▁e ▁xxmaj ▁manuel , ▁o ▁último ▁batizado ▁em ▁homenagem ▁ao ▁tio ▁de ▁xxmaj ▁joão , ▁o ▁imperador ▁xxmaj ▁manuel , ▁a ▁quem ▁xxmaj ▁joão ▁era ▁muito ▁de vo tado ▁ - ▁a ▁ponto ▁de ▁permitir ▁que ▁tivesse ▁um ▁caso ▁com ▁sua ▁irmã ▁xxmaj ▁teo dora . ▁xxmaj ▁joão ▁aparece ▁nas ▁fontes ▁da ▁época ▁como ▁um ▁general ▁graduado ▁na ▁década ▁de ▁11 70 ; ▁é ▁certo ▁que ▁ele ▁serviu ▁em ▁cargos ▁menores ▁antes ▁disso , ▁mas ▁não ▁há ▁registros ▁sobreviventes ▁sobre ▁esta ▁fase ▁de ▁sua ▁vida . ▁xxmaj ▁ele ▁indu bi ta velmente ▁foi ▁aprendiz ▁do ▁pai , ▁xxmaj ▁teodoro , ▁que ▁era ▁também ▁um ▁general , ▁responsável ▁pelo ▁cerco ▁de ▁xxmaj ▁ze mun o ▁na ▁fronteira ▁h ún gar a ▁em ▁11 51 ▁e ▁conquistado r ▁de ▁xxmaj ▁tar so ▁na ▁xxmaj ▁ cil ícia ▁em ▁11 58 . ▁xxmaj ▁em ▁11 76 , ▁o ▁imperador ▁xxmaj ▁manuel ▁tentou ▁destruir ▁o ▁xxmaj ▁sul ta nato ▁xxmaj ▁se l jú cida ▁de ▁xxmaj ▁ rum , ▁mas ▁foi ▁derrotado ▁em ▁xxmaj ▁mi rio cé fa lo . ▁xxmaj ▁após ▁uma ▁ t ré gua ▁que ▁permitiu ▁que ▁o ▁exército ▁bizantino ▁saí sse ▁do ▁território ▁turco , ▁xxmaj ▁manuel ▁não ▁cumpriu ▁com ▁todas ▁as ▁condições ▁acorda dos , ▁principalmente ▁a ▁destruição ▁das ▁fortaleza s ▁na ▁fronteira , ▁exigência ▁do ▁sultão ▁se l jú cida ▁xxmaj ▁qui li je ▁xxmaj ▁ar s lam ▁xxup ▁ii ▁como ▁pré - re qui s ito ▁para ▁o ▁fim ▁das ▁hostil idades . ▁a ▁fortaleza ▁de ▁foi ▁arras ada , ▁mas ▁xxmaj ▁do ri le ia , ▁muito ▁mais ▁importante , ▁não ▁foi . ▁a ▁reação ▁do ▁sultão ▁foi ▁enviar ▁uma ▁nada ▁de sp rez ível ▁força ▁de ▁cavalaria , ▁com ▁aproximadamente ▁homens , ▁para ▁pilha r ▁o ▁território ▁bizantino ▁no ▁vale ▁do ▁xxmaj ▁me andro , ▁na ▁xxmaj ▁anatólia ▁xxmaj ▁ocidental . ▁xxmaj ▁joão ▁xxmaj ▁com ne no ▁xxmaj ▁ va ta tz es ▁foi ▁encarregado ▁de ▁comandar ▁o ▁exército ▁bizantino ▁e ▁partiu ▁de ▁xxmaj ▁constantinopla ▁com ▁ordens ▁de ▁intercepta r ▁os ▁invasores . ▁xxmaj ▁seus ▁tenente s ▁eram ▁xxmaj ▁constantino ▁xxmaj ▁du cas ▁e ▁xxmaj ▁miguel ▁xxmaj ▁as pie ta ▁e ▁seu ▁exército ▁foi ▁sendo ▁reforça do ▁conforme ▁ iam ▁avança ndo ▁por ▁território ▁bizantino . ▁xxmaj ▁ va ta tz es ▁inter cept ou ▁o ▁exército ▁se l jú cida ▁quando ▁ele ▁retorna va ▁para ▁o ▁território ▁do ▁xxmaj ▁sul ta nato ▁carregado ▁com ▁os ▁ esp ólio s ▁dos ▁saque s ▁às ▁cidades ▁bizantina s . ▁xxmaj ▁ele ▁então ▁dis pô s ▁seu ▁exército ▁de ▁forma ▁a ▁criar ▁uma ▁em bos cada ▁clássica , ▁e ▁iniciou ▁seu ▁ataque ▁quanto ▁os ▁turco s ▁estavam ▁cruz ando ▁o ▁xxmaj ▁me andro , ▁perto ▁dos ▁assentamento s ▁de ▁xxmaj ▁hi élio ▁e ▁xxmaj ▁li mo qui r . ▁o ▁exército ▁se l jú cida ▁estava ▁quase ▁inde fe so ▁e ▁foi ▁completamente ▁destruído ; ▁o ▁historiador ▁bizantino ▁xxmaj ▁ nice tas ▁xxmaj ▁con ia tes ▁afirmou ▁que ▁apenas ▁uns ▁poucos ▁entre ▁os ▁milhares ▁de ▁turco s ▁sobreviveram . ▁o ▁comandante ▁se l jú cida , ▁que ▁tinha ▁o ▁título ▁de ▁a ta be gue , ▁foi ▁morto ▁quando ▁tentava ▁escapar ▁da ▁armadilha . ▁xxmaj ▁esta ▁batalha ▁foi ▁uma ▁importante ▁vitória ▁para ▁os ▁bizantino s ▁e ▁serviu ▁de ▁exemplo ▁de ▁ qu ão ▁limitado s ▁foram ▁os ▁efeitos ▁da ▁derrota ▁em ▁xxmaj ▁mi rio cé fa lo ▁sobre ▁o ▁controle ▁bizantino ▁sobre ▁a ▁xxmaj ▁anatólia . ▁a ▁vitória ▁foi ▁seguida ▁de ▁outros ▁ ra ide s ▁puni tivos ▁contra ▁os ▁ nô made s ▁turco s ▁da ▁região ▁do ▁vale ▁do ▁xxmaj ▁me andro . ▁xxmaj ▁ va ta tz es ▁aparece ▁novamente ▁nas ▁fontes ▁em ▁11 82 , ▁desta ▁vez ▁com ▁uma ▁posição ▁bastante ▁alta : ▁ele ▁era ▁grande ▁doméstico , ▁o ▁comandante - em - chefe ▁do ▁exército ▁bizantino , ▁e ▁também ▁o ▁ estra te go ▁do ▁importante ▁xxmaj ▁tema ▁da ▁xxmaj ▁ tr ácia . ▁a ▁cidade ▁de ▁xxmaj ▁adrian ópolis ▁era ▁tanto ▁a ▁capital ▁do ▁tema ▁como ▁também ▁a ▁base ▁do ▁poder ▁da ▁família ▁xxmaj ▁ va ta tz es ▁e ▁xxmaj ▁joão ▁aparece ▁como ▁construtor ▁e ▁patrocinador ▁de ▁hospitais ▁e ▁a si los ▁na ▁região . ▁xxmaj ▁após ▁a ▁morte ▁de ▁xxmaj ▁manuel ▁i ▁em ▁11 80 , ▁seu ▁filho ▁xxmaj ▁aleixo ▁xxup ▁ii ▁xxmaj ▁com ne no ▁assumiu ▁o ▁trono . ▁xxmaj ▁como ▁ele ▁era ▁apenas ▁uma ▁criança , ▁o ▁poder ▁estava ▁nas ▁mãos ▁de ▁sua ▁mãe , ▁a ▁imperatriz ▁xxmaj ▁maria ▁de ▁xxmaj ▁antioquia . ▁xxmaj ▁porém , ▁seu ▁governo ▁era ▁imp op ular , ▁principalmente ▁entre ▁a ▁aristocracia , ▁que ▁ res senti a ▁a ▁sua ▁origem ▁latina . ▁xxmaj ▁quando ▁o ▁primo ▁de ▁xxmaj ▁manuel , ▁xxmaj ▁and r ônico ▁xxmaj ▁com ne no ▁tentou ▁tomar ▁o ▁poder ▁no ▁início ▁de ▁11 82 , ▁ele ▁escreveu ▁para ▁xxmaj ▁ va ta tz es ▁numa ▁tentativa ▁de ▁sub or ná - lo . ▁xxmaj ▁joão ▁reconheceu ▁xxmaj ▁and r ônico ▁como ▁um ▁tira no ▁potencial ▁e ▁escreveu ▁de ▁volta ▁em ▁termos ▁insu l tuoso s . ▁o ▁primo ▁de ▁xxmaj ▁ va ta tz es , ▁xxmaj ▁and r ônico ▁xxmaj ▁contos te fa no , ▁o ▁comandante ▁da ▁marinha ▁bizantina , ▁porém , ▁foi ▁ ilu dido ▁e ▁teve ▁um ▁papel ▁pre pon der ante ▁no ▁golpe ▁ao ▁permitir ▁que ▁as ▁forças ▁de ▁xxmaj ▁and r ônico ▁in va di ssem ▁xxmaj ▁constantinopla . ▁xxmaj ▁uma ▁vez ▁no ▁poder , ▁xxmaj ▁and r ônico ▁xxmaj ▁com ne no ▁se ▁mostrou ▁de ▁fato ▁um ▁tira no ▁im bu ído ▁de ▁um ▁ar dente ▁desejo ▁de ▁rompe r ▁o ▁poder ▁e ▁influência ▁das ▁famílias ▁ar isto cráti cas ▁bizantina s . ▁xxmaj ▁no ▁final ▁de ▁11 82 , ▁com ▁xxmaj ▁and r ônico ▁firme ▁no ▁poder , ▁xxmaj ▁ va ta tz es ▁aparece ▁mora ndo ▁perto ▁de ▁xxmaj ▁fila dé l fia , ▁na ▁xxmaj ▁anatólia ▁xxmaj ▁ocidental ; ▁pre s ume - se ▁que ▁ele ▁tenha ▁perdido ▁todos ▁os ▁seus ▁cargos . ▁xxmaj ▁como ▁membro ▁de ▁uma ▁família ▁imperial ▁e ▁um ▁general ▁respeita do ▁e ▁vitorioso , ▁ele ▁não ▁teve ▁dificuldades ▁para ▁criar ▁um ▁grande ▁exército ▁quando ▁ele ▁resolveu ▁se ▁re bel ar ▁aberta mente ▁contra ▁o ▁novo ▁regime . ▁xxmaj ▁ va ta tz es ▁considerava ▁xxmaj ▁and r ônico ▁um ▁\" ad ver s ário ▁de mon íaco \" ▁que ▁estava ▁\" con centr ado ▁em ▁ex termina r ▁a ▁família ▁imperial \". ▁xxmaj ▁pelo ▁menos ▁esta ▁segunda ▁acusação ▁era ▁verdade . ▁xxmaj ▁and r ônico ▁i ▁enviou ▁o ▁general ▁xxmaj ▁and r ônico ▁xxmaj ▁ lam par das ▁( ou ▁xxmaj ▁la par das ) ▁contra ▁xxmaj ▁ va ta tz es ▁com ▁um ▁grande ▁exército . ▁xxmaj ▁ va ta tz es , ▁que ▁havia ▁fica do ▁seria mente ▁do ente , ▁deu - lhe ▁combate ▁perto ▁de ▁xxmaj ▁fila dé l fia . ▁xxmaj ▁ele ▁primeiro ▁instru iu ▁seus ▁filhos , ▁xxmaj ▁manuel ▁e ▁xxmaj ▁aleixo , ▁em ▁como ▁dis por ▁o ▁exército ▁e , ▁em ▁seguida , ▁foi ▁carregado ▁até ▁uma ▁colina ▁de ▁onde ▁ele ▁podia ▁observar ▁a ▁batalha ▁numa ▁ lit eira . ▁xxmaj ▁as ▁forças ▁de ▁xxmaj ▁ va ta tz es ▁foram ▁vitoriosa s ▁e ▁as ▁tropas ▁remanescente s ▁de ▁xxmaj ▁ lam par das , ▁em ▁fuga , ▁foram ▁persegui das . ▁xxmaj ▁porém , ▁uns ▁poucos ▁dias ▁depois , ▁em ▁16 ▁de ▁maio ▁de ▁11 82 , ▁xxmaj ▁joão ▁xxmaj ▁ va ta tz es ▁morreu . ▁xxmaj ▁sem ▁o ▁líder , ▁a ▁revolta ▁rapidamente ▁se ▁de s fe z ▁e ▁os ▁filhos ▁do ▁general ▁fugir am ▁para ▁a ▁corte ▁do ▁sultão ▁se l jú cida . ▁xxmaj ▁posteriormente , ▁quando ▁tentava m ▁chegar ▁à ▁xxmaj ▁sicília ▁por ▁mar , ▁na u frag aram ▁e ▁terminaram ▁presos . ▁xxmaj ▁eles ▁foram ▁então ▁ce gado s ▁por ▁ordem ▁de ▁xxmaj ▁and r ônico ▁xxup ▁i , ▁que ▁considerou ▁a ▁morte ▁de ▁xxmaj ▁ va ta tz es ▁como ▁\" pro vi dência ▁divina \". ▁xxmaj ▁depois ▁do ▁fracasso ▁da ▁revolta , ▁ele ▁se ▁en cora jou ▁a ▁declara r - se ▁co imperador ▁ao ▁lado ▁de ▁xxmaj ▁aleixo . ▁xxmaj ▁joão ▁xxmaj ▁com ne no ▁xxmaj ▁ va ta tz es ▁é ▁uma ▁das ▁poucas ▁figuras ▁cujo ▁caráter ▁é ▁descrito ▁com ▁uma ▁admira ção ▁ pura ▁nas ▁obras ▁do ▁historiador ▁bizantino ▁xxmaj ▁ nice tas ▁xxmaj ▁con ia tes ." ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.train_ds.x[1]" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "### Backward" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 38min 20s, sys: 25 s, total: 38min 45s\n", "Wall time: 20min 19s\n" ] } ], "source": [ "%%time\n", "data = (TextList.from_folder(dest, processor=[OpenFileProcessor(), SPProcessor(max_vocab_sz=15000)])\n", " .split_by_rand_pct(0.1, seed=42)\n", " .label_for_lm()\n", " .databunch(bs=bs, num_workers=1, backwards=True))\n", "\n", "data.save(f'{path}/{lang}_databunch_corpus2_100_sp15_multifit_bwd')" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true, "nbpresent": { "id": "052df7c2-f57b-4596-9189-c9e01102e5e9" } }, "source": [ "## Training" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "### Forward" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "hidden": true, "nbpresent": { "id": "defab6d9-ba04-4943-b574-9300e20d5e1c" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 3.74 s, sys: 2.58 s, total: 6.32 s\n", "Wall time: 12.3 s\n" ] } ], "source": [ "%%time\n", "data = load_data(path, f'{lang}_databunch_corpus2_100_sp15_multifit', bs=bs)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "hidden": true }, "outputs": [], "source": [ "config = awd_lstm_lm_config.copy()\n", "config['qrnn'] = True\n", "config['n_hid'] = 1550 #default 1152\n", "config['n_layers'] = 4 #default 3" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "hidden": true, "nbpresent": { "id": "670d6ea2-913e-4cd2-80d8-1e3d34f0b9c0" }, "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 4.75 s, sys: 1.73 s, total: 6.48 s\n", "Wall time: 6.88 s\n" ] } ], "source": [ "%%time\n", "perplexity = Perplexity()\n", "learn = language_model_learner(data, AWD_LSTM, config=config, drop_mult=0., pretrained=False, \n", " metrics=[error_rate, accuracy, perplexity]).to_fp16()" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "number of parameters: 46020150\n" ] } ], "source": [ "print(f'number of parameters: {sum([parameter.numel() for parameter in learn.model.parameters()])}')" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/plain": [ "SequentialRNN(\n", " (0): AWD_LSTM(\n", " (encoder): Embedding(15000, 400, padding_idx=1)\n", " (encoder_dp): EmbeddingDropout(\n", " (emb): Embedding(15000, 400, padding_idx=1)\n", " )\n", " (rnns): ModuleList(\n", " (0): QRNN(\n", " (layers): ModuleList(\n", " (0): QRNNLayer(\n", " (linear): WeightDropout(\n", " (module): Linear(in_features=800, out_features=4650, bias=True)\n", " )\n", " )\n", " )\n", " )\n", " (1): QRNN(\n", " (layers): ModuleList(\n", " (0): QRNNLayer(\n", " (linear): WeightDropout(\n", " (module): Linear(in_features=1550, out_features=4650, bias=True)\n", " )\n", " )\n", " )\n", " )\n", " (2): QRNN(\n", " (layers): ModuleList(\n", " (0): QRNNLayer(\n", " (linear): WeightDropout(\n", " (module): Linear(in_features=1550, out_features=4650, bias=True)\n", " )\n", " )\n", " )\n", " )\n", " (3): QRNN(\n", " (layers): ModuleList(\n", " (0): QRNNLayer(\n", " (linear): WeightDropout(\n", " (module): Linear(in_features=1550, out_features=1200, bias=True)\n", " )\n", " )\n", " )\n", " )\n", " )\n", " (input_dp): RNNDropout()\n", " (hidden_dps): ModuleList(\n", " (0): RNNDropout()\n", " (1): RNNDropout()\n", " (2): RNNDropout()\n", " (3): RNNDropout()\n", " )\n", " )\n", " (1): LinearDecoder(\n", " (decoder): Linear(in_features=400, out_features=15000, bias=True)\n", " (output_dp): RNNDropout()\n", " )\n", ")" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "learn.model" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "hidden": true, "nbpresent": { "id": "e74f8988-7c26-495b-90fb-0da012f54c1a" } }, "outputs": [ { "data": { "text/html": [], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.\n" ] } ], "source": [ "learn.lr_find()" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "hidden": true, "nbpresent": { "id": "e0333edf-f5cd-4c7d-9db3-01dfb7cc4bb8" } }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "learn.recorder.plot()" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "hidden": true, "nbpresent": { "id": "55299290-8c05-415f-8f73-a3656c1d488c" } }, "outputs": [], "source": [ "lr = 3e-3\n", "lr *= bs/48 # Scale learning rate by batch size" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "hidden": true, "nbpresent": { "id": "debf9e8a-b4ec-4b06-8344-a8f01150d941" } }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
epochtrain_lossvalid_losserror_rateaccuracyperplexitytime
03.7128333.8019820.6898540.31014544.78992145:07
13.4954453.6129810.6738450.32615637.07643544:49
23.4165623.5508380.6676590.33234234.84265945:11
33.3467023.4737530.6587220.34127532.25764845:09
43.2695863.4024090.6496410.35036030.03620145:10
53.2067943.3216660.6392040.36079627.70649945:08
63.1245283.2357400.6278070.37219225.42519245:08
73.0920823.1576510.6161110.38388923.51529345:11
82.9631533.0954850.6060170.39398322.09792745:11
92.9984703.0799030.6031290.39687021.75631545:09
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Better model found at epoch 0 with accuracy value: 0.31014543771743774.\n", "Better model found at epoch 1 with accuracy value: 0.3261563181877136.\n", "Better model found at epoch 2 with accuracy value: 0.3323418200016022.\n", "Better model found at epoch 3 with accuracy value: 0.34127548336982727.\n", "Better model found at epoch 4 with accuracy value: 0.3503599464893341.\n", "Better model found at epoch 5 with accuracy value: 0.3607962131500244.\n", "Better model found at epoch 6 with accuracy value: 0.3721919059753418.\n", "Better model found at epoch 7 with accuracy value: 0.3838889002799988.\n", "Better model found at epoch 8 with accuracy value: 0.393983393907547.\n", "Better model found at epoch 9 with accuracy value: 0.3968702554702759.\n", "CPU times: user 5h 51min 50s, sys: 1h 41min 8s, total: 7h 32min 59s\n", "Wall time: 7h 31min 56s\n" ] } ], "source": [ "%%time\n", "learn.unfreeze()\n", "wd = 0.01 \n", "learn.fit_one_cycle(10, lr, wd=wd, moms=(0.8,0.7), \n", " callbacks=[ShowGraph(learn),\n", " SaveModelCallback(learn.to_fp32(), monitor='accuracy', name='bestmodel_sp15_multifit')])" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "Save the pretrained model and vocab:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "hidden": true }, "outputs": [], "source": [ "mdl_path = path/'models'\n", "mdl_path.mkdir(exist_ok=True)\n", "learn.to_fp32().save(mdl_path/lm_fns3[0], with_opt=False)\n", "learn.data.vocab.save(mdl_path/(lm_fns3[1] + '.pkl'))" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "### Backward" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "hidden": true }, "outputs": [], "source": [ "data = load_data(path, f'{lang}_databunch_corpus2_100_sp15_multifit_bwd', bs=bs, backwards=True)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "hidden": true }, "outputs": [], "source": [ "config = awd_lstm_lm_config.copy()\n", "config['qrnn'] = True\n", "config['n_hid'] = 1550 #default 1152\n", "config['n_layers'] = 4 #default 3" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "hidden": true, "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 952 ms, sys: 68 ms, total: 1.02 s\n", "Wall time: 1.02 s\n" ] } ], "source": [ "%%time\n", "perplexity = Perplexity()\n", "learn = language_model_learner(data, AWD_LSTM, config=config, drop_mult=0., pretrained=False, \n", " metrics=[error_rate, accuracy, perplexity]).to_fp16()" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/html": [], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.\n" ] } ], "source": [ "learn.lr_find()" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "hidden": true }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "learn.recorder.plot()" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "hidden": true }, "outputs": [], "source": [ "lr = 3e-3\n", "lr *= bs/48 # Scale learning rate by batch size\n", "\n", "wd = 0.01" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
epochtrain_lossvalid_losserror_rateaccuracyperplexitytime
03.7592653.8121610.6454540.35454645.24815845:15
13.5243303.6434590.6324010.36759938.22371745:16
23.3998053.5691910.6245960.37540435.48792345:14
33.3583563.4923920.6156900.38430932.86456748:06
43.2989913.4232040.6073340.39266630.66742944:30
53.2307623.3441300.5981090.40189128.33585444:15
63.1395103.2592530.5868690.41313026.03006744:16
73.1230083.1758210.5756630.42433823.94639444:10
83.0077103.1144350.5663600.43364122.52071244:10
92.9556733.0984130.5633130.43668822.16272944:18
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Better model found at epoch 0 with accuracy value: 0.35454609990119934.\n", "Better model found at epoch 1 with accuracy value: 0.36759868264198303.\n", "Better model found at epoch 2 with accuracy value: 0.3754040598869324.\n", "Better model found at epoch 3 with accuracy value: 0.38430914282798767.\n", "Better model found at epoch 4 with accuracy value: 0.39266595244407654.\n", "Better model found at epoch 5 with accuracy value: 0.4018912613391876.\n", "Better model found at epoch 6 with accuracy value: 0.4131302535533905.\n", "Better model found at epoch 7 with accuracy value: 0.4243376851081848.\n", "Better model found at epoch 8 with accuracy value: 0.43364080786705017.\n", "Better model found at epoch 9 with accuracy value: 0.43668755888938904.\n", "CPU times: user 5h 48min 48s, sys: 1h 42min 19s, total: 7h 31min 7s\n", "Wall time: 7h 30min 9s\n" ] } ], "source": [ "%%time\n", "learn.unfreeze()\n", "learn.fit_one_cycle(10, lr, wd=wd, moms=(0.8,0.7), \n", " callbacks=[ShowGraph(learn),\n", " SaveModelCallback(learn.to_fp32(), monitor='accuracy', name='bestmodel_sp15_multifit_bwd')])" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "hidden": true }, "outputs": [], "source": [ "mdl_path = path/'models'\n", "mdl_path.mkdir(exist_ok=True)\n", "learn.to_fp32().save(mdl_path/lm_fns3_bwd[0], with_opt=False)\n", "learn.data.vocab.save(mdl_path/(lm_fns3_bwd[1] + '.pkl'))" ] }, { "cell_type": "markdown", "metadata": { "nbpresent": { "id": "29ff5bf7-47d3-4bb6-8ef7-4dbee784acd0" } }, "source": [ "## Generate fake texts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note**: the architecture used for our Portuguese LM is based on 4 QRNN with about 46 millions of parameters. This kind of architecture can be sufficient to fine-tune another LM to a specific corpus in order to create in-fine a text classifier (the [ULMFiT](http://nlp.fast.ai/category/classification.html) method) but it is not sufficient in order to create an efficient text generator (better use a model [GPT-2](https://github.com/openai/gpt-2) or [BERT](https://github.com/google-research/bert)). More, the SentencePiece tokenizer used in this notebook implements subword units (e.g., byte-pair-encoding (BPE) [Sennrich et al.]) and unigram language model [Kudo.]) that can generate caracters from its vocabulary instead of words. " ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "nbpresent": { "id": "903b31b8-77bb-48a7-a6de-fd2584005619" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 3.11 s, sys: 1.81 s, total: 4.92 s\n", "Wall time: 13.7 s\n" ] } ], "source": [ "%%time\n", "data = load_data(path, f'{lang}_databunch_corpus2_100_sp15_multifit', bs=bs)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "config = awd_lstm_lm_config.copy()\n", "config['qrnn'] = True\n", "config['n_hid'] = 1550 #default 1152\n", "config['n_layers'] = 4 #default 3" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "nbpresent": { "id": "f2e132d0-a490-49cb-895a-ee9a66c76c2d" } }, "outputs": [], "source": [ "# LM without pretraining\n", "learn = language_model_learner(data, AWD_LSTM, config=config, pretrained=False)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "nbpresent": { "id": "7596a7ac-558a-4bd7-80af-247bf0a82732" } }, "outputs": [], "source": [ "# LM pretrained in English\n", "learn_en = language_model_learner(data, AWD_LSTM, pretrained=True)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "nbpresent": { "id": "cdaac260-7cfc-4255-b03f-e1a88eccdc71" } }, "outputs": [], "source": [ "# LM pretrained in french\n", "learn_pt = language_model_learner(data, AWD_LSTM, config=config, pretrained_fnames=lm_fns3)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "nbpresent": { "id": "12d9ce7c-fdb1-4108-ab95-764a015f8e0e" }, "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Nadal venceu o torneio ▁comprar ▁relevante ▁cordas mila ▁lutadores ▁assumido fos ল velmente ▁conjunto ▁american ▁lentamente ▁empregada ▁extra เ anfitriã ⵙ ▁gonzaga ⊂ ̌ coli ▁feijão gi ▁útil ▁notável evolution my 20 ▁gen ▁subespécie ổ ▁cachoeira ▁natal nik ▁jornal ▁cartões ▁desportivo ▁grega ỏ sa ▁azulejos ▁habitat ▁energia fc ▁fazê 68 ▁reencontra ▁humorístico ▁mitchell ▁kris organiz ▁designar ▁castro ▁cerâmica ▁bob burg dos ▁usualmente ▁pessoas ▁principais ▁saudita ▁rom ▁escolhido ▁abandonou ▁patrimônio rid ib ▁ventos ▁banda ▁fim ▁dinamarquesa ን ▁divididos ▁apropriado carna ▁graves ▁vir ▁fm ខ far ▁cbf 提 wel ▁aclamação ├ 童 ▁vinda ▁alusão mun ▁bacia 工 ▁arrenda ▁biológica ▁marginal ▁cheio hí ▁chicago ▁daí 화 ▁parceira\n" ] } ], "source": [ "TEXT = \"Nadal venceu o torneio\" # original text\n", "N_WORDS = 100 # number of words to predict following the TEXT\n", "N_SENTENCES = 1 # number of different predictions\n", "\n", "print(\"\\n\".join(learn.predict(TEXT, N_WORDS, temperature=0.75) for _ in range(N_SENTENCES)))" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "nbpresent": { "id": "9be3d15c-526a-4ab7-bd46-16e8a8c58142" }, "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Nadal venceu o torneio into the \" big - screen \" . a \" ' cop - man ' game , \" \" a first - time \" , is a one - game , game - to - day play . \" a \" as a game for the \" bad man \" , \" you - do - it - you - do - it - me - to - do \" game , \" the first game \" , is \" the first game in the world \" . a game in all four game , \" the game \" ,\n" ] } ], "source": [ "TEXT = \"Nadal venceu o torneio\" # original text\n", "N_WORDS = 100 # number of words to predict following the TEXT\n", "N_SENTENCES = 1 # number of different predictions\n", "\n", "print(\"\\n\".join(learn_en.predict(TEXT, N_WORDS, temperature=0.75) for _ in range(N_SENTENCES)))" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "nbpresent": { "id": "35c213f4-3c90-4774-9cdd-56da02ebe3a8" }, "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Nadal venceu o torneio ▁xxmaj ▁ ▁oficial ▁de ▁ , ▁em ▁seu ▁segundo ▁dia ▁no ▁dia ▁da ▁xxmaj ▁copa ▁xxmaj ▁ master s , ▁em ▁xxmaj ▁ las ▁xxmaj ▁vegas , ▁em ▁15 ▁de ▁xxmaj ▁abril ▁de ▁2016. ▁xxmaj ▁durante ▁a ▁temporada , ▁foi ▁o ▁primeiro ▁título ▁da ▁liga , ▁vencendo ▁o ▁xxmaj ▁ s pur s ▁por ▁9 zo s ▁por ▁3 ▁a ▁0 . ▁o ▁campeão ▁foi ▁o ▁ , ▁da ▁xxmaj ▁copa ▁do ▁xxmaj ▁mundo ▁de ▁xxmaj ▁ ty r rell , ▁onde ▁o ▁título ▁passou ▁a ▁ser ▁xxmaj ▁ s qua d . ▁a ▁xxup ▁ m l s ▁é\n" ] } ], "source": [ "TEXT = \"Nadal venceu o torneio\" # original text\n", "N_WORDS = 100 # number of words to predict following the TEXT\n", "N_SENTENCES = 1 # number of different predictions\n", "\n", "print(\"\\n\".join(learn_pt.predict(TEXT, N_WORDS, temperature=0.75) for _ in range(N_SENTENCES)))" ] } ], "metadata": { "kernelspec": { "display_name": "Python [conda env:root] *", "language": "python", "name": "conda-root-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 2 }