{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Byte-level BPE, an universal tokenizer but...\n", "\n", "> Study about the \"universality\" of a Byte-level Byte-Pair-Encoding tokenizer (BBPE)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Author: [Pierre Guillou](https://www.linkedin.com/in/pierreguillou)\n", "- Date: July 2020\n", "- Post in medium: [Byte-level BPE, an universal tokenizer but…](https://medium.com/@pierre_guillou/byte-level-bpe-an-universal-tokenizer-but-aff932332ffe) (07/03/2020)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Overview" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this study, we will see that, while it is true that a BBPE tokenizer (Byte-level Byte-Pair-Encoding) trained on a huge monolingual corpus can tokenize any word of any language (there is no unknown token), **it requires on average almost 70% of additional tokens** when it is applied to a text in a language different from that used for its training.\n", "\n", "This information is key when it comes to choosing a tokenizer to train a natural language model like a Transformer model." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What is a tokenizer?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Just read the great [tutorial \"Tokenizer summary\"](https://huggingface.co/transformers/master/tokenizer_summary.html) from Sylvain Gugger (Hugging Face)!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## About the Byte-level BPE (BBPE) tokenizer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the [tutorial \"Tokenizer summary\"](https://huggingface.co/transformers/master/tokenizer_summary.html), read the paragraphs [Byte-Pair Encoding](https://huggingface.co/transformers/master/tokenizer_summary.html#byte-pair-encoding) and [Byte-level BPE](https://huggingface.co/transformers/master/tokenizer_summary.html#byte-level-bpe) to get the best overview of a Byte-level BPE (Byte-level Byte-Pair-Encoding) and read the Abstract and Conclusion paragraphs of the original paper: [Neural Machine Translation with Byte-Level Subwords](https://arxiv.org/pdf/1909.03341.pdf]) (Facebook AI, 12/05/2019).\n", "\n", "> **[Abstract]** Almost all existing machine translation models are built on top of character-based vocabularies: characters, subwords or words. Rare characters from noisy text or character-rich languages such as Japanese and Chinese however can unnecessarily take up vocabulary slots and limit its compactness. Representing text at the level of bytes and using the 256 byte set as vocabulary is a potential solution to this issue. High computational cost has however prevented it from being widely deployed or used in practice. **In this paper, we investigate byte-level subwords, specifically byte-level BPE (BBPE), which is compacter than character vocabulary and has no out-of-vocabulary tokens, but is more efficient than using pure bytes only is.** We claim that contextualizing BBPE embeddings is necessary, which can be implemented by a convolutional or recurrent layer. **Our experiments show that BBPE has comparable performance to BPE while its size is only 1/8 of that for BPE.** In the multilingual setting, BBPE maximizes vocabulary sharing across many languages and achieves better translation quality. Moreover, we show that BBPE enables transferring models between languages with non-overlapping character sets.\n", "\n", "> **[Conclusion]** We proposed BBPE which builds a byte-level subword vocabulary for machine translation. It results in a much more compact vocabulary than character-based ones do without the loss of performance. In multilingual settings, the former often outperforms the latter. **BBPE does not have any out-of-vocabulary tokens, allowing us to transfer a model using BBPE between languages with non-overlapping vocabularies.** This transfer learning paradigm is actually very generic and can be applied to any languages and datasets for performance gain or training acceleration. With the same vocabulary size, BBPE segments sentences into shorter sequences than character-based methods do, leading to faster training and inference. Our future work includes: eliminating source-target sentence length imbalance; evaluating BBPE in one-to-many and many-to-many translation settings; exploring better segmentation algorithms for byte-level subwords." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## About the tokenizers and NLP libraries used in this study" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The tokenizers and NLP libraries used to perform this study were:\n", "- English pre-trained GPT2 tokenizer ([GPT2TokenizerFast](https://huggingface.co/transformers/model_doc/gpt2.html#gpt2tokenizerfast)) from the [Transformers](https://github.com/huggingface/transformers) library (Hugging Face, version 3.0.0): it is a Fast GPT-2 BBPE tokenizer (backed by Hugging Face's tokenizers library)\n", "- Portuguese trained ByteLevelBPETokenizer tokenizer from the [Tokenizers](https://github.com/huggingface/tokenizers) library (Hugging Face, version 0.8.0)\n", "- Deep Learning library [fastai v2](https://github.com/fastai/fastai2) (fastai2, version 0.0.17) and the Wikipedia downloading functions of the file [nlputils_fastai2.py](https://github.com/piegu/fastai-projects/blob/master/nlputils_fastai2.py)\n", "\n", "**Note about the Deep Learning libraries**: the [Tokenizers](https://github.com/huggingface/tokenizers) and [Transformers](https://huggingface.co/transformers/) libraries from [Hugging Face](https://huggingface.co/) are today the most up-to-date NLP libraries (Natural Language Processing) used all over the world when fastai v2 is a great tool for training Deep Learning models, especially with powerful fastai tools like Learning rate finder, Mixed precision training, Distributed training, Gradual unfreezing, Differential learning rates and 1cycle policy.\n", "\n", "**Note about the choice of the GPT2 tokenizer**: we could have chosen another pre-trained BBPE tokenizer for this study. The key point is to use BBPE tokenizers trained on huge corpus because they can thus tokenize any word of any language without using the unknown token." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Initialization" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "nbpresent": { "id": "151cd18f-76e3-440f-a8c7-ffa5c6b5da01" } }, "outputs": [], "source": [ "from fastai2.text.all import *\n", "from nlputils_fastai2 import * \n", "\n", "%reload_ext autoreload\n", "%autoreload 2\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'archive_path': '/mnt/home/pierre/.fastai/archive',\n", " 'data_path': '/mnt/home/pierre/.fastai/data',\n", " 'model_path': '/mnt/home/pierre/.fastai/models',\n", " 'storage_path': '/mnt/home/pierre/.fastai/data',\n", " 'version': 2}" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Get config of fastai2 paths\n", "config = Config()\n", "config.d" ] }, { "cell_type": "markdown", "metadata": { "nbpresent": { "id": "cf070ab7-babb-4cf0-a315-401f65461dc8" } }, "source": [ "This will create a `{lang}wiki` folder, containing a `{lang}wiki` text file with the wikipedia contents (for other languages, replace `{lang}` with the appropriate code from the [list of wikipedias](https://meta.wikimedia.org/wiki/List_of_Wikipedias))." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "nbpresent": { "id": "70da588b-8af1-4f97-97c2-c9f2d4d46e1a" } }, "outputs": [], "source": [ "lang = 'pt'" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "nbpresent": { "id": "701ab344-0430-4f43-bbe2-337a12cae6be" } }, "outputs": [], "source": [ "# setup new path_data and create the corresponding folder\n", "name = f'{lang}_wiki'\n", "data_path = config['data_path']\n", "path_data = data_path/name\n", "path_data.mkdir(exist_ok=True, parents=True)" ] }, { "cell_type": "markdown", "metadata": { "nbpresent": { "id": "bfe49910-58e0-4be3-aba1-7733dc18cca2" } }, "source": [ "## Download Wikipedia in Portuguese" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### By fastai" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note**: all the following methods come from the file [nlputils_fastai2.py](https://github.com/piegu/fastai-projects/blob/master/nlputils_fastai2.py)." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(Path('/mnt/home/pierre/course-v4/nbs'),\n", " Path('/mnt/home/pierre/.fastai/data/pt_wiki'))" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Path.cwd(), path_data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "# Download Wikipedia in Portuguese (zip of 1.62Go)\n", "# duration: 40m 30s\n", "get_wiki(path_data,lang)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If `get_wiki(path_data,lang)` breaks, fix the download manually in a terminal:\n", "- mkdir -p /mnt/home/pierre/.fastai/data/pt_wiki\n", "- cd /mnt/home/pierre/.fastai/data/pt_wiki\n", "- wget -c https://dumps.wikimedia.org/ptwiki/latest/ptwiki-latest-pages-articles.xml.bz2\n", "- bzip2 -dk ptwiki-latest-pages-articles.xml.bz2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "--2020-07-02 18:35:23-- https://dumps.wikimedia.org/ptwiki/latest/ptwiki-latest-pages-articles.xml.bz2\n", "Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.7, 2620:0:861:1:208:80:154:7\n", "Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.7|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 1744626313 (1.6G) [application/octet-stream]\n", "Saving to: ‘ptwiki-latest-pages-articles.xml.bz2’\n", "```\n", "\n", "```\n", "ptwiki-latest-pages-articles.xml.bz2\n", "100%[==================================================================================>] \n", "1.62G 791KB/s in 40m 30s\n", "2020-07-02 19:15:54 (701 KB/s) - ‘ptwiki-latest-pages-articles.xml.bz2’ saved [1744626313/1744626313]\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And re-run `get_wiki(path_data,lang)` once the download is successful." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "extracting...\n", "CPU times: user 22.7 ms, sys: 18.3 ms, total: 41 ms\n", "Wall time: 10min 3s\n" ] } ], "source": [ "%%time\n", "# Download Wikipedia in Portuguese\n", "get_wiki(path_data,lang)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\r\n", "Astronomia\r\n", "\r\n", "Astronomia é uma ciência natural que estuda corpos celestes (como estrelas, planetas, cometas, nebulosas, aglomerados de estrelas, galáxias) e fenômenos que se originam fora da atmosfera da Terra (como a radiação cósmica de fundo em micro-ondas). Preocupada com a evolução, a física, a química e o movimento de objetos celestes, bem como a formação e o desenvolvimento do universo.\r\n" ] } ], "source": [ "# Check the content downloaded\n", "name = f'{lang}wiki'\n", "!head -n4 {path_data}/{name}" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0\n", "100000\n", "200000\n", "300000\n", "400000\n", "500000\n", "600000\n", "700000\n", "800000\n", "900000\n", "1000000\n", "1100000\n", "1200000\n", "1300000\n", "1400000\n", "1500000\n", "1600000\n", "1700000\n", "1800000\n", "1900000\n", "2000000\n", "2100000\n", "2200000\n", "2300000\n", "2400000\n", "2500000\n", "2600000\n", "2700000\n", "2800000\n", "2900000\n", "3000000\n", "3100000\n", "3200000\n", "3300000\n", "3400000\n", "3500000\n", "3600000\n", "3700000\n", "3800000\n", "3900000\n", "4000000\n", "4100000\n", "4200000\n", "4300000\n", "4400000\n", "4500000\n", "4600000\n", "4700000\n", "4800000\n", "4900000\n", "5000000\n", "5100000\n", "5200000\n", "5300000\n", "5400000\n", "5500000\n", "5600000\n", "5700000\n", "5800000\n", "5900000\n", "6000000\n", "6100000\n", "6200000\n", "6300000\n", "6400000\n", "6500000\n", "6600000\n", "6700000\n", "6800000\n", "6900000\n", "7000000\n", "7100000\n", "7200000\n", "7300000\n", "7400000\n", "CPU times: user 36.3 s, sys: 13.8 s, total: 50.1 s\n", "Wall time: 5min 26s\n" ] } ], "source": [ "%%time\n", "# Split global download file to one article by text file\n", "dest = split_wiki(path_data,lang)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/mnt/home/pierre/.fastai/data/pt_wiki/docs/Fotografia.txt\n", "/mnt/home/pierre/.fastai/data/pt_wiki/docs/Espadanedo (Macedo de Cavaleiros).txt\n", "/mnt/home/pierre/.fastai/data/pt_wiki/docs/Jacques-Germain Soufflot.txt\n", "/mnt/home/pierre/.fastai/data/pt_wiki/docs/Faculdade de Medicina da Universidade de São Paulo.txt\n", "/mnt/home/pierre/.fastai/data/pt_wiki/docs/Escola do Teatro Bolshoi no Brasil.txt\n" ] } ], "source": [ "# Check the splitting\n", "dest = path_data/'docs'\n", "for file in dest.ls()[:5]:\n", " print(file)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "204315 files - 194843190 tokens\n", "CPU times: user 38.5 s, sys: 15.7 s, total: 54.1 s\n", "Wall time: 3min 45s\n" ] } ], "source": [ "%%time\n", "# number of files and total tokens\n", "num_files, num_tokens = get_num_tokens(dest)\n", "print(f'{num_files} files - {num_tokens} tokens')" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1.6G\t/mnt/home/pierre/.fastai/data/pt_wiki/docs\r\n" ] } ], "source": [ "# Size of downloaded data \n", "!du -hs {dest}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### By Hugging Face" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[nlp](https://github.com/huggingface/nlp) is a lightweight and extensible library from Hugging Face to easily share and access datasets and evaluation metrics for Natural Language Processing (NLP).\n", "- Colab tutorial: https://colab.research.google.com/github/huggingface/nlp/blob/master/notebooks/Overview.ipynb\n", "- Online dataset explorer: https://huggingface.co/nlp/viewer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**[ WARNING ]** We did try to use it in order to download Wikipedia in Portuguese but without success. However, to help people solving this issue, we decided to leave our code in this notebook." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "# source: https://huggingface.co/nlp/viewer/?dataset=wikipedia&config=20200501.pt\n", "# !pip install nlp\n", "\n", "# Issues\n", "# source: https://github.com/huggingface/nlp/issues/227\n", "# !pip instal apache_beam\n", "# !pip install dill==0.3.1.1\n", "# !pip install apache-beam[interactive]\n", "# !pip install mwparserfromhell" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "nlp==0.3.0\r\n" ] } ], "source": [ "!pip freeze | grep nlp" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "c6d3ba8825184c71aa82c44d5e318d85", "version_major": 2, "version_minor": 0 }, "text/plain": [ "HBox(children=(FloatProgress(value=0.0, description='Downloading', max=13008.0, style=ProgressStyle(descriptio…" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "f2a6cf0107f44488a393822998a4ec1b", "version_major": 2, "version_minor": 0 }, "text/plain": [ "HBox(children=(FloatProgress(value=0.0, description='Downloading', max=28653.0, style=ProgressStyle(descriptio…" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "Downloading and preparing dataset wikipedia/20200501.pt (download: Unknown size, generated: Unknown size, total: Unknown size) to /mnt/home/pierre/.cache/huggingface/datasets/wikipedia/20200501.pt/1.0.0...\n" ] }, { "ename": "MissingBeamOptions", "evalue": "Trying to generate a dataset using Apache Beam, yet no Beam Runner or PipelineOptions() has been provided in `load_dataset` or in the builder arguments. For big datasets it has to run on large-scale data processing tools like Dataflow, Spark, etc. More information about Apache Beam runners at https://beam.apache.org/documentation/runners/capability-matrix/\nIf you really want to run it locally because you feel like the Dataset is small enough, you can use the local beam runner called `DirectRunner` (you may run out of memory). \nExample of usage: \n\t`load_dataset('wikipedia', '20200501.pt', beam_runner='DirectRunner')`", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mMissingBeamOptions\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0mnlp\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mload_dataset\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mdataset\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mload_dataset\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'wikipedia'\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m'20200501.pt'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32m~/.conda/envs/fastai2/lib/python3.7/site-packages/nlp/load.py\u001b[0m in \u001b[0;36mload_dataset\u001b[0;34m(path, name, version, data_dir, data_files, split, cache_dir, download_config, download_mode, ignore_verifications, save_infos, **config_kwargs)\u001b[0m\n\u001b[1;32m 522\u001b[0m \u001b[0mdownload_mode\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mdownload_mode\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 523\u001b[0m \u001b[0mignore_verifications\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mignore_verifications\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 524\u001b[0;31m \u001b[0msave_infos\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0msave_infos\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 525\u001b[0m )\n\u001b[1;32m 526\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/.conda/envs/fastai2/lib/python3.7/site-packages/nlp/builder.py\u001b[0m in \u001b[0;36mdownload_and_prepare\u001b[0;34m(self, download_config, download_mode, ignore_verifications, save_infos, try_from_hf_gcs, dl_manager, **download_and_prepare_kwargs)\u001b[0m\n\u001b[1;32m 430\u001b[0m \u001b[0mverify_infos\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0msave_infos\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mignore_verifications\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 431\u001b[0m self._download_and_prepare(\n\u001b[0;32m--> 432\u001b[0;31m \u001b[0mdl_manager\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mdl_manager\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mverify_infos\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mverify_infos\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mdownload_and_prepare_kwargs\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 433\u001b[0m )\n\u001b[1;32m 434\u001b[0m \u001b[0;31m# Sync info\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/.conda/envs/fastai2/lib/python3.7/site-packages/nlp/builder.py\u001b[0m in \u001b[0;36m_download_and_prepare\u001b[0;34m(self, dl_manager, verify_infos)\u001b[0m\n\u001b[1;32m 821\u001b[0m \u001b[0;34m\"Dataset is small enough, you can use the local beam runner called \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 822\u001b[0m \u001b[0;34m\"`DirectRunner` (you may run out of memory). \\nExample of usage: \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 823\u001b[0;31m \u001b[0;34m\"\\n\\t`{}`\"\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mformat\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0musage_example\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 824\u001b[0m )\n\u001b[1;32m 825\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mMissingBeamOptions\u001b[0m: Trying to generate a dataset using Apache Beam, yet no Beam Runner or PipelineOptions() has been provided in `load_dataset` or in the builder arguments. For big datasets it has to run on large-scale data processing tools like Dataflow, Spark, etc. More information about Apache Beam runners at https://beam.apache.org/documentation/runners/capability-matrix/\nIf you really want to run it locally because you feel like the Dataset is small enough, you can use the local beam runner called `DirectRunner` (you may run out of memory). \nExample of usage: \n\t`load_dataset('wikipedia', '20200501.pt', beam_runner='DirectRunner')`" ] } ], "source": [ "# source: https://huggingface.co/nlp/viewer/?dataset=wikipedia&config=20200501.pt\n", "import nlp\n", "from nlp import load_dataset\n", "dataset = load_dataset('wikipedia','20200501.pt')" ] }, { "cell_type": "markdown", "metadata": { "nbpresent": { "id": "bfe49910-58e0-4be3-aba1-7733dc18cca2" } }, "source": [ "### Create text and csv files of Wikipedia in Portuguese" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The text file (all the articles in one file) will allow the training of the Portuguese tokenizer and the csv one will facilitate the tests of the study." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "dest = path_data/'docs'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Text file" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0\n", "1000\n", "2000\n", "3000\n", "4000\n", "5000\n", "6000\n", "7000\n", "8000\n", "9000\n", "10000\n", "11000\n", "12000\n", "13000\n", "14000\n", "15000\n", "16000\n", "17000\n", "18000\n", "19000\n", "20000\n", "21000\n", "22000\n", "23000\n", "24000\n", "25000\n", "26000\n", "27000\n", "28000\n", "29000\n", "30000\n", "31000\n", "32000\n", "33000\n", "34000\n", "35000\n", "36000\n", "37000\n", "38000\n", "39000\n", "40000\n", "41000\n", "42000\n", "43000\n", "44000\n", "45000\n", "46000\n", "47000\n", "48000\n", "49000\n", "50000\n", "51000\n", "52000\n", "53000\n", "54000\n", "55000\n", "56000\n", "57000\n", "58000\n", "59000\n", "60000\n", "61000\n", "62000\n", "63000\n", "64000\n", "65000\n", "66000\n", "67000\n", "68000\n", "69000\n", "70000\n", "71000\n", "72000\n", "73000\n", "74000\n", "75000\n", "76000\n", "77000\n", "78000\n", "79000\n", "80000\n", "81000\n", "82000\n", "83000\n", "84000\n", "85000\n", "86000\n", "87000\n", "88000\n", "89000\n", "90000\n", "91000\n", "92000\n", "93000\n", "94000\n", "95000\n", "96000\n", "97000\n", "98000\n", "99000\n", "100000\n", "101000\n", "102000\n", "103000\n", "104000\n", "105000\n", "106000\n", "107000\n", "108000\n", "109000\n", "110000\n", "111000\n", "112000\n", "113000\n", "114000\n", "115000\n", "116000\n", "117000\n", "118000\n", "119000\n", "120000\n", "121000\n", "122000\n", "123000\n", "124000\n", "125000\n", "126000\n", "127000\n", "128000\n", "129000\n", "130000\n", "131000\n", "132000\n", "133000\n", "134000\n", "135000\n", "136000\n", "137000\n", "138000\n", "139000\n", "140000\n", "141000\n", "142000\n", "143000\n", "144000\n", "145000\n", "146000\n", "147000\n", "148000\n", "149000\n", "150000\n", "151000\n", "152000\n", "153000\n", "154000\n", "155000\n", "156000\n", "157000\n", "158000\n", "159000\n", "160000\n", "161000\n", "162000\n", "163000\n", "164000\n", "165000\n", "166000\n", "167000\n", "168000\n", "169000\n", "170000\n", "171000\n", "172000\n", "173000\n", "174000\n", "175000\n", "176000\n", "177000\n", "178000\n", "179000\n", "180000\n", "181000\n", "182000\n", "183000\n", "184000\n", "185000\n", "186000\n", "187000\n", "188000\n", "189000\n", "190000\n", "191000\n", "192000\n", "193000\n", "194000\n", "195000\n", "196000\n", "197000\n", "198000\n", "199000\n", "200000\n", "201000\n", "202000\n", "203000\n", "204000\n", "all texts from wikipedia pt in the file /mnt/home/pierre/.fastai/data/pt_wiki/all_texts_ptwiki.txt\n", "\n", "CPU times: user 49 s, sys: 15.1 s, total: 1min 4s\n", "Wall time: 3min 48s\n" ] } ], "source": [ "%%time\n", "get_one_clean_file(dest,lang)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1.2G\t/mnt/home/pierre/.fastai/data/pt_wiki/all_texts_ptwiki.txt\r\n" ] } ], "source": [ "# size\n", "fname = f'all_texts_{lang}wiki.txt'\n", "!du -hs {dest.parent/fname}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### csv file" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "all texts from wikipedia pt in the file /mnt/home/pierre/.fastai/data/pt_wiki/all_texts_ptwiki.csv\n", "\n", "CPU times: user 1min 14s, sys: 11 s, total: 1min 25s\n", "Wall time: 3min 56s\n" ] } ], "source": [ "%%time\n", "get_one_clean_csv_file(dest,lang)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1.2G\t/mnt/home/pierre/.fastai/data/pt_wiki/all_texts_ptwiki.csv\r\n" ] } ], "source": [ "# size\n", "fname = f'all_texts_{lang}wiki.csv'\n", "!du -hs {dest.parent/fname}" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
text
0Fotografia (do grego φως [\"fós\"] (\"luz\"), e γραφις [\"grafis\"] (\"estilo\", \"pincel\") ou γραφη \"grafê\", e significa \"desenhar com luz e contraste\"), por definição, é essencialmente a \"técnica de criação\" de imagens por meio de exposição luminosa, fixando-as em uma superfície sensível. A primeira fotografia reconhecida remonta ao ano de 1826 e é atribuída ao francês Joseph Nicéphore Niépce. Contudo, a invenção da fotografia não é obra de um só autor, mas um processo de acúmulo de avanços por parte de muitas pessoas, trabalhando, juntas ou em paralelo, ao longo de muitos anos. Se por um lado os...
1Espadanedo é uma antiga freguesia portuguesa do concelho de Macedo de Cavaleiros, com 17,90 km² de área e 188 habitantes (2011). A sua densidade populacional era 10,5 hab/km².\\nFoi extinta (agregada) pela reorganização administrativa de 2012/2013, sendo o seu território integrado na União de Freguesias de Espadanedo, Edroso, Murçós e Soutelo Mourisco.\\n\\nA antiga freguesia de S. Miguel de Espadanedo e Valongo pertenceu ao antigo concelho de Torre D. Chama, extinto a 24 de Outubro de 1855, em 1839 aparece agregada à comarca de Bragança e em 1852 à de Mirandela. Passa a pertencer definitivam...
2Jacques-Germain Soufflot (Irancy, 22 de julho de 1713 — Paris, 29 de agosto de 1780) foi um arquitecto francês, iniciador do estilo arquitectónico do Neoclassicismo. O seu trabalho mais conhecido é, sem dúvida, o Panthéon (Panteão) de Paris, construído a partir de 1755, inicialmente uma igreja dedicada a Santa Genoveva.\\n\\nSoufflot nasceu em Irancy, perto de Auxerre, na França. Com 18 anos, entrou para a Academia Francesa de Roma, onde os jovens estudantes da década de 1750 se tornariam na primeira geração de criativos pleanamente neoclássicos. Ficará na Itália de 1731 a 1738. Quando volto...
3A Faculdade de Medicina da Universidade de São Paulo (FMUSP) é uma escola médica da Universidade de São Paulo. Foi fundada em 1912 com o nome de \"Faculdade de Medicina e Cirurgia de São Paulo\" por Arnaldo Vieira de Carvalho (1867-1920), médico formado em 1888 pela Faculdade de Medicina do Rio de Janeiro. Em homenagem ao ilustre fundador, a Faculdade é, ainda hoje, chamada de a \"Casa de Arnaldo\" por seus alunos e ex-alunos. Em 1925 teve seu nome alterado para \"Faculdade de Medicina de São Paulo\" e em 1934, foi incorporada à recém-criada Universidade de São Paulo, passando a ter a atual desi...
4A Escola do Teatro Bolshoi no Brasil é uma tradicional escola de balé existente na cidade de Joinville, no estado de Santa Catarina. Fundada em 2000, é a única filial do Teatro Bolshoi de Moscou e possui alunos de vários estados brasileiros. Tem como missão formar artistas cidadãos, promover e difundir a arte-educação.\\n\\nA instituição foi fundada em 15 de março de 2000, é a única filial do Teatro Bolshoi. \\n\\nUm orgulho para o Brasil e para Joinville, cidade sede. A Escola do Teatro Bolshoi no Brasil, com professores russos e brasileiros, forma bailarinos com a mesma precisão, técnica e q...
\n", "
" ], "text/plain": [ " text\n", "0 Fotografia (do grego φως [\"fós\"] (\"luz\"), e γραφις [\"grafis\"] (\"estilo\", \"pincel\") ou γραφη \"grafê\", e significa \"desenhar com luz e contraste\"), por definição, é essencialmente a \"técnica de criação\" de imagens por meio de exposição luminosa, fixando-as em uma superfície sensível. A primeira fotografia reconhecida remonta ao ano de 1826 e é atribuída ao francês Joseph Nicéphore Niépce. Contudo, a invenção da fotografia não é obra de um só autor, mas um processo de acúmulo de avanços por parte de muitas pessoas, trabalhando, juntas ou em paralelo, ao longo de muitos anos. Se por um lado os...\n", "1 Espadanedo é uma antiga freguesia portuguesa do concelho de Macedo de Cavaleiros, com 17,90 km² de área e 188 habitantes (2011). A sua densidade populacional era 10,5 hab/km².\\nFoi extinta (agregada) pela reorganização administrativa de 2012/2013, sendo o seu território integrado na União de Freguesias de Espadanedo, Edroso, Murçós e Soutelo Mourisco.\\n\\nA antiga freguesia de S. Miguel de Espadanedo e Valongo pertenceu ao antigo concelho de Torre D. Chama, extinto a 24 de Outubro de 1855, em 1839 aparece agregada à comarca de Bragança e em 1852 à de Mirandela. Passa a pertencer definitivam...\n", "2 Jacques-Germain Soufflot (Irancy, 22 de julho de 1713 — Paris, 29 de agosto de 1780) foi um arquitecto francês, iniciador do estilo arquitectónico do Neoclassicismo. O seu trabalho mais conhecido é, sem dúvida, o Panthéon (Panteão) de Paris, construído a partir de 1755, inicialmente uma igreja dedicada a Santa Genoveva.\\n\\nSoufflot nasceu em Irancy, perto de Auxerre, na França. Com 18 anos, entrou para a Academia Francesa de Roma, onde os jovens estudantes da década de 1750 se tornariam na primeira geração de criativos pleanamente neoclássicos. Ficará na Itália de 1731 a 1738. Quando volto...\n", "3 A Faculdade de Medicina da Universidade de São Paulo (FMUSP) é uma escola médica da Universidade de São Paulo. Foi fundada em 1912 com o nome de \"Faculdade de Medicina e Cirurgia de São Paulo\" por Arnaldo Vieira de Carvalho (1867-1920), médico formado em 1888 pela Faculdade de Medicina do Rio de Janeiro. Em homenagem ao ilustre fundador, a Faculdade é, ainda hoje, chamada de a \"Casa de Arnaldo\" por seus alunos e ex-alunos. Em 1925 teve seu nome alterado para \"Faculdade de Medicina de São Paulo\" e em 1934, foi incorporada à recém-criada Universidade de São Paulo, passando a ter a atual desi...\n", "4 A Escola do Teatro Bolshoi no Brasil é uma tradicional escola de balé existente na cidade de Joinville, no estado de Santa Catarina. Fundada em 2000, é a única filial do Teatro Bolshoi de Moscou e possui alunos de vários estados brasileiros. Tem como missão formar artistas cidadãos, promover e difundir a arte-educação.\\n\\nA instituição foi fundada em 15 de março de 2000, é a única filial do Teatro Bolshoi. \\n\\nUm orgulho para o Brasil e para Joinville, cidade sede. A Escola do Teatro Bolshoi no Brasil, com professores russos e brasileiros, forma bailarinos com a mesma precisão, técnica e q..." ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv(dest.parent/fname)\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Byte-level BPE (BBPE) tokenizers from Transformers and Tokenizers (Hugging Face libraries)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We are following 3 steps in order to get 2 identical GPT2 tokenizers, one trained on an English corpus and the other on Wikipedia in Portuguese\n", "1. **Get the pre-trained GPT2 Tokenizer (pre-training with an English corpus) from the Transformers library (Hugging Face)**: it will give us the tokenizer stucture we need and the pre-trained English tokenizer.\n", "2. **Train a Byte-level BPE (BBPE) Tokenizer on the Portuguese wikipedia corpus by using the Tokenizers library (Hugging Face)**: this will give us the vocabulary files of our GPT2 tokenizer in Portuguese.\n", "3. **Import the tokenizer config files in Portuguese into the pre-trained GPT2 Tokenizer**: it will give us a tokenizer structure with the vocab in Portuguese." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1. Get the pre-trained GPT2 Tokenizer (pre-training with an English corpus)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Firstly, will need to install the transformers library." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "transformers==3.0.0\r\n" ] } ], "source": [ "# ! pip install transformers\n", "!pip freeze | grep transformers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then, we can get the English pre-trained GPT2 tokenizer ([GPT2Tokenizer](https://huggingface.co/transformers/model_doc/gpt2.html#gpt2tokenizer) or [GPT2TokenizerFast](https://huggingface.co/transformers/model_doc/gpt2.html#gpt2tokenizerfast))." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 260 ms, sys: 63 ms, total: 323 ms\n", "Wall time: 1.53 s\n" ] } ], "source": [ "%%time\n", "from transformers import GPT2TokenizerFast\n", "\n", "pretrained_weights = 'gpt2'\n", "tokenizer_en = GPT2TokenizerFast.from_pretrained(pretrained_weights)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "# To correct the warning about token_pad (GPT2TokenizerFast), run the following code\n", "# source: https://github.com/huggingface/transformers/issues/2648#issuecomment-616177044\n", "tokenizer_en.pad_token = tokenizer_en.eos_token" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's check the GPT2 tokenizer vocab. " ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "---------- vocab ----------\n", "\n", "vocab_files_names: {'vocab_file': 'vocab.json', 'merges_file': 'merges.txt'}\n", "\n", "vocab_file\n", "- gpt2 : https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json\n", "- gpt2-medium : https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-vocab.json\n", "- gpt2-large : https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-large-vocab.json\n", "- gpt2-xl : https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-xl-vocab.json\n", "- distilgpt2 : https://s3.amazonaws.com/models.huggingface.co/bert/distilgpt2-vocab.json\n", "\n", "merges_file\n", "- gpt2 : https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt\n", "- gpt2-medium : https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-merges.txt\n", "- gpt2-large : https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-large-merges.txt\n", "- gpt2-xl : https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-xl-merges.txt\n", "- distilgpt2 : https://s3.amazonaws.com/models.huggingface.co/bert/distilgpt2-merges.txt\n", "\n", "vocab_size: 50257\n", "\n", "First 50 items of the vocab: {'ormon': 10615, 'Ġ13': 1511, '659': 36445, 'Ġbecome': 1716, 'Ġinvari': 25275, 'Ġsearch': 2989, 'Ġzo': 40565, 'YL': 45448, 'Ġreception': 16307, 'Ġseminars': 44160, 'anta': 4910, 'ĠAmos': 49104, 'ĠDefinition': 30396, 'Ġ3000': 20343, 'ĠDemocrat': 9755, 'ploy': 1420, 'ãĥİ': 25053, 'ĠECO': 39031, 'ï': 171, 'ĠKING': 32957}\n" ] } ], "source": [ "# source: https://huggingface.co/transformers/_modules/transformers/tokenization_utils_fast.html\n", "\n", "print('---------- vocab ----------')\n", "print()\n", "\n", "print('vocab_files_names:',tokenizer_en.vocab_files_names)\n", "print()\n", "\n", "for k,v in tokenizer_en.pretrained_vocab_files_map.items():\n", " print(k)\n", " for kk,vv in v.items():\n", " print('- ',kk,':',vv)\n", " print()\n", " \n", "print('vocab_size:',tokenizer_en.vocab_size)\n", "print()\n", "#print(tokenizer_en.get_vocab())\n", "\n", "num = 50\n", "print(f'First {num} items of the vocab: {dict(itertools.islice(tokenizer_en.get_vocab().items(), 20))}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2. Train a Byte-level BPE (BBPE) tokenizer on the Portuguese Wikipedia" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll use the [Tokenizers](https://github.com/huggingface/tokenizers) library from Hugging Face in order to train a **Byte-level BPE (BBPE) Tokenizer** on the Portuguese Wikipedia with the objective to get the vocab files `vocab.json`, which is a list of the most frequent tokens ranked by frequency, and the `merges.txt` which is a list of merges." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "tokenizers==0.8.0\r\n" ] } ], "source": [ "# !pip install tokenizers\n", "!pip freeze | grep tokenizers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Training" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "50257" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Get GPT2 tokenizer_en vocab size\n", "ByteLevelBPE_tokenizer_pt_vocab_size = tokenizer_en.vocab_size\n", "ByteLevelBPE_tokenizer_pt_vocab_size" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 25min 17s, sys: 5min 4s, total: 30min 21s\n", "Wall time: 2min 7s\n" ] } ], "source": [ "%%time\n", "# ByteLevelBPETokenizer Represents a Byte-level BPE as introduced by OpenAI with their GPT-2 model\n", "from tokenizers import ByteLevelBPETokenizer\n", "\n", "ByteLevelBPE_tokenizer_pt = ByteLevelBPETokenizer()\n", "\n", "# Get list of paths to corpus files\n", "paths = [str(path_data/'all_texts_ptwiki.txt')]\n", "\n", "# Customize training with <|endoftext|> special GPT2 token\n", "ByteLevelBPE_tokenizer_pt.train(files=paths, \n", " vocab_size=ByteLevelBPE_tokenizer_pt_vocab_size, \n", " min_frequency=2, \n", " special_tokens=[\"<|endoftext|>\"])\n", "\n", "# Get sequence length max of 1024\n", "ByteLevelBPE_tokenizer_pt.enable_truncation(max_length=1024)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['/mnt/home/pierre/.fastai/data/pt_wiki/ByteLevelBPE_tokenizer_pt/vocab.json',\n", " '/mnt/home/pierre/.fastai/data/pt_wiki/ByteLevelBPE_tokenizer_pt/merges.txt']" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# save tokenizer\n", "ByteLevelBPE_tokenizer_pt_rep = 'ByteLevelBPE_tokenizer_pt'\n", "path_to_ByteLevelBPE_tokenizer_pt_rep = path_data/ByteLevelBPE_tokenizer_pt_rep\n", "if not (path_to_ByteLevelBPE_tokenizer_pt_rep).exists():\n", " path_to_ByteLevelBPE_tokenizer_pt_rep.mkdir(exist_ok=True, parents=True)\n", "ByteLevelBPE_tokenizer_pt.save_model(str(path_to_ByteLevelBPE_tokenizer_pt_rep))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now have both a `vocab.json`, which is a list of the most frequent tokens ranked by frequency, and a `merges.txt` list of merges." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Check our tokenizer pre-trained in Portuguese" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(50257, ['<|endoftext|>', '!', '\"', '#', '$'])" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Get vocab as a list\n", "ByteLevelBPE_tokenizer_pt_vocab = ByteLevelBPE_tokenizer_pt.get_vocab() \n", "ByteLevelBPE_tokenizer_pt_vocab_ls = [k for k, v in sorted(ByteLevelBPE_tokenizer_pt_vocab.items(), key=lambda item: item[1])]\n", "len(ByteLevelBPE_tokenizer_pt_vocab_ls),ByteLevelBPE_tokenizer_pt_vocab_ls[:5]" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "([39, 1120, 298, 22335, 258, 11001, 14],\n", " ['G', 'osto', 'Ġdo', 'Ġqueijo', 'Ġe', 'Ġvinho', '.'],\n", " [(0, 1), (1, 5), (5, 8), (8, 15), (15, 17), (17, 23), (23, 24)])" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text = \"Gosto do queijo e vinho.\"\n", "output = ByteLevelBPE_tokenizer_pt.encode(text)\n", "output.ids,output.tokens,output.offsets" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "input text: Gosto do queijo e vinho.\n", "tokens ids: [39, 1120, 298, 22335, 258, 11001, 14]\n", "back to text: Gosto do queijo e vinho.\n" ] } ], "source": [ "back_to_text = ByteLevelBPE_tokenizer_pt.decode(ByteLevelBPE_tokenizer_pt.encode(text).ids)\n", "\n", "print('input text:', text)\n", "print('tokens ids:', output.ids)\n", "print('back to text:', back_to_text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3. Import the tokenizer config files in Portuguese into the pre-trained GPT2 Tokenizer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to get exactly the same tokenizer model but with different vocabs (English and Portuguese), we import the Portuguese vocab files (`vocab.json` and `merges.txt`) into the `GPT2TokenizerFast` model." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "# Get the path to ByteLevelBPE_tokenizer_pt config files\n", "ByteLevelBPE_tokenizer_pt_rep = 'ByteLevelBPE_tokenizer_pt'\n", "path_to_ByteLevelBPE_tokenizer_pt_rep = path_data/ByteLevelBPE_tokenizer_pt_rep\n", "\n", "# import the pre-trained GPT2TokenizerFast tokenizer with the tokenizer_pt config files\n", "tokenizer_pt = GPT2TokenizerFast.from_pretrained(\n", " str(path_to_ByteLevelBPE_tokenizer_pt_rep), \n", " pad_token='<|endoftext|>')\n", "\n", "# Get sequence length max of 1024\n", "tokenizer_pt.model_max_length = 1024" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "input text: Gosto do queijo e vinho.\n", "tokens ids: [39, 1120, 298, 22335, 258, 11001, 14]\n", "back to text: Gosto do queijo e vinho.\n" ] } ], "source": [ "# Check\n", "text = \"Gosto do queijo e vinho.\"\n", "tokens_ids = tokenizer_pt.encode(text)\n", "back_to_text = tokenizer_pt.decode(tokenizer_pt.encode(text))\n", "\n", "print('input text:', text)\n", "print('tokens ids:', tokens_ids)\n", "print('back to text:', back_to_text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## English pre-trained tokenizer on a text in 3 languages (en, pt, fr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's tokenize by an English tokenizer a text in 3 different languages ​​(English, Portuguese, French) in order to observe the number of tokens required in each case." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(en) Jacques-Germain Soufflot (Irancy, July 22, 1713 - Paris, August 29, 1780) was a French architect, initiator of the architectural style of Neoclassicism.\n", "\n", "(pt) Jacques-Germain Soufflot (Irancy, 22 de julho de 1713 — Paris, 29 de agosto de 1780) foi um arquitecto francês, iniciador do estilo arquitectónico do Neoclassicismo.\n", "\n", "(fr) Jacques-Germain Soufflot (Irancy, 22 juillet 1713 - Paris, 29 août 1780) était un architecte français, initiateur du style architectural du néoclassicisme.\n", "\n" ] } ], "source": [ "# text in 3 languages to be tokenized\n", "text_en = 'Jacques-Germain Soufflot (Irancy, July 22, 1713 - Paris, August 29, 1780) was a French architect, initiator of the architectural style of Neoclassicism.'\n", "text_pt = 'Jacques-Germain Soufflot (Irancy, 22 de julho de 1713 — Paris, 29 de agosto de 1780) foi um arquitecto francês, iniciador do estilo arquitectónico do Neoclassicismo.'\n", "text_fr = 'Jacques-Germain Soufflot (Irancy, 22 juillet 1713 - Paris, 29 août 1780) était un architecte français, initiateur du style architectural du néoclassicisme.'\n", "\n", "langs = ['en', 'pt', 'fr']\n", "texts = [text_en,text_pt,text_fr]\n", "\n", "for lang,text in zip(*[langs,texts]):\n", " print(f'({lang}) {TitledStr(text)}\\n')" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(en - 22 tokens) ['Jacques-Germain', 'Soufflot', '(Irancy,', 'July', '22,', '1713', '-', 'Paris,', 'August', '29,', '1780)', 'was', 'a', 'French', 'architect,', 'initiator', 'of', 'the', 'architectural', 'style', 'of', 'Neoclassicism.']\n", "\n", "(pt - 25 tokens) ['Jacques-Germain', 'Soufflot', '(Irancy,', '22', 'de', 'julho', 'de', '1713', '—', 'Paris,', '29', 'de', 'agosto', 'de', '1780)', 'foi', 'um', 'arquitecto', 'francês,', 'iniciador', 'do', 'estilo', 'arquitectónico', 'do', 'Neoclassicismo.']\n", "\n", "(fr - 21 tokens) ['Jacques-Germain', 'Soufflot', '(Irancy,', '22', 'juillet', '1713', '-', 'Paris,', '29', 'août', '1780)', 'était', 'un', 'architecte', 'français,', 'initiateur', 'du', 'style', 'architectural', 'du', 'néoclassicisme.']\n", "\n" ] } ], "source": [ "# number and list of classical tokens (ie, tokens separated by a blank)\n", "for lang,text in zip(*[langs,texts]):\n", " print(f'({lang} - {len(text.split())} tokens) {TitledStr(text.split(\" \"))}\\n')" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(en - 44 tokens) ['Jac', 'ques', '-', 'G', 'erm', 'ain', 'ĠSou', 'ff', 'lot', 'Ġ(', 'Iran', 'cy', ',', 'ĠJuly', 'Ġ22', ',', 'Ġ17', '13', 'Ġ-', 'ĠParis', ',', 'ĠAugust', 'Ġ29', ',', 'Ġ17', '80', ')', 'Ġwas', 'Ġa', 'ĠFrench', 'Ġarchitect', ',', 'Ġiniti', 'ator', 'Ġof', 'Ġthe', 'Ġarchitectural', 'Ġstyle', 'Ġof', 'ĠNe', 'oc', 'lass', 'icism', '.']\n", "\n", "(pt - 62 tokens) ['Jac', 'ques', '-', 'G', 'erm', 'ain', 'ĠSou', 'ff', 'lot', 'Ġ(', 'Iran', 'cy', ',', 'Ġ22', 'Ġde', 'Ġj', 'ul', 'ho', 'Ġde', 'Ġ17', '13', 'ĠâĢĶ', 'ĠParis', ',', 'Ġ29', 'Ġde', 'Ġag', 'ost', 'o', 'Ġde', 'Ġ17', '80', ')', 'Ġfo', 'i', 'Ġum', 'Ġar', 'qu', 'itect', 'o', 'Ġfranc', 'ê', 's', ',', 'Ġin', 'ici', 'ador', 'Ġdo', 'Ġest', 'ilo', 'Ġar', 'qu', 'itect', 'ón', 'ico', 'Ġdo', 'ĠNe', 'oc', 'lass', 'icism', 'o', '.']\n", "\n", "(fr - 53 tokens) ['Jac', 'ques', '-', 'G', 'erm', 'ain', 'ĠSou', 'ff', 'lot', 'Ġ(', 'Iran', 'cy', ',', 'Ġ22', 'Ġju', 'illet', 'Ġ17', '13', 'Ġ-', 'ĠParis', ',', 'Ġ29', 'Ġa', 'o', 'û', 't', 'Ġ17', '80', ')', 'Ġ', 'ét', 'ait', 'Ġun', 'Ġarchitect', 'e', 'Ġfr', 'an', 'ç', 'ais', ',', 'Ġiniti', 'ateur', 'Ġdu', 'Ġstyle', 'Ġarchitectural', 'Ġdu', 'Ġn', 'é', 'oc', 'lass', 'icism', 'e', '.']\n", "\n" ] } ], "source": [ "# number and list of tokens \n", "# after the text tokenization by imported BPE GPT2TokenizerFast (trained with an English corpus...)\n", "for lang,text in zip(*[langs,texts]):\n", " toks = tokenizer_en.tokenize(text)\n", " print(f'({lang} - {len(toks)} tokens) {TitledStr(toks)}\\n')" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(en - 44 tokens) [28821, 13281, 12, 38, 7780, 391, 22862, 487, 26487, 357, 23798, 948, 11, 2901, 2534, 11, 1596, 1485, 532, 6342, 11, 2932, 2808, 11, 1596, 1795, 8, 373, 257, 4141, 7068, 11, 5383, 1352, 286, 262, 27070, 3918, 286, 3169, 420, 31172, 11965, 13]\n", "\n", "(pt - 62 tokens) [28821, 13281, 12, 38, 7780, 391, 22862, 487, 26487, 357, 23798, 948, 11, 2534, 390, 474, 377, 8873, 390, 1596, 1485, 851, 6342, 11, 2808, 390, 556, 455, 78, 390, 1596, 1795, 8, 11511, 72, 23781, 610, 421, 5712, 78, 46754, 25792, 82, 11, 287, 44070, 7079, 466, 1556, 18526, 610, 421, 5712, 18840, 3713, 466, 3169, 420, 31172, 11965, 78, 13]\n", "\n", "(fr - 53 tokens) [28821, 13281, 12, 38, 7780, 391, 22862, 487, 26487, 357, 23798, 948, 11, 2534, 7544, 32512, 1596, 1485, 532, 6342, 11, 2808, 257, 78, 42324, 83, 1596, 1795, 8, 220, 25125, 4548, 555, 7068, 68, 1216, 272, 16175, 15152, 11, 5383, 15093, 7043, 3918, 27070, 7043, 299, 2634, 420, 31172, 11965, 68, 13]\n", "\n" ] } ], "source": [ "# number and list of tokens ids\n", "# after the text tokenization + numerization by imported BPE GPT2TokenizerFast (trained with an English corpus...)\n", "for lang,text in zip(*[langs,texts]):\n", " toks_ids = tokenizer_en.encode(text)\n", " print(f'({lang} - {len(toks_ids)} tokens) {TitledStr(toks_ids)}\\n')" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(en) Jacques-Germain Soufflot (Irancy, July 22, 1713 - Paris, August 29, 1780) was a French architect, initiator of the architectural style of Neoclassicism.\n", "\n", "(pt) Jacques-Germain Soufflot (Irancy, 22 de julho de 1713 — Paris, 29 de agosto de 1780) foi um arquitecto francês, iniciador do estilo arquitectónico do Neoclassicismo.\n", "\n", "(fr) Jacques-Germain Soufflot (Irancy, 22 juillet 1713 - Paris, 29 août 1780) était un architecte français, initiateur du style architectural du néoclassicisme.\n", "\n" ] } ], "source": [ "# decode (back to the text)\n", "for lang,text in zip(*[langs,texts]):\n", " toks_ids = tokenizer_en.encode(text)\n", " text_decoded = tokenizer_en.decode(toks_ids)\n", " print(f'({lang}) {TitledStr(text_decoded)}\\n')" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# graph \"Number of tokens by tokenization method and lang\"\n", "# source: https://matplotlib.org/3.2.1/gallery/lines_bars_and_markers/barchart.html#sphx-glr-gallery-lines-bars-and-markers-barchart-py\n", "text_split = list()\n", "toks_split = list()\n", "\n", "for text in texts:\n", " text_split.append(len(text.split()))\n", " toks_ids = tokenizer_en.encode(text)\n", " toks_split.append(len(toks_ids))\n", " \n", "labels = langs\n", "xy = list(np.array([1.,2.,3.]) - 0.2)\n", "xz = list(np.array([1.,2.,3.]) + 0.2)\n", "y = text_split\n", "z = toks_split\n", "\n", "ax = plt.subplot(111)\n", "ax.bar(xy, y, width=0.4, color='b', align='center')\n", "ax.bar(xz, z, width=0.4, color='g', align='center')\n", "\n", "ax.set_xlabel('languages')\n", "ax.set_xticks(range(1,len(labels)+1))\n", "ax.set_xticklabels(labels)\n", "ax.set_ylabel('number of tokens')\n", "ax.legend(['split(\" \")', 'GPTTokenizerFast (en)'])\n", "\n", "ax.set_title('Number of tokens by tokenization method and lang')\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can see, **even if a GPT2TokenizerFast trained with an English corpus can tokenize any text in any language, it was optimized for English**: the number of generated tokens is lower for an English text than for the same text in another language with an equivalent number of words.\n", "\n", "In our example, \n", "- English: 44 tokens (22 words)\n", "- French: 20% more tokens with 53 (21 words)\n", "- Portuguese: 40% more tokens with 62 (25 words)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## English vs Portuguese tokenizer on Portuguese Wikipedia" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this paragraph, we will compare the number of tokens per Portuguese text generated respectively by the English and Portuguese tokenizers." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
text
0Fotografia (do grego φως [\"fós\"] (\"luz\"), e γραφις [\"grafis\"] (\"estilo\", \"pincel\") ou γραφη \"grafê\", e significa \"desenhar com luz e contraste\"), por definição, é essencialmente a \"técnica de criação\" de imagens por meio de exposição luminosa, fixando-as em uma superfície sensível. A primeira fotografia reconhecida remonta ao ano de 1826 e é atribuída ao francês Joseph Nicéphore Niépce. Contudo, a invenção da fotografia não é obra de um só autor, mas um processo de acúmulo de avanços por parte de muitas pessoas, trabalhando, juntas ou em paralelo, ao longo de muitos anos. Se por um lado os...
1Espadanedo é uma antiga freguesia portuguesa do concelho de Macedo de Cavaleiros, com 17,90 km² de área e 188 habitantes (2011). A sua densidade populacional era 10,5 hab/km².\\nFoi extinta (agregada) pela reorganização administrativa de 2012/2013, sendo o seu território integrado na União de Freguesias de Espadanedo, Edroso, Murçós e Soutelo Mourisco.\\n\\nA antiga freguesia de S. Miguel de Espadanedo e Valongo pertenceu ao antigo concelho de Torre D. Chama, extinto a 24 de Outubro de 1855, em 1839 aparece agregada à comarca de Bragança e em 1852 à de Mirandela. Passa a pertencer definitivam...
2Jacques-Germain Soufflot (Irancy, 22 de julho de 1713 — Paris, 29 de agosto de 1780) foi um arquitecto francês, iniciador do estilo arquitectónico do Neoclassicismo. O seu trabalho mais conhecido é, sem dúvida, o Panthéon (Panteão) de Paris, construído a partir de 1755, inicialmente uma igreja dedicada a Santa Genoveva.\\n\\nSoufflot nasceu em Irancy, perto de Auxerre, na França. Com 18 anos, entrou para a Academia Francesa de Roma, onde os jovens estudantes da década de 1750 se tornariam na primeira geração de criativos pleanamente neoclássicos. Ficará na Itália de 1731 a 1738. Quando volto...
3A Faculdade de Medicina da Universidade de São Paulo (FMUSP) é uma escola médica da Universidade de São Paulo. Foi fundada em 1912 com o nome de \"Faculdade de Medicina e Cirurgia de São Paulo\" por Arnaldo Vieira de Carvalho (1867-1920), médico formado em 1888 pela Faculdade de Medicina do Rio de Janeiro. Em homenagem ao ilustre fundador, a Faculdade é, ainda hoje, chamada de a \"Casa de Arnaldo\" por seus alunos e ex-alunos. Em 1925 teve seu nome alterado para \"Faculdade de Medicina de São Paulo\" e em 1934, foi incorporada à recém-criada Universidade de São Paulo, passando a ter a atual desi...
4A Escola do Teatro Bolshoi no Brasil é uma tradicional escola de balé existente na cidade de Joinville, no estado de Santa Catarina. Fundada em 2000, é a única filial do Teatro Bolshoi de Moscou e possui alunos de vários estados brasileiros. Tem como missão formar artistas cidadãos, promover e difundir a arte-educação.\\n\\nA instituição foi fundada em 15 de março de 2000, é a única filial do Teatro Bolshoi. \\n\\nUm orgulho para o Brasil e para Joinville, cidade sede. A Escola do Teatro Bolshoi no Brasil, com professores russos e brasileiros, forma bailarinos com a mesma precisão, técnica e q...
\n", "
" ], "text/plain": [ " text\n", "0 Fotografia (do grego φως [\"fós\"] (\"luz\"), e γραφις [\"grafis\"] (\"estilo\", \"pincel\") ou γραφη \"grafê\", e significa \"desenhar com luz e contraste\"), por definição, é essencialmente a \"técnica de criação\" de imagens por meio de exposição luminosa, fixando-as em uma superfície sensível. A primeira fotografia reconhecida remonta ao ano de 1826 e é atribuída ao francês Joseph Nicéphore Niépce. Contudo, a invenção da fotografia não é obra de um só autor, mas um processo de acúmulo de avanços por parte de muitas pessoas, trabalhando, juntas ou em paralelo, ao longo de muitos anos. Se por um lado os...\n", "1 Espadanedo é uma antiga freguesia portuguesa do concelho de Macedo de Cavaleiros, com 17,90 km² de área e 188 habitantes (2011). A sua densidade populacional era 10,5 hab/km².\\nFoi extinta (agregada) pela reorganização administrativa de 2012/2013, sendo o seu território integrado na União de Freguesias de Espadanedo, Edroso, Murçós e Soutelo Mourisco.\\n\\nA antiga freguesia de S. Miguel de Espadanedo e Valongo pertenceu ao antigo concelho de Torre D. Chama, extinto a 24 de Outubro de 1855, em 1839 aparece agregada à comarca de Bragança e em 1852 à de Mirandela. Passa a pertencer definitivam...\n", "2 Jacques-Germain Soufflot (Irancy, 22 de julho de 1713 — Paris, 29 de agosto de 1780) foi um arquitecto francês, iniciador do estilo arquitectónico do Neoclassicismo. O seu trabalho mais conhecido é, sem dúvida, o Panthéon (Panteão) de Paris, construído a partir de 1755, inicialmente uma igreja dedicada a Santa Genoveva.\\n\\nSoufflot nasceu em Irancy, perto de Auxerre, na França. Com 18 anos, entrou para a Academia Francesa de Roma, onde os jovens estudantes da década de 1750 se tornariam na primeira geração de criativos pleanamente neoclássicos. Ficará na Itália de 1731 a 1738. Quando volto...\n", "3 A Faculdade de Medicina da Universidade de São Paulo (FMUSP) é uma escola médica da Universidade de São Paulo. Foi fundada em 1912 com o nome de \"Faculdade de Medicina e Cirurgia de São Paulo\" por Arnaldo Vieira de Carvalho (1867-1920), médico formado em 1888 pela Faculdade de Medicina do Rio de Janeiro. Em homenagem ao ilustre fundador, a Faculdade é, ainda hoje, chamada de a \"Casa de Arnaldo\" por seus alunos e ex-alunos. Em 1925 teve seu nome alterado para \"Faculdade de Medicina de São Paulo\" e em 1934, foi incorporada à recém-criada Universidade de São Paulo, passando a ter a atual desi...\n", "4 A Escola do Teatro Bolshoi no Brasil é uma tradicional escola de balé existente na cidade de Joinville, no estado de Santa Catarina. Fundada em 2000, é a única filial do Teatro Bolshoi de Moscou e possui alunos de vários estados brasileiros. Tem como missão formar artistas cidadãos, promover e difundir a arte-educação.\\n\\nA instituição foi fundada em 15 de março de 2000, é a única filial do Teatro Bolshoi. \\n\\nUm orgulho para o Brasil e para Joinville, cidade sede. A Escola do Teatro Bolshoi no Brasil, com professores russos e brasileiros, forma bailarinos com a mesma precisão, técnica e q..." ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lang = 'pt'\n", "fname = f'all_texts_{lang}wiki.csv'\n", "df = pd.read_csv(path_data/fname)\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 11h 46min 52s, sys: 2h 52min 55s, total: 14h 39min 48s\n", "Wall time: 1h 32min 56s\n" ] } ], "source": [ "%%time\n", "df2 = df.copy()\n", "\n", "tokens_en_list = list()\n", "num_token_by_word_en_list = list()\n", "tokens_pt_list = list()\n", "num_token_by_word_pt_list = list()\n", "\n", "for index, row in df2.iterrows():\n", " text = row['text']\n", " \n", " tokens_en = tokenizer_en.encode(text)\n", " tokens_pt = tokenizer_pt.encode(text)\n", " \n", " tokens_en_list.append(tokens_en)\n", " tokens_pt_list.append(tokens_pt)\n", " \n", " length_text = len(text.split())\n", " tokens_by_word_en = len(tokens_en)/length_text\n", " tokens_by_word_pt = len(tokens_pt)/length_text\n", " \n", " num_token_by_word_en_list.append(tokens_by_word_en)\n", " num_token_by_word_pt_list.append(tokens_by_word_pt)\n", " \n", "df2['tokens_en'] = tokens_en_list\n", "df2['num_token_by_word_en'] = num_token_by_word_en_list\n", "df2['tokens_pt'] = tokens_pt_list\n", "df2['num_token_by_word_pt'] = num_token_by_word_pt_list" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
texttokens_ennum_token_by_word_entokens_ptnum_token_by_word_pt
0Fotografia (do grego φως [\"fós\"] (\"luz\"), e γραφις [\"grafis\"] (\"estilo\", \"pincel\") ou γραφη \"grafê\", e significa \"desenhar com luz e contraste\"), por definição, é essencialmente a \"técnica de criação\" de imagens por meio de exposição luminosa, fixando-as em uma superfície sensível. A primeira fotografia reconhecida remonta ao ano de 1826 e é atribuída ao francês Joseph Nicéphore Niépce. Contudo, a invenção da fotografia não é obra de um só autor, mas um processo de acúmulo de avanços por parte de muitas pessoas, trabalhando, juntas ou em paralelo, ao longo de muitos anos. Se por um lado os...[37, 313, 519, 32188, 544, 357, 4598, 10536, 2188, 18074, 228, 49535, 35558, 14631, 69, 10205, 82, 8973, 5855, 2290, 89, 12340, 304, 7377, 111, 33643, 17394, 139, 228, 29945, 35558, 14631, 70, 32188, 271, 8973, 5855, 395, 18526, 1600, 366, 79, 924, 75, 4943, 267, 84, 7377, 111, 33643, 17394, 139, 228, 138, 115, 366, 70, 32188, 25792, 1600, 304, 2216, 64, 366, 8906, 268, 9869, 401, 300, 10277, 304, 3445, 4594, 12340, 16964, 2730, 72, 16175, 28749, 11, 38251, 3209, 268, 2413, 434, 68, 257, 366, 83, 2634, 31522, 3970, 390, 269, 7496, 16175, 28749, 1, 390, 3590, ...]2.369519[9493, 22419, 363, 360, 5712, 46504, 30015, 16691, 33769, 70, 592, 32642, 1293, 21146, 2301, 258, 6970, 112, 17273, 12320, 46645, 17472, 16691, 33769, 4700, 274, 32642, 1293, 38371, 474, 330, 6701, 1996, 1766, 414, 6970, 112, 17273, 12320, 46645, 29825, 330, 4700, 369, 474, 258, 1952, 330, 1202, 1308, 290, 297, 3551, 258, 9592, 2301, 358, 5758, 12, 372, 9420, 259, 330, 38735, 260, 2271, 2, 260, 4318, 358, 1513, 260, 5214, 36162, 12, 34567, 13, 267, 300, 349, 4144, 17706, 14, 315, 806, 9659, 8892, 13110, 443, 750, 260, 22051, 258, 372, 11679, 443, 3081, 7358, 5227, 293, ...]1.295616
1Espadanedo é uma antiga freguesia portuguesa do concelho de Macedo de Cavaleiros, com 17,90 km² de área e 188 habitantes (2011). A sua densidade populacional era 10,5 hab/km².\\nFoi extinta (agregada) pela reorganização administrativa de 2012/2013, sendo o seu território integrado na União de Freguesias de Espadanedo, Edroso, Murçós e Soutelo Mourisco.\\n\\nA antiga freguesia de S. Miguel de Espadanedo e Valongo pertenceu ao antigo concelho de Torre D. Chama, extinto a 24 de Outubro de 1855, em 1839 aparece agregada à comarca de Bragança e em 1852 à de Mirandela. Passa a pertencer definitivam...[36, 2777, 324, 22739, 78, 38251, 334, 2611, 1885, 13827, 2030, 70, 947, 544, 2493, 1018, 947, 64, 466, 369, 5276, 8873, 390, 25942, 78, 390, 19931, 1000, 72, 4951, 11, 401, 1596, 11, 3829, 1849, 13276, 31185, 390, 6184, 94, 21468, 304, 27778, 7947, 39781, 357, 9804, 737, 317, 424, 64, 29509, 312, 671, 16595, 330, 1538, 6980, 838, 11, 20, 387, 65, 14, 13276, 31185, 13, 198, 37, 23013, 1070, 600, 64, 357, 363, 2301, 4763, 8, 279, 10304, 35459, 23638, 16175, 28749, 6863, 265, 12151, 390, 2321, 14, 6390, 11, 3758, 78, 267, 384, 84, 5771, 10205, ...]2.392857[1438, 15237, 1388, 360, 372, 349, 2661, 4766, 3764, 298, 5975, 260, 16023, 260, 19537, 12, 297, 1115, 12, 5699, 638, 2107, 4490, 260, 1430, 258, 2731, 2668, 363, 9381, 609, 315, 450, 7173, 9342, 621, 1194, 12, 21, 8770, 15, 2107, 4490, 14, 199, 2704, 10602, 363, 599, 45513, 9, 534, 19571, 8497, 260, 1762, 15, 11825, 12, 790, 275, 467, 2521, 11558, 347, 2743, 260, 40439, 260, 3158, 271, 1388, 360, 12, 2088, 6129, 12, 6834, 286, 592, 258, 2582, 3026, 6713, 1141, 423, 14, 199, 199, 33, 2661, 4766, 260, 327, 14, 4379, 260, 3158, 271, 1388, ...]1.486264
2Jacques-Germain Soufflot (Irancy, 22 de julho de 1713 — Paris, 29 de agosto de 1780) foi um arquitecto francês, iniciador do estilo arquitectónico do Neoclassicismo. O seu trabalho mais conhecido é, sem dúvida, o Panthéon (Panteão) de Paris, construído a partir de 1755, inicialmente uma igreja dedicada a Santa Genoveva.\\n\\nSoufflot nasceu em Irancy, perto de Auxerre, na França. Com 18 anos, entrou para a Academia Francesa de Roma, onde os jovens estudantes da década de 1750 se tornariam na primeira geração de criativos pleanamente neoclássicos. Ficará na Itália de 1731 a 1738. Quando volto...[28821, 13281, 12, 38, 7780, 391, 22862, 487, 26487, 357, 23798, 948, 11, 2534, 390, 474, 377, 8873, 390, 1596, 1485, 851, 6342, 11, 2808, 390, 556, 455, 78, 390, 1596, 1795, 8, 11511, 72, 23781, 610, 421, 5712, 78, 46754, 25792, 82, 11, 287, 44070, 7079, 466, 1556, 18526, 610, 421, 5712, 18840, 3713, 466, 3169, 420, 31172, 11965, 78, 13, 440, 384, 84, 491, 44349, 8873, 285, 15152, 369, 258, 66, 17305, 38251, 11, 5026, 288, 21356, 85, 3755, 11, 267, 5961, 400, 2634, 261, 357, 47, 12427, 28749, 8, 390, 6342, 11, 1500, 622, 8836, 4598, 257, ...]2.268409[33134, 13, 35969, 2582, 1738, 26035, 363, 41, 791, 89, 12, 2331, 260, 1637, 260, 45571, 2047, 2807, 12, 2820, 260, 1653, 260, 28277, 9, 379, 342, 17859, 3081, 12, 2333, 629, 298, 2266, 44381, 298, 1247, 5171, 41486, 14, 384, 467, 1313, 447, 1507, 372, 12, 944, 10436, 12, 275, 14982, 37648, 308, 363, 4042, 301, 280, 9, 260, 2807, 12, 4217, 259, 1129, 260, 26110, 12, 3470, 349, 2608, 8761, 259, 1925, 404, 19432, 373, 14, 199, 199, 15533, 1738, 26035, 2765, 300, 419, 791, 89, 12, 3067, 260, 21440, 269, 266, 12, 347, 2020, 14, 991, 664, ...]1.418052
3A Faculdade de Medicina da Universidade de São Paulo (FMUSP) é uma escola médica da Universidade de São Paulo. Foi fundada em 1912 com o nome de \"Faculdade de Medicina e Cirurgia de São Paulo\" por Arnaldo Vieira de Carvalho (1867-1920), médico formado em 1888 pela Faculdade de Medicina do Rio de Janeiro. Em homenagem ao ilustre fundador, a Faculdade é, ainda hoje, chamada de a \"Casa de Arnaldo\" por seus alunos e ex-alunos. Em 1925 teve seu nome alterado para \"Faculdade de Medicina de São Paulo\" e em 1934, foi incorporada à recém-criada Universidade de São Paulo, passando a ter a atual desi...[32, 13585, 32926, 671, 390, 5786, 1437, 12379, 26986, 312, 671, 390, 311, 28749, 34410, 357, 23264, 2937, 47, 8, 38251, 334, 2611, 3671, 5708, 285, 2634, 67, 3970, 12379, 26986, 312, 671, 390, 311, 28749, 34410, 13, 19434, 72, 1814, 4763, 795, 34463, 401, 267, 299, 462, 390, 366, 47522, 32926, 671, 390, 5786, 1437, 304, 21239, 3686, 544, 390, 311, 28749, 34410, 1, 16964, 16644, 41476, 47154, 8704, 390, 1879, 2100, 8873, 357, 1507, 3134, 12, 40454, 828, 285, 2634, 67, 3713, 1296, 4533, 795, 49584, 279, 10304, 13585, 32926, 671, 390, 5786, 1437, 466, 15338, 390, 42799, ...]2.244809[33, 4687, 260, 8138, 305, 1627, 260, 862, 1140, 363, 20640, 21090, 9, 372, 349, 2198, 9867, 305, 1627, 260, 862, 1140, 14, 1376, 4330, 300, 8928, 297, 275, 781, 260, 330, 42326, 260, 8138, 258, 33944, 260, 862, 1140, 2, 358, 22872, 9094, 260, 6171, 363, 1692, 8386, 13, 29477, 521, 4805, 4461, 300, 13789, 534, 4687, 260, 8138, 298, 1124, 260, 1588, 14, 575, 3904, 443, 26215, 5687, 12, 259, 4687, 372, 12, 872, 1913, 12, 2136, 260, 259, 330, 14238, 260, 22872, 2, 358, 660, 4433, 258, 417, 13, 43556, 14, 575, 9703, 1122, 467, 781, 12677, ...]1.253501
4A Escola do Teatro Bolshoi no Brasil é uma tradicional escola de balé existente na cidade de Joinville, no estado de Santa Catarina. Fundada em 2000, é a única filial do Teatro Bolshoi de Moscou e possui alunos de vários estados brasileiros. Tem como missão formar artistas cidadãos, promover e difundir a arte-educação.\\n\\nA instituição foi fundada em 15 de março de 2000, é a única filial do Teatro Bolshoi. \\n\\nUm orgulho para o Brasil e para Joinville, cidade sede. A Escola do Teatro Bolshoi no Brasil, com professores russos e brasileiros, forma bailarinos com a mesma precisão, técnica e q...[32, 16319, 5708, 466, 1665, 47756, 10797, 1477, 23013, 645, 39452, 346, 38251, 334, 2611, 2083, 291, 1538, 3671, 5708, 390, 3652, 2634, 2152, 21872, 12385, 269, 312, 671, 390, 15251, 4244, 11, 645, 1556, 4533, 390, 8909, 5181, 283, 1437, 13, 7557, 4763, 795, 4751, 11, 38251, 257, 6184, 118, 77, 3970, 1226, 498, 466, 1665, 47756, 10797, 1477, 23013, 390, 5826, 66, 280, 304, 1184, 9019, 435, 403, 418, 390, 410, 6557, 380, 418, 1556, 22484, 48029, 576, 72, 4951, 13, 5825, 401, 78, 2051, 28749, 1296, 283, 6802, 292, 269, 32482, 26102, 418, 11, 1552, 2502, 304, ...]2.442777[33, 3183, 298, 4948, 15934, 385, 73, 325, 889, 372, 349, 3898, 2198, 260, 26728, 8050, 347, 786, 260, 23696, 12, 325, 1282, 260, 1925, 5337, 14, 32791, 300, 2424, 12, 372, 259, 2542, 19606, 298, 4948, 15934, 385, 73, 260, 9001, 258, 1893, 4433, 260, 1292, 3606, 4779, 14, 2904, 390, 3829, 4964, 3098, 6267, 12, 5746, 258, 29331, 259, 2432, 13, 45422, 14, 199, 199, 33, 4493, 379, 4330, 300, 1079, 260, 1563, 260, 2424, 12, 372, 259, 2542, 19606, 298, 4948, 15934, 385, 73, 14, 722, 199, 2635, 18621, 341, 275, 889, 258, 341, 23696, 12, 786, ...]1.358349
\n", "
" ], "text/plain": [ " text \\\n", "0 Fotografia (do grego φως [\"fós\"] (\"luz\"), e γραφις [\"grafis\"] (\"estilo\", \"pincel\") ou γραφη \"grafê\", e significa \"desenhar com luz e contraste\"), por definição, é essencialmente a \"técnica de criação\" de imagens por meio de exposição luminosa, fixando-as em uma superfície sensível. A primeira fotografia reconhecida remonta ao ano de 1826 e é atribuída ao francês Joseph Nicéphore Niépce. Contudo, a invenção da fotografia não é obra de um só autor, mas um processo de acúmulo de avanços por parte de muitas pessoas, trabalhando, juntas ou em paralelo, ao longo de muitos anos. Se por um lado os... \n", "1 Espadanedo é uma antiga freguesia portuguesa do concelho de Macedo de Cavaleiros, com 17,90 km² de área e 188 habitantes (2011). A sua densidade populacional era 10,5 hab/km².\\nFoi extinta (agregada) pela reorganização administrativa de 2012/2013, sendo o seu território integrado na União de Freguesias de Espadanedo, Edroso, Murçós e Soutelo Mourisco.\\n\\nA antiga freguesia de S. Miguel de Espadanedo e Valongo pertenceu ao antigo concelho de Torre D. Chama, extinto a 24 de Outubro de 1855, em 1839 aparece agregada à comarca de Bragança e em 1852 à de Mirandela. Passa a pertencer definitivam... \n", "2 Jacques-Germain Soufflot (Irancy, 22 de julho de 1713 — Paris, 29 de agosto de 1780) foi um arquitecto francês, iniciador do estilo arquitectónico do Neoclassicismo. O seu trabalho mais conhecido é, sem dúvida, o Panthéon (Panteão) de Paris, construído a partir de 1755, inicialmente uma igreja dedicada a Santa Genoveva.\\n\\nSoufflot nasceu em Irancy, perto de Auxerre, na França. Com 18 anos, entrou para a Academia Francesa de Roma, onde os jovens estudantes da década de 1750 se tornariam na primeira geração de criativos pleanamente neoclássicos. Ficará na Itália de 1731 a 1738. Quando volto... \n", "3 A Faculdade de Medicina da Universidade de São Paulo (FMUSP) é uma escola médica da Universidade de São Paulo. Foi fundada em 1912 com o nome de \"Faculdade de Medicina e Cirurgia de São Paulo\" por Arnaldo Vieira de Carvalho (1867-1920), médico formado em 1888 pela Faculdade de Medicina do Rio de Janeiro. Em homenagem ao ilustre fundador, a Faculdade é, ainda hoje, chamada de a \"Casa de Arnaldo\" por seus alunos e ex-alunos. Em 1925 teve seu nome alterado para \"Faculdade de Medicina de São Paulo\" e em 1934, foi incorporada à recém-criada Universidade de São Paulo, passando a ter a atual desi... \n", "4 A Escola do Teatro Bolshoi no Brasil é uma tradicional escola de balé existente na cidade de Joinville, no estado de Santa Catarina. Fundada em 2000, é a única filial do Teatro Bolshoi de Moscou e possui alunos de vários estados brasileiros. Tem como missão formar artistas cidadãos, promover e difundir a arte-educação.\\n\\nA instituição foi fundada em 15 de março de 2000, é a única filial do Teatro Bolshoi. \\n\\nUm orgulho para o Brasil e para Joinville, cidade sede. A Escola do Teatro Bolshoi no Brasil, com professores russos e brasileiros, forma bailarinos com a mesma precisão, técnica e q... \n", "\n", " tokens_en \\\n", "0 [37, 313, 519, 32188, 544, 357, 4598, 10536, 2188, 18074, 228, 49535, 35558, 14631, 69, 10205, 82, 8973, 5855, 2290, 89, 12340, 304, 7377, 111, 33643, 17394, 139, 228, 29945, 35558, 14631, 70, 32188, 271, 8973, 5855, 395, 18526, 1600, 366, 79, 924, 75, 4943, 267, 84, 7377, 111, 33643, 17394, 139, 228, 138, 115, 366, 70, 32188, 25792, 1600, 304, 2216, 64, 366, 8906, 268, 9869, 401, 300, 10277, 304, 3445, 4594, 12340, 16964, 2730, 72, 16175, 28749, 11, 38251, 3209, 268, 2413, 434, 68, 257, 366, 83, 2634, 31522, 3970, 390, 269, 7496, 16175, 28749, 1, 390, 3590, ...] \n", "1 [36, 2777, 324, 22739, 78, 38251, 334, 2611, 1885, 13827, 2030, 70, 947, 544, 2493, 1018, 947, 64, 466, 369, 5276, 8873, 390, 25942, 78, 390, 19931, 1000, 72, 4951, 11, 401, 1596, 11, 3829, 1849, 13276, 31185, 390, 6184, 94, 21468, 304, 27778, 7947, 39781, 357, 9804, 737, 317, 424, 64, 29509, 312, 671, 16595, 330, 1538, 6980, 838, 11, 20, 387, 65, 14, 13276, 31185, 13, 198, 37, 23013, 1070, 600, 64, 357, 363, 2301, 4763, 8, 279, 10304, 35459, 23638, 16175, 28749, 6863, 265, 12151, 390, 2321, 14, 6390, 11, 3758, 78, 267, 384, 84, 5771, 10205, ...] \n", "2 [28821, 13281, 12, 38, 7780, 391, 22862, 487, 26487, 357, 23798, 948, 11, 2534, 390, 474, 377, 8873, 390, 1596, 1485, 851, 6342, 11, 2808, 390, 556, 455, 78, 390, 1596, 1795, 8, 11511, 72, 23781, 610, 421, 5712, 78, 46754, 25792, 82, 11, 287, 44070, 7079, 466, 1556, 18526, 610, 421, 5712, 18840, 3713, 466, 3169, 420, 31172, 11965, 78, 13, 440, 384, 84, 491, 44349, 8873, 285, 15152, 369, 258, 66, 17305, 38251, 11, 5026, 288, 21356, 85, 3755, 11, 267, 5961, 400, 2634, 261, 357, 47, 12427, 28749, 8, 390, 6342, 11, 1500, 622, 8836, 4598, 257, ...] \n", "3 [32, 13585, 32926, 671, 390, 5786, 1437, 12379, 26986, 312, 671, 390, 311, 28749, 34410, 357, 23264, 2937, 47, 8, 38251, 334, 2611, 3671, 5708, 285, 2634, 67, 3970, 12379, 26986, 312, 671, 390, 311, 28749, 34410, 13, 19434, 72, 1814, 4763, 795, 34463, 401, 267, 299, 462, 390, 366, 47522, 32926, 671, 390, 5786, 1437, 304, 21239, 3686, 544, 390, 311, 28749, 34410, 1, 16964, 16644, 41476, 47154, 8704, 390, 1879, 2100, 8873, 357, 1507, 3134, 12, 40454, 828, 285, 2634, 67, 3713, 1296, 4533, 795, 49584, 279, 10304, 13585, 32926, 671, 390, 5786, 1437, 466, 15338, 390, 42799, ...] \n", "4 [32, 16319, 5708, 466, 1665, 47756, 10797, 1477, 23013, 645, 39452, 346, 38251, 334, 2611, 2083, 291, 1538, 3671, 5708, 390, 3652, 2634, 2152, 21872, 12385, 269, 312, 671, 390, 15251, 4244, 11, 645, 1556, 4533, 390, 8909, 5181, 283, 1437, 13, 7557, 4763, 795, 4751, 11, 38251, 257, 6184, 118, 77, 3970, 1226, 498, 466, 1665, 47756, 10797, 1477, 23013, 390, 5826, 66, 280, 304, 1184, 9019, 435, 403, 418, 390, 410, 6557, 380, 418, 1556, 22484, 48029, 576, 72, 4951, 13, 5825, 401, 78, 2051, 28749, 1296, 283, 6802, 292, 269, 32482, 26102, 418, 11, 1552, 2502, 304, ...] \n", "\n", " num_token_by_word_en \\\n", "0 2.369519 \n", "1 2.392857 \n", "2 2.268409 \n", "3 2.244809 \n", "4 2.442777 \n", "\n", " tokens_pt \\\n", "0 [9493, 22419, 363, 360, 5712, 46504, 30015, 16691, 33769, 70, 592, 32642, 1293, 21146, 2301, 258, 6970, 112, 17273, 12320, 46645, 17472, 16691, 33769, 4700, 274, 32642, 1293, 38371, 474, 330, 6701, 1996, 1766, 414, 6970, 112, 17273, 12320, 46645, 29825, 330, 4700, 369, 474, 258, 1952, 330, 1202, 1308, 290, 297, 3551, 258, 9592, 2301, 358, 5758, 12, 372, 9420, 259, 330, 38735, 260, 2271, 2, 260, 4318, 358, 1513, 260, 5214, 36162, 12, 34567, 13, 267, 300, 349, 4144, 17706, 14, 315, 806, 9659, 8892, 13110, 443, 750, 260, 22051, 258, 372, 11679, 443, 3081, 7358, 5227, 293, ...] \n", "1 [1438, 15237, 1388, 360, 372, 349, 2661, 4766, 3764, 298, 5975, 260, 16023, 260, 19537, 12, 297, 1115, 12, 5699, 638, 2107, 4490, 260, 1430, 258, 2731, 2668, 363, 9381, 609, 315, 450, 7173, 9342, 621, 1194, 12, 21, 8770, 15, 2107, 4490, 14, 199, 2704, 10602, 363, 599, 45513, 9, 534, 19571, 8497, 260, 1762, 15, 11825, 12, 790, 275, 467, 2521, 11558, 347, 2743, 260, 40439, 260, 3158, 271, 1388, 360, 12, 2088, 6129, 12, 6834, 286, 592, 258, 2582, 3026, 6713, 1141, 423, 14, 199, 199, 33, 2661, 4766, 260, 327, 14, 4379, 260, 3158, 271, 1388, ...] \n", "2 [33134, 13, 35969, 2582, 1738, 26035, 363, 41, 791, 89, 12, 2331, 260, 1637, 260, 45571, 2047, 2807, 12, 2820, 260, 1653, 260, 28277, 9, 379, 342, 17859, 3081, 12, 2333, 629, 298, 2266, 44381, 298, 1247, 5171, 41486, 14, 384, 467, 1313, 447, 1507, 372, 12, 944, 10436, 12, 275, 14982, 37648, 308, 363, 4042, 301, 280, 9, 260, 2807, 12, 4217, 259, 1129, 260, 26110, 12, 3470, 349, 2608, 8761, 259, 1925, 404, 19432, 373, 14, 199, 199, 15533, 1738, 26035, 2765, 300, 419, 791, 89, 12, 3067, 260, 21440, 269, 266, 12, 347, 2020, 14, 991, 664, ...] \n", "3 [33, 4687, 260, 8138, 305, 1627, 260, 862, 1140, 363, 20640, 21090, 9, 372, 349, 2198, 9867, 305, 1627, 260, 862, 1140, 14, 1376, 4330, 300, 8928, 297, 275, 781, 260, 330, 42326, 260, 8138, 258, 33944, 260, 862, 1140, 2, 358, 22872, 9094, 260, 6171, 363, 1692, 8386, 13, 29477, 521, 4805, 4461, 300, 13789, 534, 4687, 260, 8138, 298, 1124, 260, 1588, 14, 575, 3904, 443, 26215, 5687, 12, 259, 4687, 372, 12, 872, 1913, 12, 2136, 260, 259, 330, 14238, 260, 22872, 2, 358, 660, 4433, 258, 417, 13, 43556, 14, 575, 9703, 1122, 467, 781, 12677, ...] \n", "4 [33, 3183, 298, 4948, 15934, 385, 73, 325, 889, 372, 349, 3898, 2198, 260, 26728, 8050, 347, 786, 260, 23696, 12, 325, 1282, 260, 1925, 5337, 14, 32791, 300, 2424, 12, 372, 259, 2542, 19606, 298, 4948, 15934, 385, 73, 260, 9001, 258, 1893, 4433, 260, 1292, 3606, 4779, 14, 2904, 390, 3829, 4964, 3098, 6267, 12, 5746, 258, 29331, 259, 2432, 13, 45422, 14, 199, 199, 33, 4493, 379, 4330, 300, 1079, 260, 1563, 260, 2424, 12, 372, 259, 2542, 19606, 298, 4948, 15934, 385, 73, 14, 722, 199, 2635, 18621, 341, 275, 889, 258, 341, 23696, 12, 786, ...] \n", "\n", " num_token_by_word_pt \n", "0 1.295616 \n", "1 1.486264 \n", "2 1.418052 \n", "3 1.253501 \n", "4 1.358349 " ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2.head()" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "# save\n", "fname = \"comparaison_tokenizers_en_pt_with_corpus_pt.csv\"\n", "df2.to_csv(path_data/fname, index=False)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "# load\n", "fname = \"comparaison_tokenizers_en_pt_with_corpus_pt.csv\"\n", "df2 = pd.read_csv(path_data/fname)" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(en) 1.31\n", "(pt) 1.12\n" ] } ], "source": [ "# check min\n", "num_token_by_word_en_min = df2.num_token_by_word_en.min()\n", "num_token_by_word_pt_min = df2.num_token_by_word_pt.min()\n", "print('(en)',round(num_token_by_word_en_min,2))\n", "print('(pt)',round(num_token_by_word_pt_min,2))" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(en) 75.29\n", "(pt) 75.65\n" ] } ], "source": [ "# check max\n", "num_token_by_word_en_max = df2.num_token_by_word_en.max()\n", "num_token_by_word_pt_max = df2.num_token_by_word_pt.max()\n", "print('(en)',round(num_token_by_word_en_max,2))\n", "print('(pt)',round(num_token_by_word_pt_max,2))" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(en) 2.25\n", "(pt) 1.36\n" ] } ], "source": [ "# check mean\n", "num_token_by_word_en_mean = df2.num_token_by_word_en.mean()\n", "num_token_by_word_pt_mean = df2.num_token_by_word_pt.mean()\n", "print('(en)',round(num_token_by_word_en_mean,2))\n", "print('(pt)',round(num_token_by_word_pt_mean,2))" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Rate of increase: 65.87 %\n", "Multiplier coefficient: 1.66\n", "Multiplier coefficient in %: 165.87 %\n" ] } ], "source": [ "# check increase rate and Multiplier coefficient\n", "increase = 0.\n", "multiplier = 0.\n", "\n", "for tok_en,tok_pt in zip(*(tokens_en_list,tokens_pt_list)):\n", " increase += (len(tok_en)-len(tok_pt))/len(tok_pt)\n", " multiplier += len(tok_en)/len(tok_pt)\n", " \n", "# Rate of increase in % from pt to en\n", "increase_pct = increase / len(tokens_en_list)\n", "print('Rate of increase:',round(increase_pct*100,2),'%')\n", "\n", "# Multiplier coefficient = (Rate of increase in %, converted to number) + 1\n", "multiplier_coef = round(increase_pct+1,2)\n", "print('Multiplier coefficient:',multiplier_coef)\n", "\n", "# Multiplier coefficient in % = Multiplier coefficient, converted to %\n", "multiplier_pct = round((multiplier/len(tokens_en_list))*100,2)\n", "print('Multiplier coefficient in %:',multiplier_pct,'%')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Graph" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1min 51s, sys: 1.63 s, total: 1min 53s\n", "Wall time: 1min 53s\n" ] } ], "source": [ "%%time\n", "len_tokens_text_list = list()\n", "for index, row in df2.iterrows():\n", " text = row['text']\n", " length_text = len(text.split())\n", " len_tokens_text_list.append(length_text)\n", "\n", "tokens_en_list = df2.tokens_en.tolist()\n", "len_tokens_en_list = [len(t) for t in tokens_en_list]\n", "\n", "tokens_pt_list = df2.tokens_pt.tolist()\n", "len_tokens_pt_list = [len(t) for t in tokens_pt_list]" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/mnt/home/pierre/.conda/envs/fastai2/lib/python3.7/site-packages/IPython/core/pylabtools.py:132: UserWarning: Creating legend with loc=\"best\" can be slow with large amounts of data.\n", " fig.canvas.print_figure(bytes_io, **kw)\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "sorted_len_tokens_text_list = sorted(len_tokens_text_list)\n", "y_len_tokens_en_list = (12*np.array(sorted_len_tokens_text_list)).tolist() \n", "y_len_tokens_pt_list = (7*np.array(sorted_len_tokens_text_list)).tolist() \n", "\n", "ax = plt.subplot(111)\n", "ax.scatter(len_tokens_text_list, len_tokens_en_list)\n", "ax.plot(sorted_len_tokens_text_list, y_len_tokens_en_list)\n", "ax.scatter(len_tokens_text_list, len_tokens_pt_list)\n", "ax.plot(sorted_len_tokens_text_list, y_len_tokens_pt_list)\n", "\n", "ax.set_xlabel('length of texts')\n", "ax.set_ylabel('length of en and pt tokens')\n", "ax.legend(['en', 'pt'])\n", "\n", "ax.set_title('Number of tokens by tokenization method')\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can see, the use of a BBPE tokenizer trained with an English corpus on a corpus of another language (here, Portuguese) **requires on average almost 70% of additional tokens** (66% exactly) than a BBPE tokenizer trained with the same language of the corpus of application." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## END" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" } }, "nbformat": 4, "nbformat_minor": 2 }