# ELECTRA ## Introduction **ELECTRA** is a method for self-supervised language representation learning. It can be used to pre-train transformer networks using relatively little compute. ELECTRA models are trained to distinguish "real" input tokens vs "fake" input tokens generated by another neural network, similar to the discriminator of a [GAN](https://arxiv.org/pdf/1406.2661.pdf). AraELECTRA achieves state-of-the-art results on Arabic QA dataset. For a detailed description, please refer to the AraELECTRA paper [AraELECTRA: Pre-Training Text Discriminatorsfor Arabic Language Understanding](https://arxiv.org/abs/2012.15516). This repository contains code to pre-train ELECTRA. It also supports fine-tuning ELECTRA on downstream tasks including classification tasks (e.g,. [GLUE](https://gluebenchmark.com/)), QA tasks (e.g., [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/)), and sequence tagging tasks (e.g., [text chunking](https://www.clips.uantwerpen.be/conll2000/chunking/)). ## Released Models We are releasing two pre-trained models: | Model | Layers | Hidden Size | Attention Heads | Params | HuggingFace Model Name | | --- | --- | --- | --- | --- | --- | | AraELECTRA-base-discriminator | 12 | 12 | 768 | 136M | [araelectra-base-discriminator](https://huggingface.co/aubmindlab/araelectra-base-discriminator) | | AraELECTRA-base-generator | 12 |4 | 256 | 60M | [araelectra-base-generator](https://huggingface.co/aubmindlab/araelectra-base-generator) ## Results Model | TyDiQA (EM - F1 ) | ARCD (EM - F1 ) | |:----|:----:|:----:| AraBERTv0.1| 68.51 - 82.86 | 31.62 - 67.45 | AraBERTv1| 61.11 - 79.36 | 31.7 - 67.8 | AraBERTv0.2-base| 73.07 - 85.41| 32.76 - 66.53| AraBERTv2-base| 61.67 - 81.66| 31.34 - 67.23 | AraBERTv0.2-large| 73.72 - 86.03| 36.89 - **71.32** | AraBERTv2-large| 64.49 - 82.51 | 34.19 - 68.12 | ArabicBERT-base| 67.42 - 81.24| 30.48 - 62.24 | ArabicBERT-large| 70.03 - 84.12| 33.33 - 67.27 | Arabic-ALBERT-base| 67.10 - 80.98| 30.91 - 61.33 | Arabic-ALBERT-large| 68.07 - 81.59| 34.19 - 65.41 | Arabic-ALBERT-xlarge| 71.12 - 84.59| **37.75** - 68.03 | AraELECTRA| **74.91 - 86.68**| 37.03 - 71.22 | ## Requirements * Python 3 * [TensorFlow](https://www.tensorflow.org/) 1.15 (although we hope to support TensorFlow 2.0 at a future date) * [NumPy](https://numpy.org/) * [scikit-learn](https://scikit-learn.org/stable/) and [SciPy](https://www.scipy.org/) (for computing some evaluation metrics). ## Pre-training Use `build_pretraining_dataset.py` or `build_arabert_pretraining_data.py` to create a pre-training dataset from a dump of raw text. It has the following arguments: * `--corpus-dir`: A directory containing raw text files to turn into ELECTRA examples. A text file can contain multiple documents with empty lines separating them. * `--vocab-file`: File defining the wordpiece vocabulary. * `--output-dir`: Where to write out ELECTRA examples. * `--max-seq-length`: The number of tokens per example (128 by default). * `--num-processes`: If >1 parallelize across multiple processes (1 by default). * `--blanks-separate-docs`: Whether blank lines indicate document boundaries (True by default). * `--do-lower-case/--no-lower-case`: Whether to lower case the input text (True by default). Use `run_pretraining.py` to pre-train an ELECTRA model. It has the following arguments: * `--data-dir`: a directory where pre-training data, model weights, etc. are stored. By default, the training loads examples from `/pretrain_tfrecords` and a vocabulary from `/vocab.txt`. * `--model-name`: a name for the model being trained. Model weights will be saved in `/models/` by default. * `--hparams` (optional): a JSON dict or path to a JSON file containing model hyperparameters, data paths, etc. See `configure_pretraining.py` for the supported hyperparameters. If training is halted, re-running the `run_pretraining.py` with the same arguments will continue the training where it left off. You can continue pre-training from the released ELECTRA checkpoints by 1. Setting the model-name to point to a downloaded model (e.g., `--model-name electra_small` if you downloaded weights to `$DATA_DIR/electra_small`). 2. Setting `num_train_steps` by (for example) adding `"num_train_steps": 4010000` to the `--hparams`. This will continue training the small model for 10000 more steps (it has already been trained for 4e6 steps). 3. Increase the learning rate to account for the linear learning rate decay. For example, to start with a learning rate of 2e-4 you should set the `learning_rate` hparam to 2e-4 * (4e6 + 10000) / 10000. 4. For ELECTRA-Small, you also need to specifiy `"generator_hidden_size": 1.0` in the `hparams` because we did not use a small generator for that model. #### Evaluating the pre-trained model. To evaluate the model on a downstream task, see the below finetuning instructions. To evaluate the generator/discriminator on the openwebtext data run `python3 run_pretraining.py --data-dir $DATA_DIR --model-name electra_small_owt --hparams '{"do_train": false, "do_eval": true}'`. This will print out eval metrics such as the accuracy of the generator and discriminator, and also writing the metrics out to `data-dir/model-name/results`. ## Fine-tuning Use `run_finetuning.py` to fine-tune and evaluate an ELECTRA model on a downstream NLP task. It expects three arguments: * `--data-dir`: a directory where data, model weights, etc. are stored. By default, the script loads finetuning data from `/finetuning_data/` and a vocabulary from `/vocab.txt`. * `--model-name`: a name of the pre-trained model: the pre-trained weights should exist in `data-dir/models/model-name`. * `--hparams`: a JSON dict containing model hyperparameters, data paths, etc. (e.g., `--hparams '{"task_names": ["rte"], "model_size": "base", "learning_rate": 1e-4, ...}'`). See `configure_pretraining.py` for the supported hyperparameters. Instead of a dict, this can also be a path to a `.json` file containing the hyperparameters. You must specify the `"task_names"` and `"model_size"` (see examples below). Eval metrics will be saved in `data-dir/model-name/results` and model weights will be saved in `data-dir/model-name/finetuning_models` by default. Evaluation is done on the dev set by default. To customize the training, add `--hparams '{"hparam1": value1, "hparam2": value2, ...}'` to the run command. Some particularly useful options: * `"debug": true` fine-tunes a tiny ELECTRA model for a few steps. * `"task_names": ["task_name"]`: specifies the tasks to train on. A list because the codebase nominally supports multi-task learning, (although be warned this has not been thoroughly tested). * `"model_size": one of "small", "base", or "large"`: determines the size of the model; you must set this to the same size as the pre-trained model. * `"do_train" and "do_eval"`: train and/or evaluate a model (both are set to true by default). For using `"do_eval": true` with `"do_train": false`, you need to specify the `init_checkpoint`, e.g., `python3 run_finetuning.py --data-dir $DATA_DIR --model-name electra_base --hparams '{"model_size": "base", "task_names": ["mnli"], "do_train": false, "do_eval": true, "init_checkpoint": "/models/electra_base/finetuning_models/mnli_model_1"}'` * `"num_trials": n`: If >1, does multiple fine-tuning/evaluation runs with different random seeds. * `"learning_rate": lr, "train_batch_size": n`, etc. can be used to change training hyperparameters. * `"model_hparam_overrides": {"hidden_size": n, "num_hidden_layers": m}`, etc. can be used to changed the hyperparameters for the underlying transformer (the `"model_size"` flag sets the default values). ### Setup Get a pre-trained ELECTRA model either by training your own (see pre-training instructions above), or downloading the release ELECTRA weights and unziping them under `$DATA_DIR/models` (e.g., you should have a directory`$DATA_DIR/models/electra_large` if you are using the large model). ### Finetune ELECTRA on question answering The code supports [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) 1.1 and 2.0, as well as datasets in [the 2019 MRQA shared task](https://github.com/mrqa/MRQA-Shared-Task-2019) * **ARCD**: Download the train/dev datasets from `https://github.com/husseinmozannar/SOQAL` move them under `$DATA_DIR/finetuning_data/squadv1/(train|dev).json` Then run (for example) ``` python3 run_finetuning.py --data-dir $DATA_DIR --model-name electra_base --hparams '{"model_size": "base", "task_names": ["squad"]}' ``` This repository uses the official evaluation code released by the [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) authors or you can use the `transformers` library as shown in the notebook `ARCD_pytorch.ipynb` of `Tydiqa_ar_pytorch.ipynb` from the examples folder ### Finetune ELECTRA on sequence tagging Download the CoNLL-2000 text chunking dataset from [here](https://www.clips.uantwerpen.be/conll2000/chunking/) and put it under `$DATA_DIR/finetuning_data/chunk/(train|dev).txt`. Then run ``` python3 run_finetuning.py --data-dir $DATA_DIR --model-name electra_base --hparams '{"model_size": "base", "task_names": ["chunk"]}' ``` ### Adding a new task The easiest way to run on a new task is to implement a new `finetune.task.Task`, add it to `finetune.task_builder.py`, and then use `run_finetuning.py` as normal. For classification/qa/sequence tagging, you can inherit from a `finetune.classification.classification_tasks.ClassificationTask`, `finetune.qa.qa_tasks.QATask`, or `finetune.tagging.tagging_tasks.TaggingTask`. For preprocessing data, we use the same tokenizer as [BERT](https://github.com/google-research/bert). ## Citation ## If you used this model please cite us as: ``` @inproceedings{antoun-etal-2021-araelectra, title = "{A}ra{ELECTRA}: Pre-Training Text Discriminators for {A}rabic Language Understanding", author = "Antoun, Wissam and Baly, Fady and Hajj, Hazem", booktitle = "Proceedings of the Sixth Arabic Natural Language Processing Workshop", month = apr, year = "2021", address = "Kyiv, Ukraine (Virtual)", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2021.wanlp-1.20", pages = "191--195", } ```