# Retro and InstructRetro Retro [(Borgeaud et al., 2022)](https://arxiv.org/abs/2112.04426) is an autoregressive decoder-only language model (LM) pretrained with retrieval-augmentation. Retro features practical scalability to support large-scale pretraining from scratch by retrieving from trillions of tokens. Pretraining with retrieval provides a more efficient storage mechanism of factual knowledge, when compared to storing factual knowledge implicitly within the network's parameters, thus largely reducing model parameters while achieving lower perplexity than standard GPT. Retro also provides the flexibility to update the knowledge stored in LMs [(Wang et al., 2023a)](https://arxiv.org/abs/2304.06762) by updating the retrieval database without training LMs again. InstructRetro [(Wang et al., 2023b)](https://arxiv.org/abs/2310.07713) further scales up the size of Retro to 48B, featuring the largest LLM pretrained with retrieval (as of December 2023). The obtained foundation model, Retro 48B, largely outperforms the GPT counterpart in terms of perplexity. With instruction tuning on Retro, InstructRetro demonstrates significant improvement over the instruction tuned GPT on downstream tasks in the zero-shot setting. Specifically, the average improvement of InstructRetro is 7% over its GPT counterpart across 8 short-form QA tasks, 10% over GPT across 4 challenging long-form QA tasks, and 16% over GPT across 3 summarization tasks. We also find that one can ablate the encoder from InstructRetro architecture and directly use the InstructRetro decoder backbone as GPT, while achieving comparable results. This README provides an end-to-end tutorial to reproduce Retro and InstructRetro. # Contents * [Checkpoints](#checkpoints) * [End-to-end Reproduction Guide](#end-to-end-reproduction-guide) * [Step 0: Prepare the environment](#step-0-prepare-the-environment) * [Docker image](#docker-image) * [Install dependencies](#install-dependencies) * [Step 1: Build retrieval database](#step-1-build-retrieval-database) * [Step 2: Pretraining](#step-2-pretraining) * [Step 3: Perplexity evaluation](#step-3-perplexity-evaluation) * [Step 4: Instruction tuning](#step-4-instruction-tuning) * [Step 5: Downstream task evaluation](#step-5-downstream-task-evaluation) * [Citations](#citations) # Checkpoints We provide the pretrained checkpoints of Retro and InstructRetro in the following table. The checkpoints are available to download through the following links: | Model | Size | Instruction Tuning | Download Link 1 | Download Link 2 | Download Link 3 | |-------------------------|------|--------------------|--------------------------------------------------------------------|--------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------| | `retro-8b-base-4k` | 8b | | [Huggingface](https://huggingface.co/nvidia/retro-8b-base-4k) | [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/models/retro-8b-base-4k) | [Google Drive](https://drive.google.com/drive/folders/1uSQ5DAsuvx_8XcbtnVfs_MGvEOcx0uK_?usp=sharing) | | `retro-8b-instruct-4k` | 8b | ✅ | [Huggingface](https://huggingface.co/nvidia/retro-8b-instruct-4k) | [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/models/retro-8b-instruct-4k) | [Google Drive](https://drive.google.com/drive/folders/1v5dKaSN0cm2lwyAWpFaJtlTrLhtMZXsI?usp=sharing) | | `retro-48b-base-4k` | 48b | | [Huggingface](https://huggingface.co/nvidia/retro-48b-base-4k) | [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/models/retro-48b-base-4k) | [Google Drive](https://drive.google.com/drive/folders/1rtNpf0CiLElSHQcr3aLI3zgfI3teGTP5?usp=sharing) | | `retro-48b-instruct-4k` | 48b | ✅ | [Huggingface](https://huggingface.co/nvidia/retro-48b-instruct-4k) | [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/models/retro-48b-instruct-4k) | [Google Drive](https://drive.google.com/drive/folders/1qdb0AQjSsAPGlWaIu3wgHPjf_nwLeY5h?usp=sharing) | # End-to-end Reproduction Guide In this README, we provide an end-to-end reproduction guide for InstructRetro, covering from large-scale retrieval construction, pretraining, perplexity evaluation, instruction tuning, to downstream task evaluation. If you are interested in evaluation only, we also [open-sourced our checkpoints](#checkpoints) and you can directly go to [Step 5](#step-5-downstream-task-evaluation) to evaluate the checkpoints on downstream tasks. ## Step 0: Prepare the environment We recommend using docker environment to run the code. ### Docker image We provide a docker build file in [tools/retro/examples/Dockerfile](examples/Dockerfile) for the reproduction. The docker image is based on the [NGC docker](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch/tags) `nvcr.io/nvidia/pytorch:23.09-py3`. ### Install dependencies Clone the Megatron repo: ```bash git clone --branch InstructRetro https://github.com/NVIDIA/Megatron-LM.git ``` If docker is not available, we recommend starting from a clean conda environment with the following runtime dependencies: - Python 3.10 - NVIDIA CUDA® 12.2.1 - NVIDIA cuBLAS 12.2.5.6 - NVIDIA cuDNN 8.9.5 - NVIDIA NCCL 2.18.5 - PyTorch 2.1.0a0+32f93b1 Then install Retro-specific dependencies, including: ```bash pip install -U faiss-gpu pip install -U transformers pip install -U sentencepiece pip install -U h5py pip install -U nltk pip install -U einops ``` ## Step 1: Build retrieval database In this step, we build a large-scale retrieval database for InstructRetro through [Faiss](https://github.com/facebookresearch/faiss) to retrieve from trillions of tokens, and preprocess (and save) the retrieval neighbors for the pretraining step. Please refer to [tools/retro/build_db.md](build_db.md) for more details. ## Step 2: Pretraining *Please strictly follow Step 1 to build the retrieval database before pretraining to make sure the preprocessed retrieval neighbors match the pretraining corpus.* In the pretraining step, we support both pretraining from scratch and continued pretraining from a pretrained GPT model. We provide a template pretraining script to pretrain 843M Retro from scratch. Prepare your own arguments and update our templates in [tools/retro/examples/pretrain_model.sh](examples/pretrain_model.sh). Please note that the data path should be exactly matching the one used in Step 1 to make sure the preprocessed retrieval neighbors match the pretraining corpus. [//]: # (Take the example of the Wikipedia corpus) ```bash bash tools/retro/examples/pretrain_model.sh ``` After pretraining, the model checkpoints will be saved in the `--save` directory if you specified the arg in `pretrain_model.sh`. To continue pretraining with retrieval from a pretrained GPT model, please specify `--load` in `pretrain_model.sh` to load the pretrained GPT model checkpoint (the architecture of GPT, including hidden size, number of layers, and activation methods, should be exactly the same as the one used for Retro). You should also specify `--no-load-optim --finetune` to make sure the optimizer state is not loaded from the pretrained GPT model and the continued pretraining with retrieval is from a clean start. After the first job / the first run, you will continue pretraining with retrieval from your last checkpoint. In the follow-up jobs, you should launch the pretraining without the flags `--no-load-optim --finetune` to make sure the optimizer state is correctly loaded from your last job. ## Step 3: Perplexity evaluation During pretraining, we will automatically evaluate the model perplexity on the specified validation corpus every `--eval-interval` steps. The validation corpus should be exactly the same as the one used in Step 1 to make sure the preprocessed retrieval neighbors match the pretraining corpus. To evaluate the perplexity of a pretrained model, please add `--skip-train` in `pretrain_model.sh` to skip the pretraining step and only evaluate the perplexity of the model specified in `--load` on the validation corpus. Run the above command again to evaluate the perplexity of a pretrained model: ```bash bash tools/retro/examples/pretrain_model.sh ``` ## Step 4: Instruction tuning In this step, we fine-tune the pretrained model on the downstream task with instructions. We provide a template instruction tuning script to fine-tune 843M Retro. We also provide an open-source blend of instruction tuning datasets. The dataset is available to download through [here](https://drive.google.com/file/d/1nzKwwYf8lYb9gN3P4YO8pFNU_B2nMYe1/view?usp=sharing). The blendable dataset consists of the following open-source instruction tuning datasets: ### Instruction Tuning Dataset Breakdown | Dataset | Samples | Epochs | Sampling Prob | |------------------------------------------------------------|--------:|-------:|--------------:| | [soda](https://arxiv.org/abs/2212.10465) | 2560 | 0.005 | 0.020 | | [eli5](https://arxiv.org/abs/1907.09190) | 2561 | 0.055 | 0.020 | | [self_instruct_short](https://arxiv.org/abs/2212.10560) | 1280 | 0.043 | 0.010 | | [self_instruct_long](https://arxiv.org/abs/2212.10560) | 2560 | 0.333 | 0.020 | | [unnatural-instructions](https://arxiv.org/abs/2212.09689) | 2560 | 0.024 | 0.020 | | [flan_cot](https://arxiv.org/abs/2210.11416) | 1280 | 0.093 | 0.010 | | [dolly](https://arxiv.org/abs/2305.13735) | 6400 | 0.938 | 0.050 | | [oasst-skip-noncode](https://open-assistant.io/) | 104558 | 1.839 | 0.817 | | [oasst-skip-code](https://open-assistant.io/) | 4243 | 1.839 | 0.033 | Refer to the paper links above for more details about each instruction tuning dataset. *We note that the provided instruction tuning dataset is all from open-source instruction tuning datasets. It is slightly different from what we use in [InstructRetro](https://arxiv.org/abs/2310.07713), which contains private and proprietary datasets. Thus a 1-2% accuracy difference in downstream tasks may be expected.* ### Instruction tuning script Download the [blended instruction tuning dataset](https://drive.google.com/file/d/1nzKwwYf8lYb9gN3P4YO8pFNU_B2nMYe1/view?usp=sharing) in your data home directory `$DATA_HOME` and update our templates in [tools/retro/sft/sft_retro_lm.sh](sft/sft_retro_lm.sh). An example command to run instruction tuning on 843M Retro is as follows: ```bash [blend-dataset-name] [model-size] [batch-size] [lr] [checkpoints] bash tools/retro/sft/sft_retro_lm.sh open_inst 843m 128 5e-6 ``` The `blend_dataset_name` argument will blend all the datasets within the `$DATA_HOME` following the weights and configurations specified in the `${blend_dataset_name}.sh` ([open_inst.sh](sft/open_inst.sh) in the example above). The checkpoints will be saved in the `--save` directory. For example, it will be saved to `/checkpoints/applications/retro-sft_pp1_same_format_ctx1_843m_128_5e-6`. ## Step 5: Downstream task evaluation In this step, we demonstrate how to run InstructRetro for zero-shot evaluation on downstream question answering (QA) tasks. We provide the pre-processed open-source evaluation datasets with a unified format for different tasks. The evaluation datasets used in our paper are available to download through [here](https://drive.google.com/drive/folders/1xw-N0LJR_lIWnH6BKzHIb49quVCS_V72?usp=sharing). Please stick to the same retro workdir used in Step 0-4 to make sure the preprocessed retrieval neighbors match the pretraining corpus. If you directly come to Step 5, an example retro workdir with `args.json` for 800M Retro is provided [here](https://drive.google.com/file/d/121GqAdMvf8bJEBZRt-SD4uhW-SRWgI3s/view?usp=sharing). Note that the args in the json can be overwritten through the command line. We present an example command to run retro generation given the InstructRetro checkpoints and the Natural Question (NQ) task. The example command is for the 843m InstructRetro obtained in Step 4. Please specify the directory for the NQ dataset and update the command accordingly for other checkpoints. ```bash bash tools/retro/text_generation/retro_generate.sh nq 843m greedy test 0 20000 1000 5 pp1 /checkpoints/applications/retro-sft_pp1_same_format_ctx1_843m_128_5e-6 2 ``` The generated responses will be saved in the corresponding checkpoint directory. For example, for the 843m InstructRetro, it will be saved to `/checkpoints/applications/retro-sft_pp1_same_format_ctx1_843m_128_5e-6/retro-generate-nq_5_2_843m_test_greedy_0_20000_1000.txt`. To evaluate the F1 / Exact Match (EM) scores of the generated responses, we provide an example script to run the evaluation on the NQ dataset. Please specify the directory for the NQ dataset and update the command accordingly for other checkpoints and downstream tasks. ```bash python3 tools/retro/text_generation/evaluate.py ``` # Citations See more details from our papers: [Shall we Pretrain Autoregressive Language Models with Retrieval? A Comprehensive Study.](https://arxiv.org/abs/2304.06762) _Boxin Wang, Wei Ping, Peng Xu, Lawrence McAfee, Zihan Liu, Mohammad Shoeybi, Yi Dong, Oleksii Kuchaiev, Bo Li, Chaowei Xiao, Anima Anandkumar, Bryan Catanzaro._ (EMNLP 2023) [InstructRetro: Instruction Tuning post Retrieval-Augmented Pretraining.](https://arxiv.org/abs/2310.07713) _Boxin Wang, Wei Ping, Lawrence McAfee, Peng Xu, Bo Li, Mohammad Shoeybi, Bryan Catanzaro._ Please cite the papers as follows if you use the data or code from this repo: ```bibtex @inproceedings{wang2023shall, title = {Shall We Pretrain Autoregressive Language Models with Retrieval? A Comprehensive Study}, author = {Boxin Wang and Wei Ping and Peng Xu and Lawrence McAfee and Zihan Liu and Mohammad Shoeybi and Yi Dong and Oleksii Kuchaiev and Bo Li and Chaowei Xiao and Anima Anandkumar and Bryan Catanzaro}, journal = {The 2023 Conference on Empirical Methods in Natural Language Processing}, year = {2023} } @article{wang2023instructretro, title = {InstructRetro: Instruction Tuning post Retrieval-Augmented Pretraining}, author = {Boxin Wang and Wei Ping and Lawrence McAfee and Peng Xu and Bo Li and Mohammad Shoeybi and Bryan Catanzaro}, year = {2023}, journal = {arXiv preprint arXiv: 2310.07713} } ```