--- title: Codex author: coder date: 2021-07-07 00:00:00 +0800 categories: [arxiv] tags: [models] math: true pin: false --- - 📙Paper: [Evaluating Large Language Models Trained on Code](https://arxiv.org/pdf/2107.03374.pdf) - 📚Publisher: `Arxiv` - 🏠Author Affiliation: `OpenAI` - 🔑Public: ❌ - 🌐Architecture + [ ] Encoder-Decoder + [x] Decoder-Only - 📏Model Size + `12M`; `25M`; `42M`; `85M`; `300M`; `679M`; `2.5B`; `12B` - 🗂️Data pre-processing + Data Resource * May 2020 from 54 million public software repositories hosted on GitHub, containing 179 GB of unique Python files under 1 MB. + De-duplication: ✅ + Filter Strategies * average line length greater than 100 * maximum line length greater than 1000 * contain a small percentage of alphaunmeric characters - 🍉Tokenizer + Technology * [x] Byte-level Byte-Pair-Encoding (BBPE) * [ ] SentencePiece + Details * GPT3 tokenizer+additional set of tokens for representing whitespace runs of different lengths - 🧪Hyperparameters (Codex 12B) + optimizer: Adam * betas: / * eps: / + batch size: `2M` + context window: `4,096` + gradient accumulation steps: / + warmup steps: `175` + learning rate: `1e-4` + weight decay: `0.1` + decay schedule * [x] Cosine * [ ] Linear * [ ] Polynomial * [ ] Inverse Square + precision floating point: / - 🏃‍♀️Training + model initialization: `GPT3` + training strategies * [x] left-to-right * [ ] fill-in-the-middle + trained tokens/steps: 100B tokens + hardware: V100 + training time: /