{ "cells": [ { "attachments": {}, "cell_type": "markdown", "id": "23569edd", "metadata": {}, "source": [ "# Salesforce CodeGen\n", "\n", "**CodeGen** models (`350M`, `2B`, `6B`, `16B`) for **Program Synthesis** are developed by Salesforce Research and presented in: [CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis](https://arxiv.org/abs/2203.13474). This note [1](https://colab.research.google.com/drive/1fQI8OgzMAR0bquCrvhlAtXSw6iMFbVgI#scrollTo=YN2xY4xmkss0) shows an example how to use CodeGen for programming synthesis under Kubeflow notebook environment. \n", "\n", "### Released Models\n", "Various sizes trained models on various datasets are released. The models are named in the following format:\n", "```\n", "codegen-{model-size}-{data}\n", "```\n", "\n", "`model-size` has 4 options: `350M`, `2B`, `6B`, `16B`, which represent the number of parameters in each model.\n", "\n", "`data` has 3 options: `nl`, `multi`, `mono`.\n", "\n", "* `nl` models are randomly initialized and trained on [The Pile](https://github.com/EleutherAI/the-pile), a 825.18 GB English text corpus.\n", "* `multi` models are initialized from `nl` models and then trained on a corpus with code data consisting of multiple programming languages.\n", "* `mono` models are initialized from `multi` models and then trained on a corpus with Python code data.\n" ] }, { "attachments": {}, "cell_type": "markdown", "id": "f0a9b01c", "metadata": {}, "source": [ "### Prerequisites" ] }, { "cell_type": "code", "execution_count": 1, "id": "924ee40a-d601-49bc-a54c-4c712809dec9", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Cloning into 'CodeGen'...\n", "remote: Enumerating objects: 171, done.\u001b[K\n", "remote: Counting objects: 100% (169/169), done.\u001b[K\n", "remote: Compressing objects: 100% (103/103), done.\u001b[K\n", "remote: Total 171 (delta 77), reused 132 (delta 51), pack-reused 2\u001b[K\n", "Receiving objects: 100% (171/171), 1.36 MiB | 1.40 MiB/s, done.\n", "Resolving deltas: 100% (77/77), done.\n", "/home/jovyan/CodeGen\n", "Requirement already satisfied: pip in /opt/conda/lib/python3.8/site-packages (22.2.2)\n", "Collecting pip\n", " Downloading pip-23.1-py3-none-any.whl (2.1 MB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.1/2.1 MB\u001b[0m \u001b[31m1.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m0m\n", "\u001b[?25hRequirement already satisfied: setuptools in /opt/conda/lib/python3.8/site-packages (65.3.0)\n", "Collecting setuptools\n", " Downloading setuptools-67.6.1-py3-none-any.whl (1.1 MB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.1/1.1 MB\u001b[0m \u001b[31m2.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0mm\n", "\u001b[?25hInstalling collected packages: setuptools, pip\n", " Attempting uninstall: setuptools\n", " Found existing installation: setuptools 65.3.0\n", " Uninstalling setuptools-65.3.0:\n", " Successfully uninstalled setuptools-65.3.0\n", " Attempting uninstall: pip\n", " Found existing installation: pip 22.2.2\n", " Uninstalling pip-22.2.2:\n", " Successfully uninstalled pip-22.2.2\n", "Successfully installed pip-23.1 setuptools-67.6.1\n", "Looking in links: https://download.pytorch.org/whl/torch_stable.html\n", "Collecting torch==1.9.0+cu111 (from -r requirements.txt (line 2))\n", " Downloading https://download.pytorch.org/whl/cu111/torch-1.9.0%2Bcu111-cp38-cp38-linux_x86_64.whl (2041.3 MB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.0/2.0 GB\u001b[0m \u001b[31m?\u001b[0m eta \u001b[36m0:00:00\u001b[0m \u001b[36m0:00:01\u001b[0m00:05\u001b[0mm\n", "\u001b[?25hCollecting transformers==4.16.2 (from -r requirements.txt (line 3))\n", " Downloading transformers-4.16.2-py3-none-any.whl (3.5 MB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m3.5/3.5 MB\u001b[0m \u001b[31m2.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m0m\n", "\u001b[?25hRequirement already satisfied: typing-extensions in /opt/conda/lib/python3.8/site-packages (from torch==1.9.0+cu111->-r requirements.txt (line 2)) (4.3.0)\n", "Collecting filelock (from transformers==4.16.2->-r requirements.txt (line 3))\n", " Downloading filelock-3.11.0-py3-none-any.whl (10.0 kB)\n", "Collecting huggingface-hub<1.0,>=0.1.0 (from transformers==4.16.2->-r requirements.txt (line 3))\n", " Downloading huggingface_hub-0.13.4-py3-none-any.whl (200 kB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m200.1/200.1 kB\u001b[0m \u001b[31m495.1 kB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0ma \u001b[36m0:00:01\u001b[0m\n", "\u001b[?25hRequirement already satisfied: numpy>=1.17 in /opt/conda/lib/python3.8/site-packages (from transformers==4.16.2->-r requirements.txt (line 3)) (1.22.4)\n", "Requirement already satisfied: packaging>=20.0 in /opt/conda/lib/python3.8/site-packages (from transformers==4.16.2->-r requirements.txt (line 3)) (21.3)\n", "Requirement already satisfied: pyyaml>=5.1 in /opt/conda/lib/python3.8/site-packages (from transformers==4.16.2->-r requirements.txt (line 3)) (5.4.1)\n", "Collecting regex!=2019.12.17 (from transformers==4.16.2->-r requirements.txt (line 3))\n", " Downloading regex-2023.3.23-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (771 kB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m771.9/771.9 kB\u001b[0m \u001b[31m974.4 kB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0ma \u001b[36m0:00:01\u001b[0m\n", "\u001b[?25hRequirement already satisfied: requests in /opt/conda/lib/python3.8/site-packages (from transformers==4.16.2->-r requirements.txt (line 3)) (2.28.1)\n", "Collecting sacremoses (from transformers==4.16.2->-r requirements.txt (line 3))\n", " Downloading sacremoses-0.0.53.tar.gz (880 kB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m880.6/880.6 kB\u001b[0m \u001b[31m920.6 kB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n", "\u001b[?25h Preparing metadata (setup.py) ... \u001b[?25ldone\n", "\u001b[?25hCollecting tokenizers!=0.11.3,>=0.10.1 (from transformers==4.16.2->-r requirements.txt (line 3))\n", " Downloading tokenizers-0.13.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m7.8/7.8 MB\u001b[0m \u001b[31m4.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m0m\n", "\u001b[?25hRequirement already satisfied: tqdm>=4.27 in /opt/conda/lib/python3.8/site-packages (from transformers==4.16.2->-r requirements.txt (line 3)) (4.64.1)\n", "Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /opt/conda/lib/python3.8/site-packages (from packaging>=20.0->transformers==4.16.2->-r requirements.txt (line 3)) (3.0.9)\n", "Requirement already satisfied: charset-normalizer<3,>=2 in /opt/conda/lib/python3.8/site-packages (from requests->transformers==4.16.2->-r requirements.txt (line 3)) (2.1.1)\n", "Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.8/site-packages (from requests->transformers==4.16.2->-r requirements.txt (line 3)) (3.3)\n", "Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/conda/lib/python3.8/site-packages (from requests->transformers==4.16.2->-r requirements.txt (line 3)) (1.26.11)\n", "Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.8/site-packages (from requests->transformers==4.16.2->-r requirements.txt (line 3)) (2022.6.15)\n", "Requirement already satisfied: six in /opt/conda/lib/python3.8/site-packages (from sacremoses->transformers==4.16.2->-r requirements.txt (line 3)) (1.16.0)\n", "Requirement already satisfied: click in /opt/conda/lib/python3.8/site-packages (from sacremoses->transformers==4.16.2->-r requirements.txt (line 3)) (7.1.2)\n", "Requirement already satisfied: joblib in /opt/conda/lib/python3.8/site-packages (from sacremoses->transformers==4.16.2->-r requirements.txt (line 3)) (1.1.0)\n", "Building wheels for collected packages: sacremoses\n", " Building wheel for sacremoses (setup.py) ... \u001b[?25ldone\n", "\u001b[?25h Created wheel for sacremoses: filename=sacremoses-0.0.53-py3-none-any.whl size=895241 sha256=3f2b8d261642b1de521eed16da25fc6c1ef5f5d68c11f1379606df72e4fba31c\n", " Stored in directory: /home/jovyan/.cache/pip/wheels/82/ab/9b/c15899bf659ba74f623ac776e861cf2eb8608c1825ddec66a4\n", "Successfully built sacremoses\n", "Installing collected packages: tokenizers, torch, regex, filelock, sacremoses, huggingface-hub, transformers\n", " Attempting uninstall: torch\n", " Found existing installation: torch 1.8.1+cu111\n", " Uninstalling torch-1.8.1+cu111:\n", " Successfully uninstalled torch-1.8.1+cu111\n", "\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n", "torchaudio 0.8.1 requires torch==1.8.1, but you have torch 1.9.0+cu111 which is incompatible.\n", "torchvision 0.9.1+cu111 requires torch==1.8.1, but you have torch 1.9.0+cu111 which is incompatible.\u001b[0m\u001b[31m\n", "\u001b[0mSuccessfully installed filelock-3.11.0 huggingface-hub-0.13.4 regex-2023.3.23 sacremoses-0.0.53 tokenizers-0.13.3 torch-1.9.0+cu111 transformers-4.16.2\n" ] } ], "source": [ "!git clone https://github.com/salesforce/CodeGen\n", "%cd CodeGen\n", "!pip install --upgrade pip setuptools\n", "!pip install -r requirements.txt" ] }, { "attachments": {}, "cell_type": "markdown", "id": "fbdbdd71", "metadata": {}, "source": [ "### Load model and tokenizer" ] }, { "cell_type": "code", "execution_count": 2, "id": "dd2ee524-56c8-416c-821a-78cdb66c93f5", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--2023-04-16 15:38:22-- https://storage.googleapis.com/sfr-codegen-research/checkpoints/codegen-350M-mono.tar.gz\n", "Resolving proxy.liuqi.me (proxy.liuqi.me)... 10.185.248.180\n", "Connecting to proxy.liuqi.me (proxy.liuqi.me)|10.185.248.180|:3128... connected.\n", "Proxy request sent, awaiting response... 200 OK\n", "Length: 656148604 (626M) [application/x-tar]\n", "Saving to: ‘checkpoints/codegen-350M-mono.tar.gz’\n", "\n", "codegen-350M-mono.t 100%[===================>] 625.75M 11.9MB/s in 47s \n", "\n", "2023-04-16 15:39:10 (13.2 MB/s) - ‘checkpoints/codegen-350M-mono.tar.gz’ saved [656148604/656148604]\n", "\n", "codegen-350M-mono/\n", "codegen-350M-mono/config.json\n", "codegen-350M-mono/pytorch_model.bin\n", "loading parameters\n", "loading parameters took 24.35s\n", "loading tokenizer\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "f4ff71e918144961bda9501066d860fc", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Downloading: 0%| | 0.00/0.99M [00:00