# Salesforce CodeGen

**CodeGen** models (`350M`, `2B`, `6B`, `16B`) for **Program Synthesis** are developed by Salesforce Research and presented in: [CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis](https://arxiv.org/abs/2203.13474). This note [1](https://colab.research.google.com/drive/1fQI8OgzMAR0bquCrvhlAtXSw6iMFbVgI#scrollTo=YN2xY4xmkss0) shows an example how to use CodeGen for programming synthesis under Kubeflow notebook environment. 

### Released Models
Various sizes trained models on various datasets are released. The models are named in the following format:
```
codegen-{model-size}-{data}
```

`model-size` has 4 options: `350M`, `2B`, `6B`, `16B`, which represent the number of parameters in each model.

`data` has 3 options: `nl`, `multi`, `mono`.

* `nl` models are randomly initialized and trained on [The Pile](https://github.com/EleutherAI/the-pile), a 825.18 GB English text corpus.
* `multi` models are initialized from `nl` models and then trained on a corpus with code data consisting of multiple programming languages.
* `mono` models are initialized from `multi` models and then trained on a corpus with Python code data.


### Prerequisites

In [1]:
!git clone https://github.com/salesforce/CodeGen
%cd CodeGen
!pip install --upgrade pip setuptools
!pip install -r requirements.txt

Cloning into 'CodeGen'...
remote: Enumerating objects: 171, done.[K
remote: Counting objects: 100% (169/169), done.[K
remote: Compressing objects: 100% (103/103), done.[K
remote: Total 171 (delta 77), reused 132 (delta 51), pack-reused 2[K
Receiving objects: 100% (171/171), 1.36 MiB | 1.40 MiB/s, done.
Resolving deltas: 100% (77/77), done.
/home/jovyan/CodeGen
Collecting pip
  Downloading pip-23.1-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m0m
Collecting setuptools
  Downloading setuptools-67.6.1-py3-none-any.whl (1.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0mm
[?25hInstalling collected packages: setuptools, pip
  Attempting uninstall: setuptools
    Found existing installation: setuptools 65.3.0
    Uninstalling setuptools-65.3.0:
      Successfully uninstalled setuptoo

### Load model and tokenizer

In [2]:
chosen_model = "codegen-350M-mono" #@param ["codegen-350M-nl", "codegen-350M-multi", "codegen-350M-mono", "codegen-2B-nl", "codegen-2B-multi", "codegen-2B-mono", "codegen-6B-nl", "codegen-6B-multi", "codegen-6B-mono", "codegen-16B-nl", "codegen-16B-multi", "codegen-16B-mono"]
fp16 = True #@param {type:"boolean"}

import os

if not os.path.exists(f'./checkpoints/{chosen_model}'):
  !wget -P checkpoints https://storage.googleapis.com/sfr-codegen-research/checkpoints/{chosen_model}.tar.gz && tar -xvf checkpoints/{chosen_model}.tar.gz -C checkpoints/


import torch
from jaxformer.hf.sample import truncate as do_truncate
from jaxformer.hf.sample import set_env, set_seed, print_time, create_model, create_custom_gpt2_tokenizer, create_tokenizer, sample

# (0) constants

models_nl = ['codegen-350M-nl', 'codegen-2B-nl', 'codegen-6B-nl', 'codegen-16B-nl']
models_pl = ['codegen-350M-multi', 'codegen-2B-multi', 'codegen-6B-multi', 'codegen-16B-multi', 'codegen-350M-mono', 'codegen-2B-mono', 'codegen-6B-mono', 'codegen-16B-mono']
models = models_nl + models_pl


# (2) preamble

set_env()

pad = 50256
device = torch.device('cuda:0')
ckpt = f'./checkpoints/{chosen_model}'

if device.type == "cpu":
  print()
  print("force full precision for cpu!!")
  print()
  fp16 = False


# (3) load

with print_time('loading parameters'):
  model = create_model(ckpt=ckpt, fp16=fp16).to(device)


with print_time('loading tokenizer'):
  if chosen_model in models_pl:
    tokenizer = create_custom_gpt2_tokenizer()
  else:
    tokenizer = create_tokenizer()
  tokenizer.padding_side = 'left'
  tokenizer.pad_token = pad

--2023-04-16 15:38:22--  https://storage.googleapis.com/sfr-codegen-research/checkpoints/codegen-350M-mono.tar.gz
Resolving proxy.liuqi.me (proxy.liuqi.me)... 10.185.248.180
Connecting to proxy.liuqi.me (proxy.liuqi.me)|10.185.248.180|:3128... connected.
Proxy request sent, awaiting response... 200 OK
Length: 656148604 (626M) [application/x-tar]
Saving to: ‘checkpoints/codegen-350M-mono.tar.gz’


2023-04-16 15:39:10 (13.2 MB/s) - ‘checkpoints/codegen-350M-mono.tar.gz’ saved [656148604/656148604]

codegen-350M-mono/
codegen-350M-mono/config.json
codegen-350M-mono/pytorch_model.bin
loading parameters
loading parameters took 24.35s
loading tokenizer


Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

loading tokenizer took 36.72s


### Try out the model

In [3]:
rng_seed = 42 #@param {type:"integer"}
rng_deterministic = True #@param {type:"boolean"}
p = 0.95 #@param {type:"number"}
t = 0.2 #@param {type:"number"}
max_length = 128 #@param {type:"integer"}
batch_size = 1 #@param {type:"integer"}
context = "def hello_world():" #@param {type:"string"}

set_seed(rng_seed, deterministic=rng_deterministic)

# (4) sample

with print_time('sampling'):
  completion = sample(device=device, model=model, tokenizer=tokenizer, context=context, pad_token_id=pad, num_return_sequences=batch_size, temp=t, top_p=p, max_length_sample=max_length)[0]
  truncation = do_truncate(completion)

  print('=' * 100)
  print(completion)
  print('=' * 100)
  print(context+truncation)
  print('=' * 100)
    

# !python -m jaxformer.hf.sample --model $chosen_model \
#                  --rng-seed $rng_seed \
#                  --p $p \
#                  --t $t \
#                  --max-length $max_length \
#                  --batch-size $batch_size \
#                  --context '$context'

sampling

    print("Hello World")

hello_world()

#
def hello_world():
    print("Hello World")

hello_world()


sampling took 0.96s


## Training and Fine-tuning

The Jaxformer library for data pre-processing, training and fine-tuning the CodeGen models can be found here:

https://github.com/salesforce/jaxformer