---
name: esm
description: >
  ESM2 protein language model for embeddings and sequence scoring.
  Use this skill when: (1) Computing pseudo-log-likelihood (PLL) scores,
  (2) Getting protein embeddings for clustering,
  (3) Filtering designs by sequence plausibility,
  (4) Zero-shot variant effect prediction,
  (5) Analyzing sequence-function relationships.

  For structure prediction, use chai or boltz.
  For QC thresholds, use protein-qc.
license: MIT
category: design-tools
tags: [sequence-design, embeddings, scoring]
proteinbase_slug: esm2-optimization
proteinbase_url: https://proteinbase.com/design-methods/esm2-optimization
biomodals_script: modal_esm2_predict_masked.py
---

# ESM2 Protein Language Model

## Prerequisites

| Requirement | Minimum | Recommended |
|-------------|---------|-------------|
| Python | 3.8+ | 3.10 |
| PyTorch | 1.10+ | 2.0+ |
| CUDA | 11.0+ | 11.7+ |
| GPU VRAM | 8GB | 24GB (A10G) |
| RAM | 16GB | 32GB |

## How to run

> **First time?** See [Installation Guide](../../docs/installation.md) to set up Modal and biomodals.

### Option 1: Modal
```bash
cd biomodals
modal run modal_esm2_predict_masked.py \
  --input-faa sequences.fasta \
  --out-dir embeddings/
```

**GPU**: A10G (24GB) | **Timeout**: 300s default

### Option 2: Python API (recommended)
```python
import torch
import esm

# Load model
model, alphabet = esm.pretrained.esm2_t33_650M_UR50D()
batch_converter = alphabet.get_batch_converter()
model = model.eval().cuda()

# Process sequences
data = [("seq1", "MKTAYIAKQRQISFVK...")]
batch_labels, batch_strs, batch_tokens = batch_converter(data)

with torch.no_grad():
    results = model(batch_tokens.cuda(), repr_layers=[33])

# Get embeddings
embeddings = results["representations"][33]
```

## Key parameters

### ESM2 Models

| Model | Parameters | Speed | Quality |
|-------|------------|-------|---------|
| esm2_t6_8M | 8M | Fastest | Fast screening |
| esm2_t12_35M | 35M | Fast | Good |
| esm2_t33_650M | 650M | Medium | Better |
| esm2_t36_3B | 3B | Slow | Best |

## Output format

```
embeddings/
├── embeddings.npy       # (N, 1280) array
├── pll_scores.csv       # PLL for each sequence
└── metadata.json        # Sequence info
```

## Sample output

### Successful run
```
$ modal run modal_esm2_predict_masked.py --input-faa designs.fasta
[INFO] Loading ESM2-650M model...
[INFO] Processing 100 sequences...
[INFO] Computing pseudo-log-likelihood...

embeddings/pll_scores.csv:
sequence_id,pll,pll_normalized,length
design_0,-0.82,0.15,78
design_1,-0.95,0.08,85
design_2,-1.23,-0.12,72
...

Summary:
  Mean PLL: -0.91
  Sequences with PLL > 0: 42/100 (42%)
```

**What good output looks like:**
- PLL_normalized: > 0.0 (more natural-like)
- Embeddings shape: (N, 1280) for 650M model
- Higher PLL = more natural sequence

## Decision tree

```
Should I use ESM2?
│
├─ What do you need?
│  ├─ Sequence plausibility score → ESM2 PLL ✓
│  ├─ Embeddings for clustering → ESM2 ✓
│  ├─ Variant effect prediction → ESM2 ✓
│  └─ Structure prediction → Use ESMFold
│
├─ What model size?
│  ├─ Fast screening → esm2_t12_35M
│  ├─ Standard use → esm2_t33_650M ✓
│  └─ Best quality → esm2_t36_3B
│
└─ Use case?
   ├─ QC filtering → PLL > 0.0 threshold
   ├─ Diversity analysis → Mean-pooled embeddings
   └─ Mutation scanning → Per-position log-odds
```

## PLL interpretation

| Normalized PLL | Interpretation |
|----------------|----------------|
| > 0.2 | Very natural sequence |
| 0.0 - 0.2 | Good, natural-like |
| -0.5 - 0.0 | Acceptable |
| < -0.5 | May be unnatural |

## Typical performance

| Campaign Size | Time (A10G) | Cost (Modal) | Notes |
|---------------|-------------|--------------|-------|
| 100 sequences | 5-10 min | ~$1 | Quick screen |
| 1000 sequences | 30-60 min | ~$5 | Standard |
| 5000 sequences | 2-3h | ~$20 | Large batch |

**Throughput**: ~100-200 sequences/minute with 650M model.

---

## Verify

```bash
wc -l embeddings/pll_scores.csv  # Should match input + 1 (header)
```

---

## Troubleshooting

**OOM errors**: Use smaller model or batch sequences
**Slow processing**: Use esm2_t12_35M for speed
**Low PLL scores**: May indicate unusual/designed sequences

### Error interpretation

| Error | Cause | Fix |
|-------|-------|-----|
| `RuntimeError: CUDA out of memory` | Sequence too long or large batch | Reduce batch size |
| `KeyError: representation` | Wrong layer requested | Use layer 33 for 650M model |
| `ValueError: sequence` | Invalid amino acid | Check for non-standard AAs |

---

**Next**: Structure prediction with `chai` or `boltz` → `protein-qc` for filtering.