---
name: proteinmpnn
description: >
  Design protein sequences using ProteinMPNN inverse folding. Use this skill when:
  (1) Designing sequences for RFdiffusion backbones,
  (2) Redesigning existing protein sequences,
  (3) Fixing specific residues while designing others,
  (4) Optimizing sequences for expression or stability,
  (5) Multi-state or negative design.

  For backbone generation, use rfdiffusion or bindcraft.
  For ligand-aware design, use ligandmpnn.
  For solubility optimization, use solublempnn.
license: MIT
category: design-tools
tags: [sequence-design, inverse-folding]
biomodals_script: modal_ligandmpnn.py
---

# ProteinMPNN Sequence Design

## Prerequisites

| Requirement | Minimum | Recommended |
|-------------|---------|-------------|
| Python | 3.8+ | 3.10 |
| CUDA | 11.0+ | 11.7+ |
| GPU VRAM | 8GB | 16GB (T4) |
| RAM | 8GB | 16GB |

## How to run

> **First time?** See [Installation Guide](../../docs/installation.md) to set up Modal and biomodals.

### Option 1: Local installation (recommended)
```bash
git clone https://github.com/dauparas/ProteinMPNN.git
cd ProteinMPNN

python protein_mpnn_run.py \
  --pdb_path backbone.pdb \
  --out_folder output/ \
  --num_seq_per_target 16 \
  --sampling_temp "0.1"
```

**GPU**: T4 (16GB) sufficient | **Time**: ~50-100 sequences/minute

### Option 2: Modal (via LigandMPNN wrapper)
```bash
cd biomodals
modal run modal_ligandmpnn.py \
  --pdb-path backbone.pdb \
  --num-seq-per-target 16
```

Note: LigandMPNN includes ProteinMPNN functionality.

## Config Schema

### Core Parameters

| Parameter | Default | Range | Description |
|-----------|---------|-------|-------------|
| `--pdb_path` | required | path | Single PDB input |
| `--pdb_path_chains` | all | A,B | Chains to design (comma-sep) |
| `--out_folder` | required | path | Output directory |
| `--num_seq_per_target` | 1 | 1-1000 | Sequences per structure |
| `--sampling_temp` | "0.1" | "0.0001-1.0" | Temperature (string!) |
| `--seed` | 0 | int | Random seed |
| `--batch_size` | 1 | 1-32 | Batch size |

### Temperature Guide
```
0.1  -> Low diversity, high recovery (production)
0.2  -> Moderate diversity (default)
0.3  -> Higher diversity (exploration)
0.5+ -> Very diverse, lower quality
```

**IMPORTANT**: Temperature must be passed as a string, not float.

## Common mistakes

### Temperature Parameter
✅ **Correct**:
```bash
--sampling_temp "0.1"    # String with quotes
```

❌ **Wrong**:
```bash
--sampling_temp 0.1      # Float without quotes - may cause errors
--sampling_temp 0.1,0.2  # Multiple temps need proper format
```

### Fixed Positions JSONL
✅ **Correct**:
```json
{"A": [1, 2, 3, 10, 11], "B": [5, 6]}
```

❌ **Wrong**:
```json
{"A": "1,2,3,10,11"}     # String instead of list
{A: [1, 2, 3]}           # Missing quotes on key
{"A": [1,2,3,]}          # Trailing comma
```

### Chain Selection
✅ **Correct**:
```bash
--pdb_path_chains A,B    # No spaces
```

❌ **Wrong**:
```bash
--pdb_path_chains A, B   # Space after comma
--pdb_path_chains "A,B"  # Quotes may cause issues
```

### Amino Acid Biases
```bash
# Bias toward certain AAs (positive = favor)
--bias_AA_jsonl '{"A": {"A": 1.5, "W": -2.0}}'

# Omit specific AAs globally
--omit_AAs "CM"  # No cysteine or methionine

# Per-position omission
--omit_AA_jsonl '{"A": {"1": "C", "2": "CM"}}'
```

### Multi-Chain Design
```bash
# Design chains A and B together
--pdb_path_chains A,B

# Tie chains (same sequence)
--tied_positions_jsonl tied.jsonl
```

## Variants Comparison

| Variant | Use Case | Key Difference |
|---------|----------|----------------|
| ProteinMPNN | General | Original model |
| SolubleMPNN | Expression | Trained on soluble proteins |
| LigandMPNN | Small molecules | Ligand-aware context |

## Output format

```
output/
├── seqs/
│   └── backbone.fa          # FASTA sequences
└── backbone_pdb/
    └── backbone_0001.pdb    # PDBs with designed sequence
```

### FASTA Header Format
```
>backbone_0001, score=1.234, global_score=1.234, seq_recovery=0.85
MKTAYIAKQRQISFVKSHFSRQLE...
```

## Common workflows

### Binder Sequence Design
```bash
python protein_mpnn_run.py \
  --pdb_path binder_backbone.pdb \
  --out_folder output/ \
  --num_seq_per_target 16 \
  --sampling_temp "0.1" \
  --pdb_path_chains B  # Design binder chain only
```

### Interface Redesign
```bash
# Fix core, design interface
python protein_mpnn_run.py \
  --pdb_path complex.pdb \
  --fixed_positions_jsonl core_positions.jsonl \
  --num_seq_per_target 32
```

### Multi-State Design
```bash
# Design for multiple conformations
python protein_mpnn_run.py \
  --pdb_path_multi state1.pdb,state2.pdb \
  --num_seq_per_target 16
```

## Sample output

### Successful run
```
$ python protein_mpnn_run.py --pdb_path backbone.pdb --out_folder output/ --num_seq_per_target 8
Loading model weights...
Designing sequences for backbone.pdb
Generated 8 sequences in 2.3 seconds

output/seqs/backbone.fa:
>backbone_0001, score=1.234, global_score=1.189, seq_recovery=0.82
MKTAYIAKQRQISFVKSHFSRQLEERGLTKE...
>backbone_0002, score=1.198, global_score=1.156, seq_recovery=0.79
MKTAYIAKQRQISFVKSQFSRQLDERGLTKE...
```

**What good output looks like:**
- Score: 1.0-2.0 (lower = more confident)
- Seq recovery: 0.3-0.6 for de novo, 0.7-0.9 for redesign
- Diverse sequences (not all identical) when temp > 0.1

## Decision tree

```
Should I use ProteinMPNN?
│
├─ Have a backbone structure?
│  ├─ Yes → Continue below
│  └─ No → Use RFdiffusion first
│
├─ What's in the binding site?
│  ├─ Nothing / protein only → ProteinMPNN ✓
│  ├─ Small molecule / ligand → Use LigandMPNN
│  └─ Metal / cofactor → Use LigandMPNN
│
├─ Priority?
│  ├─ Solubility/expression → Consider SolubleMPNN
│  ├─ Speed → ProteinMPNN ✓
│  └─ AF2 optimization → Consider ColabDesign
│
└─ Need fixed positions?
   ├─ Yes → Use --fixed_positions_jsonl
   └─ No → ProteinMPNN ✓ (design all)
```

## Typical performance

| Campaign Size | Time (T4) | Cost (Modal) | Notes |
|---------------|-----------|--------------|-------|
| 100 backbones × 8 seq | 15-20 min | ~$2 | Standard |
| 500 backbones × 8 seq | 1-1.5h | ~$8 | Large campaign |
| 1000 backbones × 16 seq | 3-4h | ~$18 | Comprehensive |

**Throughput**: ~50-100 sequences/minute on T4 GPU.

---

## Verify

```bash
grep -c "^>" output/seqs/*.fa  # Should match backbone_count × num_seq_per_target
```

---

## Troubleshooting

**Low sequence diversity**: Increase sampling_temp to 0.2-0.3
**Poor recovery**: Decrease sampling_temp to 0.1
**OOM errors**: Reduce batch_size
**Unwanted cysteines**: Use --omit_AAs "C"

### Error interpretation

| Error | Cause | Fix |
|-------|-------|-----|
| `RuntimeError: CUDA out of memory` | Long protein or large batch | Reduce batch_size or use larger GPU |
| `KeyError: 'A'` | Chain not in PDB | Check chain IDs in your PDB file |
| `JSONDecodeError` | Invalid JSONL format | Validate JSON syntax (see Common Mistakes) |
| `IndexError: list index` | Empty chain or residue list | Check PDB has atoms, not just HEADER |

---

**Next**: Structure prediction for validation → `protein-qc` for filtering.