---
name: megatron-memory-estimator
description: Estimate GPU memory usage for Megatron-based MoE (Mixture of Experts) and dense models. Use when users need to (1) estimate memory from HuggingFace model configs (DeepSeek-V3, Qwen, etc.), (2) plan GPU resource allocation for training, (3) compare different parallelism strategies (TP/PP/EP/CP), (4) determine if a model fits in available GPU memory, or (5) optimize training configurations for memory efficiency.
---

# Megatron Memory Estimator

Estimate GPU memory usage for Megatron-based models directly from HuggingFace configs or custom specifications.

## Quick Start

### Option 1: From HuggingFace Model (Recommended)

Estimate directly from HuggingFace model paths:

```bash
# DeepSeek-V3 (61 layers, requires layer distribution when pp>1)
python scripts/estimate_from_hf.py deepseek-ai/DeepSeek-V3 \
    --tp 4 --pp 4 --ep 8 --num-gpus 128 --num-layers-in-last-pipeline-stage 16

# Qwen 3
python scripts/estimate_from_hf.py Qwen/Qwen3-235B-A22B \
    --tp 8 --pp 4 --ep 4 --num-gpus 128
```

### Option 2: From Local HF Config

```bash
python scripts/estimate_from_hf.py /path/to/config.json \
    --tp 2 --pp 2 --num-gpus 8
```

### Option 3: Quick Parameter Testing

```bash
# Test different parallelism strategies
python scripts/estimate_from_hf.py deepseek-ai/DeepSeek-V3 \
    --tp 8 --pp 2 --ep 16 --num-layers-in-last-pipeline-stage 31  # Strategy 1 (30+31=61)

python scripts/estimate_from_hf.py deepseek-ai/DeepSeek-V3 \
    --tp 4 --pp 4 --ep 8 --num-layers-in-last-pipeline-stage 16   # Strategy 2 (15+15+15+16=61)

# Test different batch sizes
python scripts/estimate_from_hf.py deepseek-ai/DeepSeek-V3 \
    --tp 4 --pp 4 --ep 8 --micro-batch-size 2 --num-layers-in-last-pipeline-stage 16
```

## Available Scripts

### estimate_from_hf.py (Primary Script)

Automatically converts HuggingFace configs to Megatron format and estimates memory.

**Key Arguments:**
- `model_path`: HF model path or local config.json path
- `--tp N`: Tensor parallel size (default: 1)
- `--pp N`: Pipeline parallel size (default: 1)
- `--ep N`: Expert parallel size (default: 1, for MoE)
- `--cp N`: Context parallel size (default: 1)
- `--etp N`: Expert tensor parallel size (optional)
- `--vpp N`: Virtual pipeline parallel size (optional)
- `--micro-batch-size N`: Micro batch size (default: 1)
- `--seq-length N`: Sequence length (default: 4096)
- `--num-gpus N`: Total GPU count (default: 8)
- `--recompute-granularity {full,selective}`: Enable activation checkpointing
- `--num-layers-in-first-pipeline-stage N`: Number of layers in the first pipeline stage (use when model layers cannot be evenly divided by `--pp`)
- `--num-layers-in-last-pipeline-stage N`: Number of layers in the last pipeline stage (use when model layers cannot be evenly divided by `--pp`)
- `--verbose`: Show detailed model breakdown
- `--json`: Output as JSON

**Examples:**

```bash
# Basic estimation
python scripts/estimate_from_hf.py deepseek-ai/DeepSeek-V3 --num-gpus 64

# With memory optimization
python scripts/estimate_from_hf.py Qwen/Qwen3-235B-A22B \
    --tp 8 --pp 4 --ep 4 \
    --recompute-granularity full \
    --recompute-method uniform \
    --num-gpus 128

# Verbose output
python scripts/estimate_from_hf.py deepseek-ai/DeepSeek-V3 \
    --tp 4 --pp 4 --ep 8 --verbose --num-layers-in-last-pipeline-stage 16

# JSON output for automation
python scripts/estimate_from_hf.py deepseek-ai/DeepSeek-V3 \
    --tp 4 --pp 4 --ep 8 --json --num-layers-in-last-pipeline-stage 16 > result.json
```


## Common Workflows

### Find Optimal Parallelism for a Model

```bash
# Start with model path
MODEL="deepseek-ai/DeepSeek-V3"
GPUS=128

# Test different strategies
python scripts/estimate_from_hf.py $MODEL --tp 4 --pp 4 --ep 8 --num-gpus $GPUS --num-layers-in-last-pipeline-stage 16
python scripts/estimate_from_hf.py $MODEL --tp 8 --pp 2 --ep 8 --num-gpus $GPUS --num-layers-in-last-pipeline-stage 31


# Choose strategy that fits GPU memory with best efficiency
```

### Optimize for Memory Efficiency

Progressive memory reduction:

```bash
# 1. Baseline
python scripts/estimate_from_hf.py $MODEL --tp 4 --pp 2 --num-gpus 16

# 2. Add recomputation
python scripts/estimate_from_hf.py $MODEL --tp 4 --pp 2 --num-gpus 16 \
    --recompute-granularity full

# 3. Increase expert parallelism (MoE only)
python scripts/estimate_from_hf.py $MODEL --tp 4 --pp 2 --ep 4 --num-gpus 16 \
    --recompute-granularity full

# 4. Increase pipeline parallelism
python scripts/estimate_from_hf.py $MODEL --tp 4 --pp 4 --ep 4 --num-gpus 16 \
    --recompute-granularity full

# 5. Last resort: reduce batch size
python scripts/estimate_from_hf.py $MODEL --tp 4 --pp 4 --ep 4 --num-gpus 16 \
    --recompute-granularity full --micro-batch-size 1
```

### Check if Model Fits Available GPUs

```bash
# Check if DeepSeek-V3 fits in 128x A100 80GB
python scripts/estimate_from_hf.py deepseek-ai/DeepSeek-V3 \
    --tp 4 --pp 4 --ep 8 --num-gpus 128 --num-layers-in-last-pipeline-stage 16

# Output will show peak memory per GPU
# If < 80 GB: ✓ Fits
# If > 80 GB: Need more parallelism or optimization
```

## Understanding Output

The estimator shows:

```
================================================================================
CONFIGURATION SUMMARY
================================================================================

Model Type: deepseek_v3
Architecture: 61L-7168H
MoE: 256 experts, top-8

Parallelism:
  TP=4, PP=4, EP=8, CP=1

Training:
  Micro Batch Size: 1
  Sequence Length: 4096
  Total GPUs: 128

================================================================================
MEMORY ESTIMATION RESULTS
================================================================================

Pipeline Stage 0:
  Parameters: 3.15B
  Activations: 1.23B
  Memory Breakdown:
    - Weights + Gradients: 18.90 GB
    - Weights + Gradients + Optimizer: 37.80 GB
    - Activations: 2.46 GB
    - Total: 40.26 GB

================================================================================
Peak Memory per GPU: 40.26 GB
✓ Fits in: A100 80GB, H100
================================================================================
```

**Memory Components:**
- **Weights + Gradients**: Parameters and gradients (2+2=4 bytes/param in FP16)
- **Optimizer States**: Adam momentum + variance (8 bytes/param)
- **Activations**: Forward pass activations stored for backward

**GPU Fit Guidelines:**
- < 40 GB: A100 40GB, A100 80GB, H100
- < 80 GB: A100 80GB, H100 80GB
- \> 80 GB: H200 141GB or consider more parallelism or smaller batch

## Memory Optimization Techniques

Ranked by effectiveness:

1. **Enable Distributed Optimizer** (included by default)
   - Shards optimizer states across data parallel ranks
   - ~6 bytes/param saving

2. **Activation Recomputation** (`--recompute-granularity full`)
   - 50-70% activation memory reduction
   - Trade compute for memory

3. **Increase Expert Parallelism** (MoE only) (`--ep N`)
   - Linear memory reduction for expert layers
   - Minimal performance impact

4. **Increase Pipeline Parallelism** (`--pp N`)
   - Splits model across more stages
   - Some pipeline bubble overhead

5. **Reduce Batch Size** (`--micro-batch-size 1`)
   - Direct activation memory reduction
   - Impacts throughput

## Supported Models

The script automatically handles:

- **DeepSeek**: DeepSeek-V2, DeepSeek-V3
- **Qwen**: Qwen2.5, Qwen3 (dense and MoE)
- **Moonlight**: Kimi models
- **Any HuggingFace model with config.json**


## Setup & Troubleshooting
   
Because this tool relies on Megatron-LM components, you need to add both the tool directory and Megatron-LM to your `PYTHONPATH`.

**Recommended Setup:**

```bash
# Add current directory and Megatron-LM to PYTHONPATH
export PYTHONPATH=$PYTHONPATH:$(pwd):/path/to/Megatron-LM
```

If you encounter `ImportError: No module named 'megatron_memory_estimator'`, ensure the root directory of this skill is in your `PYTHONPATH`.

## Dependencies

**Required:**
- `mbridge`: HuggingFace to Megatron config bridge
- `transformers`: HuggingFace transformers library
- `torch`: PyTorch (CPU version sufficient)
- `megatron-core`: Megatron core library

**Installation:**
```bash
pip install mbridge transformers torch megatron-core==0.13.0
```

For full Megatron-LM support (optional):
```bash
pip install git+https://github.com/NVIDIA/Megatron-LM.git@core_r0.13.0
```

## Reference Documentation

For detailed configuration options:
- `references/configuration_guide.md`: All configuration parameters
- `references/parallelism_strategies.md`: Parallelism strategy guide

## Notes

- Estimates are theoretical based on model architecture
- Actual memory may vary ±10-15% due to framework overhead
- Always leave 10-20% memory headroom for safety
- Test on small scale before full deployment
- MoE models: Expert parallelism (EP) is critical for memory efficiency