---
name: sagemaker-hyperpod
description: |
  Amazon SageMaker HyperPod expert for ML training clusters with Trainium or GPU.
  Use when: creating HyperPod clusters, running distributed training, configuring
  EKS or Slurm orchestration, troubleshooting cluster issues, checking quotas,
  or when user mentions "hyperpod", "hyp", "ml-cluster", "trainium", "trn1",
  "distributed training", or "multi-node training".
argument-hint: "[cluster-name or action]"
context: fork
model: sonnet
skills:
  - aws-mcp-setup
allowed-tools:
  - mcp__sagemaker__*
  - mcp__aws-mcp__*
  - mcp__awsdocs__*
  - WebFetch
  - Bash(hyp *)
  - Bash(aws sagemaker *)
  - Bash(kubectl *)
  - Bash(aws eks *)
  - Bash(aws ec2 describe-*)
  - Bash(aws servicequotas *)
  - Bash(aws s3 *)
  - Bash(aws ssm start-session *)
  - Bash(aws sts get-caller-identity)
  - Bash(aws logs *)
  - Bash(aws iam get-role*)
  - Bash(aws iam list-*)
  - Bash(helm *)
  - Bash(pip install sagemaker-hyperpod)
hooks:
  PreToolUse:
    - matcher: Bash(aws sagemaker create-cluster*)
      command: aws sts get-caller-identity --query Account --output text
      once: true
    - matcher: Bash(hyp create*)
      command: aws sts get-caller-identity --query Account --output text
      once: true
---

# Amazon SageMaker HyperPod Expert

You are an expert in Amazon SageMaker HyperPod for provisioning resilient ML training clusters with AWS Trainium and NVIDIA GPUs.

## When This Skill Activates

- Creating HyperPod clusters (EKS or Slurm)
- Running distributed ML training jobs
- Troubleshooting cluster issues
- Checking quotas or instance availability
- User mentions: "hyperpod", "hyp", "trainium", "trn1", "distributed training"

## Detailed Guides

| Guide | Use When |
|-------|----------|
| [reference/eks-guide.md](reference/eks-guide.md) | EKS orchestration, `hyp` CLI, add-ons, Pod Identity |
| [reference/slurm-guide.md](reference/slurm-guide.md) | Slurm orchestration, lifecycle scripts, SBATCH |
| [reference/troubleshooting.md](reference/troubleshooting.md) | Error diagnosis and solutions |

---

## Orchestrator Selection

| Aspect | EKS | Slurm |
|--------|-----|-------|
| AZ Requirement | **2+ AZs required** | Single AZ OK |
| Primary Tool | `hyp` CLI | AWS CLI |
| Job Submission | PyTorchJob via `hyp create` | SBATCH scripts |
| Access Method | kubectl | SSM Session Manager |
| Best For | Kubernetes teams, container workloads | HPC teams, batch jobs |

---

## Instance Types

| Instance Type | Accelerator | Count | Use Case |
|---------------|-------------|-------|----------|
| ml.p4d.24xlarge | A100 | 8 | General training |
| ml.p4de.24xlarge | A100 (80GB) | 8 | Large models |
| ml.p5.48xlarge | H100 | 8 | Latest gen training |
| ml.trn1.32xlarge | Trainium | 16 | Cost-effective |
| ml.trn1n.32xlarge | Trainium | 16 | Higher network |

**IMPORTANT**: `ml.trn1.2xlarge` is NOT supported for HyperPod - only `ml.trn1.32xlarge`.

---

## CRITICAL: Pre-Creation Validation

**ALWAYS perform these checks BEFORE creating a cluster:**

### 1. Verify Instance Type Support

```bash
# Must say "for cluster usage" in quota name
aws service-quotas list-service-quotas \
  --service-code sagemaker --region us-east-1 \
  --query 'Quotas[?contains(QuotaName, `<INSTANCE_TYPE>`) && contains(QuotaName, `cluster`)].[QuotaName,Value]' \
  --output table
```

### 2. Check AZ Availability

```bash
aws ec2 describe-instance-type-offerings \
  --location-type availability-zone \
  --filters Name=instance-type,Values=trn1.32xlarge \
  --region us-east-1 \
  --query 'InstanceTypeOfferings[*].Location' --output text
```

### 3. For EKS: Ensure 2+ AZs in config.yaml

```yaml
availability_zone_ids:
  - use1-az6  # Primary for workers
  - use1-az4  # Secondary for EKS HA
```

### 4. Check K8s Version (EKS Only)

```
WebFetch: https://docs.aws.amazon.com/eks/latest/userguide/kubernetes-versions.html#kubernetes-release-calendar
Prompt: What is the latest Kubernetes version in standard support?
```

### 5. Check Add-on Compatibility (EKS Only)

Before upgrading K8s versions, verify HyperPod add-ons support the target version:
```bash
aws eks describe-addon-versions --addon-name amazon-sagemaker-hyperpod-taskgovernance \
  --query 'addons[0].addonVersions[*].compatibilities[*].clusterVersion' --output text
```

**WARNING**: EKS does NOT support downgrading. Stay on a supported version if you need HyperPod add-ons.

---

## EKS Quick Start

```bash
# 1. Install CLI
pip install sagemaker-hyperpod

# 2. Initialize cluster stack
hyp init cluster-stack my-cluster
cd my-cluster

# 3. Edit config.yaml (ensure 2+ AZs!)

# 4. Validate and create
hyp validate && hyp create cluster-stack --region us-east-1

# 5. Set context
hyp set-cluster-context --cluster-name <NAME> --region us-east-1
```

## Submit Training Job (EKS)

```bash
# Option 1: Using config file (recommended)
hyp init hyp-pytorch-job my-job
cd my-job
# Edit config.yaml
hyp validate
hyp create hyp-pytorch-job

# Option 2: Command line
hyp create hyp-pytorch-job \
  --job-name my-job \
  --image <ECR-IMAGE> \
  --instance-type ml.trn1.32xlarge \
  --node-count 1 \
  --accelerators 16 \
  --accelerators-limit 16
```

## Monitor Training Job (EKS)

```bash
# List jobs
hyp list hyp-pytorch-job

# Job details
hyp describe hyp-pytorch-job --job-name <NAME>

# View logs
hyp get-logs hyp-pytorch-job --job-name <NAME> --follow

# List pods
hyp list-pods hyp-pytorch-job --job-name <NAME>

# Delete job
hyp delete hyp-pytorch-job --job-name <NAME>
```

**Full guide**: See [orchestrators/eks/job-submission.md](orchestrators/eks/job-submission.md)

---

## Slurm Quick Start

```bash
# 1. Prepare lifecycle scripts (use AWS samples)
git clone https://github.com/aws-samples/awsome-distributed-training.git
cd awsome-distributed-training/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/

# 2. Upload to S3
aws s3 cp . s3://my-bucket/lifecycle-scripts/ --recursive

# 3. Create cluster
aws sagemaker create-cluster --cluster-name my-cluster \
  --instance-groups '[...]' --vpc-config "..."

# 4. Connect via SSM
aws ssm start-session --target <instance-id>
```

**Full workflow**: See [reference/slurm-guide.md](reference/slurm-guide.md)

---

## Model Compatibility (Trainium/Inferentia)

**CRITICAL**: Verify model support before configuring Trainium jobs.

### Check Support

```
WebFetch: https://huggingface.co/docs/optimum-neuron/en/supported_architectures
Prompt: List supported model architectures for training on Trainium
```

### Currently Supported (Training)

| Architecture | Tensor Parallelism | Pipeline Parallelism |
|--------------|-------------------|---------------------|
| Llama, Llama 2, Llama 3 | Yes | Yes |
| Qwen3 | Yes | Yes |
| Granite | Yes | No |

---

## Common Errors (Quick Reference)

| Error | Cause | Solution |
|-------|-------|----------|
| `InvalidParameterException` (EKS) | Single AZ | Add 2+ AZs to config |
| `ml.trn1.2xlarge not found` | Unsupported type | Use `ml.trn1.32xlarge` |
| Training Operator pod fails | Missing Pod Identity | See [EKS guide](reference/eks-guide.md#fix-pod-identity-verification-failure) |
| `Insufficient cpu` | Full node request | Use partial resources |
| `Accelerator request != limit` | Limits mismatch | Set `accelerators_limit` = `accelerators` |
| EFA health check failed | Multi-AZ | Use single subnet with `OverrideVpcConfig` |
| Add-on not supported | K8s version | Check add-on compatibility before upgrade |

**Full troubleshooting**: See [reference/troubleshooting.md](reference/troubleshooting.md)

---

## Infrastructure Requirements

### EFA Single-AZ Requirement

For EFA-enabled instances (trn1, p4d, p5), ALL instances MUST be in the SAME AZ.

### Security Group

Must allow ALL traffic within itself:
```bash
aws ec2 authorize-security-group-ingress \
  --group-id sg-xxx --protocol all --port -1 --source-group sg-xxx
```

### CIDR Sizing

| Orchestrator | IPs per P5 |
|--------------|------------|
| Slurm | 32 |
| EKS | 81 (includes pods) |

---

## Quota Management

```bash
# Check quota
aws service-quotas get-service-quota \
  --service-code sagemaker --quota-code L-6865522E --region us-east-1

# Request increase
aws service-quotas request-service-quota-increase \
  --service-code sagemaker --quota-code L-6865522E --desired-value 4
```

**Common codes**:
- `L-6865522E`: ml.trn1.32xlarge for cluster usage
- `L-5C4CD236`: ml.p5.48xlarge for cluster usage

---

## Diagnostic Commands

```bash
# Cluster status
aws sagemaker describe-cluster --cluster-name NAME

# List nodes
aws sagemaker list-cluster-nodes --cluster-name NAME

# CloudWatch logs
aws logs get-log-events \
  --log-group-name /aws/sagemaker/Clusters/NAME/ID \
  --log-stream-name LifecycleConfig/GROUP/INSTANCE

# EKS nodes/pods
kubectl get nodes && kubectl get pods -A
```