# Hardware Sizing Skill ## Overview Expertise in calculating and specifying hardware requirements for local AI deployments, including GPU selection, server configuration, storage, and network planning based on workload characteristics and team size. ## Key Capabilities - GPU selection and sizing for LLM inference - Server configuration for AI workloads - Storage planning for models and data - Network bandwidth calculations - TCO modeling for hardware investments - Capacity planning and growth projections ## GPU Selection Guide ### NVIDIA GPU Comparison | GPU | VRAM | FP16 TFLOPS | Bandwidth | TDP | Price (approx) | Best For | |-----|------|-------------|-----------|-----|----------------|----------| | RTX 4090 | 24GB | 82.6 | 1 TB/s | 450W | $1,600 | Small teams, dev | | RTX A6000 | 48GB | 38.7 | 768 GB/s | 300W | $4,500 | Medium teams | | A100 40GB | 40GB | 77.9 | 1.5 TB/s | 400W | $10,000 | Production | | A100 80GB | 80GB | 77.9 | 2.0 TB/s | 400W | $15,000 | Large models | | H100 80GB | 80GB | 267 | 3.35 TB/s | 700W | $30,000 | Maximum perf | | L40S | 48GB | 91.6 | 864 GB/s | 350W | $8,000 | Balanced | ### Model VRAM Requirements | Model Size | FP16 | INT8 | INT4/AWQ | Example Models | |------------|------|------|----------|----------------| | 7B | 14GB | 8GB | 4GB | Qwen-Next (small variant), GLM-4.6 (small variant) | | 13B | 26GB | 14GB | 8GB | Qwen-Next (mid variant), MiniMax-M2 (mid variant) | | 34B | 68GB | 36GB | 18GB | Qwen-Next / GLM-4.6 (large-ish variants) | | 70B | 140GB | 75GB | 38GB | Qwen-Next / GLM-4.6 / MiniMax-M2 (largest variants) | | 110B | 220GB | 115GB | 58GB | Frontier-scale variants (verify availability + license) | **VRAM Formula:** ``` VRAM Required = (Parameters × Bytes per Parameter) + Context Window Overhead - FP16: 2 bytes per parameter - INT8: 1 byte per parameter - INT4: 0.5 bytes per parameter - Context overhead: ~2GB for 8K context, ~8GB for 32K context ``` ### GPU Sizing by Team Size | Team Size | Usage Level | Model Size | Recommended GPU | Quantity | |-----------|-------------|------------|-----------------|----------| | 1-5 | Dev/Test | 7B-13B | RTX 4090 | 1 | | 5-15 | Production | 13B-34B | RTX 4090 or A6000 | 1-2 | | 15-30 | Production | 34B-70B | A100 40GB | 2 | | 30-75 | Production | 70B | A100 80GB | 2-4 | | 75-150 | Enterprise | 70B+ | H100 or A100 | 4-8 | | 150+ | Enterprise | 70B+ | H100 cluster | 8+ | ## Server Configuration Templates ### Small Team Server (5-15 developers) ```yaml # Small team AI server specification server: type: Tower or 2U Rack cpu: model: AMD EPYC 7343 or Intel Xeon Gold 5315Y cores: 16 threads: 32 memory: type: DDR4-3200 ECC capacity: 128GB channels: 8 gpu: model: NVIDIA RTX 4090 count: 1-2 vram_total: 24-48GB nvlink: false storage: system: type: NVMe SSD capacity: 500GB raid: None models: type: NVMe SSD capacity: 2TB raid: None logs: type: SATA SSD capacity: 2TB raid: 1 network: type: 10GbE ports: 2 bonding: Active/Standby power: psu: 1200W redundancy: Single (N) ups: Recommended estimated_cost: hardware: $10,000 - $15,000 annual_power: $1,500 annual_maintenance: $1,000 ``` ### Medium Team Server (15-50 developers) ```yaml # Medium team AI server specification server: type: 2U Rack Mount cpu: model: AMD EPYC 7543 or Intel Xeon Platinum 8358 cores: 32 threads: 64 memory: type: DDR4-3200 ECC capacity: 256GB channels: 8 gpu: model: NVIDIA A6000 or RTX 4090 count: 2-4 vram_total: 96-192GB nvlink: Recommended for A6000 storage: system: type: NVMe SSD capacity: 1TB raid: 1 models: type: NVMe SSD capacity: 4TB raid: 0 logs: type: SAS SSD capacity: 4TB raid: 10 network: type: 25GbE ports: 2 bonding: LACP power: psu: 2000W redundancy: Redundant (N+1) ups: Required estimated_cost: hardware: $35,000 - $60,000 annual_power: $4,000 annual_maintenance: $3,000 ``` ### Enterprise Server (50-200 developers) ```yaml # Enterprise AI server specification server: type: 4U Rack Mount or DGX-style cpu: model: 2x AMD EPYC 9354 or Intel Xeon Platinum 8480+ cores: 64 total threads: 128 memory: type: DDR5-4800 ECC capacity: 512GB - 1TB channels: 12-16 gpu: model: NVIDIA A100 80GB or H100 count: 4-8 vram_total: 320-640GB nvlink: Required (NVSwitch for 8+ GPUs) storage: system: type: NVMe SSD capacity: 2TB raid: 1 models: type: NVMe SSD capacity: 8TB raid: 0 or 10 logs: type: NVMe SSD capacity: 8TB raid: 10 backup: type: SAS HDD capacity: 32TB raid: 6 network: type: 100GbE or InfiniBand ports: 2-4 bonding: LACP power: psu: 3000W+ redundancy: Redundant (N+N) ups: Required with generator backup estimated_cost: hardware: $150,000 - $400,000 annual_power: $15,000 - $30,000 annual_maintenance: $10,000 - $20,000 ``` ## Capacity Planning ### Request Volume Estimation | Developer Usage | Requests/Day | Tokens/Request | Daily Tokens | |-----------------|--------------|----------------|--------------| | Light (occasional) | 20-30 | 2,000 | 40K-60K | | Medium (regular) | 50-100 | 3,000 | 150K-300K | | Heavy (power user) | 150-250 | 4,000 | 600K-1M | | Intensive (AI-first) | 300-500 | 5,000 | 1.5M-2.5M | ### Throughput Calculation ``` # Calculate required throughput Daily Requests = Team Size × Requests per User per Day Peak Factor = 0.1 (10% of daily load in peak hour) Peak Requests per Minute = (Daily Requests × Peak Factor) / 60 Tokens per Request = Avg Input Tokens + Avg Output Tokens Peak Tokens per Second = Peak Requests per Minute × Tokens per Request / 60 # Example: 50 medium-usage developers Daily Requests = 50 × 100 = 5,000 Peak Requests/min = 5,000 × 0.1 / 60 = 8.3 Tokens/Request = 2,000 + 1,000 = 3,000 Peak Tokens/sec = 8.3 × 3,000 / 60 = 415 tok/s ``` ### GPU Throughput Reference | GPU | Model Size | Throughput (tok/s) | Concurrent Requests | |-----|------------|-------------------|---------------------| | RTX 4090 | 7B | 100-150 | 8-12 | | RTX 4090 | 13B | 50-80 | 4-8 | | A100 40GB | 13B | 120-180 | 16-24 | | A100 40GB | 34B | 60-100 | 8-16 | | A100 80GB | 70B | 40-70 | 4-8 | | H100 80GB | 70B | 100-150 | 8-16 | | 2x A100 80GB | 70B (TP=2) | 80-140 | 8-16 | ### Sizing Formula ``` Required GPUs = Peak Tokens/sec / Single GPU Throughput × Safety Factor Safety Factor = 1.3 (30% headroom for spikes) # Example: 415 tok/s needed for 70B model Single A100 80GB throughput = 55 tok/s average Required GPUs = 415 / 55 × 1.3 = 9.8 → 10 A100 80GB # OR with 2-GPU tensor parallel: TP=2 throughput = 110 tok/s Required TP pairs = 415 / 110 × 1.3 = 4.9 → 5 pairs (10 GPUs) ``` ## Storage Planning ### Model Storage Requirements | Model Size | Weights (FP16) | Weights (INT4) | With Tokenizer | |------------|----------------|----------------|----------------| | 7B | 14GB | 4GB | +500MB | | 13B | 26GB | 7GB | +500MB | | 34B | 68GB | 18GB | +500MB | | 70B | 140GB | 38GB | +500MB | | 100B+ | 200GB+ | 50GB+ | +1GB | ### Storage Architecture ```yaml storage_tiers: tier1_hot: # Active models type: NVMe SSD iops: 500K+ latency: <0.1ms purpose: Currently loaded models, active inference sizing: 2x largest model size tier2_warm: # Standby models type: SATA SSD or NVMe iops: 50K+ latency: <1ms purpose: Quick-loading alternate models sizing: 5-10x model sizes for model library tier3_cold: # Archives type: HDD or object storage purpose: Model version history, backups sizing: 3x warm storage for versioning log_storage: type: SSD (fast write) sizing: | Daily logs = Requests/day × 2KB average Monthly = Daily × 30 Retention storage = Monthly × Retention months ``` ## Network Planning ### Bandwidth Requirements | Component | Traffic Type | Bandwidth Need | |-----------|--------------|----------------| | API Requests | Client → Server | 1-10 Mbps per concurrent user | | Responses | Server → Client | 5-50 Mbps per concurrent user | | Model Loading | Storage → GPU | 10+ Gbps (reduces load time) | | Monitoring | Server → Collector | 10-100 Mbps | | Replication | Server → Backup | Varies by backup frequency | ### Network Architecture ``` ┌─────────────────────────────────────────────────────┐ │ Corporate Network │ │ (10 GbE) │ └──────────────────────┬──────────────────────────────┘ │ ┌──────────────────────┴──────────────────────────────┐ │ Load Balancer │ │ (25-100 GbE uplink) │ └──────────────────────┬──────────────────────────────┘ │ ┌─────────────┴─────────────┐ │ │ ┌────┴────┐ ┌────┴────┐ │ AI Node │ ◄──(25 GbE)──► │ AI Node │ │ #1 │ │ #2 │ └────┬────┘ └────┬────┘ │ │ └─────────────┬─────────────┘ │ ┌────────┴────────┐ │ Storage Array │ │ (100 GbE) │ └─────────────────┘ ``` ## TCO Calculation ### Hardware TCO Template ```markdown ## 3-Year Total Cost of Ownership ### Capital Expenditure (CapEx) | Item | Unit Cost | Quantity | Total | |------|-----------|----------|-------| | Server (compute) | $15,000 | 2 | $30,000 | | GPUs (A100 80GB) | $15,000 | 4 | $60,000 | | Storage (NVMe) | $500/TB | 8TB | $4,000 | | Network equipment | $5,000 | 1 | $5,000 | | Installation | $2,000 | 1 | $2,000 | | **CapEx Total** | | | **$101,000** | ### Operating Expenses (OpEx) - Annual | Item | Monthly | Annual | |------|---------|--------| | Power (3kW average) | $400 | $4,800 | | Cooling | $100 | $1,200 | | Maintenance/support | $500 | $6,000 | | Hosting/colocation | $1,000 | $12,000 | | Admin labor (0.25 FTE) | $2,500 | $30,000 | | **Annual OpEx** | **$4,500** | **$54,000** | ### 3-Year TCO | Year | CapEx | OpEx | Cumulative | |------|-------|------|------------| | Year 1 | $101,000 | $54,000 | $155,000 | | Year 2 | $0 | $54,000 | $209,000 | | Year 3 | $0 | $54,000 | $263,000 | ### Per-Request Cost (at 500K requests/month) Year 1: $155,000 / 6M requests = $0.026/request Year 3: $263,000 / 18M requests = $0.015/request (amortized) ``` ### Cloud API Cost Comparison ```markdown ## Local vs Cloud Cost Comparison ### Assumptions - 50 developers, medium usage - 100 requests/dev/day = 5,000 requests/day - 3,000 tokens/request average - 15M tokens/day = 450M tokens/month ### Cloud API Costs (GPT-4o-mini pricing) - Input: $0.15/1M tokens × 150M = $22.50/month - Output: $0.60/1M tokens × 300M = $180/month - Total: ~$200/month = $2,400/year ### Cloud API Costs (GPT-4o pricing) - Input: $2.50/1M tokens × 150M = $375/month - Output: $10.00/1M tokens × 300M = $3,000/month - Total: ~$3,375/month = $40,500/year ### Local Large Model (Qwen-Next / MiniMax-M2 / GLM-4.6) - Year 1 TCO: $155,000 - Equivalent cloud cost: $40,500/year - Breakeven: 3.8 years ### With Data Sovereignty Premium If data can't go to cloud, local is only option. Value of data sovereignty: Priceless / Required ``` ## Scaling Strategy ### Horizontal Scaling Triggers | Metric | Add Capacity When | Scale Strategy | |--------|-------------------|----------------| | GPU Utilization | >80% sustained | Add GPU or node | | Queue Depth | >10 requests sustained | Add replica | | P95 Latency | >5s sustained | Add GPU for parallelism | | Memory Pressure | >90% VRAM | Larger GPU or quantization | ### Vertical Scaling Path ``` Stage 1: Single RTX 4090 (24GB) ↓ Need more VRAM Stage 2: Single A6000 (48GB) ↓ Need more throughput Stage 3: 2x A6000 with tensor parallel ↓ Need larger models Stage 4: 2x A100 80GB ↓ Need more throughput Stage 5: 4x A100 80GB with NVLink ↓ Need maximum performance Stage 6: 8x H100 with NVSwitch ``` ## Best Practices ### Procurement 1. **Budget 20% contingency** for unexpected needs 2. **Test before bulk purchase** with single unit 3. **Consider used enterprise GPUs** (A100s at 50% cost) 4. **Plan for 3-year lifecycle** (hardware depreciation) 5. **Include installation and training** in budget ### Deployment 1. **Start small, scale up** - validate before expanding 2. **Keep 30% headroom** for traffic spikes 3. **Plan upgrade path** before initial deployment 4. **Document all specifications** for future reference ### Monitoring 1. **Track utilization trends** weekly 2. **Plan capacity 3-6 months ahead** 3. **Review TCO quarterly** against cloud alternatives 4. **Update sizing models** with actual usage data This skill ensures organizations size hardware appropriately for their AI workloads, optimizing for both performance and cost-effectiveness.