--- name: janus-moe-disaggregation title: "Janus: Disaggregated Attention and Expert Layers for Scalable MoE Inference" version: 0.0.2 engine: skillxiv-v0.0.2-claude-opus-4.6 license: MIT url: https://arxiv.org/abs/2512.13525 keywords: [mixture-of-experts, MoE-inference, distributed-inference, disaggregation, throughput-optimization] description: "Enable scalable MoE inference by disaggregating attention and expert layers onto independent GPU sub-clusters. Use adaptive two-phase communication, activation load-balanced scheduling, and activation-aware expert management. Achieve 3.9× higher per-GPU throughput than state-of-the-art systems." --- ## Skill Summary Janus addresses MoE inference scaling through disaggregated architecture separating attention and MoE layers onto independent GPU sub-clusters. The system combines adaptive two-phase communication minimizing cross-cluster transfers, activation load-balanced scheduling distributing requests intelligently, and activation-aware expert management dynamically adjusting replication. Results show 3.9× higher per-GPU throughput while maintaining latency SLOs. ## When To Use - Deploying large sparse mixture-of-experts models with independent attention/expert scaling - Scenarios where attention and MoE layers have different performance characteristics - Projects with sufficient GPU clusters to justify disaggregation infrastructure - Research on efficient MoE inference systems ## When NOT To Use - Small models where disaggregation overhead exceeds benefits - Single-GPU or tightly coupled systems where disaggregation creates bottlenecks - Latency-critical applications where communication overhead is problematic - Scenarios with fixed hardware constraints preventing disaggregation ## Core Technique Four key components enable disaggregated MoE inference: **1. Disaggregated Architecture** Rather than deploying entire MoE model as monolithic unit, manage attention and MoE layers separately. This enables fine-grained, module-specific resource scaling based on distinct performance characteristics. **2. Adaptive Two-Phase Communication** Minimize overhead from frequent data transfers between attention and MoE instances: - Intra-node aggregation via NVLink consolidates activations - Bulk inter-node transfers reduce number of small cross-cluster messages Adaptive routing switches between phases based on activation characteristics. **3. Activation Load-Balanced Scheduling** Lightweight GPU kernel scheduler distributes expert activation requests across MoE instances to minimize concurrently active experts per GPU. This reduces per-instance load and latency with negligible overhead. **4. Activation-Aware Expert Management** Dynamically adjust expert replication counts and placement based on activation patterns. Spread frequently co-activated experts across GPUs to reduce per-instance load, improving throughput. ## Implementation Notes Design system separating attention and MoE layer clusters. Implement adaptive two-phase communication: intra-node via NVLink, inter-node in bulk transfers. Build activation load-balanced scheduler distributing requests intelligently. Track expert co-activation patterns and dynamically adjust replication. Monitor and optimize for your specific workload characteristics. ## References - Original paper: Janus (Dec 2025) - Mixture-of-experts systems and inference - Distributed GPU systems and communication optimization