# USAF — Ultra Sparse Adaptive Fine-Tuning [![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](LICENSE) [![Python](https://img.shields.io/badge/Python-3.10%2B-blue)](https://python.org) [![CUDA](https://img.shields.io/badge/CUDA-11.8%2B-green)](https://developer.nvidia.com/cuda-downloads) [![Status](https://img.shields.io/badge/Status-Beta-orange)]() Fine-tune MoE models on hardware that can barely run inference. Qwen3-30B-A3B needs 60GB in fp16. Full fine-tuning needs 120GB+. USAF trains 26M out of 4.8B parameters on a 12GB GPU — the only method that works on AMD and the only one that trains expert weights and the router. --- ## Why This Exists I don't have an A100, an H100, or even an RTX 4090. I have a Radeon RX 6750 XT with 12GB. On Windows. Every existing fine-tuning method either won't load on this hardware or won't touch the parts of MoE models that actually matter. So I built something that does both. ## Comparison Qwen3-30B-A3B, 180 steps. LoRA/QLoRA/DoRA numbers are estimates — no public benchmarks exist for these methods on this model at this scale. Where a method can't run, I explain why. | | USAF | LoRA | QLoRA | DoRA | Full FT | |---|---|---|---|---|---| | **Runs on 12GB** | Yes | No | No | No | No | | **Runs on 24GB** | Yes | No | Maybe | No | No | | **Runs on AMD** | Yes | No | No | No | No | | **Min VRAM (NVIDIA)** | 12GB | ~60GB | ~24GB | ~60GB | ~120GB | | **Trains expert weights** | Yes | No | No | No | Yes | | **Trains router** | Yes | No | No | No | Yes | | **Time (RX 6750 XT)** | 7.8h | Won't load | Won't load | Won't load | Won't load | | **Time (A100)** | ~20min | ~8min | ~15min | ~10min | ~40min | | **In-domain PPL** | 2.76 | ~2.80 | ~2.90 | ~2.78 | ~2.60 | LoRA and QLoRA train adapter matrices on frozen weights. USAF trains the actual expert weights and router — it just picks which ones matter. For MoE models, the gate determines model behavior more than any single expert weight. ### Why USAF Takes Longer on Big GPUs On an A100, USAF is slower per-step because it does more work: | Operation | USAF | LoRA | |---|---|---| | Forward pass | ~3ms/layer (same) | ~3ms/layer | | Backward | ~30ms/layer (26M params) | ~0.5ms/layer (100K params) | | RigL dense pass (every 50 steps) | ~60s each | N/A | | Optimizer | SparseAdam (26M) | AdamW (100K) | USAF computes gradients for 26M parameters per step vs ~100K for LoRA — **260× more gradient work**. That it's only 2-3× slower is the entire point of sparse training. On consumer hardware, the comparison is simpler: USAF runs. LoRA doesn't. ## Results 180 steps on Qwen3-30B-A3B, RX 6750 XT 12GB (AMD), DirectML. | Metric | Before | After | |---|---|---| | Loss | 1.43 | 1.00 (-30%) | | In-domain PPL | 2.83 | 2.76 | | Held-out PPL | 4.52 | 4.24 (-6%) | | Steps skipped (NaN) | — | 0 / 180 | Held-out repositories (Flecs, SFML, EnTT, Box2D) improved alongside training data — generalization, not memorization. ## Why Sparse Training Works for MoE **Not all weights matter.** MoE models route each token to a handful of experts. Most weights never activate for a given input. The importance phase finds the 0.5% with highest gradient magnitude. **The router is leverage.** Training the gating network (2M parameters) changes which experts fire. A single step drops loss by 0.65. Adapter methods can't touch the router. **Sparsity adapts.** RigL reselection replaces underperforming weights every 50 steps. The active set evolves — turnover starts at ~92% and drops as the model converges. **Resident caching kills the bottleneck.** 4-bit dequantization is slow on CPU (400ms per tensor). Trainable layers keep fp16 copies in RAM — dequant once, use forever. ## Quick Start ```bash pip install transformers safetensors psutil ``` ```bash # AMD GPU (DirectML) python train.py # NVIDIA GPU (CUDA) USE_CUDA=1 USE_AMP=1 python train.py # Multi-GPU USE_CUDA=1 USE_MULTI_GPU=1 MICROBATCH=4 python train.py ``` No config files. Everything via environment variables. ## Performance | Hardware | Backend | tok/s | 180 steps | |---|---|---|---| | RX 6750 XT 12GB | DirectML | 9 | 7.8h | | T4 16GB | CUDA | ~30 | ~2h | | 2× T4 16GB | CUDA | ~50 | ~1.2h | | RTX 4090 24GB | CUDA | ~80 | ~45min | *CUDA numbers are estimates pending real hardware benchmarks.* ## Supported Models Auto-detection works for any MoE model from HuggingFace — `config.json` is all it needs. Tested on Qwen3-30B-A3B. | Model Family | Tested | |---|---| | Qwen3-MoE | Yes (30B-A3B) | | Mixtral | No | | DeepSeek-MoE | No | | OLMoE | No | ## Models I Want to Test These are the models USAF was designed for. I just don't have the GPUs. | Model | Parameters | Active | Verified | Why | |---|---|---|---|---| | **DeepSeek-V4 Pro** | 1.6T | 49B | Yes | Latest DeepSeek, MIT license, Apr 2026 | | **Kimi K2.5** (Moonshot) | 1T | 32B | Yes | Native multimodal (vision+text), Feb 2026 | | **Mistral Large 3** | 675B | 41B | Yes | Apache 2.0, Dec 2025 | | **Qwen3-235B-A22B** | 235B | 22B | Yes | Same architecture as tested, 8× larger | | **Mixtral-8x22B** | 141B | 39B | Yes | Non-fused expert projections | Hardware needed: 4-8× A100 80GB or equivalent per model. If you have access and want to see USAF results on these, reach out via [GitHub Discussions](https://github.com/tsuyu122/usaf/discussions). I'll write the training code — you bring the GPUs. ## Universal CLI ```bash python -m usaf.train --model Qwen/Qwen3-30B-A3B --dataset data.jsonl --steps 180 python -m usaf.train --model mistralai/Mixtral-8x7B --dataset data.jsonl ``` ## Features | Feature | Status | |---|---| | Sparse training (0.5% active) | Production | | RigL dynamic reselection | Production | | Router co-training | Production | | 4-bit quantized weights | Production | | Resident expert caching | Production | | CUDA + AMP | Production | | Multi-GPU (DataParallel) | Production | | DirectML (AMD) | Production | | Vulkan acceleration | Experimental | | Held-out evaluation | Production | ## Hardware - GPU with 12GB+ VRAM or 32GB RAM (CPU-only) - AMD: DirectML (Windows, built-in) - NVIDIA: CUDA 11.8+ - Python 3.10+, PyTorch 2.0+ ## Using Your Own Model ### Step 1: Prepare the dataset Create a JSONL file with tokenized sequences. Each line must have `input_ids` and `labels`: ```json {"input_ids": [1, 2, 3, ..., 512], "labels": [1, 2, 3, ..., 512]} ``` To tokenize your own text with the model's tokenizer: ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-30B-A3B") text = "Your training text here..." tokens = tokenizer.encode(text) # Chunk into 512-token segments for i in range(0, len(tokens) - 512, 512): chunk = tokens[i:i+512] sample = {"input_ids": chunk, "labels": chunk[1:] + [tokenizer.eos_token_id]} # Write sample to JSONL ``` ### Step 2: Quantize the expert weights USAF needs the expert weights in 4-bit HQQ format. Currently supports Qwen3-MoE out of the box. For other models, you need to generate the `experts_q4.pt` file: ```python from usaf.quantization import quantize_4bit import torch # Load your model's expert tensors (gate_up_proj and down_proj for each layer) q_dict = {} for layer_idx in range(num_layers): for param_name in ["gate_up_proj", "down_proj"]: # Load the fused expert tensor [num_experts, intermediate, hidden] weights = load_expert_weights(model_path, layer_idx, param_name) q4_entry = quantize_4bit(weights, group_size=128) q_dict[f"model.layers.{layer_idx}.mlp.experts.{param_name}"] = q4_entry torch.save(q_dict, "my-model-q4/experts_q4.pt") ``` ### Step 3: Configure and run ```bash # Set these environment variables for your model QUANT_PATH="my-model-q4/experts_q4.pt" # Path to quantized weights TRAIN_FROM=36 # First trainable layer (keep top layers) STEPS=360 # 2 epochs for ~190K tokens FRAC=0.005 # 0.5% sparsity MICROBATCH=2 # Batch size (increase if VRAM allows) python train.py ``` ### Environment Variables Reference | Variable | Default | Description | |---|---|---| | `DATASET_PATH` | `data/train_dataset_12h.jsonl` | JSONL file with training samples | | `QUANT_PATH` | auto-detected | Path to `experts_q4.pt` | | `TRAIN_FROM` | 40 | First trainable layer (0-39 are frozen) | | `FRAC` | 0.005 | Fraction of weights to train (0.5%) | | `STEPS` | 180 | Training steps | | `MICROBATCH` | 2 | Sequences per micro-batch | | `LR_PEAK` | 2e-4 | Peak learning rate (cosine decay) | | `RESELECT_EVERY` | 50 | RigL reselection frequency | | `USE_CUDA` | 0 | Set to `1` for NVIDIA GPUs | | `USE_AMP` | 1 | Mixed precision (CUDA only) | | `USE_MULTI_GPU` | 1 | DataParallel (CUDA only) | | `FROZEN_CACHE_N` | 0 | Number of samples to cache (0=all) | ### Supported GPU Configurations | Setup | Command | |---|---| | AMD GPU (RX 6000/7000) | `python train.py` | | NVIDIA single GPU | `USE_CUDA=1 python train.py` | | NVIDIA dual GPU | `USE_CUDA=1 USE_MULTI_GPU=1 MICROBATCH=4 python train.py` | | CPU fallback | `python train.py` (automatic) | ### Troubleshooting **"CUDA out of memory"**: Reduce `MICROBATCH` to 1 or increase `TRAIN_FROM` to freeze more layers. **"No module named torch_directml"** on NVIDIA: Expected. The code auto-detects and uses CUDA. Set `USE_CUDA=1`. **Loss not decreasing**: Ensure `FRAC` is high enough (>0.001). Try 2-3 epochs with `EPOCHS=3`. Check dataset quality. **Frozen cache takes too long**: Set `FROZEN_CACHE_N=50` to only cache the first 50 samples. Or disable with `USE_FROZEN_CACHE=0`. ## Future Work - Benchmarks against LoRA/QLoRA/DoRA on A100-class hardware - Full Vulkan attention pipeline for cross-vendor acceleration - Distributed training (FSDP) - Tests on DeepSeek-V4 Pro, Kimi K2.5, Mistral Large 3 — need hardware ## License Apache 2.0. [LICENSE](LICENSE). Contributions: [CLA](CLA.md).