--- name: signround-v2-low-bit-quant title: "SignRoundV2: Extremely Low-Bit Post-Training Quantization via DeltaLoss Sensitivity" version: 0.0.2 engine: skillxiv-v0.0.2-claude-opus-4.6 license: MIT url: https://arxiv.org/abs/2512.04746 keywords: [quantization, model-compression, low-bit, post-training, layer-wise-bit-allocation] description: "DeltaLoss sensitivity metric combining gradient and quantization-induced parameter deviation for adaptive bit-width allocation, with lightweight pre-tuning search for scale initialization, enabling competitive accuracy at 4-5 bits in 2.5-6 hours." --- ## Summary SignRoundV2 presents two main technical innovations for extremely low-bit post-training quantization. The primary contribution is DeltaLoss, a sensitivity metric capturing both local parameter distortions and global impact on task loss for reliable layer-wise bit allocation. The secondary contribution is lightweight pre-tuning search for initialization, enabling stable and accurate quantization at extremely low bit-widths (2-5 bits) in minimal time. ## Core Technique **DeltaLoss Sensitivity Metric:** Measures importance of each layer by combining: 1. Gradient information (how sensitive loss is to weight changes) 2. Quantization-induced deviation (how much quantization distorts parameters) ``` DeltaLoss_layer = ||∇_W L|| · ||W - Q(W)|| ``` This captures both local and global impact—sensitive layers get more bits. **Pre-Tuning Scale Search:** Before main quantization, perform lightweight search for optimal quantization scale (step size) per layer: ``` scale_opt = argmin_scale MSE(Q_scale(W), W) ``` **Adaptive Bit-Width Allocation:** Use DeltaLoss to allocate bits from a budget: ``` bits_layer = base_bits + extra_bits[DeltaLoss_rank(layer)] ``` Critical layers get extra bits; less important layers get fewer. ## Implementation **DeltaLoss computation:** ```python def compute_deltaloss_sensitivity(weight, gradient, quant_weight): # Gradient magnitude (local sensitivity) gradient_norm = torch.norm(gradient.flatten()) # Quantization deviation (distortion magnitude) deviation = torch.norm(weight - quant_weight) # Combined sensitivity deltaloss = gradient_norm * deviation return deltaloss ``` **Pre-tuning scale search:** ```python def search_optimal_scale(weight, bits=4): scales = torch.linspace(0.001, 1.0, 100) best_scale = None best_mse = float('inf') for scale in scales: # Quantize with this scale quant_weight = quantize_symmetric(weight / scale, bits) * scale mse = torch.mean((weight - quant_weight) ** 2) if mse < best_mse: best_mse = mse best_scale = scale return best_scale ``` **Adaptive bit allocation:** ```python def allocate_bits_adaptive(layers, total_bits_budget, deltaloss_scores): num_layers = len(layers) base_bits = total_bits_budget // num_layers # Normalize DeltaLoss scores dl_normalized = deltaloss_scores / deltaloss_scores.sum() bit_allocation = {} for layer_idx, layer in enumerate(layers): # Allocate bits proportional to importance extra_bits = int(total_bits_budget * dl_normalized[layer_idx]) bit_allocation[layer] = base_bits + extra_bits return bit_allocation ``` **Quantization with allocated bits:** ```python def quantize_with_allocation(model, bit_allocation): quantized_model = {} for layer, bits in bit_allocation.items(): weight = model[layer] # Search optimal scale for this bit-width optimal_scale = search_optimal_scale(weight, bits) # Quantize normalized_weight = weight / optimal_scale quant_weight = quantize_symmetric(normalized_weight, bits) * optimal_scale quantized_model[layer] = quant_weight return quantized_model ``` ## When to Use - Extreme model compression for mobile/edge deployment (2-5 bits) - Scenarios where quantization must complete in hours not days - Applications requiring reliable bit-width allocation across layers - Tasks where post-training quantization is preferred over QAT ## When NOT to Use - Scenarios with abundant GPU for QAT training - Applications requiring per-model custom tuning - Tasks where 8-bit quantization is sufficient - Real-time quantization where search overhead matters ## Key References - Post-training quantization and sensitivity analysis - Layer-wise bit allocation and adaptive compression - Low-bit quantization and extreme compression - Quantization-aware scale selection