--- name: qlora description: Memory-efficient fine-tuning with 4-bit quantization and LoRA adapters. Use when fine-tuning large models (7B+) on consumer GPUs, when VRAM is limited, or when standard LoRA still exceeds memory. Builds on the lora skill. --- # QLoRA: Quantized Low-Rank Adaptation QLoRA enables fine-tuning of large language models on consumer GPUs by combining 4-bit quantization with LoRA adapters. A 65B model can be fine-tuned on a single 48GB GPU while matching 16-bit fine-tuning performance. > **Prerequisites**: This skill assumes familiarity with LoRA. See the `lora` skill for LoRA fundamentals (LoraConfig, target_modules, training patterns). ## Table of Contents - [Core Innovations](#core-innovations) - [BitsAndBytesConfig Deep Dive](#bitsandbytesconfig-deep-dive) - [Memory Requirements](#memory-requirements) - [Complete Training Example](#complete-training-example) - [Inference and Merging](#inference-and-merging) - [Troubleshooting](#troubleshooting) - [Best Practices](#best-practices) ## Core Innovations QLoRA introduces three techniques that reduce memory usage without sacrificing performance: ### 4-bit NormalFloat (NF4) NF4 is an information-theoretically optimal quantization data type for normally distributed weights. Neural network weights are typically normally distributed, making NF4 more efficient than standard 4-bit floats. ``` Storage: 4-bit NF4 (quantized weights) Compute: 16-bit BF16 (dequantized for forward/backward pass) ``` The key insight: weights are stored in 4-bit but dequantized to bf16 for computation. Only the frozen base model is quantized; LoRA adapters remain in full precision. **NF4 vs FP4:** | Quantization | Description | Use Case | |--------------|-------------|----------| | `nf4` | Normalized Float 4-bit, optimal for normal distributions | Default, recommended | | `fp4` | Standard 4-bit float | Legacy, rarely needed | ### Double Quantization Standard quantization requires storing scaling constants (typically fp32) for each quantization block. Double quantization quantizes these constants too: ``` First quantization: weights → 4-bit + fp32 scaling constants Double quantization: scaling constants → 8-bit + fp32 second-level constants ``` This saves approximately **0.37 bits per parameter**—significant for billion-parameter models: - 7B model: ~325 MB savings - 70B model: ~3.2 GB savings ### Paged Optimizers During training, gradient checkpointing can cause memory spikes when processing long sequences. Paged optimizers use NVIDIA unified memory to automatically transfer optimizer states between GPU and CPU: ``` Normal training: OOM on memory spike Paged optimizers: GPU ↔ CPU transfer handles spikes gracefully ``` This is handled automatically by bitsandbytes when using 4-bit training. ## BitsAndBytesConfig Deep Dive ### All Parameters Explained ```python from transformers import BitsAndBytesConfig import torch bnb_config = BitsAndBytesConfig( # Core 4-bit settings load_in_4bit=True, # Enable 4-bit quantization bnb_4bit_quant_type="nf4", # "nf4" (recommended) or "fp4" # Double quantization bnb_4bit_use_double_quant=True, # Quantize the quantization constants # Compute precision bnb_4bit_compute_dtype=torch.bfloat16, # Dequantize to this dtype for compute # Optional: specific storage type (usually auto-detected) bnb_4bit_quant_storage=torch.uint8, # Storage dtype for quantized weights ) ``` ### Compute Dtype Selection | Dtype | Hardware | Notes | |-------|----------|-------| | `torch.bfloat16` | Ampere+ (RTX 30xx, A100) | Recommended, faster | | `torch.float16` | Older GPUs (V100, RTX 20xx) | Use if bf16 not supported | | `torch.float32` | Any | Slower, only for debugging | Check bf16 support: ```python import torch print(torch.cuda.is_bf16_supported()) # True on Ampere+ ``` ### Comparison: Quantization Options ```python # Recommended: NF4 + double quant + bf16 optimal_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.bfloat16, ) # Maximum memory savings (slightly slower) max_savings_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.float16, # fp16 uses less memory than bf16 ) # 8-bit alternative (less compression, sometimes more stable) eight_bit_config = BitsAndBytesConfig( load_in_8bit=True, ) ``` ## Memory Requirements | Model Size | Full Fine-tuning | LoRA (16-bit) | QLoRA (4-bit) | |------------|------------------|---------------|---------------| | 7B | ~60 GB | ~16 GB | ~6 GB | | 13B | ~104 GB | ~28 GB | ~10 GB | | 34B | ~272 GB | ~75 GB | ~20 GB | | 70B | ~560 GB | ~160 GB | ~48 GB | **Notes:** - QLoRA memory includes model + optimizer states + activations - Actual usage varies with batch size, sequence length, and gradient checkpointing - Add ~20% buffer for safe operation ### GPU Recommendations | GPU VRAM | Max Model Size (QLoRA) | |----------|------------------------| | 8 GB | 7B (tight) | | 16 GB | 7-13B | | 24 GB | 13-34B | | 48 GB | 34-70B | | 80 GB | 70B+ comfortably | ## Complete Training Example ```python from transformers import ( AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments, ) from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training from trl import SFTTrainer, SFTConfig from datasets import load_dataset import torch # 1. Quantization config bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.bfloat16, ) # 2. Load quantized model model_name = "meta-llama/Llama-3.1-8B" model = AutoModelForCausalLM.from_pretrained( model_name, quantization_config=bnb_config, device_map="auto", attn_implementation="flash_attention_2", # Optional: faster attention ) tokenizer = AutoTokenizer.from_pretrained(model_name) tokenizer.pad_token = tokenizer.eos_token tokenizer.padding_side = "right" # 3. Prepare for k-bit training (critical step!) model = prepare_model_for_kbit_training(model) # 4. LoRA config (see lora skill for parameter details) lora_config = LoraConfig( r=16, lora_alpha=32, target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM", ) model = get_peft_model(model, lora_config) model.print_trainable_parameters() # 5. Dataset dataset = load_dataset("tatsu-lab/alpaca", split="train[:1000]") def format_example(example): if example["input"]: return {"text": f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['output']}"} return {"text": f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"} dataset = dataset.map(format_example) # 6. Training sft_config = SFTConfig( output_dir="./qlora-output", max_seq_length=512, per_device_train_batch_size=4, gradient_accumulation_steps=4, num_train_epochs=1, learning_rate=2e-4, bf16=True, logging_steps=10, save_steps=100, gradient_checkpointing=True, gradient_checkpointing_kwargs={"use_reentrant": False}, optim="paged_adamw_8bit", # Paged optimizer for memory efficiency ) trainer = SFTTrainer( model=model, args=sft_config, train_dataset=dataset, processing_class=tokenizer, dataset_text_field="text", ) trainer.train() # 7. Save adapter model.save_pretrained("./qlora-adapter") tokenizer.save_pretrained("./qlora-adapter") ``` ## Inference and Merging ### Inference with Quantized Model ```python from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig from peft import PeftModel import torch model_name = "meta-llama/Llama-3.1-8B" # Load quantized base model bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, ) base_model = AutoModelForCausalLM.from_pretrained( model_name, quantization_config=bnb_config, device_map="auto", ) tokenizer = AutoTokenizer.from_pretrained(model_name) # Load adapter model = PeftModel.from_pretrained(base_model, "./qlora-adapter") model.eval() # Generate inputs = tokenizer("### Instruction:\nExplain quantum computing.\n\n### Response:\n", return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=256) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ### Merging to Full Precision To merge QLoRA adapters into a full-precision model (for deployment without bitsandbytes): ```python from transformers import AutoModelForCausalLM from peft import PeftModel import torch # Load base model in full precision (on CPU to avoid OOM) base_model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B", torch_dtype=torch.bfloat16, device_map="cpu", ) # Load adapter model = PeftModel.from_pretrained(base_model, "./qlora-adapter") # Merge and unload merged_model = model.merge_and_unload() # Save merged model merged_model.save_pretrained("./merged-model") ``` **Note**: Merging requires enough RAM to hold the full-precision model. For 70B models, this means ~140GB RAM. ## Troubleshooting ### CUDA Version Issues ```bash # Check CUDA version nvcc --version python -c "import torch; print(torch.version.cuda)" # bitsandbytes requires CUDA 11.7+ # If version mismatch, reinstall: pip uninstall bitsandbytes pip install bitsandbytes --upgrade ``` ### "cannot find libcudart" or Missing Library Errors ```bash # Find CUDA installation find /usr -name "libcudart*" 2>/dev/null # Set environment variable export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH # Or for conda: export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH ``` ### Slow Training Common cause: compute dtype mismatch ```python # Check if model is using expected dtype for name, param in model.named_parameters(): if param.requires_grad: print(f"{name}: {param.dtype}") break # All LoRA params should match # Ensure bf16 is used in training args if BitsAndBytesConfig uses bf16 # Mismatch causes constant dtype conversions ``` ### Out of Memory ```python # 1. Enable gradient checkpointing model.gradient_checkpointing_enable() # 2. Reduce batch size, increase accumulation per_device_train_batch_size = 1 gradient_accumulation_steps = 16 # 3. Use paged optimizer optim = "paged_adamw_8bit" # 4. Reduce sequence length max_seq_length = 256 # 5. Target fewer modules target_modules = ["q_proj", "v_proj"] # Minimal set ``` ### Model Loads But Training Fails ```python # Ensure prepare_model_for_kbit_training is called from peft import prepare_model_for_kbit_training model = prepare_model_for_kbit_training(model) # Don't skip this! # Enable input gradients if needed model.enable_input_require_grads() ``` ## Best Practices 1. **Always use `prepare_model_for_kbit_training`**: This enables gradient computation through the frozen quantized layers 2. **Match compute dtype with training precision**: If `bnb_4bit_compute_dtype=torch.bfloat16`, use `bf16=True` in training args 3. **Use paged optimizers for large models**: `optim="paged_adamw_8bit"` or `"paged_adamw_32bit"` handles memory spikes 4. **Start with NF4 + double quantization**: This is the recommended default; only change if debugging 5. **Gradient checkpointing is essential**: Always enable for QLoRA training to fit larger batch sizes 6. **Test inference before long training runs**: Load the model and generate a few tokens to catch configuration issues early 7. **Monitor GPU memory**: Use `nvidia-smi` or `torch.cuda.memory_summary()` to track actual usage 8. **Consider 8-bit for unstable training**: If 4-bit training shows instability, try `load_in_8bit=True` as a middle ground