--- name: peft description: | Parameter-efficient fine-tuning with LoRA and Unsloth. Covers LoraConfig, target module selection, QLoRA for 4-bit training, adapter merging, and Unsloth optimizations for 2x faster training. --- # Parameter-Efficient Fine-Tuning (PEFT) ## Overview PEFT methods like LoRA train only a small number of adapter parameters instead of the full model, reducing memory by 10-100x while maintaining quality. ## Quick Reference | Method | Memory | Speed | Quality | |--------|--------|-------|---------| | Full Fine-tune | High | Slow | Best | | LoRA | Low | Fast | Very Good | | QLoRA | Very Low | Fast | Good | | Unsloth | Very Low | 2x Faster | Good | ## LoRA Concepts ### How LoRA Works ``` Original weight matrix W (frozen): d x k LoRA adapters A and B: d x r, r x k (where r << min(d,k)) Forward pass: output = x @ W + x @ A @ B * (alpha / r) Trainable params: 2 * r * d (instead of d * k) ``` ### Memory Savings ```python def lora_savings(d, k, r): original = d * k lora = 2 * r * max(d, k) reduction = (1 - lora / original) * 100 return reduction # Example: 4096 x 4096 matrix with rank 8 print(f"Memory reduction: {lora_savings(4096, 4096, 8):.1f}%") # Output: ~99.6% reduction ``` ## Basic LoRA Setup ### Configure LoRA ```python from peft import LoraConfig, get_peft_model, TaskType lora_config = LoraConfig( r=8, # Rank (capacity) lora_alpha=16, # Scaling factor target_modules=["q_proj", "v_proj"], # Which layers lora_dropout=0.05, # Regularization bias="none", # Don't train biases task_type=TaskType.CAUSAL_LM # Task type ) ``` ### Apply to Model ```python from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained( "TinyLlama/TinyLlama-1.1B-Chat-v1.0", device_map="auto" ) model = get_peft_model(model, lora_config) # Check trainable parameters model.print_trainable_parameters() # Output: trainable params: 4,194,304 || all params: 1,100,048,384 || trainable%: 0.38% ``` ## LoRA Parameters ### Key Parameters | Parameter | Values | Effect | |-----------|--------|--------| | `r` | 4, 8, 16, 32 | Adapter capacity | | `lora_alpha` | r to 2*r | Scaling (higher = stronger) | | `target_modules` | List | Which layers to adapt | | `lora_dropout` | 0.0-0.1 | Regularization | ### Target Modules ```python # Common target modules for different models # LLaMA / Mistral / TinyLlama target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"] # GPT-2 target_modules = ["c_attn", "c_proj"] # BLOOM target_modules = ["query_key_value", "dense"] # All linear layers (most aggressive) target_modules = "all-linear" ``` ### Rank Selection Guide | Rank (r) | Use Case | |----------|----------| | 4 | Simple tasks, small datasets | | 8 | General purpose (recommended) | | 16 | Complex tasks, more capacity | | 32+ | Near full fine-tune quality | ## QLoRA (Quantized LoRA) ### Setup ```python from transformers import AutoModelForCausalLM, BitsAndBytesConfig from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training import torch # 4-bit quantization config quantization_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True ) # Load quantized model model = AutoModelForCausalLM.from_pretrained( "TinyLlama/TinyLlama-1.1B-Chat-v1.0", quantization_config=quantization_config, device_map="auto" ) # Prepare for k-bit training (important!) model = prepare_model_for_kbit_training(model) # Add LoRA adapters lora_config = LoraConfig( r=8, lora_alpha=16, target_modules=["q_proj", "v_proj", "k_proj", "o_proj"], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM" ) model = get_peft_model(model, lora_config) model.print_trainable_parameters() ``` ## Training with PEFT ### Using SFTTrainer ```python from trl import SFTTrainer, SFTConfig from datasets import load_dataset dataset = load_dataset("timdettmers/openassistant-guanaco") sft_config = SFTConfig( output_dir="./lora_checkpoints", num_train_epochs=3, per_device_train_batch_size=4, learning_rate=2e-4, # Higher LR for LoRA logging_steps=10, save_steps=500, max_seq_length=512, gradient_accumulation_steps=4, ) trainer = SFTTrainer( model=model, args=sft_config, train_dataset=dataset["train"], tokenizer=tokenizer, dataset_text_field="text", peft_config=lora_config, # Pass LoRA config ) trainer.train() ``` ## Unsloth (2x Faster Training) ### Setup ```python from unsloth import FastLanguageModel # Load model with Unsloth optimizations model, tokenizer = FastLanguageModel.from_pretrained( "unsloth/tinyllama-chat-bnb-4bit", # Pre-quantized max_seq_length=2048, dtype=None, # Auto-detect load_in_4bit=True, ) # Add LoRA with Unsloth model = FastLanguageModel.get_peft_model( model, r=16, target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], lora_alpha=16, lora_dropout=0, bias="none", use_gradient_checkpointing=True, random_state=42, ) ``` ### Train with Unsloth ```python from trl import SFTTrainer, SFTConfig sft_config = SFTConfig( output_dir="./unsloth_output", per_device_train_batch_size=2, gradient_accumulation_steps=4, warmup_steps=5, max_steps=100, learning_rate=2e-4, fp16=not torch.cuda.is_bf16_supported(), bf16=torch.cuda.is_bf16_supported(), logging_steps=1, optim="adamw_8bit", # Memory-efficient optimizer weight_decay=0.01, lr_scheduler_type="linear", seed=42, ) trainer = SFTTrainer( model=model, tokenizer=tokenizer, train_dataset=dataset, dataset_text_field="text", max_seq_length=2048, args=sft_config, ) trainer.train() ``` ## Save and Load Adapters ### Save Adapters Only ```python # Save just the LoRA weights (small!) model.save_pretrained("./lora_adapters") ``` ### Load Adapters ```python from peft import PeftModel base_model = AutoModelForCausalLM.from_pretrained( "TinyLlama/TinyLlama-1.1B-Chat-v1.0", device_map="auto" ) model = PeftModel.from_pretrained(base_model, "./lora_adapters") ``` ### Merge Adapters into Base Model ```python # Merge LoRA weights into base model (for deployment) merged_model = model.merge_and_unload() # Save merged model merged_model.save_pretrained("./merged_model") ``` ## Inference with Adapters ```python from peft import PeftModel # Load base + adapters base_model = AutoModelForCausalLM.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0") model = PeftModel.from_pretrained(base_model, "./lora_adapters") # Generate model.eval() inputs = tokenizer("What is Python?", return_tensors="pt") with torch.no_grad(): outputs = model.generate(**inputs, max_new_tokens=100) print(tokenizer.decode(outputs[0])) ``` ## Multi-Adapter Hot-Swapping Train task-specific adapters and swap them at inference time without reloading the base model. ### Train Multiple Adapters ```python from unsloth import FastLanguageModel from trl import SFTTrainer, SFTConfig TASK_DATASETS = { "technical": technical_data, # Precise, factual responses "creative": creative_data, # Imaginative, expressive responses "code": code_data, # Code-focused analysis } for task_name, task_data in TASK_DATASETS.items(): # Load fresh model model, tokenizer = FastLanguageModel.from_pretrained( "unsloth/Qwen3-4B-Thinking-2507-unsloth-bnb-4bit", max_seq_length=512, load_in_4bit=True, ) # Apply LoRA model = FastLanguageModel.get_peft_model( model, r=16, lora_alpha=16, target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], ) # Train on task-specific data trainer = SFTTrainer(model=model, train_dataset=task_data, ...) trainer.train() # Save lightweight adapter (~130MB each) model.save_pretrained(f"./adapters/{task_name}") ``` ### Hot-Swap at Inference ```python from peft import PeftModel from unsloth import FastLanguageModel # Load base model ONCE base_model, tokenizer = FastLanguageModel.from_pretrained( "unsloth/Qwen3-4B-Thinking-2507-unsloth-bnb-4bit", max_seq_length=512, load_in_4bit=True, ) def load_and_generate(adapter_path, prompt): """Load adapter and generate response.""" # Hot-swap adapter onto base model adapted_model = PeftModel.from_pretrained(base_model, adapter_path) FastLanguageModel.for_inference(adapted_model) messages = [{"role": "user", "content": prompt}] inputs = tokenizer.apply_chat_template( messages, tokenize=True, add_generation_prompt=True, return_tensors="pt" ).to(adapted_model.device) outputs = adapted_model.generate(input_ids=inputs, max_new_tokens=128) return tokenizer.decode(outputs[0], skip_special_tokens=True) # Use different adapters for different tasks technical_response = load_and_generate("./adapters/technical", "Explain TCP vs UDP") creative_response = load_and_generate("./adapters/creative", "Write a haiku about coding") code_response = load_and_generate("./adapters/code", "Explain Python decorators") ``` ### Adapter Storage Efficiency | Component | Size | |-----------|------| | Base model (4-bit) | ~8GB | | Each adapter | ~130MB | | 10 adapters total | ~1.3GB | **Multi-adapter approach**: 8GB + 1.3GB = 9.3GB total **vs 10 full models**: 80GB total ## Comparison: Full vs LoRA vs QLoRA | Aspect | Full Fine-tune | LoRA | QLoRA | |--------|----------------|------|-------| | Trainable % | 100% | ~0.1-1% | ~0.1-1% | | Memory | 4x model | ~1.2x model | ~0.5x model | | Training speed | Slow | Fast | Fast | | Quality | Best | Very Good | Good | | 7B model | 28GB+ | ~16GB | ~6GB | ## Troubleshooting ### Out of Memory **Fix:** ```python # Use gradient checkpointing model.gradient_checkpointing_enable() # Use smaller batch with accumulation per_device_train_batch_size=1 gradient_accumulation_steps=8 ``` ### Poor Quality **Fix:** - Increase `r` (rank) - Add more target modules - Train longer - Check data quality ### NaN Loss **Fix:** - Lower learning rate - Use gradient clipping - Check for data issues ## When to Use This Skill Use when: - GPU memory is limited - Fine-tuning large models (7B+) - Need fast training iterations - Want to swap adapters for different tasks ## Cross-References - `bazzite-ai-jupyter:qlora` - Advanced QLoRA experiments (alpha, rank, modules) - `bazzite-ai-jupyter:finetuning` - Full fine-tuning basics - `bazzite-ai-jupyter:quantization` - Quantization for QLoRA - `bazzite-ai-jupyter:sft` - SFT training with LoRA - `bazzite-ai-jupyter:inference` - Fast inference with adapters - `bazzite-ai-jupyter:transformers` - Target module selection