--- name: lora description: Parameter-efficient fine-tuning with Low-Rank Adaptation (LoRA). Use when fine-tuning large language models with limited GPU memory, creating task-specific adapters, or when you need to train multiple specialized models from a single base. --- # Using LoRA for Fine-tuning LoRA (Low-Rank Adaptation) enables efficient fine-tuning by freezing pretrained weights and injecting small trainable matrices into transformer layers. This reduces trainable parameters to ~0.1% of the original model while maintaining performance. ## Table of Contents - [Core Concepts](#core-concepts) - [Basic Setup](#basic-setup) - [Configuration Parameters](#configuration-parameters) - [QLoRA (Quantized LoRA)](#qlora-quantized-lora) - [Training Patterns](#training-patterns) - [Saving and Loading](#saving-and-loading) - [Merging Adapters](#merging-adapters) - [Best Practices](#best-practices) ## Core Concepts ### How LoRA Works Instead of updating all weights during fine-tuning, LoRA decomposes weight updates into low-rank matrices: ``` W' = W + BA ``` Where: - `W` is the frozen pretrained weight matrix (d × k) - `B` is a trainable matrix (d × r) - `A` is a trainable matrix (r × k) - `r` is the rank, much smaller than d and k The key insight: weight updates during fine-tuning have low intrinsic rank, so we can represent them efficiently with smaller matrices. ### Why Use LoRA | Aspect | Full Fine-tuning | LoRA | |--------|------------------|------| | Trainable params | 100% | ~0.1-1% | | Memory usage | High | Low | | Adapter size | Full model | ~3-100 MB | | Training speed | Slower | Faster | | Multiple tasks | Separate models | Swap adapters | ## Basic Setup ### Installation ```bash pip install peft transformers accelerate ``` ### Minimal Example ```python from transformers import AutoModelForCausalLM, AutoTokenizer from peft import LoraConfig, get_peft_model, TaskType import torch # Load base model model_name = "meta-llama/Llama-3.2-1B" model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.bfloat16, device_map="auto", ) tokenizer = AutoTokenizer.from_pretrained(model_name) tokenizer.pad_token = tokenizer.eos_token # Configure LoRA lora_config = LoraConfig( r=16, lora_alpha=32, target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], lora_dropout=0.05, bias="none", task_type=TaskType.CAUSAL_LM, ) # Apply LoRA model = get_peft_model(model, lora_config) model.print_trainable_parameters() # trainable params: 3,407,872 || all params: 1,238,300,672 || trainable%: 0.28% ``` ## Configuration Parameters ### LoraConfig Options ```python from peft import LoraConfig, TaskType config = LoraConfig( # Core parameters r=16, # Rank of update matrices lora_alpha=32, # Scaling factor (alpha/r applied to updates) target_modules=["q_proj", "v_proj"], # Layers to adapt # Regularization lora_dropout=0.05, # Dropout on LoRA layers bias="none", # "none", "all", or "lora_only" # Task configuration task_type=TaskType.CAUSAL_LM, # CAUSAL_LM, SEQ_CLS, SEQ_2_SEQ_LM, TOKEN_CLS # Advanced modules_to_save=None, # Additional modules to train (e.g., ["lm_head"]) layers_to_transform=None, # Specific layer indices to adapt use_rslora=False, # Rank-stabilized LoRA scaling use_dora=False, # Weight-Decomposed LoRA ) ``` ### Target Modules by Architecture ```python # Llama, Mistral, Qwen target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"] # GPT-2, GPT-J target_modules = ["c_attn", "c_proj", "c_fc"] # BERT, RoBERTa target_modules = ["query", "key", "value", "dense"] # Falcon target_modules = ["query_key_value", "dense", "dense_h_to_4h", "dense_4h_to_h"] # Phi target_modules = ["q_proj", "k_proj", "v_proj", "dense", "fc1", "fc2"] ``` ### Finding Target Modules ```python # Print all linear layer names from peft.utils import get_peft_model_state_dict def find_target_modules(model): linear_modules = set() for name, module in model.named_modules(): if isinstance(module, torch.nn.Linear): # Get the last part of the name (e.g., "q_proj" from "model.layers.0.self_attn.q_proj") layer_name = name.split(".")[-1] linear_modules.add(layer_name) return list(linear_modules) print(find_target_modules(model)) ``` ## QLoRA (Quantized LoRA) QLoRA combines 4-bit quantization with LoRA, enabling fine-tuning of large models on consumer GPUs. ### Setup ```python from transformers import AutoModelForCausalLM, BitsAndBytesConfig from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training import torch # 4-bit quantization config bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", # Normalized float 4-bit bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True, # Nested quantization ) # Load quantized model model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.2-3B", quantization_config=bnb_config, device_map="auto", ) # Prepare for k-bit training model = prepare_model_for_kbit_training(model) # Apply LoRA lora_config = LoraConfig( r=16, lora_alpha=32, target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], lora_dropout=0.05, bias="none", task_type=TaskType.CAUSAL_LM, ) model = get_peft_model(model, lora_config) ``` ### Memory Requirements | Model Size | Full FT (16-bit) | LoRA (16-bit) | QLoRA (4-bit) | |------------|------------------|---------------|---------------| | 7B | ~60 GB | ~16 GB | ~6 GB | | 13B | ~104 GB | ~28 GB | ~10 GB | | 70B | ~560 GB | ~160 GB | ~48 GB | ## Training Patterns ### With Hugging Face Trainer ```python from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling from datasets import load_dataset # Prepare dataset dataset = load_dataset("tatsu-lab/alpaca", split="train") def format_prompt(example): if example["input"]: text = f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['output']}" else: text = f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}" return {"text": text} dataset = dataset.map(format_prompt) def tokenize(examples): return tokenizer( examples["text"], truncation=True, max_length=512, padding=False, ) tokenized = dataset.map(tokenize, batched=True, remove_columns=dataset.column_names) # Training arguments (note higher learning rate) training_args = TrainingArguments( output_dir="./lora-output", per_device_train_batch_size=4, gradient_accumulation_steps=4, num_train_epochs=1, learning_rate=2e-4, # Higher than full fine-tuning bf16=True, logging_steps=10, save_steps=500, warmup_ratio=0.03, gradient_checkpointing=True, optim="adamw_torch_fused", ) trainer = Trainer( model=model, args=training_args, train_dataset=tokenized, data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False), ) trainer.train() ``` ### With SFTTrainer (TRL) ```python from trl import SFTTrainer, SFTConfig sft_config = SFTConfig( output_dir="./sft-lora", max_seq_length=1024, per_device_train_batch_size=4, gradient_accumulation_steps=4, num_train_epochs=1, learning_rate=2e-4, bf16=True, logging_steps=10, gradient_checkpointing=True, ) trainer = SFTTrainer( model=model, args=sft_config, train_dataset=dataset, tokenizer=tokenizer, peft_config=lora_config, # Pass config directly, SFTTrainer applies it dataset_text_field="text", ) trainer.train() ``` ### Classification Task ```python from transformers import AutoModelForSequenceClassification from peft import LoraConfig, get_peft_model, TaskType model = AutoModelForSequenceClassification.from_pretrained( "bert-base-uncased", num_labels=2, ) lora_config = LoraConfig( r=8, lora_alpha=16, target_modules=["query", "value"], lora_dropout=0.1, bias="none", task_type=TaskType.SEQ_CLS, modules_to_save=["classifier"], # Train classification head fully ) model = get_peft_model(model, lora_config) ``` ## Saving and Loading ### Save Adapter ```python # Save only LoRA weights (small file) model.save_pretrained("./my-lora-adapter") tokenizer.save_pretrained("./my-lora-adapter") # Push to Hub model.push_to_hub("username/my-lora-adapter") ``` ### Load Adapter ```python from peft import PeftModel from transformers import AutoModelForCausalLM # Load base model base_model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.2-1B", torch_dtype=torch.bfloat16, device_map="auto", ) # Load adapter model = PeftModel.from_pretrained(base_model, "./my-lora-adapter") # For inference model.eval() ``` ### Switch Between Adapters ```python # Load multiple adapters model.load_adapter("./adapter-1", adapter_name="task1") model.load_adapter("./adapter-2", adapter_name="task2") # Switch active adapter model.set_adapter("task1") output = model.generate(**inputs) model.set_adapter("task2") output = model.generate(**inputs) # Disable adapter (use base model) with model.disable_adapter(): output = model.generate(**inputs) ``` ## Merging Adapters Merge LoRA weights into the base model for deployment without adapter overhead. ```python from peft import PeftModel # Load base model base_model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.2-1B", torch_dtype=torch.bfloat16, device_map="cpu", # Merge on CPU to avoid memory issues ) # Load adapter model = PeftModel.from_pretrained(base_model, "./my-lora-adapter") # Merge and unload merged_model = model.merge_and_unload() # Save merged model merged_model.save_pretrained("./merged-model") tokenizer.save_pretrained("./merged-model") # Push merged model to Hub merged_model.push_to_hub("username/my-merged-model") ``` ## Best Practices 1. **Start with r=16**: Scale up to 32 or 64 if the model underfits, down to 8 if overfitting or memory-constrained 2. **Set lora_alpha = 2 × r**: This is a common heuristic; the effective scaling is `alpha/r` 3. **Target all attention and MLP layers**: For best results on LLMs, include gate/up/down projections: ```python target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"] ``` 4. **Use higher learning rate**: 2e-4 is typical for LoRA vs 2e-5 for full fine-tuning 5. **Enable gradient checkpointing**: Reduces memory at cost of ~20% slower training: ```python model.gradient_checkpointing_enable() ``` 6. **Use QLoRA for large models**: Essential for fine-tuning 7B+ models on consumer GPUs 7. **Keep dropout low**: 0.05 is usually sufficient; higher values may hurt performance 8. **Save checkpoints frequently**: LoRA adapters are small, so save often 9. **Evaluate on base model too**: Ensure adapter doesn't degrade base capabilities 10. **Consider modules_to_save for task heads**: For classification, train the classifier fully: ```python modules_to_save=["classifier", "score"] ``` ## References See `reference/` for detailed documentation: - `advanced-techniques.md` - DoRA, rsLoRA, adapter composition, and debugging