--- name: llm-fine-tuning-guide description: Master fine-tuning of large language models for specific domains and tasks. Covers data preparation, training techniques, optimization strategies, and evaluation methods. Use when adapting models for specialized applications, reducing inference costs, or improving domain-specific performance. --- # LLM Fine-Tuning Guide Master the art of fine-tuning large language models to create specialized models optimized for your specific use cases, domains, and performance requirements. ## Overview Fine-tuning adapts pre-trained LLMs to specific tasks, domains, or styles by training them on curated datasets. This improves accuracy, reduces hallucinations, and optimizes costs. ### When to Fine-Tune - **Domain Specialization**: Legal documents, medical records, financial reports - **Task-Specific Performance**: Better results on specific tasks than base model - **Cost Optimization**: Smaller fine-tuned model replaces expensive large model - **Style Adaptation**: Match specific writing styles or tones - **Compliance Requirements**: Keep sensitive data within your infrastructure - **Latency Requirements**: Smaller models deploy faster ### When NOT to Fine-Tune - One-off queries (use prompting instead) - Rapidly changing information (use RAG instead) - Limited training data (< 100 examples typically insufficient) - General knowledge questions (base model sufficient) ## Quick Start **Full Fine-Tuning**: ```bash python examples/full_fine_tuning.py ``` **LoRA (Recommended for most cases)**: ```bash python examples/lora_fine_tuning.py ``` **QLoRA (Single GPU)**: ```bash python examples/qlora_fine_tuning.py ``` **Data Preparation**: ```bash python scripts/data_preparation.py ``` ## Fine-Tuning Approaches ### 1. Full Fine-Tuning Update all model parameters during training. **Pros**: - Maximum performance improvement - Can completely rewrite model behavior - Best for significant domain shifts **Cons**: - High computational cost - Requires large dataset (1000+ examples) - Risk of catastrophic forgetting - Long training time ```python from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer model_id = "meta-llama/Llama-2-7b" model = AutoModelForCausalLM.from_pretrained(model_id) tokenizer = AutoTokenizer.from_pretrained(model_id) training_args = TrainingArguments( output_dir="./fine-tuned-llama", num_train_epochs=3, per_device_train_batch_size=4, gradient_accumulation_steps=4, learning_rate=2e-5, weight_decay=0.01, logging_steps=10, save_steps=100, eval_strategy="steps", eval_steps=50, load_best_model_at_end=True, ) trainer = Trainer( model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset, ) trainer.train() ``` ### 2. Parameter-Efficient Fine-Tuning (PEFT) Train only a small fraction of parameters. #### LoRA (Low-Rank Adaptation) Adds trainable low-rank matrices to existing weights. **Pros**: - 99% fewer trainable parameters - Maintains base model knowledge - Fast training (10-20x faster) - Easy to switch between adapters **Cons**: - Slightly lower performance than full fine-tuning - Requires base model at inference ```python from peft import get_peft_model, LoraConfig, TaskType from transformers import AutoModelForCausalLM, AutoTokenizer base_model_id = "meta-llama/Llama-2-7b" model = AutoModelForCausalLM.from_pretrained(base_model_id) tokenizer = AutoTokenizer.from_pretrained(base_model_id) # Configure LoRA lora_config = LoraConfig( r=8, # Rank of low-rank matrices lora_alpha=16, # Scaling factor target_modules=["q_proj", "v_proj"], # Which layers to adapt lora_dropout=0.05, bias="none", task_type=TaskType.CAUSAL_LM ) # Wrap model with LoRA model = get_peft_model(model, lora_config) model.print_trainable_parameters() # Output: trainable params: 4,194,304 || all params: 6,738,415,616 || trainable%: 0.06 # Train as normal trainer = Trainer( model=model, args=training_args, train_dataset=train_dataset, ) trainer.train() # Save only LoRA weights model.save_pretrained("./llama-lora-adapter") ``` #### QLoRA (Quantized LoRA) Combines LoRA with quantization for extreme efficiency. ```python from peft import prepare_model_for_kbit_training, get_peft_model, LoraConfig from transformers import AutoModelForCausalLM, BitsAndBytesConfig # Quantization config bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype="float16", bnb_4bit_use_double_quant=True ) # Load quantized model model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-2-7b", quantization_config=bnb_config, device_map="auto" ) # Prepare for training model = prepare_model_for_kbit_training(model) # Apply LoRA lora_config = LoraConfig( r=8, lora_alpha=32, target_modules=["q_proj", "v_proj", "k_proj", "o_proj"], lora_dropout=0.05, bias="none", task_type=TaskType.CAUSAL_LM ) model = get_peft_model(model, lora_config) # Train on single GPU trainer = Trainer( model=model, args=TrainingArguments( output_dir="./qlora-output", per_device_train_batch_size=1, gradient_accumulation_steps=4, learning_rate=5e-4, num_train_epochs=3, ), train_dataset=train_dataset, ) trainer.train() ``` #### Prefix Tuning Prepends trainable tokens to input. ```python from peft import get_peft_model, PrefixTuningConfig config = PrefixTuningConfig( num_virtual_tokens=20, task_type=TaskType.CAUSAL_LM, ) model = get_peft_model(model, config) # Only 20 * embedding_dim parameters trained ``` ### 3. Instruction Fine-Tuning Train model to follow instructions with examples. ```python # Training data format training_data = [ { "instruction": "Translate to French", "input": "Hello, how are you?", "output": "Bonjour, comment allez-vous?" }, { "instruction": "Summarize this text", "input": "Long document...", "output": "Summary..." } ] # Template for training template = """Below is an instruction that describes a task, paired with an input that provides further context. ### Instruction: {instruction} ### Input: {input} ### Response: {output}""" # Create formatted dataset formatted_data = [ template.format(**example) for example in training_data ] ``` ### 4. Domain-Specific Fine-Tuning Tailor models for specific industries or fields. #### Legal Domain Example ```python legal_training_data = [ { "prompt": "What are the key clauses in an NDA?", "completion": """Key clauses typically include: 1. Definition of Confidential Information 2. Non-Disclosure Obligations 3. Permitted Disclosures 4. Term and Termination 5. Return of Information 6. Remedies""" }, # ... more legal examples ] # Train on legal domain model = fine_tune_on_domain( base_model="gpt-3.5-turbo", training_data=legal_training_data, epochs=3, learning_rate=0.0002, ) ``` ## Data Preparation ### 1. Dataset Quality ```python class DatasetValidator: def validate_dataset(self, data): issues = { "empty_samples": 0, "duplicates": 0, "outliers": 0, "imbalance": {} } # Check for empty samples for sample in data: if not sample.get("text"): issues["empty_samples"] += 1 # Check for duplicates texts = [s.get("text") for s in data] issues["duplicates"] = len(texts) - len(set(texts)) # Check for length outliers lengths = [len(t.split()) for t in texts] mean_length = sum(lengths) / len(lengths) issues["outliers"] = sum(1 for l in lengths if l > mean_length * 3) return issues # Validate before training validator = DatasetValidator() issues = validator.validate_dataset(training_data) print(f"Dataset Issues: {issues}") ``` ### 2. Data Augmentation ```python from nlpaug.augmenter.word import SynonymAug, RandomWordAug import nlpaug.flow as naf # Create augmentation pipeline text = "The quick brown fox jumps over the lazy dog" # Synonym replacement aug_syn = SynonymAug(aug_p=0.3) augmented_syn = aug_syn.augment(text) # Random word insertion aug_insert = RandomWordAug(action="insert", aug_p=0.3) augmented_insert = aug_insert.augment(text) # Combine augmentations flow = naf.Sequential([ SynonymAug(aug_p=0.2), RandomWordAug(action="swap", aug_p=0.2) ]) augmented = flow.augment(text) ``` ### 3. Train/Validation Split ```python from sklearn.model_selection import train_test_split # Create splits train_data, eval_data = train_test_split( data, test_size=0.2, random_state=42 ) eval_data, test_data = train_test_split( eval_data, test_size=0.5, random_state=42 ) print(f"Train: {len(train_data)}, Eval: {len(eval_data)}, Test: {len(test_data)}") ``` ## Training Techniques ### 1. Learning Rate Scheduling ```python from torch.optim.lr_scheduler import CosineAnnealingLR, LinearLR # Linear warmup + cosine annealing def get_scheduler(optimizer, num_steps): lr_scheduler = get_linear_schedule_with_warmup( optimizer, num_warmup_steps=500, num_training_steps=num_steps ) return lr_scheduler training_args = TrainingArguments( learning_rate=1e-4, lr_scheduler_type="cosine", warmup_steps=500, warmup_ratio=0.1, ) ``` ### 2. Gradient Accumulation ```python training_args = TrainingArguments( gradient_accumulation_steps=4, # Accumulate gradients over 4 steps per_device_train_batch_size=1, # Effective batch size: 1 * 4 = 4 ) # Simulates larger batch on limited GPU memory ``` ### 3. Mixed Precision Training ```python training_args = TrainingArguments( fp16=True, # Use 16-bit floats bf16=False, ) # Reduces memory usage by 50%, speeds up training ``` ### 4. Multi-GPU Training ```python training_args = TrainingArguments( output_dir="./results", num_train_epochs=3, per_device_train_batch_size=8, per_device_eval_batch_size=8, gradient_accumulation_steps=4, dataloader_pin_memory=True, dataloader_num_workers=4, ) # Automatically uses all available GPUs trainer = Trainer( model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset, ) ``` ## Popular Models for Fine-Tuning ### Open Source Models #### Llama 3.2 (Meta) ```python from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-7b") tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-7b") # Fine-tune on custom data # ... training code ``` **Characteristics**: - 7B, 70B parameter versions - Strong instruction-following - Excellent for domain adaptation - Apache 2.0 license #### Gemma 3 (Google) ```python model = AutoModelForCausalLM.from_pretrained("google/gemma-3-2b") tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-2b") # Gemma 3 sizes: 2B, 7B, 27B # Very efficient, great for fine-tuning ``` **Characteristics**: - Small, medium, large sizes - Efficient architecture - Good for edge deployment - Built on cutting-edge research #### Mistral 7B ```python model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1") tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1") # Strong performance, efficient architecture ``` **Characteristics**: - Sliding window attention - Efficient inference - Strong performance-to-size ratio ### Commercial Models #### OpenAI Fine-Tuning API ```python import openai # Prepare training data training_file = openai.File.create( file=open("training_data.jsonl", "rb"), purpose="fine-tune" ) # Create fine-tuning job fine_tune_job = openai.FineTuningJob.create( training_file=training_file.id, model="gpt-3.5-turbo", hyperparameters={ "n_epochs": 3, "learning_rate_multiplier": 0.1, } ) # Wait for completion fine_tuned_model = openai.FineTuningJob.retrieve(fine_tune_job.id) print(f"Status: {fine_tuned_model.status}") # Use fine-tuned model response = openai.ChatCompletion.create( model=fine_tuned_model.fine_tuned_model, messages=[{"role": "user", "content": "Hello"}] ) ``` ## Evaluation and Metrics ### 1. Perplexity ```python import torch from math import exp def calculate_perplexity(model, eval_dataset): model.eval() total_loss = 0 total_tokens = 0 with torch.no_grad(): for batch in eval_dataset: outputs = model(**batch) loss = outputs.loss total_loss += loss.item() * batch["input_ids"].shape[0] total_tokens += batch["input_ids"].shape[0] perplexity = exp(total_loss / total_tokens) return perplexity perplexity = calculate_perplexity(model, eval_dataset) print(f"Perplexity: {perplexity:.2f}") ``` ### 2. Task-Specific Metrics ```python from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score def evaluate_task(predictions, ground_truth): return { "accuracy": accuracy_score(ground_truth, predictions), "precision": precision_score(ground_truth, predictions, average='weighted'), "recall": recall_score(ground_truth, predictions, average='weighted'), "f1": f1_score(ground_truth, predictions, average='weighted'), } # Evaluate on task predictions = [model.predict(x) for x in test_data] metrics = evaluate_task(predictions, test_labels) print(f"Metrics: {metrics}") ``` ### 3. Human Evaluation ```python class HumanEvaluator: def evaluate_response(self, prompt, response): criteria = { "relevance": self._score_relevance(prompt, response), "coherence": self._score_coherence(response), "factuality": self._score_factuality(response), "helpfulness": self._score_helpfulness(response), } return sum(criteria.values()) / len(criteria) def _score_relevance(self, prompt, response): # Score 1-5 pass def _score_coherence(self, response): # Score 1-5 pass ``` ## Common Challenges & Solutions ### Challenge: Catastrophic Forgetting Model forgets pre-trained knowledge while adapting to new domain. **Solutions**: - Use lower learning rates (2e-5 to 5e-5) - Smaller training epochs (1-3) - Regularization techniques - Continual learning approaches ```python # Conservative training settings training_args = TrainingArguments( learning_rate=2e-5, # Lower learning rate num_train_epochs=2, # Few epochs weight_decay=0.01, # L2 regularization warmup_steps=500, save_total_limit=3, load_best_model_at_end=True, ) ``` ### Challenge: Overfitting Model performs well on training data but poorly on new data. **Solutions**: - Use more training data - Implement dropout - Early stopping - Validation monitoring ```python training_args = TrainingArguments( eval_strategy="steps", eval_steps=50, load_best_model_at_end=True, early_stopping_patience=3, metric_for_best_model="eval_loss", ) ``` ### Challenge: Insufficient Training Data Few examples for fine-tuning. **Solutions**: - Data augmentation - Use PEFT (LoRA) instead of full fine-tuning - Few-shot learning with prompting - Transfer learning ```python # Use LoRA when data is limited lora_config = LoraConfig( r=8, lora_alpha=16, target_modules=["q_proj", "v_proj"], lora_dropout=0.05, ) ``` ## Best Practices ### Before Fine-Tuning - ✓ Start with a strong base model - ✓ Prepare high-quality training data (100+ examples recommended) - ✓ Define clear evaluation metrics - ✓ Set up proper train/validation splits - ✓ Document your objectives ### During Fine-Tuning - ✓ Monitor training/validation loss - ✓ Use appropriate learning rates - ✓ Save checkpoints regularly - ✓ Validate on held-out data - ✓ Watch for overfitting/underfitting ### After Fine-Tuning - ✓ Evaluate on test set - ✓ Compare against baseline - ✓ Perform qualitative analysis - ✓ Document configuration and results - ✓ Version your fine-tuned models ## Implementation Checklist - [ ] Determine fine-tuning approach (full, LoRA, QLoRA, instruction) - [ ] Prepare and validate training dataset (100+ examples) - [ ] Choose base model (Llama 3.2, Gemma 3, Mistral, etc.) - [ ] Set up PEFT if using parameter-efficient methods - [ ] Configure training arguments and hyperparameters - [ ] Implement data loading and preprocessing - [ ] Set up evaluation metrics - [ ] Train model with monitoring - [ ] Evaluate on test set - [ ] Save and version fine-tuned model - [ ] Test in production environment - [ ] Document process and results ## Resources ### Frameworks - **Hugging Face Transformers**: https://huggingface.co/transformers/ - **PEFT (Parameter-Efficient Fine-Tuning)**: https://github.com/huggingface/peft - **Hugging Face Datasets**: https://huggingface.co/datasets ### Models - **Llama 3.2**: https://www.meta.com/llama/ - **Gemma 3**: https://deepmind.google/technologies/gemma/ - **Mistral**: https://mistral.ai/ ### Papers - "LoRA: Low-Rank Adaptation of Large Language Models" (Hu et al.) - "QLoRA: Efficient Finetuning of Quantized LLMs" (Dettmers et al.)