# T1: Fine-Tuning & Model Customization > **Duration:** 60–90 minutes | **Level:** Deep-Dive > **Part of:** 🍎 FROOT Transformation Layer > **Prerequisites:** F1 (GenAI Foundations), F2 (LLM Landscape) > **Last Updated:** March 2026 --- ## Table of Contents - [T1.1 The Customization Spectrum](#t11-the-customization-spectrum) - [T1.2 When to Fine-Tune (and When Not To)](#t12-when-to-fine-tune-and-when-not-to) - [T1.3 Fine-Tuning Methods](#t13-fine-tuning-methods) - [T1.4 LoRA & QLoRA — The Practical Revolution](#t14-lora--qlora--the-practical-revolution) - [T1.5 Alignment: RLHF & DPO](#t15-alignment-rlhf--dpo) - [T1.6 Data Preparation](#t16-data-preparation) - [T1.7 The Fine-Tuning Pipeline](#t17-the-fine-tuning-pipeline) - [T1.8 Evaluation — Did It Work?](#t18-evaluation--did-it-work) - [T1.9 Azure AI Foundry Fine-Tuning](#t19-azure-ai-foundry-fine-tuning) - [T1.10 MLOps for LLMs](#t110-mlops-for-llms) - [Key Takeaways](#key-takeaways) --- ## T1.1 The Customization Spectrum Not every AI problem needs fine-tuning. Understanding the full spectrum of customization — from simple to complex — lets you pick the right tool: ```mermaid %%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#1a1a2e', 'primaryTextColor': '#e0e0e0', 'primaryBorderColor': '#6366f1', 'lineColor': '#818cf8', 'background': 'transparent'}}}%% graph LR subgraph SIMPLE["Low Effort / Low Control"] A["Prompt
Engineering"] B["Few-Shot
Examples"] end subgraph MEDIUM["Medium Effort / Medium Control"] C["RAG
(Retrieval)"] D["System
Instructions"] end subgraph ADVANCED["High Effort / High Control"] E["LoRA / QLoRA
Fine-Tuning"] F["Full
Fine-Tuning"] end subgraph EXTREME["Extreme Effort / Maximum Control"] G["Continued
Pre-Training"] H["Training
From Scratch"] end A --> B --> C --> D --> E --> F --> G --> H style SIMPLE fill:#10b98122,stroke:#10b981,stroke-width:2px style MEDIUM fill:#06b6d422,stroke:#06b6d4,stroke-width:2px style ADVANCED fill:#f59e0b22,stroke:#f59e0b,stroke-width:2px style EXTREME fill:#ef444422,stroke:#ef4444,stroke-width:2px ``` | Method | Effort | Data Needed | Cost | When to Use | |--------|--------|-------------|------|-------------| | **Prompt Engineering** | Minutes | 0 | Free | Always start here | | **Few-Shot Examples** | Hours | 5–20 examples | Minimal | When output format matters | | **RAG** | Days | Your knowledge base | Medium | When accuracy on specific data matters | | **System Instructions** | Hours | Rules & constraints | Free | When behavioral alignment matters | | **LoRA Fine-Tuning** | Days-Weeks | 100–10K examples | $50–$500 | When domain language/style matters | | **Full Fine-Tuning** | Weeks | 10K–100K examples | $1K–$50K | When deep behavioral change needed | | **Continued Pre-Training** | Weeks-Months | Millions of tokens | $10K–$1M | New domain/language | | **Training From Scratch** | Months-Years | Trillions of tokens | $1M–$100M+ | Only for frontier labs | --- ## T1.2 When to Fine-Tune (and When Not To) ### The Decision Framework ```mermaid %%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#1a1a2e', 'primaryTextColor': '#e0e0e0', 'primaryBorderColor': '#6366f1', 'lineColor': '#818cf8', 'background': 'transparent'}}}%% flowchart TB START["Does the model
understand your task?"] START -->|"No, it doesn't
get the domain"| Q2["Have you tried
detailed prompting + RAG?"] START -->|"Yes, but output
format/style is wrong"| Q3["Have you tried
few-shot examples?"] START -->|"Yes, and
it works well"| DONE["✅ Don't fine-tune!
Optimize prompts"] Q2 -->|"Yes, still poor"| FT["🔧 Fine-tune
(LoRA recommended)"] Q2 -->|"No"| TRY_RAG["Try RAG first.
Fine-tune if still poor."] Q3 -->|"Yes, still inconsistent"| FT Q3 -->|"No"| TRY_FEW["Try few-shot first.
Fine-tune if still inconsistent."] FT --> Q4["How much
data do you have?"] Q4 -->|"100-1000 examples"| LORA["LoRA / QLoRA"] Q4 -->|"1000-10K examples"| FULL_LORA["LoRA or Full FT"] Q4 -->|"10K+ examples"| FULL["Full Fine-Tuning"] Q4 -->|"< 100 examples"| NOTYET["⚠️ Not enough data.
Collect more or use RAG"] style FT fill:#f59e0b22,stroke:#f59e0b,stroke-width:2px style DONE fill:#10b98122,stroke:#10b981,stroke-width:2px style NOTYET fill:#ef444422,stroke:#ef4444,stroke-width:2px ``` ### Fine-Tune When | Use Case | Why Fine-Tuning Helps | Example | |----------|----------------------|---------| | **Domain-specific language** | Model needs to understand jargon | Medical terminology, legal language | | **Consistent output format** | Need reliable structure despite varying inputs | Always output XML in a specific schema | | **Behavioral alignment** | Model should act in a specific way consistently | Always formal, always cites sources | | **Cost reduction** | Replace long prompts with trained behavior | 2000-token system message → fine-tuned, save $$ | | **Latency reduction** | Shorter prompts = faster responses | Remove few-shot examples from every call | ### Don't Fine-Tune When | Situation | Better Alternative | |-----------|-------------------| | Model needs current/changing data | RAG | | You have <100 examples | Few-shot prompting | | The issue is prompt clarity | Better prompts | | You need quick iteration | Prompt engineering | | You want to add new knowledge | RAG, not fine-tuning | > **Critical Insight:** Fine-tuning teaches the model **how** to respond, not **what** to know. For new knowledge, use RAG. For new behavior, use fine-tuning. For both, combine RAG + fine-tuning. --- ## T1.3 Fine-Tuning Methods ### Full Fine-Tuning Updates **all** model parameters. Most powerful but most expensive. | Aspect | Detail | |--------|--------| | **Parameters updated** | All (7B = 7 billion, 70B = 70 billion) | | **GPU requirements** | 4–8x A100/H100 (80GB) for 7B, 32+ for 70B | | **Training data** | 10K–100K examples ideal | | **Risk** | Catastrophic forgetting — model loses general capabilities | | **When to use** | When you need deep behavioral change AND have enough data | ### LoRA (Low-Rank Adaptation) The **most practical** fine-tuning technique. Freezes original weights and trains small adapter matrices. ```mermaid %%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#1a1a2e', 'primaryTextColor': '#e0e0e0', 'primaryBorderColor': '#6366f1', 'lineColor': '#818cf8', 'background': 'transparent'}}}%% graph LR subgraph ORIGINAL["Original Model (Frozen)"] W["Weight Matrix W
(d × d)
e.g. 4096 × 4096"] end subgraph LORA["LoRA Adapter (Trained)"] A["Matrix A
(d × r)
e.g. 4096 × 16"] B["Matrix B
(r × d)
e.g. 16 × 4096"] A --> B end INPUT["Input x"] --> W INPUT --> A W --> COMBINE["Output =
Wx + BAx"] B --> COMBINE style ORIGINAL fill:#6366f122,stroke:#6366f1,stroke-width:2px style LORA fill:#10b98122,stroke:#10b981,stroke-width:2px ``` | LoRA Parameter | Typical Value | Effect | |----------------|---------------|--------| | **Rank (r)** | 8–64 | Higher = more expressive, more compute | | **Alpha** | 16–128 | Scaling factor, usually alpha = 2 × rank | | **Target modules** | q_proj, v_proj, k_proj, o_proj | Which attention matrices to adapt | | **Dropout** | 0.05–0.1 | Regularization to prevent overfitting | ### QLoRA (Quantized LoRA) LoRA applied to a **4-bit quantized** base model. The democratizer of fine-tuning. | Comparison | LoRA | QLoRA | |-----------|------|-------| | **Base model format** | FP16/BF16 | NF4 (4-bit) | | **Memory for 7B model** | ~14 GB | ~4 GB | | **Memory for 70B model** | ~140 GB | ~35 GB | | **Quality loss** | None vs full FT | <1% vs LoRA | | **GPU required** | A100 40GB+ | RTX 4090 (24GB) | --- ## T1.4 LoRA & QLoRA — The Practical Revolution ### Why LoRA Changed Everything Before LoRA (2023): fine-tuning a 7B model required 4x A100 GPUs ($30K+ in hardware). Fine-tuning 70B? Enterprise-only. After LoRA: fine-tune 7B on a single consumer GPU. Fine-tune 70B on a single A100. Adapters are 10-100MB (instead of the full model at 14-140GB). ``` Full Fine-Tuning: 70B × 2 bytes (FP16) = 140 GB VRAM → Need 32x A100 LoRA: 70B frozen + 0.1B trained = ~140 GB + ~200 MB → Need 4x A100 QLoRA: 70B in 4-bit + 0.1B trained = ~35 GB + ~200 MB → Need 1x A100 ``` ### LoRA Best Practices | Practice | Recommendation | Why | |----------|---------------|-----| | Start with low rank | r=8 or r=16 | Most tasks don't need high rank | | Target attention layers | q_proj, v_proj minimum | Most impactful for behavioral change | | Set alpha = 2 × rank | alpha=32 for r=16 | Good balance of adapter influence | | Use learning rate | 1e-4 to 2e-4 | Lower than full FT to avoid instability | | Train for 1-3 epochs | More risks overfitting | Monitor eval loss, stop when it rises | | Validate frequently | Every 100-500 steps | Catch overfitting early | --- ## T1.5 Alignment: RLHF & DPO ### RLHF (Reinforcement Learning from Human Feedback) The technique that turned GPT-3 into ChatGPT: ```mermaid %%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#1a1a2e', 'primaryTextColor': '#e0e0e0', 'primaryBorderColor': '#6366f1', 'lineColor': '#818cf8', 'background': 'transparent'}}}%% flowchart LR subgraph STEP1["Step 1: SFT"] S1["Supervised Fine-Tuning
on instruction data"] end subgraph STEP2["Step 2: Reward Model"] S2["Human annotators rank
model outputs"] S3["Train reward model
from preferences"] S2 --> S3 end subgraph STEP3["Step 3: PPO"] S4["Optimize policy (the LLM)
against reward model"] S5["Using PPO algorithm"] S4 --> S5 end STEP1 --> STEP2 --> STEP3 style STEP1 fill:#6366f122,stroke:#6366f1,stroke-width:2px style STEP2 fill:#f59e0b22,stroke:#f59e0b,stroke-width:2px style STEP3 fill:#10b98122,stroke:#10b981,stroke-width:2px ``` ### DPO (Direct Preference Optimization) A simpler alternative that skips the reward model: | Aspect | RLHF | DPO | |--------|------|-----| | **Complexity** | 3-step pipeline | Single training step | | **Reward model** | Required (separate training) | Not needed | | **Stability** | Can be finicky (PPO training) | More stable | | **Data needed** | Preference pairs + reward model data | Preference pairs only | | **Quality** | Gold standard | Comparable (sometimes better) | | **Adoption** | ChatGPT, Claude | Llama 3, Zephyr, many open models | ### Preference Data Format ```json { "prompt": "Explain what a token is in the context of LLMs.", "chosen": "A token is the smallest unit of text that a language model processes. Rather than reading individual characters, models break text into subword units called tokens using algorithms like Byte-Pair Encoding (BPE). For example, 'unbelievable' might become ['un', 'believ', 'able']. In GPT-4, one token averages about 4 English characters. Token count matters because it determines both the cost and the context window usage.", "rejected": "A token is basically a word. Like when you type 'hello world' that's 2 tokens. Tokens are used by AI." } ``` --- ## T1.6 Data Preparation Data quality is **more important than data quantity** for fine-tuning. A thousand excellent examples outperform ten thousand mediocre ones. ### Data Format (Chat Format) ```jsonl {"messages": [{"role": "system", "content": "You are an Azure architect..."}, {"role": "user", "content": "What's the best VM for a SQL workload?"}, {"role": "assistant", "content": "For SQL workloads, I recommend..."}]} {"messages": [{"role": "system", "content": "You are an Azure architect..."}, {"role": "user", "content": "Explain Azure Front Door."}, {"role": "assistant", "content": "Azure Front Door is a global..."}]} ``` ### Data Quality Checklist | Check | Why It Matters | How to Verify | |-------|---------------|---------------| | **Diverse inputs** | Prevent overfitting to patterns | Cluster queries by type, ensure coverage | | **High-quality outputs** | Model learns from examples | Expert review of assistant messages | | **Consistent format** | Reinforces desired structure | Schema validation on outputs | | **No contradictions** | Confuses the model | Deduplication and consistency check | | **Balanced classes** | Prevents bias toward over-represented types | Category distribution analysis | | **Length variety** | Handles both short and long responses | Histogram of response lengths | | **Edge cases** | Teaches robust behavior | Include tricky/ambiguous examples (10-20%) | ### How Much Data? | Model Size | Minimum | Good | Excellent | |-----------|---------|------|-----------| | **7B** | 100 examples | 500–1,000 | 5,000+ | | **13B** | 200 examples | 1,000–2,000 | 10,000+ | | **70B** | 500 examples | 2,000–5,000 | 20,000+ | --- ## T1.7 The Fine-Tuning Pipeline ```mermaid %%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#1a1a2e', 'primaryTextColor': '#e0e0e0', 'primaryBorderColor': '#6366f1', 'lineColor': '#818cf8', 'background': 'transparent'}}}%% flowchart TB D["📊 Data Preparation
Clean, format, split"] --> V["✅ Data Validation
Quality checks, dedup"] V --> B["🏗️ Base Model Selection
GPT-4o-mini / Phi / Llama"] B --> C["⚙️ Config
LoRA rank, LR, epochs"] C --> T["🔄 Training
Fine-tune with monitoring"] T --> E["📈 Evaluation
Test set metrics"] E -->|Metrics OK| DEP["🚀 Deploy
Managed endpoint"] E -->|Metrics poor| ITER["🔁 Iterate
More data, adjust config"] ITER --> D DEP --> MON["📊 Monitor
Production quality"] MON -->|Drift detected| RETRAIN["🔄 Retrain
Updated data"] RETRAIN --> D style D fill:#6366f122,stroke:#6366f1,stroke-width:2px style T fill:#f59e0b22,stroke:#f59e0b,stroke-width:2px style E fill:#10b98122,stroke:#10b981,stroke-width:2px style DEP fill:#7c3aed22,stroke:#7c3aed,stroke-width:2px ``` --- ## T1.8 Evaluation — Did It Work? ### Evaluation Framework | Metric | What It Measures | How to Compute | Target | |--------|-----------------|----------------|--------| | **Loss (eval)** | Training convergence | Built into training | Decreasing, not diverging | | **Accuracy** | Correct answers on test set | Compare to ground truth | >85% for most tasks | | **Format compliance** | Follows output format | Schema validation | >98% | | **Regression check** | Didn't break general capability | Test on general benchmarks | <5% degradation | | **Human preference** | Humans prefer fine-tuned vs base | A/B blind evaluation | >60% prefer fine-tuned | | **Task-specific** | Domain-relevant metrics | Task-dependent | Improvement vs base | ### A/B Evaluation Pattern ``` 1. Prepare 100 test prompts (not in training data) 2. Generate responses from BOTH base model and fine-tuned model 3. Randomize order (evaluator doesn't know which is which) 4. Expert evaluators score each response 1-5 on: - Accuracy - Relevance - Format compliance - Helpfulness 5. Compare aggregate scores 6. Fine-tuned should win on task-specific metrics without significant regression on general metrics ``` --- ## T1.9 Azure AI Foundry Fine-Tuning Azure AI Foundry supports fine-tuning several model families: | Model | Fine-Tuning Support | Method | Min Data | |-------|-------------------|--------|----------| | **GPT-4o** | ✅ Managed | Full FT | 10 examples (50+ recommended) | | **GPT-4o-mini** | ✅ Managed | Full FT | 10 examples | | **Phi-4** | ✅ Managed | LoRA | 100 examples | | **Llama 3.1** | ✅ Managed | LoRA | 100 examples | | **Mistral** | ✅ Managed | LoRA | 100 examples | ### Azure Fine-Tuning Flow ```mermaid %%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#1a1a2e', 'primaryTextColor': '#e0e0e0', 'primaryBorderColor': '#6366f1', 'lineColor': '#818cf8', 'background': 'transparent'}}}%% flowchart LR DATA["Upload Training
Data (JSONL)"] --> JOB["Create Fine-Tune
Job in Foundry"] JOB --> TRAIN["Azure Manages
Training (GPU)"] TRAIN --> MODEL["Fine-Tuned
Model Created"] MODEL --> DEPLOY["Deploy to
Managed Endpoint"] DEPLOY --> TEST["Test & Evaluate
via API"] style DATA fill:#6366f122,stroke:#6366f1 style TRAIN fill:#f59e0b22,stroke:#f59e0b style DEPLOY fill:#10b98122,stroke:#10b981 ``` ### Cost Estimation | Model | Training Cost | Hosting Cost | |-------|--------------|-------------| | **GPT-4o-mini** | ~$3.00/1M training tokens | Standard PAYG pricing | | **GPT-4o** | ~$25.00/1M training tokens | Standard PAYG pricing | | **Phi-4 (14B)** | Compute hours (GPU reservation) | Managed compute pricing | --- ## T1.10 MLOps for LLMs ### The LLMOps Lifecycle ```mermaid %%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#1a1a2e', 'primaryTextColor': '#e0e0e0', 'primaryBorderColor': '#6366f1', 'lineColor': '#818cf8', 'background': 'transparent'}}}%% graph TB subgraph DEVELOP["🔬 Develop"] D1["Prompt Engineering"] D2["RAG Pipeline"] D3["Fine-Tuning"] end subgraph EVALUATE["📊 Evaluate"] E1["Offline Metrics"] E2["LLM-as-Judge"] E3["Human Review"] end subgraph DEPLOY_P["🚀 Deploy"] P1["Blue/Green Deploy"] P2["A/B Testing"] P3["Canary Rollout"] end subgraph OPERATE["📈 Operate"] O1["Quality Monitoring"] O2["Cost Tracking"] O3["Drift Detection"] end DEVELOP --> EVALUATE --> DEPLOY_P --> OPERATE OPERATE -->|"Retrain trigger"| DEVELOP style DEVELOP fill:#6366f122,stroke:#6366f1,stroke-width:2px style EVALUATE fill:#f59e0b22,stroke:#f59e0b,stroke-width:2px style DEPLOY_P fill:#10b98122,stroke:#10b981,stroke-width:2px style OPERATE fill:#7c3aed22,stroke:#7c3aed,stroke-width:2px ``` ### Key LLMOps Tools | Tool | Purpose | Integration | |------|---------|-------------| | **Azure AI Foundry** | Model management, fine-tuning, evaluation | Native Azure | | **MLflow** | Experiment tracking, model registry | Open source, Azure-integrated | | **Prompt Flow** | Prompt development and evaluation | Azure AI Foundry, VS Code | | **GitHub Actions** | CI/CD for model pipelines | Azure-integrated | | **Azure Monitor** | Production observability | Native Azure | | **LangSmith / LangFuse** | LLM-specific tracing and evaluation | Open source alternatives | --- ## Key Takeaways :::tip The Five Rules of Model Customization 1. **Start with prompts, not fine-tuning.** 80% of problems are solved with better prompts + RAG. Fine-tune only when you've exhausted simpler options. 2. **LoRA is the default.** Unless you have a specific reason for full fine-tuning, LoRA or QLoRA gives you 95% of the quality at 5% of the cost. 3. **Data quality > data quantity.** 500 expert-curated examples beat 50,000 mediocre ones. Invest in data preparation. 4. **Always evaluate rigorously.** Don't just check if fine-tuning "feels better" — measure accuracy, format compliance, and regression on general capability. 5. **Fine-tuning teaches HOW, not WHAT.** For new knowledge, use RAG. For new behavior/style/format, use fine-tuning. For both, combine them. ::: --- > **FrootAI T1** — *Fine-tuning is the art of teaching a model your language. LoRA made it accessible. QLoRA made it affordable. Good data makes it work.*