--- name: HuggingFace Model Trainer description: Train and fine-tune LLMs using HuggingFace TRL, Transformers, and cloud GPU infrastructure with SFT, DPO, GRPO methods version: 1.1.0 last_updated: 2026-01-06 external_version: "TRL 0.12+, Transformers 4.47+" triggers: - fine-tuning - model training - huggingface - TRL - LoRA - PEFT --- # HuggingFace Model Trainer You are an expert in training and fine-tuning large language models using HuggingFace's TRL (Transformer Reinforcement Learning), Transformers, and PEFT libraries. You help with dataset preparation, training configuration, GPU selection, and deployment. ## Training Methods Overview ### Method Selection Guide ``` ┌─────────────────────────────────────────────────────────────────┐ │ TRAINING METHOD SELECTION │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ HAVE LABELED DATA? │ │ ├── Yes: Input/Output pairs │ │ │ └── Use SFT (Supervised Fine-Tuning) │ │ │ │ │ ├── Yes: Preference pairs (chosen/rejected) │ │ │ └── Use DPO (Direct Preference Optimization) │ │ │ │ │ ├── No: Have a reward function/verifier │ │ │ └── Use GRPO (Group Relative Policy Optimization) │ │ │ │ │ └── No: Just want to continue pretraining │ │ └── Use CLM (Causal Language Modeling) │ │ │ └─────────────────────────────────────────────────────────────────┘ ``` ## 1. Supervised Fine-Tuning (SFT) ### When to Use - You have instruction/response pairs - Adapting a model to your domain - Teaching specific output formats ### Basic SFT Script ```python from trl import SFTTrainer, SFTConfig from transformers import AutoModelForCausalLM, AutoTokenizer from datasets import load_dataset # Load model and tokenizer model_id = "meta-llama/Llama-3.1-8B" model = AutoModelForCausalLM.from_pretrained(model_id) tokenizer = AutoTokenizer.from_pretrained(model_id) tokenizer.pad_token = tokenizer.eos_token # Load dataset dataset = load_dataset("your-org/your-dataset", split="train") # Training configuration config = SFTConfig( output_dir="./sft-output", max_seq_length=2048, per_device_train_batch_size=4, gradient_accumulation_steps=4, learning_rate=2e-5, num_train_epochs=3, logging_steps=10, save_strategy="epoch", bf16=True, # Use bfloat16 on supported GPUs ) # Create trainer trainer = SFTTrainer( model=model, args=config, train_dataset=dataset, tokenizer=tokenizer, ) # Train trainer.train() trainer.save_model("./final-model") ``` ### SFT with Chat Template ```python from trl import SFTTrainer, SFTConfig # Dataset should have 'messages' column in chat format # [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}] config = SFTConfig( output_dir="./chat-sft", max_seq_length=4096, per_device_train_batch_size=2, gradient_accumulation_steps=8, learning_rate=2e-5, num_train_epochs=3, ) trainer = SFTTrainer( model=model, args=config, train_dataset=dataset, tokenizer=tokenizer, # Automatically applies chat template ) ``` ## 2. Direct Preference Optimization (DPO) ### When to Use - You have preference data (chosen vs rejected responses) - Aligning model with human preferences - Improving response quality ### DPO Script ```python from trl import DPOTrainer, DPOConfig from transformers import AutoModelForCausalLM, AutoTokenizer from datasets import load_dataset # Load model model_id = "meta-llama/Llama-3.1-8B-Instruct" model = AutoModelForCausalLM.from_pretrained(model_id) tokenizer = AutoTokenizer.from_pretrained(model_id) # Dataset needs: prompt, chosen, rejected columns dataset = load_dataset("your-org/preference-data", split="train") config = DPOConfig( output_dir="./dpo-output", per_device_train_batch_size=2, gradient_accumulation_steps=4, learning_rate=5e-7, # Lower LR for DPO beta=0.1, # KL penalty coefficient num_train_epochs=1, bf16=True, logging_steps=10, ) trainer = DPOTrainer( model=model, args=config, train_dataset=dataset, tokenizer=tokenizer, ) trainer.train() ``` ### Preference Data Format ```python # Required columns: prompt, chosen, rejected preference_example = { "prompt": "Explain quantum computing", "chosen": "Quantum computing uses quantum bits...", # Better response "rejected": "Computers are fast machines..." # Worse response } ``` ## 3. Group Relative Policy Optimization (GRPO) ### When to Use - You have a reward function or verifier - Math/code tasks with checkable answers - RL-based training without paired preferences ### GRPO Script ```python from trl import GRPOTrainer, GRPOConfig from transformers import AutoModelForCausalLM, AutoTokenizer model_id = "meta-llama/Llama-3.1-8B-Instruct" model = AutoModelForCausalLM.from_pretrained(model_id) tokenizer = AutoTokenizer.from_pretrained(model_id) # Define reward function def reward_fn(completions, prompts): """Return rewards for each completion""" rewards = [] for completion, prompt in zip(completions, prompts): # Example: reward correct math answers if verify_math_answer(completion, prompt): rewards.append(1.0) else: rewards.append(-0.5) return rewards config = GRPOConfig( output_dir="./grpo-output", per_device_train_batch_size=4, num_generations=4, # Generate 4 samples per prompt learning_rate=1e-6, num_train_epochs=1, ) trainer = GRPOTrainer( model=model, args=config, train_dataset=dataset, tokenizer=tokenizer, reward_fn=reward_fn, ) trainer.train() ``` ## 4. Parameter-Efficient Fine-Tuning (PEFT/LoRA) ### Why Use LoRA - Train large models on limited GPU memory - 10-100x fewer trainable parameters - Fast training, easy to merge or swap adapters ### LoRA Configuration ```python from peft import LoraConfig, get_peft_model, TaskType # LoRA configuration lora_config = LoraConfig( r=16, # Rank (start with 8-32) lora_alpha=32, # Alpha scaling target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], lora_dropout=0.05, bias="none", task_type=TaskType.CAUSAL_LM, ) # Apply to model model = get_peft_model(model, lora_config) model.print_trainable_parameters() # Output: trainable params: 6,553,600 || all params: 8,030,261,248 || trainable%: 0.082 ``` ### SFT with LoRA ```python from trl import SFTTrainer, SFTConfig from peft import LoraConfig # LoRA config peft_config = LoraConfig( r=16, lora_alpha=32, target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM", ) config = SFTConfig( output_dir="./lora-sft", per_device_train_batch_size=4, gradient_accumulation_steps=4, learning_rate=2e-4, # Higher LR for LoRA num_train_epochs=3, bf16=True, ) trainer = SFTTrainer( model=model, args=config, train_dataset=dataset, tokenizer=tokenizer, peft_config=peft_config, # Pass LoRA config ) trainer.train() ``` ### QLoRA (Quantized LoRA) ```python from transformers import BitsAndBytesConfig import torch # 4-bit quantization config bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True, ) # Load quantized model model = AutoModelForCausalLM.from_pretrained( model_id, quantization_config=bnb_config, device_map="auto", ) # Then apply LoRA as normal ``` ## GPU Selection Guide ### Memory Requirements | Model Size | Full Fine-tune | LoRA | QLoRA | |------------|---------------|------|-------| | 7-8B | 60GB+ | 16GB | 8GB | | 13B | 100GB+ | 24GB | 12GB | | 34B | 200GB+ | 48GB | 24GB | | 70B | 400GB+ | 80GB | 48GB | ### GPU Recommendations ``` ┌─────────────────────────────────────────────────────────────────┐ │ GPU SELECTION GUIDE │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ TASK │ RECOMMENDED GPU │ │ ────────────────────────┼──────────────────────────────────── │ │ QLoRA 8B │ RTX 4090 (24GB), A10G │ │ QLoRA 70B │ A100 40GB x2, H100 │ │ LoRA 8B │ A100 40GB, A10G x2 │ │ LoRA 70B │ A100 80GB x2, H100 x2 │ │ Full FT 8B │ A100 80GB x2, H100 │ │ Full FT 70B │ H100 x8, A100 80GB x8 │ │ │ │ CLOUD PROVIDERS: │ │ - AWS: p4d (A100), p5 (H100) │ │ - GCP: a2-highgpu (A100), a3-highgpu (H100) │ │ - Azure: NC A100, ND H100 │ │ - Lambda Labs: Most cost-effective for training │ │ - RunPod: Good spot pricing │ │ - HuggingFace Jobs: Managed training infrastructure │ │ │ └─────────────────────────────────────────────────────────────────┘ ``` ## Dataset Preparation ### Chat Format Dataset ```python from datasets import Dataset # Conversation format conversations = [ { "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is Python?"}, {"role": "assistant", "content": "Python is a programming language..."} ] }, # More examples... ] dataset = Dataset.from_list(conversations) dataset.push_to_hub("your-org/chat-dataset") ``` ### Instruction Format ```python # Alpaca-style format instruction_data = [ { "instruction": "Summarize the following text", "input": "Long text here...", "output": "Summary here..." } ] # Or simpler format simple_data = [ { "prompt": "Question or instruction", "completion": "Expected response" } ] ``` ### Data Quality Tips ```python # Filter low-quality examples def filter_quality(example): # Remove very short responses if len(example["completion"]) < 50: return False # Remove repetitive content if example["completion"].count(example["completion"][:20]) > 3: return False return True dataset = dataset.filter(filter_quality) # Deduplicate from datasets import concatenate_datasets def deduplicate(dataset, column="prompt"): seen = set() indices = [] for i, example in enumerate(dataset): key = example[column] if key not in seen: seen.add(key) indices.append(i) return dataset.select(indices) ``` ## Training on HuggingFace Jobs ### Using HF Jobs MCP Tool ```python # If using Claude Code with HF Jobs MCP # This is submitted via hf_jobs() MCP tool training_script = ''' from trl import SFTTrainer, SFTConfig from transformers import AutoModelForCausalLM, AutoTokenizer from datasets import load_dataset model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B") tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B") dataset = load_dataset("your-org/your-dataset", split="train") config = SFTConfig( output_dir="./output", max_seq_length=2048, per_device_train_batch_size=4, num_train_epochs=3, bf16=True, push_to_hub=True, hub_model_id="your-org/fine-tuned-model", ) trainer = SFTTrainer(model=model, args=config, train_dataset=dataset, tokenizer=tokenizer) trainer.train() ''' # Submit via MCP: hf_jobs("uv", {"script": training_script, "gpu": "a100"}) ``` ### Cost Estimation ```python # Rough cost estimates for HF Jobs / Cloud GPUs TRAINING_COSTS = { # GPU type: (hourly_rate, tokens_per_hour_8B) "a10g": (1.50, 50_000_000), "a100_40gb": (3.50, 150_000_000), "a100_80gb": (5.00, 200_000_000), "h100": (8.00, 400_000_000), } def estimate_cost( model_size: str, dataset_tokens: int, epochs: int, gpu_type: str = "a100_40gb" ) -> dict: rate, throughput = TRAINING_COSTS[gpu_type] total_tokens = dataset_tokens * epochs hours = total_tokens / throughput cost = hours * rate return { "gpu": gpu_type, "estimated_hours": round(hours, 1), "estimated_cost": f"${cost:.2f}", "total_tokens": f"{total_tokens:,}" } # Example: 10M token dataset, 3 epochs on A100 estimate_cost("8B", 10_000_000, 3, "a100_40gb") # {'gpu': 'a100_40gb', 'estimated_hours': 0.2, 'estimated_cost': '$0.70', 'total_tokens': '30,000,000'} ``` ## GGUF Conversion for Local Deployment ```python # Convert to GGUF for llama.cpp / Ollama from transformers import AutoModelForCausalLM, AutoTokenizer # Load your fine-tuned model model = AutoModelForCausalLM.from_pretrained("./fine-tuned-model") tokenizer = AutoTokenizer.from_pretrained("./fine-tuned-model") # Save in format for conversion model.save_pretrained("./model-for-gguf", safe_serialization=True) tokenizer.save_pretrained("./model-for-gguf") # Then use llama.cpp for conversion: # python convert_hf_to_gguf.py ./model-for-gguf --outtype q4_k_m ``` ### Quantization Options | Type | Size Reduction | Quality Loss | Use Case | |------|---------------|--------------|----------| | f16 | 2x | None | Best quality | | q8_0 | 4x | Minimal | Good balance | | q4_k_m | 8x | Small | Production | | q4_0 | 8x | Moderate | Resource constrained | | q2_k | 16x | Significant | Extreme constraints | ## Evaluation ### Using lm-eval-harness ```python # Install: pip install lm-eval # Command line evaluation # lm_eval --model hf --model_args pretrained=./fine-tuned-model --tasks hellaswag,arc_easy --batch_size 8 # Programmatic from lm_eval import evaluator, tasks results = evaluator.simple_evaluate( model="hf", model_args="pretrained=./fine-tuned-model", tasks=["hellaswag", "arc_easy", "mmlu"], batch_size=8, ) print(results["results"]) ``` ### Custom Evaluation ```python def evaluate_on_test_set(model, tokenizer, test_dataset): correct = 0 total = 0 for example in test_dataset: prompt = example["prompt"] expected = example["expected"] inputs = tokenizer(prompt, return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=100) response = tokenizer.decode(outputs[0], skip_special_tokens=True) if expected.lower() in response.lower(): correct += 1 total += 1 return {"accuracy": correct / total, "total": total} ``` ## Best Practices ### Training Checklist ```yaml before_training: - [ ] Validate dataset format and quality - [ ] Check GPU memory requirements - [ ] Set up monitoring (W&B, TensorBoard) - [ ] Configure checkpointing strategy - [ ] Test with small subset first during_training: - [ ] Monitor loss curves - [ ] Watch for gradient issues - [ ] Check learning rate schedule - [ ] Validate checkpoints periodically after_training: - [ ] Evaluate on held-out test set - [ ] Compare with base model - [ ] Test on diverse prompts - [ ] Convert to desired format (GGUF, etc.) - [ ] Push to Hub with model card ``` ### Hyperparameter Guidelines ```python # SFT defaults SFT_DEFAULTS = { "learning_rate": 2e-5, # Full fine-tune "learning_rate_lora": 2e-4, # LoRA (higher) "batch_size": 4, "gradient_accumulation": 4, # Effective batch = 16 "epochs": 1-3, "warmup_ratio": 0.03, "weight_decay": 0.01, } # DPO defaults DPO_DEFAULTS = { "learning_rate": 5e-7, # Much lower "beta": 0.1, # KL penalty "epochs": 1, # Usually 1 is enough } ``` ## Resources - [TRL Documentation](https://huggingface.co/docs/trl) - [PEFT Documentation](https://huggingface.co/docs/peft) - [HuggingFace Hub](https://huggingface.co/models) - [HuggingFace Jobs](https://huggingface.co/jobs) - [lm-eval-harness](https://github.com/EleutherAI/lm-evaluation-harness) - [Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl) - High-level training framework