--- name: faion-ml-ops description: "ML operations: fine-tuning (LoRA, QLoRA), model evaluation, cost optimization, observability." user-invocable: false allowed-tools: Read, Write, Edit, Glob, Grep, Bash, Task, AskUserQuestion, TodoWrite --- > **Entry point:** `/faion-net` — invoke this skill for automatic routing to the appropriate domain. # ML Ops Skill **Communication: User's language. Code: English.** ## Purpose Handles ML model operations. Covers fine-tuning, evaluation, cost management, and observability. ## Context Discovery ### Auto-Investigation Check these project signals before asking questions: | Signal | Where to Check | What to Look For | |--------|----------------|------------------| | **Dependencies** | requirements.txt | transformers, peft, openai, tiktoken, langsmith | | **Training data** | /data, /datasets | JSONL files for fine-tuning | | **Logs/metrics** | Grep for "langsmith", "wandb", "mlflow" | Existing observability tools | | **Cost tracking** | Grep for "tiktoken", "count_tokens" | Token counting implementation | ### Discovery Questions ```yaml question: "What ML operation are you working on?" header: "Operation Type" multiSelect: false options: - label: "Fine-tuning LLM" description: "Custom model training (OpenAI API, LoRA, QLoRA)" - label: "Model evaluation" description: "Benchmark performance, LLM-as-judge" - label: "Cost optimization" description: "Reduce API costs, prompt caching, batching" - label: "Observability/monitoring" description: "Track LLM usage, traces, performance" ``` ```yaml question: "For fine-tuning: dataset size and approach?" header: "Fine-tuning Strategy" multiSelect: false options: - label: "<100 examples - use few-shot prompting instead" description: "Too small for fine-tuning, improve prompts" - label: "100-1000 examples - OpenAI fine-tuning" description: "Use OpenAI API fine-tuning endpoint" - label: ">1000 examples - LoRA/QLoRA" description: "Efficient parameter fine-tuning" - label: "Not fine-tuning" description: "Skip this question" ``` ```yaml question: "Which observability tools?" header: "Monitoring Stack" multiSelect: true options: - label: "LangSmith (recommended)" description: "LangChain native tracing" - label: "Langfuse (open-source)" description: "Self-hosted observability" - label: "Custom logging" description: "Build custom tracking" - label: "None yet" description: "Starting from scratch" ``` ## Scope | Area | Coverage | |------|----------| | **Fine-tuning** | LoRA, QLoRA, OpenAI fine-tuning, datasets | | **Evaluation** | Metrics, benchmarks, frameworks | | **Cost Optimization** | Token management, caching, batch APIs | | **Observability** | LLM monitoring, tracing, logging | ## Quick Start | Task | Files | |------|-------| | Fine-tune OpenAI | fine-tuning-openai-basics.md → fine-tuning-openai-production.md | | Fine-tune LoRA | lora-qlora.md → finetuning-basics.md | | Cost optimization | llm-cost-basics.md → cost-reduction-strategies.md | | Evaluation | evaluation-metrics.md → evaluation-framework.md | | Observability | llm-observability.md → llm-observability-stack-2026.md | ## Methodologies (15) **Fine-tuning (5):** - finetuning-basics: Fundamentals, when to fine-tune - finetuning-datasets: Data preparation, quality - fine-tuning-openai-basics: OpenAI API fine-tuning - fine-tuning-openai-production: Production deployment - lora-qlora: Efficient fine-tuning, parameter selection **Evaluation (3):** - evaluation-metrics: Accuracy, F1, perplexity, task metrics - evaluation-framework: LLM-as-judge, human eval - evaluation-benchmarks: MMLU, HumanEval, industry benchmarks **Cost Optimization (2):** - llm-cost-basics: Token counting, pricing models - cost-reduction-strategies: Caching, compression, batching **Observability (5):** - llm-observability: Fundamentals, why monitor - llm-observability-stack: Tools selection - llm-observability-stack-2026: Latest tools (LangSmith, Langfuse) - llm-management-observability: End-to-end management ## Code Examples ### OpenAI Fine-tuning ```python from openai import OpenAI client = OpenAI() # Upload training data file = client.files.create( file=open("training_data.jsonl", "rb"), purpose="fine-tune" ) # Create fine-tuning job job = client.fine_tuning.jobs.create( training_file=file.id, model="gpt-4o-mini-2024-07-18", hyperparameters={"n_epochs": 3} ) # Monitor while True: job = client.fine_tuning.jobs.retrieve(job.id) if job.status == "succeeded": break ``` ### LoRA Fine-tuning ```python from peft import LoraConfig, get_peft_model from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8b") lora_config = LoraConfig( r=16, lora_alpha=32, target_modules=["q_proj", "v_proj"], lora_dropout=0.1, bias="none" ) model = get_peft_model(model, lora_config) ``` ### Cost Tracking ```python import tiktoken def count_tokens(text, model="gpt-4o"): encoding = tiktoken.encoding_for_model(model) return len(encoding.encode(text)) def estimate_cost(prompt, completion, model="gpt-4o"): prompt_tokens = count_tokens(prompt, model) completion_tokens = count_tokens(completion, model) # GPT-4o pricing prompt_cost = prompt_tokens * 0.000005 completion_cost = completion_tokens * 0.000015 return prompt_cost + completion_cost ``` ### LLM Observability with LangSmith ```python from langsmith import traceable @traceable def rag_pipeline(query: str) -> str: # Retrieval docs = retrieve(query) # Generation response = generate(query, docs) return response ``` ## Fine-tuning Decision Matrix | Scenario | Approach | |----------|----------| | Small dataset (<100 examples) | Few-shot prompting | | Medium dataset (100-1000) | OpenAI fine-tuning | | Large dataset (>1000) | LoRA/QLoRA | | Custom behavior | Fine-tuning | | New knowledge | RAG (not fine-tuning) | ## Cost Reduction Strategies | Strategy | Savings | Trade-off | |----------|---------|-----------| | Prompt caching | 90% on cached | Cold start cost | | Batch API | 50% | 24h latency | | Smaller models | 80%+ | Lower quality | | Context pruning | Variable | May lose context | | Output limits | Variable | Truncated responses | ## Evaluation Frameworks | Framework | Use Case | |-----------|----------| | **LangSmith** | Production monitoring, traces | | **Langfuse** | Open-source observability | | **PromptLayer** | Prompt versioning | | **Weights & Biases** | Experiment tracking | ## Related Skills | Skill | Relationship | |-------|-------------| | faion-llm-integration | Provides APIs to optimize | | faion-rag-engineer | RAG evaluation | | faion-devops-engineer | Model deployment | --- *ML Ops v1.0 | 15 methodologies*