--- name: hugging-face-jobs description: This skill should be used when users want to run any workload on Hugging Face Jobs infrastructure. Covers UV scripts, Docker-based jobs, hardware selection, cost estimation, authentication with tokens, secrets management, timeout configuration, and result persistence. Designed for general-purpose compute workloads including data processing, inference, experiments, batch jobs, and any Python-based tasks. Should be invoked for tasks involving cloud compute, GPU workloads, or when users mention running jobs on Hugging Face infrastructure without local setup. license: Complete terms in LICENSE.txt --- # Running Workloads on Hugging Face Jobs ## Overview Run any workload on fully managed Hugging Face infrastructure. No local setup required—jobs run on cloud CPUs, GPUs, or TPUs and can persist results to the Hugging Face Hub. **Common use cases:** - **Data Processing** - Transform, filter, or analyze large datasets - **Batch Inference** - Run inference on thousands of samples - **Experiments & Benchmarks** - Reproducible ML experiments - **Model Training** - Fine-tune models (see `model-trainer` skill for TRL-specific training) - **Synthetic Data Generation** - Generate datasets using LLMs - **Development & Testing** - Test code without local GPU setup - **Scheduled Jobs** - Automate recurring tasks **For model training specifically:** See the `model-trainer` skill for TRL-based training workflows. ## When to Use This Skill Use this skill when users want to: - Run Python workloads on cloud infrastructure - Execute jobs without local GPU/TPU setup - Process data at scale - Run batch inference or experiments - Schedule recurring tasks - Use GPUs/TPUs for any workload - Persist results to the Hugging Face Hub ## Key Directives When assisting with jobs: 1. **ALWAYS use `hf_jobs()` MCP tool** - Submit jobs using `hf_jobs("uv", {...})` or `hf_jobs("run", {...})`. The `script` parameter accepts Python code directly. Do NOT save to local files unless the user explicitly requests it. Pass the script content as a string to `hf_jobs()`. 2. **Always handle authentication** - Jobs that interact with the Hub require `HF_TOKEN` via secrets. See Token Usage section below. 3. **Provide job details after submission** - After submitting, provide job ID, monitoring URL, estimated time, and note that the user can request status checks later. 4. **Set appropriate timeouts** - Default 30min may be insufficient for long-running tasks. ## Prerequisites Checklist Before starting any job, verify: ### ✅ **Account & Authentication** - Hugging Face Account with [Pro](https://hf.co/pro), [Team](https://hf.co/enterprise), or [Enterprise](https://hf.co/enterprise) plan (Jobs require paid plan) - Authenticated login: Check with `hf_whoami()` - **HF_TOKEN for Hub Access** ⚠️ CRITICAL - Required for any Hub operations (push models/datasets, download private repos, etc.) - Token must have appropriate permissions (read for downloads, write for uploads) ### ✅ **Token Usage** (See Token Usage section for details) **When tokens are required:** - Pushing models/datasets to Hub - Accessing private repositories - Using Hub APIs in scripts - Any authenticated Hub operations **How to provide tokens:** ```python { "secrets": {"HF_TOKEN": "$HF_TOKEN"} # Recommended: automatic token } ``` **⚠️ CRITICAL:** The `$HF_TOKEN` placeholder is automatically replaced with your logged-in token. Never hardcode tokens in scripts. ## Token Usage Guide ### Understanding Tokens **What are HF Tokens?** - Authentication credentials for Hugging Face Hub - Required for authenticated operations (push, private repos, API access) - Stored securely on your machine after `hf auth login` **Token Types:** - **Read Token** - Can download models/datasets, read private repos - **Write Token** - Can push models/datasets, create repos, modify content - **Organization Token** - Can act on behalf of an organization ### When Tokens Are Required **Always Required:** - Pushing models/datasets to Hub - Accessing private repositories - Creating new repositories - Modifying existing repositories - Using Hub APIs programmatically **Not Required:** - Downloading public models/datasets - Running jobs that don't interact with Hub - Reading public repository information ### How to Provide Tokens to Jobs #### Method 1: Automatic Token (Recommended) ```python hf_jobs("uv", { "script": "your_script.py", "secrets": {"HF_TOKEN": "$HF_TOKEN"} # ✅ Automatic replacement }) ``` **How it works:** - `$HF_TOKEN` is a placeholder that gets replaced with your actual token - Uses the token from your logged-in session (`hf auth login`) - Most secure and convenient method - Token is encrypted server-side when passed as a secret **Benefits:** - No token exposure in code - Uses your current login session - Automatically updated if you re-login - Works seamlessly with MCP tools #### Method 2: Explicit Token (Not Recommended) ```python hf_jobs("uv", { "script": "your_script.py", "secrets": {"HF_TOKEN": "hf_abc123..."} # ⚠️ Hardcoded token }) ``` **When to use:** - Only if automatic token doesn't work - Testing with a specific token - Organization tokens (use with caution) **Security concerns:** - Token visible in code/logs - Must manually update if token rotates - Risk of token exposure #### Method 3: Environment Variable (Less Secure) ```python hf_jobs("uv", { "script": "your_script.py", "env": {"HF_TOKEN": "hf_abc123..."} # ⚠️ Less secure than secrets }) ``` **Difference from secrets:** - `env` variables are visible in job logs - `secrets` are encrypted server-side - Always prefer `secrets` for tokens ### Using Tokens in Scripts **In your Python script, tokens are available as environment variables:** ```python # /// script # dependencies = ["huggingface-hub"] # /// import os from huggingface_hub import HfApi # Token is automatically available if passed via secrets token = os.environ.get("HF_TOKEN") # Use with Hub API api = HfApi(token=token) # Or let huggingface_hub auto-detect api = HfApi() # Automatically uses HF_TOKEN env var ``` **Best practices:** - Don't hardcode tokens in scripts - Use `os.environ.get("HF_TOKEN")` to access - Let `huggingface_hub` auto-detect when possible - Verify token exists before Hub operations ### Token Verification **Check if you're logged in:** ```python from huggingface_hub import whoami user_info = whoami() # Returns your username if authenticated ``` **Verify token in job:** ```python import os assert "HF_TOKEN" in os.environ, "HF_TOKEN not found!" token = os.environ["HF_TOKEN"] print(f"Token starts with: {token[:7]}...") # Should start with "hf_" ``` ### Common Token Issues **Error: 401 Unauthorized** - **Cause:** Token missing or invalid - **Fix:** Add `secrets={"HF_TOKEN": "$HF_TOKEN"}` to job config - **Verify:** Check `hf_whoami()` works locally **Error: 403 Forbidden** - **Cause:** Token lacks required permissions - **Fix:** Ensure token has write permissions for push operations - **Check:** Token type at https://huggingface.co/settings/tokens **Error: Token not found in environment** - **Cause:** `secrets` not passed or wrong key name - **Fix:** Use `secrets={"HF_TOKEN": "$HF_TOKEN"}` (not `env`) - **Verify:** Script checks `os.environ.get("HF_TOKEN")` **Error: Repository access denied** - **Cause:** Token doesn't have access to private repo - **Fix:** Use token from account with access - **Check:** Verify repo visibility and your permissions ### Token Security Best Practices 1. **Never commit tokens** - Use `$HF_TOKEN` placeholder or environment variables 2. **Use secrets, not env** - Secrets are encrypted server-side 3. **Rotate tokens regularly** - Generate new tokens periodically 4. **Use minimal permissions** - Create tokens with only needed permissions 5. **Don't share tokens** - Each user should use their own token 6. **Monitor token usage** - Check token activity in Hub settings ### Complete Token Example ```python # Example: Push results to Hub hf_jobs("uv", { "script": """ # /// script # dependencies = ["huggingface-hub", "datasets"] # /// import os from huggingface_hub import HfApi from datasets import Dataset # Verify token is available assert "HF_TOKEN" in os.environ, "HF_TOKEN required!" # Use token for Hub operations api = HfApi(token=os.environ["HF_TOKEN"]) # Create and push dataset data = {"text": ["Hello", "World"]} dataset = Dataset.from_dict(data) dataset.push_to_hub("username/my-dataset", token=os.environ["HF_TOKEN"]) print("✅ Dataset pushed successfully!") """, "flavor": "cpu-basic", "timeout": "30m", "secrets": {"HF_TOKEN": "$HF_TOKEN"} # ✅ Token provided securely }) ``` ## Quick Start: Two Approaches ### Approach 1: UV Scripts (Recommended) UV scripts use PEP 723 inline dependencies for clean, self-contained workloads. **MCP Tool:** ```python hf_jobs("uv", { "script": """ # /// script # dependencies = ["transformers", "torch"] # /// from transformers import pipeline import torch # Your workload here classifier = pipeline("sentiment-analysis") result = classifier("I love Hugging Face!") print(result) """, "flavor": "cpu-basic", "timeout": "30m" }) ``` **CLI Equivalent:** ```bash hf jobs uv run my_script.py --flavor cpu-basic --timeout 30m ``` **Python API:** ```python from huggingface_hub import run_uv_job run_uv_job("my_script.py", flavor="cpu-basic", timeout="30m") ``` **Benefits:** Direct MCP tool usage, clean code, dependencies declared inline, no file saving required **When to use:** Default choice for all workloads, custom logic, any scenario requiring `hf_jobs()` #### Custom Docker Images for UV Scripts By default, UV scripts use `ghcr.io/astral-sh/uv:python3.12-bookworm-slim`. For ML workloads with complex dependencies, use pre-built images: ```python hf_jobs("uv", { "script": "inference.py", "image": "vllm/vllm-openai:latest", # Pre-built image with vLLM "flavor": "a10g-large" }) ``` **CLI:** ```bash hf jobs uv run --image vllm/vllm-openai:latest --flavor a10g-large inference.py ``` **Benefits:** Faster startup, pre-installed dependencies, optimized for specific frameworks #### Python Version By default, UV scripts use Python 3.12. Specify a different version: ```python hf_jobs("uv", { "script": "my_script.py", "python": "3.11", # Use Python 3.11 "flavor": "cpu-basic" }) ``` **Python API:** ```python from huggingface_hub import run_uv_job run_uv_job("my_script.py", python="3.11") ``` #### Working with Scripts ⚠️ **Important:** There are *two* "script path" stories depending on how you run Jobs: - **Using the `hf_jobs()` MCP tool (recommended in this repo)**: the `script` value must be **inline code** (a string) or a **URL**. A local filesystem path (like `"./scripts/foo.py"`) won't exist inside the remote container. - **Using the `hf jobs uv run` CLI**: local file paths **do work** (the CLI uploads your script). **Common mistake with `hf_jobs()` MCP tool:** ```python # ❌ Will fail (remote container can't see your local path) hf_jobs("uv", {"script": "./scripts/foo.py"}) ``` **Correct patterns with `hf_jobs()` MCP tool:** ```python # ✅ Inline: read the local script file and pass its *contents* from pathlib import Path script = Path("hf-jobs/scripts/foo.py").read_text() hf_jobs("uv", {"script": script}) # ✅ URL: host the script somewhere reachable hf_jobs("uv", {"script": "https://huggingface.co/datasets/uv-scripts/.../raw/main/foo.py"}) # ✅ URL from GitHub hf_jobs("uv", {"script": "https://raw.githubusercontent.com/huggingface/trl/main/trl/scripts/sft.py"}) ``` **CLI equivalent (local paths supported):** ```bash hf jobs uv run ./scripts/foo.py -- --your --args ``` #### Adding Dependencies at Runtime Add extra dependencies beyond what's in the PEP 723 header: ```python hf_jobs("uv", { "script": "inference.py", "dependencies": ["transformers", "torch>=2.0"], # Extra deps "flavor": "a10g-small" }) ``` **Python API:** ```python from huggingface_hub import run_uv_job run_uv_job("inference.py", dependencies=["transformers", "torch>=2.0"]) ``` ### Approach 2: Docker-Based Jobs Run jobs with custom Docker images and commands. **MCP Tool:** ```python hf_jobs("run", { "image": "python:3.12", "command": ["python", "-c", "print('Hello from HF Jobs!')"], "flavor": "cpu-basic", "timeout": "30m" }) ``` **CLI Equivalent:** ```bash hf jobs run python:3.12 python -c "print('Hello from HF Jobs!')" ``` **Python API:** ```python from huggingface_hub import run_job run_job(image="python:3.12", command=["python", "-c", "print('Hello!')"], flavor="cpu-basic") ``` **Benefits:** Full Docker control, use pre-built images, run any command **When to use:** Need specific Docker images, non-Python workloads, complex environments **Example with GPU:** ```python hf_jobs("run", { "image": "pytorch/pytorch:2.6.0-cuda12.4-cudnn9-devel", "command": ["python", "-c", "import torch; print(torch.cuda.get_device_name())"], "flavor": "a10g-small", "timeout": "1h" }) ``` **Using Hugging Face Spaces as Images:** You can use Docker images from HF Spaces: ```python hf_jobs("run", { "image": "hf.co/spaces/lhoestq/duckdb", # Space as Docker image "command": ["duckdb", "-c", "SELECT 'Hello from DuckDB!'"], "flavor": "cpu-basic" }) ``` **CLI:** ```bash hf jobs run hf.co/spaces/lhoestq/duckdb duckdb -c "SELECT 'Hello!'" ``` ### Finding More UV Scripts on Hub The `uv-scripts` organization provides ready-to-use UV scripts stored as datasets on Hugging Face Hub: ```python # Discover available UV script collections dataset_search({"author": "uv-scripts", "sort": "downloads", "limit": 20}) # Explore a specific collection hub_repo_details(["uv-scripts/classification"], repo_type="dataset", include_readme=True) ``` **Popular collections:** OCR, classification, synthetic-data, vLLM, dataset-creation ## Hardware Selection > **Reference:** [HF Jobs Hardware Docs](https://huggingface.co/docs/hub/en/spaces-config-reference) (updated 07/2025) | Workload Type | Recommended Hardware | Use Case | |---------------|---------------------|----------| | Data processing, testing | `cpu-basic`, `cpu-upgrade` | Lightweight tasks | | Small models, demos | `t4-small` | <1B models, quick tests | | Medium models | `t4-medium`, `l4x1` | 1-7B models | | Large models, production | `a10g-small`, `a10g-large` | 7-13B models | | Very large models | `a100-large` | 13B+ models | | Batch inference | `a10g-large`, `a100-large` | High-throughput | | Multi-GPU workloads | `l4x4`, `a10g-largex2`, `a10g-largex4` | Parallel/large models | | TPU workloads | `v5e-1x1`, `v5e-2x2`, `v5e-2x4` | JAX/Flax, TPU-optimized | **All Available Flavors:** - **CPU:** `cpu-basic`, `cpu-upgrade` - **GPU:** `t4-small`, `t4-medium`, `l4x1`, `l4x4`, `a10g-small`, `a10g-large`, `a10g-largex2`, `a10g-largex4`, `a100-large` - **TPU:** `v5e-1x1`, `v5e-2x2`, `v5e-2x4` **Guidelines:** - Start with smaller hardware for testing - Scale up based on actual needs - Use multi-GPU for parallel workloads or large models - Use TPUs for JAX/Flax workloads - See `references/hardware_guide.md` for detailed specifications ## Critical: Saving Results **⚠️ EPHEMERAL ENVIRONMENT—MUST PERSIST RESULTS** The Jobs environment is temporary. All files are deleted when the job ends. If results aren't persisted, **ALL WORK IS LOST**. ### Persistence Options **1. Push to Hugging Face Hub (Recommended)** ```python # Push models model.push_to_hub("username/model-name", token=os.environ["HF_TOKEN"]) # Push datasets dataset.push_to_hub("username/dataset-name", token=os.environ["HF_TOKEN"]) # Push artifacts api.upload_file( path_or_fileobj="results.json", path_in_repo="results.json", repo_id="username/results", token=os.environ["HF_TOKEN"] ) ``` **2. Use External Storage** ```python # Upload to S3, GCS, etc. import boto3 s3 = boto3.client('s3') s3.upload_file('results.json', 'my-bucket', 'results.json') ``` **3. Send Results via API** ```python # POST results to your API import requests requests.post("https://your-api.com/results", json=results) ``` ### Required Configuration for Hub Push **In job submission:** ```python { "secrets": {"HF_TOKEN": "$HF_TOKEN"} # Enables authentication } ``` **In script:** ```python import os from huggingface_hub import HfApi # Token automatically available from secrets api = HfApi(token=os.environ.get("HF_TOKEN")) # Push your results api.upload_file(...) ``` ### Verification Checklist Before submitting: - [ ] Results persistence method chosen - [ ] `secrets={"HF_TOKEN": "$HF_TOKEN"}` if using Hub - [ ] Script handles missing token gracefully - [ ] Test persistence path works **See:** `references/hub_saving.md` for detailed Hub persistence guide ## Timeout Management **⚠️ DEFAULT: 30 MINUTES** Jobs automatically stop after the timeout. For long-running tasks like training, always set a custom timeout. ### Setting Timeouts **MCP Tool:** ```python { "timeout": "2h" # 2 hours } ``` **Supported formats:** - Integer/float: seconds (e.g., `300` = 5 minutes) - String with suffix: `"5m"` (minutes), `"2h"` (hours), `"1d"` (days) - Examples: `"90m"`, `"2h"`, `"1.5h"`, `300`, `"1d"` **Python API:** ```python from huggingface_hub import run_job, run_uv_job run_job(image="python:3.12", command=[...], timeout="2h") run_uv_job("script.py", timeout=7200) # 2 hours in seconds ``` ### Timeout Guidelines | Scenario | Recommended | Notes | |----------|-------------|-------| | Quick test | 10-30 min | Verify setup | | Data processing | 1-2 hours | Depends on data size | | Batch inference | 2-4 hours | Large batches | | Experiments | 4-8 hours | Multiple runs | | Long-running | 8-24 hours | Production workloads | **Always add 20-30% buffer** for setup, network delays, and cleanup. **On timeout:** Job killed immediately, all unsaved progress lost ## Cost Estimation **General guidelines:** ``` Total Cost = (Hours of runtime) × (Cost per hour) ``` **Example calculations:** **Quick test:** - Hardware: cpu-basic ($0.10/hour) - Time: 15 minutes (0.25 hours) - Cost: $0.03 **Data processing:** - Hardware: l4x1 ($2.50/hour) - Time: 2 hours - Cost: $5.00 **Batch inference:** - Hardware: a10g-large ($5/hour) - Time: 4 hours - Cost: $20.00 **Cost optimization tips:** 1. Start small - Test on cpu-basic or t4-small 2. Monitor runtime - Set appropriate timeouts 3. Use checkpoints - Resume if job fails 4. Optimize code - Reduce unnecessary compute 5. Choose right hardware - Don't over-provision ## Monitoring and Tracking ### Check Job Status **MCP Tool:** ```python # List all jobs hf_jobs("ps") # Inspect specific job hf_jobs("inspect", {"job_id": "your-job-id"}) # View logs hf_jobs("logs", {"job_id": "your-job-id"}) # Cancel a job hf_jobs("cancel", {"job_id": "your-job-id"}) ``` **Python API:** ```python from huggingface_hub import list_jobs, inspect_job, fetch_job_logs, cancel_job # List your jobs jobs = list_jobs() # List running jobs only running = [j for j in list_jobs() if j.status.stage == "RUNNING"] # Inspect specific job job_info = inspect_job(job_id="your-job-id") # View logs for log in fetch_job_logs(job_id="your-job-id"): print(log) # Cancel a job cancel_job(job_id="your-job-id") ``` **CLI:** ```bash hf jobs ps # List jobs hf jobs logs # View logs hf jobs cancel # Cancel job ``` **Remember:** Wait for user to request status checks. Avoid polling repeatedly. ### Job URLs After submission, jobs have monitoring URLs: ``` https://huggingface.co/jobs/username/job-id ``` View logs, status, and details in the browser. ### Wait for Multiple Jobs ```python import time from huggingface_hub import inspect_job, run_job # Run multiple jobs jobs = [run_job(image=img, command=cmd) for img, cmd in workloads] # Wait for all to complete for job in jobs: while inspect_job(job_id=job.id).status.stage not in ("COMPLETED", "ERROR"): time.sleep(10) ``` ## Scheduled Jobs Run jobs on a schedule using CRON expressions or predefined schedules. **MCP Tool:** ```python # Schedule a UV script that runs every hour hf_jobs("scheduled uv", { "script": "your_script.py", "schedule": "@hourly", "flavor": "cpu-basic" }) # Schedule with CRON syntax hf_jobs("scheduled uv", { "script": "your_script.py", "schedule": "0 9 * * 1", # 9 AM every Monday "flavor": "cpu-basic" }) # Schedule a Docker-based job hf_jobs("scheduled run", { "image": "python:3.12", "command": ["python", "-c", "print('Scheduled!')"], "schedule": "@daily", "flavor": "cpu-basic" }) ``` **Python API:** ```python from huggingface_hub import create_scheduled_job, create_scheduled_uv_job # Schedule a Docker job create_scheduled_job( image="python:3.12", command=["python", "-c", "print('Running on schedule!')"], schedule="@hourly" ) # Schedule a UV script create_scheduled_uv_job("my_script.py", schedule="@daily", flavor="cpu-basic") # Schedule with GPU create_scheduled_uv_job( "ml_inference.py", schedule="0 */6 * * *", # Every 6 hours flavor="a10g-small" ) ``` **Available schedules:** - `@annually`, `@yearly` - Once per year - `@monthly` - Once per month - `@weekly` - Once per week - `@daily` - Once per day - `@hourly` - Once per hour - CRON expression - Custom schedule (e.g., `"*/5 * * * *"` for every 5 minutes) **Manage scheduled jobs:** ```python # MCP Tool hf_jobs("scheduled ps") # List scheduled jobs hf_jobs("scheduled inspect", {"job_id": "..."}) # Inspect details hf_jobs("scheduled suspend", {"job_id": "..."}) # Pause hf_jobs("scheduled resume", {"job_id": "..."}) # Resume hf_jobs("scheduled delete", {"job_id": "..."}) # Delete ``` **Python API for management:** ```python from huggingface_hub import ( list_scheduled_jobs, inspect_scheduled_job, suspend_scheduled_job, resume_scheduled_job, delete_scheduled_job ) # List all scheduled jobs scheduled = list_scheduled_jobs() # Inspect a scheduled job info = inspect_scheduled_job(scheduled_job_id) # Suspend (pause) a scheduled job suspend_scheduled_job(scheduled_job_id) # Resume a scheduled job resume_scheduled_job(scheduled_job_id) # Delete a scheduled job delete_scheduled_job(scheduled_job_id) ``` ## Webhooks: Trigger Jobs on Events Trigger jobs automatically when changes happen in Hugging Face repositories. **Python API:** ```python from huggingface_hub import create_webhook # Create webhook that triggers a job when a repo changes webhook = create_webhook( job_id=job.id, watched=[ {"type": "user", "name": "your-username"}, {"type": "org", "name": "your-org-name"} ], domains=["repo", "discussion"], secret="your-secret" ) ``` **How it works:** 1. Webhook listens for changes in watched repositories 2. When triggered, the job runs with `WEBHOOK_PAYLOAD` environment variable 3. Your script can parse the payload to understand what changed **Use cases:** - Auto-process new datasets when uploaded - Trigger inference when models are updated - Run tests when code changes - Generate reports on repository activity **Access webhook payload in script:** ```python import os import json payload = json.loads(os.environ.get("WEBHOOK_PAYLOAD", "{}")) print(f"Event type: {payload.get('event', {}).get('action')}") ``` See [Webhooks Documentation](https://huggingface.co/docs/huggingface_hub/guides/webhooks) for more details. ## Common Workload Patterns This repository ships ready-to-run UV scripts in `hf-jobs/scripts/`. Prefer using them instead of inventing new templates. ### Pattern 1: Dataset → Model Responses (vLLM) — `scripts/generate-responses.py` **What it does:** loads a Hub dataset (chat `messages` or a `prompt` column), applies a model chat template, generates responses with vLLM, and **pushes** the output dataset + dataset card back to the Hub. **Requires:** GPU + **write** token (it pushes a dataset). ```python from pathlib import Path script = Path("hf-jobs/scripts/generate-responses.py").read_text() hf_jobs("uv", { "script": script, "script_args": [ "username/input-dataset", "username/output-dataset", "--messages-column", "messages", "--model-id", "Qwen/Qwen3-30B-A3B-Instruct-2507", "--temperature", "0.7", "--top-p", "0.8", "--max-tokens", "2048", ], "flavor": "a10g-large", "timeout": "4h", "secrets": {"HF_TOKEN": "$HF_TOKEN"}, }) ``` ### Pattern 2: CoT Self-Instruct Synthetic Data — `scripts/cot-self-instruct.py` **What it does:** generates synthetic prompts/answers via CoT Self-Instruct, optionally filters outputs (answer-consistency / RIP), then **pushes** the generated dataset + dataset card to the Hub. **Requires:** GPU + **write** token (it pushes a dataset). ```python from pathlib import Path script = Path("hf-jobs/scripts/cot-self-instruct.py").read_text() hf_jobs("uv", { "script": script, "script_args": [ "--seed-dataset", "davanstrien/s1k-reasoning", "--output-dataset", "username/synthetic-math", "--task-type", "reasoning", "--num-samples", "5000", "--filter-method", "answer-consistency", ], "flavor": "l4x4", "timeout": "8h", "secrets": {"HF_TOKEN": "$HF_TOKEN"}, }) ``` ### Pattern 3: Streaming Dataset Stats (Polars + HF Hub) — `scripts/finepdfs-stats.py` **What it does:** scans parquet directly from Hub (no 300GB download), computes temporal stats, and (optionally) uploads results to a Hub dataset repo. **Requires:** CPU is often enough; token needed **only** if you pass `--output-repo` (upload). ```python from pathlib import Path script = Path("hf-jobs/scripts/finepdfs-stats.py").read_text() hf_jobs("uv", { "script": script, "script_args": [ "--limit", "10000", "--show-plan", "--output-repo", "username/finepdfs-temporal-stats", ], "flavor": "cpu-upgrade", "timeout": "2h", "env": {"HF_XET_HIGH_PERFORMANCE": "1"}, "secrets": {"HF_TOKEN": "$HF_TOKEN"}, }) ``` ## Common Failure Modes ### Out of Memory (OOM) **Fix:** 1. Reduce batch size or data chunk size 2. Process data in smaller batches 3. Upgrade hardware: cpu → t4 → a10g → a100 ### Job Timeout **Fix:** 1. Check logs for actual runtime 2. Increase timeout with buffer: `"timeout": "3h"` 3. Optimize code for faster execution 4. Process data in chunks ### Hub Push Failures **Fix:** 1. Add to job: `secrets={"HF_TOKEN": "$HF_TOKEN"}` 2. Verify token in script: `assert "HF_TOKEN" in os.environ` 3. Check token permissions 4. Verify repo exists or can be created ### Missing Dependencies **Fix:** Add to PEP 723 header: ```python # /// script # dependencies = ["package1", "package2>=1.0.0"] # /// ``` ### Authentication Errors **Fix:** 1. Check `hf_whoami()` works locally 2. Verify `secrets={"HF_TOKEN": "$HF_TOKEN"}` in job config 3. Re-login: `hf auth login` 4. Check token has required permissions ## Troubleshooting **Common issues:** - Job times out → Increase timeout, optimize code - Results not saved → Check persistence method, verify HF_TOKEN - Out of Memory → Reduce batch size, upgrade hardware - Import errors → Add dependencies to PEP 723 header - Authentication errors → Check token, verify secrets parameter **See:** `references/troubleshooting.md` for complete troubleshooting guide ## Resources ### References (In This Skill) - `references/token_usage.md` - Complete token usage guide - `references/hardware_guide.md` - Hardware specs and selection - `references/hub_saving.md` - Hub persistence guide - `references/troubleshooting.md` - Common issues and solutions ### Scripts (In This Skill) - `scripts/generate-responses.py` - vLLM batch generation: dataset → responses → push to Hub - `scripts/cot-self-instruct.py` - CoT Self-Instruct synthetic data generation + filtering → push to Hub - `scripts/finepdfs-stats.py` - Polars streaming stats over `finepdfs-edu` parquet on Hub (optional push) ### External Links **Official Documentation:** - [HF Jobs Guide](https://huggingface.co/docs/huggingface_hub/guides/jobs) - Main documentation - [HF Jobs CLI Reference](https://huggingface.co/docs/huggingface_hub/guides/cli#hf-jobs) - Command line interface - [HF Jobs API Reference](https://huggingface.co/docs/huggingface_hub/package_reference/hf_api) - Python API details - [Hardware Flavors Reference](https://huggingface.co/docs/hub/en/spaces-config-reference) - Available hardware **Related Tools:** - [UV Scripts Guide](https://docs.astral.sh/uv/guides/scripts/) - PEP 723 inline dependencies - [UV Scripts Organization](https://huggingface.co/uv-scripts) - Community UV script collection - [HF Hub Authentication](https://huggingface.co/docs/huggingface_hub/quick-start#authentication) - Token setup - [Webhooks Documentation](https://huggingface.co/docs/huggingface_hub/guides/webhooks) - Event triggers ## Key Takeaways 1. **Submit scripts inline** - The `script` parameter accepts Python code directly; no file saving required unless user requests 2. **Jobs are asynchronous** - Don't wait/poll; let user check when ready 3. **Always set timeout** - Default 30 min may be insufficient; set appropriate timeout 4. **Always persist results** - Environment is ephemeral; without persistence, all work is lost 5. **Use tokens securely** - Always use `secrets={"HF_TOKEN": "$HF_TOKEN"}` for Hub operations 6. **Choose appropriate hardware** - Start small, scale up based on needs (see hardware guide) 7. **Use UV scripts** - Default to `hf_jobs("uv", {...})` with inline scripts for Python workloads 8. **Handle authentication** - Verify tokens are available before Hub operations 9. **Monitor jobs** - Provide job URLs and status check commands 10. **Optimize costs** - Choose right hardware, set appropriate timeouts ## Quick Reference: MCP Tool vs CLI vs Python API | Operation | MCP Tool | CLI | Python API | |-----------|----------|-----|------------| | Run UV script | `hf_jobs("uv", {...})` | `hf jobs uv run script.py` | `run_uv_job("script.py")` | | Run Docker job | `hf_jobs("run", {...})` | `hf jobs run image cmd` | `run_job(image, command)` | | List jobs | `hf_jobs("ps")` | `hf jobs ps` | `list_jobs()` | | View logs | `hf_jobs("logs", {...})` | `hf jobs logs ` | `fetch_job_logs(job_id)` | | Cancel job | `hf_jobs("cancel", {...})` | `hf jobs cancel ` | `cancel_job(job_id)` | | Schedule UV | `hf_jobs("scheduled uv", {...})` | - | `create_scheduled_uv_job()` | | Schedule Docker | `hf_jobs("scheduled run", {...})` | - | `create_scheduled_job()` |