--- name: hugging-face-jobs-v2 description: "Running Workloads on Hugging Face Jobs workflow skill. Use this skill when the user needs Run workloads on Hugging Face Jobs with managed CPUs, GPUs, TPUs, secrets, and Hub persistence and the operator should preserve the upstream workflow, copied support files, and provenance before merging or handing off." version: "0.0.1" category: machine-learning tags: ["hugging-face-jobs-v2", "hugging-face-jobs", "run", "workloads", "hugging", "face", "jobs", "managed"] complexity: advanced risk: caution tools: ["codex-cli", "claude-code", "cursor", "gemini-cli", "opencode"] source: community author: "sickn33" date_added: "2026-04-16" date_updated: "2026-04-25" --- # Running Workloads on Hugging Face Jobs ## Overview This public intake copy packages `plugins/antigravity-awesome-skills/skills/hugging-face-jobs` from `https://github.com/sickn33/antigravity-awesome-skills` into the native Omni Skills editorial shape without hiding its origin. Use it when the operator needs the upstream workflow, support files, and repository context to stay intact while the public validator and private enhancer continue their normal downstream flow. This intake keeps the copied upstream files intact and uses the `external_source` block in `metadata.json` plus `ORIGIN.md` as the provenance anchor for review. # Running Workloads on Hugging Face Jobs Imported source sections that did not map cleanly to the public headings are still preserved below or in the support files. Notable imported sections: Key Directives, Prerequisites Checklist, Quick Start: Two Approaches, Hardware Selection, Critical: Saving Results, Timeout Management. ## When to Use This Skill Use this section as the trigger filter. It should make the activation boundary explicit before the operator loads files, runs commands, or opens a pull request. - Run Python workloads on cloud infrastructure - Execute jobs without local GPU/TPU setup - Process data at scale - Run batch inference or experiments - Schedule recurring tasks - Use GPUs/TPUs for any workload ## Operating Table | Situation | Start here | Why it matters | | --- | --- | --- | | First-time use | `metadata.json` | Confirms repository, branch, commit, and imported path through the `external_source` block before touching the copied workflow | | Provenance review | `ORIGIN.md` | Gives reviewers a plain-language audit trail for the imported source | | Workflow execution | `references/hardware_guide.md` | Starts with the smallest copied file that materially changes execution | | Supporting context | `references/hub_saving.md` | Adds the next most relevant copied source file without loading the entire package | | Handoff decision | `## Related Skills` | Helps the operator switch to a stronger native skill when the task drifts | ## Workflow This workflow is intentionally editorial and operational at the same time. It keeps the imported source useful to the operator while still satisfying the public intake standards that feed the downstream enhancer flow. 1. Confirm the user goal, the scope of the imported workflow, and whether this skill is still the right router for the task. 2. Read the overview and provenance files before loading any copied upstream support files. 3. Load only the references, examples, prompts, or scripts that materially change the outcome for the current request. 4. Execute the upstream workflow while keeping provenance and source boundaries explicit in the working notes. 5. Validate the result against the upstream expectations and the evidence you can point to in the copied files. 6. Escalate or hand off to a related skill when the work moves out of this imported workflow's center of gravity. 7. Before merge or closure, record what was used, what changed, and what the reviewer still needs to verify. ### Imported Workflow Notes #### Imported: Overview Run any workload on fully managed Hugging Face infrastructure. No local setup required—jobs run on cloud CPUs, GPUs, or TPUs and can persist results to the Hugging Face Hub. **Common use cases:** - **Data Processing** - Transform, filter, or analyze large datasets - **Batch Inference** - Run inference on thousands of samples - **Experiments & Benchmarks** - Reproducible ML experiments - **Model Training** - Fine-tune models (see `model-trainer` skill for TRL-specific training) - **Synthetic Data Generation** - Generate datasets using LLMs - **Development & Testing** - Test code without local GPU setup - **Scheduled Jobs** - Automate recurring tasks **For model training specifically:** See the `model-trainer` skill for TRL-based training workflows. #### Imported: Key Directives When assisting with jobs: 1. **ALWAYS use `hf_jobs()` MCP tool** - Submit jobs using `hf_jobs("uv", {...})` or `hf_jobs("run", {...})`. The `script` parameter accepts Python code directly. Do NOT save to local files unless the user explicitly requests it. Pass the script content as a string to `hf_jobs()`. 2. **Always handle authentication** - Jobs that interact with the Hub require `HF_TOKEN` via secrets. See Token Usage section below. 3. **Provide job details after submission** - After submitting, provide job ID, monitoring URL, estimated time, and note that the user can request status checks later. 4. **Set appropriate timeouts** - Default 30min may be insufficient for long-running tasks. ## Examples ### Example 1: Ask for the upstream workflow directly ```text Use @hugging-face-jobs-v2 to handle . Start from the copied upstream workflow, load only the files that change the outcome, and keep provenance visible in the answer. ``` **Explanation:** This is the safest starting point when the operator needs the imported workflow, but not the entire repository. ### Example 2: Ask for a provenance-grounded review ```text Review @hugging-face-jobs-v2 against metadata.json and ORIGIN.md, then explain which copied upstream files you would load first and why. ``` **Explanation:** Use this before review or troubleshooting when you need a precise, auditable explanation of origin and file selection. ### Example 3: Narrow the copied support files before execution ```text Use @hugging-face-jobs-v2 for . Load only the copied references, examples, or scripts that change the outcome, and name the files explicitly before proceeding. ``` **Explanation:** This keeps the skill aligned with progressive disclosure instead of loading the whole copied package by default. ### Example 4: Build a reviewer packet ```text Review @hugging-face-jobs-v2 using the copied upstream files plus provenance, then summarize any gaps before merge. ``` **Explanation:** This is useful when the PR is waiting for human review and you want a repeatable audit packet. ### Imported Usage Notes #### Imported: Token Usage Guide ### Understanding Tokens **What are HF Tokens?** - Authentication credentials for Hugging Face Hub - Required for authenticated operations (push, private repos, API access) - Stored securely on your machine after `hf auth login` **Token Types:** - **Read Token** - Can download models/datasets, read private repos - **Write Token** - Can push models/datasets, create repos, modify content - **Organization Token** - Can act on behalf of an organization ### When Tokens Are Required **Always Required:** - Pushing models/datasets to Hub - Accessing private repositories - Creating new repositories - Modifying existing repositories - Using Hub APIs programmatically **Not Required:** - Downloading public models/datasets - Running jobs that don't interact with Hub - Reading public repository information ### How to Provide Tokens to Jobs #### Method 1: Automatic Token (Recommended) ```python hf_jobs("uv", { "script": "your_script.py", "secrets": {"HF_TOKEN": "$HF_TOKEN"} # ✅ Automatic replacement }) ``` **How it works:** - `$HF_TOKEN` is a placeholder that gets replaced with your actual token - Uses the token from your logged-in session (`hf auth login`) - Most secure and convenient method - Token is encrypted server-side when passed as a secret **Benefits:** - No token exposure in code - Uses your current login session - Automatically updated if you re-login - Works seamlessly with MCP tools #### Method 2: Explicit Token (Not Recommended) ```python hf_jobs("uv", { "script": "your_script.py", "secrets": {"HF_TOKEN": "hf_abc123..."} # ⚠️ Hardcoded token }) ``` **When to use:** - Only if automatic token doesn't work - Testing with a specific token - Organization tokens (use with caution) **Security concerns:** - Token visible in code/logs - Must manually update if token rotates - Risk of token exposure #### Method 3: Environment Variable (Less Secure) ```python hf_jobs("uv", { "script": "your_script.py", "env": {"HF_TOKEN": "hf_abc123..."} # ⚠️ Less secure than secrets }) ``` **Difference from secrets:** - `env` variables are visible in job logs - `secrets` are encrypted server-side - Always prefer `secrets` for tokens ### Using Tokens in Scripts **In your Python script, tokens are available as environment variables:** ```python # /// script # dependencies = ["huggingface-hub"] # /// import os from huggingface_hub import HfApi # Token is automatically available if passed via secrets token = os.environ.get("HF_TOKEN") # Use with Hub API api = HfApi(token=token) # Or let huggingface_hub auto-detect api = HfApi() # Automatically uses HF_TOKEN env var ``` **Best practices:** - Don't hardcode tokens in scripts - Use `os.environ.get("HF_TOKEN")` to access - Let `huggingface_hub` auto-detect when possible - Verify token exists before Hub operations ### Token Verification **Check if you're logged in:** ```python from huggingface_hub import whoami user_info = whoami() # Returns your username if authenticated ``` **Verify token in job:** ```python import os assert "HF_TOKEN" in os.environ, "HF_TOKEN not found!" token = os.environ["HF_TOKEN"] print(f"Token starts with: {token[:7]}...") # Should start with "hf_" ``` ### Common Token Issues **Error: 401 Unauthorized** - **Cause:** Token missing or invalid - **Fix:** Add `secrets={"HF_TOKEN": "$HF_TOKEN"}` to job config - **Verify:** Check `hf_whoami()` works locally **Error: 403 Forbidden** - **Cause:** Token lacks required permissions - **Fix:** Ensure token has write permissions for push operations - **Check:** Token type at https://huggingface.co/settings/tokens **Error: Token not found in environment** - **Cause:** `secrets` not passed or wrong key name - **Fix:** Use `secrets={"HF_TOKEN": "$HF_TOKEN"}` (not `env`) - **Verify:** Script checks `os.environ.get("HF_TOKEN")` **Error: Repository access denied** - **Cause:** Token doesn't have access to private repo - **Fix:** Use token from account with access - **Check:** Verify repo visibility and your permissions ### Token Security Best Practices 1. **Never commit tokens** - Use `$HF_TOKEN` placeholder or environment variables 2. **Use secrets, not env** - Secrets are encrypted server-side 3. **Rotate tokens regularly** - Generate new tokens periodically 4. **Use minimal permissions** - Create tokens with only needed permissions 5. **Don't share tokens** - Each user should use their own token 6. **Monitor token usage** - Check token activity in Hub settings ### Complete Token Example ```python # Example: Push results to Hub hf_jobs("uv", { "script": """ # /// script # dependencies = ["huggingface-hub", "datasets"] # /// import os from huggingface_hub import HfApi from datasets import Dataset # Verify token is available assert "HF_TOKEN" in os.environ, "HF_TOKEN required!" # Use token for Hub operations api = HfApi(token=os.environ["HF_TOKEN"]) # Create and push dataset data = {"text": ["Hello", "World"]} dataset = Dataset.from_dict(data) dataset.push_to_hub("username/my-dataset", token=os.environ["HF_TOKEN"]) print("✅ Dataset pushed successfully!") """, "flavor": "cpu-basic", "timeout": "30m", "secrets": {"HF_TOKEN": "$HF_TOKEN"} # ✅ Token provided securely }) ``` ## Best Practices Treat the generated public skill as a reviewable packaging layer around the upstream repository. The goal is to keep provenance explicit and load only the copied source material that materially improves execution. - Keep the imported skill grounded in the upstream repository; do not invent steps that the source material cannot support. - Prefer the smallest useful set of support files so the workflow stays auditable and fast to review. - Keep provenance, source commit, and imported file paths visible in notes and PR descriptions. - Point directly at the copied upstream files that justify the workflow instead of relying on generic review boilerplate. - Treat generated examples as scaffolding; adapt them to the concrete task before execution. - Route to a stronger native skill when architecture, debugging, design, or security concerns become dominant. ## Troubleshooting ### Problem: The operator skipped the imported context and answered too generically **Symptoms:** The result ignores the upstream workflow in `plugins/antigravity-awesome-skills/skills/hugging-face-jobs`, fails to mention provenance, or does not use any copied source files at all. **Solution:** Re-open `metadata.json`, `ORIGIN.md`, and the most relevant copied upstream files. Check the `external_source` block first, then restate the provenance before continuing. ### Problem: The imported workflow feels incomplete during review **Symptoms:** Reviewers can see the generated `SKILL.md`, but they cannot quickly tell which references, examples, or scripts matter for the current task. **Solution:** Point at the exact copied references, examples, scripts, or assets that justify the path you took. If the gap is still real, record it in the PR instead of hiding it. ### Problem: The task drifted into a different specialization **Symptoms:** The imported skill starts in the right place, but the work turns into debugging, architecture, design, security, or release orchestration that a native skill handles better. **Solution:** Use the related skills section to hand off deliberately. Keep the imported provenance visible so the next skill inherits the right context instead of starting blind. ### Imported Troubleshooting Notes #### Imported: Common Failure Modes ### Out of Memory (OOM) **Fix:** 1. Reduce batch size or data chunk size 2. Process data in smaller batches 3. Upgrade hardware: cpu → t4 → a10g → a100 ### Job Timeout **Fix:** 1. Check logs for actual runtime 2. Increase timeout with buffer: `"timeout": "3h"` 3. Optimize code for faster execution 4. Process data in chunks ### Hub Push Failures **Fix:** 1. Add token to secrets: MCP uses `"$HF_TOKEN"` (auto-replaced), Python API uses `get_token()` (must pass real token) 2. Verify token in script: `assert "HF_TOKEN" in os.environ` 3. Check token permissions 4. Verify repo exists or can be created ### Missing Dependencies **Fix:** Add to PEP 723 header: ```python # /// script # dependencies = ["package1", "package2>=1.0.0"] # /// ``` ### Authentication Errors **Fix:** 1. Check `hf_whoami()` works locally 2. Verify token in secrets — MCP: `"$HF_TOKEN"`, Python API: `get_token()` (NOT `"$HF_TOKEN"`) 3. Re-login: `hf auth login` 4. Check token has required permissions #### Imported: Troubleshooting **Common issues:** - Job times out → Increase timeout, optimize code - Results not saved → Check persistence method, verify HF_TOKEN - Out of Memory → Reduce batch size, upgrade hardware - Import errors → Add dependencies to PEP 723 header - Authentication errors → Check token, verify secrets parameter **See:** `references/troubleshooting.md` for complete troubleshooting guide ## Related Skills - `@00-andruia-consultant` - Use when the work is better handled by that native specialization after this imported skill establishes context. - `@00-andruia-consultant-v2` - Use when the work is better handled by that native specialization after this imported skill establishes context. - `@10-andruia-skill-smith` - Use when the work is better handled by that native specialization after this imported skill establishes context. - `@10-andruia-skill-smith-v2` - Use when the work is better handled by that native specialization after this imported skill establishes context. ## Additional Resources Use this support matrix and the linked files below as the operator packet for this imported skill. They should reflect real copied source material, not generic scaffolding. | Resource family | What it gives the reviewer | Example path | | --- | --- | --- | | `references` | copied reference notes, guides, or background material from upstream | `references/hardware_guide.md` | | `examples` | worked examples or reusable prompts copied from upstream | `examples/n/a` | | `scripts` | upstream helper scripts that change execution or validation | `scripts/cot-self-instruct.py` | | `agents` | routing or delegation notes that are genuinely part of the imported package | `agents/n/a` | | `assets` | supporting assets or schemas copied from the source package | `assets/n/a` | - [hardware_guide.md](references/hardware_guide.md) - [hub_saving.md](references/hub_saving.md) - [token_usage.md](references/token_usage.md) - [troubleshooting.md](references/troubleshooting.md) - [cot-self-instruct.py](scripts/cot-self-instruct.py) - [finepdfs-stats.py](scripts/finepdfs-stats.py) - [generate-responses.py](scripts/generate-responses.py) ### Imported Reference Notes #### Imported: Resources ### References (In This Skill) - `references/token_usage.md` - Complete token usage guide - `references/hardware_guide.md` - Hardware specs and selection - `references/hub_saving.md` - Hub persistence guide - `references/troubleshooting.md` - Common issues and solutions ### Scripts (In This Skill) - `scripts/generate-responses.py` - vLLM batch generation: dataset → responses → push to Hub - `scripts/cot-self-instruct.py` - CoT Self-Instruct synthetic data generation + filtering → push to Hub - `scripts/finepdfs-stats.py` - Polars streaming stats over `finepdfs-edu` parquet on Hub (optional push) ### External Links **Official Documentation:** - [HF Jobs Guide](https://huggingface.co/docs/huggingface_hub/guides/jobs) - Main documentation - [HF Jobs CLI Reference](https://huggingface.co/docs/huggingface_hub/guides/cli#hf-jobs) - Command line interface - [HF Jobs API Reference](https://huggingface.co/docs/huggingface_hub/package_reference/hf_api) - Python API details - [Hardware Flavors Reference](https://huggingface.co/docs/hub/en/spaces-config-reference) - Available hardware **Related Tools:** - [UV Scripts Guide](https://docs.astral.sh/uv/guides/scripts/) - PEP 723 inline dependencies - [UV Scripts Organization](https://huggingface.co/uv-scripts) - Community UV script collection - [HF Hub Authentication](https://huggingface.co/docs/huggingface_hub/quick-start#authentication) - Token setup - [Webhooks Documentation](https://huggingface.co/docs/huggingface_hub/guides/webhooks) - Event triggers #### Imported: Quick Reference: MCP Tool vs CLI vs Python API | Operation | MCP Tool | CLI | Python API | |-----------|----------|-----|------------| | Run UV script | `hf_jobs("uv", {...})` | `hf jobs uv run script.py` | `run_uv_job("script.py")` | | Run Docker job | `hf_jobs("run", {...})` | `hf jobs run image cmd` | `run_job(image, command)` | | List jobs | `hf_jobs("ps")` | `hf jobs ps` | `list_jobs()` | | View logs | `hf_jobs("logs", {...})` | `hf jobs logs ` | `fetch_job_logs(job_id)` | | Cancel job | `hf_jobs("cancel", {...})` | `hf jobs cancel ` | `cancel_job(job_id)` | | Schedule UV | `hf_jobs("scheduled uv", {...})` | `hf jobs scheduled uv run SCHEDULE script.py` | `create_scheduled_uv_job()` | | Schedule Docker | `hf_jobs("scheduled run", {...})` | `hf jobs scheduled run SCHEDULE image cmd` | `create_scheduled_job()` | | List scheduled | `hf_jobs("scheduled ps")` | `hf jobs scheduled ps` | `list_scheduled_jobs()` | | Delete scheduled | `hf_jobs("scheduled delete", {...})` | `hf jobs scheduled delete ` | `delete_scheduled_job()` | #### Imported: Prerequisites Checklist Before starting any job, verify: ### ✅ **Account & Authentication** - Hugging Face Account with [Pro](https://hf.co/pro), [Team](https://hf.co/enterprise), or [Enterprise](https://hf.co/enterprise) plan (Jobs require paid plan) - Authenticated login: Check with `hf_whoami()` - **HF_TOKEN for Hub Access** ⚠️ CRITICAL - Required for any Hub operations (push models/datasets, download private repos, etc.) - Token must have appropriate permissions (read for downloads, write for uploads) ### ✅ **Token Usage** (See Token Usage section for details) **When tokens are required:** - Pushing models/datasets to Hub - Accessing private repositories - Using Hub APIs in scripts - Any authenticated Hub operations **How to provide tokens:** ```python # hf_jobs MCP tool — $HF_TOKEN is auto-replaced with real token: {"secrets": {"HF_TOKEN": "$HF_TOKEN"}} # HfApi().run_uv_job() — MUST pass actual token: from huggingface_hub import get_token secrets={"HF_TOKEN": get_token()} ``` **⚠️ CRITICAL:** The `$HF_TOKEN` placeholder is ONLY auto-replaced by the `hf_jobs` MCP tool. When using `HfApi().run_uv_job()`, you MUST pass the real token via `get_token()`. Passing the literal string `"$HF_TOKEN"` results in a 9-character invalid token and 401 errors. #### Imported: Quick Start: Two Approaches ### Approach 1: UV Scripts (Recommended) UV scripts use PEP 723 inline dependencies for clean, self-contained workloads. **MCP Tool:** ```python hf_jobs("uv", { "script": """ # /// script # dependencies = ["transformers", "torch"] # /// from transformers import pipeline import torch # Your workload here classifier = pipeline("sentiment-analysis") result = classifier("I love Hugging Face!") print(result) """, "flavor": "cpu-basic", "timeout": "30m" }) ``` **CLI Equivalent:** ```bash hf jobs uv run my_script.py --flavor cpu-basic --timeout 30m ``` **Python API:** ```python from huggingface_hub import run_uv_job run_uv_job("my_script.py", flavor="cpu-basic", timeout="30m") ``` **Benefits:** Direct MCP tool usage, clean code, dependencies declared inline, no file saving required **When to use:** Default choice for all workloads, custom logic, any scenario requiring `hf_jobs()` #### Custom Docker Images for UV Scripts By default, UV scripts use `ghcr.io/astral-sh/uv:python3.12-bookworm-slim`. For ML workloads with complex dependencies, use pre-built images: ```python hf_jobs("uv", { "script": "inference.py", "image": "vllm/vllm-openai:latest", # Pre-built image with vLLM "flavor": "a10g-large" }) ``` **CLI:** ```bash hf jobs uv run --image vllm/vllm-openai:latest --flavor a10g-large inference.py ``` **Benefits:** Faster startup, pre-installed dependencies, optimized for specific frameworks #### Python Version By default, UV scripts use Python 3.12. Specify a different version: ```python hf_jobs("uv", { "script": "my_script.py", "python": "3.11", # Use Python 3.11 "flavor": "cpu-basic" }) ``` **Python API:** ```python from huggingface_hub import run_uv_job run_uv_job("my_script.py", python="3.11") ``` #### Working with Scripts ⚠️ **Important:** There are *two* "script path" stories depending on how you run Jobs: - **Using the `hf_jobs()` MCP tool (recommended in this repo)**: the `script` value must be **inline code** (a string) or a **URL**. A local filesystem path (like `"./scripts/foo.py"`) won't exist inside the remote container. - **Using the `hf jobs uv run` CLI**: local file paths **do work** (the CLI uploads your script). **Common mistake with `hf_jobs()` MCP tool:** ```python # ❌ Will fail (remote container can't see your local path) hf_jobs("uv", {"script": "./scripts/foo.py"}) ``` **Correct patterns with `hf_jobs()` MCP tool:** ```python # ✅ Inline: read the local script file and pass its *contents* from pathlib import Path script = Path("hf-jobs/scripts/foo.py").read_text() hf_jobs("uv", {"script": script}) # ✅ URL: host the script somewhere reachable hf_jobs("uv", {"script": "https://huggingface.co/datasets/uv-scripts/.../raw/main/foo.py"}) # ✅ URL from GitHub hf_jobs("uv", {"script": "https://raw.githubusercontent.com/huggingface/trl/main/trl/scripts/sft.py"}) ``` **CLI equivalent (local paths supported):** ```bash hf jobs uv run ./scripts/foo.py -- --your --args ``` #### Adding Dependencies at Runtime Add extra dependencies beyond what's in the PEP 723 header: ```python hf_jobs("uv", { "script": "inference.py", "dependencies": ["transformers", "torch>=2.0"], # Extra deps "flavor": "a10g-small" }) ``` **Python API:** ```python from huggingface_hub import run_uv_job run_uv_job("inference.py", dependencies=["transformers", "torch>=2.0"]) ``` ### Approach 2: Docker-Based Jobs Run jobs with custom Docker images and commands. **MCP Tool:** ```python hf_jobs("run", { "image": "python:3.12", "command": ["python", "-c", "print('Hello from HF Jobs!')"], "flavor": "cpu-basic", "timeout": "30m" }) ``` **CLI Equivalent:** ```bash hf jobs run python:3.12 python -c "print('Hello from HF Jobs!')" ``` **Python API:** ```python from huggingface_hub import run_job run_job(image="python:3.12", command=["python", "-c", "print('Hello!')"], flavor="cpu-basic") ``` **Benefits:** Full Docker control, use pre-built images, run any command **When to use:** Need specific Docker images, non-Python workloads, complex environments **Example with GPU:** ```python hf_jobs("run", { "image": "pytorch/pytorch:2.6.0-cuda12.4-cudnn9-devel", "command": ["python", "-c", "import torch; print(torch.cuda.get_device_name())"], "flavor": "a10g-small", "timeout": "1h" }) ``` **Using Hugging Face Spaces as Images:** You can use Docker images from HF Spaces: ```python hf_jobs("run", { "image": "hf.co/spaces/lhoestq/duckdb", # Space as Docker image "command": ["duckdb", "-c", "SELECT 'Hello from DuckDB!'"], "flavor": "cpu-basic" }) ``` **CLI:** ```bash hf jobs run hf.co/spaces/lhoestq/duckdb duckdb -c "SELECT 'Hello!'" ``` ### Finding More UV Scripts on Hub The `uv-scripts` organization provides ready-to-use UV scripts stored as datasets on Hugging Face Hub: ```python # Discover available UV script collections dataset_search({"author": "uv-scripts", "sort": "downloads", "limit": 20}) # Explore a specific collection hub_repo_details(["uv-scripts/classification"], repo_type="dataset", include_readme=True) ``` **Popular collections:** OCR, classification, synthetic-data, vLLM, dataset-creation #### Imported: Hardware Selection > **Reference:** [HF Jobs Hardware Docs](https://huggingface.co/docs/hub/en/spaces-config-reference) (updated 07/2025) | Workload Type | Recommended Hardware | Use Case | |---------------|---------------------|----------| | Data processing, testing | `cpu-basic`, `cpu-upgrade` | Lightweight tasks | | Small models, demos | `t4-small` | <1B models, quick tests | | Medium models | `t4-medium`, `l4x1` | 1-7B models | | Large models, production | `a10g-small`, `a10g-large` | 7-13B models | | Very large models | `a100-large` | 13B+ models | | Batch inference | `a10g-large`, `a100-large` | High-throughput | | Multi-GPU workloads | `l4x4`, `a10g-largex2`, `a10g-largex4` | Parallel/large models | | TPU workloads | `v5e-1x1`, `v5e-2x2`, `v5e-2x4` | JAX/Flax, TPU-optimized | **All Available Flavors:** - **CPU:** `cpu-basic`, `cpu-upgrade` - **GPU:** `t4-small`, `t4-medium`, `l4x1`, `l4x4`, `a10g-small`, `a10g-large`, `a10g-largex2`, `a10g-largex4`, `a100-large` - **TPU:** `v5e-1x1`, `v5e-2x2`, `v5e-2x4` **Guidelines:** - Start with smaller hardware for testing - Scale up based on actual needs - Use multi-GPU for parallel workloads or large models - Use TPUs for JAX/Flax workloads - See `references/hardware_guide.md` for detailed specifications #### Imported: Critical: Saving Results **⚠️ EPHEMERAL ENVIRONMENT—MUST PERSIST RESULTS** The Jobs environment is temporary. All files are deleted when the job ends. If results aren't persisted, **ALL WORK IS LOST**. ### Persistence Options **1. Push to Hugging Face Hub (Recommended)** ```python # Push models model.push_to_hub("username/model-name", token=os.environ["HF_TOKEN"]) # Push datasets dataset.push_to_hub("username/dataset-name", token=os.environ["HF_TOKEN"]) # Push artifacts api.upload_file( path_or_fileobj="results.json", path_in_repo="results.json", repo_id="username/results", token=os.environ["HF_TOKEN"] ) ``` **2. Use External Storage** ```python # Upload to S3, GCS, etc. import boto3 s3 = boto3.client('s3') s3.upload_file('results.json', 'my-bucket', 'results.json') ``` **3. Send Results via API** ```python # POST results to your API import requests requests.post("https://your-api.com/results", json=results) ``` ### Required Configuration for Hub Push **In job submission:** ```python # hf_jobs MCP tool: {"secrets": {"HF_TOKEN": "$HF_TOKEN"}} # auto-replaced # HfApi().run_uv_job(): from huggingface_hub import get_token secrets={"HF_TOKEN": get_token()} # must pass real token ``` **In script:** ```python import os from huggingface_hub import HfApi # Token automatically available from secrets api = HfApi(token=os.environ.get("HF_TOKEN")) # Push your results api.upload_file(...) ``` ### Verification Checklist Before submitting: - [ ] Results persistence method chosen - [ ] Token in secrets if using Hub (MCP: `"$HF_TOKEN"`, Python API: `get_token()`) - [ ] Script handles missing token gracefully - [ ] Test persistence path works **See:** `references/hub_saving.md` for detailed Hub persistence guide #### Imported: Timeout Management **⚠️ DEFAULT: 30 MINUTES** Jobs automatically stop after the timeout. For long-running tasks like training, always set a custom timeout. ### Setting Timeouts **MCP Tool:** ```python { "timeout": "2h" # 2 hours } ``` **Supported formats:** - Integer/float: seconds (e.g., `300` = 5 minutes) - String with suffix: `"5m"` (minutes), `"2h"` (hours), `"1d"` (days) - Examples: `"90m"`, `"2h"`, `"1.5h"`, `300`, `"1d"` **Python API:** ```python from huggingface_hub import run_job, run_uv_job run_job(image="python:3.12", command=[...], timeout="2h") run_uv_job("script.py", timeout=7200) # 2 hours in seconds ``` ### Timeout Guidelines | Scenario | Recommended | Notes | |----------|-------------|-------| | Quick test | 10-30 min | Verify setup | | Data processing | 1-2 hours | Depends on data size | | Batch inference | 2-4 hours | Large batches | | Experiments | 4-8 hours | Multiple runs | | Long-running | 8-24 hours | Production workloads | **Always add 20-30% buffer** for setup, network delays, and cleanup. **On timeout:** Job killed immediately, all unsaved progress lost #### Imported: Cost Estimation **General guidelines:** ``` Total Cost = (Hours of runtime) × (Cost per hour) ``` **Example calculations:** **Quick test:** - Hardware: cpu-basic ($0.10/hour) - Time: 15 minutes (0.25 hours) - Cost: $0.03 **Data processing:** - Hardware: l4x1 ($2.50/hour) - Time: 2 hours - Cost: $5.00 **Batch inference:** - Hardware: a10g-large ($5/hour) - Time: 4 hours - Cost: $20.00 **Cost optimization tips:** 1. Start small - Test on cpu-basic or t4-small 2. Monitor runtime - Set appropriate timeouts 3. Use checkpoints - Resume if job fails 4. Optimize code - Reduce unnecessary compute 5. Choose right hardware - Don't over-provision #### Imported: Monitoring and Tracking ### Check Job Status **MCP Tool:** ```python # List all jobs hf_jobs("ps") # Inspect specific job hf_jobs("inspect", {"job_id": "your-job-id"}) # View logs hf_jobs("logs", {"job_id": "your-job-id"}) # Cancel a job hf_jobs("cancel", {"job_id": "your-job-id"}) ``` **Python API:** ```python from huggingface_hub import list_jobs, inspect_job, fetch_job_logs, cancel_job # List your jobs jobs = list_jobs() # List running jobs only running = [j for j in list_jobs() if j.status.stage == "RUNNING"] # Inspect specific job job_info = inspect_job(job_id="your-job-id") # View logs for log in fetch_job_logs(job_id="your-job-id"): print(log) # Cancel a job cancel_job(job_id="your-job-id") ``` **CLI:** ```bash hf jobs ps # List jobs hf jobs logs # View logs hf jobs cancel # Cancel job ``` **Remember:** Wait for user to request status checks. Avoid polling repeatedly. ### Job URLs After submission, jobs have monitoring URLs: ``` https://huggingface.co/jobs/username/job-id ``` View logs, status, and details in the browser. ### Wait for Multiple Jobs ```python import time from huggingface_hub import inspect_job, run_job # Run multiple jobs jobs = [run_job(image=img, command=cmd) for img, cmd in workloads] # Wait for all to complete for job in jobs: while inspect_job(job_id=job.id).status.stage not in ("COMPLETED", "ERROR"): time.sleep(10) ``` #### Imported: Scheduled Jobs Run jobs on a schedule using CRON expressions or predefined schedules. **MCP Tool:** ```python # Schedule a UV script that runs every hour hf_jobs("scheduled uv", { "script": "your_script.py", "schedule": "@hourly", "flavor": "cpu-basic" }) # Schedule with CRON syntax hf_jobs("scheduled uv", { "script": "your_script.py", "schedule": "0 9 * * 1", # 9 AM every Monday "flavor": "cpu-basic" }) # Schedule a Docker-based job hf_jobs("scheduled run", { "image": "python:3.12", "command": ["python", "-c", "print('Scheduled!')"], "schedule": "@daily", "flavor": "cpu-basic" }) ``` **Python API:** ```python from huggingface_hub import create_scheduled_job, create_scheduled_uv_job # Schedule a Docker job create_scheduled_job( image="python:3.12", command=["python", "-c", "print('Running on schedule!')"], schedule="@hourly" ) # Schedule a UV script create_scheduled_uv_job("my_script.py", schedule="@daily", flavor="cpu-basic") # Schedule with GPU create_scheduled_uv_job( "ml_inference.py", schedule="0 */6 * * *", # Every 6 hours flavor="a10g-small" ) ``` **Available schedules:** - `@annually`, `@yearly` - Once per year - `@monthly` - Once per month - `@weekly` - Once per week - `@daily` - Once per day - `@hourly` - Once per hour - CRON expression - Custom schedule (e.g., `"*/5 * * * *"` for every 5 minutes) **Manage scheduled jobs:** ```python # MCP Tool hf_jobs("scheduled ps") # List scheduled jobs hf_jobs("scheduled inspect", {"job_id": "..."}) # Inspect details hf_jobs("scheduled suspend", {"job_id": "..."}) # Pause hf_jobs("scheduled resume", {"job_id": "..."}) # Resume hf_jobs("scheduled delete", {"job_id": "..."}) # Delete ``` **Python API for management:** ```python from huggingface_hub import ( list_scheduled_jobs, inspect_scheduled_job, suspend_scheduled_job, resume_scheduled_job, delete_scheduled_job ) # List all scheduled jobs scheduled = list_scheduled_jobs() # Inspect a scheduled job info = inspect_scheduled_job(scheduled_job_id) # Suspend (pause) a scheduled job suspend_scheduled_job(scheduled_job_id) # Resume a scheduled job resume_scheduled_job(scheduled_job_id) # Delete a scheduled job delete_scheduled_job(scheduled_job_id) ``` #### Imported: Common Workload Patterns This repository ships ready-to-run UV scripts in `hf-jobs/scripts/`. Prefer using them instead of inventing new templates. ### Pattern 1: Dataset → Model Responses (vLLM) — `scripts/generate-responses.py` **What it does:** loads a Hub dataset (chat `messages` or a `prompt` column), applies a model chat template, generates responses with vLLM, and **pushes** the output dataset + dataset card back to the Hub. **Requires:** GPU + **write** token (it pushes a dataset). ```python from pathlib import Path script = Path("hf-jobs/scripts/generate-responses.py").read_text() hf_jobs("uv", { "script": script, "script_args": [ "username/input-dataset", "username/output-dataset", "--messages-column", "messages", "--model-id", "Qwen/Qwen3-30B-A3B-Instruct-2507", "--temperature", "0.7", "--top-p", "0.8", "--max-tokens", "2048", ], "flavor": "a10g-large", "timeout": "4h", "secrets": {"HF_TOKEN": "$HF_TOKEN"}, }) ``` ### Pattern 2: CoT Self-Instruct Synthetic Data — `scripts/cot-self-instruct.py` **What it does:** generates synthetic prompts/answers via CoT Self-Instruct, optionally filters outputs (answer-consistency / RIP), then **pushes** the generated dataset + dataset card to the Hub. **Requires:** GPU + **write** token (it pushes a dataset). ```python from pathlib import Path script = Path("hf-jobs/scripts/cot-self-instruct.py").read_text() hf_jobs("uv", { "script": script, "script_args": [ "--seed-dataset", "davanstrien/s1k-reasoning", "--output-dataset", "username/synthetic-math", "--task-type", "reasoning", "--num-samples", "5000", "--filter-method", "answer-consistency", ], "flavor": "l4x4", "timeout": "8h", "secrets": {"HF_TOKEN": "$HF_TOKEN"}, }) ``` ### Pattern 3: Streaming Dataset Stats (Polars + HF Hub) — `scripts/finepdfs-stats.py` **What it does:** scans parquet directly from Hub (no 300GB download), computes temporal stats, and (optionally) uploads results to a Hub dataset repo. **Requires:** CPU is often enough; token needed **only** if you pass `--output-repo` (upload). ```python from pathlib import Path script = Path("hf-jobs/scripts/finepdfs-stats.py").read_text() hf_jobs("uv", { "script": script, "script_args": [ "--limit", "10000", "--show-plan", "--output-repo", "username/finepdfs-temporal-stats", ], "flavor": "cpu-upgrade", "timeout": "2h", "env": {"HF_XET_HIGH_PERFORMANCE": "1"}, "secrets": {"HF_TOKEN": "$HF_TOKEN"}, }) ``` #### Imported: Key Takeaways 1. **Submit scripts inline** - The `script` parameter accepts Python code directly; no file saving required unless user requests 2. **Jobs are asynchronous** - Don't wait/poll; let user check when ready 3. **Always set timeout** - Default 30 min may be insufficient; set appropriate timeout 4. **Always persist results** - Environment is ephemeral; without persistence, all work is lost 5. **Use tokens securely** - MCP: `secrets={"HF_TOKEN": "$HF_TOKEN"}`, Python API: `secrets={"HF_TOKEN": get_token()}` — `"$HF_TOKEN"` only works with MCP tool 6. **Choose appropriate hardware** - Start small, scale up based on needs (see hardware guide) 7. **Use UV scripts** - Default to `hf_jobs("uv", {...})` with inline scripts for Python workloads 8. **Handle authentication** - Verify tokens are available before Hub operations 9. **Monitor jobs** - Provide job URLs and status check commands 10. **Optimize costs** - Choose right hardware, set appropriate timeouts #### Imported: Limitations - Use this skill only when the task clearly matches the scope described above. - Do not treat the output as a substitute for environment-specific validation, testing, or expert review. - Stop and ask for clarification if required inputs, permissions, safety boundaries, or success criteria are missing.