---
name: hugging-face-space-deployer
description: Create, configure, and deploy Hugging Face Spaces for showcasing ML models. Supports Gradio, Streamlit, and Docker SDKs with templates for common use cases like chat interfaces, image generation, and model comparisons.
---

# Hugging Face Space Deployer

A skill for AI engineers to create, configure, and deploy interactive ML demos on Hugging Face Spaces.

## CRITICAL: Pre-Deployment Checklist

**Before writing ANY code, gather this information about the model:**

### 1. Check Model Type (LoRA Adapter vs Full Model)

**Use the HF MCP tool to inspect the model files:**
```
hf-skills - Hub Repo Details (repo_ids: ["username/model"], repo_type: "model")
```

**Look for these indicators:**

| Files Present | Model Type | Action Required |
|---------------|------------|-----------------|
| `model.safetensors` or `pytorch_model.bin` | Full model | Load directly with `AutoModelForCausalLM` |
| `adapter_model.safetensors` + `adapter_config.json` | LoRA/PEFT adapter | Must load base model first, then apply adapter with `peft` |
| Only config files, no weights | Broken/incomplete | Ask user to verify |

**If adapter_config.json exists, check for `base_model_name_or_path` to identify the base model.**

### 2. Check Inference API Availability

Visit the model page on HF Hub and look for "Inference Providers" widget on the right side.

**Indicators that model HAS Inference API:**
- Inference widget visible on model page
- Model from known provider: `meta-llama`, `mistralai`, `HuggingFaceH4`, `google`, `stabilityai`, `Qwen`
- High download count (>10,000) with standard architecture

**Indicators that model DOES NOT have Inference API:**
- Personal namespace (e.g., `GhostScientist/my-model`)
- LoRA/PEFT adapter (adapters never have direct Inference API)
- Missing `pipeline_tag` in model metadata
- No inference widget on model page

### 3. Check Model Metadata

- Ensure `pipeline_tag` is set (e.g., `text-generation`)
- Add `conversational` tag for chat models

### 4. Determine Hardware Needs

| Model Size | Recommended Hardware |
|------------|---------------------|
| < 3B parameters | ZeroGPU (free) or CPU |
| 3B - 7B parameters | ZeroGPU or T4 |
| > 7B parameters | A10G or A100 |

### 5. Ask User If Unclear

**If you cannot determine the model type, ASK THE USER:**

> "I'm analyzing your model to determine the best deployment strategy. I found:
> - [what you found about files]
> - [what you found about inference API]
>
> Is this model:
> 1. A full model you trained/uploaded?
> 2. A LoRA/PEFT adapter on top of another model?
> 3. Something else?
>
> Also, would you prefer:
> A. Free deployment with ZeroGPU (may have queue times)
> B. Paid GPU for faster response (~$0.60/hr)"

## Hardware Options

| Hardware | Use Case | Cost |
|----------|----------|------|
| `cpu-basic` | Simple demos, Inference API apps | Free |
| `cpu-upgrade` | Faster CPU inference | ~$0.03/hr |
| **`zero-a10g`** | **Models needing GPU on-demand (recommended for most)** | **Free (with quota)** |
| `t4-small` | Small GPU models (<7B) | ~$0.60/hr |
| `t4-medium` | Medium GPU models | ~$0.90/hr |
| `a10g-small` | Large models (7B-13B) | ~$1.50/hr |
| `a10g-large` | Very large models (30B+) | ~$3.15/hr |
| `a100-large` | Largest models | ~$4.50/hr |

**ZeroGPU Note:** ZeroGPU (`zero-a10g`) provides free GPU access on-demand. The Space runs on CPU, and when a user triggers inference, a GPU is allocated temporarily (~60-120 seconds). **After deployment, you must manually set the runtime to "ZeroGPU" in Space Settings > Hardware.**

## Deployment Decision Tree

```
Analyze Model
│
├── Does it have adapter_config.json?
│   └── YES → It's a LoRA adapter
│       ├── Find base_model_name_or_path in adapter_config.json
│       └── Use Template 3 (LoRA + ZeroGPU)
│
├── Does it have model.safetensors or pytorch_model.bin?
│   └── YES → It's a full model
│       ├── Is it from a major provider with inference widget?
│       │   ├── YES → Use Inference API (Template 1)
│       │   └── NO → Use ZeroGPU (Template 2)
│
└── Neither found?
    └── ASK USER - model may be incomplete
```

## Dependencies

**For Inference API (cpu-basic, free):**
```
gradio>=5.0.0
huggingface_hub>=0.26.0
```

**For ZeroGPU full models (zero-a10g, free with quota):**
```
gradio>=5.0.0
torch
transformers
accelerate
spaces
```

**For ZeroGPU LoRA adapters (zero-a10g, free with quota):**
```
gradio>=5.0.0
torch
transformers
accelerate
spaces
peft
```

## CLI Commands (CORRECT Syntax)

```bash
# Create Space
hf repo create my-space-name --repo-type space --space-sdk gradio

# Upload files
hf upload username/space-name ./local-folder --repo-type space

# Download model files to inspect
hf download username/model-name --local-dir ./model-check --dry-run

# Check what files exist in a model
hf download username/model-name --local-dir /tmp/check --dry-run 2>&1 | grep -E '\.(safetensors|bin|json)'
```

## Template 1: Inference API (For Supported Models)

**Use when:** Model has inference widget, is from major provider, or explicitly supports serverless API.

```python
import gradio as gr
from huggingface_hub import InferenceClient

MODEL_ID = "HuggingFaceH4/zephyr-7b-beta"  # Must support Inference API!
client = InferenceClient(MODEL_ID)

def respond(message, history, system_message, max_tokens, temperature, top_p):
    messages = [{"role": "system", "content": system_message}]

    for user_msg, assistant_msg in history:
        if user_msg:
            messages.append({"role": "user", "content": user_msg})
        if assistant_msg:
            messages.append({"role": "assistant", "content": assistant_msg})

    messages.append({"role": "user", "content": message})

    response = ""
    for token in client.chat_completion(
        messages,
        max_tokens=max_tokens,
        stream=True,
        temperature=temperature,
        top_p=top_p,
    ):
        delta = token.choices[0].delta.content or ""
        response += delta
        yield response

demo = gr.ChatInterface(
    respond,
    title="Chat Assistant",
    description="Powered by Hugging Face Inference API",
    additional_inputs=[
        gr.Textbox(value="You are a helpful assistant.", label="System message"),
        gr.Slider(minimum=1, maximum=2048, value=512, step=1, label="Max tokens"),
        gr.Slider(minimum=0.1, maximum=2.0, value=0.7, step=0.1, label="Temperature"),
        gr.Slider(minimum=0.1, maximum=1.0, value=0.95, step=0.05, label="Top-p"),
    ],
    examples=[
        ["Hello! How are you?"],
        ["Write a Python function to sort a list"],
    ],
)

if __name__ == "__main__":
    demo.launch()
```

**requirements.txt:**
```
gradio>=5.0.0
huggingface_hub>=0.26.0
```

**README.md:**
```yaml
---
title: My Chat App
emoji: 💬
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.9.1
app_file: app.py
pinned: false
license: apache-2.0
---
```

## Template 2: ZeroGPU Full Model (For Models Without Inference API)

**Use when:** Full model (has model.safetensors) but no Inference API support.

```python
import gradio as gr
import spaces
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_ID = "username/my-full-model"

# Load tokenizer at startup
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

# Global model - loaded lazily on first GPU call for faster Space startup
model = None

def load_model():
    global model
    if model is None:
        model = AutoModelForCausalLM.from_pretrained(
            MODEL_ID,
            torch_dtype=torch.float16,
            device_map="auto",
        )
    return model

@spaces.GPU(duration=120)
def generate_response(message, history, system_message, max_tokens, temperature, top_p):
    model = load_model()

    messages = [{"role": "system", "content": system_message}]

    for user_msg, assistant_msg in history:
        if user_msg:
            messages.append({"role": "user", "content": user_msg})
        if assistant_msg:
            messages.append({"role": "assistant", "content": assistant_msg})

    messages.append({"role": "user", "content": message})

    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    inputs = tokenizer([text], return_tensors="pt").to(model.device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=int(max_tokens),
            temperature=float(temperature),
            top_p=float(top_p),
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
        )

    response = tokenizer.decode(
        outputs[0][inputs['input_ids'].shape[1]:],
        skip_special_tokens=True
    )
    return response

demo = gr.ChatInterface(
    generate_response,
    title="My Model",
    description="Powered by ZeroGPU (free!)",
    additional_inputs=[
        gr.Textbox(value="You are a helpful assistant.", label="System message", lines=2),
        gr.Slider(minimum=64, maximum=2048, value=512, step=64, label="Max tokens"),
        gr.Slider(minimum=0.1, maximum=1.5, value=0.7, step=0.1, label="Temperature"),
        gr.Slider(minimum=0.1, maximum=1.0, value=0.95, step=0.05, label="Top-p"),
    ],
    examples=[
        ["Hello! How are you?"],
        ["Help me write some code"],
    ],
)

if __name__ == "__main__":
    demo.launch()
```

**requirements.txt:**
```
gradio>=5.0.0
torch
transformers
accelerate
spaces
```

**README.md:**
```yaml
---
title: My Model
emoji: 🤖
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.9.1
app_file: app.py
pinned: false
license: apache-2.0
suggested_hardware: zero-a10g
---
```

## Template 3: ZeroGPU LoRA Adapter (CRITICAL FOR FINE-TUNED MODELS)

**Use when:** Model has `adapter_config.json` and `adapter_model.safetensors` (NOT `model.safetensors`)

**You MUST identify the base model from `adapter_config.json` field `base_model_name_or_path`**

```python
import gradio as gr
import spaces
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Your LoRA adapter
ADAPTER_ID = "username/my-lora-adapter"
# Base model (from adapter_config.json -> base_model_name_or_path)
BASE_MODEL_ID = "Qwen/Qwen2.5-Coder-1.5B-Instruct"

# Load tokenizer at startup
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_ID)

# Global model - loaded lazily on first GPU call
model = None

def load_model():
    global model
    if model is None:
        base_model = AutoModelForCausalLM.from_pretrained(
            BASE_MODEL_ID,
            torch_dtype=torch.float16,
            device_map="auto",
        )
        model = PeftModel.from_pretrained(base_model, ADAPTER_ID)
        model = model.merge_and_unload()  # Merge for faster inference
    return model

@spaces.GPU(duration=120)
def generate_response(message, history, system_message, max_tokens, temperature, top_p):
    model = load_model()

    messages = [{"role": "system", "content": system_message}]

    for item in history:
        if isinstance(item, (list, tuple)) and len(item) == 2:
            user_msg, assistant_msg = item
            if user_msg:
                messages.append({"role": "user", "content": user_msg})
            if assistant_msg:
                messages.append({"role": "assistant", "content": assistant_msg})

    messages.append({"role": "user", "content": message})

    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    inputs = tokenizer([text], return_tensors="pt").to(model.device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=int(max_tokens),
            temperature=float(temperature),
            top_p=float(top_p),
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
        )

    response = tokenizer.decode(
        outputs[0][inputs['input_ids'].shape[1]:],
        skip_special_tokens=True
    )
    return response

demo = gr.ChatInterface(
    generate_response,
    title="My Fine-Tuned Model",
    description="LoRA fine-tuned model powered by ZeroGPU (free!)",
    additional_inputs=[
        gr.Textbox(value="You are a helpful assistant.", label="System message", lines=2),
        gr.Slider(minimum=64, maximum=2048, value=512, step=64, label="Max tokens"),
        gr.Slider(minimum=0.1, maximum=1.5, value=0.7, step=0.1, label="Temperature"),
        gr.Slider(minimum=0.1, maximum=1.0, value=0.95, step=0.05, label="Top-p"),
    ],
    examples=[
        ["Hello! How are you?"],
        ["Help me with a coding task"],
    ],
)

if __name__ == "__main__":
    demo.launch()
```

**requirements.txt (MUST include peft):**
```
gradio>=5.0.0
torch
transformers
accelerate
spaces
peft
```

**README.md:**
```yaml
---
title: My Fine-Tuned Model
emoji: 🔧
colorFrom: green
colorTo: blue
sdk: gradio
sdk_version: 5.9.1
app_file: app.py
pinned: false
license: apache-2.0
suggested_hardware: zero-a10g
---
```

## Post-Deployment Steps

**After uploading your Space files:**

### 1. Set the Runtime Hardware (REQUIRED for GPU models)

- Go to: `https://huggingface.co/spaces/USERNAME/SPACE_NAME/settings`
- Under "Space Hardware", select the appropriate option:
  - **ZeroGPU** for free on-demand GPU (recommended)
  - Or a dedicated GPU tier if needed

### 2. Verify the Space is Running

- Check the Space URL for any build errors
- Review container logs in Settings if issues occur

### 3. Common Post-Deploy Fixes

| Issue | Cause | Fix |
|-------|-------|-----|
| "No API found" error | Hardware mismatch | Set runtime to ZeroGPU in Settings |
| Model not loading | LoRA vs full model confusion | Check if it's an adapter, use correct template |
| Inference API errors | Model not on serverless | Load directly with transformers instead |

## Detecting Model Type - Quick Reference

### Full Model
Files include: `model.safetensors`, `pytorch_model.bin`, or sharded versions
```python
# Can load directly
model = AutoModelForCausalLM.from_pretrained("username/model")
```

### LoRA/PEFT Adapter
Files include: `adapter_config.json`, `adapter_model.safetensors`
```python
# Must load base model first, then apply adapter
base_model = AutoModelForCausalLM.from_pretrained("base-model-id")
model = PeftModel.from_pretrained(base_model, "username/adapter")
model = model.merge_and_unload()  # Optional: merge for faster inference
```

### Inference API Available
Model page shows "Inference Providers" widget on the right side
```python
# Can use InferenceClient (simplest approach)
from huggingface_hub import InferenceClient
client = InferenceClient("username/model")
```

## Fixing Missing pipeline_tag (To Enable Inference API)

If a model doesn't have an inference widget but should, it may be missing metadata:

```bash
# Download the README
hf download username/model-name README.md --local-dir /tmp/fix

# Edit to add pipeline_tag in YAML frontmatter:
# ---
# pipeline_tag: text-generation
# tags:
# - conversational
# ---

# Upload the fix
hf upload username/model-name /tmp/fix/README.md README.md
```

**Note:** Even with correct tags, custom models may not get Inference API - it depends on HF's infrastructure decisions.

## CRITICAL: Gradio 5.x Requirements

### Examples Format (MUST be nested lists)
```python
# CORRECT:
examples=[
    ["Example 1"],
    ["Example 2"],
]

# WRONG (causes ValueError):
examples=[
    "Example 1",
    "Example 2",
]
```

### Version Requirements
```
gradio>=5.0.0
huggingface_hub>=0.26.0
```

Do NOT use `gradio==4.44.0` - causes `ImportError: cannot import name 'HfFolder'`

## Troubleshooting

### "No API found" Error
**Cause:** Gradio app isn't exposing API correctly, often due to hardware mismatch
**Fix:** Go to Space Settings and set runtime to "ZeroGPU" or appropriate GPU tier

### "OSError: does not appear to have a file named pytorch_model.bin, model.safetensors"
**Cause:** Trying to load a LoRA adapter as a full model
**Fix:** Check for `adapter_config.json` - if present, use PEFT to load:
```python
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained("base-model")
model = PeftModel.from_pretrained(base_model, "adapter-id")
```

### Inference API Not Available
**Cause:** Model doesn't have pipeline_tag or isn't deployed to serverless
**Fix:** Either:
  a. Add `pipeline_tag: text-generation` to model's README.md
  b. Or load model directly with transformers instead of InferenceClient

### `ImportError: cannot import name 'HfFolder'`
**Cause:** gradio/huggingface_hub version mismatch
**Fix:** Use `gradio>=5.0.0` and `huggingface_hub>=0.26.0`

### `ValueError: examples must be nested list`
**Cause:** Gradio 5.x format change
**Fix:** Use `[["ex1"], ["ex2"]]` not `["ex1", "ex2"]`

### Space builds but model doesn't load
**Cause:** Missing `peft` for adapters, or wrong base model
**Fix:** Check adapter_config.json for correct base_model_name_or_path

## Workflow Summary

1. **Analyze model** (check for adapter_config.json, model files, inference widget)
2. **Determine strategy** (Inference API vs ZeroGPU, full model vs LoRA)
3. **Ask user if unclear** about model type or cost preferences
4. **Generate correct template** based on analysis
5. **Create Space** with correct requirements and README
6. **Upload files** using `hf upload`
7. **Set hardware** in Space Settings (ZeroGPU for free GPU access)
8. **Monitor build logs** for any issues