---
name: embedding
description: >
  Standalone embedding service for semantic search. Runs as persistent FastAPI
  server for millisecond-latency embeddings. Supports model swapping via env vars.
  Use when you need vectors for any database (ArangoDB, Pinecone, etc).
allowed-tools: Bash, WebFetch
triggers:
  - embed this
  - embed text
  - start embedding service
  - get embeddings
  - generate vectors
  - semantic search vectors
metadata:
  short-description: Persistent embedding service for semantic search
---

# Embedding Skill

Standalone embedding service for semantic search across any database.

## Architecture

```
┌─────────────────────────────────────────┐
│         embedding service (:8602)       │
│  Model: EMBEDDING_MODEL env var         │
│  Device: auto (CPU/GPU)                 │
└───────────────────┬─────────────────────┘
                    │
    ┌───────────────┼───────────────┐
    ▼               ▼               ▼
 memory        edge-verifier    your-project
 skill         searches         ArangoDB/etc
```

## Quick Start

```bash
# Start the service (first run loads model ~5-10s)
./run.sh serve

# Embed text (CLI)
./run.sh embed --text "your query here"

# Embed via HTTP (after service is running)
curl -X POST http://127.0.0.1:8602/embed -H "Content-Type: application/json" \
  -d '{"text": "your query here"}'
```

## Commands

| Command                           | Description                                 |
| --------------------------------- | ------------------------------------------- |
| `./run.sh serve`                  | Start persistent FastAPI server             |
| `./run.sh embed --text "..."`     | Embed single text (uses service if running) |
| `./run.sh embed --file input.txt` | Embed file contents                         |
| `./run.sh info`                   | Show model, device, service status          |

## Configuration

| Variable                | Default                 | Description                          |
| ----------------------- | ----------------------- | ------------------------------------ |
| `EMBEDDING_MODEL`       | `all-MiniLM-L6-v2`      | Sentence-transformers model name     |
| `EMBEDDING_DEVICE`      | `auto`                  | Device: `auto`, `cpu`, `cuda`, `mps` |
| `EMBEDDING_PORT`        | `8602`                  | Service port                         |
| `EMBEDDING_SERVICE_URL` | `http://127.0.0.1:8602` | Client connection URL                |

## Swapping Models

```bash
# Use a different model for this project
export EMBEDDING_MODEL="nomic-ai/nomic-embed-text-v1"
./run.sh serve

# Or for GPU-accelerated
export EMBEDDING_MODEL="intfloat/e5-large-v2"
export EMBEDDING_DEVICE="cuda"
./run.sh serve
```

## API Endpoints

### POST /embed

Embed single text.

```json
{"text": "query to embed"}
→ {"vector": [0.1, 0.2, ...], "model": "all-MiniLM-L6-v2", "dimensions": 384}
```

### POST /embed/batch

Embed multiple texts.

```json
{"texts": ["query 1", "query 2"]}
→ {"vectors": [[...], [...]], "model": "...", "count": 2}
```

### GET /info

Service status and configuration.

```json
{
  "model": "all-MiniLM-L6-v2",
  "device": "cuda",
  "dimensions": 384,
  "status": "ready"
}
```

## Integration Examples

### ArangoDB Semantic Search

```python
import httpx

# Get embedding
resp = httpx.post("http://127.0.0.1:8602/embed", json={"text": "find similar docs"})
vector = resp.json()["vector"]

# Use in AQL query
aql = """
FOR doc IN my_collection
  LET score = COSINE_SIMILARITY(doc.embedding, @vector)
  FILTER score > 0.7
  SORT score DESC
  RETURN doc
"""
```

### From Memory Skill

Memory skill can consume this service by setting:

```bash
export EMBEDDING_SERVICE_URL="http://127.0.0.1:8602"
```

## Cold Start

First invocation loads the model (~5-10 seconds). After that, embeddings are millisecond-latency. The service logs progress:

```
[embedding] Loading model: all-MiniLM-L6-v2...
[embedding] Model loaded in 6.2s
[embedding] Service ready on http://127.0.0.1:8602
```