---
name: detecting-tips-zones
description: Text-prompted image zone detection using TIPSv2 B/14 on CPU. Produces `focus_targets` / `focus_edges` bbox lists from natural-language labels, ready to feed into `svg-portrait-mode`. Use when you want automatic foreground/background separation from prompts like "dog face" + "wooden floor" instead of hand-annotating bboxes.
metadata:
  version: 0.1.0
---

# Detecting TIPS Zones

Zero-shot zone detection: text prompts → patch-grid cosine heatmaps → bboxes.
Companion to `svg-portrait-mode` — replaces manual `focus_targets` / `focus_edges`
annotation with a TIPSv2 B/14 forward pass.

## Quick Start

```python
from tips_zones import detect_zones
from portrait_mode import portrait_mode

focus_targets, focus_edges = detect_zones(
    "photo.jpg",
    targets=["dog face"],
    edges=["dog paws", "dog ears", "dog body"],
    distractors=["wooden floor", "carpet rug", "shoes", "wall"],
    ckpt_dir="/path/to/tips/checkpoints",
    tips_root="/path/to/tips",
)

svg, stats = portrait_mode(
    "photo.jpg",
    focus_targets=focus_targets,
    focus_edges=focus_edges,
    style_transforms={"background": "desaturate:0.7"},
)
```

Amortise model load across multiple images:

```python
from tips_zones import load_models, detect_zones

models = load_models(ckpt_dir, tips_root, device="cpu")
for img in images:
    ft, fe = detect_zones(img, targets=[...], edges=[...], distractors=[...],
                          ckpt_dir=ckpt_dir, tips_root=tips_root, models=models)
    ...
```

## How It Works

```
image → B/14 vision encoder (MaskCLIP values trick on last block)
     → (32×32 patch grid at 448, or 64×64 at 896) × 768-d patch features
text labels → prompt ensemble (9 TCL templates) → B/14 text encoder
     → per-label mean feature → L2-normalise
per-label heatmap = cos(patch feature, label feature)  # raw, no softmax
bbox = top-k% patches → largest connected component → scaled + padded to image coords
```

### Why no softmax over labels

Naïve softmax assumes labels are mutually exclusive. `dog face`, `dog ears`,
and `dog body` are all true of the same pixels, so softmax collapses to
near-uniform and every heatmap covers the whole subject. Raw cosines +
per-label top-k threshold works much better — at the cost of requiring
**distractor labels** to anchor the relative scale. Always pass some
distractors (floor, wall, props — whatever is in the scene but not the
subject).

## Parameters

```python
detect_zones(
    image,                     # path | PIL Image
    targets,                   # ["main subject label", ...]
    edges=(),                  # ["sub-region label", ...]
    distractors=(),            # scene elements to anchor against — pass these!
    *,
    ckpt_dir,                  # has tips_v2_oss_b14_{vision,text}.npz + tokenizer.model
    tips_root,                 # local clone of google-deepmind/tips
    input_size=448,            # 448 → 32×32 grid, 896 → 64×64 (~12× slower on CPU)
    target_top_frac=0.04,      # fraction of patches kept per target label
    edge_top_frac=0.06,        # fraction of patches kept per edge label
    pad_frac=0.02,             # bbox padding as fraction of image dim
    device="cpu",
    models=None,               # optional pre-loaded (img_model, text_model, tokenizer)
)
```

Returns `(focus_targets, focus_edges)` — both lists of `{'bbox': (x1,y1,x2,y2), 'label': str}`.

## Performance (CPU, 16 cores)

| Step | Time |
|------|------|
| `load_models` (warm) | ~3.5s |
| `load_models` (cold, over 9p) | ~50s |
| Text encoding (9 templates × N labels) | ~0.1s |
| Vision forward @ 448 | 0.3–0.6s |
| Vision forward @ 896 | ~6–7s |

Inference is negligible next to `portrait_mode()` on large images.

## Capability Notes

**Subject / background split: strong.** B/14 separates subject from scene
reliably — typical split ~30/70 subject:background on single-subject photos.

**Sub-part discrimination: weak at B/14 + 448.** "dog face" vs "dog paws" vs
"dog ears" tend to fire on the same region. The 32×32 patch grid is not the
bottleneck (64×64 at 896 barely helps); B/14's patch features just don't
encode fine sub-part semantics strongly. If you need per-part zones:

1. Sharpen prompts — "close-up of dog's furry face" > "dog face" (try first)
2. L/14 or SO/14 model (richer features, larger download)
3. Sliding-window inference (tile crops, stitch heatmaps)

For coarse target/edge zoning (the `portrait_mode` use case), B/14 at 448 is
enough.

## Requirements

Python deps:

```bash
pip install torch torchvision tensorflow tensorflow-text scipy pillow numpy --break-system-packages -q
```

Upstream TIPS repo (for the `tips.pytorch` image/text encoder modules):

```bash
git clone https://github.com/google-deepmind/tips /path/to/tips
```

B/14 checkpoints (~500MB total) go in a directory passed as `ckpt_dir`:

- `tips_v2_oss_b14_vision.npz`
- `tips_v2_oss_b14_text.npz`
- `tokenizer.model`

Download links are in the TIPS repo README.

## Prompt Engineering Tips

- **Always include distractors.** Without them, top-k thresholding has no
  relative scale. 3–7 distractors covering scene elements (floor, wall,
  background objects) is the sweet spot.
- **Use concrete nouns over abstract ones.** "carpet rug" > "textured floor".
- **Top_frac tuning.** If a target bbox is too small, raise `target_top_frac`
  (0.04 → 0.08). Too big / bleeds into scene: lower it.
- **Pad modestly.** `pad_frac=0.02` works for most photos; raise to 0.05 for
  subjects near frame edges.

## EXIF Caveat

`portrait_mode` (via OpenCV) honours EXIF rotation. PIL (this skill's
preprocessing) does not. For correctly-oriented source images they agree; for
EXIF-rotated phone photos the detected bboxes will be in the *raw pixel*
orientation. Either:

- Re-save the source with EXIF baked in: `Image.open(p).rotate(0, expand=True).save(p)`
- Or call `ImageOps.exif_transpose(pil)` before passing to `detect_zones`.