# Zero-copy GPU encoding (CUDA → NVENC)

Stream a **GPU-resident** framebuffer straight to NVENC with **no host copy**: a
CUDA NV12 (or RGB) buffer — a CuPy / PyTorch / any `__dlpack__` tensor — is handed
to `h264_nvenc` via PyAV's `VideoFrame.from_dlpack`, and the encoder reads device
memory directly. This is the GPU counterpart to the host
[`NvencCpuEncoder`](guide_python.md), which uploads host `rgb24` and reformats to
`yuv420p` on the CPU first.

For a render-on-GPU scientific pipeline this removes the CPU color-conversion and
the PCIe upload from every frame and frees the CPU entirely.

## Measured payoff

Per-frame encode latency on an RTX 4090 Laptop GPU (moving gradient, vbr,
`delay=0`), CPU-origin (host `rgb24` → CPU `yuv420p` reformat → upload → encode)
vs GPU zero-copy (device RGB → on-GPU NV12 → `from_dlpack` → encode):

| Resolution | RGB→NV12 (GPU) | GPU encode (zero-copy) | GPU total | GPU fps | CPU-origin | CPU fps | speed-up |
| ---------- | -------------- | ---------------------- | --------- | ------- | ---------- | ------- | -------- |
| 1280×720   | 0.009 ms       | 1.36 ms                | 1.37 ms   | 730     | 3.25 ms    | 307     | 2.4×     |
| 1920×1080  | 0.014 ms       | 2.49 ms                | 2.50 ms   | 400     | 7.26 ms    | 138     | 2.9×     |
| 2560×1440  | 0.021 ms       | 3.52 ms                | 3.54 ms   | 282     | 12.73 ms   | 79      | 3.6×     |
| 3840×2160  | 0.057 ms       | 7.08 ms                | 7.08 ms   | 141     | 30.53 ms   | 33      | 4.3×     |

The NVENC kernel itself is GPU-bound either way; the win is removing the CPU
`rgb→yuv` reformat (libswscale, single-threaded — brutal at 4K) and the per-frame
PCIe upload (~0.5 ms at 1080p, ~2.2 ms at 4K). Reproduce with
`python -m pdum.rfb.benchmark --gpu` (see [Benchmark](#benchmark)).

## What NV12 is

NVENC's native input is **NV12**: YUV 4:2:0, 8-bit, *semi-planar*.

- **Y (luma)** — full resolution, `W×H` bytes.
- **UV (chroma)** — half resolution in both axes, one **interleaved** plane of
  `U,V,U,V…` (`(H/2)×W` bytes). Each chroma sample is shared by a 2×2 pixel block.

Total `1.5 bytes/pixel` (vs 3 for RGB). "Semi-planar" = Y separate, U/V interleaved
(NV12), unlike `yuv420p`/I420's three separate planes. Critically, NVENC wants NV12
as **one contiguous allocation** — Y plane, then the UV plane at byte offset
`pitch·height` — because it reads UV relative to the Y base pointer. This module's
[`rgb_to_nv12`](#api) produces exactly that layout; [`nv12_planes`](#api) slices it
back into the two DLPack planes.

## Quick start

```python
import asyncio, cupy as cp, pdum.rfb as rfb

# 1) BEFORE any framework's first CUDA op (CuPy/PyTorch/JAX): share the device
#    primary context with FFmpeg. This pokes the CUDA driver directly (no CuPy),
#    so it must run before anything activates the primary context — otherwise the
#    flags don't take effect.
rfb.enable_cuda_context_sharing()

async def main():
    # 2) gpu=True selects the zero-copy CUDA→NVENC encoder (validated at startup).
    display = await rfb.serve(1920, 1080, port=8765, gpu=True)
    try:
        while True:
            for ev in display.poll_events():
                ...  # handle input
            frame_rgb = render_on_gpu()           # a CuPy (H, W, 3) uint8 array
            display.publish(frame_rgb)            # zero-copy: stays on the GPU
            await asyncio.sleep(1 / 60)
    finally:
        await display.aclose()

asyncio.run(main())
```

Publish a CuPy `(H, W, 3)` array directly, or — to skip even the RGB→NV12 step —
publish an already-NV12 frame:

```python
nv12 = rfb.gpu.rgb_to_nv12(frame_rgb)             # contiguous (H+H//2, W) on GPU
display.publish(rfb.gpu.cuda_frame(nv12, pixel_format="nv12", height=1080))
```

Any framework works as long as the tensor exposes `__cuda_array_interface__` or a
CUDA `__dlpack__` — **CuPy, PyTorch, and JAX** all do, and all run on the device
**primary context**, so the pointer FFmpeg/NVENC sees is valid (that's what
[`enable_cuda_context_sharing`](#api) guarantees; it operates on the primary context
via the CUDA driver, not on any one library). **One caveat:** a framework that
creates its *own* non-primary context — e.g. Numba's CUDA target, which calls
`cuCtxCreate` rather than retaining the primary context — produces pointers that
live in a different context and can't be registered, even after the call. CuPy,
PyTorch, and JAX are not in that category.

## Requirements

`pdum.rfb.gpu.cuda_zerocopy_available()` returns `True` only when **all** hold (it
runs an actual one-frame encode to be sure, and caches the result):

1. **CuPy** — `cupy-cuda13x` / `cupy-cuda12x` (cp314 wheels exist; works on 3.14).
2. **An NVENC-capable GPU + driver** — same gate as the host NVENC backend
   (`pdum.rfb.encoders.nvenc.nvenc_cpu_available()`).
3. **PyAV that can *encode* CUDA frames — PyAV ≥ 18.** `from_dlpack` (frame
   *creation*) is in 17.0, but feeding a CUDA frame to an encoder — adopting the
   frame's `hw_frames_ctx` before `avcodec_open2` — lands in **18.0**
   ([PyAV #2199](https://github.com/PyAV-Org/PyAV/issues/2199)), unreleased at time
   of writing (the fix is on `main`).

On PyAV 17.x the encode raises `avcodec_open2(...) returned 22; hw_frames_ctx must
be set when using GPU frames as input`.

## Installing it today (before PyAV 18.0 ships on PyPI)

CuPy installs normally (`pip install habemus-papadum-rfb[gpu-cuda13]`, or
`[gpu-cuda12]` for CUDA 12). The only catch is **PyAV ≥ 18**, which isn't on PyPI
yet. Three paths, easiest first; all land in your active env (or `$PYTHON`):

**1. Prebuilt self-contained wheel (recommended).** A wheel with a CUDA-enabled
ffmpeg bundled in — no system ffmpeg, no compiler, no env vars. Host it on a GitHub
release (see *Building & hosting* below), then:

```bash
PYAV_WHEEL_URL=https://github.com/<owner>/<repo>/releases/download/<tag>/av-...whl \
  ./scripts/install-gpu.sh           # installs the wheel + CuPy, then self-tests
# or directly:  uv pip install <that-url> cupy-cuda13x
```

**2. Build from source (one command).** No prebuilt wheel needed — the script
fetches a CUDA ffmpeg (a BtbN LGPL shared build) and builds PyAV from a pinned
commit, baking an rpath so **no `LD_LIBRARY_PATH`** is needed at runtime:

```bash
./scripts/install-gpu.sh             # ~1 min the first time; uv caches the build
# CUPY_PACKAGE=cupy-cuda12x ./scripts/install-gpu.sh   # for a CUDA 12 toolkit
```

**3. When PyAV 18.0 is released:** add `"av>=18"` to the `[gpu-cuda13]`/`[gpu-cuda12]`
extras and it collapses to a one-step `pip install habemus-papadum-rfb[gpu-cuda13]`
— the 18.0 wheel bundles a CUDA-capable ffmpeg, so no build and no system ffmpeg.

### Building & hosting the wheel (maintainers)

`scripts/build-cuda-av-wheel.sh` builds the self-contained wheel(s):

```bash
PYTHON_VERSIONS="3.12 3.13 3.14" ./scripts/build-cuda-av-wheel.sh   # -> dist/cuda-wheels/
gh release create gpu-av18-<date> dist/cuda-wheels/av-*.whl \
  --title "PyAV 18 (CUDA/NVENC) wheels" --notes "Self-contained; bundles LGPL ffmpeg."
```

It links PyAV against a BtbN **LGPL** ffmpeg (has `h264_nvenc` + the CUDA hwcontext,
`--disable-libx264` ⇒ no GPL components) and runs `auditwheel repair` to bundle the
ffmpeg `.so`s (tagged `manylinux_2_28` ⇒ installs on RHEL8 / Ubuntu 18.10+ and
newer). `libcuda`/`libnvidia-encode` are **not** bundled — they come from the host
driver, as they must. **Licensing:** the bundled ffmpeg is LGPL, so redistributing
the wheel carries LGPL obligations (offer the corresponding ffmpeg source / build
config). Hosting in this repo's GitHub **Releases** (not committed to the tree) is
the simplest option; a PEP 503 index on GitHub Pages is a later nicety.

## Two gotchas the library handles for you

### One shared CUDA context

CuPy uses the device **primary** context. FFmpeg's CUDA hwcontext (`primary_ctx=1`)
expects that context to have been created with `CU_CTX_SCHED_BLOCKING_SYNC` flags.
If CuPy activates it first with the default (auto) flags:

- `primary_ctx=True` fails with *"Primary context already active with incompatible
  flags"*; and
- a separate `primary_ctx=False` context can't register CuPy's pointers — NVENC
  *"resource register failed (23)"*, because a device pointer from one context
  isn't valid in another on the same device.

`enable_cuda_context_sharing()` pre-sets the flags (via the CUDA driver
`cuDevicePrimaryCtxSetFlags`). **Call it once, before any CuPy/PyTorch CUDA op**
(importing CuPy is fine; the first allocation/op is what activates the context).
`serve(gpu=True)` and the encoder call it defensively too, but if CuPy has already
run, it is too late for that process.

### NV12 must be one contiguous allocation

Pass NVENC two *separate* CuPy arrays for Y and UV and registration fails. Allocate
one buffer and slice views — which is what `rgb_to_nv12` / `nv12_planes` do:

```python
nv12 = cp.empty((H + H // 2, W), cp.uint8)   # one allocation
y, uv = nv12[:H], nv12[H:]                    # views; uv at base + W*H
```

## RGB → NV12 conversion options

NVENC needs YUV, so a GPU RGB buffer must be converted first. Cheapest-effort first:

1. **A custom CuPy `RawKernel`** — what `pdum.rfb.gpu.rgb_to_nv12` uses (BT.601
   limited range). ~20 lines of CUDA C, no extra dependency, ~0.01 ms at 1080p.
   **Recommended** — the conversion is so cheap that nothing else buys anything.
2. **NPP** (`nppiRGBToNV12_*`) — NVIDIA's prebuilt image primitives, ships with
   CUDA; fast and battle-tested but adds an NPP binding.
3. **CV-CUDA / `nvcv`** — `cvcuda.cvtcolor`; a heavier dependency, worthwhile only
   if you already use it.
4. **PyNvVideoCodec / VPF** — bundle convert *and* encode, but have no cp314 wheel
   (see [the NVENC-source route](#alternative-the-nvenc-source-route)).

## Can we avoid building PyAV from source on `< 18`?

Short answer: **no pure-Python monkey-patch exists; you must build PyAV from
source** (or wait for the 18.0 wheel). Investigated and ruled out:

- **`HWAccel` (setting `hw_device_ctx`)** — PyAV *can* set the encoder's
  `hw_device_ctx` from Python via `HWAccel`, but NVENC explicitly rejects it for
  GPU input: *"hw_frames_ctx must be set when using GPU frames as input"*. It needs
  `hw_frames_ctx` specifically.
- **A `ctypes` poke at `avctx->hw_frames_ctx`** — PyAV exposes **no** Python handle
  to the underlying `AVCodecContext` / `AVFrame` pointers, and Cython cdef-object
  offsets are not stable ABI. Not viable.

So `< 18` needs a build. Good news: **no custom FFmpeg is required** — the stock
PyPI `av` wheel's bundled ffmpeg already has the CUDA hwcontext (it's auto-enabled
by the nv-codec-headers + nvenc dependency; it just isn't a separate
`--enable-cuda` token, which is why `from_dlpack(primary_ctx=False)` works on the
stock wheel today). You only need to rebuild *PyAV* against an ffmpeg dev tree.

This is what [`scripts/install-gpu.sh`](#installing-it-today-before-pyav-180-ships-on-pypi)
automates (Option A). The manual forms, for reference:

### Option A — build PyAV `main` / a pinned commit (the official fix)

```bash
# needs a CUDA ffmpeg dev tree on PKG_CONFIG_PATH (a BtbN LGPL/GPL "shared" release —
# no compiling ffmpeg yourself); LDFLAGS bakes an rpath so no LD_LIBRARY_PATH at runtime
PKG_CONFIG_PATH=/path/to/ffmpeg/lib/pkgconfig LDFLAGS="-Wl,-rpath,/path/to/ffmpeg/lib" \
  uv pip install --no-cache --no-binary av "av @ git+https://github.com/PyAV-Org/PyAV@main"
```

> uv caches built wheels by git commit, **not** by the ffmpeg you link against — so
> use `--no-cache` (or `--refresh`) when (re)building against a specific ffmpeg, or a
> stale wheel may be reused silently.

### Option B — the minimal patch on 17.1.0 (pin to a known version)

Two edits to the PyAV sdist, then build from source. They are exactly what 18.0
does ([#2199](https://github.com/PyAV-Org/PyAV/issues/2199)):

1. `include/avcodec.pxd` — declare the field (the cdef struct omits it):
   ```diff
            AVHWAccel *hwaccel
            AVBufferRef *hw_device_ctx
   +        AVBufferRef *hw_frames_ctx
    ```
2. `av/video/codeccontext.py` — adopt a hardware input frame's `hw_frames_ctx`
   before the encoder is opened:
   ```python
   @cython.cfunc
   def _prepare_and_time_rebase_frames_for_encode(self, frame: Frame):
       if (not self.is_open and frame is not None
               and frame.ptr.hw_frames_ctx and not self.ptr.hw_frames_ctx):
           self.ptr.hw_frames_ctx = lib.av_buffer_ref(frame.ptr.hw_frames_ctx)
       return CodecContext._prepare_and_time_rebase_frames_for_encode(self, frame)
   ```
   ```bash
   PKG_CONFIG_PATH=/path/to/ffmpeg/lib/pkgconfig uv pip install --no-binary av ./PyAV-17.1.0
   ```

Either way, `cuda_zerocopy_available()` flips to `True` and everything below works.

## API

All of `pdum.rfb.gpu` lazy-imports CuPy, so importing it is always safe.

| Symbol | Purpose |
| ------ | ------- |
| `enable_cuda_context_sharing(device_id=0)` | Pre-set primary-ctx flags so CuPy + FFmpeg share one context. **Call first.** |
| `cuda_zerocopy_available()` | `True` iff the full stack works (cached; runs a real encode). |
| `rgb_to_nv12(rgb, *, out=None)` | Device `(H,W,3)` → contiguous NV12 `(H+H//2, W)` (custom kernel). |
| `nv12_planes(packed)` | Slice contiguous NV12 into `(Y, UV)` DLPack-ready views. |
| `cuda_frame(array, *, pixel_format="auto", ...)` | Wrap a device tensor as a CUDA `RawFrame` for `publish()`. |
| `to_host_rgb(frame)` | Download a CUDA frame to host `rgb24` (used by the image fallback). |
| `HostFrameAdapter(inner)` | Wrap a host encoder so it tolerates CUDA frames (downloads first). |
| `NvencGpuPyavEncoder` | The `EncoderBackend` (registered as `"nvenc_gpu_pyav"`). |

`publish()` accepts a CuPy `(H,W,3|4)` tensor directly (or a `cuda_frame` for NV12),
and `serve(gpu=True)` selects `NvencGpuPyavEncoder` for every viewer.

## Architecture & integration

- `RawFrame.memory == "cuda"` (the type already modelled this) carries the device
  tensor; `Display.publish` tags CuPy/DLPack tensors automatically.
- `NvencGpuPyavEncoder` (`encoders/nvenc_gpu_pyav.py`) subclasses the host
  `H264CpuEncoder`, swapping only the input handling: it accepts a CUDA `nv12`
  frame (true zero-copy), a CUDA `rgb24`/`rgba8` frame (on-GPU convert first), or a
  host frame (uploaded then converted — a graceful fallback). It reuses one
  contiguous NV12 staging buffer (safe because `delay=0` consumes each frame before
  the next), and one persistent `CudaContext` so every frame shares the encoder's
  `hw_frames_ctx`.
- Wire format, Annex B packing, forced-keyframe handling, and backpressure are
  **inherited unchanged** — the browser side needs nothing new.
- **Image-only viewers** on a GPU-publishing display still work: their image
  encoder is wrapped in `HostFrameAdapter`, which downloads each CUDA frame to host
  `rgb24` (NV12 is converted on the GPU first). GPU mode otherwise targets
  WebCodecs (H.264) viewers.

## Benchmark

```bash
# CPU-origin vs GPU zero-copy, per resolution, with CUDA-event-timed conversion
python -m pdum.rfb.benchmark --gpu
```

Reports, per resolution: the RGB→NV12 conversion cost (timed with
`cupy.cuda.Event` markers), the zero-copy encode latency, and the CPU-origin
latency for comparison. Requires the full stack above.

## Alternative: the NVENC-source route

PyAV is the pragmatic backend (one dependency, no build once 18.0 ships). The other
route is **NVIDIA's own** binding to the Video Codec SDK:

- **PyNvVideoCodec / VPF** — takes CUDA arrays directly (DLPack / CAI), bundles its
  own color conversion, and bypasses ffmpeg entirely. But: **no cp314 wheel and no
  sdist** on PyPI, so it can't `pip install` on 3.14. Building from the Video Codec
  SDK source is possible (CUDA + `nvcc` are present on a dev box) but needs the SDK
  headers (`nvEncodeAPI.h`) and is a heavier, NVIDIA-version-coupled dependency.
- **A direct `ctypes`/`cffi` binding to `libnvidia-encode`** — no build step
  (dlopen the driver lib), maximal control, but a large amount of NVENC-API
  plumbing to maintain.

Trade-off: PyAV reuses our existing Annex-B / decode-back test infrastructure and
adds zero new Python dependencies; the NVENC-SDK route removes the ffmpeg layer and
the PyAV-18 dependency but adds a build step and a hand-maintained binding. If the
SDK source is available, the most interesting evaluation is whether a thin
PyNvVideoCodec build (or a minimal `cffi` shim) can match the PyAV path's latency
while accepting the *same* DLPack frames `gpu.cuda_frame` already produces — in
which case it could slot in behind the same `register_video_encoder("nvenc_gpu_pyav",
...)` seam.

## Caveats

- **Consumer GPUs** can transiently `EINVAL` (or rarely hard-fault) on rapid NVENC
  *session* open/close churn. Production uses one long-lived encoder per connection
  and is unaffected; the test suite retries and GCs between encoders.
- Publish a **fresh** device buffer per frame — viewers share the reference and may
  read it asynchronously (same rule as the host path). Or opt into `serve(gpu=True,
  own_frames=True)`: `publish()` then does a device-to-device copy into a recycled
  server-owned CuPy buffer, so you may reuse your own device tensor immediately (no
  reallocation, no release callback). See the
  [frame ownership model](guide_python.md#frame-ownership-memory-model).
- Even dimensions only (NV12), and `width ≥ 160` (NVENC minimum).
- The encoder uses device 0 and the primary context; multi-GPU selection is a
  future extension.