---
name: browser-onnx
description: "Implements high-performance local machine learning inference in the browser using ONNX Runtime Web. Use this skill when the user needs privacy-first, low-latency, or offline AI capabilities (e.g., image classification, object detection, or NLP) without server-side processing."
license: MIT
compatibility: "Requires a browser with WebAssembly (WASM) support for CPU or WebGPU for hardware acceleration."
metadata:
  version: "1.0.0"
  runtime: "onnxruntime-web"
---

# Browser-Based ONNX Inference

This skill provides a comprehensive workflow for executing ONNX models locally in the browser using **ONNX Runtime Web (ORT-Web)**. Local inference offers significant advantages in **data privacy**, **reduced server costs**, and **unlimited scalability** as each user brings their own compute power.

## 1. Setup and Installation

Install the required library via npm:

```bash
npm install onnxruntime-web
```
*Note: For experimental features like WebGPU or WebNN, use the nightly version `onnxruntime-web@dev`.*

## 2. Global Environment Configuration

Set global `ort.env` flags before creating a session to optimize the runtime environment.

- **WebAssembly (CPU):** Enable multi-threading by setting `ort.env.wasm.numThreads` (default is half of hardware concurrency) and use a **Proxy Worker** (`ort.env.wasm.proxy = true`) to keep the UI responsive.
- **WASM Paths:** If binaries are not in the same directory as the JS bundle, manually override paths using `ort.env.wasm.wasmPaths` to point to local assets or a CDN.
- **WebGPU (GPU):** Use `ort.env.webgpu.profiling = { mode: 'default' }` for performance diagnosis during development.

## 3. Creating an Inference Session

Initialize the session by choosing the appropriate **Execution Provider (EP)**:

```javascript
import * as ort from 'onnxruntime-web';

const session = await ort.InferenceSession.create('./model.onnx', {
  executionProviders: ['webgpu', 'wasm'], // Prioritize GPU, fallback to CPU
  graphOptimizationLevel: 'all' // Enable all graph-level optimizations
});
```

## 4. Data Preprocessing

Input data must match the model's training format (e.g., NCHW for vision models).

- **Image-to-Tensor:** Use libraries like **JIMP** or **OpenCV.js** to resize, normalize (divide by 255.0), and convert RGBA to RGB.
- **Tensor Creation:** Use `new ort.Tensor('float32', float32Data,)` to prepare the input feeds.

## 5. Optimized Inference Patterns

- **Graph Capture:** For models with static shapes on WebGPU, enable `enableGraphCapture: true` to reduce CPU overhead by replaying kernel executions.
- **IO Binding:** For transformer models, keep data on the GPU by using `ort.Tensor.fromGpuBuffer()` and setting `preferredOutputLocation: 'gpu-buffer'` to avoid expensive memory copies.
- **Quantization:** Prefer **uint8 quantized models** for CPU (WASM) inference to improve performance; avoid float16 on CPU as it lacks native support and is slow.

## 6. Large Model Handling (>2GB)

- **Platform Limits:** Browsers like Chrome limit `ArrayBuffer` to ~2GB. Models exceeding this must be exported with **external data**.
- **Loading External Data:** Explicitly link external weight files in the session options:
  ```javascript
  const session = await ort.InferenceSession.create(modelUrl, {
    externalData: [{ path: './model.data', data: dataUrl }]
  });
  ```

## 7. Common Edge Cases

- **Memory Management:** Explicitly call `tensor.dispose()` for GPU tensors to prevent memory leaks.
- **Zero-Sized Tensors:** ORT-Web treats tensors with a dimension of 0 as CPU tensors regardless of the selected EP.
- **Thermal Throttling:** Sustained inference on mobile devices may trigger frequency scaling, doubling latency. Use lightweight "tiny" models to maintain thermal equilibrium.

## 8. Examples

### Multilingual Translation
Offload heavy translation tasks to a separate **Web Worker** using a singleton pattern to ensure the model (e.g., NLLB-200) loads only once.

### Object Detection (YOLO)
Implement **Non-Max Suppression (NMS)**. If the browser lacks support for specific NMS ops, run a separate NMS ONNX model to filter overlapping boxes locally.