# 10. Optimization and Hardware Acceleration (Android)

This document defines a practical strategy to improve Android performance with ONNX Runtime, balancing latency, stability, and per-device compatibility.

## 1) ONNX Runtime SessionOptions

### graphOptimizationLevel
- Recommendation: use a high level in production (ORT_ENABLE_ALL), especially for reused sessions.
- Advantage: operator fusion and graph simplification reduce inference latency.
- Trade-off: session creation may become slower; this is usually acceptable when the session is created in loadModel and reused in later inferences.

### intraOpNumThreads
- Defines parallelism within an operator.
- Initial recommendation:
  - Low-end: 1-2
  - Mid/high-end: 2-4
- Trade-off: more threads can improve throughput, but may worsen tail latency (p95) and increase thermal/battery usage.

### interOpNumThreads
- Defines parallelism between operators.
- Initial recommendation: 1 for most mobile models (smaller graphs).
- Trade-off: higher values may add overhead for small models.

### executionMode
- ORT_SEQUENTIAL: generally better for single-request inference on mobile.
- ORT_PARALLEL: can help for larger models with real graph parallelism.
- Recommendation: start with ORT_SEQUENTIAL and validate with per-device benchmarks.

### memory pattern
- Recommendation: enable for inputs with stable shapes.
- Trade-off: better performance for repeated inferences; less gain with dynamic shapes.

### CPU arena allocator
- Recommendation: keep enabled by default to reduce allocation overhead.
- Trade-off: may increase resident memory usage in some scenarios.

## 2) Execution Providers on Android

### CPUExecutionProvider
- Disponibilidade: sempre presente.
- Availability: always present.
- Role: reliable baseline and universal fallback.
- When to use: safe mode, maximum compatibility, debugging.

### NNAPIExecutionProvider
- Availability: depends on Android version + driver/vendor.
- Role: main hardware acceleration path (NPU/DSP/GPU through NNAPI).
- Trade-off: gains vary widely across manufacturers and supported operations.

### XNNPACKExecutionProvider (if applicable)
### XNNPACKExecutionProvider (if applicable)
- On Android, the most common path is ORT CPU EP with internal optimizations; XNNPACK depends on specific build/distribution.
- Recommendation: treat as optional and validate in the real artifact (AAR) before exposing as a public mode.

### QNNExecutionProvider (if applicable)
### QNNExecutionProvider (if applicable)
- Focused on Qualcomm hardware with a specific stack.
- Recommendation: consider only for dedicated distribution by manufacturer/controlled device fleet.
- Trade-off: higher operational complexity and compatibility matrix overhead.

## 3) GPU / NPU on Android: practical viability

- Direct GPU support on Android through ONNX Runtime is not the most portable path for broad-market apps.
- Main realistic path: NNAPI, which tries to map workloads to device accelerators.
- Important limitation: not every model/op runs on NNAPI.
- Expected behavior: when an op is unsupported, partial or total fallback to CPU occurs.
- Practical impact: some devices may get large gains, while others may get little gain (or regressions).
- Manufacturer compatibility:
  - Recent Qualcomm: trend toward better gains on compatible models.
  - Exynos/MediaTek/variants: more heterogeneous behavior.
  - Older devices: CPU fallback is often dominant.

## 4) Recommended configuration strategy

Define a configurable execution mode in the plugin/app:
- cpu
- nnapi
- auto

### Mode semantics
- cpu: force CPUExecutionProvider.
- nnapi: try NNAPI; if unavailable/invalid, explicit error or flag-controlled fallback.
- auto: try NNAPI first and fall back safely to CPU.

### Recommended parameters
- numThreads (mapped to intraOpNumThreads)
- interOpNumThreads
- graphOptimizationLevel
- executionMode
- enableMemoryPattern
- enableCpuMemArena

### Performance instrumentation (required)
Measure and record separately:
- download
- session creation
- pre-processing
- inference
- post-processing

Also record:
- effective provider used
- whether fallback occurred (yes/no)
- errors by op/provider

## 5) Model optimization

### Quantizacao INT8
- Advantage: major latency and memory reduction on CPU/NNAPI when well supported.
- Trade-off: can degrade accuracy; requires validation with real datasets.

### Float16 / mixed precision
- Advantage: possible gains on compatible accelerators.
- Trade-off: support varies by backend/device; gains are inconsistent on pure CPU.

### Graph optimization
- Apply in the export/conversion pipeline and keep ORT optimized at runtime.

### ORT format
- Converting ONNX to ORT format can reduce initialization cost and improve runtime.
- Trade-off: more complex build/deploy pipeline.
- This plugin can produce a Runtime-style `.ort` alongside a **reduced** (op-trimmed)
  `libonnxruntime.so` via the `cantoo-onnx-reduce` command — see
  [reduced-onnx.md](reduced-onnx.md). Android-only, opt-in.

### Removing unnecessary outputs
- Reduces transfer and post-processing.
- Recommended for classification scenarios where only final top-k is needed in the app.

### Input size reduction
- Direct impact on latency and memory.
- Trade-off: risk of accuracy loss; validate the latency vs quality curve.

## 6) Per-device benchmark strategy

### Comparison matrix
Run per device and per model:
- CPU single-thread
- CPU multi-thread (2, 4)
- NNAPI
- XNNPACK (if available in the build)

### Metricas
- average latency
- p95
- memory (peak and approximate resident set)
- failures (crash, session error, unexpected fallback)
- cold start vs warm start

### Minimum protocol
- 5 warmup runs
- 30-100 measured runs per scenario
- battery > 40% and controlled temperature when possible
- no debugger attached

## Safe recommendation for production

1. Production default: auto mode with safe CPU fallback.
2. Session created once in loadModel and reused by modelId+version; never recreate per inference.
3. Execution off the main thread (already aligned with the current plugin).
4. Start with:
   - graphOptimizationLevel alto
   - interOpNumThreads = 1
  - intraOpNumThreads = 2 or 4 (tunable)
5. Required stage telemetry (download/session/pre/inference/post) and effective provider.
6. Progressive NNAPI rollout by device tier/manufacturer (remote feature flag).
7. Continuous benchmarking and regression tracking by model version and app version.

## ONNX Runtime usage (practical checklist)

- Utilizar ONNX Runtime Android.
- Configure multithreading.
- Configure graph optimization.
- Evaluate execution providers available in the Android artifact.
- Consider NNAPI as the main hardware acceleration path.
- Implement safe CPU fallback.
- Create session in loadModel and reuse it for all inferences of the same modelId+version.
- Run inference off the main thread.

## Current plugin state

- Current flow: loadModel already handles file download/cache and initializes the ONNX session.
- Current reuse: classifyImage reuses the same session by modelId+version (without recreating per call).
- Practical implication: initialization cost is concentrated in prepare/warmup, and inference stays focused on session run.