name: Scalable Inference Serving Vocabulary description: >- Normative vocabulary for the scalable ML model inference serving domain, covering model deployment, inference protocols, serving frameworks, hardware acceleration, batching strategies, and the operational concepts of modern MLOps and LLMOps workflows. created: '2026-05-02' modified: '2026-05-02' tags: - AI - CNCF - Inference - Machine Learning - Model Serving - MLOps terms: - term: Open Inference Protocol definition: >- A standardized REST and gRPC interface for model inference, also known as the KServe V2 Inference Protocol. Implemented by KServe, NVIDIA Triton, BentoML, TorchServe, and OpenVINO Model Server. Defines health, metadata, and inference endpoints for cross-framework interoperability. acronym: OIP fullName: Open Inference Protocol synonyms: - KServe V2 Protocol - V2 Inference Protocol related: - KServe - NVIDIA Triton - InferenceService - term: KServe definition: >- A standardized distributed generative and predictive AI inference platform for scalable, multi-framework deployment on Kubernetes. Manages InferenceService custom resources, autoscaling, canary rollouts, model pipelines, and explainability. CNCF incubating project since November 2025. related: - Open Inference Protocol - InferenceService - Kubernetes - term: InferenceService definition: >- A KServe Custom Resource Definition (CRD) that defines and manages the full lifecycle of a model serving deployment on Kubernetes, including the predictor, transformer, explainer components, autoscaling, and routing configuration. related: - KServe - Custom Resource Definition - term: Tensor definition: >- A multi-dimensional array of numeric data. The fundamental unit of data exchange in ML inference systems. Tensors have a name, shape (dimensions), datatype, and data payload. The Open Inference Protocol uses tensors for all input and output data. related: - Tensor Shape - Tensor Datatype - Batch Inference - term: Tensor Shape definition: >- The dimensions of a tensor expressed as an array of integers. For example, [batch_size, sequence_length] = [1, 128] for a BERT input. Dynamic dimensions are represented as -1. related: - Tensor - Batch Size - term: Batch Inference definition: >- Processing multiple inference requests simultaneously in a single forward pass through the model. Dramatically improves throughput and hardware utilization. The batch dimension is typically the first tensor dimension. synonyms: - Batching related: - Tensor - Adaptive Batching - Throughput - term: Adaptive Batching definition: >- Dynamically grouping concurrent inference requests into a batch based on latency and throughput targets. If requests arrive within a configurable window, they are batched together. BentoML Runner implements adaptive batching with configurable max_latency and max_batch_size parameters. related: - Batch Inference - BentoML - term: BentoML definition: >- An open-source unified ML model serving framework. BentoML wraps model loading, custom pre/post-processing, and REST/gRPC API generation into a deployable artifact called a Bento. The Runner abstraction parallelizes inference with adaptive batching. Integrates with KServe for Kubernetes deployment. related: - Bento - Runner - KServe - term: Bento definition: >- A BentoML deployment artifact that bundles the model, service code, dependencies, and runtime configuration into an immutable, versioned unit that can be deployed as a Docker container or pushed to BentoCloud. related: - BentoML - Container Image - term: vLLM definition: >- A high-throughput and memory-efficient inference engine for Large Language Models, implementing PagedAttention for efficient GPU KV cache management. Exposes an OpenAI-compatible REST API. In 2026, vLLM integrates with KServe via LLMInferenceService and llm-d for distributed LLM inference. related: - PagedAttention - LLMInferenceService - KServe - term: PagedAttention definition: >- A memory management algorithm in vLLM that manages the GPU KV cache in non-contiguous paged blocks (similar to OS virtual memory paging), dramatically reducing memory waste and enabling higher throughput and longer context windows. related: - vLLM - KV Cache - term: KV Cache definition: >- Key-Value Cache. Stores the attention key and value tensors from previously processed tokens to avoid recomputing them in autoregressive LLM generation. The primary memory bottleneck in LLM serving; managed efficiently by vLLM's PagedAttention. related: - PagedAttention - LLM - term: LLMInferenceService definition: >- A KServe CRD introduced in v0.16 designed specifically for Large Language Model workloads. Integrates with llm-d and vLLM for production-grade distributed LLM inference with KV cache offloading and multi-GPU support. related: - KServe - vLLM - llm-d - term: NVIDIA Triton Inference Server definition: >- NVIDIA's open-source inference serving software that implements the Open Inference Protocol. Supports TensorRT, ONNX, TensorFlow SavedModel, PyTorch TorchScript, and Python backends. Provides dynamic batching, model ensembles, and GPU/CPU concurrent model execution. related: - Open Inference Protocol - TensorRT - Dynamic Batching - term: TensorRT definition: >- NVIDIA's SDK for high-performance deep learning inference. Optimizes trained models by fusing layers, quantizing precision (FP16, INT8), and generating CUDA kernels tailored to specific GPU hardware. Used with Triton for maximum GPU inference throughput. related: - NVIDIA Triton Inference Server - Quantization - term: Dynamic Batching definition: >- A Triton Inference Server feature that forms batches from multiple inference requests on the fly without requiring the client to batch requests. The server collects requests within a configurable queue delay and batches them together. related: - NVIDIA Triton Inference Server - Batch Inference - term: MLflow definition: >- An open-source platform for the complete ML lifecycle, including experiment tracking, reproducibility, model packaging, and deployment. The MLflow Model Registry provides model versioning, stage transitions (Staging, Production), and annotations. Used with KServe for model lifecycle management. related: - Model Registry - Experiment Tracking - term: Model Registry definition: >- A centralized store for tracking trained model versions, their metadata, training parameters, evaluation metrics, and deployment artifacts. Enables governance and reproducibility in ML pipelines. MLflow, W&B, and Vertex AI all provide model registry capabilities. related: - MLflow - Model Versioning - term: Ray Serve definition: >- A scalable model serving library built on Ray. Enables composable deployments where multiple models or processing stages are chained together. Supports HTTP ingress, autoscaling, request batching, and gRPC. Suitable for complex multi-model pipelines and heterogeneous ML workloads. related: - Ray - Model Pipeline - term: Canary Rollout definition: >- A deployment strategy for inference services where a new model version receives a fraction of traffic while the old version handles the rest. KServe supports canary rollouts via the InferenceService canaryTrafficPercent field. Allows gradual validation of new model versions in production. related: - InferenceService - A/B Testing - term: Model Pipeline definition: >- A sequence of processing steps (preprocessing, inference, postprocessing, ensemble) composed into a single serving endpoint. KServe supports pipelines via the pipeline InferenceService component. related: - InferenceService - Ray Serve - term: Quantization definition: >- Reducing the precision of model weights and activations (e.g., FP32 to INT8 or FP16) to reduce model size and increase inference throughput. Commonly used with TensorRT, ONNX Runtime, and vLLM AWQ/GPTQ for LLMs. related: - TensorRT - vLLM categories: - name: Inference Protocols terms: - Open Inference Protocol - Tensor - Tensor Shape - name: Serving Frameworks terms: - KServe - BentoML - vLLM - NVIDIA Triton Inference Server - Ray Serve - name: LLM Serving terms: - PagedAttention - KV Cache - LLMInferenceService - Quantization - name: Batching terms: - Batch Inference - Adaptive Batching - Dynamic Batching - name: Model Lifecycle terms: - Model Registry - MLflow - Canary Rollout - Model Pipeline - name: Kubernetes Resources terms: - InferenceService - Custom Resource Definition