# PhotonInfer
**A High-Performance LLM Inference Engine with vLLM-Style Continuous Batching**
[English](README.md) | [δΈζ](README_ZH.md) | [Live Demo](https://photoninfer.xyz/)
[](LICENSE)
[](https://developer.nvidia.com/cuda-toolkit)
[](https://en.cppreference.com/w/cpp/20)
---
## π Performance Highlights
PhotonInfer delivers **production-grade inference performance** for LLMs with advanced batching capabilities. **Supports Llama-3.2 and Qwen3 models**.
### Single Request Inference
| Model | PhotonInfer | llama.cpp | Speedup |
|-------|-------------|-----------|---------|
| Llama 3.2 1B | 185 tok/s | 252 tok/s | 0.73Γ (llama.cpp faster) |
**TTFT (Time To First Token)**: 387ms @ 100-token prompt (INT8 quantization)
### Batched Inference Throughput
| Batch Size | PhotonInfer | llama.cpp | Speedup |
|------------|-------------|-----------|---------|
| 4 | 410 tok/s | 252 tok/s | **1.63Γ** |
| 8 | 720 tok/s | 255 tok/s | **2.82Γ** |
| 16 | 787 tok/s | 253 tok/s | **3.07Γ** |
**Tested on**: NVIDIA A100, Llama 3.2 1B, Q8/INT8 quantization
## β¨ Key Features
### π― **vLLM-Style Continuous Batching**
- **Token-level dynamic scheduling**: Add new requests mid-generation without waiting for batch completion
- **Two-phase scheduler**: Seamlessly continue running requests while admitting new ones
- **Request state tracking**: Precise `num_computed_tokens` management for efficient resume
- **Perfect for production**: High-concurrency inference services with real-time responsiveness
### β‘ **GPU-Optimized Kernels**
- **Batched Paged Attention**: Block-level KV cache management with efficient memory utilization
- **Vectorized Memory Access**: `float4` loads for 2-4Γ bandwidth efficiency
- **Fused Operations**: Zero-copy GPU sampling, batched RoPE, and fused normalization
- **INT8 Quantization**: Group-wise quantization with cuBLASLt INT8 GEMM support
- **Optimized Softmax**: CUB BlockReduce for numerically stable attention computation
### ποΈ **Modern C++20 Architecture**
- **Type-Safe Error Handling**: Rust-inspired `Result` type for explicit error propagation
- **Zero-Copy Design**: Extensive use of `std::span` and move semantics
- **Device Agnostic**: Unified interface for CPU and CUDA backends
- **Concepts & Ranges**: Compile-time constraints and expressive type safety
## π Quick Start
### Prerequisites
- **Compiler**: GCC 12+ (C++20 support required)
- **CMake**: 3.20+
- **CUDA Toolkit**: 12.0+ (tested on 12.5)
- **GPU**: NVIDIA GPU with Compute Capability 7.0+
### Download Model
Download a pre-quantized model to get started quickly:
https://huggingface.co/Lummy666/llama-3.2-1B-Instruct
### Build
#### Option 1: Build from Source
```bash
# Clone repository
cd photon_infer
# Configure with CUDA
mkdir build && cd build
cmake -DCMAKE_BUILD_TYPE=Release -DPHOTON_BUILD_CUDA=ON ..
# Build
cmake --build . -j$(nproc)
# Install (optional)
sudo cmake --install .
```
After installation, you can run the web server directly from anywhere:
```bash
photon_web_server \
--port 5728 \
--model /path/to/llama-3.2-1B-Instruct \
--tokenizer /path/to/llama-3.2-1B-Instruct/tokenizer.json
```
The installation will place:
- `photon_web_server` β `/usr/local/bin/`
- Static web files β `/photon_infer/web/static/`
- Core library β `/usr/local/lib/`
To uninstall:
```bash
cd build
sudo cmake --build . --target uninstall
```
#### Option 2: Use Docker (Recommended)
```bash
# Pull the pre-built Docker image
docker pull lumia431/photon_infer:latest
# Run the container with GPU support
docker run --rm --gpus all -p 5728:5728 -e PORT=5728 lumia431/photon_infer:latest
```
The web interface will be available at `http://localhost:5728`
## π¬ Technical Details
### INT8 Quantization
- **Group-wise quantization**: Configurable group size (32, 64, 128)
- **cuBLASLt integration**: Hardware-accelerated INT8 GEMM
- **Minimal accuracy loss**: < 1% perplexity degradation on Llama models
### Paged Attention
- **Block-level KV cache**: Efficient memory allocation without fragmentation
- **Dynamic sequence management**: Per-sequence cache offsets for flexible scheduling
- **Batched cache operations**: Single kernel for multi-sequence K/V writes
### Continuous Batching Scheduler
- **Two-phase scheduling**:
1. **Phase 1**: Continue all RUNNING requests (no interruption)
2. **Phase 2**: Admit WAITING requests to fill remaining capacity
- **Request states**: WAITING β RUNNING β FINISHED (with PREEMPTED support)
- **Token-level granularity**: `num_computed_tokens` tracking for precise resume
## π£οΈ Roadmap
- [x] **Core Infrastructure**: Tensor, operators, memory management
- [x] **LLaMA Model**: Full transformer implementation with CPU/GPU kernels
- [x] **INT8 Quantization**: Group-wise quantization with cuBLASLt
- [x] **Paged Attention**: Block-level KV cache management
- [x] **Continuous Batching**: vLLM-style dynamic request scheduling
- [ ] **Flash Attention 2**: IO-aware attention for long sequences
- [ ] **Multi-GPU Support**: Tensor parallelism for large models
- [ ] **FP16/BF16 Mixed Precision**: Enhanced throughput on modern GPUs
- [ ] **Speculative Decoding**: Multi-token generation with draft model
## π Documentation
- [Continuous Batching Design](docs/continuous_batching.md)
- [Performance Optimization Guide](docs/performance.md)
- [API Reference](docs/api.md)
## π€ Contributing
Contributions welcome! Please see [CONTRIBUTING.md](docs/contributing.md) for guidelines.
## π License
MIT License - see [LICENSE](LICENSE) for details.
## π Acknowledgments
- Architecture inspired by [vLLM](https://github.com/vllm-project/vllm)
- Kernel optimizations reference [llama.cpp](https://github.com/ggerganov/llama.cpp)
- Error handling design from Rust's `Result`
---
**Built with β€οΈ for high-performance LLM inference**