(fractional-gpu-guide)=
# Fractional GPU serving

Serve multiple small models on the same GPU for cost-efficient deployments.

:::{note}
This feature hasn't been extensively tested in production. If you encounter any issues, report them on [GitHub](https://github.com/ray-project/ray/issues) with reproducible code.
:::

Fractional GPU allocation allows you to run multiple model replicas on a single GPU by customizing placement groups. This approach maximizes GPU utilization and reduces costs when serving small models that don't require a full GPU's resources.

## When to use fractional GPUs

Consider fractional GPU allocation when:

- You're serving small models with low concurrency that don't require a full GPU for model weights and KV cache.
- You have multiple models that fit this profile.

## Deploy with fractional GPU allocation

The following example shows how to serve 8 replicas of a small model on 4 L4 GPUs (2 replicas per GPU):

```python
from ray.serve.llm import LLMConfig, ModelLoadingConfig
from ray.serve.llm import build_openai_app
from ray import serve


llm_config = LLMConfig(
    model_loading_config=ModelLoadingConfig(
        model_id="HuggingFaceTB/SmolVLM-256M-Instruct",
    ),
    engine_kwargs=dict(
        gpu_memory_utilization=0.4,
        use_tqdm_on_load=False,
        enforce_eager=True,
        max_model_len=2048,
    ),
    deployment_config=dict(
        autoscaling_config=dict(
            min_replicas=8, max_replicas=8,
        )
    ),
    accelerator_type="L4",
    placement_group_config=dict(bundles=[dict(GPU=0.49)]),
    runtime_env=dict(
        env_vars={
            "VLLM_DISABLE_COMPILE_CACHE": "1",
        },
    ),
)

app = build_openai_app({"llm_configs": [llm_config]})
serve.run(app, blocking=True)
```

## Configuration parameters

Use the following parameters to configure fractional GPU allocation. The placement group defines the GPU share, and Ray Serve infers the matching `VLLM_RAY_PER_WORKER_GPUS` value for you. The memory management and performance settings are vLLM-specific optimizations that you can adjust based on your model and workload requirements.

### Placement group configuration

- `placement_group_config`: Specifies the GPU fraction each replica uses. Set `GPU` to the fraction (for example, `0.49` for approximately half a GPU). Use slightly less than the theoretical fraction to account for system overhead—this headroom prevents out-of-memory errors.
- `VLLM_RAY_PER_WORKER_GPUS`: Ray Serve derives this from `placement_group_config` when GPU bundles are fractional. Setting it manually is allowed but not recommended.

### Memory management

- `gpu_memory_utilization`: Controls how much GPU memory vLLM pre-allocates. vLLM allocates memory based on this setting regardless of Ray's GPU scheduling. In the example, `0.4` means vLLM targets 40% of GPU memory for the model, KV cache, and CUDAGraph memory.

### Performance settings

- `enforce_eager`: Set to `True` to disable CUDA graphs and reduce memory overhead.
- `max_model_len`: Limits the maximum sequence length, reducing memory requirements.
- `use_tqdm_on_load`: Set to `False` to disable progress bars during model loading.

### Workarounds

- `VLLM_DISABLE_COMPILE_CACHE`: Set to `1` to avoid a [resource contention issue](https://github.com/vllm-project/vllm/issues/24601) among workers during torch compile caching.

## Best practices

### Calculate GPU allocation

- **Leave headroom**: Use slightly less than the theoretical fraction (for example, `0.49` instead of `0.5`) to account for system overhead.
- **Match memory to workload**: Ensure `gpu_memory_utilization` × GPU memory × number of replicas per GPU doesn't exceed total GPU memory.
- **Account for all memory**: Consider model weights, KV cache, CUDA graphs, and framework overhead.

### Optimize for your models

- **Test memory requirements**: Profile your model's actual memory usage before setting `gpu_memory_utilization`. This information often gets printed as part of the vLLM initialization.
- **Start conservative**: Begin with fewer replicas per GPU and increase gradually while monitoring memory usage.
- **Monitor OOM errors**: Watch for out-of-memory errors that indicate you need to reduce replicas or lower `gpu_memory_utilization`.

### Production considerations

- **Validate performance**: Test throughput and latency with your actual workload before production deployment.
- **Consider autoscaling carefully**: Fractional GPU deployments work best with fixed replica counts rather than autoscaling.

## Troubleshooting

### Out of memory errors

- Reduce `gpu_memory_utilization` (for example, from `0.4` to `0.3`)
- Decrease the number of replicas per GPU
- Lower `max_model_len` to reduce KV cache size
- Enable `enforce_eager=True` if not already set to ensure CUDA graph memory requirements don't cause issues

### Replicas fail to start

- Verify that your fractional allocation matches your replica count (for example, 2 replicas with `GPU=0.49` each)
- Confirm that `placement_group_config` matches the share you expect Ray to reserve
- If you override `VLLM_RAY_PER_WORKER_GPUS` (not recommended) ensure it matches the GPU share from the placement group
- Ensure your model size is appropriate for fractional GPU allocation

### Resource contention issues

- Ensure `VLLM_DISABLE_COMPILE_CACHE=1` is set to avoid torch compile caching conflicts
- Check Ray logs for resource allocation errors
- Verify placement group configuration is applied correctly

## See also

- {doc}`Quickstart <../quick-start>` - Basic LLM deployment examples
- [Ray placement groups](https://docs.ray.io/en/latest/ray-core/scheduling/placement-group.html) - Ray Core placement group documentation