# gpu collector

The gpu collector exposes metrics about GPU usage and memory consumption, both at the adapter (physical GPU) and
per-process level.

|                     |                                      |
|---------------------|--------------------------------------|
| Metric name prefix  | `gpu`                                |
| Data source         | Perflib                              |
| Counters            | GPU Engine, GPU Adapter, GPU Process |
| Enabled by default? | No                                   |

## Flags

None

## Metrics

These metrics are available on supported versions of Windows with compatible GPUs and drivers:

### Adapter-level Metrics

| Name                                             | Description                                                                        | Type  | Labels                                                          |
|--------------------------------------------------|------------------------------------------------------------------------------------|-------|-----------------------------------------------------------------|
| `windows_gpu_info`                               | A metric with a constant '1' value labeled with gpu device information.            | gauge | `bus_number`,`device_id`,`function_number`,`luid`,`name`,`phys` |
| `windows_gpu_dedicated_system_memory_size_bytes` | The size, in bytes, of memory that is dedicated from system memory.                | gauge | `device_id`,`luid`                                              |
| `windows_gpu_dedicated_video_memory_size_bytes`  | The size, in bytes, of memory that is dedicated from video memory.                 | gauge | `device_id`,`luid`                                              |
| `windows_gpu_shared_system_memory_size_bytes`    | The size, in bytes, of memory from system memory that can be shared by many users. | gauge | `device_id`,`luid`                                              |
| `windows_gpu_adapter_memory_committed_bytes`     | Total committed GPU memory in bytes per physical GPU                               | gauge | `device_id`,`luid`,`phys`                                       |
| `windows_gpu_adapter_memory_dedicated_bytes`     | Dedicated GPU memory usage in bytes per physical GPU                               | gauge | `device_id`,`luid`,`phys`                                       |
| `windows_gpu_adapter_memory_shared_bytes`        | Shared GPU memory usage in bytes per physical GPU                                  | gauge | `device_id`,`luid`,`phys`                                       |
| `windows_gpu_local_adapter_memory_bytes`         | Local adapter memory usage in bytes per physical GPU                               | gauge | `device_id`,`luid`,`phys`,`part`                                |
| `windows_gpu_non_local_adapter_memory_bytes`     | Non-local adapter memory usage in bytes per physical GPU                           | gauge | `device_id`,`luid`,`phys`,`part`                                |

### Per-process Metrics

| Name                                         | Description                                     | Type    | Labels                                                    |
|----------------------------------------------|-------------------------------------------------|---------|-----------------------------------------------------------|
| `windows_gpu_engine_time_seconds`            | Total running time of the GPU engine in seconds | counter | `device_id`,`luid`,`phys`, `eng`, `engtype`, `process_id` |
| `windows_gpu_process_memory_committed_bytes` | Total committed GPU memory in bytes per process | gauge   | `device_id`,`luid`,`phys`,`process_id`                    |
| `windows_gpu_process_memory_dedicated_bytes` | Dedicated GPU memory usage in bytes per process | gauge   | `device_id`,`luid`,`phys`,`process_id`                    |
| `windows_gpu_process_memory_local_bytes`     | Local GPU memory usage in bytes per process     | gauge   | `device_id`,`luid`,`phys`,`process_id`                    |
| `windows_gpu_process_memory_non_local_bytes` | Non-local GPU memory usage in bytes per process | gauge   | `device_id`,`luid`,`phys`,`process_id`                    |
| `windows_gpu_process_memory_shared_bytes`    | Shared GPU memory usage in bytes per process    | gauge   | `device_id`,`luid`,`phys`,`process_id`                    |

## Metric Labels

* `luid`,`phys`: Physical GPU index (e.g., "0")
* `eng`: GPU engine index (e.g., "0", "1", ...)
* `engtype`: GPU engine type (e.g., "3D", "Copy", "VideoDecode", etc.)
* `process_id`: Process ID

## Example Metric

These are basic queries to help you get started with GPU monitoring on Windows using Prometheus.

**Show GPU information for a specific physical GPU (0):**

```promql
windows_gpu_info{bus_number="8",device_id="PCI\\VEN_10DE&DEV_1B81&SUBSYS_61733842&REV_A1",function_number="0",luid="0x00000000_0x00010F8A",name="NVIDIA GeForce GTX 1070",phys="0"} 1
```

**Show total dedicated GPU memory (in bytes) usage on GPU 0:**

```promql
windows_gpu_adapter_memory_dedicated_bytes{phys="0"}
```

**Aggregate GPU utilization across all processes for a physical GPU (3D engine):**

```promql
sum by (phys) (
  rate(windows_gpu_engine_time_seconds{phys="0", engtype="3D"}[1m])
) * 100
```

**Show GPU utilization for a specific process (3D engine):**

```promql
sum by (phys, process_id) (
  rate(windows_gpu_engine_time_seconds{process_id="1234", engtype="3D"}[1m])
) * 100
```

**Show dedicated GPU memory per process:**

```promql
windows_gpu_adapter_memory_dedicated_bytes
```

## Useful Queries

**Show top 5 processes by GPU utilization (all engines):**

```promql
topk(5, sum by (process_id) (
  rate(windows_gpu_engine_time_seconds[1m])
) * 100)
```

**Show GPU memory usage per physical GPU:**

```promql
sum by (phys) (
  windows_gpu_adapter_memory_dedicated_bytes
)
```

Show GPU engine time with process owner and command line:

```promql
windows_gpu_engine_time_seconds * on(process_id) group_left(owner, cmdline) windows_process_info
```

## Alerting Examples

**prometheus.rules**

```yaml
# Alert on processes using more than 80% of a GPU's capacity over 10 minutes
- alert: HighGpuUtilization
  expr: |
    sum by (process_id) (
      rate(windows_gpu_engine_time_seconds[1m])
    ) * 100 > 80
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "High GPU Utilization (process {{ $labels.process_id }})"
    description: "Process is using more than 80% of GPU resources\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
```

## Notes

* Per-process metrics allow you to identify which processes are consuming GPU resources.
* Adapter-level metrics provide an overview of total GPU memory usage.
* For overall GPU utilization, aggregate per-process metrics in Prometheus using queries such as `sum()`.
* The collector relies on Windows performance counters; ensure your system and drivers support these counters.

## Enabling the Collector

To enable the GPU collector, add `gpu` to the list of enabled collectors in your windows_exporter configuration.

Example (command line):

```shell
windows_exporter.exe --collectors.enabled=gpu
```