# Claude Code Assistant Guidelines

## Go Code Style

- Follow the standard Go code style and conventions. Use `gofmt` for formatting and adhere to idiomatic Go practices.
- Follow best practices from the [Effective Go](https://go.dev/doc/effective_go) guide:

### Naming Conventions
- Use **MixedCaps** or **mixedCaps** rather than underscores for multi-word names
- Package names should be short, lowercase, single-word names
- Getters don't use "Get" prefix (use `obj.Name()` not `obj.GetName()`)
- Interface names use "-er" suffix for single-method interfaces (e.g., `Reader`, `Writer`)

### Formatting
- Use `gofmt` for consistent formatting (tabs for indentation, spaces for alignment)
- Line length: no strict limit, but keep lines reasonable
- Group related declarations together

### Error Handling
- Return errors as the last return value
- Check errors immediately after the call
- Provide context with `fmt.Errorf` and error wrapping

### Logging
- Use `ctrl.Log` for structured logging
- Keep log fields consistent and meaningful
- Avoid logging sensitive data

### Documentation
- Every exported name should have a doc comment
- Start comments with the name being described
- Use complete sentences

### Concurrency
- Share memory by communicating; don't communicate by sharing memory
- Use channels to orchestrate goroutines
- Always handle goroutine cleanup and cancellation properly

### Project Structure
- Keep packages focused and cohesive
- Avoid circular dependencies
- Place tests in `*_test.go` files

## Documentation

Prefer placing documentation in the `docs/` directory.

There are 3 main types of documentation targeting different audiences:

1. **Developer Documentation** - For contributors and maintainers of this project
   - Architecture decisions
   - Development setup and workflow
   - Contributing guidelines
   - usually in the `docs/developer-guide/` subdirectory

2. **Administrator Documentation** - For operators deploying and managing the autoscaler controller
   - Installation and configuration
   - Deployment guidelines
   - Monitoring and troubleshooting
   - usually located under the `docs/user-guide/` directory (for example, in an admin-focused subdirectory)

3. **End-User Documentation** - For application developers creating applications that use the autoscaler
   - Usage guides and examples
   - API reference
   - Best practices and common patterns
   - usually located under the `docs/user-guide/` directory (for example, in an end-user-focused subdirectory)

## E2E Testing

- use make targets for running e2e tests (e.g., `make test-e2e-smoke` or `make test-e2e-full`) and document the process in `docs/developer-guide/testing.md`
- use `make test` for unit tests
- **Never use images from docker.io in e2e tests.** All container images must use fully-qualified registry paths (e.g., `registry.k8s.io/`, `quay.io/`, or a private registry). Do not rely on Docker Hub as a default registry.

## CLI Tools

### llm-d Inference Scheduler EPP CLI Reference

This section documents the command-line flags and environment variables supported by the llm-d inference scheduler EPP (Endpoint Picker). The EPP inherits its CLI from `gateway-api-inference-extension`.

#### Main Branch (Latest)

Uses `gateway-api-inference-extension` at commit `fd30cb97714a` (post-v1.3.0).

##### Command-Line Flags

| Flag | Type | Default | Description |
|------|------|---------|-------------|
| `--grpc-port` | int | `9002` | gRPC port used for communicating with Envoy proxy |
| `--ha-enable-leader-election` | bool | `false` | Enables leader election for high availability. When enabled, readiness probes will only pass on the leader |
| `--pool-group` | string | `inference.networking.k8s.io` | Kubernetes resource group of the InferencePool this Endpoint Picker is associated with |
| `--pool-namespace` | string | `""` | Namespace of the InferencePool this Endpoint Picker is associated with |
| `--pool-name` | string | `""` | Name of the InferencePool this Endpoint Picker is associated with |
| `--endpoint-selector` | string | `""` | Selector to filter model server pods on, only 'key=value' pairs are supported. Format: comma-separated list of key=value pairs (e.g., 'app=vllm-llama3-8b-instruct,env=prod') |
| `--endpoint-target-ports` | []int | `[]` | Target ports of model server pods. Format: comma-separated list of numbers (e.g., '3000,3001,3002') |
| `--disable-endpoint-subset-filter` | bool | `false` | Disables respecting the x-gateway-destination-endpoint-subset metadata for dispatching requests in EPP |
| `--model-server-metrics-scheme` | string | `http` | Protocol scheme used in scraping metrics from endpoints |
| `--model-server-metrics-path` | string | `/metrics` | URL path used in scraping metrics from endpoints |
| `--model-server-metrics-port` | int | `0` | **DEPRECATED**: Port to scrape metrics from endpoints |
| `--model-server-metrics-https-insecure-skip-verify` | bool | `true` | Disable certificate verification when using 'https' scheme for model-server-metrics-scheme |
| `--refresh-metrics-interval` | duration | `50ms` | Interval to refresh metrics |
| `--refresh-prometheus-metrics-interval` | duration | `5s` | Interval to flush Prometheus metrics |
| `--metrics-staleness-threshold` | duration | `2s` | Duration after which metrics are considered stale |
| `--total-queued-requests-metric` | string | `vllm:num_requests_waiting` | **DEPRECATED**: Use engineConfigs in EndpointPickerConfig instead |
| `--total-running-requests-metric` | string | `vllm:num_requests_running` | **DEPRECATED**: Use engineConfigs in EndpointPickerConfig instead |
| `--kv-cache-usage-percentage-metric` | string | `vllm:kv_cache_usage_perc` | **DEPRECATED**: Use engineConfigs in EndpointPickerConfig instead |
| `--lora-info-metric` | string | `vllm:lora_requests_info` | **DEPRECATED**: Use engineConfigs in EndpointPickerConfig instead |
| `--cache-info-metric` | string | `vllm:cache_config_info` | **DEPRECATED**: Use engineConfigs in EndpointPickerConfig instead |
| `-v`, `--v` | int | `0` | Number for the log level verbosity |
| `--zap-log-level` | string | | Zap log level (debug, info, warn, error) |
| `--zap-devel` | bool | `true` | Development Mode defaults (encoder=consoleEncoder,logLevel=Debug,stackTraceLevel=Warn) |
| `--zap-encoder` | string | | Zap log encoding ('json' or 'console') |
| `--zap-stacktrace-level` | string | | Zap Level at and above which stacktraces are captured |
| `--tracing` | bool | `true` | Enables emitting traces |
| `--health-checking` | bool | `false` | Enables health checking |
| `--metrics-port` | int | `9090` | The metrics port exposed by EPP |
| `--grpc-health-port` | int | `9003` | The port used for gRPC liveness and readiness probes |
| `--enable-pprof` | bool | `true` | Enables pprof handlers |
| `--cert-path` | string | `""` | The path to the certificate for secure serving. Certificate and private key files are assumed to be named tls.crt and tls.key |
| `--enable-cert-reload` | bool | `false` | Enables certificate reloading of the certificates specified in --cert-path |
| `--secure-serving` | bool | `true` | Enables secure serving |
| `--metrics-endpoint-auth` | bool | `true` | Enables authentication and authorization of the metrics endpoint |
| `--config-file` | string | `""` | The path to the configuration file |
| `--config-text` | string | `""` | The configuration specified as text, in lieu of a file |

##### Environment Variables

| Variable | Description | Deprecation |
|----------|-------------|-------------|
| `NAMESPACE` | Used to determine pool namespace when `--pool-namespace` is not set | - |
| `POD_NAME` | Used to determine EPP name when using `--endpoint-selector` mode | - |
| `ENABLE_EXPERIMENTAL_DATALAYER_V2` | Enables experimental pluggable data layer | **DEPRECATED**: Use FeatureGates in config file instead |
| `ENABLE_EXPERIMENTAL_FLOW_CONTROL_LAYER` | Enables experimental pluggable flow control layer | **DEPRECATED**: Use FeatureGates in config file instead |
| `SD_QUEUE_DEPTH_THRESHOLD` | Saturation detector queue depth threshold | **DEPRECATED**: Use config file instead |
| `SD_KV_CACHE_UTIL_THRESHOLD` | Saturation detector KV cache utilization threshold | **DEPRECATED**: Use config file instead |
| `SD_METRICS_STALENESS_THRESHOLD` | Saturation detector metrics staleness threshold | **DEPRECATED**: Use config file instead |

---

##### v0.5.0

Uses `gateway-api-inference-extension v1.3.0`.

##### Command-Line Flags

| Flag | Type | Default | Description |
|------|------|---------|-------------|
| `--grpc-port` | int | `9002` | gRPC port used for communicating with Envoy proxy |
| `--ha-enable-leader-election` | bool | `false` | Enables leader election for high availability. When enabled, readiness probes will only pass on the leader |
| `--pool-group` | string | `inference.networking.k8s.io` | Kubernetes resource group of the InferencePool this Endpoint Picker is associated with |
| `--pool-namespace` | string | `""` | Namespace of the InferencePool this Endpoint Picker is associated with |
| `--pool-name` | string | `""` | Name of the InferencePool this Endpoint Picker is associated with |
| `--endpoint-selector` | string | `""` | Selector to filter model server pods on, only 'key=value' pairs are supported. Format: comma-separated list of key=value pairs (e.g., 'app=vllm-llama3-8b-instruct,env=prod') |
| `--endpoint-target-ports` | []int | `[]` | Target ports of model server pods. Format: comma-separated list of numbers (e.g., '3000,3001,3002') |
| `--disable-endpoint-subset-filter` | bool | `false` | Disables respecting the x-gateway-destination-endpoint-subset metadata for dispatching requests in EPP |
| `--model-server-metrics-scheme` | string | `http` | Protocol scheme used in scraping metrics from endpoints |
| `--model-server-metrics-path` | string | `/metrics` | URL path used in scraping metrics from endpoints |
| `--model-server-metrics-port` | int | `0` | **DEPRECATED**: Port to scrape metrics from endpoints. Set to InferencePool.Spec.TargetPorts[0].Number if not defined |
| `--model-server-metrics-https-insecure-skip-verify` | bool | `true` | Disable certificate verification when using 'https' scheme for model-server-metrics-scheme |
| `--refresh-metrics-interval` | duration | `50ms` | Interval to refresh metrics |
| `--refresh-prometheus-metrics-interval` | duration | `5s` | Interval to flush Prometheus metrics |
| `--metrics-staleness-threshold` | duration | `2s` | Duration after which metrics are considered stale |
| `--total-queued-requests-metric` | string | `vllm:num_requests_waiting` | Prometheus metric for the number of queued requests |
| `--total-running-requests-metric` | string | `vllm:num_requests_running` | Prometheus metric for the number of running requests |
| `--kv-cache-usage-percentage-metric` | string | `vllm:kv_cache_usage_perc` | Prometheus metric for the fraction of KV-cache blocks currently in use (from 0 to 1) |
| `--lora-info-metric` | string | `vllm:lora_requests_info` | Prometheus metric for the LoRA info metrics (must be in vLLM label format) |
| `--cache-info-metric` | string | `vllm:cache_config_info` | Prometheus metric for the cache info metrics |
| `-v`, `--v` | int | `0` | Number for the log level verbosity |
| `--zap-log-level` | string | | Zap log level (debug, info, warn, error) |
| `--zap-devel` | bool | `true` | Development Mode defaults (encoder=consoleEncoder,logLevel=Debug,stackTraceLevel=Warn) |
| `--zap-encoder` | string | | Zap log encoding ('json' or 'console') |
| `--zap-stacktrace-level` | string | | Zap Level at and above which stacktraces are captured |
| `--tracing` | bool | `true` | Enables emitting traces |
| `--health-checking` | bool | `false` | Enables health checking |
| `--metrics-port` | int | `9090` | The metrics port exposed by EPP |
| `--grpc-health-port` | int | `9003` | The port used for gRPC liveness and readiness probes |
| `--enable-pprof` | bool | `true` | Enables pprof handlers |
| `--cert-path` | string | `""` | The path to the certificate for secure serving. Certificate and private key files are assumed to be named tls.crt and tls.key |
| `--enable-cert-reload` | bool | `false` | Enables certificate reloading of the certificates specified in --cert-path |
| `--secure-serving` | bool | `true` | Enables secure serving |
| `--metrics-endpoint-auth` | bool | `true` | Enables authentication and authorization of the metrics endpoint |
| `--config-file` | string | `""` | The path to the configuration file |
| `--config-text` | string | `""` | The configuration specified as text, in lieu of a file |

##### Environment Variables

| Variable | Description | Deprecation |
|----------|-------------|-------------|
| `NAMESPACE` | Used to determine pool namespace when `--pool-namespace` is not set | - |
| `POD_NAME` | Used to determine EPP name when using `--endpoint-selector` mode | - |
| `ENABLE_EXPERIMENTAL_DATALAYER_V2` | Enables experimental pluggable data layer | **DEPRECATED**: Use FeatureGates in config file instead |
| `ENABLE_EXPERIMENTAL_FLOW_CONTROL_LAYER` | Enables experimental pluggable flow control layer | **DEPRECATED**: Use FeatureGates in config file instead |
| `SD_QUEUE_DEPTH_THRESHOLD` | Saturation detector queue depth threshold | **DEPRECATED**: Use config file instead |
| `SD_KV_CACHE_UTIL_THRESHOLD` | Saturation detector KV cache utilization threshold | **DEPRECATED**: Use config file instead |
| `SD_METRICS_STALENESS_THRESHOLD` | Saturation detector metrics staleness threshold | **DEPRECATED**: Use config file instead |

#### Key Differences Between Main and v0.5.0

1. **Metric Flags**: In main branch, `--total-queued-requests-metric`, `--total-running-requests-metric`, `--kv-cache-usage-percentage-metric`, `--lora-info-metric`, and `--cache-info-metric` are deprecated and will error if explicitly set. In v0.5.0, these flags are functional.

2. **Configuration**: Main branch encourages using `EndpointPickerConfig` with `engineConfigs` for metrics configuration instead of CLI flags.

---

### llm-d Inference Simulator CLI Reference

This section documents the command-line flags and environment variables supported by the llm-d inference simulator (`llm-d-inference-sim`). The simulator is a vLLM server simulator supporting OpenAI API endpoints.

#### Main Branch (Latest)

##### Command-Line Flags

| Flag | Type | Default | Description |
|------|------|---------|-------------|
| `--config` | string | `""` | Path to a YAML configuration file. Command line values overwrite config file values |
| `--port` | int | `8000` | Port on which the simulator runs |
| `--model` | string | `""` | Currently 'loaded' model name (required) |
| `--served-model-name` | []string | `[]` | Model names exposed by the API (space-separated strings). Falls back to `--model` if not set |
| `--max-num-seqs` | int | `5` | Maximum number of inference requests that could be processed at the same time |
| `--max-waiting-queue-length` | int | `1000` | Maximum length of inference requests waiting queue |
| `--max-loras` | int | `1` | Maximum number of LoRAs in a single batch |
| `--max-cpu-loras` | int | (same as `--max-loras`) | Maximum number of LoRAs to store in CPU memory |
| `--max-model-len` | int | `1024` | Model's context window, maximum number of tokens in a single request including input and output |
| `--lora-modules` | []string | `[]` | List of LoRA adapters (space-separated JSON strings) |
| `--mode` | string | `random` | Simulator mode: `echo` returns input text; `random` returns random pre-defined sentences |
| `--seed` | int64 | (current Unix nano) | Random seed for operations |
| `--time-to-first-token` | duration | `0` | Time to first token (e.g., "100ms"). Integer format (milliseconds) is deprecated |
| `--time-to-first-token-std-dev` | duration | `0` | Standard deviation for time to first token (max 30% of TTFT) |
| `--inter-token-latency` | duration | `0` | Time to generate one token (e.g., "100ms"). Integer format is deprecated |
| `--inter-token-latency-std-dev` | duration | `0` | Standard deviation for inter-token latency (max 30% of ITL) |
| `--prefill-overhead` | duration | `0` | Time to prefill. Ignored if `--time-to-first-token` is set |
| `--prefill-time-per-token` | duration | `0` | Time to prefill per token |
| `--prefill-time-std-dev` | duration | `0` | Standard deviation for prefill time |
| `--kv-cache-transfer-latency` | duration | `0` | Time for KV-cache transfer from a remote vLLM (P/D mode) |
| `--kv-cache-transfer-latency-std-dev` | duration | `0` | Standard deviation for KV-cache transfer latency |
| `--kv-cache-transfer-time-per-token` | duration | `0` | Time for KV-cache transfer per token from a remote vLLM |
| `--kv-cache-transfer-time-std-dev` | duration | `0` | Standard deviation for KV-cache transfer time per token |
| `--time-factor-under-load` | float64 | `1.0` | Multiplicative factor affecting request time when parallel requests are processed (must be >= 1.0) |
| `--enable-kvcache` | bool | `false` | Enables KV cache feature |
| `--kv-cache-size` | int | `1024` | Maximum number of token blocks in KV cache |
| `--global-cache-hit-threshold` | float64 | `0` | Default cache hit threshold [0, 1] for all requests |
| `--block-size` | int | `16` | Token block size for contiguous chunks (valid: 8, 16, 32, 64, 128) |
| `--tokenizers-cache-dir` | string | `hf_cache` | Directory for caching tokenizers |
| `--hash-seed` | string | `""` | Seed for hash generation (falls back to `PYTHONHASHSEED` env var) |
| `--zmq-endpoint` | string | `tcp://localhost:5557` | ZMQ address to publish events |
| `--zmq-max-connect-attempts` | int | `0` | Maximum number of times to try ZMQ connect (max 10) |
| `--event-batch-size` | int | `16` | Maximum number of KV-cache events to be sent together |
| `--data-parallel-size` | int | `1` | Number of ranks to run (1-8) |
| `--data-parallel-rank` | int | `-1` | The rank when running each rank in a process |
| `--failure-injection-rate` | int | `0` | Probability (0-100) of injecting failures |
| `--failure-types` | []string | `[]` | Specific failure types to inject: `rate_limit`, `invalid_api_key`, `context_length`, `server_error`, `invalid_request`, `model_not_found` |
| `--fake-metrics` | string | `""` | JSON metrics to report to Prometheus instead of real metrics |
| `--ssl-certfile` | string | `""` | Path to SSL certificate file for HTTPS |
| `--ssl-keyfile` | string | `""` | Path to SSL private key file for HTTPS |
| `--self-signed-certs` | bool | `false` | Enable automatic generation of self-signed certificates for HTTPS |
| `--dataset-path` | string | `""` | Local path to SQLite database file for response generation from a dataset |
| `--dataset-url` | string | `""` | URL to download the SQLite database file for response generation |
| `--dataset-in-memory` | bool | `false` | Load the entire dataset into memory for faster access |
| `--enable-sleep-mode` | bool | `false` | Enable sleep mode |
| `--enable-request-id-headers` | bool | `false` | Enable including X-Request-Id header in responses |
| `--latency-calculator` | string | `""` | Name of the latency calculator: `constant` or `per-token` |
| `--max-tool-call-integer-param` | int | `100` | Maximum possible value of integer parameters in a tool call |
| `--min-tool-call-integer-param` | int | `0` | Minimum possible value of integer parameters in a tool call |
| `--max-tool-call-number-param` | float64 | `100` | Maximum possible value of number (float) parameters in a tool call |
| `--min-tool-call-number-param` | float64 | `0` | Minimum possible value of number (float) parameters in a tool call |
| `--max-tool-call-array-param-length` | int | `5` | Maximum possible length of array parameters in a tool call |
| `--min-tool-call-array-param-length` | int | `1` | Minimum possible length of array parameters in a tool call |
| `--tool-call-not-required-param-probability` | int | `50` | Probability (0-100) to add a non-required parameter in a tool call |
| `--object-tool-call-not-required-field-probability` | int | `50` | Probability (0-100) to add a non-required field in an object in a tool call |

##### Environment Variables

| Variable | Description |
|----------|-------------|
| `POD_NAME` | Pod name of simulator |
| `POD_NAMESPACE` | Namespace where simulator is running |
| `POD_IP` | IP address on which simulator runs |
| `PYTHONHASHSEED` | Fallback seed for hash generation if `--hash-seed` is not set |
| `VLLM_SERVER_DEV_MODE` | Set to `1` to enable development mode |

---

#### v0.5.0

##### Command-Line Flags

| Flag | Type | Default | Description |
|------|------|---------|-------------|
| `--config` | string | `""` | Path to a YAML configuration file. Command line values overwrite config file values |
| `--port` | int | `8000` | Port on which the simulator runs |
| `--model` | string | `""` | Currently 'loaded' model name (required) |
| `--served-model-name` | []string | `[]` | Model names exposed by the API (space-separated strings). Falls back to `--model` if not set |
| `--max-num-seqs` | int | `5` | Maximum number of inference requests that could be processed at the same time (parameter to simulate requests waiting queue) |
| `--max-loras` | int | `1` | Maximum number of LoRAs in a single batch |
| `--max-cpu-loras` | int | (same as `--max-loras`) | Maximum number of LoRAs to store in CPU memory |
| `--max-model-len` | int | `1024` | Model's context window, maximum number of tokens in a single request including input and output |
| `--lora-modules` | []string | `[]` | List of LoRA adapters (space-separated JSON strings) |
| `--mode` | string | `random` | Simulator mode: `echo` returns input text; `random` returns random pre-defined sentences |
| `--seed` | int64 | (current Unix nano) | Random seed for operations |
| `--time-to-first-token` | int | `0` | Time to first token in milliseconds |
| `--time-to-first-token-std-dev` | int | `0` | Standard deviation for time to first token in milliseconds (max 30% of TTFT) |
| `--inter-token-latency` | int | `0` | Time to generate one token in milliseconds |
| `--inter-token-latency-std-dev` | int | `0` | Standard deviation for inter-token latency in milliseconds (max 30% of ITL) |
| `--prefill-overhead` | int | `0` | Time to prefill in milliseconds. Ignored if `--time-to-first-token` is not 0 |
| `--prefill-time-per-token` | int | `0` | Time to prefill per token in milliseconds |
| `--prefill-time-std-dev` | int | `0` | Standard deviation for prefill time in milliseconds |
| `--kv-cache-transfer-latency` | int | `0` | Time for KV-cache transfer from a remote vLLM in milliseconds (P/D mode) |
| `--kv-cache-transfer-latency-std-dev` | int | `0` | Standard deviation for KV-cache transfer latency in milliseconds |
| `--kv-cache-transfer-time-per-token` | int | `0` | Time for KV-cache transfer per token from a remote vLLM in milliseconds |
| `--kv-cache-transfer-time-std-dev` | int | `0` | Standard deviation for KV-cache transfer time per token in milliseconds |
| `--time-factor-under-load` | float64 | `1.0` | Multiplicative factor affecting request time when parallel requests are processed (must be >= 1.0) |
| `--enable-kvcache` | bool | `false` | Enables KV cache feature |
| `--kv-cache-size` | int | `1024` | Maximum number of token blocks in KV cache |
| `--block-size` | int | `16` | Token block size for contiguous chunks (valid: 8, 16, 32, 64, 128) |
| `--tokenizers-cache-dir` | string | `""` | Directory for caching tokenizers |
| `--hash-seed` | string | `""` | Seed for hash generation (falls back to `PYTHONHASHSEED` env var) |
| `--zmq-endpoint` | string | `tcp://localhost:5557` | ZMQ address to publish events |
| `--zmq-max-connect-attempts` | uint | `0` | Maximum number of times to try ZMQ connect (max 10) |
| `--event-batch-size` | int | `16` | Maximum number of KV-cache events to be sent together |
| `--data-parallel-size` | int | `1` | Number of ranks to run (1-8) |
| `--failure-injection-rate` | int | `0` | Probability (0-100) of injecting failures |
| `--failure-types` | []string | `[]` | Specific failure types to inject: `rate_limit`, `invalid_api_key`, `context_length`, `server_error`, `invalid_request`, `model_not_found` |
| `--fake-metrics` | string | `""` | JSON metrics to report to Prometheus instead of real metrics |
| `--max-tool-call-integer-param` | int | `100` | Maximum possible value of integer parameters in a tool call |
| `--min-tool-call-integer-param` | int | `0` | Minimum possible value of integer parameters in a tool call |
| `--max-tool-call-number-param` | float64 | `100` | Maximum possible value of number (float) parameters in a tool call |
| `--min-tool-call-number-param` | float64 | `0` | Minimum possible value of number (float) parameters in a tool call |
| `--max-tool-call-array-param-length` | int | `5` | Maximum possible length of array parameters in a tool call |
| `--min-tool-call-array-param-length` | int | `1` | Minimum possible length of array parameters in a tool call |
| `--tool-call-not-required-param-probability` | int | `50` | Probability (0-100) to add a non-required parameter in a tool call |
| `--object-tool-call-not-required-field-probability` | int | `50` | Probability (0-100) to add a non-required field in an object in a tool call |

##### Environment Variables

| Variable | Description |
|----------|-------------|
| `POD_NAME` | Pod name of simulator |
| `POD_NAMESPACE` | Namespace where simulator is running |
| `PYTHONHASHSEED` | Fallback seed for hash generation if `--hash-seed` is not set |

##### Key Differences Between Main and v0.5.0

1. **Duration Parameters**: In main branch, latency-related parameters (`--time-to-first-token`, `--inter-token-latency`, etc.) use Go duration strings (e.g., "100ms", "1.5s"). In v0.5.0, these are integers representing milliseconds.

2. **New Flags in Main**: `--max-waiting-queue-length`, `--global-cache-hit-threshold`, `--data-parallel-rank`, `--ssl-certfile`, `--ssl-keyfile`, `--self-signed-certs`, `--dataset-path`, `--dataset-url`, `--dataset-in-memory`, `--enable-sleep-mode`, `--enable-request-id-headers`, `--latency-calculator`.

3. **Environment Variables**: Main branch adds `POD_IP` and `VLLM_SERVER_DEV_MODE`.