# [KEP-5055](https://github.com/kubernetes/enhancements/issues/5055): DRA: device taints and tolerations - [Release Signoff Checklist](#release-signoff-checklist) - [Summary](#summary) - [Motivation](#motivation) - [Goals](#goals) - [Non-Goals](#non-goals) - [Proposal](#proposal) - [User Stories](#user-stories) - [Degraded Devices](#degraded-devices) - [External Health Monitoring](#external-health-monitoring) - [Safe Pod Eviction](#safe-pod-eviction) - [Risks and Mitigations](#risks-and-mitigations) - [Design Details](#design-details) - [API](#api) - [Test Plan](#test-plan) - [Prerequisite testing updates](#prerequisite-testing-updates) - [Unit tests](#unit-tests) - [Integration tests](#integration-tests) - [e2e tests](#e2e-tests) - [Graduation Criteria](#graduation-criteria) - [Alpha](#alpha) - [Beta](#beta) - [GA](#ga) - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) - [Version Skew Strategy](#version-skew-strategy) - [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) - [Feature Enablement and Rollback](#feature-enablement-and-rollback) - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) - [Monitoring Requirements](#monitoring-requirements) - [Dependencies](#dependencies) - [Scalability](#scalability) - [Troubleshooting](#troubleshooting) - [Implementation History](#implementation-history) - [Drawbacks](#drawbacks) - [Alternatives](#alternatives) - [Extending node taint controller](#extending-node-taint-controller) - [Tolerating taints in pods](#tolerating-taints-in-pods) - [Storing result of patching in ResourceSlice](#storing-result-of-patching-in-resourceslice) ## Release Signoff Checklist Items marked with (R) are required *prior to targeting to a milestone / release*. - [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) - [x] (R) KEP approvers have approved the KEP status as `implementable` - [x] (R) Design details are appropriately documented - [x] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) - [x] e2e Tests for all Beta API Operations (endpoints) - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free - [x] (R) Graduation criteria is in place - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) - [x] (R) Production readiness review completed - [x] (R) Production readiness review approved - [x] "Implementation History" section is up-to-date for milestone - [x] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] - [x] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes [kubernetes.io]: https://kubernetes.io/ [kubernetes/enhancements]: https://git.k8s.io/enhancements [kubernetes/kubernetes]: https://git.k8s.io/kubernetes [kubernetes/website]: https://git.k8s.io/website ## Summary With Dynamic Resource Allocation (DRA), DRA drivers publish information about the devices that they manage in ResourceSlices. This information is used by the scheduler when selecting devices for user requests in ResourceClaims. With this KEP, DRA drivers can mark devices as tainted such that they won't be used for scheduling new pods. In addition, pods already running with access to a tainted device can be stopped automatically. Cluster administrators can do the same by creating a DeviceTaintRule which applies a taint to all devices matching certain selection criteria, like all devices of a certain driver. Users can decide to ignore specific taints by adding tolerations to their ResourceClaim. ## Motivation ### Goals - Enable taking devices offline for maintenance while still allowing test pods to request and use those devices. Being able to do this one device at a time minimizes service level disruption. - Enable users to decide whether they want to keep running a workload in a degraded mode while a device is unhealthy or prefer to get pods rescheduled. - Publish information about devices ("device health") such that control plane components or admins can decide how to react, without immediately affecting scheduling or workloads. ### Non-Goals - Not part of the plan for this KEP: developing a kubectl command for managing device taints. ## Proposal ### User Stories #### Degraded Devices A driver itself can detect problems which may or may not be tolerable for workloads, like degraded performance due to overheating. Removing such devices from the ResourceSlice would unconditionally prevent using them for new pods. Instead, taints can be added to ResourceSlices with documented types for decision making on the indicated problems and no immediate effect on scheduling or running pods. A control plane component or the admins react to that information. They may publish a DeviceTaintRule which prevents using the degraded device for new pods or even evict all pods using it at the moment to replace or reset the device once it is idle. Once there is such an effect, users can decide to tolerate less critical taints in their workload, at their own risk. Admins scheduling maintenance pods need to tolerate their own taints to get the pod scheduled. #### External Health Monitoring As cluster admin, I am deploying a vendor-provided DRA driver together with a separate monitoring component for hardware aspects that are not available or not supported by that DRA driver. When that component detects problems, it can check its policy configuration and decide to take devices offline by creating a DeviceTaintRule with a taint for affected devices. #### Safe Pod Eviction Selecting the wrong set of devices in a DeviceTaintRule can have potentially disastrous consequences, including quickly evicting all workloads using any device in the cluster instead of those using a single device. To avoid this, a cluster admin can first create a DeviceTaintRule such that it has no immediate effect. The eviction controller adds a condition with information about what it would do. `kubectl describe` then shows that information. At this point, scheduling is not affected and pods keep running. Then the admin can edit the DeviceTaintRule to set the desired effect to "evict all affected pods". The DeviceTaintRule status provides information about the progress. It can happen that a pod still gets scheduled after activating the taint because the scheduler hasn't observed that change yet. Such a pod then gets evicted. Either way, eventually no affected pods are running and no new ones get scheduled either. ### Risks and Mitigations A device can be identified by its names (`//`). It was a conscious decision for core DRA to not require that the name is tied to one particular hardware instance to support hot-swapping. Admins are expected to prefer the names. Health monitoring might prefer to be specific and use a vendor-defined unique ID attribute. However, supporting this consistently in a DeviceTaintRule turned out to be hard because the attributes of an allocated device are not guaranteed to be available (ResourceSlice might have been deleted) and therefore selecting devices by attributes in a DeviceTaintRule is not possible. Vendors which want to be very specific about which device incarnation is unhealthy have to use taints in ResourceSlices. Without a kubectl extension similar to `kubectl taint nodes`, the user experience for admins will be a bit challenging. They need to decide how to identify the device (by name or with a CEL expression), manually create a DeviceTaintRule with a unique name, then remember to remove that DeviceTaintRule again. Users might be tempted to tolerate taints to get their pods running. They do that at their own risk. Depending on the taint, the application then may not get the performance it needs (degraded hardware) or may fail at runtime (hardware gets turned off). Admission controllers or validating admission policies could be deployed to limit which tolerations may be used, but as taints are not defined by Kubernetes itself, none of that is part of Kubernetes itself. A taint is not meant to be used as an access control mechanism. Users are allowed to ignore taints (at their own risk). Adding a taint in a live cluster is inherently racy because it first needs to be observed by e.g. the scheduler. ## Design Details The feature is following the approach and APIs taken for node taints and applies them to devices. A new controller watches tainted devices and deletes pods using them unless they tolerate the device taint, similar to the [taint-eviction-controller](https://github.com/kubernetes/kubernetes/blob/32130691a4cb8a1034b999341c40e48d197f5465/pkg/controller/tainteviction/taint_eviction.go#L81-L83). A pod which is running or has finalizers will not get removed immediately. Instead, the `DeletionTimestamp` gets set. That's okay for the purpose of this KEP: - The kubelet will stop any running containers and mark the pod as completed. - The ResourceClaim controller will remove such a completed pod from the claim's `ReservedFor` and deallocate the claim once it has no consumers. The semantic of the value associated with a taint key is defined by whoever publishes taints with that key. DRA drivers should use the driver name as domain in the key to avoid conflicts. To support tolerating taints by value, values should not require parsing to extract information. It's better to use different keys with simple values than one key with a complex value. If some common patterns emerge, then Kubernetes could standardize the name, value and data for certain taints. Those then have a certain semantic. This is similar to standardizing device attributes. The recommended pattern for names is: - `[/`: the domain part avoids name clashes. This can be the same as a DRA driver name, but isn't required to. Descriptive names for the different taints defined by a vendor are useful. - `kubernetes.io/...`: reserved for use by Kubernetes. `Effect: None` can be used to publish taints which are merely informational. Such taints are ignored during scheduling and cause no eviction. Therefore it is not necessary to specify tolerations for them. Nonetheless a toleration for `Effect: None` is allowed. Such tolerations then have no effect either. Taints are cumulative: - Taints defined by an admin in a DeviceTaintRule get added to the set of taints defined by the DRA driver in a ResourceSlice. - All taints that are not tolerated apply their effect. It is valid to have the same taint key with different effects, both within a ResourceSlice and when one is in a ResourceSlice and the other in a DeviceTaintRule: - Key: "A", Effect: NoSchedule - Key: "A", Effect: NoExecute What this means is that "A" implies both NoSchedule and NoExecute. The first one could be dropped, but it's merely redundant, not in conflict with the second one. It would be possible to prevent redundant entries during validation of a ResourceSlice like it [is done for node taints](https://github.com/kubernetes/kubernetes/blob/2a3ca42c917a698e9fd3e07c55f369a9accbe2c2/pkg/apis/core/validation/validation.go#L6450-L6456), but not when they are in different objects (ResourceSlice and DeviceTaintRule), so instead the consumers need to handle this for device taints at runtime by checking that all of them are tolerated without assuming uniqueness by key and effect (meaning NoExecute toleration does not imply NoSchedule toleration). To ensure consistency among all pods sharing a ResourceClaim, the toleration for taints gets added to the request in a ResourceClaim, not the pod. This also avoids conflicts like one pod tolerating a taint for scheduling and some other pod not tolerating that. Device and node taints are applied independently. A node taint applies to all pods on a node, whereas a device taint affects claim allocation and only those pods using the claim. ### API The Device struct inside a ResourceSlice gets extended with taint information. To prevent exceeding the size limit for objects, the number of devices is half of what it normally would be without taints. If changes of taints are limited to the devices stored in a single ResourceSlice, then there is no need to bump the generation of the pool: a consumer will see and use either the old set of ResourceSlices or the new, updated set. This makes updating taints efficient because it avoids having to update the other ResourceSlices with a different generation or count. DRA drivers which support taints are therefore encouraged to group devices together which might have to be tainted together. ```Go type Device struct { ... // If specified, these are the driver-defined taints. // // The maximum number of taints is 16. If taints are set for // any device in a ResourceSlice, then the maximum number of // allowed devices per ResourceSlice is 64 instead of 128. // // This is an alpha field and requires enabling the DRADeviceTaints // feature gate. // // +optional // +listType=atomic // +featureGate=DRADeviceTaints Taints []DeviceTaint } // DeviceTaintsMaxLength is the maximum number of taints per Device. const DeviceTaintsMaxLength = 16 // The device this taint is attached to has the "effect" on // any claim which does not tolerate the taint and, through the claim, // to pods using the claim. type DeviceTaint struct { // The taint key to be applied to a device. // Must be a label name. // // +required Key string // The taint value corresponding to the taint key. // Must be a label value. // // +optional Value string // The effect of the taint on claims that do not tolerate the taint // and through such claims on the pods using them. // // Valid effects are None, NoSchedule and NoExecute. PreferNoSchedule as used for // nodes is not valid here. More effects may get added in the future. // Consumers must treat unknown effects like None. // // +required Effect DeviceTaintEffect // ^^^^ // // Implementing PreferNoSchedule would depend on a scoring solution for DRA. // It might get added as part of that. // // A possible future new effect is NoExecuteWithPodDisruptionBudget: // honor the pod disruption budget instead of simply deleting pods. // This is currently undecided, it could also be a separate field. // // Validation must be prepared to allow unknown enums in stored objects, // which will enable adding new enums within a single release without // ratcheting. // TimeAdded represents the time at which the taint was added or // (only in a DeviceTaintRule) the effect was modified. // Added automatically during create or update if not set. // In addition, in a DeviceTaintRule a value provided during // an update gets replaced with the current time if the provided // value is the same as the old one and the new effect is different. // // +optional TimeAdded *metav1.Time // ^^^ // // This field was defined as "It is only written for NoExecute taints." for node taints. // But in practice, Kubernetes never did anything with it (no validation, no defaulting, // ignored during pod eviction in pkg/controller/tainteviction). } // +enum type DeviceTaintEffect string const ( // No effect, the taint is purely informational. DeviceTaintEffectNone DeviceTaintEffect = "None" // Do not allow new pods to schedule which use a tainted device unless they tolerate the taint, // but allow all pods submitted to Kubelet without going through the scheduler // to start, and allow all already-running pods to continue running. DeviceTaintEffectNoSchedule DeviceTaintEffect = "NoSchedule" // Evict any already-running pods that do not tolerate the device taint. DeviceTaintEffectNoExecute DeviceTaintEffect = "NoExecute" ) ``` DeviceTaint has all the fields of a v1.Taint, but the description is a bit different. In particular, PreferNoSchedule is not valid and None gets added. Tolerations get added to a DeviceRequest: ```Go type DeviceRequest struct { ... // If specified, the request's tolerations. // // Tolerations for NoSchedule are required to allocate a // device which has a taint with that effect. The same applies // to NoExecute. // // In addition, should any of the allocated devices get tainted // with NoExecute after allocation and that effect is not tolerated, // then all pods consuming the ResourceClaim get deleted to evict // them. The scheduler will not let new pods reserve the claim while // it has these tainted devices. Once all pods are evicted, the // claim will get deallocated. // // The maximum number of tolerations is 16. // // This is an alpha field and requires enabling the DRADeviceTaints // feature gate. // // +optional // +listType=atomic // +featureGate=DRADeviceTaints Tolerations []DeviceToleration } // DeviceTolerationsMaxLength is the maximum number of tolerations in a DeviceRequest. const DeviceTolerationsMaxLength = 16 // The ResourceClaim this DeviceToleration is attached to tolerates any taint that matches // the triple using the matching operator . type DeviceToleration struct { // Key is the taint key that the toleration applies to. Empty means match all taint keys. // If the key is empty, operator must be Exists; this combination means to match all values and all keys. // Must be a label name. // // +optional Key string // Operator represents a key's relationship to the value. // Valid operators are Exists and Equal. Defaults to Equal. // Exists is equivalent to wildcard for value, so that a ResourceClaim can // tolerate all taints of a particular category. // // +optional // +default="Equal" Operator DeviceTolerationOperator // Value is the taint value the toleration matches to. // If the operator is Exists, the value must be empty, otherwise just a regular string. // Must be a label value. // // +optional Value string // Effect indicates the taint effect to match. Empty means match all taint effects. // When specified, allowed values are NoSchedule and NoExecute. // // +optional Effect DeviceTaintEffect // TolerationSeconds represents the period of time the toleration (which must be // of effect NoExecute, otherwise this field is ignored) tolerates the taint. By default, // it is not set, which means tolerate the taint forever (do not evict). Zero and // negative values will be treated as 0 (evict immediately) by the system. // If larger than zero, the time when the pod needs to be evicted is calculated as