DPRIO THE DEFERRED SET PRIORITY FACILITY FOR LINUX Sergey Oboguev SUMMARY ======= Applications relying on fine-grain parallelism may sometimes need to change their threads priority at a very high rate, hundreds or even thousands of times per typical scheduling timeslice. These are typically applications that have to execute short or very short lock-holding critical or otherwise time-urgent sections of code at a very high frequency and need to protect these sections with "set priority" system calls, one "set priority" call to elevate current thread priority before entering the critical or time-urgent section, followed by another call to downgrade thread priority at the completion of the section. Due to the high frequency of entering and leaving critical or time-urgent sections, the cost of these "set priority" system calls may raise to a noticeable part of an application's overall expended CPU time. Proposed "deferred set priority" facility allows to largely eliminate the cost of these system calls. Instead of executing a system call to elevate its thread priority, an application simply writes its desired priority level to a designated memory area in the userspace. When the kernel attempts to preempt the thread, it first checks the content of this area, and if the application's stated request to change its priority has been posted in the designated memory area, the kernel will execute this request and alter the priority of the thread being preempted before performing a rescheduling, and then make scheduling decisions based on the new thread priority level thus implementing the priority protection of the critical or time-urgent section desired by the application. In a predominant number of cases however, an application will complete the critical section before the end of the current timeslice and cancel or alter the request held in the userspace area. Thus a vast majority of an application's change priority requests will be handled and mutually cancelled or coalesced within the userspace, at a very low overhead and without incurring the cost of a system call, while maintaining safe preemption control. The cost of an actual kernel-level "set priority" operation is incurred only if an application is actually being preempted while inside the critical section, i.e. typically at most once per scheduling timeslice instead of hundreds or thousands "set priority" system calls in the same timeslice. RATIONALE ========= There are two common basic kinds of situations calling for high-frequency elevation of thread priority during short sections of code. One capability commonly needed by applications relying on fine-grained parallelism is efficient control over thread preemption in order to avoid lock holder preemption. A wide class of applications falls under this category, from database engines to multiprocessor virtual machines to multimedia applications. Primitives intended to counteract thread preemption would normally be used to bracket critical sections within applications, with the purpose of preventing the thread from being preempted while holding a resource lock. Preemption of a resource holder can lead to a wide range of pathologies, such as other threads piling up waiting for the preempted thread holding the lock. These blocked threads may in their turn hold other locks, which can lead to an avalanche of blocking in the system (convoying), resulting in drastically increased processing latencies and waste of CPU resources due to context switching and rescheduling overhead. Furthermore, if resource locks are implemented as spinlocks or hybrid locks with "spin-then-block" behavior, blocked waiters will spin, wastefully consuming CPU resources and hindering preempted lock holder to complete its work and release the lock. Priority inversion is easy to occur in this situation as well. Yet, despite this situation being common, mainstream operating systems usually do not offer efficient low-overhead mechanism for preemption control. Usually the best they offer is a system call to change thread priority, but this call requires a userspace/kernel context switch and in-kernel processing, which is fine for applications with coarse-grained parallelism, but is expensive for applications relying on fine-grain parallelism. By juxtaposition, mainstream operating systems do provide efficient locking primitives that normally do not require userspace/kernel context switching, such as the Linux futex or Windows CRITICAL_SECTION, but they do not provide similarly efficient mechanisms for priority protection of critical sections or other adequate preemption control to match those locking primitives. Another kind of situation arises out of an application's need to control a thread's priority for reasons unrelated to well-defined resource locking, but rather to perform system-critical tasks, whether holding a lock or not. For example, a multiprocessor virtual machine may call for the elevation of a thread's priority while processing virtual device interrupts or inter-processor interrupts, for reasons of timing issues not expressible via locking notation. A virtual machine may also want to elevate a thread's priority while a guest OS is inside its time-urgent section as expressed by the state of the virtual processor and various guest OS indicators not reducible to locking. [*] [*] Indeed, the proposal for the deferred set priority mechanism has been conceived in the context of VAX multiprocessor virtual machine project, which encounters extensively both kinds of the described situations. See the discussion in "VAX MP: A multiprocessor VAX simulator // Technical Overview", http://oboguev.net/vax_mp/VAX_MP.pdf , specifically sections "Mapping execution context model: virtual processors to host system threads" (pp. 16-30) and "Retrospective on the features of host OS desirable for SMP virtualization" (pp. 156-159). Time-urgent sections thus can be defined not only by lock states, but also by other intra-application state variables. SOLUTION ======== The proposed solution is composed of two parts: a patch to the Linux kernel to implement the deferred set priority facility, and a user-level library simplifying the use of the facility and shielding an application from having to bother about low-level details and implementing the boilerplate code likely to be common and repetitive for all applications using the deferred set priority mechanism. An application developer wishing to use DPRIO normally does not need to bother about DPRIO prctl interface and userspace <-> kernel protocol described in this section and would rather use the user-level library presenting a very simple interface and shielding the developer from the low-level details. The kernel part of DPRIO interface is exposed via prctl(2) with the option PR_SET_DEFERRED_SETPRIO. PR_SET_DEFERRED_SETPRIO has two forms: one to set up the use of deferred set priority facility (hereafter, DPRIO) for the current thread, another to terminate the use of DPRIO for the thread. The system call to set up the use of DPRIO for the thread takes the form prctl(PR_SET_DEFERRED_SETPRIO, dprio_ku_area_pp, sched_attrs_pp, sched_attrs_count, 0) The effect of this call is limited to the current thread only. sched_attr_pp is a pointer to an array of pointers to struct sched_attr. The array size is specified by sched_attrs_count parameter. The array describes a set of priorities (scheduling attributes) the thread intends to subsequently request via DPRIO mechanism and must contain at least one entry per each scheduling policy (SCHED_NORMAL, SCHED_RR etc.) that the application intends to subsequently request via DPRIO. Application can specify multiple entries per scheduling policy in the array, but only the entry with the highest ("best") priority for the given scheduling policy really matters. For each of the scheduling policies listed in the array, prctl(PR_SET_DEFERRED_SETPRIO) will determine the highest level of priority listed and verify whether the calling thread is currently authorized to use this level of priority. If not, prctl call will return error status. If yes, prctl(PR_SET_DEFERRED_SETPRIO) will store inside the kernel a pre-authorization for the thread to subsequently elevate to this level of priority via DPRIO. DPRIO will also allow the thread to elevate to lower levels of priority within the same scheduling policy. The following scheduling policies can be listed via sched_attr_pp and subsequently used via DPRIO: SCHED_NORMAL, SCHED_IDLE, SCHED_BATCH, SCHED_RR and SCHED_FIFO. SCHED_DEADLINE is not a meaningful policy for use cases that DPRIO is intended for, and cannot be used with DPRIO. An attempt to specify SCHED_DEADLINE in sched_attr_pp will result in prctl(PR_SET_DEFERRED_SETPRIO) returning an error. Note that using SCHED_NORMAL and SCHED_BATCH does not provide a protection against a preemption at the expiration of the task's timeslice. Using negative "nice" value ensures a greater thread's claim to CPU resources over a longer time frame, but does not secure it against a preemption by peer threads in the short-term. Therefore lock-holding and most other time-urgent critical sections will typically use SCHED_FIFO or SCHED_RR for priority protection. Still there are potentially some rare cases when using SCHED_NORMAL or SCHED_BATCH for "soft" priority protection may come in handy in the DPRIO context. See https://lkml.org/lkml/2014/8/6/593 for details. More specifically, stored pre-authorization consists of: (a) For each scheduling policy listed in sched_attrs_pp, the highest priority level requested for this policy in sched_attrs_pp. (b) A record of whether the caller had CAP_SYS_NICE capability (either explicitly assigned as a capability or caller having effective id of root) at the time of the call. If at any time after executing prctl(PR_SET_DEFERRED_SETPRIO) thread priority limits are subsequently constrained with prlimit/setrlimit(RLIMIT_RTPRIO) or prlimit/setrlimit(RLIMIT_NICE), new constraint will affect the stored pre-authorization. DPRIO will not allow the thread to elevate its priority above the new limits set by prlimit/setrlimit(RLIMIT_RTPRIO) and prlimit/setrlimit(RLIMIT_NICE) regardless of the pre-authorization created at the time of prctl(PR_SET_DEFERRED_SETPRIO), unless pre-authorization record also indicates that the thread had CAP_SYS_NICE capability at the time of calling PR_SET_DEFERRED_SETPRIO. This restriction is intended to let external process management applications to clamp down the application's priority. Thus calling PR_SET_DEFERRED_SETPRIO creates a limited pre-authorization context separate from the thread's current security context and holding a record of CAP_SYS_NICE setting for the thread of at the time of PR_SET_DEFERRED_SETPRIO call. If the thread decides to subsequently downgrade its current security context and does not want the code subsequent to this point to be able to make use of CAP_SYS_NICE recorded in the DPRIO context, it is responsible for shutting down the registered DPRIO as well. (Very much like a privileged program downgrading its privileges will be responsible for deciding which of the files opened by it while it had privileges must be closed at this point or left open.) dprio_ku_area_pp is a pointer to u64 variable in the userspace, let us name the latter dprio_ku_area_p. This latter variable must be aligned on u64 natural boundary, i.e. on the 8-byte boundary. When the application wants to signal to the kernel its desire to change the thread's priority, the application fills in the desired priority settings and control data into struct dprio_ku_area described below and stores the address of the prepared struct dprio_ku_area into dprio_ku_area_p. Even though dprio_ku_area_p actually holds a pointer, it is declared as u64, so the interface will be uniform regardless of the system's bitness, and thus 32-bit applications could interoperate with 64-bit kernel. On 32-bit systems, the upper part of dprio_ku_area_p is left zero. prctl(PR_SET_DEFERRED_SETPRIO) stores the value of dprio_ku_area_pp inside the kernel. When the kernel subsequently attempts to preempt the thread, it checks the stored value of dprio_ku_area_pp, and if it not NULL, tries to fetch from the userspace a u64 value pointed to by dprio_ku_area_pp, i.e. dprio_ku_area_p. If the fetch attempt fails (either because dprio_ku_area_pp points to an invalid memory or page has been paged out), the kernel ignores the DPRIO setting and proceeds as if the thread was not set up for DPRIO. If the fetched value of dprio_ku_area_p is NULL, there is no DPRIO request from the application and the kernel proceeds with rescheduling. If the fetched value of dprio_ku_area_p is not NULL, the kernel fetches struct dprio_ku_area pointed by dprio_ku_area_p. This area has the following structure: struct dprio_ku_area { /*volatile*/ u32 resp; /* DPRIO_RESP_xxx */ /*volatile*/ u32 error; /* one of errno values */ /*volatile*/ struct sched_attr sched_attr; }; If the new priority setting requested by the application in sched_attr field fits within the pre-authorization stored for the thread during prctl(PR_SET_DEFERRED_SETPRIO), the kernel will alter the thread's priority and scheduling attributes to the new value requested in sched_attr. The kernel will try to store the success/error status of the operation in resp and error fields and attempt to reset the value of dprio_ku_area_p to NULL. The exact kernel<->userspace DPRIO communication protocol is described in a separate section below. The system call to terminate the use of DPRIO for the thread takes the form prctl(PR_SET_DEFERRED_SETPRIO, 0, 0, 0, 0) After executing this call, the kernel will clear any previously stored value of dprio_ku_area_pp for the thread and will no longer attempt to check for DPRIO requests for the thread. It is crucial that the user of DPRIO facility remembers to detach from DPRIO after stopping to use it and before relinquishing the ownership of memory areas pointed by dprio_ku_area_pp and the pointer stored in dprio_ku_area_pp, as the kernel will continue reading from and writing to these areas of memory until DPRIO use is terminated either via prctl or thread termination. Therefore relinquishing the ownership of these memory areas prior to terminating their designation for DPRIO will likely result in (1) task memory corruption and (2) the possibility of unintended task priority changes. Right before detaching the thread from the previously designated dprio_ku_area_pp the kernel will make one last attempt to check for a pending DPRIO request associated with the previous value of dprio_ku_area_pp and process any currently pending DPRIO request pointed to by this value. Likewise, if the application calls prctl(PR_SET_DEFERRED_SETPRIO, dprio_ku_area_1_pp, ...) . . . . . prctl(PR_SET_DEFERRED_SETPRIO, dprio_ku_area_2_pp, ...) then right before switching over to dprio_ku_area_2_pp, the kernel will process a pending DPRIO request pointed by dprio_ku_area_1_pp. DPRIO requests are handled both on voluntary and involuntary preemption of the thread, i.e. if the thread has a DPRIO request posted, the request will be handled when the thread is preempted by a higher-priority task or on a round-robin basis, but also when the thread enters wait state, e.g. by calling sleep(3) or trying to read from a socket with no data available or doing epoll etc. PR_SET_DEFERRED_SETPRIO setting is per-thread and is not inherited by child threads or processes created via clone(2) or fork(2). Only the effects of priority change requests issued via DPRIO prior to clone(2) or fork(2) have an effect on a child task or thread (as long as this is not restricted by SCHED_RESET_ON_FORK or SCHED_FLAG_RESET_ON_FORK), but a child task or thread does not inherit the parent's PR_SET_DEFERRED_SETPRIO setting. PR_SET_DEFERRED_SETPRIO setting is reset by execve(2). Only the effects of priority change requests issued via DPRIO prior to execve(2) have an effect on the loaded executable image, but a new image executed by the task does not inherit PR_SET_DEFERRED_SETPRIO setting. When executing clone(2), fork(2) or execve(2) the kernel will check for a pending DPRIO request and try to process it before executing the main syscall body, so the effects of priority adjustment request posted via DPRIO will be integrated in the outcome of the mentioned syscalls. When executing a DPRIO request, the kernel will try to merge SCHED_FLAG_RESET_ON_FORK into the current task state, as follows. If the task does not have the SCHED_RESET_ON_FORK flag set, and the request does have the SCHED_FLAG_RESET_ON_FORK flag set in sched_attr structure, the kernel will set the SCHED_RESET_ON_FORK flag for the task. If the task already has the SCHED_RESET_ON_FORK flag set, but the request does not have the SCHED_FLAG_RESET_ON_FORK flag set in sched_attr structure, the task will retain the flag. The resultant task's "reset on fork" flag state is the logical "OR" of the preceding task's flag state and the SCHED_FLAG_RESET_ON_FORK flag state in the request. If the flag in the request is not set, the kernel will not attempt to reset the task's flag. This is intended to avoid the need for task privileges checking during the execution of a DPRIO request (since clearing the "reset on fork" flag is a privileged operation). The client of the user-level library needs to take additional caution about the use of the SCHED_FLAG_RESET_ON_FORK flag, as described in the below section for the user-level library. DPRIO USERSPACE <-> KERNEL PROTOCOL =================================== Userspace-kernel dprio protocol is defined as follows: Userspace: Select and fill-in dprio_ku_area: Set @resp = DPRIO_RESP_NONE. Set @sched_attr. Set @dprio_ku_area_p to point struct dprio_ku_area. Kernel: 1) On task preemption attempt or at another processing point, such as fork or exec, read @dprio_ku_area_p. If cannot (e.g. @ dprio_ku_area_p inaccessible incl. page swapped out), quit. Note: will reattempt again on next preemption cycle. 2) If read-in value of @dprio_ku_area_p is 0, do nothing. Quit. 3) Set @resp = DPRIO_RESP_UNKNOWN. If cannot (e.g. inaccessible), quit. 4) Set @dprio_ku_area_p = NULL. If cannot (e.g. inaccessible), quit. Note that in this case request handling will be reattempted on next thread preemption cycle. Thus @resp value of DPRIO_RESP_UNKNOWN may be transient and overwritten with DPRIO_RESP_OK or DPRIO_RESP_ERROR if @dprio_ku_area_p is not reset to 0 by the kernel (or to 0 or to the address of another dprio_ku_area by the userspace). 5) Read @sched_attr. If cannot (e.g. inaccessible), quit. 6) Try to change task scheduling attributes in accordance with read-in value of @sched_attr. 7) If successful, set @resp = DPRIO_RESP_OK and Quit. 8) If unsuccessful, set @error = appopriate errno-style value. If cannot (e.g. @error inaccessible), quit. Set @resp = DPRIO_RESP_ERROR. If cannot (e.g. @resp inaccessible), quit. Explanation of possible @resp codes: DPRIO_RESP_NONE Request has not been processed yet. DPRIO_RESP_OK Request has been successfully processed. DPRIO_RESP_ERROR Request has failed, @error has errno-style error code. DPRIO_RESP_UNKNOWN Request processing has been attempted, but the outcome is unknown. Request might have been successful or failed. Current os-level thread priority becomes unknown. @error field may be invalid. This code is written to @resp at the start of request processing, then @resp is changed to OK or ERR at the end of request processing if dprio_ku_area and @cmd stay accessible for write. This status code is never left visible to the userspace code in the current thread if dprio_ku_area and @cmd are locked in memory and remain properly accessible for read and write during request processing. This status code might happen (i.e. stay visible to userspace code in the current thread) if access to dprio_ku_area or @cmd is lost during request processing, for example the page that contains the area gets swapped out or the area is otherwise not fully accessible for reading and writing. If @error has value of DPRIO_RESP_UNKNOWN and @cmd is still pointing to dprio_ku_area containing @error, it is possible for the request to be reprocessed again at the next context switch and @error change to DPRIO_RESP_OK or DPRIO_RESP_ERROR. To ensure @error does not change under your feet, change @cmd to either NULL or address of another dprio_ku_area distinct from one containing this @error. If userspace memory containing dprio_ku_area_p or struct dprio_ku_area gets paged out, the kernel won't be able to process a pending DPRIO request or report processing status back to the userspace. In practice, the probability of this is exceedingly small since if the request is still pending, it must have been posted by the application during the latest timeslice, and thus the application must have touched those memory pages during this timeslice, therefore they are extremely likely to still be resident. The mainline use case for DPRIO is to avoid performance degradation caused by problems like lock holder preemption, or preemption of a thread in overall application-urgent section. These use cases are tolerant to occasionally missing thread priority elevation as long as it is very infrequent, and thus the total impact on the performance is negligible due to very low incidence of such events. If the application requires hard guarantees, it must lock pages holding dprio_ku_area_p and struct dprio_ku_area in memory with mlock(2). The user-level DPRIO library described below allocates and manages low-level structures internally and provides a way for the caller to request the structures to be locked in memory, as long as the caller has appropriate resource limits (sufficient RLIMIT_MEMLOCK) or CAP_IPC_LOCK capability. USER-LEVEL LIBRARY ================== While it is possible for an application developer to use the DPRIO prctl and ku_area interface directly, it is more convenient to utilize instead a user-level library wrapper published in DPRIO git that shields the developer from having to deal with low-level details and embodies boilerplate code. The wrapper is published under dual-license terms and the developer is free to choose between GPL2 and "use as you want" license covering both open-source and private-source commercial and non-commercial applications. The description of how to use the DPRIO user-level library and other explanations can be found in the header file "dprio.h" of the library. Use examples can be found in the DPRIO test program "test.c". The library provides a set of very simple routines to define logical priority levels and their mapping to OS-level scheduling attributes, to allocate supporting per-thread structures holding dprio_ku_area and other low-level data, and routines to get/set thread logical priority. Logical priority setting will be cached in the userspace and propagated to the kernel only when actually necessary. In the use cases DPRIO facility is intended for, most get/set logical priority calls will be resolved within the userspace at a very low overhead. Two caveats the client of the DPRIO library needs to be aware of are related to the very nature of the DPRIO library as "caching" the thread priority setting in the userspace: (a) sched_getattr(2), sched_getparam(2) and sched_getscheduler(2) return current OS-level thread scheduling policy and priority setting for the thread, that is very likely to be different from thread's current logical priority cached in the userspace which, after all, is intended in most cases not to be propagated to the kernel. (b) DPRIO user-level library tracks the latest thread priority setting it propagated to the kernel and makes its caching behavior decision based on the knowledge of this setting held inside the library. The library assumes its knowledge accurately represents kernel-level thread priority setting. Direct sched_setattr(2), sched_setparam(2) and sched_setscheduler(2) executed for the application thread either by the application itself or by an external process will invalidate this knowledge. If such an invalidation happens, it is the responsibility of the application to execute the library method dprio_setnow(pl, DPRIO_FORCE) to resync kernel-level thread priority setting with the library-cached state and application's idea what the thread's priority should be. The client of the user-level library must also be careful about SCHED_RESET_ON_FORK flag, since in addition to issuing DPRIO requests the library may invoke sched_setattr(2) directly, which will fail if priority level descriptor passed to it does not have SCHED_FLAG_RESET_ON_FORK set. The library tries to mitigate this issue by checking if SCHED_RESET_ON_FORK was set at the time of thread initialization for the library, and if so merging SCHED_FLAG_RESET_ON_FORK into subsequent requests to sched_setattr(2), however if SCHED_RESET_ON_FORK is set after calling dprio_init_thread, then the library caller is responsible for managing the state of SCHED_FLAG_RESET_ON_FORK in definitions of logical priority levels. KERNEL BUILD CONFIGURATION ========================== The DPRIO facility is included into the built kernel via build option CONFIG_DEFERRED_SETPRIO. If this option is not enabled, prctl(PR_SET_DEFERRED_SETPRIO) will return -1 with errno set to EINVAL. The downside of using DPRIO is that it may slightly increase the rescheduling latency in the case when DPRIO request for the task is pending and being processed during the task's rescheduling time. Thus, a higher priority task may be delayed from resuming its execution by the time it takes to process the pending DPRIO request. This time on a typical x86 machine is around 1 usec, possibly slightly longer if the rt_mutex priority inheritance chain needs to be adjusted. This added delay is but a fraction of normal task rescheduling and context switching time, and furthermore is incurred only when a DPRIO request is actually pending, so in most cases it won't be significant, however in those cases where it might be, DPRIO provides an authorization mechanism controlling who is allowed to use the DPRIO facility. Build option DEFERRED_SETPRIO_PRIVILEGED defines who is initially authorized to use the DPRIO facility. Selecting "No" will initially permit every user on the system to utilize DPRIO. Selecting "Yes" will initially allow only users with CAP_DPRIO capability to use the facility. This initial setting is in effect right after system boot, however it can be dynamically altered via /proc/sys/kernel/dprio_privileged, as described below. Build option CONFIG_PUT_TASK_TIMEBOUND ensures fast and deterministically time-bound task switch latency (as far as DPRIO impact is concerned) when a deferred set priority request is pending on a task rescheduling and the processing of this request causes an adjustment of priority inheritance chain under very low memory conditions (depleted atomic pool). Adjustment of priority inheritance chain may cause (albeit with low probability) a final release of task structure for some tasks that participated in the chain. When this codepath is executed within the context of task rescheduling, DPRIO patch tries to postpone the deallocation of those task structures and various elements of these structures by relegating the procedure to a system work thread instead of executing it from inside the scheduler. To relegate this work to a work thread, the kernel allocates a workqueue entry from an atomic pool, however under very low memory conditions this allocation may fail. If CONFIG option PUT_TASK_TIMEBOUND was selected as "no", then in the case of such wokrqueue element allocation failure the deallocation of task structure will be performed within the context of the scheduler. If however PUT_TASK_TIMEBOUND was selected as "yes", task structure deallocation path will use a workqueue entry embedded into a task structure itself. This ensures a workqueue entry is always available and thus task dealocation can always be relegated to a system work thread, however it costs about 20-40 bytes per every task in the system. Select PUT_TASK_TIMEBOUND as Y if building the kernel for hard real-time system requiring the determinism in latency in the task switch path. Select N for general-purpose desktop or server system. Build option CONFIG_DEBUG_DEFERRED_SETPRIO enables the debug code for DPRIO. DPRIO AUTHORIZATION =================== DPRIO works by making a check at __schedule() time for whether a deferred set priority request is pending, and if so, then performing an equivalent of sched_setattr(), minus security checks, before the rescheduling. This introduces additional latency at task switch time when a deferred set priority request is pending -- albeit normally a very small latency, but non-zero one and a malicious user can also manoeuvre into increasing it. There are two parts to this latency. One is more or less constant part in the order of 1 usec, that's basic __sched_setscheduler() code. The other part is due to possible priority inheritance chain adjustment in rt_mutex_adjust_pi() and depends on the length of the chain. Malicious user might conceivably construct a very long chain, long enough for processing of this chain at the time of __schedule to cause an objectionable latency. On systems where this might be a concern, administrator may therefore want to restrict the use of DPRIO to legitimate trusted applications (or users of those). Build option DEFERRED_SETPRIO_PRIVILEGED defines who is initially authorized to use the DPRIO facility. Selecting "No" will initially permit every user on the system to utilize DPRIO. Selecting "Yes" will initially allow only users with CAP_DPRIO capability to use the facility. This initial setting is in effect right after system boot, however it can be dynamically altered via /proc/sys/kernel/dprio_privileged. Writing value of "1" into /proc/sys/kernel/dprio_privileged makes DPRIO to require the caller to have CAP_DPRIO capability. If the caller of prctl(PR_SET_DEFERRED_SETPRIO) does not have CAP_DPRIO, prctl call will return error status of -EPERM. Writing value of "0" into /proc/sys/kernel/dprio_privileged makes DPRIO facility available to unprivileged users. System administrator may designate specific applications as having a right to use DPRIO even when launched by a non-privileged user by setting the capability on application's executable file: # setcap CAP_DPRIO+eip /path/application or until libcap is updated to recognize the name of CAP_DPRIO, by using the capability's numeric value: # setcap 38+eip /path/application PRIOR ART ========= There is a number of existing techniques and facilities intended to address similar or partially similar problems that the deferred set priority mechanism (DPRIO) is intended to address. This section briefly reviews these techniques and facilities and compares them vs. DPRIO. Microsoft Windows tries to address the lock holder preemption issue by using explicitly enlarged scheduling quantum in server editions, that (hopefully) should allow a worker thread enough time to handle the request and release the locks, and also reduces context switching overhead. Scheduling quantum can likewise be increased in Linux to meet similar needs of the server systems. However large scheduling quantum does not help applications that require longer than a quantum to complete the processing and that acquire locks for short intervals, many times per scheduling quantum whether the latter is short or long. In fact, enlarged quantum may even aggravate the problem in this case by preempting lock holder for a longer time. The PTHREADS specification defines priority protection (ceiling) associated with locks, whereby a thread acquiring the lock will temporary have its priority elevated for the duration of holding the lock, however when PTHREADS is implemented as a user-level library, such as a part of GLIBC (which is necessary for having user-space fast path at least for the locking part itself), the implementation of the priority protection feature has to rely on existing priority control facilities exposed by the kernel to the userspace and in the absence of DPRIO has to incur the cost of the system call every time a thread priority adjustment is performed, i.e. typically both on entering and leaving the critical section -- exactly what DPRIO is meant to avoid. Ironically, the implementation of priority-protected PTHREADS primitives ends up composed of two mismatching parts: low-overhead locking primitives that complete most of the time in the userspace, without switching to the kernel, and the priority protection part, which, in the absence of DPRIO means utilizing regular "set thread priority" system calls with all the overhead of userspace/kernel context switching and intra-kernel processing these calls induce. DPRIO is meant to help eliminate this disbalance and provide a priority control mechanism that would be an adequate match for low-overhead locking primitives. Furthermore, not all PTHREADS implementations implement the priority protection part of the specification correctly or even at all. For example, the GLIBC implementation resets the priority of the thread leaving a mutex-protected critical section to the maximum of the thread's pre-section priority and the highest ceiling priority of any mutex still held, even though the thread may want to retain higher priority at the exit from the critical section for reasons unrelated to locking and has explicitly elevated the thread to this priority. A part of this flaw arises from misreading the specification, but in part it is due to the lack of sufficient emphasis in the specification itself on application-wide control over thread priority integrating both lock states and inputs unrelated to locking. As an alternative to priority protection, PTHREADS specification also provides a priority inheritance protocol for the protection of locks, which in Linux is implemented on top of priority inheritance futexes, ultimately laid on top of an rt_mutex. Apart from the issues discussed e.g. in Yodaiken's article ("Against priority inheritance", FSMLabs, 2002) and responses to it, priority inheritance may work satisfactory or at least as a "least worst" solution for the cases covered by a PI model, but offers little for those that are not, such as applications with non-trivial wait chains not expressed in locks structure, or where they are expressed as such, not as host OS PI locks (but e.g. as guest OS spinlocks or one VCPU spin-waiting for ack to IPI request sent to another VCPU). A PI model also does not offer a solution for applications with parts having soft RT properties and thus preemption by unrelated timesharing threads while inside a time-urgent section of the application is undesirable. Such applications have to fall back on priority protection based heuristics the problems of which were discussed earlier in this section. Although the issue of inopportune preemption had been addressed in the research early on [*], the only mainstream production operating systems that eventually came to provide a form of low-overhead preemption control are Solaris and AIX. Solaris schedctl facility provides functions schedctl_start() and schedctl_stop() with a very efficient low-overhead implementation. When an application wants to defer involuntary thread preemption, this thread calls schedctl_start which stores a flag in the userspace/kernel communication memory page accessible to the kernel but also mapped into user space. If the scheduler sees this flag, it will try to honor it and give the thread extra time before it is being preempted. The flag is strictly advisory, the kernel is under no obligation to honor it, and indeed if the flag stays on for more than a couple of ticks and the thread does not yield, the kernel will stop honoring this flag until the thread yields. If the kernel considered a thread for a preemption, but let it go on because of the "do not preempt me" flag, the kernel will set the "preemption pending" flag in the user/kernel shared page. When the thread has released the lock and calls schedctl_stop, the latter will reset the "do not preempt me" flag and check for the "preemption pending" flag. If the latter is set, the thread will yield voluntary. AIX provides a very similar facility. [*] For the bibliography see e.g. "VAX MP Technical Overview", p. 157, also pp. 16, 25. The Solaris schedctl is a useful mechanism, but has a number of obvious limitations. First, it does not provide a way to associate a priority with the resource whose lock is being held (or, more generally, with thread application-specific logical state; see the footnote below). An application is likely to have a range of locks with different criticality levels and different needs for holder protection [*]. For some locks, holder preemption may be tolerated somewhat, while other locks are highly critical, furthermore for some lock holders preemption by a high-priority thread is acceptable but not a preemption by a low-priority thread. The Solaris/AIX schedctl does not provide a capability for priority ranging relative to the context of the whole application and other processes in the system. [*] We refer just to locks here for simplicity, but the need of a thread for preemption control does not reduce to locks held alone, and may result from other intra-application state conditions, such as executing a time-urgent fragment of code in response to a high-priority event (that may potentially be blocking for other threads) or other code paths that can lead to wait chains unless completed promptly. Second, in some cases application may need to perform time-urgent processing without knowing in advance how long it will take. In the majority of cases the processing may be very short (a fraction of a scheduling timeslice), but sometimes may take much longer (such as a fraction of a second). Since schedctl would not be effective in the latter case, an application would have to resort to system calls for thread priority control in all cases [*], even in the majority of "short processing" cases, with all the overhead of this approach. [*] Or introduce extra complexity, most likely very cumbersome, by trying to gauge and monitor the accumulated duration of the processing, with the intention to transition from schedctl to thread priority elevation once a threshold has been reached. Finally, schedctl is strictly an advisory mechanism. The kernel is under no obligation to honor it and the calling thread fundamentally still remains low-priority thread preemptible by other low-priority compute bound threads. Moreover, the kernel starts ignoring schedctl requests under heavy load, exactly when the aid of schedctl is most needed. DPRIO offers an efficient facility addressing the needs of some use cases not covered by the existing mechanisms, and can be used by applications both directly, and also as an efficient underlying facility in the implementation of PTHREADS priority protection and other higher-level synchronization primitives. For an overview of DPRIO within the space of other preemption handling solutions and techniques, see https://lkml.org/lkml/2014/8/13/744. IMPACT OF HYPERVISORS ===================== Priority protection schemes are vulnerable when guest OS is running on top of a hypervisor in a configuration overcommitted with regard to CPU capacity. Hypervisor typically is oblivious of guest OS scheduling data and may deschedule a VCPU regardless of the priority of the task a VCPU is currently executing, yielding PCPU resources to a VCPU running a lower priority task, and thus creating a form of priority inversion. For an applications likely to be deployed in such a configuration, it is therefore specifically advisable to combine priority protection with a form of post-preemption solution (such as spin-then-yield-to) wherever possible. Priority protection forms a front line of defense avoiding incurring the cost of inopportune preemption in the major number of cases, whereas post-preemption solution limits the worst-case cost. In a longer run, it may be desirable to develop an integrated guest OS - hypervisor mechanism to allow lock waiters to know that lock holder had been preempted (either as a process by guest OS or the whole VCPU had been preempted by a hypervisor), so the waiters can yield immediatelly in favor of a lock holder.