Event Queues and Threads

This document primarily described Linux's epoll(7) I/O event notification
facility as of the 2.6 kernel series. It is believed to largely apply also to
FreeBSD's kqueue(7) mechanism. The information within has been extracted from
documentation, mailing lists, reading source and extensive experimentation.
Little, however, makes up published, official API, and thus some of these
details might change without warning.

1. Details of the Linux epoll(7) implementation
2. Details of the FreeBSD kqueue(7) implementation
3. Functional semantics
4. Performance semantics
5. Deductions
6. Signals

-----------------------------------------------------------------------------
1. Details of the Linux epoll(7) implementation (taken from 2.6.32.2):
-----------------------------------------------------------------------------
A struct eventpoll contains, among other things:

 - a spinlock_t, "lock"
 - a struct mutex, "mtx"
 - two wait_queue_head_t's, "wq" and "poll_wait"

sys_epoll_wait() validates input and calls ep_poll() (fs/eventpoll.c). Its
core loop:

 - acquires "lock"
 - if there are no events ready or signals pending:
 -- release "lock", sleep on "wq"
 --- ep_poll_callback() is in charge of waking up "wq"
 --- a timeout is also used (LONG_MAX if we passed timeout < 0)
 -- reacquire "lock", loop to event/signal check
 - release "lock"
 - call ep_send_events()->ep_scan_ready_list() (see below)
 - if we have events ready, pending signals, or the timeout has expired, return
 - otherwise, loop

ep_send_events()->ep_scan_ready_list() is largely linear code:

 - acquires "mtx"
 - acquires "lock"
 - steals the ready list "rdllist" to "txlist"
 - releases "lock"
 - calls ep_send_events_proc() with "txlist"
 - acquires "lock"
 - appends "ovflist" to "rdllist", resets "ovflist"
 - appends "txlist" to "rdllist"
 - if "rdllist" is non-empty, wake up "wq"
 - release "lock"
 - release "mtx" 
 - if "rdllist" is non-empty, wake up "poll_wait"

ep_send_events_proc() iterates over a list of struct epitem objects:

 - if the revent mask matches that requested:
 -- copy the event
 -- if EPOLLONESHOT is set, events &= (EPOLLONESHOT|EPOLLET)
 -- else if EPOLLET is *not* set, add it back to the ready list

"mtx" is held across ep_scan_ready_list() and thus ep_send_events_proc(), even
if EPOLLET is in use (and thus we needn't add events back to the ready list).
See the comment in ep_send_events_proc() about locking out epoll_ctl() callers,
which are directed to "ovflist".

sys_epoll_ctl() validates input, and:

 - acquires "mtx"
 - searches for the fd using ep_find()
 -- EPOLL_CTL_ADD with a present fd generates EEXIST
 -- EPOLL_CTL_DEL and EPOLL_CTL_MOD with a non-present fd generates ENOENT
 - call the appropriate one of ep_insert(), ep_remove(), ep_modify()
 - releases "mtx"

The sys_epoll_pwait() extension is implemented in terms of sys_epoll_wait(),
wrapping the latter in appropriate sigprocmask() calls.

-----------------------------------------------------------------------------
2. Details of the FreeBSD kqueue(7) implementation:
-----------------------------------------------------------------------------

 FIXME

-----------------------------------------------------------------------------
3. Functional semantics
-----------------------------------------------------------------------------
Multiple threads can safely call into event retrieval -- the event queue will
not be corrupted.

Events are retrieved greedily. A thread will grab as many as it indicates
room for.

Multiple threads can safely modify an event queue, event if threads are
retrieving from it -- the event queue will not be corrupted, and threads
sleeping in retrieval will be woken up for new ready events.

Level-triggered events cannot be safely processed by multiple threads
without some kind of extra locking, except in circumstances so rare
(and likely trivial) that they're barely worth accounting for:

 - LT event 1 becomes ready
 - Thread A retrieves LT event 1 instance 1
 - Thread B retrieves LT event 1 instance 2
 - Thread A enters LT event 1 handler
 - Thread B enters LT event 1 handler
 - either contention or race follows

Edge-triggered, auto-renewed events cannot be safely processed by multiple
threads without some kind of extra locking, except when file descriptor
availability suffices to synchronize the event handling. An example of this
would be a callback which simply loops on accept(2) until EAGAIN/EWOULDBLOCK,
possibly performing some MT-safe processing on each new file descriptor (adding
the new fd to the event queue would be fine, since we've already shown event
queue operations to be MT-safe). Otherwise, for instance if reading data from
the fd into an associated buffer:

 - ET event 1 becomes ready
 - Thread A retrieves ET event 1 instance 1
 - Thread B sleeps
 - Thread A enters ET event 1 handler, read(2)ing all available data
 - ET event 1 is automatically rearmed
 - Thread B wakes up, retrieves ET event 1 instance 2
 - Thread B enters ET event 1 handler, read(2)ing all available data
 - ET event 1 is automatically rearmed
 - Thread A enters ET event 1 critical section
 - Thread B enters ET event 1 critical section
 - either contention or race follows

Edge-triggered, one-shot events can be safely processed by multiple threads
so long as they're rearmed only at the end of the handler. They require
more event queue modifications than other solutions (assuming that we always
handle as much data as is provided in any callback -- the edge-triggering
requirement. Callbacks which depend on level-triggered semantics might
perform explicit disabling calls from time to time).

The choice of queue-sharing schemes affects the mechanics of library cleanup
and termination. See the document doc/termination for more details.

-----------------------------------------------------------------------------
4. Performance semantics
-----------------------------------------------------------------------------
Edge-triggered semantics allow a more efficient internal implementation than
level-triggered semantics.

Modifications and (running) retrievals are serialized by the kernel for
each event queue. As the number of threads per event queue rises, the
expected volume of waiting also increases. A complex feedback system is
likely to emerge if modification wait time ever becomes non-negligible.
This does not apply to retrieval wait time; if events are being handled more
quickly than they take to retrieve, we are bounded by retrieval time (by
the pigeonhole principle: in the large, a thread will always be
actively retrieving though we will sometimes empty the work queue).

System calls are, as ever, to be avoided when prudent. Using larger event
receipt buffers minimizes system calls into the event queueing framework,
but can lead to a poor short-term distribution of work and poor parallelism.

Batching of changes cannot be accomplished with the current Linux epoll_ctl()
implementation, despite our internal kevent() emulation affecting such an API
(it actually iterates over the list, invoking epoll_ctl() in turn). Until this
is changed, batching of changes is a non-starter on Linux. FreeBSD does provide
the interface; it remains to be determined whether this is worth the latency
penalties. Multithreading makes this less of an issue, but also leads to more
possible contention on change registration.

-----------------------------------------------------------------------------
5. Deductions
-----------------------------------------------------------------------------
Single-threaded retrieval of events into a userspace queue shared among
multiple threads doesn't solve any correctness problems, or meaningfully
reduce locking (unless, possibly, when threads are retrieving less events at
one time than is optimal based on load). The actual work of retrieval (as
opposed to sleeping) is still serialized by the kernel, and must be
performed.

Exclusive event queues per thread are the simplest solution, eliminating
all contention and even synchronization concerns at the event engine level.
This might not scale (it's doubtful that anyone's seriously used thousands
of event queues in a single app), and more importantly it leads to a lack of
balance; threads will only be as balanced as can be predicted during event
source distribution.

It is of course feasible to serve many threads using a scheme where only
one actually receives events for a given event queue at any time (ie, we
don't rely on epoll_wait() being safe for multiple callers), dumping them
into a common userspace buffer managed via mutexes and condition variables.
We'd not be exploiting the kernel locking to provide our own synchronization
anymore, but only a contention-free mutex acquisition and release have been
"added". Furthermore, this removes the event queue system call from each
thread; they now only need perform userspace locking (with, of course, a
system call to waitqueues should the lock be contended, but we presumed
that this was rare). What we want to avoid is a queue *modification*,
generated during the event loop, blocking against a retrieval operation
(preferably without adding a batching step ala the kqueue() API, since this
can add latency. Remember, retrievals contending means either that load is
negligible, or that we're not retrieving nearly enough events at a time. Using
TCP-like windows would solve that problem for multiple event threads).

-----------------------------------------------------------------------------
6. Signals
-----------------------------------------------------------------------------
All other threads in the application must have libtorque-registered signals
masked, or else libtorque might not be able to receive relevant signals. Upon
creation, libtorque threads mask all signals. When a signal is registered with
libtorque, a common handler is installed for that signal, demultiplexing into
the signal event table. Note that since handlers are process-wide, failure to
mask appropriate signals in application threads might result in libtorque
signals being handled in their contexts.

Signals and POSIX threads have a historically complicated relationship (see
the document doc/termination for more details about threads). When
EVFILT_SIGNAL (FreeBSD) or signalfd's (Linux) are used, they can be cleanly
handled within our standard event semantics. On older Linux kernels, however,
signalfds are not available. There, we must handle signals via the standard
POSIX signal delivery mechanism, adding a new complexity to the event queues.
Note from {2} that delivery of a signal results in immediate exit from
epoll_wait() and release of associated locks, thus the multiple threads of an
event queue will automatically enjoy distribution of events (this is
independent of any sa_mask value or SA_NODEFER flag provided to sigaction(2),
since each thread has its own signal mask).

If libtorque threads unblocked its signals throughout their lifetimes, those
signals could be delivered during event handling callbacks. Synchronization
requires that only one thread be handling a given event source at a time, so
repeated delivery of a signal during such a callback would both block the
event source's progress in that thread (it's busy handling the signal) and
preclude (without major contortions) any other thread handling the event.

POSIX signal handlers' action is heavily restricted. pthread_mutex_lock() is
not safe to call within a signal handler, nor is the C library's dynamic
allocation interface. We certainly cannot directly invoke registered callbacks
from the signal handler. The standard signal-handling idioms are:

--event: post to a global semaphore having type sig_atomic_t
--exception: clean up transprocess resources (files, etc), re-raise signal

Note that progress under the event idiom requires some thread to examine the
semaphore, and handle the event in its context. GNUlibc specifies that handling
a signal interrupts all event multiplexing interfaces with EINTR, regardless of
the use of SA_RESTART, so the semaphore check must live no further away from
the epoll_wait() call than whatever scope is defined by the loop on EINTR.