Linux file system calls
http://man7.org/linux/man-pages/man2/close.2.html
10
SYSTEM CALL:
close(2) - Linux manual page
FUNCTIONALITY:

       close - close a file descriptor
SYNOPSIS:

       #include <unistd.h>

       int close(int fd);
DESCRIPTION

       close() closes a file descriptor, so that it no longer refers to any
       file and may be reused.  Any record locks (see fcntl(2)) held on the
       file it was associated with, and owned by the process, are removed
       (regardless of the file descriptor that was used to obtain the lock).

       If fd is the last file descriptor referring to the underlying open
       file description (see open(2)), the resources associated with the
       open file description are freed; if the file descriptor was the last
       reference to a file which has been removed using unlink(2), the file
       is deleted.
http://man7.org/linux/man-pages/man2/creat.2.html
12
SYSTEM CALL:
open(2) - Linux manual page
FUNCTIONALITY:

       open, openat, creat - open and possibly create a file
SYNOPSIS:

       #include <sys/types.h>
       #include <sys/stat.h>
       #include <fcntl.h>

       int open(const char *pathname, int flags);
       int open(const char *pathname, int flags, mode_t mode);

       int creat(const char *pathname, mode_t mode);

       int openat(int dirfd, const char *pathname, int flags);
       int openat(int dirfd, const char *pathname, int flags, mode_t mode);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       openat():
           Since glibc 2.10:
               _POSIX_C_SOURCE >= 200809L
           Before glibc 2.10:
               _ATFILE_SOURCE
DESCRIPTION

       Given a pathname for a file, open() returns a file descriptor, a
       small, nonnegative integer for use in subsequent system calls
       (read(2), write(2), lseek(2), fcntl(2), etc.).  The file descriptor
       returned by a successful call will be the lowest-numbered file
       descriptor not currently open for the process.

       By default, the new file descriptor is set to remain open across an
       execve(2) (i.e., the FD_CLOEXEC file descriptor flag described in
       fcntl(2) is initially disabled); the O_CLOEXEC flag, described below,
       can be used to change this default.  The file offset is set to the
       beginning of the file (see lseek(2)).

       A call to open() creates a new open file description, an entry in the
       system-wide table of open files.  The open file description records
       the file offset and the file status flags (see below).  A file
       descriptor is a reference to an open file description; this reference
       is unaffected if pathname is subsequently removed or modified to
       refer to a different file.  For further details on open file
       descriptions, see NOTES.

       The argument flags must include one of the following access modes:
       O_RDONLY, O_WRONLY, or O_RDWR.  These request opening the file read-
       only, write-only, or read/write, respectively.

       In addition, zero or more file creation flags and file status flags
       can be bitwise-or'd in flags.  The file creation flags are O_CLOEXEC,
       O_CREAT, O_DIRECTORY, O_EXCL, O_NOCTTY, O_NOFOLLOW, O_TMPFILE, and
       O_TRUNC.  The file status flags are all of the remaining flags listed
       below.  The distinction between these two groups of flags is that the
       file status flags can be retrieved and (in some cases) modified; see
       fcntl(2) for details.

       The full list of file creation flags and file status flags is as
       follows:

       O_APPEND
              The file is opened in append mode.  Before each write(2), the
              file offset is positioned at the end of the file, as if with
              lseek(2).  O_APPEND may lead to corrupted files on NFS
              filesystems if more than one process appends data to a file at
              once.  This is because NFS does not support appending to a
              file, so the client kernel has to simulate it, which can't be
              done without a race condition.

       O_ASYNC
              Enable signal-driven I/O: generate a signal (SIGIO by default,
              but this can be changed via fcntl(2)) when input or output
              becomes possible on this file descriptor.  This feature is
              available only for terminals, pseudoterminals, sockets, and
              (since Linux 2.6) pipes and FIFOs.  See fcntl(2) for further
              details.  See also BUGS, below.

       O_CLOEXEC (since Linux 2.6.23)
              Enable the close-on-exec flag for the new file descriptor.
              Specifying this flag permits a program to avoid additional
              fcntl(2) F_SETFD operations to set the FD_CLOEXEC flag.

              Note that the use of this flag is essential in some
              multithreaded programs, because using a separate fcntl(2)
              F_SETFD operation to set the FD_CLOEXEC flag does not suffice
              to avoid race conditions where one thread opens a file
              descriptor and attempts to set its close-on-exec flag using
              fcntl(2) at the same time as another thread does a fork(2)
              plus execve(2).  Depending on the order of execution, the race
              may lead to the file descriptor returned by open() being
              unintentionally leaked to the program executed by the child
              process created by fork(2).  (This kind of race is in
              principle possible for any system call that creates a file
              descriptor whose close-on-exec flag should be set, and various
              other Linux system calls provide an equivalent of the
              O_CLOEXEC flag to deal with this problem.)

       O_CREAT
              If the file does not exist, it will be created.  The owner
              (user ID) of the file is set to the effective user ID of the
              process.  The group ownership (group ID) is set either to the
              effective group ID of the process or to the group ID of the
              parent directory (depending on filesystem type and mount
              options, and the mode of the parent directory; see the mount
              options bsdgroups and sysvgroups described in mount(8)).

              The mode argument specifies the file mode bits be applied when
              a new file is created.  This argument must be supplied when
              O_CREAT or O_TMPFILE is specified in flags; if neither O_CREAT
              nor O_TMPFILE is specified, then mode is ignored.  The
              effective mode is modified by the process's umask in the usual
              way: in the absence of a default ACL, the mode of the created
              file is (mode & ~umask).  Note that this mode applies only to
              future accesses of the newly created file; the open() call
              that creates a read-only file may well return a read/write
              file descriptor.

              The following symbolic constants are provided for mode:

              S_IRWXU  00700 user (file owner) has read, write, and execute
                       permission

              S_IRUSR  00400 user has read permission

              S_IWUSR  00200 user has write permission

              S_IXUSR  00100 user has execute permission

              S_IRWXG  00070 group has read, write, and execute permission

              S_IRGRP  00040 group has read permission

              S_IWGRP  00020 group has write permission

              S_IXGRP  00010 group has execute permission

              S_IRWXO  00007 others have read, write, and execute permission

              S_IROTH  00004 others have read permission

              S_IWOTH  00002 others have write permission

              S_IXOTH  00001 others have execute permission

              According to POSIX, the effect when other bits are set in mode
              is unspecified.  On Linux, the following bits are also honored
              in mode:

              S_ISUID  0004000 set-user-ID bit

              S_ISGID  0002000 set-group-ID bit (see stat(2))

              S_ISVTX  0001000 sticky bit (see stat(2))

       O_DIRECT (since Linux 2.4.10)
              Try to minimize cache effects of the I/O to and from this
              file.  In general this will degrade performance, but it is
              useful in special situations, such as when applications do
              their own caching.  File I/O is done directly to/from user-
              space buffers.  The O_DIRECT flag on its own makes an effort
              to transfer data synchronously, but does not give the
              guarantees of the O_SYNC flag that data and necessary metadata
              are transferred.  To guarantee synchronous I/O, O_SYNC must be
              used in addition to O_DIRECT.  See NOTES below for further
              discussion.

              A semantically similar (but deprecated) interface for block
              devices is described in raw(8).

       O_DIRECTORY
              If pathname is not a directory, cause the open to fail.  This
              flag was added in kernel version 2.1.126, to avoid denial-of-
              service problems if opendir(3) is called on a FIFO or tape
              device.

       O_DSYNC
              Write operations on the file will complete according to the
              requirements of synchronized I/O data integrity completion.

              By the time write(2) (and similar) return, the output data has
              been transferred to the underlying hardware, along with any
              file metadata that would be required to retrieve that data
              (i.e., as though each write(2) was followed by a call to
              fdatasync(2)).  See NOTES below.

       O_EXCL Ensure that this call creates the file: if this flag is
              specified in conjunction with O_CREAT, and pathname already
              exists, then open() will fail.

              When these two flags are specified, symbolic links are not
              followed: if pathname is a symbolic link, then open() fails
              regardless of where the symbolic link points to.

              In general, the behavior of O_EXCL is undefined if it is used
              without O_CREAT.  There is one exception: on Linux 2.6 and
              later, O_EXCL can be used without O_CREAT if pathname refers
              to a block device.  If the block device is in use by the
              system (e.g., mounted), open() fails with the error EBUSY.

              On NFS, O_EXCL is supported only when using NFSv3 or later on
              kernel 2.6 or later.  In NFS environments where O_EXCL support
              is not provided, programs that rely on it for performing
              locking tasks will contain a race condition.  Portable
              programs that want to perform atomic file locking using a
              lockfile, and need to avoid reliance on NFS support for
              O_EXCL, can create a unique file on the same filesystem (e.g.,
              incorporating hostname and PID), and use link(2) to make a
              link to the lockfile.  If link(2) returns 0, the lock is
              successful.  Otherwise, use stat(2) on the unique file to
              check if its link count has increased to 2, in which case the
              lock is also successful.

       O_LARGEFILE
              (LFS) Allow files whose sizes cannot be represented in an
              off_t (but can be represented in an off64_t) to be opened.
              The _LARGEFILE64_SOURCE macro must be defined (before
              including any header files) in order to obtain this
              definition.  Setting the _FILE_OFFSET_BITS feature test macro
              to 64 (rather than using O_LARGEFILE) is the preferred method
              of accessing large files on 32-bit systems (see
              feature_test_macros(7)).

       O_NOATIME (since Linux 2.6.8)
              Do not update the file last access time (st_atime in the
              inode) when the file is read(2).  This flag is intended for
              use by indexing or backup programs, where its use can
              significantly reduce the amount of disk activity.  This flag
              may not be effective on all filesystems.  One example is NFS,
              where the server maintains the access time.

       O_NOCTTY
              If pathname refers to a terminal device—see tty(4)—it will not
              become the process's controlling terminal even if the process
              does not have one.

       O_NOFOLLOW
              If pathname is a symbolic link, then the open fails.  This is
              a FreeBSD extension, which was added to Linux in version
              2.1.126.  Symbolic links in earlier components of the pathname
              will still be followed.  See also O_PATH below.

       O_NONBLOCK or O_NDELAY
              When possible, the file is opened in nonblocking mode.
              Neither the open() nor any subsequent operations on the file
              descriptor which is returned will cause the calling process to
              wait.

              Note that this flag has no effect for regular files and block
              devices; that is, I/O operations will (briefly) block when
              device activity is required, regardless of whether O_NONBLOCK
              is set.  Since O_NONBLOCK semantics might eventually be
              implemented, applications should not depend upon blocking
              behavior when specifying this flag for regular files and block
              devices.

              For the handling of FIFOs (named pipes), see also fifo(7).
              For a discussion of the effect of O_NONBLOCK in conjunction
              with mandatory file locks and with file leases, see fcntl(2).

       O_PATH (since Linux 2.6.39)
              Obtain a file descriptor that can be used for two purposes: to
              indicate a location in the filesystem tree and to perform
              operations that act purely at the file descriptor level.  The
              file itself is not opened, and other file operations (e.g.,
              read(2), write(2), fchmod(2), fchown(2), fgetxattr(2),
              mmap(2)) fail with the error EBADF.

              The following operations can be performed on the resulting
              file descriptor:

              *  close(2); fchdir(2) (since Linux 3.5); fstat(2) (since
                 Linux 3.6).

              *  Duplicating the file descriptor (dup(2), fcntl(2) F_DUPFD,
                 etc.).

              *  Getting and setting file descriptor flags (fcntl(2) F_GETFD
                 and F_SETFD).

              *  Retrieving open file status flags using the fcntl(2)
                 F_GETFL operation: the returned flags will include the bit
                 O_PATH.

              *  Passing the file descriptor as the dirfd argument of
                 openat(2) and the other "*at()" system calls.  This
                 includes linkat(2) with AT_EMPTY_PATH (or via procfs using
                 AT_SYMLINK_FOLLOW) even if the file is not a directory.

              *  Passing the file descriptor to another process via a UNIX
                 domain socket (see SCM_RIGHTS in unix(7)).

              When O_PATH is specified in flags, flag bits other than
              O_CLOEXEC, O_DIRECTORY, and O_NOFOLLOW are ignored.

              If pathname is a symbolic link and the O_NOFOLLOW flag is also
              specified, then the call returns a file descriptor referring
              to the symbolic link.  This file descriptor can be used as the
              dirfd argument in calls to fchownat(2), fstatat(2), linkat(2),
              and readlinkat(2) with an empty pathname to have the calls
              operate on the symbolic link.

       O_SYNC Write operations on the file will complete according to the
              requirements of synchronized I/O file integrity completion (by
              contrast with the synchronized I/O data integrity completion
              provided by O_DSYNC.)

              By the time write(2) (and similar) return, the output data and
              associated file metadata have been transferred to the
              underlying hardware (i.e., as though each write(2) was
              followed by a call to fsync(2)).  See NOTES below.

       O_TMPFILE (since Linux 3.11)
              Create an unnamed temporary file.  The pathname argument
              specifies a directory; an unnamed inode will be created in
              that directory's filesystem.  Anything written to the
              resulting file will be lost when the last file descriptor is
              closed, unless the file is given a name.

              O_TMPFILE must be specified with one of O_RDWR or O_WRONLY
              and, optionally, O_EXCL.  If O_EXCL is not specified, then
              linkat(2) can be used to link the temporary file into the
              filesystem, making it permanent, using code like the
              following:

                  char path[PATH_MAX];
                  fd = open("/path/to/dir", O_TMPFILE | O_RDWR,
                                          S_IRUSR | S_IWUSR);

                  /* File I/O on 'fd'... */

                  snprintf(path, PATH_MAX,  "/proc/self/fd/%d", fd);
                  linkat(AT_FDCWD, path, AT_FDCWD, "/path/for/file",
                                          AT_SYMLINK_FOLLOW);

              In this case, the open() mode argument determines the file
              permission mode, as with O_CREAT.

              Specifying O_EXCL in conjunction with O_TMPFILE prevents a
              temporary file from being linked into the filesystem in the
              above manner.  (Note that the meaning of O_EXCL in this case
              is different from the meaning of O_EXCL otherwise.)

              There are two main use cases for O_TMPFILE:

              *  Improved tmpfile(3) functionality: race-free creation of
                 temporary files that (1) are automatically deleted when
                 closed; (2) can never be reached via any pathname; (3) are
                 not subject to symlink attacks; and (4) do not require the
                 caller to devise unique names.

              *  Creating a file that is initially invisible, which is then
                 populated with data and adjusted to have appropriate
                 filesystem attributes (fchown(2), fchmod(2), fsetxattr(2),
                 etc.)  before being atomically linked into the filesystem
                 in a fully formed state (using linkat(2) as described
                 above).

              O_TMPFILE requires support by the underlying filesystem; only
              a subset of Linux filesystems provide that support.  In the
              initial implementation, support was provided in the ext2,
              ext3, ext4, UDF, Minix, and shmem filesystems.  XFS support
              was added in Linux 3.15, and Btrfs support was added in Linux
              3.16.

       O_TRUNC
              If the file already exists and is a regular file and the
              access mode allows writing (i.e., is O_RDWR or O_WRONLY) it
              will be truncated to length 0.  If the file is a FIFO or
              terminal device file, the O_TRUNC flag is ignored.  Otherwise,
              the effect of O_TRUNC is unspecified.

   creat()
       A call to creat() is equivalent to calling open() with flags equal to
       O_CREAT|O_WRONLY|O_TRUNC.

   openat()
       The openat() system call operates in exactly the same way as open(),
       except for the differences described here.

       If the pathname given in pathname is relative, then it is interpreted
       relative to the directory referred to by the file descriptor dirfd
       (rather than relative to the current working directory of the calling
       process, as is done by open() for a relative pathname).

       If pathname is relative and dirfd is the special value AT_FDCWD, then
       pathname is interpreted relative to the current working directory of
       the calling process (like open()).

       If pathname is absolute, then dirfd is ignored.
http://man7.org/linux/man-pages/man2/open.2.html
12
SYSTEM CALL:
open(2) - Linux manual page
FUNCTIONALITY:

       open, openat, creat - open and possibly create a file
SYNOPSIS:

       #include <sys/types.h>
       #include <sys/stat.h>
       #include <fcntl.h>

       int open(const char *pathname, int flags);
       int open(const char *pathname, int flags, mode_t mode);

       int creat(const char *pathname, mode_t mode);

       int openat(int dirfd, const char *pathname, int flags);
       int openat(int dirfd, const char *pathname, int flags, mode_t mode);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       openat():
           Since glibc 2.10:
               _POSIX_C_SOURCE >= 200809L
           Before glibc 2.10:
               _ATFILE_SOURCE
DESCRIPTION

       Given a pathname for a file, open() returns a file descriptor, a
       small, nonnegative integer for use in subsequent system calls
       (read(2), write(2), lseek(2), fcntl(2), etc.).  The file descriptor
       returned by a successful call will be the lowest-numbered file
       descriptor not currently open for the process.

       By default, the new file descriptor is set to remain open across an
       execve(2) (i.e., the FD_CLOEXEC file descriptor flag described in
       fcntl(2) is initially disabled); the O_CLOEXEC flag, described below,
       can be used to change this default.  The file offset is set to the
       beginning of the file (see lseek(2)).

       A call to open() creates a new open file description, an entry in the
       system-wide table of open files.  The open file description records
       the file offset and the file status flags (see below).  A file
       descriptor is a reference to an open file description; this reference
       is unaffected if pathname is subsequently removed or modified to
       refer to a different file.  For further details on open file
       descriptions, see NOTES.

       The argument flags must include one of the following access modes:
       O_RDONLY, O_WRONLY, or O_RDWR.  These request opening the file read-
       only, write-only, or read/write, respectively.

       In addition, zero or more file creation flags and file status flags
       can be bitwise-or'd in flags.  The file creation flags are O_CLOEXEC,
       O_CREAT, O_DIRECTORY, O_EXCL, O_NOCTTY, O_NOFOLLOW, O_TMPFILE, and
       O_TRUNC.  The file status flags are all of the remaining flags listed
       below.  The distinction between these two groups of flags is that the
       file status flags can be retrieved and (in some cases) modified; see
       fcntl(2) for details.

       The full list of file creation flags and file status flags is as
       follows:

       O_APPEND
              The file is opened in append mode.  Before each write(2), the
              file offset is positioned at the end of the file, as if with
              lseek(2).  O_APPEND may lead to corrupted files on NFS
              filesystems if more than one process appends data to a file at
              once.  This is because NFS does not support appending to a
              file, so the client kernel has to simulate it, which can't be
              done without a race condition.

       O_ASYNC
              Enable signal-driven I/O: generate a signal (SIGIO by default,
              but this can be changed via fcntl(2)) when input or output
              becomes possible on this file descriptor.  This feature is
              available only for terminals, pseudoterminals, sockets, and
              (since Linux 2.6) pipes and FIFOs.  See fcntl(2) for further
              details.  See also BUGS, below.

       O_CLOEXEC (since Linux 2.6.23)
              Enable the close-on-exec flag for the new file descriptor.
              Specifying this flag permits a program to avoid additional
              fcntl(2) F_SETFD operations to set the FD_CLOEXEC flag.

              Note that the use of this flag is essential in some
              multithreaded programs, because using a separate fcntl(2)
              F_SETFD operation to set the FD_CLOEXEC flag does not suffice
              to avoid race conditions where one thread opens a file
              descriptor and attempts to set its close-on-exec flag using
              fcntl(2) at the same time as another thread does a fork(2)
              plus execve(2).  Depending on the order of execution, the race
              may lead to the file descriptor returned by open() being
              unintentionally leaked to the program executed by the child
              process created by fork(2).  (This kind of race is in
              principle possible for any system call that creates a file
              descriptor whose close-on-exec flag should be set, and various
              other Linux system calls provide an equivalent of the
              O_CLOEXEC flag to deal with this problem.)

       O_CREAT
              If the file does not exist, it will be created.  The owner
              (user ID) of the file is set to the effective user ID of the
              process.  The group ownership (group ID) is set either to the
              effective group ID of the process or to the group ID of the
              parent directory (depending on filesystem type and mount
              options, and the mode of the parent directory; see the mount
              options bsdgroups and sysvgroups described in mount(8)).

              The mode argument specifies the file mode bits be applied when
              a new file is created.  This argument must be supplied when
              O_CREAT or O_TMPFILE is specified in flags; if neither O_CREAT
              nor O_TMPFILE is specified, then mode is ignored.  The
              effective mode is modified by the process's umask in the usual
              way: in the absence of a default ACL, the mode of the created
              file is (mode & ~umask).  Note that this mode applies only to
              future accesses of the newly created file; the open() call
              that creates a read-only file may well return a read/write
              file descriptor.

              The following symbolic constants are provided for mode:

              S_IRWXU  00700 user (file owner) has read, write, and execute
                       permission

              S_IRUSR  00400 user has read permission

              S_IWUSR  00200 user has write permission

              S_IXUSR  00100 user has execute permission

              S_IRWXG  00070 group has read, write, and execute permission

              S_IRGRP  00040 group has read permission

              S_IWGRP  00020 group has write permission

              S_IXGRP  00010 group has execute permission

              S_IRWXO  00007 others have read, write, and execute permission

              S_IROTH  00004 others have read permission

              S_IWOTH  00002 others have write permission

              S_IXOTH  00001 others have execute permission

              According to POSIX, the effect when other bits are set in mode
              is unspecified.  On Linux, the following bits are also honored
              in mode:

              S_ISUID  0004000 set-user-ID bit

              S_ISGID  0002000 set-group-ID bit (see stat(2))

              S_ISVTX  0001000 sticky bit (see stat(2))

       O_DIRECT (since Linux 2.4.10)
              Try to minimize cache effects of the I/O to and from this
              file.  In general this will degrade performance, but it is
              useful in special situations, such as when applications do
              their own caching.  File I/O is done directly to/from user-
              space buffers.  The O_DIRECT flag on its own makes an effort
              to transfer data synchronously, but does not give the
              guarantees of the O_SYNC flag that data and necessary metadata
              are transferred.  To guarantee synchronous I/O, O_SYNC must be
              used in addition to O_DIRECT.  See NOTES below for further
              discussion.

              A semantically similar (but deprecated) interface for block
              devices is described in raw(8).

       O_DIRECTORY
              If pathname is not a directory, cause the open to fail.  This
              flag was added in kernel version 2.1.126, to avoid denial-of-
              service problems if opendir(3) is called on a FIFO or tape
              device.

       O_DSYNC
              Write operations on the file will complete according to the
              requirements of synchronized I/O data integrity completion.

              By the time write(2) (and similar) return, the output data has
              been transferred to the underlying hardware, along with any
              file metadata that would be required to retrieve that data
              (i.e., as though each write(2) was followed by a call to
              fdatasync(2)).  See NOTES below.

       O_EXCL Ensure that this call creates the file: if this flag is
              specified in conjunction with O_CREAT, and pathname already
              exists, then open() will fail.

              When these two flags are specified, symbolic links are not
              followed: if pathname is a symbolic link, then open() fails
              regardless of where the symbolic link points to.

              In general, the behavior of O_EXCL is undefined if it is used
              without O_CREAT.  There is one exception: on Linux 2.6 and
              later, O_EXCL can be used without O_CREAT if pathname refers
              to a block device.  If the block device is in use by the
              system (e.g., mounted), open() fails with the error EBUSY.

              On NFS, O_EXCL is supported only when using NFSv3 or later on
              kernel 2.6 or later.  In NFS environments where O_EXCL support
              is not provided, programs that rely on it for performing
              locking tasks will contain a race condition.  Portable
              programs that want to perform atomic file locking using a
              lockfile, and need to avoid reliance on NFS support for
              O_EXCL, can create a unique file on the same filesystem (e.g.,
              incorporating hostname and PID), and use link(2) to make a
              link to the lockfile.  If link(2) returns 0, the lock is
              successful.  Otherwise, use stat(2) on the unique file to
              check if its link count has increased to 2, in which case the
              lock is also successful.

       O_LARGEFILE
              (LFS) Allow files whose sizes cannot be represented in an
              off_t (but can be represented in an off64_t) to be opened.
              The _LARGEFILE64_SOURCE macro must be defined (before
              including any header files) in order to obtain this
              definition.  Setting the _FILE_OFFSET_BITS feature test macro
              to 64 (rather than using O_LARGEFILE) is the preferred method
              of accessing large files on 32-bit systems (see
              feature_test_macros(7)).

       O_NOATIME (since Linux 2.6.8)
              Do not update the file last access time (st_atime in the
              inode) when the file is read(2).  This flag is intended for
              use by indexing or backup programs, where its use can
              significantly reduce the amount of disk activity.  This flag
              may not be effective on all filesystems.  One example is NFS,
              where the server maintains the access time.

       O_NOCTTY
              If pathname refers to a terminal device—see tty(4)—it will not
              become the process's controlling terminal even if the process
              does not have one.

       O_NOFOLLOW
              If pathname is a symbolic link, then the open fails.  This is
              a FreeBSD extension, which was added to Linux in version
              2.1.126.  Symbolic links in earlier components of the pathname
              will still be followed.  See also O_PATH below.

       O_NONBLOCK or O_NDELAY
              When possible, the file is opened in nonblocking mode.
              Neither the open() nor any subsequent operations on the file
              descriptor which is returned will cause the calling process to
              wait.

              Note that this flag has no effect for regular files and block
              devices; that is, I/O operations will (briefly) block when
              device activity is required, regardless of whether O_NONBLOCK
              is set.  Since O_NONBLOCK semantics might eventually be
              implemented, applications should not depend upon blocking
              behavior when specifying this flag for regular files and block
              devices.

              For the handling of FIFOs (named pipes), see also fifo(7).
              For a discussion of the effect of O_NONBLOCK in conjunction
              with mandatory file locks and with file leases, see fcntl(2).

       O_PATH (since Linux 2.6.39)
              Obtain a file descriptor that can be used for two purposes: to
              indicate a location in the filesystem tree and to perform
              operations that act purely at the file descriptor level.  The
              file itself is not opened, and other file operations (e.g.,
              read(2), write(2), fchmod(2), fchown(2), fgetxattr(2),
              mmap(2)) fail with the error EBADF.

              The following operations can be performed on the resulting
              file descriptor:

              *  close(2); fchdir(2) (since Linux 3.5); fstat(2) (since
                 Linux 3.6).

              *  Duplicating the file descriptor (dup(2), fcntl(2) F_DUPFD,
                 etc.).

              *  Getting and setting file descriptor flags (fcntl(2) F_GETFD
                 and F_SETFD).

              *  Retrieving open file status flags using the fcntl(2)
                 F_GETFL operation: the returned flags will include the bit
                 O_PATH.

              *  Passing the file descriptor as the dirfd argument of
                 openat(2) and the other "*at()" system calls.  This
                 includes linkat(2) with AT_EMPTY_PATH (or via procfs using
                 AT_SYMLINK_FOLLOW) even if the file is not a directory.

              *  Passing the file descriptor to another process via a UNIX
                 domain socket (see SCM_RIGHTS in unix(7)).

              When O_PATH is specified in flags, flag bits other than
              O_CLOEXEC, O_DIRECTORY, and O_NOFOLLOW are ignored.

              If pathname is a symbolic link and the O_NOFOLLOW flag is also
              specified, then the call returns a file descriptor referring
              to the symbolic link.  This file descriptor can be used as the
              dirfd argument in calls to fchownat(2), fstatat(2), linkat(2),
              and readlinkat(2) with an empty pathname to have the calls
              operate on the symbolic link.

       O_SYNC Write operations on the file will complete according to the
              requirements of synchronized I/O file integrity completion (by
              contrast with the synchronized I/O data integrity completion
              provided by O_DSYNC.)

              By the time write(2) (and similar) return, the output data and
              associated file metadata have been transferred to the
              underlying hardware (i.e., as though each write(2) was
              followed by a call to fsync(2)).  See NOTES below.

       O_TMPFILE (since Linux 3.11)
              Create an unnamed temporary file.  The pathname argument
              specifies a directory; an unnamed inode will be created in
              that directory's filesystem.  Anything written to the
              resulting file will be lost when the last file descriptor is
              closed, unless the file is given a name.

              O_TMPFILE must be specified with one of O_RDWR or O_WRONLY
              and, optionally, O_EXCL.  If O_EXCL is not specified, then
              linkat(2) can be used to link the temporary file into the
              filesystem, making it permanent, using code like the
              following:

                  char path[PATH_MAX];
                  fd = open("/path/to/dir", O_TMPFILE | O_RDWR,
                                          S_IRUSR | S_IWUSR);

                  /* File I/O on 'fd'... */

                  snprintf(path, PATH_MAX,  "/proc/self/fd/%d", fd);
                  linkat(AT_FDCWD, path, AT_FDCWD, "/path/for/file",
                                          AT_SYMLINK_FOLLOW);

              In this case, the open() mode argument determines the file
              permission mode, as with O_CREAT.

              Specifying O_EXCL in conjunction with O_TMPFILE prevents a
              temporary file from being linked into the filesystem in the
              above manner.  (Note that the meaning of O_EXCL in this case
              is different from the meaning of O_EXCL otherwise.)

              There are two main use cases for O_TMPFILE:

              *  Improved tmpfile(3) functionality: race-free creation of
                 temporary files that (1) are automatically deleted when
                 closed; (2) can never be reached via any pathname; (3) are
                 not subject to symlink attacks; and (4) do not require the
                 caller to devise unique names.

              *  Creating a file that is initially invisible, which is then
                 populated with data and adjusted to have appropriate
                 filesystem attributes (fchown(2), fchmod(2), fsetxattr(2),
                 etc.)  before being atomically linked into the filesystem
                 in a fully formed state (using linkat(2) as described
                 above).

              O_TMPFILE requires support by the underlying filesystem; only
              a subset of Linux filesystems provide that support.  In the
              initial implementation, support was provided in the ext2,
              ext3, ext4, UDF, Minix, and shmem filesystems.  XFS support
              was added in Linux 3.15, and Btrfs support was added in Linux
              3.16.

       O_TRUNC
              If the file already exists and is a regular file and the
              access mode allows writing (i.e., is O_RDWR or O_WRONLY) it
              will be truncated to length 0.  If the file is a FIFO or
              terminal device file, the O_TRUNC flag is ignored.  Otherwise,
              the effect of O_TRUNC is unspecified.

   creat()
       A call to creat() is equivalent to calling open() with flags equal to
       O_CREAT|O_WRONLY|O_TRUNC.

   openat()
       The openat() system call operates in exactly the same way as open(),
       except for the differences described here.

       If the pathname given in pathname is relative, then it is interpreted
       relative to the directory referred to by the file descriptor dirfd
       (rather than relative to the current working directory of the calling
       process, as is done by open() for a relative pathname).

       If pathname is relative and dirfd is the special value AT_FDCWD, then
       pathname is interpreted relative to the current working directory of
       the calling process (like open()).

       If pathname is absolute, then dirfd is ignored.
http://man7.org/linux/man-pages/man2/openat.2.html
12
SYSTEM CALL:
open(2) - Linux manual page
FUNCTIONALITY:

       open, openat, creat - open and possibly create a file
SYNOPSIS:

       #include <sys/types.h>
       #include <sys/stat.h>
       #include <fcntl.h>

       int open(const char *pathname, int flags);
       int open(const char *pathname, int flags, mode_t mode);

       int creat(const char *pathname, mode_t mode);

       int openat(int dirfd, const char *pathname, int flags);
       int openat(int dirfd, const char *pathname, int flags, mode_t mode);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       openat():
           Since glibc 2.10:
               _POSIX_C_SOURCE >= 200809L
           Before glibc 2.10:
               _ATFILE_SOURCE
DESCRIPTION

       Given a pathname for a file, open() returns a file descriptor, a
       small, nonnegative integer for use in subsequent system calls
       (read(2), write(2), lseek(2), fcntl(2), etc.).  The file descriptor
       returned by a successful call will be the lowest-numbered file
       descriptor not currently open for the process.

       By default, the new file descriptor is set to remain open across an
       execve(2) (i.e., the FD_CLOEXEC file descriptor flag described in
       fcntl(2) is initially disabled); the O_CLOEXEC flag, described below,
       can be used to change this default.  The file offset is set to the
       beginning of the file (see lseek(2)).

       A call to open() creates a new open file description, an entry in the
       system-wide table of open files.  The open file description records
       the file offset and the file status flags (see below).  A file
       descriptor is a reference to an open file description; this reference
       is unaffected if pathname is subsequently removed or modified to
       refer to a different file.  For further details on open file
       descriptions, see NOTES.

       The argument flags must include one of the following access modes:
       O_RDONLY, O_WRONLY, or O_RDWR.  These request opening the file read-
       only, write-only, or read/write, respectively.

       In addition, zero or more file creation flags and file status flags
       can be bitwise-or'd in flags.  The file creation flags are O_CLOEXEC,
       O_CREAT, O_DIRECTORY, O_EXCL, O_NOCTTY, O_NOFOLLOW, O_TMPFILE, and
       O_TRUNC.  The file status flags are all of the remaining flags listed
       below.  The distinction between these two groups of flags is that the
       file status flags can be retrieved and (in some cases) modified; see
       fcntl(2) for details.

       The full list of file creation flags and file status flags is as
       follows:

       O_APPEND
              The file is opened in append mode.  Before each write(2), the
              file offset is positioned at the end of the file, as if with
              lseek(2).  O_APPEND may lead to corrupted files on NFS
              filesystems if more than one process appends data to a file at
              once.  This is because NFS does not support appending to a
              file, so the client kernel has to simulate it, which can't be
              done without a race condition.

       O_ASYNC
              Enable signal-driven I/O: generate a signal (SIGIO by default,
              but this can be changed via fcntl(2)) when input or output
              becomes possible on this file descriptor.  This feature is
              available only for terminals, pseudoterminals, sockets, and
              (since Linux 2.6) pipes and FIFOs.  See fcntl(2) for further
              details.  See also BUGS, below.

       O_CLOEXEC (since Linux 2.6.23)
              Enable the close-on-exec flag for the new file descriptor.
              Specifying this flag permits a program to avoid additional
              fcntl(2) F_SETFD operations to set the FD_CLOEXEC flag.

              Note that the use of this flag is essential in some
              multithreaded programs, because using a separate fcntl(2)
              F_SETFD operation to set the FD_CLOEXEC flag does not suffice
              to avoid race conditions where one thread opens a file
              descriptor and attempts to set its close-on-exec flag using
              fcntl(2) at the same time as another thread does a fork(2)
              plus execve(2).  Depending on the order of execution, the race
              may lead to the file descriptor returned by open() being
              unintentionally leaked to the program executed by the child
              process created by fork(2).  (This kind of race is in
              principle possible for any system call that creates a file
              descriptor whose close-on-exec flag should be set, and various
              other Linux system calls provide an equivalent of the
              O_CLOEXEC flag to deal with this problem.)

       O_CREAT
              If the file does not exist, it will be created.  The owner
              (user ID) of the file is set to the effective user ID of the
              process.  The group ownership (group ID) is set either to the
              effective group ID of the process or to the group ID of the
              parent directory (depending on filesystem type and mount
              options, and the mode of the parent directory; see the mount
              options bsdgroups and sysvgroups described in mount(8)).

              The mode argument specifies the file mode bits be applied when
              a new file is created.  This argument must be supplied when
              O_CREAT or O_TMPFILE is specified in flags; if neither O_CREAT
              nor O_TMPFILE is specified, then mode is ignored.  The
              effective mode is modified by the process's umask in the usual
              way: in the absence of a default ACL, the mode of the created
              file is (mode & ~umask).  Note that this mode applies only to
              future accesses of the newly created file; the open() call
              that creates a read-only file may well return a read/write
              file descriptor.

              The following symbolic constants are provided for mode:

              S_IRWXU  00700 user (file owner) has read, write, and execute
                       permission

              S_IRUSR  00400 user has read permission

              S_IWUSR  00200 user has write permission

              S_IXUSR  00100 user has execute permission

              S_IRWXG  00070 group has read, write, and execute permission

              S_IRGRP  00040 group has read permission

              S_IWGRP  00020 group has write permission

              S_IXGRP  00010 group has execute permission

              S_IRWXO  00007 others have read, write, and execute permission

              S_IROTH  00004 others have read permission

              S_IWOTH  00002 others have write permission

              S_IXOTH  00001 others have execute permission

              According to POSIX, the effect when other bits are set in mode
              is unspecified.  On Linux, the following bits are also honored
              in mode:

              S_ISUID  0004000 set-user-ID bit

              S_ISGID  0002000 set-group-ID bit (see stat(2))

              S_ISVTX  0001000 sticky bit (see stat(2))

       O_DIRECT (since Linux 2.4.10)
              Try to minimize cache effects of the I/O to and from this
              file.  In general this will degrade performance, but it is
              useful in special situations, such as when applications do
              their own caching.  File I/O is done directly to/from user-
              space buffers.  The O_DIRECT flag on its own makes an effort
              to transfer data synchronously, but does not give the
              guarantees of the O_SYNC flag that data and necessary metadata
              are transferred.  To guarantee synchronous I/O, O_SYNC must be
              used in addition to O_DIRECT.  See NOTES below for further
              discussion.

              A semantically similar (but deprecated) interface for block
              devices is described in raw(8).

       O_DIRECTORY
              If pathname is not a directory, cause the open to fail.  This
              flag was added in kernel version 2.1.126, to avoid denial-of-
              service problems if opendir(3) is called on a FIFO or tape
              device.

       O_DSYNC
              Write operations on the file will complete according to the
              requirements of synchronized I/O data integrity completion.

              By the time write(2) (and similar) return, the output data has
              been transferred to the underlying hardware, along with any
              file metadata that would be required to retrieve that data
              (i.e., as though each write(2) was followed by a call to
              fdatasync(2)).  See NOTES below.

       O_EXCL Ensure that this call creates the file: if this flag is
              specified in conjunction with O_CREAT, and pathname already
              exists, then open() will fail.

              When these two flags are specified, symbolic links are not
              followed: if pathname is a symbolic link, then open() fails
              regardless of where the symbolic link points to.

              In general, the behavior of O_EXCL is undefined if it is used
              without O_CREAT.  There is one exception: on Linux 2.6 and
              later, O_EXCL can be used without O_CREAT if pathname refers
              to a block device.  If the block device is in use by the
              system (e.g., mounted), open() fails with the error EBUSY.

              On NFS, O_EXCL is supported only when using NFSv3 or later on
              kernel 2.6 or later.  In NFS environments where O_EXCL support
              is not provided, programs that rely on it for performing
              locking tasks will contain a race condition.  Portable
              programs that want to perform atomic file locking using a
              lockfile, and need to avoid reliance on NFS support for
              O_EXCL, can create a unique file on the same filesystem (e.g.,
              incorporating hostname and PID), and use link(2) to make a
              link to the lockfile.  If link(2) returns 0, the lock is
              successful.  Otherwise, use stat(2) on the unique file to
              check if its link count has increased to 2, in which case the
              lock is also successful.

       O_LARGEFILE
              (LFS) Allow files whose sizes cannot be represented in an
              off_t (but can be represented in an off64_t) to be opened.
              The _LARGEFILE64_SOURCE macro must be defined (before
              including any header files) in order to obtain this
              definition.  Setting the _FILE_OFFSET_BITS feature test macro
              to 64 (rather than using O_LARGEFILE) is the preferred method
              of accessing large files on 32-bit systems (see
              feature_test_macros(7)).

       O_NOATIME (since Linux 2.6.8)
              Do not update the file last access time (st_atime in the
              inode) when the file is read(2).  This flag is intended for
              use by indexing or backup programs, where its use can
              significantly reduce the amount of disk activity.  This flag
              may not be effective on all filesystems.  One example is NFS,
              where the server maintains the access time.

       O_NOCTTY
              If pathname refers to a terminal device—see tty(4)—it will not
              become the process's controlling terminal even if the process
              does not have one.

       O_NOFOLLOW
              If pathname is a symbolic link, then the open fails.  This is
              a FreeBSD extension, which was added to Linux in version
              2.1.126.  Symbolic links in earlier components of the pathname
              will still be followed.  See also O_PATH below.

       O_NONBLOCK or O_NDELAY
              When possible, the file is opened in nonblocking mode.
              Neither the open() nor any subsequent operations on the file
              descriptor which is returned will cause the calling process to
              wait.

              Note that this flag has no effect for regular files and block
              devices; that is, I/O operations will (briefly) block when
              device activity is required, regardless of whether O_NONBLOCK
              is set.  Since O_NONBLOCK semantics might eventually be
              implemented, applications should not depend upon blocking
              behavior when specifying this flag for regular files and block
              devices.

              For the handling of FIFOs (named pipes), see also fifo(7).
              For a discussion of the effect of O_NONBLOCK in conjunction
              with mandatory file locks and with file leases, see fcntl(2).

       O_PATH (since Linux 2.6.39)
              Obtain a file descriptor that can be used for two purposes: to
              indicate a location in the filesystem tree and to perform
              operations that act purely at the file descriptor level.  The
              file itself is not opened, and other file operations (e.g.,
              read(2), write(2), fchmod(2), fchown(2), fgetxattr(2),
              mmap(2)) fail with the error EBADF.

              The following operations can be performed on the resulting
              file descriptor:

              *  close(2); fchdir(2) (since Linux 3.5); fstat(2) (since
                 Linux 3.6).

              *  Duplicating the file descriptor (dup(2), fcntl(2) F_DUPFD,
                 etc.).

              *  Getting and setting file descriptor flags (fcntl(2) F_GETFD
                 and F_SETFD).

              *  Retrieving open file status flags using the fcntl(2)
                 F_GETFL operation: the returned flags will include the bit
                 O_PATH.

              *  Passing the file descriptor as the dirfd argument of
                 openat(2) and the other "*at()" system calls.  This
                 includes linkat(2) with AT_EMPTY_PATH (or via procfs using
                 AT_SYMLINK_FOLLOW) even if the file is not a directory.

              *  Passing the file descriptor to another process via a UNIX
                 domain socket (see SCM_RIGHTS in unix(7)).

              When O_PATH is specified in flags, flag bits other than
              O_CLOEXEC, O_DIRECTORY, and O_NOFOLLOW are ignored.

              If pathname is a symbolic link and the O_NOFOLLOW flag is also
              specified, then the call returns a file descriptor referring
              to the symbolic link.  This file descriptor can be used as the
              dirfd argument in calls to fchownat(2), fstatat(2), linkat(2),
              and readlinkat(2) with an empty pathname to have the calls
              operate on the symbolic link.

       O_SYNC Write operations on the file will complete according to the
              requirements of synchronized I/O file integrity completion (by
              contrast with the synchronized I/O data integrity completion
              provided by O_DSYNC.)

              By the time write(2) (and similar) return, the output data and
              associated file metadata have been transferred to the
              underlying hardware (i.e., as though each write(2) was
              followed by a call to fsync(2)).  See NOTES below.

       O_TMPFILE (since Linux 3.11)
              Create an unnamed temporary file.  The pathname argument
              specifies a directory; an unnamed inode will be created in
              that directory's filesystem.  Anything written to the
              resulting file will be lost when the last file descriptor is
              closed, unless the file is given a name.

              O_TMPFILE must be specified with one of O_RDWR or O_WRONLY
              and, optionally, O_EXCL.  If O_EXCL is not specified, then
              linkat(2) can be used to link the temporary file into the
              filesystem, making it permanent, using code like the
              following:

                  char path[PATH_MAX];
                  fd = open("/path/to/dir", O_TMPFILE | O_RDWR,
                                          S_IRUSR | S_IWUSR);

                  /* File I/O on 'fd'... */

                  snprintf(path, PATH_MAX,  "/proc/self/fd/%d", fd);
                  linkat(AT_FDCWD, path, AT_FDCWD, "/path/for/file",
                                          AT_SYMLINK_FOLLOW);

              In this case, the open() mode argument determines the file
              permission mode, as with O_CREAT.

              Specifying O_EXCL in conjunction with O_TMPFILE prevents a
              temporary file from being linked into the filesystem in the
              above manner.  (Note that the meaning of O_EXCL in this case
              is different from the meaning of O_EXCL otherwise.)

              There are two main use cases for O_TMPFILE:

              *  Improved tmpfile(3) functionality: race-free creation of
                 temporary files that (1) are automatically deleted when
                 closed; (2) can never be reached via any pathname; (3) are
                 not subject to symlink attacks; and (4) do not require the
                 caller to devise unique names.

              *  Creating a file that is initially invisible, which is then
                 populated with data and adjusted to have appropriate
                 filesystem attributes (fchown(2), fchmod(2), fsetxattr(2),
                 etc.)  before being atomically linked into the filesystem
                 in a fully formed state (using linkat(2) as described
                 above).

              O_TMPFILE requires support by the underlying filesystem; only
              a subset of Linux filesystems provide that support.  In the
              initial implementation, support was provided in the ext2,
              ext3, ext4, UDF, Minix, and shmem filesystems.  XFS support
              was added in Linux 3.15, and Btrfs support was added in Linux
              3.16.

       O_TRUNC
              If the file already exists and is a regular file and the
              access mode allows writing (i.e., is O_RDWR or O_WRONLY) it
              will be truncated to length 0.  If the file is a FIFO or
              terminal device file, the O_TRUNC flag is ignored.  Otherwise,
              the effect of O_TRUNC is unspecified.

   creat()
       A call to creat() is equivalent to calling open() with flags equal to
       O_CREAT|O_WRONLY|O_TRUNC.

   openat()
       The openat() system call operates in exactly the same way as open(),
       except for the differences described here.

       If the pathname given in pathname is relative, then it is interpreted
       relative to the directory referred to by the file descriptor dirfd
       (rather than relative to the current working directory of the calling
       process, as is done by open() for a relative pathname).

       If pathname is relative and dirfd is the special value AT_FDCWD, then
       pathname is interpreted relative to the current working directory of
       the calling process (like open()).

       If pathname is absolute, then dirfd is ignored.
http://man7.org/linux/man-pages/man2/name_to_handle_at.2.html
12
SYSTEM CALL:
open_by_handle_at(2) - Linux manual page
FUNCTIONALITY:

       name_to_handle_at,  open_by_handle_at  - obtain handle for a pathname
       and open file via a handle
SYNOPSIS:

       #define _GNU_SOURCE         /* See feature_test_macros(7) */
       #include <sys/types.h>
       #include <sys/stat.h>
       #include <fcntl.h>

       int name_to_handle_at(int dirfd, const char *pathname,
                             struct file_handle *handle,
                             int *mount_id, int flags);

       int open_by_handle_at(int mount_fd, struct file_handle *handle,
                             int flags);
DESCRIPTION

       The name_to_handle_at() and open_by_handle_at() system calls split
       the functionality of openat(2) into two parts: name_to_handle_at()
       returns an opaque handle that corresponds to a specified file;
       open_by_handle_at() opens the file corresponding to a handle returned
       by a previous call to name_to_handle_at() and returns an open file
       descriptor.

   name_to_handle_at()
       The name_to_handle_at() system call returns a file handle and a mount
       ID corresponding to the file specified by the dirfd and pathname
       arguments.  The file handle is returned via the argument handle,
       which is a pointer to a structure of the following form:

           struct file_handle {
               unsigned int  handle_bytes;   /* Size of f_handle [in, out] */
               int           handle_type;    /* Handle type [out] */
               unsigned char f_handle[0];    /* File identifier (sized by
                                                caller) [out] */
           };

       It is the caller's responsibility to allocate the structure with a
       size large enough to hold the handle returned in f_handle.  Before
       the call, the handle_bytes field should be initialized to contain the
       allocated size for f_handle.  (The constant MAX_HANDLE_SZ, defined in
       <fcntl.h>, specifies the maximum possible size for a file handle.)
       Upon successful return, the handle_bytes field is updated to contain
       the number of bytes actually written to f_handle.

       The caller can discover the required size for the file_handle
       structure by making a call in which handle->handle_bytes is zero; in
       this case, the call fails with the error EOVERFLOW and
       handle->handle_bytes is set to indicate the required size; the caller
       can then use this information to allocate a structure of the correct
       size (see EXAMPLE below).

       Other than the use of the handle_bytes field, the caller should treat
       the file_handle structure as an opaque data type: the handle_type and
       f_handle fields are needed only by a subsequent call to
       open_by_handle_at().

       The flags argument is a bit mask constructed by ORing together zero
       or more of AT_EMPTY_PATH and AT_SYMLINK_FOLLOW, described below.

       Together, the pathname and dirfd arguments identify the file for
       which a handle is to be obtained.  There are four distinct cases:

       *  If pathname is a nonempty string containing an absolute pathname,
          then a handle is returned for the file referred to by that
          pathname.  In this case, dirfd is ignored.

       *  If pathname is a nonempty string containing a relative pathname
          and dirfd has the special value AT_FDCWD, then pathname is
          interpreted relative to the current working directory of the
          caller, and a handle is returned for the file to which it refers.

       *  If pathname is a nonempty string containing a relative pathname
          and dirfd is a file descriptor referring to a directory, then
          pathname is interpreted relative to the directory referred to by
          dirfd, and a handle is returned for the file to which it refers.
          (See openat(2) for an explanation of why "directory file
          descriptors" are useful.)

       *  If pathname is an empty string and flags specifies the value
          AT_EMPTY_PATH, then dirfd can be an open file descriptor referring
          to any type of file, or AT_FDCWD, meaning the current working
          directory, and a handle is returned for the file to which it
          refers.

       The mount_id argument returns an identifier for the filesystem mount
       that corresponds to pathname.  This corresponds to the first field in
       one of the records in /proc/self/mountinfo.  Opening the pathname in
       the fifth field of that record yields a file descriptor for the mount
       point; that file descriptor can be used in a subsequent call to
       open_by_handle_at().

       By default, name_to_handle_at() does not dereference pathname if it
       is a symbolic link, and thus returns a handle for the link itself.
       If AT_SYMLINK_FOLLOW is specified in flags, pathname is dereferenced
       if it is a symbolic link (so that the call returns a handle for the
       file referred to by the link).

   open_by_handle_at()
       The open_by_handle_at() system call opens the file referred to by
       handle, a file handle returned by a previous call to
       name_to_handle_at().

       The mount_fd argument is a file descriptor for any object (file,
       directory, etc.)  in the mounted filesystem with respect to which
       handle should be interpreted.  The special value AT_FDCWD can be
       specified, meaning the current working directory of the caller.

       The flags argument is as for open(2).  If handle refers to a symbolic
       link, the caller must specify the O_PATH flag, and the symbolic link
       is not dereferenced; the O_NOFOLLOW flag, if specified, is ignored.

       The caller must have the CAP_DAC_READ_SEARCH capability to invoke
       open_by_handle_at().
http://man7.org/linux/man-pages/man2/open_by_handle_at.2.html
12
SYSTEM CALL:
open_by_handle_at(2) - Linux manual page
FUNCTIONALITY:

       name_to_handle_at,  open_by_handle_at  - obtain handle for a pathname
       and open file via a handle
SYNOPSIS:

       #define _GNU_SOURCE         /* See feature_test_macros(7) */
       #include <sys/types.h>
       #include <sys/stat.h>
       #include <fcntl.h>

       int name_to_handle_at(int dirfd, const char *pathname,
                             struct file_handle *handle,
                             int *mount_id, int flags);

       int open_by_handle_at(int mount_fd, struct file_handle *handle,
                             int flags);
DESCRIPTION

       The name_to_handle_at() and open_by_handle_at() system calls split
       the functionality of openat(2) into two parts: name_to_handle_at()
       returns an opaque handle that corresponds to a specified file;
       open_by_handle_at() opens the file corresponding to a handle returned
       by a previous call to name_to_handle_at() and returns an open file
       descriptor.

   name_to_handle_at()
       The name_to_handle_at() system call returns a file handle and a mount
       ID corresponding to the file specified by the dirfd and pathname
       arguments.  The file handle is returned via the argument handle,
       which is a pointer to a structure of the following form:

           struct file_handle {
               unsigned int  handle_bytes;   /* Size of f_handle [in, out] */
               int           handle_type;    /* Handle type [out] */
               unsigned char f_handle[0];    /* File identifier (sized by
                                                caller) [out] */
           };

       It is the caller's responsibility to allocate the structure with a
       size large enough to hold the handle returned in f_handle.  Before
       the call, the handle_bytes field should be initialized to contain the
       allocated size for f_handle.  (The constant MAX_HANDLE_SZ, defined in
       <fcntl.h>, specifies the maximum possible size for a file handle.)
       Upon successful return, the handle_bytes field is updated to contain
       the number of bytes actually written to f_handle.

       The caller can discover the required size for the file_handle
       structure by making a call in which handle->handle_bytes is zero; in
       this case, the call fails with the error EOVERFLOW and
       handle->handle_bytes is set to indicate the required size; the caller
       can then use this information to allocate a structure of the correct
       size (see EXAMPLE below).

       Other than the use of the handle_bytes field, the caller should treat
       the file_handle structure as an opaque data type: the handle_type and
       f_handle fields are needed only by a subsequent call to
       open_by_handle_at().

       The flags argument is a bit mask constructed by ORing together zero
       or more of AT_EMPTY_PATH and AT_SYMLINK_FOLLOW, described below.

       Together, the pathname and dirfd arguments identify the file for
       which a handle is to be obtained.  There are four distinct cases:

       *  If pathname is a nonempty string containing an absolute pathname,
          then a handle is returned for the file referred to by that
          pathname.  In this case, dirfd is ignored.

       *  If pathname is a nonempty string containing a relative pathname
          and dirfd has the special value AT_FDCWD, then pathname is
          interpreted relative to the current working directory of the
          caller, and a handle is returned for the file to which it refers.

       *  If pathname is a nonempty string containing a relative pathname
          and dirfd is a file descriptor referring to a directory, then
          pathname is interpreted relative to the directory referred to by
          dirfd, and a handle is returned for the file to which it refers.
          (See openat(2) for an explanation of why "directory file
          descriptors" are useful.)

       *  If pathname is an empty string and flags specifies the value
          AT_EMPTY_PATH, then dirfd can be an open file descriptor referring
          to any type of file, or AT_FDCWD, meaning the current working
          directory, and a handle is returned for the file to which it
          refers.

       The mount_id argument returns an identifier for the filesystem mount
       that corresponds to pathname.  This corresponds to the first field in
       one of the records in /proc/self/mountinfo.  Opening the pathname in
       the fifth field of that record yields a file descriptor for the mount
       point; that file descriptor can be used in a subsequent call to
       open_by_handle_at().

       By default, name_to_handle_at() does not dereference pathname if it
       is a symbolic link, and thus returns a handle for the link itself.
       If AT_SYMLINK_FOLLOW is specified in flags, pathname is dereferenced
       if it is a symbolic link (so that the call returns a handle for the
       file referred to by the link).

   open_by_handle_at()
       The open_by_handle_at() system call opens the file referred to by
       handle, a file handle returned by a previous call to
       name_to_handle_at().

       The mount_fd argument is a file descriptor for any object (file,
       directory, etc.)  in the mounted filesystem with respect to which
       handle should be interpreted.  The special value AT_FDCWD can be
       specified, meaning the current working directory of the caller.

       The flags argument is as for open(2).  If handle refers to a symbolic
       link, the caller must specify the O_PATH flag, and the symbolic link
       is not dereferenced; the O_NOFOLLOW flag, if specified, is ignored.

       The caller must have the CAP_DAC_READ_SEARCH capability to invoke
       open_by_handle_at().
http://man7.org/linux/man-pages/man2/memfd_create.2.html
12
SYSTEM CALL:
memfd_create(2) - Linux manual page
FUNCTIONALITY:

       memfd_create - create an anonymous file
SYNOPSIS:

       #include <sys/memfd.h>

       int memfd_create(const char *name, unsigned int flags);

       Note: There is no glibc wrapper for this system call; see NOTES.
DESCRIPTION

       memfd_create() creates an anonymous file and returns a file
       descriptor that refers to it.  The file behaves like a regular file,
       and so can be modified, truncated, memory-mapped, and so on.
       However, unlike a regular file, it lives in RAM and has a volatile
       backing storage.  Once all references to the file are dropped, it is
       automatically released.  Anonymous memory is used for all backing
       pages of the file.  Therefore, files created by memfd_create() have
       the same semantics as other anonymous memory allocations such as
       those allocated using mmap(2) with the MAP_ANONYMOUS flag.

       The initial size of the file is set to 0.  Following the call, the
       file size should be set using ftruncate(2).  (Alternatively, the file
       may be populated by calls to write(2) or similar.)

       The name supplied in name is used as a filename and will be displayed
       as the target of the corresponding symbolic link in the directory
       /proc/self/fd/.  The displayed name is always prefixed with memfd:
       and serves only for debugging purposes.  Names do not affect the
       behavior of the file descriptor, and as such multiple files can have
       the same name without any side effects.

       The following values may be bitwise ORed in flags to change the
       behavior of memfd_create():

       MFD_CLOEXEC
              Set the close-on-exec (FD_CLOEXEC) flag on the new file
              descriptor.  See the description of the O_CLOEXEC flag in
              open(2) for reasons why this may be useful.

       MFD_ALLOW_SEALING
              Allow sealing operations on this file.  See the discussion of
              the F_ADD_SEALS and F_GET_SEALS operations in fcntl(2), and
              also NOTES, below.  The initial set of seals is empty.  If
              this flag is not set, the initial set of seals will be
              F_SEAL_SEAL, meaning that no other seals can be set on the
              file.

       Unused bits in flags must be 0.

       As its return value, memfd_create() returns a new file descriptor
       that can be used to refer to the file.  This file descriptor is
       opened for both reading and writing (O_RDWR) and O_LARGEFILE is set
       for the file descriptor.

       With respect to fork(2) and execve(2), the usual semantics apply for
       the file descriptor created by memfd_create().  A copy of the file
       descriptor is inherited by the child produced by fork(2) and refers
       to the same file.  The file descriptor is preserved across execve(2),
       unless the close-on-exec flag has been set.
http://man7.org/linux/man-pages/man2/mknod.2.html
11
SYSTEM CALL:
mknod(2) - Linux manual page
FUNCTIONALITY:

       mknod, mknodat - create a special or ordinary file
SYNOPSIS:

       #include <sys/types.h>
       #include <sys/stat.h>
       #include <fcntl.h>
       #include <unistd.h>

       int mknod(const char *pathname, mode_t mode, dev_t dev);

       #include <fcntl.h>           /* Definition of AT_* constants */
       #include <sys/stat.h>

       int mknodat(int dirfd, const char *pathname, mode_t mode, dev_t dev);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       mknod():
           _XOPEN_SOURCE >= 500
               || /* Since glibc 2.19: */ _DEFAULT_SOURCE
               || /* Glibc versions <= 2.19: */ _BSD_SOURCE || _SVID_SOURCE
DESCRIPTION

       The system call mknod() creates a filesystem node (file, device
       special file, or named pipe) named pathname, with attributes
       specified by mode and dev.

       The mode argument specifies both the file mode to use and the type of
       node to be created.  It should be a combination (using bitwise OR) of
       one of the file types listed below and zero or more of the file mode
       bits listed in stat(2).

       The file mode is modified by the process's umask in the usual way: in
       the absence of a default ACL, the permissions of the created node are
       (mode & ~umask).

       The file type must be one of S_IFREG, S_IFCHR, S_IFBLK, S_IFIFO, or
       S_IFSOCK to specify a regular file (which will be created empty),
       character special file, block special file, FIFO (named pipe), or
       UNIX domain socket, respectively.  (Zero file type is equivalent to
       type S_IFREG.)

       If the file type is S_IFCHR or S_IFBLK, then dev specifies the major
       and minor numbers of the newly created device special file
       (makedev(3) may be useful to build the value for dev); otherwise it
       is ignored.

       If pathname already exists, or is a symbolic link, this call fails
       with an EEXIST error.

       The newly created node will be owned by the effective user ID of the
       process.  If the directory containing the node has the set-group-ID
       bit set, or if the filesystem is mounted with BSD group semantics,
       the new node will inherit the group ownership from its parent
       directory; otherwise it will be owned by the effective group ID of
       the process.

   mknodat()
       The mknodat() system call operates in exactly the same way as
       mknod(2), except for the differences described here.

       If the pathname given in pathname is relative, then it is interpreted
       relative to the directory referred to by the file descriptor dirfd
       (rather than relative to the current working directory of the calling
       process, as is done by mknod(2) for a relative pathname).

       If pathname is relative and dirfd is the special value AT_FDCWD, then
       pathname is interpreted relative to the current working directory of
       the calling process (like mknod(2)).

       If pathname is absolute, then dirfd is ignored.

       See openat(2) for an explanation of the need for mknodat().
http://man7.org/linux/man-pages/man2/mknodat.2.html
11
SYSTEM CALL:
mknod(2) - Linux manual page
FUNCTIONALITY:

       mknod, mknodat - create a special or ordinary file
SYNOPSIS:

       #include <sys/types.h>
       #include <sys/stat.h>
       #include <fcntl.h>
       #include <unistd.h>

       int mknod(const char *pathname, mode_t mode, dev_t dev);

       #include <fcntl.h>           /* Definition of AT_* constants */
       #include <sys/stat.h>

       int mknodat(int dirfd, const char *pathname, mode_t mode, dev_t dev);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       mknod():
           _XOPEN_SOURCE >= 500
               || /* Since glibc 2.19: */ _DEFAULT_SOURCE
               || /* Glibc versions <= 2.19: */ _BSD_SOURCE || _SVID_SOURCE
DESCRIPTION

       The system call mknod() creates a filesystem node (file, device
       special file, or named pipe) named pathname, with attributes
       specified by mode and dev.

       The mode argument specifies both the file mode to use and the type of
       node to be created.  It should be a combination (using bitwise OR) of
       one of the file types listed below and zero or more of the file mode
       bits listed in stat(2).

       The file mode is modified by the process's umask in the usual way: in
       the absence of a default ACL, the permissions of the created node are
       (mode & ~umask).

       The file type must be one of S_IFREG, S_IFCHR, S_IFBLK, S_IFIFO, or
       S_IFSOCK to specify a regular file (which will be created empty),
       character special file, block special file, FIFO (named pipe), or
       UNIX domain socket, respectively.  (Zero file type is equivalent to
       type S_IFREG.)

       If the file type is S_IFCHR or S_IFBLK, then dev specifies the major
       and minor numbers of the newly created device special file
       (makedev(3) may be useful to build the value for dev); otherwise it
       is ignored.

       If pathname already exists, or is a symbolic link, this call fails
       with an EEXIST error.

       The newly created node will be owned by the effective user ID of the
       process.  If the directory containing the node has the set-group-ID
       bit set, or if the filesystem is mounted with BSD group semantics,
       the new node will inherit the group ownership from its parent
       directory; otherwise it will be owned by the effective group ID of
       the process.

   mknodat()
       The mknodat() system call operates in exactly the same way as
       mknod(2), except for the differences described here.

       If the pathname given in pathname is relative, then it is interpreted
       relative to the directory referred to by the file descriptor dirfd
       (rather than relative to the current working directory of the calling
       process, as is done by mknod(2) for a relative pathname).

       If pathname is relative and dirfd is the special value AT_FDCWD, then
       pathname is interpreted relative to the current working directory of
       the calling process (like mknod(2)).

       If pathname is absolute, then dirfd is ignored.

       See openat(2) for an explanation of the need for mknodat().
http://man7.org/linux/man-pages/man2/rename.2.html
12
SYSTEM CALL:
rename(2) - Linux manual page
FUNCTIONALITY:

       rename, renameat, renameat2 - change the name or location of a file
SYNOPSIS:

       #include <stdio.h>

       int rename(const char *oldpath, const char *newpath);

       #include <fcntl.h>           /* Definition of AT_* constants */
       #include <stdio.h>

       int renameat(int olddirfd, const char *oldpath,
                    int newdirfd, const char *newpath);

       int renameat2(int olddirfd, const char *oldpath,
                     int newdirfd, const char *newpath, unsigned int flags);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       renameat():
           Since glibc 2.10:
               _POSIX_C_SOURCE >= 200809L
           Before glibc 2.10:
               _ATFILE_SOURCE
DESCRIPTION

       rename() renames a file, moving it between directories if required.
       Any other hard links to the file (as created using link(2)) are
       unaffected.  Open file descriptors for oldpath are also unaffected.

       If newpath already exists, it will be atomically replaced (subject to
       a few conditions; see ERRORS below), so that there is no point at
       which another process attempting to access newpath will find it
       missing.

       If oldpath and newpath are existing hard links referring to the same
       file, then rename() does nothing, and returns a success status.

       If newpath exists but the operation fails for some reason, rename()
       guarantees to leave an instance of newpath in place.

       oldpath can specify a directory.  In this case, newpath must either
       not exist, or it must specify an empty directory.

       However, when overwriting there will probably be a window in which
       both oldpath and newpath refer to the file being renamed.

       If oldpath refers to a symbolic link, the link is renamed; if newpath
       refers to a symbolic link, the link will be overwritten.

   renameat()
       The renameat() system call operates in exactly the same way as
       rename(), except for the differences described here.

       If the pathname given in oldpath is relative, then it is interpreted
       relative to the directory referred to by the file descriptor olddirfd
       (rather than relative to the current working directory of the calling
       process, as is done by rename() for a relative pathname).

       If oldpath is relative and olddirfd is the special value AT_FDCWD,
       then oldpath is interpreted relative to the current working directory
       of the calling process (like rename()).

       If oldpath is absolute, then olddirfd is ignored.

       The interpretation of newpath is as for oldpath, except that a
       relative pathname is interpreted relative to the directory referred
       to by the file descriptor newdirfd.

       See openat(2) for an explanation of the need for renameat().

   renameat2()
       renameat2() has an additional flags argument.  A renameat2() call
       with a zero flags argument is equivalent to renameat().

       The flags argument is a bit mask consisting of zero or more of the
       following flags:

       RENAME_EXCHANGE
              Atomically exchange oldpath and newpath.  Both pathnames must
              exist but may be of different types (e.g., one could be a non-
              empty directory and the other a symbolic link).

       RENAME_NOREPLACE
              Don't overwrite newpath of the rename.  Return an error if
              newpath already exists.

              RENAME_NOREPLACE can't be employed together with
              RENAME_EXCHANGE.

       RENAME_WHITEOUT (since Linux 3.18)
              This operation makes sense only for overlay/union filesystem
              implementations.

              Specifying RENAME_WHITEOUT creates a "whiteout" object at the
              source of the rename at the same time as performing the
              rename.  The whole operation is atomic, so that if the rename
              succeeds then the whiteout will also have been created.

              A "whiteout" is an object that has special meaning in
              union/overlay filesystem constructs.  In these constructs,
              multiple layers exist and only the top one is ever modified.
              A whiteout on an upper layer will effectively hide a matching
              file in the lower layer, making it appear as if the file
              didn't exist.

              When a file that exists on the lower layer is renamed, the
              file is first copied up (if not already on the upper layer)
              and then renamed on the upper, read-write layer.  At the same
              time, the source file needs to be "whiteouted" (so that the
              version of the source file in the lower layer is rendered
              invisible).  The whole operation needs to be done atomically.

              When not part of a union/overlay, the whiteout appears as a
              character device with a {0,0} device number.

              RENAME_WHITEOUT requires the same privileges as creating a
              device node (i.e., the CAP_MKNOD capability).

              RENAME_WHITEOUT can't be employed together with
              RENAME_EXCHANGE.

              RENAME_WHITEOUT requires support from the underlying
              filesystem.  Among the filesystems that provide that support
              are shmem (since Linux 3.18), ext4 (since Linux 3.18), and XFS
              (since Linux 4.1).
http://man7.org/linux/man-pages/man2/renameat.2.html
12
SYSTEM CALL:
rename(2) - Linux manual page
FUNCTIONALITY:

       rename, renameat, renameat2 - change the name or location of a file
SYNOPSIS:

       #include <stdio.h>

       int rename(const char *oldpath, const char *newpath);

       #include <fcntl.h>           /* Definition of AT_* constants */
       #include <stdio.h>

       int renameat(int olddirfd, const char *oldpath,
                    int newdirfd, const char *newpath);

       int renameat2(int olddirfd, const char *oldpath,
                     int newdirfd, const char *newpath, unsigned int flags);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       renameat():
           Since glibc 2.10:
               _POSIX_C_SOURCE >= 200809L
           Before glibc 2.10:
               _ATFILE_SOURCE
DESCRIPTION

       rename() renames a file, moving it between directories if required.
       Any other hard links to the file (as created using link(2)) are
       unaffected.  Open file descriptors for oldpath are also unaffected.

       If newpath already exists, it will be atomically replaced (subject to
       a few conditions; see ERRORS below), so that there is no point at
       which another process attempting to access newpath will find it
       missing.

       If oldpath and newpath are existing hard links referring to the same
       file, then rename() does nothing, and returns a success status.

       If newpath exists but the operation fails for some reason, rename()
       guarantees to leave an instance of newpath in place.

       oldpath can specify a directory.  In this case, newpath must either
       not exist, or it must specify an empty directory.

       However, when overwriting there will probably be a window in which
       both oldpath and newpath refer to the file being renamed.

       If oldpath refers to a symbolic link, the link is renamed; if newpath
       refers to a symbolic link, the link will be overwritten.

   renameat()
       The renameat() system call operates in exactly the same way as
       rename(), except for the differences described here.

       If the pathname given in oldpath is relative, then it is interpreted
       relative to the directory referred to by the file descriptor olddirfd
       (rather than relative to the current working directory of the calling
       process, as is done by rename() for a relative pathname).

       If oldpath is relative and olddirfd is the special value AT_FDCWD,
       then oldpath is interpreted relative to the current working directory
       of the calling process (like rename()).

       If oldpath is absolute, then olddirfd is ignored.

       The interpretation of newpath is as for oldpath, except that a
       relative pathname is interpreted relative to the directory referred
       to by the file descriptor newdirfd.

       See openat(2) for an explanation of the need for renameat().

   renameat2()
       renameat2() has an additional flags argument.  A renameat2() call
       with a zero flags argument is equivalent to renameat().

       The flags argument is a bit mask consisting of zero or more of the
       following flags:

       RENAME_EXCHANGE
              Atomically exchange oldpath and newpath.  Both pathnames must
              exist but may be of different types (e.g., one could be a non-
              empty directory and the other a symbolic link).

       RENAME_NOREPLACE
              Don't overwrite newpath of the rename.  Return an error if
              newpath already exists.

              RENAME_NOREPLACE can't be employed together with
              RENAME_EXCHANGE.

       RENAME_WHITEOUT (since Linux 3.18)
              This operation makes sense only for overlay/union filesystem
              implementations.

              Specifying RENAME_WHITEOUT creates a "whiteout" object at the
              source of the rename at the same time as performing the
              rename.  The whole operation is atomic, so that if the rename
              succeeds then the whiteout will also have been created.

              A "whiteout" is an object that has special meaning in
              union/overlay filesystem constructs.  In these constructs,
              multiple layers exist and only the top one is ever modified.
              A whiteout on an upper layer will effectively hide a matching
              file in the lower layer, making it appear as if the file
              didn't exist.

              When a file that exists on the lower layer is renamed, the
              file is first copied up (if not already on the upper layer)
              and then renamed on the upper, read-write layer.  At the same
              time, the source file needs to be "whiteouted" (so that the
              version of the source file in the lower layer is rendered
              invisible).  The whole operation needs to be done atomically.

              When not part of a union/overlay, the whiteout appears as a
              character device with a {0,0} device number.

              RENAME_WHITEOUT requires the same privileges as creating a
              device node (i.e., the CAP_MKNOD capability).

              RENAME_WHITEOUT can't be employed together with
              RENAME_EXCHANGE.

              RENAME_WHITEOUT requires support from the underlying
              filesystem.  Among the filesystems that provide that support
              are shmem (since Linux 3.18), ext4 (since Linux 3.18), and XFS
              (since Linux 4.1).
http://man7.org/linux/man-pages/man2/renameat2.2.html
12
SYSTEM CALL:
rename(2) - Linux manual page
FUNCTIONALITY:

       rename, renameat, renameat2 - change the name or location of a file
SYNOPSIS:

       #include <stdio.h>

       int rename(const char *oldpath, const char *newpath);

       #include <fcntl.h>           /* Definition of AT_* constants */
       #include <stdio.h>

       int renameat(int olddirfd, const char *oldpath,
                    int newdirfd, const char *newpath);

       int renameat2(int olddirfd, const char *oldpath,
                     int newdirfd, const char *newpath, unsigned int flags);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       renameat():
           Since glibc 2.10:
               _POSIX_C_SOURCE >= 200809L
           Before glibc 2.10:
               _ATFILE_SOURCE
DESCRIPTION

       rename() renames a file, moving it between directories if required.
       Any other hard links to the file (as created using link(2)) are
       unaffected.  Open file descriptors for oldpath are also unaffected.

       If newpath already exists, it will be atomically replaced (subject to
       a few conditions; see ERRORS below), so that there is no point at
       which another process attempting to access newpath will find it
       missing.

       If oldpath and newpath are existing hard links referring to the same
       file, then rename() does nothing, and returns a success status.

       If newpath exists but the operation fails for some reason, rename()
       guarantees to leave an instance of newpath in place.

       oldpath can specify a directory.  In this case, newpath must either
       not exist, or it must specify an empty directory.

       However, when overwriting there will probably be a window in which
       both oldpath and newpath refer to the file being renamed.

       If oldpath refers to a symbolic link, the link is renamed; if newpath
       refers to a symbolic link, the link will be overwritten.

   renameat()
       The renameat() system call operates in exactly the same way as
       rename(), except for the differences described here.

       If the pathname given in oldpath is relative, then it is interpreted
       relative to the directory referred to by the file descriptor olddirfd
       (rather than relative to the current working directory of the calling
       process, as is done by rename() for a relative pathname).

       If oldpath is relative and olddirfd is the special value AT_FDCWD,
       then oldpath is interpreted relative to the current working directory
       of the calling process (like rename()).

       If oldpath is absolute, then olddirfd is ignored.

       The interpretation of newpath is as for oldpath, except that a
       relative pathname is interpreted relative to the directory referred
       to by the file descriptor newdirfd.

       See openat(2) for an explanation of the need for renameat().

   renameat2()
       renameat2() has an additional flags argument.  A renameat2() call
       with a zero flags argument is equivalent to renameat().

       The flags argument is a bit mask consisting of zero or more of the
       following flags:

       RENAME_EXCHANGE
              Atomically exchange oldpath and newpath.  Both pathnames must
              exist but may be of different types (e.g., one could be a non-
              empty directory and the other a symbolic link).

       RENAME_NOREPLACE
              Don't overwrite newpath of the rename.  Return an error if
              newpath already exists.

              RENAME_NOREPLACE can't be employed together with
              RENAME_EXCHANGE.

       RENAME_WHITEOUT (since Linux 3.18)
              This operation makes sense only for overlay/union filesystem
              implementations.

              Specifying RENAME_WHITEOUT creates a "whiteout" object at the
              source of the rename at the same time as performing the
              rename.  The whole operation is atomic, so that if the rename
              succeeds then the whiteout will also have been created.

              A "whiteout" is an object that has special meaning in
              union/overlay filesystem constructs.  In these constructs,
              multiple layers exist and only the top one is ever modified.
              A whiteout on an upper layer will effectively hide a matching
              file in the lower layer, making it appear as if the file
              didn't exist.

              When a file that exists on the lower layer is renamed, the
              file is first copied up (if not already on the upper layer)
              and then renamed on the upper, read-write layer.  At the same
              time, the source file needs to be "whiteouted" (so that the
              version of the source file in the lower layer is rendered
              invisible).  The whole operation needs to be done atomically.

              When not part of a union/overlay, the whiteout appears as a
              character device with a {0,0} device number.

              RENAME_WHITEOUT requires the same privileges as creating a
              device node (i.e., the CAP_MKNOD capability).

              RENAME_WHITEOUT can't be employed together with
              RENAME_EXCHANGE.

              RENAME_WHITEOUT requires support from the underlying
              filesystem.  Among the filesystems that provide that support
              are shmem (since Linux 3.18), ext4 (since Linux 3.18), and XFS
              (since Linux 4.1).
http://man7.org/linux/man-pages/man2/truncate.2.html
11
SYSTEM CALL:
truncate(2) - Linux manual page
FUNCTIONALITY:

       truncate, ftruncate - truncate a file to a specified length
SYNOPSIS:

       #include <unistd.h>
       #include <sys/types.h>

       int truncate(const char *path, off_t length);
       int ftruncate(int fd, off_t length);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       truncate():
           _XOPEN_SOURCE >= 500
               || /* Since glibc 2.12: */ _POSIX_C_SOURCE >= 200809L
               || /* Glibc versions <= 2.19: */ _BSD_SOURCE

       ftruncate():
           _XOPEN_SOURCE >= 500
               || /* Since glibc 2.3.5: */ _POSIX_C_SOURCE >= 200112L
               || /* Glibc versions <= 2.19: */ _BSD_SOURCE
DESCRIPTION

       The truncate() and ftruncate() functions cause the regular file named
       by path or referenced by fd to be truncated to a size of precisely
       length bytes.

       If the file previously was larger than this size, the extra data is
       lost.  If the file previously was shorter, it is extended, and the
       extended part reads as null bytes ('\0').

       The file offset is not changed.

       If the size changed, then the st_ctime and st_mtime fields
       (respectively, time of last status change and time of last
       modification; see stat(2)) for the file are updated, and the set-
       user-ID and set-group-ID mode bits may be cleared.

       With ftruncate(), the file must be open for writing; with truncate(),
       the file must be writable.
http://man7.org/linux/man-pages/man2/ftruncate.2.html
11
SYSTEM CALL:
truncate(2) - Linux manual page
FUNCTIONALITY:

       truncate, ftruncate - truncate a file to a specified length
SYNOPSIS:

       #include <unistd.h>
       #include <sys/types.h>

       int truncate(const char *path, off_t length);
       int ftruncate(int fd, off_t length);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       truncate():
           _XOPEN_SOURCE >= 500
               || /* Since glibc 2.12: */ _POSIX_C_SOURCE >= 200809L
               || /* Glibc versions <= 2.19: */ _BSD_SOURCE

       ftruncate():
           _XOPEN_SOURCE >= 500
               || /* Since glibc 2.3.5: */ _POSIX_C_SOURCE >= 200112L
               || /* Glibc versions <= 2.19: */ _BSD_SOURCE
DESCRIPTION

       The truncate() and ftruncate() functions cause the regular file named
       by path or referenced by fd to be truncated to a size of precisely
       length bytes.

       If the file previously was larger than this size, the extra data is
       lost.  If the file previously was shorter, it is extended, and the
       extended part reads as null bytes ('\0').

       The file offset is not changed.

       If the size changed, then the st_ctime and st_mtime fields
       (respectively, time of last status change and time of last
       modification; see stat(2)) for the file are updated, and the set-
       user-ID and set-group-ID mode bits may be cleared.

       With ftruncate(), the file must be open for writing; with truncate(),
       the file must be writable.
http://man7.org/linux/man-pages/man2/fallocate.2.html
10
SYSTEM CALL:
fallocate(2) - Linux manual page
FUNCTIONALITY:

       fallocate - manipulate file space
SYNOPSIS:

       #define _GNU_SOURCE             /* See feature_test_macros(7) */
       #include <fcntl.h>

       int fallocate(int fd, int mode, off_t offset, off_t len);
DESCRIPTION

       This is a nonportable, Linux-specific system call.  For the portable,
       POSIX.1-specified method of ensuring that space is allocated for a
       file, see posix_fallocate(3).

       fallocate() allows the caller to directly manipulate the allocated
       disk space for the file referred to by fd for the byte range starting
       at offset and continuing for len bytes.

       The mode argument determines the operation to be performed on the
       given range.  Details of the supported operations are given in the
       subsections below.

   Allocating disk space
       The default operation (i.e., mode is zero) of fallocate() allocates
       the disk space within the range specified by offset and len.  The
       file size (as reported by stat(2)) will be changed if offset+len is
       greater than the file size.  Any subregion within the range specified
       by offset and len that did not contain data before the call will be
       initialized to zero.  This default behavior closely resembles the
       behavior of the posix_fallocate(3) library function, and is intended
       as a method of optimally implementing that function.

       After a successful call, subsequent writes into the range specified
       by offset and len are guaranteed not to fail because of lack of disk
       space.

       If the FALLOC_FL_KEEP_SIZE flag is specified in mode, the behavior of
       the call is similar, but the file size will not be changed even if
       offset+len is greater than the file size.  Preallocating zeroed
       blocks beyond the end of the file in this manner is useful for
       optimizing append workloads.

       Because allocation is done in block size chunks, fallocate() may
       allocate a larger range of disk space than was specified.

   Deallocating file space
       Specifying the FALLOC_FL_PUNCH_HOLE flag (available since Linux
       2.6.38) in mode deallocates space (i.e., creates a hole) in the byte
       range starting at offset and continuing for len bytes.  Within the
       specified range, partial filesystem blocks are zeroed, and whole
       filesystem blocks are removed from the file.  After a successful
       call, subsequent reads from this range will return zeroes.

       The FALLOC_FL_PUNCH_HOLE flag must be ORed with FALLOC_FL_KEEP_SIZE
       in mode; in other words, even when punching off the end of the file,
       the file size (as reported by stat(2)) does not change.

       Not all filesystems support FALLOC_FL_PUNCH_HOLE; if a filesystem
       doesn't support the operation, an error is returned.  The operation
       is supported on at least the following filesystems:

       *  XFS (since Linux 2.6.38)

       *  ext4 (since Linux 3.0)

       *  Btrfs (since Linux 3.7)

       *  tmpfs (since Linux 3.5)

   Collapsing file space
       Specifying the FALLOC_FL_COLLAPSE_RANGE flag (available since Linux
       3.15) in mode removes a byte range from a file, without leaving a
       hole.  The byte range to be collapsed starts at offset and continues
       for len bytes.  At the completion of the operation, the contents of
       the file starting at the location offset+len will be appended at the
       location offset, and the file will be len bytes smaller.

       A filesystem may place limitations on the granularity of the
       operation, in order to ensure efficient implementation.  Typically,
       offset and len must be a multiple of the filesystem logical block
       size, which varies according to the filesystem type and
       configuration.  If a filesystem has such a requirement, fallocate()
       will fail with the error EINVAL if this requirement is violated.

       If the region specified by offset plus len reaches or passes the end
       of file, an error is returned; instead, use ftruncate(2) to truncate
       a file.

       No other flags may be specified in mode in conjunction with
       FALLOC_FL_COLLAPSE_RANGE.

       As at Linux 3.15, FALLOC_FL_COLLAPSE_RANGE is supported by ext4 (only
       for extent-based files) and XFS.

   Zeroing file space
       Specifying the FALLOC_FL_ZERO_RANGE flag (available since Linux 3.15)
       in mode zeroes space in the byte range starting at offset and
       continuing for len bytes.  Within the specified range, blocks are
       preallocated for the regions that span the holes in the file.  After
       a successful call, subsequent reads from this range will return
       zeroes.

       Zeroing is done within the filesystem preferably by converting the
       range into unwritten extents.  This approach means that the specified
       range will not be physically zeroed out on the device (except for
       partial blocks at the either end of the range), and I/O is
       (otherwise) required only to update metadata.

       If the FALLOC_FL_KEEP_SIZE flag is additionally specified in mode,
       the behavior of the call is similar, but the file size will not be
       changed even if offset+len is greater than the file size.  This
       behavior is the same as when preallocating space with
       FALLOC_FL_KEEP_SIZE specified.

       Not all filesystems support FALLOC_FL_ZERO_RANGE; if a filesystem
       doesn't support the operation, an error is returned.  The operation
       is supported on at least the following filesystems:

       *  XFS (since Linux 3.15)

       *  ext4, for extent-based files (since Linux 3.15)

       *  SMB3 (since Linux 3.17)

   Increasing file space
       Specifying the FALLOC_FL_INSERT_RANGE flag (available since Linux
       4.1) in mode increases the file space by inserting a hole within the
       file size without overwriting any existing data.  The hole will start
       at offset and continue for len bytes.  When inserting the hole inside
       file, the contents of the file starting at offset will be shifted
       upward (i.e., to a higher file offset) by len bytes.  Inserting a
       hole inside a file increases the file size by len bytes.

       This mode has the same limitations as FALLOC_FL_COLLAPSE_RANGE
       regarding the granularity of the operation.  If the granularity
       requirements are not met, fallocate() will fail with the error
       EINVAL.  If the offset is equal to or greater than the end of file,
       an error is returned.  For such operations (i.e., inserting a hole at
       the end of file), ftruncate(2) should be used.

       No other flags may be specified in mode in conjunction with
       FALLOC_FL_INSERT_RANGE.

       FALLOC_FL_INSERT_RANGE requires filesystem support.  Filesystems that
       support this operation include XFS (since Linux 4.1) and ext4 (since
       Linux 4.2).
http://man7.org/linux/man-pages/man2/mkdir.2.html
11
SYSTEM CALL:
mkdir(2) - Linux manual page
FUNCTIONALITY:

       mkdir, mkdirat - create a directory
SYNOPSIS:

       #include <sys/stat.h>
       #include <sys/types.h>

       int mkdir(const char *pathname, mode_t mode);

       #include <fcntl.h>           /* Definition of AT_* constants */
       #include <sys/stat.h>

       int mkdirat(int dirfd, const char *pathname, mode_t mode);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       mkdirat():
           Since glibc 2.10:
               _POSIX_C_SOURCE >= 200809L
           Before glibc 2.10:
               _ATFILE_SOURCE
DESCRIPTION

       mkdir() attempts to create a directory named pathname.

       The argument mode specifies the mode for the new directory (see
       stat(2)).  It is modified by the process's umask in the usual way: in
       the absence of a default ACL, the mode of the created directory is
       (mode & ~umask & 0777).  Whether other mode bits are honored for the
       created directory depends on the operating system.  For Linux, see
       NOTES below.

       The newly created directory will be owned by the effective user ID of
       the process.  If the directory containing the file has the set-group-
       ID bit set, or if the filesystem is mounted with BSD group semantics
       (mount -o bsdgroups or, synonymously mount -o grpid), the new
       directory will inherit the group ownership from its parent; otherwise
       it will be owned by the effective group ID of the process.

       If the parent directory has the set-group-ID bit set, then so will
       the newly created directory.

   mkdirat()
       The mkdirat() system call operates in exactly the same way as
       mkdir(), except for the differences described here.

       If the pathname given in pathname is relative, then it is interpreted
       relative to the directory referred to by the file descriptor dirfd
       (rather than relative to the current working directory of the calling
       process, as is done by mkdir() for a relative pathname).

       If pathname is relative and dirfd is the special value AT_FDCWD, then
       pathname is interpreted relative to the current working directory of
       the calling process (like mkdir()).

       If pathname is absolute, then dirfd is ignored.

       See openat(2) for an explanation of the need for mkdirat().
http://man7.org/linux/man-pages/man2/mkdirat.2.html
11
SYSTEM CALL:
mkdir(2) - Linux manual page
FUNCTIONALITY:

       mkdir, mkdirat - create a directory
SYNOPSIS:

       #include <sys/stat.h>
       #include <sys/types.h>

       int mkdir(const char *pathname, mode_t mode);

       #include <fcntl.h>           /* Definition of AT_* constants */
       #include <sys/stat.h>

       int mkdirat(int dirfd, const char *pathname, mode_t mode);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       mkdirat():
           Since glibc 2.10:
               _POSIX_C_SOURCE >= 200809L
           Before glibc 2.10:
               _ATFILE_SOURCE
DESCRIPTION

       mkdir() attempts to create a directory named pathname.

       The argument mode specifies the mode for the new directory (see
       stat(2)).  It is modified by the process's umask in the usual way: in
       the absence of a default ACL, the mode of the created directory is
       (mode & ~umask & 0777).  Whether other mode bits are honored for the
       created directory depends on the operating system.  For Linux, see
       NOTES below.

       The newly created directory will be owned by the effective user ID of
       the process.  If the directory containing the file has the set-group-
       ID bit set, or if the filesystem is mounted with BSD group semantics
       (mount -o bsdgroups or, synonymously mount -o grpid), the new
       directory will inherit the group ownership from its parent; otherwise
       it will be owned by the effective group ID of the process.

       If the parent directory has the set-group-ID bit set, then so will
       the newly created directory.

   mkdirat()
       The mkdirat() system call operates in exactly the same way as
       mkdir(), except for the differences described here.

       If the pathname given in pathname is relative, then it is interpreted
       relative to the directory referred to by the file descriptor dirfd
       (rather than relative to the current working directory of the calling
       process, as is done by mkdir() for a relative pathname).

       If pathname is relative and dirfd is the special value AT_FDCWD, then
       pathname is interpreted relative to the current working directory of
       the calling process (like mkdir()).

       If pathname is absolute, then dirfd is ignored.

       See openat(2) for an explanation of the need for mkdirat().
http://man7.org/linux/man-pages/man2/rmdir.2.html
10
SYSTEM CALL:
rmdir(2) - Linux manual page
FUNCTIONALITY:

       rmdir - delete a directory
SYNOPSIS:

       #include <unistd.h>

       int rmdir(const char *pathname);
DESCRIPTION

       rmdir() deletes a directory, which must be empty.
http://man7.org/linux/man-pages/man2/getcwd.2.html
11
SYSTEM CALL:
getcwd(3) - Linux manual page
FUNCTIONALITY:

       getcwd, getwd, get_current_dir_name - get current working directory
SYNOPSIS:

       #include <unistd.h>

       char *getcwd(char *buf, size_t size);

       char *getwd(char *buf);

       char *get_current_dir_name(void);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       get_current_dir_name():
              _GNU_SOURCE

       getwd():
           Since glibc 2.12:
               (_XOPEN_SOURCE >= 500) && ! (_POSIX_C_SOURCE >= 200809L)
                   || /* Glibc since 2.19: */ _DEFAULT_SOURCE
                   || /* Glibc versions <= 2.19: */ _BSD_SOURCE
           Before glibc 2.12:
               _BSD_SOURCE || _XOPEN_SOURCE >= 500
DESCRIPTION

       These functions return a null-terminated string containing an
       absolute pathname that is the current working directory of the
       calling process.  The pathname is returned as the function result and
       via the argument buf, if present.

       If the current directory is not below the root directory of the
       current process (e.g., because the process set a new filesystem root
       using chroot(2) without changing its current directory into the new
       root), then, since Linux 2.6.36, the returned path will be prefixed
       with the string "(unreachable)".  Such behavior can also be caused by
       an unprivileged user by changing the current directory into another
       mount namespace.  When dealing with paths from untrusted sources,
       callers of these functions should consider checking whether the
       returned path starts with '/' or '(' to avoid misinterpreting an
       unreachable path as a relative path.

       The getcwd() function copies an absolute pathname of the current
       working directory to the array pointed to by buf, which is of length
       size.

       If the length of the absolute pathname of the current working
       directory, including the terminating null byte, exceeds size bytes,
       NULL is returned, and errno is set to ERANGE; an application should
       check for this error, and allocate a larger buffer if necessary.

       As an extension to the POSIX.1-2001 standard, glibc's getcwd()
       allocates the buffer dynamically using malloc(3) if buf is NULL.  In
       this case, the allocated buffer has the length size unless size is
       zero, when buf is allocated as big as necessary.  The caller should
       free(3) the returned buffer.

       get_current_dir_name() will malloc(3) an array big enough to hold the
       absolute pathname of the current working directory.  If the
       environment variable PWD is set, and its value is correct, then that
       value will be returned.  The caller should free(3) the returned
       buffer.

       getwd() does not malloc(3) any memory.  The buf argument should be a
       pointer to an array at least PATH_MAX bytes long.  If the length of
       the absolute pathname of the current working directory, including the
       terminating null byte, exceeds PATH_MAX bytes, NULL is returned, and
       errno is set to ENAMETOOLONG.  (Note that on some systems, PATH_MAX
       may not be a compile-time constant; furthermore, its value may depend
       on the filesystem, see pathconf(3).)  For portability and security
       reasons, use of getwd() is deprecated.
http://man7.org/linux/man-pages/man2/chdir.2.html
10
SYSTEM CALL:
chdir(2) - Linux manual page
FUNCTIONALITY:

       chdir, fchdir - change working directory
SYNOPSIS:

       #include <unistd.h>

       int chdir(const char *path);
       int fchdir(int fd);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       fchdir():
           _XOPEN_SOURCE >= 500
               || /* Since glibc 2.12: */ _POSIX_C_SOURCE >= 200809L
               || /* Glibc up to and including 2.19: */ _BSD_SOURCE
DESCRIPTION

       chdir() changes the current working directory of the calling process
       to the directory specified in path.

       fchdir() is identical to chdir(); the only difference is that the
       directory is given as an open file descriptor.
http://man7.org/linux/man-pages/man2/fchdir.2.html
10
SYSTEM CALL:
chdir(2) - Linux manual page
FUNCTIONALITY:

       chdir, fchdir - change working directory
SYNOPSIS:

       #include <unistd.h>

       int chdir(const char *path);
       int fchdir(int fd);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       fchdir():
           _XOPEN_SOURCE >= 500
               || /* Since glibc 2.12: */ _POSIX_C_SOURCE >= 200809L
               || /* Glibc up to and including 2.19: */ _BSD_SOURCE
DESCRIPTION

       chdir() changes the current working directory of the calling process
       to the directory specified in path.

       fchdir() is identical to chdir(); the only difference is that the
       directory is given as an open file descriptor.
http://man7.org/linux/man-pages/man2/chroot.2.html
10
SYSTEM CALL:
chroot(2) - Linux manual page
FUNCTIONALITY:

       chroot - change root directory
SYNOPSIS:

       #include <unistd.h>

       int chroot(const char *path);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       chroot():
           Since glibc 2.2.2:
               _XOPEN_SOURCE && ! (_POSIX_C_SOURCE >= 200112L)
                   || /* Since glibc 2.20: */ _DEFAULT_SOURCE
                   || /* Glibc versions <= 2.19: */ _BSD_SOURCE
           Before glibc 2.2.2: none
DESCRIPTION

       chroot() changes the root directory of the calling process to that
       specified in path.  This directory will be used for pathnames
       beginning with /.  The root directory is inherited by all children of
       the calling process.

       Only a privileged process (Linux: one with the CAP_SYS_CHROOT
       capability) may call chroot().

       This call changes an ingredient in the pathname resolution process
       and does nothing else.  In particular, it is not intended to be used
       for any kind of security purpose, neither to fully sandbox a process
       nor to restrict filesystem system calls.  In the past, chroot() has
       been used by daemons to restrict themselves prior to passing paths
       supplied by untrusted users to system calls such as open(2).
       However, if a folder is moved out of the chroot directory, an
       attacker can exploit that to get out of the chroot directory as well.
       The easiest way to do that is to chdir(2) to the to-be-moved
       directory, wait for it to be moved out, then open a path like
       ../../../etc/passwd.

       A slightly trickier variation also works under some circumstances if
       chdir(2) is not permitted.  If a daemon allows a "chroot directory"
       to be specified, that usually means that if you want to prevent
       remote users from accessing files outside the chroot directory, you
       must ensure that folders are never moved out of it.

       This call does not change the current working directory, so that
       after the call '.' can be outside the tree rooted at '/'.  In
       particular, the superuser can escape from a "chroot jail" by doing:

           mkdir foo; chroot foo; cd ..

       This call does not close open file descriptors, and such file
       descriptors may allow access to files outside the chroot tree.
http://man7.org/linux/man-pages/man2/getdents.2.html
11
SYSTEM CALL:
getdents(2) - Linux manual page
FUNCTIONALITY:

       getdents, getdents64 - get directory entries
SYNOPSIS:

       int getdents(unsigned int fd, struct linux_dirent *dirp,
                    unsigned int count);
       int getdents64(unsigned int fd, struct linux_dirent64 *dirp,
                    unsigned int count);

       Note: There are no glibc wrappers for these system calls; see NOTES.
DESCRIPTION

       These are not the interfaces you are interested in.  Look at
       readdir(3) for the POSIX-conforming C library interface.  This page
       documents the bare kernel system call interfaces.

   getdents()
       The system call getdents() reads several linux_dirent structures from
       the directory referred to by the open file descriptor fd into the
       buffer pointed to by dirp.  The argument count specifies the size of
       that buffer.

       The linux_dirent structure is declared as follows:

           struct linux_dirent {
               unsigned long  d_ino;     /* Inode number */
               unsigned long  d_off;     /* Offset to next linux_dirent */
               unsigned short d_reclen;  /* Length of this linux_dirent */
               char           d_name[];  /* Filename (null-terminated) */
                                 /* length is actually (d_reclen - 2 -
                                    offsetof(struct linux_dirent, d_name)) */
               /*
               char           pad;       // Zero padding byte
               char           d_type;    // File type (only since Linux
                                         // 2.6.4); offset is (d_reclen - 1)
               */
           }

       d_ino is an inode number.  d_off is the distance from the start of
       the directory to the start of the next linux_dirent.  d_reclen is the
       size of this entire linux_dirent.  d_name is a null-terminated
       filename.

       d_type is a byte at the end of the structure that indicates the file
       type.  It contains one of the following values (defined in
       <dirent.h>):

       DT_BLK      This is a block device.

       DT_CHR      This is a character device.

       DT_DIR      This is a directory.

       DT_FIFO     This is a named pipe (FIFO).

       DT_LNK      This is a symbolic link.

       DT_REG      This is a regular file.

       DT_SOCK     This is a UNIX domain socket.

       DT_UNKNOWN  The file type is unknown.

       The d_type field is implemented since Linux 2.6.4.  It occupies a
       space that was previously a zero-filled padding byte in the
       linux_dirent structure.  Thus, on kernels up to and including 2.6.3,
       attempting to access this field always provides the value 0
       (DT_UNKNOWN).

       Currently, only some filesystems (among them: Btrfs, ext2, ext3, and
       ext4) have full support for returning the file type in d_type.  All
       applications must properly handle a return of DT_UNKNOWN.

   getdents64()
       The original Linux getdents() system call did not handle large
       filesystems and large file offsets.  Consequently, Linux 2.4 added
       getdents64(), with wider types for the d_ino and d_off fields.  In
       addition, getdents64() supports an explicit d_type field.

       The getdents64() system call is like getdents(), except that its
       second argument is a pointer to a buffer containing structures of the
       following type:

           struct linux_dirent64 {
               ino64_t        d_ino;    /* 64-bit inode number */
               off64_t        d_off;    /* 64-bit offset to next structure */
               unsigned short d_reclen; /* Size of this dirent */
               unsigned char  d_type;   /* File type */
               char           d_name[]; /* Filename (null-terminated) */
           };
http://man7.org/linux/man-pages/man2/getdents64.2.html
11
SYSTEM CALL:
getdents(2) - Linux manual page
FUNCTIONALITY:

       getdents, getdents64 - get directory entries
SYNOPSIS:

       int getdents(unsigned int fd, struct linux_dirent *dirp,
                    unsigned int count);
       int getdents64(unsigned int fd, struct linux_dirent64 *dirp,
                    unsigned int count);

       Note: There are no glibc wrappers for these system calls; see NOTES.
DESCRIPTION

       These are not the interfaces you are interested in.  Look at
       readdir(3) for the POSIX-conforming C library interface.  This page
       documents the bare kernel system call interfaces.

   getdents()
       The system call getdents() reads several linux_dirent structures from
       the directory referred to by the open file descriptor fd into the
       buffer pointed to by dirp.  The argument count specifies the size of
       that buffer.

       The linux_dirent structure is declared as follows:

           struct linux_dirent {
               unsigned long  d_ino;     /* Inode number */
               unsigned long  d_off;     /* Offset to next linux_dirent */
               unsigned short d_reclen;  /* Length of this linux_dirent */
               char           d_name[];  /* Filename (null-terminated) */
                                 /* length is actually (d_reclen - 2 -
                                    offsetof(struct linux_dirent, d_name)) */
               /*
               char           pad;       // Zero padding byte
               char           d_type;    // File type (only since Linux
                                         // 2.6.4); offset is (d_reclen - 1)
               */
           }

       d_ino is an inode number.  d_off is the distance from the start of
       the directory to the start of the next linux_dirent.  d_reclen is the
       size of this entire linux_dirent.  d_name is a null-terminated
       filename.

       d_type is a byte at the end of the structure that indicates the file
       type.  It contains one of the following values (defined in
       <dirent.h>):

       DT_BLK      This is a block device.

       DT_CHR      This is a character device.

       DT_DIR      This is a directory.

       DT_FIFO     This is a named pipe (FIFO).

       DT_LNK      This is a symbolic link.

       DT_REG      This is a regular file.

       DT_SOCK     This is a UNIX domain socket.

       DT_UNKNOWN  The file type is unknown.

       The d_type field is implemented since Linux 2.6.4.  It occupies a
       space that was previously a zero-filled padding byte in the
       linux_dirent structure.  Thus, on kernels up to and including 2.6.3,
       attempting to access this field always provides the value 0
       (DT_UNKNOWN).

       Currently, only some filesystems (among them: Btrfs, ext2, ext3, and
       ext4) have full support for returning the file type in d_type.  All
       applications must properly handle a return of DT_UNKNOWN.

   getdents64()
       The original Linux getdents() system call did not handle large
       filesystems and large file offsets.  Consequently, Linux 2.4 added
       getdents64(), with wider types for the d_ino and d_off fields.  In
       addition, getdents64() supports an explicit d_type field.

       The getdents64() system call is like getdents(), except that its
       second argument is a pointer to a buffer containing structures of the
       following type:

           struct linux_dirent64 {
               ino64_t        d_ino;    /* 64-bit inode number */
               off64_t        d_off;    /* 64-bit offset to next structure */
               unsigned short d_reclen; /* Size of this dirent */
               unsigned char  d_type;   /* File type */
               char           d_name[]; /* Filename (null-terminated) */
           };
http://man7.org/linux/man-pages/man2/lookup_dcookie.2.html
11
SYSTEM CALL:
lookup_dcookie(2) - Linux manual page
FUNCTIONALITY:

       lookup_dcookie - return a directory entry's path
SYNOPSIS:

       int lookup_dcookie(u64 cookie, char *buffer, size_t len);
DESCRIPTION

       Look up the full path of the directory entry specified by the value
       cookie.  The cookie is an opaque identifier uniquely identifying a
       particular directory entry.  The buffer given is filled in with the
       full path of the directory entry.

       For lookup_dcookie() to return successfully, the kernel must still
       hold a cookie reference to the directory entry.
http://man7.org/linux/man-pages/man2/link.2.html
12
SYSTEM CALL:
link(2) - Linux manual page
FUNCTIONALITY:

       link, linkat - make a new name for a file
SYNOPSIS:

       #include <unistd.h>

       int link(const char *oldpath, const char *newpath);

       #include <fcntl.h>           /* Definition of AT_* constants */
       #include <unistd.h>

       int linkat(int olddirfd, const char *oldpath,
                  int newdirfd, const char *newpath, int flags);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       linkat():
           Since glibc 2.10:
               _POSIX_C_SOURCE >= 200809L
           Before glibc 2.10:
               _ATFILE_SOURCE
DESCRIPTION

       link() creates a new link (also known as a hard link) to an existing
       file.

       If newpath exists, it will not be overwritten.

       This new name may be used exactly as the old one for any operation;
       both names refer to the same file (and so have the same permissions
       and ownership) and it is impossible to tell which name was the
       "original".

   linkat()
       The linkat() system call operates in exactly the same way as link(),
       except for the differences described here.

       If the pathname given in oldpath is relative, then it is interpreted
       relative to the directory referred to by the file descriptor olddirfd
       (rather than relative to the current working directory of the calling
       process, as is done by link() for a relative pathname).

       If oldpath is relative and olddirfd is the special value AT_FDCWD,
       then oldpath is interpreted relative to the current working directory
       of the calling process (like link()).

       If oldpath is absolute, then olddirfd is ignored.

       The interpretation of newpath is as for oldpath, except that a
       relative pathname is interpreted relative to the directory referred
       to by the file descriptor newdirfd.

       The following values can be bitwise ORed in flags:

       AT_EMPTY_PATH (since Linux 2.6.39)
              If oldpath is an empty string, create a link to the file
              referenced by olddirfd (which may have been obtained using the
              open(2) O_PATH flag).  In this case, olddirfd can refer to any
              type of file, not just a directory.  This will generally not
              work if the file has a link count of zero (files created with
              O_TMPFILE and without O_EXCL are an exception).  The caller
              must have the CAP_DAC_READ_SEARCH capability in order to use
              this flag.  This flag is Linux-specific; define _GNU_SOURCE to
              obtain its definition.

       AT_SYMLINK_FOLLOW (since Linux 2.6.18)
              By default, linkat(), does not dereference oldpath if it is a
              symbolic link (like link()).  The flag AT_SYMLINK_FOLLOW can
              be specified in flags to cause oldpath to be dereferenced if
              it is a symbolic link.  If procfs is mounted, this can be used
              as an alternative to AT_EMPTY_PATH, like this:

                  linkat(AT_FDCWD, "/proc/self/fd/<fd>", newdirfd,
                         newname, AT_SYMLINK_FOLLOW);

       Before kernel 2.6.18, the flags argument was unused, and had to be
       specified as 0.

       See openat(2) for an explanation of the need for linkat().
http://man7.org/linux/man-pages/man2/linkat.2.html
12
SYSTEM CALL:
link(2) - Linux manual page
FUNCTIONALITY:

       link, linkat - make a new name for a file
SYNOPSIS:

       #include <unistd.h>

       int link(const char *oldpath, const char *newpath);

       #include <fcntl.h>           /* Definition of AT_* constants */
       #include <unistd.h>

       int linkat(int olddirfd, const char *oldpath,
                  int newdirfd, const char *newpath, int flags);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       linkat():
           Since glibc 2.10:
               _POSIX_C_SOURCE >= 200809L
           Before glibc 2.10:
               _ATFILE_SOURCE
DESCRIPTION

       link() creates a new link (also known as a hard link) to an existing
       file.

       If newpath exists, it will not be overwritten.

       This new name may be used exactly as the old one for any operation;
       both names refer to the same file (and so have the same permissions
       and ownership) and it is impossible to tell which name was the
       "original".

   linkat()
       The linkat() system call operates in exactly the same way as link(),
       except for the differences described here.

       If the pathname given in oldpath is relative, then it is interpreted
       relative to the directory referred to by the file descriptor olddirfd
       (rather than relative to the current working directory of the calling
       process, as is done by link() for a relative pathname).

       If oldpath is relative and olddirfd is the special value AT_FDCWD,
       then oldpath is interpreted relative to the current working directory
       of the calling process (like link()).

       If oldpath is absolute, then olddirfd is ignored.

       The interpretation of newpath is as for oldpath, except that a
       relative pathname is interpreted relative to the directory referred
       to by the file descriptor newdirfd.

       The following values can be bitwise ORed in flags:

       AT_EMPTY_PATH (since Linux 2.6.39)
              If oldpath is an empty string, create a link to the file
              referenced by olddirfd (which may have been obtained using the
              open(2) O_PATH flag).  In this case, olddirfd can refer to any
              type of file, not just a directory.  This will generally not
              work if the file has a link count of zero (files created with
              O_TMPFILE and without O_EXCL are an exception).  The caller
              must have the CAP_DAC_READ_SEARCH capability in order to use
              this flag.  This flag is Linux-specific; define _GNU_SOURCE to
              obtain its definition.

       AT_SYMLINK_FOLLOW (since Linux 2.6.18)
              By default, linkat(), does not dereference oldpath if it is a
              symbolic link (like link()).  The flag AT_SYMLINK_FOLLOW can
              be specified in flags to cause oldpath to be dereferenced if
              it is a symbolic link.  If procfs is mounted, this can be used
              as an alternative to AT_EMPTY_PATH, like this:

                  linkat(AT_FDCWD, "/proc/self/fd/<fd>", newdirfd,
                         newname, AT_SYMLINK_FOLLOW);

       Before kernel 2.6.18, the flags argument was unused, and had to be
       specified as 0.

       See openat(2) for an explanation of the need for linkat().
http://man7.org/linux/man-pages/man2/symlink.2.html
11
SYSTEM CALL:
symlink(2) - Linux manual page
FUNCTIONALITY:

       symlink, symlinkat - make a new name for a file
SYNOPSIS:

       #include <unistd.h>

       int symlink(const char *target, const char *linkpath);

       #include <fcntl.h>           /* Definition of AT_* constants */
       #include <unistd.h>

       int symlinkat(const char *target, int newdirfd, const char *linkpath);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       symlink():
           _XOPEN_SOURCE >= 500 || _POSIX_C_SOURCE >= 200112L
               || /* Glibc versions <= 2.19: */ _BSD_SOURCE

       symlinkat():
           Since glibc 2.10:
               _POSIX_C_SOURCE >= 200809L
           Before glibc 2.10:
               _ATFILE_SOURCE
DESCRIPTION

       symlink() creates a symbolic link named linkpath which contains the
       string target.

       Symbolic links are interpreted at run time as if the contents of the
       link had been substituted into the path being followed to find a file
       or directory.

       Symbolic links may contain ..  path components, which (if used at the
       start of the link) refer to the parent directories of that in which
       the link resides.

       A symbolic link (also known as a soft link) may point to an existing
       file or to a nonexistent one; the latter case is known as a dangling
       link.

       The permissions of a symbolic link are irrelevant; the ownership is
       ignored when following the link, but is checked when removal or
       renaming of the link is requested and the link is in a directory with
       the sticky bit (S_ISVTX) set.

       If linkpath exists, it will not be overwritten.

   symlinkat()
       The symlinkat() system call operates in exactly the same way as
       symlink(), except for the differences described here.

       If the pathname given in linkpath is relative, then it is interpreted
       relative to the directory referred to by the file descriptor newdirfd
       (rather than relative to the current working directory of the calling
       process, as is done by symlink() for a relative pathname).

       If linkpath is relative and newdirfd is the special value AT_FDCWD,
       then linkpath is interpreted relative to the current working
       directory of the calling process (like symlink()).

       If linkpath is absolute, then newdirfd is ignored.
http://man7.org/linux/man-pages/man2/symlinkat.2.html
11
SYSTEM CALL:
symlink(2) - Linux manual page
FUNCTIONALITY:

       symlink, symlinkat - make a new name for a file
SYNOPSIS:

       #include <unistd.h>

       int symlink(const char *target, const char *linkpath);

       #include <fcntl.h>           /* Definition of AT_* constants */
       #include <unistd.h>

       int symlinkat(const char *target, int newdirfd, const char *linkpath);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       symlink():
           _XOPEN_SOURCE >= 500 || _POSIX_C_SOURCE >= 200112L
               || /* Glibc versions <= 2.19: */ _BSD_SOURCE

       symlinkat():
           Since glibc 2.10:
               _POSIX_C_SOURCE >= 200809L
           Before glibc 2.10:
               _ATFILE_SOURCE
DESCRIPTION

       symlink() creates a symbolic link named linkpath which contains the
       string target.

       Symbolic links are interpreted at run time as if the contents of the
       link had been substituted into the path being followed to find a file
       or directory.

       Symbolic links may contain ..  path components, which (if used at the
       start of the link) refer to the parent directories of that in which
       the link resides.

       A symbolic link (also known as a soft link) may point to an existing
       file or to a nonexistent one; the latter case is known as a dangling
       link.

       The permissions of a symbolic link are irrelevant; the ownership is
       ignored when following the link, but is checked when removal or
       renaming of the link is requested and the link is in a directory with
       the sticky bit (S_ISVTX) set.

       If linkpath exists, it will not be overwritten.

   symlinkat()
       The symlinkat() system call operates in exactly the same way as
       symlink(), except for the differences described here.

       If the pathname given in linkpath is relative, then it is interpreted
       relative to the directory referred to by the file descriptor newdirfd
       (rather than relative to the current working directory of the calling
       process, as is done by symlink() for a relative pathname).

       If linkpath is relative and newdirfd is the special value AT_FDCWD,
       then linkpath is interpreted relative to the current working
       directory of the calling process (like symlink()).

       If linkpath is absolute, then newdirfd is ignored.
http://man7.org/linux/man-pages/man2/unlink.2.html
12
SYSTEM CALL:
unlink(2) - Linux manual page
FUNCTIONALITY:

       unlink, unlinkat - delete a name and possibly the file it refers to
SYNOPSIS:

       #include <unistd.h>

       int unlink(const char *pathname);

       #include <fcntl.h>           /* Definition of AT_* constants */
       #include <unistd.h>

       int unlinkat(int dirfd, const char *pathname, int flags);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       unlinkat():
           Since glibc 2.10:
               _POSIX_C_SOURCE >= 200809L
           Before glibc 2.10:
               _ATFILE_SOURCE
DESCRIPTION

       unlink() deletes a name from the filesystem.  If that name was the
       last link to a file and no processes have the file open, the file is
       deleted and the space it was using is made available for reuse.

       If the name was the last link to a file but any processes still have
       the file open, the file will remain in existence until the last file
       descriptor referring to it is closed.

       If the name referred to a symbolic link, the link is removed.

       If the name referred to a socket, FIFO, or device, the name for it is
       removed but processes which have the object open may continue to use
       it.

   unlinkat()
       The unlinkat() system call operates in exactly the same way as either
       unlink() or rmdir(2) (depending on whether or not flags includes the
       AT_REMOVEDIR flag) except for the differences described here.

       If the pathname given in pathname is relative, then it is interpreted
       relative to the directory referred to by the file descriptor dirfd
       (rather than relative to the current working directory of the calling
       process, as is done by unlink() and rmdir(2) for a relative
       pathname).

       If the pathname given in pathname is relative and dirfd is the
       special value AT_FDCWD, then pathname is interpreted relative to the
       current working directory of the calling process (like unlink() and
       rmdir(2)).

       If the pathname given in pathname is absolute, then dirfd is ignored.

       flags is a bit mask that can either be specified as 0, or by ORing
       together flag values that control the operation of unlinkat().
       Currently, only one such flag is defined:

       AT_REMOVEDIR
              By default, unlinkat() performs the equivalent of unlink() on
              pathname.  If the AT_REMOVEDIR flag is specified, then
              performs the equivalent of rmdir(2) on pathname.

       See openat(2) for an explanation of the need for unlinkat().
http://man7.org/linux/man-pages/man2/unlinkat.2.html
12
SYSTEM CALL:
unlink(2) - Linux manual page
FUNCTIONALITY:

       unlink, unlinkat - delete a name and possibly the file it refers to
SYNOPSIS:

       #include <unistd.h>

       int unlink(const char *pathname);

       #include <fcntl.h>           /* Definition of AT_* constants */
       #include <unistd.h>

       int unlinkat(int dirfd, const char *pathname, int flags);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       unlinkat():
           Since glibc 2.10:
               _POSIX_C_SOURCE >= 200809L
           Before glibc 2.10:
               _ATFILE_SOURCE
DESCRIPTION

       unlink() deletes a name from the filesystem.  If that name was the
       last link to a file and no processes have the file open, the file is
       deleted and the space it was using is made available for reuse.

       If the name was the last link to a file but any processes still have
       the file open, the file will remain in existence until the last file
       descriptor referring to it is closed.

       If the name referred to a symbolic link, the link is removed.

       If the name referred to a socket, FIFO, or device, the name for it is
       removed but processes which have the object open may continue to use
       it.

   unlinkat()
       The unlinkat() system call operates in exactly the same way as either
       unlink() or rmdir(2) (depending on whether or not flags includes the
       AT_REMOVEDIR flag) except for the differences described here.

       If the pathname given in pathname is relative, then it is interpreted
       relative to the directory referred to by the file descriptor dirfd
       (rather than relative to the current working directory of the calling
       process, as is done by unlink() and rmdir(2) for a relative
       pathname).

       If the pathname given in pathname is relative and dirfd is the
       special value AT_FDCWD, then pathname is interpreted relative to the
       current working directory of the calling process (like unlink() and
       rmdir(2)).

       If the pathname given in pathname is absolute, then dirfd is ignored.

       flags is a bit mask that can either be specified as 0, or by ORing
       together flag values that control the operation of unlinkat().
       Currently, only one such flag is defined:

       AT_REMOVEDIR
              By default, unlinkat() performs the equivalent of unlink() on
              pathname.  If the AT_REMOVEDIR flag is specified, then
              performs the equivalent of rmdir(2) on pathname.

       See openat(2) for an explanation of the need for unlinkat().
http://man7.org/linux/man-pages/man2/readlink.2.html
12
SYSTEM CALL:
readlink(2) - Linux manual page
FUNCTIONALITY:

       readlink, readlinkat - read value of a symbolic link
SYNOPSIS:

       #include <unistd.h>

       ssize_t readlink(const char *pathname, char *buf, size_t bufsiz);

       #include <fcntl.h>           /* Definition of AT_* constants */
       #include <unistd.h>

       ssize_t readlinkat(int dirfd, const char *pathname,
                          char *buf, size_t bufsiz);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       readlink():
           _XOPEN_SOURCE >= 500 || _POSIX_C_SOURCE >= 200112L
               || /* Glibc versions <= 2.19: */ _BSD_SOURCE

       readlinkat():
           Since glibc 2.10:
               _POSIX_C_SOURCE >= 200809L
           Before glibc 2.10:
               _ATFILE_SOURCE
DESCRIPTION

       readlink() places the contents of the symbolic link pathname in the
       buffer buf, which has size bufsiz.  readlink() does not append a null
       byte to buf.  It will truncate the contents (to a length of bufsiz
       characters), in case the buffer is too small to hold all of the
       contents.

   readlinkat()
       The readlinkat() system call operates in exactly the same way as
       readlink(), except for the differences described here.

       If the pathname given in pathname is relative, then it is interpreted
       relative to the directory referred to by the file descriptor dirfd
       (rather than relative to the current working directory of the calling
       process, as is done by readlink() for a relative pathname).

       If pathname is relative and dirfd is the special value AT_FDCWD, then
       pathname is interpreted relative to the current working directory of
       the calling process (like readlink()).

       If pathname is absolute, then dirfd is ignored.

       Since Linux 2.6.39, pathname can be an empty string, in which case
       the call operates on the symbolic link referred to by dirfd (which
       should have been obtained using open(2) with the O_PATH and
       O_NOFOLLOW flags).

       See openat(2) for an explanation of the need for readlinkat().
http://man7.org/linux/man-pages/man2/readlinkat.2.html
12
SYSTEM CALL:
readlink(2) - Linux manual page
FUNCTIONALITY:

       readlink, readlinkat - read value of a symbolic link
SYNOPSIS:

       #include <unistd.h>

       ssize_t readlink(const char *pathname, char *buf, size_t bufsiz);

       #include <fcntl.h>           /* Definition of AT_* constants */
       #include <unistd.h>

       ssize_t readlinkat(int dirfd, const char *pathname,
                          char *buf, size_t bufsiz);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       readlink():
           _XOPEN_SOURCE >= 500 || _POSIX_C_SOURCE >= 200112L
               || /* Glibc versions <= 2.19: */ _BSD_SOURCE

       readlinkat():
           Since glibc 2.10:
               _POSIX_C_SOURCE >= 200809L
           Before glibc 2.10:
               _ATFILE_SOURCE
DESCRIPTION

       readlink() places the contents of the symbolic link pathname in the
       buffer buf, which has size bufsiz.  readlink() does not append a null
       byte to buf.  It will truncate the contents (to a length of bufsiz
       characters), in case the buffer is too small to hold all of the
       contents.

   readlinkat()
       The readlinkat() system call operates in exactly the same way as
       readlink(), except for the differences described here.

       If the pathname given in pathname is relative, then it is interpreted
       relative to the directory referred to by the file descriptor dirfd
       (rather than relative to the current working directory of the calling
       process, as is done by readlink() for a relative pathname).

       If pathname is relative and dirfd is the special value AT_FDCWD, then
       pathname is interpreted relative to the current working directory of
       the calling process (like readlink()).

       If pathname is absolute, then dirfd is ignored.

       Since Linux 2.6.39, pathname can be an empty string, in which case
       the call operates on the symbolic link referred to by dirfd (which
       should have been obtained using open(2) with the O_PATH and
       O_NOFOLLOW flags).

       See openat(2) for an explanation of the need for readlinkat().
http://man7.org/linux/man-pages/man2/umask.2.html
9
SYSTEM CALL:
umask(2) - Linux manual page
FUNCTIONALITY:

       umask - set file mode creation mask
SYNOPSIS:

       #include <sys/types.h>
       #include <sys/stat.h>

       mode_t umask(mode_t mask);
DESCRIPTION

       umask() sets the calling process's file mode creation mask (umask) to
       mask & 0777 (i.e., only the file permission bits of mask are used),
       and returns the previous value of the mask.

       The umask is used by open(2), mkdir(2), and other system calls that
       create files to modify the permissions placed on newly created files
       or directories.  Specifically, permissions in the umask are turned
       off from the mode argument to open(2) and mkdir(2).

       Alternatively, if the parent directory has a default ACL (see
       acl(5)), the umask is ignored, the default ACL is inherited, the
       permission bits are set based on the inherited ACL, and permission
       bits absent in the mode argument are turned off.  For example, the
       following default ACL is equivalent to a umask of 022:

           u::rwx,g::r-x,o::r-x

       Combining the effect of this default ACL with a mode argument of 0666
       (rw-rw-rw-), the resulting file permissions would be 0644 (rw-
       r--r--).

       The constants that should be used to specify mask are described under
       stat(2).

       The typical default value for the process umask is S_IWGRP | S_IWOTH
       (octal 022).  In the usual case where the mode argument to open(2) is
       specified as:

           S_IRUSR | S_IWUSR | S_IRGRP | S_IWGRP | S_IROTH | S_IWOTH

       (octal 0666) when creating a new file, the permissions on the
       resulting file will be:

           S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH

       (because 0666 & ~022 = 0644; i.e., rw-r--r--).
http://man7.org/linux/man-pages/man2/stat.2.html
12
SYSTEM CALL:
stat(2) - Linux manual page
FUNCTIONALITY:

       stat, fstat, lstat, fstatat - get file status
SYNOPSIS:

       #include <sys/types.h>
       #include <sys/stat.h>
       #include <unistd.h>

       int stat(const char *pathname, struct stat *buf);
       int fstat(int fd, struct stat *buf);
       int lstat(const char *pathname, struct stat *buf);

       #include <fcntl.h>           /* Definition of AT_* constants */
       #include <sys/stat.h>

       int fstatat(int dirfd, const char *pathname, struct stat *buf,
                   int flags);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       lstat():
           /* glibc 2.19 and earlier */ _BSD_SOURCE
               || /* Since glibc 2.20 */ _DEFAULT_SOURCE
               || _XOPEN_SOURCE >= 500
               || /* Since glibc 2.10: */ _POSIX_C_SOURCE >= 200112L

       fstatat():
           Since glibc 2.10:
               _POSIX_C_SOURCE >= 200809L
           Before glibc 2.10:
               _ATFILE_SOURCE
DESCRIPTION

       These functions return information about a file, in the buffer
       pointed to by buf.  No permissions are required on the file itself,
       but—in the case of stat(), fstatat(), and lstat()—execute (search)
       permission is required on all of the directories in pathname that
       lead to the file.

       stat() and fstatat() retrieve information about the file pointed to
       by pathname; the differences for fstatat() are described below.

       lstat() is identical to stat(), except that if pathname is a symbolic
       link, then it returns information about the link itself, not the file
       that it refers to.

       fstat() is identical to stat(), except that the file about which
       information is to be retrieved is specified by the file descriptor
       fd.

       All of these system calls return a stat structure, which contains the
       following fields:

           struct stat {
               dev_t     st_dev;         /* ID of device containing file */
               ino_t     st_ino;         /* inode number */
               mode_t    st_mode;        /* file type and mode */
               nlink_t   st_nlink;       /* number of hard links */
               uid_t     st_uid;         /* user ID of owner */
               gid_t     st_gid;         /* group ID of owner */
               dev_t     st_rdev;        /* device ID (if special file) */
               off_t     st_size;        /* total size, in bytes */
               blksize_t st_blksize;     /* blocksize for filesystem I/O */
               blkcnt_t  st_blocks;      /* number of 512B blocks allocated */

               /* Since Linux 2.6, the kernel supports nanosecond
                  precision for the following timestamp fields.
                  For the details before Linux 2.6, see NOTES. */

               struct timespec st_atim;  /* time of last access */
               struct timespec st_mtim;  /* time of last modification */
               struct timespec st_ctim;  /* time of last status change */

           #define st_atime st_atim.tv_sec      /* Backward compatibility */
           #define st_mtime st_mtim.tv_sec
           #define st_ctime st_ctim.tv_sec
           };

       Note: the order of fields in the stat structure varies somewhat
       across architectures.  In addition, the definition above does not
       show the padding bytes that may be present between some fields on
       various architectures.  Consult the glibc and kernel source code if
       you need to know the details.

       Note: For performance and simplicity reasons, different fields in the
       stat structure may contain state information from different moments
       during the execution of the system call.  For example, if st_mode or
       st_uid is changed by another process by calling chmod(2) or chown(2),
       stat() might return the old st_mode together with the new st_uid, or
       the old st_uid together with the new st_mode.

       The st_dev field describes the device on which this file resides.
       (The major(3) and minor(3) macros may be useful to decompose the
       device ID in this field.)

       The st_rdev field describes the device that this file (inode)
       represents.

       The st_size field gives the size of the file (if it is a regular file
       or a symbolic link) in bytes.  The size of a symbolic link is the
       length of the pathname it contains, without a terminating null byte.

       The st_blocks field indicates the number of blocks allocated to the
       file, 512-byte units.  (This may be smaller than st_size/512 when the
       file has holes.)

       The st_blksize field gives the "preferred" blocksize for efficient
       filesystem I/O.  (Writing to a file in smaller chunks may cause an
       inefficient read-modify-rewrite.)

       Not all of the Linux filesystems implement all of the time fields.
       Some filesystem types allow mounting in such a way that file and/or
       directory accesses do not cause an update of the st_atime field.
       (See noatime, nodiratime, and relatime in mount(8), and related
       information in mount(2).)  In addition, st_atime is not updated if a
       file is opened with the O_NOATIME; see open(2).

       The field st_atime is changed by file accesses, for example, by
       execve(2), mknod(2), pipe(2), utime(2), and read(2) (of more than
       zero bytes).  Other routines, like mmap(2), may or may not update
       st_atime.

       The field st_mtime is changed by file modifications, for example, by
       mknod(2), truncate(2), utime(2), and write(2) (of more than zero
       bytes).  Moreover, st_mtime of a directory is changed by the creation
       or deletion of files in that directory.  The st_mtime field is not
       changed for changes in owner, group, hard link count, or mode.

       The field st_ctime is changed by writing or by setting inode
       information (i.e., owner, group, link count, mode, etc.).

       POSIX refers to the st_mode bits corresponding to the mask S_IFMT
       (see below) as the file type, the 12 bits corresponding to the mask
       07777 as the file mode bits and the least significant 9 bits (0777)
       as the file permission bits.

       The following mask values are defined for the file type of the
       st_mode field:

           S_IFMT     0170000   bit mask for the file type bit field

           S_IFSOCK   0140000   socket
           S_IFLNK    0120000   symbolic link
           S_IFREG    0100000   regular file
           S_IFBLK    0060000   block device
           S_IFDIR    0040000   directory
           S_IFCHR    0020000   character device
           S_IFIFO    0010000   FIFO

       Thus, to test for a regular file (for example), one could write:

           stat(pathname, &sb);
           if ((sb.st_mode & S_IFMT) == S_IFREG) {
               /* Handle regular file */
           }

       Because tests of the above form are common, additional macros are
       defined by POSIX to allow the test of the file type in st_mode to be
       written more concisely:

           S_ISREG(m)  is it a regular file?

           S_ISDIR(m)  directory?

           S_ISCHR(m)  character device?

           S_ISBLK(m)  block device?

           S_ISFIFO(m) FIFO (named pipe)?

           S_ISLNK(m)  symbolic link?  (Not in POSIX.1-1996.)

           S_ISSOCK(m) socket?  (Not in POSIX.1-1996.)

       The preceding code snippet could thus be rewritten as:

           stat(pathname, &sb);
           if (S_ISREG(sb.st_mode)) {
               /* Handle regular file */
           }

       The definitions of most of the above file type test macros are
       provided if any of the following feature test macros is defined:
       _BSD_SOURCE (in glibc 2.19 and earlier), _SVID_SOURCE (in glibc 2.19
       and earlier), or _DEFAULT_SOURCE (in glibc 2.20 and later).  In
       addition, definitions of all of the above macros except S_IFSOCK and
       S_ISSOCK() are provided if _XOPEN_SOURCE is defined.  The definition
       of S_IFSOCK can also be exposed by defining _XOPEN_SOURCE with a
       value of 500 or greater.

       The definition of S_ISSOCK() is exposed if any of the following
       feature test macros is defined: _BSD_SOURCE (in glibc 2.19 and
       earlier), _DEFAULT_SOURCE (in glibc 2.20 and later), _XOPEN_SOURCE
       with a value of 500 or greater, or _POSIX_C_SOURCE with a value of
       200112L or greater.

       The following mask values are defined for the file mode component of
       the st_mode field:

           S_ISUID     04000   set-user-ID bit
           S_ISGID     02000   set-group-ID bit (see below)
           S_ISVTX     01000   sticky bit (see below)

           S_IRWXU     00700   owner has read, write, and execute permission
           S_IRUSR     00400   owner has read permission
           S_IWUSR     00200   owner has write permission
           S_IXUSR     00100   owner has execute permission

           S_IRWXG     00070   group has read, write, and execute permission
           S_IRGRP     00040   group has read permission
           S_IWGRP     00020   group has write permission
           S_IXGRP     00010   group has execute permission

           S_IRWXO     00007   others (not in group) have read, write, and
                               execute permission
           S_IROTH     00004   others have read permission
           S_IWOTH     00002   others have write permission
           S_IXOTH     00001   others have execute permission

       The set-group-ID bit (S_ISGID) has several special uses.  For a
       directory, it indicates that BSD semantics is to be used for that
       directory: files created there inherit their group ID from the
       directory, not from the effective group ID of the creating process,
       and directories created there will also get the S_ISGID bit set.  For
       a file that does not have the group execution bit (S_IXGRP) set, the
       set-group-ID bit indicates mandatory file/record locking.

       The sticky bit (S_ISVTX) on a directory means that a file in that
       directory can be renamed or deleted only by the owner of the file, by
       the owner of the directory, and by a privileged process.

   fstatat()
       The fstatat() system call operates in exactly the same way as stat(),
       except for the differences described here.

       If the pathname given in pathname is relative, then it is interpreted
       relative to the directory referred to by the file descriptor dirfd
       (rather than relative to the current working directory of the calling
       process, as is done by stat() for a relative pathname).

       If pathname is relative and dirfd is the special value AT_FDCWD, then
       pathname is interpreted relative to the current working directory of
       the calling process (like stat()).

       If pathname is absolute, then dirfd is ignored.

       flags can either be 0, or include one or more of the following flags
       ORed:

       AT_EMPTY_PATH (since Linux 2.6.39)
              If pathname is an empty string, operate on the file referred
              to by dirfd (which may have been obtained using the open(2)
              O_PATH flag).  If dirfd is AT_FDCWD, the call operates on the
              current working directory.  In this case, dirfd can refer to
              any type of file, not just a directory.  This flag is Linux-
              specific; define _GNU_SOURCE to obtain its definition.

       AT_NO_AUTOMOUNT (since Linux 2.6.38)
              Don't automount the terminal ("basename") component of
              pathname if it is a directory that is an automount point.
              This allows the caller to gather attributes of an automount
              point (rather than the location it would mount).  This flag
              can be used in tools that scan directories to prevent mass-
              automounting of a directory of automount points.  The
              AT_NO_AUTOMOUNT flag has no effect if the mount point has
              already been mounted over.  This flag is Linux-specific;
              define _GNU_SOURCE to obtain its definition.

       AT_SYMLINK_NOFOLLOW
              If pathname is a symbolic link, do not dereference it: instead
              return information about the link itself, like lstat().  (By
              default, fstatat() dereferences symbolic links, like stat().)

       See openat(2) for an explanation of the need for fstatat().
http://man7.org/linux/man-pages/man2/lstat.2.html
12
SYSTEM CALL:
stat(2) - Linux manual page
FUNCTIONALITY:

       stat, fstat, lstat, fstatat - get file status
SYNOPSIS:

       #include <sys/types.h>
       #include <sys/stat.h>
       #include <unistd.h>

       int stat(const char *pathname, struct stat *buf);
       int fstat(int fd, struct stat *buf);
       int lstat(const char *pathname, struct stat *buf);

       #include <fcntl.h>           /* Definition of AT_* constants */
       #include <sys/stat.h>

       int fstatat(int dirfd, const char *pathname, struct stat *buf,
                   int flags);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       lstat():
           /* glibc 2.19 and earlier */ _BSD_SOURCE
               || /* Since glibc 2.20 */ _DEFAULT_SOURCE
               || _XOPEN_SOURCE >= 500
               || /* Since glibc 2.10: */ _POSIX_C_SOURCE >= 200112L

       fstatat():
           Since glibc 2.10:
               _POSIX_C_SOURCE >= 200809L
           Before glibc 2.10:
               _ATFILE_SOURCE
DESCRIPTION

       These functions return information about a file, in the buffer
       pointed to by buf.  No permissions are required on the file itself,
       but—in the case of stat(), fstatat(), and lstat()—execute (search)
       permission is required on all of the directories in pathname that
       lead to the file.

       stat() and fstatat() retrieve information about the file pointed to
       by pathname; the differences for fstatat() are described below.

       lstat() is identical to stat(), except that if pathname is a symbolic
       link, then it returns information about the link itself, not the file
       that it refers to.

       fstat() is identical to stat(), except that the file about which
       information is to be retrieved is specified by the file descriptor
       fd.

       All of these system calls return a stat structure, which contains the
       following fields:

           struct stat {
               dev_t     st_dev;         /* ID of device containing file */
               ino_t     st_ino;         /* inode number */
               mode_t    st_mode;        /* file type and mode */
               nlink_t   st_nlink;       /* number of hard links */
               uid_t     st_uid;         /* user ID of owner */
               gid_t     st_gid;         /* group ID of owner */
               dev_t     st_rdev;        /* device ID (if special file) */
               off_t     st_size;        /* total size, in bytes */
               blksize_t st_blksize;     /* blocksize for filesystem I/O */
               blkcnt_t  st_blocks;      /* number of 512B blocks allocated */

               /* Since Linux 2.6, the kernel supports nanosecond
                  precision for the following timestamp fields.
                  For the details before Linux 2.6, see NOTES. */

               struct timespec st_atim;  /* time of last access */
               struct timespec st_mtim;  /* time of last modification */
               struct timespec st_ctim;  /* time of last status change */

           #define st_atime st_atim.tv_sec      /* Backward compatibility */
           #define st_mtime st_mtim.tv_sec
           #define st_ctime st_ctim.tv_sec
           };

       Note: the order of fields in the stat structure varies somewhat
       across architectures.  In addition, the definition above does not
       show the padding bytes that may be present between some fields on
       various architectures.  Consult the glibc and kernel source code if
       you need to know the details.

       Note: For performance and simplicity reasons, different fields in the
       stat structure may contain state information from different moments
       during the execution of the system call.  For example, if st_mode or
       st_uid is changed by another process by calling chmod(2) or chown(2),
       stat() might return the old st_mode together with the new st_uid, or
       the old st_uid together with the new st_mode.

       The st_dev field describes the device on which this file resides.
       (The major(3) and minor(3) macros may be useful to decompose the
       device ID in this field.)

       The st_rdev field describes the device that this file (inode)
       represents.

       The st_size field gives the size of the file (if it is a regular file
       or a symbolic link) in bytes.  The size of a symbolic link is the
       length of the pathname it contains, without a terminating null byte.

       The st_blocks field indicates the number of blocks allocated to the
       file, 512-byte units.  (This may be smaller than st_size/512 when the
       file has holes.)

       The st_blksize field gives the "preferred" blocksize for efficient
       filesystem I/O.  (Writing to a file in smaller chunks may cause an
       inefficient read-modify-rewrite.)

       Not all of the Linux filesystems implement all of the time fields.
       Some filesystem types allow mounting in such a way that file and/or
       directory accesses do not cause an update of the st_atime field.
       (See noatime, nodiratime, and relatime in mount(8), and related
       information in mount(2).)  In addition, st_atime is not updated if a
       file is opened with the O_NOATIME; see open(2).

       The field st_atime is changed by file accesses, for example, by
       execve(2), mknod(2), pipe(2), utime(2), and read(2) (of more than
       zero bytes).  Other routines, like mmap(2), may or may not update
       st_atime.

       The field st_mtime is changed by file modifications, for example, by
       mknod(2), truncate(2), utime(2), and write(2) (of more than zero
       bytes).  Moreover, st_mtime of a directory is changed by the creation
       or deletion of files in that directory.  The st_mtime field is not
       changed for changes in owner, group, hard link count, or mode.

       The field st_ctime is changed by writing or by setting inode
       information (i.e., owner, group, link count, mode, etc.).

       POSIX refers to the st_mode bits corresponding to the mask S_IFMT
       (see below) as the file type, the 12 bits corresponding to the mask
       07777 as the file mode bits and the least significant 9 bits (0777)
       as the file permission bits.

       The following mask values are defined for the file type of the
       st_mode field:

           S_IFMT     0170000   bit mask for the file type bit field

           S_IFSOCK   0140000   socket
           S_IFLNK    0120000   symbolic link
           S_IFREG    0100000   regular file
           S_IFBLK    0060000   block device
           S_IFDIR    0040000   directory
           S_IFCHR    0020000   character device
           S_IFIFO    0010000   FIFO

       Thus, to test for a regular file (for example), one could write:

           stat(pathname, &sb);
           if ((sb.st_mode & S_IFMT) == S_IFREG) {
               /* Handle regular file */
           }

       Because tests of the above form are common, additional macros are
       defined by POSIX to allow the test of the file type in st_mode to be
       written more concisely:

           S_ISREG(m)  is it a regular file?

           S_ISDIR(m)  directory?

           S_ISCHR(m)  character device?

           S_ISBLK(m)  block device?

           S_ISFIFO(m) FIFO (named pipe)?

           S_ISLNK(m)  symbolic link?  (Not in POSIX.1-1996.)

           S_ISSOCK(m) socket?  (Not in POSIX.1-1996.)

       The preceding code snippet could thus be rewritten as:

           stat(pathname, &sb);
           if (S_ISREG(sb.st_mode)) {
               /* Handle regular file */
           }

       The definitions of most of the above file type test macros are
       provided if any of the following feature test macros is defined:
       _BSD_SOURCE (in glibc 2.19 and earlier), _SVID_SOURCE (in glibc 2.19
       and earlier), or _DEFAULT_SOURCE (in glibc 2.20 and later).  In
       addition, definitions of all of the above macros except S_IFSOCK and
       S_ISSOCK() are provided if _XOPEN_SOURCE is defined.  The definition
       of S_IFSOCK can also be exposed by defining _XOPEN_SOURCE with a
       value of 500 or greater.

       The definition of S_ISSOCK() is exposed if any of the following
       feature test macros is defined: _BSD_SOURCE (in glibc 2.19 and
       earlier), _DEFAULT_SOURCE (in glibc 2.20 and later), _XOPEN_SOURCE
       with a value of 500 or greater, or _POSIX_C_SOURCE with a value of
       200112L or greater.

       The following mask values are defined for the file mode component of
       the st_mode field:

           S_ISUID     04000   set-user-ID bit
           S_ISGID     02000   set-group-ID bit (see below)
           S_ISVTX     01000   sticky bit (see below)

           S_IRWXU     00700   owner has read, write, and execute permission
           S_IRUSR     00400   owner has read permission
           S_IWUSR     00200   owner has write permission
           S_IXUSR     00100   owner has execute permission

           S_IRWXG     00070   group has read, write, and execute permission
           S_IRGRP     00040   group has read permission
           S_IWGRP     00020   group has write permission
           S_IXGRP     00010   group has execute permission

           S_IRWXO     00007   others (not in group) have read, write, and
                               execute permission
           S_IROTH     00004   others have read permission
           S_IWOTH     00002   others have write permission
           S_IXOTH     00001   others have execute permission

       The set-group-ID bit (S_ISGID) has several special uses.  For a
       directory, it indicates that BSD semantics is to be used for that
       directory: files created there inherit their group ID from the
       directory, not from the effective group ID of the creating process,
       and directories created there will also get the S_ISGID bit set.  For
       a file that does not have the group execution bit (S_IXGRP) set, the
       set-group-ID bit indicates mandatory file/record locking.

       The sticky bit (S_ISVTX) on a directory means that a file in that
       directory can be renamed or deleted only by the owner of the file, by
       the owner of the directory, and by a privileged process.

   fstatat()
       The fstatat() system call operates in exactly the same way as stat(),
       except for the differences described here.

       If the pathname given in pathname is relative, then it is interpreted
       relative to the directory referred to by the file descriptor dirfd
       (rather than relative to the current working directory of the calling
       process, as is done by stat() for a relative pathname).

       If pathname is relative and dirfd is the special value AT_FDCWD, then
       pathname is interpreted relative to the current working directory of
       the calling process (like stat()).

       If pathname is absolute, then dirfd is ignored.

       flags can either be 0, or include one or more of the following flags
       ORed:

       AT_EMPTY_PATH (since Linux 2.6.39)
              If pathname is an empty string, operate on the file referred
              to by dirfd (which may have been obtained using the open(2)
              O_PATH flag).  If dirfd is AT_FDCWD, the call operates on the
              current working directory.  In this case, dirfd can refer to
              any type of file, not just a directory.  This flag is Linux-
              specific; define _GNU_SOURCE to obtain its definition.

       AT_NO_AUTOMOUNT (since Linux 2.6.38)
              Don't automount the terminal ("basename") component of
              pathname if it is a directory that is an automount point.
              This allows the caller to gather attributes of an automount
              point (rather than the location it would mount).  This flag
              can be used in tools that scan directories to prevent mass-
              automounting of a directory of automount points.  The
              AT_NO_AUTOMOUNT flag has no effect if the mount point has
              already been mounted over.  This flag is Linux-specific;
              define _GNU_SOURCE to obtain its definition.

       AT_SYMLINK_NOFOLLOW
              If pathname is a symbolic link, do not dereference it: instead
              return information about the link itself, like lstat().  (By
              default, fstatat() dereferences symbolic links, like stat().)

       See openat(2) for an explanation of the need for fstatat().
http://man7.org/linux/man-pages/man2/fstat.2.html
12
SYSTEM CALL:
stat(2) - Linux manual page
FUNCTIONALITY:

       stat, fstat, lstat, fstatat - get file status
SYNOPSIS:

       #include <sys/types.h>
       #include <sys/stat.h>
       #include <unistd.h>

       int stat(const char *pathname, struct stat *buf);
       int fstat(int fd, struct stat *buf);
       int lstat(const char *pathname, struct stat *buf);

       #include <fcntl.h>           /* Definition of AT_* constants */
       #include <sys/stat.h>

       int fstatat(int dirfd, const char *pathname, struct stat *buf,
                   int flags);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       lstat():
           /* glibc 2.19 and earlier */ _BSD_SOURCE
               || /* Since glibc 2.20 */ _DEFAULT_SOURCE
               || _XOPEN_SOURCE >= 500
               || /* Since glibc 2.10: */ _POSIX_C_SOURCE >= 200112L

       fstatat():
           Since glibc 2.10:
               _POSIX_C_SOURCE >= 200809L
           Before glibc 2.10:
               _ATFILE_SOURCE
DESCRIPTION

       These functions return information about a file, in the buffer
       pointed to by buf.  No permissions are required on the file itself,
       but—in the case of stat(), fstatat(), and lstat()—execute (search)
       permission is required on all of the directories in pathname that
       lead to the file.

       stat() and fstatat() retrieve information about the file pointed to
       by pathname; the differences for fstatat() are described below.

       lstat() is identical to stat(), except that if pathname is a symbolic
       link, then it returns information about the link itself, not the file
       that it refers to.

       fstat() is identical to stat(), except that the file about which
       information is to be retrieved is specified by the file descriptor
       fd.

       All of these system calls return a stat structure, which contains the
       following fields:

           struct stat {
               dev_t     st_dev;         /* ID of device containing file */
               ino_t     st_ino;         /* inode number */
               mode_t    st_mode;        /* file type and mode */
               nlink_t   st_nlink;       /* number of hard links */
               uid_t     st_uid;         /* user ID of owner */
               gid_t     st_gid;         /* group ID of owner */
               dev_t     st_rdev;        /* device ID (if special file) */
               off_t     st_size;        /* total size, in bytes */
               blksize_t st_blksize;     /* blocksize for filesystem I/O */
               blkcnt_t  st_blocks;      /* number of 512B blocks allocated */

               /* Since Linux 2.6, the kernel supports nanosecond
                  precision for the following timestamp fields.
                  For the details before Linux 2.6, see NOTES. */

               struct timespec st_atim;  /* time of last access */
               struct timespec st_mtim;  /* time of last modification */
               struct timespec st_ctim;  /* time of last status change */

           #define st_atime st_atim.tv_sec      /* Backward compatibility */
           #define st_mtime st_mtim.tv_sec
           #define st_ctime st_ctim.tv_sec
           };

       Note: the order of fields in the stat structure varies somewhat
       across architectures.  In addition, the definition above does not
       show the padding bytes that may be present between some fields on
       various architectures.  Consult the glibc and kernel source code if
       you need to know the details.

       Note: For performance and simplicity reasons, different fields in the
       stat structure may contain state information from different moments
       during the execution of the system call.  For example, if st_mode or
       st_uid is changed by another process by calling chmod(2) or chown(2),
       stat() might return the old st_mode together with the new st_uid, or
       the old st_uid together with the new st_mode.

       The st_dev field describes the device on which this file resides.
       (The major(3) and minor(3) macros may be useful to decompose the
       device ID in this field.)

       The st_rdev field describes the device that this file (inode)
       represents.

       The st_size field gives the size of the file (if it is a regular file
       or a symbolic link) in bytes.  The size of a symbolic link is the
       length of the pathname it contains, without a terminating null byte.

       The st_blocks field indicates the number of blocks allocated to the
       file, 512-byte units.  (This may be smaller than st_size/512 when the
       file has holes.)

       The st_blksize field gives the "preferred" blocksize for efficient
       filesystem I/O.  (Writing to a file in smaller chunks may cause an
       inefficient read-modify-rewrite.)

       Not all of the Linux filesystems implement all of the time fields.
       Some filesystem types allow mounting in such a way that file and/or
       directory accesses do not cause an update of the st_atime field.
       (See noatime, nodiratime, and relatime in mount(8), and related
       information in mount(2).)  In addition, st_atime is not updated if a
       file is opened with the O_NOATIME; see open(2).

       The field st_atime is changed by file accesses, for example, by
       execve(2), mknod(2), pipe(2), utime(2), and read(2) (of more than
       zero bytes).  Other routines, like mmap(2), may or may not update
       st_atime.

       The field st_mtime is changed by file modifications, for example, by
       mknod(2), truncate(2), utime(2), and write(2) (of more than zero
       bytes).  Moreover, st_mtime of a directory is changed by the creation
       or deletion of files in that directory.  The st_mtime field is not
       changed for changes in owner, group, hard link count, or mode.

       The field st_ctime is changed by writing or by setting inode
       information (i.e., owner, group, link count, mode, etc.).

       POSIX refers to the st_mode bits corresponding to the mask S_IFMT
       (see below) as the file type, the 12 bits corresponding to the mask
       07777 as the file mode bits and the least significant 9 bits (0777)
       as the file permission bits.

       The following mask values are defined for the file type of the
       st_mode field:

           S_IFMT     0170000   bit mask for the file type bit field

           S_IFSOCK   0140000   socket
           S_IFLNK    0120000   symbolic link
           S_IFREG    0100000   regular file
           S_IFBLK    0060000   block device
           S_IFDIR    0040000   directory
           S_IFCHR    0020000   character device
           S_IFIFO    0010000   FIFO

       Thus, to test for a regular file (for example), one could write:

           stat(pathname, &sb);
           if ((sb.st_mode & S_IFMT) == S_IFREG) {
               /* Handle regular file */
           }

       Because tests of the above form are common, additional macros are
       defined by POSIX to allow the test of the file type in st_mode to be
       written more concisely:

           S_ISREG(m)  is it a regular file?

           S_ISDIR(m)  directory?

           S_ISCHR(m)  character device?

           S_ISBLK(m)  block device?

           S_ISFIFO(m) FIFO (named pipe)?

           S_ISLNK(m)  symbolic link?  (Not in POSIX.1-1996.)

           S_ISSOCK(m) socket?  (Not in POSIX.1-1996.)

       The preceding code snippet could thus be rewritten as:

           stat(pathname, &sb);
           if (S_ISREG(sb.st_mode)) {
               /* Handle regular file */
           }

       The definitions of most of the above file type test macros are
       provided if any of the following feature test macros is defined:
       _BSD_SOURCE (in glibc 2.19 and earlier), _SVID_SOURCE (in glibc 2.19
       and earlier), or _DEFAULT_SOURCE (in glibc 2.20 and later).  In
       addition, definitions of all of the above macros except S_IFSOCK and
       S_ISSOCK() are provided if _XOPEN_SOURCE is defined.  The definition
       of S_IFSOCK can also be exposed by defining _XOPEN_SOURCE with a
       value of 500 or greater.

       The definition of S_ISSOCK() is exposed if any of the following
       feature test macros is defined: _BSD_SOURCE (in glibc 2.19 and
       earlier), _DEFAULT_SOURCE (in glibc 2.20 and later), _XOPEN_SOURCE
       with a value of 500 or greater, or _POSIX_C_SOURCE with a value of
       200112L or greater.

       The following mask values are defined for the file mode component of
       the st_mode field:

           S_ISUID     04000   set-user-ID bit
           S_ISGID     02000   set-group-ID bit (see below)
           S_ISVTX     01000   sticky bit (see below)

           S_IRWXU     00700   owner has read, write, and execute permission
           S_IRUSR     00400   owner has read permission
           S_IWUSR     00200   owner has write permission
           S_IXUSR     00100   owner has execute permission

           S_IRWXG     00070   group has read, write, and execute permission
           S_IRGRP     00040   group has read permission
           S_IWGRP     00020   group has write permission
           S_IXGRP     00010   group has execute permission

           S_IRWXO     00007   others (not in group) have read, write, and
                               execute permission
           S_IROTH     00004   others have read permission
           S_IWOTH     00002   others have write permission
           S_IXOTH     00001   others have execute permission

       The set-group-ID bit (S_ISGID) has several special uses.  For a
       directory, it indicates that BSD semantics is to be used for that
       directory: files created there inherit their group ID from the
       directory, not from the effective group ID of the creating process,
       and directories created there will also get the S_ISGID bit set.  For
       a file that does not have the group execution bit (S_IXGRP) set, the
       set-group-ID bit indicates mandatory file/record locking.

       The sticky bit (S_ISVTX) on a directory means that a file in that
       directory can be renamed or deleted only by the owner of the file, by
       the owner of the directory, and by a privileged process.

   fstatat()
       The fstatat() system call operates in exactly the same way as stat(),
       except for the differences described here.

       If the pathname given in pathname is relative, then it is interpreted
       relative to the directory referred to by the file descriptor dirfd
       (rather than relative to the current working directory of the calling
       process, as is done by stat() for a relative pathname).

       If pathname is relative and dirfd is the special value AT_FDCWD, then
       pathname is interpreted relative to the current working directory of
       the calling process (like stat()).

       If pathname is absolute, then dirfd is ignored.

       flags can either be 0, or include one or more of the following flags
       ORed:

       AT_EMPTY_PATH (since Linux 2.6.39)
              If pathname is an empty string, operate on the file referred
              to by dirfd (which may have been obtained using the open(2)
              O_PATH flag).  If dirfd is AT_FDCWD, the call operates on the
              current working directory.  In this case, dirfd can refer to
              any type of file, not just a directory.  This flag is Linux-
              specific; define _GNU_SOURCE to obtain its definition.

       AT_NO_AUTOMOUNT (since Linux 2.6.38)
              Don't automount the terminal ("basename") component of
              pathname if it is a directory that is an automount point.
              This allows the caller to gather attributes of an automount
              point (rather than the location it would mount).  This flag
              can be used in tools that scan directories to prevent mass-
              automounting of a directory of automount points.  The
              AT_NO_AUTOMOUNT flag has no effect if the mount point has
              already been mounted over.  This flag is Linux-specific;
              define _GNU_SOURCE to obtain its definition.

       AT_SYMLINK_NOFOLLOW
              If pathname is a symbolic link, do not dereference it: instead
              return information about the link itself, like lstat().  (By
              default, fstatat() dereferences symbolic links, like stat().)

       See openat(2) for an explanation of the need for fstatat().
http://man7.org/linux/man-pages/man2/fstatat.2.html
12
SYSTEM CALL:
stat(2) - Linux manual page
FUNCTIONALITY:

       stat, fstat, lstat, fstatat - get file status
SYNOPSIS:

       #include <sys/types.h>
       #include <sys/stat.h>
       #include <unistd.h>

       int stat(const char *pathname, struct stat *buf);
       int fstat(int fd, struct stat *buf);
       int lstat(const char *pathname, struct stat *buf);

       #include <fcntl.h>           /* Definition of AT_* constants */
       #include <sys/stat.h>

       int fstatat(int dirfd, const char *pathname, struct stat *buf,
                   int flags);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       lstat():
           /* glibc 2.19 and earlier */ _BSD_SOURCE
               || /* Since glibc 2.20 */ _DEFAULT_SOURCE
               || _XOPEN_SOURCE >= 500
               || /* Since glibc 2.10: */ _POSIX_C_SOURCE >= 200112L

       fstatat():
           Since glibc 2.10:
               _POSIX_C_SOURCE >= 200809L
           Before glibc 2.10:
               _ATFILE_SOURCE
DESCRIPTION

       These functions return information about a file, in the buffer
       pointed to by buf.  No permissions are required on the file itself,
       but—in the case of stat(), fstatat(), and lstat()—execute (search)
       permission is required on all of the directories in pathname that
       lead to the file.

       stat() and fstatat() retrieve information about the file pointed to
       by pathname; the differences for fstatat() are described below.

       lstat() is identical to stat(), except that if pathname is a symbolic
       link, then it returns information about the link itself, not the file
       that it refers to.

       fstat() is identical to stat(), except that the file about which
       information is to be retrieved is specified by the file descriptor
       fd.

       All of these system calls return a stat structure, which contains the
       following fields:

           struct stat {
               dev_t     st_dev;         /* ID of device containing file */
               ino_t     st_ino;         /* inode number */
               mode_t    st_mode;        /* file type and mode */
               nlink_t   st_nlink;       /* number of hard links */
               uid_t     st_uid;         /* user ID of owner */
               gid_t     st_gid;         /* group ID of owner */
               dev_t     st_rdev;        /* device ID (if special file) */
               off_t     st_size;        /* total size, in bytes */
               blksize_t st_blksize;     /* blocksize for filesystem I/O */
               blkcnt_t  st_blocks;      /* number of 512B blocks allocated */

               /* Since Linux 2.6, the kernel supports nanosecond
                  precision for the following timestamp fields.
                  For the details before Linux 2.6, see NOTES. */

               struct timespec st_atim;  /* time of last access */
               struct timespec st_mtim;  /* time of last modification */
               struct timespec st_ctim;  /* time of last status change */

           #define st_atime st_atim.tv_sec      /* Backward compatibility */
           #define st_mtime st_mtim.tv_sec
           #define st_ctime st_ctim.tv_sec
           };

       Note: the order of fields in the stat structure varies somewhat
       across architectures.  In addition, the definition above does not
       show the padding bytes that may be present between some fields on
       various architectures.  Consult the glibc and kernel source code if
       you need to know the details.

       Note: For performance and simplicity reasons, different fields in the
       stat structure may contain state information from different moments
       during the execution of the system call.  For example, if st_mode or
       st_uid is changed by another process by calling chmod(2) or chown(2),
       stat() might return the old st_mode together with the new st_uid, or
       the old st_uid together with the new st_mode.

       The st_dev field describes the device on which this file resides.
       (The major(3) and minor(3) macros may be useful to decompose the
       device ID in this field.)

       The st_rdev field describes the device that this file (inode)
       represents.

       The st_size field gives the size of the file (if it is a regular file
       or a symbolic link) in bytes.  The size of a symbolic link is the
       length of the pathname it contains, without a terminating null byte.

       The st_blocks field indicates the number of blocks allocated to the
       file, 512-byte units.  (This may be smaller than st_size/512 when the
       file has holes.)

       The st_blksize field gives the "preferred" blocksize for efficient
       filesystem I/O.  (Writing to a file in smaller chunks may cause an
       inefficient read-modify-rewrite.)

       Not all of the Linux filesystems implement all of the time fields.
       Some filesystem types allow mounting in such a way that file and/or
       directory accesses do not cause an update of the st_atime field.
       (See noatime, nodiratime, and relatime in mount(8), and related
       information in mount(2).)  In addition, st_atime is not updated if a
       file is opened with the O_NOATIME; see open(2).

       The field st_atime is changed by file accesses, for example, by
       execve(2), mknod(2), pipe(2), utime(2), and read(2) (of more than
       zero bytes).  Other routines, like mmap(2), may or may not update
       st_atime.

       The field st_mtime is changed by file modifications, for example, by
       mknod(2), truncate(2), utime(2), and write(2) (of more than zero
       bytes).  Moreover, st_mtime of a directory is changed by the creation
       or deletion of files in that directory.  The st_mtime field is not
       changed for changes in owner, group, hard link count, or mode.

       The field st_ctime is changed by writing or by setting inode
       information (i.e., owner, group, link count, mode, etc.).

       POSIX refers to the st_mode bits corresponding to the mask S_IFMT
       (see below) as the file type, the 12 bits corresponding to the mask
       07777 as the file mode bits and the least significant 9 bits (0777)
       as the file permission bits.

       The following mask values are defined for the file type of the
       st_mode field:

           S_IFMT     0170000   bit mask for the file type bit field

           S_IFSOCK   0140000   socket
           S_IFLNK    0120000   symbolic link
           S_IFREG    0100000   regular file
           S_IFBLK    0060000   block device
           S_IFDIR    0040000   directory
           S_IFCHR    0020000   character device
           S_IFIFO    0010000   FIFO

       Thus, to test for a regular file (for example), one could write:

           stat(pathname, &sb);
           if ((sb.st_mode & S_IFMT) == S_IFREG) {
               /* Handle regular file */
           }

       Because tests of the above form are common, additional macros are
       defined by POSIX to allow the test of the file type in st_mode to be
       written more concisely:

           S_ISREG(m)  is it a regular file?

           S_ISDIR(m)  directory?

           S_ISCHR(m)  character device?

           S_ISBLK(m)  block device?

           S_ISFIFO(m) FIFO (named pipe)?

           S_ISLNK(m)  symbolic link?  (Not in POSIX.1-1996.)

           S_ISSOCK(m) socket?  (Not in POSIX.1-1996.)

       The preceding code snippet could thus be rewritten as:

           stat(pathname, &sb);
           if (S_ISREG(sb.st_mode)) {
               /* Handle regular file */
           }

       The definitions of most of the above file type test macros are
       provided if any of the following feature test macros is defined:
       _BSD_SOURCE (in glibc 2.19 and earlier), _SVID_SOURCE (in glibc 2.19
       and earlier), or _DEFAULT_SOURCE (in glibc 2.20 and later).  In
       addition, definitions of all of the above macros except S_IFSOCK and
       S_ISSOCK() are provided if _XOPEN_SOURCE is defined.  The definition
       of S_IFSOCK can also be exposed by defining _XOPEN_SOURCE with a
       value of 500 or greater.

       The definition of S_ISSOCK() is exposed if any of the following
       feature test macros is defined: _BSD_SOURCE (in glibc 2.19 and
       earlier), _DEFAULT_SOURCE (in glibc 2.20 and later), _XOPEN_SOURCE
       with a value of 500 or greater, or _POSIX_C_SOURCE with a value of
       200112L or greater.

       The following mask values are defined for the file mode component of
       the st_mode field:

           S_ISUID     04000   set-user-ID bit
           S_ISGID     02000   set-group-ID bit (see below)
           S_ISVTX     01000   sticky bit (see below)

           S_IRWXU     00700   owner has read, write, and execute permission
           S_IRUSR     00400   owner has read permission
           S_IWUSR     00200   owner has write permission
           S_IXUSR     00100   owner has execute permission

           S_IRWXG     00070   group has read, write, and execute permission
           S_IRGRP     00040   group has read permission
           S_IWGRP     00020   group has write permission
           S_IXGRP     00010   group has execute permission

           S_IRWXO     00007   others (not in group) have read, write, and
                               execute permission
           S_IROTH     00004   others have read permission
           S_IWOTH     00002   others have write permission
           S_IXOTH     00001   others have execute permission

       The set-group-ID bit (S_ISGID) has several special uses.  For a
       directory, it indicates that BSD semantics is to be used for that
       directory: files created there inherit their group ID from the
       directory, not from the effective group ID of the creating process,
       and directories created there will also get the S_ISGID bit set.  For
       a file that does not have the group execution bit (S_IXGRP) set, the
       set-group-ID bit indicates mandatory file/record locking.

       The sticky bit (S_ISVTX) on a directory means that a file in that
       directory can be renamed or deleted only by the owner of the file, by
       the owner of the directory, and by a privileged process.

   fstatat()
       The fstatat() system call operates in exactly the same way as stat(),
       except for the differences described here.

       If the pathname given in pathname is relative, then it is interpreted
       relative to the directory referred to by the file descriptor dirfd
       (rather than relative to the current working directory of the calling
       process, as is done by stat() for a relative pathname).

       If pathname is relative and dirfd is the special value AT_FDCWD, then
       pathname is interpreted relative to the current working directory of
       the calling process (like stat()).

       If pathname is absolute, then dirfd is ignored.

       flags can either be 0, or include one or more of the following flags
       ORed:

       AT_EMPTY_PATH (since Linux 2.6.39)
              If pathname is an empty string, operate on the file referred
              to by dirfd (which may have been obtained using the open(2)
              O_PATH flag).  If dirfd is AT_FDCWD, the call operates on the
              current working directory.  In this case, dirfd can refer to
              any type of file, not just a directory.  This flag is Linux-
              specific; define _GNU_SOURCE to obtain its definition.

       AT_NO_AUTOMOUNT (since Linux 2.6.38)
              Don't automount the terminal ("basename") component of
              pathname if it is a directory that is an automount point.
              This allows the caller to gather attributes of an automount
              point (rather than the location it would mount).  This flag
              can be used in tools that scan directories to prevent mass-
              automounting of a directory of automount points.  The
              AT_NO_AUTOMOUNT flag has no effect if the mount point has
              already been mounted over.  This flag is Linux-specific;
              define _GNU_SOURCE to obtain its definition.

       AT_SYMLINK_NOFOLLOW
              If pathname is a symbolic link, do not dereference it: instead
              return information about the link itself, like lstat().  (By
              default, fstatat() dereferences symbolic links, like stat().)

       See openat(2) for an explanation of the need for fstatat().
http://man7.org/linux/man-pages/man2/chmod.2.html
11
SYSTEM CALL:
chmod(2) - Linux manual page
FUNCTIONALITY:

       chmod, fchmod, fchmodat - change permissions of a file
SYNOPSIS:

       #include <sys/stat.h>

       int chmod(const char *pathname, mode_t mode);
       int fchmod(int fd, mode_t mode);

       #include <fcntl.h>           /* Definition of AT_* constants */
       #include <sys/stat.h>

       int fchmodat(int dirfd, const char *pathname, mode_t mode, int flags);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       fchmod():
           /* Since glibc 2.16: */ _POSIX_C_SOURCE
               || /* Glibc versions <= 2.19: */ _BSD_SOURCE
               || /* Glibc versions <= 2.15: */ _XOPEN_SOURCE >= 500
               || /* Glibc 2.12 to 2.15: */ _POSIX_C_SOURCE >= 200809L

       fchmodat():
           Since glibc 2.10:
               _POSIX_C_SOURCE >= 200809L
           Before glibc 2.10:
               _ATFILE_SOURCE
DESCRIPTION

       The chmod() and fchmod() system calls change a files mode bits.  (The
       file mode consists of the file permission bits plus the set-user-ID,
       set-group-ID, and sticky bits.)  These system calls differ only in
       how the file is specified:

       * chmod() changes the mode of the file specified whose pathname is
         given in pathname, which is dereferenced if it is a symbolic link.

       * fchmod() changes the mode of the file referred to by the open file
         descriptor fd.

       The new file mode is specified in mode, which is a bit mask created
       by ORing together zero or more of the following:

       S_ISUID  (04000)  set-user-ID (set process effective user ID on
                         execve(2))

       S_ISGID  (02000)  set-group-ID (set process effective group ID on
                         execve(2); mandatory locking, as described in
                         fcntl(2); take a new file's group from parent
                         directory, as described in chown(2) and mkdir(2))

       S_ISVTX  (01000)  sticky bit (restricted deletion flag, as described
                         in unlink(2))

       S_IRUSR  (00400)  read by owner

       S_IWUSR  (00200)  write by owner

       S_IXUSR  (00100)  execute/search by owner ("search" applies for
                         directories, and means that entries within the
                         directory can be accessed)

       S_IRGRP  (00040)  read by group

       S_IWGRP  (00020)  write by group

       S_IXGRP  (00010)  execute/search by group

       S_IROTH  (00004)  read by others

       S_IWOTH  (00002)  write by others

       S_IXOTH  (00001)  execute/search by others

       The effective UID of the calling process must match the owner of the
       file, or the process must be privileged (Linux: it must have the
       CAP_FOWNER capability).

       If the calling process is not privileged (Linux: does not have the
       CAP_FSETID capability), and the group of the file does not match the
       effective group ID of the process or one of its supplementary group
       IDs, the S_ISGID bit will be turned off, but this will not cause an
       error to be returned.

       As a security measure, depending on the filesystem, the set-user-ID
       and set-group-ID execution bits may be turned off if a file is
       written.  (On Linux, this occurs if the writing process does not have
       the CAP_FSETID capability.)  On some filesystems, only the superuser
       can set the sticky bit, which may have a special meaning.  For the
       sticky bit, and for set-user-ID and set-group-ID bits on directories,
       see stat(2).

       On NFS filesystems, restricting the permissions will immediately
       influence already open files, because the access control is done on
       the server, but open files are maintained by the client.  Widening
       the permissions may be delayed for other clients if attribute caching
       is enabled on them.

   fchmodat()
       The fchmodat() system call operates in exactly the same way as
       chmod(), except for the differences described here.

       If the pathname given in pathname is relative, then it is interpreted
       relative to the directory referred to by the file descriptor dirfd
       (rather than relative to the current working directory of the calling
       process, as is done by chmod() for a relative pathname).

       If pathname is relative and dirfd is the special value AT_FDCWD, then
       pathname is interpreted relative to the current working directory of
       the calling process (like chmod()).

       If pathname is absolute, then dirfd is ignored.

       flags can either be 0, or include the following flag:

       AT_SYMLINK_NOFOLLOW
              If pathname is a symbolic link, do not dereference it: instead
              operate on the link itself.  This flag is not currently
              implemented.

       See openat(2) for an explanation of the need for fchmodat().
http://man7.org/linux/man-pages/man2/fchmod.2.html
11
SYSTEM CALL:
chmod(2) - Linux manual page
FUNCTIONALITY:

       chmod, fchmod, fchmodat - change permissions of a file
SYNOPSIS:

       #include <sys/stat.h>

       int chmod(const char *pathname, mode_t mode);
       int fchmod(int fd, mode_t mode);

       #include <fcntl.h>           /* Definition of AT_* constants */
       #include <sys/stat.h>

       int fchmodat(int dirfd, const char *pathname, mode_t mode, int flags);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       fchmod():
           /* Since glibc 2.16: */ _POSIX_C_SOURCE
               || /* Glibc versions <= 2.19: */ _BSD_SOURCE
               || /* Glibc versions <= 2.15: */ _XOPEN_SOURCE >= 500
               || /* Glibc 2.12 to 2.15: */ _POSIX_C_SOURCE >= 200809L

       fchmodat():
           Since glibc 2.10:
               _POSIX_C_SOURCE >= 200809L
           Before glibc 2.10:
               _ATFILE_SOURCE
DESCRIPTION

       The chmod() and fchmod() system calls change a files mode bits.  (The
       file mode consists of the file permission bits plus the set-user-ID,
       set-group-ID, and sticky bits.)  These system calls differ only in
       how the file is specified:

       * chmod() changes the mode of the file specified whose pathname is
         given in pathname, which is dereferenced if it is a symbolic link.

       * fchmod() changes the mode of the file referred to by the open file
         descriptor fd.

       The new file mode is specified in mode, which is a bit mask created
       by ORing together zero or more of the following:

       S_ISUID  (04000)  set-user-ID (set process effective user ID on
                         execve(2))

       S_ISGID  (02000)  set-group-ID (set process effective group ID on
                         execve(2); mandatory locking, as described in
                         fcntl(2); take a new file's group from parent
                         directory, as described in chown(2) and mkdir(2))

       S_ISVTX  (01000)  sticky bit (restricted deletion flag, as described
                         in unlink(2))

       S_IRUSR  (00400)  read by owner

       S_IWUSR  (00200)  write by owner

       S_IXUSR  (00100)  execute/search by owner ("search" applies for
                         directories, and means that entries within the
                         directory can be accessed)

       S_IRGRP  (00040)  read by group

       S_IWGRP  (00020)  write by group

       S_IXGRP  (00010)  execute/search by group

       S_IROTH  (00004)  read by others

       S_IWOTH  (00002)  write by others

       S_IXOTH  (00001)  execute/search by others

       The effective UID of the calling process must match the owner of the
       file, or the process must be privileged (Linux: it must have the
       CAP_FOWNER capability).

       If the calling process is not privileged (Linux: does not have the
       CAP_FSETID capability), and the group of the file does not match the
       effective group ID of the process or one of its supplementary group
       IDs, the S_ISGID bit will be turned off, but this will not cause an
       error to be returned.

       As a security measure, depending on the filesystem, the set-user-ID
       and set-group-ID execution bits may be turned off if a file is
       written.  (On Linux, this occurs if the writing process does not have
       the CAP_FSETID capability.)  On some filesystems, only the superuser
       can set the sticky bit, which may have a special meaning.  For the
       sticky bit, and for set-user-ID and set-group-ID bits on directories,
       see stat(2).

       On NFS filesystems, restricting the permissions will immediately
       influence already open files, because the access control is done on
       the server, but open files are maintained by the client.  Widening
       the permissions may be delayed for other clients if attribute caching
       is enabled on them.

   fchmodat()
       The fchmodat() system call operates in exactly the same way as
       chmod(), except for the differences described here.

       If the pathname given in pathname is relative, then it is interpreted
       relative to the directory referred to by the file descriptor dirfd
       (rather than relative to the current working directory of the calling
       process, as is done by chmod() for a relative pathname).

       If pathname is relative and dirfd is the special value AT_FDCWD, then
       pathname is interpreted relative to the current working directory of
       the calling process (like chmod()).

       If pathname is absolute, then dirfd is ignored.

       flags can either be 0, or include the following flag:

       AT_SYMLINK_NOFOLLOW
              If pathname is a symbolic link, do not dereference it: instead
              operate on the link itself.  This flag is not currently
              implemented.

       See openat(2) for an explanation of the need for fchmodat().
http://man7.org/linux/man-pages/man2/fchmodat.2.html
11
SYSTEM CALL:
chmod(2) - Linux manual page
FUNCTIONALITY:

       chmod, fchmod, fchmodat - change permissions of a file
SYNOPSIS:

       #include <sys/stat.h>

       int chmod(const char *pathname, mode_t mode);
       int fchmod(int fd, mode_t mode);

       #include <fcntl.h>           /* Definition of AT_* constants */
       #include <sys/stat.h>

       int fchmodat(int dirfd, const char *pathname, mode_t mode, int flags);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       fchmod():
           /* Since glibc 2.16: */ _POSIX_C_SOURCE
               || /* Glibc versions <= 2.19: */ _BSD_SOURCE
               || /* Glibc versions <= 2.15: */ _XOPEN_SOURCE >= 500
               || /* Glibc 2.12 to 2.15: */ _POSIX_C_SOURCE >= 200809L

       fchmodat():
           Since glibc 2.10:
               _POSIX_C_SOURCE >= 200809L
           Before glibc 2.10:
               _ATFILE_SOURCE
DESCRIPTION

       The chmod() and fchmod() system calls change a files mode bits.  (The
       file mode consists of the file permission bits plus the set-user-ID,
       set-group-ID, and sticky bits.)  These system calls differ only in
       how the file is specified:

       * chmod() changes the mode of the file specified whose pathname is
         given in pathname, which is dereferenced if it is a symbolic link.

       * fchmod() changes the mode of the file referred to by the open file
         descriptor fd.

       The new file mode is specified in mode, which is a bit mask created
       by ORing together zero or more of the following:

       S_ISUID  (04000)  set-user-ID (set process effective user ID on
                         execve(2))

       S_ISGID  (02000)  set-group-ID (set process effective group ID on
                         execve(2); mandatory locking, as described in
                         fcntl(2); take a new file's group from parent
                         directory, as described in chown(2) and mkdir(2))

       S_ISVTX  (01000)  sticky bit (restricted deletion flag, as described
                         in unlink(2))

       S_IRUSR  (00400)  read by owner

       S_IWUSR  (00200)  write by owner

       S_IXUSR  (00100)  execute/search by owner ("search" applies for
                         directories, and means that entries within the
                         directory can be accessed)

       S_IRGRP  (00040)  read by group

       S_IWGRP  (00020)  write by group

       S_IXGRP  (00010)  execute/search by group

       S_IROTH  (00004)  read by others

       S_IWOTH  (00002)  write by others

       S_IXOTH  (00001)  execute/search by others

       The effective UID of the calling process must match the owner of the
       file, or the process must be privileged (Linux: it must have the
       CAP_FOWNER capability).

       If the calling process is not privileged (Linux: does not have the
       CAP_FSETID capability), and the group of the file does not match the
       effective group ID of the process or one of its supplementary group
       IDs, the S_ISGID bit will be turned off, but this will not cause an
       error to be returned.

       As a security measure, depending on the filesystem, the set-user-ID
       and set-group-ID execution bits may be turned off if a file is
       written.  (On Linux, this occurs if the writing process does not have
       the CAP_FSETID capability.)  On some filesystems, only the superuser
       can set the sticky bit, which may have a special meaning.  For the
       sticky bit, and for set-user-ID and set-group-ID bits on directories,
       see stat(2).

       On NFS filesystems, restricting the permissions will immediately
       influence already open files, because the access control is done on
       the server, but open files are maintained by the client.  Widening
       the permissions may be delayed for other clients if attribute caching
       is enabled on them.

   fchmodat()
       The fchmodat() system call operates in exactly the same way as
       chmod(), except for the differences described here.

       If the pathname given in pathname is relative, then it is interpreted
       relative to the directory referred to by the file descriptor dirfd
       (rather than relative to the current working directory of the calling
       process, as is done by chmod() for a relative pathname).

       If pathname is relative and dirfd is the special value AT_FDCWD, then
       pathname is interpreted relative to the current working directory of
       the calling process (like chmod()).

       If pathname is absolute, then dirfd is ignored.

       flags can either be 0, or include the following flag:

       AT_SYMLINK_NOFOLLOW
              If pathname is a symbolic link, do not dereference it: instead
              operate on the link itself.  This flag is not currently
              implemented.

       See openat(2) for an explanation of the need for fchmodat().
http://man7.org/linux/man-pages/man2/chown.2.html
12
SYSTEM CALL:
chown(2) - Linux manual page
FUNCTIONALITY:

       chown, fchown, lchown, fchownat - change ownership of a file
SYNOPSIS:

       #include <unistd.h>

       int chown(const char *pathname, uid_t owner, gid_t group);
       int fchown(int fd, uid_t owner, gid_t group);
       int lchown(const char *pathname, uid_t owner, gid_t group);

       #include <fcntl.h>           /* Definition of AT_* constants */
       #include <unistd.h>

       int fchownat(int dirfd, const char *pathname,
                    uid_t owner, gid_t group, int flags);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       fchown(), lchown():
           /* Since glibc 2.12: */ _POSIX_C_SOURCE >= 200809L
               || _XOPEN_SOURCE >= 500
               || /* Glibc versions <= 2.19: */ _BSD_SOURCE

       fchownat():
           Since glibc 2.10:
               _POSIX_C_SOURCE >= 200809L
           Before glibc 2.10:
               _ATFILE_SOURCE
DESCRIPTION

       These system calls change the owner and group of a file.  The
       chown(), fchown(), and lchown() system calls differ only in how the
       file is specified:

       * chown() changes the ownership of the file specified by pathname,
         which is dereferenced if it is a symbolic link.

       * fchown() changes the ownership of the file referred to by the open
         file descriptor fd.

       * lchown() is like chown(), but does not dereference symbolic links.

       Only a privileged process (Linux: one with the CAP_CHOWN capability)
       may change the owner of a file.  The owner of a file may change the
       group of the file to any group of which that owner is a member.  A
       privileged process (Linux: with CAP_CHOWN) may change the group
       arbitrarily.

       If the owner or group is specified as -1, then that ID is not
       changed.

       When the owner or group of an executable file is changed by an
       unprivileged user, the S_ISUID and S_ISGID mode bits are cleared.
       POSIX does not specify whether this also should happen when root does
       the chown(); the Linux behavior depends on the kernel version.  In
       case of a non-group-executable file (i.e., one for which the S_IXGRP
       bit is not set) the S_ISGID bit indicates mandatory locking, and is
       not cleared by a chown().

   fchownat()
       The fchownat() system call operates in exactly the same way as
       chown(), except for the differences described here.

       If the pathname given in pathname is relative, then it is interpreted
       relative to the directory referred to by the file descriptor dirfd
       (rather than relative to the current working directory of the calling
       process, as is done by chown() for a relative pathname).

       If pathname is relative and dirfd is the special value AT_FDCWD, then
       pathname is interpreted relative to the current working directory of
       the calling process (like chown()).

       If pathname is absolute, then dirfd is ignored.

       The flags argument is a bit mask created by ORing together 0 or more
       of the following values;

       AT_EMPTY_PATH (since Linux 2.6.39)
              If pathname is an empty string, operate on the file referred
              to by dirfd (which may have been obtained using the open(2)
              O_PATH flag).  In this case, dirfd can refer to any type of
              file, not just a directory.  If dirfd is AT_FDCWD, the call
              operates on the current working directory.  This flag is
              Linux-specific; define _GNU_SOURCE to obtain its definition.

       AT_SYMLINK_NOFOLLOW
              If pathname is a symbolic link, do not dereference it: instead
              operate on the link itself, like lchown().  (By default,
              fchownat() dereferences symbolic links, like chown().)

       See openat(2) for an explanation of the need for fchownat().
http://man7.org/linux/man-pages/man2/lchown.2.html
12
SYSTEM CALL:
chown(2) - Linux manual page
FUNCTIONALITY:

       chown, fchown, lchown, fchownat - change ownership of a file
SYNOPSIS:

       #include <unistd.h>

       int chown(const char *pathname, uid_t owner, gid_t group);
       int fchown(int fd, uid_t owner, gid_t group);
       int lchown(const char *pathname, uid_t owner, gid_t group);

       #include <fcntl.h>           /* Definition of AT_* constants */
       #include <unistd.h>

       int fchownat(int dirfd, const char *pathname,
                    uid_t owner, gid_t group, int flags);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       fchown(), lchown():
           /* Since glibc 2.12: */ _POSIX_C_SOURCE >= 200809L
               || _XOPEN_SOURCE >= 500
               || /* Glibc versions <= 2.19: */ _BSD_SOURCE

       fchownat():
           Since glibc 2.10:
               _POSIX_C_SOURCE >= 200809L
           Before glibc 2.10:
               _ATFILE_SOURCE
DESCRIPTION

       These system calls change the owner and group of a file.  The
       chown(), fchown(), and lchown() system calls differ only in how the
       file is specified:

       * chown() changes the ownership of the file specified by pathname,
         which is dereferenced if it is a symbolic link.

       * fchown() changes the ownership of the file referred to by the open
         file descriptor fd.

       * lchown() is like chown(), but does not dereference symbolic links.

       Only a privileged process (Linux: one with the CAP_CHOWN capability)
       may change the owner of a file.  The owner of a file may change the
       group of the file to any group of which that owner is a member.  A
       privileged process (Linux: with CAP_CHOWN) may change the group
       arbitrarily.

       If the owner or group is specified as -1, then that ID is not
       changed.

       When the owner or group of an executable file is changed by an
       unprivileged user, the S_ISUID and S_ISGID mode bits are cleared.
       POSIX does not specify whether this also should happen when root does
       the chown(); the Linux behavior depends on the kernel version.  In
       case of a non-group-executable file (i.e., one for which the S_IXGRP
       bit is not set) the S_ISGID bit indicates mandatory locking, and is
       not cleared by a chown().

   fchownat()
       The fchownat() system call operates in exactly the same way as
       chown(), except for the differences described here.

       If the pathname given in pathname is relative, then it is interpreted
       relative to the directory referred to by the file descriptor dirfd
       (rather than relative to the current working directory of the calling
       process, as is done by chown() for a relative pathname).

       If pathname is relative and dirfd is the special value AT_FDCWD, then
       pathname is interpreted relative to the current working directory of
       the calling process (like chown()).

       If pathname is absolute, then dirfd is ignored.

       The flags argument is a bit mask created by ORing together 0 or more
       of the following values;

       AT_EMPTY_PATH (since Linux 2.6.39)
              If pathname is an empty string, operate on the file referred
              to by dirfd (which may have been obtained using the open(2)
              O_PATH flag).  In this case, dirfd can refer to any type of
              file, not just a directory.  If dirfd is AT_FDCWD, the call
              operates on the current working directory.  This flag is
              Linux-specific; define _GNU_SOURCE to obtain its definition.

       AT_SYMLINK_NOFOLLOW
              If pathname is a symbolic link, do not dereference it: instead
              operate on the link itself, like lchown().  (By default,
              fchownat() dereferences symbolic links, like chown().)

       See openat(2) for an explanation of the need for fchownat().
http://man7.org/linux/man-pages/man2/fchown.2.html
12
SYSTEM CALL:
chown(2) - Linux manual page
FUNCTIONALITY:

       chown, fchown, lchown, fchownat - change ownership of a file
SYNOPSIS:

       #include <unistd.h>

       int chown(const char *pathname, uid_t owner, gid_t group);
       int fchown(int fd, uid_t owner, gid_t group);
       int lchown(const char *pathname, uid_t owner, gid_t group);

       #include <fcntl.h>           /* Definition of AT_* constants */
       #include <unistd.h>

       int fchownat(int dirfd, const char *pathname,
                    uid_t owner, gid_t group, int flags);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       fchown(), lchown():
           /* Since glibc 2.12: */ _POSIX_C_SOURCE >= 200809L
               || _XOPEN_SOURCE >= 500
               || /* Glibc versions <= 2.19: */ _BSD_SOURCE

       fchownat():
           Since glibc 2.10:
               _POSIX_C_SOURCE >= 200809L
           Before glibc 2.10:
               _ATFILE_SOURCE
DESCRIPTION

       These system calls change the owner and group of a file.  The
       chown(), fchown(), and lchown() system calls differ only in how the
       file is specified:

       * chown() changes the ownership of the file specified by pathname,
         which is dereferenced if it is a symbolic link.

       * fchown() changes the ownership of the file referred to by the open
         file descriptor fd.

       * lchown() is like chown(), but does not dereference symbolic links.

       Only a privileged process (Linux: one with the CAP_CHOWN capability)
       may change the owner of a file.  The owner of a file may change the
       group of the file to any group of which that owner is a member.  A
       privileged process (Linux: with CAP_CHOWN) may change the group
       arbitrarily.

       If the owner or group is specified as -1, then that ID is not
       changed.

       When the owner or group of an executable file is changed by an
       unprivileged user, the S_ISUID and S_ISGID mode bits are cleared.
       POSIX does not specify whether this also should happen when root does
       the chown(); the Linux behavior depends on the kernel version.  In
       case of a non-group-executable file (i.e., one for which the S_IXGRP
       bit is not set) the S_ISGID bit indicates mandatory locking, and is
       not cleared by a chown().

   fchownat()
       The fchownat() system call operates in exactly the same way as
       chown(), except for the differences described here.

       If the pathname given in pathname is relative, then it is interpreted
       relative to the directory referred to by the file descriptor dirfd
       (rather than relative to the current working directory of the calling
       process, as is done by chown() for a relative pathname).

       If pathname is relative and dirfd is the special value AT_FDCWD, then
       pathname is interpreted relative to the current working directory of
       the calling process (like chown()).

       If pathname is absolute, then dirfd is ignored.

       The flags argument is a bit mask created by ORing together 0 or more
       of the following values;

       AT_EMPTY_PATH (since Linux 2.6.39)
              If pathname is an empty string, operate on the file referred
              to by dirfd (which may have been obtained using the open(2)
              O_PATH flag).  In this case, dirfd can refer to any type of
              file, not just a directory.  If dirfd is AT_FDCWD, the call
              operates on the current working directory.  This flag is
              Linux-specific; define _GNU_SOURCE to obtain its definition.

       AT_SYMLINK_NOFOLLOW
              If pathname is a symbolic link, do not dereference it: instead
              operate on the link itself, like lchown().  (By default,
              fchownat() dereferences symbolic links, like chown().)

       See openat(2) for an explanation of the need for fchownat().
http://man7.org/linux/man-pages/man2/fchownat.2.html
12
SYSTEM CALL:
chown(2) - Linux manual page
FUNCTIONALITY:

       chown, fchown, lchown, fchownat - change ownership of a file
SYNOPSIS:

       #include <unistd.h>

       int chown(const char *pathname, uid_t owner, gid_t group);
       int fchown(int fd, uid_t owner, gid_t group);
       int lchown(const char *pathname, uid_t owner, gid_t group);

       #include <fcntl.h>           /* Definition of AT_* constants */
       #include <unistd.h>

       int fchownat(int dirfd, const char *pathname,
                    uid_t owner, gid_t group, int flags);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       fchown(), lchown():
           /* Since glibc 2.12: */ _POSIX_C_SOURCE >= 200809L
               || _XOPEN_SOURCE >= 500
               || /* Glibc versions <= 2.19: */ _BSD_SOURCE

       fchownat():
           Since glibc 2.10:
               _POSIX_C_SOURCE >= 200809L
           Before glibc 2.10:
               _ATFILE_SOURCE
DESCRIPTION

       These system calls change the owner and group of a file.  The
       chown(), fchown(), and lchown() system calls differ only in how the
       file is specified:

       * chown() changes the ownership of the file specified by pathname,
         which is dereferenced if it is a symbolic link.

       * fchown() changes the ownership of the file referred to by the open
         file descriptor fd.

       * lchown() is like chown(), but does not dereference symbolic links.

       Only a privileged process (Linux: one with the CAP_CHOWN capability)
       may change the owner of a file.  The owner of a file may change the
       group of the file to any group of which that owner is a member.  A
       privileged process (Linux: with CAP_CHOWN) may change the group
       arbitrarily.

       If the owner or group is specified as -1, then that ID is not
       changed.

       When the owner or group of an executable file is changed by an
       unprivileged user, the S_ISUID and S_ISGID mode bits are cleared.
       POSIX does not specify whether this also should happen when root does
       the chown(); the Linux behavior depends on the kernel version.  In
       case of a non-group-executable file (i.e., one for which the S_IXGRP
       bit is not set) the S_ISGID bit indicates mandatory locking, and is
       not cleared by a chown().

   fchownat()
       The fchownat() system call operates in exactly the same way as
       chown(), except for the differences described here.

       If the pathname given in pathname is relative, then it is interpreted
       relative to the directory referred to by the file descriptor dirfd
       (rather than relative to the current working directory of the calling
       process, as is done by chown() for a relative pathname).

       If pathname is relative and dirfd is the special value AT_FDCWD, then
       pathname is interpreted relative to the current working directory of
       the calling process (like chown()).

       If pathname is absolute, then dirfd is ignored.

       The flags argument is a bit mask created by ORing together 0 or more
       of the following values;

       AT_EMPTY_PATH (since Linux 2.6.39)
              If pathname is an empty string, operate on the file referred
              to by dirfd (which may have been obtained using the open(2)
              O_PATH flag).  In this case, dirfd can refer to any type of
              file, not just a directory.  If dirfd is AT_FDCWD, the call
              operates on the current working directory.  This flag is
              Linux-specific; define _GNU_SOURCE to obtain its definition.

       AT_SYMLINK_NOFOLLOW
              If pathname is a symbolic link, do not dereference it: instead
              operate on the link itself, like lchown().  (By default,
              fchownat() dereferences symbolic links, like chown().)

       See openat(2) for an explanation of the need for fchownat().
http://man7.org/linux/man-pages/man2/utime.2.html
10
SYSTEM CALL:
utime(2) - Linux manual page
FUNCTIONALITY:

       utime, utimes - change file last access and modification times
SYNOPSIS:

       #include <sys/types.h>
       #include <utime.h>

       int utime(const char *filename, const struct utimbuf *times);

       #include <sys/time.h>

       int utimes(const char *filename, const struct timeval times[2]);
DESCRIPTION

       Note: modern applications may prefer to use the interfaces described
       in utimensat(2).

       The utime() system call changes the access and modification times of
       the inode specified by filename to the actime and modtime fields of
       times respectively.

       If times is NULL, then the access and modification times of the file
       are set to the current time.

       Changing timestamps is permitted when: either the process has
       appropriate privileges, or the effective user ID equals the user ID
       of the file, or times is NULL and the process has write permission
       for the file.

       The utimbuf structure is:

           struct utimbuf {
               time_t actime;       /* access time */
               time_t modtime;      /* modification time */
           };

       The utime() system call allows specification of timestamps with a
       resolution of 1 second.

       The utimes() system call is similar, but the times argument refers to
       an array rather than a structure.  The elements of this array are
       timeval structures, which allow a precision of 1 microsecond for
       specifying timestamps.  The timeval structure is:

           struct timeval {
               long tv_sec;        /* seconds */
               long tv_usec;       /* microseconds */
           };

       times[0] specifies the new access time, and times[1] specifies the
       new modification time.  If times is NULL, then analogously to
       utime(), the access and modification times of the file are set to the
       current time.
http://man7.org/linux/man-pages/man2/utimes.2.html
10
SYSTEM CALL:
utime(2) - Linux manual page
FUNCTIONALITY:

       utime, utimes - change file last access and modification times
SYNOPSIS:

       #include <sys/types.h>
       #include <utime.h>

       int utime(const char *filename, const struct utimbuf *times);

       #include <sys/time.h>

       int utimes(const char *filename, const struct timeval times[2]);
DESCRIPTION

       Note: modern applications may prefer to use the interfaces described
       in utimensat(2).

       The utime() system call changes the access and modification times of
       the inode specified by filename to the actime and modtime fields of
       times respectively.

       If times is NULL, then the access and modification times of the file
       are set to the current time.

       Changing timestamps is permitted when: either the process has
       appropriate privileges, or the effective user ID equals the user ID
       of the file, or times is NULL and the process has write permission
       for the file.

       The utimbuf structure is:

           struct utimbuf {
               time_t actime;       /* access time */
               time_t modtime;      /* modification time */
           };

       The utime() system call allows specification of timestamps with a
       resolution of 1 second.

       The utimes() system call is similar, but the times argument refers to
       an array rather than a structure.  The elements of this array are
       timeval structures, which allow a precision of 1 microsecond for
       specifying timestamps.  The timeval structure is:

           struct timeval {
               long tv_sec;        /* seconds */
               long tv_usec;       /* microseconds */
           };

       times[0] specifies the new access time, and times[1] specifies the
       new modification time.  If times is NULL, then analogously to
       utime(), the access and modification times of the file are set to the
       current time.
http://man7.org/linux/man-pages/man2/futimesat.2.html
11
SYSTEM CALL:
futimesat(2) - Linux manual page
FUNCTIONALITY:

       futimesat  - change timestamps of a file relative to a directory file
       descriptor
SYNOPSIS:

       #include <fcntl.h> /* Definition of AT_* constants */
       #include <sys/time.h>

       int futimesat(int dirfd, const char *pathname,
                     const struct timeval times[2]);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       futimesat(): _GNU_SOURCE
DESCRIPTION

       This system call is obsolete.  Use utimensat(2) instead.

       The futimesat() system call operates in exactly the same way as
       utimes(2), except for the differences described in this manual page.

       If the pathname given in pathname is relative, then it is interpreted
       relative to the directory referred to by the file descriptor dirfd
       (rather than relative to the current working directory of the calling
       process, as is done by utimes(2) for a relative pathname).

       If pathname is relative and dirfd is the special value AT_FDCWD, then
       pathname is interpreted relative to the current working directory of
       the calling process (like utimes(2)).

       If pathname is absolute, then dirfd is ignored.
http://man7.org/linux/man-pages/man2/utimensat.2.html
13
SYSTEM CALL:
utimensat(2) - Linux manual page
FUNCTIONALITY:

       utimensat,  futimens  - change file timestamps with nanosecond preci‐
       sion
SYNOPSIS:

       #include <fcntl.h> /* Definition of AT_* constants */
       #include <sys/stat.h>

       int utimensat(int dirfd, const char *pathname,
                     const struct timespec times[2], int flags);

       int futimens(int fd, const struct timespec times[2]);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       utimensat():
           Since glibc 2.10:
               _POSIX_C_SOURCE >= 200809L
           Before glibc 2.10:
               _ATFILE_SOURCE
       futimens():
           Since glibc 2.10:
                  _POSIX_C_SOURCE >= 200809L
           Before glibc 2.10:
                  _GNU_SOURCE
DESCRIPTION

       utimensat() and futimens() update the timestamps of a file with
       nanosecond precision.  This contrasts with the historical utime(2)
       and utimes(2), which permit only second and microsecond precision,
       respectively, when setting file timestamps.

       With utimensat() the file is specified via the pathname given in
       pathname.  With futimens() the file whose timestamps are to be
       updated is specified via an open file descriptor, fd.

       For both calls, the new file timestamps are specified in the array
       times: times[0] specifies the new "last access time" (atime);
       times[1] specifies the new "last modification time" (mtime).  Each of
       the elements of times specifies a time as the number of seconds and
       nanoseconds since the Epoch, 1970-01-01 00:00:00 +0000 (UTC).  This
       information is conveyed in a structure of the following form:

           struct timespec {
               time_t tv_sec;        /* seconds */
               long   tv_nsec;       /* nanoseconds */
           };

       Updated file timestamps are set to the greatest value supported by
       the filesystem that is not greater than the specified time.

       If the tv_nsec field of one of the timespec structures has the
       special value UTIME_NOW, then the corresponding file timestamp is set
       to the current time.  If the tv_nsec field of one of the timespec
       structures has the special value UTIME_OMIT, then the corresponding
       file timestamp is left unchanged.  In both of these cases, the value
       of the corresponding tv_sec field is ignored.

       If times is NULL, then both timestamps are set to the current time.

   Permissions requirements
       To set both file timestamps to the current time (i.e., times is NULL,
       or both tv_nsec fields specify UTIME_NOW), either:

       1. the caller must have write access to the file;

       2. the caller's effective user ID must match the owner of the file;
          or

       3. the caller must have appropriate privileges.

       To make any change other than setting both timestamps to the current
       time (i.e., times is not NULL, and neither tv_nsec field is UTIME_NOW
       and neither tv_nsec field is UTIME_OMIT), either condition 2 or 3
       above must apply.

       If both tv_nsec fields are specified as UTIME_OMIT, then no file
       ownership or permission checks are performed, and the file timestamps
       are not modified, but other error conditions may still be detected.

   utimensat() specifics
       If pathname is relative, then by default it is interpreted relative
       to the directory referred to by the open file descriptor, dirfd
       (rather than relative to the current working directory of the calling
       process, as is done by utimes(2) for a relative pathname).  See
       openat(2) for an explanation of why this can be useful.

       If pathname is relative and dirfd is the special value AT_FDCWD, then
       pathname is interpreted relative to the current working directory of
       the calling process (like utimes(2)).

       If pathname is absolute, then dirfd is ignored.

       The flags field is a bit mask that may be 0, or include the following
       constant, defined in <fcntl.h>:

       AT_SYMLINK_NOFOLLOW
              If pathname specifies a symbolic link, then update the
              timestamps of the link, rather than the file to which it
              refers.
http://man7.org/linux/man-pages/man2/access.2.html
12
SYSTEM CALL:
access(2) - Linux manual page
FUNCTIONALITY:

       access, faccessat - check user's permissions for a file
SYNOPSIS:

       #include <unistd.h>

       int access(const char *pathname, int mode);

       #include <fcntl.h>           /* Definition of AT_* constants */
       #include <unistd.h>

       int faccessat(int dirfd, const char *pathname, int mode, int flags);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       faccessat():
           Since glibc 2.10:
               _POSIX_C_SOURCE >= 200809L
           Before glibc 2.10:
               _ATFILE_SOURCE
DESCRIPTION

       access() checks whether the calling process can access the file
       pathname.  If pathname is a symbolic link, it is dereferenced.

       The mode specifies the accessibility check(s) to be performed, and is
       either the value F_OK, or a mask consisting of the bitwise OR of one
       or more of R_OK, W_OK, and X_OK.  F_OK tests for the existence of the
       file.  R_OK, W_OK, and X_OK test whether the file exists and grants
       read, write, and execute permissions, respectively.

       The check is done using the calling process's real UID and GID,
       rather than the effective IDs as is done when actually attempting an
       operation (e.g., open(2)) on the file.  Similarly, for the root user,
       the check uses the set of permitted capabilities rather than the set
       of effective capabilities; and for non-root users, the check uses an
       empty set of capabilities.

       This allows set-user-ID programs and capability-endowed programs to
       easily determine the invoking user's authority.  In other words,
       access() does not answer the "can I read/write/execute this file?"
       question.  It answers a slightly different question: "(assuming I'm a
       setuid binary) can the user who invoked me read/write/execute this
       file?", which gives set-user-ID programs the possibility to prevent
       malicious users from causing them to read files which users shouldn't
       be able to read.

       If the calling process is privileged (i.e., its real UID is zero),
       then an X_OK check is successful for a regular file if execute
       permission is enabled for any of the file owner, group, or other.

   faccessat()
       The faccessat() system call operates in exactly the same way as
       access(), except for the differences described here.

       If the pathname given in pathname is relative, then it is interpreted
       relative to the directory referred to by the file descriptor dirfd
       (rather than relative to the current working directory of the calling
       process, as is done by access() for a relative pathname).

       If pathname is relative and dirfd is the special value AT_FDCWD, then
       pathname is interpreted relative to the current working directory of
       the calling process (like access()).

       If pathname is absolute, then dirfd is ignored.

       flags is constructed by ORing together zero or more of the following
       values:

       AT_EACCESS
              Perform access checks using the effective user and group IDs.
              By default, faccessat() uses the real IDs (like access()).

       AT_SYMLINK_NOFOLLOW
              If pathname is a symbolic link, do not dereference it: instead
              return information about the link itself.

       See openat(2) for an explanation of the need for faccessat().
http://man7.org/linux/man-pages/man2/faccessat.2.html
12
SYSTEM CALL:
access(2) - Linux manual page
FUNCTIONALITY:

       access, faccessat - check user's permissions for a file
SYNOPSIS:

       #include <unistd.h>

       int access(const char *pathname, int mode);

       #include <fcntl.h>           /* Definition of AT_* constants */
       #include <unistd.h>

       int faccessat(int dirfd, const char *pathname, int mode, int flags);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       faccessat():
           Since glibc 2.10:
               _POSIX_C_SOURCE >= 200809L
           Before glibc 2.10:
               _ATFILE_SOURCE
DESCRIPTION

       access() checks whether the calling process can access the file
       pathname.  If pathname is a symbolic link, it is dereferenced.

       The mode specifies the accessibility check(s) to be performed, and is
       either the value F_OK, or a mask consisting of the bitwise OR of one
       or more of R_OK, W_OK, and X_OK.  F_OK tests for the existence of the
       file.  R_OK, W_OK, and X_OK test whether the file exists and grants
       read, write, and execute permissions, respectively.

       The check is done using the calling process's real UID and GID,
       rather than the effective IDs as is done when actually attempting an
       operation (e.g., open(2)) on the file.  Similarly, for the root user,
       the check uses the set of permitted capabilities rather than the set
       of effective capabilities; and for non-root users, the check uses an
       empty set of capabilities.

       This allows set-user-ID programs and capability-endowed programs to
       easily determine the invoking user's authority.  In other words,
       access() does not answer the "can I read/write/execute this file?"
       question.  It answers a slightly different question: "(assuming I'm a
       setuid binary) can the user who invoked me read/write/execute this
       file?", which gives set-user-ID programs the possibility to prevent
       malicious users from causing them to read files which users shouldn't
       be able to read.

       If the calling process is privileged (i.e., its real UID is zero),
       then an X_OK check is successful for a regular file if execute
       permission is enabled for any of the file owner, group, or other.

   faccessat()
       The faccessat() system call operates in exactly the same way as
       access(), except for the differences described here.

       If the pathname given in pathname is relative, then it is interpreted
       relative to the directory referred to by the file descriptor dirfd
       (rather than relative to the current working directory of the calling
       process, as is done by access() for a relative pathname).

       If pathname is relative and dirfd is the special value AT_FDCWD, then
       pathname is interpreted relative to the current working directory of
       the calling process (like access()).

       If pathname is absolute, then dirfd is ignored.

       flags is constructed by ORing together zero or more of the following
       values:

       AT_EACCESS
              Perform access checks using the effective user and group IDs.
              By default, faccessat() uses the real IDs (like access()).

       AT_SYMLINK_NOFOLLOW
              If pathname is a symbolic link, do not dereference it: instead
              return information about the link itself.

       See openat(2) for an explanation of the need for faccessat().
http://man7.org/linux/man-pages/man2/setxattr.2.html
10
SYSTEM CALL:
setxattr(2) - Linux manual page
FUNCTIONALITY:

       setxattr, lsetxattr, fsetxattr - set an extended attribute value
SYNOPSIS:

       #include <sys/types.h>
       #include <sys/xattr.h>

       int setxattr(const char *path, const char *name,
                     const void *value, size_t size, int flags);
       int lsetxattr(const char *path, const char *name,
                     const void *value, size_t size, int flags);
       int fsetxattr(int fd, const char *name,
                     const void *value, size_t size, int flags);
DESCRIPTION

       Extended attributes are name:value pairs associated with inodes
       (files, directories, symbolic links, etc.).  They are extensions to
       the normal attributes which are associated with all inodes in the
       system (i.e., the stat(2) data).  A complete overview of extended
       attributes concepts can be found in xattr(7).

       setxattr() sets the value of the extended attribute identified by
       name and associated with the given path in the filesystem.  The size
       argument specifies the size (in bytes) of value; a zero-length value
       is permitted.

       lsetxattr() is identical to setxattr(), except in the case of a
       symbolic link, where the extended attribute is set on the link
       itself, not the file that it refers to.

       fsetxattr() is identical to setxattr(), only the extended attribute
       is set on the open file referred to by fd (as returned by open(2)) in
       place of path.

       An extended attribute name is a null-terminated string.  The name
       includes a namespace prefix; there may be several, disjoint
       namespaces associated with an individual inode.  The value of an
       extended attribute is a chunk of arbitrary textual or binary data of
       specified length.

       By default (i.e., flags is zero), the extended attribute will be
       created if it does not exist, or the value will be replaced if the
       attribute already exists.  To modify these semantics, one of the
       following values can be specified in flags:

       XATTR_CREATE
              Perform a pure create, which fails if the named attribute
              exists already.

       XATTR_REPLACE
              Perform a pure replace operation, which fails if the named
              attribute does not already exist.
http://man7.org/linux/man-pages/man2/lsetxattr.2.html
10
SYSTEM CALL:
setxattr(2) - Linux manual page
FUNCTIONALITY:

       setxattr, lsetxattr, fsetxattr - set an extended attribute value
SYNOPSIS:

       #include <sys/types.h>
       #include <sys/xattr.h>

       int setxattr(const char *path, const char *name,
                     const void *value, size_t size, int flags);
       int lsetxattr(const char *path, const char *name,
                     const void *value, size_t size, int flags);
       int fsetxattr(int fd, const char *name,
                     const void *value, size_t size, int flags);
DESCRIPTION

       Extended attributes are name:value pairs associated with inodes
       (files, directories, symbolic links, etc.).  They are extensions to
       the normal attributes which are associated with all inodes in the
       system (i.e., the stat(2) data).  A complete overview of extended
       attributes concepts can be found in xattr(7).

       setxattr() sets the value of the extended attribute identified by
       name and associated with the given path in the filesystem.  The size
       argument specifies the size (in bytes) of value; a zero-length value
       is permitted.

       lsetxattr() is identical to setxattr(), except in the case of a
       symbolic link, where the extended attribute is set on the link
       itself, not the file that it refers to.

       fsetxattr() is identical to setxattr(), only the extended attribute
       is set on the open file referred to by fd (as returned by open(2)) in
       place of path.

       An extended attribute name is a null-terminated string.  The name
       includes a namespace prefix; there may be several, disjoint
       namespaces associated with an individual inode.  The value of an
       extended attribute is a chunk of arbitrary textual or binary data of
       specified length.

       By default (i.e., flags is zero), the extended attribute will be
       created if it does not exist, or the value will be replaced if the
       attribute already exists.  To modify these semantics, one of the
       following values can be specified in flags:

       XATTR_CREATE
              Perform a pure create, which fails if the named attribute
              exists already.

       XATTR_REPLACE
              Perform a pure replace operation, which fails if the named
              attribute does not already exist.
http://man7.org/linux/man-pages/man2/fsetxattr.2.html
10
SYSTEM CALL:
setxattr(2) - Linux manual page
FUNCTIONALITY:

       setxattr, lsetxattr, fsetxattr - set an extended attribute value
SYNOPSIS:

       #include <sys/types.h>
       #include <sys/xattr.h>

       int setxattr(const char *path, const char *name,
                     const void *value, size_t size, int flags);
       int lsetxattr(const char *path, const char *name,
                     const void *value, size_t size, int flags);
       int fsetxattr(int fd, const char *name,
                     const void *value, size_t size, int flags);
DESCRIPTION

       Extended attributes are name:value pairs associated with inodes
       (files, directories, symbolic links, etc.).  They are extensions to
       the normal attributes which are associated with all inodes in the
       system (i.e., the stat(2) data).  A complete overview of extended
       attributes concepts can be found in xattr(7).

       setxattr() sets the value of the extended attribute identified by
       name and associated with the given path in the filesystem.  The size
       argument specifies the size (in bytes) of value; a zero-length value
       is permitted.

       lsetxattr() is identical to setxattr(), except in the case of a
       symbolic link, where the extended attribute is set on the link
       itself, not the file that it refers to.

       fsetxattr() is identical to setxattr(), only the extended attribute
       is set on the open file referred to by fd (as returned by open(2)) in
       place of path.

       An extended attribute name is a null-terminated string.  The name
       includes a namespace prefix; there may be several, disjoint
       namespaces associated with an individual inode.  The value of an
       extended attribute is a chunk of arbitrary textual or binary data of
       specified length.

       By default (i.e., flags is zero), the extended attribute will be
       created if it does not exist, or the value will be replaced if the
       attribute already exists.  To modify these semantics, one of the
       following values can be specified in flags:

       XATTR_CREATE
              Perform a pure create, which fails if the named attribute
              exists already.

       XATTR_REPLACE
              Perform a pure replace operation, which fails if the named
              attribute does not already exist.
http://man7.org/linux/man-pages/man2/getxattr.2.html
11
SYSTEM CALL:
getxattr(2) - Linux manual page
FUNCTIONALITY:

       getxattr, lgetxattr, fgetxattr - retrieve an extended attribute value
SYNOPSIS:

       #include <sys/types.h>
       #include <sys/xattr.h>

       ssize_t getxattr(const char *path, const char *name,
                        void *value, size_t size);
       ssize_t lgetxattr(const char *path, const char *name,
                        void *value, size_t size);
       ssize_t fgetxattr(int fd, const char *name,
                        void *value, size_t size);
DESCRIPTION

       Extended attributes are name:value pairs associated with inodes
       (files, directories, symbolic links, etc.).  They are extensions to
       the normal attributes which are associated with all inodes in the
       system (i.e., the stat(2) data).  A complete overview of extended
       attributes concepts can be found in xattr(7).

       getxattr() retrieves the value of the extended attribute identified
       by name and associated with the given path in the filesystem.  The
       attribute value is placed in the buffer pointed to by value; size
       specifies the size of that buffer.  The return value of the call is
       the number of bytes placed in value.

       lgetxattr() is identical to getxattr(), except in the case of a
       symbolic link, where the link itself is interrogated, not the file
       that it refers to.

       fgetxattr() is identical to getxattr(), only the open file referred
       to by fd (as returned by open(2)) is interrogated in place of path.

       An extended attribute name is a null-terminated string.  The name
       includes a namespace prefix; there may be several, disjoint
       namespaces associated with an individual inode.  The value of an
       extended attribute is a chunk of arbitrary textual or binary data
       that was assigned using setxattr(2).

       If size is specified as zero, these calls return the current size of
       the named extended attribute (and leave value unchanged).  This can
       be used to determine the size of the buffer that should be supplied
       in a subsequent call.  (But, bear in mind that there is a possibility
       that the attribute value may change between the two calls, so that it
       is still necessary to check the return status from the second call.)
http://man7.org/linux/man-pages/man2/lgetxattr.2.html
11
SYSTEM CALL:
getxattr(2) - Linux manual page
FUNCTIONALITY:

       getxattr, lgetxattr, fgetxattr - retrieve an extended attribute value
SYNOPSIS:

       #include <sys/types.h>
       #include <sys/xattr.h>

       ssize_t getxattr(const char *path, const char *name,
                        void *value, size_t size);
       ssize_t lgetxattr(const char *path, const char *name,
                        void *value, size_t size);
       ssize_t fgetxattr(int fd, const char *name,
                        void *value, size_t size);
DESCRIPTION

       Extended attributes are name:value pairs associated with inodes
       (files, directories, symbolic links, etc.).  They are extensions to
       the normal attributes which are associated with all inodes in the
       system (i.e., the stat(2) data).  A complete overview of extended
       attributes concepts can be found in xattr(7).

       getxattr() retrieves the value of the extended attribute identified
       by name and associated with the given path in the filesystem.  The
       attribute value is placed in the buffer pointed to by value; size
       specifies the size of that buffer.  The return value of the call is
       the number of bytes placed in value.

       lgetxattr() is identical to getxattr(), except in the case of a
       symbolic link, where the link itself is interrogated, not the file
       that it refers to.

       fgetxattr() is identical to getxattr(), only the open file referred
       to by fd (as returned by open(2)) is interrogated in place of path.

       An extended attribute name is a null-terminated string.  The name
       includes a namespace prefix; there may be several, disjoint
       namespaces associated with an individual inode.  The value of an
       extended attribute is a chunk of arbitrary textual or binary data
       that was assigned using setxattr(2).

       If size is specified as zero, these calls return the current size of
       the named extended attribute (and leave value unchanged).  This can
       be used to determine the size of the buffer that should be supplied
       in a subsequent call.  (But, bear in mind that there is a possibility
       that the attribute value may change between the two calls, so that it
       is still necessary to check the return status from the second call.)
http://man7.org/linux/man-pages/man2/fgetxattr.2.html
11
SYSTEM CALL:
getxattr(2) - Linux manual page
FUNCTIONALITY:

       getxattr, lgetxattr, fgetxattr - retrieve an extended attribute value
SYNOPSIS:

       #include <sys/types.h>
       #include <sys/xattr.h>

       ssize_t getxattr(const char *path, const char *name,
                        void *value, size_t size);
       ssize_t lgetxattr(const char *path, const char *name,
                        void *value, size_t size);
       ssize_t fgetxattr(int fd, const char *name,
                        void *value, size_t size);
DESCRIPTION

       Extended attributes are name:value pairs associated with inodes
       (files, directories, symbolic links, etc.).  They are extensions to
       the normal attributes which are associated with all inodes in the
       system (i.e., the stat(2) data).  A complete overview of extended
       attributes concepts can be found in xattr(7).

       getxattr() retrieves the value of the extended attribute identified
       by name and associated with the given path in the filesystem.  The
       attribute value is placed in the buffer pointed to by value; size
       specifies the size of that buffer.  The return value of the call is
       the number of bytes placed in value.

       lgetxattr() is identical to getxattr(), except in the case of a
       symbolic link, where the link itself is interrogated, not the file
       that it refers to.

       fgetxattr() is identical to getxattr(), only the open file referred
       to by fd (as returned by open(2)) is interrogated in place of path.

       An extended attribute name is a null-terminated string.  The name
       includes a namespace prefix; there may be several, disjoint
       namespaces associated with an individual inode.  The value of an
       extended attribute is a chunk of arbitrary textual or binary data
       that was assigned using setxattr(2).

       If size is specified as zero, these calls return the current size of
       the named extended attribute (and leave value unchanged).  This can
       be used to determine the size of the buffer that should be supplied
       in a subsequent call.  (But, bear in mind that there is a possibility
       that the attribute value may change between the two calls, so that it
       is still necessary to check the return status from the second call.)
http://man7.org/linux/man-pages/man2/listxattr.2.html
12
SYSTEM CALL:
listxattr(2) - Linux manual page
FUNCTIONALITY:

       listxattr, llistxattr, flistxattr - list extended attribute names
SYNOPSIS:

       #include <sys/types.h>
       #include <sys/xattr.h>

       ssize_t listxattr(const char *path, char *list, size_t size);
       ssize_t llistxattr(const char *path, char *list, size_t size);
       ssize_t flistxattr(int fd, char *list, size_t size);
DESCRIPTION

       Extended attributes are name:value pairs associated with inodes
       (files, directories, symbolic links, etc.).  They are extensions to
       the normal attributes which are associated with all inodes in the
       system (i.e., the stat(2) data).  A complete overview of extended
       attributes concepts can be found in xattr(7).

       listxattr() retrieves the list of extended attribute names associated
       with the given path in the filesystem.  The retrieved list is placed
       in list, a caller-allocated buffer whose size (in bytes) is specified
       in the argument size.  The list is the set of (null-terminated)
       names, one after the other.  Names of extended attributes to which
       the calling process does not have access may be omitted from the
       list.  The length of the attribute name list is returned.

       llistxattr() is identical to listxattr(), except in the case of a
       symbolic link, where the list of names of extended attributes
       associated with the link itself is retrieved, not the file that it
       refers to.

       flistxattr() is identical to listxattr(), only the open file referred
       to by fd (as returned by open(2)) is interrogated in place of path.

       A single extended attribute name is a null-terminated string.  The
       name includes a namespace prefix; there may be several, disjoint
       namespaces associated with an individual inode.

       If size is specified as zero, these calls return the current size of
       the list of extended attribute names (and leave list unchanged).
       This can be used to determine the size of the buffer that should be
       supplied in a subsequent call.  (But, bear in mind that there is a
       possibility that the set of extended attributes may change between
       the two calls, so that it is still necessary to check the return
       status from the second call.)

   Example
       The list of names is returned as an unordered array of null-
       terminated character strings (attribute names are separated by null
       bytes ('\0')), like this:

              user.name1\0system.name1\0user.name2\0

       Filesystems that implement POSIX ACLs using extended attributes might
       return a list like this:

              system.posix_acl_access\0system.posix_acl_default\0
http://man7.org/linux/man-pages/man2/llistxattr.2.html
12
SYSTEM CALL:
listxattr(2) - Linux manual page
FUNCTIONALITY:

       listxattr, llistxattr, flistxattr - list extended attribute names
SYNOPSIS:

       #include <sys/types.h>
       #include <sys/xattr.h>

       ssize_t listxattr(const char *path, char *list, size_t size);
       ssize_t llistxattr(const char *path, char *list, size_t size);
       ssize_t flistxattr(int fd, char *list, size_t size);
DESCRIPTION

       Extended attributes are name:value pairs associated with inodes
       (files, directories, symbolic links, etc.).  They are extensions to
       the normal attributes which are associated with all inodes in the
       system (i.e., the stat(2) data).  A complete overview of extended
       attributes concepts can be found in xattr(7).

       listxattr() retrieves the list of extended attribute names associated
       with the given path in the filesystem.  The retrieved list is placed
       in list, a caller-allocated buffer whose size (in bytes) is specified
       in the argument size.  The list is the set of (null-terminated)
       names, one after the other.  Names of extended attributes to which
       the calling process does not have access may be omitted from the
       list.  The length of the attribute name list is returned.

       llistxattr() is identical to listxattr(), except in the case of a
       symbolic link, where the list of names of extended attributes
       associated with the link itself is retrieved, not the file that it
       refers to.

       flistxattr() is identical to listxattr(), only the open file referred
       to by fd (as returned by open(2)) is interrogated in place of path.

       A single extended attribute name is a null-terminated string.  The
       name includes a namespace prefix; there may be several, disjoint
       namespaces associated with an individual inode.

       If size is specified as zero, these calls return the current size of
       the list of extended attribute names (and leave list unchanged).
       This can be used to determine the size of the buffer that should be
       supplied in a subsequent call.  (But, bear in mind that there is a
       possibility that the set of extended attributes may change between
       the two calls, so that it is still necessary to check the return
       status from the second call.)

   Example
       The list of names is returned as an unordered array of null-
       terminated character strings (attribute names are separated by null
       bytes ('\0')), like this:

              user.name1\0system.name1\0user.name2\0

       Filesystems that implement POSIX ACLs using extended attributes might
       return a list like this:

              system.posix_acl_access\0system.posix_acl_default\0
http://man7.org/linux/man-pages/man2/flistxattr.2.html
12
SYSTEM CALL:
listxattr(2) - Linux manual page
FUNCTIONALITY:

       listxattr, llistxattr, flistxattr - list extended attribute names
SYNOPSIS:

       #include <sys/types.h>
       #include <sys/xattr.h>

       ssize_t listxattr(const char *path, char *list, size_t size);
       ssize_t llistxattr(const char *path, char *list, size_t size);
       ssize_t flistxattr(int fd, char *list, size_t size);
DESCRIPTION

       Extended attributes are name:value pairs associated with inodes
       (files, directories, symbolic links, etc.).  They are extensions to
       the normal attributes which are associated with all inodes in the
       system (i.e., the stat(2) data).  A complete overview of extended
       attributes concepts can be found in xattr(7).

       listxattr() retrieves the list of extended attribute names associated
       with the given path in the filesystem.  The retrieved list is placed
       in list, a caller-allocated buffer whose size (in bytes) is specified
       in the argument size.  The list is the set of (null-terminated)
       names, one after the other.  Names of extended attributes to which
       the calling process does not have access may be omitted from the
       list.  The length of the attribute name list is returned.

       llistxattr() is identical to listxattr(), except in the case of a
       symbolic link, where the list of names of extended attributes
       associated with the link itself is retrieved, not the file that it
       refers to.

       flistxattr() is identical to listxattr(), only the open file referred
       to by fd (as returned by open(2)) is interrogated in place of path.

       A single extended attribute name is a null-terminated string.  The
       name includes a namespace prefix; there may be several, disjoint
       namespaces associated with an individual inode.

       If size is specified as zero, these calls return the current size of
       the list of extended attribute names (and leave list unchanged).
       This can be used to determine the size of the buffer that should be
       supplied in a subsequent call.  (But, bear in mind that there is a
       possibility that the set of extended attributes may change between
       the two calls, so that it is still necessary to check the return
       status from the second call.)

   Example
       The list of names is returned as an unordered array of null-
       terminated character strings (attribute names are separated by null
       bytes ('\0')), like this:

              user.name1\0system.name1\0user.name2\0

       Filesystems that implement POSIX ACLs using extended attributes might
       return a list like this:

              system.posix_acl_access\0system.posix_acl_default\0
http://man7.org/linux/man-pages/man2/removexattr.2.html
10
SYSTEM CALL:
removexattr(2) - Linux manual page
FUNCTIONALITY:

       removexattr,   lremovexattr,   fremovexattr   -  remove  an  extended
       attribute
SYNOPSIS:

       #include <sys/types.h>
       #include <sys/xattr.h>

       int removexattr(const char *path, const char *name);
       int lremovexattr(const char *path, const char *name);
       int fremovexattr(int fd, const char *name);
DESCRIPTION

       Extended attributes are name:value pairs associated with inodes
       (files, directories, symbolic links, etc.).  They are extensions to
       the normal attributes which are associated with all inodes in the
       system (i.e., the stat(2) data).  A complete overview of extended
       attributes concepts can be found in xattr(7).

       removexattr() removes the extended attribute identified by name and
       associated with the given path in the filesystem.

       lremovexattr() is identical to removexattr(), except in the case of a
       symbolic link, where the extended attribute is removed from the link
       itself, not the file that it refers to.

       fremovexattr() is identical to removexattr(), only the extended
       attribute is removed from the open file referred to by fd (as
       returned by open(2)) in place of path.

       An extended attribute name is a null-terminated string.  The name
       includes a namespace prefix; there may be several, disjoint
       namespaces associated with an individual inode.
http://man7.org/linux/man-pages/man2/lremovexattr.2.html
10
SYSTEM CALL:
removexattr(2) - Linux manual page
FUNCTIONALITY:

       removexattr,   lremovexattr,   fremovexattr   -  remove  an  extended
       attribute
SYNOPSIS:

       #include <sys/types.h>
       #include <sys/xattr.h>

       int removexattr(const char *path, const char *name);
       int lremovexattr(const char *path, const char *name);
       int fremovexattr(int fd, const char *name);
DESCRIPTION

       Extended attributes are name:value pairs associated with inodes
       (files, directories, symbolic links, etc.).  They are extensions to
       the normal attributes which are associated with all inodes in the
       system (i.e., the stat(2) data).  A complete overview of extended
       attributes concepts can be found in xattr(7).

       removexattr() removes the extended attribute identified by name and
       associated with the given path in the filesystem.

       lremovexattr() is identical to removexattr(), except in the case of a
       symbolic link, where the extended attribute is removed from the link
       itself, not the file that it refers to.

       fremovexattr() is identical to removexattr(), only the extended
       attribute is removed from the open file referred to by fd (as
       returned by open(2)) in place of path.

       An extended attribute name is a null-terminated string.  The name
       includes a namespace prefix; there may be several, disjoint
       namespaces associated with an individual inode.
http://man7.org/linux/man-pages/man2/fremovexattr.2.html
10
SYSTEM CALL:
removexattr(2) - Linux manual page
FUNCTIONALITY:

       removexattr,   lremovexattr,   fremovexattr   -  remove  an  extended
       attribute
SYNOPSIS:

       #include <sys/types.h>
       #include <sys/xattr.h>

       int removexattr(const char *path, const char *name);
       int lremovexattr(const char *path, const char *name);
       int fremovexattr(int fd, const char *name);
DESCRIPTION

       Extended attributes are name:value pairs associated with inodes
       (files, directories, symbolic links, etc.).  They are extensions to
       the normal attributes which are associated with all inodes in the
       system (i.e., the stat(2) data).  A complete overview of extended
       attributes concepts can be found in xattr(7).

       removexattr() removes the extended attribute identified by name and
       associated with the given path in the filesystem.

       lremovexattr() is identical to removexattr(), except in the case of a
       symbolic link, where the extended attribute is removed from the link
       itself, not the file that it refers to.

       fremovexattr() is identical to removexattr(), only the extended
       attribute is removed from the open file referred to by fd (as
       returned by open(2)) in place of path.

       An extended attribute name is a null-terminated string.  The name
       includes a namespace prefix; there may be several, disjoint
       namespaces associated with an individual inode.
http://man7.org/linux/man-pages/man2/ioctl.2.html
10
SYSTEM CALL:
ioctl(2) - Linux manual page
FUNCTIONALITY:

       ioctl - control device
SYNOPSIS:

       #include <sys/ioctl.h>

       int ioctl(int fd, unsigned long request, ...);
DESCRIPTION

       The ioctl() function manipulates the underlying device parameters of
       special files.  In particular, many operating characteristics of
       character special files (e.g., terminals) may be controlled with
       ioctl() requests.  The argument fd must be an open file descriptor.

       The second argument is a device-dependent request code.  The third
       argument is an untyped pointer to memory.  It's traditionally char
       *argp (from the days before void * was valid C), and will be so named
       for this discussion.

       An ioctl() request has encoded in it whether the argument is an in
       parameter or out parameter, and the size of the argument argp in
       bytes.  Macros and defines used in specifying an ioctl() request are
       located in the file <sys/ioctl.h>.
http://man7.org/linux/man-pages/man2/fcntl.2.html
11
SYSTEM CALL:
fcntl(2) - Linux manual page
FUNCTIONALITY:

       fcntl - manipulate file descriptor
SYNOPSIS:

       #include <unistd.h>
       #include <fcntl.h>

       int fcntl(int fd, int cmd, ... /* arg */ );
DESCRIPTION

       fcntl() performs one of the operations described below on the open
       file descriptor fd.  The operation is determined by cmd.

       fcntl() can take an optional third argument.  Whether or not this
       argument is required is determined by cmd.  The required argument
       type is indicated in parentheses after each cmd name (in most cases,
       the required type is int, and we identify the argument using the name
       arg), or void is specified if the argument is not required.

       Certain of the operations below are supported only since a particular
       Linux kernel version.  The preferred method of checking whether the
       host kernel supports a particular operation is to invoke fcntl() with
       the desired cmd value and then test whether the call failed with
       EINVAL, indicating that the kernel does not recognize this value.

   Duplicating a file descriptor
       F_DUPFD (int)
              Find the lowest numbered available file descriptor greater
              than or equal to arg and make it be a copy of fd.  This is
              different from dup2(2), which uses exactly the file descriptor
              specified.

              On success, the new file descriptor is returned.

              See dup(2) for further details.

       F_DUPFD_CLOEXEC (int; since Linux 2.6.24)
              As for F_DUPFD, but additionally set the close-on-exec flag
              for the duplicate file descriptor.  Specifying this flag
              permits a program to avoid an additional fcntl() F_SETFD
              operation to set the FD_CLOEXEC flag.  For an explanation of
              why this flag is useful, see the description of O_CLOEXEC in
              open(2).

   File descriptor flags
       The following commands manipulate the flags associated with a file
       descriptor.  Currently, only one such flag is defined: FD_CLOEXEC,
       the close-on-exec flag.  If the FD_CLOEXEC bit is 0, the file
       descriptor will remain open across an execve(2), otherwise it will be
       closed.

       F_GETFD (void)
              Read the file descriptor flags; arg is ignored.

       F_SETFD (int)
              Set the file descriptor flags to the value specified by arg.

       In multithreaded programs, using fcntl() F_SETFD to set the close-on-
       exec flag at the same time as another thread performs a fork(2) plus
       execve(2) is vulnerable to a race condition that may unintentionally
       leak the file descriptor to the program executed in the child
       process.  See the discussion of the O_CLOEXEC flag in open(2) for
       details and a remedy to the problem.

   File status flags
       Each open file description has certain associated status flags,
       initialized by open(2) and possibly modified by fcntl().  Duplicated
       file descriptors (made with dup(2), fcntl(F_DUPFD), fork(2), etc.)
       refer to the same open file description, and thus share the same file
       status flags.

       The file status flags and their semantics are described in open(2).

       F_GETFL (void)
              Get the file access mode and the file status flags; arg is
              ignored.

       F_SETFL (int)
              Set the file status flags to the value specified by arg.  File
              access mode (O_RDONLY, O_WRONLY, O_RDWR) and file creation
              flags (i.e., O_CREAT, O_EXCL, O_NOCTTY, O_TRUNC) in arg are
              ignored.  On Linux, this command can change only the O_APPEND,
              O_ASYNC, O_DIRECT, O_NOATIME, and O_NONBLOCK flags.  It is not
              possible to change the O_DSYNC and O_SYNC flags; see BUGS,
              below.

   Advisory record locking
       Linux implements traditional ("process-associated") UNIX record
       locks, as standardized by POSIX.  For a Linux-specific alternative
       with better semantics, see the discussion of open file description
       locks below.

       F_SETLK, F_SETLKW, and F_GETLK are used to acquire, release, and test
       for the existence of record locks (also known as byte-range, file-
       segment, or file-region locks).  The third argument, lock, is a
       pointer to a structure that has at least the following fields (in
       unspecified order).

           struct flock {
               ...
               short l_type;    /* Type of lock: F_RDLCK,
                                   F_WRLCK, F_UNLCK */
               short l_whence;  /* How to interpret l_start:
                                   SEEK_SET, SEEK_CUR, SEEK_END */
               off_t l_start;   /* Starting offset for lock */
               off_t l_len;     /* Number of bytes to lock */
               pid_t l_pid;     /* PID of process blocking our lock
                                   (set by F_GETLK and F_OFD_GETLK) */
               ...
           };

       The l_whence, l_start, and l_len fields of this structure specify the
       range of bytes we wish to lock.  Bytes past the end of the file may
       be locked, but not bytes before the start of the file.

       l_start is the starting offset for the lock, and is interpreted
       relative to either: the start of the file (if l_whence is SEEK_SET);
       the current file offset (if l_whence is SEEK_CUR); or the end of the
       file (if l_whence is SEEK_END).  In the final two cases, l_start can
       be a negative number provided the offset does not lie before the
       start of the file.

       l_len specifies the number of bytes to be locked.  If l_len is
       positive, then the range to be locked covers bytes l_start up to and
       including l_start+l_len-1.  Specifying 0 for l_len has the special
       meaning: lock all bytes starting at the location specified by
       l_whence and l_start through to the end of file, no matter how large
       the file grows.

       POSIX.1-2001 allows (but does not require) an implementation to
       support a negative l_len value; if l_len is negative, the interval
       described by lock covers bytes l_start+l_len up to and including
       l_start-1.  This is supported by Linux since kernel versions 2.4.21
       and 2.5.49.

       The l_type field can be used to place a read (F_RDLCK) or a write
       (F_WRLCK) lock on a file.  Any number of processes may hold a read
       lock (shared lock) on a file region, but only one process may hold a
       write lock (exclusive lock).  An exclusive lock excludes all other
       locks, both shared and exclusive.  A single process can hold only one
       type of lock on a file region; if a new lock is applied to an
       already-locked region, then the existing lock is converted to the new
       lock type.  (Such conversions may involve splitting, shrinking, or
       coalescing with an existing lock if the byte range specified by the
       new lock does not precisely coincide with the range of the existing
       lock.)

       F_SETLK (struct flock *)
              Acquire a lock (when l_type is F_RDLCK or F_WRLCK) or release
              a lock (when l_type is F_UNLCK) on the bytes specified by the
              l_whence, l_start, and l_len fields of lock.  If a conflicting
              lock is held by another process, this call returns -1 and sets
              errno to EACCES or EAGAIN.  (The error returned in this case
              differs across implementations, so POSIX requires a portable
              application to check for both errors.)

       F_SETLKW (struct flock *)
              As for F_SETLK, but if a conflicting lock is held on the file,
              then wait for that lock to be released.  If a signal is caught
              while waiting, then the call is interrupted and (after the
              signal handler has returned) returns immediately (with return
              value -1 and errno set to EINTR; see signal(7)).

       F_GETLK (struct flock *)
              On input to this call, lock describes a lock we would like to
              place on the file.  If the lock could be placed, fcntl() does
              not actually place it, but returns F_UNLCK in the l_type field
              of lock and leaves the other fields of the structure
              unchanged.

              If one or more incompatible locks would prevent this lock
              being placed, then fcntl() returns details about one of those
              locks in the l_type, l_whence, l_start, and l_len fields of
              lock.  If the conflicting lock is a traditional (process-
              associated) record lock, then the l_pid field is set to the
              PID of the process holding that lock.  If the conflicting lock
              is an open file description lock, then l_pid is set to -1.
              Note that the returned information may already be out of date
              by the time the caller inspects it.

       In order to place a read lock, fd must be open for reading.  In order
       to place a write lock, fd must be open for writing.  To place both
       types of lock, open a file read-write.

       When placing locks with F_SETLKW, the kernel detects deadlocks,
       whereby two or more processes have their lock requests mutually
       blocked by locks held by the other processes.  For example, suppose
       process A holds a write lock on byte 100 of a file, and process B
       holds a write lock on byte 200.  If each process then attempts to
       lock the byte already locked by the other process using F_SETLKW,
       then, without deadlock detection, both processes would remain blocked
       indefinitely.  When the kernel detects such deadlocks, it causes one
       of the blocking lock requests to immediately fail with the error
       EDEADLK; an application that encounters such an error should release
       some of its locks to allow other applications to proceed before
       attempting regain the locks that it requires.  Circular deadlocks
       involving more than two processes are also detected.  Note, however,
       that there are limitations to the kernel's deadlock-detection
       algorithm; see BUGS.

       As well as being removed by an explicit F_UNLCK, record locks are
       automatically released when the process terminates.

       Record locks are not inherited by a child created via fork(2), but
       are preserved across an execve(2).

       Because of the buffering performed by the stdio(3) library, the use
       of record locking with routines in that package should be avoided;
       use read(2) and write(2) instead.

       The record locks described above are associated with the process
       (unlike the open file description locks described below).  This has
       some unfortunate consequences:

       *  If a process closes any file descriptor referring to a file, then
          all of the process's locks on that file are released, regardless
          of the file descriptor(s) on which the locks were obtained.  This
          is bad: it means that a process can lose its locks on a file such
          as /etc/passwd or /etc/mtab when for some reason a library
          function decides to open, read, and close the same file.

       *  The threads in a process share locks.  In other words, a
          multithreaded program can't use record locking to ensure that
          threads don't simultaneously access the same region of a file.

       Open file description locks solve both of these problems.

   Open file description locks (non-POSIX)
       Open file description locks are advisory byte-range locks whose
       operation is in most respects identical to the traditional record
       locks described above.  This lock type is Linux-specific, and
       available since Linux 3.15.  (There is a proposal with the Austin
       Group to include this lock type in the next revision of POSIX.1.)
       For an explanation of open file descriptions, see open(2).

       The principal difference between the two lock types is that whereas
       traditional record locks are associated with a process, open file
       description locks are associated with the open file description on
       which they are acquired, much like locks acquired with flock(2).
       Consequently (and unlike traditional advisory record locks), open
       file description locks are inherited across fork(2) (and clone(2)
       with CLONE_FILES), and are only automatically released on the last
       close of the open file description, instead of being released on any
       close of the file.

       Conflicting lock combinations (i.e., a read lock and a write lock or
       two write locks) where one lock is an open file description lock and
       the other is a traditional record lock conflict even when they are
       acquired by the same process on the same file descriptor.

       Open file description locks placed via the same open file description
       (i.e., via the same file descriptor, or via a duplicate of the file
       descriptor created by fork(2), dup(2), fcntl(2) F_DUPFD, and so on)
       are always compatible: if a new lock is placed on an already locked
       region, then the existing lock is converted to the new lock type.
       (Such conversions may result in splitting, shrinking, or coalescing
       with an existing lock as discussed above.)

       On the other hand, open file description locks may conflict with each
       other when they are acquired via different open file descriptions.
       Thus, the threads in a multithreaded program can use open file
       description locks to synchronize access to a file region by having
       each thread perform its own open(2) on the file and applying locks
       via the resulting file descriptor.

       As with traditional advisory locks, the third argument to fcntl(),
       lock, is a pointer to an flock structure.  By contrast with
       traditional record locks, the l_pid field of that structure must be
       set to zero when using the commands described below.

       The commands for working with open file description locks are
       analogous to those used with traditional locks:

       F_OFD_SETLK (struct flock *)
              Acquire an open file description lock (when l_type is F_RDLCK
              or F_WRLCK) or release an open file description lock (when
              l_type is F_UNLCK) on the bytes specified by the l_whence,
              l_start, and l_len fields of lock.  If a conflicting lock is
              held by another process, this call returns -1 and sets errno
              to EAGAIN.

       F_OFD_SETLKW (struct flock *)
              As for F_OFD_SETLK, but if a conflicting lock is held on the
              file, then wait for that lock to be released.  If a signal is
              caught while waiting, then the call is interrupted and (after
              the signal handler has returned) returns immediately (with
              return value -1 and errno set to EINTR; see signal(7)).

       F_OFD_GETLK (struct flock *)
              On input to this call, lock describes an open file description
              lock we would like to place on the file.  If the lock could be
              placed, fcntl() does not actually place it, but returns
              F_UNLCK in the l_type field of lock and leaves the other
              fields of the structure unchanged.  If one or more
              incompatible locks would prevent this lock being placed, then
              details about one of these locks are returned via lock, as
              described above for F_GETLK.

       In the current implementation, no deadlock detection is performed for
       open file description locks.  (This contrasts with process-associated
       record locks, for which the kernel does perform deadlock detection.)

   Mandatory locking
       Warning: the Linux implementation of mandatory locking is unreliable.
       See BUGS below.  Because of these bugs, and the fact that the feature
       is believed to be little used, since Linux 4.5, mandatory locking has
       been made an optional feature, governed by a configuration option
       (CONFIG_MANDATORY_FILE_LOCKING).  This is an initial step toward
       removing this feature completely.

       By default, both traditional (process-associated) and open file
       description record locks are advisory.  Advisory locks are not
       enforced and are useful only between cooperating processes.

       Both lock types can also be mandatory.  Mandatory locks are enforced
       for all processes.  If a process tries to perform an incompatible
       access (e.g., read(2) or write(2)) on a file region that has an
       incompatible mandatory lock, then the result depends upon whether the
       O_NONBLOCK flag is enabled for its open file description.  If the
       O_NONBLOCK flag is not enabled, then the system call is blocked until
       the lock is removed or converted to a mode that is compatible with
       the access.  If the O_NONBLOCK flag is enabled, then the system call
       fails with the error EAGAIN.

       To make use of mandatory locks, mandatory locking must be enabled
       both on the filesystem that contains the file to be locked, and on
       the file itself.  Mandatory locking is enabled on a filesystem using
       the "-o mand" option to mount(8), or the MS_MANDLOCK flag for
       mount(2).  Mandatory locking is enabled on a file by disabling group
       execute permission on the file and enabling the set-group-ID
       permission bit (see chmod(1) and chmod(2)).

       Mandatory locking is not specified by POSIX.  Some other systems also
       support mandatory locking, although the details of how to enable it
       vary across systems.

   Managing signals
       F_GETOWN, F_SETOWN, F_GETOWN_EX, F_SETOWN_EX, F_GETSIG and F_SETSIG
       are used to manage I/O availability signals:

       F_GETOWN (void)
              Return (as the function result) the process ID or process
              group currently receiving SIGIO and SIGURG signals for events
              on file descriptor fd.  Process IDs are returned as positive
              values; process group IDs are returned as negative values (but
              see BUGS below).  arg is ignored.

       F_SETOWN (int)
              Set the process ID or process group ID that will receive SIGIO
              and SIGURG signals for events on the file descriptor fd.  The
              target process or process group ID is specified in arg.  A
              process ID is specified as a positive value; a process group
              ID is specified as a negative value.  Most commonly, the
              calling process specifies itself as the owner (that is, arg is
              specified as getpid(2)).

              As well as setting the file descriptor owner, one must also
              enable generation of signals on the file descriptor.  This is
              done by using the fcntl() F_SETFL command to set the O_ASYNC
              file status flag on the file descriptor.  Subsequently, a
              SIGIO signal is sent whenever input or output becomes possible
              on the file descriptor.  The fcntl() F_SETSIG command can be
              used to obtain delivery of a signal other than SIGIO.

              Sending a signal to the owner process (group) specified by
              F_SETOWN is subject to the same permissions checks as are
              described for kill(2), where the sending process is the one
              that employs F_SETOWN (but see BUGS below).  If this
              permission check fails, then the signal is silently discarded.

              If the file descriptor fd refers to a socket, F_SETOWN also
              selects the recipient of SIGURG signals that are delivered
              when out-of-band data arrives on that socket.  (SIGURG is sent
              in any situation where select(2) would report the socket as
              having an "exceptional condition".)

              The following was true in 2.6.x kernels up to and including
              kernel 2.6.11:

                     If a nonzero value is given to F_SETSIG in a
                     multithreaded process running with a threading library
                     that supports thread groups (e.g., NPTL), then a
                     positive value given to F_SETOWN has a different
                     meaning: instead of being a process ID identifying a
                     whole process, it is a thread ID identifying a specific
                     thread within a process.  Consequently, it may be
                     necessary to pass F_SETOWN the result of gettid(2)
                     instead of getpid(2) to get sensible results when
                     F_SETSIG is used.  (In current Linux threading
                     implementations, a main thread's thread ID is the same
                     as its process ID.  This means that a single-threaded
                     program can equally use gettid(2) or getpid(2) in this
                     scenario.)  Note, however, that the statements in this
                     paragraph do not apply to the SIGURG signal generated
                     for out-of-band data on a socket: this signal is always
                     sent to either a process or a process group, depending
                     on the value given to F_SETOWN.

              The above behavior was accidentally dropped in Linux 2.6.12,
              and won't be restored.  From Linux 2.6.32 onward, use
              F_SETOWN_EX to target SIGIO and SIGURG signals at a particular
              thread.

       F_GETOWN_EX (struct f_owner_ex *) (since Linux 2.6.32)
              Return the current file descriptor owner settings as defined
              by a previous F_SETOWN_EX operation.  The information is
              returned in the structure pointed to by arg, which has the
              following form:

                  struct f_owner_ex {
                      int   type;
                      pid_t pid;
                  };

              The type field will have one of the values F_OWNER_TID,
              F_OWNER_PID, or F_OWNER_PGRP.  The pid field is a positive
              integer representing a thread ID, process ID, or process group
              ID.  See F_SETOWN_EX for more details.

       F_SETOWN_EX (struct f_owner_ex *) (since Linux 2.6.32)
              This operation performs a similar task to F_SETOWN.  It allows
              the caller to direct I/O availability signals to a specific
              thread, process, or process group.  The caller specifies the
              target of signals via arg, which is a pointer to a f_owner_ex
              structure.  The type field has one of the following values,
              which define how pid is interpreted:

              F_OWNER_TID
                     Send the signal to the thread whose thread ID (the
                     value returned by a call to clone(2) or gettid(2)) is
                     specified in pid.

              F_OWNER_PID
                     Send the signal to the process whose ID is specified in
                     pid.

              F_OWNER_PGRP
                     Send the signal to the process group whose ID is
                     specified in pid.  (Note that, unlike with F_SETOWN, a
                     process group ID is specified as a positive value
                     here.)

       F_GETSIG (void)
              Return (as the function result) the signal sent when input or
              output becomes possible.  A value of zero means SIGIO is sent.
              Any other value (including SIGIO) is the signal sent instead,
              and in this case additional info is available to the signal
              handler if installed with SA_SIGINFO.  arg is ignored.

       F_SETSIG (int)
              Set the signal sent when input or output becomes possible to
              the value given in arg.  A value of zero means to send the
              default SIGIO signal.  Any other value (including SIGIO) is
              the signal to send instead, and in this case additional info
              is available to the signal handler if installed with
              SA_SIGINFO.

              By using F_SETSIG with a nonzero value, and setting SA_SIGINFO
              for the signal handler (see sigaction(2)), extra information
              about I/O events is passed to the handler in a siginfo_t
              structure.  If the si_code field indicates the source is
              SI_SIGIO, the si_fd field gives the file descriptor associated
              with the event.  Otherwise, there is no indication which file
              descriptors are pending, and you should use the usual
              mechanisms (select(2), poll(2), read(2) with O_NONBLOCK set
              etc.) to determine which file descriptors are available for
              I/O.

              Note that the file descriptor provided in si_fd is the one
              that was specified during the F_SETSIG operation.  This can
              lead to an unusual corner case.  If the file descriptor is
              duplicated (dup(2) or similar), and the original file
              descriptor is closed, then I/O events will continue to be
              generated, but the si_fd field will contain the number of the
              now closed file descriptor.

              By selecting a real time signal (value >= SIGRTMIN), multiple
              I/O events may be queued using the same signal numbers.
              (Queuing is dependent on available memory.)  Extra information
              is available if SA_SIGINFO is set for the signal handler, as
              above.

              Note that Linux imposes a limit on the number of real-time
              signals that may be queued to a process (see getrlimit(2) and
              signal(7)) and if this limit is reached, then the kernel
              reverts to delivering SIGIO, and this signal is delivered to
              the entire process rather than to a specific thread.

       Using these mechanisms, a program can implement fully asynchronous
       I/O without using select(2) or poll(2) most of the time.

       The use of O_ASYNC is specific to BSD and Linux.  The only use of
       F_GETOWN and F_SETOWN specified in POSIX.1 is in conjunction with the
       use of the SIGURG signal on sockets.  (POSIX does not specify the
       SIGIO signal.)  F_GETOWN_EX, F_SETOWN_EX, F_GETSIG, and F_SETSIG are
       Linux-specific.  POSIX has asynchronous I/O and the aio_sigevent
       structure to achieve similar things; these are also available in
       Linux as part of the GNU C Library (Glibc).

   Leases
       F_SETLEASE and F_GETLEASE (Linux 2.4 onward) are used (respectively)
       to establish a new lease, and retrieve the current lease, on the open
       file description referred to by the file descriptor fd.  A file lease
       provides a mechanism whereby the process holding the lease (the
       "lease holder") is notified (via delivery of a signal) when a process
       (the "lease breaker") tries to open(2) or truncate(2) the file
       referred to by that file descriptor.

       F_SETLEASE (int)
              Set or remove a file lease according to which of the following
              values is specified in the integer arg:

              F_RDLCK
                     Take out a read lease.  This will cause the calling
                     process to be notified when the file is opened for
                     writing or is truncated.  A read lease can be placed
                     only on a file descriptor that is opened read-only.

              F_WRLCK
                     Take out a write lease.  This will cause the caller to
                     be notified when the file is opened for reading or
                     writing or is truncated.  A write lease may be placed
                     on a file only if there are no other open file
                     descriptors for the file.

              F_UNLCK
                     Remove our lease from the file.

       Leases are associated with an open file description (see open(2)).
       This means that duplicate file descriptors (created by, for example,
       fork(2) or dup(2)) refer to the same lease, and this lease may be
       modified or released using any of these descriptors.  Furthermore,
       the lease is released by either an explicit F_UNLCK operation on any
       of these duplicate file descriptors, or when all such file
       descriptors have been closed.

       Leases may be taken out only on regular files.  An unprivileged
       process may take out a lease only on a file whose UID (owner) matches
       the filesystem UID of the process.  A process with the CAP_LEASE
       capability may take out leases on arbitrary files.

       F_GETLEASE (void)
              Indicates what type of lease is associated with the file
              descriptor fd by returning either F_RDLCK, F_WRLCK, or
              F_UNLCK, indicating, respectively, a read lease , a write
              lease, or no lease.  arg is ignored.

       When a process (the "lease breaker") performs an open(2) or
       truncate(2) that conflicts with a lease established via F_SETLEASE,
       the system call is blocked by the kernel and the kernel notifies the
       lease holder by sending it a signal (SIGIO by default).  The lease
       holder should respond to receipt of this signal by doing whatever
       cleanup is required in preparation for the file to be accessed by
       another process (e.g., flushing cached buffers) and then either
       remove or downgrade its lease.  A lease is removed by performing an
       F_SETLEASE command specifying arg as F_UNLCK.  If the lease holder
       currently holds a write lease on the file, and the lease breaker is
       opening the file for reading, then it is sufficient for the lease
       holder to downgrade the lease to a read lease.  This is done by
       performing an F_SETLEASE command specifying arg as F_RDLCK.

       If the lease holder fails to downgrade or remove the lease within the
       number of seconds specified in /proc/sys/fs/lease-break-time, then
       the kernel forcibly removes or downgrades the lease holder's lease.

       Once a lease break has been initiated, F_GETLEASE returns the target
       lease type (either F_RDLCK or F_UNLCK, depending on what would be
       compatible with the lease breaker) until the lease holder voluntarily
       downgrades or removes the lease or the kernel forcibly does so after
       the lease break timer expires.

       Once the lease has been voluntarily or forcibly removed or
       downgraded, and assuming the lease breaker has not unblocked its
       system call, the kernel permits the lease breaker's system call to
       proceed.

       If the lease breaker's blocked open(2) or truncate(2) is interrupted
       by a signal handler, then the system call fails with the error EINTR,
       but the other steps still occur as described above.  If the lease
       breaker is killed by a signal while blocked in open(2) or
       truncate(2), then the other steps still occur as described above.  If
       the lease breaker specifies the O_NONBLOCK flag when calling open(2),
       then the call immediately fails with the error EWOULDBLOCK, but the
       other steps still occur as described above.

       The default signal used to notify the lease holder is SIGIO, but this
       can be changed using the F_SETSIG command to fcntl().  If a F_SETSIG
       command is performed (even one specifying SIGIO), and the signal
       handler is established using SA_SIGINFO, then the handler will
       receive a siginfo_t structure as its second argument, and the si_fd
       field of this argument will hold the file descriptor of the leased
       file that has been accessed by another process.  (This is useful if
       the caller holds leases against multiple files.)

   File and directory change notification (dnotify)
       F_NOTIFY (int)
              (Linux 2.4 onward) Provide notification when the directory
              referred to by fd or any of the files that it contains is
              changed.  The events to be notified are specified in arg,
              which is a bit mask specified by ORing together zero or more
              of the following bits:

              DN_ACCESS   A file was accessed (read(2), pread(2), readv(2),
                          and similar)
              DN_MODIFY   A file was modified (write(2), pwrite(2),
                          writev(2), truncate(2), ftruncate(2), and
                          similar).
              DN_CREATE   A file was created (open(2), creat(2), mknod(2),
                          mkdir(2), link(2), symlink(2), rename(2) into this
                          directory).
              DN_DELETE   A file was unlinked (unlink(2), rename(2) to
                          another directory, rmdir(2)).
              DN_RENAME   A file was renamed within this directory
                          (rename(2)).
              DN_ATTRIB   The attributes of a file were changed (chown(2),
                          chmod(2), utime(2), utimensat(2), and similar).

              (In order to obtain these definitions, the _GNU_SOURCE feature
              test macro must be defined before including any header files.)

              Directory notifications are normally "one-shot", and the
              application must reregister to receive further notifications.
              Alternatively, if DN_MULTISHOT is included in arg, then
              notification will remain in effect until explicitly removed.

              A series of F_NOTIFY requests is cumulative, with the events
              in arg being added to the set already monitored.  To disable
              notification of all events, make an F_NOTIFY call specifying
              arg as 0.

              Notification occurs via delivery of a signal.  The default
              signal is SIGIO, but this can be changed using the F_SETSIG
              command to fcntl().  (Note that SIGIO is one of the nonqueuing
              standard signals; switching to the use of a real-time signal
              means that multiple notifications can be queued to the
              process.)  In the latter case, the signal handler receives a
              siginfo_t structure as its second argument (if the handler was
              established using SA_SIGINFO) and the si_fd field of this
              structure contains the file descriptor which generated the
              notification (useful when establishing notification on
              multiple directories).

              Especially when using DN_MULTISHOT, a real time signal should
              be used for notification, so that multiple notifications can
              be queued.

              NOTE: New applications should use the inotify interface
              (available since kernel 2.6.13), which provides a much
              superior interface for obtaining notifications of filesystem
              events.  See inotify(7).

   Changing the capacity of a pipe
       F_SETPIPE_SZ (int; since Linux 2.6.35)
              Change the capacity of the pipe referred to by fd to be at
              least arg bytes.  An unprivileged process can adjust the pipe
              capacity to any value between the system page size and the
              limit defined in /proc/sys/fs/pipe-max-size (see proc(5)).
              Attempts to set the pipe capacity below the page size are
              silently rounded up to the page size.  Attempts by an
              unprivileged process to set the pipe capacity above the limit
              in /proc/sys/fs/pipe-max-size yield the error EPERM; a
              privileged process (CAP_SYS_RESOURCE) can override the limit.
              When allocating the buffer for the pipe, the kernel may use a
              capacity larger than arg, if that is convenient for the
              implementation.  The actual capacity that is set is returned
              as the function result.  Attempting to set the pipe capacity
              smaller than the amount of buffer space currently used to
              store data produces the error EBUSY.

       F_GETPIPE_SZ (void; since Linux 2.6.35)
              Return (as the function result) the capacity of the pipe
              referred to by fd.

   File Sealing
       File seals limit the set of allowed operations on a given file.  For
       each seal that is set on a file, a specific set of operations will
       fail with EPERM on this file from now on.  The file is said to be
       sealed.  The default set of seals depends on the type of the
       underlying file and filesystem.  For an overview of file sealing, a
       discussion of its purpose, and some code examples, see
       memfd_create(2).

       Currently, only the tmpfs filesystem supports sealing.  On other
       filesystems, all fcntl(2) operations that operate on seals will
       return EINVAL.

       Seals are a property of an inode.  Thus, all open file descriptors
       referring to the same inode share the same set of seals.
       Furthermore, seals can never be removed, only added.

       F_ADD_SEALS (int; since Linux 3.17)
              Add the seals given in the bit-mask argument arg to the set of
              seals of the inode referred to by the file descriptor fd.
              Seals cannot be removed again.  Once this call succeeds, the
              seals are enforced by the kernel immediately.  If the current
              set of seals includes F_SEAL_SEAL (see below), then this call
              will be rejected with EPERM.  Adding a seal that is already
              set is a no-op, in case F_SEAL_SEAL is not set already.  In
              order to place a seal, the file descriptor fd must be
              writable.

       F_GET_SEALS (void; since Linux 3.17)
              Return (as the function result) the current set of seals of
              the inode referred to by fd.  If no seals are set, 0 is
              returned.  If the file does not support sealing, -1 is
              returned and errno is set to EINVAL.

       The following seals are available:

       F_SEAL_SEAL
              If this seal is set, any further call to fcntl(2) with
              F_ADD_SEALS will fail with EPERM.  Therefore, this seal
              prevents any modifications to the set of seals itself.  If the
              initial set of seals of a file includes F_SEAL_SEAL, then this
              effectively causes the set of seals to be constant and locked.

       F_SEAL_SHRINK
              If this seal is set, the file in question cannot be reduced in
              size.  This affects open(2) with the O_TRUNC flag as well as
              truncate(2) and ftruncate(2).  Those calls will fail with
              EPERM if you try to shrink the file in question.  Increasing
              the file size is still possible.

       F_SEAL_GROW
              If this seal is set, the size of the file in question cannot
              be increased.  This affects write(2) beyond the end of the
              file, truncate(2), ftruncate(2), and fallocate(2).  These
              calls will fail with EPERM if you use them to increase the
              file size.  If you keep the size or shrink it, those calls
              still work as expected.

       F_SEAL_WRITE
              If this seal is set, you cannot modify the contents of the
              file.  Note that shrinking or growing the size of the file is
              still possible and allowed.  Thus, this seal is normally used
              in combination with one of the other seals.  This seal affects
              write(2) and fallocate(2) (only in combination with the
              FALLOC_FL_PUNCH_HOLE flag).  Those calls will fail with EPERM
              if this seal is set.  Furthermore, trying to create new
              shared, writable memory-mappings via mmap(2) will also fail
              with EPERM.

              Using the F_ADD_SEALS operation to set the F_SEAL_WRITE seal
              will fail with EBUSY if any writable, shared mapping exists.
              Such mappings must be unmapped before you can add this seal.
              Furthermore, if there are any asynchronous I/O operations
              (io_submit(2)) pending on the file, all outstanding writes
              will be discarded.
http://man7.org/linux/man-pages/man2/dup.2.html
11
SYSTEM CALL:
dup(2) - Linux manual page
FUNCTIONALITY:

       dup, dup2, dup3 - duplicate a file descriptor
SYNOPSIS:

       #include <unistd.h>

       int dup(int oldfd);
       int dup2(int oldfd, int newfd);

       #define _GNU_SOURCE             /* See feature_test_macros(7) */
       #include <fcntl.h>              /* Obtain O_* constant definitions */
       #include <unistd.h>

       int dup3(int oldfd, int newfd, int flags);
DESCRIPTION

       The dup() system call creates a copy of the file descriptor oldfd,
       using the lowest-numbered unused file descriptor for the new
       descriptor.

       After a successful return, the old and new file descriptors may be
       used interchangeably.  They refer to the same open file description
       (see open(2)) and thus share file offset and file status flags; for
       example, if the file offset is modified by using lseek(2) on one of
       the file descriptors, the offset is also changed for the other.

       The two file descriptors do not share file descriptor flags (the
       close-on-exec flag).  The close-on-exec flag (FD_CLOEXEC; see
       fcntl(2)) for the duplicate descriptor is off.

   dup2()
       The dup2() system call performs the same task as dup(), but instead
       of using the lowest-numbered unused file descriptor, it uses the file
       descriptor number specified in newfd.  If the file descriptor newfd
       was previously open, it is silently closed before being reused.

       The steps of closing and reusing the file descriptor newfd are
       performed atomically.  This is important, because trying to implement
       equivalent functionality using close(2) and dup() would be subject to
       race conditions, whereby newfd might be reused between the two steps.
       Such reuse could happen because the main program is interrupted by a
       signal handler that allocates a file descriptor, or because a
       parallel thread allocates a file descriptor.

       Note the following points:

       *  If oldfd is not a valid file descriptor, then the call fails, and
          newfd is not closed.

       *  If oldfd is a valid file descriptor, and newfd has the same value
          as oldfd, then dup2() does nothing, and returns newfd.

   dup3()
       dup3() is the same as dup2(), except that:

       *  The caller can force the close-on-exec flag to be set for the new
          file descriptor by specifying O_CLOEXEC in flags.  See the
          description of the same flag in open(2) for reasons why this may
          be useful.

       *  If oldfd equals newfd, then dup3() fails with the error EINVAL.
http://man7.org/linux/man-pages/man2/dup2.2.html
11
SYSTEM CALL:
dup(2) - Linux manual page
FUNCTIONALITY:

       dup, dup2, dup3 - duplicate a file descriptor
SYNOPSIS:

       #include <unistd.h>

       int dup(int oldfd);
       int dup2(int oldfd, int newfd);

       #define _GNU_SOURCE             /* See feature_test_macros(7) */
       #include <fcntl.h>              /* Obtain O_* constant definitions */
       #include <unistd.h>

       int dup3(int oldfd, int newfd, int flags);
DESCRIPTION

       The dup() system call creates a copy of the file descriptor oldfd,
       using the lowest-numbered unused file descriptor for the new
       descriptor.

       After a successful return, the old and new file descriptors may be
       used interchangeably.  They refer to the same open file description
       (see open(2)) and thus share file offset and file status flags; for
       example, if the file offset is modified by using lseek(2) on one of
       the file descriptors, the offset is also changed for the other.

       The two file descriptors do not share file descriptor flags (the
       close-on-exec flag).  The close-on-exec flag (FD_CLOEXEC; see
       fcntl(2)) for the duplicate descriptor is off.

   dup2()
       The dup2() system call performs the same task as dup(), but instead
       of using the lowest-numbered unused file descriptor, it uses the file
       descriptor number specified in newfd.  If the file descriptor newfd
       was previously open, it is silently closed before being reused.

       The steps of closing and reusing the file descriptor newfd are
       performed atomically.  This is important, because trying to implement
       equivalent functionality using close(2) and dup() would be subject to
       race conditions, whereby newfd might be reused between the two steps.
       Such reuse could happen because the main program is interrupted by a
       signal handler that allocates a file descriptor, or because a
       parallel thread allocates a file descriptor.

       Note the following points:

       *  If oldfd is not a valid file descriptor, then the call fails, and
          newfd is not closed.

       *  If oldfd is a valid file descriptor, and newfd has the same value
          as oldfd, then dup2() does nothing, and returns newfd.

   dup3()
       dup3() is the same as dup2(), except that:

       *  The caller can force the close-on-exec flag to be set for the new
          file descriptor by specifying O_CLOEXEC in flags.  See the
          description of the same flag in open(2) for reasons why this may
          be useful.

       *  If oldfd equals newfd, then dup3() fails with the error EINVAL.
http://man7.org/linux/man-pages/man2/dup3.2.html
11
SYSTEM CALL:
dup(2) - Linux manual page
FUNCTIONALITY:

       dup, dup2, dup3 - duplicate a file descriptor
SYNOPSIS:

       #include <unistd.h>

       int dup(int oldfd);
       int dup2(int oldfd, int newfd);

       #define _GNU_SOURCE             /* See feature_test_macros(7) */
       #include <fcntl.h>              /* Obtain O_* constant definitions */
       #include <unistd.h>

       int dup3(int oldfd, int newfd, int flags);
DESCRIPTION

       The dup() system call creates a copy of the file descriptor oldfd,
       using the lowest-numbered unused file descriptor for the new
       descriptor.

       After a successful return, the old and new file descriptors may be
       used interchangeably.  They refer to the same open file description
       (see open(2)) and thus share file offset and file status flags; for
       example, if the file offset is modified by using lseek(2) on one of
       the file descriptors, the offset is also changed for the other.

       The two file descriptors do not share file descriptor flags (the
       close-on-exec flag).  The close-on-exec flag (FD_CLOEXEC; see
       fcntl(2)) for the duplicate descriptor is off.

   dup2()
       The dup2() system call performs the same task as dup(), but instead
       of using the lowest-numbered unused file descriptor, it uses the file
       descriptor number specified in newfd.  If the file descriptor newfd
       was previously open, it is silently closed before being reused.

       The steps of closing and reusing the file descriptor newfd are
       performed atomically.  This is important, because trying to implement
       equivalent functionality using close(2) and dup() would be subject to
       race conditions, whereby newfd might be reused between the two steps.
       Such reuse could happen because the main program is interrupted by a
       signal handler that allocates a file descriptor, or because a
       parallel thread allocates a file descriptor.

       Note the following points:

       *  If oldfd is not a valid file descriptor, then the call fails, and
          newfd is not closed.

       *  If oldfd is a valid file descriptor, and newfd has the same value
          as oldfd, then dup2() does nothing, and returns newfd.

   dup3()
       dup3() is the same as dup2(), except that:

       *  The caller can force the close-on-exec flag to be set for the new
          file descriptor by specifying O_CLOEXEC in flags.  See the
          description of the same flag in open(2) for reasons why this may
          be useful.

       *  If oldfd equals newfd, then dup3() fails with the error EINVAL.
http://man7.org/linux/man-pages/man2/flock.2.html
10
SYSTEM CALL:
flock(2) - Linux manual page
FUNCTIONALITY:

       flock - apply or remove an advisory lock on an open file
SYNOPSIS:

       #include <sys/file.h>

       int flock(int fd, int operation);
DESCRIPTION

       Apply or remove an advisory lock on the open file specified by fd.
       The argument operation is one of the following:

           LOCK_SH  Place a shared lock.  More than one process may hold a
                    shared lock for a given file at a given time.

           LOCK_EX  Place an exclusive lock.  Only one process may hold an
                    exclusive lock for a given file at a given time.

           LOCK_UN  Remove an existing lock held by this process.

       A call to flock() may block if an incompatible lock is held by
       another process.  To make a nonblocking request, include LOCK_NB (by
       ORing) with any of the above operations.

       A single file may not simultaneously have both shared and exclusive
       locks.

       Locks created by flock() are associated with an open file description
       (see open(2)).  This means that duplicate file descriptors (created
       by, for example, fork(2) or dup(2)) refer to the same lock, and this
       lock may be modified or released using any of these file descriptors.
       Furthermore, the lock is released either by an explicit LOCK_UN
       operation on any of these duplicate file descriptors, or when all
       such file descriptors have been closed.

       If a process uses open(2) (or similar) to obtain more than one file
       descriptor for the same file, these file descriptors are treated
       independently by flock().  An attempt to lock the file using one of
       these file descriptors may be denied by a lock that the calling
       process has already placed via another file descriptor.

       A process may hold only one type of lock (shared or exclusive) on a
       file.  Subsequent flock() calls on an already locked file will
       convert an existing lock to the new lock mode.

       Locks created by flock() are preserved across an execve(2).

       A shared or exclusive lock can be placed on a file regardless of the
       mode in which the file was opened.
http://man7.org/linux/man-pages/man2/read.2.html
11
SYSTEM CALL:
read(2) - Linux manual page
FUNCTIONALITY:

       read - read from a file descriptor
SYNOPSIS:

       #include <unistd.h>

       ssize_t read(int fd, void *buf, size_t count);
DESCRIPTION

       read() attempts to read up to count bytes from file descriptor fd
       into the buffer starting at buf.

       On files that support seeking, the read operation commences at the
       file offset, and the file offset is incremented by the number of
       bytes read.  If the file offset is at or past the end of file, no
       bytes are read, and read() returns zero.

       If count is zero, read() may detect the errors described below.  In
       the absence of any errors, or if read() does not check for errors, a
       read() with a count of 0 returns zero and has no other effects.

       If count is greater than SSIZE_MAX, the result is unspecified.
http://man7.org/linux/man-pages/man2/readv.2.html
12
SYSTEM CALL:
readv(2) - Linux manual page
FUNCTIONALITY:

       readv,  writev,  preadv,  pwritev  - read or write data into multiple
       buffers
SYNOPSIS:

       #include <sys/uio.h>

       ssize_t readv(int fd, const struct iovec *iov, int iovcnt);

       ssize_t writev(int fd, const struct iovec *iov, int iovcnt);

       ssize_t preadv(int fd, const struct iovec *iov, int iovcnt,
                      off_t offset);

       ssize_t pwritev(int fd, const struct iovec *iov, int iovcnt,
                       off_t offset);

       ssize_t preadv2(int fd, const struct iovec *iov, int iovcnt,
                       off_t offset, int flags);

       ssize_t pwritev2(int fd, const struct iovec *iov, int iovcnt,
                        off_t offset, int flags);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       preadv(), pwritev():
           Since glibc 2.19:
               _DEFAULT_SOURCE
           Glibc 2.19 and earlier:
               _BSD_SOURCE
DESCRIPTION

       The readv() system call reads iovcnt buffers from the file associated
       with the file descriptor fd into the buffers described by iov
       ("scatter input").

       The writev() system call writes iovcnt buffers of data described by
       iov to the file associated with the file descriptor fd ("gather
       output").

       The pointer iov points to an array of iovec structures, defined in
       <sys/uio.h> as:

           struct iovec {
               void  *iov_base;    /* Starting address */
               size_t iov_len;     /* Number of bytes to transfer */
           };

       The readv() system call works just like read(2) except that multiple
       buffers are filled.

       The writev() system call works just like write(2) except that
       multiple buffers are written out.

       Buffers are processed in array order.  This means that readv()
       completely fills iov[0] before proceeding to iov[1], and so on.  (If
       there is insufficient data, then not all buffers pointed to by iov
       may be filled.)  Similarly, writev() writes out the entire contents
       of iov[0] before proceeding to iov[1], and so on.

       The data transfers performed by readv() and writev() are atomic: the
       data written by writev() is written as a single block that is not
       intermingled with output from writes in other processes (but see
       pipe(7) for an exception); analogously, readv() is guaranteed to read
       a contiguous block of data from the file, regardless of read
       operations performed in other threads or processes that have file
       descriptors referring to the same open file description (see
       open(2)).

   preadv() and pwritev()
       The preadv() system call combines the functionality of readv() and
       pread(2).  It performs the same task as readv(), but adds a fourth
       argument, offset, which specifies the file offset at which the input
       operation is to be performed.

       The pwritev() system call combines the functionality of writev() and
       pwrite(2).  It performs the same task as writev(), but adds a fourth
       argument, offset, which specifies the file offset at which the output
       operation is to be performed.

       The file offset is not changed by these system calls.  The file
       referred to by fd must be capable of seeking.

   preadv2() and pwritev2()
       These system calls are similar to preadv() and pwritev() calls, but
       add a fifth argument, flags, which modifies the behavior on a per-
       call basis.

       Unlike preadv() and pwritev(), if the offset argument is -1, then the
       current file offset is used and updated.

       The flags argument contains a bitwise OR of zero or more of the
       following flags:

       RWF_HIPRI (since Linux 4.6)
              High priority read/write.  Allows block-based filesystems to
              use polling of the device, which provides lower latency, but
              may use additional resources.  (Currently, this feature is
              usable only on a file descriptor opened using the O_DIRECT
              flag.)
http://man7.org/linux/man-pages/man2/pread.2.html
12
SYSTEM CALL:
pread(2) - Linux manual page
FUNCTIONALITY:

       pread,  pwrite  -  read from or write to a file descriptor at a given
       offset
SYNOPSIS:

       #include <unistd.h>

       ssize_t pread(int fd, void *buf, size_t count, off_t offset);

       ssize_t pwrite(int fd, const void *buf, size_t count, off_t offset);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       pread(), pwrite():
           _XOPEN_SOURCE >= 500
           || /* Since glibc 2.12: */ _POSIX_C_SOURCE >= 200809L
DESCRIPTION

       pread() reads up to count bytes from file descriptor fd at offset
       offset (from the start of the file) into the buffer starting at buf.
       The file offset is not changed.

       pwrite() writes up to count bytes from the buffer starting at buf to
       the file descriptor fd at offset offset.  The file offset is not
       changed.

       The file referenced by fd must be capable of seeking.
http://man7.org/linux/man-pages/man2/preadv.2.html
12
SYSTEM CALL:
readv(2) - Linux manual page
FUNCTIONALITY:

       readv,  writev,  preadv,  pwritev  - read or write data into multiple
       buffers
SYNOPSIS:

       #include <sys/uio.h>

       ssize_t readv(int fd, const struct iovec *iov, int iovcnt);

       ssize_t writev(int fd, const struct iovec *iov, int iovcnt);

       ssize_t preadv(int fd, const struct iovec *iov, int iovcnt,
                      off_t offset);

       ssize_t pwritev(int fd, const struct iovec *iov, int iovcnt,
                       off_t offset);

       ssize_t preadv2(int fd, const struct iovec *iov, int iovcnt,
                       off_t offset, int flags);

       ssize_t pwritev2(int fd, const struct iovec *iov, int iovcnt,
                        off_t offset, int flags);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       preadv(), pwritev():
           Since glibc 2.19:
               _DEFAULT_SOURCE
           Glibc 2.19 and earlier:
               _BSD_SOURCE
DESCRIPTION

       The readv() system call reads iovcnt buffers from the file associated
       with the file descriptor fd into the buffers described by iov
       ("scatter input").

       The writev() system call writes iovcnt buffers of data described by
       iov to the file associated with the file descriptor fd ("gather
       output").

       The pointer iov points to an array of iovec structures, defined in
       <sys/uio.h> as:

           struct iovec {
               void  *iov_base;    /* Starting address */
               size_t iov_len;     /* Number of bytes to transfer */
           };

       The readv() system call works just like read(2) except that multiple
       buffers are filled.

       The writev() system call works just like write(2) except that
       multiple buffers are written out.

       Buffers are processed in array order.  This means that readv()
       completely fills iov[0] before proceeding to iov[1], and so on.  (If
       there is insufficient data, then not all buffers pointed to by iov
       may be filled.)  Similarly, writev() writes out the entire contents
       of iov[0] before proceeding to iov[1], and so on.

       The data transfers performed by readv() and writev() are atomic: the
       data written by writev() is written as a single block that is not
       intermingled with output from writes in other processes (but see
       pipe(7) for an exception); analogously, readv() is guaranteed to read
       a contiguous block of data from the file, regardless of read
       operations performed in other threads or processes that have file
       descriptors referring to the same open file description (see
       open(2)).

   preadv() and pwritev()
       The preadv() system call combines the functionality of readv() and
       pread(2).  It performs the same task as readv(), but adds a fourth
       argument, offset, which specifies the file offset at which the input
       operation is to be performed.

       The pwritev() system call combines the functionality of writev() and
       pwrite(2).  It performs the same task as writev(), but adds a fourth
       argument, offset, which specifies the file offset at which the output
       operation is to be performed.

       The file offset is not changed by these system calls.  The file
       referred to by fd must be capable of seeking.

   preadv2() and pwritev2()
       These system calls are similar to preadv() and pwritev() calls, but
       add a fifth argument, flags, which modifies the behavior on a per-
       call basis.

       Unlike preadv() and pwritev(), if the offset argument is -1, then the
       current file offset is used and updated.

       The flags argument contains a bitwise OR of zero or more of the
       following flags:

       RWF_HIPRI (since Linux 4.6)
              High priority read/write.  Allows block-based filesystems to
              use polling of the device, which provides lower latency, but
              may use additional resources.  (Currently, this feature is
              usable only on a file descriptor opened using the O_DIRECT
              flag.)
http://man7.org/linux/man-pages/man2/write.2.html
11
SYSTEM CALL:
write(2) - Linux manual page
FUNCTIONALITY:

       write - write to a file descriptor
SYNOPSIS:

       #include <unistd.h>

       ssize_t write(int fd, const void *buf, size_t count);
DESCRIPTION

       write() writes up to count bytes from the buffer pointed buf to the
       file referred to by the file descriptor fd.

       The number of bytes written may be less than count if, for example,
       there is insufficient space on the underlying physical medium, or the
       RLIMIT_FSIZE resource limit is encountered (see setrlimit(2)), or the
       call was interrupted by a signal handler after having written less
       than count bytes.  (See also pipe(7).)

       For a seekable file (i.e., one to which lseek(2) may be applied, for
       example, a regular file) writing takes place at the file offset, and
       the file offset is incremented by the number of bytes actually
       written.  If the file was open(2)ed with O_APPEND, the file offset is
       first set to the end of the file before writing.  The adjustment of
       the file offset and the write operation are performed as an atomic
       step.

       POSIX requires that a read(2) which can be proved to occur after a
       write() has returned returns the new data.  Note that not all
       filesystems are POSIX conforming.
http://man7.org/linux/man-pages/man2/writev.2.html
12
SYSTEM CALL:
readv(2) - Linux manual page
FUNCTIONALITY:

       readv,  writev,  preadv,  pwritev  - read or write data into multiple
       buffers
SYNOPSIS:

       #include <sys/uio.h>

       ssize_t readv(int fd, const struct iovec *iov, int iovcnt);

       ssize_t writev(int fd, const struct iovec *iov, int iovcnt);

       ssize_t preadv(int fd, const struct iovec *iov, int iovcnt,
                      off_t offset);

       ssize_t pwritev(int fd, const struct iovec *iov, int iovcnt,
                       off_t offset);

       ssize_t preadv2(int fd, const struct iovec *iov, int iovcnt,
                       off_t offset, int flags);

       ssize_t pwritev2(int fd, const struct iovec *iov, int iovcnt,
                        off_t offset, int flags);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       preadv(), pwritev():
           Since glibc 2.19:
               _DEFAULT_SOURCE
           Glibc 2.19 and earlier:
               _BSD_SOURCE
DESCRIPTION

       The readv() system call reads iovcnt buffers from the file associated
       with the file descriptor fd into the buffers described by iov
       ("scatter input").

       The writev() system call writes iovcnt buffers of data described by
       iov to the file associated with the file descriptor fd ("gather
       output").

       The pointer iov points to an array of iovec structures, defined in
       <sys/uio.h> as:

           struct iovec {
               void  *iov_base;    /* Starting address */
               size_t iov_len;     /* Number of bytes to transfer */
           };

       The readv() system call works just like read(2) except that multiple
       buffers are filled.

       The writev() system call works just like write(2) except that
       multiple buffers are written out.

       Buffers are processed in array order.  This means that readv()
       completely fills iov[0] before proceeding to iov[1], and so on.  (If
       there is insufficient data, then not all buffers pointed to by iov
       may be filled.)  Similarly, writev() writes out the entire contents
       of iov[0] before proceeding to iov[1], and so on.

       The data transfers performed by readv() and writev() are atomic: the
       data written by writev() is written as a single block that is not
       intermingled with output from writes in other processes (but see
       pipe(7) for an exception); analogously, readv() is guaranteed to read
       a contiguous block of data from the file, regardless of read
       operations performed in other threads or processes that have file
       descriptors referring to the same open file description (see
       open(2)).

   preadv() and pwritev()
       The preadv() system call combines the functionality of readv() and
       pread(2).  It performs the same task as readv(), but adds a fourth
       argument, offset, which specifies the file offset at which the input
       operation is to be performed.

       The pwritev() system call combines the functionality of writev() and
       pwrite(2).  It performs the same task as writev(), but adds a fourth
       argument, offset, which specifies the file offset at which the output
       operation is to be performed.

       The file offset is not changed by these system calls.  The file
       referred to by fd must be capable of seeking.

   preadv2() and pwritev2()
       These system calls are similar to preadv() and pwritev() calls, but
       add a fifth argument, flags, which modifies the behavior on a per-
       call basis.

       Unlike preadv() and pwritev(), if the offset argument is -1, then the
       current file offset is used and updated.

       The flags argument contains a bitwise OR of zero or more of the
       following flags:

       RWF_HIPRI (since Linux 4.6)
              High priority read/write.  Allows block-based filesystems to
              use polling of the device, which provides lower latency, but
              may use additional resources.  (Currently, this feature is
              usable only on a file descriptor opened using the O_DIRECT
              flag.)
http://man7.org/linux/man-pages/man2/pwrite.2.html
12
SYSTEM CALL:
pread(2) - Linux manual page
FUNCTIONALITY:

       pread,  pwrite  -  read from or write to a file descriptor at a given
       offset
SYNOPSIS:

       #include <unistd.h>

       ssize_t pread(int fd, void *buf, size_t count, off_t offset);

       ssize_t pwrite(int fd, const void *buf, size_t count, off_t offset);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       pread(), pwrite():
           _XOPEN_SOURCE >= 500
           || /* Since glibc 2.12: */ _POSIX_C_SOURCE >= 200809L
DESCRIPTION

       pread() reads up to count bytes from file descriptor fd at offset
       offset (from the start of the file) into the buffer starting at buf.
       The file offset is not changed.

       pwrite() writes up to count bytes from the buffer starting at buf to
       the file descriptor fd at offset offset.  The file offset is not
       changed.

       The file referenced by fd must be capable of seeking.
http://man7.org/linux/man-pages/man2/pwritev.2.html
12
SYSTEM CALL:
readv(2) - Linux manual page
FUNCTIONALITY:

       readv,  writev,  preadv,  pwritev  - read or write data into multiple
       buffers
SYNOPSIS:

       #include <sys/uio.h>

       ssize_t readv(int fd, const struct iovec *iov, int iovcnt);

       ssize_t writev(int fd, const struct iovec *iov, int iovcnt);

       ssize_t preadv(int fd, const struct iovec *iov, int iovcnt,
                      off_t offset);

       ssize_t pwritev(int fd, const struct iovec *iov, int iovcnt,
                       off_t offset);

       ssize_t preadv2(int fd, const struct iovec *iov, int iovcnt,
                       off_t offset, int flags);

       ssize_t pwritev2(int fd, const struct iovec *iov, int iovcnt,
                        off_t offset, int flags);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       preadv(), pwritev():
           Since glibc 2.19:
               _DEFAULT_SOURCE
           Glibc 2.19 and earlier:
               _BSD_SOURCE
DESCRIPTION

       The readv() system call reads iovcnt buffers from the file associated
       with the file descriptor fd into the buffers described by iov
       ("scatter input").

       The writev() system call writes iovcnt buffers of data described by
       iov to the file associated with the file descriptor fd ("gather
       output").

       The pointer iov points to an array of iovec structures, defined in
       <sys/uio.h> as:

           struct iovec {
               void  *iov_base;    /* Starting address */
               size_t iov_len;     /* Number of bytes to transfer */
           };

       The readv() system call works just like read(2) except that multiple
       buffers are filled.

       The writev() system call works just like write(2) except that
       multiple buffers are written out.

       Buffers are processed in array order.  This means that readv()
       completely fills iov[0] before proceeding to iov[1], and so on.  (If
       there is insufficient data, then not all buffers pointed to by iov
       may be filled.)  Similarly, writev() writes out the entire contents
       of iov[0] before proceeding to iov[1], and so on.

       The data transfers performed by readv() and writev() are atomic: the
       data written by writev() is written as a single block that is not
       intermingled with output from writes in other processes (but see
       pipe(7) for an exception); analogously, readv() is guaranteed to read
       a contiguous block of data from the file, regardless of read
       operations performed in other threads or processes that have file
       descriptors referring to the same open file description (see
       open(2)).

   preadv() and pwritev()
       The preadv() system call combines the functionality of readv() and
       pread(2).  It performs the same task as readv(), but adds a fourth
       argument, offset, which specifies the file offset at which the input
       operation is to be performed.

       The pwritev() system call combines the functionality of writev() and
       pwrite(2).  It performs the same task as writev(), but adds a fourth
       argument, offset, which specifies the file offset at which the output
       operation is to be performed.

       The file offset is not changed by these system calls.  The file
       referred to by fd must be capable of seeking.

   preadv2() and pwritev2()
       These system calls are similar to preadv() and pwritev() calls, but
       add a fifth argument, flags, which modifies the behavior on a per-
       call basis.

       Unlike preadv() and pwritev(), if the offset argument is -1, then the
       current file offset is used and updated.

       The flags argument contains a bitwise OR of zero or more of the
       following flags:

       RWF_HIPRI (since Linux 4.6)
              High priority read/write.  Allows block-based filesystems to
              use polling of the device, which provides lower latency, but
              may use additional resources.  (Currently, this feature is
              usable only on a file descriptor opened using the O_DIRECT
              flag.)
http://man7.org/linux/man-pages/man2/lseek.2.html
10
SYSTEM CALL:
lseek(2) - Linux manual page
FUNCTIONALITY:

       lseek - reposition read/write file offset
SYNOPSIS:

       #include <sys/types.h>
       #include <unistd.h>

       off_t lseek(int fd, off_t offset, int whence);
DESCRIPTION

       The lseek() function repositions the file offset of the open file
       description associated with the file descriptor fd to the argument
       offset according to the directive whence as follows:

       SEEK_SET
              The file offset is set to offset bytes.

       SEEK_CUR
              The file offset is set to its current location plus offset
              bytes.

       SEEK_END
              The file offset is set to the size of the file plus offset
              bytes.

       The lseek() function allows the file offset to be set beyond the end
       of the file (but this does not change the size of the file).  If data
       is later written at this point, subsequent reads of the data in the
       gap (a "hole") return null bytes ('\0') until data is actually
       written into the gap.

   Seeking file data and holes
       Since version 3.1, Linux supports the following additional values for
       whence:

       SEEK_DATA
              Adjust the file offset to the next location in the file
              greater than or equal to offset containing data.  If offset
              points to data, then the file offset is set to offset.

       SEEK_HOLE
              Adjust the file offset to the next hole in the file greater
              than or equal to offset.  If offset points into the middle of
              a hole, then the file offset is set to offset.  If there is no
              hole past offset, then the file offset is adjusted to the end
              of the file (i.e., there is an implicit hole at the end of any
              file).

       In both of the above cases, lseek() fails if offset points past the
       end of the file.

       These operations allow applications to map holes in a sparsely
       allocated file.  This can be useful for applications such as file
       backup tools, which can save space when creating backups and preserve
       holes, if they have a mechanism for discovering holes.

       For the purposes of these operations, a hole is a sequence of zeros
       that (normally) has not been allocated in the underlying file
       storage.  However, a filesystem is not obliged to report holes, so
       these operations are not a guaranteed mechanism for mapping the
       storage space actually allocated to a file.  (Furthermore, a sequence
       of zeros that actually has been written to the underlying storage may
       not be reported as a hole.)  In the simplest implementation, a
       filesystem can support the operations by making SEEK_HOLE always
       return the offset of the end of the file, and making SEEK_DATA always
       return offset (i.e., even if the location referred to by offset is a
       hole, it can be considered to consist of data that is a sequence of
       zeros).

       The _GNU_SOURCE feature test macro must be defined in order to obtain
       the definitions of SEEK_DATA and SEEK_HOLE from <unistd.h>.

       The SEEK_HOLE and SEEK_DATA operations are supported for the
       following filesystems:

       *  Btrfs (since Linux 3.1)

       *  OCFS (since Linux 3.2)

       *  XFS (since Linux 3.5)

       *  ext4 (since Linux 3.8)

       *  tmpfs (since Linux 3.8)

       *  NFS (since Linux 3.18)

       *  FUSE (since Linux 4.5)
http://man7.org/linux/man-pages/man2/sendfile.2.html
11
SYSTEM CALL:
sendfile(2) - Linux manual page
FUNCTIONALITY:

       sendfile - transfer data between file descriptors
SYNOPSIS:

       #include <sys/sendfile.h>

       ssize_t sendfile(int out_fd, int in_fd, off_t *offset, size_t count);
DESCRIPTION

       sendfile() copies data between one file descriptor and another.
       Because this copying is done within the kernel, sendfile() is more
       efficient than the combination of read(2) and write(2), which would
       require transferring data to and from user space.

       in_fd should be a file descriptor opened for reading and out_fd
       should be a descriptor opened for writing.

       If offset is not NULL, then it points to a variable holding the file
       offset from which sendfile() will start reading data from in_fd.
       When sendfile() returns, this variable will be set to the offset of
       the byte following the last byte that was read.  If offset is not
       NULL, then sendfile() does not modify the file offset of in_fd;
       otherwise the file offset is adjusted to reflect the number of bytes
       read from in_fd.

       If offset is NULL, then data will be read from in_fd starting at the
       file offset, and the file offset will be updated by the call.

       count is the number of bytes to copy between the file descriptors.

       The in_fd argument must correspond to a file which supports
       mmap(2)-like operations (i.e., it cannot be a socket).

       In Linux kernels before 2.6.33, out_fd must refer to a socket.  Since
       Linux 2.6.33 it can be any file.  If it is a regular file, then
       sendfile() changes the file offset appropriately.
http://man7.org/linux/man-pages/man2/fdatasync.2.html
11
SYSTEM CALL:
fsync(2) - Linux manual page
FUNCTIONALITY:

       fsync,  fdatasync  -  synchronize a file's in-core state with storage
       device
SYNOPSIS:

       #include <unistd.h>

       int fsync(int fd);

       int fdatasync(int fd);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       fsync():
           Glibc 2.16 and later:
               No feature test macros need be defined
           Glibc up to and including 2.15:
               _BSD_SOURCE || _XOPEN_SOURCE
                   || /* since glibc 2.8: */ _POSIX_C_SOURCE >= 200112L
       fdatasync():
           _POSIX_C_SOURCE >= 199309L || _XOPEN_SOURCE >= 500
DESCRIPTION

       fsync() transfers ("flushes") all modified in-core data of (i.e.,
       modified buffer cache pages for) the file referred to by the file
       descriptor fd to the disk device (or other permanent storage device)
       so that all changed information can be retrieved even after the
       system crashed or was rebooted.  This includes writing through or
       flushing a disk cache if present.  The call blocks until the device
       reports that the transfer has completed.  It also flushes metadata
       information associated with the file (see stat(2)).

       Calling fsync() does not necessarily ensure that the entry in the
       directory containing the file has also reached disk.  For that an
       explicit fsync() on a file descriptor for the directory is also
       needed.

       fdatasync() is similar to fsync(), but does not flush modified
       metadata unless that metadata is needed in order to allow a
       subsequent data retrieval to be correctly handled.  For example,
       changes to st_atime or st_mtime (respectively, time of last access
       and time of last modification; see stat(2)) do not require flushing
       because they are not necessary for a subsequent data read to be
       handled correctly.  On the other hand, a change to the file size
       (st_size, as made by say ftruncate(2)), would require a metadata
       flush.

       The aim of fdatasync() is to reduce disk activity for applications
       that do not require all metadata to be synchronized with the disk.
http://man7.org/linux/man-pages/man2/fsync.2.html
11
SYSTEM CALL:
fsync(2) - Linux manual page
FUNCTIONALITY:

       fsync,  fdatasync  -  synchronize a file's in-core state with storage
       device
SYNOPSIS:

       #include <unistd.h>

       int fsync(int fd);

       int fdatasync(int fd);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       fsync():
           Glibc 2.16 and later:
               No feature test macros need be defined
           Glibc up to and including 2.15:
               _BSD_SOURCE || _XOPEN_SOURCE
                   || /* since glibc 2.8: */ _POSIX_C_SOURCE >= 200112L
       fdatasync():
           _POSIX_C_SOURCE >= 199309L || _XOPEN_SOURCE >= 500
DESCRIPTION

       fsync() transfers ("flushes") all modified in-core data of (i.e.,
       modified buffer cache pages for) the file referred to by the file
       descriptor fd to the disk device (or other permanent storage device)
       so that all changed information can be retrieved even after the
       system crashed or was rebooted.  This includes writing through or
       flushing a disk cache if present.  The call blocks until the device
       reports that the transfer has completed.  It also flushes metadata
       information associated with the file (see stat(2)).

       Calling fsync() does not necessarily ensure that the entry in the
       directory containing the file has also reached disk.  For that an
       explicit fsync() on a file descriptor for the directory is also
       needed.

       fdatasync() is similar to fsync(), but does not flush modified
       metadata unless that metadata is needed in order to allow a
       subsequent data retrieval to be correctly handled.  For example,
       changes to st_atime or st_mtime (respectively, time of last access
       and time of last modification; see stat(2)) do not require flushing
       because they are not necessary for a subsequent data read to be
       handled correctly.  On the other hand, a change to the file size
       (st_size, as made by say ftruncate(2)), would require a metadata
       flush.

       The aim of fdatasync() is to reduce disk activity for applications
       that do not require all metadata to be synchronized with the disk.
http://man7.org/linux/man-pages/man2/msync.2.html
11
SYSTEM CALL:
msync(2) - Linux manual page
FUNCTIONALITY:

       msync - synchronize a file with a memory map
SYNOPSIS:

       #include <sys/mman.h>

       int msync(void *addr, size_t length, int flags);
DESCRIPTION

       msync() flushes changes made to the in-core copy of a file that was
       mapped into memory using mmap(2) back to the filesystem.  Without use
       of this call, there is no guarantee that changes are written back
       before munmap(2) is called.  To be more precise, the part of the file
       that corresponds to the memory area starting at addr and having
       length length is updated.

       The flags argument should specify exactly one of MS_ASYNC and
       MS_SYNC, and may additionally include the MS_INVALIDATE bit.  These
       bits have the following meanings:

       MS_ASYNC
              Specifies that an update be scheduled, but the call returns
              immediately.

       MS_SYNC
              Requests an update and waits for it to complete.

       MS_INVALIDATE
              Asks to invalidate other mappings of the same file (so that
              they can be updated with the fresh values just written).
http://man7.org/linux/man-pages/man2/sync_file_range.2.html
11
SYSTEM CALL:
sync_file_range(2) - Linux manual page
FUNCTIONALITY:

       sync_file_range - sync a file segment with disk
SYNOPSIS:

       #define _GNU_SOURCE         /* See feature_test_macros(7) */
       #include <fcntl.h>

       int sync_file_range(int fd, off64_t offset, off64_t nbytes,
                           unsigned int flags);
DESCRIPTION

       sync_file_range() permits fine control when synchronizing the open
       file referred to by the file descriptor fd with disk.

       offset is the starting byte of the file range to be synchronized.
       nbytes specifies the length of the range to be synchronized, in
       bytes; if nbytes is zero, then all bytes from offset through to the
       end of file are synchronized.  Synchronization is in units of the
       system page size: offset is rounded down to a page boundary;
       (offset+nbytes-1) is rounded up to a page boundary.

       The flags bit-mask argument can include any of the following values:

       SYNC_FILE_RANGE_WAIT_BEFORE
              Wait upon write-out of all pages in the specified range that
              have already been submitted to the device driver for write-out
              before performing any write.

       SYNC_FILE_RANGE_WRITE
              Initiate write-out of all dirty pages in the specified range
              which are not presently submitted write-out.  Note that even
              this may block if you attempt to write more than request queue
              size.

       SYNC_FILE_RANGE_WAIT_AFTER
              Wait upon write-out of all pages in the range after performing
              any write.

       Specifying flags as 0 is permitted, as a no-op.

   Warning
       This system call is extremely dangerous and should not be used in
       portable programs.  None of these operations writes out the file's
       metadata.  Therefore, unless the application is strictly performing
       overwrites of already-instantiated disk blocks, there are no
       guarantees that the data will be available after a crash.  There is
       no user interface to know if a write is purely an overwrite.  On
       filesystems using copy-on-write semantics (e.g., btrfs) an overwrite
       of existing allocated blocks is impossible.  When writing into
       preallocated space, many filesystems also require calls into the
       block allocator, which this system call does not sync out to disk.
       This system call does not flush disk write caches and thus does not
       provide any data integrity on systems with volatile disk write
       caches.

   Some details
       SYNC_FILE_RANGE_WAIT_BEFORE and SYNC_FILE_RANGE_WAIT_AFTER will
       detect any I/O errors or ENOSPC conditions and will return these to
       the caller.

       Useful combinations of the flags bits are:

       SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE
              Ensures that all pages in the specified range which were dirty
              when sync_file_range() was called are placed under write-out.
              This is a start-write-for-data-integrity operation.

       SYNC_FILE_RANGE_WRITE
              Start write-out of all dirty pages in the specified range
              which are not presently under write-out.  This is an
              asynchronous flush-to-disk operation.  This is not suitable
              for data integrity operations.

       SYNC_FILE_RANGE_WAIT_BEFORE (or SYNC_FILE_RANGE_WAIT_AFTER)
              Wait for completion of write-out of all pages in the specified
              range.  This can be used after an earlier
              SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE operation
              to wait for completion of that operation, and obtain its
              result.

       SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE |
       SYNC_FILE_RANGE_WAIT_AFTER
              This is a write-for-data-integrity operation that will ensure
              that all pages in the specified range which were dirty when
              sync_file_range() was called are committed to disk.
http://man7.org/linux/man-pages/man2/sync.2.html
12
SYSTEM CALL:
sync(2) - Linux manual page
FUNCTIONALITY:

       sync, syncfs - commit filesystem caches to disk
SYNOPSIS:

       #include <unistd.h>

       void sync(void);

       int syncfs(int fd);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       sync():
           _XOPEN_SOURCE >= 500
               || /* Since glibc 2.19: */ _DEFAULT_SOURCE
               || /* Glibc versions <= 2.19: */ _BSD_SOURCE

       syncfs():
           _GNU_SOURCE
DESCRIPTION

       sync() causes all pending modifications to file system metadata and
       cached file data to be written to the underlying filesystems.

       syncfs() is like sync(), but synchronizes just the filesystem
       containing file referred to by the open file descriptor fd.
http://man7.org/linux/man-pages/man2/syncfs.2.html
12
SYSTEM CALL:
sync(2) - Linux manual page
FUNCTIONALITY:

       sync, syncfs - commit filesystem caches to disk
SYNOPSIS:

       #include <unistd.h>

       void sync(void);

       int syncfs(int fd);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       sync():
           _XOPEN_SOURCE >= 500
               || /* Since glibc 2.19: */ _DEFAULT_SOURCE
               || /* Glibc versions <= 2.19: */ _BSD_SOURCE

       syncfs():
           _GNU_SOURCE
DESCRIPTION

       sync() causes all pending modifications to file system metadata and
       cached file data to be written to the underlying filesystems.

       syncfs() is like sync(), but synchronizes just the filesystem
       containing file referred to by the open file descriptor fd.
http://man7.org/linux/man-pages/man2/io_setup.2.html
11
SYSTEM CALL:
io_setup(2) - Linux manual page
FUNCTIONALITY:

       io_setup - create an asynchronous I/O context
SYNOPSIS:

       #include <linux/aio_abi.h>          /* Defines needed types */

       int io_setup(unsigned nr_events, aio_context_t *ctx_idp);

       Note: There is no glibc wrapper for this system call; see NOTES.
DESCRIPTION

       The io_setup() system call creates an asynchronous I/O context
       suitable for concurrently processing nr_events operations.  The
       ctx_idp argument must not point to an AIO context that already
       exists, and must be initialized to 0 prior to the call.  On
       successful creation of the AIO context, *ctx_idp is filled in with
       the resulting handle.
http://man7.org/linux/man-pages/man2/io_destroy.2.html
11
SYSTEM CALL:
io_destroy(2) - Linux manual page
FUNCTIONALITY:

       io_destroy - destroy an asynchronous I/O context
SYNOPSIS:

       #include <linux/aio_abi.h>          /* Defines needed types */

       int io_destroy(aio_context_t ctx_id);

       Note: There is no glibc wrapper for this system call; see NOTES.
DESCRIPTION

       The io_destroy() system call will attempt to cancel all outstanding
       asynchronous I/O operations against ctx_id, will block on the
       completion of all operations that could not be canceled, and will
       destroy the ctx_id.
http://man7.org/linux/man-pages/man2/io_submit.2.html
11
SYSTEM CALL:
io_submit(2) - Linux manual page
FUNCTIONALITY:

       io_submit - submit asynchronous I/O blocks for processing
SYNOPSIS:

       #include <linux/aio_abi.h>          /* Defines needed types */

       int io_submit(aio_context_t ctx_id, long nr, struct iocb **iocbpp);

       Note: There is no glibc wrapper for this system call; see NOTES.
DESCRIPTION

       The io_submit() system call queues nr I/O request blocks for
       processing in the AIO context ctx_id.  The iocbpp argument should be
       an array of nr AIO control blocks, which will be submitted to context
       ctx_id.
http://man7.org/linux/man-pages/man2/io_cancel.2.html
11
SYSTEM CALL:
io_cancel(2) - Linux manual page
FUNCTIONALITY:

       io_cancel - cancel an outstanding asynchronous I/O operation
SYNOPSIS:

       #include <linux/aio_abi.h>          /* Defines needed types */

       int io_cancel(aio_context_t ctx_id, struct iocb *iocb,
                     struct io_event *result);

       Note: There is no glibc wrapper for this system call; see NOTES.
DESCRIPTION

       The io_cancel() system call attempts to cancel an asynchronous I/O
       operation previously submitted with io_submit(2).  The iocb argument
       describes the operation to be canceled and the ctx_id argument is the
       AIO context to which the operation was submitted.  If the operation
       is successfully canceled, the event will be copied into the memory
       pointed to by result without being placed into the completion queue.
http://man7.org/linux/man-pages/man2/io_getevents.2.html
12
SYSTEM CALL:
io_getevents(2) - Linux manual page
FUNCTIONALITY:

       io_getevents - read asynchronous I/O events from the completion queue
SYNOPSIS:

       #include <linux/aio_abi.h>         /* Defines needed types */
       #include <linux/time.h>            /* Defines 'struct timespec' */

       int io_getevents(aio_context_t ctx_id, long min_nr, long nr,
                        struct io_event *events, struct timespec *timeout);

       Note: There is no glibc wrapper for this system call; see NOTES.
DESCRIPTION

       The io_getevents() system call attempts to read at least min_nr
       events and up to nr events from the completion queue of the AIO
       context specified by ctx_id.

       The timeout argument specifies the amount of time to wait for events,
       and is specified as a relative timeout in a structure of the
       following form:

           struct timespec {
               time_t tv_sec;      /* seconds */
               long   tv_nsec;     /* nanoseconds [0 .. 999999999] */
           };

       The specified time will be rounded up to the system clock granularity
       and is guaranteed not to expire     early.

       Specifying timeout as NULL means block indefinitely until at least
       min_nr events have been obtained.
http://man7.org/linux/man-pages/man2/select.2.html
13
SYSTEM CALL:
select(2) - Linux manual page
FUNCTIONALITY:

       select,  pselect, FD_CLR, FD_ISSET, FD_SET, FD_ZERO - synchronous I/O
       multiplexing
SYNOPSIS:

       /* According to POSIX.1-2001, POSIX.1-2008 */
       #include <sys/select.h>

       /* According to earlier standards */
       #include <sys/time.h>
       #include <sys/types.h>
       #include <unistd.h>

       int select(int nfds, fd_set *readfds, fd_set *writefds,
                  fd_set *exceptfds, struct timeval *timeout);

       void FD_CLR(int fd, fd_set *set);
       int  FD_ISSET(int fd, fd_set *set);
       void FD_SET(int fd, fd_set *set);
       void FD_ZERO(fd_set *set);

       #include <sys/select.h>

       int pselect(int nfds, fd_set *readfds, fd_set *writefds,
                   fd_set *exceptfds, const struct timespec *timeout,
                   const sigset_t *sigmask);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       pselect(): _POSIX_C_SOURCE >= 200112L
DESCRIPTION

       select() and pselect() allow a program to monitor multiple file
       descriptors, waiting until one or more of the file descriptors become
       "ready" for some class of I/O operation (e.g., input possible).  A
       file descriptor is considered ready if it is possible to perform a
       corresponding I/O operation (e.g., read(2) without blocking, or a
       sufficiently small write(2)).

       select() can monitor only file descriptors numbers that are less than
       FD_SETSIZE; poll(2) does not have this limitation.  See BUGS.

       The operation of select() and pselect() is identical, other than
       these three differences:

       (i)    select() uses a timeout that is a struct timeval (with seconds
              and microseconds), while pselect() uses a struct timespec
              (with seconds and nanoseconds).

       (ii)   select() may update the timeout argument to indicate how much
              time was left.  pselect() does not change this argument.

       (iii)  select() has no sigmask argument, and behaves as pselect()
              called with NULL sigmask.

       Three independent sets of file descriptors are watched.  Those listed
       in readfds will be watched to see if characters become available for
       reading (more precisely, to see if a read will not block; in
       particular, a file descriptor is also ready on end-of-file), those in
       writefds will be watched to see if space is available for write
       (though a large write may still block), and those in exceptfds will
       be watched for exceptions.  On exit, the sets are modified in place
       to indicate which file descriptors actually changed status.  Each of
       the three file descriptor sets may be specified as NULL if no file
       descriptors are to be watched for the corresponding class of events.

       Four macros are provided to manipulate the sets.  FD_ZERO() clears a
       set.  FD_SET() and FD_CLR() respectively add and remove a given file
       descriptor from a set.  FD_ISSET() tests to see if a file descriptor
       is part of the set; this is useful after select() returns.

       nfds is the highest-numbered file descriptor in any of the three
       sets, plus 1.

       The timeout argument specifies the interval that select() should
       block waiting for a file descriptor to become ready.  The call will
       block until either:

       *  a file descriptor becomes ready;

       *  the call is interrupted by a signal handler; or

       *  the timeout expires.

       Note that the timeout interval will be rounded up to the system clock
       granularity, and kernel scheduling delays mean that the blocking
       interval may overrun by a small amount.  If both fields of the
       timeval structure are zero, then select() returns immediately.  (This
       is useful for polling.)  If timeout is NULL (no timeout), select()
       can block indefinitely.

       sigmask is a pointer to a signal mask (see sigprocmask(2)); if it is
       not NULL, then pselect() first replaces the current signal mask by
       the one pointed to by sigmask, then does the "select" function, and
       then restores the original signal mask.

       Other than the difference in the precision of the timeout argument,
       the following pselect() call:

           ready = pselect(nfds, &readfds, &writefds, &exceptfds,
                           timeout, &sigmask);

       is equivalent to atomically executing the following calls:

           sigset_t origmask;

           pthread_sigmask(SIG_SETMASK, &sigmask, &origmask);
           ready = select(nfds, &readfds, &writefds, &exceptfds, timeout);
           pthread_sigmask(SIG_SETMASK, &origmask, NULL);

       The reason that pselect() is needed is that if one wants to wait for
       either a signal or for a file descriptor to become ready, then an
       atomic test is needed to prevent race conditions.  (Suppose the
       signal handler sets a global flag and returns.  Then a test of this
       global flag followed by a call of select() could hang indefinitely if
       the signal arrived just after the test but just before the call.  By
       contrast, pselect() allows one to first block signals, handle the
       signals that have come in, then call pselect() with the desired
       sigmask, avoiding the race.)

   The timeout
       The time structures involved are defined in <sys/time.h> and look
       like

           struct timeval {
               long    tv_sec;         /* seconds */
               long    tv_usec;        /* microseconds */
           };

       and

           struct timespec {
               long    tv_sec;         /* seconds */
               long    tv_nsec;        /* nanoseconds */
           };

       (However, see below on the POSIX.1 versions.)

       Some code calls select() with all three sets empty, nfds zero, and a
       non-NULL timeout as a fairly portable way to sleep with subsecond
       precision.

       On Linux, select() modifies timeout to reflect the amount of time not
       slept; most other implementations do not do this.  (POSIX.1 permits
       either behavior.)  This causes problems both when Linux code which
       reads timeout is ported to other operating systems, and when code is
       ported to Linux that reuses a struct timeval for multiple select()s
       in a loop without reinitializing it.  Consider timeout to be
       undefined after select() returns.
http://man7.org/linux/man-pages/man2/pselect6.2.html
13
SYSTEM CALL:
select(2) - Linux manual page
FUNCTIONALITY:

       select,  pselect, FD_CLR, FD_ISSET, FD_SET, FD_ZERO - synchronous I/O
       multiplexing
SYNOPSIS:

       /* According to POSIX.1-2001, POSIX.1-2008 */
       #include <sys/select.h>

       /* According to earlier standards */
       #include <sys/time.h>
       #include <sys/types.h>
       #include <unistd.h>

       int select(int nfds, fd_set *readfds, fd_set *writefds,
                  fd_set *exceptfds, struct timeval *timeout);

       void FD_CLR(int fd, fd_set *set);
       int  FD_ISSET(int fd, fd_set *set);
       void FD_SET(int fd, fd_set *set);
       void FD_ZERO(fd_set *set);

       #include <sys/select.h>

       int pselect(int nfds, fd_set *readfds, fd_set *writefds,
                   fd_set *exceptfds, const struct timespec *timeout,
                   const sigset_t *sigmask);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       pselect(): _POSIX_C_SOURCE >= 200112L
DESCRIPTION

       select() and pselect() allow a program to monitor multiple file
       descriptors, waiting until one or more of the file descriptors become
       "ready" for some class of I/O operation (e.g., input possible).  A
       file descriptor is considered ready if it is possible to perform a
       corresponding I/O operation (e.g., read(2) without blocking, or a
       sufficiently small write(2)).

       select() can monitor only file descriptors numbers that are less than
       FD_SETSIZE; poll(2) does not have this limitation.  See BUGS.

       The operation of select() and pselect() is identical, other than
       these three differences:

       (i)    select() uses a timeout that is a struct timeval (with seconds
              and microseconds), while pselect() uses a struct timespec
              (with seconds and nanoseconds).

       (ii)   select() may update the timeout argument to indicate how much
              time was left.  pselect() does not change this argument.

       (iii)  select() has no sigmask argument, and behaves as pselect()
              called with NULL sigmask.

       Three independent sets of file descriptors are watched.  Those listed
       in readfds will be watched to see if characters become available for
       reading (more precisely, to see if a read will not block; in
       particular, a file descriptor is also ready on end-of-file), those in
       writefds will be watched to see if space is available for write
       (though a large write may still block), and those in exceptfds will
       be watched for exceptions.  On exit, the sets are modified in place
       to indicate which file descriptors actually changed status.  Each of
       the three file descriptor sets may be specified as NULL if no file
       descriptors are to be watched for the corresponding class of events.

       Four macros are provided to manipulate the sets.  FD_ZERO() clears a
       set.  FD_SET() and FD_CLR() respectively add and remove a given file
       descriptor from a set.  FD_ISSET() tests to see if a file descriptor
       is part of the set; this is useful after select() returns.

       nfds is the highest-numbered file descriptor in any of the three
       sets, plus 1.

       The timeout argument specifies the interval that select() should
       block waiting for a file descriptor to become ready.  The call will
       block until either:

       *  a file descriptor becomes ready;

       *  the call is interrupted by a signal handler; or

       *  the timeout expires.

       Note that the timeout interval will be rounded up to the system clock
       granularity, and kernel scheduling delays mean that the blocking
       interval may overrun by a small amount.  If both fields of the
       timeval structure are zero, then select() returns immediately.  (This
       is useful for polling.)  If timeout is NULL (no timeout), select()
       can block indefinitely.

       sigmask is a pointer to a signal mask (see sigprocmask(2)); if it is
       not NULL, then pselect() first replaces the current signal mask by
       the one pointed to by sigmask, then does the "select" function, and
       then restores the original signal mask.

       Other than the difference in the precision of the timeout argument,
       the following pselect() call:

           ready = pselect(nfds, &readfds, &writefds, &exceptfds,
                           timeout, &sigmask);

       is equivalent to atomically executing the following calls:

           sigset_t origmask;

           pthread_sigmask(SIG_SETMASK, &sigmask, &origmask);
           ready = select(nfds, &readfds, &writefds, &exceptfds, timeout);
           pthread_sigmask(SIG_SETMASK, &origmask, NULL);

       The reason that pselect() is needed is that if one wants to wait for
       either a signal or for a file descriptor to become ready, then an
       atomic test is needed to prevent race conditions.  (Suppose the
       signal handler sets a global flag and returns.  Then a test of this
       global flag followed by a call of select() could hang indefinitely if
       the signal arrived just after the test but just before the call.  By
       contrast, pselect() allows one to first block signals, handle the
       signals that have come in, then call pselect() with the desired
       sigmask, avoiding the race.)

   The timeout
       The time structures involved are defined in <sys/time.h> and look
       like

           struct timeval {
               long    tv_sec;         /* seconds */
               long    tv_usec;        /* microseconds */
           };

       and

           struct timespec {
               long    tv_sec;         /* seconds */
               long    tv_nsec;        /* nanoseconds */
           };

       (However, see below on the POSIX.1 versions.)

       Some code calls select() with all three sets empty, nfds zero, and a
       non-NULL timeout as a fairly portable way to sleep with subsecond
       precision.

       On Linux, select() modifies timeout to reflect the amount of time not
       slept; most other implementations do not do this.  (POSIX.1 permits
       either behavior.)  This causes problems both when Linux code which
       reads timeout is ported to other operating systems, and when code is
       ported to Linux that reuses a struct timeval for multiple select()s
       in a loop without reinitializing it.  Consider timeout to be
       undefined after select() returns.
http://man7.org/linux/man-pages/man2/poll.2.html
12
SYSTEM CALL:
poll(2) - Linux manual page
FUNCTIONALITY:

       poll, ppoll - wait for some event on a file descriptor
SYNOPSIS:

       #include <poll.h>

       int poll(struct pollfd *fds, nfds_t nfds, int timeout);

       #define _GNU_SOURCE         /* See feature_test_macros(7) */
       #include <signal.h>
       #include <poll.h>

       int ppoll(struct pollfd *fds, nfds_t nfds,
               const struct timespec *tmo_p, const sigset_t *sigmask);
DESCRIPTION

       poll() performs a similar task to select(2): it waits for one of a
       set of file descriptors to become ready to perform I/O.

       The set of file descriptors to be monitored is specified in the fds
       argument, which is an array of structures of the following form:

           struct pollfd {
               int   fd;         /* file descriptor */
               short events;     /* requested events */
               short revents;    /* returned events */
           };

       The caller should specify the number of items in the fds array in
       nfds.

       The field fd contains a file descriptor for an open file.  If this
       field is negative, then the corresponding events field is ignored and
       the revents field returns zero.  (This provides an easy way of
       ignoring a file descriptor for a single poll() call: simply negate
       the fd field.  Note, however, that this technique can't be used to
       ignore file descriptor 0.)

       The field events is an input parameter, a bit mask specifying the
       events the application is interested in for the file descriptor fd.
       This field may be specified as zero, in which case the only events
       that can be returned in revents are POLLHUP, POLLERR, and POLLNVAL
       (see below).

       The field revents is an output parameter, filled by the kernel with
       the events that actually occurred.  The bits returned in revents can
       include any of those specified in events, or one of the values
       POLLERR, POLLHUP, or POLLNVAL.  (These three bits are meaningless in
       the events field, and will be set in the revents field whenever the
       corresponding condition is true.)

       If none of the events requested (and no error) has occurred for any
       of the file descriptors, then poll() blocks until one of the events
       occurs.

       The timeout argument specifies the number of milliseconds that poll()
       should block waiting for a file descriptor to become ready.  The call
       will block until either:

       *  a file descriptor becomes ready;

       *  the call is interrupted by a signal handler; or

       *  the timeout expires.

       Note that the timeout interval will be rounded up to the system clock
       granularity, and kernel scheduling delays mean that the blocking
       interval may overrun by a small amount.  Specifying a negative value
       in timeout means an infinite timeout.  Specifying a timeout of zero
       causes poll() to return immediately, even if no file descriptors are
       ready.

       The bits that may be set/returned in events and revents are defined
       in <poll.h>:

              POLLIN There is data to read.

              POLLPRI
                     There is urgent data to read (e.g., out-of-band data on
                     TCP socket; pseudoterminal master in packet mode has
                     seen state change in slave).

              POLLOUT
                     Writing is now possible, though a write larger that the
                     available space in a socket or pipe will still block
                     (unless O_NONBLOCK is set).

              POLLRDHUP (since Linux 2.6.17)
                     Stream socket peer closed connection, or shut down
                     writing half of connection.  The _GNU_SOURCE feature
                     test macro must be defined (before including any header
                     files) in order to obtain this definition.

              POLLERR
                     Error condition (only returned in revents; ignored in
                     events).

              POLLHUP
                     Hang up (only returned in revents; ignored in events).
                     Note that when reading from a channel such as a pipe or
                     a stream socket, this event merely indicates that the
                     peer closed its end of the channel.  Subsequent reads
                     from the channel will return 0 (end of file) only after
                     all outstanding data in the channel has been consumed.

              POLLNVAL
                     Invalid request: fd not open (only returned in revents;
                     ignored in events).

       When compiling with _XOPEN_SOURCE defined, one also has the
       following, which convey no further information beyond the bits listed
       above:

              POLLRDNORM
                     Equivalent to POLLIN.

              POLLRDBAND
                     Priority band data can be read (generally unused on
                     Linux).

              POLLWRNORM
                     Equivalent to POLLOUT.

              POLLWRBAND
                     Priority data may be written.

       Linux also knows about, but does not use POLLMSG.

   ppoll()
       The relationship between poll() and ppoll() is analogous to the
       relationship between select(2) and pselect(2): like pselect(2),
       ppoll() allows an application to safely wait until either a file
       descriptor becomes ready or until a signal is caught.

       Other than the difference in the precision of the timeout argument,
       the following ppoll() call:

           ready = ppoll(&fds, nfds, tmo_p, &sigmask);

       is equivalent to atomically executing the following calls:

           sigset_t origmask;
           int timeout;

           timeout = (tmo_p == NULL) ? -1 :
                     (tmo_p->tv_sec * 1000 + tmo_p->tv_nsec / 1000000);
           pthread_sigmask(SIG_SETMASK, &sigmask, &origmask);
           ready = poll(&fds, nfds, timeout);
           pthread_sigmask(SIG_SETMASK, &origmask, NULL);

       See the description of pselect(2) for an explanation of why ppoll()
       is necessary.

       If the sigmask argument is specified as NULL, then no signal mask
       manipulation is performed (and thus ppoll() differs from poll() only
       in the precision of the timeout argument).

       The tmo_p argument specifies an upper limit on the amount of time
       that ppoll() will block.  This argument is a pointer to a structure
       of the following form:

           struct timespec {
               long    tv_sec;         /* seconds */
               long    tv_nsec;        /* nanoseconds */
           };

       If tmo_p is specified as NULL, then ppoll() can block indefinitely.
http://man7.org/linux/man-pages/man2/ppoll.2.html
12
SYSTEM CALL:
poll(2) - Linux manual page
FUNCTIONALITY:

       poll, ppoll - wait for some event on a file descriptor
SYNOPSIS:

       #include <poll.h>

       int poll(struct pollfd *fds, nfds_t nfds, int timeout);

       #define _GNU_SOURCE         /* See feature_test_macros(7) */
       #include <signal.h>
       #include <poll.h>

       int ppoll(struct pollfd *fds, nfds_t nfds,
               const struct timespec *tmo_p, const sigset_t *sigmask);
DESCRIPTION

       poll() performs a similar task to select(2): it waits for one of a
       set of file descriptors to become ready to perform I/O.

       The set of file descriptors to be monitored is specified in the fds
       argument, which is an array of structures of the following form:

           struct pollfd {
               int   fd;         /* file descriptor */
               short events;     /* requested events */
               short revents;    /* returned events */
           };

       The caller should specify the number of items in the fds array in
       nfds.

       The field fd contains a file descriptor for an open file.  If this
       field is negative, then the corresponding events field is ignored and
       the revents field returns zero.  (This provides an easy way of
       ignoring a file descriptor for a single poll() call: simply negate
       the fd field.  Note, however, that this technique can't be used to
       ignore file descriptor 0.)

       The field events is an input parameter, a bit mask specifying the
       events the application is interested in for the file descriptor fd.
       This field may be specified as zero, in which case the only events
       that can be returned in revents are POLLHUP, POLLERR, and POLLNVAL
       (see below).

       The field revents is an output parameter, filled by the kernel with
       the events that actually occurred.  The bits returned in revents can
       include any of those specified in events, or one of the values
       POLLERR, POLLHUP, or POLLNVAL.  (These three bits are meaningless in
       the events field, and will be set in the revents field whenever the
       corresponding condition is true.)

       If none of the events requested (and no error) has occurred for any
       of the file descriptors, then poll() blocks until one of the events
       occurs.

       The timeout argument specifies the number of milliseconds that poll()
       should block waiting for a file descriptor to become ready.  The call
       will block until either:

       *  a file descriptor becomes ready;

       *  the call is interrupted by a signal handler; or

       *  the timeout expires.

       Note that the timeout interval will be rounded up to the system clock
       granularity, and kernel scheduling delays mean that the blocking
       interval may overrun by a small amount.  Specifying a negative value
       in timeout means an infinite timeout.  Specifying a timeout of zero
       causes poll() to return immediately, even if no file descriptors are
       ready.

       The bits that may be set/returned in events and revents are defined
       in <poll.h>:

              POLLIN There is data to read.

              POLLPRI
                     There is urgent data to read (e.g., out-of-band data on
                     TCP socket; pseudoterminal master in packet mode has
                     seen state change in slave).

              POLLOUT
                     Writing is now possible, though a write larger that the
                     available space in a socket or pipe will still block
                     (unless O_NONBLOCK is set).

              POLLRDHUP (since Linux 2.6.17)
                     Stream socket peer closed connection, or shut down
                     writing half of connection.  The _GNU_SOURCE feature
                     test macro must be defined (before including any header
                     files) in order to obtain this definition.

              POLLERR
                     Error condition (only returned in revents; ignored in
                     events).

              POLLHUP
                     Hang up (only returned in revents; ignored in events).
                     Note that when reading from a channel such as a pipe or
                     a stream socket, this event merely indicates that the
                     peer closed its end of the channel.  Subsequent reads
                     from the channel will return 0 (end of file) only after
                     all outstanding data in the channel has been consumed.

              POLLNVAL
                     Invalid request: fd not open (only returned in revents;
                     ignored in events).

       When compiling with _XOPEN_SOURCE defined, one also has the
       following, which convey no further information beyond the bits listed
       above:

              POLLRDNORM
                     Equivalent to POLLIN.

              POLLRDBAND
                     Priority band data can be read (generally unused on
                     Linux).

              POLLWRNORM
                     Equivalent to POLLOUT.

              POLLWRBAND
                     Priority data may be written.

       Linux also knows about, but does not use POLLMSG.

   ppoll()
       The relationship between poll() and ppoll() is analogous to the
       relationship between select(2) and pselect(2): like pselect(2),
       ppoll() allows an application to safely wait until either a file
       descriptor becomes ready or until a signal is caught.

       Other than the difference in the precision of the timeout argument,
       the following ppoll() call:

           ready = ppoll(&fds, nfds, tmo_p, &sigmask);

       is equivalent to atomically executing the following calls:

           sigset_t origmask;
           int timeout;

           timeout = (tmo_p == NULL) ? -1 :
                     (tmo_p->tv_sec * 1000 + tmo_p->tv_nsec / 1000000);
           pthread_sigmask(SIG_SETMASK, &sigmask, &origmask);
           ready = poll(&fds, nfds, timeout);
           pthread_sigmask(SIG_SETMASK, &origmask, NULL);

       See the description of pselect(2) for an explanation of why ppoll()
       is necessary.

       If the sigmask argument is specified as NULL, then no signal mask
       manipulation is performed (and thus ppoll() differs from poll() only
       in the precision of the timeout argument).

       The tmo_p argument specifies an upper limit on the amount of time
       that ppoll() will block.  This argument is a pointer to a structure
       of the following form:

           struct timespec {
               long    tv_sec;         /* seconds */
               long    tv_nsec;        /* nanoseconds */
           };

       If tmo_p is specified as NULL, then ppoll() can block indefinitely.
http://man7.org/linux/man-pages/man2/epoll_create.2.html
11
SYSTEM CALL:
epoll_create(2) - Linux manual page
FUNCTIONALITY:

       epoll_create, epoll_create1 - open an epoll file descriptor
SYNOPSIS:

       #include <sys/epoll.h>

       int epoll_create(int size);
       int epoll_create1(int flags);
DESCRIPTION

       epoll_create() creates an epoll(7) instance.  Since Linux 2.6.8, the
       size argument is ignored, but must be greater than zero; see NOTES
       below.

       epoll_create() returns a file descriptor referring to the new epoll
       instance.  This file descriptor is used for all the subsequent calls
       to the epoll interface.  When no longer required, the file descriptor
       returned by epoll_create() should be closed by using close(2).  When
       all file descriptors referring to an epoll instance have been closed,
       the kernel destroys the instance and releases the associated
       resources for reuse.

   epoll_create1()
       If flags is 0, then, other than the fact that the obsolete size
       argument is dropped, epoll_create1() is the same as epoll_create().
       The following value can be included in flags to obtain different
       behavior:

       EPOLL_CLOEXEC
              Set the close-on-exec (FD_CLOEXEC) flag on the new file
              descriptor.  See the description of the O_CLOEXEC flag in
              open(2) for reasons why this may be useful.
http://man7.org/linux/man-pages/man2/epoll_create1.2.html
11
SYSTEM CALL:
epoll_create(2) - Linux manual page
FUNCTIONALITY:

       epoll_create, epoll_create1 - open an epoll file descriptor
SYNOPSIS:

       #include <sys/epoll.h>

       int epoll_create(int size);
       int epoll_create1(int flags);
DESCRIPTION

       epoll_create() creates an epoll(7) instance.  Since Linux 2.6.8, the
       size argument is ignored, but must be greater than zero; see NOTES
       below.

       epoll_create() returns a file descriptor referring to the new epoll
       instance.  This file descriptor is used for all the subsequent calls
       to the epoll interface.  When no longer required, the file descriptor
       returned by epoll_create() should be closed by using close(2).  When
       all file descriptors referring to an epoll instance have been closed,
       the kernel destroys the instance and releases the associated
       resources for reuse.

   epoll_create1()
       If flags is 0, then, other than the fact that the obsolete size
       argument is dropped, epoll_create1() is the same as epoll_create().
       The following value can be included in flags to obtain different
       behavior:

       EPOLL_CLOEXEC
              Set the close-on-exec (FD_CLOEXEC) flag on the new file
              descriptor.  See the description of the O_CLOEXEC flag in
              open(2) for reasons why this may be useful.
http://man7.org/linux/man-pages/man2/epoll_ctl.2.html
12
SYSTEM CALL:
epoll_ctl(2) - Linux manual page
FUNCTIONALITY:

       epoll_ctl - control interface for an epoll file descriptor
SYNOPSIS:

       #include <sys/epoll.h>

       int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);
DESCRIPTION

       This system call performs control operations on the epoll(7) instance
       referred to by the file descriptor epfd.  It requests that the
       operation op be performed for the target file descriptor, fd.

       Valid values for the op argument are:

       EPOLL_CTL_ADD
              Register the target file descriptor fd on the epoll instance
              referred to by the file descriptor epfd and associate the
              event event with the internal file linked to fd.

       EPOLL_CTL_MOD
              Change the event event associated with the target file
              descriptor fd.

       EPOLL_CTL_DEL
              Remove (deregister) the target file descriptor fd from the
              epoll instance referred to by epfd.  The event is ignored and
              can be NULL (but see BUGS below).

       The event argument describes the object linked to the file descriptor
       fd.  The struct epoll_event is defined as:

           typedef union epoll_data {
               void        *ptr;
               int          fd;
               uint32_t     u32;
               uint64_t     u64;
           } epoll_data_t;

           struct epoll_event {
               uint32_t     events;      /* Epoll events */
               epoll_data_t data;        /* User data variable */
           };

       The events member is a bit mask composed using the following
       available event types:

       EPOLLIN
              The associated file is available for read(2) operations.

       EPOLLOUT
              The associated file is available for write(2) operations.

       EPOLLRDHUP (since Linux 2.6.17)
              Stream socket peer closed connection, or shut down writing
              half of connection.  (This flag is especially useful for
              writing simple code to detect peer shutdown when using Edge
              Triggered monitoring.)

       EPOLLPRI
              There is urgent data available for read(2) operations.

       EPOLLERR
              Error condition happened on the associated file descriptor.
              epoll_wait(2) will always wait for this event; it is not
              necessary to set it in events.

       EPOLLHUP
              Hang up happened on the associated file descriptor.
              epoll_wait(2) will always wait for this event; it is not
              necessary to set it in events.  Note that when reading from a
              channel such as a pipe or a stream socket, this event merely
              indicates that the peer closed its end of the channel.
              Subsequent reads from the channel will return 0 (end of file)
              only after all outstanding data in the channel has been
              consumed.

       EPOLLET
              Sets the Edge Triggered behavior for the associated file
              descriptor.  The default behavior for epoll is Level
              Triggered.  See epoll(7) for more detailed information about
              Edge and Level Triggered event distribution architectures.

       EPOLLONESHOT (since Linux 2.6.2)
              Sets the one-shot behavior for the associated file descriptor.
              This means that after an event is pulled out with
              epoll_wait(2) the associated file descriptor is internally
              disabled and no other events will be reported by the epoll
              interface.  The user must call epoll_ctl() with EPOLL_CTL_MOD
              to rearm the file descriptor with a new event mask.

       EPOLLWAKEUP (since Linux 3.5)
              If EPOLLONESHOT and EPOLLET are clear and the process has the
              CAP_BLOCK_SUSPEND capability, ensure that the system does not
              enter "suspend" or "hibernate" while this event is pending or
              being processed.  The event is considered as being "processed"
              from the time when it is returned by a call to epoll_wait(2)
              until the next call to epoll_wait(2) on the same epoll(7) file
              descriptor, the closure of that file descriptor, the removal
              of the event file descriptor with EPOLL_CTL_DEL, or the
              clearing of EPOLLWAKEUP for the event file descriptor with
              EPOLL_CTL_MOD.  See also BUGS.

       EPOLLEXCLUSIVE (since Linux 4.5)
              Sets an exclusive wakeup mode for the epoll file descriptor
              that is being attached to the target file descriptor, fd.
              When a wakeup event occurs and multiple epoll file descriptors
              are attached to the same target file using EPOLLEXCLUSIVE, one
              or more of the epoll file descriptors will receive an event
              with epoll_wait(2).  The default in this scenario (when
              EPOLLEXCLUSIVE is not set) is for all epoll file descriptors
              to receive an event.  EPOLLEXCLUSIVE is thus useful for
              avoiding thundering herd problems in certain scenarios.

              If the same file descriptor is in multiple epoll instances,
              some with the EPOLLEXCLUSIVE flag, and others without, then
              events will provided to all epoll instances that did not
              specify EPOLLEXCLUSIVE, and at least one of the epoll
              instances that did specify EPOLLEXCLUSIVE.

              The following values may be specified in conjunction with
              EPOLLEXCLUSIVE: EPOLLIN, EPOLLOUT, EPOLLWAKEUP, and EPOLLET.
              EPOLLHUP and EPOLLERR can also be specified, but this is not
              required: as usual, these events are always reported if they
              occur, regardless of whether they are specified in events.
              Attempts to specify other values in events yield an error.
              EPOLLEXCLUSIVE may be used only in an EPOLL_CTL_ADD operation;
              attempts to employ it with EPOLL_CTL_MOD yield an error.  If
              EPOLLEXCLUSIVE has set using epoll_ctl(2), then a subsequent
              EPOLL_CTL_MOD on the same epfd, fd pair yields an error.  A
              call to epoll_ctl(2) that specifies EPOLLEXCLUSIVE in events
              and specifies the target file descriptor fd as an epoll
              instance will likewise fail.  The error in all of these cases
              is EINVAL.
http://man7.org/linux/man-pages/man2/epoll_wait.2.html
12
SYSTEM CALL:
epoll_wait(2) - Linux manual page
FUNCTIONALITY:

       epoll_wait,  epoll_pwait  -  wait  for  an I/O event on an epoll file
       descriptor
SYNOPSIS:

       #include <sys/epoll.h>

       int epoll_wait(int epfd, struct epoll_event *events,
                      int maxevents, int timeout);
       int epoll_pwait(int epfd, struct epoll_event *events,
                      int maxevents, int timeout,
                      const sigset_t *sigmask);
DESCRIPTION

       The epoll_wait() system call waits for events on the epoll(7)
       instance referred to by the file descriptor epfd.  The memory area
       pointed to by events will contain the events that will be available
       for the caller.  Up to maxevents are returned by epoll_wait().  The
       maxevents argument must be greater than zero.

       The timeout argument specifies the number of milliseconds that
       epoll_wait() will block.  The call will block until either:

       *  a file descriptor delivers an event;

       *  the call is interrupted by a signal handler; or

       *  the timeout expires.

       Note that the timeout interval will be rounded up to the system clock
       granularity, and kernel scheduling delays mean that the blocking
       interval may overrun by a small amount.  Specifying a timeout of -1
       causes epoll_wait() to block indefinitely, while specifying a timeout
       equal to zero cause epoll_wait() to return immediately, even if no
       events are available.

       The struct epoll_event is defined as:

           typedef union epoll_data {
               void    *ptr;
               int      fd;
               uint32_t u32;
               uint64_t u64;
           } epoll_data_t;

           struct epoll_event {
               uint32_t     events;    /* Epoll events */
               epoll_data_t data;      /* User data variable */
           };

       The data of each returned structure will contain the same data the
       user set with an epoll_ctl(2) (EPOLL_CTL_ADD, EPOLL_CTL_MOD) while
       the events member will contain the returned event bit field.

   epoll_pwait()
       The relationship between epoll_wait() and epoll_pwait() is analogous
       to the relationship between select(2) and pselect(2): like
       pselect(2), epoll_pwait() allows an application to safely wait until
       either a file descriptor becomes ready or until a signal is caught.

       The following epoll_pwait() call:

           ready = epoll_pwait(epfd, &events, maxevents, timeout, &sigmask);

       is equivalent to atomically executing the following calls:

           sigset_t origmask;

           pthread_sigmask(SIG_SETMASK, &sigmask, &origmask);
           ready = epoll_wait(epfd, &events, maxevents, timeout);
           pthread_sigmask(SIG_SETMASK, &origmask, NULL);

       The sigmask argument may be specified as NULL, in which case
       epoll_pwait() is equivalent to epoll_wait().
http://man7.org/linux/man-pages/man2/epoll_pwait.2.html
12
SYSTEM CALL:
epoll_wait(2) - Linux manual page
FUNCTIONALITY:

       epoll_wait,  epoll_pwait  -  wait  for  an I/O event on an epoll file
       descriptor
SYNOPSIS:

       #include <sys/epoll.h>

       int epoll_wait(int epfd, struct epoll_event *events,
                      int maxevents, int timeout);
       int epoll_pwait(int epfd, struct epoll_event *events,
                      int maxevents, int timeout,
                      const sigset_t *sigmask);
DESCRIPTION

       The epoll_wait() system call waits for events on the epoll(7)
       instance referred to by the file descriptor epfd.  The memory area
       pointed to by events will contain the events that will be available
       for the caller.  Up to maxevents are returned by epoll_wait().  The
       maxevents argument must be greater than zero.

       The timeout argument specifies the number of milliseconds that
       epoll_wait() will block.  The call will block until either:

       *  a file descriptor delivers an event;

       *  the call is interrupted by a signal handler; or

       *  the timeout expires.

       Note that the timeout interval will be rounded up to the system clock
       granularity, and kernel scheduling delays mean that the blocking
       interval may overrun by a small amount.  Specifying a timeout of -1
       causes epoll_wait() to block indefinitely, while specifying a timeout
       equal to zero cause epoll_wait() to return immediately, even if no
       events are available.

       The struct epoll_event is defined as:

           typedef union epoll_data {
               void    *ptr;
               int      fd;
               uint32_t u32;
               uint64_t u64;
           } epoll_data_t;

           struct epoll_event {
               uint32_t     events;    /* Epoll events */
               epoll_data_t data;      /* User data variable */
           };

       The data of each returned structure will contain the same data the
       user set with an epoll_ctl(2) (EPOLL_CTL_ADD, EPOLL_CTL_MOD) while
       the events member will contain the returned event bit field.

   epoll_pwait()
       The relationship between epoll_wait() and epoll_pwait() is analogous
       to the relationship between select(2) and pselect(2): like
       pselect(2), epoll_pwait() allows an application to safely wait until
       either a file descriptor becomes ready or until a signal is caught.

       The following epoll_pwait() call:

           ready = epoll_pwait(epfd, &events, maxevents, timeout, &sigmask);

       is equivalent to atomically executing the following calls:

           sigset_t origmask;

           pthread_sigmask(SIG_SETMASK, &sigmask, &origmask);
           ready = epoll_wait(epfd, &events, maxevents, timeout);
           pthread_sigmask(SIG_SETMASK, &origmask, NULL);

       The sigmask argument may be specified as NULL, in which case
       epoll_pwait() is equivalent to epoll_wait().
http://man7.org/linux/man-pages/man2/inotify_init.2.html
10
SYSTEM CALL:
inotify_init(2) - Linux manual page
FUNCTIONALITY:

       inotify_init, inotify_init1 - initialize an inotify instance
SYNOPSIS:

       #include <sys/inotify.h>

       int inotify_init(void);
       int inotify_init1(int flags);
DESCRIPTION

       For an overview of the inotify API, see inotify(7).

       inotify_init() initializes a new inotify instance and returns a file
       descriptor associated with a new inotify event queue.

       If flags is 0, then inotify_init1() is the same as inotify_init().
       The following values can be bitwise ORed in flags to obtain different
       behavior:

       IN_NONBLOCK Set the O_NONBLOCK file status flag on the new open file
                   description.  Using this flag saves extra calls to
                   fcntl(2) to achieve the same result.

       IN_CLOEXEC  Set the close-on-exec (FD_CLOEXEC) flag on the new file
                   descriptor.  See the description of the O_CLOEXEC flag in
                   open(2) for reasons why this may be useful.
http://man7.org/linux/man-pages/man2/inotify_init1.2.html
10
SYSTEM CALL:
inotify_init(2) - Linux manual page
FUNCTIONALITY:

       inotify_init, inotify_init1 - initialize an inotify instance
SYNOPSIS:

       #include <sys/inotify.h>

       int inotify_init(void);
       int inotify_init1(int flags);
DESCRIPTION

       For an overview of the inotify API, see inotify(7).

       inotify_init() initializes a new inotify instance and returns a file
       descriptor associated with a new inotify event queue.

       If flags is 0, then inotify_init1() is the same as inotify_init().
       The following values can be bitwise ORed in flags to obtain different
       behavior:

       IN_NONBLOCK Set the O_NONBLOCK file status flag on the new open file
                   description.  Using this flag saves extra calls to
                   fcntl(2) to achieve the same result.

       IN_CLOEXEC  Set the close-on-exec (FD_CLOEXEC) flag on the new file
                   descriptor.  See the description of the O_CLOEXEC flag in
                   open(2) for reasons why this may be useful.
http://man7.org/linux/man-pages/man2/inotify_add_watch.2.html
10
SYSTEM CALL:
inotify_add_watch(2) - Linux manual page
FUNCTIONALITY:

       inotify_add_watch - add a watch to an initialized inotify instance
SYNOPSIS:

       #include <sys/inotify.h>

       int inotify_add_watch(int fd, const char *pathname, uint32_t mask);
DESCRIPTION

       inotify_add_watch() adds a new watch, or modifies an existing watch,
       for the file whose location is specified in pathname; the caller must
       have read permission for this file.  The fd argument is a file
       descriptor referring to the inotify instance whose watch list is to
       be modified.  The events to be monitored for pathname are specified
       in the mask bit-mask argument.  See inotify(7) for a description of
       the bits that can be set in mask.

       A successful call to inotify_add_watch() returns a unique watch
       descriptor for this inotify instance, for the filesystem object that
       corresponds to pathname.  If the filesystem object was not previously
       being watched by this inotify instance, then the watch descriptor is
       newly allocated.  If the filesystem object was already being watched
       (perhaps via a different link to the same object), then the
       descriptor for the existing watch is returned.

       The watch descriptor is returned by later read(2)s from the inotify
       file descriptor.  These reads fetch inotify_event structures (see
       inotify(7)) indicating filesystem events; the watch descriptor inside
       this structure identifies the object for which the event occurred.
http://man7.org/linux/man-pages/man2/inotify_rm_watch.2.html
10
SYSTEM CALL:
inotify_rm_watch(2) - Linux manual page
FUNCTIONALITY:

       inotify_rm_watch - remove an existing watch from an inotify instance
SYNOPSIS:

       #include <sys/inotify.h>

       int inotify_rm_watch(int fd, int wd);
DESCRIPTION

       inotify_rm_watch() removes the watch associated with the watch
       descriptor wd from the inotify instance associated with the file
       descriptor fd.

       Removing a watch causes an IN_IGNORED event to be generated for this
       watch descriptor.  (See inotify(7).)
http://man7.org/linux/man-pages/man2/fanotify_init.2.html
11
SYSTEM CALL:
fanotify_init(2) - Linux manual page
FUNCTIONALITY:

       fanotify_init - create and initialize fanotify group
SYNOPSIS:

       #include <fcntl.h>
       #include <sys/fanotify.h>

       int fanotify_init(unsigned int flags, unsigned int event_f_flags);
DESCRIPTION

       For an overview of the fanotify API, see fanotify(7).

       fanotify_init() initializes a new fanotify group and returns a file
       descriptor for the event queue associated with the group.

       The file descriptor is used in calls to fanotify_mark(2) to specify
       the files, directories, and mounts for which fanotify events shall be
       created.  These events are received by reading from the file
       descriptor.  Some events are only informative, indicating that a file
       has been accessed.  Other events can be used to determine whether
       another application is permitted to access a file or directory.
       Permission to access filesystem objects is granted by writing to the
       file descriptor.

       Multiple programs may be using the fanotify interface at the same
       time to monitor the same files.

       In the current implementation, the number of fanotify groups per user
       is limited to 128.  This limit cannot be overridden.

       Calling fanotify_init() requires the CAP_SYS_ADMIN capability.  This
       constraint might be relaxed in future versions of the API.
       Therefore, certain additional capability checks have been implemented
       as indicated below.

       The flags argument contains a multi-bit field defining the
       notification class of the listening application and further single
       bit fields specifying the behavior of the file descriptor.

       If multiple listeners for permission events exist, the notification
       class is used to establish the sequence in which the listeners
       receive the events.

       Only one of the following notification classes may be specified in
       flags:

       FAN_CLASS_PRE_CONTENT
              This value allows the receipt of events notifying that a file
              has been accessed and events for permission decisions if a
              file may be accessed.  It is intended for event listeners that
              need to access files before they contain their final data.
              This notification class might be used by hierarchical storage
              managers, for example.

       FAN_CLASS_CONTENT
              This value allows the receipt of events notifying that a file
              has been accessed and events for permission decisions if a
              file may be accessed.  It is intended for event listeners that
              need to access files when they already contain their final
              content.  This notification class might be used by malware
              detection programs, for example.

       FAN_CLASS_NOTIF
              This is the default value.  It does not need to be specified.
              This value only allows the receipt of events notifying that a
              file has been accessed.  Permission decisions before the file
              is accessed are not possible.

       Listeners with different notification classes will receive events in
       the order FAN_CLASS_PRE_CONTENT, FAN_CLASS_CONTENT, FAN_CLASS_NOTIF.
       The order of notification for listeners in the same notification
       class is undefined.

       The following bits can additionally be set in flags:

       FAN_CLOEXEC
              Set the close-on-exec flag (FD_CLOEXEC) on the new file
              descriptor.  See the description of the O_CLOEXEC flag in
              open(2).

       FAN_NONBLOCK
              Enable the nonblocking flag (O_NONBLOCK) for the file
              descriptor.  Reading from the file descriptor will not block.
              Instead, if no data is available, read(2) will fail with the
              error EAGAIN.

       FAN_UNLIMITED_QUEUE
              Remove the limit of 16384 events for the event queue.  Use of
              this flag requires the CAP_SYS_ADMIN capability.

       FAN_UNLIMITED_MARKS
              Remove the limit of 8192 marks.  Use of this flag requires the
              CAP_SYS_ADMIN capability.

       The event_f_flags argument defines the file status flags that will be
       set on the open file descriptions that are created for fanotify
       events.  For details of these flags, see the description of the flags
       values in open(2).  event_f_flags includes a multi-bit field for the
       access mode.  This field can take the following values:

       O_RDONLY
              This value allows only read access.

       O_WRONLY
              This value allows only write access.

       O_RDWR This value allows read and write access.

       Additional bits can be set in event_f_flags.  The most useful values
       are:

       O_LARGEFILE
              Enable support for files exceeding 2 GB.  Failing to set this
              flag will result in an EOVERFLOW error when trying to open a
              large file which is monitored by an fanotify group on a 32-bit
              system.

       O_CLOEXEC (since Linux 3.18)
              Enable the close-on-exec flag for the file descriptor.  See
              the description of the O_CLOEXEC flag in open(2) for reasons
              why this may be useful.

       The following are also allowable: O_APPEND, O_DSYNC, O_NOATIME,
       O_NONBLOCK, and O_SYNC.  Specifying any other flag in event_f_flags
       yields the error EINVAL (but see BUGS).
http://man7.org/linux/man-pages/man2/fanotify_mark.2.html
11
SYSTEM CALL:
fanotify_mark(2) - Linux manual page
FUNCTIONALITY:

       fanotify_mark - add, remove, or modify an fanotify mark on a filesys‐
       tem object
SYNOPSIS:

       #include <sys/fanotify.h>

       int fanotify_mark(int fanotify_fd, unsigned int flags,
                         uint64_t mask, int dirfd, const char *pathname);
DESCRIPTION

       For an overview of the fanotify API, see fanotify(7).

       fanotify_mark(2) adds, removes, or modifies an fanotify mark on a
       filesystem object.  The caller must have read permission on the
       filesystem object that is to be marked.

       The fanotify_fd argument is a file descriptor returned by
       fanotify_init(2).

       flags is a bit mask describing the modification to perform.  It must
       include exactly one of the following values:

       FAN_MARK_ADD
              The events in mask will be added to the mark mask (or to the
              ignore mask).  mask must be nonempty or the error EINVAL will
              occur.

       FAN_MARK_REMOVE
              The events in argument mask will be removed from the mark mask
              (or from the ignore mask).  mask must be nonempty or the error
              EINVAL will occur.

       FAN_MARK_FLUSH
              Remove either all mount or all non-mount marks from the
              fanotify group.  If flags contains FAN_MARK_MOUNT, all marks
              for mounts are removed from the group.  Otherwise, all marks
              for directories and files are removed.  No flag other than
              FAN_MARK_MOUNT can be used in conjunction with FAN_MARK_FLUSH.
              mask is ignored.

       If none of the values above is specified, or more than one is
       specified, the call fails with the error EINVAL.

       In addition, zero or more of the following values may be ORed into
       flags:

       FAN_MARK_DONT_FOLLOW
              If pathname is a symbolic link, mark the link itself, rather
              than the file to which it refers.  (By default,
              fanotify_mark() dereferences pathname if it is a symbolic
              link.)

       FAN_MARK_ONLYDIR
              If the filesystem object to be marked is not a directory, the
              error ENOTDIR shall be raised.

       FAN_MARK_MOUNT
              Mark the mount point specified by pathname.  If pathname is
              not itself a mount point, the mount point containing pathname
              will be marked.  All directories, subdirectories, and the
              contained files of the mount point will be monitored.

       FAN_MARK_IGNORED_MASK
              The events in mask shall be added to or removed from the
              ignore mask.

       FAN_MARK_IGNORED_SURV_MODIFY
              The ignore mask shall survive modify events.  If this flag is
              not set, the ignore mask is cleared when a modify event occurs
              for the ignored file or directory.

       mask defines which events shall be listened for (or which shall be
       ignored).  It is a bit mask composed of the following values:

       FAN_ACCESS
              Create an event when a file or directory (but see BUGS) is
              accessed (read).

       FAN_MODIFY
              Create an event when a file is modified (write).

       FAN_CLOSE_WRITE
              Create an event when a writable file is closed.

       FAN_CLOSE_NOWRITE
              Create an event when a read-only file or directory is closed.

       FAN_OPEN
              Create an event when a file or directory is opened.

       FAN_OPEN_PERM
              Create an event when a permission to open a file or directory
              is requested.  An fanotify file descriptor created with
              FAN_CLASS_PRE_CONTENT or FAN_CLASS_CONTENT is required.

       FAN_ACCESS_PERM
              Create an event when a permission to read a file or directory
              is requested.  An fanotify file descriptor created with
              FAN_CLASS_PRE_CONTENT or FAN_CLASS_CONTENT is required.

       FAN_ONDIR
              Create events for directories—for example, when opendir(3),
              readdir(3) (but see BUGS), and closedir(3) are called.
              Without this flag, only events for files are created.

       FAN_EVENT_ON_CHILD
              Events for the immediate children of marked directories shall
              be created.  The flag has no effect when marking mounts.  Note
              that events are not generated for children of the
              subdirectories of marked directories.  To monitor complete
              directory trees it is necessary to mark the relevant mount.

       The following composed value is defined:

       FAN_CLOSE
              A file is closed (FAN_CLOSE_WRITE|FAN_CLOSE_NOWRITE).

       The filesystem object to be marked is determined by the file
       descriptor dirfd and the pathname specified in pathname:

       *  If pathname is NULL, dirfd defines the filesystem object to be
          marked.

       *  If pathname is NULL, and dirfd takes the special value AT_FDCWD,
          the current working directory is to be marked.

       *  If pathname is absolute, it defines the filesystem object to be
          marked, and dirfd is ignored.

       *  If pathname is relative, and dirfd does not have the value
          AT_FDCWD, then the filesystem object to be marked is determined by
          interpreting pathname relative the directory referred to by dirfd.

       *  If pathname is relative, and dirfd has the value AT_FDCWD, then
          the filesystem object to be marked is determined by interpreting
          pathname relative the current working directory.
http://man7.org/linux/man-pages/man2/fadvise64.2.html
12
SYSTEM CALL:
posix_fadvise(2) - Linux manual page
FUNCTIONALITY:

       posix_fadvise - predeclare an access pattern for file data
SYNOPSIS:

       #include <fcntl.h>

       int posix_fadvise(int fd, off_t offset, off_t len, int advice);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       posix_fadvise():
           _POSIX_C_SOURCE >= 200112L
DESCRIPTION

       Programs can use posix_fadvise() to announce an intention to access
       file data in a specific pattern in the future, thus allowing the
       kernel to perform appropriate optimizations.

       The advice applies to a (not necessarily existent) region starting at
       offset and extending for len bytes (or until the end of the file if
       len is 0) within the file referred to by fd.  The advice is not
       binding; it merely constitutes an expectation on behalf of the
       application.

       Permissible values for advice include:

       POSIX_FADV_NORMAL
              Indicates that the application has no advice to give about its
              access pattern for the specified data.  If no advice is given
              for an open file, this is the default assumption.

       POSIX_FADV_SEQUENTIAL
              The application expects to access the specified data
              sequentially (with lower offsets read before higher ones).

       POSIX_FADV_RANDOM
              The specified data will be accessed in random order.

       POSIX_FADV_NOREUSE
              The specified data will be accessed only once.

       POSIX_FADV_WILLNEED
              The specified data will be accessed in the near future.

       POSIX_FADV_DONTNEED
              The specified data will not be accessed in the near future.
http://man7.org/linux/man-pages/man2/readahead.2.html
12
SYSTEM CALL:
readahead(2) - Linux manual page
FUNCTIONALITY:

       readahead - initiate file readahead into page cache
SYNOPSIS:

       #define _GNU_SOURCE             /* See feature_test_macros(7) */
       #include <fcntl.h>

       ssize_t readahead(int fd, off64_t offset, size_t count);
DESCRIPTION

       readahead() initiates readahead on a file so that subsequent reads
       from that file will be satisfied from the cache, and not block on
       disk I/O (assuming the readahead was initiated early enough and that
       other activity on the system did not in the meantime flush pages from
       the cache).

       The fd argument is a file descriptor identifying the file which is to
       be read.  The offset argument specifies the starting point from which
       data is to be read and count specifies the number of bytes to be
       read.  I/O is performed in whole pages, so that offset is effectively
       rounded down to a page boundary and bytes are read up to the next
       page boundary greater than or equal to (offset+count).  readahead()
       does not read beyond the end of the file.  The file offset of the
       open file description referred to by fd is left unchanged.
http://man7.org/linux/man-pages/man2/getrandom.2.html
12
SYSTEM CALL:
getrandom(2) - Linux manual page
FUNCTIONALITY:

       getrandom - obtain a series of random bytes
SYNOPSIS:

       #include <linux/random.h>

       int getrandom(void *buf, size_t buflen, unsigned int flags);
DESCRIPTION

       The getrandom() system call fills the buffer pointed to by buf with
       up to buflen random bytes.  These bytes can be used to seed user-
       space random number generators or for cryptographic purposes.

       getrandom() relies on entropy gathered from device drivers and other
       sources of environmental noise.  Unnecessarily reading large
       quantities of data will have a negative impact on other users of the
       /dev/random and /dev/urandom devices.  Therefore, getrandom() should
       not be used for Monte Carlo simulations or other programs/algorithms
       which are doing probabilistic sampling.

       By default, getrandom() draws entropy from the /dev/urandom pool.
       This behavior can be changed via the flags argument.  If the
       /dev/urandom pool has been initialized, reads of up to 256 bytes will
       always return as many bytes as requested and will not be interrupted
       by signals.  No such guarantees apply for larger buffer sizes.  For
       example, if the call is interrupted by a signal handler, it may
       return a partially filled buffer, or fail with the error EINTR.  If
       the pool has not yet been initialized, then the call blocks, unless
       GRND_NONBLOCK is specified in flags.

       The flags argument is a bit mask that can contain zero or more of the
       following values ORed together:

       GRND_RANDOM
              If this bit is set, then random bytes are drawn from the
              /dev/random pool instead of the /dev/urandom pool.  The
              /dev/random pool is limited based on the entropy that can be
              obtained from environmental noise.  If the number of available
              bytes in /dev/random is less than requested in buflen, the
              call returns just the available random bytes.  If no random
              bytes are available, the behavior depends on the presence of
              GRND_NONBLOCK in the flags argument.

       GRND_NONBLOCK
              By default, when reading from /dev/random, getrandom() blocks
              if no random bytes are available, and when reading from
              /dev/urandom, it blocks if the entropy pool has not yet been
              initialized.  If the GRND_NONBLOCK flag is set, then
              getrandom() does not block in these cases, but instead
              immediately returns -1 with errno set to EAGAIN.
Linux network system calls
http://man7.org/linux/man-pages/man2/socket.2.html
11
SYSTEM CALL:
socket(2) - Linux manual page
FUNCTIONALITY:

       socket - create an endpoint for communication
SYNOPSIS:

       #include <sys/types.h>          /* See NOTES */
       #include <sys/socket.h>

       int socket(int domain, int type, int protocol);
DESCRIPTION

       socket() creates an endpoint for communication and returns a file
       descriptor that refers to that endpoint.

       The domain argument specifies a communication domain; this selects
       the protocol family which will be used for communication.  These
       families are defined in <sys/socket.h>.  The currently understood
       formats include:

       Name                Purpose                          Man page
       AF_UNIX, AF_LOCAL   Local communication              unix(7)
       AF_INET             IPv4 Internet protocols          ip(7)
       AF_INET6            IPv6 Internet protocols          ipv6(7)
       AF_IPX              IPX - Novell protocols
       AF_NETLINK          Kernel user interface device     netlink(7)
       AF_X25              ITU-T X.25 / ISO-8208 protocol   x25(7)
       AF_AX25             Amateur radio AX.25 protocol
       AF_ATMPVC           Access to raw ATM PVCs
       AF_APPLETALK        AppleTalk                        ddp(7)
       AF_PACKET           Low level packet interface       packet(7)
       AF_ALG              Interface to kernel crypto API

       The socket has the indicated type, which specifies the communication
       semantics.  Currently defined types are:

       SOCK_STREAM     Provides sequenced, reliable, two-way, connection-
                       based byte streams.  An out-of-band data transmission
                       mechanism may be supported.

       SOCK_DGRAM      Supports datagrams (connectionless, unreliable
                       messages of a fixed maximum length).

       SOCK_SEQPACKET  Provides a sequenced, reliable, two-way connection-
                       based data transmission path for datagrams of fixed
                       maximum length; a consumer is required to read an
                       entire packet with each input system call.

       SOCK_RAW        Provides raw network protocol access.

       SOCK_RDM        Provides a reliable datagram layer that does not
                       guarantee ordering.

       SOCK_PACKET     Obsolete and should not be used in new programs; see
                       packet(7).

       Some socket types may not be implemented by all protocol families.

       Since Linux 2.6.27, the type argument serves a second purpose: in
       addition to specifying a socket type, it may include the bitwise OR
       of any of the following values, to modify the behavior of socket():

       SOCK_NONBLOCK   Set the O_NONBLOCK file status flag on the new open
                       file description.  Using this flag saves extra calls
                       to fcntl(2) to achieve the same result.

       SOCK_CLOEXEC    Set the close-on-exec (FD_CLOEXEC) flag on the new
                       file descriptor.  See the description of the
                       O_CLOEXEC flag in open(2) for reasons why this may be
                       useful.

       The protocol specifies a particular protocol to be used with the
       socket.  Normally only a single protocol exists to support a
       particular socket type within a given protocol family, in which case
       protocol can be specified as 0.  However, it is possible that many
       protocols may exist, in which case a particular protocol must be
       specified in this manner.  The protocol number to use is specific to
       the “communication domain” in which communication is to take place;
       see protocols(5).  See getprotoent(3) on how to map protocol name
       strings to protocol numbers.

       Sockets of type SOCK_STREAM are full-duplex byte streams.  They do
       not preserve record boundaries.  A stream socket must be in a
       connected state before any data may be sent or received on it.  A
       connection to another socket is created with a connect(2) call.  Once
       connected, data may be transferred using read(2) and write(2) calls
       or some variant of the send(2) and recv(2) calls.  When a session has
       been completed a close(2) may be performed.  Out-of-band data may
       also be transmitted as described in send(2) and received as described
       in recv(2).

       The communications protocols which implement a SOCK_STREAM ensure
       that data is not lost or duplicated.  If a piece of data for which
       the peer protocol has buffer space cannot be successfully transmitted
       within a reasonable length of time, then the connection is considered
       to be dead.  When SO_KEEPALIVE is enabled on the socket the protocol
       checks in a protocol-specific manner if the other end is still alive.
       A SIGPIPE signal is raised if a process sends or receives on a broken
       stream; this causes naive processes, which do not handle the signal,
       to exit.  SOCK_SEQPACKET sockets employ the same system calls as
       SOCK_STREAM sockets.  The only difference is that read(2) calls will
       return only the amount of data requested, and any data remaining in
       the arriving packet will be discarded.  Also all message boundaries
       in incoming datagrams are preserved.

       SOCK_DGRAM and SOCK_RAW sockets allow sending of datagrams to
       correspondents named in sendto(2) calls.  Datagrams are generally
       received with recvfrom(2), which returns the next datagram along with
       the address of its sender.

       SOCK_PACKET is an obsolete socket type to receive raw packets
       directly from the device driver.  Use packet(7) instead.

       An fcntl(2) F_SETOWN operation can be used to specify a process or
       process group to receive a SIGURG signal when the out-of-band data
       arrives or SIGPIPE signal when a SOCK_STREAM connection breaks
       unexpectedly.  This operation may also be used to set the process or
       process group that receives the I/O and asynchronous notification of
       I/O events via SIGIO.  Using F_SETOWN is equivalent to an ioctl(2)
       call with the FIOSETOWN or SIOCSPGRP argument.

       When the network signals an error condition to the protocol module
       (e.g., using a ICMP message for IP) the pending error flag is set for
       the socket.  The next operation on this socket will return the error
       code of the pending error.  For some protocols it is possible to
       enable a per-socket error queue to retrieve detailed information
       about the error; see IP_RECVERR in ip(7).

       The operation of sockets is controlled by socket level options.
       These options are defined in <sys/socket.h>.  The functions
       setsockopt(2) and getsockopt(2) are used to set and get options,
       respectively.
http://man7.org/linux/man-pages/man2/socketpair.2.html
10
SYSTEM CALL:
socketpair(2) - Linux manual page
FUNCTIONALITY:

       socketpair - create a pair of connected sockets
SYNOPSIS:

       #include <sys/types.h>          /* See NOTES */
       #include <sys/socket.h>

       int socketpair(int domain, int type, int protocol, int sv[2]);
DESCRIPTION

       The socketpair() call creates an unnamed pair of connected sockets in
       the specified domain, of the specified type, and using the optionally
       specified protocol.  For further details of these arguments, see
       socket(2).

       The file descriptors used in referencing the new sockets are returned
       in sv[0] and sv[1].  The two sockets are indistinguishable.
http://man7.org/linux/man-pages/man2/setsockopt.2.html
11
SYSTEM CALL:
getsockopt(2) - Linux manual page
FUNCTIONALITY:

       getsockopt, setsockopt - get and set options on sockets
SYNOPSIS:

       #include <sys/types.h>          /* See NOTES */
       #include <sys/socket.h>

       int getsockopt(int sockfd, int level, int optname,
                      void *optval, socklen_t *optlen);
       int setsockopt(int sockfd, int level, int optname,
                      const void *optval, socklen_t optlen);
DESCRIPTION

       getsockopt() and setsockopt() manipulate options for the socket
       referred to by the file descriptor sockfd.  Options may exist at
       multiple protocol levels; they are always present at the uppermost
       socket level.

       When manipulating socket options, the level at which the option
       resides and the name of the option must be specified.  To manipulate
       options at the sockets API level, level is specified as SOL_SOCKET.
       To manipulate options at any other level the protocol number of the
       appropriate protocol controlling the option is supplied.  For
       example, to indicate that an option is to be interpreted by the TCP
       protocol, level should be set to the protocol number of TCP; see
       getprotoent(3).

       The arguments optval and optlen are used to access option values for
       setsockopt().  For getsockopt() they identify a buffer in which the
       value for the requested option(s) are to be returned.  For
       getsockopt(), optlen is a value-result argument, initially containing
       the size of the buffer pointed to by optval, and modified on return
       to indicate the actual size of the value returned.  If no option
       value is to be supplied or returned, optval may be NULL.

       Optname and any specified options are passed uninterpreted to the
       appropriate protocol module for interpretation.  The include file
       <sys/socket.h> contains definitions for socket level options,
       described below.  Options at other protocol levels vary in format and
       name; consult the appropriate entries in section 4 of the manual.

       Most socket-level options utilize an int argument for optval.  For
       setsockopt(), the argument should be nonzero to enable a boolean
       option, or zero if the option is to be disabled.

       For a description of the available socket options see socket(7) and
       the appropriate protocol man pages.
http://man7.org/linux/man-pages/man2/getsockopt.2.html
11
SYSTEM CALL:
getsockopt(2) - Linux manual page
FUNCTIONALITY:

       getsockopt, setsockopt - get and set options on sockets
SYNOPSIS:

       #include <sys/types.h>          /* See NOTES */
       #include <sys/socket.h>

       int getsockopt(int sockfd, int level, int optname,
                      void *optval, socklen_t *optlen);
       int setsockopt(int sockfd, int level, int optname,
                      const void *optval, socklen_t optlen);
DESCRIPTION

       getsockopt() and setsockopt() manipulate options for the socket
       referred to by the file descriptor sockfd.  Options may exist at
       multiple protocol levels; they are always present at the uppermost
       socket level.

       When manipulating socket options, the level at which the option
       resides and the name of the option must be specified.  To manipulate
       options at the sockets API level, level is specified as SOL_SOCKET.
       To manipulate options at any other level the protocol number of the
       appropriate protocol controlling the option is supplied.  For
       example, to indicate that an option is to be interpreted by the TCP
       protocol, level should be set to the protocol number of TCP; see
       getprotoent(3).

       The arguments optval and optlen are used to access option values for
       setsockopt().  For getsockopt() they identify a buffer in which the
       value for the requested option(s) are to be returned.  For
       getsockopt(), optlen is a value-result argument, initially containing
       the size of the buffer pointed to by optval, and modified on return
       to indicate the actual size of the value returned.  If no option
       value is to be supplied or returned, optval may be NULL.

       Optname and any specified options are passed uninterpreted to the
       appropriate protocol module for interpretation.  The include file
       <sys/socket.h> contains definitions for socket level options,
       described below.  Options at other protocol levels vary in format and
       name; consult the appropriate entries in section 4 of the manual.

       Most socket-level options utilize an int argument for optval.  For
       setsockopt(), the argument should be nonzero to enable a boolean
       option, or zero if the option is to be disabled.

       For a description of the available socket options see socket(7) and
       the appropriate protocol man pages.
http://man7.org/linux/man-pages/man2/getsockname.2.html
10
SYSTEM CALL:
getsockname(2) - Linux manual page
FUNCTIONALITY:

       getsockname - get socket name
SYNOPSIS:

       #include <sys/socket.h>

       int getsockname(int sockfd, struct sockaddr *addr, socklen_t *addrlen);
DESCRIPTION

       getsockname() returns the current address to which the socket sockfd
       is bound, in the buffer pointed to by addr.  The addrlen argument
       should be initialized to indicate the amount of space (in bytes)
       pointed to by addr.  On return it contains the actual size of the
       socket address.

       The returned address is truncated if the buffer provided is too
       small; in this case, addrlen will return a value greater than was
       supplied to the call.
http://man7.org/linux/man-pages/man2/getpeername.2.html
10
SYSTEM CALL:
getpeername(2) - Linux manual page
FUNCTIONALITY:

       getpeername - get name of connected peer socket
SYNOPSIS:

       #include <sys/socket.h>

       int getpeername(int sockfd, struct sockaddr *addr, socklen_t
       *addrlen);
DESCRIPTION

       getpeername() returns the address of the peer connected to the socket
       sockfd, in the buffer pointed to by addr.  The addrlen argument
       should be initialized to indicate the amount of space pointed to by
       addr.  On return it contains the actual size of the name returned (in
       bytes).  The name is truncated if the buffer provided is too small.

       The returned address is truncated if the buffer provided is too
       small; in this case, addrlen will return a value greater than was
       supplied to the call.
http://man7.org/linux/man-pages/man2/bind.2.html
12
SYSTEM CALL:
bind(2) - Linux manual page
FUNCTIONALITY:

       bind - bind a name to a socket
SYNOPSIS:

       #include <sys/types.h>          /* See NOTES */
       #include <sys/socket.h>

       int bind(int sockfd, const struct sockaddr *addr,
                socklen_t addrlen);
DESCRIPTION

       When a socket is created with socket(2), it exists in a name space
       (address family) but has no address assigned to it.  bind() assigns
       the address specified by addr to the socket referred to by the file
       descriptor sockfd.  addrlen specifies the size, in bytes, of the
       address structure pointed to by addr.  Traditionally, this operation
       is called “assigning a name to a socket”.

       It is normally necessary to assign a local address using bind()
       before a SOCK_STREAM socket may receive connections (see accept(2)).

       The rules used in name binding vary between address families.
       Consult the manual entries in Section 7 for detailed information.
       For AF_INET see ip(7), for AF_INET6 see ipv6(7), for AF_UNIX see
       unix(7), for AF_APPLETALK see ddp(7), for AF_PACKET see packet(7),
       for AF_X25 see x25(7) and for AF_NETLINK see netlink(7).

       The actual structure passed for the addr argument will depend on the
       address family.  The sockaddr structure is defined as something like:

           struct sockaddr {
               sa_family_t sa_family;
               char        sa_data[14];
           }

       The only purpose of this structure is to cast the structure pointer
       passed in addr in order to avoid compiler warnings.  See EXAMPLE
       below.
http://man7.org/linux/man-pages/man2/listen.2.html
11
SYSTEM CALL:
listen(2) - Linux manual page
FUNCTIONALITY:

       listen - listen for connections on a socket
SYNOPSIS:

       #include <sys/types.h>          /* See NOTES */
       #include <sys/socket.h>

       int listen(int sockfd, int backlog);
DESCRIPTION

       listen() marks the socket referred to by sockfd as a passive socket,
       that is, as a socket that will be used to accept incoming connection
       requests using accept(2).

       The sockfd argument is a file descriptor that refers to a socket of
       type SOCK_STREAM or SOCK_SEQPACKET.

       The backlog argument defines the maximum length to which the queue of
       pending connections for sockfd may grow.  If a connection request
       arrives when the queue is full, the client may receive an error with
       an indication of ECONNREFUSED or, if the underlying protocol supports
       retransmission, the request may be ignored so that a later reattempt
       at connection succeeds.
http://man7.org/linux/man-pages/man2/accept.2.html
12
SYSTEM CALL:
accept(2) - Linux manual page
FUNCTIONALITY:

       accept, accept4 - accept a connection on a socket
SYNOPSIS:

       #include <sys/types.h>          /* See NOTES */
       #include <sys/socket.h>

       int accept(int sockfd, struct sockaddr *addr, socklen_t *addrlen);

       #define _GNU_SOURCE             /* See feature_test_macros(7) */
       #include <sys/socket.h>

       int accept4(int sockfd, struct sockaddr *addr,
                   socklen_t *addrlen, int flags);
DESCRIPTION

       The accept() system call is used with connection-based socket types
       (SOCK_STREAM, SOCK_SEQPACKET).  It extracts the first connection
       request on the queue of pending connections for the listening socket,
       sockfd, creates a new connected socket, and returns a new file
       descriptor referring to that socket.  The newly created socket is not
       in the listening state.  The original socket sockfd is unaffected by
       this call.

       The argument sockfd is a socket that has been created with socket(2),
       bound to a local address with bind(2), and is listening for
       connections after a listen(2).

       The argument addr is a pointer to a sockaddr structure.  This
       structure is filled in with the address of the peer socket, as known
       to the communications layer.  The exact format of the address
       returned addr is determined by the socket's address family (see
       socket(2) and the respective protocol man pages).  When addr is NULL,
       nothing is filled in; in this case, addrlen is not used, and should
       also be NULL.

       The addrlen argument is a value-result argument: the caller must
       initialize it to contain the size (in bytes) of the structure pointed
       to by addr; on return it will contain the actual size of the peer
       address.

       The returned address is truncated if the buffer provided is too
       small; in this case, addrlen will return a value greater than was
       supplied to the call.

       If no pending connections are present on the queue, and the socket is
       not marked as nonblocking, accept() blocks the caller until a
       connection is present.  If the socket is marked nonblocking and no
       pending connections are present on the queue, accept() fails with the
       error EAGAIN or EWOULDBLOCK.

       In order to be notified of incoming connections on a socket, you can
       use select(2) or poll(2).  A readable event will be delivered when a
       new connection is attempted and you may then call accept() to get a
       socket for that connection.  Alternatively, you can set the socket to
       deliver SIGIO when activity occurs on a socket; see socket(7) for
       details.

       For certain protocols which require an explicit confirmation, such as
       DECNet, accept() can be thought of as merely dequeuing the next
       connection request and not implying confirmation.  Confirmation can
       be implied by a normal read or write on the new file descriptor, and
       rejection can be implied by closing the new socket.  Currently, only
       DECNet has these semantics on Linux.

       If flags is 0, then accept4() is the same as accept().  The following
       values can be bitwise ORed in flags to obtain different behavior:

       SOCK_NONBLOCK   Set the O_NONBLOCK file status flag on the new open
                       file description.  Using this flag saves extra calls
                       to fcntl(2) to achieve the same result.

       SOCK_CLOEXEC    Set the close-on-exec (FD_CLOEXEC) flag on the new
                       file descriptor.  See the description of the
                       O_CLOEXEC flag in open(2) for reasons why this may be
                       useful.
http://man7.org/linux/man-pages/man2/accept4.2.html
12
SYSTEM CALL:
accept(2) - Linux manual page
FUNCTIONALITY:

       accept, accept4 - accept a connection on a socket
SYNOPSIS:

       #include <sys/types.h>          /* See NOTES */
       #include <sys/socket.h>

       int accept(int sockfd, struct sockaddr *addr, socklen_t *addrlen);

       #define _GNU_SOURCE             /* See feature_test_macros(7) */
       #include <sys/socket.h>

       int accept4(int sockfd, struct sockaddr *addr,
                   socklen_t *addrlen, int flags);
DESCRIPTION

       The accept() system call is used with connection-based socket types
       (SOCK_STREAM, SOCK_SEQPACKET).  It extracts the first connection
       request on the queue of pending connections for the listening socket,
       sockfd, creates a new connected socket, and returns a new file
       descriptor referring to that socket.  The newly created socket is not
       in the listening state.  The original socket sockfd is unaffected by
       this call.

       The argument sockfd is a socket that has been created with socket(2),
       bound to a local address with bind(2), and is listening for
       connections after a listen(2).

       The argument addr is a pointer to a sockaddr structure.  This
       structure is filled in with the address of the peer socket, as known
       to the communications layer.  The exact format of the address
       returned addr is determined by the socket's address family (see
       socket(2) and the respective protocol man pages).  When addr is NULL,
       nothing is filled in; in this case, addrlen is not used, and should
       also be NULL.

       The addrlen argument is a value-result argument: the caller must
       initialize it to contain the size (in bytes) of the structure pointed
       to by addr; on return it will contain the actual size of the peer
       address.

       The returned address is truncated if the buffer provided is too
       small; in this case, addrlen will return a value greater than was
       supplied to the call.

       If no pending connections are present on the queue, and the socket is
       not marked as nonblocking, accept() blocks the caller until a
       connection is present.  If the socket is marked nonblocking and no
       pending connections are present on the queue, accept() fails with the
       error EAGAIN or EWOULDBLOCK.

       In order to be notified of incoming connections on a socket, you can
       use select(2) or poll(2).  A readable event will be delivered when a
       new connection is attempted and you may then call accept() to get a
       socket for that connection.  Alternatively, you can set the socket to
       deliver SIGIO when activity occurs on a socket; see socket(7) for
       details.

       For certain protocols which require an explicit confirmation, such as
       DECNet, accept() can be thought of as merely dequeuing the next
       connection request and not implying confirmation.  Confirmation can
       be implied by a normal read or write on the new file descriptor, and
       rejection can be implied by closing the new socket.  Currently, only
       DECNet has these semantics on Linux.

       If flags is 0, then accept4() is the same as accept().  The following
       values can be bitwise ORed in flags to obtain different behavior:

       SOCK_NONBLOCK   Set the O_NONBLOCK file status flag on the new open
                       file description.  Using this flag saves extra calls
                       to fcntl(2) to achieve the same result.

       SOCK_CLOEXEC    Set the close-on-exec (FD_CLOEXEC) flag on the new
                       file descriptor.  See the description of the
                       O_CLOEXEC flag in open(2) for reasons why this may be
                       useful.
http://man7.org/linux/man-pages/man2/connect.2.html
11
SYSTEM CALL:
connect(2) - Linux manual page
FUNCTIONALITY:

       connect - initiate a connection on a socket
SYNOPSIS:

       #include <sys/types.h>          /* See NOTES */
       #include <sys/socket.h>

       int connect(int sockfd, const struct sockaddr *addr,
                   socklen_t addrlen);
DESCRIPTION

       The connect() system call connects the socket referred to by the file
       descriptor sockfd to the address specified by addr.  The addrlen
       argument specifies the size of addr.  The format of the address in
       addr is determined by the address space of the socket sockfd; see
       socket(2) for further details.

       If the socket sockfd is of type SOCK_DGRAM, then addr is the address
       to which datagrams are sent by default, and the only address from
       which datagrams are received.  If the socket is of type SOCK_STREAM
       or SOCK_SEQPACKET, this call attempts to make a connection to the
       socket that is bound to the address specified by addr.

       Generally, connection-based protocol sockets may successfully
       connect() only once; connectionless protocol sockets may use
       connect() multiple times to change their association.  Connectionless
       sockets may dissolve the association by connecting to an address with
       the sa_family member of sockaddr set to AF_UNSPEC (supported on Linux
       since kernel 2.2).
http://man7.org/linux/man-pages/man2/shutdown.2.html
11
SYSTEM CALL:
shutdown(2) - Linux manual page
FUNCTIONALITY:

       shutdown - shut down part of a full-duplex connection
SYNOPSIS:

       #include <sys/socket.h>

       int shutdown(int sockfd, int how);
DESCRIPTION

       The shutdown() call causes all or part of a full-duplex connection on
       the socket associated with sockfd to be shut down.  If how is
       SHUT_RD, further receptions will be disallowed.  If how is SHUT_WR,
       further transmissions will be disallowed.  If how is SHUT_RDWR,
       further receptions and transmissions will be disallowed.
http://man7.org/linux/man-pages/man2/recvfrom.2.html
11
SYSTEM CALL:
recv(2) - Linux manual page
FUNCTIONALITY:

       recv, recvfrom, recvmsg - receive a message from a socket
SYNOPSIS:

       #include <sys/types.h>
       #include <sys/socket.h>

       ssize_t recv(int sockfd, void *buf, size_t len, int flags);

       ssize_t recvfrom(int sockfd, void *buf, size_t len, int flags,
                        struct sockaddr *src_addr, socklen_t *addrlen);

       ssize_t recvmsg(int sockfd, struct msghdr *msg, int flags);
DESCRIPTION

       The recv(), recvfrom(), and recvmsg() calls are used to receive
       messages from a socket.  They may be used to receive data on both
       connectionless and connection-oriented sockets.  This page first
       describes common features of all three system calls, and then
       describes the differences between the calls.

       The only difference between recv() and read(2) is the presence of
       flags.  With a zero flags argument, recv() is generally equivalent to
       read(2) (but see NOTES).  Also, the following call

           recv(sockfd, buf, len, flags);

       is equivalent to

           recvfrom(sockfd, buf, len, flags, NULL, NULL);

       All three calls return the length of the message on successful
       completion.  If a message is too long to fit in the supplied buffer,
       excess bytes may be discarded depending on the type of socket the
       message is received from.

       If no messages are available at the socket, the receive calls wait
       for a message to arrive, unless the socket is nonblocking (see
       fcntl(2)), in which case the value -1 is returned and the external
       variable errno is set to EAGAIN or EWOULDBLOCK.  The receive calls
       normally return any data available, up to the requested amount,
       rather than waiting for receipt of the full amount requested.

       An application can use select(2), poll(2), or epoll(7) to determine
       when more data arrives on a socket.

   The flags argument
       The flags argument is formed by ORing one or more of the following
       values:

       MSG_CMSG_CLOEXEC (recvmsg() only; since Linux 2.6.23)
              Set the close-on-exec flag for the file descriptor received
              via a UNIX domain file descriptor using the SCM_RIGHTS
              operation (described in unix(7)).  This flag is useful for the
              same reasons as the O_CLOEXEC flag of open(2).

       MSG_DONTWAIT (since Linux 2.2)
              Enables nonblocking operation; if the operation would block,
              the call fails with the error EAGAIN or EWOULDBLOCK.  This
              provides similar behavior to setting the O_NONBLOCK flag (via
              the fcntl(2) F_SETFL operation), but differs in that
              MSG_DONTWAIT is a per-call option, whereas O_NONBLOCK is a
              setting on the open file description (see open(2)), which will
              affect all threads in the calling process and as well as other
              processes that hold file descriptors referring to the same
              open file description.

       MSG_ERRQUEUE (since Linux 2.2)
              This flag specifies that queued errors should be received from
              the socket error queue.  The error is passed in an ancillary
              message with a type dependent on the protocol (for IPv4
              IP_RECVERR).  The user should supply a buffer of sufficient
              size.  See cmsg(3) and ip(7) for more information.  The
              payload of the original packet that caused the error is passed
              as normal data via msg_iovec.  The original destination
              address of the datagram that caused the error is supplied via
              msg_name.

              For local errors, no address is passed (this can be checked
              with the cmsg_len member of the cmsghdr).  For error receives,
              the MSG_ERRQUEUE is set in the msghdr.  After an error has
              been passed, the pending socket error is regenerated based on
              the next queued error and will be passed on the next socket
              operation.

              The error is supplied in a sock_extended_err structure:

                  #define SO_EE_ORIGIN_NONE    0
                  #define SO_EE_ORIGIN_LOCAL   1
                  #define SO_EE_ORIGIN_ICMP    2
                  #define SO_EE_ORIGIN_ICMP6   3

                  struct sock_extended_err
                  {
                      uint32_t ee_errno;   /* error number */
                      uint8_t  ee_origin;  /* where the error originated */
                      uint8_t  ee_type;    /* type */
                      uint8_t  ee_code;    /* code */
                      uint8_t  ee_pad;     /* padding */
                      uint32_t ee_info;    /* additional information */
                      uint32_t ee_data;    /* other data */
                      /* More data may follow */
                  };

                  struct sockaddr *SO_EE_OFFENDER(struct sock_extended_err *);

              ee_errno contains the errno number of the queued error.
              ee_origin is the origin code of where the error originated.
              The other fields are protocol-specific.  The macro
              SOCK_EE_OFFENDER returns a pointer to the address of the
              network object where the error originated from given a pointer
              to the ancillary message.  If this address is not known, the
              sa_family member of the sockaddr contains AF_UNSPEC and the
              other fields of the sockaddr are undefined.  The payload of
              the packet that caused the error is passed as normal data.

              For local errors, no address is passed (this can be checked
              with the cmsg_len member of the cmsghdr).  For error receives,
              the MSG_ERRQUEUE is set in the msghdr.  After an error has
              been passed, the pending socket error is regenerated based on
              the next queued error and will be passed on the next socket
              operation.

       MSG_OOB
              This flag requests receipt of out-of-band data that would not
              be received in the normal data stream.  Some protocols place
              expedited data at the head of the normal data queue, and thus
              this flag cannot be used with such protocols.

       MSG_PEEK
              This flag causes the receive operation to return data from the
              beginning of the receive queue without removing that data from
              the queue.  Thus, a subsequent receive call will return the
              same data.

       MSG_TRUNC (since Linux 2.2)
              For raw (AF_PACKET), Internet datagram (since Linux
              2.4.27/2.6.8), netlink (since Linux 2.6.22), and UNIX datagram
              (since Linux 3.4) sockets: return the real length of the
              packet or datagram, even when it was longer than the passed
              buffer.

              For use with Internet stream sockets, see tcp(7).

       MSG_WAITALL (since Linux 2.2)
              This flag requests that the operation block until the full
              request is satisfied.  However, the call may still return less
              data than requested if a signal is caught, an error or
              disconnect occurs, or the next data to be received is of a
              different type than that returned.  This flag has no effect
              for datagram sockets.

   recvfrom()
       recvfrom() places the received message into the buffer buf.  The
       caller must specify the size of the buffer in len.

       If src_addr is not NULL, and the underlying protocol provides the
       source address of the message, that source address is placed in the
       buffer pointed to by src_addr.  In this case, addrlen is a value-
       result argument.  Before the call, it should be initialized to the
       size of the buffer associated with src_addr.  Upon return, addrlen is
       updated to contain the actual size of the source address.  The
       returned address is truncated if the buffer provided is too small; in
       this case, addrlen will return a value greater than was supplied to
       the call.

       If the caller is not interested in the source address, src_addr and
       addrlen should be specified as NULL.

   recv()
       The recv() call is normally used only on a connected socket (see
       connect(2)).  It is equivalent to the call:

           recvfrom(fd, buf, len, flags, NULL, 0);

   recvmsg()
       The recvmsg() call uses a msghdr structure to minimize the number of
       directly supplied arguments.  This structure is defined as follows in
       <sys/socket.h>:

           struct iovec {                    /* Scatter/gather array items */
               void  *iov_base;              /* Starting address */
               size_t iov_len;               /* Number of bytes to transfer */
           };

           struct msghdr {
               void         *msg_name;       /* optional address */
               socklen_t     msg_namelen;    /* size of address */
               struct iovec *msg_iov;        /* scatter/gather array */
               size_t        msg_iovlen;     /* # elements in msg_iov */
               void         *msg_control;    /* ancillary data, see below */
               size_t        msg_controllen; /* ancillary data buffer len */
               int           msg_flags;      /* flags on received message */
           };

       The msg_name field points to a caller-allocated buffer that is used
       to return the source address if the socket is unconnected.  The
       caller should set msg_namelen to the size of this buffer before this
       call; upon return from a successful call, msg_namelen will contain
       the length of the returned address.  If the application does not need
       to know the source address, msg_name can be specified as NULL.

       The fields msg_iov and msg_iovlen describe scatter-gather locations,
       as discussed in readv(2).

       The field msg_control, which has length msg_controllen, points to a
       buffer for other protocol control-related messages or miscellaneous
       ancillary data.  When recvmsg() is called, msg_controllen should
       contain the length of the available buffer in msg_control; upon
       return from a successful call it will contain the length of the
       control message sequence.

       The messages are of the form:

           struct cmsghdr {
               size_t cmsg_len;    /* Data byte count, including header
                                      (type is socklen_t in POSIX) */
               int    cmsg_level;  /* Originating protocol */
               int    cmsg_type;   /* Protocol-specific type */
           /* followed by
               unsigned char cmsg_data[]; */
           };

       Ancillary data should be accessed only by the macros defined in
       cmsg(3).

       As an example, Linux uses this ancillary data mechanism to pass
       extended errors, IP options, or file descriptors over UNIX domain
       sockets.

       The msg_flags field in the msghdr is set on return of recvmsg().  It
       can contain several flags:

       MSG_EOR
              indicates end-of-record; the data returned completed a record
              (generally used with sockets of type SOCK_SEQPACKET).

       MSG_TRUNC
              indicates that the trailing portion of a datagram was
              discarded because the datagram was larger than the buffer
              supplied.

       MSG_CTRUNC
              indicates that some control data were discarded due to lack of
              space in the buffer for ancillary data.

       MSG_OOB
              is returned to indicate that expedited or out-of-band data
              were received.

       MSG_ERRQUEUE
              indicates that no data was received but an extended error from
              the socket error queue.
http://man7.org/linux/man-pages/man2/recvmsg.2.html
11
SYSTEM CALL:
recv(2) - Linux manual page
FUNCTIONALITY:

       recv, recvfrom, recvmsg - receive a message from a socket
SYNOPSIS:

       #include <sys/types.h>
       #include <sys/socket.h>

       ssize_t recv(int sockfd, void *buf, size_t len, int flags);

       ssize_t recvfrom(int sockfd, void *buf, size_t len, int flags,
                        struct sockaddr *src_addr, socklen_t *addrlen);

       ssize_t recvmsg(int sockfd, struct msghdr *msg, int flags);
DESCRIPTION

       The recv(), recvfrom(), and recvmsg() calls are used to receive
       messages from a socket.  They may be used to receive data on both
       connectionless and connection-oriented sockets.  This page first
       describes common features of all three system calls, and then
       describes the differences between the calls.

       The only difference between recv() and read(2) is the presence of
       flags.  With a zero flags argument, recv() is generally equivalent to
       read(2) (but see NOTES).  Also, the following call

           recv(sockfd, buf, len, flags);

       is equivalent to

           recvfrom(sockfd, buf, len, flags, NULL, NULL);

       All three calls return the length of the message on successful
       completion.  If a message is too long to fit in the supplied buffer,
       excess bytes may be discarded depending on the type of socket the
       message is received from.

       If no messages are available at the socket, the receive calls wait
       for a message to arrive, unless the socket is nonblocking (see
       fcntl(2)), in which case the value -1 is returned and the external
       variable errno is set to EAGAIN or EWOULDBLOCK.  The receive calls
       normally return any data available, up to the requested amount,
       rather than waiting for receipt of the full amount requested.

       An application can use select(2), poll(2), or epoll(7) to determine
       when more data arrives on a socket.

   The flags argument
       The flags argument is formed by ORing one or more of the following
       values:

       MSG_CMSG_CLOEXEC (recvmsg() only; since Linux 2.6.23)
              Set the close-on-exec flag for the file descriptor received
              via a UNIX domain file descriptor using the SCM_RIGHTS
              operation (described in unix(7)).  This flag is useful for the
              same reasons as the O_CLOEXEC flag of open(2).

       MSG_DONTWAIT (since Linux 2.2)
              Enables nonblocking operation; if the operation would block,
              the call fails with the error EAGAIN or EWOULDBLOCK.  This
              provides similar behavior to setting the O_NONBLOCK flag (via
              the fcntl(2) F_SETFL operation), but differs in that
              MSG_DONTWAIT is a per-call option, whereas O_NONBLOCK is a
              setting on the open file description (see open(2)), which will
              affect all threads in the calling process and as well as other
              processes that hold file descriptors referring to the same
              open file description.

       MSG_ERRQUEUE (since Linux 2.2)
              This flag specifies that queued errors should be received from
              the socket error queue.  The error is passed in an ancillary
              message with a type dependent on the protocol (for IPv4
              IP_RECVERR).  The user should supply a buffer of sufficient
              size.  See cmsg(3) and ip(7) for more information.  The
              payload of the original packet that caused the error is passed
              as normal data via msg_iovec.  The original destination
              address of the datagram that caused the error is supplied via
              msg_name.

              For local errors, no address is passed (this can be checked
              with the cmsg_len member of the cmsghdr).  For error receives,
              the MSG_ERRQUEUE is set in the msghdr.  After an error has
              been passed, the pending socket error is regenerated based on
              the next queued error and will be passed on the next socket
              operation.

              The error is supplied in a sock_extended_err structure:

                  #define SO_EE_ORIGIN_NONE    0
                  #define SO_EE_ORIGIN_LOCAL   1
                  #define SO_EE_ORIGIN_ICMP    2
                  #define SO_EE_ORIGIN_ICMP6   3

                  struct sock_extended_err
                  {
                      uint32_t ee_errno;   /* error number */
                      uint8_t  ee_origin;  /* where the error originated */
                      uint8_t  ee_type;    /* type */
                      uint8_t  ee_code;    /* code */
                      uint8_t  ee_pad;     /* padding */
                      uint32_t ee_info;    /* additional information */
                      uint32_t ee_data;    /* other data */
                      /* More data may follow */
                  };

                  struct sockaddr *SO_EE_OFFENDER(struct sock_extended_err *);

              ee_errno contains the errno number of the queued error.
              ee_origin is the origin code of where the error originated.
              The other fields are protocol-specific.  The macro
              SOCK_EE_OFFENDER returns a pointer to the address of the
              network object where the error originated from given a pointer
              to the ancillary message.  If this address is not known, the
              sa_family member of the sockaddr contains AF_UNSPEC and the
              other fields of the sockaddr are undefined.  The payload of
              the packet that caused the error is passed as normal data.

              For local errors, no address is passed (this can be checked
              with the cmsg_len member of the cmsghdr).  For error receives,
              the MSG_ERRQUEUE is set in the msghdr.  After an error has
              been passed, the pending socket error is regenerated based on
              the next queued error and will be passed on the next socket
              operation.

       MSG_OOB
              This flag requests receipt of out-of-band data that would not
              be received in the normal data stream.  Some protocols place
              expedited data at the head of the normal data queue, and thus
              this flag cannot be used with such protocols.

       MSG_PEEK
              This flag causes the receive operation to return data from the
              beginning of the receive queue without removing that data from
              the queue.  Thus, a subsequent receive call will return the
              same data.

       MSG_TRUNC (since Linux 2.2)
              For raw (AF_PACKET), Internet datagram (since Linux
              2.4.27/2.6.8), netlink (since Linux 2.6.22), and UNIX datagram
              (since Linux 3.4) sockets: return the real length of the
              packet or datagram, even when it was longer than the passed
              buffer.

              For use with Internet stream sockets, see tcp(7).

       MSG_WAITALL (since Linux 2.2)
              This flag requests that the operation block until the full
              request is satisfied.  However, the call may still return less
              data than requested if a signal is caught, an error or
              disconnect occurs, or the next data to be received is of a
              different type than that returned.  This flag has no effect
              for datagram sockets.

   recvfrom()
       recvfrom() places the received message into the buffer buf.  The
       caller must specify the size of the buffer in len.

       If src_addr is not NULL, and the underlying protocol provides the
       source address of the message, that source address is placed in the
       buffer pointed to by src_addr.  In this case, addrlen is a value-
       result argument.  Before the call, it should be initialized to the
       size of the buffer associated with src_addr.  Upon return, addrlen is
       updated to contain the actual size of the source address.  The
       returned address is truncated if the buffer provided is too small; in
       this case, addrlen will return a value greater than was supplied to
       the call.

       If the caller is not interested in the source address, src_addr and
       addrlen should be specified as NULL.

   recv()
       The recv() call is normally used only on a connected socket (see
       connect(2)).  It is equivalent to the call:

           recvfrom(fd, buf, len, flags, NULL, 0);

   recvmsg()
       The recvmsg() call uses a msghdr structure to minimize the number of
       directly supplied arguments.  This structure is defined as follows in
       <sys/socket.h>:

           struct iovec {                    /* Scatter/gather array items */
               void  *iov_base;              /* Starting address */
               size_t iov_len;               /* Number of bytes to transfer */
           };

           struct msghdr {
               void         *msg_name;       /* optional address */
               socklen_t     msg_namelen;    /* size of address */
               struct iovec *msg_iov;        /* scatter/gather array */
               size_t        msg_iovlen;     /* # elements in msg_iov */
               void         *msg_control;    /* ancillary data, see below */
               size_t        msg_controllen; /* ancillary data buffer len */
               int           msg_flags;      /* flags on received message */
           };

       The msg_name field points to a caller-allocated buffer that is used
       to return the source address if the socket is unconnected.  The
       caller should set msg_namelen to the size of this buffer before this
       call; upon return from a successful call, msg_namelen will contain
       the length of the returned address.  If the application does not need
       to know the source address, msg_name can be specified as NULL.

       The fields msg_iov and msg_iovlen describe scatter-gather locations,
       as discussed in readv(2).

       The field msg_control, which has length msg_controllen, points to a
       buffer for other protocol control-related messages or miscellaneous
       ancillary data.  When recvmsg() is called, msg_controllen should
       contain the length of the available buffer in msg_control; upon
       return from a successful call it will contain the length of the
       control message sequence.

       The messages are of the form:

           struct cmsghdr {
               size_t cmsg_len;    /* Data byte count, including header
                                      (type is socklen_t in POSIX) */
               int    cmsg_level;  /* Originating protocol */
               int    cmsg_type;   /* Protocol-specific type */
           /* followed by
               unsigned char cmsg_data[]; */
           };

       Ancillary data should be accessed only by the macros defined in
       cmsg(3).

       As an example, Linux uses this ancillary data mechanism to pass
       extended errors, IP options, or file descriptors over UNIX domain
       sockets.

       The msg_flags field in the msghdr is set on return of recvmsg().  It
       can contain several flags:

       MSG_EOR
              indicates end-of-record; the data returned completed a record
              (generally used with sockets of type SOCK_SEQPACKET).

       MSG_TRUNC
              indicates that the trailing portion of a datagram was
              discarded because the datagram was larger than the buffer
              supplied.

       MSG_CTRUNC
              indicates that some control data were discarded due to lack of
              space in the buffer for ancillary data.

       MSG_OOB
              is returned to indicate that expedited or out-of-band data
              were received.

       MSG_ERRQUEUE
              indicates that no data was received but an extended error from
              the socket error queue.
http://man7.org/linux/man-pages/man2/recvmmsg.2.html
12
SYSTEM CALL:
recvmmsg(2) - Linux manual page
FUNCTIONALITY:

       recvmmsg - receive multiple messages on a socket
SYNOPSIS:

       #define _GNU_SOURCE         /* See feature_test_macros(7) */
       #include <sys/socket.h>

       int recvmmsg(int sockfd, struct mmsghdr *msgvec, unsigned int vlen,
                    unsigned int flags, struct timespec *timeout);
DESCRIPTION

       The recvmmsg() system call is an extension of recvmsg(2) that allows
       the caller to receive multiple messages from a socket using a single
       system call.  (This has performance benefits for some applications.)
       A further extension over recvmsg(2) is support for a timeout on the
       receive operation.

       The sockfd argument is the file descriptor of the socket to receive
       data from.

       The msgvec argument is a pointer to an array of mmsghdr structures.
       The size of this array is specified in vlen.

       The mmsghdr structure is defined in <sys/socket.h> as:

           struct mmsghdr {
               struct msghdr msg_hdr;  /* Message header */
               unsigned int  msg_len;  /* Number of received bytes for header */
           };

       The msg_hdr field is a msghdr structure, as described in recvmsg(2).
       The msg_len field is the number of bytes returned for the message in
       the entry.  This field has the same value as the return value of a
       single recvmsg(2) on the header.

       The flags argument contains flags ORed together.  The flags are the
       same as documented for recvmsg(2), with the following addition:

       MSG_WAITFORONE (since Linux 2.6.34)
              Turns on MSG_DONTWAIT after the first message has been
              received.

       The timeout argument points to a struct timespec (see
       clock_gettime(2)) defining a timeout (seconds plus nanoseconds) for
       the receive operation (but see BUGS!).  (This interval will be
       rounded up to the system clock granularity, and kernel scheduling
       delays mean that the blocking interval may overrun by a small
       amount.)  If timeout is NULL, then the operation blocks indefinitely.

       A blocking recvmmsg() call blocks until vlen messages have been
       received or until the timeout expires.  A nonblocking call reads as
       many messages as are available (up to the limit specified by vlen)
       and returns immediately.

       On return from recvmmsg(), successive elements of msgvec are updated
       to contain information about each received message: msg_len contains
       the size of the received message; the subfields of msg_hdr are
       updated as described in recvmsg(2).  The return value of the call
       indicates the number of elements of msgvec that have been updated.
http://man7.org/linux/man-pages/man2/sendto.2.html
12
SYSTEM CALL:
send(2) - Linux manual page
FUNCTIONALITY:

       send, sendto, sendmsg - send a message on a socket
SYNOPSIS:

       #include <sys/types.h>
       #include <sys/socket.h>

       ssize_t send(int sockfd, const void *buf, size_t len, int flags);

       ssize_t sendto(int sockfd, const void *buf, size_t len, int flags,
                      const struct sockaddr *dest_addr, socklen_t addrlen);

       ssize_t sendmsg(int sockfd, const struct msghdr *msg, int flags);
DESCRIPTION

       The system calls send(), sendto(), and sendmsg() are used to transmit
       a message to another socket.

       The send() call may be used only when the socket is in a connected
       state (so that the intended recipient is known).  The only difference
       between send() and write(2) is the presence of flags.  With a zero
       flags argument, send() is equivalent to write(2).  Also, the
       following call

           send(sockfd, buf, len, flags);

       is equivalent to

           sendto(sockfd, buf, len, flags, NULL, 0);

       The argument sockfd is the file descriptor of the sending socket.

       If sendto() is used on a connection-mode (SOCK_STREAM,
       SOCK_SEQPACKET) socket, the arguments dest_addr and addrlen are
       ignored (and the error EISCONN may be returned when they are not NULL
       and 0), and the error ENOTCONN is returned when the socket was not
       actually connected.  Otherwise, the address of the target is given by
       dest_addr with addrlen specifying its size.  For sendmsg(), the
       address of the target is given by msg.msg_name, with msg.msg_namelen
       specifying its size.

       For send() and sendto(), the message is found in buf and has length
       len.  For sendmsg(), the message is pointed to by the elements of the
       array msg.msg_iov.  The sendmsg() call also allows sending ancillary
       data (also known as control information).

       If the message is too long to pass atomically through the underlying
       protocol, the error EMSGSIZE is returned, and the message is not
       transmitted.

       No indication of failure to deliver is implicit in a send().  Locally
       detected errors are indicated by a return value of -1.

       When the message does not fit into the send buffer of the socket,
       send() normally blocks, unless the socket has been placed in
       nonblocking I/O mode.  In nonblocking mode it would fail with the
       error EAGAIN or EWOULDBLOCK in this case.  The select(2) call may be
       used to determine when it is possible to send more data.

   The flags argument
       The flags argument is the bitwise OR of zero or more of the following
       flags.

       MSG_CONFIRM (since Linux 2.3.15)
              Tell the link layer that forward progress happened: you got a
              successful reply from the other side.  If the link layer
              doesn't get this it will regularly reprobe the neighbor (e.g.,
              via a unicast ARP).  Only valid on SOCK_DGRAM and SOCK_RAW
              sockets and currently implemented only for IPv4 and IPv6.  See
              arp(7) for details.

       MSG_DONTROUTE
              Don't use a gateway to send out the packet, send to hosts only
              on directly connected networks.  This is usually used only by
              diagnostic or routing programs.  This is defined only for
              protocol families that route; packet sockets don't.

       MSG_DONTWAIT (since Linux 2.2)
              Enables nonblocking operation; if the operation would block,
              EAGAIN or EWOULDBLOCK is returned.  This provides similar
              behavior to setting the O_NONBLOCK flag (via the fcntl(2)
              F_SETFL operation), but differs in that MSG_DONTWAIT is a per-
              call option, whereas O_NONBLOCK is a setting on the open file
              description (see open(2)), which will affect all threads in
              the calling process and as well as other processes that hold
              file descriptors referring to the same open file description.

       MSG_EOR (since Linux 2.2)
              Terminates a record (when this notion is supported, as for
              sockets of type SOCK_SEQPACKET).

       MSG_MORE (since Linux 2.4.4)
              The caller has more data to send.  This flag is used with TCP
              sockets to obtain the same effect as the TCP_CORK socket
              option (see tcp(7)), with the difference that this flag can be
              set on a per-call basis.

              Since Linux 2.6, this flag is also supported for UDP sockets,
              and informs the kernel to package all of the data sent in
              calls with this flag set into a single datagram which is
              transmitted only when a call is performed that does not
              specify this flag.  (See also the UDP_CORK socket option
              described in udp(7).)

       MSG_NOSIGNAL (since Linux 2.2)
              Don't generate a SIGPIPE signal if the peer on a stream-
              oriented socket has closed the connection.  The EPIPE error is
              still returned.  This provides similar behavior to using
              sigaction(2) to ignore SIGPIPE, but, whereas MSG_NOSIGNAL is a
              per-call feature, ignoring SIGPIPE sets a process attribute
              that affects all threads in the process.

       MSG_OOB
              Sends out-of-band data on sockets that support this notion
              (e.g., of type SOCK_STREAM); the underlying protocol must also
              support out-of-band data.

   sendmsg()
       The definition of the msghdr structure employed by sendmsg() is as
       follows:

           struct msghdr {
               void         *msg_name;       /* optional address */
               socklen_t     msg_namelen;    /* size of address */
               struct iovec *msg_iov;        /* scatter/gather array */
               size_t        msg_iovlen;     /* # elements in msg_iov */
               void         *msg_control;    /* ancillary data, see below */
               size_t        msg_controllen; /* ancillary data buffer len */
               int           msg_flags;      /* flags (unused) */
           };

       The msg_name field is used on an unconnected socket to specify the
       target address for a datagram.  It points to a buffer containing the
       address; the msg_namelen field should be set to the size of the
       address.  For a connected socket, these fields should be specified as
       NULL and 0, respectively.

       The msg_iov and msg_iovlen fields specify scatter-gather locations,
       as for writev(2).

       You may send control information using the msg_control and
       msg_controllen members.  The maximum control buffer length the kernel
       can process is limited per socket by the value in
       /proc/sys/net/core/optmem_max; see socket(7).

       The msg_flags field is ignored.
http://man7.org/linux/man-pages/man2/sendmsg.2.html
12
SYSTEM CALL:
send(2) - Linux manual page
FUNCTIONALITY:

       send, sendto, sendmsg - send a message on a socket
SYNOPSIS:

       #include <sys/types.h>
       #include <sys/socket.h>

       ssize_t send(int sockfd, const void *buf, size_t len, int flags);

       ssize_t sendto(int sockfd, const void *buf, size_t len, int flags,
                      const struct sockaddr *dest_addr, socklen_t addrlen);

       ssize_t sendmsg(int sockfd, const struct msghdr *msg, int flags);
DESCRIPTION

       The system calls send(), sendto(), and sendmsg() are used to transmit
       a message to another socket.

       The send() call may be used only when the socket is in a connected
       state (so that the intended recipient is known).  The only difference
       between send() and write(2) is the presence of flags.  With a zero
       flags argument, send() is equivalent to write(2).  Also, the
       following call

           send(sockfd, buf, len, flags);

       is equivalent to

           sendto(sockfd, buf, len, flags, NULL, 0);

       The argument sockfd is the file descriptor of the sending socket.

       If sendto() is used on a connection-mode (SOCK_STREAM,
       SOCK_SEQPACKET) socket, the arguments dest_addr and addrlen are
       ignored (and the error EISCONN may be returned when they are not NULL
       and 0), and the error ENOTCONN is returned when the socket was not
       actually connected.  Otherwise, the address of the target is given by
       dest_addr with addrlen specifying its size.  For sendmsg(), the
       address of the target is given by msg.msg_name, with msg.msg_namelen
       specifying its size.

       For send() and sendto(), the message is found in buf and has length
       len.  For sendmsg(), the message is pointed to by the elements of the
       array msg.msg_iov.  The sendmsg() call also allows sending ancillary
       data (also known as control information).

       If the message is too long to pass atomically through the underlying
       protocol, the error EMSGSIZE is returned, and the message is not
       transmitted.

       No indication of failure to deliver is implicit in a send().  Locally
       detected errors are indicated by a return value of -1.

       When the message does not fit into the send buffer of the socket,
       send() normally blocks, unless the socket has been placed in
       nonblocking I/O mode.  In nonblocking mode it would fail with the
       error EAGAIN or EWOULDBLOCK in this case.  The select(2) call may be
       used to determine when it is possible to send more data.

   The flags argument
       The flags argument is the bitwise OR of zero or more of the following
       flags.

       MSG_CONFIRM (since Linux 2.3.15)
              Tell the link layer that forward progress happened: you got a
              successful reply from the other side.  If the link layer
              doesn't get this it will regularly reprobe the neighbor (e.g.,
              via a unicast ARP).  Only valid on SOCK_DGRAM and SOCK_RAW
              sockets and currently implemented only for IPv4 and IPv6.  See
              arp(7) for details.

       MSG_DONTROUTE
              Don't use a gateway to send out the packet, send to hosts only
              on directly connected networks.  This is usually used only by
              diagnostic or routing programs.  This is defined only for
              protocol families that route; packet sockets don't.

       MSG_DONTWAIT (since Linux 2.2)
              Enables nonblocking operation; if the operation would block,
              EAGAIN or EWOULDBLOCK is returned.  This provides similar
              behavior to setting the O_NONBLOCK flag (via the fcntl(2)
              F_SETFL operation), but differs in that MSG_DONTWAIT is a per-
              call option, whereas O_NONBLOCK is a setting on the open file
              description (see open(2)), which will affect all threads in
              the calling process and as well as other processes that hold
              file descriptors referring to the same open file description.

       MSG_EOR (since Linux 2.2)
              Terminates a record (when this notion is supported, as for
              sockets of type SOCK_SEQPACKET).

       MSG_MORE (since Linux 2.4.4)
              The caller has more data to send.  This flag is used with TCP
              sockets to obtain the same effect as the TCP_CORK socket
              option (see tcp(7)), with the difference that this flag can be
              set on a per-call basis.

              Since Linux 2.6, this flag is also supported for UDP sockets,
              and informs the kernel to package all of the data sent in
              calls with this flag set into a single datagram which is
              transmitted only when a call is performed that does not
              specify this flag.  (See also the UDP_CORK socket option
              described in udp(7).)

       MSG_NOSIGNAL (since Linux 2.2)
              Don't generate a SIGPIPE signal if the peer on a stream-
              oriented socket has closed the connection.  The EPIPE error is
              still returned.  This provides similar behavior to using
              sigaction(2) to ignore SIGPIPE, but, whereas MSG_NOSIGNAL is a
              per-call feature, ignoring SIGPIPE sets a process attribute
              that affects all threads in the process.

       MSG_OOB
              Sends out-of-band data on sockets that support this notion
              (e.g., of type SOCK_STREAM); the underlying protocol must also
              support out-of-band data.

   sendmsg()
       The definition of the msghdr structure employed by sendmsg() is as
       follows:

           struct msghdr {
               void         *msg_name;       /* optional address */
               socklen_t     msg_namelen;    /* size of address */
               struct iovec *msg_iov;        /* scatter/gather array */
               size_t        msg_iovlen;     /* # elements in msg_iov */
               void         *msg_control;    /* ancillary data, see below */
               size_t        msg_controllen; /* ancillary data buffer len */
               int           msg_flags;      /* flags (unused) */
           };

       The msg_name field is used on an unconnected socket to specify the
       target address for a datagram.  It points to a buffer containing the
       address; the msg_namelen field should be set to the size of the
       address.  For a connected socket, these fields should be specified as
       NULL and 0, respectively.

       The msg_iov and msg_iovlen fields specify scatter-gather locations,
       as for writev(2).

       You may send control information using the msg_control and
       msg_controllen members.  The maximum control buffer length the kernel
       can process is limited per socket by the value in
       /proc/sys/net/core/optmem_max; see socket(7).

       The msg_flags field is ignored.
http://man7.org/linux/man-pages/man2/sendmmsg.2.html
12
SYSTEM CALL:
sendmmsg(2) - Linux manual page
FUNCTIONALITY:

       sendmmsg - send multiple messages on a socket
SYNOPSIS:

       #define _GNU_SOURCE         /* See feature_test_macros(7) */
       #include <sys/socket.h>

       int sendmmsg(int sockfd, struct mmsghdr *msgvec, unsigned int vlen,
                    unsigned int flags);
DESCRIPTION

       The sendmmsg() system call is an extension of sendmsg(2) that allows
       the caller to transmit multiple messages on a socket using a single
       system call.  (This has performance benefits for some applications.)

       The sockfd argument is the file descriptor of the socket on which
       data is to be transmitted.

       The msgvec argument is a pointer to an array of mmsghdr structures.
       The size of this array is specified in vlen.

       The mmsghdr structure is defined in <sys/socket.h> as:

           struct mmsghdr {
               struct msghdr msg_hdr;  /* Message header */
               unsigned int  msg_len;  /* Number of bytes transmitted */
           };

       The msg_hdr field is a msghdr structure, as described in sendmsg(2).
       The msg_len field is used to return the number of bytes sent from the
       message in msg_hdr (i.e., the same as the return value from a single
       sendmsg(2) call).

       The flags argument contains flags ORed together.  The flags are the
       same as for sendmsg(2).

       A blocking sendmmsg() call blocks until vlen messages have been sent.
       A nonblocking call sends as many messages as possible (up to the
       limit specified by vlen) and returns immediately.

       On return from sendmmsg(), the msg_len fields of successive elements
       of msgvec are updated to contain the number of bytes transmitted from
       the corresponding msg_hdr.  The return value of the call indicates
       the number of elements of msgvec that have been updated.
http://man7.org/linux/man-pages/man2/sethostname.2.html
10
SYSTEM CALL:
gethostname(2) - Linux manual page
FUNCTIONALITY:

       gethostname, sethostname - get/set hostname
SYNOPSIS:

       #include <unistd.h>

       int gethostname(char *name, size_t len);
       int sethostname(const char *name, size_t len);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       gethostname():
           Since glibc 2.12: _BSD_SOURCE || _XOPEN_SOURCE >= 500
           || /* Since glibc 2.12: */ _POSIX_C_SOURCE >= 200112L
       sethostname():
           Since glibc 2.21:
               _DEFAULT_SOURCE
           In glibc 2.19 and 2.20:
               _DEFAULT_SOURCE || (_XOPEN_SOURCE && _XOPEN_SOURCE < 500)
           Up to and including glibc 2.19:
               _BSD_SOURCE || (_XOPEN_SOURCE && _XOPEN_SOURCE < 500)
DESCRIPTION

       These system calls are used to access or to change the hostname of
       the current processor.

       sethostname() sets the hostname to the value given in the character
       array name.  The len argument specifies the number of bytes in name.
       (Thus, name does not require a terminating null byte.)

       gethostname() returns the null-terminated hostname in the character
       array name, which has a length of len bytes.  If the null-terminated
       hostname is too large to fit, then the name is truncated, and no
       error is returned (but see NOTES below).  POSIX.1 says that if such
       truncation occurs, then it is unspecified whether the returned buffer
       includes a terminating null byte.
http://man7.org/linux/man-pages/man2/setdomainname.2.html
10
SYSTEM CALL:
getdomainname(2) - Linux manual page
FUNCTIONALITY:

       getdomainname, setdomainname - get/set NIS domain name
SYNOPSIS:

       #include <unistd.h>

       int getdomainname(char *name, size_t len);
       int setdomainname(const char *name, size_t len);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       getdomainname(), setdomainname():
           Since glibc 2.21:
               _DEFAULT_SOURCE
           In glibc 2.19 and 2.20:
               _DEFAULT_SOURCE || (_XOPEN_SOURCE && _XOPEN_SOURCE < 500)
           Up to and including glibc 2.19:
               _BSD_SOURCE || (_XOPEN_SOURCE && _XOPEN_SOURCE < 500)
DESCRIPTION

       These functions are used to access or to change the NIS domain name
       of the host system.

       setdomainname() sets the domain name to the value given in the
       character array name.  The len argument specifies the number of bytes
       in name.  (Thus, name does not require a terminating null byte.)

       getdomainname() returns the null-terminated domain name in the
       character array name, which has a length of len bytes.  If the null-
       terminated domain name requires more than len bytes, getdomainname()
       returns the first len bytes (glibc) or gives an error (libc).
http://man7.org/linux/man-pages/man2/bpf.2.html
12
SYSTEM CALL:
bpf(2) - Linux manual page
FUNCTIONALITY:

       bpf - perform a command on an extended BPF map or program
SYNOPSIS:

       #include <linux/bpf.h>

       int bpf(int cmd, union bpf_attr *attr, unsigned int size);
DESCRIPTION

       The bpf() system call performs a range of operations related to
       extended Berkeley Packet Filters.  Extended BPF (or eBPF) is similar
       to the original ("classic") BPF (cBPF) used to filter network
       packets.  For both cBPF and eBPF programs, the kernel statically
       analyzes the programs before loading them, in order to ensure that
       they cannot harm the running system.

       eBPF extends cBPF in multiple ways, including the ability to call a
       fixed set of in-kernel helper functions (via the BPF_CALL opcode
       extension provided by eBPF) and access shared data structures such as
       eBPF maps.

   Extended BPF Design/Architecture
       eBPF maps are a generic data structure for storage of different data
       types.  Data types are generally treated as binary blobs, so a user
       just specifies the size of the key and the size of the value at map-
       creation time.  In other words, a key/value for a given map can have
       an arbitrary structure.

       A user process can create multiple maps (with key/value-pairs being
       opaque bytes of data) and access them via file descriptors.
       Different eBPF programs can access the same maps in parallel.  It's
       up to the user process and eBPF program to decide what they store
       inside maps.

       There's one special map type, called a program array.  This type of
       map stores file descriptors referring to other eBPF programs.  When a
       lookup in the map is performed, the program flow is redirected in-
       place to the beginning of another eBPF program and does not return
       back to the calling program.  The level of nesting has a fixed limit
       of 32, so that infinite loops cannot be crafted.  At runtime, the
       program file descriptors stored in the map can be modified, so
       program functionality can be altered based on specific requirements.
       All programs referred to in a program-array map must have been
       previously loaded into the kernel via bpf().  If a map lookup fails,
       the current program continues its execution.  See
       BPF_MAP_TYPE_PROG_ARRAY below for further details.

       Generally, eBPF programs are loaded by the user process and
       automatically unloaded when the process exits.  In some cases, for
       example, tc-bpf(8), the program will continue to stay alive inside
       the kernel even after the process that loaded the program exits.  In
       that case, the tc subsystem holds a reference to the eBPF program
       after the file descriptor has been closed by the user-space program.
       Thus, whether a specific program continues to live inside the kernel
       depends on how it is further attached to a given kernel subsystem
       after it was loaded via bpf().

       Each eBPF program is a set of instructions that is safe to run until
       its completion.  An in-kernel verifier statically determines that the
       eBPF program terminates and is safe to execute.  During verification,
       the kernel increments reference counts for each of the maps that the
       eBPF program uses, so that the attached maps can't be removed until
       the program is unloaded.

       eBPF programs can be attached to different events.  These events can
       be the arrival of network packets, tracing events, classification
       events by network queueing  disciplines (for eBPF programs attached
       to a tc(8) classifier), and other types that may be added in the
       future.  A new event triggers execution of the eBPF program, which
       may store information about the event in eBPF maps.  Beyond storing
       data, eBPF programs may call a fixed set of in-kernel helper
       functions.

       The same eBPF program can be attached to multiple events and
       different eBPF programs can access the same map:

           tracing     tracing    tracing    packet      packet     packet
           event A     event B    event C    on eth0     on eth1    on eth2
            |             |         |          |           |          ^
            |             |         |          |           v          |
            --> tracing <--     tracing      socket    tc ingress   tc egress
                 prog_1          prog_2      prog_3    classifier    action
                 |  |              |           |         prog_4      prog_5
              |---  -----|  |------|          map_3        |           |
            map_1       map_2                              --| map_4 |--

   Arguments
       The operation to be performed by the bpf() system call is determined
       by the cmd argument.  Each operation takes an accompanying argument,
       provided via attr, which is a pointer to a union of type bpf_attr
       (see below).  The size argument is the size of the union pointed to
       by attr.

       The value provided in cmd is one of the following:

       BPF_MAP_CREATE
              Create a map and return a file descriptor that refers to the
              map.  The close-on-exec file descriptor flag (see fcntl(2)) is
              automatically enabled for the new file descriptor.

       BPF_MAP_LOOKUP_ELEM
              Look up an element by key in a specified map and return its
              value.

       BPF_MAP_UPDATE_ELEM
              Create or update an element (key/value pair) in a specified
              map.

       BPF_MAP_DELETE_ELEM
              Look up and delete an element by key in a specified map.

       BPF_MAP_GET_NEXT_KEY
              Look up an element by key in a specified map and return the
              key of the next element.

       BPF_PROG_LOAD
              Verify and load an eBPF program, returning a new file
              descriptor associated with the program.  The close-on-exec
              file descriptor flag (see fcntl(2)) is automatically enabled
              for the new file descriptor.

       The bpf_attr union consists of various anonymous structures that are
       used by different bpf() commands:

           union bpf_attr {
               struct {    /* Used by BPF_MAP_CREATE */
                   __u32         map_type;
                   __u32         key_size;    /* size of key in bytes */
                   __u32         value_size;  /* size of value in bytes */
                   __u32         max_entries; /* maximum number of entries
                                                 in a map */
               };

               struct {    /* Used by BPF_MAP_*_ELEM and BPF_MAP_GET_NEXT_KEY
                              commands */
                   __u32         map_fd;
                   __aligned_u64 key;
                   union {
                       __aligned_u64 value;
                       __aligned_u64 next_key;
                   };
                   __u64         flags;
               };

               struct {    /* Used by BPF_PROG_LOAD */
                   __u32         prog_type;
                   __u32         insn_cnt;
                   __aligned_u64 insns;      /* 'const struct bpf_insn *' */
                   __aligned_u64 license;    /* 'const char *' */
                   __u32         log_level;  /* verbosity level of verifier */
                   __u32         log_size;   /* size of user buffer */
                   __aligned_u64 log_buf;    /* user supplied 'char *'
                                                buffer */
                   __u32         kern_version;
                                             /* checked when prog_type=kprobe
                                                (since Linux 4.1) */
               };
           } __attribute__((aligned(8)));

   eBPF maps
       Maps are a generic data structure for storage of different types of
       data.  They allow sharing of data between eBPF kernel programs, and
       also between kernel and user-space applications.

       Each map type has the following attributes:

       *  type
       *  maximum number of elements
       *  key size in bytes
       *  value size in bytes

       The following wrapper functions demonstrate how various bpf()
       commands can be used to access the maps.  The functions use the cmd
       argument to invoke different operations.

       BPF_MAP_CREATE
              The BPF_MAP_CREATE command creates a new map, returning a new
              file descriptor that refers to the map.

                  int
                  bpf_create_map(enum bpf_map_type map_type,
                                 unsigned int key_size,
                                 unsigned int value_size,
                                 unsigned int max_entries)
                  {
                      union bpf_attr attr = {
                          .map_type    = map_type,
                          .key_size    = key_size,
                          .value_size  = value_size,
                          .max_entries = max_entries
                      };

                      return bpf(BPF_MAP_CREATE, &attr, sizeof(attr));
                  }

              The new map has the type specified by map_type, and attributes
              as specified in key_size, value_size, and max_entries.  On
              success, this operation returns a file descriptor.  On error,
              -1 is returned and errno is set to EINVAL, EPERM, or ENOMEM.

              The key_size and value_size attributes will be used by the
              verifier during program loading to check that the program is
              calling bpf_map_*_elem() helper functions with a correctly
              initialized key and to check that the program doesn't access
              the map element value beyond the specified value_size.  For
              example, when a map is created with a key_size of 8 and the
              eBPF program calls

                  bpf_map_lookup_elem(map_fd, fp - 4)

              the program will be rejected, since the in-kernel helper
              function

                  bpf_map_lookup_elem(map_fd, void *key)

              expects to read 8 bytes from the location pointed to by key,
              but the fp - 4 (where fp is the top of the stack) starting
              address will cause out-of-bounds stack access.

              Similarly, when a map is created with a value_size of 1 and
              the eBPF program contains

                  value = bpf_map_lookup_elem(...);
                  *(u32 *) value = 1;

              the program will be rejected, since it accesses the value
              pointer beyond the specified 1 byte value_size limit.

              Currently, the following values are supported for map_type:

                  enum bpf_map_type {
                      BPF_MAP_TYPE_UNSPEC,  /* Reserve 0 as invalid map type */
                      BPF_MAP_TYPE_HASH,
                      BPF_MAP_TYPE_ARRAY,
                      BPF_MAP_TYPE_PROG_ARRAY,
                  };

              map_type selects one of the available map implementations in
              the kernel.  For all map types, eBPF programs access maps with
              the same bpf_map_lookup_elem() and bpf_map_update_elem()
              helper functions.  Further details of the various map types
              are given below.

       BPF_MAP_LOOKUP_ELEM
              The BPF_MAP_LOOKUP_ELEM command looks up an element with a
              given key in the map referred to by the file descriptor fd.

                  int
                  bpf_lookup_elem(int fd, const void *key, void *value)
                  {
                      union bpf_attr attr = {
                          .map_fd = fd,
                          .key    = ptr_to_u64(key),
                          .value  = ptr_to_u64(value),
                      };

                      return bpf(BPF_MAP_LOOKUP_ELEM, &attr, sizeof(attr));
                  }

              If an element is found, the operation returns zero and stores
              the element's value into value, which must point to a buffer
              of value_size bytes.

              If no element is found, the operation returns -1 and sets
              errno to ENOENT.

       BPF_MAP_UPDATE_ELEM
              The BPF_MAP_UPDATE_ELEM command creates or updates an element
              with a given key/value in the map referred to by the file
              descriptor fd.

                  int
                  bpf_update_elem(int fd, const void *key, const void *value,
                                  uint64_t flags)
                  {
                      union bpf_attr attr = {
                          .map_fd = fd,
                          .key    = ptr_to_u64(key),
                          .value  = ptr_to_u64(value),
                          .flags  = flags,
                      };

                      return bpf(BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr));
                  }

              The flags argument should be specified as one of the
              following:

              BPF_ANY
                     Create a new element or update an existing element.

              BPF_NOEXIST
                     Create a new element only if it did not exist.

              BPF_EXIST
                     Update an existing element.

              On success, the operation returns zero.  On error, -1 is
              returned and errno is set to EINVAL, EPERM, ENOMEM, or E2BIG.
              E2BIG indicates that the number of elements in the map reached
              the max_entries limit specified at map creation time.  EEXIST
              will be returned if flags specifies BPF_NOEXIST and the
              element with key already exists in the map.  ENOENT will be
              returned if flags specifies BPF_EXIST and the element with key
              doesn't exist in the map.

       BPF_MAP_DELETE_ELEM
              The BPF_MAP_DELETE_ELEM command deleted the element whose key
              is key from the map referred to by the file descriptor fd.

                  int
                  bpf_delete_elem(int fd, const void *key)
                  {
                      union bpf_attr attr = {
                          .map_fd = fd,
                          .key    = ptr_to_u64(key),
                      };

                      return bpf(BPF_MAP_DELETE_ELEM, &attr, sizeof(attr));
                  }

              On success, zero is returned.  If the element is not found, -1
              is returned and errno is set to ENOENT.

       BPF_MAP_GET_NEXT_KEY
              The BPF_MAP_GET_NEXT_KEY command looks up an element by key in
              the map referred to by the file descriptor fd and sets the
              next_key pointer to the key of the next element.

                  int
                  bpf_get_next_key(int fd, const void *key, void *next_key)
                  {
                      union bpf_attr attr = {
                          .map_fd   = fd,
                          .key      = ptr_to_u64(key),
                          .next_key = ptr_to_u64(next_key),
                      };

                      return bpf(BPF_MAP_GET_NEXT_KEY, &attr, sizeof(attr));
                  }

              If key is found, the operation returns zero and sets the
              next_key pointer to the key of the next element.  If key is
              not found, the operation returns zero and sets the next_key
              pointer to the key of the first element.  If key is the last
              element, -1 is returned and errno is set to ENOENT.  Other
              possible errno values are ENOMEM, EFAULT, EPERM, and EINVAL.
              This method can be used to iterate over all elements in the
              map.

       close(map_fd)
              Delete the map referred to by the file descriptor map_fd.
              When the user-space program that created a map exits, all maps
              will be deleted automatically (but see NOTES).

   eBPF map types
       The following map types are supported:

       BPF_MAP_TYPE_HASH
              Hash-table maps have the following characteristics:

              *  Maps are created and destroyed by user-space programs.
                 Both user-space and eBPF programs can perform lookup,
                 update, and delete operations.

              *  The kernel takes care of allocating and freeing key/value
                 pairs.

              *  The map_update_elem() helper with fail to insert new
                 element when the max_entries limit is reached.  (This
                 ensures that eBPF programs cannot exhaust memory.)

              *  map_update_elem() replaces existing elements atomically.

              Hash-table maps are optimized for speed of lookup.

       BPF_MAP_TYPE_ARRAY
              Array maps have the following characteristics:

              *  Optimized for fastest possible lookup.  In the future the
                 verifier/JIT compiler may recognize lookup() operations
                 that employ a constant key and optimize it into constant
                 pointer.  It is possible to optimize a non-constant key
                 into direct pointer arithmetic as well, since pointers and
                 value_size are constant for the life of the eBPF program.
                 In other words, array_map_lookup_elem() may be 'inlined' by
                 the verifier/JIT compiler while preserving concurrent
                 access to this map from user space.

              *  All array elements pre-allocated and zero initialized at
                 init time

              *  The key is an array index, and must be exactly four bytes.

              *  map_delete_elem() fails with the error EINVAL, since
                 elements cannot be deleted.

              *  map_update_elem() replaces elements in a nonatomic fashion;
                 for atomic updates, a hash-table map should be used
                 instead.  There is however one special case that can also
                 be used with arrays: the atomic built-in
                 __sync_fetch_and_add() can be used on 32 and 64 bit atomic
                 counters.  For example, it can be applied on the whole
                 value itself if it represents a single counter, or in case
                 of a structure containing multiple counters, it could be
                 used on individual counters.  This is quite often useful
                 for aggregation and accounting of events.

              Among the uses for array maps are the following:

              *  As "global" eBPF variables: an array of 1 element whose key
                 is (index) 0 and where the value is a collection of
                 'global' variables which eBPF programs can use to keep
                 state between events.

              *  Aggregation of tracing events into a fixed set of buckets.

              *  Accounting of networking events, for example, number of
                 packets and packet sizes.

       BPF_MAP_TYPE_PROG_ARRAY (since Linux 4.2)
              A program array map is a special kind of array map whose map
              values contain only file descriptors referring to other eBPF
              programs.  Thus, both the key_size and value_size must be
              exactly four bytes.  This map is used in conjunction with the
              bpf_tail_call() helper.

              This means that an eBPF program with a program array map
              attached to it can call from kernel side into

                  void bpf_tail_call(void *context, void *prog_map, unsigned int index);

              and therefore replace its own program flow with the one from
              the program at the given program array slot, if present.  This
              can be regarded as kind of a jump table to a different eBPF
              program.  The invoked program will then reuse the same stack.
              When a jump into the new program has been performed, it won't
              return to the old program anymore.

              If no eBPF program is found at the given index of the program
              array (because the map slot doesn't contain a valid program
              file descriptor, the specified lookup index/key is out of
              bounds, or the limit of 32 nested calls has been exceed),
              execution continues with the current eBPF program.  This can
              be used as a fall-through for default cases.

              A program array map is useful, for example, in tracing or
              networking, to handle individual system calls or protocols in
              their own subprograms and use their identifiers as an
              individual map index.  This approach may result in performance
              benefits, and also makes it possible to overcome the maximum
              instruction limit of a single eBPF program.  In dynamic
              environments, a user-space daemon might atomically replace
              individual subprograms at run-time with newer versions to
              alter overall program behavior, for instance, if global
              policies change.

   eBPF programs
       The BPF_PROG_LOAD command is used to load an eBPF program into the
       kernel.  The return value for this command is a new file descriptor
       associated with this eBPF program.

           char bpf_log_buf[LOG_BUF_SIZE];

           int
           bpf_prog_load(enum bpf_prog_type type,
                         const struct bpf_insn *insns, int insn_cnt,
                         const char *license)
           {
               union bpf_attr attr = {
                   .prog_type = type,
                   .insns     = ptr_to_u64(insns),
                   .insn_cnt  = insn_cnt,
                   .license   = ptr_to_u64(license),
                   .log_buf   = ptr_to_u64(bpf_log_buf),
                   .log_size  = LOG_BUF_SIZE,
                   .log_level = 1,
               };

               return bpf(BPF_PROG_LOAD, &attr, sizeof(attr));
           }

       prog_type is one of the available program types:

           enum bpf_prog_type {
               BPF_PROG_TYPE_UNSPEC,        /* Reserve 0 as invalid
                                               program type */
               BPF_PROG_TYPE_SOCKET_FILTER,
               BPF_PROG_TYPE_KPROBE,
               BPF_PROG_TYPE_SCHED_CLS,
               BPF_PROG_TYPE_SCHED_ACT,
           };

       For further details of eBPF program types, see below.

       The remaining fields of bpf_attr are set as follows:

       *  insns is an array of struct bpf_insn instructions.

       *  insn_cnt is the number of instructions in the program referred to
          by insns.

       *  license is a license string, which must be GPL compatible to call
          helper functions marked gpl_only.  (The licensing rules are the
          same as for kernel modules, so that also dual licenses, such as
          "Dual BSD/GPL", may be used.)

       *  log_buf is a pointer to a caller-allocated buffer in which the in-
          kernel verifier can store the verification log.  This log is a
          multi-line string that can be checked by the program author in
          order to understand how the verifier came to the conclusion that
          the eBPF program is unsafe.  The format of the output can change
          at any time as the verifier evolves.

       *  log_size size of the buffer pointed to by log_bug.  If the size of
          the buffer is not large enough to store all verifier messages, -1
          is returned and errno is set to ENOSPC.

       *  log_level verbosity level of the verifier.  A value of zero means
          that the verifier will not provide a log; in this case, log_buf
          must be a NULL pointer, and log_size must be zero.

       Applying close(2) to the file descriptor returned by BPF_PROG_LOAD
       will unload the eBPF program (but see NOTES).

       Maps are accessible from eBPF programs and are used to exchange data
       between eBPF programs and between eBPF programs and user-space
       programs.  For example, eBPF programs can process various events
       (like kprobe, packets) and store their data into a map, and user-
       space programs can then fetch data from the map.  Conversely, user-
       space programs can use a map as a configuration mechanism, populating
       the map with values checked by the eBPF program, which then modifies
       its behavior on the fly according to those values.

   eBPF program types
       The eBPF program type (prog_type) determines the subset of kernel
       helper functions that the program may call.  The program type also
       determines the program input (context)—the format of struct
       bpf_context (which is the data blob passed into the eBPF program as
       the first argument).

       For example, a tracing program does not have the exact same subset of
       helper functions as a socket filter program (though they may have
       some helpers in common).  Similarly, the input (context) for a
       tracing program is a set of register values, while for a socket
       filter it is a network packet.

       The set of functions available to eBPF programs of a given type may
       increase in the future.

       The following program types are supported:

       BPF_PROG_TYPE_SOCKET_FILTER (since Linux 3.19)
              Currently, the set of functions for
              BPF_PROG_TYPE_SOCKET_FILTER is:

                  bpf_map_lookup_elem(map_fd, void *key)
                                      /* look up key in a map_fd */
                  bpf_map_update_elem(map_fd, void *key, void *value)
                                      /* update key/value */
                  bpf_map_delete_elem(map_fd, void *key)
                                      /* delete key in a map_fd */

              The bpf_context argument is a pointer to a struct __sk_buff.

       BPF_PROG_TYPE_KPROBE (since Linux 4.1)
              [To be documented]

       BPF_PROG_TYPE_SCHED_CLS (since Linux 4.1)
              [To be documented]

       BPF_PROG_TYPE_SCHED_ACT (since Linux 4.1)
              [To be documented]

   Events
       Once a program is loaded, it can be attached to an event.  Various
       kernel subsystems have different ways to do so.

       Since Linux 3.19, the following call will attach the program prog_fd
       to the socket sockfd, which was created by an earlier call to
       socket(2):

           setsockopt(sockfd, SOL_SOCKET, SO_ATTACH_BPF,
                      &prog_fd, sizeof(prog_fd));

       Since Linux 4.1, the following call may be used to attach the eBPF
       program referred to by the file descriptor prog_fd to a perf event
       file descriptor, event_fd, that was created by a previous call to
       perf_event_open(2):

           ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
Linux time system calls
http://man7.org/linux/man-pages/man2/time.2.html
11
SYSTEM CALL:
time(2) - Linux manual page
FUNCTIONALITY:

       time - get time in seconds
SYNOPSIS:

       #include <time.h>

       time_t time(time_t *tloc);
DESCRIPTION

       time() returns the time as the number of seconds since the Epoch,
       1970-01-01 00:00:00 +0000 (UTC).

       If tloc is non-NULL, the return value is also stored in the memory
       pointed to by tloc.
http://man7.org/linux/man-pages/man2/settimeofday.2.html
10
SYSTEM CALL:
gettimeofday(2) - Linux manual page
FUNCTIONALITY:

       gettimeofday, settimeofday - get / set time
SYNOPSIS:

       #include <sys/time.h>

       int gettimeofday(struct timeval *tv, struct timezone *tz);

       int settimeofday(const struct timeval *tv, const struct timezone *tz);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       settimeofday():
           Since glibc 2.19:
               _DEFAULT_SOURCE
           Glibc 2.19 and earlier:
               _BSD_SOURCE
DESCRIPTION

       The functions gettimeofday() and settimeofday() can get and set the
       time as well as a timezone.  The tv argument is a struct timeval (as
       specified in <sys/time.h>):

           struct timeval {
               time_t      tv_sec;     /* seconds */
               suseconds_t tv_usec;    /* microseconds */
           };

       and gives the number of seconds and microseconds since the Epoch (see
       time(2)).  The tz argument is a struct timezone:

           struct timezone {
               int tz_minuteswest;     /* minutes west of Greenwich */
               int tz_dsttime;         /* type of DST correction */
           };

       If either tv or tz is NULL, the corresponding structure is not set or
       returned.  (However, compilation warnings will result if tv is NULL.)

       The use of the timezone structure is obsolete; the tz argument should
       normally be specified as NULL.  (See NOTES below.)

       Under Linux, there are some peculiar "warp clock" semantics
       associated with the settimeofday() system call if on the very first
       call (after booting) that has a non-NULL tz argument, the tv argument
       is NULL and the tz_minuteswest field is nonzero.  (The tz_dsttime
       field should be zero for this case.)  In such a case it is assumed
       that the CMOS clock is on local time, and that it has to be
       incremented by this amount to get UTC system time.  No doubt it is a
       bad idea to use this feature.
http://man7.org/linux/man-pages/man2/gettimeofday.2.html
10
SYSTEM CALL:
gettimeofday(2) - Linux manual page
FUNCTIONALITY:

       gettimeofday, settimeofday - get / set time
SYNOPSIS:

       #include <sys/time.h>

       int gettimeofday(struct timeval *tv, struct timezone *tz);

       int settimeofday(const struct timeval *tv, const struct timezone *tz);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       settimeofday():
           Since glibc 2.19:
               _DEFAULT_SOURCE
           Glibc 2.19 and earlier:
               _BSD_SOURCE
DESCRIPTION

       The functions gettimeofday() and settimeofday() can get and set the
       time as well as a timezone.  The tv argument is a struct timeval (as
       specified in <sys/time.h>):

           struct timeval {
               time_t      tv_sec;     /* seconds */
               suseconds_t tv_usec;    /* microseconds */
           };

       and gives the number of seconds and microseconds since the Epoch (see
       time(2)).  The tz argument is a struct timezone:

           struct timezone {
               int tz_minuteswest;     /* minutes west of Greenwich */
               int tz_dsttime;         /* type of DST correction */
           };

       If either tv or tz is NULL, the corresponding structure is not set or
       returned.  (However, compilation warnings will result if tv is NULL.)

       The use of the timezone structure is obsolete; the tz argument should
       normally be specified as NULL.  (See NOTES below.)

       Under Linux, there are some peculiar "warp clock" semantics
       associated with the settimeofday() system call if on the very first
       call (after booting) that has a non-NULL tz argument, the tv argument
       is NULL and the tz_minuteswest field is nonzero.  (The tz_dsttime
       field should be zero for this case.)  In such a case it is assumed
       that the CMOS clock is on local time, and that it has to be
       incremented by this amount to get UTC system time.  No doubt it is a
       bad idea to use this feature.
http://man7.org/linux/man-pages/man2/clock_settime.2.html
14
SYSTEM CALL:
clock_getres(2) - Linux manual page
FUNCTIONALITY:

       clock_getres, clock_gettime, clock_settime - clock and time functions
SYNOPSIS:

       #include <time.h>

       int clock_getres(clockid_t clk_id, struct timespec *res);

       int clock_gettime(clockid_t clk_id, struct timespec *tp);

       int clock_settime(clockid_t clk_id, const struct timespec *tp);

       Link with -lrt (only for glibc versions before 2.17).

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       clock_getres(), clock_gettime(), clock_settime():
              _POSIX_C_SOURCE >= 199309L
DESCRIPTION

       The function clock_getres() finds the resolution (precision) of the
       specified clock clk_id, and, if res is non-NULL, stores it in the
       struct timespec pointed to by res.  The resolution of clocks depends
       on the implementation and cannot be configured by a particular
       process.  If the time value pointed to by the argument tp of
       clock_settime() is not a multiple of res, then it is truncated to a
       multiple of res.

       The functions clock_gettime() and clock_settime() retrieve and set
       the time of the specified clock clk_id.

       The res and tp arguments are timespec structures, as specified in
       <time.h>:

           struct timespec {
               time_t   tv_sec;        /* seconds */
               long     tv_nsec;       /* nanoseconds */
           };

       The clk_id argument is the identifier of the particular clock on
       which to act.  A clock may be system-wide and hence visible for all
       processes, or per-process if it measures time only within a single
       process.

       All implementations support the system-wide real-time clock, which is
       identified by CLOCK_REALTIME.  Its time represents seconds and
       nanoseconds since the Epoch.  When its time is changed, timers for a
       relative interval are unaffected, but timers for an absolute point in
       time are affected.

       More clocks may be implemented.  The interpretation of the
       corresponding time values and the effect on timers is unspecified.

       Sufficiently recent versions of glibc and the Linux kernel support
       the following clocks:

       CLOCK_REALTIME
              System-wide clock that measures real (i.e., wall-clock) time.
              Setting this clock requires appropriate privileges.  This
              clock is affected by discontinuous jumps in the system time
              (e.g., if the system administrator manually changes the
              clock), and by the incremental adjustments performed by
              adjtime(3) and NTP.

       CLOCK_REALTIME_COARSE (since Linux 2.6.32; Linux-specific)
              A faster but less precise version of CLOCK_REALTIME.  Use when
              you need very fast, but not fine-grained timestamps.  Requires
              per-architecture support, and probably also architecture
              support for this flag in the vdso(7).

       CLOCK_MONOTONIC
              Clock that cannot be set and represents monotonic time since
              some unspecified starting point.  This clock is not affected
              by discontinuous jumps in the system time (e.g., if the system
              administrator manually changes the clock), but is affected by
              the incremental adjustments performed by adjtime(3) and NTP.

       CLOCK_MONOTONIC_COARSE (since Linux 2.6.32; Linux-specific)
              A faster but less precise version of CLOCK_MONOTONIC.  Use
              when you need very fast, but not fine-grained timestamps.
              Requires per-architecture support, and probably also
              architecture support for this flag in the vdso(7).

       CLOCK_MONOTONIC_RAW (since Linux 2.6.28; Linux-specific)
              Similar to CLOCK_MONOTONIC, but provides access to a raw
              hardware-based time that is not subject to NTP adjustments or
              the incremental adjustments performed by adjtime(3).

       CLOCK_BOOTTIME (since Linux 2.6.39; Linux-specific)
              Identical to CLOCK_MONOTONIC, except it also includes any time
              that the system is suspended.  This allows applications to get
              a suspend-aware monotonic clock without having to deal with
              the complications of CLOCK_REALTIME, which may have
              discontinuities if the time is changed using settimeofday(2).

       CLOCK_PROCESS_CPUTIME_ID (since Linux 2.6.12)
              Per-process CPU-time clock (measures CPU time consumed by all
              threads in the process).

       CLOCK_THREAD_CPUTIME_ID (since Linux 2.6.12)
              Thread-specific CPU-time clock.
http://man7.org/linux/man-pages/man2/clock_gettime.2.html
14
SYSTEM CALL:
clock_getres(2) - Linux manual page
FUNCTIONALITY:

       clock_getres, clock_gettime, clock_settime - clock and time functions
SYNOPSIS:

       #include <time.h>

       int clock_getres(clockid_t clk_id, struct timespec *res);

       int clock_gettime(clockid_t clk_id, struct timespec *tp);

       int clock_settime(clockid_t clk_id, const struct timespec *tp);

       Link with -lrt (only for glibc versions before 2.17).

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       clock_getres(), clock_gettime(), clock_settime():
              _POSIX_C_SOURCE >= 199309L
DESCRIPTION

       The function clock_getres() finds the resolution (precision) of the
       specified clock clk_id, and, if res is non-NULL, stores it in the
       struct timespec pointed to by res.  The resolution of clocks depends
       on the implementation and cannot be configured by a particular
       process.  If the time value pointed to by the argument tp of
       clock_settime() is not a multiple of res, then it is truncated to a
       multiple of res.

       The functions clock_gettime() and clock_settime() retrieve and set
       the time of the specified clock clk_id.

       The res and tp arguments are timespec structures, as specified in
       <time.h>:

           struct timespec {
               time_t   tv_sec;        /* seconds */
               long     tv_nsec;       /* nanoseconds */
           };

       The clk_id argument is the identifier of the particular clock on
       which to act.  A clock may be system-wide and hence visible for all
       processes, or per-process if it measures time only within a single
       process.

       All implementations support the system-wide real-time clock, which is
       identified by CLOCK_REALTIME.  Its time represents seconds and
       nanoseconds since the Epoch.  When its time is changed, timers for a
       relative interval are unaffected, but timers for an absolute point in
       time are affected.

       More clocks may be implemented.  The interpretation of the
       corresponding time values and the effect on timers is unspecified.

       Sufficiently recent versions of glibc and the Linux kernel support
       the following clocks:

       CLOCK_REALTIME
              System-wide clock that measures real (i.e., wall-clock) time.
              Setting this clock requires appropriate privileges.  This
              clock is affected by discontinuous jumps in the system time
              (e.g., if the system administrator manually changes the
              clock), and by the incremental adjustments performed by
              adjtime(3) and NTP.

       CLOCK_REALTIME_COARSE (since Linux 2.6.32; Linux-specific)
              A faster but less precise version of CLOCK_REALTIME.  Use when
              you need very fast, but not fine-grained timestamps.  Requires
              per-architecture support, and probably also architecture
              support for this flag in the vdso(7).

       CLOCK_MONOTONIC
              Clock that cannot be set and represents monotonic time since
              some unspecified starting point.  This clock is not affected
              by discontinuous jumps in the system time (e.g., if the system
              administrator manually changes the clock), but is affected by
              the incremental adjustments performed by adjtime(3) and NTP.

       CLOCK_MONOTONIC_COARSE (since Linux 2.6.32; Linux-specific)
              A faster but less precise version of CLOCK_MONOTONIC.  Use
              when you need very fast, but not fine-grained timestamps.
              Requires per-architecture support, and probably also
              architecture support for this flag in the vdso(7).

       CLOCK_MONOTONIC_RAW (since Linux 2.6.28; Linux-specific)
              Similar to CLOCK_MONOTONIC, but provides access to a raw
              hardware-based time that is not subject to NTP adjustments or
              the incremental adjustments performed by adjtime(3).

       CLOCK_BOOTTIME (since Linux 2.6.39; Linux-specific)
              Identical to CLOCK_MONOTONIC, except it also includes any time
              that the system is suspended.  This allows applications to get
              a suspend-aware monotonic clock without having to deal with
              the complications of CLOCK_REALTIME, which may have
              discontinuities if the time is changed using settimeofday(2).

       CLOCK_PROCESS_CPUTIME_ID (since Linux 2.6.12)
              Per-process CPU-time clock (measures CPU time consumed by all
              threads in the process).

       CLOCK_THREAD_CPUTIME_ID (since Linux 2.6.12)
              Thread-specific CPU-time clock.
http://man7.org/linux/man-pages/man2/clock_getres.2.html
14
SYSTEM CALL:
clock_getres(2) - Linux manual page
FUNCTIONALITY:

       clock_getres, clock_gettime, clock_settime - clock and time functions
SYNOPSIS:

       #include <time.h>

       int clock_getres(clockid_t clk_id, struct timespec *res);

       int clock_gettime(clockid_t clk_id, struct timespec *tp);

       int clock_settime(clockid_t clk_id, const struct timespec *tp);

       Link with -lrt (only for glibc versions before 2.17).

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       clock_getres(), clock_gettime(), clock_settime():
              _POSIX_C_SOURCE >= 199309L
DESCRIPTION

       The function clock_getres() finds the resolution (precision) of the
       specified clock clk_id, and, if res is non-NULL, stores it in the
       struct timespec pointed to by res.  The resolution of clocks depends
       on the implementation and cannot be configured by a particular
       process.  If the time value pointed to by the argument tp of
       clock_settime() is not a multiple of res, then it is truncated to a
       multiple of res.

       The functions clock_gettime() and clock_settime() retrieve and set
       the time of the specified clock clk_id.

       The res and tp arguments are timespec structures, as specified in
       <time.h>:

           struct timespec {
               time_t   tv_sec;        /* seconds */
               long     tv_nsec;       /* nanoseconds */
           };

       The clk_id argument is the identifier of the particular clock on
       which to act.  A clock may be system-wide and hence visible for all
       processes, or per-process if it measures time only within a single
       process.

       All implementations support the system-wide real-time clock, which is
       identified by CLOCK_REALTIME.  Its time represents seconds and
       nanoseconds since the Epoch.  When its time is changed, timers for a
       relative interval are unaffected, but timers for an absolute point in
       time are affected.

       More clocks may be implemented.  The interpretation of the
       corresponding time values and the effect on timers is unspecified.

       Sufficiently recent versions of glibc and the Linux kernel support
       the following clocks:

       CLOCK_REALTIME
              System-wide clock that measures real (i.e., wall-clock) time.
              Setting this clock requires appropriate privileges.  This
              clock is affected by discontinuous jumps in the system time
              (e.g., if the system administrator manually changes the
              clock), and by the incremental adjustments performed by
              adjtime(3) and NTP.

       CLOCK_REALTIME_COARSE (since Linux 2.6.32; Linux-specific)
              A faster but less precise version of CLOCK_REALTIME.  Use when
              you need very fast, but not fine-grained timestamps.  Requires
              per-architecture support, and probably also architecture
              support for this flag in the vdso(7).

       CLOCK_MONOTONIC
              Clock that cannot be set and represents monotonic time since
              some unspecified starting point.  This clock is not affected
              by discontinuous jumps in the system time (e.g., if the system
              administrator manually changes the clock), but is affected by
              the incremental adjustments performed by adjtime(3) and NTP.

       CLOCK_MONOTONIC_COARSE (since Linux 2.6.32; Linux-specific)
              A faster but less precise version of CLOCK_MONOTONIC.  Use
              when you need very fast, but not fine-grained timestamps.
              Requires per-architecture support, and probably also
              architecture support for this flag in the vdso(7).

       CLOCK_MONOTONIC_RAW (since Linux 2.6.28; Linux-specific)
              Similar to CLOCK_MONOTONIC, but provides access to a raw
              hardware-based time that is not subject to NTP adjustments or
              the incremental adjustments performed by adjtime(3).

       CLOCK_BOOTTIME (since Linux 2.6.39; Linux-specific)
              Identical to CLOCK_MONOTONIC, except it also includes any time
              that the system is suspended.  This allows applications to get
              a suspend-aware monotonic clock without having to deal with
              the complications of CLOCK_REALTIME, which may have
              discontinuities if the time is changed using settimeofday(2).

       CLOCK_PROCESS_CPUTIME_ID (since Linux 2.6.12)
              Per-process CPU-time clock (measures CPU time consumed by all
              threads in the process).

       CLOCK_THREAD_CPUTIME_ID (since Linux 2.6.12)
              Thread-specific CPU-time clock.
http://man7.org/linux/man-pages/man2/clock_adjtime.2.html
Linux process management system calls
http://man7.org/linux/man-pages/man2/clone.2.html
13
SYSTEM CALL:
clone(2) - Linux manual page
FUNCTIONALITY:

       clone, __clone2 - create a child process
SYNOPSIS:

       /* Prototype for the glibc wrapper function */

       #define _GNU_SOURCE
       #include <sched.h>

       int clone(int (*fn)(void *), void *child_stack,
                 int flags, void *arg, ...
                 /* pid_t *ptid, struct user_desc *tls, pid_t *ctid */ );

       /* Prototype for the raw system call */

       long clone(unsigned long flags, void *child_stack,
                 void *ptid, void *ctid,
                 struct pt_regs *regs);
DESCRIPTION

       clone() creates a new process, in a manner similar to fork(2).

       This page describes both the glibc clone() wrapper function and the
       underlying system call on which it is based.  The main text describes
       the wrapper function; the differences for the raw system call are
       described toward the end of this page.

       Unlike fork(2), clone() allows the child process to share parts of
       its execution context with the calling process, such as the memory
       space, the table of file descriptors, and the table of signal
       handlers.  (Note that on this manual page, "calling process" normally
       corresponds to "parent process".  But see the description of
       CLONE_PARENT below.)

       One use of clone() is to implement threads: multiple threads of
       control in a program that run concurrently in a shared memory space.

       When the child process is created with clone(), it executes the
       function fn(arg).  (This differs from fork(2), where execution
       continues in the child from the point of the fork(2) call.)  The fn
       argument is a pointer to a function that is called by the child
       process at the beginning of its execution.  The arg argument is
       passed to the fn function.

       When the fn(arg) function application returns, the child process
       terminates.  The integer returned by fn is the exit code for the
       child process.  The child process may also terminate explicitly by
       calling exit(2) or after receiving a fatal signal.

       The child_stack argument specifies the location of the stack used by
       the child process.  Since the child and calling process may share
       memory, it is not possible for the child process to execute in the
       same stack as the calling process.  The calling process must
       therefore set up memory space for the child stack and pass a pointer
       to this space to clone().  Stacks grow downward on all processors
       that run Linux (except the HP PA processors), so child_stack usually
       points to the topmost address of the memory space set up for the
       child stack.

       The low byte of flags contains the number of the termination signal
       sent to the parent when the child dies.  If this signal is specified
       as anything other than SIGCHLD, then the parent process must specify
       the __WALL or __WCLONE options when waiting for the child with
       wait(2).  If no signal is specified, then the parent process is not
       signaled when the child terminates.

       flags may also be bitwise-or'ed with zero or more of the following
       constants, in order to specify what is shared between the calling
       process and the child process:

       CLONE_CHILD_CLEARTID (since Linux 2.5.49)
              Erase the child thread ID at the location ctid in child memory
              when the child exits, and do a wakeup on the futex at that
              address.  The address involved may be changed by the
              set_tid_address(2) system call.  This is used by threading
              libraries.

       CLONE_CHILD_SETTID (since Linux 2.5.49)
              Store the child thread ID at the location ctid in the child's
              memory.

       CLONE_FILES (since Linux 2.0)
              If CLONE_FILES is set, the calling process and the child
              process share the same file descriptor table.  Any file
              descriptor created by the calling process or by the child
              process is also valid in the other process.  Similarly, if one
              of the processes closes a file descriptor, or changes its
              associated flags (using the fcntl(2) F_SETFD operation), the
              other process is also affected.  If a process sharing a file
              descriptor table calls execve(2), its file descriptor table is
              duplicated (unshared).

              If CLONE_FILES is not set, the child process inherits a copy
              of all file descriptors opened in the calling process at the
              time of clone().  (The duplicated file descriptors in the
              child refer to the same open file descriptions (see open(2))
              as the corresponding file descriptors in the calling process.)
              Subsequent operations that open or close file descriptors, or
              change file descriptor flags, performed by either the calling
              process or the child process do not affect the other process.

       CLONE_FS (since Linux 2.0)
              If CLONE_FS is set, the caller and the child process share the
              same filesystem information.  This includes the root of the
              filesystem, the current working directory, and the umask.  Any
              call to chroot(2), chdir(2), or umask(2) performed by the
              calling process or the child process also affects the other
              process.

              If CLONE_FS is not set, the child process works on a copy of
              the filesystem information of the calling process at the time
              of the clone() call.  Calls to chroot(2), chdir(2), umask(2)
              performed later by one of the processes do not affect the
              other process.

       CLONE_IO (since Linux 2.6.25)
              If CLONE_IO is set, then the new process shares an I/O context
              with the calling process.  If this flag is not set, then (as
              with fork(2)) the new process has its own I/O context.

              The I/O context is the I/O scope of the disk scheduler (i.e.,
              what the I/O scheduler uses to model scheduling of a process's
              I/O).  If processes share the same I/O context, they are
              treated as one by the I/O scheduler.  As a consequence, they
              get to share disk time.  For some I/O schedulers, if two
              processes share an I/O context, they will be allowed to
              interleave their disk access.  If several threads are doing
              I/O on behalf of the same process (aio_read(3), for instance),
              they should employ CLONE_IO to get better I/O performance.

              If the kernel is not configured with the CONFIG_BLOCK option,
              this flag is a no-op.

       CLONE_NEWCGROUP (since Linux 4.6)
              Create the process in a new cgroup namespace.  If this flag is
              not set, then (as with fork(2)) the process is created in the
              same cgroup namespaces as the calling process.  This flag is
              intended for the implementation of containers.

              For further information on cgroup namespaces, see
              cgroup_namespaces(7).

              Only a privileged process (CAP_SYS_ADMIN) can employ
              CLONE_NEWCGROUP.

       CLONE_NEWIPC (since Linux 2.6.19)
              If CLONE_NEWIPC is set, then create the process in a new IPC
              namespace.  If this flag is not set, then (as with fork(2)),
              the process is created in the same IPC namespace as the
              calling process.  This flag is intended for the implementation
              of containers.

              An IPC namespace provides an isolated view of System V IPC
              objects (see svipc(7)) and (since Linux 2.6.30) POSIX message
              queues (see mq_overview(7)).  The common characteristic of
              these IPC mechanisms is that IPC objects are identified by
              mechanisms other than filesystem pathnames.

              Objects created in an IPC namespace are visible to all other
              processes that are members of that namespace, but are not
              visible to processes in other IPC namespaces.

              When an IPC namespace is destroyed (i.e., when the last
              process that is a member of the namespace terminates), all IPC
              objects in the namespace are automatically destroyed.

              Only a privileged process (CAP_SYS_ADMIN) can employ
              CLONE_NEWIPC.  This flag can't be specified in conjunction
              with CLONE_SYSVSEM.

              For further information on IPC namespaces, see namespaces(7).

       CLONE_NEWNET (since Linux 2.6.24)
              (The implementation of this flag was completed only by about
              kernel version 2.6.29.)

              If CLONE_NEWNET is set, then create the process in a new
              network namespace.  If this flag is not set, then (as with
              fork(2)) the process is created in the same network namespace
              as the calling process.  This flag is intended for the
              implementation of containers.

              A network namespace provides an isolated view of the
              networking stack (network device interfaces, IPv4 and IPv6
              protocol stacks, IP routing tables, firewall rules, the
              /proc/net and /sys/class/net directory trees, sockets, etc.).
              A physical network device can live in exactly one network
              namespace.  A virtual network device ("veth") pair provides a
              pipe-like abstraction that can be used to create tunnels
              between network namespaces, and can be used to create a bridge
              to a physical network device in another namespace.

              When a network namespace is freed (i.e., when the last process
              in the namespace terminates), its physical network devices are
              moved back to the initial network namespace (not to the parent
              of the process).  For further information on network
              namespaces, see namespaces(7).

              Only a privileged process (CAP_SYS_ADMIN) can employ
              CLONE_NEWNET.

       CLONE_NEWNS (since Linux 2.4.19)
              If CLONE_NEWNS is set, the cloned child is started in a new
              mount namespace, initialized with a copy of the namespace of
              the parent.  If CLONE_NEWNS is not set, the child lives in the
              same mount namespace as the parent.

              Only a privileged process (CAP_SYS_ADMIN) can employ
              CLONE_NEWNS.  It is not permitted to specify both CLONE_NEWNS
              and CLONE_FS in the same clone() call.

              For further information on mount namespaces, see namespaces(7)
              and mount_namespaces(7).

       CLONE_NEWPID (since Linux 2.6.24)
              If CLONE_NEWPID is set, then create the process in a new PID
              namespace.  If this flag is not set, then (as with fork(2))
              the process is created in the same PID namespace as the
              calling process.  This flag is intended for the implementation
              of containers.

              For further information on PID namespaces, see namespaces(7)
              and pid_namespaces(7).

              Only a privileged process (CAP_SYS_ADMIN) can employ
              CLONE_NEWPID.  This flag can't be specified in conjunction
              with CLONE_THREAD or CLONE_PARENT.

       CLONE_NEWUSER
              (This flag first became meaningful for clone() in Linux
              2.6.23, the current clone() semantics were merged in Linux
              3.5, and the final pieces to make the user namespaces
              completely usable were merged in Linux 3.8.)

              If CLONE_NEWUSER is set, then create the process in a new user
              namespace.  If this flag is not set, then (as with fork(2))
              the process is created in the same user namespace as the
              calling process.

              For further information on user namespaces, see namespaces(7)
              and user_namespaces(7)

              Before Linux 3.8, use of CLONE_NEWUSER required that the
              caller have three capabilities: CAP_SYS_ADMIN, CAP_SETUID, and
              CAP_SETGID.  Starting with Linux 3.8, no privileges are needed
              to create a user namespace.

              This flag can't be specified in conjunction with CLONE_THREAD
              or CLONE_PARENT.  For security reasons, CLONE_NEWUSER cannot
              be specified in conjunction with CLONE_FS.

              For further information on user namespaces, see
              user_namespaces(7).

       CLONE_NEWUTS (since Linux 2.6.19)
              If CLONE_NEWUTS is set, then create the process in a new UTS
              namespace, whose identifiers are initialized by duplicating
              the identifiers from the UTS namespace of the calling process.
              If this flag is not set, then (as with fork(2)) the process is
              created in the same UTS namespace as the calling process.
              This flag is intended for the implementation of containers.

              A UTS namespace is the set of identifiers returned by
              uname(2); among these, the domain name and the hostname can be
              modified by setdomainname(2) and sethostname(2), respectively.
              Changes made to the identifiers in a UTS namespace are visible
              to all other processes in the same namespace, but are not
              visible to processes in other UTS namespaces.

              Only a privileged process (CAP_SYS_ADMIN) can employ
              CLONE_NEWUTS.

              For further information on UTS namespaces, see namespaces(7).

       CLONE_PARENT (since Linux 2.3.12)
              If CLONE_PARENT is set, then the parent of the new child (as
              returned by getppid(2)) will be the same as that of the
              calling process.

              If CLONE_PARENT is not set, then (as with fork(2)) the child's
              parent is the calling process.

              Note that it is the parent process, as returned by getppid(2),
              which is signaled when the child terminates, so that if
              CLONE_PARENT is set, then the parent of the calling process,
              rather than the calling process itself, will be signaled.

       CLONE_PARENT_SETTID (since Linux 2.5.49)
              Store the child thread ID at the location ptid in the parent's
              memory.  (In Linux 2.5.32-2.5.48 there was a flag CLONE_SETTID
              that did this.)

       CLONE_PID (obsolete)
              If CLONE_PID is set, the child process is created with the
              same process ID as the calling process.  This is good for
              hacking the system, but otherwise of not much use.  Since
              2.3.21 this flag can be specified only by the system boot
              process (PID 0).  It disappeared in Linux 2.5.16.  Since then,
              the kernel silently ignores it without error.

       CLONE_PTRACE (since Linux 2.2)
              If CLONE_PTRACE is specified, and the calling process is being
              traced, then trace the child also (see ptrace(2)).

       CLONE_SETTLS (since Linux 2.5.32)
              The newtls argument is the new TLS (Thread Local Storage)
              descriptor.  (See set_thread_area(2).)

       CLONE_SIGHAND (since Linux 2.0)
              If CLONE_SIGHAND is set, the calling process and the child
              process share the same table of signal handlers.  If the
              calling process or child process calls sigaction(2) to change
              the behavior associated with a signal, the behavior is changed
              in the other process as well.  However, the calling process
              and child processes still have distinct signal masks and sets
              of pending signals.  So, one of them may block or unblock some
              signals using sigprocmask(2) without affecting the other
              process.

              If CLONE_SIGHAND is not set, the child process inherits a copy
              of the signal handlers of the calling process at the time
              clone() is called.  Calls to sigaction(2) performed later by
              one of the processes have no effect on the other process.

              Since Linux 2.6.0-test6, flags must also include CLONE_VM if
              CLONE_SIGHAND is specified

       CLONE_STOPPED (since Linux 2.6.0-test2)
              If CLONE_STOPPED is set, then the child is initially stopped
              (as though it was sent a SIGSTOP signal), and must be resumed
              by sending it a SIGCONT signal.

              This flag was deprecated from Linux 2.6.25 onward, and was
              removed altogether in Linux 2.6.38.  Since then, the kernel
              silently ignores it without error.  Starting with Linux 4.6,
              the same bit was reused for the CLONE_NEWCGROUP flag.

       CLONE_SYSVSEM (since Linux 2.5.10)
              If CLONE_SYSVSEM is set, then the child and the calling
              process share a single list of System V semaphore adjustment
              (semadj) values (see semop(2)).  In this case, the shared list
              accumulates semadj values across all processes sharing the
              list, and semaphore adjustments are performed only when the
              last process that is sharing the list terminates (or ceases
              sharing the list using unshare(2)).  If this flag is not set,
              then the child has a separate semadj list that is initially
              empty.

       CLONE_THREAD (since Linux 2.4.0-test8)
              If CLONE_THREAD is set, the child is placed in the same thread
              group as the calling process.  To make the remainder of the
              discussion of CLONE_THREAD more readable, the term "thread" is
              used to refer to the processes within a thread group.

              Thread groups were a feature added in Linux 2.4 to support the
              POSIX threads notion of a set of threads that share a single
              PID.  Internally, this shared PID is the so-called thread
              group identifier (TGID) for the thread group.  Since Linux
              2.4, calls to getpid(2) return the TGID of the caller.

              The threads within a group can be distinguished by their
              (system-wide) unique thread IDs (TID).  A new thread's TID is
              available as the function result returned to the caller of
              clone(), and a thread can obtain its own TID using gettid(2).

              When a call is made to clone() without specifying
              CLONE_THREAD, then the resulting thread is placed in a new
              thread group whose TGID is the same as the thread's TID.  This
              thread is the leader of the new thread group.

              A new thread created with CLONE_THREAD has the same parent
              process as the caller of clone() (i.e., like CLONE_PARENT), so
              that calls to getppid(2) return the same value for all of the
              threads in a thread group.  When a CLONE_THREAD thread
              terminates, the thread that created it using clone() is not
              sent a SIGCHLD (or other termination) signal; nor can the
              status of such a thread be obtained using wait(2).  (The
              thread is said to be detached.)

              After all of the threads in a thread group terminate the
              parent process of the thread group is sent a SIGCHLD (or other
              termination) signal.

              If any of the threads in a thread group performs an execve(2),
              then all threads other than the thread group leader are
              terminated, and the new program is executed in the thread
              group leader.

              If one of the threads in a thread group creates a child using
              fork(2), then any thread in the group can wait(2) for that
              child.

              Since Linux 2.5.35, flags must also include CLONE_SIGHAND if
              CLONE_THREAD is specified (and note that, since Linux
              2.6.0-test6, CLONE_SIGHAND also requires CLONE_VM to be
              included).

              Signals may be sent to a thread group as a whole (i.e., a
              TGID) using kill(2), or to a specific thread (i.e., TID) using
              tgkill(2).

              Signal dispositions and actions are process-wide: if an
              unhandled signal is delivered to a thread, then it will affect
              (terminate, stop, continue, be ignored in) all members of the
              thread group.

              Each thread has its own signal mask, as set by sigprocmask(2),
              but signals can be pending either: for the whole process
              (i.e., deliverable to any member of the thread group), when
              sent with kill(2); or for an individual thread, when sent with
              tgkill(2).  A call to sigpending(2) returns a signal set that
              is the union of the signals pending for the whole process and
              the signals that are pending for the calling thread.

              If kill(2) is used to send a signal to a thread group, and the
              thread group has installed a handler for the signal, then the
              handler will be invoked in exactly one, arbitrarily selected
              member of the thread group that has not blocked the signal.
              If multiple threads in a group are waiting to accept the same
              signal using sigwaitinfo(2), the kernel will arbitrarily
              select one of these threads to receive a signal sent using
              kill(2).

       CLONE_UNTRACED (since Linux 2.5.46)
              If CLONE_UNTRACED is specified, then a tracing process cannot
              force CLONE_PTRACE on this child process.

       CLONE_VFORK (since Linux 2.2)
              If CLONE_VFORK is set, the execution of the calling process is
              suspended until the child releases its virtual memory
              resources via a call to execve(2) or _exit(2) (as with
              vfork(2)).

              If CLONE_VFORK is not set, then both the calling process and
              the child are schedulable after the call, and an application
              should not rely on execution occurring in any particular
              order.

       CLONE_VM (since Linux 2.0)
              If CLONE_VM is set, the calling process and the child process
              run in the same memory space.  In particular, memory writes
              performed by the calling process or by the child process are
              also visible in the other process.  Moreover, any memory
              mapping or unmapping performed with mmap(2) or munmap(2) by
              the child or calling process also affects the other process.

              If CLONE_VM is not set, the child process runs in a separate
              copy of the memory space of the calling process at the time of
              clone().  Memory writes or file mappings/unmappings performed
              by one of the processes do not affect the other, as with
              fork(2).

   C library/kernel differences
       The raw clone() system call corresponds more closely to fork(2) in
       that execution in the child continues from the point of the call.  As
       such, the fn and arg arguments of the clone() wrapper function are
       omitted.  Furthermore, the argument order changes.  The raw system
       call interface on x86 and many other architectures is roughly:

           long clone(unsigned long flags, void *child_stack,
                      void *ptid, void *ctid,
                      struct pt_regs *regs);

       Another difference for the raw system call is that the child_stack
       argument may be zero, in which case copy-on-write semantics ensure
       that the child gets separate copies of stack pages when either
       process modifies the stack.  In this case, for correct operation, the
       CLONE_VM option should not be specified.

       For some architectures, the order of the arguments for the system
       call differs from that shown above.  On the score, microblaze, ARM,
       ARM 64, PA-RISC, arc, Power PC, xtensa, and MIPS architectures, the
       order of the fourth and fifth arguments is reversed.  On the cris and
       s390 architectures, the order of the first and second arguments is
       reversed.

   blackfin, m68k, and sparc
       The argument-passing conventions on blackfin, m68k, and sparc are
       different from the descriptions above.  For details, see the kernel
       (and glibc) source.

   ia64
       On ia64, a different interface is used:

       int __clone2(int (*fn)(void *),
                    void *child_stack_base, size_t stack_size,
                    int flags, void *arg, ...
                 /* pid_t *ptid, struct user_desc *tls, pid_t *ctid */ );

       The prototype shown above is for the glibc wrapper function; the raw
       system call interface has no fn or arg argument, and changes the
       order of the arguments so that flags is the first argument, and tls
       is the last argument.

       __clone2() operates in the same way as clone(), except that
       child_stack_base points to the lowest address of the child's stack
       area, and stack_size specifies the size of the stack pointed to by
       child_stack_base.

   Linux 2.4 and earlier
       In Linux 2.4 and earlier, clone() does not take arguments ptid, tls,
       and ctid.
http://man7.org/linux/man-pages/man2/fork.2.html
11
SYSTEM CALL:
fork(2) - Linux manual page
FUNCTIONALITY:

       fork - create a child process
SYNOPSIS:

       #include <unistd.h>

       pid_t fork(void);
DESCRIPTION

       fork() creates a new process by duplicating the calling process.  The
       new process is referred to as the child process.  The calling process
       is referred to as the parent process.

       The child process and the parent process run in separate memory
       spaces.  At the time of fork() both memory spaces have the same
       content.  Memory writes, file mappings (mmap(2)), and unmappings
       (munmap(2)) performed by one of the processes do not affect the
       other.

       The child process is an exact duplicate of the parent process except
       for the following points:

       *  The child has its own unique process ID, and this PID does not
          match the ID of any existing process group (setpgid(2)).

       *  The child's parent process ID is the same as the parent's process
          ID.

       *  The child does not inherit its parent's memory locks (mlock(2),
          mlockall(2)).

       *  Process resource utilizations (getrusage(2)) and CPU time counters
          (times(2)) are reset to zero in the child.

       *  The child's set of pending signals is initially empty
          (sigpending(2)).

       *  The child does not inherit semaphore adjustments from its parent
          (semop(2)).

       *  The child does not inherit process-associated record locks from
          its parent (fcntl(2)).  (On the other hand, it does inherit
          fcntl(2) open file description locks and flock(2) locks from its
          parent.)

       *  The child does not inherit timers from its parent (setitimer(2),
          alarm(2), timer_create(2)).

       *  The child does not inherit outstanding asynchronous I/O operations
          from its parent (aio_read(3), aio_write(3)), nor does it inherit
          any asynchronous I/O contexts from its parent (see io_setup(2)).

       The process attributes in the preceding list are all specified in
       POSIX.1.  The parent and child also differ with respect to the
       following Linux-specific process attributes:

       *  The child does not inherit directory change notifications
          (dnotify) from its parent (see the description of F_NOTIFY in
          fcntl(2)).

       *  The prctl(2) PR_SET_PDEATHSIG setting is reset so that the child
          does not receive a signal when its parent terminates.

       *  The default timer slack value is set to the parent's current timer
          slack value.  See the description of PR_SET_TIMERSLACK in
          prctl(2).

       *  Memory mappings that have been marked with the madvise(2)
          MADV_DONTFORK flag are not inherited across a fork().

       *  The termination signal of the child is always SIGCHLD (see
          clone(2)).

       *  The port access permission bits set by ioperm(2) are not inherited
          by the child; the child must turn on any bits that it requires
          using ioperm(2).

       Note the following further points:

       *  The child process is created with a single thread—the one that
          called fork().  The entire virtual address space of the parent is
          replicated in the child, including the states of mutexes,
          condition variables, and other pthreads objects; the use of
          pthread_atfork(3) may be helpful for dealing with problems that
          this can cause.

       *  After a fork(2) in a multithreaded program, the child can safely
          call only async-signal-safe functions (see signal(7)) until such
          time as it calls execve(2).

       *  The child inherits copies of the parent's set of open file
          descriptors.  Each file descriptor in the child refers to the same
          open file description (see open(2)) as the corresponding file
          descriptor in the parent.  This means that the two file
          descriptors share open file status flags, file offset, and signal-
          driven I/O attributes (see the description of F_SETOWN and
          F_SETSIG in fcntl(2)).

       *  The child inherits copies of the parent's set of open message
          queue descriptors (see mq_overview(7)).  Each file descriptor in
          the child refers to the same open message queue description as the
          corresponding file descriptor in the parent.  This means that the
          two file descriptors share the same flags (mq_flags).

       *  The child inherits copies of the parent's set of open directory
          streams (see opendir(3)).  POSIX.1 says that the corresponding
          directory streams in the parent and child may share the directory
          stream positioning; on Linux/glibc they do not.
http://man7.org/linux/man-pages/man2/vfork.2.html
9
SYSTEM CALL:
vfork(2) - Linux manual page
FUNCTIONALITY:

       vfork - create a child process and block parent
SYNOPSIS:

       #include <sys/types.h>
       #include <unistd.h>

       pid_t vfork(void);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       vfork():
           Since glibc 2.12:
               (_XOPEN_SOURCE >= 500) && ! (_POSIX_C_SOURCE >= 200809L)
                   || /* Since glibc 2.19: */ _DEFAULT_SOURCE
                   || /* Glibc versions <= 2.19: */ _BSD_SOURCE
           Before glibc 2.12:
               _BSD_SOURCE || _XOPEN_SOURCE >= 500
DESCRIPTION

   Standard description
       (From POSIX.1) The vfork() function has the same effect as fork(2),
       except that the behavior is undefined if the process created by
       vfork() either modifies any data other than a variable of type pid_t
       used to store the return value from vfork(), or returns from the
       function in which vfork() was called, or calls any other function
       before successfully calling _exit(2) or one of the exec(3) family of
       functions.

   Linux description
       vfork(), just like fork(2), creates a child process of the calling
       process.  For details and return value and errors, see fork(2).

       vfork() is a special case of clone(2).  It is used to create new
       processes without copying the page tables of the parent process.  It
       may be useful in performance-sensitive applications where a child is
       created which then immediately issues an execve(2).

       vfork() differs from fork(2) in that the calling thread is suspended
       until the child terminates (either normally, by calling _exit(2), or
       abnormally, after delivery of a fatal signal), or it makes a call to
       execve(2).  Until that point, the child shares all memory with its
       parent, including the stack.  The child must not return from the
       current function or call exit(3), but may call _exit(2).

       As with fork(2), the child process created by vfork() inherits copies
       of various of the caller's process attributes (e.g., file
       descriptors, signal dispositions, and current working directory); the
       vfork() call differs only in the treatment of the virtual address
       space, as described above.

       Signals sent to the parent arrive after the child releases the
       parent's memory (i.e., after the child terminates or calls
       execve(2)).

   Historic description
       Under Linux, fork(2) is implemented using copy-on-write pages, so the
       only penalty incurred by fork(2) is the time and memory required to
       duplicate the parent's page tables, and to create a unique task
       structure for the child.  However, in the bad old days a fork(2)
       would require making a complete copy of the caller's data space,
       often needlessly, since usually immediately afterward an exec(3) is
       done.  Thus, for greater efficiency, BSD introduced the vfork()
       system call, which did not fully copy the address space of the parent
       process, but borrowed the parent's memory and thread of control until
       a call to execve(2) or an exit occurred.  The parent process was
       suspended while the child was using its resources.  The use of
       vfork() was tricky: for example, not modifying data in the parent
       process depended on knowing which variables were held in a register.
http://man7.org/linux/man-pages/man2/execve.2.html
11
SYSTEM CALL:
execve(2) - Linux manual page
FUNCTIONALITY:

       execve - execute program
SYNOPSIS:

       #include <unistd.h>

       int execve(const char *filename, char *const argv[],
                  char *const envp[]);
DESCRIPTION

       execve() executes the program pointed to by filename.  filename must
       be either a binary executable, or a script starting with a line of
       the form:

           #! interpreter [optional-arg]

       For details of the latter case, see "Interpreter scripts" below.

       argv is an array of argument strings passed to the new program.  By
       convention, the first of these strings should contain the filename
       associated with the file being executed.  envp is an array of
       strings, conventionally of the form key=value, which are passed as
       environment to the new program.  Both argv and envp must be
       terminated by a null pointer.  The argument vector and environment
       can be accessed by the called program's main function, when it is
       defined as:

           int main(int argc, char *argv[], char *envp[])

       execve() does not return on success, and the text, data, bss, and
       stack of the calling process are overwritten by that of the program
       loaded.

       If the current program is being ptraced, a SIGTRAP is sent to it
       after a successful execve().

       If the set-user-ID bit is set on the program file pointed to by
       filename, and the underlying filesystem is not mounted nosuid (the
       MS_NOSUID flag for mount(2)), and the calling process is not being
       ptraced, then the effective user ID of the calling process is changed
       to that of the owner of the program file.  Similarly, when the set-
       group-ID bit of the program file is set the effective group ID of the
       calling process is set to the group of the program file.

       The effective user ID of the process is copied to the saved set-user-
       ID; similarly, the effective group ID is copied to the saved set-
       group-ID.  This copying takes place after any effective ID changes
       that occur because of the set-user-ID and set-group-ID mode bits.

       If the executable is an a.out dynamically linked binary executable
       containing shared-library stubs, the Linux dynamic linker ld.so(8) is
       called at the start of execution to bring needed shared objects into
       memory and link the executable with them.

       If the executable is a dynamically linked ELF executable, the
       interpreter named in the PT_INTERP segment is used to load the needed
       shared objects.  This interpreter is typically /lib/ld-linux.so.2 for
       binaries linked with glibc (see ld-linux.so(8)).

       All process attributes are preserved during an execve(), except the
       following:

       *  The dispositions of any signals that are being caught are reset to
          the default (signal(7)).

       *  Any alternate signal stack is not preserved (sigaltstack(2)).

       *  Memory mappings are not preserved (mmap(2)).

       *  Attached System V shared memory segments are detached (shmat(2)).

       *  POSIX shared memory regions are unmapped (shm_open(3)).

       *  Open POSIX message queue descriptors are closed (mq_overview(7)).

       *  Any open POSIX named semaphores are closed (sem_overview(7)).

       *  POSIX timers are not preserved (timer_create(2)).

       *  Any open directory streams are closed (opendir(3)).

       *  Memory locks are not preserved (mlock(2), mlockall(2)).

       *  Exit handlers are not preserved (atexit(3), on_exit(3)).

       *  The floating-point environment is reset to the default (see
          fenv(3)).

       The process attributes in the preceding list are all specified in
       POSIX.1.  The following Linux-specific process attributes are also
       not preserved during an execve():

       *  The prctl(2) PR_SET_DUMPABLE flag is set, unless a set-user-ID or
          set-group ID program is being executed, in which case it is
          cleared.

       *  The prctl(2) PR_SET_KEEPCAPS flag is cleared.

       *  (Since Linux 2.4.36 / 2.6.23) If a set-user-ID or set-group-ID
          program is being executed, then the parent death signal set by
          prctl(2) PR_SET_PDEATHSIG flag is cleared.

       *  The process name, as set by prctl(2) PR_SET_NAME (and displayed by
          ps -o comm), is reset to the name of the new executable file.

       *  The SECBIT_KEEP_CAPS securebits flag is cleared.  See
          capabilities(7).

       *  The termination signal is reset to SIGCHLD (see clone(2)).

       *  The file descriptor table is unshared, undoing the effect of the
          CLONE_FILES flag of clone(2).

       Note the following further points:

       *  All threads other than the calling thread are destroyed during an
          execve().  Mutexes, condition variables, and other pthreads
          objects are not preserved.

       *  The equivalent of setlocale(LC_ALL, "C") is executed at program
          start-up.

       *  POSIX.1 specifies that the dispositions of any signals that are
          ignored or set to the default are left unchanged.  POSIX.1
          specifies one exception: if SIGCHLD is being ignored, then an
          implementation may leave the disposition unchanged or reset it to
          the default; Linux does the former.

       *  Any outstanding asynchronous I/O operations are canceled
          (aio_read(3), aio_write(3)).

       *  For the handling of capabilities during execve(), see
          capabilities(7).

       *  By default, file descriptors remain open across an execve().  File
          descriptors that are marked close-on-exec are closed; see the
          description of FD_CLOEXEC in fcntl(2).  (If a file descriptor is
          closed, this will cause the release of all record locks obtained
          on the underlying file by this process.  See fcntl(2) for
          details.)  POSIX.1 says that if file descriptors 0, 1, and 2 would
          otherwise be closed after a successful execve(), and the process
          would gain privilege because the set-user_ID or set-group_ID mode
          bit was set on the executed file, then the system may open an
          unspecified file for each of these file descriptors.  As a general
          principle, no portable program, whether privileged or not, can
          assume that these three file descriptors will remain closed across
          an execve().

   Interpreter scripts
       An interpreter script is a text file that has execute permission
       enabled and whose first line is of the form:

           #! interpreter [optional-arg]

       The interpreter must be a valid pathname for an executable file.  If
       the filename argument of execve() specifies an interpreter script,
       then interpreter will be invoked with the following arguments:

           interpreter [optional-arg] filename arg...

       where arg...  is the series of words pointed to by the argv argument
       of execve(), starting at argv[1].

       For portable use, optional-arg should either be absent, or be
       specified as a single word (i.e., it should not contain white space);
       see NOTES below.

       Since Linux 2.6.28, the kernel permits the interpreter of a script to
       itself be a script.  This permission is recursive, up to a limit of
       four recursions, so that the interpreter may be a script which is
       interpreted by a script, and so on.

   Limits on size of arguments and environment
       Most UNIX implementations impose some limit on the total size of the
       command-line argument (argv) and environment (envp) strings that may
       be passed to a new program.  POSIX.1 allows an implementation to
       advertise this limit using the ARG_MAX constant (either defined in
       <limits.h> or available at run time using the call
       sysconf(_SC_ARG_MAX)).

       On Linux prior to kernel 2.6.23, the memory used to store the
       environment and argument strings was limited to 32 pages (defined by
       the kernel constant MAX_ARG_PAGES).  On architectures with a 4-kB
       page size, this yields a maximum size of 128 kB.

       On kernel 2.6.23 and later, most architectures support a size limit
       derived from the soft RLIMIT_STACK resource limit (see getrlimit(2))
       that is in force at the time of the execve() call.  (Architectures
       with no memory management unit are excepted: they maintain the limit
       that was in effect before kernel 2.6.23.)  This change allows
       programs to have a much larger argument and/or environment list.  For
       these architectures, the total size is limited to 1/4 of the allowed
       stack size.  (Imposing the 1/4-limit ensures that the new program
       always has some stack space.)  Since Linux 2.6.25, the kernel places
       a floor of 32 pages on this size limit, so that, even when
       RLIMIT_STACK is set very low, applications are guaranteed to have at
       least as much argument and environment space as was provided by Linux
       2.6.23 and earlier.  (This guarantee was not provided in Linux 2.6.23
       and 2.6.24.)  Additionally, the limit per string is 32 pages (the
       kernel constant MAX_ARG_STRLEN), and the maximum number of strings is
       0x7FFFFFFF.
http://man7.org/linux/man-pages/man2/execveat.2.html
12
SYSTEM CALL:
execveat(2) - Linux manual page
FUNCTIONALITY:

       execveat - execute program relative to a directory file descriptor
SYNOPSIS:

       #include <unistd.h>

       int execveat(int dirfd, const char *pathname,
                    char *const argv[], char *const envp[],
                    int flags);
DESCRIPTION

       The execveat() system call executes the program referred to by the
       combination of dirfd and pathname.  It operates in exactly the same
       way as execve(2), except for the differences described in this manual
       page.

       If the pathname given in pathname is relative, then it is interpreted
       relative to the directory referred to by the file descriptor dirfd
       (rather than relative to the current working directory of the calling
       process, as is done by execve(2) for a relative pathname).

       If pathname is relative and dirfd is the special value AT_FDCWD, then
       pathname is interpreted relative to the current working directory of
       the calling process (like execve(2)).

       If pathname is absolute, then dirfd is ignored.

       If pathname is an empty string and the AT_EMPTY_PATH flag is
       specified, then the file descriptor dirfd specifies the file to be
       executed (i.e., dirfd refers to an executable file, rather than a
       directory).

       The flags argument is a bit mask that can include zero or more of the
       following flags:

       AT_EMPTY_PATH
              If pathname is an empty string, operate on the file referred
              to by dirfd (which may have been obtained using the open(2)
              O_PATH flag).

       AT_SYMLINK_NOFOLLOW
              If the file identified by dirfd and a non-NULL pathname is a
              symbolic link, then the call fails with the error ELOOP.
http://man7.org/linux/man-pages/man2/exit.2.html
9
SYSTEM CALL:
_exit(2) - Linux manual page
FUNCTIONALITY:

       _exit, _Exit - terminate the calling process
SYNOPSIS:

       #include <unistd.h>

       void _exit(int status);

       #include <stdlib.h>

       void _Exit(int status);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       _Exit():
           _ISOC99_SOURCE || _POSIX_C_SOURCE >= 200112L
DESCRIPTION

       The function _exit() terminates the calling process "immediately".
       Any open file descriptors belonging to the process are closed; any
       children of the process are inherited by process 1, init, and the
       process's parent is sent a SIGCHLD signal.

       The value status is returned to the parent process as the process's
       exit status, and can be collected using one of the wait(2) family of
       calls.

       The function _Exit() is equivalent to _exit().
http://man7.org/linux/man-pages/man2/exit_group.2.html
10
SYSTEM CALL:
exit_group(2) - Linux manual page
FUNCTIONALITY:

       exit_group - exit all threads in a process
SYNOPSIS:

       #include <linux/unistd.h>

       void exit_group(int status);
DESCRIPTION

       This system call is equivalent to exit(2) except that it terminates
       not only the calling thread, but all threads in the calling process's
       thread group.
http://man7.org/linux/man-pages/man2/wait4.2.html
10
SYSTEM CALL:
wait4(2) - Linux manual page
FUNCTIONALITY:

       wait3, wait4 - wait for process to change state, BSD style
SYNOPSIS:

       #include <sys/types.h>
       #include <sys/time.h>
       #include <sys/resource.h>
       #include <sys/wait.h>

       pid_t wait3(int *wstatus, int options,
                   struct rusage *rusage);

       pid_t wait4(pid_t pid, int *wstatus, int options,
                   struct rusage *rusage);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       wait3():
           Since glibc 2.19:
               _DEFAULT_SOURCE || _XOPEN_SOURCE >= 500
           Glibc 2.19 and earlier:
               _BSD_SOURCE || _XOPEN_SOURCE >= 500
       wait4():
           Since glibc 2.19:
               _DEFAULT_SOURCE
           Glibc 2.19 and earlier:
               _BSD_SOURCE
DESCRIPTION

       These functions are obsolete; use waitpid(2) or waitid(2) in new
       programs.

       The wait3() and wait4() system calls are similar to waitpid(2), but
       additionally return resource usage information about the child in the
       structure pointed to by rusage.

       Other than the use of the rusage argument, the following wait3()
       call:

           wait3(wstatus, options, rusage);

       is equivalent to:

           waitpid(-1, wstatus, options);

       Similarly, the following wait4() call:

           wait4(pid, wstatus, options, rusage);

       is equivalent to:

           waitpid(pid, wstatus, options);

       In other words, wait3() waits of any child, while wait4() can be used
       to select a specific child, or children, on which to wait.  See
       wait(2) for further details.

       If rusage is not NULL, the struct rusage to which it points will be
       filled with accounting information about the child.  See getrusage(2)
       for details.
http://man7.org/linux/man-pages/man2/waitid.2.html
12
SYSTEM CALL:
wait(2) - Linux manual page
FUNCTIONALITY:

       wait, waitpid, waitid - wait for process to change state
SYNOPSIS:

       #include <sys/types.h>
       #include <sys/wait.h>

       pid_t wait(int *wstatus);

       pid_t waitpid(pid_t pid, int *wstatus, int options);

       int waitid(idtype_t idtype, id_t id, siginfo_t *infop, int options);
                       /* This is the glibc and POSIX interface; see
                          NOTES for information on the raw system call. */

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       waitid():
           _XOPEN_SOURCE
               || /* Since glibc 2.12: */ _POSIX_C_SOURCE >= 200809L
               || /* Glibc versions <= 2.19: */ _BSD_SOURCE
DESCRIPTION

       All of these system calls are used to wait for state changes in a
       child of the calling process, and obtain information about the child
       whose state has changed.  A state change is considered to be: the
       child terminated; the child was stopped by a signal; or the child was
       resumed by a signal.  In the case of a terminated child, performing a
       wait allows the system to release the resources associated with the
       child; if a wait is not performed, then the terminated child remains
       in a "zombie" state (see NOTES below).

       If a child has already changed state, then these calls return
       immediately.  Otherwise, they block until either a child changes
       state or a signal handler interrupts the call (assuming that system
       calls are not automatically restarted using the SA_RESTART flag of
       sigaction(2)).  In the remainder of this page, a child whose state
       has changed and which has not yet been waited upon by one of these
       system calls is termed waitable.

   wait() and waitpid()
       The wait() system call suspends execution of the calling process
       until one of its children terminates.  The call wait(&wstatus) is
       equivalent to:

           waitpid(-1, &wstatus, 0);

       The waitpid() system call suspends execution of the calling process
       until a child specified by pid argument has changed state.  By
       default, waitpid() waits only for terminated children, but this
       behavior is modifiable via the options argument, as described below.

       The value of pid can be:

       < -1   meaning wait for any child process whose process group ID is
              equal to the absolute value of pid.

       -1     meaning wait for any child process.

       0      meaning wait for any child process whose process group ID is
              equal to that of the calling process.

       > 0    meaning wait for the child whose process ID is equal to the
              value of pid.

       The value of options is an OR of zero or more of the following
       constants:

       WNOHANG     return immediately if no child has exited.

       WUNTRACED   also return if a child has stopped (but not traced via
                   ptrace(2)).  Status for traced children which have
                   stopped is provided even if this option is not specified.

       WCONTINUED (since Linux 2.6.10)
                   also return if a stopped child has been resumed by
                   delivery of SIGCONT.

       (For Linux-only options, see below.)

       If wstatus is not NULL, wait() and waitpid() store status information
       in the int to which it points.  This integer can be inspected with
       the following macros (which take the integer itself as an argument,
       not a pointer to it, as is done in wait() and waitpid()!):

       WIFEXITED(wstatus)
              returns true if the child terminated normally, that is, by
              calling exit(3) or _exit(2), or by returning from main().

       WEXITSTATUS(wstatus)
              returns the exit status of the child.  This consists of the
              least significant 8 bits of the wstatus argument that the
              child specified in a call to exit(3) or _exit(2) or as the
              argument for a return statement in main().  This macro should
              be employed only if WIFEXITED returned true.

       WIFSIGNALED(wstatus)
              returns true if the child process was terminated by a signal.

       WTERMSIG(wstatus)
              returns the number of the signal that caused the child process
              to terminate.  This macro should be employed only if
              WIFSIGNALED returned true.

       WCOREDUMP(wstatus)
              returns true if the child produced a core dump.  This macro
              should be employed only if WIFSIGNALED returned true.  This
              macro is not specified in POSIX.1-2001 and is not available on
              some UNIX implementations (e.g., AIX, SunOS).  Only use this
              enclosed in #ifdef WCOREDUMP ... #endif.

       WIFSTOPPED(wstatus)
              returns true if the child process was stopped by delivery of a
              signal; this is possible only if the call was done using
              WUNTRACED or when the child is being traced (see ptrace(2)).

       WSTOPSIG(wstatus)
              returns the number of the signal which caused the child to
              stop.  This macro should be employed only if WIFSTOPPED
              returned true.

       WIFCONTINUED(wstatus)
              (since Linux 2.6.10) returns true if the child process was
              resumed by delivery of SIGCONT.

   waitid()
       The waitid() system call (available since Linux 2.6.9) provides more
       precise control over which child state changes to wait for.

       The idtype and id arguments select the child(ren) to wait for, as
       follows:

       idtype == P_PID
              Wait for the child whose process ID matches id.

       idtype == P_PGID
              Wait for any child whose process group ID matches id.

       idtype == P_ALL
              Wait for any child; id is ignored.

       The child state changes to wait for are specified by ORing one or
       more of the following flags in options:

       WEXITED     Wait for children that have terminated.

       WSTOPPED    Wait for children that have been stopped by delivery of a
                   signal.

       WCONTINUED  Wait for (previously stopped) children that have been
                   resumed by delivery of SIGCONT.

       The following flags may additionally be ORed in options:

       WNOHANG     As for waitpid().

       WNOWAIT     Leave the child in a waitable state; a later wait call
                   can be used to again retrieve the child status
                   information.

       Upon successful return, waitid() fills in the following fields of the
       siginfo_t structure pointed to by infop:

       si_pid      The process ID of the child.

       si_uid      The real user ID of the child.  (This field is not set on
                   most other implementations.)

       si_signo    Always set to SIGCHLD.

       si_status   Either the exit status of the child, as given to _exit(2)
                   (or exit(3)), or the signal that caused the child to
                   terminate, stop, or continue.  The si_code field can be
                   used to determine how to interpret this field.

       si_code     Set to one of: CLD_EXITED (child called _exit(2));
                   CLD_KILLED (child killed by signal); CLD_DUMPED (child
                   killed by signal, and dumped core); CLD_STOPPED (child
                   stopped by signal); CLD_TRAPPED (traced child has
                   trapped); or CLD_CONTINUED (child continued by SIGCONT).

       If WNOHANG was specified in options and there were no children in a
       waitable state, then waitid() returns 0 immediately and the state of
       the siginfo_t structure pointed to by infop is unspecified.  To
       distinguish this case from that where a child was in a waitable
       state, zero out the si_pid field before the call and check for a
       nonzero value in this field after the call returns.
http://man7.org/linux/man-pages/man2/getpid.2.html
9
SYSTEM CALL:
getpid(2) - Linux manual page
FUNCTIONALITY:

       getpid, getppid - get process identification
SYNOPSIS:

       #include <sys/types.h>
       #include <unistd.h>

       pid_t getpid(void);
       pid_t getppid(void);
DESCRIPTION

       getpid() returns the process ID of the calling process.  (This is
       often used by routines that generate unique temporary filenames.)

       getppid() returns the process ID of the parent of the calling
       process.
http://man7.org/linux/man-pages/man2/getppid.2.html
9
SYSTEM CALL:
getpid(2) - Linux manual page
FUNCTIONALITY:

       getpid, getppid - get process identification
SYNOPSIS:

       #include <sys/types.h>
       #include <unistd.h>

       pid_t getpid(void);
       pid_t getppid(void);
DESCRIPTION

       getpid() returns the process ID of the calling process.  (This is
       often used by routines that generate unique temporary filenames.)

       getppid() returns the process ID of the parent of the calling
       process.
http://man7.org/linux/man-pages/man2/gettid.2.html
11
SYSTEM CALL:
gettid(2) - Linux manual page
FUNCTIONALITY:

       gettid - get thread identification
SYNOPSIS:

       #include <sys/types.h>

       pid_t gettid(void);

       Note: There is no glibc wrapper for this system call; see NOTES.
DESCRIPTION

       gettid() returns the caller's thread ID (TID).  In a single-threaded
       process, the thread ID is equal to the process ID (PID, as returned
       by getpid(2)).  In a multithreaded process, all threads have the same
       PID, but each one has a unique TID.  For further details, see the
       discussion of CLONE_THREAD in clone(2).
http://man7.org/linux/man-pages/man2/setsid.2.html
10
SYSTEM CALL:
setsid(2) - Linux manual page
FUNCTIONALITY:

       setsid - creates a session and sets the process group ID
SYNOPSIS:

       #include <unistd.h>

       pid_t setsid(void);
DESCRIPTION

       setsid() creates a new session if the calling process is not a
       process group leader.  The calling process is the leader of the new
       session (i.e., its session ID is made the same as its process ID).
       The calling process also becomes the process group leader of a new
       process group in the session (i.e., its process group ID is made the
       same as its process ID).

       The calling process will be the only process in the new process group
       and in the new session.  The new session has no controlling terminal.
http://man7.org/linux/man-pages/man2/getsid.2.html
11
SYSTEM CALL:
getsid(2) - Linux manual page
FUNCTIONALITY:

       getsid - get session ID
SYNOPSIS:

       #include <unistd.h>

       pid_t getsid(pid_t pid);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       getsid():
           _XOPEN_SOURCE >= 500
               || /* Since glibc 2.12: */ _POSIX_C_SOURCE >= 200809L
DESCRIPTION

       getsid(0) returns the session ID of the calling process.  getsid(p)
       returns the session ID of the process with process ID p.  (The
       session ID of a process is the process group ID of the session
       leader.)
http://man7.org/linux/man-pages/man2/setpgid.2.html
10
SYSTEM CALL:
setpgid(2) - Linux manual page
FUNCTIONALITY:

       setpgid, getpgid, setpgrp, getpgrp - set/get process group
SYNOPSIS:

       #include <unistd.h>

       int setpgid(pid_t pid, pid_t pgid);
       pid_t getpgid(pid_t pid);

       pid_t getpgrp(void);                 /* POSIX.1 version */
       pid_t getpgrp(pid_t pid);            /* BSD version */

       int setpgrp(void);                   /* System V version */
       int setpgrp(pid_t pid, pid_t pgid);  /* BSD version */

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       getpgid():
           _XOPEN_SOURCE >= 500
               || /* Since glibc 2.12: */ _POSIX_C_SOURCE >= 200809L

       setpgrp() (POSIX.1):
           _XOPEN_SOURCE >= 500
               || /* Since glibc 2.19: */ _DEFAULT_SOURCE
               || /* Glibc versions <= 2.19: */ _SVID_SOURCE

       setpgrp() (BSD), getpgrp() (BSD):
           [These are available only before glibc 2.19]
           _BSD_SOURCE &&
               ! (_POSIX_SOURCE || _POSIX_C_SOURCE || _XOPEN_SOURCE ||
                   _GNU_SOURCE || _SVID_SOURCE)
DESCRIPTION

       All of these interfaces are available on Linux, and are used for
       getting and setting the process group ID (PGID) of a process.  The
       preferred, POSIX.1-specified ways of doing this are: getpgrp(void),
       for retrieving the calling process's PGID; and setpgid(), for setting
       a process's PGID.

       setpgid() sets the PGID of the process specified by pid to pgid.  If
       pid is zero, then the process ID of the calling process is used.  If
       pgid is zero, then the PGID of the process specified by pid is made
       the same as its process ID.  If setpgid() is used to move a process
       from one process group to another (as is done by some shells when
       creating pipelines), both process groups must be part of the same
       session (see setsid(2) and credentials(7)).  In this case, the pgid
       specifies an existing process group to be joined and the session ID
       of that group must match the session ID of the joining process.

       The POSIX.1 version of getpgrp(), which takes no arguments, returns
       the PGID of the calling process.

       getpgid() returns the PGID of the process specified by pid.  If pid
       is zero, the process ID of the calling process is used.  (Retrieving
       the PGID of a process other than the caller is rarely necessary, and
       the POSIX.1 getpgrp() is preferred for that task.)

       The System V-style setpgrp(), which takes no arguments, is equivalent
       to setpgid(0, 0).

       The BSD-specific setpgrp() call, which takes arguments pid and pgid,
       is a wrapper function that calls

           setpgid(pid, pgid)

       Since glibc 2.19, the BSD-specific setpgrp() function is no longer
       exposed by <unistd.h>; calls should be replaced with the setpgid()
       call shown above.

       The BSD-specific getpgrp() call, which takes a single pid argument,
       is a wrapper function that calls

           getpgid(pid)

       Since glibc 2.19, the BSD-specific getpgrp() function is no longer
       exposed by <unistd.h>; calls should be replaced with calls to the
       POSIX.1 getpgrp() which takes no arguments (if the intent is to
       obtain the caller's PGID), or with the getpgid() call shown above.
http://man7.org/linux/man-pages/man2/getpgid.2.html
10
SYSTEM CALL:
setpgid(2) - Linux manual page
FUNCTIONALITY:

       setpgid, getpgid, setpgrp, getpgrp - set/get process group
SYNOPSIS:

       #include <unistd.h>

       int setpgid(pid_t pid, pid_t pgid);
       pid_t getpgid(pid_t pid);

       pid_t getpgrp(void);                 /* POSIX.1 version */
       pid_t getpgrp(pid_t pid);            /* BSD version */

       int setpgrp(void);                   /* System V version */
       int setpgrp(pid_t pid, pid_t pgid);  /* BSD version */

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       getpgid():
           _XOPEN_SOURCE >= 500
               || /* Since glibc 2.12: */ _POSIX_C_SOURCE >= 200809L

       setpgrp() (POSIX.1):
           _XOPEN_SOURCE >= 500
               || /* Since glibc 2.19: */ _DEFAULT_SOURCE
               || /* Glibc versions <= 2.19: */ _SVID_SOURCE

       setpgrp() (BSD), getpgrp() (BSD):
           [These are available only before glibc 2.19]
           _BSD_SOURCE &&
               ! (_POSIX_SOURCE || _POSIX_C_SOURCE || _XOPEN_SOURCE ||
                   _GNU_SOURCE || _SVID_SOURCE)
DESCRIPTION

       All of these interfaces are available on Linux, and are used for
       getting and setting the process group ID (PGID) of a process.  The
       preferred, POSIX.1-specified ways of doing this are: getpgrp(void),
       for retrieving the calling process's PGID; and setpgid(), for setting
       a process's PGID.

       setpgid() sets the PGID of the process specified by pid to pgid.  If
       pid is zero, then the process ID of the calling process is used.  If
       pgid is zero, then the PGID of the process specified by pid is made
       the same as its process ID.  If setpgid() is used to move a process
       from one process group to another (as is done by some shells when
       creating pipelines), both process groups must be part of the same
       session (see setsid(2) and credentials(7)).  In this case, the pgid
       specifies an existing process group to be joined and the session ID
       of that group must match the session ID of the joining process.

       The POSIX.1 version of getpgrp(), which takes no arguments, returns
       the PGID of the calling process.

       getpgid() returns the PGID of the process specified by pid.  If pid
       is zero, the process ID of the calling process is used.  (Retrieving
       the PGID of a process other than the caller is rarely necessary, and
       the POSIX.1 getpgrp() is preferred for that task.)

       The System V-style setpgrp(), which takes no arguments, is equivalent
       to setpgid(0, 0).

       The BSD-specific setpgrp() call, which takes arguments pid and pgid,
       is a wrapper function that calls

           setpgid(pid, pgid)

       Since glibc 2.19, the BSD-specific setpgrp() function is no longer
       exposed by <unistd.h>; calls should be replaced with the setpgid()
       call shown above.

       The BSD-specific getpgrp() call, which takes a single pid argument,
       is a wrapper function that calls

           getpgid(pid)

       Since glibc 2.19, the BSD-specific getpgrp() function is no longer
       exposed by <unistd.h>; calls should be replaced with calls to the
       POSIX.1 getpgrp() which takes no arguments (if the intent is to
       obtain the caller's PGID), or with the getpgid() call shown above.
http://man7.org/linux/man-pages/man2/getpgrp.2.html
10
SYSTEM CALL:
setpgid(2) - Linux manual page
FUNCTIONALITY:

       setpgid, getpgid, setpgrp, getpgrp - set/get process group
SYNOPSIS:

       #include <unistd.h>

       int setpgid(pid_t pid, pid_t pgid);
       pid_t getpgid(pid_t pid);

       pid_t getpgrp(void);                 /* POSIX.1 version */
       pid_t getpgrp(pid_t pid);            /* BSD version */

       int setpgrp(void);                   /* System V version */
       int setpgrp(pid_t pid, pid_t pgid);  /* BSD version */

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       getpgid():
           _XOPEN_SOURCE >= 500
               || /* Since glibc 2.12: */ _POSIX_C_SOURCE >= 200809L

       setpgrp() (POSIX.1):
           _XOPEN_SOURCE >= 500
               || /* Since glibc 2.19: */ _DEFAULT_SOURCE
               || /* Glibc versions <= 2.19: */ _SVID_SOURCE

       setpgrp() (BSD), getpgrp() (BSD):
           [These are available only before glibc 2.19]
           _BSD_SOURCE &&
               ! (_POSIX_SOURCE || _POSIX_C_SOURCE || _XOPEN_SOURCE ||
                   _GNU_SOURCE || _SVID_SOURCE)
DESCRIPTION

       All of these interfaces are available on Linux, and are used for
       getting and setting the process group ID (PGID) of a process.  The
       preferred, POSIX.1-specified ways of doing this are: getpgrp(void),
       for retrieving the calling process's PGID; and setpgid(), for setting
       a process's PGID.

       setpgid() sets the PGID of the process specified by pid to pgid.  If
       pid is zero, then the process ID of the calling process is used.  If
       pgid is zero, then the PGID of the process specified by pid is made
       the same as its process ID.  If setpgid() is used to move a process
       from one process group to another (as is done by some shells when
       creating pipelines), both process groups must be part of the same
       session (see setsid(2) and credentials(7)).  In this case, the pgid
       specifies an existing process group to be joined and the session ID
       of that group must match the session ID of the joining process.

       The POSIX.1 version of getpgrp(), which takes no arguments, returns
       the PGID of the calling process.

       getpgid() returns the PGID of the process specified by pid.  If pid
       is zero, the process ID of the calling process is used.  (Retrieving
       the PGID of a process other than the caller is rarely necessary, and
       the POSIX.1 getpgrp() is preferred for that task.)

       The System V-style setpgrp(), which takes no arguments, is equivalent
       to setpgid(0, 0).

       The BSD-specific setpgrp() call, which takes arguments pid and pgid,
       is a wrapper function that calls

           setpgid(pid, pgid)

       Since glibc 2.19, the BSD-specific setpgrp() function is no longer
       exposed by <unistd.h>; calls should be replaced with the setpgid()
       call shown above.

       The BSD-specific getpgrp() call, which takes a single pid argument,
       is a wrapper function that calls

           getpgid(pid)

       Since glibc 2.19, the BSD-specific getpgrp() function is no longer
       exposed by <unistd.h>; calls should be replaced with calls to the
       POSIX.1 getpgrp() which takes no arguments (if the intent is to
       obtain the caller's PGID), or with the getpgid() call shown above.
http://man7.org/linux/man-pages/man2/setuid.2.html
10
SYSTEM CALL:
setuid(2) - Linux manual page
FUNCTIONALITY:

       setuid - set user identity
SYNOPSIS:

       #include <sys/types.h>
       #include <unistd.h>

       int setuid(uid_t uid);
DESCRIPTION

       setuid() sets the effective user ID of the calling process.  If the
       effective UID of the caller is root (more precisely: if the caller
       has the CAP_SETUID capability), the real UID and saved set-user-ID
       are also set.

       Under Linux, setuid() is implemented like the POSIX version with the
       _POSIX_SAVED_IDS feature.  This allows a set-user-ID (other than
       root) program to drop all of its user privileges, do some un-
       privileged work, and then reengage the original effective user ID in
       a secure manner.

       If the user is root or the program is set-user-ID-root, special care
       must be taken.  The setuid() function checks the effective user ID of
       the caller and if it is the superuser, all process-related user ID's
       are set to uid.  After this has occurred, it is impossible for the
       program to regain root privileges.

       Thus, a set-user-ID-root program wishing to temporarily drop root
       privileges, assume the identity of an unprivileged user, and then
       regain root privileges afterward cannot use setuid().  You can
       accomplish this with seteuid(2).
http://man7.org/linux/man-pages/man2/getuid.2.html
9
SYSTEM CALL:
getuid(2) - Linux manual page
FUNCTIONALITY:

       getuid, geteuid - get user identity
SYNOPSIS:

       #include <unistd.h>
       #include <sys/types.h>

       uid_t getuid(void);
       uid_t geteuid(void);
DESCRIPTION

       getuid() returns the real user ID of the calling process.

       geteuid() returns the effective user ID of the calling process.
http://man7.org/linux/man-pages/man2/setgid.2.html
10
SYSTEM CALL:
setgid(2) - Linux manual page
FUNCTIONALITY:

       setgid - set group identity
SYNOPSIS:

       #include <sys/types.h>
       #include <unistd.h>

       int setgid(gid_t gid);
DESCRIPTION

       setgid() sets the effective group ID of the calling process.  If the
       caller is privileged (has the CAP_SETGID capability), the real GID
       and saved set-group-ID are also set.

       Under Linux, setgid() is implemented like the POSIX version with the
       _POSIX_SAVED_IDS feature.  This allows a set-group-ID program that is
       not set-user-ID-root to drop all of its group privileges, do some un-
       privileged work, and then reengage the original effective group ID in
       a secure manner.
http://man7.org/linux/man-pages/man2/getgid.2.html
9
SYSTEM CALL:
getgid(2) - Linux manual page
FUNCTIONALITY:

       getgid, getegid - get group identity
SYNOPSIS:

       #include <unistd.h>
       #include <sys/types.h>

       gid_t getgid(void);
       gid_t getegid(void);
DESCRIPTION

       getgid() returns the real group ID of the calling process.

       getegid() returns the effective group ID of the calling process.
http://man7.org/linux/man-pages/man2/setresuid.2.html
11
SYSTEM CALL:
setresuid(2) - Linux manual page
FUNCTIONALITY:

       setresuid, setresgid - set real, effective and saved user or group ID
SYNOPSIS:

       #define _GNU_SOURCE         /* See feature_test_macros(7) */
       #include <unistd.h>

       int setresuid(uid_t ruid, uid_t euid, uid_t suid);
       int setresgid(gid_t rgid, gid_t egid, gid_t sgid);
DESCRIPTION

       setresuid() sets the real user ID, the effective user ID, and the
       saved set-user-ID of the calling process.

       Unprivileged user processes may change the real UID, effective UID,
       and saved set-user-ID, each to one of: the current real UID, the
       current effective UID or the current saved set-user-ID.

       Privileged processes (on Linux, those having the CAP_SETUID
       capability) may set the real UID, effective UID, and saved set-user-
       ID to arbitrary values.

       If one of the arguments equals -1, the corresponding value is not
       changed.

       Regardless of what changes are made to the real UID, effective UID,
       and saved set-user-ID, the filesystem UID is always set to the same
       value as the (possibly new) effective UID.

       Completely analogously, setresgid() sets the real GID, effective GID,
       and saved set-group-ID of the calling process (and always modifies
       the filesystem GID to be the same as the effective GID), with the
       same restrictions for unprivileged processes.
http://man7.org/linux/man-pages/man2/getresuid.2.html
11
SYSTEM CALL:
getresuid(2) - Linux manual page
FUNCTIONALITY:

       getresuid, getresgid - get real, effective and saved user/group IDs
SYNOPSIS:

       #define _GNU_SOURCE         /* See feature_test_macros(7) */
       #include <unistd.h>

       int getresuid(uid_t *ruid, uid_t *euid, uid_t *suid);
       int getresgid(gid_t *rgid, gid_t *egid, gid_t *sgid);
DESCRIPTION

       getresuid() returns the real UID, the effective UID, and the saved
       set-user-ID of the calling process, in the arguments ruid, euid, and
       suid, respectively.  getresgid() performs the analogous task for the
       process's group IDs.
http://man7.org/linux/man-pages/man2/setresgid.2.html
11
SYSTEM CALL:
setresuid(2) - Linux manual page
FUNCTIONALITY:

       setresuid, setresgid - set real, effective and saved user or group ID
SYNOPSIS:

       #define _GNU_SOURCE         /* See feature_test_macros(7) */
       #include <unistd.h>

       int setresuid(uid_t ruid, uid_t euid, uid_t suid);
       int setresgid(gid_t rgid, gid_t egid, gid_t sgid);
DESCRIPTION

       setresuid() sets the real user ID, the effective user ID, and the
       saved set-user-ID of the calling process.

       Unprivileged user processes may change the real UID, effective UID,
       and saved set-user-ID, each to one of: the current real UID, the
       current effective UID or the current saved set-user-ID.

       Privileged processes (on Linux, those having the CAP_SETUID
       capability) may set the real UID, effective UID, and saved set-user-
       ID to arbitrary values.

       If one of the arguments equals -1, the corresponding value is not
       changed.

       Regardless of what changes are made to the real UID, effective UID,
       and saved set-user-ID, the filesystem UID is always set to the same
       value as the (possibly new) effective UID.

       Completely analogously, setresgid() sets the real GID, effective GID,
       and saved set-group-ID of the calling process (and always modifies
       the filesystem GID to be the same as the effective GID), with the
       same restrictions for unprivileged processes.
http://man7.org/linux/man-pages/man2/getresgid.2.html
11
SYSTEM CALL:
getresuid(2) - Linux manual page
FUNCTIONALITY:

       getresuid, getresgid - get real, effective and saved user/group IDs
SYNOPSIS:

       #define _GNU_SOURCE         /* See feature_test_macros(7) */
       #include <unistd.h>

       int getresuid(uid_t *ruid, uid_t *euid, uid_t *suid);
       int getresgid(gid_t *rgid, gid_t *egid, gid_t *sgid);
DESCRIPTION

       getresuid() returns the real UID, the effective UID, and the saved
       set-user-ID of the calling process, in the arguments ruid, euid, and
       suid, respectively.  getresgid() performs the analogous task for the
       process's group IDs.
http://man7.org/linux/man-pages/man2/setreuid.2.html
10
SYSTEM CALL:
setreuid(2) - Linux manual page
FUNCTIONALITY:

       setreuid, setregid - set real and/or effective user or group ID
SYNOPSIS:

       #include <sys/types.h>
       #include <unistd.h>

       int setreuid(uid_t ruid, uid_t euid);
       int setregid(gid_t rgid, gid_t egid);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       setreuid(), setregid():
           _XOPEN_SOURCE >= 500
               || /* Since glibc 2.19: */ _DEFAULT_SOURCE
               || /* Glibc versions <= 2.19: */ _BSD_SOURCE
DESCRIPTION

       setreuid() sets real and effective user IDs of the calling process.

       Supplying a value of -1 for either the real or effective user ID
       forces the system to leave that ID unchanged.

       Unprivileged processes may only set the effective user ID to the real
       user ID, the effective user ID, or the saved set-user-ID.

       Unprivileged users may only set the real user ID to the real user ID
       or the effective user ID.

       If the real user ID is set (i.e., ruid is not -1) or the effective
       user ID is set to a value not equal to the previous real user ID, the
       saved set-user-ID will be set to the new effective user ID.

       Completely analogously, setregid() sets real and effective group ID's
       of the calling process, and all of the above holds with "group"
       instead of "user".
http://man7.org/linux/man-pages/man2/setregid.2.html
10
SYSTEM CALL:
setreuid(2) - Linux manual page
FUNCTIONALITY:

       setreuid, setregid - set real and/or effective user or group ID
SYNOPSIS:

       #include <sys/types.h>
       #include <unistd.h>

       int setreuid(uid_t ruid, uid_t euid);
       int setregid(gid_t rgid, gid_t egid);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       setreuid(), setregid():
           _XOPEN_SOURCE >= 500
               || /* Since glibc 2.19: */ _DEFAULT_SOURCE
               || /* Glibc versions <= 2.19: */ _BSD_SOURCE
DESCRIPTION

       setreuid() sets real and effective user IDs of the calling process.

       Supplying a value of -1 for either the real or effective user ID
       forces the system to leave that ID unchanged.

       Unprivileged processes may only set the effective user ID to the real
       user ID, the effective user ID, or the saved set-user-ID.

       Unprivileged users may only set the real user ID to the real user ID
       or the effective user ID.

       If the real user ID is set (i.e., ruid is not -1) or the effective
       user ID is set to a value not equal to the previous real user ID, the
       saved set-user-ID will be set to the new effective user ID.

       Completely analogously, setregid() sets real and effective group ID's
       of the calling process, and all of the above holds with "group"
       instead of "user".
http://man7.org/linux/man-pages/man2/setfsuid.2.html
11
SYSTEM CALL:
setfsuid(2) - Linux manual page
FUNCTIONALITY:

       setfsuid - set user identity used for filesystem checks
SYNOPSIS:

       #include <sys/fsuid.h>

       int setfsuid(uid_t fsuid);
DESCRIPTION

       The system call setfsuid() changes the value of the caller's
       filesystem user ID—the user ID that the Linux kernel uses to check
       for all accesses to the filesystem.  Normally, the value of the
       filesystem user ID will shadow the value of the effective user ID.
       In fact, whenever the effective user ID is changed, the filesystem
       user ID will also be changed to the new value of the effective user
       ID.

       Explicit calls to setfsuid() and setfsgid(2) are usually used only by
       programs such as the Linux NFS server that need to change what user
       and group ID is used for file access without a corresponding change
       in the real and effective user and group IDs.  A change in the normal
       user IDs for a program such as the NFS server is a security hole that
       can expose it to unwanted signals.  (But see below.)

       setfsuid() will succeed only if the caller is the superuser or if
       fsuid matches either the caller's real user ID, effective user ID,
       saved set-user-ID, or current filesystem user ID.
http://man7.org/linux/man-pages/man2/setfsgid.2.html
11
SYSTEM CALL:
setfsgid(2) - Linux manual page
FUNCTIONALITY:

       setfsgid - set group identity used for filesystem checks
SYNOPSIS:

       #include <sys/fsuid.h>

       int setfsgid(uid_t fsgid);
DESCRIPTION

       The system call setfsgid() changes the value of the caller's
       filesystem group ID—the group ID that the Linux kernel uses to check
       for all accesses to the filesystem.  Normally, the value of the
       filesystem group ID will shadow the value of the effective group ID.
       In fact, whenever the effective group ID is changed, the filesystem
       group ID will also be changed to the new value of the effective group
       ID.

       Explicit calls to setfsuid(2) and setfsgid() are usually used only by
       programs such as the Linux NFS server that need to change what user
       and group ID is used for file access without a corresponding change
       in the real and effective user and group IDs.  A change in the normal
       user IDs for a program such as the NFS server is a security hole that
       can expose it to unwanted signals.  (But see below.)

       setfsgid() will succeed only if the caller is the superuser or if
       fsgid matches either the caller's real group ID, effective group ID,
       saved set-group-ID, or current the filesystem user ID.
http://man7.org/linux/man-pages/man2/geteuid.2.html
9
SYSTEM CALL:
getuid(2) - Linux manual page
FUNCTIONALITY:

       getuid, geteuid - get user identity
SYNOPSIS:

       #include <unistd.h>
       #include <sys/types.h>

       uid_t getuid(void);
       uid_t geteuid(void);
DESCRIPTION

       getuid() returns the real user ID of the calling process.

       geteuid() returns the effective user ID of the calling process.
http://man7.org/linux/man-pages/man2/getegid.2.html
9
SYSTEM CALL:
getgid(2) - Linux manual page
FUNCTIONALITY:

       getgid, getegid - get group identity
SYNOPSIS:

       #include <unistd.h>
       #include <sys/types.h>

       gid_t getgid(void);
       gid_t getegid(void);
DESCRIPTION

       getgid() returns the real group ID of the calling process.

       getegid() returns the effective group ID of the calling process.
http://man7.org/linux/man-pages/man2/setgroups.2.html
10
SYSTEM CALL:
getgroups(2) - Linux manual page
FUNCTIONALITY:

       getgroups, setgroups - get/set list of supplementary group IDs
SYNOPSIS:

       #include <sys/types.h>
       #include <unistd.h>

       int getgroups(int size, gid_t list[]);

       #include <grp.h>

       int setgroups(size_t size, const gid_t *list);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       setgroups():
           Since glibc 2.19:
               _DEFAULT_SOURCE
           Glibc 2.19 and earlier:
               _SVID_SOURCE
DESCRIPTION

       getgroups() returns the supplementary group IDs of the calling
       process in list.  The argument size should be set to the maximum
       number of items that can be stored in the buffer pointed to by list.
       If the calling process is a member of more than size supplementary
       groups, then an error results.  It is unspecified whether the
       effective group ID of the calling process is included in the returned
       list.  (Thus, an application should also call getegid(2) and add or
       remove the resulting value.)

       If size is zero, list is not modified, but the total number of
       supplementary group IDs for the process is returned.  This allows the
       caller to determine the size of a dynamically allocated list to be
       used in a further call to getgroups().

       setgroups() sets the supplementary group IDs for the calling process.
       Appropriate privileges (Linux: the CAP_SETGID capability) are
       required.  The size argument specifies the number of supplementary
       group IDs in the buffer pointed to by list.
http://man7.org/linux/man-pages/man2/getgroups.2.html
10
SYSTEM CALL:
getgroups(2) - Linux manual page
FUNCTIONALITY:

       getgroups, setgroups - get/set list of supplementary group IDs
SYNOPSIS:

       #include <sys/types.h>
       #include <unistd.h>

       int getgroups(int size, gid_t list[]);

       #include <grp.h>

       int setgroups(size_t size, const gid_t *list);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       setgroups():
           Since glibc 2.19:
               _DEFAULT_SOURCE
           Glibc 2.19 and earlier:
               _SVID_SOURCE
DESCRIPTION

       getgroups() returns the supplementary group IDs of the calling
       process in list.  The argument size should be set to the maximum
       number of items that can be stored in the buffer pointed to by list.
       If the calling process is a member of more than size supplementary
       groups, then an error results.  It is unspecified whether the
       effective group ID of the calling process is included in the returned
       list.  (Thus, an application should also call getegid(2) and add or
       remove the resulting value.)

       If size is zero, list is not modified, but the total number of
       supplementary group IDs for the process is returned.  This allows the
       caller to determine the size of a dynamically allocated list to be
       used in a further call to getgroups().

       setgroups() sets the supplementary group IDs for the calling process.
       Appropriate privileges (Linux: the CAP_SETGID capability) are
       required.  The size argument specifies the number of supplementary
       group IDs in the buffer pointed to by list.
http://man7.org/linux/man-pages/man2/setns.2.html
12
SYSTEM CALL:
setns(2) - Linux manual page
FUNCTIONALITY:

       setns - reassociate thread with a namespace
SYNOPSIS:

       #define _GNU_SOURCE             /* See feature_test_macros(7) */
       #include <sched.h>

       int setns(int fd, int nstype);
DESCRIPTION

       Given a file descriptor referring to a namespace, reassociate the
       calling thread with that namespace.

       The fd argument is a file descriptor referring to one of the
       namespace entries in a /proc/[pid]/ns/ directory; see namespaces(7)
       for further information on /proc/[pid]/ns/.  The calling thread will
       be reassociated with the corresponding namespace, subject to any
       constraints imposed by the nstype argument.

       The nstype argument specifies which type of namespace the calling
       thread may be reassociated with.  This argument can have one of the
       following values:

       0      Allow any type of namespace to be joined.

       CLONE_NEWCGROUP (since Linux 4.6)
              fd must refer to a cgroup namespace.

       CLONE_NEWIPC (since Linux 3.0)
              fd must refer to an IPC namespace.

       CLONE_NEWNET (since Linux 3.0)
              fd must refer to a network namespace.

       CLONE_NEWNS (since Linux 3.8)
              fd must refer to a mount namespace.

       CLONE_NEWPID (since Linux 3.8)
              fd must refer to a descendant PID namespace.

       CLONE_NEWUSER (since Linux 3.8)
              fd must refer to a user namespace.

       CLONE_NEWUTS (since Linux 3.0)
              fd must refer to a UTS namespace.

       Specifying nstype as 0 suffices if the caller knows (or does not
       care) what type of namespace is referred to by fd.  Specifying a
       nonzero value for nstype is useful if the caller does not know what
       type of namespace is referred to by fd and wants to ensure that the
       namespace is of a particular type.  (The caller might not know the
       type of the namespace referred to by fd if the file descriptor was
       opened by another process and, for example, passed to the caller via
       a UNIX domain socket.)

       CLONE_NEWPID behaves somewhat differently from the other nstype
       values: reassociating the calling thread with a PID namespace changes
       only the PID namespace that child processes of the caller will be
       created in; it does not change the PID namespace of the caller
       itself.  Reassociating with a PID namespace is allowed only if the
       PID namespace specified by fd is a descendant (child, grandchild,
       etc.)  of the PID namespace of the caller.  For further details on
       PID namespaces, see pid_namespaces(7).

       A process reassociating itself with a user namespace must have the
       CAP_SYS_ADMIN capability in the target user namespace.  Upon
       successfully joining a user namespace, a process is granted all
       capabilities in that namespace, regardless of its user and group IDs.
       A multithreaded process may not change user namespace with setns().
       It is not permitted to use setns() to reenter the caller's current
       user namespace.  This prevents a caller that has dropped capabilities
       from regaining those capabilities via a call to setns().  For
       security reasons, a process can't join a new user namespace if it is
       sharing filesystem-related attributes (the attributes whose sharing
       is controlled by the clone(2) CLONE_FS flag) with another process.
       For further details on user namespaces, see user_namespaces(7).

       A process may not be reassociated with a new mount namespace if it is
       multithreaded.  Changing the mount namespace requires that the caller
       possess both CAP_SYS_CHROOT and CAP_SYS_ADMIN capabilities in its own
       user namespace and CAP_SYS_ADMIN in the target mount namespace.  See
       user_namespaces(7) for details on the interaction of user namespaces
       and mount namespaces.

       Using setns() to change the caller's cgroup namespace does not change
       the caller's cgroup memberships.
http://man7.org/linux/man-pages/man2/setrlimit.2.html
14
SYSTEM CALL:
getrlimit(2) - Linux manual page
FUNCTIONALITY:

       getrlimit, setrlimit, prlimit - get/set resource limits
SYNOPSIS:

       #include <sys/time.h>
       #include <sys/resource.h>

       int getrlimit(int resource, struct rlimit *rlim);
       int setrlimit(int resource, const struct rlimit *rlim);

       int prlimit(pid_t pid, int resource, const struct rlimit *new_limit,
                   struct rlimit *old_limit);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       prlimit(): _GNU_SOURCE
DESCRIPTION

       The getrlimit() and setrlimit() system calls get and set resource
       limits respectively.  Each resource has an associated soft and hard
       limit, as defined by the rlimit structure:

           struct rlimit {
               rlim_t rlim_cur;  /* Soft limit */
               rlim_t rlim_max;  /* Hard limit (ceiling for rlim_cur) */
           };

       The soft limit is the value that the kernel enforces for the
       corresponding resource.  The hard limit acts as a ceiling for the
       soft limit: an unprivileged process may set only its soft limit to a
       value in the range from 0 up to the hard limit, and (irreversibly)
       lower its hard limit.  A privileged process (under Linux: one with
       the CAP_SYS_RESOURCE capability) may make arbitrary changes to either
       limit value.

       The value RLIM_INFINITY denotes no limit on a resource (both in the
       structure returned by getrlimit() and in the structure passed to
       setrlimit()).

       The resource argument must be one of:

       RLIMIT_AS
              The maximum size of the process's virtual memory (address
              space) in bytes.  This limit affects calls to brk(2), mmap(2),
              and mremap(2), which fail with the error ENOMEM upon exceeding
              this limit.  Also automatic stack expansion will fail (and
              generate a SIGSEGV that kills the process if no alternate
              stack has been made available via sigaltstack(2)).  Since the
              value is a long, on machines with a 32-bit long either this
              limit is at most 2 GiB, or this resource is unlimited.

       RLIMIT_CORE
              Maximum size of a core file (see core(5)).  When 0 no core
              dump files are created.  When nonzero, larger dumps are
              truncated to this size.

       RLIMIT_CPU
              CPU time limit in seconds.  When the process reaches the soft
              limit, it is sent a SIGXCPU signal.  The default action for
              this signal is to terminate the process.  However, the signal
              can be caught, and the handler can return control to the main
              program.  If the process continues to consume CPU time, it
              will be sent SIGXCPU once per second until the hard limit is
              reached, at which time it is sent SIGKILL.  (This latter point
              describes Linux behavior.  Implementations vary in how they
              treat processes which continue to consume CPU time after
              reaching the soft limit.  Portable applications that need to
              catch this signal should perform an orderly termination upon
              first receipt of SIGXCPU.)

       RLIMIT_DATA
              The maximum size of the process's data segment (initialized
              data, uninitialized data, and heap).  This limit affects calls
              to brk(2) and sbrk(2), which fail with the error ENOMEM upon
              encountering the soft limit of this resource.

       RLIMIT_FSIZE
              The maximum size of files that the process may create.
              Attempts to extend a file beyond this limit result in delivery
              of a SIGXFSZ signal.  By default, this signal terminates a
              process, but a process can catch this signal instead, in which
              case the relevant system call (e.g., write(2), truncate(2))
              fails with the error EFBIG.

       RLIMIT_LOCKS (Early Linux 2.4 only)
              A limit on the combined number of flock(2) locks and fcntl(2)
              leases that this process may establish.

       RLIMIT_MEMLOCK
              The maximum number of bytes of memory that may be locked into
              RAM.  In effect this limit is rounded down to the nearest
              multiple of the system page size.  This limit affects mlock(2)
              and mlockall(2) and the mmap(2) MAP_LOCKED operation.  Since
              Linux 2.6.9 it also affects the shmctl(2) SHM_LOCK operation,
              where it sets a maximum on the total bytes in shared memory
              segments (see shmget(2)) that may be locked by the real user
              ID of the calling process.  The shmctl(2) SHM_LOCK locks are
              accounted for separately from the per-process memory locks
              established by mlock(2), mlockall(2), and mmap(2) MAP_LOCKED;
              a process can lock bytes up to this limit in each of these two
              categories.  In Linux kernels before 2.6.9, this limit
              controlled the amount of memory that could be locked by a
              privileged process.  Since Linux 2.6.9, no limits are placed
              on the amount of memory that a privileged process may lock,
              and this limit instead governs the amount of memory that an
              unprivileged process may lock.

       RLIMIT_MSGQUEUE (since Linux 2.6.8)
              Specifies the limit on the number of bytes that can be
              allocated for POSIX message queues for the real user ID of the
              calling process.  This limit is enforced for mq_open(3).  Each
              message queue that the user creates counts (until it is
              removed) against this limit according to the formula:

                  Since Linux 3.5:
                      bytes = attr.mq_maxmsg * sizeof(struct msg_msg) +
                              min(attr.mq_maxmsg, MQ_PRIO_MAX) *
                                    sizeof(struct posix_msg_tree_node)+
                                              /* For overhead */
                              attr.mq_maxmsg * attr.mq_msgsize;
                                              /* For message data */

                  Linux 3.4 and earlier:
                      bytes = attr.mq_maxmsg * sizeof(struct msg_msg *) +
                                              /* For overhead */
                              attr.mq_maxmsg * attr.mq_msgsize;
                                              /* For message data */

              where attr is the mq_attr structure specified as the fourth
              argument to mq_open(3), and the msg_msg and
              posix_msg_tree_node structures are kernel-internal structures.

              The "overhead" addend in the formula accounts for overhead
              bytes required by the implementation and ensures that the user
              cannot create an unlimited number of zero-length messages
              (such messages nevertheless each consume some system memory
              for bookkeeping overhead).

       RLIMIT_NICE (since Linux 2.6.12, but see BUGS below)
              Specifies a ceiling to which the process's nice value can be
              raised using setpriority(2) or nice(2).  The actual ceiling
              for the nice value is calculated as 20 - rlim_cur.  (This
              strangeness occurs because negative numbers cannot be
              specified as resource limit values, since they typically have
              special meanings.  For example, RLIM_INFINITY typically is the
              same as -1.)

       RLIMIT_NOFILE
              Specifies a value one greater than the maximum file descriptor
              number that can be opened by this process.  Attempts (open(2),
              pipe(2), dup(2), etc.)  to exceed this limit yield the error
              EMFILE.  (Historically, this limit was named RLIMIT_OFILE on
              BSD.)

       RLIMIT_NPROC
              The maximum number of processes (or, more precisely on Linux,
              threads) that can be created for the real user ID of the
              calling process.  Upon encountering this limit, fork(2) fails
              with the error EAGAIN.  This limit is not enforced for
              processes that have either the CAP_SYS_ADMIN or the
              CAP_SYS_RESOURCE capability.

       RLIMIT_RSS
              Specifies the limit (in bytes) of the process's resident set
              (the number of virtual pages resident in RAM).  This limit has
              effect only in Linux 2.4.x, x < 30, and there affects only
              calls to madvise(2) specifying MADV_WILLNEED.

       RLIMIT_RTPRIO (since Linux 2.6.12, but see BUGS)
              Specifies a ceiling on the real-time priority that may be set
              for this process using sched_setscheduler(2) and
              sched_setparam(2).

       RLIMIT_RTTIME (since Linux 2.6.25)
              Specifies a limit (in microseconds) on the amount of CPU time
              that a process scheduled under a real-time scheduling policy
              may consume without making a blocking system call.  For the
              purpose of this limit, each time a process makes a blocking
              system call, the count of its consumed CPU time is reset to
              zero.  The CPU time count is not reset if the process
              continues trying to use the CPU but is preempted, its time
              slice expires, or it calls sched_yield(2).

              Upon reaching the soft limit, the process is sent a SIGXCPU
              signal.  If the process catches or ignores this signal and
              continues consuming CPU time, then SIGXCPU will be generated
              once each second until the hard limit is reached, at which
              point the process is sent a SIGKILL signal.

              The intended use of this limit is to stop a runaway real-time
              process from locking up the system.

       RLIMIT_SIGPENDING (since Linux 2.6.8)
              Specifies the limit on the number of signals that may be
              queued for the real user ID of the calling process.  Both
              standard and real-time signals are counted for the purpose of
              checking this limit.  However, the limit is enforced only for
              sigqueue(3); it is always possible to use kill(2) to queue one
              instance of any of the signals that are not already queued to
              the process.

       RLIMIT_STACK
              The maximum size of the process stack, in bytes.  Upon
              reaching this limit, a SIGSEGV signal is generated.  To handle
              this signal, a process must employ an alternate signal stack
              (sigaltstack(2)).

              Since Linux 2.6.23, this limit also determines the amount of
              space used for the process's command-line arguments and
              environment variables; for details, see execve(2).

   prlimit()
       The Linux-specific prlimit() system call combines and extends the
       functionality of setrlimit() and getrlimit().  It can be used to both
       set and get the resource limits of an arbitrary process.

       The resource argument has the same meaning as for setrlimit() and
       getrlimit().

       If the new_limit argument is a not NULL, then the rlimit structure to
       which it points is used to set new values for the soft and hard
       limits for resource.  If the old_limit argument is a not NULL, then a
       successful call to prlimit() places the previous soft and hard limits
       for resource in the rlimit structure pointed to by old_limit.

       The pid argument specifies the ID of the process on which the call is
       to operate.  If pid is 0, then the call applies to the calling
       process.  To set or get the resources of a process other than itself,
       the caller must have the CAP_SYS_RESOURCE capability, or the real,
       effective, and saved set user IDs of the target process must match
       the real user ID of the caller and the real, effective, and saved set
       group IDs of the target process must match the real group ID of the
       caller.
http://man7.org/linux/man-pages/man2/getrlimit.2.html
14
SYSTEM CALL:
getrlimit(2) - Linux manual page
FUNCTIONALITY:

       getrlimit, setrlimit, prlimit - get/set resource limits
SYNOPSIS:

       #include <sys/time.h>
       #include <sys/resource.h>

       int getrlimit(int resource, struct rlimit *rlim);
       int setrlimit(int resource, const struct rlimit *rlim);

       int prlimit(pid_t pid, int resource, const struct rlimit *new_limit,
                   struct rlimit *old_limit);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       prlimit(): _GNU_SOURCE
DESCRIPTION

       The getrlimit() and setrlimit() system calls get and set resource
       limits respectively.  Each resource has an associated soft and hard
       limit, as defined by the rlimit structure:

           struct rlimit {
               rlim_t rlim_cur;  /* Soft limit */
               rlim_t rlim_max;  /* Hard limit (ceiling for rlim_cur) */
           };

       The soft limit is the value that the kernel enforces for the
       corresponding resource.  The hard limit acts as a ceiling for the
       soft limit: an unprivileged process may set only its soft limit to a
       value in the range from 0 up to the hard limit, and (irreversibly)
       lower its hard limit.  A privileged process (under Linux: one with
       the CAP_SYS_RESOURCE capability) may make arbitrary changes to either
       limit value.

       The value RLIM_INFINITY denotes no limit on a resource (both in the
       structure returned by getrlimit() and in the structure passed to
       setrlimit()).

       The resource argument must be one of:

       RLIMIT_AS
              The maximum size of the process's virtual memory (address
              space) in bytes.  This limit affects calls to brk(2), mmap(2),
              and mremap(2), which fail with the error ENOMEM upon exceeding
              this limit.  Also automatic stack expansion will fail (and
              generate a SIGSEGV that kills the process if no alternate
              stack has been made available via sigaltstack(2)).  Since the
              value is a long, on machines with a 32-bit long either this
              limit is at most 2 GiB, or this resource is unlimited.

       RLIMIT_CORE
              Maximum size of a core file (see core(5)).  When 0 no core
              dump files are created.  When nonzero, larger dumps are
              truncated to this size.

       RLIMIT_CPU
              CPU time limit in seconds.  When the process reaches the soft
              limit, it is sent a SIGXCPU signal.  The default action for
              this signal is to terminate the process.  However, the signal
              can be caught, and the handler can return control to the main
              program.  If the process continues to consume CPU time, it
              will be sent SIGXCPU once per second until the hard limit is
              reached, at which time it is sent SIGKILL.  (This latter point
              describes Linux behavior.  Implementations vary in how they
              treat processes which continue to consume CPU time after
              reaching the soft limit.  Portable applications that need to
              catch this signal should perform an orderly termination upon
              first receipt of SIGXCPU.)

       RLIMIT_DATA
              The maximum size of the process's data segment (initialized
              data, uninitialized data, and heap).  This limit affects calls
              to brk(2) and sbrk(2), which fail with the error ENOMEM upon
              encountering the soft limit of this resource.

       RLIMIT_FSIZE
              The maximum size of files that the process may create.
              Attempts to extend a file beyond this limit result in delivery
              of a SIGXFSZ signal.  By default, this signal terminates a
              process, but a process can catch this signal instead, in which
              case the relevant system call (e.g., write(2), truncate(2))
              fails with the error EFBIG.

       RLIMIT_LOCKS (Early Linux 2.4 only)
              A limit on the combined number of flock(2) locks and fcntl(2)
              leases that this process may establish.

       RLIMIT_MEMLOCK
              The maximum number of bytes of memory that may be locked into
              RAM.  In effect this limit is rounded down to the nearest
              multiple of the system page size.  This limit affects mlock(2)
              and mlockall(2) and the mmap(2) MAP_LOCKED operation.  Since
              Linux 2.6.9 it also affects the shmctl(2) SHM_LOCK operation,
              where it sets a maximum on the total bytes in shared memory
              segments (see shmget(2)) that may be locked by the real user
              ID of the calling process.  The shmctl(2) SHM_LOCK locks are
              accounted for separately from the per-process memory locks
              established by mlock(2), mlockall(2), and mmap(2) MAP_LOCKED;
              a process can lock bytes up to this limit in each of these two
              categories.  In Linux kernels before 2.6.9, this limit
              controlled the amount of memory that could be locked by a
              privileged process.  Since Linux 2.6.9, no limits are placed
              on the amount of memory that a privileged process may lock,
              and this limit instead governs the amount of memory that an
              unprivileged process may lock.

       RLIMIT_MSGQUEUE (since Linux 2.6.8)
              Specifies the limit on the number of bytes that can be
              allocated for POSIX message queues for the real user ID of the
              calling process.  This limit is enforced for mq_open(3).  Each
              message queue that the user creates counts (until it is
              removed) against this limit according to the formula:

                  Since Linux 3.5:
                      bytes = attr.mq_maxmsg * sizeof(struct msg_msg) +
                              min(attr.mq_maxmsg, MQ_PRIO_MAX) *
                                    sizeof(struct posix_msg_tree_node)+
                                              /* For overhead */
                              attr.mq_maxmsg * attr.mq_msgsize;
                                              /* For message data */

                  Linux 3.4 and earlier:
                      bytes = attr.mq_maxmsg * sizeof(struct msg_msg *) +
                                              /* For overhead */
                              attr.mq_maxmsg * attr.mq_msgsize;
                                              /* For message data */

              where attr is the mq_attr structure specified as the fourth
              argument to mq_open(3), and the msg_msg and
              posix_msg_tree_node structures are kernel-internal structures.

              The "overhead" addend in the formula accounts for overhead
              bytes required by the implementation and ensures that the user
              cannot create an unlimited number of zero-length messages
              (such messages nevertheless each consume some system memory
              for bookkeeping overhead).

       RLIMIT_NICE (since Linux 2.6.12, but see BUGS below)
              Specifies a ceiling to which the process's nice value can be
              raised using setpriority(2) or nice(2).  The actual ceiling
              for the nice value is calculated as 20 - rlim_cur.  (This
              strangeness occurs because negative numbers cannot be
              specified as resource limit values, since they typically have
              special meanings.  For example, RLIM_INFINITY typically is the
              same as -1.)

       RLIMIT_NOFILE
              Specifies a value one greater than the maximum file descriptor
              number that can be opened by this process.  Attempts (open(2),
              pipe(2), dup(2), etc.)  to exceed this limit yield the error
              EMFILE.  (Historically, this limit was named RLIMIT_OFILE on
              BSD.)

       RLIMIT_NPROC
              The maximum number of processes (or, more precisely on Linux,
              threads) that can be created for the real user ID of the
              calling process.  Upon encountering this limit, fork(2) fails
              with the error EAGAIN.  This limit is not enforced for
              processes that have either the CAP_SYS_ADMIN or the
              CAP_SYS_RESOURCE capability.

       RLIMIT_RSS
              Specifies the limit (in bytes) of the process's resident set
              (the number of virtual pages resident in RAM).  This limit has
              effect only in Linux 2.4.x, x < 30, and there affects only
              calls to madvise(2) specifying MADV_WILLNEED.

       RLIMIT_RTPRIO (since Linux 2.6.12, but see BUGS)
              Specifies a ceiling on the real-time priority that may be set
              for this process using sched_setscheduler(2) and
              sched_setparam(2).

       RLIMIT_RTTIME (since Linux 2.6.25)
              Specifies a limit (in microseconds) on the amount of CPU time
              that a process scheduled under a real-time scheduling policy
              may consume without making a blocking system call.  For the
              purpose of this limit, each time a process makes a blocking
              system call, the count of its consumed CPU time is reset to
              zero.  The CPU time count is not reset if the process
              continues trying to use the CPU but is preempted, its time
              slice expires, or it calls sched_yield(2).

              Upon reaching the soft limit, the process is sent a SIGXCPU
              signal.  If the process catches or ignores this signal and
              continues consuming CPU time, then SIGXCPU will be generated
              once each second until the hard limit is reached, at which
              point the process is sent a SIGKILL signal.

              The intended use of this limit is to stop a runaway real-time
              process from locking up the system.

       RLIMIT_SIGPENDING (since Linux 2.6.8)
              Specifies the limit on the number of signals that may be
              queued for the real user ID of the calling process.  Both
              standard and real-time signals are counted for the purpose of
              checking this limit.  However, the limit is enforced only for
              sigqueue(3); it is always possible to use kill(2) to queue one
              instance of any of the signals that are not already queued to
              the process.

       RLIMIT_STACK
              The maximum size of the process stack, in bytes.  Upon
              reaching this limit, a SIGSEGV signal is generated.  To handle
              this signal, a process must employ an alternate signal stack
              (sigaltstack(2)).

              Since Linux 2.6.23, this limit also determines the amount of
              space used for the process's command-line arguments and
              environment variables; for details, see execve(2).

   prlimit()
       The Linux-specific prlimit() system call combines and extends the
       functionality of setrlimit() and getrlimit().  It can be used to both
       set and get the resource limits of an arbitrary process.

       The resource argument has the same meaning as for setrlimit() and
       getrlimit().

       If the new_limit argument is a not NULL, then the rlimit structure to
       which it points is used to set new values for the soft and hard
       limits for resource.  If the old_limit argument is a not NULL, then a
       successful call to prlimit() places the previous soft and hard limits
       for resource in the rlimit structure pointed to by old_limit.

       The pid argument specifies the ID of the process on which the call is
       to operate.  If pid is 0, then the call applies to the calling
       process.  To set or get the resources of a process other than itself,
       the caller must have the CAP_SYS_RESOURCE capability, or the real,
       effective, and saved set user IDs of the target process must match
       the real user ID of the caller and the real, effective, and saved set
       group IDs of the target process must match the real group ID of the
       caller.
http://man7.org/linux/man-pages/man2/prlimit.2.html
14
SYSTEM CALL:
getrlimit(2) - Linux manual page
FUNCTIONALITY:

       getrlimit, setrlimit, prlimit - get/set resource limits
SYNOPSIS:

       #include <sys/time.h>
       #include <sys/resource.h>

       int getrlimit(int resource, struct rlimit *rlim);
       int setrlimit(int resource, const struct rlimit *rlim);

       int prlimit(pid_t pid, int resource, const struct rlimit *new_limit,
                   struct rlimit *old_limit);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       prlimit(): _GNU_SOURCE
DESCRIPTION

       The getrlimit() and setrlimit() system calls get and set resource
       limits respectively.  Each resource has an associated soft and hard
       limit, as defined by the rlimit structure:

           struct rlimit {
               rlim_t rlim_cur;  /* Soft limit */
               rlim_t rlim_max;  /* Hard limit (ceiling for rlim_cur) */
           };

       The soft limit is the value that the kernel enforces for the
       corresponding resource.  The hard limit acts as a ceiling for the
       soft limit: an unprivileged process may set only its soft limit to a
       value in the range from 0 up to the hard limit, and (irreversibly)
       lower its hard limit.  A privileged process (under Linux: one with
       the CAP_SYS_RESOURCE capability) may make arbitrary changes to either
       limit value.

       The value RLIM_INFINITY denotes no limit on a resource (both in the
       structure returned by getrlimit() and in the structure passed to
       setrlimit()).

       The resource argument must be one of:

       RLIMIT_AS
              The maximum size of the process's virtual memory (address
              space) in bytes.  This limit affects calls to brk(2), mmap(2),
              and mremap(2), which fail with the error ENOMEM upon exceeding
              this limit.  Also automatic stack expansion will fail (and
              generate a SIGSEGV that kills the process if no alternate
              stack has been made available via sigaltstack(2)).  Since the
              value is a long, on machines with a 32-bit long either this
              limit is at most 2 GiB, or this resource is unlimited.

       RLIMIT_CORE
              Maximum size of a core file (see core(5)).  When 0 no core
              dump files are created.  When nonzero, larger dumps are
              truncated to this size.

       RLIMIT_CPU
              CPU time limit in seconds.  When the process reaches the soft
              limit, it is sent a SIGXCPU signal.  The default action for
              this signal is to terminate the process.  However, the signal
              can be caught, and the handler can return control to the main
              program.  If the process continues to consume CPU time, it
              will be sent SIGXCPU once per second until the hard limit is
              reached, at which time it is sent SIGKILL.  (This latter point
              describes Linux behavior.  Implementations vary in how they
              treat processes which continue to consume CPU time after
              reaching the soft limit.  Portable applications that need to
              catch this signal should perform an orderly termination upon
              first receipt of SIGXCPU.)

       RLIMIT_DATA
              The maximum size of the process's data segment (initialized
              data, uninitialized data, and heap).  This limit affects calls
              to brk(2) and sbrk(2), which fail with the error ENOMEM upon
              encountering the soft limit of this resource.

       RLIMIT_FSIZE
              The maximum size of files that the process may create.
              Attempts to extend a file beyond this limit result in delivery
              of a SIGXFSZ signal.  By default, this signal terminates a
              process, but a process can catch this signal instead, in which
              case the relevant system call (e.g., write(2), truncate(2))
              fails with the error EFBIG.

       RLIMIT_LOCKS (Early Linux 2.4 only)
              A limit on the combined number of flock(2) locks and fcntl(2)
              leases that this process may establish.

       RLIMIT_MEMLOCK
              The maximum number of bytes of memory that may be locked into
              RAM.  In effect this limit is rounded down to the nearest
              multiple of the system page size.  This limit affects mlock(2)
              and mlockall(2) and the mmap(2) MAP_LOCKED operation.  Since
              Linux 2.6.9 it also affects the shmctl(2) SHM_LOCK operation,
              where it sets a maximum on the total bytes in shared memory
              segments (see shmget(2)) that may be locked by the real user
              ID of the calling process.  The shmctl(2) SHM_LOCK locks are
              accounted for separately from the per-process memory locks
              established by mlock(2), mlockall(2), and mmap(2) MAP_LOCKED;
              a process can lock bytes up to this limit in each of these two
              categories.  In Linux kernels before 2.6.9, this limit
              controlled the amount of memory that could be locked by a
              privileged process.  Since Linux 2.6.9, no limits are placed
              on the amount of memory that a privileged process may lock,
              and this limit instead governs the amount of memory that an
              unprivileged process may lock.

       RLIMIT_MSGQUEUE (since Linux 2.6.8)
              Specifies the limit on the number of bytes that can be
              allocated for POSIX message queues for the real user ID of the
              calling process.  This limit is enforced for mq_open(3).  Each
              message queue that the user creates counts (until it is
              removed) against this limit according to the formula:

                  Since Linux 3.5:
                      bytes = attr.mq_maxmsg * sizeof(struct msg_msg) +
                              min(attr.mq_maxmsg, MQ_PRIO_MAX) *
                                    sizeof(struct posix_msg_tree_node)+
                                              /* For overhead */
                              attr.mq_maxmsg * attr.mq_msgsize;
                                              /* For message data */

                  Linux 3.4 and earlier:
                      bytes = attr.mq_maxmsg * sizeof(struct msg_msg *) +
                                              /* For overhead */
                              attr.mq_maxmsg * attr.mq_msgsize;
                                              /* For message data */

              where attr is the mq_attr structure specified as the fourth
              argument to mq_open(3), and the msg_msg and
              posix_msg_tree_node structures are kernel-internal structures.

              The "overhead" addend in the formula accounts for overhead
              bytes required by the implementation and ensures that the user
              cannot create an unlimited number of zero-length messages
              (such messages nevertheless each consume some system memory
              for bookkeeping overhead).

       RLIMIT_NICE (since Linux 2.6.12, but see BUGS below)
              Specifies a ceiling to which the process's nice value can be
              raised using setpriority(2) or nice(2).  The actual ceiling
              for the nice value is calculated as 20 - rlim_cur.  (This
              strangeness occurs because negative numbers cannot be
              specified as resource limit values, since they typically have
              special meanings.  For example, RLIM_INFINITY typically is the
              same as -1.)

       RLIMIT_NOFILE
              Specifies a value one greater than the maximum file descriptor
              number that can be opened by this process.  Attempts (open(2),
              pipe(2), dup(2), etc.)  to exceed this limit yield the error
              EMFILE.  (Historically, this limit was named RLIMIT_OFILE on
              BSD.)

       RLIMIT_NPROC
              The maximum number of processes (or, more precisely on Linux,
              threads) that can be created for the real user ID of the
              calling process.  Upon encountering this limit, fork(2) fails
              with the error EAGAIN.  This limit is not enforced for
              processes that have either the CAP_SYS_ADMIN or the
              CAP_SYS_RESOURCE capability.

       RLIMIT_RSS
              Specifies the limit (in bytes) of the process's resident set
              (the number of virtual pages resident in RAM).  This limit has
              effect only in Linux 2.4.x, x < 30, and there affects only
              calls to madvise(2) specifying MADV_WILLNEED.

       RLIMIT_RTPRIO (since Linux 2.6.12, but see BUGS)
              Specifies a ceiling on the real-time priority that may be set
              for this process using sched_setscheduler(2) and
              sched_setparam(2).

       RLIMIT_RTTIME (since Linux 2.6.25)
              Specifies a limit (in microseconds) on the amount of CPU time
              that a process scheduled under a real-time scheduling policy
              may consume without making a blocking system call.  For the
              purpose of this limit, each time a process makes a blocking
              system call, the count of its consumed CPU time is reset to
              zero.  The CPU time count is not reset if the process
              continues trying to use the CPU but is preempted, its time
              slice expires, or it calls sched_yield(2).

              Upon reaching the soft limit, the process is sent a SIGXCPU
              signal.  If the process catches or ignores this signal and
              continues consuming CPU time, then SIGXCPU will be generated
              once each second until the hard limit is reached, at which
              point the process is sent a SIGKILL signal.

              The intended use of this limit is to stop a runaway real-time
              process from locking up the system.

       RLIMIT_SIGPENDING (since Linux 2.6.8)
              Specifies the limit on the number of signals that may be
              queued for the real user ID of the calling process.  Both
              standard and real-time signals are counted for the purpose of
              checking this limit.  However, the limit is enforced only for
              sigqueue(3); it is always possible to use kill(2) to queue one
              instance of any of the signals that are not already queued to
              the process.

       RLIMIT_STACK
              The maximum size of the process stack, in bytes.  Upon
              reaching this limit, a SIGSEGV signal is generated.  To handle
              this signal, a process must employ an alternate signal stack
              (sigaltstack(2)).

              Since Linux 2.6.23, this limit also determines the amount of
              space used for the process's command-line arguments and
              environment variables; for details, see execve(2).

   prlimit()
       The Linux-specific prlimit() system call combines and extends the
       functionality of setrlimit() and getrlimit().  It can be used to both
       set and get the resource limits of an arbitrary process.

       The resource argument has the same meaning as for setrlimit() and
       getrlimit().

       If the new_limit argument is a not NULL, then the rlimit structure to
       which it points is used to set new values for the soft and hard
       limits for resource.  If the old_limit argument is a not NULL, then a
       successful call to prlimit() places the previous soft and hard limits
       for resource in the rlimit structure pointed to by old_limit.

       The pid argument specifies the ID of the process on which the call is
       to operate.  If pid is 0, then the call applies to the calling
       process.  To set or get the resources of a process other than itself,
       the caller must have the CAP_SYS_RESOURCE capability, or the real,
       effective, and saved set user IDs of the target process must match
       the real user ID of the caller and the real, effective, and saved set
       group IDs of the target process must match the real group ID of the
       caller.
http://man7.org/linux/man-pages/man2/getrusage.2.html
11
SYSTEM CALL:
getrusage(2) - Linux manual page
FUNCTIONALITY:

       getrusage - get resource usage
SYNOPSIS:

       #include <sys/time.h>
       #include <sys/resource.h>

       int getrusage(int who, struct rusage *usage);
DESCRIPTION

       getrusage() returns resource usage measures for who, which can be one
       of the following:

       RUSAGE_SELF
              Return resource usage statistics for the calling process,
              which is the sum of resources used by all threads in the
              process.

       RUSAGE_CHILDREN
              Return resource usage statistics for all children of the
              calling process that have terminated and been waited for.
              These statistics will include the resources used by
              grandchildren, and further removed descendants, if all of the
              intervening descendants waited on their terminated children.

       RUSAGE_THREAD (since Linux 2.6.26)
              Return resource usage statistics for the calling thread.  The
              _GNU_SOURCE feature test macro must be defined (before
              including any header file) in order to obtain the definition
              of this constant from <sys/resource.h>.

       The resource usages are returned in the structure pointed to by
       usage, which has the following form:

           struct rusage {
               struct timeval ru_utime; /* user CPU time used */
               struct timeval ru_stime; /* system CPU time used */
               long   ru_maxrss;        /* maximum resident set size */
               long   ru_ixrss;         /* integral shared memory size */
               long   ru_idrss;         /* integral unshared data size */
               long   ru_isrss;         /* integral unshared stack size */
               long   ru_minflt;        /* page reclaims (soft page faults) */
               long   ru_majflt;        /* page faults (hard page faults) */
               long   ru_nswap;         /* swaps */
               long   ru_inblock;       /* block input operations */
               long   ru_oublock;       /* block output operations */
               long   ru_msgsnd;        /* IPC messages sent */
               long   ru_msgrcv;        /* IPC messages received */
               long   ru_nsignals;      /* signals received */
               long   ru_nvcsw;         /* voluntary context switches */
               long   ru_nivcsw;        /* involuntary context switches */
           };

       Not all fields are completed; unmaintained fields are set to zero by
       the kernel.  (The unmaintained fields are provided for compatibility
       with other systems, and because they may one day be supported on
       Linux.)  The fields are interpreted as follows:

       ru_utime
              This is the total amount of time spent executing in user mode,
              expressed in a timeval structure (seconds plus microseconds).

       ru_stime
              This is the total amount of time spent executing in kernel
              mode, expressed in a timeval structure (seconds plus
              microseconds).

       ru_maxrss (since Linux 2.6.32)
              This is the maximum resident set size used (in kilobytes).
              For RUSAGE_CHILDREN, this is the resident set size of the
              largest child, not the maximum resident set size of the
              process tree.

       ru_ixrss (unmaintained)
              This field is currently unused on Linux.

       ru_idrss (unmaintained)
              This field is currently unused on Linux.

       ru_isrss (unmaintained)
              This field is currently unused on Linux.

       ru_minflt
              The number of page faults serviced without any I/O activity;
              here I/O activity is avoided by “reclaiming” a page frame from
              the list of pages awaiting reallocation.

       ru_majflt
              The number of page faults serviced that required I/O activity.

       ru_nswap (unmaintained)
              This field is currently unused on Linux.

       ru_inblock (since Linux 2.6.22)
              The number of times the filesystem had to perform input.

       ru_oublock (since Linux 2.6.22)
              The number of times the filesystem had to perform output.

       ru_msgsnd (unmaintained)
              This field is currently unused on Linux.

       ru_msgrcv (unmaintained)
              This field is currently unused on Linux.

       ru_nsignals (unmaintained)
              This field is currently unused on Linux.

       ru_nvcsw (since Linux 2.6)
              The number of times a context switch resulted due to a process
              voluntarily giving up the processor before its time slice was
              completed (usually to await availability of a resource).

       ru_nivcsw (since Linux 2.6)
              The number of times a context switch resulted due to a higher
              priority process becoming runnable or because the current
              process exceeded its time slice.
http://man7.org/linux/man-pages/man2/sched_setattr.2.html
12
SYSTEM CALL:
sched_setattr(2) - Linux manual page
FUNCTIONALITY:

       sched_setattr,  sched_getattr  -  set  and  get scheduling policy and
       attributes
SYNOPSIS:

       #include <sched.h>

       int sched_setattr(pid_t pid, struct sched_attr *attr,
                         unsigned int flags);

       int sched_getattr(pid_t pid, struct sched_attr *attr,
                         unsigned int size, unsigned int flags);
DESCRIPTION

   sched_setattr()
       The sched_setattr() system call sets the scheduling policy and
       associated attributes for the thread whose ID is specified in pid.
       If pid equals zero, the scheduling policy and attributes of the
       calling thread will be set.

       Currently, Linux supports the following "normal" (i.e., non-real-
       time) scheduling policies as values that may be specified in policy:

       SCHED_OTHER   the standard round-robin time-sharing policy;

       SCHED_BATCH   for "batch" style execution of processes; and

       SCHED_IDLE    for running very low priority background jobs.

       Various "real-time" policies are also supported, for special time-
       critical applications that need precise control over the way in which
       runnable threads are selected for execution.  For the rules governing
       when a process may use these policies, see sched(7).  The real-time
       policies that may be specified in policy are:

       SCHED_FIFO    a first-in, first-out policy; and

       SCHED_RR      a round-robin policy.

       Linux also provides the following policy:

       SCHED_DEADLINE
                     a deadline scheduling policy; see sched(7) for details.

       The attr argument is a pointer to a structure that defines the new
       scheduling policy and attributes for the specified thread.  This
       structure has the following form:

           struct sched_attr {
               u32 size;              /* Size of this structure */
               u32 sched_policy;      /* Policy (SCHED_*) */
               u64 sched_flags;       /* Flags */
               s32 sched_nice;        /* Nice value (SCHED_OTHER,
                                         SCHED_BATCH) */
               u32 sched_priority;    /* Static priority (SCHED_FIFO,
                                         SCHED_RR) */
               /* Remaining fields are for SCHED_DEADLINE */
               u64 sched_runtime;
               u64 sched_deadline;
               u64 sched_period;
           };

       The fields of this structure are as follows:

       size   This field should be set to the size of the structure in
              bytes, as in sizeof(struct sched_attr).  If the provided
              structure is smaller than the kernel structure, any additional
              fields are assumed to be '0'.  If the provided structure is
              larger than the kernel structure, the kernel verifies that all
              additional fields are 0; if they are not, sched_setattr()
              fails with the error E2BIG and updates size to contain the
              size of the kernel structure.

              The above behavior when the size of the user-space sched_attr
              structure does not match the size of the kernel structure
              allows for future extensibility of the interface.  Malformed
              applications that pass oversize structures won't break in the
              future if the size of the kernel sched_attr structure is
              increased.  In the future, it could also allow applications
              that know about a larger user-space sched_attr structure to
              determine whether they are running on an older kernel that
              does not support the larger structure.

       sched_policy
              This field specifies the scheduling policy, as one of the
              SCHED_* values listed above.

       sched_flags
              This field contains flags controlling scheduling behavior.
              Only one such flag is currently defined:
              SCHED_FLAG_RESET_ON_FORK.  As a result of including this flag,
              children created by fork(2) do not inherit privileged
              scheduling policies.  See sched(7) for details.

       sched_nice
              This field specifies the nice value to be set when specifying
              sched_policy as SCHED_OTHER or SCHED_BATCH.  The nice value is
              a number in the range -20 (high priority) to +19 (low
              priority); see setpriority(2).

       sched_priority
              This field specifies the static priority to be set when
              specifying sched_policy as SCHED_FIFO or SCHED_RR.  The
              allowed range of priorities for these policies can be
              determined using sched_get_priority_min(2) and
              sched_get_priority_max(2).  For other policies, this field
              must be specified as 0.

       sched_runtime
              This field specifies the "Runtime" parameter for deadline
              scheduling.  The value is expressed in nanoseconds.  This
              field, and the next two fields, are used only for
              SCHED_DEADLINE scheduling; for further details, see sched(7).

       sched_deadline
              This field specifies the "Deadline" parameter for deadline
              scheduling.  The value is expressed in nanoseconds.

       sched_period
              This field specifies the "Period" parameter for deadline
              scheduling.  The value is expressed in nanoseconds.

       The flags argument is provided to allow for future extensions to the
       interface; in the current implementation it must be specified as 0.

   sched_getattr()
       The sched_getattr() system call fetches the scheduling policy and the
       associated attributes for the thread whose ID is specified in pid.
       If pid equals zero, the scheduling policy and attributes of the
       calling thread will be retrieved.

       The size argument should be set to the size of the sched_attr
       structure as known to user space.  The value must be at least as
       large as the size of the initially published sched_attr structure, or
       the call fails with the error EINVAL.

       The retrieved scheduling attributes are placed in the fields of the
       sched_attr structure pointed to by attr.  The kernel sets attr.size
       to the size of its sched_attr structure.

       If the caller-provided attr buffer is larger than the kernel's
       sched_attr structure, the additional bytes in the user-space
       structure are not touched.  If the caller-provided structure is
       smaller than the kernel sched_attr structure and the kernel needs to
       return values outside the provided space, sched_getattr() fails with
       the error E2BIG.  As with sched_setattr(), these semantics allow for
       future extensibility of the interface.

       The flags argument is provided to allow for future extensions to the
       interface; in the current implementation it must be specified as 0.
http://man7.org/linux/man-pages/man2/sched_getattr.2.html
12
SYSTEM CALL:
sched_setattr(2) - Linux manual page
FUNCTIONALITY:

       sched_setattr,  sched_getattr  -  set  and  get scheduling policy and
       attributes
SYNOPSIS:

       #include <sched.h>

       int sched_setattr(pid_t pid, struct sched_attr *attr,
                         unsigned int flags);

       int sched_getattr(pid_t pid, struct sched_attr *attr,
                         unsigned int size, unsigned int flags);
DESCRIPTION

   sched_setattr()
       The sched_setattr() system call sets the scheduling policy and
       associated attributes for the thread whose ID is specified in pid.
       If pid equals zero, the scheduling policy and attributes of the
       calling thread will be set.

       Currently, Linux supports the following "normal" (i.e., non-real-
       time) scheduling policies as values that may be specified in policy:

       SCHED_OTHER   the standard round-robin time-sharing policy;

       SCHED_BATCH   for "batch" style execution of processes; and

       SCHED_IDLE    for running very low priority background jobs.

       Various "real-time" policies are also supported, for special time-
       critical applications that need precise control over the way in which
       runnable threads are selected for execution.  For the rules governing
       when a process may use these policies, see sched(7).  The real-time
       policies that may be specified in policy are:

       SCHED_FIFO    a first-in, first-out policy; and

       SCHED_RR      a round-robin policy.

       Linux also provides the following policy:

       SCHED_DEADLINE
                     a deadline scheduling policy; see sched(7) for details.

       The attr argument is a pointer to a structure that defines the new
       scheduling policy and attributes for the specified thread.  This
       structure has the following form:

           struct sched_attr {
               u32 size;              /* Size of this structure */
               u32 sched_policy;      /* Policy (SCHED_*) */
               u64 sched_flags;       /* Flags */
               s32 sched_nice;        /* Nice value (SCHED_OTHER,
                                         SCHED_BATCH) */
               u32 sched_priority;    /* Static priority (SCHED_FIFO,
                                         SCHED_RR) */
               /* Remaining fields are for SCHED_DEADLINE */
               u64 sched_runtime;
               u64 sched_deadline;
               u64 sched_period;
           };

       The fields of this structure are as follows:

       size   This field should be set to the size of the structure in
              bytes, as in sizeof(struct sched_attr).  If the provided
              structure is smaller than the kernel structure, any additional
              fields are assumed to be '0'.  If the provided structure is
              larger than the kernel structure, the kernel verifies that all
              additional fields are 0; if they are not, sched_setattr()
              fails with the error E2BIG and updates size to contain the
              size of the kernel structure.

              The above behavior when the size of the user-space sched_attr
              structure does not match the size of the kernel structure
              allows for future extensibility of the interface.  Malformed
              applications that pass oversize structures won't break in the
              future if the size of the kernel sched_attr structure is
              increased.  In the future, it could also allow applications
              that know about a larger user-space sched_attr structure to
              determine whether they are running on an older kernel that
              does not support the larger structure.

       sched_policy
              This field specifies the scheduling policy, as one of the
              SCHED_* values listed above.

       sched_flags
              This field contains flags controlling scheduling behavior.
              Only one such flag is currently defined:
              SCHED_FLAG_RESET_ON_FORK.  As a result of including this flag,
              children created by fork(2) do not inherit privileged
              scheduling policies.  See sched(7) for details.

       sched_nice
              This field specifies the nice value to be set when specifying
              sched_policy as SCHED_OTHER or SCHED_BATCH.  The nice value is
              a number in the range -20 (high priority) to +19 (low
              priority); see setpriority(2).

       sched_priority
              This field specifies the static priority to be set when
              specifying sched_policy as SCHED_FIFO or SCHED_RR.  The
              allowed range of priorities for these policies can be
              determined using sched_get_priority_min(2) and
              sched_get_priority_max(2).  For other policies, this field
              must be specified as 0.

       sched_runtime
              This field specifies the "Runtime" parameter for deadline
              scheduling.  The value is expressed in nanoseconds.  This
              field, and the next two fields, are used only for
              SCHED_DEADLINE scheduling; for further details, see sched(7).

       sched_deadline
              This field specifies the "Deadline" parameter for deadline
              scheduling.  The value is expressed in nanoseconds.

       sched_period
              This field specifies the "Period" parameter for deadline
              scheduling.  The value is expressed in nanoseconds.

       The flags argument is provided to allow for future extensions to the
       interface; in the current implementation it must be specified as 0.

   sched_getattr()
       The sched_getattr() system call fetches the scheduling policy and the
       associated attributes for the thread whose ID is specified in pid.
       If pid equals zero, the scheduling policy and attributes of the
       calling thread will be retrieved.

       The size argument should be set to the size of the sched_attr
       structure as known to user space.  The value must be at least as
       large as the size of the initially published sched_attr structure, or
       the call fails with the error EINVAL.

       The retrieved scheduling attributes are placed in the fields of the
       sched_attr structure pointed to by attr.  The kernel sets attr.size
       to the size of its sched_attr structure.

       If the caller-provided attr buffer is larger than the kernel's
       sched_attr structure, the additional bytes in the user-space
       structure are not touched.  If the caller-provided structure is
       smaller than the kernel sched_attr structure and the kernel needs to
       return values outside the provided space, sched_getattr() fails with
       the error E2BIG.  As with sched_setattr(), these semantics allow for
       future extensibility of the interface.

       The flags argument is provided to allow for future extensions to the
       interface; in the current implementation it must be specified as 0.
http://man7.org/linux/man-pages/man2/sched_setscheduler.2.html
11
SYSTEM CALL:
sched_setscheduler(2) - Linux manual page
FUNCTIONALITY:

       sched_setscheduler,  sched_getscheduler - set and get scheduling pol‐
       icy/parameters
SYNOPSIS:

       #include <sched.h>

       int sched_setscheduler(pid_t pid, int policy,
                              const struct sched_param *param);

       int sched_getscheduler(pid_t pid);
DESCRIPTION

       The sched_setscheduler() system call sets both the scheduling policy
       and parameters for the thread whose ID is specified in pid.  If pid
       equals zero, the scheduling policy and parameters of the calling
       thread will be set.

       The scheduling parameters are specified in the param argument, which
       is a pointer to a structure of the following form:

           struct sched_param {
               ...
               int sched_priority;
               ...
           };

       In the current implementation, the structure contains only one field,
       sched_priority.  The interpretation of param depends on the selected
       policy.

       Currently, Linux supports the following "normal" (i.e., non-real-
       time) scheduling policies as values that may be specified in policy:

       SCHED_OTHER   the standard round-robin time-sharing policy;

       SCHED_BATCH   for "batch" style execution of processes; and

       SCHED_IDLE    for running very low priority background jobs.

       For each of the above policies, param->sched_priority must be 0.

       Various "real-time" policies are also supported, for special time-
       critical applications that need precise control over the way in which
       runnable threads are selected for execution.  For the rules governing
       when a process may use these policies, see sched(7).  The real-time
       policies that may be specified in policy are:

       SCHED_FIFO    a first-in, first-out policy; and

       SCHED_RR      a round-robin policy.

       For each of the above policies, param->sched_priority specifies a
       scheduling priority for the thread.  This is a number in the range
       returned by calling sched_get_priority_min(2) and
       sched_get_priority_max(2) with the specified policy.  On Linux, these
       system calls return, respectively, 1 and 99.

       Since Linux 2.6.32, the SCHED_RESET_ON_FORK flag can be ORed in
       policy when calling sched_setscheduler().  As a result of including
       this flag, children created by fork(2) do not inherit privileged
       scheduling policies.  See sched(7) for details.

       sched_getscheduler() returns the current scheduling policy of the
       thread identified by pid.  If pid equals zero, the policy of the
       calling thread will be retrieved.
http://man7.org/linux/man-pages/man2/sched_getscheduler.2.html
11
SYSTEM CALL:
sched_setscheduler(2) - Linux manual page
FUNCTIONALITY:

       sched_setscheduler,  sched_getscheduler - set and get scheduling pol‐
       icy/parameters
SYNOPSIS:

       #include <sched.h>

       int sched_setscheduler(pid_t pid, int policy,
                              const struct sched_param *param);

       int sched_getscheduler(pid_t pid);
DESCRIPTION

       The sched_setscheduler() system call sets both the scheduling policy
       and parameters for the thread whose ID is specified in pid.  If pid
       equals zero, the scheduling policy and parameters of the calling
       thread will be set.

       The scheduling parameters are specified in the param argument, which
       is a pointer to a structure of the following form:

           struct sched_param {
               ...
               int sched_priority;
               ...
           };

       In the current implementation, the structure contains only one field,
       sched_priority.  The interpretation of param depends on the selected
       policy.

       Currently, Linux supports the following "normal" (i.e., non-real-
       time) scheduling policies as values that may be specified in policy:

       SCHED_OTHER   the standard round-robin time-sharing policy;

       SCHED_BATCH   for "batch" style execution of processes; and

       SCHED_IDLE    for running very low priority background jobs.

       For each of the above policies, param->sched_priority must be 0.

       Various "real-time" policies are also supported, for special time-
       critical applications that need precise control over the way in which
       runnable threads are selected for execution.  For the rules governing
       when a process may use these policies, see sched(7).  The real-time
       policies that may be specified in policy are:

       SCHED_FIFO    a first-in, first-out policy; and

       SCHED_RR      a round-robin policy.

       For each of the above policies, param->sched_priority specifies a
       scheduling priority for the thread.  This is a number in the range
       returned by calling sched_get_priority_min(2) and
       sched_get_priority_max(2) with the specified policy.  On Linux, these
       system calls return, respectively, 1 and 99.

       Since Linux 2.6.32, the SCHED_RESET_ON_FORK flag can be ORed in
       policy when calling sched_setscheduler().  As a result of including
       this flag, children created by fork(2) do not inherit privileged
       scheduling policies.  See sched(7) for details.

       sched_getscheduler() returns the current scheduling policy of the
       thread identified by pid.  If pid equals zero, the policy of the
       calling thread will be retrieved.
http://man7.org/linux/man-pages/man2/sched_setparam.2.html
10
SYSTEM CALL:
sched_setparam(2) - Linux manual page
FUNCTIONALITY:

       sched_setparam, sched_getparam - set and get scheduling parameters
SYNOPSIS:

       #include <sched.h>

       int sched_setparam(pid_t pid, const struct sched_param *param);

       int sched_getparam(pid_t pid, struct sched_param *param);

       struct sched_param {
           ...
           int sched_priority;
           ...
       };
DESCRIPTION

       sched_setparam() sets the scheduling parameters associated with the
       scheduling policy for the process identified by pid.  If pid is zero,
       then the parameters of the calling process are set.  The
       interpretation of the argument param depends on the scheduling policy
       of the process identified by pid.  See sched(7) for a description of
       the scheduling policies supported under Linux.

       sched_getparam() retrieves the scheduling parameters for the process
       identified by pid.  If pid is zero, then the parameters of the
       calling process are retrieved.

       sched_setparam() checks the validity of param for the scheduling
       policy of the thread.  The value param->sched_priority must lie
       within the range given by sched_get_priority_min(2) and
       sched_get_priority_max(2).

       For a discussion of the privileges and resource limits related to
       scheduling priority and policy, see sched(7).

       POSIX systems on which sched_setparam() and sched_getparam() are
       available define _POSIX_PRIORITY_SCHEDULING in <unistd.h>.
http://man7.org/linux/man-pages/man2/sched_getparam.2.html
10
SYSTEM CALL:
sched_setparam(2) - Linux manual page
FUNCTIONALITY:

       sched_setparam, sched_getparam - set and get scheduling parameters
SYNOPSIS:

       #include <sched.h>

       int sched_setparam(pid_t pid, const struct sched_param *param);

       int sched_getparam(pid_t pid, struct sched_param *param);

       struct sched_param {
           ...
           int sched_priority;
           ...
       };
DESCRIPTION

       sched_setparam() sets the scheduling parameters associated with the
       scheduling policy for the process identified by pid.  If pid is zero,
       then the parameters of the calling process are set.  The
       interpretation of the argument param depends on the scheduling policy
       of the process identified by pid.  See sched(7) for a description of
       the scheduling policies supported under Linux.

       sched_getparam() retrieves the scheduling parameters for the process
       identified by pid.  If pid is zero, then the parameters of the
       calling process are retrieved.

       sched_setparam() checks the validity of param for the scheduling
       policy of the thread.  The value param->sched_priority must lie
       within the range given by sched_get_priority_min(2) and
       sched_get_priority_max(2).

       For a discussion of the privileges and resource limits related to
       scheduling priority and policy, see sched(7).

       POSIX systems on which sched_setparam() and sched_getparam() are
       available define _POSIX_PRIORITY_SCHEDULING in <unistd.h>.
http://man7.org/linux/man-pages/man2/sched_setaffinity.2.html
12
SYSTEM CALL:
sched_setaffinity(2) - Linux manual page
FUNCTIONALITY:

       sched_setaffinity,  sched_getaffinity  -  set  and get a thread's CPU
       affinity mask
SYNOPSIS:

       #define _GNU_SOURCE             /* See feature_test_macros(7) */
       #include <sched.h>

       int sched_setaffinity(pid_t pid, size_t cpusetsize,
                             const cpu_set_t *mask);

       int sched_getaffinity(pid_t pid, size_t cpusetsize,
                             cpu_set_t *mask);
DESCRIPTION

       A thread's CPU affinity mask determines the set of CPUs on which it
       is eligible to run.  On a multiprocessor system, setting the CPU
       affinity mask can be used to obtain performance benefits.  For
       example, by dedicating one CPU to a particular thread (i.e., setting
       the affinity mask of that thread to specify a single CPU, and setting
       the affinity mask of all other threads to exclude that CPU), it is
       possible to ensure maximum execution speed for that thread.
       Restricting a thread to run on a single CPU also avoids the
       performance cost caused by the cache invalidation that occurs when a
       thread ceases to execute on one CPU and then recommences execution on
       a different CPU.

       A CPU affinity mask is represented by the cpu_set_t structure, a "CPU
       set", pointed to by mask.  A set of macros for manipulating CPU sets
       is described in CPU_SET(3).

       sched_setaffinity() sets the CPU affinity mask of the thread whose ID
       is pid to the value specified by mask.  If pid is zero, then the
       calling thread is used.  The argument cpusetsize is the length (in
       bytes) of the data pointed to by mask.  Normally this argument would
       be specified as sizeof(cpu_set_t).

       If the thread specified by pid is not currently running on one of the
       CPUs specified in mask, then that thread is migrated to one of the
       CPUs specified in mask.

       sched_getaffinity() writes the affinity mask of the thread whose ID
       is pid into the cpu_set_t structure pointed to by mask.  The
       cpusetsize argument specifies the size (in bytes) of mask.  If pid is
       zero, then the mask of the calling thread is returned.
http://man7.org/linux/man-pages/man2/sched_getaffinity.2.html
12
SYSTEM CALL:
sched_setaffinity(2) - Linux manual page
FUNCTIONALITY:

       sched_setaffinity,  sched_getaffinity  -  set  and get a thread's CPU
       affinity mask
SYNOPSIS:

       #define _GNU_SOURCE             /* See feature_test_macros(7) */
       #include <sched.h>

       int sched_setaffinity(pid_t pid, size_t cpusetsize,
                             const cpu_set_t *mask);

       int sched_getaffinity(pid_t pid, size_t cpusetsize,
                             cpu_set_t *mask);
DESCRIPTION

       A thread's CPU affinity mask determines the set of CPUs on which it
       is eligible to run.  On a multiprocessor system, setting the CPU
       affinity mask can be used to obtain performance benefits.  For
       example, by dedicating one CPU to a particular thread (i.e., setting
       the affinity mask of that thread to specify a single CPU, and setting
       the affinity mask of all other threads to exclude that CPU), it is
       possible to ensure maximum execution speed for that thread.
       Restricting a thread to run on a single CPU also avoids the
       performance cost caused by the cache invalidation that occurs when a
       thread ceases to execute on one CPU and then recommences execution on
       a different CPU.

       A CPU affinity mask is represented by the cpu_set_t structure, a "CPU
       set", pointed to by mask.  A set of macros for manipulating CPU sets
       is described in CPU_SET(3).

       sched_setaffinity() sets the CPU affinity mask of the thread whose ID
       is pid to the value specified by mask.  If pid is zero, then the
       calling thread is used.  The argument cpusetsize is the length (in
       bytes) of the data pointed to by mask.  Normally this argument would
       be specified as sizeof(cpu_set_t).

       If the thread specified by pid is not currently running on one of the
       CPUs specified in mask, then that thread is migrated to one of the
       CPUs specified in mask.

       sched_getaffinity() writes the affinity mask of the thread whose ID
       is pid into the cpu_set_t structure pointed to by mask.  The
       cpusetsize argument specifies the size (in bytes) of mask.  If pid is
       zero, then the mask of the calling thread is returned.
http://man7.org/linux/man-pages/man2/sched_get_priority_max.2.html
9
SYSTEM CALL:
sched_get_priority_max(2) - Linux manual page
FUNCTIONALITY:

       sched_get_priority_max, sched_get_priority_min  - get static priority
       range
SYNOPSIS:

       #include <sched.h>

       int sched_get_priority_max(int policy);

       int sched_get_priority_min(int policy);
DESCRIPTION

       sched_get_priority_max() returns the maximum priority value that can
       be used with the scheduling algorithm identified by policy.
       sched_get_priority_min() returns the minimum priority value that can
       be used with the scheduling algorithm identified by policy.
       Supported policy values are SCHED_FIFO, SCHED_RR, SCHED_OTHER,
       SCHED_BATCH, SCHED_IDLE, and SCHED_DEADLINE.  Further details about
       these policies can be found in sched(7).

       Processes with numerically higher priority values are scheduled
       before processes with numerically lower priority values.  Thus, the
       value returned by sched_get_priority_max() will be greater than the
       value returned by sched_get_priority_min().

       Linux allows the static priority range 1 to 99 for the SCHED_FIFO and
       SCHED_RR policies, and the priority 0 for the remaining policies.
       Scheduling priority ranges for the various policies are not
       alterable.

       The range of scheduling priorities may vary on other POSIX systems,
       thus it is a good idea for portable applications to use a virtual
       priority range and map it to the interval given by
       sched_get_priority_max() and sched_get_priority_min POSIX.1 requires
       a spread of at least 32 between the maximum and the minimum values
       for SCHED_FIFO and SCHED_RR.

       POSIX systems on which sched_get_priority_max() and
       sched_get_priority_min() are available define
       _POSIX_PRIORITY_SCHEDULING in <unistd.h>.
http://man7.org/linux/man-pages/man2/sched_get_priority_min.2.html
9
SYSTEM CALL:
sched_get_priority_max(2) - Linux manual page
FUNCTIONALITY:

       sched_get_priority_max, sched_get_priority_min  - get static priority
       range
SYNOPSIS:

       #include <sched.h>

       int sched_get_priority_max(int policy);

       int sched_get_priority_min(int policy);
DESCRIPTION

       sched_get_priority_max() returns the maximum priority value that can
       be used with the scheduling algorithm identified by policy.
       sched_get_priority_min() returns the minimum priority value that can
       be used with the scheduling algorithm identified by policy.
       Supported policy values are SCHED_FIFO, SCHED_RR, SCHED_OTHER,
       SCHED_BATCH, SCHED_IDLE, and SCHED_DEADLINE.  Further details about
       these policies can be found in sched(7).

       Processes with numerically higher priority values are scheduled
       before processes with numerically lower priority values.  Thus, the
       value returned by sched_get_priority_max() will be greater than the
       value returned by sched_get_priority_min().

       Linux allows the static priority range 1 to 99 for the SCHED_FIFO and
       SCHED_RR policies, and the priority 0 for the remaining policies.
       Scheduling priority ranges for the various policies are not
       alterable.

       The range of scheduling priorities may vary on other POSIX systems,
       thus it is a good idea for portable applications to use a virtual
       priority range and map it to the interval given by
       sched_get_priority_max() and sched_get_priority_min POSIX.1 requires
       a spread of at least 32 between the maximum and the minimum values
       for SCHED_FIFO and SCHED_RR.

       POSIX systems on which sched_get_priority_max() and
       sched_get_priority_min() are available define
       _POSIX_PRIORITY_SCHEDULING in <unistd.h>.
http://man7.org/linux/man-pages/man2/sched_rr_get_interval.2.html
10
SYSTEM CALL:
sched_rr_get_interval(2) - Linux manual page
FUNCTIONALITY:

       sched_rr_get_interval  -  get  the  SCHED_RR  interval  for the named
       process
SYNOPSIS:

       #include <sched.h>

       int sched_rr_get_interval(pid_t pid, struct timespec * tp);
DESCRIPTION

       sched_rr_get_interval() writes into the timespec structure pointed to
       by tp the round-robin time quantum for the process identified by pid.
       The specified process should be running under the SCHED_RR scheduling
       policy.

       The timespec structure has the following form:

           struct timespec {
               time_t tv_sec;    /* seconds */
               long   tv_nsec;   /* nanoseconds */
           };

       If pid is zero, the time quantum for the calling process is written
       into *tp.
http://man7.org/linux/man-pages/man2/sched_yield.2.html
10
SYSTEM CALL:
sched_yield(2) - Linux manual page
FUNCTIONALITY:

       sched_yield - yield the processor
SYNOPSIS:

       #include <sched.h>

       int sched_yield(void);
DESCRIPTION

       sched_yield() causes the calling thread to relinquish the CPU.  The
       thread is moved to the end of the queue for its static priority and a
       new thread gets to run.
http://man7.org/linux/man-pages/man2/setpriority.2.html
11
SYSTEM CALL:
getpriority(2) - Linux manual page
FUNCTIONALITY:

       getpriority, setpriority - get/set program scheduling priority
SYNOPSIS:

       #include <sys/time.h>
       #include <sys/resource.h>

       int getpriority(int which, id_t who);
       int setpriority(int which, id_t who, int prio);
DESCRIPTION

       The scheduling priority of the process, process group, or user, as
       indicated by which and who is obtained with the getpriority() call
       and set with the setpriority() call.  The process attribute dealt
       with by these system calls is the same attribute (also known as the
       "nice" value) that is dealt with by nice(2).

       The value which is one of PRIO_PROCESS, PRIO_PGRP, or PRIO_USER, and
       who is interpreted relative to which (a process identifier for
       PRIO_PROCESS, process group identifier for PRIO_PGRP, and a user ID
       for PRIO_USER).  A zero value for who denotes (respectively) the
       calling process, the process group of the calling process, or the
       real user ID of the calling process.

       The prio argument is a value in the range -20 to 19 (but see NOTES
       below).  with -20 being the highest priority and 19 being the lowest
       priority.  The default priority is 0; lower values give a process a
       higher scheduling priority.

       The getpriority() call returns the highest priority (lowest numerical
       value) enjoyed by any of the specified processes.  The setpriority()
       call sets the priorities of all of the specified processes to the
       specified value.

       Traditionally, only a privileged process could lower the nice value
       (i.e., set a higher priority).  However, since Linux 2.6.12, an
       unprivileged process can decrease the nice value of a target process
       that has a suitable RLIMIT_NICE soft limit; see getrlimit(2) for
       details.
http://man7.org/linux/man-pages/man2/getpriority.2.html
11
SYSTEM CALL:
getpriority(2) - Linux manual page
FUNCTIONALITY:

       getpriority, setpriority - get/set program scheduling priority
SYNOPSIS:

       #include <sys/time.h>
       #include <sys/resource.h>

       int getpriority(int which, id_t who);
       int setpriority(int which, id_t who, int prio);
DESCRIPTION

       The scheduling priority of the process, process group, or user, as
       indicated by which and who is obtained with the getpriority() call
       and set with the setpriority() call.  The process attribute dealt
       with by these system calls is the same attribute (also known as the
       "nice" value) that is dealt with by nice(2).

       The value which is one of PRIO_PROCESS, PRIO_PGRP, or PRIO_USER, and
       who is interpreted relative to which (a process identifier for
       PRIO_PROCESS, process group identifier for PRIO_PGRP, and a user ID
       for PRIO_USER).  A zero value for who denotes (respectively) the
       calling process, the process group of the calling process, or the
       real user ID of the calling process.

       The prio argument is a value in the range -20 to 19 (but see NOTES
       below).  with -20 being the highest priority and 19 being the lowest
       priority.  The default priority is 0; lower values give a process a
       higher scheduling priority.

       The getpriority() call returns the highest priority (lowest numerical
       value) enjoyed by any of the specified processes.  The setpriority()
       call sets the priorities of all of the specified processes to the
       specified value.

       Traditionally, only a privileged process could lower the nice value
       (i.e., set a higher priority).  However, since Linux 2.6.12, an
       unprivileged process can decrease the nice value of a target process
       that has a suitable RLIMIT_NICE soft limit; see getrlimit(2) for
       details.
http://man7.org/linux/man-pages/man2/ioprio_set.2.html
12
SYSTEM CALL:
ioprio_set(2) - Linux manual page
FUNCTIONALITY:

       ioprio_get, ioprio_set - get/set I/O scheduling class and priority
SYNOPSIS:

       int ioprio_get(int which, int who);
       int ioprio_set(int which, int who, int ioprio);

       Note: There are no glibc wrappers for these system calls; see NOTES.
DESCRIPTION

       The ioprio_get() and ioprio_set() system calls respectively get and
       set the I/O scheduling class and priority of one or more threads.

       The which and who arguments identify the thread(s) on which the
       system calls operate.  The which argument determines how who is
       interpreted, and has one of the following values:

       IOPRIO_WHO_PROCESS
              who is a process ID or thread ID identifying a single process
              or thread.  If who is 0, then operate on the calling thread.

       IOPRIO_WHO_PGRP
              who is a process group ID identifying all the members of a
              process group.  If who is 0, then operate on the process group
              of which the caller is a member.

       IOPRIO_WHO_USER
              who is a user ID identifying all of the processes that have a
              matching real UID.

       If which is specified as IOPRIO_WHO_PGRP or IOPRIO_WHO_USER when
       calling ioprio_get(), and more than one process matches who, then the
       returned priority will be the highest one found among all of the
       matching processes.  One priority is said to be higher than another
       one if it belongs to a higher priority class (IOPRIO_CLASS_RT is the
       highest priority class; IOPRIO_CLASS_IDLE is the lowest) or if it
       belongs to the same priority class as the other process but has a
       higher priority level (a lower priority number means a higher
       priority level).

       The ioprio argument given to ioprio_set() is a bit mask that
       specifies both the scheduling class and the priority to be assigned
       to the target process(es).  The following macros are used for
       assembling and dissecting ioprio values:

       IOPRIO_PRIO_VALUE(class, data)
              Given a scheduling class and priority (data), this macro
              combines the two values to produce an ioprio value, which is
              returned as the result of the macro.

       IOPRIO_PRIO_CLASS(mask)
              Given mask (an ioprio value), this macro returns its I/O class
              component, that is, one of the values IOPRIO_CLASS_RT,
              IOPRIO_CLASS_BE, or IOPRIO_CLASS_IDLE.

       IOPRIO_PRIO_DATA(mask)
              Given mask (an ioprio value), this macro returns its priority
              (data) component.

       See the NOTES section for more information on scheduling classes and
       priorities, as well as the meaning of specifying ioprio as 0.

       I/O priorities are supported for reads and for synchronous (O_DIRECT,
       O_SYNC) writes.  I/O priorities are not supported for asynchronous
       writes because they are issued outside the context of the program
       dirtying the memory, and thus program-specific priorities do not
       apply.
http://man7.org/linux/man-pages/man2/ioprio_get.2.html
12
SYSTEM CALL:
ioprio_set(2) - Linux manual page
FUNCTIONALITY:

       ioprio_get, ioprio_set - get/set I/O scheduling class and priority
SYNOPSIS:

       int ioprio_get(int which, int who);
       int ioprio_set(int which, int who, int ioprio);

       Note: There are no glibc wrappers for these system calls; see NOTES.
DESCRIPTION

       The ioprio_get() and ioprio_set() system calls respectively get and
       set the I/O scheduling class and priority of one or more threads.

       The which and who arguments identify the thread(s) on which the
       system calls operate.  The which argument determines how who is
       interpreted, and has one of the following values:

       IOPRIO_WHO_PROCESS
              who is a process ID or thread ID identifying a single process
              or thread.  If who is 0, then operate on the calling thread.

       IOPRIO_WHO_PGRP
              who is a process group ID identifying all the members of a
              process group.  If who is 0, then operate on the process group
              of which the caller is a member.

       IOPRIO_WHO_USER
              who is a user ID identifying all of the processes that have a
              matching real UID.

       If which is specified as IOPRIO_WHO_PGRP or IOPRIO_WHO_USER when
       calling ioprio_get(), and more than one process matches who, then the
       returned priority will be the highest one found among all of the
       matching processes.  One priority is said to be higher than another
       one if it belongs to a higher priority class (IOPRIO_CLASS_RT is the
       highest priority class; IOPRIO_CLASS_IDLE is the lowest) or if it
       belongs to the same priority class as the other process but has a
       higher priority level (a lower priority number means a higher
       priority level).

       The ioprio argument given to ioprio_set() is a bit mask that
       specifies both the scheduling class and the priority to be assigned
       to the target process(es).  The following macros are used for
       assembling and dissecting ioprio values:

       IOPRIO_PRIO_VALUE(class, data)
              Given a scheduling class and priority (data), this macro
              combines the two values to produce an ioprio value, which is
              returned as the result of the macro.

       IOPRIO_PRIO_CLASS(mask)
              Given mask (an ioprio value), this macro returns its I/O class
              component, that is, one of the values IOPRIO_CLASS_RT,
              IOPRIO_CLASS_BE, or IOPRIO_CLASS_IDLE.

       IOPRIO_PRIO_DATA(mask)
              Given mask (an ioprio value), this macro returns its priority
              (data) component.

       See the NOTES section for more information on scheduling classes and
       priorities, as well as the meaning of specifying ioprio as 0.

       I/O priorities are supported for reads and for synchronous (O_DIRECT,
       O_SYNC) writes.  I/O priorities are not supported for asynchronous
       writes because they are issued outside the context of the program
       dirtying the memory, and thus program-specific priorities do not
       apply.
http://man7.org/linux/man-pages/man2/brk.2.html
9
SYSTEM CALL:
brk(2) - Linux manual page
FUNCTIONALITY:

       brk, sbrk - change data segment size
SYNOPSIS:

       #include <unistd.h>

       int brk(void *addr);

       void *sbrk(intptr_t increment);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       brk(), sbrk():
           Since glibc 2.19:
               _DEFAULT_SOURCE ||
                   (_XOPEN_SOURCE >= 500) &&
                   ! (_POSIX_C_SOURCE >= 200112L)
           From glibc 2.12 to 2.19:
               _BSD_SOURCE || _SVID_SOURCE ||
                   (_XOPEN_SOURCE >= 500) &&
                   ! (_POSIX_C_SOURCE >= 200112L)
           Before glibc 2.12:
               _BSD_SOURCE || _SVID_SOURCE || _XOPEN_SOURCE >= 500
DESCRIPTION

       brk() and sbrk() change the location of the program break, which
       defines the end of the process's data segment (i.e., the program
       break is the first location after the end of the uninitialized data
       segment).  Increasing the program break has the effect of allocating
       memory to the process; decreasing the break deallocates memory.

       brk() sets the end of the data segment to the value specified by
       addr, when that value is reasonable, the system has enough memory,
       and the process does not exceed its maximum data size (see
       setrlimit(2)).

       sbrk() increments the program's data space by increment bytes.
       Calling sbrk() with an increment of 0 can be used to find the current
       location of the program break.
http://man7.org/linux/man-pages/man2/mmap.2.html
14
SYSTEM CALL:
mmap(2) - Linux manual page
FUNCTIONALITY:

       mmap, munmap - map or unmap files or devices into memory
SYNOPSIS:

       #include <sys/mman.h>

       void *mmap(void *addr, size_t length, int prot, int flags,
                  int fd, off_t offset);
       int munmap(void *addr, size_t length);

       See NOTES for information on feature test macro requirements.
DESCRIPTION

       mmap() creates a new mapping in the virtual address space of the
       calling process.  The starting address for the new mapping is
       specified in addr.  The length argument specifies the length of the
       mapping.

       If addr is NULL, then the kernel chooses the address at which to
       create the mapping; this is the most portable method of creating a
       new mapping.  If addr is not NULL, then the kernel takes it as a hint
       about where to place the mapping; on Linux, the mapping will be
       created at a nearby page boundary.  The address of the new mapping is
       returned as the result of the call.

       The contents of a file mapping (as opposed to an anonymous mapping;
       see MAP_ANONYMOUS below), are initialized using length bytes starting
       at offset offset in the file (or other object) referred to by the
       file descriptor fd.  offset must be a multiple of the page size as
       returned by sysconf(_SC_PAGE_SIZE).

       The prot argument describes the desired memory protection of the
       mapping (and must not conflict with the open mode of the file).  It
       is either PROT_NONE or the bitwise OR of one or more of the following
       flags:

       PROT_EXEC  Pages may be executed.

       PROT_READ  Pages may be read.

       PROT_WRITE Pages may be written.

       PROT_NONE  Pages may not be accessed.

       The flags argument determines whether updates to the mapping are
       visible to other processes mapping the same region, and whether
       updates are carried through to the underlying file.  This behavior is
       determined by including exactly one of the following values in flags:

       MAP_SHARED
              Share this mapping.  Updates to the mapping are visible to
              other processes that map this file, and are carried through to
              the underlying file.  (To precisely control when updates are
              carried through to the underlying file requires the use of
              msync(2).)

       MAP_PRIVATE
              Create a private copy-on-write mapping.  Updates to the
              mapping are not visible to other processes mapping the same
              file, and are not carried through to the underlying file.  It
              is unspecified whether changes made to the file after the
              mmap() call are visible in the mapped region.

       Both of these flags are described in POSIX.1-2001 and POSIX.1-2008.

       In addition, zero or more of the following values can be ORed in
       flags:

       MAP_32BIT (since Linux 2.4.20, 2.6)
              Put the mapping into the first 2 Gigabytes of the process
              address space.  This flag is supported only on x86-64, for
              64-bit programs.  It was added to allow thread stacks to be
              allocated somewhere in the first 2GB of memory, so as to
              improve context-switch performance on some early 64-bit
              processors.  Modern x86-64 processors no longer have this
              performance problem, so use of this flag is not required on
              those systems.  The MAP_32BIT flag is ignored when MAP_FIXED
              is set.

       MAP_ANON
              Synonym for MAP_ANONYMOUS.  Deprecated.

       MAP_ANONYMOUS
              The mapping is not backed by any file; its contents are
              initialized to zero.  The fd and offset arguments are ignored;
              however, some implementations require fd to be -1 if
              MAP_ANONYMOUS (or MAP_ANON) is specified, and portable
              applications should ensure this.  The use of MAP_ANONYMOUS in
              conjunction with MAP_SHARED is supported on Linux only since
              kernel 2.4.

       MAP_DENYWRITE
              This flag is ignored.  (Long ago, it signaled that attempts to
              write to the underlying file should fail with ETXTBUSY.  But
              this was a source of denial-of-service attacks.)

       MAP_EXECUTABLE
              This flag is ignored.

       MAP_FILE
              Compatibility flag.  Ignored.

       MAP_FIXED
              Don't interpret addr as a hint: place the mapping at exactly
              that address.  addr must be a multiple of the page size.  If
              the memory region specified by addr and len overlaps pages of
              any existing mapping(s), then the overlapped part of the
              existing mapping(s) will be discarded.  If the specified
              address cannot be used, mmap() will fail.  Because requiring a
              fixed address for a mapping is less portable, the use of this
              option is discouraged.

       MAP_GROWSDOWN
              Used for stacks.  Indicates to the kernel virtual memory
              system that the mapping should extend downward in memory.

       MAP_HUGETLB (since Linux 2.6.32)
              Allocate the mapping using "huge pages."  See the Linux kernel
              source file Documentation/vm/hugetlbpage.txt for further
              information, as well as NOTES, below.

       MAP_HUGE_2MB, MAP_HUGE_1GB (since Linux 3.8)
              Used in conjunction with MAP_HUGETLB to select alternative
              hugetlb page sizes (respectively, 2 MB and 1 GB) on systems
              that support multiple hugetlb page sizes.

              More generally, the desired huge page size can be configured
              by encoding the base-2 logarithm of the desired page size in
              the six bits at the offset MAP_HUGE_SHIFT.  (A value of zero
              in this bit field provides the default huge page size; the
              default huge page size can be discovered vie the Hugepagesize
              field exposed by /proc/meminfo.)  Thus, the above two
              constants are defined as:

                  #define MAP_HUGE_2MB    (21 << MAP_HUGE_SHIFT)
                  #define MAP_HUGE_1GB    (30 << MAP_HUGE_SHIFT)

              The range of huge page sizes that are supported by the system
              can be discovered by listing the subdirectories in
              /sys/kernel/mm/hugepages.

       MAP_LOCKED (since Linux 2.5.37)
              Mark the mmaped region to be locked in the same way as
              mlock(2).  This implementation will try to populate (prefault)
              the whole range but the mmap call doesn't fail with ENOMEM if
              this fails.  Therefore major faults might happen later on.  So
              the semantic is not as strong as mlock(2).  One should use
              mmap(2) plus mlock(2) when major faults are not acceptable
              after the initialization of the mapping.  The MAP_LOCKED flag
              is ignored in older kernels.

       MAP_NONBLOCK (since Linux 2.5.46)
              Only meaningful in conjunction with MAP_POPULATE.  Don't
              perform read-ahead: create page tables entries only for pages
              that are already present in RAM.  Since Linux 2.6.23, this
              flag causes MAP_POPULATE to do nothing.  One day, the
              combination of MAP_POPULATE and MAP_NONBLOCK may be
              reimplemented.

       MAP_NORESERVE
              Do not reserve swap space for this mapping.  When swap space
              is reserved, one has the guarantee that it is possible to
              modify the mapping.  When swap space is not reserved one might
              get SIGSEGV upon a write if no physical memory is available.
              See also the discussion of the file
              /proc/sys/vm/overcommit_memory in proc(5).  In kernels before
              2.6, this flag had effect only for private writable mappings.

       MAP_POPULATE (since Linux 2.5.46)
              Populate (prefault) page tables for a mapping.  For a file
              mapping, this causes read-ahead on the file.  This will help
              to reduce blocking on page faults later.  MAP_POPULATE is
              supported for private mappings only since Linux 2.6.23.

       MAP_STACK (since Linux 2.6.27)
              Allocate the mapping at an address suitable for a process or
              thread stack.  This flag is currently a no-op, but is used in
              the glibc threading implementation so that if some
              architectures require special treatment for stack allocations,
              support can later be transparently implemented for glibc.

       MAP_UNINITIALIZED (since Linux 2.6.33)
              Don't clear anonymous pages.  This flag is intended to improve
              performance on embedded devices.  This flag is honored only if
              the kernel was configured with the
              CONFIG_MMAP_ALLOW_UNINITIALIZED option.  Because of the
              security implications, that option is normally enabled only on
              embedded devices (i.e., devices where one has complete control
              of the contents of user memory).

       Of the above flags, only MAP_FIXED is specified in POSIX.1-2001 and
       POSIX.1-2008.  However, most systems also support MAP_ANONYMOUS (or
       its synonym MAP_ANON).

       Some systems document the additional flags MAP_AUTOGROW,
       MAP_AUTORESRV, MAP_COPY, and MAP_LOCAL.

       Memory mapped by mmap() is preserved across fork(2), with the same
       attributes.

       A file is mapped in multiples of the page size.  For a file that is
       not a multiple of the page size, the remaining memory is zeroed when
       mapped, and writes to that region are not written out to the file.
       The effect of changing the size of the underlying file of a mapping
       on the pages that correspond to added or removed regions of the file
       is unspecified.

   munmap()
       The munmap() system call deletes the mappings for the specified
       address range, and causes further references to addresses within the
       range to generate invalid memory references.  The region is also
       automatically unmapped when the process is terminated.  On the other
       hand, closing the file descriptor does not unmap the region.

       The address addr must be a multiple of the page size (but length need
       not be).  All pages containing a part of the indicated range are
       unmapped, and subsequent references to these pages will generate
       SIGSEGV.  It is not an error if the indicated range does not contain
       any mapped pages.
http://man7.org/linux/man-pages/man2/munmap.2.html
14
SYSTEM CALL:
mmap(2) - Linux manual page
FUNCTIONALITY:

       mmap, munmap - map or unmap files or devices into memory
SYNOPSIS:

       #include <sys/mman.h>

       void *mmap(void *addr, size_t length, int prot, int flags,
                  int fd, off_t offset);
       int munmap(void *addr, size_t length);

       See NOTES for information on feature test macro requirements.
DESCRIPTION

       mmap() creates a new mapping in the virtual address space of the
       calling process.  The starting address for the new mapping is
       specified in addr.  The length argument specifies the length of the
       mapping.

       If addr is NULL, then the kernel chooses the address at which to
       create the mapping; this is the most portable method of creating a
       new mapping.  If addr is not NULL, then the kernel takes it as a hint
       about where to place the mapping; on Linux, the mapping will be
       created at a nearby page boundary.  The address of the new mapping is
       returned as the result of the call.

       The contents of a file mapping (as opposed to an anonymous mapping;
       see MAP_ANONYMOUS below), are initialized using length bytes starting
       at offset offset in the file (or other object) referred to by the
       file descriptor fd.  offset must be a multiple of the page size as
       returned by sysconf(_SC_PAGE_SIZE).

       The prot argument describes the desired memory protection of the
       mapping (and must not conflict with the open mode of the file).  It
       is either PROT_NONE or the bitwise OR of one or more of the following
       flags:

       PROT_EXEC  Pages may be executed.

       PROT_READ  Pages may be read.

       PROT_WRITE Pages may be written.

       PROT_NONE  Pages may not be accessed.

       The flags argument determines whether updates to the mapping are
       visible to other processes mapping the same region, and whether
       updates are carried through to the underlying file.  This behavior is
       determined by including exactly one of the following values in flags:

       MAP_SHARED
              Share this mapping.  Updates to the mapping are visible to
              other processes that map this file, and are carried through to
              the underlying file.  (To precisely control when updates are
              carried through to the underlying file requires the use of
              msync(2).)

       MAP_PRIVATE
              Create a private copy-on-write mapping.  Updates to the
              mapping are not visible to other processes mapping the same
              file, and are not carried through to the underlying file.  It
              is unspecified whether changes made to the file after the
              mmap() call are visible in the mapped region.

       Both of these flags are described in POSIX.1-2001 and POSIX.1-2008.

       In addition, zero or more of the following values can be ORed in
       flags:

       MAP_32BIT (since Linux 2.4.20, 2.6)
              Put the mapping into the first 2 Gigabytes of the process
              address space.  This flag is supported only on x86-64, for
              64-bit programs.  It was added to allow thread stacks to be
              allocated somewhere in the first 2GB of memory, so as to
              improve context-switch performance on some early 64-bit
              processors.  Modern x86-64 processors no longer have this
              performance problem, so use of this flag is not required on
              those systems.  The MAP_32BIT flag is ignored when MAP_FIXED
              is set.

       MAP_ANON
              Synonym for MAP_ANONYMOUS.  Deprecated.

       MAP_ANONYMOUS
              The mapping is not backed by any file; its contents are
              initialized to zero.  The fd and offset arguments are ignored;
              however, some implementations require fd to be -1 if
              MAP_ANONYMOUS (or MAP_ANON) is specified, and portable
              applications should ensure this.  The use of MAP_ANONYMOUS in
              conjunction with MAP_SHARED is supported on Linux only since
              kernel 2.4.

       MAP_DENYWRITE
              This flag is ignored.  (Long ago, it signaled that attempts to
              write to the underlying file should fail with ETXTBUSY.  But
              this was a source of denial-of-service attacks.)

       MAP_EXECUTABLE
              This flag is ignored.

       MAP_FILE
              Compatibility flag.  Ignored.

       MAP_FIXED
              Don't interpret addr as a hint: place the mapping at exactly
              that address.  addr must be a multiple of the page size.  If
              the memory region specified by addr and len overlaps pages of
              any existing mapping(s), then the overlapped part of the
              existing mapping(s) will be discarded.  If the specified
              address cannot be used, mmap() will fail.  Because requiring a
              fixed address for a mapping is less portable, the use of this
              option is discouraged.

       MAP_GROWSDOWN
              Used for stacks.  Indicates to the kernel virtual memory
              system that the mapping should extend downward in memory.

       MAP_HUGETLB (since Linux 2.6.32)
              Allocate the mapping using "huge pages."  See the Linux kernel
              source file Documentation/vm/hugetlbpage.txt for further
              information, as well as NOTES, below.

       MAP_HUGE_2MB, MAP_HUGE_1GB (since Linux 3.8)
              Used in conjunction with MAP_HUGETLB to select alternative
              hugetlb page sizes (respectively, 2 MB and 1 GB) on systems
              that support multiple hugetlb page sizes.

              More generally, the desired huge page size can be configured
              by encoding the base-2 logarithm of the desired page size in
              the six bits at the offset MAP_HUGE_SHIFT.  (A value of zero
              in this bit field provides the default huge page size; the
              default huge page size can be discovered vie the Hugepagesize
              field exposed by /proc/meminfo.)  Thus, the above two
              constants are defined as:

                  #define MAP_HUGE_2MB    (21 << MAP_HUGE_SHIFT)
                  #define MAP_HUGE_1GB    (30 << MAP_HUGE_SHIFT)

              The range of huge page sizes that are supported by the system
              can be discovered by listing the subdirectories in
              /sys/kernel/mm/hugepages.

       MAP_LOCKED (since Linux 2.5.37)
              Mark the mmaped region to be locked in the same way as
              mlock(2).  This implementation will try to populate (prefault)
              the whole range but the mmap call doesn't fail with ENOMEM if
              this fails.  Therefore major faults might happen later on.  So
              the semantic is not as strong as mlock(2).  One should use
              mmap(2) plus mlock(2) when major faults are not acceptable
              after the initialization of the mapping.  The MAP_LOCKED flag
              is ignored in older kernels.

       MAP_NONBLOCK (since Linux 2.5.46)
              Only meaningful in conjunction with MAP_POPULATE.  Don't
              perform read-ahead: create page tables entries only for pages
              that are already present in RAM.  Since Linux 2.6.23, this
              flag causes MAP_POPULATE to do nothing.  One day, the
              combination of MAP_POPULATE and MAP_NONBLOCK may be
              reimplemented.

       MAP_NORESERVE
              Do not reserve swap space for this mapping.  When swap space
              is reserved, one has the guarantee that it is possible to
              modify the mapping.  When swap space is not reserved one might
              get SIGSEGV upon a write if no physical memory is available.
              See also the discussion of the file
              /proc/sys/vm/overcommit_memory in proc(5).  In kernels before
              2.6, this flag had effect only for private writable mappings.

       MAP_POPULATE (since Linux 2.5.46)
              Populate (prefault) page tables for a mapping.  For a file
              mapping, this causes read-ahead on the file.  This will help
              to reduce blocking on page faults later.  MAP_POPULATE is
              supported for private mappings only since Linux 2.6.23.

       MAP_STACK (since Linux 2.6.27)
              Allocate the mapping at an address suitable for a process or
              thread stack.  This flag is currently a no-op, but is used in
              the glibc threading implementation so that if some
              architectures require special treatment for stack allocations,
              support can later be transparently implemented for glibc.

       MAP_UNINITIALIZED (since Linux 2.6.33)
              Don't clear anonymous pages.  This flag is intended to improve
              performance on embedded devices.  This flag is honored only if
              the kernel was configured with the
              CONFIG_MMAP_ALLOW_UNINITIALIZED option.  Because of the
              security implications, that option is normally enabled only on
              embedded devices (i.e., devices where one has complete control
              of the contents of user memory).

       Of the above flags, only MAP_FIXED is specified in POSIX.1-2001 and
       POSIX.1-2008.  However, most systems also support MAP_ANONYMOUS (or
       its synonym MAP_ANON).

       Some systems document the additional flags MAP_AUTOGROW,
       MAP_AUTORESRV, MAP_COPY, and MAP_LOCAL.

       Memory mapped by mmap() is preserved across fork(2), with the same
       attributes.

       A file is mapped in multiples of the page size.  For a file that is
       not a multiple of the page size, the remaining memory is zeroed when
       mapped, and writes to that region are not written out to the file.
       The effect of changing the size of the underlying file of a mapping
       on the pages that correspond to added or removed regions of the file
       is unspecified.

   munmap()
       The munmap() system call deletes the mappings for the specified
       address range, and causes further references to addresses within the
       range to generate invalid memory references.  The region is also
       automatically unmapped when the process is terminated.  On the other
       hand, closing the file descriptor does not unmap the region.

       The address addr must be a multiple of the page size (but length need
       not be).  All pages containing a part of the indicated range are
       unmapped, and subsequent references to these pages will generate
       SIGSEGV.  It is not an error if the indicated range does not contain
       any mapped pages.
http://man7.org/linux/man-pages/man2/mremap.2.html
10
SYSTEM CALL:
mremap(2) - Linux manual page
FUNCTIONALITY:

       mremap - remap a virtual memory address
SYNOPSIS:

       #define _GNU_SOURCE         /* See feature_test_macros(7) */
       #include <sys/mman.h>

       void *mremap(void *old_address, size_t old_size,
                    size_t new_size, int flags, ... /* void *new_address */);
DESCRIPTION

       mremap() expands (or shrinks) an existing memory mapping, potentially
       moving it at the same time (controlled by the flags argument and the
       available virtual address space).

       old_address is the old address of the virtual memory block that you
       want to expand (or shrink).  Note that old_address has to be page
       aligned.  old_size is the old size of the virtual memory block.
       new_size is the requested size of the virtual memory block after the
       resize.  An optional fifth argument, new_address, may be provided;
       see the description of MREMAP_FIXED below.

       In Linux the memory is divided into pages.  A user process has (one
       or) several linear virtual memory segments.  Each virtual memory
       segment has one or more mappings to real memory pages (in the page
       table).  Each virtual memory segment has its own protection (access
       rights), which may cause a segmentation violation if the memory is
       accessed incorrectly (e.g., writing to a read-only segment).
       Accessing virtual memory outside of the segments will also cause a
       segmentation violation.

       mremap() uses the Linux page table scheme.  mremap() changes the
       mapping between virtual addresses and memory pages.  This can be used
       to implement a very efficient realloc(3).

       The flags bit-mask argument may be 0, or include the following flag:

       MREMAP_MAYMOVE
              By default, if there is not sufficient space to expand a
              mapping at its current location, then mremap() fails.  If this
              flag is specified, then the kernel is permitted to relocate
              the mapping to a new virtual address, if necessary.  If the
              mapping is relocated, then absolute pointers into the old
              mapping location become invalid (offsets relative to the
              starting address of the mapping should be employed).

       MREMAP_FIXED (since Linux 2.3.31)
              This flag serves a similar purpose to the MAP_FIXED flag of
              mmap(2).  If this flag is specified, then mremap() accepts a
              fifth argument, void *new_address, which specifies a page-
              aligned address to which the mapping must be moved.  Any
              previous mapping at the address range specified by new_address
              and new_size is unmapped.  If MREMAP_FIXED is specified, then
              MREMAP_MAYMOVE must also be specified.

       If the memory segment specified by old_address and old_size is locked
       (using mlock(2) or similar), then this lock is maintained when the
       segment is resized and/or relocated.  As a consequence, the amount of
       memory locked by the process may change.
http://man7.org/linux/man-pages/man2/mprotect.2.html
11
SYSTEM CALL:
mprotect(2) - Linux manual page
FUNCTIONALITY:

       mprotect - set protection on a region of memory
SYNOPSIS:

       #include <sys/mman.h>

       int mprotect(void *addr, size_t len, int prot);
DESCRIPTION

       mprotect() changes protection for the calling process's memory
       page(s) containing any part of the address range in the interval
       [addr, addr+len-1].  addr must be aligned to a page boundary.

       If the calling process tries to access memory in a manner that
       violates the protection, then the kernel generates a SIGSEGV signal
       for the process.

       prot is either PROT_NONE or a bitwise-or of the other values in the
       following list:

       PROT_NONE  The memory cannot be accessed at all.

       PROT_READ  The memory can be read.

       PROT_WRITE The memory can be modified.

       PROT_EXEC  The memory can be executed.
http://man7.org/linux/man-pages/man2/madvise.2.html
11
SYSTEM CALL:
madvise(2) - Linux manual page
FUNCTIONALITY:

       madvise - give advice about use of memory
SYNOPSIS:

       #include <sys/mman.h>

       int madvise(void *addr, size_t length, int advice);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       madvise():
           Since glibc 2.19:
               _DEFAULT_SOURCE
           Up to and including glibc 2.19:
               _BSD_SOURCE
DESCRIPTION

       The madvise() system call is used to give advice or directions to the
       kernel about the address range beginning at address addr and with
       size length bytes.  Initially, the system call supported a set of
       "conventional" advice values, which are also available on several
       other implementations.  (Note, though, that madvise() is not
       specified in POSIX.)  Subsequently, a number of Linux-specific advice
       values have been added.

   Conventional advice values
       The advice values listed below allow an application to tell the
       kernel how it expects to use some mapped or shared memory areas, so
       that the kernel can choose appropriate read-ahead and caching
       techniques.  These advice values do not influence the semantics of
       the application (except in the case of MADV_DONTNEED), but may
       influence its performance.  All of the advice values listed here have
       analogs in the POSIX-specified posix_madvise(3) function, and the
       values have the same meanings, with the exception of MADV_DONTNEED.

       The advice is indicated in the advice argument, which is one of the
       following:

       MADV_NORMAL
              No special treatment.  This is the default.

       MADV_RANDOM
              Expect page references in random order.  (Hence, read ahead
              may be less useful than normally.)

       MADV_SEQUENTIAL
              Expect page references in sequential order.  (Hence, pages in
              the given range can be aggressively read ahead, and may be
              freed soon after they are accessed.)

       MADV_WILLNEED
              Expect access in the near future.  (Hence, it might be a good
              idea to read some pages ahead.)

       MADV_DONTNEED
              Do not expect access in the near future.  (For the time being,
              the application is finished with the given range, so the
              kernel can free resources associated with it.)

              After a successful MADV_DONTNEED operation, the semantics of
              memory access in the specified region are changed: subsequent
              accesses of pages in the range will succeed, but will result
              in either repopulating the memory contents from the up-to-date
              contents of the underlying mapped file (for shared file
              mappings, shared anonymous mappings, and shmem-based
              techniques such as System V shared memory segments) or zero-
              fill-on-demand pages for anonymous private mappings.

              Note that, when applied to shared mappings, MADV_DONTNEED
              might not lead to immediate freeing of the pages in the range.
              The kernel is free to delay freeing the pages until an
              appropriate moment.  The resident set size (RSS) of the
              calling process will be immediately reduced however.

              MADV_DONTNEED cannot be applied to locked pages, Huge TLB
              pages, or VM_PFNMAP pages.  (Pages marked with the kernel-
              internal VM_PFNMAP flag are special memory areas that are not
              managed by the virtual memory subsystem.  Such pages are
              typically created by device drivers that map the pages into
              user space.)

   Linux-specific advice values
       The following Linux-specific advice values have no counterparts in
       the POSIX-specified posix_madvise(3), and may or may not have
       counterparts in the madvise() interface available on other
       implementations.  Note that some of these operations change the
       semantics of memory accesses.

       MADV_REMOVE (since Linux 2.6.16)
              Free up a given range of pages and its associated backing
              store.  This is equivalent to punching a hole in the
              corresponding byte range of the backing store (see
              fallocate(2)).  Subsequent accesses in the specified address
              range will see bytes containing zero.

              The specified address range must be mapped shared and
              writable.  This flag cannot be applied to locked pages, Huge
              TLB pages, or VM_PFNMAP pages.

              In the initial implementation, only shmfs/tmpfs supported
              MADV_REMOVE; but since Linux 3.5, any filesystem which
              supports the fallocate(2) FALLOC_FL_PUNCH_HOLE mode also
              supports MADV_REMOVE.  Hugetlbfs will fail with the error
              EINVAL and other filesystems fail with the error EOPNOTSUPP.

       MADV_DONTFORK (since Linux 2.6.16)
              Do not make the pages in this range available to the child
              after a fork(2).  This is useful to prevent copy-on-write
              semantics from changing the physical location of a page if the
              parent writes to it after a fork(2).  (Such page relocations
              cause problems for hardware that DMAs into the page.)

       MADV_DOFORK (since Linux 2.6.16)
              Undo the effect of MADV_DONTFORK, restoring the default
              behavior, whereby a mapping is inherited across fork(2).

       MADV_HWPOISON (since Linux 2.6.32)
              Poison the pages in the range specified by addr and length and
              handle subsequent references to those pages like a hardware
              memory corruption.  This operation is available only for
              privileged (CAP_SYS_ADMIN) processes.  This operation may
              result in the calling process receiving a SIGBUS and the page
              being unmapped.

              This feature is intended for testing of memory error-handling
              code; it is available only if the kernel was configured with
              CONFIG_MEMORY_FAILURE.

       MADV_MERGEABLE (since Linux 2.6.32)
              Enable Kernel Samepage Merging (KSM) for the pages in the
              range specified by addr and length.  The kernel regularly
              scans those areas of user memory that have been marked as
              mergeable, looking for pages with identical content.  These
              are replaced by a single write-protected page (which is
              automatically copied if a process later wants to update the
              content of the page).  KSM merges only private anonymous pages
              (see mmap(2)).

              The KSM feature is intended for applications that generate
              many instances of the same data (e.g., virtualization systems
              such as KVM).  It can consume a lot of processing power; use
              with care.  See the Linux kernel source file
              Documentation/vm/ksm.txt for more details.

              The MADV_MERGEABLE and MADV_UNMERGEABLE operations are
              available only if the kernel was configured with CONFIG_KSM.

       MADV_UNMERGEABLE (since Linux 2.6.32)
              Undo the effect of an earlier MADV_MERGEABLE operation on the
              specified address range; KSM unmerges whatever pages it had
              merged in the address range specified by addr and length.

       MADV_SOFT_OFFLINE (since Linux 2.6.33)
              Soft offline the pages in the range specified by addr and
              length.  The memory of each page in the specified range is
              preserved (i.e., when next accessed, the same content will be
              visible, but in a new physical page frame), and the original
              page is offlined (i.e., no longer used, and taken out of
              normal memory management).  The effect of the
              MADV_SOFT_OFFLINE operation is invisible to (i.e., does not
              change the semantics of) the calling process.

              This feature is intended for testing of memory error-handling
              code; it is available only if the kernel was configured with
              CONFIG_MEMORY_FAILURE.

       MADV_HUGEPAGE (since Linux 2.6.38)
              Enable Transparent Huge Pages (THP) for pages in the range
              specified by addr and length.  Currently, Transparent Huge
              Pages work only with private anonymous pages (see mmap(2)).
              The kernel will regularly scan the areas marked as huge page
              candidates to replace them with huge pages.  The kernel will
              also allocate huge pages directly when the region is naturally
              aligned to the huge page size (see posix_memalign(2)).

              This feature is primarily aimed at applications that use large
              mappings of data and access large regions of that memory at a
              time (e.g., virtualization systems such as QEMU).  It can very
              easily waste memory (e.g., a 2MB mapping that only ever
              accesses 1 byte will result in 2MB of wired memory instead of
              one 4KB page).  See the Linux kernel source file
              Documentation/vm/transhuge.txt for more details.

              The MADV_HUGEPAGE and MADV_NOHUGEPAGE operations are available
              only if the kernel was configured with
              CONFIG_TRANSPARENT_HUGEPAGE.

       MADV_NOHUGEPAGE (since Linux 2.6.38)
              Ensures that memory in the address range specified by addr and
              length will not be collapsed into huge pages.

       MADV_DONTDUMP (since Linux 3.4)
              Exclude from a core dump those pages in the range specified by
              addr and length.  This is useful in applications that have
              large areas of memory that are known not to be useful in a
              core dump.  The effect of MADV_DONTDUMP takes precedence over
              the bit mask that is set via the /proc/PID/coredump_filter
              file (see core(5)).

       MADV_DODUMP (since Linux 3.4)
              Undo the effect of an earlier MADV_DONTDUMP.

       MADV_FREE (since Linux 4.5)
              The application no longer requires the pages in the range
              specified by addr and len.  The kernel can thus free these
              pages, but the freeing could be delayed until memory pressure
              occurs.  For each of the pages that has been marked to be
              freed but has not yet been freed, the free operation will be
              canceled if the caller writes into the page.  After a
              successful MADV_FREE operation, any stale data (i.e., dirty,
              unwritten pages) will be lost when the kernel frees the pages.
              However, subsequent writes to pages in the range will succeed
              and then kernel cannot free those dirtied pages, so that the
              caller can always see just written data.  If there is no
              subsequent write, the kernel can free the pages at any time.
              Once pages in the range have been freed, the caller will see
              zero-fill-on-demand pages upon subsequent page references.

              The MADV_FREE operation can be applied only to private
              anonymous pages (see mmap(2)).  On a swapless system, freeing
              pages in a given range happens instantly, regardless of memory
              pressure.
http://man7.org/linux/man-pages/man2/mlock.2.html
13
SYSTEM CALL:
mlock(2) - Linux manual page
FUNCTIONALITY:

       mlock, mlock2, munlock, mlockall, munlockall - lock and unlock memory
SYNOPSIS:

       #include <sys/mman.h>

       int mlock(const void *addr, size_t len);
       int mlock2(const void *addr, size_t len, int flags);
       int munlock(const void *addr, size_t len);

       int mlockall(int flags);
       int munlockall(void);
DESCRIPTION

       mlock(), mlock2(), and mlockall() lock part or all of the calling
       process's virtual address space into RAM, preventing that memory from
       being paged to the swap area.

       munlock() and munlockall() perform the converse operation, unlocking
       part or all of the calling process's virtual address space, so that
       pages in the specified virtual address range may once more to be
       swapped out if required by the kernel memory manager.

       Memory locking and unlocking are performed in units of whole pages.

   mlock(), mlock2(), and munlock()
       mlock() locks pages in the address range starting at addr and
       continuing for len bytes.  All pages that contain a part of the
       specified address range are guaranteed to be resident in RAM when the
       call returns successfully; the pages are guaranteed to stay in RAM
       until later unlocked.

       mlock2() also locks pages in the specified range starting at addr and
       continuing for len bytes.  However, the state of the pages contained
       in that range after the call returns successfully will depend on the
       value in the flags argument.

       The flags argument can be either 0 or the following constant:

       MLOCK_ONFAULT
              Lock pages that are currently resident and mark the entire
              range to have pages locked when they are populated by the page
              fault.

       If flags is 0, mlock2() behaves exactly the same as mlock().

       Note: currently, there is not a glibc wrapper for mlock2(), so it
       will need to be invoked using syscall(2).

       munlock() unlocks pages in the address range starting at addr and
       continuing for len bytes.  After this call, all pages that contain a
       part of the specified memory range can be moved to external swap
       space again by the kernel.

   mlockall() and munlockall()
       mlockall() locks all pages mapped into the address space of the
       calling process.  This includes the pages of the code, data and stack
       segment, as well as shared libraries, user space kernel data, shared
       memory, and memory-mapped files.  All mapped pages are guaranteed to
       be resident in RAM when the call returns successfully; the pages are
       guaranteed to stay in RAM until later unlocked.

       The flags argument is constructed as the bitwise OR of one or more of
       the following constants:

       MCL_CURRENT Lock all pages which are currently mapped into the
                   address space of the process.

       MCL_FUTURE  Lock all pages which will become mapped into the address
                   space of the process in the future.  These could be, for
                   instance, new pages required by a growing heap and stack
                   as well as new memory-mapped files or shared memory
                   regions.

       MCL_ONFAULT (since Linux 4.4)
                   Used together with MCL_CURRENT, MCL_FUTURE, or both.
                   Mark all current (with MCL_CURRENT) or future (with
                   MCL_FUTURE) mappings to lock pages when they are faulted
                   in.  When used with MCL_CURRENT, all present pages are
                   locked, but mlockall() will not fault in non-present
                   pages.  When used with MCL_FUTURE, all future mappings
                   will be marked to lock pages when they are faulted in,
                   but they will not be populated by the lock when the
                   mapping is created.  MCL_ONFAULT must be used with either
                   MCL_CURRENT or MCL_FUTURE or both.

       If MCL_FUTURE has been specified, then a later system call (e.g.,
       mmap(2), sbrk(2), malloc(3)), may fail if it would cause the number
       of locked bytes to exceed the permitted maximum (see below).  In the
       same circumstances, stack growth may likewise fail: the kernel will
       deny stack expansion and deliver a SIGSEGV signal to the process.

       munlockall() unlocks all pages mapped into the address space of the
       calling process.
http://man7.org/linux/man-pages/man2/mlock2.2.html
13
SYSTEM CALL:
mlock(2) - Linux manual page
FUNCTIONALITY:

       mlock, mlock2, munlock, mlockall, munlockall - lock and unlock memory
SYNOPSIS:

       #include <sys/mman.h>

       int mlock(const void *addr, size_t len);
       int mlock2(const void *addr, size_t len, int flags);
       int munlock(const void *addr, size_t len);

       int mlockall(int flags);
       int munlockall(void);
DESCRIPTION

       mlock(), mlock2(), and mlockall() lock part or all of the calling
       process's virtual address space into RAM, preventing that memory from
       being paged to the swap area.

       munlock() and munlockall() perform the converse operation, unlocking
       part or all of the calling process's virtual address space, so that
       pages in the specified virtual address range may once more to be
       swapped out if required by the kernel memory manager.

       Memory locking and unlocking are performed in units of whole pages.

   mlock(), mlock2(), and munlock()
       mlock() locks pages in the address range starting at addr and
       continuing for len bytes.  All pages that contain a part of the
       specified address range are guaranteed to be resident in RAM when the
       call returns successfully; the pages are guaranteed to stay in RAM
       until later unlocked.

       mlock2() also locks pages in the specified range starting at addr and
       continuing for len bytes.  However, the state of the pages contained
       in that range after the call returns successfully will depend on the
       value in the flags argument.

       The flags argument can be either 0 or the following constant:

       MLOCK_ONFAULT
              Lock pages that are currently resident and mark the entire
              range to have pages locked when they are populated by the page
              fault.

       If flags is 0, mlock2() behaves exactly the same as mlock().

       Note: currently, there is not a glibc wrapper for mlock2(), so it
       will need to be invoked using syscall(2).

       munlock() unlocks pages in the address range starting at addr and
       continuing for len bytes.  After this call, all pages that contain a
       part of the specified memory range can be moved to external swap
       space again by the kernel.

   mlockall() and munlockall()
       mlockall() locks all pages mapped into the address space of the
       calling process.  This includes the pages of the code, data and stack
       segment, as well as shared libraries, user space kernel data, shared
       memory, and memory-mapped files.  All mapped pages are guaranteed to
       be resident in RAM when the call returns successfully; the pages are
       guaranteed to stay in RAM until later unlocked.

       The flags argument is constructed as the bitwise OR of one or more of
       the following constants:

       MCL_CURRENT Lock all pages which are currently mapped into the
                   address space of the process.

       MCL_FUTURE  Lock all pages which will become mapped into the address
                   space of the process in the future.  These could be, for
                   instance, new pages required by a growing heap and stack
                   as well as new memory-mapped files or shared memory
                   regions.

       MCL_ONFAULT (since Linux 4.4)
                   Used together with MCL_CURRENT, MCL_FUTURE, or both.
                   Mark all current (with MCL_CURRENT) or future (with
                   MCL_FUTURE) mappings to lock pages when they are faulted
                   in.  When used with MCL_CURRENT, all present pages are
                   locked, but mlockall() will not fault in non-present
                   pages.  When used with MCL_FUTURE, all future mappings
                   will be marked to lock pages when they are faulted in,
                   but they will not be populated by the lock when the
                   mapping is created.  MCL_ONFAULT must be used with either
                   MCL_CURRENT or MCL_FUTURE or both.

       If MCL_FUTURE has been specified, then a later system call (e.g.,
       mmap(2), sbrk(2), malloc(3)), may fail if it would cause the number
       of locked bytes to exceed the permitted maximum (see below).  In the
       same circumstances, stack growth may likewise fail: the kernel will
       deny stack expansion and deliver a SIGSEGV signal to the process.

       munlockall() unlocks all pages mapped into the address space of the
       calling process.
http://man7.org/linux/man-pages/man2/mlockall.2.html
13
SYSTEM CALL:
mlock(2) - Linux manual page
FUNCTIONALITY:

       mlock, mlock2, munlock, mlockall, munlockall - lock and unlock memory
SYNOPSIS:

       #include <sys/mman.h>

       int mlock(const void *addr, size_t len);
       int mlock2(const void *addr, size_t len, int flags);
       int munlock(const void *addr, size_t len);

       int mlockall(int flags);
       int munlockall(void);
DESCRIPTION

       mlock(), mlock2(), and mlockall() lock part or all of the calling
       process's virtual address space into RAM, preventing that memory from
       being paged to the swap area.

       munlock() and munlockall() perform the converse operation, unlocking
       part or all of the calling process's virtual address space, so that
       pages in the specified virtual address range may once more to be
       swapped out if required by the kernel memory manager.

       Memory locking and unlocking are performed in units of whole pages.

   mlock(), mlock2(), and munlock()
       mlock() locks pages in the address range starting at addr and
       continuing for len bytes.  All pages that contain a part of the
       specified address range are guaranteed to be resident in RAM when the
       call returns successfully; the pages are guaranteed to stay in RAM
       until later unlocked.

       mlock2() also locks pages in the specified range starting at addr and
       continuing for len bytes.  However, the state of the pages contained
       in that range after the call returns successfully will depend on the
       value in the flags argument.

       The flags argument can be either 0 or the following constant:

       MLOCK_ONFAULT
              Lock pages that are currently resident and mark the entire
              range to have pages locked when they are populated by the page
              fault.

       If flags is 0, mlock2() behaves exactly the same as mlock().

       Note: currently, there is not a glibc wrapper for mlock2(), so it
       will need to be invoked using syscall(2).

       munlock() unlocks pages in the address range starting at addr and
       continuing for len bytes.  After this call, all pages that contain a
       part of the specified memory range can be moved to external swap
       space again by the kernel.

   mlockall() and munlockall()
       mlockall() locks all pages mapped into the address space of the
       calling process.  This includes the pages of the code, data and stack
       segment, as well as shared libraries, user space kernel data, shared
       memory, and memory-mapped files.  All mapped pages are guaranteed to
       be resident in RAM when the call returns successfully; the pages are
       guaranteed to stay in RAM until later unlocked.

       The flags argument is constructed as the bitwise OR of one or more of
       the following constants:

       MCL_CURRENT Lock all pages which are currently mapped into the
                   address space of the process.

       MCL_FUTURE  Lock all pages which will become mapped into the address
                   space of the process in the future.  These could be, for
                   instance, new pages required by a growing heap and stack
                   as well as new memory-mapped files or shared memory
                   regions.

       MCL_ONFAULT (since Linux 4.4)
                   Used together with MCL_CURRENT, MCL_FUTURE, or both.
                   Mark all current (with MCL_CURRENT) or future (with
                   MCL_FUTURE) mappings to lock pages when they are faulted
                   in.  When used with MCL_CURRENT, all present pages are
                   locked, but mlockall() will not fault in non-present
                   pages.  When used with MCL_FUTURE, all future mappings
                   will be marked to lock pages when they are faulted in,
                   but they will not be populated by the lock when the
                   mapping is created.  MCL_ONFAULT must be used with either
                   MCL_CURRENT or MCL_FUTURE or both.

       If MCL_FUTURE has been specified, then a later system call (e.g.,
       mmap(2), sbrk(2), malloc(3)), may fail if it would cause the number
       of locked bytes to exceed the permitted maximum (see below).  In the
       same circumstances, stack growth may likewise fail: the kernel will
       deny stack expansion and deliver a SIGSEGV signal to the process.

       munlockall() unlocks all pages mapped into the address space of the
       calling process.
http://man7.org/linux/man-pages/man2/munlock.2.html
13
SYSTEM CALL:
mlock(2) - Linux manual page
FUNCTIONALITY:

       mlock, mlock2, munlock, mlockall, munlockall - lock and unlock memory
SYNOPSIS:

       #include <sys/mman.h>

       int mlock(const void *addr, size_t len);
       int mlock2(const void *addr, size_t len, int flags);
       int munlock(const void *addr, size_t len);

       int mlockall(int flags);
       int munlockall(void);
DESCRIPTION

       mlock(), mlock2(), and mlockall() lock part or all of the calling
       process's virtual address space into RAM, preventing that memory from
       being paged to the swap area.

       munlock() and munlockall() perform the converse operation, unlocking
       part or all of the calling process's virtual address space, so that
       pages in the specified virtual address range may once more to be
       swapped out if required by the kernel memory manager.

       Memory locking and unlocking are performed in units of whole pages.

   mlock(), mlock2(), and munlock()
       mlock() locks pages in the address range starting at addr and
       continuing for len bytes.  All pages that contain a part of the
       specified address range are guaranteed to be resident in RAM when the
       call returns successfully; the pages are guaranteed to stay in RAM
       until later unlocked.

       mlock2() also locks pages in the specified range starting at addr and
       continuing for len bytes.  However, the state of the pages contained
       in that range after the call returns successfully will depend on the
       value in the flags argument.

       The flags argument can be either 0 or the following constant:

       MLOCK_ONFAULT
              Lock pages that are currently resident and mark the entire
              range to have pages locked when they are populated by the page
              fault.

       If flags is 0, mlock2() behaves exactly the same as mlock().

       Note: currently, there is not a glibc wrapper for mlock2(), so it
       will need to be invoked using syscall(2).

       munlock() unlocks pages in the address range starting at addr and
       continuing for len bytes.  After this call, all pages that contain a
       part of the specified memory range can be moved to external swap
       space again by the kernel.

   mlockall() and munlockall()
       mlockall() locks all pages mapped into the address space of the
       calling process.  This includes the pages of the code, data and stack
       segment, as well as shared libraries, user space kernel data, shared
       memory, and memory-mapped files.  All mapped pages are guaranteed to
       be resident in RAM when the call returns successfully; the pages are
       guaranteed to stay in RAM until later unlocked.

       The flags argument is constructed as the bitwise OR of one or more of
       the following constants:

       MCL_CURRENT Lock all pages which are currently mapped into the
                   address space of the process.

       MCL_FUTURE  Lock all pages which will become mapped into the address
                   space of the process in the future.  These could be, for
                   instance, new pages required by a growing heap and stack
                   as well as new memory-mapped files or shared memory
                   regions.

       MCL_ONFAULT (since Linux 4.4)
                   Used together with MCL_CURRENT, MCL_FUTURE, or both.
                   Mark all current (with MCL_CURRENT) or future (with
                   MCL_FUTURE) mappings to lock pages when they are faulted
                   in.  When used with MCL_CURRENT, all present pages are
                   locked, but mlockall() will not fault in non-present
                   pages.  When used with MCL_FUTURE, all future mappings
                   will be marked to lock pages when they are faulted in,
                   but they will not be populated by the lock when the
                   mapping is created.  MCL_ONFAULT must be used with either
                   MCL_CURRENT or MCL_FUTURE or both.

       If MCL_FUTURE has been specified, then a later system call (e.g.,
       mmap(2), sbrk(2), malloc(3)), may fail if it would cause the number
       of locked bytes to exceed the permitted maximum (see below).  In the
       same circumstances, stack growth may likewise fail: the kernel will
       deny stack expansion and deliver a SIGSEGV signal to the process.

       munlockall() unlocks all pages mapped into the address space of the
       calling process.
http://man7.org/linux/man-pages/man2/munlockall.2.html
13
SYSTEM CALL:
mlock(2) - Linux manual page
FUNCTIONALITY:

       mlock, mlock2, munlock, mlockall, munlockall - lock and unlock memory
SYNOPSIS:

       #include <sys/mman.h>

       int mlock(const void *addr, size_t len);
       int mlock2(const void *addr, size_t len, int flags);
       int munlock(const void *addr, size_t len);

       int mlockall(int flags);
       int munlockall(void);
DESCRIPTION

       mlock(), mlock2(), and mlockall() lock part or all of the calling
       process's virtual address space into RAM, preventing that memory from
       being paged to the swap area.

       munlock() and munlockall() perform the converse operation, unlocking
       part or all of the calling process's virtual address space, so that
       pages in the specified virtual address range may once more to be
       swapped out if required by the kernel memory manager.

       Memory locking and unlocking are performed in units of whole pages.

   mlock(), mlock2(), and munlock()
       mlock() locks pages in the address range starting at addr and
       continuing for len bytes.  All pages that contain a part of the
       specified address range are guaranteed to be resident in RAM when the
       call returns successfully; the pages are guaranteed to stay in RAM
       until later unlocked.

       mlock2() also locks pages in the specified range starting at addr and
       continuing for len bytes.  However, the state of the pages contained
       in that range after the call returns successfully will depend on the
       value in the flags argument.

       The flags argument can be either 0 or the following constant:

       MLOCK_ONFAULT
              Lock pages that are currently resident and mark the entire
              range to have pages locked when they are populated by the page
              fault.

       If flags is 0, mlock2() behaves exactly the same as mlock().

       Note: currently, there is not a glibc wrapper for mlock2(), so it
       will need to be invoked using syscall(2).

       munlock() unlocks pages in the address range starting at addr and
       continuing for len bytes.  After this call, all pages that contain a
       part of the specified memory range can be moved to external swap
       space again by the kernel.

   mlockall() and munlockall()
       mlockall() locks all pages mapped into the address space of the
       calling process.  This includes the pages of the code, data and stack
       segment, as well as shared libraries, user space kernel data, shared
       memory, and memory-mapped files.  All mapped pages are guaranteed to
       be resident in RAM when the call returns successfully; the pages are
       guaranteed to stay in RAM until later unlocked.

       The flags argument is constructed as the bitwise OR of one or more of
       the following constants:

       MCL_CURRENT Lock all pages which are currently mapped into the
                   address space of the process.

       MCL_FUTURE  Lock all pages which will become mapped into the address
                   space of the process in the future.  These could be, for
                   instance, new pages required by a growing heap and stack
                   as well as new memory-mapped files or shared memory
                   regions.

       MCL_ONFAULT (since Linux 4.4)
                   Used together with MCL_CURRENT, MCL_FUTURE, or both.
                   Mark all current (with MCL_CURRENT) or future (with
                   MCL_FUTURE) mappings to lock pages when they are faulted
                   in.  When used with MCL_CURRENT, all present pages are
                   locked, but mlockall() will not fault in non-present
                   pages.  When used with MCL_FUTURE, all future mappings
                   will be marked to lock pages when they are faulted in,
                   but they will not be populated by the lock when the
                   mapping is created.  MCL_ONFAULT must be used with either
                   MCL_CURRENT or MCL_FUTURE or both.

       If MCL_FUTURE has been specified, then a later system call (e.g.,
       mmap(2), sbrk(2), malloc(3)), may fail if it would cause the number
       of locked bytes to exceed the permitted maximum (see below).  In the
       same circumstances, stack growth may likewise fail: the kernel will
       deny stack expansion and deliver a SIGSEGV signal to the process.

       munlockall() unlocks all pages mapped into the address space of the
       calling process.
http://man7.org/linux/man-pages/man2/mincore.2.html
11
SYSTEM CALL:
mincore(2) - Linux manual page
FUNCTIONALITY:

       mincore - determine whether pages are resident in memory
SYNOPSIS:

       #include <unistd.h>
       #include <sys/mman.h>

       int mincore(void *addr, size_t length, unsigned char *vec);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       mincore():
           Since glibc 2.19:
               _DEFAULT_SOURCE
           Glibc 2.19 and earlier:
               _BSD_SOURCE || _SVID_SOURCE
DESCRIPTION

       mincore() returns a vector that indicates whether pages of the
       calling process's virtual memory are resident in core (RAM), and so
       will not cause a disk access (page fault) if referenced.  The kernel
       returns residency information about the pages starting at the address
       addr, and continuing for length bytes.

       The addr argument must be a multiple of the system page size.  The
       length argument need not be a multiple of the page size, but since
       residency information is returned for whole pages, length is
       effectively rounded up to the next multiple of the page size.  One
       may obtain the page size (PAGE_SIZE) using sysconf(_SC_PAGESIZE).

       The vec argument must point to an array containing at least
       (length+PAGE_SIZE-1) / PAGE_SIZE bytes.  On return, the least
       significant bit of each byte will be set if the corresponding page is
       currently resident in memory, and be clear otherwise.  (The settings
       of the other bits in each byte are undefined; these bits are reserved
       for possible later use.)  Of course the information returned in vec
       is only a snapshot: pages that are not locked in memory can come and
       go at any moment, and the contents of vec may already be stale by the
       time this call returns.
http://man7.org/linux/man-pages/man2/membarrier.2.html
11
SYSTEM CALL:
membarrier(2) - Linux manual page
FUNCTIONALITY:

       membarrier - issue memory barriers on a set of threads
SYNOPSIS:

       #include <linux/membarrier.h>

       int membarrier(int cmd, int flags);
DESCRIPTION

       The membarrier() system call helps reducing the overhead of the
       memory barrier instructions required to order memory accesses on
       multi-core systems.  However, this system call is heavier than a
       memory barrier, so using it effectively is not as simple as replacing
       memory barriers with this system call, but requires understanding of
       the details below.

       Use of memory barriers needs to be done taking into account that a
       memory barrier always needs to be either matched with its memory
       barrier counterparts, or that the architecture's memory model doesn't
       require the matching barriers.

       There are cases where one side of the matching barriers (which we
       will refer to as "fast side") is executed much more often than the
       other (which we will refer to as "slow side").  This is a prime
       target for the use of membarrier().  The key idea is to replace, for
       these matching barriers, the fast-side memory barriers by simple
       compiler barriers, for example:

           asm volatile ("" : : : "memory")

       and replace the slow-side memory barriers by calls to membarrier().

       This will add overhead to the slow side, and remove overhead from the
       fast side, thus resulting in an overall performance increase as long
       as the slow side is infrequent enough that the overhead of the
       membarrier() calls does not outweigh the performance gain on the fast
       side.

       The cmd argument is one of the following:

       MEMBARRIER_CMD_QUERY
              Query the set of supported commands.  The return value of the
              call is a bit mask of supported commands.
              MEMBARRIER_CMD_QUERY, which has the value 0, is not itself
              included in this bit mask.  This command is always supported
              (on kernels where membarrier() is provided).

       MEMBARRIER_CMD_SHARED
              Ensure that all threads from all processes on the system pass
              through a state where all memory accesses to user-space
              addresses match program order between entry to and return from
              the membarrier() system call.  All threads on the system are
              targeted by this command.

       The flags argument is currently unused and must be specified as 0.

       All memory accesses performed in program order from each targeted
       thread are guaranteed to be ordered with respect to membarrier().

       If we use the semantic barrier() to represent a compiler barrier
       forcing memory accesses to be performed in program order across the
       barrier, and smp_mb() to represent explicit memory barriers forcing
       full memory ordering across the barrier, we have the following
       ordering table for each pairing of barrier(), membarrier() and
       smp_mb().  The pair ordering is detailed as (O: ordered, X: not
       ordered):

                              barrier()  smp_mb()  membarrier()
              barrier()          X          X          O
              smp_mb()           X          O          O
              membarrier()       O          O          O
http://man7.org/linux/man-pages/man2/modify_ldt.2.html
11
SYSTEM CALL:
modify_ldt(2) - Linux manual page
FUNCTIONALITY:

       modify_ldt - get or set a per-process LDT entry
SYNOPSIS:

       #include <sys/types.h>

       int modify_ldt(int func, void *ptr, unsigned long bytecount);

       Note: There is no glibc wrapper for this system call; see NOTES.
DESCRIPTION

       modify_ldt() reads or writes the local descriptor table (LDT) for a
       process.  The LDT is an array of segment descriptors that can be
       referenced by user code.  Linux allows processes to configure a per-
       process (actually per-mm) LDT.  For more information about the LDT,
       see the Intel Software Developer's Manual or the AMD Architecture
       Programming Manual.

       When func is 0, modify_ldt() reads the LDT into the memory pointed to
       by ptr.  The number of bytes read is the smaller of bytecount and the
       actual size of the LDT, although the kernel may act as though the LDT
       is padded with additional trailing zero bytes.  On success,
       modify_ldt() will return the number of bytes read.

       When func is 1 or 0x11, modify_ldt() modifies the LDT entry indicated
       by ptr->entry_number.  ptr points to a user_desc structure and
       bytecount must equal the size of this structure.

       The user_desc structure is defined in <asm/ldt.h> as:

           struct user_desc {
               unsigned int  entry_number;
               unsigned long base_addr;
               unsigned int  limit;
               unsigned int  seg_32bit:1;
               unsigned int  contents:2;
               unsigned int  read_exec_only:1;
               unsigned int  limit_in_pages:1;
               unsigned int  seg_not_present:1;
               unsigned int  useable:1;
           };

       In Linux 2.4 and earlier, this structure was named modify_ldt_ldt_s.

       The contents field is the segment type (data, expand-down data, non-
       conforming code, or conforming code).  The other fields match their
       descriptions in the CPU manual, although modify_ldt() cannot set the
       hardware-defined "accessed" bit described in the CPU manual.

       A user_desc is considered "empty" if read_exec_only and
       seg_not_present are set to 1 and all of the other fields are 0.  An
       LDT entry can be cleared by setting it to an "empty" user_desc or, if
       func is 1, by setting both base and limit to 0.

       A conforming code segment (i.e., one with contents==3) will be
       rejected if func is 1 or if seg_not_present is 0.

       When func is 2, modify_ldt() will read zeros.  This appears to be a
       leftover from Linux 2.4.
http://man7.org/linux/man-pages/man2/capset.2.html
10
SYSTEM CALL:
capget(2) - Linux manual page
FUNCTIONALITY:

       capget, capset - set/get capabilities of thread(s)
SYNOPSIS:

       #include <sys/capability.h>

       int capget(cap_user_header_t hdrp, cap_user_data_t datap);

       int capset(cap_user_header_t hdrp, const cap_user_data_t datap);
DESCRIPTION

       As of Linux 2.2, the power of the superuser (root) has been
       partitioned into a set of discrete capabilities.  Each thread has a
       set of effective capabilities identifying which capabilities (if any)
       it may currently exercise.  Each thread also has a set of inheritable
       capabilities that may be passed through an execve(2) call, and a set
       of permitted capabilities that it can make effective or inheritable.

       These two system calls are the raw kernel interface for getting and
       setting thread capabilities.  Not only are these system calls
       specific to Linux, but the kernel API is likely to change and use of
       these system calls (in particular the format of the cap_user_*_t
       types) is subject to extension with each kernel revision, but old
       programs will keep working.

       The portable interfaces are cap_set_proc(3) and cap_get_proc(3); if
       possible, you should use those interfaces in applications.  If you
       wish to use the Linux extensions in applications, you should use the
       easier-to-use interfaces capsetp(3) and capgetp(3).

   Current details
       Now that you have been warned, some current kernel details.  The
       structures are defined as follows.

           #define _LINUX_CAPABILITY_VERSION_1  0x19980330
           #define _LINUX_CAPABILITY_U32S_1     1

                   /* V2 added in Linux 2.6.25; deprecated */
           #define _LINUX_CAPABILITY_VERSION_2  0x20071026
           #define _LINUX_CAPABILITY_U32S_2     2

                   /* V3 added in Linux 2.6.26 */
           #define _LINUX_CAPABILITY_VERSION_3  0x20080522
           #define _LINUX_CAPABILITY_U32S_3     2

           typedef struct __user_cap_header_struct {
              __u32 version;
              int pid;
           } *cap_user_header_t;

           typedef struct __user_cap_data_struct {
              __u32 effective;
              __u32 permitted;
              __u32 inheritable;
           } *cap_user_data_t;

       The effective, permitted, and inheritable fields are bit masks of the
       capabilities defined in capabilities(7).  Note that the CAP_* values
       are bit indexes and need to be bit-shifted before ORing into the bit
       fields.  To define the structures for passing to the system call, you
       have to use the struct __user_cap_header_struct and struct
       __user_cap_data_struct names because the typedefs are only pointers.

       Kernels prior to 2.6.25 prefer 32-bit capabilities with version
       _LINUX_CAPABILITY_VERSION_1.  Linux 2.6.25 added 64-bit capability
       sets, with version _LINUX_CAPABILITY_VERSION_2.  There was, however,
       an API glitch, and Linux 2.6.26 added _LINUX_CAPABILITY_VERSION_3 to
       fix the problem.

       Note that 64-bit capabilities use datap[0] and datap[1], whereas
       32-bit capabilities use only datap[0].

       On kernels that support file capabilities (VFS capability support),
       these system calls behave slightly differently.  This support was
       added as an option in Linux 2.6.24, and became fixed (nonoptional) in
       Linux 2.6.33.

       For capget() calls, one can probe the capabilities of any process by
       specifying its process ID with the hdrp->pid field value.

   With VFS capability support
       VFS Capability support creates a file-attribute method for adding
       capabilities to privileged executables.  This privilege model
       obsoletes kernel support for one process asynchronously setting the
       capabilities of another.  That is, with VFS support, for capset()
       calls the only permitted values for hdrp->pid are 0 or gettid(2),
       which are equivalent.

   Without VFS capability support
       When the kernel does not support VFS capabilities, capset() calls can
       operate on the capabilities of the thread specified by the pid field
       of hdrp when that is nonzero, or on the capabilities of the calling
       thread if pid is 0.  If pid refers to a single-threaded process, then
       pid can be specified as a traditional process ID; operating on a
       thread of a multithreaded process requires a thread ID of the type
       returned by gettid(2).  For capset(), pid can also be: -1, meaning
       perform the change on all threads except the caller and init(1); or a
       value less than -1, in which case the change is applied to all
       members of the process group whose ID is -pid.

       For details on the data, see capabilities(7).
http://man7.org/linux/man-pages/man2/capget.2.html
10
SYSTEM CALL:
capget(2) - Linux manual page
FUNCTIONALITY:

       capget, capset - set/get capabilities of thread(s)
SYNOPSIS:

       #include <sys/capability.h>

       int capget(cap_user_header_t hdrp, cap_user_data_t datap);

       int capset(cap_user_header_t hdrp, const cap_user_data_t datap);
DESCRIPTION

       As of Linux 2.2, the power of the superuser (root) has been
       partitioned into a set of discrete capabilities.  Each thread has a
       set of effective capabilities identifying which capabilities (if any)
       it may currently exercise.  Each thread also has a set of inheritable
       capabilities that may be passed through an execve(2) call, and a set
       of permitted capabilities that it can make effective or inheritable.

       These two system calls are the raw kernel interface for getting and
       setting thread capabilities.  Not only are these system calls
       specific to Linux, but the kernel API is likely to change and use of
       these system calls (in particular the format of the cap_user_*_t
       types) is subject to extension with each kernel revision, but old
       programs will keep working.

       The portable interfaces are cap_set_proc(3) and cap_get_proc(3); if
       possible, you should use those interfaces in applications.  If you
       wish to use the Linux extensions in applications, you should use the
       easier-to-use interfaces capsetp(3) and capgetp(3).

   Current details
       Now that you have been warned, some current kernel details.  The
       structures are defined as follows.

           #define _LINUX_CAPABILITY_VERSION_1  0x19980330
           #define _LINUX_CAPABILITY_U32S_1     1

                   /* V2 added in Linux 2.6.25; deprecated */
           #define _LINUX_CAPABILITY_VERSION_2  0x20071026
           #define _LINUX_CAPABILITY_U32S_2     2

                   /* V3 added in Linux 2.6.26 */
           #define _LINUX_CAPABILITY_VERSION_3  0x20080522
           #define _LINUX_CAPABILITY_U32S_3     2

           typedef struct __user_cap_header_struct {
              __u32 version;
              int pid;
           } *cap_user_header_t;

           typedef struct __user_cap_data_struct {
              __u32 effective;
              __u32 permitted;
              __u32 inheritable;
           } *cap_user_data_t;

       The effective, permitted, and inheritable fields are bit masks of the
       capabilities defined in capabilities(7).  Note that the CAP_* values
       are bit indexes and need to be bit-shifted before ORing into the bit
       fields.  To define the structures for passing to the system call, you
       have to use the struct __user_cap_header_struct and struct
       __user_cap_data_struct names because the typedefs are only pointers.

       Kernels prior to 2.6.25 prefer 32-bit capabilities with version
       _LINUX_CAPABILITY_VERSION_1.  Linux 2.6.25 added 64-bit capability
       sets, with version _LINUX_CAPABILITY_VERSION_2.  There was, however,
       an API glitch, and Linux 2.6.26 added _LINUX_CAPABILITY_VERSION_3 to
       fix the problem.

       Note that 64-bit capabilities use datap[0] and datap[1], whereas
       32-bit capabilities use only datap[0].

       On kernels that support file capabilities (VFS capability support),
       these system calls behave slightly differently.  This support was
       added as an option in Linux 2.6.24, and became fixed (nonoptional) in
       Linux 2.6.33.

       For capget() calls, one can probe the capabilities of any process by
       specifying its process ID with the hdrp->pid field value.

   With VFS capability support
       VFS Capability support creates a file-attribute method for adding
       capabilities to privileged executables.  This privilege model
       obsoletes kernel support for one process asynchronously setting the
       capabilities of another.  That is, with VFS support, for capset()
       calls the only permitted values for hdrp->pid are 0 or gettid(2),
       which are equivalent.

   Without VFS capability support
       When the kernel does not support VFS capabilities, capset() calls can
       operate on the capabilities of the thread specified by the pid field
       of hdrp when that is nonzero, or on the capabilities of the calling
       thread if pid is 0.  If pid refers to a single-threaded process, then
       pid can be specified as a traditional process ID; operating on a
       thread of a multithreaded process requires a thread ID of the type
       returned by gettid(2).  For capset(), pid can also be: -1, meaning
       perform the change on all threads except the caller and init(1); or a
       value less than -1, in which case the change is applied to all
       members of the process group whose ID is -pid.

       For details on the data, see capabilities(7).
http://man7.org/linux/man-pages/man2/set_thread_area.2.html
12
SYSTEM CALL:
set_thread_area(2) - Linux manual page
FUNCTIONALITY:

       set_thread_area - set a GDT entry for thread-local storage
SYNOPSIS:

       #include <linux/unistd.h>
       #include <asm/ldt.h>

       int get_thread_area(struct user_desc *u_info);
       int set_thread_area(struct user_desc *u_info);

       Note: There are no glibc wrappers for these system calls; see NOTES.
DESCRIPTION

       Linux dedicates three global descriptor table (GDT) entries for
       thread-local storage.  For more information about the GDT, see the
       Intel Software Developer's Manual or the AMD Architecture Programming
       Manual.

       Both of these system calls take an argument that is a pointer to a
       structure of the following type:

           struct user_desc {
               unsigned int  entry_number;
               unsigned long base_addr;
               unsigned int  limit;
               unsigned int  seg_32bit:1;
               unsigned int  contents:2;
               unsigned int  read_exec_only:1;
               unsigned int  limit_in_pages:1;
               unsigned int  seg_not_present:1;
               unsigned int  useable:1; };

       get_thread_area() reads the GDT entry indicated by
       u_info->entry_number and fills in the rest of the fields in u_info.

       set_thread_area() sets a TLS entry in the GDT.

       The TLS array entry set by set_thread_area() corresponds to the value
       of u_info->entry_number passed in by the user.  If this value is in
       bounds, set_thread_area() writes the TLS descriptor pointed to by
       u_info into the thread's TLS array.

       When set_thread_area() is passed an entry_number of -1, it searches
       for a free TLS entry.  If set_thread_area() finds a free TLS entry,
       the value of u_info->entry_number is set upon return to show which
       entry was changed.

       A user_desc is considered "empty" if read_exec_only and
       seg_not_present are set to 1 and all of the other fields are 0.  If
       an "empty" descriptor is passed to set_thread_area, the corresponding
       TLS entry will be cleared.  See BUGS for additional details.

       Since Linux 3.19, set_thread_area() cannot be used to write non-
       present segments, 16-bit segments, or code segments, although
       clearing a segment is still acceptable.
http://man7.org/linux/man-pages/man2/get_thread_area.2.html
12
SYSTEM CALL:
set_thread_area(2) - Linux manual page
FUNCTIONALITY:

       set_thread_area - set a GDT entry for thread-local storage
SYNOPSIS:

       #include <linux/unistd.h>
       #include <asm/ldt.h>

       int get_thread_area(struct user_desc *u_info);
       int set_thread_area(struct user_desc *u_info);

       Note: There are no glibc wrappers for these system calls; see NOTES.
DESCRIPTION

       Linux dedicates three global descriptor table (GDT) entries for
       thread-local storage.  For more information about the GDT, see the
       Intel Software Developer's Manual or the AMD Architecture Programming
       Manual.

       Both of these system calls take an argument that is a pointer to a
       structure of the following type:

           struct user_desc {
               unsigned int  entry_number;
               unsigned long base_addr;
               unsigned int  limit;
               unsigned int  seg_32bit:1;
               unsigned int  contents:2;
               unsigned int  read_exec_only:1;
               unsigned int  limit_in_pages:1;
               unsigned int  seg_not_present:1;
               unsigned int  useable:1; };

       get_thread_area() reads the GDT entry indicated by
       u_info->entry_number and fills in the rest of the fields in u_info.

       set_thread_area() sets a TLS entry in the GDT.

       The TLS array entry set by set_thread_area() corresponds to the value
       of u_info->entry_number passed in by the user.  If this value is in
       bounds, set_thread_area() writes the TLS descriptor pointed to by
       u_info into the thread's TLS array.

       When set_thread_area() is passed an entry_number of -1, it searches
       for a free TLS entry.  If set_thread_area() finds a free TLS entry,
       the value of u_info->entry_number is set upon return to show which
       entry was changed.

       A user_desc is considered "empty" if read_exec_only and
       seg_not_present are set to 1 and all of the other fields are 0.  If
       an "empty" descriptor is passed to set_thread_area, the corresponding
       TLS entry will be cleared.  See BUGS for additional details.

       Since Linux 3.19, set_thread_area() cannot be used to write non-
       present segments, 16-bit segments, or code segments, although
       clearing a segment is still acceptable.
http://man7.org/linux/man-pages/man2/set_tid_address.2.html
10
SYSTEM CALL:
set_tid_address(2) - Linux manual page
FUNCTIONALITY:

       set_tid_address - set pointer to thread ID
SYNOPSIS:

       #include <linux/unistd.h>

       long set_tid_address(int *tidptr);
DESCRIPTION

       For each thread, the kernel maintains two attributes (addresses)
       called set_child_tid and clear_child_tid.  These two attributes
       contain the value NULL by default.

       set_child_tid
              If a thread is started using clone(2) with the
              CLONE_CHILD_SETTID flag, set_child_tid is set to the value
              passed in the ctid argument of that system call.

              When set_child_tid is set, the very first thing the new thread
              does is to write its thread ID at this address.

       clear_child_tid
              If a thread is started using clone(2) with the
              CLONE_CHILD_CLEARTID flag, clear_child_tid is set to the value
              passed in the ctid argument of that system call.

       The system call set_tid_address() sets the clear_child_tid value for
       the calling thread to tidptr.

       When a thread whose clear_child_tid is not NULL terminates, then, if
       the thread is sharing memory with other threads, then 0 is written at
       the address specified in clear_child_tid and the kernel performs the
       following operation:

           futex(clear_child_tid, FUTEX_WAKE, 1, NULL, NULL, 0);

       The effect of this operation is to wake a single thread that is
       performing a futex wait on the memory location.  Errors from the
       futex wake operation are ignored.
http://man7.org/linux/man-pages/man2/arch_prctl.2.html
10
SYSTEM CALL:
arch_prctl(2) - Linux manual page
FUNCTIONALITY:

       arch_prctl - set architecture-specific thread state
SYNOPSIS:

       #include <asm/prctl.h>
       #include <sys/prctl.h>

       int arch_prctl(int code, unsigned long addr);
       int arch_prctl(int code, unsigned long *addr);
DESCRIPTION

       The arch_prctl() function sets architecture-specific process or
       thread state.  code selects a subfunction and passes argument addr to
       it; addr is interpreted as either an unsigned long for the "set"
       operations, or as an unsigned long *, for the "get" operations.

       Subfunctions for x86-64 are:

       ARCH_SET_FS
              Set the 64-bit base for the FS register to addr.

       ARCH_GET_FS
              Return the 64-bit base value for the FS register of the
              current thread in the unsigned long pointed to by addr.

       ARCH_SET_GS
              Set the 64-bit base for the GS register to addr.

       ARCH_GET_GS
              Return the 64-bit base value for the GS register of the
              current thread in the unsigned long pointed to by addr.
http://man7.org/linux/man-pages/man2/uselib.2.html
10
SYSTEM CALL:
uselib(2) - Linux manual page
FUNCTIONALITY:

       uselib - load shared library
SYNOPSIS:

       #include <unistd.h>

       int uselib(const char *library);

       Note: No declaration of this system call is provided in glibc
       headers; see NOTES.
DESCRIPTION

       The system call uselib() serves to load a shared library to be used
       by the calling process.  It is given a pathname.  The address where
       to load is found in the library itself.  The library can have any
       recognized binary format.
http://man7.org/linux/man-pages/man2/prctl.2.html
10
SYSTEM CALL:
prctl(2) - Linux manual page
FUNCTIONALITY:

       prctl - operations on a process
SYNOPSIS:

       #include <sys/prctl.h>

       int prctl(int option, unsigned long arg2, unsigned long arg3,
                 unsigned long arg4, unsigned long arg5);
DESCRIPTION

       prctl() is called with a first argument describing what to do (with
       values defined in <linux/prctl.h>), and further arguments with a
       significance depending on the first one.  The first argument can be:

       PR_CAP_AMBIENT (since Linux 4.3)
              Reads or changes the ambient capability set, according to the
              value of arg2, which must be one of the following:

              PR_CAP_AMBIENT_RAISE
                     The capability specified in arg3 is added to the
                     ambient set.  The specified capability must already be
                     present in both the permitted and the inheritable sets
                     of the process.  This operation is not permitted if the
                     SECBIT_NO_CAP_AMBIENT_RAISE securebit is set.

              PR_CAP_AMBIENT_LOWER
                     The capability specified in arg3 is removed from the
                     ambient set.

              PR_CAP_AMBIENT_IS_SET
                     The prctl(2) call returns 1 if the capability in arg3
                     is in the ambient set and 0 if it is not.

              PR_CAP_AMBIENT_CLEAR_ALL
                     All capabilities will be removed from the ambient set.
                     This operation requires setting arg3 to zero.

              In all of the above operations, arg4 and arg5 must be
              specified as 0.

       PR_CAPBSET_READ (since Linux 2.6.25)
              Return (as the function result) 1 if the capability specified
              in arg2 is in the calling thread's capability bounding set, or
              0 if it is not.  (The capability constants are defined in
              <linux/capability.h>.)  The capability bounding set dictates
              whether the process can receive the capability through a
              file's permitted capability set on a subsequent call to
              execve(2).

              If the capability specified in arg2 is not valid, then the
              call fails with the error EINVAL.

       PR_CAPBSET_DROP (since Linux 2.6.25)
              If the calling thread has the CAP_SETPCAP capability, then
              drop the capability specified by arg2 from the calling
              thread's capability bounding set.  Any children of the calling
              thread will inherit the newly reduced bounding set.

              The call fails with the error: EPERM if the calling thread
              does not have the CAP_SETPCAP; EINVAL if arg2 does not
              represent a valid capability; or EINVAL if file capabilities
              are not enabled in the kernel, in which case bounding sets are
              not supported.

       PR_SET_CHILD_SUBREAPER (since Linux 3.4)
              If arg2 is nonzero, set the "child subreaper" attribute of the
              calling process; if arg2 is zero, unset the attribute.

              When a process is marked as a child subreaper, all of the
              children that it creates, and their descendants, will be
              marked as having a subreaper.  In effect, a subreaper fulfills
              the role of init(1) for its descendant processes.  Upon
              termination of a process that is orphaned (i.e., its immediate
              parent has already terminated) and marked as having a
              subreaper, the nearest still living ancestor subreaper will
              receive a SIGCHLD signal and will be able to wait(2) on the
              process to discover its termination status.

       PR_GET_CHILD_SUBREAPER (since Linux 3.4)
              Return the "child subreaper" setting of the caller, in the
              location pointed to by (int *) arg2.

       PR_SET_DUMPABLE (since Linux 2.3.20)
              Set the state of the "dumpable" flag, which determines whether
              core dumps are produced for the calling process upon delivery
              of a signal whose default behavior is to produce a core dump.

              In kernels up to and including 2.6.12, arg2 must be either 0
              (SUID_DUMP_DISABLE, process is not dumpable) or 1
              (SUID_DUMP_USER, process is dumpable).  Between kernels 2.6.13
              and 2.6.17, the value 2 was also permitted, which caused any
              binary which normally would not be dumped to be dumped
              readable by root only; for security reasons, this feature has
              been removed.  (See also the description of /proc/sys/fs/
              suid_dumpable in proc(5).)

              Normally, this flag is set to 1.  However, it is reset to the
              current value contained in the file /proc/sys/fs/suid_dumpable
              (which by default has the value 0), if any of the following
              attributes of the process are changed by the operations listed
              below:

              *  The effective user or group ID is changed.

              *  The filesystem user or group ID is changed (see
                 credentials(7)).

              *  The process's set of permitted capabilities (see
                 capabilities(7)) is changed such that its new set of
                 capabilities is not a subset of its previous set of
                 capabilities.

              The operations that may trigger changes to the dumpable flag
              include:

              *  execution (execve(2)) of a set-user-ID or set-group-ID
                 program, or a program that has capabilities (see
                 capabilities(7));

              *  capset(2); and

              *  system calls that change process credentials (setuid(2)
                 setgid(2), setresuid(2), setresgid(2), setgroups(2), and so
                 on).

              Processes that are not dumpable can not be attached via
              ptrace(2) PTRACE_ATTACH.

       PR_GET_DUMPABLE (since Linux 2.3.20)
              Return (as the function result) the current state of the
              calling process's dumpable flag.

       PR_SET_ENDIAN (since Linux 2.6.18, PowerPC only)
              Set the endian-ness of the calling process to the value given
              in arg2, which should be one of the following: PR_ENDIAN_BIG,
              PR_ENDIAN_LITTLE, or PR_ENDIAN_PPC_LITTLE (PowerPC pseudo
              little endian).

       PR_GET_ENDIAN (since Linux 2.6.18, PowerPC only)
              Return the endian-ness of the calling process, in the location
              pointed to by (int *) arg2.

       PR_SET_FPEMU (since Linux 2.4.18, 2.5.9, only on ia64)
              Set floating-point emulation control bits to arg2.  Pass
              PR_FPEMU_NOPRINT to silently emulate floating-point operation
              accesses, or PR_FPEMU_SIGFPE to not emulate floating-point
              operations and send SIGFPE instead.

       PR_GET_FPEMU (since Linux 2.4.18, 2.5.9, only on ia64)
              Return floating-point emulation control bits, in the location
              pointed to by (int *) arg2.

       PR_SET_FPEXC (since Linux 2.4.21, 2.5.32, only on PowerPC)
              Set floating-point exception mode to arg2.  Pass
              PR_FP_EXC_SW_ENABLE to use FPEXC for FP exception enables,
              PR_FP_EXC_DIV for floating-point divide by zero, PR_FP_EXC_OVF
              for floating-point overflow, PR_FP_EXC_UND for floating-point
              underflow, PR_FP_EXC_RES for floating-point inexact result,
              PR_FP_EXC_INV for floating-point invalid operation,
              PR_FP_EXC_DISABLED for FP exceptions disabled,
              PR_FP_EXC_NONRECOV for async nonrecoverable exception mode,
              PR_FP_EXC_ASYNC for async recoverable exception mode,
              PR_FP_EXC_PRECISE for precise exception mode.

       PR_GET_FPEXC (since Linux 2.4.21, 2.5.32, only on PowerPC)
              Return floating-point exception mode, in the location pointed
              to by (int *) arg2.

       PR_SET_KEEPCAPS (since Linux 2.2.18)
              Set the state of the thread's "keep capabilities" flag, which
              determines whether the thread's permitted capability set is
              cleared when a change is made to the thread's user IDs such
              that the thread's real UID, effective UID, and saved set-user-
              ID all become nonzero when at least one of them previously had
              the value 0.  By default, the permitted capability set is
              cleared when such a change is made; setting the "keep
              capabilities" flag prevents it from being cleared.  arg2 must
              be either 0 (permitted capabilities are cleared) or 1
              (permitted capabilities are kept).  (A thread's effective
              capability set is always cleared when such a credential change
              is made, regardless of the setting of the "keep capabilities"
              flag.)  The "keep capabilities" value will be reset to 0 on
              subsequent calls to execve(2).

       PR_GET_KEEPCAPS (since Linux 2.2.18)
              Return (as the function result) the current state of the
              calling thread's "keep capabilities" flag.

       PR_MCE_KILL (since Linux 2.6.32)
              Set the machine check memory corruption kill policy for the
              current thread.  If arg2 is PR_MCE_KILL_CLEAR, clear the
              thread memory corruption kill policy and use the system-wide
              default.  (The system-wide default is defined by
              /proc/sys/vm/memory_failure_early_kill; see proc(5).)  If arg2
              is PR_MCE_KILL_SET, use a thread-specific memory corruption
              kill policy.  In this case, arg3 defines whether the policy is
              early kill (PR_MCE_KILL_EARLY), late kill (PR_MCE_KILL_LATE),
              or the system-wide default (PR_MCE_KILL_DEFAULT).  Early kill
              means that the thread receives a SIGBUS signal as soon as
              hardware memory corruption is detected inside its address
              space.  In late kill mode, the process is killed only when it
              accesses a corrupted page.  See sigaction(2) for more
              information on the SIGBUS signal.  The policy is inherited by
              children.  The remaining unused prctl() arguments must be zero
              for future compatibility.

       PR_MCE_KILL_GET (since Linux 2.6.32)
              Return the current per-process machine check kill policy.  All
              unused prctl() arguments must be zero.

       PR_SET_MM (since Linux 3.3)
              Modify certain kernel memory map descriptor fields of the
              calling process.  Usually these fields are set by the kernel
              and dynamic loader (see ld.so(8) for more information) and a
              regular application should not use this feature.  However,
              there are cases, such as self-modifying programs, where a
              program might find it useful to change its own memory map.
              This feature is available only if the kernel is built with the
              CONFIG_CHECKPOINT_RESTORE option enabled.  The calling process
              must have the CAP_SYS_RESOURCE capability.  The value in arg2
              is one of the options below, while arg3 provides a new value
              for the option.

              PR_SET_MM_START_CODE
                     Set the address above which the program text can run.
                     The corresponding memory area must be readable and
                     executable, but not writable or sharable (see
                     mprotect(2) and mmap(2) for more information).

              PR_SET_MM_END_CODE
                     Set the address below which the program text can run.
                     The corresponding memory area must be readable and
                     executable, but not writable or sharable.

              PR_SET_MM_START_DATA
                     Set the address above which initialized and
                     uninitialized (bss) data are placed.  The corresponding
                     memory area must be readable and writable, but not
                     executable or sharable.

              PR_SET_MM_END_DATA
                     Set the address below which initialized and
                     uninitialized (bss) data are placed.  The corresponding
                     memory area must be readable and writable, but not
                     executable or sharable.

              PR_SET_MM_START_STACK
                     Set the start address of the stack.  The corresponding
                     memory area must be readable and writable.

              PR_SET_MM_START_BRK
                     Set the address above which the program heap can be
                     expanded with brk(2) call.  The address must be greater
                     than the ending address of the current program data
                     segment.  In addition, the combined size of the
                     resulting heap and the size of the data segment can't
                     exceed the RLIMIT_DATA resource limit (see
                     setrlimit(2)).

              PR_SET_MM_BRK
                     Set the current brk(2) value.  The requirements for the
                     address are the same as for the PR_SET_MM_START_BRK
                     option.

              The following options are available since Linux 3.5.

              PR_SET_MM_ARG_START
                     Set the address above which the program command line is
                     placed.

              PR_SET_MM_ARG_END
                     Set the address below which the program command line is
                     placed.

              PR_SET_MM_ENV_START
                     Set the address above which the program environment is
                     placed.

              PR_SET_MM_ENV_END
                     Set the address below which the program environment is
                     placed.

                     The address passed with PR_SET_MM_ARG_START,
                     PR_SET_MM_ARG_END, PR_SET_MM_ENV_START, and
                     PR_SET_MM_ENV_END should belong to a process stack
                     area.  Thus, the corresponding memory area must be
                     readable, writable, and (depending on the kernel
                     configuration) have the MAP_GROWSDOWN attribute set
                     (see mmap(2)).

              PR_SET_MM_AUXV
                     Set a new auxiliary vector.  The arg3 argument should
                     provide the address of the vector.  The arg4 is the
                     size of the vector.

              PR_SET_MM_EXE_FILE
                     Supersede the /proc/pid/exe symbolic link with a new
                     one pointing to a new executable file identified by the
                     file descriptor provided in arg3 argument.  The file
                     descriptor should be obtained with a regular open(2)
                     call.

                     To change the symbolic link, one needs to unmap all
                     existing executable memory areas, including those
                     created by the kernel itself (for example the kernel
                     usually creates at least one executable memory area for
                     the ELF .text section).

                     The second limitation is that such transitions can be
                     done only once in a process life time.  Any further
                     attempts will be rejected.  This should help system
                     administrators monitor unusual symbolic-link
                     transitions over all processes running on a system.

       PR_MPX_ENABLE_MANAGEMENT, PR_MPX_DISABLE_MANAGEMENT (since Linux
       3.19)
              Enable or disable kernel management of Memory Protection
              eXtensions (MPX) bounds tables.  The arg2, arg3, arg4, and
              arg5 arguments must be zero.

              MPX is a hardware-assisted mechanism for performing bounds
              checking on pointers.  It consists of a set of registers
              storing bounds information and a set of special instruction
              prefixes that tell the CPU on which instructions it should do
              bounds enforcement.  There is a limited number of these
              registers and when there are more pointers than registers,
              their contents must be "spilled" into a set of tables.  These
              tables are called "bounds tables" and the MPX prctl()
              operations control whether the kernel manages their allocation
              and freeing.

              When management is enabled, the kernel will take over
              allocation and freeing of the bounds tables.  It does this by
              trapping the #BR exceptions that result at first use of
              missing bounds tables and instead of delivering the exception
              to user space, it allocates the table and populates the bounds
              directory with the location of the new table.  For freeing,
              the kernel checks to see if bounds tables are present for
              memory which is not allocated, and frees them if so.

              Before enabling MPX management using PR_MPX_ENABLE_MANAGEMENT,
              the application must first have allocated a user-space buffer
              for the bounds directory and placed the location of that
              directory in the bndcfgu register.

              These calls will fail if the CPU or kernel does not support
              MPX.  Kernel support for MPX is enabled via the
              CONFIG_X86_INTEL_MPX configuration option.  You can check
              whether the CPU supports MPX by looking for the 'mpx' CPUID
              bit, like with the following command:

                   cat /proc/cpuinfo | grep ' mpx '

              A thread may not switch in or out of long (64-bit) mode while
              MPX is enabled.

              All threads in a process are affected by these calls.

              The child of a fork(2) inherits the state of MPX management.
              During execve(2), MPX management is reset to a state as if
              PR_MPX_DISABLE_MANAGEMENT had been called.

              For further information on Intel MPX, see the kernel source
              file Documentation/x86/intel_mpx.txt.

       PR_SET_NAME (since Linux 2.6.9)
              Set the name of the calling thread, using the value in the
              location pointed to by (char *) arg2.  The name can be up to
              16 bytes long, including the terminating null byte.  (If the
              length of the string, including the terminating null byte,
              exceeds 16 bytes, the string is silently truncated.)  This is
              the same attribute that can be set via pthread_setname_np(3)
              and retrieved using pthread_getname_np(3).  The attribute is
              likewise accessible via /proc/self/task/[tid]/comm, where tid
              is the name of the calling thread.

       PR_GET_NAME (since Linux 2.6.11)
              Return the name of the calling thread, in the buffer pointed
              to by (char *) arg2.  The buffer should allow space for up to
              16 bytes; the returned string will be null-terminated.

       PR_SET_NO_NEW_PRIVS (since Linux 3.5)
              Set the calling process's no_new_privs bit to the value in
              arg2.  With no_new_privs set to 1, execve(2) promises not to
              grant privileges to do anything that could not have been done
              without the execve(2) call (for example, rendering the set-
              user-ID and set-group-ID mode bits, and file capabilities non-
              functional).  Once set, this bit cannot be unset.  The setting
              of this bit is inherited by children created by fork(2) and
              clone(2), and preserved across execve(2).

              For more information, see the kernel source file
              Documentation/prctl/no_new_privs.txt.

       PR_GET_NO_NEW_PRIVS (since Linux 3.5)
              Return (as the function result) the value of the no_new_privs
              bit for the current process.  A value of 0 indicates the
              regular execve(2) behavior.  A value of 1 indicates execve(2)
              will operate in the privilege-restricting mode described
              above.

       PR_SET_PDEATHSIG (since Linux 2.1.57)
              Set the parent death signal of the calling process to arg2
              (either a signal value in the range 1..maxsig, or 0 to clear).
              This is the signal that the calling process will get when its
              parent dies.  This value is cleared for the child of a fork(2)
              and (since Linux 2.4.36 / 2.6.23) when executing a set-user-ID
              or set-group-ID binary, or a binary that has associated
              capabilities (see capabilities(7)).  This value is preserved
              across execve(2).

              Warning: the "parent" in this case is considered to be the
              thread that created this process.  In other words, the signal
              will be sent when that thread terminates (via, for example,
              pthread_exit(3)), rather than after all of the threads in the
              parent process terminate.

       PR_GET_PDEATHSIG (since Linux 2.3.15)
              Return the current value of the parent process death signal,
              in the location pointed to by (int *) arg2.

       PR_SET_PTRACER (since Linux 3.4)
              This is meaningful only when the Yama LSM is enabled and in
              mode 1 ("restricted ptrace", visible via
              /proc/sys/kernel/yama/ptrace_scope).  When a "ptracer process
              ID" is passed in arg2, the caller is declaring that the
              ptracer process can ptrace(2) the calling process as if it
              were a direct process ancestor.  Each PR_SET_PTRACER operation
              replaces the previous "ptracer process ID".  Employing
              PR_SET_PTRACER with arg2 set to 0 clears the caller's "ptracer
              process ID".  If arg2 is PR_SET_PTRACER_ANY, the ptrace
              restrictions introduced by Yama are effectively disabled for
              the calling process.

              For further information, see the kernel source file
              Documentation/security/Yama.txt.

       PR_SET_SECCOMP (since Linux 2.6.23)
              Set the secure computing (seccomp) mode for the calling
              thread, to limit the available system calls.  The more recent
              seccomp(2) system call provides a superset of the
              functionality of PR_SET_SECCOMP.

              The seccomp mode is selected via arg2.  (The seccomp constants
              are defined in <linux/seccomp.h>.)

              With arg2 set to SECCOMP_MODE_STRICT, the only system calls
              that the thread is permitted to make are read(2), write(2),
              _exit(2) (but not exit_group(2)), and sigreturn(2).  Other
              system calls result in the delivery of a SIGKILL signal.
              Strict secure computing mode is useful for number-crunching
              applications that may need to execute untrusted byte code,
              perhaps obtained by reading from a pipe or socket.  This
              operation is available only if the kernel is configured with
              CONFIG_SECCOMP enabled.

              With arg2 set to SECCOMP_MODE_FILTER (since Linux 3.5), the
              system calls allowed are defined by a pointer to a Berkeley
              Packet Filter passed in arg3.  This argument is a pointer to
              struct sock_fprog; it can be designed to filter arbitrary
              system calls and system call arguments.  This mode is
              available only if the kernel is configured with
              CONFIG_SECCOMP_FILTER enabled.

              If SECCOMP_MODE_FILTER filters permit fork(2), then the
              seccomp mode is inherited by children created by fork(2); if
              execve(2) is permitted, then the seccomp mode is preserved
              across execve(2).  If the filters permit prctl() calls, then
              additional filters can be added; they are run in order until
              the first non-allow result is seen.

              For further information, see the kernel source file
              Documentation/prctl/seccomp_filter.txt.

       PR_GET_SECCOMP (since Linux 2.6.23)
              Return (as the function result) the secure computing mode of
              the calling thread.  If the caller is not in secure computing
              mode, this operation returns 0; if the caller is in strict
              secure computing mode, then the prctl() call will cause a
              SIGKILL signal to be sent to the process.  If the caller is in
              filter mode, and this system call is allowed by the seccomp
              filters, it returns 2; otherwise, the process is killed with a
              SIGKILL signal.  This operation is available only if the
              kernel is configured with CONFIG_SECCOMP enabled.

              Since Linux 3.8, the Seccomp field of the /proc/[pid]/status
              file provides a method of obtaining the same information,
              without the risk that the process is killed; see proc(5).

       PR_SET_SECUREBITS (since Linux 2.6.26)
              Set the "securebits" flags of the calling thread to the value
              supplied in arg2.  See capabilities(7).

       PR_GET_SECUREBITS (since Linux 2.6.26)
              Return (as the function result) the "securebits" flags of the
              calling thread.  See capabilities(7).

       PR_SET_THP_DISABLE (since Linux 3.15)
              Set the state of the "THP disable" flag for the calling
              thread.  If arg2 has a nonzero value, the flag is set,
              otherwise it is cleared.  Setting this flag provides a method
              for disabling transparent huge pages for jobs where the code
              cannot be modified, and using a malloc hook with madvise(2) is
              not an option (i.e., statically allocated data).  The setting
              of the "THP disable" flag is inherited by a child created via
              fork(2) and is preserved across execve(2).

       PR_TASK_PERF_EVENTS_DISABLE (since Linux 2.6.31)
              Disable all performance counters attached to the calling
              process, regardless of whether the counters were created by
              this process or another process.  Performance counters created
              by the calling process for other processes are unaffected.
              For more information on performance counters, see the Linux
              kernel source file tools/perf/design.txt.

              Originally called PR_TASK_PERF_COUNTERS_DISABLE; renamed (with
              same numerical value) in Linux 2.6.32.

       PR_TASK_PERF_EVENTS_ENABLE (since Linux 2.6.31)
              The converse of PR_TASK_PERF_EVENTS_DISABLE; enable
              performance counters attached to the calling process.

              Originally called PR_TASK_PERF_COUNTERS_ENABLE; renamed in
              Linux 2.6.32.

       PR_GET_THP_DISABLE (since Linux 3.15)
              Return (via the function result) the current setting of the
              "THP disable" flag for the calling thread: either 1, if the
              flag is set, or 0, if it is not.

       PR_GET_TID_ADDRESS (since Linux 3.5)
              Retrieve the clear_child_tid address set by set_tid_address(2)
              and the clone(2) CLONE_CHILD_CLEARTID flag, in the location
              pointed to by (int **) arg2.  This feature is available only
              if the kernel is built with the CONFIG_CHECKPOINT_RESTORE
              option enabled.

       PR_SET_TIMERSLACK (since Linux 2.6.28)
              Each thread has two associated timer slack values: a "default"
              value, and a "current" value.  This operation sets the
              "current" timer slack value for the calling thread.  If the
              nanosecond value supplied in arg2 is greater than zero, then
              the "current" value is set to this value.  If arg2 is less
              than or equal to zero, the "current" timer slack is reset to
              the thread's "default" timer slack value.

              The "current" timer slack is used by the kernel to group timer
              expirations for the calling thread that are close to one
              another; as a consequence, timer expirations for the thread
              may be up to the specified number of nanoseconds late (but
              will never expire early).  Grouping timer expirations can help
              reduce system power consumption by minimizing CPU wake-ups.

              The timer expirations affected by timer slack are those set by
              select(2), pselect(2), poll(2), ppoll(2), epoll_wait(2),
              epoll_pwait(2), clock_nanosleep(2), nanosleep(2), and futex(2)
              (and thus the library functions implemented via futexes,
              including pthread_cond_timedwait(3),
              pthread_mutex_timedlock(3), pthread_rwlock_timedrdlock(3),
              pthread_rwlock_timedwrlock(3), and sem_timedwait(3)).

              Timer slack is not applied to threads that are scheduled under
              a real-time scheduling policy (see sched_setscheduler(2)).

              When a new thread is created, the two timer slack values are
              made the same as the "current" value of the creating thread.
              Thereafter, a thread can adjust its "current" timer slack
              value via PR_SET_TIMERSLACK.  The "default" value can't be
              changed.  The timer slack values of init (PID 1), the ancestor
              of all processes, are 50,000 nanoseconds (50 microseconds).
              The timer slack values are preserved across execve(2).

              Since Linux 4.6, the "current" timer slack value of any
              process can be examined and changed via the file
              /proc/[pid]/timerslack_ns.  See proc(5).

       PR_GET_TIMERSLACK (since Linux 2.6.28)
              Return (as the function result) the "current" timer slack
              value of the calling thread.

       PR_SET_TIMING (since Linux 2.6.0-test4)
              Set whether to use (normal, traditional) statistical process
              timing or accurate timestamp-based process timing, by passing
              PR_TIMING_STATISTICAL or PR_TIMING_TIMESTAMP to arg2.
              PR_TIMING_TIMESTAMP is not currently implemented (attempting
              to set this mode will yield the error EINVAL).

       PR_GET_TIMING (since Linux 2.6.0-test4)
              Return (as the function result) which process timing method is
              currently in use.

       PR_SET_TSC (since Linux 2.6.26, x86 only)
              Set the state of the flag determining whether the timestamp
              counter can be read by the process.  Pass PR_TSC_ENABLE to
              arg2 to allow it to be read, or PR_TSC_SIGSEGV to generate a
              SIGSEGV when the process tries to read the timestamp counter.

       PR_GET_TSC (since Linux 2.6.26, x86 only)
              Return the state of the flag determining whether the timestamp
              counter can be read, in the location pointed to by (int *)
              arg2.

       PR_SET_UNALIGN
              (Only on: ia64, since Linux 2.3.48; parisc, since Linux
              2.6.15; PowerPC, since Linux 2.6.18; Alpha, since Linux
              2.6.22) Set unaligned access control bits to arg2.  Pass
              PR_UNALIGN_NOPRINT to silently fix up unaligned user accesses,
              or PR_UNALIGN_SIGBUS to generate SIGBUS on unaligned user
              access.

       PR_GET_UNALIGN
              (see PR_SET_UNALIGN for information on versions and
              architectures) Return unaligned access control bits, in the
              location pointed to by (int *) arg2.
http://man7.org/linux/man-pages/man2/seccomp.2.html
12
SYSTEM CALL:
seccomp(2) - Linux manual page
FUNCTIONALITY:

       seccomp - operate on Secure Computing state of the process
SYNOPSIS:

       #include <linux/seccomp.h>
       #include <linux/filter.h>
       #include <linux/audit.h>
       #include <linux/signal.h>
       #include <sys/ptrace.h>

       int seccomp(unsigned int operation, unsigned int flags, void *args);
DESCRIPTION

       The seccomp() system call operates on the Secure Computing (seccomp)
       state of the calling process.

       Currently, Linux supports the following operation values:

       SECCOMP_SET_MODE_STRICT
              The only system calls that the calling thread is permitted to
              make are read(2), write(2), _exit(2) (but not exit_group(2)),
              and sigreturn(2).  Other system calls result in the delivery
              of a SIGKILL signal.  Strict secure computing mode is useful
              for number-crunching applications that may need to execute
              untrusted byte code, perhaps obtained by reading from a pipe
              or socket.

              Note that although the calling thread can no longer call
              sigprocmask(2), it can use sigreturn(2) to block all signals
              apart from SIGKILL and SIGSTOP.  This means that alarm(2) (for
              example) is not sufficient for restricting the process's
              execution time.  Instead, to reliably terminate the process,
              SIGKILL must be used.  This can be done by using
              timer_create(2) with SIGEV_SIGNAL and sigev_signo set to
              SIGKILL, or by using setrlimit(2) to set the hard limit for
              RLIMIT_CPU.

              This operation is available only if the kernel is configured
              with CONFIG_SECCOMP enabled.

              The value of flags must be 0, and args must be NULL.

              This operation is functionally identical to the call:

                  prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT);

       SECCOMP_SET_MODE_FILTER
              The system calls allowed are defined by a pointer to a
              Berkeley Packet Filter (BPF) passed via args.  This argument
              is a pointer to a struct sock_fprog; it can be designed to
              filter arbitrary system calls and system call arguments.  If
              the filter is invalid, seccomp() fails, returning EINVAL in
              errno.

              If fork(2) or clone(2) is allowed by the filter, any child
              processes will be constrained to the same system call filters
              as the parent.  If execve(2) is allowed, the existing filters
              will be preserved across a call to execve(2).

              In order to use the SECCOMP_SET_MODE_FILTER operation, either
              the caller must have the CAP_SYS_ADMIN capability, or the
              thread must already have the no_new_privs bit set.  If that
              bit was not already set by an ancestor of this thread, the
              thread must make the following call:

                  prctl(PR_SET_NO_NEW_PRIVS, 1);

              Otherwise, the SECCOMP_SET_MODE_FILTER operation will fail and
              return EACCES in errno.  This requirement ensures that an
              unprivileged process cannot apply a malicious filter and then
              invoke a set-user-ID or other privileged program using
              execve(2), thus potentially compromising that program.  (Such
              a malicious filter might, for example, cause an attempt to use
              setuid(2) to set the caller's user IDs to non-zero values to
              instead return 0 without actually making the system call.
              Thus, the program might be tricked into retaining superuser
              privileges in circumstances where it is possible to influence
              it to do dangerous things because it did not actually drop
              privileges.)

              If prctl(2) or seccomp(2) is allowed by the attached filter,
              further filters may be added.  This will increase evaluation
              time, but allows for further reduction of the attack surface
              during execution of a thread.

              The SECCOMP_SET_MODE_FILTER operation is available only if the
              kernel is configured with CONFIG_SECCOMP_FILTER enabled.

              When flags is 0, this operation is functionally identical to
              the call:

                  prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, args);

              The recognized flags are:

              SECCOMP_FILTER_FLAG_TSYNC
                     When adding a new filter, synchronize all other threads
                     of the calling process to the same seccomp filter tree.
                     A "filter tree" is the ordered list of filters attached
                     to a thread.  (Attaching identical filters in separate
                     seccomp() calls results in different filters from this
                     perspective.)

                     If any thread cannot synchronize to the same filter
                     tree, the call will not attach the new seccomp filter,
                     and will fail, returning the first thread ID found that
                     cannot synchronize.  Synchronization will fail if
                     another thread in the same process is in
                     SECCOMP_MODE_STRICT or if it has attached new seccomp
                     filters to itself, diverging from the calling thread's
                     filter tree.

   Filters
       When adding filters via SECCOMP_SET_MODE_FILTER, args points to a
       filter program:

           struct sock_fprog {
               unsigned short      len;    /* Number of BPF instructions */
               struct sock_filter *filter; /* Pointer to array of
                                              BPF instructions */
           };

       Each program must contain one or more BPF instructions:

           struct sock_filter {            /* Filter block */
               __u16 code;                 /* Actual filter code */
               __u8  jt;                   /* Jump true */
               __u8  jf;                   /* Jump false */
               __u32 k;                    /* Generic multiuse field */
           };

       When executing the instructions, the BPF program operates on the
       system call information made available (i.e., use the BPF_ABS
       addressing mode) as a (read-only) buffer of the following form:

           struct seccomp_data {
               int   nr;                   /* System call number */
               __u32 arch;                 /* AUDIT_ARCH_* value
                                              (see <linux/audit.h>) */
               __u64 instruction_pointer;  /* CPU instruction pointer */
               __u64 args[6];              /* Up to 6 system call arguments */
           };

       Because numbering of system calls varies between architectures and
       some architectures (e.g., x86-64) allow user-space code to use the
       calling conventions of multiple architectures, it is usually
       necessary to verify the value of the arch field.

       It is strongly recommended to use a whitelisting approach whenever
       possible because such an approach is more robust and simple.  A
       blacklist will have to be updated whenever a potentially dangerous
       system call is added (or a dangerous flag or option if those are
       blacklisted), and it is often possible to alter the representation of
       a value without altering its meaning, leading to a blacklist bypass.

       The arch field is not unique for all calling conventions.  The x86-64
       ABI and the x32 ABI both use AUDIT_ARCH_X86_64 as arch, and they run
       on the same processors.  Instead, the mask __X32_SYSCALL_BIT is used
       on the system call number to tell the two ABIs apart.

       This means that in order to create a seccomp-based blacklist for
       system calls performed through the x86-64 ABI, it is necessary to not
       only check that arch equals AUDIT_ARCH_X86_64, but also to explicitly
       reject all system calls that contain __X32_SYSCALL_BIT in nr.

       The instruction_pointer field provides the address of the machine-
       language instruction that performed the system call.  This might be
       useful in conjunction with the use of /proc/[pid]/maps to perform
       checks based on which region (mapping) of the program made the system
       call.  (Probably, it is wise to lock down the mmap(2) and mprotect(2)
       system calls to prevent the program from subverting such checks.)

       When checking values from args against a blacklist, keep in mind that
       arguments are often silently truncated before being processed, but
       after the seccomp check.  For example, this happens if the i386 ABI
       is used on an x86-64 kernel: although the kernel will normally not
       look beyond the 32 lowest bits of the arguments, the values of the
       full 64-bit registers will be present in the seccomp data.  A less
       surprising example is that if the x86-64 ABI is used to perform a
       system call that takes an argument of type int, the more-significant
       half of the argument register is ignored by the system call, but
       visible in the seccomp data.

       A seccomp filter returns a 32-bit value consisting of two parts: the
       most significant 16 bits (corresponding to the mask defined by the
       constant SECCOMP_RET_ACTION) contain one of the "action" values
       listed below; the least significant 16-bits (defined by the constant
       SECCOMP_RET_DATA) are "data" to be associated with this return value.

       If multiple filters exist, they are all executed, in reverse order of
       their addition to the filter tree—that is, the most recently
       installed filter is executed first.  (Note that all filters will be
       called even if one of the earlier filters returns SECCOMP_RET_KILL.
       This is done to simplify the kernel code and to provide a tiny speed-
       up in the execution of sets of filters by avoiding a check for this
       uncommon case.)  The return value for the evaluation of a given
       system call is the first-seen SECCOMP_RET_ACTION value of highest
       precedence (along with its accompanying data) returned by execution
       of all of the filters.

       In decreasing order of precedence, the values that may be returned by
       a seccomp filter are:

       SECCOMP_RET_KILL
              This value results in the process exiting immediately without
              executing the system call.  The process terminates as though
              killed by a SIGSYS signal (not SIGKILL).

       SECCOMP_RET_TRAP
              This value results in the kernel sending a SIGSYS signal to
              the triggering process without executing the system call.
              Various fields will be set in the siginfo_t structure (see
              sigaction(2)) associated with signal:

              *  si_signo will contain SIGSYS.

              *  si_call_addr will show the address of the system call
                 instruction.

              *  si_syscall and si_arch will indicate which system call was
                 attempted.

              *  si_code will contain SYS_SECCOMP.

              *  si_errno will contain the SECCOMP_RET_DATA portion of the
                 filter return value.

              The program counter will be as though the system call happened
              (i.e., it will not point to the system call instruction).  The
              return value register will contain an architecture-dependent
              value; if resuming execution, set it to something appropriate
              for the system call.  (The architecture dependency is because
              replacing it with ENOSYS could overwrite some useful
              information.)

       SECCOMP_RET_ERRNO
              This value results in the SECCOMP_RET_DATA portion of the
              filter's return value being passed to user space as the errno
              value without executing the system call.

       SECCOMP_RET_TRACE
              When returned, this value will cause the kernel to attempt to
              notify a ptrace(2)-based tracer prior to executing the system
              call.  If there is no tracer present, the system call is not
              executed and returns a failure status with errno set to
              ENOSYS.

              A tracer will be notified if it requests PTRACE_O_TRACESECCOMP
              using ptrace(PTRACE_SETOPTIONS).  The tracer will be notified
              of a PTRACE_EVENT_SECCOMP and the SECCOMP_RET_DATA portion of
              the filter's return value will be available to the tracer via
              PTRACE_GETEVENTMSG.

              The tracer can skip the system call by changing the system
              call number to -1.  Alternatively, the tracer can change the
              system call requested by changing the system call to a valid
              system call number.  If the tracer asks to skip the system
              call, then the system call will appear to return the value
              that the tracer puts in the return value register.

              The seccomp check will not be run again after the tracer is
              notified.  (This means that seccomp-based sandboxes must not
              allow use of ptrace(2)—even of other sandboxed processes—
              without extreme care; ptracers can use this mechanism to
              escape from the seccomp sandbox.)

       SECCOMP_RET_ALLOW
              This value results in the system call being executed.
http://man7.org/linux/man-pages/man2/ptrace.2.html
11
SYSTEM CALL:
ptrace(2) - Linux manual page
FUNCTIONALITY:

       ptrace - process trace
SYNOPSIS:

       #include <sys/ptrace.h>

       long ptrace(enum __ptrace_request request, pid_t pid,
                   void *addr, void *data);
DESCRIPTION

       The ptrace() system call provides a means by which one process (the
       "tracer") may observe and control the execution of another process
       (the "tracee"), and examine and change the tracee's memory and
       registers.  It is primarily used to implement breakpoint debugging
       and system call tracing.

       A tracee first needs to be attached to the tracer.  Attachment and
       subsequent commands are per thread: in a multithreaded process, every
       thread can be individually attached to a (potentially different)
       tracer, or left not attached and thus not debugged.  Therefore,
       "tracee" always means "(one) thread", never "a (possibly
       multithreaded) process".  Ptrace commands are always sent to a
       specific tracee using a call of the form

           ptrace(PTRACE_foo, pid, ...)

       where pid is the thread ID of the corresponding Linux thread.

       (Note that in this page, a "multithreaded process" means a thread
       group consisting of threads created using the clone(2) CLONE_THREAD
       flag.)

       A process can initiate a trace by calling fork(2) and having the
       resulting child do a PTRACE_TRACEME, followed (typically) by an
       execve(2).  Alternatively, one process may commence tracing another
       process using PTRACE_ATTACH or PTRACE_SEIZE.

       While being traced, the tracee will stop each time a signal is
       delivered, even if the signal is being ignored.  (An exception is
       SIGKILL, which has its usual effect.)  The tracer will be notified at
       its next call to waitpid(2) (or one of the related "wait" system
       calls); that call will return a status value containing information
       that indicates the cause of the stop in the tracee.  While the tracee
       is stopped, the tracer can use various ptrace requests to inspect and
       modify the tracee.  The tracer then causes the tracee to continue,
       optionally ignoring the delivered signal (or even delivering a
       different signal instead).

       If the PTRACE_O_TRACEEXEC option is not in effect, all successful
       calls to execve(2) by the traced process will cause it to be sent a
       SIGTRAP signal, giving the parent a chance to gain control before the
       new program begins execution.

       When the tracer is finished tracing, it can cause the tracee to
       continue executing in a normal, untraced mode via PTRACE_DETACH.

       The value of request determines the action to be performed:

       PTRACE_TRACEME
              Indicate that this process is to be traced by its parent.  A
              process probably shouldn't make this request if its parent
              isn't expecting to trace it.  (pid, addr, and data are
              ignored.)

              The PTRACE_TRACEME request is used only by the tracee; the
              remaining requests are used only by the tracer.  In the
              following requests, pid specifies the thread ID of the tracee
              to be acted on.  For requests other than PTRACE_ATTACH,
              PTRACE_SEIZE, PTRACE_INTERRUPT, and PTRACE_KILL, the tracee
              must be stopped.

       PTRACE_PEEKTEXT, PTRACE_PEEKDATA
              Read a word at the address addr in the tracee's memory,
              returning the word as the result of the ptrace() call.  Linux
              does not have separate text and data address spaces, so these
              two requests are currently equivalent.  (data is ignored; but
              see NOTES.)

       PTRACE_PEEKUSER
              Read a word at offset addr in the tracee's USER area, which
              holds the registers and other information about the process
              (see <sys/user.h>).  The word is returned as the result of the
              ptrace() call.  Typically, the offset must be word-aligned,
              though this might vary by architecture.  See NOTES.  (data is
              ignored; but see NOTES.)

       PTRACE_POKETEXT, PTRACE_POKEDATA
              Copy the word data to the address addr in the tracee's memory.
              As for PTRACE_PEEKTEXT and PTRACE_PEEKDATA, these two requests
              are currently equivalent.

       PTRACE_POKEUSER
              Copy the word data to offset addr in the tracee's USER area.
              As for PTRACE_PEEKUSER, the offset must typically be word-
              aligned.  In order to maintain the integrity of the kernel,
              some modifications to the USER area are disallowed.

       PTRACE_GETREGS, PTRACE_GETFPREGS
              Copy the tracee's general-purpose or floating-point registers,
              respectively, to the address data in the tracer.  See
              <sys/user.h> for information on the format of this data.
              (addr is ignored.)  Note that SPARC systems have the meaning
              of data and addr reversed; that is, data is ignored and the
              registers are copied to the address addr.  PTRACE_GETREGS and
              PTRACE_GETFPREGS are not present on all architectures.

       PTRACE_GETREGSET (since Linux 2.6.34)
              Read the tracee's registers.  addr specifies, in an
              architecture-dependent way, the type of registers to be read.
              NT_PRSTATUS (with numerical value 1) usually results in
              reading of general-purpose registers.  If the CPU has, for
              example, floating-point and/or vector registers, they can be
              retrieved by setting addr to the corresponding NT_foo
              constant.  data points to a struct iovec, which describes the
              destination buffer's location and length.  On return, the
              kernel modifies iov.len to indicate the actual number of bytes
              returned.

       PTRACE_SETREGS, PTRACE_SETFPREGS
              Modify the tracee's general-purpose or floating-point
              registers, respectively, from the address data in the tracer.
              As for PTRACE_POKEUSER, some general-purpose register
              modifications may be disallowed.  (addr is ignored.)  Note
              that SPARC systems have the meaning of data and addr reversed;
              that is, data is ignored and the registers are copied from the
              address addr.  PTRACE_SETREGS and PTRACE_SETFPREGS are not
              present on all architectures.

       PTRACE_SETREGSET (since Linux 2.6.34)
              Modify the tracee's registers.  The meaning of addr and data
              is analogous to PTRACE_GETREGSET.

       PTRACE_GETSIGINFO (since Linux 2.3.99-pre6)
              Retrieve information about the signal that caused the stop.
              Copy a siginfo_t structure (see sigaction(2)) from the tracee
              to the address data in the tracer.  (addr is ignored.)

       PTRACE_SETSIGINFO (since Linux 2.3.99-pre6)
              Set signal information: copy a siginfo_t structure from the
              address data in the tracer to the tracee.  This will affect
              only signals that would normally be delivered to the tracee
              and were caught by the tracer.  It may be difficult to tell
              these normal signals from synthetic signals generated by
              ptrace() itself.  (addr is ignored.)

       PTRACE_PEEKSIGINFO (since Linux 3.10)
              Retrieve siginfo_t structures without removing signals from a
              queue.  addr points to a ptrace_peeksiginfo_args structure
              that specifies the ordinal position from which copying of
              signals should start, and the number of signals to copy.
              siginfo_t structures are copied into the buffer pointed to by
              data.  The return value contains the number of copied signals
              (zero indicates that there is no signal corresponding to the
              specified ordinal position).  Within the returned siginfo
              structures, the si_code field includes information (__SI_CHLD,
              __SI_FAULT, etc.) that are not otherwise exposed to user
              space.

                 struct ptrace_peeksiginfo_args {
                     u64 off;    /* Ordinal position in queue at which
                                    to start copying signals */
                     u32 flags;  /* PTRACE_PEEKSIGINFO_SHARED or 0 */
                     s32 nr;     /* Number of signals to copy */
                 };

                 Currently, there is only one flag,
                 PTRACE_PEEKSIGINFO_SHARED, for dumping signals from the
                 process-wide signal queue.  If this flag is not set,
                 signals are read from the per-thread queue of the specified
                 thread.

       PTRACE_GETSIGMASK (since Linux 3.11)
              Place a copy of the mask of blocked signals (see
              sigprocmask(2)) in the buffer pointed to by data, which should
              be a pointer to a buffer of type sigset_t.  The addr argument
              contains the size of the buffer pointed to by data (i.e.,
              sizeof(sigset_t)).

       PTRACE_SETSIGMASK (since Linux 3.11)
              Change the mask of blocked signals (see sigprocmask(2)) to the
              value specified in the buffer pointed to by data, which should
              be a pointer to a buffer of type sigset_t.  The addr argument
              contains the size of the buffer pointed to by data (i.e.,
              sizeof(sigset_t)).

       PTRACE_SETOPTIONS (since Linux 2.4.6; see BUGS for caveats)
              Set ptrace options from data.  (addr is ignored.)  data is
              interpreted as a bit mask of options, which are specified by
              the following flags:

              PTRACE_O_EXITKILL (since Linux 3.8)
                     If a tracer sets this flag, a SIGKILL signal will be
                     sent to every tracee if the tracer exits.  This option
                     is useful for ptrace jailers that want to ensure that
                     tracees can never escape the tracer's control.

              PTRACE_O_TRACECLONE (since Linux 2.5.46)
                     Stop the tracee at the next clone(2) and automatically
                     start tracing the newly cloned process, which will
                     start with a SIGSTOP, or PTRACE_EVENT_STOP if
                     PTRACE_SEIZE was used.  A waitpid(2) by the tracer will
                     return a status value such that

                       status>>8 == (SIGTRAP | (PTRACE_EVENT_CLONE<<8))

                     The PID of the new process can be retrieved with
                     PTRACE_GETEVENTMSG.

                     This option may not catch clone(2) calls in all cases.
                     If the tracee calls clone(2) with the CLONE_VFORK flag,
                     PTRACE_EVENT_VFORK will be delivered instead if
                     PTRACE_O_TRACEVFORK is set; otherwise if the tracee
                     calls clone(2) with the exit signal set to SIGCHLD,
                     PTRACE_EVENT_FORK will be delivered if
                     PTRACE_O_TRACEFORK is set.

              PTRACE_O_TRACEEXEC (since Linux 2.5.46)
                     Stop the tracee at the next execve(2).  A waitpid(2) by
                     the tracer will return a status value such that

                       status>>8 == (SIGTRAP | (PTRACE_EVENT_EXEC<<8))

                     If the execing thread is not a thread group leader, the
                     thread ID is reset to thread group leader's ID before
                     this stop.  Since Linux 3.0, the former thread ID can
                     be retrieved with PTRACE_GETEVENTMSG.

              PTRACE_O_TRACEEXIT (since Linux 2.5.60)
                     Stop the tracee at exit.  A waitpid(2) by the tracer
                     will return a status value such that

                       status>>8 == (SIGTRAP | (PTRACE_EVENT_EXIT<<8))

                     The tracee's exit status can be retrieved with
                     PTRACE_GETEVENTMSG.

                     The tracee is stopped early during process exit, when
                     registers are still available, allowing the tracer to
                     see where the exit occurred, whereas the normal exit
                     notification is done after the process is finished
                     exiting.  Even though context is available, the tracer
                     cannot prevent the exit from happening at this point.

              PTRACE_O_TRACEFORK (since Linux 2.5.46)
                     Stop the tracee at the next fork(2) and automatically
                     start tracing the newly forked process, which will
                     start with a SIGSTOP, or PTRACE_EVENT_STOP if
                     PTRACE_SEIZE was used.  A waitpid(2) by the tracer will
                     return a status value such that

                       status>>8 == (SIGTRAP | (PTRACE_EVENT_FORK<<8))

                     The PID of the new process can be retrieved with
                     PTRACE_GETEVENTMSG.

              PTRACE_O_TRACESYSGOOD (since Linux 2.4.6)
                     When delivering system call traps, set bit 7 in the
                     signal number (i.e., deliver SIGTRAP|0x80).  This makes
                     it easy for the tracer to distinguish normal traps from
                     those caused by a system call.  (PTRACE_O_TRACESYSGOOD
                     may not work on all architectures.)

              PTRACE_O_TRACEVFORK (since Linux 2.5.46)
                     Stop the tracee at the next vfork(2) and automatically
                     start tracing the newly vforked process, which will
                     start with a SIGSTOP, or PTRACE_EVENT_STOP if
                     PTRACE_SEIZE was used.  A waitpid(2) by the tracer will
                     return a status value such that

                       status>>8 == (SIGTRAP | (PTRACE_EVENT_VFORK<<8))

                     The PID of the new process can be retrieved with
                     PTRACE_GETEVENTMSG.

              PTRACE_O_TRACEVFORKDONE (since Linux 2.5.60)
                     Stop the tracee at the completion of the next vfork(2).
                     A waitpid(2) by the tracer will return a status value
                     such that

                       status>>8 == (SIGTRAP | (PTRACE_EVENT_VFORK_DONE<<8))

                     The PID of the new process can (since Linux 2.6.18) be
                     retrieved with PTRACE_GETEVENTMSG.

              PTRACE_O_TRACESECCOMP (since Linux 3.5)
                     Stop the tracee when a seccomp(2) SECCOMP_RET_TRACE
                     rule is triggered.  A waitpid(2) by the tracer will
                     return a status value such that

                       status>>8 == (SIGTRAP | (PTRACE_EVENT_SECCOMP<<8))

                     While this triggers a PTRACE_EVENT stop, it is similar
                     to a syscall-enter-stop, in that the tracee has not yet
                     entered the syscall that seccomp triggered on.  The
                     seccomp event message data (from the SECCOMP_RET_DATA
                     portion of the seccomp filter rule) can be retrieved
                     with PTRACE_GETEVENTMSG.

              PTRACE_O_SUSPEND_SECCOMP (since Linux 4.3)
                     Suspend the tracee's seccomp protections.  This applies
                     regardless of mode, and can be used when the tracee has
                     not yet installed seccomp filters.  That is, a valid
                     use case is to suspend a tracee's seccomp protections
                     before they are installed by the tracee, let the tracee
                     install the filters, and then clear this flag when the
                     filters should be resumed.  Setting this option
                     requires that the tracer have the CAP_SYS_ADMIN
                     capability, not have any seccomp protections installed,
                     and not have PTRACE_O_SUSPEND_SECCOMP set on itself.

       PTRACE_GETEVENTMSG (since Linux 2.5.46)
              Retrieve a message (as an unsigned long) about the ptrace
              event that just happened, placing it at the address data in
              the tracer.  For PTRACE_EVENT_EXIT, this is the tracee's exit
              status.  For PTRACE_EVENT_FORK, PTRACE_EVENT_VFORK,
              PTRACE_EVENT_VFORK_DONE, and PTRACE_EVENT_CLONE, this is the
              PID of the new process.  For PTRACE_EVENT_SECCOMP, this is the
              seccomp(2) filter's SECCOMP_RET_DATA associated with the
              triggered rule.  (addr is ignored.)

       PTRACE_CONT
              Restart the stopped tracee process.  If data is nonzero, it is
              interpreted as the number of a signal to be delivered to the
              tracee; otherwise, no signal is delivered.  Thus, for example,
              the tracer can control whether a signal sent to the tracee is
              delivered or not.  (addr is ignored.)

       PTRACE_SYSCALL, PTRACE_SINGLESTEP
              Restart the stopped tracee as for PTRACE_CONT, but arrange for
              the tracee to be stopped at the next entry to or exit from a
              system call, or after execution of a single instruction,
              respectively.  (The tracee will also, as usual, be stopped
              upon receipt of a signal.)  From the tracer's perspective, the
              tracee will appear to have been stopped by receipt of a
              SIGTRAP.  So, for PTRACE_SYSCALL, for example, the idea is to
              inspect the arguments to the system call at the first stop,
              then do another PTRACE_SYSCALL and inspect the return value of
              the system call at the second stop.  The data argument is
              treated as for PTRACE_CONT.  (addr is ignored.)

       PTRACE_SYSEMU, PTRACE_SYSEMU_SINGLESTEP (since Linux 2.6.14)
              For PTRACE_SYSEMU, continue and stop on entry to the next
              system call, which will not be executed.  For
              PTRACE_SYSEMU_SINGLESTEP, do the same but also singlestep if
              not a system call.  This call is used by programs like User
              Mode Linux that want to emulate all the tracee's system calls.
              The data argument is treated as for PTRACE_CONT.  The addr
              argument is ignored.  These requests are currently supported
              only on x86.

       PTRACE_LISTEN (since Linux 3.4)
              Restart the stopped tracee, but prevent it from executing.
              The resulting state of the tracee is similar to a process
              which has been stopped by a SIGSTOP (or other stopping
              signal).  See the "group-stop" subsection for additional
              information.  PTRACE_LISTEN works only on tracees attached by
              PTRACE_SEIZE.

       PTRACE_KILL
              Send the tracee a SIGKILL to terminate it.  (addr and data are
              ignored.)

              This operation is deprecated; do not use it!  Instead, send a
              SIGKILL directly using kill(2) or tgkill(2).  The problem with
              PTRACE_KILL is that it requires the tracee to be in signal-
              delivery-stop, otherwise it may not work (i.e., may complete
              successfully but won't kill the tracee).  By contrast, sending
              a SIGKILL directly has no such limitation.

       PTRACE_INTERRUPT (since Linux 3.4)
              Stop a tracee.  If the tracee is running or sleeping in kernel
              space and PTRACE_SYSCALL is in effect, the system call is
              interrupted and syscall-exit-stop is reported.  (The
              interrupted system call is restarted when the tracee is
              restarted.)  If the tracee was already stopped by a signal and
              PTRACE_LISTEN was sent to it, the tracee stops with
              PTRACE_EVENT_STOP and WSTOPSIG(status) returns the stop
              signal.  If any other ptrace-stop is generated at the same
              time (for example, if a signal is sent to the tracee), this
              ptrace-stop happens.  If none of the above applies (for
              example, if the tracee is running in user space), it stops
              with PTRACE_EVENT_STOP with WSTOPSIG(status) == SIGTRAP.
              PTRACE_INTERRUPT only works on tracees attached by
              PTRACE_SEIZE.

       PTRACE_ATTACH
              Attach to the process specified in pid, making it a tracee of
              the calling process.  The tracee is sent a SIGSTOP, but will
              not necessarily have stopped by the completion of this call;
              use waitpid(2) to wait for the tracee to stop.  See the
              "Attaching and detaching" subsection for additional
              information.  (addr and data are ignored.)

              Permission to perform a PTRACE_ATTACH is governed by a ptrace
              access mode PTRACE_MODE_ATTACH_REALCREDS check; see below.

       PTRACE_SEIZE (since Linux 3.4)
              Attach to the process specified in pid, making it a tracee of
              the calling process.  Unlike PTRACE_ATTACH, PTRACE_SEIZE does
              not stop the process.  Group-stops are reported as
              PTRACE_EVENT_STOP and WSTOPSIG(status) returns the stop
              signal.  Automatically attached children stop with
              PTRACE_EVENT_STOP and WSTOPSIG(status) returns SIGTRAP instead
              of having SIGSTOP signal delivered to them.  execve(2) does
              not deliver an extra SIGTRAP.  Only a PTRACE_SEIZEd process
              can accept PTRACE_INTERRUPT and PTRACE_LISTEN commands.  The
              "seized" behavior just described is inherited by children that
              are automatically attached using PTRACE_O_TRACEFORK,
              PTRACE_O_TRACEVFORK, and PTRACE_O_TRACECLONE.  addr must be
              zero.  data contains a bit mask of ptrace options to activate
              immediately.

              Permission to perform a PTRACE_SEIZE is governed by a ptrace
              access mode PTRACE_MODE_ATTACH_REALCREDS check; see below.

       PTRACE_DETACH
              Restart the stopped tracee as for PTRACE_CONT, but first
              detach from it.  Under Linux, a tracee can be detached in this
              way regardless of which method was used to initiate tracing.
              (addr is ignored.)

   Death under ptrace
       When a (possibly multithreaded) process receives a killing signal
       (one whose disposition is set to SIG_DFL and whose default action is
       to kill the process), all threads exit.  Tracees report their death
       to their tracer(s).  Notification of this event is delivered via
       waitpid(2).

       Note that the killing signal will first cause signal-delivery-stop
       (on one tracee only), and only after it is injected by the tracer (or
       after it was dispatched to a thread which isn't traced), will death
       from the signal happen on all tracees within a multithreaded process.
       (The term "signal-delivery-stop" is explained below.)

       SIGKILL does not generate signal-delivery-stop and therefore the
       tracer can't suppress it.  SIGKILL kills even within system calls
       (syscall-exit-stop is not generated prior to death by SIGKILL).  The
       net effect is that SIGKILL always kills the process (all its
       threads), even if some threads of the process are ptraced.

       When the tracee calls _exit(2), it reports its death to its tracer.
       Other threads are not affected.

       When any thread executes exit_group(2), every tracee in its thread
       group reports its death to its tracer.

       If the PTRACE_O_TRACEEXIT option is on, PTRACE_EVENT_EXIT will happen
       before actual death.  This applies to exits via exit(2),
       exit_group(2), and signal deaths (except SIGKILL, depending on the
       kernel version; see BUGS below), and when threads are torn down on
       execve(2) in a multithreaded process.

       The tracer cannot assume that the ptrace-stopped tracee exists.
       There are many scenarios when the tracee may die while stopped (such
       as SIGKILL).  Therefore, the tracer must be prepared to handle an
       ESRCH error on any ptrace operation.  Unfortunately, the same error
       is returned if the tracee exists but is not ptrace-stopped (for
       commands which require a stopped tracee), or if it is not traced by
       the process which issued the ptrace call.  The tracer needs to keep
       track of the stopped/running state of the tracee, and interpret ESRCH
       as "tracee died unexpectedly" only if it knows that the tracee has
       been observed to enter ptrace-stop.  Note that there is no guarantee
       that waitpid(WNOHANG) will reliably report the tracee's death status
       if a ptrace operation returned ESRCH.  waitpid(WNOHANG) may return 0
       instead.  In other words, the tracee may be "not yet fully dead", but
       already refusing ptrace requests.

       The tracer can't assume that the tracee always ends its life by
       reporting WIFEXITED(status) or WIFSIGNALED(status); there are cases
       where this does not occur.  For example, if a thread other than
       thread group leader does an execve(2), it disappears; its PID will
       never be seen again, and any subsequent ptrace stops will be reported
       under the thread group leader's PID.

   Stopped states
       A tracee can be in two states: running or stopped.  For the purposes
       of ptrace, a tracee which is blocked in a system call (such as
       read(2), pause(2), etc.)  is nevertheless considered to be running,
       even if the tracee is blocked for a long time.  The state of the
       tracee after PTRACE_LISTEN is somewhat of a gray area: it is not in
       any ptrace-stop (ptrace commands won't work on it, and it will
       deliver waitpid(2) notifications), but it also may be considered
       "stopped" because it is not executing instructions (is not
       scheduled), and if it was in group-stop before PTRACE_LISTEN, it will
       not respond to signals until SIGCONT is received.

       There are many kinds of states when the tracee is stopped, and in
       ptrace discussions they are often conflated.  Therefore, it is
       important to use precise terms.

       In this manual page, any stopped state in which the tracee is ready
       to accept ptrace commands from the tracer is called ptrace-stop.
       Ptrace-stops can be further subdivided into signal-delivery-stop,
       group-stop, syscall-stop, and so on.  These stopped states are
       described in detail below.

       When the running tracee enters ptrace-stop, it notifies its tracer
       using waitpid(2) (or one of the other "wait" system calls).  Most of
       this manual page assumes that the tracer waits with:

           pid = waitpid(pid_or_minus_1, &status, __WALL);

       Ptrace-stopped tracees are reported as returns with pid greater than
       0 and WIFSTOPPED(status) true.

       The __WALL flag does not include the WSTOPPED and WEXITED flags, but
       implies their functionality.

       Setting the WCONTINUED flag when calling waitpid(2) is not
       recommended: the "continued" state is per-process and consuming it
       can confuse the real parent of the tracee.

       Use of the WNOHANG flag may cause waitpid(2) to return 0 ("no wait
       results available yet") even if the tracer knows there should be a
       notification.  Example:

           errno = 0;
           ptrace(PTRACE_CONT, pid, 0L, 0L);
           if (errno == ESRCH) {
               /* tracee is dead */
               r = waitpid(tracee, &status, __WALL | WNOHANG);
               /* r can still be 0 here! */
           }

       The following kinds of ptrace-stops exist: signal-delivery-stops,
       group-stops, PTRACE_EVENT stops, syscall-stops.  They all are
       reported by waitpid(2) with WIFSTOPPED(status) true.  They may be
       differentiated by examining the value status>>8, and if there is
       ambiguity in that value, by querying PTRACE_GETSIGINFO.  (Note: the
       WSTOPSIG(status) macro can't be used to perform this examination,
       because it returns the value (status>>8) & 0xff.)

   Signal-delivery-stop
       When a (possibly multithreaded) process receives any signal except
       SIGKILL, the kernel selects an arbitrary thread which handles the
       signal.  (If the signal is generated with tgkill(2), the target
       thread can be explicitly selected by the caller.)  If the selected
       thread is traced, it enters signal-delivery-stop.  At this point, the
       signal is not yet delivered to the process, and can be suppressed by
       the tracer.  If the tracer doesn't suppress the signal, it passes the
       signal to the tracee in the next ptrace restart request.  This second
       step of signal delivery is called signal injection in this manual
       page.  Note that if the signal is blocked, signal-delivery-stop
       doesn't happen until the signal is unblocked, with the usual
       exception that SIGSTOP can't be blocked.

       Signal-delivery-stop is observed by the tracer as waitpid(2)
       returning with WIFSTOPPED(status) true, with the signal returned by
       WSTOPSIG(status).  If the signal is SIGTRAP, this may be a different
       kind of ptrace-stop; see the "Syscall-stops" and "execve" sections
       below for details.  If WSTOPSIG(status) returns a stopping signal,
       this may be a group-stop; see below.

   Signal injection and suppression
       After signal-delivery-stop is observed by the tracer, the tracer
       should restart the tracee with the call

           ptrace(PTRACE_restart, pid, 0, sig)

       where PTRACE_restart is one of the restarting ptrace requests.  If
       sig is 0, then a signal is not delivered.  Otherwise, the signal sig
       is delivered.  This operation is called signal injection in this
       manual page, to distinguish it from signal-delivery-stop.

       The sig value may be different from the WSTOPSIG(status) value: the
       tracer can cause a different signal to be injected.

       Note that a suppressed signal still causes system calls to return
       prematurely.  In this case, system calls will be restarted: the
       tracer will observe the tracee to reexecute the interrupted system
       call (or restart_syscall(2) system call for a few system calls which
       use a different mechanism for restarting) if the tracer uses
       PTRACE_SYSCALL.  Even system calls (such as poll(2)) which are not
       restartable after signal are restarted after signal is suppressed;
       however, kernel bugs exist which cause some system calls to fail with
       EINTR even though no observable signal is injected to the tracee.

       Restarting ptrace commands issued in ptrace-stops other than signal-
       delivery-stop are not guaranteed to inject a signal, even if sig is
       nonzero.  No error is reported; a nonzero sig may simply be ignored.
       Ptrace users should not try to "create a new signal" this way: use
       tgkill(2) instead.

       The fact that signal injection requests may be ignored when
       restarting the tracee after ptrace stops that are not signal-
       delivery-stops is a cause of confusion among ptrace users.  One
       typical scenario is that the tracer observes group-stop, mistakes it
       for signal-delivery-stop, restarts the tracee with

           ptrace(PTRACE_restart, pid, 0, stopsig)

       with the intention of injecting stopsig, but stopsig gets ignored and
       the tracee continues to run.

       The SIGCONT signal has a side effect of waking up (all threads of) a
       group-stopped process.  This side effect happens before signal-
       delivery-stop.  The tracer can't suppress this side effect (it can
       only suppress signal injection, which only causes the SIGCONT handler
       to not be executed in the tracee, if such a handler is installed).
       In fact, waking up from group-stop may be followed by signal-
       delivery-stop for signal(s) other than SIGCONT, if they were pending
       when SIGCONT was delivered.  In other words, SIGCONT may be not the
       first signal observed by the tracee after it was sent.

       Stopping signals cause (all threads of) a process to enter group-
       stop.  This side effect happens after signal injection, and therefore
       can be suppressed by the tracer.

       In Linux 2.4 and earlier, the SIGSTOP signal can't be injected.

       PTRACE_GETSIGINFO can be used to retrieve a siginfo_t structure which
       corresponds to the delivered signal.  PTRACE_SETSIGINFO may be used
       to modify it.  If PTRACE_SETSIGINFO has been used to alter siginfo_t,
       the si_signo field and the sig parameter in the restarting command
       must match, otherwise the result is undefined.

   Group-stop
       When a (possibly multithreaded) process receives a stopping signal,
       all threads stop.  If some threads are traced, they enter a group-
       stop.  Note that the stopping signal will first cause signal-
       delivery-stop (on one tracee only), and only after it is injected by
       the tracer (or after it was dispatched to a thread which isn't
       traced), will group-stop be initiated on all tracees within the
       multithreaded process.  As usual, every tracee reports its group-stop
       separately to the corresponding tracer.

       Group-stop is observed by the tracer as waitpid(2) returning with
       WIFSTOPPED(status) true, with the stopping signal available via
       WSTOPSIG(status).  The same result is returned by some other classes
       of ptrace-stops, therefore the recommended practice is to perform the
       call

           ptrace(PTRACE_GETSIGINFO, pid, 0, &siginfo)

       The call can be avoided if the signal is not SIGSTOP, SIGTSTP,
       SIGTTIN, or SIGTTOU; only these four signals are stopping signals.
       If the tracer sees something else, it can't be a group-stop.
       Otherwise, the tracer needs to call PTRACE_GETSIGINFO.  If
       PTRACE_GETSIGINFO fails with EINVAL, then it is definitely a group-
       stop.  (Other failure codes are possible, such as ESRCH ("no such
       process") if a SIGKILL killed the tracee.)

       If tracee was attached using PTRACE_SEIZE, group-stop is indicated by
       PTRACE_EVENT_STOP: status>>16 == PTRACE_EVENT_STOP.  This allows
       detection of group-stops without requiring an extra PTRACE_GETSIGINFO
       call.

       As of Linux 2.6.38, after the tracer sees the tracee ptrace-stop and
       until it restarts or kills it, the tracee will not run, and will not
       send notifications (except SIGKILL death) to the tracer, even if the
       tracer enters into another waitpid(2) call.

       The kernel behavior described in the previous paragraph causes a
       problem with transparent handling of stopping signals.  If the tracer
       restarts the tracee after group-stop, the stopping signal is
       effectively ignored—the tracee doesn't remain stopped, it runs.  If
       the tracer doesn't restart the tracee before entering into the next
       waitpid(2), future SIGCONT signals will not be reported to the
       tracer; this would cause the SIGCONT signals to have no effect on the
       tracee.

       Since Linux 3.4, there is a method to overcome this problem: instead
       of PTRACE_CONT, a PTRACE_LISTEN command can be used to restart a
       tracee in a way where it does not execute, but waits for a new event
       which it can report via waitpid(2) (such as when it is restarted by a
       SIGCONT).

   PTRACE_EVENT stops
       If the tracer sets PTRACE_O_TRACE_* options, the tracee will enter
       ptrace-stops called PTRACE_EVENT stops.

       PTRACE_EVENT stops are observed by the tracer as waitpid(2) returning
       with WIFSTOPPED(status), and WSTOPSIG(status) returns SIGTRAP.  An
       additional bit is set in the higher byte of the status word: the
       value status>>8 will be

           (SIGTRAP | PTRACE_EVENT_foo << 8).

       The following events exist:

       PTRACE_EVENT_VFORK
              Stop before return from vfork(2) or clone(2) with the
              CLONE_VFORK flag.  When the tracee is continued after this
              stop, it will wait for child to exit/exec before continuing
              its execution (in other words, the usual behavior on
              vfork(2)).

       PTRACE_EVENT_FORK
              Stop before return from fork(2) or clone(2) with the exit
              signal set to SIGCHLD.

       PTRACE_EVENT_CLONE
              Stop before return from clone(2).

       PTRACE_EVENT_VFORK_DONE
              Stop before return from vfork(2) or clone(2) with the
              CLONE_VFORK flag, but after the child unblocked this tracee by
              exiting or execing.

       For all four stops described above, the stop occurs in the parent
       (i.e., the tracee), not in the newly created thread.
       PTRACE_GETEVENTMSG can be used to retrieve the new thread's ID.

       PTRACE_EVENT_EXEC
              Stop before return from execve(2).  Since Linux 3.0,
              PTRACE_GETEVENTMSG returns the former thread ID.

       PTRACE_EVENT_EXIT
              Stop before exit (including death from exit_group(2)), signal
              death, or exit caused by execve(2) in a multithreaded process.
              PTRACE_GETEVENTMSG returns the exit status.  Registers can be
              examined (unlike when "real" exit happens).  The tracee is
              still alive; it needs to be PTRACE_CONTed or PTRACE_DETACHed
              to finish exiting.

       PTRACE_EVENT_STOP
              Stop induced by PTRACE_INTERRUPT command, or group-stop, or
              initial ptrace-stop when a new child is attached (only if
              attached using PTRACE_SEIZE).

       PTRACE_EVENT_SECCOMP
              Stop triggered by a seccomp(2) rule on tracee syscall entry
              when PTRACE_O_TRACESECCOMP has been set by the tracer.  The
              seccomp event message data (from the SECCOMP_RET_DATA portion
              of the seccomp filter rule) can be retrieved with
              PTRACE_GETEVENTMSG.

       PTRACE_GETSIGINFO on PTRACE_EVENT stops returns SIGTRAP in si_signo,
       with si_code set to (event<<8) | SIGTRAP.

   Syscall-stops
       If the tracee was restarted by PTRACE_SYSCALL, the tracee enters
       syscall-enter-stop just prior to entering any system call.  If the
       tracer restarts the tracee with PTRACE_SYSCALL, the tracee enters
       syscall-exit-stop when the system call is finished, or if it is
       interrupted by a signal.  (That is, signal-delivery-stop never
       happens between syscall-enter-stop and syscall-exit-stop; it happens
       after syscall-exit-stop.)

       Other possibilities are that the tracee may stop in a PTRACE_EVENT
       stop, exit (if it entered _exit(2) or exit_group(2)), be killed by
       SIGKILL, or die silently (if it is a thread group leader, the
       execve(2) happened in another thread, and that thread is not traced
       by the same tracer; this situation is discussed later).

       Syscall-enter-stop and syscall-exit-stop are observed by the tracer
       as waitpid(2) returning with WIFSTOPPED(status) true, and
       WSTOPSIG(status) giving SIGTRAP.  If the PTRACE_O_TRACESYSGOOD option
       was set by the tracer, then WSTOPSIG(status) will give the value
       (SIGTRAP | 0x80).

       Syscall-stops can be distinguished from signal-delivery-stop with
       SIGTRAP by querying PTRACE_GETSIGINFO for the following cases:

       si_code <= 0
              SIGTRAP was delivered as a result of a user-space action, for
              example, a system call (tgkill(2), kill(2), sigqueue(3),
              etc.), expiration of a POSIX timer, change of state on a POSIX
              message queue, or completion of an asynchronous I/O request.

       si_code == SI_KERNEL (0x80)
              SIGTRAP was sent by the kernel.

       si_code == SIGTRAP or si_code == (SIGTRAP|0x80)
              This is a syscall-stop.

       However, syscall-stops happen very often (twice per system call), and
       performing PTRACE_GETSIGINFO for every syscall-stop may be somewhat
       expensive.

       Some architectures allow the cases to be distinguished by examining
       registers.  For example, on x86, rax == -ENOSYS in syscall-enter-
       stop.  Since SIGTRAP (like any other signal) always happens after
       syscall-exit-stop, and at this point rax almost never contains
       -ENOSYS, the SIGTRAP looks like "syscall-stop which is not syscall-
       enter-stop"; in other words, it looks like a "stray syscall-exit-
       stop" and can be detected this way.  But such detection is fragile
       and is best avoided.

       Using the PTRACE_O_TRACESYSGOOD option is the recommended method to
       distinguish syscall-stops from other kinds of ptrace-stops, since it
       is reliable and does not incur a performance penalty.

       Syscall-enter-stop and syscall-exit-stop are indistinguishable from
       each other by the tracer.  The tracer needs to keep track of the
       sequence of ptrace-stops in order to not misinterpret syscall-enter-
       stop as syscall-exit-stop or vice versa.  The rule is that syscall-
       enter-stop is always followed by syscall-exit-stop, PTRACE_EVENT stop
       or the tracee's death; no other kinds of ptrace-stop can occur in
       between.

       If after syscall-enter-stop, the tracer uses a restarting command
       other than PTRACE_SYSCALL, syscall-exit-stop is not generated.

       PTRACE_GETSIGINFO on syscall-stops returns SIGTRAP in si_signo, with
       si_code set to SIGTRAP or (SIGTRAP|0x80).

   PTRACE_SINGLESTEP, PTRACE_SYSEMU, PTRACE_SYSEMU_SINGLESTEP stops
       [Details of these kinds of stops are yet to be documented.]

   Informational and restarting ptrace commands
       Most ptrace commands (all except PTRACE_ATTACH, PTRACE_SEIZE,
       PTRACE_TRACEME, PTRACE_INTERRUPT, and PTRACE_KILL) require the tracee
       to be in a ptrace-stop, otherwise they fail with ESRCH.

       When the tracee is in ptrace-stop, the tracer can read and write data
       to the tracee using informational commands.  These commands leave the
       tracee in ptrace-stopped state:

           ptrace(PTRACE_PEEKTEXT/PEEKDATA/PEEKUSER, pid, addr, 0);
           ptrace(PTRACE_POKETEXT/POKEDATA/POKEUSER, pid, addr, long_val);
           ptrace(PTRACE_GETREGS/GETFPREGS, pid, 0, &struct);
           ptrace(PTRACE_SETREGS/SETFPREGS, pid, 0, &struct);
           ptrace(PTRACE_GETREGSET, pid, NT_foo, &iov);
           ptrace(PTRACE_SETREGSET, pid, NT_foo, &iov);
           ptrace(PTRACE_GETSIGINFO, pid, 0, &siginfo);
           ptrace(PTRACE_SETSIGINFO, pid, 0, &siginfo);
           ptrace(PTRACE_GETEVENTMSG, pid, 0, &long_var);
           ptrace(PTRACE_SETOPTIONS, pid, 0, PTRACE_O_flags);

       Note that some errors are not reported.  For example, setting signal
       information (siginfo) may have no effect in some ptrace-stops, yet
       the call may succeed (return 0 and not set errno); querying
       PTRACE_GETEVENTMSG may succeed and return some random value if
       current ptrace-stop is not documented as returning a meaningful event
       message.

       The call

           ptrace(PTRACE_SETOPTIONS, pid, 0, PTRACE_O_flags);

       affects one tracee.  The tracee's current flags are replaced.  Flags
       are inherited by new tracees created and "auto-attached" via active
       PTRACE_O_TRACEFORK, PTRACE_O_TRACEVFORK, or PTRACE_O_TRACECLONE
       options.

       Another group of commands makes the ptrace-stopped tracee run.  They
       have the form:

           ptrace(cmd, pid, 0, sig);

       where cmd is PTRACE_CONT, PTRACE_LISTEN, PTRACE_DETACH,
       PTRACE_SYSCALL, PTRACE_SINGLESTEP, PTRACE_SYSEMU, or
       PTRACE_SYSEMU_SINGLESTEP.  If the tracee is in signal-delivery-stop,
       sig is the signal to be injected (if it is nonzero).  Otherwise, sig
       may be ignored.  (When restarting a tracee from a ptrace-stop other
       than signal-delivery-stop, recommended practice is to always pass 0
       in sig.)

   Attaching and detaching
       A thread can be attached to the tracer using the call

           ptrace(PTRACE_ATTACH, pid, 0, 0);

       or

           ptrace(PTRACE_SEIZE, pid, 0, PTRACE_O_flags);

       PTRACE_ATTACH sends SIGSTOP to this thread.  If the tracer wants this
       SIGSTOP to have no effect, it needs to suppress it.  Note that if
       other signals are concurrently sent to this thread during attach, the
       tracer may see the tracee enter signal-delivery-stop with other
       signal(s) first!  The usual practice is to reinject these signals
       until SIGSTOP is seen, then suppress SIGSTOP injection.  The design
       bug here is that a ptrace attach and a concurrently delivered SIGSTOP
       may race and the concurrent SIGSTOP may be lost.

       Since attaching sends SIGSTOP and the tracer usually suppresses it,
       this may cause a stray EINTR return from the currently executing
       system call in the tracee, as described in the "Signal injection and
       suppression" section.

       Since Linux 3.4, PTRACE_SEIZE can be used instead of PTRACE_ATTACH.
       PTRACE_SEIZE does not stop the attached process.  If you need to stop
       it after attach (or at any other time) without sending it any
       signals, use PTRACE_INTERRUPT command.

       The request

           ptrace(PTRACE_TRACEME, 0, 0, 0);

       turns the calling thread into a tracee.  The thread continues to run
       (doesn't enter ptrace-stop).  A common practice is to follow the
       PTRACE_TRACEME with

           raise(SIGSTOP);

       and allow the parent (which is our tracer now) to observe our signal-
       delivery-stop.

       If the PTRACE_O_TRACEFORK, PTRACE_O_TRACEVFORK, or
       PTRACE_O_TRACECLONE options are in effect, then children created by,
       respectively, vfork(2) or clone(2) with the CLONE_VFORK flag, fork(2)
       or clone(2) with the exit signal set to SIGCHLD, and other kinds of
       clone(2), are automatically attached to the same tracer which traced
       their parent.  SIGSTOP is delivered to the children, causing them to
       enter signal-delivery-stop after they exit the system call which
       created them.

       Detaching of the tracee is performed by:

           ptrace(PTRACE_DETACH, pid, 0, sig);

       PTRACE_DETACH is a restarting operation; therefore it requires the
       tracee to be in ptrace-stop.  If the tracee is in signal-delivery-
       stop, a signal can be injected.  Otherwise, the sig parameter may be
       silently ignored.

       If the tracee is running when the tracer wants to detach it, the
       usual solution is to send SIGSTOP (using tgkill(2), to make sure it
       goes to the correct thread), wait for the tracee to stop in signal-
       delivery-stop for SIGSTOP and then detach it (suppressing SIGSTOP
       injection).  A design bug is that this can race with concurrent
       SIGSTOPs.  Another complication is that the tracee may enter other
       ptrace-stops and needs to be restarted and waited for again, until
       SIGSTOP is seen.  Yet another complication is to be sure that the
       tracee is not already ptrace-stopped, because no signal delivery
       happens while it is—not even SIGSTOP.

       If the tracer dies, all tracees are automatically detached and
       restarted, unless they were in group-stop.  Handling of restart from
       group-stop is currently buggy, but the "as planned" behavior is to
       leave tracee stopped and waiting for SIGCONT.  If the tracee is
       restarted from signal-delivery-stop, the pending signal is injected.

   execve(2) under ptrace
       When one thread in a multithreaded process calls execve(2), the
       kernel destroys all other threads in the process, and resets the
       thread ID of the execing thread to the thread group ID (process ID).
       (Or, to put things another way, when a multithreaded process does an
       execve(2), at completion of the call, it appears as though the
       execve(2) occurred in the thread group leader, regardless of which
       thread did the execve(2).)  This resetting of the thread ID looks
       very confusing to tracers:

       *  All other threads stop in PTRACE_EVENT_EXIT stop, if the
          PTRACE_O_TRACEEXIT option was turned on.  Then all other threads
          except the thread group leader report death as if they exited via
          _exit(2) with exit code 0.

       *  The execing tracee changes its thread ID while it is in the
          execve(2).  (Remember, under ptrace, the "pid" returned from
          waitpid(2), or fed into ptrace calls, is the tracee's thread ID.)
          That is, the tracee's thread ID is reset to be the same as its
          process ID, which is the same as the thread group leader's thread
          ID.

       *  Then a PTRACE_EVENT_EXEC stop happens, if the PTRACE_O_TRACEEXEC
          option was turned on.

       *  If the thread group leader has reported its PTRACE_EVENT_EXIT stop
          by this time, it appears to the tracer that the dead thread leader
          "reappears from nowhere".  (Note: the thread group leader does not
          report death via WIFEXITED(status) until there is at least one
          other live thread.  This eliminates the possibility that the
          tracer will see it dying and then reappearing.)  If the thread
          group leader was still alive, for the tracer this may look as if
          thread group leader returns from a different system call than it
          entered, or even "returned from a system call even though it was
          not in any system call".  If the thread group leader was not
          traced (or was traced by a different tracer), then during
          execve(2) it will appear as if it has become a tracee of the
          tracer of the execing tracee.

       All of the above effects are the artifacts of the thread ID change in
       the tracee.

       The PTRACE_O_TRACEEXEC option is the recommended tool for dealing
       with this situation.  First, it enables PTRACE_EVENT_EXEC stop, which
       occurs before execve(2) returns.  In this stop, the tracer can use
       PTRACE_GETEVENTMSG to retrieve the tracee's former thread ID.  (This
       feature was introduced in Linux 3.0.)  Second, the PTRACE_O_TRACEEXEC
       option disables legacy SIGTRAP generation on execve(2).

       When the tracer receives PTRACE_EVENT_EXEC stop notification, it is
       guaranteed that except this tracee and the thread group leader, no
       other threads from the process are alive.

       On receiving the PTRACE_EVENT_EXEC stop notification, the tracer
       should clean up all its internal data structures describing the
       threads of this process, and retain only one data structure—one which
       describes the single still running tracee, with

           thread ID == thread group ID == process ID.

       Example: two threads call execve(2) at the same time:

       *** we get syscall-enter-stop in thread 1: **
       PID1 execve("/bin/foo", "foo" <unfinished ...>
       *** we issue PTRACE_SYSCALL for thread 1 **
       *** we get syscall-enter-stop in thread 2: **
       PID2 execve("/bin/bar", "bar" <unfinished ...>
       *** we issue PTRACE_SYSCALL for thread 2 **
       *** we get PTRACE_EVENT_EXEC for PID0, we issue PTRACE_SYSCALL **
       *** we get syscall-exit-stop for PID0: **
       PID0 <... execve resumed> )             = 0

       If the PTRACE_O_TRACEEXEC option is not in effect for the execing
       tracee, and if the tracee was PTRACE_ATTACHed rather that
       PTRACE_SEIZEd, the kernel delivers an extra SIGTRAP to the tracee
       after execve(2) returns.  This is an ordinary signal (similar to one
       which can be generated by kill -TRAP), not a special kind of ptrace-
       stop.  Employing PTRACE_GETSIGINFO for this signal returns si_code
       set to 0 (SI_USER).  This signal may be blocked by signal mask, and
       thus may be delivered (much) later.

       Usually, the tracer (for example, strace(1)) would not want to show
       this extra post-execve SIGTRAP signal to the user, and would suppress
       its delivery to the tracee (if SIGTRAP is set to SIG_DFL, it is a
       killing signal).  However, determining which SIGTRAP to suppress is
       not easy.  Setting the PTRACE_O_TRACEEXEC option or using
       PTRACE_SEIZE and thus suppressing this extra SIGTRAP is the
       recommended approach.

   Real parent
       The ptrace API (ab)uses the standard UNIX parent/child signaling over
       waitpid(2).  This used to cause the real parent of the process to
       stop receiving several kinds of waitpid(2) notifications when the
       child process is traced by some other process.

       Many of these bugs have been fixed, but as of Linux 2.6.38 several
       still exist; see BUGS below.

       As of Linux 2.6.38, the following is believed to work correctly:

       *  exit/death by signal is reported first to the tracer, then, when
          the tracer consumes the waitpid(2) result, to the real parent (to
          the real parent only when the whole multithreaded process exits).
          If the tracer and the real parent are the same process, the report
          is sent only once.
http://man7.org/linux/man-pages/man2/process_vm_readv.2.html
12
SYSTEM CALL:
process_vm_readv(2) - Linux manual page
FUNCTIONALITY:

       process_vm_readv,  process_vm_writev  - transfer data between process
       address spaces
SYNOPSIS:

       #include <sys/uio.h>

       ssize_t process_vm_readv(pid_t pid,
                                const struct iovec *local_iov,
                                unsigned long liovcnt,
                                const struct iovec *remote_iov,
                                unsigned long riovcnt,
                                unsigned long flags);

       ssize_t process_vm_writev(pid_t pid,
                                 const struct iovec *local_iov,
                                 unsigned long liovcnt,
                                 const struct iovec *remote_iov,
                                 unsigned long riovcnt,
                                 unsigned long flags);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       process_vm_readv(), process_vm_writev():
           _GNU_SOURCE
DESCRIPTION

       These system calls transfer data between the address space of the
       calling process ("the local process") and the process identified by
       pid ("the remote process").  The data moves directly between the
       address spaces of the two processes, without passing through kernel
       space.

       The process_vm_readv() system call transfers data from the remote
       process to the local process.  The data to be transferred is
       identified by remote_iov and riovcnt: remote_iov is a pointer to an
       array describing address ranges in the process pid, and riovcnt
       specifies the number of elements in remote_iov.  The data is
       transferred to the locations specified by local_iov and liovcnt:
       local_iov is a pointer to an array describing address ranges in the
       calling process, and liovcnt specifies the number of elements in
       local_iov.

       The process_vm_writev() system call is the converse of
       process_vm_readv()—it transfers data from the local process to the
       remote process.  Other than the direction of the transfer, the
       arguments liovcnt, local_iov, riovcnt, and remote_iov have the same
       meaning as for process_vm_readv().

       The local_iov and remote_iov arguments point to an array of iovec
       structures, defined in <sys/uio.h> as:

           struct iovec {
               void  *iov_base;    /* Starting address */
               size_t iov_len;     /* Number of bytes to transfer */
           };

       Buffers are processed in array order.  This means that
       process_vm_readv() completely fills local_iov[0] before proceeding to
       local_iov[1], and so on.  Likewise, remote_iov[0] is completely read
       before proceeding to remote_iov[1], and so on.

       Similarly, process_vm_writev() writes out the entire contents of
       local_iov[0] before proceeding to local_iov[1], and it completely
       fills remote_iov[0] before proceeding to remote_iov[1].

       The lengths of remote_iov[i].iov_len and local_iov[i].iov_len do not
       have to be the same.  Thus, it is possible to split a single local
       buffer into multiple remote buffers, or vice versa.

       The flags argument is currently unused and must be set to 0.

       The values specified in the liovcnt and riovcnt arguments must be
       less than or equal to IOV_MAX (defined in <limits.h> or accessible
       via the call sysconf(_SC_IOV_MAX)).

       The count arguments and local_iov are checked before doing any
       transfers.  If the counts are too big, or local_iov is invalid, or
       the addresses refer to regions that are inaccessible to the local
       process, none of the vectors will be processed and an error will be
       returned immediately.

       Note, however, that these system calls do not check the memory
       regions in the remote process until just before doing the read/write.
       Consequently, a partial read/write (see RETURN VALUE) may result if
       one of the remote_iov elements points to an invalid memory region in
       the remote process.  No further reads/writes will be attempted beyond
       that point.  Keep this in mind when attempting to read data of
       unknown length (such as C strings that are null-terminated) from a
       remote process, by avoiding spanning memory pages (typically 4KiB) in
       a single remote iovec element.  (Instead, split the remote read into
       two remote_iov elements and have them merge back into a single write
       local_iov entry.  The first read entry goes up to the page boundary,
       while the second starts on the next page boundary.)

       Permission to read from or write to another process is governed by a
       ptrace access mode PTRACE_MODE_ATTACH_REALCREDS check; see ptrace(2).
http://man7.org/linux/man-pages/man2/process_vm_writev.2.html
12
SYSTEM CALL:
process_vm_readv(2) - Linux manual page
FUNCTIONALITY:

       process_vm_readv,  process_vm_writev  - transfer data between process
       address spaces
SYNOPSIS:

       #include <sys/uio.h>

       ssize_t process_vm_readv(pid_t pid,
                                const struct iovec *local_iov,
                                unsigned long liovcnt,
                                const struct iovec *remote_iov,
                                unsigned long riovcnt,
                                unsigned long flags);

       ssize_t process_vm_writev(pid_t pid,
                                 const struct iovec *local_iov,
                                 unsigned long liovcnt,
                                 const struct iovec *remote_iov,
                                 unsigned long riovcnt,
                                 unsigned long flags);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       process_vm_readv(), process_vm_writev():
           _GNU_SOURCE
DESCRIPTION

       These system calls transfer data between the address space of the
       calling process ("the local process") and the process identified by
       pid ("the remote process").  The data moves directly between the
       address spaces of the two processes, without passing through kernel
       space.

       The process_vm_readv() system call transfers data from the remote
       process to the local process.  The data to be transferred is
       identified by remote_iov and riovcnt: remote_iov is a pointer to an
       array describing address ranges in the process pid, and riovcnt
       specifies the number of elements in remote_iov.  The data is
       transferred to the locations specified by local_iov and liovcnt:
       local_iov is a pointer to an array describing address ranges in the
       calling process, and liovcnt specifies the number of elements in
       local_iov.

       The process_vm_writev() system call is the converse of
       process_vm_readv()—it transfers data from the local process to the
       remote process.  Other than the direction of the transfer, the
       arguments liovcnt, local_iov, riovcnt, and remote_iov have the same
       meaning as for process_vm_readv().

       The local_iov and remote_iov arguments point to an array of iovec
       structures, defined in <sys/uio.h> as:

           struct iovec {
               void  *iov_base;    /* Starting address */
               size_t iov_len;     /* Number of bytes to transfer */
           };

       Buffers are processed in array order.  This means that
       process_vm_readv() completely fills local_iov[0] before proceeding to
       local_iov[1], and so on.  Likewise, remote_iov[0] is completely read
       before proceeding to remote_iov[1], and so on.

       Similarly, process_vm_writev() writes out the entire contents of
       local_iov[0] before proceeding to local_iov[1], and it completely
       fills remote_iov[0] before proceeding to remote_iov[1].

       The lengths of remote_iov[i].iov_len and local_iov[i].iov_len do not
       have to be the same.  Thus, it is possible to split a single local
       buffer into multiple remote buffers, or vice versa.

       The flags argument is currently unused and must be set to 0.

       The values specified in the liovcnt and riovcnt arguments must be
       less than or equal to IOV_MAX (defined in <limits.h> or accessible
       via the call sysconf(_SC_IOV_MAX)).

       The count arguments and local_iov are checked before doing any
       transfers.  If the counts are too big, or local_iov is invalid, or
       the addresses refer to regions that are inaccessible to the local
       process, none of the vectors will be processed and an error will be
       returned immediately.

       Note, however, that these system calls do not check the memory
       regions in the remote process until just before doing the read/write.
       Consequently, a partial read/write (see RETURN VALUE) may result if
       one of the remote_iov elements points to an invalid memory region in
       the remote process.  No further reads/writes will be attempted beyond
       that point.  Keep this in mind when attempting to read data of
       unknown length (such as C strings that are null-terminated) from a
       remote process, by avoiding spanning memory pages (typically 4KiB) in
       a single remote iovec element.  (Instead, split the remote read into
       two remote_iov elements and have them merge back into a single write
       local_iov entry.  The first read entry goes up to the page boundary,
       while the second starts on the next page boundary.)

       Permission to read from or write to another process is governed by a
       ptrace access mode PTRACE_MODE_ATTACH_REALCREDS check; see ptrace(2).
http://man7.org/linux/man-pages/man2/kcmp.2.html
11
SYSTEM CALL:
kcmp(2) - Linux manual page
FUNCTIONALITY:

       kcmp  -  compare  two  processes  to determine if they share a kernel
       resource
SYNOPSIS:

       #include <linux/kcmp.h>

       int kcmp(pid_t pid1, pid_t pid2, int type,
                unsigned long idx1, unsigned long idx2);

       Note: There is no glibc wrapper for this system call; see NOTES.
DESCRIPTION

       The kcmp() system call can be used to check whether the two processes
       identified by pid1 and pid2 share a kernel resource such as virtual
       memory, file descriptors, and so on.

       Permission to employ kcmp() is governed by ptrace access mode
       PTRACE_MODE_READ_REALCREDS checks against both pid1 and pid2; see
       ptrace(2).

       The type argument specifies which resource is to be compared in the
       two processes.  It has one of the following values:

       KCMP_FILE
              Check whether a file descriptor idx1 in the process pid1
              refers to the same open file description (see open(2)) as file
              descriptor idx2 in the process pid2.

       KCMP_FILES
              Check whether the process share the same set of open file
              descriptors.  The arguments idx1 and idx2 are ignored.

       KCMP_FS
              Check whether the processes share the same filesystem
              information (i.e., file mode creation mask, working directory,
              and filesystem root).  The arguments idx1 and idx2 are
              ignored.

       KCMP_IO
              Check whether the processes share I/O context.  The arguments
              idx1 and idx2 are ignored.

       KCMP_SIGHAND
              Check whether the processes share the same table of signal
              dispositions.  The arguments idx1 and idx2 are ignored.

       KCMP_SYSVSEM
              Check whether the processes share the same list of System V
              semaphore undo operations.  The arguments idx1 and idx2 are
              ignored.

       KCMP_VM
              Check whether the processes share the same address space.  The
              arguments idx1 and idx2 are ignored.

       Note the kcmp() is not protected against false positives which may
       occur if the processes are currently running.  One should stop the
       processes by sending SIGSTOP (see signal(7)) prior to inspection with
       this system call to obtain meaningful results.
http://man7.org/linux/man-pages/man2/unshare.2.html
12
SYSTEM CALL:
unshare(2) - Linux manual page
FUNCTIONALITY:

       unshare - disassociate parts of the process execution context
SYNOPSIS:

       #define _GNU_SOURCE
       #include <sched.h>

       int unshare(int flags);
DESCRIPTION

       unshare() allows a process (or thread) to disassociate parts of its
       execution context that are currently being shared with other
       processes (or threads).  Part of the execution context, such as the
       mount namespace, is shared implicitly when a new process is created
       using fork(2) or vfork(2), while other parts, such as virtual memory,
       may be shared by explicit request when creating a process or thread
       using clone(2).

       The main use of unshare() is to allow a process to control its shared
       execution context without creating a new process.

       The flags argument is a bit mask that specifies which parts of the
       execution context should be unshared.  This argument is specified by
       ORing together zero or more of the following constants:

       CLONE_FILES
              Reverse the effect of the clone(2) CLONE_FILES flag.  Unshare
              the file descriptor table, so that the calling process no
              longer shares its file descriptors with any other process.

       CLONE_FS
              Reverse the effect of the clone(2) CLONE_FS flag.  Unshare
              filesystem attributes, so that the calling process no longer
              shares its root directory (chroot(2)), current directory
              (chdir(2)), or umask (umask(2)) attributes with any other
              process.

       CLONE_NEWCGROUP (since Linux 4.6)
              This flag has the same effect as the clone(2) CLONE_NEWCGROUP
              flag.  Unshare the cgroup namespace.  Use of CLONE_NEWCGROUP
              requires the CAP_SYS_ADMIN capability.

       CLONE_NEWIPC (since Linux 2.6.19)
              This flag has the same effect as the clone(2) CLONE_NEWIPC
              flag.  Unshare the IPC namespace, so that the calling process
              has a private copy of the IPC namespace which is not shared
              with any other process.  Specifying this flag automatically
              implies CLONE_SYSVSEM as well.  Use of CLONE_NEWIPC requires
              the CAP_SYS_ADMIN capability.

       CLONE_NEWNET (since Linux 2.6.24)
              This flag has the same effect as the clone(2) CLONE_NEWNET
              flag.  Unshare the network namespace, so that the calling
              process is moved into a new network namespace which is not
              shared with any previously existing process.  Use of
              CLONE_NEWNET requires the CAP_SYS_ADMIN capability.

       CLONE_NEWNS
              This flag has the same effect as the clone(2) CLONE_NEWNS
              flag.  Unshare the mount namespace, so that the calling
              process has a private copy of its namespace which is not
              shared with any other process.  Specifying this flag
              automatically implies CLONE_FS as well.  Use of CLONE_NEWNS
              requires the CAP_SYS_ADMIN capability.  For further
              information, see mount_namespaces(7).

       CLONE_NEWPID (since Linux 3.8)
              This flag has the same effect as the clone(2) CLONE_NEWPID
              flag.  Unshare the PID namespace, so that the calling process
              has a new PID namespace for its children which is not shared
              with any previously existing process.  The calling process is
              not moved into the new namespace.  The first child created by
              the calling process will have the process ID 1 and will assume
              the role of init(1) in the new namespace.  CLONE_NEWPID
              automatically implies CLONE_THREAD as well.  Use of
              CLONE_NEWPID requires the CAP_SYS_ADMIN capability.  For
              further information, see pid_namespaces(7).

       CLONE_NEWUSER (since Linux 3.8)
              This flag has the same effect as the clone(2) CLONE_NEWUSER
              flag.  Unshare the user namespace, so that the calling process
              is moved into a new user namespace which is not shared with
              any previously existing process.  As with the child process
              created by clone(2) with the CLONE_NEWUSER flag, the caller
              obtains a full set of capabilities in the new namespace.

              CLONE_NEWUSER requires that the calling process is not
              threaded; specifying CLONE_NEWUSER automatically implies
              CLONE_THREAD.  Since Linux 3.9, CLONE_NEWUSER also
              automatically implies CLONE_FS.  CLONE_NEWUSER requires that
              the user ID and group ID of the calling process are mapped to
              user IDs and group IDs in the user namespace of the calling
              process at the time of the call.

              For further information on user namespaces, see
              user_namespaces(7).

       CLONE_NEWUTS (since Linux 2.6.19)
              This flag has the same effect as the clone(2) CLONE_NEWUTS
              flag.  Unshare the UTS IPC namespace, so that the calling
              process has a private copy of the UTS namespace which is not
              shared with any other process.  Use of CLONE_NEWUTS requires
              the CAP_SYS_ADMIN capability.

       CLONE_SYSVSEM (since Linux 2.6.26)
              This flag reverses the effect of the clone(2) CLONE_SYSVSEM
              flag.  Unshare System V semaphore adjustment (semadj) values,
              so that the calling process has a new empty semadj list that
              is not shared with any other process.  If this is the last
              process that has a reference to the process's current semadj
              list, then the adjustments in that list are applied to the
              corresponding semaphores, as described in semop(2).

       In addition, CLONE_THREAD, CLONE_SIGHAND, and CLONE_VM can be
       specified in flags if the caller is single threaded (i.e., it is not
       sharing its address space with another process or thread).  In this
       case, these flags have no effect.  (Note also that specifying
       CLONE_THREAD automatically implies CLONE_VM, and specifying CLONE_VM
       automatically implies CLONE_SIGHAND.)  If the process is
       multithreaded, then the use of these flags results in an error.

       If flags is specified as zero, then unshare() is a no-op; no changes
       are made to the calling process's execution context.
Linux signals system calls
http://man7.org/linux/man-pages/man2/kill.2.html
11
SYSTEM CALL:
kill(2) - Linux manual page
FUNCTIONALITY:

       kill - send signal to a process
SYNOPSIS:

       #include <sys/types.h>
       #include <signal.h>

       int kill(pid_t pid, int sig);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       kill(): _POSIX_C_SOURCE
DESCRIPTION

       The kill() system call can be used to send any signal to any process
       group or process.

       If pid is positive, then signal sig is sent to the process with the
       ID specified by pid.

       If pid equals 0, then sig is sent to every process in the process
       group of the calling process.

       If pid equals -1, then sig is sent to every process for which the
       calling process has permission to send signals, except for process 1
       (init), but see below.

       If pid is less than -1, then sig is sent to every process in the
       process group whose ID is -pid.

       If sig is 0, then no signal is sent, but existence and permission
       checks are still performed; this can be used to check for the
       existence of a process ID or process group ID that the caller is
       permitted to signal.

       For a process to have permission to send a signal it must either be
       privileged (under Linux: have the CAP_KILL capability), or the real
       or effective user ID of the sending process must equal the real or
       saved set-user-ID of the target process.  In the case of SIGCONT it
       suffices when the sending and receiving processes belong to the same
       session.  (Historically, the rules were different; see NOTES.)
http://man7.org/linux/man-pages/man2/tkill.2.html
11
SYSTEM CALL:
tkill(2) - Linux manual page
FUNCTIONALITY:

       tkill, tgkill - send a signal to a thread
SYNOPSIS:

       int tkill(int tid, int sig);

       int tgkill(int tgid, int tid, int sig);

       Note: There are no glibc wrappers for these system calls; see NOTES.
DESCRIPTION

       tgkill() sends the signal sig to the thread with the thread ID tid in
       the thread group tgid.  (By contrast, kill(2) can be used to send a
       signal only to a process (i.e., thread group) as a whole, and the
       signal will be delivered to an arbitrary thread within that process.)

       tkill() is an obsolete predecessor to tgkill().  It allows only the
       target thread ID to be specified, which may result in the wrong
       thread being signaled if a thread terminates and its thread ID is
       recycled.  Avoid using this system call.

       These are the raw system call interfaces, meant for internal thread
       library use.
http://man7.org/linux/man-pages/man2/tgkill.2.html
11
SYSTEM CALL:
tkill(2) - Linux manual page
FUNCTIONALITY:

       tkill, tgkill - send a signal to a thread
SYNOPSIS:

       int tkill(int tid, int sig);

       int tgkill(int tgid, int tid, int sig);

       Note: There are no glibc wrappers for these system calls; see NOTES.
DESCRIPTION

       tgkill() sends the signal sig to the thread with the thread ID tid in
       the thread group tgid.  (By contrast, kill(2) can be used to send a
       signal only to a process (i.e., thread group) as a whole, and the
       signal will be delivered to an arbitrary thread within that process.)

       tkill() is an obsolete predecessor to tgkill().  It allows only the
       target thread ID to be specified, which may result in the wrong
       thread being signaled if a thread terminates and its thread ID is
       recycled.  Avoid using this system call.

       These are the raw system call interfaces, meant for internal thread
       library use.
http://man7.org/linux/man-pages/man2/pause.2.html
9
SYSTEM CALL:
pause(2) - Linux manual page
FUNCTIONALITY:

       pause - wait for signal
SYNOPSIS:

       #include <unistd.h>

       int pause(void);
DESCRIPTION

       pause() causes the calling process (or thread) to sleep until a
       signal is delivered that either terminates the process or causes the
       invocation of a signal-catching function.
http://man7.org/linux/man-pages/man2/rt_sigaction.2.html
12
SYSTEM CALL:
sigaction(2) - Linux manual page
FUNCTIONALITY:

       sigaction, rt_sigaction - examine and change a signal action
SYNOPSIS:

       #include <signal.h>

       int sigaction(int signum, const struct sigaction *act,
                     struct sigaction *oldact);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       sigaction(): _POSIX_C_SOURCE

       siginfo_t: _POSIX_C_SOURCE >= 199309L
DESCRIPTION

       The sigaction() system call is used to change the action taken by a
       process on receipt of a specific signal.  (See signal(7) for an
       overview of signals.)

       signum specifies the signal and can be any valid signal except
       SIGKILL and SIGSTOP.

       If act is non-NULL, the new action for signal signum is installed
       from act.  If oldact is non-NULL, the previous action is saved in
       oldact.

       The sigaction structure is defined as something like:

           struct sigaction {
               void     (*sa_handler)(int);
               void     (*sa_sigaction)(int, siginfo_t *, void *);
               sigset_t   sa_mask;
               int        sa_flags;
               void     (*sa_restorer)(void);
           };

       On some architectures a union is involved: do not assign to both
       sa_handler and sa_sigaction.

       The sa_restorer field is not intended for application use.  (POSIX
       does not specify a sa_restorer field.)  Some further details of
       purpose of this field can be found in sigreturn(2).

       sa_handler specifies the action to be associated with signum and may
       be SIG_DFL for the default action, SIG_IGN to ignore this signal, or
       a pointer to a signal handling function.  This function receives the
       signal number as its only argument.

       If SA_SIGINFO is specified in sa_flags, then sa_sigaction (instead of
       sa_handler) specifies the signal-handling function for signum.  This
       function receives the signal number as its first argument, a pointer
       to a siginfo_t as its second argument and a pointer to a ucontext_t
       (cast to void *) as its third argument.  (Commonly, the handler
       function doesn't make any use of the third argument.  See
       getcontext(3) for further information about ucontext_t.)

       sa_mask specifies a mask of signals which should be blocked (i.e.,
       added to the signal mask of the thread in which the signal handler is
       invoked) during execution of the signal handler.  In addition, the
       signal which triggered the handler will be blocked, unless the
       SA_NODEFER flag is used.

       sa_flags specifies a set of flags which modify the behavior of the
       signal.  It is formed by the bitwise OR of zero or more of the
       following:

           SA_NOCLDSTOP
                  If signum is SIGCHLD, do not receive notification when
                  child processes stop (i.e., when they receive one of
                  SIGSTOP, SIGTSTP, SIGTTIN, or SIGTTOU) or resume (i.e.,
                  they receive SIGCONT) (see wait(2)).  This flag is
                  meaningful only when establishing a handler for SIGCHLD.

           SA_NOCLDWAIT (since Linux 2.6)
                  If signum is SIGCHLD, do not transform children into
                  zombies when they terminate.  See also waitpid(2).  This
                  flag is meaningful only when establishing a handler for
                  SIGCHLD, or when setting that signal's disposition to
                  SIG_DFL.

                  If the SA_NOCLDWAIT flag is set when establishing a
                  handler for SIGCHLD, POSIX.1 leaves it unspecified whether
                  a SIGCHLD signal is generated when a child process
                  terminates.  On Linux, a SIGCHLD signal is generated in
                  this case; on some other implementations, it is not.

           SA_NODEFER
                  Do not prevent the signal from being received from within
                  its own signal handler.  This flag is meaningful only when
                  establishing a signal handler.  SA_NOMASK is an obsolete,
                  nonstandard synonym for this flag.

           SA_ONSTACK
                  Call the signal handler on an alternate signal stack
                  provided by sigaltstack(2).  If an alternate stack is not
                  available, the default stack will be used.  This flag is
                  meaningful only when establishing a signal handler.

           SA_RESETHAND
                  Restore the signal action to the default upon entry to the
                  signal handler.  This flag is meaningful only when
                  establishing a signal handler.  SA_ONESHOT is an obsolete,
                  nonstandard synonym for this flag.

           SA_RESTART
                  Provide behavior compatible with BSD signal semantics by
                  making certain system calls restartable across signals.
                  This flag is meaningful only when establishing a signal
                  handler.  See signal(7) for a discussion of system call
                  restarting.

           SA_RESTORER
                  Not intended for application use.  This flag is used by C
                  libraries to indicate that the sa_restorer field contains
                  the address of a "signal trampoline".  See sigreturn(2)
                  for more details.

           SA_SIGINFO (since Linux 2.2)
                  The signal handler takes three arguments, not one.  In
                  this case, sa_sigaction should be set instead of
                  sa_handler.  This flag is meaningful only when
                  establishing a signal handler.

       The siginfo_t argument to sa_sigaction is a struct with the following
       fields:

           siginfo_t {
               int      si_signo;     /* Signal number */
               int      si_errno;     /* An errno value */
               int      si_code;      /* Signal code */
               int      si_trapno;    /* Trap number that caused
                                         hardware-generated signal
                                         (unused on most architectures) */
               pid_t    si_pid;       /* Sending process ID */
               uid_t    si_uid;       /* Real user ID of sending process */
               int      si_status;    /* Exit value or signal */
               clock_t  si_utime;     /* User time consumed */
               clock_t  si_stime;     /* System time consumed */
               sigval_t si_value;     /* Signal value */
               int      si_int;       /* POSIX.1b signal */
               void    *si_ptr;       /* POSIX.1b signal */
               int      si_overrun;   /* Timer overrun count;
                                         POSIX.1b timers */
               int      si_timerid;   /* Timer ID; POSIX.1b timers */
               void    *si_addr;      /* Memory location which caused fault */
               long     si_band;      /* Band event (was int in
                                         glibc 2.3.2 and earlier) */
               int      si_fd;        /* File descriptor */
               short    si_addr_lsb;  /* Least significant bit of address
                                         (since Linux 2.6.32) */
               void    *si_lower;     /* Lower bound when address violation
                                         occurred (since Linux 3.19) */
               void    *si_upper;     /* Upper bound when address violation
                                         occurred (since Linux 3.19) */
               int      si_pkey;      /* Protection key on PTE that caused
                                         fault (since Linux 4.6) */
               void    *si_call_addr; /* Address of system call instruction
                                         (since Linux 3.5) */
               int      si_syscall;   /* Number of attempted system call
                                         (since Linux 3.5) */
               unsigned int si_arch;  /* Architecture of attempted system call
                                         (since Linux 3.5) */
           }

       si_signo, si_errno and si_code are defined for all signals.
       (si_errno is generally unused on Linux.)  The rest of the struct may
       be a union, so that one should read only the fields that are
       meaningful for the given signal:

       * Signals sent with kill(2) and sigqueue(3) fill in si_pid and
         si_uid.  In addition, signals sent with sigqueue(3) fill in si_int
         and si_ptr with the values specified by the sender of the signal;
         see sigqueue(3) for more details.

       * Signals sent by POSIX.1b timers (since Linux 2.6) fill in
         si_overrun and si_timerid.  The si_timerid field is an internal ID
         used by the kernel to identify the timer; it is not the same as the
         timer ID returned by timer_create(2).  The si_overrun field is the
         timer overrun count; this is the same information as is obtained by
         a call to timer_getoverrun(2).  These fields are nonstandard Linux
         extensions.

       * Signals sent for message queue notification (see the description of
         SIGEV_SIGNAL in mq_notify(3)) fill in si_int/si_ptr, with the
         sigev_value supplied to mq_notify(3); si_pid, with the process ID
         of the message sender; and si_uid, with the real user ID of the
         message sender.

       * SIGCHLD fills in si_pid, si_uid, si_status, si_utime, and si_stime,
         providing information about the child.  The si_pid field is the
         process ID of the child; si_uid is the child's real user ID.  The
         si_status field contains the exit status of the child (if si_code
         is CLD_EXITED), or the signal number that caused the process to
         change state.  The si_utime and si_stime contain the user and
         system CPU time used by the child process; these fields do not
         include the times used by waited-for children (unlike getrusage(2)
         and times(2)).  In kernels up to 2.6, and since 2.6.27, these
         fields report CPU time in units of sysconf(_SC_CLK_TCK).  In 2.6
         kernels before 2.6.27, a bug meant that these fields reported time
         in units of the (configurable) system jiffy (see time(7)).

       * SIGILL, SIGFPE, SIGSEGV, SIGBUS, and SIGTRAP fill in si_addr with
         the address of the fault.  On some architectures, these signals
         also fill in the si_trapno field.

         Some suberrors of SIGBUS, in particular BUS_MCEERR_AO and
         BUS_MCEERR_AR, also fill in si_addr_lsb.  This field indicates the
         least significant bit of the reported address and therefore the
         extent of the corruption.  For example, if a full page was
         corrupted, si_addr_lsb contains log2(sysconf(_SC_PAGESIZE)).  When
         SIGTRAP is delivered in response to a ptrace(2) event
         (PTRACE_EVENT_foo), si_addr is not populated, but si_pid and si_uid
         are populated with the respective process ID and user ID
         responsible for delivering the trap.  In the case of seccomp(2),
         the tracee will be shown as delivering the event.  BUS_MCEERR_* and
         si_addr_lsb are Linux-specific extensions.

         The SEGV_BNDERR suberror of SIGSEGV populates si_lower and
         si_upper.

         The SEGV_PKUERR suberror of SIGSEGV populates si_pkey.

       * SIGIO/SIGPOLL (the two names are synonyms on Linux) fills in
         si_band and si_fd.  The si_band event is a bit mask containing the
         same values as are filled in the revents field by poll(2).  The
         si_fd field indicates the file descriptor for which the I/O event
         occurred; for further details, see the description of F_SETSIG in
         fcntl(2).

       * SIGSYS, generated (since Linux 3.5) when a seccomp filter returns
         SECCOMP_RET_TRAP, fills in si_call_addr, si_syscall, si_arch,
         si_errno, and other fields as described in seccomp(2).

       si_code is a value (not a bit mask) indicating why this signal was
       sent.  For a ptrace(2) event, si_code will contain SIGTRAP and have
       the ptrace event in the high byte:

           (SIGTRAP | PTRACE_EVENT_foo << 8).

       For a regular signal, the following list shows the values which can
       be placed in si_code for any signal, along with reason that the
       signal was generated.

           SI_USER
                  kill(2).

           SI_KERNEL
                  Sent by the kernel.

           SI_QUEUE
                  sigqueue(3).

           SI_TIMER
                  POSIX timer expired.

           SI_MESGQ (since Linux 2.6.6)
                  POSIX message queue state changed; see mq_notify(3).

           SI_ASYNCIO
                  AIO completed.

           SI_SIGIO
                  Queued SIGIO (only in kernels up to Linux 2.2; from Linux
                  2.4 onward SIGIO/SIGPOLL fills in si_code as described
                  below).

           SI_TKILL (since Linux 2.4.19)
                  tkill(2) or tgkill(2).

       The following values can be placed in si_code for a SIGILL signal:

           ILL_ILLOPC
                  Illegal opcode.

           ILL_ILLOPN
                  Illegal operand.

           ILL_ILLADR
                  Illegal addressing mode.

           ILL_ILLTRP
                  Illegal trap.

           ILL_PRVOPC
                  Privileged opcode.

           ILL_PRVREG
                  Privileged register.

           ILL_COPROC
                  Coprocessor error.

           ILL_BADSTK
                  Internal stack error.

       The following values can be placed in si_code for a SIGFPE signal:

           FPE_INTDIV
                  Integer divide by zero.

           FPE_INTOVF
                  Integer overflow.

           FPE_FLTDIV
                  Floating-point divide by zero.

           FPE_FLTOVF
                  Floating-point overflow.

           FPE_FLTUND
                  Floating-point underflow.

           FPE_FLTRES
                  Floating-point inexact result.

           FPE_FLTINV
                  Floating-point invalid operation.

           FPE_FLTSUB
                  Subscript out of range.

       The following values can be placed in si_code for a SIGSEGV signal:

           SEGV_MAPERR
                  Address not mapped to object.

           SEGV_ACCERR
                  Invalid permissions for mapped object.

           SEGV_BNDERR (since Linux 3.19)
                  Failed address bound checks.

           SEGV_PKUERR (since Linux 4.6)
                  Protection key fault.

       The following values can be placed in si_code for a SIGBUS signal:

           BUS_ADRALN
                  Invalid address alignment.

           BUS_ADRERR
                  Nonexistent physical address.

           BUS_OBJERR
                  Object-specific hardware error.

           BUS_MCEERR_AR (since Linux 2.6.32)
                  Hardware memory error consumed on a machine check; action
                  required.

           BUS_MCEERR_AO (since Linux 2.6.32)
                  Hardware memory error detected in process but not
                  consumed; action optional.

       The following values can be placed in si_code for a SIGTRAP signal:

           TRAP_BRKPT
                  Process breakpoint.

           TRAP_TRACE
                  Process trace trap.

           TRAP_BRANCH (since Linux 2.4)
                  Process taken branch trap.

           TRAP_HWBKPT (since Linux 2.4)
                  Hardware breakpoint/watchpoint.

       The following values can be placed in si_code for a SIGCHLD signal:

           CLD_EXITED
                  Child has exited.

           CLD_KILLED
                  Child was killed.

           CLD_DUMPED
                  Child terminated abnormally.

           CLD_TRAPPED
                  Traced child has trapped.

           CLD_STOPPED
                  Child has stopped.

           CLD_CONTINUED (since Linux 2.6.9)
                  Stopped child has continued.

       The following values can be placed in si_code for a SIGIO/SIGPOLL
       signal:

           POLL_IN
                  Data input available.

           POLL_OUT
                  Output buffers available.

           POLL_MSG
                  Input message available.

           POLL_ERR
                  I/O error.

           POLL_PRI
                  High priority input available.

           POLL_HUP
                  Device disconnected.

       The following value can be placed in si_code for a SIGSYS signal:

           SYS_SECCOMP (since Linux 3.5)
                  Triggered by a seccomp(2) filter rule.
http://man7.org/linux/man-pages/man2/rt_sigprocmask.2.html
10
SYSTEM CALL:
sigprocmask(2) - Linux manual page
FUNCTIONALITY:

       sigprocmask, rt_sigprocmask - examine and change blocked signals
SYNOPSIS:

       #include <signal.h>

       int sigprocmask(int how, const sigset_t *set, sigset_t *oldset);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       sigprocmask(): _POSIX_C_SOURCE
DESCRIPTION

       sigprocmask() is used to fetch and/or change the signal mask of the
       calling thread.  The signal mask is the set of signals whose delivery
       is currently blocked for the caller (see also signal(7) for more
       details).

       The behavior of the call is dependent on the value of how, as
       follows.

       SIG_BLOCK
              The set of blocked signals is the union of the current set and
              the set argument.

       SIG_UNBLOCK
              The signals in set are removed from the current set of blocked
              signals.  It is permissible to attempt to unblock a signal
              which is not blocked.

       SIG_SETMASK
              The set of blocked signals is set to the argument set.

       If oldset is non-NULL, the previous value of the signal mask is
       stored in oldset.

       If set is NULL, then the signal mask is unchanged (i.e., how is
       ignored), but the current value of the signal mask is nevertheless
       returned in oldset (if it is not NULL).

       A set of functions for modifying and inspecting variables of type
       sigset_t ("signal sets") is described in sigsetops(3).

       The use of sigprocmask() is unspecified in a multithreaded process;
       see pthread_sigmask(3).
http://man7.org/linux/man-pages/man2/rt_sigpending.2.html
11
SYSTEM CALL:
sigpending(2) - Linux manual page
FUNCTIONALITY:

       sigpending, rt_sigpending - examine pending signals
SYNOPSIS:

       #include <signal.h>

       int sigpending(sigset_t *set);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       sigpending(): _POSIX_C_SOURCE
DESCRIPTION

       sigpending() returns the set of signals that are pending for delivery
       to the calling thread (i.e., the signals which have been raised while
       blocked).  The mask of pending signals is returned in set.
http://man7.org/linux/man-pages/man2/rt_sigqueueinfo.2.html
11
SYSTEM CALL:
rt_sigqueueinfo(2) - Linux manual page
FUNCTIONALITY:

       rt_sigqueueinfo, rt_tgsigqueueinfo - queue a signal and data
SYNOPSIS:

       int rt_sigqueueinfo(pid_t tgid, int sig, siginfo_t *uinfo);

       int rt_tgsigqueueinfo(pid_t tgid, pid_t tid, int sig,
                             siginfo_t *uinfo);

       Note: There are no glibc wrappers for these system calls; see NOTES.
DESCRIPTION

       The rt_sigqueueinfo() and rt_tgsigqueueinfo() system calls are the
       low-level interfaces used to send a signal plus data to a process or
       thread.  The receiver of the signal can obtain the accompanying data
       by establishing a signal handler with the sigaction(2) SA_SIGINFO
       flag.

       These system calls are not intended for direct application use; they
       are provided to allow the implementation of sigqueue(3) and
       pthread_sigqueue(3).

       The rt_sigqueueinfo() system call sends the signal sig to the thread
       group with the ID tgid.  (The term "thread group" is synonymous with
       "process", and tid corresponds to the traditional UNIX process ID.)
       The signal will be delivered to an arbitrary member of the thread
       group (i.e., one of the threads that is not currently blocking the
       signal).

       The uinfo argument specifies the data to accompany the signal.  This
       argument is a pointer to a structure of type siginfo_t, described in
       sigaction(2) (and defined by including <sigaction.h>).  The caller
       should set the following fields in this structure:

       si_code
              This must be one of the SI_* codes in the Linux kernel source
              file include/asm-generic/siginfo.h, with the restriction that
              the code must be negative (i.e., cannot be SI_USER, which is
              used by the kernel to indicate a signal sent by kill(2)) and
              cannot (since Linux 2.6.39) be SI_TKILL (which is used by the
              kernel to indicate a signal sent using tgkill(2)).

       si_pid This should be set to a process ID, typically the process ID
              of the sender.

       si_uid This should be set to a user ID, typically the real user ID of
              the sender.

       si_value
              This field contains the user data to accompany the signal.
              For more information, see the description of the last (union
              sigval) argument of sigqueue(3).

       Internally, the kernel sets the si_signo field to the value specified
       in sig, so that the receiver of the signal can also obtain the signal
       number via that field.

       The rt_tgsigqueueinfo() system call is like rt_sigqueueinfo(), but
       sends the signal and data to the single thread specified by the
       combination of tgid, a thread group ID, and tid, a thread in that
       thread group.
http://man7.org/linux/man-pages/man2/rt_tgsigqueueinfo.2.html
11
SYSTEM CALL:
rt_sigqueueinfo(2) - Linux manual page
FUNCTIONALITY:

       rt_sigqueueinfo, rt_tgsigqueueinfo - queue a signal and data
SYNOPSIS:

       int rt_sigqueueinfo(pid_t tgid, int sig, siginfo_t *uinfo);

       int rt_tgsigqueueinfo(pid_t tgid, pid_t tid, int sig,
                             siginfo_t *uinfo);

       Note: There are no glibc wrappers for these system calls; see NOTES.
DESCRIPTION

       The rt_sigqueueinfo() and rt_tgsigqueueinfo() system calls are the
       low-level interfaces used to send a signal plus data to a process or
       thread.  The receiver of the signal can obtain the accompanying data
       by establishing a signal handler with the sigaction(2) SA_SIGINFO
       flag.

       These system calls are not intended for direct application use; they
       are provided to allow the implementation of sigqueue(3) and
       pthread_sigqueue(3).

       The rt_sigqueueinfo() system call sends the signal sig to the thread
       group with the ID tgid.  (The term "thread group" is synonymous with
       "process", and tid corresponds to the traditional UNIX process ID.)
       The signal will be delivered to an arbitrary member of the thread
       group (i.e., one of the threads that is not currently blocking the
       signal).

       The uinfo argument specifies the data to accompany the signal.  This
       argument is a pointer to a structure of type siginfo_t, described in
       sigaction(2) (and defined by including <sigaction.h>).  The caller
       should set the following fields in this structure:

       si_code
              This must be one of the SI_* codes in the Linux kernel source
              file include/asm-generic/siginfo.h, with the restriction that
              the code must be negative (i.e., cannot be SI_USER, which is
              used by the kernel to indicate a signal sent by kill(2)) and
              cannot (since Linux 2.6.39) be SI_TKILL (which is used by the
              kernel to indicate a signal sent using tgkill(2)).

       si_pid This should be set to a process ID, typically the process ID
              of the sender.

       si_uid This should be set to a user ID, typically the real user ID of
              the sender.

       si_value
              This field contains the user data to accompany the signal.
              For more information, see the description of the last (union
              sigval) argument of sigqueue(3).

       Internally, the kernel sets the si_signo field to the value specified
       in sig, so that the receiver of the signal can also obtain the signal
       number via that field.

       The rt_tgsigqueueinfo() system call is like rt_sigqueueinfo(), but
       sends the signal and data to the single thread specified by the
       combination of tgid, a thread group ID, and tid, a thread in that
       thread group.
http://man7.org/linux/man-pages/man2/rt_sigtimedwait.2.html
10
SYSTEM CALL:
sigwaitinfo(2) - Linux manual page
FUNCTIONALITY:

       sigwaitinfo,  sigtimedwait,  rt_sigtimedwait - synchronously wait for
       queued signals
SYNOPSIS:

       #include <signal.h>

       int sigwaitinfo(const sigset_t *set, siginfo_t *info);

       int sigtimedwait(const sigset_t *set, siginfo_t *info,
                        const struct timespec *timeout);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       sigwaitinfo(), sigtimedwait(): _POSIX_C_SOURCE >= 199309L
DESCRIPTION

       sigwaitinfo() suspends execution of the calling thread until one of
       the signals in set is pending (If one of the signals in set is
       already pending for the calling thread, sigwaitinfo() will return
       immediately.)

       sigwaitinfo() removes the signal from the set of pending signals and
       returns the signal number as its function result.  If the info
       argument is not NULL, then the buffer that it points to is used to
       return a structure of type siginfo_t (see sigaction(2)) containing
       information about the signal.

       If multiple signals in set are pending for the caller, the signal
       that is retrieved by sigwaitinfo() is determined according to the
       usual ordering rules; see signal(7) for further details.

       sigtimedwait() operates in exactly the same way as sigwaitinfo()
       except that it has an additional argument, timeout, which specifies
       the interval for which the thread is suspended waiting for a signal.
       (This interval will be rounded up to the system clock granularity,
       and kernel scheduling delays mean that the interval may overrun by a
       small amount.)  This argument is of the following type:

           struct timespec {
               long    tv_sec;         /* seconds */
               long    tv_nsec;        /* nanoseconds */
           }

       If both fields of this structure are specified as 0, a poll is
       performed: sigtimedwait() returns immediately, either with
       information about a signal that was pending for the caller, or with
       an error if none of the signals in set was pending.
http://man7.org/linux/man-pages/man2/rt_sigsuspend.2.html
10
SYSTEM CALL:
sigsuspend(2) - Linux manual page
FUNCTIONALITY:

       sigsuspend, rt_sigsuspend - wait for a signal
SYNOPSIS:

       #include <signal.h>

       int sigsuspend(const sigset_t *mask);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       sigsuspend(): _POSIX_C_SOURCE
DESCRIPTION

       sigsuspend() temporarily replaces the signal mask of the calling
       process with the mask given by mask and then suspends the process
       until delivery of a signal whose action is to invoke a signal handler
       or to terminate a process.

       If the signal terminates the process, then sigsuspend() does not
       return.  If the signal is caught, then sigsuspend() returns after the
       signal handler returns, and the signal mask is restored to the state
       before the call to sigsuspend().

       It is not possible to block SIGKILL or SIGSTOP; specifying these
       signals in mask, has no effect on the process's signal mask.
http://man7.org/linux/man-pages/man2/rt_sigreturn.2.html
9
SYSTEM CALL:
sigreturn(2) - Linux manual page
FUNCTIONALITY:

       sigreturn,  rt_sigreturn  -  return  from  signal handler and cleanup
       stack frame
SYNOPSIS:

       int sigreturn(...);
DESCRIPTION

       If the Linux kernel determines that an unblocked signal is pending
       for a process, then, at the next transition back to user mode in that
       process (e.g., upon return from a system call or when the process is
       rescheduled onto the CPU), it saves various pieces of process context
       (processor status word, registers, signal mask, and signal stack
       settings) into the user-space stack.

       The kernel also arranges that, during the transition back to user
       mode, the signal handler is called, and that, upon return from the
       handler, control passes to a piece of user-space code commonly called
       the "signal trampoline".  The signal trampoline code in turn calls
       sigreturn().

       This sigreturn() call undoes everything that was done—changing the
       process's signal mask, switching signal stacks (see
       sigaltstack(2))—in order to invoke the signal handler.  It restores
       the process's signal mask, switches stacks, and restores the
       process's context (processor flags and registers, including the stack
       pointer and instruction pointer), so that the process resumes
       execution at the point where it was interrupted by the signal.
http://man7.org/linux/man-pages/man2/sigaltstack.2.html
12
SYSTEM CALL:
sigaltstack(2) - Linux manual page
FUNCTIONALITY:

       sigaltstack - set and/or get signal stack context
SYNOPSIS:

       #include <signal.h>

       int sigaltstack(const stack_t *ss, stack_t *oss);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       sigaltstack():
           _XOPEN_SOURCE >= 500
               || /* Since glibc 2.12: */ _POSIX_C_SOURCE >= 200809L
               || /* Glibc versions <= 2.19: */ _BSD_SOURCE
DESCRIPTION

       sigaltstack() allows a process to define a new alternate signal stack
       and/or retrieve the state of an existing alternate signal stack.  An
       alternate signal stack is used during the execution of a signal
       handler if the establishment of that handler (see sigaction(2))
       requested it.

       The normal sequence of events for using an alternate signal stack is
       the following:

       1. Allocate an area of memory to be used for the alternate signal
          stack.

       2. Use sigaltstack() to inform the system of the existence and
          location of the alternate signal stack.

       3. When establishing a signal handler using sigaction(2), inform the
          system that the signal handler should be executed on the alternate
          signal stack by specifying the SA_ONSTACK flag.

       The ss argument is used to specify a new alternate signal stack,
       while the oss argument is used to retrieve information about the
       currently established signal stack.  If we are interested in
       performing just one of these tasks, then the other argument can be
       specified as NULL.  Each of these arguments is a structure of the
       following type:

           typedef struct {
               void  *ss_sp;     /* Base address of stack */
               int    ss_flags;  /* Flags */
               size_t ss_size;   /* Number of bytes in stack */
           } stack_t;

       To establish a new alternate signal stack, ss.ss_flags is set to
       zero, and ss.ss_sp and ss.ss_size specify the starting address and
       size of the stack.  The constant SIGSTKSZ is defined to be large
       enough to cover the usual size requirements for an alternate signal
       stack, and the constant MINSIGSTKSZ defines the minimum size required
       to execute a signal handler.

       When a signal handler is invoked on the alternate stack, the kernel
       automatically aligns the address given in ss.ss_sp to a suitable
       address boundary for the underlying hardware architecture.

       To disable an existing stack, specify ss.ss_flags as SS_DISABLE.  In
       this case, the remaining fields in ss are ignored.

       If oss is not NULL, then it is used to return information about the
       alternate signal stack which was in effect prior to the call to
       sigaltstack().  The oss.ss_sp and oss.ss_size fields return the
       starting address and size of that stack.  The oss.ss_flags may return
       either of the following values:

       SS_ONSTACK
              The process is currently executing on the alternate signal
              stack.  (Note that it is not possible to change the alternate
              signal stack if the process is currently executing on it.)

       SS_DISABLE
              The alternate signal stack is currently disabled.
http://man7.org/linux/man-pages/man2/signalfd.2.html
13
SYSTEM CALL:
signalfd(2) - Linux manual page
FUNCTIONALITY:

       signalfd - create a file descriptor for accepting signals
SYNOPSIS:

       #include <sys/signalfd.h>

       int signalfd(int fd, const sigset_t *mask, int flags);
DESCRIPTION

       signalfd() creates a file descriptor that can be used to accept
       signals targeted at the caller.  This provides an alternative to the
       use of a signal handler or sigwaitinfo(2), and has the advantage that
       the file descriptor may be monitored by select(2), poll(2), and
       epoll(7).

       The mask argument specifies the set of signals that the caller wishes
       to accept via the file descriptor.  This argument is a signal set
       whose contents can be initialized using the macros described in
       sigsetops(3).  Normally, the set of signals to be received via the
       file descriptor should be blocked using sigprocmask(2), to prevent
       the signals being handled according to their default dispositions.
       It is not possible to receive SIGKILL or SIGSTOP signals via a
       signalfd file descriptor; these signals are silently ignored if
       specified in mask.

       If the fd argument is -1, then the call creates a new file descriptor
       and associates the signal set specified in mask with that file
       descriptor.  If fd is not -1, then it must specify a valid existing
       signalfd file descriptor, and mask is used to replace the signal set
       associated with that file descriptor.

       Starting with Linux 2.6.27, the following values may be bitwise ORed
       in flags to change the behavior of signalfd():

       SFD_NONBLOCK  Set the O_NONBLOCK file status flag on the new open
                     file description.  Using this flag saves extra calls to
                     fcntl(2) to achieve the same result.

       SFD_CLOEXEC   Set the close-on-exec (FD_CLOEXEC) flag on the new file
                     descriptor.  See the description of the O_CLOEXEC flag
                     in open(2) for reasons why this may be useful.

       In Linux up to version 2.6.26, the flags argument is unused, and must
       be specified as zero.

       signalfd() returns a file descriptor that supports the following
       operations:

       read(2)
              If one or more of the signals specified in mask is pending for
              the process, then the buffer supplied to read(2) is used to
              return one or more signalfd_siginfo structures (see below)
              that describe the signals.  The read(2) returns information
              for as many signals as are pending and will fit in the
              supplied buffer.  The buffer must be at least sizeof(struct
              signalfd_siginfo) bytes.  The return value of the read(2) is
              the total number of bytes read.

              As a consequence of the read(2), the signals are consumed, so
              that they are no longer pending for the process (i.e., will
              not be caught by signal handlers, and cannot be accepted using
              sigwaitinfo(2)).

              If none of the signals in mask is pending for the process,
              then the read(2) either blocks until one of the signals in
              mask is generated for the process, or fails with the error
              EAGAIN if the file descriptor has been made nonblocking.

       poll(2), select(2) (and similar)
              The file descriptor is readable (the select(2) readfds
              argument; the poll(2) POLLIN flag) if one or more of the
              signals in mask is pending for the process.

              The signalfd file descriptor also supports the other file-
              descriptor multiplexing APIs: pselect(2), ppoll(2), and
              epoll(7).

       close(2)
              When the file descriptor is no longer required it should be
              closed.  When all file descriptors associated with the same
              signalfd object have been closed, the resources for object are
              freed by the kernel.

   The signalfd_siginfo structure
       The format of the signalfd_siginfo structure(s) returned by read(2)s
       from a signalfd file descriptor is as follows:

           struct signalfd_siginfo {
               uint32_t ssi_signo;   /* Signal number */
               int32_t  ssi_errno;   /* Error number (unused) */
               int32_t  ssi_code;    /* Signal code */
               uint32_t ssi_pid;     /* PID of sender */
               uint32_t ssi_uid;     /* Real UID of sender */
               int32_t  ssi_fd;      /* File descriptor (SIGIO) */
               uint32_t ssi_tid;     /* Kernel timer ID (POSIX timers)
               uint32_t ssi_band;    /* Band event (SIGIO) */
               uint32_t ssi_overrun; /* POSIX timer overrun count */
               uint32_t ssi_trapno;  /* Trap number that caused signal */
               int32_t  ssi_status;  /* Exit status or signal (SIGCHLD) */
               int32_t  ssi_int;     /* Integer sent by sigqueue(3) */
               uint64_t ssi_ptr;     /* Pointer sent by sigqueue(3) */
               uint64_t ssi_utime;   /* User CPU time consumed (SIGCHLD) */
               uint64_t ssi_stime;   /* System CPU time consumed (SIGCHLD) */
               uint64_t ssi_addr;    /* Address that generated signal
                                        (for hardware-generated signals) */
               uint8_t  pad[X];      /* Pad size to 128 bytes (allow for
                                         additional fields in the future) */
           };

       Each of the fields in this structure is analogous to the similarly
       named field in the siginfo_t structure.  The siginfo_t structure is
       described in sigaction(2).  Not all fields in the returned
       signalfd_siginfo structure will be valid for a specific signal; the
       set of valid fields can be determined from the value returned in the
       ssi_code field.  This field is the analog of the siginfo_t si_code
       field; see sigaction(2) for details.

   fork(2) semantics
       After a fork(2), the child inherits a copy of the signalfd file
       descriptor.  A read(2) from the file descriptor in the child will
       return information about signals queued to the child.

   Semantics of file descriptor passing
       As with other file descriptors, signalfd file descriptors can be
       passed to another process via a UNIX domain socket (see unix(7)).  In
       the receiving process, a read(2) from the received file descriptor
       will return information about signals queued to that process.

   execve(2) semantics
       Just like any other file descriptor, a signalfd file descriptor
       remains open across an execve(2), unless it has been marked for
       close-on-exec (see fcntl(2)).  Any signals that were available for
       reading before the execve(2) remain available to the newly loaded
       program.  (This is analogous to traditional signal semantics, where a
       blocked signal that is pending remains pending across an execve(2).)

   Thread semantics
       The semantics of signalfd file descriptors in a multithreaded program
       mirror the standard semantics for signals.  In other words, when a
       thread reads from a signalfd file descriptor, it will read the
       signals that are directed to the thread itself and the signals that
       are directed to the process (i.e., the entire thread group).  (A
       thread will not be able to read signals that are directed to other
       threads in the process.)
http://man7.org/linux/man-pages/man2/signalfd4.2.html
13
SYSTEM CALL:
signalfd(2) - Linux manual page
FUNCTIONALITY:

       signalfd - create a file descriptor for accepting signals
SYNOPSIS:

       #include <sys/signalfd.h>

       int signalfd(int fd, const sigset_t *mask, int flags);
DESCRIPTION

       signalfd() creates a file descriptor that can be used to accept
       signals targeted at the caller.  This provides an alternative to the
       use of a signal handler or sigwaitinfo(2), and has the advantage that
       the file descriptor may be monitored by select(2), poll(2), and
       epoll(7).

       The mask argument specifies the set of signals that the caller wishes
       to accept via the file descriptor.  This argument is a signal set
       whose contents can be initialized using the macros described in
       sigsetops(3).  Normally, the set of signals to be received via the
       file descriptor should be blocked using sigprocmask(2), to prevent
       the signals being handled according to their default dispositions.
       It is not possible to receive SIGKILL or SIGSTOP signals via a
       signalfd file descriptor; these signals are silently ignored if
       specified in mask.

       If the fd argument is -1, then the call creates a new file descriptor
       and associates the signal set specified in mask with that file
       descriptor.  If fd is not -1, then it must specify a valid existing
       signalfd file descriptor, and mask is used to replace the signal set
       associated with that file descriptor.

       Starting with Linux 2.6.27, the following values may be bitwise ORed
       in flags to change the behavior of signalfd():

       SFD_NONBLOCK  Set the O_NONBLOCK file status flag on the new open
                     file description.  Using this flag saves extra calls to
                     fcntl(2) to achieve the same result.

       SFD_CLOEXEC   Set the close-on-exec (FD_CLOEXEC) flag on the new file
                     descriptor.  See the description of the O_CLOEXEC flag
                     in open(2) for reasons why this may be useful.

       In Linux up to version 2.6.26, the flags argument is unused, and must
       be specified as zero.

       signalfd() returns a file descriptor that supports the following
       operations:

       read(2)
              If one or more of the signals specified in mask is pending for
              the process, then the buffer supplied to read(2) is used to
              return one or more signalfd_siginfo structures (see below)
              that describe the signals.  The read(2) returns information
              for as many signals as are pending and will fit in the
              supplied buffer.  The buffer must be at least sizeof(struct
              signalfd_siginfo) bytes.  The return value of the read(2) is
              the total number of bytes read.

              As a consequence of the read(2), the signals are consumed, so
              that they are no longer pending for the process (i.e., will
              not be caught by signal handlers, and cannot be accepted using
              sigwaitinfo(2)).

              If none of the signals in mask is pending for the process,
              then the read(2) either blocks until one of the signals in
              mask is generated for the process, or fails with the error
              EAGAIN if the file descriptor has been made nonblocking.

       poll(2), select(2) (and similar)
              The file descriptor is readable (the select(2) readfds
              argument; the poll(2) POLLIN flag) if one or more of the
              signals in mask is pending for the process.

              The signalfd file descriptor also supports the other file-
              descriptor multiplexing APIs: pselect(2), ppoll(2), and
              epoll(7).

       close(2)
              When the file descriptor is no longer required it should be
              closed.  When all file descriptors associated with the same
              signalfd object have been closed, the resources for object are
              freed by the kernel.

   The signalfd_siginfo structure
       The format of the signalfd_siginfo structure(s) returned by read(2)s
       from a signalfd file descriptor is as follows:

           struct signalfd_siginfo {
               uint32_t ssi_signo;   /* Signal number */
               int32_t  ssi_errno;   /* Error number (unused) */
               int32_t  ssi_code;    /* Signal code */
               uint32_t ssi_pid;     /* PID of sender */
               uint32_t ssi_uid;     /* Real UID of sender */
               int32_t  ssi_fd;      /* File descriptor (SIGIO) */
               uint32_t ssi_tid;     /* Kernel timer ID (POSIX timers)
               uint32_t ssi_band;    /* Band event (SIGIO) */
               uint32_t ssi_overrun; /* POSIX timer overrun count */
               uint32_t ssi_trapno;  /* Trap number that caused signal */
               int32_t  ssi_status;  /* Exit status or signal (SIGCHLD) */
               int32_t  ssi_int;     /* Integer sent by sigqueue(3) */
               uint64_t ssi_ptr;     /* Pointer sent by sigqueue(3) */
               uint64_t ssi_utime;   /* User CPU time consumed (SIGCHLD) */
               uint64_t ssi_stime;   /* System CPU time consumed (SIGCHLD) */
               uint64_t ssi_addr;    /* Address that generated signal
                                        (for hardware-generated signals) */
               uint8_t  pad[X];      /* Pad size to 128 bytes (allow for
                                         additional fields in the future) */
           };

       Each of the fields in this structure is analogous to the similarly
       named field in the siginfo_t structure.  The siginfo_t structure is
       described in sigaction(2).  Not all fields in the returned
       signalfd_siginfo structure will be valid for a specific signal; the
       set of valid fields can be determined from the value returned in the
       ssi_code field.  This field is the analog of the siginfo_t si_code
       field; see sigaction(2) for details.

   fork(2) semantics
       After a fork(2), the child inherits a copy of the signalfd file
       descriptor.  A read(2) from the file descriptor in the child will
       return information about signals queued to the child.

   Semantics of file descriptor passing
       As with other file descriptors, signalfd file descriptors can be
       passed to another process via a UNIX domain socket (see unix(7)).  In
       the receiving process, a read(2) from the received file descriptor
       will return information about signals queued to that process.

   execve(2) semantics
       Just like any other file descriptor, a signalfd file descriptor
       remains open across an execve(2), unless it has been marked for
       close-on-exec (see fcntl(2)).  Any signals that were available for
       reading before the execve(2) remain available to the newly loaded
       program.  (This is analogous to traditional signal semantics, where a
       blocked signal that is pending remains pending across an execve(2).)

   Thread semantics
       The semantics of signalfd file descriptors in a multithreaded program
       mirror the standard semantics for signals.  In other words, when a
       thread reads from a signalfd file descriptor, it will read the
       signals that are directed to the thread itself and the signals that
       are directed to the process (i.e., the entire thread group).  (A
       thread will not be able to read signals that are directed to other
       threads in the process.)
http://man7.org/linux/man-pages/man2/eventfd.2.html
13
SYSTEM CALL:
eventfd(2) - Linux manual page
FUNCTIONALITY:

       eventfd - create a file descriptor for event notification
SYNOPSIS:

       #include <sys/eventfd.h>

       int eventfd(unsigned int initval, int flags);
DESCRIPTION

       eventfd() creates an "eventfd object" that can be used as an event
       wait/notify mechanism by user-space applications, and by the kernel
       to notify user-space applications of events.  The object contains an
       unsigned 64-bit integer (uint64_t) counter that is maintained by the
       kernel.  This counter is initialized with the value specified in the
       argument initval.

       The following values may be bitwise ORed in flags to change the
       behavior of eventfd():

       EFD_CLOEXEC (since Linux 2.6.27)
              Set the close-on-exec (FD_CLOEXEC) flag on the new file
              descriptor.  See the description of the O_CLOEXEC flag in
              open(2) for reasons why this may be useful.

       EFD_NONBLOCK (since Linux 2.6.27)
              Set the O_NONBLOCK file status flag on the new open file
              description.  Using this flag saves extra calls to fcntl(2) to
              achieve the same result.

       EFD_SEMAPHORE (since Linux 2.6.30)
              Provide semaphore-like semantics for reads from the new file
              descriptor.  See below.

       In Linux up to version 2.6.26, the flags argument is unused, and must
       be specified as zero.

       As its return value, eventfd() returns a new file descriptor that can
       be used to refer to the eventfd object.  The following operations can
       be performed on the file descriptor:

       read(2)
              Each successful read(2) returns an 8-byte integer.  A read(2)
              will fail with the error EINVAL if the size of the supplied
              buffer is less than 8 bytes.

              The value returned by read(2) is in host byte order—that is,
              the native byte order for integers on the host machine.

              The semantics of read(2) depend on whether the eventfd counter
              currently has a nonzero value and whether the EFD_SEMAPHORE
              flag was specified when creating the eventfd file descriptor:

              *  If EFD_SEMAPHORE was not specified and the eventfd counter
                 has a nonzero value, then a read(2) returns 8 bytes
                 containing that value, and the counter's value is reset to
                 zero.

              *  If EFD_SEMAPHORE was specified and the eventfd counter has
                 a nonzero value, then a read(2) returns 8 bytes containing
                 the value 1, and the counter's value is decremented by 1.

              *  If the eventfd counter is zero at the time of the call to
                 read(2), then the call either blocks until the counter
                 becomes nonzero (at which time, the read(2) proceeds as
                 described above) or fails with the error EAGAIN if the file
                 descriptor has been made nonblocking.

       write(2)
              A write(2) call adds the 8-byte integer value supplied in its
              buffer to the counter.  The maximum value that may be stored
              in the counter is the largest unsigned 64-bit value minus 1
              (i.e., 0xfffffffffffffffe).  If the addition would cause the
              counter's value to exceed the maximum, then the write(2)
              either blocks until a read(2) is performed on the file
              descriptor, or fails with the error EAGAIN if the file
              descriptor has been made nonblocking.

              A write(2) will fail with the error EINVAL if the size of the
              supplied buffer is less than 8 bytes, or if an attempt is made
              to write the value 0xffffffffffffffff.

       poll(2), select(2) (and similar)
              The returned file descriptor supports poll(2) (and analogously
              epoll(7)) and select(2), as follows:

              *  The file descriptor is readable (the select(2) readfds
                 argument; the poll(2) POLLIN flag) if the counter has a
                 value greater than 0.

              *  The file descriptor is writable (the select(2) writefds
                 argument; the poll(2) POLLOUT flag) if it is possible to
                 write a value of at least "1" without blocking.

              *  If an overflow of the counter value was detected, then
                 select(2) indicates the file descriptor as being both
                 readable and writable, and poll(2) returns a POLLERR event.
                 As noted above, write(2) can never overflow the counter.
                 However an overflow can occur if 2^64 eventfd "signal
                 posts" were performed by the KAIO subsystem (theoretically
                 possible, but practically unlikely).  If an overflow has
                 occurred, then read(2) will return that maximum uint64_t
                 value (i.e., 0xffffffffffffffff).

              The eventfd file descriptor also supports the other file-
              descriptor multiplexing APIs: pselect(2) and ppoll(2).

       close(2)
              When the file descriptor is no longer required it should be
              closed.  When all file descriptors associated with the same
              eventfd object have been closed, the resources for object are
              freed by the kernel.

       A copy of the file descriptor created by eventfd() is inherited by
       the child produced by fork(2).  The duplicate file descriptor is
       associated with the same eventfd object.  File descriptors created by
       eventfd() are preserved across execve(2), unless the close-on-exec
       flag has been set.
http://man7.org/linux/man-pages/man2/eventfd2.2.html
13
SYSTEM CALL:
eventfd(2) - Linux manual page
FUNCTIONALITY:

       eventfd - create a file descriptor for event notification
SYNOPSIS:

       #include <sys/eventfd.h>

       int eventfd(unsigned int initval, int flags);
DESCRIPTION

       eventfd() creates an "eventfd object" that can be used as an event
       wait/notify mechanism by user-space applications, and by the kernel
       to notify user-space applications of events.  The object contains an
       unsigned 64-bit integer (uint64_t) counter that is maintained by the
       kernel.  This counter is initialized with the value specified in the
       argument initval.

       The following values may be bitwise ORed in flags to change the
       behavior of eventfd():

       EFD_CLOEXEC (since Linux 2.6.27)
              Set the close-on-exec (FD_CLOEXEC) flag on the new file
              descriptor.  See the description of the O_CLOEXEC flag in
              open(2) for reasons why this may be useful.

       EFD_NONBLOCK (since Linux 2.6.27)
              Set the O_NONBLOCK file status flag on the new open file
              description.  Using this flag saves extra calls to fcntl(2) to
              achieve the same result.

       EFD_SEMAPHORE (since Linux 2.6.30)
              Provide semaphore-like semantics for reads from the new file
              descriptor.  See below.

       In Linux up to version 2.6.26, the flags argument is unused, and must
       be specified as zero.

       As its return value, eventfd() returns a new file descriptor that can
       be used to refer to the eventfd object.  The following operations can
       be performed on the file descriptor:

       read(2)
              Each successful read(2) returns an 8-byte integer.  A read(2)
              will fail with the error EINVAL if the size of the supplied
              buffer is less than 8 bytes.

              The value returned by read(2) is in host byte order—that is,
              the native byte order for integers on the host machine.

              The semantics of read(2) depend on whether the eventfd counter
              currently has a nonzero value and whether the EFD_SEMAPHORE
              flag was specified when creating the eventfd file descriptor:

              *  If EFD_SEMAPHORE was not specified and the eventfd counter
                 has a nonzero value, then a read(2) returns 8 bytes
                 containing that value, and the counter's value is reset to
                 zero.

              *  If EFD_SEMAPHORE was specified and the eventfd counter has
                 a nonzero value, then a read(2) returns 8 bytes containing
                 the value 1, and the counter's value is decremented by 1.

              *  If the eventfd counter is zero at the time of the call to
                 read(2), then the call either blocks until the counter
                 becomes nonzero (at which time, the read(2) proceeds as
                 described above) or fails with the error EAGAIN if the file
                 descriptor has been made nonblocking.

       write(2)
              A write(2) call adds the 8-byte integer value supplied in its
              buffer to the counter.  The maximum value that may be stored
              in the counter is the largest unsigned 64-bit value minus 1
              (i.e., 0xfffffffffffffffe).  If the addition would cause the
              counter's value to exceed the maximum, then the write(2)
              either blocks until a read(2) is performed on the file
              descriptor, or fails with the error EAGAIN if the file
              descriptor has been made nonblocking.

              A write(2) will fail with the error EINVAL if the size of the
              supplied buffer is less than 8 bytes, or if an attempt is made
              to write the value 0xffffffffffffffff.

       poll(2), select(2) (and similar)
              The returned file descriptor supports poll(2) (and analogously
              epoll(7)) and select(2), as follows:

              *  The file descriptor is readable (the select(2) readfds
                 argument; the poll(2) POLLIN flag) if the counter has a
                 value greater than 0.

              *  The file descriptor is writable (the select(2) writefds
                 argument; the poll(2) POLLOUT flag) if it is possible to
                 write a value of at least "1" without blocking.

              *  If an overflow of the counter value was detected, then
                 select(2) indicates the file descriptor as being both
                 readable and writable, and poll(2) returns a POLLERR event.
                 As noted above, write(2) can never overflow the counter.
                 However an overflow can occur if 2^64 eventfd "signal
                 posts" were performed by the KAIO subsystem (theoretically
                 possible, but practically unlikely).  If an overflow has
                 occurred, then read(2) will return that maximum uint64_t
                 value (i.e., 0xffffffffffffffff).

              The eventfd file descriptor also supports the other file-
              descriptor multiplexing APIs: pselect(2) and ppoll(2).

       close(2)
              When the file descriptor is no longer required it should be
              closed.  When all file descriptors associated with the same
              eventfd object have been closed, the resources for object are
              freed by the kernel.

       A copy of the file descriptor created by eventfd() is inherited by
       the child produced by fork(2).  The duplicate file descriptor is
       associated with the same eventfd object.  File descriptors created by
       eventfd() are preserved across execve(2), unless the close-on-exec
       flag has been set.
http://man7.org/linux/man-pages/man2/restart_syscall.2.html
11
SYSTEM CALL:
restart_syscall(2) - Linux manual page
FUNCTIONALITY:

       restart_syscall  - restart a system call after interruption by a stop
       signal
SYNOPSIS:

       int restart_syscall(void);

       Note: There is no glibc wrapper for this system call; see NOTES.
DESCRIPTION

       The restart_syscall() system call is used to restart certain system
       calls after a process that was stopped by a signal (e.g., SIGSTOP or
       SIGTSTP) is later resumed after receiving a SIGCONT signal.  This
       system call is designed only for internal use by the kernel.

       restart_syscall() is used for restarting only those system calls
       that, when restarted, should adjust their time-related parameters—
       namely poll(2) (since Linux 2.6.24), nanosleep(2) (since Linux 2.6),
       clock_nanosleep(2) (since Linux 2.6), and futex(2), when employed
       with the FUTEX_WAIT (since Linux 2.6.22) and FUTEX_WAIT_BITSET (since
       Linux 2.6.31) operations.  restart_syscall() restarts the interrupted
       system call with a time argument that is suitably adjusted to account
       for the time that has already elapsed (including the time where the
       process was stopped by a signal).  Without the restart_syscall()
       mechanism, restarting these system calls would not correctly deduct
       the already elapsed time when the process continued execution.
Linux Inter Process Communication (IPC) system calls
http://man7.org/linux/man-pages/man2/pipe.2.html
11
SYSTEM CALL:
pipe(2) - Linux manual page
FUNCTIONALITY:

       pipe, pipe2 - create pipe
SYNOPSIS:

       #include <unistd.h>

       int pipe(int pipefd[2]);

       #define _GNU_SOURCE             /* See feature_test_macros(7) */
       #include <fcntl.h>              /* Obtain O_* constant definitions */
       #include <unistd.h>

       int pipe2(int pipefd[2], int flags);
DESCRIPTION

       pipe() creates a pipe, a unidirectional data channel that can be used
       for interprocess communication.  The array pipefd is used to return
       two file descriptors referring to the ends of the pipe.  pipefd[0]
       refers to the read end of the pipe.  pipefd[1] refers to the write
       end of the pipe.  Data written to the write end of the pipe is
       buffered by the kernel until it is read from the read end of the
       pipe.  For further details, see pipe(7).

       If flags is 0, then pipe2() is the same as pipe().  The following
       values can be bitwise ORed in flags to obtain different behavior:

       O_CLOEXEC
              Set the close-on-exec (FD_CLOEXEC) flag on the two new file
              descriptors.  See the description of the same flag in open(2)
              for reasons why this may be useful.

       O_DIRECT (since Linux 3.4)
              Create a pipe that performs I/O in "packet" mode.  Each
              write(2) to the pipe is dealt with as a separate packet, and
              read(2)s from the pipe will read one packet at a time.  Note
              the following points:

              *  Writes of greater than PIPE_BUF bytes (see pipe(7)) will be
                 split into multiple packets.  The constant PIPE_BUF is
                 defined in <limits.h>.

              *  If a read(2) specifies a buffer size that is smaller than
                 the next packet, then the requested number of bytes are
                 read, and the excess bytes in the packet are discarded.
                 Specifying a buffer size of PIPE_BUF will be sufficient to
                 read the largest possible packets (see the previous point).

              *  Zero-length packets are not supported.  (A read(2) that
                 specifies a buffer size of zero is a no-op, and returns 0.)

              Older kernels that do not support this flag will indicate this
              via an EINVAL error.

       O_NONBLOCK
              Set the O_NONBLOCK file status flag on the two new open file
              descriptions.  Using this flag saves extra calls to fcntl(2)
              to achieve the same result.
http://man7.org/linux/man-pages/man2/pipe2.2.html
11
SYSTEM CALL:
pipe(2) - Linux manual page
FUNCTIONALITY:

       pipe, pipe2 - create pipe
SYNOPSIS:

       #include <unistd.h>

       int pipe(int pipefd[2]);

       #define _GNU_SOURCE             /* See feature_test_macros(7) */
       #include <fcntl.h>              /* Obtain O_* constant definitions */
       #include <unistd.h>

       int pipe2(int pipefd[2], int flags);
DESCRIPTION

       pipe() creates a pipe, a unidirectional data channel that can be used
       for interprocess communication.  The array pipefd is used to return
       two file descriptors referring to the ends of the pipe.  pipefd[0]
       refers to the read end of the pipe.  pipefd[1] refers to the write
       end of the pipe.  Data written to the write end of the pipe is
       buffered by the kernel until it is read from the read end of the
       pipe.  For further details, see pipe(7).

       If flags is 0, then pipe2() is the same as pipe().  The following
       values can be bitwise ORed in flags to obtain different behavior:

       O_CLOEXEC
              Set the close-on-exec (FD_CLOEXEC) flag on the two new file
              descriptors.  See the description of the same flag in open(2)
              for reasons why this may be useful.

       O_DIRECT (since Linux 3.4)
              Create a pipe that performs I/O in "packet" mode.  Each
              write(2) to the pipe is dealt with as a separate packet, and
              read(2)s from the pipe will read one packet at a time.  Note
              the following points:

              *  Writes of greater than PIPE_BUF bytes (see pipe(7)) will be
                 split into multiple packets.  The constant PIPE_BUF is
                 defined in <limits.h>.

              *  If a read(2) specifies a buffer size that is smaller than
                 the next packet, then the requested number of bytes are
                 read, and the excess bytes in the packet are discarded.
                 Specifying a buffer size of PIPE_BUF will be sufficient to
                 read the largest possible packets (see the previous point).

              *  Zero-length packets are not supported.  (A read(2) that
                 specifies a buffer size of zero is a no-op, and returns 0.)

              Older kernels that do not support this flag will indicate this
              via an EINVAL error.

       O_NONBLOCK
              Set the O_NONBLOCK file status flag on the two new open file
              descriptions.  Using this flag saves extra calls to fcntl(2)
              to achieve the same result.
http://man7.org/linux/man-pages/man2/tee.2.html
12
SYSTEM CALL:
tee(2) - Linux manual page
FUNCTIONALITY:

       tee - duplicating pipe content
SYNOPSIS:

       #define _GNU_SOURCE         /* See feature_test_macros(7) */
       #include <fcntl.h>

       ssize_t tee(int fd_in, int fd_out, size_t len, unsigned int flags);
DESCRIPTION

       tee() duplicates up to len bytes of data from the pipe referred to by
       the file descriptor fd_in to the pipe referred to by the file
       descriptor fd_out.  It does not consume the data that is duplicated
       from fd_in; therefore, that data can be copied by a subsequent
       splice(2).

       flags is a bit mask that is composed by ORing together zero or more
       of the following values:

       SPLICE_F_MOVE      Currently has no effect for tee(); see splice(2).

       SPLICE_F_NONBLOCK  Do not block on I/O; see splice(2) for further
                          details.

       SPLICE_F_MORE      Currently has no effect for tee(), but may be
                          implemented in the future; see splice(2).

       SPLICE_F_GIFT      Unused for tee(); see vmsplice(2).
http://man7.org/linux/man-pages/man2/splice.2.html
12
SYSTEM CALL:
splice(2) - Linux manual page
FUNCTIONALITY:

       splice - splice data to/from a pipe
SYNOPSIS:

       #define _GNU_SOURCE         /* See feature_test_macros(7) */
       #include <fcntl.h>

       ssize_t splice(int fd_in, loff_t *off_in, int fd_out,
                      loff_t *off_out, size_t len, unsigned int flags);
DESCRIPTION

       splice() moves data between two file descriptors without copying
       between kernel address space and user address space.  It transfers up
       to len bytes of data from the file descriptor fd_in to the file
       descriptor fd_out, where one of the file descriptors must refer to a
       pipe.

       The following semantics apply for fd_in and off_in:

       *  If fd_in refers to a pipe, then off_in must be NULL.

       *  If fd_in does not refer to a pipe and off_in is NULL, then bytes
          are read from fd_in starting from the file offset, and the file
          offset is adjusted appropriately.

       *  If fd_in does not refer to a pipe and off_in is not NULL, then
          off_in must point to a buffer which specifies the starting offset
          from which bytes will be read from fd_in; in this case, the file
          offset of fd_in is not changed.

       Analogous statements apply for fd_out and off_out.

       The flags argument is a bit mask that is composed by ORing together
       zero or more of the following values:

       SPLICE_F_MOVE      Attempt to move pages instead of copying.  This is
                          only a hint to the kernel: pages may still be
                          copied if the kernel cannot move the pages from
                          the pipe, or if the pipe buffers don't refer to
                          full pages.  The initial implementation of this
                          flag was buggy: therefore starting in Linux 2.6.21
                          it is a no-op (but is still permitted in a
                          splice() call); in the future, a correct
                          implementation may be restored.

       SPLICE_F_NONBLOCK  Do not block on I/O.  This makes the splice pipe
                          operations nonblocking, but splice() may
                          nevertheless block because the file descriptors
                          that are spliced to/from may block (unless they
                          have the O_NONBLOCK flag set).

       SPLICE_F_MORE      More data will be coming in a subsequent splice.
                          This is a helpful hint when the fd_out refers to a
                          socket (see also the description of MSG_MORE in
                          send(2), and the description of TCP_CORK in
                          tcp(7)).

       SPLICE_F_GIFT      Unused for splice(); see vmsplice(2).
http://man7.org/linux/man-pages/man2/vmsplice.2.html
11
SYSTEM CALL:
vmsplice(2) - Linux manual page
FUNCTIONALITY:

       vmsplice - splice user pages into a pipe
SYNOPSIS:

       #define _GNU_SOURCE         /* See feature_test_macros(7) */
       #include <fcntl.h>
       #include <sys/uio.h>

       ssize_t vmsplice(int fd, const struct iovec *iov,
                        unsigned long nr_segs, unsigned int flags);
DESCRIPTION

       The vmsplice() system call maps nr_segs ranges of user memory
       described by iov into a pipe.  The file descriptor fd must refer to a
       pipe.

       The pointer iov points to an array of iovec structures as defined in
       <sys/uio.h>:

           struct iovec {
               void  *iov_base;        /* Starting address */
               size_t iov_len;         /* Number of bytes */
           };

       The flags argument is a bit mask that is composed by ORing together
       zero or more of the following values:

       SPLICE_F_MOVE      Unused for vmsplice(); see splice(2).

       SPLICE_F_NONBLOCK  Do not block on I/O; see splice(2) for further
                          details.

       SPLICE_F_MORE      Currently has no effect for vmsplice(), but may be
                          implemented in the future; see splice(2).

       SPLICE_F_GIFT      The user pages are a gift to the kernel.  The
                          application may not modify this memory ever,
                          otherwise the page cache and on-disk data may
                          differ.  Gifting pages to the kernel means that a
                          subsequent splice(2) SPLICE_F_MOVE can
                          successfully move the pages; if this flag is not
                          specified, then a subsequent splice(2)
                          SPLICE_F_MOVE must copy the pages.  Data must also
                          be properly page aligned, both in memory and
                          length.
http://man7.org/linux/man-pages/man2/shmget.2.html
11
SYSTEM CALL:
shmget(2) - Linux manual page
FUNCTIONALITY:

       shmget - allocates a System V shared memory segment
SYNOPSIS:

       #include <sys/ipc.h>
       #include <sys/shm.h>

       int shmget(key_t key, size_t size, int shmflg);
DESCRIPTION

       shmget() returns the identifier of the System V shared memory segment
       associated with the value of the argument key.  A new shared memory
       segment, with size equal to the value of size rounded up to a
       multiple of PAGE_SIZE, is created if key has the value IPC_PRIVATE or
       key isn't IPC_PRIVATE, no shared memory segment corresponding to key
       exists, and IPC_CREAT is specified in shmflg.

       If shmflg specifies both IPC_CREAT and IPC_EXCL and a shared memory
       segment already exists for key, then shmget() fails with errno set to
       EEXIST.  (This is analogous to the effect of the combination O_CREAT
       | O_EXCL for open(2).)

       The value shmflg is composed of:

       IPC_CREAT   Create a new segment.  If this flag is not used, then
                   shmget() will find the segment associated with key and
                   check to see if the user has permission to access the
                   segment.

       IPC_EXCL    This flag is used with IPC_CREAT to ensure that this call
                   creates the segment.  If the segment already exists, the
                   call fails.

       SHM_HUGETLB (since Linux 2.6)
                   Allocate the segment using "huge pages."  See the Linux
                   kernel source file Documentation/vm/hugetlbpage.txt for
                   further information.

       SHM_HUGE_2MB, SHM_HUGE_1GB (since Linux 3.8)
                   Used in conjunction with SHM_HUGETLB to select
                   alternative hugetlb page sizes (respectively, 2 MB and 1
                   GB) on systems that support multiple hugetlb page sizes.

                   More generally, the desired huge page size can be
                   configured by encoding the base-2 logarithm of the
                   desired page size in the six bits at the offset
                   SHM_HUGE_SHIFT.  Thus, the above two constants are
                   defined as:

                       #define SHM_HUGE_2MB    (21 << SHM_HUGE_SHIFT)
                       #define SHM_HUGE_1GB    (30 << SHM_HUGE_SHIFT)

                   For some additional details, see the discussion of the
                   similarly named constants in mmap(2).

       SHM_NORESERVE (since Linux 2.6.15)
                   This flag serves the same purpose as the mmap(2)
                   MAP_NORESERVE flag.  Do not reserve swap space for this
                   segment.  When swap space is reserved, one has the
                   guarantee that it is possible to modify the segment.
                   When swap space is not reserved one might get SIGSEGV
                   upon a write if no physical memory is available.  See
                   also the discussion of the file
                   /proc/sys/vm/overcommit_memory in proc(5).

       In addition to the above flags, the least significant 9 bits of
       shmflg specify the permissions granted to the owner, group, and
       others.  These bits have the same format, and the same meaning, as
       the mode argument of open(2).  Presently, execute permissions are not
       used by the system.

       When a new shared memory segment is created, its contents are
       initialized to zero values, and its associated data structure,
       shmid_ds (see shmctl(2)), is initialized as follows:

              shm_perm.cuid and shm_perm.uid are set to the effective user
              ID of the calling process.

              shm_perm.cgid and shm_perm.gid are set to the effective group
              ID of the calling process.

              The least significant 9 bits of shm_perm.mode are set to the
              least significant 9 bit of shmflg.

              shm_segsz is set to the value of size.

              shm_lpid, shm_nattch, shm_atime, and shm_dtime are set to 0.

              shm_ctime is set to the current time.

       If the shared memory segment already exists, the permissions are
       verified, and a check is made to see if it is marked for destruction.
http://man7.org/linux/man-pages/man2/shmctl.2.html
10
SYSTEM CALL:
shmctl(2) - Linux manual page
FUNCTIONALITY:

       shmctl - System V shared memory control
SYNOPSIS:

       #include <sys/ipc.h>
       #include <sys/shm.h>

       int shmctl(int shmid, int cmd, struct shmid_ds *buf);
DESCRIPTION

       shmctl() performs the control operation specified by cmd on the
       System V shared memory segment whose identifier is given in shmid.

       The buf argument is a pointer to a shmid_ds structure, defined in
       <sys/shm.h> as follows:

           struct shmid_ds {
               struct ipc_perm shm_perm;    /* Ownership and permissions */
               size_t          shm_segsz;   /* Size of segment (bytes) */
               time_t          shm_atime;   /* Last attach time */
               time_t          shm_dtime;   /* Last detach time */
               time_t          shm_ctime;   /* Last change time */
               pid_t           shm_cpid;    /* PID of creator */
               pid_t           shm_lpid;    /* PID of last shmat(2)/shmdt(2) */
               shmatt_t        shm_nattch;  /* No. of current attaches */
               ...
           };

       The ipc_perm structure is defined as follows (the highlighted fields
       are settable using IPC_SET):

           struct ipc_perm {
               key_t          __key;    /* Key supplied to shmget(2) */
               uid_t          uid;      /* Effective UID of owner */
               gid_t          gid;      /* Effective GID of owner */
               uid_t          cuid;     /* Effective UID of creator */
               gid_t          cgid;     /* Effective GID of creator */
               unsigned short mode;     /* Permissions + SHM_DEST and
                                           SHM_LOCKED flags */
               unsigned short __seq;    /* Sequence number */
           };

       Valid values for cmd are:

       IPC_STAT  Copy information from the kernel data structure associated
                 with shmid into the shmid_ds structure pointed to by buf.
                 The caller must have read permission on the shared memory
                 segment.

       IPC_SET   Write the values of some members of the shmid_ds structure
                 pointed to by buf to the kernel data structure associated
                 with this shared memory segment, updating also its
                 shm_ctime member.  The following fields can be changed:
                 shm_perm.uid, shm_perm.gid, and (the least significant 9
                 bits of) shm_perm.mode.  The effective UID of the calling
                 process must match the owner (shm_perm.uid) or creator
                 (shm_perm.cuid) of the shared memory segment, or the caller
                 must be privileged.

       IPC_RMID  Mark the segment to be destroyed.  The segment will
                 actually be destroyed only after the last process detaches
                 it (i.e., when the shm_nattch member of the associated
                 structure shmid_ds is zero).  The caller must be the owner
                 or creator of the segment, or be privileged.  The buf
                 argument is ignored.

                 If a segment has been marked for destruction, then the
                 (nonstandard) SHM_DEST flag of the shm_perm.mode field in
                 the associated data structure retrieved by IPC_STAT will be
                 set.

                 The caller must ensure that a segment is eventually
                 destroyed; otherwise its pages that were faulted in will
                 remain in memory or swap.

                 See also the description of
                 /proc/sys/kernel/shm_rmid_forced in proc(5).

       IPC_INFO (Linux-specific)
                 Return information about system-wide shared memory limits
                 and parameters in the structure pointed to by buf.  This
                 structure is of type shminfo (thus, a cast is required),
                 defined in <sys/shm.h> if the _GNU_SOURCE feature test
                 macro is defined:

                     struct shminfo {
                         unsigned long shmmax; /* Maximum segment size */
                         unsigned long shmmin; /* Minimum segment size;
                                                  always 1 */
                         unsigned long shmmni; /* Maximum number of segments */
                         unsigned long shmseg; /* Maximum number of segments
                                                  that a process can attach;
                                                  unused within kernel */
                         unsigned long shmall; /* Maximum number of pages of
                                                  shared memory, system-wide */
                     };

                 The shmmni, shmmax, and shmall settings can be changed via
                 /proc files of the same name; see proc(5) for details.

       SHM_INFO (Linux-specific)
                 Return a shm_info structure whose fields contain
                 information about system resources consumed by shared
                 memory.  This structure is defined in <sys/shm.h> if the
                 _GNU_SOURCE feature test macro is defined:

                     struct shm_info {
                         int           used_ids; /* # of currently existing
                                                    segments */
                         unsigned long shm_tot;  /* Total number of shared
                                                    memory pages */
                         unsigned long shm_rss;  /* # of resident shared
                                                    memory pages */
                         unsigned long shm_swp;  /* # of swapped shared
                                                    memory pages */
                         unsigned long swap_attempts;
                                                 /* Unused since Linux 2.4 */
                         unsigned long swap_successes;
                                                 /* Unused since Linux 2.4 */
                     };

       SHM_STAT (Linux-specific)
                 Return a shmid_ds structure as for IPC_STAT.  However, the
                 shmid argument is not a segment identifier, but instead an
                 index into the kernel's internal array that maintains
                 information about all shared memory segments on the system.

       The caller can prevent or allow swapping of a shared memory segment
       with the following cmd values:

       SHM_LOCK (Linux-specific)
                 Prevent swapping of the shared memory segment.  The caller
                 must fault in any pages that are required to be present
                 after locking is enabled.  If a segment has been locked,
                 then the (nonstandard) SHM_LOCKED flag of the shm_perm.mode
                 field in the associated data structure retrieved by
                 IPC_STAT will be set.

       SHM_UNLOCK (Linux-specific)
                 Unlock the segment, allowing it to be swapped out.

       In kernels before 2.6.10, only a privileged process could employ
       SHM_LOCK and SHM_UNLOCK.  Since kernel 2.6.10, an unprivileged
       process can employ these operations if its effective UID matches the
       owner or creator UID of the segment, and (for SHM_LOCK) the amount of
       memory to be locked falls within the RLIMIT_MEMLOCK resource limit
       (see setrlimit(2)).
http://man7.org/linux/man-pages/man2/shmat.2.html
10
SYSTEM CALL:
shmop(2) - Linux manual page
FUNCTIONALITY:

       shmat, shmdt - System V shared memory operations
SYNOPSIS:

       #include <sys/types.h>
       #include <sys/shm.h>

       void *shmat(int shmid, const void *shmaddr, int shmflg);

       int shmdt(const void *shmaddr);
DESCRIPTION

   shmat()
       shmat() attaches the System V shared memory segment identified by
       shmid to the address space of the calling process.  The attaching
       address is specified by shmaddr with one of the following criteria:

       *  If shmaddr is NULL, the system chooses a suitable (unused) address
          at which to attach the segment.

       *  If shmaddr isn't NULL and SHM_RND is specified in shmflg, the
          attach occurs at the address equal to shmaddr rounded down to the
          nearest multiple of SHMLBA.

       *  Otherwise, shmaddr must be a page-aligned address at which the
          attach occurs.

       In addition to SHM_RND, the following flags may be specified in the
       shmflg bit-mask argument:

       SHM_EXEC (Linux-specific; since Linux 2.6.9)
              Allow the contents of the segment to be executed.  The caller
              must have execute permission on the segment.

       SHM_RDONLY
              Attach the segment for read-only access.  The process must
              have read permission for the segment.  If this flag is not
              specified, the segment is attached for read and write access,
              and the process must have read and write permission for the
              segment.  There is no notion of a write-only shared memory
              segment.

       SHM_REMAP (Linux-specific)
              This flag specifies that the mapping of the segment should
              replace any existing mapping in the range starting at shmaddr
              and continuing for the size of the segment.  (Normally, an
              EINVAL error would result if a mapping already exists in this
              address range.)  In this case, shmaddr must not be NULL.

       The brk(2) value of the calling process is not altered by the attach.
       The segment will automatically be detached at process exit.  The same
       segment may be attached as a read and as a read-write one, and more
       than once, in the process's address space.

       A successful shmat() call updates the members of the shmid_ds
       structure (see shmctl(2)) associated with the shared memory segment
       as follows:

              shm_atime is set to the current time.

              shm_lpid is set to the process-ID of the calling process.

              shm_nattch is incremented by one.

   shmdt()
       shmdt() detaches the shared memory segment located at the address
       specified by shmaddr from the address space of the calling process.
       The to-be-detached segment must be currently attached with shmaddr
       equal to the value returned by the attaching shmat() call.

       On a successful shmdt() call, the system updates the members of the
       shmid_ds structure associated with the shared memory segment as
       follows:

              shm_dtime is set to the current time.

              shm_lpid is set to the process-ID of the calling process.

              shm_nattch is decremented by one.  If it becomes 0 and the
              segment is marked for deletion, the segment is deleted.
http://man7.org/linux/man-pages/man2/shmdt.2.html
10
SYSTEM CALL:
shmop(2) - Linux manual page
FUNCTIONALITY:

       shmat, shmdt - System V shared memory operations
SYNOPSIS:

       #include <sys/types.h>
       #include <sys/shm.h>

       void *shmat(int shmid, const void *shmaddr, int shmflg);

       int shmdt(const void *shmaddr);
DESCRIPTION

   shmat()
       shmat() attaches the System V shared memory segment identified by
       shmid to the address space of the calling process.  The attaching
       address is specified by shmaddr with one of the following criteria:

       *  If shmaddr is NULL, the system chooses a suitable (unused) address
          at which to attach the segment.

       *  If shmaddr isn't NULL and SHM_RND is specified in shmflg, the
          attach occurs at the address equal to shmaddr rounded down to the
          nearest multiple of SHMLBA.

       *  Otherwise, shmaddr must be a page-aligned address at which the
          attach occurs.

       In addition to SHM_RND, the following flags may be specified in the
       shmflg bit-mask argument:

       SHM_EXEC (Linux-specific; since Linux 2.6.9)
              Allow the contents of the segment to be executed.  The caller
              must have execute permission on the segment.

       SHM_RDONLY
              Attach the segment for read-only access.  The process must
              have read permission for the segment.  If this flag is not
              specified, the segment is attached for read and write access,
              and the process must have read and write permission for the
              segment.  There is no notion of a write-only shared memory
              segment.

       SHM_REMAP (Linux-specific)
              This flag specifies that the mapping of the segment should
              replace any existing mapping in the range starting at shmaddr
              and continuing for the size of the segment.  (Normally, an
              EINVAL error would result if a mapping already exists in this
              address range.)  In this case, shmaddr must not be NULL.

       The brk(2) value of the calling process is not altered by the attach.
       The segment will automatically be detached at process exit.  The same
       segment may be attached as a read and as a read-write one, and more
       than once, in the process's address space.

       A successful shmat() call updates the members of the shmid_ds
       structure (see shmctl(2)) associated with the shared memory segment
       as follows:

              shm_atime is set to the current time.

              shm_lpid is set to the process-ID of the calling process.

              shm_nattch is incremented by one.

   shmdt()
       shmdt() detaches the shared memory segment located at the address
       specified by shmaddr from the address space of the calling process.
       The to-be-detached segment must be currently attached with shmaddr
       equal to the value returned by the attaching shmat() call.

       On a successful shmdt() call, the system updates the members of the
       shmid_ds structure associated with the shared memory segment as
       follows:

              shm_dtime is set to the current time.

              shm_lpid is set to the process-ID of the calling process.

              shm_nattch is decremented by one.  If it becomes 0 and the
              segment is marked for deletion, the segment is deleted.
http://man7.org/linux/man-pages/man2/semget.2.html
11
SYSTEM CALL:
semget(2) - Linux manual page
FUNCTIONALITY:

       semget - get a System V semaphore set identifier
SYNOPSIS:

       #include <sys/types.h>
       #include <sys/ipc.h>
       #include <sys/sem.h>

       int semget(key_t key, int nsems, int semflg);
DESCRIPTION

       The semget() system call returns the System V semaphore set
       identifier associated with the argument key.  A new set of nsems
       semaphores is created if key has the value IPC_PRIVATE or if no
       existing semaphore set is associated with key and IPC_CREAT is
       specified in semflg.

       If semflg specifies both IPC_CREAT and IPC_EXCL and a semaphore set
       already exists for key, then semget() fails with errno set to EEXIST.
       (This is analogous to the effect of the combination O_CREAT | O_EXCL
       for open(2).)

       Upon creation, the least significant 9 bits of the argument semflg
       define the permissions (for owner, group and others) for the
       semaphore set.  These bits have the same format, and the same
       meaning, as the mode argument of open(2) (though the execute
       permissions are not meaningful for semaphores, and write permissions
       mean permission to alter semaphore values).

       When creating a new semaphore set, semget() initializes the set's
       associated data structure, semid_ds (see semctl(2)), as follows:

              sem_perm.cuid and sem_perm.uid are set to the effective user
              ID of the calling process.

              sem_perm.cgid and sem_perm.gid are set to the effective group
              ID of the calling process.

              The least significant 9 bits of sem_perm.mode are set to the
              least significant 9 bits of semflg.

              sem_nsems is set to the value of nsems.

              sem_otime is set to 0.

              sem_ctime is set to the current time.

       The argument nsems can be 0 (a don't care) when a semaphore set is
       not being created.  Otherwise, nsems must be greater than 0 and less
       than or equal to the maximum number of semaphores per semaphore set
       (SEMMSL).

       If the semaphore set already exists, the permissions are verified.
http://man7.org/linux/man-pages/man2/semctl.2.html
10
SYSTEM CALL:
semctl(2) - Linux manual page
FUNCTIONALITY:

       semctl - System V semaphore control operations
SYNOPSIS:

       #include <sys/types.h>
       #include <sys/ipc.h>
       #include <sys/sem.h>

       int semctl(int semid, int semnum, int cmd, ...);
DESCRIPTION

       semctl() performs the control operation specified by cmd on the
       System V semaphore set identified by semid, or on the semnum-th
       semaphore of that set.  (The semaphores in a set are numbered
       starting at 0.)

       This function has three or four arguments, depending on cmd.  When
       there are four, the fourth has the type union semun.  The calling
       program must define this union as follows:

           union semun {
               int              val;    /* Value for SETVAL */
               struct semid_ds *buf;    /* Buffer for IPC_STAT, IPC_SET */
               unsigned short  *array;  /* Array for GETALL, SETALL */
               struct seminfo  *__buf;  /* Buffer for IPC_INFO
                                           (Linux-specific) */
           };

       The semid_ds data structure is defined in <sys/sem.h> as follows:

           struct semid_ds {
               struct ipc_perm sem_perm;  /* Ownership and permissions */
               time_t          sem_otime; /* Last semop time */
               time_t          sem_ctime; /* Last change time */
               unsigned long   sem_nsems; /* No. of semaphores in set */
           };

       The ipc_perm structure is defined as follows (the highlighted fields
       are settable using IPC_SET):

           struct ipc_perm {
               key_t          __key; /* Key supplied to semget(2) */
               uid_t          uid;   /* Effective UID of owner */
               gid_t          gid;   /* Effective GID of owner */
               uid_t          cuid;  /* Effective UID of creator */
               gid_t          cgid;  /* Effective GID of creator */
               unsigned short mode;  /* Permissions */
               unsigned short __seq; /* Sequence number */
           };

       Valid values for cmd are:

       IPC_STAT  Copy information from the kernel data structure associated
                 with semid into the semid_ds structure pointed to by
                 arg.buf.  The argument semnum is ignored.  The calling
                 process must have read permission on the semaphore set.

       IPC_SET   Write the values of some members of the semid_ds structure
                 pointed to by arg.buf to the kernel data structure
                 associated with this semaphore set, updating also its
                 sem_ctime member.  The following members of the structure
                 are updated: sem_perm.uid, sem_perm.gid, and (the least
                 significant 9 bits of) sem_perm.mode.  The effective UID of
                 the calling process must match the owner (sem_perm.uid) or
                 creator (sem_perm.cuid) of the semaphore set, or the caller
                 must be privileged.  The argument semnum is ignored.

       IPC_RMID  Immediately remove the semaphore set, awakening all
                 processes blocked in semop(2) calls on the set (with an
                 error return and errno set to EIDRM).  The effective user
                 ID of the calling process must match the creator or owner
                 of the semaphore set, or the caller must be privileged.
                 The argument semnum is ignored.

       IPC_INFO (Linux-specific)
                 Return information about system-wide semaphore limits and
                 parameters in the structure pointed to by arg.__buf.  This
                 structure is of type seminfo, defined in <sys/sem.h> if the
                 _GNU_SOURCE feature test macro is defined:

                     struct  seminfo {
                         int semmap;  /* Number of entries in semaphore
                                         map; unused within kernel */
                         int semmni;  /* Maximum number of semaphore sets */
                         int semmns;  /* Maximum number of semaphores in all
                                         semaphore sets */
                         int semmnu;  /* System-wide maximum number of undo
                                         structures; unused within kernel */
                         int semmsl;  /* Maximum number of semaphores in a
                                         set */
                         int semopm;  /* Maximum number of operations for
                                         semop(2) */
                         int semume;  /* Maximum number of undo entries per
                                         process; unused within kernel */
                         int semusz;  /* Size of struct sem_undo */
                         int semvmx;  /* Maximum semaphore value */
                         int semaem;  /* Max. value that can be recorded for
                                         semaphore adjustment (SEM_UNDO) */
                     };

                 The semmsl, semmns, semopm, and semmni settings can be
                 changed via /proc/sys/kernel/sem; see proc(5) for details.

       SEM_INFO (Linux-specific)
                 Return a seminfo structure containing the same information
                 as for IPC_INFO, except that the following fields are
                 returned with information about system resources consumed
                 by semaphores: the semusz field returns the number of
                 semaphore sets that currently exist on the system; and the
                 semaem field returns the total number of semaphores in all
                 semaphore sets on the system.

       SEM_STAT (Linux-specific)
                 Return a semid_ds structure as for IPC_STAT.  However, the
                 semid argument is not a semaphore identifier, but instead
                 an index into the kernel's internal array that maintains
                 information about all semaphore sets on the system.

       GETALL    Return semval (i.e., the current value) for all semaphores
                 of the set into arg.array.  The argument semnum is ignored.
                 The calling process must have read permission on the
                 semaphore set.

       GETNCNT   Return the value of semncnt for the semnum-th semaphore of
                 the set (i.e., the number of processes waiting for an
                 increase of semval for the semnum-th semaphore of the set).
                 The calling process must have read permission on the
                 semaphore set.

       GETPID    Return the value of sempid for the semnum-th semaphore of
                 the set.  This is the PID of the process that last
                 performed an operation on that semaphore (but see NOTES).
                 The calling process must have read permission on the
                 semaphore set.

       GETVAL    Return the value of semval for the semnum-th semaphore of
                 the set.  The calling process must have read permission on
                 the semaphore set.

       GETZCNT   Return the value of semzcnt for the semnum-th semaphore of
                 the set (i.e., the number of processes waiting for semval
                 of the semnum-th semaphore of the set to become 0).  The
                 calling process must have read permission on the semaphore
                 set.

       SETALL    Set semval for all semaphores of the set using arg.array,
                 updating also the sem_ctime member of the semid_ds
                 structure associated with the set.  Undo entries (see
                 semop(2)) are cleared for altered semaphores in all
                 processes.  If the changes to semaphore values would permit
                 blocked semop(2) calls in other processes to proceed, then
                 those processes are woken up.  The argument semnum is
                 ignored.  The calling process must have alter (write)
                 permission on the semaphore set.

       SETVAL    Set the value of semval to arg.val for the semnum-th
                 semaphore of the set, updating also the sem_ctime member of
                 the semid_ds structure associated with the set.  Undo
                 entries are cleared for altered semaphores in all
                 processes.  If the changes to semaphore values would permit
                 blocked semop(2) calls in other processes to proceed, then
                 those processes are woken up.  The calling process must
                 have alter permission on the semaphore set.
http://man7.org/linux/man-pages/man2/semop.2.html
13
SYSTEM CALL:
semop(2) - Linux manual page
FUNCTIONALITY:

       semop, semtimedop - System V semaphore operations
SYNOPSIS:

       #include <sys/types.h>
       #include <sys/ipc.h>
       #include <sys/sem.h>

       int semop(int semid, struct sembuf *sops, size_t nsops);

       int semtimedop(int semid, struct sembuf *sops, size_t nsops,
                      const struct timespec *timeout);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       semtimedop(): _GNU_SOURCE
DESCRIPTION

       Each semaphore in a System V semaphore set has the following
       associated values:

           unsigned short  semval;   /* semaphore value */
           unsigned short  semzcnt;  /* # waiting for zero */
           unsigned short  semncnt;  /* # waiting for increase */
           pid_t           sempid;   /* PID of process that last
                                        modified semaphore value */

       semop() performs operations on selected semaphores in the set
       indicated by semid.  Each of the nsops elements in the array pointed
       to by sops is a structure that specifies an operation to be performed
       on a single semaphore.  The elements of this structure are of type
       struct sembuf, containing the following members:

           unsigned short sem_num;  /* semaphore number */
           short          sem_op;   /* semaphore operation */
           short          sem_flg;  /* operation flags */

       Flags recognized in sem_flg are IPC_NOWAIT and SEM_UNDO.  If an
       operation specifies SEM_UNDO, it will be automatically undone when
       the process terminates.

       The set of operations contained in sops is performed in array order,
       and atomically, that is, the operations are performed either as a
       complete unit, or not at all.  The behavior of the system call if not
       all operations can be performed immediately depends on the presence
       of the IPC_NOWAIT flag in the individual sem_flg fields, as noted
       below.

       Each operation is performed on the sem_num-th semaphore of the
       semaphore set, where the first semaphore of the set is numbered 0.
       There are three types of operation, distinguished by the value of
       sem_op.

       If sem_op is a positive integer, the operation adds this value to the
       semaphore value (semval).  Furthermore, if SEM_UNDO is specified for
       this operation, the system subtracts the value sem_op from the
       semaphore adjustment (semadj) value for this semaphore.  This
       operation can always proceed—it never forces a thread to wait.  The
       calling process must have alter permission on the semaphore set.

       If sem_op is zero, the process must have read permission on the
       semaphore set.  This is a "wait-for-zero" operation: if semval is
       zero, the operation can immediately proceed.  Otherwise, if
       IPC_NOWAIT is specified in sem_flg, semop() fails with errno set to
       EAGAIN (and none of the operations in sops is performed).  Otherwise,
       semzcnt (the count of threads waiting until this semaphore's value
       becomes zero) is incremented by one and the thread sleeps until one
       of the following occurs:

       ·  semval becomes 0, at which time the value of semzcnt is
          decremented.

       ·  The semaphore set is removed: semop() fails, with errno set to
          EIDRM.

       ·  The calling thread catches a signal: the value of semzcnt is
          decremented and semop() fails, with errno set to EINTR.

       If sem_op is less than zero, the process must have alter permission
       on the semaphore set.  If semval is greater than or equal to the
       absolute value of sem_op, the operation can proceed immediately: the
       absolute value of sem_op is subtracted from semval, and, if SEM_UNDO
       is specified for this operation, the system adds the absolute value
       of sem_op to the semaphore adjustment (semadj) value for this
       semaphore.  If the absolute value of sem_op is greater than semval,
       and IPC_NOWAIT is specified in sem_flg, semop() fails, with errno set
       to EAGAIN (and none of the operations in sops is performed).
       Otherwise, semncnt (the counter of threads waiting for this
       semaphore's value to increase) is incremented by one and the thread
       sleeps until one of the following occurs:

       ·  semval becomes greater than or equal to the absolute value of
          sem_op: the operation now proceeds, as described above.

       ·  The semaphore set is removed from the system: semop() fails, with
          errno set to EIDRM.

       ·  The calling thread catches a signal: the value of semncnt is
          decremented and semop() fails, with errno set to EINTR.

       On successful completion, the sempid value for each semaphore
       specified in the array pointed to by sops is set to the caller's
       process ID.  In addition, the sem_otime is set to the current time.

   semtimedop()
       semtimedop() behaves identically to semop() except that in those
       cases where the calling thread would sleep, the duration of that
       sleep is limited by the amount of elapsed time specified by the
       timespec structure whose address is passed in the timeout argument.
       (This sleep interval will be rounded up to the system clock
       granularity, and kernel scheduling delays mean that the interval may
       overrun by a small amount.)  If the specified time limit has been
       reached, semtimedop() fails with errno set to EAGAIN (and none of the
       operations in sops is performed).  If the timeout argument is NULL,
       then semtimedop() behaves exactly like semop().

       Note that if semtimeop() is interrupted by a signal, causing the call
       to fail with the error EINTR, the contents of timeout are left
       unchanged.
http://man7.org/linux/man-pages/man2/semtimedop.2.html
13
SYSTEM CALL:
semop(2) - Linux manual page
FUNCTIONALITY:

       semop, semtimedop - System V semaphore operations
SYNOPSIS:

       #include <sys/types.h>
       #include <sys/ipc.h>
       #include <sys/sem.h>

       int semop(int semid, struct sembuf *sops, size_t nsops);

       int semtimedop(int semid, struct sembuf *sops, size_t nsops,
                      const struct timespec *timeout);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       semtimedop(): _GNU_SOURCE
DESCRIPTION

       Each semaphore in a System V semaphore set has the following
       associated values:

           unsigned short  semval;   /* semaphore value */
           unsigned short  semzcnt;  /* # waiting for zero */
           unsigned short  semncnt;  /* # waiting for increase */
           pid_t           sempid;   /* PID of process that last
                                        modified semaphore value */

       semop() performs operations on selected semaphores in the set
       indicated by semid.  Each of the nsops elements in the array pointed
       to by sops is a structure that specifies an operation to be performed
       on a single semaphore.  The elements of this structure are of type
       struct sembuf, containing the following members:

           unsigned short sem_num;  /* semaphore number */
           short          sem_op;   /* semaphore operation */
           short          sem_flg;  /* operation flags */

       Flags recognized in sem_flg are IPC_NOWAIT and SEM_UNDO.  If an
       operation specifies SEM_UNDO, it will be automatically undone when
       the process terminates.

       The set of operations contained in sops is performed in array order,
       and atomically, that is, the operations are performed either as a
       complete unit, or not at all.  The behavior of the system call if not
       all operations can be performed immediately depends on the presence
       of the IPC_NOWAIT flag in the individual sem_flg fields, as noted
       below.

       Each operation is performed on the sem_num-th semaphore of the
       semaphore set, where the first semaphore of the set is numbered 0.
       There are three types of operation, distinguished by the value of
       sem_op.

       If sem_op is a positive integer, the operation adds this value to the
       semaphore value (semval).  Furthermore, if SEM_UNDO is specified for
       this operation, the system subtracts the value sem_op from the
       semaphore adjustment (semadj) value for this semaphore.  This
       operation can always proceed—it never forces a thread to wait.  The
       calling process must have alter permission on the semaphore set.

       If sem_op is zero, the process must have read permission on the
       semaphore set.  This is a "wait-for-zero" operation: if semval is
       zero, the operation can immediately proceed.  Otherwise, if
       IPC_NOWAIT is specified in sem_flg, semop() fails with errno set to
       EAGAIN (and none of the operations in sops is performed).  Otherwise,
       semzcnt (the count of threads waiting until this semaphore's value
       becomes zero) is incremented by one and the thread sleeps until one
       of the following occurs:

       ·  semval becomes 0, at which time the value of semzcnt is
          decremented.

       ·  The semaphore set is removed: semop() fails, with errno set to
          EIDRM.

       ·  The calling thread catches a signal: the value of semzcnt is
          decremented and semop() fails, with errno set to EINTR.

       If sem_op is less than zero, the process must have alter permission
       on the semaphore set.  If semval is greater than or equal to the
       absolute value of sem_op, the operation can proceed immediately: the
       absolute value of sem_op is subtracted from semval, and, if SEM_UNDO
       is specified for this operation, the system adds the absolute value
       of sem_op to the semaphore adjustment (semadj) value for this
       semaphore.  If the absolute value of sem_op is greater than semval,
       and IPC_NOWAIT is specified in sem_flg, semop() fails, with errno set
       to EAGAIN (and none of the operations in sops is performed).
       Otherwise, semncnt (the counter of threads waiting for this
       semaphore's value to increase) is incremented by one and the thread
       sleeps until one of the following occurs:

       ·  semval becomes greater than or equal to the absolute value of
          sem_op: the operation now proceeds, as described above.

       ·  The semaphore set is removed from the system: semop() fails, with
          errno set to EIDRM.

       ·  The calling thread catches a signal: the value of semncnt is
          decremented and semop() fails, with errno set to EINTR.

       On successful completion, the sempid value for each semaphore
       specified in the array pointed to by sops is set to the caller's
       process ID.  In addition, the sem_otime is set to the current time.

   semtimedop()
       semtimedop() behaves identically to semop() except that in those
       cases where the calling thread would sleep, the duration of that
       sleep is limited by the amount of elapsed time specified by the
       timespec structure whose address is passed in the timeout argument.
       (This sleep interval will be rounded up to the system clock
       granularity, and kernel scheduling delays mean that the interval may
       overrun by a small amount.)  If the specified time limit has been
       reached, semtimedop() fails with errno set to EAGAIN (and none of the
       operations in sops is performed).  If the timeout argument is NULL,
       then semtimedop() behaves exactly like semop().

       Note that if semtimeop() is interrupted by a signal, causing the call
       to fail with the error EINTR, the contents of timeout are left
       unchanged.
http://man7.org/linux/man-pages/man2/futex.2.html
12
SYSTEM CALL:
futex(2) - Linux manual page
FUNCTIONALITY:

       futex - fast user-space locking
SYNOPSIS:

       #include <linux/futex.h>
       #include <sys/time.h>

       int futex(int *uaddr, int futex_op, int val,
                 const struct timespec *timeout,   /* or: uint32_t val2 */
                 int *uaddr2, int val3);

       Note: There is no glibc wrapper for this system call; see NOTES.
DESCRIPTION

       The futex() system call provides a method for waiting until a certain
       condition becomes true.  It is typically used as a blocking construct
       in the context of shared-memory synchronization.  When using futexes,
       the majority of the synchronization operations are performed in user
       space.  A user-space program employs the futex() system call only
       when it is likely that the program has to block for a longer time
       until the condition becomes true.  Other futex() operations can be
       used to wake any processes or threads waiting for a particular
       condition.

       A futex is a 32-bit value—referred to below as a futex word—whose
       address is supplied to the futex() system call.  (Futexes are 32 bits
       in size on all platforms, including 64-bit systems.)  All futex
       operations are governed by this value.  In order to share a futex
       between processes, the futex is placed in a region of shared memory,
       created using (for example) mmap(2) or shmat(2).  (Thus, the futex
       word may have different virtual addresses in different processes, but
       these addresses all refer to the same location in physical memory.)
       In a multithreaded program, it is sufficient to place the futex word
       in a global variable shared by all threads.

       When executing a futex operation that requests to block a thread, the
       kernel will block only if the futex word has the value that the
       calling thread supplied (as one of the arguments of the futex() call)
       as the expected value of the futex word.  The loading of the futex
       word's value, the comparison of that value with the expected value,
       and the actual blocking will happen atomically and will be totally
       ordered with respect to concurrent operations performed by other
       threads on the same futex word.  Thus, the futex word is used to
       connect the synchronization in user space with the implementation of
       blocking by the kernel.  Analogously to an atomic compare-and-
       exchange operation that potentially changes shared memory, blocking
       via a futex is an atomic compare-and-block operation.

       One use of futexes is for implementing locks.  The state of the lock
       (i.e., acquired or not acquired) can be represented as an atomically
       accessed flag in shared memory.  In the uncontended case, a thread
       can access or modify the lock state with atomic instructions, for
       example atomically changing it from not acquired to acquired using an
       atomic compare-and-exchange instruction.  (Such instructions are
       performed entirely in user mode, and the kernel maintains no
       information about the lock state.)  On the other hand, a thread may
       be unable to acquire a lock because it is already acquired by another
       thread.  It then may pass the lock's flag as a futex word and the
       value representing the acquired state as the expected value to a
       futex() wait operation.  This futex() operation will block if and
       only if the lock is still acquired (i.e., the value in the futex word
       still matches the "acquired state").  When releasing the lock, a
       thread has to first reset the lock state to not acquired and then
       execute a futex operation that wakes threads blocked on the lock flag
       used as a futex word (this can be further optimized to avoid
       unnecessary wake-ups).  See futex(7) for more detail on how to use
       futexes.

       Besides the basic wait and wake-up futex functionality, there are
       further futex operations aimed at supporting more complex use cases.

       Note that no explicit initialization or destruction is necessary to
       use futexes; the kernel maintains a futex (i.e., the kernel-internal
       implementation artifact) only while operations such as FUTEX_WAIT,
       described below, are being performed on a particular futex word.

   Arguments
       The uaddr argument points to the futex word.  On all platforms,
       futexes are four-byte integers that must be aligned on a four-byte
       boundary.  The operation to perform on the futex is specified in the
       futex_op argument; val is a value whose meaning and purpose depends
       on futex_op.

       The remaining arguments (timeout, uaddr2, and val3) are required only
       for certain of the futex operations described below.  Where one of
       these arguments is not required, it is ignored.

       For several blocking operations, the timeout argument is a pointer to
       a timespec structure that specifies a timeout for the operation.
       However,  notwithstanding the prototype shown above, for some
       operations, the least significant four bytes are used as an integer
       whose meaning is determined by the operation.  For these operations,
       the kernel casts the timeout value first to unsigned long, then to
       uint32_t, and in the remainder of this page, this argument is
       referred to as val2 when interpreted in this fashion.

       Where it is required, the uaddr2 argument is a pointer to a second
       futex word that is employed by the operation.

       The interpretation of the final integer argument, val3, depends on
       the operation.

   Futex operations
       The futex_op argument consists of two parts: a command that specifies
       the operation to be performed, bit-wise ORed with zero or more
       options that modify the behaviour of the operation.  The options that
       may be included in futex_op are as follows:

       FUTEX_PRIVATE_FLAG (since Linux 2.6.22)
              This option bit can be employed with all futex operations.  It
              tells the kernel that the futex is process-private and not
              shared with another process (i.e., it is being used for
              synchronization only between threads of the same process).
              This allows the kernel to make some additional performance
              optimizations.

              As a convenience, <linux/futex.h> defines a set of constants
              with the suffix _PRIVATE that are equivalents of all of the
              operations listed below, but with the FUTEX_PRIVATE_FLAG ORed
              into the constant value.  Thus, there are FUTEX_WAIT_PRIVATE,
              FUTEX_WAKE_PRIVATE, and so on.

       FUTEX_CLOCK_REALTIME (since Linux 2.6.28)
              This option bit can be employed only with the
              FUTEX_WAIT_BITSET, FUTEX_WAIT_REQUEUE_PI, and (since Linux
              4.5) FUTEX_WAIT operations.

              If this option is set, the kernel measures the timeout against
              the CLOCK_REALTIME clock.

              If this option is not set, the kernel measures the timeout
              against the CLOCK_MONOTONIC clock.

       The operation specified in futex_op is one of the following:

       FUTEX_WAIT (since Linux 2.6.0)
              This operation tests that the value at the futex word pointed
              to by the address uaddr still contains the expected value val,
              and if so, then sleeps waiting for a FUTEX_WAKE operation on
              the futex word.  The load of the value of the futex word is an
              atomic memory access (i.e., using atomic machine instructions
              of the respective architecture).  This load, the comparison
              with the expected value, and starting to sleep are performed
              atomically and totally ordered with respect to other futex
              operations on the same futex word.  If the thread starts to
              sleep, it is considered a waiter on this futex word.  If the
              futex value does not match val, then the call fails
              immediately with the error EAGAIN.

              The purpose of the comparison with the expected value is to
              prevent lost wake-ups.  If another thread changed the value of
              the futex word after the calling thread decided to block based
              on the prior value, and if the other thread executed a
              FUTEX_WAKE operation (or similar wake-up) after the value
              change and before this FUTEX_WAIT operation, then the calling
              thread will observe the value change and will not start to
              sleep.

              If the timeout is not NULL, the structure it points to
              specifies a timeout for the wait.  (This interval will be
              rounded up to the system clock granularity, and is guaranteed
              not to expire early.)  The timeout is by default measured
              according to the CLOCK_MONOTONIC clock, but, since Linux 4.5,
              the CLOCK_REALTIME clock can be selected by specifying
              FUTEX_CLOCK_REALTIME in futex_op.  If timeout is NULL, the
              call blocks indefinitely.

              Note: for FUTEX_WAIT, timeout is interpreted as a relative
              value.  This differs from other futex operations, where
              timeout is interpreted as an absolute value.  To obtain the
              equivalent of FUTEX_WAIT with an absolute timeout, employ
              FUTEX_WAIT_BITSET with val3 specified as
              FUTEX_BITSET_MATCH_ANY.

              The arguments uaddr2 and val3 are ignored.

       FUTEX_WAKE (since Linux 2.6.0)
              This operation wakes at most val of the waiters that are
              waiting (e.g., inside FUTEX_WAIT) on the futex word at the
              address uaddr.  Most commonly, val is specified as either 1
              (wake up a single waiter) or INT_MAX (wake up all waiters).
              No guarantee is provided about which waiters are awoken (e.g.,
              a waiter with a higher scheduling priority is not guaranteed
              to be awoken in preference to a waiter with a lower priority).

              The arguments timeout, uaddr2, and val3 are ignored.

       FUTEX_FD (from Linux 2.6.0 up to and including Linux 2.6.25)
              This operation creates a file descriptor that is associated
              with the futex at uaddr.  The caller must close the returned
              file descriptor after use.  When another process or thread
              performs a FUTEX_WAKE on the futex word, the file descriptor
              indicates as being readable with select(2), poll(2), and
              epoll(7)

              The file descriptor can be used to obtain asynchronous
              notifications: if val is nonzero, then, when another process
              or thread executes a FUTEX_WAKE, the caller will receive the
              signal number that was passed in val.

              The arguments timeout, uaddr2 and val3 are ignored.

              Because it was inherently racy, FUTEX_FD has been removed from
              Linux 2.6.26 onward.

       FUTEX_REQUEUE (since Linux 2.6.0)
              This operation performs the same task as FUTEX_CMP_REQUEUE
              (see below), except that no check is made using the value in
              val3.  (The argument val3 is ignored.)

       FUTEX_CMP_REQUEUE (since Linux 2.6.7)
              This operation first checks whether the location uaddr still
              contains the value val3.  If not, the operation fails with the
              error EAGAIN.  Otherwise, the operation wakes up a maximum of
              val waiters that are waiting on the futex at uaddr.  If there
              are more than val waiters, then the remaining waiters are
              removed from the wait queue of the source futex at uaddr and
              added to the wait queue of the target futex at uaddr2.  The
              val2 argument specifies an upper limit on the number of
              waiters that are requeued to the futex at uaddr2.

              The load from uaddr is an atomic memory access (i.e., using
              atomic machine instructions of the respective architecture).
              This load, the comparison with val3, and the requeueing of any
              waiters are performed atomically and totally ordered with
              respect to other operations on the same futex word.

              Typical values to specify for val are 0 or 1.  (Specifying
              INT_MAX is not useful, because it would make the
              FUTEX_CMP_REQUEUE operation equivalent to FUTEX_WAKE.)  The
              limit value specified via val2 is typically either 1 or
              INT_MAX.  (Specifying the argument as 0 is not useful, because
              it would make the FUTEX_CMP_REQUEUE operation equivalent to
              FUTEX_WAIT.)

              The FUTEX_CMP_REQUEUE operation was added as a replacement for
              the earlier FUTEX_REQUEUE.  The difference is that the check
              of the value at uaddr can be used to ensure that requeueing
              happens only under certain conditions, which allows race
              conditions to be avoided in certain use cases.

              Both FUTEX_REQUEUE and FUTEX_CMP_REQUEUE can be used to avoid
              "thundering herd" wake-ups that could occur when using
              FUTEX_WAKE in cases where all of the waiters that are woken
              need to acquire another futex.  Consider the following
              scenario, where multiple waiter threads are waiting on B, a
              wait queue implemented using a futex:

                  lock(A)
                  while (!check_value(V)) {
                      unlock(A);
                      block_on(B);
                      lock(A);
                  };
                  unlock(A);

              If a waker thread used FUTEX_WAKE, then all waiters waiting on
              B would be woken up, and they would all try to acquire lock A.
              However, waking all of the threads in this manner would be
              pointless because all except one of the threads would
              immediately block on lock A again.  By contrast, a requeue
              operation wakes just one waiter and moves the other waiters to
              lock A, and when the woken waiter unlocks A then the next
              waiter can proceed.

       FUTEX_WAKE_OP (since Linux 2.6.14)
              This operation was added to support some user-space use cases
              where more than one futex must be handled at the same time.
              The most notable example is the implementation of
              pthread_cond_signal(3), which requires operations on two
              futexes, the one used to implement the mutex and the one used
              in the implementation of the wait queue associated with the
              condition variable.  FUTEX_WAKE_OP allows such cases to be
              implemented without leading to high rates of contention and
              context switching.

              The FUTEX_WAKE_OP operation is equivalent to executing the
              following code atomically and totally ordered with respect to
              other futex operations on any of the two supplied futex words:

                  int oldval = *(int *) uaddr2;
                  *(int *) uaddr2 = oldval op oparg;
                  futex(uaddr, FUTEX_WAKE, val, 0, 0, 0);
                  if (oldval cmp cmparg)
                      futex(uaddr2, FUTEX_WAKE, val2, 0, 0, 0);

              In other words, FUTEX_WAKE_OP does the following:

              *  saves the original value of the futex word at uaddr2 and
                 performs an operation to modify the value of the futex at
                 uaddr2; this is an atomic read-modify-write memory access
                 (i.e., using atomic machine instructions of the respective
                 architecture)

              *  wakes up a maximum of val waiters on the futex for the
                 futex word at uaddr; and

              *  dependent on the results of a test of the original value of
                 the futex word at uaddr2, wakes up a maximum of val2
                 waiters on the futex for the futex word at uaddr2.

              The operation and comparison that are to be performed are
              encoded in the bits of the argument val3.  Pictorially, the
              encoding is:

                      +---+---+-----------+-----------+
                      |op |cmp|   oparg   |  cmparg   |
                      +---+---+-----------+-----------+
                        4   4       12          12    <== # of bits

              Expressed in code, the encoding is:

                  #define FUTEX_OP(op, oparg, cmp, cmparg) \
                                  (((op & 0xf) << 28) | \
                                  ((cmp & 0xf) << 24) | \
                                  ((oparg & 0xfff) << 12) | \
                                  (cmparg & 0xfff))

              In the above, op and cmp are each one of the codes listed
              below.  The oparg and cmparg components are literal numeric
              values, except as noted below.

              The op component has one of the following values:

                  FUTEX_OP_SET        0  /* uaddr2 = oparg; */
                  FUTEX_OP_ADD        1  /* uaddr2 += oparg; */
                  FUTEX_OP_OR         2  /* uaddr2 |= oparg; */
                  FUTEX_OP_ANDN       3  /* uaddr2 &= ~oparg; */
                  FUTEX_OP_XOR        4  /* uaddr2 ^= oparg; */

              In addition, bit-wise ORing the following value into op causes
              (1 << oparg) to be used as the operand:

                  FUTEX_OP_ARG_SHIFT  8  /* Use (1 << oparg) as operand */

              The cmp field is one of the following:

                  FUTEX_OP_CMP_EQ     0  /* if (oldval == cmparg) wake */
                  FUTEX_OP_CMP_NE     1  /* if (oldval != cmparg) wake */
                  FUTEX_OP_CMP_LT     2  /* if (oldval < cmparg) wake */
                  FUTEX_OP_CMP_LE     3  /* if (oldval <= cmparg) wake */
                  FUTEX_OP_CMP_GT     4  /* if (oldval > cmparg) wake */
                  FUTEX_OP_CMP_GE     5  /* if (oldval >= cmparg) wake */

              The return value of FUTEX_WAKE_OP is the sum of the number of
              waiters woken on the futex uaddr plus the number of waiters
              woken on the futex uaddr2.

       FUTEX_WAIT_BITSET (since Linux 2.6.25)
              This operation is like FUTEX_WAIT except that val3 is used to
              provide a 32-bit bit mask to the kernel.  This bit mask, in
              which at least one bit must be set, is stored in the kernel-
              internal state of the waiter.  See the description of
              FUTEX_WAKE_BITSET for further details.

              If timeout is not NULL, the structure it points to specifies
              an absolute timeout for the wait operation.  If timeout is
              NULL, the operation can block indefinitely.

              The uaddr2 argument is ignored.

       FUTEX_WAKE_BITSET (since Linux 2.6.25)
              This operation is the same as FUTEX_WAKE except that the val3
              argument is used to provide a 32-bit bit mask to the kernel.
              This bit mask, in which at least one bit must be set, is used
              to select which waiters should be woken up.  The selection is
              done by a bit-wise AND of the "wake" bit mask (i.e., the value
              in val3) and the bit mask which is stored in the kernel-
              internal state of the waiter (the "wait" bit mask that is set
              using FUTEX_WAIT_BITSET).  All of the waiters for which the
              result of the AND is nonzero are woken up; the remaining
              waiters are left sleeping.

              The effect of FUTEX_WAIT_BITSET and FUTEX_WAKE_BITSET is to
              allow selective wake-ups among multiple waiters that are
              blocked on the same futex.  However, note that, depending on
              the use case, employing this bit-mask multiplexing feature on
              a futex can be less efficient than simply using multiple
              futexes, because employing bit-mask multiplexing requires the
              kernel to check all waiters on a futex, including those that
              are not interested in being woken up (i.e., they do not have
              the relevant bit set in their "wait" bit mask).

              The constant FUTEX_BITSET_MATCH_ANY, which corresponds to all
              32 bits set in the bit mask, can be used as the val3 argument
              for FUTEX_WAIT_BITSET and FUTEX_WAKE_BITSET.  Other than
              differences in the handling of the timeout argument, the
              FUTEX_WAIT operation is equivalent to FUTEX_WAIT_BITSET with
              val3 specified as FUTEX_BITSET_MATCH_ANY; that is, allow a
              wake-up by any waker.  The FUTEX_WAKE operation is equivalent
              to FUTEX_WAKE_BITSET with val3 specified as
              FUTEX_BITSET_MATCH_ANY; that is, wake up any waiter(s).

              The uaddr2 and timeout arguments are ignored.

   Priority-inheritance futexes
       Linux supports priority-inheritance (PI) futexes in order to handle
       priority-inversion problems that can be encountered with normal futex
       locks.  Priority inversion is the problem that occurs when a high-
       priority task is blocked waiting to acquire a lock held by a low-
       priority task, while tasks at an intermediate priority continuously
       preempt the low-priority task from the CPU.  Consequently, the low-
       priority task makes no progress toward releasing the lock, and the
       high-priority task remains blocked.

       Priority inheritance is a mechanism for dealing with the priority-
       inversion problem.  With this mechanism, when a high-priority task
       becomes blocked by a lock held by a low-priority task, the priority
       of the low-priority task is temporarily raised to that of the high-
       priority task, so that it is not preempted by any intermediate level
       tasks, and can thus make progress toward releasing the lock.  To be
       effective, priority inheritance must be transitive, meaning that if a
       high-priority task blocks on a lock held by a lower-priority task
       that is itself blocked by a lock held by another intermediate-
       priority task (and so on, for chains of arbitrary length), then both
       of those tasks (or more generally, all of the tasks in a lock chain)
       have their priorities raised to be the same as the high-priority
       task.

       From a user-space perspective, what makes a futex PI-aware is a
       policy agreement (described below) between user space and the kernel
       about the value of the futex word, coupled with the use of the PI-
       futex operations described below.  (Unlike the other futex operations
       described above, the PI-futex operations are designed for the
       implementation of very specific IPC mechanisms.)

       The PI-futex operations described below differ from the other futex
       operations in that they impose policy on the use of the value of the
       futex word:

       *  If the lock is not acquired, the futex word's value shall be 0.

       *  If the lock is acquired, the futex word's value shall be the
          thread ID (TID; see gettid(2)) of the owning thread.

       *  If the lock is owned and there are threads contending for the
          lock, then the FUTEX_WAITERS bit shall be set in the futex word's
          value; in other words, this value is:

              FUTEX_WAITERS | TID

          (Note that is invalid for a PI futex word to have no owner and
          FUTEX_WAITERS set.)

       With this policy in place, a user-space application can acquire an
       unacquired lock or release a lock using atomic instructions executed
       in user mode (e.g., a compare-and-swap operation such as cmpxchg on
       the x86 architecture).  Acquiring a lock simply consists of using
       compare-and-swap to atomically set the futex word's value to the
       caller's TID if its previous value was 0.  Releasing a lock requires
       using compare-and-swap to set the futex word's value to 0 if the
       previous value was the expected TID.

       If a futex is already acquired (i.e., has a nonzero value), waiters
       must employ the FUTEX_LOCK_PI operation to acquire the lock.  If
       other threads are waiting for the lock, then the FUTEX_WAITERS bit is
       set in the futex value; in this case, the lock owner must employ the
       FUTEX_UNLOCK_PI operation to release the lock.

       In the cases where callers are forced into the kernel (i.e., required
       to perform a futex() call), they then deal directly with a so-called
       RT-mutex, a kernel locking mechanism which implements the required
       priority-inheritance semantics.  After the RT-mutex is acquired, the
       futex value is updated accordingly, before the calling thread returns
       to user space.

       It is important to note that the kernel will update the futex word's
       value prior to returning to user space.  (This prevents the
       possibility of the futex word's value ending up in an invalid state,
       such as having an owner but the value being 0, or having waiters but
       not having the FUTEX_WAITERS bit set.)

       If a futex has an associated RT-mutex in the kernel (i.e., there are
       blocked waiters) and the owner of the futex/RT-mutex dies
       unexpectedly, then the kernel cleans up the RT-mutex and hands it
       over to the next waiter.  This in turn requires that the user-space
       value is updated accordingly.  To indicate that this is required, the
       kernel sets the FUTEX_OWNER_DIED bit in the futex word along with the
       thread ID of the new owner.  User space can detect this situation via
       the presence of the FUTEX_OWNER_DIED bit and is then responsible for
       cleaning up the stale state left over by the dead owner.

       PI futexes are operated on by specifying one of the values listed
       below in futex_op.  Note that the PI futex operations must be used as
       paired operations and are subject to some additional requirements:

       *  FUTEX_LOCK_PI and FUTEX_TRYLOCK_PI pair with FUTEX_UNLOCK_PI.
          FUTEX_UNLOCK_PI must be called only on a futex owned by the
          calling thread, as defined by the value policy, otherwise the
          error EPERM results.

       *  FUTEX_WAIT_REQUEUE_PI pairs with FUTEX_CMP_REQUEUE_PI.  This must
          be performed from a non-PI futex to a distinct PI futex (or the
          error EINVAL results).  Additionally, val (the number of waiters
          to be woken) must be 1 (or the error EINVAL results).

       The PI futex operations are as follows:

       FUTEX_LOCK_PI (since Linux 2.6.18)
              This operation is used after an attempt to acquire the lock
              via an atomic user-mode instruction failed because the futex
              word has a nonzero value—specifically, because it contained
              the (PID-namespace-specific) TID of the lock owner.

              The operation checks the value of the futex word at the
              address uaddr.  If the value is 0, then the kernel tries to
              atomically set the futex value to the caller's TID.  If the
              futex word's value is nonzero, the kernel atomically sets the
              FUTEX_WAITERS bit, which signals the futex owner that it
              cannot unlock the futex in user space atomically by setting
              the futex value to 0.  After that, the kernel:

              1. Tries to find the thread which is associated with the owner
                 TID.

              2. Creates or reuses kernel state on behalf of the owner.  (If
                 this is the first waiter, there is no kernel state for this
                 futex, so kernel state is created by locking the RT-mutex
                 and the futex owner is made the owner of the RT-mutex.  If
                 there are existing waiters, then the existing state is
                 reused.)

              3. Attaches the waiter to the futex (i.e., the waiter is
                 enqueued on the RT-mutex waiter list).

              If more than one waiter exists, the enqueueing of the waiter
              is in descending priority order.  (For information on priority
              ordering, see the discussion of the SCHED_DEADLINE,
              SCHED_FIFO, and SCHED_RR scheduling policies in sched(7).)
              The owner inherits either the waiter's CPU bandwidth (if the
              waiter is scheduled under the SCHED_DEADLINE policy) or the
              waiter's priority (if the waiter is scheduled under the
              SCHED_RR or SCHED_FIFO policy).  This inheritance follows the
              lock chain in the case of nested locking and performs deadlock
              detection.

              The timeout argument provides a timeout for the lock attempt.
              If timeout is not NULL, the structure it points to specifies
              an absolute timeout, measured against the CLOCK_REALTIME
              clock.  If timeout is NULL, the operation will block
              indefinitely.

              The uaddr2, val, and val3 arguments are ignored.

       FUTEX_TRYLOCK_PI (since Linux 2.6.18)
              This operation tries to acquire the lock at uaddr.  It is
              invoked when a user-space atomic acquire did not succeed
              because the futex word was not 0.

              Because the kernel has access to more state information than
              user space, acquisition of the lock might succeed if performed
              by the kernel in cases where the futex word (i.e., the state
              information accessible to use-space) contains stale state
              (FUTEX_WAITERS and/or FUTEX_OWNER_DIED).  This can happen when
              the owner of the futex died.  User space cannot handle this
              condition in a race-free manner, but the kernel can fix this
              up and acquire the futex.

              The uaddr2, val, timeout, and val3 arguments are ignored.

       FUTEX_UNLOCK_PI (since Linux 2.6.18)
              This operation wakes the top priority waiter that is waiting
              in FUTEX_LOCK_PI on the futex address provided by the uaddr
              argument.

              This is called when the user-space value at uaddr cannot be
              changed atomically from a TID (of the owner) to 0.

              The uaddr2, val, timeout, and val3 arguments are ignored.

       FUTEX_CMP_REQUEUE_PI (since Linux 2.6.31)
              This operation is a PI-aware variant of FUTEX_CMP_REQUEUE.  It
              requeues waiters that are blocked via FUTEX_WAIT_REQUEUE_PI on
              uaddr from a non-PI source futex (uaddr) to a PI target futex
              (uaddr2).

              As with FUTEX_CMP_REQUEUE, this operation wakes up a maximum
              of val waiters that are waiting on the futex at uaddr.
              However, for FUTEX_CMP_REQUEUE_PI, val is required to be 1
              (since the main point is to avoid a thundering herd).  The
              remaining waiters are removed from the wait queue of the
              source futex at uaddr and added to the wait queue of the
              target futex at uaddr2.

              The val2 and val3 arguments serve the same purposes as for
              FUTEX_CMP_REQUEUE.

       FUTEX_WAIT_REQUEUE_PI (since Linux 2.6.31)
              Wait on a non-PI futex at uaddr and potentially be requeued
              (via a FUTEX_CMP_REQUEUE_PI operation in another task) onto a
              PI futex at uaddr2.  The wait operation on uaddr is the same
              as for FUTEX_WAIT.

              The waiter can be removed from the wait on uaddr without
              requeueing on uaddr2 via a FUTEX_WAKE operation in another
              task.  In this case, the FUTEX_WAIT_REQUEUE_PI operation fails
              with the error EAGAIN.

              If timeout is not NULL, the structure it points to specifies
              an absolute timeout for the wait operation.  If timeout is
              NULL, the operation can block indefinitely.

              The val3 argument is ignored.

              The FUTEX_WAIT_REQUEUE_PI and FUTEX_CMP_REQUEUE_PI were added
              to support a fairly specific use case: support for priority-
              inheritance-aware POSIX threads condition variables.  The idea
              is that these operations should always be paired, in order to
              ensure that user space and the kernel remain in sync.  Thus,
              in the FUTEX_WAIT_REQUEUE_PI operation, the user-space
              application pre-specifies the target of the requeue that takes
              place in the FUTEX_CMP_REQUEUE_PI operation.
http://man7.org/linux/man-pages/man2/set_robust_list.2.html
10
SYSTEM CALL:
get_robust_list(2) - Linux manual page
FUNCTIONALITY:

       get_robust_list, set_robust_list - get/set list of robust futexes
SYNOPSIS:

       #include <linux/futex.h>
       #include <sys/types.h>
       #include <syscall.h>

       long get_robust_list(int pid, struct robust_list_head **head_ptr,
                            size_t *len_ptr);
       long set_robust_list(struct robust_list_head *head, size_t len);

       Note: There are no glibc wrappers for these system calls; see NOTES.
DESCRIPTION

       The robust futex implementation needs to maintain per-thread lists of
       the robust futexes which are to be unlocked when the thread exits.
       These lists are managed in user space; the kernel is notified about
       only the location of the head of the list.

       The get_robust_list() system call returns the head of the robust
       futex list of the thread whose thread ID is specified in pid.  If pid
       is 0, the head of the list for the calling thread is returned.  The
       list head is stored in the location pointed to by head_ptr.  The size
       of the object pointed to by **head_ptr is stored in len_ptr.

       Permission to employ get_robust_list() is governed by a ptrace access
       mode PTRACE_MODE_READ_REALCREDS check; see ptrace(2).

       The set_robust_list() system call requests the kernel to record the
       head of the list of robust futexes owned by the calling thread.  The
       head argument is the list head to record.  The len argument should be
       sizeof(*head).
http://man7.org/linux/man-pages/man2/get_robust_list.2.html
10
SYSTEM CALL:
get_robust_list(2) - Linux manual page
FUNCTIONALITY:

       get_robust_list, set_robust_list - get/set list of robust futexes
SYNOPSIS:

       #include <linux/futex.h>
       #include <sys/types.h>
       #include <syscall.h>

       long get_robust_list(int pid, struct robust_list_head **head_ptr,
                            size_t *len_ptr);
       long set_robust_list(struct robust_list_head *head, size_t len);

       Note: There are no glibc wrappers for these system calls; see NOTES.
DESCRIPTION

       The robust futex implementation needs to maintain per-thread lists of
       the robust futexes which are to be unlocked when the thread exits.
       These lists are managed in user space; the kernel is notified about
       only the location of the head of the list.

       The get_robust_list() system call returns the head of the robust
       futex list of the thread whose thread ID is specified in pid.  If pid
       is 0, the head of the list for the calling thread is returned.  The
       list head is stored in the location pointed to by head_ptr.  The size
       of the object pointed to by **head_ptr is stored in len_ptr.

       Permission to employ get_robust_list() is governed by a ptrace access
       mode PTRACE_MODE_READ_REALCREDS check; see ptrace(2).

       The set_robust_list() system call requests the kernel to record the
       head of the list of robust futexes owned by the calling thread.  The
       head argument is the list head to record.  The len argument should be
       sizeof(*head).
http://man7.org/linux/man-pages/man2/msgget.2.html
11
SYSTEM CALL:
msgget(2) - Linux manual page
FUNCTIONALITY:

       msgget - get a System V message queue identifier
SYNOPSIS:

       #include <sys/types.h>
       #include <sys/ipc.h>
       #include <sys/msg.h>

       int msgget(key_t key, int msgflg);
DESCRIPTION

       The msgget() system call returns the System V message queue
       identifier associated with the value of the key argument.  A new
       message queue is created if key has the value IPC_PRIVATE or key
       isn't IPC_PRIVATE, no message queue with the given key key exists,
       and IPC_CREAT is specified in msgflg.

       If msgflg specifies both IPC_CREAT and IPC_EXCL and a message queue
       already exists for key, then msgget() fails with errno set to EEXIST.
       (This is analogous to the effect of the combination O_CREAT | O_EXCL
       for open(2).)

       Upon creation, the least significant bits of the argument msgflg
       define the permissions of the message queue.  These permission bits
       have the same format and semantics as the permissions specified for
       the mode argument of open(2).  (The execute permissions are not
       used.)

       If a new message queue is created, then its associated data structure
       msqid_ds (see msgctl(2)) is initialized as follows:

              msg_perm.cuid and msg_perm.uid are set to the effective user
              ID of the calling process.

              msg_perm.cgid and msg_perm.gid are set to the effective group
              ID of the calling process.

              The least significant 9 bits of msg_perm.mode are set to the
              least significant 9 bits of msgflg.

              msg_qnum, msg_lspid, msg_lrpid, msg_stime, and msg_rtime are
              set to 0.

              msg_ctime is set to the current time.

              msg_qbytes is set to the system limit MSGMNB.

       If the message queue already exists the permissions are verified, and
       a check is made to see if it is marked for destruction.
http://man7.org/linux/man-pages/man2/msgctl.2.html
10
SYSTEM CALL:
msgctl(2) - Linux manual page
FUNCTIONALITY:

       msgctl - System V message control operations
SYNOPSIS:

       #include <sys/types.h>
       #include <sys/ipc.h>
       #include <sys/msg.h>

       int msgctl(int msqid, int cmd, struct msqid_ds *buf);
DESCRIPTION

       msgctl() performs the control operation specified by cmd on the
       System V message queue with identifier msqid.

       The msqid_ds data structure is defined in <sys/msg.h> as follows:

           struct msqid_ds {
               struct ipc_perm msg_perm;     /* Ownership and permissions */
               time_t          msg_stime;    /* Time of last msgsnd(2) */
               time_t          msg_rtime;    /* Time of last msgrcv(2) */
               time_t          msg_ctime;    /* Time of last change */
               unsigned long   __msg_cbytes; /* Current number of bytes in
                                                queue (nonstandard) */
               msgqnum_t       msg_qnum;     /* Current number of messages
                                                in queue */
               msglen_t        msg_qbytes;   /* Maximum number of bytes
                                                allowed in queue */
               pid_t           msg_lspid;    /* PID of last msgsnd(2) */
               pid_t           msg_lrpid;    /* PID of last msgrcv(2) */
           };

       The ipc_perm structure is defined as follows (the highlighted fields
       are settable using IPC_SET):

           struct ipc_perm {
               key_t          __key;       /* Key supplied to msgget(2) */
               uid_t          uid;         /* Effective UID of owner */
               gid_t          gid;         /* Effective GID of owner */
               uid_t          cuid;        /* Effective UID of creator */
               gid_t          cgid;        /* Effective GID of creator */
               unsigned short mode;        /* Permissions */
               unsigned short __seq;       /* Sequence number */
           };

       Valid values for cmd are:

       IPC_STAT
              Copy information from the kernel data structure associated
              with msqid into the msqid_ds structure pointed to by buf.  The
              caller must have read permission on the message queue.

       IPC_SET
              Write the values of some members of the msqid_ds structure
              pointed to by buf to the kernel data structure associated with
              this message queue, updating also its msg_ctime member.  The
              following members of the structure are updated: msg_qbytes,
              msg_perm.uid, msg_perm.gid, and (the least significant 9 bits
              of) msg_perm.mode.  The effective UID of the calling process
              must match the owner (msg_perm.uid) or creator (msg_perm.cuid)
              of the message queue, or the caller must be privileged.
              Appropriate privilege (Linux: the CAP_SYS_RESOURCE capability)
              is required to raise the msg_qbytes value beyond the system
              parameter MSGMNB.

       IPC_RMID
              Immediately remove the message queue, awakening all waiting
              reader and writer processes (with an error return and errno
              set to EIDRM).  The calling process must have appropriate
              privileges or its effective user ID must be either that of the
              creator or owner of the message queue.  The third argument to
              msgctl() is ignored in this case.

       IPC_INFO (Linux-specific)
              Return information about system-wide message queue limits and
              parameters in the structure pointed to by buf.  This structure
              is of type msginfo (thus, a cast is required), defined in
              <sys/msg.h> if the _GNU_SOURCE feature test macro is defined:

                  struct msginfo {
                      int msgpool; /* Size in kibibytes of buffer pool
                                      used to hold message data;
                                      unused within kernel */
                      int msgmap;  /* Maximum number of entries in message
                                      map; unused within kernel */
                      int msgmax;  /* Maximum number of bytes that can be
                                      written in a single message */
                      int msgmnb;  /* Maximum number of bytes that can be
                                      written to queue; used to initialize
                                      msg_qbytes during queue creation
                                      (msgget(2)) */
                      int msgmni;  /* Maximum number of message queues */
                      int msgssz;  /* Message segment size;
                                      unused within kernel */
                      int msgtql;  /* Maximum number of messages on all queues
                                      in system; unused within kernel */
                      unsigned short int msgseg;
                                   /* Maximum number of segments;
                                      unused within kernel */
                  };

              The msgmni, msgmax, and msgmnb settings can be changed via
              /proc files of the same name; see proc(5) for details.

       MSG_INFO (Linux-specific)
              Return a msginfo structure containing the same information as
              for IPC_INFO, except that the following fields are returned
              with information about system resources consumed by message
              queues: the msgpool field returns the number of message queues
              that currently exist on the system; the msgmap field returns
              the total number of messages in all queues on the system; and
              the msgtql field returns the total number of bytes in all
              messages in all queues on the system.

       MSG_STAT (Linux-specific)
              Return a msqid_ds structure as for IPC_STAT.  However, the
              msqid argument is not a queue identifier, but instead an index
              into the kernel's internal array that maintains information
              about all message queues on the system.
http://man7.org/linux/man-pages/man2/msgsnd.2.html
12
SYSTEM CALL:
msgop(2) - Linux manual page
FUNCTIONALITY:

       msgrcv, msgsnd - System V message queue operations
SYNOPSIS:

       #include <sys/types.h>
       #include <sys/ipc.h>
       #include <sys/msg.h>

       int msgsnd(int msqid, const void *msgp, size_t msgsz, int msgflg);

       ssize_t msgrcv(int msqid, void *msgp, size_t msgsz, long msgtyp,
                      int msgflg);
DESCRIPTION

       The msgsnd() and msgrcv() system calls are used, respectively, to
       send messages to, and receive messages from, a System V message
       queue.  The calling process must have write permission on the message
       queue in order to send a message, and read permission to receive a
       message.

       The msgp argument is a pointer to a caller-defined structure of the
       following general form:

           struct msgbuf {
               long mtype;       /* message type, must be > 0 */
               char mtext[1];    /* message data */
           };

       The mtext field is an array (or other structure) whose size is
       specified by msgsz, a nonnegative integer value.  Messages of zero
       length (i.e., no mtext field) are permitted.  The mtype field must
       have a strictly positive integer value.  This value can be used by
       the receiving process for message selection (see the description of
       msgrcv() below).

   msgsnd()
       The msgsnd() system call appends a copy of the message pointed to by
       msgp to the message queue whose identifier is specified by msqid.

       If sufficient space is available in the queue, msgsnd() succeeds
       immediately.  The queue capacity is governed by the msg_qbytes field
       in the associated data structure for the message queue.  During queue
       creation this field is initialized to MSGMNB bytes, but this limit
       can be modified using msgctl(2).  A message queue is considered to be
       full if either of the following conditions is true:

       * Adding a new message to the queue would cause the total number of
         bytes in the queue to exceed the queue's maximum size (the
         msg_qbytes field).

       * Adding another message to the queue would cause the total number of
         messages in the queue to exceed the queue's maximum size (the
         msg_qbytes field).  This check is necessary to prevent an unlimited
         number of zero-length messages being placed on the queue.  Although
         such messages contain no data, they nevertheless consume (locked)
         kernel memory.

       If insufficient space is available in the queue, then the default
       behavior of msgsnd() is to block until space becomes available.  If
       IPC_NOWAIT is specified in msgflg, then the call instead fails with
       the error EAGAIN.

       A blocked msgsnd() call may also fail if:

       * the queue is removed, in which case the system call fails with
         errno set to EIDRM; or

       * a signal is caught, in which case the system call fails with errno
         set to EINTR;see signal(7).  (msgsnd() is never automatically
         restarted after being interrupted by a signal handler, regardless
         of the setting of the SA_RESTART flag when establishing a signal
         handler.)

       Upon successful completion the message queue data structure is
       updated as follows:

              msg_lspid is set to the process ID of the calling process.

              msg_qnum is incremented by 1.

              msg_stime is set to the current time.

   msgrcv()
       The msgrcv() system call removes a message from the queue specified
       by msqid and places it in the buffer pointed to by msgp.

       The argument msgsz specifies the maximum size in bytes for the member
       mtext of the structure pointed to by the msgp argument.  If the
       message text has length greater than msgsz, then the behavior depends
       on whether MSG_NOERROR is specified in msgflg.  If MSG_NOERROR is
       specified, then the message text will be truncated (and the truncated
       part will be lost); if MSG_NOERROR is not specified, then the message
       isn't removed from the queue and the system call fails returning -1
       with errno set to E2BIG.

       Unless MSG_COPY is specified in msgflg (see below), the msgtyp
       argument specifies the type of message requested, as follows:

       * If msgtyp is 0, then the first message in the queue is read.

       * If msgtyp is greater than 0, then the first message in the queue of
         type msgtyp is read, unless MSG_EXCEPT was specified in msgflg, in
         which case the first message in the queue of type not equal to
         msgtyp will be read.

       * If msgtyp is less than 0, then the first message in the queue with
         the lowest type less than or equal to the absolute value of msgtyp
         will be read.

       The msgflg argument is a bit mask constructed by ORing together zero
       or more of the following flags:

       IPC_NOWAIT
              Return immediately if no message of the requested type is in
              the queue.  The system call fails with errno set to ENOMSG.

       MSG_COPY (since Linux 3.8)
              Nondestructively fetch a copy of the message at the ordinal
              position in the queue specified by msgtyp (messages are
              considered to be numbered starting at 0).

              This flag must be specified in conjunction with IPC_NOWAIT,
              with the result that, if there is no message available at the
              given position, the call fails immediately with the error
              ENOMSG.  Because they alter the meaning of msgtyp in
              orthogonal ways, MSG_COPY and MSG_EXCEPT may not both be
              specified in msgflg.

              The MSG_COPY flag was added for the implementation of the
              kernel checkpoint-restore facility and is available only if
              the kernel was built with the CONFIG_CHECKPOINT_RESTORE
              option.

       MSG_EXCEPT
              Used with msgtyp greater than 0 to read the first message in
              the queue with message type that differs from msgtyp.

       MSG_NOERROR
              To truncate the message text if longer than msgsz bytes.

       If no message of the requested type is available and IPC_NOWAIT isn't
       specified in msgflg, the calling process is blocked until one of the
       following conditions occurs:

       * A message of the desired type is placed in the queue.

       * The message queue is removed from the system.  In this case, the
         system call fails with errno set to EIDRM.

       * The calling process catches a signal.  In this case, the system
         call fails with errno set to EINTR.  (msgrcv() is never
         automatically restarted after being interrupted by a signal
         handler, regardless of the setting of the SA_RESTART flag when
         establishing a signal handler.)

       Upon successful completion the message queue data structure is
       updated as follows:

              msg_lrpid is set to the process ID of the calling process.

              msg_qnum is decremented by 1.

              msg_rtime is set to the current time.
http://man7.org/linux/man-pages/man2/msgrcv.2.html
12
SYSTEM CALL:
msgop(2) - Linux manual page
FUNCTIONALITY:

       msgrcv, msgsnd - System V message queue operations
SYNOPSIS:

       #include <sys/types.h>
       #include <sys/ipc.h>
       #include <sys/msg.h>

       int msgsnd(int msqid, const void *msgp, size_t msgsz, int msgflg);

       ssize_t msgrcv(int msqid, void *msgp, size_t msgsz, long msgtyp,
                      int msgflg);
DESCRIPTION

       The msgsnd() and msgrcv() system calls are used, respectively, to
       send messages to, and receive messages from, a System V message
       queue.  The calling process must have write permission on the message
       queue in order to send a message, and read permission to receive a
       message.

       The msgp argument is a pointer to a caller-defined structure of the
       following general form:

           struct msgbuf {
               long mtype;       /* message type, must be > 0 */
               char mtext[1];    /* message data */
           };

       The mtext field is an array (or other structure) whose size is
       specified by msgsz, a nonnegative integer value.  Messages of zero
       length (i.e., no mtext field) are permitted.  The mtype field must
       have a strictly positive integer value.  This value can be used by
       the receiving process for message selection (see the description of
       msgrcv() below).

   msgsnd()
       The msgsnd() system call appends a copy of the message pointed to by
       msgp to the message queue whose identifier is specified by msqid.

       If sufficient space is available in the queue, msgsnd() succeeds
       immediately.  The queue capacity is governed by the msg_qbytes field
       in the associated data structure for the message queue.  During queue
       creation this field is initialized to MSGMNB bytes, but this limit
       can be modified using msgctl(2).  A message queue is considered to be
       full if either of the following conditions is true:

       * Adding a new message to the queue would cause the total number of
         bytes in the queue to exceed the queue's maximum size (the
         msg_qbytes field).

       * Adding another message to the queue would cause the total number of
         messages in the queue to exceed the queue's maximum size (the
         msg_qbytes field).  This check is necessary to prevent an unlimited
         number of zero-length messages being placed on the queue.  Although
         such messages contain no data, they nevertheless consume (locked)
         kernel memory.

       If insufficient space is available in the queue, then the default
       behavior of msgsnd() is to block until space becomes available.  If
       IPC_NOWAIT is specified in msgflg, then the call instead fails with
       the error EAGAIN.

       A blocked msgsnd() call may also fail if:

       * the queue is removed, in which case the system call fails with
         errno set to EIDRM; or

       * a signal is caught, in which case the system call fails with errno
         set to EINTR;see signal(7).  (msgsnd() is never automatically
         restarted after being interrupted by a signal handler, regardless
         of the setting of the SA_RESTART flag when establishing a signal
         handler.)

       Upon successful completion the message queue data structure is
       updated as follows:

              msg_lspid is set to the process ID of the calling process.

              msg_qnum is incremented by 1.

              msg_stime is set to the current time.

   msgrcv()
       The msgrcv() system call removes a message from the queue specified
       by msqid and places it in the buffer pointed to by msgp.

       The argument msgsz specifies the maximum size in bytes for the member
       mtext of the structure pointed to by the msgp argument.  If the
       message text has length greater than msgsz, then the behavior depends
       on whether MSG_NOERROR is specified in msgflg.  If MSG_NOERROR is
       specified, then the message text will be truncated (and the truncated
       part will be lost); if MSG_NOERROR is not specified, then the message
       isn't removed from the queue and the system call fails returning -1
       with errno set to E2BIG.

       Unless MSG_COPY is specified in msgflg (see below), the msgtyp
       argument specifies the type of message requested, as follows:

       * If msgtyp is 0, then the first message in the queue is read.

       * If msgtyp is greater than 0, then the first message in the queue of
         type msgtyp is read, unless MSG_EXCEPT was specified in msgflg, in
         which case the first message in the queue of type not equal to
         msgtyp will be read.

       * If msgtyp is less than 0, then the first message in the queue with
         the lowest type less than or equal to the absolute value of msgtyp
         will be read.

       The msgflg argument is a bit mask constructed by ORing together zero
       or more of the following flags:

       IPC_NOWAIT
              Return immediately if no message of the requested type is in
              the queue.  The system call fails with errno set to ENOMSG.

       MSG_COPY (since Linux 3.8)
              Nondestructively fetch a copy of the message at the ordinal
              position in the queue specified by msgtyp (messages are
              considered to be numbered starting at 0).

              This flag must be specified in conjunction with IPC_NOWAIT,
              with the result that, if there is no message available at the
              given position, the call fails immediately with the error
              ENOMSG.  Because they alter the meaning of msgtyp in
              orthogonal ways, MSG_COPY and MSG_EXCEPT may not both be
              specified in msgflg.

              The MSG_COPY flag was added for the implementation of the
              kernel checkpoint-restore facility and is available only if
              the kernel was built with the CONFIG_CHECKPOINT_RESTORE
              option.

       MSG_EXCEPT
              Used with msgtyp greater than 0 to read the first message in
              the queue with message type that differs from msgtyp.

       MSG_NOERROR
              To truncate the message text if longer than msgsz bytes.

       If no message of the requested type is available and IPC_NOWAIT isn't
       specified in msgflg, the calling process is blocked until one of the
       following conditions occurs:

       * A message of the desired type is placed in the queue.

       * The message queue is removed from the system.  In this case, the
         system call fails with errno set to EIDRM.

       * The calling process catches a signal.  In this case, the system
         call fails with errno set to EINTR.  (msgrcv() is never
         automatically restarted after being interrupted by a signal
         handler, regardless of the setting of the SA_RESTART flag when
         establishing a signal handler.)

       Upon successful completion the message queue data structure is
       updated as follows:

              msg_lrpid is set to the process ID of the calling process.

              msg_qnum is decremented by 1.

              msg_rtime is set to the current time.
http://man7.org/linux/man-pages/man2/mq_open.2.html
12
SYSTEM CALL:
mq_open(3) - Linux manual page
FUNCTIONALITY:

       mq_open - open a message queue
SYNOPSIS:

       #include <fcntl.h>           /* For O_* constants */
       #include <sys/stat.h>        /* For mode constants */
       #include <mqueue.h>

       mqd_t mq_open(const char *name, int oflag);
       mqd_t mq_open(const char *name, int oflag, mode_t mode,
                     struct mq_attr *attr);

       Link with -lrt.
DESCRIPTION

       mq_open() creates a new POSIX message queue or opens an existing
       queue.  The queue is identified by name.  For details of the
       construction of name, see mq_overview(7).

       The oflag argument specifies flags that control the operation of the
       call.  (Definitions of the flags values can be obtained by including
       <fcntl.h>.)  Exactly one of the following must be specified in oflag:

       O_RDONLY
              Open the queue to receive messages only.

       O_WRONLY
              Open the queue to send messages only.

       O_RDWR Open the queue to both send and receive messages.

       Zero or more of the following flags can additionally be ORed in
       oflag:

       O_CLOEXEC (since Linux 2.6.26)
              Set the close-on-exec flag for the message queue descriptor.
              See open(2) for a discussion of why this flag is useful.

       O_CREAT
              Create the message queue if it does not exist.  The owner
              (user ID) of the message queue is set to the effective user ID
              of the calling process.  The group ownership (group ID) is set
              to the effective group ID of the calling process.

       O_EXCL If O_CREAT was specified in oflag, and a queue with the given
              name already exists, then fail with the error EEXIST.

       O_NONBLOCK
              Open the queue in nonblocking mode.  In circumstances where
              mq_receive(3) and mq_send(3) would normally block, these
              functions instead fail with the error EAGAIN.

       If O_CREAT is specified in oflag, then two additional arguments must
       be supplied.  The mode argument specifies the permissions to be
       placed on the new queue, as for open(2).  (Symbolic definitions for
       the permissions bits can be obtained by including <sys/stat.h>.)  The
       permissions settings are masked against the process umask.

       The attr argument specifies attributes for the queue.  See
       mq_getattr(3) for details.  If attr is NULL, then the queue is
       created with implementation-defined default attributes.  Since Linux
       3.5, two /proc files can be used to control these defaults; see
       mq_overview(7) for details.
http://man7.org/linux/man-pages/man2/mq_unlink.2.html
10
SYSTEM CALL:
mq_unlink(3) - Linux manual page
FUNCTIONALITY:

       mq_unlink - remove a message queue
SYNOPSIS:

       #include <mqueue.h>

       int mq_unlink(const char *name);

       Link with -lrt.
DESCRIPTION

       mq_unlink() removes the specified message queue name.  The message
       queue name is removed immediately.  The queue itself is destroyed
       once any other processes that have the queue open close their
       descriptors referring to the queue.
http://man7.org/linux/man-pages/man2/mq_getsetattr.2.html
8
SYSTEM CALL:
mq_getsetattr(2) - Linux manual page
FUNCTIONALITY:

       mq_getsetattr - get/set message queue attributes
SYNOPSIS:

       #include <sys/types.h>
       #include <mqueue.h>

       int mq_getsetattr(mqd_t mqdes, struct mq_attr *newattr,
                        struct mq_attr *oldattr);

       Note: There is no glibc wrapper for this system call; see NOTES.
DESCRIPTION

       Do not use this system call.

       This is the low-level system call used to implement mq_getattr(3) and
       mq_setattr(3).  For an explanation of how this system call operates,
       see the description of mq_setattr(3).
http://man7.org/linux/man-pages/man2/mq_timedsend.2.html
11
SYSTEM CALL:
mq_send(3) - Linux manual page
FUNCTIONALITY:

       mq_send, mq_timedsend - send a message to a message queue
SYNOPSIS:

       #include <mqueue.h>

       int mq_send(mqd_t mqdes, const char *msg_ptr,
                     size_t msg_len, unsigned int msg_prio);

       #include <time.h>
       #include <mqueue.h>

       int mq_timedsend(mqd_t mqdes, const char *msg_ptr,
                     size_t msg_len, unsigned int msg_prio,
                     const struct timespec *abs_timeout);

       Link with -lrt.

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       mq_timedsend():
           _POSIX_C_SOURCE >= 200112L
DESCRIPTION

       mq_send() adds the message pointed to by msg_ptr to the message queue
       referred to by the message queue descriptor mqdes.  The msg_len
       argument specifies the length of the message pointed to by msg_ptr;
       this length must be less than or equal to the queue's mq_msgsize
       attribute.  Zero-length messages are allowed.

       The msg_prio argument is a nonnegative integer that specifies the
       priority of this message.  Messages are placed on the queue in
       decreasing order of priority, with newer messages of the same
       priority being placed after older messages with the same priority.

       If the message queue is already full (i.e., the number of messages on
       the queue equals the queue's mq_maxmsg attribute), then, by default,
       mq_send() blocks until sufficient space becomes available to allow
       the message to be queued, or until the call is interrupted by a
       signal handler.  If the O_NONBLOCK flag is enabled for the message
       queue description, then the call instead fails immediately with the
       error EAGAIN.

       mq_timedsend() behaves just like mq_send(), except that if the queue
       is full and the O_NONBLOCK flag is not enabled for the message queue
       description, then abs_timeout points to a structure which specifies
       how long the call will block.  This value is an absolute timeout in
       seconds and nanoseconds since the Epoch, 1970-01-01 00:00:00 +0000
       (UTC), specified in the following structure:

           struct timespec {
               time_t tv_sec;        /* seconds */
               long   tv_nsec;       /* nanoseconds */
           };

       If the message queue is full, and the timeout has already expired by
       the time of the call, mq_timedsend() returns immediately.
http://man7.org/linux/man-pages/man2/mq_timedreceive.2.html
11
SYSTEM CALL:
mq_receive(3) - Linux manual page
FUNCTIONALITY:

       mq_receive, mq_timedreceive - receive a message from a message queue
SYNOPSIS:

       #include <mqueue.h>

       ssize_t mq_receive(mqd_t mqdes, char *msg_ptr,
                          size_t msg_len, unsigned int *msg_prio);

       #include <time.h>
       #include <mqueue.h>

       ssize_t mq_timedreceive(mqd_t mqdes, char *msg_ptr,
                          size_t msg_len, unsigned int *msg_prio,
                          const struct timespec *abs_timeout);

       Link with -lrt.

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       mq_timedreceive():
           _POSIX_C_SOURCE >= 200112L
DESCRIPTION

       mq_receive() removes the oldest message with the highest priority
       from the message queue referred to by the message queue descriptor
       mqdes, and places it in the buffer pointed to by msg_ptr.  The
       msg_len argument specifies the size of the buffer pointed to by
       msg_ptr; this must be greater than or equal to the mq_msgsize
       attribute of the queue (see mq_getattr(3)).  If msg_prio is not NULL,
       then the buffer to which it points is used to return the priority
       associated with the received message.

       If the queue is empty, then, by default, mq_receive() blocks until a
       message becomes available, or the call is interrupted by a signal
       handler.  If the O_NONBLOCK flag is enabled for the message queue
       description, then the call instead fails immediately with the error
       EAGAIN.

       mq_timedreceive() behaves just like mq_receive(), except that if the
       queue is empty and the O_NONBLOCK flag is not enabled for the message
       queue description, then abs_timeout points to a structure which
       specifies how long the call will block.  This value is an absolute
       timeout in seconds and nanoseconds since the Epoch, 1970-01-01
       00:00:00 +0000 (UTC), specified in the following structure:

           struct timespec {
               time_t tv_sec;        /* seconds */
               long   tv_nsec;       /* nanoseconds */
           };

       If no message is available, and the timeout has already expired by
       the time of the call, mq_timedreceive() returns immediately.
http://man7.org/linux/man-pages/man2/mq_notify.2.html
12
SYSTEM CALL:
mq_notify(3) - Linux manual page
FUNCTIONALITY:

       mq_notify - register for notification when a message is available
SYNOPSIS:

       #include <mqueue.h>

       int mq_notify(mqd_t mqdes, const struct sigevent *sevp);

       Link with -lrt.
DESCRIPTION

       mq_notify() allows the calling process to register or unregister for
       delivery of an asynchronous notification when a new message arrives
       on the empty message queue referred to by the message queue
       descriptor mqdes.

       The sevp argument is a pointer to a sigevent structure.  For the
       definition and general details of this structure, see sigevent(7).

       If sevp is a non-null pointer, then mq_notify() registers the calling
       process to receive message notification.  The sigev_notify field of
       the sigevent structure to which sevp points specifies how
       notification is to be performed.  This field has one of the following
       values:

       SIGEV_NONE
              A "null" notification: the calling process is registered as
              the target for notification, but when a message arrives, no
              notification is sent.

       SIGEV_SIGNAL
              Notify the process by sending the signal specified in
              sigev_signo.  See sigevent(7) for general details.  The
              si_code field of the siginfo_t structure will be set to
              SI_MESGQ.  In addition, si_pid will be set to the PID of the
              process that sent the message, and si_uid will be set to the
              real user ID of the sending process.

       SIGEV_THREAD
              Upon message delivery, invoke sigev_notify_function as if it
              were the start function of a new thread.  See sigevent(7) for
              details.

       Only one process can be registered to receive notification from a
       message queue.

       If sevp is NULL, and the calling process is currently registered to
       receive notifications for this message queue, then the registration
       is removed; another process can then register to receive a message
       notification for this queue.

       Message notification occurs only when a new message arrives and the
       queue was previously empty.  If the queue was not empty at the time
       mq_notify() was called, then a notification will occur only after the
       queue is emptied and a new message arrives.

       If another process or thread is waiting to read a message from an
       empty queue using mq_receive(3), then any message notification
       registration is ignored: the message is delivered to the process or
       thread calling mq_receive(3), and the message notification
       registration remains in effect.

       Notification occurs once: after a notification is delivered, the
       notification registration is removed, and another process can
       register for message notification.  If the notified process wishes to
       receive the next notification, it can use mq_notify() to request a
       further notification.  This should be done before emptying all unread
       messages from the queue.  (Placing the queue in nonblocking mode is
       useful for emptying the queue of messages without blocking once it is
       empty.)
Linux Non-Uniform Memory Access (NUMA) system calls
http://man7.org/linux/man-pages/man2/getcpu.2.html
11
SYSTEM CALL:
getcpu(2) - Linux manual page
FUNCTIONALITY:

       getcpu  -  determine CPU and NUMA node on which the calling thread is
       running
SYNOPSIS:

       #include <linux/getcpu.h>

       int getcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *tcache);

       Note: There is no glibc wrapper for this system call; see NOTES.
DESCRIPTION

       The getcpu() system call identifies the processor and node on which
       the calling thread or process is currently running and writes them
       into the integers pointed to by the cpu and node arguments.  The
       processor is a unique small integer identifying a CPU.  The node is a
       unique small identifier identifying a NUMA node.  When either cpu or
       node is NULL nothing is written to the respective pointer.

       The third argument to this system call is nowadays unused, and should
       be specified as NULL unless portability to Linux 2.6.23 or earlier is
       required (see NOTES).

       The information placed in cpu is guaranteed to be current only at the
       time of the call: unless the CPU affinity has been fixed using
       sched_setaffinity(2), the kernel might change the CPU at any time.
       (Normally this does not happen because the scheduler tries to
       minimize movements between CPUs to keep caches hot, but it is
       possible.)  The caller must allow for the possibility that the
       information returned in cpu and node is no longer current by the time
       the call returns.
http://man7.org/linux/man-pages/man2/set_mempolicy.2.html
11
SYSTEM CALL:
set_mempolicy(2) - Linux manual page
FUNCTIONALITY:

       set_mempolicy  -  set default NUMA memory policy for a thread and its
       children
SYNOPSIS:

       #include <numaif.h>

       long set_mempolicy(int mode, const unsigned long *nodemask,
                          unsigned long maxnode);

       Link with -lnuma.
DESCRIPTION

       set_mempolicy() sets the NUMA memory policy of the calling thread,
       which consists of a policy mode and zero or more nodes, to the values
       specified by the mode, nodemask and maxnode arguments.

       A NUMA machine has different memory controllers with different
       distances to specific CPUs.  The memory policy defines from which
       node memory is allocated for the thread.

       This system call defines the default policy for the thread.  The
       thread policy governs allocation of pages in the process's address
       space outside of memory ranges controlled by a more specific policy
       set by mbind(2).  The thread default policy also controls allocation
       of any pages for memory-mapped files mapped using the mmap(2) call
       with the MAP_PRIVATE flag and that are only read [loaded] from by the
       thread and of memory-mapped files mapped using the mmap(2) call with
       the MAP_SHARED flag, regardless of the access type.  The policy is
       applied only when a new page is allocated for the thread.  For
       anonymous memory this is when the page is first touched by the
       thread.

       The mode argument must specify one of MPOL_DEFAULT, MPOL_BIND,
       MPOL_INTERLEAVE, or MPOL_PREFERRED.  All modes except MPOL_DEFAULT
       require the caller to specify via the nodemask argument one or more
       nodes.

       The mode argument may also include an optional mode flag.  The
       supported mode flags are:

       MPOL_F_STATIC_NODES (since Linux 2.6.26)
              A nonempty nodemask specifies physical node ids.  Linux will
              not remap the nodemask when the process moves to a different
              cpuset context, nor when the set of nodes allowed by the
              process's current cpuset context changes.

       MPOL_F_RELATIVE_NODES (since Linux 2.6.26)
              A nonempty nodemask specifies node ids that are relative to
              the set of node ids allowed by the process's current cpuset.

       nodemask points to a bit mask of node IDs that contains up to maxnode
       bits.  The bit mask size is rounded to the next multiple of
       sizeof(unsigned long), but the kernel will use bits only up to
       maxnode.  A NULL value of nodemask or a maxnode value of zero
       specifies the empty set of nodes.  If the value of maxnode is zero,
       the nodemask argument is ignored.

       Where a nodemask is required, it must contain at least one node that
       is on-line, allowed by the process's current cpuset context, [unless
       the MPOL_F_STATIC_NODES mode flag is specified], and contains memory.
       If the MPOL_F_STATIC_NODES is set in mode and a required nodemask
       contains no nodes that are allowed by the process's current cpuset
       context, the memory policy reverts to local allocation.  This
       effectively overrides the specified policy until the process's cpuset
       context includes one or more of the nodes specified by nodemask.

       The MPOL_DEFAULT mode specifies that any nondefault thread memory
       policy be removed, so that the memory policy "falls back" to the
       system default policy.  The system default policy is "local
       allocation"—that is, allocate memory on the node of the CPU that
       triggered the allocation.  nodemask must be specified as NULL.  If
       the "local node" contains no free memory, the system will attempt to
       allocate memory from a "near by" node.

       The MPOL_BIND mode defines a strict policy that restricts memory
       allocation to the nodes specified in nodemask.  If nodemask specifies
       more than one node, page allocations will come from the node with the
       lowest numeric node ID first, until that node contains no free
       memory.  Allocations will then come from the node with the next
       highest node ID specified in nodemask and so forth, until none of the
       specified nodes contain free memory.  Pages will not be allocated
       from any node not specified in the nodemask.

       MPOL_INTERLEAVE interleaves page allocations across the nodes
       specified in nodemask in numeric node ID order.  This optimizes for
       bandwidth instead of latency by spreading out pages and memory
       accesses to those pages across multiple nodes.  However, accesses to
       a single page will still be limited to the memory bandwidth of a
       single node.

       MPOL_PREFERRED sets the preferred node for allocation.  The kernel
       will try to allocate pages from this node first and fall back to
       "near by" nodes if the preferred node is low on free memory.  If
       nodemask specifies more than one node ID, the first node in the mask
       will be selected as the preferred node.  If the nodemask and maxnode
       arguments specify the empty set, then the policy specifies "local
       allocation" (like the system default policy discussed above).

       The thread memory policy is preserved across an execve(2), and is
       inherited by child threads created using fork(2) or clone(2).
http://man7.org/linux/man-pages/man2/get_mempolicy.2.html
11
SYSTEM CALL:
get_mempolicy(2) - Linux manual page
FUNCTIONALITY:

       get_mempolicy - retrieve NUMA memory policy for a thread
SYNOPSIS:

       #include <numaif.h>

       int get_mempolicy(int *mode, unsigned long *nodemask,
                         unsigned long maxnode, void *addr,
                         unsigned long flags);

       Link with -lnuma.
DESCRIPTION

       get_mempolicy() retrieves the NUMA policy of the calling thread or of
       a memory address, depending on the setting of flags.

       A NUMA machine has different memory controllers with different
       distances to specific CPUs.  The memory policy defines from which
       node memory is allocated for the thread.

       If flags is specified as 0, then information about the calling
       thread's default policy (as set by set_mempolicy(2)) is returned.
       The policy returned [mode and nodemask] may be used to restore the
       thread's policy to its state at the time of the call to
       get_mempolicy() using set_mempolicy(2).

       If flags specifies MPOL_F_MEMS_ALLOWED (available since Linux
       2.6.24), the mode argument is ignored and the set of nodes [memories]
       that the thread is allowed to specify in subsequent calls to mbind(2)
       or set_mempolicy(2) [in the absence of any mode flags] is returned in
       nodemask.  It is not permitted to combine MPOL_F_MEMS_ALLOWED with
       either MPOL_F_ADDR or MPOL_F_NODE.

       If flags specifies MPOL_F_ADDR, then information is returned about
       the policy governing the memory address given in addr.  This policy
       may be different from the thread's default policy if mbind(2) or one
       of the helper functions described in numa(3) has been used to
       establish a policy for the memory range containing addr.

       If the mode argument is not NULL, then get_mempolicy() will store the
       policy mode and any optional mode flags of the requested NUMA policy
       in the location pointed to by this argument.  If nodemask is not
       NULL, then the nodemask associated with the policy will be stored in
       the location pointed to by this argument.  maxnode specifies the
       number of node IDs that can be stored into nodemask—that is, the
       maximum node ID plus one.  The value specified by maxnode is always
       rounded to a multiple of sizeof(unsigned long)*8.

       If flags specifies both MPOL_F_NODE and MPOL_F_ADDR, get_mempolicy()
       will return the node ID of the node on which the address addr is
       allocated into the location pointed to by mode.  If no page has yet
       been allocated for the specified address, get_mempolicy() will
       allocate a page as if the thread had performed a read [load] access
       to that address, and return the ID of the node where that page was
       allocated.

       If flags specifies MPOL_F_NODE, but not MPOL_F_ADDR, and the thread's
       current policy is MPOL_INTERLEAVE, then get_mempolicy() will return
       in the location pointed to by a non-NULL mode argument, the node ID
       of the next node that will be used for interleaving of internal
       kernel pages allocated on behalf of the thread.  These allocations
       include pages for memory-mapped files in process memory ranges mapped
       using the mmap(2) call with the MAP_PRIVATE flag for read accesses,
       and in memory ranges mapped with the MAP_SHARED flag for all
       accesses.

       Other flag values are reserved.

       For an overview of the possible policies see set_mempolicy(2).
http://man7.org/linux/man-pages/man2/mbind.2.html
11
SYSTEM CALL:
mbind(2) - Linux manual page
FUNCTIONALITY:

       mbind - set memory policy for a memory range
SYNOPSIS:

       #include <numaif.h>

       long mbind(void *addr, unsigned long len, int mode,
                  const unsigned long *nodemask, unsigned long maxnode,
                  unsigned flags);

       Link with -lnuma.
DESCRIPTION

       mbind() sets the NUMA memory policy, which consists of a policy mode
       and zero or more nodes, for the memory range starting with addr and
       continuing for len bytes.  The memory policy defines from which node
       memory is allocated.

       If the memory range specified by the addr and len arguments includes
       an "anonymous" region of memory—that is a region of memory created
       using the mmap(2) system call with the MAP_ANONYMOUS—or a memory-
       mapped file, mapped using the mmap(2) system call with the
       MAP_PRIVATE flag, pages will be allocated only according to the
       specified policy when the application writes [stores] to the page.
       For anonymous regions, an initial read access will use a shared page
       in the kernel containing all zeros.  For a file mapped with
       MAP_PRIVATE, an initial read access will allocate pages according to
       the process policy of the process that causes the page to be
       allocated.  This may not be the process that called mbind().

       The specified policy will be ignored for any MAP_SHARED mappings in
       the specified memory range.  Rather the pages will be allocated
       according to the process policy of the process that caused the page
       to be allocated.  Again, this may not be the process that called
       mbind().

       If the specified memory range includes a shared memory region created
       using the shmget(2) system call and attached using the shmat(2)
       system call, pages allocated for the anonymous or shared memory
       region will be allocated according to the policy specified,
       regardless which process attached to the shared memory segment causes
       the allocation.  If, however, the shared memory region was created
       with the SHM_HUGETLB flag, the huge pages will be allocated according
       to the policy specified only if the page allocation is caused by the
       process that calls mbind() for that region.

       By default, mbind() has an effect only for new allocations; if the
       pages inside the range have been already touched before setting the
       policy, then the policy has no effect.  This default behavior may be
       overridden by the MPOL_MF_MOVE and MPOL_MF_MOVE_ALL flags described
       below.

       The mode argument must specify one of MPOL_DEFAULT, MPOL_BIND,
       MPOL_INTERLEAVE, or MPOL_PREFERRED.  All policy modes except
       MPOL_DEFAULT require the caller to specify via the nodemask argument,
       the node or nodes to which the mode applies.

       The mode argument may also include an optional mode flag .  The
       supported mode flags are:

       MPOL_F_STATIC_NODES (since Linux-2.6.26)
              A nonempty nodemask specifies physical node ids.  Linux does
              not remap the nodemask when the process moves to a different
              cpuset context, nor when the set of nodes allowed by the
              process's current cpuset context changes.

       MPOL_F_RELATIVE_NODES (since Linux-2.6.26)
              A nonempty nodemask specifies node ids that are relative to
              the set of node ids allowed by the process's current cpuset.

       nodemask points to a bit mask of nodes containing up to maxnode bits.
       The bit mask size is rounded to the next multiple of sizeof(unsigned
       long), but the kernel will use bits only up to maxnode.  A NULL value
       of nodemask or a maxnode value of zero specifies the empty set of
       nodes.  If the value of maxnode is zero, the nodemask argument is
       ignored.  Where a nodemask is required, it must contain at least one
       node that is on-line, allowed by the process's current cpuset context
       [unless the MPOL_F_STATIC_NODES mode flag is specified], and contains
       memory.

       The MPOL_DEFAULT mode requests that any nondefault policy be removed,
       restoring default behavior.  When applied to a range of memory via
       mbind(), this means to use the process policy, which may have been
       set with set_mempolicy(2).  If the mode of the process policy is also
       MPOL_DEFAULT, the system-wide default policy will be used.  The
       system-wide default policy allocates pages on the node of the CPU
       that triggers the allocation.  For MPOL_DEFAULT, the nodemask and
       maxnode arguments must be specify the empty set of nodes.

       The MPOL_BIND mode specifies a strict policy that restricts memory
       allocation to the nodes specified in nodemask.  If nodemask specifies
       more than one node, page allocations will come from the node with the
       lowest numeric node ID first, until that node contains no free
       memory.  Allocations will then come from the node with the next
       highest node ID specified in nodemask and so forth, until none of the
       specified nodes contain free memory.  Pages will not be allocated
       from any node not specified in the nodemask.

       The MPOL_INTERLEAVE mode specifies that page allocations be
       interleaved across the set of nodes specified in nodemask.  This
       optimizes for bandwidth instead of latency by spreading out pages and
       memory accesses to those pages across multiple nodes.  To be
       effective the memory area should be fairly large, at least 1MB or
       bigger with a fairly uniform access pattern.  Accesses to a single
       page of the area will still be limited to the memory bandwidth of a
       single node.

       MPOL_PREFERRED sets the preferred node for allocation.  The kernel
       will try to allocate pages from this node first and fall back to
       other nodes if the preferred nodes is low on free memory.  If
       nodemask specifies more than one node ID, the first node in the mask
       will be selected as the preferred node.  If the nodemask and maxnode
       arguments specify the empty set, then the memory is allocated on the
       node of the CPU that triggered the allocation.  This is the only way
       to specify "local allocation" for a range of memory via mbind().

       If MPOL_MF_STRICT is passed in flags and mode is not MPOL_DEFAULT,
       then the call will fail with the error EIO if the existing pages in
       the memory range don't follow the policy.

       If MPOL_MF_MOVE is specified in flags, then the kernel will attempt
       to move all the existing pages in the memory range so that they
       follow the policy.  Pages that are shared with other processes will
       not be moved.  If MPOL_MF_STRICT is also specified, then the call
       will fail with the error EIO if some pages could not be moved.

       If MPOL_MF_MOVE_ALL is passed in flags, then the kernel will attempt
       to move all existing pages in the memory range regardless of whether
       other processes use the pages.  The calling process must be
       privileged (CAP_SYS_NICE) to use this flag.  If MPOL_MF_STRICT is
       also specified, then the call will fail with the error EIO if some
       pages could not be moved.
http://man7.org/linux/man-pages/man2/move_pages.2.html
11
SYSTEM CALL:
move_pages(2) - Linux manual page
FUNCTIONALITY:

       move_pages - move individual pages of a process to another node
SYNOPSIS:

       #include <numaif.h>

       long move_pages(int pid, unsigned long count, void **pages,
                       const int *nodes, int *status, int flags);

       Link with -lnuma.
DESCRIPTION

       move_pages() moves the specified pages of the process pid to the
       memory nodes specified by nodes.  The result of the move is reflected
       in status.  The flags indicate constraints on the pages to be moved.

       pid is the ID of the process in which pages are to be moved.  To move
       pages in another process, the caller must be privileged
       (CAP_SYS_NICE) or the real or effective user ID of the calling
       process must match the real or saved-set user ID of the target
       process.  If pid is 0, then move_pages() moves pages of the calling
       process.

       count is the number of pages to move.  It defines the size of the
       three arrays pages, nodes, and status.

       pages is an array of pointers to the pages that should be moved.
       These are pointers that should be aligned to page boundaries.
       Addresses are specified as seen by the process specified by pid.

       nodes is an array of integers that specify the desired location for
       each page.  Each element in the array is a node number.  nodes can
       also be NULL, in which case move_pages() does not move any pages but
       instead will return the node where each page currently resides, in
       the status array.  Obtaining the status of each page may be necessary
       to determine pages that need to be moved.

       status is an array of integers that return the status of each page.
       The array contains valid values only if move_pages() did not return
       an error.

       flags specify what types of pages to move.  MPOL_MF_MOVE means that
       only pages that are in exclusive use by the process are to be moved.
       MPOL_MF_MOVE_ALL means that pages shared between multiple processes
       can also be moved.  The process must be privileged (CAP_SYS_NICE) to
       use MPOL_MF_MOVE_ALL.

   Page states in the status array
       The following values can be returned in each element of the status
       array.

       0..MAX_NUMNODES
              Identifies the node on which the page resides.

       -EACCES
              The page is mapped by multiple processes and can be moved only
              if MPOL_MF_MOVE_ALL is specified.

       -EBUSY The page is currently busy and cannot be moved.  Try again
              later.  This occurs if a page is undergoing I/O or another
              kernel subsystem is holding a reference to the page.

       -EFAULT
              This is a zero page or the memory area is not mapped by the
              process.

       -EIO   Unable to write back a page.  The page has to be written back
              in order to move it since the page is dirty and the filesystem
              does not provide a migration function that would allow the
              move of dirty pages.

       -EINVAL
              A dirty page cannot be moved.  The filesystem does not provide
              a migration function and has no ability to write back pages.

       -ENOENT
              The page is not present.

       -ENOMEM
              Unable to allocate memory on target node.
http://man7.org/linux/man-pages/man2/migrate_pages.2.html
11
SYSTEM CALL:
migrate_pages(2) - Linux manual page
FUNCTIONALITY:

       migrate_pages - move all pages in a process to another set of nodes
SYNOPSIS:

       #include <numaif.h>

       long migrate_pages(int pid, unsigned long maxnode,
                          const unsigned long *old_nodes,
                          const unsigned long *new_nodes);

       Link with -lnuma.
DESCRIPTION

       migrate_pages() attempts to move all pages of the process pid that
       are in memory nodes old_nodes to the memory nodes in new_nodes.
       Pages not located in any node in old_nodes will not be migrated.  As
       far as possible, the kernel maintains the relative topology
       relationship inside old_nodes during the migration to new_nodes.

       The old_nodes and new_nodes arguments are pointers to bit masks of
       node numbers, with up to maxnode bits in each mask.  These masks are
       maintained as arrays of unsigned long integers (in the last long
       integer, the bits beyond those specified by maxnode are ignored).
       The maxnode argument is the maximum node number in the bit mask plus
       one (this is the same as in mbind(2), but different from select(2)).

       The pid argument is the ID of the process whose pages are to be
       moved.  To move pages in another process, the caller must be
       privileged (CAP_SYS_NICE) or the real or effective user ID of the
       calling process must match the real or saved-set user ID of the
       target process.  If pid is 0, then migrate_pages() moves pages of the
       calling process.

       Pages shared with another process will be moved only if the
       initiating process has the CAP_SYS_NICE privilege.
Linux key management system calls
http://man7.org/linux/man-pages/man2/add_key.2.html
10
SYSTEM CALL:
add_key(2) - Linux manual page
FUNCTIONALITY:

       add_key - add a key to the kernel's key management facility
SYNOPSIS:

       #include <keyutils.h>

       key_serial_t add_key(const char *type, const char *description,
                            const void *payload, size_t plen,
                            key_serial_t keyring);
DESCRIPTION

       add_key() asks the kernel to create or update a key of the given type
       and description, instantiate it with the payload of length plen, and
       to attach it to the nominated keyring and to return its serial
       number.

       The key type may reject the data if it's in the wrong format or in
       some other way invalid.

       If the destination keyring already contains a key that matches the
       specified type and description, then, if the key type supports it,
       that key will be updated rather than a new key being created; if not,
       a new key will be created and it will displace the link to the extant
       key from the keyring.

       The destination keyring serial number may be that of a valid keyring
       to which the caller has write permission, or it may be a special
       keyring ID:

       KEY_SPEC_THREAD_KEYRING
              This specifies the caller's thread-specific keyring.

       KEY_SPEC_PROCESS_KEYRING
              This specifies the caller's process-specific keyring.

       KEY_SPEC_SESSION_KEYRING
              This specifies the caller's session-specific keyring.

       KEY_SPEC_USER_KEYRING
              This specifies the caller's UID-specific keyring.

       KEY_SPEC_USER_SESSION_KEYRING
              This specifies the caller's UID-session keyring.
http://man7.org/linux/man-pages/man2/request_key.2.html
9
SYSTEM CALL:
request_key(2) - Linux manual page
FUNCTIONALITY:

       request_key - request a key from the kernel's key management facility
SYNOPSIS:

       #include <keyutils.h>

       key_serial_t request_key(const char *type, const char *description,
                                const char *callout_info,
                                key_serial_t keyring);
DESCRIPTION

       request_key() asks the kernel to find a key of the given type that
       matches the specified description and, if successful, to attach it to
       the nominated keyring and to return its serial number.

       request_key() first recursively searches all the keyrings attached to
       the calling process in the order thread-specific keyring, process-
       specific keyring and then session keyring for a matching key.

       If request_key() is called from a program invoked by request_key() on
       behalf of some other process to generate a key, then the keyrings of
       that other process will be searched next, using that other process's
       UID, GID, groups, and security context to control access.

       The keys in each keyring searched are checked for a match before any
       child keyrings are recursed into.  Only keys that are searchable for
       the caller may be found, and only searchable keyrings may be
       searched.

       If the key is not found, then, if callout_info is set, this function
       will attempt to look further afield.  In such a case, the
       callout_info is passed to a user-space service such as
       /sbin/request-key to generate the key.

       If that is unsuccessful also, then an error will be returned, and a
       temporary negative key will be installed in the nominated keyring.
       This will expire after a few seconds, but will cause subsequent calls
       to request_key() to fail until it does.

       The keyring serial number may be that of a valid keyring to which the
       caller has write permission, or it may be a special keyring ID:

       KEY_SPEC_THREAD_KEYRING
              This specifies the caller's thread-specific keyring.

       KEY_SPEC_PROCESS_KEYRING
              This specifies the caller's process-specific keyring.

       KEY_SPEC_SESSION_KEYRING
              This specifies the caller's session-specific keyring.

       KEY_SPEC_USER_KEYRING
              This specifies the caller's UID-specific keyring.

       KEY_SPEC_USER_SESSION_KEYRING
              This specifies the caller's UID-session keyring.

       If a key is created, no matter whether it's a valid key or a negative
       key, it will displace any other key of the same type and description
       from the destination keyring.
http://man7.org/linux/man-pages/man2/keyctl.2.html
9
SYSTEM CALL:
keyctl(2) - Linux manual page
FUNCTIONALITY:

       keyctl - manipulate the kernel's key management facility
SYNOPSIS:

       #include <keyutils.h>

       long keyctl(int cmd, ...);
DESCRIPTION

       keyctl() has a number of functions available:

       KEYCTL_GET_KEYRING_ID
              Ask for a keyring's ID.

       KEYCTL_JOIN_SESSION_KEYRING
              Join or start named session keyring.

       KEYCTL_UPDATE
              Update a key.

       KEYCTL_REVOKE
              Revoke a key.

       KEYCTL_CHOWN
              Set ownership of a key.

       KEYCTL_SETPERM
              Set perms on a key.

       KEYCTL_DESCRIBE
              Describe a key.

       KEYCTL_CLEAR
              Clear contents of a keyring.

       KEYCTL_LINK
              Link a key into a keyring.

       KEYCTL_UNLINK
              Unlink a key from a keyring.

       KEYCTL_SEARCH
              Search for a key in a keyring.

       KEYCTL_READ
              Read a key or keyring's contents.

       KEYCTL_INSTANTIATE
              Instantiate a partially constructed key.

       KEYCTL_NEGATE
              Negate a partially constructed key.

       KEYCTL_SET_REQKEY_KEYRING
              Set default request-key keyring.

       KEYCTL_SET_TIMEOUT
              Set timeout on a key.

       KEYCTL_ASSUME_AUTHORITY
              Assume authority to instantiate key.

       These are wrapped by libkeyutils into individual functions to permit
       the compiler to check types.  See the See Also section at the bottom.
Linux system-wide system calls
http://man7.org/linux/man-pages/man2/create_module.2.html
11
SYSTEM CALL:
create_module(2) - Linux manual page
FUNCTIONALITY:

       create_module - create a loadable module entry
SYNOPSIS:

       #include <linux/module.h>

       caddr_t create_module(const char *name, size_t size);

       Note: No declaration of this system call is provided in glibc
       headers; see NOTES.
DESCRIPTION

       Note: This system call is present only in kernels before Linux 2.6.

       create_module() attempts to create a loadable module entry and
       reserve the kernel memory that will be needed to hold the module.
       This system call requires privilege.
http://man7.org/linux/man-pages/man2/init_module.2.html
11
SYSTEM CALL:
init_module(2) - Linux manual page
FUNCTIONALITY:

       init_module, finit_module - load a kernel module
SYNOPSIS:

       int init_module(void *module_image, unsigned long len,
                       const char *param_values);

       int finit_module(int fd, const char *param_values,
                        int flags);

       Note: glibc provides no header file declaration of init_module() and
       no wrapper function for finit_module(); see NOTES.
DESCRIPTION

       init_module() loads an ELF image into kernel space, performs any
       necessary symbol relocations, initializes module parameters to values
       provided by the caller, and then runs the module's init function.
       This system call requires privilege.

       The module_image argument points to a buffer containing the binary
       image to be loaded; len specifies the size of that buffer.  The
       module image should be a valid ELF image, built for the running
       kernel.

       The param_values argument is a string containing space-delimited
       specifications of the values for module parameters (defined inside
       the module using module_param() and module_param_array()).  The
       kernel parses this string and initializes the specified parameters.
       Each of the parameter specifications has the form:

               name[=value[,value...]]

       The parameter name is one of those defined within the module using
       module_param() (see the Linux kernel source file
       include/linux/moduleparam.h).  The parameter value is optional in the
       case of bool and invbool parameters.  Values for array parameters are
       specified as a comma-separated list.

   finit_module()
       The finit_module() system call is like init_module(), but reads the
       module to be loaded from the file descriptor fd.  It is useful when
       the authenticity of a kernel module can be determined from its
       location in the filesystem; in cases where that is possible, the
       overhead of using cryptographically signed modules to determine the
       authenticity of a module can be avoided.  The param_values argument
       is as for init_module().

       The flags argument modifies the operation of finit_module().  It is a
       bit mask value created by ORing together zero or more of the
       following flags:

       MODULE_INIT_IGNORE_MODVERSIONS
              Ignore symbol version hashes.

       MODULE_INIT_IGNORE_VERMAGIC
              Ignore kernel version magic.

       There are some safety checks built into a module to ensure that it
       matches the kernel against which it is loaded.  These checks are
       recorded when the module is built and verified when the module is
       loaded.  First, the module records a "vermagic" string containing the
       kernel version number and prominent features (such as the CPU type).
       Second, if the module was built with the CONFIG_MODVERSIONS
       configuration option enabled, a version hash is recorded for each
       symbol the module uses.  This hash is based on the types of the
       arguments and return value for the function named by the symbol.  In
       this case, the kernel version number within the "vermagic" string is
       ignored, as the symbol version hashes are assumed to be sufficiently
       reliable.

       Using the MODULE_INIT_IGNORE_VERMAGIC flag indicates that the
       "vermagic" string is to be ignored, and the
       MODULE_INIT_IGNORE_MODVERSIONS flag indicates that the symbol version
       hashes are to be ignored.  If the kernel is built to permit forced
       loading (i.e., configured with CONFIG_MODULE_FORCE_LOAD), then
       loading will continue, otherwise it will fail with ENOEXEC as
       expected for malformed modules.
http://man7.org/linux/man-pages/man2/finit_module.2.html
11
SYSTEM CALL:
init_module(2) - Linux manual page
FUNCTIONALITY:

       init_module, finit_module - load a kernel module
SYNOPSIS:

       int init_module(void *module_image, unsigned long len,
                       const char *param_values);

       int finit_module(int fd, const char *param_values,
                        int flags);

       Note: glibc provides no header file declaration of init_module() and
       no wrapper function for finit_module(); see NOTES.
DESCRIPTION

       init_module() loads an ELF image into kernel space, performs any
       necessary symbol relocations, initializes module parameters to values
       provided by the caller, and then runs the module's init function.
       This system call requires privilege.

       The module_image argument points to a buffer containing the binary
       image to be loaded; len specifies the size of that buffer.  The
       module image should be a valid ELF image, built for the running
       kernel.

       The param_values argument is a string containing space-delimited
       specifications of the values for module parameters (defined inside
       the module using module_param() and module_param_array()).  The
       kernel parses this string and initializes the specified parameters.
       Each of the parameter specifications has the form:

               name[=value[,value...]]

       The parameter name is one of those defined within the module using
       module_param() (see the Linux kernel source file
       include/linux/moduleparam.h).  The parameter value is optional in the
       case of bool and invbool parameters.  Values for array parameters are
       specified as a comma-separated list.

   finit_module()
       The finit_module() system call is like init_module(), but reads the
       module to be loaded from the file descriptor fd.  It is useful when
       the authenticity of a kernel module can be determined from its
       location in the filesystem; in cases where that is possible, the
       overhead of using cryptographically signed modules to determine the
       authenticity of a module can be avoided.  The param_values argument
       is as for init_module().

       The flags argument modifies the operation of finit_module().  It is a
       bit mask value created by ORing together zero or more of the
       following flags:

       MODULE_INIT_IGNORE_MODVERSIONS
              Ignore symbol version hashes.

       MODULE_INIT_IGNORE_VERMAGIC
              Ignore kernel version magic.

       There are some safety checks built into a module to ensure that it
       matches the kernel against which it is loaded.  These checks are
       recorded when the module is built and verified when the module is
       loaded.  First, the module records a "vermagic" string containing the
       kernel version number and prominent features (such as the CPU type).
       Second, if the module was built with the CONFIG_MODVERSIONS
       configuration option enabled, a version hash is recorded for each
       symbol the module uses.  This hash is based on the types of the
       arguments and return value for the function named by the symbol.  In
       this case, the kernel version number within the "vermagic" string is
       ignored, as the symbol version hashes are assumed to be sufficiently
       reliable.

       Using the MODULE_INIT_IGNORE_VERMAGIC flag indicates that the
       "vermagic" string is to be ignored, and the
       MODULE_INIT_IGNORE_MODVERSIONS flag indicates that the symbol version
       hashes are to be ignored.  If the kernel is built to permit forced
       loading (i.e., configured with CONFIG_MODULE_FORCE_LOAD), then
       loading will continue, otherwise it will fail with ENOEXEC as
       expected for malformed modules.
http://man7.org/linux/man-pages/man2/delete_module.2.html
10
SYSTEM CALL:
delete_module(2) - Linux manual page
FUNCTIONALITY:

       delete_module - unload a kernel module
SYNOPSIS:

       int delete_module(const char *name, int flags);

       Note: No declaration of this system call is provided in glibc
       headers; see NOTES.
DESCRIPTION

       The delete_module() system call attempts to remove the unused
       loadable module entry identified by name.  If the module has an exit
       function, then that function is executed before unloading the module.
       The flags argument is used to modify the behavior of the system call,
       as described below.  This system call requires privilege.

       Module removal is attempted according to the following rules:

       1.  If there are other loaded modules that depend on (i.e., refer to
           symbols defined in) this module, then the call fails.

       2.  Otherwise, if the reference count for the module (i.e., the
           number of processes currently using the module) is zero, then the
           module is immediately unloaded.

       3.  If a module has a nonzero reference count, then the behavior
           depends on the bits set in flags.  In normal usage (see NOTES),
           the O_NONBLOCK flag is always specified, and the O_TRUNC flag may
           additionally be specified.

           The various combinations for flags have the following effect:

           flags == O_NONBLOCK
                  The call returns immediately, with an error.

           flags == (O_NONBLOCK | O_TRUNC)
                  The module is unloaded immediately, regardless of whether
                  it has a nonzero reference count.

           (flags & O_NONBLOCK) == 0
                  If flags does not specify O_NONBLOCK, the following steps
                  occur:

                  *  The module is marked so that no new references are
                     permitted.

                  *  If the module's reference count is nonzero, the caller
                     is placed in an uninterruptible sleep state
                     (TASK_UNINTERRUPTIBLE) until the reference count is
                     zero, at which point the call unblocks.

                  *  The module is unloaded in the usual way.

       The O_TRUNC flag has one further effect on the rules described above.
       By default, if a module has an init function but no exit function,
       then an attempt to remove the module will fail.  However, if O_TRUNC
       was specified, this requirement is bypassed.

       Using the O_TRUNC flag is dangerous!  If the kernel was not built
       with CONFIG_MODULE_FORCE_UNLOAD, this flag is silently ignored.
       (Normally, CONFIG_MODULE_FORCE_UNLOAD is enabled.)  Using this flag
       taints the kernel (TAINT_FORCED_RMMOD).
http://man7.org/linux/man-pages/man2/query_module.2.html
11
SYSTEM CALL:
query_module(2) - Linux manual page
FUNCTIONALITY:

       query_module  -  query the kernel for various bits pertaining to mod‐
       ules
SYNOPSIS:

       #include <linux/module.h>

       int query_module(const char *name, int which, void *buf,
                        size_t bufsize, size_t *ret);

       Note: No declaration of this system call is provided in glibc
       headers; see NOTES.
DESCRIPTION

       Note: This system call is present only in kernels before Linux 2.6.

       query_module() requests information from the kernel about loadable
       modules.  The returned information is placed in the buffer pointed to
       by buf.  The caller must specify the size of buf in bufsize.  The
       precise nature and format of the returned information depend on the
       operation specified by which.  Some operations require name to
       identify a currently loaded module, some allow name to be NULL,
       indicating the kernel proper.

       The following values can be specified for which:

       0      Returns success, if the kernel supports query_module().  Used
              to probe for availability of the system call.

       QM_MODULES
              Returns the names of all loaded modules.  The returned buffer
              consists of a sequence of null-terminated strings; ret is set
              to the number of modules.

       QM_DEPS
              Returns the names of all modules used by the indicated module.
              The returned buffer consists of a sequence of null-terminated
              strings; ret is set to the number of modules.

       QM_REFS
              Returns the names of all modules using the indicated module.
              This is the inverse of QM_DEPS.  The returned buffer consists
              of a sequence of null-terminated strings; ret is set to the
              number of modules.

       QM_SYMBOLS
              Returns the symbols and values exported by the kernel or the
              indicated module.  The returned buffer is an array of
              structures of the following form

                  struct module_symbol {
                      unsigned long value;
                      unsigned long name;
                  };

              followed by null-terminated strings.  The value of name is the
              character offset of the string relative to the start of buf;
              ret is set to the number of symbols.

       QM_INFO
              Returns miscellaneous information about the indicated module.
              The output buffer format is:

                  struct module_info {
                      unsigned long address;
                      unsigned long size;
                      unsigned long flags;
                  };

              where address is the kernel address at which the module
              resides, size is the size of the module in bytes, and flags is
              a mask of MOD_RUNNING, MOD_AUTOCLEAN, and so on, that
              indicates the current status of the module (see the Linux
              kernel source file include/linux/module.h).  ret is set to the
              size of the module_info structure.
http://man7.org/linux/man-pages/man2/get_kernel_syms.2.html
12
SYSTEM CALL:
get_kernel_syms(2) - Linux manual page
FUNCTIONALITY:

       get_kernel_syms - retrieve exported kernel and module symbols
SYNOPSIS:

       #include <linux/module.h>

       int get_kernel_syms(struct kernel_sym *table);

       Note: No declaration of this system call is provided in glibc
       headers; see NOTES.
DESCRIPTION

       Note: This system call is present only in kernels before Linux 2.6.

       If table is NULL, get_kernel_syms() returns the number of symbols
       available for query.  Otherwise, it fills in a table of structures:

           struct kernel_sym {
               unsigned long value;
               char          name[60];
           };

       The symbols are interspersed with magic symbols of the form #module-
       name with the kernel having an empty name.  The value associated with
       a symbol of this form is the address at which the module is loaded.

       The symbols exported from each module follow their magic module tag
       and the modules are returned in the reverse of the order in which
       they were loaded.
http://man7.org/linux/man-pages/man2/acct.2.html
10
SYSTEM CALL:
acct(2) - Linux manual page
FUNCTIONALITY:

       acct - switch process accounting on or off
SYNOPSIS:

       #include <unistd.h>

       int acct(const char *filename);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       acct():
           Since glibc 2.21:
               _DEFAULT_SOURCE
           In glibc 2.19 and 2.20:
               _DEFAULT_SOURCE || (_XOPEN_SOURCE && _XOPEN_SOURCE < 500)
           Up to and including glibc 2.19:
               _BSD_SOURCE || (_XOPEN_SOURCE && _XOPEN_SOURCE < 500)
DESCRIPTION

       The acct() system call enables or disables process accounting.  If
       called with the name of an existing file as its argument, accounting
       is turned on, and records for each terminating process are appended
       to filename as it terminates.  An argument of NULL causes accounting
       to be turned off.
http://man7.org/linux/man-pages/man2/quotactl.2.html
8
SYSTEM CALL:
quotactl(2) - Linux manual page
FUNCTIONALITY:

       quotactl - manipulate disk quotas
SYNOPSIS:

       #include <sys/quota.h>
       #include <xfs/xqm.h>

       int quotactl(int cmd, const char *special, int id, caddr_t addr);
DESCRIPTION

       The quota system can be used to set per-user and per-group limits on
       the amount of disk space used on a filesystem.  For each user and/or
       group, a soft limit and a hard limit can be set for each filesystem.
       The hard limit can't be exceeded.  The soft limit can be exceeded,
       but warnings will ensue.  Moreover, the user can't exceed the soft
       limit for more than one week (by default) at a time; after this time,
       the soft limit counts as a hard limit.

       The quotactl() call manipulates disk quotas.  The cmd argument
       indicates a command to be applied to the user or group ID specified
       in id.  To initialize the cmd argument, use the QCMD(subcmd, type)
       macro.  The type value is either USRQUOTA, for user quotas, or
       GRPQUOTA, for group quotas.  The subcmd value is described below.

       The special argument is a pointer to a null-terminated string
       containing the pathname of the (mounted) block special device for the
       filesystem being manipulated.

       The addr argument is the address of an optional, command-specific,
       data structure that is copied in or out of the system.  The
       interpretation of addr is given with each command below.

       The subcmd value is one of the following:

       Q_QUOTAON
               Turn on quotas for a filesystem.  The id argument is the
               identification number of the quota format to be used.
               Currently, there are three supported quota formats:

               QFMT_VFS_OLD The original quota format.

               QFMT_VFS_V0  The standard VFS v0 quota format, which can
                            handle 32-bit UIDs and GIDs and quota limits up
                            to 2^42 bytes and 2^32 inodes.

               QFMT_VFS_V1  A quota format that can handle 32-bit UIDs and
                            GIDs and quota limits of 2^64 bytes and 2^64
                            inodes.

               The addr argument points to the pathname of a file containing
               the quotas for the filesystem.  The quota file must exist; it
               is normally created with the quotacheck(8) program.  This
               operation requires privilege (CAP_SYS_ADMIN).

       Q_QUOTAOFF
               Turn off quotas for a filesystem.  The addr and id arguments
               are ignored.  This operation requires privilege
               (CAP_SYS_ADMIN).

       Q_GETQUOTA
               Get disk quota limits and current usage for user or group id.
               The addr argument is a pointer to a dqblk structure defined
               in <sys/quota.h> as follows:

                   /* uint64_t is an unsigned 64-bit integer;
                      uint32_t is an unsigned 32-bit integer */

                   struct dqblk {          /* Definition since Linux 2.4.22 */
                       uint64_t dqb_bhardlimit;   /* absolute limit on disk
                                                     quota blocks alloc */
                       uint64_t dqb_bsoftlimit;   /* preferred limit on
                                                     disk quota blocks */
                       uint64_t dqb_curspace;     /* current occupied space
                                                     (in bytes) */
                       uint64_t dqb_ihardlimit;   /* maximum number of
                                                     allocated inodes */
                       uint64_t dqb_isoftlimit;   /* preferred inode limit */
                       uint64_t dqb_curinodes;    /* current number of
                                                     allocated inodes */
                       uint64_t dqb_btime;        /* time limit for excessive
                                                     disk use */
                       uint64_t dqb_itime;        /* time limit for excessive
                                                     files */
                       uint32_t dqb_valid;        /* bit mask of QIF_*
                                                     constants */
                   };

                   /* Flags in dqb_valid that indicate which fields in
                      dqblk structure are valid. */

                   #define QIF_BLIMITS   1
                   #define QIF_SPACE     2
                   #define QIF_ILIMITS   4
                   #define QIF_INODES    8
                   #define QIF_BTIME     16
                   #define QIF_ITIME     32
                   #define QIF_LIMITS    (QIF_BLIMITS | QIF_ILIMITS)
                   #define QIF_USAGE     (QIF_SPACE | QIF_INODES)
                   #define QIF_TIMES     (QIF_BTIME | QIF_ITIME)
                   #define QIF_ALL       (QIF_LIMITS | QIF_USAGE | QIF_TIMES)

               The dqb_valid field is a bit mask that is set to indicate the
               entries in the dqblk structure that are valid.  Currently,
               the kernel fills in all entries of the dqblk structure and
               marks them as valid in the dqb_valid field.  Unprivileged
               users may retrieve only their own quotas; a privileged user
               (CAP_SYS_ADMIN) can retrieve the quotas of any user.

       Q_GETNEXTQUOTA (since Linux 4.6)
               This operation is the same as Q_GETQUOTA, but it returns
               quota information for the next ID greater than or equal to id
               that has a quota set.

               The addr argument is a pointer to a nextdqblk structure whose
               fields are as for the dqblk, except for the addition of a
               dqb_id field that is used to return the ID for which quota
               information is being returned:

                   struct nextdqblk {
                       uint64_t dqb_bhardlimit;
                       uint64_t dqb_bsoftlimit;
                       uint64_t dqb_curspace;
                       uint64_t dqb_ihardlimit;
                       uint64_t dqb_isoftlimit;
                       uint64_t dqb_curinodes;
                       uint64_t dqb_btime;
                       uint64_t dqb_itime;
                       uint32_t dqb_valid;
                       uint32_t dqb_id;
                   };

       Q_SETQUOTA
               Set quota information for user or group id, using the
               information supplied in the dqblk structure pointed to by
               addr.  The dqb_valid field of the dqblk structure indicates
               which entries in the structure have been set by the caller.
               This operation supersedes the Q_SETQLIM and Q_SETUSE
               operations in the previous quota interfaces.  This operation
               requires privilege (CAP_SYS_ADMIN).

       Q_GETINFO (since Linux 2.4.22)
               Get information (like grace times) about quotafile.  The addr
               argument should be a pointer to a dqinfo structure.  This
               structure is defined in <sys/quota.h> as follows:

                   /* uint64_t is an unsigned 64-bit integer;
                      uint32_t is an unsigned 32-bit integer */

                   struct dqinfo {         /* Defined since kernel 2.4.22 */
                       uint64_t dqi_bgrace;    /* Time before block soft limit
                                                  becomes hard limit */

                       uint64_t dqi_igrace;    /* Time before inode soft limit
                                                  becomes hard limit */
                       uint32_t dqi_flags;     /* Flags for quotafile
                                                  (DQF_*) */
                       uint32_t dqi_valid;
                   };

                   /* Bits for dqi_flags */

                   /* Quota format QFMT_VFS_OLD */

                   #define V1_DQF_RSQUASH   1   /* Root squash enabled */

                   /* Other quota formats have no dqi_flags bits defined */

                   /* Flags in dqi_valid that indicate which fields in
                      dqinfo structure are valid. */

                   # define IIF_BGRACE 1
                   # define IIF_IGRACE 2
                   # define IIF_FLAGS  4
                   # define IIF_ALL        (IIF_BGRACE | IIF_IGRACE | IIF_FLAGS)

               The dqi_valid field in the dqinfo structure indicates the
               entries in the structure that are valid.  Currently, the
               kernel fills in all entries of the dqinfo structure and marks
               them all as valid in the dqi_valid field.  The id argument is
               ignored.

       Q_SETINFO (since Linux 2.4.22)
               Set information about quotafile.  The addr argument should be
               a pointer to a dqinfo structure.  The dqi_valid field of the
               dqinfo structure indicates the entries in the structure that
               have been set by the caller.  This operation supersedes the
               Q_SETGRACE and Q_SETFLAGS operations in the previous quota
               interfaces.  The id argument is ignored.  This operation
               requires privilege (CAP_SYS_ADMIN).

       Q_GETFMT (since Linux 2.4.22)
               Get quota format used on the specified filesystem.  The addr
               argument should be a pointer to a 4-byte buffer where the
               format number will be stored.

       Q_SYNC  Update the on-disk copy of quota usages for a filesystem.  If
               special is NULL, then all filesystems with active quotas are
               sync'ed.  The addr and id arguments are ignored.

       Q_GETSTATS (supported up to Linux 2.4.21)
               Get statistics and other generic information about the quota
               subsystem.  The addr argument should be a pointer to a
               dqstats structure in which data should be stored.  This
               structure is defined in <sys/quota.h>.  The special and id
               arguments are ignored.

               This operation is obsolete and was removed in Linux 2.4.22.
               Files in /proc/sys/fs/quota/ carry the information instead.

       For XFS filesystems making use of the XFS Quota Manager (XQM), the
       above commands are bypassed and the following commands are used:

       Q_XQUOTAON
               Turn on quotas for an XFS filesystem.  XFS provides the
               ability to turn on/off quota limit enforcement with quota
               accounting.  Therefore, XFS expects addr to be a pointer to
               an unsigned int that contains either the flags
               XFS_QUOTA_UDQ_ACCT and/or XFS_QUOTA_UDQ_ENFD (for user
               quota), or XFS_QUOTA_GDQ_ACCT and/or XFS_QUOTA_GDQ_ENFD (for
               group quota), as defined in <xfs/xqm.h>.  This operation
               requires privilege (CAP_SYS_ADMIN).

       Q_XQUOTAOFF
               Turn off quotas for an XFS filesystem.  As with Q_QUOTAON,
               XFS filesystems expect a pointer to an unsigned int that
               specifies whether quota accounting and/or limit enforcement
               need to be turned off.  This operation requires privilege
               (CAP_SYS_ADMIN).

       Q_XGETQUOTA
               Get disk quota limits and current usage for user id.  The
               addr argument is a pointer to an fs_disk_quota structure
               (defined in <xfs/xqm.h>).  Unprivileged users may retrieve
               only their own quotas; a privileged user (CAP_SYS_ADMIN) may
               retrieve the quotas of any user.

       Q_XGETNEXTQUOTA (since Linux 4.6)
               This operation is the same as Q_XGETQUOTA, but it returns
               quota information for the next ID greater than or equal to id
               that has a quota set.

       Q_XSETQLIM
               Set disk quota limits for user id.  The addr argument is a
               pointer to an fs_disk_quota structure (defined in
               <xfs/xqm.h>).  This operation requires privilege
               (CAP_SYS_ADMIN).

       Q_XGETQSTAT
               Returns an fs_quota_stat structure containing XFS filesystem-
               specific quota information.  This is useful for finding out
               how much space is used to store quota information, and also
               to get quotaon/off status of a given local XFS filesystem.

       Q_XQUOTARM
               Free the disk space taken by disk quotas.  Quotas must have
               already been turned off.

       There is no command equivalent to Q_SYNC for XFS since sync(1) writes
       quota information to disk (in addition to the other filesystem
       metadata that it writes out).
http://man7.org/linux/man-pages/man2/pivot_root.2.html
12
SYSTEM CALL:
pivot_root(2) - Linux manual page
FUNCTIONALITY:

       pivot_root - change the root filesystem
SYNOPSIS:

       int pivot_root(const char *new_root, const char *put_old);

       Note: There is no glibc wrapper for this system call; see NOTES.
DESCRIPTION

       pivot_root() moves the root filesystem of the calling process to the
       directory put_old and makes new_root the new root filesystem of the
       calling process.

       The typical use of pivot_root() is during system startup, when the
       system mounts a temporary root filesystem (e.g., an initrd), then
       mounts the real root filesystem, and eventually turns the latter into
       the current root of all relevant processes or threads.

       pivot_root() may or may not change the current root and the current
       working directory of any processes or threads which use the old root
       directory.  The caller of pivot_root() must ensure that processes
       with root or current working directory at the old root operate
       correctly in either case.  An easy way to ensure this is to change
       their root and current working directory to new_root before invoking
       pivot_root().

       The paragraph above is intentionally vague because the implementation
       of pivot_root() may change in the future.  At the time of writing,
       pivot_root() changes root and current working directory of each
       process or thread to new_root if they point to the old root
       directory.  This is necessary in order to prevent kernel threads from
       keeping the old root directory busy with their root and current
       working directory, even if they never access the filesystem in any
       way.  In the future, there may be a mechanism for kernel threads to
       explicitly relinquish any access to the filesystem, such that this
       fairly intrusive mechanism can be removed from pivot_root().

       Note that this also applies to the calling process: pivot_root() may
       or may not affect its current working directory.  It is therefore
       recommended to call chdir("/") immediately after pivot_root().

       The following restrictions apply to new_root and put_old:

       -  They must be directories.

       -  new_root and put_old must not be on the same filesystem as the
          current root.

       -  put_old must be underneath new_root, that is, adding a nonzero
          number of /.. to the string pointed to by put_old must yield the
          same directory as new_root.

       -  No other filesystem may be mounted on put_old.

       See also pivot_root(8) for additional usage examples.

       If the current root is not a mount point (e.g., after chroot(2) or
       pivot_root(), see also below), not the old root directory, but the
       mount point of that filesystem is mounted on put_old.

       new_root does not have to be a mount point.  In this case,
       /proc/mounts will show the mount point of the filesystem containing
       new_root as root (/).
http://man7.org/linux/man-pages/man2/swapon.2.html
10
SYSTEM CALL:
swapon(2) - Linux manual page
FUNCTIONALITY:

       swapon, swapoff - start/stop swapping to file/device
SYNOPSIS:

       #include <unistd.h>
       #include <sys/swap.h>

       int swapon(const char *path, int swapflags);
       int swapoff(const char *path);
DESCRIPTION

       swapon() sets the swap area to the file or block device specified by
       path.  swapoff() stops swapping to the file or block device specified
       by path.

       If the SWAP_FLAG_PREFER flag is specified in the swapon() swapflags
       argument, the new swap area will have a higher priority than default.
       The priority is encoded within swapflags as:

           (prio << SWAP_FLAG_PRIO_SHIFT) & SWAP_FLAG_PRIO_MASK

       If the SWAP_FLAG_DISCARD flag is specified in the swapon() swapflags
       argument, freed swap pages will be discarded before they are reused,
       if the swap device supports the discard or trim operation.  (This may
       improve performance on some Solid State Devices, but often it does
       not.)  See also NOTES.

       These functions may be used only by a privileged process (one having
       the CAP_SYS_ADMIN capability).

   Priority
       Each swap area has a priority, either high or low.  The default
       priority is low.  Within the low-priority areas, newer areas are even
       lower priority than older areas.

       All priorities set with swapflags are high-priority, higher than
       default.  They may have any nonnegative value chosen by the caller.
       Higher numbers mean higher priority.

       Swap pages are allocated from areas in priority order, highest
       priority first.  For areas with different priorities, a higher-
       priority area is exhausted before using a lower-priority area.  If
       two or more areas have the same priority, and it is the highest
       priority available, pages are allocated on a round-robin basis
       between them.

       As of Linux 1.3.6, the kernel usually follows these rules, but there
       are exceptions.
http://man7.org/linux/man-pages/man2/swapoff.2.html
10
SYSTEM CALL:
swapon(2) - Linux manual page
FUNCTIONALITY:

       swapon, swapoff - start/stop swapping to file/device
SYNOPSIS:

       #include <unistd.h>
       #include <sys/swap.h>

       int swapon(const char *path, int swapflags);
       int swapoff(const char *path);
DESCRIPTION

       swapon() sets the swap area to the file or block device specified by
       path.  swapoff() stops swapping to the file or block device specified
       by path.

       If the SWAP_FLAG_PREFER flag is specified in the swapon() swapflags
       argument, the new swap area will have a higher priority than default.
       The priority is encoded within swapflags as:

           (prio << SWAP_FLAG_PRIO_SHIFT) & SWAP_FLAG_PRIO_MASK

       If the SWAP_FLAG_DISCARD flag is specified in the swapon() swapflags
       argument, freed swap pages will be discarded before they are reused,
       if the swap device supports the discard or trim operation.  (This may
       improve performance on some Solid State Devices, but often it does
       not.)  See also NOTES.

       These functions may be used only by a privileged process (one having
       the CAP_SYS_ADMIN capability).

   Priority
       Each swap area has a priority, either high or low.  The default
       priority is low.  Within the low-priority areas, newer areas are even
       lower priority than older areas.

       All priorities set with swapflags are high-priority, higher than
       default.  They may have any nonnegative value chosen by the caller.
       Higher numbers mean higher priority.

       Swap pages are allocated from areas in priority order, highest
       priority first.  For areas with different priorities, a higher-
       priority area is exhausted before using a lower-priority area.  If
       two or more areas have the same priority, and it is the highest
       priority available, pages are allocated on a round-robin basis
       between them.

       As of Linux 1.3.6, the kernel usually follows these rules, but there
       are exceptions.
http://man7.org/linux/man-pages/man2/mount.2.html
11
SYSTEM CALL:
mount(2) - Linux manual page
FUNCTIONALITY:

       mount - mount filesystem
SYNOPSIS:

       #include <sys/mount.h>

       int mount(const char *source, const char *target,
                 const char *filesystemtype, unsigned long mountflags,
                 const void *data);
DESCRIPTION

       mount() attaches the filesystem specified by source (which is often a
       pathname referring to a device, but can also be the pathname of a
       directory or file, or a dummy string) to the location (a directory or
       file) specified by the pathname in target.

       Appropriate privilege (Linux: the CAP_SYS_ADMIN capability) is
       required to mount filesystems.

       Values for the filesystemtype argument supported by the kernel are
       listed in /proc/filesystems (e.g., "btrfs", "ext4", "jfs", "xfs",
       "vfat", "fuse", "tmpfs", "cgroup", "proc", "mqueue", "nfs", "cifs",
       "iso9660").  Further types may become available when the appropriate
       modules are loaded.

       The data argument is interpreted by the different filesystems.
       Typically it is a string of comma-separated options understood by
       this filesystem.  See mount(8) for details of the options available
       for each filesystem type.

       A call to mount() performs one of a number of general types of
       operation.  depending on the bits specified in mountflags.  The
       choice of operation is determined by testing the bits set in
       mountflags, with the tests being conducted in the order listed here:

       *  Remount an existing mount: mountflags includes MS_REMOUNT.

       *  Create a bind mount: mountflags includes MS_BIND.

       *  Change the propagation type of an existing mount: mountflags
          includes one of MS_SHARED, MS_PRIVATE, MS_SLAVE, or MS_UNBINDABLE.

       *  Move an existing mount to a new location: mountflags includes
          MS_MOVE.

       *  Create a new mount: mountflags includes none of the above flags.

       Each of these operations is detailed later in this page.  Further
       flags may be specified in mountflags to modify the behavior of
       mount(), as described below.

   Additional mount flags
       The list below describes the additional flags that can be specified
       in mountflags.  Note that some operation types ignore some or all of
       these flags, as described later in this page.

       MS_DIRSYNC (since Linux 2.5.19)
              Make directory changes on this filesystem synchronous.  (This
              property can be obtained for individual directories or
              subtrees using chattr(1).)

       MS_LAZYTIME (since Linux 4.0)
              Reduce on-disk updates of inode timestamps (atime, mtime,
              ctime) by maintaining these changes only in memory.  The on-
              disk timestamps are updated only when:

              (a)  the inode needs to be updated for some change unrelated
                   to file timestamps;

              (b)  the application employs sync(2);

              (c)  an undeleted inode is evicted from memory; or

              (d)  more than 24 hours have passed since the inode was
                   written to disk.

              This mount option significantly reduces writes needed to
              update the inode's timestamps, especially mtime and atime.
              However, in the event of a system crash, the atime and mtime
              fields on disk might be out of date by up to 24 hours.

              Examples of workloads where this option could be of
              significant benefit include frequent random writes to
              preallocated files, as well as cases where the MS_STRICTATIME
              mount option is also enabled.  (The advantage of combining
              MS_STRICTATIME and MS_LAZYTIME is that stat(2) will return the
              correctly updated atime, but the atime updates will be flushed
              to disk only in the cases listed above.)

       MS_MANDLOCK
              Permit mandatory locking on files in this filesystem.
              (Mandatory locking must still be enabled on a per-file basis,
              as described in fcntl(2).)  Since Linux 4.5, this mount option
              requires the CAP_SYS_ADMIN capability.

       MS_NOATIME
              Do not update access times for (all types of) files on this
              filesystem.

       MS_NODEV
              Do not allow access to devices (special files) on this
              filesystem.

       MS_NODIRATIME
              Do not update access times for directories on this filesystem.
              This flag provides a subset of the functionality provided by
              MS_NOATIME; that is, MS_NOATIME implies MS_NODIRATIME.

       MS_NOEXEC
              Do not allow programs to be executed from this filesystem.

       MS_NOSUID
              Do not honor set-user-ID and set-group-ID bits or file
              capabilities when executing programs from this filesystem.

       MS_RDONLY
              Mount filesystem read-only.

       MS_REC (since Linux 2.4.11)
              Used in conjunction with MS_BIND to create a recursive bind
              mount, and in conjunction with the propagation type flags to
              recursively change the propagation type of all of the mounts
              in a subtree.  See below for further details.

       MS_RELATIME (since Linux 2.6.20)
              When a file on this filesystem is accessed, update the file's
              last access time (atime) only if the current value of atime is
              less than or equal to the file's last modification time
              (mtime) or last status change time (ctime).  This option is
              useful for programs, such as mutt(1), that need to know when a
              file has been read since it was last modified.  Since Linux
              2.6.30, the kernel defaults to the behavior provided by this
              flag (unless MS_NOATIME was specified), and the MS_STRICTATIME
              flag is required to obtain traditional semantics.  In
              addition, since Linux 2.6.30, the file's last access time is
              always updated if it is more than 1 day old.

       MS_SILENT (since Linux 2.6.17)
              Suppress the display of certain (printk()) warning messages in
              the kernel log.  This flag supersedes the misnamed and
              obsolete MS_VERBOSE flag (available since Linux 2.4.12), which
              has the same meaning.

       MS_STRICTATIME (since Linux 2.6.30)
              Always update the last access time (atime) when files on this
              filesystem are accessed.  (This was the default behavior
              before Linux 2.6.30.)  Specifying this flag overrides the
              effect of setting the MS_NOATIME and MS_RELATIME flags.

       MS_SYNCHRONOUS
              Make writes on this filesystem synchronous (as though the
              O_SYNC flag to open(2) was specified for all file opens to
              this filesystem).

       From Linux 2.4 onward, the MS_NODEV, MS_NOEXEC, and MS_NOSUID flags
       are settable on a per-mount-point basis.  From kernel 2.6.16 onward,
       MS_NOATIME and MS_NODIRATIME are also settable on a per-mount-point
       basis.  The MS_RELATIME flag is also settable on a per-mount-point
       basis.

   Remounting an existing mount
       An existing mount may be remounted by specifying MS_REMOUNT in
       mountflags.  This allows you to change the mountflags and data of an
       existing mount without having to unmount and remount the filesystem.
       target should be the same value specified in the initial mount()
       call.

       The source and filesystemtype arguments are ignored.

       The mountflags and data arguments should match the values used in the
       original mount() call, except for those parameters that are being
       deliberately changed.

       The following mountflags can be changed: MS_LAZYTIME, MS_MANDLOCK,
       MS_NOATIME, MS_NODEV, MS_NODIRATIME, MS_NOEXEC, MS_NOSUID,
       MS_RELATIME, MS_RDONLY, and MS_SYNCHRONOUS.  Attempts to change the
       setting of the MS_DIRSYNC flag during a remount are silently ignored.

       Since Linux 3.17, if none of MS_NOATIME, MS_NODIRATIME, MS_RELATIME,
       or MS_STRICTATIME is specified in mountflags, then the remount
       operation preserves the existing values of these flags (rather than
       defaulting to MS_RELATIME).

       Since Linux 2.6.26, this flag can also be used to make an existing
       bind mount read-only by specifying mountflags as:

           MS_REMOUNT | MS_BIND | MS_RDONLY

       Note that only the MS_RDONLY setting of the bind mount can be changed
       in this manner.

   Creating a bind mount
       If mountflags includes MS_BIND (available since Linux 2.4), then
       perform a bind mount.  A bind mount makes a file or a directory
       subtree visible at another point within the single directory
       hierarchy.  Bind mounts may cross filesystem boundaries and span
       chroot(2) jails.

       The filesystemtype and data arguments are ignored.

       The remaining bits in the mountflags argument are also ignored, with
       the exception of MS_REC.  (The bind mount has the same mount options
       as the underlying mount point.)  However, see the discussion of
       remounting above, for a method of making an existing bind mount read-
       only.

       By default, when a directory is bind mounted, only that directory is
       mounted; if there are any submounts under the directory tree, they
       are not bind mounted.  If the MS_REC flag is also specified, then a
       recursive bind mount operation is performed: all submounts under the
       source subtree (other than unbindable mounts) are also bind mounted
       at the corresponding location in the target subtree.

   Changing the propagation type of an existing mount
       If mountflags includes one of MS_SHARED, MS_PRIVATE, MS_SLAVE, or
       MS_UNBINDABLE (all available since Linux 2.6.15), then the
       propagation type of an existing mount is changed.  If more than one
       of these flags is specified, an error results.

       The only flags that can be used with changing the propagation type
       are MS_REC and MS_SILENT.

       The source, filesystemtype, and data arguments are ignored.

       The meanings of the propagation type flags are as follows:

       MS_SHARED
              Make this mount point shared.  Mount and unmount events
              immediately under this mount point will propagate to the other
              mount points that are members of this mount's peer group.
              Propagation here means that the same mount or unmount will
              automatically occur under all of the other mount points in the
              peer group.  Conversely, mount and unmount events that take
              place under peer mount points will propagate to this mount
              point.

       MS_PRIVATE
              Make this mount point private.  Mount and unmount events do
              not propagate into or out of this mount point.  This is the
              default propagation type for newly created mount points.

       MS_SLAVE
              If this is a shared mount point that is a member of a peer
              group that contains other members, convert it to a slave
              mount.  If this is a shared mount point that is a member of a
              peer group that contains no other members, convert it to a
              private mount.  Otherwise, the propagation type of the mount
              point is left unchanged.

              When a mount point is a slave, mount and unmount events
              propagate into this mount point from the (master) shared peer
              group of which it was formerly a member.  Mount and unmount
              events under this mount point do not propagate to any peer.

              A mount point can be the slave of another peer group while at
              the same time sharing mount and unmount events with a peer
              group of which it is a member.

       MS_UNBINDABLE
              Make this mount unbindable.  This is like a private mount, and
              in addition this mount can't be bind mounted.  When a
              recursive bind mount (mount(2) with the MS_BIND and MS_REC
              flags) is performed on a directory subtree, any bind mounts
              within the subtree are automatically pruned (i.e., not
              replicated) when replicating that subtree to produce the
              target subtree.

       By default, changing the propagation type affects only the target
       mount point.  If the MS_REC flag is also specified in mountflags,
       then the propagation type of all mount points under target is also
       changed.

       For further details regarding mount propagation types, see
       mount_namespaces(7).

   Moving a mount
       If mountflags contains the flag MS_MOVE (available since Linux
       2.4.18), then move a subtree: source specifies an existing mount
       point and target specifies the new location to which that mount point
       is to be relocated.  The move is atomic: at no point is the subtree
       unmounted.

       The remaining bits in the mountflags argument are ignored, as are the
       filesystemtype and data arguments.

   Creating a new mount point
       If none of MS_REMOUNT, MS_BIND, MS_MOVE, MS_SHARED, MS_PRIVATE,
       MS_SLAVE, or MS_UNBINDABLE is specified in mountflags, then mount()
       performs its default action: creating a new mount point.  source
       specifies the source for the new mount point, and target specifies
       the directory at which to create the mount point.

       The filesystemtype and data arguments are employed, and further bits
       may be specified in mountflags to modify the behavior of the call.
http://man7.org/linux/man-pages/man2/umount2.2.html
11
SYSTEM CALL:
umount(2) - Linux manual page
FUNCTIONALITY:

       umount, umount2 - unmount filesystem
SYNOPSIS:

       #include <sys/mount.h>

       int umount(const char *target);

       int umount2(const char *target, int flags);
DESCRIPTION

       umount() and umount2() remove the attachment of the (topmost)
       filesystem mounted on target.

       Appropriate privilege (Linux: the CAP_SYS_ADMIN capability) is
       required to unmount filesystems.

       Linux 2.1.116 added the umount2() system call, which, like umount(),
       unmounts a target, but allows additional flags controlling the
       behavior of the operation:

       MNT_FORCE (since Linux 2.1.116)
              Force unmount even if busy.  This can cause data loss.  (Only
              for NFS mounts.)

       MNT_DETACH (since Linux 2.4.11)
              Perform a lazy unmount: make the mount point unavailable for
              new accesses, immediately disconnect the filesystem and all
              filesystems mounted below it from each other and from the
              mount table, and actually perform the unmount when the mount
              point ceases to be busy.

       MNT_EXPIRE (since Linux 2.6.8)
              Mark the mount point as expired.  If a mount point is not
              currently in use, then an initial call to umount2() with this
              flag fails with the error EAGAIN, but marks the mount point as
              expired.  The mount point remains expired as long as it isn't
              accessed by any process.  A second umount2() call specifying
              MNT_EXPIRE unmounts an expired mount point.  This flag cannot
              be specified with either MNT_FORCE or MNT_DETACH.

       UMOUNT_NOFOLLOW (since Linux 2.6.34)
              Don't dereference target if it is a symbolic link.  This flag
              allows security problems to be avoided in set-user-ID-root
              programs that allow unprivileged users to unmount filesystems.
http://man7.org/linux/man-pages/man2/nfsservctl.2.html
7
SYSTEM CALL:
nfsservctl(2) - Linux manual page
FUNCTIONALITY:

       nfsservctl - syscall interface to kernel nfs daemon
SYNOPSIS:

       #include <linux/nfsd/syscall.h>

       long nfsservctl(int cmd, struct nfsctl_arg *argp,
                       union nfsctl_res *resp);
DESCRIPTION

       Note: Since Linux 3.1, this system call no longer exists.  It has
       been replaced by a set of files in the nfsd filesystem; see nfsd(7).

       /*
        * These are the commands understood by nfsctl().
        */
       #define NFSCTL_SVC          0    /* This is a server process. */
       #define NFSCTL_ADDCLIENT    1    /* Add an NFS client. */
       #define NFSCTL_DELCLIENT    2    /* Remove an NFS client. */
       #define NFSCTL_EXPORT       3    /* Export a filesystem. */
       #define NFSCTL_UNEXPORT     4    /* Unexport a filesystem. */
       #define NFSCTL_UGIDUPDATE   5    /* Update a client's UID/GID map
                                           (only in Linux 2.4.x and earlier). */
       #define NFSCTL_GETFH        6    /* Get a file handle (used by mountd)
                                           (only in Linux 2.4.x and earlier). */

       struct nfsctl_arg {
           int                       ca_version;     /* safeguard */
           union {
               struct nfsctl_svc     u_svc;
               struct nfsctl_client  u_client;
               struct nfsctl_export  u_export;
               struct nfsctl_uidmap  u_umap;
               struct nfsctl_fhparm  u_getfh;
               unsigned int          u_debug;
           } u;
       }

       union nfsctl_res {
               struct knfs_fh          cr_getfh;
               unsigned int            cr_debug;
       };
http://man7.org/linux/man-pages/man2/ustat.2.html
10
SYSTEM CALL:
ustat(2) - Linux manual page
FUNCTIONALITY:

       ustat - get filesystem statistics
SYNOPSIS:

       #include <sys/types.h>
       #include <unistd.h>    /* libc[45] */
       #include <ustat.h>     /* glibc2 */

       int ustat(dev_t dev, struct ustat *ubuf);
DESCRIPTION

       ustat() returns information about a mounted filesystem.  dev is a
       device number identifying a device containing a mounted filesystem.
       ubuf is a pointer to a ustat structure that contains the following
       members:

           daddr_t f_tfree;      /* Total free blocks */
           ino_t   f_tinode;     /* Number of free inodes */
           char    f_fname[6];   /* Filsys name */
           char    f_fpack[6];   /* Filsys pack name */

       The last two fields, f_fname and f_fpack, are not implemented and
       will always be filled with null bytes ('\0').
http://man7.org/linux/man-pages/man2/statfs.2.html
11
SYSTEM CALL:
statfs(2) - Linux manual page
FUNCTIONALITY:

       statfs, fstatfs - get filesystem statistics
SYNOPSIS:

       #include <sys/vfs.h>    /* or <sys/statfs.h> */

       int statfs(const char *path, struct statfs *buf);
       int fstatfs(int fd, struct statfs *buf);
DESCRIPTION

       The statfs() system call returns information about a mounted
       filesystem.  path is the pathname of any file within the mounted
       filesystem.  buf is a pointer to a statfs structure defined
       approximately as follows:

           struct statfs {
               __fsword_t f_type;    /* Type of filesystem (see below) */
               __fsword_t f_bsize;   /* Optimal transfer block size */
               fsblkcnt_t f_blocks;  /* Total data blocks in filesystem */
               fsblkcnt_t f_bfree;   /* Free blocks in filesystem */
               fsblkcnt_t f_bavail;  /* Free blocks available to
                                        unprivileged user */
               fsfilcnt_t f_files;   /* Total file nodes in filesystem */
               fsfilcnt_t f_ffree;   /* Free file nodes in filesystem */
               fsid_t     f_fsid;    /* Filesystem ID */
               __fsword_t f_namelen; /* Maximum length of filenames */
               __fsword_t f_frsize;  /* Fragment size (since Linux 2.6) */
               __fsword_t f_flags;   /* Mount flags of filesystem
                                        (since Linux 2.6.36) */
               __fsword_t f_spare[xxx];
                               /* Padding bytes reserved for future use */
           };

           Filesystem types:

              ADFS_SUPER_MAGIC      0xadf5
              AFFS_SUPER_MAGIC      0xadff
              BDEVFS_MAGIC          0x62646576
              BEFS_SUPER_MAGIC      0x42465331
              BFS_MAGIC             0x1badface
              BINFMTFS_MAGIC        0x42494e4d
              BTRFS_SUPER_MAGIC     0x9123683e
              CGROUP_SUPER_MAGIC    0x27e0eb
              CIFS_MAGIC_NUMBER     0xff534d42
              CODA_SUPER_MAGIC      0x73757245
              COH_SUPER_MAGIC       0x012ff7b7
              CRAMFS_MAGIC          0x28cd3d45
              DEBUGFS_MAGIC         0x64626720
              DEVFS_SUPER_MAGIC     0x1373
              DEVPTS_SUPER_MAGIC    0x1cd1
              EFIVARFS_MAGIC        0xde5e81e4
              EFS_SUPER_MAGIC       0x00414a53
              EXT_SUPER_MAGIC       0x137d
              EXT2_OLD_SUPER_MAGIC  0xef51
              EXT2_SUPER_MAGIC      0xef53
              EXT3_SUPER_MAGIC      0xef53
              EXT4_SUPER_MAGIC      0xef53
              FUSE_SUPER_MAGIC      0x65735546
              FUTEXFS_SUPER_MAGIC   0xbad1dea
              HFS_SUPER_MAGIC       0x4244
              HOSTFS_SUPER_MAGIC    0x00c0ffee
              HPFS_SUPER_MAGIC      0xf995e849
              HUGETLBFS_MAGIC       0x958458f6
              ISOFS_SUPER_MAGIC     0x9660
              JFFS2_SUPER_MAGIC     0x72b6
              JFS_SUPER_MAGIC       0x3153464a
              MINIX_SUPER_MAGIC     0x137f /* orig. minix */
              MINIX_SUPER_MAGIC2    0x138f /* 30 char minix */
              MINIX2_SUPER_MAGIC    0x2468 /* minix V2 */
              MINIX2_SUPER_MAGIC2   0x2478 /* minix V2, 30 char names */
              MINIX3_SUPER_MAGIC    0x4d5a /* minix V3 fs, 60 char names */
              MQUEUE_MAGIC          0x19800202
              MSDOS_SUPER_MAGIC     0x4d44
              NCP_SUPER_MAGIC       0x564c
              NFS_SUPER_MAGIC       0x6969
              NILFS_SUPER_MAGIC     0x3434
              NTFS_SB_MAGIC         0x5346544e
              OCFS2_SUPER_MAGIC     0x7461636f
              OPENPROM_SUPER_MAGIC  0x9fa1
              PIPEFS_MAGIC          0x50495045
              PROC_SUPER_MAGIC      0x9fa0
              PSTOREFS_MAGIC        0x6165676c
              QNX4_SUPER_MAGIC      0x002f
              QNX6_SUPER_MAGIC      0x68191122
              RAMFS_MAGIC           0x858458f6
              REISERFS_SUPER_MAGIC  0x52654973
              ROMFS_MAGIC           0x7275
              SELINUX_MAGIC         0xf97cff8c
              SMACK_MAGIC           0x43415d53
              SMB_SUPER_MAGIC       0x517b
              SOCKFS_MAGIC          0x534f434b
              SQUASHFS_MAGIC        0x73717368
              SYSFS_MAGIC           0x62656572
              SYSV2_SUPER_MAGIC     0x012ff7b6
              SYSV4_SUPER_MAGIC     0x012ff7b5
              TMPFS_MAGIC           0x01021994
              UDF_SUPER_MAGIC       0x15013346
              UFS_MAGIC             0x00011954
              USBDEVICE_SUPER_MAGIC 0x9fa2
              V9FS_MAGIC            0x01021997
              VXFS_SUPER_MAGIC      0xa501fcf5
              XENFS_SUPER_MAGIC     0xabba1974
              XENIX_SUPER_MAGIC     0x012ff7b4
              XFS_SUPER_MAGIC       0x58465342
              _XIAFS_SUPER_MAGIC    0x012fd16d

       Most of these MAGIC constants are defined in
       /usr/include/linux/magic.h, and some are hardcoded in kernel sources.

       The f_flags is a bit mask indicating mount options for the file
       system.  It contains zero or more of the following bits:

       ST_MANDLOCK
              Mandatory locking is permitted on the filesystem (see
              fcntl(2)).

       ST_NOATIME
              Do not update access times; see mount(2).

       ST_NODEV
              Disallow access to device special files on this filesystem.

       ST_NODIRATIME
              Do not update directory access times; see mount(2).

       ST_NOEXEC
              Execution of programs is disallowed on this filesystem.

       ST_NOSUID
              The set-user-ID and set-group-ID bits are ignored by exec(3)
              for executable files on this filesystem

       ST_RDONLY
              This filesystem is mounted read-only.

       ST_RELATIME
              Update atime relative to mtime/ctime; see mount(2).

       ST_SYNCHRONOUS
              Writes are synched to the filesystem immediately (see the
              description of O_SYNC in open(2)).

       Nobody knows what f_fsid is supposed to contain (but see below).

       Fields that are undefined for a particular filesystem are set to 0.

       fstatfs() returns the same information about an open file referenced
       by descriptor fd.
http://man7.org/linux/man-pages/man2/fstatfs.2.html
11
SYSTEM CALL:
statfs(2) - Linux manual page
FUNCTIONALITY:

       statfs, fstatfs - get filesystem statistics
SYNOPSIS:

       #include <sys/vfs.h>    /* or <sys/statfs.h> */

       int statfs(const char *path, struct statfs *buf);
       int fstatfs(int fd, struct statfs *buf);
DESCRIPTION

       The statfs() system call returns information about a mounted
       filesystem.  path is the pathname of any file within the mounted
       filesystem.  buf is a pointer to a statfs structure defined
       approximately as follows:

           struct statfs {
               __fsword_t f_type;    /* Type of filesystem (see below) */
               __fsword_t f_bsize;   /* Optimal transfer block size */
               fsblkcnt_t f_blocks;  /* Total data blocks in filesystem */
               fsblkcnt_t f_bfree;   /* Free blocks in filesystem */
               fsblkcnt_t f_bavail;  /* Free blocks available to
                                        unprivileged user */
               fsfilcnt_t f_files;   /* Total file nodes in filesystem */
               fsfilcnt_t f_ffree;   /* Free file nodes in filesystem */
               fsid_t     f_fsid;    /* Filesystem ID */
               __fsword_t f_namelen; /* Maximum length of filenames */
               __fsword_t f_frsize;  /* Fragment size (since Linux 2.6) */
               __fsword_t f_flags;   /* Mount flags of filesystem
                                        (since Linux 2.6.36) */
               __fsword_t f_spare[xxx];
                               /* Padding bytes reserved for future use */
           };

           Filesystem types:

              ADFS_SUPER_MAGIC      0xadf5
              AFFS_SUPER_MAGIC      0xadff
              BDEVFS_MAGIC          0x62646576
              BEFS_SUPER_MAGIC      0x42465331
              BFS_MAGIC             0x1badface
              BINFMTFS_MAGIC        0x42494e4d
              BTRFS_SUPER_MAGIC     0x9123683e
              CGROUP_SUPER_MAGIC    0x27e0eb
              CIFS_MAGIC_NUMBER     0xff534d42
              CODA_SUPER_MAGIC      0x73757245
              COH_SUPER_MAGIC       0x012ff7b7
              CRAMFS_MAGIC          0x28cd3d45
              DEBUGFS_MAGIC         0x64626720
              DEVFS_SUPER_MAGIC     0x1373
              DEVPTS_SUPER_MAGIC    0x1cd1
              EFIVARFS_MAGIC        0xde5e81e4
              EFS_SUPER_MAGIC       0x00414a53
              EXT_SUPER_MAGIC       0x137d
              EXT2_OLD_SUPER_MAGIC  0xef51
              EXT2_SUPER_MAGIC      0xef53
              EXT3_SUPER_MAGIC      0xef53
              EXT4_SUPER_MAGIC      0xef53
              FUSE_SUPER_MAGIC      0x65735546
              FUTEXFS_SUPER_MAGIC   0xbad1dea
              HFS_SUPER_MAGIC       0x4244
              HOSTFS_SUPER_MAGIC    0x00c0ffee
              HPFS_SUPER_MAGIC      0xf995e849
              HUGETLBFS_MAGIC       0x958458f6
              ISOFS_SUPER_MAGIC     0x9660
              JFFS2_SUPER_MAGIC     0x72b6
              JFS_SUPER_MAGIC       0x3153464a
              MINIX_SUPER_MAGIC     0x137f /* orig. minix */
              MINIX_SUPER_MAGIC2    0x138f /* 30 char minix */
              MINIX2_SUPER_MAGIC    0x2468 /* minix V2 */
              MINIX2_SUPER_MAGIC2   0x2478 /* minix V2, 30 char names */
              MINIX3_SUPER_MAGIC    0x4d5a /* minix V3 fs, 60 char names */
              MQUEUE_MAGIC          0x19800202
              MSDOS_SUPER_MAGIC     0x4d44
              NCP_SUPER_MAGIC       0x564c
              NFS_SUPER_MAGIC       0x6969
              NILFS_SUPER_MAGIC     0x3434
              NTFS_SB_MAGIC         0x5346544e
              OCFS2_SUPER_MAGIC     0x7461636f
              OPENPROM_SUPER_MAGIC  0x9fa1
              PIPEFS_MAGIC          0x50495045
              PROC_SUPER_MAGIC      0x9fa0
              PSTOREFS_MAGIC        0x6165676c
              QNX4_SUPER_MAGIC      0x002f
              QNX6_SUPER_MAGIC      0x68191122
              RAMFS_MAGIC           0x858458f6
              REISERFS_SUPER_MAGIC  0x52654973
              ROMFS_MAGIC           0x7275
              SELINUX_MAGIC         0xf97cff8c
              SMACK_MAGIC           0x43415d53
              SMB_SUPER_MAGIC       0x517b
              SOCKFS_MAGIC          0x534f434b
              SQUASHFS_MAGIC        0x73717368
              SYSFS_MAGIC           0x62656572
              SYSV2_SUPER_MAGIC     0x012ff7b6
              SYSV4_SUPER_MAGIC     0x012ff7b5
              TMPFS_MAGIC           0x01021994
              UDF_SUPER_MAGIC       0x15013346
              UFS_MAGIC             0x00011954
              USBDEVICE_SUPER_MAGIC 0x9fa2
              V9FS_MAGIC            0x01021997
              VXFS_SUPER_MAGIC      0xa501fcf5
              XENFS_SUPER_MAGIC     0xabba1974
              XENIX_SUPER_MAGIC     0x012ff7b4
              XFS_SUPER_MAGIC       0x58465342
              _XIAFS_SUPER_MAGIC    0x012fd16d

       Most of these MAGIC constants are defined in
       /usr/include/linux/magic.h, and some are hardcoded in kernel sources.

       The f_flags is a bit mask indicating mount options for the file
       system.  It contains zero or more of the following bits:

       ST_MANDLOCK
              Mandatory locking is permitted on the filesystem (see
              fcntl(2)).

       ST_NOATIME
              Do not update access times; see mount(2).

       ST_NODEV
              Disallow access to device special files on this filesystem.

       ST_NODIRATIME
              Do not update directory access times; see mount(2).

       ST_NOEXEC
              Execution of programs is disallowed on this filesystem.

       ST_NOSUID
              The set-user-ID and set-group-ID bits are ignored by exec(3)
              for executable files on this filesystem

       ST_RDONLY
              This filesystem is mounted read-only.

       ST_RELATIME
              Update atime relative to mtime/ctime; see mount(2).

       ST_SYNCHRONOUS
              Writes are synched to the filesystem immediately (see the
              description of O_SYNC in open(2)).

       Nobody knows what f_fsid is supposed to contain (but see below).

       Fields that are undefined for a particular filesystem are set to 0.

       fstatfs() returns the same information about an open file referenced
       by descriptor fd.
http://man7.org/linux/man-pages/man2/sysfs.2.html
10
SYSTEM CALL:
sysfs(2) - Linux manual page
FUNCTIONALITY:

       sysfs - get filesystem type information
SYNOPSIS:

       int sysfs(int option, const char *fsname);

       int sysfs(int option, unsigned int fs_index, char *buf);

       int sysfs(int option);
DESCRIPTION

       sysfs() returns information about the filesystem types currently
       present in the kernel.  The specific form of the sysfs() call and the
       information returned depends on the option in effect:

       1  Translate the filesystem identifier string fsname into a
          filesystem type index.

       2  Translate the filesystem type index fs_index into a null-
          terminated filesystem identifier string.  This string will be
          written to the buffer pointed to by buf.  Make sure that buf has
          enough space to accept the string.

       3  Return the total number of filesystem types currently present in
          the kernel.

       The numbering of the filesystem type indexes begins with zero.
http://man7.org/linux/man-pages/man2/_sysctl.2.html
12
SYSTEM CALL:
sysctl(2) - Linux manual page
FUNCTIONALITY:

       sysctl - read/write system parameters
SYNOPSIS:

       #include <unistd.h>
       #include <linux/sysctl.h>

       int _sysctl(struct __sysctl_args *args);

       Note: There is no glibc wrapper for this system call; see NOTES.
DESCRIPTION

       Do not use this system call!  See NOTES.

       The _sysctl() call reads and/or writes kernel parameters.  For
       example, the hostname, or the maximum number of open files.  The
       argument has the form

           struct __sysctl_args {
               int    *name;    /* integer vector describing variable */
               int     nlen;    /* length of this vector */
               void   *oldval;  /* 0 or address where to store old value */
               size_t *oldlenp; /* available room for old value,
                                   overwritten by actual size of old value */
               void   *newval;  /* 0 or address of new value */
               size_t  newlen;  /* size of new value */
           };

       This call does a search in a tree structure, possibly resembling a
       directory tree under /proc/sys, and if the requested item is found
       calls some appropriate routine to read or modify the value.
http://man7.org/linux/man-pages/man2/syslog.2.html
10
SYSTEM CALL:
syslog(2) - Linux manual page
FUNCTIONALITY:

       syslog,  klogctl  - read and/or clear kernel message ring buffer; set
       console_loglevel
SYNOPSIS:

       int syslog(int type, char *bufp, int len);
                       /* No wrapper provided in glibc */

       /* The glibc interface */
       #include <sys/klog.h>

       int klogctl(int type, char *bufp, int len);
DESCRIPTION

       Note: Probably, you are looking for the C library function syslog(),
       which talks to syslogd(8); see syslog(3) for details.

       This page describes the kernel syslog() system call, which is used to
       control the kernel printk() buffer; the glibc wrapper function for
       the system call is called klogctl().

   The kernel log buffer
       The kernel has a cyclic buffer of length LOG_BUF_LEN in which
       messages given as arguments to the kernel function printk() are
       stored (regardless of their log level).  In early kernels,
       LOG_BUF_LEN had the value 4096; from kernel 1.3.54, it was 8192; from
       kernel 2.1.113, it was 16384; since kernel 2.4.23/2.6, the value is a
       kernel configuration option (CONFIG_LOG_BUF_SHIFT, default value
       dependent on the architecture).  Since Linux 2.6.6, the size can be
       queried with command type 10 (see below).

   Commands
       The type argument determines the action taken by this function.  The
       list below specifies the values for type.  The symbolic names are
       defined in the kernel source, but are not exported to user space; you
       will either need to use the numbers, or define the names yourself.

       SYSLOG_ACTION_CLOSE (0)
              Close the log.  Currently a NOP.

       SYSLOG_ACTION_OPEN (1)
              Open the log.  Currently a NOP.

       SYSLOG_ACTION_READ (2)
              Read from the log.  The call waits until the kernel log buffer
              is nonempty, and then reads at most len bytes into the buffer
              pointed to by bufp.  The call returns the number of bytes
              read.  Bytes read from the log disappear from the log buffer:
              the information can be read only once.  This is the function
              executed by the kernel when a user program reads /proc/kmsg.

       SYSLOG_ACTION_READ_ALL (3)
              Read all messages remaining in the ring buffer, placing them
              in the buffer pointed to by bufp.  The call reads the last len
              bytes from the log buffer (nondestructively), but will not
              read more than was written into the buffer since the last
              "clear ring buffer" command (see command 5 below)).  The call
              returns the number of bytes read.

       SYSLOG_ACTION_READ_CLEAR (4)
              Read and clear all messages remaining in the ring buffer.  The
              call does precisely the same as for a type of 3, but also
              executes the "clear ring buffer" command.

       SYSLOG_ACTION_CLEAR (5)
              The call executes just the "clear ring buffer" command.  The
              bufp and len arguments are ignored.

              This command does not really clear the ring buffer.  Rather,
              it sets a kernel bookkeeping variable that determines the
              results returned by commands 3 (SYSLOG_ACTION_READ_ALL) and 4
              (SYSLOG_ACTION_READ_CLEAR).  This command has no effect on
              commands 2 (SYSLOG_ACTION_READ) and 9
              (SYSLOG_ACTION_SIZE_UNREAD).

       SYSLOG_ACTION_CONSOLE_OFF (6)
              The command saves the current value of console_loglevel and
              then sets console_loglevel to minimum_console_loglevel, so
              that no messages are printed to the console.  Before Linux
              2.6.32, the command simply sets console_loglevel to
              minimum_console_loglevel.  See the discussion of
              /proc/sys/kernel/printk, below.

              The bufp and len arguments are ignored.

       SYSLOG_ACTION_CONSOLE_ON (7)
              If a previous SYSLOG_ACTION_CONSOLE_OFF command has been
              performed, this command restores console_loglevel to the value
              that was saved by that command.  Before Linux 2.6.32, this
              command simply sets console_loglevel to
              default_console_loglevel.  See the discussion of
              /proc/sys/kernel/printk, below.

              The bufp and len arguments are ignored.

       SYSLOG_ACTION_CONSOLE_LEVEL (8)
              The call sets console_loglevel to the value given in len,
              which must be an integer between 1 and 8 (inclusive).  The
              kernel silently enforces a minimum value of
              minimum_console_loglevel for len.  See the log level section
              for details.  The bufp argument is ignored.

       SYSLOG_ACTION_SIZE_UNREAD (9) (since Linux 2.4.10)
              The call returns the number of bytes currently available to be
              read from the kernel log buffer via command 2
              (SYSLOG_ACTION_READ).  The bufp and len arguments are ignored.

       SYSLOG_ACTION_SIZE_BUFFER (10) (since Linux 2.6.6)
              This command returns the total size of the kernel log buffer.
              The bufp and len arguments are ignored.

       All commands except 3 and 10 require privilege.  In Linux kernels
       before 2.6.37, command types 3 and 10 are allowed to unprivileged
       processes; since Linux 2.6.37, these commands are allowed to
       unprivileged processes only if /proc/sys/kernel/dmesg_restrict has
       the value 0.  Before Linux 2.6.37, "privileged" means that the caller
       has the CAP_SYS_ADMIN capability.  Since Linux 2.6.37, "privileged"
       means that the caller has either the CAP_SYS_ADMIN capability (now
       deprecated for this purpose) or the (new) CAP_SYSLOG capability.

   /proc/sys/kernel/printk
       /proc/sys/kernel/printk is a writable file containing four integer
       values that influence kernel printk() behavior when printing or
       logging error messages.  The four values are:

       console_loglevel
              Only messages with a log level lower than this value will be
              printed to the console.  The default value for this field is
              DEFAULT_CONSOLE_LOGLEVEL (7), but it is set to 4 if the kernel
              command line contains the word "quiet", 10 if the kernel
              command line contains the word "debug", and to 15 in case of a
              kernel fault (the 10 and 15 are just silly, and equivalent to
              8).  The value of console_loglevel can be set (to a value in
              the range 1-8) by a syslog() call with a type of 8.

       default_message_loglevel
              This value will be used as the log level for printk() messages
              that do not have an explicit level.  Up to and including Linux
              2.6.38, the hard-coded default value for this field was 4
              (KERN_WARNING); since Linux 2.6.39, the default value is a
              defined by the kernel configuration option
              CONFIG_DEFAULT_MESSAGE_LOGLEVEL, which defaults to 4.

       minimum_console_loglevel
              The value in this field is the minimum value to which
              console_loglevel can be set.

       default_console_loglevel
              This is the default value for console_loglevel.

   The log level
       Every printk() message has its own log level.  If the log level is
       not explicitly specified as part of the message, it defaults to
       default_message_loglevel.  The conventional meaning of the log level
       is as follows:

       Kernel constant   Level value   Meaning
       KERN_EMERG             0        System is unusable
       KERN_ALERT             1        Action must be taken immediately
       KERN_CRIT              2        Critical conditions
       KERN_ERR               3        Error conditions
       KERN_WARNING           4        Warning conditions
       KERN_NOTICE            5        Normal but significant condition
       KERN_INFO              6        Informational
       KERN_DEBUG             7        Debug-level messages

       The kernel printk() routine will print a message on the console only
       if it has a log level less than the value of console_loglevel.
http://man7.org/linux/man-pages/man2/ioperm.2.html
10
SYSTEM CALL:
ioperm(2) - Linux manual page
FUNCTIONALITY:

       ioperm - set port input/output permissions
SYNOPSIS:

       #include <sys/io.h> /* for glibc */

       int ioperm(unsigned long from, unsigned long num, int turn_on);
DESCRIPTION

       ioperm() sets the port access permission bits for the calling thread
       for num bits starting from port address from.  If turn_on is nonzero,
       then permission for the specified bits is enabled; otherwise it is
       disabled.  If turn_on is nonzero, the calling thread must be
       privileged (CAP_SYS_RAWIO).

       Before Linux 2.6.8, only the first 0x3ff I/O ports could be specified
       in this manner.  For more ports, the iopl(2) system call had to be
       used (with a level argument of 3).  Since Linux 2.6.8, 65,536 I/O
       ports can be specified.

       Permissions are inherited by the child created by fork(2) (but see
       NOTES).  Permissions are preserved across execve(2); this is useful
       for giving port access permissions to unprivileged programs.

       This call is mostly for the i386 architecture.  On many other
       architectures it does not exist or will always return an error.
http://man7.org/linux/man-pages/man2/iopl.2.html
10
SYSTEM CALL:
iopl(2) - Linux manual page
FUNCTIONALITY:

       iopl - change I/O privilege level
SYNOPSIS:

       #include <sys/io.h>

       int iopl(int level);
DESCRIPTION

       iopl() changes the I/O privilege level of the calling process, as
       specified by the two least significant bits in level.

       This call is necessary to allow 8514-compatible X servers to run
       under Linux.  Since these X servers require access to all 65536 I/O
       ports, the ioperm(2) call is not sufficient.

       In addition to granting unrestricted I/O port access, running at a
       higher I/O privilege level also allows the process to disable
       interrupts.  This will probably crash the system, and is not
       recommended.

       Permissions are not inherited by the child process created by fork(2)
       and are not preserved across execve(2) (but see NOTES).

       The I/O privilege level for a normal process is 0.

       This call is mostly for the i386 architecture.  On many other
       architectures it does not exist or will always return an error.
http://man7.org/linux/man-pages/man2/personality.2.html
10
SYSTEM CALL:
personality(2) - Linux manual page
FUNCTIONALITY:

       personality - set the process execution domain
SYNOPSIS:

       #include <sys/personality.h>

       int personality(unsigned long persona);
DESCRIPTION

       Linux supports different execution domains, or personalities, for
       each process.  Among other things, execution domains tell Linux how
       to map signal numbers into signal actions.  The execution domain
       system allows Linux to provide limited support for binaries compiled
       under other UNIX-like operating systems.

       If persona is not 0xffffffff, then personality() sets the caller's
       execution domain to the value specified by persona.  Specifying
       persona as 0xffffffff provides a way of retrieving the current
       persona without changing it.

       A list of the available execution domains can be found in
       <sys/personality.h>.  The execution domain is a 32-bit value in which
       the top three bytes are set aside for flags that cause the kernel to
       modify the behavior of certain system calls so as to emulate
       historical or architectural quirks.  The least significant byte is
       value defining the personality the kernel should assume.  The flag
       values are as follows:

       ADDR_COMPAT_LAYOUT (since Linux 2.6.9)
              With this flag set, provide legacy virtual address space
              layout.

       ADDR_NO_RANDOMIZE (since Linux 2.6.12)
              With this flag set, disable address-space-layout
              randomization.

       ADDR_LIMIT_32BIT (since Linux 2.2)
              Limit the address space to 32 bits.

       ADDR_LIMIT_3GB (since Linux 2.4.0)
              With this flag set, use 0xc0000000 as the offset at which to
              search a virtual memory chunk on mmap(2); otherwise use
              0xffffe000.

       FDPIC_FUNCPTRS (since Linux 2.6.11)
              User-space function pointers to signal handlers point (on
              certain architectures) to descriptors.

       MMAP_PAGE_ZERO (since Linux 2.4.0)
              Map page 0 as read-only (to support binaries that depend on
              this SVr4 behavior).

       READ_IMPLIES_EXEC (since Linux 2.6.8)
              With this flag set, PROT_READ implies PROT_EXEC for mmap(2).

       SHORT_INODE (since Linux 2.4.0)
              No effects(?).

       STICKY_TIMEOUTS (since Linux 1.2.0)
              With this flag set, select(2), pselect(2), and ppoll(2) do not
              modify the returned timeout argument when interrupted by a
              signal handler.

       UNAME26 (since Linux 3.1)
              Have uname(2) report a 2.6.40+ version number rather than a
              3.x version number.  Added as a stopgap measure to support
              broken applications that could not handle the kernel version-
              numbering switch from 2.6.x to 3.x.

       WHOLE_SECONDS (since Linux 1.2.0)
              No effects(?).

       The available execution domains are:

       PER_BSD (since Linux 1.2.0)
              BSD. (No effects.)

       PER_HPUX (since Linux 2.4)
              Support for 32-bit HP/UX.  This support was never complete,
              and was dropped so that since Linux 4.0, this value has no
              effect.

       PER_IRIX32 (since Linux 2.2)
              IRIX 5 32-bit.  Never fully functional; support dropped in
              Linux 2.6.27.  Implies STICKY_TIMEOUTS.

       PER_IRIX64 (since Linux 2.2)
              IRIX 6 64-bit.  Implies STICKY_TIMEOUTS; otherwise no effects.

       PER_IRIXN32 (since Linux 2.2)
              IRIX 6 new 32-bit.  Implies STICKY_TIMEOUTS; otherwise no
              effects.

       PER_ISCR4 (since Linux 1.2.0)
              Implies STICKY_TIMEOUTS; otherwise no effects.

       PER_LINUX (since Linux 1.2.0)
              Linux.

       PER_LINUX32 (since Linux 2.2)
              [To be documented.]

       PER_LINUX32_3GB (since Linux 2.4)
              Implies ADDR_LIMIT_3GB.

       PER_LINUX_32BIT (since Linux 2.0)
              Implies ADDR_LIMIT_32BIT.

       PER_LINUX_FDPIC (since Linux 2.6.11)
              Implies FDPIC_FUNCPTRS.

       PER_OSF4 (since Linux 2.4)
              OSF/1 v4.  On alpha, clear top 32 bits of iov_len in the
              user's buffer for compatibility with old versions of OSF/1
              where iov_len was defined as.  int.

       PER_OSR5 (since Linux 2.4)
              Implies STICKY_TIMEOUTS and WHOLE_SECONDS; otherwise no
              effects.

       PER_RISCOS (since Linux 2.2)
              [To be documented.]

       PER_SCOSVR3 (since Linux 1.2.0)
              Implies STICKY_TIMEOUTS, WHOLE_SECONDS, and SHORT_INODE;
              otherwise no effects.

       PER_SOLARIS (since Linux 2.4)
              Implies STICKY_TIMEOUTS; otherwise no effects.

       PER_SUNOS (since Linux 2.4.0)
              Implies STICKY_TIMEOUTS.  Divert library and dynamic linker
              searches to /usr/gnemul.  Buggy, largely unmaintained, and
              almost entirely unused; support was removed in Linux 2.6.26.

       PER_SVR3 (since Linux 1.2.0)
              Implies STICKY_TIMEOUTS and SHORT_INODE; otherwise no effects.

       PER_SVR4 (since Linux 1.2.0)
              Implies STICKY_TIMEOUTS and MMAP_PAGE_ZERO; otherwise no
              effects.

       PER_UW7 (since Linux 2.4)
              Implies STICKY_TIMEOUTS and MMAP_PAGE_ZERO; otherwise no
              effects.

       PER_WYSEV386 (since Linux 1.2.0)
              Implies STICKY_TIMEOUTS and SHORT_INODE; otherwise no effects.

       PER_XENIX (since Linux 1.2.0)
              Implies STICKY_TIMEOUTS and SHORT_INODE; otherwise no effects.
http://man7.org/linux/man-pages/man2/vhangup.2.html
9
SYSTEM CALL:
vhangup(2) - Linux manual page
FUNCTIONALITY:

       vhangup - virtually hangup the current terminal
SYNOPSIS:

       #include <unistd.h>

       int vhangup(void);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       vhangup():
           Since glibc 2.21:
               _DEFAULT_SOURCE
           In glibc 2.19 and 2.20:
               _DEFAULT_SOURCE || (_XOPEN_SOURCE && _XOPEN_SOURCE < 500)
           Up to and including glibc 2.19:
               _BSD_SOURCE || (_XOPEN_SOURCE && _XOPEN_SOURCE < 500)
DESCRIPTION

       vhangup() simulates a hangup on the current terminal.  This call
       arranges for other users to have a “clean” terminal at login time.
http://man7.org/linux/man-pages/man2/reboot.2.html
9
SYSTEM CALL:
reboot(2) - Linux manual page
FUNCTIONALITY:

       reboot - reboot or enable/disable Ctrl-Alt-Del
SYNOPSIS:

       /* For libc4 and libc5 the library call and the system call
          are identical, and since kernel version 2.1.30 there are
          symbolic names LINUX_REBOOT_* for the constants and a
          fourth argument to the call: */

       #include <unistd.h>
       #include <linux/reboot.h>

       int reboot(int magic, int magic2, int cmd, void *arg);

       /* Under glibc and most alternative libc's (including uclibc,
       dietlibc,
          musl and a few others), some of the constants involved have gotten
          symbolic names RB_*, and the library call is a 1-argument
          wrapper around the 3-argument system call: */

       #include <unistd.h>
       #include <sys/reboot.h>

       int reboot(int cmd);
DESCRIPTION

       The reboot() call reboots the system, or enables/disables the reboot
       keystroke (abbreviated CAD, since the default is Ctrl-Alt-Delete; it
       can be changed using loadkeys(1)).

       This system call will fail (with EINVAL) unless magic equals
       LINUX_REBOOT_MAGIC1 (that is, 0xfee1dead) and magic2 equals
       LINUX_REBOOT_MAGIC2 (that is, 672274793).  However, since 2.1.17 also
       LINUX_REBOOT_MAGIC2A (that is, 85072278) and since 2.1.97 also
       LINUX_REBOOT_MAGIC2B (that is, 369367448) and since 2.5.71 also
       LINUX_REBOOT_MAGIC2C (that is, 537993216) are permitted as values for
       magic2.  (The hexadecimal values of these constants are meaningful.)

       The cmd argument can have the following values:

       LINUX_REBOOT_CMD_CAD_OFF
              (RB_DISABLE_CAD, 0).  CAD is disabled.  This means that the
              CAD keystroke will cause a SIGINT signal to be sent to init
              (process 1), whereupon this process may decide upon a proper
              action (maybe: kill all processes, sync, reboot).

       LINUX_REBOOT_CMD_CAD_ON
              (RB_ENABLE_CAD, 0x89abcdef).  CAD is enabled.  This means that
              the CAD keystroke will immediately cause the action associated
              with LINUX_REBOOT_CMD_RESTART.

       LINUX_REBOOT_CMD_HALT
              (RB_HALT_SYSTEM, 0xcdef0123; since Linux 1.1.76).  The message
              "System halted." is printed, and the system is halted.
              Control is given to the ROM monitor, if there is one.  If not
              preceded by a sync(2), data will be lost.

       LINUX_REBOOT_CMD_KEXEC
              (RB_KEXEC, 0x45584543, since Linux 2.6.13).  Execute a kernel
              that has been loaded earlier with kexec_load(2).  This option
              is available only if the kernel was configured with
              CONFIG_KEXEC.

       LINUX_REBOOT_CMD_POWER_OFF
              (RB_POWER_OFF, 0x4321fedc; since Linux 2.1.30).  The message
              "Power down." is printed, the system is stopped, and all power
              is removed from the system, if possible.  If not preceded by a
              sync(2), data will be lost.

       LINUX_REBOOT_CMD_RESTART
              (RB_AUTOBOOT, 0x1234567).  The message "Restarting system." is
              printed, and a default restart is performed immediately.  If
              not preceded by a sync(2), data will be lost.

       LINUX_REBOOT_CMD_RESTART2
              (0xa1b2c3d4; since Linux 2.1.30).  The message "Restarting
              system with command '%s'" is printed, and a restart (using the
              command string given in arg) is performed immediately.  If not
              preceded by a sync(2), data will be lost.

       LINUX_REBOOT_CMD_SW_SUSPEND
              (RB_SW_SUSPEND, 0xd000fce1; since Linux 2.5.18).  The system
              is suspended (hibernated) to disk.  This option is available
              only if the kernel was configured with CONFIG_HIBERNATION.

       Only the superuser may call reboot().

       The precise effect of the above actions depends on the architecture.
       For the i386 architecture, the additional argument does not do
       anything at present (2.1.122), but the type of reboot can be
       determined by kernel command-line arguments ("reboot=...") to be
       either warm or cold, and either hard or through the BIOS.

   Behavior inside PID namespaces
       Since Linux 3.4, when reboot() is called from a PID namespace (see
       pid_namespaces(7)) other than the initial PID namespace, the effect
       of the call is to send a signal to the namespace "init" process.
       LINUX_REBOOT_CMD_RESTART and LINUX_REBOOT_CMD_RESTART2 cause a SIGHUP
       signal to be sent.  LINUX_REBOOT_CMD_POWER_OFF and
       LINUX_REBOOT_CMD_HALT cause a SIGINT signal to be sent.
http://man7.org/linux/man-pages/man2/kexec_load.2.html
11
SYSTEM CALL:
kexec_load(2) - Linux manual page
FUNCTIONALITY:

       kexec_load, kexec_file_load - load a new kernel for later execution
SYNOPSIS:

       #include <linux/kexec.h>

       long kexec_load(unsigned long entry, unsigned long nr_segments,
                       struct kexec_segment *segments, unsigned long flags);

       long kexec_file_load(int kernel_fd, int initrd_fd,
                           unsigned long cmdline_len, const char *cmdline,
                           unsigned long flags);

       Note: There are no glibc wrappers for these system calls; see NOTES.
DESCRIPTION

       The kexec_load() system call loads a new kernel that can be executed
       later by reboot(2).

       The flags argument is a bit mask that controls the operation of the
       call.  The following values can be specified in flags:

       KEXEC_ON_CRASH (since Linux 2.6.13)
              Execute the new kernel automatically on a system crash.  This
              "crash kernel" is loaded into an area of reserved memory that
              is determined at boot time using the crashkernel kernel
              command-line parameter.  The location of this reserved memory
              is exported to user space via the /proc/iomem file, in an
              entry labeled "Crash kernel".  A user-space application can
              parse this file and prepare a list of segments (see below)
              that specify this reserved memory as destination.  If this
              flag is specified, the kernel checks that the target segments
              specified in segments fall within the reserved region.

       KEXEC_PRESERVE_CONTEXT (since Linux 2.6.27)
              Preserve the system hardware and software states before
              executing the new kernel.  This could be used for system
              suspend.  This flag is available only if the kernel was
              configured with CONFIG_KEXEC_JUMP, and is effective only if
              nr_segments is greater than 0.

       The high-order bits (corresponding to the mask 0xffff0000) of flags
       contain the architecture of the to-be-executed kernel.  Specify (OR)
       the constant KEXEC_ARCH_DEFAULT to use the current architecture, or
       one of the following architecture constants KEXEC_ARCH_386,
       KEXEC_ARCH_68K, KEXEC_ARCH_X86_64, KEXEC_ARCH_PPC, KEXEC_ARCH_PPC64,
       KEXEC_ARCH_IA_64, KEXEC_ARCH_ARM, KEXEC_ARCH_S390, KEXEC_ARCH_SH,
       KEXEC_ARCH_MIPS, and KEXEC_ARCH_MIPS_LE.  The architecture must be
       executable on the CPU of the system.

       The entry argument is the physical entry address in the kernel image.
       The nr_segments argument is the number of segments pointed to by the
       segments pointer; the kernel imposes an (arbitrary) limit of 16 on
       the number of segments.  The segments argument is an array of
       kexec_segment structures which define the kernel layout:

           struct kexec_segment {
               void   *buf;        /* Buffer in user space */
               size_t  bufsz;      /* Buffer length in user space */
               void   *mem;        /* Physical address of kernel */
               size_t  memsz;      /* Physical address length */
           };

       The kernel image defined by segments is copied from the calling
       process into the kernel either in regular memory or in reserved
       memory (if KEXEC_ON_CRASH is set).  The kernel first performs various
       sanity checks on the information passed in segments.  If these checks
       pass, the kernel copies the segment data to kernel memory.  Each
       segment specified in segments is copied as follows:

       *  buf and bufsz identify a memory region in the caller's virtual
          address space that is the source of the copy.  The value in bufsz
          may not exceed the value in the memsz field.

       *  mem and memsz specify a physical address range that is the target
          of the copy.  The values specified in both fields must be
          multiples of the system page size.

       *  bufsz bytes are copied from the source buffer to the target kernel
          buffer.  If bufsz is less than memsz, then the excess bytes in the
          kernel buffer are zeroed out.

       In case of a normal kexec (i.e., the KEXEC_ON_CRASH flag is not set),
       the segment data is loaded in any available memory and is moved to
       the final destination at kexec reboot time (e.g., when the kexec(8)
       command is executed with the -e option).

       In case of kexec on panic (i.e., the KEXEC_ON_CRASH flag is set), the
       segment data is loaded to reserved memory at the time of the call,
       and, after a crash, the kexec mechanism simply passes control to that
       kernel.

       The kexec_load() system call is available only if the kernel was
       configured with CONFIG_KEXEC.

   kexec_file_load()
       The kexec_file_load() system call is similar to kexec_load(), but it
       takes a different set of arguments.  It reads the kernel to be loaded
       from the file referred to by the file descriptor kernel_fd, and the
       initrd (initial RAM disk) to be loaded from file referred to by the
       file descriptor initrd_fd.  The cmdline argument is a pointer to a
       buffer containing the command line for the new kernel.  The
       cmdline_len argument specifies size of the buffer.  The last byte in
       the buffer must be a null byte ('\0').

       The flags argument is a bit mask which modifies the behavior of the
       call.  The following values can be specified in flags:

       KEXEC_FILE_UNLOAD
              Unload the currently loaded kernel.

       KEXEC_FILE_ON_CRASH
              Load the new kernel in the memory region reserved for the
              crash kernel (as for KEXEC_ON_CRASH).  This kernel is booted
              if the currently running kernel crashes.

       KEXEC_FILE_NO_INITRAMFS
              Loading initrd/initramfs is optional.  Specify this flag if no
              initramfs is being loaded.  If this flag is set, the value
              passed in initrd_fd is ignored.

       The kexec_file_load() system call was added to provide support for
       systems where "kexec" loading should be restricted to only kernels
       that are signed.  This system call is available only if the kernel
       was configured with CONFIG_KEXEC_FILE.
http://man7.org/linux/man-pages/man2/kexec_file_load.2.html
11
SYSTEM CALL:
kexec_load(2) - Linux manual page
FUNCTIONALITY:

       kexec_load, kexec_file_load - load a new kernel for later execution
SYNOPSIS:

       #include <linux/kexec.h>

       long kexec_load(unsigned long entry, unsigned long nr_segments,
                       struct kexec_segment *segments, unsigned long flags);

       long kexec_file_load(int kernel_fd, int initrd_fd,
                           unsigned long cmdline_len, const char *cmdline,
                           unsigned long flags);

       Note: There are no glibc wrappers for these system calls; see NOTES.
DESCRIPTION

       The kexec_load() system call loads a new kernel that can be executed
       later by reboot(2).

       The flags argument is a bit mask that controls the operation of the
       call.  The following values can be specified in flags:

       KEXEC_ON_CRASH (since Linux 2.6.13)
              Execute the new kernel automatically on a system crash.  This
              "crash kernel" is loaded into an area of reserved memory that
              is determined at boot time using the crashkernel kernel
              command-line parameter.  The location of this reserved memory
              is exported to user space via the /proc/iomem file, in an
              entry labeled "Crash kernel".  A user-space application can
              parse this file and prepare a list of segments (see below)
              that specify this reserved memory as destination.  If this
              flag is specified, the kernel checks that the target segments
              specified in segments fall within the reserved region.

       KEXEC_PRESERVE_CONTEXT (since Linux 2.6.27)
              Preserve the system hardware and software states before
              executing the new kernel.  This could be used for system
              suspend.  This flag is available only if the kernel was
              configured with CONFIG_KEXEC_JUMP, and is effective only if
              nr_segments is greater than 0.

       The high-order bits (corresponding to the mask 0xffff0000) of flags
       contain the architecture of the to-be-executed kernel.  Specify (OR)
       the constant KEXEC_ARCH_DEFAULT to use the current architecture, or
       one of the following architecture constants KEXEC_ARCH_386,
       KEXEC_ARCH_68K, KEXEC_ARCH_X86_64, KEXEC_ARCH_PPC, KEXEC_ARCH_PPC64,
       KEXEC_ARCH_IA_64, KEXEC_ARCH_ARM, KEXEC_ARCH_S390, KEXEC_ARCH_SH,
       KEXEC_ARCH_MIPS, and KEXEC_ARCH_MIPS_LE.  The architecture must be
       executable on the CPU of the system.

       The entry argument is the physical entry address in the kernel image.
       The nr_segments argument is the number of segments pointed to by the
       segments pointer; the kernel imposes an (arbitrary) limit of 16 on
       the number of segments.  The segments argument is an array of
       kexec_segment structures which define the kernel layout:

           struct kexec_segment {
               void   *buf;        /* Buffer in user space */
               size_t  bufsz;      /* Buffer length in user space */
               void   *mem;        /* Physical address of kernel */
               size_t  memsz;      /* Physical address length */
           };

       The kernel image defined by segments is copied from the calling
       process into the kernel either in regular memory or in reserved
       memory (if KEXEC_ON_CRASH is set).  The kernel first performs various
       sanity checks on the information passed in segments.  If these checks
       pass, the kernel copies the segment data to kernel memory.  Each
       segment specified in segments is copied as follows:

       *  buf and bufsz identify a memory region in the caller's virtual
          address space that is the source of the copy.  The value in bufsz
          may not exceed the value in the memsz field.

       *  mem and memsz specify a physical address range that is the target
          of the copy.  The values specified in both fields must be
          multiples of the system page size.

       *  bufsz bytes are copied from the source buffer to the target kernel
          buffer.  If bufsz is less than memsz, then the excess bytes in the
          kernel buffer are zeroed out.

       In case of a normal kexec (i.e., the KEXEC_ON_CRASH flag is not set),
       the segment data is loaded in any available memory and is moved to
       the final destination at kexec reboot time (e.g., when the kexec(8)
       command is executed with the -e option).

       In case of kexec on panic (i.e., the KEXEC_ON_CRASH flag is set), the
       segment data is loaded to reserved memory at the time of the call,
       and, after a crash, the kexec mechanism simply passes control to that
       kernel.

       The kexec_load() system call is available only if the kernel was
       configured with CONFIG_KEXEC.

   kexec_file_load()
       The kexec_file_load() system call is similar to kexec_load(), but it
       takes a different set of arguments.  It reads the kernel to be loaded
       from the file referred to by the file descriptor kernel_fd, and the
       initrd (initial RAM disk) to be loaded from file referred to by the
       file descriptor initrd_fd.  The cmdline argument is a pointer to a
       buffer containing the command line for the new kernel.  The
       cmdline_len argument specifies size of the buffer.  The last byte in
       the buffer must be a null byte ('\0').

       The flags argument is a bit mask which modifies the behavior of the
       call.  The following values can be specified in flags:

       KEXEC_FILE_UNLOAD
              Unload the currently loaded kernel.

       KEXEC_FILE_ON_CRASH
              Load the new kernel in the memory region reserved for the
              crash kernel (as for KEXEC_ON_CRASH).  This kernel is booted
              if the currently running kernel crashes.

       KEXEC_FILE_NO_INITRAMFS
              Loading initrd/initramfs is optional.  Specify this flag if no
              initramfs is being loaded.  If this flag is set, the value
              passed in initrd_fd is ignored.

       The kexec_file_load() system call was added to provide support for
       systems where "kexec" loading should be restricted to only kernels
       that are signed.  This system call is available only if the kernel
       was configured with CONFIG_KEXEC_FILE.
http://man7.org/linux/man-pages/man2/perf_event_open.2.html
13
SYSTEM CALL:
perf_event_open(2) - Linux manual page
FUNCTIONALITY:

       perf_event_open - set up performance monitoring
SYNOPSIS:

       #include <linux/perf_event.h>
       #include <linux/hw_breakpoint.h>

       int perf_event_open(struct perf_event_attr *attr,
                           pid_t pid, int cpu, int group_fd,
                           unsigned long flags);

       Note: There is no glibc wrapper for this system call; see NOTES.
DESCRIPTION

       Given a list of parameters, perf_event_open() returns a file
       descriptor, for use in subsequent system calls (read(2), mmap(2),
       prctl(2), fcntl(2), etc.).

       A call to perf_event_open() creates a file descriptor that allows
       measuring performance information.  Each file descriptor corresponds
       to one event that is measured; these can be grouped together to
       measure multiple events simultaneously.

       Events can be enabled and disabled in two ways: via ioctl(2) and via
       prctl(2).  When an event is disabled it does not count or generate
       overflows but does continue to exist and maintain its count value.

       Events come in two flavors: counting and sampled.  A counting event
       is one that is used for counting the aggregate number of events that
       occur.  In general, counting event results are gathered with a
       read(2) call.  A sampling event periodically writes measurements to a
       buffer that can then be accessed via mmap(2).

   Arguments
       The pid and cpu arguments allow specifying which process and CPU to
       monitor:

       pid == 0 and cpu == -1
              This measures the calling process/thread on any CPU.

       pid == 0 and cpu >= 0
              This measures the calling process/thread only when running on
              the specified CPU.

       pid > 0 and cpu == -1
              This measures the specified process/thread on any CPU.

       pid > 0 and cpu >= 0
              This measures the specified process/thread only when running
              on the specified CPU.

       pid == -1 and cpu >= 0
              This measures all processes/threads on the specified CPU.
              This requires CAP_SYS_ADMIN capability or a
              /proc/sys/kernel/perf_event_paranoid value of less than 1.

       pid == -1 and cpu == -1
              This setting is invalid and will return an error.

       When pid is greater than zero, permission to perform this system call
       is governed by a ptrace access mode PTRACE_MODE_READ_REALCREDS check;
       see ptrace(2).

       The group_fd argument allows event groups to be created.  An event
       group has one event which is the group leader.  The leader is created
       first, with group_fd = -1.  The rest of the group members are created
       with subsequent perf_event_open() calls with group_fd being set to
       the file descriptor of the group leader.  (A single event on its own
       is created with group_fd = -1 and is considered to be a group with
       only 1 member.)  An event group is scheduled onto the CPU as a unit:
       it will be put onto the CPU only if all of the events in the group
       can be put onto the CPU.  This means that the values of the member
       events can be meaningfully compared—added, divided (to get ratios),
       and so on—with each other, since they have counted events for the
       same set of executed instructions.

       The flags argument is formed by ORing together zero or more of the
       following values:

       PERF_FLAG_FD_CLOEXEC (since Linux 3.14)
              This flag enables the close-on-exec flag for the created event
              file descriptor, so that the file descriptor is automatically
              closed on execve(2).  Setting the close-on-exec flags at
              creation time, rather than later with fcntl(2), avoids
              potential race conditions where the calling thread invokes
              perf_event_open() and fcntl(2) at the same time as another
              thread calls fork(2) then execve(2).

       PERF_FLAG_FD_NO_GROUP
              This flag tells the event to ignore the group_fd parameter
              except for the purpose of setting up output redirection using
              the PERF_FLAG_FD_OUTPUT flag.

       PERF_FLAG_FD_OUTPUT (broken since Linux 2.6.35)
              This flag re-routes the event's sampled output to instead be
              included in the mmap buffer of the event specified by
              group_fd.

       PERF_FLAG_PID_CGROUP (since Linux 2.6.39)
              This flag activates per-container system-wide monitoring.  A
              container is an abstraction that isolates a set of resources
              for finer-grained control (CPUs, memory, etc.).  In this mode,
              the event is measured only if the thread running on the
              monitored CPU belongs to the designated container (cgroup).
              The cgroup is identified by passing a file descriptor opened
              on its directory in the cgroupfs filesystem.  For instance, if
              the cgroup to monitor is called test, then a file descriptor
              opened on /dev/cgroup/test (assuming cgroupfs is mounted on
              /dev/cgroup) must be passed as the pid parameter.  cgroup
              monitoring is available only for system-wide events and may
              therefore require extra permissions.

       The perf_event_attr structure provides detailed configuration
       information for the event being created.

           struct perf_event_attr {
               __u32 type;         /* Type of event */
               __u32 size;         /* Size of attribute structure */
               __u64 config;       /* Type-specific configuration */

               union {
                   __u64 sample_period;    /* Period of sampling */
                   __u64 sample_freq;      /* Frequency of sampling */
               };

               __u64 sample_type;  /* Specifies values included in sample */
               __u64 read_format;  /* Specifies values returned in read */

               __u64 disabled       : 1,   /* off by default */
                     inherit        : 1,   /* children inherit it */
                     pinned         : 1,   /* must always be on PMU */
                     exclusive      : 1,   /* only group on PMU */
                     exclude_user   : 1,   /* don't count user */
                     exclude_kernel : 1,   /* don't count kernel */
                     exclude_hv     : 1,   /* don't count hypervisor */
                     exclude_idle   : 1,   /* don't count when idle */
                     mmap           : 1,   /* include mmap data */
                     comm           : 1,   /* include comm data */
                     freq           : 1,   /* use freq, not period */
                     inherit_stat   : 1,   /* per task counts */
                     enable_on_exec : 1,   /* next exec enables */
                     task           : 1,   /* trace fork/exit */
                     watermark      : 1,   /* wakeup_watermark */
                     precise_ip     : 2,   /* skid constraint */
                     mmap_data      : 1,   /* non-exec mmap data */
                     sample_id_all  : 1,   /* sample_type all events */
                     exclude_host   : 1,   /* don't count in host */
                     exclude_guest  : 1,   /* don't count in guest */
                     exclude_callchain_kernel : 1,
                                           /* exclude kernel callchains */
                     exclude_callchain_user   : 1,
                                           /* exclude user callchains */
                     mmap2          :  1,  /* include mmap with inode data */
                     comm_exec      :  1,  /* flag comm events that are due to exec */
                     use_clockid    :  1,  /* use clockid for time fields */

                     __reserved_1   : 38;

               union {
                   __u32 wakeup_events;    /* wakeup every n events */
                   __u32 wakeup_watermark; /* bytes before wakeup */
               };

               __u32     bp_type;          /* breakpoint type */

               union {
                   __u64 bp_addr;          /* breakpoint address */
                   __u64 config1;          /* extension of config */
               };

               union {
                   __u64 bp_len;           /* breakpoint length */
                   __u64 config2;          /* extension of config1 */
               };
               __u64 branch_sample_type;   /* enum perf_branch_sample_type */
               __u64 sample_regs_user;     /* user regs to dump on samples */
               __u32 sample_stack_user;    /* size of stack to dump on
                                              samples */
               __s32 clockid;              /* clock to use for time fields */
               __u64 sample_regs_intr;     /* regs to dump on samples */
               __u32 aux_watermark;        /* aux bytes before wakeup */
               __u32 __reserved_2;         /* align to u64 */

           };

       The fields of the perf_event_attr structure are described in more
       detail below:

       type   This field specifies the overall event type.  It has one of
              the following values:

              PERF_TYPE_HARDWARE
                     This indicates one of the "generalized" hardware events
                     provided by the kernel.  See the config field
                     definition for more details.

              PERF_TYPE_SOFTWARE
                     This indicates one of the software-defined events
                     provided by the kernel (even if no hardware support is
                     available).

              PERF_TYPE_TRACEPOINT
                     This indicates a tracepoint provided by the kernel
                     tracepoint infrastructure.

              PERF_TYPE_HW_CACHE
                     This indicates a hardware cache event.  This has a
                     special encoding, described in the config field
                     definition.

              PERF_TYPE_RAW
                     This indicates a "raw" implementation-specific event in
                     the config field.

              PERF_TYPE_BREAKPOINT (since Linux 2.6.33)
                     This indicates a hardware breakpoint as provided by the
                     CPU.  Breakpoints can be read/write accesses to an
                     address as well as execution of an instruction address.

              dynamic PMU
                     Since Linux 2.6.38, perf_event_open() can support
                     multiple PMUs.  To enable this, a value exported by the
                     kernel can be used in the type field to indicate which
                     PMU to use.  The value to use can be found in the sysfs
                     filesystem: there is a subdirectory per PMU instance
                     under /sys/bus/event_source/devices.  In each
                     subdirectory there is a type file whose content is an
                     integer that can be used in the type field.  For
                     instance, /sys/bus/event_source/devices/cpu/type
                     contains the value for the core CPU PMU, which is
                     usually 4.

       size   The size of the perf_event_attr structure for forward/backward
              compatibility.  Set this using sizeof(struct perf_event_attr)
              to allow the kernel to see the struct size at the time of
              compilation.

              The related define PERF_ATTR_SIZE_VER0 is set to 64; this was
              the size of the first published struct.  PERF_ATTR_SIZE_VER1
              is 72, corresponding to the addition of breakpoints in Linux
              2.6.33.  PERF_ATTR_SIZE_VER2 is 80 corresponding to the
              addition of branch sampling in Linux 3.4.  PERF_ATTR_SIZE_VER3
              is 96 corresponding to the addition of sample_regs_user and
              sample_stack_user in Linux 3.7.  PERF_ATTR_SIZE_VER4 is 104
              corresponding to the addition of sample_regs_intr in Linux
              3.19.  PERF_ATTR_SIZE_VER5 is 112 corresponding to the
              addition of aux_watermark in Linux 4.1.

       config This specifies which event you want, in conjunction with the
              type field.  The config1 and config2 fields are also taken
              into account in cases where 64 bits is not enough to fully
              specify the event.  The encoding of these fields are event
              dependent.

              There are various ways to set the config field that are
              dependent on the value of the previously described type field.
              What follows are various possible settings for config
              separated out by type.

              If type is PERF_TYPE_HARDWARE, we are measuring one of the
              generalized hardware CPU events.  Not all of these are
              available on all platforms.  Set config to one of the
              following:

                   PERF_COUNT_HW_CPU_CYCLES
                          Total cycles.  Be wary of what happens during CPU
                          frequency scaling.

                   PERF_COUNT_HW_INSTRUCTIONS
                          Retired instructions.  Be careful, these can be
                          affected by various issues, most notably hardware
                          interrupt counts.

                   PERF_COUNT_HW_CACHE_REFERENCES
                          Cache accesses.  Usually this indicates Last Level
                          Cache accesses but this may vary depending on your
                          CPU.  This may include prefetches and coherency
                          messages; again this depends on the design of your
                          CPU.

                   PERF_COUNT_HW_CACHE_MISSES
                          Cache misses.  Usually this indicates Last Level
                          Cache misses; this is intended to be used in
                          conjunction with the
                          PERF_COUNT_HW_CACHE_REFERENCES event to calculate
                          cache miss rates.

                   PERF_COUNT_HW_BRANCH_INSTRUCTIONS
                          Retired branch instructions.  Prior to Linux
                          2.6.35, this used the wrong event on AMD
                          processors.

                   PERF_COUNT_HW_BRANCH_MISSES
                          Mispredicted branch instructions.

                   PERF_COUNT_HW_BUS_CYCLES
                          Bus cycles, which can be different from total
                          cycles.

                   PERF_COUNT_HW_STALLED_CYCLES_FRONTEND (since Linux 3.0)
                          Stalled cycles during issue.

                   PERF_COUNT_HW_STALLED_CYCLES_BACKEND (since Linux 3.0)
                          Stalled cycles during retirement.

                   PERF_COUNT_HW_REF_CPU_CYCLES (since Linux 3.3)
                          Total cycles; not affected by CPU frequency
                          scaling.

              If type is PERF_TYPE_SOFTWARE, we are measuring software
              events provided by the kernel.  Set config to one of the
              following:

                   PERF_COUNT_SW_CPU_CLOCK
                          This reports the CPU clock, a high-resolution per-
                          CPU timer.

                   PERF_COUNT_SW_TASK_CLOCK
                          This reports a clock count specific to the task
                          that is running.

                   PERF_COUNT_SW_PAGE_FAULTS
                          This reports the number of page faults.

                   PERF_COUNT_SW_CONTEXT_SWITCHES
                          This counts context switches.  Until Linux 2.6.34,
                          these were all reported as user-space events,
                          after that they are reported as happening in the
                          kernel.

                   PERF_COUNT_SW_CPU_MIGRATIONS
                          This reports the number of times the process has
                          migrated to a new CPU.

                   PERF_COUNT_SW_PAGE_FAULTS_MIN
                          This counts the number of minor page faults.
                          These did not require disk I/O to handle.

                   PERF_COUNT_SW_PAGE_FAULTS_MAJ
                          This counts the number of major page faults.
                          These required disk I/O to handle.

                   PERF_COUNT_SW_ALIGNMENT_FAULTS (since Linux 2.6.33)
                          This counts the number of alignment faults.  These
                          happen when unaligned memory accesses happen; the
                          kernel can handle these but it reduces
                          performance.  This happens only on some
                          architectures (never on x86).

                   PERF_COUNT_SW_EMULATION_FAULTS (since Linux 2.6.33)
                          This counts the number of emulation faults.  The
                          kernel sometimes traps on unimplemented
                          instructions and emulates them for user space.
                          This can negatively impact performance.

                   PERF_COUNT_SW_DUMMY (since Linux 3.12)
                          This is a placeholder event that counts nothing.
                          Informational sample record types such as mmap or
                          comm must be associated with an active event.
                          This dummy event allows gathering such records
                          without requiring a counting event.

              If type is PERF_TYPE_TRACEPOINT, then we are measuring kernel
              tracepoints.  The value to use in config can be obtained from
              under debugfs tracing/events/*/*/id if ftrace is enabled in
              the kernel.

              If type is PERF_TYPE_HW_CACHE, then we are measuring a
              hardware CPU cache event.  To calculate the appropriate config
              value use the following equation:

                      (perf_hw_cache_id) | (perf_hw_cache_op_id << 8) |
                      (perf_hw_cache_op_result_id << 16)

                  where perf_hw_cache_id is one of:

                      PERF_COUNT_HW_CACHE_L1D
                             for measuring Level 1 Data Cache

                      PERF_COUNT_HW_CACHE_L1I
                             for measuring Level 1 Instruction Cache

                      PERF_COUNT_HW_CACHE_LL
                             for measuring Last-Level Cache

                      PERF_COUNT_HW_CACHE_DTLB
                             for measuring the Data TLB

                      PERF_COUNT_HW_CACHE_ITLB
                             for measuring the Instruction TLB

                      PERF_COUNT_HW_CACHE_BPU
                             for measuring the branch prediction unit

                      PERF_COUNT_HW_CACHE_NODE (since Linux 3.1)
                             for measuring local memory accesses

                  and perf_hw_cache_op_id is one of

                      PERF_COUNT_HW_CACHE_OP_READ
                             for read accesses

                      PERF_COUNT_HW_CACHE_OP_WRITE
                             for write accesses

                      PERF_COUNT_HW_CACHE_OP_PREFETCH
                             for prefetch accesses

                  and perf_hw_cache_op_result_id is one of

                      PERF_COUNT_HW_CACHE_RESULT_ACCESS
                             to measure accesses

                      PERF_COUNT_HW_CACHE_RESULT_MISS
                             to measure misses

              If type is PERF_TYPE_RAW, then a custom "raw" config value is
              needed.  Most CPUs support events that are not covered by the
              "generalized" events.  These are implementation defined; see
              your CPU manual (for example the Intel Volume 3B documentation
              or the AMD BIOS and Kernel Developer Guide).  The libpfm4
              library can be used to translate from the name in the
              architectural manuals to the raw hex value perf_event_open()
              expects in this field.

              If type is PERF_TYPE_BREAKPOINT, then leave config set to
              zero.  Its parameters are set in other places.

       sample_period, sample_freq
              A "sampling" event is one that generates an overflow
              notification every N events, where N is given by
              sample_period.  A sampling event has sample_period > 0.  When
              an overflow occurs, requested data is recorded in the mmap
              buffer.  The sample_type field controls what data is recorded
              on each overflow.

              sample_freq can be used if you wish to use frequency rather
              than period.  In this case, you set the freq flag.  The kernel
              will adjust the sampling period to try and achieve the desired
              rate.  The rate of adjustment is a timer tick.

       sample_type
              The various bits in this field specify which values to include
              in the sample.  They will be recorded in a ring-buffer, which
              is available to user space using mmap(2).  The order in which
              the values are saved in the sample are documented in the MMAP
              Layout subsection below; it is not the enum
              perf_event_sample_format order.

              PERF_SAMPLE_IP
                     Records instruction pointer.

              PERF_SAMPLE_TID
                     Records the process and thread IDs.

              PERF_SAMPLE_TIME
                     Records a timestamp.

              PERF_SAMPLE_ADDR
                     Records an address, if applicable.

              PERF_SAMPLE_READ
                     Record counter values for all events in a group, not
                     just the group leader.

              PERF_SAMPLE_CALLCHAIN
                     Records the callchain (stack backtrace).

              PERF_SAMPLE_ID
                     Records a unique ID for the opened event's group
                     leader.

              PERF_SAMPLE_CPU
                     Records CPU number.

              PERF_SAMPLE_PERIOD
                     Records the current sampling period.

              PERF_SAMPLE_STREAM_ID
                     Records a unique ID for the opened event.  Unlike
                     PERF_SAMPLE_ID the actual ID is returned, not the group
                     leader.  This ID is the same as the one returned by
                     PERF_FORMAT_ID.

              PERF_SAMPLE_RAW
                     Records additional data, if applicable.  Usually
                     returned by tracepoint events.

              PERF_SAMPLE_BRANCH_STACK (since Linux 3.4)
                     This provides a record of recent branches, as provided
                     by CPU branch sampling hardware (such as Intel Last
                     Branch Record).  Not all hardware supports this
                     feature.

                     See the branch_sample_type field for how to filter
                     which branches are reported.

              PERF_SAMPLE_REGS_USER (since Linux 3.7)
                     Records the current user-level CPU register state (the
                     values in the process before the kernel was called).

              PERF_SAMPLE_STACK_USER (since Linux 3.7)
                     Records the user level stack, allowing stack unwinding.

              PERF_SAMPLE_WEIGHT (since Linux 3.10)
                     Records a hardware provided weight value that expresses
                     how costly the sampled event was.  This allows the
                     hardware to highlight expensive events in a profile.

              PERF_SAMPLE_DATA_SRC (since Linux 3.10)
                     Records the data source: where in the memory hierarchy
                     the data associated with the sampled instruction came
                     from.  This is available only if the underlying
                     hardware supports this feature.

              PERF_SAMPLE_IDENTIFIER (since Linux 3.12)
                     Places the SAMPLE_ID value in a fixed position in the
                     record, either at the beginning (for sample events) or
                     at the end (if a non-sample event).

                     This was necessary because a sample stream may have
                     records from various different event sources with
                     different sample_type settings.  Parsing the event
                     stream properly was not possible because the format of
                     the record was needed to find SAMPLE_ID, but the format
                     could not be found without knowing what event the
                     sample belonged to (causing a circular dependency).

                     The PERF_SAMPLE_IDENTIFIER setting makes the event
                     stream always parsable by putting SAMPLE_ID in a fixed
                     location, even though it means having duplicate
                     SAMPLE_ID values in records.

              PERF_SAMPLE_TRANSACTION (since Linux 3.13)
                     Records reasons for transactional memory abort events
                     (for example, from Intel TSX transactional memory
                     support).

                     The precise_ip setting must be greater than 0 and a
                     transactional memory abort event must be measured or no
                     values will be recorded.  Also note that some
                     perf_event measurements, such as sampled cycle
                     counting, may cause extraneous aborts (by causing an
                     interrupt during a transaction).

              PERF_SAMPLE_REGS_INTR (since Linux 3.19)
                     Records a subset of the current CPU register state as
                     specified by sample_regs_intr.  Unlike
                     PERF_SAMPLE_REGS_USER the register values will return
                     kernel register state if the overflow happened while
                     kernel code is running.  If the CPU supports hardware
                     sampling of register state (i.e. PEBS on Intel x86) and
                     precise_ip is set higher than zero then the register
                     values returned are those captured by hardware at the
                     time of the sampled instruction's retirement.

       read_format
              This field specifies the format of the data returned by
              read(2) on a perf_event_open() file descriptor.

              PERF_FORMAT_TOTAL_TIME_ENABLED
                     Adds the 64-bit time_enabled field.  This can be used
                     to calculate estimated totals if the PMU is
                     overcommitted and multiplexing is happening.

              PERF_FORMAT_TOTAL_TIME_RUNNING
                     Adds the 64-bit time_running field.  This can be used
                     to calculate estimated totals if the PMU is
                     overcommitted and multiplexing is happening.

              PERF_FORMAT_ID
                     Adds a 64-bit unique value that corresponds to the
                     event group.

              PERF_FORMAT_GROUP
                     Allows all counter values in an event group to be read
                     with one read.

       disabled
              The disabled bit specifies whether the counter starts out
              disabled or enabled.  If disabled, the event can later be
              enabled by ioctl(2), prctl(2), or enable_on_exec.

              When creating an event group, typically the group leader is
              initialized with disabled set to 1 and any child events are
              initialized with disabled set to 0.  Despite disabled being 0,
              the child events will not start until the group leader is
              enabled.

       inherit
              The inherit bit specifies that this counter should count
              events of child tasks as well as the task specified.  This
              applies only to new children, not to any existing children at
              the time the counter is created (nor to any new children of
              existing children).

              Inherit does not work for some combinations of read_formats,
              such as PERF_FORMAT_GROUP.

       pinned The pinned bit specifies that the counter should always be on
              the CPU if at all possible.  It applies only to hardware
              counters and only to group leaders.  If a pinned counter
              cannot be put onto the CPU (e.g., because there are not enough
              hardware counters or because of a conflict with some other
              event), then the counter goes into an 'error' state, where
              reads return end-of-file (i.e., read(2) returns 0) until the
              counter is subsequently enabled or disabled.

       exclusive
              The exclusive bit specifies that when this counter's group is
              on the CPU, it should be the only group using the CPU's
              counters.  In the future this may allow monitoring programs to
              support PMU features that need to run alone so that they do
              not disrupt other hardware counters.

              Note that many unexpected situations may prevent events with
              the exclusive bit set from ever running.  This includes any
              users running a system-wide measurement as well as any kernel
              use of the performance counters (including the commonly
              enabled NMI Watchdog Timer interface).

       exclude_user
              If this bit is set, the count excludes events that happen in
              user space.

       exclude_kernel
              If this bit is set, the count excludes events that happen in
              kernel-space.

       exclude_hv
              If this bit is set, the count excludes events that happen in
              the hypervisor.  This is mainly for PMUs that have built-in
              support for handling this (such as POWER).  Extra support is
              needed for handling hypervisor measurements on most machines.

       exclude_idle
              If set, don't count when the CPU is idle.

       mmap   The mmap bit enables generation of PERF_RECORD_MMAP samples
              for every mmap(2) call that has PROT_EXEC set.  This allows
              tools to notice new executable code being mapped into a
              program (dynamic shared libraries for example) so that
              addresses can be mapped back to the original code.

       comm   The comm bit enables tracking of process command name as
              modified by the exec(2) and prctl(PR_SET_NAME) system calls as
              well as writing to /proc/self/comm.  If the comm_exec flag is
              also successfully set (possible since Linux 3.16), then the
              misc flag PERF_RECORD_MISC_COMM_EXEC can be used to
              differentiate the exec(2) case from the others.

       freq   If this bit is set, then sample_frequency not sample_period is
              used when setting up the sampling interval.

       inherit_stat
              This bit enables saving of event counts on context switch for
              inherited tasks.  This is meaningful only if the inherit field
              is set.

       enable_on_exec
              If this bit is set, a counter is automatically enabled after a
              call to exec(2).

       task   If this bit is set, then fork/exit notifications are included
              in the ring buffer.

       watermark
              If set, have an overflow notification happen when we cross the
              wakeup_watermark boundary.  Otherwise, overflow notifications
              happen after wakeup_events samples.

       precise_ip (since Linux 2.6.35)
              This controls the amount of skid.  Skid is how many
              instructions execute between an event of interest happening
              and the kernel being able to stop and record the event.
              Smaller skid is better and allows more accurate reporting of
              which events correspond to which instructions, but hardware is
              often limited with how small this can be.

              The values of this are the following:

              0 -    SAMPLE_IP can have arbitrary skid.

              1 -    SAMPLE_IP must have constant skid.

              2 -    SAMPLE_IP requested to have 0 skid.

              3 -    SAMPLE_IP must have 0 skid.  See also
                     PERF_RECORD_MISC_EXACT_IP.

       mmap_data (since Linux 2.6.36)
              The counterpart of the mmap field.  This enables generation of
              PERF_RECORD_MMAP samples for mmap(2) calls that do not have
              PROT_EXEC set (for example data and SysV shared memory).

       sample_id_all (since Linux 2.6.38)
              If set, then TID, TIME, ID, STREAM_ID, and CPU can
              additionally be included in non-PERF_RECORD_SAMPLEs if the
              corresponding sample_type is selected.

              If PERF_SAMPLE_IDENTIFIER is specified, then an additional ID
              value is included as the last value to ease parsing the record
              stream.  This may lead to the id value appearing twice.

              The layout is described by this pseudo-structure:
                  struct sample_id {
                      { u32 pid, tid; } /* if PERF_SAMPLE_TID set        */
                      { u64 time;     } /* if PERF_SAMPLE_TIME set       */
                      { u64 id;       } /* if PERF_SAMPLE_ID set         */
                      { u64 stream_id;} /* if PERF_SAMPLE_STREAM_ID set  */
                      { u32 cpu, res; } /* if PERF_SAMPLE_CPU set        */
                      { u64 id;       } /* if PERF_SAMPLE_IDENTIFIER set */
                  };

       exclude_host (since Linux 3.2)
              When conducting measurements that include processes running VM
              instances (i.e. have executed a KVM_RUN ioctl(2) ) only
              measure events happening inside a guest instance.  This is
              only meaningful outside the guests; this setting does not
              change counts gathered inside of a guest.  Currently, this
              functionality is x86 only.

       exclude_guest (since Linux 3.2)
              When conducting measurements that include processes running VM
              instances (i.e. have executed a KVM_RUN ioctl(2) ) do not
              measure events happening inside guest instances.  This is only
              meaningful outside the guests; this setting does not change
              counts gathered inside of a guest.  Currently, this
              functionality is x86 only.

       exclude_callchain_kernel (since Linux 3.7)
              Do not include kernel callchains.

       exclude_callchain_user (since Linux 3.7)
              Do not include user callchains.

       mmap2 (since Linux 3.16)
              Generate an extended executable mmap record that contains
              enough additional information to uniquely identify shared
              mappings.  The mmap flag must also be set for this to work.

       comm_exec (since Linux 3.16)
              This is purely a feature-detection flag, it does not change
              kernel behavior.  If this flag can successfully be set, then,
              when comm is enabled, the PERF_RECORD_MISC_COMM_EXEC flag will
              be set in the misc field of a comm record header if the rename
              event being reported was caused by a call to exec(2).  This
              allows tools to distinguish between the various types of
              process renaming.

       use_clockid (since Linux 4.1)
              This allows selecting which internal Linux clock to use when
              generating timestamps via the clockid field.  This can make it
              easier to correlate perf sample times with timestamps
              generated by other tools.

       wakeup_events, wakeup_watermark
              This union sets how many samples (wakeup_events) or bytes
              (wakeup_watermark) happen before an overflow notification
              happens.  Which one is used is selected by the watermark bit
              flag.

              wakeup_events counts only PERF_RECORD_SAMPLE record types.  To
              receive overflow notification for all PERF_RECORD types choose
              watermark and set wakeup_watermark to 1.

              Prior to Linux 3.0 setting wakeup_events to 0 resulted in no
              overflow notifications; more recent kernels treat 0 the same
              as 1.

       bp_type (since Linux 2.6.33)
              This chooses the breakpoint type.  It is one of:

              HW_BREAKPOINT_EMPTY
                     No breakpoint.

              HW_BREAKPOINT_R
                     Count when we read the memory location.

              HW_BREAKPOINT_W
                     Count when we write the memory location.

              HW_BREAKPOINT_RW
                     Count when we read or write the memory location.

              HW_BREAKPOINT_X
                     Count when we execute code at the memory location.

              The values can be combined via a bitwise or, but the
              combination of HW_BREAKPOINT_R or HW_BREAKPOINT_W with
              HW_BREAKPOINT_X is not allowed.

       bp_addr (since Linux 2.6.33)
              bp_addr address of the breakpoint.  For execution breakpoints
              this is the memory address of the instruction of interest; for
              read and write breakpoints it is the memory address of the
              memory location of interest.

       config1 (since Linux 2.6.39)
              config1 is used for setting events that need an extra register
              or otherwise do not fit in the regular config field.  Raw
              OFFCORE_EVENTS on Nehalem/Westmere/SandyBridge use this field
              on 3.3 and later kernels.

       bp_len (since Linux 2.6.33)
              bp_len is the length of the breakpoint being measured if type
              is PERF_TYPE_BREAKPOINT.  Options are HW_BREAKPOINT_LEN_1,
              HW_BREAKPOINT_LEN_2, HW_BREAKPOINT_LEN_4, HW_BREAKPOINT_LEN_8.
              For an execution breakpoint, set this to sizeof(long).

       config2 (since Linux 2.6.39)

              config2 is a further extension of the config1 field.

       branch_sample_type (since Linux 3.4)
              If PERF_SAMPLE_BRANCH_STACK is enabled, then this specifies
              what branches to include in the branch record.

              The first part of the value is the privilege level, which is a
              combination of one of the following values.  If the user does
              not set privilege level explicitly, the kernel will use the
              event's privilege level.  Event and branch privilege levels do
              not have to match.

              PERF_SAMPLE_BRANCH_USER
                     Branch target is in user space.

              PERF_SAMPLE_BRANCH_KERNEL
                     Branch target is in kernel space.

              PERF_SAMPLE_BRANCH_HV
                     Branch target is in hypervisor.

              PERF_SAMPLE_BRANCH_PLM_ALL
                     A convenience value that is the three preceding values
                     ORed together.

              In addition to the privilege value, at least one or more of
              the following bits must be set.

              PERF_SAMPLE_BRANCH_ANY
                     Any branch type.

              PERF_SAMPLE_BRANCH_ANY_CALL
                     Any call branch.

              PERF_SAMPLE_BRANCH_ANY_RETURN
                     Any return branch.

              PERF_SAMPLE_BRANCH_IND_CALL
                     Indirect calls.

              PERF_SAMPLE_BRANCH_COND (since Linux 3.16)
                     Conditional branches.

              PERF_SAMPLE_BRANCH_ABORT_TX (since Linux 3.11)
                     Transactional memory aborts.

              PERF_SAMPLE_BRANCH_IN_TX (since Linux 3.11)
                     Branch in transactional memory transaction.

              PERF_SAMPLE_BRANCH_NO_TX (since Linux 3.11)
                     Branch not in transactional memory transaction.
                     PERF_SAMPLE_BRANCH_CALL_STACK (since Linux 4.1) Branch
                     is part of a hardware-generated call stack.  This
                     requires hardware support, currently only found on
                     Intel x86 Haswell or newer.

       sample_regs_user (since Linux 3.7)
              This bit mask defines the set of user CPU registers to dump on
              samples.  The layout of the register mask is architecture-
              specific and described in the kernel header
              arch/ARCH/include/uapi/asm/perf_regs.h.

       sample_stack_user (since Linux 3.7)
              This defines the size of the user stack to dump if
              PERF_SAMPLE_STACK_USER is specified.

       clockid (since Linux 4.1)
              If use_clockid is set, then this field selects which internal
              Linux timer to use for timestamps.  The available timers are
              defined in linux/time.h, with CLOCK_MONOTONIC,
              CLOCK_MONOTONIC_RAW, CLOCK_REALTIME, CLOCK_BOOTTIME, and
              CLOCK_TAI currently supported.

       aux_watermark (since Linux 4.1)
              This specifies how much data is required to trigger a
              PERF_RECORD_AUX sample.

   Reading results
       Once a perf_event_open() file descriptor has been opened, the values
       of the events can be read from the file descriptor.  The values that
       are there are specified by the read_format field in the attr
       structure at open time.

       If you attempt to read into a buffer that is not big enough to hold
       the data ENOSPC is returned

       Here is the layout of the data returned by a read:

       * If PERF_FORMAT_GROUP was specified to allow reading all events in a
         group at once:

             struct read_format {
                 u64 nr;            /* The number of events */
                 u64 time_enabled;  /* if PERF_FORMAT_TOTAL_TIME_ENABLED */
                 u64 time_running;  /* if PERF_FORMAT_TOTAL_TIME_RUNNING */
                 struct {
                     u64 value;     /* The value of the event */
                     u64 id;        /* if PERF_FORMAT_ID */
                 } values[nr];
             };

       * If PERF_FORMAT_GROUP was not specified:

             struct read_format {
                 u64 value;         /* The value of the event */
                 u64 time_enabled;  /* if PERF_FORMAT_TOTAL_TIME_ENABLED */
                 u64 time_running;  /* if PERF_FORMAT_TOTAL_TIME_RUNNING */
                 u64 id;            /* if PERF_FORMAT_ID */
             };

       The values read are as follows:

       nr     The number of events in this file descriptor.  Only available
              if PERF_FORMAT_GROUP was specified.

       time_enabled, time_running
              Total time the event was enabled and running.  Normally these
              are the same.  If more events are started, then available
              counter slots on the PMU, then multiplexing happens and events
              run only part of the time.  In that case, the time_enabled and
              time running values can be used to scale an estimated value
              for the count.

       value  An unsigned 64-bit value containing the counter result.

       id     A globally unique value for this particular event, only
              present if PERF_FORMAT_ID was specified in read_format.

   MMAP layout
       When using perf_event_open() in sampled mode, asynchronous events
       (like counter overflow or PROT_EXEC mmap tracking) are logged into a
       ring-buffer.  This ring-buffer is created and accessed through
       mmap(2).

       The mmap size should be 1+2^n pages, where the first page is a
       metadata page (struct perf_event_mmap_page) that contains various
       bits of information such as where the ring-buffer head is.

       Before kernel 2.6.39, there is a bug that means you must allocate an
       mmap ring buffer when sampling even if you do not plan to access it.

       The structure of the first metadata mmap page is as follows:

           struct perf_event_mmap_page {
               __u32 version;        /* version number of this structure */
               __u32 compat_version; /* lowest version this is compat with */
               __u32 lock;           /* seqlock for synchronization */
               __u32 index;          /* hardware counter identifier */
               __s64 offset;         /* add to hardware counter value */
               __u64 time_enabled;   /* time event active */
               __u64 time_running;   /* time event on CPU */
               union {
                   __u64   capabilities;
                   struct {
                       __u64 cap_usr_time / cap_usr_rdpmc / cap_bit0 : 1,
                             cap_bit0_is_deprecated : 1,
                             cap_user_rdpmc         : 1,
                             cap_user_time          : 1,
                             cap_user_time_zero     : 1,
                   };
               };
               __u16 pmc_width;
               __u16 time_shift;
               __u32 time_mult;
               __u64 time_offset;
               __u64 __reserved[120];   /* Pad to 1k */
               __u64 data_head;         /* head in the data section */
               __u64 data_tail;         /* user-space written tail */
               __u64 data_offset;       /* where the buffer starts */
               __u64 data_size;         /* data buffer size */
               __u64 aux_head;
               __u64 aux_tail;
               __u64 aux_offset;
               __u64 aux_size;

           }

       The following list describes the fields in the perf_event_mmap_page
       structure in more detail:

       version
              Version number of this structure.

       compat_version
              The lowest version this is compatible with.

       lock   A seqlock for synchronization.

       index  A unique hardware counter identifier.

       offset When using rdpmc for reads this offset value must be added to
              the one returned by rdpmc to get the current total event
              count.

       time_enabled
              Time the event was active.

       time_running
              Time the event was running.

       cap_usr_time / cap_usr_rdpmc / cap_bit0 (since Linux 3.4)
              There was a bug in the definition of cap_usr_time and
              cap_usr_rdpmc from Linux 3.4 until Linux 3.11.  Both bits were
              defined to point to the same location, so it was impossible to
              know if cap_usr_time or cap_usr_rdpmc were actually set.

              Starting with Linux 3.12, these are renamed to cap_bit0 and
              you should use the cap_user_time and cap_user_rdpmc fields
              instead.

       cap_bit0_is_deprecated (since Linux 3.12)
              If set, this bit indicates that the kernel supports the
              properly separated cap_user_time and cap_user_rdpmc bits.

              If not-set, it indicates an older kernel where cap_usr_time
              and cap_usr_rdpmc map to the same bit and thus both features
              should be used with caution.

       cap_user_rdpmc (since Linux 3.12)
              If the hardware supports user-space read of performance
              counters without syscall (this is the "rdpmc" instruction on
              x86), then the following code can be used to do a read:

                  u32 seq, time_mult, time_shift, idx, width;
                  u64 count, enabled, running;
                  u64 cyc, time_offset;

                  do {
                      seq = pc->lock;
                      barrier();
                      enabled = pc->time_enabled;
                      running = pc->time_running;

                      if (pc->cap_usr_time && enabled != running) {
                          cyc = rdtsc();
                          time_offset = pc->time_offset;
                          time_mult   = pc->time_mult;
                          time_shift  = pc->time_shift;
                      }

                      idx = pc->index;
                      count = pc->offset;

                      if (pc->cap_usr_rdpmc && idx) {
                          width = pc->pmc_width;
                          count += rdpmc(idx - 1);
                      }

                      barrier();
                  } while (pc->lock != seq);

       cap_user_time (since Linux 3.12)
              This bit indicates the hardware has a constant, nonstop
              timestamp counter (TSC on x86).

       cap_user_time_zero (since Linux 3.12)
              Indicates the presence of time_zero which allows mapping
              timestamp values to the hardware clock.

       pmc_width
              If cap_usr_rdpmc, this field provides the bit-width of the
              value read using the rdpmc or equivalent instruction.  This
              can be used to sign extend the result like:

                  pmc <<= 64 - pmc_width;
                  pmc >>= 64 - pmc_width; // signed shift right
                  count += pmc;

       time_shift, time_mult, time_offset

              If cap_usr_time, these fields can be used to compute the time
              delta since time_enabled (in nanoseconds) using rdtsc or
              similar.

                  u64 quot, rem;
                  u64 delta;
                  quot = (cyc >> time_shift);
                  rem = cyc & ((1 << time_shift) - 1);
                  delta = time_offset + quot * time_mult +
                          ((rem * time_mult) >> time_shift);

              Where time_offset, time_mult, time_shift, and cyc are read in
              the seqcount loop described above.  This delta can then be
              added to enabled and possible running (if idx), improving the
              scaling:

                  enabled += delta;
                  if (idx)
                      running += delta;
                  quot = count / running;
                  rem  = count % running;
                  count = quot * enabled + (rem * enabled) / running;

       time_zero (since Linux 3.12)

              If cap_usr_time_zero is set, then the hardware clock (the TSC
              timestamp counter on x86) can be calculated from the
              time_zero, time_mult, and time_shift values:

                  time = timestamp - time_zero;
                  quot = time / time_mult;
                  rem  = time % time_mult;
                  cyc = (quot << time_shift) + (rem << time_shift) / time_mult;

              And vice versa:

                  quot = cyc >> time_shift;
                  rem  = cyc & ((1 << time_shift) - 1);
                  timestamp = time_zero + quot * time_mult +
                      ((rem * time_mult) >> time_shift);

       data_head
              This points to the head of the data section.  The value
              continuously increases, it does not wrap.  The value needs to
              be manually wrapped by the size of the mmap buffer before
              accessing the samples.

              On SMP-capable platforms, after reading the data_head value,
              user space should issue an rmb().

       data_tail
              When the mapping is PROT_WRITE, the data_tail value should be
              written by user space to reflect the last read data.  In this
              case, the kernel will not overwrite unread data.

       data_offset (since Linux 4.1)
              Contains the offset of the location in the mmap buffer where
              perf sample data begins.

       data_size (since Linux 4.1)
              Contains the size of the perf sample region within the mmap
              buffer.

       aux_head, aux_tail, aux_offset, aux_size (since Linux 4.1)
              The AUX region allows mmaping a separate sample buffer for
              high-bandwidth data streams (separate from the main perf
              sample buffer).  An example of a high-bandwidth stream is
              instruction tracing support, as is found in newer Intel
              processors.

              To set up an AUX area, first aux_offset needs to be set with
              an offset greater than data_offset+data_size and aux_size
              needs to be set to the desired buffer size.  The desired
              offset and size must be page aligned, and the size must be a
              power of two.  These values are then passed to mmap in order
              to map the AUX buffer.  Pages in the AUX buffer are included
              as part of the RLIMIT_MEMLOCK resource limit (see
              setrlimit(2)), and also as part of the perf_event_mlock_kb
              allowance.

              By default, the AUX buffer will be truncated if it will not
              fit in the available space in the ring buffer.  If the AUX
              buffer is mapped as a read only buffer, then it will operate
              in ring buffer mode where old data will be overwritten by new.
              In overwrite mode, it might not be possible to infer where the
              new data began, and it is the consumer's job to disable
              measurement while reading to avoid possible data races.

              The aux_head and aux_tail ring buffer pointers have the same
              behavior and ordering rules as the previous described
              data_head and data_tail.

       The following 2^n ring-buffer pages have the layout described below.

       If perf_event_attr.sample_id_all is set, then all event types will
       have the sample_type selected fields related to where/when (identity)
       an event took place (TID, TIME, ID, CPU, STREAM_ID) described in
       PERF_RECORD_SAMPLE below, it will be stashed just after the
       perf_event_header and the fields already present for the existing
       fields, that is, at the end of the payload.  That way a newer
       perf.data file will be supported by older perf tools, with these new
       optional fields being ignored.

       The mmap values start with a header:

           struct perf_event_header {
               __u32   type;
               __u16   misc;
               __u16   size;
           };

       Below, we describe the perf_event_header fields in more detail.  For
       ease of reading, the fields with shorter descriptions are presented
       first.

       size   This indicates the size of the record.

       misc   The misc field contains additional information about the
              sample.

              The CPU mode can be determined from this value by masking with
              PERF_RECORD_MISC_CPUMODE_MASK and looking for one of the
              following (note these are not bit masks, only one can be set
              at a time):

              PERF_RECORD_MISC_CPUMODE_UNKNOWN
                     Unknown CPU mode.

              PERF_RECORD_MISC_KERNEL
                     Sample happened in the kernel.

              PERF_RECORD_MISC_USER
                     Sample happened in user code.

              PERF_RECORD_MISC_HYPERVISOR
                     Sample happened in the hypervisor.

              PERF_RECORD_MISC_GUEST_KERNEL (since Linux 2.6.35)
                     Sample happened in the guest kernel.

              PERF_RECORD_MISC_GUEST_USER  (since Linux 2.6.35)
                     Sample happened in guest user code.

              In addition, one of the following bits can be set:

              PERF_RECORD_MISC_MMAP_DATA (since Linux 3.10)
                     This is set when the mapping is not executable;
                     otherwise the mapping is executable.

              PERF_RECORD_MISC_COMM_EXEC (since Linux 3.16)
                     This is set for a PERF_RECORD_COMM record on kernels
                     more recent than Linux 3.16 if a process name change
                     was caused by an exec(2) system call.  It is an alias
                     for PERF_RECORD_MISC_MMAP_DATA since the two values
                     would not be set in the same record.

              PERF_RECORD_MISC_EXACT_IP
                     This indicates that the content of PERF_SAMPLE_IP
                     points to the actual instruction that triggered the
                     event.  See also perf_event_attr.precise_ip.

              PERF_RECORD_MISC_EXT_RESERVED (since Linux 2.6.35)
                     This indicates there is extended data available
                     (currently not used).

       type   The type value is one of the below.  The values in the
              corresponding record (that follows the header) depend on the
              type selected as shown.

              PERF_RECORD_MMAP
                  The MMAP events record the PROT_EXEC mappings so that we
                  can correlate user-space IPs to code.  They have the
                  following structure:

                      struct {
                          struct perf_event_header header;
                          u32    pid, tid;
                          u64    addr;
                          u64    len;
                          u64    pgoff;
                          char   filename[];
                      };

                  pid    is the process ID.

                  tid    is the thread ID.

                  addr   is the address of the allocated memory.  len is the
                         length of the allocated memory.  pgoff is the page
                         offset of the allocated memory.  filename is a
                         string describing the backing of the allocated
                         memory.

              PERF_RECORD_LOST
                  This record indicates when events are lost.

                      struct {
                          struct perf_event_header header;
                          u64 id;
                          u64 lost;
                          struct sample_id sample_id;
                      };

                  id     is the unique event ID for the samples that were
                         lost.

                  lost   is the number of events that were lost.

              PERF_RECORD_COMM
                  This record indicates a change in the process name.

                      struct {
                          struct perf_event_header header;
                          u32 pid;
                          u32 tid;
                          char comm[];
                          struct sample_id sample_id;
                      };

                  pid    is the process ID.

                  tid    is the thread ID.

                  comm   is a string containing the new name of the process.

              PERF_RECORD_EXIT
                  This record indicates a process exit event.

                      struct {
                          struct perf_event_header header;
                          u32 pid, ppid;
                          u32 tid, ptid;
                          u64 time;
                          struct sample_id sample_id;
                      };

              PERF_RECORD_THROTTLE, PERF_RECORD_UNTHROTTLE
                  This record indicates a throttle/unthrottle event.

                      struct {
                          struct perf_event_header header;
                          u64 time;
                          u64 id;
                          u64 stream_id;
                          struct sample_id sample_id;
                      };

              PERF_RECORD_FORK
                  This record indicates a fork event.

                      struct {
                          struct perf_event_header header;
                          u32 pid, ppid;
                          u32 tid, ptid;
                          u64 time;
                          struct sample_id sample_id;
                      };

              PERF_RECORD_READ
                  This record indicates a read event.

                      struct {
                          struct perf_event_header header;
                          u32 pid, tid;
                          struct read_format values;
                          struct sample_id sample_id;
                      };

              PERF_RECORD_SAMPLE
                  This record indicates a sample.

                      struct {
                          struct perf_event_header header;
                          u64   sample_id;  /* if PERF_SAMPLE_IDENTIFIER */
                          u64   ip;         /* if PERF_SAMPLE_IP */
                          u32   pid, tid;   /* if PERF_SAMPLE_TID */
                          u64   time;       /* if PERF_SAMPLE_TIME */
                          u64   addr;       /* if PERF_SAMPLE_ADDR */
                          u64   id;         /* if PERF_SAMPLE_ID */
                          u64   stream_id;  /* if PERF_SAMPLE_STREAM_ID */
                          u32   cpu, res;   /* if PERF_SAMPLE_CPU */
                          u64   period;     /* if PERF_SAMPLE_PERIOD */
                          struct read_format v; /* if PERF_SAMPLE_READ */
                          u64   nr;         /* if PERF_SAMPLE_CALLCHAIN */
                          u64   ips[nr];    /* if PERF_SAMPLE_CALLCHAIN */
                          u32   size;       /* if PERF_SAMPLE_RAW */
                          char  data[size]; /* if PERF_SAMPLE_RAW */
                          u64   bnr;        /* if PERF_SAMPLE_BRANCH_STACK */
                          struct perf_branch_entry lbr[bnr];
                                            /* if PERF_SAMPLE_BRANCH_STACK */
                          u64   abi;        /* if PERF_SAMPLE_REGS_USER */
                          u64   regs[weight(mask)];
                                            /* if PERF_SAMPLE_REGS_USER */
                          u64   size;       /* if PERF_SAMPLE_STACK_USER */
                          char  data[size]; /* if PERF_SAMPLE_STACK_USER */
                          u64   dyn_size;   /* if PERF_SAMPLE_STACK_USER && size != 0 */
                          u64   weight;     /* if PERF_SAMPLE_WEIGHT */
                          u64   data_src;   /* if PERF_SAMPLE_DATA_SRC */
                          u64   transaction;/* if PERF_SAMPLE_TRANSACTION */
                          u64   abi;        /* if PERF_SAMPLE_REGS_INTR */
                          u64   regs[weight(mask)];
                                            /* if PERF_SAMPLE_REGS_INTR */
                      };

                  sample_id
                      If PERF_SAMPLE_IDENTIFIER is enabled, a 64-bit unique
                      ID is included.  This is a duplication of the
                      PERF_SAMPLE_ID id value, but included at the beginning
                      of the sample so parsers can easily obtain the value.

                  ip  If PERF_SAMPLE_IP is enabled, then a 64-bit
                      instruction pointer value is included.

                  pid, tid
                      If PERF_SAMPLE_TID is enabled, then a 32-bit process
                      ID and 32-bit thread ID are included.

                  time
                      If PERF_SAMPLE_TIME is enabled, then a 64-bit
                      timestamp is included.  This is obtained via
                      local_clock() which is a hardware timestamp if
                      available and the jiffies value if not.

                  addr
                      If PERF_SAMPLE_ADDR is enabled, then a 64-bit address
                      is included.  This is usually the address of a
                      tracepoint, breakpoint, or software event; otherwise
                      the value is 0.

                  id  If PERF_SAMPLE_ID is enabled, a 64-bit unique ID is
                      included.  If the event is a member of an event group,
                      the group leader ID is returned.  This ID is the same
                      as the one returned by PERF_FORMAT_ID.

                  stream_id
                      If PERF_SAMPLE_STREAM_ID is enabled, a 64-bit unique
                      ID is included.  Unlike PERF_SAMPLE_ID the actual ID
                      is returned, not the group leader.  This ID is the
                      same as the one returned by PERF_FORMAT_ID.

                  cpu, res
                      If PERF_SAMPLE_CPU is enabled, this is a 32-bit value
                      indicating which CPU was being used, in addition to a
                      reserved (unused) 32-bit value.

                  period
                      If PERF_SAMPLE_PERIOD is enabled, a 64-bit value
                      indicating the current sampling period is written.

                  v   If PERF_SAMPLE_READ is enabled, a structure of type
                      read_format is included which has values for all
                      events in the event group.  The values included depend
                      on the read_format value used at perf_event_open()
                      time.

                  nr, ips[nr]
                      If PERF_SAMPLE_CALLCHAIN is enabled, then a 64-bit
                      number is included which indicates how many following
                      64-bit instruction pointers will follow.  This is the
                      current callchain.

                  size, data[size]
                      If PERF_SAMPLE_RAW is enabled, then a 32-bit value
                      indicating size is included followed by an array of
                      8-bit values of length size.  The values are padded
                      with 0 to have 64-bit alignment.

                      This RAW record data is opaque with respect to the
                      ABI.  The ABI doesn't make any promises with respect
                      to the stability of its content, it may vary depending
                      on event, hardware, and kernel version.

                  bnr, lbr[bnr]
                      If PERF_SAMPLE_BRANCH_STACK is enabled, then a 64-bit
                      value indicating the number of records is included,
                      followed by bnr perf_branch_entry structures which
                      each include the fields:

                      from   This indicates the source instruction (may not
                             be a branch).

                      to     The branch target.

                      mispred
                             The branch target was mispredicted.

                      predicted
                             The branch target was predicted.

                      in_tx (since Linux 3.11)
                             The branch was in a transactional memory
                             transaction.

                      abort (since Linux 3.11)
                             The branch was in an aborted transactional
                             memory transaction.

                      The entries are from most to least recent, so the
                      first entry has the most recent branch.

                      Support for mispred and predicted is optional; if not
                      supported, both values will be 0.

                      The type of branches recorded is specified by the
                      branch_sample_type field.

                  abi, regs[weight(mask)]
                      If PERF_SAMPLE_REGS_USER is enabled, then the user CPU
                      registers are recorded.

                      The abi field is one of PERF_SAMPLE_REGS_ABI_NONE,
                      PERF_SAMPLE_REGS_ABI_32 or PERF_SAMPLE_REGS_ABI_64.

                      The regs field is an array of the CPU registers that
                      were specified by the sample_regs_user attr field.
                      The number of values is the number of bits set in the
                      sample_regs_user bit mask.

                  size, data[size], dyn_size
                      If PERF_SAMPLE_STACK_USER is enabled, then the user
                      stack is recorded.  This can be used to generate stack
                      backtraces.  size is the size requested by the user in
                      sample_stack_user or else the maximum record size.
                      data is the stack data (a raw dump of the memory
                      pointed to by the stack pointer at the time of
                      sampling).  dyn_size is the amount of data actually
                      dumped (can be less than size).  Note that dyn_size is
                      omitted if size is 0.

                  weight
                      If PERF_SAMPLE_WEIGHT is enabled, then a 64-bit value
                      provided by the hardware is recorded that indicates
                      how costly the event was.  This allows expensive
                      events to stand out more clearly in profiles.

                  data_src
                      If PERF_SAMPLE_DATA_SRC is enabled, then a 64-bit
                      value is recorded that is made up of the following
                      fields:

                      mem_op
                          Type of opcode, a bitwise combination of:

                          PERF_MEM_OP_NA          Not available
                          PERF_MEM_OP_LOAD        Load instruction
                          PERF_MEM_OP_STORE       Store instruction
                          PERF_MEM_OP_PFETCH      Prefetch
                          PERF_MEM_OP_EXEC        Executable code

                      mem_lvl
                          Memory hierarchy level hit or miss, a bitwise
                          combination of the following, shifted left by
                          PERF_MEM_LVL_SHIFT:

                          PERF_MEM_LVL_NA         Not available
                          PERF_MEM_LVL_HIT        Hit
                          PERF_MEM_LVL_MISS       Miss
                          PERF_MEM_LVL_L1         Level 1 cache
                          PERF_MEM_LVL_LFB        Line fill buffer
                          PERF_MEM_LVL_L2         Level 2 cache
                          PERF_MEM_LVL_L3         Level 3 cache
                          PERF_MEM_LVL_LOC_RAM    Local DRAM
                          PERF_MEM_LVL_REM_RAM1   Remote DRAM 1 hop
                          PERF_MEM_LVL_REM_RAM2   Remote DRAM 2 hops
                          PERF_MEM_LVL_REM_CCE1   Remote cache 1 hop
                          PERF_MEM_LVL_REM_CCE2   Remote cache 2 hops
                          PERF_MEM_LVL_IO         I/O memory
                          PERF_MEM_LVL_UNC        Uncached memory

                      mem_snoop
                          Snoop mode, a bitwise combination of the
                          following, shifted left by PERF_MEM_SNOOP_SHIFT:

                          PERF_MEM_SNOOP_NA       Not available
                          PERF_MEM_SNOOP_NONE     No snoop
                          PERF_MEM_SNOOP_HIT      Snoop hit
                          PERF_MEM_SNOOP_MISS     Snoop miss
                          PERF_MEM_SNOOP_HITM     Snoop hit modified

                      mem_lock
                          Lock instruction, a bitwise combination of the
                          following, shifted left by PERF_MEM_LOCK_SHIFT:

                          PERF_MEM_LOCK_NA        Not available
                          PERF_MEM_LOCK_LOCKED    Locked transaction

                      mem_dtlb
                          TLB access hit or miss, a bitwise combination of
                          the following, shifted left by PERF_MEM_TLB_SHIFT:

                          PERF_MEM_TLB_NA         Not available
                          PERF_MEM_TLB_HIT        Hit
                          PERF_MEM_TLB_MISS       Miss
                          PERF_MEM_TLB_L1         Level 1 TLB
                          PERF_MEM_TLB_L2         Level 2 TLB
                          PERF_MEM_TLB_WK         Hardware walker
                          PERF_MEM_TLB_OS         OS fault handler

                  transaction
                      If the PERF_SAMPLE_TRANSACTION flag is set, then a
                      64-bit field is recorded describing the sources of any
                      transactional memory aborts.

                      The field is a bitwise combination of the following
                      values:

                      PERF_TXN_ELISION
                             Abort from an elision type transaction (Intel-
                             CPU-specific).

                      PERF_TXN_TRANSACTION
                             Abort from a generic transaction.

                      PERF_TXN_SYNC
                             Synchronous abort (related to the reported
                             instruction).

                      PERF_TXN_ASYNC
                             Asynchronous abort (not related to the reported
                             instruction).

                      PERF_TXN_RETRY
                             Retryable abort (retrying the transaction may
                             have succeeded).

                      PERF_TXN_CONFLICT
                             Abort due to memory conflicts with other
                             threads.

                      PERF_TXN_CAPACITY_WRITE
                             Abort due to write capacity overflow.

                      PERF_TXN_CAPACITY_READ
                             Abort due to read capacity overflow.

                      In addition, a user-specified abort code can be
                      obtained from the high 32 bits of the field by
                      shifting right by PERF_TXN_ABORT_SHIFT and masking
                      with PERF_TXN_ABORT_MASK.

                  abi, regs[weight(mask)]
                      If PERF_SAMPLE_REGS_INTR is enabled, then the user CPU
                      registers are recorded.

                      The abi field is one of PERF_SAMPLE_REGS_ABI_NONE,
                      PERF_SAMPLE_REGS_ABI_32 or PERF_SAMPLE_REGS_ABI_64.

                      The regs field is an array of the CPU registers that
                      were specified by the sample_regs_intr attr field.
                      The number of values is the number of bits set in the
                      sample_regs_intr bit mask.

              PERF_RECORD_MMAP2
                  This record includes extended information on mmap(2) calls
                  returning executable mappings.  The format is similar to
                  that of the PERF_RECORD_MMAP record, but includes extra
                  values that allow uniquely identifying shared mappings.

                      struct {
                          struct perf_event_header header;
                          u32 pid;
                          u32 tid;
                          u64 addr;
                          u64 len;
                          u64 pgoff;
                          u32 maj;
                          u32 min;
                          u64 ino;
                          u64 ino_generation;
                          u32 prot;
                          u32 flags;
                          char filename[];
                          struct sample_id sample_id;
                      };

                  pid    is the process ID.

                  tid    is the thread ID.

                  addr   is the address of the allocated memory.

                  len    is the length of the allocated memory.

                  pgoff  is the page offset of the allocated memory.

                  maj    is the major ID of the underlying device.

                  min    is the minor ID of the underlying device.

                  ino    is the inode number.

                  ino_generation
                         is the inode generation.

                  prot   is the protection information.

                  flags  is the flags information.

                  filename
                         is a string describing the backing of the allocated
                         memory.

              PERF_RECORD_AUX (since Linux 4.1)

                  This record reports that new data is available in the
                  separate AUX buffer region.

                      struct {
                          struct perf_event_header header;
                          u64 aux_offset;
                          u64 aux_size;
                          u64 flags;
                          struct sample_id sample_id;
                      };

                  aux_offset
                         offset in the AUX mmap region where the new data
                         begins.

                  aux_size
                         size of the data made available.

                  flags  describes the AUX update.

                         PERF_AUX_FLAG_TRUNCATED
                                if set, then the data returned was truncated
                                to fit the available buffer size.

                         PERF_AUX_FLAG_OVERWRITE
                                if set, then the data returned has
                                overwritten previous data.

              PERF_RECORD_ITRACE_START (since Linux 4.1)

                  This record indicates which process has initiated an
                  instruction trace event, allowing tools to properly
                  correlate the instruction addresses in the AUX buffer with
                  the proper executable.

                      struct {
                          struct perf_event_header header;
                          u32 pid;
                          u32 tid;
                      };

                  pid    process ID of the thread starting an instruction
                         trace.

                  tid    thread ID of the thread starting an instruction
                         trace.

   Overflow handling
       Events can be set to notify when a threshold is crossed, indicating
       an overflow.  Overflow conditions can be captured by monitoring the
       event file descriptor with poll(2), select(2), or epoll(2).
       Alternately, a SIGIO signal handler can be created and the event
       configured with fcntl(2) to generate SIGIO signals.

       Overflows are generated only by sampling events (sample_period must
       have a nonzero value).

       There are two ways to generate overflow notifications.

       The first is to set a wakeup_events or wakeup_watermark value that
       will trigger if a certain number of samples or bytes have been
       written to the mmap ring buffer.  In this case POLL_IN is indicated.

       The other way is by use of the PERF_EVENT_IOC_REFRESH ioctl.  This
       ioctl adds to a counter that decrements each time the event
       overflows.  When nonzero, POLL_IN is indicated, but once the counter
       reaches 0 POLL_HUP is indicated and the underlying event is disabled.

       Refreshing an event group leader refreshes all siblings and
       refreshing with a parameter of 0 currently enables infinite
       refreshes; these behaviors are unsupported and should not be relied
       on.

       Starting with Linux 3.18, POLL_HUP is indicated if the event being
       monitored is attached to a different process and that process exits.

   rdpmc instruction
       Starting with Linux 3.4 on x86, you can use the rdpmc instruction to
       get low-latency reads without having to enter the kernel.  Note that
       using rdpmc is not necessarily faster than other methods for reading
       event values.

       Support for this can be detected with the cap_usr_rdpmc field in the
       mmap page; documentation on how to calculate event values can be
       found in that section.

       Originally, when rdpmc support was enabled, any process (not just
       ones with an active perf event) could use the rdpmc instruction to
       access the counters.  Starting with Linux 4.0 rdpmc support is only
       allowed if an event is currently enabled in a process's context.  To
       restore the old behavior, write the value 2 to
       /sys/devices/cpu/rdpmc.

   perf_event ioctl calls
       Various ioctls act on perf_event_open() file descriptors:

       PERF_EVENT_IOC_ENABLE
              This enables the individual event or event group specified by
              the file descriptor argument.

              If the PERF_IOC_FLAG_GROUP bit is set in the ioctl argument,
              then all events in a group are enabled, even if the event
              specified is not the group leader (but see BUGS).

       PERF_EVENT_IOC_DISABLE
              This disables the individual counter or event group specified
              by the file descriptor argument.

              Enabling or disabling the leader of a group enables or
              disables the entire group; that is, while the group leader is
              disabled, none of the counters in the group will count.
              Enabling or disabling a member of a group other than the
              leader affects only that counter; disabling a non-leader stops
              that counter from counting but doesn't affect any other
              counter.

              If the PERF_IOC_FLAG_GROUP bit is set in the ioctl argument,
              then all events in a group are disabled, even if the event
              specified is not the group leader (but see BUGS).

       PERF_EVENT_IOC_REFRESH
              Non-inherited overflow counters can use this to enable a
              counter for a number of overflows specified by the argument,
              after which it is disabled.  Subsequent calls of this ioctl
              add the argument value to the current count.  An overflow
              notification with POLL_IN set will happen on each overflow
              until the count reaches 0; when that happens a notification
              with POLL_HUP set is sent and the event is disabled.  Using an
              argument of 0 is considered undefined behavior.

       PERF_EVENT_IOC_RESET
              Reset the event count specified by the file descriptor
              argument to zero.  This resets only the counts; there is no
              way to reset the multiplexing time_enabled or time_running
              values.

              If the PERF_IOC_FLAG_GROUP bit is set in the ioctl argument,
              then all events in a group are reset, even if the event
              specified is not the group leader (but see BUGS).

       PERF_EVENT_IOC_PERIOD
              This updates the overflow period for the event.

              Since Linux 3.7 (on ARM) and Linux 3.14 (all other
              architectures), the new period takes effect immediately.  On
              older kernels, the new period did not take effect until after
              the next overflow.

              The argument is a pointer to a 64-bit value containing the
              desired new period.

              Prior to Linux 2.6.36 this ioctl always failed due to a bug in
              the kernel.

       PERF_EVENT_IOC_SET_OUTPUT
              This tells the kernel to report event notifications to the
              specified file descriptor rather than the default one.  The
              file descriptors must all be on the same CPU.

              The argument specifies the desired file descriptor, or -1 if
              output should be ignored.

       PERF_EVENT_IOC_SET_FILTER (since Linux 2.6.33)
              This adds an ftrace filter to this event.

              The argument is a pointer to the desired ftrace filter.

       PERF_EVENT_IOC_ID (since Linux 3.12)
              This returns the event ID value for the given event file
              descriptor.

              The argument is a pointer to a 64-bit unsigned integer to hold
              the result.

       PERF_EVENT_IOC_SET_BPF (since Linux 4.1)
              This allows attaching a Berkeley Packet Filter (BPF) program
              to an existing kprobe tracepoint event.  You need
              CAP_SYS_ADMIN privileges to use this ioctl.

              The argument is a BPF program file descriptor that was created
              by a previous bpf(2) system call.

   Using prctl
       A process can enable or disable all the event groups that are
       attached to it using the prctl(2) PR_TASK_PERF_EVENTS_ENABLE and
       PR_TASK_PERF_EVENTS_DISABLE operations.  This applies to all counters
       on the calling process, whether created by this process or by
       another, and does not affect any counters that this process has
       created on other processes.  It enables or disables only the group
       leaders, not any other members in the groups.

   perf_event related configuration files
       Files in /proc/sys/kernel/

           /proc/sys/kernel/perf_event_paranoid

                  The perf_event_paranoid file can be set to restrict access
                  to the performance counters.

                  2   allow only user-space measurements (default since
                      Linux 4.6).

                  1   allow both kernel and user measurements (default
                      before Linux 4.6).

                  0   allow access to CPU-specific data but not raw
                      tracepoint samples.

                  -1  no restrictions.

                  The existence of the perf_event_paranoid file is the
                  official method for determining if a kernel supports
                  perf_event_open().

           /proc/sys/kernel/perf_event_max_sample_rate

                  This sets the maximum sample rate.  Setting this too high
                  can allow users to sample at a rate that impacts overall
                  machine performance and potentially lock up the machine.
                  The default value is 100000 (samples per second).

           /proc/sys/kernel/perf_event_mlock_kb

                  Maximum number of pages an unprivileged user can mlock(2).
                  The default is 516 (kB).

       Files in /sys/bus/event_source/devices/
           Since Linux 2.6.34, the kernel supports having multiple PMUs
           available for monitoring.  Information on how to program these
           PMUs can be found under /sys/bus/event_source/devices/.  Each
           subdirectory corresponds to a different PMU.

           /sys/bus/event_source/devices/*/type (since Linux 2.6.38)
                  This contains an integer that can be used in the type
                  field of perf_event_attr to indicate that you wish to use
                  this PMU.

           /sys/bus/event_source/devices/cpu/rdpmc (since Linux 3.4)
                  If this file is 1, then direct user-space access to the
                  performance counter registers is allowed via the rdpmc
                  instruction.  This can be disabled by echoing 0 to the
                  file.

                  As of Linux 4.0 the behavior has changed, so that 1 now
                  means only allow access to processes with active perf
                  events, with 2 indicating the old allow-anyone-access
                  behavior.

           /sys/bus/event_source/devices/*/format/ (since Linux 3.4)
                  This subdirectory contains information on the
                  architecture-specific subfields available for programming
                  the various config fields in the perf_event_attr struct.

                  The content of each file is the name of the config field,
                  followed by a colon, followed by a series of integer bit
                  ranges separated by commas.  For example, the file event
                  may contain the value config1:1,6-10,44 which indicates
                  that event is an attribute that occupies bits 1,6-10, and
                  44 of perf_event_attr::config1.

           /sys/bus/event_source/devices/*/events/ (since Linux 3.4)
                  This subdirectory contains files with predefined events.
                  The contents are strings describing the event settings
                  expressed in terms of the fields found in the previously
                  mentioned ./format/ directory.  These are not necessarily
                  complete lists of all events supported by a PMU, but
                  usually a subset of events deemed useful or interesting.

                  The content of each file is a list of attribute names
                  separated by commas.  Each entry has an optional value
                  (either hex or decimal).  If no value is specified, then
                  it is assumed to be a single-bit field with a value of 1.
                  An example entry may look like this:
                  event=0x2,inv,ldlat=3.

           /sys/bus/event_source/devices/*/uevent
                  This file is the standard kernel device interface for
                  injecting hotplug events.

           /sys/bus/event_source/devices/*/cpumask (since Linux 3.7)
                  The cpumask file contains a comma-separated list of
                  integers that indicate a representative CPU number for
                  each socket (package) on the motherboard.  This is needed
                  when setting up uncore or northbridge events, as those
                  PMUs present socket-wide events.
http://man7.org/linux/man-pages/man2/uname.2.html
10
SYSTEM CALL:
uname(2) - Linux manual page
FUNCTIONALITY:

       uname - get name and information about current kernel
SYNOPSIS:

       #include <sys/utsname.h>

       int uname(struct utsname *buf);
DESCRIPTION

       uname() returns system information in the structure pointed to by
       buf.  The utsname struct is defined in <sys/utsname.h>:

           struct utsname {
               char sysname[];    /* Operating system name (e.g., "Linux") */
               char nodename[];   /* Name within "some implementation-defined
                                     network" */
               char release[];    /* Operating system release (e.g., "2.6.28") */
               char version[];    /* Operating system version */
               char machine[];    /* Hardware identifier */
           #ifdef _GNU_SOURCE
               char domainname[]; /* NIS or YP domain name */
           #endif
           };

       The length of the arrays in a struct utsname is unspecified (see
       NOTES); the fields are terminated by a null byte ('\0').
http://man7.org/linux/man-pages/man2/sysinfo.2.html
11
SYSTEM CALL:
sysinfo(2) - Linux manual page
FUNCTIONALITY:

       sysinfo - return system information
SYNOPSIS:

       #include <sys/sysinfo.h>

       int sysinfo(struct sysinfo *info);
DESCRIPTION

       sysinfo() returns certain statistics on memory and swap usage, as
       well as the load average.

       Until Linux 2.3.16, sysinfo() returned information in the following
       structure:

           struct sysinfo {
               long uptime;             /* Seconds since boot */
               unsigned long loads[3];  /* 1, 5, and 15 minute load averages */
               unsigned long totalram;  /* Total usable main memory size */
               unsigned long freeram;   /* Available memory size */
               unsigned long sharedram; /* Amount of shared memory */
               unsigned long bufferram; /* Memory used by buffers */
               unsigned long totalswap; /* Total swap space size */
               unsigned long freeswap;  /* Swap space still available */
               unsigned short procs;    /* Number of current processes */
               char _f[22];             /* Pads structure to 64 bytes */
           };

       In the above structure, the sizes of the memory and swap fields are
       given in bytes.

       Since Linux 2.3.23 (i386) and Linux 2.3.48 (all architectures) the
       structure is:

           struct sysinfo {
               long uptime;             /* Seconds since boot */
               unsigned long loads[3];  /* 1, 5, and 15 minute load averages */
               unsigned long totalram;  /* Total usable main memory size */
               unsigned long freeram;   /* Available memory size */
               unsigned long sharedram; /* Amount of shared memory */
               unsigned long bufferram; /* Memory used by buffers */
               unsigned long totalswap; /* Total swap space size */
               unsigned long freeswap;  /* Swap space still available */
               unsigned short procs;    /* Number of current processes */
               unsigned long totalhigh; /* Total high memory size */
               unsigned long freehigh;  /* Available high memory size */
               unsigned int mem_unit;   /* Memory unit size in bytes */
               char _f[20-2*sizeof(long)-sizeof(int)];
                                        /* Padding to 64 bytes */
           };

       In the above structure, sizes of the memory and swap fields are given
       as multiples of mem_unit bytes.