Linux file system calls http://man7.org/linux/man-pages/man2/close.2.html 10 SYSTEM CALL: close(2) - Linux manual page FUNCTIONALITY: close - close a file descriptor SYNOPSIS: #include int close(int fd); DESCRIPTION close() closes a file descriptor, so that it no longer refers to any file and may be reused. Any record locks (see fcntl(2)) held on the file it was associated with, and owned by the process, are removed (regardless of the file descriptor that was used to obtain the lock). If fd is the last file descriptor referring to the underlying open file description (see open(2)), the resources associated with the open file description are freed; if the file descriptor was the last reference to a file which has been removed using unlink(2), the file is deleted. http://man7.org/linux/man-pages/man2/creat.2.html 12 SYSTEM CALL: open(2) - Linux manual page FUNCTIONALITY: open, openat, creat - open and possibly create a file SYNOPSIS: #include #include #include int open(const char *pathname, int flags); int open(const char *pathname, int flags, mode_t mode); int creat(const char *pathname, mode_t mode); int openat(int dirfd, const char *pathname, int flags); int openat(int dirfd, const char *pathname, int flags, mode_t mode); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): openat(): Since glibc 2.10: _POSIX_C_SOURCE >= 200809L Before glibc 2.10: _ATFILE_SOURCE DESCRIPTION Given a pathname for a file, open() returns a file descriptor, a small, nonnegative integer for use in subsequent system calls (read(2), write(2), lseek(2), fcntl(2), etc.). The file descriptor returned by a successful call will be the lowest-numbered file descriptor not currently open for the process. By default, the new file descriptor is set to remain open across an execve(2) (i.e., the FD_CLOEXEC file descriptor flag described in fcntl(2) is initially disabled); the O_CLOEXEC flag, described below, can be used to change this default. The file offset is set to the beginning of the file (see lseek(2)). A call to open() creates a new open file description, an entry in the system-wide table of open files. The open file description records the file offset and the file status flags (see below). A file descriptor is a reference to an open file description; this reference is unaffected if pathname is subsequently removed or modified to refer to a different file. For further details on open file descriptions, see NOTES. The argument flags must include one of the following access modes: O_RDONLY, O_WRONLY, or O_RDWR. These request opening the file read- only, write-only, or read/write, respectively. In addition, zero or more file creation flags and file status flags can be bitwise-or'd in flags. The file creation flags are O_CLOEXEC, O_CREAT, O_DIRECTORY, O_EXCL, O_NOCTTY, O_NOFOLLOW, O_TMPFILE, and O_TRUNC. The file status flags are all of the remaining flags listed below. The distinction between these two groups of flags is that the file status flags can be retrieved and (in some cases) modified; see fcntl(2) for details. The full list of file creation flags and file status flags is as follows: O_APPEND The file is opened in append mode. Before each write(2), the file offset is positioned at the end of the file, as if with lseek(2). O_APPEND may lead to corrupted files on NFS filesystems if more than one process appends data to a file at once. This is because NFS does not support appending to a file, so the client kernel has to simulate it, which can't be done without a race condition. O_ASYNC Enable signal-driven I/O: generate a signal (SIGIO by default, but this can be changed via fcntl(2)) when input or output becomes possible on this file descriptor. This feature is available only for terminals, pseudoterminals, sockets, and (since Linux 2.6) pipes and FIFOs. See fcntl(2) for further details. See also BUGS, below. O_CLOEXEC (since Linux 2.6.23) Enable the close-on-exec flag for the new file descriptor. Specifying this flag permits a program to avoid additional fcntl(2) F_SETFD operations to set the FD_CLOEXEC flag. Note that the use of this flag is essential in some multithreaded programs, because using a separate fcntl(2) F_SETFD operation to set the FD_CLOEXEC flag does not suffice to avoid race conditions where one thread opens a file descriptor and attempts to set its close-on-exec flag using fcntl(2) at the same time as another thread does a fork(2) plus execve(2). Depending on the order of execution, the race may lead to the file descriptor returned by open() being unintentionally leaked to the program executed by the child process created by fork(2). (This kind of race is in principle possible for any system call that creates a file descriptor whose close-on-exec flag should be set, and various other Linux system calls provide an equivalent of the O_CLOEXEC flag to deal with this problem.) O_CREAT If the file does not exist, it will be created. The owner (user ID) of the file is set to the effective user ID of the process. The group ownership (group ID) is set either to the effective group ID of the process or to the group ID of the parent directory (depending on filesystem type and mount options, and the mode of the parent directory; see the mount options bsdgroups and sysvgroups described in mount(8)). The mode argument specifies the file mode bits be applied when a new file is created. This argument must be supplied when O_CREAT or O_TMPFILE is specified in flags; if neither O_CREAT nor O_TMPFILE is specified, then mode is ignored. The effective mode is modified by the process's umask in the usual way: in the absence of a default ACL, the mode of the created file is (mode & ~umask). Note that this mode applies only to future accesses of the newly created file; the open() call that creates a read-only file may well return a read/write file descriptor. The following symbolic constants are provided for mode: S_IRWXU 00700 user (file owner) has read, write, and execute permission S_IRUSR 00400 user has read permission S_IWUSR 00200 user has write permission S_IXUSR 00100 user has execute permission S_IRWXG 00070 group has read, write, and execute permission S_IRGRP 00040 group has read permission S_IWGRP 00020 group has write permission S_IXGRP 00010 group has execute permission S_IRWXO 00007 others have read, write, and execute permission S_IROTH 00004 others have read permission S_IWOTH 00002 others have write permission S_IXOTH 00001 others have execute permission According to POSIX, the effect when other bits are set in mode is unspecified. On Linux, the following bits are also honored in mode: S_ISUID 0004000 set-user-ID bit S_ISGID 0002000 set-group-ID bit (see stat(2)) S_ISVTX 0001000 sticky bit (see stat(2)) O_DIRECT (since Linux 2.4.10) Try to minimize cache effects of the I/O to and from this file. In general this will degrade performance, but it is useful in special situations, such as when applications do their own caching. File I/O is done directly to/from user- space buffers. The O_DIRECT flag on its own makes an effort to transfer data synchronously, but does not give the guarantees of the O_SYNC flag that data and necessary metadata are transferred. To guarantee synchronous I/O, O_SYNC must be used in addition to O_DIRECT. See NOTES below for further discussion. A semantically similar (but deprecated) interface for block devices is described in raw(8). O_DIRECTORY If pathname is not a directory, cause the open to fail. This flag was added in kernel version 2.1.126, to avoid denial-of- service problems if opendir(3) is called on a FIFO or tape device. O_DSYNC Write operations on the file will complete according to the requirements of synchronized I/O data integrity completion. By the time write(2) (and similar) return, the output data has been transferred to the underlying hardware, along with any file metadata that would be required to retrieve that data (i.e., as though each write(2) was followed by a call to fdatasync(2)). See NOTES below. O_EXCL Ensure that this call creates the file: if this flag is specified in conjunction with O_CREAT, and pathname already exists, then open() will fail. When these two flags are specified, symbolic links are not followed: if pathname is a symbolic link, then open() fails regardless of where the symbolic link points to. In general, the behavior of O_EXCL is undefined if it is used without O_CREAT. There is one exception: on Linux 2.6 and later, O_EXCL can be used without O_CREAT if pathname refers to a block device. If the block device is in use by the system (e.g., mounted), open() fails with the error EBUSY. On NFS, O_EXCL is supported only when using NFSv3 or later on kernel 2.6 or later. In NFS environments where O_EXCL support is not provided, programs that rely on it for performing locking tasks will contain a race condition. Portable programs that want to perform atomic file locking using a lockfile, and need to avoid reliance on NFS support for O_EXCL, can create a unique file on the same filesystem (e.g., incorporating hostname and PID), and use link(2) to make a link to the lockfile. If link(2) returns 0, the lock is successful. Otherwise, use stat(2) on the unique file to check if its link count has increased to 2, in which case the lock is also successful. O_LARGEFILE (LFS) Allow files whose sizes cannot be represented in an off_t (but can be represented in an off64_t) to be opened. The _LARGEFILE64_SOURCE macro must be defined (before including any header files) in order to obtain this definition. Setting the _FILE_OFFSET_BITS feature test macro to 64 (rather than using O_LARGEFILE) is the preferred method of accessing large files on 32-bit systems (see feature_test_macros(7)). O_NOATIME (since Linux 2.6.8) Do not update the file last access time (st_atime in the inode) when the file is read(2). This flag is intended for use by indexing or backup programs, where its use can significantly reduce the amount of disk activity. This flag may not be effective on all filesystems. One example is NFS, where the server maintains the access time. O_NOCTTY If pathname refers to a terminal device—see tty(4)—it will not become the process's controlling terminal even if the process does not have one. O_NOFOLLOW If pathname is a symbolic link, then the open fails. This is a FreeBSD extension, which was added to Linux in version 2.1.126. Symbolic links in earlier components of the pathname will still be followed. See also O_PATH below. O_NONBLOCK or O_NDELAY When possible, the file is opened in nonblocking mode. Neither the open() nor any subsequent operations on the file descriptor which is returned will cause the calling process to wait. Note that this flag has no effect for regular files and block devices; that is, I/O operations will (briefly) block when device activity is required, regardless of whether O_NONBLOCK is set. Since O_NONBLOCK semantics might eventually be implemented, applications should not depend upon blocking behavior when specifying this flag for regular files and block devices. For the handling of FIFOs (named pipes), see also fifo(7). For a discussion of the effect of O_NONBLOCK in conjunction with mandatory file locks and with file leases, see fcntl(2). O_PATH (since Linux 2.6.39) Obtain a file descriptor that can be used for two purposes: to indicate a location in the filesystem tree and to perform operations that act purely at the file descriptor level. The file itself is not opened, and other file operations (e.g., read(2), write(2), fchmod(2), fchown(2), fgetxattr(2), mmap(2)) fail with the error EBADF. The following operations can be performed on the resulting file descriptor: * close(2); fchdir(2) (since Linux 3.5); fstat(2) (since Linux 3.6). * Duplicating the file descriptor (dup(2), fcntl(2) F_DUPFD, etc.). * Getting and setting file descriptor flags (fcntl(2) F_GETFD and F_SETFD). * Retrieving open file status flags using the fcntl(2) F_GETFL operation: the returned flags will include the bit O_PATH. * Passing the file descriptor as the dirfd argument of openat(2) and the other "*at()" system calls. This includes linkat(2) with AT_EMPTY_PATH (or via procfs using AT_SYMLINK_FOLLOW) even if the file is not a directory. * Passing the file descriptor to another process via a UNIX domain socket (see SCM_RIGHTS in unix(7)). When O_PATH is specified in flags, flag bits other than O_CLOEXEC, O_DIRECTORY, and O_NOFOLLOW are ignored. If pathname is a symbolic link and the O_NOFOLLOW flag is also specified, then the call returns a file descriptor referring to the symbolic link. This file descriptor can be used as the dirfd argument in calls to fchownat(2), fstatat(2), linkat(2), and readlinkat(2) with an empty pathname to have the calls operate on the symbolic link. O_SYNC Write operations on the file will complete according to the requirements of synchronized I/O file integrity completion (by contrast with the synchronized I/O data integrity completion provided by O_DSYNC.) By the time write(2) (and similar) return, the output data and associated file metadata have been transferred to the underlying hardware (i.e., as though each write(2) was followed by a call to fsync(2)). See NOTES below. O_TMPFILE (since Linux 3.11) Create an unnamed temporary file. The pathname argument specifies a directory; an unnamed inode will be created in that directory's filesystem. Anything written to the resulting file will be lost when the last file descriptor is closed, unless the file is given a name. O_TMPFILE must be specified with one of O_RDWR or O_WRONLY and, optionally, O_EXCL. If O_EXCL is not specified, then linkat(2) can be used to link the temporary file into the filesystem, making it permanent, using code like the following: char path[PATH_MAX]; fd = open("/path/to/dir", O_TMPFILE | O_RDWR, S_IRUSR | S_IWUSR); /* File I/O on 'fd'... */ snprintf(path, PATH_MAX, "/proc/self/fd/%d", fd); linkat(AT_FDCWD, path, AT_FDCWD, "/path/for/file", AT_SYMLINK_FOLLOW); In this case, the open() mode argument determines the file permission mode, as with O_CREAT. Specifying O_EXCL in conjunction with O_TMPFILE prevents a temporary file from being linked into the filesystem in the above manner. (Note that the meaning of O_EXCL in this case is different from the meaning of O_EXCL otherwise.) There are two main use cases for O_TMPFILE: * Improved tmpfile(3) functionality: race-free creation of temporary files that (1) are automatically deleted when closed; (2) can never be reached via any pathname; (3) are not subject to symlink attacks; and (4) do not require the caller to devise unique names. * Creating a file that is initially invisible, which is then populated with data and adjusted to have appropriate filesystem attributes (fchown(2), fchmod(2), fsetxattr(2), etc.) before being atomically linked into the filesystem in a fully formed state (using linkat(2) as described above). O_TMPFILE requires support by the underlying filesystem; only a subset of Linux filesystems provide that support. In the initial implementation, support was provided in the ext2, ext3, ext4, UDF, Minix, and shmem filesystems. XFS support was added in Linux 3.15, and Btrfs support was added in Linux 3.16. O_TRUNC If the file already exists and is a regular file and the access mode allows writing (i.e., is O_RDWR or O_WRONLY) it will be truncated to length 0. If the file is a FIFO or terminal device file, the O_TRUNC flag is ignored. Otherwise, the effect of O_TRUNC is unspecified. creat() A call to creat() is equivalent to calling open() with flags equal to O_CREAT|O_WRONLY|O_TRUNC. openat() The openat() system call operates in exactly the same way as open(), except for the differences described here. If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by open() for a relative pathname). If pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like open()). If pathname is absolute, then dirfd is ignored. http://man7.org/linux/man-pages/man2/open.2.html 12 SYSTEM CALL: open(2) - Linux manual page FUNCTIONALITY: open, openat, creat - open and possibly create a file SYNOPSIS: #include #include #include int open(const char *pathname, int flags); int open(const char *pathname, int flags, mode_t mode); int creat(const char *pathname, mode_t mode); int openat(int dirfd, const char *pathname, int flags); int openat(int dirfd, const char *pathname, int flags, mode_t mode); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): openat(): Since glibc 2.10: _POSIX_C_SOURCE >= 200809L Before glibc 2.10: _ATFILE_SOURCE DESCRIPTION Given a pathname for a file, open() returns a file descriptor, a small, nonnegative integer for use in subsequent system calls (read(2), write(2), lseek(2), fcntl(2), etc.). The file descriptor returned by a successful call will be the lowest-numbered file descriptor not currently open for the process. By default, the new file descriptor is set to remain open across an execve(2) (i.e., the FD_CLOEXEC file descriptor flag described in fcntl(2) is initially disabled); the O_CLOEXEC flag, described below, can be used to change this default. The file offset is set to the beginning of the file (see lseek(2)). A call to open() creates a new open file description, an entry in the system-wide table of open files. The open file description records the file offset and the file status flags (see below). A file descriptor is a reference to an open file description; this reference is unaffected if pathname is subsequently removed or modified to refer to a different file. For further details on open file descriptions, see NOTES. The argument flags must include one of the following access modes: O_RDONLY, O_WRONLY, or O_RDWR. These request opening the file read- only, write-only, or read/write, respectively. In addition, zero or more file creation flags and file status flags can be bitwise-or'd in flags. The file creation flags are O_CLOEXEC, O_CREAT, O_DIRECTORY, O_EXCL, O_NOCTTY, O_NOFOLLOW, O_TMPFILE, and O_TRUNC. The file status flags are all of the remaining flags listed below. The distinction between these two groups of flags is that the file status flags can be retrieved and (in some cases) modified; see fcntl(2) for details. The full list of file creation flags and file status flags is as follows: O_APPEND The file is opened in append mode. Before each write(2), the file offset is positioned at the end of the file, as if with lseek(2). O_APPEND may lead to corrupted files on NFS filesystems if more than one process appends data to a file at once. This is because NFS does not support appending to a file, so the client kernel has to simulate it, which can't be done without a race condition. O_ASYNC Enable signal-driven I/O: generate a signal (SIGIO by default, but this can be changed via fcntl(2)) when input or output becomes possible on this file descriptor. This feature is available only for terminals, pseudoterminals, sockets, and (since Linux 2.6) pipes and FIFOs. See fcntl(2) for further details. See also BUGS, below. O_CLOEXEC (since Linux 2.6.23) Enable the close-on-exec flag for the new file descriptor. Specifying this flag permits a program to avoid additional fcntl(2) F_SETFD operations to set the FD_CLOEXEC flag. Note that the use of this flag is essential in some multithreaded programs, because using a separate fcntl(2) F_SETFD operation to set the FD_CLOEXEC flag does not suffice to avoid race conditions where one thread opens a file descriptor and attempts to set its close-on-exec flag using fcntl(2) at the same time as another thread does a fork(2) plus execve(2). Depending on the order of execution, the race may lead to the file descriptor returned by open() being unintentionally leaked to the program executed by the child process created by fork(2). (This kind of race is in principle possible for any system call that creates a file descriptor whose close-on-exec flag should be set, and various other Linux system calls provide an equivalent of the O_CLOEXEC flag to deal with this problem.) O_CREAT If the file does not exist, it will be created. The owner (user ID) of the file is set to the effective user ID of the process. The group ownership (group ID) is set either to the effective group ID of the process or to the group ID of the parent directory (depending on filesystem type and mount options, and the mode of the parent directory; see the mount options bsdgroups and sysvgroups described in mount(8)). The mode argument specifies the file mode bits be applied when a new file is created. This argument must be supplied when O_CREAT or O_TMPFILE is specified in flags; if neither O_CREAT nor O_TMPFILE is specified, then mode is ignored. The effective mode is modified by the process's umask in the usual way: in the absence of a default ACL, the mode of the created file is (mode & ~umask). Note that this mode applies only to future accesses of the newly created file; the open() call that creates a read-only file may well return a read/write file descriptor. The following symbolic constants are provided for mode: S_IRWXU 00700 user (file owner) has read, write, and execute permission S_IRUSR 00400 user has read permission S_IWUSR 00200 user has write permission S_IXUSR 00100 user has execute permission S_IRWXG 00070 group has read, write, and execute permission S_IRGRP 00040 group has read permission S_IWGRP 00020 group has write permission S_IXGRP 00010 group has execute permission S_IRWXO 00007 others have read, write, and execute permission S_IROTH 00004 others have read permission S_IWOTH 00002 others have write permission S_IXOTH 00001 others have execute permission According to POSIX, the effect when other bits are set in mode is unspecified. On Linux, the following bits are also honored in mode: S_ISUID 0004000 set-user-ID bit S_ISGID 0002000 set-group-ID bit (see stat(2)) S_ISVTX 0001000 sticky bit (see stat(2)) O_DIRECT (since Linux 2.4.10) Try to minimize cache effects of the I/O to and from this file. In general this will degrade performance, but it is useful in special situations, such as when applications do their own caching. File I/O is done directly to/from user- space buffers. The O_DIRECT flag on its own makes an effort to transfer data synchronously, but does not give the guarantees of the O_SYNC flag that data and necessary metadata are transferred. To guarantee synchronous I/O, O_SYNC must be used in addition to O_DIRECT. See NOTES below for further discussion. A semantically similar (but deprecated) interface for block devices is described in raw(8). O_DIRECTORY If pathname is not a directory, cause the open to fail. This flag was added in kernel version 2.1.126, to avoid denial-of- service problems if opendir(3) is called on a FIFO or tape device. O_DSYNC Write operations on the file will complete according to the requirements of synchronized I/O data integrity completion. By the time write(2) (and similar) return, the output data has been transferred to the underlying hardware, along with any file metadata that would be required to retrieve that data (i.e., as though each write(2) was followed by a call to fdatasync(2)). See NOTES below. O_EXCL Ensure that this call creates the file: if this flag is specified in conjunction with O_CREAT, and pathname already exists, then open() will fail. When these two flags are specified, symbolic links are not followed: if pathname is a symbolic link, then open() fails regardless of where the symbolic link points to. In general, the behavior of O_EXCL is undefined if it is used without O_CREAT. There is one exception: on Linux 2.6 and later, O_EXCL can be used without O_CREAT if pathname refers to a block device. If the block device is in use by the system (e.g., mounted), open() fails with the error EBUSY. On NFS, O_EXCL is supported only when using NFSv3 or later on kernel 2.6 or later. In NFS environments where O_EXCL support is not provided, programs that rely on it for performing locking tasks will contain a race condition. Portable programs that want to perform atomic file locking using a lockfile, and need to avoid reliance on NFS support for O_EXCL, can create a unique file on the same filesystem (e.g., incorporating hostname and PID), and use link(2) to make a link to the lockfile. If link(2) returns 0, the lock is successful. Otherwise, use stat(2) on the unique file to check if its link count has increased to 2, in which case the lock is also successful. O_LARGEFILE (LFS) Allow files whose sizes cannot be represented in an off_t (but can be represented in an off64_t) to be opened. The _LARGEFILE64_SOURCE macro must be defined (before including any header files) in order to obtain this definition. Setting the _FILE_OFFSET_BITS feature test macro to 64 (rather than using O_LARGEFILE) is the preferred method of accessing large files on 32-bit systems (see feature_test_macros(7)). O_NOATIME (since Linux 2.6.8) Do not update the file last access time (st_atime in the inode) when the file is read(2). This flag is intended for use by indexing or backup programs, where its use can significantly reduce the amount of disk activity. This flag may not be effective on all filesystems. One example is NFS, where the server maintains the access time. O_NOCTTY If pathname refers to a terminal device—see tty(4)—it will not become the process's controlling terminal even if the process does not have one. O_NOFOLLOW If pathname is a symbolic link, then the open fails. This is a FreeBSD extension, which was added to Linux in version 2.1.126. Symbolic links in earlier components of the pathname will still be followed. See also O_PATH below. O_NONBLOCK or O_NDELAY When possible, the file is opened in nonblocking mode. Neither the open() nor any subsequent operations on the file descriptor which is returned will cause the calling process to wait. Note that this flag has no effect for regular files and block devices; that is, I/O operations will (briefly) block when device activity is required, regardless of whether O_NONBLOCK is set. Since O_NONBLOCK semantics might eventually be implemented, applications should not depend upon blocking behavior when specifying this flag for regular files and block devices. For the handling of FIFOs (named pipes), see also fifo(7). For a discussion of the effect of O_NONBLOCK in conjunction with mandatory file locks and with file leases, see fcntl(2). O_PATH (since Linux 2.6.39) Obtain a file descriptor that can be used for two purposes: to indicate a location in the filesystem tree and to perform operations that act purely at the file descriptor level. The file itself is not opened, and other file operations (e.g., read(2), write(2), fchmod(2), fchown(2), fgetxattr(2), mmap(2)) fail with the error EBADF. The following operations can be performed on the resulting file descriptor: * close(2); fchdir(2) (since Linux 3.5); fstat(2) (since Linux 3.6). * Duplicating the file descriptor (dup(2), fcntl(2) F_DUPFD, etc.). * Getting and setting file descriptor flags (fcntl(2) F_GETFD and F_SETFD). * Retrieving open file status flags using the fcntl(2) F_GETFL operation: the returned flags will include the bit O_PATH. * Passing the file descriptor as the dirfd argument of openat(2) and the other "*at()" system calls. This includes linkat(2) with AT_EMPTY_PATH (or via procfs using AT_SYMLINK_FOLLOW) even if the file is not a directory. * Passing the file descriptor to another process via a UNIX domain socket (see SCM_RIGHTS in unix(7)). When O_PATH is specified in flags, flag bits other than O_CLOEXEC, O_DIRECTORY, and O_NOFOLLOW are ignored. If pathname is a symbolic link and the O_NOFOLLOW flag is also specified, then the call returns a file descriptor referring to the symbolic link. This file descriptor can be used as the dirfd argument in calls to fchownat(2), fstatat(2), linkat(2), and readlinkat(2) with an empty pathname to have the calls operate on the symbolic link. O_SYNC Write operations on the file will complete according to the requirements of synchronized I/O file integrity completion (by contrast with the synchronized I/O data integrity completion provided by O_DSYNC.) By the time write(2) (and similar) return, the output data and associated file metadata have been transferred to the underlying hardware (i.e., as though each write(2) was followed by a call to fsync(2)). See NOTES below. O_TMPFILE (since Linux 3.11) Create an unnamed temporary file. The pathname argument specifies a directory; an unnamed inode will be created in that directory's filesystem. Anything written to the resulting file will be lost when the last file descriptor is closed, unless the file is given a name. O_TMPFILE must be specified with one of O_RDWR or O_WRONLY and, optionally, O_EXCL. If O_EXCL is not specified, then linkat(2) can be used to link the temporary file into the filesystem, making it permanent, using code like the following: char path[PATH_MAX]; fd = open("/path/to/dir", O_TMPFILE | O_RDWR, S_IRUSR | S_IWUSR); /* File I/O on 'fd'... */ snprintf(path, PATH_MAX, "/proc/self/fd/%d", fd); linkat(AT_FDCWD, path, AT_FDCWD, "/path/for/file", AT_SYMLINK_FOLLOW); In this case, the open() mode argument determines the file permission mode, as with O_CREAT. Specifying O_EXCL in conjunction with O_TMPFILE prevents a temporary file from being linked into the filesystem in the above manner. (Note that the meaning of O_EXCL in this case is different from the meaning of O_EXCL otherwise.) There are two main use cases for O_TMPFILE: * Improved tmpfile(3) functionality: race-free creation of temporary files that (1) are automatically deleted when closed; (2) can never be reached via any pathname; (3) are not subject to symlink attacks; and (4) do not require the caller to devise unique names. * Creating a file that is initially invisible, which is then populated with data and adjusted to have appropriate filesystem attributes (fchown(2), fchmod(2), fsetxattr(2), etc.) before being atomically linked into the filesystem in a fully formed state (using linkat(2) as described above). O_TMPFILE requires support by the underlying filesystem; only a subset of Linux filesystems provide that support. In the initial implementation, support was provided in the ext2, ext3, ext4, UDF, Minix, and shmem filesystems. XFS support was added in Linux 3.15, and Btrfs support was added in Linux 3.16. O_TRUNC If the file already exists and is a regular file and the access mode allows writing (i.e., is O_RDWR or O_WRONLY) it will be truncated to length 0. If the file is a FIFO or terminal device file, the O_TRUNC flag is ignored. Otherwise, the effect of O_TRUNC is unspecified. creat() A call to creat() is equivalent to calling open() with flags equal to O_CREAT|O_WRONLY|O_TRUNC. openat() The openat() system call operates in exactly the same way as open(), except for the differences described here. If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by open() for a relative pathname). If pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like open()). If pathname is absolute, then dirfd is ignored. http://man7.org/linux/man-pages/man2/openat.2.html 12 SYSTEM CALL: open(2) - Linux manual page FUNCTIONALITY: open, openat, creat - open and possibly create a file SYNOPSIS: #include #include #include int open(const char *pathname, int flags); int open(const char *pathname, int flags, mode_t mode); int creat(const char *pathname, mode_t mode); int openat(int dirfd, const char *pathname, int flags); int openat(int dirfd, const char *pathname, int flags, mode_t mode); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): openat(): Since glibc 2.10: _POSIX_C_SOURCE >= 200809L Before glibc 2.10: _ATFILE_SOURCE DESCRIPTION Given a pathname for a file, open() returns a file descriptor, a small, nonnegative integer for use in subsequent system calls (read(2), write(2), lseek(2), fcntl(2), etc.). The file descriptor returned by a successful call will be the lowest-numbered file descriptor not currently open for the process. By default, the new file descriptor is set to remain open across an execve(2) (i.e., the FD_CLOEXEC file descriptor flag described in fcntl(2) is initially disabled); the O_CLOEXEC flag, described below, can be used to change this default. The file offset is set to the beginning of the file (see lseek(2)). A call to open() creates a new open file description, an entry in the system-wide table of open files. The open file description records the file offset and the file status flags (see below). A file descriptor is a reference to an open file description; this reference is unaffected if pathname is subsequently removed or modified to refer to a different file. For further details on open file descriptions, see NOTES. The argument flags must include one of the following access modes: O_RDONLY, O_WRONLY, or O_RDWR. These request opening the file read- only, write-only, or read/write, respectively. In addition, zero or more file creation flags and file status flags can be bitwise-or'd in flags. The file creation flags are O_CLOEXEC, O_CREAT, O_DIRECTORY, O_EXCL, O_NOCTTY, O_NOFOLLOW, O_TMPFILE, and O_TRUNC. The file status flags are all of the remaining flags listed below. The distinction between these two groups of flags is that the file status flags can be retrieved and (in some cases) modified; see fcntl(2) for details. The full list of file creation flags and file status flags is as follows: O_APPEND The file is opened in append mode. Before each write(2), the file offset is positioned at the end of the file, as if with lseek(2). O_APPEND may lead to corrupted files on NFS filesystems if more than one process appends data to a file at once. This is because NFS does not support appending to a file, so the client kernel has to simulate it, which can't be done without a race condition. O_ASYNC Enable signal-driven I/O: generate a signal (SIGIO by default, but this can be changed via fcntl(2)) when input or output becomes possible on this file descriptor. This feature is available only for terminals, pseudoterminals, sockets, and (since Linux 2.6) pipes and FIFOs. See fcntl(2) for further details. See also BUGS, below. O_CLOEXEC (since Linux 2.6.23) Enable the close-on-exec flag for the new file descriptor. Specifying this flag permits a program to avoid additional fcntl(2) F_SETFD operations to set the FD_CLOEXEC flag. Note that the use of this flag is essential in some multithreaded programs, because using a separate fcntl(2) F_SETFD operation to set the FD_CLOEXEC flag does not suffice to avoid race conditions where one thread opens a file descriptor and attempts to set its close-on-exec flag using fcntl(2) at the same time as another thread does a fork(2) plus execve(2). Depending on the order of execution, the race may lead to the file descriptor returned by open() being unintentionally leaked to the program executed by the child process created by fork(2). (This kind of race is in principle possible for any system call that creates a file descriptor whose close-on-exec flag should be set, and various other Linux system calls provide an equivalent of the O_CLOEXEC flag to deal with this problem.) O_CREAT If the file does not exist, it will be created. The owner (user ID) of the file is set to the effective user ID of the process. The group ownership (group ID) is set either to the effective group ID of the process or to the group ID of the parent directory (depending on filesystem type and mount options, and the mode of the parent directory; see the mount options bsdgroups and sysvgroups described in mount(8)). The mode argument specifies the file mode bits be applied when a new file is created. This argument must be supplied when O_CREAT or O_TMPFILE is specified in flags; if neither O_CREAT nor O_TMPFILE is specified, then mode is ignored. The effective mode is modified by the process's umask in the usual way: in the absence of a default ACL, the mode of the created file is (mode & ~umask). Note that this mode applies only to future accesses of the newly created file; the open() call that creates a read-only file may well return a read/write file descriptor. The following symbolic constants are provided for mode: S_IRWXU 00700 user (file owner) has read, write, and execute permission S_IRUSR 00400 user has read permission S_IWUSR 00200 user has write permission S_IXUSR 00100 user has execute permission S_IRWXG 00070 group has read, write, and execute permission S_IRGRP 00040 group has read permission S_IWGRP 00020 group has write permission S_IXGRP 00010 group has execute permission S_IRWXO 00007 others have read, write, and execute permission S_IROTH 00004 others have read permission S_IWOTH 00002 others have write permission S_IXOTH 00001 others have execute permission According to POSIX, the effect when other bits are set in mode is unspecified. On Linux, the following bits are also honored in mode: S_ISUID 0004000 set-user-ID bit S_ISGID 0002000 set-group-ID bit (see stat(2)) S_ISVTX 0001000 sticky bit (see stat(2)) O_DIRECT (since Linux 2.4.10) Try to minimize cache effects of the I/O to and from this file. In general this will degrade performance, but it is useful in special situations, such as when applications do their own caching. File I/O is done directly to/from user- space buffers. The O_DIRECT flag on its own makes an effort to transfer data synchronously, but does not give the guarantees of the O_SYNC flag that data and necessary metadata are transferred. To guarantee synchronous I/O, O_SYNC must be used in addition to O_DIRECT. See NOTES below for further discussion. A semantically similar (but deprecated) interface for block devices is described in raw(8). O_DIRECTORY If pathname is not a directory, cause the open to fail. This flag was added in kernel version 2.1.126, to avoid denial-of- service problems if opendir(3) is called on a FIFO or tape device. O_DSYNC Write operations on the file will complete according to the requirements of synchronized I/O data integrity completion. By the time write(2) (and similar) return, the output data has been transferred to the underlying hardware, along with any file metadata that would be required to retrieve that data (i.e., as though each write(2) was followed by a call to fdatasync(2)). See NOTES below. O_EXCL Ensure that this call creates the file: if this flag is specified in conjunction with O_CREAT, and pathname already exists, then open() will fail. When these two flags are specified, symbolic links are not followed: if pathname is a symbolic link, then open() fails regardless of where the symbolic link points to. In general, the behavior of O_EXCL is undefined if it is used without O_CREAT. There is one exception: on Linux 2.6 and later, O_EXCL can be used without O_CREAT if pathname refers to a block device. If the block device is in use by the system (e.g., mounted), open() fails with the error EBUSY. On NFS, O_EXCL is supported only when using NFSv3 or later on kernel 2.6 or later. In NFS environments where O_EXCL support is not provided, programs that rely on it for performing locking tasks will contain a race condition. Portable programs that want to perform atomic file locking using a lockfile, and need to avoid reliance on NFS support for O_EXCL, can create a unique file on the same filesystem (e.g., incorporating hostname and PID), and use link(2) to make a link to the lockfile. If link(2) returns 0, the lock is successful. Otherwise, use stat(2) on the unique file to check if its link count has increased to 2, in which case the lock is also successful. O_LARGEFILE (LFS) Allow files whose sizes cannot be represented in an off_t (but can be represented in an off64_t) to be opened. The _LARGEFILE64_SOURCE macro must be defined (before including any header files) in order to obtain this definition. Setting the _FILE_OFFSET_BITS feature test macro to 64 (rather than using O_LARGEFILE) is the preferred method of accessing large files on 32-bit systems (see feature_test_macros(7)). O_NOATIME (since Linux 2.6.8) Do not update the file last access time (st_atime in the inode) when the file is read(2). This flag is intended for use by indexing or backup programs, where its use can significantly reduce the amount of disk activity. This flag may not be effective on all filesystems. One example is NFS, where the server maintains the access time. O_NOCTTY If pathname refers to a terminal device—see tty(4)—it will not become the process's controlling terminal even if the process does not have one. O_NOFOLLOW If pathname is a symbolic link, then the open fails. This is a FreeBSD extension, which was added to Linux in version 2.1.126. Symbolic links in earlier components of the pathname will still be followed. See also O_PATH below. O_NONBLOCK or O_NDELAY When possible, the file is opened in nonblocking mode. Neither the open() nor any subsequent operations on the file descriptor which is returned will cause the calling process to wait. Note that this flag has no effect for regular files and block devices; that is, I/O operations will (briefly) block when device activity is required, regardless of whether O_NONBLOCK is set. Since O_NONBLOCK semantics might eventually be implemented, applications should not depend upon blocking behavior when specifying this flag for regular files and block devices. For the handling of FIFOs (named pipes), see also fifo(7). For a discussion of the effect of O_NONBLOCK in conjunction with mandatory file locks and with file leases, see fcntl(2). O_PATH (since Linux 2.6.39) Obtain a file descriptor that can be used for two purposes: to indicate a location in the filesystem tree and to perform operations that act purely at the file descriptor level. The file itself is not opened, and other file operations (e.g., read(2), write(2), fchmod(2), fchown(2), fgetxattr(2), mmap(2)) fail with the error EBADF. The following operations can be performed on the resulting file descriptor: * close(2); fchdir(2) (since Linux 3.5); fstat(2) (since Linux 3.6). * Duplicating the file descriptor (dup(2), fcntl(2) F_DUPFD, etc.). * Getting and setting file descriptor flags (fcntl(2) F_GETFD and F_SETFD). * Retrieving open file status flags using the fcntl(2) F_GETFL operation: the returned flags will include the bit O_PATH. * Passing the file descriptor as the dirfd argument of openat(2) and the other "*at()" system calls. This includes linkat(2) with AT_EMPTY_PATH (or via procfs using AT_SYMLINK_FOLLOW) even if the file is not a directory. * Passing the file descriptor to another process via a UNIX domain socket (see SCM_RIGHTS in unix(7)). When O_PATH is specified in flags, flag bits other than O_CLOEXEC, O_DIRECTORY, and O_NOFOLLOW are ignored. If pathname is a symbolic link and the O_NOFOLLOW flag is also specified, then the call returns a file descriptor referring to the symbolic link. This file descriptor can be used as the dirfd argument in calls to fchownat(2), fstatat(2), linkat(2), and readlinkat(2) with an empty pathname to have the calls operate on the symbolic link. O_SYNC Write operations on the file will complete according to the requirements of synchronized I/O file integrity completion (by contrast with the synchronized I/O data integrity completion provided by O_DSYNC.) By the time write(2) (and similar) return, the output data and associated file metadata have been transferred to the underlying hardware (i.e., as though each write(2) was followed by a call to fsync(2)). See NOTES below. O_TMPFILE (since Linux 3.11) Create an unnamed temporary file. The pathname argument specifies a directory; an unnamed inode will be created in that directory's filesystem. Anything written to the resulting file will be lost when the last file descriptor is closed, unless the file is given a name. O_TMPFILE must be specified with one of O_RDWR or O_WRONLY and, optionally, O_EXCL. If O_EXCL is not specified, then linkat(2) can be used to link the temporary file into the filesystem, making it permanent, using code like the following: char path[PATH_MAX]; fd = open("/path/to/dir", O_TMPFILE | O_RDWR, S_IRUSR | S_IWUSR); /* File I/O on 'fd'... */ snprintf(path, PATH_MAX, "/proc/self/fd/%d", fd); linkat(AT_FDCWD, path, AT_FDCWD, "/path/for/file", AT_SYMLINK_FOLLOW); In this case, the open() mode argument determines the file permission mode, as with O_CREAT. Specifying O_EXCL in conjunction with O_TMPFILE prevents a temporary file from being linked into the filesystem in the above manner. (Note that the meaning of O_EXCL in this case is different from the meaning of O_EXCL otherwise.) There are two main use cases for O_TMPFILE: * Improved tmpfile(3) functionality: race-free creation of temporary files that (1) are automatically deleted when closed; (2) can never be reached via any pathname; (3) are not subject to symlink attacks; and (4) do not require the caller to devise unique names. * Creating a file that is initially invisible, which is then populated with data and adjusted to have appropriate filesystem attributes (fchown(2), fchmod(2), fsetxattr(2), etc.) before being atomically linked into the filesystem in a fully formed state (using linkat(2) as described above). O_TMPFILE requires support by the underlying filesystem; only a subset of Linux filesystems provide that support. In the initial implementation, support was provided in the ext2, ext3, ext4, UDF, Minix, and shmem filesystems. XFS support was added in Linux 3.15, and Btrfs support was added in Linux 3.16. O_TRUNC If the file already exists and is a regular file and the access mode allows writing (i.e., is O_RDWR or O_WRONLY) it will be truncated to length 0. If the file is a FIFO or terminal device file, the O_TRUNC flag is ignored. Otherwise, the effect of O_TRUNC is unspecified. creat() A call to creat() is equivalent to calling open() with flags equal to O_CREAT|O_WRONLY|O_TRUNC. openat() The openat() system call operates in exactly the same way as open(), except for the differences described here. If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by open() for a relative pathname). If pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like open()). If pathname is absolute, then dirfd is ignored. http://man7.org/linux/man-pages/man2/name_to_handle_at.2.html 12 SYSTEM CALL: open_by_handle_at(2) - Linux manual page FUNCTIONALITY: name_to_handle_at, open_by_handle_at - obtain handle for a pathname and open file via a handle SYNOPSIS: #define _GNU_SOURCE /* See feature_test_macros(7) */ #include #include #include int name_to_handle_at(int dirfd, const char *pathname, struct file_handle *handle, int *mount_id, int flags); int open_by_handle_at(int mount_fd, struct file_handle *handle, int flags); DESCRIPTION The name_to_handle_at() and open_by_handle_at() system calls split the functionality of openat(2) into two parts: name_to_handle_at() returns an opaque handle that corresponds to a specified file; open_by_handle_at() opens the file corresponding to a handle returned by a previous call to name_to_handle_at() and returns an open file descriptor. name_to_handle_at() The name_to_handle_at() system call returns a file handle and a mount ID corresponding to the file specified by the dirfd and pathname arguments. The file handle is returned via the argument handle, which is a pointer to a structure of the following form: struct file_handle { unsigned int handle_bytes; /* Size of f_handle [in, out] */ int handle_type; /* Handle type [out] */ unsigned char f_handle[0]; /* File identifier (sized by caller) [out] */ }; It is the caller's responsibility to allocate the structure with a size large enough to hold the handle returned in f_handle. Before the call, the handle_bytes field should be initialized to contain the allocated size for f_handle. (The constant MAX_HANDLE_SZ, defined in , specifies the maximum possible size for a file handle.) Upon successful return, the handle_bytes field is updated to contain the number of bytes actually written to f_handle. The caller can discover the required size for the file_handle structure by making a call in which handle->handle_bytes is zero; in this case, the call fails with the error EOVERFLOW and handle->handle_bytes is set to indicate the required size; the caller can then use this information to allocate a structure of the correct size (see EXAMPLE below). Other than the use of the handle_bytes field, the caller should treat the file_handle structure as an opaque data type: the handle_type and f_handle fields are needed only by a subsequent call to open_by_handle_at(). The flags argument is a bit mask constructed by ORing together zero or more of AT_EMPTY_PATH and AT_SYMLINK_FOLLOW, described below. Together, the pathname and dirfd arguments identify the file for which a handle is to be obtained. There are four distinct cases: * If pathname is a nonempty string containing an absolute pathname, then a handle is returned for the file referred to by that pathname. In this case, dirfd is ignored. * If pathname is a nonempty string containing a relative pathname and dirfd has the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the caller, and a handle is returned for the file to which it refers. * If pathname is a nonempty string containing a relative pathname and dirfd is a file descriptor referring to a directory, then pathname is interpreted relative to the directory referred to by dirfd, and a handle is returned for the file to which it refers. (See openat(2) for an explanation of why "directory file descriptors" are useful.) * If pathname is an empty string and flags specifies the value AT_EMPTY_PATH, then dirfd can be an open file descriptor referring to any type of file, or AT_FDCWD, meaning the current working directory, and a handle is returned for the file to which it refers. The mount_id argument returns an identifier for the filesystem mount that corresponds to pathname. This corresponds to the first field in one of the records in /proc/self/mountinfo. Opening the pathname in the fifth field of that record yields a file descriptor for the mount point; that file descriptor can be used in a subsequent call to open_by_handle_at(). By default, name_to_handle_at() does not dereference pathname if it is a symbolic link, and thus returns a handle for the link itself. If AT_SYMLINK_FOLLOW is specified in flags, pathname is dereferenced if it is a symbolic link (so that the call returns a handle for the file referred to by the link). open_by_handle_at() The open_by_handle_at() system call opens the file referred to by handle, a file handle returned by a previous call to name_to_handle_at(). The mount_fd argument is a file descriptor for any object (file, directory, etc.) in the mounted filesystem with respect to which handle should be interpreted. The special value AT_FDCWD can be specified, meaning the current working directory of the caller. The flags argument is as for open(2). If handle refers to a symbolic link, the caller must specify the O_PATH flag, and the symbolic link is not dereferenced; the O_NOFOLLOW flag, if specified, is ignored. The caller must have the CAP_DAC_READ_SEARCH capability to invoke open_by_handle_at(). http://man7.org/linux/man-pages/man2/open_by_handle_at.2.html 12 SYSTEM CALL: open_by_handle_at(2) - Linux manual page FUNCTIONALITY: name_to_handle_at, open_by_handle_at - obtain handle for a pathname and open file via a handle SYNOPSIS: #define _GNU_SOURCE /* See feature_test_macros(7) */ #include #include #include int name_to_handle_at(int dirfd, const char *pathname, struct file_handle *handle, int *mount_id, int flags); int open_by_handle_at(int mount_fd, struct file_handle *handle, int flags); DESCRIPTION The name_to_handle_at() and open_by_handle_at() system calls split the functionality of openat(2) into two parts: name_to_handle_at() returns an opaque handle that corresponds to a specified file; open_by_handle_at() opens the file corresponding to a handle returned by a previous call to name_to_handle_at() and returns an open file descriptor. name_to_handle_at() The name_to_handle_at() system call returns a file handle and a mount ID corresponding to the file specified by the dirfd and pathname arguments. The file handle is returned via the argument handle, which is a pointer to a structure of the following form: struct file_handle { unsigned int handle_bytes; /* Size of f_handle [in, out] */ int handle_type; /* Handle type [out] */ unsigned char f_handle[0]; /* File identifier (sized by caller) [out] */ }; It is the caller's responsibility to allocate the structure with a size large enough to hold the handle returned in f_handle. Before the call, the handle_bytes field should be initialized to contain the allocated size for f_handle. (The constant MAX_HANDLE_SZ, defined in , specifies the maximum possible size for a file handle.) Upon successful return, the handle_bytes field is updated to contain the number of bytes actually written to f_handle. The caller can discover the required size for the file_handle structure by making a call in which handle->handle_bytes is zero; in this case, the call fails with the error EOVERFLOW and handle->handle_bytes is set to indicate the required size; the caller can then use this information to allocate a structure of the correct size (see EXAMPLE below). Other than the use of the handle_bytes field, the caller should treat the file_handle structure as an opaque data type: the handle_type and f_handle fields are needed only by a subsequent call to open_by_handle_at(). The flags argument is a bit mask constructed by ORing together zero or more of AT_EMPTY_PATH and AT_SYMLINK_FOLLOW, described below. Together, the pathname and dirfd arguments identify the file for which a handle is to be obtained. There are four distinct cases: * If pathname is a nonempty string containing an absolute pathname, then a handle is returned for the file referred to by that pathname. In this case, dirfd is ignored. * If pathname is a nonempty string containing a relative pathname and dirfd has the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the caller, and a handle is returned for the file to which it refers. * If pathname is a nonempty string containing a relative pathname and dirfd is a file descriptor referring to a directory, then pathname is interpreted relative to the directory referred to by dirfd, and a handle is returned for the file to which it refers. (See openat(2) for an explanation of why "directory file descriptors" are useful.) * If pathname is an empty string and flags specifies the value AT_EMPTY_PATH, then dirfd can be an open file descriptor referring to any type of file, or AT_FDCWD, meaning the current working directory, and a handle is returned for the file to which it refers. The mount_id argument returns an identifier for the filesystem mount that corresponds to pathname. This corresponds to the first field in one of the records in /proc/self/mountinfo. Opening the pathname in the fifth field of that record yields a file descriptor for the mount point; that file descriptor can be used in a subsequent call to open_by_handle_at(). By default, name_to_handle_at() does not dereference pathname if it is a symbolic link, and thus returns a handle for the link itself. If AT_SYMLINK_FOLLOW is specified in flags, pathname is dereferenced if it is a symbolic link (so that the call returns a handle for the file referred to by the link). open_by_handle_at() The open_by_handle_at() system call opens the file referred to by handle, a file handle returned by a previous call to name_to_handle_at(). The mount_fd argument is a file descriptor for any object (file, directory, etc.) in the mounted filesystem with respect to which handle should be interpreted. The special value AT_FDCWD can be specified, meaning the current working directory of the caller. The flags argument is as for open(2). If handle refers to a symbolic link, the caller must specify the O_PATH flag, and the symbolic link is not dereferenced; the O_NOFOLLOW flag, if specified, is ignored. The caller must have the CAP_DAC_READ_SEARCH capability to invoke open_by_handle_at(). http://man7.org/linux/man-pages/man2/memfd_create.2.html 12 SYSTEM CALL: memfd_create(2) - Linux manual page FUNCTIONALITY: memfd_create - create an anonymous file SYNOPSIS: #include int memfd_create(const char *name, unsigned int flags); Note: There is no glibc wrapper for this system call; see NOTES. DESCRIPTION memfd_create() creates an anonymous file and returns a file descriptor that refers to it. The file behaves like a regular file, and so can be modified, truncated, memory-mapped, and so on. However, unlike a regular file, it lives in RAM and has a volatile backing storage. Once all references to the file are dropped, it is automatically released. Anonymous memory is used for all backing pages of the file. Therefore, files created by memfd_create() have the same semantics as other anonymous memory allocations such as those allocated using mmap(2) with the MAP_ANONYMOUS flag. The initial size of the file is set to 0. Following the call, the file size should be set using ftruncate(2). (Alternatively, the file may be populated by calls to write(2) or similar.) The name supplied in name is used as a filename and will be displayed as the target of the corresponding symbolic link in the directory /proc/self/fd/. The displayed name is always prefixed with memfd: and serves only for debugging purposes. Names do not affect the behavior of the file descriptor, and as such multiple files can have the same name without any side effects. The following values may be bitwise ORed in flags to change the behavior of memfd_create(): MFD_CLOEXEC Set the close-on-exec (FD_CLOEXEC) flag on the new file descriptor. See the description of the O_CLOEXEC flag in open(2) for reasons why this may be useful. MFD_ALLOW_SEALING Allow sealing operations on this file. See the discussion of the F_ADD_SEALS and F_GET_SEALS operations in fcntl(2), and also NOTES, below. The initial set of seals is empty. If this flag is not set, the initial set of seals will be F_SEAL_SEAL, meaning that no other seals can be set on the file. Unused bits in flags must be 0. As its return value, memfd_create() returns a new file descriptor that can be used to refer to the file. This file descriptor is opened for both reading and writing (O_RDWR) and O_LARGEFILE is set for the file descriptor. With respect to fork(2) and execve(2), the usual semantics apply for the file descriptor created by memfd_create(). A copy of the file descriptor is inherited by the child produced by fork(2) and refers to the same file. The file descriptor is preserved across execve(2), unless the close-on-exec flag has been set. http://man7.org/linux/man-pages/man2/mknod.2.html 11 SYSTEM CALL: mknod(2) - Linux manual page FUNCTIONALITY: mknod, mknodat - create a special or ordinary file SYNOPSIS: #include #include #include #include int mknod(const char *pathname, mode_t mode, dev_t dev); #include /* Definition of AT_* constants */ #include int mknodat(int dirfd, const char *pathname, mode_t mode, dev_t dev); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): mknod(): _XOPEN_SOURCE >= 500 || /* Since glibc 2.19: */ _DEFAULT_SOURCE || /* Glibc versions <= 2.19: */ _BSD_SOURCE || _SVID_SOURCE DESCRIPTION The system call mknod() creates a filesystem node (file, device special file, or named pipe) named pathname, with attributes specified by mode and dev. The mode argument specifies both the file mode to use and the type of node to be created. It should be a combination (using bitwise OR) of one of the file types listed below and zero or more of the file mode bits listed in stat(2). The file mode is modified by the process's umask in the usual way: in the absence of a default ACL, the permissions of the created node are (mode & ~umask). The file type must be one of S_IFREG, S_IFCHR, S_IFBLK, S_IFIFO, or S_IFSOCK to specify a regular file (which will be created empty), character special file, block special file, FIFO (named pipe), or UNIX domain socket, respectively. (Zero file type is equivalent to type S_IFREG.) If the file type is S_IFCHR or S_IFBLK, then dev specifies the major and minor numbers of the newly created device special file (makedev(3) may be useful to build the value for dev); otherwise it is ignored. If pathname already exists, or is a symbolic link, this call fails with an EEXIST error. The newly created node will be owned by the effective user ID of the process. If the directory containing the node has the set-group-ID bit set, or if the filesystem is mounted with BSD group semantics, the new node will inherit the group ownership from its parent directory; otherwise it will be owned by the effective group ID of the process. mknodat() The mknodat() system call operates in exactly the same way as mknod(2), except for the differences described here. If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by mknod(2) for a relative pathname). If pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like mknod(2)). If pathname is absolute, then dirfd is ignored. See openat(2) for an explanation of the need for mknodat(). http://man7.org/linux/man-pages/man2/mknodat.2.html 11 SYSTEM CALL: mknod(2) - Linux manual page FUNCTIONALITY: mknod, mknodat - create a special or ordinary file SYNOPSIS: #include #include #include #include int mknod(const char *pathname, mode_t mode, dev_t dev); #include /* Definition of AT_* constants */ #include int mknodat(int dirfd, const char *pathname, mode_t mode, dev_t dev); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): mknod(): _XOPEN_SOURCE >= 500 || /* Since glibc 2.19: */ _DEFAULT_SOURCE || /* Glibc versions <= 2.19: */ _BSD_SOURCE || _SVID_SOURCE DESCRIPTION The system call mknod() creates a filesystem node (file, device special file, or named pipe) named pathname, with attributes specified by mode and dev. The mode argument specifies both the file mode to use and the type of node to be created. It should be a combination (using bitwise OR) of one of the file types listed below and zero or more of the file mode bits listed in stat(2). The file mode is modified by the process's umask in the usual way: in the absence of a default ACL, the permissions of the created node are (mode & ~umask). The file type must be one of S_IFREG, S_IFCHR, S_IFBLK, S_IFIFO, or S_IFSOCK to specify a regular file (which will be created empty), character special file, block special file, FIFO (named pipe), or UNIX domain socket, respectively. (Zero file type is equivalent to type S_IFREG.) If the file type is S_IFCHR or S_IFBLK, then dev specifies the major and minor numbers of the newly created device special file (makedev(3) may be useful to build the value for dev); otherwise it is ignored. If pathname already exists, or is a symbolic link, this call fails with an EEXIST error. The newly created node will be owned by the effective user ID of the process. If the directory containing the node has the set-group-ID bit set, or if the filesystem is mounted with BSD group semantics, the new node will inherit the group ownership from its parent directory; otherwise it will be owned by the effective group ID of the process. mknodat() The mknodat() system call operates in exactly the same way as mknod(2), except for the differences described here. If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by mknod(2) for a relative pathname). If pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like mknod(2)). If pathname is absolute, then dirfd is ignored. See openat(2) for an explanation of the need for mknodat(). http://man7.org/linux/man-pages/man2/rename.2.html 12 SYSTEM CALL: rename(2) - Linux manual page FUNCTIONALITY: rename, renameat, renameat2 - change the name or location of a file SYNOPSIS: #include int rename(const char *oldpath, const char *newpath); #include /* Definition of AT_* constants */ #include int renameat(int olddirfd, const char *oldpath, int newdirfd, const char *newpath); int renameat2(int olddirfd, const char *oldpath, int newdirfd, const char *newpath, unsigned int flags); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): renameat(): Since glibc 2.10: _POSIX_C_SOURCE >= 200809L Before glibc 2.10: _ATFILE_SOURCE DESCRIPTION rename() renames a file, moving it between directories if required. Any other hard links to the file (as created using link(2)) are unaffected. Open file descriptors for oldpath are also unaffected. If newpath already exists, it will be atomically replaced (subject to a few conditions; see ERRORS below), so that there is no point at which another process attempting to access newpath will find it missing. If oldpath and newpath are existing hard links referring to the same file, then rename() does nothing, and returns a success status. If newpath exists but the operation fails for some reason, rename() guarantees to leave an instance of newpath in place. oldpath can specify a directory. In this case, newpath must either not exist, or it must specify an empty directory. However, when overwriting there will probably be a window in which both oldpath and newpath refer to the file being renamed. If oldpath refers to a symbolic link, the link is renamed; if newpath refers to a symbolic link, the link will be overwritten. renameat() The renameat() system call operates in exactly the same way as rename(), except for the differences described here. If the pathname given in oldpath is relative, then it is interpreted relative to the directory referred to by the file descriptor olddirfd (rather than relative to the current working directory of the calling process, as is done by rename() for a relative pathname). If oldpath is relative and olddirfd is the special value AT_FDCWD, then oldpath is interpreted relative to the current working directory of the calling process (like rename()). If oldpath is absolute, then olddirfd is ignored. The interpretation of newpath is as for oldpath, except that a relative pathname is interpreted relative to the directory referred to by the file descriptor newdirfd. See openat(2) for an explanation of the need for renameat(). renameat2() renameat2() has an additional flags argument. A renameat2() call with a zero flags argument is equivalent to renameat(). The flags argument is a bit mask consisting of zero or more of the following flags: RENAME_EXCHANGE Atomically exchange oldpath and newpath. Both pathnames must exist but may be of different types (e.g., one could be a non- empty directory and the other a symbolic link). RENAME_NOREPLACE Don't overwrite newpath of the rename. Return an error if newpath already exists. RENAME_NOREPLACE can't be employed together with RENAME_EXCHANGE. RENAME_WHITEOUT (since Linux 3.18) This operation makes sense only for overlay/union filesystem implementations. Specifying RENAME_WHITEOUT creates a "whiteout" object at the source of the rename at the same time as performing the rename. The whole operation is atomic, so that if the rename succeeds then the whiteout will also have been created. A "whiteout" is an object that has special meaning in union/overlay filesystem constructs. In these constructs, multiple layers exist and only the top one is ever modified. A whiteout on an upper layer will effectively hide a matching file in the lower layer, making it appear as if the file didn't exist. When a file that exists on the lower layer is renamed, the file is first copied up (if not already on the upper layer) and then renamed on the upper, read-write layer. At the same time, the source file needs to be "whiteouted" (so that the version of the source file in the lower layer is rendered invisible). The whole operation needs to be done atomically. When not part of a union/overlay, the whiteout appears as a character device with a {0,0} device number. RENAME_WHITEOUT requires the same privileges as creating a device node (i.e., the CAP_MKNOD capability). RENAME_WHITEOUT can't be employed together with RENAME_EXCHANGE. RENAME_WHITEOUT requires support from the underlying filesystem. Among the filesystems that provide that support are shmem (since Linux 3.18), ext4 (since Linux 3.18), and XFS (since Linux 4.1). http://man7.org/linux/man-pages/man2/renameat.2.html 12 SYSTEM CALL: rename(2) - Linux manual page FUNCTIONALITY: rename, renameat, renameat2 - change the name or location of a file SYNOPSIS: #include int rename(const char *oldpath, const char *newpath); #include /* Definition of AT_* constants */ #include int renameat(int olddirfd, const char *oldpath, int newdirfd, const char *newpath); int renameat2(int olddirfd, const char *oldpath, int newdirfd, const char *newpath, unsigned int flags); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): renameat(): Since glibc 2.10: _POSIX_C_SOURCE >= 200809L Before glibc 2.10: _ATFILE_SOURCE DESCRIPTION rename() renames a file, moving it between directories if required. Any other hard links to the file (as created using link(2)) are unaffected. Open file descriptors for oldpath are also unaffected. If newpath already exists, it will be atomically replaced (subject to a few conditions; see ERRORS below), so that there is no point at which another process attempting to access newpath will find it missing. If oldpath and newpath are existing hard links referring to the same file, then rename() does nothing, and returns a success status. If newpath exists but the operation fails for some reason, rename() guarantees to leave an instance of newpath in place. oldpath can specify a directory. In this case, newpath must either not exist, or it must specify an empty directory. However, when overwriting there will probably be a window in which both oldpath and newpath refer to the file being renamed. If oldpath refers to a symbolic link, the link is renamed; if newpath refers to a symbolic link, the link will be overwritten. renameat() The renameat() system call operates in exactly the same way as rename(), except for the differences described here. If the pathname given in oldpath is relative, then it is interpreted relative to the directory referred to by the file descriptor olddirfd (rather than relative to the current working directory of the calling process, as is done by rename() for a relative pathname). If oldpath is relative and olddirfd is the special value AT_FDCWD, then oldpath is interpreted relative to the current working directory of the calling process (like rename()). If oldpath is absolute, then olddirfd is ignored. The interpretation of newpath is as for oldpath, except that a relative pathname is interpreted relative to the directory referred to by the file descriptor newdirfd. See openat(2) for an explanation of the need for renameat(). renameat2() renameat2() has an additional flags argument. A renameat2() call with a zero flags argument is equivalent to renameat(). The flags argument is a bit mask consisting of zero or more of the following flags: RENAME_EXCHANGE Atomically exchange oldpath and newpath. Both pathnames must exist but may be of different types (e.g., one could be a non- empty directory and the other a symbolic link). RENAME_NOREPLACE Don't overwrite newpath of the rename. Return an error if newpath already exists. RENAME_NOREPLACE can't be employed together with RENAME_EXCHANGE. RENAME_WHITEOUT (since Linux 3.18) This operation makes sense only for overlay/union filesystem implementations. Specifying RENAME_WHITEOUT creates a "whiteout" object at the source of the rename at the same time as performing the rename. The whole operation is atomic, so that if the rename succeeds then the whiteout will also have been created. A "whiteout" is an object that has special meaning in union/overlay filesystem constructs. In these constructs, multiple layers exist and only the top one is ever modified. A whiteout on an upper layer will effectively hide a matching file in the lower layer, making it appear as if the file didn't exist. When a file that exists on the lower layer is renamed, the file is first copied up (if not already on the upper layer) and then renamed on the upper, read-write layer. At the same time, the source file needs to be "whiteouted" (so that the version of the source file in the lower layer is rendered invisible). The whole operation needs to be done atomically. When not part of a union/overlay, the whiteout appears as a character device with a {0,0} device number. RENAME_WHITEOUT requires the same privileges as creating a device node (i.e., the CAP_MKNOD capability). RENAME_WHITEOUT can't be employed together with RENAME_EXCHANGE. RENAME_WHITEOUT requires support from the underlying filesystem. Among the filesystems that provide that support are shmem (since Linux 3.18), ext4 (since Linux 3.18), and XFS (since Linux 4.1). http://man7.org/linux/man-pages/man2/renameat2.2.html 12 SYSTEM CALL: rename(2) - Linux manual page FUNCTIONALITY: rename, renameat, renameat2 - change the name or location of a file SYNOPSIS: #include int rename(const char *oldpath, const char *newpath); #include /* Definition of AT_* constants */ #include int renameat(int olddirfd, const char *oldpath, int newdirfd, const char *newpath); int renameat2(int olddirfd, const char *oldpath, int newdirfd, const char *newpath, unsigned int flags); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): renameat(): Since glibc 2.10: _POSIX_C_SOURCE >= 200809L Before glibc 2.10: _ATFILE_SOURCE DESCRIPTION rename() renames a file, moving it between directories if required. Any other hard links to the file (as created using link(2)) are unaffected. Open file descriptors for oldpath are also unaffected. If newpath already exists, it will be atomically replaced (subject to a few conditions; see ERRORS below), so that there is no point at which another process attempting to access newpath will find it missing. If oldpath and newpath are existing hard links referring to the same file, then rename() does nothing, and returns a success status. If newpath exists but the operation fails for some reason, rename() guarantees to leave an instance of newpath in place. oldpath can specify a directory. In this case, newpath must either not exist, or it must specify an empty directory. However, when overwriting there will probably be a window in which both oldpath and newpath refer to the file being renamed. If oldpath refers to a symbolic link, the link is renamed; if newpath refers to a symbolic link, the link will be overwritten. renameat() The renameat() system call operates in exactly the same way as rename(), except for the differences described here. If the pathname given in oldpath is relative, then it is interpreted relative to the directory referred to by the file descriptor olddirfd (rather than relative to the current working directory of the calling process, as is done by rename() for a relative pathname). If oldpath is relative and olddirfd is the special value AT_FDCWD, then oldpath is interpreted relative to the current working directory of the calling process (like rename()). If oldpath is absolute, then olddirfd is ignored. The interpretation of newpath is as for oldpath, except that a relative pathname is interpreted relative to the directory referred to by the file descriptor newdirfd. See openat(2) for an explanation of the need for renameat(). renameat2() renameat2() has an additional flags argument. A renameat2() call with a zero flags argument is equivalent to renameat(). The flags argument is a bit mask consisting of zero or more of the following flags: RENAME_EXCHANGE Atomically exchange oldpath and newpath. Both pathnames must exist but may be of different types (e.g., one could be a non- empty directory and the other a symbolic link). RENAME_NOREPLACE Don't overwrite newpath of the rename. Return an error if newpath already exists. RENAME_NOREPLACE can't be employed together with RENAME_EXCHANGE. RENAME_WHITEOUT (since Linux 3.18) This operation makes sense only for overlay/union filesystem implementations. Specifying RENAME_WHITEOUT creates a "whiteout" object at the source of the rename at the same time as performing the rename. The whole operation is atomic, so that if the rename succeeds then the whiteout will also have been created. A "whiteout" is an object that has special meaning in union/overlay filesystem constructs. In these constructs, multiple layers exist and only the top one is ever modified. A whiteout on an upper layer will effectively hide a matching file in the lower layer, making it appear as if the file didn't exist. When a file that exists on the lower layer is renamed, the file is first copied up (if not already on the upper layer) and then renamed on the upper, read-write layer. At the same time, the source file needs to be "whiteouted" (so that the version of the source file in the lower layer is rendered invisible). The whole operation needs to be done atomically. When not part of a union/overlay, the whiteout appears as a character device with a {0,0} device number. RENAME_WHITEOUT requires the same privileges as creating a device node (i.e., the CAP_MKNOD capability). RENAME_WHITEOUT can't be employed together with RENAME_EXCHANGE. RENAME_WHITEOUT requires support from the underlying filesystem. Among the filesystems that provide that support are shmem (since Linux 3.18), ext4 (since Linux 3.18), and XFS (since Linux 4.1). http://man7.org/linux/man-pages/man2/truncate.2.html 11 SYSTEM CALL: truncate(2) - Linux manual page FUNCTIONALITY: truncate, ftruncate - truncate a file to a specified length SYNOPSIS: #include #include int truncate(const char *path, off_t length); int ftruncate(int fd, off_t length); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): truncate(): _XOPEN_SOURCE >= 500 || /* Since glibc 2.12: */ _POSIX_C_SOURCE >= 200809L || /* Glibc versions <= 2.19: */ _BSD_SOURCE ftruncate(): _XOPEN_SOURCE >= 500 || /* Since glibc 2.3.5: */ _POSIX_C_SOURCE >= 200112L || /* Glibc versions <= 2.19: */ _BSD_SOURCE DESCRIPTION The truncate() and ftruncate() functions cause the regular file named by path or referenced by fd to be truncated to a size of precisely length bytes. If the file previously was larger than this size, the extra data is lost. If the file previously was shorter, it is extended, and the extended part reads as null bytes ('\0'). The file offset is not changed. If the size changed, then the st_ctime and st_mtime fields (respectively, time of last status change and time of last modification; see stat(2)) for the file are updated, and the set- user-ID and set-group-ID mode bits may be cleared. With ftruncate(), the file must be open for writing; with truncate(), the file must be writable. http://man7.org/linux/man-pages/man2/ftruncate.2.html 11 SYSTEM CALL: truncate(2) - Linux manual page FUNCTIONALITY: truncate, ftruncate - truncate a file to a specified length SYNOPSIS: #include #include int truncate(const char *path, off_t length); int ftruncate(int fd, off_t length); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): truncate(): _XOPEN_SOURCE >= 500 || /* Since glibc 2.12: */ _POSIX_C_SOURCE >= 200809L || /* Glibc versions <= 2.19: */ _BSD_SOURCE ftruncate(): _XOPEN_SOURCE >= 500 || /* Since glibc 2.3.5: */ _POSIX_C_SOURCE >= 200112L || /* Glibc versions <= 2.19: */ _BSD_SOURCE DESCRIPTION The truncate() and ftruncate() functions cause the regular file named by path or referenced by fd to be truncated to a size of precisely length bytes. If the file previously was larger than this size, the extra data is lost. If the file previously was shorter, it is extended, and the extended part reads as null bytes ('\0'). The file offset is not changed. If the size changed, then the st_ctime and st_mtime fields (respectively, time of last status change and time of last modification; see stat(2)) for the file are updated, and the set- user-ID and set-group-ID mode bits may be cleared. With ftruncate(), the file must be open for writing; with truncate(), the file must be writable. http://man7.org/linux/man-pages/man2/fallocate.2.html 10 SYSTEM CALL: fallocate(2) - Linux manual page FUNCTIONALITY: fallocate - manipulate file space SYNOPSIS: #define _GNU_SOURCE /* See feature_test_macros(7) */ #include int fallocate(int fd, int mode, off_t offset, off_t len); DESCRIPTION This is a nonportable, Linux-specific system call. For the portable, POSIX.1-specified method of ensuring that space is allocated for a file, see posix_fallocate(3). fallocate() allows the caller to directly manipulate the allocated disk space for the file referred to by fd for the byte range starting at offset and continuing for len bytes. The mode argument determines the operation to be performed on the given range. Details of the supported operations are given in the subsections below. Allocating disk space The default operation (i.e., mode is zero) of fallocate() allocates the disk space within the range specified by offset and len. The file size (as reported by stat(2)) will be changed if offset+len is greater than the file size. Any subregion within the range specified by offset and len that did not contain data before the call will be initialized to zero. This default behavior closely resembles the behavior of the posix_fallocate(3) library function, and is intended as a method of optimally implementing that function. After a successful call, subsequent writes into the range specified by offset and len are guaranteed not to fail because of lack of disk space. If the FALLOC_FL_KEEP_SIZE flag is specified in mode, the behavior of the call is similar, but the file size will not be changed even if offset+len is greater than the file size. Preallocating zeroed blocks beyond the end of the file in this manner is useful for optimizing append workloads. Because allocation is done in block size chunks, fallocate() may allocate a larger range of disk space than was specified. Deallocating file space Specifying the FALLOC_FL_PUNCH_HOLE flag (available since Linux 2.6.38) in mode deallocates space (i.e., creates a hole) in the byte range starting at offset and continuing for len bytes. Within the specified range, partial filesystem blocks are zeroed, and whole filesystem blocks are removed from the file. After a successful call, subsequent reads from this range will return zeroes. The FALLOC_FL_PUNCH_HOLE flag must be ORed with FALLOC_FL_KEEP_SIZE in mode; in other words, even when punching off the end of the file, the file size (as reported by stat(2)) does not change. Not all filesystems support FALLOC_FL_PUNCH_HOLE; if a filesystem doesn't support the operation, an error is returned. The operation is supported on at least the following filesystems: * XFS (since Linux 2.6.38) * ext4 (since Linux 3.0) * Btrfs (since Linux 3.7) * tmpfs (since Linux 3.5) Collapsing file space Specifying the FALLOC_FL_COLLAPSE_RANGE flag (available since Linux 3.15) in mode removes a byte range from a file, without leaving a hole. The byte range to be collapsed starts at offset and continues for len bytes. At the completion of the operation, the contents of the file starting at the location offset+len will be appended at the location offset, and the file will be len bytes smaller. A filesystem may place limitations on the granularity of the operation, in order to ensure efficient implementation. Typically, offset and len must be a multiple of the filesystem logical block size, which varies according to the filesystem type and configuration. If a filesystem has such a requirement, fallocate() will fail with the error EINVAL if this requirement is violated. If the region specified by offset plus len reaches or passes the end of file, an error is returned; instead, use ftruncate(2) to truncate a file. No other flags may be specified in mode in conjunction with FALLOC_FL_COLLAPSE_RANGE. As at Linux 3.15, FALLOC_FL_COLLAPSE_RANGE is supported by ext4 (only for extent-based files) and XFS. Zeroing file space Specifying the FALLOC_FL_ZERO_RANGE flag (available since Linux 3.15) in mode zeroes space in the byte range starting at offset and continuing for len bytes. Within the specified range, blocks are preallocated for the regions that span the holes in the file. After a successful call, subsequent reads from this range will return zeroes. Zeroing is done within the filesystem preferably by converting the range into unwritten extents. This approach means that the specified range will not be physically zeroed out on the device (except for partial blocks at the either end of the range), and I/O is (otherwise) required only to update metadata. If the FALLOC_FL_KEEP_SIZE flag is additionally specified in mode, the behavior of the call is similar, but the file size will not be changed even if offset+len is greater than the file size. This behavior is the same as when preallocating space with FALLOC_FL_KEEP_SIZE specified. Not all filesystems support FALLOC_FL_ZERO_RANGE; if a filesystem doesn't support the operation, an error is returned. The operation is supported on at least the following filesystems: * XFS (since Linux 3.15) * ext4, for extent-based files (since Linux 3.15) * SMB3 (since Linux 3.17) Increasing file space Specifying the FALLOC_FL_INSERT_RANGE flag (available since Linux 4.1) in mode increases the file space by inserting a hole within the file size without overwriting any existing data. The hole will start at offset and continue for len bytes. When inserting the hole inside file, the contents of the file starting at offset will be shifted upward (i.e., to a higher file offset) by len bytes. Inserting a hole inside a file increases the file size by len bytes. This mode has the same limitations as FALLOC_FL_COLLAPSE_RANGE regarding the granularity of the operation. If the granularity requirements are not met, fallocate() will fail with the error EINVAL. If the offset is equal to or greater than the end of file, an error is returned. For such operations (i.e., inserting a hole at the end of file), ftruncate(2) should be used. No other flags may be specified in mode in conjunction with FALLOC_FL_INSERT_RANGE. FALLOC_FL_INSERT_RANGE requires filesystem support. Filesystems that support this operation include XFS (since Linux 4.1) and ext4 (since Linux 4.2). http://man7.org/linux/man-pages/man2/mkdir.2.html 11 SYSTEM CALL: mkdir(2) - Linux manual page FUNCTIONALITY: mkdir, mkdirat - create a directory SYNOPSIS: #include #include int mkdir(const char *pathname, mode_t mode); #include /* Definition of AT_* constants */ #include int mkdirat(int dirfd, const char *pathname, mode_t mode); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): mkdirat(): Since glibc 2.10: _POSIX_C_SOURCE >= 200809L Before glibc 2.10: _ATFILE_SOURCE DESCRIPTION mkdir() attempts to create a directory named pathname. The argument mode specifies the mode for the new directory (see stat(2)). It is modified by the process's umask in the usual way: in the absence of a default ACL, the mode of the created directory is (mode & ~umask & 0777). Whether other mode bits are honored for the created directory depends on the operating system. For Linux, see NOTES below. The newly created directory will be owned by the effective user ID of the process. If the directory containing the file has the set-group- ID bit set, or if the filesystem is mounted with BSD group semantics (mount -o bsdgroups or, synonymously mount -o grpid), the new directory will inherit the group ownership from its parent; otherwise it will be owned by the effective group ID of the process. If the parent directory has the set-group-ID bit set, then so will the newly created directory. mkdirat() The mkdirat() system call operates in exactly the same way as mkdir(), except for the differences described here. If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by mkdir() for a relative pathname). If pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like mkdir()). If pathname is absolute, then dirfd is ignored. See openat(2) for an explanation of the need for mkdirat(). http://man7.org/linux/man-pages/man2/mkdirat.2.html 11 SYSTEM CALL: mkdir(2) - Linux manual page FUNCTIONALITY: mkdir, mkdirat - create a directory SYNOPSIS: #include #include int mkdir(const char *pathname, mode_t mode); #include /* Definition of AT_* constants */ #include int mkdirat(int dirfd, const char *pathname, mode_t mode); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): mkdirat(): Since glibc 2.10: _POSIX_C_SOURCE >= 200809L Before glibc 2.10: _ATFILE_SOURCE DESCRIPTION mkdir() attempts to create a directory named pathname. The argument mode specifies the mode for the new directory (see stat(2)). It is modified by the process's umask in the usual way: in the absence of a default ACL, the mode of the created directory is (mode & ~umask & 0777). Whether other mode bits are honored for the created directory depends on the operating system. For Linux, see NOTES below. The newly created directory will be owned by the effective user ID of the process. If the directory containing the file has the set-group- ID bit set, or if the filesystem is mounted with BSD group semantics (mount -o bsdgroups or, synonymously mount -o grpid), the new directory will inherit the group ownership from its parent; otherwise it will be owned by the effective group ID of the process. If the parent directory has the set-group-ID bit set, then so will the newly created directory. mkdirat() The mkdirat() system call operates in exactly the same way as mkdir(), except for the differences described here. If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by mkdir() for a relative pathname). If pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like mkdir()). If pathname is absolute, then dirfd is ignored. See openat(2) for an explanation of the need for mkdirat(). http://man7.org/linux/man-pages/man2/rmdir.2.html 10 SYSTEM CALL: rmdir(2) - Linux manual page FUNCTIONALITY: rmdir - delete a directory SYNOPSIS: #include int rmdir(const char *pathname); DESCRIPTION rmdir() deletes a directory, which must be empty. http://man7.org/linux/man-pages/man2/getcwd.2.html 11 SYSTEM CALL: getcwd(3) - Linux manual page FUNCTIONALITY: getcwd, getwd, get_current_dir_name - get current working directory SYNOPSIS: #include char *getcwd(char *buf, size_t size); char *getwd(char *buf); char *get_current_dir_name(void); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): get_current_dir_name(): _GNU_SOURCE getwd(): Since glibc 2.12: (_XOPEN_SOURCE >= 500) && ! (_POSIX_C_SOURCE >= 200809L) || /* Glibc since 2.19: */ _DEFAULT_SOURCE || /* Glibc versions <= 2.19: */ _BSD_SOURCE Before glibc 2.12: _BSD_SOURCE || _XOPEN_SOURCE >= 500 DESCRIPTION These functions return a null-terminated string containing an absolute pathname that is the current working directory of the calling process. The pathname is returned as the function result and via the argument buf, if present. If the current directory is not below the root directory of the current process (e.g., because the process set a new filesystem root using chroot(2) without changing its current directory into the new root), then, since Linux 2.6.36, the returned path will be prefixed with the string "(unreachable)". Such behavior can also be caused by an unprivileged user by changing the current directory into another mount namespace. When dealing with paths from untrusted sources, callers of these functions should consider checking whether the returned path starts with '/' or '(' to avoid misinterpreting an unreachable path as a relative path. The getcwd() function copies an absolute pathname of the current working directory to the array pointed to by buf, which is of length size. If the length of the absolute pathname of the current working directory, including the terminating null byte, exceeds size bytes, NULL is returned, and errno is set to ERANGE; an application should check for this error, and allocate a larger buffer if necessary. As an extension to the POSIX.1-2001 standard, glibc's getcwd() allocates the buffer dynamically using malloc(3) if buf is NULL. In this case, the allocated buffer has the length size unless size is zero, when buf is allocated as big as necessary. The caller should free(3) the returned buffer. get_current_dir_name() will malloc(3) an array big enough to hold the absolute pathname of the current working directory. If the environment variable PWD is set, and its value is correct, then that value will be returned. The caller should free(3) the returned buffer. getwd() does not malloc(3) any memory. The buf argument should be a pointer to an array at least PATH_MAX bytes long. If the length of the absolute pathname of the current working directory, including the terminating null byte, exceeds PATH_MAX bytes, NULL is returned, and errno is set to ENAMETOOLONG. (Note that on some systems, PATH_MAX may not be a compile-time constant; furthermore, its value may depend on the filesystem, see pathconf(3).) For portability and security reasons, use of getwd() is deprecated. http://man7.org/linux/man-pages/man2/chdir.2.html 10 SYSTEM CALL: chdir(2) - Linux manual page FUNCTIONALITY: chdir, fchdir - change working directory SYNOPSIS: #include int chdir(const char *path); int fchdir(int fd); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): fchdir(): _XOPEN_SOURCE >= 500 || /* Since glibc 2.12: */ _POSIX_C_SOURCE >= 200809L || /* Glibc up to and including 2.19: */ _BSD_SOURCE DESCRIPTION chdir() changes the current working directory of the calling process to the directory specified in path. fchdir() is identical to chdir(); the only difference is that the directory is given as an open file descriptor. http://man7.org/linux/man-pages/man2/fchdir.2.html 10 SYSTEM CALL: chdir(2) - Linux manual page FUNCTIONALITY: chdir, fchdir - change working directory SYNOPSIS: #include int chdir(const char *path); int fchdir(int fd); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): fchdir(): _XOPEN_SOURCE >= 500 || /* Since glibc 2.12: */ _POSIX_C_SOURCE >= 200809L || /* Glibc up to and including 2.19: */ _BSD_SOURCE DESCRIPTION chdir() changes the current working directory of the calling process to the directory specified in path. fchdir() is identical to chdir(); the only difference is that the directory is given as an open file descriptor. http://man7.org/linux/man-pages/man2/chroot.2.html 10 SYSTEM CALL: chroot(2) - Linux manual page FUNCTIONALITY: chroot - change root directory SYNOPSIS: #include int chroot(const char *path); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): chroot(): Since glibc 2.2.2: _XOPEN_SOURCE && ! (_POSIX_C_SOURCE >= 200112L) || /* Since glibc 2.20: */ _DEFAULT_SOURCE || /* Glibc versions <= 2.19: */ _BSD_SOURCE Before glibc 2.2.2: none DESCRIPTION chroot() changes the root directory of the calling process to that specified in path. This directory will be used for pathnames beginning with /. The root directory is inherited by all children of the calling process. Only a privileged process (Linux: one with the CAP_SYS_CHROOT capability) may call chroot(). This call changes an ingredient in the pathname resolution process and does nothing else. In particular, it is not intended to be used for any kind of security purpose, neither to fully sandbox a process nor to restrict filesystem system calls. In the past, chroot() has been used by daemons to restrict themselves prior to passing paths supplied by untrusted users to system calls such as open(2). However, if a folder is moved out of the chroot directory, an attacker can exploit that to get out of the chroot directory as well. The easiest way to do that is to chdir(2) to the to-be-moved directory, wait for it to be moved out, then open a path like ../../../etc/passwd. A slightly trickier variation also works under some circumstances if chdir(2) is not permitted. If a daemon allows a "chroot directory" to be specified, that usually means that if you want to prevent remote users from accessing files outside the chroot directory, you must ensure that folders are never moved out of it. This call does not change the current working directory, so that after the call '.' can be outside the tree rooted at '/'. In particular, the superuser can escape from a "chroot jail" by doing: mkdir foo; chroot foo; cd .. This call does not close open file descriptors, and such file descriptors may allow access to files outside the chroot tree. http://man7.org/linux/man-pages/man2/getdents.2.html 11 SYSTEM CALL: getdents(2) - Linux manual page FUNCTIONALITY: getdents, getdents64 - get directory entries SYNOPSIS: int getdents(unsigned int fd, struct linux_dirent *dirp, unsigned int count); int getdents64(unsigned int fd, struct linux_dirent64 *dirp, unsigned int count); Note: There are no glibc wrappers for these system calls; see NOTES. DESCRIPTION These are not the interfaces you are interested in. Look at readdir(3) for the POSIX-conforming C library interface. This page documents the bare kernel system call interfaces. getdents() The system call getdents() reads several linux_dirent structures from the directory referred to by the open file descriptor fd into the buffer pointed to by dirp. The argument count specifies the size of that buffer. The linux_dirent structure is declared as follows: struct linux_dirent { unsigned long d_ino; /* Inode number */ unsigned long d_off; /* Offset to next linux_dirent */ unsigned short d_reclen; /* Length of this linux_dirent */ char d_name[]; /* Filename (null-terminated) */ /* length is actually (d_reclen - 2 - offsetof(struct linux_dirent, d_name)) */ /* char pad; // Zero padding byte char d_type; // File type (only since Linux // 2.6.4); offset is (d_reclen - 1) */ } d_ino is an inode number. d_off is the distance from the start of the directory to the start of the next linux_dirent. d_reclen is the size of this entire linux_dirent. d_name is a null-terminated filename. d_type is a byte at the end of the structure that indicates the file type. It contains one of the following values (defined in ): DT_BLK This is a block device. DT_CHR This is a character device. DT_DIR This is a directory. DT_FIFO This is a named pipe (FIFO). DT_LNK This is a symbolic link. DT_REG This is a regular file. DT_SOCK This is a UNIX domain socket. DT_UNKNOWN The file type is unknown. The d_type field is implemented since Linux 2.6.4. It occupies a space that was previously a zero-filled padding byte in the linux_dirent structure. Thus, on kernels up to and including 2.6.3, attempting to access this field always provides the value 0 (DT_UNKNOWN). Currently, only some filesystems (among them: Btrfs, ext2, ext3, and ext4) have full support for returning the file type in d_type. All applications must properly handle a return of DT_UNKNOWN. getdents64() The original Linux getdents() system call did not handle large filesystems and large file offsets. Consequently, Linux 2.4 added getdents64(), with wider types for the d_ino and d_off fields. In addition, getdents64() supports an explicit d_type field. The getdents64() system call is like getdents(), except that its second argument is a pointer to a buffer containing structures of the following type: struct linux_dirent64 { ino64_t d_ino; /* 64-bit inode number */ off64_t d_off; /* 64-bit offset to next structure */ unsigned short d_reclen; /* Size of this dirent */ unsigned char d_type; /* File type */ char d_name[]; /* Filename (null-terminated) */ }; http://man7.org/linux/man-pages/man2/getdents64.2.html 11 SYSTEM CALL: getdents(2) - Linux manual page FUNCTIONALITY: getdents, getdents64 - get directory entries SYNOPSIS: int getdents(unsigned int fd, struct linux_dirent *dirp, unsigned int count); int getdents64(unsigned int fd, struct linux_dirent64 *dirp, unsigned int count); Note: There are no glibc wrappers for these system calls; see NOTES. DESCRIPTION These are not the interfaces you are interested in. Look at readdir(3) for the POSIX-conforming C library interface. This page documents the bare kernel system call interfaces. getdents() The system call getdents() reads several linux_dirent structures from the directory referred to by the open file descriptor fd into the buffer pointed to by dirp. The argument count specifies the size of that buffer. The linux_dirent structure is declared as follows: struct linux_dirent { unsigned long d_ino; /* Inode number */ unsigned long d_off; /* Offset to next linux_dirent */ unsigned short d_reclen; /* Length of this linux_dirent */ char d_name[]; /* Filename (null-terminated) */ /* length is actually (d_reclen - 2 - offsetof(struct linux_dirent, d_name)) */ /* char pad; // Zero padding byte char d_type; // File type (only since Linux // 2.6.4); offset is (d_reclen - 1) */ } d_ino is an inode number. d_off is the distance from the start of the directory to the start of the next linux_dirent. d_reclen is the size of this entire linux_dirent. d_name is a null-terminated filename. d_type is a byte at the end of the structure that indicates the file type. It contains one of the following values (defined in ): DT_BLK This is a block device. DT_CHR This is a character device. DT_DIR This is a directory. DT_FIFO This is a named pipe (FIFO). DT_LNK This is a symbolic link. DT_REG This is a regular file. DT_SOCK This is a UNIX domain socket. DT_UNKNOWN The file type is unknown. The d_type field is implemented since Linux 2.6.4. It occupies a space that was previously a zero-filled padding byte in the linux_dirent structure. Thus, on kernels up to and including 2.6.3, attempting to access this field always provides the value 0 (DT_UNKNOWN). Currently, only some filesystems (among them: Btrfs, ext2, ext3, and ext4) have full support for returning the file type in d_type. All applications must properly handle a return of DT_UNKNOWN. getdents64() The original Linux getdents() system call did not handle large filesystems and large file offsets. Consequently, Linux 2.4 added getdents64(), with wider types for the d_ino and d_off fields. In addition, getdents64() supports an explicit d_type field. The getdents64() system call is like getdents(), except that its second argument is a pointer to a buffer containing structures of the following type: struct linux_dirent64 { ino64_t d_ino; /* 64-bit inode number */ off64_t d_off; /* 64-bit offset to next structure */ unsigned short d_reclen; /* Size of this dirent */ unsigned char d_type; /* File type */ char d_name[]; /* Filename (null-terminated) */ }; http://man7.org/linux/man-pages/man2/lookup_dcookie.2.html 11 SYSTEM CALL: lookup_dcookie(2) - Linux manual page FUNCTIONALITY: lookup_dcookie - return a directory entry's path SYNOPSIS: int lookup_dcookie(u64 cookie, char *buffer, size_t len); DESCRIPTION Look up the full path of the directory entry specified by the value cookie. The cookie is an opaque identifier uniquely identifying a particular directory entry. The buffer given is filled in with the full path of the directory entry. For lookup_dcookie() to return successfully, the kernel must still hold a cookie reference to the directory entry. http://man7.org/linux/man-pages/man2/link.2.html 12 SYSTEM CALL: link(2) - Linux manual page FUNCTIONALITY: link, linkat - make a new name for a file SYNOPSIS: #include int link(const char *oldpath, const char *newpath); #include /* Definition of AT_* constants */ #include int linkat(int olddirfd, const char *oldpath, int newdirfd, const char *newpath, int flags); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): linkat(): Since glibc 2.10: _POSIX_C_SOURCE >= 200809L Before glibc 2.10: _ATFILE_SOURCE DESCRIPTION link() creates a new link (also known as a hard link) to an existing file. If newpath exists, it will not be overwritten. This new name may be used exactly as the old one for any operation; both names refer to the same file (and so have the same permissions and ownership) and it is impossible to tell which name was the "original". linkat() The linkat() system call operates in exactly the same way as link(), except for the differences described here. If the pathname given in oldpath is relative, then it is interpreted relative to the directory referred to by the file descriptor olddirfd (rather than relative to the current working directory of the calling process, as is done by link() for a relative pathname). If oldpath is relative and olddirfd is the special value AT_FDCWD, then oldpath is interpreted relative to the current working directory of the calling process (like link()). If oldpath is absolute, then olddirfd is ignored. The interpretation of newpath is as for oldpath, except that a relative pathname is interpreted relative to the directory referred to by the file descriptor newdirfd. The following values can be bitwise ORed in flags: AT_EMPTY_PATH (since Linux 2.6.39) If oldpath is an empty string, create a link to the file referenced by olddirfd (which may have been obtained using the open(2) O_PATH flag). In this case, olddirfd can refer to any type of file, not just a directory. This will generally not work if the file has a link count of zero (files created with O_TMPFILE and without O_EXCL are an exception). The caller must have the CAP_DAC_READ_SEARCH capability in order to use this flag. This flag is Linux-specific; define _GNU_SOURCE to obtain its definition. AT_SYMLINK_FOLLOW (since Linux 2.6.18) By default, linkat(), does not dereference oldpath if it is a symbolic link (like link()). The flag AT_SYMLINK_FOLLOW can be specified in flags to cause oldpath to be dereferenced if it is a symbolic link. If procfs is mounted, this can be used as an alternative to AT_EMPTY_PATH, like this: linkat(AT_FDCWD, "/proc/self/fd/", newdirfd, newname, AT_SYMLINK_FOLLOW); Before kernel 2.6.18, the flags argument was unused, and had to be specified as 0. See openat(2) for an explanation of the need for linkat(). http://man7.org/linux/man-pages/man2/linkat.2.html 12 SYSTEM CALL: link(2) - Linux manual page FUNCTIONALITY: link, linkat - make a new name for a file SYNOPSIS: #include int link(const char *oldpath, const char *newpath); #include /* Definition of AT_* constants */ #include int linkat(int olddirfd, const char *oldpath, int newdirfd, const char *newpath, int flags); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): linkat(): Since glibc 2.10: _POSIX_C_SOURCE >= 200809L Before glibc 2.10: _ATFILE_SOURCE DESCRIPTION link() creates a new link (also known as a hard link) to an existing file. If newpath exists, it will not be overwritten. This new name may be used exactly as the old one for any operation; both names refer to the same file (and so have the same permissions and ownership) and it is impossible to tell which name was the "original". linkat() The linkat() system call operates in exactly the same way as link(), except for the differences described here. If the pathname given in oldpath is relative, then it is interpreted relative to the directory referred to by the file descriptor olddirfd (rather than relative to the current working directory of the calling process, as is done by link() for a relative pathname). If oldpath is relative and olddirfd is the special value AT_FDCWD, then oldpath is interpreted relative to the current working directory of the calling process (like link()). If oldpath is absolute, then olddirfd is ignored. The interpretation of newpath is as for oldpath, except that a relative pathname is interpreted relative to the directory referred to by the file descriptor newdirfd. The following values can be bitwise ORed in flags: AT_EMPTY_PATH (since Linux 2.6.39) If oldpath is an empty string, create a link to the file referenced by olddirfd (which may have been obtained using the open(2) O_PATH flag). In this case, olddirfd can refer to any type of file, not just a directory. This will generally not work if the file has a link count of zero (files created with O_TMPFILE and without O_EXCL are an exception). The caller must have the CAP_DAC_READ_SEARCH capability in order to use this flag. This flag is Linux-specific; define _GNU_SOURCE to obtain its definition. AT_SYMLINK_FOLLOW (since Linux 2.6.18) By default, linkat(), does not dereference oldpath if it is a symbolic link (like link()). The flag AT_SYMLINK_FOLLOW can be specified in flags to cause oldpath to be dereferenced if it is a symbolic link. If procfs is mounted, this can be used as an alternative to AT_EMPTY_PATH, like this: linkat(AT_FDCWD, "/proc/self/fd/", newdirfd, newname, AT_SYMLINK_FOLLOW); Before kernel 2.6.18, the flags argument was unused, and had to be specified as 0. See openat(2) for an explanation of the need for linkat(). http://man7.org/linux/man-pages/man2/symlink.2.html 11 SYSTEM CALL: symlink(2) - Linux manual page FUNCTIONALITY: symlink, symlinkat - make a new name for a file SYNOPSIS: #include int symlink(const char *target, const char *linkpath); #include /* Definition of AT_* constants */ #include int symlinkat(const char *target, int newdirfd, const char *linkpath); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): symlink(): _XOPEN_SOURCE >= 500 || _POSIX_C_SOURCE >= 200112L || /* Glibc versions <= 2.19: */ _BSD_SOURCE symlinkat(): Since glibc 2.10: _POSIX_C_SOURCE >= 200809L Before glibc 2.10: _ATFILE_SOURCE DESCRIPTION symlink() creates a symbolic link named linkpath which contains the string target. Symbolic links are interpreted at run time as if the contents of the link had been substituted into the path being followed to find a file or directory. Symbolic links may contain .. path components, which (if used at the start of the link) refer to the parent directories of that in which the link resides. A symbolic link (also known as a soft link) may point to an existing file or to a nonexistent one; the latter case is known as a dangling link. The permissions of a symbolic link are irrelevant; the ownership is ignored when following the link, but is checked when removal or renaming of the link is requested and the link is in a directory with the sticky bit (S_ISVTX) set. If linkpath exists, it will not be overwritten. symlinkat() The symlinkat() system call operates in exactly the same way as symlink(), except for the differences described here. If the pathname given in linkpath is relative, then it is interpreted relative to the directory referred to by the file descriptor newdirfd (rather than relative to the current working directory of the calling process, as is done by symlink() for a relative pathname). If linkpath is relative and newdirfd is the special value AT_FDCWD, then linkpath is interpreted relative to the current working directory of the calling process (like symlink()). If linkpath is absolute, then newdirfd is ignored. http://man7.org/linux/man-pages/man2/symlinkat.2.html 11 SYSTEM CALL: symlink(2) - Linux manual page FUNCTIONALITY: symlink, symlinkat - make a new name for a file SYNOPSIS: #include int symlink(const char *target, const char *linkpath); #include /* Definition of AT_* constants */ #include int symlinkat(const char *target, int newdirfd, const char *linkpath); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): symlink(): _XOPEN_SOURCE >= 500 || _POSIX_C_SOURCE >= 200112L || /* Glibc versions <= 2.19: */ _BSD_SOURCE symlinkat(): Since glibc 2.10: _POSIX_C_SOURCE >= 200809L Before glibc 2.10: _ATFILE_SOURCE DESCRIPTION symlink() creates a symbolic link named linkpath which contains the string target. Symbolic links are interpreted at run time as if the contents of the link had been substituted into the path being followed to find a file or directory. Symbolic links may contain .. path components, which (if used at the start of the link) refer to the parent directories of that in which the link resides. A symbolic link (also known as a soft link) may point to an existing file or to a nonexistent one; the latter case is known as a dangling link. The permissions of a symbolic link are irrelevant; the ownership is ignored when following the link, but is checked when removal or renaming of the link is requested and the link is in a directory with the sticky bit (S_ISVTX) set. If linkpath exists, it will not be overwritten. symlinkat() The symlinkat() system call operates in exactly the same way as symlink(), except for the differences described here. If the pathname given in linkpath is relative, then it is interpreted relative to the directory referred to by the file descriptor newdirfd (rather than relative to the current working directory of the calling process, as is done by symlink() for a relative pathname). If linkpath is relative and newdirfd is the special value AT_FDCWD, then linkpath is interpreted relative to the current working directory of the calling process (like symlink()). If linkpath is absolute, then newdirfd is ignored. http://man7.org/linux/man-pages/man2/unlink.2.html 12 SYSTEM CALL: unlink(2) - Linux manual page FUNCTIONALITY: unlink, unlinkat - delete a name and possibly the file it refers to SYNOPSIS: #include int unlink(const char *pathname); #include /* Definition of AT_* constants */ #include int unlinkat(int dirfd, const char *pathname, int flags); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): unlinkat(): Since glibc 2.10: _POSIX_C_SOURCE >= 200809L Before glibc 2.10: _ATFILE_SOURCE DESCRIPTION unlink() deletes a name from the filesystem. If that name was the last link to a file and no processes have the file open, the file is deleted and the space it was using is made available for reuse. If the name was the last link to a file but any processes still have the file open, the file will remain in existence until the last file descriptor referring to it is closed. If the name referred to a symbolic link, the link is removed. If the name referred to a socket, FIFO, or device, the name for it is removed but processes which have the object open may continue to use it. unlinkat() The unlinkat() system call operates in exactly the same way as either unlink() or rmdir(2) (depending on whether or not flags includes the AT_REMOVEDIR flag) except for the differences described here. If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by unlink() and rmdir(2) for a relative pathname). If the pathname given in pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like unlink() and rmdir(2)). If the pathname given in pathname is absolute, then dirfd is ignored. flags is a bit mask that can either be specified as 0, or by ORing together flag values that control the operation of unlinkat(). Currently, only one such flag is defined: AT_REMOVEDIR By default, unlinkat() performs the equivalent of unlink() on pathname. If the AT_REMOVEDIR flag is specified, then performs the equivalent of rmdir(2) on pathname. See openat(2) for an explanation of the need for unlinkat(). http://man7.org/linux/man-pages/man2/unlinkat.2.html 12 SYSTEM CALL: unlink(2) - Linux manual page FUNCTIONALITY: unlink, unlinkat - delete a name and possibly the file it refers to SYNOPSIS: #include int unlink(const char *pathname); #include /* Definition of AT_* constants */ #include int unlinkat(int dirfd, const char *pathname, int flags); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): unlinkat(): Since glibc 2.10: _POSIX_C_SOURCE >= 200809L Before glibc 2.10: _ATFILE_SOURCE DESCRIPTION unlink() deletes a name from the filesystem. If that name was the last link to a file and no processes have the file open, the file is deleted and the space it was using is made available for reuse. If the name was the last link to a file but any processes still have the file open, the file will remain in existence until the last file descriptor referring to it is closed. If the name referred to a symbolic link, the link is removed. If the name referred to a socket, FIFO, or device, the name for it is removed but processes which have the object open may continue to use it. unlinkat() The unlinkat() system call operates in exactly the same way as either unlink() or rmdir(2) (depending on whether or not flags includes the AT_REMOVEDIR flag) except for the differences described here. If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by unlink() and rmdir(2) for a relative pathname). If the pathname given in pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like unlink() and rmdir(2)). If the pathname given in pathname is absolute, then dirfd is ignored. flags is a bit mask that can either be specified as 0, or by ORing together flag values that control the operation of unlinkat(). Currently, only one such flag is defined: AT_REMOVEDIR By default, unlinkat() performs the equivalent of unlink() on pathname. If the AT_REMOVEDIR flag is specified, then performs the equivalent of rmdir(2) on pathname. See openat(2) for an explanation of the need for unlinkat(). http://man7.org/linux/man-pages/man2/readlink.2.html 12 SYSTEM CALL: readlink(2) - Linux manual page FUNCTIONALITY: readlink, readlinkat - read value of a symbolic link SYNOPSIS: #include ssize_t readlink(const char *pathname, char *buf, size_t bufsiz); #include /* Definition of AT_* constants */ #include ssize_t readlinkat(int dirfd, const char *pathname, char *buf, size_t bufsiz); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): readlink(): _XOPEN_SOURCE >= 500 || _POSIX_C_SOURCE >= 200112L || /* Glibc versions <= 2.19: */ _BSD_SOURCE readlinkat(): Since glibc 2.10: _POSIX_C_SOURCE >= 200809L Before glibc 2.10: _ATFILE_SOURCE DESCRIPTION readlink() places the contents of the symbolic link pathname in the buffer buf, which has size bufsiz. readlink() does not append a null byte to buf. It will truncate the contents (to a length of bufsiz characters), in case the buffer is too small to hold all of the contents. readlinkat() The readlinkat() system call operates in exactly the same way as readlink(), except for the differences described here. If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by readlink() for a relative pathname). If pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like readlink()). If pathname is absolute, then dirfd is ignored. Since Linux 2.6.39, pathname can be an empty string, in which case the call operates on the symbolic link referred to by dirfd (which should have been obtained using open(2) with the O_PATH and O_NOFOLLOW flags). See openat(2) for an explanation of the need for readlinkat(). http://man7.org/linux/man-pages/man2/readlinkat.2.html 12 SYSTEM CALL: readlink(2) - Linux manual page FUNCTIONALITY: readlink, readlinkat - read value of a symbolic link SYNOPSIS: #include ssize_t readlink(const char *pathname, char *buf, size_t bufsiz); #include /* Definition of AT_* constants */ #include ssize_t readlinkat(int dirfd, const char *pathname, char *buf, size_t bufsiz); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): readlink(): _XOPEN_SOURCE >= 500 || _POSIX_C_SOURCE >= 200112L || /* Glibc versions <= 2.19: */ _BSD_SOURCE readlinkat(): Since glibc 2.10: _POSIX_C_SOURCE >= 200809L Before glibc 2.10: _ATFILE_SOURCE DESCRIPTION readlink() places the contents of the symbolic link pathname in the buffer buf, which has size bufsiz. readlink() does not append a null byte to buf. It will truncate the contents (to a length of bufsiz characters), in case the buffer is too small to hold all of the contents. readlinkat() The readlinkat() system call operates in exactly the same way as readlink(), except for the differences described here. If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by readlink() for a relative pathname). If pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like readlink()). If pathname is absolute, then dirfd is ignored. Since Linux 2.6.39, pathname can be an empty string, in which case the call operates on the symbolic link referred to by dirfd (which should have been obtained using open(2) with the O_PATH and O_NOFOLLOW flags). See openat(2) for an explanation of the need for readlinkat(). http://man7.org/linux/man-pages/man2/umask.2.html 9 SYSTEM CALL: umask(2) - Linux manual page FUNCTIONALITY: umask - set file mode creation mask SYNOPSIS: #include #include mode_t umask(mode_t mask); DESCRIPTION umask() sets the calling process's file mode creation mask (umask) to mask & 0777 (i.e., only the file permission bits of mask are used), and returns the previous value of the mask. The umask is used by open(2), mkdir(2), and other system calls that create files to modify the permissions placed on newly created files or directories. Specifically, permissions in the umask are turned off from the mode argument to open(2) and mkdir(2). Alternatively, if the parent directory has a default ACL (see acl(5)), the umask is ignored, the default ACL is inherited, the permission bits are set based on the inherited ACL, and permission bits absent in the mode argument are turned off. For example, the following default ACL is equivalent to a umask of 022: u::rwx,g::r-x,o::r-x Combining the effect of this default ACL with a mode argument of 0666 (rw-rw-rw-), the resulting file permissions would be 0644 (rw- r--r--). The constants that should be used to specify mask are described under stat(2). The typical default value for the process umask is S_IWGRP | S_IWOTH (octal 022). In the usual case where the mode argument to open(2) is specified as: S_IRUSR | S_IWUSR | S_IRGRP | S_IWGRP | S_IROTH | S_IWOTH (octal 0666) when creating a new file, the permissions on the resulting file will be: S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH (because 0666 & ~022 = 0644; i.e., rw-r--r--). http://man7.org/linux/man-pages/man2/stat.2.html 12 SYSTEM CALL: stat(2) - Linux manual page FUNCTIONALITY: stat, fstat, lstat, fstatat - get file status SYNOPSIS: #include #include #include int stat(const char *pathname, struct stat *buf); int fstat(int fd, struct stat *buf); int lstat(const char *pathname, struct stat *buf); #include /* Definition of AT_* constants */ #include int fstatat(int dirfd, const char *pathname, struct stat *buf, int flags); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): lstat(): /* glibc 2.19 and earlier */ _BSD_SOURCE || /* Since glibc 2.20 */ _DEFAULT_SOURCE || _XOPEN_SOURCE >= 500 || /* Since glibc 2.10: */ _POSIX_C_SOURCE >= 200112L fstatat(): Since glibc 2.10: _POSIX_C_SOURCE >= 200809L Before glibc 2.10: _ATFILE_SOURCE DESCRIPTION These functions return information about a file, in the buffer pointed to by buf. No permissions are required on the file itself, but—in the case of stat(), fstatat(), and lstat()—execute (search) permission is required on all of the directories in pathname that lead to the file. stat() and fstatat() retrieve information about the file pointed to by pathname; the differences for fstatat() are described below. lstat() is identical to stat(), except that if pathname is a symbolic link, then it returns information about the link itself, not the file that it refers to. fstat() is identical to stat(), except that the file about which information is to be retrieved is specified by the file descriptor fd. All of these system calls return a stat structure, which contains the following fields: struct stat { dev_t st_dev; /* ID of device containing file */ ino_t st_ino; /* inode number */ mode_t st_mode; /* file type and mode */ nlink_t st_nlink; /* number of hard links */ uid_t st_uid; /* user ID of owner */ gid_t st_gid; /* group ID of owner */ dev_t st_rdev; /* device ID (if special file) */ off_t st_size; /* total size, in bytes */ blksize_t st_blksize; /* blocksize for filesystem I/O */ blkcnt_t st_blocks; /* number of 512B blocks allocated */ /* Since Linux 2.6, the kernel supports nanosecond precision for the following timestamp fields. For the details before Linux 2.6, see NOTES. */ struct timespec st_atim; /* time of last access */ struct timespec st_mtim; /* time of last modification */ struct timespec st_ctim; /* time of last status change */ #define st_atime st_atim.tv_sec /* Backward compatibility */ #define st_mtime st_mtim.tv_sec #define st_ctime st_ctim.tv_sec }; Note: the order of fields in the stat structure varies somewhat across architectures. In addition, the definition above does not show the padding bytes that may be present between some fields on various architectures. Consult the glibc and kernel source code if you need to know the details. Note: For performance and simplicity reasons, different fields in the stat structure may contain state information from different moments during the execution of the system call. For example, if st_mode or st_uid is changed by another process by calling chmod(2) or chown(2), stat() might return the old st_mode together with the new st_uid, or the old st_uid together with the new st_mode. The st_dev field describes the device on which this file resides. (The major(3) and minor(3) macros may be useful to decompose the device ID in this field.) The st_rdev field describes the device that this file (inode) represents. The st_size field gives the size of the file (if it is a regular file or a symbolic link) in bytes. The size of a symbolic link is the length of the pathname it contains, without a terminating null byte. The st_blocks field indicates the number of blocks allocated to the file, 512-byte units. (This may be smaller than st_size/512 when the file has holes.) The st_blksize field gives the "preferred" blocksize for efficient filesystem I/O. (Writing to a file in smaller chunks may cause an inefficient read-modify-rewrite.) Not all of the Linux filesystems implement all of the time fields. Some filesystem types allow mounting in such a way that file and/or directory accesses do not cause an update of the st_atime field. (See noatime, nodiratime, and relatime in mount(8), and related information in mount(2).) In addition, st_atime is not updated if a file is opened with the O_NOATIME; see open(2). The field st_atime is changed by file accesses, for example, by execve(2), mknod(2), pipe(2), utime(2), and read(2) (of more than zero bytes). Other routines, like mmap(2), may or may not update st_atime. The field st_mtime is changed by file modifications, for example, by mknod(2), truncate(2), utime(2), and write(2) (of more than zero bytes). Moreover, st_mtime of a directory is changed by the creation or deletion of files in that directory. The st_mtime field is not changed for changes in owner, group, hard link count, or mode. The field st_ctime is changed by writing or by setting inode information (i.e., owner, group, link count, mode, etc.). POSIX refers to the st_mode bits corresponding to the mask S_IFMT (see below) as the file type, the 12 bits corresponding to the mask 07777 as the file mode bits and the least significant 9 bits (0777) as the file permission bits. The following mask values are defined for the file type of the st_mode field: S_IFMT 0170000 bit mask for the file type bit field S_IFSOCK 0140000 socket S_IFLNK 0120000 symbolic link S_IFREG 0100000 regular file S_IFBLK 0060000 block device S_IFDIR 0040000 directory S_IFCHR 0020000 character device S_IFIFO 0010000 FIFO Thus, to test for a regular file (for example), one could write: stat(pathname, &sb); if ((sb.st_mode & S_IFMT) == S_IFREG) { /* Handle regular file */ } Because tests of the above form are common, additional macros are defined by POSIX to allow the test of the file type in st_mode to be written more concisely: S_ISREG(m) is it a regular file? S_ISDIR(m) directory? S_ISCHR(m) character device? S_ISBLK(m) block device? S_ISFIFO(m) FIFO (named pipe)? S_ISLNK(m) symbolic link? (Not in POSIX.1-1996.) S_ISSOCK(m) socket? (Not in POSIX.1-1996.) The preceding code snippet could thus be rewritten as: stat(pathname, &sb); if (S_ISREG(sb.st_mode)) { /* Handle regular file */ } The definitions of most of the above file type test macros are provided if any of the following feature test macros is defined: _BSD_SOURCE (in glibc 2.19 and earlier), _SVID_SOURCE (in glibc 2.19 and earlier), or _DEFAULT_SOURCE (in glibc 2.20 and later). In addition, definitions of all of the above macros except S_IFSOCK and S_ISSOCK() are provided if _XOPEN_SOURCE is defined. The definition of S_IFSOCK can also be exposed by defining _XOPEN_SOURCE with a value of 500 or greater. The definition of S_ISSOCK() is exposed if any of the following feature test macros is defined: _BSD_SOURCE (in glibc 2.19 and earlier), _DEFAULT_SOURCE (in glibc 2.20 and later), _XOPEN_SOURCE with a value of 500 or greater, or _POSIX_C_SOURCE with a value of 200112L or greater. The following mask values are defined for the file mode component of the st_mode field: S_ISUID 04000 set-user-ID bit S_ISGID 02000 set-group-ID bit (see below) S_ISVTX 01000 sticky bit (see below) S_IRWXU 00700 owner has read, write, and execute permission S_IRUSR 00400 owner has read permission S_IWUSR 00200 owner has write permission S_IXUSR 00100 owner has execute permission S_IRWXG 00070 group has read, write, and execute permission S_IRGRP 00040 group has read permission S_IWGRP 00020 group has write permission S_IXGRP 00010 group has execute permission S_IRWXO 00007 others (not in group) have read, write, and execute permission S_IROTH 00004 others have read permission S_IWOTH 00002 others have write permission S_IXOTH 00001 others have execute permission The set-group-ID bit (S_ISGID) has several special uses. For a directory, it indicates that BSD semantics is to be used for that directory: files created there inherit their group ID from the directory, not from the effective group ID of the creating process, and directories created there will also get the S_ISGID bit set. For a file that does not have the group execution bit (S_IXGRP) set, the set-group-ID bit indicates mandatory file/record locking. The sticky bit (S_ISVTX) on a directory means that a file in that directory can be renamed or deleted only by the owner of the file, by the owner of the directory, and by a privileged process. fstatat() The fstatat() system call operates in exactly the same way as stat(), except for the differences described here. If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by stat() for a relative pathname). If pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like stat()). If pathname is absolute, then dirfd is ignored. flags can either be 0, or include one or more of the following flags ORed: AT_EMPTY_PATH (since Linux 2.6.39) If pathname is an empty string, operate on the file referred to by dirfd (which may have been obtained using the open(2) O_PATH flag). If dirfd is AT_FDCWD, the call operates on the current working directory. In this case, dirfd can refer to any type of file, not just a directory. This flag is Linux- specific; define _GNU_SOURCE to obtain its definition. AT_NO_AUTOMOUNT (since Linux 2.6.38) Don't automount the terminal ("basename") component of pathname if it is a directory that is an automount point. This allows the caller to gather attributes of an automount point (rather than the location it would mount). This flag can be used in tools that scan directories to prevent mass- automounting of a directory of automount points. The AT_NO_AUTOMOUNT flag has no effect if the mount point has already been mounted over. This flag is Linux-specific; define _GNU_SOURCE to obtain its definition. AT_SYMLINK_NOFOLLOW If pathname is a symbolic link, do not dereference it: instead return information about the link itself, like lstat(). (By default, fstatat() dereferences symbolic links, like stat().) See openat(2) for an explanation of the need for fstatat(). http://man7.org/linux/man-pages/man2/lstat.2.html 12 SYSTEM CALL: stat(2) - Linux manual page FUNCTIONALITY: stat, fstat, lstat, fstatat - get file status SYNOPSIS: #include #include #include int stat(const char *pathname, struct stat *buf); int fstat(int fd, struct stat *buf); int lstat(const char *pathname, struct stat *buf); #include /* Definition of AT_* constants */ #include int fstatat(int dirfd, const char *pathname, struct stat *buf, int flags); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): lstat(): /* glibc 2.19 and earlier */ _BSD_SOURCE || /* Since glibc 2.20 */ _DEFAULT_SOURCE || _XOPEN_SOURCE >= 500 || /* Since glibc 2.10: */ _POSIX_C_SOURCE >= 200112L fstatat(): Since glibc 2.10: _POSIX_C_SOURCE >= 200809L Before glibc 2.10: _ATFILE_SOURCE DESCRIPTION These functions return information about a file, in the buffer pointed to by buf. No permissions are required on the file itself, but—in the case of stat(), fstatat(), and lstat()—execute (search) permission is required on all of the directories in pathname that lead to the file. stat() and fstatat() retrieve information about the file pointed to by pathname; the differences for fstatat() are described below. lstat() is identical to stat(), except that if pathname is a symbolic link, then it returns information about the link itself, not the file that it refers to. fstat() is identical to stat(), except that the file about which information is to be retrieved is specified by the file descriptor fd. All of these system calls return a stat structure, which contains the following fields: struct stat { dev_t st_dev; /* ID of device containing file */ ino_t st_ino; /* inode number */ mode_t st_mode; /* file type and mode */ nlink_t st_nlink; /* number of hard links */ uid_t st_uid; /* user ID of owner */ gid_t st_gid; /* group ID of owner */ dev_t st_rdev; /* device ID (if special file) */ off_t st_size; /* total size, in bytes */ blksize_t st_blksize; /* blocksize for filesystem I/O */ blkcnt_t st_blocks; /* number of 512B blocks allocated */ /* Since Linux 2.6, the kernel supports nanosecond precision for the following timestamp fields. For the details before Linux 2.6, see NOTES. */ struct timespec st_atim; /* time of last access */ struct timespec st_mtim; /* time of last modification */ struct timespec st_ctim; /* time of last status change */ #define st_atime st_atim.tv_sec /* Backward compatibility */ #define st_mtime st_mtim.tv_sec #define st_ctime st_ctim.tv_sec }; Note: the order of fields in the stat structure varies somewhat across architectures. In addition, the definition above does not show the padding bytes that may be present between some fields on various architectures. Consult the glibc and kernel source code if you need to know the details. Note: For performance and simplicity reasons, different fields in the stat structure may contain state information from different moments during the execution of the system call. For example, if st_mode or st_uid is changed by another process by calling chmod(2) or chown(2), stat() might return the old st_mode together with the new st_uid, or the old st_uid together with the new st_mode. The st_dev field describes the device on which this file resides. (The major(3) and minor(3) macros may be useful to decompose the device ID in this field.) The st_rdev field describes the device that this file (inode) represents. The st_size field gives the size of the file (if it is a regular file or a symbolic link) in bytes. The size of a symbolic link is the length of the pathname it contains, without a terminating null byte. The st_blocks field indicates the number of blocks allocated to the file, 512-byte units. (This may be smaller than st_size/512 when the file has holes.) The st_blksize field gives the "preferred" blocksize for efficient filesystem I/O. (Writing to a file in smaller chunks may cause an inefficient read-modify-rewrite.) Not all of the Linux filesystems implement all of the time fields. Some filesystem types allow mounting in such a way that file and/or directory accesses do not cause an update of the st_atime field. (See noatime, nodiratime, and relatime in mount(8), and related information in mount(2).) In addition, st_atime is not updated if a file is opened with the O_NOATIME; see open(2). The field st_atime is changed by file accesses, for example, by execve(2), mknod(2), pipe(2), utime(2), and read(2) (of more than zero bytes). Other routines, like mmap(2), may or may not update st_atime. The field st_mtime is changed by file modifications, for example, by mknod(2), truncate(2), utime(2), and write(2) (of more than zero bytes). Moreover, st_mtime of a directory is changed by the creation or deletion of files in that directory. The st_mtime field is not changed for changes in owner, group, hard link count, or mode. The field st_ctime is changed by writing or by setting inode information (i.e., owner, group, link count, mode, etc.). POSIX refers to the st_mode bits corresponding to the mask S_IFMT (see below) as the file type, the 12 bits corresponding to the mask 07777 as the file mode bits and the least significant 9 bits (0777) as the file permission bits. The following mask values are defined for the file type of the st_mode field: S_IFMT 0170000 bit mask for the file type bit field S_IFSOCK 0140000 socket S_IFLNK 0120000 symbolic link S_IFREG 0100000 regular file S_IFBLK 0060000 block device S_IFDIR 0040000 directory S_IFCHR 0020000 character device S_IFIFO 0010000 FIFO Thus, to test for a regular file (for example), one could write: stat(pathname, &sb); if ((sb.st_mode & S_IFMT) == S_IFREG) { /* Handle regular file */ } Because tests of the above form are common, additional macros are defined by POSIX to allow the test of the file type in st_mode to be written more concisely: S_ISREG(m) is it a regular file? S_ISDIR(m) directory? S_ISCHR(m) character device? S_ISBLK(m) block device? S_ISFIFO(m) FIFO (named pipe)? S_ISLNK(m) symbolic link? (Not in POSIX.1-1996.) S_ISSOCK(m) socket? (Not in POSIX.1-1996.) The preceding code snippet could thus be rewritten as: stat(pathname, &sb); if (S_ISREG(sb.st_mode)) { /* Handle regular file */ } The definitions of most of the above file type test macros are provided if any of the following feature test macros is defined: _BSD_SOURCE (in glibc 2.19 and earlier), _SVID_SOURCE (in glibc 2.19 and earlier), or _DEFAULT_SOURCE (in glibc 2.20 and later). In addition, definitions of all of the above macros except S_IFSOCK and S_ISSOCK() are provided if _XOPEN_SOURCE is defined. The definition of S_IFSOCK can also be exposed by defining _XOPEN_SOURCE with a value of 500 or greater. The definition of S_ISSOCK() is exposed if any of the following feature test macros is defined: _BSD_SOURCE (in glibc 2.19 and earlier), _DEFAULT_SOURCE (in glibc 2.20 and later), _XOPEN_SOURCE with a value of 500 or greater, or _POSIX_C_SOURCE with a value of 200112L or greater. The following mask values are defined for the file mode component of the st_mode field: S_ISUID 04000 set-user-ID bit S_ISGID 02000 set-group-ID bit (see below) S_ISVTX 01000 sticky bit (see below) S_IRWXU 00700 owner has read, write, and execute permission S_IRUSR 00400 owner has read permission S_IWUSR 00200 owner has write permission S_IXUSR 00100 owner has execute permission S_IRWXG 00070 group has read, write, and execute permission S_IRGRP 00040 group has read permission S_IWGRP 00020 group has write permission S_IXGRP 00010 group has execute permission S_IRWXO 00007 others (not in group) have read, write, and execute permission S_IROTH 00004 others have read permission S_IWOTH 00002 others have write permission S_IXOTH 00001 others have execute permission The set-group-ID bit (S_ISGID) has several special uses. For a directory, it indicates that BSD semantics is to be used for that directory: files created there inherit their group ID from the directory, not from the effective group ID of the creating process, and directories created there will also get the S_ISGID bit set. For a file that does not have the group execution bit (S_IXGRP) set, the set-group-ID bit indicates mandatory file/record locking. The sticky bit (S_ISVTX) on a directory means that a file in that directory can be renamed or deleted only by the owner of the file, by the owner of the directory, and by a privileged process. fstatat() The fstatat() system call operates in exactly the same way as stat(), except for the differences described here. If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by stat() for a relative pathname). If pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like stat()). If pathname is absolute, then dirfd is ignored. flags can either be 0, or include one or more of the following flags ORed: AT_EMPTY_PATH (since Linux 2.6.39) If pathname is an empty string, operate on the file referred to by dirfd (which may have been obtained using the open(2) O_PATH flag). If dirfd is AT_FDCWD, the call operates on the current working directory. In this case, dirfd can refer to any type of file, not just a directory. This flag is Linux- specific; define _GNU_SOURCE to obtain its definition. AT_NO_AUTOMOUNT (since Linux 2.6.38) Don't automount the terminal ("basename") component of pathname if it is a directory that is an automount point. This allows the caller to gather attributes of an automount point (rather than the location it would mount). This flag can be used in tools that scan directories to prevent mass- automounting of a directory of automount points. The AT_NO_AUTOMOUNT flag has no effect if the mount point has already been mounted over. This flag is Linux-specific; define _GNU_SOURCE to obtain its definition. AT_SYMLINK_NOFOLLOW If pathname is a symbolic link, do not dereference it: instead return information about the link itself, like lstat(). (By default, fstatat() dereferences symbolic links, like stat().) See openat(2) for an explanation of the need for fstatat(). http://man7.org/linux/man-pages/man2/fstat.2.html 12 SYSTEM CALL: stat(2) - Linux manual page FUNCTIONALITY: stat, fstat, lstat, fstatat - get file status SYNOPSIS: #include #include #include int stat(const char *pathname, struct stat *buf); int fstat(int fd, struct stat *buf); int lstat(const char *pathname, struct stat *buf); #include /* Definition of AT_* constants */ #include int fstatat(int dirfd, const char *pathname, struct stat *buf, int flags); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): lstat(): /* glibc 2.19 and earlier */ _BSD_SOURCE || /* Since glibc 2.20 */ _DEFAULT_SOURCE || _XOPEN_SOURCE >= 500 || /* Since glibc 2.10: */ _POSIX_C_SOURCE >= 200112L fstatat(): Since glibc 2.10: _POSIX_C_SOURCE >= 200809L Before glibc 2.10: _ATFILE_SOURCE DESCRIPTION These functions return information about a file, in the buffer pointed to by buf. No permissions are required on the file itself, but—in the case of stat(), fstatat(), and lstat()—execute (search) permission is required on all of the directories in pathname that lead to the file. stat() and fstatat() retrieve information about the file pointed to by pathname; the differences for fstatat() are described below. lstat() is identical to stat(), except that if pathname is a symbolic link, then it returns information about the link itself, not the file that it refers to. fstat() is identical to stat(), except that the file about which information is to be retrieved is specified by the file descriptor fd. All of these system calls return a stat structure, which contains the following fields: struct stat { dev_t st_dev; /* ID of device containing file */ ino_t st_ino; /* inode number */ mode_t st_mode; /* file type and mode */ nlink_t st_nlink; /* number of hard links */ uid_t st_uid; /* user ID of owner */ gid_t st_gid; /* group ID of owner */ dev_t st_rdev; /* device ID (if special file) */ off_t st_size; /* total size, in bytes */ blksize_t st_blksize; /* blocksize for filesystem I/O */ blkcnt_t st_blocks; /* number of 512B blocks allocated */ /* Since Linux 2.6, the kernel supports nanosecond precision for the following timestamp fields. For the details before Linux 2.6, see NOTES. */ struct timespec st_atim; /* time of last access */ struct timespec st_mtim; /* time of last modification */ struct timespec st_ctim; /* time of last status change */ #define st_atime st_atim.tv_sec /* Backward compatibility */ #define st_mtime st_mtim.tv_sec #define st_ctime st_ctim.tv_sec }; Note: the order of fields in the stat structure varies somewhat across architectures. In addition, the definition above does not show the padding bytes that may be present between some fields on various architectures. Consult the glibc and kernel source code if you need to know the details. Note: For performance and simplicity reasons, different fields in the stat structure may contain state information from different moments during the execution of the system call. For example, if st_mode or st_uid is changed by another process by calling chmod(2) or chown(2), stat() might return the old st_mode together with the new st_uid, or the old st_uid together with the new st_mode. The st_dev field describes the device on which this file resides. (The major(3) and minor(3) macros may be useful to decompose the device ID in this field.) The st_rdev field describes the device that this file (inode) represents. The st_size field gives the size of the file (if it is a regular file or a symbolic link) in bytes. The size of a symbolic link is the length of the pathname it contains, without a terminating null byte. The st_blocks field indicates the number of blocks allocated to the file, 512-byte units. (This may be smaller than st_size/512 when the file has holes.) The st_blksize field gives the "preferred" blocksize for efficient filesystem I/O. (Writing to a file in smaller chunks may cause an inefficient read-modify-rewrite.) Not all of the Linux filesystems implement all of the time fields. Some filesystem types allow mounting in such a way that file and/or directory accesses do not cause an update of the st_atime field. (See noatime, nodiratime, and relatime in mount(8), and related information in mount(2).) In addition, st_atime is not updated if a file is opened with the O_NOATIME; see open(2). The field st_atime is changed by file accesses, for example, by execve(2), mknod(2), pipe(2), utime(2), and read(2) (of more than zero bytes). Other routines, like mmap(2), may or may not update st_atime. The field st_mtime is changed by file modifications, for example, by mknod(2), truncate(2), utime(2), and write(2) (of more than zero bytes). Moreover, st_mtime of a directory is changed by the creation or deletion of files in that directory. The st_mtime field is not changed for changes in owner, group, hard link count, or mode. The field st_ctime is changed by writing or by setting inode information (i.e., owner, group, link count, mode, etc.). POSIX refers to the st_mode bits corresponding to the mask S_IFMT (see below) as the file type, the 12 bits corresponding to the mask 07777 as the file mode bits and the least significant 9 bits (0777) as the file permission bits. The following mask values are defined for the file type of the st_mode field: S_IFMT 0170000 bit mask for the file type bit field S_IFSOCK 0140000 socket S_IFLNK 0120000 symbolic link S_IFREG 0100000 regular file S_IFBLK 0060000 block device S_IFDIR 0040000 directory S_IFCHR 0020000 character device S_IFIFO 0010000 FIFO Thus, to test for a regular file (for example), one could write: stat(pathname, &sb); if ((sb.st_mode & S_IFMT) == S_IFREG) { /* Handle regular file */ } Because tests of the above form are common, additional macros are defined by POSIX to allow the test of the file type in st_mode to be written more concisely: S_ISREG(m) is it a regular file? S_ISDIR(m) directory? S_ISCHR(m) character device? S_ISBLK(m) block device? S_ISFIFO(m) FIFO (named pipe)? S_ISLNK(m) symbolic link? (Not in POSIX.1-1996.) S_ISSOCK(m) socket? (Not in POSIX.1-1996.) The preceding code snippet could thus be rewritten as: stat(pathname, &sb); if (S_ISREG(sb.st_mode)) { /* Handle regular file */ } The definitions of most of the above file type test macros are provided if any of the following feature test macros is defined: _BSD_SOURCE (in glibc 2.19 and earlier), _SVID_SOURCE (in glibc 2.19 and earlier), or _DEFAULT_SOURCE (in glibc 2.20 and later). In addition, definitions of all of the above macros except S_IFSOCK and S_ISSOCK() are provided if _XOPEN_SOURCE is defined. The definition of S_IFSOCK can also be exposed by defining _XOPEN_SOURCE with a value of 500 or greater. The definition of S_ISSOCK() is exposed if any of the following feature test macros is defined: _BSD_SOURCE (in glibc 2.19 and earlier), _DEFAULT_SOURCE (in glibc 2.20 and later), _XOPEN_SOURCE with a value of 500 or greater, or _POSIX_C_SOURCE with a value of 200112L or greater. The following mask values are defined for the file mode component of the st_mode field: S_ISUID 04000 set-user-ID bit S_ISGID 02000 set-group-ID bit (see below) S_ISVTX 01000 sticky bit (see below) S_IRWXU 00700 owner has read, write, and execute permission S_IRUSR 00400 owner has read permission S_IWUSR 00200 owner has write permission S_IXUSR 00100 owner has execute permission S_IRWXG 00070 group has read, write, and execute permission S_IRGRP 00040 group has read permission S_IWGRP 00020 group has write permission S_IXGRP 00010 group has execute permission S_IRWXO 00007 others (not in group) have read, write, and execute permission S_IROTH 00004 others have read permission S_IWOTH 00002 others have write permission S_IXOTH 00001 others have execute permission The set-group-ID bit (S_ISGID) has several special uses. For a directory, it indicates that BSD semantics is to be used for that directory: files created there inherit their group ID from the directory, not from the effective group ID of the creating process, and directories created there will also get the S_ISGID bit set. For a file that does not have the group execution bit (S_IXGRP) set, the set-group-ID bit indicates mandatory file/record locking. The sticky bit (S_ISVTX) on a directory means that a file in that directory can be renamed or deleted only by the owner of the file, by the owner of the directory, and by a privileged process. fstatat() The fstatat() system call operates in exactly the same way as stat(), except for the differences described here. If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by stat() for a relative pathname). If pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like stat()). If pathname is absolute, then dirfd is ignored. flags can either be 0, or include one or more of the following flags ORed: AT_EMPTY_PATH (since Linux 2.6.39) If pathname is an empty string, operate on the file referred to by dirfd (which may have been obtained using the open(2) O_PATH flag). If dirfd is AT_FDCWD, the call operates on the current working directory. In this case, dirfd can refer to any type of file, not just a directory. This flag is Linux- specific; define _GNU_SOURCE to obtain its definition. AT_NO_AUTOMOUNT (since Linux 2.6.38) Don't automount the terminal ("basename") component of pathname if it is a directory that is an automount point. This allows the caller to gather attributes of an automount point (rather than the location it would mount). This flag can be used in tools that scan directories to prevent mass- automounting of a directory of automount points. The AT_NO_AUTOMOUNT flag has no effect if the mount point has already been mounted over. This flag is Linux-specific; define _GNU_SOURCE to obtain its definition. AT_SYMLINK_NOFOLLOW If pathname is a symbolic link, do not dereference it: instead return information about the link itself, like lstat(). (By default, fstatat() dereferences symbolic links, like stat().) See openat(2) for an explanation of the need for fstatat(). http://man7.org/linux/man-pages/man2/fstatat.2.html 12 SYSTEM CALL: stat(2) - Linux manual page FUNCTIONALITY: stat, fstat, lstat, fstatat - get file status SYNOPSIS: #include #include #include int stat(const char *pathname, struct stat *buf); int fstat(int fd, struct stat *buf); int lstat(const char *pathname, struct stat *buf); #include /* Definition of AT_* constants */ #include int fstatat(int dirfd, const char *pathname, struct stat *buf, int flags); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): lstat(): /* glibc 2.19 and earlier */ _BSD_SOURCE || /* Since glibc 2.20 */ _DEFAULT_SOURCE || _XOPEN_SOURCE >= 500 || /* Since glibc 2.10: */ _POSIX_C_SOURCE >= 200112L fstatat(): Since glibc 2.10: _POSIX_C_SOURCE >= 200809L Before glibc 2.10: _ATFILE_SOURCE DESCRIPTION These functions return information about a file, in the buffer pointed to by buf. No permissions are required on the file itself, but—in the case of stat(), fstatat(), and lstat()—execute (search) permission is required on all of the directories in pathname that lead to the file. stat() and fstatat() retrieve information about the file pointed to by pathname; the differences for fstatat() are described below. lstat() is identical to stat(), except that if pathname is a symbolic link, then it returns information about the link itself, not the file that it refers to. fstat() is identical to stat(), except that the file about which information is to be retrieved is specified by the file descriptor fd. All of these system calls return a stat structure, which contains the following fields: struct stat { dev_t st_dev; /* ID of device containing file */ ino_t st_ino; /* inode number */ mode_t st_mode; /* file type and mode */ nlink_t st_nlink; /* number of hard links */ uid_t st_uid; /* user ID of owner */ gid_t st_gid; /* group ID of owner */ dev_t st_rdev; /* device ID (if special file) */ off_t st_size; /* total size, in bytes */ blksize_t st_blksize; /* blocksize for filesystem I/O */ blkcnt_t st_blocks; /* number of 512B blocks allocated */ /* Since Linux 2.6, the kernel supports nanosecond precision for the following timestamp fields. For the details before Linux 2.6, see NOTES. */ struct timespec st_atim; /* time of last access */ struct timespec st_mtim; /* time of last modification */ struct timespec st_ctim; /* time of last status change */ #define st_atime st_atim.tv_sec /* Backward compatibility */ #define st_mtime st_mtim.tv_sec #define st_ctime st_ctim.tv_sec }; Note: the order of fields in the stat structure varies somewhat across architectures. In addition, the definition above does not show the padding bytes that may be present between some fields on various architectures. Consult the glibc and kernel source code if you need to know the details. Note: For performance and simplicity reasons, different fields in the stat structure may contain state information from different moments during the execution of the system call. For example, if st_mode or st_uid is changed by another process by calling chmod(2) or chown(2), stat() might return the old st_mode together with the new st_uid, or the old st_uid together with the new st_mode. The st_dev field describes the device on which this file resides. (The major(3) and minor(3) macros may be useful to decompose the device ID in this field.) The st_rdev field describes the device that this file (inode) represents. The st_size field gives the size of the file (if it is a regular file or a symbolic link) in bytes. The size of a symbolic link is the length of the pathname it contains, without a terminating null byte. The st_blocks field indicates the number of blocks allocated to the file, 512-byte units. (This may be smaller than st_size/512 when the file has holes.) The st_blksize field gives the "preferred" blocksize for efficient filesystem I/O. (Writing to a file in smaller chunks may cause an inefficient read-modify-rewrite.) Not all of the Linux filesystems implement all of the time fields. Some filesystem types allow mounting in such a way that file and/or directory accesses do not cause an update of the st_atime field. (See noatime, nodiratime, and relatime in mount(8), and related information in mount(2).) In addition, st_atime is not updated if a file is opened with the O_NOATIME; see open(2). The field st_atime is changed by file accesses, for example, by execve(2), mknod(2), pipe(2), utime(2), and read(2) (of more than zero bytes). Other routines, like mmap(2), may or may not update st_atime. The field st_mtime is changed by file modifications, for example, by mknod(2), truncate(2), utime(2), and write(2) (of more than zero bytes). Moreover, st_mtime of a directory is changed by the creation or deletion of files in that directory. The st_mtime field is not changed for changes in owner, group, hard link count, or mode. The field st_ctime is changed by writing or by setting inode information (i.e., owner, group, link count, mode, etc.). POSIX refers to the st_mode bits corresponding to the mask S_IFMT (see below) as the file type, the 12 bits corresponding to the mask 07777 as the file mode bits and the least significant 9 bits (0777) as the file permission bits. The following mask values are defined for the file type of the st_mode field: S_IFMT 0170000 bit mask for the file type bit field S_IFSOCK 0140000 socket S_IFLNK 0120000 symbolic link S_IFREG 0100000 regular file S_IFBLK 0060000 block device S_IFDIR 0040000 directory S_IFCHR 0020000 character device S_IFIFO 0010000 FIFO Thus, to test for a regular file (for example), one could write: stat(pathname, &sb); if ((sb.st_mode & S_IFMT) == S_IFREG) { /* Handle regular file */ } Because tests of the above form are common, additional macros are defined by POSIX to allow the test of the file type in st_mode to be written more concisely: S_ISREG(m) is it a regular file? S_ISDIR(m) directory? S_ISCHR(m) character device? S_ISBLK(m) block device? S_ISFIFO(m) FIFO (named pipe)? S_ISLNK(m) symbolic link? (Not in POSIX.1-1996.) S_ISSOCK(m) socket? (Not in POSIX.1-1996.) The preceding code snippet could thus be rewritten as: stat(pathname, &sb); if (S_ISREG(sb.st_mode)) { /* Handle regular file */ } The definitions of most of the above file type test macros are provided if any of the following feature test macros is defined: _BSD_SOURCE (in glibc 2.19 and earlier), _SVID_SOURCE (in glibc 2.19 and earlier), or _DEFAULT_SOURCE (in glibc 2.20 and later). In addition, definitions of all of the above macros except S_IFSOCK and S_ISSOCK() are provided if _XOPEN_SOURCE is defined. The definition of S_IFSOCK can also be exposed by defining _XOPEN_SOURCE with a value of 500 or greater. The definition of S_ISSOCK() is exposed if any of the following feature test macros is defined: _BSD_SOURCE (in glibc 2.19 and earlier), _DEFAULT_SOURCE (in glibc 2.20 and later), _XOPEN_SOURCE with a value of 500 or greater, or _POSIX_C_SOURCE with a value of 200112L or greater. The following mask values are defined for the file mode component of the st_mode field: S_ISUID 04000 set-user-ID bit S_ISGID 02000 set-group-ID bit (see below) S_ISVTX 01000 sticky bit (see below) S_IRWXU 00700 owner has read, write, and execute permission S_IRUSR 00400 owner has read permission S_IWUSR 00200 owner has write permission S_IXUSR 00100 owner has execute permission S_IRWXG 00070 group has read, write, and execute permission S_IRGRP 00040 group has read permission S_IWGRP 00020 group has write permission S_IXGRP 00010 group has execute permission S_IRWXO 00007 others (not in group) have read, write, and execute permission S_IROTH 00004 others have read permission S_IWOTH 00002 others have write permission S_IXOTH 00001 others have execute permission The set-group-ID bit (S_ISGID) has several special uses. For a directory, it indicates that BSD semantics is to be used for that directory: files created there inherit their group ID from the directory, not from the effective group ID of the creating process, and directories created there will also get the S_ISGID bit set. For a file that does not have the group execution bit (S_IXGRP) set, the set-group-ID bit indicates mandatory file/record locking. The sticky bit (S_ISVTX) on a directory means that a file in that directory can be renamed or deleted only by the owner of the file, by the owner of the directory, and by a privileged process. fstatat() The fstatat() system call operates in exactly the same way as stat(), except for the differences described here. If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by stat() for a relative pathname). If pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like stat()). If pathname is absolute, then dirfd is ignored. flags can either be 0, or include one or more of the following flags ORed: AT_EMPTY_PATH (since Linux 2.6.39) If pathname is an empty string, operate on the file referred to by dirfd (which may have been obtained using the open(2) O_PATH flag). If dirfd is AT_FDCWD, the call operates on the current working directory. In this case, dirfd can refer to any type of file, not just a directory. This flag is Linux- specific; define _GNU_SOURCE to obtain its definition. AT_NO_AUTOMOUNT (since Linux 2.6.38) Don't automount the terminal ("basename") component of pathname if it is a directory that is an automount point. This allows the caller to gather attributes of an automount point (rather than the location it would mount). This flag can be used in tools that scan directories to prevent mass- automounting of a directory of automount points. The AT_NO_AUTOMOUNT flag has no effect if the mount point has already been mounted over. This flag is Linux-specific; define _GNU_SOURCE to obtain its definition. AT_SYMLINK_NOFOLLOW If pathname is a symbolic link, do not dereference it: instead return information about the link itself, like lstat(). (By default, fstatat() dereferences symbolic links, like stat().) See openat(2) for an explanation of the need for fstatat(). http://man7.org/linux/man-pages/man2/chmod.2.html 11 SYSTEM CALL: chmod(2) - Linux manual page FUNCTIONALITY: chmod, fchmod, fchmodat - change permissions of a file SYNOPSIS: #include int chmod(const char *pathname, mode_t mode); int fchmod(int fd, mode_t mode); #include /* Definition of AT_* constants */ #include int fchmodat(int dirfd, const char *pathname, mode_t mode, int flags); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): fchmod(): /* Since glibc 2.16: */ _POSIX_C_SOURCE || /* Glibc versions <= 2.19: */ _BSD_SOURCE || /* Glibc versions <= 2.15: */ _XOPEN_SOURCE >= 500 || /* Glibc 2.12 to 2.15: */ _POSIX_C_SOURCE >= 200809L fchmodat(): Since glibc 2.10: _POSIX_C_SOURCE >= 200809L Before glibc 2.10: _ATFILE_SOURCE DESCRIPTION The chmod() and fchmod() system calls change a files mode bits. (The file mode consists of the file permission bits plus the set-user-ID, set-group-ID, and sticky bits.) These system calls differ only in how the file is specified: * chmod() changes the mode of the file specified whose pathname is given in pathname, which is dereferenced if it is a symbolic link. * fchmod() changes the mode of the file referred to by the open file descriptor fd. The new file mode is specified in mode, which is a bit mask created by ORing together zero or more of the following: S_ISUID (04000) set-user-ID (set process effective user ID on execve(2)) S_ISGID (02000) set-group-ID (set process effective group ID on execve(2); mandatory locking, as described in fcntl(2); take a new file's group from parent directory, as described in chown(2) and mkdir(2)) S_ISVTX (01000) sticky bit (restricted deletion flag, as described in unlink(2)) S_IRUSR (00400) read by owner S_IWUSR (00200) write by owner S_IXUSR (00100) execute/search by owner ("search" applies for directories, and means that entries within the directory can be accessed) S_IRGRP (00040) read by group S_IWGRP (00020) write by group S_IXGRP (00010) execute/search by group S_IROTH (00004) read by others S_IWOTH (00002) write by others S_IXOTH (00001) execute/search by others The effective UID of the calling process must match the owner of the file, or the process must be privileged (Linux: it must have the CAP_FOWNER capability). If the calling process is not privileged (Linux: does not have the CAP_FSETID capability), and the group of the file does not match the effective group ID of the process or one of its supplementary group IDs, the S_ISGID bit will be turned off, but this will not cause an error to be returned. As a security measure, depending on the filesystem, the set-user-ID and set-group-ID execution bits may be turned off if a file is written. (On Linux, this occurs if the writing process does not have the CAP_FSETID capability.) On some filesystems, only the superuser can set the sticky bit, which may have a special meaning. For the sticky bit, and for set-user-ID and set-group-ID bits on directories, see stat(2). On NFS filesystems, restricting the permissions will immediately influence already open files, because the access control is done on the server, but open files are maintained by the client. Widening the permissions may be delayed for other clients if attribute caching is enabled on them. fchmodat() The fchmodat() system call operates in exactly the same way as chmod(), except for the differences described here. If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by chmod() for a relative pathname). If pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like chmod()). If pathname is absolute, then dirfd is ignored. flags can either be 0, or include the following flag: AT_SYMLINK_NOFOLLOW If pathname is a symbolic link, do not dereference it: instead operate on the link itself. This flag is not currently implemented. See openat(2) for an explanation of the need for fchmodat(). http://man7.org/linux/man-pages/man2/fchmod.2.html 11 SYSTEM CALL: chmod(2) - Linux manual page FUNCTIONALITY: chmod, fchmod, fchmodat - change permissions of a file SYNOPSIS: #include int chmod(const char *pathname, mode_t mode); int fchmod(int fd, mode_t mode); #include /* Definition of AT_* constants */ #include int fchmodat(int dirfd, const char *pathname, mode_t mode, int flags); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): fchmod(): /* Since glibc 2.16: */ _POSIX_C_SOURCE || /* Glibc versions <= 2.19: */ _BSD_SOURCE || /* Glibc versions <= 2.15: */ _XOPEN_SOURCE >= 500 || /* Glibc 2.12 to 2.15: */ _POSIX_C_SOURCE >= 200809L fchmodat(): Since glibc 2.10: _POSIX_C_SOURCE >= 200809L Before glibc 2.10: _ATFILE_SOURCE DESCRIPTION The chmod() and fchmod() system calls change a files mode bits. (The file mode consists of the file permission bits plus the set-user-ID, set-group-ID, and sticky bits.) These system calls differ only in how the file is specified: * chmod() changes the mode of the file specified whose pathname is given in pathname, which is dereferenced if it is a symbolic link. * fchmod() changes the mode of the file referred to by the open file descriptor fd. The new file mode is specified in mode, which is a bit mask created by ORing together zero or more of the following: S_ISUID (04000) set-user-ID (set process effective user ID on execve(2)) S_ISGID (02000) set-group-ID (set process effective group ID on execve(2); mandatory locking, as described in fcntl(2); take a new file's group from parent directory, as described in chown(2) and mkdir(2)) S_ISVTX (01000) sticky bit (restricted deletion flag, as described in unlink(2)) S_IRUSR (00400) read by owner S_IWUSR (00200) write by owner S_IXUSR (00100) execute/search by owner ("search" applies for directories, and means that entries within the directory can be accessed) S_IRGRP (00040) read by group S_IWGRP (00020) write by group S_IXGRP (00010) execute/search by group S_IROTH (00004) read by others S_IWOTH (00002) write by others S_IXOTH (00001) execute/search by others The effective UID of the calling process must match the owner of the file, or the process must be privileged (Linux: it must have the CAP_FOWNER capability). If the calling process is not privileged (Linux: does not have the CAP_FSETID capability), and the group of the file does not match the effective group ID of the process or one of its supplementary group IDs, the S_ISGID bit will be turned off, but this will not cause an error to be returned. As a security measure, depending on the filesystem, the set-user-ID and set-group-ID execution bits may be turned off if a file is written. (On Linux, this occurs if the writing process does not have the CAP_FSETID capability.) On some filesystems, only the superuser can set the sticky bit, which may have a special meaning. For the sticky bit, and for set-user-ID and set-group-ID bits on directories, see stat(2). On NFS filesystems, restricting the permissions will immediately influence already open files, because the access control is done on the server, but open files are maintained by the client. Widening the permissions may be delayed for other clients if attribute caching is enabled on them. fchmodat() The fchmodat() system call operates in exactly the same way as chmod(), except for the differences described here. If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by chmod() for a relative pathname). If pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like chmod()). If pathname is absolute, then dirfd is ignored. flags can either be 0, or include the following flag: AT_SYMLINK_NOFOLLOW If pathname is a symbolic link, do not dereference it: instead operate on the link itself. This flag is not currently implemented. See openat(2) for an explanation of the need for fchmodat(). http://man7.org/linux/man-pages/man2/fchmodat.2.html 11 SYSTEM CALL: chmod(2) - Linux manual page FUNCTIONALITY: chmod, fchmod, fchmodat - change permissions of a file SYNOPSIS: #include int chmod(const char *pathname, mode_t mode); int fchmod(int fd, mode_t mode); #include /* Definition of AT_* constants */ #include int fchmodat(int dirfd, const char *pathname, mode_t mode, int flags); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): fchmod(): /* Since glibc 2.16: */ _POSIX_C_SOURCE || /* Glibc versions <= 2.19: */ _BSD_SOURCE || /* Glibc versions <= 2.15: */ _XOPEN_SOURCE >= 500 || /* Glibc 2.12 to 2.15: */ _POSIX_C_SOURCE >= 200809L fchmodat(): Since glibc 2.10: _POSIX_C_SOURCE >= 200809L Before glibc 2.10: _ATFILE_SOURCE DESCRIPTION The chmod() and fchmod() system calls change a files mode bits. (The file mode consists of the file permission bits plus the set-user-ID, set-group-ID, and sticky bits.) These system calls differ only in how the file is specified: * chmod() changes the mode of the file specified whose pathname is given in pathname, which is dereferenced if it is a symbolic link. * fchmod() changes the mode of the file referred to by the open file descriptor fd. The new file mode is specified in mode, which is a bit mask created by ORing together zero or more of the following: S_ISUID (04000) set-user-ID (set process effective user ID on execve(2)) S_ISGID (02000) set-group-ID (set process effective group ID on execve(2); mandatory locking, as described in fcntl(2); take a new file's group from parent directory, as described in chown(2) and mkdir(2)) S_ISVTX (01000) sticky bit (restricted deletion flag, as described in unlink(2)) S_IRUSR (00400) read by owner S_IWUSR (00200) write by owner S_IXUSR (00100) execute/search by owner ("search" applies for directories, and means that entries within the directory can be accessed) S_IRGRP (00040) read by group S_IWGRP (00020) write by group S_IXGRP (00010) execute/search by group S_IROTH (00004) read by others S_IWOTH (00002) write by others S_IXOTH (00001) execute/search by others The effective UID of the calling process must match the owner of the file, or the process must be privileged (Linux: it must have the CAP_FOWNER capability). If the calling process is not privileged (Linux: does not have the CAP_FSETID capability), and the group of the file does not match the effective group ID of the process or one of its supplementary group IDs, the S_ISGID bit will be turned off, but this will not cause an error to be returned. As a security measure, depending on the filesystem, the set-user-ID and set-group-ID execution bits may be turned off if a file is written. (On Linux, this occurs if the writing process does not have the CAP_FSETID capability.) On some filesystems, only the superuser can set the sticky bit, which may have a special meaning. For the sticky bit, and for set-user-ID and set-group-ID bits on directories, see stat(2). On NFS filesystems, restricting the permissions will immediately influence already open files, because the access control is done on the server, but open files are maintained by the client. Widening the permissions may be delayed for other clients if attribute caching is enabled on them. fchmodat() The fchmodat() system call operates in exactly the same way as chmod(), except for the differences described here. If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by chmod() for a relative pathname). If pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like chmod()). If pathname is absolute, then dirfd is ignored. flags can either be 0, or include the following flag: AT_SYMLINK_NOFOLLOW If pathname is a symbolic link, do not dereference it: instead operate on the link itself. This flag is not currently implemented. See openat(2) for an explanation of the need for fchmodat(). http://man7.org/linux/man-pages/man2/chown.2.html 12 SYSTEM CALL: chown(2) - Linux manual page FUNCTIONALITY: chown, fchown, lchown, fchownat - change ownership of a file SYNOPSIS: #include int chown(const char *pathname, uid_t owner, gid_t group); int fchown(int fd, uid_t owner, gid_t group); int lchown(const char *pathname, uid_t owner, gid_t group); #include /* Definition of AT_* constants */ #include int fchownat(int dirfd, const char *pathname, uid_t owner, gid_t group, int flags); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): fchown(), lchown(): /* Since glibc 2.12: */ _POSIX_C_SOURCE >= 200809L || _XOPEN_SOURCE >= 500 || /* Glibc versions <= 2.19: */ _BSD_SOURCE fchownat(): Since glibc 2.10: _POSIX_C_SOURCE >= 200809L Before glibc 2.10: _ATFILE_SOURCE DESCRIPTION These system calls change the owner and group of a file. The chown(), fchown(), and lchown() system calls differ only in how the file is specified: * chown() changes the ownership of the file specified by pathname, which is dereferenced if it is a symbolic link. * fchown() changes the ownership of the file referred to by the open file descriptor fd. * lchown() is like chown(), but does not dereference symbolic links. Only a privileged process (Linux: one with the CAP_CHOWN capability) may change the owner of a file. The owner of a file may change the group of the file to any group of which that owner is a member. A privileged process (Linux: with CAP_CHOWN) may change the group arbitrarily. If the owner or group is specified as -1, then that ID is not changed. When the owner or group of an executable file is changed by an unprivileged user, the S_ISUID and S_ISGID mode bits are cleared. POSIX does not specify whether this also should happen when root does the chown(); the Linux behavior depends on the kernel version. In case of a non-group-executable file (i.e., one for which the S_IXGRP bit is not set) the S_ISGID bit indicates mandatory locking, and is not cleared by a chown(). fchownat() The fchownat() system call operates in exactly the same way as chown(), except for the differences described here. If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by chown() for a relative pathname). If pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like chown()). If pathname is absolute, then dirfd is ignored. The flags argument is a bit mask created by ORing together 0 or more of the following values; AT_EMPTY_PATH (since Linux 2.6.39) If pathname is an empty string, operate on the file referred to by dirfd (which may have been obtained using the open(2) O_PATH flag). In this case, dirfd can refer to any type of file, not just a directory. If dirfd is AT_FDCWD, the call operates on the current working directory. This flag is Linux-specific; define _GNU_SOURCE to obtain its definition. AT_SYMLINK_NOFOLLOW If pathname is a symbolic link, do not dereference it: instead operate on the link itself, like lchown(). (By default, fchownat() dereferences symbolic links, like chown().) See openat(2) for an explanation of the need for fchownat(). http://man7.org/linux/man-pages/man2/lchown.2.html 12 SYSTEM CALL: chown(2) - Linux manual page FUNCTIONALITY: chown, fchown, lchown, fchownat - change ownership of a file SYNOPSIS: #include int chown(const char *pathname, uid_t owner, gid_t group); int fchown(int fd, uid_t owner, gid_t group); int lchown(const char *pathname, uid_t owner, gid_t group); #include /* Definition of AT_* constants */ #include int fchownat(int dirfd, const char *pathname, uid_t owner, gid_t group, int flags); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): fchown(), lchown(): /* Since glibc 2.12: */ _POSIX_C_SOURCE >= 200809L || _XOPEN_SOURCE >= 500 || /* Glibc versions <= 2.19: */ _BSD_SOURCE fchownat(): Since glibc 2.10: _POSIX_C_SOURCE >= 200809L Before glibc 2.10: _ATFILE_SOURCE DESCRIPTION These system calls change the owner and group of a file. The chown(), fchown(), and lchown() system calls differ only in how the file is specified: * chown() changes the ownership of the file specified by pathname, which is dereferenced if it is a symbolic link. * fchown() changes the ownership of the file referred to by the open file descriptor fd. * lchown() is like chown(), but does not dereference symbolic links. Only a privileged process (Linux: one with the CAP_CHOWN capability) may change the owner of a file. The owner of a file may change the group of the file to any group of which that owner is a member. A privileged process (Linux: with CAP_CHOWN) may change the group arbitrarily. If the owner or group is specified as -1, then that ID is not changed. When the owner or group of an executable file is changed by an unprivileged user, the S_ISUID and S_ISGID mode bits are cleared. POSIX does not specify whether this also should happen when root does the chown(); the Linux behavior depends on the kernel version. In case of a non-group-executable file (i.e., one for which the S_IXGRP bit is not set) the S_ISGID bit indicates mandatory locking, and is not cleared by a chown(). fchownat() The fchownat() system call operates in exactly the same way as chown(), except for the differences described here. If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by chown() for a relative pathname). If pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like chown()). If pathname is absolute, then dirfd is ignored. The flags argument is a bit mask created by ORing together 0 or more of the following values; AT_EMPTY_PATH (since Linux 2.6.39) If pathname is an empty string, operate on the file referred to by dirfd (which may have been obtained using the open(2) O_PATH flag). In this case, dirfd can refer to any type of file, not just a directory. If dirfd is AT_FDCWD, the call operates on the current working directory. This flag is Linux-specific; define _GNU_SOURCE to obtain its definition. AT_SYMLINK_NOFOLLOW If pathname is a symbolic link, do not dereference it: instead operate on the link itself, like lchown(). (By default, fchownat() dereferences symbolic links, like chown().) See openat(2) for an explanation of the need for fchownat(). http://man7.org/linux/man-pages/man2/fchown.2.html 12 SYSTEM CALL: chown(2) - Linux manual page FUNCTIONALITY: chown, fchown, lchown, fchownat - change ownership of a file SYNOPSIS: #include int chown(const char *pathname, uid_t owner, gid_t group); int fchown(int fd, uid_t owner, gid_t group); int lchown(const char *pathname, uid_t owner, gid_t group); #include /* Definition of AT_* constants */ #include int fchownat(int dirfd, const char *pathname, uid_t owner, gid_t group, int flags); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): fchown(), lchown(): /* Since glibc 2.12: */ _POSIX_C_SOURCE >= 200809L || _XOPEN_SOURCE >= 500 || /* Glibc versions <= 2.19: */ _BSD_SOURCE fchownat(): Since glibc 2.10: _POSIX_C_SOURCE >= 200809L Before glibc 2.10: _ATFILE_SOURCE DESCRIPTION These system calls change the owner and group of a file. The chown(), fchown(), and lchown() system calls differ only in how the file is specified: * chown() changes the ownership of the file specified by pathname, which is dereferenced if it is a symbolic link. * fchown() changes the ownership of the file referred to by the open file descriptor fd. * lchown() is like chown(), but does not dereference symbolic links. Only a privileged process (Linux: one with the CAP_CHOWN capability) may change the owner of a file. The owner of a file may change the group of the file to any group of which that owner is a member. A privileged process (Linux: with CAP_CHOWN) may change the group arbitrarily. If the owner or group is specified as -1, then that ID is not changed. When the owner or group of an executable file is changed by an unprivileged user, the S_ISUID and S_ISGID mode bits are cleared. POSIX does not specify whether this also should happen when root does the chown(); the Linux behavior depends on the kernel version. In case of a non-group-executable file (i.e., one for which the S_IXGRP bit is not set) the S_ISGID bit indicates mandatory locking, and is not cleared by a chown(). fchownat() The fchownat() system call operates in exactly the same way as chown(), except for the differences described here. If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by chown() for a relative pathname). If pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like chown()). If pathname is absolute, then dirfd is ignored. The flags argument is a bit mask created by ORing together 0 or more of the following values; AT_EMPTY_PATH (since Linux 2.6.39) If pathname is an empty string, operate on the file referred to by dirfd (which may have been obtained using the open(2) O_PATH flag). In this case, dirfd can refer to any type of file, not just a directory. If dirfd is AT_FDCWD, the call operates on the current working directory. This flag is Linux-specific; define _GNU_SOURCE to obtain its definition. AT_SYMLINK_NOFOLLOW If pathname is a symbolic link, do not dereference it: instead operate on the link itself, like lchown(). (By default, fchownat() dereferences symbolic links, like chown().) See openat(2) for an explanation of the need for fchownat(). http://man7.org/linux/man-pages/man2/fchownat.2.html 12 SYSTEM CALL: chown(2) - Linux manual page FUNCTIONALITY: chown, fchown, lchown, fchownat - change ownership of a file SYNOPSIS: #include int chown(const char *pathname, uid_t owner, gid_t group); int fchown(int fd, uid_t owner, gid_t group); int lchown(const char *pathname, uid_t owner, gid_t group); #include /* Definition of AT_* constants */ #include int fchownat(int dirfd, const char *pathname, uid_t owner, gid_t group, int flags); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): fchown(), lchown(): /* Since glibc 2.12: */ _POSIX_C_SOURCE >= 200809L || _XOPEN_SOURCE >= 500 || /* Glibc versions <= 2.19: */ _BSD_SOURCE fchownat(): Since glibc 2.10: _POSIX_C_SOURCE >= 200809L Before glibc 2.10: _ATFILE_SOURCE DESCRIPTION These system calls change the owner and group of a file. The chown(), fchown(), and lchown() system calls differ only in how the file is specified: * chown() changes the ownership of the file specified by pathname, which is dereferenced if it is a symbolic link. * fchown() changes the ownership of the file referred to by the open file descriptor fd. * lchown() is like chown(), but does not dereference symbolic links. Only a privileged process (Linux: one with the CAP_CHOWN capability) may change the owner of a file. The owner of a file may change the group of the file to any group of which that owner is a member. A privileged process (Linux: with CAP_CHOWN) may change the group arbitrarily. If the owner or group is specified as -1, then that ID is not changed. When the owner or group of an executable file is changed by an unprivileged user, the S_ISUID and S_ISGID mode bits are cleared. POSIX does not specify whether this also should happen when root does the chown(); the Linux behavior depends on the kernel version. In case of a non-group-executable file (i.e., one for which the S_IXGRP bit is not set) the S_ISGID bit indicates mandatory locking, and is not cleared by a chown(). fchownat() The fchownat() system call operates in exactly the same way as chown(), except for the differences described here. If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by chown() for a relative pathname). If pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like chown()). If pathname is absolute, then dirfd is ignored. The flags argument is a bit mask created by ORing together 0 or more of the following values; AT_EMPTY_PATH (since Linux 2.6.39) If pathname is an empty string, operate on the file referred to by dirfd (which may have been obtained using the open(2) O_PATH flag). In this case, dirfd can refer to any type of file, not just a directory. If dirfd is AT_FDCWD, the call operates on the current working directory. This flag is Linux-specific; define _GNU_SOURCE to obtain its definition. AT_SYMLINK_NOFOLLOW If pathname is a symbolic link, do not dereference it: instead operate on the link itself, like lchown(). (By default, fchownat() dereferences symbolic links, like chown().) See openat(2) for an explanation of the need for fchownat(). http://man7.org/linux/man-pages/man2/utime.2.html 10 SYSTEM CALL: utime(2) - Linux manual page FUNCTIONALITY: utime, utimes - change file last access and modification times SYNOPSIS: #include #include int utime(const char *filename, const struct utimbuf *times); #include int utimes(const char *filename, const struct timeval times[2]); DESCRIPTION Note: modern applications may prefer to use the interfaces described in utimensat(2). The utime() system call changes the access and modification times of the inode specified by filename to the actime and modtime fields of times respectively. If times is NULL, then the access and modification times of the file are set to the current time. Changing timestamps is permitted when: either the process has appropriate privileges, or the effective user ID equals the user ID of the file, or times is NULL and the process has write permission for the file. The utimbuf structure is: struct utimbuf { time_t actime; /* access time */ time_t modtime; /* modification time */ }; The utime() system call allows specification of timestamps with a resolution of 1 second. The utimes() system call is similar, but the times argument refers to an array rather than a structure. The elements of this array are timeval structures, which allow a precision of 1 microsecond for specifying timestamps. The timeval structure is: struct timeval { long tv_sec; /* seconds */ long tv_usec; /* microseconds */ }; times[0] specifies the new access time, and times[1] specifies the new modification time. If times is NULL, then analogously to utime(), the access and modification times of the file are set to the current time. http://man7.org/linux/man-pages/man2/utimes.2.html 10 SYSTEM CALL: utime(2) - Linux manual page FUNCTIONALITY: utime, utimes - change file last access and modification times SYNOPSIS: #include #include int utime(const char *filename, const struct utimbuf *times); #include int utimes(const char *filename, const struct timeval times[2]); DESCRIPTION Note: modern applications may prefer to use the interfaces described in utimensat(2). The utime() system call changes the access and modification times of the inode specified by filename to the actime and modtime fields of times respectively. If times is NULL, then the access and modification times of the file are set to the current time. Changing timestamps is permitted when: either the process has appropriate privileges, or the effective user ID equals the user ID of the file, or times is NULL and the process has write permission for the file. The utimbuf structure is: struct utimbuf { time_t actime; /* access time */ time_t modtime; /* modification time */ }; The utime() system call allows specification of timestamps with a resolution of 1 second. The utimes() system call is similar, but the times argument refers to an array rather than a structure. The elements of this array are timeval structures, which allow a precision of 1 microsecond for specifying timestamps. The timeval structure is: struct timeval { long tv_sec; /* seconds */ long tv_usec; /* microseconds */ }; times[0] specifies the new access time, and times[1] specifies the new modification time. If times is NULL, then analogously to utime(), the access and modification times of the file are set to the current time. http://man7.org/linux/man-pages/man2/futimesat.2.html 11 SYSTEM CALL: futimesat(2) - Linux manual page FUNCTIONALITY: futimesat - change timestamps of a file relative to a directory file descriptor SYNOPSIS: #include /* Definition of AT_* constants */ #include int futimesat(int dirfd, const char *pathname, const struct timeval times[2]); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): futimesat(): _GNU_SOURCE DESCRIPTION This system call is obsolete. Use utimensat(2) instead. The futimesat() system call operates in exactly the same way as utimes(2), except for the differences described in this manual page. If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by utimes(2) for a relative pathname). If pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like utimes(2)). If pathname is absolute, then dirfd is ignored. http://man7.org/linux/man-pages/man2/utimensat.2.html 13 SYSTEM CALL: utimensat(2) - Linux manual page FUNCTIONALITY: utimensat, futimens - change file timestamps with nanosecond preci‐ sion SYNOPSIS: #include /* Definition of AT_* constants */ #include int utimensat(int dirfd, const char *pathname, const struct timespec times[2], int flags); int futimens(int fd, const struct timespec times[2]); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): utimensat(): Since glibc 2.10: _POSIX_C_SOURCE >= 200809L Before glibc 2.10: _ATFILE_SOURCE futimens(): Since glibc 2.10: _POSIX_C_SOURCE >= 200809L Before glibc 2.10: _GNU_SOURCE DESCRIPTION utimensat() and futimens() update the timestamps of a file with nanosecond precision. This contrasts with the historical utime(2) and utimes(2), which permit only second and microsecond precision, respectively, when setting file timestamps. With utimensat() the file is specified via the pathname given in pathname. With futimens() the file whose timestamps are to be updated is specified via an open file descriptor, fd. For both calls, the new file timestamps are specified in the array times: times[0] specifies the new "last access time" (atime); times[1] specifies the new "last modification time" (mtime). Each of the elements of times specifies a time as the number of seconds and nanoseconds since the Epoch, 1970-01-01 00:00:00 +0000 (UTC). This information is conveyed in a structure of the following form: struct timespec { time_t tv_sec; /* seconds */ long tv_nsec; /* nanoseconds */ }; Updated file timestamps are set to the greatest value supported by the filesystem that is not greater than the specified time. If the tv_nsec field of one of the timespec structures has the special value UTIME_NOW, then the corresponding file timestamp is set to the current time. If the tv_nsec field of one of the timespec structures has the special value UTIME_OMIT, then the corresponding file timestamp is left unchanged. In both of these cases, the value of the corresponding tv_sec field is ignored. If times is NULL, then both timestamps are set to the current time. Permissions requirements To set both file timestamps to the current time (i.e., times is NULL, or both tv_nsec fields specify UTIME_NOW), either: 1. the caller must have write access to the file; 2. the caller's effective user ID must match the owner of the file; or 3. the caller must have appropriate privileges. To make any change other than setting both timestamps to the current time (i.e., times is not NULL, and neither tv_nsec field is UTIME_NOW and neither tv_nsec field is UTIME_OMIT), either condition 2 or 3 above must apply. If both tv_nsec fields are specified as UTIME_OMIT, then no file ownership or permission checks are performed, and the file timestamps are not modified, but other error conditions may still be detected. utimensat() specifics If pathname is relative, then by default it is interpreted relative to the directory referred to by the open file descriptor, dirfd (rather than relative to the current working directory of the calling process, as is done by utimes(2) for a relative pathname). See openat(2) for an explanation of why this can be useful. If pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like utimes(2)). If pathname is absolute, then dirfd is ignored. The flags field is a bit mask that may be 0, or include the following constant, defined in : AT_SYMLINK_NOFOLLOW If pathname specifies a symbolic link, then update the timestamps of the link, rather than the file to which it refers. http://man7.org/linux/man-pages/man2/access.2.html 12 SYSTEM CALL: access(2) - Linux manual page FUNCTIONALITY: access, faccessat - check user's permissions for a file SYNOPSIS: #include int access(const char *pathname, int mode); #include /* Definition of AT_* constants */ #include int faccessat(int dirfd, const char *pathname, int mode, int flags); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): faccessat(): Since glibc 2.10: _POSIX_C_SOURCE >= 200809L Before glibc 2.10: _ATFILE_SOURCE DESCRIPTION access() checks whether the calling process can access the file pathname. If pathname is a symbolic link, it is dereferenced. The mode specifies the accessibility check(s) to be performed, and is either the value F_OK, or a mask consisting of the bitwise OR of one or more of R_OK, W_OK, and X_OK. F_OK tests for the existence of the file. R_OK, W_OK, and X_OK test whether the file exists and grants read, write, and execute permissions, respectively. The check is done using the calling process's real UID and GID, rather than the effective IDs as is done when actually attempting an operation (e.g., open(2)) on the file. Similarly, for the root user, the check uses the set of permitted capabilities rather than the set of effective capabilities; and for non-root users, the check uses an empty set of capabilities. This allows set-user-ID programs and capability-endowed programs to easily determine the invoking user's authority. In other words, access() does not answer the "can I read/write/execute this file?" question. It answers a slightly different question: "(assuming I'm a setuid binary) can the user who invoked me read/write/execute this file?", which gives set-user-ID programs the possibility to prevent malicious users from causing them to read files which users shouldn't be able to read. If the calling process is privileged (i.e., its real UID is zero), then an X_OK check is successful for a regular file if execute permission is enabled for any of the file owner, group, or other. faccessat() The faccessat() system call operates in exactly the same way as access(), except for the differences described here. If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by access() for a relative pathname). If pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like access()). If pathname is absolute, then dirfd is ignored. flags is constructed by ORing together zero or more of the following values: AT_EACCESS Perform access checks using the effective user and group IDs. By default, faccessat() uses the real IDs (like access()). AT_SYMLINK_NOFOLLOW If pathname is a symbolic link, do not dereference it: instead return information about the link itself. See openat(2) for an explanation of the need for faccessat(). http://man7.org/linux/man-pages/man2/faccessat.2.html 12 SYSTEM CALL: access(2) - Linux manual page FUNCTIONALITY: access, faccessat - check user's permissions for a file SYNOPSIS: #include int access(const char *pathname, int mode); #include /* Definition of AT_* constants */ #include int faccessat(int dirfd, const char *pathname, int mode, int flags); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): faccessat(): Since glibc 2.10: _POSIX_C_SOURCE >= 200809L Before glibc 2.10: _ATFILE_SOURCE DESCRIPTION access() checks whether the calling process can access the file pathname. If pathname is a symbolic link, it is dereferenced. The mode specifies the accessibility check(s) to be performed, and is either the value F_OK, or a mask consisting of the bitwise OR of one or more of R_OK, W_OK, and X_OK. F_OK tests for the existence of the file. R_OK, W_OK, and X_OK test whether the file exists and grants read, write, and execute permissions, respectively. The check is done using the calling process's real UID and GID, rather than the effective IDs as is done when actually attempting an operation (e.g., open(2)) on the file. Similarly, for the root user, the check uses the set of permitted capabilities rather than the set of effective capabilities; and for non-root users, the check uses an empty set of capabilities. This allows set-user-ID programs and capability-endowed programs to easily determine the invoking user's authority. In other words, access() does not answer the "can I read/write/execute this file?" question. It answers a slightly different question: "(assuming I'm a setuid binary) can the user who invoked me read/write/execute this file?", which gives set-user-ID programs the possibility to prevent malicious users from causing them to read files which users shouldn't be able to read. If the calling process is privileged (i.e., its real UID is zero), then an X_OK check is successful for a regular file if execute permission is enabled for any of the file owner, group, or other. faccessat() The faccessat() system call operates in exactly the same way as access(), except for the differences described here. If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by access() for a relative pathname). If pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like access()). If pathname is absolute, then dirfd is ignored. flags is constructed by ORing together zero or more of the following values: AT_EACCESS Perform access checks using the effective user and group IDs. By default, faccessat() uses the real IDs (like access()). AT_SYMLINK_NOFOLLOW If pathname is a symbolic link, do not dereference it: instead return information about the link itself. See openat(2) for an explanation of the need for faccessat(). http://man7.org/linux/man-pages/man2/setxattr.2.html 10 SYSTEM CALL: setxattr(2) - Linux manual page FUNCTIONALITY: setxattr, lsetxattr, fsetxattr - set an extended attribute value SYNOPSIS: #include #include int setxattr(const char *path, const char *name, const void *value, size_t size, int flags); int lsetxattr(const char *path, const char *name, const void *value, size_t size, int flags); int fsetxattr(int fd, const char *name, const void *value, size_t size, int flags); DESCRIPTION Extended attributes are name:value pairs associated with inodes (files, directories, symbolic links, etc.). They are extensions to the normal attributes which are associated with all inodes in the system (i.e., the stat(2) data). A complete overview of extended attributes concepts can be found in xattr(7). setxattr() sets the value of the extended attribute identified by name and associated with the given path in the filesystem. The size argument specifies the size (in bytes) of value; a zero-length value is permitted. lsetxattr() is identical to setxattr(), except in the case of a symbolic link, where the extended attribute is set on the link itself, not the file that it refers to. fsetxattr() is identical to setxattr(), only the extended attribute is set on the open file referred to by fd (as returned by open(2)) in place of path. An extended attribute name is a null-terminated string. The name includes a namespace prefix; there may be several, disjoint namespaces associated with an individual inode. The value of an extended attribute is a chunk of arbitrary textual or binary data of specified length. By default (i.e., flags is zero), the extended attribute will be created if it does not exist, or the value will be replaced if the attribute already exists. To modify these semantics, one of the following values can be specified in flags: XATTR_CREATE Perform a pure create, which fails if the named attribute exists already. XATTR_REPLACE Perform a pure replace operation, which fails if the named attribute does not already exist. http://man7.org/linux/man-pages/man2/lsetxattr.2.html 10 SYSTEM CALL: setxattr(2) - Linux manual page FUNCTIONALITY: setxattr, lsetxattr, fsetxattr - set an extended attribute value SYNOPSIS: #include #include int setxattr(const char *path, const char *name, const void *value, size_t size, int flags); int lsetxattr(const char *path, const char *name, const void *value, size_t size, int flags); int fsetxattr(int fd, const char *name, const void *value, size_t size, int flags); DESCRIPTION Extended attributes are name:value pairs associated with inodes (files, directories, symbolic links, etc.). They are extensions to the normal attributes which are associated with all inodes in the system (i.e., the stat(2) data). A complete overview of extended attributes concepts can be found in xattr(7). setxattr() sets the value of the extended attribute identified by name and associated with the given path in the filesystem. The size argument specifies the size (in bytes) of value; a zero-length value is permitted. lsetxattr() is identical to setxattr(), except in the case of a symbolic link, where the extended attribute is set on the link itself, not the file that it refers to. fsetxattr() is identical to setxattr(), only the extended attribute is set on the open file referred to by fd (as returned by open(2)) in place of path. An extended attribute name is a null-terminated string. The name includes a namespace prefix; there may be several, disjoint namespaces associated with an individual inode. The value of an extended attribute is a chunk of arbitrary textual or binary data of specified length. By default (i.e., flags is zero), the extended attribute will be created if it does not exist, or the value will be replaced if the attribute already exists. To modify these semantics, one of the following values can be specified in flags: XATTR_CREATE Perform a pure create, which fails if the named attribute exists already. XATTR_REPLACE Perform a pure replace operation, which fails if the named attribute does not already exist. http://man7.org/linux/man-pages/man2/fsetxattr.2.html 10 SYSTEM CALL: setxattr(2) - Linux manual page FUNCTIONALITY: setxattr, lsetxattr, fsetxattr - set an extended attribute value SYNOPSIS: #include #include int setxattr(const char *path, const char *name, const void *value, size_t size, int flags); int lsetxattr(const char *path, const char *name, const void *value, size_t size, int flags); int fsetxattr(int fd, const char *name, const void *value, size_t size, int flags); DESCRIPTION Extended attributes are name:value pairs associated with inodes (files, directories, symbolic links, etc.). They are extensions to the normal attributes which are associated with all inodes in the system (i.e., the stat(2) data). A complete overview of extended attributes concepts can be found in xattr(7). setxattr() sets the value of the extended attribute identified by name and associated with the given path in the filesystem. The size argument specifies the size (in bytes) of value; a zero-length value is permitted. lsetxattr() is identical to setxattr(), except in the case of a symbolic link, where the extended attribute is set on the link itself, not the file that it refers to. fsetxattr() is identical to setxattr(), only the extended attribute is set on the open file referred to by fd (as returned by open(2)) in place of path. An extended attribute name is a null-terminated string. The name includes a namespace prefix; there may be several, disjoint namespaces associated with an individual inode. The value of an extended attribute is a chunk of arbitrary textual or binary data of specified length. By default (i.e., flags is zero), the extended attribute will be created if it does not exist, or the value will be replaced if the attribute already exists. To modify these semantics, one of the following values can be specified in flags: XATTR_CREATE Perform a pure create, which fails if the named attribute exists already. XATTR_REPLACE Perform a pure replace operation, which fails if the named attribute does not already exist. http://man7.org/linux/man-pages/man2/getxattr.2.html 11 SYSTEM CALL: getxattr(2) - Linux manual page FUNCTIONALITY: getxattr, lgetxattr, fgetxattr - retrieve an extended attribute value SYNOPSIS: #include #include ssize_t getxattr(const char *path, const char *name, void *value, size_t size); ssize_t lgetxattr(const char *path, const char *name, void *value, size_t size); ssize_t fgetxattr(int fd, const char *name, void *value, size_t size); DESCRIPTION Extended attributes are name:value pairs associated with inodes (files, directories, symbolic links, etc.). They are extensions to the normal attributes which are associated with all inodes in the system (i.e., the stat(2) data). A complete overview of extended attributes concepts can be found in xattr(7). getxattr() retrieves the value of the extended attribute identified by name and associated with the given path in the filesystem. The attribute value is placed in the buffer pointed to by value; size specifies the size of that buffer. The return value of the call is the number of bytes placed in value. lgetxattr() is identical to getxattr(), except in the case of a symbolic link, where the link itself is interrogated, not the file that it refers to. fgetxattr() is identical to getxattr(), only the open file referred to by fd (as returned by open(2)) is interrogated in place of path. An extended attribute name is a null-terminated string. The name includes a namespace prefix; there may be several, disjoint namespaces associated with an individual inode. The value of an extended attribute is a chunk of arbitrary textual or binary data that was assigned using setxattr(2). If size is specified as zero, these calls return the current size of the named extended attribute (and leave value unchanged). This can be used to determine the size of the buffer that should be supplied in a subsequent call. (But, bear in mind that there is a possibility that the attribute value may change between the two calls, so that it is still necessary to check the return status from the second call.) http://man7.org/linux/man-pages/man2/lgetxattr.2.html 11 SYSTEM CALL: getxattr(2) - Linux manual page FUNCTIONALITY: getxattr, lgetxattr, fgetxattr - retrieve an extended attribute value SYNOPSIS: #include #include ssize_t getxattr(const char *path, const char *name, void *value, size_t size); ssize_t lgetxattr(const char *path, const char *name, void *value, size_t size); ssize_t fgetxattr(int fd, const char *name, void *value, size_t size); DESCRIPTION Extended attributes are name:value pairs associated with inodes (files, directories, symbolic links, etc.). They are extensions to the normal attributes which are associated with all inodes in the system (i.e., the stat(2) data). A complete overview of extended attributes concepts can be found in xattr(7). getxattr() retrieves the value of the extended attribute identified by name and associated with the given path in the filesystem. The attribute value is placed in the buffer pointed to by value; size specifies the size of that buffer. The return value of the call is the number of bytes placed in value. lgetxattr() is identical to getxattr(), except in the case of a symbolic link, where the link itself is interrogated, not the file that it refers to. fgetxattr() is identical to getxattr(), only the open file referred to by fd (as returned by open(2)) is interrogated in place of path. An extended attribute name is a null-terminated string. The name includes a namespace prefix; there may be several, disjoint namespaces associated with an individual inode. The value of an extended attribute is a chunk of arbitrary textual or binary data that was assigned using setxattr(2). If size is specified as zero, these calls return the current size of the named extended attribute (and leave value unchanged). This can be used to determine the size of the buffer that should be supplied in a subsequent call. (But, bear in mind that there is a possibility that the attribute value may change between the two calls, so that it is still necessary to check the return status from the second call.) http://man7.org/linux/man-pages/man2/fgetxattr.2.html 11 SYSTEM CALL: getxattr(2) - Linux manual page FUNCTIONALITY: getxattr, lgetxattr, fgetxattr - retrieve an extended attribute value SYNOPSIS: #include #include ssize_t getxattr(const char *path, const char *name, void *value, size_t size); ssize_t lgetxattr(const char *path, const char *name, void *value, size_t size); ssize_t fgetxattr(int fd, const char *name, void *value, size_t size); DESCRIPTION Extended attributes are name:value pairs associated with inodes (files, directories, symbolic links, etc.). They are extensions to the normal attributes which are associated with all inodes in the system (i.e., the stat(2) data). A complete overview of extended attributes concepts can be found in xattr(7). getxattr() retrieves the value of the extended attribute identified by name and associated with the given path in the filesystem. The attribute value is placed in the buffer pointed to by value; size specifies the size of that buffer. The return value of the call is the number of bytes placed in value. lgetxattr() is identical to getxattr(), except in the case of a symbolic link, where the link itself is interrogated, not the file that it refers to. fgetxattr() is identical to getxattr(), only the open file referred to by fd (as returned by open(2)) is interrogated in place of path. An extended attribute name is a null-terminated string. The name includes a namespace prefix; there may be several, disjoint namespaces associated with an individual inode. The value of an extended attribute is a chunk of arbitrary textual or binary data that was assigned using setxattr(2). If size is specified as zero, these calls return the current size of the named extended attribute (and leave value unchanged). This can be used to determine the size of the buffer that should be supplied in a subsequent call. (But, bear in mind that there is a possibility that the attribute value may change between the two calls, so that it is still necessary to check the return status from the second call.) http://man7.org/linux/man-pages/man2/listxattr.2.html 12 SYSTEM CALL: listxattr(2) - Linux manual page FUNCTIONALITY: listxattr, llistxattr, flistxattr - list extended attribute names SYNOPSIS: #include #include ssize_t listxattr(const char *path, char *list, size_t size); ssize_t llistxattr(const char *path, char *list, size_t size); ssize_t flistxattr(int fd, char *list, size_t size); DESCRIPTION Extended attributes are name:value pairs associated with inodes (files, directories, symbolic links, etc.). They are extensions to the normal attributes which are associated with all inodes in the system (i.e., the stat(2) data). A complete overview of extended attributes concepts can be found in xattr(7). listxattr() retrieves the list of extended attribute names associated with the given path in the filesystem. The retrieved list is placed in list, a caller-allocated buffer whose size (in bytes) is specified in the argument size. The list is the set of (null-terminated) names, one after the other. Names of extended attributes to which the calling process does not have access may be omitted from the list. The length of the attribute name list is returned. llistxattr() is identical to listxattr(), except in the case of a symbolic link, where the list of names of extended attributes associated with the link itself is retrieved, not the file that it refers to. flistxattr() is identical to listxattr(), only the open file referred to by fd (as returned by open(2)) is interrogated in place of path. A single extended attribute name is a null-terminated string. The name includes a namespace prefix; there may be several, disjoint namespaces associated with an individual inode. If size is specified as zero, these calls return the current size of the list of extended attribute names (and leave list unchanged). This can be used to determine the size of the buffer that should be supplied in a subsequent call. (But, bear in mind that there is a possibility that the set of extended attributes may change between the two calls, so that it is still necessary to check the return status from the second call.) Example The list of names is returned as an unordered array of null- terminated character strings (attribute names are separated by null bytes ('\0')), like this: user.name1\0system.name1\0user.name2\0 Filesystems that implement POSIX ACLs using extended attributes might return a list like this: system.posix_acl_access\0system.posix_acl_default\0 http://man7.org/linux/man-pages/man2/llistxattr.2.html 12 SYSTEM CALL: listxattr(2) - Linux manual page FUNCTIONALITY: listxattr, llistxattr, flistxattr - list extended attribute names SYNOPSIS: #include #include ssize_t listxattr(const char *path, char *list, size_t size); ssize_t llistxattr(const char *path, char *list, size_t size); ssize_t flistxattr(int fd, char *list, size_t size); DESCRIPTION Extended attributes are name:value pairs associated with inodes (files, directories, symbolic links, etc.). They are extensions to the normal attributes which are associated with all inodes in the system (i.e., the stat(2) data). A complete overview of extended attributes concepts can be found in xattr(7). listxattr() retrieves the list of extended attribute names associated with the given path in the filesystem. The retrieved list is placed in list, a caller-allocated buffer whose size (in bytes) is specified in the argument size. The list is the set of (null-terminated) names, one after the other. Names of extended attributes to which the calling process does not have access may be omitted from the list. The length of the attribute name list is returned. llistxattr() is identical to listxattr(), except in the case of a symbolic link, where the list of names of extended attributes associated with the link itself is retrieved, not the file that it refers to. flistxattr() is identical to listxattr(), only the open file referred to by fd (as returned by open(2)) is interrogated in place of path. A single extended attribute name is a null-terminated string. The name includes a namespace prefix; there may be several, disjoint namespaces associated with an individual inode. If size is specified as zero, these calls return the current size of the list of extended attribute names (and leave list unchanged). This can be used to determine the size of the buffer that should be supplied in a subsequent call. (But, bear in mind that there is a possibility that the set of extended attributes may change between the two calls, so that it is still necessary to check the return status from the second call.) Example The list of names is returned as an unordered array of null- terminated character strings (attribute names are separated by null bytes ('\0')), like this: user.name1\0system.name1\0user.name2\0 Filesystems that implement POSIX ACLs using extended attributes might return a list like this: system.posix_acl_access\0system.posix_acl_default\0 http://man7.org/linux/man-pages/man2/flistxattr.2.html 12 SYSTEM CALL: listxattr(2) - Linux manual page FUNCTIONALITY: listxattr, llistxattr, flistxattr - list extended attribute names SYNOPSIS: #include #include ssize_t listxattr(const char *path, char *list, size_t size); ssize_t llistxattr(const char *path, char *list, size_t size); ssize_t flistxattr(int fd, char *list, size_t size); DESCRIPTION Extended attributes are name:value pairs associated with inodes (files, directories, symbolic links, etc.). They are extensions to the normal attributes which are associated with all inodes in the system (i.e., the stat(2) data). A complete overview of extended attributes concepts can be found in xattr(7). listxattr() retrieves the list of extended attribute names associated with the given path in the filesystem. The retrieved list is placed in list, a caller-allocated buffer whose size (in bytes) is specified in the argument size. The list is the set of (null-terminated) names, one after the other. Names of extended attributes to which the calling process does not have access may be omitted from the list. The length of the attribute name list is returned. llistxattr() is identical to listxattr(), except in the case of a symbolic link, where the list of names of extended attributes associated with the link itself is retrieved, not the file that it refers to. flistxattr() is identical to listxattr(), only the open file referred to by fd (as returned by open(2)) is interrogated in place of path. A single extended attribute name is a null-terminated string. The name includes a namespace prefix; there may be several, disjoint namespaces associated with an individual inode. If size is specified as zero, these calls return the current size of the list of extended attribute names (and leave list unchanged). This can be used to determine the size of the buffer that should be supplied in a subsequent call. (But, bear in mind that there is a possibility that the set of extended attributes may change between the two calls, so that it is still necessary to check the return status from the second call.) Example The list of names is returned as an unordered array of null- terminated character strings (attribute names are separated by null bytes ('\0')), like this: user.name1\0system.name1\0user.name2\0 Filesystems that implement POSIX ACLs using extended attributes might return a list like this: system.posix_acl_access\0system.posix_acl_default\0 http://man7.org/linux/man-pages/man2/removexattr.2.html 10 SYSTEM CALL: removexattr(2) - Linux manual page FUNCTIONALITY: removexattr, lremovexattr, fremovexattr - remove an extended attribute SYNOPSIS: #include #include int removexattr(const char *path, const char *name); int lremovexattr(const char *path, const char *name); int fremovexattr(int fd, const char *name); DESCRIPTION Extended attributes are name:value pairs associated with inodes (files, directories, symbolic links, etc.). They are extensions to the normal attributes which are associated with all inodes in the system (i.e., the stat(2) data). A complete overview of extended attributes concepts can be found in xattr(7). removexattr() removes the extended attribute identified by name and associated with the given path in the filesystem. lremovexattr() is identical to removexattr(), except in the case of a symbolic link, where the extended attribute is removed from the link itself, not the file that it refers to. fremovexattr() is identical to removexattr(), only the extended attribute is removed from the open file referred to by fd (as returned by open(2)) in place of path. An extended attribute name is a null-terminated string. The name includes a namespace prefix; there may be several, disjoint namespaces associated with an individual inode. http://man7.org/linux/man-pages/man2/lremovexattr.2.html 10 SYSTEM CALL: removexattr(2) - Linux manual page FUNCTIONALITY: removexattr, lremovexattr, fremovexattr - remove an extended attribute SYNOPSIS: #include #include int removexattr(const char *path, const char *name); int lremovexattr(const char *path, const char *name); int fremovexattr(int fd, const char *name); DESCRIPTION Extended attributes are name:value pairs associated with inodes (files, directories, symbolic links, etc.). They are extensions to the normal attributes which are associated with all inodes in the system (i.e., the stat(2) data). A complete overview of extended attributes concepts can be found in xattr(7). removexattr() removes the extended attribute identified by name and associated with the given path in the filesystem. lremovexattr() is identical to removexattr(), except in the case of a symbolic link, where the extended attribute is removed from the link itself, not the file that it refers to. fremovexattr() is identical to removexattr(), only the extended attribute is removed from the open file referred to by fd (as returned by open(2)) in place of path. An extended attribute name is a null-terminated string. The name includes a namespace prefix; there may be several, disjoint namespaces associated with an individual inode. http://man7.org/linux/man-pages/man2/fremovexattr.2.html 10 SYSTEM CALL: removexattr(2) - Linux manual page FUNCTIONALITY: removexattr, lremovexattr, fremovexattr - remove an extended attribute SYNOPSIS: #include #include int removexattr(const char *path, const char *name); int lremovexattr(const char *path, const char *name); int fremovexattr(int fd, const char *name); DESCRIPTION Extended attributes are name:value pairs associated with inodes (files, directories, symbolic links, etc.). They are extensions to the normal attributes which are associated with all inodes in the system (i.e., the stat(2) data). A complete overview of extended attributes concepts can be found in xattr(7). removexattr() removes the extended attribute identified by name and associated with the given path in the filesystem. lremovexattr() is identical to removexattr(), except in the case of a symbolic link, where the extended attribute is removed from the link itself, not the file that it refers to. fremovexattr() is identical to removexattr(), only the extended attribute is removed from the open file referred to by fd (as returned by open(2)) in place of path. An extended attribute name is a null-terminated string. The name includes a namespace prefix; there may be several, disjoint namespaces associated with an individual inode. http://man7.org/linux/man-pages/man2/ioctl.2.html 10 SYSTEM CALL: ioctl(2) - Linux manual page FUNCTIONALITY: ioctl - control device SYNOPSIS: #include int ioctl(int fd, unsigned long request, ...); DESCRIPTION The ioctl() function manipulates the underlying device parameters of special files. In particular, many operating characteristics of character special files (e.g., terminals) may be controlled with ioctl() requests. The argument fd must be an open file descriptor. The second argument is a device-dependent request code. The third argument is an untyped pointer to memory. It's traditionally char *argp (from the days before void * was valid C), and will be so named for this discussion. An ioctl() request has encoded in it whether the argument is an in parameter or out parameter, and the size of the argument argp in bytes. Macros and defines used in specifying an ioctl() request are located in the file . http://man7.org/linux/man-pages/man2/fcntl.2.html 11 SYSTEM CALL: fcntl(2) - Linux manual page FUNCTIONALITY: fcntl - manipulate file descriptor SYNOPSIS: #include #include int fcntl(int fd, int cmd, ... /* arg */ ); DESCRIPTION fcntl() performs one of the operations described below on the open file descriptor fd. The operation is determined by cmd. fcntl() can take an optional third argument. Whether or not this argument is required is determined by cmd. The required argument type is indicated in parentheses after each cmd name (in most cases, the required type is int, and we identify the argument using the name arg), or void is specified if the argument is not required. Certain of the operations below are supported only since a particular Linux kernel version. The preferred method of checking whether the host kernel supports a particular operation is to invoke fcntl() with the desired cmd value and then test whether the call failed with EINVAL, indicating that the kernel does not recognize this value. Duplicating a file descriptor F_DUPFD (int) Find the lowest numbered available file descriptor greater than or equal to arg and make it be a copy of fd. This is different from dup2(2), which uses exactly the file descriptor specified. On success, the new file descriptor is returned. See dup(2) for further details. F_DUPFD_CLOEXEC (int; since Linux 2.6.24) As for F_DUPFD, but additionally set the close-on-exec flag for the duplicate file descriptor. Specifying this flag permits a program to avoid an additional fcntl() F_SETFD operation to set the FD_CLOEXEC flag. For an explanation of why this flag is useful, see the description of O_CLOEXEC in open(2). File descriptor flags The following commands manipulate the flags associated with a file descriptor. Currently, only one such flag is defined: FD_CLOEXEC, the close-on-exec flag. If the FD_CLOEXEC bit is 0, the file descriptor will remain open across an execve(2), otherwise it will be closed. F_GETFD (void) Read the file descriptor flags; arg is ignored. F_SETFD (int) Set the file descriptor flags to the value specified by arg. In multithreaded programs, using fcntl() F_SETFD to set the close-on- exec flag at the same time as another thread performs a fork(2) plus execve(2) is vulnerable to a race condition that may unintentionally leak the file descriptor to the program executed in the child process. See the discussion of the O_CLOEXEC flag in open(2) for details and a remedy to the problem. File status flags Each open file description has certain associated status flags, initialized by open(2) and possibly modified by fcntl(). Duplicated file descriptors (made with dup(2), fcntl(F_DUPFD), fork(2), etc.) refer to the same open file description, and thus share the same file status flags. The file status flags and their semantics are described in open(2). F_GETFL (void) Get the file access mode and the file status flags; arg is ignored. F_SETFL (int) Set the file status flags to the value specified by arg. File access mode (O_RDONLY, O_WRONLY, O_RDWR) and file creation flags (i.e., O_CREAT, O_EXCL, O_NOCTTY, O_TRUNC) in arg are ignored. On Linux, this command can change only the O_APPEND, O_ASYNC, O_DIRECT, O_NOATIME, and O_NONBLOCK flags. It is not possible to change the O_DSYNC and O_SYNC flags; see BUGS, below. Advisory record locking Linux implements traditional ("process-associated") UNIX record locks, as standardized by POSIX. For a Linux-specific alternative with better semantics, see the discussion of open file description locks below. F_SETLK, F_SETLKW, and F_GETLK are used to acquire, release, and test for the existence of record locks (also known as byte-range, file- segment, or file-region locks). The third argument, lock, is a pointer to a structure that has at least the following fields (in unspecified order). struct flock { ... short l_type; /* Type of lock: F_RDLCK, F_WRLCK, F_UNLCK */ short l_whence; /* How to interpret l_start: SEEK_SET, SEEK_CUR, SEEK_END */ off_t l_start; /* Starting offset for lock */ off_t l_len; /* Number of bytes to lock */ pid_t l_pid; /* PID of process blocking our lock (set by F_GETLK and F_OFD_GETLK) */ ... }; The l_whence, l_start, and l_len fields of this structure specify the range of bytes we wish to lock. Bytes past the end of the file may be locked, but not bytes before the start of the file. l_start is the starting offset for the lock, and is interpreted relative to either: the start of the file (if l_whence is SEEK_SET); the current file offset (if l_whence is SEEK_CUR); or the end of the file (if l_whence is SEEK_END). In the final two cases, l_start can be a negative number provided the offset does not lie before the start of the file. l_len specifies the number of bytes to be locked. If l_len is positive, then the range to be locked covers bytes l_start up to and including l_start+l_len-1. Specifying 0 for l_len has the special meaning: lock all bytes starting at the location specified by l_whence and l_start through to the end of file, no matter how large the file grows. POSIX.1-2001 allows (but does not require) an implementation to support a negative l_len value; if l_len is negative, the interval described by lock covers bytes l_start+l_len up to and including l_start-1. This is supported by Linux since kernel versions 2.4.21 and 2.5.49. The l_type field can be used to place a read (F_RDLCK) or a write (F_WRLCK) lock on a file. Any number of processes may hold a read lock (shared lock) on a file region, but only one process may hold a write lock (exclusive lock). An exclusive lock excludes all other locks, both shared and exclusive. A single process can hold only one type of lock on a file region; if a new lock is applied to an already-locked region, then the existing lock is converted to the new lock type. (Such conversions may involve splitting, shrinking, or coalescing with an existing lock if the byte range specified by the new lock does not precisely coincide with the range of the existing lock.) F_SETLK (struct flock *) Acquire a lock (when l_type is F_RDLCK or F_WRLCK) or release a lock (when l_type is F_UNLCK) on the bytes specified by the l_whence, l_start, and l_len fields of lock. If a conflicting lock is held by another process, this call returns -1 and sets errno to EACCES or EAGAIN. (The error returned in this case differs across implementations, so POSIX requires a portable application to check for both errors.) F_SETLKW (struct flock *) As for F_SETLK, but if a conflicting lock is held on the file, then wait for that lock to be released. If a signal is caught while waiting, then the call is interrupted and (after the signal handler has returned) returns immediately (with return value -1 and errno set to EINTR; see signal(7)). F_GETLK (struct flock *) On input to this call, lock describes a lock we would like to place on the file. If the lock could be placed, fcntl() does not actually place it, but returns F_UNLCK in the l_type field of lock and leaves the other fields of the structure unchanged. If one or more incompatible locks would prevent this lock being placed, then fcntl() returns details about one of those locks in the l_type, l_whence, l_start, and l_len fields of lock. If the conflicting lock is a traditional (process- associated) record lock, then the l_pid field is set to the PID of the process holding that lock. If the conflicting lock is an open file description lock, then l_pid is set to -1. Note that the returned information may already be out of date by the time the caller inspects it. In order to place a read lock, fd must be open for reading. In order to place a write lock, fd must be open for writing. To place both types of lock, open a file read-write. When placing locks with F_SETLKW, the kernel detects deadlocks, whereby two or more processes have their lock requests mutually blocked by locks held by the other processes. For example, suppose process A holds a write lock on byte 100 of a file, and process B holds a write lock on byte 200. If each process then attempts to lock the byte already locked by the other process using F_SETLKW, then, without deadlock detection, both processes would remain blocked indefinitely. When the kernel detects such deadlocks, it causes one of the blocking lock requests to immediately fail with the error EDEADLK; an application that encounters such an error should release some of its locks to allow other applications to proceed before attempting regain the locks that it requires. Circular deadlocks involving more than two processes are also detected. Note, however, that there are limitations to the kernel's deadlock-detection algorithm; see BUGS. As well as being removed by an explicit F_UNLCK, record locks are automatically released when the process terminates. Record locks are not inherited by a child created via fork(2), but are preserved across an execve(2). Because of the buffering performed by the stdio(3) library, the use of record locking with routines in that package should be avoided; use read(2) and write(2) instead. The record locks described above are associated with the process (unlike the open file description locks described below). This has some unfortunate consequences: * If a process closes any file descriptor referring to a file, then all of the process's locks on that file are released, regardless of the file descriptor(s) on which the locks were obtained. This is bad: it means that a process can lose its locks on a file such as /etc/passwd or /etc/mtab when for some reason a library function decides to open, read, and close the same file. * The threads in a process share locks. In other words, a multithreaded program can't use record locking to ensure that threads don't simultaneously access the same region of a file. Open file description locks solve both of these problems. Open file description locks (non-POSIX) Open file description locks are advisory byte-range locks whose operation is in most respects identical to the traditional record locks described above. This lock type is Linux-specific, and available since Linux 3.15. (There is a proposal with the Austin Group to include this lock type in the next revision of POSIX.1.) For an explanation of open file descriptions, see open(2). The principal difference between the two lock types is that whereas traditional record locks are associated with a process, open file description locks are associated with the open file description on which they are acquired, much like locks acquired with flock(2). Consequently (and unlike traditional advisory record locks), open file description locks are inherited across fork(2) (and clone(2) with CLONE_FILES), and are only automatically released on the last close of the open file description, instead of being released on any close of the file. Conflicting lock combinations (i.e., a read lock and a write lock or two write locks) where one lock is an open file description lock and the other is a traditional record lock conflict even when they are acquired by the same process on the same file descriptor. Open file description locks placed via the same open file description (i.e., via the same file descriptor, or via a duplicate of the file descriptor created by fork(2), dup(2), fcntl(2) F_DUPFD, and so on) are always compatible: if a new lock is placed on an already locked region, then the existing lock is converted to the new lock type. (Such conversions may result in splitting, shrinking, or coalescing with an existing lock as discussed above.) On the other hand, open file description locks may conflict with each other when they are acquired via different open file descriptions. Thus, the threads in a multithreaded program can use open file description locks to synchronize access to a file region by having each thread perform its own open(2) on the file and applying locks via the resulting file descriptor. As with traditional advisory locks, the third argument to fcntl(), lock, is a pointer to an flock structure. By contrast with traditional record locks, the l_pid field of that structure must be set to zero when using the commands described below. The commands for working with open file description locks are analogous to those used with traditional locks: F_OFD_SETLK (struct flock *) Acquire an open file description lock (when l_type is F_RDLCK or F_WRLCK) or release an open file description lock (when l_type is F_UNLCK) on the bytes specified by the l_whence, l_start, and l_len fields of lock. If a conflicting lock is held by another process, this call returns -1 and sets errno to EAGAIN. F_OFD_SETLKW (struct flock *) As for F_OFD_SETLK, but if a conflicting lock is held on the file, then wait for that lock to be released. If a signal is caught while waiting, then the call is interrupted and (after the signal handler has returned) returns immediately (with return value -1 and errno set to EINTR; see signal(7)). F_OFD_GETLK (struct flock *) On input to this call, lock describes an open file description lock we would like to place on the file. If the lock could be placed, fcntl() does not actually place it, but returns F_UNLCK in the l_type field of lock and leaves the other fields of the structure unchanged. If one or more incompatible locks would prevent this lock being placed, then details about one of these locks are returned via lock, as described above for F_GETLK. In the current implementation, no deadlock detection is performed for open file description locks. (This contrasts with process-associated record locks, for which the kernel does perform deadlock detection.) Mandatory locking Warning: the Linux implementation of mandatory locking is unreliable. See BUGS below. Because of these bugs, and the fact that the feature is believed to be little used, since Linux 4.5, mandatory locking has been made an optional feature, governed by a configuration option (CONFIG_MANDATORY_FILE_LOCKING). This is an initial step toward removing this feature completely. By default, both traditional (process-associated) and open file description record locks are advisory. Advisory locks are not enforced and are useful only between cooperating processes. Both lock types can also be mandatory. Mandatory locks are enforced for all processes. If a process tries to perform an incompatible access (e.g., read(2) or write(2)) on a file region that has an incompatible mandatory lock, then the result depends upon whether the O_NONBLOCK flag is enabled for its open file description. If the O_NONBLOCK flag is not enabled, then the system call is blocked until the lock is removed or converted to a mode that is compatible with the access. If the O_NONBLOCK flag is enabled, then the system call fails with the error EAGAIN. To make use of mandatory locks, mandatory locking must be enabled both on the filesystem that contains the file to be locked, and on the file itself. Mandatory locking is enabled on a filesystem using the "-o mand" option to mount(8), or the MS_MANDLOCK flag for mount(2). Mandatory locking is enabled on a file by disabling group execute permission on the file and enabling the set-group-ID permission bit (see chmod(1) and chmod(2)). Mandatory locking is not specified by POSIX. Some other systems also support mandatory locking, although the details of how to enable it vary across systems. Managing signals F_GETOWN, F_SETOWN, F_GETOWN_EX, F_SETOWN_EX, F_GETSIG and F_SETSIG are used to manage I/O availability signals: F_GETOWN (void) Return (as the function result) the process ID or process group currently receiving SIGIO and SIGURG signals for events on file descriptor fd. Process IDs are returned as positive values; process group IDs are returned as negative values (but see BUGS below). arg is ignored. F_SETOWN (int) Set the process ID or process group ID that will receive SIGIO and SIGURG signals for events on the file descriptor fd. The target process or process group ID is specified in arg. A process ID is specified as a positive value; a process group ID is specified as a negative value. Most commonly, the calling process specifies itself as the owner (that is, arg is specified as getpid(2)). As well as setting the file descriptor owner, one must also enable generation of signals on the file descriptor. This is done by using the fcntl() F_SETFL command to set the O_ASYNC file status flag on the file descriptor. Subsequently, a SIGIO signal is sent whenever input or output becomes possible on the file descriptor. The fcntl() F_SETSIG command can be used to obtain delivery of a signal other than SIGIO. Sending a signal to the owner process (group) specified by F_SETOWN is subject to the same permissions checks as are described for kill(2), where the sending process is the one that employs F_SETOWN (but see BUGS below). If this permission check fails, then the signal is silently discarded. If the file descriptor fd refers to a socket, F_SETOWN also selects the recipient of SIGURG signals that are delivered when out-of-band data arrives on that socket. (SIGURG is sent in any situation where select(2) would report the socket as having an "exceptional condition".) The following was true in 2.6.x kernels up to and including kernel 2.6.11: If a nonzero value is given to F_SETSIG in a multithreaded process running with a threading library that supports thread groups (e.g., NPTL), then a positive value given to F_SETOWN has a different meaning: instead of being a process ID identifying a whole process, it is a thread ID identifying a specific thread within a process. Consequently, it may be necessary to pass F_SETOWN the result of gettid(2) instead of getpid(2) to get sensible results when F_SETSIG is used. (In current Linux threading implementations, a main thread's thread ID is the same as its process ID. This means that a single-threaded program can equally use gettid(2) or getpid(2) in this scenario.) Note, however, that the statements in this paragraph do not apply to the SIGURG signal generated for out-of-band data on a socket: this signal is always sent to either a process or a process group, depending on the value given to F_SETOWN. The above behavior was accidentally dropped in Linux 2.6.12, and won't be restored. From Linux 2.6.32 onward, use F_SETOWN_EX to target SIGIO and SIGURG signals at a particular thread. F_GETOWN_EX (struct f_owner_ex *) (since Linux 2.6.32) Return the current file descriptor owner settings as defined by a previous F_SETOWN_EX operation. The information is returned in the structure pointed to by arg, which has the following form: struct f_owner_ex { int type; pid_t pid; }; The type field will have one of the values F_OWNER_TID, F_OWNER_PID, or F_OWNER_PGRP. The pid field is a positive integer representing a thread ID, process ID, or process group ID. See F_SETOWN_EX for more details. F_SETOWN_EX (struct f_owner_ex *) (since Linux 2.6.32) This operation performs a similar task to F_SETOWN. It allows the caller to direct I/O availability signals to a specific thread, process, or process group. The caller specifies the target of signals via arg, which is a pointer to a f_owner_ex structure. The type field has one of the following values, which define how pid is interpreted: F_OWNER_TID Send the signal to the thread whose thread ID (the value returned by a call to clone(2) or gettid(2)) is specified in pid. F_OWNER_PID Send the signal to the process whose ID is specified in pid. F_OWNER_PGRP Send the signal to the process group whose ID is specified in pid. (Note that, unlike with F_SETOWN, a process group ID is specified as a positive value here.) F_GETSIG (void) Return (as the function result) the signal sent when input or output becomes possible. A value of zero means SIGIO is sent. Any other value (including SIGIO) is the signal sent instead, and in this case additional info is available to the signal handler if installed with SA_SIGINFO. arg is ignored. F_SETSIG (int) Set the signal sent when input or output becomes possible to the value given in arg. A value of zero means to send the default SIGIO signal. Any other value (including SIGIO) is the signal to send instead, and in this case additional info is available to the signal handler if installed with SA_SIGINFO. By using F_SETSIG with a nonzero value, and setting SA_SIGINFO for the signal handler (see sigaction(2)), extra information about I/O events is passed to the handler in a siginfo_t structure. If the si_code field indicates the source is SI_SIGIO, the si_fd field gives the file descriptor associated with the event. Otherwise, there is no indication which file descriptors are pending, and you should use the usual mechanisms (select(2), poll(2), read(2) with O_NONBLOCK set etc.) to determine which file descriptors are available for I/O. Note that the file descriptor provided in si_fd is the one that was specified during the F_SETSIG operation. This can lead to an unusual corner case. If the file descriptor is duplicated (dup(2) or similar), and the original file descriptor is closed, then I/O events will continue to be generated, but the si_fd field will contain the number of the now closed file descriptor. By selecting a real time signal (value >= SIGRTMIN), multiple I/O events may be queued using the same signal numbers. (Queuing is dependent on available memory.) Extra information is available if SA_SIGINFO is set for the signal handler, as above. Note that Linux imposes a limit on the number of real-time signals that may be queued to a process (see getrlimit(2) and signal(7)) and if this limit is reached, then the kernel reverts to delivering SIGIO, and this signal is delivered to the entire process rather than to a specific thread. Using these mechanisms, a program can implement fully asynchronous I/O without using select(2) or poll(2) most of the time. The use of O_ASYNC is specific to BSD and Linux. The only use of F_GETOWN and F_SETOWN specified in POSIX.1 is in conjunction with the use of the SIGURG signal on sockets. (POSIX does not specify the SIGIO signal.) F_GETOWN_EX, F_SETOWN_EX, F_GETSIG, and F_SETSIG are Linux-specific. POSIX has asynchronous I/O and the aio_sigevent structure to achieve similar things; these are also available in Linux as part of the GNU C Library (Glibc). Leases F_SETLEASE and F_GETLEASE (Linux 2.4 onward) are used (respectively) to establish a new lease, and retrieve the current lease, on the open file description referred to by the file descriptor fd. A file lease provides a mechanism whereby the process holding the lease (the "lease holder") is notified (via delivery of a signal) when a process (the "lease breaker") tries to open(2) or truncate(2) the file referred to by that file descriptor. F_SETLEASE (int) Set or remove a file lease according to which of the following values is specified in the integer arg: F_RDLCK Take out a read lease. This will cause the calling process to be notified when the file is opened for writing or is truncated. A read lease can be placed only on a file descriptor that is opened read-only. F_WRLCK Take out a write lease. This will cause the caller to be notified when the file is opened for reading or writing or is truncated. A write lease may be placed on a file only if there are no other open file descriptors for the file. F_UNLCK Remove our lease from the file. Leases are associated with an open file description (see open(2)). This means that duplicate file descriptors (created by, for example, fork(2) or dup(2)) refer to the same lease, and this lease may be modified or released using any of these descriptors. Furthermore, the lease is released by either an explicit F_UNLCK operation on any of these duplicate file descriptors, or when all such file descriptors have been closed. Leases may be taken out only on regular files. An unprivileged process may take out a lease only on a file whose UID (owner) matches the filesystem UID of the process. A process with the CAP_LEASE capability may take out leases on arbitrary files. F_GETLEASE (void) Indicates what type of lease is associated with the file descriptor fd by returning either F_RDLCK, F_WRLCK, or F_UNLCK, indicating, respectively, a read lease , a write lease, or no lease. arg is ignored. When a process (the "lease breaker") performs an open(2) or truncate(2) that conflicts with a lease established via F_SETLEASE, the system call is blocked by the kernel and the kernel notifies the lease holder by sending it a signal (SIGIO by default). The lease holder should respond to receipt of this signal by doing whatever cleanup is required in preparation for the file to be accessed by another process (e.g., flushing cached buffers) and then either remove or downgrade its lease. A lease is removed by performing an F_SETLEASE command specifying arg as F_UNLCK. If the lease holder currently holds a write lease on the file, and the lease breaker is opening the file for reading, then it is sufficient for the lease holder to downgrade the lease to a read lease. This is done by performing an F_SETLEASE command specifying arg as F_RDLCK. If the lease holder fails to downgrade or remove the lease within the number of seconds specified in /proc/sys/fs/lease-break-time, then the kernel forcibly removes or downgrades the lease holder's lease. Once a lease break has been initiated, F_GETLEASE returns the target lease type (either F_RDLCK or F_UNLCK, depending on what would be compatible with the lease breaker) until the lease holder voluntarily downgrades or removes the lease or the kernel forcibly does so after the lease break timer expires. Once the lease has been voluntarily or forcibly removed or downgraded, and assuming the lease breaker has not unblocked its system call, the kernel permits the lease breaker's system call to proceed. If the lease breaker's blocked open(2) or truncate(2) is interrupted by a signal handler, then the system call fails with the error EINTR, but the other steps still occur as described above. If the lease breaker is killed by a signal while blocked in open(2) or truncate(2), then the other steps still occur as described above. If the lease breaker specifies the O_NONBLOCK flag when calling open(2), then the call immediately fails with the error EWOULDBLOCK, but the other steps still occur as described above. The default signal used to notify the lease holder is SIGIO, but this can be changed using the F_SETSIG command to fcntl(). If a F_SETSIG command is performed (even one specifying SIGIO), and the signal handler is established using SA_SIGINFO, then the handler will receive a siginfo_t structure as its second argument, and the si_fd field of this argument will hold the file descriptor of the leased file that has been accessed by another process. (This is useful if the caller holds leases against multiple files.) File and directory change notification (dnotify) F_NOTIFY (int) (Linux 2.4 onward) Provide notification when the directory referred to by fd or any of the files that it contains is changed. The events to be notified are specified in arg, which is a bit mask specified by ORing together zero or more of the following bits: DN_ACCESS A file was accessed (read(2), pread(2), readv(2), and similar) DN_MODIFY A file was modified (write(2), pwrite(2), writev(2), truncate(2), ftruncate(2), and similar). DN_CREATE A file was created (open(2), creat(2), mknod(2), mkdir(2), link(2), symlink(2), rename(2) into this directory). DN_DELETE A file was unlinked (unlink(2), rename(2) to another directory, rmdir(2)). DN_RENAME A file was renamed within this directory (rename(2)). DN_ATTRIB The attributes of a file were changed (chown(2), chmod(2), utime(2), utimensat(2), and similar). (In order to obtain these definitions, the _GNU_SOURCE feature test macro must be defined before including any header files.) Directory notifications are normally "one-shot", and the application must reregister to receive further notifications. Alternatively, if DN_MULTISHOT is included in arg, then notification will remain in effect until explicitly removed. A series of F_NOTIFY requests is cumulative, with the events in arg being added to the set already monitored. To disable notification of all events, make an F_NOTIFY call specifying arg as 0. Notification occurs via delivery of a signal. The default signal is SIGIO, but this can be changed using the F_SETSIG command to fcntl(). (Note that SIGIO is one of the nonqueuing standard signals; switching to the use of a real-time signal means that multiple notifications can be queued to the process.) In the latter case, the signal handler receives a siginfo_t structure as its second argument (if the handler was established using SA_SIGINFO) and the si_fd field of this structure contains the file descriptor which generated the notification (useful when establishing notification on multiple directories). Especially when using DN_MULTISHOT, a real time signal should be used for notification, so that multiple notifications can be queued. NOTE: New applications should use the inotify interface (available since kernel 2.6.13), which provides a much superior interface for obtaining notifications of filesystem events. See inotify(7). Changing the capacity of a pipe F_SETPIPE_SZ (int; since Linux 2.6.35) Change the capacity of the pipe referred to by fd to be at least arg bytes. An unprivileged process can adjust the pipe capacity to any value between the system page size and the limit defined in /proc/sys/fs/pipe-max-size (see proc(5)). Attempts to set the pipe capacity below the page size are silently rounded up to the page size. Attempts by an unprivileged process to set the pipe capacity above the limit in /proc/sys/fs/pipe-max-size yield the error EPERM; a privileged process (CAP_SYS_RESOURCE) can override the limit. When allocating the buffer for the pipe, the kernel may use a capacity larger than arg, if that is convenient for the implementation. The actual capacity that is set is returned as the function result. Attempting to set the pipe capacity smaller than the amount of buffer space currently used to store data produces the error EBUSY. F_GETPIPE_SZ (void; since Linux 2.6.35) Return (as the function result) the capacity of the pipe referred to by fd. File Sealing File seals limit the set of allowed operations on a given file. For each seal that is set on a file, a specific set of operations will fail with EPERM on this file from now on. The file is said to be sealed. The default set of seals depends on the type of the underlying file and filesystem. For an overview of file sealing, a discussion of its purpose, and some code examples, see memfd_create(2). Currently, only the tmpfs filesystem supports sealing. On other filesystems, all fcntl(2) operations that operate on seals will return EINVAL. Seals are a property of an inode. Thus, all open file descriptors referring to the same inode share the same set of seals. Furthermore, seals can never be removed, only added. F_ADD_SEALS (int; since Linux 3.17) Add the seals given in the bit-mask argument arg to the set of seals of the inode referred to by the file descriptor fd. Seals cannot be removed again. Once this call succeeds, the seals are enforced by the kernel immediately. If the current set of seals includes F_SEAL_SEAL (see below), then this call will be rejected with EPERM. Adding a seal that is already set is a no-op, in case F_SEAL_SEAL is not set already. In order to place a seal, the file descriptor fd must be writable. F_GET_SEALS (void; since Linux 3.17) Return (as the function result) the current set of seals of the inode referred to by fd. If no seals are set, 0 is returned. If the file does not support sealing, -1 is returned and errno is set to EINVAL. The following seals are available: F_SEAL_SEAL If this seal is set, any further call to fcntl(2) with F_ADD_SEALS will fail with EPERM. Therefore, this seal prevents any modifications to the set of seals itself. If the initial set of seals of a file includes F_SEAL_SEAL, then this effectively causes the set of seals to be constant and locked. F_SEAL_SHRINK If this seal is set, the file in question cannot be reduced in size. This affects open(2) with the O_TRUNC flag as well as truncate(2) and ftruncate(2). Those calls will fail with EPERM if you try to shrink the file in question. Increasing the file size is still possible. F_SEAL_GROW If this seal is set, the size of the file in question cannot be increased. This affects write(2) beyond the end of the file, truncate(2), ftruncate(2), and fallocate(2). These calls will fail with EPERM if you use them to increase the file size. If you keep the size or shrink it, those calls still work as expected. F_SEAL_WRITE If this seal is set, you cannot modify the contents of the file. Note that shrinking or growing the size of the file is still possible and allowed. Thus, this seal is normally used in combination with one of the other seals. This seal affects write(2) and fallocate(2) (only in combination with the FALLOC_FL_PUNCH_HOLE flag). Those calls will fail with EPERM if this seal is set. Furthermore, trying to create new shared, writable memory-mappings via mmap(2) will also fail with EPERM. Using the F_ADD_SEALS operation to set the F_SEAL_WRITE seal will fail with EBUSY if any writable, shared mapping exists. Such mappings must be unmapped before you can add this seal. Furthermore, if there are any asynchronous I/O operations (io_submit(2)) pending on the file, all outstanding writes will be discarded. http://man7.org/linux/man-pages/man2/dup.2.html 11 SYSTEM CALL: dup(2) - Linux manual page FUNCTIONALITY: dup, dup2, dup3 - duplicate a file descriptor SYNOPSIS: #include int dup(int oldfd); int dup2(int oldfd, int newfd); #define _GNU_SOURCE /* See feature_test_macros(7) */ #include /* Obtain O_* constant definitions */ #include int dup3(int oldfd, int newfd, int flags); DESCRIPTION The dup() system call creates a copy of the file descriptor oldfd, using the lowest-numbered unused file descriptor for the new descriptor. After a successful return, the old and new file descriptors may be used interchangeably. They refer to the same open file description (see open(2)) and thus share file offset and file status flags; for example, if the file offset is modified by using lseek(2) on one of the file descriptors, the offset is also changed for the other. The two file descriptors do not share file descriptor flags (the close-on-exec flag). The close-on-exec flag (FD_CLOEXEC; see fcntl(2)) for the duplicate descriptor is off. dup2() The dup2() system call performs the same task as dup(), but instead of using the lowest-numbered unused file descriptor, it uses the file descriptor number specified in newfd. If the file descriptor newfd was previously open, it is silently closed before being reused. The steps of closing and reusing the file descriptor newfd are performed atomically. This is important, because trying to implement equivalent functionality using close(2) and dup() would be subject to race conditions, whereby newfd might be reused between the two steps. Such reuse could happen because the main program is interrupted by a signal handler that allocates a file descriptor, or because a parallel thread allocates a file descriptor. Note the following points: * If oldfd is not a valid file descriptor, then the call fails, and newfd is not closed. * If oldfd is a valid file descriptor, and newfd has the same value as oldfd, then dup2() does nothing, and returns newfd. dup3() dup3() is the same as dup2(), except that: * The caller can force the close-on-exec flag to be set for the new file descriptor by specifying O_CLOEXEC in flags. See the description of the same flag in open(2) for reasons why this may be useful. * If oldfd equals newfd, then dup3() fails with the error EINVAL. http://man7.org/linux/man-pages/man2/dup2.2.html 11 SYSTEM CALL: dup(2) - Linux manual page FUNCTIONALITY: dup, dup2, dup3 - duplicate a file descriptor SYNOPSIS: #include int dup(int oldfd); int dup2(int oldfd, int newfd); #define _GNU_SOURCE /* See feature_test_macros(7) */ #include /* Obtain O_* constant definitions */ #include int dup3(int oldfd, int newfd, int flags); DESCRIPTION The dup() system call creates a copy of the file descriptor oldfd, using the lowest-numbered unused file descriptor for the new descriptor. After a successful return, the old and new file descriptors may be used interchangeably. They refer to the same open file description (see open(2)) and thus share file offset and file status flags; for example, if the file offset is modified by using lseek(2) on one of the file descriptors, the offset is also changed for the other. The two file descriptors do not share file descriptor flags (the close-on-exec flag). The close-on-exec flag (FD_CLOEXEC; see fcntl(2)) for the duplicate descriptor is off. dup2() The dup2() system call performs the same task as dup(), but instead of using the lowest-numbered unused file descriptor, it uses the file descriptor number specified in newfd. If the file descriptor newfd was previously open, it is silently closed before being reused. The steps of closing and reusing the file descriptor newfd are performed atomically. This is important, because trying to implement equivalent functionality using close(2) and dup() would be subject to race conditions, whereby newfd might be reused between the two steps. Such reuse could happen because the main program is interrupted by a signal handler that allocates a file descriptor, or because a parallel thread allocates a file descriptor. Note the following points: * If oldfd is not a valid file descriptor, then the call fails, and newfd is not closed. * If oldfd is a valid file descriptor, and newfd has the same value as oldfd, then dup2() does nothing, and returns newfd. dup3() dup3() is the same as dup2(), except that: * The caller can force the close-on-exec flag to be set for the new file descriptor by specifying O_CLOEXEC in flags. See the description of the same flag in open(2) for reasons why this may be useful. * If oldfd equals newfd, then dup3() fails with the error EINVAL. http://man7.org/linux/man-pages/man2/dup3.2.html 11 SYSTEM CALL: dup(2) - Linux manual page FUNCTIONALITY: dup, dup2, dup3 - duplicate a file descriptor SYNOPSIS: #include int dup(int oldfd); int dup2(int oldfd, int newfd); #define _GNU_SOURCE /* See feature_test_macros(7) */ #include /* Obtain O_* constant definitions */ #include int dup3(int oldfd, int newfd, int flags); DESCRIPTION The dup() system call creates a copy of the file descriptor oldfd, using the lowest-numbered unused file descriptor for the new descriptor. After a successful return, the old and new file descriptors may be used interchangeably. They refer to the same open file description (see open(2)) and thus share file offset and file status flags; for example, if the file offset is modified by using lseek(2) on one of the file descriptors, the offset is also changed for the other. The two file descriptors do not share file descriptor flags (the close-on-exec flag). The close-on-exec flag (FD_CLOEXEC; see fcntl(2)) for the duplicate descriptor is off. dup2() The dup2() system call performs the same task as dup(), but instead of using the lowest-numbered unused file descriptor, it uses the file descriptor number specified in newfd. If the file descriptor newfd was previously open, it is silently closed before being reused. The steps of closing and reusing the file descriptor newfd are performed atomically. This is important, because trying to implement equivalent functionality using close(2) and dup() would be subject to race conditions, whereby newfd might be reused between the two steps. Such reuse could happen because the main program is interrupted by a signal handler that allocates a file descriptor, or because a parallel thread allocates a file descriptor. Note the following points: * If oldfd is not a valid file descriptor, then the call fails, and newfd is not closed. * If oldfd is a valid file descriptor, and newfd has the same value as oldfd, then dup2() does nothing, and returns newfd. dup3() dup3() is the same as dup2(), except that: * The caller can force the close-on-exec flag to be set for the new file descriptor by specifying O_CLOEXEC in flags. See the description of the same flag in open(2) for reasons why this may be useful. * If oldfd equals newfd, then dup3() fails with the error EINVAL. http://man7.org/linux/man-pages/man2/flock.2.html 10 SYSTEM CALL: flock(2) - Linux manual page FUNCTIONALITY: flock - apply or remove an advisory lock on an open file SYNOPSIS: #include int flock(int fd, int operation); DESCRIPTION Apply or remove an advisory lock on the open file specified by fd. The argument operation is one of the following: LOCK_SH Place a shared lock. More than one process may hold a shared lock for a given file at a given time. LOCK_EX Place an exclusive lock. Only one process may hold an exclusive lock for a given file at a given time. LOCK_UN Remove an existing lock held by this process. A call to flock() may block if an incompatible lock is held by another process. To make a nonblocking request, include LOCK_NB (by ORing) with any of the above operations. A single file may not simultaneously have both shared and exclusive locks. Locks created by flock() are associated with an open file description (see open(2)). This means that duplicate file descriptors (created by, for example, fork(2) or dup(2)) refer to the same lock, and this lock may be modified or released using any of these file descriptors. Furthermore, the lock is released either by an explicit LOCK_UN operation on any of these duplicate file descriptors, or when all such file descriptors have been closed. If a process uses open(2) (or similar) to obtain more than one file descriptor for the same file, these file descriptors are treated independently by flock(). An attempt to lock the file using one of these file descriptors may be denied by a lock that the calling process has already placed via another file descriptor. A process may hold only one type of lock (shared or exclusive) on a file. Subsequent flock() calls on an already locked file will convert an existing lock to the new lock mode. Locks created by flock() are preserved across an execve(2). A shared or exclusive lock can be placed on a file regardless of the mode in which the file was opened. http://man7.org/linux/man-pages/man2/read.2.html 11 SYSTEM CALL: read(2) - Linux manual page FUNCTIONALITY: read - read from a file descriptor SYNOPSIS: #include ssize_t read(int fd, void *buf, size_t count); DESCRIPTION read() attempts to read up to count bytes from file descriptor fd into the buffer starting at buf. On files that support seeking, the read operation commences at the file offset, and the file offset is incremented by the number of bytes read. If the file offset is at or past the end of file, no bytes are read, and read() returns zero. If count is zero, read() may detect the errors described below. In the absence of any errors, or if read() does not check for errors, a read() with a count of 0 returns zero and has no other effects. If count is greater than SSIZE_MAX, the result is unspecified. http://man7.org/linux/man-pages/man2/readv.2.html 12 SYSTEM CALL: readv(2) - Linux manual page FUNCTIONALITY: readv, writev, preadv, pwritev - read or write data into multiple buffers SYNOPSIS: #include ssize_t readv(int fd, const struct iovec *iov, int iovcnt); ssize_t writev(int fd, const struct iovec *iov, int iovcnt); ssize_t preadv(int fd, const struct iovec *iov, int iovcnt, off_t offset); ssize_t pwritev(int fd, const struct iovec *iov, int iovcnt, off_t offset); ssize_t preadv2(int fd, const struct iovec *iov, int iovcnt, off_t offset, int flags); ssize_t pwritev2(int fd, const struct iovec *iov, int iovcnt, off_t offset, int flags); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): preadv(), pwritev(): Since glibc 2.19: _DEFAULT_SOURCE Glibc 2.19 and earlier: _BSD_SOURCE DESCRIPTION The readv() system call reads iovcnt buffers from the file associated with the file descriptor fd into the buffers described by iov ("scatter input"). The writev() system call writes iovcnt buffers of data described by iov to the file associated with the file descriptor fd ("gather output"). The pointer iov points to an array of iovec structures, defined in as: struct iovec { void *iov_base; /* Starting address */ size_t iov_len; /* Number of bytes to transfer */ }; The readv() system call works just like read(2) except that multiple buffers are filled. The writev() system call works just like write(2) except that multiple buffers are written out. Buffers are processed in array order. This means that readv() completely fills iov[0] before proceeding to iov[1], and so on. (If there is insufficient data, then not all buffers pointed to by iov may be filled.) Similarly, writev() writes out the entire contents of iov[0] before proceeding to iov[1], and so on. The data transfers performed by readv() and writev() are atomic: the data written by writev() is written as a single block that is not intermingled with output from writes in other processes (but see pipe(7) for an exception); analogously, readv() is guaranteed to read a contiguous block of data from the file, regardless of read operations performed in other threads or processes that have file descriptors referring to the same open file description (see open(2)). preadv() and pwritev() The preadv() system call combines the functionality of readv() and pread(2). It performs the same task as readv(), but adds a fourth argument, offset, which specifies the file offset at which the input operation is to be performed. The pwritev() system call combines the functionality of writev() and pwrite(2). It performs the same task as writev(), but adds a fourth argument, offset, which specifies the file offset at which the output operation is to be performed. The file offset is not changed by these system calls. The file referred to by fd must be capable of seeking. preadv2() and pwritev2() These system calls are similar to preadv() and pwritev() calls, but add a fifth argument, flags, which modifies the behavior on a per- call basis. Unlike preadv() and pwritev(), if the offset argument is -1, then the current file offset is used and updated. The flags argument contains a bitwise OR of zero or more of the following flags: RWF_HIPRI (since Linux 4.6) High priority read/write. Allows block-based filesystems to use polling of the device, which provides lower latency, but may use additional resources. (Currently, this feature is usable only on a file descriptor opened using the O_DIRECT flag.) http://man7.org/linux/man-pages/man2/pread.2.html 12 SYSTEM CALL: pread(2) - Linux manual page FUNCTIONALITY: pread, pwrite - read from or write to a file descriptor at a given offset SYNOPSIS: #include ssize_t pread(int fd, void *buf, size_t count, off_t offset); ssize_t pwrite(int fd, const void *buf, size_t count, off_t offset); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): pread(), pwrite(): _XOPEN_SOURCE >= 500 || /* Since glibc 2.12: */ _POSIX_C_SOURCE >= 200809L DESCRIPTION pread() reads up to count bytes from file descriptor fd at offset offset (from the start of the file) into the buffer starting at buf. The file offset is not changed. pwrite() writes up to count bytes from the buffer starting at buf to the file descriptor fd at offset offset. The file offset is not changed. The file referenced by fd must be capable of seeking. http://man7.org/linux/man-pages/man2/preadv.2.html 12 SYSTEM CALL: readv(2) - Linux manual page FUNCTIONALITY: readv, writev, preadv, pwritev - read or write data into multiple buffers SYNOPSIS: #include ssize_t readv(int fd, const struct iovec *iov, int iovcnt); ssize_t writev(int fd, const struct iovec *iov, int iovcnt); ssize_t preadv(int fd, const struct iovec *iov, int iovcnt, off_t offset); ssize_t pwritev(int fd, const struct iovec *iov, int iovcnt, off_t offset); ssize_t preadv2(int fd, const struct iovec *iov, int iovcnt, off_t offset, int flags); ssize_t pwritev2(int fd, const struct iovec *iov, int iovcnt, off_t offset, int flags); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): preadv(), pwritev(): Since glibc 2.19: _DEFAULT_SOURCE Glibc 2.19 and earlier: _BSD_SOURCE DESCRIPTION The readv() system call reads iovcnt buffers from the file associated with the file descriptor fd into the buffers described by iov ("scatter input"). The writev() system call writes iovcnt buffers of data described by iov to the file associated with the file descriptor fd ("gather output"). The pointer iov points to an array of iovec structures, defined in as: struct iovec { void *iov_base; /* Starting address */ size_t iov_len; /* Number of bytes to transfer */ }; The readv() system call works just like read(2) except that multiple buffers are filled. The writev() system call works just like write(2) except that multiple buffers are written out. Buffers are processed in array order. This means that readv() completely fills iov[0] before proceeding to iov[1], and so on. (If there is insufficient data, then not all buffers pointed to by iov may be filled.) Similarly, writev() writes out the entire contents of iov[0] before proceeding to iov[1], and so on. The data transfers performed by readv() and writev() are atomic: the data written by writev() is written as a single block that is not intermingled with output from writes in other processes (but see pipe(7) for an exception); analogously, readv() is guaranteed to read a contiguous block of data from the file, regardless of read operations performed in other threads or processes that have file descriptors referring to the same open file description (see open(2)). preadv() and pwritev() The preadv() system call combines the functionality of readv() and pread(2). It performs the same task as readv(), but adds a fourth argument, offset, which specifies the file offset at which the input operation is to be performed. The pwritev() system call combines the functionality of writev() and pwrite(2). It performs the same task as writev(), but adds a fourth argument, offset, which specifies the file offset at which the output operation is to be performed. The file offset is not changed by these system calls. The file referred to by fd must be capable of seeking. preadv2() and pwritev2() These system calls are similar to preadv() and pwritev() calls, but add a fifth argument, flags, which modifies the behavior on a per- call basis. Unlike preadv() and pwritev(), if the offset argument is -1, then the current file offset is used and updated. The flags argument contains a bitwise OR of zero or more of the following flags: RWF_HIPRI (since Linux 4.6) High priority read/write. Allows block-based filesystems to use polling of the device, which provides lower latency, but may use additional resources. (Currently, this feature is usable only on a file descriptor opened using the O_DIRECT flag.) http://man7.org/linux/man-pages/man2/write.2.html 11 SYSTEM CALL: write(2) - Linux manual page FUNCTIONALITY: write - write to a file descriptor SYNOPSIS: #include ssize_t write(int fd, const void *buf, size_t count); DESCRIPTION write() writes up to count bytes from the buffer pointed buf to the file referred to by the file descriptor fd. The number of bytes written may be less than count if, for example, there is insufficient space on the underlying physical medium, or the RLIMIT_FSIZE resource limit is encountered (see setrlimit(2)), or the call was interrupted by a signal handler after having written less than count bytes. (See also pipe(7).) For a seekable file (i.e., one to which lseek(2) may be applied, for example, a regular file) writing takes place at the file offset, and the file offset is incremented by the number of bytes actually written. If the file was open(2)ed with O_APPEND, the file offset is first set to the end of the file before writing. The adjustment of the file offset and the write operation are performed as an atomic step. POSIX requires that a read(2) which can be proved to occur after a write() has returned returns the new data. Note that not all filesystems are POSIX conforming. http://man7.org/linux/man-pages/man2/writev.2.html 12 SYSTEM CALL: readv(2) - Linux manual page FUNCTIONALITY: readv, writev, preadv, pwritev - read or write data into multiple buffers SYNOPSIS: #include ssize_t readv(int fd, const struct iovec *iov, int iovcnt); ssize_t writev(int fd, const struct iovec *iov, int iovcnt); ssize_t preadv(int fd, const struct iovec *iov, int iovcnt, off_t offset); ssize_t pwritev(int fd, const struct iovec *iov, int iovcnt, off_t offset); ssize_t preadv2(int fd, const struct iovec *iov, int iovcnt, off_t offset, int flags); ssize_t pwritev2(int fd, const struct iovec *iov, int iovcnt, off_t offset, int flags); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): preadv(), pwritev(): Since glibc 2.19: _DEFAULT_SOURCE Glibc 2.19 and earlier: _BSD_SOURCE DESCRIPTION The readv() system call reads iovcnt buffers from the file associated with the file descriptor fd into the buffers described by iov ("scatter input"). The writev() system call writes iovcnt buffers of data described by iov to the file associated with the file descriptor fd ("gather output"). The pointer iov points to an array of iovec structures, defined in as: struct iovec { void *iov_base; /* Starting address */ size_t iov_len; /* Number of bytes to transfer */ }; The readv() system call works just like read(2) except that multiple buffers are filled. The writev() system call works just like write(2) except that multiple buffers are written out. Buffers are processed in array order. This means that readv() completely fills iov[0] before proceeding to iov[1], and so on. (If there is insufficient data, then not all buffers pointed to by iov may be filled.) Similarly, writev() writes out the entire contents of iov[0] before proceeding to iov[1], and so on. The data transfers performed by readv() and writev() are atomic: the data written by writev() is written as a single block that is not intermingled with output from writes in other processes (but see pipe(7) for an exception); analogously, readv() is guaranteed to read a contiguous block of data from the file, regardless of read operations performed in other threads or processes that have file descriptors referring to the same open file description (see open(2)). preadv() and pwritev() The preadv() system call combines the functionality of readv() and pread(2). It performs the same task as readv(), but adds a fourth argument, offset, which specifies the file offset at which the input operation is to be performed. The pwritev() system call combines the functionality of writev() and pwrite(2). It performs the same task as writev(), but adds a fourth argument, offset, which specifies the file offset at which the output operation is to be performed. The file offset is not changed by these system calls. The file referred to by fd must be capable of seeking. preadv2() and pwritev2() These system calls are similar to preadv() and pwritev() calls, but add a fifth argument, flags, which modifies the behavior on a per- call basis. Unlike preadv() and pwritev(), if the offset argument is -1, then the current file offset is used and updated. The flags argument contains a bitwise OR of zero or more of the following flags: RWF_HIPRI (since Linux 4.6) High priority read/write. Allows block-based filesystems to use polling of the device, which provides lower latency, but may use additional resources. (Currently, this feature is usable only on a file descriptor opened using the O_DIRECT flag.) http://man7.org/linux/man-pages/man2/pwrite.2.html 12 SYSTEM CALL: pread(2) - Linux manual page FUNCTIONALITY: pread, pwrite - read from or write to a file descriptor at a given offset SYNOPSIS: #include ssize_t pread(int fd, void *buf, size_t count, off_t offset); ssize_t pwrite(int fd, const void *buf, size_t count, off_t offset); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): pread(), pwrite(): _XOPEN_SOURCE >= 500 || /* Since glibc 2.12: */ _POSIX_C_SOURCE >= 200809L DESCRIPTION pread() reads up to count bytes from file descriptor fd at offset offset (from the start of the file) into the buffer starting at buf. The file offset is not changed. pwrite() writes up to count bytes from the buffer starting at buf to the file descriptor fd at offset offset. The file offset is not changed. The file referenced by fd must be capable of seeking. http://man7.org/linux/man-pages/man2/pwritev.2.html 12 SYSTEM CALL: readv(2) - Linux manual page FUNCTIONALITY: readv, writev, preadv, pwritev - read or write data into multiple buffers SYNOPSIS: #include ssize_t readv(int fd, const struct iovec *iov, int iovcnt); ssize_t writev(int fd, const struct iovec *iov, int iovcnt); ssize_t preadv(int fd, const struct iovec *iov, int iovcnt, off_t offset); ssize_t pwritev(int fd, const struct iovec *iov, int iovcnt, off_t offset); ssize_t preadv2(int fd, const struct iovec *iov, int iovcnt, off_t offset, int flags); ssize_t pwritev2(int fd, const struct iovec *iov, int iovcnt, off_t offset, int flags); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): preadv(), pwritev(): Since glibc 2.19: _DEFAULT_SOURCE Glibc 2.19 and earlier: _BSD_SOURCE DESCRIPTION The readv() system call reads iovcnt buffers from the file associated with the file descriptor fd into the buffers described by iov ("scatter input"). The writev() system call writes iovcnt buffers of data described by iov to the file associated with the file descriptor fd ("gather output"). The pointer iov points to an array of iovec structures, defined in as: struct iovec { void *iov_base; /* Starting address */ size_t iov_len; /* Number of bytes to transfer */ }; The readv() system call works just like read(2) except that multiple buffers are filled. The writev() system call works just like write(2) except that multiple buffers are written out. Buffers are processed in array order. This means that readv() completely fills iov[0] before proceeding to iov[1], and so on. (If there is insufficient data, then not all buffers pointed to by iov may be filled.) Similarly, writev() writes out the entire contents of iov[0] before proceeding to iov[1], and so on. The data transfers performed by readv() and writev() are atomic: the data written by writev() is written as a single block that is not intermingled with output from writes in other processes (but see pipe(7) for an exception); analogously, readv() is guaranteed to read a contiguous block of data from the file, regardless of read operations performed in other threads or processes that have file descriptors referring to the same open file description (see open(2)). preadv() and pwritev() The preadv() system call combines the functionality of readv() and pread(2). It performs the same task as readv(), but adds a fourth argument, offset, which specifies the file offset at which the input operation is to be performed. The pwritev() system call combines the functionality of writev() and pwrite(2). It performs the same task as writev(), but adds a fourth argument, offset, which specifies the file offset at which the output operation is to be performed. The file offset is not changed by these system calls. The file referred to by fd must be capable of seeking. preadv2() and pwritev2() These system calls are similar to preadv() and pwritev() calls, but add a fifth argument, flags, which modifies the behavior on a per- call basis. Unlike preadv() and pwritev(), if the offset argument is -1, then the current file offset is used and updated. The flags argument contains a bitwise OR of zero or more of the following flags: RWF_HIPRI (since Linux 4.6) High priority read/write. Allows block-based filesystems to use polling of the device, which provides lower latency, but may use additional resources. (Currently, this feature is usable only on a file descriptor opened using the O_DIRECT flag.) http://man7.org/linux/man-pages/man2/lseek.2.html 10 SYSTEM CALL: lseek(2) - Linux manual page FUNCTIONALITY: lseek - reposition read/write file offset SYNOPSIS: #include #include off_t lseek(int fd, off_t offset, int whence); DESCRIPTION The lseek() function repositions the file offset of the open file description associated with the file descriptor fd to the argument offset according to the directive whence as follows: SEEK_SET The file offset is set to offset bytes. SEEK_CUR The file offset is set to its current location plus offset bytes. SEEK_END The file offset is set to the size of the file plus offset bytes. The lseek() function allows the file offset to be set beyond the end of the file (but this does not change the size of the file). If data is later written at this point, subsequent reads of the data in the gap (a "hole") return null bytes ('\0') until data is actually written into the gap. Seeking file data and holes Since version 3.1, Linux supports the following additional values for whence: SEEK_DATA Adjust the file offset to the next location in the file greater than or equal to offset containing data. If offset points to data, then the file offset is set to offset. SEEK_HOLE Adjust the file offset to the next hole in the file greater than or equal to offset. If offset points into the middle of a hole, then the file offset is set to offset. If there is no hole past offset, then the file offset is adjusted to the end of the file (i.e., there is an implicit hole at the end of any file). In both of the above cases, lseek() fails if offset points past the end of the file. These operations allow applications to map holes in a sparsely allocated file. This can be useful for applications such as file backup tools, which can save space when creating backups and preserve holes, if they have a mechanism for discovering holes. For the purposes of these operations, a hole is a sequence of zeros that (normally) has not been allocated in the underlying file storage. However, a filesystem is not obliged to report holes, so these operations are not a guaranteed mechanism for mapping the storage space actually allocated to a file. (Furthermore, a sequence of zeros that actually has been written to the underlying storage may not be reported as a hole.) In the simplest implementation, a filesystem can support the operations by making SEEK_HOLE always return the offset of the end of the file, and making SEEK_DATA always return offset (i.e., even if the location referred to by offset is a hole, it can be considered to consist of data that is a sequence of zeros). The _GNU_SOURCE feature test macro must be defined in order to obtain the definitions of SEEK_DATA and SEEK_HOLE from . The SEEK_HOLE and SEEK_DATA operations are supported for the following filesystems: * Btrfs (since Linux 3.1) * OCFS (since Linux 3.2) * XFS (since Linux 3.5) * ext4 (since Linux 3.8) * tmpfs (since Linux 3.8) * NFS (since Linux 3.18) * FUSE (since Linux 4.5) http://man7.org/linux/man-pages/man2/sendfile.2.html 11 SYSTEM CALL: sendfile(2) - Linux manual page FUNCTIONALITY: sendfile - transfer data between file descriptors SYNOPSIS: #include ssize_t sendfile(int out_fd, int in_fd, off_t *offset, size_t count); DESCRIPTION sendfile() copies data between one file descriptor and another. Because this copying is done within the kernel, sendfile() is more efficient than the combination of read(2) and write(2), which would require transferring data to and from user space. in_fd should be a file descriptor opened for reading and out_fd should be a descriptor opened for writing. If offset is not NULL, then it points to a variable holding the file offset from which sendfile() will start reading data from in_fd. When sendfile() returns, this variable will be set to the offset of the byte following the last byte that was read. If offset is not NULL, then sendfile() does not modify the file offset of in_fd; otherwise the file offset is adjusted to reflect the number of bytes read from in_fd. If offset is NULL, then data will be read from in_fd starting at the file offset, and the file offset will be updated by the call. count is the number of bytes to copy between the file descriptors. The in_fd argument must correspond to a file which supports mmap(2)-like operations (i.e., it cannot be a socket). In Linux kernels before 2.6.33, out_fd must refer to a socket. Since Linux 2.6.33 it can be any file. If it is a regular file, then sendfile() changes the file offset appropriately. http://man7.org/linux/man-pages/man2/fdatasync.2.html 11 SYSTEM CALL: fsync(2) - Linux manual page FUNCTIONALITY: fsync, fdatasync - synchronize a file's in-core state with storage device SYNOPSIS: #include int fsync(int fd); int fdatasync(int fd); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): fsync(): Glibc 2.16 and later: No feature test macros need be defined Glibc up to and including 2.15: _BSD_SOURCE || _XOPEN_SOURCE || /* since glibc 2.8: */ _POSIX_C_SOURCE >= 200112L fdatasync(): _POSIX_C_SOURCE >= 199309L || _XOPEN_SOURCE >= 500 DESCRIPTION fsync() transfers ("flushes") all modified in-core data of (i.e., modified buffer cache pages for) the file referred to by the file descriptor fd to the disk device (or other permanent storage device) so that all changed information can be retrieved even after the system crashed or was rebooted. This includes writing through or flushing a disk cache if present. The call blocks until the device reports that the transfer has completed. It also flushes metadata information associated with the file (see stat(2)). Calling fsync() does not necessarily ensure that the entry in the directory containing the file has also reached disk. For that an explicit fsync() on a file descriptor for the directory is also needed. fdatasync() is similar to fsync(), but does not flush modified metadata unless that metadata is needed in order to allow a subsequent data retrieval to be correctly handled. For example, changes to st_atime or st_mtime (respectively, time of last access and time of last modification; see stat(2)) do not require flushing because they are not necessary for a subsequent data read to be handled correctly. On the other hand, a change to the file size (st_size, as made by say ftruncate(2)), would require a metadata flush. The aim of fdatasync() is to reduce disk activity for applications that do not require all metadata to be synchronized with the disk. http://man7.org/linux/man-pages/man2/fsync.2.html 11 SYSTEM CALL: fsync(2) - Linux manual page FUNCTIONALITY: fsync, fdatasync - synchronize a file's in-core state with storage device SYNOPSIS: #include int fsync(int fd); int fdatasync(int fd); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): fsync(): Glibc 2.16 and later: No feature test macros need be defined Glibc up to and including 2.15: _BSD_SOURCE || _XOPEN_SOURCE || /* since glibc 2.8: */ _POSIX_C_SOURCE >= 200112L fdatasync(): _POSIX_C_SOURCE >= 199309L || _XOPEN_SOURCE >= 500 DESCRIPTION fsync() transfers ("flushes") all modified in-core data of (i.e., modified buffer cache pages for) the file referred to by the file descriptor fd to the disk device (or other permanent storage device) so that all changed information can be retrieved even after the system crashed or was rebooted. This includes writing through or flushing a disk cache if present. The call blocks until the device reports that the transfer has completed. It also flushes metadata information associated with the file (see stat(2)). Calling fsync() does not necessarily ensure that the entry in the directory containing the file has also reached disk. For that an explicit fsync() on a file descriptor for the directory is also needed. fdatasync() is similar to fsync(), but does not flush modified metadata unless that metadata is needed in order to allow a subsequent data retrieval to be correctly handled. For example, changes to st_atime or st_mtime (respectively, time of last access and time of last modification; see stat(2)) do not require flushing because they are not necessary for a subsequent data read to be handled correctly. On the other hand, a change to the file size (st_size, as made by say ftruncate(2)), would require a metadata flush. The aim of fdatasync() is to reduce disk activity for applications that do not require all metadata to be synchronized with the disk. http://man7.org/linux/man-pages/man2/msync.2.html 11 SYSTEM CALL: msync(2) - Linux manual page FUNCTIONALITY: msync - synchronize a file with a memory map SYNOPSIS: #include int msync(void *addr, size_t length, int flags); DESCRIPTION msync() flushes changes made to the in-core copy of a file that was mapped into memory using mmap(2) back to the filesystem. Without use of this call, there is no guarantee that changes are written back before munmap(2) is called. To be more precise, the part of the file that corresponds to the memory area starting at addr and having length length is updated. The flags argument should specify exactly one of MS_ASYNC and MS_SYNC, and may additionally include the MS_INVALIDATE bit. These bits have the following meanings: MS_ASYNC Specifies that an update be scheduled, but the call returns immediately. MS_SYNC Requests an update and waits for it to complete. MS_INVALIDATE Asks to invalidate other mappings of the same file (so that they can be updated with the fresh values just written). http://man7.org/linux/man-pages/man2/sync_file_range.2.html 11 SYSTEM CALL: sync_file_range(2) - Linux manual page FUNCTIONALITY: sync_file_range - sync a file segment with disk SYNOPSIS: #define _GNU_SOURCE /* See feature_test_macros(7) */ #include int sync_file_range(int fd, off64_t offset, off64_t nbytes, unsigned int flags); DESCRIPTION sync_file_range() permits fine control when synchronizing the open file referred to by the file descriptor fd with disk. offset is the starting byte of the file range to be synchronized. nbytes specifies the length of the range to be synchronized, in bytes; if nbytes is zero, then all bytes from offset through to the end of file are synchronized. Synchronization is in units of the system page size: offset is rounded down to a page boundary; (offset+nbytes-1) is rounded up to a page boundary. The flags bit-mask argument can include any of the following values: SYNC_FILE_RANGE_WAIT_BEFORE Wait upon write-out of all pages in the specified range that have already been submitted to the device driver for write-out before performing any write. SYNC_FILE_RANGE_WRITE Initiate write-out of all dirty pages in the specified range which are not presently submitted write-out. Note that even this may block if you attempt to write more than request queue size. SYNC_FILE_RANGE_WAIT_AFTER Wait upon write-out of all pages in the range after performing any write. Specifying flags as 0 is permitted, as a no-op. Warning This system call is extremely dangerous and should not be used in portable programs. None of these operations writes out the file's metadata. Therefore, unless the application is strictly performing overwrites of already-instantiated disk blocks, there are no guarantees that the data will be available after a crash. There is no user interface to know if a write is purely an overwrite. On filesystems using copy-on-write semantics (e.g., btrfs) an overwrite of existing allocated blocks is impossible. When writing into preallocated space, many filesystems also require calls into the block allocator, which this system call does not sync out to disk. This system call does not flush disk write caches and thus does not provide any data integrity on systems with volatile disk write caches. Some details SYNC_FILE_RANGE_WAIT_BEFORE and SYNC_FILE_RANGE_WAIT_AFTER will detect any I/O errors or ENOSPC conditions and will return these to the caller. Useful combinations of the flags bits are: SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE Ensures that all pages in the specified range which were dirty when sync_file_range() was called are placed under write-out. This is a start-write-for-data-integrity operation. SYNC_FILE_RANGE_WRITE Start write-out of all dirty pages in the specified range which are not presently under write-out. This is an asynchronous flush-to-disk operation. This is not suitable for data integrity operations. SYNC_FILE_RANGE_WAIT_BEFORE (or SYNC_FILE_RANGE_WAIT_AFTER) Wait for completion of write-out of all pages in the specified range. This can be used after an earlier SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE operation to wait for completion of that operation, and obtain its result. SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER This is a write-for-data-integrity operation that will ensure that all pages in the specified range which were dirty when sync_file_range() was called are committed to disk. http://man7.org/linux/man-pages/man2/sync.2.html 12 SYSTEM CALL: sync(2) - Linux manual page FUNCTIONALITY: sync, syncfs - commit filesystem caches to disk SYNOPSIS: #include void sync(void); int syncfs(int fd); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): sync(): _XOPEN_SOURCE >= 500 || /* Since glibc 2.19: */ _DEFAULT_SOURCE || /* Glibc versions <= 2.19: */ _BSD_SOURCE syncfs(): _GNU_SOURCE DESCRIPTION sync() causes all pending modifications to file system metadata and cached file data to be written to the underlying filesystems. syncfs() is like sync(), but synchronizes just the filesystem containing file referred to by the open file descriptor fd. http://man7.org/linux/man-pages/man2/syncfs.2.html 12 SYSTEM CALL: sync(2) - Linux manual page FUNCTIONALITY: sync, syncfs - commit filesystem caches to disk SYNOPSIS: #include void sync(void); int syncfs(int fd); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): sync(): _XOPEN_SOURCE >= 500 || /* Since glibc 2.19: */ _DEFAULT_SOURCE || /* Glibc versions <= 2.19: */ _BSD_SOURCE syncfs(): _GNU_SOURCE DESCRIPTION sync() causes all pending modifications to file system metadata and cached file data to be written to the underlying filesystems. syncfs() is like sync(), but synchronizes just the filesystem containing file referred to by the open file descriptor fd. http://man7.org/linux/man-pages/man2/io_setup.2.html 11 SYSTEM CALL: io_setup(2) - Linux manual page FUNCTIONALITY: io_setup - create an asynchronous I/O context SYNOPSIS: #include /* Defines needed types */ int io_setup(unsigned nr_events, aio_context_t *ctx_idp); Note: There is no glibc wrapper for this system call; see NOTES. DESCRIPTION The io_setup() system call creates an asynchronous I/O context suitable for concurrently processing nr_events operations. The ctx_idp argument must not point to an AIO context that already exists, and must be initialized to 0 prior to the call. On successful creation of the AIO context, *ctx_idp is filled in with the resulting handle. http://man7.org/linux/man-pages/man2/io_destroy.2.html 11 SYSTEM CALL: io_destroy(2) - Linux manual page FUNCTIONALITY: io_destroy - destroy an asynchronous I/O context SYNOPSIS: #include /* Defines needed types */ int io_destroy(aio_context_t ctx_id); Note: There is no glibc wrapper for this system call; see NOTES. DESCRIPTION The io_destroy() system call will attempt to cancel all outstanding asynchronous I/O operations against ctx_id, will block on the completion of all operations that could not be canceled, and will destroy the ctx_id. http://man7.org/linux/man-pages/man2/io_submit.2.html 11 SYSTEM CALL: io_submit(2) - Linux manual page FUNCTIONALITY: io_submit - submit asynchronous I/O blocks for processing SYNOPSIS: #include /* Defines needed types */ int io_submit(aio_context_t ctx_id, long nr, struct iocb **iocbpp); Note: There is no glibc wrapper for this system call; see NOTES. DESCRIPTION The io_submit() system call queues nr I/O request blocks for processing in the AIO context ctx_id. The iocbpp argument should be an array of nr AIO control blocks, which will be submitted to context ctx_id. http://man7.org/linux/man-pages/man2/io_cancel.2.html 11 SYSTEM CALL: io_cancel(2) - Linux manual page FUNCTIONALITY: io_cancel - cancel an outstanding asynchronous I/O operation SYNOPSIS: #include /* Defines needed types */ int io_cancel(aio_context_t ctx_id, struct iocb *iocb, struct io_event *result); Note: There is no glibc wrapper for this system call; see NOTES. DESCRIPTION The io_cancel() system call attempts to cancel an asynchronous I/O operation previously submitted with io_submit(2). The iocb argument describes the operation to be canceled and the ctx_id argument is the AIO context to which the operation was submitted. If the operation is successfully canceled, the event will be copied into the memory pointed to by result without being placed into the completion queue. http://man7.org/linux/man-pages/man2/io_getevents.2.html 12 SYSTEM CALL: io_getevents(2) - Linux manual page FUNCTIONALITY: io_getevents - read asynchronous I/O events from the completion queue SYNOPSIS: #include /* Defines needed types */ #include /* Defines 'struct timespec' */ int io_getevents(aio_context_t ctx_id, long min_nr, long nr, struct io_event *events, struct timespec *timeout); Note: There is no glibc wrapper for this system call; see NOTES. DESCRIPTION The io_getevents() system call attempts to read at least min_nr events and up to nr events from the completion queue of the AIO context specified by ctx_id. The timeout argument specifies the amount of time to wait for events, and is specified as a relative timeout in a structure of the following form: struct timespec { time_t tv_sec; /* seconds */ long tv_nsec; /* nanoseconds [0 .. 999999999] */ }; The specified time will be rounded up to the system clock granularity and is guaranteed not to expire early. Specifying timeout as NULL means block indefinitely until at least min_nr events have been obtained. http://man7.org/linux/man-pages/man2/select.2.html 13 SYSTEM CALL: select(2) - Linux manual page FUNCTIONALITY: select, pselect, FD_CLR, FD_ISSET, FD_SET, FD_ZERO - synchronous I/O multiplexing SYNOPSIS: /* According to POSIX.1-2001, POSIX.1-2008 */ #include /* According to earlier standards */ #include #include #include int select(int nfds, fd_set *readfds, fd_set *writefds, fd_set *exceptfds, struct timeval *timeout); void FD_CLR(int fd, fd_set *set); int FD_ISSET(int fd, fd_set *set); void FD_SET(int fd, fd_set *set); void FD_ZERO(fd_set *set); #include int pselect(int nfds, fd_set *readfds, fd_set *writefds, fd_set *exceptfds, const struct timespec *timeout, const sigset_t *sigmask); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): pselect(): _POSIX_C_SOURCE >= 200112L DESCRIPTION select() and pselect() allow a program to monitor multiple file descriptors, waiting until one or more of the file descriptors become "ready" for some class of I/O operation (e.g., input possible). A file descriptor is considered ready if it is possible to perform a corresponding I/O operation (e.g., read(2) without blocking, or a sufficiently small write(2)). select() can monitor only file descriptors numbers that are less than FD_SETSIZE; poll(2) does not have this limitation. See BUGS. The operation of select() and pselect() is identical, other than these three differences: (i) select() uses a timeout that is a struct timeval (with seconds and microseconds), while pselect() uses a struct timespec (with seconds and nanoseconds). (ii) select() may update the timeout argument to indicate how much time was left. pselect() does not change this argument. (iii) select() has no sigmask argument, and behaves as pselect() called with NULL sigmask. Three independent sets of file descriptors are watched. Those listed in readfds will be watched to see if characters become available for reading (more precisely, to see if a read will not block; in particular, a file descriptor is also ready on end-of-file), those in writefds will be watched to see if space is available for write (though a large write may still block), and those in exceptfds will be watched for exceptions. On exit, the sets are modified in place to indicate which file descriptors actually changed status. Each of the three file descriptor sets may be specified as NULL if no file descriptors are to be watched for the corresponding class of events. Four macros are provided to manipulate the sets. FD_ZERO() clears a set. FD_SET() and FD_CLR() respectively add and remove a given file descriptor from a set. FD_ISSET() tests to see if a file descriptor is part of the set; this is useful after select() returns. nfds is the highest-numbered file descriptor in any of the three sets, plus 1. The timeout argument specifies the interval that select() should block waiting for a file descriptor to become ready. The call will block until either: * a file descriptor becomes ready; * the call is interrupted by a signal handler; or * the timeout expires. Note that the timeout interval will be rounded up to the system clock granularity, and kernel scheduling delays mean that the blocking interval may overrun by a small amount. If both fields of the timeval structure are zero, then select() returns immediately. (This is useful for polling.) If timeout is NULL (no timeout), select() can block indefinitely. sigmask is a pointer to a signal mask (see sigprocmask(2)); if it is not NULL, then pselect() first replaces the current signal mask by the one pointed to by sigmask, then does the "select" function, and then restores the original signal mask. Other than the difference in the precision of the timeout argument, the following pselect() call: ready = pselect(nfds, &readfds, &writefds, &exceptfds, timeout, &sigmask); is equivalent to atomically executing the following calls: sigset_t origmask; pthread_sigmask(SIG_SETMASK, &sigmask, &origmask); ready = select(nfds, &readfds, &writefds, &exceptfds, timeout); pthread_sigmask(SIG_SETMASK, &origmask, NULL); The reason that pselect() is needed is that if one wants to wait for either a signal or for a file descriptor to become ready, then an atomic test is needed to prevent race conditions. (Suppose the signal handler sets a global flag and returns. Then a test of this global flag followed by a call of select() could hang indefinitely if the signal arrived just after the test but just before the call. By contrast, pselect() allows one to first block signals, handle the signals that have come in, then call pselect() with the desired sigmask, avoiding the race.) The timeout The time structures involved are defined in and look like struct timeval { long tv_sec; /* seconds */ long tv_usec; /* microseconds */ }; and struct timespec { long tv_sec; /* seconds */ long tv_nsec; /* nanoseconds */ }; (However, see below on the POSIX.1 versions.) Some code calls select() with all three sets empty, nfds zero, and a non-NULL timeout as a fairly portable way to sleep with subsecond precision. On Linux, select() modifies timeout to reflect the amount of time not slept; most other implementations do not do this. (POSIX.1 permits either behavior.) This causes problems both when Linux code which reads timeout is ported to other operating systems, and when code is ported to Linux that reuses a struct timeval for multiple select()s in a loop without reinitializing it. Consider timeout to be undefined after select() returns. http://man7.org/linux/man-pages/man2/pselect6.2.html 13 SYSTEM CALL: select(2) - Linux manual page FUNCTIONALITY: select, pselect, FD_CLR, FD_ISSET, FD_SET, FD_ZERO - synchronous I/O multiplexing SYNOPSIS: /* According to POSIX.1-2001, POSIX.1-2008 */ #include /* According to earlier standards */ #include #include #include int select(int nfds, fd_set *readfds, fd_set *writefds, fd_set *exceptfds, struct timeval *timeout); void FD_CLR(int fd, fd_set *set); int FD_ISSET(int fd, fd_set *set); void FD_SET(int fd, fd_set *set); void FD_ZERO(fd_set *set); #include int pselect(int nfds, fd_set *readfds, fd_set *writefds, fd_set *exceptfds, const struct timespec *timeout, const sigset_t *sigmask); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): pselect(): _POSIX_C_SOURCE >= 200112L DESCRIPTION select() and pselect() allow a program to monitor multiple file descriptors, waiting until one or more of the file descriptors become "ready" for some class of I/O operation (e.g., input possible). A file descriptor is considered ready if it is possible to perform a corresponding I/O operation (e.g., read(2) without blocking, or a sufficiently small write(2)). select() can monitor only file descriptors numbers that are less than FD_SETSIZE; poll(2) does not have this limitation. See BUGS. The operation of select() and pselect() is identical, other than these three differences: (i) select() uses a timeout that is a struct timeval (with seconds and microseconds), while pselect() uses a struct timespec (with seconds and nanoseconds). (ii) select() may update the timeout argument to indicate how much time was left. pselect() does not change this argument. (iii) select() has no sigmask argument, and behaves as pselect() called with NULL sigmask. Three independent sets of file descriptors are watched. Those listed in readfds will be watched to see if characters become available for reading (more precisely, to see if a read will not block; in particular, a file descriptor is also ready on end-of-file), those in writefds will be watched to see if space is available for write (though a large write may still block), and those in exceptfds will be watched for exceptions. On exit, the sets are modified in place to indicate which file descriptors actually changed status. Each of the three file descriptor sets may be specified as NULL if no file descriptors are to be watched for the corresponding class of events. Four macros are provided to manipulate the sets. FD_ZERO() clears a set. FD_SET() and FD_CLR() respectively add and remove a given file descriptor from a set. FD_ISSET() tests to see if a file descriptor is part of the set; this is useful after select() returns. nfds is the highest-numbered file descriptor in any of the three sets, plus 1. The timeout argument specifies the interval that select() should block waiting for a file descriptor to become ready. The call will block until either: * a file descriptor becomes ready; * the call is interrupted by a signal handler; or * the timeout expires. Note that the timeout interval will be rounded up to the system clock granularity, and kernel scheduling delays mean that the blocking interval may overrun by a small amount. If both fields of the timeval structure are zero, then select() returns immediately. (This is useful for polling.) If timeout is NULL (no timeout), select() can block indefinitely. sigmask is a pointer to a signal mask (see sigprocmask(2)); if it is not NULL, then pselect() first replaces the current signal mask by the one pointed to by sigmask, then does the "select" function, and then restores the original signal mask. Other than the difference in the precision of the timeout argument, the following pselect() call: ready = pselect(nfds, &readfds, &writefds, &exceptfds, timeout, &sigmask); is equivalent to atomically executing the following calls: sigset_t origmask; pthread_sigmask(SIG_SETMASK, &sigmask, &origmask); ready = select(nfds, &readfds, &writefds, &exceptfds, timeout); pthread_sigmask(SIG_SETMASK, &origmask, NULL); The reason that pselect() is needed is that if one wants to wait for either a signal or for a file descriptor to become ready, then an atomic test is needed to prevent race conditions. (Suppose the signal handler sets a global flag and returns. Then a test of this global flag followed by a call of select() could hang indefinitely if the signal arrived just after the test but just before the call. By contrast, pselect() allows one to first block signals, handle the signals that have come in, then call pselect() with the desired sigmask, avoiding the race.) The timeout The time structures involved are defined in and look like struct timeval { long tv_sec; /* seconds */ long tv_usec; /* microseconds */ }; and struct timespec { long tv_sec; /* seconds */ long tv_nsec; /* nanoseconds */ }; (However, see below on the POSIX.1 versions.) Some code calls select() with all three sets empty, nfds zero, and a non-NULL timeout as a fairly portable way to sleep with subsecond precision. On Linux, select() modifies timeout to reflect the amount of time not slept; most other implementations do not do this. (POSIX.1 permits either behavior.) This causes problems both when Linux code which reads timeout is ported to other operating systems, and when code is ported to Linux that reuses a struct timeval for multiple select()s in a loop without reinitializing it. Consider timeout to be undefined after select() returns. http://man7.org/linux/man-pages/man2/poll.2.html 12 SYSTEM CALL: poll(2) - Linux manual page FUNCTIONALITY: poll, ppoll - wait for some event on a file descriptor SYNOPSIS: #include int poll(struct pollfd *fds, nfds_t nfds, int timeout); #define _GNU_SOURCE /* See feature_test_macros(7) */ #include #include int ppoll(struct pollfd *fds, nfds_t nfds, const struct timespec *tmo_p, const sigset_t *sigmask); DESCRIPTION poll() performs a similar task to select(2): it waits for one of a set of file descriptors to become ready to perform I/O. The set of file descriptors to be monitored is specified in the fds argument, which is an array of structures of the following form: struct pollfd { int fd; /* file descriptor */ short events; /* requested events */ short revents; /* returned events */ }; The caller should specify the number of items in the fds array in nfds. The field fd contains a file descriptor for an open file. If this field is negative, then the corresponding events field is ignored and the revents field returns zero. (This provides an easy way of ignoring a file descriptor for a single poll() call: simply negate the fd field. Note, however, that this technique can't be used to ignore file descriptor 0.) The field events is an input parameter, a bit mask specifying the events the application is interested in for the file descriptor fd. This field may be specified as zero, in which case the only events that can be returned in revents are POLLHUP, POLLERR, and POLLNVAL (see below). The field revents is an output parameter, filled by the kernel with the events that actually occurred. The bits returned in revents can include any of those specified in events, or one of the values POLLERR, POLLHUP, or POLLNVAL. (These three bits are meaningless in the events field, and will be set in the revents field whenever the corresponding condition is true.) If none of the events requested (and no error) has occurred for any of the file descriptors, then poll() blocks until one of the events occurs. The timeout argument specifies the number of milliseconds that poll() should block waiting for a file descriptor to become ready. The call will block until either: * a file descriptor becomes ready; * the call is interrupted by a signal handler; or * the timeout expires. Note that the timeout interval will be rounded up to the system clock granularity, and kernel scheduling delays mean that the blocking interval may overrun by a small amount. Specifying a negative value in timeout means an infinite timeout. Specifying a timeout of zero causes poll() to return immediately, even if no file descriptors are ready. The bits that may be set/returned in events and revents are defined in : POLLIN There is data to read. POLLPRI There is urgent data to read (e.g., out-of-band data on TCP socket; pseudoterminal master in packet mode has seen state change in slave). POLLOUT Writing is now possible, though a write larger that the available space in a socket or pipe will still block (unless O_NONBLOCK is set). POLLRDHUP (since Linux 2.6.17) Stream socket peer closed connection, or shut down writing half of connection. The _GNU_SOURCE feature test macro must be defined (before including any header files) in order to obtain this definition. POLLERR Error condition (only returned in revents; ignored in events). POLLHUP Hang up (only returned in revents; ignored in events). Note that when reading from a channel such as a pipe or a stream socket, this event merely indicates that the peer closed its end of the channel. Subsequent reads from the channel will return 0 (end of file) only after all outstanding data in the channel has been consumed. POLLNVAL Invalid request: fd not open (only returned in revents; ignored in events). When compiling with _XOPEN_SOURCE defined, one also has the following, which convey no further information beyond the bits listed above: POLLRDNORM Equivalent to POLLIN. POLLRDBAND Priority band data can be read (generally unused on Linux). POLLWRNORM Equivalent to POLLOUT. POLLWRBAND Priority data may be written. Linux also knows about, but does not use POLLMSG. ppoll() The relationship between poll() and ppoll() is analogous to the relationship between select(2) and pselect(2): like pselect(2), ppoll() allows an application to safely wait until either a file descriptor becomes ready or until a signal is caught. Other than the difference in the precision of the timeout argument, the following ppoll() call: ready = ppoll(&fds, nfds, tmo_p, &sigmask); is equivalent to atomically executing the following calls: sigset_t origmask; int timeout; timeout = (tmo_p == NULL) ? -1 : (tmo_p->tv_sec * 1000 + tmo_p->tv_nsec / 1000000); pthread_sigmask(SIG_SETMASK, &sigmask, &origmask); ready = poll(&fds, nfds, timeout); pthread_sigmask(SIG_SETMASK, &origmask, NULL); See the description of pselect(2) for an explanation of why ppoll() is necessary. If the sigmask argument is specified as NULL, then no signal mask manipulation is performed (and thus ppoll() differs from poll() only in the precision of the timeout argument). The tmo_p argument specifies an upper limit on the amount of time that ppoll() will block. This argument is a pointer to a structure of the following form: struct timespec { long tv_sec; /* seconds */ long tv_nsec; /* nanoseconds */ }; If tmo_p is specified as NULL, then ppoll() can block indefinitely. http://man7.org/linux/man-pages/man2/ppoll.2.html 12 SYSTEM CALL: poll(2) - Linux manual page FUNCTIONALITY: poll, ppoll - wait for some event on a file descriptor SYNOPSIS: #include int poll(struct pollfd *fds, nfds_t nfds, int timeout); #define _GNU_SOURCE /* See feature_test_macros(7) */ #include #include int ppoll(struct pollfd *fds, nfds_t nfds, const struct timespec *tmo_p, const sigset_t *sigmask); DESCRIPTION poll() performs a similar task to select(2): it waits for one of a set of file descriptors to become ready to perform I/O. The set of file descriptors to be monitored is specified in the fds argument, which is an array of structures of the following form: struct pollfd { int fd; /* file descriptor */ short events; /* requested events */ short revents; /* returned events */ }; The caller should specify the number of items in the fds array in nfds. The field fd contains a file descriptor for an open file. If this field is negative, then the corresponding events field is ignored and the revents field returns zero. (This provides an easy way of ignoring a file descriptor for a single poll() call: simply negate the fd field. Note, however, that this technique can't be used to ignore file descriptor 0.) The field events is an input parameter, a bit mask specifying the events the application is interested in for the file descriptor fd. This field may be specified as zero, in which case the only events that can be returned in revents are POLLHUP, POLLERR, and POLLNVAL (see below). The field revents is an output parameter, filled by the kernel with the events that actually occurred. The bits returned in revents can include any of those specified in events, or one of the values POLLERR, POLLHUP, or POLLNVAL. (These three bits are meaningless in the events field, and will be set in the revents field whenever the corresponding condition is true.) If none of the events requested (and no error) has occurred for any of the file descriptors, then poll() blocks until one of the events occurs. The timeout argument specifies the number of milliseconds that poll() should block waiting for a file descriptor to become ready. The call will block until either: * a file descriptor becomes ready; * the call is interrupted by a signal handler; or * the timeout expires. Note that the timeout interval will be rounded up to the system clock granularity, and kernel scheduling delays mean that the blocking interval may overrun by a small amount. Specifying a negative value in timeout means an infinite timeout. Specifying a timeout of zero causes poll() to return immediately, even if no file descriptors are ready. The bits that may be set/returned in events and revents are defined in : POLLIN There is data to read. POLLPRI There is urgent data to read (e.g., out-of-band data on TCP socket; pseudoterminal master in packet mode has seen state change in slave). POLLOUT Writing is now possible, though a write larger that the available space in a socket or pipe will still block (unless O_NONBLOCK is set). POLLRDHUP (since Linux 2.6.17) Stream socket peer closed connection, or shut down writing half of connection. The _GNU_SOURCE feature test macro must be defined (before including any header files) in order to obtain this definition. POLLERR Error condition (only returned in revents; ignored in events). POLLHUP Hang up (only returned in revents; ignored in events). Note that when reading from a channel such as a pipe or a stream socket, this event merely indicates that the peer closed its end of the channel. Subsequent reads from the channel will return 0 (end of file) only after all outstanding data in the channel has been consumed. POLLNVAL Invalid request: fd not open (only returned in revents; ignored in events). When compiling with _XOPEN_SOURCE defined, one also has the following, which convey no further information beyond the bits listed above: POLLRDNORM Equivalent to POLLIN. POLLRDBAND Priority band data can be read (generally unused on Linux). POLLWRNORM Equivalent to POLLOUT. POLLWRBAND Priority data may be written. Linux also knows about, but does not use POLLMSG. ppoll() The relationship between poll() and ppoll() is analogous to the relationship between select(2) and pselect(2): like pselect(2), ppoll() allows an application to safely wait until either a file descriptor becomes ready or until a signal is caught. Other than the difference in the precision of the timeout argument, the following ppoll() call: ready = ppoll(&fds, nfds, tmo_p, &sigmask); is equivalent to atomically executing the following calls: sigset_t origmask; int timeout; timeout = (tmo_p == NULL) ? -1 : (tmo_p->tv_sec * 1000 + tmo_p->tv_nsec / 1000000); pthread_sigmask(SIG_SETMASK, &sigmask, &origmask); ready = poll(&fds, nfds, timeout); pthread_sigmask(SIG_SETMASK, &origmask, NULL); See the description of pselect(2) for an explanation of why ppoll() is necessary. If the sigmask argument is specified as NULL, then no signal mask manipulation is performed (and thus ppoll() differs from poll() only in the precision of the timeout argument). The tmo_p argument specifies an upper limit on the amount of time that ppoll() will block. This argument is a pointer to a structure of the following form: struct timespec { long tv_sec; /* seconds */ long tv_nsec; /* nanoseconds */ }; If tmo_p is specified as NULL, then ppoll() can block indefinitely. http://man7.org/linux/man-pages/man2/epoll_create.2.html 11 SYSTEM CALL: epoll_create(2) - Linux manual page FUNCTIONALITY: epoll_create, epoll_create1 - open an epoll file descriptor SYNOPSIS: #include int epoll_create(int size); int epoll_create1(int flags); DESCRIPTION epoll_create() creates an epoll(7) instance. Since Linux 2.6.8, the size argument is ignored, but must be greater than zero; see NOTES below. epoll_create() returns a file descriptor referring to the new epoll instance. This file descriptor is used for all the subsequent calls to the epoll interface. When no longer required, the file descriptor returned by epoll_create() should be closed by using close(2). When all file descriptors referring to an epoll instance have been closed, the kernel destroys the instance and releases the associated resources for reuse. epoll_create1() If flags is 0, then, other than the fact that the obsolete size argument is dropped, epoll_create1() is the same as epoll_create(). The following value can be included in flags to obtain different behavior: EPOLL_CLOEXEC Set the close-on-exec (FD_CLOEXEC) flag on the new file descriptor. See the description of the O_CLOEXEC flag in open(2) for reasons why this may be useful. http://man7.org/linux/man-pages/man2/epoll_create1.2.html 11 SYSTEM CALL: epoll_create(2) - Linux manual page FUNCTIONALITY: epoll_create, epoll_create1 - open an epoll file descriptor SYNOPSIS: #include int epoll_create(int size); int epoll_create1(int flags); DESCRIPTION epoll_create() creates an epoll(7) instance. Since Linux 2.6.8, the size argument is ignored, but must be greater than zero; see NOTES below. epoll_create() returns a file descriptor referring to the new epoll instance. This file descriptor is used for all the subsequent calls to the epoll interface. When no longer required, the file descriptor returned by epoll_create() should be closed by using close(2). When all file descriptors referring to an epoll instance have been closed, the kernel destroys the instance and releases the associated resources for reuse. epoll_create1() If flags is 0, then, other than the fact that the obsolete size argument is dropped, epoll_create1() is the same as epoll_create(). The following value can be included in flags to obtain different behavior: EPOLL_CLOEXEC Set the close-on-exec (FD_CLOEXEC) flag on the new file descriptor. See the description of the O_CLOEXEC flag in open(2) for reasons why this may be useful. http://man7.org/linux/man-pages/man2/epoll_ctl.2.html 12 SYSTEM CALL: epoll_ctl(2) - Linux manual page FUNCTIONALITY: epoll_ctl - control interface for an epoll file descriptor SYNOPSIS: #include int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event); DESCRIPTION This system call performs control operations on the epoll(7) instance referred to by the file descriptor epfd. It requests that the operation op be performed for the target file descriptor, fd. Valid values for the op argument are: EPOLL_CTL_ADD Register the target file descriptor fd on the epoll instance referred to by the file descriptor epfd and associate the event event with the internal file linked to fd. EPOLL_CTL_MOD Change the event event associated with the target file descriptor fd. EPOLL_CTL_DEL Remove (deregister) the target file descriptor fd from the epoll instance referred to by epfd. The event is ignored and can be NULL (but see BUGS below). The event argument describes the object linked to the file descriptor fd. The struct epoll_event is defined as: typedef union epoll_data { void *ptr; int fd; uint32_t u32; uint64_t u64; } epoll_data_t; struct epoll_event { uint32_t events; /* Epoll events */ epoll_data_t data; /* User data variable */ }; The events member is a bit mask composed using the following available event types: EPOLLIN The associated file is available for read(2) operations. EPOLLOUT The associated file is available for write(2) operations. EPOLLRDHUP (since Linux 2.6.17) Stream socket peer closed connection, or shut down writing half of connection. (This flag is especially useful for writing simple code to detect peer shutdown when using Edge Triggered monitoring.) EPOLLPRI There is urgent data available for read(2) operations. EPOLLERR Error condition happened on the associated file descriptor. epoll_wait(2) will always wait for this event; it is not necessary to set it in events. EPOLLHUP Hang up happened on the associated file descriptor. epoll_wait(2) will always wait for this event; it is not necessary to set it in events. Note that when reading from a channel such as a pipe or a stream socket, this event merely indicates that the peer closed its end of the channel. Subsequent reads from the channel will return 0 (end of file) only after all outstanding data in the channel has been consumed. EPOLLET Sets the Edge Triggered behavior for the associated file descriptor. The default behavior for epoll is Level Triggered. See epoll(7) for more detailed information about Edge and Level Triggered event distribution architectures. EPOLLONESHOT (since Linux 2.6.2) Sets the one-shot behavior for the associated file descriptor. This means that after an event is pulled out with epoll_wait(2) the associated file descriptor is internally disabled and no other events will be reported by the epoll interface. The user must call epoll_ctl() with EPOLL_CTL_MOD to rearm the file descriptor with a new event mask. EPOLLWAKEUP (since Linux 3.5) If EPOLLONESHOT and EPOLLET are clear and the process has the CAP_BLOCK_SUSPEND capability, ensure that the system does not enter "suspend" or "hibernate" while this event is pending or being processed. The event is considered as being "processed" from the time when it is returned by a call to epoll_wait(2) until the next call to epoll_wait(2) on the same epoll(7) file descriptor, the closure of that file descriptor, the removal of the event file descriptor with EPOLL_CTL_DEL, or the clearing of EPOLLWAKEUP for the event file descriptor with EPOLL_CTL_MOD. See also BUGS. EPOLLEXCLUSIVE (since Linux 4.5) Sets an exclusive wakeup mode for the epoll file descriptor that is being attached to the target file descriptor, fd. When a wakeup event occurs and multiple epoll file descriptors are attached to the same target file using EPOLLEXCLUSIVE, one or more of the epoll file descriptors will receive an event with epoll_wait(2). The default in this scenario (when EPOLLEXCLUSIVE is not set) is for all epoll file descriptors to receive an event. EPOLLEXCLUSIVE is thus useful for avoiding thundering herd problems in certain scenarios. If the same file descriptor is in multiple epoll instances, some with the EPOLLEXCLUSIVE flag, and others without, then events will provided to all epoll instances that did not specify EPOLLEXCLUSIVE, and at least one of the epoll instances that did specify EPOLLEXCLUSIVE. The following values may be specified in conjunction with EPOLLEXCLUSIVE: EPOLLIN, EPOLLOUT, EPOLLWAKEUP, and EPOLLET. EPOLLHUP and EPOLLERR can also be specified, but this is not required: as usual, these events are always reported if they occur, regardless of whether they are specified in events. Attempts to specify other values in events yield an error. EPOLLEXCLUSIVE may be used only in an EPOLL_CTL_ADD operation; attempts to employ it with EPOLL_CTL_MOD yield an error. If EPOLLEXCLUSIVE has set using epoll_ctl(2), then a subsequent EPOLL_CTL_MOD on the same epfd, fd pair yields an error. A call to epoll_ctl(2) that specifies EPOLLEXCLUSIVE in events and specifies the target file descriptor fd as an epoll instance will likewise fail. The error in all of these cases is EINVAL. http://man7.org/linux/man-pages/man2/epoll_wait.2.html 12 SYSTEM CALL: epoll_wait(2) - Linux manual page FUNCTIONALITY: epoll_wait, epoll_pwait - wait for an I/O event on an epoll file descriptor SYNOPSIS: #include int epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout); int epoll_pwait(int epfd, struct epoll_event *events, int maxevents, int timeout, const sigset_t *sigmask); DESCRIPTION The epoll_wait() system call waits for events on the epoll(7) instance referred to by the file descriptor epfd. The memory area pointed to by events will contain the events that will be available for the caller. Up to maxevents are returned by epoll_wait(). The maxevents argument must be greater than zero. The timeout argument specifies the number of milliseconds that epoll_wait() will block. The call will block until either: * a file descriptor delivers an event; * the call is interrupted by a signal handler; or * the timeout expires. Note that the timeout interval will be rounded up to the system clock granularity, and kernel scheduling delays mean that the blocking interval may overrun by a small amount. Specifying a timeout of -1 causes epoll_wait() to block indefinitely, while specifying a timeout equal to zero cause epoll_wait() to return immediately, even if no events are available. The struct epoll_event is defined as: typedef union epoll_data { void *ptr; int fd; uint32_t u32; uint64_t u64; } epoll_data_t; struct epoll_event { uint32_t events; /* Epoll events */ epoll_data_t data; /* User data variable */ }; The data of each returned structure will contain the same data the user set with an epoll_ctl(2) (EPOLL_CTL_ADD, EPOLL_CTL_MOD) while the events member will contain the returned event bit field. epoll_pwait() The relationship between epoll_wait() and epoll_pwait() is analogous to the relationship between select(2) and pselect(2): like pselect(2), epoll_pwait() allows an application to safely wait until either a file descriptor becomes ready or until a signal is caught. The following epoll_pwait() call: ready = epoll_pwait(epfd, &events, maxevents, timeout, &sigmask); is equivalent to atomically executing the following calls: sigset_t origmask; pthread_sigmask(SIG_SETMASK, &sigmask, &origmask); ready = epoll_wait(epfd, &events, maxevents, timeout); pthread_sigmask(SIG_SETMASK, &origmask, NULL); The sigmask argument may be specified as NULL, in which case epoll_pwait() is equivalent to epoll_wait(). http://man7.org/linux/man-pages/man2/epoll_pwait.2.html 12 SYSTEM CALL: epoll_wait(2) - Linux manual page FUNCTIONALITY: epoll_wait, epoll_pwait - wait for an I/O event on an epoll file descriptor SYNOPSIS: #include int epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout); int epoll_pwait(int epfd, struct epoll_event *events, int maxevents, int timeout, const sigset_t *sigmask); DESCRIPTION The epoll_wait() system call waits for events on the epoll(7) instance referred to by the file descriptor epfd. The memory area pointed to by events will contain the events that will be available for the caller. Up to maxevents are returned by epoll_wait(). The maxevents argument must be greater than zero. The timeout argument specifies the number of milliseconds that epoll_wait() will block. The call will block until either: * a file descriptor delivers an event; * the call is interrupted by a signal handler; or * the timeout expires. Note that the timeout interval will be rounded up to the system clock granularity, and kernel scheduling delays mean that the blocking interval may overrun by a small amount. Specifying a timeout of -1 causes epoll_wait() to block indefinitely, while specifying a timeout equal to zero cause epoll_wait() to return immediately, even if no events are available. The struct epoll_event is defined as: typedef union epoll_data { void *ptr; int fd; uint32_t u32; uint64_t u64; } epoll_data_t; struct epoll_event { uint32_t events; /* Epoll events */ epoll_data_t data; /* User data variable */ }; The data of each returned structure will contain the same data the user set with an epoll_ctl(2) (EPOLL_CTL_ADD, EPOLL_CTL_MOD) while the events member will contain the returned event bit field. epoll_pwait() The relationship between epoll_wait() and epoll_pwait() is analogous to the relationship between select(2) and pselect(2): like pselect(2), epoll_pwait() allows an application to safely wait until either a file descriptor becomes ready or until a signal is caught. The following epoll_pwait() call: ready = epoll_pwait(epfd, &events, maxevents, timeout, &sigmask); is equivalent to atomically executing the following calls: sigset_t origmask; pthread_sigmask(SIG_SETMASK, &sigmask, &origmask); ready = epoll_wait(epfd, &events, maxevents, timeout); pthread_sigmask(SIG_SETMASK, &origmask, NULL); The sigmask argument may be specified as NULL, in which case epoll_pwait() is equivalent to epoll_wait(). http://man7.org/linux/man-pages/man2/inotify_init.2.html 10 SYSTEM CALL: inotify_init(2) - Linux manual page FUNCTIONALITY: inotify_init, inotify_init1 - initialize an inotify instance SYNOPSIS: #include int inotify_init(void); int inotify_init1(int flags); DESCRIPTION For an overview of the inotify API, see inotify(7). inotify_init() initializes a new inotify instance and returns a file descriptor associated with a new inotify event queue. If flags is 0, then inotify_init1() is the same as inotify_init(). The following values can be bitwise ORed in flags to obtain different behavior: IN_NONBLOCK Set the O_NONBLOCK file status flag on the new open file description. Using this flag saves extra calls to fcntl(2) to achieve the same result. IN_CLOEXEC Set the close-on-exec (FD_CLOEXEC) flag on the new file descriptor. See the description of the O_CLOEXEC flag in open(2) for reasons why this may be useful. http://man7.org/linux/man-pages/man2/inotify_init1.2.html 10 SYSTEM CALL: inotify_init(2) - Linux manual page FUNCTIONALITY: inotify_init, inotify_init1 - initialize an inotify instance SYNOPSIS: #include int inotify_init(void); int inotify_init1(int flags); DESCRIPTION For an overview of the inotify API, see inotify(7). inotify_init() initializes a new inotify instance and returns a file descriptor associated with a new inotify event queue. If flags is 0, then inotify_init1() is the same as inotify_init(). The following values can be bitwise ORed in flags to obtain different behavior: IN_NONBLOCK Set the O_NONBLOCK file status flag on the new open file description. Using this flag saves extra calls to fcntl(2) to achieve the same result. IN_CLOEXEC Set the close-on-exec (FD_CLOEXEC) flag on the new file descriptor. See the description of the O_CLOEXEC flag in open(2) for reasons why this may be useful. http://man7.org/linux/man-pages/man2/inotify_add_watch.2.html 10 SYSTEM CALL: inotify_add_watch(2) - Linux manual page FUNCTIONALITY: inotify_add_watch - add a watch to an initialized inotify instance SYNOPSIS: #include int inotify_add_watch(int fd, const char *pathname, uint32_t mask); DESCRIPTION inotify_add_watch() adds a new watch, or modifies an existing watch, for the file whose location is specified in pathname; the caller must have read permission for this file. The fd argument is a file descriptor referring to the inotify instance whose watch list is to be modified. The events to be monitored for pathname are specified in the mask bit-mask argument. See inotify(7) for a description of the bits that can be set in mask. A successful call to inotify_add_watch() returns a unique watch descriptor for this inotify instance, for the filesystem object that corresponds to pathname. If the filesystem object was not previously being watched by this inotify instance, then the watch descriptor is newly allocated. If the filesystem object was already being watched (perhaps via a different link to the same object), then the descriptor for the existing watch is returned. The watch descriptor is returned by later read(2)s from the inotify file descriptor. These reads fetch inotify_event structures (see inotify(7)) indicating filesystem events; the watch descriptor inside this structure identifies the object for which the event occurred. http://man7.org/linux/man-pages/man2/inotify_rm_watch.2.html 10 SYSTEM CALL: inotify_rm_watch(2) - Linux manual page FUNCTIONALITY: inotify_rm_watch - remove an existing watch from an inotify instance SYNOPSIS: #include int inotify_rm_watch(int fd, int wd); DESCRIPTION inotify_rm_watch() removes the watch associated with the watch descriptor wd from the inotify instance associated with the file descriptor fd. Removing a watch causes an IN_IGNORED event to be generated for this watch descriptor. (See inotify(7).) http://man7.org/linux/man-pages/man2/fanotify_init.2.html 11 SYSTEM CALL: fanotify_init(2) - Linux manual page FUNCTIONALITY: fanotify_init - create and initialize fanotify group SYNOPSIS: #include #include int fanotify_init(unsigned int flags, unsigned int event_f_flags); DESCRIPTION For an overview of the fanotify API, see fanotify(7). fanotify_init() initializes a new fanotify group and returns a file descriptor for the event queue associated with the group. The file descriptor is used in calls to fanotify_mark(2) to specify the files, directories, and mounts for which fanotify events shall be created. These events are received by reading from the file descriptor. Some events are only informative, indicating that a file has been accessed. Other events can be used to determine whether another application is permitted to access a file or directory. Permission to access filesystem objects is granted by writing to the file descriptor. Multiple programs may be using the fanotify interface at the same time to monitor the same files. In the current implementation, the number of fanotify groups per user is limited to 128. This limit cannot be overridden. Calling fanotify_init() requires the CAP_SYS_ADMIN capability. This constraint might be relaxed in future versions of the API. Therefore, certain additional capability checks have been implemented as indicated below. The flags argument contains a multi-bit field defining the notification class of the listening application and further single bit fields specifying the behavior of the file descriptor. If multiple listeners for permission events exist, the notification class is used to establish the sequence in which the listeners receive the events. Only one of the following notification classes may be specified in flags: FAN_CLASS_PRE_CONTENT This value allows the receipt of events notifying that a file has been accessed and events for permission decisions if a file may be accessed. It is intended for event listeners that need to access files before they contain their final data. This notification class might be used by hierarchical storage managers, for example. FAN_CLASS_CONTENT This value allows the receipt of events notifying that a file has been accessed and events for permission decisions if a file may be accessed. It is intended for event listeners that need to access files when they already contain their final content. This notification class might be used by malware detection programs, for example. FAN_CLASS_NOTIF This is the default value. It does not need to be specified. This value only allows the receipt of events notifying that a file has been accessed. Permission decisions before the file is accessed are not possible. Listeners with different notification classes will receive events in the order FAN_CLASS_PRE_CONTENT, FAN_CLASS_CONTENT, FAN_CLASS_NOTIF. The order of notification for listeners in the same notification class is undefined. The following bits can additionally be set in flags: FAN_CLOEXEC Set the close-on-exec flag (FD_CLOEXEC) on the new file descriptor. See the description of the O_CLOEXEC flag in open(2). FAN_NONBLOCK Enable the nonblocking flag (O_NONBLOCK) for the file descriptor. Reading from the file descriptor will not block. Instead, if no data is available, read(2) will fail with the error EAGAIN. FAN_UNLIMITED_QUEUE Remove the limit of 16384 events for the event queue. Use of this flag requires the CAP_SYS_ADMIN capability. FAN_UNLIMITED_MARKS Remove the limit of 8192 marks. Use of this flag requires the CAP_SYS_ADMIN capability. The event_f_flags argument defines the file status flags that will be set on the open file descriptions that are created for fanotify events. For details of these flags, see the description of the flags values in open(2). event_f_flags includes a multi-bit field for the access mode. This field can take the following values: O_RDONLY This value allows only read access. O_WRONLY This value allows only write access. O_RDWR This value allows read and write access. Additional bits can be set in event_f_flags. The most useful values are: O_LARGEFILE Enable support for files exceeding 2 GB. Failing to set this flag will result in an EOVERFLOW error when trying to open a large file which is monitored by an fanotify group on a 32-bit system. O_CLOEXEC (since Linux 3.18) Enable the close-on-exec flag for the file descriptor. See the description of the O_CLOEXEC flag in open(2) for reasons why this may be useful. The following are also allowable: O_APPEND, O_DSYNC, O_NOATIME, O_NONBLOCK, and O_SYNC. Specifying any other flag in event_f_flags yields the error EINVAL (but see BUGS). http://man7.org/linux/man-pages/man2/fanotify_mark.2.html 11 SYSTEM CALL: fanotify_mark(2) - Linux manual page FUNCTIONALITY: fanotify_mark - add, remove, or modify an fanotify mark on a filesys‐ tem object SYNOPSIS: #include int fanotify_mark(int fanotify_fd, unsigned int flags, uint64_t mask, int dirfd, const char *pathname); DESCRIPTION For an overview of the fanotify API, see fanotify(7). fanotify_mark(2) adds, removes, or modifies an fanotify mark on a filesystem object. The caller must have read permission on the filesystem object that is to be marked. The fanotify_fd argument is a file descriptor returned by fanotify_init(2). flags is a bit mask describing the modification to perform. It must include exactly one of the following values: FAN_MARK_ADD The events in mask will be added to the mark mask (or to the ignore mask). mask must be nonempty or the error EINVAL will occur. FAN_MARK_REMOVE The events in argument mask will be removed from the mark mask (or from the ignore mask). mask must be nonempty or the error EINVAL will occur. FAN_MARK_FLUSH Remove either all mount or all non-mount marks from the fanotify group. If flags contains FAN_MARK_MOUNT, all marks for mounts are removed from the group. Otherwise, all marks for directories and files are removed. No flag other than FAN_MARK_MOUNT can be used in conjunction with FAN_MARK_FLUSH. mask is ignored. If none of the values above is specified, or more than one is specified, the call fails with the error EINVAL. In addition, zero or more of the following values may be ORed into flags: FAN_MARK_DONT_FOLLOW If pathname is a symbolic link, mark the link itself, rather than the file to which it refers. (By default, fanotify_mark() dereferences pathname if it is a symbolic link.) FAN_MARK_ONLYDIR If the filesystem object to be marked is not a directory, the error ENOTDIR shall be raised. FAN_MARK_MOUNT Mark the mount point specified by pathname. If pathname is not itself a mount point, the mount point containing pathname will be marked. All directories, subdirectories, and the contained files of the mount point will be monitored. FAN_MARK_IGNORED_MASK The events in mask shall be added to or removed from the ignore mask. FAN_MARK_IGNORED_SURV_MODIFY The ignore mask shall survive modify events. If this flag is not set, the ignore mask is cleared when a modify event occurs for the ignored file or directory. mask defines which events shall be listened for (or which shall be ignored). It is a bit mask composed of the following values: FAN_ACCESS Create an event when a file or directory (but see BUGS) is accessed (read). FAN_MODIFY Create an event when a file is modified (write). FAN_CLOSE_WRITE Create an event when a writable file is closed. FAN_CLOSE_NOWRITE Create an event when a read-only file or directory is closed. FAN_OPEN Create an event when a file or directory is opened. FAN_OPEN_PERM Create an event when a permission to open a file or directory is requested. An fanotify file descriptor created with FAN_CLASS_PRE_CONTENT or FAN_CLASS_CONTENT is required. FAN_ACCESS_PERM Create an event when a permission to read a file or directory is requested. An fanotify file descriptor created with FAN_CLASS_PRE_CONTENT or FAN_CLASS_CONTENT is required. FAN_ONDIR Create events for directories—for example, when opendir(3), readdir(3) (but see BUGS), and closedir(3) are called. Without this flag, only events for files are created. FAN_EVENT_ON_CHILD Events for the immediate children of marked directories shall be created. The flag has no effect when marking mounts. Note that events are not generated for children of the subdirectories of marked directories. To monitor complete directory trees it is necessary to mark the relevant mount. The following composed value is defined: FAN_CLOSE A file is closed (FAN_CLOSE_WRITE|FAN_CLOSE_NOWRITE). The filesystem object to be marked is determined by the file descriptor dirfd and the pathname specified in pathname: * If pathname is NULL, dirfd defines the filesystem object to be marked. * If pathname is NULL, and dirfd takes the special value AT_FDCWD, the current working directory is to be marked. * If pathname is absolute, it defines the filesystem object to be marked, and dirfd is ignored. * If pathname is relative, and dirfd does not have the value AT_FDCWD, then the filesystem object to be marked is determined by interpreting pathname relative the directory referred to by dirfd. * If pathname is relative, and dirfd has the value AT_FDCWD, then the filesystem object to be marked is determined by interpreting pathname relative the current working directory. http://man7.org/linux/man-pages/man2/fadvise64.2.html 12 SYSTEM CALL: posix_fadvise(2) - Linux manual page FUNCTIONALITY: posix_fadvise - predeclare an access pattern for file data SYNOPSIS: #include int posix_fadvise(int fd, off_t offset, off_t len, int advice); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): posix_fadvise(): _POSIX_C_SOURCE >= 200112L DESCRIPTION Programs can use posix_fadvise() to announce an intention to access file data in a specific pattern in the future, thus allowing the kernel to perform appropriate optimizations. The advice applies to a (not necessarily existent) region starting at offset and extending for len bytes (or until the end of the file if len is 0) within the file referred to by fd. The advice is not binding; it merely constitutes an expectation on behalf of the application. Permissible values for advice include: POSIX_FADV_NORMAL Indicates that the application has no advice to give about its access pattern for the specified data. If no advice is given for an open file, this is the default assumption. POSIX_FADV_SEQUENTIAL The application expects to access the specified data sequentially (with lower offsets read before higher ones). POSIX_FADV_RANDOM The specified data will be accessed in random order. POSIX_FADV_NOREUSE The specified data will be accessed only once. POSIX_FADV_WILLNEED The specified data will be accessed in the near future. POSIX_FADV_DONTNEED The specified data will not be accessed in the near future. http://man7.org/linux/man-pages/man2/readahead.2.html 12 SYSTEM CALL: readahead(2) - Linux manual page FUNCTIONALITY: readahead - initiate file readahead into page cache SYNOPSIS: #define _GNU_SOURCE /* See feature_test_macros(7) */ #include ssize_t readahead(int fd, off64_t offset, size_t count); DESCRIPTION readahead() initiates readahead on a file so that subsequent reads from that file will be satisfied from the cache, and not block on disk I/O (assuming the readahead was initiated early enough and that other activity on the system did not in the meantime flush pages from the cache). The fd argument is a file descriptor identifying the file which is to be read. The offset argument specifies the starting point from which data is to be read and count specifies the number of bytes to be read. I/O is performed in whole pages, so that offset is effectively rounded down to a page boundary and bytes are read up to the next page boundary greater than or equal to (offset+count). readahead() does not read beyond the end of the file. The file offset of the open file description referred to by fd is left unchanged. http://man7.org/linux/man-pages/man2/getrandom.2.html 12 SYSTEM CALL: getrandom(2) - Linux manual page FUNCTIONALITY: getrandom - obtain a series of random bytes SYNOPSIS: #include int getrandom(void *buf, size_t buflen, unsigned int flags); DESCRIPTION The getrandom() system call fills the buffer pointed to by buf with up to buflen random bytes. These bytes can be used to seed user- space random number generators or for cryptographic purposes. getrandom() relies on entropy gathered from device drivers and other sources of environmental noise. Unnecessarily reading large quantities of data will have a negative impact on other users of the /dev/random and /dev/urandom devices. Therefore, getrandom() should not be used for Monte Carlo simulations or other programs/algorithms which are doing probabilistic sampling. By default, getrandom() draws entropy from the /dev/urandom pool. This behavior can be changed via the flags argument. If the /dev/urandom pool has been initialized, reads of up to 256 bytes will always return as many bytes as requested and will not be interrupted by signals. No such guarantees apply for larger buffer sizes. For example, if the call is interrupted by a signal handler, it may return a partially filled buffer, or fail with the error EINTR. If the pool has not yet been initialized, then the call blocks, unless GRND_NONBLOCK is specified in flags. The flags argument is a bit mask that can contain zero or more of the following values ORed together: GRND_RANDOM If this bit is set, then random bytes are drawn from the /dev/random pool instead of the /dev/urandom pool. The /dev/random pool is limited based on the entropy that can be obtained from environmental noise. If the number of available bytes in /dev/random is less than requested in buflen, the call returns just the available random bytes. If no random bytes are available, the behavior depends on the presence of GRND_NONBLOCK in the flags argument. GRND_NONBLOCK By default, when reading from /dev/random, getrandom() blocks if no random bytes are available, and when reading from /dev/urandom, it blocks if the entropy pool has not yet been initialized. If the GRND_NONBLOCK flag is set, then getrandom() does not block in these cases, but instead immediately returns -1 with errno set to EAGAIN. Linux network system calls http://man7.org/linux/man-pages/man2/socket.2.html 11 SYSTEM CALL: socket(2) - Linux manual page FUNCTIONALITY: socket - create an endpoint for communication SYNOPSIS: #include /* See NOTES */ #include int socket(int domain, int type, int protocol); DESCRIPTION socket() creates an endpoint for communication and returns a file descriptor that refers to that endpoint. The domain argument specifies a communication domain; this selects the protocol family which will be used for communication. These families are defined in . The currently understood formats include: Name Purpose Man page AF_UNIX, AF_LOCAL Local communication unix(7) AF_INET IPv4 Internet protocols ip(7) AF_INET6 IPv6 Internet protocols ipv6(7) AF_IPX IPX - Novell protocols AF_NETLINK Kernel user interface device netlink(7) AF_X25 ITU-T X.25 / ISO-8208 protocol x25(7) AF_AX25 Amateur radio AX.25 protocol AF_ATMPVC Access to raw ATM PVCs AF_APPLETALK AppleTalk ddp(7) AF_PACKET Low level packet interface packet(7) AF_ALG Interface to kernel crypto API The socket has the indicated type, which specifies the communication semantics. Currently defined types are: SOCK_STREAM Provides sequenced, reliable, two-way, connection- based byte streams. An out-of-band data transmission mechanism may be supported. SOCK_DGRAM Supports datagrams (connectionless, unreliable messages of a fixed maximum length). SOCK_SEQPACKET Provides a sequenced, reliable, two-way connection- based data transmission path for datagrams of fixed maximum length; a consumer is required to read an entire packet with each input system call. SOCK_RAW Provides raw network protocol access. SOCK_RDM Provides a reliable datagram layer that does not guarantee ordering. SOCK_PACKET Obsolete and should not be used in new programs; see packet(7). Some socket types may not be implemented by all protocol families. Since Linux 2.6.27, the type argument serves a second purpose: in addition to specifying a socket type, it may include the bitwise OR of any of the following values, to modify the behavior of socket(): SOCK_NONBLOCK Set the O_NONBLOCK file status flag on the new open file description. Using this flag saves extra calls to fcntl(2) to achieve the same result. SOCK_CLOEXEC Set the close-on-exec (FD_CLOEXEC) flag on the new file descriptor. See the description of the O_CLOEXEC flag in open(2) for reasons why this may be useful. The protocol specifies a particular protocol to be used with the socket. Normally only a single protocol exists to support a particular socket type within a given protocol family, in which case protocol can be specified as 0. However, it is possible that many protocols may exist, in which case a particular protocol must be specified in this manner. The protocol number to use is specific to the “communication domain” in which communication is to take place; see protocols(5). See getprotoent(3) on how to map protocol name strings to protocol numbers. Sockets of type SOCK_STREAM are full-duplex byte streams. They do not preserve record boundaries. A stream socket must be in a connected state before any data may be sent or received on it. A connection to another socket is created with a connect(2) call. Once connected, data may be transferred using read(2) and write(2) calls or some variant of the send(2) and recv(2) calls. When a session has been completed a close(2) may be performed. Out-of-band data may also be transmitted as described in send(2) and received as described in recv(2). The communications protocols which implement a SOCK_STREAM ensure that data is not lost or duplicated. If a piece of data for which the peer protocol has buffer space cannot be successfully transmitted within a reasonable length of time, then the connection is considered to be dead. When SO_KEEPALIVE is enabled on the socket the protocol checks in a protocol-specific manner if the other end is still alive. A SIGPIPE signal is raised if a process sends or receives on a broken stream; this causes naive processes, which do not handle the signal, to exit. SOCK_SEQPACKET sockets employ the same system calls as SOCK_STREAM sockets. The only difference is that read(2) calls will return only the amount of data requested, and any data remaining in the arriving packet will be discarded. Also all message boundaries in incoming datagrams are preserved. SOCK_DGRAM and SOCK_RAW sockets allow sending of datagrams to correspondents named in sendto(2) calls. Datagrams are generally received with recvfrom(2), which returns the next datagram along with the address of its sender. SOCK_PACKET is an obsolete socket type to receive raw packets directly from the device driver. Use packet(7) instead. An fcntl(2) F_SETOWN operation can be used to specify a process or process group to receive a SIGURG signal when the out-of-band data arrives or SIGPIPE signal when a SOCK_STREAM connection breaks unexpectedly. This operation may also be used to set the process or process group that receives the I/O and asynchronous notification of I/O events via SIGIO. Using F_SETOWN is equivalent to an ioctl(2) call with the FIOSETOWN or SIOCSPGRP argument. When the network signals an error condition to the protocol module (e.g., using a ICMP message for IP) the pending error flag is set for the socket. The next operation on this socket will return the error code of the pending error. For some protocols it is possible to enable a per-socket error queue to retrieve detailed information about the error; see IP_RECVERR in ip(7). The operation of sockets is controlled by socket level options. These options are defined in . The functions setsockopt(2) and getsockopt(2) are used to set and get options, respectively. http://man7.org/linux/man-pages/man2/socketpair.2.html 10 SYSTEM CALL: socketpair(2) - Linux manual page FUNCTIONALITY: socketpair - create a pair of connected sockets SYNOPSIS: #include /* See NOTES */ #include int socketpair(int domain, int type, int protocol, int sv[2]); DESCRIPTION The socketpair() call creates an unnamed pair of connected sockets in the specified domain, of the specified type, and using the optionally specified protocol. For further details of these arguments, see socket(2). The file descriptors used in referencing the new sockets are returned in sv[0] and sv[1]. The two sockets are indistinguishable. http://man7.org/linux/man-pages/man2/setsockopt.2.html 11 SYSTEM CALL: getsockopt(2) - Linux manual page FUNCTIONALITY: getsockopt, setsockopt - get and set options on sockets SYNOPSIS: #include /* See NOTES */ #include int getsockopt(int sockfd, int level, int optname, void *optval, socklen_t *optlen); int setsockopt(int sockfd, int level, int optname, const void *optval, socklen_t optlen); DESCRIPTION getsockopt() and setsockopt() manipulate options for the socket referred to by the file descriptor sockfd. Options may exist at multiple protocol levels; they are always present at the uppermost socket level. When manipulating socket options, the level at which the option resides and the name of the option must be specified. To manipulate options at the sockets API level, level is specified as SOL_SOCKET. To manipulate options at any other level the protocol number of the appropriate protocol controlling the option is supplied. For example, to indicate that an option is to be interpreted by the TCP protocol, level should be set to the protocol number of TCP; see getprotoent(3). The arguments optval and optlen are used to access option values for setsockopt(). For getsockopt() they identify a buffer in which the value for the requested option(s) are to be returned. For getsockopt(), optlen is a value-result argument, initially containing the size of the buffer pointed to by optval, and modified on return to indicate the actual size of the value returned. If no option value is to be supplied or returned, optval may be NULL. Optname and any specified options are passed uninterpreted to the appropriate protocol module for interpretation. The include file contains definitions for socket level options, described below. Options at other protocol levels vary in format and name; consult the appropriate entries in section 4 of the manual. Most socket-level options utilize an int argument for optval. For setsockopt(), the argument should be nonzero to enable a boolean option, or zero if the option is to be disabled. For a description of the available socket options see socket(7) and the appropriate protocol man pages. http://man7.org/linux/man-pages/man2/getsockopt.2.html 11 SYSTEM CALL: getsockopt(2) - Linux manual page FUNCTIONALITY: getsockopt, setsockopt - get and set options on sockets SYNOPSIS: #include /* See NOTES */ #include int getsockopt(int sockfd, int level, int optname, void *optval, socklen_t *optlen); int setsockopt(int sockfd, int level, int optname, const void *optval, socklen_t optlen); DESCRIPTION getsockopt() and setsockopt() manipulate options for the socket referred to by the file descriptor sockfd. Options may exist at multiple protocol levels; they are always present at the uppermost socket level. When manipulating socket options, the level at which the option resides and the name of the option must be specified. To manipulate options at the sockets API level, level is specified as SOL_SOCKET. To manipulate options at any other level the protocol number of the appropriate protocol controlling the option is supplied. For example, to indicate that an option is to be interpreted by the TCP protocol, level should be set to the protocol number of TCP; see getprotoent(3). The arguments optval and optlen are used to access option values for setsockopt(). For getsockopt() they identify a buffer in which the value for the requested option(s) are to be returned. For getsockopt(), optlen is a value-result argument, initially containing the size of the buffer pointed to by optval, and modified on return to indicate the actual size of the value returned. If no option value is to be supplied or returned, optval may be NULL. Optname and any specified options are passed uninterpreted to the appropriate protocol module for interpretation. The include file contains definitions for socket level options, described below. Options at other protocol levels vary in format and name; consult the appropriate entries in section 4 of the manual. Most socket-level options utilize an int argument for optval. For setsockopt(), the argument should be nonzero to enable a boolean option, or zero if the option is to be disabled. For a description of the available socket options see socket(7) and the appropriate protocol man pages. http://man7.org/linux/man-pages/man2/getsockname.2.html 10 SYSTEM CALL: getsockname(2) - Linux manual page FUNCTIONALITY: getsockname - get socket name SYNOPSIS: #include int getsockname(int sockfd, struct sockaddr *addr, socklen_t *addrlen); DESCRIPTION getsockname() returns the current address to which the socket sockfd is bound, in the buffer pointed to by addr. The addrlen argument should be initialized to indicate the amount of space (in bytes) pointed to by addr. On return it contains the actual size of the socket address. The returned address is truncated if the buffer provided is too small; in this case, addrlen will return a value greater than was supplied to the call. http://man7.org/linux/man-pages/man2/getpeername.2.html 10 SYSTEM CALL: getpeername(2) - Linux manual page FUNCTIONALITY: getpeername - get name of connected peer socket SYNOPSIS: #include int getpeername(int sockfd, struct sockaddr *addr, socklen_t *addrlen); DESCRIPTION getpeername() returns the address of the peer connected to the socket sockfd, in the buffer pointed to by addr. The addrlen argument should be initialized to indicate the amount of space pointed to by addr. On return it contains the actual size of the name returned (in bytes). The name is truncated if the buffer provided is too small. The returned address is truncated if the buffer provided is too small; in this case, addrlen will return a value greater than was supplied to the call. http://man7.org/linux/man-pages/man2/bind.2.html 12 SYSTEM CALL: bind(2) - Linux manual page FUNCTIONALITY: bind - bind a name to a socket SYNOPSIS: #include /* See NOTES */ #include int bind(int sockfd, const struct sockaddr *addr, socklen_t addrlen); DESCRIPTION When a socket is created with socket(2), it exists in a name space (address family) but has no address assigned to it. bind() assigns the address specified by addr to the socket referred to by the file descriptor sockfd. addrlen specifies the size, in bytes, of the address structure pointed to by addr. Traditionally, this operation is called “assigning a name to a socket”. It is normally necessary to assign a local address using bind() before a SOCK_STREAM socket may receive connections (see accept(2)). The rules used in name binding vary between address families. Consult the manual entries in Section 7 for detailed information. For AF_INET see ip(7), for AF_INET6 see ipv6(7), for AF_UNIX see unix(7), for AF_APPLETALK see ddp(7), for AF_PACKET see packet(7), for AF_X25 see x25(7) and for AF_NETLINK see netlink(7). The actual structure passed for the addr argument will depend on the address family. The sockaddr structure is defined as something like: struct sockaddr { sa_family_t sa_family; char sa_data[14]; } The only purpose of this structure is to cast the structure pointer passed in addr in order to avoid compiler warnings. See EXAMPLE below. http://man7.org/linux/man-pages/man2/listen.2.html 11 SYSTEM CALL: listen(2) - Linux manual page FUNCTIONALITY: listen - listen for connections on a socket SYNOPSIS: #include /* See NOTES */ #include int listen(int sockfd, int backlog); DESCRIPTION listen() marks the socket referred to by sockfd as a passive socket, that is, as a socket that will be used to accept incoming connection requests using accept(2). The sockfd argument is a file descriptor that refers to a socket of type SOCK_STREAM or SOCK_SEQPACKET. The backlog argument defines the maximum length to which the queue of pending connections for sockfd may grow. If a connection request arrives when the queue is full, the client may receive an error with an indication of ECONNREFUSED or, if the underlying protocol supports retransmission, the request may be ignored so that a later reattempt at connection succeeds. http://man7.org/linux/man-pages/man2/accept.2.html 12 SYSTEM CALL: accept(2) - Linux manual page FUNCTIONALITY: accept, accept4 - accept a connection on a socket SYNOPSIS: #include /* See NOTES */ #include int accept(int sockfd, struct sockaddr *addr, socklen_t *addrlen); #define _GNU_SOURCE /* See feature_test_macros(7) */ #include int accept4(int sockfd, struct sockaddr *addr, socklen_t *addrlen, int flags); DESCRIPTION The accept() system call is used with connection-based socket types (SOCK_STREAM, SOCK_SEQPACKET). It extracts the first connection request on the queue of pending connections for the listening socket, sockfd, creates a new connected socket, and returns a new file descriptor referring to that socket. The newly created socket is not in the listening state. The original socket sockfd is unaffected by this call. The argument sockfd is a socket that has been created with socket(2), bound to a local address with bind(2), and is listening for connections after a listen(2). The argument addr is a pointer to a sockaddr structure. This structure is filled in with the address of the peer socket, as known to the communications layer. The exact format of the address returned addr is determined by the socket's address family (see socket(2) and the respective protocol man pages). When addr is NULL, nothing is filled in; in this case, addrlen is not used, and should also be NULL. The addrlen argument is a value-result argument: the caller must initialize it to contain the size (in bytes) of the structure pointed to by addr; on return it will contain the actual size of the peer address. The returned address is truncated if the buffer provided is too small; in this case, addrlen will return a value greater than was supplied to the call. If no pending connections are present on the queue, and the socket is not marked as nonblocking, accept() blocks the caller until a connection is present. If the socket is marked nonblocking and no pending connections are present on the queue, accept() fails with the error EAGAIN or EWOULDBLOCK. In order to be notified of incoming connections on a socket, you can use select(2) or poll(2). A readable event will be delivered when a new connection is attempted and you may then call accept() to get a socket for that connection. Alternatively, you can set the socket to deliver SIGIO when activity occurs on a socket; see socket(7) for details. For certain protocols which require an explicit confirmation, such as DECNet, accept() can be thought of as merely dequeuing the next connection request and not implying confirmation. Confirmation can be implied by a normal read or write on the new file descriptor, and rejection can be implied by closing the new socket. Currently, only DECNet has these semantics on Linux. If flags is 0, then accept4() is the same as accept(). The following values can be bitwise ORed in flags to obtain different behavior: SOCK_NONBLOCK Set the O_NONBLOCK file status flag on the new open file description. Using this flag saves extra calls to fcntl(2) to achieve the same result. SOCK_CLOEXEC Set the close-on-exec (FD_CLOEXEC) flag on the new file descriptor. See the description of the O_CLOEXEC flag in open(2) for reasons why this may be useful. http://man7.org/linux/man-pages/man2/accept4.2.html 12 SYSTEM CALL: accept(2) - Linux manual page FUNCTIONALITY: accept, accept4 - accept a connection on a socket SYNOPSIS: #include /* See NOTES */ #include int accept(int sockfd, struct sockaddr *addr, socklen_t *addrlen); #define _GNU_SOURCE /* See feature_test_macros(7) */ #include int accept4(int sockfd, struct sockaddr *addr, socklen_t *addrlen, int flags); DESCRIPTION The accept() system call is used with connection-based socket types (SOCK_STREAM, SOCK_SEQPACKET). It extracts the first connection request on the queue of pending connections for the listening socket, sockfd, creates a new connected socket, and returns a new file descriptor referring to that socket. The newly created socket is not in the listening state. The original socket sockfd is unaffected by this call. The argument sockfd is a socket that has been created with socket(2), bound to a local address with bind(2), and is listening for connections after a listen(2). The argument addr is a pointer to a sockaddr structure. This structure is filled in with the address of the peer socket, as known to the communications layer. The exact format of the address returned addr is determined by the socket's address family (see socket(2) and the respective protocol man pages). When addr is NULL, nothing is filled in; in this case, addrlen is not used, and should also be NULL. The addrlen argument is a value-result argument: the caller must initialize it to contain the size (in bytes) of the structure pointed to by addr; on return it will contain the actual size of the peer address. The returned address is truncated if the buffer provided is too small; in this case, addrlen will return a value greater than was supplied to the call. If no pending connections are present on the queue, and the socket is not marked as nonblocking, accept() blocks the caller until a connection is present. If the socket is marked nonblocking and no pending connections are present on the queue, accept() fails with the error EAGAIN or EWOULDBLOCK. In order to be notified of incoming connections on a socket, you can use select(2) or poll(2). A readable event will be delivered when a new connection is attempted and you may then call accept() to get a socket for that connection. Alternatively, you can set the socket to deliver SIGIO when activity occurs on a socket; see socket(7) for details. For certain protocols which require an explicit confirmation, such as DECNet, accept() can be thought of as merely dequeuing the next connection request and not implying confirmation. Confirmation can be implied by a normal read or write on the new file descriptor, and rejection can be implied by closing the new socket. Currently, only DECNet has these semantics on Linux. If flags is 0, then accept4() is the same as accept(). The following values can be bitwise ORed in flags to obtain different behavior: SOCK_NONBLOCK Set the O_NONBLOCK file status flag on the new open file description. Using this flag saves extra calls to fcntl(2) to achieve the same result. SOCK_CLOEXEC Set the close-on-exec (FD_CLOEXEC) flag on the new file descriptor. See the description of the O_CLOEXEC flag in open(2) for reasons why this may be useful. http://man7.org/linux/man-pages/man2/connect.2.html 11 SYSTEM CALL: connect(2) - Linux manual page FUNCTIONALITY: connect - initiate a connection on a socket SYNOPSIS: #include /* See NOTES */ #include int connect(int sockfd, const struct sockaddr *addr, socklen_t addrlen); DESCRIPTION The connect() system call connects the socket referred to by the file descriptor sockfd to the address specified by addr. The addrlen argument specifies the size of addr. The format of the address in addr is determined by the address space of the socket sockfd; see socket(2) for further details. If the socket sockfd is of type SOCK_DGRAM, then addr is the address to which datagrams are sent by default, and the only address from which datagrams are received. If the socket is of type SOCK_STREAM or SOCK_SEQPACKET, this call attempts to make a connection to the socket that is bound to the address specified by addr. Generally, connection-based protocol sockets may successfully connect() only once; connectionless protocol sockets may use connect() multiple times to change their association. Connectionless sockets may dissolve the association by connecting to an address with the sa_family member of sockaddr set to AF_UNSPEC (supported on Linux since kernel 2.2). http://man7.org/linux/man-pages/man2/shutdown.2.html 11 SYSTEM CALL: shutdown(2) - Linux manual page FUNCTIONALITY: shutdown - shut down part of a full-duplex connection SYNOPSIS: #include int shutdown(int sockfd, int how); DESCRIPTION The shutdown() call causes all or part of a full-duplex connection on the socket associated with sockfd to be shut down. If how is SHUT_RD, further receptions will be disallowed. If how is SHUT_WR, further transmissions will be disallowed. If how is SHUT_RDWR, further receptions and transmissions will be disallowed. http://man7.org/linux/man-pages/man2/recvfrom.2.html 11 SYSTEM CALL: recv(2) - Linux manual page FUNCTIONALITY: recv, recvfrom, recvmsg - receive a message from a socket SYNOPSIS: #include #include ssize_t recv(int sockfd, void *buf, size_t len, int flags); ssize_t recvfrom(int sockfd, void *buf, size_t len, int flags, struct sockaddr *src_addr, socklen_t *addrlen); ssize_t recvmsg(int sockfd, struct msghdr *msg, int flags); DESCRIPTION The recv(), recvfrom(), and recvmsg() calls are used to receive messages from a socket. They may be used to receive data on both connectionless and connection-oriented sockets. This page first describes common features of all three system calls, and then describes the differences between the calls. The only difference between recv() and read(2) is the presence of flags. With a zero flags argument, recv() is generally equivalent to read(2) (but see NOTES). Also, the following call recv(sockfd, buf, len, flags); is equivalent to recvfrom(sockfd, buf, len, flags, NULL, NULL); All three calls return the length of the message on successful completion. If a message is too long to fit in the supplied buffer, excess bytes may be discarded depending on the type of socket the message is received from. If no messages are available at the socket, the receive calls wait for a message to arrive, unless the socket is nonblocking (see fcntl(2)), in which case the value -1 is returned and the external variable errno is set to EAGAIN or EWOULDBLOCK. The receive calls normally return any data available, up to the requested amount, rather than waiting for receipt of the full amount requested. An application can use select(2), poll(2), or epoll(7) to determine when more data arrives on a socket. The flags argument The flags argument is formed by ORing one or more of the following values: MSG_CMSG_CLOEXEC (recvmsg() only; since Linux 2.6.23) Set the close-on-exec flag for the file descriptor received via a UNIX domain file descriptor using the SCM_RIGHTS operation (described in unix(7)). This flag is useful for the same reasons as the O_CLOEXEC flag of open(2). MSG_DONTWAIT (since Linux 2.2) Enables nonblocking operation; if the operation would block, the call fails with the error EAGAIN or EWOULDBLOCK. This provides similar behavior to setting the O_NONBLOCK flag (via the fcntl(2) F_SETFL operation), but differs in that MSG_DONTWAIT is a per-call option, whereas O_NONBLOCK is a setting on the open file description (see open(2)), which will affect all threads in the calling process and as well as other processes that hold file descriptors referring to the same open file description. MSG_ERRQUEUE (since Linux 2.2) This flag specifies that queued errors should be received from the socket error queue. The error is passed in an ancillary message with a type dependent on the protocol (for IPv4 IP_RECVERR). The user should supply a buffer of sufficient size. See cmsg(3) and ip(7) for more information. The payload of the original packet that caused the error is passed as normal data via msg_iovec. The original destination address of the datagram that caused the error is supplied via msg_name. For local errors, no address is passed (this can be checked with the cmsg_len member of the cmsghdr). For error receives, the MSG_ERRQUEUE is set in the msghdr. After an error has been passed, the pending socket error is regenerated based on the next queued error and will be passed on the next socket operation. The error is supplied in a sock_extended_err structure: #define SO_EE_ORIGIN_NONE 0 #define SO_EE_ORIGIN_LOCAL 1 #define SO_EE_ORIGIN_ICMP 2 #define SO_EE_ORIGIN_ICMP6 3 struct sock_extended_err { uint32_t ee_errno; /* error number */ uint8_t ee_origin; /* where the error originated */ uint8_t ee_type; /* type */ uint8_t ee_code; /* code */ uint8_t ee_pad; /* padding */ uint32_t ee_info; /* additional information */ uint32_t ee_data; /* other data */ /* More data may follow */ }; struct sockaddr *SO_EE_OFFENDER(struct sock_extended_err *); ee_errno contains the errno number of the queued error. ee_origin is the origin code of where the error originated. The other fields are protocol-specific. The macro SOCK_EE_OFFENDER returns a pointer to the address of the network object where the error originated from given a pointer to the ancillary message. If this address is not known, the sa_family member of the sockaddr contains AF_UNSPEC and the other fields of the sockaddr are undefined. The payload of the packet that caused the error is passed as normal data. For local errors, no address is passed (this can be checked with the cmsg_len member of the cmsghdr). For error receives, the MSG_ERRQUEUE is set in the msghdr. After an error has been passed, the pending socket error is regenerated based on the next queued error and will be passed on the next socket operation. MSG_OOB This flag requests receipt of out-of-band data that would not be received in the normal data stream. Some protocols place expedited data at the head of the normal data queue, and thus this flag cannot be used with such protocols. MSG_PEEK This flag causes the receive operation to return data from the beginning of the receive queue without removing that data from the queue. Thus, a subsequent receive call will return the same data. MSG_TRUNC (since Linux 2.2) For raw (AF_PACKET), Internet datagram (since Linux 2.4.27/2.6.8), netlink (since Linux 2.6.22), and UNIX datagram (since Linux 3.4) sockets: return the real length of the packet or datagram, even when it was longer than the passed buffer. For use with Internet stream sockets, see tcp(7). MSG_WAITALL (since Linux 2.2) This flag requests that the operation block until the full request is satisfied. However, the call may still return less data than requested if a signal is caught, an error or disconnect occurs, or the next data to be received is of a different type than that returned. This flag has no effect for datagram sockets. recvfrom() recvfrom() places the received message into the buffer buf. The caller must specify the size of the buffer in len. If src_addr is not NULL, and the underlying protocol provides the source address of the message, that source address is placed in the buffer pointed to by src_addr. In this case, addrlen is a value- result argument. Before the call, it should be initialized to the size of the buffer associated with src_addr. Upon return, addrlen is updated to contain the actual size of the source address. The returned address is truncated if the buffer provided is too small; in this case, addrlen will return a value greater than was supplied to the call. If the caller is not interested in the source address, src_addr and addrlen should be specified as NULL. recv() The recv() call is normally used only on a connected socket (see connect(2)). It is equivalent to the call: recvfrom(fd, buf, len, flags, NULL, 0); recvmsg() The recvmsg() call uses a msghdr structure to minimize the number of directly supplied arguments. This structure is defined as follows in : struct iovec { /* Scatter/gather array items */ void *iov_base; /* Starting address */ size_t iov_len; /* Number of bytes to transfer */ }; struct msghdr { void *msg_name; /* optional address */ socklen_t msg_namelen; /* size of address */ struct iovec *msg_iov; /* scatter/gather array */ size_t msg_iovlen; /* # elements in msg_iov */ void *msg_control; /* ancillary data, see below */ size_t msg_controllen; /* ancillary data buffer len */ int msg_flags; /* flags on received message */ }; The msg_name field points to a caller-allocated buffer that is used to return the source address if the socket is unconnected. The caller should set msg_namelen to the size of this buffer before this call; upon return from a successful call, msg_namelen will contain the length of the returned address. If the application does not need to know the source address, msg_name can be specified as NULL. The fields msg_iov and msg_iovlen describe scatter-gather locations, as discussed in readv(2). The field msg_control, which has length msg_controllen, points to a buffer for other protocol control-related messages or miscellaneous ancillary data. When recvmsg() is called, msg_controllen should contain the length of the available buffer in msg_control; upon return from a successful call it will contain the length of the control message sequence. The messages are of the form: struct cmsghdr { size_t cmsg_len; /* Data byte count, including header (type is socklen_t in POSIX) */ int cmsg_level; /* Originating protocol */ int cmsg_type; /* Protocol-specific type */ /* followed by unsigned char cmsg_data[]; */ }; Ancillary data should be accessed only by the macros defined in cmsg(3). As an example, Linux uses this ancillary data mechanism to pass extended errors, IP options, or file descriptors over UNIX domain sockets. The msg_flags field in the msghdr is set on return of recvmsg(). It can contain several flags: MSG_EOR indicates end-of-record; the data returned completed a record (generally used with sockets of type SOCK_SEQPACKET). MSG_TRUNC indicates that the trailing portion of a datagram was discarded because the datagram was larger than the buffer supplied. MSG_CTRUNC indicates that some control data were discarded due to lack of space in the buffer for ancillary data. MSG_OOB is returned to indicate that expedited or out-of-band data were received. MSG_ERRQUEUE indicates that no data was received but an extended error from the socket error queue. http://man7.org/linux/man-pages/man2/recvmsg.2.html 11 SYSTEM CALL: recv(2) - Linux manual page FUNCTIONALITY: recv, recvfrom, recvmsg - receive a message from a socket SYNOPSIS: #include #include ssize_t recv(int sockfd, void *buf, size_t len, int flags); ssize_t recvfrom(int sockfd, void *buf, size_t len, int flags, struct sockaddr *src_addr, socklen_t *addrlen); ssize_t recvmsg(int sockfd, struct msghdr *msg, int flags); DESCRIPTION The recv(), recvfrom(), and recvmsg() calls are used to receive messages from a socket. They may be used to receive data on both connectionless and connection-oriented sockets. This page first describes common features of all three system calls, and then describes the differences between the calls. The only difference between recv() and read(2) is the presence of flags. With a zero flags argument, recv() is generally equivalent to read(2) (but see NOTES). Also, the following call recv(sockfd, buf, len, flags); is equivalent to recvfrom(sockfd, buf, len, flags, NULL, NULL); All three calls return the length of the message on successful completion. If a message is too long to fit in the supplied buffer, excess bytes may be discarded depending on the type of socket the message is received from. If no messages are available at the socket, the receive calls wait for a message to arrive, unless the socket is nonblocking (see fcntl(2)), in which case the value -1 is returned and the external variable errno is set to EAGAIN or EWOULDBLOCK. The receive calls normally return any data available, up to the requested amount, rather than waiting for receipt of the full amount requested. An application can use select(2), poll(2), or epoll(7) to determine when more data arrives on a socket. The flags argument The flags argument is formed by ORing one or more of the following values: MSG_CMSG_CLOEXEC (recvmsg() only; since Linux 2.6.23) Set the close-on-exec flag for the file descriptor received via a UNIX domain file descriptor using the SCM_RIGHTS operation (described in unix(7)). This flag is useful for the same reasons as the O_CLOEXEC flag of open(2). MSG_DONTWAIT (since Linux 2.2) Enables nonblocking operation; if the operation would block, the call fails with the error EAGAIN or EWOULDBLOCK. This provides similar behavior to setting the O_NONBLOCK flag (via the fcntl(2) F_SETFL operation), but differs in that MSG_DONTWAIT is a per-call option, whereas O_NONBLOCK is a setting on the open file description (see open(2)), which will affect all threads in the calling process and as well as other processes that hold file descriptors referring to the same open file description. MSG_ERRQUEUE (since Linux 2.2) This flag specifies that queued errors should be received from the socket error queue. The error is passed in an ancillary message with a type dependent on the protocol (for IPv4 IP_RECVERR). The user should supply a buffer of sufficient size. See cmsg(3) and ip(7) for more information. The payload of the original packet that caused the error is passed as normal data via msg_iovec. The original destination address of the datagram that caused the error is supplied via msg_name. For local errors, no address is passed (this can be checked with the cmsg_len member of the cmsghdr). For error receives, the MSG_ERRQUEUE is set in the msghdr. After an error has been passed, the pending socket error is regenerated based on the next queued error and will be passed on the next socket operation. The error is supplied in a sock_extended_err structure: #define SO_EE_ORIGIN_NONE 0 #define SO_EE_ORIGIN_LOCAL 1 #define SO_EE_ORIGIN_ICMP 2 #define SO_EE_ORIGIN_ICMP6 3 struct sock_extended_err { uint32_t ee_errno; /* error number */ uint8_t ee_origin; /* where the error originated */ uint8_t ee_type; /* type */ uint8_t ee_code; /* code */ uint8_t ee_pad; /* padding */ uint32_t ee_info; /* additional information */ uint32_t ee_data; /* other data */ /* More data may follow */ }; struct sockaddr *SO_EE_OFFENDER(struct sock_extended_err *); ee_errno contains the errno number of the queued error. ee_origin is the origin code of where the error originated. The other fields are protocol-specific. The macro SOCK_EE_OFFENDER returns a pointer to the address of the network object where the error originated from given a pointer to the ancillary message. If this address is not known, the sa_family member of the sockaddr contains AF_UNSPEC and the other fields of the sockaddr are undefined. The payload of the packet that caused the error is passed as normal data. For local errors, no address is passed (this can be checked with the cmsg_len member of the cmsghdr). For error receives, the MSG_ERRQUEUE is set in the msghdr. After an error has been passed, the pending socket error is regenerated based on the next queued error and will be passed on the next socket operation. MSG_OOB This flag requests receipt of out-of-band data that would not be received in the normal data stream. Some protocols place expedited data at the head of the normal data queue, and thus this flag cannot be used with such protocols. MSG_PEEK This flag causes the receive operation to return data from the beginning of the receive queue without removing that data from the queue. Thus, a subsequent receive call will return the same data. MSG_TRUNC (since Linux 2.2) For raw (AF_PACKET), Internet datagram (since Linux 2.4.27/2.6.8), netlink (since Linux 2.6.22), and UNIX datagram (since Linux 3.4) sockets: return the real length of the packet or datagram, even when it was longer than the passed buffer. For use with Internet stream sockets, see tcp(7). MSG_WAITALL (since Linux 2.2) This flag requests that the operation block until the full request is satisfied. However, the call may still return less data than requested if a signal is caught, an error or disconnect occurs, or the next data to be received is of a different type than that returned. This flag has no effect for datagram sockets. recvfrom() recvfrom() places the received message into the buffer buf. The caller must specify the size of the buffer in len. If src_addr is not NULL, and the underlying protocol provides the source address of the message, that source address is placed in the buffer pointed to by src_addr. In this case, addrlen is a value- result argument. Before the call, it should be initialized to the size of the buffer associated with src_addr. Upon return, addrlen is updated to contain the actual size of the source address. The returned address is truncated if the buffer provided is too small; in this case, addrlen will return a value greater than was supplied to the call. If the caller is not interested in the source address, src_addr and addrlen should be specified as NULL. recv() The recv() call is normally used only on a connected socket (see connect(2)). It is equivalent to the call: recvfrom(fd, buf, len, flags, NULL, 0); recvmsg() The recvmsg() call uses a msghdr structure to minimize the number of directly supplied arguments. This structure is defined as follows in : struct iovec { /* Scatter/gather array items */ void *iov_base; /* Starting address */ size_t iov_len; /* Number of bytes to transfer */ }; struct msghdr { void *msg_name; /* optional address */ socklen_t msg_namelen; /* size of address */ struct iovec *msg_iov; /* scatter/gather array */ size_t msg_iovlen; /* # elements in msg_iov */ void *msg_control; /* ancillary data, see below */ size_t msg_controllen; /* ancillary data buffer len */ int msg_flags; /* flags on received message */ }; The msg_name field points to a caller-allocated buffer that is used to return the source address if the socket is unconnected. The caller should set msg_namelen to the size of this buffer before this call; upon return from a successful call, msg_namelen will contain the length of the returned address. If the application does not need to know the source address, msg_name can be specified as NULL. The fields msg_iov and msg_iovlen describe scatter-gather locations, as discussed in readv(2). The field msg_control, which has length msg_controllen, points to a buffer for other protocol control-related messages or miscellaneous ancillary data. When recvmsg() is called, msg_controllen should contain the length of the available buffer in msg_control; upon return from a successful call it will contain the length of the control message sequence. The messages are of the form: struct cmsghdr { size_t cmsg_len; /* Data byte count, including header (type is socklen_t in POSIX) */ int cmsg_level; /* Originating protocol */ int cmsg_type; /* Protocol-specific type */ /* followed by unsigned char cmsg_data[]; */ }; Ancillary data should be accessed only by the macros defined in cmsg(3). As an example, Linux uses this ancillary data mechanism to pass extended errors, IP options, or file descriptors over UNIX domain sockets. The msg_flags field in the msghdr is set on return of recvmsg(). It can contain several flags: MSG_EOR indicates end-of-record; the data returned completed a record (generally used with sockets of type SOCK_SEQPACKET). MSG_TRUNC indicates that the trailing portion of a datagram was discarded because the datagram was larger than the buffer supplied. MSG_CTRUNC indicates that some control data were discarded due to lack of space in the buffer for ancillary data. MSG_OOB is returned to indicate that expedited or out-of-band data were received. MSG_ERRQUEUE indicates that no data was received but an extended error from the socket error queue. http://man7.org/linux/man-pages/man2/recvmmsg.2.html 12 SYSTEM CALL: recvmmsg(2) - Linux manual page FUNCTIONALITY: recvmmsg - receive multiple messages on a socket SYNOPSIS: #define _GNU_SOURCE /* See feature_test_macros(7) */ #include int recvmmsg(int sockfd, struct mmsghdr *msgvec, unsigned int vlen, unsigned int flags, struct timespec *timeout); DESCRIPTION The recvmmsg() system call is an extension of recvmsg(2) that allows the caller to receive multiple messages from a socket using a single system call. (This has performance benefits for some applications.) A further extension over recvmsg(2) is support for a timeout on the receive operation. The sockfd argument is the file descriptor of the socket to receive data from. The msgvec argument is a pointer to an array of mmsghdr structures. The size of this array is specified in vlen. The mmsghdr structure is defined in as: struct mmsghdr { struct msghdr msg_hdr; /* Message header */ unsigned int msg_len; /* Number of received bytes for header */ }; The msg_hdr field is a msghdr structure, as described in recvmsg(2). The msg_len field is the number of bytes returned for the message in the entry. This field has the same value as the return value of a single recvmsg(2) on the header. The flags argument contains flags ORed together. The flags are the same as documented for recvmsg(2), with the following addition: MSG_WAITFORONE (since Linux 2.6.34) Turns on MSG_DONTWAIT after the first message has been received. The timeout argument points to a struct timespec (see clock_gettime(2)) defining a timeout (seconds plus nanoseconds) for the receive operation (but see BUGS!). (This interval will be rounded up to the system clock granularity, and kernel scheduling delays mean that the blocking interval may overrun by a small amount.) If timeout is NULL, then the operation blocks indefinitely. A blocking recvmmsg() call blocks until vlen messages have been received or until the timeout expires. A nonblocking call reads as many messages as are available (up to the limit specified by vlen) and returns immediately. On return from recvmmsg(), successive elements of msgvec are updated to contain information about each received message: msg_len contains the size of the received message; the subfields of msg_hdr are updated as described in recvmsg(2). The return value of the call indicates the number of elements of msgvec that have been updated. http://man7.org/linux/man-pages/man2/sendto.2.html 12 SYSTEM CALL: send(2) - Linux manual page FUNCTIONALITY: send, sendto, sendmsg - send a message on a socket SYNOPSIS: #include #include ssize_t send(int sockfd, const void *buf, size_t len, int flags); ssize_t sendto(int sockfd, const void *buf, size_t len, int flags, const struct sockaddr *dest_addr, socklen_t addrlen); ssize_t sendmsg(int sockfd, const struct msghdr *msg, int flags); DESCRIPTION The system calls send(), sendto(), and sendmsg() are used to transmit a message to another socket. The send() call may be used only when the socket is in a connected state (so that the intended recipient is known). The only difference between send() and write(2) is the presence of flags. With a zero flags argument, send() is equivalent to write(2). Also, the following call send(sockfd, buf, len, flags); is equivalent to sendto(sockfd, buf, len, flags, NULL, 0); The argument sockfd is the file descriptor of the sending socket. If sendto() is used on a connection-mode (SOCK_STREAM, SOCK_SEQPACKET) socket, the arguments dest_addr and addrlen are ignored (and the error EISCONN may be returned when they are not NULL and 0), and the error ENOTCONN is returned when the socket was not actually connected. Otherwise, the address of the target is given by dest_addr with addrlen specifying its size. For sendmsg(), the address of the target is given by msg.msg_name, with msg.msg_namelen specifying its size. For send() and sendto(), the message is found in buf and has length len. For sendmsg(), the message is pointed to by the elements of the array msg.msg_iov. The sendmsg() call also allows sending ancillary data (also known as control information). If the message is too long to pass atomically through the underlying protocol, the error EMSGSIZE is returned, and the message is not transmitted. No indication of failure to deliver is implicit in a send(). Locally detected errors are indicated by a return value of -1. When the message does not fit into the send buffer of the socket, send() normally blocks, unless the socket has been placed in nonblocking I/O mode. In nonblocking mode it would fail with the error EAGAIN or EWOULDBLOCK in this case. The select(2) call may be used to determine when it is possible to send more data. The flags argument The flags argument is the bitwise OR of zero or more of the following flags. MSG_CONFIRM (since Linux 2.3.15) Tell the link layer that forward progress happened: you got a successful reply from the other side. If the link layer doesn't get this it will regularly reprobe the neighbor (e.g., via a unicast ARP). Only valid on SOCK_DGRAM and SOCK_RAW sockets and currently implemented only for IPv4 and IPv6. See arp(7) for details. MSG_DONTROUTE Don't use a gateway to send out the packet, send to hosts only on directly connected networks. This is usually used only by diagnostic or routing programs. This is defined only for protocol families that route; packet sockets don't. MSG_DONTWAIT (since Linux 2.2) Enables nonblocking operation; if the operation would block, EAGAIN or EWOULDBLOCK is returned. This provides similar behavior to setting the O_NONBLOCK flag (via the fcntl(2) F_SETFL operation), but differs in that MSG_DONTWAIT is a per- call option, whereas O_NONBLOCK is a setting on the open file description (see open(2)), which will affect all threads in the calling process and as well as other processes that hold file descriptors referring to the same open file description. MSG_EOR (since Linux 2.2) Terminates a record (when this notion is supported, as for sockets of type SOCK_SEQPACKET). MSG_MORE (since Linux 2.4.4) The caller has more data to send. This flag is used with TCP sockets to obtain the same effect as the TCP_CORK socket option (see tcp(7)), with the difference that this flag can be set on a per-call basis. Since Linux 2.6, this flag is also supported for UDP sockets, and informs the kernel to package all of the data sent in calls with this flag set into a single datagram which is transmitted only when a call is performed that does not specify this flag. (See also the UDP_CORK socket option described in udp(7).) MSG_NOSIGNAL (since Linux 2.2) Don't generate a SIGPIPE signal if the peer on a stream- oriented socket has closed the connection. The EPIPE error is still returned. This provides similar behavior to using sigaction(2) to ignore SIGPIPE, but, whereas MSG_NOSIGNAL is a per-call feature, ignoring SIGPIPE sets a process attribute that affects all threads in the process. MSG_OOB Sends out-of-band data on sockets that support this notion (e.g., of type SOCK_STREAM); the underlying protocol must also support out-of-band data. sendmsg() The definition of the msghdr structure employed by sendmsg() is as follows: struct msghdr { void *msg_name; /* optional address */ socklen_t msg_namelen; /* size of address */ struct iovec *msg_iov; /* scatter/gather array */ size_t msg_iovlen; /* # elements in msg_iov */ void *msg_control; /* ancillary data, see below */ size_t msg_controllen; /* ancillary data buffer len */ int msg_flags; /* flags (unused) */ }; The msg_name field is used on an unconnected socket to specify the target address for a datagram. It points to a buffer containing the address; the msg_namelen field should be set to the size of the address. For a connected socket, these fields should be specified as NULL and 0, respectively. The msg_iov and msg_iovlen fields specify scatter-gather locations, as for writev(2). You may send control information using the msg_control and msg_controllen members. The maximum control buffer length the kernel can process is limited per socket by the value in /proc/sys/net/core/optmem_max; see socket(7). The msg_flags field is ignored. http://man7.org/linux/man-pages/man2/sendmsg.2.html 12 SYSTEM CALL: send(2) - Linux manual page FUNCTIONALITY: send, sendto, sendmsg - send a message on a socket SYNOPSIS: #include #include ssize_t send(int sockfd, const void *buf, size_t len, int flags); ssize_t sendto(int sockfd, const void *buf, size_t len, int flags, const struct sockaddr *dest_addr, socklen_t addrlen); ssize_t sendmsg(int sockfd, const struct msghdr *msg, int flags); DESCRIPTION The system calls send(), sendto(), and sendmsg() are used to transmit a message to another socket. The send() call may be used only when the socket is in a connected state (so that the intended recipient is known). The only difference between send() and write(2) is the presence of flags. With a zero flags argument, send() is equivalent to write(2). Also, the following call send(sockfd, buf, len, flags); is equivalent to sendto(sockfd, buf, len, flags, NULL, 0); The argument sockfd is the file descriptor of the sending socket. If sendto() is used on a connection-mode (SOCK_STREAM, SOCK_SEQPACKET) socket, the arguments dest_addr and addrlen are ignored (and the error EISCONN may be returned when they are not NULL and 0), and the error ENOTCONN is returned when the socket was not actually connected. Otherwise, the address of the target is given by dest_addr with addrlen specifying its size. For sendmsg(), the address of the target is given by msg.msg_name, with msg.msg_namelen specifying its size. For send() and sendto(), the message is found in buf and has length len. For sendmsg(), the message is pointed to by the elements of the array msg.msg_iov. The sendmsg() call also allows sending ancillary data (also known as control information). If the message is too long to pass atomically through the underlying protocol, the error EMSGSIZE is returned, and the message is not transmitted. No indication of failure to deliver is implicit in a send(). Locally detected errors are indicated by a return value of -1. When the message does not fit into the send buffer of the socket, send() normally blocks, unless the socket has been placed in nonblocking I/O mode. In nonblocking mode it would fail with the error EAGAIN or EWOULDBLOCK in this case. The select(2) call may be used to determine when it is possible to send more data. The flags argument The flags argument is the bitwise OR of zero or more of the following flags. MSG_CONFIRM (since Linux 2.3.15) Tell the link layer that forward progress happened: you got a successful reply from the other side. If the link layer doesn't get this it will regularly reprobe the neighbor (e.g., via a unicast ARP). Only valid on SOCK_DGRAM and SOCK_RAW sockets and currently implemented only for IPv4 and IPv6. See arp(7) for details. MSG_DONTROUTE Don't use a gateway to send out the packet, send to hosts only on directly connected networks. This is usually used only by diagnostic or routing programs. This is defined only for protocol families that route; packet sockets don't. MSG_DONTWAIT (since Linux 2.2) Enables nonblocking operation; if the operation would block, EAGAIN or EWOULDBLOCK is returned. This provides similar behavior to setting the O_NONBLOCK flag (via the fcntl(2) F_SETFL operation), but differs in that MSG_DONTWAIT is a per- call option, whereas O_NONBLOCK is a setting on the open file description (see open(2)), which will affect all threads in the calling process and as well as other processes that hold file descriptors referring to the same open file description. MSG_EOR (since Linux 2.2) Terminates a record (when this notion is supported, as for sockets of type SOCK_SEQPACKET). MSG_MORE (since Linux 2.4.4) The caller has more data to send. This flag is used with TCP sockets to obtain the same effect as the TCP_CORK socket option (see tcp(7)), with the difference that this flag can be set on a per-call basis. Since Linux 2.6, this flag is also supported for UDP sockets, and informs the kernel to package all of the data sent in calls with this flag set into a single datagram which is transmitted only when a call is performed that does not specify this flag. (See also the UDP_CORK socket option described in udp(7).) MSG_NOSIGNAL (since Linux 2.2) Don't generate a SIGPIPE signal if the peer on a stream- oriented socket has closed the connection. The EPIPE error is still returned. This provides similar behavior to using sigaction(2) to ignore SIGPIPE, but, whereas MSG_NOSIGNAL is a per-call feature, ignoring SIGPIPE sets a process attribute that affects all threads in the process. MSG_OOB Sends out-of-band data on sockets that support this notion (e.g., of type SOCK_STREAM); the underlying protocol must also support out-of-band data. sendmsg() The definition of the msghdr structure employed by sendmsg() is as follows: struct msghdr { void *msg_name; /* optional address */ socklen_t msg_namelen; /* size of address */ struct iovec *msg_iov; /* scatter/gather array */ size_t msg_iovlen; /* # elements in msg_iov */ void *msg_control; /* ancillary data, see below */ size_t msg_controllen; /* ancillary data buffer len */ int msg_flags; /* flags (unused) */ }; The msg_name field is used on an unconnected socket to specify the target address for a datagram. It points to a buffer containing the address; the msg_namelen field should be set to the size of the address. For a connected socket, these fields should be specified as NULL and 0, respectively. The msg_iov and msg_iovlen fields specify scatter-gather locations, as for writev(2). You may send control information using the msg_control and msg_controllen members. The maximum control buffer length the kernel can process is limited per socket by the value in /proc/sys/net/core/optmem_max; see socket(7). The msg_flags field is ignored. http://man7.org/linux/man-pages/man2/sendmmsg.2.html 12 SYSTEM CALL: sendmmsg(2) - Linux manual page FUNCTIONALITY: sendmmsg - send multiple messages on a socket SYNOPSIS: #define _GNU_SOURCE /* See feature_test_macros(7) */ #include int sendmmsg(int sockfd, struct mmsghdr *msgvec, unsigned int vlen, unsigned int flags); DESCRIPTION The sendmmsg() system call is an extension of sendmsg(2) that allows the caller to transmit multiple messages on a socket using a single system call. (This has performance benefits for some applications.) The sockfd argument is the file descriptor of the socket on which data is to be transmitted. The msgvec argument is a pointer to an array of mmsghdr structures. The size of this array is specified in vlen. The mmsghdr structure is defined in as: struct mmsghdr { struct msghdr msg_hdr; /* Message header */ unsigned int msg_len; /* Number of bytes transmitted */ }; The msg_hdr field is a msghdr structure, as described in sendmsg(2). The msg_len field is used to return the number of bytes sent from the message in msg_hdr (i.e., the same as the return value from a single sendmsg(2) call). The flags argument contains flags ORed together. The flags are the same as for sendmsg(2). A blocking sendmmsg() call blocks until vlen messages have been sent. A nonblocking call sends as many messages as possible (up to the limit specified by vlen) and returns immediately. On return from sendmmsg(), the msg_len fields of successive elements of msgvec are updated to contain the number of bytes transmitted from the corresponding msg_hdr. The return value of the call indicates the number of elements of msgvec that have been updated. http://man7.org/linux/man-pages/man2/sethostname.2.html 10 SYSTEM CALL: gethostname(2) - Linux manual page FUNCTIONALITY: gethostname, sethostname - get/set hostname SYNOPSIS: #include int gethostname(char *name, size_t len); int sethostname(const char *name, size_t len); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): gethostname(): Since glibc 2.12: _BSD_SOURCE || _XOPEN_SOURCE >= 500 || /* Since glibc 2.12: */ _POSIX_C_SOURCE >= 200112L sethostname(): Since glibc 2.21: _DEFAULT_SOURCE In glibc 2.19 and 2.20: _DEFAULT_SOURCE || (_XOPEN_SOURCE && _XOPEN_SOURCE < 500) Up to and including glibc 2.19: _BSD_SOURCE || (_XOPEN_SOURCE && _XOPEN_SOURCE < 500) DESCRIPTION These system calls are used to access or to change the hostname of the current processor. sethostname() sets the hostname to the value given in the character array name. The len argument specifies the number of bytes in name. (Thus, name does not require a terminating null byte.) gethostname() returns the null-terminated hostname in the character array name, which has a length of len bytes. If the null-terminated hostname is too large to fit, then the name is truncated, and no error is returned (but see NOTES below). POSIX.1 says that if such truncation occurs, then it is unspecified whether the returned buffer includes a terminating null byte. http://man7.org/linux/man-pages/man2/setdomainname.2.html 10 SYSTEM CALL: getdomainname(2) - Linux manual page FUNCTIONALITY: getdomainname, setdomainname - get/set NIS domain name SYNOPSIS: #include int getdomainname(char *name, size_t len); int setdomainname(const char *name, size_t len); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): getdomainname(), setdomainname(): Since glibc 2.21: _DEFAULT_SOURCE In glibc 2.19 and 2.20: _DEFAULT_SOURCE || (_XOPEN_SOURCE && _XOPEN_SOURCE < 500) Up to and including glibc 2.19: _BSD_SOURCE || (_XOPEN_SOURCE && _XOPEN_SOURCE < 500) DESCRIPTION These functions are used to access or to change the NIS domain name of the host system. setdomainname() sets the domain name to the value given in the character array name. The len argument specifies the number of bytes in name. (Thus, name does not require a terminating null byte.) getdomainname() returns the null-terminated domain name in the character array name, which has a length of len bytes. If the null- terminated domain name requires more than len bytes, getdomainname() returns the first len bytes (glibc) or gives an error (libc). http://man7.org/linux/man-pages/man2/bpf.2.html 12 SYSTEM CALL: bpf(2) - Linux manual page FUNCTIONALITY: bpf - perform a command on an extended BPF map or program SYNOPSIS: #include int bpf(int cmd, union bpf_attr *attr, unsigned int size); DESCRIPTION The bpf() system call performs a range of operations related to extended Berkeley Packet Filters. Extended BPF (or eBPF) is similar to the original ("classic") BPF (cBPF) used to filter network packets. For both cBPF and eBPF programs, the kernel statically analyzes the programs before loading them, in order to ensure that they cannot harm the running system. eBPF extends cBPF in multiple ways, including the ability to call a fixed set of in-kernel helper functions (via the BPF_CALL opcode extension provided by eBPF) and access shared data structures such as eBPF maps. Extended BPF Design/Architecture eBPF maps are a generic data structure for storage of different data types. Data types are generally treated as binary blobs, so a user just specifies the size of the key and the size of the value at map- creation time. In other words, a key/value for a given map can have an arbitrary structure. A user process can create multiple maps (with key/value-pairs being opaque bytes of data) and access them via file descriptors. Different eBPF programs can access the same maps in parallel. It's up to the user process and eBPF program to decide what they store inside maps. There's one special map type, called a program array. This type of map stores file descriptors referring to other eBPF programs. When a lookup in the map is performed, the program flow is redirected in- place to the beginning of another eBPF program and does not return back to the calling program. The level of nesting has a fixed limit of 32, so that infinite loops cannot be crafted. At runtime, the program file descriptors stored in the map can be modified, so program functionality can be altered based on specific requirements. All programs referred to in a program-array map must have been previously loaded into the kernel via bpf(). If a map lookup fails, the current program continues its execution. See BPF_MAP_TYPE_PROG_ARRAY below for further details. Generally, eBPF programs are loaded by the user process and automatically unloaded when the process exits. In some cases, for example, tc-bpf(8), the program will continue to stay alive inside the kernel even after the process that loaded the program exits. In that case, the tc subsystem holds a reference to the eBPF program after the file descriptor has been closed by the user-space program. Thus, whether a specific program continues to live inside the kernel depends on how it is further attached to a given kernel subsystem after it was loaded via bpf(). Each eBPF program is a set of instructions that is safe to run until its completion. An in-kernel verifier statically determines that the eBPF program terminates and is safe to execute. During verification, the kernel increments reference counts for each of the maps that the eBPF program uses, so that the attached maps can't be removed until the program is unloaded. eBPF programs can be attached to different events. These events can be the arrival of network packets, tracing events, classification events by network queueing disciplines (for eBPF programs attached to a tc(8) classifier), and other types that may be added in the future. A new event triggers execution of the eBPF program, which may store information about the event in eBPF maps. Beyond storing data, eBPF programs may call a fixed set of in-kernel helper functions. The same eBPF program can be attached to multiple events and different eBPF programs can access the same map: tracing tracing tracing packet packet packet event A event B event C on eth0 on eth1 on eth2 | | | | | ^ | | | | v | --> tracing <-- tracing socket tc ingress tc egress prog_1 prog_2 prog_3 classifier action | | | | prog_4 prog_5 |--- -----| |------| map_3 | | map_1 map_2 --| map_4 |-- Arguments The operation to be performed by the bpf() system call is determined by the cmd argument. Each operation takes an accompanying argument, provided via attr, which is a pointer to a union of type bpf_attr (see below). The size argument is the size of the union pointed to by attr. The value provided in cmd is one of the following: BPF_MAP_CREATE Create a map and return a file descriptor that refers to the map. The close-on-exec file descriptor flag (see fcntl(2)) is automatically enabled for the new file descriptor. BPF_MAP_LOOKUP_ELEM Look up an element by key in a specified map and return its value. BPF_MAP_UPDATE_ELEM Create or update an element (key/value pair) in a specified map. BPF_MAP_DELETE_ELEM Look up and delete an element by key in a specified map. BPF_MAP_GET_NEXT_KEY Look up an element by key in a specified map and return the key of the next element. BPF_PROG_LOAD Verify and load an eBPF program, returning a new file descriptor associated with the program. The close-on-exec file descriptor flag (see fcntl(2)) is automatically enabled for the new file descriptor. The bpf_attr union consists of various anonymous structures that are used by different bpf() commands: union bpf_attr { struct { /* Used by BPF_MAP_CREATE */ __u32 map_type; __u32 key_size; /* size of key in bytes */ __u32 value_size; /* size of value in bytes */ __u32 max_entries; /* maximum number of entries in a map */ }; struct { /* Used by BPF_MAP_*_ELEM and BPF_MAP_GET_NEXT_KEY commands */ __u32 map_fd; __aligned_u64 key; union { __aligned_u64 value; __aligned_u64 next_key; }; __u64 flags; }; struct { /* Used by BPF_PROG_LOAD */ __u32 prog_type; __u32 insn_cnt; __aligned_u64 insns; /* 'const struct bpf_insn *' */ __aligned_u64 license; /* 'const char *' */ __u32 log_level; /* verbosity level of verifier */ __u32 log_size; /* size of user buffer */ __aligned_u64 log_buf; /* user supplied 'char *' buffer */ __u32 kern_version; /* checked when prog_type=kprobe (since Linux 4.1) */ }; } __attribute__((aligned(8))); eBPF maps Maps are a generic data structure for storage of different types of data. They allow sharing of data between eBPF kernel programs, and also between kernel and user-space applications. Each map type has the following attributes: * type * maximum number of elements * key size in bytes * value size in bytes The following wrapper functions demonstrate how various bpf() commands can be used to access the maps. The functions use the cmd argument to invoke different operations. BPF_MAP_CREATE The BPF_MAP_CREATE command creates a new map, returning a new file descriptor that refers to the map. int bpf_create_map(enum bpf_map_type map_type, unsigned int key_size, unsigned int value_size, unsigned int max_entries) { union bpf_attr attr = { .map_type = map_type, .key_size = key_size, .value_size = value_size, .max_entries = max_entries }; return bpf(BPF_MAP_CREATE, &attr, sizeof(attr)); } The new map has the type specified by map_type, and attributes as specified in key_size, value_size, and max_entries. On success, this operation returns a file descriptor. On error, -1 is returned and errno is set to EINVAL, EPERM, or ENOMEM. The key_size and value_size attributes will be used by the verifier during program loading to check that the program is calling bpf_map_*_elem() helper functions with a correctly initialized key and to check that the program doesn't access the map element value beyond the specified value_size. For example, when a map is created with a key_size of 8 and the eBPF program calls bpf_map_lookup_elem(map_fd, fp - 4) the program will be rejected, since the in-kernel helper function bpf_map_lookup_elem(map_fd, void *key) expects to read 8 bytes from the location pointed to by key, but the fp - 4 (where fp is the top of the stack) starting address will cause out-of-bounds stack access. Similarly, when a map is created with a value_size of 1 and the eBPF program contains value = bpf_map_lookup_elem(...); *(u32 *) value = 1; the program will be rejected, since it accesses the value pointer beyond the specified 1 byte value_size limit. Currently, the following values are supported for map_type: enum bpf_map_type { BPF_MAP_TYPE_UNSPEC, /* Reserve 0 as invalid map type */ BPF_MAP_TYPE_HASH, BPF_MAP_TYPE_ARRAY, BPF_MAP_TYPE_PROG_ARRAY, }; map_type selects one of the available map implementations in the kernel. For all map types, eBPF programs access maps with the same bpf_map_lookup_elem() and bpf_map_update_elem() helper functions. Further details of the various map types are given below. BPF_MAP_LOOKUP_ELEM The BPF_MAP_LOOKUP_ELEM command looks up an element with a given key in the map referred to by the file descriptor fd. int bpf_lookup_elem(int fd, const void *key, void *value) { union bpf_attr attr = { .map_fd = fd, .key = ptr_to_u64(key), .value = ptr_to_u64(value), }; return bpf(BPF_MAP_LOOKUP_ELEM, &attr, sizeof(attr)); } If an element is found, the operation returns zero and stores the element's value into value, which must point to a buffer of value_size bytes. If no element is found, the operation returns -1 and sets errno to ENOENT. BPF_MAP_UPDATE_ELEM The BPF_MAP_UPDATE_ELEM command creates or updates an element with a given key/value in the map referred to by the file descriptor fd. int bpf_update_elem(int fd, const void *key, const void *value, uint64_t flags) { union bpf_attr attr = { .map_fd = fd, .key = ptr_to_u64(key), .value = ptr_to_u64(value), .flags = flags, }; return bpf(BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr)); } The flags argument should be specified as one of the following: BPF_ANY Create a new element or update an existing element. BPF_NOEXIST Create a new element only if it did not exist. BPF_EXIST Update an existing element. On success, the operation returns zero. On error, -1 is returned and errno is set to EINVAL, EPERM, ENOMEM, or E2BIG. E2BIG indicates that the number of elements in the map reached the max_entries limit specified at map creation time. EEXIST will be returned if flags specifies BPF_NOEXIST and the element with key already exists in the map. ENOENT will be returned if flags specifies BPF_EXIST and the element with key doesn't exist in the map. BPF_MAP_DELETE_ELEM The BPF_MAP_DELETE_ELEM command deleted the element whose key is key from the map referred to by the file descriptor fd. int bpf_delete_elem(int fd, const void *key) { union bpf_attr attr = { .map_fd = fd, .key = ptr_to_u64(key), }; return bpf(BPF_MAP_DELETE_ELEM, &attr, sizeof(attr)); } On success, zero is returned. If the element is not found, -1 is returned and errno is set to ENOENT. BPF_MAP_GET_NEXT_KEY The BPF_MAP_GET_NEXT_KEY command looks up an element by key in the map referred to by the file descriptor fd and sets the next_key pointer to the key of the next element. int bpf_get_next_key(int fd, const void *key, void *next_key) { union bpf_attr attr = { .map_fd = fd, .key = ptr_to_u64(key), .next_key = ptr_to_u64(next_key), }; return bpf(BPF_MAP_GET_NEXT_KEY, &attr, sizeof(attr)); } If key is found, the operation returns zero and sets the next_key pointer to the key of the next element. If key is not found, the operation returns zero and sets the next_key pointer to the key of the first element. If key is the last element, -1 is returned and errno is set to ENOENT. Other possible errno values are ENOMEM, EFAULT, EPERM, and EINVAL. This method can be used to iterate over all elements in the map. close(map_fd) Delete the map referred to by the file descriptor map_fd. When the user-space program that created a map exits, all maps will be deleted automatically (but see NOTES). eBPF map types The following map types are supported: BPF_MAP_TYPE_HASH Hash-table maps have the following characteristics: * Maps are created and destroyed by user-space programs. Both user-space and eBPF programs can perform lookup, update, and delete operations. * The kernel takes care of allocating and freeing key/value pairs. * The map_update_elem() helper with fail to insert new element when the max_entries limit is reached. (This ensures that eBPF programs cannot exhaust memory.) * map_update_elem() replaces existing elements atomically. Hash-table maps are optimized for speed of lookup. BPF_MAP_TYPE_ARRAY Array maps have the following characteristics: * Optimized for fastest possible lookup. In the future the verifier/JIT compiler may recognize lookup() operations that employ a constant key and optimize it into constant pointer. It is possible to optimize a non-constant key into direct pointer arithmetic as well, since pointers and value_size are constant for the life of the eBPF program. In other words, array_map_lookup_elem() may be 'inlined' by the verifier/JIT compiler while preserving concurrent access to this map from user space. * All array elements pre-allocated and zero initialized at init time * The key is an array index, and must be exactly four bytes. * map_delete_elem() fails with the error EINVAL, since elements cannot be deleted. * map_update_elem() replaces elements in a nonatomic fashion; for atomic updates, a hash-table map should be used instead. There is however one special case that can also be used with arrays: the atomic built-in __sync_fetch_and_add() can be used on 32 and 64 bit atomic counters. For example, it can be applied on the whole value itself if it represents a single counter, or in case of a structure containing multiple counters, it could be used on individual counters. This is quite often useful for aggregation and accounting of events. Among the uses for array maps are the following: * As "global" eBPF variables: an array of 1 element whose key is (index) 0 and where the value is a collection of 'global' variables which eBPF programs can use to keep state between events. * Aggregation of tracing events into a fixed set of buckets. * Accounting of networking events, for example, number of packets and packet sizes. BPF_MAP_TYPE_PROG_ARRAY (since Linux 4.2) A program array map is a special kind of array map whose map values contain only file descriptors referring to other eBPF programs. Thus, both the key_size and value_size must be exactly four bytes. This map is used in conjunction with the bpf_tail_call() helper. This means that an eBPF program with a program array map attached to it can call from kernel side into void bpf_tail_call(void *context, void *prog_map, unsigned int index); and therefore replace its own program flow with the one from the program at the given program array slot, if present. This can be regarded as kind of a jump table to a different eBPF program. The invoked program will then reuse the same stack. When a jump into the new program has been performed, it won't return to the old program anymore. If no eBPF program is found at the given index of the program array (because the map slot doesn't contain a valid program file descriptor, the specified lookup index/key is out of bounds, or the limit of 32 nested calls has been exceed), execution continues with the current eBPF program. This can be used as a fall-through for default cases. A program array map is useful, for example, in tracing or networking, to handle individual system calls or protocols in their own subprograms and use their identifiers as an individual map index. This approach may result in performance benefits, and also makes it possible to overcome the maximum instruction limit of a single eBPF program. In dynamic environments, a user-space daemon might atomically replace individual subprograms at run-time with newer versions to alter overall program behavior, for instance, if global policies change. eBPF programs The BPF_PROG_LOAD command is used to load an eBPF program into the kernel. The return value for this command is a new file descriptor associated with this eBPF program. char bpf_log_buf[LOG_BUF_SIZE]; int bpf_prog_load(enum bpf_prog_type type, const struct bpf_insn *insns, int insn_cnt, const char *license) { union bpf_attr attr = { .prog_type = type, .insns = ptr_to_u64(insns), .insn_cnt = insn_cnt, .license = ptr_to_u64(license), .log_buf = ptr_to_u64(bpf_log_buf), .log_size = LOG_BUF_SIZE, .log_level = 1, }; return bpf(BPF_PROG_LOAD, &attr, sizeof(attr)); } prog_type is one of the available program types: enum bpf_prog_type { BPF_PROG_TYPE_UNSPEC, /* Reserve 0 as invalid program type */ BPF_PROG_TYPE_SOCKET_FILTER, BPF_PROG_TYPE_KPROBE, BPF_PROG_TYPE_SCHED_CLS, BPF_PROG_TYPE_SCHED_ACT, }; For further details of eBPF program types, see below. The remaining fields of bpf_attr are set as follows: * insns is an array of struct bpf_insn instructions. * insn_cnt is the number of instructions in the program referred to by insns. * license is a license string, which must be GPL compatible to call helper functions marked gpl_only. (The licensing rules are the same as for kernel modules, so that also dual licenses, such as "Dual BSD/GPL", may be used.) * log_buf is a pointer to a caller-allocated buffer in which the in- kernel verifier can store the verification log. This log is a multi-line string that can be checked by the program author in order to understand how the verifier came to the conclusion that the eBPF program is unsafe. The format of the output can change at any time as the verifier evolves. * log_size size of the buffer pointed to by log_bug. If the size of the buffer is not large enough to store all verifier messages, -1 is returned and errno is set to ENOSPC. * log_level verbosity level of the verifier. A value of zero means that the verifier will not provide a log; in this case, log_buf must be a NULL pointer, and log_size must be zero. Applying close(2) to the file descriptor returned by BPF_PROG_LOAD will unload the eBPF program (but see NOTES). Maps are accessible from eBPF programs and are used to exchange data between eBPF programs and between eBPF programs and user-space programs. For example, eBPF programs can process various events (like kprobe, packets) and store their data into a map, and user- space programs can then fetch data from the map. Conversely, user- space programs can use a map as a configuration mechanism, populating the map with values checked by the eBPF program, which then modifies its behavior on the fly according to those values. eBPF program types The eBPF program type (prog_type) determines the subset of kernel helper functions that the program may call. The program type also determines the program input (context)—the format of struct bpf_context (which is the data blob passed into the eBPF program as the first argument). For example, a tracing program does not have the exact same subset of helper functions as a socket filter program (though they may have some helpers in common). Similarly, the input (context) for a tracing program is a set of register values, while for a socket filter it is a network packet. The set of functions available to eBPF programs of a given type may increase in the future. The following program types are supported: BPF_PROG_TYPE_SOCKET_FILTER (since Linux 3.19) Currently, the set of functions for BPF_PROG_TYPE_SOCKET_FILTER is: bpf_map_lookup_elem(map_fd, void *key) /* look up key in a map_fd */ bpf_map_update_elem(map_fd, void *key, void *value) /* update key/value */ bpf_map_delete_elem(map_fd, void *key) /* delete key in a map_fd */ The bpf_context argument is a pointer to a struct __sk_buff. BPF_PROG_TYPE_KPROBE (since Linux 4.1) [To be documented] BPF_PROG_TYPE_SCHED_CLS (since Linux 4.1) [To be documented] BPF_PROG_TYPE_SCHED_ACT (since Linux 4.1) [To be documented] Events Once a program is loaded, it can be attached to an event. Various kernel subsystems have different ways to do so. Since Linux 3.19, the following call will attach the program prog_fd to the socket sockfd, which was created by an earlier call to socket(2): setsockopt(sockfd, SOL_SOCKET, SO_ATTACH_BPF, &prog_fd, sizeof(prog_fd)); Since Linux 4.1, the following call may be used to attach the eBPF program referred to by the file descriptor prog_fd to a perf event file descriptor, event_fd, that was created by a previous call to perf_event_open(2): ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, prog_fd); Linux time system calls http://man7.org/linux/man-pages/man2/time.2.html 11 SYSTEM CALL: time(2) - Linux manual page FUNCTIONALITY: time - get time in seconds SYNOPSIS: #include time_t time(time_t *tloc); DESCRIPTION time() returns the time as the number of seconds since the Epoch, 1970-01-01 00:00:00 +0000 (UTC). If tloc is non-NULL, the return value is also stored in the memory pointed to by tloc. http://man7.org/linux/man-pages/man2/settimeofday.2.html 10 SYSTEM CALL: gettimeofday(2) - Linux manual page FUNCTIONALITY: gettimeofday, settimeofday - get / set time SYNOPSIS: #include int gettimeofday(struct timeval *tv, struct timezone *tz); int settimeofday(const struct timeval *tv, const struct timezone *tz); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): settimeofday(): Since glibc 2.19: _DEFAULT_SOURCE Glibc 2.19 and earlier: _BSD_SOURCE DESCRIPTION The functions gettimeofday() and settimeofday() can get and set the time as well as a timezone. The tv argument is a struct timeval (as specified in ): struct timeval { time_t tv_sec; /* seconds */ suseconds_t tv_usec; /* microseconds */ }; and gives the number of seconds and microseconds since the Epoch (see time(2)). The tz argument is a struct timezone: struct timezone { int tz_minuteswest; /* minutes west of Greenwich */ int tz_dsttime; /* type of DST correction */ }; If either tv or tz is NULL, the corresponding structure is not set or returned. (However, compilation warnings will result if tv is NULL.) The use of the timezone structure is obsolete; the tz argument should normally be specified as NULL. (See NOTES below.) Under Linux, there are some peculiar "warp clock" semantics associated with the settimeofday() system call if on the very first call (after booting) that has a non-NULL tz argument, the tv argument is NULL and the tz_minuteswest field is nonzero. (The tz_dsttime field should be zero for this case.) In such a case it is assumed that the CMOS clock is on local time, and that it has to be incremented by this amount to get UTC system time. No doubt it is a bad idea to use this feature. http://man7.org/linux/man-pages/man2/gettimeofday.2.html 10 SYSTEM CALL: gettimeofday(2) - Linux manual page FUNCTIONALITY: gettimeofday, settimeofday - get / set time SYNOPSIS: #include int gettimeofday(struct timeval *tv, struct timezone *tz); int settimeofday(const struct timeval *tv, const struct timezone *tz); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): settimeofday(): Since glibc 2.19: _DEFAULT_SOURCE Glibc 2.19 and earlier: _BSD_SOURCE DESCRIPTION The functions gettimeofday() and settimeofday() can get and set the time as well as a timezone. The tv argument is a struct timeval (as specified in ): struct timeval { time_t tv_sec; /* seconds */ suseconds_t tv_usec; /* microseconds */ }; and gives the number of seconds and microseconds since the Epoch (see time(2)). The tz argument is a struct timezone: struct timezone { int tz_minuteswest; /* minutes west of Greenwich */ int tz_dsttime; /* type of DST correction */ }; If either tv or tz is NULL, the corresponding structure is not set or returned. (However, compilation warnings will result if tv is NULL.) The use of the timezone structure is obsolete; the tz argument should normally be specified as NULL. (See NOTES below.) Under Linux, there are some peculiar "warp clock" semantics associated with the settimeofday() system call if on the very first call (after booting) that has a non-NULL tz argument, the tv argument is NULL and the tz_minuteswest field is nonzero. (The tz_dsttime field should be zero for this case.) In such a case it is assumed that the CMOS clock is on local time, and that it has to be incremented by this amount to get UTC system time. No doubt it is a bad idea to use this feature. http://man7.org/linux/man-pages/man2/clock_settime.2.html 14 SYSTEM CALL: clock_getres(2) - Linux manual page FUNCTIONALITY: clock_getres, clock_gettime, clock_settime - clock and time functions SYNOPSIS: #include int clock_getres(clockid_t clk_id, struct timespec *res); int clock_gettime(clockid_t clk_id, struct timespec *tp); int clock_settime(clockid_t clk_id, const struct timespec *tp); Link with -lrt (only for glibc versions before 2.17). Feature Test Macro Requirements for glibc (see feature_test_macros(7)): clock_getres(), clock_gettime(), clock_settime(): _POSIX_C_SOURCE >= 199309L DESCRIPTION The function clock_getres() finds the resolution (precision) of the specified clock clk_id, and, if res is non-NULL, stores it in the struct timespec pointed to by res. The resolution of clocks depends on the implementation and cannot be configured by a particular process. If the time value pointed to by the argument tp of clock_settime() is not a multiple of res, then it is truncated to a multiple of res. The functions clock_gettime() and clock_settime() retrieve and set the time of the specified clock clk_id. The res and tp arguments are timespec structures, as specified in : struct timespec { time_t tv_sec; /* seconds */ long tv_nsec; /* nanoseconds */ }; The clk_id argument is the identifier of the particular clock on which to act. A clock may be system-wide and hence visible for all processes, or per-process if it measures time only within a single process. All implementations support the system-wide real-time clock, which is identified by CLOCK_REALTIME. Its time represents seconds and nanoseconds since the Epoch. When its time is changed, timers for a relative interval are unaffected, but timers for an absolute point in time are affected. More clocks may be implemented. The interpretation of the corresponding time values and the effect on timers is unspecified. Sufficiently recent versions of glibc and the Linux kernel support the following clocks: CLOCK_REALTIME System-wide clock that measures real (i.e., wall-clock) time. Setting this clock requires appropriate privileges. This clock is affected by discontinuous jumps in the system time (e.g., if the system administrator manually changes the clock), and by the incremental adjustments performed by adjtime(3) and NTP. CLOCK_REALTIME_COARSE (since Linux 2.6.32; Linux-specific) A faster but less precise version of CLOCK_REALTIME. Use when you need very fast, but not fine-grained timestamps. Requires per-architecture support, and probably also architecture support for this flag in the vdso(7). CLOCK_MONOTONIC Clock that cannot be set and represents monotonic time since some unspecified starting point. This clock is not affected by discontinuous jumps in the system time (e.g., if the system administrator manually changes the clock), but is affected by the incremental adjustments performed by adjtime(3) and NTP. CLOCK_MONOTONIC_COARSE (since Linux 2.6.32; Linux-specific) A faster but less precise version of CLOCK_MONOTONIC. Use when you need very fast, but not fine-grained timestamps. Requires per-architecture support, and probably also architecture support for this flag in the vdso(7). CLOCK_MONOTONIC_RAW (since Linux 2.6.28; Linux-specific) Similar to CLOCK_MONOTONIC, but provides access to a raw hardware-based time that is not subject to NTP adjustments or the incremental adjustments performed by adjtime(3). CLOCK_BOOTTIME (since Linux 2.6.39; Linux-specific) Identical to CLOCK_MONOTONIC, except it also includes any time that the system is suspended. This allows applications to get a suspend-aware monotonic clock without having to deal with the complications of CLOCK_REALTIME, which may have discontinuities if the time is changed using settimeofday(2). CLOCK_PROCESS_CPUTIME_ID (since Linux 2.6.12) Per-process CPU-time clock (measures CPU time consumed by all threads in the process). CLOCK_THREAD_CPUTIME_ID (since Linux 2.6.12) Thread-specific CPU-time clock. http://man7.org/linux/man-pages/man2/clock_gettime.2.html 14 SYSTEM CALL: clock_getres(2) - Linux manual page FUNCTIONALITY: clock_getres, clock_gettime, clock_settime - clock and time functions SYNOPSIS: #include int clock_getres(clockid_t clk_id, struct timespec *res); int clock_gettime(clockid_t clk_id, struct timespec *tp); int clock_settime(clockid_t clk_id, const struct timespec *tp); Link with -lrt (only for glibc versions before 2.17). Feature Test Macro Requirements for glibc (see feature_test_macros(7)): clock_getres(), clock_gettime(), clock_settime(): _POSIX_C_SOURCE >= 199309L DESCRIPTION The function clock_getres() finds the resolution (precision) of the specified clock clk_id, and, if res is non-NULL, stores it in the struct timespec pointed to by res. The resolution of clocks depends on the implementation and cannot be configured by a particular process. If the time value pointed to by the argument tp of clock_settime() is not a multiple of res, then it is truncated to a multiple of res. The functions clock_gettime() and clock_settime() retrieve and set the time of the specified clock clk_id. The res and tp arguments are timespec structures, as specified in : struct timespec { time_t tv_sec; /* seconds */ long tv_nsec; /* nanoseconds */ }; The clk_id argument is the identifier of the particular clock on which to act. A clock may be system-wide and hence visible for all processes, or per-process if it measures time only within a single process. All implementations support the system-wide real-time clock, which is identified by CLOCK_REALTIME. Its time represents seconds and nanoseconds since the Epoch. When its time is changed, timers for a relative interval are unaffected, but timers for an absolute point in time are affected. More clocks may be implemented. The interpretation of the corresponding time values and the effect on timers is unspecified. Sufficiently recent versions of glibc and the Linux kernel support the following clocks: CLOCK_REALTIME System-wide clock that measures real (i.e., wall-clock) time. Setting this clock requires appropriate privileges. This clock is affected by discontinuous jumps in the system time (e.g., if the system administrator manually changes the clock), and by the incremental adjustments performed by adjtime(3) and NTP. CLOCK_REALTIME_COARSE (since Linux 2.6.32; Linux-specific) A faster but less precise version of CLOCK_REALTIME. Use when you need very fast, but not fine-grained timestamps. Requires per-architecture support, and probably also architecture support for this flag in the vdso(7). CLOCK_MONOTONIC Clock that cannot be set and represents monotonic time since some unspecified starting point. This clock is not affected by discontinuous jumps in the system time (e.g., if the system administrator manually changes the clock), but is affected by the incremental adjustments performed by adjtime(3) and NTP. CLOCK_MONOTONIC_COARSE (since Linux 2.6.32; Linux-specific) A faster but less precise version of CLOCK_MONOTONIC. Use when you need very fast, but not fine-grained timestamps. Requires per-architecture support, and probably also architecture support for this flag in the vdso(7). CLOCK_MONOTONIC_RAW (since Linux 2.6.28; Linux-specific) Similar to CLOCK_MONOTONIC, but provides access to a raw hardware-based time that is not subject to NTP adjustments or the incremental adjustments performed by adjtime(3). CLOCK_BOOTTIME (since Linux 2.6.39; Linux-specific) Identical to CLOCK_MONOTONIC, except it also includes any time that the system is suspended. This allows applications to get a suspend-aware monotonic clock without having to deal with the complications of CLOCK_REALTIME, which may have discontinuities if the time is changed using settimeofday(2). CLOCK_PROCESS_CPUTIME_ID (since Linux 2.6.12) Per-process CPU-time clock (measures CPU time consumed by all threads in the process). CLOCK_THREAD_CPUTIME_ID (since Linux 2.6.12) Thread-specific CPU-time clock. http://man7.org/linux/man-pages/man2/clock_getres.2.html 14 SYSTEM CALL: clock_getres(2) - Linux manual page FUNCTIONALITY: clock_getres, clock_gettime, clock_settime - clock and time functions SYNOPSIS: #include int clock_getres(clockid_t clk_id, struct timespec *res); int clock_gettime(clockid_t clk_id, struct timespec *tp); int clock_settime(clockid_t clk_id, const struct timespec *tp); Link with -lrt (only for glibc versions before 2.17). Feature Test Macro Requirements for glibc (see feature_test_macros(7)): clock_getres(), clock_gettime(), clock_settime(): _POSIX_C_SOURCE >= 199309L DESCRIPTION The function clock_getres() finds the resolution (precision) of the specified clock clk_id, and, if res is non-NULL, stores it in the struct timespec pointed to by res. The resolution of clocks depends on the implementation and cannot be configured by a particular process. If the time value pointed to by the argument tp of clock_settime() is not a multiple of res, then it is truncated to a multiple of res. The functions clock_gettime() and clock_settime() retrieve and set the time of the specified clock clk_id. The res and tp arguments are timespec structures, as specified in : struct timespec { time_t tv_sec; /* seconds */ long tv_nsec; /* nanoseconds */ }; The clk_id argument is the identifier of the particular clock on which to act. A clock may be system-wide and hence visible for all processes, or per-process if it measures time only within a single process. All implementations support the system-wide real-time clock, which is identified by CLOCK_REALTIME. Its time represents seconds and nanoseconds since the Epoch. When its time is changed, timers for a relative interval are unaffected, but timers for an absolute point in time are affected. More clocks may be implemented. The interpretation of the corresponding time values and the effect on timers is unspecified. Sufficiently recent versions of glibc and the Linux kernel support the following clocks: CLOCK_REALTIME System-wide clock that measures real (i.e., wall-clock) time. Setting this clock requires appropriate privileges. This clock is affected by discontinuous jumps in the system time (e.g., if the system administrator manually changes the clock), and by the incremental adjustments performed by adjtime(3) and NTP. CLOCK_REALTIME_COARSE (since Linux 2.6.32; Linux-specific) A faster but less precise version of CLOCK_REALTIME. Use when you need very fast, but not fine-grained timestamps. Requires per-architecture support, and probably also architecture support for this flag in the vdso(7). CLOCK_MONOTONIC Clock that cannot be set and represents monotonic time since some unspecified starting point. This clock is not affected by discontinuous jumps in the system time (e.g., if the system administrator manually changes the clock), but is affected by the incremental adjustments performed by adjtime(3) and NTP. CLOCK_MONOTONIC_COARSE (since Linux 2.6.32; Linux-specific) A faster but less precise version of CLOCK_MONOTONIC. Use when you need very fast, but not fine-grained timestamps. Requires per-architecture support, and probably also architecture support for this flag in the vdso(7). CLOCK_MONOTONIC_RAW (since Linux 2.6.28; Linux-specific) Similar to CLOCK_MONOTONIC, but provides access to a raw hardware-based time that is not subject to NTP adjustments or the incremental adjustments performed by adjtime(3). CLOCK_BOOTTIME (since Linux 2.6.39; Linux-specific) Identical to CLOCK_MONOTONIC, except it also includes any time that the system is suspended. This allows applications to get a suspend-aware monotonic clock without having to deal with the complications of CLOCK_REALTIME, which may have discontinuities if the time is changed using settimeofday(2). CLOCK_PROCESS_CPUTIME_ID (since Linux 2.6.12) Per-process CPU-time clock (measures CPU time consumed by all threads in the process). CLOCK_THREAD_CPUTIME_ID (since Linux 2.6.12) Thread-specific CPU-time clock. http://man7.org/linux/man-pages/man2/clock_adjtime.2.html Linux process management system calls http://man7.org/linux/man-pages/man2/clone.2.html 13 SYSTEM CALL: clone(2) - Linux manual page FUNCTIONALITY: clone, __clone2 - create a child process SYNOPSIS: /* Prototype for the glibc wrapper function */ #define _GNU_SOURCE #include int clone(int (*fn)(void *), void *child_stack, int flags, void *arg, ... /* pid_t *ptid, struct user_desc *tls, pid_t *ctid */ ); /* Prototype for the raw system call */ long clone(unsigned long flags, void *child_stack, void *ptid, void *ctid, struct pt_regs *regs); DESCRIPTION clone() creates a new process, in a manner similar to fork(2). This page describes both the glibc clone() wrapper function and the underlying system call on which it is based. The main text describes the wrapper function; the differences for the raw system call are described toward the end of this page. Unlike fork(2), clone() allows the child process to share parts of its execution context with the calling process, such as the memory space, the table of file descriptors, and the table of signal handlers. (Note that on this manual page, "calling process" normally corresponds to "parent process". But see the description of CLONE_PARENT below.) One use of clone() is to implement threads: multiple threads of control in a program that run concurrently in a shared memory space. When the child process is created with clone(), it executes the function fn(arg). (This differs from fork(2), where execution continues in the child from the point of the fork(2) call.) The fn argument is a pointer to a function that is called by the child process at the beginning of its execution. The arg argument is passed to the fn function. When the fn(arg) function application returns, the child process terminates. The integer returned by fn is the exit code for the child process. The child process may also terminate explicitly by calling exit(2) or after receiving a fatal signal. The child_stack argument specifies the location of the stack used by the child process. Since the child and calling process may share memory, it is not possible for the child process to execute in the same stack as the calling process. The calling process must therefore set up memory space for the child stack and pass a pointer to this space to clone(). Stacks grow downward on all processors that run Linux (except the HP PA processors), so child_stack usually points to the topmost address of the memory space set up for the child stack. The low byte of flags contains the number of the termination signal sent to the parent when the child dies. If this signal is specified as anything other than SIGCHLD, then the parent process must specify the __WALL or __WCLONE options when waiting for the child with wait(2). If no signal is specified, then the parent process is not signaled when the child terminates. flags may also be bitwise-or'ed with zero or more of the following constants, in order to specify what is shared between the calling process and the child process: CLONE_CHILD_CLEARTID (since Linux 2.5.49) Erase the child thread ID at the location ctid in child memory when the child exits, and do a wakeup on the futex at that address. The address involved may be changed by the set_tid_address(2) system call. This is used by threading libraries. CLONE_CHILD_SETTID (since Linux 2.5.49) Store the child thread ID at the location ctid in the child's memory. CLONE_FILES (since Linux 2.0) If CLONE_FILES is set, the calling process and the child process share the same file descriptor table. Any file descriptor created by the calling process or by the child process is also valid in the other process. Similarly, if one of the processes closes a file descriptor, or changes its associated flags (using the fcntl(2) F_SETFD operation), the other process is also affected. If a process sharing a file descriptor table calls execve(2), its file descriptor table is duplicated (unshared). If CLONE_FILES is not set, the child process inherits a copy of all file descriptors opened in the calling process at the time of clone(). (The duplicated file descriptors in the child refer to the same open file descriptions (see open(2)) as the corresponding file descriptors in the calling process.) Subsequent operations that open or close file descriptors, or change file descriptor flags, performed by either the calling process or the child process do not affect the other process. CLONE_FS (since Linux 2.0) If CLONE_FS is set, the caller and the child process share the same filesystem information. This includes the root of the filesystem, the current working directory, and the umask. Any call to chroot(2), chdir(2), or umask(2) performed by the calling process or the child process also affects the other process. If CLONE_FS is not set, the child process works on a copy of the filesystem information of the calling process at the time of the clone() call. Calls to chroot(2), chdir(2), umask(2) performed later by one of the processes do not affect the other process. CLONE_IO (since Linux 2.6.25) If CLONE_IO is set, then the new process shares an I/O context with the calling process. If this flag is not set, then (as with fork(2)) the new process has its own I/O context. The I/O context is the I/O scope of the disk scheduler (i.e., what the I/O scheduler uses to model scheduling of a process's I/O). If processes share the same I/O context, they are treated as one by the I/O scheduler. As a consequence, they get to share disk time. For some I/O schedulers, if two processes share an I/O context, they will be allowed to interleave their disk access. If several threads are doing I/O on behalf of the same process (aio_read(3), for instance), they should employ CLONE_IO to get better I/O performance. If the kernel is not configured with the CONFIG_BLOCK option, this flag is a no-op. CLONE_NEWCGROUP (since Linux 4.6) Create the process in a new cgroup namespace. If this flag is not set, then (as with fork(2)) the process is created in the same cgroup namespaces as the calling process. This flag is intended for the implementation of containers. For further information on cgroup namespaces, see cgroup_namespaces(7). Only a privileged process (CAP_SYS_ADMIN) can employ CLONE_NEWCGROUP. CLONE_NEWIPC (since Linux 2.6.19) If CLONE_NEWIPC is set, then create the process in a new IPC namespace. If this flag is not set, then (as with fork(2)), the process is created in the same IPC namespace as the calling process. This flag is intended for the implementation of containers. An IPC namespace provides an isolated view of System V IPC objects (see svipc(7)) and (since Linux 2.6.30) POSIX message queues (see mq_overview(7)). The common characteristic of these IPC mechanisms is that IPC objects are identified by mechanisms other than filesystem pathnames. Objects created in an IPC namespace are visible to all other processes that are members of that namespace, but are not visible to processes in other IPC namespaces. When an IPC namespace is destroyed (i.e., when the last process that is a member of the namespace terminates), all IPC objects in the namespace are automatically destroyed. Only a privileged process (CAP_SYS_ADMIN) can employ CLONE_NEWIPC. This flag can't be specified in conjunction with CLONE_SYSVSEM. For further information on IPC namespaces, see namespaces(7). CLONE_NEWNET (since Linux 2.6.24) (The implementation of this flag was completed only by about kernel version 2.6.29.) If CLONE_NEWNET is set, then create the process in a new network namespace. If this flag is not set, then (as with fork(2)) the process is created in the same network namespace as the calling process. This flag is intended for the implementation of containers. A network namespace provides an isolated view of the networking stack (network device interfaces, IPv4 and IPv6 protocol stacks, IP routing tables, firewall rules, the /proc/net and /sys/class/net directory trees, sockets, etc.). A physical network device can live in exactly one network namespace. A virtual network device ("veth") pair provides a pipe-like abstraction that can be used to create tunnels between network namespaces, and can be used to create a bridge to a physical network device in another namespace. When a network namespace is freed (i.e., when the last process in the namespace terminates), its physical network devices are moved back to the initial network namespace (not to the parent of the process). For further information on network namespaces, see namespaces(7). Only a privileged process (CAP_SYS_ADMIN) can employ CLONE_NEWNET. CLONE_NEWNS (since Linux 2.4.19) If CLONE_NEWNS is set, the cloned child is started in a new mount namespace, initialized with a copy of the namespace of the parent. If CLONE_NEWNS is not set, the child lives in the same mount namespace as the parent. Only a privileged process (CAP_SYS_ADMIN) can employ CLONE_NEWNS. It is not permitted to specify both CLONE_NEWNS and CLONE_FS in the same clone() call. For further information on mount namespaces, see namespaces(7) and mount_namespaces(7). CLONE_NEWPID (since Linux 2.6.24) If CLONE_NEWPID is set, then create the process in a new PID namespace. If this flag is not set, then (as with fork(2)) the process is created in the same PID namespace as the calling process. This flag is intended for the implementation of containers. For further information on PID namespaces, see namespaces(7) and pid_namespaces(7). Only a privileged process (CAP_SYS_ADMIN) can employ CLONE_NEWPID. This flag can't be specified in conjunction with CLONE_THREAD or CLONE_PARENT. CLONE_NEWUSER (This flag first became meaningful for clone() in Linux 2.6.23, the current clone() semantics were merged in Linux 3.5, and the final pieces to make the user namespaces completely usable were merged in Linux 3.8.) If CLONE_NEWUSER is set, then create the process in a new user namespace. If this flag is not set, then (as with fork(2)) the process is created in the same user namespace as the calling process. For further information on user namespaces, see namespaces(7) and user_namespaces(7) Before Linux 3.8, use of CLONE_NEWUSER required that the caller have three capabilities: CAP_SYS_ADMIN, CAP_SETUID, and CAP_SETGID. Starting with Linux 3.8, no privileges are needed to create a user namespace. This flag can't be specified in conjunction with CLONE_THREAD or CLONE_PARENT. For security reasons, CLONE_NEWUSER cannot be specified in conjunction with CLONE_FS. For further information on user namespaces, see user_namespaces(7). CLONE_NEWUTS (since Linux 2.6.19) If CLONE_NEWUTS is set, then create the process in a new UTS namespace, whose identifiers are initialized by duplicating the identifiers from the UTS namespace of the calling process. If this flag is not set, then (as with fork(2)) the process is created in the same UTS namespace as the calling process. This flag is intended for the implementation of containers. A UTS namespace is the set of identifiers returned by uname(2); among these, the domain name and the hostname can be modified by setdomainname(2) and sethostname(2), respectively. Changes made to the identifiers in a UTS namespace are visible to all other processes in the same namespace, but are not visible to processes in other UTS namespaces. Only a privileged process (CAP_SYS_ADMIN) can employ CLONE_NEWUTS. For further information on UTS namespaces, see namespaces(7). CLONE_PARENT (since Linux 2.3.12) If CLONE_PARENT is set, then the parent of the new child (as returned by getppid(2)) will be the same as that of the calling process. If CLONE_PARENT is not set, then (as with fork(2)) the child's parent is the calling process. Note that it is the parent process, as returned by getppid(2), which is signaled when the child terminates, so that if CLONE_PARENT is set, then the parent of the calling process, rather than the calling process itself, will be signaled. CLONE_PARENT_SETTID (since Linux 2.5.49) Store the child thread ID at the location ptid in the parent's memory. (In Linux 2.5.32-2.5.48 there was a flag CLONE_SETTID that did this.) CLONE_PID (obsolete) If CLONE_PID is set, the child process is created with the same process ID as the calling process. This is good for hacking the system, but otherwise of not much use. Since 2.3.21 this flag can be specified only by the system boot process (PID 0). It disappeared in Linux 2.5.16. Since then, the kernel silently ignores it without error. CLONE_PTRACE (since Linux 2.2) If CLONE_PTRACE is specified, and the calling process is being traced, then trace the child also (see ptrace(2)). CLONE_SETTLS (since Linux 2.5.32) The newtls argument is the new TLS (Thread Local Storage) descriptor. (See set_thread_area(2).) CLONE_SIGHAND (since Linux 2.0) If CLONE_SIGHAND is set, the calling process and the child process share the same table of signal handlers. If the calling process or child process calls sigaction(2) to change the behavior associated with a signal, the behavior is changed in the other process as well. However, the calling process and child processes still have distinct signal masks and sets of pending signals. So, one of them may block or unblock some signals using sigprocmask(2) without affecting the other process. If CLONE_SIGHAND is not set, the child process inherits a copy of the signal handlers of the calling process at the time clone() is called. Calls to sigaction(2) performed later by one of the processes have no effect on the other process. Since Linux 2.6.0-test6, flags must also include CLONE_VM if CLONE_SIGHAND is specified CLONE_STOPPED (since Linux 2.6.0-test2) If CLONE_STOPPED is set, then the child is initially stopped (as though it was sent a SIGSTOP signal), and must be resumed by sending it a SIGCONT signal. This flag was deprecated from Linux 2.6.25 onward, and was removed altogether in Linux 2.6.38. Since then, the kernel silently ignores it without error. Starting with Linux 4.6, the same bit was reused for the CLONE_NEWCGROUP flag. CLONE_SYSVSEM (since Linux 2.5.10) If CLONE_SYSVSEM is set, then the child and the calling process share a single list of System V semaphore adjustment (semadj) values (see semop(2)). In this case, the shared list accumulates semadj values across all processes sharing the list, and semaphore adjustments are performed only when the last process that is sharing the list terminates (or ceases sharing the list using unshare(2)). If this flag is not set, then the child has a separate semadj list that is initially empty. CLONE_THREAD (since Linux 2.4.0-test8) If CLONE_THREAD is set, the child is placed in the same thread group as the calling process. To make the remainder of the discussion of CLONE_THREAD more readable, the term "thread" is used to refer to the processes within a thread group. Thread groups were a feature added in Linux 2.4 to support the POSIX threads notion of a set of threads that share a single PID. Internally, this shared PID is the so-called thread group identifier (TGID) for the thread group. Since Linux 2.4, calls to getpid(2) return the TGID of the caller. The threads within a group can be distinguished by their (system-wide) unique thread IDs (TID). A new thread's TID is available as the function result returned to the caller of clone(), and a thread can obtain its own TID using gettid(2). When a call is made to clone() without specifying CLONE_THREAD, then the resulting thread is placed in a new thread group whose TGID is the same as the thread's TID. This thread is the leader of the new thread group. A new thread created with CLONE_THREAD has the same parent process as the caller of clone() (i.e., like CLONE_PARENT), so that calls to getppid(2) return the same value for all of the threads in a thread group. When a CLONE_THREAD thread terminates, the thread that created it using clone() is not sent a SIGCHLD (or other termination) signal; nor can the status of such a thread be obtained using wait(2). (The thread is said to be detached.) After all of the threads in a thread group terminate the parent process of the thread group is sent a SIGCHLD (or other termination) signal. If any of the threads in a thread group performs an execve(2), then all threads other than the thread group leader are terminated, and the new program is executed in the thread group leader. If one of the threads in a thread group creates a child using fork(2), then any thread in the group can wait(2) for that child. Since Linux 2.5.35, flags must also include CLONE_SIGHAND if CLONE_THREAD is specified (and note that, since Linux 2.6.0-test6, CLONE_SIGHAND also requires CLONE_VM to be included). Signals may be sent to a thread group as a whole (i.e., a TGID) using kill(2), or to a specific thread (i.e., TID) using tgkill(2). Signal dispositions and actions are process-wide: if an unhandled signal is delivered to a thread, then it will affect (terminate, stop, continue, be ignored in) all members of the thread group. Each thread has its own signal mask, as set by sigprocmask(2), but signals can be pending either: for the whole process (i.e., deliverable to any member of the thread group), when sent with kill(2); or for an individual thread, when sent with tgkill(2). A call to sigpending(2) returns a signal set that is the union of the signals pending for the whole process and the signals that are pending for the calling thread. If kill(2) is used to send a signal to a thread group, and the thread group has installed a handler for the signal, then the handler will be invoked in exactly one, arbitrarily selected member of the thread group that has not blocked the signal. If multiple threads in a group are waiting to accept the same signal using sigwaitinfo(2), the kernel will arbitrarily select one of these threads to receive a signal sent using kill(2). CLONE_UNTRACED (since Linux 2.5.46) If CLONE_UNTRACED is specified, then a tracing process cannot force CLONE_PTRACE on this child process. CLONE_VFORK (since Linux 2.2) If CLONE_VFORK is set, the execution of the calling process is suspended until the child releases its virtual memory resources via a call to execve(2) or _exit(2) (as with vfork(2)). If CLONE_VFORK is not set, then both the calling process and the child are schedulable after the call, and an application should not rely on execution occurring in any particular order. CLONE_VM (since Linux 2.0) If CLONE_VM is set, the calling process and the child process run in the same memory space. In particular, memory writes performed by the calling process or by the child process are also visible in the other process. Moreover, any memory mapping or unmapping performed with mmap(2) or munmap(2) by the child or calling process also affects the other process. If CLONE_VM is not set, the child process runs in a separate copy of the memory space of the calling process at the time of clone(). Memory writes or file mappings/unmappings performed by one of the processes do not affect the other, as with fork(2). C library/kernel differences The raw clone() system call corresponds more closely to fork(2) in that execution in the child continues from the point of the call. As such, the fn and arg arguments of the clone() wrapper function are omitted. Furthermore, the argument order changes. The raw system call interface on x86 and many other architectures is roughly: long clone(unsigned long flags, void *child_stack, void *ptid, void *ctid, struct pt_regs *regs); Another difference for the raw system call is that the child_stack argument may be zero, in which case copy-on-write semantics ensure that the child gets separate copies of stack pages when either process modifies the stack. In this case, for correct operation, the CLONE_VM option should not be specified. For some architectures, the order of the arguments for the system call differs from that shown above. On the score, microblaze, ARM, ARM 64, PA-RISC, arc, Power PC, xtensa, and MIPS architectures, the order of the fourth and fifth arguments is reversed. On the cris and s390 architectures, the order of the first and second arguments is reversed. blackfin, m68k, and sparc The argument-passing conventions on blackfin, m68k, and sparc are different from the descriptions above. For details, see the kernel (and glibc) source. ia64 On ia64, a different interface is used: int __clone2(int (*fn)(void *), void *child_stack_base, size_t stack_size, int flags, void *arg, ... /* pid_t *ptid, struct user_desc *tls, pid_t *ctid */ ); The prototype shown above is for the glibc wrapper function; the raw system call interface has no fn or arg argument, and changes the order of the arguments so that flags is the first argument, and tls is the last argument. __clone2() operates in the same way as clone(), except that child_stack_base points to the lowest address of the child's stack area, and stack_size specifies the size of the stack pointed to by child_stack_base. Linux 2.4 and earlier In Linux 2.4 and earlier, clone() does not take arguments ptid, tls, and ctid. http://man7.org/linux/man-pages/man2/fork.2.html 11 SYSTEM CALL: fork(2) - Linux manual page FUNCTIONALITY: fork - create a child process SYNOPSIS: #include pid_t fork(void); DESCRIPTION fork() creates a new process by duplicating the calling process. The new process is referred to as the child process. The calling process is referred to as the parent process. The child process and the parent process run in separate memory spaces. At the time of fork() both memory spaces have the same content. Memory writes, file mappings (mmap(2)), and unmappings (munmap(2)) performed by one of the processes do not affect the other. The child process is an exact duplicate of the parent process except for the following points: * The child has its own unique process ID, and this PID does not match the ID of any existing process group (setpgid(2)). * The child's parent process ID is the same as the parent's process ID. * The child does not inherit its parent's memory locks (mlock(2), mlockall(2)). * Process resource utilizations (getrusage(2)) and CPU time counters (times(2)) are reset to zero in the child. * The child's set of pending signals is initially empty (sigpending(2)). * The child does not inherit semaphore adjustments from its parent (semop(2)). * The child does not inherit process-associated record locks from its parent (fcntl(2)). (On the other hand, it does inherit fcntl(2) open file description locks and flock(2) locks from its parent.) * The child does not inherit timers from its parent (setitimer(2), alarm(2), timer_create(2)). * The child does not inherit outstanding asynchronous I/O operations from its parent (aio_read(3), aio_write(3)), nor does it inherit any asynchronous I/O contexts from its parent (see io_setup(2)). The process attributes in the preceding list are all specified in POSIX.1. The parent and child also differ with respect to the following Linux-specific process attributes: * The child does not inherit directory change notifications (dnotify) from its parent (see the description of F_NOTIFY in fcntl(2)). * The prctl(2) PR_SET_PDEATHSIG setting is reset so that the child does not receive a signal when its parent terminates. * The default timer slack value is set to the parent's current timer slack value. See the description of PR_SET_TIMERSLACK in prctl(2). * Memory mappings that have been marked with the madvise(2) MADV_DONTFORK flag are not inherited across a fork(). * The termination signal of the child is always SIGCHLD (see clone(2)). * The port access permission bits set by ioperm(2) are not inherited by the child; the child must turn on any bits that it requires using ioperm(2). Note the following further points: * The child process is created with a single thread—the one that called fork(). The entire virtual address space of the parent is replicated in the child, including the states of mutexes, condition variables, and other pthreads objects; the use of pthread_atfork(3) may be helpful for dealing with problems that this can cause. * After a fork(2) in a multithreaded program, the child can safely call only async-signal-safe functions (see signal(7)) until such time as it calls execve(2). * The child inherits copies of the parent's set of open file descriptors. Each file descriptor in the child refers to the same open file description (see open(2)) as the corresponding file descriptor in the parent. This means that the two file descriptors share open file status flags, file offset, and signal- driven I/O attributes (see the description of F_SETOWN and F_SETSIG in fcntl(2)). * The child inherits copies of the parent's set of open message queue descriptors (see mq_overview(7)). Each file descriptor in the child refers to the same open message queue description as the corresponding file descriptor in the parent. This means that the two file descriptors share the same flags (mq_flags). * The child inherits copies of the parent's set of open directory streams (see opendir(3)). POSIX.1 says that the corresponding directory streams in the parent and child may share the directory stream positioning; on Linux/glibc they do not. http://man7.org/linux/man-pages/man2/vfork.2.html 9 SYSTEM CALL: vfork(2) - Linux manual page FUNCTIONALITY: vfork - create a child process and block parent SYNOPSIS: #include #include pid_t vfork(void); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): vfork(): Since glibc 2.12: (_XOPEN_SOURCE >= 500) && ! (_POSIX_C_SOURCE >= 200809L) || /* Since glibc 2.19: */ _DEFAULT_SOURCE || /* Glibc versions <= 2.19: */ _BSD_SOURCE Before glibc 2.12: _BSD_SOURCE || _XOPEN_SOURCE >= 500 DESCRIPTION Standard description (From POSIX.1) The vfork() function has the same effect as fork(2), except that the behavior is undefined if the process created by vfork() either modifies any data other than a variable of type pid_t used to store the return value from vfork(), or returns from the function in which vfork() was called, or calls any other function before successfully calling _exit(2) or one of the exec(3) family of functions. Linux description vfork(), just like fork(2), creates a child process of the calling process. For details and return value and errors, see fork(2). vfork() is a special case of clone(2). It is used to create new processes without copying the page tables of the parent process. It may be useful in performance-sensitive applications where a child is created which then immediately issues an execve(2). vfork() differs from fork(2) in that the calling thread is suspended until the child terminates (either normally, by calling _exit(2), or abnormally, after delivery of a fatal signal), or it makes a call to execve(2). Until that point, the child shares all memory with its parent, including the stack. The child must not return from the current function or call exit(3), but may call _exit(2). As with fork(2), the child process created by vfork() inherits copies of various of the caller's process attributes (e.g., file descriptors, signal dispositions, and current working directory); the vfork() call differs only in the treatment of the virtual address space, as described above. Signals sent to the parent arrive after the child releases the parent's memory (i.e., after the child terminates or calls execve(2)). Historic description Under Linux, fork(2) is implemented using copy-on-write pages, so the only penalty incurred by fork(2) is the time and memory required to duplicate the parent's page tables, and to create a unique task structure for the child. However, in the bad old days a fork(2) would require making a complete copy of the caller's data space, often needlessly, since usually immediately afterward an exec(3) is done. Thus, for greater efficiency, BSD introduced the vfork() system call, which did not fully copy the address space of the parent process, but borrowed the parent's memory and thread of control until a call to execve(2) or an exit occurred. The parent process was suspended while the child was using its resources. The use of vfork() was tricky: for example, not modifying data in the parent process depended on knowing which variables were held in a register. http://man7.org/linux/man-pages/man2/execve.2.html 11 SYSTEM CALL: execve(2) - Linux manual page FUNCTIONALITY: execve - execute program SYNOPSIS: #include int execve(const char *filename, char *const argv[], char *const envp[]); DESCRIPTION execve() executes the program pointed to by filename. filename must be either a binary executable, or a script starting with a line of the form: #! interpreter [optional-arg] For details of the latter case, see "Interpreter scripts" below. argv is an array of argument strings passed to the new program. By convention, the first of these strings should contain the filename associated with the file being executed. envp is an array of strings, conventionally of the form key=value, which are passed as environment to the new program. Both argv and envp must be terminated by a null pointer. The argument vector and environment can be accessed by the called program's main function, when it is defined as: int main(int argc, char *argv[], char *envp[]) execve() does not return on success, and the text, data, bss, and stack of the calling process are overwritten by that of the program loaded. If the current program is being ptraced, a SIGTRAP is sent to it after a successful execve(). If the set-user-ID bit is set on the program file pointed to by filename, and the underlying filesystem is not mounted nosuid (the MS_NOSUID flag for mount(2)), and the calling process is not being ptraced, then the effective user ID of the calling process is changed to that of the owner of the program file. Similarly, when the set- group-ID bit of the program file is set the effective group ID of the calling process is set to the group of the program file. The effective user ID of the process is copied to the saved set-user- ID; similarly, the effective group ID is copied to the saved set- group-ID. This copying takes place after any effective ID changes that occur because of the set-user-ID and set-group-ID mode bits. If the executable is an a.out dynamically linked binary executable containing shared-library stubs, the Linux dynamic linker ld.so(8) is called at the start of execution to bring needed shared objects into memory and link the executable with them. If the executable is a dynamically linked ELF executable, the interpreter named in the PT_INTERP segment is used to load the needed shared objects. This interpreter is typically /lib/ld-linux.so.2 for binaries linked with glibc (see ld-linux.so(8)). All process attributes are preserved during an execve(), except the following: * The dispositions of any signals that are being caught are reset to the default (signal(7)). * Any alternate signal stack is not preserved (sigaltstack(2)). * Memory mappings are not preserved (mmap(2)). * Attached System V shared memory segments are detached (shmat(2)). * POSIX shared memory regions are unmapped (shm_open(3)). * Open POSIX message queue descriptors are closed (mq_overview(7)). * Any open POSIX named semaphores are closed (sem_overview(7)). * POSIX timers are not preserved (timer_create(2)). * Any open directory streams are closed (opendir(3)). * Memory locks are not preserved (mlock(2), mlockall(2)). * Exit handlers are not preserved (atexit(3), on_exit(3)). * The floating-point environment is reset to the default (see fenv(3)). The process attributes in the preceding list are all specified in POSIX.1. The following Linux-specific process attributes are also not preserved during an execve(): * The prctl(2) PR_SET_DUMPABLE flag is set, unless a set-user-ID or set-group ID program is being executed, in which case it is cleared. * The prctl(2) PR_SET_KEEPCAPS flag is cleared. * (Since Linux 2.4.36 / 2.6.23) If a set-user-ID or set-group-ID program is being executed, then the parent death signal set by prctl(2) PR_SET_PDEATHSIG flag is cleared. * The process name, as set by prctl(2) PR_SET_NAME (and displayed by ps -o comm), is reset to the name of the new executable file. * The SECBIT_KEEP_CAPS securebits flag is cleared. See capabilities(7). * The termination signal is reset to SIGCHLD (see clone(2)). * The file descriptor table is unshared, undoing the effect of the CLONE_FILES flag of clone(2). Note the following further points: * All threads other than the calling thread are destroyed during an execve(). Mutexes, condition variables, and other pthreads objects are not preserved. * The equivalent of setlocale(LC_ALL, "C") is executed at program start-up. * POSIX.1 specifies that the dispositions of any signals that are ignored or set to the default are left unchanged. POSIX.1 specifies one exception: if SIGCHLD is being ignored, then an implementation may leave the disposition unchanged or reset it to the default; Linux does the former. * Any outstanding asynchronous I/O operations are canceled (aio_read(3), aio_write(3)). * For the handling of capabilities during execve(), see capabilities(7). * By default, file descriptors remain open across an execve(). File descriptors that are marked close-on-exec are closed; see the description of FD_CLOEXEC in fcntl(2). (If a file descriptor is closed, this will cause the release of all record locks obtained on the underlying file by this process. See fcntl(2) for details.) POSIX.1 says that if file descriptors 0, 1, and 2 would otherwise be closed after a successful execve(), and the process would gain privilege because the set-user_ID or set-group_ID mode bit was set on the executed file, then the system may open an unspecified file for each of these file descriptors. As a general principle, no portable program, whether privileged or not, can assume that these three file descriptors will remain closed across an execve(). Interpreter scripts An interpreter script is a text file that has execute permission enabled and whose first line is of the form: #! interpreter [optional-arg] The interpreter must be a valid pathname for an executable file. If the filename argument of execve() specifies an interpreter script, then interpreter will be invoked with the following arguments: interpreter [optional-arg] filename arg... where arg... is the series of words pointed to by the argv argument of execve(), starting at argv[1]. For portable use, optional-arg should either be absent, or be specified as a single word (i.e., it should not contain white space); see NOTES below. Since Linux 2.6.28, the kernel permits the interpreter of a script to itself be a script. This permission is recursive, up to a limit of four recursions, so that the interpreter may be a script which is interpreted by a script, and so on. Limits on size of arguments and environment Most UNIX implementations impose some limit on the total size of the command-line argument (argv) and environment (envp) strings that may be passed to a new program. POSIX.1 allows an implementation to advertise this limit using the ARG_MAX constant (either defined in or available at run time using the call sysconf(_SC_ARG_MAX)). On Linux prior to kernel 2.6.23, the memory used to store the environment and argument strings was limited to 32 pages (defined by the kernel constant MAX_ARG_PAGES). On architectures with a 4-kB page size, this yields a maximum size of 128 kB. On kernel 2.6.23 and later, most architectures support a size limit derived from the soft RLIMIT_STACK resource limit (see getrlimit(2)) that is in force at the time of the execve() call. (Architectures with no memory management unit are excepted: they maintain the limit that was in effect before kernel 2.6.23.) This change allows programs to have a much larger argument and/or environment list. For these architectures, the total size is limited to 1/4 of the allowed stack size. (Imposing the 1/4-limit ensures that the new program always has some stack space.) Since Linux 2.6.25, the kernel places a floor of 32 pages on this size limit, so that, even when RLIMIT_STACK is set very low, applications are guaranteed to have at least as much argument and environment space as was provided by Linux 2.6.23 and earlier. (This guarantee was not provided in Linux 2.6.23 and 2.6.24.) Additionally, the limit per string is 32 pages (the kernel constant MAX_ARG_STRLEN), and the maximum number of strings is 0x7FFFFFFF. http://man7.org/linux/man-pages/man2/execveat.2.html 12 SYSTEM CALL: execveat(2) - Linux manual page FUNCTIONALITY: execveat - execute program relative to a directory file descriptor SYNOPSIS: #include int execveat(int dirfd, const char *pathname, char *const argv[], char *const envp[], int flags); DESCRIPTION The execveat() system call executes the program referred to by the combination of dirfd and pathname. It operates in exactly the same way as execve(2), except for the differences described in this manual page. If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by execve(2) for a relative pathname). If pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like execve(2)). If pathname is absolute, then dirfd is ignored. If pathname is an empty string and the AT_EMPTY_PATH flag is specified, then the file descriptor dirfd specifies the file to be executed (i.e., dirfd refers to an executable file, rather than a directory). The flags argument is a bit mask that can include zero or more of the following flags: AT_EMPTY_PATH If pathname is an empty string, operate on the file referred to by dirfd (which may have been obtained using the open(2) O_PATH flag). AT_SYMLINK_NOFOLLOW If the file identified by dirfd and a non-NULL pathname is a symbolic link, then the call fails with the error ELOOP. http://man7.org/linux/man-pages/man2/exit.2.html 9 SYSTEM CALL: _exit(2) - Linux manual page FUNCTIONALITY: _exit, _Exit - terminate the calling process SYNOPSIS: #include void _exit(int status); #include void _Exit(int status); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): _Exit(): _ISOC99_SOURCE || _POSIX_C_SOURCE >= 200112L DESCRIPTION The function _exit() terminates the calling process "immediately". Any open file descriptors belonging to the process are closed; any children of the process are inherited by process 1, init, and the process's parent is sent a SIGCHLD signal. The value status is returned to the parent process as the process's exit status, and can be collected using one of the wait(2) family of calls. The function _Exit() is equivalent to _exit(). http://man7.org/linux/man-pages/man2/exit_group.2.html 10 SYSTEM CALL: exit_group(2) - Linux manual page FUNCTIONALITY: exit_group - exit all threads in a process SYNOPSIS: #include void exit_group(int status); DESCRIPTION This system call is equivalent to exit(2) except that it terminates not only the calling thread, but all threads in the calling process's thread group. http://man7.org/linux/man-pages/man2/wait4.2.html 10 SYSTEM CALL: wait4(2) - Linux manual page FUNCTIONALITY: wait3, wait4 - wait for process to change state, BSD style SYNOPSIS: #include #include #include #include pid_t wait3(int *wstatus, int options, struct rusage *rusage); pid_t wait4(pid_t pid, int *wstatus, int options, struct rusage *rusage); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): wait3(): Since glibc 2.19: _DEFAULT_SOURCE || _XOPEN_SOURCE >= 500 Glibc 2.19 and earlier: _BSD_SOURCE || _XOPEN_SOURCE >= 500 wait4(): Since glibc 2.19: _DEFAULT_SOURCE Glibc 2.19 and earlier: _BSD_SOURCE DESCRIPTION These functions are obsolete; use waitpid(2) or waitid(2) in new programs. The wait3() and wait4() system calls are similar to waitpid(2), but additionally return resource usage information about the child in the structure pointed to by rusage. Other than the use of the rusage argument, the following wait3() call: wait3(wstatus, options, rusage); is equivalent to: waitpid(-1, wstatus, options); Similarly, the following wait4() call: wait4(pid, wstatus, options, rusage); is equivalent to: waitpid(pid, wstatus, options); In other words, wait3() waits of any child, while wait4() can be used to select a specific child, or children, on which to wait. See wait(2) for further details. If rusage is not NULL, the struct rusage to which it points will be filled with accounting information about the child. See getrusage(2) for details. http://man7.org/linux/man-pages/man2/waitid.2.html 12 SYSTEM CALL: wait(2) - Linux manual page FUNCTIONALITY: wait, waitpid, waitid - wait for process to change state SYNOPSIS: #include #include pid_t wait(int *wstatus); pid_t waitpid(pid_t pid, int *wstatus, int options); int waitid(idtype_t idtype, id_t id, siginfo_t *infop, int options); /* This is the glibc and POSIX interface; see NOTES for information on the raw system call. */ Feature Test Macro Requirements for glibc (see feature_test_macros(7)): waitid(): _XOPEN_SOURCE || /* Since glibc 2.12: */ _POSIX_C_SOURCE >= 200809L || /* Glibc versions <= 2.19: */ _BSD_SOURCE DESCRIPTION All of these system calls are used to wait for state changes in a child of the calling process, and obtain information about the child whose state has changed. A state change is considered to be: the child terminated; the child was stopped by a signal; or the child was resumed by a signal. In the case of a terminated child, performing a wait allows the system to release the resources associated with the child; if a wait is not performed, then the terminated child remains in a "zombie" state (see NOTES below). If a child has already changed state, then these calls return immediately. Otherwise, they block until either a child changes state or a signal handler interrupts the call (assuming that system calls are not automatically restarted using the SA_RESTART flag of sigaction(2)). In the remainder of this page, a child whose state has changed and which has not yet been waited upon by one of these system calls is termed waitable. wait() and waitpid() The wait() system call suspends execution of the calling process until one of its children terminates. The call wait(&wstatus) is equivalent to: waitpid(-1, &wstatus, 0); The waitpid() system call suspends execution of the calling process until a child specified by pid argument has changed state. By default, waitpid() waits only for terminated children, but this behavior is modifiable via the options argument, as described below. The value of pid can be: < -1 meaning wait for any child process whose process group ID is equal to the absolute value of pid. -1 meaning wait for any child process. 0 meaning wait for any child process whose process group ID is equal to that of the calling process. > 0 meaning wait for the child whose process ID is equal to the value of pid. The value of options is an OR of zero or more of the following constants: WNOHANG return immediately if no child has exited. WUNTRACED also return if a child has stopped (but not traced via ptrace(2)). Status for traced children which have stopped is provided even if this option is not specified. WCONTINUED (since Linux 2.6.10) also return if a stopped child has been resumed by delivery of SIGCONT. (For Linux-only options, see below.) If wstatus is not NULL, wait() and waitpid() store status information in the int to which it points. This integer can be inspected with the following macros (which take the integer itself as an argument, not a pointer to it, as is done in wait() and waitpid()!): WIFEXITED(wstatus) returns true if the child terminated normally, that is, by calling exit(3) or _exit(2), or by returning from main(). WEXITSTATUS(wstatus) returns the exit status of the child. This consists of the least significant 8 bits of the wstatus argument that the child specified in a call to exit(3) or _exit(2) or as the argument for a return statement in main(). This macro should be employed only if WIFEXITED returned true. WIFSIGNALED(wstatus) returns true if the child process was terminated by a signal. WTERMSIG(wstatus) returns the number of the signal that caused the child process to terminate. This macro should be employed only if WIFSIGNALED returned true. WCOREDUMP(wstatus) returns true if the child produced a core dump. This macro should be employed only if WIFSIGNALED returned true. This macro is not specified in POSIX.1-2001 and is not available on some UNIX implementations (e.g., AIX, SunOS). Only use this enclosed in #ifdef WCOREDUMP ... #endif. WIFSTOPPED(wstatus) returns true if the child process was stopped by delivery of a signal; this is possible only if the call was done using WUNTRACED or when the child is being traced (see ptrace(2)). WSTOPSIG(wstatus) returns the number of the signal which caused the child to stop. This macro should be employed only if WIFSTOPPED returned true. WIFCONTINUED(wstatus) (since Linux 2.6.10) returns true if the child process was resumed by delivery of SIGCONT. waitid() The waitid() system call (available since Linux 2.6.9) provides more precise control over which child state changes to wait for. The idtype and id arguments select the child(ren) to wait for, as follows: idtype == P_PID Wait for the child whose process ID matches id. idtype == P_PGID Wait for any child whose process group ID matches id. idtype == P_ALL Wait for any child; id is ignored. The child state changes to wait for are specified by ORing one or more of the following flags in options: WEXITED Wait for children that have terminated. WSTOPPED Wait for children that have been stopped by delivery of a signal. WCONTINUED Wait for (previously stopped) children that have been resumed by delivery of SIGCONT. The following flags may additionally be ORed in options: WNOHANG As for waitpid(). WNOWAIT Leave the child in a waitable state; a later wait call can be used to again retrieve the child status information. Upon successful return, waitid() fills in the following fields of the siginfo_t structure pointed to by infop: si_pid The process ID of the child. si_uid The real user ID of the child. (This field is not set on most other implementations.) si_signo Always set to SIGCHLD. si_status Either the exit status of the child, as given to _exit(2) (or exit(3)), or the signal that caused the child to terminate, stop, or continue. The si_code field can be used to determine how to interpret this field. si_code Set to one of: CLD_EXITED (child called _exit(2)); CLD_KILLED (child killed by signal); CLD_DUMPED (child killed by signal, and dumped core); CLD_STOPPED (child stopped by signal); CLD_TRAPPED (traced child has trapped); or CLD_CONTINUED (child continued by SIGCONT). If WNOHANG was specified in options and there were no children in a waitable state, then waitid() returns 0 immediately and the state of the siginfo_t structure pointed to by infop is unspecified. To distinguish this case from that where a child was in a waitable state, zero out the si_pid field before the call and check for a nonzero value in this field after the call returns. http://man7.org/linux/man-pages/man2/getpid.2.html 9 SYSTEM CALL: getpid(2) - Linux manual page FUNCTIONALITY: getpid, getppid - get process identification SYNOPSIS: #include #include pid_t getpid(void); pid_t getppid(void); DESCRIPTION getpid() returns the process ID of the calling process. (This is often used by routines that generate unique temporary filenames.) getppid() returns the process ID of the parent of the calling process. http://man7.org/linux/man-pages/man2/getppid.2.html 9 SYSTEM CALL: getpid(2) - Linux manual page FUNCTIONALITY: getpid, getppid - get process identification SYNOPSIS: #include #include pid_t getpid(void); pid_t getppid(void); DESCRIPTION getpid() returns the process ID of the calling process. (This is often used by routines that generate unique temporary filenames.) getppid() returns the process ID of the parent of the calling process. http://man7.org/linux/man-pages/man2/gettid.2.html 11 SYSTEM CALL: gettid(2) - Linux manual page FUNCTIONALITY: gettid - get thread identification SYNOPSIS: #include pid_t gettid(void); Note: There is no glibc wrapper for this system call; see NOTES. DESCRIPTION gettid() returns the caller's thread ID (TID). In a single-threaded process, the thread ID is equal to the process ID (PID, as returned by getpid(2)). In a multithreaded process, all threads have the same PID, but each one has a unique TID. For further details, see the discussion of CLONE_THREAD in clone(2). http://man7.org/linux/man-pages/man2/setsid.2.html 10 SYSTEM CALL: setsid(2) - Linux manual page FUNCTIONALITY: setsid - creates a session and sets the process group ID SYNOPSIS: #include pid_t setsid(void); DESCRIPTION setsid() creates a new session if the calling process is not a process group leader. The calling process is the leader of the new session (i.e., its session ID is made the same as its process ID). The calling process also becomes the process group leader of a new process group in the session (i.e., its process group ID is made the same as its process ID). The calling process will be the only process in the new process group and in the new session. The new session has no controlling terminal. http://man7.org/linux/man-pages/man2/getsid.2.html 11 SYSTEM CALL: getsid(2) - Linux manual page FUNCTIONALITY: getsid - get session ID SYNOPSIS: #include pid_t getsid(pid_t pid); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): getsid(): _XOPEN_SOURCE >= 500 || /* Since glibc 2.12: */ _POSIX_C_SOURCE >= 200809L DESCRIPTION getsid(0) returns the session ID of the calling process. getsid(p) returns the session ID of the process with process ID p. (The session ID of a process is the process group ID of the session leader.) http://man7.org/linux/man-pages/man2/setpgid.2.html 10 SYSTEM CALL: setpgid(2) - Linux manual page FUNCTIONALITY: setpgid, getpgid, setpgrp, getpgrp - set/get process group SYNOPSIS: #include int setpgid(pid_t pid, pid_t pgid); pid_t getpgid(pid_t pid); pid_t getpgrp(void); /* POSIX.1 version */ pid_t getpgrp(pid_t pid); /* BSD version */ int setpgrp(void); /* System V version */ int setpgrp(pid_t pid, pid_t pgid); /* BSD version */ Feature Test Macro Requirements for glibc (see feature_test_macros(7)): getpgid(): _XOPEN_SOURCE >= 500 || /* Since glibc 2.12: */ _POSIX_C_SOURCE >= 200809L setpgrp() (POSIX.1): _XOPEN_SOURCE >= 500 || /* Since glibc 2.19: */ _DEFAULT_SOURCE || /* Glibc versions <= 2.19: */ _SVID_SOURCE setpgrp() (BSD), getpgrp() (BSD): [These are available only before glibc 2.19] _BSD_SOURCE && ! (_POSIX_SOURCE || _POSIX_C_SOURCE || _XOPEN_SOURCE || _GNU_SOURCE || _SVID_SOURCE) DESCRIPTION All of these interfaces are available on Linux, and are used for getting and setting the process group ID (PGID) of a process. The preferred, POSIX.1-specified ways of doing this are: getpgrp(void), for retrieving the calling process's PGID; and setpgid(), for setting a process's PGID. setpgid() sets the PGID of the process specified by pid to pgid. If pid is zero, then the process ID of the calling process is used. If pgid is zero, then the PGID of the process specified by pid is made the same as its process ID. If setpgid() is used to move a process from one process group to another (as is done by some shells when creating pipelines), both process groups must be part of the same session (see setsid(2) and credentials(7)). In this case, the pgid specifies an existing process group to be joined and the session ID of that group must match the session ID of the joining process. The POSIX.1 version of getpgrp(), which takes no arguments, returns the PGID of the calling process. getpgid() returns the PGID of the process specified by pid. If pid is zero, the process ID of the calling process is used. (Retrieving the PGID of a process other than the caller is rarely necessary, and the POSIX.1 getpgrp() is preferred for that task.) The System V-style setpgrp(), which takes no arguments, is equivalent to setpgid(0, 0). The BSD-specific setpgrp() call, which takes arguments pid and pgid, is a wrapper function that calls setpgid(pid, pgid) Since glibc 2.19, the BSD-specific setpgrp() function is no longer exposed by ; calls should be replaced with the setpgid() call shown above. The BSD-specific getpgrp() call, which takes a single pid argument, is a wrapper function that calls getpgid(pid) Since glibc 2.19, the BSD-specific getpgrp() function is no longer exposed by ; calls should be replaced with calls to the POSIX.1 getpgrp() which takes no arguments (if the intent is to obtain the caller's PGID), or with the getpgid() call shown above. http://man7.org/linux/man-pages/man2/getpgid.2.html 10 SYSTEM CALL: setpgid(2) - Linux manual page FUNCTIONALITY: setpgid, getpgid, setpgrp, getpgrp - set/get process group SYNOPSIS: #include int setpgid(pid_t pid, pid_t pgid); pid_t getpgid(pid_t pid); pid_t getpgrp(void); /* POSIX.1 version */ pid_t getpgrp(pid_t pid); /* BSD version */ int setpgrp(void); /* System V version */ int setpgrp(pid_t pid, pid_t pgid); /* BSD version */ Feature Test Macro Requirements for glibc (see feature_test_macros(7)): getpgid(): _XOPEN_SOURCE >= 500 || /* Since glibc 2.12: */ _POSIX_C_SOURCE >= 200809L setpgrp() (POSIX.1): _XOPEN_SOURCE >= 500 || /* Since glibc 2.19: */ _DEFAULT_SOURCE || /* Glibc versions <= 2.19: */ _SVID_SOURCE setpgrp() (BSD), getpgrp() (BSD): [These are available only before glibc 2.19] _BSD_SOURCE && ! (_POSIX_SOURCE || _POSIX_C_SOURCE || _XOPEN_SOURCE || _GNU_SOURCE || _SVID_SOURCE) DESCRIPTION All of these interfaces are available on Linux, and are used for getting and setting the process group ID (PGID) of a process. The preferred, POSIX.1-specified ways of doing this are: getpgrp(void), for retrieving the calling process's PGID; and setpgid(), for setting a process's PGID. setpgid() sets the PGID of the process specified by pid to pgid. If pid is zero, then the process ID of the calling process is used. If pgid is zero, then the PGID of the process specified by pid is made the same as its process ID. If setpgid() is used to move a process from one process group to another (as is done by some shells when creating pipelines), both process groups must be part of the same session (see setsid(2) and credentials(7)). In this case, the pgid specifies an existing process group to be joined and the session ID of that group must match the session ID of the joining process. The POSIX.1 version of getpgrp(), which takes no arguments, returns the PGID of the calling process. getpgid() returns the PGID of the process specified by pid. If pid is zero, the process ID of the calling process is used. (Retrieving the PGID of a process other than the caller is rarely necessary, and the POSIX.1 getpgrp() is preferred for that task.) The System V-style setpgrp(), which takes no arguments, is equivalent to setpgid(0, 0). The BSD-specific setpgrp() call, which takes arguments pid and pgid, is a wrapper function that calls setpgid(pid, pgid) Since glibc 2.19, the BSD-specific setpgrp() function is no longer exposed by ; calls should be replaced with the setpgid() call shown above. The BSD-specific getpgrp() call, which takes a single pid argument, is a wrapper function that calls getpgid(pid) Since glibc 2.19, the BSD-specific getpgrp() function is no longer exposed by ; calls should be replaced with calls to the POSIX.1 getpgrp() which takes no arguments (if the intent is to obtain the caller's PGID), or with the getpgid() call shown above. http://man7.org/linux/man-pages/man2/getpgrp.2.html 10 SYSTEM CALL: setpgid(2) - Linux manual page FUNCTIONALITY: setpgid, getpgid, setpgrp, getpgrp - set/get process group SYNOPSIS: #include int setpgid(pid_t pid, pid_t pgid); pid_t getpgid(pid_t pid); pid_t getpgrp(void); /* POSIX.1 version */ pid_t getpgrp(pid_t pid); /* BSD version */ int setpgrp(void); /* System V version */ int setpgrp(pid_t pid, pid_t pgid); /* BSD version */ Feature Test Macro Requirements for glibc (see feature_test_macros(7)): getpgid(): _XOPEN_SOURCE >= 500 || /* Since glibc 2.12: */ _POSIX_C_SOURCE >= 200809L setpgrp() (POSIX.1): _XOPEN_SOURCE >= 500 || /* Since glibc 2.19: */ _DEFAULT_SOURCE || /* Glibc versions <= 2.19: */ _SVID_SOURCE setpgrp() (BSD), getpgrp() (BSD): [These are available only before glibc 2.19] _BSD_SOURCE && ! (_POSIX_SOURCE || _POSIX_C_SOURCE || _XOPEN_SOURCE || _GNU_SOURCE || _SVID_SOURCE) DESCRIPTION All of these interfaces are available on Linux, and are used for getting and setting the process group ID (PGID) of a process. The preferred, POSIX.1-specified ways of doing this are: getpgrp(void), for retrieving the calling process's PGID; and setpgid(), for setting a process's PGID. setpgid() sets the PGID of the process specified by pid to pgid. If pid is zero, then the process ID of the calling process is used. If pgid is zero, then the PGID of the process specified by pid is made the same as its process ID. If setpgid() is used to move a process from one process group to another (as is done by some shells when creating pipelines), both process groups must be part of the same session (see setsid(2) and credentials(7)). In this case, the pgid specifies an existing process group to be joined and the session ID of that group must match the session ID of the joining process. The POSIX.1 version of getpgrp(), which takes no arguments, returns the PGID of the calling process. getpgid() returns the PGID of the process specified by pid. If pid is zero, the process ID of the calling process is used. (Retrieving the PGID of a process other than the caller is rarely necessary, and the POSIX.1 getpgrp() is preferred for that task.) The System V-style setpgrp(), which takes no arguments, is equivalent to setpgid(0, 0). The BSD-specific setpgrp() call, which takes arguments pid and pgid, is a wrapper function that calls setpgid(pid, pgid) Since glibc 2.19, the BSD-specific setpgrp() function is no longer exposed by ; calls should be replaced with the setpgid() call shown above. The BSD-specific getpgrp() call, which takes a single pid argument, is a wrapper function that calls getpgid(pid) Since glibc 2.19, the BSD-specific getpgrp() function is no longer exposed by ; calls should be replaced with calls to the POSIX.1 getpgrp() which takes no arguments (if the intent is to obtain the caller's PGID), or with the getpgid() call shown above. http://man7.org/linux/man-pages/man2/setuid.2.html 10 SYSTEM CALL: setuid(2) - Linux manual page FUNCTIONALITY: setuid - set user identity SYNOPSIS: #include #include int setuid(uid_t uid); DESCRIPTION setuid() sets the effective user ID of the calling process. If the effective UID of the caller is root (more precisely: if the caller has the CAP_SETUID capability), the real UID and saved set-user-ID are also set. Under Linux, setuid() is implemented like the POSIX version with the _POSIX_SAVED_IDS feature. This allows a set-user-ID (other than root) program to drop all of its user privileges, do some un- privileged work, and then reengage the original effective user ID in a secure manner. If the user is root or the program is set-user-ID-root, special care must be taken. The setuid() function checks the effective user ID of the caller and if it is the superuser, all process-related user ID's are set to uid. After this has occurred, it is impossible for the program to regain root privileges. Thus, a set-user-ID-root program wishing to temporarily drop root privileges, assume the identity of an unprivileged user, and then regain root privileges afterward cannot use setuid(). You can accomplish this with seteuid(2). http://man7.org/linux/man-pages/man2/getuid.2.html 9 SYSTEM CALL: getuid(2) - Linux manual page FUNCTIONALITY: getuid, geteuid - get user identity SYNOPSIS: #include #include uid_t getuid(void); uid_t geteuid(void); DESCRIPTION getuid() returns the real user ID of the calling process. geteuid() returns the effective user ID of the calling process. http://man7.org/linux/man-pages/man2/setgid.2.html 10 SYSTEM CALL: setgid(2) - Linux manual page FUNCTIONALITY: setgid - set group identity SYNOPSIS: #include #include int setgid(gid_t gid); DESCRIPTION setgid() sets the effective group ID of the calling process. If the caller is privileged (has the CAP_SETGID capability), the real GID and saved set-group-ID are also set. Under Linux, setgid() is implemented like the POSIX version with the _POSIX_SAVED_IDS feature. This allows a set-group-ID program that is not set-user-ID-root to drop all of its group privileges, do some un- privileged work, and then reengage the original effective group ID in a secure manner. http://man7.org/linux/man-pages/man2/getgid.2.html 9 SYSTEM CALL: getgid(2) - Linux manual page FUNCTIONALITY: getgid, getegid - get group identity SYNOPSIS: #include #include gid_t getgid(void); gid_t getegid(void); DESCRIPTION getgid() returns the real group ID of the calling process. getegid() returns the effective group ID of the calling process. http://man7.org/linux/man-pages/man2/setresuid.2.html 11 SYSTEM CALL: setresuid(2) - Linux manual page FUNCTIONALITY: setresuid, setresgid - set real, effective and saved user or group ID SYNOPSIS: #define _GNU_SOURCE /* See feature_test_macros(7) */ #include int setresuid(uid_t ruid, uid_t euid, uid_t suid); int setresgid(gid_t rgid, gid_t egid, gid_t sgid); DESCRIPTION setresuid() sets the real user ID, the effective user ID, and the saved set-user-ID of the calling process. Unprivileged user processes may change the real UID, effective UID, and saved set-user-ID, each to one of: the current real UID, the current effective UID or the current saved set-user-ID. Privileged processes (on Linux, those having the CAP_SETUID capability) may set the real UID, effective UID, and saved set-user- ID to arbitrary values. If one of the arguments equals -1, the corresponding value is not changed. Regardless of what changes are made to the real UID, effective UID, and saved set-user-ID, the filesystem UID is always set to the same value as the (possibly new) effective UID. Completely analogously, setresgid() sets the real GID, effective GID, and saved set-group-ID of the calling process (and always modifies the filesystem GID to be the same as the effective GID), with the same restrictions for unprivileged processes. http://man7.org/linux/man-pages/man2/getresuid.2.html 11 SYSTEM CALL: getresuid(2) - Linux manual page FUNCTIONALITY: getresuid, getresgid - get real, effective and saved user/group IDs SYNOPSIS: #define _GNU_SOURCE /* See feature_test_macros(7) */ #include int getresuid(uid_t *ruid, uid_t *euid, uid_t *suid); int getresgid(gid_t *rgid, gid_t *egid, gid_t *sgid); DESCRIPTION getresuid() returns the real UID, the effective UID, and the saved set-user-ID of the calling process, in the arguments ruid, euid, and suid, respectively. getresgid() performs the analogous task for the process's group IDs. http://man7.org/linux/man-pages/man2/setresgid.2.html 11 SYSTEM CALL: setresuid(2) - Linux manual page FUNCTIONALITY: setresuid, setresgid - set real, effective and saved user or group ID SYNOPSIS: #define _GNU_SOURCE /* See feature_test_macros(7) */ #include int setresuid(uid_t ruid, uid_t euid, uid_t suid); int setresgid(gid_t rgid, gid_t egid, gid_t sgid); DESCRIPTION setresuid() sets the real user ID, the effective user ID, and the saved set-user-ID of the calling process. Unprivileged user processes may change the real UID, effective UID, and saved set-user-ID, each to one of: the current real UID, the current effective UID or the current saved set-user-ID. Privileged processes (on Linux, those having the CAP_SETUID capability) may set the real UID, effective UID, and saved set-user- ID to arbitrary values. If one of the arguments equals -1, the corresponding value is not changed. Regardless of what changes are made to the real UID, effective UID, and saved set-user-ID, the filesystem UID is always set to the same value as the (possibly new) effective UID. Completely analogously, setresgid() sets the real GID, effective GID, and saved set-group-ID of the calling process (and always modifies the filesystem GID to be the same as the effective GID), with the same restrictions for unprivileged processes. http://man7.org/linux/man-pages/man2/getresgid.2.html 11 SYSTEM CALL: getresuid(2) - Linux manual page FUNCTIONALITY: getresuid, getresgid - get real, effective and saved user/group IDs SYNOPSIS: #define _GNU_SOURCE /* See feature_test_macros(7) */ #include int getresuid(uid_t *ruid, uid_t *euid, uid_t *suid); int getresgid(gid_t *rgid, gid_t *egid, gid_t *sgid); DESCRIPTION getresuid() returns the real UID, the effective UID, and the saved set-user-ID of the calling process, in the arguments ruid, euid, and suid, respectively. getresgid() performs the analogous task for the process's group IDs. http://man7.org/linux/man-pages/man2/setreuid.2.html 10 SYSTEM CALL: setreuid(2) - Linux manual page FUNCTIONALITY: setreuid, setregid - set real and/or effective user or group ID SYNOPSIS: #include #include int setreuid(uid_t ruid, uid_t euid); int setregid(gid_t rgid, gid_t egid); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): setreuid(), setregid(): _XOPEN_SOURCE >= 500 || /* Since glibc 2.19: */ _DEFAULT_SOURCE || /* Glibc versions <= 2.19: */ _BSD_SOURCE DESCRIPTION setreuid() sets real and effective user IDs of the calling process. Supplying a value of -1 for either the real or effective user ID forces the system to leave that ID unchanged. Unprivileged processes may only set the effective user ID to the real user ID, the effective user ID, or the saved set-user-ID. Unprivileged users may only set the real user ID to the real user ID or the effective user ID. If the real user ID is set (i.e., ruid is not -1) or the effective user ID is set to a value not equal to the previous real user ID, the saved set-user-ID will be set to the new effective user ID. Completely analogously, setregid() sets real and effective group ID's of the calling process, and all of the above holds with "group" instead of "user". http://man7.org/linux/man-pages/man2/setregid.2.html 10 SYSTEM CALL: setreuid(2) - Linux manual page FUNCTIONALITY: setreuid, setregid - set real and/or effective user or group ID SYNOPSIS: #include #include int setreuid(uid_t ruid, uid_t euid); int setregid(gid_t rgid, gid_t egid); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): setreuid(), setregid(): _XOPEN_SOURCE >= 500 || /* Since glibc 2.19: */ _DEFAULT_SOURCE || /* Glibc versions <= 2.19: */ _BSD_SOURCE DESCRIPTION setreuid() sets real and effective user IDs of the calling process. Supplying a value of -1 for either the real or effective user ID forces the system to leave that ID unchanged. Unprivileged processes may only set the effective user ID to the real user ID, the effective user ID, or the saved set-user-ID. Unprivileged users may only set the real user ID to the real user ID or the effective user ID. If the real user ID is set (i.e., ruid is not -1) or the effective user ID is set to a value not equal to the previous real user ID, the saved set-user-ID will be set to the new effective user ID. Completely analogously, setregid() sets real and effective group ID's of the calling process, and all of the above holds with "group" instead of "user". http://man7.org/linux/man-pages/man2/setfsuid.2.html 11 SYSTEM CALL: setfsuid(2) - Linux manual page FUNCTIONALITY: setfsuid - set user identity used for filesystem checks SYNOPSIS: #include int setfsuid(uid_t fsuid); DESCRIPTION The system call setfsuid() changes the value of the caller's filesystem user ID—the user ID that the Linux kernel uses to check for all accesses to the filesystem. Normally, the value of the filesystem user ID will shadow the value of the effective user ID. In fact, whenever the effective user ID is changed, the filesystem user ID will also be changed to the new value of the effective user ID. Explicit calls to setfsuid() and setfsgid(2) are usually used only by programs such as the Linux NFS server that need to change what user and group ID is used for file access without a corresponding change in the real and effective user and group IDs. A change in the normal user IDs for a program such as the NFS server is a security hole that can expose it to unwanted signals. (But see below.) setfsuid() will succeed only if the caller is the superuser or if fsuid matches either the caller's real user ID, effective user ID, saved set-user-ID, or current filesystem user ID. http://man7.org/linux/man-pages/man2/setfsgid.2.html 11 SYSTEM CALL: setfsgid(2) - Linux manual page FUNCTIONALITY: setfsgid - set group identity used for filesystem checks SYNOPSIS: #include int setfsgid(uid_t fsgid); DESCRIPTION The system call setfsgid() changes the value of the caller's filesystem group ID—the group ID that the Linux kernel uses to check for all accesses to the filesystem. Normally, the value of the filesystem group ID will shadow the value of the effective group ID. In fact, whenever the effective group ID is changed, the filesystem group ID will also be changed to the new value of the effective group ID. Explicit calls to setfsuid(2) and setfsgid() are usually used only by programs such as the Linux NFS server that need to change what user and group ID is used for file access without a corresponding change in the real and effective user and group IDs. A change in the normal user IDs for a program such as the NFS server is a security hole that can expose it to unwanted signals. (But see below.) setfsgid() will succeed only if the caller is the superuser or if fsgid matches either the caller's real group ID, effective group ID, saved set-group-ID, or current the filesystem user ID. http://man7.org/linux/man-pages/man2/geteuid.2.html 9 SYSTEM CALL: getuid(2) - Linux manual page FUNCTIONALITY: getuid, geteuid - get user identity SYNOPSIS: #include #include uid_t getuid(void); uid_t geteuid(void); DESCRIPTION getuid() returns the real user ID of the calling process. geteuid() returns the effective user ID of the calling process. http://man7.org/linux/man-pages/man2/getegid.2.html 9 SYSTEM CALL: getgid(2) - Linux manual page FUNCTIONALITY: getgid, getegid - get group identity SYNOPSIS: #include #include gid_t getgid(void); gid_t getegid(void); DESCRIPTION getgid() returns the real group ID of the calling process. getegid() returns the effective group ID of the calling process. http://man7.org/linux/man-pages/man2/setgroups.2.html 10 SYSTEM CALL: getgroups(2) - Linux manual page FUNCTIONALITY: getgroups, setgroups - get/set list of supplementary group IDs SYNOPSIS: #include #include int getgroups(int size, gid_t list[]); #include int setgroups(size_t size, const gid_t *list); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): setgroups(): Since glibc 2.19: _DEFAULT_SOURCE Glibc 2.19 and earlier: _SVID_SOURCE DESCRIPTION getgroups() returns the supplementary group IDs of the calling process in list. The argument size should be set to the maximum number of items that can be stored in the buffer pointed to by list. If the calling process is a member of more than size supplementary groups, then an error results. It is unspecified whether the effective group ID of the calling process is included in the returned list. (Thus, an application should also call getegid(2) and add or remove the resulting value.) If size is zero, list is not modified, but the total number of supplementary group IDs for the process is returned. This allows the caller to determine the size of a dynamically allocated list to be used in a further call to getgroups(). setgroups() sets the supplementary group IDs for the calling process. Appropriate privileges (Linux: the CAP_SETGID capability) are required. The size argument specifies the number of supplementary group IDs in the buffer pointed to by list. http://man7.org/linux/man-pages/man2/getgroups.2.html 10 SYSTEM CALL: getgroups(2) - Linux manual page FUNCTIONALITY: getgroups, setgroups - get/set list of supplementary group IDs SYNOPSIS: #include #include int getgroups(int size, gid_t list[]); #include int setgroups(size_t size, const gid_t *list); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): setgroups(): Since glibc 2.19: _DEFAULT_SOURCE Glibc 2.19 and earlier: _SVID_SOURCE DESCRIPTION getgroups() returns the supplementary group IDs of the calling process in list. The argument size should be set to the maximum number of items that can be stored in the buffer pointed to by list. If the calling process is a member of more than size supplementary groups, then an error results. It is unspecified whether the effective group ID of the calling process is included in the returned list. (Thus, an application should also call getegid(2) and add or remove the resulting value.) If size is zero, list is not modified, but the total number of supplementary group IDs for the process is returned. This allows the caller to determine the size of a dynamically allocated list to be used in a further call to getgroups(). setgroups() sets the supplementary group IDs for the calling process. Appropriate privileges (Linux: the CAP_SETGID capability) are required. The size argument specifies the number of supplementary group IDs in the buffer pointed to by list. http://man7.org/linux/man-pages/man2/setns.2.html 12 SYSTEM CALL: setns(2) - Linux manual page FUNCTIONALITY: setns - reassociate thread with a namespace SYNOPSIS: #define _GNU_SOURCE /* See feature_test_macros(7) */ #include int setns(int fd, int nstype); DESCRIPTION Given a file descriptor referring to a namespace, reassociate the calling thread with that namespace. The fd argument is a file descriptor referring to one of the namespace entries in a /proc/[pid]/ns/ directory; see namespaces(7) for further information on /proc/[pid]/ns/. The calling thread will be reassociated with the corresponding namespace, subject to any constraints imposed by the nstype argument. The nstype argument specifies which type of namespace the calling thread may be reassociated with. This argument can have one of the following values: 0 Allow any type of namespace to be joined. CLONE_NEWCGROUP (since Linux 4.6) fd must refer to a cgroup namespace. CLONE_NEWIPC (since Linux 3.0) fd must refer to an IPC namespace. CLONE_NEWNET (since Linux 3.0) fd must refer to a network namespace. CLONE_NEWNS (since Linux 3.8) fd must refer to a mount namespace. CLONE_NEWPID (since Linux 3.8) fd must refer to a descendant PID namespace. CLONE_NEWUSER (since Linux 3.8) fd must refer to a user namespace. CLONE_NEWUTS (since Linux 3.0) fd must refer to a UTS namespace. Specifying nstype as 0 suffices if the caller knows (or does not care) what type of namespace is referred to by fd. Specifying a nonzero value for nstype is useful if the caller does not know what type of namespace is referred to by fd and wants to ensure that the namespace is of a particular type. (The caller might not know the type of the namespace referred to by fd if the file descriptor was opened by another process and, for example, passed to the caller via a UNIX domain socket.) CLONE_NEWPID behaves somewhat differently from the other nstype values: reassociating the calling thread with a PID namespace changes only the PID namespace that child processes of the caller will be created in; it does not change the PID namespace of the caller itself. Reassociating with a PID namespace is allowed only if the PID namespace specified by fd is a descendant (child, grandchild, etc.) of the PID namespace of the caller. For further details on PID namespaces, see pid_namespaces(7). A process reassociating itself with a user namespace must have the CAP_SYS_ADMIN capability in the target user namespace. Upon successfully joining a user namespace, a process is granted all capabilities in that namespace, regardless of its user and group IDs. A multithreaded process may not change user namespace with setns(). It is not permitted to use setns() to reenter the caller's current user namespace. This prevents a caller that has dropped capabilities from regaining those capabilities via a call to setns(). For security reasons, a process can't join a new user namespace if it is sharing filesystem-related attributes (the attributes whose sharing is controlled by the clone(2) CLONE_FS flag) with another process. For further details on user namespaces, see user_namespaces(7). A process may not be reassociated with a new mount namespace if it is multithreaded. Changing the mount namespace requires that the caller possess both CAP_SYS_CHROOT and CAP_SYS_ADMIN capabilities in its own user namespace and CAP_SYS_ADMIN in the target mount namespace. See user_namespaces(7) for details on the interaction of user namespaces and mount namespaces. Using setns() to change the caller's cgroup namespace does not change the caller's cgroup memberships. http://man7.org/linux/man-pages/man2/setrlimit.2.html 14 SYSTEM CALL: getrlimit(2) - Linux manual page FUNCTIONALITY: getrlimit, setrlimit, prlimit - get/set resource limits SYNOPSIS: #include #include int getrlimit(int resource, struct rlimit *rlim); int setrlimit(int resource, const struct rlimit *rlim); int prlimit(pid_t pid, int resource, const struct rlimit *new_limit, struct rlimit *old_limit); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): prlimit(): _GNU_SOURCE DESCRIPTION The getrlimit() and setrlimit() system calls get and set resource limits respectively. Each resource has an associated soft and hard limit, as defined by the rlimit structure: struct rlimit { rlim_t rlim_cur; /* Soft limit */ rlim_t rlim_max; /* Hard limit (ceiling for rlim_cur) */ }; The soft limit is the value that the kernel enforces for the corresponding resource. The hard limit acts as a ceiling for the soft limit: an unprivileged process may set only its soft limit to a value in the range from 0 up to the hard limit, and (irreversibly) lower its hard limit. A privileged process (under Linux: one with the CAP_SYS_RESOURCE capability) may make arbitrary changes to either limit value. The value RLIM_INFINITY denotes no limit on a resource (both in the structure returned by getrlimit() and in the structure passed to setrlimit()). The resource argument must be one of: RLIMIT_AS The maximum size of the process's virtual memory (address space) in bytes. This limit affects calls to brk(2), mmap(2), and mremap(2), which fail with the error ENOMEM upon exceeding this limit. Also automatic stack expansion will fail (and generate a SIGSEGV that kills the process if no alternate stack has been made available via sigaltstack(2)). Since the value is a long, on machines with a 32-bit long either this limit is at most 2 GiB, or this resource is unlimited. RLIMIT_CORE Maximum size of a core file (see core(5)). When 0 no core dump files are created. When nonzero, larger dumps are truncated to this size. RLIMIT_CPU CPU time limit in seconds. When the process reaches the soft limit, it is sent a SIGXCPU signal. The default action for this signal is to terminate the process. However, the signal can be caught, and the handler can return control to the main program. If the process continues to consume CPU time, it will be sent SIGXCPU once per second until the hard limit is reached, at which time it is sent SIGKILL. (This latter point describes Linux behavior. Implementations vary in how they treat processes which continue to consume CPU time after reaching the soft limit. Portable applications that need to catch this signal should perform an orderly termination upon first receipt of SIGXCPU.) RLIMIT_DATA The maximum size of the process's data segment (initialized data, uninitialized data, and heap). This limit affects calls to brk(2) and sbrk(2), which fail with the error ENOMEM upon encountering the soft limit of this resource. RLIMIT_FSIZE The maximum size of files that the process may create. Attempts to extend a file beyond this limit result in delivery of a SIGXFSZ signal. By default, this signal terminates a process, but a process can catch this signal instead, in which case the relevant system call (e.g., write(2), truncate(2)) fails with the error EFBIG. RLIMIT_LOCKS (Early Linux 2.4 only) A limit on the combined number of flock(2) locks and fcntl(2) leases that this process may establish. RLIMIT_MEMLOCK The maximum number of bytes of memory that may be locked into RAM. In effect this limit is rounded down to the nearest multiple of the system page size. This limit affects mlock(2) and mlockall(2) and the mmap(2) MAP_LOCKED operation. Since Linux 2.6.9 it also affects the shmctl(2) SHM_LOCK operation, where it sets a maximum on the total bytes in shared memory segments (see shmget(2)) that may be locked by the real user ID of the calling process. The shmctl(2) SHM_LOCK locks are accounted for separately from the per-process memory locks established by mlock(2), mlockall(2), and mmap(2) MAP_LOCKED; a process can lock bytes up to this limit in each of these two categories. In Linux kernels before 2.6.9, this limit controlled the amount of memory that could be locked by a privileged process. Since Linux 2.6.9, no limits are placed on the amount of memory that a privileged process may lock, and this limit instead governs the amount of memory that an unprivileged process may lock. RLIMIT_MSGQUEUE (since Linux 2.6.8) Specifies the limit on the number of bytes that can be allocated for POSIX message queues for the real user ID of the calling process. This limit is enforced for mq_open(3). Each message queue that the user creates counts (until it is removed) against this limit according to the formula: Since Linux 3.5: bytes = attr.mq_maxmsg * sizeof(struct msg_msg) + min(attr.mq_maxmsg, MQ_PRIO_MAX) * sizeof(struct posix_msg_tree_node)+ /* For overhead */ attr.mq_maxmsg * attr.mq_msgsize; /* For message data */ Linux 3.4 and earlier: bytes = attr.mq_maxmsg * sizeof(struct msg_msg *) + /* For overhead */ attr.mq_maxmsg * attr.mq_msgsize; /* For message data */ where attr is the mq_attr structure specified as the fourth argument to mq_open(3), and the msg_msg and posix_msg_tree_node structures are kernel-internal structures. The "overhead" addend in the formula accounts for overhead bytes required by the implementation and ensures that the user cannot create an unlimited number of zero-length messages (such messages nevertheless each consume some system memory for bookkeeping overhead). RLIMIT_NICE (since Linux 2.6.12, but see BUGS below) Specifies a ceiling to which the process's nice value can be raised using setpriority(2) or nice(2). The actual ceiling for the nice value is calculated as 20 - rlim_cur. (This strangeness occurs because negative numbers cannot be specified as resource limit values, since they typically have special meanings. For example, RLIM_INFINITY typically is the same as -1.) RLIMIT_NOFILE Specifies a value one greater than the maximum file descriptor number that can be opened by this process. Attempts (open(2), pipe(2), dup(2), etc.) to exceed this limit yield the error EMFILE. (Historically, this limit was named RLIMIT_OFILE on BSD.) RLIMIT_NPROC The maximum number of processes (or, more precisely on Linux, threads) that can be created for the real user ID of the calling process. Upon encountering this limit, fork(2) fails with the error EAGAIN. This limit is not enforced for processes that have either the CAP_SYS_ADMIN or the CAP_SYS_RESOURCE capability. RLIMIT_RSS Specifies the limit (in bytes) of the process's resident set (the number of virtual pages resident in RAM). This limit has effect only in Linux 2.4.x, x < 30, and there affects only calls to madvise(2) specifying MADV_WILLNEED. RLIMIT_RTPRIO (since Linux 2.6.12, but see BUGS) Specifies a ceiling on the real-time priority that may be set for this process using sched_setscheduler(2) and sched_setparam(2). RLIMIT_RTTIME (since Linux 2.6.25) Specifies a limit (in microseconds) on the amount of CPU time that a process scheduled under a real-time scheduling policy may consume without making a blocking system call. For the purpose of this limit, each time a process makes a blocking system call, the count of its consumed CPU time is reset to zero. The CPU time count is not reset if the process continues trying to use the CPU but is preempted, its time slice expires, or it calls sched_yield(2). Upon reaching the soft limit, the process is sent a SIGXCPU signal. If the process catches or ignores this signal and continues consuming CPU time, then SIGXCPU will be generated once each second until the hard limit is reached, at which point the process is sent a SIGKILL signal. The intended use of this limit is to stop a runaway real-time process from locking up the system. RLIMIT_SIGPENDING (since Linux 2.6.8) Specifies the limit on the number of signals that may be queued for the real user ID of the calling process. Both standard and real-time signals are counted for the purpose of checking this limit. However, the limit is enforced only for sigqueue(3); it is always possible to use kill(2) to queue one instance of any of the signals that are not already queued to the process. RLIMIT_STACK The maximum size of the process stack, in bytes. Upon reaching this limit, a SIGSEGV signal is generated. To handle this signal, a process must employ an alternate signal stack (sigaltstack(2)). Since Linux 2.6.23, this limit also determines the amount of space used for the process's command-line arguments and environment variables; for details, see execve(2). prlimit() The Linux-specific prlimit() system call combines and extends the functionality of setrlimit() and getrlimit(). It can be used to both set and get the resource limits of an arbitrary process. The resource argument has the same meaning as for setrlimit() and getrlimit(). If the new_limit argument is a not NULL, then the rlimit structure to which it points is used to set new values for the soft and hard limits for resource. If the old_limit argument is a not NULL, then a successful call to prlimit() places the previous soft and hard limits for resource in the rlimit structure pointed to by old_limit. The pid argument specifies the ID of the process on which the call is to operate. If pid is 0, then the call applies to the calling process. To set or get the resources of a process other than itself, the caller must have the CAP_SYS_RESOURCE capability, or the real, effective, and saved set user IDs of the target process must match the real user ID of the caller and the real, effective, and saved set group IDs of the target process must match the real group ID of the caller. http://man7.org/linux/man-pages/man2/getrlimit.2.html 14 SYSTEM CALL: getrlimit(2) - Linux manual page FUNCTIONALITY: getrlimit, setrlimit, prlimit - get/set resource limits SYNOPSIS: #include #include int getrlimit(int resource, struct rlimit *rlim); int setrlimit(int resource, const struct rlimit *rlim); int prlimit(pid_t pid, int resource, const struct rlimit *new_limit, struct rlimit *old_limit); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): prlimit(): _GNU_SOURCE DESCRIPTION The getrlimit() and setrlimit() system calls get and set resource limits respectively. Each resource has an associated soft and hard limit, as defined by the rlimit structure: struct rlimit { rlim_t rlim_cur; /* Soft limit */ rlim_t rlim_max; /* Hard limit (ceiling for rlim_cur) */ }; The soft limit is the value that the kernel enforces for the corresponding resource. The hard limit acts as a ceiling for the soft limit: an unprivileged process may set only its soft limit to a value in the range from 0 up to the hard limit, and (irreversibly) lower its hard limit. A privileged process (under Linux: one with the CAP_SYS_RESOURCE capability) may make arbitrary changes to either limit value. The value RLIM_INFINITY denotes no limit on a resource (both in the structure returned by getrlimit() and in the structure passed to setrlimit()). The resource argument must be one of: RLIMIT_AS The maximum size of the process's virtual memory (address space) in bytes. This limit affects calls to brk(2), mmap(2), and mremap(2), which fail with the error ENOMEM upon exceeding this limit. Also automatic stack expansion will fail (and generate a SIGSEGV that kills the process if no alternate stack has been made available via sigaltstack(2)). Since the value is a long, on machines with a 32-bit long either this limit is at most 2 GiB, or this resource is unlimited. RLIMIT_CORE Maximum size of a core file (see core(5)). When 0 no core dump files are created. When nonzero, larger dumps are truncated to this size. RLIMIT_CPU CPU time limit in seconds. When the process reaches the soft limit, it is sent a SIGXCPU signal. The default action for this signal is to terminate the process. However, the signal can be caught, and the handler can return control to the main program. If the process continues to consume CPU time, it will be sent SIGXCPU once per second until the hard limit is reached, at which time it is sent SIGKILL. (This latter point describes Linux behavior. Implementations vary in how they treat processes which continue to consume CPU time after reaching the soft limit. Portable applications that need to catch this signal should perform an orderly termination upon first receipt of SIGXCPU.) RLIMIT_DATA The maximum size of the process's data segment (initialized data, uninitialized data, and heap). This limit affects calls to brk(2) and sbrk(2), which fail with the error ENOMEM upon encountering the soft limit of this resource. RLIMIT_FSIZE The maximum size of files that the process may create. Attempts to extend a file beyond this limit result in delivery of a SIGXFSZ signal. By default, this signal terminates a process, but a process can catch this signal instead, in which case the relevant system call (e.g., write(2), truncate(2)) fails with the error EFBIG. RLIMIT_LOCKS (Early Linux 2.4 only) A limit on the combined number of flock(2) locks and fcntl(2) leases that this process may establish. RLIMIT_MEMLOCK The maximum number of bytes of memory that may be locked into RAM. In effect this limit is rounded down to the nearest multiple of the system page size. This limit affects mlock(2) and mlockall(2) and the mmap(2) MAP_LOCKED operation. Since Linux 2.6.9 it also affects the shmctl(2) SHM_LOCK operation, where it sets a maximum on the total bytes in shared memory segments (see shmget(2)) that may be locked by the real user ID of the calling process. The shmctl(2) SHM_LOCK locks are accounted for separately from the per-process memory locks established by mlock(2), mlockall(2), and mmap(2) MAP_LOCKED; a process can lock bytes up to this limit in each of these two categories. In Linux kernels before 2.6.9, this limit controlled the amount of memory that could be locked by a privileged process. Since Linux 2.6.9, no limits are placed on the amount of memory that a privileged process may lock, and this limit instead governs the amount of memory that an unprivileged process may lock. RLIMIT_MSGQUEUE (since Linux 2.6.8) Specifies the limit on the number of bytes that can be allocated for POSIX message queues for the real user ID of the calling process. This limit is enforced for mq_open(3). Each message queue that the user creates counts (until it is removed) against this limit according to the formula: Since Linux 3.5: bytes = attr.mq_maxmsg * sizeof(struct msg_msg) + min(attr.mq_maxmsg, MQ_PRIO_MAX) * sizeof(struct posix_msg_tree_node)+ /* For overhead */ attr.mq_maxmsg * attr.mq_msgsize; /* For message data */ Linux 3.4 and earlier: bytes = attr.mq_maxmsg * sizeof(struct msg_msg *) + /* For overhead */ attr.mq_maxmsg * attr.mq_msgsize; /* For message data */ where attr is the mq_attr structure specified as the fourth argument to mq_open(3), and the msg_msg and posix_msg_tree_node structures are kernel-internal structures. The "overhead" addend in the formula accounts for overhead bytes required by the implementation and ensures that the user cannot create an unlimited number of zero-length messages (such messages nevertheless each consume some system memory for bookkeeping overhead). RLIMIT_NICE (since Linux 2.6.12, but see BUGS below) Specifies a ceiling to which the process's nice value can be raised using setpriority(2) or nice(2). The actual ceiling for the nice value is calculated as 20 - rlim_cur. (This strangeness occurs because negative numbers cannot be specified as resource limit values, since they typically have special meanings. For example, RLIM_INFINITY typically is the same as -1.) RLIMIT_NOFILE Specifies a value one greater than the maximum file descriptor number that can be opened by this process. Attempts (open(2), pipe(2), dup(2), etc.) to exceed this limit yield the error EMFILE. (Historically, this limit was named RLIMIT_OFILE on BSD.) RLIMIT_NPROC The maximum number of processes (or, more precisely on Linux, threads) that can be created for the real user ID of the calling process. Upon encountering this limit, fork(2) fails with the error EAGAIN. This limit is not enforced for processes that have either the CAP_SYS_ADMIN or the CAP_SYS_RESOURCE capability. RLIMIT_RSS Specifies the limit (in bytes) of the process's resident set (the number of virtual pages resident in RAM). This limit has effect only in Linux 2.4.x, x < 30, and there affects only calls to madvise(2) specifying MADV_WILLNEED. RLIMIT_RTPRIO (since Linux 2.6.12, but see BUGS) Specifies a ceiling on the real-time priority that may be set for this process using sched_setscheduler(2) and sched_setparam(2). RLIMIT_RTTIME (since Linux 2.6.25) Specifies a limit (in microseconds) on the amount of CPU time that a process scheduled under a real-time scheduling policy may consume without making a blocking system call. For the purpose of this limit, each time a process makes a blocking system call, the count of its consumed CPU time is reset to zero. The CPU time count is not reset if the process continues trying to use the CPU but is preempted, its time slice expires, or it calls sched_yield(2). Upon reaching the soft limit, the process is sent a SIGXCPU signal. If the process catches or ignores this signal and continues consuming CPU time, then SIGXCPU will be generated once each second until the hard limit is reached, at which point the process is sent a SIGKILL signal. The intended use of this limit is to stop a runaway real-time process from locking up the system. RLIMIT_SIGPENDING (since Linux 2.6.8) Specifies the limit on the number of signals that may be queued for the real user ID of the calling process. Both standard and real-time signals are counted for the purpose of checking this limit. However, the limit is enforced only for sigqueue(3); it is always possible to use kill(2) to queue one instance of any of the signals that are not already queued to the process. RLIMIT_STACK The maximum size of the process stack, in bytes. Upon reaching this limit, a SIGSEGV signal is generated. To handle this signal, a process must employ an alternate signal stack (sigaltstack(2)). Since Linux 2.6.23, this limit also determines the amount of space used for the process's command-line arguments and environment variables; for details, see execve(2). prlimit() The Linux-specific prlimit() system call combines and extends the functionality of setrlimit() and getrlimit(). It can be used to both set and get the resource limits of an arbitrary process. The resource argument has the same meaning as for setrlimit() and getrlimit(). If the new_limit argument is a not NULL, then the rlimit structure to which it points is used to set new values for the soft and hard limits for resource. If the old_limit argument is a not NULL, then a successful call to prlimit() places the previous soft and hard limits for resource in the rlimit structure pointed to by old_limit. The pid argument specifies the ID of the process on which the call is to operate. If pid is 0, then the call applies to the calling process. To set or get the resources of a process other than itself, the caller must have the CAP_SYS_RESOURCE capability, or the real, effective, and saved set user IDs of the target process must match the real user ID of the caller and the real, effective, and saved set group IDs of the target process must match the real group ID of the caller. http://man7.org/linux/man-pages/man2/prlimit.2.html 14 SYSTEM CALL: getrlimit(2) - Linux manual page FUNCTIONALITY: getrlimit, setrlimit, prlimit - get/set resource limits SYNOPSIS: #include #include int getrlimit(int resource, struct rlimit *rlim); int setrlimit(int resource, const struct rlimit *rlim); int prlimit(pid_t pid, int resource, const struct rlimit *new_limit, struct rlimit *old_limit); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): prlimit(): _GNU_SOURCE DESCRIPTION The getrlimit() and setrlimit() system calls get and set resource limits respectively. Each resource has an associated soft and hard limit, as defined by the rlimit structure: struct rlimit { rlim_t rlim_cur; /* Soft limit */ rlim_t rlim_max; /* Hard limit (ceiling for rlim_cur) */ }; The soft limit is the value that the kernel enforces for the corresponding resource. The hard limit acts as a ceiling for the soft limit: an unprivileged process may set only its soft limit to a value in the range from 0 up to the hard limit, and (irreversibly) lower its hard limit. A privileged process (under Linux: one with the CAP_SYS_RESOURCE capability) may make arbitrary changes to either limit value. The value RLIM_INFINITY denotes no limit on a resource (both in the structure returned by getrlimit() and in the structure passed to setrlimit()). The resource argument must be one of: RLIMIT_AS The maximum size of the process's virtual memory (address space) in bytes. This limit affects calls to brk(2), mmap(2), and mremap(2), which fail with the error ENOMEM upon exceeding this limit. Also automatic stack expansion will fail (and generate a SIGSEGV that kills the process if no alternate stack has been made available via sigaltstack(2)). Since the value is a long, on machines with a 32-bit long either this limit is at most 2 GiB, or this resource is unlimited. RLIMIT_CORE Maximum size of a core file (see core(5)). When 0 no core dump files are created. When nonzero, larger dumps are truncated to this size. RLIMIT_CPU CPU time limit in seconds. When the process reaches the soft limit, it is sent a SIGXCPU signal. The default action for this signal is to terminate the process. However, the signal can be caught, and the handler can return control to the main program. If the process continues to consume CPU time, it will be sent SIGXCPU once per second until the hard limit is reached, at which time it is sent SIGKILL. (This latter point describes Linux behavior. Implementations vary in how they treat processes which continue to consume CPU time after reaching the soft limit. Portable applications that need to catch this signal should perform an orderly termination upon first receipt of SIGXCPU.) RLIMIT_DATA The maximum size of the process's data segment (initialized data, uninitialized data, and heap). This limit affects calls to brk(2) and sbrk(2), which fail with the error ENOMEM upon encountering the soft limit of this resource. RLIMIT_FSIZE The maximum size of files that the process may create. Attempts to extend a file beyond this limit result in delivery of a SIGXFSZ signal. By default, this signal terminates a process, but a process can catch this signal instead, in which case the relevant system call (e.g., write(2), truncate(2)) fails with the error EFBIG. RLIMIT_LOCKS (Early Linux 2.4 only) A limit on the combined number of flock(2) locks and fcntl(2) leases that this process may establish. RLIMIT_MEMLOCK The maximum number of bytes of memory that may be locked into RAM. In effect this limit is rounded down to the nearest multiple of the system page size. This limit affects mlock(2) and mlockall(2) and the mmap(2) MAP_LOCKED operation. Since Linux 2.6.9 it also affects the shmctl(2) SHM_LOCK operation, where it sets a maximum on the total bytes in shared memory segments (see shmget(2)) that may be locked by the real user ID of the calling process. The shmctl(2) SHM_LOCK locks are accounted for separately from the per-process memory locks established by mlock(2), mlockall(2), and mmap(2) MAP_LOCKED; a process can lock bytes up to this limit in each of these two categories. In Linux kernels before 2.6.9, this limit controlled the amount of memory that could be locked by a privileged process. Since Linux 2.6.9, no limits are placed on the amount of memory that a privileged process may lock, and this limit instead governs the amount of memory that an unprivileged process may lock. RLIMIT_MSGQUEUE (since Linux 2.6.8) Specifies the limit on the number of bytes that can be allocated for POSIX message queues for the real user ID of the calling process. This limit is enforced for mq_open(3). Each message queue that the user creates counts (until it is removed) against this limit according to the formula: Since Linux 3.5: bytes = attr.mq_maxmsg * sizeof(struct msg_msg) + min(attr.mq_maxmsg, MQ_PRIO_MAX) * sizeof(struct posix_msg_tree_node)+ /* For overhead */ attr.mq_maxmsg * attr.mq_msgsize; /* For message data */ Linux 3.4 and earlier: bytes = attr.mq_maxmsg * sizeof(struct msg_msg *) + /* For overhead */ attr.mq_maxmsg * attr.mq_msgsize; /* For message data */ where attr is the mq_attr structure specified as the fourth argument to mq_open(3), and the msg_msg and posix_msg_tree_node structures are kernel-internal structures. The "overhead" addend in the formula accounts for overhead bytes required by the implementation and ensures that the user cannot create an unlimited number of zero-length messages (such messages nevertheless each consume some system memory for bookkeeping overhead). RLIMIT_NICE (since Linux 2.6.12, but see BUGS below) Specifies a ceiling to which the process's nice value can be raised using setpriority(2) or nice(2). The actual ceiling for the nice value is calculated as 20 - rlim_cur. (This strangeness occurs because negative numbers cannot be specified as resource limit values, since they typically have special meanings. For example, RLIM_INFINITY typically is the same as -1.) RLIMIT_NOFILE Specifies a value one greater than the maximum file descriptor number that can be opened by this process. Attempts (open(2), pipe(2), dup(2), etc.) to exceed this limit yield the error EMFILE. (Historically, this limit was named RLIMIT_OFILE on BSD.) RLIMIT_NPROC The maximum number of processes (or, more precisely on Linux, threads) that can be created for the real user ID of the calling process. Upon encountering this limit, fork(2) fails with the error EAGAIN. This limit is not enforced for processes that have either the CAP_SYS_ADMIN or the CAP_SYS_RESOURCE capability. RLIMIT_RSS Specifies the limit (in bytes) of the process's resident set (the number of virtual pages resident in RAM). This limit has effect only in Linux 2.4.x, x < 30, and there affects only calls to madvise(2) specifying MADV_WILLNEED. RLIMIT_RTPRIO (since Linux 2.6.12, but see BUGS) Specifies a ceiling on the real-time priority that may be set for this process using sched_setscheduler(2) and sched_setparam(2). RLIMIT_RTTIME (since Linux 2.6.25) Specifies a limit (in microseconds) on the amount of CPU time that a process scheduled under a real-time scheduling policy may consume without making a blocking system call. For the purpose of this limit, each time a process makes a blocking system call, the count of its consumed CPU time is reset to zero. The CPU time count is not reset if the process continues trying to use the CPU but is preempted, its time slice expires, or it calls sched_yield(2). Upon reaching the soft limit, the process is sent a SIGXCPU signal. If the process catches or ignores this signal and continues consuming CPU time, then SIGXCPU will be generated once each second until the hard limit is reached, at which point the process is sent a SIGKILL signal. The intended use of this limit is to stop a runaway real-time process from locking up the system. RLIMIT_SIGPENDING (since Linux 2.6.8) Specifies the limit on the number of signals that may be queued for the real user ID of the calling process. Both standard and real-time signals are counted for the purpose of checking this limit. However, the limit is enforced only for sigqueue(3); it is always possible to use kill(2) to queue one instance of any of the signals that are not already queued to the process. RLIMIT_STACK The maximum size of the process stack, in bytes. Upon reaching this limit, a SIGSEGV signal is generated. To handle this signal, a process must employ an alternate signal stack (sigaltstack(2)). Since Linux 2.6.23, this limit also determines the amount of space used for the process's command-line arguments and environment variables; for details, see execve(2). prlimit() The Linux-specific prlimit() system call combines and extends the functionality of setrlimit() and getrlimit(). It can be used to both set and get the resource limits of an arbitrary process. The resource argument has the same meaning as for setrlimit() and getrlimit(). If the new_limit argument is a not NULL, then the rlimit structure to which it points is used to set new values for the soft and hard limits for resource. If the old_limit argument is a not NULL, then a successful call to prlimit() places the previous soft and hard limits for resource in the rlimit structure pointed to by old_limit. The pid argument specifies the ID of the process on which the call is to operate. If pid is 0, then the call applies to the calling process. To set or get the resources of a process other than itself, the caller must have the CAP_SYS_RESOURCE capability, or the real, effective, and saved set user IDs of the target process must match the real user ID of the caller and the real, effective, and saved set group IDs of the target process must match the real group ID of the caller. http://man7.org/linux/man-pages/man2/getrusage.2.html 11 SYSTEM CALL: getrusage(2) - Linux manual page FUNCTIONALITY: getrusage - get resource usage SYNOPSIS: #include #include int getrusage(int who, struct rusage *usage); DESCRIPTION getrusage() returns resource usage measures for who, which can be one of the following: RUSAGE_SELF Return resource usage statistics for the calling process, which is the sum of resources used by all threads in the process. RUSAGE_CHILDREN Return resource usage statistics for all children of the calling process that have terminated and been waited for. These statistics will include the resources used by grandchildren, and further removed descendants, if all of the intervening descendants waited on their terminated children. RUSAGE_THREAD (since Linux 2.6.26) Return resource usage statistics for the calling thread. The _GNU_SOURCE feature test macro must be defined (before including any header file) in order to obtain the definition of this constant from . The resource usages are returned in the structure pointed to by usage, which has the following form: struct rusage { struct timeval ru_utime; /* user CPU time used */ struct timeval ru_stime; /* system CPU time used */ long ru_maxrss; /* maximum resident set size */ long ru_ixrss; /* integral shared memory size */ long ru_idrss; /* integral unshared data size */ long ru_isrss; /* integral unshared stack size */ long ru_minflt; /* page reclaims (soft page faults) */ long ru_majflt; /* page faults (hard page faults) */ long ru_nswap; /* swaps */ long ru_inblock; /* block input operations */ long ru_oublock; /* block output operations */ long ru_msgsnd; /* IPC messages sent */ long ru_msgrcv; /* IPC messages received */ long ru_nsignals; /* signals received */ long ru_nvcsw; /* voluntary context switches */ long ru_nivcsw; /* involuntary context switches */ }; Not all fields are completed; unmaintained fields are set to zero by the kernel. (The unmaintained fields are provided for compatibility with other systems, and because they may one day be supported on Linux.) The fields are interpreted as follows: ru_utime This is the total amount of time spent executing in user mode, expressed in a timeval structure (seconds plus microseconds). ru_stime This is the total amount of time spent executing in kernel mode, expressed in a timeval structure (seconds plus microseconds). ru_maxrss (since Linux 2.6.32) This is the maximum resident set size used (in kilobytes). For RUSAGE_CHILDREN, this is the resident set size of the largest child, not the maximum resident set size of the process tree. ru_ixrss (unmaintained) This field is currently unused on Linux. ru_idrss (unmaintained) This field is currently unused on Linux. ru_isrss (unmaintained) This field is currently unused on Linux. ru_minflt The number of page faults serviced without any I/O activity; here I/O activity is avoided by “reclaiming” a page frame from the list of pages awaiting reallocation. ru_majflt The number of page faults serviced that required I/O activity. ru_nswap (unmaintained) This field is currently unused on Linux. ru_inblock (since Linux 2.6.22) The number of times the filesystem had to perform input. ru_oublock (since Linux 2.6.22) The number of times the filesystem had to perform output. ru_msgsnd (unmaintained) This field is currently unused on Linux. ru_msgrcv (unmaintained) This field is currently unused on Linux. ru_nsignals (unmaintained) This field is currently unused on Linux. ru_nvcsw (since Linux 2.6) The number of times a context switch resulted due to a process voluntarily giving up the processor before its time slice was completed (usually to await availability of a resource). ru_nivcsw (since Linux 2.6) The number of times a context switch resulted due to a higher priority process becoming runnable or because the current process exceeded its time slice. http://man7.org/linux/man-pages/man2/sched_setattr.2.html 12 SYSTEM CALL: sched_setattr(2) - Linux manual page FUNCTIONALITY: sched_setattr, sched_getattr - set and get scheduling policy and attributes SYNOPSIS: #include int sched_setattr(pid_t pid, struct sched_attr *attr, unsigned int flags); int sched_getattr(pid_t pid, struct sched_attr *attr, unsigned int size, unsigned int flags); DESCRIPTION sched_setattr() The sched_setattr() system call sets the scheduling policy and associated attributes for the thread whose ID is specified in pid. If pid equals zero, the scheduling policy and attributes of the calling thread will be set. Currently, Linux supports the following "normal" (i.e., non-real- time) scheduling policies as values that may be specified in policy: SCHED_OTHER the standard round-robin time-sharing policy; SCHED_BATCH for "batch" style execution of processes; and SCHED_IDLE for running very low priority background jobs. Various "real-time" policies are also supported, for special time- critical applications that need precise control over the way in which runnable threads are selected for execution. For the rules governing when a process may use these policies, see sched(7). The real-time policies that may be specified in policy are: SCHED_FIFO a first-in, first-out policy; and SCHED_RR a round-robin policy. Linux also provides the following policy: SCHED_DEADLINE a deadline scheduling policy; see sched(7) for details. The attr argument is a pointer to a structure that defines the new scheduling policy and attributes for the specified thread. This structure has the following form: struct sched_attr { u32 size; /* Size of this structure */ u32 sched_policy; /* Policy (SCHED_*) */ u64 sched_flags; /* Flags */ s32 sched_nice; /* Nice value (SCHED_OTHER, SCHED_BATCH) */ u32 sched_priority; /* Static priority (SCHED_FIFO, SCHED_RR) */ /* Remaining fields are for SCHED_DEADLINE */ u64 sched_runtime; u64 sched_deadline; u64 sched_period; }; The fields of this structure are as follows: size This field should be set to the size of the structure in bytes, as in sizeof(struct sched_attr). If the provided structure is smaller than the kernel structure, any additional fields are assumed to be '0'. If the provided structure is larger than the kernel structure, the kernel verifies that all additional fields are 0; if they are not, sched_setattr() fails with the error E2BIG and updates size to contain the size of the kernel structure. The above behavior when the size of the user-space sched_attr structure does not match the size of the kernel structure allows for future extensibility of the interface. Malformed applications that pass oversize structures won't break in the future if the size of the kernel sched_attr structure is increased. In the future, it could also allow applications that know about a larger user-space sched_attr structure to determine whether they are running on an older kernel that does not support the larger structure. sched_policy This field specifies the scheduling policy, as one of the SCHED_* values listed above. sched_flags This field contains flags controlling scheduling behavior. Only one such flag is currently defined: SCHED_FLAG_RESET_ON_FORK. As a result of including this flag, children created by fork(2) do not inherit privileged scheduling policies. See sched(7) for details. sched_nice This field specifies the nice value to be set when specifying sched_policy as SCHED_OTHER or SCHED_BATCH. The nice value is a number in the range -20 (high priority) to +19 (low priority); see setpriority(2). sched_priority This field specifies the static priority to be set when specifying sched_policy as SCHED_FIFO or SCHED_RR. The allowed range of priorities for these policies can be determined using sched_get_priority_min(2) and sched_get_priority_max(2). For other policies, this field must be specified as 0. sched_runtime This field specifies the "Runtime" parameter for deadline scheduling. The value is expressed in nanoseconds. This field, and the next two fields, are used only for SCHED_DEADLINE scheduling; for further details, see sched(7). sched_deadline This field specifies the "Deadline" parameter for deadline scheduling. The value is expressed in nanoseconds. sched_period This field specifies the "Period" parameter for deadline scheduling. The value is expressed in nanoseconds. The flags argument is provided to allow for future extensions to the interface; in the current implementation it must be specified as 0. sched_getattr() The sched_getattr() system call fetches the scheduling policy and the associated attributes for the thread whose ID is specified in pid. If pid equals zero, the scheduling policy and attributes of the calling thread will be retrieved. The size argument should be set to the size of the sched_attr structure as known to user space. The value must be at least as large as the size of the initially published sched_attr structure, or the call fails with the error EINVAL. The retrieved scheduling attributes are placed in the fields of the sched_attr structure pointed to by attr. The kernel sets attr.size to the size of its sched_attr structure. If the caller-provided attr buffer is larger than the kernel's sched_attr structure, the additional bytes in the user-space structure are not touched. If the caller-provided structure is smaller than the kernel sched_attr structure and the kernel needs to return values outside the provided space, sched_getattr() fails with the error E2BIG. As with sched_setattr(), these semantics allow for future extensibility of the interface. The flags argument is provided to allow for future extensions to the interface; in the current implementation it must be specified as 0. http://man7.org/linux/man-pages/man2/sched_getattr.2.html 12 SYSTEM CALL: sched_setattr(2) - Linux manual page FUNCTIONALITY: sched_setattr, sched_getattr - set and get scheduling policy and attributes SYNOPSIS: #include int sched_setattr(pid_t pid, struct sched_attr *attr, unsigned int flags); int sched_getattr(pid_t pid, struct sched_attr *attr, unsigned int size, unsigned int flags); DESCRIPTION sched_setattr() The sched_setattr() system call sets the scheduling policy and associated attributes for the thread whose ID is specified in pid. If pid equals zero, the scheduling policy and attributes of the calling thread will be set. Currently, Linux supports the following "normal" (i.e., non-real- time) scheduling policies as values that may be specified in policy: SCHED_OTHER the standard round-robin time-sharing policy; SCHED_BATCH for "batch" style execution of processes; and SCHED_IDLE for running very low priority background jobs. Various "real-time" policies are also supported, for special time- critical applications that need precise control over the way in which runnable threads are selected for execution. For the rules governing when a process may use these policies, see sched(7). The real-time policies that may be specified in policy are: SCHED_FIFO a first-in, first-out policy; and SCHED_RR a round-robin policy. Linux also provides the following policy: SCHED_DEADLINE a deadline scheduling policy; see sched(7) for details. The attr argument is a pointer to a structure that defines the new scheduling policy and attributes for the specified thread. This structure has the following form: struct sched_attr { u32 size; /* Size of this structure */ u32 sched_policy; /* Policy (SCHED_*) */ u64 sched_flags; /* Flags */ s32 sched_nice; /* Nice value (SCHED_OTHER, SCHED_BATCH) */ u32 sched_priority; /* Static priority (SCHED_FIFO, SCHED_RR) */ /* Remaining fields are for SCHED_DEADLINE */ u64 sched_runtime; u64 sched_deadline; u64 sched_period; }; The fields of this structure are as follows: size This field should be set to the size of the structure in bytes, as in sizeof(struct sched_attr). If the provided structure is smaller than the kernel structure, any additional fields are assumed to be '0'. If the provided structure is larger than the kernel structure, the kernel verifies that all additional fields are 0; if they are not, sched_setattr() fails with the error E2BIG and updates size to contain the size of the kernel structure. The above behavior when the size of the user-space sched_attr structure does not match the size of the kernel structure allows for future extensibility of the interface. Malformed applications that pass oversize structures won't break in the future if the size of the kernel sched_attr structure is increased. In the future, it could also allow applications that know about a larger user-space sched_attr structure to determine whether they are running on an older kernel that does not support the larger structure. sched_policy This field specifies the scheduling policy, as one of the SCHED_* values listed above. sched_flags This field contains flags controlling scheduling behavior. Only one such flag is currently defined: SCHED_FLAG_RESET_ON_FORK. As a result of including this flag, children created by fork(2) do not inherit privileged scheduling policies. See sched(7) for details. sched_nice This field specifies the nice value to be set when specifying sched_policy as SCHED_OTHER or SCHED_BATCH. The nice value is a number in the range -20 (high priority) to +19 (low priority); see setpriority(2). sched_priority This field specifies the static priority to be set when specifying sched_policy as SCHED_FIFO or SCHED_RR. The allowed range of priorities for these policies can be determined using sched_get_priority_min(2) and sched_get_priority_max(2). For other policies, this field must be specified as 0. sched_runtime This field specifies the "Runtime" parameter for deadline scheduling. The value is expressed in nanoseconds. This field, and the next two fields, are used only for SCHED_DEADLINE scheduling; for further details, see sched(7). sched_deadline This field specifies the "Deadline" parameter for deadline scheduling. The value is expressed in nanoseconds. sched_period This field specifies the "Period" parameter for deadline scheduling. The value is expressed in nanoseconds. The flags argument is provided to allow for future extensions to the interface; in the current implementation it must be specified as 0. sched_getattr() The sched_getattr() system call fetches the scheduling policy and the associated attributes for the thread whose ID is specified in pid. If pid equals zero, the scheduling policy and attributes of the calling thread will be retrieved. The size argument should be set to the size of the sched_attr structure as known to user space. The value must be at least as large as the size of the initially published sched_attr structure, or the call fails with the error EINVAL. The retrieved scheduling attributes are placed in the fields of the sched_attr structure pointed to by attr. The kernel sets attr.size to the size of its sched_attr structure. If the caller-provided attr buffer is larger than the kernel's sched_attr structure, the additional bytes in the user-space structure are not touched. If the caller-provided structure is smaller than the kernel sched_attr structure and the kernel needs to return values outside the provided space, sched_getattr() fails with the error E2BIG. As with sched_setattr(), these semantics allow for future extensibility of the interface. The flags argument is provided to allow for future extensions to the interface; in the current implementation it must be specified as 0. http://man7.org/linux/man-pages/man2/sched_setscheduler.2.html 11 SYSTEM CALL: sched_setscheduler(2) - Linux manual page FUNCTIONALITY: sched_setscheduler, sched_getscheduler - set and get scheduling pol‐ icy/parameters SYNOPSIS: #include int sched_setscheduler(pid_t pid, int policy, const struct sched_param *param); int sched_getscheduler(pid_t pid); DESCRIPTION The sched_setscheduler() system call sets both the scheduling policy and parameters for the thread whose ID is specified in pid. If pid equals zero, the scheduling policy and parameters of the calling thread will be set. The scheduling parameters are specified in the param argument, which is a pointer to a structure of the following form: struct sched_param { ... int sched_priority; ... }; In the current implementation, the structure contains only one field, sched_priority. The interpretation of param depends on the selected policy. Currently, Linux supports the following "normal" (i.e., non-real- time) scheduling policies as values that may be specified in policy: SCHED_OTHER the standard round-robin time-sharing policy; SCHED_BATCH for "batch" style execution of processes; and SCHED_IDLE for running very low priority background jobs. For each of the above policies, param->sched_priority must be 0. Various "real-time" policies are also supported, for special time- critical applications that need precise control over the way in which runnable threads are selected for execution. For the rules governing when a process may use these policies, see sched(7). The real-time policies that may be specified in policy are: SCHED_FIFO a first-in, first-out policy; and SCHED_RR a round-robin policy. For each of the above policies, param->sched_priority specifies a scheduling priority for the thread. This is a number in the range returned by calling sched_get_priority_min(2) and sched_get_priority_max(2) with the specified policy. On Linux, these system calls return, respectively, 1 and 99. Since Linux 2.6.32, the SCHED_RESET_ON_FORK flag can be ORed in policy when calling sched_setscheduler(). As a result of including this flag, children created by fork(2) do not inherit privileged scheduling policies. See sched(7) for details. sched_getscheduler() returns the current scheduling policy of the thread identified by pid. If pid equals zero, the policy of the calling thread will be retrieved. http://man7.org/linux/man-pages/man2/sched_getscheduler.2.html 11 SYSTEM CALL: sched_setscheduler(2) - Linux manual page FUNCTIONALITY: sched_setscheduler, sched_getscheduler - set and get scheduling pol‐ icy/parameters SYNOPSIS: #include int sched_setscheduler(pid_t pid, int policy, const struct sched_param *param); int sched_getscheduler(pid_t pid); DESCRIPTION The sched_setscheduler() system call sets both the scheduling policy and parameters for the thread whose ID is specified in pid. If pid equals zero, the scheduling policy and parameters of the calling thread will be set. The scheduling parameters are specified in the param argument, which is a pointer to a structure of the following form: struct sched_param { ... int sched_priority; ... }; In the current implementation, the structure contains only one field, sched_priority. The interpretation of param depends on the selected policy. Currently, Linux supports the following "normal" (i.e., non-real- time) scheduling policies as values that may be specified in policy: SCHED_OTHER the standard round-robin time-sharing policy; SCHED_BATCH for "batch" style execution of processes; and SCHED_IDLE for running very low priority background jobs. For each of the above policies, param->sched_priority must be 0. Various "real-time" policies are also supported, for special time- critical applications that need precise control over the way in which runnable threads are selected for execution. For the rules governing when a process may use these policies, see sched(7). The real-time policies that may be specified in policy are: SCHED_FIFO a first-in, first-out policy; and SCHED_RR a round-robin policy. For each of the above policies, param->sched_priority specifies a scheduling priority for the thread. This is a number in the range returned by calling sched_get_priority_min(2) and sched_get_priority_max(2) with the specified policy. On Linux, these system calls return, respectively, 1 and 99. Since Linux 2.6.32, the SCHED_RESET_ON_FORK flag can be ORed in policy when calling sched_setscheduler(). As a result of including this flag, children created by fork(2) do not inherit privileged scheduling policies. See sched(7) for details. sched_getscheduler() returns the current scheduling policy of the thread identified by pid. If pid equals zero, the policy of the calling thread will be retrieved. http://man7.org/linux/man-pages/man2/sched_setparam.2.html 10 SYSTEM CALL: sched_setparam(2) - Linux manual page FUNCTIONALITY: sched_setparam, sched_getparam - set and get scheduling parameters SYNOPSIS: #include int sched_setparam(pid_t pid, const struct sched_param *param); int sched_getparam(pid_t pid, struct sched_param *param); struct sched_param { ... int sched_priority; ... }; DESCRIPTION sched_setparam() sets the scheduling parameters associated with the scheduling policy for the process identified by pid. If pid is zero, then the parameters of the calling process are set. The interpretation of the argument param depends on the scheduling policy of the process identified by pid. See sched(7) for a description of the scheduling policies supported under Linux. sched_getparam() retrieves the scheduling parameters for the process identified by pid. If pid is zero, then the parameters of the calling process are retrieved. sched_setparam() checks the validity of param for the scheduling policy of the thread. The value param->sched_priority must lie within the range given by sched_get_priority_min(2) and sched_get_priority_max(2). For a discussion of the privileges and resource limits related to scheduling priority and policy, see sched(7). POSIX systems on which sched_setparam() and sched_getparam() are available define _POSIX_PRIORITY_SCHEDULING in . http://man7.org/linux/man-pages/man2/sched_getparam.2.html 10 SYSTEM CALL: sched_setparam(2) - Linux manual page FUNCTIONALITY: sched_setparam, sched_getparam - set and get scheduling parameters SYNOPSIS: #include int sched_setparam(pid_t pid, const struct sched_param *param); int sched_getparam(pid_t pid, struct sched_param *param); struct sched_param { ... int sched_priority; ... }; DESCRIPTION sched_setparam() sets the scheduling parameters associated with the scheduling policy for the process identified by pid. If pid is zero, then the parameters of the calling process are set. The interpretation of the argument param depends on the scheduling policy of the process identified by pid. See sched(7) for a description of the scheduling policies supported under Linux. sched_getparam() retrieves the scheduling parameters for the process identified by pid. If pid is zero, then the parameters of the calling process are retrieved. sched_setparam() checks the validity of param for the scheduling policy of the thread. The value param->sched_priority must lie within the range given by sched_get_priority_min(2) and sched_get_priority_max(2). For a discussion of the privileges and resource limits related to scheduling priority and policy, see sched(7). POSIX systems on which sched_setparam() and sched_getparam() are available define _POSIX_PRIORITY_SCHEDULING in . http://man7.org/linux/man-pages/man2/sched_setaffinity.2.html 12 SYSTEM CALL: sched_setaffinity(2) - Linux manual page FUNCTIONALITY: sched_setaffinity, sched_getaffinity - set and get a thread's CPU affinity mask SYNOPSIS: #define _GNU_SOURCE /* See feature_test_macros(7) */ #include int sched_setaffinity(pid_t pid, size_t cpusetsize, const cpu_set_t *mask); int sched_getaffinity(pid_t pid, size_t cpusetsize, cpu_set_t *mask); DESCRIPTION A thread's CPU affinity mask determines the set of CPUs on which it is eligible to run. On a multiprocessor system, setting the CPU affinity mask can be used to obtain performance benefits. For example, by dedicating one CPU to a particular thread (i.e., setting the affinity mask of that thread to specify a single CPU, and setting the affinity mask of all other threads to exclude that CPU), it is possible to ensure maximum execution speed for that thread. Restricting a thread to run on a single CPU also avoids the performance cost caused by the cache invalidation that occurs when a thread ceases to execute on one CPU and then recommences execution on a different CPU. A CPU affinity mask is represented by the cpu_set_t structure, a "CPU set", pointed to by mask. A set of macros for manipulating CPU sets is described in CPU_SET(3). sched_setaffinity() sets the CPU affinity mask of the thread whose ID is pid to the value specified by mask. If pid is zero, then the calling thread is used. The argument cpusetsize is the length (in bytes) of the data pointed to by mask. Normally this argument would be specified as sizeof(cpu_set_t). If the thread specified by pid is not currently running on one of the CPUs specified in mask, then that thread is migrated to one of the CPUs specified in mask. sched_getaffinity() writes the affinity mask of the thread whose ID is pid into the cpu_set_t structure pointed to by mask. The cpusetsize argument specifies the size (in bytes) of mask. If pid is zero, then the mask of the calling thread is returned. http://man7.org/linux/man-pages/man2/sched_getaffinity.2.html 12 SYSTEM CALL: sched_setaffinity(2) - Linux manual page FUNCTIONALITY: sched_setaffinity, sched_getaffinity - set and get a thread's CPU affinity mask SYNOPSIS: #define _GNU_SOURCE /* See feature_test_macros(7) */ #include int sched_setaffinity(pid_t pid, size_t cpusetsize, const cpu_set_t *mask); int sched_getaffinity(pid_t pid, size_t cpusetsize, cpu_set_t *mask); DESCRIPTION A thread's CPU affinity mask determines the set of CPUs on which it is eligible to run. On a multiprocessor system, setting the CPU affinity mask can be used to obtain performance benefits. For example, by dedicating one CPU to a particular thread (i.e., setting the affinity mask of that thread to specify a single CPU, and setting the affinity mask of all other threads to exclude that CPU), it is possible to ensure maximum execution speed for that thread. Restricting a thread to run on a single CPU also avoids the performance cost caused by the cache invalidation that occurs when a thread ceases to execute on one CPU and then recommences execution on a different CPU. A CPU affinity mask is represented by the cpu_set_t structure, a "CPU set", pointed to by mask. A set of macros for manipulating CPU sets is described in CPU_SET(3). sched_setaffinity() sets the CPU affinity mask of the thread whose ID is pid to the value specified by mask. If pid is zero, then the calling thread is used. The argument cpusetsize is the length (in bytes) of the data pointed to by mask. Normally this argument would be specified as sizeof(cpu_set_t). If the thread specified by pid is not currently running on one of the CPUs specified in mask, then that thread is migrated to one of the CPUs specified in mask. sched_getaffinity() writes the affinity mask of the thread whose ID is pid into the cpu_set_t structure pointed to by mask. The cpusetsize argument specifies the size (in bytes) of mask. If pid is zero, then the mask of the calling thread is returned. http://man7.org/linux/man-pages/man2/sched_get_priority_max.2.html 9 SYSTEM CALL: sched_get_priority_max(2) - Linux manual page FUNCTIONALITY: sched_get_priority_max, sched_get_priority_min - get static priority range SYNOPSIS: #include int sched_get_priority_max(int policy); int sched_get_priority_min(int policy); DESCRIPTION sched_get_priority_max() returns the maximum priority value that can be used with the scheduling algorithm identified by policy. sched_get_priority_min() returns the minimum priority value that can be used with the scheduling algorithm identified by policy. Supported policy values are SCHED_FIFO, SCHED_RR, SCHED_OTHER, SCHED_BATCH, SCHED_IDLE, and SCHED_DEADLINE. Further details about these policies can be found in sched(7). Processes with numerically higher priority values are scheduled before processes with numerically lower priority values. Thus, the value returned by sched_get_priority_max() will be greater than the value returned by sched_get_priority_min(). Linux allows the static priority range 1 to 99 for the SCHED_FIFO and SCHED_RR policies, and the priority 0 for the remaining policies. Scheduling priority ranges for the various policies are not alterable. The range of scheduling priorities may vary on other POSIX systems, thus it is a good idea for portable applications to use a virtual priority range and map it to the interval given by sched_get_priority_max() and sched_get_priority_min POSIX.1 requires a spread of at least 32 between the maximum and the minimum values for SCHED_FIFO and SCHED_RR. POSIX systems on which sched_get_priority_max() and sched_get_priority_min() are available define _POSIX_PRIORITY_SCHEDULING in . http://man7.org/linux/man-pages/man2/sched_get_priority_min.2.html 9 SYSTEM CALL: sched_get_priority_max(2) - Linux manual page FUNCTIONALITY: sched_get_priority_max, sched_get_priority_min - get static priority range SYNOPSIS: #include int sched_get_priority_max(int policy); int sched_get_priority_min(int policy); DESCRIPTION sched_get_priority_max() returns the maximum priority value that can be used with the scheduling algorithm identified by policy. sched_get_priority_min() returns the minimum priority value that can be used with the scheduling algorithm identified by policy. Supported policy values are SCHED_FIFO, SCHED_RR, SCHED_OTHER, SCHED_BATCH, SCHED_IDLE, and SCHED_DEADLINE. Further details about these policies can be found in sched(7). Processes with numerically higher priority values are scheduled before processes with numerically lower priority values. Thus, the value returned by sched_get_priority_max() will be greater than the value returned by sched_get_priority_min(). Linux allows the static priority range 1 to 99 for the SCHED_FIFO and SCHED_RR policies, and the priority 0 for the remaining policies. Scheduling priority ranges for the various policies are not alterable. The range of scheduling priorities may vary on other POSIX systems, thus it is a good idea for portable applications to use a virtual priority range and map it to the interval given by sched_get_priority_max() and sched_get_priority_min POSIX.1 requires a spread of at least 32 between the maximum and the minimum values for SCHED_FIFO and SCHED_RR. POSIX systems on which sched_get_priority_max() and sched_get_priority_min() are available define _POSIX_PRIORITY_SCHEDULING in . http://man7.org/linux/man-pages/man2/sched_rr_get_interval.2.html 10 SYSTEM CALL: sched_rr_get_interval(2) - Linux manual page FUNCTIONALITY: sched_rr_get_interval - get the SCHED_RR interval for the named process SYNOPSIS: #include int sched_rr_get_interval(pid_t pid, struct timespec * tp); DESCRIPTION sched_rr_get_interval() writes into the timespec structure pointed to by tp the round-robin time quantum for the process identified by pid. The specified process should be running under the SCHED_RR scheduling policy. The timespec structure has the following form: struct timespec { time_t tv_sec; /* seconds */ long tv_nsec; /* nanoseconds */ }; If pid is zero, the time quantum for the calling process is written into *tp. http://man7.org/linux/man-pages/man2/sched_yield.2.html 10 SYSTEM CALL: sched_yield(2) - Linux manual page FUNCTIONALITY: sched_yield - yield the processor SYNOPSIS: #include int sched_yield(void); DESCRIPTION sched_yield() causes the calling thread to relinquish the CPU. The thread is moved to the end of the queue for its static priority and a new thread gets to run. http://man7.org/linux/man-pages/man2/setpriority.2.html 11 SYSTEM CALL: getpriority(2) - Linux manual page FUNCTIONALITY: getpriority, setpriority - get/set program scheduling priority SYNOPSIS: #include #include int getpriority(int which, id_t who); int setpriority(int which, id_t who, int prio); DESCRIPTION The scheduling priority of the process, process group, or user, as indicated by which and who is obtained with the getpriority() call and set with the setpriority() call. The process attribute dealt with by these system calls is the same attribute (also known as the "nice" value) that is dealt with by nice(2). The value which is one of PRIO_PROCESS, PRIO_PGRP, or PRIO_USER, and who is interpreted relative to which (a process identifier for PRIO_PROCESS, process group identifier for PRIO_PGRP, and a user ID for PRIO_USER). A zero value for who denotes (respectively) the calling process, the process group of the calling process, or the real user ID of the calling process. The prio argument is a value in the range -20 to 19 (but see NOTES below). with -20 being the highest priority and 19 being the lowest priority. The default priority is 0; lower values give a process a higher scheduling priority. The getpriority() call returns the highest priority (lowest numerical value) enjoyed by any of the specified processes. The setpriority() call sets the priorities of all of the specified processes to the specified value. Traditionally, only a privileged process could lower the nice value (i.e., set a higher priority). However, since Linux 2.6.12, an unprivileged process can decrease the nice value of a target process that has a suitable RLIMIT_NICE soft limit; see getrlimit(2) for details. http://man7.org/linux/man-pages/man2/getpriority.2.html 11 SYSTEM CALL: getpriority(2) - Linux manual page FUNCTIONALITY: getpriority, setpriority - get/set program scheduling priority SYNOPSIS: #include #include int getpriority(int which, id_t who); int setpriority(int which, id_t who, int prio); DESCRIPTION The scheduling priority of the process, process group, or user, as indicated by which and who is obtained with the getpriority() call and set with the setpriority() call. The process attribute dealt with by these system calls is the same attribute (also known as the "nice" value) that is dealt with by nice(2). The value which is one of PRIO_PROCESS, PRIO_PGRP, or PRIO_USER, and who is interpreted relative to which (a process identifier for PRIO_PROCESS, process group identifier for PRIO_PGRP, and a user ID for PRIO_USER). A zero value for who denotes (respectively) the calling process, the process group of the calling process, or the real user ID of the calling process. The prio argument is a value in the range -20 to 19 (but see NOTES below). with -20 being the highest priority and 19 being the lowest priority. The default priority is 0; lower values give a process a higher scheduling priority. The getpriority() call returns the highest priority (lowest numerical value) enjoyed by any of the specified processes. The setpriority() call sets the priorities of all of the specified processes to the specified value. Traditionally, only a privileged process could lower the nice value (i.e., set a higher priority). However, since Linux 2.6.12, an unprivileged process can decrease the nice value of a target process that has a suitable RLIMIT_NICE soft limit; see getrlimit(2) for details. http://man7.org/linux/man-pages/man2/ioprio_set.2.html 12 SYSTEM CALL: ioprio_set(2) - Linux manual page FUNCTIONALITY: ioprio_get, ioprio_set - get/set I/O scheduling class and priority SYNOPSIS: int ioprio_get(int which, int who); int ioprio_set(int which, int who, int ioprio); Note: There are no glibc wrappers for these system calls; see NOTES. DESCRIPTION The ioprio_get() and ioprio_set() system calls respectively get and set the I/O scheduling class and priority of one or more threads. The which and who arguments identify the thread(s) on which the system calls operate. The which argument determines how who is interpreted, and has one of the following values: IOPRIO_WHO_PROCESS who is a process ID or thread ID identifying a single process or thread. If who is 0, then operate on the calling thread. IOPRIO_WHO_PGRP who is a process group ID identifying all the members of a process group. If who is 0, then operate on the process group of which the caller is a member. IOPRIO_WHO_USER who is a user ID identifying all of the processes that have a matching real UID. If which is specified as IOPRIO_WHO_PGRP or IOPRIO_WHO_USER when calling ioprio_get(), and more than one process matches who, then the returned priority will be the highest one found among all of the matching processes. One priority is said to be higher than another one if it belongs to a higher priority class (IOPRIO_CLASS_RT is the highest priority class; IOPRIO_CLASS_IDLE is the lowest) or if it belongs to the same priority class as the other process but has a higher priority level (a lower priority number means a higher priority level). The ioprio argument given to ioprio_set() is a bit mask that specifies both the scheduling class and the priority to be assigned to the target process(es). The following macros are used for assembling and dissecting ioprio values: IOPRIO_PRIO_VALUE(class, data) Given a scheduling class and priority (data), this macro combines the two values to produce an ioprio value, which is returned as the result of the macro. IOPRIO_PRIO_CLASS(mask) Given mask (an ioprio value), this macro returns its I/O class component, that is, one of the values IOPRIO_CLASS_RT, IOPRIO_CLASS_BE, or IOPRIO_CLASS_IDLE. IOPRIO_PRIO_DATA(mask) Given mask (an ioprio value), this macro returns its priority (data) component. See the NOTES section for more information on scheduling classes and priorities, as well as the meaning of specifying ioprio as 0. I/O priorities are supported for reads and for synchronous (O_DIRECT, O_SYNC) writes. I/O priorities are not supported for asynchronous writes because they are issued outside the context of the program dirtying the memory, and thus program-specific priorities do not apply. http://man7.org/linux/man-pages/man2/ioprio_get.2.html 12 SYSTEM CALL: ioprio_set(2) - Linux manual page FUNCTIONALITY: ioprio_get, ioprio_set - get/set I/O scheduling class and priority SYNOPSIS: int ioprio_get(int which, int who); int ioprio_set(int which, int who, int ioprio); Note: There are no glibc wrappers for these system calls; see NOTES. DESCRIPTION The ioprio_get() and ioprio_set() system calls respectively get and set the I/O scheduling class and priority of one or more threads. The which and who arguments identify the thread(s) on which the system calls operate. The which argument determines how who is interpreted, and has one of the following values: IOPRIO_WHO_PROCESS who is a process ID or thread ID identifying a single process or thread. If who is 0, then operate on the calling thread. IOPRIO_WHO_PGRP who is a process group ID identifying all the members of a process group. If who is 0, then operate on the process group of which the caller is a member. IOPRIO_WHO_USER who is a user ID identifying all of the processes that have a matching real UID. If which is specified as IOPRIO_WHO_PGRP or IOPRIO_WHO_USER when calling ioprio_get(), and more than one process matches who, then the returned priority will be the highest one found among all of the matching processes. One priority is said to be higher than another one if it belongs to a higher priority class (IOPRIO_CLASS_RT is the highest priority class; IOPRIO_CLASS_IDLE is the lowest) or if it belongs to the same priority class as the other process but has a higher priority level (a lower priority number means a higher priority level). The ioprio argument given to ioprio_set() is a bit mask that specifies both the scheduling class and the priority to be assigned to the target process(es). The following macros are used for assembling and dissecting ioprio values: IOPRIO_PRIO_VALUE(class, data) Given a scheduling class and priority (data), this macro combines the two values to produce an ioprio value, which is returned as the result of the macro. IOPRIO_PRIO_CLASS(mask) Given mask (an ioprio value), this macro returns its I/O class component, that is, one of the values IOPRIO_CLASS_RT, IOPRIO_CLASS_BE, or IOPRIO_CLASS_IDLE. IOPRIO_PRIO_DATA(mask) Given mask (an ioprio value), this macro returns its priority (data) component. See the NOTES section for more information on scheduling classes and priorities, as well as the meaning of specifying ioprio as 0. I/O priorities are supported for reads and for synchronous (O_DIRECT, O_SYNC) writes. I/O priorities are not supported for asynchronous writes because they are issued outside the context of the program dirtying the memory, and thus program-specific priorities do not apply. http://man7.org/linux/man-pages/man2/brk.2.html 9 SYSTEM CALL: brk(2) - Linux manual page FUNCTIONALITY: brk, sbrk - change data segment size SYNOPSIS: #include int brk(void *addr); void *sbrk(intptr_t increment); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): brk(), sbrk(): Since glibc 2.19: _DEFAULT_SOURCE || (_XOPEN_SOURCE >= 500) && ! (_POSIX_C_SOURCE >= 200112L) From glibc 2.12 to 2.19: _BSD_SOURCE || _SVID_SOURCE || (_XOPEN_SOURCE >= 500) && ! (_POSIX_C_SOURCE >= 200112L) Before glibc 2.12: _BSD_SOURCE || _SVID_SOURCE || _XOPEN_SOURCE >= 500 DESCRIPTION brk() and sbrk() change the location of the program break, which defines the end of the process's data segment (i.e., the program break is the first location after the end of the uninitialized data segment). Increasing the program break has the effect of allocating memory to the process; decreasing the break deallocates memory. brk() sets the end of the data segment to the value specified by addr, when that value is reasonable, the system has enough memory, and the process does not exceed its maximum data size (see setrlimit(2)). sbrk() increments the program's data space by increment bytes. Calling sbrk() with an increment of 0 can be used to find the current location of the program break. http://man7.org/linux/man-pages/man2/mmap.2.html 14 SYSTEM CALL: mmap(2) - Linux manual page FUNCTIONALITY: mmap, munmap - map or unmap files or devices into memory SYNOPSIS: #include void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset); int munmap(void *addr, size_t length); See NOTES for information on feature test macro requirements. DESCRIPTION mmap() creates a new mapping in the virtual address space of the calling process. The starting address for the new mapping is specified in addr. The length argument specifies the length of the mapping. If addr is NULL, then the kernel chooses the address at which to create the mapping; this is the most portable method of creating a new mapping. If addr is not NULL, then the kernel takes it as a hint about where to place the mapping; on Linux, the mapping will be created at a nearby page boundary. The address of the new mapping is returned as the result of the call. The contents of a file mapping (as opposed to an anonymous mapping; see MAP_ANONYMOUS below), are initialized using length bytes starting at offset offset in the file (or other object) referred to by the file descriptor fd. offset must be a multiple of the page size as returned by sysconf(_SC_PAGE_SIZE). The prot argument describes the desired memory protection of the mapping (and must not conflict with the open mode of the file). It is either PROT_NONE or the bitwise OR of one or more of the following flags: PROT_EXEC Pages may be executed. PROT_READ Pages may be read. PROT_WRITE Pages may be written. PROT_NONE Pages may not be accessed. The flags argument determines whether updates to the mapping are visible to other processes mapping the same region, and whether updates are carried through to the underlying file. This behavior is determined by including exactly one of the following values in flags: MAP_SHARED Share this mapping. Updates to the mapping are visible to other processes that map this file, and are carried through to the underlying file. (To precisely control when updates are carried through to the underlying file requires the use of msync(2).) MAP_PRIVATE Create a private copy-on-write mapping. Updates to the mapping are not visible to other processes mapping the same file, and are not carried through to the underlying file. It is unspecified whether changes made to the file after the mmap() call are visible in the mapped region. Both of these flags are described in POSIX.1-2001 and POSIX.1-2008. In addition, zero or more of the following values can be ORed in flags: MAP_32BIT (since Linux 2.4.20, 2.6) Put the mapping into the first 2 Gigabytes of the process address space. This flag is supported only on x86-64, for 64-bit programs. It was added to allow thread stacks to be allocated somewhere in the first 2GB of memory, so as to improve context-switch performance on some early 64-bit processors. Modern x86-64 processors no longer have this performance problem, so use of this flag is not required on those systems. The MAP_32BIT flag is ignored when MAP_FIXED is set. MAP_ANON Synonym for MAP_ANONYMOUS. Deprecated. MAP_ANONYMOUS The mapping is not backed by any file; its contents are initialized to zero. The fd and offset arguments are ignored; however, some implementations require fd to be -1 if MAP_ANONYMOUS (or MAP_ANON) is specified, and portable applications should ensure this. The use of MAP_ANONYMOUS in conjunction with MAP_SHARED is supported on Linux only since kernel 2.4. MAP_DENYWRITE This flag is ignored. (Long ago, it signaled that attempts to write to the underlying file should fail with ETXTBUSY. But this was a source of denial-of-service attacks.) MAP_EXECUTABLE This flag is ignored. MAP_FILE Compatibility flag. Ignored. MAP_FIXED Don't interpret addr as a hint: place the mapping at exactly that address. addr must be a multiple of the page size. If the memory region specified by addr and len overlaps pages of any existing mapping(s), then the overlapped part of the existing mapping(s) will be discarded. If the specified address cannot be used, mmap() will fail. Because requiring a fixed address for a mapping is less portable, the use of this option is discouraged. MAP_GROWSDOWN Used for stacks. Indicates to the kernel virtual memory system that the mapping should extend downward in memory. MAP_HUGETLB (since Linux 2.6.32) Allocate the mapping using "huge pages." See the Linux kernel source file Documentation/vm/hugetlbpage.txt for further information, as well as NOTES, below. MAP_HUGE_2MB, MAP_HUGE_1GB (since Linux 3.8) Used in conjunction with MAP_HUGETLB to select alternative hugetlb page sizes (respectively, 2 MB and 1 GB) on systems that support multiple hugetlb page sizes. More generally, the desired huge page size can be configured by encoding the base-2 logarithm of the desired page size in the six bits at the offset MAP_HUGE_SHIFT. (A value of zero in this bit field provides the default huge page size; the default huge page size can be discovered vie the Hugepagesize field exposed by /proc/meminfo.) Thus, the above two constants are defined as: #define MAP_HUGE_2MB (21 << MAP_HUGE_SHIFT) #define MAP_HUGE_1GB (30 << MAP_HUGE_SHIFT) The range of huge page sizes that are supported by the system can be discovered by listing the subdirectories in /sys/kernel/mm/hugepages. MAP_LOCKED (since Linux 2.5.37) Mark the mmaped region to be locked in the same way as mlock(2). This implementation will try to populate (prefault) the whole range but the mmap call doesn't fail with ENOMEM if this fails. Therefore major faults might happen later on. So the semantic is not as strong as mlock(2). One should use mmap(2) plus mlock(2) when major faults are not acceptable after the initialization of the mapping. The MAP_LOCKED flag is ignored in older kernels. MAP_NONBLOCK (since Linux 2.5.46) Only meaningful in conjunction with MAP_POPULATE. Don't perform read-ahead: create page tables entries only for pages that are already present in RAM. Since Linux 2.6.23, this flag causes MAP_POPULATE to do nothing. One day, the combination of MAP_POPULATE and MAP_NONBLOCK may be reimplemented. MAP_NORESERVE Do not reserve swap space for this mapping. When swap space is reserved, one has the guarantee that it is possible to modify the mapping. When swap space is not reserved one might get SIGSEGV upon a write if no physical memory is available. See also the discussion of the file /proc/sys/vm/overcommit_memory in proc(5). In kernels before 2.6, this flag had effect only for private writable mappings. MAP_POPULATE (since Linux 2.5.46) Populate (prefault) page tables for a mapping. For a file mapping, this causes read-ahead on the file. This will help to reduce blocking on page faults later. MAP_POPULATE is supported for private mappings only since Linux 2.6.23. MAP_STACK (since Linux 2.6.27) Allocate the mapping at an address suitable for a process or thread stack. This flag is currently a no-op, but is used in the glibc threading implementation so that if some architectures require special treatment for stack allocations, support can later be transparently implemented for glibc. MAP_UNINITIALIZED (since Linux 2.6.33) Don't clear anonymous pages. This flag is intended to improve performance on embedded devices. This flag is honored only if the kernel was configured with the CONFIG_MMAP_ALLOW_UNINITIALIZED option. Because of the security implications, that option is normally enabled only on embedded devices (i.e., devices where one has complete control of the contents of user memory). Of the above flags, only MAP_FIXED is specified in POSIX.1-2001 and POSIX.1-2008. However, most systems also support MAP_ANONYMOUS (or its synonym MAP_ANON). Some systems document the additional flags MAP_AUTOGROW, MAP_AUTORESRV, MAP_COPY, and MAP_LOCAL. Memory mapped by mmap() is preserved across fork(2), with the same attributes. A file is mapped in multiples of the page size. For a file that is not a multiple of the page size, the remaining memory is zeroed when mapped, and writes to that region are not written out to the file. The effect of changing the size of the underlying file of a mapping on the pages that correspond to added or removed regions of the file is unspecified. munmap() The munmap() system call deletes the mappings for the specified address range, and causes further references to addresses within the range to generate invalid memory references. The region is also automatically unmapped when the process is terminated. On the other hand, closing the file descriptor does not unmap the region. The address addr must be a multiple of the page size (but length need not be). All pages containing a part of the indicated range are unmapped, and subsequent references to these pages will generate SIGSEGV. It is not an error if the indicated range does not contain any mapped pages. http://man7.org/linux/man-pages/man2/munmap.2.html 14 SYSTEM CALL: mmap(2) - Linux manual page FUNCTIONALITY: mmap, munmap - map or unmap files or devices into memory SYNOPSIS: #include void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset); int munmap(void *addr, size_t length); See NOTES for information on feature test macro requirements. DESCRIPTION mmap() creates a new mapping in the virtual address space of the calling process. The starting address for the new mapping is specified in addr. The length argument specifies the length of the mapping. If addr is NULL, then the kernel chooses the address at which to create the mapping; this is the most portable method of creating a new mapping. If addr is not NULL, then the kernel takes it as a hint about where to place the mapping; on Linux, the mapping will be created at a nearby page boundary. The address of the new mapping is returned as the result of the call. The contents of a file mapping (as opposed to an anonymous mapping; see MAP_ANONYMOUS below), are initialized using length bytes starting at offset offset in the file (or other object) referred to by the file descriptor fd. offset must be a multiple of the page size as returned by sysconf(_SC_PAGE_SIZE). The prot argument describes the desired memory protection of the mapping (and must not conflict with the open mode of the file). It is either PROT_NONE or the bitwise OR of one or more of the following flags: PROT_EXEC Pages may be executed. PROT_READ Pages may be read. PROT_WRITE Pages may be written. PROT_NONE Pages may not be accessed. The flags argument determines whether updates to the mapping are visible to other processes mapping the same region, and whether updates are carried through to the underlying file. This behavior is determined by including exactly one of the following values in flags: MAP_SHARED Share this mapping. Updates to the mapping are visible to other processes that map this file, and are carried through to the underlying file. (To precisely control when updates are carried through to the underlying file requires the use of msync(2).) MAP_PRIVATE Create a private copy-on-write mapping. Updates to the mapping are not visible to other processes mapping the same file, and are not carried through to the underlying file. It is unspecified whether changes made to the file after the mmap() call are visible in the mapped region. Both of these flags are described in POSIX.1-2001 and POSIX.1-2008. In addition, zero or more of the following values can be ORed in flags: MAP_32BIT (since Linux 2.4.20, 2.6) Put the mapping into the first 2 Gigabytes of the process address space. This flag is supported only on x86-64, for 64-bit programs. It was added to allow thread stacks to be allocated somewhere in the first 2GB of memory, so as to improve context-switch performance on some early 64-bit processors. Modern x86-64 processors no longer have this performance problem, so use of this flag is not required on those systems. The MAP_32BIT flag is ignored when MAP_FIXED is set. MAP_ANON Synonym for MAP_ANONYMOUS. Deprecated. MAP_ANONYMOUS The mapping is not backed by any file; its contents are initialized to zero. The fd and offset arguments are ignored; however, some implementations require fd to be -1 if MAP_ANONYMOUS (or MAP_ANON) is specified, and portable applications should ensure this. The use of MAP_ANONYMOUS in conjunction with MAP_SHARED is supported on Linux only since kernel 2.4. MAP_DENYWRITE This flag is ignored. (Long ago, it signaled that attempts to write to the underlying file should fail with ETXTBUSY. But this was a source of denial-of-service attacks.) MAP_EXECUTABLE This flag is ignored. MAP_FILE Compatibility flag. Ignored. MAP_FIXED Don't interpret addr as a hint: place the mapping at exactly that address. addr must be a multiple of the page size. If the memory region specified by addr and len overlaps pages of any existing mapping(s), then the overlapped part of the existing mapping(s) will be discarded. If the specified address cannot be used, mmap() will fail. Because requiring a fixed address for a mapping is less portable, the use of this option is discouraged. MAP_GROWSDOWN Used for stacks. Indicates to the kernel virtual memory system that the mapping should extend downward in memory. MAP_HUGETLB (since Linux 2.6.32) Allocate the mapping using "huge pages." See the Linux kernel source file Documentation/vm/hugetlbpage.txt for further information, as well as NOTES, below. MAP_HUGE_2MB, MAP_HUGE_1GB (since Linux 3.8) Used in conjunction with MAP_HUGETLB to select alternative hugetlb page sizes (respectively, 2 MB and 1 GB) on systems that support multiple hugetlb page sizes. More generally, the desired huge page size can be configured by encoding the base-2 logarithm of the desired page size in the six bits at the offset MAP_HUGE_SHIFT. (A value of zero in this bit field provides the default huge page size; the default huge page size can be discovered vie the Hugepagesize field exposed by /proc/meminfo.) Thus, the above two constants are defined as: #define MAP_HUGE_2MB (21 << MAP_HUGE_SHIFT) #define MAP_HUGE_1GB (30 << MAP_HUGE_SHIFT) The range of huge page sizes that are supported by the system can be discovered by listing the subdirectories in /sys/kernel/mm/hugepages. MAP_LOCKED (since Linux 2.5.37) Mark the mmaped region to be locked in the same way as mlock(2). This implementation will try to populate (prefault) the whole range but the mmap call doesn't fail with ENOMEM if this fails. Therefore major faults might happen later on. So the semantic is not as strong as mlock(2). One should use mmap(2) plus mlock(2) when major faults are not acceptable after the initialization of the mapping. The MAP_LOCKED flag is ignored in older kernels. MAP_NONBLOCK (since Linux 2.5.46) Only meaningful in conjunction with MAP_POPULATE. Don't perform read-ahead: create page tables entries only for pages that are already present in RAM. Since Linux 2.6.23, this flag causes MAP_POPULATE to do nothing. One day, the combination of MAP_POPULATE and MAP_NONBLOCK may be reimplemented. MAP_NORESERVE Do not reserve swap space for this mapping. When swap space is reserved, one has the guarantee that it is possible to modify the mapping. When swap space is not reserved one might get SIGSEGV upon a write if no physical memory is available. See also the discussion of the file /proc/sys/vm/overcommit_memory in proc(5). In kernels before 2.6, this flag had effect only for private writable mappings. MAP_POPULATE (since Linux 2.5.46) Populate (prefault) page tables for a mapping. For a file mapping, this causes read-ahead on the file. This will help to reduce blocking on page faults later. MAP_POPULATE is supported for private mappings only since Linux 2.6.23. MAP_STACK (since Linux 2.6.27) Allocate the mapping at an address suitable for a process or thread stack. This flag is currently a no-op, but is used in the glibc threading implementation so that if some architectures require special treatment for stack allocations, support can later be transparently implemented for glibc. MAP_UNINITIALIZED (since Linux 2.6.33) Don't clear anonymous pages. This flag is intended to improve performance on embedded devices. This flag is honored only if the kernel was configured with the CONFIG_MMAP_ALLOW_UNINITIALIZED option. Because of the security implications, that option is normally enabled only on embedded devices (i.e., devices where one has complete control of the contents of user memory). Of the above flags, only MAP_FIXED is specified in POSIX.1-2001 and POSIX.1-2008. However, most systems also support MAP_ANONYMOUS (or its synonym MAP_ANON). Some systems document the additional flags MAP_AUTOGROW, MAP_AUTORESRV, MAP_COPY, and MAP_LOCAL. Memory mapped by mmap() is preserved across fork(2), with the same attributes. A file is mapped in multiples of the page size. For a file that is not a multiple of the page size, the remaining memory is zeroed when mapped, and writes to that region are not written out to the file. The effect of changing the size of the underlying file of a mapping on the pages that correspond to added or removed regions of the file is unspecified. munmap() The munmap() system call deletes the mappings for the specified address range, and causes further references to addresses within the range to generate invalid memory references. The region is also automatically unmapped when the process is terminated. On the other hand, closing the file descriptor does not unmap the region. The address addr must be a multiple of the page size (but length need not be). All pages containing a part of the indicated range are unmapped, and subsequent references to these pages will generate SIGSEGV. It is not an error if the indicated range does not contain any mapped pages. http://man7.org/linux/man-pages/man2/mremap.2.html 10 SYSTEM CALL: mremap(2) - Linux manual page FUNCTIONALITY: mremap - remap a virtual memory address SYNOPSIS: #define _GNU_SOURCE /* See feature_test_macros(7) */ #include void *mremap(void *old_address, size_t old_size, size_t new_size, int flags, ... /* void *new_address */); DESCRIPTION mremap() expands (or shrinks) an existing memory mapping, potentially moving it at the same time (controlled by the flags argument and the available virtual address space). old_address is the old address of the virtual memory block that you want to expand (or shrink). Note that old_address has to be page aligned. old_size is the old size of the virtual memory block. new_size is the requested size of the virtual memory block after the resize. An optional fifth argument, new_address, may be provided; see the description of MREMAP_FIXED below. In Linux the memory is divided into pages. A user process has (one or) several linear virtual memory segments. Each virtual memory segment has one or more mappings to real memory pages (in the page table). Each virtual memory segment has its own protection (access rights), which may cause a segmentation violation if the memory is accessed incorrectly (e.g., writing to a read-only segment). Accessing virtual memory outside of the segments will also cause a segmentation violation. mremap() uses the Linux page table scheme. mremap() changes the mapping between virtual addresses and memory pages. This can be used to implement a very efficient realloc(3). The flags bit-mask argument may be 0, or include the following flag: MREMAP_MAYMOVE By default, if there is not sufficient space to expand a mapping at its current location, then mremap() fails. If this flag is specified, then the kernel is permitted to relocate the mapping to a new virtual address, if necessary. If the mapping is relocated, then absolute pointers into the old mapping location become invalid (offsets relative to the starting address of the mapping should be employed). MREMAP_FIXED (since Linux 2.3.31) This flag serves a similar purpose to the MAP_FIXED flag of mmap(2). If this flag is specified, then mremap() accepts a fifth argument, void *new_address, which specifies a page- aligned address to which the mapping must be moved. Any previous mapping at the address range specified by new_address and new_size is unmapped. If MREMAP_FIXED is specified, then MREMAP_MAYMOVE must also be specified. If the memory segment specified by old_address and old_size is locked (using mlock(2) or similar), then this lock is maintained when the segment is resized and/or relocated. As a consequence, the amount of memory locked by the process may change. http://man7.org/linux/man-pages/man2/mprotect.2.html 11 SYSTEM CALL: mprotect(2) - Linux manual page FUNCTIONALITY: mprotect - set protection on a region of memory SYNOPSIS: #include int mprotect(void *addr, size_t len, int prot); DESCRIPTION mprotect() changes protection for the calling process's memory page(s) containing any part of the address range in the interval [addr, addr+len-1]. addr must be aligned to a page boundary. If the calling process tries to access memory in a manner that violates the protection, then the kernel generates a SIGSEGV signal for the process. prot is either PROT_NONE or a bitwise-or of the other values in the following list: PROT_NONE The memory cannot be accessed at all. PROT_READ The memory can be read. PROT_WRITE The memory can be modified. PROT_EXEC The memory can be executed. http://man7.org/linux/man-pages/man2/madvise.2.html 11 SYSTEM CALL: madvise(2) - Linux manual page FUNCTIONALITY: madvise - give advice about use of memory SYNOPSIS: #include int madvise(void *addr, size_t length, int advice); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): madvise(): Since glibc 2.19: _DEFAULT_SOURCE Up to and including glibc 2.19: _BSD_SOURCE DESCRIPTION The madvise() system call is used to give advice or directions to the kernel about the address range beginning at address addr and with size length bytes. Initially, the system call supported a set of "conventional" advice values, which are also available on several other implementations. (Note, though, that madvise() is not specified in POSIX.) Subsequently, a number of Linux-specific advice values have been added. Conventional advice values The advice values listed below allow an application to tell the kernel how it expects to use some mapped or shared memory areas, so that the kernel can choose appropriate read-ahead and caching techniques. These advice values do not influence the semantics of the application (except in the case of MADV_DONTNEED), but may influence its performance. All of the advice values listed here have analogs in the POSIX-specified posix_madvise(3) function, and the values have the same meanings, with the exception of MADV_DONTNEED. The advice is indicated in the advice argument, which is one of the following: MADV_NORMAL No special treatment. This is the default. MADV_RANDOM Expect page references in random order. (Hence, read ahead may be less useful than normally.) MADV_SEQUENTIAL Expect page references in sequential order. (Hence, pages in the given range can be aggressively read ahead, and may be freed soon after they are accessed.) MADV_WILLNEED Expect access in the near future. (Hence, it might be a good idea to read some pages ahead.) MADV_DONTNEED Do not expect access in the near future. (For the time being, the application is finished with the given range, so the kernel can free resources associated with it.) After a successful MADV_DONTNEED operation, the semantics of memory access in the specified region are changed: subsequent accesses of pages in the range will succeed, but will result in either repopulating the memory contents from the up-to-date contents of the underlying mapped file (for shared file mappings, shared anonymous mappings, and shmem-based techniques such as System V shared memory segments) or zero- fill-on-demand pages for anonymous private mappings. Note that, when applied to shared mappings, MADV_DONTNEED might not lead to immediate freeing of the pages in the range. The kernel is free to delay freeing the pages until an appropriate moment. The resident set size (RSS) of the calling process will be immediately reduced however. MADV_DONTNEED cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP pages. (Pages marked with the kernel- internal VM_PFNMAP flag are special memory areas that are not managed by the virtual memory subsystem. Such pages are typically created by device drivers that map the pages into user space.) Linux-specific advice values The following Linux-specific advice values have no counterparts in the POSIX-specified posix_madvise(3), and may or may not have counterparts in the madvise() interface available on other implementations. Note that some of these operations change the semantics of memory accesses. MADV_REMOVE (since Linux 2.6.16) Free up a given range of pages and its associated backing store. This is equivalent to punching a hole in the corresponding byte range of the backing store (see fallocate(2)). Subsequent accesses in the specified address range will see bytes containing zero. The specified address range must be mapped shared and writable. This flag cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP pages. In the initial implementation, only shmfs/tmpfs supported MADV_REMOVE; but since Linux 3.5, any filesystem which supports the fallocate(2) FALLOC_FL_PUNCH_HOLE mode also supports MADV_REMOVE. Hugetlbfs will fail with the error EINVAL and other filesystems fail with the error EOPNOTSUPP. MADV_DONTFORK (since Linux 2.6.16) Do not make the pages in this range available to the child after a fork(2). This is useful to prevent copy-on-write semantics from changing the physical location of a page if the parent writes to it after a fork(2). (Such page relocations cause problems for hardware that DMAs into the page.) MADV_DOFORK (since Linux 2.6.16) Undo the effect of MADV_DONTFORK, restoring the default behavior, whereby a mapping is inherited across fork(2). MADV_HWPOISON (since Linux 2.6.32) Poison the pages in the range specified by addr and length and handle subsequent references to those pages like a hardware memory corruption. This operation is available only for privileged (CAP_SYS_ADMIN) processes. This operation may result in the calling process receiving a SIGBUS and the page being unmapped. This feature is intended for testing of memory error-handling code; it is available only if the kernel was configured with CONFIG_MEMORY_FAILURE. MADV_MERGEABLE (since Linux 2.6.32) Enable Kernel Samepage Merging (KSM) for the pages in the range specified by addr and length. The kernel regularly scans those areas of user memory that have been marked as mergeable, looking for pages with identical content. These are replaced by a single write-protected page (which is automatically copied if a process later wants to update the content of the page). KSM merges only private anonymous pages (see mmap(2)). The KSM feature is intended for applications that generate many instances of the same data (e.g., virtualization systems such as KVM). It can consume a lot of processing power; use with care. See the Linux kernel source file Documentation/vm/ksm.txt for more details. The MADV_MERGEABLE and MADV_UNMERGEABLE operations are available only if the kernel was configured with CONFIG_KSM. MADV_UNMERGEABLE (since Linux 2.6.32) Undo the effect of an earlier MADV_MERGEABLE operation on the specified address range; KSM unmerges whatever pages it had merged in the address range specified by addr and length. MADV_SOFT_OFFLINE (since Linux 2.6.33) Soft offline the pages in the range specified by addr and length. The memory of each page in the specified range is preserved (i.e., when next accessed, the same content will be visible, but in a new physical page frame), and the original page is offlined (i.e., no longer used, and taken out of normal memory management). The effect of the MADV_SOFT_OFFLINE operation is invisible to (i.e., does not change the semantics of) the calling process. This feature is intended for testing of memory error-handling code; it is available only if the kernel was configured with CONFIG_MEMORY_FAILURE. MADV_HUGEPAGE (since Linux 2.6.38) Enable Transparent Huge Pages (THP) for pages in the range specified by addr and length. Currently, Transparent Huge Pages work only with private anonymous pages (see mmap(2)). The kernel will regularly scan the areas marked as huge page candidates to replace them with huge pages. The kernel will also allocate huge pages directly when the region is naturally aligned to the huge page size (see posix_memalign(2)). This feature is primarily aimed at applications that use large mappings of data and access large regions of that memory at a time (e.g., virtualization systems such as QEMU). It can very easily waste memory (e.g., a 2MB mapping that only ever accesses 1 byte will result in 2MB of wired memory instead of one 4KB page). See the Linux kernel source file Documentation/vm/transhuge.txt for more details. The MADV_HUGEPAGE and MADV_NOHUGEPAGE operations are available only if the kernel was configured with CONFIG_TRANSPARENT_HUGEPAGE. MADV_NOHUGEPAGE (since Linux 2.6.38) Ensures that memory in the address range specified by addr and length will not be collapsed into huge pages. MADV_DONTDUMP (since Linux 3.4) Exclude from a core dump those pages in the range specified by addr and length. This is useful in applications that have large areas of memory that are known not to be useful in a core dump. The effect of MADV_DONTDUMP takes precedence over the bit mask that is set via the /proc/PID/coredump_filter file (see core(5)). MADV_DODUMP (since Linux 3.4) Undo the effect of an earlier MADV_DONTDUMP. MADV_FREE (since Linux 4.5) The application no longer requires the pages in the range specified by addr and len. The kernel can thus free these pages, but the freeing could be delayed until memory pressure occurs. For each of the pages that has been marked to be freed but has not yet been freed, the free operation will be canceled if the caller writes into the page. After a successful MADV_FREE operation, any stale data (i.e., dirty, unwritten pages) will be lost when the kernel frees the pages. However, subsequent writes to pages in the range will succeed and then kernel cannot free those dirtied pages, so that the caller can always see just written data. If there is no subsequent write, the kernel can free the pages at any time. Once pages in the range have been freed, the caller will see zero-fill-on-demand pages upon subsequent page references. The MADV_FREE operation can be applied only to private anonymous pages (see mmap(2)). On a swapless system, freeing pages in a given range happens instantly, regardless of memory pressure. http://man7.org/linux/man-pages/man2/mlock.2.html 13 SYSTEM CALL: mlock(2) - Linux manual page FUNCTIONALITY: mlock, mlock2, munlock, mlockall, munlockall - lock and unlock memory SYNOPSIS: #include int mlock(const void *addr, size_t len); int mlock2(const void *addr, size_t len, int flags); int munlock(const void *addr, size_t len); int mlockall(int flags); int munlockall(void); DESCRIPTION mlock(), mlock2(), and mlockall() lock part or all of the calling process's virtual address space into RAM, preventing that memory from being paged to the swap area. munlock() and munlockall() perform the converse operation, unlocking part or all of the calling process's virtual address space, so that pages in the specified virtual address range may once more to be swapped out if required by the kernel memory manager. Memory locking and unlocking are performed in units of whole pages. mlock(), mlock2(), and munlock() mlock() locks pages in the address range starting at addr and continuing for len bytes. All pages that contain a part of the specified address range are guaranteed to be resident in RAM when the call returns successfully; the pages are guaranteed to stay in RAM until later unlocked. mlock2() also locks pages in the specified range starting at addr and continuing for len bytes. However, the state of the pages contained in that range after the call returns successfully will depend on the value in the flags argument. The flags argument can be either 0 or the following constant: MLOCK_ONFAULT Lock pages that are currently resident and mark the entire range to have pages locked when they are populated by the page fault. If flags is 0, mlock2() behaves exactly the same as mlock(). Note: currently, there is not a glibc wrapper for mlock2(), so it will need to be invoked using syscall(2). munlock() unlocks pages in the address range starting at addr and continuing for len bytes. After this call, all pages that contain a part of the specified memory range can be moved to external swap space again by the kernel. mlockall() and munlockall() mlockall() locks all pages mapped into the address space of the calling process. This includes the pages of the code, data and stack segment, as well as shared libraries, user space kernel data, shared memory, and memory-mapped files. All mapped pages are guaranteed to be resident in RAM when the call returns successfully; the pages are guaranteed to stay in RAM until later unlocked. The flags argument is constructed as the bitwise OR of one or more of the following constants: MCL_CURRENT Lock all pages which are currently mapped into the address space of the process. MCL_FUTURE Lock all pages which will become mapped into the address space of the process in the future. These could be, for instance, new pages required by a growing heap and stack as well as new memory-mapped files or shared memory regions. MCL_ONFAULT (since Linux 4.4) Used together with MCL_CURRENT, MCL_FUTURE, or both. Mark all current (with MCL_CURRENT) or future (with MCL_FUTURE) mappings to lock pages when they are faulted in. When used with MCL_CURRENT, all present pages are locked, but mlockall() will not fault in non-present pages. When used with MCL_FUTURE, all future mappings will be marked to lock pages when they are faulted in, but they will not be populated by the lock when the mapping is created. MCL_ONFAULT must be used with either MCL_CURRENT or MCL_FUTURE or both. If MCL_FUTURE has been specified, then a later system call (e.g., mmap(2), sbrk(2), malloc(3)), may fail if it would cause the number of locked bytes to exceed the permitted maximum (see below). In the same circumstances, stack growth may likewise fail: the kernel will deny stack expansion and deliver a SIGSEGV signal to the process. munlockall() unlocks all pages mapped into the address space of the calling process. http://man7.org/linux/man-pages/man2/mlock2.2.html 13 SYSTEM CALL: mlock(2) - Linux manual page FUNCTIONALITY: mlock, mlock2, munlock, mlockall, munlockall - lock and unlock memory SYNOPSIS: #include int mlock(const void *addr, size_t len); int mlock2(const void *addr, size_t len, int flags); int munlock(const void *addr, size_t len); int mlockall(int flags); int munlockall(void); DESCRIPTION mlock(), mlock2(), and mlockall() lock part or all of the calling process's virtual address space into RAM, preventing that memory from being paged to the swap area. munlock() and munlockall() perform the converse operation, unlocking part or all of the calling process's virtual address space, so that pages in the specified virtual address range may once more to be swapped out if required by the kernel memory manager. Memory locking and unlocking are performed in units of whole pages. mlock(), mlock2(), and munlock() mlock() locks pages in the address range starting at addr and continuing for len bytes. All pages that contain a part of the specified address range are guaranteed to be resident in RAM when the call returns successfully; the pages are guaranteed to stay in RAM until later unlocked. mlock2() also locks pages in the specified range starting at addr and continuing for len bytes. However, the state of the pages contained in that range after the call returns successfully will depend on the value in the flags argument. The flags argument can be either 0 or the following constant: MLOCK_ONFAULT Lock pages that are currently resident and mark the entire range to have pages locked when they are populated by the page fault. If flags is 0, mlock2() behaves exactly the same as mlock(). Note: currently, there is not a glibc wrapper for mlock2(), so it will need to be invoked using syscall(2). munlock() unlocks pages in the address range starting at addr and continuing for len bytes. After this call, all pages that contain a part of the specified memory range can be moved to external swap space again by the kernel. mlockall() and munlockall() mlockall() locks all pages mapped into the address space of the calling process. This includes the pages of the code, data and stack segment, as well as shared libraries, user space kernel data, shared memory, and memory-mapped files. All mapped pages are guaranteed to be resident in RAM when the call returns successfully; the pages are guaranteed to stay in RAM until later unlocked. The flags argument is constructed as the bitwise OR of one or more of the following constants: MCL_CURRENT Lock all pages which are currently mapped into the address space of the process. MCL_FUTURE Lock all pages which will become mapped into the address space of the process in the future. These could be, for instance, new pages required by a growing heap and stack as well as new memory-mapped files or shared memory regions. MCL_ONFAULT (since Linux 4.4) Used together with MCL_CURRENT, MCL_FUTURE, or both. Mark all current (with MCL_CURRENT) or future (with MCL_FUTURE) mappings to lock pages when they are faulted in. When used with MCL_CURRENT, all present pages are locked, but mlockall() will not fault in non-present pages. When used with MCL_FUTURE, all future mappings will be marked to lock pages when they are faulted in, but they will not be populated by the lock when the mapping is created. MCL_ONFAULT must be used with either MCL_CURRENT or MCL_FUTURE or both. If MCL_FUTURE has been specified, then a later system call (e.g., mmap(2), sbrk(2), malloc(3)), may fail if it would cause the number of locked bytes to exceed the permitted maximum (see below). In the same circumstances, stack growth may likewise fail: the kernel will deny stack expansion and deliver a SIGSEGV signal to the process. munlockall() unlocks all pages mapped into the address space of the calling process. http://man7.org/linux/man-pages/man2/mlockall.2.html 13 SYSTEM CALL: mlock(2) - Linux manual page FUNCTIONALITY: mlock, mlock2, munlock, mlockall, munlockall - lock and unlock memory SYNOPSIS: #include int mlock(const void *addr, size_t len); int mlock2(const void *addr, size_t len, int flags); int munlock(const void *addr, size_t len); int mlockall(int flags); int munlockall(void); DESCRIPTION mlock(), mlock2(), and mlockall() lock part or all of the calling process's virtual address space into RAM, preventing that memory from being paged to the swap area. munlock() and munlockall() perform the converse operation, unlocking part or all of the calling process's virtual address space, so that pages in the specified virtual address range may once more to be swapped out if required by the kernel memory manager. Memory locking and unlocking are performed in units of whole pages. mlock(), mlock2(), and munlock() mlock() locks pages in the address range starting at addr and continuing for len bytes. All pages that contain a part of the specified address range are guaranteed to be resident in RAM when the call returns successfully; the pages are guaranteed to stay in RAM until later unlocked. mlock2() also locks pages in the specified range starting at addr and continuing for len bytes. However, the state of the pages contained in that range after the call returns successfully will depend on the value in the flags argument. The flags argument can be either 0 or the following constant: MLOCK_ONFAULT Lock pages that are currently resident and mark the entire range to have pages locked when they are populated by the page fault. If flags is 0, mlock2() behaves exactly the same as mlock(). Note: currently, there is not a glibc wrapper for mlock2(), so it will need to be invoked using syscall(2). munlock() unlocks pages in the address range starting at addr and continuing for len bytes. After this call, all pages that contain a part of the specified memory range can be moved to external swap space again by the kernel. mlockall() and munlockall() mlockall() locks all pages mapped into the address space of the calling process. This includes the pages of the code, data and stack segment, as well as shared libraries, user space kernel data, shared memory, and memory-mapped files. All mapped pages are guaranteed to be resident in RAM when the call returns successfully; the pages are guaranteed to stay in RAM until later unlocked. The flags argument is constructed as the bitwise OR of one or more of the following constants: MCL_CURRENT Lock all pages which are currently mapped into the address space of the process. MCL_FUTURE Lock all pages which will become mapped into the address space of the process in the future. These could be, for instance, new pages required by a growing heap and stack as well as new memory-mapped files or shared memory regions. MCL_ONFAULT (since Linux 4.4) Used together with MCL_CURRENT, MCL_FUTURE, or both. Mark all current (with MCL_CURRENT) or future (with MCL_FUTURE) mappings to lock pages when they are faulted in. When used with MCL_CURRENT, all present pages are locked, but mlockall() will not fault in non-present pages. When used with MCL_FUTURE, all future mappings will be marked to lock pages when they are faulted in, but they will not be populated by the lock when the mapping is created. MCL_ONFAULT must be used with either MCL_CURRENT or MCL_FUTURE or both. If MCL_FUTURE has been specified, then a later system call (e.g., mmap(2), sbrk(2), malloc(3)), may fail if it would cause the number of locked bytes to exceed the permitted maximum (see below). In the same circumstances, stack growth may likewise fail: the kernel will deny stack expansion and deliver a SIGSEGV signal to the process. munlockall() unlocks all pages mapped into the address space of the calling process. http://man7.org/linux/man-pages/man2/munlock.2.html 13 SYSTEM CALL: mlock(2) - Linux manual page FUNCTIONALITY: mlock, mlock2, munlock, mlockall, munlockall - lock and unlock memory SYNOPSIS: #include int mlock(const void *addr, size_t len); int mlock2(const void *addr, size_t len, int flags); int munlock(const void *addr, size_t len); int mlockall(int flags); int munlockall(void); DESCRIPTION mlock(), mlock2(), and mlockall() lock part or all of the calling process's virtual address space into RAM, preventing that memory from being paged to the swap area. munlock() and munlockall() perform the converse operation, unlocking part or all of the calling process's virtual address space, so that pages in the specified virtual address range may once more to be swapped out if required by the kernel memory manager. Memory locking and unlocking are performed in units of whole pages. mlock(), mlock2(), and munlock() mlock() locks pages in the address range starting at addr and continuing for len bytes. All pages that contain a part of the specified address range are guaranteed to be resident in RAM when the call returns successfully; the pages are guaranteed to stay in RAM until later unlocked. mlock2() also locks pages in the specified range starting at addr and continuing for len bytes. However, the state of the pages contained in that range after the call returns successfully will depend on the value in the flags argument. The flags argument can be either 0 or the following constant: MLOCK_ONFAULT Lock pages that are currently resident and mark the entire range to have pages locked when they are populated by the page fault. If flags is 0, mlock2() behaves exactly the same as mlock(). Note: currently, there is not a glibc wrapper for mlock2(), so it will need to be invoked using syscall(2). munlock() unlocks pages in the address range starting at addr and continuing for len bytes. After this call, all pages that contain a part of the specified memory range can be moved to external swap space again by the kernel. mlockall() and munlockall() mlockall() locks all pages mapped into the address space of the calling process. This includes the pages of the code, data and stack segment, as well as shared libraries, user space kernel data, shared memory, and memory-mapped files. All mapped pages are guaranteed to be resident in RAM when the call returns successfully; the pages are guaranteed to stay in RAM until later unlocked. The flags argument is constructed as the bitwise OR of one or more of the following constants: MCL_CURRENT Lock all pages which are currently mapped into the address space of the process. MCL_FUTURE Lock all pages which will become mapped into the address space of the process in the future. These could be, for instance, new pages required by a growing heap and stack as well as new memory-mapped files or shared memory regions. MCL_ONFAULT (since Linux 4.4) Used together with MCL_CURRENT, MCL_FUTURE, or both. Mark all current (with MCL_CURRENT) or future (with MCL_FUTURE) mappings to lock pages when they are faulted in. When used with MCL_CURRENT, all present pages are locked, but mlockall() will not fault in non-present pages. When used with MCL_FUTURE, all future mappings will be marked to lock pages when they are faulted in, but they will not be populated by the lock when the mapping is created. MCL_ONFAULT must be used with either MCL_CURRENT or MCL_FUTURE or both. If MCL_FUTURE has been specified, then a later system call (e.g., mmap(2), sbrk(2), malloc(3)), may fail if it would cause the number of locked bytes to exceed the permitted maximum (see below). In the same circumstances, stack growth may likewise fail: the kernel will deny stack expansion and deliver a SIGSEGV signal to the process. munlockall() unlocks all pages mapped into the address space of the calling process. http://man7.org/linux/man-pages/man2/munlockall.2.html 13 SYSTEM CALL: mlock(2) - Linux manual page FUNCTIONALITY: mlock, mlock2, munlock, mlockall, munlockall - lock and unlock memory SYNOPSIS: #include int mlock(const void *addr, size_t len); int mlock2(const void *addr, size_t len, int flags); int munlock(const void *addr, size_t len); int mlockall(int flags); int munlockall(void); DESCRIPTION mlock(), mlock2(), and mlockall() lock part or all of the calling process's virtual address space into RAM, preventing that memory from being paged to the swap area. munlock() and munlockall() perform the converse operation, unlocking part or all of the calling process's virtual address space, so that pages in the specified virtual address range may once more to be swapped out if required by the kernel memory manager. Memory locking and unlocking are performed in units of whole pages. mlock(), mlock2(), and munlock() mlock() locks pages in the address range starting at addr and continuing for len bytes. All pages that contain a part of the specified address range are guaranteed to be resident in RAM when the call returns successfully; the pages are guaranteed to stay in RAM until later unlocked. mlock2() also locks pages in the specified range starting at addr and continuing for len bytes. However, the state of the pages contained in that range after the call returns successfully will depend on the value in the flags argument. The flags argument can be either 0 or the following constant: MLOCK_ONFAULT Lock pages that are currently resident and mark the entire range to have pages locked when they are populated by the page fault. If flags is 0, mlock2() behaves exactly the same as mlock(). Note: currently, there is not a glibc wrapper for mlock2(), so it will need to be invoked using syscall(2). munlock() unlocks pages in the address range starting at addr and continuing for len bytes. After this call, all pages that contain a part of the specified memory range can be moved to external swap space again by the kernel. mlockall() and munlockall() mlockall() locks all pages mapped into the address space of the calling process. This includes the pages of the code, data and stack segment, as well as shared libraries, user space kernel data, shared memory, and memory-mapped files. All mapped pages are guaranteed to be resident in RAM when the call returns successfully; the pages are guaranteed to stay in RAM until later unlocked. The flags argument is constructed as the bitwise OR of one or more of the following constants: MCL_CURRENT Lock all pages which are currently mapped into the address space of the process. MCL_FUTURE Lock all pages which will become mapped into the address space of the process in the future. These could be, for instance, new pages required by a growing heap and stack as well as new memory-mapped files or shared memory regions. MCL_ONFAULT (since Linux 4.4) Used together with MCL_CURRENT, MCL_FUTURE, or both. Mark all current (with MCL_CURRENT) or future (with MCL_FUTURE) mappings to lock pages when they are faulted in. When used with MCL_CURRENT, all present pages are locked, but mlockall() will not fault in non-present pages. When used with MCL_FUTURE, all future mappings will be marked to lock pages when they are faulted in, but they will not be populated by the lock when the mapping is created. MCL_ONFAULT must be used with either MCL_CURRENT or MCL_FUTURE or both. If MCL_FUTURE has been specified, then a later system call (e.g., mmap(2), sbrk(2), malloc(3)), may fail if it would cause the number of locked bytes to exceed the permitted maximum (see below). In the same circumstances, stack growth may likewise fail: the kernel will deny stack expansion and deliver a SIGSEGV signal to the process. munlockall() unlocks all pages mapped into the address space of the calling process. http://man7.org/linux/man-pages/man2/mincore.2.html 11 SYSTEM CALL: mincore(2) - Linux manual page FUNCTIONALITY: mincore - determine whether pages are resident in memory SYNOPSIS: #include #include int mincore(void *addr, size_t length, unsigned char *vec); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): mincore(): Since glibc 2.19: _DEFAULT_SOURCE Glibc 2.19 and earlier: _BSD_SOURCE || _SVID_SOURCE DESCRIPTION mincore() returns a vector that indicates whether pages of the calling process's virtual memory are resident in core (RAM), and so will not cause a disk access (page fault) if referenced. The kernel returns residency information about the pages starting at the address addr, and continuing for length bytes. The addr argument must be a multiple of the system page size. The length argument need not be a multiple of the page size, but since residency information is returned for whole pages, length is effectively rounded up to the next multiple of the page size. One may obtain the page size (PAGE_SIZE) using sysconf(_SC_PAGESIZE). The vec argument must point to an array containing at least (length+PAGE_SIZE-1) / PAGE_SIZE bytes. On return, the least significant bit of each byte will be set if the corresponding page is currently resident in memory, and be clear otherwise. (The settings of the other bits in each byte are undefined; these bits are reserved for possible later use.) Of course the information returned in vec is only a snapshot: pages that are not locked in memory can come and go at any moment, and the contents of vec may already be stale by the time this call returns. http://man7.org/linux/man-pages/man2/membarrier.2.html 11 SYSTEM CALL: membarrier(2) - Linux manual page FUNCTIONALITY: membarrier - issue memory barriers on a set of threads SYNOPSIS: #include int membarrier(int cmd, int flags); DESCRIPTION The membarrier() system call helps reducing the overhead of the memory barrier instructions required to order memory accesses on multi-core systems. However, this system call is heavier than a memory barrier, so using it effectively is not as simple as replacing memory barriers with this system call, but requires understanding of the details below. Use of memory barriers needs to be done taking into account that a memory barrier always needs to be either matched with its memory barrier counterparts, or that the architecture's memory model doesn't require the matching barriers. There are cases where one side of the matching barriers (which we will refer to as "fast side") is executed much more often than the other (which we will refer to as "slow side"). This is a prime target for the use of membarrier(). The key idea is to replace, for these matching barriers, the fast-side memory barriers by simple compiler barriers, for example: asm volatile ("" : : : "memory") and replace the slow-side memory barriers by calls to membarrier(). This will add overhead to the slow side, and remove overhead from the fast side, thus resulting in an overall performance increase as long as the slow side is infrequent enough that the overhead of the membarrier() calls does not outweigh the performance gain on the fast side. The cmd argument is one of the following: MEMBARRIER_CMD_QUERY Query the set of supported commands. The return value of the call is a bit mask of supported commands. MEMBARRIER_CMD_QUERY, which has the value 0, is not itself included in this bit mask. This command is always supported (on kernels where membarrier() is provided). MEMBARRIER_CMD_SHARED Ensure that all threads from all processes on the system pass through a state where all memory accesses to user-space addresses match program order between entry to and return from the membarrier() system call. All threads on the system are targeted by this command. The flags argument is currently unused and must be specified as 0. All memory accesses performed in program order from each targeted thread are guaranteed to be ordered with respect to membarrier(). If we use the semantic barrier() to represent a compiler barrier forcing memory accesses to be performed in program order across the barrier, and smp_mb() to represent explicit memory barriers forcing full memory ordering across the barrier, we have the following ordering table for each pairing of barrier(), membarrier() and smp_mb(). The pair ordering is detailed as (O: ordered, X: not ordered): barrier() smp_mb() membarrier() barrier() X X O smp_mb() X O O membarrier() O O O http://man7.org/linux/man-pages/man2/modify_ldt.2.html 11 SYSTEM CALL: modify_ldt(2) - Linux manual page FUNCTIONALITY: modify_ldt - get or set a per-process LDT entry SYNOPSIS: #include int modify_ldt(int func, void *ptr, unsigned long bytecount); Note: There is no glibc wrapper for this system call; see NOTES. DESCRIPTION modify_ldt() reads or writes the local descriptor table (LDT) for a process. The LDT is an array of segment descriptors that can be referenced by user code. Linux allows processes to configure a per- process (actually per-mm) LDT. For more information about the LDT, see the Intel Software Developer's Manual or the AMD Architecture Programming Manual. When func is 0, modify_ldt() reads the LDT into the memory pointed to by ptr. The number of bytes read is the smaller of bytecount and the actual size of the LDT, although the kernel may act as though the LDT is padded with additional trailing zero bytes. On success, modify_ldt() will return the number of bytes read. When func is 1 or 0x11, modify_ldt() modifies the LDT entry indicated by ptr->entry_number. ptr points to a user_desc structure and bytecount must equal the size of this structure. The user_desc structure is defined in as: struct user_desc { unsigned int entry_number; unsigned long base_addr; unsigned int limit; unsigned int seg_32bit:1; unsigned int contents:2; unsigned int read_exec_only:1; unsigned int limit_in_pages:1; unsigned int seg_not_present:1; unsigned int useable:1; }; In Linux 2.4 and earlier, this structure was named modify_ldt_ldt_s. The contents field is the segment type (data, expand-down data, non- conforming code, or conforming code). The other fields match their descriptions in the CPU manual, although modify_ldt() cannot set the hardware-defined "accessed" bit described in the CPU manual. A user_desc is considered "empty" if read_exec_only and seg_not_present are set to 1 and all of the other fields are 0. An LDT entry can be cleared by setting it to an "empty" user_desc or, if func is 1, by setting both base and limit to 0. A conforming code segment (i.e., one with contents==3) will be rejected if func is 1 or if seg_not_present is 0. When func is 2, modify_ldt() will read zeros. This appears to be a leftover from Linux 2.4. http://man7.org/linux/man-pages/man2/capset.2.html 10 SYSTEM CALL: capget(2) - Linux manual page FUNCTIONALITY: capget, capset - set/get capabilities of thread(s) SYNOPSIS: #include int capget(cap_user_header_t hdrp, cap_user_data_t datap); int capset(cap_user_header_t hdrp, const cap_user_data_t datap); DESCRIPTION As of Linux 2.2, the power of the superuser (root) has been partitioned into a set of discrete capabilities. Each thread has a set of effective capabilities identifying which capabilities (if any) it may currently exercise. Each thread also has a set of inheritable capabilities that may be passed through an execve(2) call, and a set of permitted capabilities that it can make effective or inheritable. These two system calls are the raw kernel interface for getting and setting thread capabilities. Not only are these system calls specific to Linux, but the kernel API is likely to change and use of these system calls (in particular the format of the cap_user_*_t types) is subject to extension with each kernel revision, but old programs will keep working. The portable interfaces are cap_set_proc(3) and cap_get_proc(3); if possible, you should use those interfaces in applications. If you wish to use the Linux extensions in applications, you should use the easier-to-use interfaces capsetp(3) and capgetp(3). Current details Now that you have been warned, some current kernel details. The structures are defined as follows. #define _LINUX_CAPABILITY_VERSION_1 0x19980330 #define _LINUX_CAPABILITY_U32S_1 1 /* V2 added in Linux 2.6.25; deprecated */ #define _LINUX_CAPABILITY_VERSION_2 0x20071026 #define _LINUX_CAPABILITY_U32S_2 2 /* V3 added in Linux 2.6.26 */ #define _LINUX_CAPABILITY_VERSION_3 0x20080522 #define _LINUX_CAPABILITY_U32S_3 2 typedef struct __user_cap_header_struct { __u32 version; int pid; } *cap_user_header_t; typedef struct __user_cap_data_struct { __u32 effective; __u32 permitted; __u32 inheritable; } *cap_user_data_t; The effective, permitted, and inheritable fields are bit masks of the capabilities defined in capabilities(7). Note that the CAP_* values are bit indexes and need to be bit-shifted before ORing into the bit fields. To define the structures for passing to the system call, you have to use the struct __user_cap_header_struct and struct __user_cap_data_struct names because the typedefs are only pointers. Kernels prior to 2.6.25 prefer 32-bit capabilities with version _LINUX_CAPABILITY_VERSION_1. Linux 2.6.25 added 64-bit capability sets, with version _LINUX_CAPABILITY_VERSION_2. There was, however, an API glitch, and Linux 2.6.26 added _LINUX_CAPABILITY_VERSION_3 to fix the problem. Note that 64-bit capabilities use datap[0] and datap[1], whereas 32-bit capabilities use only datap[0]. On kernels that support file capabilities (VFS capability support), these system calls behave slightly differently. This support was added as an option in Linux 2.6.24, and became fixed (nonoptional) in Linux 2.6.33. For capget() calls, one can probe the capabilities of any process by specifying its process ID with the hdrp->pid field value. With VFS capability support VFS Capability support creates a file-attribute method for adding capabilities to privileged executables. This privilege model obsoletes kernel support for one process asynchronously setting the capabilities of another. That is, with VFS support, for capset() calls the only permitted values for hdrp->pid are 0 or gettid(2), which are equivalent. Without VFS capability support When the kernel does not support VFS capabilities, capset() calls can operate on the capabilities of the thread specified by the pid field of hdrp when that is nonzero, or on the capabilities of the calling thread if pid is 0. If pid refers to a single-threaded process, then pid can be specified as a traditional process ID; operating on a thread of a multithreaded process requires a thread ID of the type returned by gettid(2). For capset(), pid can also be: -1, meaning perform the change on all threads except the caller and init(1); or a value less than -1, in which case the change is applied to all members of the process group whose ID is -pid. For details on the data, see capabilities(7). http://man7.org/linux/man-pages/man2/capget.2.html 10 SYSTEM CALL: capget(2) - Linux manual page FUNCTIONALITY: capget, capset - set/get capabilities of thread(s) SYNOPSIS: #include int capget(cap_user_header_t hdrp, cap_user_data_t datap); int capset(cap_user_header_t hdrp, const cap_user_data_t datap); DESCRIPTION As of Linux 2.2, the power of the superuser (root) has been partitioned into a set of discrete capabilities. Each thread has a set of effective capabilities identifying which capabilities (if any) it may currently exercise. Each thread also has a set of inheritable capabilities that may be passed through an execve(2) call, and a set of permitted capabilities that it can make effective or inheritable. These two system calls are the raw kernel interface for getting and setting thread capabilities. Not only are these system calls specific to Linux, but the kernel API is likely to change and use of these system calls (in particular the format of the cap_user_*_t types) is subject to extension with each kernel revision, but old programs will keep working. The portable interfaces are cap_set_proc(3) and cap_get_proc(3); if possible, you should use those interfaces in applications. If you wish to use the Linux extensions in applications, you should use the easier-to-use interfaces capsetp(3) and capgetp(3). Current details Now that you have been warned, some current kernel details. The structures are defined as follows. #define _LINUX_CAPABILITY_VERSION_1 0x19980330 #define _LINUX_CAPABILITY_U32S_1 1 /* V2 added in Linux 2.6.25; deprecated */ #define _LINUX_CAPABILITY_VERSION_2 0x20071026 #define _LINUX_CAPABILITY_U32S_2 2 /* V3 added in Linux 2.6.26 */ #define _LINUX_CAPABILITY_VERSION_3 0x20080522 #define _LINUX_CAPABILITY_U32S_3 2 typedef struct __user_cap_header_struct { __u32 version; int pid; } *cap_user_header_t; typedef struct __user_cap_data_struct { __u32 effective; __u32 permitted; __u32 inheritable; } *cap_user_data_t; The effective, permitted, and inheritable fields are bit masks of the capabilities defined in capabilities(7). Note that the CAP_* values are bit indexes and need to be bit-shifted before ORing into the bit fields. To define the structures for passing to the system call, you have to use the struct __user_cap_header_struct and struct __user_cap_data_struct names because the typedefs are only pointers. Kernels prior to 2.6.25 prefer 32-bit capabilities with version _LINUX_CAPABILITY_VERSION_1. Linux 2.6.25 added 64-bit capability sets, with version _LINUX_CAPABILITY_VERSION_2. There was, however, an API glitch, and Linux 2.6.26 added _LINUX_CAPABILITY_VERSION_3 to fix the problem. Note that 64-bit capabilities use datap[0] and datap[1], whereas 32-bit capabilities use only datap[0]. On kernels that support file capabilities (VFS capability support), these system calls behave slightly differently. This support was added as an option in Linux 2.6.24, and became fixed (nonoptional) in Linux 2.6.33. For capget() calls, one can probe the capabilities of any process by specifying its process ID with the hdrp->pid field value. With VFS capability support VFS Capability support creates a file-attribute method for adding capabilities to privileged executables. This privilege model obsoletes kernel support for one process asynchronously setting the capabilities of another. That is, with VFS support, for capset() calls the only permitted values for hdrp->pid are 0 or gettid(2), which are equivalent. Without VFS capability support When the kernel does not support VFS capabilities, capset() calls can operate on the capabilities of the thread specified by the pid field of hdrp when that is nonzero, or on the capabilities of the calling thread if pid is 0. If pid refers to a single-threaded process, then pid can be specified as a traditional process ID; operating on a thread of a multithreaded process requires a thread ID of the type returned by gettid(2). For capset(), pid can also be: -1, meaning perform the change on all threads except the caller and init(1); or a value less than -1, in which case the change is applied to all members of the process group whose ID is -pid. For details on the data, see capabilities(7). http://man7.org/linux/man-pages/man2/set_thread_area.2.html 12 SYSTEM CALL: set_thread_area(2) - Linux manual page FUNCTIONALITY: set_thread_area - set a GDT entry for thread-local storage SYNOPSIS: #include #include int get_thread_area(struct user_desc *u_info); int set_thread_area(struct user_desc *u_info); Note: There are no glibc wrappers for these system calls; see NOTES. DESCRIPTION Linux dedicates three global descriptor table (GDT) entries for thread-local storage. For more information about the GDT, see the Intel Software Developer's Manual or the AMD Architecture Programming Manual. Both of these system calls take an argument that is a pointer to a structure of the following type: struct user_desc { unsigned int entry_number; unsigned long base_addr; unsigned int limit; unsigned int seg_32bit:1; unsigned int contents:2; unsigned int read_exec_only:1; unsigned int limit_in_pages:1; unsigned int seg_not_present:1; unsigned int useable:1; }; get_thread_area() reads the GDT entry indicated by u_info->entry_number and fills in the rest of the fields in u_info. set_thread_area() sets a TLS entry in the GDT. The TLS array entry set by set_thread_area() corresponds to the value of u_info->entry_number passed in by the user. If this value is in bounds, set_thread_area() writes the TLS descriptor pointed to by u_info into the thread's TLS array. When set_thread_area() is passed an entry_number of -1, it searches for a free TLS entry. If set_thread_area() finds a free TLS entry, the value of u_info->entry_number is set upon return to show which entry was changed. A user_desc is considered "empty" if read_exec_only and seg_not_present are set to 1 and all of the other fields are 0. If an "empty" descriptor is passed to set_thread_area, the corresponding TLS entry will be cleared. See BUGS for additional details. Since Linux 3.19, set_thread_area() cannot be used to write non- present segments, 16-bit segments, or code segments, although clearing a segment is still acceptable. http://man7.org/linux/man-pages/man2/get_thread_area.2.html 12 SYSTEM CALL: set_thread_area(2) - Linux manual page FUNCTIONALITY: set_thread_area - set a GDT entry for thread-local storage SYNOPSIS: #include #include int get_thread_area(struct user_desc *u_info); int set_thread_area(struct user_desc *u_info); Note: There are no glibc wrappers for these system calls; see NOTES. DESCRIPTION Linux dedicates three global descriptor table (GDT) entries for thread-local storage. For more information about the GDT, see the Intel Software Developer's Manual or the AMD Architecture Programming Manual. Both of these system calls take an argument that is a pointer to a structure of the following type: struct user_desc { unsigned int entry_number; unsigned long base_addr; unsigned int limit; unsigned int seg_32bit:1; unsigned int contents:2; unsigned int read_exec_only:1; unsigned int limit_in_pages:1; unsigned int seg_not_present:1; unsigned int useable:1; }; get_thread_area() reads the GDT entry indicated by u_info->entry_number and fills in the rest of the fields in u_info. set_thread_area() sets a TLS entry in the GDT. The TLS array entry set by set_thread_area() corresponds to the value of u_info->entry_number passed in by the user. If this value is in bounds, set_thread_area() writes the TLS descriptor pointed to by u_info into the thread's TLS array. When set_thread_area() is passed an entry_number of -1, it searches for a free TLS entry. If set_thread_area() finds a free TLS entry, the value of u_info->entry_number is set upon return to show which entry was changed. A user_desc is considered "empty" if read_exec_only and seg_not_present are set to 1 and all of the other fields are 0. If an "empty" descriptor is passed to set_thread_area, the corresponding TLS entry will be cleared. See BUGS for additional details. Since Linux 3.19, set_thread_area() cannot be used to write non- present segments, 16-bit segments, or code segments, although clearing a segment is still acceptable. http://man7.org/linux/man-pages/man2/set_tid_address.2.html 10 SYSTEM CALL: set_tid_address(2) - Linux manual page FUNCTIONALITY: set_tid_address - set pointer to thread ID SYNOPSIS: #include long set_tid_address(int *tidptr); DESCRIPTION For each thread, the kernel maintains two attributes (addresses) called set_child_tid and clear_child_tid. These two attributes contain the value NULL by default. set_child_tid If a thread is started using clone(2) with the CLONE_CHILD_SETTID flag, set_child_tid is set to the value passed in the ctid argument of that system call. When set_child_tid is set, the very first thing the new thread does is to write its thread ID at this address. clear_child_tid If a thread is started using clone(2) with the CLONE_CHILD_CLEARTID flag, clear_child_tid is set to the value passed in the ctid argument of that system call. The system call set_tid_address() sets the clear_child_tid value for the calling thread to tidptr. When a thread whose clear_child_tid is not NULL terminates, then, if the thread is sharing memory with other threads, then 0 is written at the address specified in clear_child_tid and the kernel performs the following operation: futex(clear_child_tid, FUTEX_WAKE, 1, NULL, NULL, 0); The effect of this operation is to wake a single thread that is performing a futex wait on the memory location. Errors from the futex wake operation are ignored. http://man7.org/linux/man-pages/man2/arch_prctl.2.html 10 SYSTEM CALL: arch_prctl(2) - Linux manual page FUNCTIONALITY: arch_prctl - set architecture-specific thread state SYNOPSIS: #include #include int arch_prctl(int code, unsigned long addr); int arch_prctl(int code, unsigned long *addr); DESCRIPTION The arch_prctl() function sets architecture-specific process or thread state. code selects a subfunction and passes argument addr to it; addr is interpreted as either an unsigned long for the "set" operations, or as an unsigned long *, for the "get" operations. Subfunctions for x86-64 are: ARCH_SET_FS Set the 64-bit base for the FS register to addr. ARCH_GET_FS Return the 64-bit base value for the FS register of the current thread in the unsigned long pointed to by addr. ARCH_SET_GS Set the 64-bit base for the GS register to addr. ARCH_GET_GS Return the 64-bit base value for the GS register of the current thread in the unsigned long pointed to by addr. http://man7.org/linux/man-pages/man2/uselib.2.html 10 SYSTEM CALL: uselib(2) - Linux manual page FUNCTIONALITY: uselib - load shared library SYNOPSIS: #include int uselib(const char *library); Note: No declaration of this system call is provided in glibc headers; see NOTES. DESCRIPTION The system call uselib() serves to load a shared library to be used by the calling process. It is given a pathname. The address where to load is found in the library itself. The library can have any recognized binary format. http://man7.org/linux/man-pages/man2/prctl.2.html 10 SYSTEM CALL: prctl(2) - Linux manual page FUNCTIONALITY: prctl - operations on a process SYNOPSIS: #include int prctl(int option, unsigned long arg2, unsigned long arg3, unsigned long arg4, unsigned long arg5); DESCRIPTION prctl() is called with a first argument describing what to do (with values defined in ), and further arguments with a significance depending on the first one. The first argument can be: PR_CAP_AMBIENT (since Linux 4.3) Reads or changes the ambient capability set, according to the value of arg2, which must be one of the following: PR_CAP_AMBIENT_RAISE The capability specified in arg3 is added to the ambient set. The specified capability must already be present in both the permitted and the inheritable sets of the process. This operation is not permitted if the SECBIT_NO_CAP_AMBIENT_RAISE securebit is set. PR_CAP_AMBIENT_LOWER The capability specified in arg3 is removed from the ambient set. PR_CAP_AMBIENT_IS_SET The prctl(2) call returns 1 if the capability in arg3 is in the ambient set and 0 if it is not. PR_CAP_AMBIENT_CLEAR_ALL All capabilities will be removed from the ambient set. This operation requires setting arg3 to zero. In all of the above operations, arg4 and arg5 must be specified as 0. PR_CAPBSET_READ (since Linux 2.6.25) Return (as the function result) 1 if the capability specified in arg2 is in the calling thread's capability bounding set, or 0 if it is not. (The capability constants are defined in .) The capability bounding set dictates whether the process can receive the capability through a file's permitted capability set on a subsequent call to execve(2). If the capability specified in arg2 is not valid, then the call fails with the error EINVAL. PR_CAPBSET_DROP (since Linux 2.6.25) If the calling thread has the CAP_SETPCAP capability, then drop the capability specified by arg2 from the calling thread's capability bounding set. Any children of the calling thread will inherit the newly reduced bounding set. The call fails with the error: EPERM if the calling thread does not have the CAP_SETPCAP; EINVAL if arg2 does not represent a valid capability; or EINVAL if file capabilities are not enabled in the kernel, in which case bounding sets are not supported. PR_SET_CHILD_SUBREAPER (since Linux 3.4) If arg2 is nonzero, set the "child subreaper" attribute of the calling process; if arg2 is zero, unset the attribute. When a process is marked as a child subreaper, all of the children that it creates, and their descendants, will be marked as having a subreaper. In effect, a subreaper fulfills the role of init(1) for its descendant processes. Upon termination of a process that is orphaned (i.e., its immediate parent has already terminated) and marked as having a subreaper, the nearest still living ancestor subreaper will receive a SIGCHLD signal and will be able to wait(2) on the process to discover its termination status. PR_GET_CHILD_SUBREAPER (since Linux 3.4) Return the "child subreaper" setting of the caller, in the location pointed to by (int *) arg2. PR_SET_DUMPABLE (since Linux 2.3.20) Set the state of the "dumpable" flag, which determines whether core dumps are produced for the calling process upon delivery of a signal whose default behavior is to produce a core dump. In kernels up to and including 2.6.12, arg2 must be either 0 (SUID_DUMP_DISABLE, process is not dumpable) or 1 (SUID_DUMP_USER, process is dumpable). Between kernels 2.6.13 and 2.6.17, the value 2 was also permitted, which caused any binary which normally would not be dumped to be dumped readable by root only; for security reasons, this feature has been removed. (See also the description of /proc/sys/fs/ suid_dumpable in proc(5).) Normally, this flag is set to 1. However, it is reset to the current value contained in the file /proc/sys/fs/suid_dumpable (which by default has the value 0), if any of the following attributes of the process are changed by the operations listed below: * The effective user or group ID is changed. * The filesystem user or group ID is changed (see credentials(7)). * The process's set of permitted capabilities (see capabilities(7)) is changed such that its new set of capabilities is not a subset of its previous set of capabilities. The operations that may trigger changes to the dumpable flag include: * execution (execve(2)) of a set-user-ID or set-group-ID program, or a program that has capabilities (see capabilities(7)); * capset(2); and * system calls that change process credentials (setuid(2) setgid(2), setresuid(2), setresgid(2), setgroups(2), and so on). Processes that are not dumpable can not be attached via ptrace(2) PTRACE_ATTACH. PR_GET_DUMPABLE (since Linux 2.3.20) Return (as the function result) the current state of the calling process's dumpable flag. PR_SET_ENDIAN (since Linux 2.6.18, PowerPC only) Set the endian-ness of the calling process to the value given in arg2, which should be one of the following: PR_ENDIAN_BIG, PR_ENDIAN_LITTLE, or PR_ENDIAN_PPC_LITTLE (PowerPC pseudo little endian). PR_GET_ENDIAN (since Linux 2.6.18, PowerPC only) Return the endian-ness of the calling process, in the location pointed to by (int *) arg2. PR_SET_FPEMU (since Linux 2.4.18, 2.5.9, only on ia64) Set floating-point emulation control bits to arg2. Pass PR_FPEMU_NOPRINT to silently emulate floating-point operation accesses, or PR_FPEMU_SIGFPE to not emulate floating-point operations and send SIGFPE instead. PR_GET_FPEMU (since Linux 2.4.18, 2.5.9, only on ia64) Return floating-point emulation control bits, in the location pointed to by (int *) arg2. PR_SET_FPEXC (since Linux 2.4.21, 2.5.32, only on PowerPC) Set floating-point exception mode to arg2. Pass PR_FP_EXC_SW_ENABLE to use FPEXC for FP exception enables, PR_FP_EXC_DIV for floating-point divide by zero, PR_FP_EXC_OVF for floating-point overflow, PR_FP_EXC_UND for floating-point underflow, PR_FP_EXC_RES for floating-point inexact result, PR_FP_EXC_INV for floating-point invalid operation, PR_FP_EXC_DISABLED for FP exceptions disabled, PR_FP_EXC_NONRECOV for async nonrecoverable exception mode, PR_FP_EXC_ASYNC for async recoverable exception mode, PR_FP_EXC_PRECISE for precise exception mode. PR_GET_FPEXC (since Linux 2.4.21, 2.5.32, only on PowerPC) Return floating-point exception mode, in the location pointed to by (int *) arg2. PR_SET_KEEPCAPS (since Linux 2.2.18) Set the state of the thread's "keep capabilities" flag, which determines whether the thread's permitted capability set is cleared when a change is made to the thread's user IDs such that the thread's real UID, effective UID, and saved set-user- ID all become nonzero when at least one of them previously had the value 0. By default, the permitted capability set is cleared when such a change is made; setting the "keep capabilities" flag prevents it from being cleared. arg2 must be either 0 (permitted capabilities are cleared) or 1 (permitted capabilities are kept). (A thread's effective capability set is always cleared when such a credential change is made, regardless of the setting of the "keep capabilities" flag.) The "keep capabilities" value will be reset to 0 on subsequent calls to execve(2). PR_GET_KEEPCAPS (since Linux 2.2.18) Return (as the function result) the current state of the calling thread's "keep capabilities" flag. PR_MCE_KILL (since Linux 2.6.32) Set the machine check memory corruption kill policy for the current thread. If arg2 is PR_MCE_KILL_CLEAR, clear the thread memory corruption kill policy and use the system-wide default. (The system-wide default is defined by /proc/sys/vm/memory_failure_early_kill; see proc(5).) If arg2 is PR_MCE_KILL_SET, use a thread-specific memory corruption kill policy. In this case, arg3 defines whether the policy is early kill (PR_MCE_KILL_EARLY), late kill (PR_MCE_KILL_LATE), or the system-wide default (PR_MCE_KILL_DEFAULT). Early kill means that the thread receives a SIGBUS signal as soon as hardware memory corruption is detected inside its address space. In late kill mode, the process is killed only when it accesses a corrupted page. See sigaction(2) for more information on the SIGBUS signal. The policy is inherited by children. The remaining unused prctl() arguments must be zero for future compatibility. PR_MCE_KILL_GET (since Linux 2.6.32) Return the current per-process machine check kill policy. All unused prctl() arguments must be zero. PR_SET_MM (since Linux 3.3) Modify certain kernel memory map descriptor fields of the calling process. Usually these fields are set by the kernel and dynamic loader (see ld.so(8) for more information) and a regular application should not use this feature. However, there are cases, such as self-modifying programs, where a program might find it useful to change its own memory map. This feature is available only if the kernel is built with the CONFIG_CHECKPOINT_RESTORE option enabled. The calling process must have the CAP_SYS_RESOURCE capability. The value in arg2 is one of the options below, while arg3 provides a new value for the option. PR_SET_MM_START_CODE Set the address above which the program text can run. The corresponding memory area must be readable and executable, but not writable or sharable (see mprotect(2) and mmap(2) for more information). PR_SET_MM_END_CODE Set the address below which the program text can run. The corresponding memory area must be readable and executable, but not writable or sharable. PR_SET_MM_START_DATA Set the address above which initialized and uninitialized (bss) data are placed. The corresponding memory area must be readable and writable, but not executable or sharable. PR_SET_MM_END_DATA Set the address below which initialized and uninitialized (bss) data are placed. The corresponding memory area must be readable and writable, but not executable or sharable. PR_SET_MM_START_STACK Set the start address of the stack. The corresponding memory area must be readable and writable. PR_SET_MM_START_BRK Set the address above which the program heap can be expanded with brk(2) call. The address must be greater than the ending address of the current program data segment. In addition, the combined size of the resulting heap and the size of the data segment can't exceed the RLIMIT_DATA resource limit (see setrlimit(2)). PR_SET_MM_BRK Set the current brk(2) value. The requirements for the address are the same as for the PR_SET_MM_START_BRK option. The following options are available since Linux 3.5. PR_SET_MM_ARG_START Set the address above which the program command line is placed. PR_SET_MM_ARG_END Set the address below which the program command line is placed. PR_SET_MM_ENV_START Set the address above which the program environment is placed. PR_SET_MM_ENV_END Set the address below which the program environment is placed. The address passed with PR_SET_MM_ARG_START, PR_SET_MM_ARG_END, PR_SET_MM_ENV_START, and PR_SET_MM_ENV_END should belong to a process stack area. Thus, the corresponding memory area must be readable, writable, and (depending on the kernel configuration) have the MAP_GROWSDOWN attribute set (see mmap(2)). PR_SET_MM_AUXV Set a new auxiliary vector. The arg3 argument should provide the address of the vector. The arg4 is the size of the vector. PR_SET_MM_EXE_FILE Supersede the /proc/pid/exe symbolic link with a new one pointing to a new executable file identified by the file descriptor provided in arg3 argument. The file descriptor should be obtained with a regular open(2) call. To change the symbolic link, one needs to unmap all existing executable memory areas, including those created by the kernel itself (for example the kernel usually creates at least one executable memory area for the ELF .text section). The second limitation is that such transitions can be done only once in a process life time. Any further attempts will be rejected. This should help system administrators monitor unusual symbolic-link transitions over all processes running on a system. PR_MPX_ENABLE_MANAGEMENT, PR_MPX_DISABLE_MANAGEMENT (since Linux 3.19) Enable or disable kernel management of Memory Protection eXtensions (MPX) bounds tables. The arg2, arg3, arg4, and arg5 arguments must be zero. MPX is a hardware-assisted mechanism for performing bounds checking on pointers. It consists of a set of registers storing bounds information and a set of special instruction prefixes that tell the CPU on which instructions it should do bounds enforcement. There is a limited number of these registers and when there are more pointers than registers, their contents must be "spilled" into a set of tables. These tables are called "bounds tables" and the MPX prctl() operations control whether the kernel manages their allocation and freeing. When management is enabled, the kernel will take over allocation and freeing of the bounds tables. It does this by trapping the #BR exceptions that result at first use of missing bounds tables and instead of delivering the exception to user space, it allocates the table and populates the bounds directory with the location of the new table. For freeing, the kernel checks to see if bounds tables are present for memory which is not allocated, and frees them if so. Before enabling MPX management using PR_MPX_ENABLE_MANAGEMENT, the application must first have allocated a user-space buffer for the bounds directory and placed the location of that directory in the bndcfgu register. These calls will fail if the CPU or kernel does not support MPX. Kernel support for MPX is enabled via the CONFIG_X86_INTEL_MPX configuration option. You can check whether the CPU supports MPX by looking for the 'mpx' CPUID bit, like with the following command: cat /proc/cpuinfo | grep ' mpx ' A thread may not switch in or out of long (64-bit) mode while MPX is enabled. All threads in a process are affected by these calls. The child of a fork(2) inherits the state of MPX management. During execve(2), MPX management is reset to a state as if PR_MPX_DISABLE_MANAGEMENT had been called. For further information on Intel MPX, see the kernel source file Documentation/x86/intel_mpx.txt. PR_SET_NAME (since Linux 2.6.9) Set the name of the calling thread, using the value in the location pointed to by (char *) arg2. The name can be up to 16 bytes long, including the terminating null byte. (If the length of the string, including the terminating null byte, exceeds 16 bytes, the string is silently truncated.) This is the same attribute that can be set via pthread_setname_np(3) and retrieved using pthread_getname_np(3). The attribute is likewise accessible via /proc/self/task/[tid]/comm, where tid is the name of the calling thread. PR_GET_NAME (since Linux 2.6.11) Return the name of the calling thread, in the buffer pointed to by (char *) arg2. The buffer should allow space for up to 16 bytes; the returned string will be null-terminated. PR_SET_NO_NEW_PRIVS (since Linux 3.5) Set the calling process's no_new_privs bit to the value in arg2. With no_new_privs set to 1, execve(2) promises not to grant privileges to do anything that could not have been done without the execve(2) call (for example, rendering the set- user-ID and set-group-ID mode bits, and file capabilities non- functional). Once set, this bit cannot be unset. The setting of this bit is inherited by children created by fork(2) and clone(2), and preserved across execve(2). For more information, see the kernel source file Documentation/prctl/no_new_privs.txt. PR_GET_NO_NEW_PRIVS (since Linux 3.5) Return (as the function result) the value of the no_new_privs bit for the current process. A value of 0 indicates the regular execve(2) behavior. A value of 1 indicates execve(2) will operate in the privilege-restricting mode described above. PR_SET_PDEATHSIG (since Linux 2.1.57) Set the parent death signal of the calling process to arg2 (either a signal value in the range 1..maxsig, or 0 to clear). This is the signal that the calling process will get when its parent dies. This value is cleared for the child of a fork(2) and (since Linux 2.4.36 / 2.6.23) when executing a set-user-ID or set-group-ID binary, or a binary that has associated capabilities (see capabilities(7)). This value is preserved across execve(2). Warning: the "parent" in this case is considered to be the thread that created this process. In other words, the signal will be sent when that thread terminates (via, for example, pthread_exit(3)), rather than after all of the threads in the parent process terminate. PR_GET_PDEATHSIG (since Linux 2.3.15) Return the current value of the parent process death signal, in the location pointed to by (int *) arg2. PR_SET_PTRACER (since Linux 3.4) This is meaningful only when the Yama LSM is enabled and in mode 1 ("restricted ptrace", visible via /proc/sys/kernel/yama/ptrace_scope). When a "ptracer process ID" is passed in arg2, the caller is declaring that the ptracer process can ptrace(2) the calling process as if it were a direct process ancestor. Each PR_SET_PTRACER operation replaces the previous "ptracer process ID". Employing PR_SET_PTRACER with arg2 set to 0 clears the caller's "ptracer process ID". If arg2 is PR_SET_PTRACER_ANY, the ptrace restrictions introduced by Yama are effectively disabled for the calling process. For further information, see the kernel source file Documentation/security/Yama.txt. PR_SET_SECCOMP (since Linux 2.6.23) Set the secure computing (seccomp) mode for the calling thread, to limit the available system calls. The more recent seccomp(2) system call provides a superset of the functionality of PR_SET_SECCOMP. The seccomp mode is selected via arg2. (The seccomp constants are defined in .) With arg2 set to SECCOMP_MODE_STRICT, the only system calls that the thread is permitted to make are read(2), write(2), _exit(2) (but not exit_group(2)), and sigreturn(2). Other system calls result in the delivery of a SIGKILL signal. Strict secure computing mode is useful for number-crunching applications that may need to execute untrusted byte code, perhaps obtained by reading from a pipe or socket. This operation is available only if the kernel is configured with CONFIG_SECCOMP enabled. With arg2 set to SECCOMP_MODE_FILTER (since Linux 3.5), the system calls allowed are defined by a pointer to a Berkeley Packet Filter passed in arg3. This argument is a pointer to struct sock_fprog; it can be designed to filter arbitrary system calls and system call arguments. This mode is available only if the kernel is configured with CONFIG_SECCOMP_FILTER enabled. If SECCOMP_MODE_FILTER filters permit fork(2), then the seccomp mode is inherited by children created by fork(2); if execve(2) is permitted, then the seccomp mode is preserved across execve(2). If the filters permit prctl() calls, then additional filters can be added; they are run in order until the first non-allow result is seen. For further information, see the kernel source file Documentation/prctl/seccomp_filter.txt. PR_GET_SECCOMP (since Linux 2.6.23) Return (as the function result) the secure computing mode of the calling thread. If the caller is not in secure computing mode, this operation returns 0; if the caller is in strict secure computing mode, then the prctl() call will cause a SIGKILL signal to be sent to the process. If the caller is in filter mode, and this system call is allowed by the seccomp filters, it returns 2; otherwise, the process is killed with a SIGKILL signal. This operation is available only if the kernel is configured with CONFIG_SECCOMP enabled. Since Linux 3.8, the Seccomp field of the /proc/[pid]/status file provides a method of obtaining the same information, without the risk that the process is killed; see proc(5). PR_SET_SECUREBITS (since Linux 2.6.26) Set the "securebits" flags of the calling thread to the value supplied in arg2. See capabilities(7). PR_GET_SECUREBITS (since Linux 2.6.26) Return (as the function result) the "securebits" flags of the calling thread. See capabilities(7). PR_SET_THP_DISABLE (since Linux 3.15) Set the state of the "THP disable" flag for the calling thread. If arg2 has a nonzero value, the flag is set, otherwise it is cleared. Setting this flag provides a method for disabling transparent huge pages for jobs where the code cannot be modified, and using a malloc hook with madvise(2) is not an option (i.e., statically allocated data). The setting of the "THP disable" flag is inherited by a child created via fork(2) and is preserved across execve(2). PR_TASK_PERF_EVENTS_DISABLE (since Linux 2.6.31) Disable all performance counters attached to the calling process, regardless of whether the counters were created by this process or another process. Performance counters created by the calling process for other processes are unaffected. For more information on performance counters, see the Linux kernel source file tools/perf/design.txt. Originally called PR_TASK_PERF_COUNTERS_DISABLE; renamed (with same numerical value) in Linux 2.6.32. PR_TASK_PERF_EVENTS_ENABLE (since Linux 2.6.31) The converse of PR_TASK_PERF_EVENTS_DISABLE; enable performance counters attached to the calling process. Originally called PR_TASK_PERF_COUNTERS_ENABLE; renamed in Linux 2.6.32. PR_GET_THP_DISABLE (since Linux 3.15) Return (via the function result) the current setting of the "THP disable" flag for the calling thread: either 1, if the flag is set, or 0, if it is not. PR_GET_TID_ADDRESS (since Linux 3.5) Retrieve the clear_child_tid address set by set_tid_address(2) and the clone(2) CLONE_CHILD_CLEARTID flag, in the location pointed to by (int **) arg2. This feature is available only if the kernel is built with the CONFIG_CHECKPOINT_RESTORE option enabled. PR_SET_TIMERSLACK (since Linux 2.6.28) Each thread has two associated timer slack values: a "default" value, and a "current" value. This operation sets the "current" timer slack value for the calling thread. If the nanosecond value supplied in arg2 is greater than zero, then the "current" value is set to this value. If arg2 is less than or equal to zero, the "current" timer slack is reset to the thread's "default" timer slack value. The "current" timer slack is used by the kernel to group timer expirations for the calling thread that are close to one another; as a consequence, timer expirations for the thread may be up to the specified number of nanoseconds late (but will never expire early). Grouping timer expirations can help reduce system power consumption by minimizing CPU wake-ups. The timer expirations affected by timer slack are those set by select(2), pselect(2), poll(2), ppoll(2), epoll_wait(2), epoll_pwait(2), clock_nanosleep(2), nanosleep(2), and futex(2) (and thus the library functions implemented via futexes, including pthread_cond_timedwait(3), pthread_mutex_timedlock(3), pthread_rwlock_timedrdlock(3), pthread_rwlock_timedwrlock(3), and sem_timedwait(3)). Timer slack is not applied to threads that are scheduled under a real-time scheduling policy (see sched_setscheduler(2)). When a new thread is created, the two timer slack values are made the same as the "current" value of the creating thread. Thereafter, a thread can adjust its "current" timer slack value via PR_SET_TIMERSLACK. The "default" value can't be changed. The timer slack values of init (PID 1), the ancestor of all processes, are 50,000 nanoseconds (50 microseconds). The timer slack values are preserved across execve(2). Since Linux 4.6, the "current" timer slack value of any process can be examined and changed via the file /proc/[pid]/timerslack_ns. See proc(5). PR_GET_TIMERSLACK (since Linux 2.6.28) Return (as the function result) the "current" timer slack value of the calling thread. PR_SET_TIMING (since Linux 2.6.0-test4) Set whether to use (normal, traditional) statistical process timing or accurate timestamp-based process timing, by passing PR_TIMING_STATISTICAL or PR_TIMING_TIMESTAMP to arg2. PR_TIMING_TIMESTAMP is not currently implemented (attempting to set this mode will yield the error EINVAL). PR_GET_TIMING (since Linux 2.6.0-test4) Return (as the function result) which process timing method is currently in use. PR_SET_TSC (since Linux 2.6.26, x86 only) Set the state of the flag determining whether the timestamp counter can be read by the process. Pass PR_TSC_ENABLE to arg2 to allow it to be read, or PR_TSC_SIGSEGV to generate a SIGSEGV when the process tries to read the timestamp counter. PR_GET_TSC (since Linux 2.6.26, x86 only) Return the state of the flag determining whether the timestamp counter can be read, in the location pointed to by (int *) arg2. PR_SET_UNALIGN (Only on: ia64, since Linux 2.3.48; parisc, since Linux 2.6.15; PowerPC, since Linux 2.6.18; Alpha, since Linux 2.6.22) Set unaligned access control bits to arg2. Pass PR_UNALIGN_NOPRINT to silently fix up unaligned user accesses, or PR_UNALIGN_SIGBUS to generate SIGBUS on unaligned user access. PR_GET_UNALIGN (see PR_SET_UNALIGN for information on versions and architectures) Return unaligned access control bits, in the location pointed to by (int *) arg2. http://man7.org/linux/man-pages/man2/seccomp.2.html 12 SYSTEM CALL: seccomp(2) - Linux manual page FUNCTIONALITY: seccomp - operate on Secure Computing state of the process SYNOPSIS: #include #include #include #include #include int seccomp(unsigned int operation, unsigned int flags, void *args); DESCRIPTION The seccomp() system call operates on the Secure Computing (seccomp) state of the calling process. Currently, Linux supports the following operation values: SECCOMP_SET_MODE_STRICT The only system calls that the calling thread is permitted to make are read(2), write(2), _exit(2) (but not exit_group(2)), and sigreturn(2). Other system calls result in the delivery of a SIGKILL signal. Strict secure computing mode is useful for number-crunching applications that may need to execute untrusted byte code, perhaps obtained by reading from a pipe or socket. Note that although the calling thread can no longer call sigprocmask(2), it can use sigreturn(2) to block all signals apart from SIGKILL and SIGSTOP. This means that alarm(2) (for example) is not sufficient for restricting the process's execution time. Instead, to reliably terminate the process, SIGKILL must be used. This can be done by using timer_create(2) with SIGEV_SIGNAL and sigev_signo set to SIGKILL, or by using setrlimit(2) to set the hard limit for RLIMIT_CPU. This operation is available only if the kernel is configured with CONFIG_SECCOMP enabled. The value of flags must be 0, and args must be NULL. This operation is functionally identical to the call: prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT); SECCOMP_SET_MODE_FILTER The system calls allowed are defined by a pointer to a Berkeley Packet Filter (BPF) passed via args. This argument is a pointer to a struct sock_fprog; it can be designed to filter arbitrary system calls and system call arguments. If the filter is invalid, seccomp() fails, returning EINVAL in errno. If fork(2) or clone(2) is allowed by the filter, any child processes will be constrained to the same system call filters as the parent. If execve(2) is allowed, the existing filters will be preserved across a call to execve(2). In order to use the SECCOMP_SET_MODE_FILTER operation, either the caller must have the CAP_SYS_ADMIN capability, or the thread must already have the no_new_privs bit set. If that bit was not already set by an ancestor of this thread, the thread must make the following call: prctl(PR_SET_NO_NEW_PRIVS, 1); Otherwise, the SECCOMP_SET_MODE_FILTER operation will fail and return EACCES in errno. This requirement ensures that an unprivileged process cannot apply a malicious filter and then invoke a set-user-ID or other privileged program using execve(2), thus potentially compromising that program. (Such a malicious filter might, for example, cause an attempt to use setuid(2) to set the caller's user IDs to non-zero values to instead return 0 without actually making the system call. Thus, the program might be tricked into retaining superuser privileges in circumstances where it is possible to influence it to do dangerous things because it did not actually drop privileges.) If prctl(2) or seccomp(2) is allowed by the attached filter, further filters may be added. This will increase evaluation time, but allows for further reduction of the attack surface during execution of a thread. The SECCOMP_SET_MODE_FILTER operation is available only if the kernel is configured with CONFIG_SECCOMP_FILTER enabled. When flags is 0, this operation is functionally identical to the call: prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, args); The recognized flags are: SECCOMP_FILTER_FLAG_TSYNC When adding a new filter, synchronize all other threads of the calling process to the same seccomp filter tree. A "filter tree" is the ordered list of filters attached to a thread. (Attaching identical filters in separate seccomp() calls results in different filters from this perspective.) If any thread cannot synchronize to the same filter tree, the call will not attach the new seccomp filter, and will fail, returning the first thread ID found that cannot synchronize. Synchronization will fail if another thread in the same process is in SECCOMP_MODE_STRICT or if it has attached new seccomp filters to itself, diverging from the calling thread's filter tree. Filters When adding filters via SECCOMP_SET_MODE_FILTER, args points to a filter program: struct sock_fprog { unsigned short len; /* Number of BPF instructions */ struct sock_filter *filter; /* Pointer to array of BPF instructions */ }; Each program must contain one or more BPF instructions: struct sock_filter { /* Filter block */ __u16 code; /* Actual filter code */ __u8 jt; /* Jump true */ __u8 jf; /* Jump false */ __u32 k; /* Generic multiuse field */ }; When executing the instructions, the BPF program operates on the system call information made available (i.e., use the BPF_ABS addressing mode) as a (read-only) buffer of the following form: struct seccomp_data { int nr; /* System call number */ __u32 arch; /* AUDIT_ARCH_* value (see ) */ __u64 instruction_pointer; /* CPU instruction pointer */ __u64 args[6]; /* Up to 6 system call arguments */ }; Because numbering of system calls varies between architectures and some architectures (e.g., x86-64) allow user-space code to use the calling conventions of multiple architectures, it is usually necessary to verify the value of the arch field. It is strongly recommended to use a whitelisting approach whenever possible because such an approach is more robust and simple. A blacklist will have to be updated whenever a potentially dangerous system call is added (or a dangerous flag or option if those are blacklisted), and it is often possible to alter the representation of a value without altering its meaning, leading to a blacklist bypass. The arch field is not unique for all calling conventions. The x86-64 ABI and the x32 ABI both use AUDIT_ARCH_X86_64 as arch, and they run on the same processors. Instead, the mask __X32_SYSCALL_BIT is used on the system call number to tell the two ABIs apart. This means that in order to create a seccomp-based blacklist for system calls performed through the x86-64 ABI, it is necessary to not only check that arch equals AUDIT_ARCH_X86_64, but also to explicitly reject all system calls that contain __X32_SYSCALL_BIT in nr. The instruction_pointer field provides the address of the machine- language instruction that performed the system call. This might be useful in conjunction with the use of /proc/[pid]/maps to perform checks based on which region (mapping) of the program made the system call. (Probably, it is wise to lock down the mmap(2) and mprotect(2) system calls to prevent the program from subverting such checks.) When checking values from args against a blacklist, keep in mind that arguments are often silently truncated before being processed, but after the seccomp check. For example, this happens if the i386 ABI is used on an x86-64 kernel: although the kernel will normally not look beyond the 32 lowest bits of the arguments, the values of the full 64-bit registers will be present in the seccomp data. A less surprising example is that if the x86-64 ABI is used to perform a system call that takes an argument of type int, the more-significant half of the argument register is ignored by the system call, but visible in the seccomp data. A seccomp filter returns a 32-bit value consisting of two parts: the most significant 16 bits (corresponding to the mask defined by the constant SECCOMP_RET_ACTION) contain one of the "action" values listed below; the least significant 16-bits (defined by the constant SECCOMP_RET_DATA) are "data" to be associated with this return value. If multiple filters exist, they are all executed, in reverse order of their addition to the filter tree—that is, the most recently installed filter is executed first. (Note that all filters will be called even if one of the earlier filters returns SECCOMP_RET_KILL. This is done to simplify the kernel code and to provide a tiny speed- up in the execution of sets of filters by avoiding a check for this uncommon case.) The return value for the evaluation of a given system call is the first-seen SECCOMP_RET_ACTION value of highest precedence (along with its accompanying data) returned by execution of all of the filters. In decreasing order of precedence, the values that may be returned by a seccomp filter are: SECCOMP_RET_KILL This value results in the process exiting immediately without executing the system call. The process terminates as though killed by a SIGSYS signal (not SIGKILL). SECCOMP_RET_TRAP This value results in the kernel sending a SIGSYS signal to the triggering process without executing the system call. Various fields will be set in the siginfo_t structure (see sigaction(2)) associated with signal: * si_signo will contain SIGSYS. * si_call_addr will show the address of the system call instruction. * si_syscall and si_arch will indicate which system call was attempted. * si_code will contain SYS_SECCOMP. * si_errno will contain the SECCOMP_RET_DATA portion of the filter return value. The program counter will be as though the system call happened (i.e., it will not point to the system call instruction). The return value register will contain an architecture-dependent value; if resuming execution, set it to something appropriate for the system call. (The architecture dependency is because replacing it with ENOSYS could overwrite some useful information.) SECCOMP_RET_ERRNO This value results in the SECCOMP_RET_DATA portion of the filter's return value being passed to user space as the errno value without executing the system call. SECCOMP_RET_TRACE When returned, this value will cause the kernel to attempt to notify a ptrace(2)-based tracer prior to executing the system call. If there is no tracer present, the system call is not executed and returns a failure status with errno set to ENOSYS. A tracer will be notified if it requests PTRACE_O_TRACESECCOMP using ptrace(PTRACE_SETOPTIONS). The tracer will be notified of a PTRACE_EVENT_SECCOMP and the SECCOMP_RET_DATA portion of the filter's return value will be available to the tracer via PTRACE_GETEVENTMSG. The tracer can skip the system call by changing the system call number to -1. Alternatively, the tracer can change the system call requested by changing the system call to a valid system call number. If the tracer asks to skip the system call, then the system call will appear to return the value that the tracer puts in the return value register. The seccomp check will not be run again after the tracer is notified. (This means that seccomp-based sandboxes must not allow use of ptrace(2)—even of other sandboxed processes— without extreme care; ptracers can use this mechanism to escape from the seccomp sandbox.) SECCOMP_RET_ALLOW This value results in the system call being executed. http://man7.org/linux/man-pages/man2/ptrace.2.html 11 SYSTEM CALL: ptrace(2) - Linux manual page FUNCTIONALITY: ptrace - process trace SYNOPSIS: #include long ptrace(enum __ptrace_request request, pid_t pid, void *addr, void *data); DESCRIPTION The ptrace() system call provides a means by which one process (the "tracer") may observe and control the execution of another process (the "tracee"), and examine and change the tracee's memory and registers. It is primarily used to implement breakpoint debugging and system call tracing. A tracee first needs to be attached to the tracer. Attachment and subsequent commands are per thread: in a multithreaded process, every thread can be individually attached to a (potentially different) tracer, or left not attached and thus not debugged. Therefore, "tracee" always means "(one) thread", never "a (possibly multithreaded) process". Ptrace commands are always sent to a specific tracee using a call of the form ptrace(PTRACE_foo, pid, ...) where pid is the thread ID of the corresponding Linux thread. (Note that in this page, a "multithreaded process" means a thread group consisting of threads created using the clone(2) CLONE_THREAD flag.) A process can initiate a trace by calling fork(2) and having the resulting child do a PTRACE_TRACEME, followed (typically) by an execve(2). Alternatively, one process may commence tracing another process using PTRACE_ATTACH or PTRACE_SEIZE. While being traced, the tracee will stop each time a signal is delivered, even if the signal is being ignored. (An exception is SIGKILL, which has its usual effect.) The tracer will be notified at its next call to waitpid(2) (or one of the related "wait" system calls); that call will return a status value containing information that indicates the cause of the stop in the tracee. While the tracee is stopped, the tracer can use various ptrace requests to inspect and modify the tracee. The tracer then causes the tracee to continue, optionally ignoring the delivered signal (or even delivering a different signal instead). If the PTRACE_O_TRACEEXEC option is not in effect, all successful calls to execve(2) by the traced process will cause it to be sent a SIGTRAP signal, giving the parent a chance to gain control before the new program begins execution. When the tracer is finished tracing, it can cause the tracee to continue executing in a normal, untraced mode via PTRACE_DETACH. The value of request determines the action to be performed: PTRACE_TRACEME Indicate that this process is to be traced by its parent. A process probably shouldn't make this request if its parent isn't expecting to trace it. (pid, addr, and data are ignored.) The PTRACE_TRACEME request is used only by the tracee; the remaining requests are used only by the tracer. In the following requests, pid specifies the thread ID of the tracee to be acted on. For requests other than PTRACE_ATTACH, PTRACE_SEIZE, PTRACE_INTERRUPT, and PTRACE_KILL, the tracee must be stopped. PTRACE_PEEKTEXT, PTRACE_PEEKDATA Read a word at the address addr in the tracee's memory, returning the word as the result of the ptrace() call. Linux does not have separate text and data address spaces, so these two requests are currently equivalent. (data is ignored; but see NOTES.) PTRACE_PEEKUSER Read a word at offset addr in the tracee's USER area, which holds the registers and other information about the process (see ). The word is returned as the result of the ptrace() call. Typically, the offset must be word-aligned, though this might vary by architecture. See NOTES. (data is ignored; but see NOTES.) PTRACE_POKETEXT, PTRACE_POKEDATA Copy the word data to the address addr in the tracee's memory. As for PTRACE_PEEKTEXT and PTRACE_PEEKDATA, these two requests are currently equivalent. PTRACE_POKEUSER Copy the word data to offset addr in the tracee's USER area. As for PTRACE_PEEKUSER, the offset must typically be word- aligned. In order to maintain the integrity of the kernel, some modifications to the USER area are disallowed. PTRACE_GETREGS, PTRACE_GETFPREGS Copy the tracee's general-purpose or floating-point registers, respectively, to the address data in the tracer. See for information on the format of this data. (addr is ignored.) Note that SPARC systems have the meaning of data and addr reversed; that is, data is ignored and the registers are copied to the address addr. PTRACE_GETREGS and PTRACE_GETFPREGS are not present on all architectures. PTRACE_GETREGSET (since Linux 2.6.34) Read the tracee's registers. addr specifies, in an architecture-dependent way, the type of registers to be read. NT_PRSTATUS (with numerical value 1) usually results in reading of general-purpose registers. If the CPU has, for example, floating-point and/or vector registers, they can be retrieved by setting addr to the corresponding NT_foo constant. data points to a struct iovec, which describes the destination buffer's location and length. On return, the kernel modifies iov.len to indicate the actual number of bytes returned. PTRACE_SETREGS, PTRACE_SETFPREGS Modify the tracee's general-purpose or floating-point registers, respectively, from the address data in the tracer. As for PTRACE_POKEUSER, some general-purpose register modifications may be disallowed. (addr is ignored.) Note that SPARC systems have the meaning of data and addr reversed; that is, data is ignored and the registers are copied from the address addr. PTRACE_SETREGS and PTRACE_SETFPREGS are not present on all architectures. PTRACE_SETREGSET (since Linux 2.6.34) Modify the tracee's registers. The meaning of addr and data is analogous to PTRACE_GETREGSET. PTRACE_GETSIGINFO (since Linux 2.3.99-pre6) Retrieve information about the signal that caused the stop. Copy a siginfo_t structure (see sigaction(2)) from the tracee to the address data in the tracer. (addr is ignored.) PTRACE_SETSIGINFO (since Linux 2.3.99-pre6) Set signal information: copy a siginfo_t structure from the address data in the tracer to the tracee. This will affect only signals that would normally be delivered to the tracee and were caught by the tracer. It may be difficult to tell these normal signals from synthetic signals generated by ptrace() itself. (addr is ignored.) PTRACE_PEEKSIGINFO (since Linux 3.10) Retrieve siginfo_t structures without removing signals from a queue. addr points to a ptrace_peeksiginfo_args structure that specifies the ordinal position from which copying of signals should start, and the number of signals to copy. siginfo_t structures are copied into the buffer pointed to by data. The return value contains the number of copied signals (zero indicates that there is no signal corresponding to the specified ordinal position). Within the returned siginfo structures, the si_code field includes information (__SI_CHLD, __SI_FAULT, etc.) that are not otherwise exposed to user space. struct ptrace_peeksiginfo_args { u64 off; /* Ordinal position in queue at which to start copying signals */ u32 flags; /* PTRACE_PEEKSIGINFO_SHARED or 0 */ s32 nr; /* Number of signals to copy */ }; Currently, there is only one flag, PTRACE_PEEKSIGINFO_SHARED, for dumping signals from the process-wide signal queue. If this flag is not set, signals are read from the per-thread queue of the specified thread. PTRACE_GETSIGMASK (since Linux 3.11) Place a copy of the mask of blocked signals (see sigprocmask(2)) in the buffer pointed to by data, which should be a pointer to a buffer of type sigset_t. The addr argument contains the size of the buffer pointed to by data (i.e., sizeof(sigset_t)). PTRACE_SETSIGMASK (since Linux 3.11) Change the mask of blocked signals (see sigprocmask(2)) to the value specified in the buffer pointed to by data, which should be a pointer to a buffer of type sigset_t. The addr argument contains the size of the buffer pointed to by data (i.e., sizeof(sigset_t)). PTRACE_SETOPTIONS (since Linux 2.4.6; see BUGS for caveats) Set ptrace options from data. (addr is ignored.) data is interpreted as a bit mask of options, which are specified by the following flags: PTRACE_O_EXITKILL (since Linux 3.8) If a tracer sets this flag, a SIGKILL signal will be sent to every tracee if the tracer exits. This option is useful for ptrace jailers that want to ensure that tracees can never escape the tracer's control. PTRACE_O_TRACECLONE (since Linux 2.5.46) Stop the tracee at the next clone(2) and automatically start tracing the newly cloned process, which will start with a SIGSTOP, or PTRACE_EVENT_STOP if PTRACE_SEIZE was used. A waitpid(2) by the tracer will return a status value such that status>>8 == (SIGTRAP | (PTRACE_EVENT_CLONE<<8)) The PID of the new process can be retrieved with PTRACE_GETEVENTMSG. This option may not catch clone(2) calls in all cases. If the tracee calls clone(2) with the CLONE_VFORK flag, PTRACE_EVENT_VFORK will be delivered instead if PTRACE_O_TRACEVFORK is set; otherwise if the tracee calls clone(2) with the exit signal set to SIGCHLD, PTRACE_EVENT_FORK will be delivered if PTRACE_O_TRACEFORK is set. PTRACE_O_TRACEEXEC (since Linux 2.5.46) Stop the tracee at the next execve(2). A waitpid(2) by the tracer will return a status value such that status>>8 == (SIGTRAP | (PTRACE_EVENT_EXEC<<8)) If the execing thread is not a thread group leader, the thread ID is reset to thread group leader's ID before this stop. Since Linux 3.0, the former thread ID can be retrieved with PTRACE_GETEVENTMSG. PTRACE_O_TRACEEXIT (since Linux 2.5.60) Stop the tracee at exit. A waitpid(2) by the tracer will return a status value such that status>>8 == (SIGTRAP | (PTRACE_EVENT_EXIT<<8)) The tracee's exit status can be retrieved with PTRACE_GETEVENTMSG. The tracee is stopped early during process exit, when registers are still available, allowing the tracer to see where the exit occurred, whereas the normal exit notification is done after the process is finished exiting. Even though context is available, the tracer cannot prevent the exit from happening at this point. PTRACE_O_TRACEFORK (since Linux 2.5.46) Stop the tracee at the next fork(2) and automatically start tracing the newly forked process, which will start with a SIGSTOP, or PTRACE_EVENT_STOP if PTRACE_SEIZE was used. A waitpid(2) by the tracer will return a status value such that status>>8 == (SIGTRAP | (PTRACE_EVENT_FORK<<8)) The PID of the new process can be retrieved with PTRACE_GETEVENTMSG. PTRACE_O_TRACESYSGOOD (since Linux 2.4.6) When delivering system call traps, set bit 7 in the signal number (i.e., deliver SIGTRAP|0x80). This makes it easy for the tracer to distinguish normal traps from those caused by a system call. (PTRACE_O_TRACESYSGOOD may not work on all architectures.) PTRACE_O_TRACEVFORK (since Linux 2.5.46) Stop the tracee at the next vfork(2) and automatically start tracing the newly vforked process, which will start with a SIGSTOP, or PTRACE_EVENT_STOP if PTRACE_SEIZE was used. A waitpid(2) by the tracer will return a status value such that status>>8 == (SIGTRAP | (PTRACE_EVENT_VFORK<<8)) The PID of the new process can be retrieved with PTRACE_GETEVENTMSG. PTRACE_O_TRACEVFORKDONE (since Linux 2.5.60) Stop the tracee at the completion of the next vfork(2). A waitpid(2) by the tracer will return a status value such that status>>8 == (SIGTRAP | (PTRACE_EVENT_VFORK_DONE<<8)) The PID of the new process can (since Linux 2.6.18) be retrieved with PTRACE_GETEVENTMSG. PTRACE_O_TRACESECCOMP (since Linux 3.5) Stop the tracee when a seccomp(2) SECCOMP_RET_TRACE rule is triggered. A waitpid(2) by the tracer will return a status value such that status>>8 == (SIGTRAP | (PTRACE_EVENT_SECCOMP<<8)) While this triggers a PTRACE_EVENT stop, it is similar to a syscall-enter-stop, in that the tracee has not yet entered the syscall that seccomp triggered on. The seccomp event message data (from the SECCOMP_RET_DATA portion of the seccomp filter rule) can be retrieved with PTRACE_GETEVENTMSG. PTRACE_O_SUSPEND_SECCOMP (since Linux 4.3) Suspend the tracee's seccomp protections. This applies regardless of mode, and can be used when the tracee has not yet installed seccomp filters. That is, a valid use case is to suspend a tracee's seccomp protections before they are installed by the tracee, let the tracee install the filters, and then clear this flag when the filters should be resumed. Setting this option requires that the tracer have the CAP_SYS_ADMIN capability, not have any seccomp protections installed, and not have PTRACE_O_SUSPEND_SECCOMP set on itself. PTRACE_GETEVENTMSG (since Linux 2.5.46) Retrieve a message (as an unsigned long) about the ptrace event that just happened, placing it at the address data in the tracer. For PTRACE_EVENT_EXIT, this is the tracee's exit status. For PTRACE_EVENT_FORK, PTRACE_EVENT_VFORK, PTRACE_EVENT_VFORK_DONE, and PTRACE_EVENT_CLONE, this is the PID of the new process. For PTRACE_EVENT_SECCOMP, this is the seccomp(2) filter's SECCOMP_RET_DATA associated with the triggered rule. (addr is ignored.) PTRACE_CONT Restart the stopped tracee process. If data is nonzero, it is interpreted as the number of a signal to be delivered to the tracee; otherwise, no signal is delivered. Thus, for example, the tracer can control whether a signal sent to the tracee is delivered or not. (addr is ignored.) PTRACE_SYSCALL, PTRACE_SINGLESTEP Restart the stopped tracee as for PTRACE_CONT, but arrange for the tracee to be stopped at the next entry to or exit from a system call, or after execution of a single instruction, respectively. (The tracee will also, as usual, be stopped upon receipt of a signal.) From the tracer's perspective, the tracee will appear to have been stopped by receipt of a SIGTRAP. So, for PTRACE_SYSCALL, for example, the idea is to inspect the arguments to the system call at the first stop, then do another PTRACE_SYSCALL and inspect the return value of the system call at the second stop. The data argument is treated as for PTRACE_CONT. (addr is ignored.) PTRACE_SYSEMU, PTRACE_SYSEMU_SINGLESTEP (since Linux 2.6.14) For PTRACE_SYSEMU, continue and stop on entry to the next system call, which will not be executed. For PTRACE_SYSEMU_SINGLESTEP, do the same but also singlestep if not a system call. This call is used by programs like User Mode Linux that want to emulate all the tracee's system calls. The data argument is treated as for PTRACE_CONT. The addr argument is ignored. These requests are currently supported only on x86. PTRACE_LISTEN (since Linux 3.4) Restart the stopped tracee, but prevent it from executing. The resulting state of the tracee is similar to a process which has been stopped by a SIGSTOP (or other stopping signal). See the "group-stop" subsection for additional information. PTRACE_LISTEN works only on tracees attached by PTRACE_SEIZE. PTRACE_KILL Send the tracee a SIGKILL to terminate it. (addr and data are ignored.) This operation is deprecated; do not use it! Instead, send a SIGKILL directly using kill(2) or tgkill(2). The problem with PTRACE_KILL is that it requires the tracee to be in signal- delivery-stop, otherwise it may not work (i.e., may complete successfully but won't kill the tracee). By contrast, sending a SIGKILL directly has no such limitation. PTRACE_INTERRUPT (since Linux 3.4) Stop a tracee. If the tracee is running or sleeping in kernel space and PTRACE_SYSCALL is in effect, the system call is interrupted and syscall-exit-stop is reported. (The interrupted system call is restarted when the tracee is restarted.) If the tracee was already stopped by a signal and PTRACE_LISTEN was sent to it, the tracee stops with PTRACE_EVENT_STOP and WSTOPSIG(status) returns the stop signal. If any other ptrace-stop is generated at the same time (for example, if a signal is sent to the tracee), this ptrace-stop happens. If none of the above applies (for example, if the tracee is running in user space), it stops with PTRACE_EVENT_STOP with WSTOPSIG(status) == SIGTRAP. PTRACE_INTERRUPT only works on tracees attached by PTRACE_SEIZE. PTRACE_ATTACH Attach to the process specified in pid, making it a tracee of the calling process. The tracee is sent a SIGSTOP, but will not necessarily have stopped by the completion of this call; use waitpid(2) to wait for the tracee to stop. See the "Attaching and detaching" subsection for additional information. (addr and data are ignored.) Permission to perform a PTRACE_ATTACH is governed by a ptrace access mode PTRACE_MODE_ATTACH_REALCREDS check; see below. PTRACE_SEIZE (since Linux 3.4) Attach to the process specified in pid, making it a tracee of the calling process. Unlike PTRACE_ATTACH, PTRACE_SEIZE does not stop the process. Group-stops are reported as PTRACE_EVENT_STOP and WSTOPSIG(status) returns the stop signal. Automatically attached children stop with PTRACE_EVENT_STOP and WSTOPSIG(status) returns SIGTRAP instead of having SIGSTOP signal delivered to them. execve(2) does not deliver an extra SIGTRAP. Only a PTRACE_SEIZEd process can accept PTRACE_INTERRUPT and PTRACE_LISTEN commands. The "seized" behavior just described is inherited by children that are automatically attached using PTRACE_O_TRACEFORK, PTRACE_O_TRACEVFORK, and PTRACE_O_TRACECLONE. addr must be zero. data contains a bit mask of ptrace options to activate immediately. Permission to perform a PTRACE_SEIZE is governed by a ptrace access mode PTRACE_MODE_ATTACH_REALCREDS check; see below. PTRACE_DETACH Restart the stopped tracee as for PTRACE_CONT, but first detach from it. Under Linux, a tracee can be detached in this way regardless of which method was used to initiate tracing. (addr is ignored.) Death under ptrace When a (possibly multithreaded) process receives a killing signal (one whose disposition is set to SIG_DFL and whose default action is to kill the process), all threads exit. Tracees report their death to their tracer(s). Notification of this event is delivered via waitpid(2). Note that the killing signal will first cause signal-delivery-stop (on one tracee only), and only after it is injected by the tracer (or after it was dispatched to a thread which isn't traced), will death from the signal happen on all tracees within a multithreaded process. (The term "signal-delivery-stop" is explained below.) SIGKILL does not generate signal-delivery-stop and therefore the tracer can't suppress it. SIGKILL kills even within system calls (syscall-exit-stop is not generated prior to death by SIGKILL). The net effect is that SIGKILL always kills the process (all its threads), even if some threads of the process are ptraced. When the tracee calls _exit(2), it reports its death to its tracer. Other threads are not affected. When any thread executes exit_group(2), every tracee in its thread group reports its death to its tracer. If the PTRACE_O_TRACEEXIT option is on, PTRACE_EVENT_EXIT will happen before actual death. This applies to exits via exit(2), exit_group(2), and signal deaths (except SIGKILL, depending on the kernel version; see BUGS below), and when threads are torn down on execve(2) in a multithreaded process. The tracer cannot assume that the ptrace-stopped tracee exists. There are many scenarios when the tracee may die while stopped (such as SIGKILL). Therefore, the tracer must be prepared to handle an ESRCH error on any ptrace operation. Unfortunately, the same error is returned if the tracee exists but is not ptrace-stopped (for commands which require a stopped tracee), or if it is not traced by the process which issued the ptrace call. The tracer needs to keep track of the stopped/running state of the tracee, and interpret ESRCH as "tracee died unexpectedly" only if it knows that the tracee has been observed to enter ptrace-stop. Note that there is no guarantee that waitpid(WNOHANG) will reliably report the tracee's death status if a ptrace operation returned ESRCH. waitpid(WNOHANG) may return 0 instead. In other words, the tracee may be "not yet fully dead", but already refusing ptrace requests. The tracer can't assume that the tracee always ends its life by reporting WIFEXITED(status) or WIFSIGNALED(status); there are cases where this does not occur. For example, if a thread other than thread group leader does an execve(2), it disappears; its PID will never be seen again, and any subsequent ptrace stops will be reported under the thread group leader's PID. Stopped states A tracee can be in two states: running or stopped. For the purposes of ptrace, a tracee which is blocked in a system call (such as read(2), pause(2), etc.) is nevertheless considered to be running, even if the tracee is blocked for a long time. The state of the tracee after PTRACE_LISTEN is somewhat of a gray area: it is not in any ptrace-stop (ptrace commands won't work on it, and it will deliver waitpid(2) notifications), but it also may be considered "stopped" because it is not executing instructions (is not scheduled), and if it was in group-stop before PTRACE_LISTEN, it will not respond to signals until SIGCONT is received. There are many kinds of states when the tracee is stopped, and in ptrace discussions they are often conflated. Therefore, it is important to use precise terms. In this manual page, any stopped state in which the tracee is ready to accept ptrace commands from the tracer is called ptrace-stop. Ptrace-stops can be further subdivided into signal-delivery-stop, group-stop, syscall-stop, and so on. These stopped states are described in detail below. When the running tracee enters ptrace-stop, it notifies its tracer using waitpid(2) (or one of the other "wait" system calls). Most of this manual page assumes that the tracer waits with: pid = waitpid(pid_or_minus_1, &status, __WALL); Ptrace-stopped tracees are reported as returns with pid greater than 0 and WIFSTOPPED(status) true. The __WALL flag does not include the WSTOPPED and WEXITED flags, but implies their functionality. Setting the WCONTINUED flag when calling waitpid(2) is not recommended: the "continued" state is per-process and consuming it can confuse the real parent of the tracee. Use of the WNOHANG flag may cause waitpid(2) to return 0 ("no wait results available yet") even if the tracer knows there should be a notification. Example: errno = 0; ptrace(PTRACE_CONT, pid, 0L, 0L); if (errno == ESRCH) { /* tracee is dead */ r = waitpid(tracee, &status, __WALL | WNOHANG); /* r can still be 0 here! */ } The following kinds of ptrace-stops exist: signal-delivery-stops, group-stops, PTRACE_EVENT stops, syscall-stops. They all are reported by waitpid(2) with WIFSTOPPED(status) true. They may be differentiated by examining the value status>>8, and if there is ambiguity in that value, by querying PTRACE_GETSIGINFO. (Note: the WSTOPSIG(status) macro can't be used to perform this examination, because it returns the value (status>>8) & 0xff.) Signal-delivery-stop When a (possibly multithreaded) process receives any signal except SIGKILL, the kernel selects an arbitrary thread which handles the signal. (If the signal is generated with tgkill(2), the target thread can be explicitly selected by the caller.) If the selected thread is traced, it enters signal-delivery-stop. At this point, the signal is not yet delivered to the process, and can be suppressed by the tracer. If the tracer doesn't suppress the signal, it passes the signal to the tracee in the next ptrace restart request. This second step of signal delivery is called signal injection in this manual page. Note that if the signal is blocked, signal-delivery-stop doesn't happen until the signal is unblocked, with the usual exception that SIGSTOP can't be blocked. Signal-delivery-stop is observed by the tracer as waitpid(2) returning with WIFSTOPPED(status) true, with the signal returned by WSTOPSIG(status). If the signal is SIGTRAP, this may be a different kind of ptrace-stop; see the "Syscall-stops" and "execve" sections below for details. If WSTOPSIG(status) returns a stopping signal, this may be a group-stop; see below. Signal injection and suppression After signal-delivery-stop is observed by the tracer, the tracer should restart the tracee with the call ptrace(PTRACE_restart, pid, 0, sig) where PTRACE_restart is one of the restarting ptrace requests. If sig is 0, then a signal is not delivered. Otherwise, the signal sig is delivered. This operation is called signal injection in this manual page, to distinguish it from signal-delivery-stop. The sig value may be different from the WSTOPSIG(status) value: the tracer can cause a different signal to be injected. Note that a suppressed signal still causes system calls to return prematurely. In this case, system calls will be restarted: the tracer will observe the tracee to reexecute the interrupted system call (or restart_syscall(2) system call for a few system calls which use a different mechanism for restarting) if the tracer uses PTRACE_SYSCALL. Even system calls (such as poll(2)) which are not restartable after signal are restarted after signal is suppressed; however, kernel bugs exist which cause some system calls to fail with EINTR even though no observable signal is injected to the tracee. Restarting ptrace commands issued in ptrace-stops other than signal- delivery-stop are not guaranteed to inject a signal, even if sig is nonzero. No error is reported; a nonzero sig may simply be ignored. Ptrace users should not try to "create a new signal" this way: use tgkill(2) instead. The fact that signal injection requests may be ignored when restarting the tracee after ptrace stops that are not signal- delivery-stops is a cause of confusion among ptrace users. One typical scenario is that the tracer observes group-stop, mistakes it for signal-delivery-stop, restarts the tracee with ptrace(PTRACE_restart, pid, 0, stopsig) with the intention of injecting stopsig, but stopsig gets ignored and the tracee continues to run. The SIGCONT signal has a side effect of waking up (all threads of) a group-stopped process. This side effect happens before signal- delivery-stop. The tracer can't suppress this side effect (it can only suppress signal injection, which only causes the SIGCONT handler to not be executed in the tracee, if such a handler is installed). In fact, waking up from group-stop may be followed by signal- delivery-stop for signal(s) other than SIGCONT, if they were pending when SIGCONT was delivered. In other words, SIGCONT may be not the first signal observed by the tracee after it was sent. Stopping signals cause (all threads of) a process to enter group- stop. This side effect happens after signal injection, and therefore can be suppressed by the tracer. In Linux 2.4 and earlier, the SIGSTOP signal can't be injected. PTRACE_GETSIGINFO can be used to retrieve a siginfo_t structure which corresponds to the delivered signal. PTRACE_SETSIGINFO may be used to modify it. If PTRACE_SETSIGINFO has been used to alter siginfo_t, the si_signo field and the sig parameter in the restarting command must match, otherwise the result is undefined. Group-stop When a (possibly multithreaded) process receives a stopping signal, all threads stop. If some threads are traced, they enter a group- stop. Note that the stopping signal will first cause signal- delivery-stop (on one tracee only), and only after it is injected by the tracer (or after it was dispatched to a thread which isn't traced), will group-stop be initiated on all tracees within the multithreaded process. As usual, every tracee reports its group-stop separately to the corresponding tracer. Group-stop is observed by the tracer as waitpid(2) returning with WIFSTOPPED(status) true, with the stopping signal available via WSTOPSIG(status). The same result is returned by some other classes of ptrace-stops, therefore the recommended practice is to perform the call ptrace(PTRACE_GETSIGINFO, pid, 0, &siginfo) The call can be avoided if the signal is not SIGSTOP, SIGTSTP, SIGTTIN, or SIGTTOU; only these four signals are stopping signals. If the tracer sees something else, it can't be a group-stop. Otherwise, the tracer needs to call PTRACE_GETSIGINFO. If PTRACE_GETSIGINFO fails with EINVAL, then it is definitely a group- stop. (Other failure codes are possible, such as ESRCH ("no such process") if a SIGKILL killed the tracee.) If tracee was attached using PTRACE_SEIZE, group-stop is indicated by PTRACE_EVENT_STOP: status>>16 == PTRACE_EVENT_STOP. This allows detection of group-stops without requiring an extra PTRACE_GETSIGINFO call. As of Linux 2.6.38, after the tracer sees the tracee ptrace-stop and until it restarts or kills it, the tracee will not run, and will not send notifications (except SIGKILL death) to the tracer, even if the tracer enters into another waitpid(2) call. The kernel behavior described in the previous paragraph causes a problem with transparent handling of stopping signals. If the tracer restarts the tracee after group-stop, the stopping signal is effectively ignored—the tracee doesn't remain stopped, it runs. If the tracer doesn't restart the tracee before entering into the next waitpid(2), future SIGCONT signals will not be reported to the tracer; this would cause the SIGCONT signals to have no effect on the tracee. Since Linux 3.4, there is a method to overcome this problem: instead of PTRACE_CONT, a PTRACE_LISTEN command can be used to restart a tracee in a way where it does not execute, but waits for a new event which it can report via waitpid(2) (such as when it is restarted by a SIGCONT). PTRACE_EVENT stops If the tracer sets PTRACE_O_TRACE_* options, the tracee will enter ptrace-stops called PTRACE_EVENT stops. PTRACE_EVENT stops are observed by the tracer as waitpid(2) returning with WIFSTOPPED(status), and WSTOPSIG(status) returns SIGTRAP. An additional bit is set in the higher byte of the status word: the value status>>8 will be (SIGTRAP | PTRACE_EVENT_foo << 8). The following events exist: PTRACE_EVENT_VFORK Stop before return from vfork(2) or clone(2) with the CLONE_VFORK flag. When the tracee is continued after this stop, it will wait for child to exit/exec before continuing its execution (in other words, the usual behavior on vfork(2)). PTRACE_EVENT_FORK Stop before return from fork(2) or clone(2) with the exit signal set to SIGCHLD. PTRACE_EVENT_CLONE Stop before return from clone(2). PTRACE_EVENT_VFORK_DONE Stop before return from vfork(2) or clone(2) with the CLONE_VFORK flag, but after the child unblocked this tracee by exiting or execing. For all four stops described above, the stop occurs in the parent (i.e., the tracee), not in the newly created thread. PTRACE_GETEVENTMSG can be used to retrieve the new thread's ID. PTRACE_EVENT_EXEC Stop before return from execve(2). Since Linux 3.0, PTRACE_GETEVENTMSG returns the former thread ID. PTRACE_EVENT_EXIT Stop before exit (including death from exit_group(2)), signal death, or exit caused by execve(2) in a multithreaded process. PTRACE_GETEVENTMSG returns the exit status. Registers can be examined (unlike when "real" exit happens). The tracee is still alive; it needs to be PTRACE_CONTed or PTRACE_DETACHed to finish exiting. PTRACE_EVENT_STOP Stop induced by PTRACE_INTERRUPT command, or group-stop, or initial ptrace-stop when a new child is attached (only if attached using PTRACE_SEIZE). PTRACE_EVENT_SECCOMP Stop triggered by a seccomp(2) rule on tracee syscall entry when PTRACE_O_TRACESECCOMP has been set by the tracer. The seccomp event message data (from the SECCOMP_RET_DATA portion of the seccomp filter rule) can be retrieved with PTRACE_GETEVENTMSG. PTRACE_GETSIGINFO on PTRACE_EVENT stops returns SIGTRAP in si_signo, with si_code set to (event<<8) | SIGTRAP. Syscall-stops If the tracee was restarted by PTRACE_SYSCALL, the tracee enters syscall-enter-stop just prior to entering any system call. If the tracer restarts the tracee with PTRACE_SYSCALL, the tracee enters syscall-exit-stop when the system call is finished, or if it is interrupted by a signal. (That is, signal-delivery-stop never happens between syscall-enter-stop and syscall-exit-stop; it happens after syscall-exit-stop.) Other possibilities are that the tracee may stop in a PTRACE_EVENT stop, exit (if it entered _exit(2) or exit_group(2)), be killed by SIGKILL, or die silently (if it is a thread group leader, the execve(2) happened in another thread, and that thread is not traced by the same tracer; this situation is discussed later). Syscall-enter-stop and syscall-exit-stop are observed by the tracer as waitpid(2) returning with WIFSTOPPED(status) true, and WSTOPSIG(status) giving SIGTRAP. If the PTRACE_O_TRACESYSGOOD option was set by the tracer, then WSTOPSIG(status) will give the value (SIGTRAP | 0x80). Syscall-stops can be distinguished from signal-delivery-stop with SIGTRAP by querying PTRACE_GETSIGINFO for the following cases: si_code <= 0 SIGTRAP was delivered as a result of a user-space action, for example, a system call (tgkill(2), kill(2), sigqueue(3), etc.), expiration of a POSIX timer, change of state on a POSIX message queue, or completion of an asynchronous I/O request. si_code == SI_KERNEL (0x80) SIGTRAP was sent by the kernel. si_code == SIGTRAP or si_code == (SIGTRAP|0x80) This is a syscall-stop. However, syscall-stops happen very often (twice per system call), and performing PTRACE_GETSIGINFO for every syscall-stop may be somewhat expensive. Some architectures allow the cases to be distinguished by examining registers. For example, on x86, rax == -ENOSYS in syscall-enter- stop. Since SIGTRAP (like any other signal) always happens after syscall-exit-stop, and at this point rax almost never contains -ENOSYS, the SIGTRAP looks like "syscall-stop which is not syscall- enter-stop"; in other words, it looks like a "stray syscall-exit- stop" and can be detected this way. But such detection is fragile and is best avoided. Using the PTRACE_O_TRACESYSGOOD option is the recommended method to distinguish syscall-stops from other kinds of ptrace-stops, since it is reliable and does not incur a performance penalty. Syscall-enter-stop and syscall-exit-stop are indistinguishable from each other by the tracer. The tracer needs to keep track of the sequence of ptrace-stops in order to not misinterpret syscall-enter- stop as syscall-exit-stop or vice versa. The rule is that syscall- enter-stop is always followed by syscall-exit-stop, PTRACE_EVENT stop or the tracee's death; no other kinds of ptrace-stop can occur in between. If after syscall-enter-stop, the tracer uses a restarting command other than PTRACE_SYSCALL, syscall-exit-stop is not generated. PTRACE_GETSIGINFO on syscall-stops returns SIGTRAP in si_signo, with si_code set to SIGTRAP or (SIGTRAP|0x80). PTRACE_SINGLESTEP, PTRACE_SYSEMU, PTRACE_SYSEMU_SINGLESTEP stops [Details of these kinds of stops are yet to be documented.] Informational and restarting ptrace commands Most ptrace commands (all except PTRACE_ATTACH, PTRACE_SEIZE, PTRACE_TRACEME, PTRACE_INTERRUPT, and PTRACE_KILL) require the tracee to be in a ptrace-stop, otherwise they fail with ESRCH. When the tracee is in ptrace-stop, the tracer can read and write data to the tracee using informational commands. These commands leave the tracee in ptrace-stopped state: ptrace(PTRACE_PEEKTEXT/PEEKDATA/PEEKUSER, pid, addr, 0); ptrace(PTRACE_POKETEXT/POKEDATA/POKEUSER, pid, addr, long_val); ptrace(PTRACE_GETREGS/GETFPREGS, pid, 0, &struct); ptrace(PTRACE_SETREGS/SETFPREGS, pid, 0, &struct); ptrace(PTRACE_GETREGSET, pid, NT_foo, &iov); ptrace(PTRACE_SETREGSET, pid, NT_foo, &iov); ptrace(PTRACE_GETSIGINFO, pid, 0, &siginfo); ptrace(PTRACE_SETSIGINFO, pid, 0, &siginfo); ptrace(PTRACE_GETEVENTMSG, pid, 0, &long_var); ptrace(PTRACE_SETOPTIONS, pid, 0, PTRACE_O_flags); Note that some errors are not reported. For example, setting signal information (siginfo) may have no effect in some ptrace-stops, yet the call may succeed (return 0 and not set errno); querying PTRACE_GETEVENTMSG may succeed and return some random value if current ptrace-stop is not documented as returning a meaningful event message. The call ptrace(PTRACE_SETOPTIONS, pid, 0, PTRACE_O_flags); affects one tracee. The tracee's current flags are replaced. Flags are inherited by new tracees created and "auto-attached" via active PTRACE_O_TRACEFORK, PTRACE_O_TRACEVFORK, or PTRACE_O_TRACECLONE options. Another group of commands makes the ptrace-stopped tracee run. They have the form: ptrace(cmd, pid, 0, sig); where cmd is PTRACE_CONT, PTRACE_LISTEN, PTRACE_DETACH, PTRACE_SYSCALL, PTRACE_SINGLESTEP, PTRACE_SYSEMU, or PTRACE_SYSEMU_SINGLESTEP. If the tracee is in signal-delivery-stop, sig is the signal to be injected (if it is nonzero). Otherwise, sig may be ignored. (When restarting a tracee from a ptrace-stop other than signal-delivery-stop, recommended practice is to always pass 0 in sig.) Attaching and detaching A thread can be attached to the tracer using the call ptrace(PTRACE_ATTACH, pid, 0, 0); or ptrace(PTRACE_SEIZE, pid, 0, PTRACE_O_flags); PTRACE_ATTACH sends SIGSTOP to this thread. If the tracer wants this SIGSTOP to have no effect, it needs to suppress it. Note that if other signals are concurrently sent to this thread during attach, the tracer may see the tracee enter signal-delivery-stop with other signal(s) first! The usual practice is to reinject these signals until SIGSTOP is seen, then suppress SIGSTOP injection. The design bug here is that a ptrace attach and a concurrently delivered SIGSTOP may race and the concurrent SIGSTOP may be lost. Since attaching sends SIGSTOP and the tracer usually suppresses it, this may cause a stray EINTR return from the currently executing system call in the tracee, as described in the "Signal injection and suppression" section. Since Linux 3.4, PTRACE_SEIZE can be used instead of PTRACE_ATTACH. PTRACE_SEIZE does not stop the attached process. If you need to stop it after attach (or at any other time) without sending it any signals, use PTRACE_INTERRUPT command. The request ptrace(PTRACE_TRACEME, 0, 0, 0); turns the calling thread into a tracee. The thread continues to run (doesn't enter ptrace-stop). A common practice is to follow the PTRACE_TRACEME with raise(SIGSTOP); and allow the parent (which is our tracer now) to observe our signal- delivery-stop. If the PTRACE_O_TRACEFORK, PTRACE_O_TRACEVFORK, or PTRACE_O_TRACECLONE options are in effect, then children created by, respectively, vfork(2) or clone(2) with the CLONE_VFORK flag, fork(2) or clone(2) with the exit signal set to SIGCHLD, and other kinds of clone(2), are automatically attached to the same tracer which traced their parent. SIGSTOP is delivered to the children, causing them to enter signal-delivery-stop after they exit the system call which created them. Detaching of the tracee is performed by: ptrace(PTRACE_DETACH, pid, 0, sig); PTRACE_DETACH is a restarting operation; therefore it requires the tracee to be in ptrace-stop. If the tracee is in signal-delivery- stop, a signal can be injected. Otherwise, the sig parameter may be silently ignored. If the tracee is running when the tracer wants to detach it, the usual solution is to send SIGSTOP (using tgkill(2), to make sure it goes to the correct thread), wait for the tracee to stop in signal- delivery-stop for SIGSTOP and then detach it (suppressing SIGSTOP injection). A design bug is that this can race with concurrent SIGSTOPs. Another complication is that the tracee may enter other ptrace-stops and needs to be restarted and waited for again, until SIGSTOP is seen. Yet another complication is to be sure that the tracee is not already ptrace-stopped, because no signal delivery happens while it is—not even SIGSTOP. If the tracer dies, all tracees are automatically detached and restarted, unless they were in group-stop. Handling of restart from group-stop is currently buggy, but the "as planned" behavior is to leave tracee stopped and waiting for SIGCONT. If the tracee is restarted from signal-delivery-stop, the pending signal is injected. execve(2) under ptrace When one thread in a multithreaded process calls execve(2), the kernel destroys all other threads in the process, and resets the thread ID of the execing thread to the thread group ID (process ID). (Or, to put things another way, when a multithreaded process does an execve(2), at completion of the call, it appears as though the execve(2) occurred in the thread group leader, regardless of which thread did the execve(2).) This resetting of the thread ID looks very confusing to tracers: * All other threads stop in PTRACE_EVENT_EXIT stop, if the PTRACE_O_TRACEEXIT option was turned on. Then all other threads except the thread group leader report death as if they exited via _exit(2) with exit code 0. * The execing tracee changes its thread ID while it is in the execve(2). (Remember, under ptrace, the "pid" returned from waitpid(2), or fed into ptrace calls, is the tracee's thread ID.) That is, the tracee's thread ID is reset to be the same as its process ID, which is the same as the thread group leader's thread ID. * Then a PTRACE_EVENT_EXEC stop happens, if the PTRACE_O_TRACEEXEC option was turned on. * If the thread group leader has reported its PTRACE_EVENT_EXIT stop by this time, it appears to the tracer that the dead thread leader "reappears from nowhere". (Note: the thread group leader does not report death via WIFEXITED(status) until there is at least one other live thread. This eliminates the possibility that the tracer will see it dying and then reappearing.) If the thread group leader was still alive, for the tracer this may look as if thread group leader returns from a different system call than it entered, or even "returned from a system call even though it was not in any system call". If the thread group leader was not traced (or was traced by a different tracer), then during execve(2) it will appear as if it has become a tracee of the tracer of the execing tracee. All of the above effects are the artifacts of the thread ID change in the tracee. The PTRACE_O_TRACEEXEC option is the recommended tool for dealing with this situation. First, it enables PTRACE_EVENT_EXEC stop, which occurs before execve(2) returns. In this stop, the tracer can use PTRACE_GETEVENTMSG to retrieve the tracee's former thread ID. (This feature was introduced in Linux 3.0.) Second, the PTRACE_O_TRACEEXEC option disables legacy SIGTRAP generation on execve(2). When the tracer receives PTRACE_EVENT_EXEC stop notification, it is guaranteed that except this tracee and the thread group leader, no other threads from the process are alive. On receiving the PTRACE_EVENT_EXEC stop notification, the tracer should clean up all its internal data structures describing the threads of this process, and retain only one data structure—one which describes the single still running tracee, with thread ID == thread group ID == process ID. Example: two threads call execve(2) at the same time: *** we get syscall-enter-stop in thread 1: ** PID1 execve("/bin/foo", "foo" *** we issue PTRACE_SYSCALL for thread 1 ** *** we get syscall-enter-stop in thread 2: ** PID2 execve("/bin/bar", "bar" *** we issue PTRACE_SYSCALL for thread 2 ** *** we get PTRACE_EVENT_EXEC for PID0, we issue PTRACE_SYSCALL ** *** we get syscall-exit-stop for PID0: ** PID0 <... execve resumed> ) = 0 If the PTRACE_O_TRACEEXEC option is not in effect for the execing tracee, and if the tracee was PTRACE_ATTACHed rather that PTRACE_SEIZEd, the kernel delivers an extra SIGTRAP to the tracee after execve(2) returns. This is an ordinary signal (similar to one which can be generated by kill -TRAP), not a special kind of ptrace- stop. Employing PTRACE_GETSIGINFO for this signal returns si_code set to 0 (SI_USER). This signal may be blocked by signal mask, and thus may be delivered (much) later. Usually, the tracer (for example, strace(1)) would not want to show this extra post-execve SIGTRAP signal to the user, and would suppress its delivery to the tracee (if SIGTRAP is set to SIG_DFL, it is a killing signal). However, determining which SIGTRAP to suppress is not easy. Setting the PTRACE_O_TRACEEXEC option or using PTRACE_SEIZE and thus suppressing this extra SIGTRAP is the recommended approach. Real parent The ptrace API (ab)uses the standard UNIX parent/child signaling over waitpid(2). This used to cause the real parent of the process to stop receiving several kinds of waitpid(2) notifications when the child process is traced by some other process. Many of these bugs have been fixed, but as of Linux 2.6.38 several still exist; see BUGS below. As of Linux 2.6.38, the following is believed to work correctly: * exit/death by signal is reported first to the tracer, then, when the tracer consumes the waitpid(2) result, to the real parent (to the real parent only when the whole multithreaded process exits). If the tracer and the real parent are the same process, the report is sent only once. http://man7.org/linux/man-pages/man2/process_vm_readv.2.html 12 SYSTEM CALL: process_vm_readv(2) - Linux manual page FUNCTIONALITY: process_vm_readv, process_vm_writev - transfer data between process address spaces SYNOPSIS: #include ssize_t process_vm_readv(pid_t pid, const struct iovec *local_iov, unsigned long liovcnt, const struct iovec *remote_iov, unsigned long riovcnt, unsigned long flags); ssize_t process_vm_writev(pid_t pid, const struct iovec *local_iov, unsigned long liovcnt, const struct iovec *remote_iov, unsigned long riovcnt, unsigned long flags); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): process_vm_readv(), process_vm_writev(): _GNU_SOURCE DESCRIPTION These system calls transfer data between the address space of the calling process ("the local process") and the process identified by pid ("the remote process"). The data moves directly between the address spaces of the two processes, without passing through kernel space. The process_vm_readv() system call transfers data from the remote process to the local process. The data to be transferred is identified by remote_iov and riovcnt: remote_iov is a pointer to an array describing address ranges in the process pid, and riovcnt specifies the number of elements in remote_iov. The data is transferred to the locations specified by local_iov and liovcnt: local_iov is a pointer to an array describing address ranges in the calling process, and liovcnt specifies the number of elements in local_iov. The process_vm_writev() system call is the converse of process_vm_readv()—it transfers data from the local process to the remote process. Other than the direction of the transfer, the arguments liovcnt, local_iov, riovcnt, and remote_iov have the same meaning as for process_vm_readv(). The local_iov and remote_iov arguments point to an array of iovec structures, defined in as: struct iovec { void *iov_base; /* Starting address */ size_t iov_len; /* Number of bytes to transfer */ }; Buffers are processed in array order. This means that process_vm_readv() completely fills local_iov[0] before proceeding to local_iov[1], and so on. Likewise, remote_iov[0] is completely read before proceeding to remote_iov[1], and so on. Similarly, process_vm_writev() writes out the entire contents of local_iov[0] before proceeding to local_iov[1], and it completely fills remote_iov[0] before proceeding to remote_iov[1]. The lengths of remote_iov[i].iov_len and local_iov[i].iov_len do not have to be the same. Thus, it is possible to split a single local buffer into multiple remote buffers, or vice versa. The flags argument is currently unused and must be set to 0. The values specified in the liovcnt and riovcnt arguments must be less than or equal to IOV_MAX (defined in or accessible via the call sysconf(_SC_IOV_MAX)). The count arguments and local_iov are checked before doing any transfers. If the counts are too big, or local_iov is invalid, or the addresses refer to regions that are inaccessible to the local process, none of the vectors will be processed and an error will be returned immediately. Note, however, that these system calls do not check the memory regions in the remote process until just before doing the read/write. Consequently, a partial read/write (see RETURN VALUE) may result if one of the remote_iov elements points to an invalid memory region in the remote process. No further reads/writes will be attempted beyond that point. Keep this in mind when attempting to read data of unknown length (such as C strings that are null-terminated) from a remote process, by avoiding spanning memory pages (typically 4KiB) in a single remote iovec element. (Instead, split the remote read into two remote_iov elements and have them merge back into a single write local_iov entry. The first read entry goes up to the page boundary, while the second starts on the next page boundary.) Permission to read from or write to another process is governed by a ptrace access mode PTRACE_MODE_ATTACH_REALCREDS check; see ptrace(2). http://man7.org/linux/man-pages/man2/process_vm_writev.2.html 12 SYSTEM CALL: process_vm_readv(2) - Linux manual page FUNCTIONALITY: process_vm_readv, process_vm_writev - transfer data between process address spaces SYNOPSIS: #include ssize_t process_vm_readv(pid_t pid, const struct iovec *local_iov, unsigned long liovcnt, const struct iovec *remote_iov, unsigned long riovcnt, unsigned long flags); ssize_t process_vm_writev(pid_t pid, const struct iovec *local_iov, unsigned long liovcnt, const struct iovec *remote_iov, unsigned long riovcnt, unsigned long flags); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): process_vm_readv(), process_vm_writev(): _GNU_SOURCE DESCRIPTION These system calls transfer data between the address space of the calling process ("the local process") and the process identified by pid ("the remote process"). The data moves directly between the address spaces of the two processes, without passing through kernel space. The process_vm_readv() system call transfers data from the remote process to the local process. The data to be transferred is identified by remote_iov and riovcnt: remote_iov is a pointer to an array describing address ranges in the process pid, and riovcnt specifies the number of elements in remote_iov. The data is transferred to the locations specified by local_iov and liovcnt: local_iov is a pointer to an array describing address ranges in the calling process, and liovcnt specifies the number of elements in local_iov. The process_vm_writev() system call is the converse of process_vm_readv()—it transfers data from the local process to the remote process. Other than the direction of the transfer, the arguments liovcnt, local_iov, riovcnt, and remote_iov have the same meaning as for process_vm_readv(). The local_iov and remote_iov arguments point to an array of iovec structures, defined in as: struct iovec { void *iov_base; /* Starting address */ size_t iov_len; /* Number of bytes to transfer */ }; Buffers are processed in array order. This means that process_vm_readv() completely fills local_iov[0] before proceeding to local_iov[1], and so on. Likewise, remote_iov[0] is completely read before proceeding to remote_iov[1], and so on. Similarly, process_vm_writev() writes out the entire contents of local_iov[0] before proceeding to local_iov[1], and it completely fills remote_iov[0] before proceeding to remote_iov[1]. The lengths of remote_iov[i].iov_len and local_iov[i].iov_len do not have to be the same. Thus, it is possible to split a single local buffer into multiple remote buffers, or vice versa. The flags argument is currently unused and must be set to 0. The values specified in the liovcnt and riovcnt arguments must be less than or equal to IOV_MAX (defined in or accessible via the call sysconf(_SC_IOV_MAX)). The count arguments and local_iov are checked before doing any transfers. If the counts are too big, or local_iov is invalid, or the addresses refer to regions that are inaccessible to the local process, none of the vectors will be processed and an error will be returned immediately. Note, however, that these system calls do not check the memory regions in the remote process until just before doing the read/write. Consequently, a partial read/write (see RETURN VALUE) may result if one of the remote_iov elements points to an invalid memory region in the remote process. No further reads/writes will be attempted beyond that point. Keep this in mind when attempting to read data of unknown length (such as C strings that are null-terminated) from a remote process, by avoiding spanning memory pages (typically 4KiB) in a single remote iovec element. (Instead, split the remote read into two remote_iov elements and have them merge back into a single write local_iov entry. The first read entry goes up to the page boundary, while the second starts on the next page boundary.) Permission to read from or write to another process is governed by a ptrace access mode PTRACE_MODE_ATTACH_REALCREDS check; see ptrace(2). http://man7.org/linux/man-pages/man2/kcmp.2.html 11 SYSTEM CALL: kcmp(2) - Linux manual page FUNCTIONALITY: kcmp - compare two processes to determine if they share a kernel resource SYNOPSIS: #include int kcmp(pid_t pid1, pid_t pid2, int type, unsigned long idx1, unsigned long idx2); Note: There is no glibc wrapper for this system call; see NOTES. DESCRIPTION The kcmp() system call can be used to check whether the two processes identified by pid1 and pid2 share a kernel resource such as virtual memory, file descriptors, and so on. Permission to employ kcmp() is governed by ptrace access mode PTRACE_MODE_READ_REALCREDS checks against both pid1 and pid2; see ptrace(2). The type argument specifies which resource is to be compared in the two processes. It has one of the following values: KCMP_FILE Check whether a file descriptor idx1 in the process pid1 refers to the same open file description (see open(2)) as file descriptor idx2 in the process pid2. KCMP_FILES Check whether the process share the same set of open file descriptors. The arguments idx1 and idx2 are ignored. KCMP_FS Check whether the processes share the same filesystem information (i.e., file mode creation mask, working directory, and filesystem root). The arguments idx1 and idx2 are ignored. KCMP_IO Check whether the processes share I/O context. The arguments idx1 and idx2 are ignored. KCMP_SIGHAND Check whether the processes share the same table of signal dispositions. The arguments idx1 and idx2 are ignored. KCMP_SYSVSEM Check whether the processes share the same list of System V semaphore undo operations. The arguments idx1 and idx2 are ignored. KCMP_VM Check whether the processes share the same address space. The arguments idx1 and idx2 are ignored. Note the kcmp() is not protected against false positives which may occur if the processes are currently running. One should stop the processes by sending SIGSTOP (see signal(7)) prior to inspection with this system call to obtain meaningful results. http://man7.org/linux/man-pages/man2/unshare.2.html 12 SYSTEM CALL: unshare(2) - Linux manual page FUNCTIONALITY: unshare - disassociate parts of the process execution context SYNOPSIS: #define _GNU_SOURCE #include int unshare(int flags); DESCRIPTION unshare() allows a process (or thread) to disassociate parts of its execution context that are currently being shared with other processes (or threads). Part of the execution context, such as the mount namespace, is shared implicitly when a new process is created using fork(2) or vfork(2), while other parts, such as virtual memory, may be shared by explicit request when creating a process or thread using clone(2). The main use of unshare() is to allow a process to control its shared execution context without creating a new process. The flags argument is a bit mask that specifies which parts of the execution context should be unshared. This argument is specified by ORing together zero or more of the following constants: CLONE_FILES Reverse the effect of the clone(2) CLONE_FILES flag. Unshare the file descriptor table, so that the calling process no longer shares its file descriptors with any other process. CLONE_FS Reverse the effect of the clone(2) CLONE_FS flag. Unshare filesystem attributes, so that the calling process no longer shares its root directory (chroot(2)), current directory (chdir(2)), or umask (umask(2)) attributes with any other process. CLONE_NEWCGROUP (since Linux 4.6) This flag has the same effect as the clone(2) CLONE_NEWCGROUP flag. Unshare the cgroup namespace. Use of CLONE_NEWCGROUP requires the CAP_SYS_ADMIN capability. CLONE_NEWIPC (since Linux 2.6.19) This flag has the same effect as the clone(2) CLONE_NEWIPC flag. Unshare the IPC namespace, so that the calling process has a private copy of the IPC namespace which is not shared with any other process. Specifying this flag automatically implies CLONE_SYSVSEM as well. Use of CLONE_NEWIPC requires the CAP_SYS_ADMIN capability. CLONE_NEWNET (since Linux 2.6.24) This flag has the same effect as the clone(2) CLONE_NEWNET flag. Unshare the network namespace, so that the calling process is moved into a new network namespace which is not shared with any previously existing process. Use of CLONE_NEWNET requires the CAP_SYS_ADMIN capability. CLONE_NEWNS This flag has the same effect as the clone(2) CLONE_NEWNS flag. Unshare the mount namespace, so that the calling process has a private copy of its namespace which is not shared with any other process. Specifying this flag automatically implies CLONE_FS as well. Use of CLONE_NEWNS requires the CAP_SYS_ADMIN capability. For further information, see mount_namespaces(7). CLONE_NEWPID (since Linux 3.8) This flag has the same effect as the clone(2) CLONE_NEWPID flag. Unshare the PID namespace, so that the calling process has a new PID namespace for its children which is not shared with any previously existing process. The calling process is not moved into the new namespace. The first child created by the calling process will have the process ID 1 and will assume the role of init(1) in the new namespace. CLONE_NEWPID automatically implies CLONE_THREAD as well. Use of CLONE_NEWPID requires the CAP_SYS_ADMIN capability. For further information, see pid_namespaces(7). CLONE_NEWUSER (since Linux 3.8) This flag has the same effect as the clone(2) CLONE_NEWUSER flag. Unshare the user namespace, so that the calling process is moved into a new user namespace which is not shared with any previously existing process. As with the child process created by clone(2) with the CLONE_NEWUSER flag, the caller obtains a full set of capabilities in the new namespace. CLONE_NEWUSER requires that the calling process is not threaded; specifying CLONE_NEWUSER automatically implies CLONE_THREAD. Since Linux 3.9, CLONE_NEWUSER also automatically implies CLONE_FS. CLONE_NEWUSER requires that the user ID and group ID of the calling process are mapped to user IDs and group IDs in the user namespace of the calling process at the time of the call. For further information on user namespaces, see user_namespaces(7). CLONE_NEWUTS (since Linux 2.6.19) This flag has the same effect as the clone(2) CLONE_NEWUTS flag. Unshare the UTS IPC namespace, so that the calling process has a private copy of the UTS namespace which is not shared with any other process. Use of CLONE_NEWUTS requires the CAP_SYS_ADMIN capability. CLONE_SYSVSEM (since Linux 2.6.26) This flag reverses the effect of the clone(2) CLONE_SYSVSEM flag. Unshare System V semaphore adjustment (semadj) values, so that the calling process has a new empty semadj list that is not shared with any other process. If this is the last process that has a reference to the process's current semadj list, then the adjustments in that list are applied to the corresponding semaphores, as described in semop(2). In addition, CLONE_THREAD, CLONE_SIGHAND, and CLONE_VM can be specified in flags if the caller is single threaded (i.e., it is not sharing its address space with another process or thread). In this case, these flags have no effect. (Note also that specifying CLONE_THREAD automatically implies CLONE_VM, and specifying CLONE_VM automatically implies CLONE_SIGHAND.) If the process is multithreaded, then the use of these flags results in an error. If flags is specified as zero, then unshare() is a no-op; no changes are made to the calling process's execution context. Linux signals system calls http://man7.org/linux/man-pages/man2/kill.2.html 11 SYSTEM CALL: kill(2) - Linux manual page FUNCTIONALITY: kill - send signal to a process SYNOPSIS: #include #include int kill(pid_t pid, int sig); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): kill(): _POSIX_C_SOURCE DESCRIPTION The kill() system call can be used to send any signal to any process group or process. If pid is positive, then signal sig is sent to the process with the ID specified by pid. If pid equals 0, then sig is sent to every process in the process group of the calling process. If pid equals -1, then sig is sent to every process for which the calling process has permission to send signals, except for process 1 (init), but see below. If pid is less than -1, then sig is sent to every process in the process group whose ID is -pid. If sig is 0, then no signal is sent, but existence and permission checks are still performed; this can be used to check for the existence of a process ID or process group ID that the caller is permitted to signal. For a process to have permission to send a signal it must either be privileged (under Linux: have the CAP_KILL capability), or the real or effective user ID of the sending process must equal the real or saved set-user-ID of the target process. In the case of SIGCONT it suffices when the sending and receiving processes belong to the same session. (Historically, the rules were different; see NOTES.) http://man7.org/linux/man-pages/man2/tkill.2.html 11 SYSTEM CALL: tkill(2) - Linux manual page FUNCTIONALITY: tkill, tgkill - send a signal to a thread SYNOPSIS: int tkill(int tid, int sig); int tgkill(int tgid, int tid, int sig); Note: There are no glibc wrappers for these system calls; see NOTES. DESCRIPTION tgkill() sends the signal sig to the thread with the thread ID tid in the thread group tgid. (By contrast, kill(2) can be used to send a signal only to a process (i.e., thread group) as a whole, and the signal will be delivered to an arbitrary thread within that process.) tkill() is an obsolete predecessor to tgkill(). It allows only the target thread ID to be specified, which may result in the wrong thread being signaled if a thread terminates and its thread ID is recycled. Avoid using this system call. These are the raw system call interfaces, meant for internal thread library use. http://man7.org/linux/man-pages/man2/tgkill.2.html 11 SYSTEM CALL: tkill(2) - Linux manual page FUNCTIONALITY: tkill, tgkill - send a signal to a thread SYNOPSIS: int tkill(int tid, int sig); int tgkill(int tgid, int tid, int sig); Note: There are no glibc wrappers for these system calls; see NOTES. DESCRIPTION tgkill() sends the signal sig to the thread with the thread ID tid in the thread group tgid. (By contrast, kill(2) can be used to send a signal only to a process (i.e., thread group) as a whole, and the signal will be delivered to an arbitrary thread within that process.) tkill() is an obsolete predecessor to tgkill(). It allows only the target thread ID to be specified, which may result in the wrong thread being signaled if a thread terminates and its thread ID is recycled. Avoid using this system call. These are the raw system call interfaces, meant for internal thread library use. http://man7.org/linux/man-pages/man2/pause.2.html 9 SYSTEM CALL: pause(2) - Linux manual page FUNCTIONALITY: pause - wait for signal SYNOPSIS: #include int pause(void); DESCRIPTION pause() causes the calling process (or thread) to sleep until a signal is delivered that either terminates the process or causes the invocation of a signal-catching function. http://man7.org/linux/man-pages/man2/rt_sigaction.2.html 12 SYSTEM CALL: sigaction(2) - Linux manual page FUNCTIONALITY: sigaction, rt_sigaction - examine and change a signal action SYNOPSIS: #include int sigaction(int signum, const struct sigaction *act, struct sigaction *oldact); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): sigaction(): _POSIX_C_SOURCE siginfo_t: _POSIX_C_SOURCE >= 199309L DESCRIPTION The sigaction() system call is used to change the action taken by a process on receipt of a specific signal. (See signal(7) for an overview of signals.) signum specifies the signal and can be any valid signal except SIGKILL and SIGSTOP. If act is non-NULL, the new action for signal signum is installed from act. If oldact is non-NULL, the previous action is saved in oldact. The sigaction structure is defined as something like: struct sigaction { void (*sa_handler)(int); void (*sa_sigaction)(int, siginfo_t *, void *); sigset_t sa_mask; int sa_flags; void (*sa_restorer)(void); }; On some architectures a union is involved: do not assign to both sa_handler and sa_sigaction. The sa_restorer field is not intended for application use. (POSIX does not specify a sa_restorer field.) Some further details of purpose of this field can be found in sigreturn(2). sa_handler specifies the action to be associated with signum and may be SIG_DFL for the default action, SIG_IGN to ignore this signal, or a pointer to a signal handling function. This function receives the signal number as its only argument. If SA_SIGINFO is specified in sa_flags, then sa_sigaction (instead of sa_handler) specifies the signal-handling function for signum. This function receives the signal number as its first argument, a pointer to a siginfo_t as its second argument and a pointer to a ucontext_t (cast to void *) as its third argument. (Commonly, the handler function doesn't make any use of the third argument. See getcontext(3) for further information about ucontext_t.) sa_mask specifies a mask of signals which should be blocked (i.e., added to the signal mask of the thread in which the signal handler is invoked) during execution of the signal handler. In addition, the signal which triggered the handler will be blocked, unless the SA_NODEFER flag is used. sa_flags specifies a set of flags which modify the behavior of the signal. It is formed by the bitwise OR of zero or more of the following: SA_NOCLDSTOP If signum is SIGCHLD, do not receive notification when child processes stop (i.e., when they receive one of SIGSTOP, SIGTSTP, SIGTTIN, or SIGTTOU) or resume (i.e., they receive SIGCONT) (see wait(2)). This flag is meaningful only when establishing a handler for SIGCHLD. SA_NOCLDWAIT (since Linux 2.6) If signum is SIGCHLD, do not transform children into zombies when they terminate. See also waitpid(2). This flag is meaningful only when establishing a handler for SIGCHLD, or when setting that signal's disposition to SIG_DFL. If the SA_NOCLDWAIT flag is set when establishing a handler for SIGCHLD, POSIX.1 leaves it unspecified whether a SIGCHLD signal is generated when a child process terminates. On Linux, a SIGCHLD signal is generated in this case; on some other implementations, it is not. SA_NODEFER Do not prevent the signal from being received from within its own signal handler. This flag is meaningful only when establishing a signal handler. SA_NOMASK is an obsolete, nonstandard synonym for this flag. SA_ONSTACK Call the signal handler on an alternate signal stack provided by sigaltstack(2). If an alternate stack is not available, the default stack will be used. This flag is meaningful only when establishing a signal handler. SA_RESETHAND Restore the signal action to the default upon entry to the signal handler. This flag is meaningful only when establishing a signal handler. SA_ONESHOT is an obsolete, nonstandard synonym for this flag. SA_RESTART Provide behavior compatible with BSD signal semantics by making certain system calls restartable across signals. This flag is meaningful only when establishing a signal handler. See signal(7) for a discussion of system call restarting. SA_RESTORER Not intended for application use. This flag is used by C libraries to indicate that the sa_restorer field contains the address of a "signal trampoline". See sigreturn(2) for more details. SA_SIGINFO (since Linux 2.2) The signal handler takes three arguments, not one. In this case, sa_sigaction should be set instead of sa_handler. This flag is meaningful only when establishing a signal handler. The siginfo_t argument to sa_sigaction is a struct with the following fields: siginfo_t { int si_signo; /* Signal number */ int si_errno; /* An errno value */ int si_code; /* Signal code */ int si_trapno; /* Trap number that caused hardware-generated signal (unused on most architectures) */ pid_t si_pid; /* Sending process ID */ uid_t si_uid; /* Real user ID of sending process */ int si_status; /* Exit value or signal */ clock_t si_utime; /* User time consumed */ clock_t si_stime; /* System time consumed */ sigval_t si_value; /* Signal value */ int si_int; /* POSIX.1b signal */ void *si_ptr; /* POSIX.1b signal */ int si_overrun; /* Timer overrun count; POSIX.1b timers */ int si_timerid; /* Timer ID; POSIX.1b timers */ void *si_addr; /* Memory location which caused fault */ long si_band; /* Band event (was int in glibc 2.3.2 and earlier) */ int si_fd; /* File descriptor */ short si_addr_lsb; /* Least significant bit of address (since Linux 2.6.32) */ void *si_lower; /* Lower bound when address violation occurred (since Linux 3.19) */ void *si_upper; /* Upper bound when address violation occurred (since Linux 3.19) */ int si_pkey; /* Protection key on PTE that caused fault (since Linux 4.6) */ void *si_call_addr; /* Address of system call instruction (since Linux 3.5) */ int si_syscall; /* Number of attempted system call (since Linux 3.5) */ unsigned int si_arch; /* Architecture of attempted system call (since Linux 3.5) */ } si_signo, si_errno and si_code are defined for all signals. (si_errno is generally unused on Linux.) The rest of the struct may be a union, so that one should read only the fields that are meaningful for the given signal: * Signals sent with kill(2) and sigqueue(3) fill in si_pid and si_uid. In addition, signals sent with sigqueue(3) fill in si_int and si_ptr with the values specified by the sender of the signal; see sigqueue(3) for more details. * Signals sent by POSIX.1b timers (since Linux 2.6) fill in si_overrun and si_timerid. The si_timerid field is an internal ID used by the kernel to identify the timer; it is not the same as the timer ID returned by timer_create(2). The si_overrun field is the timer overrun count; this is the same information as is obtained by a call to timer_getoverrun(2). These fields are nonstandard Linux extensions. * Signals sent for message queue notification (see the description of SIGEV_SIGNAL in mq_notify(3)) fill in si_int/si_ptr, with the sigev_value supplied to mq_notify(3); si_pid, with the process ID of the message sender; and si_uid, with the real user ID of the message sender. * SIGCHLD fills in si_pid, si_uid, si_status, si_utime, and si_stime, providing information about the child. The si_pid field is the process ID of the child; si_uid is the child's real user ID. The si_status field contains the exit status of the child (if si_code is CLD_EXITED), or the signal number that caused the process to change state. The si_utime and si_stime contain the user and system CPU time used by the child process; these fields do not include the times used by waited-for children (unlike getrusage(2) and times(2)). In kernels up to 2.6, and since 2.6.27, these fields report CPU time in units of sysconf(_SC_CLK_TCK). In 2.6 kernels before 2.6.27, a bug meant that these fields reported time in units of the (configurable) system jiffy (see time(7)). * SIGILL, SIGFPE, SIGSEGV, SIGBUS, and SIGTRAP fill in si_addr with the address of the fault. On some architectures, these signals also fill in the si_trapno field. Some suberrors of SIGBUS, in particular BUS_MCEERR_AO and BUS_MCEERR_AR, also fill in si_addr_lsb. This field indicates the least significant bit of the reported address and therefore the extent of the corruption. For example, if a full page was corrupted, si_addr_lsb contains log2(sysconf(_SC_PAGESIZE)). When SIGTRAP is delivered in response to a ptrace(2) event (PTRACE_EVENT_foo), si_addr is not populated, but si_pid and si_uid are populated with the respective process ID and user ID responsible for delivering the trap. In the case of seccomp(2), the tracee will be shown as delivering the event. BUS_MCEERR_* and si_addr_lsb are Linux-specific extensions. The SEGV_BNDERR suberror of SIGSEGV populates si_lower and si_upper. The SEGV_PKUERR suberror of SIGSEGV populates si_pkey. * SIGIO/SIGPOLL (the two names are synonyms on Linux) fills in si_band and si_fd. The si_band event is a bit mask containing the same values as are filled in the revents field by poll(2). The si_fd field indicates the file descriptor for which the I/O event occurred; for further details, see the description of F_SETSIG in fcntl(2). * SIGSYS, generated (since Linux 3.5) when a seccomp filter returns SECCOMP_RET_TRAP, fills in si_call_addr, si_syscall, si_arch, si_errno, and other fields as described in seccomp(2). si_code is a value (not a bit mask) indicating why this signal was sent. For a ptrace(2) event, si_code will contain SIGTRAP and have the ptrace event in the high byte: (SIGTRAP | PTRACE_EVENT_foo << 8). For a regular signal, the following list shows the values which can be placed in si_code for any signal, along with reason that the signal was generated. SI_USER kill(2). SI_KERNEL Sent by the kernel. SI_QUEUE sigqueue(3). SI_TIMER POSIX timer expired. SI_MESGQ (since Linux 2.6.6) POSIX message queue state changed; see mq_notify(3). SI_ASYNCIO AIO completed. SI_SIGIO Queued SIGIO (only in kernels up to Linux 2.2; from Linux 2.4 onward SIGIO/SIGPOLL fills in si_code as described below). SI_TKILL (since Linux 2.4.19) tkill(2) or tgkill(2). The following values can be placed in si_code for a SIGILL signal: ILL_ILLOPC Illegal opcode. ILL_ILLOPN Illegal operand. ILL_ILLADR Illegal addressing mode. ILL_ILLTRP Illegal trap. ILL_PRVOPC Privileged opcode. ILL_PRVREG Privileged register. ILL_COPROC Coprocessor error. ILL_BADSTK Internal stack error. The following values can be placed in si_code for a SIGFPE signal: FPE_INTDIV Integer divide by zero. FPE_INTOVF Integer overflow. FPE_FLTDIV Floating-point divide by zero. FPE_FLTOVF Floating-point overflow. FPE_FLTUND Floating-point underflow. FPE_FLTRES Floating-point inexact result. FPE_FLTINV Floating-point invalid operation. FPE_FLTSUB Subscript out of range. The following values can be placed in si_code for a SIGSEGV signal: SEGV_MAPERR Address not mapped to object. SEGV_ACCERR Invalid permissions for mapped object. SEGV_BNDERR (since Linux 3.19) Failed address bound checks. SEGV_PKUERR (since Linux 4.6) Protection key fault. The following values can be placed in si_code for a SIGBUS signal: BUS_ADRALN Invalid address alignment. BUS_ADRERR Nonexistent physical address. BUS_OBJERR Object-specific hardware error. BUS_MCEERR_AR (since Linux 2.6.32) Hardware memory error consumed on a machine check; action required. BUS_MCEERR_AO (since Linux 2.6.32) Hardware memory error detected in process but not consumed; action optional. The following values can be placed in si_code for a SIGTRAP signal: TRAP_BRKPT Process breakpoint. TRAP_TRACE Process trace trap. TRAP_BRANCH (since Linux 2.4) Process taken branch trap. TRAP_HWBKPT (since Linux 2.4) Hardware breakpoint/watchpoint. The following values can be placed in si_code for a SIGCHLD signal: CLD_EXITED Child has exited. CLD_KILLED Child was killed. CLD_DUMPED Child terminated abnormally. CLD_TRAPPED Traced child has trapped. CLD_STOPPED Child has stopped. CLD_CONTINUED (since Linux 2.6.9) Stopped child has continued. The following values can be placed in si_code for a SIGIO/SIGPOLL signal: POLL_IN Data input available. POLL_OUT Output buffers available. POLL_MSG Input message available. POLL_ERR I/O error. POLL_PRI High priority input available. POLL_HUP Device disconnected. The following value can be placed in si_code for a SIGSYS signal: SYS_SECCOMP (since Linux 3.5) Triggered by a seccomp(2) filter rule. http://man7.org/linux/man-pages/man2/rt_sigprocmask.2.html 10 SYSTEM CALL: sigprocmask(2) - Linux manual page FUNCTIONALITY: sigprocmask, rt_sigprocmask - examine and change blocked signals SYNOPSIS: #include int sigprocmask(int how, const sigset_t *set, sigset_t *oldset); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): sigprocmask(): _POSIX_C_SOURCE DESCRIPTION sigprocmask() is used to fetch and/or change the signal mask of the calling thread. The signal mask is the set of signals whose delivery is currently blocked for the caller (see also signal(7) for more details). The behavior of the call is dependent on the value of how, as follows. SIG_BLOCK The set of blocked signals is the union of the current set and the set argument. SIG_UNBLOCK The signals in set are removed from the current set of blocked signals. It is permissible to attempt to unblock a signal which is not blocked. SIG_SETMASK The set of blocked signals is set to the argument set. If oldset is non-NULL, the previous value of the signal mask is stored in oldset. If set is NULL, then the signal mask is unchanged (i.e., how is ignored), but the current value of the signal mask is nevertheless returned in oldset (if it is not NULL). A set of functions for modifying and inspecting variables of type sigset_t ("signal sets") is described in sigsetops(3). The use of sigprocmask() is unspecified in a multithreaded process; see pthread_sigmask(3). http://man7.org/linux/man-pages/man2/rt_sigpending.2.html 11 SYSTEM CALL: sigpending(2) - Linux manual page FUNCTIONALITY: sigpending, rt_sigpending - examine pending signals SYNOPSIS: #include int sigpending(sigset_t *set); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): sigpending(): _POSIX_C_SOURCE DESCRIPTION sigpending() returns the set of signals that are pending for delivery to the calling thread (i.e., the signals which have been raised while blocked). The mask of pending signals is returned in set. http://man7.org/linux/man-pages/man2/rt_sigqueueinfo.2.html 11 SYSTEM CALL: rt_sigqueueinfo(2) - Linux manual page FUNCTIONALITY: rt_sigqueueinfo, rt_tgsigqueueinfo - queue a signal and data SYNOPSIS: int rt_sigqueueinfo(pid_t tgid, int sig, siginfo_t *uinfo); int rt_tgsigqueueinfo(pid_t tgid, pid_t tid, int sig, siginfo_t *uinfo); Note: There are no glibc wrappers for these system calls; see NOTES. DESCRIPTION The rt_sigqueueinfo() and rt_tgsigqueueinfo() system calls are the low-level interfaces used to send a signal plus data to a process or thread. The receiver of the signal can obtain the accompanying data by establishing a signal handler with the sigaction(2) SA_SIGINFO flag. These system calls are not intended for direct application use; they are provided to allow the implementation of sigqueue(3) and pthread_sigqueue(3). The rt_sigqueueinfo() system call sends the signal sig to the thread group with the ID tgid. (The term "thread group" is synonymous with "process", and tid corresponds to the traditional UNIX process ID.) The signal will be delivered to an arbitrary member of the thread group (i.e., one of the threads that is not currently blocking the signal). The uinfo argument specifies the data to accompany the signal. This argument is a pointer to a structure of type siginfo_t, described in sigaction(2) (and defined by including ). The caller should set the following fields in this structure: si_code This must be one of the SI_* codes in the Linux kernel source file include/asm-generic/siginfo.h, with the restriction that the code must be negative (i.e., cannot be SI_USER, which is used by the kernel to indicate a signal sent by kill(2)) and cannot (since Linux 2.6.39) be SI_TKILL (which is used by the kernel to indicate a signal sent using tgkill(2)). si_pid This should be set to a process ID, typically the process ID of the sender. si_uid This should be set to a user ID, typically the real user ID of the sender. si_value This field contains the user data to accompany the signal. For more information, see the description of the last (union sigval) argument of sigqueue(3). Internally, the kernel sets the si_signo field to the value specified in sig, so that the receiver of the signal can also obtain the signal number via that field. The rt_tgsigqueueinfo() system call is like rt_sigqueueinfo(), but sends the signal and data to the single thread specified by the combination of tgid, a thread group ID, and tid, a thread in that thread group. http://man7.org/linux/man-pages/man2/rt_tgsigqueueinfo.2.html 11 SYSTEM CALL: rt_sigqueueinfo(2) - Linux manual page FUNCTIONALITY: rt_sigqueueinfo, rt_tgsigqueueinfo - queue a signal and data SYNOPSIS: int rt_sigqueueinfo(pid_t tgid, int sig, siginfo_t *uinfo); int rt_tgsigqueueinfo(pid_t tgid, pid_t tid, int sig, siginfo_t *uinfo); Note: There are no glibc wrappers for these system calls; see NOTES. DESCRIPTION The rt_sigqueueinfo() and rt_tgsigqueueinfo() system calls are the low-level interfaces used to send a signal plus data to a process or thread. The receiver of the signal can obtain the accompanying data by establishing a signal handler with the sigaction(2) SA_SIGINFO flag. These system calls are not intended for direct application use; they are provided to allow the implementation of sigqueue(3) and pthread_sigqueue(3). The rt_sigqueueinfo() system call sends the signal sig to the thread group with the ID tgid. (The term "thread group" is synonymous with "process", and tid corresponds to the traditional UNIX process ID.) The signal will be delivered to an arbitrary member of the thread group (i.e., one of the threads that is not currently blocking the signal). The uinfo argument specifies the data to accompany the signal. This argument is a pointer to a structure of type siginfo_t, described in sigaction(2) (and defined by including ). The caller should set the following fields in this structure: si_code This must be one of the SI_* codes in the Linux kernel source file include/asm-generic/siginfo.h, with the restriction that the code must be negative (i.e., cannot be SI_USER, which is used by the kernel to indicate a signal sent by kill(2)) and cannot (since Linux 2.6.39) be SI_TKILL (which is used by the kernel to indicate a signal sent using tgkill(2)). si_pid This should be set to a process ID, typically the process ID of the sender. si_uid This should be set to a user ID, typically the real user ID of the sender. si_value This field contains the user data to accompany the signal. For more information, see the description of the last (union sigval) argument of sigqueue(3). Internally, the kernel sets the si_signo field to the value specified in sig, so that the receiver of the signal can also obtain the signal number via that field. The rt_tgsigqueueinfo() system call is like rt_sigqueueinfo(), but sends the signal and data to the single thread specified by the combination of tgid, a thread group ID, and tid, a thread in that thread group. http://man7.org/linux/man-pages/man2/rt_sigtimedwait.2.html 10 SYSTEM CALL: sigwaitinfo(2) - Linux manual page FUNCTIONALITY: sigwaitinfo, sigtimedwait, rt_sigtimedwait - synchronously wait for queued signals SYNOPSIS: #include int sigwaitinfo(const sigset_t *set, siginfo_t *info); int sigtimedwait(const sigset_t *set, siginfo_t *info, const struct timespec *timeout); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): sigwaitinfo(), sigtimedwait(): _POSIX_C_SOURCE >= 199309L DESCRIPTION sigwaitinfo() suspends execution of the calling thread until one of the signals in set is pending (If one of the signals in set is already pending for the calling thread, sigwaitinfo() will return immediately.) sigwaitinfo() removes the signal from the set of pending signals and returns the signal number as its function result. If the info argument is not NULL, then the buffer that it points to is used to return a structure of type siginfo_t (see sigaction(2)) containing information about the signal. If multiple signals in set are pending for the caller, the signal that is retrieved by sigwaitinfo() is determined according to the usual ordering rules; see signal(7) for further details. sigtimedwait() operates in exactly the same way as sigwaitinfo() except that it has an additional argument, timeout, which specifies the interval for which the thread is suspended waiting for a signal. (This interval will be rounded up to the system clock granularity, and kernel scheduling delays mean that the interval may overrun by a small amount.) This argument is of the following type: struct timespec { long tv_sec; /* seconds */ long tv_nsec; /* nanoseconds */ } If both fields of this structure are specified as 0, a poll is performed: sigtimedwait() returns immediately, either with information about a signal that was pending for the caller, or with an error if none of the signals in set was pending. http://man7.org/linux/man-pages/man2/rt_sigsuspend.2.html 10 SYSTEM CALL: sigsuspend(2) - Linux manual page FUNCTIONALITY: sigsuspend, rt_sigsuspend - wait for a signal SYNOPSIS: #include int sigsuspend(const sigset_t *mask); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): sigsuspend(): _POSIX_C_SOURCE DESCRIPTION sigsuspend() temporarily replaces the signal mask of the calling process with the mask given by mask and then suspends the process until delivery of a signal whose action is to invoke a signal handler or to terminate a process. If the signal terminates the process, then sigsuspend() does not return. If the signal is caught, then sigsuspend() returns after the signal handler returns, and the signal mask is restored to the state before the call to sigsuspend(). It is not possible to block SIGKILL or SIGSTOP; specifying these signals in mask, has no effect on the process's signal mask. http://man7.org/linux/man-pages/man2/rt_sigreturn.2.html 9 SYSTEM CALL: sigreturn(2) - Linux manual page FUNCTIONALITY: sigreturn, rt_sigreturn - return from signal handler and cleanup stack frame SYNOPSIS: int sigreturn(...); DESCRIPTION If the Linux kernel determines that an unblocked signal is pending for a process, then, at the next transition back to user mode in that process (e.g., upon return from a system call or when the process is rescheduled onto the CPU), it saves various pieces of process context (processor status word, registers, signal mask, and signal stack settings) into the user-space stack. The kernel also arranges that, during the transition back to user mode, the signal handler is called, and that, upon return from the handler, control passes to a piece of user-space code commonly called the "signal trampoline". The signal trampoline code in turn calls sigreturn(). This sigreturn() call undoes everything that was done—changing the process's signal mask, switching signal stacks (see sigaltstack(2))—in order to invoke the signal handler. It restores the process's signal mask, switches stacks, and restores the process's context (processor flags and registers, including the stack pointer and instruction pointer), so that the process resumes execution at the point where it was interrupted by the signal. http://man7.org/linux/man-pages/man2/sigaltstack.2.html 12 SYSTEM CALL: sigaltstack(2) - Linux manual page FUNCTIONALITY: sigaltstack - set and/or get signal stack context SYNOPSIS: #include int sigaltstack(const stack_t *ss, stack_t *oss); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): sigaltstack(): _XOPEN_SOURCE >= 500 || /* Since glibc 2.12: */ _POSIX_C_SOURCE >= 200809L || /* Glibc versions <= 2.19: */ _BSD_SOURCE DESCRIPTION sigaltstack() allows a process to define a new alternate signal stack and/or retrieve the state of an existing alternate signal stack. An alternate signal stack is used during the execution of a signal handler if the establishment of that handler (see sigaction(2)) requested it. The normal sequence of events for using an alternate signal stack is the following: 1. Allocate an area of memory to be used for the alternate signal stack. 2. Use sigaltstack() to inform the system of the existence and location of the alternate signal stack. 3. When establishing a signal handler using sigaction(2), inform the system that the signal handler should be executed on the alternate signal stack by specifying the SA_ONSTACK flag. The ss argument is used to specify a new alternate signal stack, while the oss argument is used to retrieve information about the currently established signal stack. If we are interested in performing just one of these tasks, then the other argument can be specified as NULL. Each of these arguments is a structure of the following type: typedef struct { void *ss_sp; /* Base address of stack */ int ss_flags; /* Flags */ size_t ss_size; /* Number of bytes in stack */ } stack_t; To establish a new alternate signal stack, ss.ss_flags is set to zero, and ss.ss_sp and ss.ss_size specify the starting address and size of the stack. The constant SIGSTKSZ is defined to be large enough to cover the usual size requirements for an alternate signal stack, and the constant MINSIGSTKSZ defines the minimum size required to execute a signal handler. When a signal handler is invoked on the alternate stack, the kernel automatically aligns the address given in ss.ss_sp to a suitable address boundary for the underlying hardware architecture. To disable an existing stack, specify ss.ss_flags as SS_DISABLE. In this case, the remaining fields in ss are ignored. If oss is not NULL, then it is used to return information about the alternate signal stack which was in effect prior to the call to sigaltstack(). The oss.ss_sp and oss.ss_size fields return the starting address and size of that stack. The oss.ss_flags may return either of the following values: SS_ONSTACK The process is currently executing on the alternate signal stack. (Note that it is not possible to change the alternate signal stack if the process is currently executing on it.) SS_DISABLE The alternate signal stack is currently disabled. http://man7.org/linux/man-pages/man2/signalfd.2.html 13 SYSTEM CALL: signalfd(2) - Linux manual page FUNCTIONALITY: signalfd - create a file descriptor for accepting signals SYNOPSIS: #include int signalfd(int fd, const sigset_t *mask, int flags); DESCRIPTION signalfd() creates a file descriptor that can be used to accept signals targeted at the caller. This provides an alternative to the use of a signal handler or sigwaitinfo(2), and has the advantage that the file descriptor may be monitored by select(2), poll(2), and epoll(7). The mask argument specifies the set of signals that the caller wishes to accept via the file descriptor. This argument is a signal set whose contents can be initialized using the macros described in sigsetops(3). Normally, the set of signals to be received via the file descriptor should be blocked using sigprocmask(2), to prevent the signals being handled according to their default dispositions. It is not possible to receive SIGKILL or SIGSTOP signals via a signalfd file descriptor; these signals are silently ignored if specified in mask. If the fd argument is -1, then the call creates a new file descriptor and associates the signal set specified in mask with that file descriptor. If fd is not -1, then it must specify a valid existing signalfd file descriptor, and mask is used to replace the signal set associated with that file descriptor. Starting with Linux 2.6.27, the following values may be bitwise ORed in flags to change the behavior of signalfd(): SFD_NONBLOCK Set the O_NONBLOCK file status flag on the new open file description. Using this flag saves extra calls to fcntl(2) to achieve the same result. SFD_CLOEXEC Set the close-on-exec (FD_CLOEXEC) flag on the new file descriptor. See the description of the O_CLOEXEC flag in open(2) for reasons why this may be useful. In Linux up to version 2.6.26, the flags argument is unused, and must be specified as zero. signalfd() returns a file descriptor that supports the following operations: read(2) If one or more of the signals specified in mask is pending for the process, then the buffer supplied to read(2) is used to return one or more signalfd_siginfo structures (see below) that describe the signals. The read(2) returns information for as many signals as are pending and will fit in the supplied buffer. The buffer must be at least sizeof(struct signalfd_siginfo) bytes. The return value of the read(2) is the total number of bytes read. As a consequence of the read(2), the signals are consumed, so that they are no longer pending for the process (i.e., will not be caught by signal handlers, and cannot be accepted using sigwaitinfo(2)). If none of the signals in mask is pending for the process, then the read(2) either blocks until one of the signals in mask is generated for the process, or fails with the error EAGAIN if the file descriptor has been made nonblocking. poll(2), select(2) (and similar) The file descriptor is readable (the select(2) readfds argument; the poll(2) POLLIN flag) if one or more of the signals in mask is pending for the process. The signalfd file descriptor also supports the other file- descriptor multiplexing APIs: pselect(2), ppoll(2), and epoll(7). close(2) When the file descriptor is no longer required it should be closed. When all file descriptors associated with the same signalfd object have been closed, the resources for object are freed by the kernel. The signalfd_siginfo structure The format of the signalfd_siginfo structure(s) returned by read(2)s from a signalfd file descriptor is as follows: struct signalfd_siginfo { uint32_t ssi_signo; /* Signal number */ int32_t ssi_errno; /* Error number (unused) */ int32_t ssi_code; /* Signal code */ uint32_t ssi_pid; /* PID of sender */ uint32_t ssi_uid; /* Real UID of sender */ int32_t ssi_fd; /* File descriptor (SIGIO) */ uint32_t ssi_tid; /* Kernel timer ID (POSIX timers) uint32_t ssi_band; /* Band event (SIGIO) */ uint32_t ssi_overrun; /* POSIX timer overrun count */ uint32_t ssi_trapno; /* Trap number that caused signal */ int32_t ssi_status; /* Exit status or signal (SIGCHLD) */ int32_t ssi_int; /* Integer sent by sigqueue(3) */ uint64_t ssi_ptr; /* Pointer sent by sigqueue(3) */ uint64_t ssi_utime; /* User CPU time consumed (SIGCHLD) */ uint64_t ssi_stime; /* System CPU time consumed (SIGCHLD) */ uint64_t ssi_addr; /* Address that generated signal (for hardware-generated signals) */ uint8_t pad[X]; /* Pad size to 128 bytes (allow for additional fields in the future) */ }; Each of the fields in this structure is analogous to the similarly named field in the siginfo_t structure. The siginfo_t structure is described in sigaction(2). Not all fields in the returned signalfd_siginfo structure will be valid for a specific signal; the set of valid fields can be determined from the value returned in the ssi_code field. This field is the analog of the siginfo_t si_code field; see sigaction(2) for details. fork(2) semantics After a fork(2), the child inherits a copy of the signalfd file descriptor. A read(2) from the file descriptor in the child will return information about signals queued to the child. Semantics of file descriptor passing As with other file descriptors, signalfd file descriptors can be passed to another process via a UNIX domain socket (see unix(7)). In the receiving process, a read(2) from the received file descriptor will return information about signals queued to that process. execve(2) semantics Just like any other file descriptor, a signalfd file descriptor remains open across an execve(2), unless it has been marked for close-on-exec (see fcntl(2)). Any signals that were available for reading before the execve(2) remain available to the newly loaded program. (This is analogous to traditional signal semantics, where a blocked signal that is pending remains pending across an execve(2).) Thread semantics The semantics of signalfd file descriptors in a multithreaded program mirror the standard semantics for signals. In other words, when a thread reads from a signalfd file descriptor, it will read the signals that are directed to the thread itself and the signals that are directed to the process (i.e., the entire thread group). (A thread will not be able to read signals that are directed to other threads in the process.) http://man7.org/linux/man-pages/man2/signalfd4.2.html 13 SYSTEM CALL: signalfd(2) - Linux manual page FUNCTIONALITY: signalfd - create a file descriptor for accepting signals SYNOPSIS: #include int signalfd(int fd, const sigset_t *mask, int flags); DESCRIPTION signalfd() creates a file descriptor that can be used to accept signals targeted at the caller. This provides an alternative to the use of a signal handler or sigwaitinfo(2), and has the advantage that the file descriptor may be monitored by select(2), poll(2), and epoll(7). The mask argument specifies the set of signals that the caller wishes to accept via the file descriptor. This argument is a signal set whose contents can be initialized using the macros described in sigsetops(3). Normally, the set of signals to be received via the file descriptor should be blocked using sigprocmask(2), to prevent the signals being handled according to their default dispositions. It is not possible to receive SIGKILL or SIGSTOP signals via a signalfd file descriptor; these signals are silently ignored if specified in mask. If the fd argument is -1, then the call creates a new file descriptor and associates the signal set specified in mask with that file descriptor. If fd is not -1, then it must specify a valid existing signalfd file descriptor, and mask is used to replace the signal set associated with that file descriptor. Starting with Linux 2.6.27, the following values may be bitwise ORed in flags to change the behavior of signalfd(): SFD_NONBLOCK Set the O_NONBLOCK file status flag on the new open file description. Using this flag saves extra calls to fcntl(2) to achieve the same result. SFD_CLOEXEC Set the close-on-exec (FD_CLOEXEC) flag on the new file descriptor. See the description of the O_CLOEXEC flag in open(2) for reasons why this may be useful. In Linux up to version 2.6.26, the flags argument is unused, and must be specified as zero. signalfd() returns a file descriptor that supports the following operations: read(2) If one or more of the signals specified in mask is pending for the process, then the buffer supplied to read(2) is used to return one or more signalfd_siginfo structures (see below) that describe the signals. The read(2) returns information for as many signals as are pending and will fit in the supplied buffer. The buffer must be at least sizeof(struct signalfd_siginfo) bytes. The return value of the read(2) is the total number of bytes read. As a consequence of the read(2), the signals are consumed, so that they are no longer pending for the process (i.e., will not be caught by signal handlers, and cannot be accepted using sigwaitinfo(2)). If none of the signals in mask is pending for the process, then the read(2) either blocks until one of the signals in mask is generated for the process, or fails with the error EAGAIN if the file descriptor has been made nonblocking. poll(2), select(2) (and similar) The file descriptor is readable (the select(2) readfds argument; the poll(2) POLLIN flag) if one or more of the signals in mask is pending for the process. The signalfd file descriptor also supports the other file- descriptor multiplexing APIs: pselect(2), ppoll(2), and epoll(7). close(2) When the file descriptor is no longer required it should be closed. When all file descriptors associated with the same signalfd object have been closed, the resources for object are freed by the kernel. The signalfd_siginfo structure The format of the signalfd_siginfo structure(s) returned by read(2)s from a signalfd file descriptor is as follows: struct signalfd_siginfo { uint32_t ssi_signo; /* Signal number */ int32_t ssi_errno; /* Error number (unused) */ int32_t ssi_code; /* Signal code */ uint32_t ssi_pid; /* PID of sender */ uint32_t ssi_uid; /* Real UID of sender */ int32_t ssi_fd; /* File descriptor (SIGIO) */ uint32_t ssi_tid; /* Kernel timer ID (POSIX timers) uint32_t ssi_band; /* Band event (SIGIO) */ uint32_t ssi_overrun; /* POSIX timer overrun count */ uint32_t ssi_trapno; /* Trap number that caused signal */ int32_t ssi_status; /* Exit status or signal (SIGCHLD) */ int32_t ssi_int; /* Integer sent by sigqueue(3) */ uint64_t ssi_ptr; /* Pointer sent by sigqueue(3) */ uint64_t ssi_utime; /* User CPU time consumed (SIGCHLD) */ uint64_t ssi_stime; /* System CPU time consumed (SIGCHLD) */ uint64_t ssi_addr; /* Address that generated signal (for hardware-generated signals) */ uint8_t pad[X]; /* Pad size to 128 bytes (allow for additional fields in the future) */ }; Each of the fields in this structure is analogous to the similarly named field in the siginfo_t structure. The siginfo_t structure is described in sigaction(2). Not all fields in the returned signalfd_siginfo structure will be valid for a specific signal; the set of valid fields can be determined from the value returned in the ssi_code field. This field is the analog of the siginfo_t si_code field; see sigaction(2) for details. fork(2) semantics After a fork(2), the child inherits a copy of the signalfd file descriptor. A read(2) from the file descriptor in the child will return information about signals queued to the child. Semantics of file descriptor passing As with other file descriptors, signalfd file descriptors can be passed to another process via a UNIX domain socket (see unix(7)). In the receiving process, a read(2) from the received file descriptor will return information about signals queued to that process. execve(2) semantics Just like any other file descriptor, a signalfd file descriptor remains open across an execve(2), unless it has been marked for close-on-exec (see fcntl(2)). Any signals that were available for reading before the execve(2) remain available to the newly loaded program. (This is analogous to traditional signal semantics, where a blocked signal that is pending remains pending across an execve(2).) Thread semantics The semantics of signalfd file descriptors in a multithreaded program mirror the standard semantics for signals. In other words, when a thread reads from a signalfd file descriptor, it will read the signals that are directed to the thread itself and the signals that are directed to the process (i.e., the entire thread group). (A thread will not be able to read signals that are directed to other threads in the process.) http://man7.org/linux/man-pages/man2/eventfd.2.html 13 SYSTEM CALL: eventfd(2) - Linux manual page FUNCTIONALITY: eventfd - create a file descriptor for event notification SYNOPSIS: #include int eventfd(unsigned int initval, int flags); DESCRIPTION eventfd() creates an "eventfd object" that can be used as an event wait/notify mechanism by user-space applications, and by the kernel to notify user-space applications of events. The object contains an unsigned 64-bit integer (uint64_t) counter that is maintained by the kernel. This counter is initialized with the value specified in the argument initval. The following values may be bitwise ORed in flags to change the behavior of eventfd(): EFD_CLOEXEC (since Linux 2.6.27) Set the close-on-exec (FD_CLOEXEC) flag on the new file descriptor. See the description of the O_CLOEXEC flag in open(2) for reasons why this may be useful. EFD_NONBLOCK (since Linux 2.6.27) Set the O_NONBLOCK file status flag on the new open file description. Using this flag saves extra calls to fcntl(2) to achieve the same result. EFD_SEMAPHORE (since Linux 2.6.30) Provide semaphore-like semantics for reads from the new file descriptor. See below. In Linux up to version 2.6.26, the flags argument is unused, and must be specified as zero. As its return value, eventfd() returns a new file descriptor that can be used to refer to the eventfd object. The following operations can be performed on the file descriptor: read(2) Each successful read(2) returns an 8-byte integer. A read(2) will fail with the error EINVAL if the size of the supplied buffer is less than 8 bytes. The value returned by read(2) is in host byte order—that is, the native byte order for integers on the host machine. The semantics of read(2) depend on whether the eventfd counter currently has a nonzero value and whether the EFD_SEMAPHORE flag was specified when creating the eventfd file descriptor: * If EFD_SEMAPHORE was not specified and the eventfd counter has a nonzero value, then a read(2) returns 8 bytes containing that value, and the counter's value is reset to zero. * If EFD_SEMAPHORE was specified and the eventfd counter has a nonzero value, then a read(2) returns 8 bytes containing the value 1, and the counter's value is decremented by 1. * If the eventfd counter is zero at the time of the call to read(2), then the call either blocks until the counter becomes nonzero (at which time, the read(2) proceeds as described above) or fails with the error EAGAIN if the file descriptor has been made nonblocking. write(2) A write(2) call adds the 8-byte integer value supplied in its buffer to the counter. The maximum value that may be stored in the counter is the largest unsigned 64-bit value minus 1 (i.e., 0xfffffffffffffffe). If the addition would cause the counter's value to exceed the maximum, then the write(2) either blocks until a read(2) is performed on the file descriptor, or fails with the error EAGAIN if the file descriptor has been made nonblocking. A write(2) will fail with the error EINVAL if the size of the supplied buffer is less than 8 bytes, or if an attempt is made to write the value 0xffffffffffffffff. poll(2), select(2) (and similar) The returned file descriptor supports poll(2) (and analogously epoll(7)) and select(2), as follows: * The file descriptor is readable (the select(2) readfds argument; the poll(2) POLLIN flag) if the counter has a value greater than 0. * The file descriptor is writable (the select(2) writefds argument; the poll(2) POLLOUT flag) if it is possible to write a value of at least "1" without blocking. * If an overflow of the counter value was detected, then select(2) indicates the file descriptor as being both readable and writable, and poll(2) returns a POLLERR event. As noted above, write(2) can never overflow the counter. However an overflow can occur if 2^64 eventfd "signal posts" were performed by the KAIO subsystem (theoretically possible, but practically unlikely). If an overflow has occurred, then read(2) will return that maximum uint64_t value (i.e., 0xffffffffffffffff). The eventfd file descriptor also supports the other file- descriptor multiplexing APIs: pselect(2) and ppoll(2). close(2) When the file descriptor is no longer required it should be closed. When all file descriptors associated with the same eventfd object have been closed, the resources for object are freed by the kernel. A copy of the file descriptor created by eventfd() is inherited by the child produced by fork(2). The duplicate file descriptor is associated with the same eventfd object. File descriptors created by eventfd() are preserved across execve(2), unless the close-on-exec flag has been set. http://man7.org/linux/man-pages/man2/eventfd2.2.html 13 SYSTEM CALL: eventfd(2) - Linux manual page FUNCTIONALITY: eventfd - create a file descriptor for event notification SYNOPSIS: #include int eventfd(unsigned int initval, int flags); DESCRIPTION eventfd() creates an "eventfd object" that can be used as an event wait/notify mechanism by user-space applications, and by the kernel to notify user-space applications of events. The object contains an unsigned 64-bit integer (uint64_t) counter that is maintained by the kernel. This counter is initialized with the value specified in the argument initval. The following values may be bitwise ORed in flags to change the behavior of eventfd(): EFD_CLOEXEC (since Linux 2.6.27) Set the close-on-exec (FD_CLOEXEC) flag on the new file descriptor. See the description of the O_CLOEXEC flag in open(2) for reasons why this may be useful. EFD_NONBLOCK (since Linux 2.6.27) Set the O_NONBLOCK file status flag on the new open file description. Using this flag saves extra calls to fcntl(2) to achieve the same result. EFD_SEMAPHORE (since Linux 2.6.30) Provide semaphore-like semantics for reads from the new file descriptor. See below. In Linux up to version 2.6.26, the flags argument is unused, and must be specified as zero. As its return value, eventfd() returns a new file descriptor that can be used to refer to the eventfd object. The following operations can be performed on the file descriptor: read(2) Each successful read(2) returns an 8-byte integer. A read(2) will fail with the error EINVAL if the size of the supplied buffer is less than 8 bytes. The value returned by read(2) is in host byte order—that is, the native byte order for integers on the host machine. The semantics of read(2) depend on whether the eventfd counter currently has a nonzero value and whether the EFD_SEMAPHORE flag was specified when creating the eventfd file descriptor: * If EFD_SEMAPHORE was not specified and the eventfd counter has a nonzero value, then a read(2) returns 8 bytes containing that value, and the counter's value is reset to zero. * If EFD_SEMAPHORE was specified and the eventfd counter has a nonzero value, then a read(2) returns 8 bytes containing the value 1, and the counter's value is decremented by 1. * If the eventfd counter is zero at the time of the call to read(2), then the call either blocks until the counter becomes nonzero (at which time, the read(2) proceeds as described above) or fails with the error EAGAIN if the file descriptor has been made nonblocking. write(2) A write(2) call adds the 8-byte integer value supplied in its buffer to the counter. The maximum value that may be stored in the counter is the largest unsigned 64-bit value minus 1 (i.e., 0xfffffffffffffffe). If the addition would cause the counter's value to exceed the maximum, then the write(2) either blocks until a read(2) is performed on the file descriptor, or fails with the error EAGAIN if the file descriptor has been made nonblocking. A write(2) will fail with the error EINVAL if the size of the supplied buffer is less than 8 bytes, or if an attempt is made to write the value 0xffffffffffffffff. poll(2), select(2) (and similar) The returned file descriptor supports poll(2) (and analogously epoll(7)) and select(2), as follows: * The file descriptor is readable (the select(2) readfds argument; the poll(2) POLLIN flag) if the counter has a value greater than 0. * The file descriptor is writable (the select(2) writefds argument; the poll(2) POLLOUT flag) if it is possible to write a value of at least "1" without blocking. * If an overflow of the counter value was detected, then select(2) indicates the file descriptor as being both readable and writable, and poll(2) returns a POLLERR event. As noted above, write(2) can never overflow the counter. However an overflow can occur if 2^64 eventfd "signal posts" were performed by the KAIO subsystem (theoretically possible, but practically unlikely). If an overflow has occurred, then read(2) will return that maximum uint64_t value (i.e., 0xffffffffffffffff). The eventfd file descriptor also supports the other file- descriptor multiplexing APIs: pselect(2) and ppoll(2). close(2) When the file descriptor is no longer required it should be closed. When all file descriptors associated with the same eventfd object have been closed, the resources for object are freed by the kernel. A copy of the file descriptor created by eventfd() is inherited by the child produced by fork(2). The duplicate file descriptor is associated with the same eventfd object. File descriptors created by eventfd() are preserved across execve(2), unless the close-on-exec flag has been set. http://man7.org/linux/man-pages/man2/restart_syscall.2.html 11 SYSTEM CALL: restart_syscall(2) - Linux manual page FUNCTIONALITY: restart_syscall - restart a system call after interruption by a stop signal SYNOPSIS: int restart_syscall(void); Note: There is no glibc wrapper for this system call; see NOTES. DESCRIPTION The restart_syscall() system call is used to restart certain system calls after a process that was stopped by a signal (e.g., SIGSTOP or SIGTSTP) is later resumed after receiving a SIGCONT signal. This system call is designed only for internal use by the kernel. restart_syscall() is used for restarting only those system calls that, when restarted, should adjust their time-related parameters— namely poll(2) (since Linux 2.6.24), nanosleep(2) (since Linux 2.6), clock_nanosleep(2) (since Linux 2.6), and futex(2), when employed with the FUTEX_WAIT (since Linux 2.6.22) and FUTEX_WAIT_BITSET (since Linux 2.6.31) operations. restart_syscall() restarts the interrupted system call with a time argument that is suitably adjusted to account for the time that has already elapsed (including the time where the process was stopped by a signal). Without the restart_syscall() mechanism, restarting these system calls would not correctly deduct the already elapsed time when the process continued execution. Linux Inter Process Communication (IPC) system calls http://man7.org/linux/man-pages/man2/pipe.2.html 11 SYSTEM CALL: pipe(2) - Linux manual page FUNCTIONALITY: pipe, pipe2 - create pipe SYNOPSIS: #include int pipe(int pipefd[2]); #define _GNU_SOURCE /* See feature_test_macros(7) */ #include /* Obtain O_* constant definitions */ #include int pipe2(int pipefd[2], int flags); DESCRIPTION pipe() creates a pipe, a unidirectional data channel that can be used for interprocess communication. The array pipefd is used to return two file descriptors referring to the ends of the pipe. pipefd[0] refers to the read end of the pipe. pipefd[1] refers to the write end of the pipe. Data written to the write end of the pipe is buffered by the kernel until it is read from the read end of the pipe. For further details, see pipe(7). If flags is 0, then pipe2() is the same as pipe(). The following values can be bitwise ORed in flags to obtain different behavior: O_CLOEXEC Set the close-on-exec (FD_CLOEXEC) flag on the two new file descriptors. See the description of the same flag in open(2) for reasons why this may be useful. O_DIRECT (since Linux 3.4) Create a pipe that performs I/O in "packet" mode. Each write(2) to the pipe is dealt with as a separate packet, and read(2)s from the pipe will read one packet at a time. Note the following points: * Writes of greater than PIPE_BUF bytes (see pipe(7)) will be split into multiple packets. The constant PIPE_BUF is defined in . * If a read(2) specifies a buffer size that is smaller than the next packet, then the requested number of bytes are read, and the excess bytes in the packet are discarded. Specifying a buffer size of PIPE_BUF will be sufficient to read the largest possible packets (see the previous point). * Zero-length packets are not supported. (A read(2) that specifies a buffer size of zero is a no-op, and returns 0.) Older kernels that do not support this flag will indicate this via an EINVAL error. O_NONBLOCK Set the O_NONBLOCK file status flag on the two new open file descriptions. Using this flag saves extra calls to fcntl(2) to achieve the same result. http://man7.org/linux/man-pages/man2/pipe2.2.html 11 SYSTEM CALL: pipe(2) - Linux manual page FUNCTIONALITY: pipe, pipe2 - create pipe SYNOPSIS: #include int pipe(int pipefd[2]); #define _GNU_SOURCE /* See feature_test_macros(7) */ #include /* Obtain O_* constant definitions */ #include int pipe2(int pipefd[2], int flags); DESCRIPTION pipe() creates a pipe, a unidirectional data channel that can be used for interprocess communication. The array pipefd is used to return two file descriptors referring to the ends of the pipe. pipefd[0] refers to the read end of the pipe. pipefd[1] refers to the write end of the pipe. Data written to the write end of the pipe is buffered by the kernel until it is read from the read end of the pipe. For further details, see pipe(7). If flags is 0, then pipe2() is the same as pipe(). The following values can be bitwise ORed in flags to obtain different behavior: O_CLOEXEC Set the close-on-exec (FD_CLOEXEC) flag on the two new file descriptors. See the description of the same flag in open(2) for reasons why this may be useful. O_DIRECT (since Linux 3.4) Create a pipe that performs I/O in "packet" mode. Each write(2) to the pipe is dealt with as a separate packet, and read(2)s from the pipe will read one packet at a time. Note the following points: * Writes of greater than PIPE_BUF bytes (see pipe(7)) will be split into multiple packets. The constant PIPE_BUF is defined in . * If a read(2) specifies a buffer size that is smaller than the next packet, then the requested number of bytes are read, and the excess bytes in the packet are discarded. Specifying a buffer size of PIPE_BUF will be sufficient to read the largest possible packets (see the previous point). * Zero-length packets are not supported. (A read(2) that specifies a buffer size of zero is a no-op, and returns 0.) Older kernels that do not support this flag will indicate this via an EINVAL error. O_NONBLOCK Set the O_NONBLOCK file status flag on the two new open file descriptions. Using this flag saves extra calls to fcntl(2) to achieve the same result. http://man7.org/linux/man-pages/man2/tee.2.html 12 SYSTEM CALL: tee(2) - Linux manual page FUNCTIONALITY: tee - duplicating pipe content SYNOPSIS: #define _GNU_SOURCE /* See feature_test_macros(7) */ #include ssize_t tee(int fd_in, int fd_out, size_t len, unsigned int flags); DESCRIPTION tee() duplicates up to len bytes of data from the pipe referred to by the file descriptor fd_in to the pipe referred to by the file descriptor fd_out. It does not consume the data that is duplicated from fd_in; therefore, that data can be copied by a subsequent splice(2). flags is a bit mask that is composed by ORing together zero or more of the following values: SPLICE_F_MOVE Currently has no effect for tee(); see splice(2). SPLICE_F_NONBLOCK Do not block on I/O; see splice(2) for further details. SPLICE_F_MORE Currently has no effect for tee(), but may be implemented in the future; see splice(2). SPLICE_F_GIFT Unused for tee(); see vmsplice(2). http://man7.org/linux/man-pages/man2/splice.2.html 12 SYSTEM CALL: splice(2) - Linux manual page FUNCTIONALITY: splice - splice data to/from a pipe SYNOPSIS: #define _GNU_SOURCE /* See feature_test_macros(7) */ #include ssize_t splice(int fd_in, loff_t *off_in, int fd_out, loff_t *off_out, size_t len, unsigned int flags); DESCRIPTION splice() moves data between two file descriptors without copying between kernel address space and user address space. It transfers up to len bytes of data from the file descriptor fd_in to the file descriptor fd_out, where one of the file descriptors must refer to a pipe. The following semantics apply for fd_in and off_in: * If fd_in refers to a pipe, then off_in must be NULL. * If fd_in does not refer to a pipe and off_in is NULL, then bytes are read from fd_in starting from the file offset, and the file offset is adjusted appropriately. * If fd_in does not refer to a pipe and off_in is not NULL, then off_in must point to a buffer which specifies the starting offset from which bytes will be read from fd_in; in this case, the file offset of fd_in is not changed. Analogous statements apply for fd_out and off_out. The flags argument is a bit mask that is composed by ORing together zero or more of the following values: SPLICE_F_MOVE Attempt to move pages instead of copying. This is only a hint to the kernel: pages may still be copied if the kernel cannot move the pages from the pipe, or if the pipe buffers don't refer to full pages. The initial implementation of this flag was buggy: therefore starting in Linux 2.6.21 it is a no-op (but is still permitted in a splice() call); in the future, a correct implementation may be restored. SPLICE_F_NONBLOCK Do not block on I/O. This makes the splice pipe operations nonblocking, but splice() may nevertheless block because the file descriptors that are spliced to/from may block (unless they have the O_NONBLOCK flag set). SPLICE_F_MORE More data will be coming in a subsequent splice. This is a helpful hint when the fd_out refers to a socket (see also the description of MSG_MORE in send(2), and the description of TCP_CORK in tcp(7)). SPLICE_F_GIFT Unused for splice(); see vmsplice(2). http://man7.org/linux/man-pages/man2/vmsplice.2.html 11 SYSTEM CALL: vmsplice(2) - Linux manual page FUNCTIONALITY: vmsplice - splice user pages into a pipe SYNOPSIS: #define _GNU_SOURCE /* See feature_test_macros(7) */ #include #include ssize_t vmsplice(int fd, const struct iovec *iov, unsigned long nr_segs, unsigned int flags); DESCRIPTION The vmsplice() system call maps nr_segs ranges of user memory described by iov into a pipe. The file descriptor fd must refer to a pipe. The pointer iov points to an array of iovec structures as defined in : struct iovec { void *iov_base; /* Starting address */ size_t iov_len; /* Number of bytes */ }; The flags argument is a bit mask that is composed by ORing together zero or more of the following values: SPLICE_F_MOVE Unused for vmsplice(); see splice(2). SPLICE_F_NONBLOCK Do not block on I/O; see splice(2) for further details. SPLICE_F_MORE Currently has no effect for vmsplice(), but may be implemented in the future; see splice(2). SPLICE_F_GIFT The user pages are a gift to the kernel. The application may not modify this memory ever, otherwise the page cache and on-disk data may differ. Gifting pages to the kernel means that a subsequent splice(2) SPLICE_F_MOVE can successfully move the pages; if this flag is not specified, then a subsequent splice(2) SPLICE_F_MOVE must copy the pages. Data must also be properly page aligned, both in memory and length. http://man7.org/linux/man-pages/man2/shmget.2.html 11 SYSTEM CALL: shmget(2) - Linux manual page FUNCTIONALITY: shmget - allocates a System V shared memory segment SYNOPSIS: #include #include int shmget(key_t key, size_t size, int shmflg); DESCRIPTION shmget() returns the identifier of the System V shared memory segment associated with the value of the argument key. A new shared memory segment, with size equal to the value of size rounded up to a multiple of PAGE_SIZE, is created if key has the value IPC_PRIVATE or key isn't IPC_PRIVATE, no shared memory segment corresponding to key exists, and IPC_CREAT is specified in shmflg. If shmflg specifies both IPC_CREAT and IPC_EXCL and a shared memory segment already exists for key, then shmget() fails with errno set to EEXIST. (This is analogous to the effect of the combination O_CREAT | O_EXCL for open(2).) The value shmflg is composed of: IPC_CREAT Create a new segment. If this flag is not used, then shmget() will find the segment associated with key and check to see if the user has permission to access the segment. IPC_EXCL This flag is used with IPC_CREAT to ensure that this call creates the segment. If the segment already exists, the call fails. SHM_HUGETLB (since Linux 2.6) Allocate the segment using "huge pages." See the Linux kernel source file Documentation/vm/hugetlbpage.txt for further information. SHM_HUGE_2MB, SHM_HUGE_1GB (since Linux 3.8) Used in conjunction with SHM_HUGETLB to select alternative hugetlb page sizes (respectively, 2 MB and 1 GB) on systems that support multiple hugetlb page sizes. More generally, the desired huge page size can be configured by encoding the base-2 logarithm of the desired page size in the six bits at the offset SHM_HUGE_SHIFT. Thus, the above two constants are defined as: #define SHM_HUGE_2MB (21 << SHM_HUGE_SHIFT) #define SHM_HUGE_1GB (30 << SHM_HUGE_SHIFT) For some additional details, see the discussion of the similarly named constants in mmap(2). SHM_NORESERVE (since Linux 2.6.15) This flag serves the same purpose as the mmap(2) MAP_NORESERVE flag. Do not reserve swap space for this segment. When swap space is reserved, one has the guarantee that it is possible to modify the segment. When swap space is not reserved one might get SIGSEGV upon a write if no physical memory is available. See also the discussion of the file /proc/sys/vm/overcommit_memory in proc(5). In addition to the above flags, the least significant 9 bits of shmflg specify the permissions granted to the owner, group, and others. These bits have the same format, and the same meaning, as the mode argument of open(2). Presently, execute permissions are not used by the system. When a new shared memory segment is created, its contents are initialized to zero values, and its associated data structure, shmid_ds (see shmctl(2)), is initialized as follows: shm_perm.cuid and shm_perm.uid are set to the effective user ID of the calling process. shm_perm.cgid and shm_perm.gid are set to the effective group ID of the calling process. The least significant 9 bits of shm_perm.mode are set to the least significant 9 bit of shmflg. shm_segsz is set to the value of size. shm_lpid, shm_nattch, shm_atime, and shm_dtime are set to 0. shm_ctime is set to the current time. If the shared memory segment already exists, the permissions are verified, and a check is made to see if it is marked for destruction. http://man7.org/linux/man-pages/man2/shmctl.2.html 10 SYSTEM CALL: shmctl(2) - Linux manual page FUNCTIONALITY: shmctl - System V shared memory control SYNOPSIS: #include #include int shmctl(int shmid, int cmd, struct shmid_ds *buf); DESCRIPTION shmctl() performs the control operation specified by cmd on the System V shared memory segment whose identifier is given in shmid. The buf argument is a pointer to a shmid_ds structure, defined in as follows: struct shmid_ds { struct ipc_perm shm_perm; /* Ownership and permissions */ size_t shm_segsz; /* Size of segment (bytes) */ time_t shm_atime; /* Last attach time */ time_t shm_dtime; /* Last detach time */ time_t shm_ctime; /* Last change time */ pid_t shm_cpid; /* PID of creator */ pid_t shm_lpid; /* PID of last shmat(2)/shmdt(2) */ shmatt_t shm_nattch; /* No. of current attaches */ ... }; The ipc_perm structure is defined as follows (the highlighted fields are settable using IPC_SET): struct ipc_perm { key_t __key; /* Key supplied to shmget(2) */ uid_t uid; /* Effective UID of owner */ gid_t gid; /* Effective GID of owner */ uid_t cuid; /* Effective UID of creator */ gid_t cgid; /* Effective GID of creator */ unsigned short mode; /* Permissions + SHM_DEST and SHM_LOCKED flags */ unsigned short __seq; /* Sequence number */ }; Valid values for cmd are: IPC_STAT Copy information from the kernel data structure associated with shmid into the shmid_ds structure pointed to by buf. The caller must have read permission on the shared memory segment. IPC_SET Write the values of some members of the shmid_ds structure pointed to by buf to the kernel data structure associated with this shared memory segment, updating also its shm_ctime member. The following fields can be changed: shm_perm.uid, shm_perm.gid, and (the least significant 9 bits of) shm_perm.mode. The effective UID of the calling process must match the owner (shm_perm.uid) or creator (shm_perm.cuid) of the shared memory segment, or the caller must be privileged. IPC_RMID Mark the segment to be destroyed. The segment will actually be destroyed only after the last process detaches it (i.e., when the shm_nattch member of the associated structure shmid_ds is zero). The caller must be the owner or creator of the segment, or be privileged. The buf argument is ignored. If a segment has been marked for destruction, then the (nonstandard) SHM_DEST flag of the shm_perm.mode field in the associated data structure retrieved by IPC_STAT will be set. The caller must ensure that a segment is eventually destroyed; otherwise its pages that were faulted in will remain in memory or swap. See also the description of /proc/sys/kernel/shm_rmid_forced in proc(5). IPC_INFO (Linux-specific) Return information about system-wide shared memory limits and parameters in the structure pointed to by buf. This structure is of type shminfo (thus, a cast is required), defined in if the _GNU_SOURCE feature test macro is defined: struct shminfo { unsigned long shmmax; /* Maximum segment size */ unsigned long shmmin; /* Minimum segment size; always 1 */ unsigned long shmmni; /* Maximum number of segments */ unsigned long shmseg; /* Maximum number of segments that a process can attach; unused within kernel */ unsigned long shmall; /* Maximum number of pages of shared memory, system-wide */ }; The shmmni, shmmax, and shmall settings can be changed via /proc files of the same name; see proc(5) for details. SHM_INFO (Linux-specific) Return a shm_info structure whose fields contain information about system resources consumed by shared memory. This structure is defined in if the _GNU_SOURCE feature test macro is defined: struct shm_info { int used_ids; /* # of currently existing segments */ unsigned long shm_tot; /* Total number of shared memory pages */ unsigned long shm_rss; /* # of resident shared memory pages */ unsigned long shm_swp; /* # of swapped shared memory pages */ unsigned long swap_attempts; /* Unused since Linux 2.4 */ unsigned long swap_successes; /* Unused since Linux 2.4 */ }; SHM_STAT (Linux-specific) Return a shmid_ds structure as for IPC_STAT. However, the shmid argument is not a segment identifier, but instead an index into the kernel's internal array that maintains information about all shared memory segments on the system. The caller can prevent or allow swapping of a shared memory segment with the following cmd values: SHM_LOCK (Linux-specific) Prevent swapping of the shared memory segment. The caller must fault in any pages that are required to be present after locking is enabled. If a segment has been locked, then the (nonstandard) SHM_LOCKED flag of the shm_perm.mode field in the associated data structure retrieved by IPC_STAT will be set. SHM_UNLOCK (Linux-specific) Unlock the segment, allowing it to be swapped out. In kernels before 2.6.10, only a privileged process could employ SHM_LOCK and SHM_UNLOCK. Since kernel 2.6.10, an unprivileged process can employ these operations if its effective UID matches the owner or creator UID of the segment, and (for SHM_LOCK) the amount of memory to be locked falls within the RLIMIT_MEMLOCK resource limit (see setrlimit(2)). http://man7.org/linux/man-pages/man2/shmat.2.html 10 SYSTEM CALL: shmop(2) - Linux manual page FUNCTIONALITY: shmat, shmdt - System V shared memory operations SYNOPSIS: #include #include void *shmat(int shmid, const void *shmaddr, int shmflg); int shmdt(const void *shmaddr); DESCRIPTION shmat() shmat() attaches the System V shared memory segment identified by shmid to the address space of the calling process. The attaching address is specified by shmaddr with one of the following criteria: * If shmaddr is NULL, the system chooses a suitable (unused) address at which to attach the segment. * If shmaddr isn't NULL and SHM_RND is specified in shmflg, the attach occurs at the address equal to shmaddr rounded down to the nearest multiple of SHMLBA. * Otherwise, shmaddr must be a page-aligned address at which the attach occurs. In addition to SHM_RND, the following flags may be specified in the shmflg bit-mask argument: SHM_EXEC (Linux-specific; since Linux 2.6.9) Allow the contents of the segment to be executed. The caller must have execute permission on the segment. SHM_RDONLY Attach the segment for read-only access. The process must have read permission for the segment. If this flag is not specified, the segment is attached for read and write access, and the process must have read and write permission for the segment. There is no notion of a write-only shared memory segment. SHM_REMAP (Linux-specific) This flag specifies that the mapping of the segment should replace any existing mapping in the range starting at shmaddr and continuing for the size of the segment. (Normally, an EINVAL error would result if a mapping already exists in this address range.) In this case, shmaddr must not be NULL. The brk(2) value of the calling process is not altered by the attach. The segment will automatically be detached at process exit. The same segment may be attached as a read and as a read-write one, and more than once, in the process's address space. A successful shmat() call updates the members of the shmid_ds structure (see shmctl(2)) associated with the shared memory segment as follows: shm_atime is set to the current time. shm_lpid is set to the process-ID of the calling process. shm_nattch is incremented by one. shmdt() shmdt() detaches the shared memory segment located at the address specified by shmaddr from the address space of the calling process. The to-be-detached segment must be currently attached with shmaddr equal to the value returned by the attaching shmat() call. On a successful shmdt() call, the system updates the members of the shmid_ds structure associated with the shared memory segment as follows: shm_dtime is set to the current time. shm_lpid is set to the process-ID of the calling process. shm_nattch is decremented by one. If it becomes 0 and the segment is marked for deletion, the segment is deleted. http://man7.org/linux/man-pages/man2/shmdt.2.html 10 SYSTEM CALL: shmop(2) - Linux manual page FUNCTIONALITY: shmat, shmdt - System V shared memory operations SYNOPSIS: #include #include void *shmat(int shmid, const void *shmaddr, int shmflg); int shmdt(const void *shmaddr); DESCRIPTION shmat() shmat() attaches the System V shared memory segment identified by shmid to the address space of the calling process. The attaching address is specified by shmaddr with one of the following criteria: * If shmaddr is NULL, the system chooses a suitable (unused) address at which to attach the segment. * If shmaddr isn't NULL and SHM_RND is specified in shmflg, the attach occurs at the address equal to shmaddr rounded down to the nearest multiple of SHMLBA. * Otherwise, shmaddr must be a page-aligned address at which the attach occurs. In addition to SHM_RND, the following flags may be specified in the shmflg bit-mask argument: SHM_EXEC (Linux-specific; since Linux 2.6.9) Allow the contents of the segment to be executed. The caller must have execute permission on the segment. SHM_RDONLY Attach the segment for read-only access. The process must have read permission for the segment. If this flag is not specified, the segment is attached for read and write access, and the process must have read and write permission for the segment. There is no notion of a write-only shared memory segment. SHM_REMAP (Linux-specific) This flag specifies that the mapping of the segment should replace any existing mapping in the range starting at shmaddr and continuing for the size of the segment. (Normally, an EINVAL error would result if a mapping already exists in this address range.) In this case, shmaddr must not be NULL. The brk(2) value of the calling process is not altered by the attach. The segment will automatically be detached at process exit. The same segment may be attached as a read and as a read-write one, and more than once, in the process's address space. A successful shmat() call updates the members of the shmid_ds structure (see shmctl(2)) associated with the shared memory segment as follows: shm_atime is set to the current time. shm_lpid is set to the process-ID of the calling process. shm_nattch is incremented by one. shmdt() shmdt() detaches the shared memory segment located at the address specified by shmaddr from the address space of the calling process. The to-be-detached segment must be currently attached with shmaddr equal to the value returned by the attaching shmat() call. On a successful shmdt() call, the system updates the members of the shmid_ds structure associated with the shared memory segment as follows: shm_dtime is set to the current time. shm_lpid is set to the process-ID of the calling process. shm_nattch is decremented by one. If it becomes 0 and the segment is marked for deletion, the segment is deleted. http://man7.org/linux/man-pages/man2/semget.2.html 11 SYSTEM CALL: semget(2) - Linux manual page FUNCTIONALITY: semget - get a System V semaphore set identifier SYNOPSIS: #include #include #include int semget(key_t key, int nsems, int semflg); DESCRIPTION The semget() system call returns the System V semaphore set identifier associated with the argument key. A new set of nsems semaphores is created if key has the value IPC_PRIVATE or if no existing semaphore set is associated with key and IPC_CREAT is specified in semflg. If semflg specifies both IPC_CREAT and IPC_EXCL and a semaphore set already exists for key, then semget() fails with errno set to EEXIST. (This is analogous to the effect of the combination O_CREAT | O_EXCL for open(2).) Upon creation, the least significant 9 bits of the argument semflg define the permissions (for owner, group and others) for the semaphore set. These bits have the same format, and the same meaning, as the mode argument of open(2) (though the execute permissions are not meaningful for semaphores, and write permissions mean permission to alter semaphore values). When creating a new semaphore set, semget() initializes the set's associated data structure, semid_ds (see semctl(2)), as follows: sem_perm.cuid and sem_perm.uid are set to the effective user ID of the calling process. sem_perm.cgid and sem_perm.gid are set to the effective group ID of the calling process. The least significant 9 bits of sem_perm.mode are set to the least significant 9 bits of semflg. sem_nsems is set to the value of nsems. sem_otime is set to 0. sem_ctime is set to the current time. The argument nsems can be 0 (a don't care) when a semaphore set is not being created. Otherwise, nsems must be greater than 0 and less than or equal to the maximum number of semaphores per semaphore set (SEMMSL). If the semaphore set already exists, the permissions are verified. http://man7.org/linux/man-pages/man2/semctl.2.html 10 SYSTEM CALL: semctl(2) - Linux manual page FUNCTIONALITY: semctl - System V semaphore control operations SYNOPSIS: #include #include #include int semctl(int semid, int semnum, int cmd, ...); DESCRIPTION semctl() performs the control operation specified by cmd on the System V semaphore set identified by semid, or on the semnum-th semaphore of that set. (The semaphores in a set are numbered starting at 0.) This function has three or four arguments, depending on cmd. When there are four, the fourth has the type union semun. The calling program must define this union as follows: union semun { int val; /* Value for SETVAL */ struct semid_ds *buf; /* Buffer for IPC_STAT, IPC_SET */ unsigned short *array; /* Array for GETALL, SETALL */ struct seminfo *__buf; /* Buffer for IPC_INFO (Linux-specific) */ }; The semid_ds data structure is defined in as follows: struct semid_ds { struct ipc_perm sem_perm; /* Ownership and permissions */ time_t sem_otime; /* Last semop time */ time_t sem_ctime; /* Last change time */ unsigned long sem_nsems; /* No. of semaphores in set */ }; The ipc_perm structure is defined as follows (the highlighted fields are settable using IPC_SET): struct ipc_perm { key_t __key; /* Key supplied to semget(2) */ uid_t uid; /* Effective UID of owner */ gid_t gid; /* Effective GID of owner */ uid_t cuid; /* Effective UID of creator */ gid_t cgid; /* Effective GID of creator */ unsigned short mode; /* Permissions */ unsigned short __seq; /* Sequence number */ }; Valid values for cmd are: IPC_STAT Copy information from the kernel data structure associated with semid into the semid_ds structure pointed to by arg.buf. The argument semnum is ignored. The calling process must have read permission on the semaphore set. IPC_SET Write the values of some members of the semid_ds structure pointed to by arg.buf to the kernel data structure associated with this semaphore set, updating also its sem_ctime member. The following members of the structure are updated: sem_perm.uid, sem_perm.gid, and (the least significant 9 bits of) sem_perm.mode. The effective UID of the calling process must match the owner (sem_perm.uid) or creator (sem_perm.cuid) of the semaphore set, or the caller must be privileged. The argument semnum is ignored. IPC_RMID Immediately remove the semaphore set, awakening all processes blocked in semop(2) calls on the set (with an error return and errno set to EIDRM). The effective user ID of the calling process must match the creator or owner of the semaphore set, or the caller must be privileged. The argument semnum is ignored. IPC_INFO (Linux-specific) Return information about system-wide semaphore limits and parameters in the structure pointed to by arg.__buf. This structure is of type seminfo, defined in if the _GNU_SOURCE feature test macro is defined: struct seminfo { int semmap; /* Number of entries in semaphore map; unused within kernel */ int semmni; /* Maximum number of semaphore sets */ int semmns; /* Maximum number of semaphores in all semaphore sets */ int semmnu; /* System-wide maximum number of undo structures; unused within kernel */ int semmsl; /* Maximum number of semaphores in a set */ int semopm; /* Maximum number of operations for semop(2) */ int semume; /* Maximum number of undo entries per process; unused within kernel */ int semusz; /* Size of struct sem_undo */ int semvmx; /* Maximum semaphore value */ int semaem; /* Max. value that can be recorded for semaphore adjustment (SEM_UNDO) */ }; The semmsl, semmns, semopm, and semmni settings can be changed via /proc/sys/kernel/sem; see proc(5) for details. SEM_INFO (Linux-specific) Return a seminfo structure containing the same information as for IPC_INFO, except that the following fields are returned with information about system resources consumed by semaphores: the semusz field returns the number of semaphore sets that currently exist on the system; and the semaem field returns the total number of semaphores in all semaphore sets on the system. SEM_STAT (Linux-specific) Return a semid_ds structure as for IPC_STAT. However, the semid argument is not a semaphore identifier, but instead an index into the kernel's internal array that maintains information about all semaphore sets on the system. GETALL Return semval (i.e., the current value) for all semaphores of the set into arg.array. The argument semnum is ignored. The calling process must have read permission on the semaphore set. GETNCNT Return the value of semncnt for the semnum-th semaphore of the set (i.e., the number of processes waiting for an increase of semval for the semnum-th semaphore of the set). The calling process must have read permission on the semaphore set. GETPID Return the value of sempid for the semnum-th semaphore of the set. This is the PID of the process that last performed an operation on that semaphore (but see NOTES). The calling process must have read permission on the semaphore set. GETVAL Return the value of semval for the semnum-th semaphore of the set. The calling process must have read permission on the semaphore set. GETZCNT Return the value of semzcnt for the semnum-th semaphore of the set (i.e., the number of processes waiting for semval of the semnum-th semaphore of the set to become 0). The calling process must have read permission on the semaphore set. SETALL Set semval for all semaphores of the set using arg.array, updating also the sem_ctime member of the semid_ds structure associated with the set. Undo entries (see semop(2)) are cleared for altered semaphores in all processes. If the changes to semaphore values would permit blocked semop(2) calls in other processes to proceed, then those processes are woken up. The argument semnum is ignored. The calling process must have alter (write) permission on the semaphore set. SETVAL Set the value of semval to arg.val for the semnum-th semaphore of the set, updating also the sem_ctime member of the semid_ds structure associated with the set. Undo entries are cleared for altered semaphores in all processes. If the changes to semaphore values would permit blocked semop(2) calls in other processes to proceed, then those processes are woken up. The calling process must have alter permission on the semaphore set. http://man7.org/linux/man-pages/man2/semop.2.html 13 SYSTEM CALL: semop(2) - Linux manual page FUNCTIONALITY: semop, semtimedop - System V semaphore operations SYNOPSIS: #include #include #include int semop(int semid, struct sembuf *sops, size_t nsops); int semtimedop(int semid, struct sembuf *sops, size_t nsops, const struct timespec *timeout); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): semtimedop(): _GNU_SOURCE DESCRIPTION Each semaphore in a System V semaphore set has the following associated values: unsigned short semval; /* semaphore value */ unsigned short semzcnt; /* # waiting for zero */ unsigned short semncnt; /* # waiting for increase */ pid_t sempid; /* PID of process that last modified semaphore value */ semop() performs operations on selected semaphores in the set indicated by semid. Each of the nsops elements in the array pointed to by sops is a structure that specifies an operation to be performed on a single semaphore. The elements of this structure are of type struct sembuf, containing the following members: unsigned short sem_num; /* semaphore number */ short sem_op; /* semaphore operation */ short sem_flg; /* operation flags */ Flags recognized in sem_flg are IPC_NOWAIT and SEM_UNDO. If an operation specifies SEM_UNDO, it will be automatically undone when the process terminates. The set of operations contained in sops is performed in array order, and atomically, that is, the operations are performed either as a complete unit, or not at all. The behavior of the system call if not all operations can be performed immediately depends on the presence of the IPC_NOWAIT flag in the individual sem_flg fields, as noted below. Each operation is performed on the sem_num-th semaphore of the semaphore set, where the first semaphore of the set is numbered 0. There are three types of operation, distinguished by the value of sem_op. If sem_op is a positive integer, the operation adds this value to the semaphore value (semval). Furthermore, if SEM_UNDO is specified for this operation, the system subtracts the value sem_op from the semaphore adjustment (semadj) value for this semaphore. This operation can always proceed—it never forces a thread to wait. The calling process must have alter permission on the semaphore set. If sem_op is zero, the process must have read permission on the semaphore set. This is a "wait-for-zero" operation: if semval is zero, the operation can immediately proceed. Otherwise, if IPC_NOWAIT is specified in sem_flg, semop() fails with errno set to EAGAIN (and none of the operations in sops is performed). Otherwise, semzcnt (the count of threads waiting until this semaphore's value becomes zero) is incremented by one and the thread sleeps until one of the following occurs: · semval becomes 0, at which time the value of semzcnt is decremented. · The semaphore set is removed: semop() fails, with errno set to EIDRM. · The calling thread catches a signal: the value of semzcnt is decremented and semop() fails, with errno set to EINTR. If sem_op is less than zero, the process must have alter permission on the semaphore set. If semval is greater than or equal to the absolute value of sem_op, the operation can proceed immediately: the absolute value of sem_op is subtracted from semval, and, if SEM_UNDO is specified for this operation, the system adds the absolute value of sem_op to the semaphore adjustment (semadj) value for this semaphore. If the absolute value of sem_op is greater than semval, and IPC_NOWAIT is specified in sem_flg, semop() fails, with errno set to EAGAIN (and none of the operations in sops is performed). Otherwise, semncnt (the counter of threads waiting for this semaphore's value to increase) is incremented by one and the thread sleeps until one of the following occurs: · semval becomes greater than or equal to the absolute value of sem_op: the operation now proceeds, as described above. · The semaphore set is removed from the system: semop() fails, with errno set to EIDRM. · The calling thread catches a signal: the value of semncnt is decremented and semop() fails, with errno set to EINTR. On successful completion, the sempid value for each semaphore specified in the array pointed to by sops is set to the caller's process ID. In addition, the sem_otime is set to the current time. semtimedop() semtimedop() behaves identically to semop() except that in those cases where the calling thread would sleep, the duration of that sleep is limited by the amount of elapsed time specified by the timespec structure whose address is passed in the timeout argument. (This sleep interval will be rounded up to the system clock granularity, and kernel scheduling delays mean that the interval may overrun by a small amount.) If the specified time limit has been reached, semtimedop() fails with errno set to EAGAIN (and none of the operations in sops is performed). If the timeout argument is NULL, then semtimedop() behaves exactly like semop(). Note that if semtimeop() is interrupted by a signal, causing the call to fail with the error EINTR, the contents of timeout are left unchanged. http://man7.org/linux/man-pages/man2/semtimedop.2.html 13 SYSTEM CALL: semop(2) - Linux manual page FUNCTIONALITY: semop, semtimedop - System V semaphore operations SYNOPSIS: #include #include #include int semop(int semid, struct sembuf *sops, size_t nsops); int semtimedop(int semid, struct sembuf *sops, size_t nsops, const struct timespec *timeout); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): semtimedop(): _GNU_SOURCE DESCRIPTION Each semaphore in a System V semaphore set has the following associated values: unsigned short semval; /* semaphore value */ unsigned short semzcnt; /* # waiting for zero */ unsigned short semncnt; /* # waiting for increase */ pid_t sempid; /* PID of process that last modified semaphore value */ semop() performs operations on selected semaphores in the set indicated by semid. Each of the nsops elements in the array pointed to by sops is a structure that specifies an operation to be performed on a single semaphore. The elements of this structure are of type struct sembuf, containing the following members: unsigned short sem_num; /* semaphore number */ short sem_op; /* semaphore operation */ short sem_flg; /* operation flags */ Flags recognized in sem_flg are IPC_NOWAIT and SEM_UNDO. If an operation specifies SEM_UNDO, it will be automatically undone when the process terminates. The set of operations contained in sops is performed in array order, and atomically, that is, the operations are performed either as a complete unit, or not at all. The behavior of the system call if not all operations can be performed immediately depends on the presence of the IPC_NOWAIT flag in the individual sem_flg fields, as noted below. Each operation is performed on the sem_num-th semaphore of the semaphore set, where the first semaphore of the set is numbered 0. There are three types of operation, distinguished by the value of sem_op. If sem_op is a positive integer, the operation adds this value to the semaphore value (semval). Furthermore, if SEM_UNDO is specified for this operation, the system subtracts the value sem_op from the semaphore adjustment (semadj) value for this semaphore. This operation can always proceed—it never forces a thread to wait. The calling process must have alter permission on the semaphore set. If sem_op is zero, the process must have read permission on the semaphore set. This is a "wait-for-zero" operation: if semval is zero, the operation can immediately proceed. Otherwise, if IPC_NOWAIT is specified in sem_flg, semop() fails with errno set to EAGAIN (and none of the operations in sops is performed). Otherwise, semzcnt (the count of threads waiting until this semaphore's value becomes zero) is incremented by one and the thread sleeps until one of the following occurs: · semval becomes 0, at which time the value of semzcnt is decremented. · The semaphore set is removed: semop() fails, with errno set to EIDRM. · The calling thread catches a signal: the value of semzcnt is decremented and semop() fails, with errno set to EINTR. If sem_op is less than zero, the process must have alter permission on the semaphore set. If semval is greater than or equal to the absolute value of sem_op, the operation can proceed immediately: the absolute value of sem_op is subtracted from semval, and, if SEM_UNDO is specified for this operation, the system adds the absolute value of sem_op to the semaphore adjustment (semadj) value for this semaphore. If the absolute value of sem_op is greater than semval, and IPC_NOWAIT is specified in sem_flg, semop() fails, with errno set to EAGAIN (and none of the operations in sops is performed). Otherwise, semncnt (the counter of threads waiting for this semaphore's value to increase) is incremented by one and the thread sleeps until one of the following occurs: · semval becomes greater than or equal to the absolute value of sem_op: the operation now proceeds, as described above. · The semaphore set is removed from the system: semop() fails, with errno set to EIDRM. · The calling thread catches a signal: the value of semncnt is decremented and semop() fails, with errno set to EINTR. On successful completion, the sempid value for each semaphore specified in the array pointed to by sops is set to the caller's process ID. In addition, the sem_otime is set to the current time. semtimedop() semtimedop() behaves identically to semop() except that in those cases where the calling thread would sleep, the duration of that sleep is limited by the amount of elapsed time specified by the timespec structure whose address is passed in the timeout argument. (This sleep interval will be rounded up to the system clock granularity, and kernel scheduling delays mean that the interval may overrun by a small amount.) If the specified time limit has been reached, semtimedop() fails with errno set to EAGAIN (and none of the operations in sops is performed). If the timeout argument is NULL, then semtimedop() behaves exactly like semop(). Note that if semtimeop() is interrupted by a signal, causing the call to fail with the error EINTR, the contents of timeout are left unchanged. http://man7.org/linux/man-pages/man2/futex.2.html 12 SYSTEM CALL: futex(2) - Linux manual page FUNCTIONALITY: futex - fast user-space locking SYNOPSIS: #include #include int futex(int *uaddr, int futex_op, int val, const struct timespec *timeout, /* or: uint32_t val2 */ int *uaddr2, int val3); Note: There is no glibc wrapper for this system call; see NOTES. DESCRIPTION The futex() system call provides a method for waiting until a certain condition becomes true. It is typically used as a blocking construct in the context of shared-memory synchronization. When using futexes, the majority of the synchronization operations are performed in user space. A user-space program employs the futex() system call only when it is likely that the program has to block for a longer time until the condition becomes true. Other futex() operations can be used to wake any processes or threads waiting for a particular condition. A futex is a 32-bit value—referred to below as a futex word—whose address is supplied to the futex() system call. (Futexes are 32 bits in size on all platforms, including 64-bit systems.) All futex operations are governed by this value. In order to share a futex between processes, the futex is placed in a region of shared memory, created using (for example) mmap(2) or shmat(2). (Thus, the futex word may have different virtual addresses in different processes, but these addresses all refer to the same location in physical memory.) In a multithreaded program, it is sufficient to place the futex word in a global variable shared by all threads. When executing a futex operation that requests to block a thread, the kernel will block only if the futex word has the value that the calling thread supplied (as one of the arguments of the futex() call) as the expected value of the futex word. The loading of the futex word's value, the comparison of that value with the expected value, and the actual blocking will happen atomically and will be totally ordered with respect to concurrent operations performed by other threads on the same futex word. Thus, the futex word is used to connect the synchronization in user space with the implementation of blocking by the kernel. Analogously to an atomic compare-and- exchange operation that potentially changes shared memory, blocking via a futex is an atomic compare-and-block operation. One use of futexes is for implementing locks. The state of the lock (i.e., acquired or not acquired) can be represented as an atomically accessed flag in shared memory. In the uncontended case, a thread can access or modify the lock state with atomic instructions, for example atomically changing it from not acquired to acquired using an atomic compare-and-exchange instruction. (Such instructions are performed entirely in user mode, and the kernel maintains no information about the lock state.) On the other hand, a thread may be unable to acquire a lock because it is already acquired by another thread. It then may pass the lock's flag as a futex word and the value representing the acquired state as the expected value to a futex() wait operation. This futex() operation will block if and only if the lock is still acquired (i.e., the value in the futex word still matches the "acquired state"). When releasing the lock, a thread has to first reset the lock state to not acquired and then execute a futex operation that wakes threads blocked on the lock flag used as a futex word (this can be further optimized to avoid unnecessary wake-ups). See futex(7) for more detail on how to use futexes. Besides the basic wait and wake-up futex functionality, there are further futex operations aimed at supporting more complex use cases. Note that no explicit initialization or destruction is necessary to use futexes; the kernel maintains a futex (i.e., the kernel-internal implementation artifact) only while operations such as FUTEX_WAIT, described below, are being performed on a particular futex word. Arguments The uaddr argument points to the futex word. On all platforms, futexes are four-byte integers that must be aligned on a four-byte boundary. The operation to perform on the futex is specified in the futex_op argument; val is a value whose meaning and purpose depends on futex_op. The remaining arguments (timeout, uaddr2, and val3) are required only for certain of the futex operations described below. Where one of these arguments is not required, it is ignored. For several blocking operations, the timeout argument is a pointer to a timespec structure that specifies a timeout for the operation. However, notwithstanding the prototype shown above, for some operations, the least significant four bytes are used as an integer whose meaning is determined by the operation. For these operations, the kernel casts the timeout value first to unsigned long, then to uint32_t, and in the remainder of this page, this argument is referred to as val2 when interpreted in this fashion. Where it is required, the uaddr2 argument is a pointer to a second futex word that is employed by the operation. The interpretation of the final integer argument, val3, depends on the operation. Futex operations The futex_op argument consists of two parts: a command that specifies the operation to be performed, bit-wise ORed with zero or more options that modify the behaviour of the operation. The options that may be included in futex_op are as follows: FUTEX_PRIVATE_FLAG (since Linux 2.6.22) This option bit can be employed with all futex operations. It tells the kernel that the futex is process-private and not shared with another process (i.e., it is being used for synchronization only between threads of the same process). This allows the kernel to make some additional performance optimizations. As a convenience, defines a set of constants with the suffix _PRIVATE that are equivalents of all of the operations listed below, but with the FUTEX_PRIVATE_FLAG ORed into the constant value. Thus, there are FUTEX_WAIT_PRIVATE, FUTEX_WAKE_PRIVATE, and so on. FUTEX_CLOCK_REALTIME (since Linux 2.6.28) This option bit can be employed only with the FUTEX_WAIT_BITSET, FUTEX_WAIT_REQUEUE_PI, and (since Linux 4.5) FUTEX_WAIT operations. If this option is set, the kernel measures the timeout against the CLOCK_REALTIME clock. If this option is not set, the kernel measures the timeout against the CLOCK_MONOTONIC clock. The operation specified in futex_op is one of the following: FUTEX_WAIT (since Linux 2.6.0) This operation tests that the value at the futex word pointed to by the address uaddr still contains the expected value val, and if so, then sleeps waiting for a FUTEX_WAKE operation on the futex word. The load of the value of the futex word is an atomic memory access (i.e., using atomic machine instructions of the respective architecture). This load, the comparison with the expected value, and starting to sleep are performed atomically and totally ordered with respect to other futex operations on the same futex word. If the thread starts to sleep, it is considered a waiter on this futex word. If the futex value does not match val, then the call fails immediately with the error EAGAIN. The purpose of the comparison with the expected value is to prevent lost wake-ups. If another thread changed the value of the futex word after the calling thread decided to block based on the prior value, and if the other thread executed a FUTEX_WAKE operation (or similar wake-up) after the value change and before this FUTEX_WAIT operation, then the calling thread will observe the value change and will not start to sleep. If the timeout is not NULL, the structure it points to specifies a timeout for the wait. (This interval will be rounded up to the system clock granularity, and is guaranteed not to expire early.) The timeout is by default measured according to the CLOCK_MONOTONIC clock, but, since Linux 4.5, the CLOCK_REALTIME clock can be selected by specifying FUTEX_CLOCK_REALTIME in futex_op. If timeout is NULL, the call blocks indefinitely. Note: for FUTEX_WAIT, timeout is interpreted as a relative value. This differs from other futex operations, where timeout is interpreted as an absolute value. To obtain the equivalent of FUTEX_WAIT with an absolute timeout, employ FUTEX_WAIT_BITSET with val3 specified as FUTEX_BITSET_MATCH_ANY. The arguments uaddr2 and val3 are ignored. FUTEX_WAKE (since Linux 2.6.0) This operation wakes at most val of the waiters that are waiting (e.g., inside FUTEX_WAIT) on the futex word at the address uaddr. Most commonly, val is specified as either 1 (wake up a single waiter) or INT_MAX (wake up all waiters). No guarantee is provided about which waiters are awoken (e.g., a waiter with a higher scheduling priority is not guaranteed to be awoken in preference to a waiter with a lower priority). The arguments timeout, uaddr2, and val3 are ignored. FUTEX_FD (from Linux 2.6.0 up to and including Linux 2.6.25) This operation creates a file descriptor that is associated with the futex at uaddr. The caller must close the returned file descriptor after use. When another process or thread performs a FUTEX_WAKE on the futex word, the file descriptor indicates as being readable with select(2), poll(2), and epoll(7) The file descriptor can be used to obtain asynchronous notifications: if val is nonzero, then, when another process or thread executes a FUTEX_WAKE, the caller will receive the signal number that was passed in val. The arguments timeout, uaddr2 and val3 are ignored. Because it was inherently racy, FUTEX_FD has been removed from Linux 2.6.26 onward. FUTEX_REQUEUE (since Linux 2.6.0) This operation performs the same task as FUTEX_CMP_REQUEUE (see below), except that no check is made using the value in val3. (The argument val3 is ignored.) FUTEX_CMP_REQUEUE (since Linux 2.6.7) This operation first checks whether the location uaddr still contains the value val3. If not, the operation fails with the error EAGAIN. Otherwise, the operation wakes up a maximum of val waiters that are waiting on the futex at uaddr. If there are more than val waiters, then the remaining waiters are removed from the wait queue of the source futex at uaddr and added to the wait queue of the target futex at uaddr2. The val2 argument specifies an upper limit on the number of waiters that are requeued to the futex at uaddr2. The load from uaddr is an atomic memory access (i.e., using atomic machine instructions of the respective architecture). This load, the comparison with val3, and the requeueing of any waiters are performed atomically and totally ordered with respect to other operations on the same futex word. Typical values to specify for val are 0 or 1. (Specifying INT_MAX is not useful, because it would make the FUTEX_CMP_REQUEUE operation equivalent to FUTEX_WAKE.) The limit value specified via val2 is typically either 1 or INT_MAX. (Specifying the argument as 0 is not useful, because it would make the FUTEX_CMP_REQUEUE operation equivalent to FUTEX_WAIT.) The FUTEX_CMP_REQUEUE operation was added as a replacement for the earlier FUTEX_REQUEUE. The difference is that the check of the value at uaddr can be used to ensure that requeueing happens only under certain conditions, which allows race conditions to be avoided in certain use cases. Both FUTEX_REQUEUE and FUTEX_CMP_REQUEUE can be used to avoid "thundering herd" wake-ups that could occur when using FUTEX_WAKE in cases where all of the waiters that are woken need to acquire another futex. Consider the following scenario, where multiple waiter threads are waiting on B, a wait queue implemented using a futex: lock(A) while (!check_value(V)) { unlock(A); block_on(B); lock(A); }; unlock(A); If a waker thread used FUTEX_WAKE, then all waiters waiting on B would be woken up, and they would all try to acquire lock A. However, waking all of the threads in this manner would be pointless because all except one of the threads would immediately block on lock A again. By contrast, a requeue operation wakes just one waiter and moves the other waiters to lock A, and when the woken waiter unlocks A then the next waiter can proceed. FUTEX_WAKE_OP (since Linux 2.6.14) This operation was added to support some user-space use cases where more than one futex must be handled at the same time. The most notable example is the implementation of pthread_cond_signal(3), which requires operations on two futexes, the one used to implement the mutex and the one used in the implementation of the wait queue associated with the condition variable. FUTEX_WAKE_OP allows such cases to be implemented without leading to high rates of contention and context switching. The FUTEX_WAKE_OP operation is equivalent to executing the following code atomically and totally ordered with respect to other futex operations on any of the two supplied futex words: int oldval = *(int *) uaddr2; *(int *) uaddr2 = oldval op oparg; futex(uaddr, FUTEX_WAKE, val, 0, 0, 0); if (oldval cmp cmparg) futex(uaddr2, FUTEX_WAKE, val2, 0, 0, 0); In other words, FUTEX_WAKE_OP does the following: * saves the original value of the futex word at uaddr2 and performs an operation to modify the value of the futex at uaddr2; this is an atomic read-modify-write memory access (i.e., using atomic machine instructions of the respective architecture) * wakes up a maximum of val waiters on the futex for the futex word at uaddr; and * dependent on the results of a test of the original value of the futex word at uaddr2, wakes up a maximum of val2 waiters on the futex for the futex word at uaddr2. The operation and comparison that are to be performed are encoded in the bits of the argument val3. Pictorially, the encoding is: +---+---+-----------+-----------+ |op |cmp| oparg | cmparg | +---+---+-----------+-----------+ 4 4 12 12 <== # of bits Expressed in code, the encoding is: #define FUTEX_OP(op, oparg, cmp, cmparg) \ (((op & 0xf) << 28) | \ ((cmp & 0xf) << 24) | \ ((oparg & 0xfff) << 12) | \ (cmparg & 0xfff)) In the above, op and cmp are each one of the codes listed below. The oparg and cmparg components are literal numeric values, except as noted below. The op component has one of the following values: FUTEX_OP_SET 0 /* uaddr2 = oparg; */ FUTEX_OP_ADD 1 /* uaddr2 += oparg; */ FUTEX_OP_OR 2 /* uaddr2 |= oparg; */ FUTEX_OP_ANDN 3 /* uaddr2 &= ~oparg; */ FUTEX_OP_XOR 4 /* uaddr2 ^= oparg; */ In addition, bit-wise ORing the following value into op causes (1 << oparg) to be used as the operand: FUTEX_OP_ARG_SHIFT 8 /* Use (1 << oparg) as operand */ The cmp field is one of the following: FUTEX_OP_CMP_EQ 0 /* if (oldval == cmparg) wake */ FUTEX_OP_CMP_NE 1 /* if (oldval != cmparg) wake */ FUTEX_OP_CMP_LT 2 /* if (oldval < cmparg) wake */ FUTEX_OP_CMP_LE 3 /* if (oldval <= cmparg) wake */ FUTEX_OP_CMP_GT 4 /* if (oldval > cmparg) wake */ FUTEX_OP_CMP_GE 5 /* if (oldval >= cmparg) wake */ The return value of FUTEX_WAKE_OP is the sum of the number of waiters woken on the futex uaddr plus the number of waiters woken on the futex uaddr2. FUTEX_WAIT_BITSET (since Linux 2.6.25) This operation is like FUTEX_WAIT except that val3 is used to provide a 32-bit bit mask to the kernel. This bit mask, in which at least one bit must be set, is stored in the kernel- internal state of the waiter. See the description of FUTEX_WAKE_BITSET for further details. If timeout is not NULL, the structure it points to specifies an absolute timeout for the wait operation. If timeout is NULL, the operation can block indefinitely. The uaddr2 argument is ignored. FUTEX_WAKE_BITSET (since Linux 2.6.25) This operation is the same as FUTEX_WAKE except that the val3 argument is used to provide a 32-bit bit mask to the kernel. This bit mask, in which at least one bit must be set, is used to select which waiters should be woken up. The selection is done by a bit-wise AND of the "wake" bit mask (i.e., the value in val3) and the bit mask which is stored in the kernel- internal state of the waiter (the "wait" bit mask that is set using FUTEX_WAIT_BITSET). All of the waiters for which the result of the AND is nonzero are woken up; the remaining waiters are left sleeping. The effect of FUTEX_WAIT_BITSET and FUTEX_WAKE_BITSET is to allow selective wake-ups among multiple waiters that are blocked on the same futex. However, note that, depending on the use case, employing this bit-mask multiplexing feature on a futex can be less efficient than simply using multiple futexes, because employing bit-mask multiplexing requires the kernel to check all waiters on a futex, including those that are not interested in being woken up (i.e., they do not have the relevant bit set in their "wait" bit mask). The constant FUTEX_BITSET_MATCH_ANY, which corresponds to all 32 bits set in the bit mask, can be used as the val3 argument for FUTEX_WAIT_BITSET and FUTEX_WAKE_BITSET. Other than differences in the handling of the timeout argument, the FUTEX_WAIT operation is equivalent to FUTEX_WAIT_BITSET with val3 specified as FUTEX_BITSET_MATCH_ANY; that is, allow a wake-up by any waker. The FUTEX_WAKE operation is equivalent to FUTEX_WAKE_BITSET with val3 specified as FUTEX_BITSET_MATCH_ANY; that is, wake up any waiter(s). The uaddr2 and timeout arguments are ignored. Priority-inheritance futexes Linux supports priority-inheritance (PI) futexes in order to handle priority-inversion problems that can be encountered with normal futex locks. Priority inversion is the problem that occurs when a high- priority task is blocked waiting to acquire a lock held by a low- priority task, while tasks at an intermediate priority continuously preempt the low-priority task from the CPU. Consequently, the low- priority task makes no progress toward releasing the lock, and the high-priority task remains blocked. Priority inheritance is a mechanism for dealing with the priority- inversion problem. With this mechanism, when a high-priority task becomes blocked by a lock held by a low-priority task, the priority of the low-priority task is temporarily raised to that of the high- priority task, so that it is not preempted by any intermediate level tasks, and can thus make progress toward releasing the lock. To be effective, priority inheritance must be transitive, meaning that if a high-priority task blocks on a lock held by a lower-priority task that is itself blocked by a lock held by another intermediate- priority task (and so on, for chains of arbitrary length), then both of those tasks (or more generally, all of the tasks in a lock chain) have their priorities raised to be the same as the high-priority task. From a user-space perspective, what makes a futex PI-aware is a policy agreement (described below) between user space and the kernel about the value of the futex word, coupled with the use of the PI- futex operations described below. (Unlike the other futex operations described above, the PI-futex operations are designed for the implementation of very specific IPC mechanisms.) The PI-futex operations described below differ from the other futex operations in that they impose policy on the use of the value of the futex word: * If the lock is not acquired, the futex word's value shall be 0. * If the lock is acquired, the futex word's value shall be the thread ID (TID; see gettid(2)) of the owning thread. * If the lock is owned and there are threads contending for the lock, then the FUTEX_WAITERS bit shall be set in the futex word's value; in other words, this value is: FUTEX_WAITERS | TID (Note that is invalid for a PI futex word to have no owner and FUTEX_WAITERS set.) With this policy in place, a user-space application can acquire an unacquired lock or release a lock using atomic instructions executed in user mode (e.g., a compare-and-swap operation such as cmpxchg on the x86 architecture). Acquiring a lock simply consists of using compare-and-swap to atomically set the futex word's value to the caller's TID if its previous value was 0. Releasing a lock requires using compare-and-swap to set the futex word's value to 0 if the previous value was the expected TID. If a futex is already acquired (i.e., has a nonzero value), waiters must employ the FUTEX_LOCK_PI operation to acquire the lock. If other threads are waiting for the lock, then the FUTEX_WAITERS bit is set in the futex value; in this case, the lock owner must employ the FUTEX_UNLOCK_PI operation to release the lock. In the cases where callers are forced into the kernel (i.e., required to perform a futex() call), they then deal directly with a so-called RT-mutex, a kernel locking mechanism which implements the required priority-inheritance semantics. After the RT-mutex is acquired, the futex value is updated accordingly, before the calling thread returns to user space. It is important to note that the kernel will update the futex word's value prior to returning to user space. (This prevents the possibility of the futex word's value ending up in an invalid state, such as having an owner but the value being 0, or having waiters but not having the FUTEX_WAITERS bit set.) If a futex has an associated RT-mutex in the kernel (i.e., there are blocked waiters) and the owner of the futex/RT-mutex dies unexpectedly, then the kernel cleans up the RT-mutex and hands it over to the next waiter. This in turn requires that the user-space value is updated accordingly. To indicate that this is required, the kernel sets the FUTEX_OWNER_DIED bit in the futex word along with the thread ID of the new owner. User space can detect this situation via the presence of the FUTEX_OWNER_DIED bit and is then responsible for cleaning up the stale state left over by the dead owner. PI futexes are operated on by specifying one of the values listed below in futex_op. Note that the PI futex operations must be used as paired operations and are subject to some additional requirements: * FUTEX_LOCK_PI and FUTEX_TRYLOCK_PI pair with FUTEX_UNLOCK_PI. FUTEX_UNLOCK_PI must be called only on a futex owned by the calling thread, as defined by the value policy, otherwise the error EPERM results. * FUTEX_WAIT_REQUEUE_PI pairs with FUTEX_CMP_REQUEUE_PI. This must be performed from a non-PI futex to a distinct PI futex (or the error EINVAL results). Additionally, val (the number of waiters to be woken) must be 1 (or the error EINVAL results). The PI futex operations are as follows: FUTEX_LOCK_PI (since Linux 2.6.18) This operation is used after an attempt to acquire the lock via an atomic user-mode instruction failed because the futex word has a nonzero value—specifically, because it contained the (PID-namespace-specific) TID of the lock owner. The operation checks the value of the futex word at the address uaddr. If the value is 0, then the kernel tries to atomically set the futex value to the caller's TID. If the futex word's value is nonzero, the kernel atomically sets the FUTEX_WAITERS bit, which signals the futex owner that it cannot unlock the futex in user space atomically by setting the futex value to 0. After that, the kernel: 1. Tries to find the thread which is associated with the owner TID. 2. Creates or reuses kernel state on behalf of the owner. (If this is the first waiter, there is no kernel state for this futex, so kernel state is created by locking the RT-mutex and the futex owner is made the owner of the RT-mutex. If there are existing waiters, then the existing state is reused.) 3. Attaches the waiter to the futex (i.e., the waiter is enqueued on the RT-mutex waiter list). If more than one waiter exists, the enqueueing of the waiter is in descending priority order. (For information on priority ordering, see the discussion of the SCHED_DEADLINE, SCHED_FIFO, and SCHED_RR scheduling policies in sched(7).) The owner inherits either the waiter's CPU bandwidth (if the waiter is scheduled under the SCHED_DEADLINE policy) or the waiter's priority (if the waiter is scheduled under the SCHED_RR or SCHED_FIFO policy). This inheritance follows the lock chain in the case of nested locking and performs deadlock detection. The timeout argument provides a timeout for the lock attempt. If timeout is not NULL, the structure it points to specifies an absolute timeout, measured against the CLOCK_REALTIME clock. If timeout is NULL, the operation will block indefinitely. The uaddr2, val, and val3 arguments are ignored. FUTEX_TRYLOCK_PI (since Linux 2.6.18) This operation tries to acquire the lock at uaddr. It is invoked when a user-space atomic acquire did not succeed because the futex word was not 0. Because the kernel has access to more state information than user space, acquisition of the lock might succeed if performed by the kernel in cases where the futex word (i.e., the state information accessible to use-space) contains stale state (FUTEX_WAITERS and/or FUTEX_OWNER_DIED). This can happen when the owner of the futex died. User space cannot handle this condition in a race-free manner, but the kernel can fix this up and acquire the futex. The uaddr2, val, timeout, and val3 arguments are ignored. FUTEX_UNLOCK_PI (since Linux 2.6.18) This operation wakes the top priority waiter that is waiting in FUTEX_LOCK_PI on the futex address provided by the uaddr argument. This is called when the user-space value at uaddr cannot be changed atomically from a TID (of the owner) to 0. The uaddr2, val, timeout, and val3 arguments are ignored. FUTEX_CMP_REQUEUE_PI (since Linux 2.6.31) This operation is a PI-aware variant of FUTEX_CMP_REQUEUE. It requeues waiters that are blocked via FUTEX_WAIT_REQUEUE_PI on uaddr from a non-PI source futex (uaddr) to a PI target futex (uaddr2). As with FUTEX_CMP_REQUEUE, this operation wakes up a maximum of val waiters that are waiting on the futex at uaddr. However, for FUTEX_CMP_REQUEUE_PI, val is required to be 1 (since the main point is to avoid a thundering herd). The remaining waiters are removed from the wait queue of the source futex at uaddr and added to the wait queue of the target futex at uaddr2. The val2 and val3 arguments serve the same purposes as for FUTEX_CMP_REQUEUE. FUTEX_WAIT_REQUEUE_PI (since Linux 2.6.31) Wait on a non-PI futex at uaddr and potentially be requeued (via a FUTEX_CMP_REQUEUE_PI operation in another task) onto a PI futex at uaddr2. The wait operation on uaddr is the same as for FUTEX_WAIT. The waiter can be removed from the wait on uaddr without requeueing on uaddr2 via a FUTEX_WAKE operation in another task. In this case, the FUTEX_WAIT_REQUEUE_PI operation fails with the error EAGAIN. If timeout is not NULL, the structure it points to specifies an absolute timeout for the wait operation. If timeout is NULL, the operation can block indefinitely. The val3 argument is ignored. The FUTEX_WAIT_REQUEUE_PI and FUTEX_CMP_REQUEUE_PI were added to support a fairly specific use case: support for priority- inheritance-aware POSIX threads condition variables. The idea is that these operations should always be paired, in order to ensure that user space and the kernel remain in sync. Thus, in the FUTEX_WAIT_REQUEUE_PI operation, the user-space application pre-specifies the target of the requeue that takes place in the FUTEX_CMP_REQUEUE_PI operation. http://man7.org/linux/man-pages/man2/set_robust_list.2.html 10 SYSTEM CALL: get_robust_list(2) - Linux manual page FUNCTIONALITY: get_robust_list, set_robust_list - get/set list of robust futexes SYNOPSIS: #include #include #include long get_robust_list(int pid, struct robust_list_head **head_ptr, size_t *len_ptr); long set_robust_list(struct robust_list_head *head, size_t len); Note: There are no glibc wrappers for these system calls; see NOTES. DESCRIPTION The robust futex implementation needs to maintain per-thread lists of the robust futexes which are to be unlocked when the thread exits. These lists are managed in user space; the kernel is notified about only the location of the head of the list. The get_robust_list() system call returns the head of the robust futex list of the thread whose thread ID is specified in pid. If pid is 0, the head of the list for the calling thread is returned. The list head is stored in the location pointed to by head_ptr. The size of the object pointed to by **head_ptr is stored in len_ptr. Permission to employ get_robust_list() is governed by a ptrace access mode PTRACE_MODE_READ_REALCREDS check; see ptrace(2). The set_robust_list() system call requests the kernel to record the head of the list of robust futexes owned by the calling thread. The head argument is the list head to record. The len argument should be sizeof(*head). http://man7.org/linux/man-pages/man2/get_robust_list.2.html 10 SYSTEM CALL: get_robust_list(2) - Linux manual page FUNCTIONALITY: get_robust_list, set_robust_list - get/set list of robust futexes SYNOPSIS: #include #include #include long get_robust_list(int pid, struct robust_list_head **head_ptr, size_t *len_ptr); long set_robust_list(struct robust_list_head *head, size_t len); Note: There are no glibc wrappers for these system calls; see NOTES. DESCRIPTION The robust futex implementation needs to maintain per-thread lists of the robust futexes which are to be unlocked when the thread exits. These lists are managed in user space; the kernel is notified about only the location of the head of the list. The get_robust_list() system call returns the head of the robust futex list of the thread whose thread ID is specified in pid. If pid is 0, the head of the list for the calling thread is returned. The list head is stored in the location pointed to by head_ptr. The size of the object pointed to by **head_ptr is stored in len_ptr. Permission to employ get_robust_list() is governed by a ptrace access mode PTRACE_MODE_READ_REALCREDS check; see ptrace(2). The set_robust_list() system call requests the kernel to record the head of the list of robust futexes owned by the calling thread. The head argument is the list head to record. The len argument should be sizeof(*head). http://man7.org/linux/man-pages/man2/msgget.2.html 11 SYSTEM CALL: msgget(2) - Linux manual page FUNCTIONALITY: msgget - get a System V message queue identifier SYNOPSIS: #include #include #include int msgget(key_t key, int msgflg); DESCRIPTION The msgget() system call returns the System V message queue identifier associated with the value of the key argument. A new message queue is created if key has the value IPC_PRIVATE or key isn't IPC_PRIVATE, no message queue with the given key key exists, and IPC_CREAT is specified in msgflg. If msgflg specifies both IPC_CREAT and IPC_EXCL and a message queue already exists for key, then msgget() fails with errno set to EEXIST. (This is analogous to the effect of the combination O_CREAT | O_EXCL for open(2).) Upon creation, the least significant bits of the argument msgflg define the permissions of the message queue. These permission bits have the same format and semantics as the permissions specified for the mode argument of open(2). (The execute permissions are not used.) If a new message queue is created, then its associated data structure msqid_ds (see msgctl(2)) is initialized as follows: msg_perm.cuid and msg_perm.uid are set to the effective user ID of the calling process. msg_perm.cgid and msg_perm.gid are set to the effective group ID of the calling process. The least significant 9 bits of msg_perm.mode are set to the least significant 9 bits of msgflg. msg_qnum, msg_lspid, msg_lrpid, msg_stime, and msg_rtime are set to 0. msg_ctime is set to the current time. msg_qbytes is set to the system limit MSGMNB. If the message queue already exists the permissions are verified, and a check is made to see if it is marked for destruction. http://man7.org/linux/man-pages/man2/msgctl.2.html 10 SYSTEM CALL: msgctl(2) - Linux manual page FUNCTIONALITY: msgctl - System V message control operations SYNOPSIS: #include #include #include int msgctl(int msqid, int cmd, struct msqid_ds *buf); DESCRIPTION msgctl() performs the control operation specified by cmd on the System V message queue with identifier msqid. The msqid_ds data structure is defined in as follows: struct msqid_ds { struct ipc_perm msg_perm; /* Ownership and permissions */ time_t msg_stime; /* Time of last msgsnd(2) */ time_t msg_rtime; /* Time of last msgrcv(2) */ time_t msg_ctime; /* Time of last change */ unsigned long __msg_cbytes; /* Current number of bytes in queue (nonstandard) */ msgqnum_t msg_qnum; /* Current number of messages in queue */ msglen_t msg_qbytes; /* Maximum number of bytes allowed in queue */ pid_t msg_lspid; /* PID of last msgsnd(2) */ pid_t msg_lrpid; /* PID of last msgrcv(2) */ }; The ipc_perm structure is defined as follows (the highlighted fields are settable using IPC_SET): struct ipc_perm { key_t __key; /* Key supplied to msgget(2) */ uid_t uid; /* Effective UID of owner */ gid_t gid; /* Effective GID of owner */ uid_t cuid; /* Effective UID of creator */ gid_t cgid; /* Effective GID of creator */ unsigned short mode; /* Permissions */ unsigned short __seq; /* Sequence number */ }; Valid values for cmd are: IPC_STAT Copy information from the kernel data structure associated with msqid into the msqid_ds structure pointed to by buf. The caller must have read permission on the message queue. IPC_SET Write the values of some members of the msqid_ds structure pointed to by buf to the kernel data structure associated with this message queue, updating also its msg_ctime member. The following members of the structure are updated: msg_qbytes, msg_perm.uid, msg_perm.gid, and (the least significant 9 bits of) msg_perm.mode. The effective UID of the calling process must match the owner (msg_perm.uid) or creator (msg_perm.cuid) of the message queue, or the caller must be privileged. Appropriate privilege (Linux: the CAP_SYS_RESOURCE capability) is required to raise the msg_qbytes value beyond the system parameter MSGMNB. IPC_RMID Immediately remove the message queue, awakening all waiting reader and writer processes (with an error return and errno set to EIDRM). The calling process must have appropriate privileges or its effective user ID must be either that of the creator or owner of the message queue. The third argument to msgctl() is ignored in this case. IPC_INFO (Linux-specific) Return information about system-wide message queue limits and parameters in the structure pointed to by buf. This structure is of type msginfo (thus, a cast is required), defined in if the _GNU_SOURCE feature test macro is defined: struct msginfo { int msgpool; /* Size in kibibytes of buffer pool used to hold message data; unused within kernel */ int msgmap; /* Maximum number of entries in message map; unused within kernel */ int msgmax; /* Maximum number of bytes that can be written in a single message */ int msgmnb; /* Maximum number of bytes that can be written to queue; used to initialize msg_qbytes during queue creation (msgget(2)) */ int msgmni; /* Maximum number of message queues */ int msgssz; /* Message segment size; unused within kernel */ int msgtql; /* Maximum number of messages on all queues in system; unused within kernel */ unsigned short int msgseg; /* Maximum number of segments; unused within kernel */ }; The msgmni, msgmax, and msgmnb settings can be changed via /proc files of the same name; see proc(5) for details. MSG_INFO (Linux-specific) Return a msginfo structure containing the same information as for IPC_INFO, except that the following fields are returned with information about system resources consumed by message queues: the msgpool field returns the number of message queues that currently exist on the system; the msgmap field returns the total number of messages in all queues on the system; and the msgtql field returns the total number of bytes in all messages in all queues on the system. MSG_STAT (Linux-specific) Return a msqid_ds structure as for IPC_STAT. However, the msqid argument is not a queue identifier, but instead an index into the kernel's internal array that maintains information about all message queues on the system. http://man7.org/linux/man-pages/man2/msgsnd.2.html 12 SYSTEM CALL: msgop(2) - Linux manual page FUNCTIONALITY: msgrcv, msgsnd - System V message queue operations SYNOPSIS: #include #include #include int msgsnd(int msqid, const void *msgp, size_t msgsz, int msgflg); ssize_t msgrcv(int msqid, void *msgp, size_t msgsz, long msgtyp, int msgflg); DESCRIPTION The msgsnd() and msgrcv() system calls are used, respectively, to send messages to, and receive messages from, a System V message queue. The calling process must have write permission on the message queue in order to send a message, and read permission to receive a message. The msgp argument is a pointer to a caller-defined structure of the following general form: struct msgbuf { long mtype; /* message type, must be > 0 */ char mtext[1]; /* message data */ }; The mtext field is an array (or other structure) whose size is specified by msgsz, a nonnegative integer value. Messages of zero length (i.e., no mtext field) are permitted. The mtype field must have a strictly positive integer value. This value can be used by the receiving process for message selection (see the description of msgrcv() below). msgsnd() The msgsnd() system call appends a copy of the message pointed to by msgp to the message queue whose identifier is specified by msqid. If sufficient space is available in the queue, msgsnd() succeeds immediately. The queue capacity is governed by the msg_qbytes field in the associated data structure for the message queue. During queue creation this field is initialized to MSGMNB bytes, but this limit can be modified using msgctl(2). A message queue is considered to be full if either of the following conditions is true: * Adding a new message to the queue would cause the total number of bytes in the queue to exceed the queue's maximum size (the msg_qbytes field). * Adding another message to the queue would cause the total number of messages in the queue to exceed the queue's maximum size (the msg_qbytes field). This check is necessary to prevent an unlimited number of zero-length messages being placed on the queue. Although such messages contain no data, they nevertheless consume (locked) kernel memory. If insufficient space is available in the queue, then the default behavior of msgsnd() is to block until space becomes available. If IPC_NOWAIT is specified in msgflg, then the call instead fails with the error EAGAIN. A blocked msgsnd() call may also fail if: * the queue is removed, in which case the system call fails with errno set to EIDRM; or * a signal is caught, in which case the system call fails with errno set to EINTR;see signal(7). (msgsnd() is never automatically restarted after being interrupted by a signal handler, regardless of the setting of the SA_RESTART flag when establishing a signal handler.) Upon successful completion the message queue data structure is updated as follows: msg_lspid is set to the process ID of the calling process. msg_qnum is incremented by 1. msg_stime is set to the current time. msgrcv() The msgrcv() system call removes a message from the queue specified by msqid and places it in the buffer pointed to by msgp. The argument msgsz specifies the maximum size in bytes for the member mtext of the structure pointed to by the msgp argument. If the message text has length greater than msgsz, then the behavior depends on whether MSG_NOERROR is specified in msgflg. If MSG_NOERROR is specified, then the message text will be truncated (and the truncated part will be lost); if MSG_NOERROR is not specified, then the message isn't removed from the queue and the system call fails returning -1 with errno set to E2BIG. Unless MSG_COPY is specified in msgflg (see below), the msgtyp argument specifies the type of message requested, as follows: * If msgtyp is 0, then the first message in the queue is read. * If msgtyp is greater than 0, then the first message in the queue of type msgtyp is read, unless MSG_EXCEPT was specified in msgflg, in which case the first message in the queue of type not equal to msgtyp will be read. * If msgtyp is less than 0, then the first message in the queue with the lowest type less than or equal to the absolute value of msgtyp will be read. The msgflg argument is a bit mask constructed by ORing together zero or more of the following flags: IPC_NOWAIT Return immediately if no message of the requested type is in the queue. The system call fails with errno set to ENOMSG. MSG_COPY (since Linux 3.8) Nondestructively fetch a copy of the message at the ordinal position in the queue specified by msgtyp (messages are considered to be numbered starting at 0). This flag must be specified in conjunction with IPC_NOWAIT, with the result that, if there is no message available at the given position, the call fails immediately with the error ENOMSG. Because they alter the meaning of msgtyp in orthogonal ways, MSG_COPY and MSG_EXCEPT may not both be specified in msgflg. The MSG_COPY flag was added for the implementation of the kernel checkpoint-restore facility and is available only if the kernel was built with the CONFIG_CHECKPOINT_RESTORE option. MSG_EXCEPT Used with msgtyp greater than 0 to read the first message in the queue with message type that differs from msgtyp. MSG_NOERROR To truncate the message text if longer than msgsz bytes. If no message of the requested type is available and IPC_NOWAIT isn't specified in msgflg, the calling process is blocked until one of the following conditions occurs: * A message of the desired type is placed in the queue. * The message queue is removed from the system. In this case, the system call fails with errno set to EIDRM. * The calling process catches a signal. In this case, the system call fails with errno set to EINTR. (msgrcv() is never automatically restarted after being interrupted by a signal handler, regardless of the setting of the SA_RESTART flag when establishing a signal handler.) Upon successful completion the message queue data structure is updated as follows: msg_lrpid is set to the process ID of the calling process. msg_qnum is decremented by 1. msg_rtime is set to the current time. http://man7.org/linux/man-pages/man2/msgrcv.2.html 12 SYSTEM CALL: msgop(2) - Linux manual page FUNCTIONALITY: msgrcv, msgsnd - System V message queue operations SYNOPSIS: #include #include #include int msgsnd(int msqid, const void *msgp, size_t msgsz, int msgflg); ssize_t msgrcv(int msqid, void *msgp, size_t msgsz, long msgtyp, int msgflg); DESCRIPTION The msgsnd() and msgrcv() system calls are used, respectively, to send messages to, and receive messages from, a System V message queue. The calling process must have write permission on the message queue in order to send a message, and read permission to receive a message. The msgp argument is a pointer to a caller-defined structure of the following general form: struct msgbuf { long mtype; /* message type, must be > 0 */ char mtext[1]; /* message data */ }; The mtext field is an array (or other structure) whose size is specified by msgsz, a nonnegative integer value. Messages of zero length (i.e., no mtext field) are permitted. The mtype field must have a strictly positive integer value. This value can be used by the receiving process for message selection (see the description of msgrcv() below). msgsnd() The msgsnd() system call appends a copy of the message pointed to by msgp to the message queue whose identifier is specified by msqid. If sufficient space is available in the queue, msgsnd() succeeds immediately. The queue capacity is governed by the msg_qbytes field in the associated data structure for the message queue. During queue creation this field is initialized to MSGMNB bytes, but this limit can be modified using msgctl(2). A message queue is considered to be full if either of the following conditions is true: * Adding a new message to the queue would cause the total number of bytes in the queue to exceed the queue's maximum size (the msg_qbytes field). * Adding another message to the queue would cause the total number of messages in the queue to exceed the queue's maximum size (the msg_qbytes field). This check is necessary to prevent an unlimited number of zero-length messages being placed on the queue. Although such messages contain no data, they nevertheless consume (locked) kernel memory. If insufficient space is available in the queue, then the default behavior of msgsnd() is to block until space becomes available. If IPC_NOWAIT is specified in msgflg, then the call instead fails with the error EAGAIN. A blocked msgsnd() call may also fail if: * the queue is removed, in which case the system call fails with errno set to EIDRM; or * a signal is caught, in which case the system call fails with errno set to EINTR;see signal(7). (msgsnd() is never automatically restarted after being interrupted by a signal handler, regardless of the setting of the SA_RESTART flag when establishing a signal handler.) Upon successful completion the message queue data structure is updated as follows: msg_lspid is set to the process ID of the calling process. msg_qnum is incremented by 1. msg_stime is set to the current time. msgrcv() The msgrcv() system call removes a message from the queue specified by msqid and places it in the buffer pointed to by msgp. The argument msgsz specifies the maximum size in bytes for the member mtext of the structure pointed to by the msgp argument. If the message text has length greater than msgsz, then the behavior depends on whether MSG_NOERROR is specified in msgflg. If MSG_NOERROR is specified, then the message text will be truncated (and the truncated part will be lost); if MSG_NOERROR is not specified, then the message isn't removed from the queue and the system call fails returning -1 with errno set to E2BIG. Unless MSG_COPY is specified in msgflg (see below), the msgtyp argument specifies the type of message requested, as follows: * If msgtyp is 0, then the first message in the queue is read. * If msgtyp is greater than 0, then the first message in the queue of type msgtyp is read, unless MSG_EXCEPT was specified in msgflg, in which case the first message in the queue of type not equal to msgtyp will be read. * If msgtyp is less than 0, then the first message in the queue with the lowest type less than or equal to the absolute value of msgtyp will be read. The msgflg argument is a bit mask constructed by ORing together zero or more of the following flags: IPC_NOWAIT Return immediately if no message of the requested type is in the queue. The system call fails with errno set to ENOMSG. MSG_COPY (since Linux 3.8) Nondestructively fetch a copy of the message at the ordinal position in the queue specified by msgtyp (messages are considered to be numbered starting at 0). This flag must be specified in conjunction with IPC_NOWAIT, with the result that, if there is no message available at the given position, the call fails immediately with the error ENOMSG. Because they alter the meaning of msgtyp in orthogonal ways, MSG_COPY and MSG_EXCEPT may not both be specified in msgflg. The MSG_COPY flag was added for the implementation of the kernel checkpoint-restore facility and is available only if the kernel was built with the CONFIG_CHECKPOINT_RESTORE option. MSG_EXCEPT Used with msgtyp greater than 0 to read the first message in the queue with message type that differs from msgtyp. MSG_NOERROR To truncate the message text if longer than msgsz bytes. If no message of the requested type is available and IPC_NOWAIT isn't specified in msgflg, the calling process is blocked until one of the following conditions occurs: * A message of the desired type is placed in the queue. * The message queue is removed from the system. In this case, the system call fails with errno set to EIDRM. * The calling process catches a signal. In this case, the system call fails with errno set to EINTR. (msgrcv() is never automatically restarted after being interrupted by a signal handler, regardless of the setting of the SA_RESTART flag when establishing a signal handler.) Upon successful completion the message queue data structure is updated as follows: msg_lrpid is set to the process ID of the calling process. msg_qnum is decremented by 1. msg_rtime is set to the current time. http://man7.org/linux/man-pages/man2/mq_open.2.html 12 SYSTEM CALL: mq_open(3) - Linux manual page FUNCTIONALITY: mq_open - open a message queue SYNOPSIS: #include /* For O_* constants */ #include /* For mode constants */ #include mqd_t mq_open(const char *name, int oflag); mqd_t mq_open(const char *name, int oflag, mode_t mode, struct mq_attr *attr); Link with -lrt. DESCRIPTION mq_open() creates a new POSIX message queue or opens an existing queue. The queue is identified by name. For details of the construction of name, see mq_overview(7). The oflag argument specifies flags that control the operation of the call. (Definitions of the flags values can be obtained by including .) Exactly one of the following must be specified in oflag: O_RDONLY Open the queue to receive messages only. O_WRONLY Open the queue to send messages only. O_RDWR Open the queue to both send and receive messages. Zero or more of the following flags can additionally be ORed in oflag: O_CLOEXEC (since Linux 2.6.26) Set the close-on-exec flag for the message queue descriptor. See open(2) for a discussion of why this flag is useful. O_CREAT Create the message queue if it does not exist. The owner (user ID) of the message queue is set to the effective user ID of the calling process. The group ownership (group ID) is set to the effective group ID of the calling process. O_EXCL If O_CREAT was specified in oflag, and a queue with the given name already exists, then fail with the error EEXIST. O_NONBLOCK Open the queue in nonblocking mode. In circumstances where mq_receive(3) and mq_send(3) would normally block, these functions instead fail with the error EAGAIN. If O_CREAT is specified in oflag, then two additional arguments must be supplied. The mode argument specifies the permissions to be placed on the new queue, as for open(2). (Symbolic definitions for the permissions bits can be obtained by including .) The permissions settings are masked against the process umask. The attr argument specifies attributes for the queue. See mq_getattr(3) for details. If attr is NULL, then the queue is created with implementation-defined default attributes. Since Linux 3.5, two /proc files can be used to control these defaults; see mq_overview(7) for details. http://man7.org/linux/man-pages/man2/mq_unlink.2.html 10 SYSTEM CALL: mq_unlink(3) - Linux manual page FUNCTIONALITY: mq_unlink - remove a message queue SYNOPSIS: #include int mq_unlink(const char *name); Link with -lrt. DESCRIPTION mq_unlink() removes the specified message queue name. The message queue name is removed immediately. The queue itself is destroyed once any other processes that have the queue open close their descriptors referring to the queue. http://man7.org/linux/man-pages/man2/mq_getsetattr.2.html 8 SYSTEM CALL: mq_getsetattr(2) - Linux manual page FUNCTIONALITY: mq_getsetattr - get/set message queue attributes SYNOPSIS: #include #include int mq_getsetattr(mqd_t mqdes, struct mq_attr *newattr, struct mq_attr *oldattr); Note: There is no glibc wrapper for this system call; see NOTES. DESCRIPTION Do not use this system call. This is the low-level system call used to implement mq_getattr(3) and mq_setattr(3). For an explanation of how this system call operates, see the description of mq_setattr(3). http://man7.org/linux/man-pages/man2/mq_timedsend.2.html 11 SYSTEM CALL: mq_send(3) - Linux manual page FUNCTIONALITY: mq_send, mq_timedsend - send a message to a message queue SYNOPSIS: #include int mq_send(mqd_t mqdes, const char *msg_ptr, size_t msg_len, unsigned int msg_prio); #include #include int mq_timedsend(mqd_t mqdes, const char *msg_ptr, size_t msg_len, unsigned int msg_prio, const struct timespec *abs_timeout); Link with -lrt. Feature Test Macro Requirements for glibc (see feature_test_macros(7)): mq_timedsend(): _POSIX_C_SOURCE >= 200112L DESCRIPTION mq_send() adds the message pointed to by msg_ptr to the message queue referred to by the message queue descriptor mqdes. The msg_len argument specifies the length of the message pointed to by msg_ptr; this length must be less than or equal to the queue's mq_msgsize attribute. Zero-length messages are allowed. The msg_prio argument is a nonnegative integer that specifies the priority of this message. Messages are placed on the queue in decreasing order of priority, with newer messages of the same priority being placed after older messages with the same priority. If the message queue is already full (i.e., the number of messages on the queue equals the queue's mq_maxmsg attribute), then, by default, mq_send() blocks until sufficient space becomes available to allow the message to be queued, or until the call is interrupted by a signal handler. If the O_NONBLOCK flag is enabled for the message queue description, then the call instead fails immediately with the error EAGAIN. mq_timedsend() behaves just like mq_send(), except that if the queue is full and the O_NONBLOCK flag is not enabled for the message queue description, then abs_timeout points to a structure which specifies how long the call will block. This value is an absolute timeout in seconds and nanoseconds since the Epoch, 1970-01-01 00:00:00 +0000 (UTC), specified in the following structure: struct timespec { time_t tv_sec; /* seconds */ long tv_nsec; /* nanoseconds */ }; If the message queue is full, and the timeout has already expired by the time of the call, mq_timedsend() returns immediately. http://man7.org/linux/man-pages/man2/mq_timedreceive.2.html 11 SYSTEM CALL: mq_receive(3) - Linux manual page FUNCTIONALITY: mq_receive, mq_timedreceive - receive a message from a message queue SYNOPSIS: #include ssize_t mq_receive(mqd_t mqdes, char *msg_ptr, size_t msg_len, unsigned int *msg_prio); #include #include ssize_t mq_timedreceive(mqd_t mqdes, char *msg_ptr, size_t msg_len, unsigned int *msg_prio, const struct timespec *abs_timeout); Link with -lrt. Feature Test Macro Requirements for glibc (see feature_test_macros(7)): mq_timedreceive(): _POSIX_C_SOURCE >= 200112L DESCRIPTION mq_receive() removes the oldest message with the highest priority from the message queue referred to by the message queue descriptor mqdes, and places it in the buffer pointed to by msg_ptr. The msg_len argument specifies the size of the buffer pointed to by msg_ptr; this must be greater than or equal to the mq_msgsize attribute of the queue (see mq_getattr(3)). If msg_prio is not NULL, then the buffer to which it points is used to return the priority associated with the received message. If the queue is empty, then, by default, mq_receive() blocks until a message becomes available, or the call is interrupted by a signal handler. If the O_NONBLOCK flag is enabled for the message queue description, then the call instead fails immediately with the error EAGAIN. mq_timedreceive() behaves just like mq_receive(), except that if the queue is empty and the O_NONBLOCK flag is not enabled for the message queue description, then abs_timeout points to a structure which specifies how long the call will block. This value is an absolute timeout in seconds and nanoseconds since the Epoch, 1970-01-01 00:00:00 +0000 (UTC), specified in the following structure: struct timespec { time_t tv_sec; /* seconds */ long tv_nsec; /* nanoseconds */ }; If no message is available, and the timeout has already expired by the time of the call, mq_timedreceive() returns immediately. http://man7.org/linux/man-pages/man2/mq_notify.2.html 12 SYSTEM CALL: mq_notify(3) - Linux manual page FUNCTIONALITY: mq_notify - register for notification when a message is available SYNOPSIS: #include int mq_notify(mqd_t mqdes, const struct sigevent *sevp); Link with -lrt. DESCRIPTION mq_notify() allows the calling process to register or unregister for delivery of an asynchronous notification when a new message arrives on the empty message queue referred to by the message queue descriptor mqdes. The sevp argument is a pointer to a sigevent structure. For the definition and general details of this structure, see sigevent(7). If sevp is a non-null pointer, then mq_notify() registers the calling process to receive message notification. The sigev_notify field of the sigevent structure to which sevp points specifies how notification is to be performed. This field has one of the following values: SIGEV_NONE A "null" notification: the calling process is registered as the target for notification, but when a message arrives, no notification is sent. SIGEV_SIGNAL Notify the process by sending the signal specified in sigev_signo. See sigevent(7) for general details. The si_code field of the siginfo_t structure will be set to SI_MESGQ. In addition, si_pid will be set to the PID of the process that sent the message, and si_uid will be set to the real user ID of the sending process. SIGEV_THREAD Upon message delivery, invoke sigev_notify_function as if it were the start function of a new thread. See sigevent(7) for details. Only one process can be registered to receive notification from a message queue. If sevp is NULL, and the calling process is currently registered to receive notifications for this message queue, then the registration is removed; another process can then register to receive a message notification for this queue. Message notification occurs only when a new message arrives and the queue was previously empty. If the queue was not empty at the time mq_notify() was called, then a notification will occur only after the queue is emptied and a new message arrives. If another process or thread is waiting to read a message from an empty queue using mq_receive(3), then any message notification registration is ignored: the message is delivered to the process or thread calling mq_receive(3), and the message notification registration remains in effect. Notification occurs once: after a notification is delivered, the notification registration is removed, and another process can register for message notification. If the notified process wishes to receive the next notification, it can use mq_notify() to request a further notification. This should be done before emptying all unread messages from the queue. (Placing the queue in nonblocking mode is useful for emptying the queue of messages without blocking once it is empty.) Linux Non-Uniform Memory Access (NUMA) system calls http://man7.org/linux/man-pages/man2/getcpu.2.html 11 SYSTEM CALL: getcpu(2) - Linux manual page FUNCTIONALITY: getcpu - determine CPU and NUMA node on which the calling thread is running SYNOPSIS: #include int getcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *tcache); Note: There is no glibc wrapper for this system call; see NOTES. DESCRIPTION The getcpu() system call identifies the processor and node on which the calling thread or process is currently running and writes them into the integers pointed to by the cpu and node arguments. The processor is a unique small integer identifying a CPU. The node is a unique small identifier identifying a NUMA node. When either cpu or node is NULL nothing is written to the respective pointer. The third argument to this system call is nowadays unused, and should be specified as NULL unless portability to Linux 2.6.23 or earlier is required (see NOTES). The information placed in cpu is guaranteed to be current only at the time of the call: unless the CPU affinity has been fixed using sched_setaffinity(2), the kernel might change the CPU at any time. (Normally this does not happen because the scheduler tries to minimize movements between CPUs to keep caches hot, but it is possible.) The caller must allow for the possibility that the information returned in cpu and node is no longer current by the time the call returns. http://man7.org/linux/man-pages/man2/set_mempolicy.2.html 11 SYSTEM CALL: set_mempolicy(2) - Linux manual page FUNCTIONALITY: set_mempolicy - set default NUMA memory policy for a thread and its children SYNOPSIS: #include long set_mempolicy(int mode, const unsigned long *nodemask, unsigned long maxnode); Link with -lnuma. DESCRIPTION set_mempolicy() sets the NUMA memory policy of the calling thread, which consists of a policy mode and zero or more nodes, to the values specified by the mode, nodemask and maxnode arguments. A NUMA machine has different memory controllers with different distances to specific CPUs. The memory policy defines from which node memory is allocated for the thread. This system call defines the default policy for the thread. The thread policy governs allocation of pages in the process's address space outside of memory ranges controlled by a more specific policy set by mbind(2). The thread default policy also controls allocation of any pages for memory-mapped files mapped using the mmap(2) call with the MAP_PRIVATE flag and that are only read [loaded] from by the thread and of memory-mapped files mapped using the mmap(2) call with the MAP_SHARED flag, regardless of the access type. The policy is applied only when a new page is allocated for the thread. For anonymous memory this is when the page is first touched by the thread. The mode argument must specify one of MPOL_DEFAULT, MPOL_BIND, MPOL_INTERLEAVE, or MPOL_PREFERRED. All modes except MPOL_DEFAULT require the caller to specify via the nodemask argument one or more nodes. The mode argument may also include an optional mode flag. The supported mode flags are: MPOL_F_STATIC_NODES (since Linux 2.6.26) A nonempty nodemask specifies physical node ids. Linux will not remap the nodemask when the process moves to a different cpuset context, nor when the set of nodes allowed by the process's current cpuset context changes. MPOL_F_RELATIVE_NODES (since Linux 2.6.26) A nonempty nodemask specifies node ids that are relative to the set of node ids allowed by the process's current cpuset. nodemask points to a bit mask of node IDs that contains up to maxnode bits. The bit mask size is rounded to the next multiple of sizeof(unsigned long), but the kernel will use bits only up to maxnode. A NULL value of nodemask or a maxnode value of zero specifies the empty set of nodes. If the value of maxnode is zero, the nodemask argument is ignored. Where a nodemask is required, it must contain at least one node that is on-line, allowed by the process's current cpuset context, [unless the MPOL_F_STATIC_NODES mode flag is specified], and contains memory. If the MPOL_F_STATIC_NODES is set in mode and a required nodemask contains no nodes that are allowed by the process's current cpuset context, the memory policy reverts to local allocation. This effectively overrides the specified policy until the process's cpuset context includes one or more of the nodes specified by nodemask. The MPOL_DEFAULT mode specifies that any nondefault thread memory policy be removed, so that the memory policy "falls back" to the system default policy. The system default policy is "local allocation"—that is, allocate memory on the node of the CPU that triggered the allocation. nodemask must be specified as NULL. If the "local node" contains no free memory, the system will attempt to allocate memory from a "near by" node. The MPOL_BIND mode defines a strict policy that restricts memory allocation to the nodes specified in nodemask. If nodemask specifies more than one node, page allocations will come from the node with the lowest numeric node ID first, until that node contains no free memory. Allocations will then come from the node with the next highest node ID specified in nodemask and so forth, until none of the specified nodes contain free memory. Pages will not be allocated from any node not specified in the nodemask. MPOL_INTERLEAVE interleaves page allocations across the nodes specified in nodemask in numeric node ID order. This optimizes for bandwidth instead of latency by spreading out pages and memory accesses to those pages across multiple nodes. However, accesses to a single page will still be limited to the memory bandwidth of a single node. MPOL_PREFERRED sets the preferred node for allocation. The kernel will try to allocate pages from this node first and fall back to "near by" nodes if the preferred node is low on free memory. If nodemask specifies more than one node ID, the first node in the mask will be selected as the preferred node. If the nodemask and maxnode arguments specify the empty set, then the policy specifies "local allocation" (like the system default policy discussed above). The thread memory policy is preserved across an execve(2), and is inherited by child threads created using fork(2) or clone(2). http://man7.org/linux/man-pages/man2/get_mempolicy.2.html 11 SYSTEM CALL: get_mempolicy(2) - Linux manual page FUNCTIONALITY: get_mempolicy - retrieve NUMA memory policy for a thread SYNOPSIS: #include int get_mempolicy(int *mode, unsigned long *nodemask, unsigned long maxnode, void *addr, unsigned long flags); Link with -lnuma. DESCRIPTION get_mempolicy() retrieves the NUMA policy of the calling thread or of a memory address, depending on the setting of flags. A NUMA machine has different memory controllers with different distances to specific CPUs. The memory policy defines from which node memory is allocated for the thread. If flags is specified as 0, then information about the calling thread's default policy (as set by set_mempolicy(2)) is returned. The policy returned [mode and nodemask] may be used to restore the thread's policy to its state at the time of the call to get_mempolicy() using set_mempolicy(2). If flags specifies MPOL_F_MEMS_ALLOWED (available since Linux 2.6.24), the mode argument is ignored and the set of nodes [memories] that the thread is allowed to specify in subsequent calls to mbind(2) or set_mempolicy(2) [in the absence of any mode flags] is returned in nodemask. It is not permitted to combine MPOL_F_MEMS_ALLOWED with either MPOL_F_ADDR or MPOL_F_NODE. If flags specifies MPOL_F_ADDR, then information is returned about the policy governing the memory address given in addr. This policy may be different from the thread's default policy if mbind(2) or one of the helper functions described in numa(3) has been used to establish a policy for the memory range containing addr. If the mode argument is not NULL, then get_mempolicy() will store the policy mode and any optional mode flags of the requested NUMA policy in the location pointed to by this argument. If nodemask is not NULL, then the nodemask associated with the policy will be stored in the location pointed to by this argument. maxnode specifies the number of node IDs that can be stored into nodemask—that is, the maximum node ID plus one. The value specified by maxnode is always rounded to a multiple of sizeof(unsigned long)*8. If flags specifies both MPOL_F_NODE and MPOL_F_ADDR, get_mempolicy() will return the node ID of the node on which the address addr is allocated into the location pointed to by mode. If no page has yet been allocated for the specified address, get_mempolicy() will allocate a page as if the thread had performed a read [load] access to that address, and return the ID of the node where that page was allocated. If flags specifies MPOL_F_NODE, but not MPOL_F_ADDR, and the thread's current policy is MPOL_INTERLEAVE, then get_mempolicy() will return in the location pointed to by a non-NULL mode argument, the node ID of the next node that will be used for interleaving of internal kernel pages allocated on behalf of the thread. These allocations include pages for memory-mapped files in process memory ranges mapped using the mmap(2) call with the MAP_PRIVATE flag for read accesses, and in memory ranges mapped with the MAP_SHARED flag for all accesses. Other flag values are reserved. For an overview of the possible policies see set_mempolicy(2). http://man7.org/linux/man-pages/man2/mbind.2.html 11 SYSTEM CALL: mbind(2) - Linux manual page FUNCTIONALITY: mbind - set memory policy for a memory range SYNOPSIS: #include long mbind(void *addr, unsigned long len, int mode, const unsigned long *nodemask, unsigned long maxnode, unsigned flags); Link with -lnuma. DESCRIPTION mbind() sets the NUMA memory policy, which consists of a policy mode and zero or more nodes, for the memory range starting with addr and continuing for len bytes. The memory policy defines from which node memory is allocated. If the memory range specified by the addr and len arguments includes an "anonymous" region of memory—that is a region of memory created using the mmap(2) system call with the MAP_ANONYMOUS—or a memory- mapped file, mapped using the mmap(2) system call with the MAP_PRIVATE flag, pages will be allocated only according to the specified policy when the application writes [stores] to the page. For anonymous regions, an initial read access will use a shared page in the kernel containing all zeros. For a file mapped with MAP_PRIVATE, an initial read access will allocate pages according to the process policy of the process that causes the page to be allocated. This may not be the process that called mbind(). The specified policy will be ignored for any MAP_SHARED mappings in the specified memory range. Rather the pages will be allocated according to the process policy of the process that caused the page to be allocated. Again, this may not be the process that called mbind(). If the specified memory range includes a shared memory region created using the shmget(2) system call and attached using the shmat(2) system call, pages allocated for the anonymous or shared memory region will be allocated according to the policy specified, regardless which process attached to the shared memory segment causes the allocation. If, however, the shared memory region was created with the SHM_HUGETLB flag, the huge pages will be allocated according to the policy specified only if the page allocation is caused by the process that calls mbind() for that region. By default, mbind() has an effect only for new allocations; if the pages inside the range have been already touched before setting the policy, then the policy has no effect. This default behavior may be overridden by the MPOL_MF_MOVE and MPOL_MF_MOVE_ALL flags described below. The mode argument must specify one of MPOL_DEFAULT, MPOL_BIND, MPOL_INTERLEAVE, or MPOL_PREFERRED. All policy modes except MPOL_DEFAULT require the caller to specify via the nodemask argument, the node or nodes to which the mode applies. The mode argument may also include an optional mode flag . The supported mode flags are: MPOL_F_STATIC_NODES (since Linux-2.6.26) A nonempty nodemask specifies physical node ids. Linux does not remap the nodemask when the process moves to a different cpuset context, nor when the set of nodes allowed by the process's current cpuset context changes. MPOL_F_RELATIVE_NODES (since Linux-2.6.26) A nonempty nodemask specifies node ids that are relative to the set of node ids allowed by the process's current cpuset. nodemask points to a bit mask of nodes containing up to maxnode bits. The bit mask size is rounded to the next multiple of sizeof(unsigned long), but the kernel will use bits only up to maxnode. A NULL value of nodemask or a maxnode value of zero specifies the empty set of nodes. If the value of maxnode is zero, the nodemask argument is ignored. Where a nodemask is required, it must contain at least one node that is on-line, allowed by the process's current cpuset context [unless the MPOL_F_STATIC_NODES mode flag is specified], and contains memory. The MPOL_DEFAULT mode requests that any nondefault policy be removed, restoring default behavior. When applied to a range of memory via mbind(), this means to use the process policy, which may have been set with set_mempolicy(2). If the mode of the process policy is also MPOL_DEFAULT, the system-wide default policy will be used. The system-wide default policy allocates pages on the node of the CPU that triggers the allocation. For MPOL_DEFAULT, the nodemask and maxnode arguments must be specify the empty set of nodes. The MPOL_BIND mode specifies a strict policy that restricts memory allocation to the nodes specified in nodemask. If nodemask specifies more than one node, page allocations will come from the node with the lowest numeric node ID first, until that node contains no free memory. Allocations will then come from the node with the next highest node ID specified in nodemask and so forth, until none of the specified nodes contain free memory. Pages will not be allocated from any node not specified in the nodemask. The MPOL_INTERLEAVE mode specifies that page allocations be interleaved across the set of nodes specified in nodemask. This optimizes for bandwidth instead of latency by spreading out pages and memory accesses to those pages across multiple nodes. To be effective the memory area should be fairly large, at least 1MB or bigger with a fairly uniform access pattern. Accesses to a single page of the area will still be limited to the memory bandwidth of a single node. MPOL_PREFERRED sets the preferred node for allocation. The kernel will try to allocate pages from this node first and fall back to other nodes if the preferred nodes is low on free memory. If nodemask specifies more than one node ID, the first node in the mask will be selected as the preferred node. If the nodemask and maxnode arguments specify the empty set, then the memory is allocated on the node of the CPU that triggered the allocation. This is the only way to specify "local allocation" for a range of memory via mbind(). If MPOL_MF_STRICT is passed in flags and mode is not MPOL_DEFAULT, then the call will fail with the error EIO if the existing pages in the memory range don't follow the policy. If MPOL_MF_MOVE is specified in flags, then the kernel will attempt to move all the existing pages in the memory range so that they follow the policy. Pages that are shared with other processes will not be moved. If MPOL_MF_STRICT is also specified, then the call will fail with the error EIO if some pages could not be moved. If MPOL_MF_MOVE_ALL is passed in flags, then the kernel will attempt to move all existing pages in the memory range regardless of whether other processes use the pages. The calling process must be privileged (CAP_SYS_NICE) to use this flag. If MPOL_MF_STRICT is also specified, then the call will fail with the error EIO if some pages could not be moved. http://man7.org/linux/man-pages/man2/move_pages.2.html 11 SYSTEM CALL: move_pages(2) - Linux manual page FUNCTIONALITY: move_pages - move individual pages of a process to another node SYNOPSIS: #include long move_pages(int pid, unsigned long count, void **pages, const int *nodes, int *status, int flags); Link with -lnuma. DESCRIPTION move_pages() moves the specified pages of the process pid to the memory nodes specified by nodes. The result of the move is reflected in status. The flags indicate constraints on the pages to be moved. pid is the ID of the process in which pages are to be moved. To move pages in another process, the caller must be privileged (CAP_SYS_NICE) or the real or effective user ID of the calling process must match the real or saved-set user ID of the target process. If pid is 0, then move_pages() moves pages of the calling process. count is the number of pages to move. It defines the size of the three arrays pages, nodes, and status. pages is an array of pointers to the pages that should be moved. These are pointers that should be aligned to page boundaries. Addresses are specified as seen by the process specified by pid. nodes is an array of integers that specify the desired location for each page. Each element in the array is a node number. nodes can also be NULL, in which case move_pages() does not move any pages but instead will return the node where each page currently resides, in the status array. Obtaining the status of each page may be necessary to determine pages that need to be moved. status is an array of integers that return the status of each page. The array contains valid values only if move_pages() did not return an error. flags specify what types of pages to move. MPOL_MF_MOVE means that only pages that are in exclusive use by the process are to be moved. MPOL_MF_MOVE_ALL means that pages shared between multiple processes can also be moved. The process must be privileged (CAP_SYS_NICE) to use MPOL_MF_MOVE_ALL. Page states in the status array The following values can be returned in each element of the status array. 0..MAX_NUMNODES Identifies the node on which the page resides. -EACCES The page is mapped by multiple processes and can be moved only if MPOL_MF_MOVE_ALL is specified. -EBUSY The page is currently busy and cannot be moved. Try again later. This occurs if a page is undergoing I/O or another kernel subsystem is holding a reference to the page. -EFAULT This is a zero page or the memory area is not mapped by the process. -EIO Unable to write back a page. The page has to be written back in order to move it since the page is dirty and the filesystem does not provide a migration function that would allow the move of dirty pages. -EINVAL A dirty page cannot be moved. The filesystem does not provide a migration function and has no ability to write back pages. -ENOENT The page is not present. -ENOMEM Unable to allocate memory on target node. http://man7.org/linux/man-pages/man2/migrate_pages.2.html 11 SYSTEM CALL: migrate_pages(2) - Linux manual page FUNCTIONALITY: migrate_pages - move all pages in a process to another set of nodes SYNOPSIS: #include long migrate_pages(int pid, unsigned long maxnode, const unsigned long *old_nodes, const unsigned long *new_nodes); Link with -lnuma. DESCRIPTION migrate_pages() attempts to move all pages of the process pid that are in memory nodes old_nodes to the memory nodes in new_nodes. Pages not located in any node in old_nodes will not be migrated. As far as possible, the kernel maintains the relative topology relationship inside old_nodes during the migration to new_nodes. The old_nodes and new_nodes arguments are pointers to bit masks of node numbers, with up to maxnode bits in each mask. These masks are maintained as arrays of unsigned long integers (in the last long integer, the bits beyond those specified by maxnode are ignored). The maxnode argument is the maximum node number in the bit mask plus one (this is the same as in mbind(2), but different from select(2)). The pid argument is the ID of the process whose pages are to be moved. To move pages in another process, the caller must be privileged (CAP_SYS_NICE) or the real or effective user ID of the calling process must match the real or saved-set user ID of the target process. If pid is 0, then migrate_pages() moves pages of the calling process. Pages shared with another process will be moved only if the initiating process has the CAP_SYS_NICE privilege. Linux key management system calls http://man7.org/linux/man-pages/man2/add_key.2.html 10 SYSTEM CALL: add_key(2) - Linux manual page FUNCTIONALITY: add_key - add a key to the kernel's key management facility SYNOPSIS: #include key_serial_t add_key(const char *type, const char *description, const void *payload, size_t plen, key_serial_t keyring); DESCRIPTION add_key() asks the kernel to create or update a key of the given type and description, instantiate it with the payload of length plen, and to attach it to the nominated keyring and to return its serial number. The key type may reject the data if it's in the wrong format or in some other way invalid. If the destination keyring already contains a key that matches the specified type and description, then, if the key type supports it, that key will be updated rather than a new key being created; if not, a new key will be created and it will displace the link to the extant key from the keyring. The destination keyring serial number may be that of a valid keyring to which the caller has write permission, or it may be a special keyring ID: KEY_SPEC_THREAD_KEYRING This specifies the caller's thread-specific keyring. KEY_SPEC_PROCESS_KEYRING This specifies the caller's process-specific keyring. KEY_SPEC_SESSION_KEYRING This specifies the caller's session-specific keyring. KEY_SPEC_USER_KEYRING This specifies the caller's UID-specific keyring. KEY_SPEC_USER_SESSION_KEYRING This specifies the caller's UID-session keyring. http://man7.org/linux/man-pages/man2/request_key.2.html 9 SYSTEM CALL: request_key(2) - Linux manual page FUNCTIONALITY: request_key - request a key from the kernel's key management facility SYNOPSIS: #include key_serial_t request_key(const char *type, const char *description, const char *callout_info, key_serial_t keyring); DESCRIPTION request_key() asks the kernel to find a key of the given type that matches the specified description and, if successful, to attach it to the nominated keyring and to return its serial number. request_key() first recursively searches all the keyrings attached to the calling process in the order thread-specific keyring, process- specific keyring and then session keyring for a matching key. If request_key() is called from a program invoked by request_key() on behalf of some other process to generate a key, then the keyrings of that other process will be searched next, using that other process's UID, GID, groups, and security context to control access. The keys in each keyring searched are checked for a match before any child keyrings are recursed into. Only keys that are searchable for the caller may be found, and only searchable keyrings may be searched. If the key is not found, then, if callout_info is set, this function will attempt to look further afield. In such a case, the callout_info is passed to a user-space service such as /sbin/request-key to generate the key. If that is unsuccessful also, then an error will be returned, and a temporary negative key will be installed in the nominated keyring. This will expire after a few seconds, but will cause subsequent calls to request_key() to fail until it does. The keyring serial number may be that of a valid keyring to which the caller has write permission, or it may be a special keyring ID: KEY_SPEC_THREAD_KEYRING This specifies the caller's thread-specific keyring. KEY_SPEC_PROCESS_KEYRING This specifies the caller's process-specific keyring. KEY_SPEC_SESSION_KEYRING This specifies the caller's session-specific keyring. KEY_SPEC_USER_KEYRING This specifies the caller's UID-specific keyring. KEY_SPEC_USER_SESSION_KEYRING This specifies the caller's UID-session keyring. If a key is created, no matter whether it's a valid key or a negative key, it will displace any other key of the same type and description from the destination keyring. http://man7.org/linux/man-pages/man2/keyctl.2.html 9 SYSTEM CALL: keyctl(2) - Linux manual page FUNCTIONALITY: keyctl - manipulate the kernel's key management facility SYNOPSIS: #include long keyctl(int cmd, ...); DESCRIPTION keyctl() has a number of functions available: KEYCTL_GET_KEYRING_ID Ask for a keyring's ID. KEYCTL_JOIN_SESSION_KEYRING Join or start named session keyring. KEYCTL_UPDATE Update a key. KEYCTL_REVOKE Revoke a key. KEYCTL_CHOWN Set ownership of a key. KEYCTL_SETPERM Set perms on a key. KEYCTL_DESCRIBE Describe a key. KEYCTL_CLEAR Clear contents of a keyring. KEYCTL_LINK Link a key into a keyring. KEYCTL_UNLINK Unlink a key from a keyring. KEYCTL_SEARCH Search for a key in a keyring. KEYCTL_READ Read a key or keyring's contents. KEYCTL_INSTANTIATE Instantiate a partially constructed key. KEYCTL_NEGATE Negate a partially constructed key. KEYCTL_SET_REQKEY_KEYRING Set default request-key keyring. KEYCTL_SET_TIMEOUT Set timeout on a key. KEYCTL_ASSUME_AUTHORITY Assume authority to instantiate key. These are wrapped by libkeyutils into individual functions to permit the compiler to check types. See the See Also section at the bottom. Linux system-wide system calls http://man7.org/linux/man-pages/man2/create_module.2.html 11 SYSTEM CALL: create_module(2) - Linux manual page FUNCTIONALITY: create_module - create a loadable module entry SYNOPSIS: #include caddr_t create_module(const char *name, size_t size); Note: No declaration of this system call is provided in glibc headers; see NOTES. DESCRIPTION Note: This system call is present only in kernels before Linux 2.6. create_module() attempts to create a loadable module entry and reserve the kernel memory that will be needed to hold the module. This system call requires privilege. http://man7.org/linux/man-pages/man2/init_module.2.html 11 SYSTEM CALL: init_module(2) - Linux manual page FUNCTIONALITY: init_module, finit_module - load a kernel module SYNOPSIS: int init_module(void *module_image, unsigned long len, const char *param_values); int finit_module(int fd, const char *param_values, int flags); Note: glibc provides no header file declaration of init_module() and no wrapper function for finit_module(); see NOTES. DESCRIPTION init_module() loads an ELF image into kernel space, performs any necessary symbol relocations, initializes module parameters to values provided by the caller, and then runs the module's init function. This system call requires privilege. The module_image argument points to a buffer containing the binary image to be loaded; len specifies the size of that buffer. The module image should be a valid ELF image, built for the running kernel. The param_values argument is a string containing space-delimited specifications of the values for module parameters (defined inside the module using module_param() and module_param_array()). The kernel parses this string and initializes the specified parameters. Each of the parameter specifications has the form: name[=value[,value...]] The parameter name is one of those defined within the module using module_param() (see the Linux kernel source file include/linux/moduleparam.h). The parameter value is optional in the case of bool and invbool parameters. Values for array parameters are specified as a comma-separated list. finit_module() The finit_module() system call is like init_module(), but reads the module to be loaded from the file descriptor fd. It is useful when the authenticity of a kernel module can be determined from its location in the filesystem; in cases where that is possible, the overhead of using cryptographically signed modules to determine the authenticity of a module can be avoided. The param_values argument is as for init_module(). The flags argument modifies the operation of finit_module(). It is a bit mask value created by ORing together zero or more of the following flags: MODULE_INIT_IGNORE_MODVERSIONS Ignore symbol version hashes. MODULE_INIT_IGNORE_VERMAGIC Ignore kernel version magic. There are some safety checks built into a module to ensure that it matches the kernel against which it is loaded. These checks are recorded when the module is built and verified when the module is loaded. First, the module records a "vermagic" string containing the kernel version number and prominent features (such as the CPU type). Second, if the module was built with the CONFIG_MODVERSIONS configuration option enabled, a version hash is recorded for each symbol the module uses. This hash is based on the types of the arguments and return value for the function named by the symbol. In this case, the kernel version number within the "vermagic" string is ignored, as the symbol version hashes are assumed to be sufficiently reliable. Using the MODULE_INIT_IGNORE_VERMAGIC flag indicates that the "vermagic" string is to be ignored, and the MODULE_INIT_IGNORE_MODVERSIONS flag indicates that the symbol version hashes are to be ignored. If the kernel is built to permit forced loading (i.e., configured with CONFIG_MODULE_FORCE_LOAD), then loading will continue, otherwise it will fail with ENOEXEC as expected for malformed modules. http://man7.org/linux/man-pages/man2/finit_module.2.html 11 SYSTEM CALL: init_module(2) - Linux manual page FUNCTIONALITY: init_module, finit_module - load a kernel module SYNOPSIS: int init_module(void *module_image, unsigned long len, const char *param_values); int finit_module(int fd, const char *param_values, int flags); Note: glibc provides no header file declaration of init_module() and no wrapper function for finit_module(); see NOTES. DESCRIPTION init_module() loads an ELF image into kernel space, performs any necessary symbol relocations, initializes module parameters to values provided by the caller, and then runs the module's init function. This system call requires privilege. The module_image argument points to a buffer containing the binary image to be loaded; len specifies the size of that buffer. The module image should be a valid ELF image, built for the running kernel. The param_values argument is a string containing space-delimited specifications of the values for module parameters (defined inside the module using module_param() and module_param_array()). The kernel parses this string and initializes the specified parameters. Each of the parameter specifications has the form: name[=value[,value...]] The parameter name is one of those defined within the module using module_param() (see the Linux kernel source file include/linux/moduleparam.h). The parameter value is optional in the case of bool and invbool parameters. Values for array parameters are specified as a comma-separated list. finit_module() The finit_module() system call is like init_module(), but reads the module to be loaded from the file descriptor fd. It is useful when the authenticity of a kernel module can be determined from its location in the filesystem; in cases where that is possible, the overhead of using cryptographically signed modules to determine the authenticity of a module can be avoided. The param_values argument is as for init_module(). The flags argument modifies the operation of finit_module(). It is a bit mask value created by ORing together zero or more of the following flags: MODULE_INIT_IGNORE_MODVERSIONS Ignore symbol version hashes. MODULE_INIT_IGNORE_VERMAGIC Ignore kernel version magic. There are some safety checks built into a module to ensure that it matches the kernel against which it is loaded. These checks are recorded when the module is built and verified when the module is loaded. First, the module records a "vermagic" string containing the kernel version number and prominent features (such as the CPU type). Second, if the module was built with the CONFIG_MODVERSIONS configuration option enabled, a version hash is recorded for each symbol the module uses. This hash is based on the types of the arguments and return value for the function named by the symbol. In this case, the kernel version number within the "vermagic" string is ignored, as the symbol version hashes are assumed to be sufficiently reliable. Using the MODULE_INIT_IGNORE_VERMAGIC flag indicates that the "vermagic" string is to be ignored, and the MODULE_INIT_IGNORE_MODVERSIONS flag indicates that the symbol version hashes are to be ignored. If the kernel is built to permit forced loading (i.e., configured with CONFIG_MODULE_FORCE_LOAD), then loading will continue, otherwise it will fail with ENOEXEC as expected for malformed modules. http://man7.org/linux/man-pages/man2/delete_module.2.html 10 SYSTEM CALL: delete_module(2) - Linux manual page FUNCTIONALITY: delete_module - unload a kernel module SYNOPSIS: int delete_module(const char *name, int flags); Note: No declaration of this system call is provided in glibc headers; see NOTES. DESCRIPTION The delete_module() system call attempts to remove the unused loadable module entry identified by name. If the module has an exit function, then that function is executed before unloading the module. The flags argument is used to modify the behavior of the system call, as described below. This system call requires privilege. Module removal is attempted according to the following rules: 1. If there are other loaded modules that depend on (i.e., refer to symbols defined in) this module, then the call fails. 2. Otherwise, if the reference count for the module (i.e., the number of processes currently using the module) is zero, then the module is immediately unloaded. 3. If a module has a nonzero reference count, then the behavior depends on the bits set in flags. In normal usage (see NOTES), the O_NONBLOCK flag is always specified, and the O_TRUNC flag may additionally be specified. The various combinations for flags have the following effect: flags == O_NONBLOCK The call returns immediately, with an error. flags == (O_NONBLOCK | O_TRUNC) The module is unloaded immediately, regardless of whether it has a nonzero reference count. (flags & O_NONBLOCK) == 0 If flags does not specify O_NONBLOCK, the following steps occur: * The module is marked so that no new references are permitted. * If the module's reference count is nonzero, the caller is placed in an uninterruptible sleep state (TASK_UNINTERRUPTIBLE) until the reference count is zero, at which point the call unblocks. * The module is unloaded in the usual way. The O_TRUNC flag has one further effect on the rules described above. By default, if a module has an init function but no exit function, then an attempt to remove the module will fail. However, if O_TRUNC was specified, this requirement is bypassed. Using the O_TRUNC flag is dangerous! If the kernel was not built with CONFIG_MODULE_FORCE_UNLOAD, this flag is silently ignored. (Normally, CONFIG_MODULE_FORCE_UNLOAD is enabled.) Using this flag taints the kernel (TAINT_FORCED_RMMOD). http://man7.org/linux/man-pages/man2/query_module.2.html 11 SYSTEM CALL: query_module(2) - Linux manual page FUNCTIONALITY: query_module - query the kernel for various bits pertaining to mod‐ ules SYNOPSIS: #include int query_module(const char *name, int which, void *buf, size_t bufsize, size_t *ret); Note: No declaration of this system call is provided in glibc headers; see NOTES. DESCRIPTION Note: This system call is present only in kernels before Linux 2.6. query_module() requests information from the kernel about loadable modules. The returned information is placed in the buffer pointed to by buf. The caller must specify the size of buf in bufsize. The precise nature and format of the returned information depend on the operation specified by which. Some operations require name to identify a currently loaded module, some allow name to be NULL, indicating the kernel proper. The following values can be specified for which: 0 Returns success, if the kernel supports query_module(). Used to probe for availability of the system call. QM_MODULES Returns the names of all loaded modules. The returned buffer consists of a sequence of null-terminated strings; ret is set to the number of modules. QM_DEPS Returns the names of all modules used by the indicated module. The returned buffer consists of a sequence of null-terminated strings; ret is set to the number of modules. QM_REFS Returns the names of all modules using the indicated module. This is the inverse of QM_DEPS. The returned buffer consists of a sequence of null-terminated strings; ret is set to the number of modules. QM_SYMBOLS Returns the symbols and values exported by the kernel or the indicated module. The returned buffer is an array of structures of the following form struct module_symbol { unsigned long value; unsigned long name; }; followed by null-terminated strings. The value of name is the character offset of the string relative to the start of buf; ret is set to the number of symbols. QM_INFO Returns miscellaneous information about the indicated module. The output buffer format is: struct module_info { unsigned long address; unsigned long size; unsigned long flags; }; where address is the kernel address at which the module resides, size is the size of the module in bytes, and flags is a mask of MOD_RUNNING, MOD_AUTOCLEAN, and so on, that indicates the current status of the module (see the Linux kernel source file include/linux/module.h). ret is set to the size of the module_info structure. http://man7.org/linux/man-pages/man2/get_kernel_syms.2.html 12 SYSTEM CALL: get_kernel_syms(2) - Linux manual page FUNCTIONALITY: get_kernel_syms - retrieve exported kernel and module symbols SYNOPSIS: #include int get_kernel_syms(struct kernel_sym *table); Note: No declaration of this system call is provided in glibc headers; see NOTES. DESCRIPTION Note: This system call is present only in kernels before Linux 2.6. If table is NULL, get_kernel_syms() returns the number of symbols available for query. Otherwise, it fills in a table of structures: struct kernel_sym { unsigned long value; char name[60]; }; The symbols are interspersed with magic symbols of the form #module- name with the kernel having an empty name. The value associated with a symbol of this form is the address at which the module is loaded. The symbols exported from each module follow their magic module tag and the modules are returned in the reverse of the order in which they were loaded. http://man7.org/linux/man-pages/man2/acct.2.html 10 SYSTEM CALL: acct(2) - Linux manual page FUNCTIONALITY: acct - switch process accounting on or off SYNOPSIS: #include int acct(const char *filename); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): acct(): Since glibc 2.21: _DEFAULT_SOURCE In glibc 2.19 and 2.20: _DEFAULT_SOURCE || (_XOPEN_SOURCE && _XOPEN_SOURCE < 500) Up to and including glibc 2.19: _BSD_SOURCE || (_XOPEN_SOURCE && _XOPEN_SOURCE < 500) DESCRIPTION The acct() system call enables or disables process accounting. If called with the name of an existing file as its argument, accounting is turned on, and records for each terminating process are appended to filename as it terminates. An argument of NULL causes accounting to be turned off. http://man7.org/linux/man-pages/man2/quotactl.2.html 8 SYSTEM CALL: quotactl(2) - Linux manual page FUNCTIONALITY: quotactl - manipulate disk quotas SYNOPSIS: #include #include int quotactl(int cmd, const char *special, int id, caddr_t addr); DESCRIPTION The quota system can be used to set per-user and per-group limits on the amount of disk space used on a filesystem. For each user and/or group, a soft limit and a hard limit can be set for each filesystem. The hard limit can't be exceeded. The soft limit can be exceeded, but warnings will ensue. Moreover, the user can't exceed the soft limit for more than one week (by default) at a time; after this time, the soft limit counts as a hard limit. The quotactl() call manipulates disk quotas. The cmd argument indicates a command to be applied to the user or group ID specified in id. To initialize the cmd argument, use the QCMD(subcmd, type) macro. The type value is either USRQUOTA, for user quotas, or GRPQUOTA, for group quotas. The subcmd value is described below. The special argument is a pointer to a null-terminated string containing the pathname of the (mounted) block special device for the filesystem being manipulated. The addr argument is the address of an optional, command-specific, data structure that is copied in or out of the system. The interpretation of addr is given with each command below. The subcmd value is one of the following: Q_QUOTAON Turn on quotas for a filesystem. The id argument is the identification number of the quota format to be used. Currently, there are three supported quota formats: QFMT_VFS_OLD The original quota format. QFMT_VFS_V0 The standard VFS v0 quota format, which can handle 32-bit UIDs and GIDs and quota limits up to 2^42 bytes and 2^32 inodes. QFMT_VFS_V1 A quota format that can handle 32-bit UIDs and GIDs and quota limits of 2^64 bytes and 2^64 inodes. The addr argument points to the pathname of a file containing the quotas for the filesystem. The quota file must exist; it is normally created with the quotacheck(8) program. This operation requires privilege (CAP_SYS_ADMIN). Q_QUOTAOFF Turn off quotas for a filesystem. The addr and id arguments are ignored. This operation requires privilege (CAP_SYS_ADMIN). Q_GETQUOTA Get disk quota limits and current usage for user or group id. The addr argument is a pointer to a dqblk structure defined in as follows: /* uint64_t is an unsigned 64-bit integer; uint32_t is an unsigned 32-bit integer */ struct dqblk { /* Definition since Linux 2.4.22 */ uint64_t dqb_bhardlimit; /* absolute limit on disk quota blocks alloc */ uint64_t dqb_bsoftlimit; /* preferred limit on disk quota blocks */ uint64_t dqb_curspace; /* current occupied space (in bytes) */ uint64_t dqb_ihardlimit; /* maximum number of allocated inodes */ uint64_t dqb_isoftlimit; /* preferred inode limit */ uint64_t dqb_curinodes; /* current number of allocated inodes */ uint64_t dqb_btime; /* time limit for excessive disk use */ uint64_t dqb_itime; /* time limit for excessive files */ uint32_t dqb_valid; /* bit mask of QIF_* constants */ }; /* Flags in dqb_valid that indicate which fields in dqblk structure are valid. */ #define QIF_BLIMITS 1 #define QIF_SPACE 2 #define QIF_ILIMITS 4 #define QIF_INODES 8 #define QIF_BTIME 16 #define QIF_ITIME 32 #define QIF_LIMITS (QIF_BLIMITS | QIF_ILIMITS) #define QIF_USAGE (QIF_SPACE | QIF_INODES) #define QIF_TIMES (QIF_BTIME | QIF_ITIME) #define QIF_ALL (QIF_LIMITS | QIF_USAGE | QIF_TIMES) The dqb_valid field is a bit mask that is set to indicate the entries in the dqblk structure that are valid. Currently, the kernel fills in all entries of the dqblk structure and marks them as valid in the dqb_valid field. Unprivileged users may retrieve only their own quotas; a privileged user (CAP_SYS_ADMIN) can retrieve the quotas of any user. Q_GETNEXTQUOTA (since Linux 4.6) This operation is the same as Q_GETQUOTA, but it returns quota information for the next ID greater than or equal to id that has a quota set. The addr argument is a pointer to a nextdqblk structure whose fields are as for the dqblk, except for the addition of a dqb_id field that is used to return the ID for which quota information is being returned: struct nextdqblk { uint64_t dqb_bhardlimit; uint64_t dqb_bsoftlimit; uint64_t dqb_curspace; uint64_t dqb_ihardlimit; uint64_t dqb_isoftlimit; uint64_t dqb_curinodes; uint64_t dqb_btime; uint64_t dqb_itime; uint32_t dqb_valid; uint32_t dqb_id; }; Q_SETQUOTA Set quota information for user or group id, using the information supplied in the dqblk structure pointed to by addr. The dqb_valid field of the dqblk structure indicates which entries in the structure have been set by the caller. This operation supersedes the Q_SETQLIM and Q_SETUSE operations in the previous quota interfaces. This operation requires privilege (CAP_SYS_ADMIN). Q_GETINFO (since Linux 2.4.22) Get information (like grace times) about quotafile. The addr argument should be a pointer to a dqinfo structure. This structure is defined in as follows: /* uint64_t is an unsigned 64-bit integer; uint32_t is an unsigned 32-bit integer */ struct dqinfo { /* Defined since kernel 2.4.22 */ uint64_t dqi_bgrace; /* Time before block soft limit becomes hard limit */ uint64_t dqi_igrace; /* Time before inode soft limit becomes hard limit */ uint32_t dqi_flags; /* Flags for quotafile (DQF_*) */ uint32_t dqi_valid; }; /* Bits for dqi_flags */ /* Quota format QFMT_VFS_OLD */ #define V1_DQF_RSQUASH 1 /* Root squash enabled */ /* Other quota formats have no dqi_flags bits defined */ /* Flags in dqi_valid that indicate which fields in dqinfo structure are valid. */ # define IIF_BGRACE 1 # define IIF_IGRACE 2 # define IIF_FLAGS 4 # define IIF_ALL (IIF_BGRACE | IIF_IGRACE | IIF_FLAGS) The dqi_valid field in the dqinfo structure indicates the entries in the structure that are valid. Currently, the kernel fills in all entries of the dqinfo structure and marks them all as valid in the dqi_valid field. The id argument is ignored. Q_SETINFO (since Linux 2.4.22) Set information about quotafile. The addr argument should be a pointer to a dqinfo structure. The dqi_valid field of the dqinfo structure indicates the entries in the structure that have been set by the caller. This operation supersedes the Q_SETGRACE and Q_SETFLAGS operations in the previous quota interfaces. The id argument is ignored. This operation requires privilege (CAP_SYS_ADMIN). Q_GETFMT (since Linux 2.4.22) Get quota format used on the specified filesystem. The addr argument should be a pointer to a 4-byte buffer where the format number will be stored. Q_SYNC Update the on-disk copy of quota usages for a filesystem. If special is NULL, then all filesystems with active quotas are sync'ed. The addr and id arguments are ignored. Q_GETSTATS (supported up to Linux 2.4.21) Get statistics and other generic information about the quota subsystem. The addr argument should be a pointer to a dqstats structure in which data should be stored. This structure is defined in . The special and id arguments are ignored. This operation is obsolete and was removed in Linux 2.4.22. Files in /proc/sys/fs/quota/ carry the information instead. For XFS filesystems making use of the XFS Quota Manager (XQM), the above commands are bypassed and the following commands are used: Q_XQUOTAON Turn on quotas for an XFS filesystem. XFS provides the ability to turn on/off quota limit enforcement with quota accounting. Therefore, XFS expects addr to be a pointer to an unsigned int that contains either the flags XFS_QUOTA_UDQ_ACCT and/or XFS_QUOTA_UDQ_ENFD (for user quota), or XFS_QUOTA_GDQ_ACCT and/or XFS_QUOTA_GDQ_ENFD (for group quota), as defined in . This operation requires privilege (CAP_SYS_ADMIN). Q_XQUOTAOFF Turn off quotas for an XFS filesystem. As with Q_QUOTAON, XFS filesystems expect a pointer to an unsigned int that specifies whether quota accounting and/or limit enforcement need to be turned off. This operation requires privilege (CAP_SYS_ADMIN). Q_XGETQUOTA Get disk quota limits and current usage for user id. The addr argument is a pointer to an fs_disk_quota structure (defined in ). Unprivileged users may retrieve only their own quotas; a privileged user (CAP_SYS_ADMIN) may retrieve the quotas of any user. Q_XGETNEXTQUOTA (since Linux 4.6) This operation is the same as Q_XGETQUOTA, but it returns quota information for the next ID greater than or equal to id that has a quota set. Q_XSETQLIM Set disk quota limits for user id. The addr argument is a pointer to an fs_disk_quota structure (defined in ). This operation requires privilege (CAP_SYS_ADMIN). Q_XGETQSTAT Returns an fs_quota_stat structure containing XFS filesystem- specific quota information. This is useful for finding out how much space is used to store quota information, and also to get quotaon/off status of a given local XFS filesystem. Q_XQUOTARM Free the disk space taken by disk quotas. Quotas must have already been turned off. There is no command equivalent to Q_SYNC for XFS since sync(1) writes quota information to disk (in addition to the other filesystem metadata that it writes out). http://man7.org/linux/man-pages/man2/pivot_root.2.html 12 SYSTEM CALL: pivot_root(2) - Linux manual page FUNCTIONALITY: pivot_root - change the root filesystem SYNOPSIS: int pivot_root(const char *new_root, const char *put_old); Note: There is no glibc wrapper for this system call; see NOTES. DESCRIPTION pivot_root() moves the root filesystem of the calling process to the directory put_old and makes new_root the new root filesystem of the calling process. The typical use of pivot_root() is during system startup, when the system mounts a temporary root filesystem (e.g., an initrd), then mounts the real root filesystem, and eventually turns the latter into the current root of all relevant processes or threads. pivot_root() may or may not change the current root and the current working directory of any processes or threads which use the old root directory. The caller of pivot_root() must ensure that processes with root or current working directory at the old root operate correctly in either case. An easy way to ensure this is to change their root and current working directory to new_root before invoking pivot_root(). The paragraph above is intentionally vague because the implementation of pivot_root() may change in the future. At the time of writing, pivot_root() changes root and current working directory of each process or thread to new_root if they point to the old root directory. This is necessary in order to prevent kernel threads from keeping the old root directory busy with their root and current working directory, even if they never access the filesystem in any way. In the future, there may be a mechanism for kernel threads to explicitly relinquish any access to the filesystem, such that this fairly intrusive mechanism can be removed from pivot_root(). Note that this also applies to the calling process: pivot_root() may or may not affect its current working directory. It is therefore recommended to call chdir("/") immediately after pivot_root(). The following restrictions apply to new_root and put_old: - They must be directories. - new_root and put_old must not be on the same filesystem as the current root. - put_old must be underneath new_root, that is, adding a nonzero number of /.. to the string pointed to by put_old must yield the same directory as new_root. - No other filesystem may be mounted on put_old. See also pivot_root(8) for additional usage examples. If the current root is not a mount point (e.g., after chroot(2) or pivot_root(), see also below), not the old root directory, but the mount point of that filesystem is mounted on put_old. new_root does not have to be a mount point. In this case, /proc/mounts will show the mount point of the filesystem containing new_root as root (/). http://man7.org/linux/man-pages/man2/swapon.2.html 10 SYSTEM CALL: swapon(2) - Linux manual page FUNCTIONALITY: swapon, swapoff - start/stop swapping to file/device SYNOPSIS: #include #include int swapon(const char *path, int swapflags); int swapoff(const char *path); DESCRIPTION swapon() sets the swap area to the file or block device specified by path. swapoff() stops swapping to the file or block device specified by path. If the SWAP_FLAG_PREFER flag is specified in the swapon() swapflags argument, the new swap area will have a higher priority than default. The priority is encoded within swapflags as: (prio << SWAP_FLAG_PRIO_SHIFT) & SWAP_FLAG_PRIO_MASK If the SWAP_FLAG_DISCARD flag is specified in the swapon() swapflags argument, freed swap pages will be discarded before they are reused, if the swap device supports the discard or trim operation. (This may improve performance on some Solid State Devices, but often it does not.) See also NOTES. These functions may be used only by a privileged process (one having the CAP_SYS_ADMIN capability). Priority Each swap area has a priority, either high or low. The default priority is low. Within the low-priority areas, newer areas are even lower priority than older areas. All priorities set with swapflags are high-priority, higher than default. They may have any nonnegative value chosen by the caller. Higher numbers mean higher priority. Swap pages are allocated from areas in priority order, highest priority first. For areas with different priorities, a higher- priority area is exhausted before using a lower-priority area. If two or more areas have the same priority, and it is the highest priority available, pages are allocated on a round-robin basis between them. As of Linux 1.3.6, the kernel usually follows these rules, but there are exceptions. http://man7.org/linux/man-pages/man2/swapoff.2.html 10 SYSTEM CALL: swapon(2) - Linux manual page FUNCTIONALITY: swapon, swapoff - start/stop swapping to file/device SYNOPSIS: #include #include int swapon(const char *path, int swapflags); int swapoff(const char *path); DESCRIPTION swapon() sets the swap area to the file or block device specified by path. swapoff() stops swapping to the file or block device specified by path. If the SWAP_FLAG_PREFER flag is specified in the swapon() swapflags argument, the new swap area will have a higher priority than default. The priority is encoded within swapflags as: (prio << SWAP_FLAG_PRIO_SHIFT) & SWAP_FLAG_PRIO_MASK If the SWAP_FLAG_DISCARD flag is specified in the swapon() swapflags argument, freed swap pages will be discarded before they are reused, if the swap device supports the discard or trim operation. (This may improve performance on some Solid State Devices, but often it does not.) See also NOTES. These functions may be used only by a privileged process (one having the CAP_SYS_ADMIN capability). Priority Each swap area has a priority, either high or low. The default priority is low. Within the low-priority areas, newer areas are even lower priority than older areas. All priorities set with swapflags are high-priority, higher than default. They may have any nonnegative value chosen by the caller. Higher numbers mean higher priority. Swap pages are allocated from areas in priority order, highest priority first. For areas with different priorities, a higher- priority area is exhausted before using a lower-priority area. If two or more areas have the same priority, and it is the highest priority available, pages are allocated on a round-robin basis between them. As of Linux 1.3.6, the kernel usually follows these rules, but there are exceptions. http://man7.org/linux/man-pages/man2/mount.2.html 11 SYSTEM CALL: mount(2) - Linux manual page FUNCTIONALITY: mount - mount filesystem SYNOPSIS: #include int mount(const char *source, const char *target, const char *filesystemtype, unsigned long mountflags, const void *data); DESCRIPTION mount() attaches the filesystem specified by source (which is often a pathname referring to a device, but can also be the pathname of a directory or file, or a dummy string) to the location (a directory or file) specified by the pathname in target. Appropriate privilege (Linux: the CAP_SYS_ADMIN capability) is required to mount filesystems. Values for the filesystemtype argument supported by the kernel are listed in /proc/filesystems (e.g., "btrfs", "ext4", "jfs", "xfs", "vfat", "fuse", "tmpfs", "cgroup", "proc", "mqueue", "nfs", "cifs", "iso9660"). Further types may become available when the appropriate modules are loaded. The data argument is interpreted by the different filesystems. Typically it is a string of comma-separated options understood by this filesystem. See mount(8) for details of the options available for each filesystem type. A call to mount() performs one of a number of general types of operation. depending on the bits specified in mountflags. The choice of operation is determined by testing the bits set in mountflags, with the tests being conducted in the order listed here: * Remount an existing mount: mountflags includes MS_REMOUNT. * Create a bind mount: mountflags includes MS_BIND. * Change the propagation type of an existing mount: mountflags includes one of MS_SHARED, MS_PRIVATE, MS_SLAVE, or MS_UNBINDABLE. * Move an existing mount to a new location: mountflags includes MS_MOVE. * Create a new mount: mountflags includes none of the above flags. Each of these operations is detailed later in this page. Further flags may be specified in mountflags to modify the behavior of mount(), as described below. Additional mount flags The list below describes the additional flags that can be specified in mountflags. Note that some operation types ignore some or all of these flags, as described later in this page. MS_DIRSYNC (since Linux 2.5.19) Make directory changes on this filesystem synchronous. (This property can be obtained for individual directories or subtrees using chattr(1).) MS_LAZYTIME (since Linux 4.0) Reduce on-disk updates of inode timestamps (atime, mtime, ctime) by maintaining these changes only in memory. The on- disk timestamps are updated only when: (a) the inode needs to be updated for some change unrelated to file timestamps; (b) the application employs sync(2); (c) an undeleted inode is evicted from memory; or (d) more than 24 hours have passed since the inode was written to disk. This mount option significantly reduces writes needed to update the inode's timestamps, especially mtime and atime. However, in the event of a system crash, the atime and mtime fields on disk might be out of date by up to 24 hours. Examples of workloads where this option could be of significant benefit include frequent random writes to preallocated files, as well as cases where the MS_STRICTATIME mount option is also enabled. (The advantage of combining MS_STRICTATIME and MS_LAZYTIME is that stat(2) will return the correctly updated atime, but the atime updates will be flushed to disk only in the cases listed above.) MS_MANDLOCK Permit mandatory locking on files in this filesystem. (Mandatory locking must still be enabled on a per-file basis, as described in fcntl(2).) Since Linux 4.5, this mount option requires the CAP_SYS_ADMIN capability. MS_NOATIME Do not update access times for (all types of) files on this filesystem. MS_NODEV Do not allow access to devices (special files) on this filesystem. MS_NODIRATIME Do not update access times for directories on this filesystem. This flag provides a subset of the functionality provided by MS_NOATIME; that is, MS_NOATIME implies MS_NODIRATIME. MS_NOEXEC Do not allow programs to be executed from this filesystem. MS_NOSUID Do not honor set-user-ID and set-group-ID bits or file capabilities when executing programs from this filesystem. MS_RDONLY Mount filesystem read-only. MS_REC (since Linux 2.4.11) Used in conjunction with MS_BIND to create a recursive bind mount, and in conjunction with the propagation type flags to recursively change the propagation type of all of the mounts in a subtree. See below for further details. MS_RELATIME (since Linux 2.6.20) When a file on this filesystem is accessed, update the file's last access time (atime) only if the current value of atime is less than or equal to the file's last modification time (mtime) or last status change time (ctime). This option is useful for programs, such as mutt(1), that need to know when a file has been read since it was last modified. Since Linux 2.6.30, the kernel defaults to the behavior provided by this flag (unless MS_NOATIME was specified), and the MS_STRICTATIME flag is required to obtain traditional semantics. In addition, since Linux 2.6.30, the file's last access time is always updated if it is more than 1 day old. MS_SILENT (since Linux 2.6.17) Suppress the display of certain (printk()) warning messages in the kernel log. This flag supersedes the misnamed and obsolete MS_VERBOSE flag (available since Linux 2.4.12), which has the same meaning. MS_STRICTATIME (since Linux 2.6.30) Always update the last access time (atime) when files on this filesystem are accessed. (This was the default behavior before Linux 2.6.30.) Specifying this flag overrides the effect of setting the MS_NOATIME and MS_RELATIME flags. MS_SYNCHRONOUS Make writes on this filesystem synchronous (as though the O_SYNC flag to open(2) was specified for all file opens to this filesystem). From Linux 2.4 onward, the MS_NODEV, MS_NOEXEC, and MS_NOSUID flags are settable on a per-mount-point basis. From kernel 2.6.16 onward, MS_NOATIME and MS_NODIRATIME are also settable on a per-mount-point basis. The MS_RELATIME flag is also settable on a per-mount-point basis. Remounting an existing mount An existing mount may be remounted by specifying MS_REMOUNT in mountflags. This allows you to change the mountflags and data of an existing mount without having to unmount and remount the filesystem. target should be the same value specified in the initial mount() call. The source and filesystemtype arguments are ignored. The mountflags and data arguments should match the values used in the original mount() call, except for those parameters that are being deliberately changed. The following mountflags can be changed: MS_LAZYTIME, MS_MANDLOCK, MS_NOATIME, MS_NODEV, MS_NODIRATIME, MS_NOEXEC, MS_NOSUID, MS_RELATIME, MS_RDONLY, and MS_SYNCHRONOUS. Attempts to change the setting of the MS_DIRSYNC flag during a remount are silently ignored. Since Linux 3.17, if none of MS_NOATIME, MS_NODIRATIME, MS_RELATIME, or MS_STRICTATIME is specified in mountflags, then the remount operation preserves the existing values of these flags (rather than defaulting to MS_RELATIME). Since Linux 2.6.26, this flag can also be used to make an existing bind mount read-only by specifying mountflags as: MS_REMOUNT | MS_BIND | MS_RDONLY Note that only the MS_RDONLY setting of the bind mount can be changed in this manner. Creating a bind mount If mountflags includes MS_BIND (available since Linux 2.4), then perform a bind mount. A bind mount makes a file or a directory subtree visible at another point within the single directory hierarchy. Bind mounts may cross filesystem boundaries and span chroot(2) jails. The filesystemtype and data arguments are ignored. The remaining bits in the mountflags argument are also ignored, with the exception of MS_REC. (The bind mount has the same mount options as the underlying mount point.) However, see the discussion of remounting above, for a method of making an existing bind mount read- only. By default, when a directory is bind mounted, only that directory is mounted; if there are any submounts under the directory tree, they are not bind mounted. If the MS_REC flag is also specified, then a recursive bind mount operation is performed: all submounts under the source subtree (other than unbindable mounts) are also bind mounted at the corresponding location in the target subtree. Changing the propagation type of an existing mount If mountflags includes one of MS_SHARED, MS_PRIVATE, MS_SLAVE, or MS_UNBINDABLE (all available since Linux 2.6.15), then the propagation type of an existing mount is changed. If more than one of these flags is specified, an error results. The only flags that can be used with changing the propagation type are MS_REC and MS_SILENT. The source, filesystemtype, and data arguments are ignored. The meanings of the propagation type flags are as follows: MS_SHARED Make this mount point shared. Mount and unmount events immediately under this mount point will propagate to the other mount points that are members of this mount's peer group. Propagation here means that the same mount or unmount will automatically occur under all of the other mount points in the peer group. Conversely, mount and unmount events that take place under peer mount points will propagate to this mount point. MS_PRIVATE Make this mount point private. Mount and unmount events do not propagate into or out of this mount point. This is the default propagation type for newly created mount points. MS_SLAVE If this is a shared mount point that is a member of a peer group that contains other members, convert it to a slave mount. If this is a shared mount point that is a member of a peer group that contains no other members, convert it to a private mount. Otherwise, the propagation type of the mount point is left unchanged. When a mount point is a slave, mount and unmount events propagate into this mount point from the (master) shared peer group of which it was formerly a member. Mount and unmount events under this mount point do not propagate to any peer. A mount point can be the slave of another peer group while at the same time sharing mount and unmount events with a peer group of which it is a member. MS_UNBINDABLE Make this mount unbindable. This is like a private mount, and in addition this mount can't be bind mounted. When a recursive bind mount (mount(2) with the MS_BIND and MS_REC flags) is performed on a directory subtree, any bind mounts within the subtree are automatically pruned (i.e., not replicated) when replicating that subtree to produce the target subtree. By default, changing the propagation type affects only the target mount point. If the MS_REC flag is also specified in mountflags, then the propagation type of all mount points under target is also changed. For further details regarding mount propagation types, see mount_namespaces(7). Moving a mount If mountflags contains the flag MS_MOVE (available since Linux 2.4.18), then move a subtree: source specifies an existing mount point and target specifies the new location to which that mount point is to be relocated. The move is atomic: at no point is the subtree unmounted. The remaining bits in the mountflags argument are ignored, as are the filesystemtype and data arguments. Creating a new mount point If none of MS_REMOUNT, MS_BIND, MS_MOVE, MS_SHARED, MS_PRIVATE, MS_SLAVE, or MS_UNBINDABLE is specified in mountflags, then mount() performs its default action: creating a new mount point. source specifies the source for the new mount point, and target specifies the directory at which to create the mount point. The filesystemtype and data arguments are employed, and further bits may be specified in mountflags to modify the behavior of the call. http://man7.org/linux/man-pages/man2/umount2.2.html 11 SYSTEM CALL: umount(2) - Linux manual page FUNCTIONALITY: umount, umount2 - unmount filesystem SYNOPSIS: #include int umount(const char *target); int umount2(const char *target, int flags); DESCRIPTION umount() and umount2() remove the attachment of the (topmost) filesystem mounted on target. Appropriate privilege (Linux: the CAP_SYS_ADMIN capability) is required to unmount filesystems. Linux 2.1.116 added the umount2() system call, which, like umount(), unmounts a target, but allows additional flags controlling the behavior of the operation: MNT_FORCE (since Linux 2.1.116) Force unmount even if busy. This can cause data loss. (Only for NFS mounts.) MNT_DETACH (since Linux 2.4.11) Perform a lazy unmount: make the mount point unavailable for new accesses, immediately disconnect the filesystem and all filesystems mounted below it from each other and from the mount table, and actually perform the unmount when the mount point ceases to be busy. MNT_EXPIRE (since Linux 2.6.8) Mark the mount point as expired. If a mount point is not currently in use, then an initial call to umount2() with this flag fails with the error EAGAIN, but marks the mount point as expired. The mount point remains expired as long as it isn't accessed by any process. A second umount2() call specifying MNT_EXPIRE unmounts an expired mount point. This flag cannot be specified with either MNT_FORCE or MNT_DETACH. UMOUNT_NOFOLLOW (since Linux 2.6.34) Don't dereference target if it is a symbolic link. This flag allows security problems to be avoided in set-user-ID-root programs that allow unprivileged users to unmount filesystems. http://man7.org/linux/man-pages/man2/nfsservctl.2.html 7 SYSTEM CALL: nfsservctl(2) - Linux manual page FUNCTIONALITY: nfsservctl - syscall interface to kernel nfs daemon SYNOPSIS: #include long nfsservctl(int cmd, struct nfsctl_arg *argp, union nfsctl_res *resp); DESCRIPTION Note: Since Linux 3.1, this system call no longer exists. It has been replaced by a set of files in the nfsd filesystem; see nfsd(7). /* * These are the commands understood by nfsctl(). */ #define NFSCTL_SVC 0 /* This is a server process. */ #define NFSCTL_ADDCLIENT 1 /* Add an NFS client. */ #define NFSCTL_DELCLIENT 2 /* Remove an NFS client. */ #define NFSCTL_EXPORT 3 /* Export a filesystem. */ #define NFSCTL_UNEXPORT 4 /* Unexport a filesystem. */ #define NFSCTL_UGIDUPDATE 5 /* Update a client's UID/GID map (only in Linux 2.4.x and earlier). */ #define NFSCTL_GETFH 6 /* Get a file handle (used by mountd) (only in Linux 2.4.x and earlier). */ struct nfsctl_arg { int ca_version; /* safeguard */ union { struct nfsctl_svc u_svc; struct nfsctl_client u_client; struct nfsctl_export u_export; struct nfsctl_uidmap u_umap; struct nfsctl_fhparm u_getfh; unsigned int u_debug; } u; } union nfsctl_res { struct knfs_fh cr_getfh; unsigned int cr_debug; }; http://man7.org/linux/man-pages/man2/ustat.2.html 10 SYSTEM CALL: ustat(2) - Linux manual page FUNCTIONALITY: ustat - get filesystem statistics SYNOPSIS: #include #include /* libc[45] */ #include /* glibc2 */ int ustat(dev_t dev, struct ustat *ubuf); DESCRIPTION ustat() returns information about a mounted filesystem. dev is a device number identifying a device containing a mounted filesystem. ubuf is a pointer to a ustat structure that contains the following members: daddr_t f_tfree; /* Total free blocks */ ino_t f_tinode; /* Number of free inodes */ char f_fname[6]; /* Filsys name */ char f_fpack[6]; /* Filsys pack name */ The last two fields, f_fname and f_fpack, are not implemented and will always be filled with null bytes ('\0'). http://man7.org/linux/man-pages/man2/statfs.2.html 11 SYSTEM CALL: statfs(2) - Linux manual page FUNCTIONALITY: statfs, fstatfs - get filesystem statistics SYNOPSIS: #include /* or */ int statfs(const char *path, struct statfs *buf); int fstatfs(int fd, struct statfs *buf); DESCRIPTION The statfs() system call returns information about a mounted filesystem. path is the pathname of any file within the mounted filesystem. buf is a pointer to a statfs structure defined approximately as follows: struct statfs { __fsword_t f_type; /* Type of filesystem (see below) */ __fsword_t f_bsize; /* Optimal transfer block size */ fsblkcnt_t f_blocks; /* Total data blocks in filesystem */ fsblkcnt_t f_bfree; /* Free blocks in filesystem */ fsblkcnt_t f_bavail; /* Free blocks available to unprivileged user */ fsfilcnt_t f_files; /* Total file nodes in filesystem */ fsfilcnt_t f_ffree; /* Free file nodes in filesystem */ fsid_t f_fsid; /* Filesystem ID */ __fsword_t f_namelen; /* Maximum length of filenames */ __fsword_t f_frsize; /* Fragment size (since Linux 2.6) */ __fsword_t f_flags; /* Mount flags of filesystem (since Linux 2.6.36) */ __fsword_t f_spare[xxx]; /* Padding bytes reserved for future use */ }; Filesystem types: ADFS_SUPER_MAGIC 0xadf5 AFFS_SUPER_MAGIC 0xadff BDEVFS_MAGIC 0x62646576 BEFS_SUPER_MAGIC 0x42465331 BFS_MAGIC 0x1badface BINFMTFS_MAGIC 0x42494e4d BTRFS_SUPER_MAGIC 0x9123683e CGROUP_SUPER_MAGIC 0x27e0eb CIFS_MAGIC_NUMBER 0xff534d42 CODA_SUPER_MAGIC 0x73757245 COH_SUPER_MAGIC 0x012ff7b7 CRAMFS_MAGIC 0x28cd3d45 DEBUGFS_MAGIC 0x64626720 DEVFS_SUPER_MAGIC 0x1373 DEVPTS_SUPER_MAGIC 0x1cd1 EFIVARFS_MAGIC 0xde5e81e4 EFS_SUPER_MAGIC 0x00414a53 EXT_SUPER_MAGIC 0x137d EXT2_OLD_SUPER_MAGIC 0xef51 EXT2_SUPER_MAGIC 0xef53 EXT3_SUPER_MAGIC 0xef53 EXT4_SUPER_MAGIC 0xef53 FUSE_SUPER_MAGIC 0x65735546 FUTEXFS_SUPER_MAGIC 0xbad1dea HFS_SUPER_MAGIC 0x4244 HOSTFS_SUPER_MAGIC 0x00c0ffee HPFS_SUPER_MAGIC 0xf995e849 HUGETLBFS_MAGIC 0x958458f6 ISOFS_SUPER_MAGIC 0x9660 JFFS2_SUPER_MAGIC 0x72b6 JFS_SUPER_MAGIC 0x3153464a MINIX_SUPER_MAGIC 0x137f /* orig. minix */ MINIX_SUPER_MAGIC2 0x138f /* 30 char minix */ MINIX2_SUPER_MAGIC 0x2468 /* minix V2 */ MINIX2_SUPER_MAGIC2 0x2478 /* minix V2, 30 char names */ MINIX3_SUPER_MAGIC 0x4d5a /* minix V3 fs, 60 char names */ MQUEUE_MAGIC 0x19800202 MSDOS_SUPER_MAGIC 0x4d44 NCP_SUPER_MAGIC 0x564c NFS_SUPER_MAGIC 0x6969 NILFS_SUPER_MAGIC 0x3434 NTFS_SB_MAGIC 0x5346544e OCFS2_SUPER_MAGIC 0x7461636f OPENPROM_SUPER_MAGIC 0x9fa1 PIPEFS_MAGIC 0x50495045 PROC_SUPER_MAGIC 0x9fa0 PSTOREFS_MAGIC 0x6165676c QNX4_SUPER_MAGIC 0x002f QNX6_SUPER_MAGIC 0x68191122 RAMFS_MAGIC 0x858458f6 REISERFS_SUPER_MAGIC 0x52654973 ROMFS_MAGIC 0x7275 SELINUX_MAGIC 0xf97cff8c SMACK_MAGIC 0x43415d53 SMB_SUPER_MAGIC 0x517b SOCKFS_MAGIC 0x534f434b SQUASHFS_MAGIC 0x73717368 SYSFS_MAGIC 0x62656572 SYSV2_SUPER_MAGIC 0x012ff7b6 SYSV4_SUPER_MAGIC 0x012ff7b5 TMPFS_MAGIC 0x01021994 UDF_SUPER_MAGIC 0x15013346 UFS_MAGIC 0x00011954 USBDEVICE_SUPER_MAGIC 0x9fa2 V9FS_MAGIC 0x01021997 VXFS_SUPER_MAGIC 0xa501fcf5 XENFS_SUPER_MAGIC 0xabba1974 XENIX_SUPER_MAGIC 0x012ff7b4 XFS_SUPER_MAGIC 0x58465342 _XIAFS_SUPER_MAGIC 0x012fd16d Most of these MAGIC constants are defined in /usr/include/linux/magic.h, and some are hardcoded in kernel sources. The f_flags is a bit mask indicating mount options for the file system. It contains zero or more of the following bits: ST_MANDLOCK Mandatory locking is permitted on the filesystem (see fcntl(2)). ST_NOATIME Do not update access times; see mount(2). ST_NODEV Disallow access to device special files on this filesystem. ST_NODIRATIME Do not update directory access times; see mount(2). ST_NOEXEC Execution of programs is disallowed on this filesystem. ST_NOSUID The set-user-ID and set-group-ID bits are ignored by exec(3) for executable files on this filesystem ST_RDONLY This filesystem is mounted read-only. ST_RELATIME Update atime relative to mtime/ctime; see mount(2). ST_SYNCHRONOUS Writes are synched to the filesystem immediately (see the description of O_SYNC in open(2)). Nobody knows what f_fsid is supposed to contain (but see below). Fields that are undefined for a particular filesystem are set to 0. fstatfs() returns the same information about an open file referenced by descriptor fd. http://man7.org/linux/man-pages/man2/fstatfs.2.html 11 SYSTEM CALL: statfs(2) - Linux manual page FUNCTIONALITY: statfs, fstatfs - get filesystem statistics SYNOPSIS: #include /* or */ int statfs(const char *path, struct statfs *buf); int fstatfs(int fd, struct statfs *buf); DESCRIPTION The statfs() system call returns information about a mounted filesystem. path is the pathname of any file within the mounted filesystem. buf is a pointer to a statfs structure defined approximately as follows: struct statfs { __fsword_t f_type; /* Type of filesystem (see below) */ __fsword_t f_bsize; /* Optimal transfer block size */ fsblkcnt_t f_blocks; /* Total data blocks in filesystem */ fsblkcnt_t f_bfree; /* Free blocks in filesystem */ fsblkcnt_t f_bavail; /* Free blocks available to unprivileged user */ fsfilcnt_t f_files; /* Total file nodes in filesystem */ fsfilcnt_t f_ffree; /* Free file nodes in filesystem */ fsid_t f_fsid; /* Filesystem ID */ __fsword_t f_namelen; /* Maximum length of filenames */ __fsword_t f_frsize; /* Fragment size (since Linux 2.6) */ __fsword_t f_flags; /* Mount flags of filesystem (since Linux 2.6.36) */ __fsword_t f_spare[xxx]; /* Padding bytes reserved for future use */ }; Filesystem types: ADFS_SUPER_MAGIC 0xadf5 AFFS_SUPER_MAGIC 0xadff BDEVFS_MAGIC 0x62646576 BEFS_SUPER_MAGIC 0x42465331 BFS_MAGIC 0x1badface BINFMTFS_MAGIC 0x42494e4d BTRFS_SUPER_MAGIC 0x9123683e CGROUP_SUPER_MAGIC 0x27e0eb CIFS_MAGIC_NUMBER 0xff534d42 CODA_SUPER_MAGIC 0x73757245 COH_SUPER_MAGIC 0x012ff7b7 CRAMFS_MAGIC 0x28cd3d45 DEBUGFS_MAGIC 0x64626720 DEVFS_SUPER_MAGIC 0x1373 DEVPTS_SUPER_MAGIC 0x1cd1 EFIVARFS_MAGIC 0xde5e81e4 EFS_SUPER_MAGIC 0x00414a53 EXT_SUPER_MAGIC 0x137d EXT2_OLD_SUPER_MAGIC 0xef51 EXT2_SUPER_MAGIC 0xef53 EXT3_SUPER_MAGIC 0xef53 EXT4_SUPER_MAGIC 0xef53 FUSE_SUPER_MAGIC 0x65735546 FUTEXFS_SUPER_MAGIC 0xbad1dea HFS_SUPER_MAGIC 0x4244 HOSTFS_SUPER_MAGIC 0x00c0ffee HPFS_SUPER_MAGIC 0xf995e849 HUGETLBFS_MAGIC 0x958458f6 ISOFS_SUPER_MAGIC 0x9660 JFFS2_SUPER_MAGIC 0x72b6 JFS_SUPER_MAGIC 0x3153464a MINIX_SUPER_MAGIC 0x137f /* orig. minix */ MINIX_SUPER_MAGIC2 0x138f /* 30 char minix */ MINIX2_SUPER_MAGIC 0x2468 /* minix V2 */ MINIX2_SUPER_MAGIC2 0x2478 /* minix V2, 30 char names */ MINIX3_SUPER_MAGIC 0x4d5a /* minix V3 fs, 60 char names */ MQUEUE_MAGIC 0x19800202 MSDOS_SUPER_MAGIC 0x4d44 NCP_SUPER_MAGIC 0x564c NFS_SUPER_MAGIC 0x6969 NILFS_SUPER_MAGIC 0x3434 NTFS_SB_MAGIC 0x5346544e OCFS2_SUPER_MAGIC 0x7461636f OPENPROM_SUPER_MAGIC 0x9fa1 PIPEFS_MAGIC 0x50495045 PROC_SUPER_MAGIC 0x9fa0 PSTOREFS_MAGIC 0x6165676c QNX4_SUPER_MAGIC 0x002f QNX6_SUPER_MAGIC 0x68191122 RAMFS_MAGIC 0x858458f6 REISERFS_SUPER_MAGIC 0x52654973 ROMFS_MAGIC 0x7275 SELINUX_MAGIC 0xf97cff8c SMACK_MAGIC 0x43415d53 SMB_SUPER_MAGIC 0x517b SOCKFS_MAGIC 0x534f434b SQUASHFS_MAGIC 0x73717368 SYSFS_MAGIC 0x62656572 SYSV2_SUPER_MAGIC 0x012ff7b6 SYSV4_SUPER_MAGIC 0x012ff7b5 TMPFS_MAGIC 0x01021994 UDF_SUPER_MAGIC 0x15013346 UFS_MAGIC 0x00011954 USBDEVICE_SUPER_MAGIC 0x9fa2 V9FS_MAGIC 0x01021997 VXFS_SUPER_MAGIC 0xa501fcf5 XENFS_SUPER_MAGIC 0xabba1974 XENIX_SUPER_MAGIC 0x012ff7b4 XFS_SUPER_MAGIC 0x58465342 _XIAFS_SUPER_MAGIC 0x012fd16d Most of these MAGIC constants are defined in /usr/include/linux/magic.h, and some are hardcoded in kernel sources. The f_flags is a bit mask indicating mount options for the file system. It contains zero or more of the following bits: ST_MANDLOCK Mandatory locking is permitted on the filesystem (see fcntl(2)). ST_NOATIME Do not update access times; see mount(2). ST_NODEV Disallow access to device special files on this filesystem. ST_NODIRATIME Do not update directory access times; see mount(2). ST_NOEXEC Execution of programs is disallowed on this filesystem. ST_NOSUID The set-user-ID and set-group-ID bits are ignored by exec(3) for executable files on this filesystem ST_RDONLY This filesystem is mounted read-only. ST_RELATIME Update atime relative to mtime/ctime; see mount(2). ST_SYNCHRONOUS Writes are synched to the filesystem immediately (see the description of O_SYNC in open(2)). Nobody knows what f_fsid is supposed to contain (but see below). Fields that are undefined for a particular filesystem are set to 0. fstatfs() returns the same information about an open file referenced by descriptor fd. http://man7.org/linux/man-pages/man2/sysfs.2.html 10 SYSTEM CALL: sysfs(2) - Linux manual page FUNCTIONALITY: sysfs - get filesystem type information SYNOPSIS: int sysfs(int option, const char *fsname); int sysfs(int option, unsigned int fs_index, char *buf); int sysfs(int option); DESCRIPTION sysfs() returns information about the filesystem types currently present in the kernel. The specific form of the sysfs() call and the information returned depends on the option in effect: 1 Translate the filesystem identifier string fsname into a filesystem type index. 2 Translate the filesystem type index fs_index into a null- terminated filesystem identifier string. This string will be written to the buffer pointed to by buf. Make sure that buf has enough space to accept the string. 3 Return the total number of filesystem types currently present in the kernel. The numbering of the filesystem type indexes begins with zero. http://man7.org/linux/man-pages/man2/_sysctl.2.html 12 SYSTEM CALL: sysctl(2) - Linux manual page FUNCTIONALITY: sysctl - read/write system parameters SYNOPSIS: #include #include int _sysctl(struct __sysctl_args *args); Note: There is no glibc wrapper for this system call; see NOTES. DESCRIPTION Do not use this system call! See NOTES. The _sysctl() call reads and/or writes kernel parameters. For example, the hostname, or the maximum number of open files. The argument has the form struct __sysctl_args { int *name; /* integer vector describing variable */ int nlen; /* length of this vector */ void *oldval; /* 0 or address where to store old value */ size_t *oldlenp; /* available room for old value, overwritten by actual size of old value */ void *newval; /* 0 or address of new value */ size_t newlen; /* size of new value */ }; This call does a search in a tree structure, possibly resembling a directory tree under /proc/sys, and if the requested item is found calls some appropriate routine to read or modify the value. http://man7.org/linux/man-pages/man2/syslog.2.html 10 SYSTEM CALL: syslog(2) - Linux manual page FUNCTIONALITY: syslog, klogctl - read and/or clear kernel message ring buffer; set console_loglevel SYNOPSIS: int syslog(int type, char *bufp, int len); /* No wrapper provided in glibc */ /* The glibc interface */ #include int klogctl(int type, char *bufp, int len); DESCRIPTION Note: Probably, you are looking for the C library function syslog(), which talks to syslogd(8); see syslog(3) for details. This page describes the kernel syslog() system call, which is used to control the kernel printk() buffer; the glibc wrapper function for the system call is called klogctl(). The kernel log buffer The kernel has a cyclic buffer of length LOG_BUF_LEN in which messages given as arguments to the kernel function printk() are stored (regardless of their log level). In early kernels, LOG_BUF_LEN had the value 4096; from kernel 1.3.54, it was 8192; from kernel 2.1.113, it was 16384; since kernel 2.4.23/2.6, the value is a kernel configuration option (CONFIG_LOG_BUF_SHIFT, default value dependent on the architecture). Since Linux 2.6.6, the size can be queried with command type 10 (see below). Commands The type argument determines the action taken by this function. The list below specifies the values for type. The symbolic names are defined in the kernel source, but are not exported to user space; you will either need to use the numbers, or define the names yourself. SYSLOG_ACTION_CLOSE (0) Close the log. Currently a NOP. SYSLOG_ACTION_OPEN (1) Open the log. Currently a NOP. SYSLOG_ACTION_READ (2) Read from the log. The call waits until the kernel log buffer is nonempty, and then reads at most len bytes into the buffer pointed to by bufp. The call returns the number of bytes read. Bytes read from the log disappear from the log buffer: the information can be read only once. This is the function executed by the kernel when a user program reads /proc/kmsg. SYSLOG_ACTION_READ_ALL (3) Read all messages remaining in the ring buffer, placing them in the buffer pointed to by bufp. The call reads the last len bytes from the log buffer (nondestructively), but will not read more than was written into the buffer since the last "clear ring buffer" command (see command 5 below)). The call returns the number of bytes read. SYSLOG_ACTION_READ_CLEAR (4) Read and clear all messages remaining in the ring buffer. The call does precisely the same as for a type of 3, but also executes the "clear ring buffer" command. SYSLOG_ACTION_CLEAR (5) The call executes just the "clear ring buffer" command. The bufp and len arguments are ignored. This command does not really clear the ring buffer. Rather, it sets a kernel bookkeeping variable that determines the results returned by commands 3 (SYSLOG_ACTION_READ_ALL) and 4 (SYSLOG_ACTION_READ_CLEAR). This command has no effect on commands 2 (SYSLOG_ACTION_READ) and 9 (SYSLOG_ACTION_SIZE_UNREAD). SYSLOG_ACTION_CONSOLE_OFF (6) The command saves the current value of console_loglevel and then sets console_loglevel to minimum_console_loglevel, so that no messages are printed to the console. Before Linux 2.6.32, the command simply sets console_loglevel to minimum_console_loglevel. See the discussion of /proc/sys/kernel/printk, below. The bufp and len arguments are ignored. SYSLOG_ACTION_CONSOLE_ON (7) If a previous SYSLOG_ACTION_CONSOLE_OFF command has been performed, this command restores console_loglevel to the value that was saved by that command. Before Linux 2.6.32, this command simply sets console_loglevel to default_console_loglevel. See the discussion of /proc/sys/kernel/printk, below. The bufp and len arguments are ignored. SYSLOG_ACTION_CONSOLE_LEVEL (8) The call sets console_loglevel to the value given in len, which must be an integer between 1 and 8 (inclusive). The kernel silently enforces a minimum value of minimum_console_loglevel for len. See the log level section for details. The bufp argument is ignored. SYSLOG_ACTION_SIZE_UNREAD (9) (since Linux 2.4.10) The call returns the number of bytes currently available to be read from the kernel log buffer via command 2 (SYSLOG_ACTION_READ). The bufp and len arguments are ignored. SYSLOG_ACTION_SIZE_BUFFER (10) (since Linux 2.6.6) This command returns the total size of the kernel log buffer. The bufp and len arguments are ignored. All commands except 3 and 10 require privilege. In Linux kernels before 2.6.37, command types 3 and 10 are allowed to unprivileged processes; since Linux 2.6.37, these commands are allowed to unprivileged processes only if /proc/sys/kernel/dmesg_restrict has the value 0. Before Linux 2.6.37, "privileged" means that the caller has the CAP_SYS_ADMIN capability. Since Linux 2.6.37, "privileged" means that the caller has either the CAP_SYS_ADMIN capability (now deprecated for this purpose) or the (new) CAP_SYSLOG capability. /proc/sys/kernel/printk /proc/sys/kernel/printk is a writable file containing four integer values that influence kernel printk() behavior when printing or logging error messages. The four values are: console_loglevel Only messages with a log level lower than this value will be printed to the console. The default value for this field is DEFAULT_CONSOLE_LOGLEVEL (7), but it is set to 4 if the kernel command line contains the word "quiet", 10 if the kernel command line contains the word "debug", and to 15 in case of a kernel fault (the 10 and 15 are just silly, and equivalent to 8). The value of console_loglevel can be set (to a value in the range 1-8) by a syslog() call with a type of 8. default_message_loglevel This value will be used as the log level for printk() messages that do not have an explicit level. Up to and including Linux 2.6.38, the hard-coded default value for this field was 4 (KERN_WARNING); since Linux 2.6.39, the default value is a defined by the kernel configuration option CONFIG_DEFAULT_MESSAGE_LOGLEVEL, which defaults to 4. minimum_console_loglevel The value in this field is the minimum value to which console_loglevel can be set. default_console_loglevel This is the default value for console_loglevel. The log level Every printk() message has its own log level. If the log level is not explicitly specified as part of the message, it defaults to default_message_loglevel. The conventional meaning of the log level is as follows: Kernel constant Level value Meaning KERN_EMERG 0 System is unusable KERN_ALERT 1 Action must be taken immediately KERN_CRIT 2 Critical conditions KERN_ERR 3 Error conditions KERN_WARNING 4 Warning conditions KERN_NOTICE 5 Normal but significant condition KERN_INFO 6 Informational KERN_DEBUG 7 Debug-level messages The kernel printk() routine will print a message on the console only if it has a log level less than the value of console_loglevel. http://man7.org/linux/man-pages/man2/ioperm.2.html 10 SYSTEM CALL: ioperm(2) - Linux manual page FUNCTIONALITY: ioperm - set port input/output permissions SYNOPSIS: #include /* for glibc */ int ioperm(unsigned long from, unsigned long num, int turn_on); DESCRIPTION ioperm() sets the port access permission bits for the calling thread for num bits starting from port address from. If turn_on is nonzero, then permission for the specified bits is enabled; otherwise it is disabled. If turn_on is nonzero, the calling thread must be privileged (CAP_SYS_RAWIO). Before Linux 2.6.8, only the first 0x3ff I/O ports could be specified in this manner. For more ports, the iopl(2) system call had to be used (with a level argument of 3). Since Linux 2.6.8, 65,536 I/O ports can be specified. Permissions are inherited by the child created by fork(2) (but see NOTES). Permissions are preserved across execve(2); this is useful for giving port access permissions to unprivileged programs. This call is mostly for the i386 architecture. On many other architectures it does not exist or will always return an error. http://man7.org/linux/man-pages/man2/iopl.2.html 10 SYSTEM CALL: iopl(2) - Linux manual page FUNCTIONALITY: iopl - change I/O privilege level SYNOPSIS: #include int iopl(int level); DESCRIPTION iopl() changes the I/O privilege level of the calling process, as specified by the two least significant bits in level. This call is necessary to allow 8514-compatible X servers to run under Linux. Since these X servers require access to all 65536 I/O ports, the ioperm(2) call is not sufficient. In addition to granting unrestricted I/O port access, running at a higher I/O privilege level also allows the process to disable interrupts. This will probably crash the system, and is not recommended. Permissions are not inherited by the child process created by fork(2) and are not preserved across execve(2) (but see NOTES). The I/O privilege level for a normal process is 0. This call is mostly for the i386 architecture. On many other architectures it does not exist or will always return an error. http://man7.org/linux/man-pages/man2/personality.2.html 10 SYSTEM CALL: personality(2) - Linux manual page FUNCTIONALITY: personality - set the process execution domain SYNOPSIS: #include int personality(unsigned long persona); DESCRIPTION Linux supports different execution domains, or personalities, for each process. Among other things, execution domains tell Linux how to map signal numbers into signal actions. The execution domain system allows Linux to provide limited support for binaries compiled under other UNIX-like operating systems. If persona is not 0xffffffff, then personality() sets the caller's execution domain to the value specified by persona. Specifying persona as 0xffffffff provides a way of retrieving the current persona without changing it. A list of the available execution domains can be found in . The execution domain is a 32-bit value in which the top three bytes are set aside for flags that cause the kernel to modify the behavior of certain system calls so as to emulate historical or architectural quirks. The least significant byte is value defining the personality the kernel should assume. The flag values are as follows: ADDR_COMPAT_LAYOUT (since Linux 2.6.9) With this flag set, provide legacy virtual address space layout. ADDR_NO_RANDOMIZE (since Linux 2.6.12) With this flag set, disable address-space-layout randomization. ADDR_LIMIT_32BIT (since Linux 2.2) Limit the address space to 32 bits. ADDR_LIMIT_3GB (since Linux 2.4.0) With this flag set, use 0xc0000000 as the offset at which to search a virtual memory chunk on mmap(2); otherwise use 0xffffe000. FDPIC_FUNCPTRS (since Linux 2.6.11) User-space function pointers to signal handlers point (on certain architectures) to descriptors. MMAP_PAGE_ZERO (since Linux 2.4.0) Map page 0 as read-only (to support binaries that depend on this SVr4 behavior). READ_IMPLIES_EXEC (since Linux 2.6.8) With this flag set, PROT_READ implies PROT_EXEC for mmap(2). SHORT_INODE (since Linux 2.4.0) No effects(?). STICKY_TIMEOUTS (since Linux 1.2.0) With this flag set, select(2), pselect(2), and ppoll(2) do not modify the returned timeout argument when interrupted by a signal handler. UNAME26 (since Linux 3.1) Have uname(2) report a 2.6.40+ version number rather than a 3.x version number. Added as a stopgap measure to support broken applications that could not handle the kernel version- numbering switch from 2.6.x to 3.x. WHOLE_SECONDS (since Linux 1.2.0) No effects(?). The available execution domains are: PER_BSD (since Linux 1.2.0) BSD. (No effects.) PER_HPUX (since Linux 2.4) Support for 32-bit HP/UX. This support was never complete, and was dropped so that since Linux 4.0, this value has no effect. PER_IRIX32 (since Linux 2.2) IRIX 5 32-bit. Never fully functional; support dropped in Linux 2.6.27. Implies STICKY_TIMEOUTS. PER_IRIX64 (since Linux 2.2) IRIX 6 64-bit. Implies STICKY_TIMEOUTS; otherwise no effects. PER_IRIXN32 (since Linux 2.2) IRIX 6 new 32-bit. Implies STICKY_TIMEOUTS; otherwise no effects. PER_ISCR4 (since Linux 1.2.0) Implies STICKY_TIMEOUTS; otherwise no effects. PER_LINUX (since Linux 1.2.0) Linux. PER_LINUX32 (since Linux 2.2) [To be documented.] PER_LINUX32_3GB (since Linux 2.4) Implies ADDR_LIMIT_3GB. PER_LINUX_32BIT (since Linux 2.0) Implies ADDR_LIMIT_32BIT. PER_LINUX_FDPIC (since Linux 2.6.11) Implies FDPIC_FUNCPTRS. PER_OSF4 (since Linux 2.4) OSF/1 v4. On alpha, clear top 32 bits of iov_len in the user's buffer for compatibility with old versions of OSF/1 where iov_len was defined as. int. PER_OSR5 (since Linux 2.4) Implies STICKY_TIMEOUTS and WHOLE_SECONDS; otherwise no effects. PER_RISCOS (since Linux 2.2) [To be documented.] PER_SCOSVR3 (since Linux 1.2.0) Implies STICKY_TIMEOUTS, WHOLE_SECONDS, and SHORT_INODE; otherwise no effects. PER_SOLARIS (since Linux 2.4) Implies STICKY_TIMEOUTS; otherwise no effects. PER_SUNOS (since Linux 2.4.0) Implies STICKY_TIMEOUTS. Divert library and dynamic linker searches to /usr/gnemul. Buggy, largely unmaintained, and almost entirely unused; support was removed in Linux 2.6.26. PER_SVR3 (since Linux 1.2.0) Implies STICKY_TIMEOUTS and SHORT_INODE; otherwise no effects. PER_SVR4 (since Linux 1.2.0) Implies STICKY_TIMEOUTS and MMAP_PAGE_ZERO; otherwise no effects. PER_UW7 (since Linux 2.4) Implies STICKY_TIMEOUTS and MMAP_PAGE_ZERO; otherwise no effects. PER_WYSEV386 (since Linux 1.2.0) Implies STICKY_TIMEOUTS and SHORT_INODE; otherwise no effects. PER_XENIX (since Linux 1.2.0) Implies STICKY_TIMEOUTS and SHORT_INODE; otherwise no effects. http://man7.org/linux/man-pages/man2/vhangup.2.html 9 SYSTEM CALL: vhangup(2) - Linux manual page FUNCTIONALITY: vhangup - virtually hangup the current terminal SYNOPSIS: #include int vhangup(void); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): vhangup(): Since glibc 2.21: _DEFAULT_SOURCE In glibc 2.19 and 2.20: _DEFAULT_SOURCE || (_XOPEN_SOURCE && _XOPEN_SOURCE < 500) Up to and including glibc 2.19: _BSD_SOURCE || (_XOPEN_SOURCE && _XOPEN_SOURCE < 500) DESCRIPTION vhangup() simulates a hangup on the current terminal. This call arranges for other users to have a “clean” terminal at login time. http://man7.org/linux/man-pages/man2/reboot.2.html 9 SYSTEM CALL: reboot(2) - Linux manual page FUNCTIONALITY: reboot - reboot or enable/disable Ctrl-Alt-Del SYNOPSIS: /* For libc4 and libc5 the library call and the system call are identical, and since kernel version 2.1.30 there are symbolic names LINUX_REBOOT_* for the constants and a fourth argument to the call: */ #include #include int reboot(int magic, int magic2, int cmd, void *arg); /* Under glibc and most alternative libc's (including uclibc, dietlibc, musl and a few others), some of the constants involved have gotten symbolic names RB_*, and the library call is a 1-argument wrapper around the 3-argument system call: */ #include #include int reboot(int cmd); DESCRIPTION The reboot() call reboots the system, or enables/disables the reboot keystroke (abbreviated CAD, since the default is Ctrl-Alt-Delete; it can be changed using loadkeys(1)). This system call will fail (with EINVAL) unless magic equals LINUX_REBOOT_MAGIC1 (that is, 0xfee1dead) and magic2 equals LINUX_REBOOT_MAGIC2 (that is, 672274793). However, since 2.1.17 also LINUX_REBOOT_MAGIC2A (that is, 85072278) and since 2.1.97 also LINUX_REBOOT_MAGIC2B (that is, 369367448) and since 2.5.71 also LINUX_REBOOT_MAGIC2C (that is, 537993216) are permitted as values for magic2. (The hexadecimal values of these constants are meaningful.) The cmd argument can have the following values: LINUX_REBOOT_CMD_CAD_OFF (RB_DISABLE_CAD, 0). CAD is disabled. This means that the CAD keystroke will cause a SIGINT signal to be sent to init (process 1), whereupon this process may decide upon a proper action (maybe: kill all processes, sync, reboot). LINUX_REBOOT_CMD_CAD_ON (RB_ENABLE_CAD, 0x89abcdef). CAD is enabled. This means that the CAD keystroke will immediately cause the action associated with LINUX_REBOOT_CMD_RESTART. LINUX_REBOOT_CMD_HALT (RB_HALT_SYSTEM, 0xcdef0123; since Linux 1.1.76). The message "System halted." is printed, and the system is halted. Control is given to the ROM monitor, if there is one. If not preceded by a sync(2), data will be lost. LINUX_REBOOT_CMD_KEXEC (RB_KEXEC, 0x45584543, since Linux 2.6.13). Execute a kernel that has been loaded earlier with kexec_load(2). This option is available only if the kernel was configured with CONFIG_KEXEC. LINUX_REBOOT_CMD_POWER_OFF (RB_POWER_OFF, 0x4321fedc; since Linux 2.1.30). The message "Power down." is printed, the system is stopped, and all power is removed from the system, if possible. If not preceded by a sync(2), data will be lost. LINUX_REBOOT_CMD_RESTART (RB_AUTOBOOT, 0x1234567). The message "Restarting system." is printed, and a default restart is performed immediately. If not preceded by a sync(2), data will be lost. LINUX_REBOOT_CMD_RESTART2 (0xa1b2c3d4; since Linux 2.1.30). The message "Restarting system with command '%s'" is printed, and a restart (using the command string given in arg) is performed immediately. If not preceded by a sync(2), data will be lost. LINUX_REBOOT_CMD_SW_SUSPEND (RB_SW_SUSPEND, 0xd000fce1; since Linux 2.5.18). The system is suspended (hibernated) to disk. This option is available only if the kernel was configured with CONFIG_HIBERNATION. Only the superuser may call reboot(). The precise effect of the above actions depends on the architecture. For the i386 architecture, the additional argument does not do anything at present (2.1.122), but the type of reboot can be determined by kernel command-line arguments ("reboot=...") to be either warm or cold, and either hard or through the BIOS. Behavior inside PID namespaces Since Linux 3.4, when reboot() is called from a PID namespace (see pid_namespaces(7)) other than the initial PID namespace, the effect of the call is to send a signal to the namespace "init" process. LINUX_REBOOT_CMD_RESTART and LINUX_REBOOT_CMD_RESTART2 cause a SIGHUP signal to be sent. LINUX_REBOOT_CMD_POWER_OFF and LINUX_REBOOT_CMD_HALT cause a SIGINT signal to be sent. http://man7.org/linux/man-pages/man2/kexec_load.2.html 11 SYSTEM CALL: kexec_load(2) - Linux manual page FUNCTIONALITY: kexec_load, kexec_file_load - load a new kernel for later execution SYNOPSIS: #include long kexec_load(unsigned long entry, unsigned long nr_segments, struct kexec_segment *segments, unsigned long flags); long kexec_file_load(int kernel_fd, int initrd_fd, unsigned long cmdline_len, const char *cmdline, unsigned long flags); Note: There are no glibc wrappers for these system calls; see NOTES. DESCRIPTION The kexec_load() system call loads a new kernel that can be executed later by reboot(2). The flags argument is a bit mask that controls the operation of the call. The following values can be specified in flags: KEXEC_ON_CRASH (since Linux 2.6.13) Execute the new kernel automatically on a system crash. This "crash kernel" is loaded into an area of reserved memory that is determined at boot time using the crashkernel kernel command-line parameter. The location of this reserved memory is exported to user space via the /proc/iomem file, in an entry labeled "Crash kernel". A user-space application can parse this file and prepare a list of segments (see below) that specify this reserved memory as destination. If this flag is specified, the kernel checks that the target segments specified in segments fall within the reserved region. KEXEC_PRESERVE_CONTEXT (since Linux 2.6.27) Preserve the system hardware and software states before executing the new kernel. This could be used for system suspend. This flag is available only if the kernel was configured with CONFIG_KEXEC_JUMP, and is effective only if nr_segments is greater than 0. The high-order bits (corresponding to the mask 0xffff0000) of flags contain the architecture of the to-be-executed kernel. Specify (OR) the constant KEXEC_ARCH_DEFAULT to use the current architecture, or one of the following architecture constants KEXEC_ARCH_386, KEXEC_ARCH_68K, KEXEC_ARCH_X86_64, KEXEC_ARCH_PPC, KEXEC_ARCH_PPC64, KEXEC_ARCH_IA_64, KEXEC_ARCH_ARM, KEXEC_ARCH_S390, KEXEC_ARCH_SH, KEXEC_ARCH_MIPS, and KEXEC_ARCH_MIPS_LE. The architecture must be executable on the CPU of the system. The entry argument is the physical entry address in the kernel image. The nr_segments argument is the number of segments pointed to by the segments pointer; the kernel imposes an (arbitrary) limit of 16 on the number of segments. The segments argument is an array of kexec_segment structures which define the kernel layout: struct kexec_segment { void *buf; /* Buffer in user space */ size_t bufsz; /* Buffer length in user space */ void *mem; /* Physical address of kernel */ size_t memsz; /* Physical address length */ }; The kernel image defined by segments is copied from the calling process into the kernel either in regular memory or in reserved memory (if KEXEC_ON_CRASH is set). The kernel first performs various sanity checks on the information passed in segments. If these checks pass, the kernel copies the segment data to kernel memory. Each segment specified in segments is copied as follows: * buf and bufsz identify a memory region in the caller's virtual address space that is the source of the copy. The value in bufsz may not exceed the value in the memsz field. * mem and memsz specify a physical address range that is the target of the copy. The values specified in both fields must be multiples of the system page size. * bufsz bytes are copied from the source buffer to the target kernel buffer. If bufsz is less than memsz, then the excess bytes in the kernel buffer are zeroed out. In case of a normal kexec (i.e., the KEXEC_ON_CRASH flag is not set), the segment data is loaded in any available memory and is moved to the final destination at kexec reboot time (e.g., when the kexec(8) command is executed with the -e option). In case of kexec on panic (i.e., the KEXEC_ON_CRASH flag is set), the segment data is loaded to reserved memory at the time of the call, and, after a crash, the kexec mechanism simply passes control to that kernel. The kexec_load() system call is available only if the kernel was configured with CONFIG_KEXEC. kexec_file_load() The kexec_file_load() system call is similar to kexec_load(), but it takes a different set of arguments. It reads the kernel to be loaded from the file referred to by the file descriptor kernel_fd, and the initrd (initial RAM disk) to be loaded from file referred to by the file descriptor initrd_fd. The cmdline argument is a pointer to a buffer containing the command line for the new kernel. The cmdline_len argument specifies size of the buffer. The last byte in the buffer must be a null byte ('\0'). The flags argument is a bit mask which modifies the behavior of the call. The following values can be specified in flags: KEXEC_FILE_UNLOAD Unload the currently loaded kernel. KEXEC_FILE_ON_CRASH Load the new kernel in the memory region reserved for the crash kernel (as for KEXEC_ON_CRASH). This kernel is booted if the currently running kernel crashes. KEXEC_FILE_NO_INITRAMFS Loading initrd/initramfs is optional. Specify this flag if no initramfs is being loaded. If this flag is set, the value passed in initrd_fd is ignored. The kexec_file_load() system call was added to provide support for systems where "kexec" loading should be restricted to only kernels that are signed. This system call is available only if the kernel was configured with CONFIG_KEXEC_FILE. http://man7.org/linux/man-pages/man2/kexec_file_load.2.html 11 SYSTEM CALL: kexec_load(2) - Linux manual page FUNCTIONALITY: kexec_load, kexec_file_load - load a new kernel for later execution SYNOPSIS: #include long kexec_load(unsigned long entry, unsigned long nr_segments, struct kexec_segment *segments, unsigned long flags); long kexec_file_load(int kernel_fd, int initrd_fd, unsigned long cmdline_len, const char *cmdline, unsigned long flags); Note: There are no glibc wrappers for these system calls; see NOTES. DESCRIPTION The kexec_load() system call loads a new kernel that can be executed later by reboot(2). The flags argument is a bit mask that controls the operation of the call. The following values can be specified in flags: KEXEC_ON_CRASH (since Linux 2.6.13) Execute the new kernel automatically on a system crash. This "crash kernel" is loaded into an area of reserved memory that is determined at boot time using the crashkernel kernel command-line parameter. The location of this reserved memory is exported to user space via the /proc/iomem file, in an entry labeled "Crash kernel". A user-space application can parse this file and prepare a list of segments (see below) that specify this reserved memory as destination. If this flag is specified, the kernel checks that the target segments specified in segments fall within the reserved region. KEXEC_PRESERVE_CONTEXT (since Linux 2.6.27) Preserve the system hardware and software states before executing the new kernel. This could be used for system suspend. This flag is available only if the kernel was configured with CONFIG_KEXEC_JUMP, and is effective only if nr_segments is greater than 0. The high-order bits (corresponding to the mask 0xffff0000) of flags contain the architecture of the to-be-executed kernel. Specify (OR) the constant KEXEC_ARCH_DEFAULT to use the current architecture, or one of the following architecture constants KEXEC_ARCH_386, KEXEC_ARCH_68K, KEXEC_ARCH_X86_64, KEXEC_ARCH_PPC, KEXEC_ARCH_PPC64, KEXEC_ARCH_IA_64, KEXEC_ARCH_ARM, KEXEC_ARCH_S390, KEXEC_ARCH_SH, KEXEC_ARCH_MIPS, and KEXEC_ARCH_MIPS_LE. The architecture must be executable on the CPU of the system. The entry argument is the physical entry address in the kernel image. The nr_segments argument is the number of segments pointed to by the segments pointer; the kernel imposes an (arbitrary) limit of 16 on the number of segments. The segments argument is an array of kexec_segment structures which define the kernel layout: struct kexec_segment { void *buf; /* Buffer in user space */ size_t bufsz; /* Buffer length in user space */ void *mem; /* Physical address of kernel */ size_t memsz; /* Physical address length */ }; The kernel image defined by segments is copied from the calling process into the kernel either in regular memory or in reserved memory (if KEXEC_ON_CRASH is set). The kernel first performs various sanity checks on the information passed in segments. If these checks pass, the kernel copies the segment data to kernel memory. Each segment specified in segments is copied as follows: * buf and bufsz identify a memory region in the caller's virtual address space that is the source of the copy. The value in bufsz may not exceed the value in the memsz field. * mem and memsz specify a physical address range that is the target of the copy. The values specified in both fields must be multiples of the system page size. * bufsz bytes are copied from the source buffer to the target kernel buffer. If bufsz is less than memsz, then the excess bytes in the kernel buffer are zeroed out. In case of a normal kexec (i.e., the KEXEC_ON_CRASH flag is not set), the segment data is loaded in any available memory and is moved to the final destination at kexec reboot time (e.g., when the kexec(8) command is executed with the -e option). In case of kexec on panic (i.e., the KEXEC_ON_CRASH flag is set), the segment data is loaded to reserved memory at the time of the call, and, after a crash, the kexec mechanism simply passes control to that kernel. The kexec_load() system call is available only if the kernel was configured with CONFIG_KEXEC. kexec_file_load() The kexec_file_load() system call is similar to kexec_load(), but it takes a different set of arguments. It reads the kernel to be loaded from the file referred to by the file descriptor kernel_fd, and the initrd (initial RAM disk) to be loaded from file referred to by the file descriptor initrd_fd. The cmdline argument is a pointer to a buffer containing the command line for the new kernel. The cmdline_len argument specifies size of the buffer. The last byte in the buffer must be a null byte ('\0'). The flags argument is a bit mask which modifies the behavior of the call. The following values can be specified in flags: KEXEC_FILE_UNLOAD Unload the currently loaded kernel. KEXEC_FILE_ON_CRASH Load the new kernel in the memory region reserved for the crash kernel (as for KEXEC_ON_CRASH). This kernel is booted if the currently running kernel crashes. KEXEC_FILE_NO_INITRAMFS Loading initrd/initramfs is optional. Specify this flag if no initramfs is being loaded. If this flag is set, the value passed in initrd_fd is ignored. The kexec_file_load() system call was added to provide support for systems where "kexec" loading should be restricted to only kernels that are signed. This system call is available only if the kernel was configured with CONFIG_KEXEC_FILE. http://man7.org/linux/man-pages/man2/perf_event_open.2.html 13 SYSTEM CALL: perf_event_open(2) - Linux manual page FUNCTIONALITY: perf_event_open - set up performance monitoring SYNOPSIS: #include #include int perf_event_open(struct perf_event_attr *attr, pid_t pid, int cpu, int group_fd, unsigned long flags); Note: There is no glibc wrapper for this system call; see NOTES. DESCRIPTION Given a list of parameters, perf_event_open() returns a file descriptor, for use in subsequent system calls (read(2), mmap(2), prctl(2), fcntl(2), etc.). A call to perf_event_open() creates a file descriptor that allows measuring performance information. Each file descriptor corresponds to one event that is measured; these can be grouped together to measure multiple events simultaneously. Events can be enabled and disabled in two ways: via ioctl(2) and via prctl(2). When an event is disabled it does not count or generate overflows but does continue to exist and maintain its count value. Events come in two flavors: counting and sampled. A counting event is one that is used for counting the aggregate number of events that occur. In general, counting event results are gathered with a read(2) call. A sampling event periodically writes measurements to a buffer that can then be accessed via mmap(2). Arguments The pid and cpu arguments allow specifying which process and CPU to monitor: pid == 0 and cpu == -1 This measures the calling process/thread on any CPU. pid == 0 and cpu >= 0 This measures the calling process/thread only when running on the specified CPU. pid > 0 and cpu == -1 This measures the specified process/thread on any CPU. pid > 0 and cpu >= 0 This measures the specified process/thread only when running on the specified CPU. pid == -1 and cpu >= 0 This measures all processes/threads on the specified CPU. This requires CAP_SYS_ADMIN capability or a /proc/sys/kernel/perf_event_paranoid value of less than 1. pid == -1 and cpu == -1 This setting is invalid and will return an error. When pid is greater than zero, permission to perform this system call is governed by a ptrace access mode PTRACE_MODE_READ_REALCREDS check; see ptrace(2). The group_fd argument allows event groups to be created. An event group has one event which is the group leader. The leader is created first, with group_fd = -1. The rest of the group members are created with subsequent perf_event_open() calls with group_fd being set to the file descriptor of the group leader. (A single event on its own is created with group_fd = -1 and is considered to be a group with only 1 member.) An event group is scheduled onto the CPU as a unit: it will be put onto the CPU only if all of the events in the group can be put onto the CPU. This means that the values of the member events can be meaningfully compared—added, divided (to get ratios), and so on—with each other, since they have counted events for the same set of executed instructions. The flags argument is formed by ORing together zero or more of the following values: PERF_FLAG_FD_CLOEXEC (since Linux 3.14) This flag enables the close-on-exec flag for the created event file descriptor, so that the file descriptor is automatically closed on execve(2). Setting the close-on-exec flags at creation time, rather than later with fcntl(2), avoids potential race conditions where the calling thread invokes perf_event_open() and fcntl(2) at the same time as another thread calls fork(2) then execve(2). PERF_FLAG_FD_NO_GROUP This flag tells the event to ignore the group_fd parameter except for the purpose of setting up output redirection using the PERF_FLAG_FD_OUTPUT flag. PERF_FLAG_FD_OUTPUT (broken since Linux 2.6.35) This flag re-routes the event's sampled output to instead be included in the mmap buffer of the event specified by group_fd. PERF_FLAG_PID_CGROUP (since Linux 2.6.39) This flag activates per-container system-wide monitoring. A container is an abstraction that isolates a set of resources for finer-grained control (CPUs, memory, etc.). In this mode, the event is measured only if the thread running on the monitored CPU belongs to the designated container (cgroup). The cgroup is identified by passing a file descriptor opened on its directory in the cgroupfs filesystem. For instance, if the cgroup to monitor is called test, then a file descriptor opened on /dev/cgroup/test (assuming cgroupfs is mounted on /dev/cgroup) must be passed as the pid parameter. cgroup monitoring is available only for system-wide events and may therefore require extra permissions. The perf_event_attr structure provides detailed configuration information for the event being created. struct perf_event_attr { __u32 type; /* Type of event */ __u32 size; /* Size of attribute structure */ __u64 config; /* Type-specific configuration */ union { __u64 sample_period; /* Period of sampling */ __u64 sample_freq; /* Frequency of sampling */ }; __u64 sample_type; /* Specifies values included in sample */ __u64 read_format; /* Specifies values returned in read */ __u64 disabled : 1, /* off by default */ inherit : 1, /* children inherit it */ pinned : 1, /* must always be on PMU */ exclusive : 1, /* only group on PMU */ exclude_user : 1, /* don't count user */ exclude_kernel : 1, /* don't count kernel */ exclude_hv : 1, /* don't count hypervisor */ exclude_idle : 1, /* don't count when idle */ mmap : 1, /* include mmap data */ comm : 1, /* include comm data */ freq : 1, /* use freq, not period */ inherit_stat : 1, /* per task counts */ enable_on_exec : 1, /* next exec enables */ task : 1, /* trace fork/exit */ watermark : 1, /* wakeup_watermark */ precise_ip : 2, /* skid constraint */ mmap_data : 1, /* non-exec mmap data */ sample_id_all : 1, /* sample_type all events */ exclude_host : 1, /* don't count in host */ exclude_guest : 1, /* don't count in guest */ exclude_callchain_kernel : 1, /* exclude kernel callchains */ exclude_callchain_user : 1, /* exclude user callchains */ mmap2 : 1, /* include mmap with inode data */ comm_exec : 1, /* flag comm events that are due to exec */ use_clockid : 1, /* use clockid for time fields */ __reserved_1 : 38; union { __u32 wakeup_events; /* wakeup every n events */ __u32 wakeup_watermark; /* bytes before wakeup */ }; __u32 bp_type; /* breakpoint type */ union { __u64 bp_addr; /* breakpoint address */ __u64 config1; /* extension of config */ }; union { __u64 bp_len; /* breakpoint length */ __u64 config2; /* extension of config1 */ }; __u64 branch_sample_type; /* enum perf_branch_sample_type */ __u64 sample_regs_user; /* user regs to dump on samples */ __u32 sample_stack_user; /* size of stack to dump on samples */ __s32 clockid; /* clock to use for time fields */ __u64 sample_regs_intr; /* regs to dump on samples */ __u32 aux_watermark; /* aux bytes before wakeup */ __u32 __reserved_2; /* align to u64 */ }; The fields of the perf_event_attr structure are described in more detail below: type This field specifies the overall event type. It has one of the following values: PERF_TYPE_HARDWARE This indicates one of the "generalized" hardware events provided by the kernel. See the config field definition for more details. PERF_TYPE_SOFTWARE This indicates one of the software-defined events provided by the kernel (even if no hardware support is available). PERF_TYPE_TRACEPOINT This indicates a tracepoint provided by the kernel tracepoint infrastructure. PERF_TYPE_HW_CACHE This indicates a hardware cache event. This has a special encoding, described in the config field definition. PERF_TYPE_RAW This indicates a "raw" implementation-specific event in the config field. PERF_TYPE_BREAKPOINT (since Linux 2.6.33) This indicates a hardware breakpoint as provided by the CPU. Breakpoints can be read/write accesses to an address as well as execution of an instruction address. dynamic PMU Since Linux 2.6.38, perf_event_open() can support multiple PMUs. To enable this, a value exported by the kernel can be used in the type field to indicate which PMU to use. The value to use can be found in the sysfs filesystem: there is a subdirectory per PMU instance under /sys/bus/event_source/devices. In each subdirectory there is a type file whose content is an integer that can be used in the type field. For instance, /sys/bus/event_source/devices/cpu/type contains the value for the core CPU PMU, which is usually 4. size The size of the perf_event_attr structure for forward/backward compatibility. Set this using sizeof(struct perf_event_attr) to allow the kernel to see the struct size at the time of compilation. The related define PERF_ATTR_SIZE_VER0 is set to 64; this was the size of the first published struct. PERF_ATTR_SIZE_VER1 is 72, corresponding to the addition of breakpoints in Linux 2.6.33. PERF_ATTR_SIZE_VER2 is 80 corresponding to the addition of branch sampling in Linux 3.4. PERF_ATTR_SIZE_VER3 is 96 corresponding to the addition of sample_regs_user and sample_stack_user in Linux 3.7. PERF_ATTR_SIZE_VER4 is 104 corresponding to the addition of sample_regs_intr in Linux 3.19. PERF_ATTR_SIZE_VER5 is 112 corresponding to the addition of aux_watermark in Linux 4.1. config This specifies which event you want, in conjunction with the type field. The config1 and config2 fields are also taken into account in cases where 64 bits is not enough to fully specify the event. The encoding of these fields are event dependent. There are various ways to set the config field that are dependent on the value of the previously described type field. What follows are various possible settings for config separated out by type. If type is PERF_TYPE_HARDWARE, we are measuring one of the generalized hardware CPU events. Not all of these are available on all platforms. Set config to one of the following: PERF_COUNT_HW_CPU_CYCLES Total cycles. Be wary of what happens during CPU frequency scaling. PERF_COUNT_HW_INSTRUCTIONS Retired instructions. Be careful, these can be affected by various issues, most notably hardware interrupt counts. PERF_COUNT_HW_CACHE_REFERENCES Cache accesses. Usually this indicates Last Level Cache accesses but this may vary depending on your CPU. This may include prefetches and coherency messages; again this depends on the design of your CPU. PERF_COUNT_HW_CACHE_MISSES Cache misses. Usually this indicates Last Level Cache misses; this is intended to be used in conjunction with the PERF_COUNT_HW_CACHE_REFERENCES event to calculate cache miss rates. PERF_COUNT_HW_BRANCH_INSTRUCTIONS Retired branch instructions. Prior to Linux 2.6.35, this used the wrong event on AMD processors. PERF_COUNT_HW_BRANCH_MISSES Mispredicted branch instructions. PERF_COUNT_HW_BUS_CYCLES Bus cycles, which can be different from total cycles. PERF_COUNT_HW_STALLED_CYCLES_FRONTEND (since Linux 3.0) Stalled cycles during issue. PERF_COUNT_HW_STALLED_CYCLES_BACKEND (since Linux 3.0) Stalled cycles during retirement. PERF_COUNT_HW_REF_CPU_CYCLES (since Linux 3.3) Total cycles; not affected by CPU frequency scaling. If type is PERF_TYPE_SOFTWARE, we are measuring software events provided by the kernel. Set config to one of the following: PERF_COUNT_SW_CPU_CLOCK This reports the CPU clock, a high-resolution per- CPU timer. PERF_COUNT_SW_TASK_CLOCK This reports a clock count specific to the task that is running. PERF_COUNT_SW_PAGE_FAULTS This reports the number of page faults. PERF_COUNT_SW_CONTEXT_SWITCHES This counts context switches. Until Linux 2.6.34, these were all reported as user-space events, after that they are reported as happening in the kernel. PERF_COUNT_SW_CPU_MIGRATIONS This reports the number of times the process has migrated to a new CPU. PERF_COUNT_SW_PAGE_FAULTS_MIN This counts the number of minor page faults. These did not require disk I/O to handle. PERF_COUNT_SW_PAGE_FAULTS_MAJ This counts the number of major page faults. These required disk I/O to handle. PERF_COUNT_SW_ALIGNMENT_FAULTS (since Linux 2.6.33) This counts the number of alignment faults. These happen when unaligned memory accesses happen; the kernel can handle these but it reduces performance. This happens only on some architectures (never on x86). PERF_COUNT_SW_EMULATION_FAULTS (since Linux 2.6.33) This counts the number of emulation faults. The kernel sometimes traps on unimplemented instructions and emulates them for user space. This can negatively impact performance. PERF_COUNT_SW_DUMMY (since Linux 3.12) This is a placeholder event that counts nothing. Informational sample record types such as mmap or comm must be associated with an active event. This dummy event allows gathering such records without requiring a counting event. If type is PERF_TYPE_TRACEPOINT, then we are measuring kernel tracepoints. The value to use in config can be obtained from under debugfs tracing/events/*/*/id if ftrace is enabled in the kernel. If type is PERF_TYPE_HW_CACHE, then we are measuring a hardware CPU cache event. To calculate the appropriate config value use the following equation: (perf_hw_cache_id) | (perf_hw_cache_op_id << 8) | (perf_hw_cache_op_result_id << 16) where perf_hw_cache_id is one of: PERF_COUNT_HW_CACHE_L1D for measuring Level 1 Data Cache PERF_COUNT_HW_CACHE_L1I for measuring Level 1 Instruction Cache PERF_COUNT_HW_CACHE_LL for measuring Last-Level Cache PERF_COUNT_HW_CACHE_DTLB for measuring the Data TLB PERF_COUNT_HW_CACHE_ITLB for measuring the Instruction TLB PERF_COUNT_HW_CACHE_BPU for measuring the branch prediction unit PERF_COUNT_HW_CACHE_NODE (since Linux 3.1) for measuring local memory accesses and perf_hw_cache_op_id is one of PERF_COUNT_HW_CACHE_OP_READ for read accesses PERF_COUNT_HW_CACHE_OP_WRITE for write accesses PERF_COUNT_HW_CACHE_OP_PREFETCH for prefetch accesses and perf_hw_cache_op_result_id is one of PERF_COUNT_HW_CACHE_RESULT_ACCESS to measure accesses PERF_COUNT_HW_CACHE_RESULT_MISS to measure misses If type is PERF_TYPE_RAW, then a custom "raw" config value is needed. Most CPUs support events that are not covered by the "generalized" events. These are implementation defined; see your CPU manual (for example the Intel Volume 3B documentation or the AMD BIOS and Kernel Developer Guide). The libpfm4 library can be used to translate from the name in the architectural manuals to the raw hex value perf_event_open() expects in this field. If type is PERF_TYPE_BREAKPOINT, then leave config set to zero. Its parameters are set in other places. sample_period, sample_freq A "sampling" event is one that generates an overflow notification every N events, where N is given by sample_period. A sampling event has sample_period > 0. When an overflow occurs, requested data is recorded in the mmap buffer. The sample_type field controls what data is recorded on each overflow. sample_freq can be used if you wish to use frequency rather than period. In this case, you set the freq flag. The kernel will adjust the sampling period to try and achieve the desired rate. The rate of adjustment is a timer tick. sample_type The various bits in this field specify which values to include in the sample. They will be recorded in a ring-buffer, which is available to user space using mmap(2). The order in which the values are saved in the sample are documented in the MMAP Layout subsection below; it is not the enum perf_event_sample_format order. PERF_SAMPLE_IP Records instruction pointer. PERF_SAMPLE_TID Records the process and thread IDs. PERF_SAMPLE_TIME Records a timestamp. PERF_SAMPLE_ADDR Records an address, if applicable. PERF_SAMPLE_READ Record counter values for all events in a group, not just the group leader. PERF_SAMPLE_CALLCHAIN Records the callchain (stack backtrace). PERF_SAMPLE_ID Records a unique ID for the opened event's group leader. PERF_SAMPLE_CPU Records CPU number. PERF_SAMPLE_PERIOD Records the current sampling period. PERF_SAMPLE_STREAM_ID Records a unique ID for the opened event. Unlike PERF_SAMPLE_ID the actual ID is returned, not the group leader. This ID is the same as the one returned by PERF_FORMAT_ID. PERF_SAMPLE_RAW Records additional data, if applicable. Usually returned by tracepoint events. PERF_SAMPLE_BRANCH_STACK (since Linux 3.4) This provides a record of recent branches, as provided by CPU branch sampling hardware (such as Intel Last Branch Record). Not all hardware supports this feature. See the branch_sample_type field for how to filter which branches are reported. PERF_SAMPLE_REGS_USER (since Linux 3.7) Records the current user-level CPU register state (the values in the process before the kernel was called). PERF_SAMPLE_STACK_USER (since Linux 3.7) Records the user level stack, allowing stack unwinding. PERF_SAMPLE_WEIGHT (since Linux 3.10) Records a hardware provided weight value that expresses how costly the sampled event was. This allows the hardware to highlight expensive events in a profile. PERF_SAMPLE_DATA_SRC (since Linux 3.10) Records the data source: where in the memory hierarchy the data associated with the sampled instruction came from. This is available only if the underlying hardware supports this feature. PERF_SAMPLE_IDENTIFIER (since Linux 3.12) Places the SAMPLE_ID value in a fixed position in the record, either at the beginning (for sample events) or at the end (if a non-sample event). This was necessary because a sample stream may have records from various different event sources with different sample_type settings. Parsing the event stream properly was not possible because the format of the record was needed to find SAMPLE_ID, but the format could not be found without knowing what event the sample belonged to (causing a circular dependency). The PERF_SAMPLE_IDENTIFIER setting makes the event stream always parsable by putting SAMPLE_ID in a fixed location, even though it means having duplicate SAMPLE_ID values in records. PERF_SAMPLE_TRANSACTION (since Linux 3.13) Records reasons for transactional memory abort events (for example, from Intel TSX transactional memory support). The precise_ip setting must be greater than 0 and a transactional memory abort event must be measured or no values will be recorded. Also note that some perf_event measurements, such as sampled cycle counting, may cause extraneous aborts (by causing an interrupt during a transaction). PERF_SAMPLE_REGS_INTR (since Linux 3.19) Records a subset of the current CPU register state as specified by sample_regs_intr. Unlike PERF_SAMPLE_REGS_USER the register values will return kernel register state if the overflow happened while kernel code is running. If the CPU supports hardware sampling of register state (i.e. PEBS on Intel x86) and precise_ip is set higher than zero then the register values returned are those captured by hardware at the time of the sampled instruction's retirement. read_format This field specifies the format of the data returned by read(2) on a perf_event_open() file descriptor. PERF_FORMAT_TOTAL_TIME_ENABLED Adds the 64-bit time_enabled field. This can be used to calculate estimated totals if the PMU is overcommitted and multiplexing is happening. PERF_FORMAT_TOTAL_TIME_RUNNING Adds the 64-bit time_running field. This can be used to calculate estimated totals if the PMU is overcommitted and multiplexing is happening. PERF_FORMAT_ID Adds a 64-bit unique value that corresponds to the event group. PERF_FORMAT_GROUP Allows all counter values in an event group to be read with one read. disabled The disabled bit specifies whether the counter starts out disabled or enabled. If disabled, the event can later be enabled by ioctl(2), prctl(2), or enable_on_exec. When creating an event group, typically the group leader is initialized with disabled set to 1 and any child events are initialized with disabled set to 0. Despite disabled being 0, the child events will not start until the group leader is enabled. inherit The inherit bit specifies that this counter should count events of child tasks as well as the task specified. This applies only to new children, not to any existing children at the time the counter is created (nor to any new children of existing children). Inherit does not work for some combinations of read_formats, such as PERF_FORMAT_GROUP. pinned The pinned bit specifies that the counter should always be on the CPU if at all possible. It applies only to hardware counters and only to group leaders. If a pinned counter cannot be put onto the CPU (e.g., because there are not enough hardware counters or because of a conflict with some other event), then the counter goes into an 'error' state, where reads return end-of-file (i.e., read(2) returns 0) until the counter is subsequently enabled or disabled. exclusive The exclusive bit specifies that when this counter's group is on the CPU, it should be the only group using the CPU's counters. In the future this may allow monitoring programs to support PMU features that need to run alone so that they do not disrupt other hardware counters. Note that many unexpected situations may prevent events with the exclusive bit set from ever running. This includes any users running a system-wide measurement as well as any kernel use of the performance counters (including the commonly enabled NMI Watchdog Timer interface). exclude_user If this bit is set, the count excludes events that happen in user space. exclude_kernel If this bit is set, the count excludes events that happen in kernel-space. exclude_hv If this bit is set, the count excludes events that happen in the hypervisor. This is mainly for PMUs that have built-in support for handling this (such as POWER). Extra support is needed for handling hypervisor measurements on most machines. exclude_idle If set, don't count when the CPU is idle. mmap The mmap bit enables generation of PERF_RECORD_MMAP samples for every mmap(2) call that has PROT_EXEC set. This allows tools to notice new executable code being mapped into a program (dynamic shared libraries for example) so that addresses can be mapped back to the original code. comm The comm bit enables tracking of process command name as modified by the exec(2) and prctl(PR_SET_NAME) system calls as well as writing to /proc/self/comm. If the comm_exec flag is also successfully set (possible since Linux 3.16), then the misc flag PERF_RECORD_MISC_COMM_EXEC can be used to differentiate the exec(2) case from the others. freq If this bit is set, then sample_frequency not sample_period is used when setting up the sampling interval. inherit_stat This bit enables saving of event counts on context switch for inherited tasks. This is meaningful only if the inherit field is set. enable_on_exec If this bit is set, a counter is automatically enabled after a call to exec(2). task If this bit is set, then fork/exit notifications are included in the ring buffer. watermark If set, have an overflow notification happen when we cross the wakeup_watermark boundary. Otherwise, overflow notifications happen after wakeup_events samples. precise_ip (since Linux 2.6.35) This controls the amount of skid. Skid is how many instructions execute between an event of interest happening and the kernel being able to stop and record the event. Smaller skid is better and allows more accurate reporting of which events correspond to which instructions, but hardware is often limited with how small this can be. The values of this are the following: 0 - SAMPLE_IP can have arbitrary skid. 1 - SAMPLE_IP must have constant skid. 2 - SAMPLE_IP requested to have 0 skid. 3 - SAMPLE_IP must have 0 skid. See also PERF_RECORD_MISC_EXACT_IP. mmap_data (since Linux 2.6.36) The counterpart of the mmap field. This enables generation of PERF_RECORD_MMAP samples for mmap(2) calls that do not have PROT_EXEC set (for example data and SysV shared memory). sample_id_all (since Linux 2.6.38) If set, then TID, TIME, ID, STREAM_ID, and CPU can additionally be included in non-PERF_RECORD_SAMPLEs if the corresponding sample_type is selected. If PERF_SAMPLE_IDENTIFIER is specified, then an additional ID value is included as the last value to ease parsing the record stream. This may lead to the id value appearing twice. The layout is described by this pseudo-structure: struct sample_id { { u32 pid, tid; } /* if PERF_SAMPLE_TID set */ { u64 time; } /* if PERF_SAMPLE_TIME set */ { u64 id; } /* if PERF_SAMPLE_ID set */ { u64 stream_id;} /* if PERF_SAMPLE_STREAM_ID set */ { u32 cpu, res; } /* if PERF_SAMPLE_CPU set */ { u64 id; } /* if PERF_SAMPLE_IDENTIFIER set */ }; exclude_host (since Linux 3.2) When conducting measurements that include processes running VM instances (i.e. have executed a KVM_RUN ioctl(2) ) only measure events happening inside a guest instance. This is only meaningful outside the guests; this setting does not change counts gathered inside of a guest. Currently, this functionality is x86 only. exclude_guest (since Linux 3.2) When conducting measurements that include processes running VM instances (i.e. have executed a KVM_RUN ioctl(2) ) do not measure events happening inside guest instances. This is only meaningful outside the guests; this setting does not change counts gathered inside of a guest. Currently, this functionality is x86 only. exclude_callchain_kernel (since Linux 3.7) Do not include kernel callchains. exclude_callchain_user (since Linux 3.7) Do not include user callchains. mmap2 (since Linux 3.16) Generate an extended executable mmap record that contains enough additional information to uniquely identify shared mappings. The mmap flag must also be set for this to work. comm_exec (since Linux 3.16) This is purely a feature-detection flag, it does not change kernel behavior. If this flag can successfully be set, then, when comm is enabled, the PERF_RECORD_MISC_COMM_EXEC flag will be set in the misc field of a comm record header if the rename event being reported was caused by a call to exec(2). This allows tools to distinguish between the various types of process renaming. use_clockid (since Linux 4.1) This allows selecting which internal Linux clock to use when generating timestamps via the clockid field. This can make it easier to correlate perf sample times with timestamps generated by other tools. wakeup_events, wakeup_watermark This union sets how many samples (wakeup_events) or bytes (wakeup_watermark) happen before an overflow notification happens. Which one is used is selected by the watermark bit flag. wakeup_events counts only PERF_RECORD_SAMPLE record types. To receive overflow notification for all PERF_RECORD types choose watermark and set wakeup_watermark to 1. Prior to Linux 3.0 setting wakeup_events to 0 resulted in no overflow notifications; more recent kernels treat 0 the same as 1. bp_type (since Linux 2.6.33) This chooses the breakpoint type. It is one of: HW_BREAKPOINT_EMPTY No breakpoint. HW_BREAKPOINT_R Count when we read the memory location. HW_BREAKPOINT_W Count when we write the memory location. HW_BREAKPOINT_RW Count when we read or write the memory location. HW_BREAKPOINT_X Count when we execute code at the memory location. The values can be combined via a bitwise or, but the combination of HW_BREAKPOINT_R or HW_BREAKPOINT_W with HW_BREAKPOINT_X is not allowed. bp_addr (since Linux 2.6.33) bp_addr address of the breakpoint. For execution breakpoints this is the memory address of the instruction of interest; for read and write breakpoints it is the memory address of the memory location of interest. config1 (since Linux 2.6.39) config1 is used for setting events that need an extra register or otherwise do not fit in the regular config field. Raw OFFCORE_EVENTS on Nehalem/Westmere/SandyBridge use this field on 3.3 and later kernels. bp_len (since Linux 2.6.33) bp_len is the length of the breakpoint being measured if type is PERF_TYPE_BREAKPOINT. Options are HW_BREAKPOINT_LEN_1, HW_BREAKPOINT_LEN_2, HW_BREAKPOINT_LEN_4, HW_BREAKPOINT_LEN_8. For an execution breakpoint, set this to sizeof(long). config2 (since Linux 2.6.39) config2 is a further extension of the config1 field. branch_sample_type (since Linux 3.4) If PERF_SAMPLE_BRANCH_STACK is enabled, then this specifies what branches to include in the branch record. The first part of the value is the privilege level, which is a combination of one of the following values. If the user does not set privilege level explicitly, the kernel will use the event's privilege level. Event and branch privilege levels do not have to match. PERF_SAMPLE_BRANCH_USER Branch target is in user space. PERF_SAMPLE_BRANCH_KERNEL Branch target is in kernel space. PERF_SAMPLE_BRANCH_HV Branch target is in hypervisor. PERF_SAMPLE_BRANCH_PLM_ALL A convenience value that is the three preceding values ORed together. In addition to the privilege value, at least one or more of the following bits must be set. PERF_SAMPLE_BRANCH_ANY Any branch type. PERF_SAMPLE_BRANCH_ANY_CALL Any call branch. PERF_SAMPLE_BRANCH_ANY_RETURN Any return branch. PERF_SAMPLE_BRANCH_IND_CALL Indirect calls. PERF_SAMPLE_BRANCH_COND (since Linux 3.16) Conditional branches. PERF_SAMPLE_BRANCH_ABORT_TX (since Linux 3.11) Transactional memory aborts. PERF_SAMPLE_BRANCH_IN_TX (since Linux 3.11) Branch in transactional memory transaction. PERF_SAMPLE_BRANCH_NO_TX (since Linux 3.11) Branch not in transactional memory transaction. PERF_SAMPLE_BRANCH_CALL_STACK (since Linux 4.1) Branch is part of a hardware-generated call stack. This requires hardware support, currently only found on Intel x86 Haswell or newer. sample_regs_user (since Linux 3.7) This bit mask defines the set of user CPU registers to dump on samples. The layout of the register mask is architecture- specific and described in the kernel header arch/ARCH/include/uapi/asm/perf_regs.h. sample_stack_user (since Linux 3.7) This defines the size of the user stack to dump if PERF_SAMPLE_STACK_USER is specified. clockid (since Linux 4.1) If use_clockid is set, then this field selects which internal Linux timer to use for timestamps. The available timers are defined in linux/time.h, with CLOCK_MONOTONIC, CLOCK_MONOTONIC_RAW, CLOCK_REALTIME, CLOCK_BOOTTIME, and CLOCK_TAI currently supported. aux_watermark (since Linux 4.1) This specifies how much data is required to trigger a PERF_RECORD_AUX sample. Reading results Once a perf_event_open() file descriptor has been opened, the values of the events can be read from the file descriptor. The values that are there are specified by the read_format field in the attr structure at open time. If you attempt to read into a buffer that is not big enough to hold the data ENOSPC is returned Here is the layout of the data returned by a read: * If PERF_FORMAT_GROUP was specified to allow reading all events in a group at once: struct read_format { u64 nr; /* The number of events */ u64 time_enabled; /* if PERF_FORMAT_TOTAL_TIME_ENABLED */ u64 time_running; /* if PERF_FORMAT_TOTAL_TIME_RUNNING */ struct { u64 value; /* The value of the event */ u64 id; /* if PERF_FORMAT_ID */ } values[nr]; }; * If PERF_FORMAT_GROUP was not specified: struct read_format { u64 value; /* The value of the event */ u64 time_enabled; /* if PERF_FORMAT_TOTAL_TIME_ENABLED */ u64 time_running; /* if PERF_FORMAT_TOTAL_TIME_RUNNING */ u64 id; /* if PERF_FORMAT_ID */ }; The values read are as follows: nr The number of events in this file descriptor. Only available if PERF_FORMAT_GROUP was specified. time_enabled, time_running Total time the event was enabled and running. Normally these are the same. If more events are started, then available counter slots on the PMU, then multiplexing happens and events run only part of the time. In that case, the time_enabled and time running values can be used to scale an estimated value for the count. value An unsigned 64-bit value containing the counter result. id A globally unique value for this particular event, only present if PERF_FORMAT_ID was specified in read_format. MMAP layout When using perf_event_open() in sampled mode, asynchronous events (like counter overflow or PROT_EXEC mmap tracking) are logged into a ring-buffer. This ring-buffer is created and accessed through mmap(2). The mmap size should be 1+2^n pages, where the first page is a metadata page (struct perf_event_mmap_page) that contains various bits of information such as where the ring-buffer head is. Before kernel 2.6.39, there is a bug that means you must allocate an mmap ring buffer when sampling even if you do not plan to access it. The structure of the first metadata mmap page is as follows: struct perf_event_mmap_page { __u32 version; /* version number of this structure */ __u32 compat_version; /* lowest version this is compat with */ __u32 lock; /* seqlock for synchronization */ __u32 index; /* hardware counter identifier */ __s64 offset; /* add to hardware counter value */ __u64 time_enabled; /* time event active */ __u64 time_running; /* time event on CPU */ union { __u64 capabilities; struct { __u64 cap_usr_time / cap_usr_rdpmc / cap_bit0 : 1, cap_bit0_is_deprecated : 1, cap_user_rdpmc : 1, cap_user_time : 1, cap_user_time_zero : 1, }; }; __u16 pmc_width; __u16 time_shift; __u32 time_mult; __u64 time_offset; __u64 __reserved[120]; /* Pad to 1k */ __u64 data_head; /* head in the data section */ __u64 data_tail; /* user-space written tail */ __u64 data_offset; /* where the buffer starts */ __u64 data_size; /* data buffer size */ __u64 aux_head; __u64 aux_tail; __u64 aux_offset; __u64 aux_size; } The following list describes the fields in the perf_event_mmap_page structure in more detail: version Version number of this structure. compat_version The lowest version this is compatible with. lock A seqlock for synchronization. index A unique hardware counter identifier. offset When using rdpmc for reads this offset value must be added to the one returned by rdpmc to get the current total event count. time_enabled Time the event was active. time_running Time the event was running. cap_usr_time / cap_usr_rdpmc / cap_bit0 (since Linux 3.4) There was a bug in the definition of cap_usr_time and cap_usr_rdpmc from Linux 3.4 until Linux 3.11. Both bits were defined to point to the same location, so it was impossible to know if cap_usr_time or cap_usr_rdpmc were actually set. Starting with Linux 3.12, these are renamed to cap_bit0 and you should use the cap_user_time and cap_user_rdpmc fields instead. cap_bit0_is_deprecated (since Linux 3.12) If set, this bit indicates that the kernel supports the properly separated cap_user_time and cap_user_rdpmc bits. If not-set, it indicates an older kernel where cap_usr_time and cap_usr_rdpmc map to the same bit and thus both features should be used with caution. cap_user_rdpmc (since Linux 3.12) If the hardware supports user-space read of performance counters without syscall (this is the "rdpmc" instruction on x86), then the following code can be used to do a read: u32 seq, time_mult, time_shift, idx, width; u64 count, enabled, running; u64 cyc, time_offset; do { seq = pc->lock; barrier(); enabled = pc->time_enabled; running = pc->time_running; if (pc->cap_usr_time && enabled != running) { cyc = rdtsc(); time_offset = pc->time_offset; time_mult = pc->time_mult; time_shift = pc->time_shift; } idx = pc->index; count = pc->offset; if (pc->cap_usr_rdpmc && idx) { width = pc->pmc_width; count += rdpmc(idx - 1); } barrier(); } while (pc->lock != seq); cap_user_time (since Linux 3.12) This bit indicates the hardware has a constant, nonstop timestamp counter (TSC on x86). cap_user_time_zero (since Linux 3.12) Indicates the presence of time_zero which allows mapping timestamp values to the hardware clock. pmc_width If cap_usr_rdpmc, this field provides the bit-width of the value read using the rdpmc or equivalent instruction. This can be used to sign extend the result like: pmc <<= 64 - pmc_width; pmc >>= 64 - pmc_width; // signed shift right count += pmc; time_shift, time_mult, time_offset If cap_usr_time, these fields can be used to compute the time delta since time_enabled (in nanoseconds) using rdtsc or similar. u64 quot, rem; u64 delta; quot = (cyc >> time_shift); rem = cyc & ((1 << time_shift) - 1); delta = time_offset + quot * time_mult + ((rem * time_mult) >> time_shift); Where time_offset, time_mult, time_shift, and cyc are read in the seqcount loop described above. This delta can then be added to enabled and possible running (if idx), improving the scaling: enabled += delta; if (idx) running += delta; quot = count / running; rem = count % running; count = quot * enabled + (rem * enabled) / running; time_zero (since Linux 3.12) If cap_usr_time_zero is set, then the hardware clock (the TSC timestamp counter on x86) can be calculated from the time_zero, time_mult, and time_shift values: time = timestamp - time_zero; quot = time / time_mult; rem = time % time_mult; cyc = (quot << time_shift) + (rem << time_shift) / time_mult; And vice versa: quot = cyc >> time_shift; rem = cyc & ((1 << time_shift) - 1); timestamp = time_zero + quot * time_mult + ((rem * time_mult) >> time_shift); data_head This points to the head of the data section. The value continuously increases, it does not wrap. The value needs to be manually wrapped by the size of the mmap buffer before accessing the samples. On SMP-capable platforms, after reading the data_head value, user space should issue an rmb(). data_tail When the mapping is PROT_WRITE, the data_tail value should be written by user space to reflect the last read data. In this case, the kernel will not overwrite unread data. data_offset (since Linux 4.1) Contains the offset of the location in the mmap buffer where perf sample data begins. data_size (since Linux 4.1) Contains the size of the perf sample region within the mmap buffer. aux_head, aux_tail, aux_offset, aux_size (since Linux 4.1) The AUX region allows mmaping a separate sample buffer for high-bandwidth data streams (separate from the main perf sample buffer). An example of a high-bandwidth stream is instruction tracing support, as is found in newer Intel processors. To set up an AUX area, first aux_offset needs to be set with an offset greater than data_offset+data_size and aux_size needs to be set to the desired buffer size. The desired offset and size must be page aligned, and the size must be a power of two. These values are then passed to mmap in order to map the AUX buffer. Pages in the AUX buffer are included as part of the RLIMIT_MEMLOCK resource limit (see setrlimit(2)), and also as part of the perf_event_mlock_kb allowance. By default, the AUX buffer will be truncated if it will not fit in the available space in the ring buffer. If the AUX buffer is mapped as a read only buffer, then it will operate in ring buffer mode where old data will be overwritten by new. In overwrite mode, it might not be possible to infer where the new data began, and it is the consumer's job to disable measurement while reading to avoid possible data races. The aux_head and aux_tail ring buffer pointers have the same behavior and ordering rules as the previous described data_head and data_tail. The following 2^n ring-buffer pages have the layout described below. If perf_event_attr.sample_id_all is set, then all event types will have the sample_type selected fields related to where/when (identity) an event took place (TID, TIME, ID, CPU, STREAM_ID) described in PERF_RECORD_SAMPLE below, it will be stashed just after the perf_event_header and the fields already present for the existing fields, that is, at the end of the payload. That way a newer perf.data file will be supported by older perf tools, with these new optional fields being ignored. The mmap values start with a header: struct perf_event_header { __u32 type; __u16 misc; __u16 size; }; Below, we describe the perf_event_header fields in more detail. For ease of reading, the fields with shorter descriptions are presented first. size This indicates the size of the record. misc The misc field contains additional information about the sample. The CPU mode can be determined from this value by masking with PERF_RECORD_MISC_CPUMODE_MASK and looking for one of the following (note these are not bit masks, only one can be set at a time): PERF_RECORD_MISC_CPUMODE_UNKNOWN Unknown CPU mode. PERF_RECORD_MISC_KERNEL Sample happened in the kernel. PERF_RECORD_MISC_USER Sample happened in user code. PERF_RECORD_MISC_HYPERVISOR Sample happened in the hypervisor. PERF_RECORD_MISC_GUEST_KERNEL (since Linux 2.6.35) Sample happened in the guest kernel. PERF_RECORD_MISC_GUEST_USER (since Linux 2.6.35) Sample happened in guest user code. In addition, one of the following bits can be set: PERF_RECORD_MISC_MMAP_DATA (since Linux 3.10) This is set when the mapping is not executable; otherwise the mapping is executable. PERF_RECORD_MISC_COMM_EXEC (since Linux 3.16) This is set for a PERF_RECORD_COMM record on kernels more recent than Linux 3.16 if a process name change was caused by an exec(2) system call. It is an alias for PERF_RECORD_MISC_MMAP_DATA since the two values would not be set in the same record. PERF_RECORD_MISC_EXACT_IP This indicates that the content of PERF_SAMPLE_IP points to the actual instruction that triggered the event. See also perf_event_attr.precise_ip. PERF_RECORD_MISC_EXT_RESERVED (since Linux 2.6.35) This indicates there is extended data available (currently not used). type The type value is one of the below. The values in the corresponding record (that follows the header) depend on the type selected as shown. PERF_RECORD_MMAP The MMAP events record the PROT_EXEC mappings so that we can correlate user-space IPs to code. They have the following structure: struct { struct perf_event_header header; u32 pid, tid; u64 addr; u64 len; u64 pgoff; char filename[]; }; pid is the process ID. tid is the thread ID. addr is the address of the allocated memory. len is the length of the allocated memory. pgoff is the page offset of the allocated memory. filename is a string describing the backing of the allocated memory. PERF_RECORD_LOST This record indicates when events are lost. struct { struct perf_event_header header; u64 id; u64 lost; struct sample_id sample_id; }; id is the unique event ID for the samples that were lost. lost is the number of events that were lost. PERF_RECORD_COMM This record indicates a change in the process name. struct { struct perf_event_header header; u32 pid; u32 tid; char comm[]; struct sample_id sample_id; }; pid is the process ID. tid is the thread ID. comm is a string containing the new name of the process. PERF_RECORD_EXIT This record indicates a process exit event. struct { struct perf_event_header header; u32 pid, ppid; u32 tid, ptid; u64 time; struct sample_id sample_id; }; PERF_RECORD_THROTTLE, PERF_RECORD_UNTHROTTLE This record indicates a throttle/unthrottle event. struct { struct perf_event_header header; u64 time; u64 id; u64 stream_id; struct sample_id sample_id; }; PERF_RECORD_FORK This record indicates a fork event. struct { struct perf_event_header header; u32 pid, ppid; u32 tid, ptid; u64 time; struct sample_id sample_id; }; PERF_RECORD_READ This record indicates a read event. struct { struct perf_event_header header; u32 pid, tid; struct read_format values; struct sample_id sample_id; }; PERF_RECORD_SAMPLE This record indicates a sample. struct { struct perf_event_header header; u64 sample_id; /* if PERF_SAMPLE_IDENTIFIER */ u64 ip; /* if PERF_SAMPLE_IP */ u32 pid, tid; /* if PERF_SAMPLE_TID */ u64 time; /* if PERF_SAMPLE_TIME */ u64 addr; /* if PERF_SAMPLE_ADDR */ u64 id; /* if PERF_SAMPLE_ID */ u64 stream_id; /* if PERF_SAMPLE_STREAM_ID */ u32 cpu, res; /* if PERF_SAMPLE_CPU */ u64 period; /* if PERF_SAMPLE_PERIOD */ struct read_format v; /* if PERF_SAMPLE_READ */ u64 nr; /* if PERF_SAMPLE_CALLCHAIN */ u64 ips[nr]; /* if PERF_SAMPLE_CALLCHAIN */ u32 size; /* if PERF_SAMPLE_RAW */ char data[size]; /* if PERF_SAMPLE_RAW */ u64 bnr; /* if PERF_SAMPLE_BRANCH_STACK */ struct perf_branch_entry lbr[bnr]; /* if PERF_SAMPLE_BRANCH_STACK */ u64 abi; /* if PERF_SAMPLE_REGS_USER */ u64 regs[weight(mask)]; /* if PERF_SAMPLE_REGS_USER */ u64 size; /* if PERF_SAMPLE_STACK_USER */ char data[size]; /* if PERF_SAMPLE_STACK_USER */ u64 dyn_size; /* if PERF_SAMPLE_STACK_USER && size != 0 */ u64 weight; /* if PERF_SAMPLE_WEIGHT */ u64 data_src; /* if PERF_SAMPLE_DATA_SRC */ u64 transaction;/* if PERF_SAMPLE_TRANSACTION */ u64 abi; /* if PERF_SAMPLE_REGS_INTR */ u64 regs[weight(mask)]; /* if PERF_SAMPLE_REGS_INTR */ }; sample_id If PERF_SAMPLE_IDENTIFIER is enabled, a 64-bit unique ID is included. This is a duplication of the PERF_SAMPLE_ID id value, but included at the beginning of the sample so parsers can easily obtain the value. ip If PERF_SAMPLE_IP is enabled, then a 64-bit instruction pointer value is included. pid, tid If PERF_SAMPLE_TID is enabled, then a 32-bit process ID and 32-bit thread ID are included. time If PERF_SAMPLE_TIME is enabled, then a 64-bit timestamp is included. This is obtained via local_clock() which is a hardware timestamp if available and the jiffies value if not. addr If PERF_SAMPLE_ADDR is enabled, then a 64-bit address is included. This is usually the address of a tracepoint, breakpoint, or software event; otherwise the value is 0. id If PERF_SAMPLE_ID is enabled, a 64-bit unique ID is included. If the event is a member of an event group, the group leader ID is returned. This ID is the same as the one returned by PERF_FORMAT_ID. stream_id If PERF_SAMPLE_STREAM_ID is enabled, a 64-bit unique ID is included. Unlike PERF_SAMPLE_ID the actual ID is returned, not the group leader. This ID is the same as the one returned by PERF_FORMAT_ID. cpu, res If PERF_SAMPLE_CPU is enabled, this is a 32-bit value indicating which CPU was being used, in addition to a reserved (unused) 32-bit value. period If PERF_SAMPLE_PERIOD is enabled, a 64-bit value indicating the current sampling period is written. v If PERF_SAMPLE_READ is enabled, a structure of type read_format is included which has values for all events in the event group. The values included depend on the read_format value used at perf_event_open() time. nr, ips[nr] If PERF_SAMPLE_CALLCHAIN is enabled, then a 64-bit number is included which indicates how many following 64-bit instruction pointers will follow. This is the current callchain. size, data[size] If PERF_SAMPLE_RAW is enabled, then a 32-bit value indicating size is included followed by an array of 8-bit values of length size. The values are padded with 0 to have 64-bit alignment. This RAW record data is opaque with respect to the ABI. The ABI doesn't make any promises with respect to the stability of its content, it may vary depending on event, hardware, and kernel version. bnr, lbr[bnr] If PERF_SAMPLE_BRANCH_STACK is enabled, then a 64-bit value indicating the number of records is included, followed by bnr perf_branch_entry structures which each include the fields: from This indicates the source instruction (may not be a branch). to The branch target. mispred The branch target was mispredicted. predicted The branch target was predicted. in_tx (since Linux 3.11) The branch was in a transactional memory transaction. abort (since Linux 3.11) The branch was in an aborted transactional memory transaction. The entries are from most to least recent, so the first entry has the most recent branch. Support for mispred and predicted is optional; if not supported, both values will be 0. The type of branches recorded is specified by the branch_sample_type field. abi, regs[weight(mask)] If PERF_SAMPLE_REGS_USER is enabled, then the user CPU registers are recorded. The abi field is one of PERF_SAMPLE_REGS_ABI_NONE, PERF_SAMPLE_REGS_ABI_32 or PERF_SAMPLE_REGS_ABI_64. The regs field is an array of the CPU registers that were specified by the sample_regs_user attr field. The number of values is the number of bits set in the sample_regs_user bit mask. size, data[size], dyn_size If PERF_SAMPLE_STACK_USER is enabled, then the user stack is recorded. This can be used to generate stack backtraces. size is the size requested by the user in sample_stack_user or else the maximum record size. data is the stack data (a raw dump of the memory pointed to by the stack pointer at the time of sampling). dyn_size is the amount of data actually dumped (can be less than size). Note that dyn_size is omitted if size is 0. weight If PERF_SAMPLE_WEIGHT is enabled, then a 64-bit value provided by the hardware is recorded that indicates how costly the event was. This allows expensive events to stand out more clearly in profiles. data_src If PERF_SAMPLE_DATA_SRC is enabled, then a 64-bit value is recorded that is made up of the following fields: mem_op Type of opcode, a bitwise combination of: PERF_MEM_OP_NA Not available PERF_MEM_OP_LOAD Load instruction PERF_MEM_OP_STORE Store instruction PERF_MEM_OP_PFETCH Prefetch PERF_MEM_OP_EXEC Executable code mem_lvl Memory hierarchy level hit or miss, a bitwise combination of the following, shifted left by PERF_MEM_LVL_SHIFT: PERF_MEM_LVL_NA Not available PERF_MEM_LVL_HIT Hit PERF_MEM_LVL_MISS Miss PERF_MEM_LVL_L1 Level 1 cache PERF_MEM_LVL_LFB Line fill buffer PERF_MEM_LVL_L2 Level 2 cache PERF_MEM_LVL_L3 Level 3 cache PERF_MEM_LVL_LOC_RAM Local DRAM PERF_MEM_LVL_REM_RAM1 Remote DRAM 1 hop PERF_MEM_LVL_REM_RAM2 Remote DRAM 2 hops PERF_MEM_LVL_REM_CCE1 Remote cache 1 hop PERF_MEM_LVL_REM_CCE2 Remote cache 2 hops PERF_MEM_LVL_IO I/O memory PERF_MEM_LVL_UNC Uncached memory mem_snoop Snoop mode, a bitwise combination of the following, shifted left by PERF_MEM_SNOOP_SHIFT: PERF_MEM_SNOOP_NA Not available PERF_MEM_SNOOP_NONE No snoop PERF_MEM_SNOOP_HIT Snoop hit PERF_MEM_SNOOP_MISS Snoop miss PERF_MEM_SNOOP_HITM Snoop hit modified mem_lock Lock instruction, a bitwise combination of the following, shifted left by PERF_MEM_LOCK_SHIFT: PERF_MEM_LOCK_NA Not available PERF_MEM_LOCK_LOCKED Locked transaction mem_dtlb TLB access hit or miss, a bitwise combination of the following, shifted left by PERF_MEM_TLB_SHIFT: PERF_MEM_TLB_NA Not available PERF_MEM_TLB_HIT Hit PERF_MEM_TLB_MISS Miss PERF_MEM_TLB_L1 Level 1 TLB PERF_MEM_TLB_L2 Level 2 TLB PERF_MEM_TLB_WK Hardware walker PERF_MEM_TLB_OS OS fault handler transaction If the PERF_SAMPLE_TRANSACTION flag is set, then a 64-bit field is recorded describing the sources of any transactional memory aborts. The field is a bitwise combination of the following values: PERF_TXN_ELISION Abort from an elision type transaction (Intel- CPU-specific). PERF_TXN_TRANSACTION Abort from a generic transaction. PERF_TXN_SYNC Synchronous abort (related to the reported instruction). PERF_TXN_ASYNC Asynchronous abort (not related to the reported instruction). PERF_TXN_RETRY Retryable abort (retrying the transaction may have succeeded). PERF_TXN_CONFLICT Abort due to memory conflicts with other threads. PERF_TXN_CAPACITY_WRITE Abort due to write capacity overflow. PERF_TXN_CAPACITY_READ Abort due to read capacity overflow. In addition, a user-specified abort code can be obtained from the high 32 bits of the field by shifting right by PERF_TXN_ABORT_SHIFT and masking with PERF_TXN_ABORT_MASK. abi, regs[weight(mask)] If PERF_SAMPLE_REGS_INTR is enabled, then the user CPU registers are recorded. The abi field is one of PERF_SAMPLE_REGS_ABI_NONE, PERF_SAMPLE_REGS_ABI_32 or PERF_SAMPLE_REGS_ABI_64. The regs field is an array of the CPU registers that were specified by the sample_regs_intr attr field. The number of values is the number of bits set in the sample_regs_intr bit mask. PERF_RECORD_MMAP2 This record includes extended information on mmap(2) calls returning executable mappings. The format is similar to that of the PERF_RECORD_MMAP record, but includes extra values that allow uniquely identifying shared mappings. struct { struct perf_event_header header; u32 pid; u32 tid; u64 addr; u64 len; u64 pgoff; u32 maj; u32 min; u64 ino; u64 ino_generation; u32 prot; u32 flags; char filename[]; struct sample_id sample_id; }; pid is the process ID. tid is the thread ID. addr is the address of the allocated memory. len is the length of the allocated memory. pgoff is the page offset of the allocated memory. maj is the major ID of the underlying device. min is the minor ID of the underlying device. ino is the inode number. ino_generation is the inode generation. prot is the protection information. flags is the flags information. filename is a string describing the backing of the allocated memory. PERF_RECORD_AUX (since Linux 4.1) This record reports that new data is available in the separate AUX buffer region. struct { struct perf_event_header header; u64 aux_offset; u64 aux_size; u64 flags; struct sample_id sample_id; }; aux_offset offset in the AUX mmap region where the new data begins. aux_size size of the data made available. flags describes the AUX update. PERF_AUX_FLAG_TRUNCATED if set, then the data returned was truncated to fit the available buffer size. PERF_AUX_FLAG_OVERWRITE if set, then the data returned has overwritten previous data. PERF_RECORD_ITRACE_START (since Linux 4.1) This record indicates which process has initiated an instruction trace event, allowing tools to properly correlate the instruction addresses in the AUX buffer with the proper executable. struct { struct perf_event_header header; u32 pid; u32 tid; }; pid process ID of the thread starting an instruction trace. tid thread ID of the thread starting an instruction trace. Overflow handling Events can be set to notify when a threshold is crossed, indicating an overflow. Overflow conditions can be captured by monitoring the event file descriptor with poll(2), select(2), or epoll(2). Alternately, a SIGIO signal handler can be created and the event configured with fcntl(2) to generate SIGIO signals. Overflows are generated only by sampling events (sample_period must have a nonzero value). There are two ways to generate overflow notifications. The first is to set a wakeup_events or wakeup_watermark value that will trigger if a certain number of samples or bytes have been written to the mmap ring buffer. In this case POLL_IN is indicated. The other way is by use of the PERF_EVENT_IOC_REFRESH ioctl. This ioctl adds to a counter that decrements each time the event overflows. When nonzero, POLL_IN is indicated, but once the counter reaches 0 POLL_HUP is indicated and the underlying event is disabled. Refreshing an event group leader refreshes all siblings and refreshing with a parameter of 0 currently enables infinite refreshes; these behaviors are unsupported and should not be relied on. Starting with Linux 3.18, POLL_HUP is indicated if the event being monitored is attached to a different process and that process exits. rdpmc instruction Starting with Linux 3.4 on x86, you can use the rdpmc instruction to get low-latency reads without having to enter the kernel. Note that using rdpmc is not necessarily faster than other methods for reading event values. Support for this can be detected with the cap_usr_rdpmc field in the mmap page; documentation on how to calculate event values can be found in that section. Originally, when rdpmc support was enabled, any process (not just ones with an active perf event) could use the rdpmc instruction to access the counters. Starting with Linux 4.0 rdpmc support is only allowed if an event is currently enabled in a process's context. To restore the old behavior, write the value 2 to /sys/devices/cpu/rdpmc. perf_event ioctl calls Various ioctls act on perf_event_open() file descriptors: PERF_EVENT_IOC_ENABLE This enables the individual event or event group specified by the file descriptor argument. If the PERF_IOC_FLAG_GROUP bit is set in the ioctl argument, then all events in a group are enabled, even if the event specified is not the group leader (but see BUGS). PERF_EVENT_IOC_DISABLE This disables the individual counter or event group specified by the file descriptor argument. Enabling or disabling the leader of a group enables or disables the entire group; that is, while the group leader is disabled, none of the counters in the group will count. Enabling or disabling a member of a group other than the leader affects only that counter; disabling a non-leader stops that counter from counting but doesn't affect any other counter. If the PERF_IOC_FLAG_GROUP bit is set in the ioctl argument, then all events in a group are disabled, even if the event specified is not the group leader (but see BUGS). PERF_EVENT_IOC_REFRESH Non-inherited overflow counters can use this to enable a counter for a number of overflows specified by the argument, after which it is disabled. Subsequent calls of this ioctl add the argument value to the current count. An overflow notification with POLL_IN set will happen on each overflow until the count reaches 0; when that happens a notification with POLL_HUP set is sent and the event is disabled. Using an argument of 0 is considered undefined behavior. PERF_EVENT_IOC_RESET Reset the event count specified by the file descriptor argument to zero. This resets only the counts; there is no way to reset the multiplexing time_enabled or time_running values. If the PERF_IOC_FLAG_GROUP bit is set in the ioctl argument, then all events in a group are reset, even if the event specified is not the group leader (but see BUGS). PERF_EVENT_IOC_PERIOD This updates the overflow period for the event. Since Linux 3.7 (on ARM) and Linux 3.14 (all other architectures), the new period takes effect immediately. On older kernels, the new period did not take effect until after the next overflow. The argument is a pointer to a 64-bit value containing the desired new period. Prior to Linux 2.6.36 this ioctl always failed due to a bug in the kernel. PERF_EVENT_IOC_SET_OUTPUT This tells the kernel to report event notifications to the specified file descriptor rather than the default one. The file descriptors must all be on the same CPU. The argument specifies the desired file descriptor, or -1 if output should be ignored. PERF_EVENT_IOC_SET_FILTER (since Linux 2.6.33) This adds an ftrace filter to this event. The argument is a pointer to the desired ftrace filter. PERF_EVENT_IOC_ID (since Linux 3.12) This returns the event ID value for the given event file descriptor. The argument is a pointer to a 64-bit unsigned integer to hold the result. PERF_EVENT_IOC_SET_BPF (since Linux 4.1) This allows attaching a Berkeley Packet Filter (BPF) program to an existing kprobe tracepoint event. You need CAP_SYS_ADMIN privileges to use this ioctl. The argument is a BPF program file descriptor that was created by a previous bpf(2) system call. Using prctl A process can enable or disable all the event groups that are attached to it using the prctl(2) PR_TASK_PERF_EVENTS_ENABLE and PR_TASK_PERF_EVENTS_DISABLE operations. This applies to all counters on the calling process, whether created by this process or by another, and does not affect any counters that this process has created on other processes. It enables or disables only the group leaders, not any other members in the groups. perf_event related configuration files Files in /proc/sys/kernel/ /proc/sys/kernel/perf_event_paranoid The perf_event_paranoid file can be set to restrict access to the performance counters. 2 allow only user-space measurements (default since Linux 4.6). 1 allow both kernel and user measurements (default before Linux 4.6). 0 allow access to CPU-specific data but not raw tracepoint samples. -1 no restrictions. The existence of the perf_event_paranoid file is the official method for determining if a kernel supports perf_event_open(). /proc/sys/kernel/perf_event_max_sample_rate This sets the maximum sample rate. Setting this too high can allow users to sample at a rate that impacts overall machine performance and potentially lock up the machine. The default value is 100000 (samples per second). /proc/sys/kernel/perf_event_mlock_kb Maximum number of pages an unprivileged user can mlock(2). The default is 516 (kB). Files in /sys/bus/event_source/devices/ Since Linux 2.6.34, the kernel supports having multiple PMUs available for monitoring. Information on how to program these PMUs can be found under /sys/bus/event_source/devices/. Each subdirectory corresponds to a different PMU. /sys/bus/event_source/devices/*/type (since Linux 2.6.38) This contains an integer that can be used in the type field of perf_event_attr to indicate that you wish to use this PMU. /sys/bus/event_source/devices/cpu/rdpmc (since Linux 3.4) If this file is 1, then direct user-space access to the performance counter registers is allowed via the rdpmc instruction. This can be disabled by echoing 0 to the file. As of Linux 4.0 the behavior has changed, so that 1 now means only allow access to processes with active perf events, with 2 indicating the old allow-anyone-access behavior. /sys/bus/event_source/devices/*/format/ (since Linux 3.4) This subdirectory contains information on the architecture-specific subfields available for programming the various config fields in the perf_event_attr struct. The content of each file is the name of the config field, followed by a colon, followed by a series of integer bit ranges separated by commas. For example, the file event may contain the value config1:1,6-10,44 which indicates that event is an attribute that occupies bits 1,6-10, and 44 of perf_event_attr::config1. /sys/bus/event_source/devices/*/events/ (since Linux 3.4) This subdirectory contains files with predefined events. The contents are strings describing the event settings expressed in terms of the fields found in the previously mentioned ./format/ directory. These are not necessarily complete lists of all events supported by a PMU, but usually a subset of events deemed useful or interesting. The content of each file is a list of attribute names separated by commas. Each entry has an optional value (either hex or decimal). If no value is specified, then it is assumed to be a single-bit field with a value of 1. An example entry may look like this: event=0x2,inv,ldlat=3. /sys/bus/event_source/devices/*/uevent This file is the standard kernel device interface for injecting hotplug events. /sys/bus/event_source/devices/*/cpumask (since Linux 3.7) The cpumask file contains a comma-separated list of integers that indicate a representative CPU number for each socket (package) on the motherboard. This is needed when setting up uncore or northbridge events, as those PMUs present socket-wide events. http://man7.org/linux/man-pages/man2/uname.2.html 10 SYSTEM CALL: uname(2) - Linux manual page FUNCTIONALITY: uname - get name and information about current kernel SYNOPSIS: #include int uname(struct utsname *buf); DESCRIPTION uname() returns system information in the structure pointed to by buf. The utsname struct is defined in : struct utsname { char sysname[]; /* Operating system name (e.g., "Linux") */ char nodename[]; /* Name within "some implementation-defined network" */ char release[]; /* Operating system release (e.g., "2.6.28") */ char version[]; /* Operating system version */ char machine[]; /* Hardware identifier */ #ifdef _GNU_SOURCE char domainname[]; /* NIS or YP domain name */ #endif }; The length of the arrays in a struct utsname is unspecified (see NOTES); the fields are terminated by a null byte ('\0'). http://man7.org/linux/man-pages/man2/sysinfo.2.html 11 SYSTEM CALL: sysinfo(2) - Linux manual page FUNCTIONALITY: sysinfo - return system information SYNOPSIS: #include int sysinfo(struct sysinfo *info); DESCRIPTION sysinfo() returns certain statistics on memory and swap usage, as well as the load average. Until Linux 2.3.16, sysinfo() returned information in the following structure: struct sysinfo { long uptime; /* Seconds since boot */ unsigned long loads[3]; /* 1, 5, and 15 minute load averages */ unsigned long totalram; /* Total usable main memory size */ unsigned long freeram; /* Available memory size */ unsigned long sharedram; /* Amount of shared memory */ unsigned long bufferram; /* Memory used by buffers */ unsigned long totalswap; /* Total swap space size */ unsigned long freeswap; /* Swap space still available */ unsigned short procs; /* Number of current processes */ char _f[22]; /* Pads structure to 64 bytes */ }; In the above structure, the sizes of the memory and swap fields are given in bytes. Since Linux 2.3.23 (i386) and Linux 2.3.48 (all architectures) the structure is: struct sysinfo { long uptime; /* Seconds since boot */ unsigned long loads[3]; /* 1, 5, and 15 minute load averages */ unsigned long totalram; /* Total usable main memory size */ unsigned long freeram; /* Available memory size */ unsigned long sharedram; /* Amount of shared memory */ unsigned long bufferram; /* Memory used by buffers */ unsigned long totalswap; /* Total swap space size */ unsigned long freeswap; /* Swap space still available */ unsigned short procs; /* Number of current processes */ unsigned long totalhigh; /* Total high memory size */ unsigned long freehigh; /* Available high memory size */ unsigned int mem_unit; /* Memory unit size in bytes */ char _f[20-2*sizeof(long)-sizeof(int)]; /* Padding to 64 bytes */ }; In the above structure, sizes of the memory and swap fields are given as multiples of mem_unit bytes.