Internet Engineering Task Force M. Sustrik, Ed. Internet-Draft February 2017 Intended status: Informational Expires: August 5, 2017 BSD Socket API Revamp sock-api-revamp-01 Abstract This memo describes new API for network sockets. Compared to classic BSD socket API the new API is much more lightweight and flexible. Its primary focus is on easy composability of network protocols. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on August 5, 2017. Copyright Notice Copyright (c) 2017 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Sustrik Expires August 5, 2017 [Page 1] Internet-Draft BSD Socket API Revamp February 2017 Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3 3. The problems . . . . . . . . . . . . . . . . . . . . . . . . 3 4. Basic concepts . . . . . . . . . . . . . . . . . . . . . . . 4 4.1. Vertical composability . . . . . . . . . . . . . . . . . 4 4.2. Horizontal composability . . . . . . . . . . . . . . . . 5 4.3. Application and transport protocols . . . . . . . . . . . 5 4.3.1. Application protocols . . . . . . . . . . . . . . . . 5 4.3.2. Presentation protocols . . . . . . . . . . . . . . . 6 4.3.3. Transport protocols . . . . . . . . . . . . . . . . . 6 4.4. Bytestream and message protocols . . . . . . . . . . . . 6 4.5. Connected and unconnected protocols . . . . . . . . . . . 7 4.6. Scheduling or rather lack of it . . . . . . . . . . . . . 8 4.7. Tx buffering . . . . . . . . . . . . . . . . . . . . . . 9 4.8. Rx buffering . . . . . . . . . . . . . . . . . . . . . . 9 4.9. Socket options . . . . . . . . . . . . . . . . . . . . . 9 5. The API guidelines . . . . . . . . . . . . . . . . . . . . . 10 5.1. Protocol naming conventions . . . . . . . . . . . . . . . 10 5.2. Function naming conventions . . . . . . . . . . . . . . . 10 5.3. File descriptors . . . . . . . . . . . . . . . . . . . . 10 5.4. Deadlines . . . . . . . . . . . . . . . . . . . . . . . . 11 5.5. Protocol initialization . . . . . . . . . . . . . . . . . 11 5.6. Protocol termination . . . . . . . . . . . . . . . . . . 12 5.6.1. Forceful termination . . . . . . . . . . . . . . . . 12 5.6.2. Half-close termination . . . . . . . . . . . . . . . 13 5.6.3. Orderly termination . . . . . . . . . . . . . . . . . 13 5.7. Normal operation . . . . . . . . . . . . . . . . . . . . 15 5.7.1. Bytestream protocols . . . . . . . . . . . . . . . . 15 5.7.2. Message protocols . . . . . . . . . . . . . . . . . . 16 5.7.3. Custom sending and receiving functions . . . . . . . 17 5.7.4. Error codes . . . . . . . . . . . . . . . . . . . . . 18 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 18 7. Security Considerations . . . . . . . . . . . . . . . . . . . 18 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 19 1. Introduction The progress in the area of network protocols is distinctively lagging behind. While every hobbyist writes and publishes their small JavaScript libraries, there's no such thing going on with network protocols. Indeed, it looks like the field of network protocols is dominated by big companies and academia, just like programming as a whole used to be before the advent of personal computers. Sustrik Expires August 5, 2017 [Page 2] Internet-Draft BSD Socket API Revamp February 2017 While social and political reasons may be partly to blame (adoption, interoperability etc.) the technology itself creates a huge barrier to popular participation. For example, the fact that huge part of the networking stack typically lives in the kernel space will prevent most people from even trying. More importantly though there is basically no way to reuse what already exists. While in JavaScript world you can get other people's libraries, quickly glue them together, add a bit of code of your own and publish a shiny new library, you can't do the same thing with network protocols. You can't take framing from WebSockets, add multihoming from SCTP, keep- alives from TCP and congestion control from DCCP. You have to write most of the code yourself which requires a lot of time, often more than a single programmer can realistically afford. This memo proposes to fix the reusability problem by revamping the old BSD socket API and while doing so strongly focusing on composability of protocols. 2. Terminology The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119. 3. The problems This section offers a brief summary of the problems that protocol implementors are facing in the current environment. o The hook-in API to implement new protocols and make them accessible via standard functions like send() and recv() exists, in most cases, only in the kernel space. While that makes implementation a little bit harder the real problem is deployment. Kernel is rarely deployed with the application and waiting for a kernel implementation of a protocol to be widely deployed can take many years. o The API hook-in API is not standardized meaning that a protocol has to be implemented for each operating system separately. o The usability of the original blocking BSD socket API is dependent one operating system's ability to spawn a lot of threads (e.g. two per TCP connection) and switch between them quickly. This happens not to be the case with majority of operating systems. The thread number limit tends to be couple of thousand and the context switch latency is often one or two orders of magnitude higher than what would be acceptable for high-performance networking. This leads to proliferation of asynchronous network code using non-blocking Sustrik Expires August 5, 2017 [Page 3] Internet-Draft BSD Socket API Revamp February 2017 APIs like poll() which forces the implementor to do both CPU and network scheduling by hand. High context switch latency is often fought by extreme measures. For example, by implementing hand- crafted lock-free algorithms. Ugly, mislayered and barely maintainable code ensues. o The BSD socket API is semantically underspecified. The particular points of pain are error handling and behaviour during protocol shutdown. No fixed semantics to rely on means that higher level protocols are fine-tuned to work with a particular low level protocol (e.g. TCP) and don't allow to switch it for a different low-level protocol. The situation is made worse by offering a high number of socket options that modify semantics of the core APIs. o Scope of BSD socket API is unnecessarily wide. It attempts to provide standardized way for doing things that would better be done in a customized way. This results in unnecessarily complex system of extension points: Socket options, ancillary data, fcntl(), various arguments to socket() function. o Finally, there are some minor ways to improve the socket API which, however, are not a sufficient reason to revamp the API. But if the revamp is to be done anyway these minor issues can be fixed. 4. Basic concepts 4.1. Vertical composability Vertical composability is an ability to stack protocols one on the top of another. From the network point of view the protocol on the top is a payload of the protocol on the bottom. From the API point of view the top protocol encapsulates the bottom protocol, very much like a function encapsulates another function that it calls. Example of vertical stack of protocols: +----------+ | HTTP | +----------+ | TCP | +----------+ | IP | +----------+ | Ethernet | +----------+ Sustrik Expires August 5, 2017 [Page 4] Internet-Draft BSD Socket API Revamp February 2017 4.2. Horizontal composability Horizontal composability is an ability to execute protocols in sequential manner. From the network point of view one type of communication ends and is replaced by another type of communication. From API point of view one protocol is terminated and another one is started, reusing the same underlying protocol, very much like a function can call two child functions in sequence without having to exit itself. An example of horizontal composability is how typical web page is trasferred by first doing HTTP handshake, followed by HTML body: +-----------------------------------+ | HTTP | HTML | +----------+------------------------+ | TCP | +-----------------------------------+ Note how this design makes protocol reusable: The same HTTP handshake can be used, for example, to initiate a WebSocket session. Another example of horizontal composability is how STARTTLS switches non-encrypted protocol into encrypted version of the same protocol. While these are very visible cases of composing protocols horizontally, the technique is in fact ubiquitous. For example, most protocols are composed from three distinct mini-protocols: protocol header (initial handshake), protocol body (sending data back and forth) and protocol footer (terminal handshake): +-----------------------------------+ | Header | Body | Footer | +--------+-----------------+--------+ | TCP | +-----------------------------------+ 4.3. Application and transport protocols 4.3.1. Application protocols Application protocols live on the top of the network stack. Rather than trasferring raw data they are meant to perform a specific service for the user. For example, DNS protocol provides name resolution service. Application protocols don't give user a way to send or receive data. They have no standardized API for sending or receiving. Still, they Sustrik Expires August 5, 2017 [Page 5] Internet-Draft BSD Socket API Revamp February 2017 can be initialized, terminated and layered on top of other protocols. That being the case, relevant parts of this specification still apply to them. 4.3.2. Presentation protocols Presentation protocols add structure to data carried by transport protocols (e.g. ASN.1, JSON, XML). This proposal doesn't address them in any way. Either the protocol sends and receives binary data that just happen to be in a structured format in which case it's a standard transport protocol. Or the protocol exposes special API to browse the structured data in which case it should be treated as an application protocol. 4.3.3. Transport protocols The term "transport protocol" in this memo has broader scope than "OSI L4 protocol". By "transport protocol" we mean anything capable of sending and/or receiving unstructured data, be it TCP, IP or Ethernet. 4.4. Bytestream and message protocols Byte stream protocols are transport protocols that define no message boundaries. One peer can send 10 bytes, then 8 bytes. The other peer can read all 18 bytes at once or read 12 bytes first, 6 bytes second. Bytestream protocols are always reliable (no bytes can be lost) and ordered (bytes are received in the same order they were sent in). TCP is a classic example of bytestream protocol. Message protocols are transport protocols that preserve message boundaries. While message protocols are not necessarily reliable (messages can be lost) or ordered (messages can be received in different order than they were sent in) they are always atomic. User will receive either complete message or no message. IP, UDP and WebSockets are examples of message protocols. This memo proposes distinct API for bytestream and message protocols. The reason for the design decision is that while the API for the two is superficially similar there is large difference in semantics, especially when it comes to atomicity and error handling. As an added benefit, the fact that bytestream protocols are always ordered and reliable means that bytestream API provides stronger semantic guarantees than message API. In theory, the two API could have been unified by treating each byte in a bytestream protocol as a separate message. In practice, Sustrik Expires August 5, 2017 [Page 6] Internet-Draft BSD Socket API Revamp February 2017 however, such API design would prevent batching and would thus result in implementations with inferior throughput characteristics. 4.5. Connected and unconnected protocols From the API point of view the most significant difference between connected protocols such as TCP and unconnected protocols such as UDP is that the former have initialization and termination handshakes while the latter do not. This distinction turns out to be critical when contemplating composable microprotocols. If there are 10 connected microprotocols in the stack and RTT between two endpoints is 100 ms, connection establishment will require at least 10 roundtrips, i.e. one second or more. Latency of this magnitude is often not acceptable. Even worse, given that RTT is limited by the speed of light, it's not going to get better as technology advances. Therefore, the only way to get microprotocols with decent performance is to make majority of them unconnected. In the ideal case there should be at most one connected protocol in any networking stack. To achieve the goal we'll have to turn even protocols that instinctively feel like connected into unconnected ones. Consider a simple keep-alive protocol. The peers have to agree on a keep-alive interval and thus the intial handshake seems unavoidable. However, if the peers have agreed on the keep-alive interval beforehand they can just pass the number to the API and no initial handshake is required. We can suddenly think of the protocol as unconnected. Technically, the users could have just met personally and decided on keep-alive interval. However, it doesn't have to be that way. In most cases there's a connected protocol in the stack somewhere beneath the keep-alive protocol. If that connected protocol allows to bundle arbitrary data with its initial handshake we can use it to exchange keep-alive intervals between the peers. Later on, when instantiating the keep-alive protocol we can pass in the correct number via API. As for the terminal handshakes, these are critical for horizontal composability of protocols. To be able to start new protocol on pre- existing connection the old protocol has to be terminated and both peers have to agree on where exactly it have ended. To do that handshake is needed. Sustrik Expires August 5, 2017 [Page 7] Internet-Draft BSD Socket API Revamp February 2017 Assuming there's a connected protocol beneath the user can tear down all the unconnected protocol layers above. Then they can shut down the connected protocol which will put both peers in sync with respect to where exaclty have the protocol ended. Afterwards they can open new protocols on top of the remaining layers of the stack. Each protocol MUST be either connected or unconnected. A protocol MUST NOT try to support both scenarios. 4.6. Scheduling or rather lack of it During the decades since BSD sockets were first introduced the way they are used have changed significantly. While in the beginning the user was supposed to fork a new process for each connection and do all the work using simple blocking calls nowadays they are expected to keep a pool of connections, check them via functions like poll() or kqueue() and dispatch any work to be done to one of the worker threads in a thread pool. In other words, user is supposed to do both network and CPU scheduling. This change happened for performance reasons and haven't improved functionality or usability of BSD socket API in any way. On the contrary, by requiring every programmer to do system programmer's work it contributed to proliferation of buggy, hard-to-debug and barely maintainable network code. To address this problem, this memo assumes that there already exists an efficient concurrency implementation where forking a new lightweight process takes at most hundreds of nanoseconds and context switch takes tens of nanoseconds. Note that there are already such concurrency systems deployed in the wild. One well-known example are Golang's goroutines but there are others available as well. In such environment network programming can be done in the old "one process per connection" way, with all the functions exhibiting blocking behavior. There's no need for polling, thread pools, callbacks, explicit state machines or similar. This memo thus adheres to "let system programmers do system programming" maxim and doesn't address the problem of scheduling, be it CPU scheduling or network scheduling, at all. As a footnote, it should be metioned that this proposal makes couple of deliberate design choices that prevent the modern "schedule by hand" style of network programming. Sustrik Expires August 5, 2017 [Page 8] Internet-Draft BSD Socket API Revamp February 2017 4.7. Tx buffering Buffering of outbound data and sending them down the stack in batches often results in improved performance. It is perfectly acceptable for protocol implementation to do so as long as data is flushed when the socket is closed. The data should also be flushed periodically not to induce unbounded latencies when there are no new outbound data to fill in the buffer. 4.8. Rx buffering Buffering of inbound data collides with vertical composability of protocols. If protocol reads 1000 bytes of data from the underlying protocol, then protocol above it asks for 700 bytes and closes the socket, there's no way to push the remaining 300 bytes back to the underlying socket. Allowing for such operation would mean that the buffer of the underlying socket would have to be virtually unbounded. If, on the other hand, the remaining bytes were dropped there would be no way to start a new protocol on top of the same underlying socket. The new protocol would miss initial 300 bytes of data. Luckily though, the above reasoning doesn't apply to the bottommost protocol in the stack. Given that there's no underlying protocol to start a new protocol on top of, the buffered data can be simply dropped. Also, rx buffering on the lowermost level, where the protocol is interfacing with the hardware or with user/kernel space boundary, is likely to provide the largest performance benefits. Absence of rx buffering on higher levels, where performance impact of additional receive operation is basically that of a function call, is not likely to incur huge performance penalty. And even more so given that higher layers of the stack are likely to be message-based and thus some amount of batching, proportional to the average message size, happens anyway. 4.9. Socket options There's no equivalent to socket options as they are known from BSD socket API. Any such customization of the network stack is supposed to be built by vertically layering the protocols. Sustrik Expires August 5, 2017 [Page 9] Internet-Draft BSD Socket API Revamp February 2017 5. The API guidelines 5.1. Protocol naming conventions Whenever possible, protocol name in the API SHOULD correspond to the official name of the protocol, not to the name of the protocol implementation. While this can lead to name clashes the assumption is that single application is not going to use two implementations of the same network protocol. This rule also provides an incentive to standardize protocol APIs. To make the API less tedious to use, short protocol name, e.g. "ws", SHOULD be preferred to the long name, e.g. "websockets". Given that end users prefer to create full protocol stack using a single function it is desirable to provide them with shrinkwrapped protocols aggregating many microprotocols into a coherent whole. For example, "websocket" protocol may be composed of TCP protocol and WebSocket protocol itself. Still, expert users may want to have access to WebSocket protocol as such, without the underlying TCP protocol, say, if they want to run it on top of SCTP or any other alternative transport. In such cases there is a naming dilemma: Should "websocket" name refer to the TCP+WebSocket aggregate or to the WebSocket alone? In these cases the implemetors SHOULD always prefer the former solution. 5.2. Function naming conventions The function names SHOULD be in lowercase and SHOULD be composed of short protocol name and action name separated by underscore (e.g. "tcp_connect"). Of course, in languages other than C the native naming convention should be followed, but even then the name SHOULD contain both short protocol name and action name. 5.3. File descriptors One of the design goals of this API is to support both kernel space and user space implementations. One problem with that is that kernel space objects are typically reffered to by file descriptors while POSIX provides no easy way to associate user space objects with file descriptors. Therefore, this specification allows user space implementations to use fake file descriptors (simple integers that kernel space knows nothing about) and does not guarantee that system functions will work with those descriptors. Sustrik Expires August 5, 2017 [Page 10] Internet-Draft BSD Socket API Revamp February 2017 For example, you cannot count on POSIX close() function to be able to close a socket. Therefore, hclose() function is introduced which maps directly to close() in kernel-space implementations but can be overriden by custom implementation in a user-space implementation. Whenever a function acts on a file descriptor, the descriptor SHOULD be passed to the function as its first argument. 5.4. Deadlines Unlike with BSD sockets the deadlines are points in time rather than intervals. This allows to use the same deadline in multiple calls without need for recomputing the remaining interval: int64_t deadline = now() + 1000; bsend(h, "ABC", 3, deadline); bsend(h, "DEF", 3, deadline); All possibly blocking functions MUST accept a deadline. The deadline SHOULD be passed to the function as its last argument. 5.5. Protocol initialization A protocol SHOULD be initialized using a protocol-specific "start" function (e.g. "smtp_start"). If protocol runs on top of another protocol the file descriptor of the underlying protocol SHOULD be supplied as the first argument of the function. The function MAY have arbitrary number of additional arguments. The function SHOULD return the file descriptor of the newly created protocol instance. In case of error it SHOULD close the underlying protocol, return -1 and set errno to the appropriate error. Some protocols require more complex setup. Consider TCP's listen/connect/accept connection setup process. These protocols SHOULD use custom set of functions rather than try to shoehorn all the functionality into a single all-purpose "start" function. If protocol runs on top of an underlying protocol it takes ownership of that protocol. Using the low level protocol while it is owned by a higher level protocol will result in undefined behaviour. A sane way to implement this behaviour is to create a duplicate of the underlying file descriptor to be owned by the parent protocol and closing the original file descriptor. That way, user accidentally using a lower level protocol will get an EBADF error. If protocol requires an initial handshake it MUST be performed in this phase of the socket lifecycle. Sustrik Expires August 5, 2017 [Page 11] Internet-Draft BSD Socket API Revamp February 2017 Example of creating a stack of four protocols: int s1 = tcp_connect("192.168.0.111:5555", -1); int s2 = foo_start(s1, arg1, arg2, arg3); int s3 = bar_start(s2); int s4 = baz_start(s3, arg4, arg5); 5.6. Protocol termination There are several types of termination that will be discussed in following sections: o Forceful termination means that the user wants to shut down the socket abruptly without even letting the peer know. Forceful termination is always a non-blocking operation. o Half-close termination means that outbound half of the connection is closed by the user and the terminal handshake, if supported by the protocol, is initiated. However, the user is still able to receive data from the peer. o Orderly termination means that terminal handshake with the peer, if required by the protocol, is performed. Orderly termination leaves both peers with a consistent view of the world. 5.6.1. Forceful termination To perform forceful termination protocol descriptor is closed by hclose() function. In kernel-space implementations this function maps directly to standard POSIX close() function. The protocol MUST shut down immediately without trying to do termination handshake. Note that this is different from how classic BSD sockets behave. The protocol MUST also clean up all resources it owns including closing the underlying protocol. Given that the underlying protocol does the same operation, an entire stack of protocols can be shut down recursivelly by closing the file descriptor of the topmost protocol: int h1 = foo_start(); int h2 = bar_start(h1); int h3 = baz_start(h2); hclose(h3); /* baz, bar and foo are shut down */ In case of success hclose() returns zero. In case of error it returns -1 and sets errno to appropriate value. Sustrik Expires August 5, 2017 [Page 12] Internet-Draft BSD Socket API Revamp February 2017 5.6.2. Half-close termination The primary use case for half-closing a connection is when user wants to close a connection, yet still wants to receive all the data sent by the peer prior to the termination. To do so function hdone() is used. It is roughly equivalent to POSIX shutdown(SHUT_WR) function. hdone() first of all flushes any buffered outbound data. What happens next depends on whether the protocol is connected or unconnected. For unconnected protocols the implementation MUST forward the call to the underlying socket. If the protocol is at the bottom of the stack and there is no underlying socket it MUST return ENOTSUP error. For connected protocols the implementation MUST start termination handshake and return to the caller without waiting for the answer from the peer. After hdone() is called, any further calls to hdone() or attemps to send more data MUST result in EPIPE error. However, user is still able to receive more data from the socket. Following piece of code shows typical usage do hdone(). It half- closes the connection, receives any pending messages from the peer and finally closes the socket: hdone(s); while(1) { int rc = mrecv(s, &msg, sizeof(msg), -1); if(rc < 0 && errno == EPIPE) break; process_msg(&msg); } hclose(s); hdone() function returns 0 on success. In case of error the function MUST forcibly close the underying protocol (and thus recursively all protocols beneath it), return -1 and set errno to the appropriate value. 5.6.3. Orderly termination To perform an orderly shut down there SHOULD be a protocol-specific function called "stop" (e.g. "smtp_stop"). Sustrik Expires August 5, 2017 [Page 13] Internet-Draft BSD Socket API Revamp February 2017 In addition to the file descriptor the function can have arbitrary number of other arguments. For example, one such argument may be a "shutdown reason" string to be sent to the peer. However, it is RECOMMENDED to avoid such additional arguments in newly designed protocols. The reason is that such arguments cannot be passed to hdone() function, making half-close termination functionally inferior to orderly termination. If the shut down functionality is potentially blocking the last argument of the function SHOULD be a deadline. First thing hclose() should do is to check whether hdone() was called by the user beforehand and if not so, to invoke it itself. If hdone() returns error other than ENOTSUP the socket MUST be torn down and the error MUST be forwarded to the caller. If hdone() returns ENOTSUP error the implementation MUST simply procceed further without reading any messages from the peer. If hdone() succeeds the implementation must read and drop any pending messages from the peer. Note that there is a possible DoS attack here: The peer can send infinite number of messages. Therefore, the implementation MUST observe the deadline and tear down the socket in case it runs out of time. ETIMEDOUT error is then returned to the user. At this point the implementation should deallocate the socket. However, it SHOULD NOT close the underlying protocol. Instead it SHOULD return its file descriptor to the user. This is crucial for horizontal composability of the protocols: /* create stack of two protocols */ h1 = foo_start(); h2 = bar_start(h1); /* top protocol is closed but bottom one is still alive */ h1 = bar_stop(h2, -1); /* new top protocol is started */ h3 = baz_start(h1); /* shut down both protocols */ h1 = baz_stop(h3, -1); foo_stop(h1, -1); If protocol lives at the very bottom of the stack and has no underlying protocol "stop" function MUST return 0 on success. Sustrik Expires August 5, 2017 [Page 14] Internet-Draft BSD Socket API Revamp February 2017 In the case of error stop function MUST forcibly close the underying protocol (and thus recursively all protocols beneath it), return -1 and set errno to the appropriate value. Note that this design of orderly termination does away with BSD socket SO_LINGER behaviour, which is problematic as it cannot be implemented cleanly in user space. 5.7. Normal operation Everything that happens between protocol initialization and protocol termination will be referred to as "normal operation". As already mentioned, application protocols can't send or receive data. Trying to invoke any of the functions below on an application protocol MUST result in ENOTSUP error. Transport protocols are either bystestream protocols or message protocols. 5.7.1. Bytestream protocols Bytestream protocols can be used via following four functions: int bsend(int h, const void *buf, size_t len, int64_t deadline); int brecv(int h, void *buf, size_t len, int64_t deadline); int bsendv(int h, const struct iovec *iov, size_t iovlen, int64_t deadline); int brecvv(int h, const struct iovec *iov, size_t iovlen, int64_t deadline); Function bsend() sends data to the protocol. The protocol SHOULD send them, after whatever manipulation is required, to its underlying protocol. Eventually, the bottommost protocol in the stack sends the data to the network. Function brecv() reads data from the protocol. The protocol SHOULD read them from the underlying socket and after whetever required manipulation is done return them to the caller. The bottommost protocol in the stack reads the data from the network. All the functions above MUST be blocking and exhibit atomic behaviour. I.e. either all data are sent/received or none of them are. In the later case protocol MUST be marked as broken, errno MUST be set to appropriate value and -1 MUST be returned to the user. Any subsequent attempt to use the protocol MUST result in an error. Sustrik Expires August 5, 2017 [Page 15] Internet-Draft BSD Socket API Revamp February 2017 Expired deadline is considered to be an error and the protocol MUST behave as described above and set errno to ETIMEDOUT. In case of success all the functions MUST return zero. Functions bsendv() and brecvv() MUST behave in the same way as bsend() and brecv(), the only difference being that buffers are passed to the functions via scatter/gather arrays, same way as in POSIX sendmsg() and recvmsg() functions. Note that the implementations of brecv() and brecvv() MAY change the content of the buffer supplied to the function even in the case of error. However, what exaclty will be written into the buffer is unpredictable and using such data will result in undefined behaviour. 5.7.2. Message protocols Message protocols can be used via following four functions: int msend(int s, const void *buf, size_t len, int64_t deadline); ssize_t mrecv(int h, void *buf, size_t len, int64_t deadline); int msendv(int s, const struct iovec *iov, size_t iovlen, int64_t deadline); ssize_t mrecvv(int s, const struct iovec *iov, size_t iovlen, int64_t deadline); Function msend() sends message to the protocol. The protocol SHOULD send it, after whatever manipulation is required, to its underlying protocol. Eventually, the lowermost protocol in the stack sends the data to the network. Function mrecv() reads message from the protocol. The protocol SHOULD read it from its underlying protocol and after whetever manipulation is needed return it to the caller. The lowermost protocol in the stack reads the data from the network. All the functions MUST be blocking and exhibit atomic behaviour. I.e. either entire message is sent/received or none of it is. In the later case errno MUST be set to appropriate value and -1 MUST be returned to the user. The protocol may be recoverable in which case receiving next message after an error is possible. In can also be non-recoverable in which the protocol MUST be marked as broken and any subsequent attempt to use it MUST result in an error. Sustrik Expires August 5, 2017 [Page 16] Internet-Draft BSD Socket API Revamp February 2017 Note that unlike with bytestream sockets the buffer supplied to mrecv() doesn't have to be fully filled in, i.e. received messages may be smaller than the buffer. If the message is larger than the buffer, it is considered to be an error and the protocol must return -1 and set errno to EMSGSIZE. If there's no way to discard the unread part of the message in constant time it SHOULD also mark the protocol as broken and refuse any further operations. This behaviour prevents DoS attacks by sending very large messages. Expired deadline is considered to be an error and the protocol MUST return ETIMEDOUT error. In case of success msend() function MUST return zero, mrecv() MUST return the size of the received message, zero being a valid size. Functions msendv() and mrecvv() MUST behave in the same way as msend() and mrecv(). The only difference is that buffers are passed to the functions via scatter/gather arrays, same way as in POSIX sendmsg() and recvmsg() functions. Note that the implementations of mrecv() and mrecvv() MAY change the content of the buffer supplied to the function even in the case of error. However, what exaclty will be written into the buffer is unpredictable and using such data will result in undefined behaviour. 5.7.3. Custom sending and receiving functions In addition to send/recv functions described above, protocols MAY implement their own custom send/recv functions. These functions should be called "send" and/or "recv" (e.g. "udp_send"). Custom functions allow for providing additional arguments. For example, UDP protocol may implement custom send function with additional "destination IP address" argument. A protocol MAY implement multiple send or receive functions as needed. Protocol implementors should try to make custom send/recv functions as consistent with standard send/recv as possible. Standard send/recv functions SHOULD fill in arguments otherwise provided in custom send/recv by sensible defaults. It MAY be possible to set those defaults via "start" function. Sustrik Expires August 5, 2017 [Page 17] Internet-Draft BSD Socket API Revamp February 2017 5.7.4. Error codes Send and receive function may return following error codes: o EBADF: Bad file descriptor. o ECONNRESET: Connection broken. For example, a failure to receive a keepalive from the peer may result in this error. o EMSGSIZE: Message is too large to fit into the supplied buffer. Applies only to mrecv() and mrecvv(). o ENOTSUP: The socket does not support the function. For example, msend() was called on a bytestream socket or mrecv() was called on send-only socket. o EPIPE: The peer have closed the connection. o EPROTO: The peer has violated the protocol specification. o ETIMEDOUT: Deadline expired. As already mentioned some protocols MAY treat errors as unrecoverable. In these cases any subsequent operation on the socket MUST return the same error. The implementation SHOULD NOT go into great lengths to implement recoverable errors. Instead, it should stick to the most natural semantics for the protocol. For example, EMSGSIZE may seem to be a recoverable error, however, the implementation may have to allocate arbitrary amount of memory to temporarily store the already received part of the message which could in turn enable DoS attacks by sending large messages. It may thus be preferable to keep these errors unrecoverable. 6. IANA Considerations This memo includes no request to IANA. 7. Security Considerations Network APIs can facilitate DoS attacks by allowing for unlimited buffer sizes and for infinite deadlines. This proposal avoids the first issue by requiring the user to allocate all the buffers. It addresses the second problem by always making the deadline explicit. Also, by not requiring recomputation of timeout intervals it makes the deadlines easy to use. The user Sustrik Expires August 5, 2017 [Page 18] Internet-Draft BSD Socket API Revamp February 2017 should take advantage of that and set reasonable timeout for every network operation. Other than that, the security implications of the new API don't differ from security implications of classic BSD socket API. Still, it may be worth passing the design through a security audit. Author's Address Martin Sustrik (editor) Email: sustrik@250bpm.com Sustrik Expires August 5, 2017 [Page 19]