Internet Engineering Task Force                          M. Sustrik, Ed.
Internet-Draft                                             February 2017
Intended status: Informational
Expires: August 5, 2017


                         BSD Socket API Revamp
                           sock-api-revamp-01

Abstract

   This memo describes new API for network sockets.  Compared to classic
   BSD socket API the new API is much more lightweight and flexible.
   Its primary focus is on easy composability of network protocols.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at http://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on August 5, 2017.

Copyright Notice

   Copyright (c) 2017 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.


Sustrik                  Expires August 5, 2017                 [Page 1]

Internet-Draft            BSD Socket API Revamp            February 2017


Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
   2.  Terminology . . . . . . . . . . . . . . . . . . . . . . . . .   3
   3.  The problems  . . . . . . . . . . . . . . . . . . . . . . . .   3
   4.  Basic concepts  . . . . . . . . . . . . . . . . . . . . . . .   4
     4.1.  Vertical composability  . . . . . . . . . . . . . . . . .   4
     4.2.  Horizontal composability  . . . . . . . . . . . . . . . .   5
     4.3.  Application and transport protocols . . . . . . . . . . .   5
       4.3.1.  Application protocols . . . . . . . . . . . . . . . .   5
       4.3.2.  Presentation protocols  . . . . . . . . . . . . . . .   6
       4.3.3.  Transport protocols . . . . . . . . . . . . . . . . .   6
     4.4.  Bytestream and message protocols  . . . . . . . . . . . .   6
     4.5.  Connected and unconnected protocols . . . . . . . . . . .   7
     4.6.  Scheduling or rather lack of it . . . . . . . . . . . . .   8
     4.7.  Tx buffering  . . . . . . . . . . . . . . . . . . . . . .   9
     4.8.  Rx buffering  . . . . . . . . . . . . . . . . . . . . . .   9
     4.9.  Socket options  . . . . . . . . . . . . . . . . . . . . .   9
   5.  The API guidelines  . . . . . . . . . . . . . . . . . . . . .  10
     5.1.  Protocol naming conventions . . . . . . . . . . . . . . .  10
     5.2.  Function naming conventions . . . . . . . . . . . . . . .  10
     5.3.  File descriptors  . . . . . . . . . . . . . . . . . . . .  10
     5.4.  Deadlines . . . . . . . . . . . . . . . . . . . . . . . .  11
     5.5.  Protocol initialization . . . . . . . . . . . . . . . . .  11
     5.6.  Protocol termination  . . . . . . . . . . . . . . . . . .  12
       5.6.1.  Forceful termination  . . . . . . . . . . . . . . . .  12
       5.6.2.  Half-close termination  . . . . . . . . . . . . . . .  13
       5.6.3.  Orderly termination . . . . . . . . . . . . . . . . .  13
     5.7.  Normal operation  . . . . . . . . . . . . . . . . . . . .  15
       5.7.1.  Bytestream protocols  . . . . . . . . . . . . . . . .  15
       5.7.2.  Message protocols . . . . . . . . . . . . . . . . . .  16
       5.7.3.  Custom sending and receiving functions  . . . . . . .  17
       5.7.4.  Error codes . . . . . . . . . . . . . . . . . . . . .  18
   6.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  18
   7.  Security Considerations . . . . . . . . . . . . . . . . . . .  18
   Author's Address  . . . . . . . . . . . . . . . . . . . . . . . .  19

1.  Introduction

   The progress in the area of network protocols is distinctively
   lagging behind.  While every hobbyist writes and publishes their
   small JavaScript libraries, there's no such thing going on with
   network protocols.  Indeed, it looks like the field of network
   protocols is dominated by big companies and academia, just like
   programming as a whole used to be before the advent of personal
   computers.


Sustrik                  Expires August 5, 2017                 [Page 2]

Internet-Draft            BSD Socket API Revamp            February 2017


   While social and political reasons may be partly to blame (adoption,
   interoperability etc.) the technology itself creates a huge barrier
   to popular participation.  For example, the fact that huge part of
   the networking stack typically lives in the kernel space will prevent
   most people from even trying.  More importantly though there is
   basically no way to reuse what already exists.  While in JavaScript
   world you can get other people's libraries, quickly glue them
   together, add a bit of code of your own and publish a shiny new
   library, you can't do the same thing with network protocols.  You
   can't take framing from WebSockets, add multihoming from SCTP, keep-
   alives from TCP and congestion control from DCCP.  You have to write
   most of the code yourself which requires a lot of time, often more
   than a single programmer can realistically afford.

   This memo proposes to fix the reusability problem by revamping the
   old BSD socket API and while doing so strongly focusing on
   composability of protocols.

2.  Terminology

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC 2119.

3.  The problems

   This section offers a brief summary of the problems that protocol
   implementors are facing in the current environment.

   o  The hook-in API to implement new protocols and make them
      accessible via standard functions like send() and recv() exists,
      in most cases, only in the kernel space.  While that makes
      implementation a little bit harder the real problem is deployment.
      Kernel is rarely deployed with the application and waiting for a
      kernel implementation of a protocol to be widely deployed can take
      many years.

   o  The API hook-in API is not standardized meaning that a protocol
      has to be implemented for each operating system separately.

   o  The usability of the original blocking BSD socket API is dependent
      one operating system's ability to spawn a lot of threads (e.g. two
      per TCP connection) and switch between them quickly.  This happens
      not to be the case with majority of operating systems.  The thread
      number limit tends to be couple of thousand and the context switch
      latency is often one or two orders of magnitude higher than what
      would be acceptable for high-performance networking.  This leads
      to proliferation of asynchronous network code using non-blocking


Sustrik                  Expires August 5, 2017                 [Page 3]

Internet-Draft            BSD Socket API Revamp            February 2017


      APIs like poll() which forces the implementor to do both CPU and
      network scheduling by hand.  High context switch latency is often
      fought by extreme measures.  For example, by implementing hand-
      crafted lock-free algorithms.  Ugly, mislayered and barely
      maintainable code ensues.

   o  The BSD socket API is semantically underspecified.  The particular
      points of pain are error handling and behaviour during protocol
      shutdown.  No fixed semantics to rely on means that higher level
      protocols are fine-tuned to work with a particular low level
      protocol (e.g.  TCP) and don't allow to switch it for a different
      low-level protocol.  The situation is made worse by offering a
      high number of socket options that modify semantics of the core
      APIs.

   o  Scope of BSD socket API is unnecessarily wide.  It attempts to
      provide standardized way for doing things that would better be
      done in a customized way.  This results in unnecessarily complex
      system of extension points: Socket options, ancillary data,
      fcntl(), various arguments to socket() function.

   o  Finally, there are some minor ways to improve the socket API
      which, however, are not a sufficient reason to revamp the API.
      But if the revamp is to be done anyway these minor issues can be
      fixed.

4.  Basic concepts

4.1.  Vertical composability

   Vertical composability is an ability to stack protocols one on the
   top of another.  From the network point of view the protocol on the
   top is a payload of the protocol on the bottom.  From the API point
   of view the top protocol encapsulates the bottom protocol, very much
   like a function encapsulates another function that it calls.

   Example of vertical stack of protocols:

                               +----------+
                               |   HTTP   |
                               +----------+
                               |    TCP   |
                               +----------+
                               |    IP    |
                               +----------+
                               | Ethernet |
                               +----------+


Sustrik                  Expires August 5, 2017                 [Page 4]

Internet-Draft            BSD Socket API Revamp            February 2017


4.2.  Horizontal composability

   Horizontal composability is an ability to execute protocols in
   sequential manner.  From the network point of view one type of
   communication ends and is replaced by another type of communication.
   From API point of view one protocol is terminated and another one is
   started, reusing the same underlying protocol, very much like a
   function can call two child functions in sequence without having to
   exit itself.

   An example of horizontal composability is how typical web page is
   trasferred by first doing HTTP handshake, followed by HTML body:

                  +-----------------------------------+
                  |   HTTP   |          HTML          |
                  +----------+------------------------+
                  |                TCP                |
                  +-----------------------------------+

   Note how this design makes protocol reusable: The same HTTP handshake
   can be used, for example, to initiate a WebSocket session.

   Another example of horizontal composability is how STARTTLS switches
   non-encrypted protocol into encrypted version of the same protocol.

   While these are very visible cases of composing protocols
   horizontally, the technique is in fact ubiquitous.  For example, most
   protocols are composed from three distinct mini-protocols: protocol
   header (initial handshake), protocol body (sending data back and
   forth) and protocol footer (terminal handshake):

                  +-----------------------------------+
                  | Header |       Body      | Footer |
                  +--------+-----------------+--------+
                  |                TCP                |
                  +-----------------------------------+

4.3.  Application and transport protocols

4.3.1.  Application protocols

   Application protocols live on the top of the network stack.  Rather
   than trasferring raw data they are meant to perform a specific
   service for the user.  For example, DNS protocol provides name
   resolution service.

   Application protocols don't give user a way to send or receive data.
   They have no standardized API for sending or receiving.  Still, they


Sustrik                  Expires August 5, 2017                 [Page 5]

Internet-Draft            BSD Socket API Revamp            February 2017


   can be initialized, terminated and layered on top of other protocols.
   That being the case, relevant parts of this specification still apply
   to them.

4.3.2.  Presentation protocols

   Presentation protocols add structure to data carried by transport
   protocols (e.g.  ASN.1, JSON, XML).  This proposal doesn't address
   them in any way.  Either the protocol sends and receives binary data
   that just happen to be in a structured format in which case it's a
   standard transport protocol.  Or the protocol exposes special API to
   browse the structured data in which case it should be treated as an
   application protocol.

4.3.3.  Transport protocols

   The term "transport protocol" in this memo has broader scope than
   "OSI L4 protocol".  By "transport protocol" we mean anything capable
   of sending and/or receiving unstructured data, be it TCP, IP or
   Ethernet.

4.4.  Bytestream and message protocols

   Byte stream protocols are transport protocols that define no message
   boundaries.  One peer can send 10 bytes, then 8 bytes.  The other
   peer can read all 18 bytes at once or read 12 bytes first, 6 bytes
   second.  Bytestream protocols are always reliable (no bytes can be
   lost) and ordered (bytes are received in the same order they were
   sent in).  TCP is a classic example of bytestream protocol.

   Message protocols are transport protocols that preserve message
   boundaries.  While message protocols are not necessarily reliable
   (messages can be lost) or ordered (messages can be received in
   different order than they were sent in) they are always atomic.  User
   will receive either complete message or no message.  IP, UDP and
   WebSockets are examples of message protocols.

   This memo proposes distinct API for bytestream and message protocols.
   The reason for the design decision is that while the API for the two
   is superficially similar there is large difference in semantics,
   especially when it comes to atomicity and error handling.

   As an added benefit, the fact that bytestream protocols are always
   ordered and reliable means that bytestream API provides stronger
   semantic guarantees than message API.

   In theory, the two API could have been unified by treating each byte
   in a bytestream protocol as a separate message.  In practice,


Sustrik                  Expires August 5, 2017                 [Page 6]

Internet-Draft            BSD Socket API Revamp            February 2017


   however, such API design would prevent batching and would thus result
   in implementations with inferior throughput characteristics.

4.5.  Connected and unconnected protocols

   From the API point of view the most significant difference between
   connected protocols such as TCP and unconnected protocols such as UDP
   is that the former have initialization and termination handshakes
   while the latter do not.

   This distinction turns out to be critical when contemplating
   composable microprotocols.  If there are 10 connected microprotocols
   in the stack and RTT between two endpoints is 100 ms, connection
   establishment will require at least 10 roundtrips, i.e. one second or
   more.

   Latency of this magnitude is often not acceptable.  Even worse, given
   that RTT is limited by the speed of light, it's not going to get
   better as technology advances.

   Therefore, the only way to get microprotocols with decent performance
   is to make majority of them unconnected.  In the ideal case there
   should be at most one connected protocol in any networking stack.

   To achieve the goal we'll have to turn even protocols that
   instinctively feel like connected into unconnected ones.

   Consider a simple keep-alive protocol.  The peers have to agree on a
   keep-alive interval and thus the intial handshake seems unavoidable.
   However, if the peers have agreed on the keep-alive interval
   beforehand they can just pass the number to the API and no initial
   handshake is required.  We can suddenly think of the protocol as
   unconnected.

   Technically, the users could have just met personally and decided on
   keep-alive interval.  However, it doesn't have to be that way.  In
   most cases there's a connected protocol in the stack somewhere
   beneath the keep-alive protocol.  If that connected protocol allows
   to bundle arbitrary data with its initial handshake we can use it to
   exchange keep-alive intervals between the peers.  Later on, when
   instantiating the keep-alive protocol we can pass in the correct
   number via API.

   As for the terminal handshakes, these are critical for horizontal
   composability of protocols.  To be able to start new protocol on pre-
   existing connection the old protocol has to be terminated and both
   peers have to agree on where exactly it have ended.  To do that
   handshake is needed.


Sustrik                  Expires August 5, 2017                 [Page 7]

Internet-Draft            BSD Socket API Revamp            February 2017


   Assuming there's a connected protocol beneath the user can tear down
   all the unconnected protocol layers above.  Then they can shut down
   the connected protocol which will put both peers in sync with respect
   to where exaclty have the protocol ended.  Afterwards they can open
   new protocols on top of the remaining layers of the stack.

   Each protocol MUST be either connected or unconnected.  A protocol
   MUST NOT try to support both scenarios.

4.6.  Scheduling or rather lack of it

   During the decades since BSD sockets were first introduced the way
   they are used have changed significantly.  While in the beginning the
   user was supposed to fork a new process for each connection and do
   all the work using simple blocking calls nowadays they are expected
   to keep a pool of connections, check them via functions like poll()
   or kqueue() and dispatch any work to be done to one of the worker
   threads in a thread pool.  In other words, user is supposed to do
   both network and CPU scheduling.

   This change happened for performance reasons and haven't improved
   functionality or usability of BSD socket API in any way.  On the
   contrary, by requiring every programmer to do system programmer's
   work it contributed to proliferation of buggy, hard-to-debug and
   barely maintainable network code.

   To address this problem, this memo assumes that there already exists
   an efficient concurrency implementation where forking a new
   lightweight process takes at most hundreds of nanoseconds and context
   switch takes tens of nanoseconds.  Note that there are already such
   concurrency systems deployed in the wild.  One well-known example are
   Golang's goroutines but there are others available as well.

   In such environment network programming can be done in the old "one
   process per connection" way, with all the functions exhibiting
   blocking behavior.  There's no need for polling, thread pools,
   callbacks, explicit state machines or similar.

   This memo thus adheres to "let system programmers do system
   programming" maxim and doesn't address the problem of scheduling, be
   it CPU scheduling or network scheduling, at all.

   As a footnote, it should be metioned that this proposal makes couple
   of deliberate design choices that prevent the modern "schedule by
   hand" style of network programming.


Sustrik                  Expires August 5, 2017                 [Page 8]

Internet-Draft            BSD Socket API Revamp            February 2017


4.7.  Tx buffering

   Buffering of outbound data and sending them down the stack in batches
   often results in improved performance.  It is perfectly acceptable
   for protocol implementation to do so as long as data is flushed when
   the socket is closed.  The data should also be flushed periodically
   not to induce unbounded latencies when there are no new outbound data
   to fill in the buffer.

4.8.  Rx buffering

   Buffering of inbound data collides with vertical composability of
   protocols.

   If protocol reads 1000 bytes of data from the underlying protocol,
   then protocol above it asks for 700 bytes and closes the socket,
   there's no way to push the remaining 300 bytes back to the underlying
   socket.  Allowing for such operation would mean that the buffer of
   the underlying socket would have to be virtually unbounded.

   If, on the other hand, the remaining bytes were dropped there would
   be no way to start a new protocol on top of the same underlying
   socket.  The new protocol would miss initial 300 bytes of data.

   Luckily though, the above reasoning doesn't apply to the bottommost
   protocol in the stack.  Given that there's no underlying protocol to
   start a new protocol on top of, the buffered data can be simply
   dropped.

   Also, rx buffering on the lowermost level, where the protocol is
   interfacing with the hardware or with user/kernel space boundary, is
   likely to provide the largest performance benefits.  Absence of rx
   buffering on higher levels, where performance impact of additional
   receive operation is basically that of a function call, is not likely
   to incur huge performance penalty.  And even more so given that
   higher layers of the stack are likely to be message-based and thus
   some amount of batching, proportional to the average message size,
   happens anyway.

4.9.  Socket options

   There's no equivalent to socket options as they are known from BSD
   socket API.  Any such customization of the network stack is supposed
   to be built by vertically layering the protocols.


Sustrik                  Expires August 5, 2017                 [Page 9]

Internet-Draft            BSD Socket API Revamp            February 2017


5.  The API guidelines

5.1.  Protocol naming conventions

   Whenever possible, protocol name in the API SHOULD correspond to the
   official name of the protocol, not to the name of the protocol
   implementation.  While this can lead to name clashes the assumption
   is that single application is not going to use two implementations of
   the same network protocol.  This rule also provides an incentive to
   standardize protocol APIs.

   To make the API less tedious to use, short protocol name, e.g. "ws",
   SHOULD be preferred to the long name, e.g. "websockets".

   Given that end users prefer to create full protocol stack using a
   single function it is desirable to provide them with shrinkwrapped
   protocols aggregating many microprotocols into a coherent whole.  For
   example, "websocket" protocol may be composed of TCP protocol and
   WebSocket protocol itself.  Still, expert users may want to have
   access to WebSocket protocol as such, without the underlying TCP
   protocol, say, if they want to run it on top of SCTP or any other
   alternative transport.  In such cases there is a naming dilemma:
   Should "websocket" name refer to the TCP+WebSocket aggregate or to
   the WebSocket alone?  In these cases the implemetors SHOULD always
   prefer the former solution.

5.2.  Function naming conventions

   The function names SHOULD be in lowercase and SHOULD be composed of
   short protocol name and action name separated by underscore (e.g.
   "tcp_connect").  Of course, in languages other than C the native
   naming convention should be followed, but even then the name SHOULD
   contain both short protocol name and action name.

5.3.  File descriptors

   One of the design goals of this API is to support both kernel space
   and user space implementations.  One problem with that is that kernel
   space objects are typically reffered to by file descriptors while
   POSIX provides no easy way to associate user space objects with file
   descriptors.

   Therefore, this specification allows user space implementations to
   use fake file descriptors (simple integers that kernel space knows
   nothing about) and does not guarantee that system functions will work
   with those descriptors.


Sustrik                  Expires August 5, 2017                [Page 10]

Internet-Draft            BSD Socket API Revamp            February 2017


   For example, you cannot count on POSIX close() function to be able to
   close a socket.  Therefore, hclose() function is introduced which
   maps directly to close() in kernel-space implementations but can be
   overriden by custom implementation in a user-space implementation.

   Whenever a function acts on a file descriptor, the descriptor SHOULD
   be passed to the function as its first argument.

5.4.  Deadlines

   Unlike with BSD sockets the deadlines are points in time rather than
   intervals.  This allows to use the same deadline in multiple calls
   without need for recomputing the remaining interval:

       int64_t deadline = now() + 1000;
       bsend(h, "ABC", 3, deadline);
       bsend(h, "DEF", 3, deadline);

   All possibly blocking functions MUST accept a deadline.  The deadline
   SHOULD be passed to the function as its last argument.

5.5.  Protocol initialization

   A protocol SHOULD be initialized using a protocol-specific "start"
   function (e.g. "smtp_start").  If protocol runs on top of another
   protocol the file descriptor of the underlying protocol SHOULD be
   supplied as the first argument of the function.  The function MAY
   have arbitrary number of additional arguments.

   The function SHOULD return the file descriptor of the newly created
   protocol instance.  In case of error it SHOULD close the underlying
   protocol, return -1 and set errno to the appropriate error.

   Some protocols require more complex setup.  Consider TCP's
   listen/connect/accept connection setup process.  These protocols
   SHOULD use custom set of functions rather than try to shoehorn all
   the functionality into a single all-purpose "start" function.

   If protocol runs on top of an underlying protocol it takes ownership
   of that protocol.  Using the low level protocol while it is owned by
   a higher level protocol will result in undefined behaviour.  A sane
   way to implement this behaviour is to create a duplicate of the
   underlying file descriptor to be owned by the parent protocol and
   closing the original file descriptor.  That way, user accidentally
   using a lower level protocol will get an EBADF error.

   If protocol requires an initial handshake it MUST be performed in
   this phase of the socket lifecycle.


Sustrik                  Expires August 5, 2017                [Page 11]

Internet-Draft            BSD Socket API Revamp            February 2017


   Example of creating a stack of four protocols:

       int s1 = tcp_connect("192.168.0.111:5555", -1);
       int s2 = foo_start(s1, arg1, arg2, arg3);
       int s3 = bar_start(s2);
       int s4 = baz_start(s3, arg4, arg5);

5.6.  Protocol termination

   There are several types of termination that will be discussed in
   following sections:

   o  Forceful termination means that the user wants to shut down the
      socket abruptly without even letting the peer know.  Forceful
      termination is always a non-blocking operation.

   o  Half-close termination means that outbound half of the connection
      is closed by the user and the terminal handshake, if supported by
      the protocol, is initiated.  However, the user is still able to
      receive data from the peer.

   o  Orderly termination means that terminal handshake with the peer,
      if required by the protocol, is performed.  Orderly termination
      leaves both peers with a consistent view of the world.

5.6.1.  Forceful termination

   To perform forceful termination protocol descriptor is closed by
   hclose() function.  In kernel-space implementations this function
   maps directly to standard POSIX close() function.  The protocol MUST
   shut down immediately without trying to do termination handshake.
   Note that this is different from how classic BSD sockets behave.

   The protocol MUST also clean up all resources it owns including
   closing the underlying protocol.  Given that the underlying protocol
   does the same operation, an entire stack of protocols can be shut
   down recursivelly by closing the file descriptor of the topmost
   protocol:

       int h1 = foo_start();
       int h2 = bar_start(h1);
       int h3 = baz_start(h2);
       hclose(h3); /* baz, bar and foo are shut down */

   In case of success hclose() returns zero.  In case of error it
   returns -1 and sets errno to appropriate value.


Sustrik                  Expires August 5, 2017                [Page 12]

Internet-Draft            BSD Socket API Revamp            February 2017


5.6.2.  Half-close termination

   The primary use case for half-closing a connection is when user wants
   to close a connection, yet still wants to receive all the data sent
   by the peer prior to the termination.

   To do so function hdone() is used.  It is roughly equivalent to POSIX
   shutdown(SHUT_WR) function.

   hdone() first of all flushes any buffered outbound data.  What
   happens next depends on whether the protocol is connected or
   unconnected.

   For unconnected protocols the implementation MUST forward the call to
   the underlying socket.  If the protocol is at the bottom of the stack
   and there is no underlying socket it MUST return ENOTSUP error.

   For connected protocols the implementation MUST start termination
   handshake and return to the caller without waiting for the answer
   from the peer.

   After hdone() is called, any further calls to hdone() or attemps to
   send more data MUST result in EPIPE error.

   However, user is still able to receive more data from the socket.

   Following piece of code shows typical usage do hdone().  It half-
   closes the connection, receives any pending messages from the peer
   and finally closes the socket:

       hdone(s);
       while(1) {
           int rc = mrecv(s, &msg, sizeof(msg), -1);
           if(rc < 0 && errno == EPIPE) break;
           process_msg(&msg);
       }
       hclose(s);

   hdone() function returns 0 on success.  In case of error the function
   MUST forcibly close the underying protocol (and thus recursively all
   protocols beneath it), return -1 and set errno to the appropriate
   value.

5.6.3.  Orderly termination

   To perform an orderly shut down there SHOULD be a protocol-specific
   function called "stop" (e.g. "smtp_stop").


Sustrik                  Expires August 5, 2017                [Page 13]

Internet-Draft            BSD Socket API Revamp            February 2017


   In addition to the file descriptor the function can have arbitrary
   number of other arguments.  For example, one such argument may be a
   "shutdown reason" string to be sent to the peer.  However, it is
   RECOMMENDED to avoid such additional arguments in newly designed
   protocols.  The reason is that such arguments cannot be passed to
   hdone() function, making half-close termination functionally inferior
   to orderly termination.

   If the shut down functionality is potentially blocking the last
   argument of the function SHOULD be a deadline.

   First thing hclose() should do is to check whether hdone() was called
   by the user beforehand and if not so, to invoke it itself.

   If hdone() returns error other than ENOTSUP the socket MUST be torn
   down and the error MUST be forwarded to the caller.

   If hdone() returns ENOTSUP error the implementation MUST simply
   procceed further without reading any messages from the peer.

   If hdone() succeeds the implementation must read and drop any pending
   messages from the peer.  Note that there is a possible DoS attack
   here: The peer can send infinite number of messages.  Therefore, the
   implementation MUST observe the deadline and tear down the socket in
   case it runs out of time.  ETIMEDOUT error is then returned to the
   user.

   At this point the implementation should deallocate the socket.
   However, it SHOULD NOT close the underlying protocol.  Instead it
   SHOULD return its file descriptor to the user.  This is crucial for
   horizontal composability of the protocols:

       /* create stack of two protocols */
       h1 = foo_start();
       h2 = bar_start(h1);
       /* top protocol is closed but bottom one is still alive */
       h1 = bar_stop(h2, -1);
       /* new top protocol is started */
       h3 = baz_start(h1);
       /* shut down both protocols */
       h1 = baz_stop(h3, -1);
       foo_stop(h1, -1);

   If protocol lives at the very bottom of the stack and has no
   underlying protocol "stop" function MUST return 0 on success.


Sustrik                  Expires August 5, 2017                [Page 14]

Internet-Draft            BSD Socket API Revamp            February 2017


   In the case of error stop function MUST forcibly close the underying
   protocol (and thus recursively all protocols beneath it), return -1
   and set errno to the appropriate value.

   Note that this design of orderly termination does away with BSD
   socket SO_LINGER behaviour, which is problematic as it cannot be
   implemented cleanly in user space.

5.7.  Normal operation

   Everything that happens between protocol initialization and protocol
   termination will be referred to as "normal operation".

   As already mentioned, application protocols can't send or receive
   data.  Trying to invoke any of the functions below on an application
   protocol MUST result in ENOTSUP error.

   Transport protocols are either bystestream protocols or message
   protocols.

5.7.1.  Bytestream protocols

   Bytestream protocols can be used via following four functions:

       int bsend(int h, const void *buf, size_t len,
           int64_t deadline);
       int brecv(int h, void *buf, size_t len,
           int64_t deadline);
       int bsendv(int h, const struct iovec *iov, size_t iovlen,
           int64_t deadline);
       int brecvv(int h, const struct iovec *iov, size_t iovlen,
           int64_t deadline);

   Function bsend() sends data to the protocol.  The protocol SHOULD
   send them, after whatever manipulation is required, to its underlying
   protocol.  Eventually, the bottommost protocol in the stack sends the
   data to the network.

   Function brecv() reads data from the protocol.  The protocol SHOULD
   read them from the underlying socket and after whetever required
   manipulation is done return them to the caller.  The bottommost
   protocol in the stack reads the data from the network.

   All the functions above MUST be blocking and exhibit atomic
   behaviour.  I.e. either all data are sent/received or none of them
   are.  In the later case protocol MUST be marked as broken, errno MUST
   be set to appropriate value and -1 MUST be returned to the user.  Any
   subsequent attempt to use the protocol MUST result in an error.


Sustrik                  Expires August 5, 2017                [Page 15]

Internet-Draft            BSD Socket API Revamp            February 2017


   Expired deadline is considered to be an error and the protocol MUST
   behave as described above and set errno to ETIMEDOUT.

   In case of success all the functions MUST return zero.

   Functions bsendv() and brecvv() MUST behave in the same way as
   bsend() and brecv(), the only difference being that buffers are
   passed to the functions via scatter/gather arrays, same way as in
   POSIX sendmsg() and recvmsg() functions.

   Note that the implementations of brecv() and brecvv() MAY change the
   content of the buffer supplied to the function even in the case of
   error.  However, what exaclty will be written into the buffer is
   unpredictable and using such data will result in undefined behaviour.

5.7.2.  Message protocols

   Message protocols can be used via following four functions:

       int msend(int s, const void *buf, size_t len,
           int64_t deadline);
       ssize_t mrecv(int h, void *buf, size_t len,
           int64_t deadline);
       int msendv(int s, const struct iovec *iov, size_t iovlen,
           int64_t deadline);
       ssize_t mrecvv(int s, const struct iovec *iov, size_t iovlen,
           int64_t deadline);

   Function msend() sends message to the protocol.  The protocol SHOULD
   send it, after whatever manipulation is required, to its underlying
   protocol.  Eventually, the lowermost protocol in the stack sends the
   data to the network.

   Function mrecv() reads message from the protocol.  The protocol
   SHOULD read it from its underlying protocol and after whetever
   manipulation is needed return it to the caller.  The lowermost
   protocol in the stack reads the data from the network.

   All the functions MUST be blocking and exhibit atomic behaviour.
   I.e. either entire message is sent/received or none of it is.  In the
   later case errno MUST be set to appropriate value and -1 MUST be
   returned to the user.  The protocol may be recoverable in which case
   receiving next message after an error is possible.  In can also be
   non-recoverable in which the protocol MUST be marked as broken and
   any subsequent attempt to use it MUST result in an error.


Sustrik                  Expires August 5, 2017                [Page 16]

Internet-Draft            BSD Socket API Revamp            February 2017


   Note that unlike with bytestream sockets the buffer supplied to
   mrecv() doesn't have to be fully filled in, i.e. received messages
   may be smaller than the buffer.

   If the message is larger than the buffer, it is considered to be an
   error and the protocol must return -1 and set errno to EMSGSIZE.  If
   there's no way to discard the unread part of the message in constant
   time it SHOULD also mark the protocol as broken and refuse any
   further operations.  This behaviour prevents DoS attacks by sending
   very large messages.

   Expired deadline is considered to be an error and the protocol MUST
   return ETIMEDOUT error.

   In case of success msend() function MUST return zero, mrecv() MUST
   return the size of the received message, zero being a valid size.

   Functions msendv() and mrecvv() MUST behave in the same way as
   msend() and mrecv().  The only difference is that buffers are passed
   to the functions via scatter/gather arrays, same way as in POSIX
   sendmsg() and recvmsg() functions.

   Note that the implementations of mrecv() and mrecvv() MAY change the
   content of the buffer supplied to the function even in the case of
   error.  However, what exaclty will be written into the buffer is
   unpredictable and using such data will result in undefined behaviour.

5.7.3.  Custom sending and receiving functions

   In addition to send/recv functions described above, protocols MAY
   implement their own custom send/recv functions.  These functions
   should be called "send" and/or "recv" (e.g.  "udp_send").

   Custom functions allow for providing additional arguments.  For
   example, UDP protocol may implement custom send function with
   additional "destination IP address" argument.

   A protocol MAY implement multiple send or receive functions as
   needed.

   Protocol implementors should try to make custom send/recv functions
   as consistent with standard send/recv as possible.

   Standard send/recv functions SHOULD fill in arguments otherwise
   provided in custom send/recv by sensible defaults.  It MAY be
   possible to set those defaults via "start" function.


Sustrik                  Expires August 5, 2017                [Page 17]

Internet-Draft            BSD Socket API Revamp            February 2017


5.7.4.  Error codes

   Send and receive function may return following error codes:

   o  EBADF: Bad file descriptor.

   o  ECONNRESET: Connection broken.  For example, a failure to receive
      a keepalive from the peer may result in this error.

   o  EMSGSIZE: Message is too large to fit into the supplied buffer.
      Applies only to mrecv() and mrecvv().

   o  ENOTSUP: The socket does not support the function.  For example,
      msend() was called on a bytestream socket or mrecv() was called on
      send-only socket.

   o  EPIPE: The peer have closed the connection.

   o  EPROTO: The peer has violated the protocol specification.

   o  ETIMEDOUT: Deadline expired.

   As already mentioned some protocols MAY treat errors as
   unrecoverable.  In these cases any subsequent operation on the socket
   MUST return the same error.

   The implementation SHOULD NOT go into great lengths to implement
   recoverable errors.  Instead, it should stick to the most natural
   semantics for the protocol.  For example, EMSGSIZE may seem to be a
   recoverable error, however, the implementation may have to allocate
   arbitrary amount of memory to temporarily store the already received
   part of the message which could in turn enable DoS attacks by sending
   large messages.  It may thus be preferable to keep these errors
   unrecoverable.

6.  IANA Considerations

   This memo includes no request to IANA.

7.  Security Considerations

   Network APIs can facilitate DoS attacks by allowing for unlimited
   buffer sizes and for infinite deadlines.

   This proposal avoids the first issue by requiring the user to
   allocate all the buffers.  It addresses the second problem by always
   making the deadline explicit.  Also, by not requiring recomputation
   of timeout intervals it makes the deadlines easy to use.  The user


Sustrik                  Expires August 5, 2017                [Page 18]

Internet-Draft            BSD Socket API Revamp            February 2017


   should take advantage of that and set reasonable timeout for every
   network operation.

   Other than that, the security implications of the new API don't
   differ from security implications of classic BSD socket API.  Still,
   it may be worth passing the design through a security audit.

Author's Address

   Martin Sustrik (editor)

   Email: sustrik@250bpm.com


Sustrik                  Expires August 5, 2017                [Page 19]