(DRAFT) SCORPION PROTOCOL/FILE-FORMAT Table of contents: Protocol Status codes Detailed status codes Client certificates Receive subprotocol Send subprotocol Interactive subprotocol Hashed URI scheme Document file format Extended link attributes Metadata blocks Data/text sub-blocks Languages Automated crawling Conversion Unordered Labels File Identification X.509 extensions Favicon formats Superseding certificates Recommendations and other notes Dynamic files Security issues Missing details FAQ === Protocol === The protocol can be TLS or non-TLS. Implementation of TLS is optional but is recommended; implementation of non-TLS is mandatory. The same port number can be used for TLS and non-TLS; this is distinguished by the first byte that the client sends (TLS if it is 0x16 or non-TLS otherwise (in which case it is the subprotocol byte)). The default port number (for both TLS and non-TLS) is 1517. Unless otherwise specified, the text in the protocol is ASCII. (This only applies to the protocol; the contents of files are not limited to ASCII.) The first byte is the subprotocol, and then optional subprotocol parameter, and then a space, and then the absolute URL (the scheme is mandatory), and then a carriage return and a line feed. (See the below sections about the definitions of the send and receive subprotocols. "Subprotocols" is similar than the different methods of HTTP, such as GET and PUT.) The URL is supposed to have a slash after the host name (or after the port number if present); the client MUST NOT send a URL that follows the host name or port number immediately by # or ? or a carriage return. If it does anyways, the server MUST either treat it as though the slash is present or issue a redirect to / or to the full URL with / added on the end or to a URL that those URLs would redirect to; this SHOULD be a permanent redirect. (The client also should add / if it is missing according to the above in the case of a link or a redirect to such a URL, too.) The server then sends the status line, which consists of a two-digit status code, followed by a space and then the parameters (which may be empty), and then another carriage return and line feed. Parameters are separated by spaces, although the last parameter may include spaces (the client will know how many parameters according to the major status code). The URL may include a username/password (the usual format of username and password in a URL is used); if so, the client should not display the password (unless the user has enabled an option to tell it to not hide passwords). The client software should also include a command to discard any existing username/password, in order to log out. However, if the URL contains # then the client should not include the # and what comes afterward, since that part is only for the client. If the host name of the URL does not match any host name that this server serves, or if the scheme is neither "scorpion:" nor "scorpions:", then it is a proxy request (which the server may refuse if it wishes, due to whatever criteria they want). Proxies should normally avoid converting the files into a different format. (Note that a proxy to another service on the same server might be implemented without needing to make another connection, although this is not mandatory and not something that the client needs to worry about.) If it is not a proxy request, then the TLS and non-TLS variants of the same protocol URI scheme should be treated as equivalent by the server (but not by the client, which MUST treat them differently). (Proxied requests will treat the scheme in the request like the client does.) The recommended use of SNI is: * Clients SHOULD use SNI when connecting to the server, and MAY have an option to disable SNI (in order to mitigate some types of spying). * Servers SHOULD NOT require SNI, and SHOULD ignore any provided SNI and use the host name in the request instead. (Exceptions are possible, e.g. if it needs to use a different protocol with the same port number and IP address for some reason, or if a proxy server is used to forward requests to another server without needing the proxy to handle encryption, etc.) * If possible, clients SHOULD allow to use the system's DNS services to implement encrypted Client Hello; if implemented, there MUST be an option to disable this feature (although, depending on the implementation, this option might be a part of a different component of the operating system). * A server is not required to present a valid certificate if an incorrect SNI (or no SNI) is provided by the client. Clients that wish to verify the server's certificate should avoid using incorrect SNI. === Status codes === The first digit is the major code, and the second digit is the minor code. Clients may ignore the minor code. 0x = Interactive mode; used only for the "I" subprotocol. After this is sent to the client then arbitrary two-way communication is possible. The parameter is optional and is the capability codes if it is not blank. 1x = Requires input; the parameter is the prompt text, which it should display to the user, and then the user enters any text, and the client redirects to the same as the current URL but with ? added and then the text entered by the user, which should be percent-encoded if necessary. If the existing URL already has a query string, then the result is unspecified (it is probably best to delete the existing query string, due to how relative URLs are working); servers MUST NOT use this code in that case. 2x = The response is OK and the contents of the file will follow; after this line is the data of the file. The parameters are: * The file size in decimal notation, or ? if the size is unknown (e.g. if it is a dynamic file). * The file type/format. MIME has some problems and ULFI is better, but for now, for compatibility you can use MIME; however, spaces are not allowed. If the "charset" parameter is not included then the default according to the file format should be used; "us-ascii" is a recommended default if it is not otherwise known. (If there is a possibility of the character set being unknown, servers SHOULD explicitly specify them; this is unnecessary if the file contains only ASCII characters.) * Optionally, the version, which if present must consist only of uppercase and lowercase letters, digits, and forward slash and plus sign. (Clients and servers are not required to verify that the version code uses this restricted character set.) 3x = Redirect. The parameter is another URL which it should redirect to, which can be temporary or permanent. If the original URL contains a fragment part and the target URL does not, then the client SHOULD add the fragment part of the original URL to the target URL, too (unless the client somehow knows that the fragment part is not useful). If the number of consecutive redirects exceed the limit (which MUST be not more than five by default, although it may be configurable by the user), then the client MUST NOT automatically follow any further redirects. 4x = Temporary error. Has two parameters: * The number of seconds which is recommended to wait before trying again; it must be a unsigned 31-bit number (encoded as decimal in ASCII, not as binary), or it can be ? if it is unable to estimate the required time before trying again. (TODO: Possibly remove this; maybe it is not useful.) (An implementation may have a maximum amount of time that it is willing to wait; e.g. one week with a random variation of up to 24 hours either way. Furthermore, if this error is received multiple times in succession, a client should increase the amount of waiting time.) * The error message text (optional). 5x = Permanent error. The parameter is an error message. The request should not be repeated by automatic tasks, unless such tasks are manually reset by the operator of the computer that controls such tasks. 6x = Client certificate required. The parameters are listed below. The below section about client certificates has more details. * The first parameter specifies what URLs the certificate applies to; see the below section about client certificates. This is only a hint and is not a requirement; clients MAY ignore it, and MUST allow the user to override this specification with their own. * The second parameter is arbitrary text which should be displayed to the user, to explain what kind of client certificate is needed and/or why it is needed, etc. 7x = Ready to receive; used only for the "S" subprotocol. The parameter is an arbitrary text. 8x = Received data accepted; used only for the "S" subprotocol. The parameters are same as for 2x but the data of the file is omitted. If the file has been deleted then the parameters are omitted. === Detailed status codes === 00 = Beginning of arbitrary two-way communication. (This is only valid for the "I" subprotocol.) 10 = Requires input. (This protocol does not have the 11 code that Gemini has; putting the password in the query string isn't very good because then it is a part of the URL and might not be hidden.) 20 = Response is OK. 21 = Response is OK, but it is only a part of the file and not the entire file; this is used as the response of a range request. The file size parameter is the entire file size, and not only the requested part. 30 = Temporary redirect. 31 = Permanent redirect. A client may automatically update bookmarks, etc if this feature has not been disabled by the user. 40 = Temporary error with a not more specific code. 41 = Down for maintenance. 42 = A dynamic file has an unexpected temporary error such as time out. 43 = Proxy error; this request requires the server to make another connection, but it is unable to do so, or is able to connect but cannot receive a valid response. 44 = Slow down; the client is sending requests too fast and should wait before trying again. 45 = Temporarily locked file; used with the "S" subprotocol. 50 = Permanent error with a not more specific code. 51 = File not found (maybe it is at Area 51). 52 = The file does not exist (probabily because it has been deliberately removed) and is not expected to exist again; clients should remember this and not request it again automatically. (This code also means a permanently locked file, if used with the "S" subprotocol.) 53 = A proxied request was refused by the server. A server MAY also use this status code if a username/password have been provided in the URL but are not required to access any files on this server (this is to simplify the server implementation, so that it can check that it is not its own name and therefore refuse the request). A server MAY also use this status code if a URL with no scheme has been provided, for the same reason. 54 = Forbidden request. A username and/or password probably won't help. Other conditions might or might not help, depending on the implementation (e.g. it might only permit access to LAN addresses or only to 127.0.0.1). 55 = Edit conflict; used with the "S" subprotocol. 56 = A username and/or password are required. Either none have been provided, or the username and/or password that have been provided are incorrect. The error message SHOULD NOT distinguish between an unknown username and an incorrect password for a known username. 59 = Bad request. 60 = A client certificate is required to access this file, but none has been provided. 61 = The supplied client certificate is not authorized to access this file. The certificate may be valid to access a different file, though (possibly but not necessarily on the same server). 62 = The supplied client certificate is not valid (e.g. because it has expired or because the signature is not valid). 70 = Ready to receive new file. (If a server sends this response but the file is created before it has been fully received from the client for any reason, then the server SHOULD NOT overwrite the existing file and SHOULD instead send a response with an appropriate 4x or 5x status code.) 71 = Ready to receive to replace an existing file. 72 = Ready to receive data for a use other than a new or modified file. 80 = Accepted received data and created a new file. 81 = Accepted received data to modify an existing file. 82 = Accepted received data for a use other than a new or modified file. === Client certificates === Implementation of client certificates is optional, but it is recommended to be implemented if TLS is implemented. If a 6x response is received on a non-TLS connection, then it should change the scheme from "scorpion:" to "scorpions:" when a client certificate is available, before retrying the request (it MUST NOT do this automatically if no client certificate is available). The first parameter of a 6x response specifies the suggested set of URLs that the client certificate is applicable to. It has one character followed by a URL. The URL may be empty to mean the current URL, or can be any URL that lacks a scheme and authority. The first character can be: * "=" = Only that exact URL (as well as any that differ only by the fragment part; the fragment part is always ignored for all of these modes). * "+" = The specified URL as well as any that have a different query string or no query string. * "*" = Discard the query string of the specified URL, and then it means the current URL as well as any one consisting of the current URL followed by / or ? and then anything. If the URL already ends with / then it means that URL followed by anything. * "-" = An unspecified set of URLs which includes the specified URL. (If Gemini is implemented as well as Scorpion, then a 6x response from a Gemini server SHOULD use the "*" URL set hint (with the current URL) by default, since that is what the Gemini specification says.) The set of URLs MUST include the URL that has been requested, and MUST NOT include any URL that differs by scheme, host, and/or port, than the URL that has been requested. If this is not the case, clients SHOULD treat it as an unrecognized hint. Clients MUST allow the user to override the specifications above; those specifications are merely a hint, and are not mandatory to be implemented. Unrecognized hints shouldn't prevent the user from specifying a client certificate anyways, but in that case the client MAY require that the user explicitly specify which set of URLs it applies to; if it does not then there will be a implementation-dependent default setting. Client certificates normally apply to all subprotocols used with that URL, although the user may be allowed to override this. Alternatively, a client might apply it to all subprotocols if the subprotocol in the request is R, but limit it to the requested subprotocol if it is S or I. If the server requires a client certificate and a user name, it should use a 54 response first, and then 60 once a valid username has been provided. If the server requires either a client certificate or a user name (of the client's choice), but does not need both, then it should provide a 54 response if the connection is not using TLS, or 60 if it is using TLS. A server SHOULD NOT require both a password and a client certificate at the same time. (TODO: Possibly change this paragraph.) A client may have any UI its author wishes, but an example of a GUI which might be used to manage client certificates would be: URL: ________________________________ Match: (*) Exact (*) File (*) Path [X] Restrict to current username Subprotocols: [X] Receive [X] Send [X] Interactive Duration: (*) Session (*) Permanent (*) ___ hours (*) ___ days [X] Remember TLS options (A client can be designed with whatever differences they want than the above. A similar UI may be used for Gemini, except that "Interactive" and "Restrict to current username" are not applicable for Gemini.) In the menu for selecting existing certificates, any existing certificate for the same domain should be easy to find, and should allow changing the scope of existing certificates. (This is more important for Gemini than it is for Scorpion, but it would work with Scorpion too.) A client MUST NOT automatically generate and send a client certificate without first asking the user. When creating a new certificate, the default settings for the new certificate should be made up as follows (although the user might be able to override them; the ability to override is not mandatory, since an external program can be used to make up your own certificates if you do not want to use the ones included in the browser): * The key type, signature type, number of key bits, etc should match that of the server's certificate. This is the most likely to be compatible. * The "extended key usage" (2.5.29.37) extension should be included, and it should specify "TLS client auth" (1.3.6.1.5.5.7.3.2) as the key usage. === Receive subprotocol === This is the usual subprotocol, coded as "R". (It is the only subprotocol which is mandatory to be implemented.) The subprotocol parameter can be blank or can be a range request. Servers are not required to support range requests, and can respond with a 59 code if it is not implemented. (It is also possible that range requests will be possible only with some files and not with other files.) A range request consists of two nonnegative integers in decimal notation with - in between; these are zero-based file offsets, of the first byte to receive, and of the first byte to not receive (e.g. "3-9" means the six bytes, being the fourth, fifth, ..., ninth bytes of the file). The end address can be omitted in which case it means up to the end of the file. === Send subprotocol === This subprotocol is coded as "S". The subprotocol parameter can be omitted, but if not omitted then it is the version of the file being replaced; this can check for edit conflicts. (The server MAY require the version to be specified.) Optionally, the subprotocol parameter can be a HMAC followed by @ and then the version (which may be empty). The HMAC is of everything that follows the at sign, including the entire request and everything the client will send after the server responds the first time. The response code should not be 2x nor 8x, but can be any other code. If it is 7x then this means it is ready for the client to upload the file. If it is 1x or 3x then they modify the URL and mean that the upload is required to be made to a different URL instead of the one that was initially requested. For the client to upload the file, it sends a status lines, and if 2x then it is followed by the data. The possible status codes are: * 20 = Upload a file. The size is mandatory and you cannot substitute a question mark instead. The version is optional, and the server may override it with its own version specification. * 30 or 31 = Make the file into a redirect to a different file. * 51 or 52 = Delete a file. A server does not need to implement all of the above possibilities. After the client finishes sending, the server sends the status line, which will be a 8x code if successful or a different code if it is an error. Note that if the client disconnects before sending its own status line or before sending the amount of data of the file that it said it was going to send to the server, then the upload is aborted; if the file is locked then the server can unlock it, etc. (For example, a client might not want to overwrite an existing file, but the server says 71 and the client wants to add a new file and not overwrite an existing file, then the client can disconnect, in which case the files on the server will be unchanged.) === Interactive subprotocol === This subprotocol is coded as "I", and is used for two-way communication (usually terminal emulation, although it can be used with any kind of two-way communication). (The main reason for this is if you have multiple programs and you want to be able to specify the URL of each one.) The subprotocol parameter is optional but if present it is the requested capability codes; see below. If it is acceptable, then the server sends a 00 response; its parameter is the actual capability codes or can be blank. After that, it will continue with ordinary two-way TCP communication. (Other valid status codes are 4x, 5x, and 6x, in which case the server closes the connection after that, like it does with the other subprotocols. The 2x codes MUST NOT be used.) If the client disconnects after receiving the 00 response (and possibly any further data) but without sending any further data to the server, then the server should not change the state; it should assume that the client had connected by mistake and did not intend to do so. Capability codes have no delimiters between them, and each one has the following format (similar than CSI codes, in order that implementations can use the same subroutines to parse them if they do terminal emulation): * Zero or one byte in range 0x3C to 0x3F. * Zero or more bytes in range 0x30 to 0x39 or 0x3B. * Zero or one byte in range 0x21 to 0x2F. * One byte in range 0x40 to 0x7E. (Note that the protocol before the actual data will be ASCII only, so that ait is possible to use even with terminal emulators that do not implement this subprotocol.) The ABNF of the capability codes is: = *capability capability = prefix parameters middle suffix prefix = %x3C-3F / "" parameters = *(DIGIT / ";") middle = %x21-2F / "" suffix = %x40-7E / "" Possible capability codes: * a = The client uses this if it wishes to send command-line arguments and/or environment variables to the server. If so, the server will send back the same capability code, possibly with different numbers, and then the client sends the command-line arguments and environment variables, each of which is null-terminated. The numbers are first the number of command-line arguments, and then optionally a semicolon and then the number of environment variables. Each environment variable name is not allowed to contain a equal sign, since the equal sign is used to sepaate the name from the value. The server will not send back this capability code if the requested file cannot use this capability; if it does send it back, then the numbers might be different than those that the client requested; the number of arguments/variables that the client sends must match those specified by the server. The server MUST NOT send this capability code if the client did not request it. * L = Line mode. The value is 0 for not local line editing (and not local echo), or 1 for local line editing so that the text is not sent until an entire line is entered and the user pushes send. Even in line mode it is still possible for the server to send data to the client while the user has partially entered a line of text; the client should handle that by displaying the user's text entry separately. * T = Terminal type (not currently defined). * x = Screen size. The value is two numbers being the number of columns and the number of rows; in both cases zero can mean unlimited. Optionally can have a third number; 1 means the client handles pagination. (Note that it is not required to implement the above capability codes. In some cases, some or all of them might be not applicable, e.g. if it is not a terminal emulation.) === Hashed URI scheme === The "hashed:" URI scheme has the format "hashed:X/Y,Z" where X is the hash algorithm, Y is the hash (in hexadecimal format), and Z is another URL (which can be absolute or relative, and can be of any scheme, including another "hashed:" URL, in case you want to specify multiple hashes which are using different hashing algorithms). It refers to the same file as Z if X/Y is that file's hash, or is an error if that file's hash is not X/Y. If "[X|Y]" means the absolute URL corresponding to the URL Y treated as relative to the absolute URL X, then the rules for resolving relative URLs of this scheme are as follows: "[hashed:A/B,C|D]" = "[C|D]" if "D" does not start with "#" "[hashed:A/B,C|#D]" = "hashed:A/B,[C|#D]" "[A|hashed:B/C,D]" = "hashed:B/C,[A|D]" In case the notation is confusing: "[http://example.org/files/1.txt|2.txt]" = "http://example.org/files/2.txt" is a valid equation, and the uppercase letters in the above are placeholders. For example, if the current URL is "hashed:0/ab8974,file:///tmp/help.txt" and you want to access the relative URL "help2.txt", then the new absolute URL is "file:///tmp/help2.txt" and not "hashed:0/ab8974,file:///tmp/help2.txt", because "0/ab8964" is the hash of "help.txt" and not of "help2.txt". However, if the relative URL starts with # then it is a link to another part of the same file, so it is the same file and therefore has the same hash and therefore it should not strip out the hash in this case. The hash algorithms are specified as hexadecimal numbers without a leading zero, which are multicodec numbers (but not encoded as varint). The hash values are specified as an even number of hexadecimal digits, which will include leading zeros if any. List of hash algorithms: 11 SHA-1 12 SHA2-256 13 SHA2-512 14 SHA3-512 15 SHA3-384 16 SHA3-256 17 SHA3-224 d5 MD5 b250 BLAKE2s (128-bits) b260 BLAKE2s (256-bits) (Note that some hash algorithms are deprecated because they are insecure.) (A request of the Scorpion protocol that sends a URL using the hashed: scheme is considered to be a proxied request (even if it is a URL of a file on that server), and may be refused. Clients MUST NOT send such requests, unless the user configured it to be a proxy for the hashed: scheme (the ability to do so is not required to be implemented, though).) === Document file format === The file format consists of a sequence of blocks, each of which has the format (there is no global header, delimiters, etc): * One byte being the block type and character encoding. * Big-endian 16-bit attribute length. * Attribute data. * Big-endian 24-bit body length. * Body data. The block types are: * 0x00 = Normal paragraph. The attribute is unused and MUST be empty. The body is the text of the paragraph. * 0x01 to 0x06 = Heading levels 1 (outermost) to 6 (innermost). The attribute is the part after # in the URL to refer to this section (empty if it cannot be referred to by the URL), and the body is the heading text. * 0x08 = Normal hyperlink. The attribute is the URL (in ASCII encoding) and the body is the link text. The URL can be relative or absolute. If the attribute is empty then it means the same as the current URL (which isn't very useful for type 0x08, but may be useful with types 0x09 and 0x0A). If the attribute contains a null character, then only the part before the null character is the URL, and the null character itself and anything afterward will be ignored. Other control characters are not allowed in the URL. * 0x09 = Hyperlink requesting input. Like 0x08 but it is treated like a 10 status code (with an implementation-defined prompt; it may be the same as the text of the link) without making the request. This link type is not to be used for gopher links (if it occurs anyways, a client SHOULD treat it as a normal hyperlink but with type 7 instead of 1; however, authors should be aware that a client might incorrectly use a question mark instead of a tab if this block is used for gopher links). * 0x0A = Interactive hyperlink. Like 0x08 but with the "I" subprotocol. Implementation is optional. (Some implementations may wish to use an external program which is an existing terminal emulator, if they can add initial input.) * 0x0B = Alternate service (e.g. mirrors, etc) than the previous block (which MUST be a link block; if it is also 0x0B then it is an additional alternate service), or, if there is no previous block, the current file. The attribute is the URL of the alternate service. The body is not normally used, but may contain text explaining the alternate service. Clients SHOULD normally hide this block, although it might have a way to display some kind of "alternate service" menu, to have an option to display them, to have an option to automatically select for load balancing, etc. (This is similar than the "+" type in Gopher menus.) * 0x0C = Blockquote. The attribute is unused and MUST be empty. The body is the text of the paragraph. * 0x0D = Preformatted text. Valid control codes are tab and line feed. The attribute SHOULD be blank; see below for its meaning (although a client is allowed to ignore the attribute). The client MUST display this text with a fixpitch font. * 0x0F = This block is used for optional metadata such as a digital signature of the rest of the document. Clients that do not understand it will ignore this block. The possible character encodings are: * 0x00 = TRON-8 (left to right) * 0x10 = PC (left to right) * 0x80 = TRON-8 (right to left) The control codes are: * 0x02 = Whatever comes before it is some kind of section number or item number or a bullet indicating a list item. (This may also be used to separate the word from the definition in a definition list.) * 0x05 = Follow by one or more bytes indicating a type of contents (e.g. a word or phrase in a foreign language compared with the surrounding text, or a measurement of a specific type (length, mass, etc)), and then 0x06 and then the text and then 0x07. If not implemented (or if this feature is disabled by the user; and it SHOULD be disabled by default), then it MUST skip up to the next 0x06 byte. You cannot include any other control characters in the data part. A data/text sub-block cannot be inside of either part of a furigana sub-block. * 0x06 = Separates the data part (before) and the text part (after) of a data/text sub-block. You are not allowed to nest data/text sub-blocks. * 0x07 = Ends a data/text sub-block. The text part can contain other control characters including furigana, but any changes to the formatting are required to be reset before this 0x07 code. * 0x09 = Tab; only in a preformatted block. This should not be used if exact spacing is requred, since the way that it is displayed is implementation-dependent (and possibly configurable by the user). * 0x0A = Line break; only in a preformatted block. * 0x10 = Only with PC character code; follow by one byte in range 0x41 to 0x5F, and you must subtract 0x40 to make the code of the graphic character to display. This is allowed in preformatted blocks as well as in other blocks. * 0x11 = Normal style. * 0x12 = Strong style. * 0x13 = Emphasis style. * 0x14 = Fixpitch style. This style MUST be displayed by fixpitch fonts (but it is acceptable to display everything by fixpitch fonts, which would mean that a special handling is not required). * 0x15 = Forward text direction. * 0x16 = Reverse text direction. * 0x17 = Begin the main text of furigana. This should be followed by the text and then 0x18 and then the other text and then the 0x19. * 0x18 = Begin the furigana text of the furigana. If furigana is not implemented (or if the user disabled it), then it should display the main text of a furigana block but should not display the furigana text. (Alternatively, it might have an option which causes the furigana text to be displayed in parentheses or some other kind of delimiters, which would effectively make 0x18 and 0x19 aliases for graphic characters.) * 0x19 = End of the furigana. You are not supposed to nest any other control codes inside of the furigana blocks (although 0x10 is allowed, if this block uses the PC character code; however, furigana probably would not be common when using PC character codes). * 0x1B = Used for SGR codes. The next byte MUST be 0x5B, and then zero or more bytes in range 0x30 to 0x3B except 0x3A, and then one byte which is 0x6D. This is allowed only in preformatted blocks, although its use is discouraged. Clients should skip over the SGR code entirely, but MAY have an option to interpret them. It is not required to implement most of the control codes, except as specified above. Stateful encodings MUST shift the state at the beginning of each block. It is also required after 0x18 or 0x19 if the state before such a code does not match the state at before the furigana block, and similarly also for 0x06 and 0x07. Any document which does not satisfy this criteria may result in an unreliable display on some clients. If the attribute of a preformatted block is not empty, then a client program MAY be able to use it to implement syntax highlighting, equations, simple diagrams, etc. It MUST have an option to ignore the attribute if the user wants to display all preformatted block as plain text, and MUST treat unrecognized attributes the same as a blank attribute. Authors should write the document with the expectation of the client not recognizing it. Clients MAY display the attribute text of preformatted blocks. (If data tables are required, you can link to a separate file that contains the data; you cannot have inline data tables.) === Extended link attributes === Extended link attributes are optional, both to specify by the author and to implement by the client. * 0x20 = The file size (if known; it can be ? if not known) and file format of the file that the link refers to, in the same format as the 2x response code. This is not valid for block type 0x0A. * 0x49 = A hint for the capability string to use with interactive mode. This is only valid for block type 0x0A. * 0x72 = Relation types. The value is any number of bytes; if the bytes are 0x01 to 0x7F then they are the relation type from this file to the target file, and 0x81 to 0xFF are the same relation types but in reverse (from the target file to this file). The browser should not prevent the display of links or automatically handle links due to the relation types; however, it may be e.g. shortcut keys, queries, user-defined styles, etc that might be affected by these relation types. Valid relation types are: 0x01 = Next page 0x02 = Citation 0x03 = Cross-reference === Metadata blocks === (These might be changed in future) The first bytes of the attribute of a type 0x0F block (a metadata block) identifies the type of metadata in this block. * 0x00 to 0x3F = (Reserved) * 0x40 to 0x7F = (Reserved) * 0x80 to 0xBF = Applies to everything up to the next metadata block of the same type. * 0xC0 to 0xFF = Applies to the entire document. The meaning of the rest of the attribute, and of the body, depends on the type; their meaning is described below. The character set is used only if the body has a meaning and is otherwise usually not meaningful, but exceptions will be noted in the below specifications. The possible metadata types are: * 0x80 = The language of the text of further blocks. The attribute specifies the language. The body may be empty; if it is not empty then it contains the name of the language (and should not contain any control characters). * 0x81 = Type of article of further blocks. The rest of the attribute is one byte as follows: 0x00=undefined, 0x41=article, 0x4E=navigation. * 0xC0 = Modification date/time. The rest of the attribute will be the big-endian signed 64-bit number of seconds past January 1, 1985, 00:00:00, UTC, excluding leap seconds. The body is not used. Metadata blocks are not supposed to affect the display of the document except for the display of the metadata itself (if metadata display is enabled). However, there are cases where it may affect features such as multilingual search (if the user has specified that an inexact match is satisfactory), speech synthesis, etc. === Data/text sub-blocks === The first byte of the data part of a data/text sub-block indicates the type; the rest of the bytes are the parameter (which may be empty for some types). The possible types are: * 0x30 to 0x37 = Languages and/or pronouncing. Bit2 means that the language is specified. The low 2-bits meaning: 0=none, 1=phonetic, 2=phonemic. If the language is present, 0x20 separates the language code from the pronounce code. (The codes 0x33 and 0x37 are not currently meaningful.) (TODO: Specify how the languages and pronouncing are encoded.) (TODO: this should include: languages, pronouncing (which may be combined with languages or used independently), date/time, and SI units. Possibly also other things, but also possibly not.) === Languages === The language specifications are made up according to the criteria: * If one code is a prefix of another then the shorter one is considered to be a more general specification that includes the longer one (e.g. English vs Canadian English). (This improves simplicity of implementation, since an implementation will not need to have complicated rules to figure out which language is meant.) * Both written languages (for e.g. text documents) and spoken languages (for e.g. audio files) should be considered. Although in many cases the same codes can be used, they should still be considered separately since the requirements may be different in each case. Note that some languages may be purely written languages (e.g. Blissymbols). (TODO: Possibly use a variant of ISO 639 or Glottolog or something else.) === Automated crawling === (This section is currently a draft and may be changed in future. There is some disputes about some of the below, so it is likely to be changed.) Please note that all of this section is not actually enforceable and is not intended to be. It is intended to be guidelines for bots that is likely to be implemented correctly when it is implemented. It is a similar idea than the "robots.txt", but is meant to be less ambiguous. The recommendation for automated crawling/indexing is described here. This specification applies to recursive crawlers that automatically download files, especially if they use recurring intervals, and to public search engines and mirrors that work automatically. It does not apply to users that manually download files or that only download a single list of files once, nor does it apply to proxies, gateway services, etc (but see below about proxies that are themself available to be crawled). Note that this does not prohibit anyone from making links to any files regardless of whether or not crawling is allowed. It should first try to download the file named "/.special/crawl" to find the policy set up by the server administrator (it should not do this more than once per crawling interval). Depending on the status code returned by the server: * 2x = Read and parse the file according to the below specifications. * 4x = Do not access the server for at least the specified amount of time, possibly plus some random number. After that, the crawler MAY try again, and will again try to download the /.special/crawl file. * 5x = No crawling policy is available. (The behaviour of a crawler in such a case is not specified by this document.) Note that you cannot assume that the crawling policy file will not be changed in future. If you start over the crawling then you should try to download the crawling policy file again. The format of the file is lines ending with line feeds; each line starts with one byte command code, and then the parameter. If the command code has bit5 set then it SHOULD skip that line if it is not understood. If it has bit5 clear then the client MUST treat the entire file as not understood and should not proceed with crawling (and it might abort with an error message in this case, explaining what the problem is). Each crawler also has zero or more names, which are sequences of printable ASCII characters. The commands are: * "`" = A comment that has no meaning. It may contain information which is useful for users, search engine operators, mirror operators, etc, such as downloading an archive file that contains all of the data. * "@" = The parameter is the name of the crawler. Any lines preceding the first line with @ are effective, and anything from the first @ line matching the crawler's name up to the next line with @ are effective; all other lines are ineffective. Ineffective lines MUST be ignored even if the bit5 of the command code is clear. * "i" = Suggests indexing the specified prefix. * "v" = Suggests not indexing the specified prefix. * "d" = Means that the files with the specified prefix are probably dynamic so it might not be useful to mirror them. * "c" = Allows crawling the specified prefix. * "C" = Disallows crawling the specified prefix. * "n" = Estimated number of files to download. * "t" = Estimated total size of files to download. * "N" = Maximum number of files to download. * "P" = Maximum number of simultaneous downloads. * "D" = Minimum delay (in seconds) after downloading one file before proceeding with the next one. * "R" = Minimum delay (in seconds) after starting to download the first file before starting over from the beginning. * "w" = Suggested time (in seconds) to wait before downloading the crawling policy file again. Note that the crawling policy file still counts as a file for the purpose of the D and R commands, too. * "a" = (This command is intended to be used for archiving, but the specification of the archiving has not been written yet.) Numbers are given in decimal notation, and are always nonnegative integers, using only digits 0 to 9. URL prefixes start with / and are relative to the root directory of the server; it matches all URLs that it is a prefix of (including that URL itself). Once a command is found that matches the URL being accessed, then it should ignore all further commands for the purpose of accessing that URL. However, it still must keep those commands in memory or on disk so that it can refer to them again later for another access. (It is also possible to work in an alternative way, by somehow converting the data into an internal format on the client that can more efficiently determine the access policy, as long as the behaviour matches that described here.) If no command is found that matches the URL that it wants to access, then assume that an implicit "C/" follows (meaning it is disallowed). A crawler may have a name with <> around it (in addition to its other names) if it has the purposes described below: * "" = Mirrors and backups. * "" = Public search engines and indexing. This also applies to proxies which do not themself have a policy to prohibit indexing. * "" = Programs that are intended to study statistical properties such as number of files, average file sizes, broken links, etc. All crawlers MUST have an empty string as one of their names. If a proxy service is available to be crawled/indexed, then the proxy service should also check the above policies, and either refuse the proxy, or to set up its own policy which prohibits access to the proxied files to automated crawlers (either conditionally (according to which Scorpion server is being accessed) or unconditionally (for all proxied files, regardless of which server it is accessing through the proxy)). Clients MAY add a query string when requesting the crawling policy file which identifies the crawler. (This is only for the crawling policy file; it is not supposed to do that for any other file. Also, the identification does not necessarily match the crawler's name as described above.) Crawlers that receive a 41 response when downloading any file other than the crawling policy file SHOULD try to download the crawling policy file again after waiting for the minimum time specified in the response (whether or not it has previously downloaded the crawling policy file successfully in the past). If it receives such a response too many times, then it SHOULD stop for a longer amount of time than specified. The crawling policy file is not allowed to be retroactive. === Conversion === This part of the specification is optional. Conversion between file formats is possible, and can be specified by the file called "/.special/conversion". Clients MUST NOT try to download this file unless the user explicitly commands the computer to do so; it is not supposed to do so merely by finding a file that is not known how to handle. (Client software also MUST allow the user to override them, and to remove any such files that have already been downloaded.) Furthermore, any client that is able to understand it MUST NOT require that it comes from the same server as the file; it can also be a local file which has been written by the end user, a file from another server (or the same server but a different path) etc, and once a conversion has been enabled by the user then the same conversion might be used with other servers too if appropriate (not all kinds of conversions are appropriate for using with other servers (e.g. if file name rewriting is used then it is not appropriate for arbitrary servers), but some can). If it does use it with other servers, the user MUST be able to configure this feature, e.g. so that it does not work with other servers. Clients might also have their own mechanisms for conversion beyond what are written here, e.g. allowing to use local programs with pipes (which is a recommended way of doing so); such ways can only be used if set up by the end user or the system administrator and cannot be specified by servers. (This is possible even if the rest of the specification in this section of this document is not implemented.) Each record consists of the 8-bit record type, and then four fields which are each the big-endian 16-bit length and then the data. The high nybble of the record type (which is still a part of the record type; it is not considered to be a separate field) defines how the first and second field specifications are doing (although some record types allow one or both fields to be empty): * 0x00 = The first field specifies the input format and the second field specifies the output format. * 0x80 = The first field specifies the original URI scheme and the second field specifies the target URI scheme. The record types are: * 0x01 = File name rewriting. The third field is a file name suffix (excluding any ? or # and anything that comes afterward) of the original file (if the file name does not match, then this record does not match and cannot be used with this file), and the fourth field is the suffix to replace the original suffix with to find the alternative file. * 0x02 = Use a program to convert the file. The third field is the URL of the program. The fourth field is described below. * 0x03 = Use interactive mode. The second field should be blank. The third field is the recommended capability. * 0x04 = Use a program to display the file. The third field is the URL of the program. The fourth field is described below. The second field should be blank. * 0x05 = The third field is the URL of a document which explains the file format. The file format of the document is either Scorpion format or plain ASCII text format. The second and fourth fields are not used. For record type 0x02, the fourth field has the first byte specifying the program format and the rest as the parameters: * 0x01 = It is a binary uxn / varvara program. The file name and file stat ports should not be used, and neither should the date/time ports be used, nor any device ports that are not valid for non-GUI mode. If there are errors during the conversion, they should be written to stderr and end with a nonzero exit code. For record type 0x04, the fourth field has the first byte specifying the program format and the rest as the parameters: * 0x01 = It is a binary uxn / varvara program. It will run in GUI mode but is likely to be sandboxed (which clients SHOULD do if possible). For record type 0x02 and 0x04 and program format 0x01, the parameter is one byte, and is a bit field as follows: * bit0 = Clear for stdin/stdout, or set if the first file device is the input file (which is read only) and the second file device is the output file (which is normally write only, but other flags may affect this). * bit1 = Set if the input file is seekable and can be closed and reopened (which requires some of the nonstandard features of uxn38). * bit2 = Set if the output file is seekable and can be closed and reopened, and might also be read by the program as well as written. (These also require some of the nonstandard features of uxn38.) (This bit should not be set if the record type is 0x04, since there is no output file.) * bit3 = Set if the first command-line argument is the picture size, in the decimal ASCII format (only digits 0 to 9) with "x" in between (where the horizontal size is first, and then the vertical size). (This is meant for scalable vector diagrams.) (This bit should not be set for record type 0x04; in that case, the preset screen size that can be read back from the screen device is the suggested picture size (which can be changed).) Note that it is OK if there are multiple records which match the same file. Recommended output formats for pictures are farbfeld (true colours) and XPM2 (indexed colours, which also allows specifying symbolic colours which can be specified by user preferences). In the XPM2 format, you should not use X11 colour names; only hex, "black", "white", "None", and symbolic names (with the "s" colour type) should be used (although monochrome and grey scales are usable too, in addition to full colours and symbolic). === Unordered Labels File Identification === ULFI works according to the rules: * The valid characters are printable ASCII characters other than spaces, slashes, backslashes, quotation marks, and apostrophes. * A name consists of letters, digits, hyphens, dots, and underscores, and must start with a letter or underscore, and cannot end with a dot nor have two consecutive dots. * The ULFI string consists of a set of parts with colons in between. The order of the parts doesn't matter; it is equivalent even if the parts are in a different order. * Duplicate parts are improper and redundant. * A part consists of a name, followed by an optional parameter block or inner block. The name may also contain plus signs, as explained below. * A part name with a plus sign is a shortcut for the full part name before the plus sign, as well as the part name that substitutes a dot in place of the plus sign. There may be multiple plus signs, in which case e.g. "a+b+c" is the same as "a:a.b:a.b.c". * A parameter block consists of [ and ] with zero or more characters in between other than "[", "]", "<", ">", "{", "}", "(", and ")". * A inner block consists of < and > with another ULFI in between; this other ULFI cannot itself have a inner block. (This can be used for indicating such things as a compressed file, e.g. "gzip".) Example 1: The ULFI "a.b:c" and "c:a.b" are equivalent to each other. Example 2: The ULFI "a:b+c:d" and "a:b:b.c:d" are equivalent to each other. (Note: It is rarely necessary to compare ULFIs for equality. Usually, a program would be looking for one or more labels that it recognizes, and only interpret those ones. One way to handle this is to use a bit field to remember what has been found so far during parsing.) === X.509 extensions === The use of these extensions is optional. Clients and servers are not required to read them, and whoever sets up the certificates are not required to set these extensions. Most of these extensions are independent of the protocol; they are not only for use with Scorpion protocol, and they may be used with any protocol. * 2.25.327847519394146920218914781995694576662.1.6 = The data type is a set of integers. The values are the port numbers for use with TCP. This extension is for use with server certificates. * 2.25.327847519394146920218914781995694576662.2 = The data type is a octet string, and is a comment encoded as TRON-8 encoding. * 2.25.327847519394146920218914781995694576662.3 = Favicons (clients MUST allow favicons to be disabled, and MAY be disabled by default). The data type is a sequence; each element of the sequence represents one icon. An implementation might support none, some, or all of the formats. See the section below about the formats. * 2.25.327847519394146920218914781995694576662.4 = A boolean which indicates if the client wants the server to publish the certificate or not. If true then it means you want the server to publish the certificate, and if false then not. * 2.25.327847519394146920218914781995694576662.5.1 = The time zone in the format expected for the TZ environment variable in UNIX systems, e.g. "America/Vancouver". The data type is a IA5 string. * 2.25.327847519394146920218914781995694576662.6 = This extension is meant for securely specifying that one certificate supersedes another, even if it is self-signed. See the below section about "Superseding certificates" with the format of this extension. === Favicon formats === If the data type is boolean, then true means that it may be added to the preload list (in an implementation-dependent way), and false means that it is not supposed to be added to the preload list. If the data type is a UTF-8 string, then it is interpreted as specified by: gemini://mozz.us/files/rfc_gemini_favicon.gmi If the data type is a implicit type 1 octet string, then it is a Haiku vector icon format, except that the initial <6E 63 69 66> is omitted. === Superseding certificates === (The below seems like it wold work, but there is some difficulty to get it to work, since it requires computing signatures of the certificate with some fields cleared. I had considered an alternative that uses the issuer certificate, as well as the subject certificate; this means that the public key field and list of excluded extensions field are both unnecessary, since the key is specified in the issuer certificate and excluding extensions is unnecessary. However, it then requires that both the issuer certificate and subject certificate must be changed.) The format of this extension is a sequence consisting of the fields: * The public key for other certificates to supersede this one, or null. If it is null, or if this extension is not present, then the certificate's own key will be used for this purpose. * The Algorithm Identifier for the hash algorithm which is used for identifying older certificates which are being superseded. (TODO: Make the recommendation of which hash algorithm should be used.) * A bit string which specifies which extensions (where the first bit corresponds to the first extension in this certificate, etc) should have their values replaced with all bits clear for the purpose of calculating the signature of this certificate with the old certificate's keys. If any bits are present, then the last bit that is present must be set; all further bits are implied to be clear. This bit string should never specify that the superseding certificates extension is itself excluded, and any implementation that understands an extension should not allow excluding it if such an exclusion is unnecessary. (This is present in order to avoid a mess with potential future defined extensions which would conflict.) * A sequence of the Superseding Item structures, explained below. They should be listed in order from oldest to newest. It is not necessary to list all of the older certificates, but you should list the few most recent, especially if they have been changed recently. The Superseding Item structure is a sequence of the fields: * The hash of the older certificate. * The signature (a bit string) of this certificate with the older certificate's key. For the purpose of computing the signature, use only the tbsCertificate (not the stuff that comes afterward), but with all bits cleared in the value (not the type/length header) of the signatures within all Superseding Items, and all bits cleared within the value (the octet string data, which wraps ASN.1 DER data) of all extensions that are excluded. === Recommendations and other notes === There are only two legitimate uses for the version field of the 2x response; a client MUST NOT use it for any other purpose. They are: * To use as the subprotocol parameter of a upload, to avoid edit conflicts. * When making a range request, to compare the entire status text to check if the file might have changed. This is only to be used for resuming a download and is not to be used for caching. Clients MUST implement the non-TLS protocol and SHOULD implement the TLS protocol. Servers SHOULD implement both protocols, especially the non-TLS protocol. Servers SHOULD serve the same files regardless of whether it is TLS or non-TLS (except files which require a client certificate to access). Unicode is no good. Clients SHOULD NOT automatically convert TRON code into Unicode, except as a fallback in case suitable fonts are not available or a similar problem, and such fallbacks should be avoided if possible. Redirects should not be overused. A client SHOULD have a redirect limit; the default redirect limit MUST NOT exceed five. If a redirect would occur beyond the limit (which may be configurable by the user), or if a TLS connection redirects to a non-TLS, or to specific other protocols (e.g. HTTPS), or a connection to the internet tries to redirect or link to a LAN or localhost address (it should resolve the DNS first if necessary) or to a local file (the "file:" scheme), then it should warn the user first (with a non-modal message if possible), and not redirect unless the user manually allows it. Clients made for users to view (i.e. not something like curl which is only for downloading files and not for display) should implement at least the above file format specification (or a subset of it which is at least the minimal subset) and the "text/plain" format. It is also recommended (although a lesser recommendation) to implement the "text/gemini" format, especially if the client software also implements the Gemini protocol. Clients should allow URLs entered by the user to be treated as relative if no scheme has been explicitly specified by the user. It is recommended to have an option to display a table of contents window for any file formats that it is suitable. IDN is not required to be implemented (and is recommended to not do so unless it is already implemented for other reasons). If it is implemented, clients should use a different colour to display non-ASCII characters for security purpose, and should ensure that even invisible characters are displayed. It must include an option to use only ASCII, and it is recommended to use only ASCII for domain names. Clients should not download any additional files or make any additional network requests due to the contents of a document that are merely being viewed; it should do so only by the user selecting links in the document. ("Selecting" a link means explicitly navigated to or redirected to, not merely hovering or things like that.) It is acceptable for a minimal client to only implement the "R" subprotocol and not implement range requests; the same is true for a minimal server. If the jar: scheme is implemented, then file: URLs which reference a ZIP archive should automatically redirect to the jar: URL for the root directory of the ZIP archive. If it is a Gempub file, it may redirect to the index file instead (possibly subject to user configuration); even if it does, it should be possible for the user to manually enter the jar: URL for a directory listing instead. It is possible (but probably unnecessary, since SOCKS can be used instead, which would be better) for a server to implement a proxy with raw data, by using the "I" subprotocol and the "tcp://" URI scheme in the request. (It is recommended to use SOCKS instead if possible.) Don't use R subprotocol to change the state of files in the server, except for optional logging (which should only be used for diagnostics). For example, don't use it for adding user's comments, etc. (If you want to have discussion forum with commenting, then NNTP is better. You can link to a NNTP from a Scorpion file if you want to do. The S subprotocol may also be used, and so can the I subprotocol; in the case of the I subprotocol, it should not change the state of files in the server unless the server receives any data from the client after the server sends the 00 status.) Clients SHOULD allow the user to specify proxies. It is recommended to support SOCKS (unless the operating system has the ability to do so without the application programs being aware of it). It is also recommended (if use of proxies are implemented) to allow use of a Scorpion server for proxying with schemes specified by the user; note that if the proxy URL does not specify a secure connection, then it should not encrypt any data sent through the proxy even if the scheme for the file being accessed is "scorpions:" or "gemini:" or another protocol that uses TLS. This is useful if the end user has set up their own intercepting proxy on their own computer, since it would avoid needing to encrypt the data twice. === Dynamic files === Servers can implement dynamic files however they want to do (including not at all), but the suggested convention on POSIX systems is as follows: Set argv[1] to the entire request (excluding CRLF). Set argv[2] to the part of the request following the name of the external program (also excluding CRLF); note that there might be a path and/or a query string in argv[2], or it might be empty, but don't omit it even if it is empty. It is up to the external program to parse the subprotocol parameters, and to check if the subprotocol is one that it can handle. Environment variables can be used for the remote IP address (if any program requires it) and for client certificates (if any have been provided): * REMOTE_HOST = The IP address of the client. * (Client certificate data, implementation-dependent, but see below. Recommendations may be made for this in future if some format for doing this becomes common.) Also on POSIX systems, if the file is actually a UNIX socket rather than a executable file or a non-executable regular file, then it can work by connecting to the socket and sending four strings each preceded by the big-endian 32-bit length, and then communication (always unencrypted) will go both ways. These four strings are: * The part of the request up to and including the file name. * The part of the request after the file name (same as what would be argv[2] if it was an executable file). * The client's IP address (what would be the REMOTE_HOST environment variable if it was an executable file). * The client certificate (this will be empty (but still present) if no client certificate has been provided or if the connection is non-TLS). === Security issues === This section discusses security issues. Note that many security issues with WWW are not applicable to Scorpion and related protocols, since they do not have the complexity of WWW. See RFC 5246 and RFC 8446 for the specifications of TLS, which also have some further information about security issues of TLS. Security is important as well as simplicity and interoperability. Therefore, servers and clients are required to support the non-TLS protocol, while TLS is optional (but is recommended). There is no requirement of which versions, ciphers, extensions, etc of TLS will be implemented; an implementation can do whatever it is able to do. In cases where improved security is required, users might be able to determine the required public keys ahead of time before connecting to the server, and may install those keys in the computer. (The method for acquiring those keys is deliberately not specified in this document; it is something that each user will have ti figure out by themself.) The recommended way to handle TLS cipher suites, TLS validation, etc, is whatever way the client software uses for any protocols that it already implements (e.g. Gemini client can use TOFU, curl can validate them in the usual way but allow bypassing by the -k switch). However, a client MUST include the possibility for the user to change the options, and for the user to manually add and remove certificates. It should also be possible for the user to specify the security levels when manually adding certificates (in order to prevent downgrading attacks). Regardless of the default settings, the user MUST be allowed to override them. TLS session tickets can also be used for tracking, although they can speed up transfers. By default, clients SHOULD NOT reuse tickets for multiple connections; see section C.4 of RFC 8446. If a client certificate has been specified, then clients and servers SHOULD both remember the TLS options in order to prevent downgrade attacks. (This may be configurable.) See also gemini://gemini.ctrl-c.club/~stack/gemlog/2022-02-13.notls.gmi for some notes against use of TLS. The username and password in the URL can be used for tracking, as mentioned in the mailing list of Gemini. Therefore, it is recommended that if a URL from a link or redirect contains a username and/or a password, then the client should ignore it, and instead use those specified by the user; or it may display a prompt with the username and/or password already filled in and allow the user to adjust them before sending the request (in this case, the existing password should not be hidden in the prompt, but if a new password is entered then it should normally be hidden (unless the user sets an option to not hide passwords)). (This paragraph is advisory only.) The username and password can be spied on if TLS is not used. If you consider this significant, then a client MAY wish to warn a user about logging in when using an insecure connection. If the client software knows that this server allows secure connections, then it may also have an option to automatically redirect to the secure connection (it MUST NOT do such a thing automatically unless the end user can easily disable this feature). An alternative, which is only suitable for uploading and which does not hide the data being transferred, is to use HMAC for security. (It is also possible for HMAC, client certificates, and username/password to be combined, although it is not usually useful to combine the use of client certificates with the other methods.) If favicons are implemented (which is not mandatory), then it MUST offer an option to disable favicons (which might or might not be the default), and MUST display them separately from any other icons (if any). The client MUST NOT make separate requests to download favicons automatically. An implementation MAY allow the user to specify their own favicons, and MAY have an option to auto-generate favicons. Client certificates are not encrypted with TLS 1.2, but are encrypted with TLS 1.3. Therefore, clients SHOULD warn when using a client certificate with a TLS version 1.2 or less. === Missing details === This section describes what is currently missing or incomplete in this document. They are: * Some of the above sections has parts which are incomplete. * If any part of the document is unclear, improve it. * Comparison with other protocols and file formats. * Better examples. * A reference implementation. (This is partially written.) * A way to be working with transports other than internet. * Possibly remove the recommended waiting time for 4x responses. I am not so sure that it was a good idea to add it (some people have also thought it is not such a good idea in Gemini either, anyways). (It is also possible that other changes may be made in future, than the above, including changes to existing things, and possibly also removing things if they are unnecessary.) === FAQ === *** What are the design goals? * Simplicity is important, but should include a good set of features. Features should be made optional as much as possible (and authors should use only the really necessary parts), and can be implemented independently as needed. However, it is also necessary to consider if simplifying some things too much would result in additional complexity elsewhere (a criticism of Gemini protocol once mentioned). * Low-level programming is considered and not only high-level. * Multiple implementations should be possible, and they might have some of the same and some different features. It should also be possible that implementations of some protocols/formats can be used together with other implementatations of other protocols/formats, e.g. if there is a link to a picture you can use an external program to display or convert it. * Whether or not HTTP, HTML, and other protocols does or does not do something (or whether or not it can emulate them) is not a reason to do or not do something in this protocol and file format. This is not meant to be a subset or a superset of the capabilities of any other protocol or file format. * It should be designed for user autonomy, and for things to be set by the user instead of by the document author, where possible. (For example, there is no CSS.) (This is similar than Gemini's principles.) * Criticisms of this and other formats (such as Gemini) should be considered, although they might be rejected. * Other protocols and file formats are not obsolete, just as Gemini does not intend to replace gopher and web, either. (It is especially not intended to replace NNTP and IRC; those protocols should still be used for public communication of many writers, since those are better ways of doing it than using web forums, mailing lists, proprietary apps, etc.) * The file format should be suitable both for on screen and printing out. (As in Gemini, there are no inline links, although Gemini has a different reason to avoid inline links.) * Avoid wasting the computer's power when it is not needed. * The non-extensibility of Gemini does not work quite as well as they had intended, as time can tell. *** What is the reason for the name? The Gemini is the name of a constellation (even though that is not why Gemini protocol was named as such), so I use a different one, such as the Scorpion. *** What is the reason for the default port number? It was chosen by asking someone else, who suggested 1517 because it is a date of historical significance (the Protestant Reformation), as well as being a number between 1024 and 32767. According to who suggested it, "the current state of the internet is in dire need of reformation especially how dire a perverted mess the web has has become", and the author of this document agrees with that too. *** What is the TRON character code? For some details, see: http://tronweb.super-nova.co.jp/chinesecharsandtroncode.html http://tronweb.super-nova.co.jp/unicoderevisited.html http://zzo38computer.org/fossil/osdesign.ui/finfo?name=draft/charsets http://fileformats.archiveteam.org/wiki/TRON_code Note that the Unicode planes are deprecated in the Extended TRON Code. *** Which URL specification is applicable? RFC 3986, RFC 3987, WHATWG, etc? RFC 3986. However, note that the "hashed:" scheme has its own rules. *** TLS is complicated; why don't you use a simpler mechanism (e.g. Noise)? I had considered that, but it is too difficult to know what to do. See section 4.5.3 of gemini://geminiprotocol.net/docs/faq-section-4.gmi for some other possible answers. It is possible that someone may be able to define a way to specify the use of some other transport mechanism by the host name part of the URL. In this case, that mechanism might have some other security mechanism built-in in which case the "scorpions:" scheme is not used with it, since in that case the encryption belongs to a different layer. Also note that the use of the "hashed:" scheme can sometimes mitigate the possibility of spies tampering with the data. *** Why allow non-TLS? An attacker can easily MITM requests to force non-TLS requests. An implementation may allow the user to configure it to not use non-TLS for some (or all) servers. (This is similar than "HTTPS-Everywhere", but it is not specific to HTTP(S).) Additionally, the client is supposed to display a warning message if a redirect from TLS to non-TLS (or vice-versa) occurs. I think non-TLS has benefits such as improved simplicity and improved energy efficiency. However, sometimes encryption is desirable, so TLS is permitted, too. *** Isn't it difficult to detect whether or not the first byte is 0x16 and have an existing TLS library take over the connection if it is? It should be possible to use recv with the MSG_PEEK flag, before passing it to the existing TLS library. (If you are using the Go programming language, see the "small net information services" program; the files named "peekedConn.go" and "maybeTLSListener.go" implement this behaviour. It is not a programming language I use, so I cannot help with questions about the program.) *** Why is it restricted to ASCII only? There are a few reasons for this: * It improves the simplicity in some cases. * The URL and some other parts are "computer code", where large character sets are inappropriate due to homoglyphs and other complexity; it isn't meant to be text in any language (although some parts of it may resemble English words or abbreviations of them). * Plain text files are ASCII at the simplest case, but you can still specify a different character encoding anyways. Note that not everything is restricted to ASCII only; specifically, the documents are not restricted to ASCII only. (This is also true of Spartan.) *** Why is it a binary file format? That would make it difficult to use a text editor, and besides, you are redefining well-defined control codes. A text-based format would be much more difficult for the client to parse, to have to handle difficult escaping and nesting and other stuff like that. A binary format will be simpler, especially a "flat" one such as this one, rather than being nested like HTML and XML. There are a few possibilities for how to write the document, such as using a specialized editor, or using a converter or a static site generator. *** Why is it big-endian? Most computers now use small-endian. The internet is supposed to big-endian, isn't it? Although I think that small-endian is better (independently of what computers use it), I think that it isn't that significant that it is worth violating the convention of internet in this way. (Also, uxn is big-endian.) *** Why use ULFI? What problems does it solve compared with MIME? The main problems with MIME are that you cannot easily specify that a file is of multiple formats (each with their own parameters). UTI solves some of the problems of MIME too but has its own problems (e.g. additional files are needed in order to determine what format another one conforms with), but I designed ULFI to avoid some of these problems. or example, UTI can specify Markdown conforms with "plain text", but that requires that you have the appropriate files to tell you that it conforms with; it can't know by itself. With ULFI you can specify something like "text:plain:markdown+commonmark" and a program that only understands "plain" will still understand it. (MIME also has a way to do something like this, but it isn't very good and is just "added on".) *** Why is there no support for styling? This is deliberate; the style should be controlled by the user, and not by the author. Documents can only specify a limited set of styles, and the user may be able to configure how exactly each one is displayed (by fonts, colours, etc). Furthermore, different clients for different displays might prefer a different style too, instead of the author having to try to make one that will work for everyone and then it doesn't work for everyone. *** What should be done if required parameters of status line are missing? This is technically an error. However, implementations that wish to recover from such errors may treat "2x" as "2x ? :" and "4x" as "4x ?", where the colon by itself is the ULFI that does not specify any file formats. For redirects (3x status codes), any default value would be inappropriate, so it cannot try to recover from such errors in this way and instead will have to display an error message. *** Is it possible to use languages other than Japanese? Yes. It is not only for Japanese. The TRON character encoding can be used with other languages as well, and the use of the word "furigana" in the specification does not imply that only Japanese is possible. *** What is the copyright of this file? This document is public domain.