(DRAFT)
SCORPION PROTOCOL/FILE-FORMAT

Table of contents:
  Protocol
  Status codes
  Detailed status codes
  Client certificates
  Receive subprotocol
  Send subprotocol
  Interactive subprotocol
  Hashed URI scheme
  Document file format
  Extended link attributes
  Metadata blocks
  Data/text sub-blocks
  Languages
  Automated crawling
  Conversion
  Unordered Labels File Identification
  X.509 extensions
  Favicon formats
  Superseding certificates
  Recommendations and other notes
  Dynamic files
  Security issues
  Missing details
  FAQ


=== Protocol ===

The protocol can be TLS or non-TLS. Implementation of TLS is optional but
is recommended; implementation of non-TLS is mandatory. The same port
number can be used for TLS and non-TLS; this is distinguished by the first
byte that the client sends (TLS if it is 0x16 or non-TLS otherwise (in
which case it is the subprotocol byte)).

The default port number (for both TLS and non-TLS) is 1517.

Unless otherwise specified, the text in the protocol is ASCII. (This only
applies to the protocol; the contents of files are not limited to ASCII.)

The first byte is the subprotocol, and then optional subprotocol parameter,
and then a space, and then the absolute URL (the scheme is mandatory), and
then a carriage return and a line feed. (See the below sections about the
definitions of the send and receive subprotocols. "Subprotocols" is similar
than the different methods of HTTP, such as GET and PUT.)

The URL is supposed to have a slash after the host name (or after the port
number if present); the client MUST NOT send a URL that follows the host
name or port number immediately by # or ? or a carriage return. If it does
anyways, the server MUST either treat it as though the slash is present or
issue a redirect to / or to the full URL with / added on the end or to a
URL that those URLs would redirect to; this SHOULD be a permanent redirect.
(The client also should add / if it is missing according to the above in
the case of a link or a redirect to such a URL, too.)

The server then sends the status line, which consists of a two-digit status
code, followed by a space and then the parameters (which may be empty), and
then another carriage return and line feed. Parameters are separated by
spaces, although the last parameter may include spaces (the client will
know how many parameters according to the major status code).

The URL may include a username/password (the usual format of username and
password in a URL is used); if so, the client should not display the
password (unless the user has enabled an option to tell it to not hide
passwords). The client software should also include a command to discard
any existing username/password, in order to log out.

However, if the URL contains # then the client should not include the #
and what comes afterward, since that part is only for the client.

If the host name of the URL does not match any host name that this server
serves, or if the scheme is neither "scorpion:" nor "scorpions:", then it
is a proxy request (which the server may refuse if it wishes, due to
whatever criteria they want). Proxies should normally avoid converting the
files into a different format. (Note that a proxy to another service on
the same server might be implemented without needing to make another
connection, although this is not mandatory and not something that the
client needs to worry about.)

If it is not a proxy request, then the TLS and non-TLS variants of the
same protocol URI scheme should be treated as equivalent by the server
(but not by the client, which MUST treat them differently). (Proxied
requests will treat the scheme in the request like the client does.)

The recommended use of SNI is:

* Clients SHOULD use SNI when connecting to the server, and MAY have an
option to disable SNI (in order to mitigate some types of spying).

* Servers SHOULD NOT require SNI, and SHOULD ignore any provided SNI and
use the host name in the request instead. (Exceptions are possible, e.g.
if it needs to use a different protocol with the same port number and IP
address for some reason, or if a proxy server is used to forward requests
to another server without needing the proxy to handle encryption, etc.)

* If possible, clients SHOULD allow to use the system's DNS services to
implement encrypted Client Hello; if implemented, there MUST be an option
to disable this feature (although, depending on the implementation, this
option might be a part of a different component of the operating system).

* A server is not required to present a valid certificate if an incorrect
SNI (or no SNI) is provided by the client. Clients that wish to verify the
server's certificate should avoid using incorrect SNI.


=== Status codes ===

The first digit is the major code, and the second digit is the minor code.
Clients may ignore the minor code.

0x = Interactive mode; used only for the "I" subprotocol. After this is
sent to the client then arbitrary two-way communication is possible. The
parameter is optional and is the capability codes if it is not blank.

1x = Requires input; the parameter is the prompt text, which it should
display to the user, and then the user enters any text, and the client
redirects to the same as the current URL but with ? added and then the text
entered by the user, which should be percent-encoded if necessary. If the
existing URL already has a query string, then the result is unspecified
(it is probably best to delete the existing query string, due to how
relative URLs are working); servers MUST NOT use this code in that case.

2x = The response is OK and the contents of the file will follow; after
this line is the data of the file. The parameters are:

* The file size in decimal notation, or ? if the size is unknown (e.g. if
it is a dynamic file).

* The file type/format. MIME has some problems and ULFI is better, but for
now, for compatibility you can use MIME; however, spaces are not allowed.
If the "charset" parameter is not included then the default according to
the file format should be used; "us-ascii" is a recommended default if it
is not otherwise known. (If there is a possibility of the character set
being unknown, servers SHOULD explicitly specify them; this is unnecessary
if the file contains only ASCII characters.)

* Optionally, the version, which if present must consist only of uppercase
and lowercase letters, digits, and forward slash and plus sign. (Clients
and servers are not required to verify that the version code uses this
restricted character set.)

3x = Redirect. The parameter is another URL which it should redirect to,
which can be temporary or permanent. If the original URL contains a
fragment part and the target URL does not, then the client SHOULD add the
fragment part of the original URL to the target URL, too (unless the client
somehow knows that the fragment part is not useful). If the number of
consecutive redirects exceed the limit (which MUST be not more than five by
default, although it may be configurable by the user), then the client MUST
NOT automatically follow any further redirects.

4x = Temporary error. Has two parameters:

* The number of seconds which is recommended to wait before trying again;
it must be a unsigned 31-bit number (encoded as decimal in ASCII, not as
binary), or it can be ? if it is unable to estimate the required time
before trying again. (TODO: Possibly remove this; maybe it is not useful.)
(An implementation may have a maximum amount of time that it is willing to
wait; e.g. one week with a random variation of up to 24 hours either way.
Furthermore, if this error is received multiple times in succession, a
client should increase the amount of waiting time.)

* The error message text (optional).

5x = Permanent error. The parameter is an error message. The request should
not be repeated by automatic tasks, unless such tasks are manually reset by
the operator of the computer that controls such tasks.

6x = Client certificate required. The parameters are listed below. The
below section about client certificates has more details.

* The first parameter specifies what URLs the certificate applies to; see
the below section about client certificates. This is only a hint and is not
a requirement; clients MAY ignore it, and MUST allow the user to override
this specification with their own.

* The second parameter is arbitrary text which should be displayed to the
user, to explain what kind of client certificate is needed and/or why it is
needed, etc.

7x = Ready to receive; used only for the "S" subprotocol. The parameter is
an arbitrary text.

8x = Received data accepted; used only for the "S" subprotocol. The
parameters are same as for 2x but the data of the file is omitted. If
the file has been deleted then the parameters are omitted.


=== Detailed status codes ===

00 = Beginning of arbitrary two-way communication. (This is only valid for
the "I" subprotocol.)

10 = Requires input. (This protocol does not have the 11 code that Gemini
has; putting the password in the query string isn't very good because then
it is a part of the URL and might not be hidden.)

20 = Response is OK.

21 = Response is OK, but it is only a part of the file and not the entire
file; this is used as the response of a range request. The file size
parameter is the entire file size, and not only the requested part.

30 = Temporary redirect.

31 = Permanent redirect. A client may automatically update bookmarks, etc
if this feature has not been disabled by the user.

40 = Temporary error with a not more specific code.

41 = Down for maintenance.

42 = A dynamic file has an unexpected temporary error such as time out.

43 = Proxy error; this request requires the server to make another
connection, but it is unable to do so, or is able to connect but cannot
receive a valid response.

44 = Slow down; the client is sending requests too fast and should wait
before trying again.

45 = Temporarily locked file; used with the "S" subprotocol.

50 = Permanent error with a not more specific code.

51 = File not found (maybe it is at Area 51).

52 = The file does not exist (probabily because it has been deliberately
removed) and is not expected to exist again; clients should remember this
and not request it again automatically. (This code also means a permanently
locked file, if used with the "S" subprotocol.)

53 = A proxied request was refused by the server. A server MAY also use
this status code if a username/password have been provided in the URL but
are not required to access any files on this server (this is to simplify
the server implementation, so that it can check that it is not its own name
and therefore refuse the request). A server MAY also use this status code
if a URL with no scheme has been provided, for the same reason.

54 = Forbidden request. A username and/or password probably won't help.
Other conditions might or might not help, depending on the implementation
(e.g. it might only permit access to LAN addresses or only to 127.0.0.1).

55 = Edit conflict; used with the "S" subprotocol.

56 = A username and/or password are required. Either none have been
provided, or the username and/or password that have been provided are
incorrect. The error message SHOULD NOT distinguish between an unknown
username and an incorrect password for a known username.

59 = Bad request.

60 = A client certificate is required to access this file, but none has
been provided.

61 = The supplied client certificate is not authorized to access this file.
The certificate may be valid to access a different file, though (possibly
but not necessarily on the same server).

62 = The supplied client certificate is not valid (e.g. because it has
expired or because the signature is not valid).

70 = Ready to receive new file. (If a server sends this response but the
file is created before it has been fully received from the client for any
reason, then the server SHOULD NOT overwrite the existing file and SHOULD
instead send a response with an appropriate 4x or 5x status code.)

71 = Ready to receive to replace an existing file.

72 = Ready to receive data for a use other than a new or modified file.

80 = Accepted received data and created a new file.

81 = Accepted received data to modify an existing file.

82 = Accepted received data for a use other than a new or modified file.


=== Client certificates ===

Implementation of client certificates is optional, but it is recommended
to be implemented if TLS is implemented.

If a 6x response is received on a non-TLS connection, then it should change
the scheme from "scorpion:" to "scorpions:" when a client certificate is
available, before retrying the request (it MUST NOT do this automatically
if no client certificate is available).

The first parameter of a 6x response specifies the suggested set of URLs
that the client certificate is applicable to. It has one character followed
by a URL. The URL may be empty to mean the current URL, or can be any URL
that lacks a scheme and authority. The first character can be:

* "=" = Only that exact URL (as well as any that differ only by the
fragment part; the fragment part is always ignored for all of these modes).

* "+" = The specified URL as well as any that have a different query string
or no query string.

* "*" = Discard the query string of the specified URL, and then it means
the current URL as well as any one consisting of the current URL followed
by / or ? and then anything. If the URL already ends with / then it means
that URL followed by anything.

* "-" = An unspecified set of URLs which includes the specified URL.

(If Gemini is implemented as well as Scorpion, then a 6x response from a
Gemini server SHOULD use the "*" URL set hint (with the current URL) by
default, since that is what the Gemini specification says.)

The set of URLs MUST include the URL that has been requested, and MUST NOT
include any URL that differs by scheme, host, and/or port, than the URL
that has been requested. If this is not the case, clients SHOULD treat it
as an unrecognized hint.

Clients MUST allow the user to override the specifications above; those
specifications are merely a hint, and are not mandatory to be implemented.
Unrecognized hints shouldn't prevent the user from specifying a client
certificate anyways, but in that case the client MAY require that the user
explicitly specify which set of URLs it applies to; if it does not then
there will be a implementation-dependent default setting.

Client certificates normally apply to all subprotocols used with that URL,
although the user may be allowed to override this. Alternatively, a client
might apply it to all subprotocols if the subprotocol in the request is R,
but limit it to the requested subprotocol if it is S or I.

If the server requires a client certificate and a user name, it should use
a 54 response first, and then 60 once a valid username has been provided.
If the server requires either a client certificate or a user name (of the
client's choice), but does not need both, then it should provide a 54
response if the connection is not using TLS, or 60 if it is using TLS. A
server SHOULD NOT require both a password and a client certificate at the
same time. (TODO: Possibly change this paragraph.)

A client may have any UI its author wishes, but an example of a GUI which
might be used to manage client certificates would be:

  URL: ________________________________
  Match:  (*) Exact  (*) File  (*) Path
  [X] Restrict to current username
  Subprotocols:  [X] Receive  [X] Send  [X] Interactive
  Duration:  (*) Session  (*) Permanent  (*) ___ hours  (*) ___ days
  [X] Remember TLS options
  <Select...>  <Import...>  <Create...>  <Manage...>
  <Set>  <Set and go>  <Cancel>

(A client can be designed with whatever differences they want than the
above. A similar UI may be used for Gemini, except that "Interactive" and
"Restrict to current username" are not applicable for Gemini.)

In the menu for selecting existing certificates, any existing certificate
for the same domain should be easy to find, and should allow changing the
scope of existing certificates. (This is more important for Gemini than it
is for Scorpion, but it would work with Scorpion too.)

A client MUST NOT automatically generate and send a client certificate
without first asking the user.

When creating a new certificate, the default settings for the new
certificate should be made up as follows (although the user might be
able to override them; the ability to override is not mandatory, since
an external program can be used to make up your own certificates if you
do not want to use the ones included in the browser):

* The key type, signature type, number of key bits, etc should match
that of the server's certificate. This is the most likely to be compatible.

* The "extended key usage" (2.5.29.37) extension should be included, and it
should specify "TLS client auth" (1.3.6.1.5.5.7.3.2) as the key usage.


=== Receive subprotocol ===

This is the usual subprotocol, coded as "R". (It is the only subprotocol
which is mandatory to be implemented.)

The subprotocol parameter can be blank or can be a range request. Servers
are not required to support range requests, and can respond with a 59 code
if it is not implemented. (It is also possible that range requests will be
possible only with some files and not with other files.)

A range request consists of two nonnegative integers in decimal notation
with - in between; these are zero-based file offsets, of the first byte to
receive, and of the first byte to not receive (e.g. "3-9" means the six
bytes, being the fourth, fifth, ..., ninth bytes of the file). The end
address can be omitted in which case it means up to the end of the file.


=== Send subprotocol ===

This subprotocol is coded as "S".

The subprotocol parameter can be omitted, but if not omitted then it is the
version of the file being replaced; this can check for edit conflicts. (The
server MAY require the version to be specified.)

Optionally, the subprotocol parameter can be a HMAC followed by @ and then
the version (which may be empty). The HMAC is of everything that follows
the at sign, including the entire request and everything the client will
send after the server responds the first time.

The response code should not be 2x nor 8x, but can be any other code. If it
is 7x then this means it is ready for the client to upload the file. If it
is 1x or 3x then they modify the URL and mean that the upload is required to
be made to a different URL instead of the one that was initially requested.

For the client to upload the file, it sends a status lines, and if 2x then
it is followed by the data. The possible status codes are:

* 20 = Upload a file. The size is mandatory and you cannot substitute a
question mark instead. The version is optional, and the server may override
it with its own version specification.

* 30 or 31 = Make the file into a redirect to a different file.

* 51 or 52 = Delete a file.

A server does not need to implement all of the above possibilities.

After the client finishes sending, the server sends the status line, which
will be a 8x code if successful or a different code if it is an error.

Note that if the client disconnects before sending its own status line or
before sending the amount of data of the file that it said it was going to
send to the server, then the upload is aborted; if the file is locked then
the server can unlock it, etc. (For example, a client might not want to
overwrite an existing file, but the server says 71 and the client wants to
add a new file and not overwrite an existing file, then the client can
disconnect, in which case the files on the server will be unchanged.)


=== Interactive subprotocol ===

This subprotocol is coded as "I", and is used for two-way communication
(usually terminal emulation, although it can be used with any kind of
two-way communication). (The main reason for this is if you have
multiple programs and you want to be able to specify the URL of each one.)

The subprotocol parameter is optional but if present it is the requested
capability codes; see below.

If it is acceptable, then the server sends a 00 response; its parameter is
the actual capability codes or can be blank. After that, it will continue
with ordinary two-way TCP communication. (Other valid status codes are 4x,
5x, and 6x, in which case the server closes the connection after that, like
it does with the other subprotocols. The 2x codes MUST NOT be used.)

If the client disconnects after receiving the 00 response (and possibly any
further data) but without sending any further data to the server, then the
server should not change the state; it should assume that the client had
connected by mistake and did not intend to do so.

Capability codes have no delimiters between them, and each one has the
following format (similar than CSI codes, in order that implementations
can use the same subroutines to parse them if they do terminal emulation):

* Zero or one byte in range 0x3C to 0x3F.

* Zero or more bytes in range 0x30 to 0x39 or 0x3B.

* Zero or one byte in range 0x21 to 0x2F.

* One byte in range 0x40 to 0x7E.

(Note that the protocol before the actual data will be ASCII only, so that
ait is possible to use even with terminal emulators that do not implement
this subprotocol.)

The ABNF of the capability codes is:

  = *capability
  capability = prefix parameters middle suffix
  prefix = %x3C-3F / ""
  parameters = *(DIGIT / ";")
  middle = %x21-2F / ""
  suffix = %x40-7E / ""

Possible capability codes:

* a = The client uses this if it wishes to send command-line arguments
and/or environment variables to the server. If so, the server will send
back the same capability code, possibly with different numbers, and then
the client sends the command-line arguments and environment variables,
each of which is null-terminated. The numbers are first the number of
command-line arguments, and then optionally a semicolon and then the
number of environment variables. Each environment variable name is not
allowed to contain a equal sign, since the equal sign is used to sepaate
the name from the value. The server will not send back this capability
code if the requested file cannot use this capability; if it does send
it back, then the numbers might be different than those that the client
requested; the number of arguments/variables that the client sends must
match those specified by the server. The server MUST NOT send this
capability code if the client did not request it.

* L = Line mode. The value is 0 for not local line editing (and not local
echo), or 1 for local line editing so that the text is not sent until an
entire line is entered and the user pushes send. Even in line mode it is
still possible for the server to send data to the client while the user
has partially entered a line of text; the client should handle that by
displaying the user's text entry separately.

* T = Terminal type (not currently defined).

* x = Screen size. The value is two numbers being the number of columns
and the number of rows; in both cases zero can mean unlimited. Optionally
can have a third number; 1 means the client handles pagination.

(Note that it is not required to implement the above capability codes. In
some cases, some or all of them might be not applicable, e.g. if it is not
a terminal emulation.)


=== Hashed URI scheme ===

The "hashed:" URI scheme has the format "hashed:X/Y,Z" where X is the
hash algorithm, Y is the hash (in hexadecimal format), and Z is another
URL (which can be absolute or relative, and can be of any scheme, including
another "hashed:" URL, in case you want to specify multiple hashes which
are using different hashing algorithms).

It refers to the same file as Z if X/Y is that file's hash, or is an error
if that file's hash is not X/Y.

If "[X|Y]" means the absolute URL corresponding to the URL Y treated as
relative to the absolute URL X, then the rules for resolving relative URLs
of this scheme are as follows:

"[hashed:A/B,C|D]" = "[C|D]" if "D" does not start with "#"

"[hashed:A/B,C|#D]" = "hashed:A/B,[C|#D]"

"[A|hashed:B/C,D]" = "hashed:B/C,[A|D]"

In case the notation is confusing: "[http://example.org/files/1.txt|2.txt]"
= "http://example.org/files/2.txt" is a valid equation, and the uppercase
letters in the above are placeholders.

For example, if the current URL is "hashed:0/ab8974,file:///tmp/help.txt"
and you want to access the relative URL "help2.txt", then the new absolute
URL is "file:///tmp/help2.txt" and not
"hashed:0/ab8974,file:///tmp/help2.txt", because "0/ab8964" is the hash of
"help.txt" and not of "help2.txt". However, if the relative URL starts with
# then it is a link to another part of the same file, so it is the same
file and therefore has the same hash and therefore it should not strip out
the hash in this case.

The hash algorithms are specified as hexadecimal numbers without a leading
zero, which are multicodec numbers (but not encoded as varint). The hash
values are specified as an even number of hexadecimal digits, which will
include leading zeros if any. List of hash algorithms:
  11    SHA-1
  12    SHA2-256
  13    SHA2-512
  14    SHA3-512
  15    SHA3-384
  16    SHA3-256
  17    SHA3-224
  d5    MD5
  b250  BLAKE2s (128-bits)
  b260  BLAKE2s (256-bits)

(Note that some hash algorithms are deprecated because they are insecure.)

(A request of the Scorpion protocol that sends a URL using the hashed:
scheme is considered to be a proxied request (even if it is a URL of a file
on that server), and may be refused. Clients MUST NOT send such requests,
unless the user configured it to be a proxy for the hashed: scheme (the
ability to do so is not required to be implemented, though).)


=== Document file format ===

The file format consists of a sequence of blocks, each of which has the
format (there is no global header, delimiters, etc):

* One byte being the block type and character encoding.

* Big-endian 16-bit attribute length.

* Attribute data.

* Big-endian 24-bit body length.

* Body data.

The block types are:

* 0x00 = Normal paragraph. The attribute is unused and MUST be empty. The
body is the text of the paragraph.

* 0x01 to 0x06 = Heading levels 1 (outermost) to 6 (innermost). The
attribute is the part after # in the URL to refer to this section (empty
if it cannot be referred to by the URL), and the body is the heading text.

* 0x08 = Normal hyperlink. The attribute is the URL (in ASCII encoding)
and the body is the link text. The URL can be relative or absolute. If the
attribute is empty then it means the same as the current URL (which isn't
very useful for type 0x08, but may be useful with types 0x09 and 0x0A). If
the attribute contains a null character, then only the part before the null
character is the URL, and the null character itself and anything afterward
will be ignored. Other control characters are not allowed in the URL.

* 0x09 = Hyperlink requesting input. Like 0x08 but it is treated like a
10 status code (with an implementation-defined prompt; it may be the same
as the text of the link) without making the request. This link type is not
to be used for gopher links (if it occurs anyways, a client SHOULD treat
it as a normal hyperlink but with type 7 instead of 1; however, authors
should be aware that a client might incorrectly use a question mark instead
of a tab if this block is used for gopher links).

* 0x0A = Interactive hyperlink. Like 0x08 but with the "I" subprotocol.
Implementation is optional. (Some implementations may wish to use an
external program which is an existing terminal emulator, if they can
add initial input.)

* 0x0B = Alternate service (e.g. mirrors, etc) than the previous block
(which MUST be a link block; if it is also 0x0B then it is an additional
alternate service), or, if there is no previous block, the current file.
The attribute is the URL of the alternate service. The body is not normally
used, but may contain text explaining the alternate service. Clients SHOULD
normally hide this block, although it might have a way to display some kind
of "alternate service" menu, to have an option to display them, to have an
option to automatically select for load balancing, etc. (This is similar
than the "+" type in Gopher menus.)

* 0x0C = Blockquote. The attribute is unused and MUST be empty. The body
is the text of the paragraph.

* 0x0D = Preformatted text. Valid control codes are tab and line feed. The
attribute SHOULD be blank; see below for its meaning (although a client is
allowed to ignore the attribute). The client MUST display this text with a
fixpitch font.

* 0x0F = This block is used for optional metadata such as a digital
signature of the rest of the document. Clients that do not understand it
will ignore this block.

The possible character encodings are:

* 0x00 = TRON-8 (left to right)

* 0x10 = PC (left to right)

* 0x80 = TRON-8 (right to left)

The control codes are:

* 0x02 = Whatever comes before it is some kind of section number or item
number or a bullet indicating a list item. (This may also be used to
separate the word from the definition in a definition list.)

* 0x05 = Follow by one or more bytes indicating a type of contents (e.g.
a word or phrase in a foreign language compared with the surrounding text,
or a measurement of a specific type (length, mass, etc)), and then 0x06
and then the text and then 0x07. If not implemented (or if this feature is
disabled by the user; and it SHOULD be disabled by default), then it MUST
skip up to the next 0x06 byte. You cannot include any other control
characters in the data part. A data/text sub-block cannot be inside of
either part of a furigana sub-block.

* 0x06 = Separates the data part (before) and the text part (after) of a
data/text sub-block. You are not allowed to nest data/text sub-blocks.

* 0x07 = Ends a data/text sub-block. The text part can contain other
control characters including furigana, but any changes to the formatting
are required to be reset before this 0x07 code.

* 0x09 = Tab; only in a preformatted block. This should not be used if
exact spacing is requred, since the way that it is displayed is
implementation-dependent (and possibly configurable by the user).

* 0x0A = Line break; only in a preformatted block.

* 0x10 = Only with PC character code; follow by one byte in range 0x41
to 0x5F, and you must subtract 0x40 to make the code of the graphic
character to display. This is allowed in preformatted blocks as well
as in other blocks.

* 0x11 = Normal style.

* 0x12 = Strong style.

* 0x13 = Emphasis style.

* 0x14 = Fixpitch style. This style MUST be displayed by fixpitch fonts
(but it is acceptable to display everything by fixpitch fonts, which
would mean that a special handling is not required).

* 0x15 = Forward text direction.

* 0x16 = Reverse text direction.

* 0x17 = Begin the main text of furigana. This should be followed by
the text and then 0x18 and then the other text and then the 0x19.

* 0x18 = Begin the furigana text of the furigana. If furigana is not
implemented (or if the user disabled it), then it should display the
main text of a furigana block but should not display the furigana text.
(Alternatively, it might have an option which causes the furigana text
to be displayed in parentheses or some other kind of delimiters, which
would effectively make 0x18 and 0x19 aliases for graphic characters.)

* 0x19 = End of the furigana. You are not supposed to nest any other
control codes inside of the furigana blocks (although 0x10 is allowed,
if this block uses the PC character code; however, furigana probably
would not be common when using PC character codes).

* 0x1B = Used for SGR codes. The next byte MUST be 0x5B, and then zero
or more bytes in range 0x30 to 0x3B except 0x3A, and then one byte
which is 0x6D. This is allowed only in preformatted blocks, although its
use is discouraged. Clients should skip over the SGR code entirely, but
MAY have an option to interpret them.

It is not required to implement most of the control codes, except as
specified above.

Stateful encodings MUST shift the state at the beginning of each block.
It is also required after 0x18 or 0x19 if the state before such a code
does not match the state at before the furigana block, and similarly also
for 0x06 and 0x07. Any document which does not satisfy this criteria may
result in an unreliable display on some clients.

If the attribute of a preformatted block is not empty, then a client
program MAY be able to use it to implement syntax highlighting, equations,
simple diagrams, etc. It MUST have an option to ignore the attribute if the
user wants to display all preformatted block as plain text, and MUST treat
unrecognized attributes the same as a blank attribute. Authors should write
the document with the expectation of the client not recognizing it. Clients
MAY display the attribute text of preformatted blocks.

(If data tables are required, you can link to a separate file that
contains the data; you cannot have inline data tables.)


=== Extended link attributes ===

Extended link attributes are optional, both to specify by the author
and to implement by the client.

* 0x20 = The file size (if known; it can be ? if not known) and file format
of the file that the link refers to, in the same format as the 2x response
code. This is not valid for block type 0x0A.

* 0x49 = A hint for the capability string to use with interactive mode.
This is only valid for block type 0x0A.

* 0x72 = Relation types. The value is any number of bytes; if the bytes are
0x01 to 0x7F then they are the relation type from this file to the target
file, and 0x81 to 0xFF are the same relation types but in reverse (from the
target file to this file). The browser should not prevent the display of
links or automatically handle links due to the relation types; however, it
may be e.g. shortcut keys, queries, user-defined styles, etc that might be
affected by these relation types.

Valid relation types are:
  0x01 = Next page
  0x02 = Citation
  0x03 = Cross-reference


=== Metadata blocks ===

(These might be changed in future)

The first bytes of the attribute of a type 0x0F block (a metadata block)
identifies the type of metadata in this block.

* 0x00 to 0x3F = (Reserved)

* 0x40 to 0x7F = (Reserved)

* 0x80 to 0xBF = Applies to everything up to the next metadata block of
the same type.

* 0xC0 to 0xFF = Applies to the entire document.

The meaning of the rest of the attribute, and of the body, depends on the
type; their meaning is described below. The character set is used only if
the body has a meaning and is otherwise usually not meaningful, but
exceptions will be noted in the below specifications.

The possible metadata types are:

* 0x80 = The language of the text of further blocks. The attribute
specifies the language. The body may be empty; if it is not empty then
it contains the name of the language (and should not contain any control
characters).

* 0x81 = Type of article of further blocks. The rest of the attribute is
one byte as follows: 0x00=undefined, 0x41=article, 0x4E=navigation.

* 0xC0 = Modification date/time. The rest of the attribute will be the
big-endian signed 64-bit number of seconds past January 1, 1985, 00:00:00,
UTC, excluding leap seconds. The body is not used.

Metadata blocks are not supposed to affect the display of the document
except for the display of the metadata itself (if metadata display is
enabled). However, there are cases where it may affect features such as
multilingual search (if the user has specified that an inexact match is
satisfactory), speech synthesis, etc.


=== Data/text sub-blocks ===

The first byte of the data part of a data/text sub-block indicates the
type; the rest of the bytes are the parameter (which may be empty for
some types). The possible types are:

* 0x30 to 0x37 = Languages and/or pronouncing. Bit2 means that the language
is specified. The low 2-bits meaning: 0=none, 1=phonetic, 2=phonemic. If
the language is present, 0x20 separates the language code from the
pronounce code. (The codes 0x33 and 0x37 are not currently meaningful.)
(TODO: Specify how the languages and pronouncing are encoded.)

(TODO: this should include: languages, pronouncing (which may be combined
with languages or used independently), date/time, and SI units. Possibly
also other things, but also possibly not.)


=== Languages ===

The language specifications are made up according to the criteria:

* If one code is a prefix of another then the shorter one is considered
to be a more general specification that includes the longer one (e.g.
English vs Canadian English). (This improves simplicity of implementation,
since an implementation will not need to have complicated rules to figure
out which language is meant.)

* Both written languages (for e.g. text documents) and spoken languages
(for e.g. audio files) should be considered. Although in many cases the
same codes can be used, they should still be considered separately since
the requirements may be different in each case. Note that some languages
may be purely written languages (e.g. Blissymbols).

(TODO: Possibly use a variant of ISO 639 or Glottolog or something else.)


=== Automated crawling ===

(This section is currently a draft and may be changed in future. There is
some disputes about some of the below, so it is likely to be changed.)

Please note that all of this section is not actually enforceable and is
not intended to be. It is intended to be guidelines for bots that is
likely to be implemented correctly when it is implemented. It is a similar
idea than the "robots.txt", but is meant to be less ambiguous.

The recommendation for automated crawling/indexing is described here. This
specification applies to recursive crawlers that automatically download
files, especially if they use recurring intervals, and to public search
engines and mirrors that work automatically. It does not apply to users
that manually download files or that only download a single list of files
once, nor does it apply to proxies, gateway services, etc (but see below
about proxies that are themself available to be crawled).

Note that this does not prohibit anyone from making links to any files
regardless of whether or not crawling is allowed.

It should first try to download the file named "/.special/crawl" to find
the policy set up by the server administrator (it should not do this more
than once per crawling interval). Depending on the status code returned by
the server:

* 2x = Read and parse the file according to the below specifications.

* 4x = Do not access the server for at least the specified amount of time,
possibly plus some random number. After that, the crawler MAY try again,
and will again try to download the /.special/crawl file.

* 5x = No crawling policy is available. (The behaviour of a crawler in such
a case is not specified by this document.)

Note that you cannot assume that the crawling policy file will not be
changed in future. If you start over the crawling then you should try to
download the crawling policy file again.

The format of the file is lines ending with line feeds; each line starts
with one byte command code, and then the parameter. If the command code
has bit5 set then it SHOULD skip that line if it is not understood. If it
has bit5 clear then the client MUST treat the entire file as not understood
and should not proceed with crawling (and it might abort with an error
message in this case, explaining what the problem is).

Each crawler also has zero or more names, which are sequences of printable
ASCII characters.

The commands are:

* "`" = A comment that has no meaning. It may contain information which is
useful for users, search engine operators, mirror operators, etc, such as
downloading an archive file that contains all of the data.

* "@" = The parameter is the name of the crawler. Any lines preceding the
first line with @ are effective, and anything from the first @ line
matching the crawler's name up to the next line with @ are effective; all
other lines are ineffective. Ineffective lines MUST be ignored even if the
bit5 of the command code is clear.

* "i" = Suggests indexing the specified prefix.

* "v" = Suggests not indexing the specified prefix.

* "d" = Means that the files with the specified prefix are probably dynamic
so it might not be useful to mirror them.

* "c" = Allows crawling the specified prefix.

* "C" = Disallows crawling the specified prefix.

* "n" = Estimated number of files to download.

* "t" = Estimated total size of files to download.

* "N" = Maximum number of files to download.

* "P" = Maximum number of simultaneous downloads.

* "D" = Minimum delay (in seconds) after downloading one file before
proceeding with the next one.

* "R" = Minimum delay (in seconds) after starting to download the first
file before starting over from the beginning.

* "w" = Suggested time (in seconds) to wait before downloading the crawling
policy file again. Note that the crawling policy file still counts as a
file for the purpose of the D and R commands, too.

* "a" = (This command is intended to be used for archiving, but the
specification of the archiving has not been written yet.)

Numbers are given in decimal notation, and are always nonnegative integers,
using only digits 0 to 9. URL prefixes start with / and are relative to the
root directory of the server; it matches all URLs that it is a prefix of
(including that URL itself).

Once a command is found that matches the URL being accessed, then it should
ignore all further commands for the purpose of accessing that URL. However,
it still must keep those commands in memory or on disk so that it can refer
to them again later for another access. (It is also possible to work in an
alternative way, by somehow converting the data into an internal format on
the client that can more efficiently determine the access policy, as long
as the behaviour matches that described here.) If no command is found that
matches the URL that it wants to access, then assume that an implicit "C/"
follows (meaning it is disallowed).

A crawler may have a name with <> around it (in addition to its other
names) if it has the purposes described below:

* "<MIRROR>" = Mirrors and backups.

* "<SEARCH>" = Public search engines and indexing. This also applies to
proxies which do not themself have a policy to prohibit indexing.

* "<STUDY>" = Programs that are intended to study statistical properties
such as number of files, average file sizes, broken links, etc.

All crawlers MUST have an empty string as one of their names.

If a proxy service is available to be crawled/indexed, then the proxy
service should also check the above policies, and either refuse the proxy,
or to set up its own policy which prohibits access to the proxied files
to automated crawlers (either conditionally (according to which Scorpion
server is being accessed) or unconditionally (for all proxied files,
regardless of which server it is accessing through the proxy)).

Clients MAY add a query string when requesting the crawling policy file
which identifies the crawler. (This is only for the crawling policy file;
it is not supposed to do that for any other file. Also, the identification
does not necessarily match the crawler's name as described above.)

Crawlers that receive a 41 response when downloading any file other than
the crawling policy file SHOULD try to download the crawling policy file
again after waiting for the minimum time specified in the response
(whether or not it has previously downloaded the crawling policy file
successfully in the past). If it receives such a response too many times,
then it SHOULD stop for a longer amount of time than specified.

The crawling policy file is not allowed to be retroactive.


=== Conversion ===

This part of the specification is optional.

Conversion between file formats is possible, and can be specified by the
file called "/.special/conversion". Clients MUST NOT try to download this
file unless the user explicitly commands the computer to do so; it is not
supposed to do so merely by finding a file that is not known how to handle.
(Client software also MUST allow the user to override them, and to remove
any such files that have already been downloaded.)

Furthermore, any client that is able to understand it MUST NOT require that
it comes from the same server as the file; it can also be a local file
which has been written by the end user, a file from another server (or the
same server but a different path) etc, and once a conversion has been
enabled by the user then the same conversion might be used with other
servers too if appropriate (not all kinds of conversions are appropriate
for using with other servers (e.g. if file name rewriting is used then it
is not appropriate for arbitrary servers), but some can). If it does use
it with other servers, the user MUST be able to configure this feature,
e.g. so that it does not work with other servers.

Clients might also have their own mechanisms for conversion beyond what are
written here, e.g. allowing to use local programs with pipes (which is a
recommended way of doing so); such ways can only be used if set up by the
end user or the system administrator and cannot be specified by servers.
(This is possible even if the rest of the specification in this section of
this document is not implemented.)

Each record consists of the 8-bit record type, and then four fields which
are each the big-endian 16-bit length and then the data.

The high nybble of the record type (which is still a part of the record
type; it is not considered to be a separate field) defines how the first
and second field specifications are doing (although some record types
allow one or both fields to be empty):

* 0x00 = The first field specifies the input format and the second field
specifies the output format.

* 0x80 = The first field specifies the original URI scheme and the second
field specifies the target URI scheme.

The record types are:

* 0x01 = File name rewriting. The third field is a file name suffix
(excluding any ? or # and anything that comes afterward) of the original
file (if the file name does not match, then this record does not match
and cannot be used with this file), and the fourth field is the suffix to
replace the original suffix with to find the alternative file.

* 0x02 = Use a program to convert the file. The third field is the URL of
the program. The fourth field is described below.

* 0x03 = Use interactive mode. The second field should be blank. The third
field is the recommended capability.

* 0x04 = Use a program to display the file. The third field is the URL of
the program. The fourth field is described below. The second field should
be blank.

* 0x05 = The third field is the URL of a document which explains the file
format. The file format of the document is either Scorpion format or plain
ASCII text format. The second and fourth fields are not used.

For record type 0x02, the fourth field has the first byte specifying the
program format and the rest as the parameters:

* 0x01 = It is a binary uxn / varvara program. The file name and file stat
ports should not be used, and neither should the date/time ports be used,
nor any device ports that are not valid for non-GUI mode. If there are
errors during the conversion, they should be written to stderr and end
with a nonzero exit code.

For record type 0x04, the fourth field has the first byte specifying the
program format and the rest as the parameters:

* 0x01 = It is a binary uxn / varvara program. It will run in GUI mode but
is likely to be sandboxed (which clients SHOULD do if possible).

For record type 0x02 and 0x04 and program format 0x01, the parameter is one
byte, and is a bit field as follows:

* bit0 = Clear for stdin/stdout, or set if the first file device is the
input file (which is read only) and the second file device is the output
file (which is normally write only, but other flags may affect this).

* bit1 = Set if the input file is seekable and can be closed and reopened
(which requires some of the nonstandard features of uxn38).

* bit2 = Set if the output file is seekable and can be closed and reopened,
and might also be read by the program as well as written. (These also
require some of the nonstandard features of uxn38.) (This bit should not
be set if the record type is 0x04, since there is no output file.)

* bit3 = Set if the first command-line argument is the picture size, in the
decimal ASCII format (only digits 0 to 9) with "x" in between (where the
horizontal size is first, and then the vertical size). (This is meant for
scalable vector diagrams.) (This bit should not be set for record type
0x04; in that case, the preset screen size that can be read back from the
screen device is the suggested picture size (which can be changed).)

Note that it is OK if there are multiple records which match the same file.

Recommended output formats for pictures are farbfeld (true colours) and
XPM2 (indexed colours, which also allows specifying symbolic colours which
can be specified by user preferences). In the XPM2 format, you should not
use X11 colour names; only hex, "black", "white", "None", and symbolic
names (with the "s" colour type) should be used (although monochrome and
grey scales are usable too, in addition to full colours and symbolic).


=== Unordered Labels File Identification ===

ULFI works according to the rules:

* The valid characters are printable ASCII characters other than spaces,
slashes, backslashes, quotation marks, and apostrophes.

* A name consists of letters, digits, hyphens, dots, and underscores,
and must start with a letter or underscore, and cannot end with a dot
nor have two consecutive dots.

* The ULFI string consists of a set of parts with colons in between. The
order of the parts doesn't matter; it is equivalent even if the parts are
in a different order.

* Duplicate parts are improper and redundant.

* A part consists of a name, followed by an optional parameter block or
inner block. The name may also contain plus signs, as explained below.

* A part name with a plus sign is a shortcut for the full part name before
the plus sign, as well as the part name that substitutes a dot in place of
the plus sign. There may be multiple plus signs, in which case e.g. "a+b+c"
is the same as "a:a.b:a.b.c".

* A parameter block consists of [ and ] with zero or more characters in
between other than "[", "]", "<", ">", "{", "}", "(", and ")".

* A inner block consists of < and > with another ULFI in between; this
other ULFI cannot itself have a inner block. (This can be used for
indicating such things as a compressed file, e.g. "gzip<text>".)

Example 1: The ULFI "a.b:c" and "c:a.b" are equivalent to each other.

Example 2: The ULFI "a:b+c:d" and "a:b:b.c:d" are equivalent to each other.

(Note: It is rarely necessary to compare ULFIs for equality. Usually, a
program would be looking for one or more labels that it recognizes, and
only interpret those ones. One way to handle this is to use a bit field
to remember what has been found so far during parsing.)


=== X.509 extensions ===

The use of these extensions is optional. Clients and servers are not
required to read them, and whoever sets up the certificates are not
required to set these extensions. Most of these extensions are independent
of the protocol; they are not only for use with Scorpion protocol, and
they may be used with any protocol.

* 2.25.327847519394146920218914781995694576662.1.6 = The data type is a
set of integers. The values are the port numbers for use with TCP. This
extension is for use with server certificates.

* 2.25.327847519394146920218914781995694576662.2 = The data type is a
octet string, and is a comment encoded as TRON-8 encoding.

* 2.25.327847519394146920218914781995694576662.3 = Favicons (clients MUST
allow favicons to be disabled, and MAY be disabled by default). The data
type is a sequence; each element of the sequence represents one icon. An
implementation might support none, some, or all of the formats. See the
section below about the formats.

* 2.25.327847519394146920218914781995694576662.4 = A boolean which
indicates if the client wants the server to publish the certificate or
not. If true then it means you want the server to publish the certificate,
and if false then not.

* 2.25.327847519394146920218914781995694576662.5.1 = The time zone in the
format expected for the TZ environment variable in UNIX systems, e.g.
"America/Vancouver". The data type is a IA5 string.

* 2.25.327847519394146920218914781995694576662.6 = This extension is meant
for securely specifying that one certificate supersedes another, even if
it is self-signed. See the below section about "Superseding certificates"
with the format of this extension.


=== Favicon formats ===

If the data type is boolean, then true means that it may be added to the
preload list (in an implementation-dependent way), and false means that it
is not supposed to be added to the preload list.

If the data type is a UTF-8 string, then it is interpreted as specified by:
  gemini://mozz.us/files/rfc_gemini_favicon.gmi

If the data type is a implicit type 1 octet string, then it is a Haiku
vector icon format, except that the initial <6E 63 69 66> is omitted.


=== Superseding certificates ===

(The below seems like it wold work, but there is some difficulty to get
it to work, since it requires computing signatures of the certificate
with some fields cleared. I had considered an alternative that uses the
issuer certificate, as well as the subject certificate; this means that
the public key field and list of excluded extensions field are both
unnecessary, since the key is specified in the issuer certificate and
excluding extensions is unnecessary. However, it then requires that
both the issuer certificate and subject certificate must be changed.)

The format of this extension is a sequence consisting of the fields:

* The public key for other certificates to supersede this one, or null.
If it is null, or if this extension is not present, then the certificate's
own key will be used for this purpose.

* The Algorithm Identifier for the hash algorithm which is used for
identifying older certificates which are being superseded. (TODO: Make
the recommendation of which hash algorithm should be used.)

* A bit string which specifies which extensions (where the first bit
corresponds to the first extension in this certificate, etc) should have
their values replaced with all bits clear for the purpose of calculating
the signature of this certificate with the old certificate's keys. If any
bits are present, then the last bit that is present must be set; all
further bits are implied to be clear. This bit string should never specify
that the superseding certificates extension is itself excluded, and any
implementation that understands an extension should not allow excluding it
if such an exclusion is unnecessary. (This is present in order to avoid a
mess with potential future defined extensions which would conflict.)

* A sequence of the Superseding Item structures, explained below. They
should be listed in order from oldest to newest. It is not necessary to
list all of the older certificates, but you should list the few most
recent, especially if they have been changed recently.

The Superseding Item structure is a sequence of the fields:

* The hash of the older certificate.

* The signature (a bit string) of this certificate with the older
certificate's key. For the purpose of computing the signature, use
only the tbsCertificate (not the stuff that comes afterward), but with
all bits cleared in the value (not the type/length header) of the
signatures within all Superseding Items, and all bits cleared within
the value (the octet string data, which wraps ASN.1 DER data) of all
extensions that are excluded.


=== Recommendations and other notes ===

There are only two legitimate uses for the version field of the 2x
response; a client MUST NOT use it for any other purpose. They are:

* To use as the subprotocol parameter of a upload, to avoid edit conflicts.

* When making a range request, to compare the entire status text to check
if the file might have changed. This is only to be used for resuming a
download and is not to be used for caching.

Clients MUST implement the non-TLS protocol and SHOULD implement the TLS
protocol. Servers SHOULD implement both protocols, especially the non-TLS
protocol. Servers SHOULD serve the same files regardless of whether it is
TLS or non-TLS (except files which require a client certificate to access).

Unicode is no good. Clients SHOULD NOT automatically convert TRON code
into Unicode, except as a fallback in case suitable fonts are not available
or a similar problem, and such fallbacks should be avoided if possible.

Redirects should not be overused.

A client SHOULD have a redirect limit; the default redirect limit MUST NOT
exceed five. If a redirect would occur beyond the limit (which may be
configurable by the user), or if a TLS connection redirects to a non-TLS,
or to specific other protocols (e.g. HTTPS), or a connection to the
internet tries to redirect or link to a LAN or localhost address (it
should resolve the DNS first if necessary) or to a local file (the "file:"
scheme), then it should warn the user first (with a non-modal message if
possible), and not redirect unless the user manually allows it.

Clients made for users to view (i.e. not something like curl which is only
for downloading files and not for display) should implement at least the
above file format specification (or a subset of it which is at least the
minimal subset) and the "text/plain" format. It is also recommended
(although a lesser recommendation) to implement the "text/gemini" format,
especially if the client software also implements the Gemini protocol.

Clients should allow URLs entered by the user to be treated as relative
if no scheme has been explicitly specified by the user.

It is recommended to have an option to display a table of contents window
for any file formats that it is suitable.

IDN is not required to be implemented (and is recommended to not do so
unless it is already implemented for other reasons). If it is implemented,
clients should use a different colour to display non-ASCII characters for
security purpose, and should ensure that even invisible characters are
displayed. It must include an option to use only ASCII, and it is
recommended to use only ASCII for domain names.

Clients should not download any additional files or make any additional
network requests due to the contents of a document that are merely being
viewed; it should do so only by the user selecting links in the document.
("Selecting" a link means explicitly navigated to or redirected to, not
merely hovering or things like that.)

It is acceptable for a minimal client to only implement the "R" subprotocol
and not implement range requests; the same is true for a minimal server.

If the jar: scheme is implemented, then file: URLs which reference a ZIP
archive should automatically redirect to the jar: URL for the root
directory of the ZIP archive. If it is a Gempub file, it may redirect to
the index file instead (possibly subject to user configuration); even if
it does, it should be possible for the user to manually enter the jar: URL
for a directory listing instead.

It is possible (but probably unnecessary, since SOCKS can be used instead,
which would be better) for a server to implement a proxy with raw data, by
using the "I" subprotocol and the "tcp://" URI scheme in the request. (It
is recommended to use SOCKS instead if possible.)

Don't use R subprotocol to change the state of files in the server, except
for optional logging (which should only be used for diagnostics). For
example, don't use it for adding user's comments, etc. (If you want to have
discussion forum with commenting, then NNTP is better. You can link to a
NNTP from a Scorpion file if you want to do. The S subprotocol may also be
used, and so can the I subprotocol; in the case of the I subprotocol, it
should not change the state of files in the server unless the server
receives any data from the client after the server sends the 00 status.)

Clients SHOULD allow the user to specify proxies. It is recommended to
support SOCKS (unless the operating system has the ability to do so without
the application programs being aware of it). It is also recommended (if
use of proxies are implemented) to allow use of a Scorpion server for
proxying with schemes specified by the user; note that if the proxy URL
does not specify a secure connection, then it should not encrypt any data
sent through the proxy even if the scheme for the file being accessed is
"scorpions:" or "gemini:" or another protocol that uses TLS. This is useful
if the end user has set up their own intercepting proxy on their own
computer, since it would avoid needing to encrypt the data twice.


=== Dynamic files ===

Servers can implement dynamic files however they want to do (including not
at all), but the suggested convention on POSIX systems is as follows:

Set argv[1] to the entire request (excluding CRLF). Set argv[2] to the
part of the request following the name of the external program (also
excluding CRLF); note that there might be a path and/or a query string
in argv[2], or it might be empty, but don't omit it even if it is empty.

It is up to the external program to parse the subprotocol parameters, and
to check if the subprotocol is one that it can handle.

Environment variables can be used for the remote IP address (if any program
requires it) and for client certificates (if any have been provided):

* REMOTE_HOST = The IP address of the client.

* (Client certificate data, implementation-dependent, but see below.
Recommendations may be made for this in future if some format for doing
this becomes common.)

Also on POSIX systems, if the file is actually a UNIX socket rather than
a executable file or a non-executable regular file, then it can work by
connecting to the socket and sending four strings each preceded by the
big-endian 32-bit length, and then communication (always unencrypted) will
go both ways. These four strings are:

* The part of the request up to and including the file name.

* The part of the request after the file name (same as what would be
argv[2] if it was an executable file).

* The client's IP address (what would be the REMOTE_HOST environment
variable if it was an executable file).

* The client certificate (this will be empty (but still present) if no
client certificate has been provided or if the connection is non-TLS).


=== Security issues ===

This section discusses security issues. Note that many security issues
with WWW are not applicable to Scorpion and related protocols, since
they do not have the complexity of WWW.

See RFC 5246 and RFC 8446 for the specifications of TLS, which also have
some further information about security issues of TLS.

Security is important as well as simplicity and interoperability.
Therefore, servers and clients are required to support the non-TLS
protocol, while TLS is optional (but is recommended). There is no
requirement of which versions, ciphers, extensions, etc of TLS will be
implemented; an implementation can do whatever it is able to do.

In cases where improved security is required, users might be able to
determine the required public keys ahead of time before connecting to
the server, and may install those keys in the computer. (The method for
acquiring those keys is deliberately not specified in this document; it
is something that each user will have ti figure out by themself.)

The recommended way to handle TLS cipher suites, TLS validation, etc, is
whatever way the client software uses for any protocols that it already
implements (e.g. Gemini client can use TOFU, curl can validate them in the
usual way but allow bypassing by the -k switch). However, a client MUST
include the possibility for the user to change the options, and for the
user to manually add and remove certificates. It should also be possible
for the user to specify the security levels when manually adding
certificates (in order to prevent downgrading attacks). Regardless of
the default settings, the user MUST be allowed to override them.

TLS session tickets can also be used for tracking, although they can speed
up transfers. By default, clients SHOULD NOT reuse tickets for multiple
connections; see section C.4 of RFC 8446.

If a client certificate has been specified, then clients and servers SHOULD
both remember the TLS options in order to prevent downgrade attacks. (This
may be configurable.)

See also gemini://gemini.ctrl-c.club/~stack/gemlog/2022-02-13.notls.gmi for
some notes against use of TLS.

The username and password in the URL can be used for tracking, as mentioned
in the mailing list of Gemini. Therefore, it is recommended that if a URL
from a link or redirect contains a username and/or a password, then the
client should ignore it, and instead use those specified by the user; or it
may display a prompt with the username and/or password already filled in
and allow the user to adjust them before sending the request (in this case,
the existing password should not be hidden in the prompt, but if a new
password is entered then it should normally be hidden (unless the user sets
an option to not hide passwords)). (This paragraph is advisory only.)

The username and password can be spied on if TLS is not used. If you
consider this significant, then a client MAY wish to warn a user about
logging in when using an insecure connection. If the client software knows
that this server allows secure connections, then it may also have an option
to automatically redirect to the secure connection (it MUST NOT do such a
thing automatically unless the end user can easily disable this feature).
An alternative, which is only suitable for uploading and which does not
hide the data being transferred, is to use HMAC for security.

(It is also possible for HMAC, client certificates, and username/password
to be combined, although it is not usually useful to combine the use of
client certificates with the other methods.)

If favicons are implemented (which is not mandatory), then it MUST offer
an option to disable favicons (which might or might not be the default),
and MUST display them separately from any other icons (if any). The client
MUST NOT make separate requests to download favicons automatically. An
implementation MAY allow the user to specify their own favicons, and MAY
have an option to auto-generate favicons.

Client certificates are not encrypted with TLS 1.2, but are encrypted
with TLS 1.3. Therefore, clients SHOULD warn when using a client
certificate with a TLS version 1.2 or less.


=== Missing details ===

This section describes what is currently missing or incomplete in this
document. They are:

* Some of the above sections has parts which are incomplete.

* If any part of the document is unclear, improve it.

* Comparison with other protocols and file formats.

* Better examples.

* A reference implementation. (This is partially written.)

* A way to be working with transports other than internet.

* Possibly remove the recommended waiting time for 4x responses. I am
not so sure that it was a good idea to add it (some people have also
thought it is not such a good idea in Gemini either, anyways).

(It is also possible that other changes may be made in future, than
the above, including changes to existing things, and possibly also
removing things if they are unnecessary.)


=== FAQ ===

*** What are the design goals?

* Simplicity is important, but should include a good set of features.
Features should be made optional as much as possible (and authors should
use only the really necessary parts), and can be implemented independently
as needed. However, it is also necessary to consider if simplifying some
things too much would result in additional complexity elsewhere (a
criticism of Gemini protocol once mentioned).

* Low-level programming is considered and not only high-level.

* Multiple implementations should be possible, and they might have some
of the same and some different features. It should also be possible that
implementations of some protocols/formats can be used together with other
implementatations of other protocols/formats, e.g. if there is a link to
a picture you can use an external program to display or convert it.

* Whether or not HTTP, HTML, and other protocols does or does not do
something (or whether or not it can emulate them) is not a reason to
do or not do something in this protocol and file format. This is not
meant to be a subset or a superset of the capabilities of any other
protocol or file format.

* It should be designed for user autonomy, and for things to be set by
the user instead of by the document author, where possible. (For example,
there is no CSS.) (This is similar than Gemini's principles.)

* Criticisms of this and other formats (such as Gemini) should be
considered, although they might be rejected.

* Other protocols and file formats are not obsolete, just as Gemini does
not intend to replace gopher and web, either. (It is especially not
intended to replace NNTP and IRC; those protocols should still be used
for public communication of many writers, since those are better ways of
doing it than using web forums, mailing lists, proprietary apps, etc.)

* The file format should be suitable both for on screen and printing out.
(As in Gemini, there are no inline links, although Gemini has a different
reason to avoid inline links.)

* Avoid wasting the computer's power when it is not needed.

* The non-extensibility of Gemini does not work quite as well as they had
intended, as time can tell.

*** What is the reason for the name?

The Gemini is the name of a constellation (even though that is not why
Gemini protocol was named as such), so I use a different one, such as
the Scorpion.

*** What is the reason for the default port number?

It was chosen by asking someone else, who suggested 1517 because it is a
date of historical significance (the Protestant Reformation), as well as
being a number between 1024 and 32767. According to who suggested it, "the
current state of the internet is in dire need of reformation especially how
dire a perverted mess the web has has become", and the author of this
document agrees with that too.

*** What is the TRON character code?

For some details, see:
  http://tronweb.super-nova.co.jp/chinesecharsandtroncode.html
  http://tronweb.super-nova.co.jp/unicoderevisited.html
  http://zzo38computer.org/fossil/osdesign.ui/finfo?name=draft/charsets
  http://fileformats.archiveteam.org/wiki/TRON_code

Note that the Unicode planes are deprecated in the Extended TRON Code.

*** Which URL specification is applicable? RFC 3986, RFC 3987, WHATWG, etc?

RFC 3986. However, note that the "hashed:" scheme has its own rules.

*** TLS is complicated; why don't you use a simpler mechanism (e.g. Noise)?

I had considered that, but it is too difficult to know what to do.

See section 4.5.3 of gemini://geminiprotocol.net/docs/faq-section-4.gmi
for some other possible answers.

It is possible that someone may be able to define a way to specify the use
of some other transport mechanism by the host name part of the URL. In
this case, that mechanism might have some other security mechanism built-in
in which case the "scorpions:" scheme is not used with it, since in that
case the encryption belongs to a different layer.

Also note that the use of the "hashed:" scheme can sometimes mitigate the
possibility of spies tampering with the data.

*** Why allow non-TLS? An attacker can easily MITM requests to force
non-TLS requests.

An implementation may allow the user to configure it to not use non-TLS for
some (or all) servers. (This is similar than "HTTPS-Everywhere", but it is
not specific to HTTP(S).)

Additionally, the client is supposed to display a warning message if a
redirect from TLS to non-TLS (or vice-versa) occurs.

I think non-TLS has benefits such as improved simplicity and improved
energy efficiency. However, sometimes encryption is desirable, so TLS
is permitted, too.

*** Isn't it difficult to detect whether or not the first byte is 0x16 and
have an existing TLS library take over the connection if it is?

It should be possible to use recv with the MSG_PEEK flag, before passing
it to the existing TLS library.

(If you are using the Go programming language, see the "small net
information services" program; the files named "peekedConn.go" and
"maybeTLSListener.go" implement this behaviour. It is not a programming
language I use, so I cannot help with questions about the program.)

*** Why is it restricted to ASCII only?

There are a few reasons for this:

* It improves the simplicity in some cases.

* The URL and some other parts are "computer code", where large character
sets are inappropriate due to homoglyphs and other complexity; it isn't
meant to be text in any language (although some parts of it may resemble
English words or abbreviations of them).

* Plain text files are ASCII at the simplest case, but you can still
specify a different character encoding anyways.

Note that not everything is restricted to ASCII only; specifically, the
documents are not restricted to ASCII only. (This is also true of Spartan.)

*** Why is it a binary file format? That would make it difficult to use a
text editor, and besides, you are redefining well-defined control codes.

A text-based format would be much more difficult for the client to parse,
to have to handle difficult escaping and nesting and other stuff like that.
A binary format will be simpler, especially a "flat" one such as this one,
rather than being nested like HTML and XML.

There are a few possibilities for how to write the document, such as using
a specialized editor, or using a converter or a static site generator.

*** Why is it big-endian? Most computers now use small-endian.

The internet is supposed to big-endian, isn't it? Although I think that
small-endian is better (independently of what computers use it), I think
that it isn't that significant that it is worth violating the convention
of internet in this way. (Also, uxn is big-endian.)

*** Why use ULFI? What problems does it solve compared with MIME?

The main problems with MIME are that you cannot easily specify that a file
is of multiple formats (each with their own parameters). UTI solves some of
the problems of MIME too but has its own problems (e.g. additional files
are needed in order to determine what format another one conforms with),
but I designed ULFI to avoid some of these problems. or example, UTI can
specify Markdown conforms with "plain text", but that requires that you
have the appropriate files to tell you that it conforms with; it can't
know by itself. With ULFI you can specify something like
"text:plain:markdown+commonmark" and a program that only understands
"plain" will still understand it. (MIME also has a way to do something
like this, but it isn't very good and is just "added on".)

*** Why is there no support for styling?

This is deliberate; the style should be controlled by the user, and not by
the author. Documents can only specify a limited set of styles, and the
user may be able to configure how exactly each one is displayed (by fonts,
colours, etc).

Furthermore, different clients for different displays might prefer a
different style too, instead of the author having to try to make one that
will work for everyone and then it doesn't work for everyone.

*** What should be done if required parameters of status line are missing?

This is technically an error. However, implementations that wish to recover
from such errors may treat "2x" as "2x ? :" and "4x" as "4x ?", where the
colon by itself is the ULFI that does not specify any file formats. For
redirects (3x status codes), any default value would be inappropriate, so
it cannot try to recover from such errors in this way and instead will
have to display an error message.

*** Is it possible to use languages other than Japanese?

Yes. It is not only for Japanese. The TRON character encoding can be used
with other languages as well, and the use of the word "furigana" in the
specification does not imply that only Japanese is possible.

*** What is the copyright of this file?

This document is public domain.