Editor: Steven Pemberton, CWI, Amsterdam
Version: 2022-05-19
This is the current state of the ixml base grammar; it is close to final.
Data is an abstraction: there is no essential difference between the JSON
{"temperature": {"scale": "C"; "value": 21}}
and an equivalent XML
<temperature scale="C" value="21"/>
or
<temperature> <scale>C</scale> <value>21</value> </temperature>
since the underlying abstractions being represented are the same.
We choose which representations of our data to use, CSV, JSON, XML, or whatever, depending on habit, convenience, and the context in which it occurs. On the other hand, having an interoperable generic toolchain such as that provided by XML to process data is of immense value. How do we resolve the conflicting requirements of convenience, habit, and context, and still enable a generic toolchain?
Invisible XML (ixml) is a method for treating non-XML documents as if they were XML, enabling authors to write documents and data in a format they prefer while providing XML for processes that are more effective with XML content. For example, it can turn CSS code like
body {color: blue; font-weight: bold}
into XML like
<css> <rule> <selector>body</selector> <block> <property> <name>color</name> <value>blue</value> </property> <property> <name>font-weight</name> <value>bold</value> </property> </block> </rule> </css>
or, if preferred, as:
<css> <rule> <simple-selector name="body"/> <property name="color" value="blue"/> <property name="font-weight" value="bold"/> </rule> </css>
As another example, the expression
pi×(10+b)
can result in the XML
<prod> <id>pi</id> <sum> <number>10</number> <id>b</id> </sum> </prod>
or
<prod> <id name='pi'/> <sum> <number value='10'/> <id name='b'/> </sum> </prod>
and the URL
http://www.w3.org/TR/1999/xhtml.html
can give
<url> <scheme name='http'/> <authority> <host> <sub name='www'/> <sub name='w3'/> <sub name='org'/> </host> </authority> <path> <seg sname='TR'/> <seg sname='1999'/> <seg sname='xhtml.html'/> </path> </url>
or
<url scheme='http'> <host>www.w3.org</host> <path>/TR/1999/xhtml.html</path> </url>
The JSON value:
{"name": "pi", "value": 3.145926}
can give
<json> <object> <pair string='name'> <string>pi</string> </pair> <pair string='value'> <number>3.145926</number> </pair> </object> </json>
A grammar is used to describe the input format. An input is parsed using this grammar, and the resulting parse tree is serialised as XML. Special marks in the grammar affect details of this serialisation, excluding parts of the tree, or serialising parts as attributes instead of elements.
As an example, consider this simplified grammar for URLs:
url: scheme, ":", authority, path. scheme: letter+. authority: "//", host. host: sub++".". sub: letter+. path: ("/", seg)+. seg: fletter*. -letter: ["a"-"z"]; ["A"-"Z"]; ["0"-"9"]. -fletter: letter; ".".
This means that a URL consists of a scheme (whatever that is), followed by a colon, followed by an authority, and then a path. A scheme, is one or more letters (whatever a letter is). An authority starts with two slashes, followed by a host. A host is one or more subs, separated by points. A sub is one or more letters. A path is a slash followed by a seg, repeated one or more times. A seg is zero or more fletters. A letter is a lowercase letter, an uppercase letter, or a digit. A fletter is a letter or a point.
So, given the input string
http://www.w3.org/TR/1999/xhtml.html
, this would produce the
serialisation
<url> <scheme>http</scheme>: <authority>// <host> <sub>www</sub>. <sub>w3</sub>. <sub>org</sub> </host> </authority> <path> /<seg>TR</seg> /<seg>1999</seg> /<seg>xhtml.html</seg> </path> </url>
If the rule for letter
had not had a "-" before it, the
serialisation for scheme
, for instance, would have been:
<scheme><letter>h</letter><letter>t</letter><letter>t</letter><letter>p</letter></scheme>
Changing the rule for scheme
to
scheme: name. @name: letter+.
would change the serialisation for scheme
to:
<scheme name="http"/>:
Changing the rule for scheme
instead to:
@scheme: letter+.
would change the serialisation for url
to:
<url scheme="http">
Changing the definitions of sub
and seg
from
sub: letter+. seg: fletter*.
to
-sub: letter+. -seg: fletter*.
would prevent the sub
and seg
elements appearing
in the serialised result, giving:
<url scheme='http'>:// <host>www.w3.org</host> <path>/TR/1999/xhtml.html</path> </url>
Changing the rule
url: scheme, ":", authority, path.
to
url: scheme, -":", authority, path.
and
authority: "//", host.
to
authority: -"//", host.
would remove the spurious characters from the serialisation:
<url scheme='http'> <host>www.w3.org</host> <path>/TR/1999/xhtml.html</path> </url>
Here we describe the format of the grammar used to describe documents. Note that it is in its own format, and therefore describes itself.
A grammar is an optional prolog containing a version declaration followed by a sequence of one or more rules, surrounded and separated by spacing and comments. Spacing and comments are entirely optional, except that rules must be separated by at least one of either (error S01).
ixml: s, prolog?, rule++RS, s.
An s
stands for an optional sequence of spacing and comments. A
comment is enclosed in braces, and can included nested comments, to enable
commenting out parts of a grammar:
-s: (whitespace; comment)*. {Optional spacing} -RS: (whitespace; comment)+. {Required spacing} -whitespace: -[Zs]; tab; lf; cr. -tab: -#9. -lf: -#a. -cr: -#d. comment: -"{", (cchar; comment)*, -"}". -cchar: ~["{}"].
A grammar may begin with a version declaration.
prolog = version, s. version = -"ixml", RS, -"version", RS, -string, s, -'.' .
A version declaration consists of the words ixml
and version
followed by a string and terminated with
a full stop. For example
ixml version "1.0" .
If a version declaration is present in an Invisible XML grammar, it is a statement by the author that the grammar conforms to the syntax and semantics of the version of ixml indicated in the declaration.
An implemenation will either recognize the version string or it will not. If it recognizes the version string, it should process the grammar using the syntax and semantics of the declared version. If a version declaration is not present, an implementation should behave as if the grammar was labeled “1.0”.
If it does not recognize the version string, it may issue a warning, but it must attempt to process the grammar. If it finds a syntactically valid interpretation of the grammar, it should proceed using the semantics of the version of Invisible XML under which it found a valid interpretation, otherwise it must reject the grammar.
A rule consists of an optional mark, a name, and one or more alternatives. The grammar here uses colons to define rules; an equals sign is also allowed.
rule: (mark, s)?, name, s, -["=:"], s, -alts, -".".
A mark is one of ^, @
or -
, and
indicates whether the item so marked will be serialised as a structured element
with its children (^
) which is the default, as unstructured data
in an attribute (@
), or deleted, so that only its children are
serialized (-
).
@mark: ["@^-"].
A name starts with a letter or underscore, and continues with a letter, digit, underscore, a small number of punctuation characters, and the Unicode combiner characters; Unicode classes are used to define the sets of characters used, for instance, for letters and digits. This is close to, but not identical with the XML definition of a name; it is the grammar author's responsibility to ensure that all serialised names match the requirements for an XML name [XML].
@name: namestart, namefollower*. -namestart: ["_"; L]. -namefollower: namestart; ["-.·‿⁀"; Nd; Mn].
Alternatives are separated by a semicolon or a vertical bar. The grammar here uses semicolons.
alts: alt++(-[";|"], s).
An alternative is zero or more terms, separated by commas:
alt: term**(-",", s).
A term is a singleton factor, an optional factor, or a repeated factor, repeated zero or more times, or one or more times.
-term: factor; option; repeat0; repeat1.
A factor is a terminal, a nonterminal, or a bracketed series of alternatives:
-factor: terminal; nonterminal; -"(", s, alts, -")", s.
A factor repeated zero or more times is followed by an asterisk, or followed
by a double asterisk and a separator, e.g. abc*
and
abc**","
. For instance "a"**"#"
would match the empty
string, a
a#a a#a#a
etc.
repeat0: factor, (-"*", s; -"**", s, sep).
Similarly, a factor repeated one or more times is followed by a plus, or a
double plus and a separator, e.g. abc+
and abc++","
.
For instance "a"++"#"
would match a
a#a
a#a#a
etc., but not the empty string.
repeat1: factor, (-"+", s; -"++", s, sep).
An optional factor is followed by a question mark, e.g. abc?
.
For instance "a"?
would match a
or the empty
string.
option: factor, -"?", s.
A separator can be any factor. E.g. abc**def
or
abc**(","; ".")
. For instance "a"++("#"; "!")
would
match a#a a!a a#a!a a!a#a a#a#a
etc.
sep: factor.
A nonterminal is an optionally marked name:
nonterminal: (mark, s)?, name, s.
This name refers to the rule that defines this name, which must exist (error S02), and there must only be one such rule (error S03).
A terminal is a literal or a set of characters. It matches characters in the input. A terminal must not be marked as unstructured (@) (error S04), and a charset must not be marked as structured (^) (error S05). A terminal marked as deleted (-) serialises to the empty string. A terminal marked as inserted (+) matches no characters on the input, but appears in the serialization.
-terminal: literal; charset.
A literal is either a quoted string, or a hexadecimally encoded character:
literal: quoted; encoded.
A quoted string is an optionally marked string of one or more characters,
enclosed with single or double quotes. An unmarked or deleted quoted string
matches only the exact same string in the input. Examples: "yes"
'yes'
. An inserted quoted string matches zero characters in the input
(and succeeds).
A string cannot extend over a line-break (error S11). The enclosing quote is represented in a string
by doubling it; these two strings are identical: 'Isn''t it?' "Isn't
it?"
, as are these: "He said ""Don't!""" 'He said
"Don''t!"'
.
-quoted: (tmark, s)?, string, s. @tmark: ["^-+"]. @string: -'"', dchar+, -'"'; -"'", schar+, -"'". dchar: ~['"'; #a; #d]; '"', -'"'. {all characters except line breaks; quotes must be doubled} schar: ~["'"; #a; #d]; "'", -"'". {all characters except line breaks; quotes must be doubled}
An encoded character is an optionally marked hexadecimal
number. It starts with a hash symbol, followed by any number of hexadecimal
digits, for example #a0
. The digits are interpreted as a number in
hexadecimal (error S06) , and the
character at that Unicode code-point is used [Unicode].
The number must be within the Unicode
code-point range (error S07), and must not denote a Noncharacter or Surrogate
code point (error S08).
An unmarked or deleted encoded character matches that one character in the input. If marked as inserted, it matches no characters (but succeeds).
-encoded: (tmark, s)?, -"#", hex, s. @hex: ["0"-"9"; "a"-"f"; "A"-"F"]+.
A character set is an inclusion or an exclusion: an inclusion matches one character in the input that is in the set, an exclusion matches one character not in the set.
An inclusion is enclosed in square brackets, and represents the set of
characters defined by any combination of literal characters, a range of
characters, hex encoded characters, or Unicode classes. Examples
["a"-"z"] ["xyz"] [Lc] ["0"-"9"; "!@#"; Lc]
. Note that
["abc"] ["a"; "b"; "c"] ["a"-"c"]
[#61-#63]
all
represent the same set of characters.
An exclusion is an inclusion preceded by a tilde ~
. For
example, ~["{}"]
matches any character that is not an opening or
closing brace.
Note that the empty inclusion []
would fail to match any
character in the input; on the other hand ~[]
would match any one
character, whatever it is.
-charset: inclusion; exclusion. inclusion: (tmark, s)?, set. exclusion: (tmark, s)?, -"~", s, set. -set: -"[", s, (member, s)**(-[";|"], s), -"]", s. member: string; -"#", hex; range; class.
A range matches any character in the range from the start character to the
end, inclusive, using the Unicode ordering. The from
character
must not be later in the ordering
than the to
character (error
S09).
-range: from, s, -"-", s, to. @from: character. @to: character.
A character is a string of length one, or a hex encoded character:
-character: -'"', dchar, -'"'; -"'", schar, -"'"; "#", hex.
A class is one or two letters, representing any character from the Unicode
character category [Categories] of that name, which
must exist (error S10). E.g. [Ll]
matches any lower-case
letter, [Ll; Lu]
matches any upper- or lower-case character.
-class: code. @code: capital, letter?. -capital: ["A"-"Z"]. -letter: ["a"-"z"].
The root symbol of the grammar is the name of the first rule in the grammar.
Processors must accept and parse any conforming
grammar, and produce at least one parse of supplied input that matches the
grammar starting at the root symbol. If more than one parse results, one is
chosen; it is not defined how this choice is made, but the resulting
serialization should including the attribute
ixml:state="ambiguous"
on the document element. The ixml namespace
URI is "http://invisiblexml.org/NS
". Known algorithms that accept
and parse any context-free grammar include [Earley], [Unger], [CYK], [GLR],
and [GLL]; see also [Grune]; different
algorithms may differ slightly in how ambiguity is defined.
If the parse fails, some XML document must be
produced with ixml:state="failed"
on the document element. The
document should provide helpful information about
where and why it failed; it may be a partial parse
tree that includes parts of the parse that succeeded.
If the parse succeeds, the resulting parse-tree is serialised as XML by serialising the root node of the parse tree.
A parse node is either a nonterminal, which has a name and children, or a terminal, which has a string.
A nonterminal can be unmarked, or marked as structured (^), as unstructured (@), or as deleted (-). The mark comes from the use of the nonterminal in a rule if present, otherwise, from the definition of the rule for that nonterminal.
A terminal can be unmarked, or marked as inserted (+), structured (^), or as deleted (-).
Grammars must be written so that any serialization of a parse tree produced from the grammar is well-formed XML (error D01).
Note: This requirement means for instance that names of serialized elements and attributes must match the XML requirements; an element must not contain more than one attribute of a given name (error D02); the names of all elements and attributes must conform to the requirements for XML names; invalid characters must not be serialized (error D04); a nonterminal being serialized as root element must not be marked as unstructured (error D05); in order to match the XML requirement of a single-rooted document, if the root rule is marked as hidden, all of its productions must produce exactly one non-hidden structured nonterminal and no non-hidden terminals before or after that nonterminal (error D06).
A (necessarily contrived) example grammar that illustrates serialization rules is:
expr: open, -arith, @close, -";". @open: "(". close: ")". arith: left, op, ^right. left: operand. -right: operand. -operand: name; -number. @name: ["a"-"z"]. @number: ["0"-"9"]. -op: sign. @sign: "+"; "-".
Applied to the string (a+1);
it yields the serialisation
<expr open='(' sign='+' close=')'> <left name='a'/> <right>1</right> </expr>
Points to note: how the semicolon is suppressed from the serialization; the
two ways open
and close
have been defined as
attributes; similarly the two ways left
and right
have been defined as elements; how number
appears as content and
not as an attribute; and how sign
being an exposed attribute
appears on its nearest non-hidden ancestor. Also of note is how the content of
some attributes can appear earlier in the serialization than in the input.
Insertions allow characters to be inserted into the serialization that were not present in the input. For instance, the grammar
data: value++-",", @source. source: +"ixml". value: pos; neg. -pos: +"+", digit+. -neg: +"-", -"(", digit+, -")". -digit: ["0"-"9"].
With input:
100,200,(300),400
would produce
<data source="ixml"> <value>+100</value> <value>+200</value> <value>-300</value> <value>+400</value> </data>
In this specification, the verb "must" expresses unconditional requirements for conformance to the specification; the verb "should" expresses requirements that are encouraged but which are not conditions of conformance; the verb "may" expresses optional features which are neither required nor prohibited.
Conformance to this specification can meaningfully be claimed for grammars and for processors; it cannot be claimed for input streams or input + grammar pairs.
An ixml grammar in ixml form conforms to this specification if it is described by the grammar given in this specification, and it satisfies all the other requirements specified for ixml grammars.
An ixml grammar in XML form conforms to this specification if it can be derived from an ixml grammar in ixml form by parsing as described in this specification, and it satisfies all the other requirements specified for ixml grammars.
Note: The normative formulations of conformance requirements are those given elsewhere in this specification. For convenience the requirements that go beyond what is expressed in the grammar itself can be summarized as follows. (Reasonable effort has been used to make this list complete, but omission of any conformance requirement from this list does not affect its status as a conformance requirement.)
from
character of a range must
not be later in the Unicode ordering than the to
character.A conforming processor must accept grammars in ixml form, and should accept grammars in XML form; it must not accept non-conforming grammars. Both grammars and input must be accepted in UTF-8 encoding, and may be accepted in other encodings.
For any conforming grammar and any input, under normal operation:
ixml:state="ambiguous"
on the document element. Processors
may provide a user option to suppress that
attribute; they may also provide a user option
to produce more than one parse tree.ixml:state="failed"
on the document element, with helpful
information about where and why it failed; it may be a partial parse tree that includes parts of
the parse that succeeded.ixml:state="prefix"
, or if the parse is ambiguous
ixml:state="ambiguous prefix"
.Many parsing algorithms only mention terminals and nonterminals, and don't
explain how to deal with the repetition constructs used in ixml. However, these
can be handled simply by converting them to equivalent simple constructs. In
the examples below, f
and sep
are
factors
from the grammar above. The other nonterminals are
generated nonterminals.
Optional factor:
f? ⇒ f-option -f-option: f; ().
Zero or more repetitions:
f* ⇒ f-star -f-star: (f, f-star)?.
One or more repetitions:
f+ ⇒ f-plus -f-plus: f, f*.
One or more repetitions with separator:
f++sep ⇒ f-plus-sep -f-plus-sep: f, (sep, f)*.
Zero or more repetitions with separator:
f**sep ⇒ f-star-sep -f-star-sep: (f++sep)?.
ixml: s, rule++RS, s. -s: (whitespace; comment)*. {Optional spacing} -RS: (whitespace; comment)+. {Required spacing} -whitespace: -[Zs]; tab; lf; cr. -tab: -#9. -lf: -#a. -cr: -#d. comment: -"{", (cchar; comment)*, -"}". -cchar: ~["{}"]. rule: (mark, s)?, name, s, -["=:"], s, -alts, -".". @mark: ["@^-"]. alts: alt++(-[";|"], s). alt: term**(-",", s). -term: factor; option; repeat0; repeat1. -factor: terminal; nonterminal; -"(", s, alts, -")", s. repeat0: factor, (-"*", s; -"**", s, sep). repeat1: factor, (-"+", s; -"++", s, sep). option: factor, -"?", s. sep: factor. nonterminal: (mark, s)?, name, s. @name: namestart, namefollower*. -namestart: ["_"; L]. -namefollower: namestart; ["-.·‿⁀"; Nd; Mn]. -terminal: literal; charset. literal: quoted; encoded. -quoted: (tmark, s)?, string, s. @tmark: ["^-+"]. @string: -'"', dchar+, -'"'; -"'", schar+, -"'". -dchar: ~['"'; #a; #d]; '"', -'"'. {all characters except line breaks; quotes must be doubled} -schar: ~["'"; #a; #d]; "'", -"'". {all characters except line breaks; quotes must be doubled} -encoded: (tmark, s)?, -"#", hex, s. @hex: ["0"-"9"; "a"-"f"; "A"-"F"]+. -charset: inclusion; exclusion. inclusion: (tmark, s)?, set. exclusion: (tmark, s)?, -"~", s, set. -set: -"[", s, (member, s)**(-[";|"], s), -"]", s. member: string; -"#", hex; range; class. -range: from, s, -"-", s, to. @from: character. @to: character. -character: -'"', dchar, -'"'; -"'", schar, -"'"; "#", hex. -class: code. @code: capital, letter?. -capital: ["A"-"Z"]. -letter: ["a"-"z"].
Since the ixml grammar is expressed in its own notation, the above grammar can be processed into an XML document by parsing it using itself, and then serialising. Note that all semantically significant terminals are recorded in attributes, and non-significant characters are not serialised. The serialisation begins as below, but the entire serialisation is available:
<ixml> <rule name='ixml'> <alt> <nonterminal name='s'/> <repeat1> <nonterminal name='rule'/> <sep> <nonterminal name='RS'/> </sep> </repeat1> <nonterminal name='s'/> </alt> </rule> <rule mark='-' name='s'> <alt> <repeat0> <alts> <alt> <nonterminal name='whitespace'/> </alt> <alt> <nonterminal name='comment'/> </alt> </alts> </repeat0> </alt> </rule> <comment>Optional spacing</comment> <rule mark='-' name='RS'> <alt> <repeat1> <alts> <alt> <nonterminal name='whitespace'/> </alt> <alt> <nonterminal name='comment'/> </alt> </alts> </repeat1> </alt> </rule> <comment>Required spacing</comment> <rule mark='-' name='whitespace'> <alt> <inclusion tmark='-'> <member code='Zs'/> </inclusion> </alt> <alt> <nonterminal name='tab'/> </alt> <alt> <nonterminal name='lf'/> </alt> <alt> <nonterminal name='cr'/> </alt> </rule> <rule mark='-' name='tab'> <alt> <literal tmark='-' hex='9'/> </alt> </rule> <rule mark='-' name='lf'> <alt> <literal tmark='-' hex='a'/> </alt> </rule> <rule mark='-' name='cr'> <alt> <literal tmark='-' hex='d'/> </alt> </rule> <rule name='comment'> <alt> <literal tmark='-' string='{'/> <repeat0> <alts> <alt> <nonterminal name='cchar'/> </alt> <alt> <nonterminal name='comment'/> </alt> </alts> </repeat0> <literal tmark='-' string='}'/> </alt> </rule> <rule mark='-' name='cchar'> <alt> <exclusion> <member string='{}'/> </exclusion> </alt> </rule> <rule name='rule'> <alt> <option> <alts> <alt> <nonterminal name='mark'/> <nonterminal name='s'/> </alt> </alts> </option> <nonterminal name='name'/> <nonterminal name='s'/> <inclusion tmark='-'> <member string='=:'/> </inclusion> <nonterminal name='s'/> <nonterminal mark='-' name='alts'/> <literal tmark='-' string='.'/> </alt> </rule>
This section summarizes errors identified in this specification. Static errors are errors that can be identified by inspecting the grammar.
Dynamic errors arise when a particular input is processed with a grammar.
Note: if error codes are reported in a context where it makes sense for them to appear in a namespace, they should be in the Invisible XML namespace.
[Unicode] The Unicode Consortium (ed.), The Unicode Standard — Version 13.0. Unicode Consortium, 2020, ISBN 978-1-936213-26-9, http://www.unicode.org/versions/Unicode13.0.0/
[Categories] The Unicode Consortium (ed.), Unicode Standard Annex #44: Unicode Character Database -- General Category Values https://unicode.org/reports/tr44/#General_Category_Values (See also http://www.fileformat.info/info/unicode/category/index.htm)
[XML] Tim Bray et al. (eds.), Extensible Markup Language (XML) 1.0 (Fifth Edition), W3C, 2008, https://www.w3.org/TR/xml/
[CYK] Sakai, Itiroo. Syntax in universal translation. In 1961 International Conference on Machine Translation of Languages and Applied Language Analysis, pages 593–608. https://aclanthology.org/www.mt-archive.info/50/NPL-1961-Sakai.pdf
[Earley] Earley, J. An efficient context-free parsing algorithm. Communications of the ACM, 13(2):94–102, February 1970, doi:10.1145/362007.362035
[GLL] Elizabeth Scott and Adrian Johnstone, GLL Parsing. Electronic Notes in Theoretical Computer Science, Volume 253, Issue 7, 17 September 2010, pages 177-189. doi:10.1016/j.entcs.2010.08.041
[GLR] Masaru Tomita. Generalized LR Parsing. Springer Science & Business Media. ISBN 978-1-4615-4034-2. doi:10.1007/978-1-4615-4034-2
[Grune] Grune, D. and Jacobs, C. Parsing techniques : a practical guide (2nd ed.). New York: Springer, 2008. ISBN 978-0-387-20248-8. https://dickgrune.com/Books/PTAPG_2nd_Edition/CompleteList.pdf
[Unger] Unger, S. H. A global parser for context-free phrase structure grammars. Communications of the ACM, 11(4):240–247, April 1968, doi:10.1145/362991.363001
This specification was produced by the W3C ixml community group, that at the time of publishing consisted of the members: {list of names}
Thanks are due to Hans-Dieter Hiep for an early close reading of the specification, and consequent many helpful comments.