Mork 1.4 File Format Description
Version 1.0
Sat, 14 Aug 2010 11:20:08 -0700

Copyright 2010 Kevin Goodsell

0. License

This work is licensed under a Creative Commons
Attribution-NonCommercial-ShareAlike 3.0 Unported License. To view a
copy of this license, visit
http://creativecommons.org/licenses/by-nc-sa/3.0/; or, send a letter to
Creative Commons, 171 2nd Street, Suite 300, San Francisco, California,
94105, USA.

Attributions should include my name (Kevin Goodsell) and the web site
that hosts this project, http://github.com/KevinGoodsell/mork-converter.

1. Logical Structure

Mork files logically consist of zero or more Tables. Each table consists
of zero or more rows and optional meta-data in the form of a Meta-Table.
Each row consists of zero or more Cells and optional meta-data in the
form of a Meta-Row. Each cell is a Column-Value pair.

This makes Mork Tables structurally similar to a spreadsheet or an HTML
table, with rows and columns, and entries appearing in the cells formed
by the intersection of a row and column. One important difference is
that the Columns are not necessarily the same from one Row to the next.
A Mork Row may include arbitrary Columns.

While the basic structure of the data is fairly simple, the Mork format
itself is confusing, all but impossible to decode by hand, and poorly
documented. The documentation that exists seems to contradict the
Mozilla implementation of Mork 1.4, probably the only widely-used
version (according to source comments in morkWriter.h[2] and release
dates, earlier versions were used only for very early milestone releases
of the Mozilla Suite).

2. File Layout

2.1 Encoding

Mork files are text, and appear to be limited to ASCII characters.
However, all example files I've looked at include this:

// (f=iso-8859-1)

This identifies a character encoding, but doesn't seem to have any
functional significance. The double-slashes indicate that it is a
comment, so the parser ignores it. The text of the comment is hard-coded
in morkWriter.cpp[2], so the character encoding it gives can never be
anything other than iso-8859-1. This may simply be a hint to human
readers, or an unimplemented feature.

Mork files can indicate end-of-line using several conventions. Any of
the combinations 0x0A, 0x0D, 0x0A0D, and 0x0D0A are allowed. This covers
the three common conventions (Unix, DOS/Windows, and Macintosh) and the
uncommon 'newline carriage-return'.

2.2 Syntax Elements

A Mork file is a sequence of Mork objects, including Dicts, Rows,
Tables, and Groups.

Here is a short example Mork file for reference (Example 1):

// <!-- <mdb:mork:z v="1.4"/> -->
< <(a=c)> // (f=iso-8859-1)
  (80=Category)(81=FilmTitle)(82=Winner)(83=Other)(84=awards)>

<(80=Best Picture)(81=Annie Hall)(82=Best Director)(83=Woody Allen)(84
  =Best Actor in a Leading Role)(85=Richard Dreyfuss)(86=The Goodbye Girl)
  (87=Best Actress in a Leading Role)(88=Diane Keaton)>
{1:^84
  [1 (^80^80)(^81^81)(^82^81)(^83=)]
  [2 (^80^82)(^81^81)(^82^83)(^83=)]
  [3 (^80^84)(^81^86)(^82^85)(^83=)]
  [4 (^80^87)(^81^81)(^82^88)(^83=)]}

Expanded to table form, it looks like this:

Row | Category                       | Winner           | FilmTitle        | Other
----|--------------------------------|------------------|------------------|------------
  1 | Best Picture                   | Annie Hall       | Annie Hall       |
  2 | Best Director                  | Woody Allen      | Annie Hall       |
  3 | Best Actor in a Leading Role   | Richard Dreyfuss | The Goodbye Girl |
  4 | Best Actress in a Leading Role | Diane Keaton     | Annie Hall       |

2.2.1 Names

Mork Names are just text strings (with some restrictions) used for
Columns and object namespaces. Names are defined in the source file
morkCh.h[2] with the morkCh_IsName and morkCh_IsMore macros (see
morkCh_Type in morkCh.cpp for character classifications), where the
first indicates a character that can begin a Name and the second
indicates a character that can appear after the first character of a
Name.

The first character of a Name may be any ASCII letter, '_' or ':'.
Subsequent characters may be any of those plus '!', '+', '-', and '?'.

2.2.2 Values

Mork Values are text strings used for arbitrary data. While they appear
as plain text in Mork files, they can encode arbitrary byte strings.
Values can include any ASCII character in the range 0x20 to 0x7E, but
the '$', ')', and '\' characters can't be included directly because they
have special meanings (described in the section "Character Escaping").
Other byte values may be included with escape sequences.

2.2.3 References

Mork References are a way of indicating a Name or Value by using a
reference to an Alias. The Mork reader replaces the Reference with the
referred-to Name or Value. This is described in more detail in the
section "Dictionaries".

References appear as a '^' character follow by a hexadecimal identifier:

^9A

There are some places where non-Alias Mork objects are referred to by
their hex identifier without using a '^'. These two usages should not be
confused. References beginning with '^' refer specifically to Aliases.

References must be interpreted by looking up the Alias in a specific
Dictionary. The Dictionary to use depends on the context, but it can
also be given explicitly by adding a ':' followed by the dictionary
scope:

^9A:a

The Dictionary scope can also be given as a reference, but in practice
this is probably never done.

2.2.4 Mids

Mork identifiers (or "Mids" as they are referred to in the source) are
hexadecimal identifiers for particular objects. All Tables, Rows, and
Aliases have a Mid. References are actually a '^' followed by a Mid for
the object being referenced.

Mids have three basic forms. The first form is simply a hex identifier.
The second form is a hex identifier followed by a colon and a scope (or
namespace) name:

F2:scope_name

The third form is the same, but replaces the scope with a Reference:

F2:^8A

Since the Reference itself uses a Mid, this definition appears
recursive. However, the ReadMid function in morkParser.cpp[2] is not
defined recursively, and only allows a simple hex identifier after the
'^'.

2.2.5 Comments and Magic

Mork files can include comments using C++ comment syntax. Two forward
slashes indicate the beginning of a comment which continues to the end
of the line. While this seems to be undocumented, the source code
(morkParser.cpp[2]) indicates that C-style comments are also permitted,
and may be nested. These appear to be unused in practice.

It's difficult to determine exactly where comments can and cannot appear
from the source. Places comments cannot appear include inside a Name,
after the ':' in a Mid, after the '(' at the beginning of a Cell, inside
a hex value, and inside the Value part of a Cell or Alias. Other than
that, comments can appear almost anywhere. There are some surprising
places where comments can appear. For example, a comment can appear
before any hex value, which means they can show up after the '^' in a
Reference, or after the @$${ in a Group. Comments can also appear before
the '=' or '^' in a Cell Value. Perhaps the most surprising is that
comments can apparently show up before the '=' in an Alias, but only if
there is at least one whitespace character before the comment.

Fortunately, real Mork files only use a few comments and only in
predictable locations, so a parser does not necessarily need to handle
all these strange cases.

Mork files begin with the following "magic" identifier in the form of a
comment:

// <!-- <mdb:mork:z v="1.4"/> -->

The 1.4 would of course be different for other Mork versions.

2.2.6 Dictionaries

Mork Dictionaries or "Dicts" are used to define numerical aliases for
strings (meaning Names and Values). This is simply a file size
optimization and does not add anything functional. It does however add
substantial complexity to the files.

Dicts are delimited with "angle brackets" (less-than and greater-than
symbols) and contain Aliases, and optionally a Meta-Dict.

Example:

<(80=arbitrary text)(81=http://example.com)
(8C=(Parens are allowed, but a closing paren needs to be escaped.\))>

In this example, note that the '//' does not introduce a comment, and a
backslash character is used as an escape character to allow a closing
parenthesis to appear in the Value.

Each Dict defines aliases for a particular namespace or scope. In
principle there can be an arbitrary number of scopes, but in practice
only two are used: 'a' and 'c', which are abbreviations for 'atom' and
'column', respectively. ('Atom' is described in the documentation[1] as
the data part of a Cell, but the term 'Value' is used in the code and in
this document.) The 'a' scope defines aliases used in Cell Values, and
the 'c' scope defines aliases used in Cell Columns and namespaces. The
scope for a Dict is given in the Meta-Dict, or else defaults to 'a'.

Typical Mork files have only one Dict for the 'c' scope (the first thing
in the file following the magic line), and several for the 'a' scope.
Each Dict encountered in the file just updates the in-memory Dict with
new aliases.

In Example 1, there are two Dicts:

< <(a=c)>
  (80=Category)(81=FilmTitle)(82=Winner)(83=Other)(84=awards)>

<(80=Best Picture)(81=Annie Hall)(82=Best Director)(83=Woody Allen)(84
  =Best Actor in a Leading Role)(85=Richard Dreyfuss)(86=The Goodbye Girl)
  (87=Best Actress in a Leading Role)(88=Diane Keaton)>

The first Dict is for the 'c' scope and the second is for the 'a' scope.

The hex alias values in Dicts are always greater than or equal to 0x80.
This appears to be because the values 0x00-0x7F are predefined as
aliases for the single-byte string with the same value, however these
predefined aliases may be unused in practice. It's not even clear if the
source actually supports this usage (nothing about this section of the
source could be considered "clear"), but it is referenced in a comment
near the top of morkAtomSpace.h[2].

2.2.7 Aliases

Aliases occur inside Dicts and have this form:

(A3=Arbitrary text)

The left side of the '=' sign is a hexadecimal integer, and the right
side is a Value. This makes the hexadecimal identifier an alias for the
Value.

In the file morkParser.cpp[2], function ReadAlias, there's another type
of Alias that seems to be unused in practice, and is probably not fully
supported by the code. It looks something like:

(A3<f=c>=Arbitrary text) or (A3<f=^BF>=Arbitrary text)

This is probably intended to specify an alternate character encoding for
the Alias Value, but again, it appears to be unused and not fully
supported.

2.2.8 Cells

Cells function as Name-Value pairs. They form the basis of Mork Rows,
where the Name represents a column and the Value represents the data in
that column of the Row. They also define the meta-data in Meta-Tables,
Meta-Rows, and Meta-Dicts.

Like Aliases, Cells are delimited by parentheses. The most basic form of
a Cell looks pretty much like an Alias:

(columnName=arbitrary value text)

Here the left side of the '=' is a Name and the right side is a Value.
It's also possible for the Value to be empty:

(columnName=)

Both the Name and the Value can optionally be replaced with a Reference
to an Alias previously defined in a Dict. When the Name is replaced, it
looks like this:

(^92=arbitrary value text)

When the Value is replaced, the '=' sign is dropped:

(columnName^A5)

It's common to see a lot of Cells with both the Name and Value replaced
by References:

(^92^A5)

When a Reference is used for the Name, it is looked up in the 'c' scope
(this was described in the 'Dictionaries' section). For the Value, it is
looked up in the 'a' scope.

Because the References use Mids, in principle they can include a scope,
either directly or by Reference. In practice this probably doesn't
happen, but it would look like this:

(^92:c^A5:a)
(^92:c^A5)
(^92^A5:a)

These are all equivalent to the previous example because the scopes used
are the same as the default scopes.

Using a Reference for the scope might look like this:

(^92:^E8^A5)

References for scopes are looked up in the Dict for the 'c' scope.

2.2.9 Rows

Rows are the typical container for Cells. In the Mork file syntax they
are delimited with square brackets. Here's a row from Example 1:

  [2 (^80^82)(^81^81)(^82^83)(^83=)]

Here the 2 is the Mid for the row, which contains four Cells (which
represent the data in four columns of the row).

Mork files will often contain multiple rows with the same Mid.
Duplicate rows generally appear inside Groups (though the code seems to
support duplicates anywhere rows can appear). This has the effect of
updating the in-memory Row with new or replacement items. Row updates
are described in more detail in the section on Groups.

Rows will often appear inside Tables, but not always. A Row outside of
any Table could be inserted into a Table later, or it could be an update
for a Row that is already in a Table. In some cases rows are created but
never used in any table. Presumably these rows can be ignored.

The Row Mid belongs to a particular namespace or scope. This is pretty
much like the scopes that Dict aliases occupy, but row scopes tend to be
more descriptive. Here's an example:

< <(a=c)> (80=ns:addrbk:db:row:scope:data:all)>
[4:^80(^93^A8)(^94^A9)]

When a Row Mid lacks an explicit scope, it inherits the scope from the
Table that it is contained in.

2.2.10 Tables

Tables are the basic container for Rows. Like Rows, Tables have a unique
Mid in a particular namespace. Tables are delimited with curly braces.
In Example 1, this is the table:

{1:^84
  [1 (^80^80)(^81^81)(^82^81)(^83=)]
  [2 (^80^82)(^81^81)(^82^83)(^83=)]
  [3 (^80^84)(^81^86)(^82^85)(^83=)]
  [4 (^80^87)(^81^81)(^82^88)(^83=)]}

Here, the 1:^84 is the Table Mid, so the Table namespace is given by
the Reference ^84. The Reference is looked up in the 'c' Dict scope, and
translates to "awards" in this case.

This Table contains four Rows, but there are no specific limits on the
number of Rows a Table can contain. Tables with zero Rows are also
permitted.

Table Rows can be described directly within the Table as they are in
this example, but it's also possible for Table Rows to be included by
reference. The Table from Example 1 could be re-written like this:

[1:^84 (^80^80)(^81^81)(^82^81)(^83=)]
[2:^84 (^80^82)(^81^81)(^82^83)(^83=)]
[3:^84 (^80^84)(^81^86)(^82^85)(^83=)]
[4:^84 (^80^87)(^81^81)(^82^88)(^83=)]
{1:^84 1 2 3 4}

The two Row forms can also be mixed:

[2:^84 (^80^82)(^81^81)(^82^83)(^83=)]
[3:^84 (^80^84)(^81^86)(^82^85)(^83=)]
{1:^84
  [1 (^80^80)(^81^81)(^82^81)(^83=)]
  2
  3
  [4 (^80^87)(^81^81)(^82^88)(^83=)]}

2.2.11 Meta-Dicts, Meta-Tables, and Meta-Rows

Meta-Dicts, Meta-Tables, and Meta-Rows contain meta-data in the form of
Cells. While Meta-Dicts occur inside Dicts, and Meta-Tables occur inside
Tables, Meta-Rows are a bit more tricky. According to the
documentation[1] Meta-Rows occur inside Rows, but this usage doesn't
seem to appear in real Mork files. Instead, Meta-Rows appear as a Row
inside a Meta-Table. Both usages are supported by the Mozilla Mork
parser (in morkParser.cpp[2] see how ReadRow calls ReadMeta, and how
ReadMeta handles the '[' character).

Aside from the Meta-Row exception, meta-objects generally appear as a
nested object: a Meta-Dict looks like a Dict inside a Dict (delimited
with angle brackets), and a Meta-Table looks like a Table inside a Table
(delimited by curly braces). The items inside the meta-object are
exclusively Cells (aside from the Meta-Row exception).

2.2.11.1 Meta-Dicts

Example 1 includes the following Meta-Dict:

<(a=c)>

This is a very common Meta-Dict that probably appears in all Mork files.
It may also be the *only* Meta-Dict that is actually used in Mork files.
This Meta-Dict sets the namespace for the Dict to 'c' (recall that the
default is 'a'). This usage is described in the documentation[1], but
with the error that 'atomScope' is used instead of 'a'.

Based on OnNewCell in morkBuilder.cpp[2] and the constants
morkStore_kAtomScopeColumn and morkStore_kFormColumn in morkStore.h, it
looks like the only other type of column that is recognized in
Meta-Dicts is 'f'. This doesn't appear to be used in practice, but I
suspect it's related to the comment described in the Encoding section:

// (f=iso-8859-1)

2.2.11.2 Meta-Tables

Meta-Tables typically contain two columns, 'k' and 's'. These seem to
stand for 'kind' and 'status', respectively. Based on the source[2],
'r', 'a', and 'f' are also allowed (search for morkStore_kKindColumn in
morkBuilder.cpp and morkStore.h), but these don't seem to be used in
practice.

A typical Table with a Meta-Table looks like this:

{1:^80 {(k^BF:c)(s=9)}
  // ... Rows go here
}

The value in the 'kind' column is a string that seems to describe the
usage of the table. The 'status' value is no more than a few characters:
a digit indicating the Table's priority, an optional 'u' to indicate
that the table is unique, and/or an optional 'v' to indicate that the
table is verbose. 'v' may not be used in practice. Search for
mBuilder_TableStatus in the OnValue function in morkBuilder.cpp[2] to
find the code that handles this.

Additionally, Meta-Rows are usually (maybe always) found in Meta-Tables.
A Meta-Row in a Meta-Table looks just like a Row in a Table, and seems
to contain various application-specific meta-data for the Table. The
previous Meta-Table example with the addition of a Meta-Row would look
like this:

{1:^80 {(k^BF:c)(s=9)[1(^8C=LE)]}
  // ... Rows go here
}

In this example, suppose ^8C expands (via the Dict for the 'c' scope) to
'ByteOrder'. This is used in some Mork files to indicate the byte order
(Big Endian or Little Endian) of fields in the Table that use multi-byte
character encodings such as UTF-16.

The Meta-Row can also be included by reference:

[1:^80 (^8C=LE)]
{1:^80 {(k^BF:c)(s=9)1}
  // ... Rows go here
}

2.2.12 Groups

Mork Groups represent a set of changes to make to the in-memory Mork
objects. Groups are delimited by a string of characters that includes a
hexadecimal identifier (Group identifiers are not Mids, but simple
hexadecimal values). Here's an example of an empty Group:

@$${2{@
@$$}2}@

Here the hexadecimal identifier is 2. A Group can contain Tables, Rows,
and Dicts. It's also possible for a Group to be aborted, meaning the
changes in it should not be applied. An aborted Group is indicated with
an alternative termination string, and looks like this:

@$${2{@
@$$}~~}@

This abort syntax is different from the documented abort syntax[1][3],
but can be verified in the function AbortGroup in morkWriter.cpp[2].

Any group that is not properly terminated is also considered aborted.

Inside groups, the usual syntax for adding Tables and Rows is supported.
Some additional syntax for deleting and moving Rows and deleting Cells
is also found in Groups. Based on the source, this extra syntax may
actually be allowed outside of Groups, but in practice it seems to
appear only in Groups.

Here is Example 1 with a Group added. This group adds a new Row to the
existing Table.

// <!-- <mdb:mork:z v="1.4"/> -->
< <(a=c)> // (f=iso-8859-1)
  (80=Category)(81=FilmTitle)(82=Winner)(83=Other)(84=awards)>

<(80=Best Picture)(81=Annie Hall)(82=Best Director)(83=Woody Allen)(84
  =Best Actor in a Leading Role)(85=Richard Dreyfuss)(86=The Goodbye Girl)
  (87=Best Actress in a Leading Role)(88=Diane Keaton)>
{1:^84
  [1 (^80^80)(^81^81)(^82^81)(^83=)]
  [2 (^80^82)(^81^81)(^82^83)(^83=)]
  [3 (^80^84)(^81^86)(^82^85)(^83=)]
  [4 (^80^87)(^81^81)(^82^88)(^83=)]}

@$${1{@
<(89=Best Costume Design)(8A=Star Wars)(8B=John Mollo)>
{1:^84 [5 (^80^89)(^81^8A)(^82^8B)(^83=)]}
@$$}1}@

2.2.12.1 Deleting a Row

A Row can be deleted from a Table by prefixing the Row Mid with '-'. For
example, to delete Row 2 from Example 1, a Group like this might be
used:

@$${2{@
{1:^84 -2}
@$$}2}@

Row deletions and insertions can be mixed:

@$${2{@
<(8C=Best Cinematography)(8D=Vilmos Zsigmond)(8E
  =Close Encounters of the Third Kind)>
{1:^84 -2 [6 (^80^8C)(^81^8E)(^82^8D)(^83=)]}
@$$}2}@

2.2.12.2 Deleting All Rows From a Table

All the Rows in a Table can be deleted with a single operation. This is
done by placing a '-' at the beginning of the Table, before the Table
Mid:

@$${2E{@
{-9:^82}
@$$}2E}@

This syntax can also be mixed with Row insertions to replace all Rows in
a Table.

2.2.12.3 Moving a Row

Moving a Row within a Table is accomplished by including the Row Mid,
followed by a '!', followed by a hexadecimal number giving the new
position in the Table. The position is a 0-based index into the sequence
of Rows. Other Row positions are shifted to make the slot available
and/or fill in the vacated slot as needed. For example, moving Row 3 to
the beginning of the Table in Example 1 would look like this:

@$${3{@
{1:^84 3 ! 0}
@$$}3}@

Based on ReadRow in morkParser.cpp[2] (the first call to ReadRowPos), it
appears that a row move operation can also appear after a row object,
combining the row add/update and row move operations. In practice this
might not be used.

2.2.12.4 Deleting a Cell

Deleting a Cell within a Row also uses the '-' operator. The operator is
placed inside the Row, in front of the Cell that will be deleted:

@$${3A{@
[5:^8E -(^88^9A)]
@$$}3A}@

This isn't used in any Mork files I've encountered, but appears to be
supported in the source[2] (see the call to OnMinusCell in
morkParser.cpp).

2.2.12.5 Deleting All Cells From a Row

All the Cells in a Row can be deleted with a singe operation. This is
done by placing a '-' at the beginning of the Row, before the Row Mid:

@$${1B{@
[- 5:^90]
@$$}1B}@

This syntax can also be mixed with Cell insertions to replace all Cells
in a Row.

3. Character Escaping

Some characters in Cell and Alias Values need an alternative
representation to avoid conflicting with Mork syntax. Mork provides two
kinds of special character sequences for representing these characters.

3.1 Backslash Escapes

The first type of special character sequence in Mork is introduced with
the backslash character, '\'. This removes any special meaning from the
character that follows, and is particularly useful for including the ')'
character in Cell and Alias Values (since this character would otherwise
terminate the Cell or Alias).

Another common use for backslash escapes is line continuation. When a
backslash is the last character on a line, the backslash and the end of
the line are removed and have no effect on the final value. This is used
to split long values across multiple lines:

<(80=This is an Alias for a very long value which will use a backslash \
escape sequence to continue onto the following line.)>

3.2 Dollar-Sign Escapes

The second type of special character sequence is a dollar sign followed
by two hexadecimal digits which give the value of the replacement byte.
This is often used for bytes that are non-printable as ASCII characters,
especially in UTF-16 text. For example, a string with the Unicode
snowman character (U+2603):

☃snowman☃

may be represented as UTF-16 text in an Alias this way:

<(83=$03$26s$00n$00o$00w$00m$00a$00n$00$03$26)>

Note that the Value is just a list of bytes. Interpreting this value as
UTF-16 is up to the application.

[1] https://developer.mozilla.org/En/Mork_Structure
[2] All source references are for Firefox version 2.0.0.20, and can be
looked up here: http://mxr.mozilla.org/firefox2/source/db/mork/src/
[3] http://www-archive.mozilla.org/mailnews/arch/mork/grammar.txt