Mork 1.4 File Format Description Version 1.0 Sat, 14 Aug 2010 11:20:08 -0700 Copyright 2010 Kevin Goodsell 0. License This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/3.0/; or, send a letter to Creative Commons, 171 2nd Street, Suite 300, San Francisco, California, 94105, USA. Attributions should include my name (Kevin Goodsell) and the web site that hosts this project, http://github.com/KevinGoodsell/mork-converter. 1. Logical Structure Mork files logically consist of zero or more Tables. Each table consists of zero or more rows and optional meta-data in the form of a Meta-Table. Each row consists of zero or more Cells and optional meta-data in the form of a Meta-Row. Each cell is a Column-Value pair. This makes Mork Tables structurally similar to a spreadsheet or an HTML table, with rows and columns, and entries appearing in the cells formed by the intersection of a row and column. One important difference is that the Columns are not necessarily the same from one Row to the next. A Mork Row may include arbitrary Columns. While the basic structure of the data is fairly simple, the Mork format itself is confusing, all but impossible to decode by hand, and poorly documented. The documentation that exists seems to contradict the Mozilla implementation of Mork 1.4, probably the only widely-used version (according to source comments in morkWriter.h[2] and release dates, earlier versions were used only for very early milestone releases of the Mozilla Suite). 2. File Layout 2.1 Encoding Mork files are text, and appear to be limited to ASCII characters. However, all example files I've looked at include this: // (f=iso-8859-1) This identifies a character encoding, but doesn't seem to have any functional significance. The double-slashes indicate that it is a comment, so the parser ignores it. The text of the comment is hard-coded in morkWriter.cpp[2], so the character encoding it gives can never be anything other than iso-8859-1. This may simply be a hint to human readers, or an unimplemented feature. Mork files can indicate end-of-line using several conventions. Any of the combinations 0x0A, 0x0D, 0x0A0D, and 0x0D0A are allowed. This covers the three common conventions (Unix, DOS/Windows, and Macintosh) and the uncommon 'newline carriage-return'. 2.2 Syntax Elements A Mork file is a sequence of Mork objects, including Dicts, Rows, Tables, and Groups. Here is a short example Mork file for reference (Example 1): // < <(a=c)> // (f=iso-8859-1) (80=Category)(81=FilmTitle)(82=Winner)(83=Other)(84=awards)> <(80=Best Picture)(81=Annie Hall)(82=Best Director)(83=Woody Allen)(84 =Best Actor in a Leading Role)(85=Richard Dreyfuss)(86=The Goodbye Girl) (87=Best Actress in a Leading Role)(88=Diane Keaton)> {1:^84 [1 (^80^80)(^81^81)(^82^81)(^83=)] [2 (^80^82)(^81^81)(^82^83)(^83=)] [3 (^80^84)(^81^86)(^82^85)(^83=)] [4 (^80^87)(^81^81)(^82^88)(^83=)]} Expanded to table form, it looks like this: Row | Category | Winner | FilmTitle | Other ----|--------------------------------|------------------|------------------|------------ 1 | Best Picture | Annie Hall | Annie Hall | 2 | Best Director | Woody Allen | Annie Hall | 3 | Best Actor in a Leading Role | Richard Dreyfuss | The Goodbye Girl | 4 | Best Actress in a Leading Role | Diane Keaton | Annie Hall | 2.2.1 Names Mork Names are just text strings (with some restrictions) used for Columns and object namespaces. Names are defined in the source file morkCh.h[2] with the morkCh_IsName and morkCh_IsMore macros (see morkCh_Type in morkCh.cpp for character classifications), where the first indicates a character that can begin a Name and the second indicates a character that can appear after the first character of a Name. The first character of a Name may be any ASCII letter, '_' or ':'. Subsequent characters may be any of those plus '!', '+', '-', and '?'. 2.2.2 Values Mork Values are text strings used for arbitrary data. While they appear as plain text in Mork files, they can encode arbitrary byte strings. Values can include any ASCII character in the range 0x20 to 0x7E, but the '$', ')', and '\' characters can't be included directly because they have special meanings (described in the section "Character Escaping"). Other byte values may be included with escape sequences. 2.2.3 References Mork References are a way of indicating a Name or Value by using a reference to an Alias. The Mork reader replaces the Reference with the referred-to Name or Value. This is described in more detail in the section "Dictionaries". References appear as a '^' character follow by a hexadecimal identifier: ^9A There are some places where non-Alias Mork objects are referred to by their hex identifier without using a '^'. These two usages should not be confused. References beginning with '^' refer specifically to Aliases. References must be interpreted by looking up the Alias in a specific Dictionary. The Dictionary to use depends on the context, but it can also be given explicitly by adding a ':' followed by the dictionary scope: ^9A:a The Dictionary scope can also be given as a reference, but in practice this is probably never done. 2.2.4 Mids Mork identifiers (or "Mids" as they are referred to in the source) are hexadecimal identifiers for particular objects. All Tables, Rows, and Aliases have a Mid. References are actually a '^' followed by a Mid for the object being referenced. Mids have three basic forms. The first form is simply a hex identifier. The second form is a hex identifier followed by a colon and a scope (or namespace) name: F2:scope_name The third form is the same, but replaces the scope with a Reference: F2:^8A Since the Reference itself uses a Mid, this definition appears recursive. However, the ReadMid function in morkParser.cpp[2] is not defined recursively, and only allows a simple hex identifier after the '^'. 2.2.5 Comments and Magic Mork files can include comments using C++ comment syntax. Two forward slashes indicate the beginning of a comment which continues to the end of the line. While this seems to be undocumented, the source code (morkParser.cpp[2]) indicates that C-style comments are also permitted, and may be nested. These appear to be unused in practice. It's difficult to determine exactly where comments can and cannot appear from the source. Places comments cannot appear include inside a Name, after the ':' in a Mid, after the '(' at the beginning of a Cell, inside a hex value, and inside the Value part of a Cell or Alias. Other than that, comments can appear almost anywhere. There are some surprising places where comments can appear. For example, a comment can appear before any hex value, which means they can show up after the '^' in a Reference, or after the @$${ in a Group. Comments can also appear before the '=' or '^' in a Cell Value. Perhaps the most surprising is that comments can apparently show up before the '=' in an Alias, but only if there is at least one whitespace character before the comment. Fortunately, real Mork files only use a few comments and only in predictable locations, so a parser does not necessarily need to handle all these strange cases. Mork files begin with the following "magic" identifier in the form of a comment: // The 1.4 would of course be different for other Mork versions. 2.2.6 Dictionaries Mork Dictionaries or "Dicts" are used to define numerical aliases for strings (meaning Names and Values). This is simply a file size optimization and does not add anything functional. It does however add substantial complexity to the files. Dicts are delimited with "angle brackets" (less-than and greater-than symbols) and contain Aliases, and optionally a Meta-Dict. Example: <(80=arbitrary text)(81=http://example.com) (8C=(Parens are allowed, but a closing paren needs to be escaped.\))> In this example, note that the '//' does not introduce a comment, and a backslash character is used as an escape character to allow a closing parenthesis to appear in the Value. Each Dict defines aliases for a particular namespace or scope. In principle there can be an arbitrary number of scopes, but in practice only two are used: 'a' and 'c', which are abbreviations for 'atom' and 'column', respectively. ('Atom' is described in the documentation[1] as the data part of a Cell, but the term 'Value' is used in the code and in this document.) The 'a' scope defines aliases used in Cell Values, and the 'c' scope defines aliases used in Cell Columns and namespaces. The scope for a Dict is given in the Meta-Dict, or else defaults to 'a'. Typical Mork files have only one Dict for the 'c' scope (the first thing in the file following the magic line), and several for the 'a' scope. Each Dict encountered in the file just updates the in-memory Dict with new aliases. In Example 1, there are two Dicts: < <(a=c)> (80=Category)(81=FilmTitle)(82=Winner)(83=Other)(84=awards)> <(80=Best Picture)(81=Annie Hall)(82=Best Director)(83=Woody Allen)(84 =Best Actor in a Leading Role)(85=Richard Dreyfuss)(86=The Goodbye Girl) (87=Best Actress in a Leading Role)(88=Diane Keaton)> The first Dict is for the 'c' scope and the second is for the 'a' scope. The hex alias values in Dicts are always greater than or equal to 0x80. This appears to be because the values 0x00-0x7F are predefined as aliases for the single-byte string with the same value, however these predefined aliases may be unused in practice. It's not even clear if the source actually supports this usage (nothing about this section of the source could be considered "clear"), but it is referenced in a comment near the top of morkAtomSpace.h[2]. 2.2.7 Aliases Aliases occur inside Dicts and have this form: (A3=Arbitrary text) The left side of the '=' sign is a hexadecimal integer, and the right side is a Value. This makes the hexadecimal identifier an alias for the Value. In the file morkParser.cpp[2], function ReadAlias, there's another type of Alias that seems to be unused in practice, and is probably not fully supported by the code. It looks something like: (A3=Arbitrary text) or (A3=Arbitrary text) This is probably intended to specify an alternate character encoding for the Alias Value, but again, it appears to be unused and not fully supported. 2.2.8 Cells Cells function as Name-Value pairs. They form the basis of Mork Rows, where the Name represents a column and the Value represents the data in that column of the Row. They also define the meta-data in Meta-Tables, Meta-Rows, and Meta-Dicts. Like Aliases, Cells are delimited by parentheses. The most basic form of a Cell looks pretty much like an Alias: (columnName=arbitrary value text) Here the left side of the '=' is a Name and the right side is a Value. It's also possible for the Value to be empty: (columnName=) Both the Name and the Value can optionally be replaced with a Reference to an Alias previously defined in a Dict. When the Name is replaced, it looks like this: (^92=arbitrary value text) When the Value is replaced, the '=' sign is dropped: (columnName^A5) It's common to see a lot of Cells with both the Name and Value replaced by References: (^92^A5) When a Reference is used for the Name, it is looked up in the 'c' scope (this was described in the 'Dictionaries' section). For the Value, it is looked up in the 'a' scope. Because the References use Mids, in principle they can include a scope, either directly or by Reference. In practice this probably doesn't happen, but it would look like this: (^92:c^A5:a) (^92:c^A5) (^92^A5:a) These are all equivalent to the previous example because the scopes used are the same as the default scopes. Using a Reference for the scope might look like this: (^92:^E8^A5) References for scopes are looked up in the Dict for the 'c' scope. 2.2.9 Rows Rows are the typical container for Cells. In the Mork file syntax they are delimited with square brackets. Here's a row from Example 1: [2 (^80^82)(^81^81)(^82^83)(^83=)] Here the 2 is the Mid for the row, which contains four Cells (which represent the data in four columns of the row). Mork files will often contain multiple rows with the same Mid. Duplicate rows generally appear inside Groups (though the code seems to support duplicates anywhere rows can appear). This has the effect of updating the in-memory Row with new or replacement items. Row updates are described in more detail in the section on Groups. Rows will often appear inside Tables, but not always. A Row outside of any Table could be inserted into a Table later, or it could be an update for a Row that is already in a Table. In some cases rows are created but never used in any table. Presumably these rows can be ignored. The Row Mid belongs to a particular namespace or scope. This is pretty much like the scopes that Dict aliases occupy, but row scopes tend to be more descriptive. Here's an example: < <(a=c)> (80=ns:addrbk:db:row:scope:data:all)> [4:^80(^93^A8)(^94^A9)] When a Row Mid lacks an explicit scope, it inherits the scope from the Table that it is contained in. 2.2.10 Tables Tables are the basic container for Rows. Like Rows, Tables have a unique Mid in a particular namespace. Tables are delimited with curly braces. In Example 1, this is the table: {1:^84 [1 (^80^80)(^81^81)(^82^81)(^83=)] [2 (^80^82)(^81^81)(^82^83)(^83=)] [3 (^80^84)(^81^86)(^82^85)(^83=)] [4 (^80^87)(^81^81)(^82^88)(^83=)]} Here, the 1:^84 is the Table Mid, so the Table namespace is given by the Reference ^84. The Reference is looked up in the 'c' Dict scope, and translates to "awards" in this case. This Table contains four Rows, but there are no specific limits on the number of Rows a Table can contain. Tables with zero Rows are also permitted. Table Rows can be described directly within the Table as they are in this example, but it's also possible for Table Rows to be included by reference. The Table from Example 1 could be re-written like this: [1:^84 (^80^80)(^81^81)(^82^81)(^83=)] [2:^84 (^80^82)(^81^81)(^82^83)(^83=)] [3:^84 (^80^84)(^81^86)(^82^85)(^83=)] [4:^84 (^80^87)(^81^81)(^82^88)(^83=)] {1:^84 1 2 3 4} The two Row forms can also be mixed: [2:^84 (^80^82)(^81^81)(^82^83)(^83=)] [3:^84 (^80^84)(^81^86)(^82^85)(^83=)] {1:^84 [1 (^80^80)(^81^81)(^82^81)(^83=)] 2 3 [4 (^80^87)(^81^81)(^82^88)(^83=)]} 2.2.11 Meta-Dicts, Meta-Tables, and Meta-Rows Meta-Dicts, Meta-Tables, and Meta-Rows contain meta-data in the form of Cells. While Meta-Dicts occur inside Dicts, and Meta-Tables occur inside Tables, Meta-Rows are a bit more tricky. According to the documentation[1] Meta-Rows occur inside Rows, but this usage doesn't seem to appear in real Mork files. Instead, Meta-Rows appear as a Row inside a Meta-Table. Both usages are supported by the Mozilla Mork parser (in morkParser.cpp[2] see how ReadRow calls ReadMeta, and how ReadMeta handles the '[' character). Aside from the Meta-Row exception, meta-objects generally appear as a nested object: a Meta-Dict looks like a Dict inside a Dict (delimited with angle brackets), and a Meta-Table looks like a Table inside a Table (delimited by curly braces). The items inside the meta-object are exclusively Cells (aside from the Meta-Row exception). 2.2.11.1 Meta-Dicts Example 1 includes the following Meta-Dict: <(a=c)> This is a very common Meta-Dict that probably appears in all Mork files. It may also be the *only* Meta-Dict that is actually used in Mork files. This Meta-Dict sets the namespace for the Dict to 'c' (recall that the default is 'a'). This usage is described in the documentation[1], but with the error that 'atomScope' is used instead of 'a'. Based on OnNewCell in morkBuilder.cpp[2] and the constants morkStore_kAtomScopeColumn and morkStore_kFormColumn in morkStore.h, it looks like the only other type of column that is recognized in Meta-Dicts is 'f'. This doesn't appear to be used in practice, but I suspect it's related to the comment described in the Encoding section: // (f=iso-8859-1) 2.2.11.2 Meta-Tables Meta-Tables typically contain two columns, 'k' and 's'. These seem to stand for 'kind' and 'status', respectively. Based on the source[2], 'r', 'a', and 'f' are also allowed (search for morkStore_kKindColumn in morkBuilder.cpp and morkStore.h), but these don't seem to be used in practice. A typical Table with a Meta-Table looks like this: {1:^80 {(k^BF:c)(s=9)} // ... Rows go here } The value in the 'kind' column is a string that seems to describe the usage of the table. The 'status' value is no more than a few characters: a digit indicating the Table's priority, an optional 'u' to indicate that the table is unique, and/or an optional 'v' to indicate that the table is verbose. 'v' may not be used in practice. Search for mBuilder_TableStatus in the OnValue function in morkBuilder.cpp[2] to find the code that handles this. Additionally, Meta-Rows are usually (maybe always) found in Meta-Tables. A Meta-Row in a Meta-Table looks just like a Row in a Table, and seems to contain various application-specific meta-data for the Table. The previous Meta-Table example with the addition of a Meta-Row would look like this: {1:^80 {(k^BF:c)(s=9)[1(^8C=LE)]} // ... Rows go here } In this example, suppose ^8C expands (via the Dict for the 'c' scope) to 'ByteOrder'. This is used in some Mork files to indicate the byte order (Big Endian or Little Endian) of fields in the Table that use multi-byte character encodings such as UTF-16. The Meta-Row can also be included by reference: [1:^80 (^8C=LE)] {1:^80 {(k^BF:c)(s=9)1} // ... Rows go here } 2.2.12 Groups Mork Groups represent a set of changes to make to the in-memory Mork objects. Groups are delimited by a string of characters that includes a hexadecimal identifier (Group identifiers are not Mids, but simple hexadecimal values). Here's an example of an empty Group: @$${2{@ @$$}2}@ Here the hexadecimal identifier is 2. A Group can contain Tables, Rows, and Dicts. It's also possible for a Group to be aborted, meaning the changes in it should not be applied. An aborted Group is indicated with an alternative termination string, and looks like this: @$${2{@ @$$}~~}@ This abort syntax is different from the documented abort syntax[1][3], but can be verified in the function AbortGroup in morkWriter.cpp[2]. Any group that is not properly terminated is also considered aborted. Inside groups, the usual syntax for adding Tables and Rows is supported. Some additional syntax for deleting and moving Rows and deleting Cells is also found in Groups. Based on the source, this extra syntax may actually be allowed outside of Groups, but in practice it seems to appear only in Groups. Here is Example 1 with a Group added. This group adds a new Row to the existing Table. // < <(a=c)> // (f=iso-8859-1) (80=Category)(81=FilmTitle)(82=Winner)(83=Other)(84=awards)> <(80=Best Picture)(81=Annie Hall)(82=Best Director)(83=Woody Allen)(84 =Best Actor in a Leading Role)(85=Richard Dreyfuss)(86=The Goodbye Girl) (87=Best Actress in a Leading Role)(88=Diane Keaton)> {1:^84 [1 (^80^80)(^81^81)(^82^81)(^83=)] [2 (^80^82)(^81^81)(^82^83)(^83=)] [3 (^80^84)(^81^86)(^82^85)(^83=)] [4 (^80^87)(^81^81)(^82^88)(^83=)]} @$${1{@ <(89=Best Costume Design)(8A=Star Wars)(8B=John Mollo)> {1:^84 [5 (^80^89)(^81^8A)(^82^8B)(^83=)]} @$$}1}@ 2.2.12.1 Deleting a Row A Row can be deleted from a Table by prefixing the Row Mid with '-'. For example, to delete Row 2 from Example 1, a Group like this might be used: @$${2{@ {1:^84 -2} @$$}2}@ Row deletions and insertions can be mixed: @$${2{@ <(8C=Best Cinematography)(8D=Vilmos Zsigmond)(8E =Close Encounters of the Third Kind)> {1:^84 -2 [6 (^80^8C)(^81^8E)(^82^8D)(^83=)]} @$$}2}@ 2.2.12.2 Deleting All Rows From a Table All the Rows in a Table can be deleted with a single operation. This is done by placing a '-' at the beginning of the Table, before the Table Mid: @$${2E{@ {-9:^82} @$$}2E}@ This syntax can also be mixed with Row insertions to replace all Rows in a Table. 2.2.12.3 Moving a Row Moving a Row within a Table is accomplished by including the Row Mid, followed by a '!', followed by a hexadecimal number giving the new position in the Table. The position is a 0-based index into the sequence of Rows. Other Row positions are shifted to make the slot available and/or fill in the vacated slot as needed. For example, moving Row 3 to the beginning of the Table in Example 1 would look like this: @$${3{@ {1:^84 3 ! 0} @$$}3}@ Based on ReadRow in morkParser.cpp[2] (the first call to ReadRowPos), it appears that a row move operation can also appear after a row object, combining the row add/update and row move operations. In practice this might not be used. 2.2.12.4 Deleting a Cell Deleting a Cell within a Row also uses the '-' operator. The operator is placed inside the Row, in front of the Cell that will be deleted: @$${3A{@ [5:^8E -(^88^9A)] @$$}3A}@ This isn't used in any Mork files I've encountered, but appears to be supported in the source[2] (see the call to OnMinusCell in morkParser.cpp). 2.2.12.5 Deleting All Cells From a Row All the Cells in a Row can be deleted with a singe operation. This is done by placing a '-' at the beginning of the Row, before the Row Mid: @$${1B{@ [- 5:^90] @$$}1B}@ This syntax can also be mixed with Cell insertions to replace all Cells in a Row. 3. Character Escaping Some characters in Cell and Alias Values need an alternative representation to avoid conflicting with Mork syntax. Mork provides two kinds of special character sequences for representing these characters. 3.1 Backslash Escapes The first type of special character sequence in Mork is introduced with the backslash character, '\'. This removes any special meaning from the character that follows, and is particularly useful for including the ')' character in Cell and Alias Values (since this character would otherwise terminate the Cell or Alias). Another common use for backslash escapes is line continuation. When a backslash is the last character on a line, the backslash and the end of the line are removed and have no effect on the final value. This is used to split long values across multiple lines: <(80=This is an Alias for a very long value which will use a backslash \ escape sequence to continue onto the following line.)> 3.2 Dollar-Sign Escapes The second type of special character sequence is a dollar sign followed by two hexadecimal digits which give the value of the replacement byte. This is often used for bytes that are non-printable as ASCII characters, especially in UTF-16 text. For example, a string with the Unicode snowman character (U+2603): ☃snowman☃ may be represented as UTF-16 text in an Alias this way: <(83=$03$26s$00n$00o$00w$00m$00a$00n$00$03$26)> Note that the Value is just a list of bytes. Interpreting this value as UTF-16 is up to the application. [1] https://developer.mozilla.org/En/Mork_Structure [2] All source references are for Firefox version 2.0.0.20, and can be looked up here: http://mxr.mozilla.org/firefox2/source/db/mork/src/ [3] http://www-archive.mozilla.org/mailnews/arch/mork/grammar.txt