QuickHelp Binary Format ======================= This document describes the binary format of a QuickHelp .HLP file. Overview -------- A QuickHelp .HLP file comprises the following sections: +------------------------+ | Signature | +------------------------+ | File Header | +------------------------+ | Topic Index | +------------------------+ | Context Strings | +------------------------+ | Context Map | +------------------------+ | Keywords | +------------------------+ | Huffman Tree | +------------------------+ | Topic[0] Text | +------------------------+ | ... | +------------------------+ | Topic[N-1] Text | +------------------------+ Each section is described in detail below. Note 1: All numeric fields are stored in little-endian order unless specified otherwise. Note 2: Multiple help databases may be concatenated and stored in a single file. In this document, we assume a single database for convenience. Signature --------- An HLP file starts with the two-byte signature: 0x4C, 0x4E. File Header ----------- Following the signature is the File Header section, which is a structure with the following fields: Range Type Field Meaning ---------------------------------------------------------------------------- 02-04 WORD Version always 2 04-06 WORD Attributes bit 0: case-sensitivity of context strings bit 1: prevent the database from being decoded by the HELPMAKE utility bits 2-15: reserved; always 0 06-07 BYTE ControlCharacter usually ':'; 0FFh also seen 07-08 BYTE (padding 1) reserved; always 0 08-0A WORD TopicCount number of topics in the database 0A-0C WORD ContextCount number of context strings in the database 0C-0E WORD DisplayWidth width of the help viewer in characters 0E-10 WORD PredefinedCtxCount number of predefined context strings; usually 0 10-1E BYTE[] DatabaseName database name used to resolve external links; NULL-terminated and NULL-padded 1E-22 DWORD (reserved 1) always 0 22-26 DWORD TopicIndex offset of the Topic Index section 26-2A DWORD ContextStringsOffset offset of the Context Strings section 2A-2E DWORD ContextMapOffset offset of the Context Mapping section 2E-32 DWORD KeywordsOffset offset of the Keywords section; 0 if keyword compression is not used 32-36 DWORD HuffmanTreeOffset offset of the Huffman Tree section; 0 if Huffman compression is not used 36-3A DWORD TopicTextOffset offset of the start of topic texts 3A-3E DWORD (reserved 2) always 0 3E-42 DWORD (reserved 3) always 0 42-46 DWORD DatabaseSize size of the help database in bytes Topic Index ----------- The Topic Index section comprises (TopicCount + 1) DWORD integers. The first TopicCount integers specify the offset of the corresponding topic texts relative to the beginning of the help database. The last integer specifies the offset just past the end of the last topic. (N = TopicCount) +--------------------+ | TopicOffset[0] | (DWORD) offset of the first topic +--------------------+ | ... | ... +--------------------+ | TopicOffset[N-1] | (DWORD) offset of the last topic +--------------------+ | TopicOffset[N] | (DWORD) indicates the end of the last topic; +--------------------+ should be equal to DatabaseSize. The size (in bytes) of topic k can be found by computing the difference between TopicOffset[k] and TopicOffset[k+1]. Context Strings --------------- The Context Strings section comprises (ContextCount) NULL-terminated context strings. The context strings are not sorted. (N = ContextCount) +--------------------+ | ContextString[0] | first context string (NULL-terminated) +--------------------+ | ... | ... +--------------------+ | ContextString[N-1] | last context string (NULL-terminated) +--------------------+ Context Map ----------- The Context Map section comprises (ContextCount) WORD integers. Each integer specifies the index of the topic to which the corresponding context string resolves, and must be within the range 0 to (TopicCount-1) inclusive. (N = ContextCount) +-------------------+ | ContextMap[0] | (WORD) index of the topic that the first context +-------------------+ string points to | ... | ... +-------------------+ | ContextMap[N-1] | (WORD) index of the topic that the last context +-------------------+ string points to Keywords -------- The Keywords section contains a list of frequently used words in the topics. It is used by the keyword compression (dictionary substitution) pass. This section does not exist if keyword compression is not used for this database. Each word is prefixed by a byte that specifies the length of the word in bytes, not counting the prefix itself. The word is NOT NULL-terminated. (N = implied) +-------------+ | Word[0] | first word in dictionary (length-prefixed string) +-------------+ | ... | ... +-------------+ | Word[N-1] | last word in dictionary (length-prefixed string) +-------------+ The words are sorted in lexicographical order. The length prefix is not taken into account when sorting. There can be at most 1024 (10 bits) words in the dictionary; but there can be fewer. The number of words is not explicitly specified and must be implied by reading the entire section. Huffman Tree ------------ The Huffman Tree section contains the huffman tree used by the Huffman compression pass. This section does not exist if Huffman compression is not used for this database. Each node in a huffman tree is either a leaf node or an internal node. A leaf node represents a symbol valued from 0 to 255. An internal node must have two children and encodes a bit in the huffman code: the left child encodes 0 and the right child encodes 1. 511 nodes are sufficient to encode all 256 symbols. If the tree contains less than 511 nodes, only a subset of the 256 symbols are encoded. The Huffman tree is compactly stored in an array of WORD integers, where each integer represents a node. The array is terminated by an extra WORD of value zero. The nodes plus the terminating 0 WORD should match the length of the section specified by [HuffmanTreeOffset, TopicTextOffset). The serialized format of the Huffman tree is as follows. Denote the nodes by Node[0] through Node[N] where N <= 511 and Node[N] == 0. Then: - A leaf node, Node[i], has its highest bit set to 1, and the symbol it represents is stored in the low byte of Node[i]. - An internal node, Node[i], has its highest is set to 0, and o its right child (1 bit) is Node[i+1]; o its left child (0 bit) is Node[Node[i]/2]. - The root node is Node[0]. Note that this format is not specific to a Huffman tree; it can be used to serialize any proper binary tree. The above node numbering scheme can be generated by performing a post-order traversal of the tree and number the node from N to 1 as they are visited. Topic Text ---------- Following the meta data sections are N blocks of topic text. Each topic is separately compressed. The compression method is described below. Topic text is compiled, compressed, and encoded in three steps: Step 1. Topic text is compiled from QuickHelp markup format to binary format. Step 2. The binary text is compressed using keyword compression and run-length encoding. Step 3. The compressed text is encoded with Huffman coding and stored in the help database. The following diagram illustrates the encoding procedure. +=====================+ | Markup Topic Text | +=====================+ | | [1] compile to binary format v +=====================+ | Binary Topic Text | +=====================+ | | [2] keyword compression and | run-length encoding v +=====================+ | Compact Topic Text | +=====================+ | | [3] Huffman encoding v +=====================+ | Encoded Topic Text | +=====================+ Step 1: Compile QuickHelp markup format to binary format -------------------------------------------------------- The QuickHelp markup format is described in detail in MASM documentation, Chapter 18 - Creating Help Files With HELPMAKE. In QuickHelp binary format, each line is represented by two parts: 1. Text 2. Styles and links "Text" is the characters displayed on the screen, stripped of any formatting information. "Styles" associate each character with one or more of the following styles: bold, italic, and underline. In QuickHelp viewer, the styles are rendered as follows: Value Style Foreground Color Background Color --------------------------------------------------------------------- 0 Normal (default) white black 1 Bold highlighted white black 2 Italic green black 3 Bold+Italic cyan black 4 Underline red black 5 Bold+Underline highlighted white cyan 6 Italic+Underline white black 7 Bold+Italic+Underline black black "Links" associate a range of characters in a line with a link target. The link target must take one of the following forms: 1. a context string that matches a context in the current help database or another help database, or 2. a 16-bit integer with the highest bit set, whose lowest 15 bits specifies a topic index in the current help database. Links are defined orthogonal to styles. Links must not overlap. With this model in mind, below we describe the QuickHelp binary format. Each topic is first split into lines. Colon commands (see MASM docs for a detailed description) are treated as plain text when stored. Each line consists of a text block and an attribute block, like below: +-----------------+ | TextBlockLen(X) | 1 byte number of bytes in text block, including +-----------------+ the "TextBlockLen" byte . . . TextBlockData . X-1 bytes characters in the line, stripped of any . . formatting information +-----------------+ | AttrBlockLen(Y) | 1 byte number of bytes in attribute block, +-----------------+ including the "AttrBlockLen" byte . . . AttrBlockData . Y-1 bytes character style and link information; see . . below. +-----------------+ "TextBlockLen" and "AttrBlockLen" must be greater than zero. "TextBlockData" is the plain text to display. Each byte corresponds to an ASCII or Extended ASCII character; on Windows, this is code page 437. Note, however, that characters 0-31 are rendered as graphic characters instead of interpreted as control characters when displayed in QuickHelp; this means that a further mapping must be performed after transforming using CP-437. "AttrBlockData" comprises a mandatory "Styles" part followed by an optional "Links" part. If no link exists in a line, the format of "AttrBlockData" is (Y-1) bytes +==========+ | Styles | +==========+ If links are present in a line, the format of "AttrBlockData" is (? bytes) 1 byte (? bytes) --> total (Y-1) bytes +==========+------+=========+ | Styles | 0xFF | Links | +==========+------+=========+ "Styles" is an alternating list of "chunk length" and "style", as follows: +--------------+ | ChunkLen 0 | 1 byte +--------------+ | Style 1 | 1 byte +--------------+ | ChunkLen 1 | 1 byte +--------------+ | Style 2 | 1 byte +--------------+ | ChunkLen 2 | 1 byte +--------------+ ~ ... ~ ... +--------------+ This list always starts with a ChunkLen[0] field. The default style applies to the first (ChunkLen[0]) characters in the line. The next ChunkLen[1] characters apply the style defined in Style[1]. Following that, the next ChunkLen[2] characters apply the style defined in Style[2]; and so on. Each Style byte comprises the following bits: 7 6 5 4 3 2 1 0 +---+---+---+---+---+---+---+---+ | 0 | 0 | 0 | 0 | 0 | U | I | B | +---+---+---+---+---+---+---+---+ |_______________| | | |---- Bold | | |-------- Italic reserved; |------------ Underline must be 0. Each Style[i] field replaces the previous style; the style bits are not merged or toggled. "Links" is an array of variable-length records, where each record defines a link in the line. The format of a record is: +--------------+ | StartIndex | 1 byte ONE-based index of the first character in the +--------------+ link, inclusive. | EndIndex | 1 byte ONE-based index of the last character in the +--------------+ link, inclusive. | Context | ? bytes NULL-terminated context string that specifies | String | the link target, or an empty string to indicate +--------------+ that TopicIndex should be used as the target. | TopicIndex | WORD Optional; present only if ContextString is empty +--------------+ Step 2: Keyword compression and run-length encoding --------------------------------------------------- The compressed data is a byte stream that has the following format. Each byte in the compressed stream is either a control byte or a value byte. This is determined as follows: Byte 00 - 0F : value byte Byte 10 - 1A : control byte Byte 1B - FF : value byte During decoding, value bytes are copied as is to the output, unless they follow a control byte and is treated as an argument. See below. There are eleven control bytes, valued from 0x10 (16) to 0x1A (26). A control byte takes one or two bytes as its argument. The format of each control byte is summarized below. Hex Dec Control Byte Argument Byte 1 Argument Byte 2 +-----------------+-----------------+ 10-17 16-23 | 0 0 0 1 0 A D D | D D D D D D D D | | S 9 8 | 7 6 5 4 3 2 1 0 | +-----------------+-----------------+ +-----------------+-----------------+ 18 24 | 0 0 0 1 1 0 0 0 | SPACE-COUNT | +-----------------+-----------------+ +-----------------+-----------------+-----------------+ 19 25 | 0 0 0 1 1 0 0 1 | REPEAT-BYTE | REPEAT-LENGTH | +-----------------+-----------------+-----------------+ +-----------------+-----------------+ 1A 26 | 0 0 0 1 1 0 1 0 | ESCAPE-BYTE | +-----------------+-----------------+ Control bytes 10h-17h encode a dictionary entry index. The index, D, is specified by the lowest 2 bits of the control byte, followed by the 8 bits of the argument byte that follows. (This gives 10 bits available; hence the dictionary can contain no more than 1024 entries.) The dictionary entry is copied to the output. If the AS (Append-Space) bit is 1, a space (ASCII 32) is appended to the output. Control byte 18h encodes a run of spaces (ASCII 32). The number of spaces is specified by the argument byte that follows. This many spaces are appended to the output. Control byte 19h encodes a run of bytes. The byte to repeat is given by the first argument, and the run-length is given by the second argument. That many bytes are repeated and appended to the output. Control byte 1Ah escapes the next byte (argument) in the compressed stream. The argument is written as is to the output. This is necessary to output a byte in the range 10h to 1Ah. Step 3: Huffman coding ---------------------- The resulting binary data from Step 2 is encoded by a huffman coder. The huffman tree encodes 256 symbols (i.e. byte value 0 - 255). There is no limit on the number of bits used to encode each symbol. +----------+==============+ | OUTLEN | BIT STREAM | +----------+==============+ OUTLEN is the number of bytes in the binary topic data that is produced by Step 1. Note that it is NOT the compressed data produced by Step 2. BIT STREAM is a bit stream that contains the huffman-encoded data from Step 2. For each byte in the stream, the bits are written to (and read from) starting from the MOST significant bit of that byte; there may be extra, unused bits at the end of the last byte. This is why we need the OUTLEN field.