Microsoft's ITOL/ITLS format

Preface

This is documentation on a part of the new Microsoft Help 2.0 format. It is limited to the part of a ".HxS" or ".HxI" file which starts with the string "ITOLITLS" and continues to the end of the file. This part directly corresponds to the HTML Help 1.x (.CHM) file format. All offsets are relative to the "ITOLITLS", not the start of the actual file on disk.

Actual Microsoft Help files are wrapped in Portable Executable (PE) files. To find the ITOLITLS, find the section called ".its" and skip VirtualSize bytes past the beginning as indicated by VirtualAddress. That is, the actual data is beyond the marked end of the section.

As with my .CHM documentation, this document concerns itself only with the lowest-level structure of the file. The higher-level indexes and other such information are not covered.

All numbers are in hexadecimal unless otherwise indicated in the text. Except in tabular listings, this will be indicated by $ or 0x as appropriate. All values within the file are Intel byte order (little endian) unless indicated otherwise.

The overall format of the ITOL/ITLS block

The ITOL/ITLS block begins with a short initial header. This is followed by the header section table and additional header information called the "post-header". Collectively, this is the "header".

The header is followed by the header sections. There are five header sections, including the file directory. Immediately following these header sections are the content streams.

(it is this format which is the reason my CHM docs describe the CHM header as broken up into initial header, header section table, and additional data, despite that structure not actually being present in a .CHM)

The header starts with the initial header, which has the following format

0000: char[4]  'ITOL'
0004: char[4]  'ITLS' (hence the name, ITOL/ITLS)
0008: DWORD    1 (probably a version number)
000C: DWORD    Location of header section table ($28)
0010: DWORD    Number of entries in header section table
0014: DWORD    Length of post-header table
0018: GUID     {0A9007C1-4076-11D3-8789-0000F8105754}

Note: a GUID is $10 bytes, arranged as 1 DWORD, 2 WORDs, and 8 BYTEs.

It is followed by the header section table, which is n entries (n determined by the pre-header), where each entry is $10 bytes long and has this format:

0000: QWORD    Offset of section from beginning of ITOL/IFCM block
0008: QWORD    Length of section

Following the header section table is some additional data whose length is in the pre-header. This is the post-header. A number of values in this area are known:

0000: DWORD    2 (probably a version number)
0004: DWORD    $98 (offset to CAOL from beginning of post-header)
----  Directory information
0008: QWORD    Chunk number of top-level AOLI chunk in directory, or -1
0010: QWORD    Chunk number of first AOLL chunk in directory
0018: QWORD    Chunk number of last AOLL chunk in directory
0020: QWORD    0 (unknown)
0028: DWORD    $2000 (Directory chunk size of directory)
002C: DWORD    Quickref density for main directory, usually 2
0030: DWORD    0 (unknown)
0034: DWORD    Depth of main directory index tree
               1 there is no index, 2 if there is one level of AOLI
	       chunks.
0038: QWORD    0 (unknown)
0040: QWORD    Number of directory entries
----- Directory Index Information
0048: QWORD    -1 (unknown, probably chunk number of top-level AOLI in directory
                   index)
0050: QWORD    Chunk number of first AOLL chunk in directory index
0058: QWORD    Chunk number of last AOLL chunk in directory index
0060: QWORD    0 (unknown)
0068: DWORD    $200 (Directory chunk size of directory index)
006C: DWORD    Quickref density for directory index, usually 2
0070: DWORD    0 (unknown)
0074: DWORD    Depth of directory index index tree.
0078: QWORD    Possibly flags -- sometimes 1, sometimes 0.
0080: QWORD    Number of directory index entries (same as number of AOLL
               chunks in main directory)
-----

(The obvious guess for the following two fields, which recur in a number of places, is they are maximum sizes for the directory and directory index. However, I have seen no direct evidence that this is the case.)

0088: DWORD    $100000 (Same as field following chunk size in directory)
008C: DWORD    $20000 (Same as field following chunk size in directory
                       index)
-----
0090: QWORD    0 (unknown)
0098: ASCII    'CAOL'
009C: DWORD    2 (Most likely a version number)
00A0: DWORD    $50 (Length of the CAOL section, which includes the
                   ITSF section)
00A4: WORD     Unknown.  Remains the same when identical files are built.
               Does not appear to be a checksum.  Many files have
	       'HH' (HTML Help?) here, indicating this may be a compiler ID
	       field.  But at least one ITOL/ITLS compiler does not set this
	       field to a constant value.
00A6: WORD     0 (Unknown.  Possibly part of 00A4 field)
00A8: DWORD    Unknown.  Two values have been seen -- $43ED, and 0.
00AC: DWORD    $2000 (Directory chunk size of directory)
00B0: DWORD    $200 (Directory chunk size of directory index)
00B4: DWORD    $100000 (Same as field following chunk size in directory)
00B8: DWORD    $20000 (Same as field following chunk size in directory
               index)
00BC: DWORD    0 (unknown)
00C0: DWORD    0 (Unknown)
00C4: DWORD    0 (Unknown)
00C8: ASCII    'ITSF' (ITStorage File?)
00CC: DWORD    $4 (Version number -- CHM uses 3)
00D0: DWORD    $20 (length of ITSF)
00D4: DWORD    1 (Unknown)
00D8: QWORD    Offset within file of content stream 0
00E0: DWORD    A timestamp of some sort.  
               Considered as a big-endian DWORD, it appears to contain
               seconds (MSB) and fractional seconds (second byte).
	       The third and fourth bytes may contain even more fractional
               bits.  The 4 least significant bits in the last byte are
               constant.
00E4: DWORD    Windows language ID  (0x0409 = ENGLISH/ENGLISH-US)
                                    (0x0407 = LANG_GERMAN/SUBLANG_GERMAN)

The second DWORD of the post-header is $98 (the offset of 'CAOL'), suggesting this section has some sort of internal structure. Other evidence: it appears there is a subsection for information about the directory header section, and one about the directory index header section.

The Header Sections

Header Section 0

This section contains the total size of the file, and not much else (remember that "the file" means the ITOL/ITLS block only)

0000: DWORD    $01FE (unknown)
0004: DWORD    0 (unknown)
0008: QWORD    File Size
0010: DWORD    0 (unknown)
0014: DWORD    0 (unknown)

Header Section 1: The Directory Listing

The central part of the ITOL/ITLS block: A directory of the files and information it contains. It starts with a header, its format is as follows

0000: char[4]  'IFCM'
0004: DWORD    1 (probably a version number)
0008: DWORD    $2000    Directory chunk size
000C: DWORD    $100000 (unknown)
0010: DWORD    -1 (unknown)
0014: DWORD    -1 (unknown)
0018: DWORD    Number of directory chunks
001C: DWORD    0 (unknown, probably high word of above)

(It has been suggested that some of the unknown values point to free chunks. I have no evidence either way, as free chunks do not exist in the output product)

This is directly followed by the directory chunks. There are two types of directory chunks -- index chunks, and listing chunks. The index chunk will be omitted if there is only one listing chunk. A listing chunk has the following format:

0000: char[4]  'AOLL'
0004: DWORD    Length of quickref area at end of directory chunk
0008: QWORD    Directory chunk number 
               This must match physical position in file, that is
	       the chunk size times the chunk number must be the
	       offset from the end of the directory header.
0010: QWORD    Chunk number of previous listing chunk when reading
               directory in sequence (-1 if first listing chunk)
0018: QWORD    Chunk number of next listing chunk when reading
               directory in sequence (-1 if last listing chunk)
0020: QWORD    Number of first listing entry in this chunk
0028: DWORD    1 (unknown -- other values have also been seen here)
002C: DWORD    0 (unknown)
0030: Directory listing entries (to quickref area)  Sorted by
      filename; the sort is case-insensitive.

The quickref area is written backwards from the end of the chunk. One quickref entry exists for every n entries in the file, where n is calculated as 1 + (1 << quickref density). So for density = 2, n = 5.

Chunklen-0002: WORD     Number of entries in the chunk
Chunklen-0004: WORD     Offset of entry n from entry 0
Chunklen-0008: WORD     Offset of entry 2n from entry 0
Chunklen-000C: WORD     Offset of entry 3n from entry 0
...

The format of a directory listing entry is as follows

      ENCINT: length of name
      BYTEs: name  (possibly UTF-8)
      ENCINT: content stream number
      ENCINT: offset
      ENCINT: length

The offset is from the beginning of the stream, after the stream has been decompressed (if appropriate). The length also refers to length of the file in the stream after decompression.

An index chunk has the following format

0000: char[4]  'AOLI'
0004: DWORD    Length of quickref area at end of directory chunk
0008: QWORD    Directory chunk number
0010: Directory index entries (to quickref area)

The quickref area in an AOLI is the same as in an AOLL

The format of a directory index entry is as follows

      ENCINT: length of name
      BYTEs: name  (possibly UTF-8)
      ENCINT: directory listing chunk which starts with name

An ENCINT is a variable-length integer. The high bit of each byte indicates "continued to the next byte". Bytes are stored most significant to least significant. So, for example, $EA $15 is (((0xEA&0x7F)<<7)|0x15) = 0x3515.

There are two kinds of file represented in the directory: content-related and format-related. The files which are format-related have names which begin with '::', and are described in this document. The content-related files begin with a '/' and are not described here.

Header Section 2: Directory Indexing Information

This has the same format as the directory listing, but its listing chunks actually contain chunk-index entries rather than directory entries. These chunk-index entries refer to chunks in the Section 1 directory listing. Format of a chunk-index entry

      ENCINT: Starting entry number in chunk referred to
      ENCINT: Number of directory entries in chunk referred to
      ENCINT: directory listing chunk referred to

Speculation would be that this is a reverse index, intended to find a directory entry given its index rather than its name. Only for very large directories would this be appreciably faster than proceeding linearly through the Section 1 AOLL chunks, but this is a 64-bit format.

Further evidence for this speculation is the presence of the "Number of first listing entry in this chunk" field in the AOLL chunk. The CHM equivalent, PMGL, has no such field, and it is therefore not possible to get the absolute index number of a directory entry when searching using the PMGI chunks.

Header Section 3

Contains a GUID, {0A9007C3-4076-11D3-8789-0000F8105754}

Header Section 4

Contains a GUID, {0A9007C4-4076-11D3-8789-0000F8105754}

The Content Streams

The content streams immediately follow the header sections, and start at the location indicated by byte $D8 in the post-header data. All content stream 0 locations in the directory are relative to that point. The other content streams are stored WITHIN content stream 0.

The Namelist file

There exists in content stream 0 a file called "::DataSpace/NameList". This file contains the names of all the content streams (including 0). The format is as follows:

0000: WORD     Length of file, in words
0002: WORD     Number of entries in file

Each entry:
0000: WORD     Length of name in words, excluding terminating null
0002: WORD     Double-byte characters
xxxx: WORD     0

Yes, the names have a length word AND are null terminated; sort of a belt-and-suspenders approach. The coding system is not precisely known, but is almost certainly UTF-16LE or UCS-2LE.

The stream names seen so far in HTML Help files are

Uncompressed
MSCompressed

"Uncompressed" is self-explanatory. The stream "MSCompressed" is compressed using the LZX algorithm.

Transforms

Streams 1 and up may have had transformations applied to them before being stored in the file. The most common transformation is LZX compression, but the file format is robust enough to support arbitrary transformations.

(There is evidence that the CHM format was intended to support this but failed to do so due to a bug)

To determine which transforms to apply to a stream, read the section 0 file "::DataSpace/Storage/<StreamName>/Transform/List". This file is a multiple of $10 bytes and contains GUIDs. The GUIDs are the GUIDs of COM objects which perform transform operations on the data in the streams. When reading the file, the transforms should be applied in the order they appear in the list. The known transform is

0A9007C6-4076-11D3-8789-0000F8105754: LZX Compression/Decompression

For each transform GUID, there exists a directory "::DataSpace/Storage/<StreamName>/Transform/<GUID>/InstanceData/". This directory contains the reset table for LZX decompression, and presumably other transform-specific data for other transforms.

The Stream Data

For each stream other than 0, there exists a file called '::DataSpace/Storage/<Stream Name>/Content'. This file contains the transformed data for the stream. So, conceptually, getting a file from a nonzero stream is a multi-step process. First you must get the content file from stream 0. Then you apply the transforms to the stream. Then you get the desired file from your transformed stream.

Other stream format-related files

There are several other files associated with the streams

::DataSpace/Storage/<StreamName>/ControlData
This contains data for each of the transforms associated with the stream, in order of appearance in the transform list. The first DWORD of the data for the transform is always the # of DWORDS in the rest of the data, and the next DWORD is a 4-character identifier.
::DataSpace/Storage/<StreamName>/SpanInfo
This file appears to contain one QWORD per transform, in reverse order of appearance in the transform list. It seems to indicate the size of the output of the transform.

Appendix: LZX Compression

The transform {0A9007C6-4076-11D3-8789-0000F8105754} refers to LZX compression, a method Microsoft also uses for its cabinet files. While the GUID and a few version numbers are different from the CHM version of LZX, the transforms otherwise appear identical.

The LZX compression ControlData information is partially known:

0000: DWORD    7 (number of dwords to follow)
0004: ASCII    'LZXC'  Compression type identifier
0008: DWORD    3 (Version number)
000C: DWORD    The LZX reset interval in $8000-byte blocks.
0010: DWORD    The LZX window size in $8000-byte blocks
0014: DWORD    Cache size in $8000-byte blocks (maybe?)
0018: DWORD    0 (unknown)
001C: DWORD    0 (unknown)

The dword at 0x10 is the decompression window size in $8000-byte blocks, and the dword at 0x0C (also in blocks) tells how often to reset the compression tables. Normally they are the same, indicating compression tables should be reset at every window. (thus the window doesn't slide)

Byte 8 may be 1 in some files, indicating that the reset interval, cache size, and window size are in bytes instead of blocks. Or it may be 2, which seems to be the same as 3. Both these variants go along with a data length of 6 dwords rather than 7. Such files have not been seen "in the wild".

LZX decompression also uses another file, "::DataSpace/Storage/<SectionName>/Transform/{0A9007C6-4076-11D3-8789-0000F8105754}/InstanceData/ResetTable". This reset table has the following format

0000: DWORD    3     unknown (possibly a version number)
0004: DWORD    Number of entries in reset table
0008: DWORD    8     Size of table entry
000C: DWORD    $28   Length of table header (area before table entries)
0010: QWORD    Uncompressed Length
0018: QWORD    Compressed Length
0020: QWORD    0x8000 block size for locations below
0028: QWORD    0 (zeroth entry of table)
0030: QWORD    location in compressed data of 1st block boundary in
               uncompressed data

Repeat to end of file

Now you can finally obtain the section (from its Content file). The window size for the LZX compression is 16 (decimal) on all the files seen so far. This is specified by the DWORD at $10 in the ControlData file (but note that DWORD gives the window size in 0x8000-byte blocks, not the LZX code for the window size)

The rule that the input bit-stream is to be re-aligned to a 16-bit boundary after $8000 output characters have been processed IS in effect, despite this LZX not being part of a CAB file. The reset table tells you when this was done, though there is no need for that during decompression; you can just keep track of the number of output characters. Furthermore, while this does not appear to be documented in the LZX format, the uncompressed stream is padded to an $8000 byte boundary.

There is one change from LZX as defined by Microsoft: After each LZX reset interval (defined in the ControlData file, but typically equal to the window size) of compressed data is processed, the LZX state is fully reset, as if an entirely new file was being encoded. This allows semi-random access to the compressed data; you can start reading on any reset interval boundary using the reset interval size and the reset table.

Acknowledgements

The following people in (no particular order) have submitted information which has helped correct and close the gaps in this document.

Peter Ferrie (peter_ferrie at hotmail.com) Web Site
Pabs (pabs at zip.to) who also runs the CHM Spec page.

And others I have not been able to reach.

You may freely copy and distribute unmodified copies of this file, or copies where the only modification is a change in line endings, padding after the html end tag, coding system, or any combination thereof. The original is in ASCII with Unix line endings.