PalmOpenDic Dictionary Format

Database structure

PalmOpenDic dictionaries are Palm databases with Version 1, Type ID "data" and Creator ID "ODic". The database name must start with "ODic" - this will be stripped in the viewer. Each record is compressed individually, using the zLib algorithm. The maximum uncompressed length of a record is 4096 bytes.

The first record is a header record. The next records are index root records, one for each index. The remaining records may be sorted arbitrarily. Note that the DicMaker program will first add data records with index entries, then data records without index entries, and all index pointer records (except the index root records) at the end of the file, if any.

Header record

This record is always the first record of the file. Its first byte contains the number of languages, the second byte contains the number of indices. There may be between 2 and 15 languages, the number of indices must be at least 1 and at most the number of languages. After these two bytes, there is one C string for each language that contains the language name. The next C string contains a dictionary description. The rest of this record is reserved for future enhancements.

Index pointer records (and index root records)

Index pointer records are used to find the correct record that defines an index entry. Usually index entries are words, but they can be anything else as well. It is helpful if a lot of index entries are used as words as well, since they will be stored only once in that case. Entries in index pointer records must be sorted. The sort order uses a mapping table (called MAPPING defined in main.c. So, if you want to build your own dictionaries, map all characters using this mapping table before sorting the entries. Do *not* store the mapped entries into the index, but the original ones.

Each entry in an index pointer record starts with a byte denoting the length of payload (excluding this length byte) of that entry. After that byte there is a C string that contains the word, followed by exactly two bytes, which points to a data record or another index pointer record. PalmOpenDic will detect automatically whether a record is an index pointer record or a data record, with one exception: The root record of an index must always be an index pointer record, even if all index entries could fit into one data record. After the last entry there is a single 0x00 byte to mark the end of the record.

The word in that pointer record is the last word of the record pointed to. Why the last one? If you enter a word which is not present into PalmOpenDic, ypu want to see the words that follow this word. If the first record would have been stored there, it could happen, that PalmOpenDic had to scan the full record (and in case of a pointer record, its last pointed record as well) just to notice that the word you looked for is between two records. If you store the last word, you can notice this directly and only read the following record (which you have to read anyway to show the next words).

Data records

Data records are a bit tricky. First of all, if a data record is pointed to by an index record, it has to contain only words that are used in that index, and they have to be sorted. If it is not pointed to by an index record, it may be in any order. The basic structure is similar to index pointer records: A payload length and a C string that contains the entry. The rest of the bytes have to be a multiple of four bytes (in pointer records, there are exactly two remaining bytes). These four bytes (called "a reference word" in the rest of this page) start with two bytes record index and two bytes record offset. The four most significant bits of the offset are used as "language tag", so that there are 12 bits left for the offset (which is enough for 4096 bytes). If the language tag is zero, the record and offset point to the payload length byte of another data entry that is the "head" of a dictionary entry that is a result of the given index entry. Note that there might (and will often) be more than one dictionary entry for one index entriy, especially if the dictionary contains phrases. For an index entry, reference bytes with nonzero language tag are ignored. You can use this to have an index entry point to itself, if the index entry text and the dictionary entry text are the same. On the other hand, for a dictionary entry, all reference words with zero language tag are ignored. Note that if a data record is pointed to by an index record, every entry in it needs at least one reference word with zero language tag. Nonzero language tags mark the language of a translation of the dictionary entry. They do not point to the payload byte, but directly to the C string (in fact, translations that are not a dictionary entry themselves (for translating them back, for example) can even be stored without the payload byte, but dicmaker always stores a payload byte. If the language tag identifies the same language as the index (i.e. the language you searched for), it marks a subentry which will be shown indented a bit. The translations for the subentry are stored directly after this reference tag (inside the payload of the main entry record), not in the payload of the subentry. There is no limit for the number of translations of one word except that the record has to fit into 255 payload bytes. PalmOpenDic will only show translations of the selected destination language (and subentries of the source language).

An example

Here is an example dictionary, which is identical to the empty dictionary shipped with PalmOpenDic:

Record 0

Offset	Uncompressed bytes	Explanation
0x00	`0x02`	Number of languages
0x01	`0x01`	Number of Indexes
0x02	`"empty" 0x00`	Name of first language
0x08	`"database" 0x00`	Name of second language
0x11	`"Empty Database" 0x00`	Dictionary description

Record 1 (Index root record for Language 1)

Offset	Uncompressed bytes	Explanation
0x00	`0x08`	Length of first payload
0x01	`"empty" 0x00`	Last index word
0x07	`0x00 0x02`	Pointer to record 2
0x09	`0x00`	Record terminator

Record 2 (the only data record)

Offset	Uncompressed bytes	Explanation
0x00	`0x0A`	Length of first payload
0x01	`"empty" 0x00`	Word
0x07	`0x00 0x02 0x00 0x00`	Pointer to record 2, language tag 0, offset 0
Here could be some translations for this word. However, the empty DB does not have any.
0x0B	`0x00`	Record terminator