Layout of an encyclopodia eBook-File

Version: 0.9
Copyright (c) 2005 Robert Bamler <Robert dot Bamler (a) gmx dot de>

An encyclopodia eBook stores all articles (chapters) of the eBook in one file. The file consists of three parts:

An uncompressed section containing some meta information of the eBook, such as the name, author, copyright, language, character encoding, file format version, minimum required reader version, etc. This part consists of lines in the form "key=value". Encyclopodia-reader currently accepts the following keywords: "title" (title of the ebook) and "aboutpage" (name of page to be displayed when the user selects "about this ebook" from the context menu). These are the only keywords currently supported. However, if you want to create an ebook, look at the meta-information-section of existing ebooks for usage of keywords that might be implemented in the future.
The index section. It contains an ordered list of all article titles. To each article title, there is a reference to the file offset where the block starts, that contains this article, as well as an internal offset where this article starts within the specified block. For the specification of this part, see specification of the index section.
A compressed section containing the articles. This section is devided into bz2 compressed blocks (in fact, when you copy any single block of an eBook file to another file, this file will be a valid bz2-file). The size of each block is arbitrary, but it must contain an integer number of articles (ie. no articles will be stored with the first half in one block and the second half in another block). Large block sizes will decrease the size of the eBook file, but will also increase the access time for articles that are stored at the end of the block. Small block sizes will decrease the access times for "worst case" articles, but will also increase the file size.

The following specifies in detail the layout of an article in the second section of the eBook file.

Contents:
- Overview of the format
- Special characters
- Strings
- Entities
- Control sequences
- Blocks

Overview of the format

eBook-Files are built of four different classes of data structures:

Different types of blocks are designed to store data of arbitrary (varying) size. Each type of block has its own meaning (eg. blocks of the type "List" describe a numbered or unnumbered list). Blocks can contain other blocks (subblocks), as well as control sequences, entities and strings.
Control sequences are designed to represent simple markup instructions for a reader application. There are several types of control sequences. Some of them are associated to (little) data with a well-defined size, some do not contain any data. For example, control sequences of the type "Paragraph Break" do not contain any data. They simply tell a reader application to start a new paragraph. In contrast, control sequences of the type "Change Font Family" do contain some data (the ID of the font family), but this data has a well-defined size of two bytes. If a control sequence does contain data, the data it contains is stored in the form of entities.
Different types of entities contain primitive data types, such as integers or timestamps. They have a well-defined size and can appear as contents of blocks or control sequences. For example, if a particular type of control sequence is specified to contain a single entity of the type "uint2", this simply means that the contents of control sequences of this type has a length of two bytes and these two bytes shall be interpreted as an unsigned integer.
Strings are a special form of data structures. They represent a list of characters of arbitrary length. To make reading an eBook-file very easy, the first byte of a string stores the length of the string (thus, a reader application knows how much memory to allocate before it actually reads the string). Because this byte cannot store numbers larger than 255, subsequent occurencies of strings are interpreted as one long string (with the contents of all string chunk concatenated to one single string). Note that the contents of strings has to be escaped.

Special characters

Three different special characters are used to indicate the beginning of a block, control sequence or string chunk or the end of a block. To make the format more error tolerant, a fourth special character (the escape character) is used to eleminate all occurencies of the special characters in the content of strings. Note that the escape char would in fact not be necessary if it was sure that neither the file nor the reader application contain any errors. However, this would be a very optimistic assumption.

The following special characters are used:

0xFF: Denotes the beginning of a block or a control sequence. The next char decides whether this starts a block or a control sequence.
0xFE: Denotes the end of a block. (Control sequences do not need any end mark.)
0xFD: Denotes the beginning of a string chunk.
0xFC: Escape character.

Strings

A string is a sequence of one or more string chunks and represents an array of characters, usually to be displayed on the screen. A string chunk has the following layout:

field

size

value

special char: string chunk begin

0xFD - Denoting the beginning of a string chunk

length

One byte (0-255) describing the length of the character sequence represented by this string chunk. Note that, due to escaping, the amount of data actually needed to store this string chunk in the file might be larger than the number given here (see below).

contents

The data associated to this string chunk. To make the format more error tolerant, the contents of string chunks has to be escaped, which means that an application that creates an eBook performs the following substituations when it actually wanted to write a special character as contents of a string chunk:

Instead of writing the character 0xFF it writes the two characters 0xFC0F.
Instead of writing the character 0xFE it writes the two characters 0xFC0E.
Instead of writing the character 0xFD it writes the two characters 0xFC0D.
Instead of writing the character 0xFC it writes the two characters 0xFC0C.

Thus, instead of writing the character c, it writes 0xFC, followed by (c & 0x0F), where "&" denotes a logical AND for each bit. Escaping has no effect on the value of the length-field. For example, suppose you have a string chunk holding the character sequence "asdf"0xFD"qwer" (the four letters a, s, d, and f, followed by the char 0xFD, followed by the four letters q, w, e, and r). This string chunk would look like this:

Special char: string chunk begin:	fd
Length of the string chunk:	09
First part of the contents: the four letters a, s, d, and f:	61 73 64 66
Second part of the contents: the letter 0xFD, which has to be escaped:	fc 0d
Third part of the contents: the four letters q, w, e, and r:	71 77 65 72

Obviously, an application that reads an eBook has to perform the reverse of this substitution: Whenever it reads the char 0xFC in the contents of a string chunk, it reads the following char as c and then displays the char (c | 0xF0), where "|" denotes a logical OR for each bit.

Entities

When we refer to an entitiy, this simply means that a particular integral value is written to the eBook-file without any meta data. The following entity types are defined:

Type name

length

Description

uint1

There are 8 different types of integers. Integer types starting with "uint" are interpreted as unsigned integers, integer types starting with "sint" are interpreted as signed integers.

The number (eg. "2" in "uint2") specifies the length of the integer in bytes. The first byte contains the most significant bits and the last byte contains the least significant bits (big endian).

For example, an uint2 that represents the value "42" has a length of two bytes and is stored as:

00 2a

sint1

uint2

sint2

uint3

sint3

uint4

sint4

Timestamp

This is merely a sequence of several integer types, each describing a part of a date and time. The parts appear in the following order:

uint2	year (such as "2005")
uint1	month (1-12)
uint1	day of month (1-31)
uint1	hour (0-23)
uint1	minute (0-59)
uint1	second (0-60; some systems use the 60 for leap seconds)
sint1	Time zone (offset to UTC times 2; thus, the value for an offset of two and a half hours west of UTC would be "-5")

Control Sequences

Control sequences appear within the contents of blocks. There are various types of control sequences and each type has a well defined amount of stored data and thus, a well defined size. For example, control sequences of the type "Paragraph Break" do not store any data, because they only denote the beginning of a new paragraph at the position they appear in. Thus, control sequences of this type have a well defined size of two bytes (special char + type ID). In contrast, control sequences of the type "Change Font Family" do store data and have a well defined size of 4 bytes (special char + type ID + 2 bytes for font family).

The generall layout of all control sequences is defined as follows:

field	size	value
special char: begin of block or control sequence	1	0xFF - Denoting the beginning of a block or control sequence (in this case: control sequence)
type ID	1	Identifies the type of the control sequence. In order to discriminate blocks and control sequences, all control sequences have a type ID lower than 128. Type IDs equal or higher than 128 are reserved for blocks.
contents	?	Depending on the type of the control sequence, any entities that store the data associated to the control sequence. All entities have to appear in the order specified for the particular control sequence type.

More concrete, the following control sequences are defined:

type	value
ID 0: Dummy — Sometimes needed for disambiguation. Does not have any meaning. (empty)
ID 16: Change font family — Changes the current font family.
uint2	The ID of the font family to change to: 0 = default font 1 = default sans serif font 2 = default serif font 3 = monospace font
ID 17: Change font size — Changes the current font size.
uint1	The size of the font to change to. Currently, the following sizes are supported: TODO; The default font size is 10, but nobody knows how large this actually is.
ID 18: Line break — Causes the following text to start at a new line. (empty)
ID 19: Paragraph break — Terminates the previous and begins a new paragraph. (empty)
ID 20: Hline — creates a horizontal line. This control sequence is typically surrounded by two paragraph breaks. (empty)
ID 21: Begin italic — Text after this control sequence will be displayed italic. (empty)
ID 22: End italic — Text after this control sequence will not be displayed italic any more. (empty)
ID 23: Begin bold — Text after this control sequence will be displayed bold. (empty)
ID 24: End bold — Text after this control sequence will not be displayed bold any more. (empty)
ID 25: Begin underlined — Text after this control sequence will be displayed underlined. (empty)
ID 26: End underlined — Text after this control sequence will not be displayed underlined any more. (empty)
ID 27: Table Row — Indicates the end of the previous and the beginning of the next row of a Table. This control sequence is only allowed between table rows, ie. neither at the beginning nor at the end of a table. Thus, a valid "Table"-block contains exactly one less "Table Row" control sequence than table rows.
ID 28: Table Cell — Indicates the beginning of a table cell. The contents of the cell is the chain of Elements that come after this control sequence and before the next "Table Row", "Table Cell", or "Table Header Cell" control sequence or the end of the "Table" block.
uint1	The number of table rows this cell covers. Must not be zero. You will want to use the value "1" for most table cells. (Similar to the "rowspan"-attribute in HTML, but required.)
uint1	The number of table columns this cell covers. Must not be zero. You will want to use the value "1" for most table cells. (Similar to the "colspan"-attribute in HTML, but required.)
ID 29: Table Header Cell — Same as control sequence "Table Cell", except that the contents of this cell is asumed to contain header data and should be displayed in a highlighted style by a reader application.
uint1	rowspan; see description for control sequence "Table Cell".
uint1	colspan; see description for control sequence "Table Cell".

Blocks

Blocks are designed to hold complex data of various size. There are several different types of blocks. Blocks can contain other blocks (subblocks), control sequences, entities, and/or strings.

The generall layout of all blocks is defined as follows:

field	size	value
special char: begin of block or control sequence	1	0xFF - Denoting the beginning of a block or control sequence (in this case: block)
type ID	1	Identifies the type of the block. In order to discriminate blocks and control sequences, all blocks have a type ID equal to or higher than 128. Type IDs lower than 128 are reserved for control sequences.
contents	?	Depending on the type of the block, any other blocks (subblocks), control sequences, entities, and/or strings that store the data associated to the block. All elements have to appear in the order specified for the particular block type.
special char: end of block	1	0xFE - Denoting the end of a block
type ID	1	Same identifier as used at the beginning of the block. This is for error tolerance only.

More concrete, the following blocks are defined:

category	type	amount	value
ID 128: Article — Contains an article (chapter) of the eBook.
string	string	1	The title of the article, in the form it should be displayed on the screen.
control sequence	dummy	1	To avoid ambiguities, this control sequence makes it possible to find the end of the list of string chunks.
entity	Timestamp	1	The date and time of the last modification of the article.
subblock	int list	1 ... ∞	A list of the IDs of all authors that contributed to this article. If the article comes from a wiki, only authors who were logged in when submitting their contribution are listed here. A program that displays the users who contributed to the article should also display a note that there might be other (anonymous) users who also contributed to this article, iff the article came from a source where anonymous users were able to contribute (such as Wikipedia). If there are more than one "int list"-subblocks, then a reader application concatenates the items of the lists to one large list.
control sequence	dummy	1	This control sequence makes the format more error tolerant and also makes reading eBooks much easier.
various	In any order and repetition, any of: String, Change font family, Change font size, Line break, Paragraph break, Hline, [Begin/End] [italic/bold/underlined], List, Text link, Anchor point, Table	0 ... ∞	The contents of the article.
ID 129: Int list — Contains a list of 0 to 255 integers of the same type.
entity	uint1	1	This entity contains the number of list items. Because this is a uint1, the list cannot contain more than 255 integers. However, whenever an int list may appear as a subblock of some parent block, the parent block is specified to allow an arbitrary amount of subsequent occurencies of entity lists. Thus, int lists of arbitrary length are possible
entity	sint1	1	Specifies the length (in bytes) of each integer value in the list. If the value of this entity is positive, then all list items are unsigned integers with the given length. If the value of this entity is negative, then all list items are signed integers with a length equal to the absolute value of this entity.
entity	[u/s]intX (as specified above)	0...255 (as specified above)	The list items. There must be exactly as much list items as specified by the first entity and they all have to be of the type specified by the second entity of this block.
ID 130: List — Contains a list of items to be displayed on the screen.
subblock	List item	0 ... ∞	The items of the list.
ID 131: List item — An item of a list (appears as subblock of a list block).
entity	uint1	1	The depth or level of this list item. Starting with zero, nested list items have a higher depth than their parent list items.
entity	uint2	1	The number of the list item. If set to zero, an unordered list (bulletin list) is assumed for this list level.
various	Same as for last elements in "Article"	0 ... ∞	Any text of the list item.
ID 132: Text link — A link to another page and/or anchor point. Links are automatically highlighted by a viewer application. Thus, you should not surround all links by begin unterline – end underline blocks.
string	string	0	The text to be displayed on the screen in a highlighted style.
control sequence	dummy	0 or 1	To avoid ambiguities, this control sequence makes it possible to find the end of the list of string chunks. This control sequence must be present if a link href is given separately from the link text (see below) and it must not be present if no separate link href is given.
string	string	0 or 1 (same as above)	The title of the article to jump to when this link is clicked. You can specify an anchor point in the article by using a #-symbol or even let this string begin with a #-symbol for internal article-links. If this string is given, then the dummy control sequence above must also be present. If this string is not given, then the dummy control sequence above must not be present, too. If this string is not given, it defaults to the text given in the previous string.
ID 133: Anchor point — An (invisible) mark that can be referenced to by Text links using the #-symbol.
string	string	1	An ID that can be used by TextLinks. This ID does in general not start with a #-symbol. However, TextLinks use the #-symbol to specify that the substring after the #-symbol refers to an anchor ID. The values "top" and "bottom" should not be used as they are reserved for the top and bottom of the article.
ID 134: Header — A header (title line). You normally want to surround Header-blocks by Paragraph break-blocks
entity	uint1	1	The depth of the header. A value of zero will create the largest header possible and should be used at the very beginning of an article.
various	Same as for last elements in "Article"	0 ... ∞	Any text of the header.
ID 135: Table — A block containing data that should be layouted in a tabular.
entity	uint1	1	The width of the table border, in pixels. However, you can't be sure how a reader application actually handles this attribute (some application might ignore it, because it doesn't have enough screen space to draw a border, some other application might draw a border with a maximum width of 1px, etc.).
string	string	0 or 1	A header text for the table (optional).
control sequence	dummy	1	To avoid ambiguities, this control sequence makes it possible to find the end of the list of string chunks.
various	Same as for last elements in "Article", plus: Table Row, Table Cell, Table Header Cell	0 ... ∞	The contents of the table. The order of the control sequences "Table Row", "Table Cell", "Table Header Cell" must make sense, ie. the sum of colspans of all cells in a row must be the same for each row and the sum of rowspans of all cells in a column must be the same for each column.
ID 136: Math — Contains LaTeX-code that might be rendered by a sophisticated reader application.
string	string	1	The LaTeX-code that describes the content of this block. This text will be layouted as if it was between "$$" and "$$" in a LaTeX file. A reader application should allow macros from the following LaTeX packages: ucs, inputenc, amsmath, amsfonts, amssymb, empty.

Copyright (c) 2005 Robert Bamler <Robert dot Bamler (a) gmx dot de>
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is available here.

"Apple" and "iPod" are registered trademarks of Apple Computer, Inc.