From: maarten@uva.uucp (Maarten Carels)
Subject: The MS Word File format - revised
Date: 11 Aug 86 15:25:58 GMT
Organization: Computer Science Dept., University of Amsterdam
Lines: 544

After  our previous posting, we found out a lot more about the format of
MS  Word  document  files.  By now, all parts of the file are known, not
only  the  function  of  the part, but also the almost complete internal
structure.  A  new  version of the document (BinHexed MS Word format) is
posted  to  net.sources.mac.  For  those  of you who do not have MS Word
(what  could  they  do  with  this description -:)), a text only version
follows this posting.

New  parts  discovered are mostly related to the division structure of a
MS Word document. Almost all details related to this part are described.

Some small errors in the first version of the document were corrected.

We  encourage  everyone  to make use of the information provided, but we
expect  you  to  be  fair and place every program you develop based upon
this  information  in the public domain, or at least make it a shareware
product.  (Free  for  us  of  course  !). Also we would like anybody who
discovers  new  facts  about  the  format,  or  who  finds errors in our
explanation to inform us.

(-------------- Cut Here ------------------)

MicroSoft Word file format

second, revised edition

M.J. Carels,

A.G. Starreveld


Department of Computer Science 

University of Amsterdam

e-mail: {decvax, philabs, seismo}!mcvax!uva!{maarten,dolf}

s-mail: Kruislaan 409, 1098 SJ  Amsterdam, The Netherlands

1. Introduction.

This document describes the structure of the files produced by MicroSoft
WORD (versions 1.00 and 1.05). These files are of type TWDBNU. This
knowledge has been gathered by looking at such files, and trying to
interpret the bytes in the file. All information is believed to be correct,
but no responsibility for errors or omissions can be taken. Please let us
know if you find errors in this document, or if you find more about the
structure of the file. The file format used by MS Word for other computers
than Macs may, or may not be the same.

We will describe all structures of the file format by means of C structure
declarations, as this is a convenient way to describe things. In the
structures some basic types appear. These are:

ubyte8 bits unsigned number
byte8 bits signed number
ushort16 bits unsigned number
short16 bits signed number
ulong32 bits unsigned number
long32 bits signed number

No alignment is to be assumed, the only bytes present are the ones
described. In all structures byte addresses (decimal) are included as
comments.

2. General file structure.

A  MS Word file of type TWDBNU can be divided into seven different main
parts. We will call these parts the RheaderS, RtextS, Rcharacter formatsS,
Rparagraph formatsS, Rdivision blocksS, Rdivision listS and Rpage listS 
respectively. These parts appear in the above mentioned order in the file's
data fork. The resource fork is always empty, i.e. not allocated.

The file can be seen as being built from basic blocks, each 128 bytes long.
This is the case for all parts of the file, although it does not appear to be
significant for the text part. This implies that the size of an MS Word file
is always a multiple of 128 bytes, though within each of the parts
mentioned, the last bytes in the last block belonging to a certain part may
(and usually will) contain garbage.

In several sections sizes, dimensions and distances are present. In all such
fields these dimensions are given in Tbasic unitsU. A MS Word basic unit is
1/20 of a point, or 1/1440 of an inch (a point equals 1/72 of an inch).

3. Header.

The header part consists of a single block of 128 bytes. It contains
pointers to most parts of the file. The header can be defined in terms of a
C structure as follows:

/*
 *This structure starts each MS-WORD 'WDBN' file.
 */

struct_header{
/* 00 */ushorth_1 ;/* always 0xfe32 */
/* 02 */ushorth_2 ;/* always 0 */
/* 04 */ushorth_3 ;/* always 0xab00 */
/* 06 */shorth_unk1[4] ;/* always 0 */
/* 14 */ulongh_ET ;/* Position of byte past text */
/* 18 */ushorth_par ;/* first paragraph block # */
/* 20 */ushorth_div ;/* first division info block # */
/* 22 */ushorth_div1 ;/* same */
/* 24 */ushorth_divlist ;/* first division list block # */
/* 26 */ushorth_pagelist ;/* first page list block # */
/* 28 */ushorth_unalloc ;/* first unallocated block # */
/* 30 */shorth_unk2[17] ;/* always 0 */
/* 64 */ulongh_tlength ;/* Length of text */
/* 68 */ulongh_tlength1 ;/* same */
/* 72 */shorth_unk3[28] ;/* always 0 */
};

typedef struct _header MS_Head;

The h_ET field contains the address within the file of the byte just past
the last character in the document text, i.e. the RtextS part. The h_tlength
field contains the length of the RtextS part in bytes. The h_par field gives
the block number (remember a MS Word block is 128 bytes long) of the
first block that contains paragraph formats. Every division has its own
block containing margins, page number and that kind of stuff. The h_div
field contains the block number for the first division block. For some
reason it is stored twice. The connection between the text and the division
blocks is made through the division list. The h_divlist field gives the block
number for the first block in the division list. The last block in the file
contains the page list. In the page list the position of the first character
in the first line of each page is stored. It is this list which is updated
when you issue a RRepaginateS (COMMAND-J) command. The small T=U signs
in the margins come also from this list. The h_pagelist field gives the
first block for this list. The block number of the first RunallocatedS block
is stored in the h_unalloc field. The file is also h_unalloc blocks long.

4. Text part.

The text part contains a complete representation of the text in the
document, including running heads, footnotes and pictures. The text is
represented in the order in which it occurs in the document, in the
extended Macintosh ascii character set. Some ascii values have special
meaning however:

0x01page number ((page) glossary)
0x05auto numbered footnote reference ((footnote) glossary)
0x0bForced new line within paragraph
0x0cEnd of division or forced new page
0x0dEnd of paragraph
0x1fOptional hyphen

The above implies that to extract a text only version of an MS Word file
one only needs to extract the text part of the file, possibly replacing some
of the special characters with others, depending on what you want. If you
do nothing, you will certainly get very long lines, since you will get a
newline character only at the end of each paragraph, so perhaps you want
to do some line folding. Pictures are stored within the text part, along
with a header. The picture is a single paragraph by itself. The paragraph
format run pointing to the picture has a bit set to indicate the
corresponding paragraph is a picture.

5. Format runs.

Everything related to the layout of the text is stored in what we will call
Rformat runsS and Rformat descriptorsS. A format run consists of several
bytes of formatting information, described below (section 6 and 7). A
format descriptor consists of 6 bytes. It is described by the following
structure:

/*
 *This structure defines a format descriptor.
 */

struct_fdescriptor{
/* 00 */ulongfd_start;/* start of text for next run */
/* 04 */shortfd_run ;/* pointer to this format run */
};

Each format block starts with the offset in the text part where the
formats of this block start. After the initial start a number of format
descriptors follow. The rest of the format block contains format runs.
Both the format run and the format descriptor must be contained in the
same block. A new block is allocated if either one does not fit. 

The last byte in a format block (offset 0x7f) contains the number of
format descriptors present in the block.

The format runs are stored preceded by a byte count. This bytecount gives
the number of bytes in the format run that are actually stored in the file.
The other (not stored) bytes of the format run contain the default value.
File size is reduced by not storing seldomly used fields. 

The fd_start field is a pointer in the text part of the document. The next
format applies from there. The fd_run field is an offset (relative to byte 4
in the format block) to the format run. 

6. Character formats.

The character format runs define how the characters in the text look. This
includes properties like the font, size and style of the characters. The
character format runs are 6 bytes long, although not all 6 need be stored.
One field needs special attention. The font number is split in (at least)
two pieces. The low order 6 bits are in the cf_font field. This fits most
standard fonts, as they have small numbers. More exotic fonts have larger
numbers. The extra bits are stored in the high order 3 bits of the cf_flags2
field. The meaning of the bytes in the format run is:

/*
 *This structure defines character formats.
 */

struct_cformat{
/* 00 */ubytecf_unknown ;/* seems always 0x80 or 0x00 */
/* 01 */ubytecf_font ;/* font number, some flags */
/* 02 */ubytecf_pointsize ;/* times 2, 0 = default */
/* 03 */ubytecf_flags1 ;/* more flags */
/* 04 /*ubytecf_flags2 ;/* more flags, more font # */
/* 05 */bytecf_position ;/* > 0 super, < 0 sub script */
};

typedef struct _cformat MS_CFmt;

/* macro for extracting the font number */
#define CHF_FONT(x) (((x)->cf_font&0x3f) | (((x)->cf_flags2&0xe0) << 1))

/* values for the cf_font field */

#defineCHF_BOLD0x80/* bold bit */
#defineCHF_ITAL0x40/* italic bit */

/* values for the flags1 and flags2 fields */

#defineF1_UL0x80/* underlined */
#defineF1_SC0x0c/* Small Caps */
#defineF2_OL0x10/* outline */
#defineF2_SH0x08/* shadow */

The first field in a character formats format run seems to take only the
values 0x00 and 0x80. The meaning of this field is unknown.

7. Paragraph formats.

The fourth part of the file contains the paragraph formats. The format runs
start with normal paragraph formatting information. Thereafter follow the
Rtab definitionsS. As many tab definitions as needed will be in the format
run. The tab definition is described by the structure below:

/*
 *This structure defines tabs
 */

struct_tformat {
/* 00 */ushortt_position ;/* tab position */
/* 02 */ushor tt_flags ;/* type of tab stop */
};

typedef struct _tformat MS_Tab;

/* values for the t_flags field */

#defineT_ALIGNMS K0x600 0/* mask for tab alignment */
/* alignment values: */
#defineT_LEFT0x0000/* left aligning tab */
#defineT_CENTER0x2000/* center aligning tab */
#defineT_RIGHT0x4000/* right aligning tab */
#defineT_DECIMAL0x6000/* decimal tab */


#defineT_LEADMSK0x0c00/* mask for tab leader */

/* leader values: */
#defineT_BLANK0x0000/* blank leader */
#defineT_DOTS0x0400/* dotted leader */
#defineT_DASH0x0800/* dashed leader */
#defineT_LINE0x0c00/* line leader */

The format run is defined by the structure below:

/*
 *This structure defines paragraph formats.
 */

struct_pformat{
/* 00 */ushortp_flags;/* some flags */
/* 02 */ushortp_unk1;/* always 0 */
/* 04 */ushortp_right;/* right indent */
/* 06 */ushortp_left;/* left indent */
/* 08 */ushortp_first;/* first indent */
/* 10 */ushortp_line_spacing;/* line spacing (0 = auto) */
/* 12 */ushortp_before;/* space before */
/* 14 */ushortp_after;/* space after */
/* 16 */ushortp_rhead_pict;/* running head & picture info */
/* 18 */ushortp_unk2;/* always 0 */
/* 20 */ushortp_unk3;/* always 0 */
/* 22 */MS_Tabp_tabs [0];/* list of tab descriptors */
};

typedefstruct _pformatMS_Fmt;

/* values for the p_flags field */

#definePF_FOOTMASK0x7f00/* mask for footnote info */
/* values unknown */

#definePF_JUSTMASK0x00c0/* mask for justification */
/* justification values: */
#definePF_LEFT0x0000/* left justifiedparagraph */
#definePF_CENTER0x0040/* centered paragraph */
#definePF_RIGHT0x0080/* right justified paragraph */
#definePF_JUST0x00c0/* justified paragraph */
#definePF_KEEP0x0010/* keep with next paragraph */
#definePF_KEEPL0x0020/* keep lines together */

/* values for the running_head field */

#defineRH_MASK0xf000/* mask for running head info */
/* running head values: */
#defineRH_FIRST0x8000/* appears on first page */
#defineRH_EVEN0x4000/* on even pages */
#defineRH_ODD0x2000/* on odd pages */
#defineRH_BOTTOM0x1000/* appears on bottom of page */

#defineRH_PICT0x0800/* this paragraph is a picture */

The p_flags field has sometimes the high order bit set (0x8000). The
meaning of this could be the same as the setting of this bit in the
character formats, where it is also sometimes set. What this bit indicates
is unknown.

In the paragraph format run only the tab descriptors needed for explicitely
defined tabs are stored. The list of tab descriptors is ended by a tab
descriptor with the t_position field equal to 0. This final tab descriptor is
stored in the file, although this seems not necessary.

A picture is identified by having the RH_PICT bit set. In the text part of
the document the picture is present, preceded by a 6 byte header. The
actual picture data follows after the header. It is encoded in standard PICT
format (see tech note #21). The picture header is defined by the structure
given below:

/*
This structure defines a picture header

*/

struct_phead {
/* 00 */shortph_offset;/* offset from left margin */
/* 02 */shortph_xdist;/* distortion in x direction */
/* 04 */shortph_ydist;/* distortion in y direction */
};

typedef struct _phead MS_Pict;

The ph_xdist and ph_ydist fields are used to store the distortion of the
picture. If both are zero, the picture is undistorted. The exact meaning of
the values stored here is unknown. 

8. Division blocks.

The next (fifth) part of the file contains the division blocks. One block
(128 bytes!) is allocated for every division. The block is filled with the
following structure, preceded by a byte count indicating how many bytes
are actually stored:

/*
 *This structure describes division formats
 */

struct _dformat {
/* 00 */ushortd_flags;/* flags */
/* 02 */ushortd_pap_len;/* total paper length */
/* 04 */ushortd_pap_wit;/* total paper width */
/* 06 */ushortd_p_start;/* start page # */
/* 08 */ushortd_top;/* top margin */
/* 10 */ushortd_bot;/* bottom margin, from top paper */
/* 12 */ushortd_left;/* left margin */
/* 14 */ushortd_right;/* right margin, from left paper */
/* 16 */ushortd_flag_col;/* some flags, number of columns */
/* 18 */ushortd_r_top;/* top run.head pos, from top paper */
/* 20 */ushortd_r_bot;/* bottom run.head pos, from top paper */
/* 22 */ushortd_colsp;/* column spacing */
/* 24 */ushortd_gutter;/* gutter */
/* 26 */ushortd_pag_top;/* page number position from top */
/* 28 */ushortd_pag_left;/* page number position from left */
/* 30 */ushortd_unk1;
/* 32 */ushortd_rbot;/* seems runn.  head pos, from bottom */
/* 34 */shortd_unk2[34];
};

typedef struct _dformat MS_Div;

#defineDFB_MASK0x00e0/* mask for break */
/* values for break: */
#defineDFB_CONT0x0000/* continuous */
#defineDFB_COL0x0020/* column */
#defineDFB_PAGE0x0040/* page (default) */
defineDFB_ODD0x0060/* odd */
#defineDFB_EVEN0x0080/* even */

#defineDFP_MASK0x001c/* mask for the page # format */
/* values for page # format: */
#defineDFP_NUM0x0000/* numeric (1, 2...)*/
#defineDFP_ROM0x0004/* roman, upper case (I, II...) */
#defineDFP_rom0x0008/* roman, lower case (i, ii...) */
#defineDFP_ALF0x000c/* alphabetic, upper case (A, B...) */
#defineDFP_alf0x0010/* alphabetic, lower case (a, b...) */

#defineDF_DIV0x0001/* division layout present */

#defineDEFAULT_PAG0xffff/* default page number */

/* mask for flags in the d_flag_col field */
/* values: */
#defineDCF_AUTO0x0200/* auto page numbering on */
#defineDCF_FOOT0x0100/* 1=footnote at end of division */

#defineDCF_COL0x00ff/* mask for number of columns */

Some of the values stored in the division blocks seem to have no relation
to a division, but rather to the document as a whole. These values are
present in all division blocks, but only the value in the first division block
is used. The values in the other blocks are just ignored. As usual, all
dimensions are in basic units. 

The DF_DIV bit used to indicate whether the division layout information is
stored (DF_DIV = 1) or only the paper dimensions (DF_DIV = 0). If the
DF_DIV bit is 0, only 16 bytes of the division block are used.

The division blocks do not contain any information on which part of the
text they apply to. This information is stored in the next part of the file,
the division list:

9. Division list.

The division list links the division blocks to the text. It consists of
division descriptors. These are defined by the structure:

/*
 *This structure describes a division descriptor
 */

struct_pdiv {
/* 00 */ulongpd_text;/* where the division starts */
/* 04 */ushortpd_unk;/* unknown */
/* 06 */ulongpd_block;/* there is the div block */
};

typedef struct _pdiv MS_DivD;

The pd_text field gives the place in the text where the division ends,
relative to the start of the text part. Add 0x80 to get the offset from the
start of the file. The pd_block field gives the offset in the file where the
division block starts. If the pd_block field is 0xffffffff, this division has
no division block allocated. This seems a division with all default values.
The meaning of the pd_unk field is unclear.

The structure described in section 8 is present in the corresponding
division block, preceded by a bytecount. As with the paragraph and
character formats, bytes not stored contain default values. As many
division descriptors are in the division list as there are divisions. The
division list holds some more information, as can be seen in the structure
definition:

/*
 *This structure describes the division list
 */

struct _divlist {
/* 00 */ushortdl_count;/* the number of descriptors */
/* 02 */ushortdl_unk;/* some unknown counter */
/* 06 */MS_DivDdl_list [0];/* as many as needed */
};

typedef struct _divlist MS_Div;

If one block (128 bytes) is not sufficient to store the division list, the
list
can span block boundaries.

The division list is also the way to se if a 0x0c character stored within
the file is a Tforced new pageU (TSHIFT-ENTERU) or a Tend of divisionU
character. If there is no entry for the 0x0c character in the division list,
it must be only a new page.

10. Page list.

The last part of the file is the page list. This list contains information
about the pages in the document. This information is used to process the
RGo ToIS command, and to show the small T=U signs in the margin. The page
list is updated when the document is repaginated by means of the
RRepaginateS command. The items in the page list are described by this
structure:

/*
 *This structure describes a page list item
 */

struct _page {
/* 00 */ushortpg_num;/* the page number */
/* 02 */ulongpg_text;/* where it starts */
};

typedef struct _page MS_Page;

The page numbers seem to be always in numerical order. The pg_text field
is the offset in the text part where the page starts. Add 0x80 to get the
position relative to the start of the file. All page list items are contained
in the page list, which fills the last part of the file:

/*
 *This structure describes the page list
 */

struct _plist {
/* 00 */ushortpl_count;/* the number  list items */
/* 02 */ushortpl_unk;/* some other count */
/* 04 */MS_Pagepl_list [0];/* the list */
};

typedef struct _plist MS_PList;

11. End of file.

Just after the page list the end of file is present. The file contains an
integral number of MS Word blocks. As a MS Word block is smaller than a
physical disk block, the end of file may be in the middle of the last
physical disk block.


Dolf Starreveld / Maarten Carels
Department of Computer Science, UvA

Usenet:{dolf,maarten}@uva.uucp
{seismo,decvax,philabs}!mcvax!uva!{dolf,maarten}

Snail mail:Dolf Starreveld
Department of Computing Science
University of Amsterdam
Kruislaan 409
NL-1098 SJ  Amsterdam
The Netherlands

Telefone:In Holland:    020-592 5137/5022
International: 31-20-592 5137 or 31-20-592 5022

Telex:10262 HEF NL