From: maarten@uva.uucp (Maarten Carels) Subject: The MS Word File format - revised Date: 11 Aug 86 15:25:58 GMT Organization: Computer Science Dept., University of Amsterdam Lines: 544 After our previous posting, we found out a lot more about the format of MS Word document files. By now, all parts of the file are known, not only the function of the part, but also the almost complete internal structure. A new version of the document (BinHexed MS Word format) is posted to net.sources.mac. For those of you who do not have MS Word (what could they do with this description -:)), a text only version follows this posting. New parts discovered are mostly related to the division structure of a MS Word document. Almost all details related to this part are described. Some small errors in the first version of the document were corrected. We encourage everyone to make use of the information provided, but we expect you to be fair and place every program you develop based upon this information in the public domain, or at least make it a shareware product. (Free for us of course !). Also we would like anybody who discovers new facts about the format, or who finds errors in our explanation to inform us. (-------------- Cut Here ------------------) MicroSoft Word file format second, revised edition M.J. Carels, A.G. Starreveld Department of Computer Science University of Amsterdam e-mail: {decvax, philabs, seismo}!mcvax!uva!{maarten,dolf} s-mail: Kruislaan 409, 1098 SJ Amsterdam, The Netherlands 1. Introduction. This document describes the structure of the files produced by MicroSoft WORD (versions 1.00 and 1.05). These files are of type TWDBNU. This knowledge has been gathered by looking at such files, and trying to interpret the bytes in the file. All information is believed to be correct, but no responsibility for errors or omissions can be taken. Please let us know if you find errors in this document, or if you find more about the structure of the file. The file format used by MS Word for other computers than Macs may, or may not be the same. We will describe all structures of the file format by means of C structure declarations, as this is a convenient way to describe things. In the structures some basic types appear. These are: ubyte8 bits unsigned number byte8 bits signed number ushort16 bits unsigned number short16 bits signed number ulong32 bits unsigned number long32 bits signed number No alignment is to be assumed, the only bytes present are the ones described. In all structures byte addresses (decimal) are included as comments. 2. General file structure. A MS Word file of type TWDBNU can be divided into seven different main parts. We will call these parts the RheaderS, RtextS, Rcharacter formatsS, Rparagraph formatsS, Rdivision blocksS, Rdivision listS and Rpage listS respectively. These parts appear in the above mentioned order in the file's data fork. The resource fork is always empty, i.e. not allocated. The file can be seen as being built from basic blocks, each 128 bytes long. This is the case for all parts of the file, although it does not appear to be significant for the text part. This implies that the size of an MS Word file is always a multiple of 128 bytes, though within each of the parts mentioned, the last bytes in the last block belonging to a certain part may (and usually will) contain garbage. In several sections sizes, dimensions and distances are present. In all such fields these dimensions are given in Tbasic unitsU. A MS Word basic unit is 1/20 of a point, or 1/1440 of an inch (a point equals 1/72 of an inch). 3. Header. The header part consists of a single block of 128 bytes. It contains pointers to most parts of the file. The header can be defined in terms of a C structure as follows: /* *This structure starts each MS-WORD 'WDBN' file. */ struct_header{ /* 00 */ushorth_1 ;/* always 0xfe32 */ /* 02 */ushorth_2 ;/* always 0 */ /* 04 */ushorth_3 ;/* always 0xab00 */ /* 06 */shorth_unk1[4] ;/* always 0 */ /* 14 */ulongh_ET ;/* Position of byte past text */ /* 18 */ushorth_par ;/* first paragraph block # */ /* 20 */ushorth_div ;/* first division info block # */ /* 22 */ushorth_div1 ;/* same */ /* 24 */ushorth_divlist ;/* first division list block # */ /* 26 */ushorth_pagelist ;/* first page list block # */ /* 28 */ushorth_unalloc ;/* first unallocated block # */ /* 30 */shorth_unk2[17] ;/* always 0 */ /* 64 */ulongh_tlength ;/* Length of text */ /* 68 */ulongh_tlength1 ;/* same */ /* 72 */shorth_unk3[28] ;/* always 0 */ }; typedef struct _header MS_Head; The h_ET field contains the address within the file of the byte just past the last character in the document text, i.e. the RtextS part. The h_tlength field contains the length of the RtextS part in bytes. The h_par field gives the block number (remember a MS Word block is 128 bytes long) of the first block that contains paragraph formats. Every division has its own block containing margins, page number and that kind of stuff. The h_div field contains the block number for the first division block. For some reason it is stored twice. The connection between the text and the division blocks is made through the division list. The h_divlist field gives the block number for the first block in the division list. The last block in the file contains the page list. In the page list the position of the first character in the first line of each page is stored. It is this list which is updated when you issue a RRepaginateS (COMMAND-J) command. The small T=U signs in the margins come also from this list. The h_pagelist field gives the first block for this list. The block number of the first RunallocatedS block is stored in the h_unalloc field. The file is also h_unalloc blocks long. 4. Text part. The text part contains a complete representation of the text in the document, including running heads, footnotes and pictures. The text is represented in the order in which it occurs in the document, in the extended Macintosh ascii character set. Some ascii values have special meaning however: 0x01page number ((page) glossary) 0x05auto numbered footnote reference ((footnote) glossary) 0x0bForced new line within paragraph 0x0cEnd of division or forced new page 0x0dEnd of paragraph 0x1fOptional hyphen The above implies that to extract a text only version of an MS Word file one only needs to extract the text part of the file, possibly replacing some of the special characters with others, depending on what you want. If you do nothing, you will certainly get very long lines, since you will get a newline character only at the end of each paragraph, so perhaps you want to do some line folding. Pictures are stored within the text part, along with a header. The picture is a single paragraph by itself. The paragraph format run pointing to the picture has a bit set to indicate the corresponding paragraph is a picture. 5. Format runs. Everything related to the layout of the text is stored in what we will call Rformat runsS and Rformat descriptorsS. A format run consists of several bytes of formatting information, described below (section 6 and 7). A format descriptor consists of 6 bytes. It is described by the following structure: /* *This structure defines a format descriptor. */ struct_fdescriptor{ /* 00 */ulongfd_start;/* start of text for next run */ /* 04 */shortfd_run ;/* pointer to this format run */ }; Each format block starts with the offset in the text part where the formats of this block start. After the initial start a number of format descriptors follow. The rest of the format block contains format runs. Both the format run and the format descriptor must be contained in the same block. A new block is allocated if either one does not fit. The last byte in a format block (offset 0x7f) contains the number of format descriptors present in the block. The format runs are stored preceded by a byte count. This bytecount gives the number of bytes in the format run that are actually stored in the file. The other (not stored) bytes of the format run contain the default value. File size is reduced by not storing seldomly used fields. The fd_start field is a pointer in the text part of the document. The next format applies from there. The fd_run field is an offset (relative to byte 4 in the format block) to the format run. 6. Character formats. The character format runs define how the characters in the text look. This includes properties like the font, size and style of the characters. The character format runs are 6 bytes long, although not all 6 need be stored. One field needs special attention. The font number is split in (at least) two pieces. The low order 6 bits are in the cf_font field. This fits most standard fonts, as they have small numbers. More exotic fonts have larger numbers. The extra bits are stored in the high order 3 bits of the cf_flags2 field. The meaning of the bytes in the format run is: /* *This structure defines character formats. */ struct_cformat{ /* 00 */ubytecf_unknown ;/* seems always 0x80 or 0x00 */ /* 01 */ubytecf_font ;/* font number, some flags */ /* 02 */ubytecf_pointsize ;/* times 2, 0 = default */ /* 03 */ubytecf_flags1 ;/* more flags */ /* 04 /*ubytecf_flags2 ;/* more flags, more font # */ /* 05 */bytecf_position ;/* > 0 super, < 0 sub script */ }; typedef struct _cformat MS_CFmt; /* macro for extracting the font number */ #define CHF_FONT(x) (((x)->cf_font&0x3f) | (((x)->cf_flags2&0xe0) << 1)) /* values for the cf_font field */ #defineCHF_BOLD0x80/* bold bit */ #defineCHF_ITAL0x40/* italic bit */ /* values for the flags1 and flags2 fields */ #defineF1_UL0x80/* underlined */ #defineF1_SC0x0c/* Small Caps */ #defineF2_OL0x10/* outline */ #defineF2_SH0x08/* shadow */ The first field in a character formats format run seems to take only the values 0x00 and 0x80. The meaning of this field is unknown. 7. Paragraph formats. The fourth part of the file contains the paragraph formats. The format runs start with normal paragraph formatting information. Thereafter follow the Rtab definitionsS. As many tab definitions as needed will be in the format run. The tab definition is described by the structure below: /* *This structure defines tabs */ struct_tformat { /* 00 */ushortt_position ;/* tab position */ /* 02 */ushor tt_flags ;/* type of tab stop */ }; typedef struct _tformat MS_Tab; /* values for the t_flags field */ #defineT_ALIGNMS K0x600 0/* mask for tab alignment */ /* alignment values: */ #defineT_LEFT0x0000/* left aligning tab */ #defineT_CENTER0x2000/* center aligning tab */ #defineT_RIGHT0x4000/* right aligning tab */ #defineT_DECIMAL0x6000/* decimal tab */ #defineT_LEADMSK0x0c00/* mask for tab leader */ /* leader values: */ #defineT_BLANK0x0000/* blank leader */ #defineT_DOTS0x0400/* dotted leader */ #defineT_DASH0x0800/* dashed leader */ #defineT_LINE0x0c00/* line leader */ The format run is defined by the structure below: /* *This structure defines paragraph formats. */ struct_pformat{ /* 00 */ushortp_flags;/* some flags */ /* 02 */ushortp_unk1;/* always 0 */ /* 04 */ushortp_right;/* right indent */ /* 06 */ushortp_left;/* left indent */ /* 08 */ushortp_first;/* first indent */ /* 10 */ushortp_line_spacing;/* line spacing (0 = auto) */ /* 12 */ushortp_before;/* space before */ /* 14 */ushortp_after;/* space after */ /* 16 */ushortp_rhead_pict;/* running head & picture info */ /* 18 */ushortp_unk2;/* always 0 */ /* 20 */ushortp_unk3;/* always 0 */ /* 22 */MS_Tabp_tabs [0];/* list of tab descriptors */ }; typedefstruct _pformatMS_Fmt; /* values for the p_flags field */ #definePF_FOOTMASK0x7f00/* mask for footnote info */ /* values unknown */ #definePF_JUSTMASK0x00c0/* mask for justification */ /* justification values: */ #definePF_LEFT0x0000/* left justifiedparagraph */ #definePF_CENTER0x0040/* centered paragraph */ #definePF_RIGHT0x0080/* right justified paragraph */ #definePF_JUST0x00c0/* justified paragraph */ #definePF_KEEP0x0010/* keep with next paragraph */ #definePF_KEEPL0x0020/* keep lines together */ /* values for the running_head field */ #defineRH_MASK0xf000/* mask for running head info */ /* running head values: */ #defineRH_FIRST0x8000/* appears on first page */ #defineRH_EVEN0x4000/* on even pages */ #defineRH_ODD0x2000/* on odd pages */ #defineRH_BOTTOM0x1000/* appears on bottom of page */ #defineRH_PICT0x0800/* this paragraph is a picture */ The p_flags field has sometimes the high order bit set (0x8000). The meaning of this could be the same as the setting of this bit in the character formats, where it is also sometimes set. What this bit indicates is unknown. In the paragraph format run only the tab descriptors needed for explicitely defined tabs are stored. The list of tab descriptors is ended by a tab descriptor with the t_position field equal to 0. This final tab descriptor is stored in the file, although this seems not necessary. A picture is identified by having the RH_PICT bit set. In the text part of the document the picture is present, preceded by a 6 byte header. The actual picture data follows after the header. It is encoded in standard PICT format (see tech note #21). The picture header is defined by the structure given below: /* This structure defines a picture header */ struct_phead { /* 00 */shortph_offset;/* offset from left margin */ /* 02 */shortph_xdist;/* distortion in x direction */ /* 04 */shortph_ydist;/* distortion in y direction */ }; typedef struct _phead MS_Pict; The ph_xdist and ph_ydist fields are used to store the distortion of the picture. If both are zero, the picture is undistorted. The exact meaning of the values stored here is unknown. 8. Division blocks. The next (fifth) part of the file contains the division blocks. One block (128 bytes!) is allocated for every division. The block is filled with the following structure, preceded by a byte count indicating how many bytes are actually stored: /* *This structure describes division formats */ struct _dformat { /* 00 */ushortd_flags;/* flags */ /* 02 */ushortd_pap_len;/* total paper length */ /* 04 */ushortd_pap_wit;/* total paper width */ /* 06 */ushortd_p_start;/* start page # */ /* 08 */ushortd_top;/* top margin */ /* 10 */ushortd_bot;/* bottom margin, from top paper */ /* 12 */ushortd_left;/* left margin */ /* 14 */ushortd_right;/* right margin, from left paper */ /* 16 */ushortd_flag_col;/* some flags, number of columns */ /* 18 */ushortd_r_top;/* top run.head pos, from top paper */ /* 20 */ushortd_r_bot;/* bottom run.head pos, from top paper */ /* 22 */ushortd_colsp;/* column spacing */ /* 24 */ushortd_gutter;/* gutter */ /* 26 */ushortd_pag_top;/* page number position from top */ /* 28 */ushortd_pag_left;/* page number position from left */ /* 30 */ushortd_unk1; /* 32 */ushortd_rbot;/* seems runn. head pos, from bottom */ /* 34 */shortd_unk2[34]; }; typedef struct _dformat MS_Div; #defineDFB_MASK0x00e0/* mask for break */ /* values for break: */ #defineDFB_CONT0x0000/* continuous */ #defineDFB_COL0x0020/* column */ #defineDFB_PAGE0x0040/* page (default) */ defineDFB_ODD0x0060/* odd */ #defineDFB_EVEN0x0080/* even */ #defineDFP_MASK0x001c/* mask for the page # format */ /* values for page # format: */ #defineDFP_NUM0x0000/* numeric (1, 2...)*/ #defineDFP_ROM0x0004/* roman, upper case (I, II...) */ #defineDFP_rom0x0008/* roman, lower case (i, ii...) */ #defineDFP_ALF0x000c/* alphabetic, upper case (A, B...) */ #defineDFP_alf0x0010/* alphabetic, lower case (a, b...) */ #defineDF_DIV0x0001/* division layout present */ #defineDEFAULT_PAG0xffff/* default page number */ /* mask for flags in the d_flag_col field */ /* values: */ #defineDCF_AUTO0x0200/* auto page numbering on */ #defineDCF_FOOT0x0100/* 1=footnote at end of division */ #defineDCF_COL0x00ff/* mask for number of columns */ Some of the values stored in the division blocks seem to have no relation to a division, but rather to the document as a whole. These values are present in all division blocks, but only the value in the first division block is used. The values in the other blocks are just ignored. As usual, all dimensions are in basic units. The DF_DIV bit used to indicate whether the division layout information is stored (DF_DIV = 1) or only the paper dimensions (DF_DIV = 0). If the DF_DIV bit is 0, only 16 bytes of the division block are used. The division blocks do not contain any information on which part of the text they apply to. This information is stored in the next part of the file, the division list: 9. Division list. The division list links the division blocks to the text. It consists of division descriptors. These are defined by the structure: /* *This structure describes a division descriptor */ struct_pdiv { /* 00 */ulongpd_text;/* where the division starts */ /* 04 */ushortpd_unk;/* unknown */ /* 06 */ulongpd_block;/* there is the div block */ }; typedef struct _pdiv MS_DivD; The pd_text field gives the place in the text where the division ends, relative to the start of the text part. Add 0x80 to get the offset from the start of the file. The pd_block field gives the offset in the file where the division block starts. If the pd_block field is 0xffffffff, this division has no division block allocated. This seems a division with all default values. The meaning of the pd_unk field is unclear. The structure described in section 8 is present in the corresponding division block, preceded by a bytecount. As with the paragraph and character formats, bytes not stored contain default values. As many division descriptors are in the division list as there are divisions. The division list holds some more information, as can be seen in the structure definition: /* *This structure describes the division list */ struct _divlist { /* 00 */ushortdl_count;/* the number of descriptors */ /* 02 */ushortdl_unk;/* some unknown counter */ /* 06 */MS_DivDdl_list [0];/* as many as needed */ }; typedef struct _divlist MS_Div; If one block (128 bytes) is not sufficient to store the division list, the list can span block boundaries. The division list is also the way to se if a 0x0c character stored within the file is a Tforced new pageU (TSHIFT-ENTERU) or a Tend of divisionU character. If there is no entry for the 0x0c character in the division list, it must be only a new page. 10. Page list. The last part of the file is the page list. This list contains information about the pages in the document. This information is used to process the RGo ToIS command, and to show the small T=U signs in the margin. The page list is updated when the document is repaginated by means of the RRepaginateS command. The items in the page list are described by this structure: /* *This structure describes a page list item */ struct _page { /* 00 */ushortpg_num;/* the page number */ /* 02 */ulongpg_text;/* where it starts */ }; typedef struct _page MS_Page; The page numbers seem to be always in numerical order. The pg_text field is the offset in the text part where the page starts. Add 0x80 to get the position relative to the start of the file. All page list items are contained in the page list, which fills the last part of the file: /* *This structure describes the page list */ struct _plist { /* 00 */ushortpl_count;/* the number list items */ /* 02 */ushortpl_unk;/* some other count */ /* 04 */MS_Pagepl_list [0];/* the list */ }; typedef struct _plist MS_PList; 11. End of file. Just after the page list the end of file is present. The file contains an integral number of MS Word blocks. As a MS Word block is smaller than a physical disk block, the end of file may be in the middle of the last physical disk block. Dolf Starreveld / Maarten Carels Department of Computer Science, UvA Usenet:{dolf,maarten}@uva.uucp {seismo,decvax,philabs}!mcvax!uva!{dolf,maarten} Snail mail:Dolf Starreveld Department of Computing Science University of Amsterdam Kruislaan 409 NL-1098 SJ Amsterdam The Netherlands Telefone:In Holland: 020-592 5137/5022 International: 31-20-592 5137 or 31-20-592 5022 Telex:10262 HEF NL