Compaq C Run-Time Library Utilities Reference Manual

Document revision date: 30 March 2001

Compaq C Run-Time Library Utilities Reference Manual

Contents

Index

1.2.3 Link Lines

A link line has the following form:

Link LINK-FROM LINK-TO

An example is as follows:

Link US/Eastern EST5EDT

In the OpenVMS implementation, Link is interpreted as a copy. Thus, the previous line copies the information from US/Eastern to EST5EDT.

The LINK-FROM field should appear as the NAME field in some zone line. The LINK-TO field is used as an alternate name for that zone.

Except for continuation lines, lines may appear in any order in the input.

Note

For areas with more than two types of local time, use local standard time in the AT field of the earliest transition time's rule to ensure that the earliest transition time recorded in the compiled file is correct.

Chapter 2
Locale File Format

A locale definition source file contains categories that describe a locale. You can convert a locale definition source file into a locale by using the LOCALE COMPILE command. Locales can be modified only by editing a locale definition source file and then using the LOCALE COMPILE command again on the new source file. Each locale source file section defines a category of locale data. A source file cannot contain more than one section for the same category.

2.1 Locale Categories

The following standard locale categories are supported:

LC_COLLATE --- Defines character or string collation information
LC_CTYPE --- Defines character classification, case conversion, and other character attributes
LC_MESSAGES --- Defines the format for affirmative and negative responses
LC_MONETARY --- Defines rules and symbols for formatting monetary numeric information
LC_NUMERIC --- Defines rules and symbols for formatting nonmonetary numeric information
LC_TIME --- Defines rules and symbols for formatting time and date information

2.1.1 Overriding Defaults

You can include optional declarations at the beginning of your locale source file to override the default comment and escape characters used in locale category definitions:

Escape character
The escape character is used in decimal or hexadecimal constants when they are specified in the locale file. The default escape character is the backslash (\). To define another escape character, include a line with the following format:
escape_char <char_symbol>
Comment character
The comment character is the first character of each comment entry in the locale file. The default comment character is the number sign (#). To define another comment character, use the following format:
comment_char <char_symbol>

In the preceding formats, <char_symbol> is the character's symbolic name as defined in the charmap file used to build the locale's codeset. One or more blank characters (spaces or tabs) must separate escape_char or comment_char from <char_symbol>.

2.1.2 Category Source Definitions

Each category source definition consists of the following:

The category header (category_name)
The associated keyword or value pairs that comprise the category body
The category trailer (END category_name)

For example:

LC_CTYPE <source for LC_CTYPE category> END LC_CTYPE

The source for all of the categories is specified using keywords, strings, character literals, and character symbols. Each keyword identifies either a definition or a rule. The remainder of the statement containing the keyword contains the operands to the keyword. Operands are separated from the keyword by one or more blank characters (spaces or tabs). A statement may be continued on the next line by placing a backslash (\) as the last character before the new-line character that terminates the line. Lines containing the comment character (#) in the first column are treated as comment lines.

A symbolic name begins with the left angle-bracket character (<) and ends with the right angle-bracket character (>). The characters between the < and the > can be any characters from the Portable Character Set, except for the control and space characters. For example, <A-diaeresis> could be a symbolic name for a character. Any symbolic name referenced in the locale source file must be defined via the Portable Character Set or in the character set description (charmap) file for that locale.

A character literal is the character itself, or a decimal, hexadecimal, or octal constant. A decimal constant contains two or three decimal digits and has the following form, where n is any decimal digit:

\dnn or \dnnn

A hexadecimal constant contains two hexadecimal digits and has the following form, where n is any hexadecimal digit:

\xnn

An octal constant contains two or three octal digits and has the following form, where n is any octal digit:

\nn or \nnn

The explicit definition of each category in a locale definition source file is not required. When a category is undefined in a locale definition source file, the LOCALE COMPILE command will not store any data value for this category in the resulting locale file.

2.2 LC_COLLATE Category

The LC_COLLATE category defines the relative order between collation items. This category begins with the LC_COLLATE header and ends with the END LC_COLLATE trailer.

A collation item is the unit of comparison for collation. A collation item may be a character or a sequence of characters. Every collation item in the locale has a set of weights, which determine if the collation item collates before, equal to, or after the other collation items in the locale. Each collation item is assigned collation weights by the LOCALE COMPILE command when the locale definition source file is compiled. These collation weights are then used by applications programs that compare strings.

String comparison is performed by comparing the collation weights of each character in the string until either a difference is found or the strings are determined to be equal. This comparison may be performed several times if the locale defines multiple collation orders. For example, in the French locale, the strings are compared using a primary set of collation weights. If they are equal on the basis of this comparison, they are compared again using a secondary set of collation weights. A collation item has a set of collation weights associated with it that is equal to the number of collation sort rules defined for the locale.

Every character defined in the charmap file (or every character in the Portable Character Set if no charmap file is specified) is itself a collation item. Additional collation items can be defined using the collating-element statement (see the description that follows).

Table 2-1 lists the statement keywords recognized in the LC_COLLATE category.

Table 2-1 LC_COLLATE Category Keywords
Keyword Description

copy Specifies the name of an existing locale to be used as the definition of this category. If you specify a copy statement, you need not specify any other keywords in this category.

collating-element Specifies multicharacter collation items.

collating-symbol Specifies collation symbols for use in collation sequence statements.

order_start Specifies collation order statements that assign collation weights to collation items.

**Table 2-1 LC_COLLATE Category Keywords**
Keyword	Description
`copy`	Specifies the name of an existing locale to be used as the definition of this category. If you specify a copy statement, you need not specify any other keywords in this category.
`collating-element`	Specifies multicharacter collation items.
`collating-symbol`	Specifies collation symbols for use in collation sequence statements.
`order_start`	Specifies collation order statements that assign collation weights to collation items.

The collating-element , collating-symbol , and order_start statements are further described in the following sections.

2.2.1 The collating-element Statement

The collating-element statement specifies multicharacter collation items.

Syntax:

collating-element <character_symbol> from <string>

The character_symbol argument defines a collation item that is a string of one or more characters as a single collation item. The character_symbol cannot duplicate any symbolic name in the current charmap file or any other symbolic name defined in this collation definition.

The string argument specifies a string of two or more characters that define the character_symbol argument. The following are examples of the syntax for the collating-element statement:

collating-element <ch> from "<c><h>" collating-element <e-acute> from "<acute><e>" collating-element <11> from "<1><1>"

A character_symbol argument defined by the collating-element statement is recognized only within the LC_COLLATE category.

2.2.2 The collating-symbol Statement

The collating-symbol statement specifies collation symbols for use in collation sequence statements.

Syntax:

collating-symbol <collating_symbol>

The collating-symbol argument cannot duplicate any symbolic name in the current charmap file or any other symbolic name defined in this collation definition. The following are examples of collating-symbol statements:

collating-symbol <UPPER_CASE> collating-symbol <HIGH>

An argument defined by the collating-symbol statement is recognized only within the LC_COLLATE category.

2.2.3 The order_start Statement

The order_start statement is followed by one or more collation order statements that assign collation weights to collation items and the order_end keyword. The order_start statement is a required statement.

Syntax:

order_start sort_rules;sort_rules;...;sort_rules collation_order_statements order_end

Sort Rules

The sort_rules directives have the following syntax:

keyword, keyword,...,keyword

where keyword is FORWARD, BACKWARD, or POSITION.

The sort_rules directives are optional. If specified, they define the rules to apply during string comparison. The number of specified sort_rules directives defines the number of weights each collation item is assigned (that is, the directives define the number of collation orders in the locale). If no sort_rules directives are specified, one forward directive is assumed and comparisons are made on a character basis rather than a string basis.

If sort_rules directives are present, the first one applies when comparing strings that use the primary weight, the second when comparing strings that use the secondary weight, and so on. Each set of sort_rules directives is separated by a semicolon (;). A sort_rules directive consists of one or more keywords separated by commas. The following keywords are supported:

FORWARD --- Specifies that collation weight comparisons proceed from the beginning of a string to the end of the string.

BACKWARD --- Specifies that collation weight comparisons proceed from the end of a string to the beginning of the string.

POSITION --- Specifies that collation weight comparisons consider the relative position of nonignored elements in the string (that is, if strings compare as equal, the element with the shortest distance from the starting point of the comparison collates first).

The forward and backward keywords are mutually exclusive.

The following is an example of a sort_rules directive:

order_start forward;backward

Collation Order Statements

The following syntax rules apply to the collation order statements:

Each collation order statement consists of a <character_symbol> specification followed by white space and a set of collation orders.
Characters in the character set can be explicitly specified in the collation order statements or implicitly specified using the ellipsis symbol (...).
A collation order statement that begins with the UNDEFINED special symbol specifies any characters that are in the character set but not explicitly or implicitly specified by other collation order statements.

The optional operands for each collation item are used to define the primary, secondary, or subsequent weights for the collation item. The special symbol IGNORE is used to indicate a collation item that is to be ignored when strings are compared.

An ellipsis keyword appearing in place of a collating_element_list indicates the weights are to be assigned, for the characters in the identified range, in numerically increasing order from the weight for the character symbol on the left side of the preceding statement.

The use of the ellipsis keyword results in a locale that may collate differently when compiled with different character set description (charmap) source files.

The UNDEFINED special symbol includes all coded character set values not specified explicitly or with an ellipsis symbol. These characters are inserted in the character collation order at the point indicated by the UNDEFINED special symbol and are all assigned the same weight. If no UNDEFINED special symbol exists and the collation order does not specify all collation items from the coded character set, a warning is issued and all undefined characters are placed at the end of the character collation order.

Example

The following is an example of a collation order statement section in the LC_COLLATE locale definition source file category:

order_start forward;backward UNDEFINED IGNORE;IGNORE <LOW> <space> <LOW>;<space> ... <LOW>;... <a> <a>;<a> <a-acute> <a>;<a-acute> <a-grave> <a>;<a-grave> <A> <a>;<A> <A-acute> <a>;<A-acute> <A-grave> <a>;<A-grave> <ch> <ch>;<ch> <Ch> <ch>;<Ch> <s> <s>;<s> <ss> <s><s>;<s><s> <eszet> <s><s>;<eszet><eszet> ... <HIGH>;... <HIGH> order_end

This example is interpreted as follows:

The UNDEFINED special symbol indicates that all characters not specified in the definition (either explicitly or by the ellipsis symbol) are ignored for collation purposes.
All collation items between <space> and <a> have the same primary equivalence class and individual secondary weights based on their coded character-set values.
All versions of the letter a (uppercase and lowercase, and with or without diacriticals) belong to the same primary collation class.
The <c><h> multicharacter collation item is represented by the <ch> collating symbol and belongs to the same primary equivalence class as the <C><h> multicharacter collation item.
The <eszet> character is collated as an <s><s> string (that is, one <eszet> character is expanded to two characters before comparing).

2.3 LC_CTYPE Category

The LC_CTYPE category defines character classification, case conversion, and other character attributes. This category begins with the LC_CTYPE header and ends with the END LC_CTYPE trailer.

All operands for LC_CTYPE category statements are defined as lists of characters. Each list consists of one or more characters or symbolic character names separated by semicolons. An ellipsis (...) can represent a series of characters; for example, <a>;...;<z> represents the characters in the range a through z.

Table 2-2 lists the statement keywords recognized in the LC_CTYPE category. In the keyword descriptions, the phrase "automatically included" means that an error does not occur if the referenced characters are included or omitted; the characters are provided if they are missing, and are accepted if they are present.

Table 2-2 LC_CTYPE Category Keywords
Keyword Description

copy Specifies the name of an existing locale to be used as the definition for this category.
If you specify a copy statement, you cannot specify any other keyword.

upper Defines uppercase letter characters.
Do not specify any character defined by the cntrl , digit , punct , or space keyword. The uppercase letters A through Z are automatically included in this set.

lower Defines lowercase letter characters.
Do not specify any character defined by the cntrl , digit , punct , or space keyword. The lowercase letters a through z are automatically included in this set.

alpha Defines all letter characters.
Do not specify any character defined by the cntrl , digit , punct , or space keyword. Characters defined by the upper and lower keywords are automatically included in this character class.

digit Defines numeric digit characters.
Only the digits 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9 can be specified. The digits 0 through 9 are automatically included in this set.

space Defines white-space characters.
Do not specify any character defined by the upper , lower , alpha , digit , graph , or xdigit keyword. The space, form-feed, new-line, carriage-return, tab, and vertical tab characters are automatically included in this set.

cntrl Defines control characters.
Do not specify any character defined by the upper , lower , alpha , digit , punct , graph , print , or xdigit keyword.

punct Defines punctuation characters.
Do not specify the space character or any character defined by the upper , lower , alpha , digit , cntrl , or xdigit keywords.

graph Defines printable characters, excluding the space character.
Do not specify any character defined by the cntrl keyword. The characters defined by the upper , lower , alpha , digit , xdigit , and punct keywords are automatically included in this character class.

print Defines printable characters, including the space character.
Do not specify any character defined by the cntrl keyword. The space character and characters defined by the upper , lower , alpha , digit , xdigit , and punct keywords are automatically included in this character class.

xdigit Defines hexadecimal digit characters.
Only the digits 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9 can be specified. Any character, however, can be specified for the hexadecimal values for 10 to 15. These alternate hexadecimal digits are not used by standard conversion routines when converting digit strings from hexadecimal to numeric quantities. The numbers 0 through 9 and the letters A through F and a through f are automatically included in this set.

blank Defines blank characters.
The space and horizontal tab characters are included in this character class. Any characters defined by this statement are automatically included in the space class.

toupper Defines the mapping of lowercase characters to uppercase characters.
Operands for this keyword consist of character pairs separated by commas. Each character pair is enclosed in parentheses () and separated from the next pair by a semicolon (;). The first character in each pair is considered a lowercase character; the second character is considered an uppercase character. Only characters defined by the lower and upper keywords can be specified. If toupper is not specified, a through z is mapped to A through Z by default.

tolower Defines the mapping of uppercase characters to lowercase characters.
Operands for this keyword consist of character pairs separated by commas. Each character pair is enclosed in parentheses () and separated from the next pair by a semicolon (;). The first character in each pair is considered an uppercase character; the second character is considered a lowercase character. Only characters defined by the lower and upper keywords can be specified.
If tolower is not specified, the mapping defaults to the reverse mapping of the toupper keyword, if specified. If the toupper and tolower keywords are both omitted, the mapping for each defaults to that of the C locale.

**Table 2-2 LC_CTYPE Category Keywords**
Keyword	Description
`copy`	Specifies the name of an existing locale to be used as the definition for this category. If you specify a `copy` statement, you cannot specify any other keyword.
`upper`	Defines uppercase letter characters. Do not specify any character defined by the `cntrl` , `digit` , `punct` , or `space` keyword. The uppercase letters A through Z are automatically included in this set.
`lower`	Defines lowercase letter characters. Do not specify any character defined by the `cntrl` , `digit` , `punct` , or `space` keyword. The lowercase letters a through z are automatically included in this set.
`alpha`	Defines all letter characters. Do not specify any character defined by the `cntrl` , `digit` , `punct` , or `space` keyword. Characters defined by the `upper` and `lower` keywords are automatically included in this character class.
`digit`	Defines numeric digit characters. Only the digits 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9 can be specified. The digits 0 through 9 are automatically included in this set.
`space`	Defines white-space characters. Do not specify any character defined by the `upper` , `lower` , `alpha` , `digit` , `graph` , or `xdigit` keyword. The space, form-feed, new-line, carriage-return, tab, and vertical tab characters are automatically included in this set.
`cntrl`	Defines control characters. Do not specify any character defined by the `upper` , `lower` , `alpha` , `digit` , `punct` , `graph` , `print` , or `xdigit` keyword.
`punct`	Defines punctuation characters. Do not specify the space character or any character defined by the `upper` , `lower` , `alpha` , `digit` , `cntrl` , or `xdigit` keywords.
`graph`	Defines printable characters, excluding the space character. Do not specify any character defined by the `cntrl` keyword. The characters defined by the `upper` , `lower` , `alpha` , `digit` , `xdigit` , and `punct` keywords are automatically included in this character class.
`print`	Defines printable characters, including the space character. Do not specify any character defined by the `cntrl` keyword. The space character and characters defined by the `upper` , `lower` , `alpha` , `digit` , `xdigit` , and `punct` keywords are automatically included in this character class.
`xdigit`	Defines hexadecimal digit characters. Only the digits 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9 can be specified. Any character, however, can be specified for the hexadecimal values for 10 to 15. These alternate hexadecimal digits are not used by standard conversion routines when converting digit strings from hexadecimal to numeric quantities. The numbers 0 through 9 and the letters A through F and a through f are automatically included in this set.
`blank`	Defines blank characters. The space and horizontal tab characters are included in this character class. Any characters defined by this statement are automatically included in the `space` class.
`toupper`	Defines the mapping of lowercase characters to uppercase characters. Operands for this keyword consist of character pairs separated by commas. Each character pair is enclosed in parentheses () and separated from the next pair by a semicolon (;). The first character in each pair is considered a lowercase character; the second character is considered an uppercase character. Only characters defined by the `lower` and `upper` keywords can be specified. If `toupper` is not specified, a through z is mapped to A through Z by default.
`tolower`	Defines the mapping of uppercase characters to lowercase characters. Operands for this keyword consist of character pairs separated by commas. Each character pair is enclosed in parentheses () and separated from the next pair by a semicolon (;). The first character in each pair is considered an uppercase character; the second character is considered a lowercase character. Only characters defined by the `lower` and `upper` keywords can be specified. If `tolower` is not specified, the mapping defaults to the reverse mapping of the `toupper` keyword, if specified. If the `toupper` and `tolower` keywords are both omitted, the mapping for each defaults to that of the C locale.

Additional keywords can be provided to define new character classifications. For example:

charclass vowel vowel <a>;<e>;<i>;<o>;<u>;<y>

The LC_CTYPE category does not support multicharacter elements (for example, the German Eszet character is traditionally classified as a lowercase letter). In proper capitalization of German text, the Eszet character is replaced by the two characters SS; there is no corresponding uppercase letter. This kind of conversion is outside the scope of the toupper and tolower keywords.

Contents

Index

privacy and legal statement

6494PRO_001.HTML

Compaq C Run-Time Library Utilities Reference Manual

1.2.3 Link Lines

Chapter 2Locale File Format

2.1 Locale Categories

2.1.1 Overriding Defaults

2.1.2 Category Source Definitions

2.2 LC_COLLATE Category

2.2.1 The collating-element Statement

2.2.2 The collating-symbol Statement

2.2.3 The order_start Statement

2.3 LC_CTYPE Category

Chapter 2
Locale File Format