2 Developing Internationalized Software

This chapter explains how the requirements of localization (language, codeset, and cultural differences) change the way you implement basic coding operations. A sample application that applies the suggested program development techniques from this chapter is provided in the /usr/examples/i18n/xpg4demo directory. See the README file in that directory for an introduction to the application and how you can compile and run the application with different locales. Parts of the xpg4demo application are used as examples in this and other chapters.

One of the primary functions of most computer programs is to manipulate data, which can involve interaction between the program and a computer user. In commercial situations, it is important that such interactions take place in the native language of each user. Cultural data should also observe the correct customs.

When you write programs to support multilanguage operation, you must consider the fact that languages can be represented within the computer system by one or more codesets. Because of the requirements of different languages, characters in codesets may vary in both size (8 bits, 16 bits, and so on) and binary representation.

You can satisfy the requirements of codesets and data by writing programs that make no hard-coded assumptions about language, cultural data, or character encodings. Such programs are said to be internationalized. Data specific to each supported language, territory, and codeset combination are held separately from the program code and can be bound to the run-time environment by language-initialization functions.

The operating system provides the following facilities for developing internationalized software, defining localization data, and announcing specific language requirements:

Locales that contain language, codeset, and cultural definitions for each language (Section 2.1)

Library functions that handle extended character codes and that provide language- and codeset-independent character classification, case conversion, number format conversion, and string collation (Section 2.2)

Library functions that let programs dynamically determine cultural and language-specific data (Section 2.3)

A message system that allows program messages to be held apart from the program code, translated into different languages, and retrieved by a program at run time (Section 2.4)

An initialization function that binds a program at run time to the linguistic and cultural requirements of each user (Section 2.5)

The discussion and examples in this chapter focus on functions provided in the Standard C Library. See Chapter 4 for information on using functions in the curses Library. See Chapter 5 for information about using functions in the X and Motif libraries.

2.1 Using Locales

The operating system supports Unicode and dense code locales. The Unicode locales are installed in /usr/i18n/lib/nls/ucsloc/. Dense code locales are installed in /usr/i18n/lib/nls/loc. The active default is determined by the symbolic link, /usr/i18n/lib/nls/dloc. For example, the Japanese locale filename, /usr/lib/nls/loc/ja_JP.eucJP, is a symbolic link to /usr/i18n/lib/nls/dloc/ja_JP.eucJP, where /dloc is a symbolic link to either /ucsloc for the Unicode version or /loc for the dense code version of the Japanese locale.

If you are superuser, you can switch between Unicode and dense code locales by changing the setting of the symbolic link, as described in l10n_intro(5), or you can use the Configure International Software utility from the SysMan Menu. You can also use the utility to change a default system locale and specify an input method for those Asian locales that support multiple input methods. See the online help for Configure International Software for more information.

Unicode locales conform to Unicode and ISO/IEC 10646 standards and use UTF-32 as the wide-character encoding. Under UTF-32 wide character encoding, wchar_t values represent the same characters regardless of the locale and, because Unicode standards prevail, implementation is consistent across platforms.

Locales whose names end in .UTF-8 use file code and UTF-32 internal process code (wchar_t encoding) defined in the ISO 10646 and Unicode standards.

Other, non-UTF-8 Unicode locales use traditional UNIX and proprietary codesets for the file code while using UTF-32 as the internal process code. A subset of these Unicode locales have a @ucs4 modifier; however, they are the same as the locales without the @ucs4 modifier. The @ucs4 subset is provided for backward compatibility and may be removed in the future. You cannot choose @ucs4 locales from the CDE Login Menu; you must specify the locale name in the LANG environment variable.

The universal.UTF-8 locale is also available (for use by applications rather than end users). This locale supports the complete set of characters in the Universal Character Set (UCS).

See Unicode(5) for more information about encoding formats.

For .UTF-8 locales, file code may include characters encoded in more than 1 byte; therefore, use these locales in applications that can process multibyte data. Design new applications based on multibyte .UTF-8 locales, which incorporate a large character repertoire, to enable the application to expand future character support without changing the character set.

Dense code locales use dense code for wide-character encoding to minimize table size (that is, codepoints are assigned consecutively with no empty positions). Under dense code locales, a wchar_t value for one locale may not represent the same character in another locale and, thus, is locale specific. Dense code locales are appropriate for applications that have no dependencies on the internal process code or, because dense code locales are slightly more efficient than Unicode locales, for applications whose primary goal is better performance.

All valid codepoints in multibyte character sets are mapped to valid codepoints in Unicode, including unmapped codepoints that are mapped to Unicode codepoints in the private use area. Thus, dense code locales are equivalent to Unicode locales. In general, the same charmaps and locale source can be used for Unicode and dense code locales. However, Unicode and dense code characters that are not defined in the LC_COLLATE section may be sorted differently.

A Unicode locale exists for each dense code locale. (However, not all Unicode locales have a dense code version.) For Latin-1 locales (ISO8859-1), the dense code and Unicode locales are identical because Latin-1 characters are the same as the first 256 characters in Unicode. Keep in mind that the same locale name can refer to a Unicode locale or to a dense code locale, depending on the setting of the symbolic link. Thus, if running an application in a locale is problematic, check the symbolic link.

Because Unicode locales use consistent values for characters in wchar_t form, the link to Unicode locales can increase consistency across locales and platforms. However, some users may prefer the older, dense code locales that use proprietary algorithms to convert characters to wchar_t form, or an application may have dependencies on dense code wchar_t encoding.

2.2 Using Codesets

In the past, most UNIX systems were based on the 7-bit ASCII codeset. However, most non-English languages include characters in addition to those contained in the ASCII codeset.

The X/Open UNIX standard does not require an operating system to supply any particular codesets in addition to ASCII. The standard does specify requirements for the interfaces that manipulate characters so that programs are able to handle characters from whatever codeset is available on a given system.

The first group of the International Standards Organization (ISO) codesets covered only the major European languages. In this group, several codesets allow for the mixing of major languages within a single codeset. All of these codesets are a superset of the ASCII codeset and allow systems to support non-English languages without invalidating existing software that is not internationalized. The Tru64 UNIX operating system always includes a locale for the United States that uses the ISO 8859-1 (ISO Latin-1) codeset.

Subsets that are installed as part of Worldwide Language Support (WLS) support localized variants of the operating system and may include locales based on additional ISO codesets. For example, the optional language variant subsets included with the operating system to support Czech, Hungarian, Polish, Russian, Slovak, and Slovene provide locales based on the ISO 8859-2 (Latin-2) codeset.

The following is a complete list of ISO codesets provided with the WLS, including the languages that they support and the reference pages where they are discussed in more detail:

ISO 8859-1, Latin-1
Languages of Western Europe and North America, including Catalan, Danish, Dutch, English/Great Britain, English/United States, Finnish, Flemish/Belgium, French/Belgium, French/Canada, French/Swiss, French, German/Swiss, German/Germany, Icelandic, Italian, Norwegian, Portuguese, Spanish, and Swedish
See iso8859-1(5)

ISO 8859-2, Latin-2
Languages of Eastern Europe, including Czech Republic, Hungarian, Polish, Slovak, and Slovene
See iso8859-2(5)

ISO 8859-4, Latin-4
Lithuanian
See iso8859-4(5)

ISO 8859-5, Latin/Cyrillic
Russian
See iso8859-5(5)

ISO 8859-7, Latin/Greek
Greek
See iso8859-7(5)

ISO 8859-8, Latin/Hebrew
Hebrew/Israel (uses the ISO Hebrew codeset)
See iso8859-8(5)

ISO 8859-9, Latin-5
Turkish
See iso8859-9(5)

ISO 8859-15, Latin-9
Catalan/Spain, Danish, Dutch, English/Great Britain, English/United States, Finnish, Flemish/Belgium, French/Belgium, French/Canada, French/Swiss, French, German/Swiss, German/Germany, Icelandic, Italian, Norwegian, Portuguese, Spanish/Spain, and Swedish. ISO 8859-15 (and UTF-8) support the euro monetary character.
See iso8859-15(5)

The operating system does not include support for the ISO 8859-3 (Latin-3) and ISO 8859-6 (Latin-6) codesets.

Another ISO codeset supported by utilities on a standard operating system is ISO 6937: 1983. This codeset, which accommodates both 7-bit and 8-bit characters, is used for text communication over communication networks and interchange media, such as magnetic tape and disks.

The codesets discussed up to this point address the requirements of languages whose characters can be stored in a single byte. Such codesets do not meet the needs of Asian languages, whose characters can occupy multiple bytes. The operating system supplies the following codesets through installed subsets that support Asian languages and countries:

Japanese
- Japanese Extended UNIX Code (the default)
  See eucJP(5)
- Shift JIS
  See shiftjis(5)
- DEC Kanji
  See deckanji(5)
- Super DEC Kanji
  See sdeckanji(5)

Korean
- DEC Korean
  See deckorean(5)
- Korean Extended UNIX Code
  See eucKR(5)

Thai
- Thai API Consortium/Thai Industrial Standard
  See TACTIS(5)

Simplified Chinese
- DEC Hanzi
  See dechanzi(5)
- GBK and GB18030
  See GBK(5) and GB18030(5)

Traditional Chinese
- DEC Hanyu
  See dechanyu(5)
- Taiwanese Extended UNIX Code
  See eucTW(5)
- BIG-5 (and the variant, Shift BIG-5)
  See big5(5) and sbig5(5)
- Telecode
  See telecode(5)

These codesets are supplied when you install Asian language variant subsets of the operating system software. Also supplied are a specialized terminal driver and associated utilities that must be available on your system to support the input and display of Asian characters at run time.

Codesets developed for PC systems are commonly called code pages. There are PC code pages that correspond to most of the language-specific codesets developed for UNIX systems. The operating system supports PC codesets mostly through converters that can change file data from one type of encoding format to another. The CP850 codeset supports English/United States and is used with data that contains accented characters generated on a PC using the CP850 code page for character encoding. This character encoding is usually the default for MS-DOS and Windows operating systems in Europe. See code_page(5).

The Unicode and ISO/IEC 10646 standards specify the Universal Character Set (UCS), which allows character units to be processed for all languages, including Asian languages, using the same set of rules. The operating system supports the UCS-4 (32-bit) encoding of this character set in process code.

Other encoding formats defined by the Unicode standard, the ISO/IEC 10646 standard, or both include the following:

UCS-2, a 16-bit encoding counterpart to UCS-4

A number of universal transformation formats (UTF-8, UTF-16, and UTF-32) that transform UCS encoding into sequences of bytes for handling by byte-oriented protocols

The operating system supports these different formats through locales, codeset converters, or both. Because UCS-2 is a subset of UTF-16, the operating system supports UCS-2 with UTF-16 codeset converters. The operating system supports UCS-4 with both codeset conversion and locales.

The following locales use UTF-32 as internal processing code:

universal.UTF-8
Use this locale in applications to convert data in UTF-8 file format to UCS-4 process code and to test any UCS-4 character to determine if it is included in one of the following LC-CTYPE classes: alnum, alpha, blank, cntrl, digit, graph, lower, print, punct, space, upper, or xdigit. In this locale, the LC_MESSAGES, LC_MONETARY, LC_NUMERIC, and LC_TIME definitions match those of the POSIX (C) locale. Your application can use this locale, along with the fold_string_w() function, to process the full range of characters defined by the Unicode and ISO/IEC 10646 standards.
This locale differs from most others because it does not provide access to local cultural conventions.

language_territory.UTF-8
These locales limit classification information to the characters in a particular native language, make country-specific data available to your application, and assume file data follows UTF-8 encoding rules. The operating system locales that support the euro monetary symbol use either UTF-8 or ISO 8859-15 codesets.
The Unicode UTF-8 codeset supports Catalan/Spain, Czech Republic, Danish, Dutch, English/Great Britain, English/United States, Finnish, Flemish, French/Belgium, French/Canada, French/Swiss, German/Swiss, German, Greek, Hungarian, Icelandic, Italian, Japanese, Korean, Lithuanian, Norwegian, Polish, Portuguese, Russian, Slovak, Slovene, Spanish, Swedish, Turkish, simplified Chinese (Hanzi), and traditional Chinese (Hanyu). See Unicode(5).

native_locale_name
These locales use UTF-32 as internal processing code. The codeset portion of the native_locale_name (for example, ISO8859-1) specifies the file code. Also, the locale provides classification information for the native language characters, but not for the full set of UTF-32 characters. Country specific information is available to the application; the LC_COLLATE, LC_MESSAGES, LC_MONETARY, LC_NUMERIC, and LC_TIME category definitions match the definition in native_language_name.

native_language_name@ucs4
These locales are provided for compatibility with existing applications that use the @ucs4 locales. They function the same as the native_locale_name locales, but the list of locales provided is not as complete as the native_language_name locales.

See Section 2.5 for information on locale categories, such as LC_TIME. See Unicode(5) and Section 2.1 for information on locales and comparisons of data handling. See euro(5) for more information on the euro monetary symbol.

See Unicode(5) for detailed information about support for UCS-2, UCS-4, UTF-8, UTF-16, and UTF-32. For information on how codesets are supported for a particular local language, see the reference page for that language. Reference pages for languages, particularly Asian languages, might note additional codesets that are not supported in a locale but for which there is a codeset converter.

The following sections discuss important issues that affect the way you write source code when your program must process characters in different codesets:

Ensuring data transparency (Section 2.2.1)

Using in-code literals (Section 2.2.2)

Manipulating characters that span multiple bytes (Section 2.2.3)

Converting between multibyte-character and wide-character data (Section 2.2.4)

Rules for multibyte characters in source and executable codesets (Section 2.2.5)

Classifying characters (Section 2.2.6)

Converting characters (Section 2.2.7)

Comparing strings (Section 2.2.8)

2.2.1 Ensuring Data Transparency

As discussed in Section 2.2, internationalized software must accommodate a wide variety of character-encoding schemes. Programs cannot assume that a particular codeset is on all systems that conform to requirements in the X/Open UNIX CAE specifications, nor that individual characters occupy a fixed number of bits.

Because of the historical dependence of UNIX systems on 7-bit ASCII character encoding, some programs use the most significant bit (MSB) of a byte for their own internal purposes. This was a dubious programming practice, although quite safe when characters in the underlying codeset always mapped to the remaining 7 bits of the byte. In the world of international codesets, the practice of using the most significant bit of a byte for program purposes must be avoided.

2.2.2 Using In-Code Literals

When you write internationalized software, avoid using in-code literals. Consider, for example, the following conditional statement:

if ((c = getchar()) == \141)

This condition assumes that lowercase a is always represented by a fixed octal value, which may not be true for all codesets. Use a function, instead of an in-code literal. Consider the following statement that uses a getchar() function to substitute a character constant for the octal value:

if ((c = getchar()) == 'a')

However, because the getchar() function operates on bytes, the statement would not work correctly if the next character in the input stream spanned multiple bytes. To avoid this problem, substitute the getwchar() function for the getchar() function. The getwchar() function, as used in the example, works correctly with any codeset because a is a member of the PCS and is transformed into the same wide-character value in all locales.

if ((c = getwchar()) == L'a')

The X/Open UNIX standard specifies that each member of the source character set and each escape sequence in character constants and string literals is converted to the same member of the execution character set in all locales. Thus, you can safely use any of the characters in the PCS as a character constant or in string literals. Non-English language characters are not included in the PCS and may not translate correctly when used as literals. Consider the following example:

if ((c = getwchar()) == L' à ')

The accented character à may not be represented in the codeset's source character set or execution character set. Also, the binary value of the accented character may not be translatable from one set to the other. When source files specify non-English language characters in constants, the results are undefined. In cases such as this, it can be helpful to employ a consistent use of Unicode locales.

The following example illustrates how to construct a test for a constant that for whatever reason may be a non-English language character. The constant has been defined in a message catalog with the symbolic identifier MSG_ID. Statements in the example retrieve the value for MSG_ID from the message catalog, which is locale specific and bound to the program at run time.


.
.
.
char *schar;      [1]
wchar_t wchar;    [2]

.
.
.
schar = catgets(catd,NL_SETD,MSG_ID,"a");  [3]
if (mbtowc (&wchar,schar,MB_CUR_MAX) == -1)  [4]
        error();
if ((c = getwchar()) == wchar)  [5]

.
.
.

Declares a pointer to schar as char. [Return to example]

Declares the variable wchar to be of type wchar_t. [Return to example]

Calls the catgets() function to retrieve the value of MSG_ID from the message catalog for the user's locale.
The catgets() function returns a value as an array of bytes so the value is returned to the schar variable. If the accented character is not available in the locale's codeset, the test is made against the unaccented base character (a). [Return to example]

Tests to make sure the value contained in schar represents a valid multibyte character. If the value is a valid multibyte character, the program converts it to a wide-character value and stores the results in the variable wchar.
If schar does not contain a valid multibyte character, the program signals an error. [Return to example]

Codes the conditional statement to include the value contained in wchar as the constant. [Return to example]

See Chapter 3 for more information about message catalogs and the catgets() function. See Section 2.2.4 for information about converting multibyte characters and strings to wide-character data that your program can process.

2.2.3 Manipulating Characters That Span Multiple Bytes

The operating system provides all the interfaces (such as putwc(), getwc(), fputws(), and fgetws()) that are needed to support codesets with characters that span multiple bytes. Language variant subsets of the operating system must be installed to supply the locales and facilities that make this support operational. On systems where such locales are not available, or are available but not bound to the program at run time, the *ws*() and *wc*() functions are merely synonyms for the associated single-byte functions (such as putc(), getc(), fputs(), and fgets()).

2.2.4 Converting Between Multibyte-Character and Wide-Character Data

On an internationalized system, data can be encoded as either multibyte character or wide-character data.

Multibyte encoding is typically used when data is stored in a file or generated for external use or data interchange. Multibyte encoding has the following disadvantages:

Characters are not represented by a fixed number of bytes for each character, even in the same codeset. Thus, the size of a character in a multibyte data record can vary from one character to the next.

The parsing rules for retrieving character codes from a multibyte data record are locale dependent.

Because of these disadvantages, wide-character encoding, which allocates a fixed number of bytes for each character, is typically used for internal processing by programs; in fact, internal process code is another way of referring to data in wide-character format. The size of a wide character varies from one system implementation to another. On Tru64 UNIX systems, the size for a wide character is set to 4 bytes (32 bits), a setting that optimizes performance for the HP Alpha processor.

Library routines that print, scan, input, or output text can automatically convert data from multibyte characters to wide characters or from wide characters to multibyte characters, as appropriate for the operation. However, applications almost always have additional statements or requirements for which conversion to and from multibyte characters needs to be explicit.

The following example is from a program module that reads records from a database of employee data. In this case, the programmer wants to process the data in fixed-width units, so uses the mbstowcs() function to explicitly convert an employee's first and last names from multibyte character to wide-character encoding.

/*
 * The employee record is normalized with the following format, which
 * is locale independent:  Badge number, First Name, Surname,
 * Cost Center, Date of Join in the `yy/mm/dd' format. Each field is
 * separated by a TAB. The space character is allowed in the First
 * Name and Surname fields.
 */
static const char *dbOutFormat = "%ld\t%S\t%S\t%S\t%02d/%02d/%02d\n";
static const char *dbInFormat = "%ld %[^\t] %[^\t] %S %02d/%02d/%02d\n";

.
.
.
sscanf(record, dbInFormat,
                   &emp->badge_num,
                   firstname,
                   surname,
                   emp->cost_center,
                   &emp->date_of_join.tm_year,
                   &emp->date_of_join.tm_mon,
                   &emp->date_of_join.tm_mday);
            (void) mbstowcs(emp->first_name, firstname, FIRSTNAME_MAX+1);
            (void) mbstowcs(emp->surname, surname, SURNAME_MAX+1);

.
.
.

See Section A.9 for a complete list of functions that work directly with multibyte data.

2.2.5 Rules for Multibyte Characters in Source and Execution Codesets

Both the source and execution character set variants of the same codeset can contain multibyte characters. The encodings do not have to be the same, but the source and execution variants both observe certain rules in codesets that meet X/Open requirements. PC code pages and UCS-based codesets may adhere to some or most of these rules, but the codesets native to any UNIX system that conforms to X/Open standards must adhere to all of them.

The characters defined in the Portable Character Set must be present in both sets.

The existence, meaning, and encoding of any additional members are locale specific.

A character may have a state-dependent encoding. A string of characters may contain a shift-state character that affects the system's interpretation of the following bytes until another shift-state character is encountered.

While in the initial shift state, all characters from the basic character set retain their usual interpretation and do not alter the shift state.

The interpretation for subsequent bytes in the sequence is a function of the current shift state.

A byte with all bits set to zero is interpreted as a null character, independent of the shift state.

A byte with all bits zero must not occur in the second or subsequent bytes of a multibyte character.

The source variant of a codeset must observe the following additional rules:

A comment, string literal, character constant, or header name must begin and end in the initial shift state

A comment, string literal, character constant, or header name must consist of a sequence of valid multibyte characters

The C language compiler supports trigraph sequences when you specify the -std1 or -std flag on the cc command line. Trigraph sequences, which are part of the ANSI C specification, allow users to enter the full range of basic characters in programs, even if their keyboards do not support all characters in the source codeset. The following trigraph sequences are currently defined, each of which is replaced by the corresponding single character:

Trigraph Sequence	Single Character
`??=`	`#`
`??(`	`[`
`??/`	`\`
`??'`	`^`
`??<`	`{`
`??)`	`]`
`??!`	`\|`
`??>`	`}`
`??-`	`~`

2.2.6 Classifying Characters

Another feature of program operation that depends on the locale is character classification; that is, determining whether a particular character code refers to an uppercase alphabetic, lowercase alphabetic, digit, punctuation, control, or space character.

In the past, many programs classified characters according to whether the character's value fell between certain numerical limits. For example, the following statement tests for all uppercase alphabetic characters:

if (c >= 'A' && c <= 'Z')

This statement is valid for the ASCII codeset, in which all uppercase letters have values in the range 0x41 to 0x5a (A to Z). However, the statement is not valid for the ISO 8859-1 codeset, in which uppercase letters occupy the ranges 0x41 to 0x5a, 0xc0 to 0xd6, and 0xd8 to 0xdf. In the EBCDIC codeset, character values are different again and, in this case, even the uppercase English language letters have a different encoding.

When you write internationalized programs, classify characters by calling the appropriate internationalization function. For example:

if (iswupper (c))

Internationalization functions classify wide-character code values according to ctype information in the user's locale. See Section A.2 for a complete list and description of character classification functions.

2.2.7 Converting Characters

As an example of what not to do in an internationalized program, consider the following statements, which perform case conversion of ASCII characters by converting the character in a_var first to lowercase and then to uppercase:

a_var |= 0x20;

.
.
.
a_var &= 0xdf;

The preceding statements are not safe to use in internationalized programs because the statements assume ASCII-coded character values and because they can convert invalid values.

The correct way to handle case conversion is to call the towlower() function for conversion to lowercase and the towupper() function for conversion to uppercase. For example:

a_var = towlower(a_var);

.
.
.
a_var = towupper(a_var);

These functions use information specified in the user's locale and are independent of the codeset in which characters are defined. The functions return the argument unchanged if input is invalid. See Section A.3 for more detailed discussion of case conversion functions.

2.2.8 Comparing Strings

UNIX systems provide functions for comparing character strings. The following statement, for example, compares the strings s1 and s2, returning an integer greater than, equal to, or less than zero, depending on whether the value of s1 is greater than, equal to, or less than the value of s2 in the machine-collating sequence:


.
.
.
int cmp_val;
char *s1;
char *s2;

.
.
.
cmp_val = strcmp(s1, s2);

.
.
.

Many languages, however, require more complex collation algorithms than a simple numerical sort. For example, multiple passes may be required for the following reasons:

Ordering accented characters within a particular character class for a language (for example, a, á, à, and so on)

Collating certain multiple character sequences as a single character (for example, the Welsh character ch, which collates after c and before d)

Collating certain single characters as a 2-character sequence (for example, the German character sharp s, which collates as ss)

Ignoring certain characters during collation (for example, hyphens in dictionary words)

String comparison in an international environment depends on the codeset and language. This dependency means that additional functions are required to compare strings according to collating sequence information in the user's locale. These functions include the following:

strcoll()
This function uses collation information defined in the user's locale rather than performing a simple numeric comparison as does the strcmp() function.

wcscoll()
This function performs the same operation as strcoll(), except that it operates on wide characters.

wcsxfrm()
This function transforms a wide-character string by using collating sequence information in the user's locale so that the resulting string can be compared using the wcscmp() function.
If two strings are being compared only for equality, you can use strcmp() or wcscmp(), which are faster in most environments than wcscoll().

2.3 Handling Cultural Data

Cultural data refers to items of information that can vary between languages or territories.

For example:

In the United Kingdom and the United States, a period represents the radix character and a comma represents the thousands separator in decimal numbers. In Germany, the same two characters in decimal numbers have the opposite meaning.

In the United States, the date October 7, 1986 is represented as 10/7/1986. In the United Kingdom, the same date is represented as 7/10/1986. This example indicates that cultural data items can vary even when the same language is spoken.

Date delimiters, as well as the order of year, month, and day, can vary among countries. In Germany, for example, the date October 7, 1986 is represented as 7.10.1986 rather than as 7/10/1986.

Currency symbols can vary both in the characters used and where they are placed in a currency value; that is, currency symbols can precede, follow, or be embedded in the value.
The euro character that is used as the currency symbol by European countries belonging to the Economic and Monetary Union is supported only by Unicode (*.UTF-8) or Latin-9 (*.ISO8859-15) locales and associated fonts. See euro(5) for complete information about support for the euro currency symbol.
To enter the euro character from the keyboard, you must be working in a Latin-9 or UTF-8 locale and the appropriate keymap must be active. To display the euro character, you must be working in a Latin-9 or UTF-8 locale and the appropriate font must be active. To activate the required locale and the appropriate keymap and font, log in to a Latin-9 or UTF-8 locale, or use setenv to set the LANG environment variable, and start a new dtterm. See the reference pages for locale(1) and dtterm(1).

You cannot make assumptions about cultural data when writing internationalized programs. Your program must operate according to the local customs of users. The X/Open UNIX standard specifies that this requirement be met through a database of cultural data items that a program can access at run time, plus a set of associated interfaces. The following sections discuss this database and the functions used to extract and process its data items.

2.3.1 The langinfo Database

The language information database, named langinfo, contains items that represent the cultural details of each locale supported on the system. The langinfo database contains the following information for each locale, as required by the X/Open UNIX standard:

Codeset name

Date and time formats

Names of the days of the week

Names of the months of the year

Abbreviations for names of days

Abbreviations for names of months

Radix character (the character that separates whole and fractional quantities

Thousands separator character

Affirmative and negative responses for yes/no queries

Currency symbol and its position within a currency value

Emperor/Era name and year (for Japanese locales)

2.3.2 Querying the langinfo Database

You can extract cultural data items from the langinfo database by calling the nl_langinfo() function. This function takes an item argument that is one of several constants defined in the /usr/include/langinfo.h header file. The function returns a pointer to the string with the value for item in the current locale.

The following example is a call to nl_langinfo() that extracts the string for formatting date and time information. This value is associated with the constant D_T_FMT.

nl_langinfo(D_T_FMT);

2.3.3 Generating and Interpreting Date and Time Strings That Observe Local Customs

Programs often generate date and time strings. Internationalized programs generate strings that observe the local customs of the user. You can meet this requirement by calling the strftime() or wcsftime() function. Both functions indirectly use the langinfo database. In addition, the wcsftime() function converts date and time to wide-character format.

In the following example, the strftime() function generates a date string as defined by the D_FMT item in the langinfo database:


.
.
.
setlocale(LC_ALL, "");  [1]

.
.
.
clock = time((time_t*)NULL);  [2]
tm = localtime(&clock);  [3]

.
.
.
strftime(buf, size, "%x", tm);  [4]
puts(buf);  [5]

.
.
.

Binds the program at run time to the locale set for the system or individual user. [Return to example]

Calls the time() subroutine to return the time value to the clock variable. The time value returned is relative to Coordinated Universal Time. [Return to example]

Calls the localtime() function to convert the value contained in clock to a value that can be stored in a tm structure, whose members represent values for year, month, day, hour, minute, and so forth. [Return to example]

Calls strftime() to generate a date string formatted as defined in the user's locale from the value contained in the tm structure.
The buf argument is a pointer to a string variable in which the date string is returned. The size argument contains the maximum size of buf. The "%x" argument specifies conversion specifications, similar to the format strings used with the printf() and scanf() functions. The "%x" argument is replaced in the output string by a representation appropriate for the locale. [Return to example]

Calls the puts() function to copy the string contained in buf to the standard output stream (stdout) and to append a newline character. [Return to example]

Consider the following example of how to use strftime() and nl_langinfo() in combination to generate a date and time string. Assume that the preceding example's calls to the setlocale(), time(), and localtime() interfaces have been made in this example. However, the following example includes a call to nl_langinfo() that has replaced the format string argument in the call to strftime().


.
.
.
strftime(buf, size, nl_langinfo(D_T_FMT), tm);
puts(buf);

.
.
.

To convert a string to a date/time value (that is, the reverse of the operation performed by strftime()), you can use the strptime() function. The strptime() function supports a number of conversion specifiers that behave in a locale-dependent manner.

2.3.4 Formatting Monetary Values

The strfmon() function formats monetary values according to information in the locale that is bound to the program at run time. For example:

strfmon(buf, size, "%n", value);

This statement formats the double-precision floating-point value contained in the value variable. The "%n" argument is the format specification that is replaced by the format defined in the run-time locale. The results are returned to the buf array, whose maximum length is contained in the size variable.

The money program demonstrates how the strfmon() function works. When you install a Worldwide Language Support subset, the source file for this sample program is installed in the /usr/i18n/examples/money directory.

2.3.5 Formatting Numeric Values in Program-Specific Ways

To perform your own conversions of numeric quantities, monetary or otherwise, you can use specific formatting details in the user's locale. The localeconv() function, which has no arguments, returns all the number formatting details defined in the locale to a structure declared in your program. For example:

struct lconv *app_conv;

You can use the following features, which are contained in the lconv structure, in program-defined routines:

Radix character

Thousands separator character

Digit grouping size

International currency symbol

Local currency symbol

Radix character for monetary values

Thousands separator for monetary values

Digit grouping size for monetary values

Positive sign

Negative sign

Number of fractional digits to be displayed

Parenthesis symbols for negative monetary values

2.3.6 Using the langinfo Database for Other Tasks

Functions in addition to the ones discussed so far use the langinfo database to determine settings for specific items of cultural data. For example, the wscanf(), wprintf(), and wcstod() functions determine the appropriate radix character from information in the langinfo database.

2.4 Handling Text Presentation and Input

As you create applications, you need to consider the user's native language in three particular areas:

The way program messages are defined and accessed (Section 2.4.1)

How the program presents output text (Section 2.4.2)

How the program processes input text (Section 2.4.3)

2.4.1 Creating and Using Messages

Programs need to communicate with users in their own language. This requirement places some constraints on the way program messages are defined and accessed. More specifically, messages are defined in a file that is independent of the program source code and are not compiled into object files. Because messages are in a separate file, they can be translated into different languages and stored in a form that is linked to the program at run time. Programs can then retrieve message text translations that are appropriate for the user's language.

The X/Open UNIX standard specifies the following messaging functions:

A messaging system that contains a definition of message text source files

The gencat command to generate message catalogs from these source files

A set of library functions to retrieve individual messages from one or more catalogs at run time

The following example demonstrates how an internationalized program retrieves a message from a catalog:

#include <stdio.h>     [1]
 
#include <locale.h>    [2]
#include <nl_types.h>  [3]
#include "prog_msg.h"      [4]
main()
{
      nl_catd catd;  [5]
      setlocale(LC_ALL, "");  [6]
      catd = catopen("prog.cat", NL_CAT_LOCALE);  [7]
      puts(catgets(catd, SETN, HELLO_MSG, "Hello, world!")); [8]
      catclose(catd);  [9]
}

.
.
.

Includes the header file for the Standard C Library. [Return to example]

Includes the /usr/include/locale.h header file, which declares the setlocale() function and associated constants and variables. [Return to example]

Includes the /usr/include/nl_types.h header file, which declares the catopen(), catgets(), and catclose() functions. [Return to example]

Includes the program-specific prog_msg.h header file, which sets constants to identify the message set (SETN) and specific messages (HELLO_MSG in the example) that are used by this program module.
A message catalog can contain one or more message sets. Individual messages are ordered within each set. [Return to example]

Declares a message catalog descriptor catd to be of type nl_catd.
This descriptor is returned by the function that opens the catalog. The descriptor is also passed as an argument to the function that closes the catalog. [Return to example]

Calls the setlocale() function to bind the program's locale categories to settings for the user's locale environment variables.
The locale name set for the LC_MESSAGES category is the locale used by the catopen() and catgets() functions in this example. Because the system administrator or user typically sets only the LANG or LC_ALL environment variable to a particular locale name, this operation implicitly sets the LC_MESSAGES variable as well. [Return to example]

Calls the catopen() function to open the prog.cat message catalog for use by this program.
The NL_CAT_LOCALE argument specifies that the program will use the locale name set for LC_MESSAGES. The catopen() function uses the value set for the NLSPATH environment variable to determine the location of the message catalog. The call returns the message catalog descriptor to the catd variable. [Return to example]

Calls the puts() function to display the message.
The first argument to this call is a call to the catgets() function, which retrieves the appropriate text for the message with the HELLO_MSG identifier. This message is contained in the message set identified by the SETN constant. The final argument to catgets() is the default text to be used if the messaging call cannot retrieve the translated text from the catalog. Default text is usually in the English language. [Return to example]

Calls the catclose() function to close the message catalog whose descriptor is contained in the catd variable. [Return to example]

See Chapter 3 for information about creating and using message catalogs.

2.4.2 Formatting Output Text

Successful translation of messages into different languages depends not only on making messages independent of the program source code but also on careful construction of message strings within the program.

Consider the following example:

printf(catgets(catd, set_id, WRONG_OWNER_MSG,
               "%s is owned by %s\n"),
               folder_name, user_name);

The preceding statement uses a message catalog but assumes a particular language construction (a noun followed by a verb in passive voice followed by a noun). Passive verb constructions are not part of all languages; therefore, message translation might mean printing user_name before folder_name. In other words, the translator might need to change the construction of the message so that the user sees the translated equivalent of "John_Smith owns JULY_REVENUE" rather than "JULY_REVENUE is owned by John_Smith."

To overcome the problems imposed by fixed ordering of message elements, the printf() routine format specifiers can apply format conversion to the nth argument in an argument list, and not just to the next unused argument. To apply the format conversion extension, replace the % conversion character with the sequence %digit $, where digit specifies the position of the argument in the argument list. The following example illustrates how the programmer applies this feature to the format string "%s is owned by %s\n":

printf(catgets(catd, set_id, WRONG_OWNER_MSG,
               "%1$s is owned by %2$s\n"),
               folder_name, user_name);

The construction of the string "%1$s is owned by %2$s", which is the default value for the WRONG_OWNER_MSG entry in the program's message file, can then be changed by the translator to the non-English language equivalent of the following:

WRONG_OWNER_MSG        "%2$s owns %1$s\n"

2.4.3 Scanning Input Text

The string construction issues that are discussed for output text in Section 2.4.2 also apply to input text. For example, different countries have different conventions for the order in which users specify the elements of a date, or differ in the characters that are input to delimit parts of monetary strings. The scanf() family of functions support extended format conversion specifiers that allow for variation in the way that users enter elements of a string.

Consider the following example:


.
.
.
int day;
int month;
int year;

.
.
.
scanf("%d/%d/%d", &month, &day, &year);

.
.
.

The format string in this statement is governed by the assumption that all users use a United States format (mm/dd/yyyy) to input dates. In an internationalized program, you use extended format specifiers to support requirements that language may impose on the order of string elements. For example:


.
.
.
scanf(catgets(catd, NL_SETD, DATE_STRING,
              "%1$d/%2$d/%3$d"), &month, &day, &year);

.
.
.

The default "%1$d/%2$d/%3$d" value for the DATE_STRING message is still appropriate only for countries in which users use the format mm/dd/yyyy to enter dates. However, for countries in which the order or formatting would be different, the translator can change the entry in the program's message file. Consider the following languages:

British English (dd/mm/yyyy):
```
DATE_STRING        "%2$d/%1$d/%3$d"
```

German (dd.mm.yyyy)
```
DATE_STRING        "%2$d.%1$d.%3$d"
 
```

2.5 Binding a Locale to the Run-Time Environment

A correct, operational internationalized program must bind to localized data that is appropriate for the user at run time. The setlocale() function performs this task. You can call setlocale() to perform the following operations:

Bind to locale settings that are already in effect for the user's process

Bind to locale settings controlled by the program

Query current locale settings without changing them

The call takes two arguments: category and locale_name.

The category argument specifies whether you want to query, change, or use all or a specific section of a locale. Values for category and what they represent are as follows:

LC_ALL
This category argument specifies all sections of a locale (overrides specifications for specific sections).

LC_CTYPE
This category argument defines classes and character attributes used in case conversion and similar operations.

LC_COLLATE
This category argument specifies how to order characters and strings in sorting, or collation, operations.

LC_MESSAGES
This category argument specifies yes/no responses and program messages.

LC_MONETARY
This category argument specifies rules and special symbols for use in monetary values.

LC_NUMERIC
This category argument specifies rules and special symbols used for formatting numeric values.

LC_TIME
This category argument specifies names and abbreviations for days of the week, months of the year, and other strings and formatting conventions that govern expressions of date and time.

The locale_name argument is one of the following values:

An empty string ("") that binds the program at run time to the locale name set for category by the system administrator or user

A locale name that changes the locale that may already be set for category

NULL that determines the locale name currently set for category

2.5.1 Binding to the Locale Set for the System or User

Typically, the system administrator or user sets the LANG or LC_ALL environment variable to the name of a locale. When you set either of these variables, it automatically sets all locale category variables to the same locale name.

Except for the case in which LC_ALL has been used to set all locale categories to a single locale name, system administrators or individual users can set locale category variables to different locale names. Usually, internationalized programs contain the LC_ALL call, which initializes all locale categories in the program to environment variable settings already in effect for the user. For example:

setlocale(LC_ALL, "");

A standard locale name consists of language_TERRITORY.codeset@modifier, for example, zh_CN.dechanzi@radical, where:

language represents the human language of the locale (zh is Chinese)

_TERRITORY is the geographic country or region of the locale (_CN is China, as opposed to TW for Taiwan or HK for Hong Kong)

.codeset is the coded character set used by the locale (dechanzi)

@modifier is additional information for localization data of a locale (collation by radical)

Locales often have multiple variants. These variants have the same name as the base locale but include a file name suffix that begins with the at sign (@). Locale variants for support of codesets that are not native to UNIX (such as UCS-4 and CP850), can be assigned to LANG or LC_ALL.

However, locale variants that differ from the base locale in only one locale category should be assigned only to the appropriate locale category. For example, a locale variant designed to support a specific collation sequence, such as @radical, would be assigned to LC_COLLATE. A locale variant designed to support the euro monetary sign (@euro) would be assigned to LC_MONETARY. Use the base locale name, not these variants, in assignments to the LANG environment variable.

Furthermore, in cases where a base locale name is not being assigned to all locale categories, avoid using the LC_ALL environment variable, whose assigned value overrides settings for both LANG and the environment variables for specific locale categories.

Many locale-specific files reside in directories whose names are constructed from the language, territory, and codeset portions of a locale name. Commands and other system applications insert the setting of the LANG variable into search paths that contain %L as one of the directory nodes. This makes it possible for software programs to find the correct set of files, such as fonts, resource files, user-defined character files, and translated reference pages, that should be used with the current locale. An @ suffix related to collation, if included in an assignment to the LANG variable, may result in applications being unable to find certain locale-specific files.

2.5.2 Changing Locales During Program Execution

Some internationalized programs may need to prompt the user for a locale name or change locales during program execution. The following example demonstrates how to call setlocale() when you want to explicitly initialize or reinitialize all locale categories to the same locale name:


.
.
.
nl_catd catd;  [1]
char buf[BUFSIZ];  [2]

.
.
.
setlocale(LC_ALL, "");  [3]
catd = catopen(CAT_NAME, NL_CAT_LOCALE);  [4]

.
.
.
printf(catgets(catd, NL_SETD, LOCALE_PROMPT_MSG,
               "Enter locale name: "));      [5]
gets(buf);  [6]
setlocale(LC_ALL, buf);  [7]

.
.
.

Declares a catalog descriptor catd as type nl_catd. [Return to example]

Declares the buf variable into which the locale name will later be stored.
To make sure that the variable is large enough to accommodate locale names on different systems, you should set its maximum size to the BUFSIZ constant, which is defined by the system vendor in /usr/include/stdio.h. [Return to example]

Calls setlocale() to initialize the program's locale settings to those in effect for the user who runs the program. [Return to example]

Calls catopen() to open the message catalog that contains the program's messages. The function returns the catalog's descriptor to the catd variable.
The CAT_NAME constant is defined in the program's own header file. [Return to example]

Prompts the user for a new locale name.
The NL_SETD constant specifies the default message set number in a message catalog and is defined in /usr/include/nl_types.h. The LOCALE_PROMPT_MSG identifier specifies the prompt string translation in the default message set. [Return to example]

Calls the gets() function to read the locale name typed by the user into the buf variable. [Return to example]

Calls setlocale() with buf as the locale_name argument to reinitialize all portions of the locale. [Return to example]

Sometimes a program needs to vary the locale only for a particular category of data. For example, consider a program that processes different country-specific files that contain monetary values. Before processing data in each file, the program might reinitialize a program variable to a new locale name and then use that variable value to reset only the LC_MONETARY category of the locale.