This chapter explains
how the requirements of localization (language, codeset, and cultural differences)
change the way you implement basic coding operations.
A sample application
that applies the suggested program development techniques from this chapter
is provided in the
/usr/examples/i18n/xpg4demo
directory.
See the
README
file in that directory for an introduction
to the application and how you can compile and run the application with different
locales.
Parts of the
xpg4demo
application are used as
examples in this and other chapters.
One of the primary functions of most computer programs is to manipulate data, which can involve interaction between the program and a computer user. In commercial situations, it is important that such interactions take place in the native language of each user. Cultural data should also observe the correct customs.
When you write programs to support multilanguage operation, you must consider the fact that languages can be represented within the computer system by one or more codesets. Because of the requirements of different languages, characters in codesets may vary in both size (8 bits, 16 bits, and so on) and binary representation.
You can satisfy the requirements of codesets and data by writing programs that make no hard-coded assumptions about language, cultural data, or character encodings. Such programs are said to be internationalized. Data specific to each supported language, territory, and codeset combination are held separately from the program code and can be bound to the run-time environment by language-initialization functions.
The operating system provides the following facilities for developing internationalized software, defining localization data, and announcing specific language requirements:
Locales that contain language, codeset, and cultural definitions for each language (Section 2.1)
Library functions that handle extended character codes and that provide language- and codeset-independent character classification, case conversion, number format conversion, and string collation (Section 2.2)
Library functions that let programs dynamically determine cultural and language-specific data (Section 2.3)
A message system that allows program messages to be held apart from the program code, translated into different languages, and retrieved by a program at run time (Section 2.4)
An initialization function that binds a program at run time to the linguistic and cultural requirements of each user (Section 2.5)
The discussion and examples in this chapter focus
on functions provided in the Standard C Library.
See
Chapter 4
for information on using functions in the
curses
Library.
See
Chapter 5
for information about using functions in
the X and Motif libraries.
2.1 Using Locales
The operating system supports
Unicode
and
dense code
locales.
The Unicode locales are installed
in
/usr/i18n/lib/nls/ucsloc/.
Dense code locales are
installed in
/usr/i18n/lib/nls/loc.
The active default
is determined by the symbolic link,
/usr/i18n/lib/nls/dloc.
For example, the Japanese locale filename,
/usr/lib/nls/loc/ja_JP.eucJP, is a symbolic link to
/usr/i18n/lib/nls/dloc/ja_JP.eucJP, where
/dloc
is a symbolic link to either
/ucsloc
for the Unicode version or
/loc
for
the dense code version of the Japanese locale.
If you are superuser, you can switch between Unicode and dense code
locales by changing the setting of the symbolic link, as described in
l10n_intro(5)
Unicode
locales conform to Unicode and ISO/IEC 10646 standards and use UTF-32 as the
wide-character encoding.
Under UTF-32 wide character encoding,
wchar_t
values represent the same characters regardless of the locale and,
because Unicode standards prevail, implementation is consistent across platforms.
Locales whose names end in
.UTF-8
use file code and UTF-32 internal process code (wchar_t
encoding) defined in the
ISO 10646
and Unicode standards.
Other, non-UTF-8 Unicode locales use traditional UNIX
and proprietary codesets for the file code while using UTF-32 as the internal
process code.
A subset of these Unicode locales have a
@ucs4
modifier; however, they are the same as the locales without the
@ucs4
modifier.
The
@ucs4
subset is provided
for backward compatibility and may be removed in the future.
You cannot choose
@ucs4
locales from the CDE Login Menu; you must specify the locale
name in the
LANG
environment variable.
The
universal.UTF-8
locale is also available
(for use by applications rather than end users).
This locale supports the
complete set of characters in the Universal Character Set (UCS).
See
Unicode(5)
For
.UTF-8
locales, file code may include characters encoded in more than 1 byte; therefore,
use these locales in applications that can process multibyte data.
Design
new applications based on multibyte
.UTF-8
locales, which
incorporate a large character repertoire, to enable the application to expand
future character support without changing the character set.
Dense code locales use dense code for wide-character
encoding to minimize table size (that is, codepoints are assigned consecutively
with no empty positions).
Under dense code locales, a
wchar_t
value for one locale may not represent the same character in another locale
and, thus, is locale specific.
Dense code locales are appropriate for applications
that have no dependencies on the internal process code or, because dense code
locales are slightly more efficient than Unicode locales, for applications
whose primary goal is better performance.
All valid codepoints in multibyte character
sets are mapped to valid codepoints in Unicode, including unmapped codepoints
that are mapped to Unicode codepoints in the private use area.
Thus, dense
code locales are equivalent to Unicode locales.
In general, the same charmaps
and locale source can be used for Unicode and dense code locales.
However,
Unicode and dense code characters that are not defined in the
LC_COLLATE
section may be sorted differently.
A Unicode locale exists for each dense code locale. (However, not all Unicode locales have a dense code version.) For Latin-1 locales (ISO8859-1), the dense code and Unicode locales are identical because Latin-1 characters are the same as the first 256 characters in Unicode. Keep in mind that the same locale name can refer to a Unicode locale or to a dense code locale, depending on the setting of the symbolic link. Thus, if running an application in a locale is problematic, check the symbolic link.
Because Unicode locales use consistent values for characters in
wchar_t
form, the link to Unicode locales can increase consistency
across locales and platforms.
However, some users may prefer the older, dense
code locales that use proprietary algorithms to convert characters to
wchar_t
form, or an application may have dependencies on dense code
wchar_t
encoding.
2.2 Using Codesets
In the past, most UNIX systems were based on the 7-bit ASCII codeset. However, most non-English languages include characters in addition to those contained in the ASCII codeset.
The X/Open UNIX standard does not require an operating system to supply any particular codesets in addition to ASCII. The standard does specify requirements for the interfaces that manipulate characters so that programs are able to handle characters from whatever codeset is available on a given system.
The first group of the International Standards Organization (ISO) codesets covered only the major European languages. In this group, several codesets allow for the mixing of major languages within a single codeset. All of these codesets are a superset of the ASCII codeset and allow systems to support non-English languages without invalidating existing software that is not internationalized. The Tru64 UNIX operating system always includes a locale for the United States that uses the ISO 8859-1 (ISO Latin-1) codeset.
Subsets that are installed as part of Worldwide Language Support (WLS) support localized variants of the operating system and may include locales based on additional ISO codesets. For example, the optional language variant subsets included with the operating system to support Czech, Hungarian, Polish, Russian, Slovak, and Slovene provide locales based on the ISO 8859-2 (Latin-2) codeset.
The following is a complete list of ISO codesets provided with the WLS, including the languages that they support and the reference pages where they are discussed in more detail:
ISO 8859-1, Latin-1
Languages of Western Europe and North America, including Catalan, Danish, Dutch, English/Great Britain, English/United States, Finnish, Flemish/Belgium, French/Belgium, French/Canada, French/Swiss, French, German/Swiss, German/Germany, Icelandic, Italian, Norwegian, Portuguese, Spanish, and Swedish
See
iso8859-1(5)
ISO 8859-2, Latin-2
Languages of Eastern Europe, including Czech Republic, Hungarian, Polish, Slovak, and Slovene
See
iso8859-2(5)
ISO 8859-4, Latin-4
See
iso8859-4(5)
ISO 8859-5, Latin/Cyrillic
See
iso8859-5(5)
ISO 8859-7, Latin/Greek
See
iso8859-7(5)
ISO 8859-8, Latin/Hebrew
Hebrew/Israel (uses the ISO Hebrew codeset)
See
iso8859-8(5)
ISO 8859-9, Latin-5
See
iso8859-9(5)
ISO 8859-15, Latin-9
Catalan/Spain, Danish, Dutch, English/Great Britain, English/United States, Finnish, Flemish/Belgium, French/Belgium, French/Canada, French/Swiss, French, German/Swiss, German/Germany, Icelandic, Italian, Norwegian, Portuguese, Spanish/Spain, and Swedish. ISO 8859-15 (and UTF-8) support the euro monetary character.
See
iso8859-15(5)
The operating system does not include support for the ISO 8859-3 (Latin-3) and ISO 8859-6 (Latin-6) codesets.
Another ISO codeset supported by utilities on a standard operating system is ISO 6937: 1983. This codeset, which accommodates both 7-bit and 8-bit characters, is used for text communication over communication networks and interchange media, such as magnetic tape and disks.
The codesets discussed up to this point address the requirements of languages whose characters can be stored in a single byte. Such codesets do not meet the needs of Asian languages, whose characters can occupy multiple bytes. The operating system supplies the following codesets through installed subsets that support Asian languages and countries:
Japanese
Japanese Extended UNIX Code (the default)
See
eucJP(5)
Shift JIS
See
shiftjis(5)
DEC Kanji
See
deckanji(5)
Super DEC Kanji
See
sdeckanji(5)
Korean
DEC Korean
See
deckorean(5)
Korean Extended UNIX Code
See
eucKR(5)
Thai
Thai API Consortium/Thai Industrial Standard
See
TACTIS(5)
Simplified Chinese
DEC Hanzi
See
dechanzi(5)
GBK and GB18030
See
GBK(5)GB18030(5)
Traditional Chinese
DEC Hanyu
See
dechanyu(5)
Taiwanese Extended UNIX Code
See
eucTW(5)
BIG-5 (and the variant, Shift BIG-5)
Telecode
See
telecode(5)
These codesets are supplied when you install Asian language variant subsets of the operating system software. Also supplied are a specialized terminal driver and associated utilities that must be available on your system to support the input and display of Asian characters at run time.
Codesets developed for PC systems are
commonly called code pages.
There are PC code pages that correspond to most
of the language-specific codesets developed for UNIX systems.
The operating
system supports PC codesets mostly through converters that can change file
data from one type of encoding format to another.
The CP850 codeset supports
English/United States and is used with data that contains accented characters
generated on a PC using the CP850 code page for character encoding.
This character
encoding is usually the default for MS-DOS and Windows operating systems in
Europe.
See
code_page(5)
The Unicode and ISO/IEC 10646 standards specify the Universal Character Set (UCS), which allows character units to be processed for all languages, including Asian languages, using the same set of rules. The operating system supports the UCS-4 (32-bit) encoding of this character set in process code.
Other encoding formats defined by the Unicode standard, the ISO/IEC 10646 standard, or both include the following:
UCS-2, a 16-bit encoding counterpart to UCS-4
A number of universal transformation formats (UTF-8, UTF-16, and UTF-32) that transform UCS encoding into sequences of bytes for handling by byte-oriented protocols
The operating system supports these different formats through locales, codeset converters, or both. Because UCS-2 is a subset of UTF-16, the operating system supports UCS-2 with UTF-16 codeset converters. The operating system supports UCS-4 with both codeset conversion and locales.
The following locales use UTF-32 as internal processing code:
universal.UTF-8
Use this
locale in applications to convert data in UTF-8 file format to UCS-4 process
code and to test any UCS-4 character to determine if it is included in one
of the following
LC-CTYPE
classes:
alnum,
alpha,
blank,
cntrl,
digit,
graph,
lower,
print,
punct,
space,
upper, or
xdigit.
In this locale, the
LC_MESSAGES,
LC_MONETARY,
LC_NUMERIC, and
LC_TIME
definitions match those of the
POSIX (C) locale.
Your application can use this locale, along with the
fold_string_w()
function, to process the full range of characters
defined by the Unicode and ISO/IEC 10646 standards.
This locale differs from most others because it does not provide access to local cultural conventions.
language_territory.UTF-8
These locales limit classification information to the characters in a particular native language, make country-specific data available to your application, and assume file data follows UTF-8 encoding rules. The operating system locales that support the euro monetary symbol use either UTF-8 or ISO 8859-15 codesets.
The Unicode UTF-8 codeset supports
Catalan/Spain, Czech Republic, Danish, Dutch, English/Great Britain, English/United
States, Finnish, Flemish, French/Belgium, French/Canada, French/Swiss, German/Swiss,
German, Greek, Hungarian, Icelandic, Italian, Japanese, Korean, Lithuanian,
Norwegian, Polish, Portuguese, Russian, Slovak, Slovene, Spanish, Swedish,
Turkish, simplified Chinese (Hanzi), and traditional Chinese (Hanyu).
See
Unicode(5)
native_locale_name
These locales use UTF-32 as internal processing code.
The codeset portion
of the
native_locale_name
(for
example, ISO8859-1) specifies the file code.
Also, the locale provides
classification information for the native language characters, but not for
the full set of UTF-32 characters.
Country specific information is available
to the application; the
LC_COLLATE,
LC_MESSAGES,
LC_MONETARY,
LC_NUMERIC,
and
LC_TIME
category definitions match the definition in
native_language_name.
native_language_name@ucs4
These locales are provided for compatibility with existing applications
that use the
@ucs4
locales.
They function the same as the
native_locale_name
locales, but the list of locales
provided is not as complete as the
native_language_name
locales.
See
Section 2.5
for information on locale categories,
such as
LC_TIME.
See
Unicode(5)euro(5)
See
Unicode(5)
The following sections discuss important issues that affect the way you write source code when your program must process characters in different codesets:
Ensuring data transparency (Section 2.2.1)
Using in-code literals (Section 2.2.2)
Manipulating characters that span multiple bytes (Section 2.2.3)
Converting between multibyte-character and wide-character data (Section 2.2.4)
Rules for multibyte characters in source and executable codesets (Section 2.2.5)
Classifying characters (Section 2.2.6)
Converting characters (Section 2.2.7)
Comparing strings (Section 2.2.8)
2.2.1 Ensuring Data Transparency
As discussed in Section 2.2, internationalized software must accommodate a wide variety of character-encoding schemes. Programs cannot assume that a particular codeset is on all systems that conform to requirements in the X/Open UNIX CAE specifications, nor that individual characters occupy a fixed number of bits.
Because of the historical dependence of UNIX systems on 7-bit
ASCII character encoding, some programs use the most significant bit (MSB)
of a byte for their own internal purposes.
This was a dubious programming
practice, although quite safe when characters in the underlying codeset always
mapped to the remaining 7 bits of the byte.
In the world of international
codesets, the practice of using the most significant bit of a byte for program
purposes must be avoided.
2.2.2 Using In-Code Literals
When you write internationalized software, avoid using in-code literals. Consider, for example, the following conditional statement:
if ((c = getchar()) == \141)
This condition assumes that lowercase
a
is always
represented by a fixed octal value, which may not be true for all codesets.
Use a function, instead of an in-code literal.
Consider the following statement
that uses a
getchar()
function to substitute a character
constant for the octal value:
if ((c = getchar()) == 'a')
However, because the
getchar()
function operates on bytes, the statement would not work
correctly if the next character in the input stream spanned multiple bytes.
To avoid this problem, substitute the
getwchar()
function
for the
getchar()
function.
The
getwchar()
function, as used in the example, works correctly with any codeset because
a
is a member of the PCS and is transformed into the same wide-character
value in all locales.
if ((c = getwchar()) == L'a')
The X/Open UNIX standard specifies that each member of the source character set and each escape sequence in character constants and string literals is converted to the same member of the execution character set in all locales. Thus, you can safely use any of the characters in the PCS as a character constant or in string literals. Non-English language characters are not included in the PCS and may not translate correctly when used as literals. Consider the following example:
if ((c = getwchar()) == L'à ')
The accented character
à
may not be represented
in the codeset's source character set or execution character set.
Also, the
binary value of the accented character may not be translatable from one set
to the other.
When source files specify non-English language characters in
constants, the results are undefined.
In cases such as this, it can be helpful
to employ a consistent use of Unicode locales.
The following example illustrates how to construct a
test for a constant that for whatever reason may be a non-English language
character.
The constant has been defined in a message catalog with the symbolic
identifier
MSG_ID.
Statements in the example retrieve the
value for
MSG_ID
from the message catalog, which is locale
specific and bound to the program at run time.
.
.
.
char *schar; [1] wchar_t wchar; [2]
.
.
.
schar = catgets(catd,NL_SETD,MSG_ID,"a"); [3] if (mbtowc (&wchar,schar,MB_CUR_MAX) == -1) [4] error(); if ((c = getwchar()) == wchar) [5]
.
.
.
Declares a pointer to
schar
as
char.
[Return to example]
Declares the variable
wchar
to be of type
wchar_t.
[Return to example]
Calls the
catgets()
function to retrieve
the value of
MSG_ID
from the message catalog for the user's
locale.
The
catgets()
function returns a value as an array
of bytes so the value is returned to the
schar
variable.
If the accented character is not available in the locale's codeset, the test
is made against the unaccented base character (a).
[Return to example]
Tests to make sure the value contained in
schar
represents a valid multibyte character.
If the value is a valid multibyte
character, the program converts it to a wide-character value and stores the
results in the variable
wchar.
If
schar
does not contain a valid multibyte character,
the program signals an error.
[Return to example]
Codes the conditional statement to include the value contained
in
wchar
as the constant.
[Return to example]
See
Chapter 3
for more information about message
catalogs and the
catgets()
function.
See
Section 2.2.4
for information about converting multibyte characters and strings to wide-character
data that your program can process.
2.2.3 Manipulating Characters That Span Multiple Bytes
The
operating system provides all the interfaces (such as
putwc(),
getwc(),
fputws(), and
fgetws())
that are needed to support codesets with characters that span multiple bytes.
Language variant subsets of the operating system must be installed to supply
the locales and facilities that make this support operational.
On systems
where such locales are not available, or are available but not bound to the
program at run time, the
*ws*()
and
*wc*()
functions are merely synonyms for the associated single-byte functions (such
as
putc(),
getc(),
fputs(),
and
fgets()).
2.2.4 Converting Between Multibyte-Character and Wide-Character Data
On an internationalized system, data can be encoded as either multibyte character or wide-character data.
Multibyte encoding is typically used when data is stored in a file or generated for external use or data interchange. Multibyte encoding has the following disadvantages:
Characters are not represented by a fixed number of bytes for each character, even in the same codeset. Thus, the size of a character in a multibyte data record can vary from one character to the next.
The parsing rules for retrieving character codes from a multibyte data record are locale dependent.
Because of these disadvantages, wide-character encoding, which allocates a fixed number of bytes for each character, is typically used for internal processing by programs; in fact, internal process code is another way of referring to data in wide-character format. The size of a wide character varies from one system implementation to another. On Tru64 UNIX systems, the size for a wide character is set to 4 bytes (32 bits), a setting that optimizes performance for the HP Alpha processor.
Library routines that print, scan, input, or output text can automatically convert data from multibyte characters to wide characters or from wide characters to multibyte characters, as appropriate for the operation. However, applications almost always have additional statements or requirements for which conversion to and from multibyte characters needs to be explicit.
The following
example is from a program module that reads records from a database of employee
data.
In this case, the programmer wants to process the data in fixed-width
units, so uses the
mbstowcs()
function to explicitly convert
an employee's first and last names from multibyte character to wide-character
encoding.
/* * The employee record is normalized with the following format, which * is locale independent: Badge number, First Name, Surname, * Cost Center, Date of Join in the `yy/mm/dd' format. Each field is * separated by a TAB. The space character is allowed in the First * Name and Surname fields. */ static const char *dbOutFormat = "%ld\t%S\t%S\t%S\t%02d/%02d/%02d\n"; static const char *dbInFormat = "%ld %[^\t] %[^\t] %S %02d/%02d/%02d\n";
.
.
.
sscanf(record, dbInFormat, &emp->badge_num, firstname, surname, emp->cost_center, &emp->date_of_join.tm_year, &emp->date_of_join.tm_mon, &emp->date_of_join.tm_mday); (void) mbstowcs(emp->first_name, firstname, FIRSTNAME_MAX+1); (void) mbstowcs(emp->surname, surname, SURNAME_MAX+1);
.
.
.
See
Section A.9
for a complete list
of functions that work directly with multibyte data.
2.2.5 Rules for Multibyte Characters in Source and Execution Codesets
Both the source and execution character set variants of the same codeset can contain multibyte characters. The encodings do not have to be the same, but the source and execution variants both observe certain rules in codesets that meet X/Open requirements. PC code pages and UCS-based codesets may adhere to some or most of these rules, but the codesets native to any UNIX system that conforms to X/Open standards must adhere to all of them.
The characters defined in the Portable Character Set must be present in both sets.
The existence, meaning, and encoding of any additional members are locale specific.
A character may have a state-dependent encoding. A string of characters may contain a shift-state character that affects the system's interpretation of the following bytes until another shift-state character is encountered.
While in the initial shift state, all characters from the basic character set retain their usual interpretation and do not alter the shift state.
The interpretation for subsequent bytes in the sequence is a function of the current shift state.
A byte with all bits set to zero is interpreted as a null character, independent of the shift state.
A byte with all bits zero must not occur in the second or subsequent bytes of a multibyte character.
The source variant of a codeset must observe the following additional rules:
A comment, string literal, character constant, or header name must begin and end in the initial shift state
A comment, string literal, character constant, or header name must consist of a sequence of valid multibyte characters
The
C language compiler supports trigraph sequences when you specify the
-std1
or
-std
flag on the
cc
command
line.
Trigraph sequences, which are part of the ANSI C specification, allow
users to enter the full range of basic characters in programs, even if their
keyboards do not support all characters in the source codeset.
The following
trigraph sequences are currently defined, each of which is replaced by the
corresponding single character:
| Trigraph Sequence | Single Character |
??= |
# |
??( |
[ |
??/ |
\ |
??' |
^ |
??< |
{ |
??) |
] |
??! |
| |
??> |
} |
??- |
~ |
Another feature of program operation that depends on the locale is character classification; that is, determining whether a particular character code refers to an uppercase alphabetic, lowercase alphabetic, digit, punctuation, control, or space character.
In the past, many programs classified characters according to whether the character's value fell between certain numerical limits. For example, the following statement tests for all uppercase alphabetic characters:
if (c >= 'A' && c <= 'Z')
This statement is valid for the ASCII codeset, in which all uppercase
letters have values in the range
0x41
to
0x5a
(A to Z).
However, the statement is not valid for the ISO 8859-1
codeset, in which uppercase letters occupy the ranges
0x41
to
0x5a,
0xc0
to
0xd6,
and
0xd8
to
0xdf.
In the EBCDIC codeset,
character values are different again and, in this case, even the uppercase
English language letters have a different encoding.
When you write internationalized programs, classify characters by calling the appropriate internationalization function. For example:
if (iswupper (c))
Internationalization functions classify wide-character
code values according to
ctype
information in the user's
locale.
See
Section A.2
for a complete list and description
of character classification functions.
2.2.7 Converting Characters
As
an example of what not to do in an internationalized program, consider the
following statements, which perform case conversion of ASCII characters by
converting the character in
a_var
first to lowercase and
then to uppercase:
a_var |= 0x20;
.
.
.
a_var &= 0xdf;
The preceding statements are not safe to use in internationalized programs because the statements assume ASCII-coded character values and because they can convert invalid values.
The correct way to handle
case conversion is to call the
towlower()
function for
conversion to lowercase and the
towupper()
function for
conversion to uppercase.
For example:
a_var = towlower(a_var);
.
.
.
a_var = towupper(a_var);
These functions use information specified
in the user's locale and are independent of the codeset in which characters
are defined.
The functions return the argument unchanged if input is invalid.
See
Section A.3
for more detailed discussion of
case conversion functions.
2.2.8 Comparing Strings
UNIX
systems provide functions for comparing character strings.
The following
statement, for example, compares the strings
s1
and
s2, returning an integer greater than, equal to, or less than zero,
depending on whether the value of
s1
is greater than, equal
to, or less than the value of
s2
in the machine-collating
sequence:
.
.
.
int cmp_val; char *s1; char *s2;
.
.
.
cmp_val = strcmp(s1, s2);
.
.
.
Many languages, however, require more complex collation algorithms than a simple numerical sort. For example, multiple passes may be required for the following reasons:
Ordering accented characters within a particular character class for a language (for example, a, á, à, and so on)
Collating certain multiple character sequences as a single character (for example, the Welsh character ch, which collates after c and before d)
Collating certain single characters as a 2-character sequence (for example, the German character sharp s, which collates as ss)
Ignoring certain characters during collation (for example, hyphens in dictionary words)
String comparison in an international environment depends on the codeset and language. This dependency means that additional functions are required to compare strings according to collating sequence information in the user's locale. These functions include the following:
This function uses collation information defined in the user's locale
rather than performing a simple numeric comparison as does the
strcmp()
function.
This function performs the same operation as
strcoll(),
except that it operates on wide characters.
This function transforms a wide-character string by using collating
sequence information in the user's locale so that the resulting string can
be compared using the
wcscmp()
function.
If
two strings are being compared only for equality, you can use
strcmp()
or
wcscmp(), which are faster in most environments
than
wcscoll().
Cultural data refers to items of information that can vary between languages or territories.
For example:
In the United Kingdom and the United States, a period represents the radix character and a comma represents the thousands separator in decimal numbers. In Germany, the same two characters in decimal numbers have the opposite meaning.
In the United States, the date October 7, 1986 is represented as 10/7/1986. In the United Kingdom, the same date is represented as 7/10/1986. This example indicates that cultural data items can vary even when the same language is spoken.
Date delimiters, as well as the order of year, month, and day, can vary among countries. In Germany, for example, the date October 7, 1986 is represented as 7.10.1986 rather than as 7/10/1986.
Currency symbols can vary both in the characters used and where they are placed in a currency value; that is, currency symbols can precede, follow, or be embedded in the value.
The euro character that is used as the currency symbol by European
countries belonging to the Economic and Monetary Union is supported only by
Unicode (*.UTF-8) or Latin-9 (*.ISO8859-15)
locales and associated fonts.
See
euro(5)
To enter the euro character from the keyboard, you must be
working in a Latin-9 or UTF-8 locale and the appropriate keymap must be active.
To display the euro character, you must be working in a Latin-9 or UTF-8 locale
and the appropriate font must be active.
To activate the required locale and
the appropriate keymap and font, log in to a Latin-9 or UTF-8 locale, or use
setenv
to set the
LANG
environment variable,
and start a new
dtterm.
See the reference pages for
locale(1)dtterm(1)
You cannot make assumptions about cultural data when writing internationalized
programs.
Your program must operate according to the local customs of users.
The X/Open UNIX standard specifies that this requirement be met through a
database of cultural data items that a program can access at run time, plus
a set of associated interfaces.
The following sections discuss this database
and the functions used to extract and process its data items.
2.3.1 The langinfo Database
The language information database,
named
langinfo, contains items that represent the cultural
details of each locale supported on the system.
The
langinfo
database contains the following information for each locale, as required by
the X/Open UNIX standard:
Codeset name
Date and time formats
Names of the days of the week
Names of the months of the year
Abbreviations for names of days
Abbreviations for names of months
Radix character (the character that separates whole and fractional quantities
Thousands separator character
Affirmative and negative responses for yes/no queries
Currency symbol and its position within a currency value
Emperor/Era name and year (for Japanese locales)
2.3.2 Querying the langinfo Database
You
can extract cultural data items from the
langinfo
database
by calling the
nl_langinfo()
function.
This function takes
an
item
argument that is one of several constants
defined in the
/usr/include/langinfo.h
header file.
The
function returns a pointer to the string with the value for
item
in the current locale.
The following example is
a call to
nl_langinfo()
that extracts the string for formatting
date and time information.
This value is associated with the constant
D_T_FMT.
nl_langinfo(D_T_FMT);
2.3.3 Generating and Interpreting Date and Time Strings That Observe Local Customs
Programs often generate
date and time strings.
Internationalized programs generate strings that observe
the local customs of the user.
You can meet this requirement by calling the
strftime()
or
wcsftime()
function.
Both functions
indirectly use the
langinfo
database.
In addition, the
wcsftime()
function converts date and time to wide-character format.
In the following example, the
strftime()
function generates a date string as defined by the
D_FMT
item in the
langinfo
database:
.
.
.
setlocale(LC_ALL, ""); [1]
.
.
.
clock = time((time_t*)NULL); [2] tm = localtime(&clock); [3]
.
.
.
strftime(buf, size, "%x", tm); [4] puts(buf); [5]
.
.
.
Binds the program at run time to the locale set for the system or individual user. [Return to example]
Calls the
time()
subroutine to return the
time value to the
clock
variable.
The time value returned
is relative to Coordinated Universal Time.
[Return to example]
Calls the
localtime()
function to convert
the value contained in
clock
to a value that can be stored
in a
tm
structure, whose members represent values for year,
month, day, hour, minute, and so forth.
[Return to example]
Calls
strftime()
to generate a date string
formatted as defined in the user's locale from the value contained in the
tm
structure.
The
buf
argument is a pointer to a string variable
in which the date string is returned.
The
size
argument
contains the maximum size of
buf.
The
"%x"
argument specifies conversion specifications, similar to the format strings
used with the
printf()
and
scanf()
functions.
The
"%x"
argument is replaced in the output string by
a representation appropriate for the locale.
[Return to example]
Calls the
puts()
function to copy the string
contained in
buf
to the standard output stream (stdout) and to append a newline character.
[Return to example]
Consider the following example of how to use
strftime()
and
nl_langinfo()
in combination to generate
a date and time string.
Assume that the preceding example's calls to the
setlocale(),
time(), and
localtime()
interfaces have been made in this example.
However, the following example
includes a call to
nl_langinfo()
that has replaced the
format string argument in the call to
strftime().
.
.
.
strftime(buf, size, nl_langinfo(D_T_FMT), tm); puts(buf);
.
.
.
To convert a string to a date/time value (that is, the reverse
of the operation performed by
strftime()), you can use
the
strptime()
function.
The
strptime()
function supports a number of conversion specifiers that behave in a locale-dependent
manner.
2.3.4 Formatting Monetary Values
The
strfmon()
function formats monetary values according to information in the locale that
is bound to the program at run time.
For example:
strfmon(buf, size, "%n", value);
This statement formats the double-precision floating-point value contained
in the
value
variable.
The
"%n"
argument
is the format specification that is replaced by the format defined in the
run-time locale.
The results are returned to the
buf
array,
whose maximum length is contained in the
size
variable.
The
money
program demonstrates how the
strfmon()
function works.
When you install a Worldwide Language Support
subset, the source file for this sample program is installed in the
/usr/i18n/examples/money
directory.
2.3.5 Formatting Numeric Values in Program-Specific Ways
To perform your
own conversions of numeric quantities, monetary or otherwise, you can use
specific formatting details in the user's locale.
The
localeconv()
function, which has no arguments, returns all the number formatting
details defined in the locale to a structure declared in your program.
For
example:
struct lconv *app_conv;
You can use the following features, which are contained in the
lconv
structure, in program-defined routines:
Radix character
Thousands separator character
Digit grouping size
International currency symbol
Local currency symbol
Radix character for monetary values
Thousands separator for monetary values
Digit grouping size for monetary values
Positive sign
Negative sign
Number of fractional digits to be displayed
Parenthesis symbols for negative monetary values
2.3.6 Using the langinfo Database for Other Tasks
Functions in addition to the ones
discussed so far use the
langinfo
database to determine
settings for specific items of cultural data.
For example, the
wscanf(),
wprintf(), and
wcstod()
functions determine the appropriate radix character from information in the
langinfo
database.
2.4 Handling Text Presentation and Input
As you create applications, you need to consider the user's native language in three particular areas:
The way program messages are defined and accessed (Section 2.4.1)
How the program presents output text (Section 2.4.2)
How the program processes input text (Section 2.4.3)
2.4.1 Creating and Using Messages
Programs need to communicate with users in their own language. This requirement places some constraints on the way program messages are defined and accessed. More specifically, messages are defined in a file that is independent of the program source code and are not compiled into object files. Because messages are in a separate file, they can be translated into different languages and stored in a form that is linked to the program at run time. Programs can then retrieve message text translations that are appropriate for the user's language.
The X/Open UNIX standard specifies the following messaging functions:
A messaging system that contains a definition of message text source files
The
gencat
command to generate message
catalogs from these source files
A set of library functions to retrieve individual messages from one or more catalogs at run time
The following example demonstrates how an internationalized program retrieves a message from a catalog:
#include <stdio.h> [1]
#include <locale.h> [2]
#include <nl_types.h> [3]
#include "prog_msg.h" [4]
main()
{
nl_catd catd; [5]
setlocale(LC_ALL, ""); [6]
catd = catopen("prog.cat", NL_CAT_LOCALE); [7]
puts(catgets(catd, SETN, HELLO_MSG, "Hello, world!")); [8]
catclose(catd); [9]
}
.
.
.
Includes the header file for the Standard C Library. [Return to example]
Includes the
/usr/include/locale.h
header
file, which declares the
setlocale()
function and associated
constants and variables.
[Return to example]
Includes the
/usr/include/nl_types.h
header
file, which declares the
catopen(),
catgets(),
and
catclose()
functions.
[Return to example]
Includes the program-specific
prog_msg.h
header file, which sets constants to identify the message set (SETN) and specific
messages (HELLO_MSG in the example) that are used by this program module.
A message catalog can contain one or more message sets. Individual messages are ordered within each set. [Return to example]
Declares a message catalog descriptor
catd
to be of type
nl_catd.
This descriptor is returned by the function that opens the catalog. The descriptor is also passed as an argument to the function that closes the catalog. [Return to example]
Calls the
setlocale()
function to bind the program's locale categories to settings
for the user's locale environment variables.
The locale name set for the
LC_MESSAGES
category
is the locale used by the
catopen()
and
catgets()
functions in this example.
Because the system administrator or
user typically sets only the
LANG
or
LC_ALL
environment variable to a particular locale name, this operation implicitly
sets the
LC_MESSAGES
variable as well.
[Return to example]
Calls the
catopen()
function to open the
prog.cat
message catalog for use by this program.
The
NL_CAT_LOCALE
argument specifies that the program will use the locale name set
for
LC_MESSAGES.
The
catopen()
function
uses the value set for the
NLSPATH
environment variable
to determine the location of the message catalog.
The call returns the message
catalog descriptor to the
catd
variable.
[Return to example]
Calls the
puts()
function to display the
message.
The first argument to this call is a call to the
catgets()
function, which retrieves the appropriate text for the message with the
HELLO_MSG
identifier.
This message is contained in the message set
identified by the
SETN
constant.
The final argument to
catgets()
is the default text to be used if the messaging call cannot
retrieve the translated text from the catalog.
Default text is usually in
the English language.
[Return to example]
Calls the
catclose()
function to close the
message catalog whose descriptor is contained in the
catd
variable.
[Return to example]
See
Chapter 3
for information about creating
and using message catalogs.
2.4.2 Formatting Output Text
Successful translation of messages into different languages depends not only on making messages independent of the program source code but also on careful construction of message strings within the program.
Consider the following example:
printf(catgets(catd, set_id, WRONG_OWNER_MSG,
"%s is owned by %s\n"),
folder_name, user_name);
The preceding statement uses a message catalog but assumes a particular
language construction (a noun followed by a verb in passive voice followed
by a noun).
Passive verb constructions are not part of all languages; therefore,
message translation might mean printing
user_name
before
folder_name.
In other words, the translator might need to change
the construction of the message so that the user sees the translated equivalent
of "John_Smith owns JULY_REVENUE" rather than "JULY_REVENUE
is owned by John_Smith."
To overcome the problems imposed by
fixed ordering of message elements, the
printf()
routine
format specifiers can apply format conversion to the
nth
argument in an argument list, and not just to the next unused argument.
To
apply the format conversion extension, replace the
%
conversion
character with the sequence
%digit
$, where
digit
specifies the position
of the argument in the argument list.
The following example illustrates how
the programmer applies this feature to the format string
"%s is owned by %s\n":
printf(catgets(catd, set_id, WRONG_OWNER_MSG,
"%1$s is owned by %2$s\n"),
folder_name, user_name);
The construction of the string
"%1$s is owned
by %2$s", which is the default value for the
WRONG_OWNER_MSG
entry in the program's message file, can then be changed by the
translator to the non-English language equivalent of the
following:
WRONG_OWNER_MSG "%2$s owns %1$s\n"
The string construction
issues that are discussed for output text in
Section 2.4.2
also apply to input text.
For example, different countries have different
conventions for the order in which users specify the elements of a date, or
differ in the characters that are input to delimit parts of monetary strings.
The
scanf()
family of functions support extended format
conversion specifiers that allow for variation in the way that users enter
elements of a string.
Consider the following example:
.
.
.
int day; int month; int year;
.
.
.
scanf("%d/%d/%d", &month, &day, &year);
.
.
.
The format string in this statement is governed by the assumption that all users use a United States format (mm/dd/yyyy) to input dates. In an internationalized program, you use extended format specifiers to support requirements that language may impose on the order of string elements. For example:
.
.
.
scanf(catgets(catd, NL_SETD, DATE_STRING, "%1$d/%2$d/%3$d"), &month, &day, &year);
.
.
.
The default
"%1$d/%2$d/%3$d"
value
for the DATE_STRING message is still appropriate only for countries in which
users use the format mm/dd/yyyy to enter dates.
However, for countries in
which the order or formatting would be different, the translator can change
the entry in the program's message file.
Consider the following
languages:
British English (dd/mm/yyyy):
DATE_STRING "%2$d/%1$d/%3$d"
German (dd.mm.yyyy)
DATE_STRING "%2$d.%1$d.%3$d"
2.5 Binding a Locale to the Run-Time Environment
A correct, operational internationalized program must bind to
localized data that is appropriate for the user at run time.
The
setlocale()
function performs this task.
You can
call
setlocale()
to perform the following operations:
Bind to locale settings that are already in effect for the user's process
Bind to locale settings controlled by the program
Query current locale settings without changing them
The call takes two arguments: category and locale_name.
The category argument specifies whether you want to query, change, or use all or a specific section of a locale. Values for category and what they represent are as follows:
LC_ALL
This category argument specifies all sections of a locale (overrides specifications for specific sections).
LC_CTYPE
This category argument defines classes and character attributes used in case conversion and similar operations.
LC_COLLATE
This category argument specifies how to order characters and strings in sorting, or collation, operations.
LC_MESSAGES
This category argument specifies yes/no responses and program messages.
LC_MONETARY
This category argument specifies rules and special symbols for use in monetary values.
LC_NUMERIC
This category argument specifies rules and special symbols used for formatting numeric values.
LC_TIME
This category argument specifies names and abbreviations for days of the week, months of the year, and other strings and formatting conventions that govern expressions of date and time.
The locale_name argument is one of the following values:
An empty string ("") that binds the program
at run time to the locale name set for
category
by the system administrator or user
A locale name that changes the locale that may already be set for category
NULL
that determines the locale name currently
set for
category
2.5.1 Binding to the Locale Set for the System or User
Typically,
the system administrator or user sets the
LANG
or
LC_ALL
environment variable to the name of a locale.
When you set
either of these variables, it automatically sets all locale category variables
to the same locale name.
Except for the case in which
LC_ALL
has been used
to set all locale categories to a single locale name, system administrators
or individual users can set locale category variables to different locale
names.
Usually, internationalized programs contain the
LC_ALL
call, which initializes all locale categories in the program to environment
variable settings already in effect for the user.
For example:
setlocale(LC_ALL, "");
A standard locale name consists of
language_TERRITORY.codeset@modifier, for example,
zh_CN.dechanzi@radical, where:
language
represents the human language
of the locale (zh is Chinese)
_TERRITORY
is the geographic country or
region of the locale (_CN is China, as opposed to TW for Taiwan or HK for
Hong Kong)
.codeset
is the coded character set used
by the locale (dechanzi)
@modifier
is additional information for
localization data of a locale (collation by radical)
Locales often have multiple variants.
These variants have the
same name as the base locale but include a file name suffix that begins with
the at sign (@).
Locale variants for support of codesets that are not native
to UNIX (such as UCS-4 and CP850), can be assigned to
LANG
or
LC_ALL.
However, locale variants that differ from the base
locale in only one locale category should be assigned only to the appropriate
locale category.
For example, a locale variant designed to support a specific
collation sequence, such as
@radical, would be assigned
to
LC_COLLATE.
A locale variant designed to support the
euro monetary sign (@euro) would be assigned to
LC_MONETARY.
Use the base locale name, not these variants, in assignments
to the
LANG
environment variable.
Furthermore, in cases where a base locale name is not being assigned
to all locale categories, avoid using the
LC_ALL
environment
variable, whose assigned value overrides settings for both
LANG
and the environment variables for specific locale categories.
Many locale-specific files reside in directories
whose names are constructed from the language, territory, and codeset portions
of a locale name.
Commands and other system applications insert the setting
of the
LANG
variable into search paths that contain
%L
as one of the directory nodes.
This makes it possible for software
programs to find the correct set of files, such as fonts, resource files,
user-defined character files, and translated reference pages, that should
be used with the current locale.
An
@
suffix related to
collation, if included in an assignment to the
LANG
variable,
may result in applications being unable to find certain locale-specific files.
2.5.2 Changing Locales During Program Execution
Some internationalized
programs may need to prompt the user for a locale name or change locales during
program execution.
The following example demonstrates how to call
setlocale()
when you want to explicitly initialize or reinitialize
all locale categories to the same locale name:
.
.
.
nl_catd catd; [1] char buf[BUFSIZ]; [2]
.
.
.
setlocale(LC_ALL, ""); [3] catd = catopen(CAT_NAME, NL_CAT_LOCALE); [4]
.
.
.
printf(catgets(catd, NL_SETD, LOCALE_PROMPT_MSG, "Enter locale name: ")); [5] gets(buf); [6] setlocale(LC_ALL, buf); [7]
.
.
.
Declares a catalog descriptor
catd
as type
nl_catd.
[Return to example]
Declares the
buf
variable into which the
locale name will later be stored.
To make sure that the variable is large enough to
accommodate locale names on different systems, you should set its maximum
size to the
BUFSIZ
constant, which is defined by the system
vendor in
/usr/include/stdio.h.
[Return to example]
Calls
setlocale()
to initialize the program's
locale settings to those in effect for the user who runs the program.
[Return to example]
Calls
catopen()
to open the message catalog
that contains the program's messages.
The function returns the catalog's descriptor
to the
catd
variable.
The
CAT_NAME
constant is defined
in the program's own header file.
[Return to example]
Prompts the user for a new locale name.
The
NL_SETD
constant specifies the default message set number in a message
catalog and is defined in
/usr/include/nl_types.h.
The
LOCALE_PROMPT_MSG
identifier specifies the prompt string translation
in the default message set.
[Return to example]
Calls the
gets()
function to read the locale
name typed by the user into the
buf
variable.
[Return to example]
Calls
setlocale()
with
buf
as the
locale_name
argument to reinitialize all
portions of the locale.
[Return to example]
Sometimes a program needs to vary the locale only for a particular
category of data.
For example, consider a program that processes different
country-specific files that contain monetary values.
Before processing data
in each file, the program might reinitialize a program variable to a new locale
name and then use that variable value to reset only the
LC_MONETARY
category of the locale.