Internationalization refers to the process of developing software programs without prior knowledge of the language, cultural data, or character-encoding schemes that the programs are expected to handle. In system terms, internationalization refers to the provision of interfaces that let programs produce varying output, depending on the specific environment in which they are run. The mnemonic I18N is frequently used as an abbreviation for internationalization.
This manual describes operating system interfaces and utilities that help you develop internationalized programs. These interfaces and utilities conform to specifications in the X/Open UNIX standard, which allows for implementation-defined behavior in certain areas. This manual identifies those software characteristics that are specific to the operating system.
The following sections provide an overview of the operating system interfaces and utilities used for international program development:
Language. Section 1.1 is a general description of language requirement implementation. Section 1.2 defines language in software application terms.
Cultural Data. Section 1.3 defines cultural data and Section 1.1.1 defines the implementation of cultural or local requirements in a computer system.
Character Sets. Section 1.4 defines character sets in terms of internationalization.
Language announcement is the mechanism by which language, cultural data, and codeset requirements are set either for the system as a whole, by an application, or by individual users. Language announcement is performed by setting a locale name in a set of reserved environment variables. System administrators can set the default values for these variables for different shell environments; see the Tru64 UNIX System Administration manual for information about setting locale defaults for shells. Users can also set locale variables on a per-process basis.
Typically, internationalized programs read locale variables at run time
and use them to attach settings to locale categories in the programs' operational
environment.
However, programs can also set these categories internally when
appropriate.
Therefore, the binding to a particular locale need not be general
for all parts of a program.
Within one execution cycle, different parts of
the program can request different localizations.
1.1.1 Localization
Localization refers to the process of implementing local requirements within a computer system. Some of these requirements are addressed by locales. Each locale is a set of data that supports a particular combination of native language, cultural data, and codeset. The type of information a locale can contain and the interfaces that use a locale are subject to standardization. However, where locales reside on the system and how they are named can vary from one vendor to another.
There is more to localization than providing locales. For example, the localization process means making sure that translations are available for software messages; appropriate fonts and measurement systems are supported and available for display and printing devices; and, in some cases, additional software is written to handle local requirements.
The mnemonic L10N is frequently used as an abbreviation for localization.
See Chapter 3 for information on creating and using localized data and message files in application programs. See Chapter 5 for information on localization and graphical user interfaces.
See
Chapter 2
for information on the programming
aspects of local implementation and
Chapter 6
for
a description of locales, the primary tool for localizing software.
1.2 Language
An internationalized program makes no assumptions about the language of character data (text) that the program is designed to handle.
Language has implications for processing text for such things as character handling and word ordering. The operating system provides interfaces that allow internationalized programs to manipulate text according to the language requirements of individual users.
Language differences require the separation of message text from program code. The operating system provides facilities that allow message text to be separated from the code, translated into different languages, and accessed by the program at run time. Chapter 3 explains how an internationalized program that uses the worldwide portability interfaces (WPI) generates and accesses messages.
An internationalized program that uses X and Motif interfaces can separate message text from program code in the following ways:
By defining menu items, titles, text fields, and messages in user interface language (UIL) files
By specifying titles and font lists in application resource files
By specifying help messages in files that the Help widget uses
For information about separating message text from program code for X and Motif interfaces, see the following manuals:
X Window System Toolkit
Common Desktop Environment: Internationalization Programmer's Guide
1.2.1 Character Classification
Character classification information describes the
characteristics associated with each valid character code; that is, whether
the code defines an alphabetic, uppercase, lowercase, punctuation, control,
space, or other kind of character.
Character classification functions and
internationalized regular expressions use this information to determine character
classes.
1.2.2 Case Conversion
Case
conversion refers to information that identifies the possible alternative
case of each valid character code.
Case conversion functions use this information
to change characters from uppercase to lowercase or from lowercase to uppercase.
In some languages, case is not a characteristic of all of the letters, or
even of any characters.
1.2.3 Message Catalogs
A message catalog is a file or storage
area that contains program messages, command prompts, and responses to prompts
for a particular language.
Motif applications also use resource files and
UIL files in addition to, or in place of, message catalogs for text and other
values that can vary from one locale to another.
Chapter 3
describes the messaging system.
1.3 Cultural Data
Cultural data refers to the conventions of a geopolitical area, called territory in this manual. Cultural data includes such things as date, time, and currency formats.
An internationalized program cannot assume how cultural data formats
are set in advance and uses system facilities to determine formats at run
time.
This capability is provided through a language information database
(or
langinfo database) that programs can query for
the required formats of cultural data items.
1.3.1 Language Information
Language
information refers to localization data that describes the format and setting
of cultural data that can vary from one locale to another.
The information
stored in a langinfo database includes the appropriate formats and characters
for date and time, currency, and numeric values.
1.4 Character Sets
A character set is a set of alphabetic or other characters used to construct the words and other elementary units of a native language or computer language. A coded character set (or codeset) is a set of unambiguous rules that establishes a character set and the one-to-one relationship between each character of the set and its bit representation.
For a program to be able to handle text recorded in different codesets,
the program cannot make assumptions about the size or bit assignment of character
encodings.
In particular, the program cannot assume that any part of an area
used to store a character is available for other uses.
1.4.1 Collating Sequence
The collating sequence, or ordering of characters, may be implicit in underlying hardware but can be defined for software to conform to the way language is used in a particular territory. Many languages have complex rules for sorting. The following list describes some of these collating rules:
A single letter is not necessarily represented by a single character
In traditional Spanish, for example, the character combination
ch
sorts between the characters
c
and
d.
A single character can be equivalent to a combined set of characters
For example, the ß character is equivalent to
ss
in standard and Swiss German and to
sz
in Austrian German.
Accented letters do not always follow unaccented letters
In many languages, this is true only if the words that contain those letters are otherwise identical. In other languages, a particular accented letter may be considered unique and sort after a letter that is different from the unaccented counterpart.
Characters can be sorted in multiple ways for the same language
The ideographic characters in Asian languages have sort orders based on pronunciation and on two visually recognized components (radicals, which are pictograms for elements of meaning, and the number of strokes).
Each locale contains information about collating
sequences that informs string comparison functions about the relative ordering
of characters defined in the associated codeset.
Internationalized regular
expressions also use the collating sequence for implementing character ranges,
collating symbols, and equivalence classes.
1.4.2 Characters and Strings
A
character
is a sequence of one or more bytes that represent a
single graphic symbol or control code.
Do not confuse the term character with
the C programming language
char
data type, which represents
an object large enough to store any member of the basic execution character
set and which is usually mapped as an 8-bit value.
Unlike the
char
data type in C, a character can be represented by a value that
is one or more bytes.
The expression
multibyte character
is synonymous with the term character; that is, both refer to character values
of any length, including single-byte values.
A
character string
or string is a contiguous sequence of bytes
terminated by and including the null byte.
A string is an array of type
char
in the C programming language.
The null byte is a value with
all bits set to zero (0).
A
wide character
is an integral type that
is large enough to hold any member of the extended execution character set.
In program terms, a wide character is an object of type
wchar_t,
which is defined in the header files
/usr/include/stddef.h
(for conformance to the X/Open XSH specification) and
/usr/include/stdlib.h
(for conformance to the ANSI C standard).
The locations of these
header files are determined by standards organizations; however, the definitions
themselves are implementation specific.
For example, implementations that
support only single-byte codesets (not the case for Tru64 UNIX) might
define
wchar_t
as a byte value.
A
wide-character string
is a contiguous sequence of wide characters
terminated by and including the null wide character.
A wide-character string
is an array of type
wchar_t.
The null wide character is
a
wchar_t
value with all bits set to zero (0).
An empty string is a character string whose first element is
the null byte.
Similarly, an empty wide-character string is a wide-character
string whose first element is the null wide character.
1.4.3 Portable Character Set
The Portable Character Set (PCS) is supported in both compile-time (source) and run-time (executable) environments for all locales. The PCS contains the following characters:
The 26 uppercase letters of the English language alphabet:
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
The 26 lowercase letters of the English language alphabet:
a b c d e f g h i j k l m n o p q r s t u v w x y z
The 10 decimal digits:
0 1 2 3 4 5 6 7 8 9
The following 32 graphic characters:
! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~
The space character, plus control characters that represent the horizontal tab, vertical tab, and form feed
In addition to the preceding characters, the execution version of the PCS contains control characters that represent alert, backspace, carriage return, and newline
The PCS as defined by X/Open is similar to the basic source and basic
execution character sets defined in
ISO/IEC 9899: 1990,
except that the X/Open version also includes the dollar sign ($), commercial
at sign (@), and grave accent (
Some locales (for example,
ISO 646
variants) may make substitutions for one or more of the preceding characters.
In such cases, the substituted character has the same syntactic meaning as
the character it replaces in the PCS.
An example of a character substitution
might be the British pound sign (
The definition of a character set that is portable
across all codesets is particularly relevant to encoding formats that support
a limited set of native languages.
This is typical for most of the character
encoding formats developed for UNIX systems.
In other words, the codeset used
for a Chinese locale must include all the PCS characters in addition to characters
that are part of the Chinese language.
However, that same codeset probably
would not include characters needed to support Russian or Icelandic.
Similarly,
the codeset used for the Russian language probably would not include any Chinese
characters but must include all the PCS characters.
Therefore, no matter what
the locale setting, programs can assume that characters in the PCS are available.
1.4.4 Universal Character Set
The Universal Character Set (UCS), as specified by the Unicode and ISO/IEC 10646 standards, is supported on the operating system. The UCS specifies a repertoire of characters that can be used by all major languages and standardizes the rules used by languages to process characters. This character set supports the philosophy that applications should be able to manipulate characters in any language by using the same encoding format and set of rules. Thus, operating system support of UCS can include file code and internal process code conversion without having to include multiple algorithms.
The ISO/IEC 10646 standard support two character sizes (16 bits and 32 bits) that require different parsing schemes for data input and output. UCS encoding that an implementation parses in 16-bit units (2 octets) is called UCS-2. UCS encoding that an implementation parses in 32-bit units (4 octets) is called UCS-4. UCS-4 expands the number of characters that can be supported and is more efficiently manipulated as internal process code on larger computer systems.
The standards define a number of universal transformation formats (UTFs) used by the byte-oriented protocols that handle file data. We recommend UTF-8 and UTF-32 for use on the operating system. The operating system supports the following universal transformation formats:
UTF-8 is the standard method for transforming UCS-4 process encoding into a sequence of 8-bit bytes and ensuring interchange transparency for characters in C0 code positions (0 to 31), the SPACE character (32), and the DEL character (127). The operating system provides codeset converters and locales for UTF-8.
UTF-16 uses the surrogate character extension technique defined in the Unicode standard and represents characters in 16-bit units. UTF-16 is a superset of UCS-2, but does not support full representation of the UCS-4 code space. UTF-16 does support all characters currently defined for languages covered by both standards.
Byte orientation in file code can differ
and, depending on the platform on which the file was generated, can be little-endian
(LE) or big-endian (BE).
UTF-16 uses a byte order mark (BOM), which is not
part of the file text data, to indicate byte orientation.
The Unicode standard
also defines UTF-16LE and UTF-16BE, which do not include a BOM, for little-endian
and big-endian orientations, respectively.
The operating system supports UTF-16,
UTF-16LE, and UTF-16BE through codeset converters.
Because UCS-2 is a subset
of UTF-16, the operating system supports UCS-2 with UTF-16 codeset converters.
The
UCS-2
codeset converter name is recognized as an alias
for
UTF-16*, but with a restricted character repertoire.
The operating system normally expects UTF-16, rather
than UTF-16LE or UTF-16BE.
On input, the system looks for a BOM.
If one is
not found, the converter assumes UTF-16BE.
On output, the system automatically
inserts a BOM.
If your application expects little-endian or big-endian byte
orientation on input or output, you may have to explicitly state byte orientation.
See
iconv_intro(5)
UTF-32 allows character representation in 4-byte encoding units. UTF-32 is a restricted subset of UCS-4 in that the range of character values is restricted to U+0000 to U+10FFFF, the same range as UTF-16. Keep in mind that private-use ranges above U+10FFFF will be removed from future versions of ISO/IEC 10646 to promote interoperability between ISO/IEC 10646 and Unicode standard encoding formats.
UTF-32 uses a BOM to indicate little-endian or big-endian byte orientation. As with UTF-16, the Unicode standard defines UTF-32LE and UTF-32BE, which do not include BOMs. The operating system default for input and output is also the same as UTF-16.
Use the
UCS-4
codeset converter to process UTF-32.
The operating system supports UCS-4 with codeset converters and locales. The locales and some library functions allow applications to use UCS-4 as internal process code. The codeset converters allow file data to be converted to encoding formats supported by fonts and other software resident on the system.
See
Section 2.2
for more information about
Unicode, locales, and related encoding formats.
See
Chapter 4
for a description of the
curses
Library and information
on support of wide-character format and multibyte characters.