Glossary

ASCII

American Standard Code for Information Interchange. ASCII defines 128 characters, including control characters and graphic characters, represented by 7-bit binary values (see also ISO 646).

See also character set, coded character set

C locale: The standard, or default, language environment. This environment is always in effect for non-internationalized applications or when locales are not installed or are not active.

character

A sequence of one or more bytes that represents a single graphic symbol or control code. Unlike the char datatype in C, a character can be represented by a value that is one byte or multiple bytes. The expression "multibyte character" and the term "character" both refer to character values of any length, including single-byte values.

See also wide character

character set

A member of a set of elements used for the organization, control, or representation of text.

See also ASCII, ISO 10646

character string

A contiguous sequence of bytes that is terminated by, and includes, the null byte. A string is an array of type char in the C programming language. The null byte has all bits set to zero (0).

An empty string is a character string whose first element is the null byte.

See also character, wide-character string

code page: See coded character set

coded character set: A set of unambiguous rules that establishes a character set and the one-to-one relationship between each character of the set and its bit representation. On UNIX systems, the more common term is codeset. On MS-DOS and Microsoft Windows systems, the more common term is code page.

codeset: See coded character set

collating sequence: The ordering rules applied to characters or groups of characters when they are sorted.

control character: A character, other than a graphic character, that affects the recording, processing, transmission, or interpretation of text.

cultural data: The conventions of a geographical area for such things as date, time, numeric, and currency values.

data: Information generated internally, information extracted from or written to files, and message text used for communication with the program's user.

dense code

The operating system supports two types of locales; dense code and Unicode. Dense code locales use a wide-character encoding that minimizes table size by assigning codepoints consecutively with no empty positions. Under dense code locales, a wchar_t value for one locale may not represent the same character in another locale and, thus, is locale specific.

See also Unicode

euro: The currency adopted by European countries belonging to the Economic and Monetary Union (EMU) and scheduled to replace local currencies for EMU member countries in the year 2002. The euro currency has a monetary sign that looks like an equal sign (=) superimposed on the capital letter C and is identified by the string EUR in international currency documents.

file code

The encoding format that applies to data outside the program.

Contrast with process code

graphic character: A character, other than a control character, that has a visual representation when handwritten, printed, or displayed. Also, ideograph.

I18N: See internationalization

internationalization

The process of developing programs without prior knowledge of the language, cultural data, or character-encoding schemes that the programs are expected to handle. An internationalized program uses a set of interfaces that allows the program to modify its behavior at run time for operation in a specific native language environment. I18N is frequently used as an abbreviation for internationalization.

See also locale, localization

ISO 10646

The ISO Universal Character Set (UCS). The first 65,536 code positions in this character set are called the Base Multilingual Plane (BMP), in which each character is 16 bits in length. This form of ISO 10646 is also known as UCS-2. ISO 10646 also has a form called UCS-4, in which each character is 32 bits in length.

See also Unicode

ISO 646: ISO 7-bit codeset for information interchange. The reference version of ISO 646 contains 95 graphic characters, which are identical to the graphic characters defined in the ASCII codeset.

ISO 6937: ISO 7-bit or 8-bit codeset for text communication using public communication networks, private communication networks, or interchange media such as magnetic tapes and disks.

ISO8859-*: ISO 8-bit single-byte codesets. The asterisk (*) represents a number indicating the part of the associated ISO standard. For example, the ISO8859-1 codeset conforms to ISO 8859 Part 1, Latin Alphabet No. 1, which defines 191 graphic characters covering the requirements of most Western European languages.

L10N: See localization

langinfo database: A collection of information associated with the numeric, monetary, date and time, and messaging parts of a locale.

local language: See native language

locale

A set of data and rules that supports a particular combination of native (local) language, cultural data, and codeset. Also called language table.

See also coded character set, cultural data, langinfo database, localization

localization

The process of providing language- or culture-specific information for computer systems. Some of these requirements are addressed by locales. Other requirements are addressed by translations of program messages, provision of appropriate fonts for printers and display devices, and, in some cases, development of additional software. L10N is sometimes used as an abbreviation for localization.

See also internationalization, locale

message catalog: A file or storage area external to the program code that contains program messages, command prompts, and responses to prompts for a particular native language, territory, and codeset.

multibyte character: See character

native language: A computer user's spoken or written language, such as English, French, Japanese, or Thai.

process code

The encoding format used for manipulating data inside programs.

Contrast with file code

radix character: The character that separates the integer part of a number from the fractional part.

sign extension: The high bit of the value in the small data type is used to fill in bits that remain when the value is converted to the larger data type for comparison. For example, if s[0] is the value 0x8e, sign extension would cause it to be treated as 0xffffff8e.

string: See character string

territory: The geographic area, usually defined by a political entity such as nation or state, with particular cultural differences that must be accommodated in localization; for example, the currency or language of a territory.

UCS: See ISO 10646

Unicode

A standard that defines encoding for characters in most native languages. The Unicode standard specifies a Universal Character Set (UCS) and defines many thousands of characters, including a private use area for vendor defined characters. "Unicode" originally referred to encoding that was limited to the UCS-2 (16-bit) encoding defined by the ISO 10646 standard. The Unicode standard now encompasses UCS-4 (32-bit) encoding and defines a number of universal transformation formats (UTFs) for use with byte-oriented protocols that process data files.

See also coded character set, ISO 10646

Universal Character Set: See ISO 10646

wide character

An integral type that is large enough to hold any member of the extended execution character set. In program terms, a wide character is an object of type wchar_t, which is defined in the /usr/include/stddef.h (for conformance to X/Open specifications) and /usr/include/stdlib.h (for conformance to the ANSI C standard) header files. Although the file locations where the wchar_t data type is defined are determined by standards organizations, its definition is implementation specific. For example, implementations that support only single-byte codesets might define wchar_t as a byte value. On Tru64 UNIX systems, wchar_t is a 4-byte (32-bit) value.

The null wide character is a wchar_t value with all bits set to zero (0).

wide-character string

A contiguous sequence of wide characters that is terminated by and includes the null wide character. A wide-character string is an array of type wchar_t.

See also character string, wide character

worldwide portability interface (WPI)

Functions that allow programmers to create applications that support

single-byte or multibyte codesets. WPI functions are similar to the C language interface, but WPI uses wide characters.