7    Programming Considerations for International Applications

This chapter describes a set of miscellaneous tasks you should consider as you develop international applications. These tasks include the following:

This chapter provides information on the tools needed to create international applications. The information in this chapter is also closely related to how international applications are used on the operating system. As you use the information in this chapter, you may also find it helpful to refer to the companion manual, Using International Software.

The following manuals provide language-specific information about customization and software use provided for Asian languages on the operating system:

These manuals are available from the programming bookshelf of the operating system documentation Web site ( http://www.tru64unix.compaq.com/docs/). Non-English language characters are embedded in the text of the Chinese, Japanese, and Korean Technical References. To view these characters with your Web browser, you must install the appropriate language support subsets on your system and set your locale to one that includes the local language characters used in the technical reference.

The operating system documentation also provides introductory reference pages on the topics of internationalization ( i18n_intro(5)) and localization ( l10n_intro(5)) as well as reference pages for all supported languages and codesets.

7.1    Choosing an Input Method

For some languages, such as Japanese, Chinese, and Korean, you use an input method to enter characters and phrases. An input method lets you enter a character by taking multiple editing actions on entry data. The data entered at intermediate stages of character entry is called the preediting string.

The X Input Method specification defines the following user input, or preediting, styles:

Input methods for different locales typically support more than one user input style but not all of them. If you work in languages that are supported by an input method, you can specify styles in priority order through the VendorShell resource XmNpreeditType. By default, this resource is defined to be the following:

OnTheSpot,OverTheSpot,OffTheSpot,Root

The priority order of these values means that On-the-Spot input style is used if the input method supports it, else the Over-the-Spot is used if the input method supports it, and so forth.

Use one of the following methods to supply the XmNpreeditType resource value to an application:

Input styles are supported by specialized input method servers. An input method server runs as an independent process and communicates with an application to handle input operations.

An input method server does not have to be running on the same system as the application but, with one exception, it must be running and made accessible to the application before the application starts.

If a Motif application that has been internationalized to support simplified Chinese contains an XmText or XmTextField widget with the Reconnectable resource set to True, the application is able to establish a connection with the input server when the application starts first or when the server stops and restarts. See XmText(3X) and XmTextField(3X) for more information.

See the Using International Software manual for information on the input method servers available on the operating system and the input styles that each server supports.

7.2    Managing User-Defined Characters and Phrase Input

The national character sets for Japan, Taiwan, and China do not include some of the characters that can appear in Asian place names and personal names. Such characters are defined by users and reside in site-specific databases. These databases are called user-defined character (UDC) or character-attribute databases. When users define ideographic characters, they must also define font glyphs, collating files, and other support files for the characters.

Appendix B provides details on how you set up and use UDC databases.

In Korea, Taiwan, and China, users can enter a complete phrase by typing a keyword, abbreviation, or acronym. This capability is supported by a phrase database and an input mechanism. The Using International Software manual provides details on how the user sets up and uses a phrase database.

The /var/i18n/conf/cp_dirs configuration file allows software services or hardware to locate the databases that support UDC and phrase input.

Example 7-1 contains the default entries in the cp_dirs file. You can edit these entries to change the default locations.

Example 7-1:  Default cp_dirs File

#
# Attribute directory configuration file
#
#                       System location         User location
#                       ===============         =============
udc     -               /var/i18n/udc           ~/.udc
odl     -               /var/i18n/odl           ~/.odl
sim     -               /var/i18n/sim           ~/.sim
cdb     /usr/i18n/.cdb  /var/i18n/cdb           ~/.cdb
iks     -               /var/i18n/iks           ~/.iks
pre     -               /var/i18n/fonts         ~/.fonts
bdf     -               /var/i18n/fonts         ~/.fonts
pcf     -               /var/i18n/fonts         ~/.fonts

Each line in the cp_dirs file represents one entry and has the following format:

[service_name standard_path system_path user_path ]

The service_name can be one of the following:

The cp_dirs file can contain only one entry for each service named. Remaining fields in the entry line consist of the following:

The preceding locations are specified as one of the following:

Comment lines in the cp_dirs file begin with the number sign (#).

7.3    Assigning a Sort Order with a Locale Specification

The sort command sorts characters according to the collation sequence defined for the current locale. A particular locale can apply one set of collation rules to the associated character set. Multiple locale names do exist, however, for the same combination of language, territory, and character set. These variations offer users the choice of more than one collating sequence.

When more than one locale is available for a given combination of language, territory, and codeset, some of the locale names include a suffix with the format @variant. To avoid problems with pathnames constructed using the %L specifier, you should assign a locale name with a suffix that is category specific only to the appropriate locale category variable (or variables). In the following example, the locale assigned to LC_COLLATE differs from the locale assigned to LANG only with respect to collating sequence:

% setenv LANG zh_TW.eucTW
% setenv LC_COLLATE zh_TW.eucTW@radical

Supporting different collation orders through one or more locales is adequate for most languages. However, collation orders for Asian languages require additional support for the following reasons:

For the preceding reasons, the asort command was developed and is available when you install language variant subsets that support Asian languages. The asort command uses, by default, the collating order defined for the LC_COLLATE variable and supports all the options supported by the sort command. In addition, the asort command includes the following options:

See asort(1) for more information about using this command.

7.4    Processing Non-English Language Reference Pages

Programmers who supply software applications for UNIX systems frequently supply online reference pages (manpages) to document the application and its components. UNIX text-processing commands and utilities must be able to process translated versions of these reference pages for applications sold to the international market. The operating system includes enhanced versions of the nroff, tbl, and man commands to support this requirement.

7.4.1    The nroff Command

The nroff command includes the following functions to support locales:

When formatting reference pages that contain ideographic characters, the nroff command treats each character as a single word. A string of ideographic characters, including 2-byte letters and punctuation characters, can be wrapped to the next line subject to the following constraints:

The standard no-first, no-last character lists are defined in nroff catalog files. For lists of these characters, see the following language-specific manuals:

These manuals are available from the programming bookshelf of the operating system documentation Web site ( http://www.tru64unix.compaq.com/docs/).

The no-first and no-last constraints exist to prevent nroff from placing a punctuation mark or right parenthesis at the beginning of a text line or placing a left parenthesis at the end of a text line. You can turn the standard constraints on and off in source files with the .ki and .ko commands, respectively.

You can also define a private set of no-first and no-last characters with the following command:

.kl 'no-first-list'no-last-list '

The parameters no-first-list and no-last-list are strings of characters that you include in the no-first and no-last categories. You cancel a private no-first and no-last list by entering a .kl command with null strings as the parameters. For example:

.kl '''

Note

The characters specified in the .kl command override, rather than supplement, the characters in the standard set of no-first and no-last characters. Therefore, you cannot use the standard set of no-first and no-last characters together with a private set.

Using the command .kl ''' restores use of the standard set of no-first and no-last characters for the current locale.

The nroff command can format text so that it is justified or not justified to the right margin. When text is justified to the right margin, nroff inserts spaces between words in the line. Ideographic characters, although treated as words in most stages of the formatting process, differ in terms of whether they can be delimited by spaces. The characters that can be preceded by a space, followed by a space, or both are listed in the language-specific user manuals that are available on line when you install language variant subsets of the operating system. When right-justifying text, the nroff command inserts spaces only at the following places:

In other cases, no space is inserted between consecutive ideographic characters. Therefore, if a text line contains only ideographic characters, it may not be justified to the right margin.

7.4.2    The tbl Command

The tbl command preprocesses table formatting commands within blocks delimited by the .TS and .TE macros. The tbl command handles multibyte characters that can occur in text of languages other than English.

The tbl command is frequently used with the neqn equation formatting preprocessor to filter input passed to the nroff command. In such cases, specify tbl first to minimize the volume of data passed through the pipes. For example:

% cd /usr/usr/share/ja_JP.deckanji/man/man1
% tbl od.1 | neqn | nroff -Tlpr -man -h | \
lpr -Pmyprinter

When printing Asian language text, you must use printer hardware that supports the language.

7.4.3    The man Command

The man command can handle multibyte characters in reference page files. By default, the man command automatically searches for reference pages in the/usr/share/locale_name/man directory before searching the /usr/share/man and /usr/local/man directories. Therefore, if the LANG environment variable is set to an installed locale and if reference page translations are available for that locale, the man command automatically displays reference pages in the appropriate language.

In addition, the man command automatically applies codeset conversion (assuming the availability of appropriate converters) when reference page translations for a particular language are encoded in a codeset that does not match the codeset of the user's locale. See man(1) for information about redefining the man command search path and for more details about codeset conversion.

7.5    Converting Data Files from One Codeset to Another

Each locale is based on a specific codeset. Therefore, when an application uses a file whose data is coded in one codeset and runs in a locale based on another codeset, character interpretation may be meaningless. For example, assume that a fictional language includes a character named "quo," which is encoded as \031 in one codeset and \042 in another codeset. If the "quo" character is stored in a data file as \031, the application that reads data from that file should be running in the locale based on the same codeset. Otherwise, \031 identifies a character other than "quo."

Users, the applications they run, or both may need to set the process environment to a particular locale and use a data file created with a codeset different from the one on which the locale is based. The data file in question might be appropriate for a given language and in a codeset different from the user's locale for one of the following reasons:

You can convert a data file from one codeset to another by using the iconv command or the iconv_open(), iconv(), and iconv_close() functions. For example, the following command reads data in the accounts_local file, which is encoded in the SJIS codeset; converts the data to the eucJP codeset; and appends the results to the accounts_central file:

% iconv -f SJIS -t eucJP accounts_local \
>> accounts_central

Many commands and utilities, such as the man command and internationalized print filters, use the iconv() functions and associated converters to perform codeset conversion on the user's behalf.

The iconv command and associated functions can use either an algorithmic converter or a table converter to convert data. Algorithmic converters, if installed on your system, reside in the /usr/lib/nls/loc/iconv directory; this directory is the one searched first for a converter. This directory also contains an alias file (iconv.alias) that maps different name strings for the same converter to the converter as named on the system. Table converters, if installed on your system, reside in the /usr/lib/nls/loc/iconvTable directory. The value of the LOCPATH variable, if defined, overrides the command's default search path.

The iconv command assumes that a converter name uses the following format:

from-codeset_ to-codeset

For the preceding example, the iconv command would search for and use the /usr/lib/nls/loc/iconv/SJIS_eucJP converter.

Also consider operating system support for codeset conversion of the Hong Kong Supplementary Character Set (HKSCS). HKSCS is not a locale or character set name, but is used to provide a common language interface for electronic communication and data exchange conducted in Chinese. The characters in HKSCS are only for computer use. On Tru64 UNIX, HKSCS is used as the name for extended Big-5 encoding that contains HKSCS characters, and support is limited to code conversion between HKSCS and Unicode. Using the iconv command, codeset conversion with HKSCS would be specified as one of the following:

See HKSCS(5) for more information on the Hong Kong Supplementary Character Set.

Table 7-1 specifies the codeset conversions that the operating system supports for English language data. Tables with codeset conversions supported for their respective Asian languages are described in the following manuals:

For detailed information about the iconv command, see iconv(3) and iconv_intro(5). For information on functions that programs can use to perform codeset conversion, see iconv_open(3), iconv(1), and iconv_close(3). You can find a list of all the codeset converters available for a particular language in the reference page for that language.

Table 7-1:  Supported Codeset Conversions for English

Codeset ASCII-GR ISO8859-1 ISO8859-1-GL ISO8859-1-GR
ASCII-GR - Yes No No
ISO8859-1 Yes - Yes Yes
ISO8859-1-GL No Yes - No
ISO8859-1-GR No Yes No -

7.6    Using Font Renderers in Chinese and Korean PostScript Support

This section describes the use of font renderers in the creation of Motif applications that support PostScript fonts in Chinese and Korean. See the Using International Software manual for information on tuning cache size for ideographic characters and customizing windows for local languages.

7.6.1    Using Font Renderers for Multibyte PostScript Fonts

The operating sytem includes font renderers that allow any X application to use the PostScript fonts available for the Chinese and Korean languages. The system administrator can set up font renderers for the following kinds of fonts for use through the X Server or the font server:

By installing the IOSWWXFR** subset, you automatically enable font rendering for the PostScript outline fonts.

7.6.1.1    Setting Up the Font Renderer for Double-Byte PostScript Fonts

You can set up the font renderer for Chinese and Korean PostScript fonts for use either through the X server or the font server by editing the appropriate configuration file.

The renderer for Asian Double-Byte PostScript fonts uses its own configuration file that specifies the following information:

The default pathname for this configuration file is /var/X11/renderer/DECpscf_config; however, you can change this path by setting the DECPSCF_CONFIG_PATH environment variable.

7.6.1.2    Setting Up the Font Renderer for UDC Fonts

The UDC font renderer accesses the UDC database directly to obtain font glyphs. Therefore, X applications that use this renderer do not need to use .pcf files generated by the cgen utility.

You can set up the UDC font renderer for use either through the X server or the font server as follows:

7.6.1.3    Using the Font Renderer for TrueType Fonts

The operating system includes a font renderer (/usr/shlib/X11/libfr_TrueType.so) that enables the use of TrueType fonts. Currently, the operating system includes TrueType fonts only for simplified Chinese. However, you can configure the font renderer to use third-party TrueType fonts for additional languages if these are required by applications used at your site. See TrueType(5X) for more information.