7 Programming Considerations for International Applications

This chapter describes a set of miscellaneous tasks you should consider as you develop international applications. These tasks include the following:

Choosing an input method and input styles (Section 7.1)

Managing user-defined character databases (Section 7.2)

Assigning a sort order with a locale specification (Section 7.3)

Processing non-English language reference pages (Section 7.4)

Converting data files from one codeset to another (Section 7.5)

Using font renderers in Chinese and Korean language PostScript Support (Section 7.6)

This chapter provides information on the tools needed to create international applications. The information in this chapter is also closely related to how international applications are used on the operating system. As you use the information in this chapter, you may also find it helpful to refer to the companion manual, Using International Software.

The following manuals provide language-specific information about customization and software use provided for Asian languages on the operating system:

Technical Reference for Using Chinese Features

Technical Reference for Using Japanese Features

Technical Reference for Using Korean Features

Technical Reference for Using Thai Features

These manuals are available from the programming bookshelf of the operating system documentation Web site ( http://www.tru64unix.compaq.com/docs/). Non-English language characters are embedded in the text of the Chinese, Japanese, and Korean Technical References. To view these characters with your Web browser, you must install the appropriate language support subsets on your system and set your locale to one that includes the local language characters used in the technical reference.

The operating system documentation also provides introductory reference pages on the topics of internationalization ( i18n_intro(5)) and localization ( l10n_intro(5)) as well as reference pages for all supported languages and codesets.

7.1 Choosing an Input Method

For some languages, such as Japanese, Chinese, and Korean, you use an input method to enter characters and phrases. An input method lets you enter a character by taking multiple editing actions on entry data. The data entered at intermediate stages of character entry is called the preediting string.

The X Input Method specification defines the following user input, or preediting, styles:

On-the-Spot
Data being edited is displayed directly in the application window. Application data is moved to allow the preediting string to display at the point of character insertion.

Over-the-Spot
The preediting string is displayed in a window that is positioned over the point of insertion.

Off-the-Spot
The preediting string is displayed in a window that is within the application window but not over the point of insertion. Often, the window for the preediting string appears at the bottom of the application window. In this case, the preediting window may block the last line of text from view in the application window. You can resize the application window to make this last line visible.

Root Window
The preediting string is displayed in a child window of the application root window.

Input methods for different locales typically support more than one user input style but not all of them. If you work in languages that are supported by an input method, you can specify styles in priority order through the VendorShell resource XmNpreeditType. By default, this resource is defined to be the following:

OnTheSpot,OverTheSpot,OffTheSpot,Root

The priority order of these values means that On-the-Spot input style is used if the input method supports it, else the Over-the-Spot is used if the input method supports it, and so forth.

Use one of the following methods to supply the XmNpreeditType resource value to an application:

In CDE, use the Input Methods application. See the CDE Companion manual for information on using this application.

In an application-specific resource file.

On the command line that invokes an application.
For example:
```
% app-name -xrm '*preeditType: offthespot,onthespot' &
```

Input styles are supported by specialized input method servers. An input method server runs as an independent process and communicates with an application to handle input operations.

An input method server does not have to be running on the same system as the application but, with one exception, it must be running and made accessible to the application before the application starts.

If a Motif application that has been internationalized to support simplified Chinese contains an XmText or XmTextField widget with the Reconnectable resource set to True, the application is able to establish a connection with the input server when the application starts first or when the server stops and restarts. See XmText(3X) and XmTextField(3X) for more information.

See the Using International Software manual for information on the input method servers available on the operating system and the input styles that each server supports.

7.2 Managing User-Defined Characters and Phrase Input

The national character sets for Japan, Taiwan, and China do not include some of the characters that can appear in Asian place names and personal names. Such characters are defined by users and reside in site-specific databases. These databases are called user-defined character (UDC) or character-attribute databases. When users define ideographic characters, they must also define font glyphs, collating files, and other support files for the characters.

Appendix B provides details on how you set up and use UDC databases.

In Korea, Taiwan, and China, users can enter a complete phrase by typing a keyword, abbreviation, or acronym. This capability is supported by a phrase database and an input mechanism. The Using International Software manual provides details on how the user sets up and uses a phrase database.

The /var/i18n/conf/cp_dirs configuration file allows software services or hardware to locate the databases that support UDC and phrase input.

Example 7-1 contains the default entries in the cp_dirs file. You can edit these entries to change the default locations.

Example 7-1: Default cp_dirs File

#
# Attribute directory configuration file
#
#                       System location         User location
#                       ===============         =============
udc     -               /var/i18n/udc           ~/.udc
odl     -               /var/i18n/odl           ~/.odl
sim     -               /var/i18n/sim           ~/.sim
cdb     /usr/i18n/.cdb  /var/i18n/cdb           ~/.cdb
iks     -               /var/i18n/iks           ~/.iks
pre     -               /var/i18n/fonts         ~/.fonts
bdf     -               /var/i18n/fonts         ~/.fonts
pcf     -               /var/i18n/fonts         ~/.fonts

Each line in the cp_dirs file represents one entry and has the following format:

[service_name standard_path system_path user_path ]

The service_name can be one of the following:

bdf (for font files in BDF format)

cdb (for collating value databases used with the asort command)

iks (for input key sequence files)

odl (for databases of fonts and input key sequences that the SoftODL service uses)

pcf (for font files in Printer Customization File format)
These files, depending on their font resolution, reside in either the 75dpi or 100dpi subdirectory.

pre (for font files in preload format created by the cgen utility)
These are raw font files used to preload multibyte character terminals.

sim (for phrase databases)

udc (for UDC databases)

The cp_dirs file can contain only one entry for each service named. Remaining fields in the entry line consist of the following:

standard_path specifies the location of the collating values database for the standard character sets (applies only to the cdb entry)

system_path specifies the location of systemwide databases

user_path specifies the location of users' private databases

The preceding locations are specified as one of the following:

An absolute pathname, starting with a slash (/)

A pathname, starting with tilde slash (~/), that is relative to a user's home directory

A minus sign or hyphen (-) to indicate that the entry is not used
For example, you can specify - to be user_path for all services related to user-defined characters if you want these characters supported only through systemwide databases.

Comment lines in the cp_dirs file begin with the number sign (#).

7.3 Assigning a Sort Order with a Locale Specification

The sort command sorts characters according to the collation sequence defined for the current locale. A particular locale can apply one set of collation rules to the associated character set. Multiple locale names do exist, however, for the same combination of language, territory, and character set. These variations offer users the choice of more than one collating sequence.

When more than one locale is available for a given combination of language, territory, and codeset, some of the locale names include a suffix with the format @variant. To avoid problems with pathnames constructed using the %L specifier, you should assign a locale name with a suffix that is category specific only to the appropriate locale category variable (or variables). In the following example, the locale assigned to LC_COLLATE differs from the locale assigned to LANG only with respect to collating sequence:

% setenv LANG zh_TW.eucTW
% setenv LC_COLLATE zh_TW.eucTW@radical

Supporting different collation orders through one or more locales is adequate for most languages. However, collation orders for Asian languages require additional support for the following reasons:

Asian languages include UDCs, which are not specified in a locale. These characters can be defined with a collation weight. In this case, the collation weight needs to be applied when the UDCs are encountered in the strings being sorted.

Ideographic characters can be sorted on more than one dimension (radical, stroke, phonetic, and internal code). Some users need to combine these dimensions during sort operations. In one operation the user may need to sort characters first by radical and then according to the number of strokes. For another operation, the user may need to put characters first in phonetic order, then according to the number of strokes, and so on. Sorting by combinations of dimensions requires breadth-first sorting, rather than the depth-first sorting implemented through locales.

For the preceding reasons, the asort command was developed and is available when you install language variant subsets that support Asian languages. The asort command uses, by default, the collating order defined for the LC_COLLATE variable and supports all the options supported by the sort command. In addition, the asort command includes the following options:

-C
This option indicates that the sort operation should use special system sort tables, along with sort tables produced by the cgen utility, to support UDCs. This option overrides the sort sequence defined in the locale specified by the LC_COLLATE variable.

-v
This option, which you can use only with the -C option, implements breadth-first sorting.

See asort(1) for more information about using this command.

7.4 Processing Non-English Language Reference Pages

Programmers who supply software applications for UNIX systems frequently supply online reference pages (manpages) to document the application and its components. UNIX text-processing commands and utilities must be able to process translated versions of these reference pages for applications sold to the international market. The operating system includes enhanced versions of the nroff, tbl, and man commands to support this requirement.

7.4.1 The nroff Command

The nroff command includes the following functions to support locales:

Formats reference page source files written in any language whose locale is installed on the system.

Supports characters of any supported languages in the string arguments of macros and requests.

Supports character mapping of characters for any supported language through the .tr request in reference page source files.

Allows you to set the escape character (\), command control character (.), and nobreak control character (') to local language, as well as ASCII, characters.

Maps each 2-byte space character, which is defined in most codesets for Asian languages, to two ASCII spaces in output.

When formatting reference pages that contain ideographic characters, the nroff command treats each character as a single word. A string of ideographic characters, including 2-byte letters and punctuation characters, can be wrapped to the next line subject to the following constraints:

The last character on the text line cannot be defined as a no-last character by either the standard or private list of no-last characters.

The first character on the text line cannot be defined as a no-first character by either the standard or private list of no-first characters.

The standard no-first, no-last character lists are defined in nroff catalog files. For lists of these characters, see the following language-specific manuals:

Technical Reference for Using Chinese Features

Technical Reference for Using Japanese Features

Technical Reference for Using Korean Features

Technical Reference for Using Thai Features

These manuals are available from the programming bookshelf of the operating system documentation Web site ( http://www.tru64unix.compaq.com/docs/).

The no-first and no-last constraints exist to prevent nroff from placing a punctuation mark or right parenthesis at the beginning of a text line or placing a left parenthesis at the end of a text line. You can turn the standard constraints on and off in source files with the .ki and .ko commands, respectively.

You can also define a private set of no-first and no-last characters with the following command:

.kl 'no-first-list'no-last-list '

The parameters no-first-list and no-last-list are strings of characters that you include in the no-first and no-last categories. You cancel a private no-first and no-last list by entering a .kl command with null strings as the parameters. For example:

.kl '''

Note

The characters specified in the .kl command override, rather than supplement, the characters in the standard set of no-first and no-last characters. Therefore, you cannot use the standard set of no-first and no-last characters together with a private set.
Using the command .kl ''' restores use of the standard set of no-first and no-last characters for the current locale.

The nroff command can format text so that it is justified or not justified to the right margin. When text is justified to the right margin, nroff inserts spaces between words in the line. Ideographic characters, although treated as words in most stages of the formatting process, differ in terms of whether they can be delimited by spaces. The characters that can be preceded by a space, followed by a space, or both are listed in the language-specific user manuals that are available on line when you install language variant subsets of the operating system. When right-justifying text, the nroff command inserts spaces only at the following places:

Where 1-byte or 2-byte spaces already occur

Between English language characters and ideographic characters

Before characters defined as can-space-before

After characters defined as can-space-after

In other cases, no space is inserted between consecutive ideographic characters. Therefore, if a text line contains only ideographic characters, it may not be justified to the right margin.

7.4.2 The tbl Command

The tbl command preprocesses table formatting commands within blocks delimited by the .TS and .TE macros. The tbl command handles multibyte characters that can occur in text of languages other than English.

The tbl command is frequently used with the neqn equation formatting preprocessor to filter input passed to the nroff command. In such cases, specify tbl first to minimize the volume of data passed through the pipes. For example:

% cd /usr/usr/share/ja_JP.deckanji/man/man1
% tbl od.1 | neqn | nroff -Tlpr -man -h | \
lpr -Pmyprinter

When printing Asian language text, you must use printer hardware that supports the language.

7.4.3 The man Command

The man command can handle multibyte characters in reference page files. By default, the man command automatically searches for reference pages in the/usr/share/locale_name/man directory before searching the /usr/share/man and /usr/local/man directories. Therefore, if the LANG environment variable is set to an installed locale and if reference page translations are available for that locale, the man command automatically displays reference pages in the appropriate language.

In addition, the man command automatically applies codeset conversion (assuming the availability of appropriate converters) when reference page translations for a particular language are encoded in a codeset that does not match the codeset of the user's locale. See man(1) for information about redefining the man command search path and for more details about codeset conversion.

7.5 Converting Data Files from One Codeset to Another

Each locale is based on a specific codeset. Therefore, when an application uses a file whose data is coded in one codeset and runs in a locale based on another codeset, character interpretation may be meaningless. For example, assume that a fictional language includes a character named "quo," which is encoded as \031 in one codeset and \042 in another codeset. If the "quo" character is stored in a data file as \031, the application that reads data from that file should be running in the locale based on the same codeset. Otherwise, \031 identifies a character other than "quo."

Users, the applications they run, or both may need to set the process environment to a particular locale and use a data file created with a codeset different from the one on which the locale is based. The data file in question might be appropriate for a given language and in a codeset different from the user's locale for one of the following reasons:

The data file might have been created on another vendor's system by using a locale based on a vendor-specific codeset. For example, the integration of PCs into the enterprise computing environment increases the likelihood that UNIX users need to process files for which the data encoding is in MS-DOS code page format.

The locale could be one of several UNIX locales that support the same Asian language, such as Japanese. Asian languages are typically supported by a variety of locales, each based on a different codeset.

The data file could be in Unicode, UCS-4, UTF-8, UTF-16, or UTF-32 format. If characters in this file are to be printed or displayed on the screen, they might need to be converted to encodings for which fonts are available.

You can convert a data file from one codeset to another by using the iconv command or the iconv_open(), iconv(), and iconv_close() functions. For example, the following command reads data in the accounts_local file, which is encoded in the SJIS codeset; converts the data to the eucJP codeset; and appends the results to the accounts_central file:

% iconv -f SJIS -t eucJP accounts_local \
>> accounts_central

Many commands and utilities, such as the man command and internationalized print filters, use the iconv() functions and associated converters to perform codeset conversion on the user's behalf.

The iconv command and associated functions can use either an algorithmic converter or a table converter to convert data. Algorithmic converters, if installed on your system, reside in the /usr/lib/nls/loc/iconv directory; this directory is the one searched first for a converter. This directory also contains an alias file (iconv.alias) that maps different name strings for the same converter to the converter as named on the system. Table converters, if installed on your system, reside in the /usr/lib/nls/loc/iconvTable directory. The value of the LOCPATH variable, if defined, overrides the command's default search path.

The iconv command assumes that a converter name uses the following format:

from-codeset_ to-codeset

For the preceding example, the iconv command would search for and use the /usr/lib/nls/loc/iconv/SJIS_eucJP converter.

Also consider operating system support for codeset conversion of the Hong Kong Supplementary Character Set (HKSCS). HKSCS is not a locale or character set name, but is used to provide a common language interface for electronic communication and data exchange conducted in Chinese. The characters in HKSCS are only for computer use. On Tru64 UNIX, HKSCS is used as the name for extended Big-5 encoding that contains HKSCS characters, and support is limited to code conversion between HKSCS and Unicode. Using the iconv command, codeset conversion with HKSCS would be specified as one of the following:

UTF-16_HKSCS or HKSCS_UTF-16

UCS-4_HKSCS or HKSCS_UCS-4

UTF-8_HKSCS or HKSCS_UTF-8

See HKSCS(5) for more information on the Hong Kong Supplementary Character Set.

Table 7-1 specifies the codeset conversions that the operating system supports for English language data. Tables with codeset conversions supported for their respective Asian languages are described in the following manuals:

Technical Reference for Using Chinese Features

Technical Reference for Using Japanese Features

Technical Reference for Using Korean Features

Technical Reference for Using Thai Features

For detailed information about the iconv command, see iconv(3) and iconv_intro(5). For information on functions that programs can use to perform codeset conversion, see iconv_open(3), iconv(1), and iconv_close(3). You can find a list of all the codeset converters available for a particular language in the reference page for that language.

Table 7-1: Supported Codeset Conversions for English

Codeset	ASCII-GR	ISO8859-1	ISO8859-1-GL	ISO8859-1-GR
ASCII-GR	-	Yes	No	No
ISO8859-1	Yes	-	Yes	Yes
ISO8859-1-GL	No	Yes	-	No
ISO8859-1-GR	No	Yes	No	-

7.6 Using Font Renderers in Chinese and Korean PostScript Support

This section describes the use of font renderers in the creation of Motif applications that support PostScript fonts in Chinese and Korean. See the Using International Software manual for information on tuning cache size for ideographic characters and customizing windows for local languages.

7.6.1 Using Font Renderers for Multibyte PostScript Fonts

The operating sytem includes font renderers that allow any X application to use the PostScript fonts available for the Chinese and Korean languages. The system administrator can set up font renderers for the following kinds of fonts for use through the X Server or the font server:

Double-Byte PostScript outline fonts

UDC fonts

By installing the IOSWWXFR** subset, you automatically enable font rendering for the PostScript outline fonts.

7.6.1.1 Setting Up the Font Renderer for Double-Byte PostScript Fonts

You can set up the font renderer for Chinese and Korean PostScript fonts for use either through the X server or the font server by editing the appropriate configuration file.

For the X server, the font renderer is automatically added at installation time to the font_renderers list in the X server's configuration file.

For a font server, you must manually add the following entry to the renderers list in the font server's configuration file:
```
renderers = other_renderer, other_renderer,...
     libfr_DECpscf.so;DECpscfRegisterFontFileFunctions
```
In addition, you must specify the paths for the PostScript font files in the catalogue list in the same configuration file. Double-Byte PostScript fonts for the Asian languages are available in the following directories:
```
/usr/i18n/lib/X11/fonts/KoreanPS
/usr/i18n/lib/X11/fonts/SChinesePS
/usr/i18n/lib/X11/fonts/TChinesePS
```
Each font in these directories has the following components:
- A Type1 font header with the .pfa2 file name extension
  This header file is the only file that must be listed in the fonts.dir file in the font directory.
- A data file with the .csdata file name extension
- A binary metrics file with the .xafm file name extension

The renderer for Asian Double-Byte PostScript fonts uses its own configuration file that specifies the following information:

Cache size (number of cache units)

Cache unit size

File handler (names associated with font-rendering software)

Default character (character that is printed in place of any character for which there is no glyph)

The default pathname for this configuration file is /var/X11/renderer/DECpscf_config; however, you can change this path by setting the DECPSCF_CONFIG_PATH environment variable.

7.6.1.2 Setting Up the Font Renderer for UDC Fonts

The UDC font renderer accesses the UDC database directly to obtain font glyphs. Therefore, X applications that use this renderer do not need to use .pcf files generated by the cgen utility.

You can set up the UDC font renderer for use either through the X server or the font server as follows:

For the X server, the font renderer is automatically added at installation time to the font_renderers list in the X server's configuration file.

For a font server, you must manually add the following entry to the renderers list in the font server's configuration file:
```
renderers = other_renderer, other_renderer,...
     libfr_UDC.so;UDCRegisterFontFileFunctions
```
In addition, you must specify the path to the UDC database in the catalogue list of the same configuration file. This path should be set to the top directory for the UDC database. For example, /var/i18n/udc is the correct path for a systemwide UDC database if the database was set up in the default directory.
To process UDC characters in a particular language, the font renderer also requires entries in the fonts.dir file in the appropriate PostScript font directory from the following list:
```
/usr/i18n/lib/X11/fonts/SChinesePS
/usr/i18n/lib/X11/fonts/TChinesePS
```
Edit the fonts.dir file to specify virtual file names in the format locale_name.udc followed by the corresponding XLFD names registered for the codesets. Table 7-2 describes the XLFD entry that corresponds to different Asian codesets.
Table 7-2: XLFD Registry Names for UDC Characters

Codeset XLFD Registry Name

dechanyu, eucTW DEC.CNS11643.1986-UDC

big5 BIG5-UDC

dechanzi GB2312.1980-UDC

deckanji, sdeckanji, eucJP JISX.UDC-1

The following example entry is appropriate for the fonts.dir file in the /usr/i18n/lib/X11/fonts/TChinesePS directory:
```
2
zh_TW.dechanyu.udc -system-decwin-normal-r--24-240-75-75-m-24-DEC.CNS11643.1986-UDC
zh_TW.big5.udc -system-decwin-normal-r--24-240-75-75-m-24-BIG5-UDC
```

7.6.1.3 Using the Font Renderer for TrueType Fonts

The operating system includes a font renderer (/usr/shlib/X11/libfr_TrueType.so) that enables the use of TrueType fonts. Currently, the operating system includes TrueType fonts only for simplified Chinese. However, you can configure the font renderer to use third-party TrueType fonts for additional languages if these are required by applications used at your site. See TrueType(5X) for more information.