This chapter describes a set of miscellaneous tasks you should consider as you develop international applications. These tasks include the following:
Choosing an input method and input styles (Section 7.1)
Managing user-defined character databases (Section 7.2)
Assigning a sort order with a locale specification (Section 7.3)
Processing non-English language reference pages (Section 7.4)
Converting data files from one codeset to another (Section 7.5)
Using font renderers in Chinese and Korean language PostScript Support (Section 7.6)
This chapter provides information on the tools needed to create international applications. The information in this chapter is also closely related to how international applications are used on the operating system. As you use the information in this chapter, you may also find it helpful to refer to the companion manual, Using International Software.
The following manuals provide language-specific information about customization and software use provided for Asian languages on the operating system:
Technical Reference for Using Chinese Features
Technical Reference for Using Japanese Features
Technical Reference for Using Korean Features
Technical Reference for Using Thai Features
These manuals are available from the programming bookshelf of the operating system documentation Web site ( http://www.tru64unix.compaq.com/docs/). Non-English language characters are embedded in the text of the Chinese, Japanese, and Korean Technical References. To view these characters with your Web browser, you must install the appropriate language support subsets on your system and set your locale to one that includes the local language characters used in the technical reference.
The operating system documentation also provides introductory reference
pages on the topics of internationalization
(
i18n_intro(5)l10n_intro(5)7.1 Choosing an Input Method
For some languages, such as Japanese, Chinese, and Korean, you use an input method to enter characters and phrases. An input method lets you enter a character by taking multiple editing actions on entry data. The data entered at intermediate stages of character entry is called the preediting string.
The X Input Method specification defines the following user input, or preediting, styles:
On-the-Spot
Data being edited is displayed directly in the application window. Application data is moved to allow the preediting string to display at the point of character insertion.
Over-the-Spot
The preediting string is displayed in a window that is positioned over the point of insertion.
Off-the-Spot
The preediting string is displayed in a window that is within the application window but not over the point of insertion. Often, the window for the preediting string appears at the bottom of the application window. In this case, the preediting window may block the last line of text from view in the application window. You can resize the application window to make this last line visible.
Root Window
The preediting string is displayed in a child window of the application root window.
Input
methods for different locales typically support more than one user input style
but not all of them.
If you work in languages that are supported by an input
method, you can specify styles in priority order through the VendorShell resource
XmNpreeditType.
By default, this resource is defined to be the following:
OnTheSpot,OverTheSpot,OffTheSpot,Root
The priority order of these values means that On-the-Spot input style is used if the input method supports it, else the Over-the-Spot is used if the input method supports it, and so forth.
Use one of the following methods to supply the
XmNpreeditType
resource value to an application:
In CDE, use the Input Methods application. See the CDE Companion manual for information on using this application.
In an application-specific resource file.
On the command line that invokes an application.
For example:
% app-name -xrm '*preeditType: offthespot,onthespot' &
Input styles are supported by specialized input method servers. An input method server runs as an independent process and communicates with an application to handle input operations.
An input method server does not have to be running on the same system as the application but, with one exception, it must be running and made accessible to the application before the application starts.
If a Motif application that has been internationalized to support simplified
Chinese contains an
XmText
or
XmTextField
widget with the Reconnectable resource set to True, the application is able
to establish a connection with the input server when the application starts
first or when the server stops and restarts.
See
XmText(3X)XmTextField(3X)
See the
Using International Software
manual for information on the input method
servers available on the operating system and the input styles that each server
supports.
7.2 Managing User-Defined Characters and Phrase Input
The national character sets for Japan, Taiwan, and China do not include some of the characters that can appear in Asian place names and personal names. Such characters are defined by users and reside in site-specific databases. These databases are called user-defined character (UDC) or character-attribute databases. When users define ideographic characters, they must also define font glyphs, collating files, and other support files for the characters.
Appendix B provides details on how you set up and use UDC databases.
In Korea, Taiwan, and China, users can enter a complete phrase by typing a keyword, abbreviation, or acronym. This capability is supported by a phrase database and an input mechanism. The Using International Software manual provides details on how the user sets up and uses a phrase database.
The
/var/i18n/conf/cp_dirs
configuration file allows software services or hardware to locate
the databases that support UDC and phrase input.
Example 7-1
contains the default entries in the
cp_dirs
file.
You can edit these entries to change the default
locations.
Example 7-1: Default cp_dirs File
# # Attribute directory configuration file # # System location User location # =============== ============= udc - /var/i18n/udc ~/.udc odl - /var/i18n/odl ~/.odl sim - /var/i18n/sim ~/.sim cdb /usr/i18n/.cdb /var/i18n/cdb ~/.cdb iks - /var/i18n/iks ~/.iks pre - /var/i18n/fonts ~/.fonts bdf - /var/i18n/fonts ~/.fonts pcf - /var/i18n/fonts ~/.fonts
Each line in the
cp_dirs
file represents one entry
and has the following format:
[service_name standard_path system_path user_path ]
The service_name can be one of the following:
cdb
(for collating value databases
used with the
asort
command)
odl
(for databases of fonts and
input key sequences that the SoftODL service uses)
pcf
(for font files in Printer
Customization File format)
These files, depending on their font resolution, reside in either the
75dpi
or
100dpi
subdirectory.
pre
(for font files in preload
format created by the
cgen
utility)
These are raw font files used to preload multibyte character terminals.
The
cp_dirs
file can contain only one
entry for each service named.
Remaining fields in the entry line consist of
the following:
standard_path
specifies the location
of the collating values database for the standard character sets (applies
only to the
cdb
entry)
system_path specifies the location of systemwide databases
user_path specifies the location of users' private databases
The preceding locations are specified as one of the following:
An absolute pathname, starting with a slash (/)
A pathname, starting with tilde slash (~/),
that is relative to a user's home directory
A minus sign or hyphen (-) to indicate
that the entry is not used
For example, you can specify
-
to be
user_path
for all services related to user-defined characters
if you want these characters supported only through systemwide databases.
Comment lines in the
cp_dirs
file begin with the
number sign (#).
7.3 Assigning a Sort Order with a Locale Specification
The
sort
command sorts characters
according to the collation sequence defined for the current locale.
A particular
locale can apply one set of collation rules to the associated character set.
Multiple locale names do exist, however, for the same combination of language,
territory, and character set.
These variations offer users the choice of more
than one collating sequence.
When more than one locale is available for a
given combination of language, territory, and codeset, some of the locale
names include a suffix with the format
@variant.
To avoid problems with pathnames constructed using the
%L
specifier, you should assign a locale name with a suffix that is category
specific only to the appropriate locale category variable (or variables).
In the following example, the locale assigned to
LC_COLLATE
differs from the locale assigned to
LANG
only with respect
to collating sequence:
% setenv LANG zh_TW.eucTW % setenv LC_COLLATE zh_TW.eucTW@radical
Supporting different collation orders through one or more locales is adequate for most languages. However, collation orders for Asian languages require additional support for the following reasons:
Asian languages include UDCs, which are not specified in a locale. These characters can be defined with a collation weight. In this case, the collation weight needs to be applied when the UDCs are encountered in the strings being sorted.
Ideographic characters can be sorted on more than one dimension (radical, stroke, phonetic, and internal code). Some users need to combine these dimensions during sort operations. In one operation the user may need to sort characters first by radical and then according to the number of strokes. For another operation, the user may need to put characters first in phonetic order, then according to the number of strokes, and so on. Sorting by combinations of dimensions requires breadth-first sorting, rather than the depth-first sorting implemented through locales.
For the preceding reasons, the
asort
command was developed and is available when you install language
variant subsets that support Asian languages.
The
asort
command uses, by default, the collating order defined for the
LC_COLLATE
variable and supports all the options supported by the
sort
command.
In addition, the
asort
command
includes the following options:
-C
This option indicates that the sort operation should use special system
sort tables, along with sort tables produced by the
cgen
utility, to support UDCs.
This option overrides the sort sequence defined
in the locale specified by the
LC_COLLATE
variable.
-v
This option, which you can use only with the -C option, implements breadth-first sorting.
See
asort(1)7.4 Processing Non-English Language Reference Pages
Programmers who supply software
applications for UNIX systems frequently supply online reference pages (manpages)
to document the application and its components.
UNIX text-processing commands
and utilities must be able to process translated versions of these reference
pages for applications sold to the international market.
The operating system
includes enhanced versions of the
nroff,
tbl,
and
man
commands to support this requirement.
7.4.1 The nroff Command
The
nroff
command includes the following functions to support locales:
Formats reference page source files written in any language whose locale is installed on the system.
Supports characters of any supported languages in the string arguments of macros and requests.
Supports character mapping of characters for any supported
language through the
.tr
request in reference page source
files.
Allows you to set the escape character (\), command control character (.), and nobreak control character (') to local language, as well as ASCII, characters.
Maps each 2-byte space character, which is defined in most codesets for Asian languages, to two ASCII spaces in output.
When formatting reference pages that contain ideographic characters,
the
nroff
command treats each character as a single word.
A string of ideographic characters, including 2-byte letters and punctuation
characters, can be wrapped to the next line subject to the following constraints:
The last character on the text line cannot be defined as a no-last character by either the standard or private list of no-last characters.
The first character on the text line cannot be defined as a no-first character by either the standard or private list of no-first characters.
The standard no-first, no-last character lists are defined in
nroff
catalog files.
For lists of these characters, see the following
language-specific manuals:
Technical Reference for Using Chinese Features
Technical Reference for Using Japanese Features
Technical Reference for Using Korean Features
Technical Reference for Using Thai Features
These manuals are available from the programming bookshelf of the operating system documentation Web site ( http://www.tru64unix.compaq.com/docs/).
The no-first
and no-last constraints exist to prevent
nroff
from placing
a punctuation mark or right parenthesis at the beginning of a text line or
placing a left parenthesis at the end of a text line.
You can turn the standard
constraints on and off in source files with the
.ki
and
.ko
commands, respectively.
You can also define a private set of no-first and no-last characters with the following command:
.kl 'no-first-list'no-last-list
'
The parameters
no-first-list
and
no-last-list
are strings of characters that you include in the
no-first and no-last categories.
You cancel a private no-first and no-last
list by entering a
.kl
command with null strings as the
parameters.
For example:
.kl '''
Note
The characters specified in the
.klcommand override, rather than supplement, the characters in the standard set of no-first and no-last characters. Therefore, you cannot use the standard set of no-first and no-last characters together with a private set.Using the command
.kl '''restores use of the standard set of no-first and no-last characters for the current locale.
The
nroff
command can format text so that it is justified or not justified
to the right margin.
When text is justified to the right margin,
nroff
inserts spaces between words in the line.
Ideographic characters,
although treated as words in most stages of the formatting process, differ
in terms of whether they can be delimited by spaces.
The characters that can
be preceded by a space, followed by a space, or both are listed in the language-specific
user manuals that are available on line when you install language variant
subsets of the operating system.
When right-justifying text, the
nroff
command inserts spaces only at the following places:
Where 1-byte or 2-byte spaces already occur
Between English language characters and ideographic characters
Before characters defined as can-space-before
After characters defined as can-space-after
In other cases, no space is inserted between consecutive ideographic
characters.
Therefore, if a text line contains only ideographic characters,
it may not be justified to the right margin.
7.4.2 The tbl Command
The
tbl
command preprocesses table formatting commands within blocks
delimited by the
.TS
and
.TE
macros.
The
tbl
command handles multibyte characters that can occur
in text of languages other than English.
The
tbl
command is frequently used with the
neqn
equation formatting preprocessor to filter input passed to the
nroff
command.
In such cases, specify
tbl
first to
minimize the volume of data passed through the pipes.
For example:
% cd /usr/usr/share/ja_JP.deckanji/man/man1 % tbl od.1 | neqn | nroff -Tlpr -man -h | \ lpr -Pmyprinter
When printing Asian language text, you must use printer
hardware that supports the language.
7.4.3 The man Command
The
man
command can handle multibyte characters in reference page files.
By default,
the
man
command automatically searches for reference pages
in the/usr/share/locale_name/man
directory before searching the
/usr/share/man
and
/usr/local/man
directories.
Therefore, if the
LANG
environment variable is set to an installed locale and if reference page translations
are available for that locale, the
man
command automatically
displays reference pages in the appropriate language.
In addition, the
man
command
automatically applies codeset conversion (assuming the availability of appropriate
converters) when reference page translations for a particular language are
encoded in a codeset that does not match the codeset of the user's locale.
See
man(1)man
command search path and for more details about codeset conversion.
7.5 Converting Data Files from One Codeset to Another
Each locale is based on a specific codeset. Therefore, when an application uses a file whose data is coded in one codeset and runs in a locale based on another codeset, character interpretation may be meaningless. For example, assume that a fictional language includes a character named "quo," which is encoded as \031 in one codeset and \042 in another codeset. If the "quo" character is stored in a data file as \031, the application that reads data from that file should be running in the locale based on the same codeset. Otherwise, \031 identifies a character other than "quo."
Users, the applications they run, or both may need to set the process environment to a particular locale and use a data file created with a codeset different from the one on which the locale is based. The data file in question might be appropriate for a given language and in a codeset different from the user's locale for one of the following reasons:
The data file might have been created on another vendor's system by using a locale based on a vendor-specific codeset. For example, the integration of PCs into the enterprise computing environment increases the likelihood that UNIX users need to process files for which the data encoding is in MS-DOS code page format.
The locale could be one of several UNIX locales that support the same Asian language, such as Japanese. Asian languages are typically supported by a variety of locales, each based on a different codeset.
The data file could be in Unicode, UCS-4, UTF-8, UTF-16, or UTF-32 format. If characters in this file are to be printed or displayed on the screen, they might need to be converted to encodings for which fonts are available.
You can convert a data file from one codeset
to another by using the
iconv
command or the
iconv_open(),
iconv(), and
iconv_close()
functions.
For example, the following command reads data in the
accounts_local
file, which is encoded in the
SJIS
codeset; converts the data to the
eucJP
codeset; and appends
the results to the
accounts_central
file:
% iconv -f SJIS -t eucJP accounts_local \ >> accounts_central
Many commands and utilities,
such as the
man
command and internationalized print filters,
use the
iconv()
functions and associated converters to
perform codeset conversion on the user's behalf.
The
iconv
command and associated functions can use either an algorithmic
converter or a table converter to convert data.
Algorithmic converters, if
installed on your system, reside in the
/usr/lib/nls/loc/iconv
directory; this directory is the one searched first for a converter.
This
directory also contains an alias file (iconv.alias) that
maps different name strings for the same converter to the converter as named
on the system.
Table converters, if installed on your system, reside in the
/usr/lib/nls/loc/iconvTable
directory.
The value of the
LOCPATH
variable, if defined, overrides the command's default search
path.
The
iconv
command assumes that a converter name uses
the following format:
from-codeset_
to-codeset
For the preceding example, the
iconv
command would
search for and use the
/usr/lib/nls/loc/iconv/SJIS_eucJP
converter.
Also consider operating system support for codeset conversion of the
Hong Kong Supplementary Character Set (HKSCS).
HKSCS is not a locale or character
set name, but is used to provide a common language interface for electronic
communication and data exchange conducted in Chinese.
The characters in HKSCS
are only for computer use.
On Tru64 UNIX, HKSCS is used as the name for
extended Big-5 encoding that contains HKSCS characters, and support is limited
to code conversion between HKSCS and Unicode.
Using the
iconv
command, codeset conversion with HKSCS would be specified as one of the following:
UTF-16_HKSCS
or
HKSCS_UTF-16
UCS-4_HKSCS
or
HKSCS_UCS-4
UTF-8_HKSCS
or
HKSCS_UTF-8
See
HKSCS(5)
Table 7-1 specifies the codeset conversions that the operating system supports for English language data. Tables with codeset conversions supported for their respective Asian languages are described in the following manuals:
Technical Reference for Using Chinese Features
Technical Reference for Using Japanese Features
Technical Reference for Using Korean Features
Technical Reference for Using Thai Features
For detailed information about the
iconv
command,
see
iconv(3)iconv_intro(5)iconv_open(3)iconv(1)iconv_close(3)
Table 7-1: Supported Codeset Conversions for English
| Codeset | ASCII-GR | ISO8859-1 | ISO8859-1-GL | ISO8859-1-GR |
| ASCII-GR | - | Yes | No | No |
| ISO8859-1 | Yes | - | Yes | Yes |
| ISO8859-1-GL | No | Yes | - | No |
| ISO8859-1-GR | No | Yes | No | - |
7.6 Using Font Renderers in Chinese and Korean PostScript Support
This section describes the use of font renderers in the creation of
Motif applications that support PostScript fonts in Chinese and Korean.
See
the
Using International Software
manual for information on tuning cache size for ideographic
characters and customizing windows for local languages.
7.6.1 Using Font Renderers for Multibyte PostScript Fonts
The operating sytem includes font renderers that allow any X application to use the PostScript fonts available for the Chinese and Korean languages. The system administrator can set up font renderers for the following kinds of fonts for use through the X Server or the font server:
Double-Byte PostScript outline fonts
UDC fonts
By installing the
IOSWWXFR**
subset, you
automatically enable font rendering for the PostScript outline fonts.
7.6.1.1 Setting Up the Font Renderer for Double-Byte PostScript Fonts
You can set up the font renderer for Chinese and Korean PostScript fonts for use either through the X server or the font server by editing the appropriate configuration file.
For the X server, the font renderer is automatically added
at installation time to the
font_renderers
list in the
X server's configuration file.
For a font server, you must manually add the following entry
to the
renderers
list in the font server's configuration
file:
renderers = other_renderer, other_renderer,...
libfr_DECpscf.so;DECpscfRegisterFontFileFunctions
In addition, you must specify the paths for the PostScript font files
in the
catalogue
list in the same configuration file.
Double-Byte
PostScript fonts for the Asian languages are available in the following directories:
/usr/i18n/lib/X11/fonts/KoreanPS /usr/i18n/lib/X11/fonts/SChinesePS /usr/i18n/lib/X11/fonts/TChinesePS
Each font in these directories has the following components:
A Type1 font header with the
.pfa2
file
name extension
This header file is the only file that must be listed in the
fonts.dir
file in the font directory.
A data file with the
.csdata
file name
extension
A binary metrics file with the
.xafm
file name extension
The renderer for Asian Double-Byte PostScript fonts uses its own configuration file that specifies the following information:
Cache size (number of cache units)
Cache unit size
File handler (names associated with font-rendering software)
Default character (character that is printed in place of any character for which there is no glyph)
The default pathname for this configuration file is
/var/X11/renderer/DECpscf_config; however, you can change this path by setting the
DECPSCF_CONFIG_PATH
environment variable.
7.6.1.2 Setting Up the Font Renderer for UDC Fonts
The UDC font renderer
accesses the UDC database directly to obtain font glyphs.
Therefore, X applications
that use this renderer do not need to use
.pcf
files
generated by the
cgen
utility.
You can set up the UDC font renderer for use either through the X server or the font server as follows:
For the X server, the font renderer is automatically added
at installation time to the
font_renderers
list in the
X server's configuration file.
For a font server, you must manually add the following entry
to the
renderers
list in the font server's configuration
file:
renderers = other_renderer, other_renderer,...
libfr_UDC.so;UDCRegisterFontFileFunctions
In addition, you must specify the path to the UDC database in the
catalogue
list of the same configuration file.
This path should
be set to the top directory for the UDC database.
For example,
/var/i18n/udc
is the correct path for a systemwide UDC database
if the database was set up in the default directory.
To process UDC characters in a particular language, the font renderer
also requires entries in the
fonts.dir
file in the appropriate
PostScript font directory from the following list:
/usr/i18n/lib/X11/fonts/SChinesePS /usr/i18n/lib/X11/fonts/TChinesePS
Edit the
fonts.dir
file to specify virtual file
names in the format
locale_name.udc
followed by the corresponding XLFD names registered for the codesets.
Table 7-2
describes the XLFD entry that corresponds to different
Asian codesets.
Table 7-2: XLFD Registry Names for UDC Characters
| Codeset | XLFD Registry Name |
dechanyu,
eucTW |
DEC.CNS11643.1986-UDC |
big5 |
BIG5-UDC |
dechanzi |
GB2312.1980-UDC |
deckanji,
sdeckanji,
eucJP |
JISX.UDC-1 |
The following example entry is appropriate for the
fonts.dir
file in the
/usr/i18n/lib/X11/fonts/TChinesePS
directory:
2 zh_TW.dechanyu.udc -system-decwin-normal-r--24-240-75-75-m-24-DEC.CNS11643.1986-UDC zh_TW.big5.udc -system-decwin-normal-r--24-240-75-75-m-24-BIG5-UDC
7.6.1.3 Using the Font Renderer for TrueType Fonts
The operating system includes a font renderer (/usr/shlib/X11/libfr_TrueType.so) that enables the use of TrueType fonts.
Currently, the operating
system includes TrueType fonts only for simplified Chinese.
However, you can
configure the font renderer to use third-party TrueType fonts for additional
languages if these are required by applications used at your site.
See
TrueType(5X)