This chapter explains how to develop a locale.
A locale is the
set of data that supports a particular combination of native language, cultural
data, and codeset on the operating system.
You use the
localedef
command to create locales from the following files:
charmap, a character map source file (Section 6.1)
See
charmap(4)charmap
example that conforms to binary character encodings
specified for the ISO Latin-1 codeset, which defines all characters as single
8-bit bytes.
The chapter also includes an example of part of a
charmap
file for the SJIS codeset, which defines both single-byte
and multibyte characters.
A locale source file (Section 6.2)
See
locale(4)fr_FR.ISO8859-1@example, which supports the language and customs of France.
A methods file with associated shareable library (Section 6.3)
A methods file and shareable library are required when the
charmap
file defines multibyte characters; otherwise, they are optional.
The methods file contains an entry for each function used by the locale and
defined in the associated shared library.
The message file entry includes
the library name and path.
Method file entries also specify the shared library
containing redefinitions of the C Library interfaces that convert data to
and from internal process (wide-character) encoding.
For a list of files that must be changed in order for desktop applications
to use a new locale, see
Chapter 5.
6.1 Creating a Character Map Source File for a Locale
A
charmap
file defines symbols
for character binary encodings.
The
localedef
command uses
this file to map character symbols in a locale source file to the character
encodings.
Example 6-1
is a fragment of the
ISO8859-1.cmap
source file that is used in the
fr_FR.ISO8859-1@example
locale being developed in this chapter.
Section D.1
contains the
ISO8859-1.cmap
file in its entirety.
Example 6-1: The charmap File for a Sample Locale
# [1] # Charmap for ISO 8859-1 codeset [1] # [1] <code_set_name> "ISO8859-1" [2] <mb_cur_max> 1 [2] <mb_cur_min> 1 [2] <escape_char> \ [2] <comment_char> # [2] CHARMAP [3] # Portable characters and other standard [1] # control characters [1] <NUL> \x00 [4] <SOH> \x01 <STX> \x02 <ETX> \x03 <EOT> \x04 <ENQ> \x05 <ACK> \x06 <BEL> \x07 <alert> \x07 <backspace> \x08 <tab> \x09 <newline> \x0a <vertical-tab> \x0b <form-feed> \x0c <carriage-return> \x0d <SO> \x0e
.
.
.
<zero> \x30 [4] <one> \x31 <two> \x32 <three> \x33 <A> \x41 <B> \x42 <C> \x43 <D> \x44
.
.
.
<underscore> \x5f [4] <low-line> \x5f <grave-accent> \x60 <a> \x61 <b> \x62 <c> \x63 <d> \x64
.
.
.
# Extended control characters [1] # (names taken from ISO 6429) [1] <PAD> \x80 [4] <HOP> \x81 <BPH> \x82 <NBH> \x83 <IND> \x84
.
.
.
# Other graphic characters [1] <nobreakspace> \xa0 [4] <inverted-exclamation-mark> \xa1
.
.
.
END CHARMAP [5]
Comment line
By default, the comment character is the number sign (#).
You
can override this default with a
<comment_char>
definition
(see
Example 6-1).
[Return to example]
Keyword declarations
This example provides entries
for all valid declarations and specifies default values for all but
<code_set_name>.
Usually, you specify a declaration only when
you want to override its default value.
In this example, the declarations
for
<escape_char>
and
<comment_char>
specify the default values for the escape character and comment character,
respectively.
The value for
<mb_cur_max>, the maximum
length (in bytes) of a character, is 1 for this particular
charmap
file.
The value for
<mb_cur_min>, the minimum
length (in bytes) of a character, must be 1 in
charmap
files for all locales.
(All locales include characters in the Portable Character
Set, which defines single-byte characters.)
The
<code_set_name>
value is the value returned on the
nl_langinfo(CODESET)
call made by applications that bind to the locale at run time.
[Return to example]
Header marking start of character maps [Return to example]
Symbol-to-coding maps for characters
Each character map consists of a symbolic name and encoding. The name and encoding are separated by one or more spaces.
A symbolic name begins with the left angle bracket
(<) and ends with the right angle bracket (>).
The characters between
the angle brackets can be any characters from the Portable Character Set,
except for control and space characters.
If the name includes more than one
right angle bracket (>), all but the last one must be preceded by the value
of
<escape_character>.
A symbolic name cannot exceed
128 bytes in length.
An encoding can be one or more decimal, octal, or hexadecimal constants. (Multiple constants apply to multibyte encodings.) The constants have the following formats:
Decimal
\dnnn
or
\dnn, where
n
is a decimal
digit
Hexadecimal
\xnn, where
n
is a hexadecimal digit
Octal
\nnn
or
\nn, where
n
is an octal
digit
You can define multiple character map entries (each with a different symbolic name) for the same encoding value. This example does not define multiple symbolic names for the same encoding value. [Return to example]
Trailer marking end of character maps [Return to example]
The source files for
codesets with multibyte characters have more complex character maps.
Example 6-2
is a subset of character map entries from
a source file for the Japanese SJIS codeset.
This source file specifies entries
from several character sets that must be supported within the same codeset.
Example 6-2: Fragment from a charmap File for a Multibyte Codeset
# SJIS charmap # <code_set_name> "SJIS" [1] <mb_cur_min> 1 [2] <mb_cur_max> 2 [3] CHARMAP # # CS0: ASCII #
.
.
.
<commercial-at> \x40 [4] <A> \x41 [4] <B> \x42 [4]
.
.
.
# # CS1: JIS X0208-1983 for ShiftJIS. # <zenkaku-space> \x81\x40 [5] <j0101>...<j0163> \x81\x40 [5] <j0164>...<j0194> \x81\x80 [5]
.
.
.
# # UDC Area in JIS X0208 plane # <u8501>...<u8563> \xeb\x40 [6] <u8564>...<u8594> \xeb\x80 [6] <u8601>...<u8663> \xeb\x9f [6]
.
.
.
# # CS2: JIS X0201 (so-called Hankaku-Kana) # <kana-fullstop> \xa1 [7]
.
.
.
<kana-conjunctive> \xa5 [7] <kana-WO> \xa6 [7] <kana-a> \xa7 [7]
.
.
.
END CHARMAP
Codeset name [Return to example]
Minimum number of bytes for each character
This value must be 1. [Return to example]
Maximum number of bytes for each character
In SJIS, the largest multibyte character is 2 bytes in length. [Return to example]
Symbols and encodings for ASCII characters [Return to example]
Symbols and encodings for SJIS characters
Note how character symbols are specified as a range and how two hexadecimal values determine the encoding for a 2-byte character.
When symbols are specified as a range
of symbol values, the specified character encoding applies to the first symbol
in the range.
The
localedef
command automatically increments
both the symbol value and the encoding value to create symbols and encodings
for all characters in the range.
[Return to example]
Maps for UDCs within the SJIS codeset
These maps establish ranges of encodings for which users can later define characters. [Return to example]
Maps for the single-byte characters of the Hankaku-Kana character set [Return to example]
See
charmap(4)
Note
The symbolic names for characters in character map source files are in the process of becoming standardized. A future revision of the X/Open UNIX standard will likely specify both long and short symbolic names for characters.
The symbolic names for characters in examples are not necessarily the names being proposed for adoption by any standards group.
6.2 Creating Locale Definition Source Files
A locale definition source file defines data that is specific to a particular language and territory. The source file is organized into sections, one for each category of locale data being defined. The locale categories include the following:
LC_CTYPE
defines character classes and
attributes (Section 6.2.1)
LC_COLLATE
defines how characters and strings
are collated (Section 6.2.2)
LC_MESSAGES
defines the strings used for
affirmative and negative responses (Section 6.2.3)
LC_MONETARY
defines the rules and symbols
for monetary values (Section 6.2.4)
LC_NUMERIC
defines the rules and symbols
for numeric data (Section 6.2.5)
LC_TIME
defines date and time (Section 6.2.6)
LC_ALL
references all the categories
Example 6-3
illustrates the structure of a locale
definition source file in pseudocode.
Example 6-3: Structure of Locale Source Definition File
# comment-line [1] comment_char <char_symbol1> [2] escape_char <char_symbol2> [3] CATEGORY_NAME [4] category_definition-statement [5] category_definition-statement [5]
.
.
.
END CATEGORY_NAME [6]
.
.
.
[7]
Comment line
The number sign (#)
is the default comment character.
You can specify comments as entire lines
by entering the comment character in the first column of the line.
You cannot
specify comments on the same lines as definition statements in locale source
files.
In this respect, locale source files differ from character map source
files.
[Return to example]
Redefinition of comment character
You can override the default comment character with an entry line
that begins with the
comment_char
keyword followed by the
symbol for the desired character.
The character symbol is defined in the character
map (charmap) source file for the locale.
[Return to example]
Redefinition of escape character
The escape character is the backslash (\) by default.
It is used in decimal, hexadecimal, and octal constants
to indicate when definition statements are continued to the next line of the
source file.
You can override the default escape character with an entry
line that begins with the
escape_char
keyword followed
by one or more blank characters, then the symbol for the desired character.
The character symbol is defined in the character map source file for the locale.
[Return to example]
Header for locale category section
Section headers correspond to category names, which are
LC_CTYPE,
LC_COLLATE,
LC_NUMERIC,
LC_MONETARY,
LC_MESSAGES, and
LC_TIME.
[Return to example]
Definition statement for the category
The format of these statements varies from one category to the next. In general, a statement begins with a keyword, followed by one or more spaces or tabs, then by the definition itself.
In place of
any category definition statements, you can include a
copy
statement to include definition statements in another locale source file.
For example:
copy en_US.ISO8859-1
If you include a
copy
statement, do not include other
statements in the category.
[Return to example]
Trailer for locale category section
Section trailers begin with the
END
keyword followed
by the category name.
[Return to example]
You can include sections for all locale categories or only a subset of categories. If you omit a section for a locale category from the source file, the definition for the omitted category is derived from the default locale (POSIX or C). [Return to example]
The following sections describe specific locale categories and illustrate
the description with parts of the
fr_FR.ISO8859-1@example.src
locale source file.
Section D.2
contains this source
file in its entirety.
6.2.1 Defining the LC_CTYPE Locale Category
The
LC_CTYPE
section of a locale source file
defines character classes and character attributes used in operations such
as case conversion.
Example 6-4
describes the definition
for this section.
Example 6-4: LC_CTYPE Category Definition
#############
LC_CTYPE [1]
#############
upper <A>;<B>;<C>;<D>;<E>;<F>;<G>;<H>;<I>;<J>;<K>;<L>;<M>;\
<N>;<O>;<P>;<Q>;<R>;<S>;<T>;<U>;<V>;<W>;<X>;<Y>;<Z>;\
<A-grave>;\
.
.
.
<U-diaeresis> [2]
lower <a>;<b>;<c>;<d>;<e>;<f>;<g>;<h>;<i>;<j>;<k>;<l>;<m>;\
<n>;<o>;<p>;<q>;<r>;<s>;<t>;<u>;<v>;<w>;<x>;<y>;<z>;\
<a-grave>;\
.
.
.
<u-diaeresis> [2]
space <tab>;<newline>;<vertical-tab>;<form-feed>;\
<carriage-return>;<space> [2]
cntrl <NUL>;<SOH>;<STX>;<ETX>;<EOT>;<ENQ>;<ACK>;\
<alert>;<backspace>;<tab>;<newline>;<vertical-tab>;\
<form-feed>;<carriage-return>;\
.
.
.
<SOS>;<SGCI>;<SCI>;<CSI>;<ST>;<OSC>;<PM>;<APC> [2]
graph <exclamation-mark>;<quotation-mark>;<number-sign>;\
.
.
.
<u-circumflex>;<u-diaeresis>;<y-acute>;<thorn-icelandic>;<y-diaeresis> [2]
# print class includes everything in the graph class above, plus <space>.
print <exclamation-mark>;<quotation-mark>;<number-sign>;\
.
.
.
<u-circumflex>;<u-diaeresis>;<y-acute>;<thorn-icelandic>;<y-diaeresis>;\
<space> [2]
punct <exclamation-mark>;<quotation-mark>;<number-sign>;\
<dollar-sign>;<percent-sign>;<ampersand>;<apostrophe>;\
<left-parenthesis>;<right-parenthesis>;<asterisk>;\
<plus-sign>;<comma>;<hyphen>;<period>;<slash>;\
<colon>;<semicolon>;<less-than-sign>;<equals-sign>;\
<greater-than-sign>;<question-mark>;<commercial-at>;\
<left-square-bracket>;<backslash>;<right-square-bracket>;\
<circumflex>;<underscore>;<grave-accent>;<left-brace>;\
<vertical-line>;<right-brace>;<tilde> [2]
digit <zero>;<one>;<two>;<three>;<four>;\
<five>;<six>;<seven>;<eight>;<nine> [2]
xdigit <zero>;<one>;<two>;<three>;<four>;\
<five>;<six>;<seven>;<eight>;<nine>;\
<A>;<B>;<C>;<D>;<E>;<F>;\
<a>;<b>;<c>;<d>;<e>;<f> [2]
blank <space>;<tab> [2]
toupper (<a>,<A>);(<b>,<B>);(<c>,<C>);(<d>,<D>);(<e>,<E>);\
(<f>,<F>);(<g>,<G>);(<h>,<H>);(<i>,<I>);(<j>,<J>);\
(<k>,<K>);(<l>,<L>);(<m>,<M>);(<n>,<N>);(<o>,<O>);\
(<p>,<P>);(<q>,<Q>);(<r>,<R>);(<s>,<S>);(<t>,<T>);\
(<u>,<U>);(<v>,<V>);(<w>,<W>);(<x>,<X>);(<y>,<Y>);\
(<z>,<Z>);\
(<a-grave>,<A-grave>);\
(<a-circumflex>,<A-circumflex>);\
(<ae-ligature>,<AE-ligature>);\
(<c-cedilla>,<C-cedilla>);\
(<e-grave>,<E-grave>);\
(<e-acute>,<E-acute>);\
(<e-circumflex>,<E-circumflex>);\
(<e-diaeresis>,<E-diaeresis>);\
(<i-circumflex>,<I-circumflex>);\
(<i-diaeresis>,<I-diaeresis>);\
(<o-circumflex>,<O-circumflex>);\
(<u-grave>,<U-grave>);\
(<u-circumflex>,<U-circumflex>);\
(<u-diaeresis>,<U-diaeresis>) [3]
# tolower class is the inverse of toupper.
tolower (<A>,<a>);(<B>,<b>);(<C>,<c>);(<D>,<d>);(<E>,<e>);\
(<F>,<f>);(<G>,<g>);(<H>,<h>);(<I>,<i>);(<J>,<j>);\
(<K>,<k>);(<L>,<l>);(<M>,<m>);(<N>,<n>);(<O>,<o>);\
(<P>,<p>);(<Q>,<q>);(<R>,<r>);(<S>,<s>);(<T>,<t>);\
(<U>,<u>);(<V>,<v>);(<W>,<w>);(<X>,<x>);(<Y>,<y>);\
(<Z>,<z>);\
(<A-grave>,<a-grave>);\
(<A-circumflex>,<a-circumflex>);\
(<AE-ligature>,<ae-ligature>);\
(<C-cedilla>,<c-cedilla>);\
(<E-grave>,<e-grave>);\
(<E-acute>,<e-acute>);\
(<E-circumflex>,<e-circumflex>);\
(<E-diaeresis>,<e-diaeresis>);\
(<I-circumflex>,<i-circumflex>);\
(<I-diaeresis>,<i-diaeresis>);\
(<O-circumflex>,<o-circumflex>);\
(<U-grave>,<u-grave>);\
(<U-circumflex>,<u-circumflex>);\
(<U-diaeresis>,<u-diaeresis>) [3]
END LC_CTYPE [4]
Section header [Return to example]
Definition of character class
These definitions start with a keyword that stands for the character class (also referred to as a property) followed by one or more blank characters, then a list of symbols for all characters in that class. You can substitute the character's encoding for its symbol; however, specifying characters by their encodings diminishes the readability of the locale source file and makes it impossible to use the file with more than one codeset.
Although not illustrated in the example,
you can specify a horizontal ellipsis (...) to represent
a range of characters.
In the string
<NUL>;...;<tab>,
for example, the ellipsis represents all characters whose encodings are between
the character whose symbol is
<NUL>
and the character
whose symbol is
<tab>.
The symbols and their encodings
are specified in the
charmap
file for the locale.
Character classes as defined by the X/Open UNIX standard are represented by the following keywords:
upper
(uppercase letter characters)
lower
(lowercase letter characters)
alpha
(all letter characters)
By default, the
alpha
class is the combination of
characters specified for the
upper
and
lower
classes.
Because the sample locale does not explicitly define the
alpha
class, the default definition applies.
space
(white-space characters)
cntrl
(control characters)
punct
(punctuation characters)
digit
(numeric digits)
xdigit
(hexadecimal digits)
blank
(blank characters)
graph
By default, this class is the combination of characters in the
alpha,
digit, and
punct
classes.
print
By default, this class is the combination of characters in the
alpha,
digit, and
punct
classes,
plus the space character.
From the application standpoint, there is also the
class
alnum.
This class is rarely defined in a locale because
it is always a combination of characters in the
alpha
and
digit
classes.
Unicode (*.UTF-8) locales include character classes
as defined by the Unicode standard.
See
locale(4)
Certain locales, such as those for Asian languages like Japanese, may define nonstandard character classes. [Return to example]
Definitions of case conversion for letter characters
Case conversion definitions, which begin with the keywords
toupper()
and
tolower, list symbols in pairs rather
than individually.
In the
toupper()
definition described
here, the first symbol in the pair is the symbol for a lowercase letter and
the second symbol is the symbol for that letter's uppercase equivalent.
This
definition determines what a letter is converted to when functions, like
towupper()
and
towlower(), perform case conversion
on text data.
Locales that define nonstandard character classes
may define other property conversion definitions that are used by the
wctrans()
and
towctrans()
functions.
Section trailer [Return to example]
The preceding example does not completely illustrate
all the options you can use when defining the
LC_CTYPE
category.
Additional options allow you to perform the following tasks:
Use a
copy
statement to include the entire
category definition from another locale
When you use a
copy
statement, it must be the only
entry between the section trailer and header.
Omit any of the standard character classes or define different character classes
The standard character classes are language specific. Therefore, the standard character classes may not apply to all languages. When you define a locale, use only the standard character classes that are appropriate for the locale's language. Depending on the language, it may be necessary to define nonstandardized classes.
A definition for a nonstandardized character class must be preceded
by the
charclass
statement to define a keyword for the
class, followed by the class definition.
For example:
charclass vowel vowel <a>;<e>;<i>;<o>;<u>;<y>
Applications can
use the
wctype()
and
iswctype()
functions
to determine and test all character classes (including user-defined ones).
Applications can use class-specific functions, such as
iswalpha()
and
iswpunct()
to test the standard character
classes.
Note
The
LC_CTYPEcategory of thefr_FR.ISO8859-1@examplelocale is limited to letter characters in the French language. Some locale developers would define character classes to include characters in all the languages supported by the ISO 8859-1 character set. This practice allows locales for multiple Western European languages to use the sameLC_CTYPEsource definitions through acopystatement.
See
locale(4)LC_CTYPE
category definition.
6.2.2 Defining the LC_COLLATE Locale Category
The
LC_COLLATE
section of a locale source file
specifies how characters and strings are collated.
Example 6-5
is part of an
LC_COLLATE
section.
Example 6-5: LC_COLLATE Category Definition
LC_COLLATE [1] order_start forward;backward;forward [2] <NUL> [3] <SOH> <STX> <ETX> <EOT> <ENQ> <ACK> <alert> <backspace> <tab>
.
.
.
<APC> [3] <space> <space>;<space>;<space> <exclamation-mark> <exclamation-mark>;<exclamation-mark>;<exclamation-mark> <quotation-mark> <quotation-mark>;<quotation-mark>;<quotation-mark>
.
.
.
<a> <a>;<a>;<a> [3] <A> <a>;<a>;<A> <feminine> <a>;<feminine>;<feminine> <a-acute> <a>;<a-acute>;<a-acute> <A-acute> <a>;<a-acute>;<A-acute> <a-grave> <a>;<a-grave>;<a-grave> <A-grave> <a>;<a-grave>;<A-grave> <a-circumflex> <a>;<a-circumflex>;<a-circumflex> <A-circumflex> <a>;<a-circumflex>;<A-circumflex> <a-ring> <a>;<a-ring>;<a-ring> <A-ring> <a>;<a-ring>;<A-ring> <a-diaeresis> <a>;<a-diaeresis>;<a-diaeresis> <A-diaeresis> <a>;<a-diaeresis>;<A-diaeresis> <a-tilde> <a>;<a-tilde>;<a-tilde> <A-tilde> <a>;<a-tilde>;<A-tilde> <ae-ligature> <a>;<a><e>;<a><e> <AE-ligature> <a>;<a><e>;<A><E> <b> <b>;<b>;<b> <B> <b>;<b>;<B> <c> <c>;<c>;<c> <C> <c>;<c>;<C> <c-cedilla> <c>;<c-cedilla>;<c-cedilla> <C-cedilla> <c>;<c-cedilla>;<C-cedilla>
.
.
.
<z> <z>;<z>;<z> [3] <Z> <z>;<z>;<Z> UNDEFINED [4] order_end [5] END LC_COLLATE [6]
Section header [Return to example]
An
order_start
keyword that marks the beginning of a section with statements
that assign collating weights to elements
Following the
order_start
keyword on the same
line are sort directives, separated by semicolons (;) that apply to each sorting
pass.
Sort directives can include the following keywords.
forward, which specifies that the comparison
operation proceeds from the start of the string towards the end of the string.
backward, which specifies that the comparison
operation proceeds from the end of the string towards the start of the string.
position, which specifies that the comparison
operation considers the relative position of characters in the string that
are not subject to the collating weight
IGNORE.
In other
words, the first characters collated are those that do not have a collation
weight of
IGNORE
and are the shortest distance from the
start (forward,position) or end (backward,position) of the string.
When a sort directive includes two keywords, the
position
keyword combined with either
forward
or
backward, the two keywords are separated by a comma
(,).
The
position
keyword by itself is equivalent to the
directive
forward,position.
The number of sort directives corresponds to the number of weights each collating element is assigned in subsequent statements.
Each
sort directive and its associated set of weights specify information for one
pass, or level, of string comparison.
The first directive applies when the
string comparison operation applies the primary weight, the second when the
string comparison operation applies the secondary weight, and so on.
The number
of levels required to collate strings correctly depends on language and cultural
requirements and therefore varies from one locale to another.
There is also
a level number maximum, associated with the
COLL_WEIGHTS_MAX
setting in the
limits.h
and
sys/localedef.h
files.
On Tru64 UNIX systems, you are limited to six collation
levels (sort directives).
The
backward
directive is used for many languages
to ensure that accented characters sort after unaccented characters only if
the compared strings are otherwise equivalent.
The
position
directive is frequently used to
handle characters, such as the hyphen (-) in Western European languages, whose
significance can be relative to word position.
For example, assume you wanted
the word "o-ring" to collate in a word list before the word "or-ing",
but do not want the hyphen to be considered until after strings are sorted
by letters alone.
You would need two sort directives and associated sets of
weight specifiers to implement this order.
For the first comparison operation,
you specify
forward
as the sort directive, letters as the
first weights for all letter characters, and
IGNORE
as
the weight for the hyphen character.
For the second, or a later, comparison
operation, you specify
forward position
as the sort directive,
IGNORE
as the weight for all letter characters, and the hyphen as
the weight for the hyphen character.
If you do not specify a sort directive, the default is
forward.
[Return to example]
Collation order statements for elements
These statements specify a character symbol, optionally followed by one or more blank characters (spaces or tabs), then the symbols for characters that have the same weight at each stage of the sort.
In the example, the sort order is control characters, followed by punctuation and digits, and then letters. Letters are sorted on multiple passes, with diacritics and case ignored on the first pass, diacritical marks being significant on the second pass, and case being significant on the third pass. [Return to example]
Collation order statement for characters not specified in other collation order statements
The
UNDEFINED
keyword begins a collation order statement
to be applied to all characters that are defined in the locale's
charmap
file but not specified in other collation order statements.
Characters that fall into the
UNDEFINED
category are considered
in regular expressions to belong to the same equivalence class.
Always
include the
UNDEFINED
collation order statement.
If this
statement is absent, the
localedef
command includes undefined
characters at the end of the collating order and issues a warning.
Furthermore, if you place an
UNDEFINED
statement as the last collation order statement, the
localedef
command can sometimes compress all undefined characters into one entry.
This
action can reduce the size of the locale.
This locale specifies that any characters specified in the locale's
charmap
file, but not handled by other collation order statements,
be ordered last.
An
UNDEFINED
statement can have
an operand.
For example, the
IGNORE
keyword causes any
characters unspecified by other collation order statements to be ignored for
the sort pass in which
IGNORE
appears.
If the following
UNDEFINED
statement had been included in the example, characters
not specified in other collation order statements would be ignored in all
sort passes defined by those statements:
UNDEFINED IGNORE;IGNORE;IGNORE
Trailer to indicate the end of collation order statements [Return to example]
Trailer to indicate the end of the
LC_COLLATE
section
[Return to example]
Example 6-5
contains only a few of the options that
you can specify when defining the
LC_COLLATE
category.
Additional options allow you to use the following:
A
copy
statement to include the entire
category definition from another locale
A
copy
statement can be the only entry between the
section trailer and header.
Collating order statements that specify a string of characters, rather than single characters, as the collating elements
In such cases, you first specify
collating-element
statements before the
order_start
statement to define symbols
for the strings.
You can then specify those symbols in collating order statements.
For example:
collating-element <ch> from "<c><h>"
.
.
.
order_start forward;forward;backward
.
.
.
<ch> <Ch>;<ch>;<ch>
.
.
.
Symbolic names, such as
<UPPERCASE>,
to use as weight specifiers in collation order statements
You must define each symbolic name
by using the
collating-symbol
statement in the source file
before the
order_start
statement.
You then include the
symbol in the appropriate position in the list of collation order statements
for collating elements.
For example, if you wanted the symbol
<LOW>
to represent the lowest position in the collating order,
<LOW>
would be the line entry immediately following the
order_start
statement.
A symbol such as
<UPPERCASE>
would be positioned on the line immediately preceding the section
of collating order statements for uppercase letters.
A symbol must occur before the first collation order statement in which it is used. Therefore, you cannot define a symbol for the highest position in the collating order.
After symbols are defined and positioned, you can use them as weights in collating order statements. For example:
collating-symbol <LOWERCASE> collating-symbol <UNACCENTED>
.
.
.
order_start forward;backward;forward;forward
.
.
.
<UNACCENTED>
.
.
.
<LOWERCASE> <a> <a>;<UNACCENTED>;<LOWERCASE>;IGNORE
.
.
.
Remember that, because Unicode
and dense code locales are equivalent, you can use the same charmaps and locale
source for Unicode and dense code locales.
However, Unicode and dense code
characters that are defined in the charmap but not defined in the
LC_COLLATE
section may be sorted differently.
See
locale(4)LC_COLLATE
category definition.
6.2.3 Defining the LC_MESSAGES Locale Category
The
LC_MESSAGES
section of a locale source
file defines strings that are valid for affirmative and negative responses
from users.
Example 6-6
is an
LC_MESSAGES
section.
Example 6-6: LC_MESSAGES Category Definition
LC_MESSAGES [1] # yes expression. The following designates: # "^([oO]|[oO][uU][iI])" yesexpr "<circumflex><left-parenthesis>\ <left-square-bracket><o><O><right-square-bracket>\ <vertical-line><left-square-bracket><o><O>\ <right-square-bracket><left-square-bracket><u><U>\ <right-square-bracket><left-square-bracket><i><I>\ <right-square-bracket><right-parenthesis>" [2] # no expression. The following designates: # "^([nN]|[nN][oO][nN])" noexpr "<circumflex><left-parenthesis>\ <left-square-bracket><n><N><right-square-bracket>\ <vertical-line><left-square-bracket><n><N>\ <right-square-bracket><left-square-bracket><o><O>\ <right-square-bracket><left-square-bracket><n><N>\ <right-square-bracket><right-parenthesis>" [3] # yes string. The following designates: "oui:o:O" yesstr "<o><u><i><colon><o><colon><O>" [4] # no string. The following designates: "non:n:N" nostr "<n><o><n><colon><n><colon><N>" [5] END LC_MESSAGES [6]
Section header [Return to example]
Definition of an expression for a valid "yes" response
This entry consists of the
yesexpr
keyword
followed by one or more spaces or tabs, and an extended regular expression
that is delimited by double quotation marks.
This expression specifies that "oui" or "o"
(case is ignored) is a valid affirmative response in this locale.
The regular
expression for
yesexpr
specifies individual characters
by their symbols as defined in the locale's
charmap
file.
[Return to example]
Definition of an expression for a valid "no" response
This
entry consists of the
noexpr
keyword followed by one or
more spaces or tabs, and an extended regular expression that is delimited
by double quotation marks.
This expression specifies that "non" or "n" (case is ignored) is a valid negative response in this locale. [Return to example]
Definition of a string for a valid "yes" response
This entry
consists of the
yesstr
keyword followed one or more spaces
or tabs, and a fixed string that is delimited by double quotation marks.
The
yesstr
entry is marked as LEGACY in the X/Open
UNIX standard and is not included in the POSIX standard; however, some applications
and systems software still might use
yesstr
rather than
yesexpr.
To ensure that your locale works correctly with such software,
you should define
yesstr
in your locale.
The X/Open UNIX
standard defines a single fixed string for
yesstr.
The
colon (:) separator, which allows multiple fixed strings to be specified,
is an extension to the standard definition.
[Return to example]
Definition of a string for a valid "no" response
This entry
consists of the
nostr
keyword followed one or more spaces
or tabs, and a fixed string that is delimited by double quotation marks.
The
nostr
entry is marked as LEGACY in the X/Open
UNIX standard and is not included in the POSIX standard; however, some applications
and systems software still might use
nostr
rather than
noexpr.
To ensure that your locale works correctly with such software,
you should define
nostr
in your locale.
The X/Open UNIX
standard defines a single fixed string for
nostr.
The colon
(:) separator, which allows multiple fixed strings to be specified, is an
extension to the standard definition.
[Return to example]
Section trailer [Return to example]
As an alternative to specifying symbol definitions,
you can use the
copy
statement between the section header
and trailer to duplicate an existing locale's definition of the
LC_MESSAGES
category.
The
copy
statement represents
a complete definition of the category and cannot be used when explicit symbol
definitions are used.
6.2.4 Defining the LC_MONETARY Locale Category
The
LC_MONETARY
section of the locale source
file defines the rules and symbols used to format monetary values.
Application
developers use the
localeconv()
and
nl_langinfo()
functions to determine the information defined in this section
and apply formatting rules through the
strfmon()
function.
Example 6-7
is an
LC_MONETARY
section.
Example 6-7: LC_MONETARY Category Definition
LC_MONETARY [1] int_curr_symbol "<F><R><F><space>" [2] currency_symbol "<F>" [2] mon_decimal_point "<comma>" [2] mon_thousands_sep "" [2] mon_grouping 3;0 [2] positive_sign "" [2] negative_sign "<hyphen>" [2]
.
.
.
END LC_MONETARY [3]
Section header [Return to example]
Symbol definitions
The entries in the example specify the following:
The international currency symbol is
FRF
(French Franc) and the local currency symbol is
F
(Franc).
The decimal point is the comma (,).
No character is defined to group digits to the left of the decimal point.
The digits in each grouping to the left of the decimal point in this locale are in groups of three. Because this locale does not define a default monetary thousands separator, the monetary grouping defined in this locale is significant only if the application uses a function to specify a thousands separator.
The positive sign is null.
The negative sign is the minus (-)
character.
Section trailer [Return to example]
The following list describes the symbol names you
can define in the
LC_MONETARY
section.
int_curr_symbol
The international currency symbol
currency_symbol
The local currency symbol
mon_decimal_point
The radix character, or decimal point, used in monetary formats
mon_thousands_sep
The character used to separate groups of digits to the left of the radix character
mon_grouping
The size of each group of digits to the left of the radix character.
The character defined by
mon_thousands_sep, if any, is
inserted between the groups defined by
mon_grouping.
You
can vary the size of groups by specifying multiple digits separated by a semicolon
(;).
For example,
3;2
specifies that the first group to
the left of the radix character contains three digits and all subsequent groups
contain 2 digits.
On Tru64 UNIX systems,
3;0
and
3
are equivalent; that is, all digits to the left of the decimal
point are grouped by three.
positive_sign
The string indicating that a monetary value is not negative
negative_sign
The string indicating that a monetary value is negative
int_frac_digits
The number of digits to be written to the right of the radix character
when
int_curr_symbol
appears in the format
frac_digits
The number of digits to be written to the right of the radix character
when
currency_symbol
appears in the format
p_cs_precedes
An integer that determines if the international or local currency symbol precedes a nonnegative value
p_sep_by_space
An integer that determines whether a space separates the international or local currency symbol from other parts of a formatted, nonnegative value
n_cs_precedes
An integer that determines if the international or local currency symbol precedes a negative value
n_sep_by_space
An integer that determines whether a space separates the international or local currency symbol from other parts of a formatted, negative value
p_sign_posn
An integer that indicates if or how the positive sign string is positioned in a nonnegative, formatted value
n_sign_posn
An integer that indicates how the negative sign string is positioned in a negative, formatted value
As an alternative to specifying symbol definitions,
you can use the
copy
statement between the section header
and trailer to duplicate an existing locale's definition of
LC_MONETARY.
The
copy
statement represents a complete definition
of the category and cannot be used when explicit symbol definitions are used.
The
LC_MONETARY
definition is set to the euro character for the UTF-8 and ISO8859-15
locales of the languages that have fully adopted the euro.
Because the euro
character is not in the Latin-1 repertoire, the ISO8859-1 locales of
the languages that have adopted the euro continue to use the pre-euro currency.
For example, the Italian locale
it_IT.ISO8859-15
supports the euro; the Italian locale
it_IT.ISO8859-1
supports the lira.
See
locale(4)LC_MONETARY
symbol definitions.
6.2.5 Defining the LC_NUMERIC Locale Category
The
LC_NUMERIC
section of the locale source file defines the rules and
symbols used to format numeric data.
You can use the
localeconv()
and
nl_langinfo()
functions to access this
formatting information.
Example 6-8
is an
LC_NUMERIC
section.
Example 6-8: LC_NUMERIC Category Definition
LC_NUMERIC [1] decimal_point "<comma>" [2] thousands_sep "" [3] grouping 3;0 [4] END LC_NUMERIC [5]
Category header. [Return to example]
Definition of radix character (decimal point). [Return to example]
Definition of character used to separate groups of digits to the left of the radix character. In this locale, no default character is defined. Therefore, applications must supply this character, if needed. [Return to example]
The size of each group of digits to the left
of the radix character.
The character defined by
thousands_sep,
if any, is inserted between the groups defined by
grouping.
You can vary the size of groups by specifying multiple digits separated
by a semicolon (;).
For example,
3;2
specifies that the
first group to the left of the radix character contains three digits and all
subsequent groups contain 2 digits.
On Tru64 UNIX systems,
3;0
and
3
are equivalent; that is, all digits to
the left of the radix character are grouped by three.
[Return to example]
Category trailer. [Return to example]
Example 6-8
contains all of the symbols you can define
in the
LC_NUMERIC
section.
In place of any symbol definitions,
you can specify a
copy
statement between the section header
and trailer to include this section from another locale.
See
locale(4)LC_NUMERIC
symbol
definitions.
6.2.6 Defining the LC_TIME Locale Category
The
LC_TIME
section of
a locale source file defines the interpretation of field descriptors supported
by the
date
command.
This section also affects the behavior
of the
strftime(),
wcsftime(),
strptime(), and
nl_langinfo()
functions.
Example 6-9
contains some of the symbols defined for the sample
French locale.
Example 6-9: LC_TIME Category Definition
LC_TIME [1]
abday "<d><i><m>";\
"<l><u><n>";\
"<m><a><r>";\
"<m><e><r>";\
"<j><e><u>";\
"<v><e><n>";\
"<s><a><m>" [2]
day "<d><i><m><a><n><c><h><e>";\
"<l><u><n><d><i>";\
"<m><a><r><d><i>";\
"<m><e><r><c><r><e><d><i>";\
"<j><e><u><d><i>";\
"<v><e><n><d><r><e><d><i>";\
"<s><a><m><e><d><i>" [3]
abmon "<j><a><n>";\
"<f><e-acute><v>";\
"<m><a><r>";\
"<a><v><r>";\
"<m><a><i>";\
"<j><u><n>";\
"<j><u><l>";\
"<a><o><u-circumflex>";\
"<s><e><p>";\
"<o><c><t>";\
"<n><o><v>";\
"<d><e-acute><c>" [4]
mon "<j><a><n><v><i><e><r>";\
"<f><e-acute><v><r><i><e><r>";\
"<m><a><r><s>";\
"<a><v><r><i><l>";\
"<m><a><i>";\
"<j><u><i><n>";\
"<j><u><i><l><l><e><t>";\
"<a><o><u-circumflex><t>";\
"<s><e><p><t><e><m><b><r><e>";\
"<o><c><t><o><b><r><e>";\
"<n><o><v><e><m><b><r><e>";\
"<d><e-acute><c><e><m><b><r><e>" [5]
# date/time format. The following designates this
# format: "%a %e %b %H:%M:%S %Z %Y"
d_t_fmt "<percent-sign><a><space><percent-sign><e>\
<space><percent-sign><b><space><percent-sign><H>\
<colon><percent-sign><M><colon><percent-sign><S>\
<space><percent-sign><Z><space><percent-sign><Y>" [6]
.
.
.
END LC_TIME [7]
Section header [Return to example]
Abbreviated names for days of the week
Use the
%a
conversion specifier to include these
strings in formats.
[Return to example]
Full names for days of the week
Use the
%A
conversion specifier to include these
strings in formats.
[Return to example]
Abbreviated names for months of the year
Use the
%b
conversion specifier to include these
strings in formats.
[Return to example]
Full names for months of the year
Use the
%B
conversion specifier to include these
strings in formats.
[Return to example]
Format for combined date and time information
The format combines field descriptors as defined for the
strftime()
function.
See
strftime(3)
The specified format includes the field descriptors for the abbreviated
day of the week (%a), the day of the month (%e), the number of hours in a 24-hour period (%H),
the number of minutes (%M), and the number of seconds (%S), the time zone (%Z), and the full representation
of the year (%Y).
If the date were April 23, 1999, on the
East coast of the United States, the format specified in this example would
cause the
date
command to display
ven 23 avr 13:43:05
EDT 1999.
[Return to example]
Section trailer [Return to example]
Example 6-9
includes only some of the symbol definitions
that are standard for the
LC_TIME
category.
LC_TIMEalso allows you to specify the following standard definitions:
d_fmt
Format for the date alone; corresponds to the
%x
field descriptor
t_fmt
Format for the time alone; corresponds to the
%X
field descriptor
am_pm
Format for the ante meridiem and post meridiem time strings; corresponds
to the
%p
field descriptor
For example, the definition for the English language would be as follows:
am_pm "<A><M>";"<P><M>"
t_fmt_ampm
Format for the time according to the 12-hour clock; corresponds to the
%r
field descriptor
era
Definition of how years are counted and displayed for each era in the locale. This format is for countries that use a year-counting system other than the Gregorian calendar. Such countries often use both the Gregorian calendar and a local era system.
era_d_fmt
Format of the date alone in era notation; corresponds to the
%Ex
field descriptor
era_t_fmt
Format of the time alone in era notation; corresponds to the
%EX
field descriptor
era_d_t_fmt
Format of both date and time in era notation; corresponds to the
%Ec
field descriptor
alt_digits
Definition of alternative symbols for digits; corresponds to the
%O
field descriptor
This format is for countries that include alternative symbols in date strings.
As is true for other category sections, you can specify a
copy
statement to include all
LC_TIME
definitions
from another locale.
The operating system supports symbols and field descriptors
in addition to those described here.
See
locale(4)LC_TIME
definitions.
6.3 Building Libraries to Convert Multibyte and Wide-Character Encodings
C Library routines rely on a set of special interfaces to convert characters to and from data file encoding and wide-character encoding (internal process code). By default, the C Library routines use interfaces that handle only single-byte characters. However, many are defined with entry points that permit use of alternative interfaces for handling multibyte characters. The interfaces that can be tailored to a locale's codeset are called methods.
Locales with multibyte codesets must use methods. Also, some situations require a locale with single-byte codesets to supply methods. For example, a locale must supply a method when the corresponding interface is converting characters between data formats and the interface requires codeset-specific logic to do that operation correctly. However, a method is optional when the corresponding interface is working with data that has already been converted to wide-character format and the interface can apply logic that is valid for both single-byte and multibyte characters.
When a locale supplies a method, it must include a set of required methods as described in Section 6.3.1. See Section 6.3.2 for a description of optional methods.
Methods must be available on the system in a shareable library.
This library and the functions that implement each method in the library
are made known to the
localedef
command through a
methods
file.
When the
localedef
command processes
the
methods
file along with the
charmap
and
locale
source files, the resulting locale includes
pointers to all methods that are supplied with the locale, and pointers to
default implementations for optional methods that are not supplied with the
locale.
When you set the
LANG
variable to the newly built
locale and run a command or application, methods are used wherever they have
been enabled in the system software.
6.3.1 Required Methods
If your locale uses methods, it must supply the following:
_
(Section 6.3.1.1)
_
(Section 6.3.1.2)
_
(Section 6.3.1.3)
_
(Section 6.3.1.4)
mblen
(Section 6.3.1.5)
mbstowcs
(Section 6.3.1.6)
mbtowc
(Section 6.3.1.7)
wcstombs
(Section 6.3.1.8)
wctomb
(Section 6.3.1.9)
wcswidth
(Section 6.3.1.10)
wcwidth
(Section 6.3.1.11)
These methods make it possible for C Library functions to convert data
between multibyte and wide-character formats.
6.3.1.1 Writing the _
The
fgetws()
function uses
the
_
method to convert the bytes in the
standard I/O (stdio) buffer to a wide-character string.
The function that implements this method must return the number of wide characters
converted by the call.
This method is similar to the one for
mbstowcs()
(see
Section 6.3.1.6) but contains additional parameters
to meet the needs of
fgetws().
By convention, a C source
file for this method has the file name
_, where
codeset
identifies the codeset for which the method is tailored.
Example 6-10
is the file
_
which defines
the
_
method used with the
ja_JP.sdeckanji
locale.
Example 6-10: The _
#include <stdlib.h> [1] #include <wchar.h> [1] #include <sys/localedef.h> [1] int __mbstopcs_sdeckanji( wchar_t *pwcs, [2] size_t pwcs_len, [3] const char *s, [4] size_t s_len, [5] int stopchr, [6] char **endptr, [7] int *err, [8] _LC_charmap_t *handle ) [9] { int cnt = 0; [10] int pwcs_cnt = 0; [10] int s_cnt = 0; [10] *err = 0; [11] while (1) { [12] if (pwcs_cnt >= pwcs_len || s_cnt >= s_len) { *endptr = (char *)&(s[s_cnt]); break; } [13] if ((cnt = _ _mbtopc_sdeckanji(&(pwcs[pwcs_cnt]), &(s[s_cnt]), (s_len - s_cnt), err)) == 0) { *endptr = (char *)&(s[s_cnt]); break; } [14] pwcs_cnt++; [15] if ( s[s_cnt] == (char) stopchr) { *endptr = (char *)&(s[s_cnt+1]); break; } [16] s_cnt += cnt; [17] } [18] return (pwcs_cnt); [19] }
Include header files that contain constants and structures required for this method. [Return to example]
Points, through
pwcs,
to a buffer that stores the wide-character string.
[Return to example]
Defines a variable,
pwcs_len, to store the size of the
pwcs
buffer.
[Return to example]
Points, through
s,
to a buffer that stores the multibyte character string being converted.
[Return to example]
Defines a variable,
s_len,
to store the number of bytes of data in the
s
buffer.
This parameter is needed because the
fgetws()
function
reads from the standard I/O buffer, which does not contain null-terminated
strings.
[Return to example]
Defines a variable,
stopchr,
to contain a byte value that would force conversion to stop.
This value, typically
\n, is passed to the method
on the call from the
fgetws(
function, which handles
only one line of input for each call.
[Return to example]
Defines a variable,
endptr,
that points to the byte following the last byte converted.
This pointer is needed to specify the starting character in the standard
I/O buffer for the next call to
fgetws().
[Return to example]
Points, through
err,
to a variable that stores execution status for the call made by this method
to the
mbtopc
method.
[Return to example]
Points, through
hdl,
to a structure that points to the methods that parse character maps for this
locale.
The
localedef
command creates and stores values in
the
_LC_charmap_t
structure.
[Return to example]
Initializes variables
that indicate the number of bytes that a character uses in multibyte format
(supplied by the
mbtopc
method) and the byte or character
position in buffers that the
fgetws()
function uses.
[Return to example]
Sets
err
to zero (0)
to indicate success.
[Return to example]
Starts the
while
loop
that converts the multibyte string.
[Return to example]
Sets
endptr
and breaks
out of the loop when there is either no more space in the buffer that stores
wide-character data or no more data in the buffer that stores multibyte data.
[Return to example]
Calls the
mbtopc
method
to convert a character from multibyte format to wide-character format.
If the
mbtopc
method fails to convert a character
and returns an error, the program breaks out of the loop and sets
endptr
to the first byte of the character that could not be converted.
The
err
variable contains one of the following status
returns of the call to the
mbtopc
method:
0 indicates success.
-1 indicates an invalid character.
A value greater than 0 indicates that too few bytes remain
in the multibyte character buffer to form a valid character.
In this case,
the return is the number of bytes required to form a valid character.
The
fgetws()
function can then refill the buffer and try again.
Increments the character position in the buffer that stores the wide-character data. [Return to example]
Sets
endptr
to the
character following the character stored in
stopchr
if
the
stopchr
character is encountered in the multibyte data.
[Return to example]
Increments the byte position in the buffer that contains multibyte data. [Return to example]
Ends the
while
loop.
[Return to example]
Returns the number of characters in the buffer that contains wide-character data. [Return to example]
The
getwc()
or
fgetwc()
function calls the
_
method to convert a
multibyte character to a wide character.
The method returns the number of
bytes in the multibyte character that is converted.
This method is similar
to the one for
mbtowc
(see
Section 6.3.1.7)
but contains an additional parameter that
getwc()
needs.
By convention, a C source file for this method has the file name
_, where
codeset
identifies the codeset for which this method is tailored.
Example 6-11
is the
_
file, which defines the
_
method
used with the
ja_JP.sdeckanji
locale.
Example 6-11: The _
#include <stdlib.h> [1]
#include <wchar.h>
#include <sys/localedef.h>
/*
The algorithm for this conversion is:
s[0] < 0x9f: PC = s[0]
s[0] = 0x8e: PC = s[1] + 0x5f;
s[0] = 0x8f PC = (((s[1] - 0xa1) << 7) | (s[2] - 0xa1)) + 0x303c
s[0] > 0xa1:0xa1 < s[1] < 0xfe
PC = (((s[0] - 0xa1) << 7) | (s[1] - 0xa1)) + 0x15e
0x21 < s[1] < 0x7e
PC = (((s[0] - 0xa1) << 7) | (s[1] - 0x21)) + 0x5f1a
+-----------------+-----------+-----------+-----------+
| process code | s[0] | s[1] | s[2] |
+-----------------+-----------+-----------+-----------+
| 0x0000 - 0x009f | 0x00-0x9f | -- | -- |
| 0x00a0 - 0x00ff | -- | -- | -- |
| 0x0100 - 0x015d | 0x8e | 0xa1-0xfe | -- | JIS X0201 RH
| 0x015e - 0x303b | 0xa1-0xfe | 0xa1-0xfe | -- | JIS X0208
| 0x303c - 0x5f19 | 0x8f | 0xa1-0xfe | 0xa1-0xfe | JIS X0212
| 0x5f1a - 0x8df7 | 0xa1-0xfe | 0x21-0xfe | -- | UDC
+-----------------+-----------+-----------+-----------+
*/ [2]
int _ _mbtopc_sdeckanji(
wchar_t *pwc, [3]
char *ts, [4]
size_t maxlen, [5]
int *err, [6]
_LC_charmap_t *handle ) [7]
{
wchar_t dummy; [8]
unsigned char *s = (unsigned char *)ts; [9]
if (s == NULL)
return(0); [10]
if (pwc == (wchar_t *)NULL)
pwc = &dummy; [11]
*err = 0; [12]
if (s[0] <= 0x8d) {
if (maxlen < 1) {
*err = 1;
return(0);
}
else {
*pwc = (wchar_t) s[0];
return(1);
}
} [13]
else if (s[0] == 0x8e) {
if (maxlen >= 2) {
if (s[1] >=0xa1 && s[1] <=0xfe) {
*pwc = (wchar_t) (s[1] + 0x5f);
return(2);
}
}
else {
*err = 2;
return(0);
}
} [14]
else if (s[0] == 0x8f) {
if (maxlen >= 3) {
if ((s[1] >=0xa1 && s[1] <=0xfe) &&
(s[2] >=0xa1 && s[2] <= 0xfe)) {
*pwc = (wchar_t) (((s[1] - 0xa1) << 7) |
(wchar_t) (s[2] - 0xa1)) + 0x303c;
return(3);
}
}
else {
*err = 3;
return(0);
}
} [15]
else if (s[0] <= 0x9f) {
if (maxlen < 1) {
*err = 1;
return(0);
}
else {
*pwc = (wchar_t) s[0];
return(1);
}
} [16]
else if (s[0] >= 0xa1 && s[0] <= 0xfe) {
if (maxlen >= 2) {
if (s[1] >=0xa1 && s[1] <= 0xfe) {
*pwc = (wchar_t) (((s[0] - 0xa1) << 7) |
(wchar_t) (s[1] - 0xa1)) + 0x15e;
return(2);
} else if (s[1] >=0x21 && s[1] <= 0x7e) {
*pwc = (wchar_t) (((s[0] - 0xa1) << 7) |
(wchar_t) (s[1] - 0x21)) + 0x5f1a;
return(2);
}
}
else {
*err = 2;
return(0);
}
} [17]
*err = -1;
return(0); [18]
}
Include header files that contain constants and structures required for this method. [Return to example]
Describes the algorithm used to determine the number of bytes and valid byte combinations for the different character sets that the codeset supports.
The codeset supports several character sets and each set contains characters of only one length. The value in the first byte indicates the character set and therefore the character length. For character sets with multibyte characters, one or more additional bytes must be examined to determine whether the value sequence identifies a character or is invalid. [Return to example]
Points, through
pwc, to a buffer that stores
the wide character.
[Return to example]
Points, through
ts, to a buffer that stores
the bytes that are passed to the method from the calling function.
[Return to example]
Declares a variable,
maxlen, that stores
the maximum number of bytes in the multibyte data.
This value is passed by the calling function. [Return to example]
Points, through
err, to a buffer that stores
execution status.
[Return to example]
Points, through
handle, to a structure that
contains pointers to the methods that parse the character maps for this locale.
[Return to example]
Declares a variable,
dummy, to which
pwc
can be set to ensure a valid address.
[Return to example]
Casts
ts
(an array of signed characters)
to
s
(an array of unsigned characters).
This operation
prevents problems when integer values are stored in the array and then referenced
by index.
Compilers apply
sign extension
to values
when comparing a small signed data type, such as
char,
to a large signed data type, such as
int.
In this case,
a condition such as the following is evaluated as true when you expect it
to be false:
if (s[0] <= 0x8d
[Return to example]
Returns zero (0) if the
s
buffer contains
or points to
NULL.
[Return to example]
Stores the contents of
dummy
in the wide-character
buffer if the
ts
buffer contains or points to
NULL.
This operation ensures that
*pwc
always points to
a valid address.
If this were not the case, and a wide character is not stored
in
pwc, an application produces a segmentation fault by
referring to this pointer.
[Return to example]
Initializes
err
to zero (0) to indicate
success.
[Return to example]
Determines if the character is one of the single-byte characters that the codeset defines for values equal to or less than 0x8d.
If
s
contains no characters, returns zero (0) to
indicate that no bytes were converted and sets
err
to 1
to indicate that 1 byte is needed to form a valid character.
If the byte value is in the range being tested, moves the associated
process code value to
pwc
and returns 1 to indicate the
number of bytes converted.
[Return to example]
Determines if the character is one of the double-byte characters that the codeset defines for the value 0x8e (first byte) and the value range 0xa1 to 0xfe (second byte).
If yes, moves the associated process code value to the
pwc
buffer and returns 2 to indicate the number of bytes converted;
otherwise, returns 0 to indicate that no conversion took place and sets
err
to 2 to specify that at least 2 bytes are needed to form a valid
character.
[Return to example]
Determines if the character is one of the triple-byte characters that the codeset defines for the value 0x8f (first byte), the range 0xa1 to 0xfe (second byte), and the range 0xa1 to 0xfe (third byte).
If yes, moves the associated process code value to
pwc
and returns 3 to indicate the number of bytes converted; otherwise, sets
err
to 3 to indicate that at least 3 bytes are needed and returns
zero (0) to indicate that no character was converted.
[Return to example]
Determines if the character is one of the single-byte characters that the codeset defines for the range 0x90 to 0x9f.
If there are no bytes in the standard I/O buffer, returns zero (0) to
indicate that no bytes were converted and sets
err
to 1
to indicate that at least 1 byte is needed to form a valid character.
If the byte value is in the defined range, moves the associated process
code value to
pwc
and returns 1 to indicate the number
of bytes converted.
[Return to example]
Determines if the character is one of the double-byte characters that the codeset defines for the range 0xa1 to 0xfe (first byte) and 0x21 to 0x7e (second byte).
If yes, moves the associated process code value to
pwc
buffer and returns 2 to indicate the number of bytes converted; otherwise,
sets
err
to 2 to indicate that at least 2 bytes are needed
to form a valid character and returns zero (0) to indicate that no bytes were
converted.
[Return to example]
Sets
err
to -1 to indicate that an
invalid multibyte sequence was encountered and returns zero (0) to indicate
that no bytes were converted.
These statements execute if the multibyte data in
s
satisfies none of the preceding
if
conditions.
[Return to example]
The
fputws()
function first
calls the
_
method to convert a string
of characters from process (wide-character) code to multibyte code.
If this
method returns -1 to indicate no support by the locale,
fputws()
then calls
putwc()
for each wide character
in the string being converted.
By convention, a C source file for this method
has the file name
_, where
codeset
identifies the codeset
for which this method is tailored.
Example 6-12
is
the file
_, which defines
the
_
method used with the
ja_JP.sdeckanji
locale.
Example 6-12: The _
int __pcstombs_sdeckanji() { return -1; [1] }
Returns -1 to indicate that the locale does not support the method.
This return causes the
fputws()
function to use multiple
calls to
putwc()
to convert wide characters in the string.
[Return to example]
If you choose to implement this method fully rather than writing it to return -1, your function implementation returns the number of wide characters converted and must include header files and parameters as illustrated in the following example:
#include <stdlib.h> #include <wchar.h> #include <sys/localedef.h> int __pcstombs_newcodeset( wchar_t *pcsbuf, [1] size_t pcsbuf_len, [2] char *mbsbuf, [3] size_t mbsbuf_len, [4] char **endptr, [5] int *err, [6] _LC_charmap_t *handle ) [7]
Specifies a pointer to a buffer that contains the wide-character string. [Return to example]
Specifies a variable with the length of the wide-character buffer.
This value is passed to the method on the call from
fputws().
[Return to example]
Specifies a pointer to a buffer that contains the multibyte character string. [Return to example]
Specifies a variable with the length of the multibyte character buffer.
This value is passed to the method on the call from
fputws().
[Return to example]
Points, through
endptr, to a pointer to
the byte position in the multibyte character buffer where the next character
would begin if multiple calls to
fputws()
are required
to convert all the wide-character data.
[Return to example]
Specifies a pointer to the execution status return.
If this method calls the
wctomb
method to perform
the character conversion, the
wctomb
method sets this status.
Otherwise, this method must incorporate the logic to perform wide-character
to multibyte character conversion and set the status directly.
In any event, the
fputws()
function expects the following
values:
0 for success
-1 to indicate that the wide-character value is invalid and therefore cannot be converted
A positive value to indicate that the multibyte character buffer contains too few bytes after the last character to store the next character
In this case, the value is the number of bytes required to store the
next character.
The
fputws()
function can then empty the
multibyte character buffer and try again.
Specifies a pointer to the
_LC_charmap_t
structure that stores pointers to the methods used with this locale.
[Return to example]
The
_
method performs the reverse
of the operation that the
_
method performs
(as described in
Section 6.3.1.1).
Because of the direction
of the data conversion, the
_
method behaves
as follows:
Does not require a variable for a stop conversion character,
such as
\n.
Calls (or implements the operation performed by) the
wctomb
method rather than calling the
mbtowc
method to convert each character and determine the number of bytes it needs
in the multibyte character buffer.
6.3.1.4 Writing a _
C
Library functions currently do not use the
_
interface.
The
putwc()
function, for example, calls the
wctomb
method to convert a character from wide-character to multibyte
character format.
Nonetheless, the
localedef
command requires
a method for this function when your locale supplies methods.
By convention,
a C source file for this method has the file name
_, where
codeset
identifies the codeset for which this method is tailored.
Example 6-13
is the
_
file, which defines
the
_
method used with the
ja_JP.sdeckanji
locale.
Example 6-13: The _
int __pctomb_sdeckanji() { return -1; [1] }
Returns -1 to indicate that the locale does not support this method. [Return to example]
The
mblen()
function uses
the
mblen
method to return the number of bytes in a multibyte
character.
By convention, a C source file for this method has the file name
_, where
codeset
identifies the codeset for which this method is tailored.
Example 6-14
is the
_
file, which defines the
mblen
method used with the
ja_JP.sdeckanji
locale.
Example 6-14: The _
#include <stdlib.h> [1] #include <wchar.h> #include <sys/errno.h> #include <sys/localedef.h> /* The algorithm for this conversion is: s[0] < 0x9f: 1 byte s[0] = 0x8e: 2 bytes s[0] = 0x8f 3 bytes s[0] > 0xa1 2 bytes +-----------------+-----------+-----------+-----------+ | process code | s[0] | s[1] | s[2] | +-----------------+-----------+-----------+-----------+ | 0x0000 - 0x009f | 0x00-0x9f | -- | -- | | 0x00a0 - 0x00ff | -- | -- | -- | | 0x0100 - 0x015d | 0x8e | 0xa1-0xfe | -- | JIS X0201 RH | 0x015e - 0x303b | 0xa1-0xfe | 0xa1-0xfe | -- | JIS X0208 | 0x303c - 0x5f19 | 0x8f | 0xa1-0xfe | 0xa1-0xfe | JIS X0212 | 0x5f1a - 0x8df7 | 0xa1-0xfe | 0x21-0xfe | -- | UDC +-----------------+-----------+-----------+-----------+ */ [2] int __mblen_sdeckanji( char *fs, [3] size_t maxlen, [4] _LC_charmap_t *handle ) [5] { const unsigned char *s = (void *) fs; [6] if (s == NULL || *s == '\0') return(0); [7] if (maxlen < 1) { _Seterrno(EILSEQ); return((size_t)-1); } [8] if (s[0] <= 0x8d) return(1); [9] else if (s[0] == 0x8e) { if (maxlen >= 2 && s[1] >=0xa1 && s[1] <=0xfe) return(2); } [10] else if (s[0] == 0x8f) { if(maxlen >=3 && (s[1] >=0xa1 && s[1] <=0xfe) && (s[2] >=0xa1 && s[2] <= 0xfe)) return(3); } [11] else if (s[0] <= 0x9f) return(1); [12] else if (s[0] >= 0xa1) { if (maxlen >=2 && (s[0] <= 0xfe) ) if ( (s[1] >=0xa1 && s[1] <= 0xfe) || (s[1] >=0x21 && s[1] <= 0x7e) ) return(2); } [13] _Seterrno(EILSEQ); return((size_t)-1); [14] }
Includes header files that contain constants and structures required by this method. [Return to example]
Describes the algorithm used to determine the number of bytes in the character and whether it is a valid byte sequence.
The codeset supports several character sets and each set contains characters of only one length. The value in the first byte indicates the character set and therefore the character length. For character sets with multibyte characters, one or more additional bytes must be examined to determine whether the value sequence identifies a character or is invalid. [Return to example]
Points, through
fs, to a buffer that stores
the byte string to be examined.
[Return to example]
Defines a variable,
maxlen, that stores
the maximum length of a multibyte character.
This value is passed to the method by the
mblen()
function.
[Return to example]
Points, through
handle, to a structure that
stores pointers to the methods that parse character maps for this locale.
[Return to example]
Casts
fs
(an array of signed characters)
to
s
(an array of unsigned characters).
This operation prevents problems when integer values are stored in the
array and then referenced by index.
Compilers apply sign extension to values
when comparing a small signed data type, such as
char,
to a large signed data type, such as
int.
In this case,
a condition such as the following is evaluated as true when you expect it
to be false:
if (s[0] <= 0x8d
[Return to example]
Returns zero (0) to indicate that the character length is zero
(0) bytes if
s
contains or points to
NULL.
[Return to example]
Returns -1 and sets
errno
to
[EILSEQ]
(invalid character sequence) if
maxlen
(the maximum number of bytes to consider) is 0 or a negative
number.
To set
errno
in a way that works correctly with multithreaded applications,
use
_Seterrno
rather than an assignment statement.
[Return to example]
Determines if the first byte identifies a single-byte character whose value is equal to or less than 0x8d.
If yes, returns 1 to indicate that the character length is 1 byte. [Return to example]
Determines if the first byte identifies a double-byte character whose first byte contains the value 0x8e and second byte contains a value in the range 0xa1 to 0xfe.
If yes, returns 2 to indicate that the character length is 2 bytes. [Return to example]
Determines if the first byte identifies a triple-byte character whose first byte contains the value 0x8f and whose second and third bytes contain a value in the range 0xa1 to 0xfe.
If yes, returns 3 to indicate that the character length is 3 bytes. [Return to example]
Determines if the first byte identifies a single-byte character whose value is equal to or less than 0x9f.
If yes, returns 1 to indicate that the character length is 1 byte. [Return to example]
Determines if the first byte identifies a double-byte character whose first byte contains a value in the range 0xa1 to 0xfe and whose second byte contains a value in the range 0x21 to 0x7e.
If yes, returns 2 to indicate that the character length is 2 bytes. [Return to example]
Returns -1 and sets
errno
to
[EILSEQ]
to indicate an invalid multibyte sequence.
These statements execute if the multibyte data in the standard I/O buffer
satisfies none of the preceding
if
conditions.
[Return to example]
The
mbstowcs()
function
uses the
mbstowcs
method to convert a multibyte character
string to process wide-character code and to return the number of resultant
wide characters.
By convention, a C source file for this method has the file
name
_,
where
codeset
identifies the codeset for which
this method is tailored.
Example 6-15
is the
_
file, which defines the
mbstowcs
method used with the
ja_JP.sdeckanji
locale.
Example 6-15: The _
#include <stdlib.h> [1] #include <wchar.h> #include <sys/localedef.h> size_t __mbstowcs_sdeckanji( wchar_t *pwcs, [2] const char *s, [3] size_t n, [4] _LC_charmap_t *handle ) [5] { int len = n; [6] int rc; [7] int cnt; [8] wchar_t *pwcs0 = pwcs; [9] int mb_cur_max; [10] if (s == NULL) return (0); [11] mb_cur_max = MB_CUR_MAX; [12] if (pwcs == (wchar_t *)NULL) { cnt = 0; while (*s != '\0') { if ((rc = _ _mblen_sdeckanji(s, mb_cur_max, handle)) == -1) return(-1); cnt++ ; s += rc; } return(cnt); } [13] while (len-- > 0) { if ( *s == '\0') { *pwcs = (wchar_t) '\0'; return (pwcs - pwcs0); } if ((cnt = _ _mbtowc_sdeckanji(pwcs, s, mb_cur_max, handle)) < 0) return(-1); s += cnt; ++pwcs; } [14] return (n); [15] }
Includes header files that contain constants and structures required for this method. [Return to example]
Points, through
pwcs, to a buffer that contains
the wide-character string.
[Return to example]
Points, through
s, to a buffer that contains
the multibyte character string.
[Return to example]
Defines a variable,
n, that contains the
number of wide characters in
pwcs.
[Return to example]
Points, through
handle, to a structure that
stores pointers to the methods that parse character maps for this locale.
[Return to example]
Assigns the number of wide characters in the
pwcs
buffer (the
n
value supplied by the calling
function) to
len.
[Return to example]
Defines a variable,
rc, that stores the
return count from a call this method makes to the
mblen
function.
[Return to example]
Defines a variable,
cnt, that counts the
bytes used by characters in the
s
buffer.
[Return to example]
Saves the start of the wide-character string passed by the
calling function in the
pwcs0
variable.
[Return to example]
Defines a variable,
mb_cur_max, that is
later set to
MB_CUR_MAX
and used in a call to the
mblen
method.
[Return to example]
Returns zero (0) if
s
is
NULL.
A method should return zero (0) if the locale's character encoding is stateless and a nonzero value if the locales's character encoding is stateful. [Return to example]
Assigns the value defined for
MB_CUR_MAX
to
mb_cur_max
for use on the following call to the
mblen
method.
[Return to example]
Checks to see if a
NULL
pointer was passed
from the calling function and, if yes, calls the
mblen
method to calculate the size of the wide-character string.
You can request the size of the
pwcs
buffer (for
memory allocation purposes) by passing a null wide character as the
pwcs
parameter in the call to
mbstowcs().
You
can then use the return value to efficiently allocate memory space for the
application's wide-character buffer before calling
mbstowcs()
again to actually convert the multibyte string.
[Return to example]
Converts bytes in the multibyte character buffer by calling
the
_
method until a null character (end-of-string)
is encountered.
Stops processing and returns the number of wide characters in the
pwcs
buffer if a null character is encountered; increments the byte
position in the multibyte character buffer by an appropriate number each time
a character is successfully converted.
This
while
loop uses the condition
len-- > 0
to ensure that processing stops when the
pwcs
buffer is full.
The first
if
condition in the loop makes
sure that, if the multibyte string in the
s
buffer is null
terminated, the associated null terminator in the
pwcs
buffer is not included in the wide-character count that the
mbtowcs()
function returns to the application.
[Return to example]
Returns the value in
n
to indicate the resultant
number of wide characters in the
pwcs
buffer.
This statement executes if the
pwcs
buffer runs out
of space before a null is encountered in the
s
buffer.
[Return to example]
The
mbtowc()
function uses the
mbtowc
method to convert a multibyte character to a wide character and
to return the number of bytes in the multibyte character that was converted.
By convention, a C source file for this method has the file name
_, where
codeset
identifies the codeset for which this method is tailored.
Example 6-16
is the
_
file, which defines the
mbtowc
method used
with the
ja_JP.sdeckanji
locale.
Example 6-16: The _
#include <stdlib.h> [1]
#include <wchar.h>
#include <sys/errno.h>
#include <sys/localedef.h>
/*
The algorithm for this conversion is:
s[0] < 0x9f: PC = s[0]
s[0] = 0x8e: PC = s[1] + 0x5f;
s[0] = 0x8f PC = (((s[1] - 0xa1) << 7) | (s[2] - 0xa1)) + 0x303c
s[0] > 0xa1:0xa1 < s[1] < 0xfe
PC = (((s[0] - 0xa1) << 7) | (s[1] - 0xa1)) + 0x15e
0x21 < s[1] < 0x7e
PC = (((s[0] - 0xa1) << 7) | (s[1] - 0x21)) + 0x5f1a
+-----------------+-----------+-----------+-----------+
| process code | s[0] | s[1] | s[2] |
+-----------------+-----------+-----------+-----------+
| 0x0000 - 0x009f | 0x00-0x9f | -- | -- |
| 0x00a0 - 0x00ff | -- | -- | -- |
| 0x0100 - 0x015d | 0x8e | 0xa1-0xfe | -- | JIS X0201 RH
| 0x015e - 0x303b | 0xa1-0xfe | 0xa1-0xfe | -- | JIS X0208
| 0x303c - 0x5f19 | 0x8f | 0xa1-0xfe | 0xa1-0xfe | JIS X0212
| 0x5f1a - 0x8df7 | 0xa1-0xfe | 0x21-0xfe | -- | UDC
+-----------------+-----------+-----------+-----------+
*/ [2]
int _ _mbtowc_sdeckanji(
wchar_t *pwc, [3]
const char *ts, [4]
size_t maxlen, [5]
_LC_charmap_t *handle ) [6]
{
unsigned char *s = (unsigned char *)ts; [7]
wchar_t dummy; [8]
if (s == NULL)
return(0); [9]
if (maxlen < 1) {
_Seterrno(EILSEQ);
return((size_t)-1);
} [10]
if (pwc == (wchar_t *)NULL)
pwc = &dummy; [11]
if (s[0] <= 0x8d) {
*pwc = (wchar_t) s[0];
if (s[0] != '\0')
return(1);
else
return(0);
} [12]
else if (s[0] == 0x8e) {
if ( (maxlen >= 2) && ((s[1] >=0xa1) && (s[1] <=0xfe))) {
*pwc = (wchar_t) (s[1] + 0x5f); /* 0x100 - 0xa1 */
return(2);
}
} [13]
else if (s[0] == 0x8f) {
if((maxlen >= 3) && (((s[1] >=0xa1) && (s[1] <=0xfe))
&& ((s[2] >=0xa1) && (s[2] <= 0xfe)))) {
*pwc = (wchar_t) (((s[1] - 0xa1) << 7) |
(wchar_t) (s[2] - 0xa1)) + 0x303c;
return(3);
}
} [14]
else if (s[0] <= 0x9f) {
*pwc = (wchar_t) s[0];
if (s[0] != '\0')
return(1);
else
return(0);
} [15]
else if (((s[0] >= 0xa1) && (s[0] <= 0xfe)) && (maxlen >= 2)){
if (((s[1] >=0xa1) && (s[1] <= 0xfe))){
*pwc = (wchar_t) (((s[0] - 0xa1) << 7) |
(wchar_t)(s[1] - 0xa1)) + 0x15e;
return(2);
} else if (((s[1] >=0x21) && (s[1] <= 0x7e))){
*pwc = (wchar_t) (((s[0] - 0xa1) << 7) |
(wchar_t)(s[1] - 0x21)) + 0x5f1a;
return(2);
}
} [16]
_Seterrno(EILSEQ);
return(-1); [17]
}
Includes header files that contain constants and structures required for this method. [Return to example]
Describes the algorithm used to determine the number of bytes in the character and whether it is a valid byte sequence.
The codeset supports several character sets and each set contains characters of only one length. The value in the first byte indicates the character set and therefore the character length. For character sets with multibyte characters, one or more additional bytes must be examined to determine whether the value sequence identifies a character or is invalid. [Return to example]
Points, through
pwc, to a buffer that contains
the wide character.
[Return to example]
Points, through
ts, to a buffer that contains
values in multibyte character format.
[Return to example]
Defines a variable,
maxlen, that stores
the maximum length of a multibyte character.
This value is passed from the calling function; the value will have
been set to
MB_CUR_MAX
on the original call made by the
application programmer.
[Return to example]
Points, through
handle, to a structure that
stores pointers to the methods that parse character maps for this locale.
[Return to example]
Casts
ts
(an array of signed characters)
to
s
(an array of unsigned characters).
This operation prevents problems when integer values are stored in the
array and then referenced by index.
Compilers apply sign extension to values
when comparing a small signed data type, such as
char,
to a large signed data type, such as
int.
In this case,
a condition such as the following would be evaluated as true when you would
expect it to be false:
if (s[0] <= 0x8d
[Return to example]
Defines a variable,
dummy, that can be assigned
to
pwc
to ensure
pwc
points to a valid
address.
[Return to example]
Returns zero (0) to indicate that the locale's character encoding
is stateless if
s
contains or points to
NULL.
If passed a
NULL
pointer, this method should return
a value to indicate whether the locale's character encoding is stateful or
stateless.
Return a nonzero value if your locale's character encoding is
stateful.
[Return to example]
Returns -1 cast to
size_t
and sets
errno
to
[EILSEQ]
(invalid byte
sequence) if the multibyte data buffer is less than 1 byte in length.
[Return to example]
Stores the contents of
dummy
in the wide-character
buffer if the
ts
buffer contains or points to
NULL.
This operation ensures that
pwc
always points to
a valid address; otherwise, an application could produce a segmentation fault
by referring to this pointer when a wide character has not been stored in
pwc.
[Return to example]
Determines if the first byte identifies a single-byte character whose value is equal to or less than 0x8d.
If yes, stores the associated process code value in the
pwc
buffer and returns 1 to indicate that the character length is 1
byte.
[Return to example]
Determines if the first byte identifies a double-byte character whose first byte contains the value 0x8e and second byte contains a value in the range 0xa1 to 0xfe.
If yes, stores the associated process code value in the
pwc
buffer and returns 2 to indicate that the character length is 2
bytes.
[Return to example]
Determines if the first byte identifies a triple-byte character whose first byte contains the value 0x8f and whose second and third bytes contain a value in the range 0xa1 to 0xfe.
If yes, stores the associated process code value in the
pwc
buffer and returns 3 to indicate that the character length is 3
bytes.
[Return to example]
Determines if the first byte identifies a single-byte character whose value is equal to or less than 0x9f.
If yes, stores the associated process code value in the
pwc
buffer and returns 1 to indicate that the character length is 1
byte.
[Return to example]
Determines if the first byte identifies a double-byte character whose first byte contains a value in the range x0a1 to x0fe and whose second byte contains a value in the range 0x21 to 0x7e.
If yes, stores the associated process code value in the
pwc
buffer and returns 2 to indicate that the character length is 2
bytes.
[Return to example]
Returns -1 and sets
errno
to
[EILSEQ]
to indicate that an invalid multibyte sequence
was encountered.
These statements execute if the multibyte data in the
s
buffer satisfies none of the preceding
if
conditions.
[Return to example]
The
wcstombs()
function
calls the
wcstombs
method to convert a wide-character string
to a multibyte character string and to return the number of bytes in the resultant
multibyte character string.
By convention, a C source file for this method
has the file name
_, where
codeset
identifies the codeset
for which this method is tailored.
Example 6-17
is
the
_
file, which defines
the
wcstombs
method used with the
ja_JP.sdeckanji
locale.
Example 6-17: The _
#include <stdlib.h> [1] #include <wchar.h> #include <limits.h> #include <sys/localedef.h> size_t __wcstombs_sdeckanji( char *s, [2] const wchar_t *pwcs, [3] size_t n, [4] _LC_charmap_t *handle ) [5] { int cnt=0; [6] int len=0; [7] int i=0; [8] char tmps[MB_LEN_MAX+1]; [9] if ( s == (char *)NULL) { cnt = 0; while (*pwcs != (wchar_t)'\0') { if ((len = _ _wctomb_sdeckanji(tmps, *pwcs)) == -1) return(-1); cnt += len; pwcs++; } return(cnt); } [10] if (*pwcs == (wchar_t)'\0') { *s = '\0'; return(0); } [11] while (1) { [12] if ((len = _ _wctomb_sdeckanji(tmps, *pwcs)) == -1) return(-1); [13] else if (cnt+len > n) { *s = '\0'; break; } [14] if (tmps[0] == '\0') { *s = '\0'; break; } [15] for (i=0; i<len; i++) { *s = tmps[i]; s++; } [16] cnt += len; [17] if (cnt == n) break; [18] pwcs++; [19] } [20] if (cnt == 0) cnt = len; [21] return (cnt); [22] }
Includes header files that contain constants and structures required for this method. [Return to example]
Points, through
s, to a buffer that stores
the multibyte character string that this method passes to the calling function.
[Return to example]
Points, through
pwcs, to a buffer that stores
the wide-character string that is being converted.
[Return to example]
Defines a variable,
n, that stores the maximum
number of bytes in the multibyte character string buffer.
This value is supplied by the calling function. [Return to example]
Points, through
handle,
to a structure that
points to the methods that parse character maps for this locale.
[Return to example]
Initializes a variable,
cnt, that is incremented
by the number of bytes (len) of each converted character.
[Return to example]
Initializes a variable,
len, that stores
the length of each converted character.
[Return to example]
Initializes a variable,
i, that is used
to index the bytes in each multibyte character when moving a converted character
from temporary storage to
s.
[Return to example]
Defines a temporary buffer,
tmps, that stores
the multibyte character returned to this method from a call to the
wctomb
method.
[Return to example]
Checks to see if a
NULL
was passed from
the calling function in the
s
buffer.
If yes, calls the
wctomb
method to calculate the
number of bytes required for converted characters (excluding the null terminator)
in the multibyte character buffer.
You can request the size of the
s
buffer (for memory
allocation purposes) by passing a null byte as the data in the
s
parameter on the call to
wcstombs().
You can
then use the return value to efficiently allocate memory space for the application's
wide-character buffer before calling
wcstombs()
again to
actually convert the wide-character string.
[Return to example]
Returns zero (0) to indicate that no multibyte characters resulted
and sets
s
to
NULL
if
pwcs
points to
NULL.
[Return to example]
Starts a
while
loop to process characters
in the wide-character string.
[Return to example]
Converts characters in the wide-character buffer by calling
the
wctomb
method; returns -1 to indicate an invalid
character if
wctomb
returns -1.
[Return to example]
Terminates
s
with
NULL
and breaks out of the
while
loop if there is no room in
s
for the character just converted by
wctomb.
[Return to example]
Moves a null terminator to
s
and breaks
out of the
while
loop when a
NULL
is
encountered in
s.
[Return to example]
Appends each byte in
tmps
to
s
if the current wide character is not a
NULL.
[Return to example]
Increments
cnt
by the number of bytes (len) occupied by this character in multibyte format.
[Return to example]
Breaks out of the
while
loop without adding
a null terminator if the number of bytes processed equals
n
(the maximum number of bytes in
s).
[Return to example]
Increments
pwcs
to point to the next wide
character to be converted.
[Return to example]
Ends the
while
loop that converts each wide
character.
[Return to example]
Ensures that zero (0) is returned if
s
does
not contain enough space for even one character.
[Return to example]
Returns the number of bytes in the resultant multibyte character string. [Return to example]
The
wctomb()
function calls
the
wctomb
method to convert a wide character to a multibyte
character and to return the number of bytes in the resultant multibyte character.
By convention, a C source file for this method has the file name
_, where
codeset
identifies the codeset for which this method is tailored.
Example 6-18
is the
_
file, which defines the
wctomb
method for the
ja_JP.sdeckanji
locale.
Example 6-18: The _
#include <stdlib.h> [1]
#include <wchar.h>
#include <sys/errno.h>
#include <sys/localedef.h>
/*
The algorithm for this conversion is:
PC <= 0x009f: s[0] = PC
PC >= 0x0100 and PC <=0x015d: s[0] = 0x8e
s[1] = PC - 0x005f
PC >= 0x015e and PC <=0x303b: s[0] = ((PC - 0x015e) >> 7) + 0x00a1
s[1] = ((PC - 0x015e) & 0x007f) + 0x00a1
PC >= 0x303c and PC <=0x5f19: s[0] = 0x8f
s[1] = ((PC - 0x303c) >> 7) + 0x00a1
s[2] = ((PC - 0x303c) & 0x007f) + 0x00a1
PC >= 0x5f1a and PC <=0x8df7 s[0] = ((PC - 0x5f1a) >> 7) + 0x00a1
s[1] = ((PC - 0x5f1a) & 0x007f) + 0x0021
+-----------------+-----------+-----------+-----------+
| process code | s[0] | s[1] | s[2] |
+-----------------+-----------+-----------+-----------+
| 0x0000 - 0x009f | 0x00-0x9f | -- | -- |
| 0x00a0 - 0x00ff | -- | -- | -- |
| 0x0100 - 0x015d | 0x8e | 0xa1-0xfe | -- | JIS X0201 RH
| 0x015e - 0x303b | 0xa1-0xfe | 0xa1-0xfe | -- | JIS X0208
| 0x303c - 0x5f19 | 0x8f | 0xa1-0xfe | 0xa1-0xfe | JIS X0212
| 0x5f1a - 0x8df7 | 0xa1-0xfe | 0x21-0xfe | -- | UDC
+-----------------+-----------+-----------+-----------+
*/ [2]
int _ _wctomb_sdeckanji(
char *s, [3]
wchar_t wc, [4]
_LC_charmap_t *handle ) [5]
{
if (s == (char *)NULL)
return(0); [6]
if (wc <= 0x9f) {
s[0] = (char) wc;
return(1);
} [7]
else if ((wc >= 0x0100) && (wc <= 0x015d)) {
s[0] = 0x8e;
s[1] = wc - 0x5f;
return(2);
} [8]
else if ((wc >=0x015e) && (wc <= 0x303b)) {
s[0] = (char) (((wc - 0x015e) >> 7) + 0x00a1);
s[1] = (char) (((wc - 0x015e) & 0x007f) + 0x00a1);
return(2);
} [9]
else if ((wc >=0x303c) && (wc <= 0x5f19)) {
s[0] = 0x8f;
s[1] = (char) (((wc - 0x303c) >> 7) + 0x00a1);
s[2] = (char) (((wc - 0x303c) & 0x007f) + 0x00a1);
return(3);
} [10]
else if ((wc >=0x5f1a) && (wc <= 0x8df7)) {
s[0] = (char) (((wc - 0x5f1a) >> 7) + 0x00a1);
s[1] = (char) (((wc - 0x5f1a) & 0x007f) + 0x0021);
return(2);
} [11]
_Seterrno(EILSEQ);
return(-1); [12]
}
Includes header files that contain constants and structures required for this method. [Return to example]
Describes the conversion algorithm that this method uses.
Each character set supported by the codeset corresponds to a unique range of wide-character (process code) values. Within each character set, multibyte characters are of uniform length (1, 2, or 3 bytes). Therefore, the range in which each wide-character value falls indicates the number of bytes required for the character in multibyte format. The wide-character value itself determines the specific byte value or values for the character in multibyte format. [Return to example]
Points, through
s, to a buffer that stores
the multibyte character.
[Return to example]
Defines the
wc
variable that stores the
wide character.
[Return to example]
Points, through
handle, to a structure that
stores pointers to the methods that parse the character maps for this locale.
[Return to example]
Returns zero (0) to indicate that no characters were converted
if
s
points to
NULL.
[Return to example]
If the wide-character value is equal to or less than 0x9f,
moves that value into the first byte of the
s
array and
returns 1 to indicate that the converted character is 1 byte in length.
[Return to example]
If the wide-character value is in the range 0x0100 to 0x015d,
moves the value 0x8e to the first byte and a calculated value to the second
byte of the
s
array; returns 2 to indicate that the converted
character is 2 bytes in length.
[Return to example]
If the wide-character value is in the range 0x015e to 0x303b,
moves calculated values to the first and second bytes of the
s
array and returns 2 to indicate that the converted character is 2 bytes in
length.
[Return to example]
If the wide-character value is in the range 0x303c to 0x5f19,
moves 0x8f to the first byte and calculated values to the second and third
bytes of the
s
array; returns 3 to indicate that the converted
character is 3 bytes in length.
[Return to example]
If the wide-character value is in the range 0x5f1a to 0x8df7,
moves calculated values to the first and second bytes of the
s
array, and returns 2 to indicate that the converted character is 2 bytes in
length.
[Return to example]
Sets
errno
to
[EILSEQ]
and returns -1 to indicate that the wide-character
value is invalid.
These statements execute if the wide-character values satisfy none of the preceding conditions. [Return to example]
The
wcswidth()
function
uses the
wcswidth
method to determine the number of columns
required to display a wide-character string.
By convention, a C source file
for this method has the file name
_, where
codeset
identifies the codeset for which this method is tailored.
Example 6-19
is the
_
file, which defines
the
wcswidth
method used for the
ja_JP.sdeckanji
locale.
Example 6-19: The _
#include <stdlib.h> [1]
#include <wchar.h>
#include <sys/localedef.h>
/*
The algorithm for this conversion is:
PC <= 0x009f: s[0] = PC
PC >= 0x0100 and PC <=0x015d: s[0] = 0x8e
s[1] = PC - 0x005f
PC >= 0x015e and PC <=0x303b: s[0] = ((PC - 0x015e) >> 7) + 0x00a1
s[1] = ((PC - 0x015e) & 0x007f) + 0x00a1
PC >= 0x303c and PC <=0x5f19: s[0] = 0x8f
s[1] = ((PC - 0x303c) >> 7) + 0x00a1
s[2] = ((PC - 0x303c) & 0x007f) + 0x00a1
PC >= 0x5f1a and PC <=0x8df7 s[0] = ((PC - 0x5f1a) >> 7) + 0x00a1
s[1] = ((PC - 0x5f1a) & 0x007f) + 0x0021
+-----------------+-----------+-----------+-----------+
| process code | s[0] | s[1] | s[2] |
+-----------------+-----------+-----------+-----------+
| 0x0000 - 0x009f | 0x00-0x9f | -- | -- |
| 0x00a0 - 0x00ff | -- | -- | -- |
| 0x0100 - 0x015d | 0x8e | 0xa1-0xfe | -- | JIS X0201 RH
| 0x015e - 0x303b | 0xa1-0xfe | 0xa1-0xfe | -- | JIS X0208
| 0x303c - 0x5f19 | 0x8f | 0xa1-0xfe | 0xa1-0xfe | JIS X0212
| 0x5f1a - 0x8df7 | 0xa1-0xfe | 0x21-0xfe | -- | UDC
+-----------------+-----------+-----------+-----------+
*/ [2]
int _ _wcswidth_sdeckanji(
const wchar_t *wcs, [3]
size_t n, [4]
_LC_charmap_t *hdl ) [5]
{
int len; [6]
int i; [7]
if (wcs == (wchar_t *)NULL || *wcs == (wchar_t)NULL)
return(0); [8]
len = 0; [9]
for (i=0; wcs[i] != (wchar_t)NULL && i<n; i++) { [10]
if (wcs[i] <= 0x9f)
len += 1; [11]
else if ((wcs[i] >= 0x0100) && (wcs[i] <= 0x015d))
len += 1; [12]
else if ((wcs[i] >=0x015e) && (wcs[i] <= 0x303b))
len += 2; [13]
else if ((wcs[i] >=0x303c) && (wcs[i] <= 0x5f19))
len += 2; [14]
else if ((wcs[i] >=0x5f1a) && (wcs[i] <= 0x8df7))
len += 2; [15]
else
return(-1); [16]
} [17]
return(len); [18]
}
Includes header files that contain constants and structures required for this method. [Return to example]
Describes the algorithm used to determine the required display width.
Each character's display width is either 1 or 2 columns, depending on the character set to which a character belongs. Display width is different from the size of the character in multibyte format; for example, triple-byte characters require 2 display columns and double-byte characters can require either 1 or 2 display columns. [Return to example]
Points, through
wcs, to a buffer that stores
the wide-character string for which display width information is requested.
[Return to example]
Defines a variable,
n, that stores the maximum
size of the
wcs
buffer.
[Return to example]
Points, through
hdl, to a structure that
stores pointers to the methods that parse character maps for this locale.
[Return to example]
Defines a variable,
len, that stores the
display width in bytes/columns.
[Return to example]
Defines a variable,
i, that functions as
a loop counter.
[Return to example]
Returns zero (0) if
wcs
contains or points
to
NULL.
[Return to example]
Initializes
len
to zero (0).
[Return to example]
Begins a
for
loop that processes each wide
character in the
wcs
buffer and increments the wide-character
pointer.
[Return to example]
Increments
len
by 1 if the value of the
current wide character is less than or equal to 0x9f.
[Return to example]
Increments
len
by 1 if the value of the
current wide character is in the range 0x0100 to 0x015d.
[Return to example]
Increments
len
by 2 if the value of the
current wide character is in the range 0x015e to 0x303b.
[Return to example]
Increments
len
by 2 if the value of the
current wide character is in the range 0x303c to 0x5f19.
[Return to example]
Increments
len
by 2 if the value of the
current wide character is in the range 0x5f1a to 0x8df7.
[Return to example]
Returns -1 to indicate that the string contains an invalid wide character.
This statement executes if a value that satisfies none of the preceding
conditions is encountered in the string.
The calling function,
wcswidth(), also returns -1 if the wide character is nonprintable;
however, this condition is evaluated at the level of the calling function
and does not need to be evaluated by the method.
[Return to example]
Ends the
for
loop that processes wide characters
in the
wcs
buffer.
[Return to example]
Returns
len
to indicate the number of columns
required to display the wide-character string.
[Return to example]
The
wcwidth()
function uses
the
wcwidth
method to determine the number of columns required
to display a wide character.
By convention, a C source file for this method
has the file name
_, where
codeset
identifies the codeset
for which this method is tailored.
Example 6-20
is the
_
file, which defines the
wcwidth
method used with the
ja_JP.sdeckanji
locale.
Example 6-20: The _
#include <stdlib.h> [1]
#include <wchar.h>
#include <sys/localedef.h>
/*
The algorithm for this conversion is:
PC <= 0x009f: s[0] = PC
PC >= 0x0100 and PC <=0x015d: s[0] = 0x8e
s[1] = PC - 0x005f
PC >= 0x015e and PC <=0x303b: s[0] = ((PC - 0x015e) >> 7) + 0x00a1
s[1] = ((PC - 0x015e) & 0x007f) + 0x00a1
PC >= 0x303c and PC <=0x5f19: s[0] = 0x8f
s[1] = ((PC - 0x303c) >> 7) + 0x00a1
s[2] = ((PC - 0x303c) & 0x007f) + 0x00a1
PC >= 0x5f1a and PC <=0x8df7 s[0] = ((PC - 0x5f1a) >> 7) + 0x00a1
s[1] = ((PC - 0x5f1a) & 0x007f) + 0x0021
+-----------------+-----------+-----------+-----------+
| process code | s[0] | s[1] | s[2] |
+-----------------+-----------+-----------+-----------+
| 0x0000 - 0x009f | 0x00-0x9f | -- | -- |
| 0x00a0 - 0x00ff | -- | -- | -- |
| 0x0100 - 0x015d | 0x8e | 0xa1-0xfe | -- | JIS X0201 RH
| 0x015e - 0x303b | 0xa1-0xfe | 0xa1-0xfe | -- | JIS X0208
| 0x303c - 0x5f19 | 0x8f | 0xa1-0xfe | 0xa1-0xfe | JIS X0212
| 0x5f1a - 0x8df7 | 0xa1-0xfe | 0x21-0xfe | -- | UDC
+-----------------+-----------+-----------+-----------+
*/ [2]
int _ _wcwidth_sdeckanji(
wint_t wc, [3]
_LC_charmap_t *hdl ) [4]
{
if (wc == 0)
return(0); [5] if (wc <= 0x9f)
return(1); [6]
else if ((wc >= 0x0100) && (wc <= 0x015d))
return(1); [7]
else if ((wc >=0x015e) && (wc <= 0x303b))
return(2); [8]
else if ((wc >=0x303c) && (wc <= 0x5f19))
return(2); [9]
else if ((wc >=0x5f1a) && (wc <= 0x8df7))
return(2); [10]
return(-1); [11]
}
Includes header files that contain constants and structures required for this method. [Return to example]
Describes the algorithm used to determine the required display width.
A character's display width is either 1 or 2 columns, depending on the character set to which a character belongs. Display width is different from the size of the character in multibyte format; for example, triple-byte characters require 2 display columns and double-byte characters can require either 1 or 2 display columns. [Return to example]
Defines the
wc
variable that stores the
wide character for which display width information is requested.
[Return to example]
Points, through
hdl, to a structure that
stores pointers to the methods that parse character maps for this locale.
[Return to example]
Returns zero (0) if the wide-character buffer is empty. [Return to example]
Returns 1 if the wide-character value is less than or equal to 0x009f. [Return to example]
Returns 1 if the wide-character value is in the range 0x0100 to 0x015d. [Return to example]
Returns 2 if the wide-character value is in the range 0x015e to 0x303b. [Return to example]
Returns 2 if the wide-character value is in the range 0x303c to 0x5f19. [Return to example]
Returns 2 if the wide-character value is in the range 0x5f1a to 0x8df7. [Return to example]
Returns -1 if the wide-character value is invalid.
The calling function,
wcwidth(), also returns -1
if the wide character is nonprintable; however, this condition is evaluated
at the level of the calling function and does not need to be evaluated by
the method.
[Return to example]
A locale can include optional
methods in addition to the required methods discussed in
Section 6.3.1.
A method is considered optional if a default method is applied in the absence
of a method specification.
That is, if your locale uses methods but does not
supply any methods for the functions associated with particular locale categories
or some other locale-related functions, the
localedef
command
applies default methods that handle process code for both single-byte and
multibyte characters.
Writing optional methods requires detailed information about the internal interfaces to C Library routines. This information is vendor proprietary and may be subject to change. Thus, optional method descriptions in this section are less complete than the descriptions for required methods.
In the rare cases in which your locale must include an optional method, contact your technical support representative to request information.
The following list names the optional methods:
LC_CTYPE
category
towupper
towlower
wctype
iswctype
LC_COLLATE
category
fnmatch
strcoll
strxfrm
wcscoll
wcsxfrm
regcomp
regexec
regfree
regerror
LC_MONETARY,
LC_NUMERIC,
or both categories
localeconv
strfmon
LC_TIME
category
strftime
strptime
wcsftime
LC_MESSAGES
category
rpmatch
Miscellaneous use
nl_langinfo()
6.3.3 Building a Shareable Library to Use with a Locale
Example 6-21
contains the compiler and linker command lines that are required to build
the method source files into a shareable library that is used with the
ja_JP.sdeckanji
locale.
Example 6-21: Building a Library of Methods Used with the ja_JP.sdeckanji Locale
cc -std0 -c \ __mblen_sdeckanji.c _ _mbstopcs_sdeckanji.c \ _ _mbstowcs_sdeckanji.c _ _mbtopc_sdeckanji.c \ _ _mbtowc_sdeckanji.c _ _pcstombs_sdeckanji.c \ _ _pctomb_sdeckanji.c _ _wcstombs_sdeckanji.c \ _ _wcswidth_sdeckanji.c _ _wctomb_sdeckanji.c \ _ _wcwidth_sdeckanji.c ld -shared -set_version osf.1 -soname libsdeckanji.so -shared \ -no_archive -o libsdeckanji.so \ _ _mblen_sdeckanji.o _ _mbstopcs_sdeckanji.o \ _ _mbstowcs_sdeckanji.o _ _mbtopc_sdeckanji.o \ _ _mbtowc_sdeckanji.o _ _pcstombs_sdeckanji.o _ _pctomb_sdeckanji.o \ _ _wcstombs_sdeckanji.o _ _wcswidth_sdeckanji.o _ _wctomb_sdeckanji.o \ _ _wcwidth_sdeckanji.o \ -lc
See
cc(1)ld(1)6.3.4 Creating a methods File for a Locale
The
methods
file
contains an entry for each function that is defined in the methods shared
library for use with the locale.
The operation performed by the function is
identified by a method keyword, followed by quoted strings with the name of
the function and the path to the shared library that contains the function.
Example 6-22
illustrates the section of a
methods
file for the methods used with the
ja_JP.sdeckanji
locale.
Because you must define a list of required methods if
you want to override any C Library interfaces, your
methods
file must always specify an entry for each required method as shown in this
example.
The
ja_JP.sdeckanji
locale relies on default implementations
for all optional methods, and so the example does not contain entries for
any of the optional methods.
Example 6-22: The methods File for the ja_JP.sdeckanji Locale
# sdeckanji.m [1] # <method_keyword> "<entry>" "<package>" "<library_path>" [1] METHODS [2] __mbstopcs "_ _mbstopcs_sdeckanji" "libsdeckanji.so" \ "/usr/shlib/libsdeckanji.so" [3] _ _mbtopc "_ _mbtopc_sdeckanji" "libsdeckanji.so" \ "/usr/shlib/libsdeckanji.so" [3] _ _pcstombs "_ _pcstombs_sdeckanji" "libsdeckanji.so" \ "/usr/shlib/libsdeckanji.so" [3] _ _pctomb "_ _pctomb_sdeckanji" "libsdeckanji.so" \ "/usr/shlib/libsdeckanji.so" [3] mblen "_ _mblen_sdeckanji" "libsdeckanji.so" \ "/usr/shlib/libsdeckanji.so" [3] mbstowcs "_ _mbstowcs_sdeckanji" "libsdeckanji.so" \ "/usr/shlib/libsdeckanji.so" [3] mbtowc "_ _mbtowc_sdeckanji" "libsdeckanji.so" \ "/usr/shlib/libsdeckanji.so" [3] wcstombs "_ _wcstombs_sdeckanji" "libsdeckanji.so" \ "/usr/shlib/libsdeckanji.so" [3] wcswidth "_ _wcswidth_sdeckanji" "libsdeckanji.so" \ "/usr/shlib/libsdeckanji.so" [3] wctomb "_ _wctomb_sdeckanji" "libsdeckanji.so" \ "/usr/shlib/libsdeckanji.so" [3] wcwidth "_ _wcwidth_sdeckanji" "libsdeckanji.so" \ "/usr/shlib/libsdeckanji.so" [3] END METHODS [4]
Comment lines
These lines specify the name of the
methods
file and the format of method entries.
The field identified
in the format as
<package>
is ignored, but you must
specify some string for this field in order to specify a library path.
[Return to example]
Header to mark start of method entries [Return to example]
Entries for required methods [Return to example]
Trailer to mark end of method entries [Return to example]
See
localedef(1)methods
file entries.
6.4 Building and Testing the Locale
Use the
localedef
command to build a locale from its source files.
Example 6-23
is the command line needed to build the French locale used in most examples
in this chapter.
Assume for this example that all source files reside in the
user's default directory and that the resulting locale is also created in
that directory.
Example 6-23: Building the fr_FR.ISO8859-1@example Locale
% localedef -f ISO8859-1.cmap \ [1] -i fr_FR.ISO8859-1.src \ [2] fr_FR.ISO8859-1@example [3]
The-f option specifies the character map source file. [Return to example]
The-i option specifies the locale definition source file. [Return to example]
The final argument to the command is the name of the locale. [Return to example]
When you are testing locales, particularly
ones that are similar to standard locales installed on the system, add an
extension to the locale name.
Varying names with the at (@)
extension allows you to specify the standard strings for language, territory,
and codeset and still be sure that the test locale is uniquely identified.
This is important if you later decide to move the locale to the
/usr/lib/nls/loc
directory, where other locales reside.
Example 6-23
contains only one form and a few
options for the
localedef
command.
See
localedef(1)
The following is a summary of some important rules and options:
If
you defined methods for your locale, you must specify the
methods
file with the
-m
option.
For example, the command
line that builds the
ja_JP.sdeckanji
locale would include
-m sdeckanji.m
to identify the file shown in
Example 6-22.
You
can use the
-v
option to run the command in verbose mode for
debugging purposes.
This option, when used with the
-c
option,
creates a
.c
file that contains useful information about
the locale.
Use the -w option if you want the command to display warnings when it encounters duplicate definitions.
By default,
locales must reside in the
/usr/lib/nls/loc
directory
to be found.
If you want to test your locale before moving it to the
/usr/lib/nls/loc
directory, you can define the
LOCPATH
variable to specify the directory where your locale is located.
You can then define the
LANG
environment variable to be
your new locale and interactively test the locale with commands and applications.
Example 6-24
uses the
date
command
to test the date/time format.
Example 6-24: Setting the LOCPATH Variable and Testing a Locale
% setenv LOCPATH ~harry/locales
% setenv LANG fr_FR.ISO8859-1@example
% date
ven 23 avr 13:43:05 EDT 1999
Note
The
LOCPATHvariable is an extension to specifications in the X/Open UNIX standard and therefore may not be recognized on all systems that conform to this standard.
Some programs have support files that are installed
in system directories with names that exactly match the names of standard
locales.
In such cases, application software, system software, or both might
use the value of the
LANG
environment variable to determine
the locale-specific directory in which the support files reside.
If assigned
directly to the
LANG
or
LC_ALL
environment
variable, locale file names with an at (@) suffix may result in invalid search
paths for some applications.
The following example illustrates how you can work around this problem
by assigning the standard locale name to the
LANG
variable
and the name of your variant locale to the locale category variables.
You
need to make assignments only to those category variables that represent areas
where your locale differs from the locale on which it is based.
% setenv LANG fr_FR.ISO8859-1 % setenv LC_CTYPE fr_FR.ISO8859-1@example % setenv LC_COLLATE fr_FR.ISO8859-1@example
.
.
.
% setenv LC_TIME fr_FR.ISO8859-1@example