[Contents] [Previous Chapter] [Next Section] [Next Chapter] [Index] [Help]


1   Character Sets

The Tru64 UNIX operating system supports the following Chinese character sets:

The CNS 11643 and Big-5 character sets are commonly used for traditional Chinese characters. Also, the traditional Chinese Input Server optionally supports DTSCS input. The GB2312-80, GBK, Extended GB, and GB18030-2000 character sets are commonly used for simplified Chinese characters. The Unicode and ISO/IEC 10646 character sets are common to both traditional and simplified Chinese.


[Contents] [Previous Chapter] [Previous Section] [Next Section] [Next Chapter] [Index] [Help]


1.1   CNS 11643

The CNS (Chinese National Standard) 11643 character set standard was published by the National Bureau of Standards of Taiwan in 1986 and was updated in 1992. It is also called "Standard Interchange Code for Generally-used Chinese Character" (SICGCC).

CNS 11643 provides 16 character planes for defining Chinese characters. Each character plane is divided into 94 rows and each row has 94 columns. Altogether, a total number of 8,836 characters can be accommodated in each plane. Character planes 1-11 are reserved for defining standard Chinese characters while character planes 12-16 are user-defined areas.

Figure 1-1: CNS 11643 Character Planes

CNS 11643 Character Planes

The original CNS 11643 standard, published in 1986, defines certain groups of characters only on the first and second character planes. Table 1-1 describes these groups of characters.

Table 1-1: Characters Defined in CNS 11643-1986

Character PlaneCharacter TypeNumber of Characters
Plane 1Special characters
Control characters
Frequently used characters
651
33
5,401
Plane 2Less frequently used characters7,650

Figure 1-2 and Figure 1-3 illustrate the positions of these characters in the first and second character planes.

Figure 1-2: CNS 11643 First Character Planes

CNS 11643 First Character Planes

Figure 1-3: CNS 11643 Second Character Plane

CNS 11643 Second Character Plane

Because the CNS11643-1986 character set was not rich enough to fully meet application requirements, such as names and addresses, the information industry in Taiwan requested an expansion of the character set. In 1991, the Bureau of National Standard formed a team to study how to expand CNS 11643. On August 4, 1992, the Bureau of National Standard published the revised CNS 11643 - Chinese Standard Interchange Code (CSIC).

The revised CNS 11643, called CNS 11643-1992, defined 651 special characters, 33 control characters and 48,027 Chinese characters, as shown in Table 1-2.

Table 1-2: Characters Defined in CNS 11643-1992

Character PlaneCharacter TypeNumber of Characters
Plane 1Special characters
Control characters
Frequently used characters
651
33
5,401
Plane 2Less frequently used characters7,650
Plane 3Rarely used characters (EDPC Part I)6,148
Plane 4Used for residency system, ISO 2nd edition DIS 10646 Han characters, 171 EDPC Part II Characters7,298
Plane 5Rarely used characters8,603
Plane 6Variants based on the Ministry of Education publications (less than, or equal to, 14 strokes)6,388
Plane 7Variants based on the Ministry of Education publications (greater than 14 strokes)6,539

Planes 5, 6, and 7 are based on Taiwanese Ministry of Education publications.

Because the number of characters defined in CNS11643-1992 is far greater than those required for general use, the revised CNS 11643 is called "Chinese Standard Interchange Code (CSIC)".

Note

In this release, the new characters added to CNS 11643-1992 are not supported. Only the characters defined in CNS 11643-1986 and DTSCS (which will be described in the next section) are supported.


[Contents] [Previous Chapter] [Previous Section] [Next Section] [Next Chapter] [Index] [Help]


1.2   DTSCS

In addition to CNS 11643, the operating system supports the DIGITAL Taiwan Supplemental Character Set (DTSCS). Currently, only the EDPC Recommended Character Set, which defines a total of 6,319 characters, is included in DTSCS. EDPC Recommended Character Set was first published by the Electronic Data Processing Center of Executive Yuen in June, 1988.

Figure 1-4: EDPC Recommended Character Set

EDPC Recommended Character Set

As a de facto standard, computer vendors support the EDPC Recommended Character Set and assign it to CNS 11643 character plane 14.

In the revised CNS 11643-1992, the 6,319 characters in the EDPC Recommended Character Set are assigned to the third and fourth character planes of CNS 11643, as described in Table 1-3 and shown in Figure 1-4.

Table 1-3: Mapping of EDPC Recommended Character Set to CNS 11643-1992

EDPC CharactersCharacter PlaneNumber of Characters
Part IPlane 36,148
Part IIPlane 4171


[Contents] [Previous Chapter] [Previous Section] [Next Section] [Next Chapter] [Index] [Help]


1.3   Big-5

The Big-5 character set, though not a national standard, is commonly used by the Taiwan information industry, particularly in the PC and workstation market. The Big-5 character set was designed to meet the requirements of five major software vendors in Taiwan. Since its publication, much software and hardware, and many peripheral devices have been developed to support Big-5.

Big-5 is very similar to the first two planes of CNS 11643-1992. The frequently used Chinese characters (5,401) defined in the two character sets are exactly the same except that their positions in the code table are different. For the less frequently used Chinese characters, Big-5 defines two more characters in addition to the 7,650 characters defined in the second character plane of CNS 11643, and their positions in the code table are also different.


[Contents] [Previous Chapter] [Previous Section] [Next Section] [Next Chapter] [Index] [Help]


1.4   GB2312-80

The GB2312-80 character set is a standard published by the State Bureau of Standardization of the People's Republic of China (PRC) in 1980 and put in force in May, 1981.

GB2312-80 defines 7,445 characters, including 6,763 Chinese characters in the following categories:

GB2312-80 defines 682 graphic symbols placed in rows 1-9.

GB2312-80 defines 3,755 frequently used characters placed in rows 16-55.

GB2312-80 defines 3,008 less frequently used characters placed in rows 56-87.

The GB2312-80 code table is divided into 94 rows (Qu), numbered from 1 to 94. Each row has 94 columns (Wei), also numbered from 1 to 94. Figure 1-5 illustrates the GD2312-80 character set.

Figure 1-5: GB2312-80 Character Set

GB2312-80 Character Set


[Contents] [Previous Chapter] [Previous Section] [Next Section] [Next Chapter] [Index] [Help]


1.5   Extended GB

The extended GB character set provides 8,836 (94 x 94) code points for defining user-defined characters. The 8,836 code points are divided into two regions:

The extended GB code table is similar to the GB2312 code table. It is divided into 94 rows and each row has 94 columns.


[Contents] [Previous Chapter] [Previous Section] [Next Section] [Next Chapter] [Index] [Help]


1.6   GBK and GB18030-2000

The GBK character set is an extension to the GB 2312-80 character set. GBK includes all of the simplified Chinese (Hanzi) characters specified by the ISO 10646 standard (also known as the GB13000:1.93 character set) that are not already included in GB 2312-80. GBK is therefore defined as a normative annex of GB13000.1-93.

The GB18030-2000 character set, defined by the Chinese National Standard organization, further extends GBK by means of 4-byte code points. That is, the GB18030-2000 character set has 4-byte encoding in addition to the 1-byte and 2-byte encoding of GBK.

GB18030-2000 incorporates GBK support for the Hanzi characters specified by Unicode Version 3.0 and the ISO/IEC 10646-2001 standard.


[Contents] [Previous Chapter] [Previous Section] [Next Section] [Next Chapter] [Index] [Help]


1.7   Unicode

The Unicode Standard, Version 3.0 specifies a universal character set (UCS) that contains definitions for 49,194 characters and includes a Private Use Area for vendor-defined or user-defined characters. The character set has the following features:


[Contents] [Previous Chapter] [Previous Section] [Next Chapter] [Index] [Help]


1.8   ISO/IEC 10646

The ISO/IEC 10646 standard, which is specified in Information Technology-Universal Multiple-Octet Coded Character Set, ISO/IEC 10646, allows characters to be specified as either 32-bit units or 16-bit units like Unicode. In their 32-bit form, the 16-bit character values in Unicode are zero-extended through a second 16-bit unit to conform to ISO/IEC 10646.


[Contents] [Previous Chapter] [Previous Section] [Next Chapter] [Index] [Help]