[TOC] [PREV] [NEXT]
DECwindows Motif Supplemental Guide for Simplified Chinese Support

2. Codesets

DECwindows Motif supports the following Simplified Chinese codesets:

2.1. DEC Hanzi

The ASCII, GB2312-80 and extended GB character sets are combined to form the DEC Hanzi codeset.

DEC Hanzi, or Simplified Chinese and denoted as dechanzi, uses a 2-byte data representation for symbols and ideographic characters defined in the GB2312-80 character set. To differentiate GB2312-80 codes from ASCII codes, the most significant bit (MSB) of the first byte is always set on while that of the second byte is on for GB2312-80 and off for extended GB as shown in Figure 2-1.

Figure 2-1. DEC Hanzi Character Encoding

ASCII  0      
 
GB2312-80  1     1  
   First Byte  Second Byte
Extended GB  1     0  
   First Byte  Second Byte

The first byte of a 2-byte code determines its row number, while the second byte determines its column number.

The following formulas illustrate the code of a GB2312-80 character or an extended GB character in relation to its row and column numbers.

GB2312-80 character:

First byte = A0 + row number
Second byte = A0 + column number

Extended GB character:

First byte = A0 + row number
Second byte = 20 + column number

For example, if a character is positioned at the first column of the 16th row on the GB2312-80 code plane, its encoding value is calculated as follows:

First byte = A0 (hex) + 16 = B0 (hex)
Second byte = A0 (hex) + 01 = A1 (hex)

The resulting encoded value is B0A1.

Similarly, if a character is positioned at the first column of the 16th row on the extended GB code plane, its encoding value is calculated as follows:

First byte = A0 (hex) + 16 = B0 (hex)
Second byte = 20 (hex) + 01 = 21 (hex)

The resulting encoded value is B021.

Figure 2-2 illustrates the division of a 2-byte code space and the position of the Chinese character sets.

Figure 2-2. GB2312-80 and Extended GB Code Space

   Second Byte
  00 20 80 A0 FF
First 
Byte 
20  
80        
A0  
FF   Extended GB   GB2312-80

2.2. GB18030

The GB18030 codeset provides 1-byte, 2-byte, and 4-byte encoding with the following structure:

Number of Bytes Encoding Range Code Points
1 byte0x00 to 0x7F128
2 bytes0x81 to 0xFE
0x40 to 0xFE (except 0x7F)
23940
4 bytes0x81 to 0xFE
0x30 to 0x39
0x81 to 0xFE
0x30 to 0x39
1587600

GB18030 1-byte code supports ASCII characters.

GB18030 2-byte code supports all the CJK characters (Chinese, Japanese, Korean) in the Unicode Version 2.1 Standard.

GB18030 4-byte code supports Unicode Version 3.0 additions. The 4-byte code also leaves a large number of unassigned code points available for future use.