July 30, 1996
Japanese Language Encoding
There are four major encodings to represent Japanese text. All of
these are based upon ASCII (for alphabetic text) and Japanese
Industrial Standard X0208 (JIS X0208), but the data is stored in
different ways. An "octet" is an 8-bit data quantity, often
incorrectly called a "byte" or "character".
- (1) JIS7, also called "ISO-2022-JP" or (incorrectly) "JIS"
This is the encoding in which mail is transmitted. All of the
octets are 7-bit. A three-octet sequence using the ESC code
is used to switch between English (ASCII) and Japanese (JIS).
- (2) JIS8
This is a rarely-used variant of EUC. All of the octets are
8-bit.
- (3) S-JIS, also called "Shift JIS"
This is the encoding used on PCs and Macs. All of the octets
are 8-bits. If the most significant bit of an octet is off,
the other 7-bits represent an ASCII character. If the most
significant bit of an octet is on, the other 7-bits are half
of a 14-bit shifted JIS character. Shifted JIS splits JIS
into two segments. The old 1970s single-octet katakana (also
called "half-width katakana") is inserted in the between the
two segments.
- (4) EUC, also called "Extended Unix Code"
This is the encoding used on Unix systems. All of the octets
are 8-bits. If the most significant bit of an octet is off,
the other 7-bits represent an ASCII character. If the most
significant bit of an octet is on, the other 7-bits are half
of a 14-bit JIS character.
There is no shifting. EUC can support extensions (such as JIS
X0212) but not half-width katakana.
It is very easy to translate between JIS7 and EUC. With a little more
work, it is also possible to translate between S-JIS and JIS7 and EUC.
It is also possible, by examining the data, to determine whether it is
JIS7, S-JIS, or EUC. JIS8 is ambiguous but fortunately it is rare
enough that you needn't worry about it.