Japanese Encoding

July 30, 1996

Japanese Language Encoding

There are four major encodings to represent Japanese text. All of these are based upon ASCII (for alphabetic text) and Japanese Industrial Standard X0208 (JIS X0208), but the data is stored in different ways. An "octet" is an 8-bit data quantity, often incorrectly called a "byte" or "character".

(1) JIS7, also called "ISO-2022-JP" or (incorrectly) "JIS" This is the encoding in which mail is transmitted. All of the octets are 7-bit. A three-octet sequence using the ESC code is used to switch between English (ASCII) and Japanese (JIS).
(2) JIS8 This is a rarely-used variant of EUC. All of the octets are 8-bit.
(3) S-JIS, also called "Shift JIS" This is the encoding used on PCs and Macs. All of the octets are 8-bits. If the most significant bit of an octet is off, the other 7-bits represent an ASCII character. If the most significant bit of an octet is on, the other 7-bits are half of a 14-bit shifted JIS character. Shifted JIS splits JIS into two segments. The old 1970s single-octet katakana (also called "half-width katakana") is inserted in the between the two segments.
(4) EUC, also called "Extended Unix Code" This is the encoding used on Unix systems. All of the octets are 8-bits. If the most significant bit of an octet is off, the other 7-bits represent an ASCII character. If the most significant bit of an octet is on, the other 7-bits are half of a 14-bit JIS character. There is no shifting. EUC can support extensions (such as JIS X0212) but not half-width katakana.

It is very easy to translate between JIS7 and EUC. With a little more work, it is also possible to translate between S-JIS and JIS7 and EUC. It is also possible, by examining the data, to determine whether it is JIS7, S-JIS, or EUC. JIS8 is ambiguous but fortunately it is rare enough that you needn't worry about it.