Difference between ASCII code, ISO8859-1, Unicode, GBK, and UTF-8

Little scum · Posted on 12/11/2022 2:01:58 PM

Why do you need coding?

The smallest storage unit in the computer is byte (byte), and the number of characters that can be represented by a byte is limited, 1byte=8bit, a byte can only represent 255 characters at most, and there are many languages in the world, there are various characters, which cannot be represented by a byte, so the char representation character in Java is to solve this coding problem, one char occupies two bytes, so it must be encoded between char and the smallest unit byte. The opposite is decoding.

In fact, encoding and decoding is the completed translation process ("translation" is easy to understand), and various coding methods are dictionaries.

1. ASCII code

The full name is American Standard Code for Information Interchange, which is the world's most common single-byte coding system, mainly used to display modern English and other Western European languages.

ASCII code is represented by 7 bits, which can only represent 128 characters, and 0~31 represents control characters such as enter, backspace, delete, etc.; 32~126 means that the printed characters can be entered through the keyboard and can be displayed.

Among them, 48~57 is 0 to 9 ten Arabic numerals, 65~90 is 26 uppercase English letters, 97~122 is 26 lowercase English letters, and the rest are some punctuation marks, operation symbols, etc., which can refer to the ASCII standard table for details.

2、ISO-8859-1

This encoding is an extension of the ASCII encoding, but it is still a single-byte encoding and can only represent a total of 256 characters.

Since ASCII can only represent 128 characters, the display cannot be fully expressedISO-8859-1 extends ASCII encodingOn top of the ASCII code, the corresponding text symbols of Western European languages, Greek, Thai, Arabic and Hebrew are added, and it is backward compatible with ASCII encoding.

3、GB2312

The full name of GB2312 is "Information Technology Chinese Encoding Character Set", which is a two-byte encoding with an encoding range of A1-F7, of which A1-A9 is the symbol area, containing a total of 682 symbols; B0-F7 is the kanji area, containing 6763 kanji.

The full name is Chinese character set for information exchange, which was released by China in 1980 and is mainly used for Chinese character processing in computer systems. GB2312 mainly contains 6763 Chinese characters and 682 symbols.

GB2312 covers most of the usage of Chinese characters, but cannot handle special rare characters such as ancient Chinese, so codes like GBK and GB18030 appeared later.

4、GBK

The full name of GBK is the "Chinese Character Internal Code Extension Specification", which appeared to expand GB2312, its encoding range is 8140-FEFE (remove XX7F), with a total of 23940 code bits, which can represent 21003 Chinese characters, and its encoding is compatible with GB2312

GBK, the full name of Chinese Internal Code Specification, was formulated in 1995. It mainly expands GB2312, and adds more Chinese characters on its basis, which contains a total of 21,003 Chinese characters.

GBK is backward compatible with GB2312 encoding, which means that the Chinese characters encoded in GB2312 can be decoded normally with GBK and will not appear garbled, but the Chinese characters encoded with GBK are not necessarily decoded with GB2312.

5、GB18030

GB18030 full name is the Chinese Character Internal Code Extension Specification, which is the latest internal code character set released in 2000 and enforced in 2001, including the language characters of most ethnic minorities in China, and contains more than 70,000 Chinese characters.

It mainly uses single-byte, double-byte, and four-byte character encoding, and it is backward compatible with GB2312 and GBK, although it is a mandatory standard in our country, but it is rarely used in actual production, and the most used are GBK and GB2312.

6.UNICODE

In order for their language to be displayed normally in the computer, each country and region has its own code, so no one knows each other's code if there are too many codes, at this time the ISO organization proposed a new code called UNICODE coding to support global cultures, characters, and symbols. When UNICODE was formulated, computer capacity was no longer a problem, so it was designed to be fixed two bytes, and all characters were represented in 16 bits, including English characters that only occupied 8 bits before, so it would cause a waste of space, and UNICODE was not promoted and applied for a long time.

7、UTF-16

The emergence of UTF-16 is ISO to create a new supralingual dictionary, through which all languages in the world can translate each other, you can imagine how complex and huge this dictionary is. UTF-16 uses two bytes to represent the Unicode conversion format (two bytes are 16 bits, so UTF-16) is a fixed-length representation method, that is, any character can be represented in two bytes. This makes it very convenient to represent characters. However, there is also a flaw, that is, a large part of the characters can be represented in one byte, but UTF-16 uses two bytes, which is a waste of storage space. So there is another way of encoding, which is UTF-8.

UTF-16 is the specific implementation of UNICODE, 16 is 16 bits, UTF-16 is the origin, defining how UNICODE characters are stored in the computer, UTF-16 also uses two bytes to represent any character, which makes the operation string very efficient, which is also an important reason why java uses UTF-16 as a format for characters to store in memory.

UTF-16 is suitable for use between disk and memory, and the conversion of characters and bytes to and from each other will be simpler and more efficient, but it is not suitable for network transmission, which may damage the byte stream.

8、UTF-8

UTF-8 uses a variable length technology where each encoding region has different code lengths, and different types of characters can consist of 1-6 bytes.

The encoding rules for UTF-8 are as follows:

1) If it is 1 byte and the highest bit is 0, it means that this is 1 ASCII character. It can be seen that all ASCII encodings are already UTF-8

2) If it is 1 byte, starting with 11, the number of consecutive 1s implies the number of bytes of this character, for example: 110xxxxx means it is the first byte of a double-byte UTF-8 character

3) If it is 1 byte, starting with 10, it means it is not the first byte, you need to look forward to get the first byte of the current character.

Although UTF-16 is very efficient, it is also the biggest disadvantage of UNICODE, so that all single-byte characters must account for two bytes, and the storage space is doubled, which obviously consumes resources and does not conform to the current situation of rapid development of the Internet. So with UTF-8, it is an implementation of variable-length character encoding of UNICODE, which can use 1~6 fixed-length bytes to encode UNICODE characters.

UTF-8 uses single-byte storage for ASCII characters, and if a single character is corrupted, it will not affect subsequent characters, so UTF-8 is ideal for traditional on the web and is one of the most widely used encodings today.

If you want to represent Chinese, UTF-8 encoding efficiency is greater than GBK and smaller than UTF-16, so it is also the most ideal encoding method other than GBK.

johnyoung · Posted on 12/30/2022 2:43:17 PM

Got it, with utf-8

Difference between ASCII code, ISO8859-1, Unicode, GBK, and UTF-8

Related Posts

Sections viewed