The difference between utf8 and utf8mb4 in MySQL

Little scum · Posted on 4/21/2021 6:01:22 PM

Unknown character set: utf8mb4
https://www.itsvse.com/thread-3199-1-1.html

1. Introduction

MySQL added this utf8mb4 encoding after 5.5.3, which means most bytes 4, and is specifically used to be compatible with four-byte unicode. Fortunately, utf8mb4 is a superset of utf8, and no other conversion is required except to change the encoding to utf8mb4. Of course, in order to save space, it is generally enough to use utf8.

2. Content description

As mentioned above, since utf8 can save most Chinese characters, why use utf8mb4? The maximum character length of UTF8 encoding supported by MySQL is 3 bytes, and if you encounter a wide character of 4 bytes, you will insert an exception. The maximum Unicode character encoded by UTF-8 of three bytes is 0xffff, which is the basic multilingual plane (BMP) in Unicode. That is, any Unicode character that is not in the basic multitext plane cannot be stored using Mysql's utf8 character set. These include emojis (Emoji is a special Unicode encoding commonly found on iOS and Android phones), and many uncommonly used Chinese characters, as well as any new Unicode characters, and more.

3. The root cause of the problem

The original UTF-8 format used one to six bytes and could encode up to 31 characters. The latest UTF-8 specification uses only one to four bytes and can encode up to 21 bits, which is exactly what represents all 17 Unicode planes. utf8 is a character set in Mysql that only supports UTF-8 characters up to three bytes long, which is the basic multi-text plane in Unicode.

Why does UTF8 in Mysql only support UTF-8 characters with a maximum of three bytes? I thought about it, maybe because when Mysql was first developed, Unicode didn't have an auxiliary plane. At that time, the Unicode Committee was still dreaming of "65,535 characters is enough for the whole world". String lengths in Mysql count characters rather than bytes, and for CHAR data types, strings need to be long enough. When using the utf8 character set, the length that needs to be preserved is the utf8 longest character length multiplied by the string length, so it is natural to limit the maximum utf8 length to 3, for example, CHAR(100) Mysql will retain 300 bytes. As for why subsequent versions don't support 4-byte UTF-8 characters, I think one is for backward compatibility reasons, and the other is that characters outside of the basic multilingual plane are rarely used.

To save 4-byte UTF-8 characters in Mysql, the utf8mb4 character set is required, but it is only supported after version 5.5.3 (see version: select version(); )。 I think that for better compatibility, you should always use utf8mb4 instead of utf8. For CHAR type data, utf8mb4 consumes more space, and according to the official Mysql recommendation, use VARCHAR instead of CHAR.

[Source] The difference between utf8 and utf8mb4 in MySQL

Related Posts

Sections viewed