BOM prefix issues encountered when reading Unicode files (UTF-8, etc.) in Java

Little scum · Posted on 1/14/2019 4:26:17 PM

The problem of BOM first characters encountered when reading Unicode files (UTF-8, etc.) in Java and how to deal with them

Text files created with a text editor in Windows will have a BOM ID added to the file header (the first character) if you choose to save them in a Unicode format such as UTF-8.

This identification is not removed when the file is read in Java, and String.trim() cannot be removed. If you use readLine() to read the first line and store it in the String, the length of the String will be 1 larger than what you see, and the first character is this BOM.

This can cause some trouble, such as when reading an ini file, if you want to tell if the first line starts with "[", you can't judge correctly.

Fortunately, when Java reads Unicode files, it uniformly changes the BOM to "\uFEFF", so you can solve it manually (after judgment, use substring() or replace() to remove this BOM):

Login is visible.

However,This approach is not perfectIf the generated jar file runs under Windows, there is still a problem. The ultimate workaround is to use the BOMInputStream provided by apache commons io:

Login is visible.

What is BOM?

BOM = Byte Order Mark
The BOM is the recommended method of marking the order of bytes in the Unicode specification. For example, for UTF-16, if the receiver receives a BOM of FEFF, it indicates that the byte stream is Big-Endian; If FFFE is received, it indicates that the bytestream is Little-Endian.
UTF-8 does not require a BOM to indicate byte order, but it can be used to indicate "I am UTF-8 encoded". The UTF-8 encoding of the BOM is EF BB BF (as seen by opening text with UltraEdit and switching to hexadecimal). So if the receiver receives a byte stream that starts with EF BB BF, they know it's UTF-8 encoding.

[Source] BOM prefix issues encountered when reading Unicode files (UTF-8, etc.) in Java

Related Posts

Sections viewed