This section provides a brief introduction on the Byte Order Mark (BOM) character, U+FEFF, used as the Unicode character stream signature when prepended to a character stream. The U+FEFF character becomes a 3-byte sequence of EFBBBF when encoded in UTF-8.
What Is BOM (Byte Order Mark)?
BOM is the informal name of the special Unicode character U+FEFF "ZERO WIDTH NO-BREAK SPACE",
when it is used to prepend to a stream of Unicode characters as a "signature".
This signature tells the receiver of this stream to be ready to process Unicode characters
and pay attention to the serialization order of the encoding octets.
When this BOM character, U+FEFF, is serialized in UTF-8 encoding, it becomes an octet sequence of
EF BB BF (\xEFBBBF).
As you can see from the previous tutorial, Notepad prepends U+FEFF to the text and converted it to
EFBBBF when saving the text in UTF-8 encoding. This is why I was getting these 3 extra bytes, EFBBBF,
at the beginning of the saved UTF-8 text file.
With the introduction of the BOM character, now we need to ready to support two variations of UTF-8 text file formats:
UTF-8 text file with no leading BOM character.
UTF-8 text file with the leading BOM character.
Read RFC 3629, "UTF-8, a transformation format of ISO 10646", November 2003
at http://tools.ietf.org/html/rfc3629 for more information.
Prepending the BOM character to Unicode text files is recommended by RFC 3629.