Beyond4P checks all files (programs and data) opened for Byte Order Marks (BOM) at the beginning of the file. The BOM is a UNICODE character and used to differentiate between
UTF-16 big endian, UTF-16 little endian and UTF-8 file formats. All formats are supported, whereas UTF-8 is by far the most common UNICODE data storage format
as it is compatible to various legacy systems supporing 8-byte character sets only.
Data transparency: The byte order marks will be recognized and then discarded, i.e. not passed on as special characters to the application.
How BOM's are checked in files loaded or opened:
In case no BOM is found, following checks will be applied throughout the first ca. 4000 – 8000 bytes in the file (and not the entire file for performance reasons):
If the criteria above do not apply, then the file will be checked for typical UTF-8 patterns in the first 4000-8000 bytes, provided the file contains non-ANSI characters.
If the input file is in HTML format, then the "charset=…" commands will be checked accordingly. JSON files are assumed in UTF-8 format by default. Ambiguities may still apply in the following case:
Ambiguities need to be resolved with the system variable local settings [ input file character set ].
As long no clear character format has been identified (e.g. UTF-16 or UTF-8), then the local settings will be referenced. The initial default value is win1252 (American and West European character set).
Supported character sets summarized
Character Set | Format | Description |
---|---|---|
ANSI | 8 bits, 7 of them used | Traditional ANSI characters. All non-ANSI characters, including foreign characters, the Euro symbol, etc. are converted into question marks. Examples: E e |
iso8859-1 | 8 bits | ANSI characters plus West European character set in the range between ex A0 (160) and hex FF (255). This format does not support Windows proprietary character range between hex 80 (128) and hex 9F (159) which affects the Euro symbol (€). Examples: E e É é |
win1252 | 8 bits, default setting for Windows | Same as above, but includes Windows proprietary character range so additional punctuation symbols as well as the Euro symbol (€) will be handled correctly. Examples: E e É é € |
utf-8 | 8 bits | UNICODE format. Characters can take 1, 2, 3 or 4 byes. Examples: E e É é € Ə ə 中国 𐌄 (also applicable in next rows below) |
utf-16 | 16 bits (little endian) | UNICODE format. Every character contains precisely 2 bytes, starting with the least significant byte. Surrogate pairs are used for characters outside Basic Multilingual Plane. |
utf-16 big endian | 16 bits (big endian) | UNICODE format like above, but the two bytes are swapped. Surrogate pairs are used for characters outside Basic Multilingual Plane. |
Note: Microsoft Excel does not understand utf-16 big endian, but understands the remaining UNICODE formats. Use this format only if the recipient (e.g. a UNIX server) operates on big endian format only.
JSON files are always loaded assuming that UTF-8 format is used.