Loading Files - Character sets

Prev Next

Introduction

Beyond4P checks all files (programs and data) opened for Byte Order Marks (BOM) at the beginning of the file. The BOM is a UNICODE character and used to differentiate between UTF-16 big endian, UTF-16 little endian and UTF-8 file formats. All formats are supported, whereas UTF-8 is by far the most common UNICODE data storage format as it is compatible to various legacy systems supporing 8-byte character sets only.

Data transparency: The byte order marks will be recognized and then discarded, i.e. not passed on as special characters to the application.

How BOM's are checked in files loaded or opened:

  • If the 2 byte sequence FE FF (hexadecimal) is found, then the text file is in UTF-16 big endian format.
  • If the 2 byte sequence FF FE (hexadecimal) is found, then the text file is in UTF-16 little endian format.
  • If the 3 byte sequence EF BB BF (hexadecimal) is found, then the text file is in UTF-8 format.
  • Some files contain multiple identical BOMs. They have been sighted in export files from relational databases.

In case no BOM is found, following checks will be applied throughout the first ca. 4000 – 8000 bytes in the file (and not the entire file for performance reasons):

  • NULL-characters / 00 (hexadecimal) in even numbered positions (first byte in file is position 0): File is UTF-16, big endian format.
    Example: 00 31 00 30 00 20 20 AC 00 0D 00 0A (10 € followed by new line sequence CR+LF).
    Even for difficult contents such as pure Chinese text, the UTF-16 will be identified from space symbols (00 20), numeric digits and CR+LF symbols.
  • NULL-characters / 00 (hexadecimal) in odd numbered positions: File is UTF-16, little endian format.
    Example: 31 00 30 00 20 00 AC 20 0D 00 0A 00 (10 € followed by new line sequence CR+LF)

If the criteria above do not apply, then the file will be checked for typical UTF-8 patterns in the first 4000-8000 bytes, provided the file contains non-ANSI characters.

  • Presence of non-ANSI symbols which make up typical UTF-8 byte patterns. These are 2, 3 or 4 symbols with specific binary patterns.
  • Presence of non-ANSI symbols which do not match with UTF-8 byte patterns, e.g. simple 8-bit text in a ISO 8859-1 or WIN 1252 format with single non-ANSI characters
  • If the 1st criteria applies, but the 2nd does not, then the file is in UTF-8 format.
  • If the 2nd criteria applies, but the 1st does not, then the assumption will be non-UNICODE file format WIN 1252 which is the West European 8-bit character set.

If the input file is in HTML format, then the "charset=…" commands will be checked accordingly. JSON files are assumed in UTF-8 format by default. Ambiguities may still apply in the following case:

  • 1 line of UTF-16 text containing foreign characters only and new line sequence, e.g. one sentence in Greek, Cyrillic or Chinese (without digits, spaces, new lines).
  • 8-bit text file without non-ANSI character in the first 4000-8000 bytes (lots of English text in a huge file, a foreign word such as Café follows in a concluding sentence at the end of the file.

Ambiguities need to be resolved with the system variable local settings [ input file character set ]. As long no clear character format has been identified (e.g. UTF-16 or UTF-8), then the local settings will be referenced. The initial default value is win1252 (American and West European character set).

Supported character sets summarized

Character Set Format Description
ANSI 8 bits, 7 of them used Traditional ANSI characters. All non-ANSI characters, including foreign characters, the Euro symbol, etc. are converted into question marks.
Examples: E e
iso8859-1 8 bits ANSI characters plus West European character set in the range between ex A0 (160) and hex FF (255).
This format does not support Windows proprietary character range between hex 80 (128) and hex 9F (159) which affects the Euro symbol (€).
Examples: E e É é
win1252 8 bits,
default setting for Windows
Same as above, but includes Windows proprietary character range so additional punctuation symbols as well as the Euro symbol (€) will be handled correctly.
Examples: E e É é €
utf-8 8 bits UNICODE format. Characters can take 1, 2, 3 or 4 byes.
Examples: E e É é € Ə ə 中国 𐌄 (also applicable in next rows below)
utf-16 16 bits (little endian) UNICODE format. Every character contains precisely 2 bytes, starting with the least significant byte. Surrogate pairs are used for characters outside Basic Multilingual Plane.
utf-16 big endian 16 bits (big endian) UNICODE format like above, but the two bytes are swapped. Surrogate pairs are used for characters outside Basic Multilingual Plane.

Note: Microsoft Excel does not understand utf-16 big endian, but understands the remaining UNICODE formats. Use this format only if the recipient (e.g. a UNIX server) operates on big endian format only.

JSON files are always loaded assuming that UTF-8 format is used.