The HL7 Detect Character Encoding filter is used to detect the character encoding used in an incoming HL7 message. This is done by examining byte order marks, byte layout, and by reading the character encoding provided in MSH.18, and in some cases MSH.20. Refer to the HL7 specification for details about how character encodings are used in HL7.
Once the character encoding has been detected the rhapsody:HL7Encoding
message property is set containing the name of the HL7 encoding (if present), and the internal body encoding attribute on the message is set to the detected encoding. Subsequent filters processing the message can then retrieve the message easily using the correct message encoding.
This filter is only used for detecting the character encoding for non-batch HL7 messages. All other message types are sent to the error connector.
Configuration Properties
Property |
Description |
---|---|
Encoding Mapping |
This property enables you to map between values found in MSH.18 and the appropriate Java encoding. The mapping can include new values in MSH.18 not understood by the filter by default, or alternatively they can override the default behavior of the filter. Any settings configured in this property take precedence over the internal mapping between HL7 and Java encodings inside the filter. |
Detecting the Character Encoding
The HL7 Detect Character Encoding filter initially attempts to detect one of the following character encodings using the optional byte order marks, or byte patterns (that is, bytes with a value of zero).
- UTF-8
- UTF-16
- UTF-32
This detection is done to ensure that the contents of MSH.18 are read correctly (UTF-16 and UTF-32, in particular, will encode the values in MSH.18 differently to other encodings).
The filter then reads the first repeat of MSH.18 to determine the HL7 encoding of the message:
- If the character encoding is specified in this field, then this is used to set the internal body encoding attribute of the message.
- If MSH.18 is empty then the message is returned from the filter with no changes. Subsequent attempts to retrieve the message for parsing will use the default system encoding. This is the same behavior as if the message never entered the filter.
- If MSH.18 is set, but the value was not recognized, then the message is sent to the error connector. In this case, the
rhapsody:HL7Encoding
message property is set with the value found in MSH.18.
Except for the ISO 2022 case described below, the second and subsequent repeats of MSH.18 are ignored by this filter.
ISO 2022 Handling
The ISO 2022 standard provides a mechanism for using multiple character encodings with an escape sequence to switch between these encodings. HL7 messages using this standard usually start using a plain ASCII encoding, and then switch to use another encoding. Messages encoded in this manner leave the first repeat of MSH.18 blank, but place their secondary encoding in the second repeat of MSH.18. In addition, they set the character escape sequence field MSH.20 to indicate that ISO 2022 is being used.
The HL7 Detect Character Encoding filter handles this by detecting ISO 2022 from MSH.20, and then uses the second repeat of MSH.18 if the first repeat is blank. The escape sequences used to switch encodings are then handled automatically once the body encoding attribute has been set using this mechanism.
Mapping HL7 Encodings to Java Encodings
The HL7 standard defines 22 possible values that can be used in MSH.18. The following table shows how the HL7 Detect Character Encoding filter maps these values to a Java character encoding. This mapping is built into the filter by default and does not need to be configured.
HL7 Encoding |
Java Encoding |
HL7 Encoding |
Java Encoding |
---|---|---|---|
ASCII |
ASCII |
ISO IR14 |
ISO2022JP / JIS_X0201 (see below) |
8859/1 |
ISO8859_1 |
ISO IR87 |
ISO2022JP |
8859/2 |
ISO8859_2 |
ISO IR159 |
ISO2022JP / EUC_JP |
8859/3 |
ISO8859_3 |
GB 18030-2000 |
GB18030 |
8859/4 |
ISO8859_4 |
KS X 1001 |
ISO2022KR / EUC_KR (see below) |
8859/5 |
ISO8859_5 |
CNS 11643-1992 |
ISO2022CN_CNS |
8859/6 |
ISO8859_6 |
BIG-5 |
Big5 |
8859/7 |
ISO8859_7 |
UNICODE |
UTF-16 |
8859/8 |
ISO8859_8 |
UNICODE UTF-8 |
UTF8 |
8859/9 |
ISO8859_9 |
UNICODE UTF-16 |
UTF-16 |
8859/15 |
ISO8859_15 |
UNICODE UTF-32 |
UTF_32 |
ISO IR14: If this is indicated in the first repeat of MSH.18 and MSH.20 does NOT indicate that ISO 2022 is being used, then this is treated as JIS_X0201. If MSH.20 indicates ISO 2022 and this is in the first or second repeat (and no encoding is in the first repeat), then ISO2022JP is used.
ISO IR159: If this is indicated in the first repeat of MSH.18 and MSH.20 does NOT indicate that ISO 2022 is being used, then this is treated as EUC_JP. If MSH.20 indicates ISO 2022 and this is in the first or second repeat (and no encoding is in the first repeat), then ISO2022JP is used.
KS X 1001: If this is indicated in the first repeat of MSH.18 and MSH.20 does NOT indicate that ISO 2022 is being used, then this is treated as EUC_KR. If MSH.20 indicates ISO 2022 and this is in the first or second report (and no encoding is in the first repeat), then ISO2022KR is used.
Any settings configured in the Encoding Mapping property take precedence over the internal mapping between HL7 and Java encodings inside the filter and override the default behavior of the filter.