HL7 Detect Character Encoding

The HL7 Detect Character Encoding filter is used to detect the character encoding used in an incoming HL7 message. This is done by examining byte order marks, byte layout, and by reading the character encoding provided in MSH.18, and in some cases MSH.20. Refer to the HL7 specification for details about how character encodings are used in HL7.

Once the character encoding has been detected the rhapsody:HL7Encoding message property is set containing the name of the HL7 encoding (if present), and the internal body encoding attribute on the message is set to the detected encoding. Subsequent filters processing the message can then retrieve the message easily using the correct message encoding.

This filter is only used for detecting the character encoding for non-batch HL7 messages. All other message types are sent to the error connector.

Configuration Properties

Property	Description
Encoding Mapping	This property enables you to map between values found in MSH.18 and the appropriate Java encoding. The mapping can include new values in MSH.18 not understood by the filter by default, or alternatively they can override the default behavior of the filter. Any settings configured in this property take precedence over the internal mapping between HL7 and Java encodings inside the filter.

Detecting the Character Encoding

The HL7 Detect Character Encoding filter initially attempts to detect one of the following character encodings using the optional byte order marks, or byte patterns (that is, bytes with a value of zero).

UTF-8
UTF-16
UTF-32

This detection is done to ensure that the contents of MSH.18 are read correctly (UTF-16 and UTF-32, in particular, will encode the values in MSH.18 differently to other encodings).

The filter then reads the first repeat of MSH.18 to determine the HL7 encoding of the message:

If the character encoding is specified in this field, then this is used to set the internal body encoding attribute of the message.
If MSH.18 is empty then the message is returned from the filter with no changes. Subsequent attempts to retrieve the message for parsing will use the default system encoding. This is the same behavior as if the message never entered the filter.
If MSH.18 is set, but the value was not recognized, then the message is sent to the error connector. In this case, the rhapsody:HL7Encoding message property is set with the value found in MSH.18.

Except for the ISO 2022 case described below, the second and subsequent repeats of MSH.18 are ignored by this filter.

ISO 2022 Handling

The ISO 2022 standard provides a mechanism for using multiple character encodings with an escape sequence to switch between these encodings. HL7 messages using this standard usually start using a plain ASCII encoding, and then switch to use another encoding. Messages encoded in this manner leave the first repeat of MSH.18 blank, but place their secondary encoding in the second repeat of MSH.18. In addition, they set the character escape sequence field MSH.20 to indicate that ISO 2022 is being used.

The HL7 Detect Character Encoding filter handles this by detecting ISO 2022 from MSH.20, and then uses the second repeat of MSH.18 if the first repeat is blank. The escape sequences used to switch encodings are then handled automatically once the body encoding attribute has been set using this mechanism.

Mapping HL7 Encodings to Java Encodings

The HL7 standard defines 22 possible values that can be used in MSH.18. The following table shows how the HL7 Detect Character Encoding filter maps these values to a Java character encoding. This mapping is built into the filter by default and does not need to be configured.

HL7 Encoding	Java Encoding	HL7 Encoding	Java Encoding
ASCII	ASCII	ISO IR14	ISO2022JP / JIS_X0201 (see below)
8859/1	ISO8859_1	ISO IR87	ISO2022JP
8859/2	ISO8859_2	ISO IR159	ISO2022JP / EUC_JP
8859/3	ISO8859_3	GB 18030-2000	GB18030
8859/4	ISO8859_4	KS X 1001	ISO2022KR / EUC_KR (see below)
8859/5	ISO8859_5	CNS 11643-1992	ISO2022CN_CNS
8859/6	ISO8859_6	BIG-5	Big5
8859/7	ISO8859_7	UNICODE	UTF-16
8859/8	ISO8859_8	UNICODE UTF-8	UTF8
8859/9	ISO8859_9	UNICODE UTF-16	UTF-16
8859/15	ISO8859_15	UNICODE UTF-32	UTF_32

ISO IR14: If this is indicated in the first repeat of MSH.18 and MSH.20 does NOT indicate that ISO 2022 is being used, then this is treated as JIS_X0201. If MSH.20 indicates ISO 2022 and this is in the first or second repeat (and no encoding is in the first repeat), then ISO2022JP is used.

ISO IR159: If this is indicated in the first repeat of MSH.18 and MSH.20 does NOT indicate that ISO 2022 is being used, then this is treated as EUC_JP. If MSH.20 indicates ISO 2022 and this is in the first or second repeat (and no encoding is in the first repeat), then ISO2022JP is used.

KS X 1001: If this is indicated in the first repeat of MSH.18 and MSH.20 does NOT indicate that ISO 2022 is being used, then this is treated as EUC_KR. If MSH.20 indicates ISO 2022 and this is in the first or second report (and no encoding is in the first repeat), then ISO2022KR is used.

Any settings configured in the Encoding Mapping property take precedence over the internal mapping between HL7 and Java encodings inside the filter and override the default behavior of the filter.