%status-entities; This document will be considered ready for transition to Proposed Recommendation at the same time that the XQuery 3.1 specification is ready for transition to Proposed Recommendation.

'> This &doc.w3c-doctype-full; specifies XSLT and XQuery Functions and Operators (F&O) version 4.0, a fully compatible extension of F&O version 3.1. This publication differs from its version 3.1 primarily by the addition of a number of new functions. There are numerous smaller differences as well, all documented in the change log.

'> ]>
&language; &version; &doc.w3c-designation; W3C &doc.w3c-doctype-full; &date.day; &date.month; &date.year; &doc.publoc; Specification in XML format XML function catalog &doc.latestloc; https://www.w3.org/TR/2017/REC-xpath-functions-31-20170321/ Michael Kay Saxonica http://www.saxonica.com/

This section describes the status of this document at the time of its publication. Other documents may supersede this document.

This document is a working draft developed and maintained by a W3C Community Group, the XQuery and XSLT Extensions Community Group unofficially known as QT4CG (where "QT" denotes Query and Transformation). This draft is work in progress and should not be considered either stable or complete. Standard W3C copyright and patent conditions apply.

The community group welcomes comments on the specification. Comments are best submitted as issues on the group's GitHub repository.

The community group maintains two extensive test suites, one oriented to XQuery and XPath, the other to XSLT. These can be found at qt4tests and xslt40-test respectively. New tests, or suggestions for correcting existing tests, are welcome. The test suites include extensive metadata describing the conditions for applicability of each test case as well as the expected results. They do not include any test drivers for executing the tests: each implementation is expected to provide its own test driver.

The publications of this community group are dedicated to our co-chair, Michael Sperberg-McQueen (1954–2024).

This document defines constructor functions, operators, and functions on the datatypes defined in and the datatypes defined in . It also defines functions and operators on nodes and node sequences as defined in the . These functions and operators are defined for use in and and and other related XML standards. The signatures and summaries of functions defined in this document are available at: http://www.w3.org/2005/xpath-functions/.

A summary of changes since version 3.1 is provided at .

English

Introduction If a section of this specification has been updated since version 3.1, an overview of the changes is provided, along with links to navigate to the next or previous change. Sections with significant changes are marked with a ✭ symbol in the table of contents. New functions are indicated by ✚.

The purpose of this document is to define functions and operators for inclusion in XPath 4.0, XQuery 4.0, and XSLT 4.0. The exact syntax used to call these functions and operators is specified in , and .

This document defines three classes of functions:

General purpose functions, available for direct use in user-written queries, stylesheets, and XPath expressions, whose arguments and results are values defined by the .

Constructor functions, used for creating instances of a datatype from values of (in general) a different datatype. These functions are also available for general use; they are named after the datatype that they return, and they always take a single argument.

Functions that specify the semantics of operators defined in and . These exist for specification purposes only, and are not intended for direct calling from user-written code.

defines a number of primitive and derived datatypes, collectively known as built-in datatypes. This document defines functions and operations on these datatypes as well as the other types (for example, nodes and sequences of nodes) defined in of the . These functions and operations are available for use in , and any other host language that chooses to reference them. In particular, they may be referenced in future versions of XSLT and related XML standards.

adds to the datatypes defined in . It introduces a new derived type xs:dateTimeStamp, and it incorporates as built-in types the two types xs:yearMonthDuration and xs:dayTimeDuration which were previously XDM additions to the type system. In addition, XSD 1.1 clarifies and updates many aspects of the definitions of the existing datatypes: for example, it extends the value space of xs:double to allow both positive and negative zero, and extends the lexical space to allow +INF; it modifies the value space of xs:Name to permit additional Unicode characters; it allows year zero and disallows leap seconds in xs:dateTime values; and it allows any character string to appear as the value of an xs:anyURI item. Implementations of this specification may support either XSD 1.0 or XSD 1.1 or both.

In some cases, this specification references XSD for the semantics of operations such as the effect of matching using regular expressions, or conversion of atomic items to strings. In most such cases there is no intended technical difference between the XSD 1.0 and XSD 1.1 specifications, but the 1.1 version often provides clearer explanations and sometimes also corrects technical errors. In such cases this specification often chooses to reference the XSD 1.1 specification. This should not be taken as implying that it is necessary to invoke an XSD 1.1 processor.

References to specific sections of some of the above documents are indicated by cross-document links in this document. Each such link consists of a pointer to a specific section followed a superscript specifying the linked document. The superscripts have the following meanings: XQ , XT , XP , and DM .

Operators

Despite its title, this document does not attempt to define the semantics of all the operators available in the language; indeed, in the interests of avoiding duplication, the majority of operators are now defined entirely within (for XQuery, the definitions are also found in , and are generally identical).

The remaining operators that are described in this publication are the arithmetic operators, where the semantics of the operator depend on the types of the arguments. For these operators, the language specification describes rules for selecting an internal function defined in this specification to underpin the operator. For example, when the operator x+y is applied to two operands of type xs:double, the function op:numeric-add is selected.

Previous versions of this specification also defined the value comparison operators such as eq, lt, and gt in terms of functions such as op:date-greater-than. These have been dropped; for most data types the semantics of the value comparison operators are defined by reference to the fn:compare function.

Conformance Higher-order functions are no longer an optional feature.

This recommendation contains a set of function specifications. It defines conformance at the level of individual functions. An implementation of a function conforms to a function specification in this recommendation if all the following conditions are satisfied:

For all combinations of valid inputs to the function (both explicit arguments and implicit context dependencies), the result of the function meets the mandatory requirements of this specification.

For all invalid inputs to the function, the implementation raises (in some way appropriate to the calling environment) a dynamic error.

For a sequence of calls within the same execution scope, the requirements of this recommendation regarding the determinism of results are satisfied (see ).

Other recommendations (“host languages”) that reference this document may dictate:

Subsets or supersets of this set of functions to be available in particular environments;

Mechanisms for invoking functions, supplying arguments, initializing the static and dynamic context, receiving results, and handling errors;

A concrete realization of concepts such as execution scope;

Which versions of other specifications referenced herein (for example, XML, XSD, or Unicode) are to be used.

Any behavior that is discretionary (implementation-defined or implementation-dependent) in this specification may be constrained by a host language.

Adding such constraints in a host language, however, is discouraged because it makes it difficult to reuse implementations of the function library across host languages.

This specification allows flexibility in the choice of versions of specifications on which it depends:

It is which version of Unicode is supported, but it is recommended that the most recent version of Unicode be used.

It is whether the type system is based on XML Schema 1.0 or XML Schema 1.1.

It is whether definitions that rely on XML (for example, the set of valid XML characters) should use the definitions in XML 1.0 or XML 1.1.

The XML Schema 1.1 recommendation introduces one new concrete datatype: xs:dateTimeStamp; it also incorporates the types xs:dayTimeDuration, xs:yearMonthDuration, and xs:anyAtomicType which were previously defined in earlier versions of . Furthermore, XSD 1.1 includes the option of supporting revised definitions of types such as xs:NCName based on the rules in XML 1.1 rather than 1.0.

The allows flexibility in the repertoire of characters permitted during processing that goes beyond even what version of XML is supported. A processor may allow the user to construct nodes and atomic items that contain characters not allowed by any version of XML. A permitted character is one within the repertoire accepted by the implementation.

In this document, text labeled as an example or as a note is provided for explanatory purposes and is not normative.

Namespaces and prefixes

The functions and operators defined in this document are contained in one of several namespaces (see ) and referenced using an xs:QName.

This document uses conventional prefixes to refer to these namespaces. User-written applications can choose a different prefix to refer to the namespace, so long as it is bound to the correct URI. The host language may also define a default namespace for function calls, in which case function names in that namespace need not be prefixed at all. In many cases the default namespace will be http://www.w3.org/2005/xpath-functions, allowing a call on the fn:name function (for example) to be written as name() rather than fn:name(); in this document, however, all example function calls are explicitly prefixed.

The URIs of the namespaces and the conventional prefixes associated with them are:

http://www.w3.org/2001/XMLSchema for constructors — associated with xs.

The section defines constructor functions for the built-in datatypes defined in and in of . These datatypes and the corresponding constructor functions are in the XML Schema namespace, http://www.w3.org/2001/XMLSchema, and are named in this document using the xs prefix.

http://www.w3.org/2005/xpath-functions for functions — associated with fn.

The namespace prefix used in this document for most functions that are available to users is fn.

http://www.w3.org/2005/xpath-functions/math for functions — associated with math.

This namespace is used for some mathematical functions. The namespace prefix used in this document for these functions is math. These functions are available to users in exactly the same way as those in the fn namespace.

http://www.w3.org/2005/xpath-functions/map for functions — associated with map.

This namespace is used for some functions that manipulate maps (see ). The namespace prefix used in this document for these functions is map. These functions are available to users in exactly the same way as those in the fn namespace.

http://www.w3.org/2005/xpath-functions/array for functions — associated with array.

This namespace is used for some functions that manipulate maps (see ). The namespace prefix used in this document for these functions is array. These functions are available to users in exactly the same way as those in the fn namespace.

http://www.w3.org/2005/xqt-errors — associated with err.

There are no functions in this namespace; it is used for error codes.

This document uses the prefix err to represent the namespace URI http://www.w3.org/2005/xqt-errors, which is the namespace for all XPath and XQuery error codes and messages. This namespace prefix is not predeclared and its use in this document is not normative.

http://www.w3.org/2010/xslt-xquery-serialization — associated with output.

There are no functions in this namespace: it is used for serialization parameters, as described in

Functions defined with the op prefix are described here to underpin the definitions of the operators in , and . These functions are not available directly to users, and there is no requirement that implementations should actually provide these functions. For this reason, no namespace is associated with the op prefix. For example, multiplication is generally associated with the * operator, but it is described as a function in this document:

Sometimes there is a need to use an operator as a function. To meet this requirement, the function fn:op takes any simple binary operator as its argument, and returns a corresponding function. So for example fn:for-each-pair($seq1, $seq2, op("+")) performs a pairwise addition of the values in two input sequences.

The above namespace URIs are not expected to change from one version of this document to another. The contents of these namespaces may be extended to allow additional functions (and errors, and serialization parameters) to be defined.

Function overloading

A function is uniquely defined by its name and arity (number of arguments); it is therefore not possible to have two different functions that have the same name and arity, but different types in their signature. That is, function overloading in this sense of the term is not permitted. Consequently, functions such as fn:string which accept arguments of many different types have a signature that defines a very general argument type, in this case item()? which accepts any single item; supplying an inappropriate item (such as a function item) causes a dynamic error.

Some functions on numeric types include the type xs:numeric in their signature as an argument or result type. In this version of the specification, xs:numeric has been redefined as a built-in union type representing the union of xs:decimal, xs:float, xs:double (and thus automatically accepting types derived from these, including xs:integer).

Operators such as + may be overloaded: they map to different underlying functions depending on the dynamic types of the supplied operands.

It is possible for two functions to have the same name provided they have different arity (number of arguments). For the functions defined in this specification, where two functions have the same name and different arity, they also have closely related behavior, so they are defined in the same section of this document.

Function signatures and descriptions

Each function (or group of functions having the same name) is defined in this specification using a standard proforma. This has the following sections:

Function name

The function name is a QName as defined in and must adhere to its syntactic conventions. Following the precedent set by , function names are generally composed of English words separated by hyphens: specifically U+002D. Abbreviations are used only where there is a strong precedent in other programming languages (as with math:sin and math:cos for sine and cosine). If a function name contains a datatype name, it may have intercapitalized spelling and is used in the function name as such. An example is fn:timezone-from-dateTime.

Function summary

The first section in the proforma is a short summary of what the function does. This is intended to be informative rather than normative.

Function signature

Each function is then defined by specifying its signature(s), which define the types of the parameters and of the result value.

Where functions take a variable number of arguments, two conventions are used:

Wherever possible, a single function signature is used giving default values for those parameters that can be omitted.

If this is not possible, because the effect of omitting a parameter cannot be specified by giving a default value, multiple signatures are given for the function.

Each function signature is presented in a form like this:

In this notation, function-name, in bold-face, is the local name of the function whose signature is being specified. The prefix fn indicates that the function is in the namespace http://www.w3.org/2005/xpath-functions: this is one of the conventional prefixes listed in . If the function takes no parameters, then the name is followed by an empty parameter list: (); otherwise, the name is followed by a parenthesized list of parameter declarations. Each parameter declaration includes:

The name of the parameter (which in 4.0 is significant because it can be used as a keyword in a function call)

The static type of the parameter (in italics)

If the parameter is optional, then an expression giving the default value (preceded by the symbol :=).

The default value expression is evaluated using the static and dynamic context of the function caller (or of a named function reference). For example, if the default value is given as ., then it evaluates to the context value from the dynamic context of the function caller; if it is given as default-collation, then its value is the default collation from the static context of the function caller; if it is given as deep-equal#2, then the third argument supplied to deep-equal is the default collation from the static context of the caller.

If there are two or more parameter declarations, they are separated by a comma.

The return-type, also in italics, specifies the static type of the value returned by the function. The dynamic type of the value returned by the function is the same as its static type or derived from the static type. All parameter types and return types are specified using the SequenceType notation defined in .

Function rules

The next section in the proforma defines the semantics of the function as a set of rules. The order in which the rules appear is significant; they are to be applied in the order in which they are written. Error conditions, however, are generally listed in a separate section that follows the main rules, and take precedence over non-error rules except where otherwise stated. The principles outlined in apply by default: to paraphrase, if the result of the function can be determined without evaluating all its arguments, then it is not necessary to evaluate the remaining arguments merely in order to determine whether any error conditions apply.

Formal Equivalents

Some functions supplement the prose rules with a more formal specification that describes the effect of the function in terms of an equivalent XPath or XQuery implementation. This is intended to take precedence over the prose rules in the event of any conflict; however, both sections are intended to be complete and not to rely on each other.

In writing the formal equivalents, a number of guidelines have been followed:

Where the equivalent code calls other functions, these should either be primitives defined in the data model specification (see ), or functions that themselves have a formal equivalent; and the dependencies should not be circular.

There should be minimal reliance on XPath or XQuery language features. Although no attempt has been made to precisely define a core set of language constructs, the specifications try to avoid relying on features other than function calls and a few basic operators including the comma operator, equality testing, and simple integer arithmetic.

There is no suggestion that the formal equivalent is a practical implementation; in many cases it might have very poor performance.

In some cases the formal equivalent does not attempt to replicate correct behavior in error cases; if so, this is always clearly stated.

The formal equivalent will always produce a conformant result for the function, but in some cases this will not be the only possible conformant result.

This worthy intent is not yet fully achieved; for example there are formal specifications that invoke fn:atomic-equal.

There is no attempt to write formal equivalents for functions that have complex logic (such as fn:format-number) or dependencies (such as fn:doc); the aim of the formal equivalents is to define as rigorously as possible a platform of basic functionality that can be used as a solid foundation for more complex features.

Notes

Where the proforma includes a section headed Notes, these are non-normative.

Examples

Where the proforma includes a section headed Examples, these are non-normative.

Many of the examples are given in structured form, showing example expressions and their expected results. These published examples are derived from executable test cases, so they follow a standard format. In general, the actual result of the expression is expected to be deep-equal to the presented result, under the rules of the fn:deep-equal function with default options. In some cases the result is qualified to indicate that the order of items in the result is implementation-dependent, or that numeric results are approximate.

For more complex functions, examples may be given using informal narrative prose.

Function calls

Rules for evaluating the operands of operators are described in the relevant sections of and . For example, the rules for evaluating the operands of arithmetic operators are described in . Specifically, rules for parameters of type xs:untypedAtomic and the empty sequence are specified in this section.

For function calls, the required type of an argument is defined in the function signature of each function, and the way in which a supplied value is converted to the required type (or rejected if it cannot be converted) is defined by the .

Some functions accept a single value or the empty sequence as an argument and some may return a single value or the empty sequence. This is indicated in the function signature by following the parameter or return type name with a question mark: ?, indicating that either a single value or the empty sequence must appear. See below.

Note that this function signature is different from a signature in which the parameter is omitted. See, for example, the two signatures for fn:string. In the first signature, the parameter is omitted and the argument defaults to the context value, referred to as .. In the second signature, the argument must be present but may be the empty sequence, written as ().

Some functions accept a sequence of zero or more values as an argument. This is indicated by following the name of the type of the items in the sequence with *. The sequence may contain zero or more items of the named type. For example, the function below accepts a sequence of xs:double and returns a xs:double or the empty sequence.

In XPath 4.0, the arguments in a function call can be supplied by keyword as an alternative to supplying them positionally. For example the call resolve-uri(@href, static-base-uri()) can now be written resolve-uri(base: static-base-uri(), relative: @href). The order in which arguments are supplied can therefore differ from the order in which they are declared. The specification, however, continues to use phrases such as “the second argument” as a convenient shorthand for "the value of the argument that is bound to the second parameter declaration".

Options Use of an option keyword that is not defined in the specification and is not known to the implementation now results in a dynamic error; previously it was ignored.

As a matter of convention, a number of functions defined in this document take a parameter whose value is a map, defining options controlling the detail of how the function is evaluated. Maps are a new datatype introduced in XPath 3.1.

For example, the function fn:xml-to-json has an options parameter allowing specification of whether the output is to be indented. A call might be written:

Functions that take an options parameter adopt common conventions on how the options are used. These are referred to as the option parameter conventions. These rules apply only to functions that explicitly refer to them.

Where a function adopts the , the following rules apply:

The value of the relevant argument must be a map. The entries in the map are referred to as options: the key of the entry is called the option name, and the associated value is the option value. Option names defined in this specification are always strings (single xs:string values). Option values may be of any type.

The type of the options parameter in the function signature is always given as map(*).

Although option names are described above as strings, the actual key may be any value that is the as the required string. For example, instances of xs:untypedAtomic or xs:anyURI are equally acceptable.

This means that the implementation of the function can check for the presence and value of particular options using the functions map:contains and/or map:get.

Implementations may attach an implementation-defined meaning to options in the map that are not described in this specification. These options should use values of type xs:QName as the option names, using an appropriate namespace.

If an option is present whose key is not described in the specification, then a type error must be raised unless either (a) the key is recognized by the implementation, or (b) the key is a value of type xs:QName with a non-absent namespace.

All entries in the options map are optional, and supplying the empty map has the same effect as omitting the relevant argument in the function call, assuming this is permitted.

The ordering of the options map is immaterial.

For each named option, the function specification defines a required type for the option value. The value that is actually supplied in the map is converted to this required type using the coercion rules. This will result in an error (typically or ) if conversion of the supplied value to the required type is not possible. A type error also occurs if this conversion delivers a coerced function whose invocation fails with a type error. A dynamic error occurs if the supplied value after conversion is not one of the permitted values for the option in question: the error codes for this error are defined in the specification of each function.

It is the responsibility of each function implementation to invoke this conversion; it does not happen automatically as a consequence of the function-calling rules.

In cases where the value of an option is itself a map, the specification of the particular function must indicate whether or not these rules apply recursively to the contents of that map.

Type System

The diagrams in this section show how nodes, functions, primitive simple types, and user defined types fit together into a type system. This type system comprises two distinct subsystems that both include the primitive atomic types. In the diagrams, connecting lines represent relationships between derived types and the types from which they are derived; the former are always below and to the right of the latter.

The xs:IDREFS, xs:NMTOKENS, xs:ENTITIES types, and xs:numeric and both the user-defined list types and user-defined union types are special types in that these types are lists or unions rather than types derived by extension or restriction.

Item Types

The first diagram illustrates the relationship of various item types.

Item types are used to characterize the various types of item that can appear in a sequence (nodes, atomic items, and functions), and they are therefore used in declaring the types of variables or the argument types and result types of functions.

In XDM, item types include node types, function types, and built-in atomic types. Item types form a directed graph, rather than a hierarchy or lattice: in the relationship defined by the derived-from(A, B) function, some types are derived from more than one other type. Examples include functions (function(xs:string) as xs:int is substitutable for function(xs:NCName) as xs:int and also for function(xs:string) as xs:decimal), and choice types (A is substitutable for the choice type (A | B) and also for (A | C). Record types provide an alternative way of categorizing maps: the instances of record(longitude, latitude) overlap with the instances of map(xs:string, xs:double). The diagram, which shows only hierarchic relationships, is therefore a simplification of the full model.

&common-item-types.xml;
Schema Type Hierarchy

The next diagram illustrate the schema type subsystem, in which all types are derived from xs:anyType.

Schema types include built-in types defined in the XML Schema specification, and user-defined types defined using mechanisms described in the XML Schema specification. Schema types define the permitted contents of nodes. The main categories are complex types, which define the permitted content of elements, and simple types, which can be used to constrain the values of both elements and attributes.

&common-anyType.xml;
Atomic Type Hierarchy

The final diagram shows all of the atomic types, including the primitive simple types and the built-in types derived from the primitive simple types. This includes all the built-in datatypes defined in .

Atomic types are both item types and schema types, so the root type xs:anyAtomicType may be found in both the previous diagrams.

&common-anyAtomicType.xml;
Terminology The term atomic value has been replaced by atomic item.

The terminology used to describe the functions and operators on types defined in is defined in the body of this specification. The terms defined in this section are used in building those definitions.

Following in the tradition of , the terms type and datatype are used interchangeably.

Atomic items

The following definitions are adopted from .

An atomic item is a pair (T, D) where T (the ) is an atomic type, and D (the ) is a point in the value space of T.

A primitive type is one of the 19 primitive atomic types defined in of , or the type xs:untypedAtomic defined in .

The datum of an is a point in the value space of its type, which is also a point in the value space of the primitive type from which that type is derived. There are 20 primitive atomic types (19 defined in XSD, plus xs:untypedAtomic), and these have non-overlapping value spaces, so each datum belongs to exactly one primitive atomic type.

The type annotation of an atomic item is the most specific atomic type that it is an instance of (it is also an instance of every type from which that type is derived).

The term value space is defined in as a set of values. The term datum is used here in preference to value, because value has a different meaning in this data model.

Strings, characters, and codepoints

This document uses the terms string, character, and codepoint with meanings that are normatively defined in , and which are paraphrased here for ease of reference:

A character is an instance of the Char production of .

This definition excludes Unicode characters in the surrogate blocks as well as U+FFFE and U+FFFF, while including characters with codepoints greater than U+FFFF which some programming languages treat as two characters. The valid characters are defined by their codepoints, and include some whose codepoints have not been assigned by the Unicode consortium to any character.

A string is a sequence of zero or more characters, or equivalently, a value in the value space of the xs:string datatype.

A codepoint is an integer assigned to a character by the Unicode consortium, or reserved for future assignment to a character.

The set of codepoints is thus wider than the set of characters.

This specification spells “codepoint” as one word; the Unicode specification spells it as “code point”. Equivalent terms found in other specifications are “character number” or “code position”. See

Because these terms appear so frequently, they are hyperlinked to the definition only when there is a particular desire to draw the reader’s attention to the definition; the absence of a hyperlink does not mean that the term is being used in some other sense.

It is which version of is supported, but it is recommended that the most recent version of Unicode be used.

This specification adopts the Unicode notation U+xxxx to refer to a codepoint by its hexadecimal value (always four to six hexadecimal digits). This is followed where appropriate by the official Unicode character name and its graphical representation: for example U+20AC.

Unless explicitly stated, the functions in this document do not ensure that any returned xs:string values are normalized in the sense of .

In functions that involve character counting such as fn:substring, fn:string-length and fn:translate, what is counted is the number of XML characters in the string (or equivalently, the number of Unicode codepoints). Some implementations may represent a codepoint above U+FFFF using two 16-bit values known as a surrogate pair. A surrogate pair counts as one character, not two.

Wherever encoding names (such as UTF-8 and UTF-16) are used in this specification, they are compared without regard to case: the strings "UTF-8" and "utf-8" both refer to the same encoding.

Namespaces and URIs

This document uses the phrase “namespace URI” to identify the concept identified in as “namespace name”, and the phrase “local name” to identify the concept identified in as “local part”.

It also uses the term expanded-QName defined below.

An expanded-QName is a value in the value space of the xs:QName datatype as defined in the XDM data model (see ): that is, a triple containing namespace prefix (optional), namespace URI (optional), and local name. Two expanded QNames are equal if the namespace URIs are the same (or both absent) and the local names are the same. The prefix plays no part in the comparison, but is used only if the expanded QName needs to be converted back to a string.

The term URI is used as follows:

Within this specification, the term URI refers to Universal Resource Identifiers as defined in and extended in with a new name IRI. The term URI Reference, unless otherwise stated, refers to a string in the lexical space of the xs:anyURI datatype as defined in .

This means, in practice, that where this specification requires a “URI Reference”, an IRI as defined in will be accepted, provided that other relevant specifications also permit an IRI. The term URI has been retained in preference to IRI to avoid introducing new names for concepts such as “Base URI” that are defined or referenced across the whole family of XML specifications. Note also that the definition of xs:anyURI is a wider definition than the definition in ; for example it does not require non-ASCII characters to be escaped.

Conformance terminology

In this specification:

The auxiliary verb must, when rendered in small capitals, indicates a precondition for conformance.

When the sentence relates to an implementation of a function (for example "All implementations must recognize URIs of the form ...") then an implementation is not conformant unless it behaves as stated.

When the sentence relates to the result of a function (for example "The result must have the same type as $arg") then the implementation is not conformant unless it delivers a result as stated.

When the sentence relates to the arguments to a function (for example "The value of $arg must be a valid regular expression") then the implementation is not conformant unless it enforces the condition by raising a dynamic error whenever the condition is not satisfied.

The auxiliary verb may, when rendered in small capitals, indicates optional or discretionary behavior. The statement “An implementation may do X” implies that it is implementation-dependent whether or not it does X.

The auxiliary verb should, when rendered in small capitals, indicates desirable or recommended behavior. The statement “An implementation should do X” implies that it is desirable to do X, but implementations may choose to do otherwise if this is judged appropriate.

Where behavior is described as implementation-defined, variations between processors are permitted, but a conformant implementation must document the choices it has made.

Where behavior is described as implementation-dependent, variations between processors are permitted, and conformant implementations are not required to document the choices they have made.

Where this specification states that something is implementation-defined or implementation-dependent, it is open to host languages to place further constraints on the behavior.

Properties of functions

This section is concerned with the question of whether two calls on a function, with the same arguments, may produce different results.

In this section the term function, unless otherwise specified, applies equally to function definitions (which can be the target of a static function call) and function items (which can be the target of a dynamic function call).

An execution scope is a sequence of calls to the function library during which certain aspects of the state are required to remain invariant. For example, two calls to fn:current-dateTime within the same execution scope will return the same result. The execution scope is defined by the host language that invokes the function library. In XSLT, for example, any two function calls executed during the same transformation are in the same execution scope (except that static expressions, such as those used in use-when attributes, are in a separate execution scope).

The following definition explains more precisely what it means for two function calls to return the same result:

Two values $V1 and $V2 are defined to be identical if they contain the same number of items and the items are pairwise identical. Two items are identical if and only if one of the following conditions applies:

Both items are atomic items, of precisely the same type, and the values are equal as defined using the eq operator, using the Unicode codepoint collation when comparing strings.

Both items are nodes, and represent the same node.

Both items are maps, both maps have the same number of entries, and for every entry E1 in the first map there is an entry E2 in the second map such that the keys of E1 and E2 are the same key, and the corresponding values V1 and V2 are .

Both items are arrays, both arrays have the same number of members, and the members are pairwise .

Both items are function items, neither item is a map or array, and the two function items have the same function identity. The concept of function identity is explained in .

Some functions produce results that depend not only on their explicit arguments, but also on the static and dynamic context.

A function definition may have the property of being context-dependent: the result of such a function depends on the values of properties in the static and dynamic evaluation context of the caller as well as on the actual supplied arguments (if any). A function definition may be context-dependent for some arities in its arity range, and context-independent for others: for example fn:name#0 is context-dependent while fn:name#1 is context-independent.

A function definition that is not context-dependent is called context-independent.

The main categories of context-dependent functions are:

Functions that explicitly deliver the value of a component of the static or dynamic context, for example fn:static-base-uri, fn:default-collation, fn:position, or fn:last.

Functions with an optional parameter whose default value is taken from the static or dynamic context of the caller, usually either the context value (for example, fn:node-name) or the default collation (for example, fn:index-of).

Functions that use the static context of the caller to expand or disambiguate the values of supplied arguments: for example fn:doc expands its first argument using the static base URI of the caller, and xs:QName expands its first argument using the in-scope namespaces of the caller.

A function is focus-dependent if its result depends on the focus (that is, the context item, position, or size) of the caller.

A function that is not focus-dependent is called focus-independent.

Some functions depend on aspects of the dynamic context that remain invariant within an , such as the implicit timezone. Formally this is treated in the same way as any other context dependency, but internally, the implementation may be able to take advantage of the fact that the value is invariant.

User-defined functions in XQuery and XSLT may depend on the static context of the function definition (for example, the in-scope namespaces) and also in a limited way on the dynamic context (for example, the values of global variables). However, the only way they can depend on the static or dynamic context of the caller — which is what concerns us here — is by defining optional parameters whose default values are context-dependent.

Because the focus is a specific part of the dynamic context, all focus-dependent functions are also context-dependent. A context-dependent function, however, may be either focus-dependent or focus-independent.

A function definition that is context-dependent can be used as the target of a named function reference, can be partially applied, and can be found using fn:function-lookup. The principle in such cases is that the static context used for the function evaluation is taken from the static context of the named function reference, partial function application, or the call on fn:function-lookup; and the dynamic context for the function evaluation is taken from the dynamic context of the evaluation of the named function reference, partial function application, or the call of fn:function-lookup. These constructs all deliver a function item having a captured context based on the static and dynamic context of the construct that created the function item. This captured context forms part of the closure of the function item.

The result of a dynamic call to a function item never depends on the static or dynamic context of the dynamic function call, only (where relevant) on the captured context held within the function item itself.

The fn:function-lookup function is a special case because it is potentially dependent on everything in the static and dynamic context. This is because the static and dynamic context of the call to fn:function-lookup form the captured context of the function item that fn:function-lookup returns.

A function that is guaranteed to produce identical results from repeated calls within a single execution scope if the explicit and implicit arguments are identical is referred to as deterministic.

A function that is not deterministic is referred to as nondeterministic.

All functions defined in this specification are deterministic unless otherwise stated. Exceptions include the following:

Some functions (such as fn:in-scope-prefixes, fn:load-xquery-module, and fn:unordered) produce result sequences or result maps in an implementation-defined or implementation-dependent order. In such cases two calls with the same arguments are not guaranteed to produce the results in the same order. These functions are said to be nondeterministic with respect to ordering.

Some functions (such as fn:analyze-string, fn:parse-xml, fn:parse-xml-fragment, fn:parse-html, and fn:json-to-xml) construct a tree of nodes to represent their results. There is no guarantee that repeated calls with the same arguments will return the same identical node (in the sense of the is operator). However, if non-identical nodes are returned, their content will be the same in the sense of the fn:deep-equal function. Such a function is said to be nondeterministic with respect to node identity.

Some functions (such as fn:doc and fn:collection) create new nodes by reading external documents. Such functions are guaranteed to be deterministic by default (some such functions have an option "stable":false() that makes them nondeterministic as a user option, and implementations may also provide configuration options to change the default).

Where the results of a function are described as being (to a greater or lesser extent) implementation-defined or implementation-dependent, this does not by itself remove the requirement that the results should be deterministic: that is, that repeated calls with the same explicit and implicit arguments must return identical results.

The function fn:concat is defined to be variadic: it accepts any number of arguments. No other function has this property.

Processing sequences

A sequence is an ordered collection of zero or more items. An item is a node, an atomic item, or a function, such as a map or an array. The terms sequence and item are defined formally in and .

General functions on sequences

The following functions are defined on sequences. These functions work on any sequence, without performing any operations that are sensitive to the individual items in the sequence.

As in the previous section, for the illustrative examples below, assume an XQuery or transformation operating on a non-empty Purchase Order document containing a number of line-item elements. The variable $seq is bound to the sequence of line-item nodes in document order. The variables $item1, $item2, etc. are bound to separate, individual line-item nodes in the sequence.

Comparison functions

The functions in this section perform comparisons between the items in one or more sequences.

Many of these functions require atomic items to be compared for equality.

Two atomic items A and B are said to be contextually equal if the function call fn:compare(A, B) returns zero when evaluated with a specified or context-determined collation and implicit timezone. If two values are not contextually equal, they are considered to be contextually unequal, even in the case when comparing them using fn:compare raises an error.

Except where explicitly stated otherwise, an appeal to contextual equality implies that NaN is treated as equal to NaN.

Asserting cardinality

The following functions assert the cardinality of their sequence arguments.

The functions fn:zero-or-one, fn:one-or-more, and fn:exactly-one defined in this section, check that the cardinality of a sequence is in the expected range. These functions were originally defined for use with processors that enforced strict static typing. For example, the function call fn:remove($seq, fn:index-of($seq2, 'abc')) requires the result of the call on fn:index-of to be a singleton integer, but the static type system could not infer this; writing the expression as fn:remove($seq, fn:exactly-one(fn:index-of($seq2, 'abc'))) would provide a suitable static type at query analysis time, and ensure that the length of the sequence is correct with a dynamic check at query execution time.

The 4.0 specifications no longer define strict static typing as an option, so the utility of these functions has declined. They may still serve a purpose, however, as assertions signaling expected preconditions both to the processor and to anyone reading the code.

The type signatures for these functions deliberately declare the argument type as item()*, permitting a sequence of any length. A more restrictive signature would defeat the purpose of the function, which is to defer cardinality checking until query execution time.

Aggregate functions

Aggregate functions take a sequence as argument and return a single value computed from values in the sequence. Except for fn:count, the sequence must consist of values of a single type or one if its subtypes, or they must be numeric. xs:untypedAtomic values are permitted in the input sequence and handled by special conversion rules. The type of the items in the sequence must also support certain operations.

Basic higher-order functions

The following functions take function items as an argument.

With all these functions, if the caller-supplied function fails with a dynamic error, this error is propagated as an error from the higher-order function itself.

Processing booleans

This section defines functions and operators on the xs:boolean datatype.

Boolean constant functions

Since no literals are defined in XPath to reference the constant boolean values true and false, two functions are provided for the purpose.

Functions on Boolean values

The following functions are defined on boolean values:

Processing numerics

This section specifies arithmetic operators on the numeric datatypes defined in .

Numeric types

The operators described in this section are defined on the following atomic types.

&common-numeric-types.xml;

They also apply to types derived by restriction from the above types.

The type xs:numeric is defined as a union type whose member types are (in order) xs:double, xs:float, and xs:decimal. This type is implicitly imported into the static context, so it can also be used in defining the signature of user-written functions. Apart from the fact that it is implicitly imported, it behaves exactly like a user-defined type with the same definition. This means, for example:

If the expected type of a function parameter is given as xs:numeric, the actual value supplied can be an instance of any of these three types, or any type derived from these three by restriction (this includes the built-in type xs:integer, which is derived from xs:decimal).

If the expected type of a function parameter is given as xs:numeric, and the actual value supplied is xs:untypedAtomic (or a node whose atomized value is xs:untypedAtomic), then it will be cast to the union type xs:numeric using the rules in . Because the lexical space of xs:double subsumes the lexical space of the other member types, and xs:double is listed first, the effect is that if the untyped atomic item is in the lexical space of xs:double, it will be converted to an xs:double, and if not, a dynamic error occurs.

When the return type of a function is given as xs:numeric, the actual value returned will be an instance of one of the three member types (and perhaps also of types derived from these by restriction). The rules for the particular function will specify how the type of the result depends on the values supplied as arguments. In many cases, for the functions in this specification, the result is defined to be the same type as the first argument.

This specification uses arithmetic for xs:float and xs:double values. One consequence of this is that some operations result in the value NaN (not a number), which has the unusual property that it is not equal to itself. Another consequence is that some operations return the value negative zero. This differs from , which defines NaN as being equal to itself and defines only a single zero in the value space. The text accompanying several functions defines behavior for both positive and negative zero inputs and outputs in the interest of alignment with . A conformant implementation must respect these semantics. In consequence, the expression -0.0e0 (which is actually a unary minus operator applied to an xs:double value) will always return negative zero: see . As a concession to implementations that rely on implementations of XSD 1.0, however, when casting from string to double the lexical form -0 may be converted to positive zero, though negative zero is recommended.

XML Schema 1.1 introduces support for positive and negative zero as distinct values, and also uses the semantics for comparisons involving NaN.

Arithmetic operators on numeric values

The following functions define the semantics of arithmetic operators defined in and on these numeric types.

Operator Meaning
op:numeric-add Addition
op:numeric-subtract Subtraction
op:numeric-multiply Multiplication
op:numeric-divide Division
op:numeric-integer-divide Integer division
op:numeric-mod Modulus
op:numeric-unary-plus Unary plus
op:numeric-unary-minus Unary minus (negation)

The parameters and return types for the above operators are in most cases declared to be of type xs:numeric, which permits the basic numeric types: xs:integer, xs:decimal, xs:float and xs:double, and types derived from them. In general the two-argument functions require that both arguments are of the same primitive type, and they return a value of this same type. The exceptions are op:numeric-divide, which returns an xs:decimal if called with two xs:integer operands, and op:numeric-integer-divide which always returns an xs:integer.

If the two operands of an arithmetic expression are not of the same type, they may be converted to a common type as described in .

The result type of operations depends on their argument datatypes and is defined in the following table:

Operator Returns
op:operation(xs:integer, xs:integer) xs:integer (except for op:numeric-divide(integer, integer), which returns xs:decimal)
op:operation(xs:decimal, xs:decimal) xs:decimal
op:operation(xs:float, xs:float) xs:float
op:operation(xs:double, xs:double) xs:double
op:operation(xs:integer) xs:integer
op:operation(xs:decimal) xs:decimal
op:operation(xs:float) xs:float
op:operation(xs:double) xs:double

The basic rules for addition, subtraction, and multiplication of ordinary numbers are not set out in this specification; they are taken as given. In the case of xs:double and xs:float the rules are as defined in . The rules for handling division and modulus operations, as well as the rules for handling special values such as infinity and NaN, and exception conditions such as overflow and underflow, are described more explicitly since they are not necessarily obvious.

On overflow and underflow situations during arithmetic operations, conforming implementations must behave as follows:

For xs:float and xs:double operations, overflow behavior must be conformant with . This specification allows the following options:

Raising a dynamic error via an overflow trap.

Returning INF or -INF.

Returning the largest (positive or negative) non-infinite number.

For xs:float and xs:double operations, underflow behavior must be conformant with . This specification allows the following options:

Raising a dynamic error via an underflow trap.

Returning 0.0E0 or +/- 2**Emin or a denormalized value; where Emin is the smallest possible xs:float or xs:double exponent.

For xs:decimal operations, overflow behavior must raise a dynamic error . On underflow, 0.0 must be returned.

For xs:integer operations, implementations that support limited-precision integer operations must select from the following options:

They may choose to always raise a dynamic error .

They may provide an mechanism that allows users to choose between raising an error and returning a result that is modulo the largest representable integer value. See .

The functions op:numeric-add, op:numeric-subtract, op:numeric-multiply, op:numeric-divide, op:numeric-integer-divide and op:numeric-mod are each defined for pairs of numeric operands, each of which has the same type:xs:integer, xs:decimal, xs:float, or xs:double. The functions op:numeric-unary-plus and op:numeric-unary-minus are defined for a single operand whose type is one of those same numeric types.

For xs:float and xs:double arguments, if either argument is NaN, the result is NaN.

For xs:decimal values, let N be the number of digits of precision supported by the implementation, and let M (M <= N) be the minimum limit on the number of digits required for conformance (18 digits for XSD 1.0, 16 digits for XSD 1.1). Then for addition, subtraction, and multiplication operations, the returned result should be accurate to N digits of precision, and for division and modulus operations, the returned result should be accurate to at least M digits of precision. The actual precision is . If the number of digits in the mathematical result exceeds the number of digits that the implementation retains for that operation, the result is truncated or rounded in an manner.

This specification does not determine whether xs:decimal operations are fixed point or floating point. In an implementation using floating point it is possible for very simple operations to require more digits of precision than are available; for example, adding 1e100 to 1e-100 requires 200 digits of precision for an accurate representation of the result.

The specification also describes handling of two exception conditions called divideByZero and invalidOperation. The IEEE divideByZero exception is raised not only by a direct attempt to divide by zero, but also by operations such as log(0). The IEEE invalidOperation exception is raised by attempts to call a function with an argument that is outside the function’s domain (for example, sqrt(-1) or log(-1)). Although IEEE defines these as exceptions, it also defines “default non-stop exception handling” in which the operation returns a defined result, typically positive or negative infinity, or NaN. With this function library, these IEEE exceptions do not cause a dynamic error at the application level; rather they result in the relevant function or operator returning the defined non-error result. The underlying IEEE exception may be notified to the application or to the user by some implementation-defined warning condition, but the observable effect on an application using the functions and operators defined in this specification is simply to return the defined result (typically -INF, +INF, or NaN) with no error.

The specification distinguishes two NaN values: a quiet NaN and a signaling NaN. These two values are not distinguishable in the XDM model: the value spaces of xs:float and xs:double each include only a single NaN value. This does not prevent the implementation distinguishing them internally, and triggering different implementation-defined warning conditions, but such distinctions do not affect the observable behavior of an application using the functions and operators defined in this specification.

Although comparison of numeric values across heterogeneous types has changed to convert both values to xs:decimal, arithmetic operations continue to use xs:double as the common type.

Comparing numeric values Deleted an inaccurate statement concerning the behavior of NaN. Comparison of mixed numeric types (for example xs:double and xs:decimal) now generally converts both values to xs:decimal.

Numeric values can be compared using the function fn:compare.

This function underpins the six value comparison operators eq, ne, lt, le, gt, and ge and the six general comparison operators =, !=, <, <=, >, and >=, which are all defined in terms of the fn:compare function.

For a description of the different ways of comparing numeric values using the operators = and eq and functions such as fn:deep-equal, fn:compare, and fn:atomic-equal, see .

Functions on numeric values

The following functions are defined on numeric types. Each function returns a value of the same type as the type of its argument.

If the argument is the empty sequence, the empty sequence is returned.

For xs:float and xs:double arguments, if the argument is NaN, NaN is returned.

With the exception of fn:abs, functions with arguments of type xs:float and xs:double that are positive or negative infinity return positive or negative infinity.

The fn:round function has been extended with a third argument in version 4.0 of this specification; this means that the fn:ceiling, fn:floor, and fn:round-half-to-even functions are now technically redundant. They are retained, however, both for backwards compatibility and for convenience.

Parsing numbers

It is possible to convert strings to values of type xs:integer, xs:float, xs:decimal, or xs:double using the constructor functions described in or using cast expressions as described in .

In addition the fn:number function is available to convert strings to values of type xs:double. It differs from the xs:double constructor function in that any value outside the lexical space of the xs:double datatype is converted to the xs:double value NaN.

Formatting integers Formatting numbers

This section defines a function for formatting decimal and floating point numbers.

This function can be used to format any numeric quantity, including an integer. For integers, however, the fn:format-integer function offers additional possibilities. Note also that the picture strings used by the two functions are not 100% compatible, though they share some options in common.

Defining a decimal format

Decimal formats are defined in the static context, and the way they are defined is therefore outside the scope of this specification. XSLT and XQuery both provide custom syntax for creating a decimal format.

The static context provides a set of decimal formats. One of the decimal formats is unnamed, the others (if any) are identified by a QName. There is always an unnamed decimal format available, but its contents are .

Each decimal format provides a set of named properties.

A phrase such as "The minus-sign character" is to be read as “the character assigned to the minus-sign property in the relevant decimal format”.

The decimal digit family of a decimal format is the sequence of ten digits with consecutive Unicode codepoints starting with the character that is the value of the zero-digit property.

The optional digit character is the character that is the value of the digit property.

For any decimal format, the properties representing characters used in a picture string must have distinct values. These properties are decimal-separator , grouping-separator, exponent-separator, percent, per-mille, digit, and pattern-separator. Furthermore, none of these properties may be equal to any character in the decimal digit family.

Syntax of the picture string

This differs from the format-number function previously defined in XSLT 2.0 in that any digit can be used in the picture string to represent a mandatory digit: for example the picture strings "000", "001", and "999" are equivalent. The digits will all be from the same decimal digit family, specifically, the sequence of ten consecutive digits starting with the digit assigned to the zero-digit property. This change is to align format-number (which previously used "000") with format-dateTime (which used 001).

The formatting of a number is controlled by a picture string. The picture string is a sequence of characters, in which the characters assigned to the properties decimal-separator , exponent-separator, grouping-separator, digit, and pattern-separator and the members of the decimal digit family, are classified as active characters, and all other characters (including the values of the properties percent and per-mille) are classified as passive characters.

A dynamic error is raised if the picture string does not conform to the following rules. Note that in these rules the words "preceded" and "followed" refer to characters anywhere in the string; they are not to be read as "immediately preceded" and "immediately followed".

A picture-string consists either of a sub-picture, or of two sub-pictures separated by the pattern-separator character. A picture-string must not contain more than one instance of the pattern-separator character. If the picture-string contains two sub-pictures, the first is used for positive and unsigned zero values and the second for negative values.

A sub-picture must not contain more than one instance of the decimal-separator character.

A sub-picture must not contain more than one instance of the percent or per-mille characters, and it must not contain one of each.

The mantissa part of a sub-picture (defined below) must contain at least one character that is either an optional digit character or a member of the decimal digit family.

A sub-picture must not contain a passive character that is preceded by an active character and that is followed by another active character.

A sub-picture must not contain a grouping-separator character that appears adjacent to a decimal-separator character, or in the absence of a decimal-separator character, at the end of the integer part.

A sub-picture must not contain two adjacent instances of the grouping-separator character.

The integer part of a sub-picture (defined below) must not contain a member of the decimal digit family that is followed by an instance of the optional digit character. The fractional part of a sub-picture (defined below) must not contain an instance of the optional digit character that is followed by a member of the decimal digit family.

A character that matches the exponent-separator property is treated as an exponent-separator-sign if it is both preceded and followed within the sub-picture by an active character. Otherwise, it is treated as a passive character. A sub-picture must not contain more than one character that is treated as an exponent-separator-sign.

A sub-picture that contains a percent or per-mille character must not contain a character treated as an exponent-separator-sign.

If a sub-picture contains a character treated as an exponent-separator-sign then this must be followed by one or more characters that are members of the decimal digit family, and it must not be followed by any active character that is not a member of the decimal digit family.

The mantissa part of the sub-picture is defined as the part that appears to the left of the exponent-separator-sign if there is one, or the entire sub-picture otherwise. The exponent part of the subpicture is defined as the part that appears to the right of the exponent-separator-sign; if there is no exponent-separator-sign then the exponent part is absent.

The integer part of the sub-picture is defined as the part that appears to the left of the decimal-separator character if there is one, or the entire mantissa part otherwise.

The fractional part of the sub-picture is defined as that part of the mantissa part that appears to the right of the decimal-separator character if there is one, or the part that appears to the right of the rightmost active character otherwise. The fractional part may be zero-length.

Analyzing the picture string

This phase of the algorithm analyzes the picture string and the properties from the selected decimal format in the static context, and it has the effect of setting the values of various variables, which are used in the subsequent formatting phase. These variables are listed below. Each is shown with its initial setting and its datatype.

Several variables are associated with each sub-picture. If there are two sub-pictures, then these rules are applied to one sub-picture to obtain the values that apply to positive and unsigned zero numbers, and to the other to obtain the values that apply to negative numbers. If there is only one sub-picture, then the values for both cases are derived from this sub-picture.

The variables are as follows:

The integer-part-grouping-positions is a sequence of integers representing the positions of grouping separators within the integer part of the sub-picture. For each grouping-separator character that appears within the integer part of the sub-picture, this sequence contains an integer that is equal to the total number of optional digit character and decimal digit family characters that appear within the integer part of the sub-picture and to the right of the grouping-separator character.

The grouping is defined to be regular if the following conditions apply:

There is an least one grouping-separator in the integer part of the sub-picture.

There is a positive integer G (the grouping size) such that the position of every grouping-separator in the integer part of the sub-picture is a positive integer multiple of G.

Every position in the integer part of the sub-picture that is a positive integer multiple of G is occupied by a grouping-separator.

If the grouping is regular, then the integer-part-grouping-positions sequence contains all integer multiples of G as far as necessary to accommodate the largest possible number.

The minimum-integer-part-size is an integer indicating the minimum number of digits that will appear to the left of the decimal-separator character. It is initially set to the number of decimal digit family characters found in the integer part of the sub-picture, but may be adjusted as described below.

There is no maximum integer part size. All significant digits in the integer part of the number will be displayed, even if this exceeds the number of optional digit character and decimal digit family characters in the subpicture.

The scaling factor is a non-negative integer used to determine the scaling of the mantissa in exponential notation. It is set to the number of decimal digit family characters found in the integer part of the sub-picture.

The prefix is set to contain all passive characters in the sub-picture to the left of the leftmost active character. If the picture string contains only one sub-picture, the prefix for the negative sub-picture is set by concatenating the minus-sign character and the prefix for the positive sub-picture (if any), in that order.

The fractional-part-grouping-positions is a sequence of integers representing the positions of grouping separators within the fractional part of the sub-picture. For each grouping-separator character that appears within the fractional part of the sub-picture, this sequence contains an integer that is equal to the total number of optional digit character and decimal digit family characters that appear within the fractional part of the sub-picture and to the left of the grouping-separator character.

There is no need to extrapolate grouping positions on the fractional side, because the number of digits in the output will never exceed the number of optional digit character and decimal digit family characters in the fractional part of the sub-picture.

The minimum-fractional-part-size is set to the number of decimal digit family characters found in the fractional part of the sub-picture.

The maximum-fractional-part-size is set to the total number of optional digit character and decimal digit family characters found in the fractional part of the sub-picture.

If the effect of the above rules is that minimum-integer-part-size and maximum-fractional-part-size are both zero, then an adjustment is applied as follows:

If an exponent separator is present then:

minimum-fractional-part-size is changed to 1 (one).

maximum-fractional-part-size is changed to 1 (one).

This has the effect that with the picture #.e9, the value 0.123 is formatted as 0.1e0

Otherwise:

minimum-integer-part-size is changed to 1 (one).

This has the effect that with the picture #, the value 0.23 is formatted as 0

If all the following conditions are true:

An exponent separator is present

The minimum-integer-part-size is zero

There is at least one optional digit character in the integer part of the sub-picture

then the minimum-integer-part-size is changed to 1 (one).

This has the effect that with the picture .9e9, the value 0.1 is formatted as .1e0, while with the picture #.9e9, it is formatted as 0.1e0

If (after making the above adjustments) the minimum-integer-part-size and the minimum-fractional-part-size are both zero, then the minimum-fractional-part-size is set to 1 (one).

The minimum-exponent-size is set to the number of decimal digit family characters found in the exponent part of the sub-picture if present, or zero otherwise.

The rules for the syntax of the picture string ensure that if an exponent separator is present, then the minimum-exponent-size will always be greater than zero.

The suffix is set to contain all passive characters to the right of the rightmost active character in the sub-picture.

If there is only one sub-picture, then all variables for positive numbers and negative numbers will be the same, except for prefix: the prefix for negative numbers will be preceded by the minus-sign character.

Formatting the number

This section describes the second phase of processing of the fn:format-number function. This phase takes as input a number to be formatted (referred to as the input number), and the variables set up by analyzing the decimal format in the static context and the picture string, as described above. The result of this phase is a string, which forms the return value of the fn:format-number function.

The algorithm for this second stage of processing is as follows:

If the input number is NaN (not a number), the result is the value of the pattern separator property (with no prefix or suffix).

In the rules below, the positive sub-picture and its associated variables are used if the input number is positive, and the negative sub-picture and its associated variables are used if it is negative. For xs:double and xs:float, negative zero is taken as negative, positive zero as positive. For xs:decimal and xs:integer, the positive sub-picture is used for zero.

The adjusted number is determined as follows:

If the sub-picture contains a percent character, the adjusted number is the input number multiplied by 100.

If the sub-picture contains a per-mille character, the adjusted number is the input number multiplied by 1000.

Otherwise, the adjusted number is the input number.

If the multiplication causes numeric overflow, no error occurs, and the adjusted number is positive or negative infinity as appropriate.

If the adjusted number is positive or negative infinity, the result is the concatenation of the appropriate prefix, the value of the infinity property, and the appropriate suffix.

If the minimum exponent size is non-zero, and the adjusted number is non-zero, then the adjusted number is scaled to establish a mantissa and an integer exponent. The mantissa and exponent are chosen such that all the following conditions are true:

The primitive type of the mantissa is the same as the primitive type of the adjusted number (integer, decimal, float, or double).

The mantissa multiplied by ten to the power of the exponent is equal to the adjusted number.

The mantissa (unless it is zero) is less than 10N, and at least 10N-1, where N is the scaling factor.

If the minimum exponent size is zero, then the mantissa is the adjusted number and there is no exponent.

If the minimum exponent size is non-zero and the adjusted number is zero, then the mantissa is the adjusted number and the exponent is zero.

The mantissa is converted (if necessary) to an xs:decimal value, using an implementation of xs:decimal that imposes no limits on the totalDigits or fractionDigits facets. If there are several such values that are numerically equal to the mantissa (bearing in mind that if the mantissa is an xs:double or xs:float, the comparison will be done by converting the decimal value back to an xs:double or xs:float), the one that is chosen should be one with the smallest possible number of digits not counting leading or trailing zeroes (whether significant or insignificant). For example, 1.0 is preferred to 0.9999999999, and 100000000 is preferred to 100000001. This value is then rounded so that it uses no more than maximum-fractional-part-size digits in its fractional part. The rounded number is defined to be the result of converting the mantissa to an xs:decimal value, as described above, and then calling the function fn:round-half-to-even with this converted number as the first argument and the maximum-fractional-part-size as the second argument, again with no limits on the totalDigits or fractionDigits in the result.

The absolute value of the rounded number is converted to a string in decimal notation, using the digits in the decimal digit family to represent the ten decimal digits, and the decimal-separator character to separate the integer part and the fractional part. This string must always contain a decimal-separator, and it must contain no leading zeroes and no trailing zeroes. The value zero will at this stage be represented by a decimal-separator on its own.

If the number of digits to the left of the decimal-separator character is less than minimum-integer-part-size, leading zero digit characters are added to pad out to that size.

If the number of digits to the right of the decimal-separator character is less than minimum-fractional-part-size, trailing zero digit characters are added to pad out to that size.

For each integer N in the integer-part-grouping-positions list, a grouping-separator character is inserted into the string immediately after that digit that appears in the integer part of the number and has N digits between it and the decimal-separator character, if there is such a digit.

For each integer N in the fractional-part-grouping-positions list, a grouping-separator character is inserted into the string immediately before that digit that appears in the fractional part of the number and has N digits between it and the decimal-separator character, if there is such a digit.

If there is no decimal-separator character in the sub-picture, or if there are no digits to the right of the decimal-separator character in the string, then the decimal-separator character is removed from the string (it will be the rightmost character in the string).

If an exponent exists, then the string produced from the mantissa as described above is extended with the following, in order: (a) the exponent-separator character; (b) if the exponent is negative, the minus-sign character; (c) the value of the exponent represented as a decimal integer, extended if necessary with leading zeroes to make it up to the minimum exponent size, using digits taken from the decimal digit family.

The result of the function is the concatenation of the appropriate prefix, the string conversion of the number as obtained above, and the appropriate suffix.

Trigonometric and exponential functions

The functions in this section perform trigonometric and other mathematical calculations on xs:double values. They are provided primarily for use in applications performing geometrical computation, for example when generating SVG graphics.

Functions are provided to support the six most commonly used trigonometric calculations: sine, cosine, and tangent, and their inverses arc sine, arc cosine, and arc tangent. Other functions such as secant, cosecant, and cotangent are not provided because they are easily computed in terms of these six.

The functions in this section (with the exception of math:pi) are specified by reference to , where they appear as Recommended operations in section 9. IEEE defines these functions for a variety of floating point formats; this specification defines them only for xs:double values. The IEEE specification applies with the following caveats:

IEEE states that the preferred quantum is language-defined. In this specification, it is .

IEEE states that certain functions should raise the inexact exception if the result is inexact. In this specification, this exception if it occurs does not result in an error. Any diagnostic information is outside the scope of this specification.

IEEE defines various rounding algorithms for inexact results, and states that the choice of rounding direction, and the mechanisms for influencing this choice, are language-defined. In this specification, the rounding direction and any mechanisms for influencing it are .

Certain operations (such as taking the square root of a negative number) are defined in IEEE to signal the invalid operation exception and return a quiet NaN. In this specification, such operations return NaN and do not raise an error. The same policy applies to operations (such as taking the logarithm of zero) that raise a divide-by-zero exception. Any diagnostic information is outside the scope of this specification.

Operations whose mathematical result is greater than the largest finite xs:double value are defined in IEEE to signal the overflow exception; operations whose mathematical result is closer to zero than the smallest non-zero xs:double value are similarly defined in IEEE to signal the underflow exception. The treatment of these exceptions in this specification is defined in .

Random Numbers

The function makes use of the record structure defined in the next section.

Processing strings

This section specifies functions and operators on the xs:string datatype and the datatypes derived from it.

String types

The operators described in this section are defined on the following types.

&common-string-types.xml;

They also apply to user-defined types derived by restriction from the above types.

Functions to assemble and disassemble strings Comparison of strings Collations

A collation is an algorithm that determines, for any two given strings S1 and S2, whether S1 is less than, equal to, or greater than S2. In this specification, a collation is identified by an absolute URI.

The observes that different applications may require different comparison and ordering behaviors. Similarly, different users with different linguistic expectations may require different behaviors. Consequently, the collation must be taken into account when comparing strings.

Collations can indicate that two different codepoints are to be considered equal for comparison purposes (for example, “v” and “w” are considered equivalent in some Swedish collations). Strings can be compared codepoint-by-codepoint or in a linguistically appropriate manner.

Some sources, for example use the term collation to refer more generically to a set of sorting rules that can be further parameterized or “tailored”. In this specification the term is always used for a specific algorithm in which all such parameters have defined values.

This specification defines some collation URIs that provide interoperable sorting behavior across applications. Other collation URIs are defined only partially (leaving some aspects implementation-defined). Implementations may define further collation URIs, or may allow users or third parties to define them.

The Unicode codepoint collation is available in every implementation. This collation sorts based on codepoint values. For further details see .

Collations may or may not perform Unicode normalization on strings before comparing them.

This specification allows a collation name to be provided as an argument to many string functions. Although collations are defined to be URIs, they are supplied as instances of xs:string.

The XQuery/XPath static context supplies a default collation for use when the collation argument is not specified. (see ). If the default collation is not specified by the user or the system, the default collation is the Unicode codepoint collation.

If the collation is specified using a relative URI reference, it is resolved relative to an implementation-defined base URI.

Previous versions of this specification stated that it must be resolved against the , but this is not always operationally convenient. It is recommended that processors should provide a means of setting the base URI for resolving collation URIs independently of the , though for backwards compatibility, the Static Base URI or Executable Base URI should be used as a default.

This specification does not define whether or not the collation URI is dereferenced. The collation URI may be an abstract identifier, or it may refer to an actual resource describing the collation. If it refers to a resource, this specification does not define the nature of that resource. One possible candidate is that the resource is a locale description expressed using the Locale Data Markup Language: see .

The ability to access external resources depends on whether the calling code is .

XML allows elements to specify the xml:lang attribute to indicate the language associated with the content of such an element. This specification does not use xml:lang to identify the default collation because using xml:lang does not produce desired effects when the two strings to be compared have different xml:lang values or when a string is multilingual.

Collation Capabilities

All collations support the ability to compare two strings to decide whether they are equal, and if not, which one should sort first. This must always define a total ordering, which implies that the comparison is transitive.

A collation may (or may not) support the ability to derive a collation key for a given string. A collation key is a binary value obtained as a function of a string S and a collation C, such that the collation keys for two strings S1 and S2 have the same ordering relationship (less than, equal, or greater than) as the two strings themselves, when compared under the relevant collation. Collation keys are useful for operations such as indexing, because they can be used as keys in maps. They are available using the fn:collation-key function.

Furthermore, a collation may (or may not) support the ability to determine whether one string is a substring of another under that collation. The use of collations in substring matching is described in .

The capabilities of a collation may be determined using the fn:collation-available function.

The Unicode Codepoint Collation

The collation URI http://www.w3.org/2005/xpath-functions/collation/codepoint identifies a collation which must be recognized by every implementation: it is referred to as the Unicode codepoint collation (not to be confused with the Unicode collation algorithm).

The Unicode codepoint collation does not perform any normalization on the supplied strings.

The collation is defined as follows. Each of the two strings is converted to a sequence of integers using the fn:string-to-codepoints function. These two sequences $A and $B are then compared as follows:

If both sequences are empty, the strings are equal.

If one sequence is empty and the other is not, then the string corresponding to the empty sequence is less than the other string.

If the first integer in $A is less than the first integer in $B, then the string corresponding to $A is less than the string corresponding to $B.

If the first integer in $A is greater than the first integer in $B, then the string corresponding to $A is greater than the string corresponding to $B.

Otherwise (the first pair of integers are equal), the result is obtained by applying the same rules recursively to fn:tail($A) and fn:tail($B)

While the Unicode codepoint collation does not produce results suitable for quality publishing of printed indexes or directories, it is adequate for many purposes where a restricted alphabet is used, such as sorting of vehicle registrations.

The Unicode codepoint collation differs from the default sort order used in programming languages that sort strings based on UTF-16 code units, which may include surrogate pairs.

The Unicode Collation Algorithm

This specification defines a family of collation URIs representing tailorings of the Unicode Collation Algorithm (UCA) as defined in . The parameters used for tailoring the UCA are based on the parameters defined in the Locale Data Markup Language (LDML), defined in .

This family of URIs use the scheme and path http://www.w3.org/2013/collation/UCA followed by an optional query part. The query part, if present, consists of a question mark followed by a sequence of zero or more semicolon-separated parameters. Each parameter is a keyword-value pair, the keyword and value being separated by an equals sign.

All implementations must recognize URIs in this family in the collation argument of functions that take a collation argument.

If the fallback parameter is present with the value no, then the implementation must either use a collation that conforms with the rules in the Unicode specifications for the requested tailoring, or fail with a static or dynamic error indicating that it does not provide the collation (the error code should be the same as if the collation URI were not recognized). If the fallback parameter is omitted or takes the value yes, and if the collation URI is well-formed according to the rules in this section, then the implementation must accept the collation URI, and should use the available collation that most closely reflects the user’s intentions. For example, if the collation URI requested is http://www.w3.org/2013/collation/UCA?lang=se;fallback=yes and the implementation does not include a fully conformant version of the UCA tailored for Swedish, then it may choose to use a Swedish collation that is known to differ from the UCA definition, or one whose conformance has not been established. It might even, as a last resort, fall back to using codepoint collation.

If two query parameters use the same keyword then the last one wins. If a query parameter uses a keyword or value which is not defined in this specification then the meaning is . If the implementation recognizes the meaning of the keyword and value then it should interpret it accordingly; if it does not recognize the keyword or value then if the fallback parameter is present with the value no it should reject the collation as unsupported, otherwise it should ignore the unrecognized parameter.

The following query parameters are defined. If any parameter is absent, the default is except where otherwise stated. The meaning given for each parameter is non-normative; the normative specification is found in .

KeywordValuesMeaning
fallbackyes | no (default yes)Determines whether the processor uses a fallback collation if a conformant collation is not available.
langlanguage code: a string in the lexical space of xs:language.The language whose collation conventions are to be used.
versionstringThe version number of the UCA to be used.
strengthprimary | secondary | tertiary | quaternary | identical, or 1|2|3|4|5 as synonyms (default tertiary / 3)The collation strength as defined in UCA. Primary strength takes only the base form of the character into account (so A=a=Äaut;=äaut;); secondary strength ignores case but considers accents and diacritics as significant (so A=a and Äaut;=äaut; but äaut;≠a); tertiary considers case as significant (A≠a≠Äaut;≠äaut;); quaternary strength always considers as significant spaces and punctuation (data-base≠database; if maxVariable is punct or higher and alternate is not non-ignorable, lower strengths will treat data-base=database).
maxVariablespace | punct | symbol | currency (default punct) Given the sequence space, punct, symbol, currency, all characters in the specified group and earlier groups are treated as “noise” characters to be handled as defined by the alternate parameter. For example, maxVariable=punct indicates that characters classified as whitespace or punctuation get this treatment.
alternatenon-ignorable | shifted | blanked (default non-ignorable)Controls the handling of characters such as spaces and hyphens; specifically, the "noise" characters in the groups selected by the maxVariable parameter. The value non-ignorable indicates that such characters are treated as distinct at the primary level (so data base sorts before database); shifted indicates that they are used to differentiate two strings only at the quaternary level, and blanked indicates that they are taken into account only at the identical level.
backwardsyes | no (default no)The value backwards=yes indicates that the last accent in the string is the most significant.
normalizationyes | no (default no)Indicates whether strings are converted to normalization form D.
caseLevelyes | no (default no)When used with primary strength, setting caseLevel=yes has the effect of ignoring accents while taking account of case.
caseFirstupper | lower (default lower)Indicates whether upper-case precedes lower-case or vice versa.
numericyes | no (default no)When numeric=yes is specified, a sequence of consecutive digits is interpreted as a number, for example chap2 sorts before chap12.
reordera comma-separated sequence of reorder codes, where a reorder code is one of space, punct, symbol, currency, digit, or a four-letter script code defined in , the register of scripts maintained by the Unicode Consortium in its capacity as registration authority for . Determines the relative ordering of text in different scripts; for example the value digit,Grek,Latn indicates that digits precede Greek letters, which precede Latin letters.

This list excludes parameters that are inconvenient to express in a URI, or that are applicable only to substring matching.

UCA collation URIs can be conveniently generated using the fn:collation function.

The Unicode case-insensitive collation A new collation URI is defined for Unicode case-insensitive comparison and ordering.

The collation URI http://www.w3.org/2005/xpath-functions/collation/unicode-case-insensitive must be recognized by every implementation.

The collation is defined as follows:

Let $UCI be the collation URI "http://www.w3.org/2005/xpath-functions/collation/unicode-case-insensitive".

Let $UCC be the Unicode Codepoint Collation URI http://www.w3.org/2005/xpath-functions/collation/codepoint.

For any two strings $A and $B, the result of the comparison fn:compare($A, $B, $UCI) is defined to be the same as the result of fn:compare(lower-case($A), lower-case($B), $UCC).

The collation supports collation units and can therefore be used with functions such as fn:contains; each Unicode codepoint is a single collation unit. The collation also supports the ability to obtain a collation key using fn:collation-key.

The HTML ASCII Case-Insensitive Collation The case-insensitive collation is now defined normatively within this specification, rather than by reference to the HTML "living specification", which is subject to change. The collation can now be used for ordering comparisons as well as equality comparisons.

The collation URI http://www.w3.org/2005/xpath-functions/collation/html-ascii-case-insensitive must be recognized by every implementation. It is designed to be compatible with the HTML ASCII case-insensitive collation as defined in (section 4.6, Strings), which is used, for example, when matching HTML class attribute values.

The collation is defined as follows:

Let $HACI be the collation URI "http://www.w3.org/2005/xpath-functions/collation/html-ascii-case-insensitive".

Let $UCC be the Unicode Codepoint Collation URI http://www.w3.org/2005/xpath-functions/collation/codepoint.

Let $lc be the function fn:translate(?, "ABCDEFGHIJKLMNOPQRSTUVWXYZ", "abcdefghijklmnopqrstuvwxyz").

Then for any two strings $A and $B, the result of the comparison fn:compare($A, $B, $HACI) is defined to be the same as the result of fn:compare($lc($A), $lc($B), $UCC).

HTML5 defines the semantics of equality matching using this collation; this specification additionally defines ordering rules. The collation supports collation units and can therefore be used with functions such as fn:contains; each Unicode codepoint is a single collation unit.

The corresponding HTML5 definition is: A string A is an ASCII case-insensitive match for a string B, if the ASCII lowercase of A is the ASCII lowercase of B.

Choosing a collation

Many functions have a signature that includes a $collation argument, which is generally optional and takes default-collation() as its default value.

The collation to use for these functions is determined by the following rules:

If the function specifies an explicit collation, CollationA (e.g., if the optional collation argument is specified in a call of the fn:compare function), then:

If CollationA is supported by the implementation, then CollationA is used.

Otherwise, a dynamic error is raised .

If no collation is explicitly specified for the function (that is, if the $collation argument is omitted or is set to an empty sequence), and the default collation in the XQuery/XPath static context is CollationB, then:

If CollationB is supported by the implementation, then CollationB is used.

Otherwise, a dynamic error is raised .

Because the set of collations that are supported is implementation-defined, an implementation has the option to support all collation URIs, in which case it will never raise this error.

If the value of the collation argument is a relative URI reference, it is resolved against the base-URI from the static context. If it is a relative URI reference and cannot be resolved, perhaps because the base-URI property in the static context is absent, a dynamic error is raised .

There is no explicit requirement that the string used as a collation URI be a valid URI. Implementations will in many cases reject such strings on the grounds that do not identify a supported collation; they may also cause an error if they cannot be resolved against the relevant base URI.

Functions on string values

The following functions are defined on values of type xs:string and types derived from it.

When the above operators and functions are applied to datatypes derived from xs:string, they are guaranteed to return values that are instances of xs:string, but the value might or might not be an instance of the particular subtype of xs:string to which they were applied.

The strings returned by fn:concat and fn:string-join are not guaranteed to be normalized. But see note in fn:concat.

Functions based on substring matching

The functions described in this section examine a string $arg1 to see whether it contains another string $arg2 as a substring. The result depends on whether $arg2 is a substring of $arg1, and if so, on the range of characters in $arg1 which $arg2 matches.

When the Unicode codepoint collation is used, this simply involves determining whether $arg1 contains a contiguous sequence of characters whose codepoints are the same, one for one, with the codepoints of the characters in $arg2.

When a collation is specified, the rules are more complex.

All collations support the capability of deciding whether two strings are considered equal, and if not, which of the strings should be regarded as preceding the other. For functions such as fn:compare, this is all that is required. For other functions, such as fn:contains, the collation needs to support an additional property: it must be able to decompose the string into a sequence of collation units, each unit consisting of one or more characters, such that two strings can be compared by pairwise comparison of these units.

The term collation unit as used in this specification is equivalent to the term collation element used in .

The string Q is then considered to contain P as a substring if the sequence of collation units corresponding to P is a subsequence of the sequence of collation units corresponding to Q. The characters in P that match are the characters corresponding to these collation units.

This rule may occasionally lead to surprises. For example, consider a collation that treats "Jaeger" and "Jäaut;ger" as equal. It might do this by treating "äaut;" as representing two collation units, in which case the expression fn:contains("Jäaut;ger", "eg") will return true. Alternatively, a collation might treat "ae" as a single collation unit, in which case the expression fn:contains("Jaeger", "eg") will return false. The results of these functions thus depend strongly on the properties of the collation that is used.

In addition, collations may specify that some collation units should be ignored during matching. If hyphen is an ignored collation unit, then fn:contains("code-point", "codepoint") will be true, and fn:contains("codepoint", "-") will also be true.

In the rules for the functions defined in this section, we use the following terms taken from :

The term match is used in the sense of definition DS2 from .

The term minimal match is used in the sense of definition DS4 from .

In the definitions in , these rules involve a number of parameters. In the context of the functions defined in this section, these parameters are interpreted as follows:

C is the collation; that is, the value of the $collation argument if specified, otherwise the default collation.

P is the (candidate) substring, the value of the $substring argument to the function.

Q is the (candidate) containing string, the value of the $value argument to the function.

The boundary condition B is satisfied at the start and end of a string, and between any two characters that belong to different collation units (“collation elements” in the language of ). It is not satisfied between two characters that belong to the same collation unit.

It is possible to define collations that do not have the ability to decompose a string into units suitable for substring matching. An argument to a function defined in this section may be a URI that identifies a collation that is able to compare two strings, but that does not have the capability to split the string into collation units. Such a collation may cause the function to fail, or to give unexpected results, or it may be rejected as an unsuitable argument. The ability to decompose strings into collation units is an property of the collation. The fn:collation-available function can be used to ask whether a particular collation has this property.

Regular expressions

The functions described in this section make use of a regular expression syntax for pattern matching. The syntax and semantics of regular expressions are defined in this section.

Regular expression syntax Regular expressions can include comments (starting and ending with #) if the c flag is set. Word boundaries can be matched. Lookahead and lookbehind assertions are supported. Assertions (including ^ and $) can no longer be followed by a quantifier.

The regular expression syntax used by these functions is defined in terms of the regular expression syntax specified in XSD 1.1 (see ), which in turn is based on the established conventions of languages such as Perl. However, because XML Schema uses regular expressions only for validity checking, it omits some facilities that are widely used with other languages. XPath, therefore, extends the XML Schema regular expression syntax to reinstate some of these capabilities.

Implementers should consult for information on using regular expression processing on Unicode characters.

The regular expression syntax and semantics are identical to those defined in with the additions described in the following subsections.

In there are no substantive technical changes to the syntax or semantics of regular expressions relative to , but a number of errors and ambiguities have been resolved. For example, the rules for the interpretation of hyphens within square brackets in a regular expression have been clarified; and the semantics of regular expressions are no longer tied to a specific version of Unicode.

XSD 1.1 is therefore used as the specification baseline, even for processors that only support XSD 1.0.

Processing model for regular expressions

As well as extending the XSD 1.1 syntax for regular expressions, this specification also extends the processing model.

In XSD, a regular expression is defined to denote a set of strings, and the only functionality offered is to test whether a string matches a regular expression: that is, whether it is a member of the set of strings denoted by the regular expression.

In this specification, matching a string S against a regular expression delivers a more complex outcome.

First some terminology:

A string of length N has N+1 character positions: one immediately before each character in the string, and one after the last character. In interfaces where character positions are exposed, they are numbered from 1 to N+1.

A segment of a string S is a sequence of zero or more contiguous characters starting at a given within S. Segments of a string are uniquely identified by their start position and length. The sequence of characters making up a segment is referred to as the string value of the segment.

The end position of a segment is the start position of the segment plus its length.

The operation of matching a string S against a regular expression delivers:

A set of matching segments. The string S as a whole is said to match the regular expression if the set of matching segments is non-empty.

For each matching M, a collection of captured groups. This is a mapping from positive integers to segments. The integer is called the group number, and corresponds to the ordinal sequence of opening parentheses of capturing subexpressions within the regular expression, as explained below. The corresponding segment is always a segment of S, but in the case of capturing expressions within lookahead assertions, it is not necessarily a segment of M.

The semantics of particular constructs in a regular expression are affected by a set of flags. The available flags and their effect are defined in .

The different functions available, such as fn:replace and fn:tokenize, are defined in terms of this outcome. For example:

The function fn:matches returns true if the set of matching segments is non-empty.

The function fn:replace replaces matching segments of the input string with a replacement string.

The function fn:tokenize returns the segments of the input string that appear between the matching segments.

In principle the set of segments that match a regular expression can be determined by enumerating all the segments of the input string and examining each one independently to establish whether it matches. In practice, however:

If several matching segments have the same starting position, then only one of them is returned. This is chosen as follows:

In the case of a choice (operator "|") the first matching branch is chosen.

In the case of a repetition with a greedy quantifier (for example "+" or "*") the longest matching segment is chosen.

In the case of a repetition with a reluctant quantifier (for example "+?" or "*?") the shortest matching segment is chosen.

A matching segment is not included in the result if it overlaps an earlier matching segment: specifically, a segment with start position S1 is excluded if there is a segment that has start position S0 and length L0, where S0 < S1 < S0+L0.

Two segments can be adjacent: that is, the start position of one can be equal to the of the previous segment. This is true even when the second segment is zero-length (the two segments are not considered to be overlapping, even though they have the same ). This means, for example, that the regular expression a*(?=x) has two non-overlapping matches against the string aaax, one at position 1 and the other at position 4.

The disjoint matching segments obtained by applying a regular expression R to a string S in the presence of a set of flags F are the segments of S that match R (using flags F), after elimination of overlapping segments.

The semantics of a regular expression are thus defined by stating which segments of an input string it matches, and what the captured groups corresponding to this match are. This is defined recursively for each construct that may appear within a regular expression, in terms of the outcome of applying its subexpressions.

For constructs defined in XSD 1.1 (branch, piece, NormalChar, charClass), XSD defines a set of strings denoted by the construct. The corresponding semantics for this specification are that the segments matched by such a construct are the segments whose string value is contained in this set.

For constructs added to the XSD 1.1 baseline by this specification, the semantics are defined in the sections that follow.

Comments

Comments are enabled in regular expressions if the c flag is present.

A comment starts with a # character that is not escaped with an immediately preceding backslash, and that is not contained in a CharClassExpr (that is, in square brackets). It ends with the following # character, or with the end of the string containing the regular expression.

Whether or not the c flag is present, the production for SingleCharEsc allows the # character to be escaped.

Regular expression grammar

The grammar for regular expressions is summarized here. Rules that differ from their definition in XSD 1.1 are marked with the character § against their names.

In these rules the notation【abc】matches any of the characters 'a', 'b', or 'c', while 【0➜9】 matches any character whose Unicode codepoint is within a given range, and ¬【abc】 matches any character other than 'a', 'b', or 'c'. These symbols are used in place of the more conventional notation to allow special characters such as square brackets and hyphens to appear directly without escaping. Within the lenticular brackets, all characters other than (including hyphen and backslash) represent themselves.

This grammar applies to the regular expression after removal of whitespace and comments if enabled by the x and c flags respectively: see .

XSD 1.1 defines additional rules to disambiguate this grammar.

Reluctant quantifiers

Reluctant quantifiers are supported. They are indicated by a ? following a quantifier. Specifically:

X?? matches X, once or not at all

X*? matches X, zero or more times

X+? matches X, one or more times

X{n}? matches X, exactly n times

X{n,}? matches X, at least n times

X{n,m}? matches X, at least n times, but not more than m times

Quantifiers that are not reluctant are referred to as greedy.

When a quantifier appears at the outermost level of a regular expression, the distinction between greedy and reluctant quantifiers affects the set of matching segments delivered by the matching operation. With a greedy quantifier, the longest matching segment at a given start position is returned; with a reluctant quantifier, the shortest matching segment at a given start position is returned.

When a quantifier appears within a subexpression, the quantified subexpression matches the shortest possible substring consistent with the match as a whole succeeding if the quantifier is reluctant, or the longest possible substring consistent with the match as a whole succeeding if the quantifier is greedy.

Reluctant quantifiers have no effect on the results of the boolean fn:matches function, since this function is only interested in discovering whether a matching segment exists, regardless of its start position and length.

Captured groups

The regular expression syntax defined by allows a regular expression to contain parenthesized subexpressions, but attaches no special significance to them. Some operations associated with regular expressions (for example, back-references, and the fn:replace function) allow access to the parts of the input string that matched a parenthesized subexpression (called captured groups).

A left parenthesis is recognized as a capturing left parenthesis provided it is not immediately followed by ? or * (see below), is not within a character group (square brackets), and is not escaped with a backslash. The sub-expression enclosed by a capturing left parenthesis and its matching right parenthesis is referred to as a capturing subexpression.

More specifically, the enclosed by the Nth capturing left parenthesis within the regular expression (determined by its character position in left-to-right order, and counting from one) is referred to as the Nth capturing subexpression.

For example, in the regular expression A(BC(?:D(EF(GH[()])))), the subexpression BC(?:D(EF(GH[()]))) is capturing subexpression 1, the string subexpression EF(GH[()]) is capturing subexpression 2, and the subexpression GH[()] is capturing subexpression 3.

When, in the course of evaluating a regular expression, a particular of the input matches a capturing subexpression, that segment becomes available as a captured group. The segment matched by the Nth capturing subexpression is referred to as the Nth captured group. By convention, the segment captured by the entire regular expression is treated as captured group 0 (zero).

When a is matched more than once (because it is within a construct that allows repetition), then only the last substring that it matched will be captured. Note that this rule is not sufficient in all cases to ensure an unambiguous result, especially in cases where (a) the regular expression contains nested repeating constructs, and/or (b) the repeating construct matches a zero-length string. In such cases it is implementation-dependent which substring is captured. For example given the regular expression (a*)+ and the input string "aaaa", an implementation might legitimately capture either "aaaa" or a zero length string as the content of the captured subgroup.

Parentheses that are required to group terms within the regular expression, but which are not required for capturing of substrings, can be represented using the syntax (?:xxxx).

In the absence of back-references (see below), the presence of the optional ?: has no effect on the set of strings that match the regular expression, but causes the left parenthesis not to be counted by operations (such as fn:replace and back-references) that number the capturing sub-expressions within a regular expression.

Back-references

Back-references are allowed outside a character class expression. A back-reference is an additional kind of atom. The construct \N where N is a single digit is always recognized as a back-reference; if this is followed by further digits, these digits are taken to be part of the back-reference if and only if the resulting number NN is such that the back-reference is preceded by the opening parenthesis of the NNth capturing left parenthesis. The regular expression is invalid if a back-reference refers to a capturing sub-expression that does not exist or whose closing right parenthesis occurs after the back-reference.

A back-reference with number N matches a string that is the same as the value of the Nth captured substring.

For example, the regular expression ('|").*\1 matches a sequence of characters delimited either by an apostrophe at the start and end, or by a quotation mark at the start and end.

If no string has been matched by the Nth capturing sub-expression, the back-reference is interpreted as matching a zero-length string.

Within a character class expression, \ followed by a digit is invalid. Some other regular expression languages interpret this as an octal character reference.

Unicode block names

A regular expression that uses a Unicode block name that is not defined in the version(s) of Unicode supported by the processor (for example \p{IsBadBlockName}) is deemed to be invalid .

XSD 1.0 does not say how this situation should be handled; XSD 1.1 says that it should be handled by treating all characters as matching.

Assertions

Assertions (sometimes called zero-width assertions) test whether a particular condition applies at the current position in the input string (resulting in either a match or a no-match), but they do not cause any change to the current position.

Assertions fall into the following categories:

The startOfString assertion ^ tests whether the current position is at the start of the string.

The endOfString assertion $ tests whether the current position is at the end of the string.

The boundary assertions \b and \B test whether the current position is at the start or end of a word.

The positive and negative lookahead assertions test whether there is (or is not) a substring starting at the current position that matches a given regular expression.

The positive and negative lookbehind assertions test whether there is (or is not) a substring ending at the current position that matches a given regular expression.

An assertion must not be followed by a quantifier.

Previous versions of this specification allowed a quantifier to follow the startOfString and endOfString assertions, though this served no practical purpose. Processors may provide an option to allow quantifiers to be used in this situation in order to preserve backward compatibility.

Matching the Start and End of the String

Two meta-characters, ^ and $ are added. By default, the meta-character ^ matches if the current position is the start of the entire string, while $ matches if the current position is the end of the entire string. In multi-line mode, ^ matches the start of any line (that is, the start of the entire string, and the position immediately after a newline character), while $ matches the end of any line (that is, the end of the entire string, and the position immediately before a newline character). Newline here means the character U+000A only.

Single character escapes are extended to allow the $ character to be escaped.

Boundary assertions

The assertion \b matches at any position where one of the following conditions is true:

The current position is the start of the string, the string is not empty, and the first character in the string matches \w.

The current position is the end of the string, the string is not empty, and the last character in the string matches \w.

The character before the current position matches \w and the character after the current position matches \W.

The character before the current position matches \W and the character after the current position matches \w.

Informally, \b matches if the current position is the start or end of a word, where a word is defined as a sequence of consecutive characters other than codepoints in Unicode groups P (punctuation), Z (separator), or C (other).

The assertion \B matches at any position where \b does not match.

Positive Lookahead Assertions

There are two equivalent ways of writing a positive lookahead assertion:

(?=xyz)

(*positive_lookahead:xyz)

In both cases, the assertion matches at a particular position in the input string only if there is a substring starting at that position that matches the regular expression xyz.

As with all assertions, evaluation of the assertion does not cause the current position to advance.

For example, Chapter(?=\s+[1-9]) will match "Chapter" only if followed by a number, with intervening whitespace.

A parenthesized expression within a lookahead assertion can capture a substring in the normal way. There are some minor complications, however:

Substrings captured while evaluating a lookahead assertion are represented differently in the result of the fn:analyze-string function, because they can overlap other substrings in arbitrary ways.

If an assertion is satisfied, then any substrings that are captured are based on the first evaluation of the assertion that matches; alternative evaluations of the assertion that also match, but which capture different substrings, are not considered.

A positive lookahead assertion that matches a zero-length string is permitted but pointless, since it will always match, and thus cause the assertion to succeed.

Negative Lookahead Assertion

There are two equivalent ways of writing a negative lookahead assertion:

(?!xyz)

(*negative_lookahead:xyz)

In both cases, the assertion matches at a particular position in the input string only if there is no substring starting at that position that matches the regular expression xyz.

As with all assertions, evaluation of the assertion does not cause the current position to advance.

For example, Chapter(?!\s*[1-9]) will match "Chapter" only if it is not followed by a number, with optional intervening whitespace.

Any capturing parentheses within a negative lookahead assertion are counted for the purpose of numbering captured groups, but they cannot capture any result because the pattern in the assertion must fail to match.

A negative lookahead assertion that matches a zero-length string is permitted but pointless, since it will always match, and thus cause the assertion to fail.

Positive Lookbehind Assertions

There are two equivalent ways of writing a positive lookbehind assertion:

(?<=xyz)

(*positive_lookbehind:xyz)

The second form may be more convenient when the expression appears within an XML-based host language such as XSLT, where the angle bracket would need to be escaped.

In both cases, the assertion matches at a particular position in the input string only if there is a substring ending at that position that matches the regular expression xyz.

For efficiency and ease of implementation, the regular expression contained within a lookbehind assertion is constrained. It must consist of one or more alternatives separated by "|", and each alternative must be fixed-length, consisting only of the following constructs, each of which matches a single character:

NormalChar (for example "A", "3")

SingleCharEsc (for example "\(", "\[")

charClassEsc (for example "\s", "\p{Lu}")

charClassExpr (for example "[a-z]")

WildcardEsc (".")

As with all assertions, evaluation of the assertion does not cause the current position to advance.

Parenthesized expressions cannot appear within lookbehind assertions.

For example, (?<=\[)[0-9+](?=\]) matches a sequence of digits immediately preceded by an opening square bracket and followed by a closing square bracket, without matching the brackets.

Negative Lookbehind Assertion

There are two equivalent ways of writing a negative lookbehind assertion:

(?<!xyz)

(*negative_lookbehind:xyz)

The second form may be more convenient when the expression appears within an XML-based host language such as XSLT, where the angle bracket would need to be escaped.

In both cases, the assertion matches at a particular position in the input string only if there is no substring ending at that position that matches the regular expression xyz.

The regular expression within a negative lookbehind assertion is subject to the same restrictions as for a positive lookbehind assertion: see .

For example, (?<!\$)[0-9]+ matches any sequence of digits that is not immediately preceded by a dollar sign.

Flags Regular expressions can include comments (starting and ending with #) if the c flag is set.

All these functions provide an optional parameter, $flags, to set options for the interpretation of the regular expression. The parameter accepts a xs:string, in which individual letters are used to set options. The presence of a letter within the string indicates that the option is on; its absence indicates that the option is off. Letters may appear in any order and may be repeated. They are case-sensitive. If there are characters present that are not defined here as flags, then a dynamic error is raised .

The following options are defined:

s: If present, the match operates in “dot-all” mode. (Perl calls this the single-line mode.) If the s flag is not specified, the meta-character . matches any character except a newline (#x0A) or carriage return (#x0D) character. In dot-all mode, the meta-character . matches any character whatsoever. Suppose the input contains the strings "hello" and "world" on two lines. This will not be matched by the regular expression "hello.*world" unless dot-all mode is enabled.

m: If present, the match operates in multi-line mode. By default, the meta-character ^ matches the start of the entire string, while $ matches the end of the entire string. In multi-line mode, ^ matches the start of any line (that is, the start of the entire string, and the position immediately after a newline character other than a newline that appears as the last character in the string), while $ matches the end of any line (that is, the position immediately before a newline character, and the end of the entire string if there is no newline character at the end of the string). Newline here means the character #x0A only.

i: If present, the match operates in case-insensitive mode. The detailed rules are as follows. In these rules, a character C2 is considered to be a case-variant of another character C1 if the following XPath expression returns true when the two characters are considered as strings of length one, and the Unicode codepoint collation is used:

fn:lower-case(C1) eq fn:lower-case(C2) or fn:upper-case(C1) eq fn:upper-case(C2)

Note that the case-variants of a character under this definition are always single characters.

When a normal character (Char) is used as an atom, it represents the set containing that character and all its case-variants. For example, the regular expression "z" will match both "z" and "Z".

A character range (production charRange in the XSD 1.0 grammar, replaced by productions charRange and singleChar in XSD 1.1) represents the set containing all the characters that it would match in the absence of the i flag, together with their case-variants. For example, the regular expression "[A-Z]" will match all the letters A to Z and all the letters a to z. It will also match certain other characters such as #x212A (KELVIN SIGN), since fn:lower-case("#x212A") is k.

This rule applies also to a character range used in a character class subtraction (charClassSub): thus [A-Z-[IO]] will match characters such as A, B, a, and b, but will not match I, O, i, or o.

The rule also applies to a character range used as part of a negative character group: thus "[^Q]" will match every character except Q and q (these being the only case-variants of Q in Unicode).

A back-reference is compared using case-blind comparison: that is, each character must either be the same as the corresponding character of the previously matched string, or must be a case-variant of that character. For example, the strings "Mum", "mom", "Dad", and "DUD" all match the regular expression "([md])[aeiou]\1" when the i flag is used.

All other constructs are unaffected by the i flag. For example, "\p{Lu}" continues to match upper-case letters only.

x: If present, whitespace characters (that is, U+0009, U+000A, U+000D and U+000D) in the regular expression are removed prior to matching with one exception: whitespace characters within character class expressions (charClassExpr) are not removed. This flag can be used, for example, to break up long regular expressions into readable lines.

Examples:

fn:matches("helloworld", "hello world", "x") returns true()

fn:matches("helloworld", "hello[ ]world", "x") returns false()

fn:matches("hello world", "hello\ sworld", "x") returns true()

fn:matches("hello world", "hello world", "x") returns false()

Whitespace is treated as a lexical construct to be removed before the regular expression is parsed; it is therefore not explicit in the regular expression grammar.

q: if present, all characters in the regular expression are treated as representing themselves, not as metacharacters. In effect, every character that would normally have a special meaning in a regular expression is implicitly escaped by preceding it with a backslash.

Furthermore, when this flag is present, the characters $ and \ have no special significance when used in the replacement string supplied to the fn:replace function.

This flag can be used in conjunction with the i flag. If it is used together with the m, s, x, or c flag, that flag has no effect.

Examples:

tokenize("12.3.5.6", ".", "q") returns ("12", "3", "5", "6")

replace("a\b\c", "\", "\\", "q") returns "a\\b\\c"

replace("a/b/c", "/", "$", "q") returns "a$b$c"

matches("abcd", ".*", "q") returns false()

matches("Mr. B. Obama", "B. OBAMA", "iq") returns true()

c: if present, comments are enabled in the regular expression. This flag has no effect if the q flag is present. A comment is recognized by the presence of a # character that is not escaped by a backslash or contained in a character class expression (charClassExpr), and it is terminated by the following # character or by the end of the regular expression string.

For example:

replace("03/24/2025", "(..#month#)/(..#day#)/(....#year#)", "$3-$1-$2", "c")

Comments are treated as a lexical construct to be removed before the regular expression is parsed; they are therefore not explicit in the regular expression grammar.

Functions using regular expressions
Processing URIs

This section specifies functions that manipulate URI values, either as instances of xs:anyURI or as strings.

Parsing and building URIs

This section specifies functions that parse strings as URIs, to identify their structure, and construct URI strings from their structured representation.

Some URI schemes are hierarchical and some are non-hierarchical. Implementations must treat the following schemes as non-hierarchical: jar, mailto, news, tag, tel, and urn. Whether additional schemes are known to be non-hierarchical implementation-defined. If a scheme is not known to be non-hierarchical, it must be treated as hierarchical.

Both functions use a structured representation of a URI as defined in the next section.

The segmented forms of the path and query parameters provide convenient access to commonly used information.

The path, if there is one, is tokenized on “/” characters and each segment is unescaped (as per the fn:decode-from-uri function). Consider the URI http://example.com/path/to/a%2fb. The path portion has to be returned as /path/to/a%2fb because decoding the %2f would change the nature of the path. The unescaped form is easily accessible from path-segments:

("", "path", "to", "a/b")

Note that the presence or absence of a leading slash on the path will affect whether or not the sequence begins with a zero-length string.

The query parameters are decoded into a map. Consider the URI: http://example.com/path?a=1&b=2%264&a=3. The decoded form in the query-parameters is the following map:

{ "a": ("1", "3"), "b": "2&4" }

Note that both keys and values are unescaped. If a key is repeated in the query string, the map will contain a sequence of values for that key, as seen for a in this example.

Processing durations

Operators are defined on the following type:

xs:duration

and on the two defined subtypes (see ):

xs:yearMonthDuration

xs:dayTimeDuration

Arithmetic on durations is defined only on these subtypes: this is because the results of some operations (for example one month minus one day) have no representation in the value space.

Two xs:duration values may however be compared.

Duration data types

A value of type xs:duration is considered to comprise two parts:

The total number of months, represented as a signed integer.

The total number of seconds, represented as a signed decimal number.

If one of these values is negative (less than zero), the other must not be positive (greater than zero).

In effect this means that operations on durations (including equality comparison, casting to string, and extraction of components) all treat the duration as normalized. The duration PT1M30S (one minute and thirty seconds), for example, is precisely equivalent to the duration PT90S (ninety seconds); these are different representations of the same value, and the result of any operation will be the same regardless which representation is used. For example, the function fn:seconds-from-duration returns 30 in both cases.

The information content of an xs:duration value can be reduced to an xs:integer number of months, and an xs:decimal number of seconds. For the two defined subtypes this is further simplified so that one of these two components is fixed at zero. Operations such as comparison of durations and arithmetic on durations can be expressed in terms of numeric operations applied to these two components.

Subtypes of duration

Two subtypes of xs:duration, namely xs:yearMonthDuration and xs:dayTimeDuration, are defined in . These types must be available in the data model whether or not the implementation supports other aspects of XSD 1.1.

The significance of these subtypes is that arithmetic and ordering become well defined; this is not the case for xs:duration values in general, because of the variable number of days in a month. For this reason, many of the functions and operators on durations require the arguments/operands to belong to these two subtypes.

In an xs:yearMonthDuration, the seconds component is always zero. In an xs:dayTimeDuration, the months component is always zero.

Limits and precision The specification now prescribes a minimum precision and range for durations.

All conforming processors must support duration values in which:

The total number of months can be represented as a signed xs:int value;

The total number of seconds can be represented as a signed xs:decimal value with facets totalDigits=18 and fractionalDigits=3. That is, durations must be supported to millisecond precision.

Processors may support a greater range and/or precision. The limits are .

A processor that limits the range or precision of duration values may encounter overflow and underflow conditions when it tries to evaluate operations on durations. In these situations, the processor must return a zero-length duration in case of duration underflow, and must raise a dynamic error in case of overflow.

Similarly, a processor may be unable accurately to represent the result of dividing a duration by 2, or multiplying a duration by 0.5. A processor that limits the precision of the seconds component of duration values must deliver a result that is as close as possible to the mathematically precise result, given these limits; if two values are equally close, the one that is chosen is .

Comparing durations

Duration values may be compared using the fn:compare function.

In this version of the specification, all xs:duration values are mutually comparable: comparison operations are no longer restricted to the two subtypes xs:yearMonthDuration and xs:dayTimeDuration. However, although a total ordering is defined over all durations, the result is not always meaningful: while it makes sense that P1Y1D (one year and a day) is greater than P1Y (one year), it makes little sense that P32D (thirty-two days) is less than P1M (one month).

Durations are treated as tuples with two components, the months and seconds components, and they are compared by treating the months as the primary key and the seconds as the primary key.

Extracting components of durations

The duration datatype may be considered to be a composite datatype in that it contains distinct properties or components. The extraction functions specified below extract a single component from a duration value. For xs:duration and its subtypes, including the two subtypes xs:yearMonthDuration and xs:dayTimeDuration, the components are normalized: this means that the seconds and minutes components will always be less than 60, the hours component less than 24, and the months component less than 12.

Constructing durations

This section decribes the fn:seconds function, which constructs an xs:dayTimeDuration value representing a decimal number of seconds.

Arithmetic operators on durations

For operators that combine a duration and a date/time value, see .

Processing dates and times

This section defines operations on the date and time types.

See for a disquisition on working with date and time values with and without timezones.

Date and time types

The eight primitive types xs:dateTime, xs:date, xs:time, xs:gYearMonth, xs:gYear, xs:gMonthDay, xs:gMonth, xs:gDay are referred to collectively as the Gregorian types.

This section describes operations on atomic items of these types.

Values of these types are modeled as comprising one or more of the seven components year, month, day, hour, minute, second, and timezone.

The only operations defined on xs:gYearMonth, xs:gYear, xs:gMonthDay, xs:gMonth, and xs:gDay values are equality comparison and component extraction. For other types, further operations are provided, including order comparisons, arithmetic, formatted display, and timezone adjustment.

Limits and precision

All conforming processors must support year values in the range 1 to 9999, and a minimum fractional second precision of 1 millisecond or three digits (that is, s.sss). However, processors may set larger limits on the maximum number of digits they support in these two situations. Processors may also choose to support the year 0 and years with negative values. The results of operations on dates that cross the year 0 are .

A processor that limits the number of digits in date and time datatype representations may encounter overflow and underflow conditions when it tries to execute the functions in . In these situations, the processor must return 00:00:00 in case of time underflow. It must raise a dynamic error in case of overflow.

Similarly, a processor that limits the precision of the seconds component of date and time or duration values may need to deliver a rounded result for arithmetic operations. Such a processor must deliver a result that is as close as possible to the mathematically precise result, given these limits: if two values are equally close, the one that is chosen is .

Date/time datatype values

As defined in , xs:dateTime, xs:date, xs:time, xs:gYearMonth, xs:gYear, xs:gMonthDay, xs:gMonth, xs:gDay values, referred to collectively as date/time values, are represented as seven components or properties: year, month, day, hour, minute, second and timezone. The first five components are xs:integer values. The value of the second component is an xs:decimal and the value of the timezone component is an xs:dayTimeDuration. For all the primitive date/time datatypes, the timezone property is optional and may or may not be present. Depending on the datatype, some of the remaining six properties must be present and some must be absent. Absent, or missing, properties are represented by the empty sequence. This value is referred to as the local value in that the value retains its original timezone. Before comparing or subtracting xs:dateTime values, this local value must be translated or normalized to UTC.

For xs:time, 00:00:00 and 24:00:00 are alternate lexical forms for the same value, whose canonical representation is 00:00:00. For xs:dateTime, a time component 24:00:00 translates to 00:00:00 of the following day.

Examples

An xs:dateTime with lexical representation 1999-05-31T05:00:00 is represented in the datamodel by { 1999, 5, 31, 5, 0, 0.0, () }.

An xs:dateTime with lexical representation 1999-05-31T13:20:00-05:00 is represented by { 1999, 5, 31, 13, 20, 0.0, xs:dayTimeDuration("-PT5H") }.

An xs:dateTime with lexical representation 1999-12-31T24:00:00 is represented by { 2000, 1, 1, 0, 0, 0.0, () }.

An xs:date with lexical representation 2005-02-28+8:00 is represented by { 2005, 2, 28, (), (), (), xs:dayTimeDuration("PT8H") }.

An xs:time with lexical representation 24:00:00 is represented by { (), (), (), 0, 0, 0, () }.

Constructing a dateTime Comparing date and time values

Date and time values can be compared using the function fn:compare.

states that the order relation on date and time datatypes is not a total order but a partial order because these datatypes may or may not have a timezone. This is handled as follows. If either operand to a comparison function on date or time values does not have an (explicit) timezone then, for the purpose of the operation, an implicit timezone, provided by the dynamic context , is assumed to be present as part of the value. This creates a total order for all date and time values.

An xs:dateTime can be considered to consist of seven components: year, month, day, hour, minute, second and timezone. For xs:dateTime six components (year, month, day, hour, minute and second) are required and timezone is optional. For other date/time values, of the first six components, some are required and others must be absent. Timezone is always optional. For example, for xs:date, the year, month and day components are required and hour, minute and second components must be absent; for xs:time the hour, minute and second components are required and year, month and day are missing; for xs:gDay, day is required and year, month, hour, minute and second are missing.

In , a new explicitTimezone facet is available with values optional, required, or prohibited to enable the timezone to be defined as mandatory or disallowed.

Values of the date/time datatypes xs:time, xs:gMonthDay, xs:gMonth, and xs:gDay, can be considered to represent a sequence of recurring time instants or time periods. An xs:time occurs every day. An xs:gMonth occurs every year. Comparison operators on these datatypes compare the starting instants of equivalent occurrences in the recurring series. These xs:dateTime values are calculated as described below.

Comparison operators on xs:date, xs:gYearMonth and xs:gYear compare their starting instants. These xs:dateTime values are calculated as described below.

The starting instant of an occurrence of a date/time value is an xs:dateTime calculated by filling in the missing components of the local value from a reference xs:dateTime. An example of a suitable reference xs:dateTime is 1972-01-01T00:00:00. Then, for example, the starting instant corresponding to the xs:date value 2009-03-12 is 2009-03-12T00:00:00; the starting instant corresponding to the xs:time value 13:30:02 is 1972-01-01T13:30:02; and the starting instant corresponding to the gMonthDay value --02-29 is 1972-02-29T00:00:00 (which explains why a leap year was chosen for the reference).

In the previous version of this specification, the reference date/time chosen was 1972-12-31T00:00:00. While this gives the same results, it produces a "starting instant" for a gMonth or gMonthDay that bears no relation to the ordinary meaning of the term, and it also required special handling of short months. The original choice was made to allow for leap seconds; but since leap seconds are not recognized in date/time arithmetic, this is not actually necessary.

If the xs:time value written as 24:00:00 is to be compared, filling in the missing components gives 1972-01-01T00:00:00, because 24:00:00 is an alternative representation of 00:00:00 (the lexical value "24:00:00" is converted to the time components { 0, 0, 0 } before the missing components are filled in). This has the consequence that when ordering xs:time values, 24:00:00 is considered to be earlier than 23:59:59. However, when ordering xs:dateTime values, a time component of 24:00:00 is considered equivalent to 00:00:00 on the following day.

Note that the reference xs:dateTime does not have a timezone. The timezone component is never filled in from the reference xs:dateTime. In some cases, if the date/time value does not have a timezone, the implicit timezone from the dynamic context is used as the timezone.

This specification uses the reference xs:dateTime 1972-01-01T00:00:00 in the description of the comparison operators. Implementations may use other reference xs:dateTime values as long as they yield the same results. The reference xs:dateTime used must meet the following constraints: when it is used to supply components into xs:gMonthDay values, the year must allow for February 29 and so must be a leap year; when it is used to supply missing components into xs:gDay values, the month must allow for 31 days. Different reference xs:dateTime values may be used for different operators.

Extracting components of dates and times

The date and time datatypes may be considered to be composite datatypes in that they contain distinct properties or components. The extraction functions specified below extract a single component from a date or time value. In all cases the local value (that is, the original value as written, without any timezone adjustment) is used.

A time written as 24:00:00 is treated as 00:00:00 on the following day.

Adjusting timezones on dates and times

These functions adjust the timezone component of an xs:dateTime, xs:date or xs:time value. The $timezone argument to these functions is defined as an xs:dayTimeDuration but must be a valid timezone value.

Arithmetic operators on durations, dates, and times

These functions support adding or subtracting a duration value to or from an xs:dateTime, an xs:date or an xs:time value. Appendix E of describes an algorithm for performing such operations.

Limits and precision

A processor that limits the number of digits in date and time datatype representations may encounter overflow and underflow conditions when it tries to execute the functions in this section. In these situations, the processor must return P0M or PT0S in case of duration underflow and 00:00:00 in case of time underflow. It must raise a dynamic error in case of overflow.

The value spaces of the two totally ordered subtypes of xs:duration described in are xs:integer months for xs:yearMonthDuration and xs:decimal seconds for xs:dayTimeDuration. If a processor limits the number of digits allowed in the representation of xs:integer and xs:decimal then overflow and underflow situations can arise when it tries to execute the functions in . In these situations the processor must return zero in case of numeric underflow and P0M or PT0S in case of duration underflow. It must raise a dynamic error in case of overflow.

Formatting dates and times

Three functions are provided to represent dates and times as a string, using the conventions of a selected calendar, language, and country. The functions are presented in their customary fashion, except for the rules and examples, which are described en bloc at and .

The date/time formatting functions

The fn:format-dateTime, fn:format-date, and fn:format-time functions format $value as a string using the picture string specified by the $picture argument, the calendar specified by the $calendar argument, the language specified by the $language argument, and the country or other place name specified by the $place argument. The result of the function is the formatted string representation of the supplied xs:dateTime, xs:date, or xs:time value.

The three functions fn:format-dateTime, fn:format-date, and fn:format-time are referred to collectively as the date formatting functions.

If $value is the empty sequence, the function returns the empty sequence.

Calling the two-argument form of each of the three functions is equivalent to calling the five-argument form with each of the last three arguments set to the empty sequence.

For details of the $language, $calendar, and $place arguments, see .

In general, the use of an invalid $picture, $language, $calendar, or $place argument results in a dynamic error . By contrast, use of an option in any of these arguments that is valid but not supported by the implementation is not an error, and in these cases the implementation is required to output the value in a fallback representation. More detailed rules are given below.

The picture string

The picture consists of a sequence of variable markers and literal substrings. A substring enclosed in square brackets is interpreted as a variable marker; substrings not enclosed in square brackets are taken as literal substrings. The literal substrings are optional and if present are rendered unchanged, including any whitespace. If an opening or closing square bracket is required within a literal substring, it must be doubled. The variable markers are replaced in the result by strings representing aspects of the date and/or time to be formatted. These are described in detail below.

A variable marker consists of a component specifier followed optionally by one or two presentation modifiers and/or optionally by a width modifier. Whitespace within a variable marker is ignored.

The variable marker may be separated into its components by applying the following rules:

The component specifier is always present and is always a single letter.

The width modifier may be recognized by the presence of a comma.

The substring between the component specifier and the comma (if present) or the end of the string (if there is no comma) contains the first and second presentation modifiers, both of which are optional. If this substring contains a single character, this is interpreted as the first presentation modifier. If it contains more than one character, the last character is examined: if it is valid as a second presentation modifier then it is treated as such, and the preceding part of the substring constitutes the first presentation modifier. Otherwise, the second presentation modifier is presumed absent and the whole substring is interpreted as the first presentation modifier.

The component specifier indicates the component of the date or time that is required, and takes the following values:

Specifier Meaning Default Presentation Modifier
Y year (absolute value) 1
M month in year 1
D day in month 1
d day in year 1
F day of week n
W week in year 1
w week in month 1
H hour in day (24 hours) 1
h hour in half-day (12 hours) 1
P am/pm marker n
m minute in hour 01
s second in minute 01
f fractional seconds 1
Z timezone 01:01
z timezone (Same as Z, but modified where appropriate to include a prefix as a time offset using GMT, for example GMT+1 or GMT-05:00. For this component there is a fixed prefix of GMT, or a localized variation thereof for the chosen language, and the remainder of the value is formatted as for specifier Z.) 01:01
C calendar: the name or abbreviation of a calendar name n
E era: the name of a baseline for the numbering of years, for example the reign of a monarch n

A dynamic error is reported if the syntax of the picture is incorrect.

A dynamic error is reported if a component specifier within the picture refers to components that are not available in the given type of $value, for example if the picture supplied to the fn:format-time refers to the year, month, or day component.

It is not an error to include a timezone component when the supplied value has no timezone. In these circumstances the timezone component will be ignored.

The first presentation modifier indicates the style in which the value of a component is to be represented. Its value may be either:

any format token permitted as a primary format token in the second argument of the fn:format-integer function, indicating that the value of the component is to be output numerically using the specified number format (for example, 1, 01, i, I, w, W, or Ww) or

the format token n, N, or Nn, indicating that the value of the component is to be output by name, in lower-case, upper-case, or title-case respectively. Components that can be output by name include (but are not limited to) months, days of the week, timezones, and eras. If the processor cannot output these components by name for the chosen calendar and language then it must use an fallback representation.

If a comma is to be used as a grouping separator within the format token, then there must be a width specifier. More specifically: if a variable marker contains one or more commas, then the last comma is treated as introducing the width modifier, and all others are treated as grouping separators. So [Y9,999,*] will output the year as 2,008.

It is not possible to use a closing square bracket as a grouping separator within the format token.

If the implementation does not support the use of the requested format token, it must use the default presentation modifier for that component.

If the first presentation modifier is present, then it may optionally be followed by a second presentation modifier as follows:

Modifier Meaning
either a or t indicates alphabetic or traditional numbering respectively, the default being . This has the same meaning as in the second argument of fn:format-integer.
either c or o indicates cardinal or ordinal numbering respectively, for example 7 or seven for a cardinal number, or 7th, seventh, or for an ordinal number. This has the same meaning as in the second argument of fn:format-integer. The actual representation of the ordinal form of a number may depend not only on the language, but also on the grammatical context (for example, in some languages it must agree in gender).

Although the formatting rules are expressed in terms of the rules for format tokens in fn:format-integer, the formats actually used may be specialized to the numbering of date components where appropriate. For example, in Italian, it is conventional to use an ordinal number (primo) for the first day of the month, and cardinal numbers (due, tre, quattro ...) for the remaining days. A processor may therefore use this convention to number days of the month, ignoring the presence or absence of the ordinal presentation modifier.

The Width Modifier

Whether or not a presentation modifier is included, a width modifier may be supplied. This indicates the number of characters to be included in the representation of the value.

The width modifier, if present, is introduced by a comma. It takes the form:

   ","  min-width ("-" max-width)?

where min-width is either an unsigned integer indicating the minimum number of characters to be output, or * indicating that there is no explicit minimum, and max-width is either an unsigned integer indicating the maximum number of characters to be output, or * indicating that there is no explicit maximum; if max-width is omitted then * is assumed.

A dynamic error () is raised if min-width is present and less than one, or if max-width is present and less than one or less than min-width.

A format token containing more than one digit, such as 001 or 9999, sets the minimum and maximum width to the number of digits appearing in the format token; if a width modifier is also present, then the width modifier takes precedence.

Formatting Integer-Valued Date/Time Components

The rules in this section apply to the majority of integer-valued components: specifically M D d F W w H h m s.

In the rules below, the term decimal digit pattern has the meaning given in .

If the first presentation modifier takes the form of a decimal digit pattern:

If there is no width modifier, then the value is formatted according to the rules of the format-integer function.

If there is a width modifier, then the first presentation modifier is adjusted as follows:

If the decimal digit pattern includes a grouping separator, the output is implementation-defined (but this is not an error).

Use of a width modifier together with grouping separators is inadvisable for this reason. It is never necessary to use a width modifier with a decimal digit pattern, since the same effect can be achieved by use of optional digit signs.

Otherwise, the number of mandatory-digit-sign characters in the presentation modifier is increased if necessary. This is done first by replacing optional-digit-signs with mandatory-digit-signs, starting from the right, and then prepending mandatory-digit-signs to the presentation modifier, until the number of mandatory-digit-signs is equal to the minimum width. Any mandatory-digit-signs that are added by this process must use the same decimal digit family as existing mandatory-digit-signs in the presentation modifier if there are any, or ASCII digits otherwise.

The maximum width, if specified, is ignored.

The output is then as defined using the format-integer function with this adjusted decimal digit pattern.

If the first presentation modifiers is one of N, n, or Nn:

Let FN be the full name of the component, that is, the form of the name that would be used in the absence of any width modifier.

If FN is shorter than the minimum width, then it is padded by appending spaces to the end of the name.

If FN is longer than the maximum width, then it is abbreviated, either by choosing a conventional abbreviation that fits within the maximum width (for example, “Wednesday” might be abbreviated to “Weds”), or by removing characters from the end of FN until it fits within the maximum width.

For other presentation modifiers:

Any adjustment of the value to fit within the requested width range is implementation-defined.

The value should not be truncated if this results in output that will not be meaningful to users (for example, there is no sensible way to truncate Roman numerals).

If shorter than the minimum width, the value should be padded to the minimum width, either by appending spaces, or in some other way appropriate to the numbering scheme.

Formatting the Year Component

The rules for the year component (Y) are the same as those in , except that the value of the year as output is the value of the year component of the supplied value modulo ten to the power N where N is determined as follows:

If the width modifier is present and defines a finite maximum width, then that maximum width.

Otherwise, if the first presentation modifier takes the form of a decimal-digit-pattern, then:

Let W be the number of optional-digit-signs and mandatory-digit-signs in that decimal-digit-pattern.

If W is 2 or more, then W.

Otherwise, N is infinity (that is, the year is output in full).

Formatting Fractional Seconds

The output for the fractional seconds component (f) is equivalent to the result of the following algorithm:

If the first presentation modifier contains no Unicode digit, then the output is implementation-defined.

Otherwise, the value of the fractional seconds is output as follows:

If there is no width modifier and the first presentation modifier comprises in its entirety a single mandatory-digit-sign (for example the default 1), then the presentation modifier is extended on the right with as many optional-digit-signs as are needed to accommodate the actual fractional seconds precision encountered in the value to be formatted.

If there is a width modifier, then the first presentation modifier is adjusted as follows:

If a minimum width is specified, and if this exceeds the number of mandatory-digit-sign characters in the first presentation modifier, then the first presentation modifier is adjusted. This is done first by replacing optional-digit-signs with mandatory-digit-signs, starting from the left, and then appending mandatory-digit-signs to the presentation modifier, until the number of mandatory-digit-signs is equal to the minimum width. Any mandatory-digit-signs that are added by this process must use the same decimal digit family as existing mandatory-digit-signs in the presentation modifier.

If a maximum width is specified, the first presentation modifier is extended on the right with as many optional-digit-signs as are needed to ensure that the number of mandatory-digit-signs and optional-digit-signs is at least equal to the maximum width.

The sequence of characters in the (adjusted) first presentation modifier is reversed (for example, 999'### becomes ###'999). If the result is not a valid decimal digit pattern, then the output is implementation-defined.

The sequence of digits in the conventional decimal representation of the fractional seconds component is reversed, with insignificant zeroes removed, and the result is treated as an integer. For example, if the seconds value is 25.8235, the reversed fractional seconds value is 5328.

The reversed fractional seconds value is formatted using the reversed decimal digit pattern according to the rules of the fn:format-integer function. Given the examples above, the result is 5'328

The resulting string is reversed. In our example, the result is 823'5.

If the result contains more digits than the number of mandatory-digit-signs and optional-digit-signs in the decimal digit pattern, then excess digits are removed from the right hand end (that is, the value is truncated towards zero rather than being rounded). Any grouping separator that immediately precedes a removed digit is also removed.

The reason for presenting the algorithm in this way is that it enables maximum reuse of the rules defined for fn:format-integer. Since the fractional seconds value is not properly an integer, the rules do not work if used directly: for example, the positions of grouping separators need to be counted from the left rather than from the right. Implementations, as always, are free to use a different algorithm that yields the same result.

A format token consisting of a single digit, such as 1, does not constrain the number of digits in the output. In the case of fractional seconds in particular, [f001] requests three decimal digits, [f01] requests two digits, but [f1] will retain all digits in the supplied date/time value (the maximum number of digits is implementation-defined). If exactly one digit is required, this can be achieved using the component specifier [f1,1-1].

Formatting timezones

Special rules apply to the formatting of timezones. When the component specifiers Z or z are used, the rules in this section override any rules given elsewhere in the case of discrepancies.

If the date/time value to be formatted does not include a timezone offset, then the timezone component specifier is generally ignored (results in no output). The exception is where military timezones are used (format ZZ) in which case the string "J" is output, indicating local time.

When the component specifier is z, the output is the same as for component specifier Z, except that it is prefixed by the characters GMT or some localized equivalent. The prefix is omitted, however, in cases where the timezone is identified by name rather than by a numeric offset from UTC.

If the first presentation modifier is numeric and comprises one or two digits with no grouping-separator (for example 1 or 01), then the timezone is formatted as a displacement from UTC in hours, preceded by a plus or minus sign: for example -5 or +03. If the actual timezone offset is not an integral number of hours, then the minutes part of the offset is appended, separated by a colon: for example +10:30 or -1:15.

If the first presentation modifier is numeric with a grouping-separator (for example 1:01 or 01.01), then the timezone offset is output in hours and minutes, separated by the grouping separator, even if the number of minutes is zero: for example +5:00 or +10.30.

If the first presentation modifier is numeric and comprises three or four digits with no grouping-separator, for example 001 or 0001, then the timezone offset is shown in hours and minutes with no separator, for example -0500 or +1030.

If the first presentation modifier is numeric, in any of the above formats, and the second presentation modifier is t, then a zero timezone offset (that is, UTC) is output as Z instead of a signed numeric value. In this presentation modifier is absent or if the timezone offset is non-zero, then the displayed timezone offset is preceded by a - sign for negative offsets or a + sign for non-negative offsets.

If the first presentation modifier is Z, then the timezone is formatted as a military timezone letter, using the convention Z = +00:00, A = +01:00, B = +02:00, ..., M = +12:00, N = -01:00, O = -02:00, ... Y = -12:00. The letter J (meaning local time) is used in the case of a value that does not specify a timezone offset. Timezone offsets that have no representation in this system (for example Indian Standard Time, +05:30) are output as if the format 01:01 had been requested.

If the first presentation modifier is N, then the timezone is output (where possible) as a timezone name, for example EST or CET. The same timezone offset has different names in different places; it is therefore recommended that this option should be used only if a country code (see ) or IANA timezone name (see ) is supplied in the $place argument. In the absence of this information, the implementation may apply a default, for example by using the timezone names that are conventional in North America. If no timezone name can be identified, the timezone offset is output using the fallback format 01:01.

The following examples illustrate options for timezone formatting.

Variable marker $place Timezone offsets (with time = 12:00:00)
    -10:00 -05:00 +00:00 +05:30 +13:00
[Z] () -10:00 -05:00 +00:00 +05:30 +13:00
[Z0] () -10 -5 +0 +5:30 +13
[Z0:00] () -10:00 -5:00 +0:00 +5:30 +13:00
[Z00:00] () -10:00 -05:00 +00:00 +05:30 +13:00
[Z0000] () -1000 -0500 +0000 +0530 +1300
[Z00:00t] () -10:00 -05:00 Z +05:30 +13:00
[z] () GMT‑10:00 GMT‑05:00 GMT+00:00 GMT+05:30 GMT+13:00
[ZZ] () W R Z +05:30 +13:00
[ZN] "us" HST EST GMT IST +13:00
[H00]:[M00] [ZN] "America/New_York" 06:00 EST 12:00 EST 07:00 EST 01:30 EST 18:00 EST

If a width specifier is present when formatting a timezone, then the representation as defined in this section is padded to the minimum width as described in , but it is never shortened.

Formatting Other Components

This section applies to the remaining components: P (am/pm marker), C (calendar), and E (era).

The output for these components is entirely implementation-defined. The default presentation modifier for these components is n, indicating that they are output as names (or conventional abbreviations), and the chosen names will in many cases depend on the chosen language: see .

The language, calendar, and place arguments

The set of languages, calendars, and places that are supported in the date formatting functions is implementation-defined. When any of these arguments is omitted or is the empty sequence, an implementation-defined default value is used.

The set of languages, calendars, and places that are supported in the date formatting functions is implementation-defined. If any of these arguments is omitted or set to the empty sequence, the default is implementation-defined.

If the fallback representation uses a different calendar from that requested, the output string must identify the calendar actually used, for example by prefixing the string with [Calendar: X] (where X is the calendar actually used), localized as appropriate to the requested language. If the fallback representation uses a different language from that requested, the output string must identify the language actually used, for example by prefixing the string with [Language: Y] (where Y is the language actually used) localized in an implementation-dependent way. If a particular component of the value cannot be output in the requested format, it should be output in the default format for that component.

The $language argument specifies the language to be used for the result string of the function. The value of the argument should be either the empty sequence or a value that would be valid for the xml:lang attribute (see ). Note that this permits the identification of sublanguages based on country codes (from ) as well as identification of dialects and of regions within a country.

If the $language argument is omitted or is set to the empty sequence, or if it is set to an invalid value or a value that the implementation does not recognize, then the processor uses the default language defined in the dynamic context.

The language is used to select the appropriate language-dependent forms of:

names (for example, of months)

numbers expressed as words or as ordinals (twenty, 20th, twentieth)

hour convention (0-23 vs 1-24, 0-11 vs 1-12)

first day of week, first week of year

Where appropriate this choice may also take into account the value of the $place argument, though this should not be used to override the language or any sublanguage that is specified as part of the language argument.

The choice of the names and abbreviations used in any given language is implementation-defined. For example, one implementation might abbreviate July as Jul while another uses Jly. In German, one implementation might represent Saturday as Samstag while another uses Sonnabend. Implementations may provide mechanisms allowing users to control such choices.

The choice of the names and abbreviations used in any given language for calendar units such as days of the week and months of the year is implementation-defined.

Where ordinal numbers are used, the selection of the correct representation of the ordinal (for example, the grammatical gender) may depend on the component being formatted and on its textual context in the picture string.

The calendar attribute specifies that the dateTime, date, or time supplied in the $value argument must be converted to a value in the specified calendar and then converted to a string using the conventions of that calendar.

The calendar value if present must be a valid EQName (dynamic error: ). If it is a lexical QName then it is expanded into an expanded QName using the statically known namespaces; if it has no prefix then it represents an expanded-QName in no namespace. If the expanded QName is in no namespace, then it must identify a calendar with a designator specified below (dynamic error: ). If the expanded QName is in a namespace then it identifies the calendar in an implementation-defined way.

If the $calendar argument is omitted or is set to the empty sequence then the default calendar defined in the dynamic context is used.

The calendars listed below were known to be in use during the last hundred years. Many other calendars have been used in the past.

This specification does not define any of these calendars, nor the way that they map to the value space of the xs:date datatype in . There may be ambiguities when dates are recorded using different calendars. For example, the start of a new day is not simultaneous in different calendars, and may also vary geographically (for example, based on the time of sunrise or sunset). Translation of dates is therefore more reliable when the time of day is also known, and when the geographic location is known. When translating dates between one calendar and another, the processor may take account of the values of the $place and/or $language arguments, with the $place argument taking precedence.

Information about some of these calendars, and algorithms for converting between them, may be found in .

Designator Calendar
AD Anno Domini (Christian Era)
AH Anno Hegirae (Islamic Era)
AME Mauludi Era (solar years since Muhammad’s birth)
AM Anno Mundi (Jewish Calendar)
AP Anno Persici
AS Aji Saka Era (Java)
BE Buddhist Era
CB Cooch Behar Era
CE Common Era
CL Chinese Lunar Era
CS Chula Sakarat Era
EE Ethiopian Era
FE Fasli Era
ISO ISO 8601 calendar
JE Japanese Calendar
KE Khalsa Era (Sikh calendar)
KY Kali Yuga
ME Malabar Era
MS Monarchic Solar Era
NS Nepal Samwat Era
OS Old Style (Julian Calendar)
RS Rattanakosin (Bangkok) Era
SE Saka Era
SH Solar Hijri (Islamic Era, used in Iran and Afghanistan)
SS Saka Samvat
TE Tripurabda Era
VE Vikrama Era
VS Vikrama Samvat Era

At least one of the above calendars must be supported. It is implementation-defined which calendars are supported.

The ISO 8601 calendar (), which is included in the above list and designated ISO, is very similar to the Gregorian calendar designated AD, but it differs in several ways. The ISO calendar is intended to ensure that date and time formats can be read easily by other software, as well as being legible for human users. The ISO calendar prescribes the use of particular numbering conventions as defined in ISO 8601, rather than allowing these to be localized on a per-language basis. In particular it provides a numeric “week date” format which identifies dates by year, week of the year, and day in the week; in the ISO calendar the days of the week are numbered from 1 (Monday) to 7 (Sunday), and week 1 in any calendar year is the week (from Monday to Sunday) that includes the first Thursday of that year. The numeric values of the components year, month, day, hour, minute, and second are the same in the ISO calendar as the values used in the lexical representation of the date and time as defined in . The era (E component) with this calendar is either a minus sign (for negative years) or a zero-length string (for positive years). For dates before 1 January, AD 1, year numbers in the ISO and AD calendars are off by one from each other: ISO year 0000 is 1 BC, -0001 is 2 BC, etc.

ISO 8601 does not define a numbering for weeks within a month. When the w component is used, the convention to be adopted is that each Monday-to-Sunday week is considered to fall within a particular month if its Thursday occurs in that month; the weeks that fall in a particular month under this definition are numbered starting from 1. Thus, for example, 29 January 2013 falls in week 5 because the Thursday of the week (31 January 2013) is the fifth Thursday in January, and 1 February 2013 is also in week 5 for the same reason.

The value space of the date and time datatypes, as defined in XML Schema, is based on absolute points in time. The lexical space of these datatypes defines a representation of these absolute points in time using the proleptic Gregorian calendar, that is, the modern Western calendar extrapolated into the past and the future; but the value space is calendar-neutral. The date formatting functions produce a representation of this absolute point in time, but denoted in a possibly different calendar. So, for example, the date whose lexical representation in XML Schema is 1502-01-11 (the day on which Pope Gregory XIII was born) might be formatted using the Old Style (Julian) calendar as 1 January 1502. This reflects the fact that there was at that time a ten-day difference between the two calendars. It would be incorrect, and would produce incorrect results, to represent this date in an element or attribute of type xs:date as 1502-01-01, even though this might reflect the way the date was recorded in contemporary documents.

When referring to years occurring in antiquity, modern historians generally use a numbering system in which there is no year zero (the year before 1 CE is thus 1 BCE). This is the convention that should be used when the requested calendar is OS (Julian) or AD (Gregorian). When the requested calendar is ISO, however, the conventions of ISO 8601 should be followed: here the year before +0001 is numbered zero. In (version 1.0), the value space for xs:date and xs:dateTime does not include a year zero: however, XSD 1.1 endorses the ISO 8601 convention. This means that the date on which Julius Caesar was assassinated has the ISO 8601 lexical representation -0043-03-13, but will be formatted as 15 March 44 BCE in the Julian calendar or 13 March 44 BCE in the Gregorian calendar (dependent on the chosen localization of the names of months and eras).

The intended use of the $place argument is to identify the place where an event represented by the dateTime, date, or time supplied in the $value argument took place or will take place. If the $place argument is omitted or is set to the empty sequence, then the default place defined in the dynamic context is used. If the value is supplied, and is not the empty sequence, then it should either be a country code or an IANA timezone name. If the value does not take this form, or if its value is not recognized by the implementation, then the default place defined in the dynamic context is used.

Country codes are defined in . Examples are "de" for Germany and "jp" for Japan. Implementations may also allow the use of codes representing subdivisions of a country from ISO 3166-2, or codes representing formerly used names of countries from ISO 3166-3

IANA timezone names are defined in the IANA timezone database . Examples are "America/New_York" and "Europe/Rome".

This argument is not intended to identify the location of the user for whom the date or time is being formatted; that should be done by means of the $language attribute. This information may be used to provide additional information when converting dates between calendars or when deciding how individual components of the date and time are to be formatted. For example, different countries using the Old Style (Julian) calendar started the new year on different days, and some countries used variants of the calendar that were out of synchronization as a result of differences in calculating leap years.

The geographical area identified by a country code is defined by the boundaries as they existed at the time of the date to be formatted, or the present-day boundaries for dates in the future.

If the $place argument is supplied in the form of an IANA timezone name that is recognized by the implementation, then the date or time being formatted is adjusted to the timezone offset applicable in that timezone. For example, if the xs:dateTime value 2010-02-15T12:00:00Z is formatted with the $place argument set to America/New_York, then the output will be as if the value 2010-02-15T07:00:00-05:00 had been supplied. This adjustment takes daylight savings time into account where possible; if the date in question falls during daylight savings time in New York, then it is adjusted to timezone offset -PT4H rather than -PT5H. Adjustment using daylight savings time is only possible where the value includes a date, and where the date is within the range covered by the timezone database.

Examples of date and time formatting

The following examples show a selection of dates and times and the way they might be formatted. These examples assume the use of the Gregorian calendar as the default calendar.

Required Output Expression
2002-12-31 format-date($d, "[Y0001]-[M01]-[D01]")
12-31-2002 format-date($d, "[M]-[D]-[Y]")
31-12-2002 format-date($d, "[D]-[M]-[Y]")
31 XII 2002 format-date($d, "[D1] [MI] [Y]")
31st December, 2002 format-date($d, "[D1o] [MNn], [Y]", "en", (), ())
31 DEC 2002 format-date($d, "[D01] [MN,*-3] [Y0001]", "en", (), ())
December 31, 2002 format-date($d, "[MNn] [D], [Y]", "en", (), ())
31 Dezember, 2002 format-date($d, "[D] [MNn], [Y]", "de", (), ())
Tisdag 31 December 2002 format-date($d, "[FNn] [D] [MNn] [Y]", "sv", (), ())
[2002-12-31] format-date($d, "[[[Y0001]-[M01]-[D01]]]")
Two Thousand and Three format-date($d, "[YWw]", "en", (), ())
einunddrei&eszet;igste Dezember format-date($d, "[Dwo] [MNn]", "de", (), ())
3:58 PM format-time($t, "[h]:[m01] [PN]", "en", (), ())
3:58:45 pm format-time($t, "[h]:[m01]:[s01] [Pn]", "en", (), ())
3:58:45 PM PDT format-time($t, "[h]:[m01]:[s01] [PN] [ZN,*-3]", "en", (), ())
3:58:45 o'clock PM PDT format-time($t, "[h]:[m01]:[s01] o'clock [PN] [ZN,*-3]", "en", (), ())
15:58 format-time($t, "[H01]:[m01]")
15:58:45.762 format-time($t, "[H01]:[m01]:[s01].[f001]")
15:58:45 GMT+02:00 format-time($t, "[H01]:[m01]:[s01] [z,6-6]", "en", (), ())
15.58 Uhr GMT+2 format-time($t, "[H01]:[m01] Uhr [z]", "de", (), ())
3.58pm on Tuesday, 31st December format-dateTime($dt, "[h].[m01][Pn] on [FNn], [D1o] [MNn]")
12/31/2002 at 15:58:45 format-dateTime($dt, "[M01]/[D01]/[Y0001] at [H01]:[m01]:[s01]")

The following examples use calendars other than the Gregorian calendar.

Description Request Result
Islamic format-date($d, "[D&#x0661;] [Mn] [Y&#x0661;]", "ar", "AH", ()) ٢٦ ﺸﻭّﺍﻝ ١٤٢٣
Jewish (with Western numbering) format-date($d, "[D] [Mn] [Y]", "he", "AM", ()) ‏26 טבת 5763
Jewish (with traditional numbering) format-date($d, "[D&#x05D0;t] [Mn] [Y&#x05D0;t]", "he", "AM", ()) כ״ו טבת תשס״ג
Julian (Old Style) format-date($d, "[D] [MNn] [Y]", "en", "OS", ()) 18 December 2002
Thai format-date($d, "[D&#x0E51;] [Mn] [Y&#x0E51;]", "th", "BE", ()) ๓๑ ธันวาคม ๒๕๔๕
Parsing dates and times

A function is provided to parse dates and times expressed using syntax that is commonly encountered in internet protocols.

Processing QNames and NOTATIONS Functions to create a QName

In XPath 4.0, statically-known QNames can be expressed using a QName literal such as #xml:space. Where the QName is not known statically, the xs:QName constructor function can be used.

In addition to the xs:QName constructor function, QName values can be constructed by combining a namespace URI, prefix, and local name, or by resolving a lexical QName against the in-scope namespaces of an element node. This section defines functions that perform these operations. Leading and trailing whitespace, if present, is stripped from string arguments before the result is constructed.

Functions and operators on QNames

This section specifies functions on QNames as defined in .

Processing NOTATIONs

There are no functions designed explicitly to process xs:NOTATION items.

However, some generic functions such as fn:atomic-equal and fn:compare can be used on xs:NOTATION items.

Processing binary values

Binary data is represented using the data types xs:hexBinary and xs:base64Binary. Both types have the same value space: a sequence of octets, which can be considered as integers in the range 0 to 255.

The coercion rules of XPath 4.0 ensure that the two types are interoperable; a function that declares an argument of type xs:hexBinary will always accept a value of type xs:base64Binary, and vice versa.

There are no functions defined in this document provided exclusively for processing binary data. A number of generic functions are available:

Functions such as fn:atomic-equal, fn:deep-equal, and fn:index-of compare binary values for equality.

Functions such as fn:compare, fn:sort-by, and fn:max compare binary values for ordering.

The function fn:string creates a string representation of a binary value.

The constructor functions xs:string, xs:hexBinary, and xs:base64Binary can be used to convert binary values to and from their string representation.

The function fn:unparsed-binary reads an external resource and returns is content as an xs:base64Binary value.

A library of functions specific to processing of binary data can be found in .

Processing nodes Accessors

Accessors and their semantics are described in . Some of these accessors are exposed to the user through the functions described below.

Each of these functions has an arity-zero signature which is equivalent to the arity-one form, with the context value supplied as the implicit first argument. In addition, each of the arity-one functions accepts the empty sequence as the argument, in which case it generally delivers the empty sequence as the result: the exception is fn:string, which delivers a zero-length string.

Function Accessor Accepts Returns
fn:node-name node-name node (optional) xs:QName (optional)
fn:nilled nilled node (optional) xs:boolean (optional)
fn:string string-value item (optional) xs:string
fn:data typed-value zero or more items a sequence of atomic items
fn:base-uri base-uri node (optional) xs:anyURI (optional)
fn:document-uri document-uri node (optional) xs:anyURI (optional)
Other properties of nodes

This section specifies further functions that return properties of nodes. Nodes are formally defined in .

Functions on sequences of nodes

This section specifies functions on sequences of nodes.

Identifying nodes

This section defines a number of functions used to find elements by ID or IDREF value, or to generate identifiers.

Processing function items

The functions included in this section operate on function items, that is, values referring to a function.

Functions that accept functions among their arguments, or that return functions in their result, are described in this specification as higher-order functions. Some host languages may exclude higher-order functions from the set of functions that they support, or may include such functions in an optional conformance feature.

Some functions such as fn:parse-json allow the option of supplying a callback function for example to define exception behavior. Where this is not essential to the use of the function, the function has not been classified as higher-order for this purpose; in applications where function items cannot be created, these particular options will not be available.

Processing maps

Maps were introduced as a new datatype in XDM 3.1. This section describes functions that operate on maps.

A map is a kind of item.

A map consists of a sequence of entries, also known as key-value pairs. Each entry comprises a key which is an arbitrary atomic item, and an arbitrary sequence called the associated value.

Within a map, no two entries have the same key. Two atomic items K1 and K2 are the same key for this purpose if the function call fn:atomic-equal($K1, $K2) returns true.

It is not necessary that all the keys in a map should be of the same type (for example, they can include a mixture of integers and strings).

Maps are immutable, and have no identity separate from their content. For example, the map:remove function returns a map that differs from the supplied map by the omission (typically) of one entry, but the supplied map is not changed by the operation. Two calls on map:remove with the same arguments return maps that are indistinguishable from each other; there is no way of asking whether these are “the same map”.

A map can also be viewed as a function from keys to associated values. To achieve this, a map is also a function item. The function corresponding to the map has the signature function($key as xs:anyAtomicValue) as item()*. Calling the function has the same effect as calling the map:get function: the expression $map($key) returns the same result as get($map, $key). For example, if $books-by-isbn is a map whose keys are ISBNs and whose assocated values are book elements, then the expression $books-by-isbn("0470192747") returns the book element with the given ISBN. The fact that a map is a function item allows it to be passed as an argument to higher-order functions that expect a function item as one of their arguments.

Ordering of Maps Ordered maps are introduced.

In 4.0, the entries in a map are ordered. The entry order of a map is referred to as entry order.

The entry order of the entries in a map is defined by the function or expression that creates the map, and affects the result of functions and expressions that process multiple entries in a map, for example the function map:keys and the expression for key $k value $v return EXPR. The ordering is also reflected in the output of the json and adaptive serialization methods.

Order is maintained in maps for two main reasons:

To make the representation of a map (such as its JSON serialization) easier for human readers to process: for example when visually inspecting the result of a JSON transformation;

To make the result of different implementations interoperable.

Although it is possible to use the ordering of a map to capture semantic information, the design of functions such as fn:deep-equal discourages this: maps are compared with each other, and matched against map types, without regard to the order of entries.

Composing and Decomposing Maps

It is often useful to decompose a map into a sequence of entries, or key-value pairs (in which the key is an atomic item and the value is an arbitrary sequence). Subsequently it may be necessary to reconstruct a map from these components, typically after modification.

There are two conventional ways of representing a map as a sequence of key-value pairs, each with its own advantages and disadvantages. These are described below:

A map can be represented as a sequence of single-entry maps.

A single-entry map is a map containing a single entry.

It is possible to decompose any map into a sequence of single-entry maps, and to construct a map from a sequence of single-entry maps.

For example the map { "x": 1, "y": 2 } can be decomposed to the sequence ({ "x": 1 }, { "y": 2 }).

A map can be represented as a sequence of JNodes.

A JNode holds the map key in its ·selector· property and the corresponding value in its ·content· property.

The following table summarizes the way in which these two representations can be used to compose and decompose maps:

Operation Single-Entry Maps JNodes

Decompose a map

map:entries($map)

$map/child::*

Compose a map

map:merge($entries)

map:build($jnodes, jnode-selector#1, jnode-content#1)

Create a single entry

map:entry($key, $value)

{$key : $value}/child::*

Extract the key part of a single entry

map:keys($entry)

jnode-selector($jnode)

Extract the value part of a single entry

map:items($entry)

jnode-content($jnode)

It is also possible to decompose a map using:

The function map:for-each

The expression for key $k value $v in $map return ....

Reordering the entries in a map

The examples below show several ways of constructing a map with the same entries as an input map, but with the entries sorted by key.

Using map:entries and map:merge:

map:entries($map) => sort-by({'key': map:keys#1}) => map:merge()

Using JNodes:

$map/* => sort-by({'key': jnode-selector#1}) => map:build(jnode-selector#1, jnode-content#1)

Using map:for-each:

map:merge( map:for-each($map, map:entry#2) => sort-by({'key': map:keys#1}) )

Using an XQuery FLWOR expression:

map:merge( for key $k value $v order by $k return {$k : $v} )
Formal specification of maps

The XDM data model () defines three primitive operations on maps:

dm:empty-map constructs the empty map.

dm:map-put adds or replaces an entry in a map.

dm:iterate-map applies a supplied function to every entry in a map.

The functions in this section are all specified by means of equivalent expressions that either call these primitives directly, or invoke other functions that rely on these primitives. The specifications avoid relying on XPath language constructs that manipulate maps, such as map constructor syntax, lookup expressions, or FLWOR expressions. This is done to allow these language constructs to be specified by reference to this function library, without risk of circularity.

There is one exception to this rule: for convenience, the notation {} is used to represent the empty map, in preference to a call on dm:empty-map().

The formal equivalents are not intended to provide a realistic way of implementating the functions (in particular, any real implementation might be expected to implement map:get and map:put much more efficiently). They do, however, provide a framework that allows the correctness of a practical implementation to be verified.

TODO: as yet there is no formal equivalent for map:find().
Functions that operate on maps

The functions defined in this section use a conventional namespace prefix map, which is assumed to be bound to the namespace URI http://www.w3.org/2005/xpath-functions/map.

The function call map:get($map, $key) can be used to retrieve the value associated with a given key.

There is no operation to atomize a map or convert it to a string. The function fn:serialize can in some cases be used to produce a JSON representation of a map.

Note that when the required type of an argument to a function such as map:build is a map type, then the coercion rules ensure that a JNode can be supplied in the function call: if the ·content· property of the JNode is a map, then the map is automatically extracted as if by the jnode-content function.

Converting elements to maps A new function fn:element-to-map is provided for converting XDM trees to maps suitable for serialization as JSON. Unlike the fn:xml-to-json function retained from 3.1, this can handle arbitrary XML as input.

The fn:element-to-map function converts a tree rooted at an XML element node to a corresponding tree of maps, in a form suitable for serialization as JSON. In effect it provides a mechanism for converting XML to JSON.

This section describes the mappings used by this function.

This mapping is designed with three objectives:

It should be possible to represent any XML element as a map suitable for JSON serialization.

The resulting JSON should be intuitive and easy to use.

The JSON should be consistent and stable: small variations in the input should not result in large variations in the output.

Achieving all three objectives requires design compromises. It also requires sacrificing some other desiderata. In consequence:

The conversion is not lossless (see for details).

The conversion is not streamable.

The results are not necessarily compatible with those produced by other popular libraries.

The requirement for consistency and stability is particularly challenging. An element such as John]]> maps naturally to the map { "name": "John" }; but adding an attribute (so it becomes John]]>) then requires an incompatible change in the JSON representation. The format could be made extensible by converting John]]> to { "name": {"#content":"John"} } and John]]> to { "name": { "@role":"first", "#content":"John" } }, but this imposes unwanted complexity on the simplest cases. The solution adopted is threefold:

It is possible to analyze a corpus of XML documents to develop a conversion plan, which can then be applied consistently to individual input documents, whether or not these documents were present in the corpus. The conversion plan can be serialized and subsequently reused, so that it can be applied to input documents that might not have existed at the time the conversion plan was formulated.

Alternatively, the function can make use of schema information where available, so it considers not just the structure of an individual element instance, but the rules governing the element type.

It is possible to override the choices made by the system, and explicitly specify the format to be used for elements or attributes having a given name.

Element Layouts

The key challenge in mapping XML to JSON is in deciding how element content is to be represented. To illustrate the variety of mappings that are possible, the following table lists some examples of typical XML elements and their JSON equivalents:

XML element JSON equivalent
]]>
2023-05-18]]>
]]>
Warning!]]>
5 10 ]]>
]]>

This specification defines a number of named mappings, called layouts, and allows the layout for a particular element to be selected in a number of different ways:

The layout to be used for a specific elements can be explicitly selected by supplying a conversion plan as input to the fn:element-to-map function.

It is possible to construct a conversion plan by analyzing a corpus of documents using the fn:element-to-map-plan function.

It is also possible to construct a conversion plan manually, or to modify the conversion plan produced by the fn:element-to-map-plan function before use.

In the absence of an explicit conversion plan, if the data has been schema-validated, the layout is inferred from the content model for the element type as defined in the schema.

When the data is untyped and no specific layout has been selected, a default layout is chosen based on the properties of the individual element instance.

The advantage of using schema information is that it gives a consistent representation for all elements of a particular type, even if they vary in content: for example if an element type allows optional attributes, the JSON representation will be consistent between those elements that have attributes and those without. In the absence of a schema, consistency can be achieved by supplying a conversion plan that applies uniformly to multiple documents.

The different layouts available are defined in the following sections. For each layout there is a table showing:

Layout name: the name to be used to select this layout in a conversion plan supplied to the fn:element-to-map function.

Usage: the situations for which this layout is designed.

Example input: an example of a typical element for which this layout is appropriate, shown as serialized XML.

Example output: the result of converting this example, shown as serialized JSON. The result is always shown as a singleton map, which is how it will appear when the layout is used for the top-level elements supplied in the $elements argument; when used to convert a descendant element, the corresponding key-value pair may appear as part of a larger map, depending on the layout chosen for its parent element..

The fn:element-to-map function produces a map as its result, but it is convenient to illustrate the form of the map by showing the effect of serializing the map as JSON.

Mapping rules: The rules for mapping the XML element to an XDM map representation.

Mapping for nilled elements: special rules that apply to an element having the attribute xsi:nil="true". These rules only apply if the element has been schema-validated.

Errors: situations where the layout cannot be used, and where attempting to use it will fail. For example, the empty layout cannot be used for an element that is not empty. In such a situation the recovery action is as follows, in order:

Attributes are dropped, and if this is sufficient to enable the layout to be used, then the element is converted without its attributes.

If the type of an element or attribute in the conversion plan is given as boolean or numeric, but the actual value of the element or attribute is not castable to xs:boolean or xs:numeric respectively, then the node is output ignoring the type property, that is, as an instance of xs:untypedAtomic.

If the conversion plan supplies a fallback layout (an entry with key "*"), then the fallback layout is used.

The element-to-map function fails with a dynamic error.

The rules for selecting the layout for a particular element are given later, in .

Note that it is possible to request any layout for any element. If an inappropriate layout is chosen for a particular element (for example, empty layout for an element that is not empty), then the rules for that layout specify what happens. It is possible to specify a fallback layout for use when the selected layout fails: this will typically be a layout such as xml or mixed that can handle any element.

Acknowledgements for this categorization: see . Although Goessner's categories have been used, the detailed mappings vary from his proposal.

Layout: Empty Content
Layout name

empty

Usage

Intended for XML elements that have no content and no attributes.

Example input ]]>
Example output { "hr": "" }
Mapping rules

The content is represented by the zero-length xs:string value "".

Mapping for nilled elements

The content is represented by the QName fn:QName("http://www.w3.org/2005/xpath-functions", "null"), which the JSON serialization method serializes as null. For example the result of converting the element <hr xsi:nil="true"/> becomes { "hr": #fn:null }, which is serialized in JSON as { "hr": null }.

Errors

Attributes are discarded, along with child comment nodes, processing instructions, and whitespace-only text nodes.

If any other child nodes are present, this layout fails.

Layout: Empty Content with Attributes
Layout name

empty-plus

Usage

Intended for XML elements that have no content but may have attributes.

Example input ]]>
Example output { "hr": { "@class": "ccc", "@id": "zzz" } }
Mapping rules

The content is represented by a map containing one entry for each attribute in the XML element; if there are no attributes, the content is represented as the empty map. The rules for attribute names are defined in , and the rules for attribute content in .

Mapping for nilled elements

An additional key-value pair "#content": #fn:null is added, which serializes in JSON as "#content": null. For example <hr id="x" xsi:nil="true"/> becomes { "hr": { "@id": "x", "#content": #fn:null } }.

Errors

Child comment nodes, processing instructions, and whitespace-only text nodes are discarded.

If any other child nodes are present, this layout fails.

Layout: Simple Content
Layout name

simple

Usage

Intended for XML elements that have simple content and no attributes.

Example input 2023-05-30]]>
Example output { "date": "2023-05-30" }
Mapping rules

The element is atomized and the resulting atomized value is handled as described in . If atomization fails, the element is treated as if it were untyped.

If the element is untyped, the atomized value will always appear in the result as an instance of xs:untypedAtomic.

Mapping for nilled elements

The content is represented by the value #fn:null, which is serialized as the JSON value null. For example. <name xsi:nil="true"/> becomes { "name": #fn:null }.

Errors

Attributes are discarded, along with child comment nodes and processing instructions; whitespace is retained.

If any child elements are present, this layout fails.

Layout: Simple Content with Attributes
Layout name

simple-plus

Usage

Intended for XML elements that have simple content and (optionally) attributes.

Example input 23.50]]>
Example output { "price": { "@currency": "USD", "#content": 23.50 } }
Mapping rules

The element is represented by a map containing one entry for each of its attributes, plus an entry with key "#content" representing the result of atomizing the element. The atomized value is handled as described in .

The rules for attribute names are defined in , and the rules for attribute content in .

If the element is untyped, the value of each attribute, and of "#content", will always be an instance of xs:untypedAtomic.

If the element has been schema-validated, the types of the items in the atomized value are retained.

Mapping for nilled elements

The "#content" property is represented by the value #fn:null, which is serialized in JSON as null.

Errors

Child comment nodes and processing instructions are discarded; whitespace is retained.

If any child elements are present, this layout fails.

Layout: Simple List
Layout name

list

Usage

Intended for XML elements that act as wrappers for a list of child elements, all having the same element name; neither the element itself nor any of its children should have any attributes. The expected child element name may be present in the conversion plan. The names of the child elements are not retained in the output.

Example input (1) 2023-03-20 2023-04-12 2023-05-30 ]]>
Example output (1) { "dates": [ "2023-03-20", "2023-04-12", "2023-05-30" ] }
Example input (2) 20230320 20230412 20230530 ]]>
Example output (2) { "dates": [ { "year": "2023", "month": "03", "day": "20" }, { "year": "2023", "month": "04", "day": "12" }, { "year": "2023", "month": "05", "day": "30" } ] }
Mapping rules

The content is represented by an array, whose members correspond one-to-one with the children of the element. Each child element is converted to a map as if it were a top-level element: the resulting map contains a single key-value pair. The key part is discarded, and the value part is used as a member in the resulting array.

If there are no children then the content is represented by the empty array.

Mapping for nilled elements

The array is replaced by the value #fn:null, which serializes to the JSON value null (for example { "dates": #fn:null }).

Errors

Attributes are discarded for both the element itself, and its children. Comments, processing instructions, and whitespace text nodes in the content are discarded.

This layout fails if any child element is present with a name that differs from the expected child element name, or if there are non-whitespace text node children.

Layout: List with Attributes
Layout name

list-plus

Usage

Intended for XML elements that act as wrappers for a list of child elements, all having the same element name. The wrapper element may have attributes, but the children should not. and the name of the child elements is retained in the output.

Example input (1) 2023-03-20 2023-04-12 2023-05-30 ]]>
Example output (1) "dates": { "@id": "x", "date": ["2023-03-20", "2023-04-12", "2023-05-30"]}
Example input (2) 20230320 20230412 20230530 ]]>
Example output (2) { "dates": { "@id": "x", "date": [ { "year": "2023", "month": "03", "day": "20" }, { "year": "2023", "month": "04", "day": "12" }, { "year": "2023", "month": "05", "day": "30" } ] } }
Mapping rules

The content is represented by a map containing one entry for each attribute in the XML element, plus a property named after the child elements (the content property), whose value is an array containing the results of formatting the content in the same way as the list layout.

If there are no children and the element is untyped (which can occur when this layout is chosen explicitly via the options to fn:element-to-map) then the content property is omitted (since the child element name is unknown). But if the element is typed, then the content property is included and set to the empty array.

Mapping for nilled elements

The array-valued entry in the result is replaced by the entry "#content": #fn:null, which serializes to the JSON value null. For example the element ]]> becomes {"dates": { "@id": "x", "#content": #fn:null } }.

Errors

Any attributes on the element's children are discarded. Comments, processing instructions, and whitespace text nodes in the content are discarded.

This layout fails if any child element is present with a name that differs from the expected child element name, or if there are non-whitespace text node children.

Layout: Record
Layout name

record

Usage

Intended primarily for XML elements that contain multiple child elements, with different names, where the order of the child elements is not significant. Also used for elements whose content is a single element node child. The element may or may not have attributes.

Example input (1) 1984-03-20 Germany Janitor ]]>
Example output (1) { "employee": { "@id": "x", "date-of-birth": "1984-03-20", "location": "Germany", "position": "Janitor" } }
Example input (2) 1984-03-20 Germany Janitor Gardener ]]>
Example output (2) { "employee": { "@id": "x", "date-of-birth": "1984-03-20", "location": "Germany", "position": [ "Janitor", "Gardener" ] } }
Mapping rules

The content is represented by a map containing one entry for each attribute in the XML element, plus one entry for each child element, whose value is formatted according to the rules for that element.

If two or more child elements have the same name, or names that are represented by the same string (taking into account the chosen name-format option), then they are combined into a single entry containing all the corresponding values as members of an array. For example, if there are two children Mills]]> and Boon]]>, they are combined into a single entry "author": ["Mills", "Boon"].

The of the resulting map first contains entries derived from attributes (in unpredictable order), then entries derived from child elements, in order of first appearance.

Mapping for nilled elements

Alongside any attributes, the value includes the additional entry "#content": #fn:null, which will be serialized in JSON as "#content": null.

Errors

Although this layout is intended primarily for elements whose children are unordered and uniquely named, it is also viable to use it in cases where elements can repeat, so long as order relative to other elements is not significant.

Comments, processing instructions, and whitespace text nodes in the content are discarded.

This layout fails if there are non-whitespace text node children.

Layout: Sequence
layout name

sequence

Usage

Intended for XML elements that contain a sequence of element node children, whose order is significant. The element may or may not have attributes.

Example input Introduction

Lorem ipsum.

Dolor sit amet.

]]>
Example output { "section": [ { "@id": "x" }, { "head": "Introduction" }, { "p": "Lorem ipsum." }, { "p": "Dolor sit amet." } ] }
Mapping rules

The mapping rules are identical to the rules for the mixed layout (see ) except that whitespace-only text nodes are discarded.

Mapping for nilled elements

A nilled element is indicated by including an additional map { "#content" : #fn:null} in the array, after any attributes.

Errors

This layout fails if there are non-whitespace text node children.

Layout: Mixed
Layout name

mixed

Usage

Intended for XML elements that contain mixed content (that is, elements that contain both child elements and child text nodes, intermingled). The element may or may not have attributes.

Example input This is a fine mess!]]>
Example output { "para": [ { "@id": "x" }, "This is a ", { "i": "fine" }, "mess!" ] }
Mapping rules

The content is represented by an XDM array containing one entry for each attribute in the XML element, and one entry for each child node, in order.

Each attribute node is represented within this array by a single-entry map: the rules for attribute names are defined in , and the rules for attribute content in .

Child nodes are represented within the array as follows:

A text node child is represented as an atomic item of type xs:untypedAtomic.

An element node child is represented as a map containing a single entry, with the key representing the element name and the value representing the element's content, formatted according to the chosen layout for that element.

A comment node is represented as a map containing a single entry whose key is the string "#comment", and whose corresponding value is an atomic item of type xs:string containing the text of the comment.

A processing instruction node is represented as a map containing a single entry whose key is the string "#processing-instruction" and whose value is a map with two entries: the first has the key "#target" with the value being the name of the processing instruction as an atomic item of type xs:NCName; the second has the key "#data" with the value being an atomic item of type xs:string containing the string value of the processing instruction node.

Whitespace text nodes are retained.

Mapping for nilled elements

A nilled element is indicated by including an additional map { "#content" : #fn:null} in the array, after any attributes. For example, ]]> becomes {"para": [ { "id": "p2" }, { "#content": #fn:null } ] }. In JSON the value #fn:null is serialized as null.

Errors

All children are retained, including comments, processing instructions, and text nodes, whether or not they are whitespace-only.

This layout never fails.

Layout: Serialized XML

Serialized layout allows an element node to be represented as lexical XML, contained within a map.

Layout name

xml

Usage

This layout is useful when the input contains a mix of structured data and marked-up textual content. It allows the textual content to be output as serialized XML. It is also used as a fallback representation when the selected element layout is inappropriate for a particular element.

Example input That was awesome

]]>
Example output That was awesome

" }]]>
Mapping rules

The element node is serialized as if by the fn:serialize function, and the resulting content is output as an atomic item of type xs:string.

The serialization parameter method is set to "xml".

The serialization parameter indent is set to false.

The serialization parameter omit-xml-declaration is set to true.

Other serialization parameters take their default values.

The outermost element name will typically be repeated, for example "p": "<p>Lorem ipsum</p>".

Mapping for nilled elements

A nilled element is represented using its normal XML serialization, that is, the output serialization includes the attribute xsi:nil="true", together with a declaration of the xsi namespace prefix.

Errors

This layout never fails.

Creating a conversion plan

It is possible to create a conversion plan by analyzing a collection of sample input documents. The function fn:element-to-map-plan is supplied with a collection of nodes (which will normally be element or document nodes), and it examines all the elements within the trees rooted at these nodes, looking for commonalities among like-named elements.

The output of this function (the conversion plan) holds information about how elements and attributes (identified by name) should be converted.

For elements, the information is primarily a mapping from element names (xs:QName instances) to layout names. In some cases additional information beyond the layout name is also included. The conversion plan is represented as an XDM map, whose structure is defined in this specification. A conversion plan can be constructed directly, or the plan produced by calling fn:element-to-map-plan can be modified before use. The plan can be serialized using the JSON output method and reloaded so that the same plan is used whenever a query or stylesheet is executed.

The fn:element-to-map-plan function selects a layout for a given element name N by applying the following rules:

Let $EE be the set of all elements named N, specifically $input/descendant-or-self::*[node-name(.) eq N].

If empty($EE/(* | text()) (that is, if there are no child elements or text nodes) then:

If empty($EE/@*) (that is, if there are no attributes), then the layout is empty: see .

Otherwise, the layout is empty-plus: see .

If empty($EE/*) (that is, if there are no child elements) then:

If empty($EE/@*) (that is, if there are no attributes) then the layout is simple: see .

Otherwise, simple-plus: see .

The plan also includes the property type. If all the elements in $EE are castable as xs:boolean, then the type is boolean; otherwise, if all the elements in $EE as castable as xs:numeric, then the type is numeric; otherwise, the type is string.

If empty($EE/text()[normalize-space()]) (that is, there are no text node children other than whitespace), then:

If all-equal($EE/*/node-name()) and exists($EE/*[2]) (that is, if all child elements have the same name, and at least one element has multiple child elements), then:

If empty($EE/@*) (that is, if there are no attributes) then list: see .

Otherwise, list-plus: see .

If every $e in $EE satisfies all-different($e/*/node-name()) (that is, the child elements are uniquely named among their siblings), then record: see .

Otherwise, sequence: see .

Otherwise, mixed: see .

For elements with simple content (more specifically, elements where the chosen layout is simple or simple-plus) the conversion plan also includes an entry indicating whether the content should be represented as a boolean, a number, or a string. If every instance of the element name has content that is castable to xs:boolean, the plan indicates "type": "boolean". If every instance of the element name has content that is castable to xs:numeric, the plan indicates "type": "numeric". In other cases, the plan indicates "type": "string"; however, this may be omitted because it is the default.

For attributes, the conversion plan identifies whether attributes (with a given name) should be represented as booleans, numbers, or strings; alternatively, it may indicate that attributes with a given name should be discarded. For every distinct attribute name present in the input, an entry is output associating the attribute name with one of the types boolean or numeric; the entry is generally omitted when the values are to be represented as strings, though the type can also be given explicitly as string. An entry with type boolean is generated for an attribute name if all the attributes with that name are castable as xs:boolean. Similarly, an entry with type numeric is generated for an attribute name if all the attributes with that name are castable as xs:numeric. In other case, the attributes are treated as being of type string. Entries with type string may be omitted, since that is the default. The entry for an attribute may also specify "type": "skip" to indicate that the attribute should be discarded.

A plan that is produced by analyzing a corpus of input documents can then be customized by the user if required. For example:

If simple layout is chosen for a particular element name, but it is known that some documents might be encountered in which that element has attributes, then simple might be changed to simple-plus.

If record layout is chosen for a particular element name, but it is known that some documents might be encountered in which child elements can be repeated, then record might be changed to sequence.

If a generated plan determines that phone numbers should be represented as numbers, it might be modified to treat them as strings.

The conversion plan is a map of type map(xs:string, record(*)). The key is an element or attribute name, representing element names in the form Q{uri}local, and attributes in the form @Q{uri}localnotation: in both cases the Q{uri} part must be omitted for a name in no namespace. Strings are used as keys in preference to xs:QName instances to allow the plan to be serialized in JSON format.

A more detailed definition of the structure is given in .

A small example might be (in its JSON serialization):

{ "bookList": { "layout": "list", "child": "book" }, "book": { "layout": "record" }, "author": { "layout": "simple" }, "title": { "layout": "simple" }, "price": { "layout": "simple", "type": "numeric" }, "hardback": { "layout": "simple", "type": "boolean" }, "@out-of-print": { "type": "boolean" }, "@Q{http://www.w3.org/2001/XMLSchema-instance}nil": { "type": "skip" } }
Attributes in the xsi namespace

This section defines modifications to the above rules that apply to elements having attributes in the xsi namespace (that is, http://www.w3.org/2001/XMLSchema-instance).

When analyzing a corpus using fn:element-to-map-plan, elements having the attribute xsi:nil="true" are ignored. If all elements with a given name have this attribute, allocate the layout mixed.

When deciding whether an element has any attributes (for example to decide between the layouts empty and empty-plus), all attributes in the xsi namespace are ignored.

When converting an individual element to a map, all attributes in the xsi namespace are ignored.

Notwithstanding the above, elements having the nilled property (which essentially means they are schema-validated and have the attribute xsi:nil="true"), are treated specially by each of the possible element layouts.

Structure of the conversion plan

This section provides a definition of the structure of the conversion plan that is output by the fn:element-to-map-plan function, and used as input to the fn:element-to-map function.

The structure is defined by the following item type:

map( xs:string, record ( layout? as enum("empty", "empty-plus", "simple", "simple-plus", "list", list-plus", "record", "sequence", "mixed", "xml", "error", "deep-skip"), child? as xs:string, type? as enum("boolean", "numeric", "string", "skip") * ) )

The rules relating to this structure are as follows:

The keys of the map entries are strings of the form:

local-name representing the name of an element in no namespace.

Q{uri}local-name representing the name of an element in a namespace.

* representing a fallback rule for use with elements where either (a) there is no more specific rule, or (b) processing using the selected layout fails.

@local-name representing the name of an attribute in no namespace.

@Q{uri}local-name representing the name of an attribute in a namespace.

Any entries whose keys are not in this format will be ignored.

The layout entry is present if and only if the key represents the name of an element.

The child entry is present if and only if the value of layout is list or list-plus. It represents an element name in the format local-name for a name in no namespace, or Q{uri}local-name for a name in a namespace.

The type entry is present if, and only if, one of the following conditions applies:

The key represents the name of an attribute.

The layout is simple or simple-plus. In this case the value must not be "skip".

If additional entries (beyond those described above) are present in any of the maps, they are ignored, provided that the map is coercible to the given type definition.

The fallback rule (with key "*") is used to process elements whose name has no specific entry, and also for elements where normal processing fails (for example when the selected layout is "empty", but the element has children). If no fallback rule is present then "error" is assumed: this causes processing to fail with a dynamic error. The fallback rule will typically set the layout property to one of the following:

error: this causes the function to fail with a dynamic error.

deep-skip: this causes the element and its content (recursively) to be omitted from the output.

mixed: this causes the element to be output using layout mixed

xml: this outputs the element to be output using layout xml, which represents the content as a string containing serialized XML.

However, any layout may be used as the fallback; if it fails, the error is unrecoverable.

Schema-based conversion

As an alternative to constructing a conversion plan by analyzing a corpus of specimen documents, conversion may be controlled using type annotations derived from schema validation.

If the function element-to-map encounters an element whose name is not present in the conversion plan (including the case where no plan is supplied), and if the element has a type annotation T other than xs:anyType or xs:untyped, then the following rules apply:

This section uses the notation {prop} to refer to properties of schema components, as defined in . The schema component model from XSD 1.1 is used; when XSD 1.0 is used for validation, some properties such as {open content} will inevitably be absent.

Let zeroLength(ST) be true for a simple type ST if any of the following conditions is true:

ST.{variety} = list, and ST.{facets} includes a length or maxLength facet whose value is 0 (zero).

ST.{variety} = atomic, and ST.{facets} includes a length or maxLength facet whose value is 0 (zero).

ST.{variety} = atomic, and ST.{facets} includes an enumeration facet constraining the value to be zero-length.

ST.{variety} = atomic, and ST.{facets} includes a pattern facet with the value "" (a zero-length string).

If T is a simple type:

If zeroLength(T), then the selected layout is empty (see ).

Otherwise, the selected layout is simple (see ), and the selected type is boolean if T is derived from xs:boolean; numeric if T is derived from xs:decimal, xs:double, or xs:float; or string otherwise.

Otherwise (if T is a complex type):

Let $noAttributes be true if T.{attribute uses} is empty and T.{attribute wildcard} is absent.

If T.{content type}.{variety} = empty, then:

If $noAttributes and if empty layout is not disabled, then the selected layout is empty (see ).

Otherwise, the selected layout is empty-plus (see ).

If T.{content type}.{variety} = simple (a complex type with simple content), then:

Let ST be T.{content type}.{simple type definition} (the corresponding simple type).

If zeroLength(ST), then:

If $noAttributes, the selected layout is empty (see ).

Otherwise, the selected layout is empty-plus (see ).

Otherwise:

If $noAttributes, the selected layout is simple (see ).

Otherwise the selected layout is simple-plus (see ).

In both cases the selected type is one of boolean numeric, or string, chosen in the same way as for elements having a simple type.

If T.{content type}.{variety} = element-only (a complex type with an element-only content model):

Let $noWildcards be true if T.{content type}.{open content} is absent, and T.{content type}.{particle}, expanded recursively, contains no wildcard term.

Let $childCardinalities be a set of (xs:QName, xs:double) pairs representing the expanded names of the element declaration terms within T.{content type}.{particle}, expanded recursively, and for each one, the maximum number of occurrences of elements with that name, computed using the value of the {maxOccurs} property of the particles at each level, taking the value unbounded as positive infinity.

If $noWildcards is true, and if $childCardinalities contains a single entry, and that entry has a cardinality greater than one, then:

If $noAttributes then the selected layout is list (see ).

Otherwise, the selected layout is list-plus (see ).

If $noWildcards is true, and if every entry in $childCardinalities has a cardinality of one, then the selected layout is record (see ).

Otherwise, the selected layout is sequence (see ).

Otherwise (that is, when T.{content type}.{variety} = mixed, the selected layout is mixed (see ).

For attribute nodes, the selected type is boolean if the type annotation is derived from xs:boolean; numeric if the type annotation is derived from xs:decimal, xs:double, or xs:float; and string otherwise.

Selecting an element layout

The various layouts available for elements are described in . This section defines the rules for selecting an element layout for a given element E. The rules are applied in order.

If an explicit layout is given for the element name of E in the conversion plan supplied to the fn:element-to-map function call, then that layout is used. If the selected layout is deep-skip, then no output is produced for that element. If the selected layout is error, then the function fails with a dynamic error. If the selected layout fails for the element instance, then the fallback layout (identified with the key "*" in the conversion plan) is used; in the absence of a fallback layout, the function fails with a dynamic error.

Otherwise (when no explicit layout is given for E), if the type annotation of the element is something other than xs:untyped or xs:anyType, then a schema-determined layout is used as defined in .

Otherwise, if the conversion plan supplies a fallback layout (identified with the key "*"), then the fallback layout is used.

If the above rules do not provide a layout for E, then a conversion plan for E is determined by applying the rules in , with an input that contains the single element E and no others. (Only the element E itself is considered, not its descendants.)

Element and Attribute Names

The name-format option gives control over how element and attribute names are formatted. There are four options:

The default option (which may be explicitly requested by specifying "name-format": "default") retains the namespace URI for any element that is either (a) the top-level element of a tree being converted, or (b) has a name that is in a different namespace from its parent element. In such cases the format "Q{uri}local" is used. For other elements, the name is output using the local part of the element name alone. For attributes, the form "Q{uri}local" is used for an attribute in a namespace, and the local name alone is used for a no-namespace name. Namespace prefixes are not retained.

The option eqname uses the format "Q{uri}local" for all element and attribute names that are in a namespace, or the local name alone for all names that are not in a namespace.

The option local discards all namespace information: all elements and attributes are output using the local name alone.

The option lexical outputs element and attribute names in the form obtained by calling the function fn:name. If the name has a prefix, the prefix is retained in the output. However, the output contains no information that enables the prefix to be associated with a namespace URI, so this format is suitable only when prefixes in the input documents are used predictably.

Regardless of the chosen name-format, and regardless of the above rules, attributes in the xml namespace (http://www.w3.org/XML/1998/namespace) are output using a lexical QName, with the prefix xml.

Attribute names in the output are typically prefixed with the character "@". The option attribute-marker allows this to be changed to a different prefix or none.

Whichever format of names is chosen, if the rules for the selected layout would result in an output map having two entries with the same key, the conflict is resolved by combining these entries into an array. For example if name-format is set to local then the element ]]> becomes either { "data": { "@val": ["3", "4"] } } or (because attribute order is unpredictable) { "data": { "@val": ["4", "3"] } }.

Element and Attribute Content

The conversion plan may indicate that element content is to be output as type string, numeric, or boolean: the default is string. In the case of untyped elements and attributes, the value is output as an instance of a string, numeric, or boolean type, according to this prescription. Specifically:

If the prescribed type is boolean and the value is castable as xs:boolean, then it is output as an instance of xs:boolean.

If the prescribed type is numeric and the value is castable as xs:numeric, then it is output as an instance of xs:integer, xs:decimal, or xs:double depending on the lexical form of the value, following the same rules as for XPath numeric literals. For example, "-1" becomes an xs:integer, 12.00 becomes an xs:decimal, and 1e-3 becomes an xs:double. The special xs:double values NaN and INF (which cannot be used as numeric literals) are also recognized.

In all other cases the value is output as an instance of xs:untypedAtomic, retaining its original lexical form.

Where the element or attribute is schema-validated, however:

If an element has the nilled property (that is, xsi:nil="true"), then the mapping for nilled elements with the chosen layout is used.

Let AV be the typed value of the node (that is, the result of atomization).

If, however, an element is annotated with a type that does not allow atomization (specifically, a complex type with element-only content) then let AV be the string value of the element, as an atomic item of type xs:untypedAtomic.

If an attribute is annotated as having a simple type of {variety} list, or if an element using layout simple or simple-plus is annotated as having either a simple type of {variety} list or a complex type with simple content of {variety} list then the atomized value AV is represented in the result as the array represented by the XPath expression array{AV}. This applies whether or not the atomized value actually contains multiple atomic items. The individual atomic items in the array retain their type, for example items of type xs:date remain items of type xs:date in the result.

In all other cases AV will be a single atomic item, and this value is used as is, retaining its type.

Atomic items in the result of the fn:element-to-map function may thus be of any atomic type. The type information is lost if the result is subsequently serialized as JSON.

Lost XDM Information

This section is non-normative. Its purpose is to explain what information available in the XDM nodes supplied as input to the fn:element-to-map function is missing from the output.

Element and attribute names: If the chosen name-format is default or eqname, then local names and namespace URIs of elements and attributes are retained, but namespace prefixes are lost. If the chosen name-format is lexical, then prefixes are retained but namespace URIs are lost. If the chosen name-format is local then only local names are retained; namespace URIs and prefixes are lost.

In addition, element names are lost when the parent element is mapped using list layout: see .

In-scope namespaces: All information about in-scope namespaces (and in particular, bindings for namespaces that are declared but not used in element and attribute names) is lost.

Comments and processing instructions: Comments and processing instructions are lost except when they appear as children of elements that are mapped using the sequence, mixed or xml layouts.

Text nodes: Whitespace text nodes are discarded when they appear as children of elements that are mapped using the empty, empty-plus, list, list-plus, record, or sequence layouts. Non-whitespace text nodes are never discarded.

Additional node properties: The values of the is-id, is-idref, and is-nilled properties of a node are lost.

Type annotations: The values of type annotations on elements are lost. Type annotations on atomized values of schema-validated nodes, however, are retained.

Element order: The order of child elements is lost when record layout is used and the element has multiple children with the same name.

XSI attributes: Attributes in the xsi namespace (for example, xsi:type and xsi:nil) are not represented in the result. .

Examples

The following examples show the effect of transforming some simple XML documents with default options, and then serializing the result as JSON with indent is set to true. The actual indentation is implementation dependent.

XDM element JSON serialization of result
]]>
12]]>
]]>
London 18.2 Paris 19.1 Berlin 14.6 ]]>

The following more complex example demonstrates a case where the default conversion is inadequate (for example, it wrongly assumes that for the third production, the order of child elements is immaterial). A better result, shown below, can be achieved by using a schema-aware conversion.

XDM element JSON serialization of result
return ]]>

In the above example, the schema used to validate the source document was simplified to eliminate options that do not actually arise in this input instance (such as the g:string element having attributes). This is a legitimate technique that may be useful when trying to obtain the simplest possible JSON representation.

Further improvements to the usability of the JSON output could be achieved by doing some simple transformation of the XML prior to conversion. For example, the name attribute of various productions could be converted to a child element, and <ref name="x"/> could be transformed to <ref>x</ref>.

Other operations on maps

This section is non-normative.

Because a map is a function item, functions that apply to functions also apply to maps. A map is an anonymous function, so fn:function-name returns the empty sequence; fn:function-arity always returns 1.

Maps may be compared using the fn:deep-equal function.

There is no function or operator to atomize a map or convert it to a string (other than fn:serialize, which can be used to serialize some maps as JSON texts).

XPath 4.0 defines a number of syntactic constructs that operate on maps. These all have equivalents in the function library:

The expression {} creates the empty map (see ). This is equivalent to the effect of the data model primitive dm:empty-map(). Using user-visible functions the same can be achieved by calling map:build or map:merge, supplying the empty sequence as the argument.

The map constructor { K1 : V1, K2 : V2, ... , K/n : V/n } is equivalent to map:merge((map:entry(K1, V1), map:entry(K1, V1), ..., map:entry(K/n, V/n)), { "duplicates": "reject" })

The lookup expression $map?* (see ) is equivalent to map:items($map).

The lookup expression $map?K, where K is a key value, is equivalent to map:get($map, K)

The expression for key $k value $v in $map return EXPR (see and ) is equivalent to the function call map:for-each($map, fn($k, $v) { EXPR }).

Maps can be filtered using the construct $map?[predicate] (see ).

Processing arrays

Arrays were introduced as a new datatype in XDM 3.1. This section describes functions that operate on arrays.

An array is an additional kind of item. An array of size N is a mapping from the integers (1 to N) to a set of values, called the members of the array, each of which is an arbitrary sequence. Because an array is an item, and therefore a sequence, arrays can be nested.

An array acts as a function from integer positions to associated values, so the function call $array($index) can be used to retrieve the array member at a given position. The function corresponding to the array has the signature function($index as xs:integer) as item()*. The fact that an array is a function item allows it to be passed as an argument to higher-order functions that expect a function item as one of their arguments.

Formal Specification of Arrays

The XDM data model () defines three primitive operations on arrays:

dm:empty-array constructs the empty array.

dm:array-append adds a member to an array.

dm:iterate-array applies a supplied function to every member of an array, in order.

The functions in this section are all specified by means of equivalent expressions that either call these primitives directly, or invoke other functions that rely on these primitives. The specifications avoid relying on XPath language constructs that manipulate arrays, such as array constructor syntax, lookup expressions, or FLWOR expressions. This is done to allow these language constructs to be specified by reference to this function library, without risk of circularity.

There is one exception to this rule: for convenience, the notation [] is used to represent the empty array, in preference to a call on dm:empty-array().

The formal equivalents are not intended to provide a realistic way of implementating the functions. They do, however, provide a framework that allows the correctness of a practical implementation to be verified.

Functions that operate on arrays

The functions defined in this section use a conventional namespace prefix array, which is assumed to be bound to the namespace URI http://www.w3.org/2005/xpath-functions/array.

As with all other values, arrays are treated as immutable. For example, the array:reverse function returns an array that differs from the supplied array in the order of its members, but the supplied array is not changed by the operation. Two calls on array:reverse with the same argument will return arrays that are indistinguishable from each other; there is no way of asking whether these are “the same array”. Like sequences, arrays have no identity.

All functionality on arrays is defined in terms of two primitives:

The function array:members decomposes an array to a sequence of value records.

The function array:of-members composes an array from a sequence of value records.

A value record here is an item that encapsulates an arbitrary value; the representation chosen for a value record is record(value as item()*), that is, a map containing a single entry whose key is the string "value" and whose value is the encapsulated sequence.

Note that when the required type of an argument to a function such as array:build is an array type, then the coercion rules ensure that a JNode can be supplied in the function call: if the ·content· property of the JNode is an array, then the array is automatically extracted as if by the jnode-content function.

Other Operations on Arrays

This section is non-normative.

Arrays may be compared using the fn:deep-equal function.

The XPath language provides explicit syntax for certain operations on arrays. These constructs can all be specified in terms of function primitives:

The empty array can be constructed using either of the expressions [] or array{}. The effect is the same as the data model primitive dm:empty-array(()) (see ). Using user-visible functions it can be achieved by calling array:build(()) or array:of-members(()).

The expression array { $sequence } constructs an array whose members are the items in $sequence. Every member of this array will be a singleton item. The effect is the same as array:build($sequence).

The expression [E1, E2, E3, ..., E/n] constructs an array in which E1 is the first member, E2 is the second member, and so on. The result is equivalent to the expression [] => array:append(E1) => array:append(E2) => ... => array:append(E/n))).

The lookup expression $array?* returns the sequence concatenation of the members of the array. It is equivalent to calling array:fold-left($array, (), fn($result, $next){ $result, $next }).

The lookup expression $array?$N, where $N is an integer within the bounds of the array, is equivalent to array:get($array, $N).

Similarly, applying the array as a function, $array($N), is also equivalent to array:get($array, [$N])

The expression for member $m in $array return EXPR is equivalent to array:for-each($array, fn($m){ EXPR }) (see and ).

Arrays can be filtered using the construct $array?[predicate] (see ).

Processing JNodes Introduced the concept of JNodes.

A is a wrapper around a map or array, or around a value that appears within the content of a map or array. JNodes are described at . Wrapping a map or array in a JNode enables the use of path expressions such as $jnode/descendant::title, as described at .

In addition to the functions defined in this section, functions that operate on JNodes include:

fn:distinct-ordered-nodes fn:generate-id fn:has-children fn:innermost fn:outermost fn:path fn:root fn:siblings fn:transitive-closure Functions on JNodes
External resources and data formats

These functions in this section access resources external to a query or stylesheet, and convert between external file formats and their XPath and XQuery data model representation.

Accessing external information

The functions in this section provide access to resources (such as files) in the external environment.

Functions on XML Data

These functions convert between the lexical representation of XML and the tree representation.

(The fn:serialize function also handles HTML and JSON output, but is included in this section for editorial convenience.)

XSD validation This description of the XSD validation process was previously found (with some duplication) in the XQuery and XSLT specifications; those specifications now reference this description. As a side-effects, the descriptions of the process in XQuery and XSLT are better aligned.

This section describes a process called XSD validation, which validates a supplied node against a supplied XSD schema. The validation process refers to the process defined in or .

The validation process takes the following inputs:

A schema to be used for validation, called the effective schema.

A boolean indicating whether any xsi:schemaLocation or xsi:noNamespaceSchemaLocation attributes are to be taken into consideration.

A document, element, or attribute node to be validated; this is called the operand node.

A validation mode, which is one of strict lax, or by-type.

XSLT also allows the value strip, but this does not invoke validation (instead, it invokes stripping of existing type annotations, and re-annotation of nodes as xs:untyped.)

If the validation mode is by-type, then a schema type to be used for validating the operand node. This may be any simple or complex type present in the effective schema: it must not be xs:untyped or xs:untypedAtomic.

An XQuery ValidateExpr allows the type to be specified as xs:untyped or xs:untypedAtomic, but this does not invoke validation (instead, it invokes stripping of existing type annotations and re-annotation of nodes as untyped.)

The output of the validation process comprises one or more of the following:

A boolean indicating whether the operand node was found to be valid.

If the operand node was found to be valid, a deep copy of the operand node augmented with type annotations corresponding to the types against which they were validated, the copies may also include expanded values for element and attribute defaults defined in the schema.

This creates a new node with its own identity and with no parent.

The base URI property of every node in the resulting XDM tree is the same as the base URI property of the corresponding node in the input tree.

If the operand node was not found to be valid, then optionally, a set of error diagnostics in implementation-defined format.

The operand node must be one of:

An element node

An attribute node

A well-formed document node, that is, a document node having among its children exactly one element node and zero or more comment and processing instruction nodes.

The term validation root is used to refer to the operand node if it is an element or attribute node, or to the single element child of the operand node when the operand node is a document node.

Note that a schema is defined as a collection of schema components (for example, element and attribute declarations, complex and simple type definitions). In some cases the schema that is used is the set of schema components found in the in-scope schema definitions, but this is not the only possibility.

The result of the validation process is defined by the following rules.

The invoking application determines whether the validity assessment process takes account of any xsi:schemaLocation or xsi:noNamespaceSchemaLocation attributes in the tree being validated. If it does so, then it should adhere to the following rules:

Any schema loaded using these attributes must be compatible with the existing effective schema.

Any schema loaded using these attributes must not override or redefine any schema components in the effective schema.

Any schema components loaded using this mechanism must be used for this validity assessment only, and must not affect the outcome of any subsequent validity assessments of other documents.

A processor may choose to cache such schema components but the existence of such a cache should only affect performance, not the validation outcome.

A consequence of validating a document using schema components that are not in the static context is that nodes may be annotated with types that are not in the static context. But the rules for schema compatibility mean that this is not a problem.

If the instance being validated contains any xml:id attributes, such attributes are validated against the type xs:ID, making the containing element eligible as a target for the id function. Uniqueness checking of elements and attributes typed as xs:ID, however, is carried out only if the operand node is a document node.

If the operand node is a document node:

The children of the document node must consist of exactly one element node and zero or more comment and processing instruction nodes, in any order.

The element node child is validated, as described below.

The validation rule Validation Root Valid (ID/IDREF) is applied to the single element node child of the document node. This means that validation will fail if there are non-unique ID values or dangling IDREF values in the document tree.

This rule is not applied when the operand node is an element or attribute node.

There is no check that the tree contains unparsed entities whose names match the values of nodes of type xs:ENTITY or xs:ENTITIES. This is because it is not possible (either in XSLT or XQuery) to construct a tree containing unparsed entities. It is possible to add unparsed entity declarations to the result document by referencing a suitable DOCTYPE during serialization.

All other children of the document node (comments and processing instructions) are copied unchanged, and the results become the children of a new document node, which is returned as the validation result.

If the operand node is an element node, then:

For specification purposes, because the XSD specifications require the input document to be expressed as an XML Information Set (), the operand node is first converted to an Infoset according to the “Infoset Mapping” rules defined in . Note that this process discards any existing type annotations.

Validity assessment is carried out on the root element information item of the resulting Infoset, using the supplied schema. The process of validation applies recursively to contained elements and attributes to the extent required by the supplied schema.

A practical implementation is unlikely to perform any physical conversion, but the process is defined this way in order to align with the XSD specification.

If the validation mode is by-type, then Schema-validity assessment is carried out according to the rules defined in or Part 1, section 3.3.4 "Element Declaration Validation Rules", “Validation Rule: Schema-Validity Assessment (Element)”, clauses 1.2 and 2, using this type definition as the processor-stipulated type definition for validation.

If validation mode is strict, then strict validation is carried out as described in Part 1, section 5.2, “Assessing Schema-Validity”, item 2, or its counterpart in XSD 1.1. This means that the root element information item in the Infoset must either:

have a name that matches a top-level element declaration in the effective schema, or

have an xsi:type attribute whose value matches the name of a top-level type definition in the effective schema

If there is no such element declaration or type definition, the element is assessed as invalid.

If validation mode is lax, then schema-validity assessment is carried out in accordance with Part 1, section 5.2, “Assessing Schema-Validity”, item 3, or its counterpart in XSD 1.1.

If validation mode is lax and the root element information item has neither a top-level element declaration nor an xsi:type attribute, XSD 1.0 and XSD 1.1 define the recursive checking of children and attributes as optional. This specification prescribes that this recursive checking is required.

This means, for example, that when an instance document is structured as having an envelope in one namespace wrapping a payload in a different namespaces, and when schema definitions are available for the payload but not for the envelope, lax validation of the envelope may trigger validation of the payload.

If the operand node is an element node, the validation rules named “Validation Root Valid (ID/IDREF)” are not applied. This means that document-level constraints relating to uniqueness and referential integrity are not enforced.

There is no check that the document contains unparsed entities whose names match the values of nodes of type xs:ENTITY or xs:ENTITIES.

If the operand node is an attribute node, in particular when it is a parentless attribute node, then validation cannot be defined directly in terms of the XSD-defined validation process. Instead, conceptually, a copy of the attribute is first added to an element node that is created for the purpose, and namespace fixup is performed on this element node to ensure that it has an in-scope namespace binding for the prefix and namespace of the attribute name. The name of this element is of no consequence, but it must be the same as the name of a synthesized element declaration of the form:

<xs:element name="E"> <xs:complexType> <xs:sequence/> <xs:attribute ref="A"/> </xs:complexType> </xs:element>

where A is the name of the attribute being validated.

This synthetic element is then validated using the procedure given above for validating elements, and if it is found to be valid, a copy of the validated attribute is made, retaining its type annotation, but detaching it from the containing element (and thus, from any in-scope namespace bindings).

The XDM data model does not permit an attribute node with no parent to have a typed value that includes a namespace-qualified name, that is, a value whose type is derived from xs:QName or xs:NOTATION. This restriction is imposed because these types rely on the in-scope namespaces of a containing element to resolve namespace prefixes. Therefore, a parentless attribute is considered to be invalid against such a type.

The outcome of the validation expression depends on the validity property of the root element information item in the PSVI that results from the XSD validation process.

If the validity property of the root element information item is valid, or if validation mode is lax and the validity property of the root element information item is notKnown, the PSVI is converted back into a data model instance as described in Section 3.3, “Construction from a PSVI”. The resulting node (a new node of the same kind as the operand node) is returned as the result of the validate expression.

Otherwise, the operand node is deemed invalid.

During conversion of the PSVI into an XDM instance after validation, any element information items whose validity property is notKnown are converted into element nodes with type annotation xs:anyType, and any attribute information items whose validity property is notKnown are converted into attribute nodes with type annotation xs:untypedAtomic, as described in .

Functions on HTML Data A new function is available for processing input data in HTML format.

This function converts between the lexical representation of HTML and the XDM tree representation.

XDM Mapping from HTML DOM Nodes

The fn:parse-html function conceptually works in two phases:

The lexical HTML (supplied as a string) is parsed into an HTML DOM as defined by the HTML5 specification: see and .

The resulting DOM is converted to an XDM tree as described in this section. This is described by defining the actions of the accessor functions defined in .

Because the and are not fixed, it is implementation-defined which versions are used.

An implementation must match the semantics of the mapping described in this section, but the specific way it achieves that is implementation-dependent.

Some possible implementation strategies are:

Parse the HTML to an HTML DOM and then convert the HTML DOM to an XDM node tree.

Parse the HTML to an HTML DOM and then implement a wrapper or facade that presents an XDM interface to the HTML DOM.

Parse the lexical HTML directly to an XDM node tree, bypassing the HTML DOM.

The defines parsing algorithms for two different formats, which it refers to as the HTML and XML serializations (or concrete syntaxes). The XML serialization is an XML document which typically uses the namespace http://www.w3.org/1999/xhtml and the content type application/xhtml+xml, and is popularly referred to as XHTML. The HTML parsing algorithm constructs an HTML DOM HTMLDocument document object for the HTML document. The XHTML parsing algorithm constructs an HTML DOM XMLDocument object for the HTML document, following XML parsing rules. This mapping supports both of these document types.

The specification defines HTML DOM nodes that are mapped to XDM nodes as follows:

The HTML DOM Document interface maps to .

The HTML DOM Element interface maps to . But see below for the mapping of an HTML template element.

The HTML DOM Attr interface maps to .

Any HTML DOM Attr instances in an HTML DOM HTMLDocument that represent namespace declarations will have been filtered out: see .

The HTML DOM ProcessingInstruction interface maps to .

The HTML parsing algorithm does not generate processing instruction nodes. If encountered they are parsed as comment nodes. The HTML DOM ProcessingInstruction interface is relevant only when the XHTML parsing algorithm is used.

The HTML DOM Comment interface maps to .

The HTML DOM Text interface maps to . Adjacent HTML DOM Text nodes are combined into a single .

The HTML DOM CDATASection interface is an instance of HTML DOM Text, so CDATA sections also map to .

The use of CDATA sections can result in the HTML DOM containing adjacent text nodes, which the mapping to XDM will merge into a single node.

An HTML template element is mapped to an XDM template element with children corresponding to the children of the HTML DOM DocumentFragment that is the value of the template contents property of the HTML DOM template element.

Given source HTML such as

Lorem ipsum

]]>, the HTML DOM represents the element Lorem ipsum

]]>
not as a child of the template element, but as the child of a free-standing document fragment which is accessible (in the DOM API) as the value of the template.content property of the element node. The XDM representation produced by the parse-html does not follow this convention: instead, the element Lorem ipsum

]]>
appears as an ordinary child node of the template element.

The HTML DOM DocumentFragment interface is not supported as an XML node. There are two places in the HTML DOM where this is used:

The HTML DOM ShadowRoot interface is not present in the main HTML DOM tree. It is only accessible via JavaScript.

The template element’s content property contains the child nodes of the template element. The behaviour of this is described above.

If an implementation allows these nodes to be passed in via an API or similar mechanism, their behaviour is implementation-defined.

attributes Accessor

The result of the dm:attributes($node) for an HTML DOM Node is as follows:

If the node is an instance of HTML DOM Element then the result is the value of the Element.attributes property mapped to a sequence as described below;

Otherwise, the result is the empty sequence.

An HTML DOM NamedNodeMap is mapped to a sequence as follows:

NamedNodeMap.length is the length of the sequence, where a length of 0 results in the empty sequence;

NamedNodeMap.item(n) is the nth element of the sequence.

That sequence is then filtered as follows:

If the Attr.namespaceURI property is "http://www.w3.org/2000/xmlns/", the attribute is not included in this sequence;

If the Attr.localName property is "xmlns", the attribute is not included in this sequence;

If the Attr.localName property starts with "xmlns:", the attribute is not included in this sequence;

Otherwise, the attribute is included in this sequence using the XDM mapping rules described in this section.

The HTML DOM Element.attributes property includes namespace and non-namespace attributes in the list when the HTML or XML parser is used. As such, the namespace attributes have to be filtered from the resulting XDM attribute sequence.

When the resulting document is an HTML DOM HTMLDocument, the Attr.localName and Attr.name properties of HTML DOM Attr nodes are both set to the qualified name. This includes namespace declarations which are filtered out by the logic in this section.

The Attr.localName property will be ASCII lowercase. The section 13.2.5.33, Attribute name state specifies that ASCII upper alpha characters are appended to the attribute’s name in lowercase.

base-uri Accessor

The result of the dm:base-uri($node) for an HTML DOM Node is the value of the Node.baseURI property mapped as follows:

If the value is null or the zero-length string, then the result is the empty sequence;

Otherwise, the string value is cast to an xs:anyURI.

children Accessor

The result of the dm:children($node) for an HTML DOM Node is as follows:

If the node is an instance of HTML DOM Document then the result is the value of the Node.childNodes property mapped to a sequence;

If the node is an instance of HTML DOM HTMLTemplateElement then the result is the HTML DOM DocumentFragment’s Node.childNodes property, mapped to a sequence;

If the node is an instance of HTML DOM Element then the result the value of the Node.childNodes property mapped to a sequence;

Otherwise, the result is the empty sequence.

An HTML DOM NodeList is mapped to a sequence as follows:

NodeList.length is the length of the sequence, where a length of 0 results in the empty sequence;

NodeList.item(n) is the nth element of the sequence.

That sequence is then filtered as follows:

If the child is an instance of HTML DOM DocumentType, that child is not included in this sequence;

A sequence of consecutive HTML DOM Text nodes is combined into a single XDM text node;

Otherwise, the HTML DOM Node nodes are mapped to XDM according to the rules in this section.

document-uri Accessor

The result of the dm:document-uri($node) for an HTML DOM Node is as follows:

If the node is an instance of HTML DOM Document then the value of the Document.documentURI property mapped as follows:

If the value is null or the zero-length string, then the result is the empty sequence;

Otherwise, the string value is cast to an xs:anyURI.

Otherwise, the result is the empty sequence.

is-id Accessor

The result of the dm:is-id($node) for an HTML DOM Node is as follows:

If the node is an instance of HTML DOM Attr then:

If the Attr.name property (its qualified name) is "id", then:

If the Attr.value is castable to an xs:NCName, the result is true;

Otherwise, the result is false;

Otherwise, the result is false;

Otherwise, the result is false.

In section 3.2.5, Global attributes, the id attribute is defined as being unique in the element’s tree, containing at least one character, and not having any ASCII whitespace characters. This means that an HTML id attribute may not conform to an xs:NCName.

If an HTML id is not a valid xs:NCName then that attribute is not an XML ID.

is-idrefs Accessor

The result of the dm:is-idrefs($node) for an HTML DOM Node is the empty sequence.

namespace-nodes Accessor

The result of the dm:namespace-nodes($node) for an HTML DOM Node is as follows:

If the node is an instance of HTML DOM Element then an implementation-dependent sequence of namespace nodes that is sufficient to define the namespace context of the node.

Otherwise, the result is the empty sequence.

For the XHTML parsing algorithm, this will be equivalent to constructing the namespace nodes from an XML infoset, PSVI, or similar mapping.

For the HTML parsing algorithm, the specification defines the namespace context in various places:

Section 2.1.3 XML compatibility defines the default element namespace to be http://www.w3.org/1999/xhtml.

Section 4.8.15 MathML defines rules for embedded MathML content in HTML documents. Section 13.1.2 Elements defines these elements as foreign elements, placing them in the MathML namespace (http://www.w3.org/1998/Math/MathML). The default element namespace for these elements is the MathML namespace.

Section 4.8.16 SVG defines rules for embedded SVG content in HTML documents. Section 13.1.2 Elements defines these elements as foreign elements, placing them in the SVG namespace (http://www.w3.org/2000/svg). The default element namespace for these elements is the SVG namespace.

Section 13.1.2.3 Attributes defines several namespaced attributes available on foreign elements. If any of these namespaced attributes are present, a namespace node for that namespace must be present on the element.

The supported namespace prefixes are:

xlink in the http://www.w3.org/1999/xlink namespace;

xml in the http://www.w3.org/XML/1998/namespace namespace; and

xmlns in the http://www.w3.org/2000/xmlns/ namespace.

No other namespaces are supported by the HTML parser.

Section number references to may change over time.

nilled Accessor

The result of the dm:nilled($node) for an HTML DOM Node is false().

node-kind Accessor

The result of the dm:node-kind($node) for an HTML DOM Node is as follows:

If the node is an instance of HTML DOM Document then the result is "document".

If the node is an instance of HTML DOM Element then the result is "element".

If the node is an instance of HTML DOM Attr then the result is "attribute".

If the node is an instance of HTML DOM ProcessingInstruction then the result is "processing-instruction".

If the node is an instance of HTML DOM Comment then the result is "comment".

If the node is an instance of HTML DOM Text then the result is "text".

node-name Accessor

The result of the dm:node-name($node) for an HTML DOM Node is as follows:

If the node is an instance of HTML DOM Element then the result is determined as follows:

The local name is the value of the Element.localName property. This is derived as follows:

The local name is initially set to the ASCII lowercase tag name. The section 13.2.5.8, Tag name state specifies that ASCII upper alpha characters are appended to the element’s name in lowercase.

If the local name is an SVG element name, the case-sensitive name is used. section 13.2.6.5, The rules for parsing tokens in foreign content has a table mapping the lowercase element names to their SVG names.

If the local name contains a character that is not a valid XML NameStartChar or NameChar, then an implementation-defined replacement string is used. The result must be a valid NCName.

section 13.2.9 Coercing an HTML DOM into an infoset uses a Unnnnnn escape sequence. That would map : to U00003A.

This local name escaping applies only to the HTML parsing algorithm. If the XHTML parsing algorithm is used, the localName and prefix will be correctly set for QName-based node names.

The namespace prefix is the value of the Element.prefix property, or empty if the value is null;

The namespace URI is the value of the Element.namespaceURI property, or empty if the value is null.

If the element is an HTML element, the namespace URI is "http://www.w3.org/1999/xhtml".

If the element is an SVG element, the namespace URI is "http://www.w3.org/2000/svg".

If the element is a MathML element, the namespace URI is "http://www.w3.org/1998/Math/MathML".

If the node is an instance of HTML DOM Attr then the result is determined as follows:

The attribute name is the tokenized attribute name. The section 13.2.5.33, Attribute name state specifies that ASCII upper alpha characters are appended to the attribute’s name in lowercase.

The local name is the value of the Attr.localName property. This is derived as follows:

The local name is initially set to the attribute name.

If the local name is an SVG or MathML attribute name, the case-sensitive name is used. section 13.2.6.1, Creating and inserting nodes has a table mapping the lowercase attribute names to their SVG/MathML names.

If the local name is an allowed xlink, xml, or xmlns attribute name the local name is the value of the local name column of the attribute name mapping table in section 13.2.6.1, Creating and inserting nodes.

If the local name contains a character that is not a valid XML NameStartChar or NameChar, then an implementation-defined replacement string is used. The result must be a valid NCName.

section 13.2.9 Coercing an HTML DOM into an infoset uses a Unnnnnn escape sequence. That would map : to U00003A.

This local name escaping applies only to the HTML parsing algorithm. If the XHTML parsing algorithm is used, the localName and prefix will be correctly set for QName-based node names.

The namespace prefix is the value of the Attr.prefix property, or empty if the value is null.

If the attribute name is an allowed xlink, xml, or xmlns attribute name the namespace prefix is the value of the prefix column of the attribute name mapping table in section 13.2.6.1, Creating and inserting nodes.

The namespace URI is the value of the Attr.namespaceURI property, or empty if the value is null;

If the attribute name is an allowed xlink, xml, or xmlns attribute name the namespace URI is the value of the namespace column of the attribute name mapping table in section 13.2.6.1, Creating and inserting nodes.

If the node is an instance of HTML DOM ProcessingInstruction then the result is an xs:QName constructed as follows:

The local name is the value of the ProcessingInstruction.target property;

The namespace prefix is empty;

The namespace URI is empty;

Otherwise, the result is the empty sequence.

When the resulting document is an HTML DOM HTMLDocument, the Element.localName and Element.name properties of HTML DOM Element nodes are both set to the qualified name.

When the resulting document is an HTML DOM HTMLDocument, the Attr.localName and Attr.name properties of HTML DOM Attr nodes are both set to the qualified name.

parent Accessor

The result of the dm:parent($node) for an HTML DOM Node is as follows:

Let $parent be the Node.parentNode property of the node;

If $parent is an instance of HTML DOM DocumentFragment, then for each HTML DOM HTMLTemplateElement $template in the parsed DOM tree:

Let $content be the value of the HTMLTemplateElement.content property of $template;

If $content is the same node as $parent, then the result is $template using the XDM mapping rules described in this section;

If there are no more $template nodes, then the result is an empty sequence;

If $parent is null, then the result is the empty sequence;

Otherwise, the result is $parent using the XDM mapping rules described in this section.

The current node can have a HTML DOM DocumentFragment parent node only if the include-template-content key of the html-parser-options is true().

The HTML DOM DocumentFragment’s Node.parentNode property is null, and a DocumentFragment attached to HTMLTemplateElement.content property does not have a host property connecting the fragment back to the template element.

If a future version of adds a DocumentFragment.host property that references the node’s template element, or the implementation has access to that internal property, the implementation may choose to use that instead of traversing the parsed HTML tree.

string-value Accessor

The result of the dm:string-value($node) for an HTML DOM Node is as follows:

If the node is an instance of HTML DOM Document, then use the algorithm described in ;

If the node is an instance of HTML DOM Element, then use the algorithm described in ;

If the node is an instance of HTML DOM Text, then use the algorithm described in ;

Otherwise, the result is the value of the Node.nodeValue property.

Tree string construction

The following algorithm is used to construct the concatenated string value of a node in the HTML DOM tree:

Let $text be the string value "";

For each descendant node $node in document order:

If $node is not an instance of HTML DOM Text, process the next node in document order;

Append the value of the Node.nodeValue property for $node to $text;

The result is $text.

Text node string construction

The following algorithm is used to construct the maximal sequence of adjacent character information items for text node children of an element:

Let $text be the string value "";

Append the value of the Node.nodeValue property for $node to $text;

Let $next be the value of Node.nextSibling;

Let $next is null, or not an instance of HTML DOM Text, the result is $text;

Otherwise, repeat from step 2 using $next as $node.

Adjacent text nodes in the HTML DOM are treated as a single XDM text node by only including the first text node and providing logic to ensure that the text content is merged into a single text block.

type-name Accessor

The result of the dm:type-name($node) for an HTML DOM Node is as follows:

If the node is an instance of HTML DOM Element then the result is xs:untyped.

If the node is an instance of HTML DOM Attr then the result is xs:untypedAtomic.

If the node is an instance of HTML DOM Text then the result is xs:untypedAtomic.

Otherwise, the result is the empty sequence.

typed-value Accessor

The result of the dm:typed-value($node) for an HTML DOM Node is as follows:

Let $string-value be the for the node;

If the node is an instance of HTML DOM Document then the result is $string-value as an xs:untypedAtomic;

If the node is an instance of HTML DOM Element then the result is $string-value as an xs:untypedAtomic;

If the node is an instance of HTML DOM Attr then the result is $string-value as an xs:untypedAtomic;

If the node is an instance of HTML DOM Text then the result is $string-value as an xs:untypedAtomic;

Otherwise, the result is $string-value.

unparsed-entity-public-id Accessor

The result of the dm:unparsed-entity-public-id($node) for an HTML DOM Node is the empty sequence.

unparsed-entity-system-id Accessor

The result of the dm:unparsed-entity-system-id($node) for an HTML DOM Node is the empty sequence.

Functions on JSON Data

The functions listed in this section parse or serialize JSON data.

JSON is a popular format for exchange of structured data on the web: it is specified in . This section describes facilities allowing JSON data to be converted to and from XDM values.

This specification describes two ways of representing JSON data losslessly using XDM constructs. The first method uses XDM maps to represent JSON objects, and XDM arrays to represent JSON arrays. The second method represents all JSON constructs using XDM element and attribute nodes.

Note also:

The function fn:serialize has an option to generate JSON output from a structure of maps and arrays.

The function fn:element-to-map enables arbitrary XML node trees to be converted to trees of maps and arrays suitable for serializing as JSON.

Representing JSON using maps and arrays

This section defines a mapping from JSON data to XDM maps and arrays. Two functions are available to support this mapping: fn:parse-json and fn:serialize (with options selecting JSON as the output method). The fn:parse-json function will accept any JSON text as input, and converts it to XDM data values. The fn:serialize function (with JSON as the output method) will accept any XDM value produced using fn:parse-json and convert it back to the original JSON text (subject to insignificant variations such as reordering the properties in a JSON object).

The conversion is lossless if recommended JSON good practice is followed. Information may however be lost if (a) JSON numbers are not exactly representable as double-precision floating point, or (b) duplicate key values appear within a JSON object.

The representation of JSON data produced by the fn:parse-json function has been chosen with ease of manipulation as a design aim. For example, a simple JSON object such as { "Sun": 1, "Mon": 2, "Tue": 3, ... } produces a simple map, so if the result of parsing is held in $weekdays, the number for a given weekday can be extracted using an expression such as $weekdays?Tue. Similarly, a simple array such as [ "Sun", "Mon", "Tue", ... ] produces an array that can be addressed as, for example, $weekdays(3). A more deeply nested structure can be addressed in a similar way: for example if the JSON text is an array of person objects, each of which has a property named phones which is an array of strings containing phone numbers, then the first phone number of each person in the data can be addressed as $data?phones(1).

XML Representation of JSON

This section defines a mapping from JSON data to XML (specifically, to XDM element and attribute nodes). A function fn:json-to-xml is provided to take a JSON string as input and convert it to the XML representation, and a second function fn:xml-to-json performs the reverse operation.

The XML representation is designed to be capable of representing any valid JSON text including one that uses characters which are not valid in XML. The transformation is normally lossless: that is, distinct JSON texts convert to distinct XML representations. When converting JSON to XML, options are provided to reject unsupported characters, to replace them with a substitute character, or to leave them in backslash-escaped form.

The conversion is lossless if recommended JSON good practice is followed. Information may however be lost if (a) JSON numbers are not exactly representable as double-precision floating point, or (b) duplicate key values appear within a JSON object.

The following example demonstrates the correspondence of a JSON text and the corresponding XML representation.

A JSON Text and its XML Representation

Consider the following JSON text:

The XML representation of this text is as follows. Whitespace is included in the XML representation for purposes of illustration, but it will not necessarily be present in the output of the json-to-xml function.

Distances between several cities, in kilometers. 2014-02-04T18:50:45 true London 322 Paris 265 Amsterdam 173 Brussels 322 Paris 344 Amsterdam 358 Brussels 265 London 344 Amsterdam 431 Brussels 173 London 358 Paris 431 ]]>

An XSD 1.0 schema for the XML representation is provided in . It is not necessary to import this schema into the static context unless the stylesheet or query makes explicit reference to the components defined in the schema. If the stylesheet or query does import a schema for the namespace http://www.w3.org/2005/xpath-functions, then:

Unless the host language specifies otherwise, the processor (if it is schema-aware) must recognize an import declaration for this namespace, whether or not a schema location is supplied.

If a schema location is provided, then the schema document at that location must be equivalent to the schema document at ; the effect if it is not equivalent is

The rules governing the mapping from JSON to XML are as follows. In these rules, the phrase “an element named N” is to be interpreted as meaning “an element node whose local name is N and whose namespace URI is http://www.w3.org/2005/xpath-functions”.

The JSON value null is represented by an element named null, with empty content.

The JSON values true and false are represented by an element named boolean, with content conforming to the type xs:boolean. When the element is created by the fn:json-to-xml function, the string value of the element will be true or false. The fn:xml-to-json function also recognizes other strings that validate as xs:boolean, for example 1 and 0. Leading and trailing whitespace is accepted.

A JSON number is represented by an element named number, with content conforming to the type xs:double, with the additional restriction that the value must not be positive or negative infinity, nor NaN. The fn:json-to-xml function creates an element whose string value is lexically the same as the JSON representation of the number. The fn:xml-to-json function generates a JSON representation that is the result of casting the (typed or untyped) value of the node to xs:double and then casting the result to xs:string. Leading and trailing whitespace is accepted. Since JSON does not impose limits on the range or precision of numbers, these rules mean that conversion from JSON to XML will always succeed, and will retain full precision in the lexical representation unless the data model implementation is one that reconstructs the string value from the typed value. In the reverse direction, conversion from XML to JSON may fail if the value is infinity or NaN, or if the string value is such that casting to xs:double produces positive or negative infinity.

A JSON string is represented by an element named string, with content conforming to the type xs:string. The string element has two alternative representations: escaped form, and unescaped form.

A JSON array is represented by an element named array. The content is a sequence of child elements representing the members of the array in order, each such element being the representation of the array member obtained by applying these rules recursively.

A JSON object is represented by an element named map. The content is a sequence of child elements each of which represents one of the name/value pairs in the object. The representation of the name/value pair N:V is obtained by taking the element that represents the value V (by applying these rules recursively) and adding an attribute with name key (in no namespace), whose value is N as an instance of xs:string. The functions fn:json-to-xml and fn:xml-to-json both retain the order of entries, subject to rules about how duplicate keys are handled. The key may be represented in escaped or unescaped form.

The attribute escaped="true" may be specified on a string element to indicate that the string value contains backslash-escaped characters that are to be interpreted according to the JSON rules. The attribute escaped-key="true" may be specified on any element with a key attribute to indicate that the key contains backslash-escaped characters that are to be interpreted according to the JSON rules. Both attributes have the default value false, signifying that the relevant value is in unescaped form. In unescaped form, the backslash character has no special significance (it represents itself).

The JSON grammar for number is a subset of the lexical space of the XSD type xs:double. The mapping from JSON number values to xs:double values is defined by the XPath rules for casting from xs:string to xs:double. Note that these rules will never generate an error for out-of-range values; instead very large or very small values will be converted to +INF or -INF. Since JSON does not impose limits on the range or precision of numbers, the conversion is not guaranteed to retain full precision.

Although the order of entries in a JSON object is generally considered to have no significance, the functions json-to-xml and xml-to-json both retain order.

The XDM representation of a JSON value may either be untyped (all elements annotated as xs:untyped, attributes as xs:untypedAtomic), or it may be typed. If it is typed, then it must have the type annotations obtained by validating the untyped representation against the schema given in . If it is untyped, then it must be an XDM instance such that validation against this schema would succeed; with the proviso that all attributes other than those in no namespace or in namespace http://www.w3.org/2005/xpath-functions are ignored, including attributes such as xsi:type and xsi:nil that would normally influence the process of schema validation.

The namespace prefix associated with the namespace http://www.w3.org/2005/xpath-functions (if any) is immaterial. The effect of the fn:xml-to-json function does not depend on the choice of prefix, and the prefix (if any) generated by the fn:json-to-xml function is implementation-dependent.

JSON character repertoire The rules regarding use of non-XML characters in JSON texts have been relaxed.

The set of characters that may appear in JSON texts is not the same as the set of characters allowed in XML. Specifically:

As plain unescaped characters, JSON allows any codepoint in the numeric range 0x20 to 0x10FFFF, with the exception of U+0022 and U+005C.

As a backslash-escaped character, JSON allows any codepoint in the numeric range 0x00 to 0xFFFF.

Whether escaped or not, the JSON grammar allows codepoints in the surrogate range to appear, and does not explicitly require that they be properly paired. However, the JSON specifications recognize that unpaired surrogates are likely to lead to interoperability problems.

Ignoring unpaired surrogates, this means that JSON allows codepoints that are not allowed by XML:

Not allowed by XML 1.0: 0x00 to 0x1F (other than 0x09, 0x0A, and 0x0D); 0xFFFE; 0xFFFF.

Not allowed by XML 1.1: 0x00; 0xFFFE; 0xFFFF.

The XDM data model (see ) allows an implementation to define the set of permitted characters in the xs:string data type in such a way that any Unicode codepoint assigned to a character (which excludes surrogates) is allowed. However, this is not required: a conformant implementation may restrict the set of codepoints to those permitted by XML 1.0 or XML 1.1.

In consequence, parsing of conformant JSON texts may fail if they contain codepoints that the implementation does not support. However, if such codepoints are represented in the input using JSON escape sequences, these specifications define mechanisms for dealing with them, for example by substituting a replacement character.

Functions on CSV Data New functions are available for processing input data in CSV (comma separated values) format.

This section describes functions that parse CSV data.

The term comma separated values or CSV refers to a wide variety of plain-text tabular data formats with fields and records separated by standard character delimiters (often, but not invariably, commas).

A CSV is a 2-dimensional tabular data structure consisting of multiple rows (also known as records). Each row contains multiple fields. Fields occupying the same position in successive rows constitute a column. Columns are identified by position and optionally by name. Column names can be assigned within a CSV using an optional header row.

CSV has developed informally for decades, and many variations are found. This specification refers to , which provides a standardized grammar. This specification extends the grammar defined in as follows:

This specification uses the term row where RFC 4180 uses record.

Line endings are normalized: specifically, the character sequences U+000D, or U+000D followed by U+000A, are converted to a single U+000A character. This applies whether or not the line ending appears within a quoted string, and whether or not U+000A is the chosen row delimiter.

Row delimiters other than newline are recognized.

Field delimiters other than U+002C are recognized.

Quote characters other than U+0022 are recognized.

Non-ASCII characters are recognized.

This specification defines a mapping from this extended grammar to constructs in the XDM model, and provides illustrative examples of how these constructs can be combined with other language features to process CSV data.

The most basic function for parsing CSV is fn:csv-to-arrays which recognizes the delimiters for rows and fields and returns a sequence of arrays each corresponding to one row. The fields within each array are represented as instances of xs:string.

The other two functions recognize column names, and make it easier to address individual fields using these names. The parse-csv function delivers this capability using XDM maps and functions, while csv-to-xml function represents the information using XDM element nodes.

CSV delimiters

The delimiters used for rows, columns, and quoting are configurable. An error is raised if the same delimiter string is used in multiple roles .

Rows in CSV files are typically delimited with CRLF (U+000D, U+000A), LF (U+000A), or CR (U+000D) line endings, although RFC 4180 specifies CRLF. The CSV parsing functions normalize these line endings to LF (U+000A). They therefore use LF as the default row delimiter.

The last row in the file may or may not be followed by a row delimiter. the empty file is treated as containing zero rows, while a file consisting solely of a row delimiter is treated as containing one empty row. In all other cases, a file that does not end with a row delimiter is treated as if a row delimiter were added at the end.

Fields in CSV are frequently delimited with a comma. Other field delimiters are useful, for example when numeric data uses comma as a decimal separator. The chosen field delimiter is then often U+003B or U+0009.

The column delimiter thus defaults to U+002C. The value may be any single Unicode character. An error is raised if the column-delimiter option is set to a multi-character string.

Field quoting

CSVs, as specified in , require that fields be wrapped with a quote character if they contain either the row or column delimiter. For example:

"A single field, containing a comma","another field containing CRLF within it"

If a field is to contain the quote character, the character must be escaped by doubling it, as with escaping of quotes in XPath string literals (see ). An error is raised if a quote character appears within a field incorrectly escaped, for example:

incorrectly escaped " quote character

The quotes surrounding quoted fields are not included in the result. The following input string, when parsed, produces a sequence of strings, as shown below:

'"Field 1","Field 2","Field ""with quotes"" 3"' ('Field 1', 'Field 2', 'Field "with quotes" 3')

The quote character defaults to U+0022.

No space is allowed between the column delimiter and a quote. An error is raised if whitespace or other characters occur between a quote character and the nearest column delimiter.

The following example is therefore invalid and parsing it will raise an error.

'"Field 1", "Field 2", "Field 3"'
Basic parsing of CSV to arrays

The result of fn:csv-to-arrays is a sequence of rows, where each row is represented as an array of xs:string values.

The first row of the CSV is returned in the same way as all the other rows. fn:csv-to-arrays does not distinguish between a header row and data rows, and returns all of them.

A CSV with fixed-width rows

For example, given the input:

'Column 1,Column 2,Column 3 Field 1A,Field 1B,Field 1C Field 2A,Field 2B,Field 2C'

the fn:csv-to-arrays function produces

( [ "Column 1", "Column 2", "Column 3" ], [ "Field 1A", "Field 1B", "Field 1C" ], [ "Field 2A", "Field 2B", "Field 2C" ] )
A CSV with variable-width rows

It is common practice for all rows in a CSV to have the same number of columns, but this is not required.

'Column 1,Column 2,Column 3 Field 1A,Field 1B,Field 1C Field 2A,Field 2B,Field 2C,Field 2D'

produces

( [ "Column 1", "Column 2", "Column 3" ], [ "Field 1A", "Field 1B", "Field 1C" ], [ "Field 2A", "Field 2B", "Field 2C", "Field 2D" ] )

states that CSVs should contain the same number of fields in each row, so that there are a uniform number of columns. However, the reality is that CSVs can, and sometimes do, contain a variable number of fields in a row. As a result, this function does not truncate or pad the number of fields in each row for any reason. The fn:csv-to-xml and fn:parse-csv functions provide facilities to enforce uniformity and an expected number of columns.

Enhanced parsing of CSV data to maps and arrays

While fn:csv-to-arrays simply delivers the CSV content as a sequence of arrays, the fn:parse-csv function goes a step further and enables access to the data using column names. The column names may be taken either from the first row of the CSV data, or from data supplied by the caller in the options parameter.

Representing CSV data as XML

The fn:csv-to-xml function returns an XDM node tree representing the CSV data. Following is a CSV text and the XML serialization of the corresponding node tree.

Name,Date,Amount Alice,2023-07-14,1.23 Bob,2023-07-14,2.34 Name Date Amount Alice 2023-07-14 1.23 Bob 2023-07-14 2.34 ]]>

If no non-empty column names are available, then the columns element and all column attributes are absent. If non-empty column names are available for some columns but not for others, then (a) the empty column element is included within the columns element if and only if there is a subsequent column with a non-empty name, and (b) the column attribute for the corresponding field elements is absent.

For example (when no column names are available):

Name Date Amount Alice 2023-07-14 1.23 Bob 2023-07-14 2.34 ]]>

An XSD 1.0 schema for the XML representation is provided in .

Illustrative examples of processing CSV data

The following examples illustrate more complex applications making use of CSV parsing functions.

A variable $crlf is assumed to be in scope representing the CRLF string:

let $crlf := fn:char(0x0D)||fn:char(0x0A) Converting a CSV into an HTML-style table using fn:parse-csv

Direct conversion is a matter of iterating across the records and fields to generate <tr> and <td> elements.

Using XQuery:

{ for $column in $csv?columns?fields return { $column } } { for $row in $csv?rows return { for $field in $row?fields return { $field } } } ]]>

Using XSLT:

{ . } { . }
]]>
Converting a CSV into an HTML-style table using fn:csv-to-xml

The fn:csv-to-xml function makes these kinds of conversion-to-XML-table tasks simpler by providing a simple XML represenation of the data. Here, in XQuery:

{ for $column in $csv/csv/columns/column return { $column } } { for $row in $csv/csv/rows/row return { for $field in $row/field return { $field } } } ]]>

And in XSLT:

{ . } { . }
]]>
Functions on Invisible XML

This section describes functions that support parsing.

Invisible XML defines a BNF-like language for specifying grammars, together with a mapping from sentences in that grammar to an XML representation. By defining an Invisible XML grammar, a great variety of non-XML data formats can be manipulated as if they were XML. The function fn:invisible-xml takes a grammar as input, and returns a function which can be used for parsing data instances and converting them to XML node trees.

Dynamic evaluation

The following functions allow dynamic loading and evaluation of XQuery queries, XSLT stylesheets, and XPath binary operators.

Processing types New functions are provided to obtain information about built-in types and types defined in an imported schema.

The functions in this section deliver information about schema types (including simple types and complex types). These may represent built-in types (such as xs:dateTime), user-defined types found in the static context (typically because they appear in an imported schema), or types used as type annotations on schema-validated nodes.

For more information on schema types, see . The properties of a schema type are described in terms of the properties of a Simple Type Definition or Complex Type Definition component as described in and respectively. Not all properties are exposed.

The structured representation of a schema type is described in .

Simple properties of a schema type that can be expressed as strings or booleans are represented in this record structure directly as atomic field values, while complex properties whose values are themselves types (for example, base-type and primitive-type) are represented as functions. This is done partly to make it easier for implementations to compute complex properties on demand rather than in advance, and partly to ensure that the overall structure is always acyclic. For example, the primitive type of xs:decimal is itself xs:decimal, and if this were represented as a field value without a guarding function, serialization of the map using the JSON output method would not terminate.

Functions returning type information
Accessing the context

The following functions are defined to obtain information from the static or dynamic context.

Errors and diagnostics Raising errors

In this document, as well as in and , the phrase an error is raised is used. Raising an error is equivalent to calling the fn:error function defined in this section with the provided error code. Except where otherwise specified, errors defined in this specification are dynamic errors. Some errors, however, are classified as type errors. Type errors are typically used where the presence of the error can be inferred from knowledge of the type of the actual arguments to a function, for example with a call such as fn:string(fn:abs#1). Host languages may allow type errors to be reported statically if they are discovered during static analysis.

When function specifications indicate that an error is to be raised, the notation [error code] is used to specify an error code. Each error defined in this document is identified by an xs:QName that is in the http://www.w3.org/2005/xqt-errors namespace, represented in this document by the err prefix. It is this xs:QName that is actually passed as an argument to the fn:error function. Calling this function raises an error. For a more detailed treatment of error handing, see .

The fn:error function is a general function that may be called as above but may also be called from or applications with, for example, an xs:QName argument.

Diagnostic tracing
Constructor functions Constructor functions now have a zero-arity form; the first argument defaults to the context item.

Constructor functions are used to convert a supplied value to a given type, and the name of the function is the same as the name of the target type. This section describes constructor functions corresponding to the following types:

Simple types (atomic types, union types, and list types as defined in ), which are present in the static context either because they appear in the in-scope schema types or because they appear as named item types.

These constructor functions always take a single argument.

Record types defined as named item types.

These take one argument for each named field of the record type. Constructor functions for record types are defined in .

Constructor functions are defined for all user-defined named simple types, and for most built-in atomic, list, and union types. The only named simple types that have no constructor function are those that have no instances other than instances of their derived types: specifically, xs:anySimpleType, xs:anyAtomicType, and xs:NOTATION.

Constructor functions for XML Schema built-in atomic types

Every built-in atomic type that is defined in , except xs:anyAtomicType and xs:NOTATION, has an associated constructor function. The type xs:untypedAtomic, defined in and the two derived types xs:yearMonthDuration and xs:dayTimeDuration defined in also have associated constructor functions. Implementations may additionally provide a constructor functions for the datatype xs:dateTimeStamp introduced in .

A constructor function is not defined for xs:anyAtomicType as there are no atomic items with type annotation xs:anyAtomicType at runtime, although this can be a statically inferred type. A constructor function is not defined for xs:NOTATION since it is defined as an abstract type in . If the static context (See ) contains a type derived from xs:NOTATION then a constructor function is defined for it. See .

The form of the constructor function for an atomic type eg:TYPE is:

If $arg is the empty sequence, the empty sequence is returned. For example, the signature of the constructor function corresponding to the xs:unsignedInt type defined in is:

Calling the constructor function xs:unsignedInt(12) returns the xs:unsignedInt value 12. Another call of that constructor function that returns the same xs:unsignedInt value is xs:unsignedInt("12").

The same result would also be returned if the constructor function were to be called with a node that had a typed value equal to the xs:unsignedInt 12. Because the declared parameter type for the argument is xs:anyAtomicType?, the coercion rules will atomize the supplied argument (see ) to extract its typed value and then call the constructor with the atomized value.

If the value passed to a constructor function, after atomization, is not in the lexical space of the datatype to be constructed, and cannot be converted to a value in the value space of the datatype under the rules in , then an dynamic error is raised .

The semantics of the constructor function xs:TYPE(arg) are identical to the semantics of arg cast as xs:TYPE? . See .

If the argument to a constructor function is a literal, the result of the function may be evaluated statically; if an error is found during such evaluation, it may be reported as a static error.

Special rules apply to constructor functions for xs:QName and types derived from xs:QName and xs:NOTATION. See .

The argument is optional, and defaults to the context value (which will be atomized if necessary).

The following constructor functions for the built-in atomic types are supported:

Implementations should return negative zero for xs:float("-0.0E0"). But because does not distinguish between the values positive zero and negative zero, implementations may return positive zero in this case.

Implementations should return negative zero for xs:double("-0.0E0"). But because does not distinguish between the values positive zero and negative zero, implementations may return positive zero in this case.

See for special rules.

See for rules related to constructing values of type xs:ENTITY and types derived from it.

Available only if the implementation supports XSD 1.1.

Constructor functions for xs:QName and xs:NOTATION

Special rules apply to constructor functions for the types xs:QName and xs:NOTATION, for two reasons:

Values cannot belong directly to the type xs:NOTATION, only to its subtypes.

The lexical representation of these types uses namespace prefixes, whose meaning is context-dependent.

These constraints result in the following rules:

There is no constructor function for xs:NOTATION. Constructors are defined, however, for xs:QName, for types derived or constructed from xs:QName, and for types derived or constructed from xs:NOTATION.

When converting from an xs:string, the prefix within the lexical xs:QName supplied as the argument is resolved to a namespace URI using the statically known namespaces from the static context. If the lexical xs:QName has no prefix, the namespace URI of the resulting expanded-QName is the default namespace for elements and types, taken from the static context. Components of the static context are defined in . A dynamic error is raised if the prefix is not bound in the static context. As described in , the supplied prefix is retained as part of the expanded-QName value.

When a constructor function for a namespace-sensitive type is used as a literal function item or in a partial function application (for example, xs:QName#1 or xs:QName(?)) the namespace bindings that are relevant are those from the static context of the literal function item or partial function application. When a constructor function for a namespace-sensitive type is obtained by means of the fn:function-lookup function, the relevant namespace bindings are those from the static context of the call on fn:function-lookup.

When the supplied argument to the xs:QName constructor function is a node, the node is atomized in the usual way, and if the result is xs:untypedAtomic it is then converted as if a string had been supplied. The effect might not be what is desired. For example, given the attribute xsi:type="my:type", the expression xs:QName(@xsi:type) might fail on the grounds that the prefix my is undeclared. This is because the namespace bindings are taken from the static context (that is, from the query or stylesheet), and not from the source document containing the @xsi:type attribute. The solution to this problem is to use the function call resolve-QName(@xsi:type, .) instead.

Constructor functions for XML Schema built-in list types

Each of the three built-in list types defined in , namely xs:NMTOKENS, xs:ENTITIES, and xs:IDREFS, has an associated constructor function.

The function signatures are as follows:

The semantics are equivalent to casting to the corresponding types from xs:string.

All three of these types have the facet minLength = 1 meaning that there must always be at least one item in the list. The return type, however, allows for the fact that when the argument to the function is the empty sequence, the result is the empty sequence.

In the case of atomic types, it is possible to use an expression such as xs:date(@date-of-birth) to convert an attribute value to an instance of xs:date, knowing that this will work both in the case where the attribute is already annotated as xs:date, and also in the case where it is xs:untypedAtomic. This approach does not work with list types, because it is not permitted to use a value of type xs:NMTOKEN* as input to the constructor function xs:NMTOKENS. Instead, it is necessary to use conditional logic that performs the conversion only in the case where the input is untyped: if (@x instance of attribute(*, xs:untypedAtomic)) then xs:NMTOKENS(@x) else data(@x)

Constructor functions for XML Schema built-in union types

There is a constructor function for the union type xs:numeric defined in . The function signature is:

The semantics are determined by the rules in . These rules have the effect that:

If the argument is an instance of xs:double, xs:float, or xs:decimal, then the result is an instance of the same primitive type, with the same value;

If the argument is an instance of xs:boolean, the result is the xs:double value 0.0e0 or 1.0e0;

If the argument is an instance of xs:string or xs:untypedAtomic, then:

If the value is in the lexical space of xs:double, the result will be the corresponding xs:double value;

Otherwise, a dynamic error occurs;

The result will never be an instance of xs:float, xs:decimal, or xs:integer. This is because xs:double appears first in the list of member types of xs:numeric, and its lexical space subsumes the lexical space of the other numeric types. Thus, unlike XPath numeric literals, the result does not depend on the lexical form of the supplied value. The reason for this design choice is to retain compatibility with the function conversion rules: functions such as fn:abs and fn:round are declared to expect an instance of xs:numeric as their first or only argument, and compatibility with the function conversion rules defined in earlier versions of these specifications demands that when an untyped atomic item (or untyped node) is supplied as the argument, it is converted to an xs:double value even if its lexical form is that (say) of an integer.

In all other cases, a dynamic error occurs.

In the case of an implementation that supports XSD 1.1, there is a constructor function associated with the built-in union type xs:error.

The function signature is as follows:

The semantics are equivalent to casting to the corresponding union type (see ).

Because xs:error has no member types, and therefore has an empty value space, casting will always fail with a dynamic error except in the case where the supplied argument is the empty sequence, in which case the result is also the empty sequence.

Constructor functions for user-defined atomic and union types

For every named user-defined simple type in the static context (See ), there is a constructor function whose name is the same as the name of the typeand whose effect is to create a value of that type from the supplied argument. The rules for constructing user-defined types are defined in the same way as the rules for constructing built-in derived types defined in .

For named atomic types, the rules are the same as the rules for constructing built-in derived atomic types defined in . For a named atomic type T, the signature of the function takes the form T($value as xs:anyAtomicType? := .) as T?, and the semantics are the same as casting to derived types: see ..

For named union types, the rules follow the same principles as the rules for constructing built-in union types defined in . For a named union type U, the signature of the function takes the form U($value as xs:anyAtomicType? := .) as U?, and the semantics are the same as casting to union types: see .

For named list types, the rules follow the same principles as the rules for constructing built-in list types defined in . For a named list type L, where the item type of L is I, the signature of the function takes the form L($value as xs:string? := .) as I*, and the semantics are the same as casting to list types: see .

Constructor functions are available both for named types defined in an imported schema (that is, named simple types in the in-scope schema types), and for types defined by means of named item types. Specifically, named enumeration types follow the same rules as schema types derived by restricting xs:string, and named local union types follow the same rules as union types defined in a schema.

Special rules apply to constructor functions for namespace-sensitive types, that is, atomic types derived from xs:QName and xs:NOTATION, list types that have a namespace-sensitive item type, and union types that have a namespace-sensitive member type. See .

Using a Constructor Function for a User-Defined Atomic Type

Consider a situation where the static context contains an atomic type called hatSize defined in a schema whose target namespace is bound to the prefix eg. In such a case the following constructor function is available to users:

The resulting function may be used in an expression such as eg:hatSize("10½").

In the case of an atomic type A, the return type of the function is A?, reflecting the fact that the result will be the empty sequence if the input is the empty sequence. For a union or list type, the return type of the function is specified only as xs:anyAtomicType*. Implementations performing static type checking will often be able to compute a more specific result type. For example, if the target type is a list type whose item type is the atomic type A, the result will always be an instance of A*; if the target type is a pure union type U then the result will always be an instance of U?. In general, however, applications needing interoperable behavior on implementations that do strict static type checking will need to use a treat as expression to assert the specific type of the result.

To construct an instance of a user-defined type that is not in a namespace, it is possible to use an EQName (for example Q{}hatsize(17)). Alternatives are to use a cast expression (17 cast as hatsize) or (if the host language allows it) to undeclare the default function namespace.

Constructor functions for named record types Constructor functions for named record types have been introduced.

Both XQuery 4.0 and XSLT 4.0 provide syntax to declare named record types; such a declaration implicitly adds a constructor function for values of that type to the (See ).

For example, if there is a named item type with the XQuery definition:

declare record my:location ( latitude as xs:double, longitude as xs:double )

then there will be a function definition equivalent to:

declare function my:location ( $latitude as xs:double, $longitude as xs:double ) as my:location { { 'latitude': $latitude, 'longitude': $longitude } }

Equivalently using XSLT syntax, if there is a named item type with the XSLT definition:

]]>

then there will be a function definition equivalent to:

]]>

The rules defining the relationship of the function definition to the record type are given for XQuery 4.0 in .

TODO: Add cross-reference to XSLT here. Anticipates resolution of issue #1485.
Casting

Constructor functions and cast expressions accept an expression and return a value of a given type. They both convert a source value SV, of a source type, ST to a target value TV, of the given target type TT.

Constructor functions and cast expressions have identical semantics but different syntax. The name of the constructor function is the same as the name of the built-in datatype or the datatype defined in of (see ) or the user-derived datatype (see ) that is the target for the conversion, and the semantics are exactly the same as for a cast expression; for example, xs:date("2003-01-01") means exactly the same as "2003-01-01" cast as xs:date?.

The cast expression takes a type name to indicate the target type of the conversion. See . If the type name allows the empty sequence and the expression to be cast is the empty sequence, the empty sequence is returned. If the type name does not allow the empty sequence and the expression to be cast is the empty sequence, a type error is raised .

Where the argument to a cast is a literal, the result of the function may be evaluated statically; if an error is encountered during such evaluation, it may be reported as a static error.

The general rules for casting from primitive types to primitive types are defined in , and subsections describe the rules for specific target types. The general rules for casting from xs:string (and xs:untypedAtomic) follow in . Casting to non-primitive types, including atomic types derived by restriction, union types, and list types, is described in . Casting from derived types is defined in , and .

Casting is not supported to or from xs:anySimpleType. Casting to xs:anySimpleType is not permitted and raises a static error: .

Similarly, casting is not supported to or from xs:anyAtomicType and will raise a static error: . There are no atomic items with the type annotation xs:anyAtomicType, although this can be a statically inferred type.

Casting from primitive types to primitive types This section now uses the term primitive type strictly to refer to the 20 atomic types that are not derived by restriction from another atomic type: that is, the 19 primitive atomic types defined in XSD, plus xs:untypedAtomic. The three types xs:integer, xs:dayTimeDuration, and xs:yearMonthDuration, which have custom casting rules but are not strictly-speaking primitive, are now handled in other subsections.

This section defines casting between primitive types (specifically, the 19 primitive types defined in plus xs:untypedAtomic. The type conversions that are supported between primitive atomic types are indicated in the table below; casts between other (non-primitive) types are defined in terms of these primitives.

Where the target type TT is a primitive type, the result TV will always be an instance of TT. The result may also be an instance of a type derived from TT: for example casting an xs:NCName SV to xs:string may return SV unchanged, with its original type annotation.

In this table, there is a row for each primitive type acting as the source of the conversion and there is a column for each primitive type acting as the target of the conversion. The intersections of rows and columns contain one of three characters:

Y indicates that a conversion from values of the type to which the row applies to the type to which the column applies is supported;

N indicates that there are no supported conversions from values of the type to which the row applies to the type to which the column applies;

M indicates that a conversion from values of the type to which the row applies to the type to which the column applies may succeed for some values in the value space and fail for others.

There is no row or column for xs:untypedAtomic because the casting rules are exactly the same as for xs:string. When casting from xs:string or xs:untypedAtomic the semantics in apply, regardless of target type.

defines xs:NOTATION as an abstract type. Thus, casting to xs:NOTATION from any other type including xs:NOTATION is not permitted and raises a static error . However, casting from one subtype of xs:NOTATION to another subtype of xs:NOTATION is permitted.

Casting is not supported to or from xs:anySimpleType. Thus, there is no row or column for this type in the table below. For any node that has not been validated or has been validated as xs:anySimpleType, the typed value of the node is an atomic item of type xs:untypedAtomic. There are no atomic items with the type annotation xs:anySimpleType at runtime. Casting to xs:anySimpleType is not permitted and raises a static error: .

Similarly, casting is not supported to or from xs:anyAtomicType and will raise a static error: . There are no atomic items with the type annotation xs:anyAtomicType at runtime, although this can be a statically inferred type.

If casting is attempted from an ST to a TT for which casting is not supported, as defined in the table below, a type error is raised .

In the following table, the columns and rows are identified by short codes that identify simple types as follows:

aURI = xs:anyURI b64 = xs:base64Binary bool = xs:boolean dat = xs:date gDay = xs:gDay dbl = xs:double dec = xs:decimal dT = xs:dateTime dur = xs:duration flt = xs:float hxB = xs:hexBinary gMD = xs:gMonthDay gMon = xs:gMonth NOT = xs:NOTATION QN = xs:QName str = xs:string tim = xs:time gYM = xs:gYearMonth gYr = xs:gYear

In the following table, the notation S\T indicates that the source (S) of the conversion is indicated in the column below the notation and that the target (T) is indicated in the row to the right of the notation.

S\T str flt dbl dec dur dT tim dat gYM gYr gMD gDay gMon bool b64 hxB aURI QN NOT
str Y M M M M M M M M M M M M M M M M M M
flt Y Y Y M N N N N N N N N N Y N N N N N
dbl Y Y Y M N N N N N N N N N Y N N N N N
dec Y Y Y Y N N N N N N N N N Y N N N N N
dur Y N N N Y N N N N N N N N N N N N N N
dT Y N N N N Y Y Y Y Y Y Y Y N N N N N N
tim Y N N N N N Y N N N N N N N N N N N N
dat Y N N N N Y N Y Y Y Y Y Y N N N N N N
gYM Y N N N N N N N Y N N N N N N N N N N
gYr Y N N N N N N N N Y N N N N N N N N N
gMD Y N N N N N N N N N Y N N N N N N N N
gDay Y N N N N N N N N N N Y N N N N N N N
gMon Y N N N N N N N N N N N Y N N N N N N
bool Y Y Y Y N N N N N N N N N Y N N N N N
b64 Y N N N N N N N N N N N N N Y Y N N N
hxB Y N N N N N N N N N N N N N Y Y N N N
aURI Y N N N N N N N N N N N N N N N Y N N
QN Y N N N N N N N N N N N N N N N N Y M
NOT Y N N N N N N N N N N N N N N N N Y M
Casting to xs:untypedAtomic

Any atomic item SV can be cast to xs:untypedAtomic.

The effect is the same as casting to xs:string (see ) and then returning the xs:untypedAtomic value comprising the same sequence of characters.

Casting to xs:string

Any atomic item SV can be cast to xs:string.

The resulting xs:string value TV depends on the source type ST as follows.

If SV is an instance of xs:string, TV is an instance of xs:string comprising the same sequence of characters as SV.

The implementation is free to return SV unchanged, including its original type annotation.

If SV is an instance of xs:anyURI, the result TV is an instance of xs:string comprising the same sequence of characters as SV, but with a type annotation of xs:anyURI. No escaping of special characters takes place.

If SV is an instance of xs:QName or xs:NOTATION:

if the qualified name has a prefix, then TV is the concatenation of the prefix of SV, a single colon (:), and the local name of SV.

otherwise TV is the local name of SV.

If SV is an instance of xs:numeric, the rules in apply.

If SV is an instance of xs:dateTime, xs:date or xs:time, the rules in apply.

If ST is xs:duration, or any subtype thereof including xs:yearMonthDuration and xs:dayTimeDuration, then the rules in apply.

In all other cases, TV is the canonical representation of SV. For datatypes that do not have a canonical representation defined an canonical representation may be used.

To cast as xs:untypedAtomic the value is cast as xs:string, as described above, and the type annotation changed to xs:untypedAtomic.

Casting numeric values to xs:string

The following rules apply when the source type ST is xs:decimal, xs:double, or xs:float, or any subtype of these including xs:integer.

If SV is an instance of xs:decimal, then the canonical representation of SV is returned, as defined in . Specifically, see decimalCanonicalMap.

Unlike previous versions of this specification, no special rule is given for the case where SV is an instance of xs:integer. This is because the general rule for xs:decimal gives the same result. The result in this case will be a sequence of decimal digits in the range U+0030 to U+0039, optionally preceded by a minus sign, with no leading zeroes. For example: 42, -1, 0, or 1000000000.

An xs:decimal that is equal to an integer is converted to a string as if it were first cast to an xs:integer. Specifically, there will be no decimal point and no fractional part.

If the value is not equal to an integer, then there will be a decimal point and a fractional part, which will be a sequence of decimal digits with no trailing zeroes. For example: 42.3, -1.5, or 0.00001.

If SV is an instance of xs:float or xs:double, then:

TV will be an xs:string in the lexical space of xs:double or xs:float that when converted to an xs:double or xs:float under the rules of produces a value that is equal to SV, or is NaN if SV is NaN. In addition, TV must satisfy the constraints in the following sub-bullets.

If SV has an absolute value that is greater than or equal to 0.000001 (one millionth) and less than 1000000 (one million), then the value is converted to an xs:decimal and the resulting xs:decimal is converted to an xs:string according to the rules above, as though using an implementation of xs:decimal that imposes no limits on the totalDigits or fractionDigits facets.

If SV has the value positive or negative zero, TV is "0" or "-0" respectively.

If SV is positive or negative infinity, TV is the string "INF" or "-INF" respectively.

In other cases, the result consists of a mantissa, which has the lexical form of an xs:decimal, followed by the letter "E", followed by an exponent which has the lexical form of an xs:integer. Leading zeroes and "+" signs are prohibited in the exponent. For the mantissa, there must be a decimal point, and there must be exactly one digit before the decimal point, which must be non-zero. The "+" sign is prohibited. There must be at least one digit after the decimal point. Apart from this mandatory digit, trailing zero digits are prohibited.

The above rules allow more than one representation of the same value. For example, the xs:float value whose exact decimal representation is 1.26743223E15 might be represented by any of the strings "1.26743223E15", "1.26743222E15" or "1.26743224E15" (inter alia). It is implementation-dependent which of these representations is chosen.

The string representations of numeric values are backwards compatible with XPath 1.0 except for the special values positive and negative infinity, negative zero and values outside the range 1.0e-6 to 1.0e+6.

Casting date/time values to xs:string The rules for conversion of dates and times to strings are now defined entirely in terms of XSD 1.1 canonical mappings, since these deliver exactly the same result as the XPath 3.1 rules.

If SV is an instance of xs:dateTime, xs:date, xs:time, xs:gYear, xs:gYearMonth, xs:gMonth, xs:gMonthDay, or xs:gDay, then TV is the canonical representation of SV as defined in .

The result TV includes the original timezone if a timezone is present.

All these data types contain different combinations of the components year, month, day, hour, minute, second, and timezone; all the components relevant to the data type (with the exception of the timezone) are output, and the results are concatenated together with suitable punctuation. Specifically:

The year component is represented as a xs:string of four digits, or more if needed. A leading minus sign is present for BCE years.

The month, day, hour and minute components are represented as two digits (with a leading zero if needed). For example, February is represented as 02.

The hours component will never be "24": midnight is always represented as "00:00:00".

The second component is output using as a two-digit integer if it is a whole number (for example, 30, 05, or 00), or if it is fractional, as two digits followed by a decimal point followed by as many digits as are necessary, with no trailing zeroes (for example 30.5 or 00.001).

The timezone component, if present, is cast to xs:string by applying the function eg:convertTZtoString given in . Examples are Z, +01:00, -05:00, or +05:30.

.

Casting xs:duration values to xs:string The rules for conversion of durations to strings are now defined entirely in terms of XSD 1.1 canonical mappings, since the XSD 1.1 rules deliver exactly the same result as the XPath 3.1 rules.

If SV is an instance of xs:duration (including its subtypes xs:yearMonthDuration and xs:dayTimeDuration), then TV is the canonical representation of SV as defined in . Specifically, see durationCanonicalMap.

The rules have the effect of normalizing the value so that the number of months is always less than 12, the number of hours less than 24, and the number of minutes and seconds less than 60. Zero-valued components are omitted. Fractional seconds follow the same rules as xs:decimal. For example, the duration P15MT30H is represented as P1Y3M1DT6H. A zero-length duration is output as PT0S.

At the time of writing, the published XSD 1.1 recommendation contains cut-and-paste errors in the definition of the dayTimeDuration canonical mapping. The binding of variable s should be to dt's ·seconds· (not ·months·) component, and the return expression given as sgn & 'P' & ·duYearMonthCanonicalFragmentMap·(|s|) should read sgn & 'P' & ·duDayTimeCanonicalFragmentMap·(|s|)

In reading these XSD formulations, be aware that a & b represents string concatenation, while |s| computes the absolute value of a number.

Casting to numeric types

This section defines the rules for casting to the primitive numeric types xs:float, xs:double, and xs:decimal. Rules for casting to the derived type xs:integer are given in .

Casting to xs:float

When a value of any simple type is cast as xs:float, the xs:float TV is derived from the ST and the SV as follows:

If ST is xs:float, then TV is SV and the conversion is complete.

If ST is xs:double, then TV is obtained as follows:

if SV is the xs:double value INF, -INF, NaN, positive zero, or negative zero, then TV is the xs:float value INF, -INF, NaN, positive zero, or negative zero respectively.

otherwise, SV can be expressed in the form m × 2^e where the mantissa m and exponent e are signed xs:integers whose value range is defined in , and the following rules apply:

if m (the mantissa of SV) is outside the permitted range for the mantissa of an xs:float value (-2^24-1 to +2^24-1), then it is divided by 2^N where N is the lowest positive xs:integer that brings the result of the division within the permitted range, and the exponent e is increased by N. This is integer division (in effect, the binary value of the mantissa is truncated on the right). Let M be the mantissa and E the exponent after this adjustment.

if E exceeds 104 (the maximum exponent value in the value space of xs:float) then TV is the xs:float value INF or -INF depending on the sign of M.

if E is less than -149 (the minimum exponent value in the value space of xs:float) then TV is the xs:float value positive or negative zero depending on the sign of M

otherwise, TV is the xs:float value M × 2^E.

If ST is xs:decimal, or xs:integer, then TV is xs:float( SV cast as xs:string) and the conversion is complete.

If ST is xs:boolean, SV is converted to 1.0E0 if SV is true and to 0.0E0 if SV is false and the conversion is complete.

If ST is xs:untypedAtomic or xs:string, see .

XSD 1.1 adds the value +INF to the lexical space, as an alternative to INF. XSD 1.1 also adds negative zero to the value space.

Implementations should return negative zero for xs:float("-0.0E0"). But because does not distinguish between the values positive zero and negative zero. Implementations may return positive zero in this case.

Casting to xs:double

When a value of any simple type is cast as xs:double, the xs:double value TV is derived from the ST and the SV as follows:

If ST is xs:double, then TV is SV and the conversion is complete.

If ST is xs:float or a type derived from xs:float, then TV is obtained as follows:

if SV is the xs:float value INF, -INF, NaN, positive zero, or negative zero, then TV is the xs:double value INF, -INF, NaN, positive zero, or negative zero respectively.

otherwise, SV can be expressed in the form m × 2^e where the mantissa m and exponent e are signed xs:integer values whose value range is defined in , and TV is the xs:double value m × 2^e.

If ST is xs:decimal or xs:integer, then TV is xs:double( SV cast as xs:string) and the conversion is complete.

If ST is xs:boolean, SV is converted to 1.0E0 if SV is true and to 0.0E0 if SV is false and the conversion is complete.

If ST is xs:untypedAtomic or xs:string, see .

XSD 1.1 adds the value +INF to the lexical space, as an alternative to INF. XSD 1.1 also adds negative zero to the value space.

Implementations should return negative zero for xs:double("-0.0E0"). But because does not distinguish between the values positive zero and negative zero. Implementations may return positive zero in this case.

Casting to xs:decimal

This section defines the rules for casting to the primitive type xs:decimal. The rules are also invoked implicitly as part of the process of converting to types derived from xs:decimal. There are special rules, however, if the target type TT is xs:integer, or a type derived from xs:integer: those rules are given in .

When the target type TT is xs:decimal, the resulting xs:decimal value TV is derived from ST and SV as follows:

If ST is xs:decimal or a subtype thereof (including xs:integer), then the result TV has the same as SV. The type annotation may be xs:decimal or any subtype of xs:decimal for which this is a valid instance, including the original type ST.

If ST is xs:float or xs:double, then TV is the xs:decimal value, within the set of xs:decimal values that the implementation is capable of representing, that is numerically closest to SV. If two values are equally close, then the one that is closest to zero is chosen. If SV is too large to be accommodated as an xs:decimal, (see for limits on numeric values) a dynamic error is raised . If SV is one of the special xs:float or xs:double values NaN, INF, or -INF, a dynamic error is raised .

If ST is xs:boolean, the result TV is 1.0 if SV is 1 or true and to 0.0 if SV is 0 or false. The type annotation of the result may be any subtype of xs:decimal whose value space includes the integer values 0 and 1.

If ST is xs:untypedAtomic or xs:string, see .

Casting to duration types

This section defines the rules for casting to the primitive duration type xs:duration. Rules for casting to the derived types xs:yearMonthDuration and xs:dayTimeDuration are given in .

If the source value SV is an instance of xs:duration (including instances of subtypes such as xs:yearMonthDuration and xs:dayTimeDuration, then the datum of the result TV is the same as the datum of SV, and the type annotation is xs:duration or any subtype thereof that includes this datum in its value space (in particular, it may be the same as the type annotation of SV).

If ST is xs:untypedAtomic or xs:string, see .

Casting to date and time types

In several situations, casting to date and time types requires the extraction of a component from SV or from the result of fn:current-dateTime and converting it to an xs:string. These conversions must follow certain rules. For example, converting an xs:integer year value requires converting to an xs:string with four or more characters, preceded by a minus sign if the value is negative.

This document defines four functions to perform these conversions. These functions are for illustrative purposes only and make no recommendations as to style or efficiency. References to these functions from the following text are not normative.

The arguments to these functions come from functions defined in this document. Thus, the functions below assume that they are correct and do no range checking on them.

= 0) then "" else "-" let $yearString := abs($year) cast as xs:string let $length := string-length($yearString) return if ($length = 1) then concat($plusMinus, "000", $yearString) else if ($length = 2) then concat($plusMinus, "00", $yearString) else if ($length = 3) then concat($plusMinus, "0", $yearString) else concat($plusMinus, $yearString) };]]> = 0) then "+" else "-" let $tzhString := eg:convertTo2CharString(abs($tzh)) let $tzmString := eg:convertTo2CharString(abs($tzm)) return concat($plusMinus, $tzhString, ":", $tzmString) };]]>

Conversion from primitive types to date and time types follows the rules below.

When a value of any primitive type is cast as xs:dateTime, the xs:dateTime value TV is derived from ST and SV as follows:

If ST is xs:dateTime, then TV is SV.

If ST is xs:date, then let SYR be eg:convertYearToString( year-from-date( SV )), let SMO be eg:convertTo2CharString( month-from-date( SV )), let SDA be eg:convertTo2CharString( day-from-date( SV )) and let STZ be eg:convertTZtoString( timezone-from-date( SV )); TV is xs:dateTime( concat( SYR , '-', SMO , '-', SDA , 'T00:00:00 ', STZ ) ).

If ST is xs:untypedAtomic or xs:string, see .

When a value of any primitive type is cast as xs:time, the xs:time value TV is derived from ST and SV as follows:

If ST is xs:time, then TV is SV.

If ST is xs:dateTime, then TV is xs:time( concat( eg:convertTo2CharString( hours-from-dateTime( SV )), ':', eg:convertTo2CharString( minutes-from-dateTime( SV )), ':', eg:convertSecondsToString( seconds-from-dateTime( SV )), eg:convertTZtoString( timezone-from-dateTime( SV )) )).

If ST is xs:untypedAtomic or xs:string, see .

When a value of any primitive type is cast as xs:date, the xs:date value TV is derived from ST and SV as follows:

If ST is xs:date, then TV is SV.

If ST is xs:dateTime, then let SYR be eg:convertYearToString( year-from-dateTime( SV )), let SMO be eg:convertTo2CharString( month-from-dateTime( SV )), let SDA be eg:convertTo2CharString( day-from-dateTime( SV )) and let STZ be eg:convertTZtoString(timezone-from-dateTime( SV )); TV is xs:date( concat( SYR , '-', SMO , '-', SDA, STZ ) ).

If ST is xs:untypedAtomic or xs:string, see .

When a value of any primitive type is cast as xs:gYearMonth, the xs:gYearMonth value TV is derived from ST and SV as follows:

If ST is xs:gYearMonth, then TV is SV.

If ST is xs:dateTime, then let SYR be eg:convertYearToString( year-from-dateTime( SV )), let SMO be eg:convertTo2CharString( month-from-dateTime( SV )) and let STZ be eg:convertTZtoString( timezone-from-dateTime( SV )); TV is xs:gYearMonth( concat( SYR , '-', SMO, STZ ) ).

If ST is xs:date, then let SYR be eg:convertYearToString( year-from-date( SV )), let SMO be eg:convertTo2CharString( month-from-date( SV )) and let STZ be eg:convertTZtoString( timezone-from-date( SV )); TV is xs:gYearMonth( concat( SYR , '-', SMO, STZ ) ).

If ST is xs:untypedAtomic or xs:string, see .

When a value of any primitive type is cast as xs:gYear, the xs:gYear value TV is derived from ST and SV as follows:

If ST is xs:gYear, then TV is SV.

If ST is xs:dateTime, let SYR be eg:convertYearToString( year-from-dateTime( SV )) and let STZ be eg:convertTZtoString( timezone-from-dateTime( SV )); TV is xs:gYear(concat( SYR, STZ )).

If ST is xs:date, let SYR be eg:convertYearToString( year-from-date( SV )); and let STZ be eg:convertTZtoString( timezone-from-date( SV )); TV is xs:gYear(concat( SYR, STZ )).

If ST is xs:untypedAtomic or xs:string, see .

When a value of any primitive type is cast as xs:gMonthDay, the xs:gMonthDay value TV is derived from ST and SV as follows:

If ST is xs:gMonthDay, then TV is SV.

If ST is xs:dateTime, then let SMO be eg:convertTo2CharString( month-from-dateTime( SV )), let SDA be eg:convertTo2CharString( day-from-dateTime( SV )) and let STZ be eg:convertTZtoString( timezone-from-dateTime( SV )); TV is xs:gYearMonth( concat( '--', SMO '-', SDA, STZ ) ).

If ST is xs:date, then let SMO be eg:convertTo2CharString( month-from-date( SV )), let SDA be eg:convertTo2CharString( day-from-date( SV )) and let STZ be eg:convertTZtoString( timezone-from-date( SV )); TV is xs:gYearMonth( concat( '--', SMO , '-', SDA, STZ ) ).

If ST is xs:untypedAtomic or xs:string, see .

When a value of any primitive type is cast as xs:gDay, the xs:gDay value TV is derived from ST and SV as follows:

If ST is xs:gDay, then TV is SV.

If ST is xs:dateTime, then let SDA be eg:convertTo2CharString( day-from-dateTime( SV )) and let STZ be eg:convertTZtoString( timezone-from-dateTime( SV )); TV is xs:gDay( concat( '---', SDA, STZ )).

If ST is xs:date, then let SDA be eg:convertTo2CharString( day-from-date( SV )) and let STZ be eg:convertTZtoString( timezone-from-date( SV )); TV is xs:gDay( concat( '---', SDA, STZ )).

If ST is xs:untypedAtomic or xs:string, see .

When a value of any primitive type is cast as xs:gMonth, the xs:gMonth value TV is derived from ST and SV as follows:

If ST is xs:gMonth, then TV is SV.

If ST is xs:dateTime, then let SMO be eg:convertTo2CharString( month-from-dateTime( SV )) and let STZ be eg:convertTZtoString( timezone-from-dateTime( SV )); TV is xs:gMonth( concat( '--' , SMO, STZ )).

If ST is xs:date, then let SMO be eg:convertTo2CharString( month-from-date( SV )) and let STZ be eg:convertTZtoString( timezone-from-date( SV )); TV is xs:gMonth( concat( '--', SMO, STZ )).

If ST is xs:untypedAtomic or xs:string, see .

Casting to xs:boolean

When the target type TT is xs:boolean, the resulting xs:boolean value TV is derived from the source value SV as follows:

If SV is an instance of xs:boolean, then TV is SV.

If SV is an instance of xs:numeric and SV is 0, +0, -0, 0.0, 0.0E0 or NaN, then TV is false.

If ST is is an instance of xs:numeric and SV is not one of the above values, then TV is true.

If ST is xs:untypedAtomic or xs:string, see .

Casting to xs:base64Binary and xs:hexBinary

Values of type xs:base64Binary can be cast as xs:hexBinary and vice versa, since the two types have the same value space. Casting to xs:base64Binary and xs:hexBinary is also supported from the same type and from xs:untypedAtomic, xs:string and subtypes of xs:string using semantics.

Casting to xs:anyURI

Casting to xs:anyURI is supported only from the same type, xs:untypedAtomic or xs:string.

When a value of any primitive type is cast as xs:anyURI, the xs:anyURI value TV is derived from the ST and SV as follows:

If ST is xs:untypedAtomic or xs:string see .

Casting to xs:QName and xs:NOTATION

Casting from xs:string or xs:untypedAtomic to xs:QName or xs:NOTATION is described in .

It is also possible to cast from xs:NOTATION to xs:QName, or from xs:QName to any type derived by restriction from xs:NOTATION. (Casting to xs:NOTATION itself is not allowed, because xs:NOTATION is an abstract type.) The resulting xs:QName or xs:NOTATION has the same prefix, local name, and namespace URI parts as the supplied value.

See for a discussion of how the combination of atomization and casting might not produce the desired effect.

Casting to xs:ENTITY

says that The value space of ENTITY is the set of all strings that match the NCName production ... and have been declared as an unparsed entity in a document type definition. However, and do not check that constructed values of type xs:ENTITY match declared unparsed entities. Thus, this rule is relaxed in this specification and, in casting to xs:ENTITY and types derived from it, no check is made that the values correspond to declared unparsed entities.

Casting from xs:string and xs:untypedAtomic When casting from a string to a duration or time or dateTime, it is now specified that when there are more digits in the fractional seconds than the implementation is able to retain, excess digits are truncated. Rounding upwards (which could affect the number of minutes or hours in the value) is not permitted.

This section applies when the supplied value SV is an instance of xs:string or xs:untypedAtomic, including types derived from these by restriction. If the value is xs:untypedAtomic, it is treated in exactly the same way as a string containing the same sequence of characters.

The supplied string is mapped to a typed value of the target type as defined in . Whitespace normalization is applied as indicated by the whiteSpace facet for the datatype. The resulting whitespace-normalized string must be a valid lexical form for the datatype. The semantics of casting follow the rules of XML Schema validation. For example, "13" cast as xs:unsignedInt returns the xs:unsignedInt typed value 13. This could also be written xs:unsignedInt("13").

The target type can be any simple type other than an abstract type. Specifically, it can be a type whose variety is atomic, union, or list. In each case the effect of casting to the target type is the same as constructing an element with the supplied value as its content, validating the element using the target type as the governing type, and atomizing the element to obtain its typed value.

When the target type is a derived type that is restricted by a pattern facet, the lexical form is first checked against the pattern before further casting is attempted (See ). If the lexical form does not conform to the pattern, a dynamic error is raised.

For example, consider a user-defined type my:boolean which is derived by restriction from xs:boolean and specifies the pattern facet value="0|1". The expression "true" cast as my:boolean would fail with a dynamic error .

Facets other than pattern are checked after the conversion. For example if there is a user-defined datatype called my:height defined as a restriction of xs:integer with the facet <maxInclusive value="84"/>, then the expression "100" cast as my:height would fail with a dynamic error .

Casting to the types xs:NOTATION, xs:anySimpleType, or xs:anyAtomicType is not permitted because these types are abstract (they have no immediate instances).

Special rules apply when casting to namespace-sensitive types. The types xs:QName and xs:NOTATION are namespace-sensitive. Any type derived by restriction from a namespace-sensitive type is itself namespace-sensitive, as is any union type having a namespace-sensitive type among its members, and any list type having a namespace-sensitive type as its item type. For details, see .

Since version 3.0 of this specification, casting has been allowed between xs:QName and xs:NOTATION in either direction; this was not permitted in previous Recommendations. Version 3.0 also removed the rule that only a string literal (rather than a dynamic string) may be cast to an xs:QName

When casting to a numeric type:

If the value is too large or too small to be accurately represented by the implementation, it is handled as an overflow or underflow as defined in .

If the target type is xs:float or xs:double, the string -0 (and equivalents such as -0.0 or -000) should be converted to the value negative zero. However, if the implementation is reliant on an implementation of XML Schema 1.0 in which negative zero is not part of the value space for these types, these lexical forms may be converted to positive zero.

In casting to xs:decimal or to a type derived from xs:decimal, if the value is not too large or too small but nevertheless cannot be represented accurately with the number of decimal digits available to the implementation, the implementation may round to the nearest representable value or may raise a dynamic error . The choice of rounding algorithm and the choice between rounding and error behavior is .

When casting to xs:duration, xs:dateTime, or xs:time, if the seconds component has more fractional digits than are supported by the implementation, excess digits must be truncated. This rule ensures that components other than the seconds component are unaffected: for example xs:dateTime('2023-12-31T23:59:59.999999999') is guaranteed to deliver an xs:dateTime value whose year component is 2023 rather than 2024.

Implementations are required to support millisecond precision or greater.

In casting to xs:date, xs:dateTime, xs:gYear, or xs:gYearMonth (or types derived from these), if the value is too large or too small to be represented by the implementation, a dynamic error is raised.

In casting to a duration value, if the value is too large or too small to be represented by the implementation, a dynamic error is raised.

For xs:anyURI, the extent to which an implementation validates the lexical form of xs:anyURI is .

If the cast fails for any other reason, a dynamic error is raised.

Casting involving non-primitive types

Casting from xs:string and xs:untypedAtomic to any other type (primitive or non-primitive) has been described in . This section defines how other casts to non-primitive types operate, including casting to types derived by restriction, to union types, and to list types.

Casting to derived types

Casting a value to a derived type can be separated into a number of cases. In these rules:

The types xs:integer, xs:yearMonthDuration, and xs:dayTimeDuration are treated as quasi-primitive types (alongside the 20 truly primitive types).

For any atomic type T, let P(T) denote the most specific primitive or quasi-primitive type such that itemType-subtype(T, P(T)) is true.

The rules are then:

When the source type ST is the same type as the target type TT: this case always succeeds, returning the source value SV unchanged.

When itemType-subtype(ST, TT) is true: see .

When TT is the quasi-primitive type xs:integer and SV is an instance of xs:numeric: see .

When TT is the quasi-primitive type xs:yearMonthDuration or xs:dayTimeDuration and SV is an instance of xs:duration: see .

When P(ST) is the same type as P(TT): see .

Otherwise (P(ST) is not the same type as P(TT)): see .

Casting to xs:integer

When an atomic item SV is cast as xs:integer, the resulting xs:integer value TV is obtained as follows:

If ST is xs:decimal, xs:float or xs:double, then TV is SV with the fractional part discarded and the value converted to xs:integer. Thus, casting 3.1456 returns 3 while -17.89 returns -17. Casting 3.124E1 returns 31. If SV is too large to be accommodated as an integer, (see for limits on numeric values) a dynamic error is raised . If SV is one of the special xs:float or xs:double values NaN, INF, or -INF, a dynamic error is raised .

In all other cases, the general rules of apply.

When casting to a subtype of xs:integer (for example, xs:long), the rules in apply. Note, however, that these rules treat xs:integer as a quasi-primitive type.

Casting to xs:yearMonthDuration and xs:dayTimeDuration

When the source value SV is an instance of xs:duration (including any subtype of xs:duration), then:

If the target type TT is xs:yearMonthDuration, the result is an instance of xs:yearMonthDuration whose months component is equal to the months component of SV. The seconds component of SV is ignored.

If the target type TT is xs:dayTimeDuration, the result is an instance of xs:dayTimeDuration whose seconds component is equal to the seconds component of SV. The months component of SV is ignored.

In all other cases, the general rules of apply.

In general, casting to xs:yearMonthDuration or xs:dayTimeDuration loses information.

When casting to a subtype of xs:dayTimeDuration or xs:yearMonthDuration, the rules in apply. Note, however, that these rules treat xs:dayTimeDuration and xs:yearMonthDuration as quasi-primitive types.

Casting from derived types to parent types

It is always possible to cast an atomic item A to a type T if the relation A instance of T is true, provided that T is not an abstract type.

For example, it is possible to cast an xs:unsignedShort to an xs:unsignedInt, to an xs:integer, to an xs:decimal, or to a union type whose member types are xs:integer and xs:double.

Since the value space of the original type is a subset of the value space of the target type, such a cast is always successful.

For the expression A instance of T to be true, T must be either an atomic type, or a union type that has no constraining facets. It cannot be a list type, nor a union type derived by restriction from another union type, nor a union type that has a list type among its member types.

The result will have the same value as the original, but will have a new type annotation:

If T is an atomic type, then the type annotation of the result is T.

If T is a union type, then the type of the result is an atomic type M such that M is one of the atomic types in the transitive membership of the union type T and A instance of M is true; if there is more than one type M that satisfies these conditions (which could happen, for example, if T is the union of two overlapping types such as xs:int and xs:positiveInteger) then the first one is used, taking the member types in the order in which they appear within the definition of the union type.

Casting within a branch of the type hierarchy

It is possible to cast an SV to a TT if the type of the SV and the TT type are both derived by restriction (directly or indirectly) from the same primitive type, provided that the supplied value conforms to the constraints implied by the facets of the target type. This includes the case where the target type is derived from the type of the supplied value, as well as the case where the type of the supplied value is derived from the target type. For example, an instance of xs:byte can be cast as xs:unsignedShort, provided the value is not negative.

If the value does not conform to the facets defined for the target type, then a dynamic error is raised . See . In the case of the pattern facet (which applies to the lexical space rather than the value space), the pattern is tested against the canonical representation of the value, as defined for the source type (or the result of casting the value to an xs:string, in the case of types that have no canonical representation defined for them).

Note that this will cause casts to fail if the pattern excludes the canonical lexical representation of the source type. For example, if the type my:distance is defined as a restriction of xs:decimal with a pattern that requires two digits after the decimal point, casting of an xs:integer to my:distance will always fail, because the canonical representation of an xs:integer does not conform to this pattern.

In some cases, casting from a parent type to a derived type requires special rules. See for rules regarding casting to xs:yearMonthDuration and xs:dayTimeDuration. See , below, for casting to xs:ENTITY and types derived from it.

Casting across the type hierarchy

When the ST and the TT are derived, directly or indirectly, from different primitive types, this is called casting across the type hierarchy. Casting across the type hierarchy is logically equivalent to three separate steps performed in order. Errors can occur in either of the latter two steps.

Cast the SV, up the hierarchy, to the primitive type of the source, as described in .

If SV is an instance of xs:string or xs:untypedAtomic, check its value against the pattern facet of TT, and raise a dynamic error if the check fails.

Let P(TT) be the most specific primitive or quasi-primitive type of which TT is a subtype, as described in .

Cast the value to P(TT), as described in if P(TT) is primitive, or as described in if P(TT) is quasi-primitive.

If TT is derived from xs:NOTATION, assume for the purposes of this rule that casting to xs:NOTATION succeeds.

Cast the value down to the target type TT, as described in

Casting to union types

If the target type of a cast expression (or a constructor function) is a type with variety union, the supplied value must be one of the following:

A value of type xs:string or xs:untypedAtomic. This case follows the general rules for casting from strings, and has already been described in .

If the union type has a pattern facet, the pattern is tested against the supplied value after whitespace normalization, using the whiteSpace normalization rules of the member datatype against which validation succeeds.

A value that is an instance of one of the atomic types in the transitive membership of the union type, and of the union type itself. This case has already been described in

This situation only applies when the value is an instance of the union type, which means it will never apply when the union is derived by facet-based restriction from another union type.

A value that is castable to one or more of the atomic types in the transitive membership of the union type (in the sense that the castable as operator returns true).

In this case the supplied value is cast to each atomic type in the transitive membership of the union type in turn (in the order in which the member types appear in the declaration) until one of these casts is successful; if none of them is successful, a dynamic error occurs . If the union type has constraining facets then the resulting value must satisfy these facets, otherwise a dynamic error occurs .

If the union type has a pattern facet, the pattern is tested against the canonical representation of the result value.

Only the atomic types in the transitive membership of the union type are considered. The union type may have list types in its transitive membership, but (unless the supplied value is of type xs:string or xs:untypedAtomic, in which case the rules in apply), any list types in the membership are effectively ignored.

If more than one of these conditions applies, then the casting is done according to the rules for the first condition that applies.

If none of these conditions applies, the cast fails with a dynamic error .

Example: consider a type U whose member types are xs:integer and xs:date.

The expression "123" cast as U returns the xs:integer value 123.

The expression current-date() cast as U returns the current date as an instance of xs:date.

The expression 23.1 cast as U returns the xs:integer value 23.

Example: consider a type V whose member types are xs:short and xs:negativeInteger.

The expression "-123" cast as V returns the xs:short value -123.

The expression "-100000" cast as V returns the xs:negativeInteger value -100000.

The expression 93.7 cast as V returns the xs:short value 93.

The expression "93.7" cast as V raises a dynamic error on the grounds that the string "93.7" is not in the lexical space of the union type.

Example: consider a type W that is derived from the above type V by restriction, with a pattern facet of -?\d\d.

The expression "12" cast as V returns the xs:short value 12.

The expression "123" cast as V raises an dynamic error on the grounds that the string "123" does not match the pattern facet.

Casting to list types

If the target type of a cast expression (or a constructor function) is a type with variety list, the supplied value must be of type xs:string or xs:untypedAtomic. The rules follow the general principle for all casts from xs:string outlined in .

If the supplied value is not of type xs:string or xs:untypedAtomic, a type error is raised .

The semantics of the operation are consistent with validation: that is, the effect of casting a string S to a list type L is the same as constructing an element or attribute node whose string value is S, validating it using L as the governing type, and atomizing the resulting node. The result will always be either failure, or a sequence of zero or more atomic items each of which is an instance of the item type of L (or if the item type of L is a union type, an instance of one of the atomic types in its transitive membership).

If the item type of the list type is namespace-sensitive, then the namespace bindings in the static context will be used to resolve any namespace prefix, in the same way as when the target type is xs:QName.

If the list type has a pattern facet, the pattern must match the supplied value after collapsing whitespace (an operation equivalent to the use of the fn:normalize-space function).

For example, the expression cast "A B C D" as xs:NMTOKENS produces a sequence of four xs:NMTOKEN values, ("A", "B", "C", "D").

For example, given a user-defined type my:coordinates defined as a list of xs:integer with the facet <xs:length value="2"/>, the expression my:coordinates("2 -1") will return a sequence of two xs:integer values (2, -1), while the expression my:coordinates("1 2 3") will result in a dynamic error because the length of the list does not conform to the length facet. The expression my:coordinates("1.0 3.0") will also fail because the strings 1.0 and 3.0 are not in the lexical space of xs:integer.

References Normative references Character Model for the World Wide Web 1.0: Fundamentals, Martin J. Dürst, François Yergeau, et. al., Editors. World Wide Web Consortium, 15 February 2015. This version is http://www.w3.org/TR/2005/REC-charmod-20050215/. The latest version is available at https://www.w3.org/TR/charmod/. HTML: Living Standard. WHATWG, 18 November 2022. DOM: Living Standard. WHATWG, 26 October 2022. The tz timezone database, available at http://www.iana.org/time-zones. It is which version of the database is used. IEEE. IEEE Standard for Floating-Point Arithmetic. Open Group Base Specifications Issue 8. IEEE, 2024. “IEEE Standard for Ethernet,” in IEEE Std 802.3-2022 (Revision of IEEE Std 802.3-2018). 29 July 2022. doi: 10.1109/IEEESTD.2022.9844436. ISO (International Organization for Standardization) Codes for the representation of names of countries and their subdivisions - Part 1: Country codes ISO 3166-1:2013. ISO (International Organization for Standardization). Representations of dates and times. Third edition, 2004-12-01. ISO 8601:2004(E). Available from: http://www.iso.org/". ISO (International Organization for Standardization). ISO/IEC 10967-1:2012, Information technology—Language Independent Arithmetic—Part 1: Integer and floating point arithmetic [Geneva]: International Organization for Standardization, 2012. Available from: http://www.iso.org/. ISO (International Organization for Standardization) Information and documentation — Codes for the representation of names of scripts ISO 15924:2004, January 2004. Unicode Consortium. Codes for the representation of names of scripts — Alphabetical list of four-letter script codes. See . Retrieved February 2013; continually updated. Legacy extended IRIs for XML resource identification. Henry S. Thomson, Richard Tobin, and Norman Walsh (eds), World Wide Web Consortium. 3 November 2008. Available at http://www.w3.org/TR/leiri/. IETF. RFC 1321: The MD5 Message-Digest Algorithm. Available at: http://www.ietf.org/rfc/rfc1321.txt. IETF. RFC 2376: XML Media Types. Available at: http://www.ietf.org/rfc/rfc2376.txt. IETF. RFC 3986: Uniform Resource Identifiers (URI): Generic Syntax. Available at: http://www.ietf.org/rfc/rfc3986.txt. IETF. RFC 3987: Internationalized Resource Identifiers (IRIs). Available at: http://www.ietf.org/rfc/rfc3987.txt. IETF. RFC 4180: Common Format and MIME Type for Comma-Separated Values (CSV) Files. Available at: http://www.ietf.org/rfc/rfc4180.txt. IETF. RFC 6151: Updated Security Considerations for the MD5 Message-Digest and the HMAC-MD5 Algorithms Available at: http://www.ietf.org/rfc/rfc6151.txt. IETF. RFC 7159: The Javascript Object Notation (JSON) Data Interchange Format Available at: http://www.rfc-editor.org/rfc/rfc7159.txt. H. Thompson and C. Lilley. XML Media Types. IETF RFC 7303. See http://www.ietf.org/rfc/rfc7303.txt. National Institute of Standards and Technology. Secure Hash Standard (SHS). FIPS PUB 180-4. August 2015. See http://dx.doi.org/10.6028/NIST.FIPS.180-4. Unicode Standard Annex #15: Unicode Normalization Forms. Ed. Mark Davis and Ken Whistler, Unicode Consortium. The current version is 16.0.0, dated 2024-08-14. As with , the version to be used is . Available at: http://www.unicode.org/reports/tr15/. Unicode Standard Annex #29: Unicode Text Segmentation. Ed. Josh Hadley, Unicode Consortium. The current version is 16.0.0, dated 2024-08-28. As with , the version to be used is . Available at: http://www.unicode.org/reports/tr29/. The Unicode Consortium, Reading, MA, Addison-Wesley, 2016. The Unicode Standard as updated from time to time by the publication of new versions. See http://www.unicode.org/standard/versions/ for the latest version and additional information on versions of the standard and of the Unicode Character Database. The version of Unicode to be used is , but implementations are recommended to use the latest Unicode version; currently, Version 9.0.0. Unicode Technical Standard #10: Unicode Collation Algorithm. Ed. Mark Davis and Ken Whistler, Unicode Consortium. The current version is 16.0.0, dated 2024-08-22. As with , the version to be used is . Available at: . Unicode Technical Standard #35: Unicode Locale Data Markup Language. Ed Mark Davis et al, Unicode Consortium. The current version is 47, dated 2025-03-11. As with , the version to be used is . Available at: . World Wide Web Consortium. XML Information Set (Second Edition). W3C Recommendation 4 February 2004. See http://www.w3.org/TR/xml-infoset/ CITATION: T.B.D. CITATION: T.B.D. CITATION: T.B.D. CITATION: T.B.D. XML Schema Part 1: Structures Second Edition, Oct 28 2004. Available at: http://www.w3.org/TR/xmlschema-1/ XML Schema Part 2: Datatypes Second Edition, Oct. 28 2004. Available at: http://www.w3.org/TR/xmlschema-2/ Invisible XML Specification, Steven Pemberton, editor. World Wide Web Consortium, 20 June 2020. This version is https://invisiblexml.org/1.0/. The latest version is available at https://invisiblexml.org/current/. Non-normative references Blake3 Algorithm Specification. Available at: https://github.com/BLAKE3-team/BLAKE3-specs/blob/master/blake3.pdf Edward M. Reingold and Nachum Dershowitz. Calendrical Calculations Millennium edition (2nd Edition). Cambridge University Press, ISBN 0 521 77752 6. CLDR - Unicode Common Locale Data Repository. Available at: http://cldr.unicode.org. Character Model for the World Wide Web 1.0: Normalization, Last Call Working Draft. Available at: http://www.w3.org/TR/2004/WD-charmod-norm-20040225/. EXPath: Collaboratively Defining Open Standards for Portable XPath Extensions. http://expath.org/. EXQuery: Collaboratively Defining Open Standards for Portable XQuery Extensions. http://exquery.org/. EXSLT: A Community Initiative to Provide Extensions to XSLT. https://exslt.github.io. FunctX Functions. http://www.functx.com/. Stefan Goessner. Converting Between XML and JSON. https://www.xml.com/pub/a/2006/05/31/converting-between-xml-and-json.html. 31 May 2006. HTML 4.01 Recommendation, 24 December 1999. Available at: http://www.w3.org/TR/REC-html40/. ICU - International Components for Unicode. Available at http://site.icu-project.org. The Open Group Base Specifications Issue 7 (IEEE Std 1003.1-2008). Available at: http://pubs.opengroup.org/onlinepubs/9699919799/. IETF. RFC 822: Standard for the Format of ARPA Internet Text Messages. Available at: http://www.ietf.org/rfc/rfc822.txt. IETF. RFC 850: Standard for Interchange of USENET Messages. Available at: http://www.ietf.org/rfc/rfc850.txt. IETF. RFC 1036: Standard for Interchange of USENET Messages. Available at: http://www.ietf.org/rfc/rfc1036.txt. IETF. RFC 1123: Requirements for Internet Hosts -- Application and Support. Available at: http://www.ietf.org/rfc/rfc1123.txt. IETF. RFC 2616: Hypertext Transfer Protocol -- HTTP/1.1. Available at: http://www.ietf.org/rfc/rfc2616.txt. IETF. RFC 3339: Date and Time on the Internet: Timestamps. Available at: http://www.ietf.org/rfc/rfc3339.txt. Unicode Technical Standard #18: Unicode Regular Expressions. Ed. Mark Davis and Andy Heniger, Unicode Consortium. The current version is 17, dated 2013-11-19. Available at: http://www.unicode.org/reports/tr18/. World Wide Web Consortium Working Group Note. Working With Timezones, October 13, 2005. Available at: http://www.w3.org/TR/2005/NOTE-timezone-20051013/. Error codes

The error text provided with these errors is non-normative.

Error code used by fn:error when no other error code is provided.

Raised when fn:apply is called and the arity of the supplied function is not the same as the number of members in the supplied array.

This error is raised whenever an attempt is made to divide by zero.

This error is raised whenever numeric operations result in an overflow or underflow.

This error is raised when an integer used to select a member of an array is outside the range of values for that array.

This error is raised when the $length argument to array:subarray is negative.

Raised when casting to xs:decimal if the supplied value exceeds the implementation-defined limits for the datatype.

Raised by fn:resolve-QName and fn:QName when a supplied value does not have the lexical form of a QName or URI respectively; and when casting to decimal, if the supplied value is NaN or Infinity.

Raised when casting to xs:integer if the supplied value exceeds the implementation-defined limits for the datatype.

Raised when multiplying or dividing a duration by a number, if the number supplied is NaN.

Raised when casting a string to xs:decimal if the string has more digits of precision than the implementation can represent (the implementation also has the option of rounding).

Raised by fn:codepoints-to-string if the input contains an integer that is not the codepoint of a permitted character.

Raised by any function that uses a collation if the requested collation is not recognized.

Raised by fn:normalize-unicode if the requested normalization form is not supported by the implementation.

Raised by functions such as fn:contains if the requested collation does not operate on a character-by-character basis.

Raised by fn:char if the supplied character name is not recognized, or if it represents a codepoint that is not a permitted character.

Raised when parsing CSV input if a syntax error in the input CSV is found.

Raised when parsing CSV input if the field-separator, record-separator, or quote-character option is set to an invalid value.

Raised when parsing CSV input if the same delimiter character is assigned to more than one role.

Raised by the function from the get entry of csv-columns-record, if its $key argument is an xs:string and is not one of the known column names.

Raised by fn:id, fn:idref, and fn:element-with-id if the node that identifies the tree to be searched is a node in a tree whose root is not a document node.

Raised by fn:doc, fn:collection, and fn:uri-collection to indicate that either the supplied URI cannot be dereferenced to obtain a resource, or the resource that is returned is not parseable as XML.

Raised by fn:doc, fn:collection, and fn:uri-collection to indicate that it is not possible to return a result that is guaranteed deterministic.

Raised by fn:collection and fn:uri-collection if the argument is not a valid xs:anyURI.

Raised (optionally) by fn:doc if the argument is not a valid xs:anyURI.

Raised by fn:parse-xml if the supplied string is not a well-formed and namespace-well-formed XML document; or if DTD validation is requested and the document is not valid against its DTD.

Raised by fn:parse-xml if DTD validation is requested and the supplied string has no DTD or is not valid against the DTD.

Raised when the xsd-validation option to fn:parse-xml is supplied, and the value is not one of the permitted values; for example if the option type Q{U}NNN is used, and Q{U}NNN does not identify a type in the static context.

Raised when the xsd-validation option to fn:parse-xml is set to a value other than skip, if the processor is not schema-aware.

Raised when fn:serialize is called and the processor does not support serialization, in cases where the host language makes serialization an optional feature.

Raised by fn:parse-html if the supplied string is not a well-formed HTML document.

Raised when the dtd-validation option to fn:parse-xml is set, if no validating XML parser is available. Note: it is recommended that all processors should support the dtd-validation option, but there may be environments (such as web browsers) where this is not practically feasible.

Raised by fn:parse-xml if XSD validation is requested and the XML document represented by the supplied string is not valid against the relevant XSD schema.

Raised by fn:xsd-validator if it is not possible to assemble a valid and consistent schema.

This error is raised if the decimal format name supplied to fn:format-number is not a valid QName, or if the prefix in the QName is undeclared, or if there is no decimal format in the static context with a matching name.

This error is raised if a decimal format value supplied to fn:format-number is not valid for the associated property, or if the properties of the decimal format resulting from a supplied map do not have distinct values.

This error is raised if the picture string supplied to fn:format-number or fn:format-integer has invalid syntax.

Raised when casting to date/time datatypes, or performing arithmetic with date/time values, if arithmetic overflow or underflow occurs.

Raised when casting to duration datatypes, or performing arithmetic with duration values, if arithmetic overflow or underflow occurs.

Raised by adjust-date-to-timezone and related functions if the supplied timezone is invalid.

Raised by civil-timezone if no timezone data is available for the given date/time and place.

Raised by build-dateTime if the set of fields supplied does not correspond to those present in one of the types.

Raised by build-dateTime if one of the fields supplied has a value that is outside the supported range.

This error is raised if the picture string or calendar supplied to fn:format-date, fn:format-time, or fn:format-dateTime has invalid syntax.

This error is raised if the picture string supplied to fn:format-date selects a component that is not present in a date, or if the picture string supplied to fn:format-time selects a component that is not present in a time.

Raised by fn:hash if the effective value of the supplied algorithm is not one of the values supported by the implementation.

Raised by functions such as fn:json-doc, fn:parse-json or fn:json-to-xml if the string supplied as input does not conform to the JSON grammar (optionally with implementation-defined extensions).

Raised by functions such as map:merge, fn:json-doc, fn:parse-json or fn:json-to-xml if the input contains duplicate keys, when the chosen policy is to reject duplicates.

Raised by fn:json-to-xml if validation is requested when the processor does not support schema validation or typed nodes.

Raised by functions such as map:merge, fn:parse-json, and fn:xml-to-json if the $options map contains an invalid entry.

Raised by fn:xml-to-json if the XML input does not conform to the rules for the XML representation of JSON.

Raised by fn:xml-to-json if the XML input uses the attribute escaped="true" or escaped-key="true", and the corresponding string or key contains an invalid JSON escape sequence.

Raised by fn:element-to-map if the layout selected for converting elements of a given name is unsuitable for an element node with that name, or if the conversion plan explicitly defines the processing of a particular element as an error.

Raised by fn:resolve-QName and analogous functions if a supplied QName has a prefix that has no binding to a namespace.

Raised by fn:resolve-uri if no base URI is available for resolving a relative URI.

Raised by fn:path if the node supplied in the origin option is not an ancestor of the $node whose relative path is required.

Raised by fn:load-xquery-module if the supplied module URI is zero-length.

Raised by fn:load-xquery-module if no module can be found with the supplied module URI.

Raised by fn:load-xquery-module if a static error (including a statically detected type error) is encountered when processing the library module.

Raised by fn:load-xquery-module if a value is supplied for the initial context item or for an external variable, and the value does not conform to the required type declared in the dynamically loaded module.

Raised by fn:load-xquery-module if no XQuery processor is available supporting the requested XQuery version (or if none is available at all).

A general-purpose error raised when casting, if a cast between two datatypes is allowed in principle, but the supplied value cannot be converted: for example when attempting to cast the string "nine" to an integer.

Raised when either argument to fn:resolve-uri is not a valid URI/IRI.

Raised by fn:zero-or-one if the supplied value contains more than one item.

Raised by fn:one-or-more if the supplied value is the empty sequence.

Raised by fn:exactly-one if the supplied value is not a singleton sequence.

Raised by functions such as fn:max, fn:min, fn:avg, fn:sum if the supplied sequence contains values inappropriate to this function.

Raised by fn:dateTime if the two arguments both have timezones and the timezones are different.

A catch-all error for fn:resolve-uri, recognizing that the implementation can choose between a variety of algorithms and that some of these may fail for a variety of reasons.

Raised when the input to fn:parse-ietf-date does not match the prescribed grammar, or when it represents an invalid date/time such as 31 February.

Raised when the radix supplied to fn:parse-integer is not in the range 2 to 36.

Raised when the digits in the string supplied to fn:parse-integer are not in the range appropriate to the chosen radix.

Raised by regular expression functions such as fn:matches and fn:replace if the regular expression flags contain a character other than i, m, q, s, or x.

Raised by regular expression functions such as fn:matches and fn:replace if the regular expression is syntactically invalid.

Raised by fn:replace to report errors in the replacement string.

Raised by fn:replace if both the $replacement and $action arguments are supplied.

Raised by fn:data, or by implicit atomization, if applied to a node with no typed value, the main example being an element validated against a complex type that defines it to have element-only content.

Raised by fn:data, or by implicit atomization, if the sequence to be atomized contains a function item other than an array.

Raised by fn:string, or by implicit string conversion, if the input sequence contains a function item.

Raised by fn:unparsed-text or fn:unparsed-text-lines if the $source argument contains a fragment identifier, or if it cannot be resolved to an absolute URI (for example, because the base-URI property in the static context is absent), or if it cannot be used to retrieve the string representation of a resource.

Raised by fn:unparsed-text or fn:unparsed-text-lines if the $encoding argument is not a valid encoding name, if the processor does not support the specified encoding, if the string representation of the retrieved resource contains octets that cannot be decoded into Unicode characters using the specified encoding, or if the resulting characters are not permitted characters.

Raised by fn:unparsed-text or fn:unparsed-text-lines if the $encoding argument is absent and the processor cannot infer the encoding using external information and the encoding is not UTF-8.

A dynamic error is raised if the authority component of a URI contains an open square bracket but no corresponding close square bracket.

A dynamic error is raised if no XSLT processor suitable for evaluating a call on fn:transform is available.

A dynamic error is raised if the parameters supplied to fn:transform are invalid, for example if two mutually exclusive parameters are supplied. If a suitable XSLT error code is available (for example in the case where the requested initial-template does not exist in the stylesheet), that error code should be used in preference.

A dynamic error is raised if an XSLT transformation invoked using fn:transform fails with a static or dynamic error. The XSLT error code is used if available; this error code provides a fallback when no XSLT error code is returned, for example because the processor is an XSLT 1.0 processor.

A dynamic error is raised if the fn:transform function is invoked when XSLT transformation (or a specific transformation option) has been disabled for security or other reasons.

A dynamic error is raised if the result of the fn:transform function contains characters available only in XML 1.1 and the calling processor cannot handle such characters.

Built-in named record types Named record types used in the signatures of built-in functions are now available as standard in the static context.

This appendix lists the named record types that are used in function signatures in this function library, and that are available in the static context of every application.

These definitions are all in the standard function namespace http://www.w3.org/2005/xpath-functions, which is normally bound to the prefix fn. Because this will not usually be the default namespace for types, the names will usually be written with the prefix fn.

Schemas Rules have been added clarifying that users should not be allowed to change the schema for the fn namespace.

Two functions in this specification, fn:analyze-string and fn:json-to-xml, produce results in the form of an XDM node tree that must conform to a specified schema, defined in this appendix. In both cases the elements in the result are in the namespace http://www.w3.org/2005/xpath-functions, which is therefore the target namespace of the relevant schema.

A processor may have built-in knowledge of this schema, or it may read it from external files. Any attempt to supply a modified form of this schema will have unpredictable consequences. Modification here includes not only actual changes to the text of a schema document, but also actions such as using xs:redefine or xs:override, adding members to substitution groups, or defining derived types. Processors are not required to detect and reject such modifications. When validating against this schema, it is recommended that processors should ignore or reject any xsi:schemaLocation or xsi:type attributes in the instance being validated.

The schema for this namespace is organized as three schema documents. The first is a simple umbrella document that includes the other two. A copy can be found at xpath-functions.xsd:

Schema for the result of fn:analyze-string

This schema describes the output of the function fn:analyze-string.

The schema is reproduced below, and can also be found in analyze-string.xsd:

Schema for the result of fn:json-to-xml

This schema describes the output of the function fn:json-to-xml, and the input to the function fn:xml-to-json.

The schema is reproduced below, and can also be found in schema-for-json.xsd:

Schema for the result of fn:csv-to-xml

This schema describes the output of the function fn:csv-to-xml.

The schema is reproduced below, and can also be found in schema-for-csv.xsd:

Glossary Functions defined elsewhere

This Appendix describes some sources of functions that fall outside the scope of the function library defined in this specification. It includes both function specifications and function implementations. Inclusion of a function in this appendix does not constitute any kind of recommendation or endorsement; neither is omission from this appendix to be construed negatively. This Appendix does not attempt to give any information about licensing arrangements for these function specifications or implementations.

XPath Functions Defined in Other W3C Recommendations

A number of W3C Recommendations make use of XPath, and in some cases such Recommmendations define additional functions to be made available when XPath is used in a specific host language.

Functions Defined in XSLT

The various versions of XSLT have all included additional functions intended to be available only when XPath is used within XSLT, and not in other host language environments. Some of these functions were originally defined in XSLT, and subsequently migrated into the core function library defined in this specification.

Generally, the reason that functions have been defined in XSLT rather than in the core library has been that they required additional static or dynamic context information.

XSLT-defined functions share the core namespace http://www.w3.org/2005/xpath-functions (but in XPath 1.0 and XSLT 1.0, no namespace was defined for these functions).

The following table lists all functions that have been defined in XSLT, and summarizes their current status.

Function name Availability
fn:accumulator-afterXSLT 3.0 and later
fn:accumulator-beforeXSLT 3.0 and later
fn:apply-templatesXSLT 4.0
fn:available-system-propertiesXSLT 3.0 and later
fn:character-mapXSLT 4.0
fn:collation-keyOriginally XSLT 3.0, then XPath 3.1 and later
fn:copy-ofXSLT 3.0 and later
fn:currentXSLT 1.0 and later
fn:current-groupXSLT 2.0 and later
fn:current-grouping-keyXSLT 2.0 and later
fn:current-merge-groupXSLT 3.0 and later
fn:current-merge-keyXSLT 3.0 and later
fn:current-merge-key-arrayXSLT 4.0
fn:current-output-uriXSLT 3.0 and later
fn:documentXSLT 1.0 and later
fn:element-availableXSLT 1.0 and later
fn:format-dateOriginally XSLT 2.0, then XPath 3.0 and later
fn:format-dateTimeOriginally XSLT 2.0, then XPath 3.0 and later
fn:format-numberOriginally XSLT 1.0 and 2.0; then XPath 3.0 and later
fn:format-timeOriginally XSLT 2.0; then XPath 3.0 and later
fn:function-availableXSLT 1.0 and later
fn:generate-idOriginally XSLT 1.0 and 2.0; then XPath 3.0 and later
fn:json-to-xmlOriginally XSLT 3.0, then XPath 3.1 and later
fn:keyXSLT 1.0 and later
fn:map-for-keyXSLT 4.0
fn:regex-groupXSLT 2.0 and later
fn:regex-groupsXSLT 4.0
fn:snapshotXSLT 3.0 and later
fn:stream-availableXSLT 3.0 and later
fn:system-propertyXSLT 1.0 and later
fn:type-availableXSLT 2.0 and later
fn:unparsed-entity-public-idXSLT 2.0 and later
fn:unparsed-entity-uriXSLT 1.0 and later
fn:unparsed-textOriginally XSLT 2.0; then XPath 3.0 and later
fn:unparsed-text-availableOriginally XSLT 2.0; then XPath 3.0 and later
fn:xml-to-jsonOriginally XSLT 3.0, then XPath 3.1 and later
map:containsOriginally XSLT 3.0, then XPath 3.1 and later
map:entryOriginally XSLT 3.0, then XPath 3.1 and later
map:findOriginally XSLT 3.0, then XPath 3.1 and later
map:for-eachOriginally XSLT 3.0, then XPath 3.1 and later
map:getOriginally XSLT 3.0, then XPath 3.1 and later
map:keysOriginally XSLT 3.0, then XPath 3.1 and later
map:mergeOriginally XSLT 3.0, then XPath 3.1 and later
map:putOriginally XSLT 3.0, then XPath 3.1 and later
map:removeOriginally XSLT 3.0, then XPath 3.1 and later
map:sizeOriginally XSLT 3.0, then XPath 3.1 and later

XSLT 3.0 was well advanced when work started on XPath 3.1, but XPath 3.1 appeared as a Recommendation before XSLT 3.0 reached that status.

Functions Defined in XForms

XForms 1.1 is based on XPath 1.0. It adds the following functions to the set defined in XPath 1.0, using the same namespace:

boolean-from-string, is-card-number, avg, min, max, count-non-empty, index, power, random, compare, if, property, digest, hmac, local-date, local-dateTime, now, days-from-date, days-to-date, seconds-from-dateTime, seconds-to-dateTime, adjust-dateTime-to-timezone, seconds, months, instance, current, id, context, choose, event.

XForms 2.0 was first published as a W3C Working Draft, and subsequently as a W3C Community Group specification. These draft specifications do not include any additional functions beyond those in the core XPath specification.

Function Defined in XQuery Update 1.0

The XQuery Update 1.0 specification defines one additional function in the core namespace http://www.w3.org/2005/xpath-functions, namely fn:put. This function can be used to write a document to external storage. It is thus unusual in that it has side-effects; the XQuery Update 1.0 specification defines semantics for updating expressions including this function.

Although XQuery Update 1.0 is defined as an extension of XQuery 1.0, a number of implementers have adapted it, in a fairly intuitive way, to work with later versions of XQuery. At the time of this publication, later versions of the XQuery Update specification remain at Working Draft status.

Functions Defined by Community Groups

A number of community groups, with varying levels of formal organization, have defined specifications for additional function libraries to augment the core functions defined in this specification. Many of the resulting function specifications have implementations available for popular XPath, XQuery, and XSLT processors, though the level of support is highly variable.

The first such group was EXSLT. This activity was primarily concerned with augmenting the capability of XSLT 1.0, and many of its specifications were overtaken by core functions that became available in XPath 2.0. EXSLT defined a number of function modules covering:

Dates and Times Dynamic XPath Evaluation Common (containing most notably the widely used node-set function) Math (max, min, abs, and trigonometric functions) Random Number Generation Regular Expressions Sets (operations on sets of nodes including set intersection and difference) String Manipulation (tokenize, replace, join and split, etc.)

Specifications from the EXSLT group can be found at .

A renewed attempt to define additional function libraries using XPath 2.0 as its baseline formed under the name EXPath. Again, the specifications are in various states of maturity and stability, and implementation across popular processors is patchy. At the time of this publication the function libraries that exist in stable published form include:

Binary (functions for manipulating binary data) File Handling (reading and writing files) Geospatial (handling of geographic data) HTTP Client (sending HTTP requests) ZIP Facility (reading and creating ZIP files or similar archives)

The EXPath community has also been engaged in other related projects, such as defining packaging standards for distribution of XSLT/XQuery components, and tools for unit testing. Its specifications can be found at .

A third activity has operated under the name EXQuery, which as the name suggests has focused on extensions to XQuery. EXQuery has published a single specification, RestXQ, which is primarily a system of function annotations allowing XQuery functions to act as endpoints for RESTful services. It also includes some simple functions to assist with the creation of such services. The RestXQ specification can be found at .

The FunctX Library

Many useful functions can be written in XSLT or XQuery, and in this case the function implementations themselves can be portable across different XSLT and XQuery processors. This section describes one such library.

FunctX is an open-source library of general-purpose functions, supplied in the form of XQuery 1.0 and XSLT 2.0 implementations. It contains over a hundred functions. Typical examples of these functions are:

Test whether a string is all-whitespace Trim leading and trailing whitespace Test whether all the values in a sequence are distinct Capitalize the first character of a string Change the namespace of all elements in a tree Get the number of days in a given month Get the first or last day in a given month Get the date of the preceding or following day Ask whether an element has element-only, mixed, or simple content Find the position of a node in a sequence Count words in a string

The FunctX library can be found at .

Implementation-defined features Changes since 3.1 Summary of Changes Changes to Casts and Constructor Functions

The keyword for the argument has changed from arg to value.

The argument is now optional, and defaults to the context value (which is atomized if necessary). This change aligns constructor functions such as xs:string, xs:boolean, and xs:numeric with fn:string, fn:boolean, and fn:number.

Miscellaneous Changes

The semantics of the HTML case-insensitive collation "http://www.w3.org/2005/xpath-functions/collation/html-ascii-case-insensitive" are now defined normatively in this specification rather than by reference to the living HTML5 specification (which has changed since 3.1); and the rules now make ordering explicit rather than leaving it implementation-defined.

An option in an option map is now rejected if it is not described in the specification, if it is not supported by the implementation and if its name is in no namespace.

Editorial Changes

These changes are not highlighted in the change-marked version of the specification.

The value comparison operators such as eq, lt, and gt are now defined in by reference to fn:compare and other functions. The datatype-specific functions op:XX-equal, op:XX-less-than, and op:XX-greater-than have therefore been dropped.

The names of parameters appearing in function signatures have been changed. This is to reflect the introduction of keyword arguments in XPath 4.0; the names chosen for parameters are now more consistent across the function library.

In 3.1 and earlier versions, the keywords used in the specification were for documentation purposes only, so these changes do not affect backwards compatibility.

Where appropriate, the phrase "the value of $x" has been replaced by the simpler $x. No change in meaning is intended.

For functions that take a variable number of arguments, wherever possible the specification now gives a single function signature indicating default values for arguments that may be omitted, rather than multiple signatures.

The formal specifications of array functions have been rewritten to use two new primitives: array:members which converts an array to a sequence of value records, and array:of-members which does the inverse. This has enabled many of the functions to be specified more concisely, and with less duplication between similar functions for sequences and arrays.

The appendix containing illustrative user-written functions has been dropped; many of these functions are no longer needed.

Backward compatibility

This section summarizes the extent to which this specification is compatible with previous versions.

Version 4.0 of this function library is fully backwards compatible with version 3.1, except as noted below:

In fn:deep-equal, and in other functions such as fn:distinct-values that refer to fn:deep-equal, the rules for comparing values of different numeric types (for example, xs:double and xs:decimal) have changed. In previous versions of the specification, xs:decimal values were converted to xs:double, leading to a possible loss of precision. This could make comparisons non-transitive, leading to problems when grouping, and potentially (depending on the sort algorithm) with sorting. The problem has been fixed by requiring comparisons to be performed based on the exact mathematical value without any loss of precision.

This means, for example, that deep-equal(0.2, 0.2e0) is now false, whereas in previous versions it was true. The two values are not mathematically equal, because the exact decimal equivalent of the xs:double value written as 0.2e0 is 0.200000000000000011102230246251565404236316680908203125.

The corresponding change has not been made to the = and eq operators, because it was found to be too disruptive. For example, if the context node is the element <e price="10.0" discount="0.2"/>, there is an expectation that the expression @price - @discount = 9.8 should return true. But (assuming untyped data), the result of the subtraction is an xs:double whose precise value is 9.800000000000000710542735760100185871124267578125, so comparing the two values as decimals would return false.

In previous versions, unrecognized options supplied to the $options parameter of functions such as fn:parse-json were silently ignored. In 4.0, they are rejected as a type error, unless they are QNames with a non-absent namespace, or are extensions recognized by the implementation.

In version 4.0, omitting the $value of fn:error has the same effect as setting it to the empty sequence. In 3.1, the effects could be different (the effect of omitting the argument was implementation-defined).

In version 3.1, the fn:deep-equal function did not merge adjacent text nodes after stripping comments and processing instructions, so the elements abcdef]]> and abcdef]]> were considered non-equal. In version 4.0, the text nodes are now merged prior to comparison, so these two elements compare equal.

In version 3.1, the atomic types xs:hexBinary and xs:base64Binary were not mutually comparable under the eq operator, and always compared not equal as map keys or under operations such as fn:distinct-values and fn:deep-equal. In version 4.0, instances of xs:hexBinary and xs:base64Binary are equal if they represent the same octet sequence. This means, for example, that the zero-length values xs:hexBinary("") and xs:base64Binary("") can no longer co-exist as keys in the same map.

The format of numeric values in the output of fn:xml-to-json may be different. In version 3.1, the supplied value was parsed as an xs:double and then serialized using the casting rules, resulting in an input value of 10000000 being output as 1e7. In version 4.0, the value is output as is, except for any changes (such as stripping of leading zeroes or a leading plus sign) that might be needed to ensure the result is valid JSON.

In version 4.0, the function signature of fn:namespace-uri-for-prefix constrains the first argument to be either an xs:NCName or a zero-length string (the new coercion rules mean that any string in the form of an xs:NCName is acceptable). If a string is supplied that does not meet these requirements, a type error will be raised. In version 3.1, this was not an error: it came under the rule that when no namespace binding existed for the supplied prefix, the function would return the empty sequence.

Furthermore, because the expected type of this parameter is no longer xs:string, the special coercion rules for xs:string parameters in XPath 1.0 compatibility mode no longer apply. For example, supplying xs:duration('PT1H') as the first argument will now raise a type error, rather than looking for a namespace binding for the prefix PT1H.

Version 4.0 makes it clear that the casting of a value other than xs:string or xs:untypedAtomic to a list type (whether using a cast expression or a constructor function) is a type error . Previously this was defined as an error, but the kind of error and the error code were left unspecified. Accordingly, the function signatures of the constructor functions for built-in list types have been changed to use an argument type of xs:string?.

The way that fn:min and fn:max compare numeric values of different types has changed. The most noticeable effect is that when these functions are applied to a sequence of xs:integer or xs:decimal values, the result is an xs:integer or xs:decimal, rather than the result of converting this to an xs:double.

The type of the third argument of fn:format-number has changed from xs:string to (xs:string | xs:QName). Because the expected type of this parameter is no longer xs:string, the special coercion rules for xs:string parameters no longer apply. For example, it is no longer possible to supply an instance of xs:anyURI or (when XPath 1.0 compatibility mode is in force) an instance of xs:boolean or xs:duration.

When map:put replaces an entry in a map with a new value for an existing key, in the case where the existing key and the new key differ (for example, if they have different type annotations), it is no longer guaranteed that the new entry includes the new key rather than the existing key.

In regular expressions, the assertions ^ and $ can no longer be followed by a quantifier. This is because (a) a quantifier that allows zero occurrences means that the assertion will always match, and (b) a quantifier that allows multiple occurrences has no effect. Processors may provide an option that allows such regular expressions to be accepted for compatibility reasons.

The index-of now treats NaN as equal to NaN.

For compatibility issues regarding earlier versions, see the 3.1 version of this specification.