\documentclass[nolinenum]{jfp} \usepackage[T1]{fontenc} \usepackage[scaled=0.9]{beramono} \usepackage{graphicx} \usepackage{listings} \usepackage{hyperref} \usepackage{relsize} \usepackage{multicol} \begin{document} \lstset{ columns=[c]fixed , basicstyle=\relscale{0.85}\ttfamily\linespread{1.1}\selectfont, keywordstyle=\bfseries, mathescape=true, upquote=true, commentstyle=\slshape, breaklines=true, showstringspaces=false} \lstdefinelanguage{ocaml}{language=[objective]caml, % Fixes double quotes in char literals literate={'"'}{\textquotesingle "\textquotesingle}3 {'\\"'}{\textquotesingle \textbackslash"\textquotesingle}4 {;}{\textbf{;}}1 {|}{\textbf{|}}1 {type'}{type'}5 % gets rid of bold } %%\journaltitle{JFP} %%\cpr{Cambridge University Press} %%\doival{10.1017/xxxxx} %\righttitle{Journal of Functional Programming} \righttitle{} \journaltitle{\textbf{WARNING, REJECTED by the JFP} but may still be useful} \cpr{Daniel C. Bünzli} \doival{xx.xxxx/xxxxx Document style kept for now, because layout was tailored for it. revision 1} \newcommand{\thetitle}{An alphabet for your data soups} \title{\thetitle} \lefttitle{\thetitle} \totalpg{\pageref{lastpage01}} \jnlDoiYr{2024} \newcommand{\json}{\textsc{json}} \newcommand{\ocaml}{\textsc{ocaml}} \newcommand{\ml}{\textsc{ml}} \newcommand{\code}[1]{{\lstinline[language=ocaml]{#1}}} \begin{authgrp} \renewcommand*{\thefootnote}{\fnsymbol{footnote}} \author{Daniel C. Bünzli}% \footnote{Funded by the Swiss National Science Foundation (grant \oldstylenums{\textsc{pp00p1\_211010}},\emph{ The Epistemology of Climate Change –– Understanding the Climate Challenge}) and a grant from the \href{https://ocaml-sf.org/}{OCaml Software Foundation}. }% \affiliation{Institute of Philosophy, University of Bern, Switzerland\\ (\email{daniel.buenzli@erratique.ch})} \end{authgrp} \begin{abstract} Dealing with ubiquitous but poorly typed and structured data models like \json{} in \ml{} languages is unpleasant. But it doesn't have to be. We show how to define a generalized algebraic datatype whose values denote bidirectional maps between subsets of values of a data model and the \ml{} values of your choice. With suitable combinators these maps are quick and pleasant to define in a declarative style. The result can be used by generic functions that decode, encode, query and update data soups with nicely typed values. \end{abstract} \maketitle[F] \section{Introduction} Processing generic data models like \json{} in \ml{} languages is unpleasant. \ml{} values can be converted to these data models with pickler combinators \citep{picklers} or datatype-generic programming techniques \citep{datatypegeneric}. However, partially or fully modelling \emph{their} data schemas remains cumbersome. Using a generic type for the data model works well in dynamically typed languages because it directly maps on their own type systems. But in \ml{} this representation is unnatural and frustrating to use. Instead, we show how to define a generalized algebraic datatype whose values denote bidirectional maps between subsets of values of the data model and the \ml{} values you want to use. With appropriate combinators to construct them, these descriptions can be made quick to define in the \emph{decoding} and \emph{encoding} direction possibly eliding one if not immediately useful. The values of this datatype can be used by generic functions that: \begin{itemize} \item Directly decode or encode the data model to the \ml{} values of your choice without constructing values of a generic representation of the data model.\footnote{For \json{}, since the shape of an object may depend on one of its members and that members are unordered, some form of generic buffering may be needed to decode certain data schemas.} \item Query and update data of partially modelled data schemas with arbitrary \ml{} values. \item Automatically construct them from other datatype-generic representations you may already have defined for your \ml{} types. \end{itemize} Like pickler combinators \citep{picklers}, the definition of these values can be made pleasantly declarative. The decoding and encoding bureaucracy is left to the generic functions that interpret the datatype. One way of understanding the datatype is to devise pickler combinators for the data model~––~rather than for the values of the \ml{} language~--~ but leave out the specific value coding machinery open for interpretation. Another way is to see it as a tagged final coding of the data model. Concretely the datatype allows to interpose your own functions at each data model value decoding and encoding step. These functions can be lossy or creative which naturally leads to data queries and data updates. In what follows we focus on presenting the datatype. Providing an ergonomic set of combinators to construct its values is important but less difficult. Definitions are kept as simple as possible to expose the essence of this \emph{finally tagged} representation. A practical implementation\footnote{One can be found at \url{https://erratique.ch/software/jsont}} should enrich these definitions with documentation strings for data schema documentation generation, text locations for human friendly error reporting and text layout information for layout preserving updates. We use \ocaml{} \citep{ocaml} for the \ml{} language and \json{} \citep{json} for the data model, but as we conclude in \autoref{sec:recipe} with the recipe, this technique is independent of them. For conciseness we use exceptions to represent partial functions but signatures can be changed to use explicit \code{result} or \code{either} return types where needed. No effects are needed from the \ml{} language. \section{A generic representation} \label{sec:genrep} First we define the type \code{Json.t}, a generic representation for \json{} values in \ml{}. In essence nothing more than an abstract syntax tree for \json{} text with one case for each sort of value. \begin{lstlisting}[language=ocaml] module Json = struct type t = | Null of unit | Bool of bool | Number of float | String of string | Array of t list | Obj of obj and obj = mem list and mem = string * t end \end{lstlisting} As can be seen later, the type \code{Json.t} remains useful. However it is the type that is unacceptable to work with in \ml{}. Given a fixed data schema to process we do not want to manipulate this soup of values: \begin{itemize} \item We want objects to be represented by proper record or variant types. Not by \code{Json.obj} association lists that must be dynamically name checked for expectations. \item We want to get typed values on array element and object member access. Not generic \code{Json.t} values that must be dynamically type checked for expectations. \end{itemize} \section{A typed representation to interpret} \label{sec:jsont} To replace these generic values by the \ml{} values we want, we introduce the type \code{'a jsont} whose values denote subsets of \json{} values and their bidirectional map to \ml{} values of type \code{'a}. We call these values ``\json{} types''. They belong to the following generalized algebraic datatype whose cases and elided definitions are detailed in subsequent sections. The \code{Rec} case is bureaucracy the laziest readers do not need to care about, it types recursives \json{} values if your \ml{} is strict. \begin{lstlisting}[language=ocaml] type ('a, 'b) base_map = $\ldots$ type ('a, 'elt, 'builder) array_map = $\ldots$ and ('o, 'dec) obj_map = $\ldots$ and 'a any_map = $\ldots$ and ('a, 'b) map = $\ldots$ and _ jsont = | Null : (unit, 'b) base_map -> 'b jsont | Bool : (bool, 'b) base_map -> 'b jsont | Number : (float, 'b) base_map -> 'b jsont | String : (string, 'b) base_map -> 'b jsont | Array : ('a, 'elt, 'builder) array_map -> 'a jsont | Obj : ('o, 'o) obj_map -> 'o jsont | Any : 'a any_map -> 'a jsont | Map : ('a, 'b) map -> 'b jsont | Rec : 'a jsont Lazy.t -> 'a jsont \end{lstlisting} % Except for \code{Any}, \code{Map} and \code{Rec}, the cases of the type \code{'a jsont} are in direct correspondence with those of \code{Json.t}. But rather than storing data in the cases we have functions to bidirectionally map them to values of a type \code{'a}. The \code{'a jsont} values are used alongside decoding and encoding processes to directly check and transform the shape of the data. For instance we can implement (see \hyperref[sec:appendix]{Appendix} and \autoref{sec:convert}) these two functions which decode and encode generic \code{Json.t} values with \ml{} values: \begin{lstlisting}[language=ocaml] val decode : 'a jsont -> Json.t -> 'a val encode : 'a jsont -> 'a -> Json.t \end{lstlisting} Representing \json{} data with \ml{} values becomes a matter of defining suitable \code{'a jsont} values. For example this kind of \json{} object for messages: % \begin{lstlisting}[language=c] { "content": "J'aime pas la soupe", "public": true } \end{lstlisting} % can be represented in \ml{} by a record with two fields. Using the record's natural constructor and field accessors, combinators whose implementation is given in \autoref{sec:mem_map}, and \ocaml's reverse function application operator \code{|>}, this kind of object is described by: \begin{lstlisting}[language=ocaml] module Message = struct type t = { content : string; public : bool } let make content public = { content; public } let content msg = msg.content let public msg = msg.public let jsont : t jsont = obj_map make |> obj_mem "content" string ~enc:content |> obj_mem "public" bool ~enc:public |> obj_finish end \end{lstlisting} \subsection{Base cases} \label{sec:base_cases} Every base case carries a value of type \code{base_map}: % \begin{lstlisting}[language=ocaml] type ('a, 'b) base_map = { dec : 'a -> 'b; enc : 'b -> 'a; } \end{lstlisting} % Values of this type describe bidirectional maps from values of type \code{'a} to \code{'b}. They are used to transform the canonical \ml{} type \code{'a} chosen for a \json{} base type to the one we want to use. The base cases are as follows: \begin{itemize} \item \code{Null m} maps \json{} nulls to type \code{'a} by mapping \code{unit} values with \code{m}. % \item \code{Bool m} maps \json{} booleans to type \code{'a} by mapping \code{bool} values with \code{m}. % \item \code{Number m} maps \json{} numbers or nulls%% \footnote{The semantics of \json{} numbers is left to be desired. Interoperable \json{} implementations map \json{} numbers to \textsc{ieee} 754 \code{binary64} values. But they are \emph{not} such values: \textsc{nan} and infinities cannot be represented. As of writing, the most widely deployed and formally defined \json{} encoder, namely \textsc{ecmascript}'s \mbox{\code{JSON.stringify}} \citep{ecmascript}, lossily encodes any non-finite floating point value by \code{null}.} %% to type \code{'a} by mapping \code{float} values with \code{m}. % \item \code{String m} maps unescaped \json{} strings to type \code{'a} by mapping \code{string} values with \code{m}. \end{itemize} Most of the time the map \code{m} used with base cases is the identity map. But having maps on base types is part of the strategy to interpose functions in every coding context. This is particulary useful for \json{} strings which are \json{}'s universal type: all sorts of enumerations, better represented by variants in \ml{}, can be found in them. More amusing, to reliably interchange \mbox{64-bit} integers with \json{} you need to encode them in strings.%% \footnote{Again, interoperable \json{} implementations map \json{} numbers to \textsc{ieee} 754 \code{binary64} values. Hence the only integers that can be interchanged safely without precision loss are those in the range $[-2^{53};2^{53}]$.} \subsection{Map case} \label{sec:map_case} The elided type \code{map} used by \code{Map} is: \begin{lstlisting}[language=ocaml] and ('a, 'b) map = { dom : 'a jsont; map : ('a, 'b) base_map; } \end{lstlisting} % A \code {Map m} value changes the \ml{} type of the \json{} type \code{m.dom} from \code{'a} to \code{'b}. It is a tool for composing \code{jsont} values. If the reader wonders whether it is not simpler to expose a base case like \code{String m} by the value \code{\{dom = String;} \code{ map = m\}}, the answer is rather negative. It is not directly evident in our simpler exposition but having maps in base cases provides the proper coding context for erroring or text layout preserving. This context may be more difficult to recover or no longer be available to generic functions when they get to process the \code{Map} case which is not syntactically related to \json{} text. \pagebreak \subsection{Array case} \label{sec:array_case} The elided type \code{array_map} used by \code{Array} is: % \begin{lstlisting}[language=ocaml] and ('array, 'elt, 'builder) array_map = { elt : 'elt jsont; dec_empty : 'builder; dec_skip : 'builder -> int -> bool; dec_add : 'builder -> int -> 'elt -> 'builder; dec_finish : 'builder -> 'array; enc : 'acc. ('acc -> 'elt -> 'acc) -> 'acc -> 'array -> 'acc; } \end{lstlisting} % An \code{Array m} value maps \json{} arrays of uniform \json{} type \code{m.elt} to values of type \code{'array} built using values of type \code{'builder}. The record \code{m} explains how to construct and deconstruct an \code{'array} value. For decoding, we start with the value \code{m.dec_empty}, the element in the \json{} array at index \code{i} is added with \code{m.dec_add}, unless \code{m.dec_skip} is \code{true} on \code{i} (the purpose of \code{dec_skip} will become clear later), and the final array is returned by \code{m.dec_finish}. For encoding, the \code{m.enc} function folds over the elements of an \code{'array} value to encode them to the \json{} array. \subsection{Any case} \label{sec:any_case} The elided type \code{any_map} used by \code{Any} is: % \begin{lstlisting}[language=ocaml] and 'a any_map = { dec_null : 'a jsont option; dec_bool : 'a jsont option; dec_number : 'a jsont option; dec_string : 'a jsont option; dec_array : 'a jsont option; dec_obj : 'a jsont option; enc : 'a -> 'a jsont; } \end{lstlisting} % An \code{Any m} value maps sets of \json{} values with multiple sorts to values of type \code{'a}. It embeds dynamic typing in our datatype. It also allows to decode and encode with different sorts of \json{} values. For decoding a \json{} value of sort \code{t}, a generic function uses the \json{} type \code{m.dec_t} or errors if \code{None}. For encoding, the \code{m.enc} function returns the \json{} type to use with the value. Given a \json{} type value \code{t} the following \code{option} combinator uses \code{Any} to make it nullable in \json{}. The result of \code{option t} is a \json{} type that maps \json{} null values to \code{None} and otherwise maps \json{} values as \code{t} does but with successful results wrapped by \code{Some}. % \begin{lstlisting}[language=ocaml] let option : 'a jsont -> 'a option jsont = fun t -> let none = Null { dec = Fun.const None; enc = Fun.const () } in let some = Map { dom = t; map = {dec = Option.some; enc = Option.get}}in let enc = function None -> none | Some _ -> some in let none = Some none and some = Some some in Any { dec_null = none; dec_bool = some; dec_number = some; dec_string = some; dec_array = some; dec_obj = some; enc; } \end{lstlisting} The \code{Any} case also allows to devise the \json{} type \code{json} which maps any \json{} value to its generic representation: % \begin{lstlisting}[language=ocaml] let json : Json.t jsont = $\ldots$ \end{lstlisting} % Its definition is left as an exercice for the reader but this value is a must for partial data schema modelling. \subsection{Obj case} \label{sec:obj_case} Mapping objects is more involved and the design is less self-evident. Challenges are that members in \json{} objects are unordered, that the shape of an object may depend on the value of one if its members\footnote{This is not a built-in mechanism of the data model but out-of-band constraints mandated by data schemas.} and that duplicate member names is undefined behaviour.\footnote{But \textsc{ecmascript}'s formally defined decoder \mbox{\code{JSON.parse}} \citep{ecmascript} mandates ``last one takes over.''} This means that we cannot rely on a fixed member ordering to construct the \ml{} value of an object and worse, that we may have to wait for its last member to type check it. To narrow the design space, we focus on a few patterns found in \json{} data schemas that we want to support without fuss while retaining efficient decodes for object shapes that are known beforehand. These patterns are: % \begin{enumerate} \item Objects as records. Member names and their types are known beforehand. Members are required or optional in which case they can have a default value. \item Objects as uniform key-value maps. Member names of the object are unknown but their values are all of the same type. This must compose with pattern 1. as with the \json{} type \code{json} (\autoref{sec:any_case}) it enables partial object modelling and supports data schemas that allow foreign members in their objects. \item Objects as sums. There is a distinguished \emph{case member}, for example named \code{"type"}, \code{"class"} or \code{"version"}, and its value further determines an object shape described using pattern 1., 2. or 3. \end{enumerate} % Finally we want \json{} object maps to be defined through functions that are already naturally provided for our \ml{} types: constructors and accessors. If the shape of an object cannot be captured by these patterns, it is always possible to map it to a uniform \code{Json.t} key-value map using pattern 2. followed by a \code{Map} to sort things out. This provides an ultimate escape hatch at the cost of unconditionnaly going through the generic representation. \subsubsection{(De)constructing arbitrary \ml{} values for \json{} objects} We want to represent \json{} objects by arbitrary \ml{} values of type \code{'o} which hold member values with their own distinct types \code{'a}$_1$, \code{'a}$_2$, etc. For encoding this is easily tackled by having one projection function of type \code{'o -> 'a}$_i$ for each object member. For decoding we need to provide a constructor function with one argument per member value that returns a value of type \code{'o}. To manipulate this constructor we use a datatype morally equivalent to this representation of a function application: \pagebreak \begin{lstlisting}[language=ocaml] type ('ret, 'f) app = | Fun : 'f -> ('ret, 'f) app | App : ('ret, 'a -> 'b) app * 'a -> ('ret, 'b) app \end{lstlisting} % In a value of type \code{app} we can lift an arbitrary function \code{f} returning \code{'ret} with the \code{Fun} case and instrument each argument application with \code{App} cases until \code{f} is fully ``applied'' to a value of type \code{('ret, 'ret)} \code{app}. We store object constructors in a similar data type but since we do not have the argument values yet we use a type witness\footnote{Available in the \ocaml{} standard library in \code{Type.Id} since \ocaml{} 5.1} to serve as a placeholder for the member value: % \begin{lstlisting}[language=ocaml] type ('ret, 'f) dec_fun = | Dec_fun : 'f -> ('ret, 'f) dec_fun | Dec_app : ('ret, 'a -> 'b) dec_fun * 'a Type.Id.t -> ('ret, 'b) dec_fun \end{lstlisting} % This allows to decode unordered and individually typed member values as they come, store them by type witness in an heterogeneous dictionary \code{Dict.t} (see implementation in the \hyperref[sec:appendix]{Appendix}) and, once we have collected all member values in the dictionary, we can invoke the constructor to get the \ml{} value for the object with this function: % \begin{lstlisting}[language=ocaml] let rec apply_dict : type ret f. (ret, f) dec_fun -> Dict.t -> f = fun dec dict -> match dec with | Dec_fun f -> f | Dec_app (f,arg) -> (apply_dict f dict) (Option.get (Dict.find arg dict)) \end{lstlisting} % For fully known object shapes this mechanism allows decoders to directly decode objects and their unordered member values to the representations we want to use in \ml{}. \subsubsection{Object maps} \label{sec:obj_map} The elided type \code{obj_map} used by \code{Obj} is: \begin{lstlisting}[language=ocaml] and ('o, 'dec) obj_map = { dec : ('o, 'dec) dec_fun; mem_decs : mem_dec String_map.t; mem_encs : 'o mem_enc list; shape : 'o obj_shape; } \end{lstlisting} % An \code{Obj m} value maps a \json{} object to a value of type \code{'o}. The \code{m.dec} field holds the constructor function for \code{'o} values. The \code{Obj} case in the definition of \code{jsont} (\autoref{sec:jsont}) constrains the \code{'dec} parameter to be equal to \code{'o} which ensures that the contructor is fully ``applied''. Remaining fields of the record are described in subsequent sections. \subsubsection{Member maps} \label{sec:mem_map} The \code{m.mem_decs} and \code{m.mem_encs} fields of \code{obj_map} describe members of the object that are known beforehand. Both fields hold the same values of type \code{mem_map} but they are sorted differently and their type parameters are hidden in slighly different ways to accomodate decoding and encoding processes. These types are defined by: \pagebreak \begin{lstlisting}[language=ocaml] and mem_dec = Mem_dec : ('o, 'a) mem_map -> mem_dec and 'o mem_enc = Mem_enc : ('o, 'a) mem_map -> 'o mem_enc and ('o, 'a) mem_map = { name : string; type' : 'a jsont; id : 'a Type.Id.t; dec_absent : 'a option; enc : 'o -> 'a; enc_omit : 'a -> bool; } \end{lstlisting} % A value \code{mm} of type \code{mem_map} maps a member \code{'a} of a \json{} object mapped to \code{'o}. \code{mm.name} is the member name. \code{mm.type'} is the \json{} type of its value. \code{mm.id} is the type witness to represent the member value in the constructor of \code{'o}. \code{mm.dec_absent} is a value to use if the member is absent on decodes; \code{None} means error on absence. \code{mm.enc} is the function to get back the member value from \code{'o} for encoding. \code{mm.enc_omit} is a predicate on the value returned by \code{mm.enc} to decide whether it should be omitted on encoding; usually this tests for equality with the value mentioned in \code{mm.dec_absent}. A member map \code{mm} needs to be added to an object map \code{m} in \code{m.mem_decs}, \code{m.mem_encs} and the \code{mm.id} type witness must be applied to the object constructor in \code{m.dec}. This is the duty of combinators. For example this one describes a required member and adds it to an object map: \begin{lstlisting}[language=ocaml] let obj_mem : string -> 'a jsont -> enc:('o -> 'a) -> ('o, 'a -> 'b) obj_map -> ('o, 'b) obj_map = fun name type' ~enc obj_map -> let id = Type.Id.make () in let dec_absent = None and enc_omit = Fun.const false in let mm = { name; type'; id; dec_absent; enc; enc_omit } in let dec = Dec_app (obj_map.dec, mm.id) in let mem_decs = String_map.add mm.name (Mem_dec mm) obj_map.mem_decs in let mem_encs = Mem_enc mm :: obj_map.mem_encs in { obj_map with dec; mem_decs; mem_encs; } \end{lstlisting} % At this point we can provide the full implementations of the combinators used in the message object modelling example given in \autoref{sec:jsont}. % \begin{lstlisting}[language=ocaml] let bool = Bool { dec = Fun.id; enc = Fun.id } let string = String { dec = Fun.id; enc = Fun.id } let obj_finish o = Obj { o with mem_encs = List.rev o.mem_encs } let obj_map : 'dec -> ('o, 'dec) obj_map = fun make -> let dec = Dec_fun make and shape = Obj_basic Unknown_skip in { dec; mem_decs = String_map.empty; mem_encs = []; shape } \end{lstlisting} \pagebreak \subsubsection{Object shapes} \label{sec:obj_shape} The last field of the \code{obj_map} type to describe is the \code{shape} field of type \code{obj_shape}: \begin{lstlisting}[language=ocaml] and 'o obj_shape = | Obj_basic : ('o, 'mems, 'builder) unknown_mems -> 'o obj_shape | Obj_cases : ('o, 'cases, 'tag) obj_cases -> 'o obj_shape \end{lstlisting} % This value indicates whether the members described in the object map are the final word on the shape of the object: \begin{itemize} \item \code{Obj_basic u} indicates that the object's members are fully known and the way to handle unknown member is described by \code{u}, see \autoref{sec:unknown_mems}. \item \code{Obj_cases cases} indicates that there is a case member described in \code{cases}. Each case member value gives another \code{obj_map} value which further describe the object, see \autoref{sec:obj_cases}. \end{itemize} % The \code{obj_shape} type definition turns object map values into a decision tree with \code{Obj_cases} nodes, branches labelled by case member values and with \code{Obj_basic} leaves. Each path in this tree describes a complete object whose members depend on case member values found in the data. We assume that the combinators constructing these values enforce the constraint that no member is defined twice in a path from the root to a leaf. Note that once you get an \code{Obj_basic} shape, all data dependent shapes have been determined and members can be directly decoded to their type without buffering them. \subsubsection{Unknown members} \label{sec:unknown_mems} The type \code{unknown_mems} used by \code{Obj_basic} shapes is: \begin{lstlisting}[language=ocaml] and ('o, 'mems, 'builder) unknown_mems = | Unknown_skip : ('o, unit, unit) unknown_mems | Unknown_error : ('o, unit, unit) unknown_mems | Unknown_keep : ('mems, 'a, 'builder) mems_map * ('o -> 'mems) -> ('o, 'mems) unknown_mems \end{lstlisting} % A value \code{u} of type \code{unknown_mems} maps to \code{'mems} the unknown members of a \json{} object mapped to \code{'o}. It respectively indicates to skip, error, or keep them. In the latter case the \code{Unknown_keep (m, enc)} value describes with \code{enc} how to get them back from \code{'o} for encoding and with \code{m}, how to map them to a value of type \code{'mems}. The values \code{enc} and \code{m} are kept separate because the type \code{'o} is bespoke while unknown member maps can be reused across object maps. The value \code{m} is of this type: % \begin{lstlisting}[language=ocaml] and ('mems, 'a, 'builder) mems_map = { mems_type : 'a jsont; id : 'mems Type.Id.t; dec_empty : 'builder; dec_add : string -> 'a -> 'builder -> 'builder; dec_finish : 'builder -> 'mems enc : 'acc. (string -> 'a -> 'acc -> 'acc) -> 'mems -> 'acc -> 'acc } \end{lstlisting} % This record maps unknown members of uniform \json{} type \code{m.mems_type} to a value of type \code{'mems} built using values of types \code{'builder}. Use the \json{} type \code{json} (\autoref{sec:any_case}) in \code{m.mems_type} for partial object modelling or objects that need to preserve foreign members. \code{m.id} is the type witness to represent the \code{'mems} value in the object constructor. For decoding, we start with the value \code{m.dec_empty}, unknown members are added with \code{m.dec_add} and the final \code{'mems} value is returned by \code{m.dec_finish}. For encoding \code{m.enc} allows to recover from \code{'mems} the unknown members to encode them in the \json{} object. \subsubsection{Object cases} \label{sec:obj_cases} Type type \code{obj_cases} used by \code{Obj_cases} shapes is: \begin{lstlisting}[language=ocaml] and ('o, 'cases, 'tag) obj_cases = { tag : ('o, 'tag) mem_map; (* 'o is irrelevant, 'tag is not stored *) tag_compare : 'tag -> 'tag -> int; id : 'cases Type.Id.t; cases : ('cases, 'tag) case list; enc : 'o -> 'cases; enc_case : 'cases -> ('cases, 'tag) case_value; } \end{lstlisting} % A value \code{m} of type \code{obj_cases} maps to \code{'cases} the object cases of an object mapped to \code{'o}. Cases are selected by the value of a case member of type \code{'tag} described in \code{m.tag}. Tag values are not stored in \code{'o} (the decoded case value is) so the \code{'o} parameter, \code{m.tag.id} and \code{m.tag.enc} are unused here. \code{m.tag_compare} allows to compare case tags. \code{m.id} is the type witness to represent the cases in the constructor of \code{'o}. \code{m.cases} is the list of cases. This is not a function on \code{'tag} values in order to make the description enumerable (e.g. for schema documentation generation). The type \code{case} hides the \code{'case} parameter of the type \code{case_map} which describes cases: \begin{lstlisting}[language=ocaml] and ('cases, 'tag) case = | Case : ('cases, 'case, 'tag) case_map -> ('cases, 'tag) case and ('cases, 'case, 'tag) case_map = { tag : 'tag; obj_map : ('case, 'case) obj_map; dec : 'case -> 'cases; } \end{lstlisting} % A value \code{cm} of type \code{case_map} describes a case of type \code{'case} part of the type \code{'cases}. \code{cm.tag} is the tag value that identifies the case in the data. \code{cm.obj_map} describes the additional shape this case gives to the object. \code{cm.dec} injects the decoded case into the type that gathers them. For encoding cases, the \code{m.enc} function of \code{obj_cases} gets back the case from \code{'o}. To find out how to encode it, the function \code{m.enc_case} is used. It returns a value of type \code{case_value} which has a the actual case value and its map for encoding: \begin{lstlisting}[language=ocaml] and ('cases, 'tag) case_value = | Case_value : ('cases, 'case, 'tag) case_map * 'case -> ('cases, 'tag) case_value \end{lstlisting} % The \code{m.enc_case} function is the only ad-hoc function that needs to be devised specifically for \code{jsont} values. All the other functions to describe objects are natural constructors and accessors of \ml{} types. The design for object cases allows to map them to a record type which has common fields for all cases and a field for the cases: % \begin{lstlisting}[language=ocaml] type type' = C1 of C1.t | C2 of C2.t $\ldots$ type t = { $\ldots$ (* Fields common to all cases *); type' : type'; } \end{lstlisting} % but they can also be described individually and mapped to a ``toplevel'' variant type if \code{'cases} coincides with \code{'o}: % \begin{lstlisting}[language=ocaml] type t = C1 of C1.t | C2 of C2.t $\ldots$ \end{lstlisting} \section{Decode and encode} \label{sec:convert} Given a \code{jsont} value we can decode and encode \json{} with \ml{} values without constructing generic \code{Json.t} values; except transiently for decoding object instances with data dependent shapes and poorly ordered members. Implementing a \json{} codec is beyond the scope of this paper but the \hyperref[sec:appendix]{Appendix} has implementations for \code{decode} and \code{encode} functions that convert \ml{} values with generic \code{Json.t} values. For \code{decode} we took care not to assume full in-memory access to an object's members. It thus shows how a decoder can proceed to provide best-effort on-the-fly decoding. Except for case members, the last occurence of duplicate members takes over, however all definitions must type as defined by the object map otherwise the decode errors. These limitations on duplicate members could be lifted with a more complex decoder but it may not be worth the trouble. The case for objects is more intricate than we would like it to be, but we blame \json{}'s loose specification for that. Otherwise the implementation of these functions mostly consists in recursing on the \code{jsont} values to boringly invoke the menagerie of functions that are packed therein. \section{Query and update} \label{sec:queryx} Since we can now interpose our functions in every coding context we get a very flexible data processing system. A type for data queries and a function to execute them can be as simple as: \begin{lstlisting}[language=ocaml] type 'a query = 'a jsont let query : 'a query -> Json.t -> 'a = decode \end{lstlisting} % In this view, queries are just transforming decodes. Their encoding direction can be made to fail or defined with anything that feels sensitive to encode the query result to. To navigate the structure of \json{} values to apply a query on a subtree, the following composable indexing combinators can be used: \begin{lstlisting}[language=ocaml] let get_mem : string -> 'a query -> 'a query = fun name q -> obj_map Fun.id |> obj_mem name q ~enc:Fun.id |> obj_finish let get_nth : int -> 'a query -> 'a query = fun nth q -> let dec_empty = None and dec_add _ _ v = Some v in let dec_skip _ k = nth <> k in let dec_finish = function None -> failwith "too short" | Some v -> v in let enc f acc v = f acc v (* Singleton array with the query result *) in Array { elt = q; dec_empty; dec_add; dec_skip; dec_finish; enc } \end{lstlisting} % The \code{get_nth} combinator explains the presence of \code{dec_skip} in the \code{array_map} type (\autoref{sec:array_case}). The query \code{q} only needs to succeed on the \code{nth} element. Without \code{dec_skip} we would apply it on every element of the array which is undesirable. The \code{dec_skip} field is the only bit in the design that was specifically added to support queries. For objects, skipping unknown members is quite natural to have in order to support data schema evolution. Typed updates of \json{} data is easy to specify as \code{Json.t} returning \json{} types. Decoders invoking such queries return updated \json{} as \code{Json.t} values. Here is a kernel of composable combinators to peform updates: \begin{lstlisting}[language=ocaml] val update_mem : string -> 'a jsont -> Json.t jsont val update_nth : int -> 'a jsont -> Json.t jsont val delete_mem : string -> Json.t jsont val delete_nth : int -> Json.t jsont val const : 'a jsont -> 'a -> 'a jsont \end{lstlisting} % The \code{update_mem} and \code{update_nth} combinators apply on the member or index value the decoder of the given \json{} type and replace it with the encoding of the result. Chaining update combinators allows to navigate arbitrarily nested \json{} to apply an update. All these combinators are simple \code{Map} over the \json{} type \code{json} (\autoref{sec:any_case}) with suitable uses of \code{encode} and \code{decode}. The implementations of \code{update_mem}, \code{delete_mem} and \code{const} are: \begin{lstlisting}[language=ocaml] let update_mem : string -> 'a jsont -> Json.t jsont = fun name q -> let dec = function | Json.Obj ms -> let update (n, v as m) = if n = name then (n, encode q (decode q v)) else m in Json.Obj (List.map update ms) | _ -> failwith "type error" in Map { dom = json; map = { dec; enc = Fun.id } } let delete_mem : string -> Json.t jsont = fun name -> let dec = function | Json.Obj ms -> Json.Obj (List.filter (fun (n, _) -> n <> name) ms) | _ -> type_error () in Map { dom = json; map = { dec; enc = Fun.id } } let const : 'a jsont -> 'a -> 'a jsont = fun t v -> let dec _ = v and enc _ = encode t v in Map { dom = json; map = { dec; enc } } \end{lstlisting} \section{The recipe} \label{sec:recipe} None of what was presented here is specific to the \json{} data model. A datatype similar to \code{jsont} (\autoref{sec:jsont}) can be devised for any data model. The recipe is as follows. % \begin{itemize} \item A base case is needed for every base type of the model. Having maps in these cases allows to accurately represent their coding contexts. (\autoref{sec:base_cases}) \item An \code{Array}-like case is needed for mapping the model's type for arrays. (\autoref{sec:array_case}) \item An \code{Obj}-like case is needed for mapping the model's type for key-value maps or records. The \ml{} ingredients here are: projection functions for encoding and, for decoding, a constructor function instrumented by a datatype representing function applications using type witnesses to indirectly refer to argument values. (\autoref{sec:obj_case}) \item An \code{Any}-like case is needed if the model is dynamically typed. It is used to map implicit sums of the model's types to a uniform \ml{} type. (\autoref{sec:any_case}) \item The \code{Map} case is needed for composing map values. (\autoref{sec:map_case}) \item The \code{Rec} case is needed in a strict \ml{} for representing recursive values of the data model. (\autoref{sec:jsont}) \end{itemize} % And with this we hope to have made your future data soups more edible in \ml{}. \appendix \begin{thebibliography}{} \bibitem[\protect\citename{Bray, }2017]{json} Bray, T., Ed. (2017) The JavaScript Object Notation (JSON) Data Interchange Format. {\it RFC 8259}. \url{https://doi.org/10.17487/RFC8259} \bibitem[\protect\citename{Guo, }2023]{ecmascript} Guo, S., Ficarra M., Gibbons, K., Eds (2023) ECMAScript® 2023 Language Specification. ECMA-262. \url{https://262.ecma-international.org/14.0/} \bibitem[\protect\citename{Gibbons, }2007]{datatypegeneric} Gibbons, J. (2007) Datatype-Generic Programming. {\it Lecture Notes in Computer Science}. vol 4719. \url{https://doi.org/10.1007/978-3-540-76786-2\_1} \bibitem[\protect\citename{Kennedy, }2004]{picklers} Kennedy, A. J., (2004) Pickler combinators. {\it Journal of Functional Programming}. 14(6), 727-739. \url{https://doi.org/10.1017/S0956796804005209} \bibitem[\protect\citename{Leroy {\it et al.}, }2023]{ocaml} Leroy, X., Doligez, D., Frisch, A., Garrigue, J., Rémy, D., Sivaramakrishnan, KC, \& Vouillon, J. (2023) The OCaml system release 5.1. Documentation and user’s manual. \mbox{\url{https://ocaml.org/manual}} \end{thebibliography} \section*{Appendix} \label{sec:appendix} \ocaml{} 5.1 implementation of the \code{decode} and \code{encode} functions mentioned in \autoref{sec:convert}. \lstset{basicstyle=\relscale{0.84}\ttfamily\linespread{1.1}\selectfont} \begin{lstlisting}[language=ocaml] module String_map = Map.Make (String) (* Errors *) let type_error () = failwith "type error" let unexpected_member n = failwith ("Unexpected member " ^ n) let missing_member n = failwith ("Missing member " ^ n) let unknown_case_tag () = failwith "Unknown case tag" (* Heterogeneous key-value maps *) module Dict = struct module M = Map.Make (Int) type binding = B : 'a Type.Id.t * 'a -> binding type t = binding M.t let empty = M.empty let add k v m = M.add (Type.Id.uid k) (B (k, v)) m let find : type a. a Type.Id.t -> t -> a option = fun k m -> match M.find_opt (Type.Id.uid k) m with | None -> None | Some B (k', v) -> match Type.Id.provably_equal k k' with | Some Type.Equal -> Some v | None -> assert false end (* Decode *) let rec decode : type a. a jsont -> Json.t -> a = fun t j -> match t with | Null map -> (match j with Json.Null v -> map.dec v | _ -> type_error ()) | Bool map -> (match j with Json.Bool b -> map.dec b | _ -> type_error ()) | Number map -> (match j with | Json.Number n -> map.dec n | Json.Null _ -> map.dec Float.nan | _ -> type_error ()) | String map -> (match j with Json.String s -> map.dec s | _ -> type_error ()) | Array map -> (match j with Json.Array vs -> decode_array map vs | j -> type_error ()) | Obj map -> (match j with Json.Obj mems -> decode_obj map mems | j -> type_error ()) | Map map -> map.map.dec (decode map.dom j) | Any map -> decode_any t map j | Rec t -> decode (Lazy.force t) j and decode_array : type a e b. (a, e, b) array_map -> Json.t list -> a = fun map vs -> let add (i, a) v = i + 1, (if map.dec_skip a i then a else map.dec_add a i (decode map.elt v)) in map.dec_finish (snd (List.fold_left add (0, map.dec_empty) vs)) and decode_obj : type o. (o, o) obj_map -> Json.obj -> o = fun map mems -> apply_dict map.dec @@ decode_obj_map map String_map.empty String_map.empty Dict.empty mems and decode_obj_map : type o. (o, o) obj_map -> mem_dec String_map.t -> mem_dec String_map.t -> Dict.t -> Json.obj -> Dict.t = fun map mem_miss mem_decs dict mems -> let u n _ _ = invalid_arg (n ^ "member defined twice") in let mem_miss = String_map.union u mem_miss map.mem_decs in let mem_decs = String_map.union u mem_decs map.mem_decs in match map.shape with | Obj_cases cases -> decode_obj_case cases mem_miss mem_decs dict [] mems | Obj_basic u -> match u with | Unknown_skip -> decode_obj_basic u () mem_miss mem_decs dict mems | Unknown_error -> decode_obj_basic u () mem_miss mem_decs dict mems | Unknown_keep (map, _) -> decode_obj_basic u map.dec_empty mem_miss mem_decs dict mems and decode_obj_basic : type o map builder. (o, map, builder) unknown_mems -> builder -> mem_dec String_map.t -> mem_dec String_map.t -> Dict.t -> Json.obj -> Dict.t = fun u umap mem_miss mem_decs dict -> function | [] -> let dict = match u with | Unknown_skip | Unknown_error -> dict | Unknown_keep (map, _) -> Dict.add map.id (map.dec_finish umap) dict in let add_default _ (Mem_dec m) dict = match m.dec_absent with | Some v -> Dict.add m.id v dict | None -> missing_member m.name in String_map.fold add_default mem_miss dict | (n, v) :: mems -> match String_map.find_opt n mem_decs with | Some (Mem_dec m) -> let dict = Dict.add m.id (decode m.type' v) dict in let mem_miss = String_map.remove n mem_miss in decode_obj_basic u umap mem_miss mem_decs dict mems | None -> match u with | Unknown_skip -> decode_obj_basic u umap mem_miss mem_decs dict mems | Unknown_error -> unexpected_member n | Unknown_keep (map, _) -> let umap = map.dec_add n (decode map.mems_type v) umap in decode_obj_basic u umap mem_miss mem_decs dict mems and decode_obj_case : type o cases tag. (o, cases, tag) obj_cases -> mem_dec String_map.t -> mem_dec String_map.t -> Dict.t -> Json.obj -> Json.obj -> Dict.t = fun cases mem_miss mem_decs dict delay mems -> let decode_case_tag tag = let eq_tag (Case c) = cases.tag_compare c.tag tag = 0 in match List.find_opt eq_tag cases.cases with | None -> unknown_case_tag () | Some (Case case) -> let mems = List.rev_append delay mems in let dict = decode_obj_map case.obj_map mem_miss mem_decs dict mems in Dict.add cases.id (case.dec (apply_dict case.obj_map.dec dict)) dict in match mems with | [] -> (match cases.tag.dec_absent with | Some t -> decode_case_tag t | None -> missing_member cases.tag.name) | (n, v as mem) :: mems -> if n = cases.tag.name then decode_case_tag (decode cases.tag.type' v) else match String_map.find_opt n mem_decs with | None -> decode_obj_case cases mem_miss mem_decs dict (mem :: delay) mems | Some (Mem_dec m) -> let dict = Dict.add m.id (decode m.type' v) dict in let mem_miss = String_map.remove n mem_miss in decode_obj_case cases mem_miss mem_decs dict delay mems and decode_any : type a. a jsont -> a any_map -> Json.t -> a = fun t map j -> let dec t m j = match m with Some t -> decode t j | None -> type_error () in match j with | Json.Null _ -> dec t map.dec_null j | Json.Bool _ -> dec t map.dec_bool j | Json.Number _ -> dec t map.dec_number j | Json.String _ -> dec t map.dec_string j | Json.Array _ -> dec t map.dec_array j | Json.Obj _ -> dec t map.dec_obj j (* Encode *) let rec encode : type a. a jsont -> a -> Json.t = fun t v -> match t with | Null map -> Json.Null (map.enc v) | Bool map -> Json.Bool (map.enc v) | Number map -> Json.Number (map.enc v) | String map -> Json.String (map.enc v) | Array map -> let encode_elt a elt = (encode map.elt elt) :: a in Json.Array (List.rev (map.enc encode_elt [] v)) | Obj map -> Json.Obj (List.rev (encode_obj map v [])) | Any map -> encode (map.enc v) v | Map map -> encode map.dom (map.map.enc v) | Rec t -> encode (Lazy.force t) v and encode_obj : type o. (o, o) obj_map -> o -> Json.obj -> Json.obj = fun map o obj -> let encode_mem obj (Mem_enc map) = let v = map.enc o in if map.enc_omit v then obj else (map.name, encode map.type' v) :: obj in let obj = List.fold_left encode_mem obj map.mem_encs in match map.shape with | Obj_basic (Unknown_keep (map, enc)) -> let encode_mem n v obj = (n, encode map.mems_type v) :: obj in map.enc encode_mem (enc o) obj | Obj_basic _ -> obj | Obj_cases cases -> let Case_value (case, c) = cases.enc_case (cases.enc o) in let obj = if cases.tag.enc_omit case.tag then obj else (cases.tag.name, encode cases.tag.type' case.tag) :: obj in encode_obj case.obj_map c obj \end{lstlisting} \label{lastpage01} \end{document}