Jaql: A JSON Query Language
Kevin Beyer, Vuk Ercegovac, Jun Rao, Eugene Shekita
{kbeyer,vercego,junrao,shekita}@us.ibm.com
Introduction
JavaScript Object Notation or JSON
has become a popular data format for many Web-based applications
because of its simplicity and modeling flexibility.
Wikipedia includes a nice
summary
of JSON's advantages over other data formats like XML.
In contrast to XML, which was originally designed as a markup language,
JSON was actually designed for data.
Moreover, JSON is a language-independent format, with bindings
in a variety of programming languages.
In short, JSON makes it easy to model a wide spectrum of
data, ranging from homogenous flat data to
heterogeneous nested data, and it can do this
in a language-independent format.
We believe that these characteristics make JSON
an ideal data format for many Hadoop applications
and databases in general.
Overview
In this document, we introduce Jaql, a query language for JSON.
Although Jaql has been designed specifically for JSON,
we have tried to borrow some of the best features of
SQL,
XQuery, and
PigLatin.
We begin with an example of JSON data, then go on to describe
the key features of Jaql and show how it can be used to process
JSON data in parallel using Hadoop's map/reduce framework.
Jaql is also designed to be extended. Both Jaql's functions
as well as its data-sources can be extended so as to facilitate
plugging in your computation for your data. While
only a high-level description of Jaql is presented here,
all examples can be run by
downloading the
most recent release of Jaql.
Note: Jaql is still in early development,
so beware that it is likely to change over the next few months.
The future development plans are outlined in the roadmap.
A JSON Example
Let's start off with an example of books and their reviews in JSON format:
[
{publisher: 'Scholastic',
author: 'J. K. Rowling',
title: 'Deathly Hallows',
year: 2007},
{publisher: 'Scholastic',
author: 'J. K. Rowling',
title: 'Chamber of Secrets',
year: 1999,
reviews: [
{rating: 10, user: 'joe', review: 'The best ...'},
{rating: 6, user: 'mary', review: 'Average ...'}]},
{publisher: 'Scholastic',
author: 'J. K. Rowling',
title: 'Sorcerers Stone',
year: 1998},
{publisher: 'Scholastic',
author: 'R. L. Stine',
title: 'Monster Blood IV',
year: 1997,
reviews: [
{rating: 8, user: 'rob', review: 'High on my list...'},
{rating: 2, user: 'mike', review: 'Not worth the paper ...',
discussion:
[{user: 'ben', text: 'This is too harsh...'},
{user: 'jill', text: 'I agree ...'}]}]},
{publisher: 'Grosset',
author: 'Carolyn Keene',
title: 'The Secret of Kane',
year: 1930}
]
This example shows an array of JSON objects.
Arrays are delimited by brackets '[]' and objects are delimited by braces '{}'.
Objects contain name:value pairs or members, where the
value can be an atomic type or a nested value.
In contrast to XML, the type of an atomic value is always known in JSON.
Here, each top-level object represents a book and its reviews.
The 'reviews' for a book object is an array and each entry
in the array corresponds to a review.
Each review consists of a 'rating', its 'user', and the text of the review.
Each review can also contain a 'discussion', which
itself can be discussed, forming a discussion thread.
It is important to point out that the Jaql
data model is actually a superset of JSON.
The only difference is that atomic types like date have been added to Jaql to
make certain database operations more efficient.
Also, for readability, the names in name:value pairs
do not always need to be quoted.
When data needs to be exported in standard JSON format,
Jaql can convert non-standard atomic types to strings
and add quotes to names.
For the remainder of this document, we will use the Jaql data
model and JSON interchangeably, although strictly speaking
the Jaql data model is a superset of JSON.
Reading and Writing JSON Data
Using Jaql, JSON data can be stored and retrieved
from a variety of data sources including ordinary files.
Jaql queries can take a collection as input and generate
a new collection as output, where a
collection corresponds to a JSON array.
An example using an ordinary file to store a collection
is as follows:
// Example 1. Write to a file named 'hey.dat'.
localWrite('hey.dat', [{text: 'Hello World'}]);
// Read it back...
localRead('hey.dat');
Here, a single object with the 'Hello World' string is
being written to a file called 'hey.dat' in the current directory.
It is also possible to read and write JSON data to
Hadoop HDFS files and
HBase tables.
We have made integration with Hadoop a priority by having all data types implement the
WritableComparable interface.
By integrating Jaql with HDFS and HBase, we are able to store JSON data
and process it in parallel using Hadoop's map/reduce framework.
Examples using Hadoop are provided below.
Our second example writes to an HDFS
SequenceFile.
A SequenceFile is a collection of key-value pairs.
Jaql's hdfsWrite() function only writes data into the value field, leaving the key field empty.
In example 2, the input data is represented as a literal, but in general the input
can be an expression that is the result of a Jaql query.
// Example 2. Write to a Hadoop SequenceFile named: 'orders.dat'.
hdfsWrite('orders.dat', [
{order: 1, cust: 'c1', items: [
{item: 1, qty: 2},
{item: 3, qty: 6},
{item: 5, qty: 10}]},
{order: 2, cust: 'c2', items: [
{item: 2, qty: 1},
{item: 5, qty: 2},
{item: 7, qty: 3}]},
{order: 3, cust: 'c1', items: [
{item: 1, qty: 2},
{item: 7, qty: 14},
{item: 5, qty: 10}]}
]);
// Read it back...
hdfsRead('orders.dat');
Our third example writes to an HBase table.
An HBase table is a collection of records, with each record containing a
primary key and a set of column name-value pairs.
Column names in HBase are of type
Text,
while column values are simply byte arrays.
HBase uses a two-part naming scheme for columns of the form 'column family:column'.
In Jaql, we use a sharp '#' instead of a colon ':' to separate column families from columns,
since the colon is already used as a separator in JSON.
If a column family is not specified, a special 'default' column family is used, as in the
following example.
// Example 3. Write to an HBase table named 'webcrawl'.
hbaseWrite('webcrawl', [
{key: "www.cnn.com", page:'...', rank: 0.9,
inlinks:[
{link: 'www.news.com', anchor: 'newsite'},
{link: 'www.jscript.com', anchor: 'look here'}]},
{key: "www.json.org", page:'...', rank: 0.8}
]);
// Read it back...
hbaseRead('webcrawl');
In hbaseWrite(), all objects in the input are written to a
single HBase table.
Each top-level JSON object is stored as an Hbase record with the specified key,
and each name:value pair in the object is stored as a separate column-value pair in
the record.
Values are serialized as a byte array.
Note that only the outermost name:value pairs in top-level objects are
stored as separate columns in Hbase.
Nested arrays and objects are serialized within these.
In example 3, two HBase records are written.
Each record is used to store the content, rank, and in-links (if any) of a web page.
The Jaql Query Language
Jaql is a functional query language that provides users with a simple,
declarative syntax to do things like filter, join, and group JSON data.
Jaql also allows user-defined functions to be written and used in expressions.
Let's begin with some examples using the book data presented earlier:
Our first two queries illustrate selection, projection, and filtering.
// Write the books collection from data above.
// hdfsWrite('books', [ {publisher... ] )
// Query 1. Return the publisher and title of each book.
for( $b in hdfsRead('books') )
[{ $b.publisher, $b.title }];
// result...
[
{publisher: 'Scholastic', title: 'Deathly Hallows'},
{publisher: 'Scholastic', title: 'Chamber of Secrets'},
{publisher: 'Scholastic', title: 'Sorcerers Stone'},
{publisher: 'Scholastic', title: 'Monster Blood IV'},
{publisher: 'Grosset', title: 'The Secret of Kane'}
];
// Query 2. Find the authors and titles of books that have received
// a review.
for( $b in hdfsRead('books') )
if( exists($b.reviews) )
[{ $b.author, $b.title }];
// result...
[
{author: 'J. K. Rowling', title: 'Chamber of Secrets'},
{author: 'R. L. Stine', title: 'Monster Blood IV'}
];
Query 1 uses a for expression to loop over the books
collection. Each book is bound to the variable $b and the loop body
is evaluated. Both the in-expression and the loop body must produce
arrays. The result of the for is an array that contains all of the
loop iteration results. In Query 1, every iteration returns a record
containing the publisher and title. In Query 2, the if
expression causes only books with reviews to be output.
// Query 3a. Project the title from each book using the short-hand
// projection notation.
hdfsRead('books')[*].title;
// Query 3a-alt. Or using equivalent the long-hand notation.
for( $b in hdfsRead('books') )
[ $b.title ];
// Query 3b. Project the user from each review of each book using the short-hand
// projection notation. The double-stars flattens the contained arrays.
hdfsRead('books')[**].reviews[*].user;
// Query 3b-alt. Or using equivalent the long-hand notation.
for( $b in hdfsRead('books') )
for( $r in $b.reviews )
[ $r.user ];
// result...
[ "Deathly Hallows",
"Chamber of Secrets",
"Sorcerers Stone",
"Monster Blood IV",
"The Secret of Kane" ];
// Query 4. Find authors, titles, and reviews of books where a review
// prompted a discussion by the user 'ben'.
for( $b in hdfsRead('books') )
if( 'ben' in $b.reviews[**].discussion[*].user )
[{ $b.author, $b.title, $b.reviews }];
// result...
[
{author: 'R. L. Stine',
title: 'Monster Blood IV',
reviews: [
{rating: 8, user: 'rob', review: 'High on my list...'},
{rating: 2, user: 'mike', review: 'Not worth the paper ...',
discussion: [
{user: 'ben', text: 'This is too harsh...'},
{user: 'jill', text: 'I agree ...'}]}]}
];
Queries 3a and 3b shows two short-hand notations for projecting
from an array, as well as their equivalent long-hand notations.
Query 4 is a more complicated variation of Query 2. Now the
predicate is based on values that are nested three levels deep (books,
reviews, discussion). The in predicate checks if the user 'ben'
discussed any book review.
// Query 5. Find the authors and titles of books that had an
// average review rating over 5.
for( $b in hdfsRead('books') )
if( avg($b.reviews[*].rating) > 5 )
[{ $b.author, $b.title }];
// result...
[
{author: 'J. K. Rowling', title: 'Chamber of Secrets'}
];
Query 5 filters books by their average rating.
For each book $b, the syntax $b.reviews[*].rating creates an array of review
ratings for each book $b. The average rating for book $b is then computed and tested
to see if it is greater than 5.
// Query 6. Show how many books each publisher has published.
group( $b in hdfsRead('books') by $p = $b.publisher into $pubs )
[{ publisher: $p, num: count($pubs) }];
// result...
[
{publisher: 'Scholastic', num: 4},
{publisher: 'Grosset', num: 1}
];
Query 6 illustrates grouping and aggregation.
The group expression partitions an input collection into groups.
Books are partitioned into groups by the grouping value $p,
which is set to $b.publisher.
The $pubs variable is bound to an array associated with each
group and used to count how many books each publisher has published.
// Query 7. Find the publisher who published the most books.
// group books by publisher and compute their book count
$g = group( $b in hdfsRead('books') by $p = $b.publisher into $pubs )
[{ publisher: $p, num: count($pubs) }];
// sort publishers by descending book count
$sorted = sort( $i in $g by $i.num desc );
// return the top publisher
$sorted[0];
// result...
{publisher: 'Scholastic', num: 4};
Query 7 illustrates grouping and sorting. It shows how
group and sort can be used to find the publisher who
published the most books. Variables are used to make it easier to
write the query. The group expression can also be used to
group multiple collections. To illustrate group, we define the
collections X and Y:
hdfsWrite('X',
[
{a:1, b:1},
{a:1, b:2},
{a:2, b:3},
{a:2, b:4}
] );
hdfsWrite('Y',
[
{c:2, d:1},
{c:2, d:2},
{c:3, d:3},
{c:3, d:4}
] );
Query 8 groups both X and Y, which is similar to a group over the
union of the input collections, except that two arrays are generated
for each group, one for each input collection.
// Query 8. Co-group X and Y.
group( $x in hdfsRead('X') by $g = $x.a into $xgroup,
$y in hdfsRead('Y') by $g = $y.c into $ygroup )
[{ g: $g, b: $xgroup[*].b, d: $ygroup[*].d }];
// result...
[
{g: 1, b: [1,2], d: []},
{g: 2, b: [3,4], d: [1,2]},
{g: 3, b: [], d: [3,4]}
];
In Query 8, both X and Y need to be grouped on the same value,
namely $g. The syntax $xgroup[*].b and $ygroup[*].d in the return
expression projects the 'b' and 'd' values in the arrays created for
each X and Y group, respectively.
Joins can be expressed using group, but the syntax can get a little messy.
Since joins are common, special syntax has been introduced for them:
// Query 9. Join X and Y.
join( $x in hdfsRead('X') on $x.a,
$y in hdfsRead('Y') on $y.c )
[{ $x.a, $x.b, $y.c, $y.d }];
// result...
[
{a: 2, b: 3, c: 2, d: 1},
{a: 2, b: 3, c: 2, d: 2},
{a: 2, b: 4, c: 2, d: 1},
{a: 2, b: 4, c: 2, d: 2},
];
Query 9 shows an inner join on 'a' and 'c' in X and Y, respectively.
Although it has not be shown, left-, right-, and full-outer joins
can also be specified using modifiers.
Writing Aggregate Functions
Jaql includes standard database aggregate functions, like sum,
count, min, max, and avg. Starting with Release
0.2, Jaql supports partial aggregation for these functions using
"combiners" inside Hadoop's map/reduce framework for greater
parallelism and reduced data shipping. Syntactically, these functions
look like "holistic" aggregate functions -- an aggregate that requires
all the data before returning an answer. However, they actually
expand into "algebraic" aggregates using the combine expression.
Median is the typical example of a holistic function:
$median = fn($items) (
$sorted = sort( $i in $items by $i ),
$sorted[int(count($sorted)/2)]
);
$median( [ 1, 4, 5, 3, 2 ] ); // 3
If you need the exact median, we cannot improve upon this much. But
consider variance instead. It can be computed from the sum of the
numbers and the sum of squares. The combine expression
is used to define "algebraic" aggregates -- an aggregate that can be
applied on portions of the data and combined to produce a final
result. Such aggregates typically have an "initialization phase" that
creates a partial aggregate from a single item, a "combining phase"
where partial aggregates are combined into larger partial aggregates,
and a "final phase" that transforms the largest partial aggregate into
the desired result.
The combine expression handles the combining
phase. It takes any two partial aggregates from its input, aggregates
them using the combining expression, and conceptually puts the result
back into the input until the input is reduced to a single item. The
following example defines variance completely in Jaql using
combine:
$var = fn($items) (
$init =
for( $i in $items )
if( not isnull($i) )
[{ n: 1, s1: $i, s2: $i*$i }],
$combined =
combine( $a, $b in $init )
{ n: $a.n + $b.n,
s1: $a.s1 + $b.s1,
s2: $a.s2 + $b.s2 },
$E_X = $combined.s1 / $combined.n,
$E_X2 = $combined.s2 / $combined.n,
$E_X2 - $E_X * $E_X
);
$var( [ 1, 4, 5, 3, 2 ] ); // 2
Many functions are cumbersome to define in a query language. Java functions are also
useful for defining both holistic and algebraic aggregates.
Extending Jaql
Jaql has been designed to be extensible. Custom code can be
plugged-in to extend Jaql's capability in two different ways:
- Jaql's library of functions to transform JSON data, and
- Jaql's input/output to access any data and possibly convert into JSON.
In this section, we briefly describe some interesting functions that
we have added to Jaql. The first example is mapReduce(), which takes a Jaql description of
the map and reduce functions as input, and uses it to run a map/reduce
job in Hadoop.
// Write to an HDFS file called 'sample'.
hdfsWrite('sample.dat', [
{x: 0, text: 'zero'},
{x: 1, text: 'one'},
{x: 0, text: 'two'},
{x: 1, text: 'three'},
{x: 0, text: 'four'},
{x: 1, text: 'five'},
{x: 0, text: 'six'},
{x: 1, text: 'seven'},
{x: 0, text: 'eight'}
]);
// Run a map/reduce job that counts the number objects
// for each 'x' value.
mapReduce(
{ input: {type: 'hdfs', location: 'sample.dat'},
output: {type: 'hdfs', location: 'results.dat'},
map: fn($i) [ [$i.x, 1] ],
reduce: fn($x, $v) [ {x: $x, num: count($v)} ]
});
// Read the results...
hdfsRead('results.dat');
// result...
[
{x: '0', num: 5},
{x: '1', num: 4}
];
This example groups the input on 'x' and counts the
number of objects in each group.
The map function must specify how to extract a key-value pair,
and the reduce function must specify how to aggregate the
values for a given key.
Here, the key value is set to $i.x and count($v) is
used to count the values $v associated with each key.
Note that both the map and reduce functions need to output an array
because each input is allowed to produce multiple outputs.
We have found mapReduce() to come in handy for many map/reduce jobs.
In general, the body of the map and reduce functions can
be Jaql expressions, so a lot of expressive power is available
to programmers who want to work with JSON data.
By working in Jaql, programmers are freed from all the little
details required to actually set up and run a map/reduce job.
The mapReduce() function touches on a interesting feature of Jaql,
namely that Jaql is a second-order language.
This allows function definitions to be assigned to variables
-- even stored in the database --
and later evaluated, as shown in the following example.
// Define a function that returns the most recent book
// written by a given author.
$mostRecent = fn($author) (
$authorsBooks =
for( $b in hdfsRead('books') )
if( $b.author == $author )
[{ title: $b.title, year: $b.year }],
$sorted = sort( $b in $authorsBooks by $b.year desc ),
$sorted[0].title
);
// Invoke the function.
$mostRecent('J. K. Rowling');
// result...
'Deathly Hallows';
Finally, to work with external data sources, we recently added a
jaqlGet() function that can retrieve JSON data from a URL. Below are
two examples of jaqlGet() that get data from Freebase and Yahoo! Traffic (the
latter requires you to supply an application id).
// Get albums recorded by "The Police" using Freebase.
$artist = "The Police";
$freebase =
httpGet('http://www.freebase.com/api/service/mqlread',
{ queries:
serialize(
{ myquery:
{ query:
[{ type: "/music/artist",
name: $artist,
album: []
}]
}
}
) })[0];
$freebase.myquery.result[**].album;
// result...
[ "Outlandos d\'Amour",
"Reggatta de Blanc",
"Zenyatta Mondatta",
"Ghost in the Machine",
"Synchronicity",
"Every Breath You Take: The Singles",
"Greatest Hits",
"Message in a Box: The Complete Recordings (disc 1)",
"Message in a Box: The Complete Recordings (disc 2)",
"Message in a Box: The Complete Recordings (disc 3)",
"Message in a Box: The Complete Recordings (disc 4)",
"Live! (disc 1: Orpheum WBCN/Boston Broadcast)",
"Live! (disc 2: Atlanta/Synchronicity Concert)",
"Every Breath You Take: The Classics",
"Their Greatest Hits",
"Can\'t Stand Losing You",
"Roxanne \'97 (Puff Daddy remix)",
"Roxanne \'97"];
// Get traffic incidents from Yahoo!.
$appid = "YahooDemo"; // Set to your yahoo application ID
$trafficData =
httpGet('http://local.yahooapis.com/MapsService/V1/trafficData',
{ appid: $appid,
street: "701 First Street",
city: "Sunnyvale",
state: "CA",
output: "json"
})[0];
$trafficData.ResultSet.Result[*].Title;
// result...
[ "Road construction, on US-101 NB at FAIROAKS AVE TONBTO NB MATHILDA",
"Road construction, on CA-85 SB at MOFFETT BLVD",
"Road construction, on CA-237 EB at MATHILDA AVE TOEBTO FAIR OAKS AVE",
"Road construction, on CA-237 WB at CROSSMAN AVE",
"Road construction, on I-880 at GATEWAY BLVD"];
Overview |
Java Functions |
Extending Data-sources |
Running Jaql |
Roadmap