It’s been almost 1 year now since jOOQ 3.14 was released in October 19, 2020 with SQL/JSON (and SQL/XML) support. Half a year later, we’ve released jOOQ 3.15 with MULTISET
support, which builds on top of these features to offer type-safe nested collections, the way every ORDBMS should implement them.
Building (dogfooding) on top of our own SQL/JSON API has revealed a lot of caveats of the various SQL/JSON implementations across vendors, and to be frank, it’s been a bit of a sobering experience. Despite there now being the ISO/IEC TR 19075:6 standard (mostly driven by Oracle this time), many vendors have already implemented some kind of JSON support, and it looks differently in all dialects – to the extent where writing vendor agnostic SQL/JSON is almost impossible with hand written native SQL. You’ll need an API like jOOQ or any other abstraction to standardise the different dialects.
In this article, I’d like to share a few of the biggest caveats encountered over the past year. There are many more, just try to translate some standard SQL/JSON to various dialects with jOOQ.
JSON Type or String?
JSON documents can be seen as simple strings with formatted content. Sure, I mean you can parse a string into a JSON document, so why complicate the SQL type system with new types?
Unfortunately, this proves to be a poor decision by most vendors. PostgreSQL got this right by offering the JSON
and JSONB
types, the latter being quite performant even when storing and indexing JSON documents.
First off, and this comes to no surprise to anyone working with almost any other language than SQL: Types are semantic. Stringly typed environments are bad. Or as the quote goes:
The Bitterness of Poor Quality Remains Long After the Sweetness of Low Price is Forgotten — Benjamin Franklin
With “Low Price” being the low price of the quick and dirty stringly-typed feature addition without a formal type. We already had this time and again with BOOLEAN
types, and dialects like MySQL that pretend that supporting this is a good idea:
SELECT *
FROM t
WHERE 1 OR 0 -- TRUE OR FALSE
Let’s look at examples of what I mean.
MariaDB and MySQL
Let’s look at MariaDB’s and MySQL’s syntax first. Here’s how to create a JSON array:
select json_array(1, 2)
Producing
[1, 2]
That’s great! And it even conforms to standard SQL syntax. A JSON array can also be nested easily:
select json_array(json_array(1, 2), 3)
To produce:
[[1, 2], 3]
Now, what if that nested array originates from a derived table?
select json_array(nested, 3)
from (select json_array(1, 2) nested) as t
It results in:
-- MySQL, MariaDB 10.6 [[1, 2], 3] -- MariaDB 10.5 ["[1, 2]", 3]
Bummer. In MariaDB 10.5, the nested array lost its “JSON type annotation” and went back to the stringly typed version of itself. As a string, it needs to be quoted. This seems to be a bug https://jira.mariadb.org/browse/MDEV-26134 which apparently has been fixed already under a different issue in MariaDB 10.6. It’s not the only one, though. The bug tracker is full of similar issues: https://jira.mariadb.org/browse/MDEV-13701.
MySQL seems to fare a bit better by now, though there are caveats when aggregating (see further down).
The only workaround I’ve found so far for the above bugs is extremely laborious:
select json_array(json_merge_preserve('[]', nested), 3)
from (select json_array(1, 2) nested) as t
Imagine doing that every time you nest JSON. It works, but it really needs to be automated (e.g. through jOOQ).
Oracle
Oracle defined most of the SQL standard, and I really like how SQL-idiomatic its syntax feels, just like SQL/XML. Unfortunately, they waited with the introduction of the new JSON type until Oracle 21c (can’t wait to play with it). As such, we have to choose whether to represent JSON as VARCHAR2
, by default (limited to 4000 bytes on some systems or 32kb at most!) or CLOB
. If you’re doing serious JSON, you can probably always add the RETURNING CLOB
clause to every single JSON function call of yours. Effectively:
select json_array(1, 2 returning clob)
from dual
The nested version looks like this:
select json_array(nested, 3 returning clob)
from (select json_array(1, 2 returning clob) nested from dual) t;
This works much better in Oracle, but there are still many edge cases that it doesn’t get right. Try this on Oracle 18c:
select json_arrayagg(coalesce(json_object(), json_object()))
from dual
And it produces another instance of erroneously stringly-typed JSON:
["{}"]
The workaround is to add FORMAT JSON
just about everywhere, just to be sure, e.g.
select json_arrayagg(
coalesce(json_object(), json_object()) format json
)
from dual
Now, the result is as expected:
[{}]
To play it safe, you’ll probably have to write FORMAT JSON
just about everywhere again, like RETURNING CLOB
Truncation
Truncation is probably the most sobering part when using JSON. Why do we still have size-limited data types in SQL? It hardly every makes sense from an application perspective. Yet, here we are.
Oracle
Run this query in Oracle:
select json_arrayagg(owner) from all_objects;
And you’ll get:
SQL Error [40478] [99999]: ORA-40478: output value too large (maximum: 4000)
Increasing the maximum VARCHAR2
size to 32kb will only postpone the problem. There is no “reasonable” size limit to such documents, so again, you’ll have to RETURNING CLOB
all the time.
select json_arrayagg(owner returning clob) from all_objects;
The price is the same as always. CLOB
are just a bit more annoying to work with than VARCHAR2
, both within Oracle as well as in the client application (e.g. JDBC based), because you have to work with yet another resource when you just wanted a string. jOOQ will just add the clause everywhere for you, there’s hardly a reason not to do that with jOOQ, as the fetching of CLOB
values is completely transparent to jOOQ users.
MySQL
Up until recently, you couldn’t call JSON_ARRAYAGG()
in MySQL, and the MariaDB version crashed the server (https://jira.mariadb.org/browse/MDEV-21912). When this blog was written, neither implementation supported the ORDER BY
clause in JSON_ARRAYAGG()
, which I find quite essential, so the workaround is to use GROUP_CONCAT
:
select concat('[', group_concat(id), ']')
from t_book
Of course, that is very wrong if the concatenated values are not numbers, so we need to also use JSON_QUOTE
, e.g.
select concat('[', group_concat(json_quote(title)), ']')
from t_book
And, if you’re embedding that stuff in other JSON structures, you have to turn what really is a string now, back into JSON using JSON_MERGE
(until recently, but now deprecated), or JSON_MERGE_PRESERVE
, e.g.
select json_object(
'titles',
json_merge_preserve(
'[]',
concat('[', group_concat(json_quote(title)), ']')
)
)
from t_book
To produce a document like this:
{"titles": ["1984", "Animal Farm", "O Alquimista", "Brida"]}
Without that JSON_MERGE_PRESERVE
, you’d be getting:
{"titles": "[\"1984\",\"Animal Farm\",\"O Alquimista\",\"Brida\"]"}
Definitely not something you can ever remember.
Anyway. This section was about truncation! What happens with large, aggregated JSON documents in MySQL? Try this:
select
concat('[', group_concat(json_quote(table_name)), ']')
from information_schema.tables
It produces (on my machine):
["innodb_table_stats","innodb_index_stats","CHARACTER_SETS","CHECK_CONSTRAINTS","COLLATIONS","COLLATION_CHARACTER_SET_APPLICABILITY","COLUMNS","COLUMNS_EXTENSIONS","COLUMN_STATISTICS","EVENTS","FILES","INNODB_DATAFILES","INNODB_FOREIGN","INNODB_FOREIGN_COLS","INNODB_FIELDS","INNODB_TABLESPACES_BRIEF","KEY_COLUMN_USAGE","KEYWORDS","PARAMETERS","PARTITIONS","REFERENTIAL_CONSTRAINTS","RESOURCE_GROUPS","ROUTINES","SCHEMATA","SCHEMATA_EXTENSIONS","ST_SPATIAL_REFERENCE_SYSTEMS","ST_UNITS_OF_MEASURE","ST_GEOMETRY_COLUMNS","STATISTICS","TABLE_CONSTRAINTS","TABLE_CONSTRAINTS_EXTENSIONS","TABLES","TABLES_EXTENSIONS","TABLESPACES_EXTENSIONS","TRIGGERS","VIEW_ROUTINE_USAGE","VIEW_TABLE_USAGE","VIEWS","COLUMN_PRIVILEGES","ENGINES","OPTIMIZER_TRACE","PLUGINS","PROCESSLIST","PROFILING","SCHEMA_PRIVILEGES","TABLESPACES","TABLE_PRIVILEGES","USER_PRIVILEGES","cond_instances","error_log","events_waits_current","events_waits_history","events_waits_history_long","events_waits_summary_by_host_by_event_name","events_waits_summary_by]
Wait, what is that at the end
,"events_waits_summary_by]
Invalid JSON. Because the GROUP_CONCAT
string got truncated. We could set the following
set @@group_concat_max_len = 4294967295;
And now the output is correct, and much longer:
["innodb_table_stats",...,"t_identity_pk","t_triggers"]
An API like jOOQ will be able to set this session variable automatically for you, but you probably don’t want to think about this all the time for native SQL?
Data Type Support
JSON knows a few data types. Namely:
- string
- number
- boolean
- null
- object
- array
That’s less than SQL, but often good enough (anything can be encoded as a string, after all). But when you don’t have BOOLEAN
types in SQL (e.g. MySQL, Oracle), then you have to manually encode your already manual encoding to JSON BOOLEAN
.
MySQL
MySQL makes you believe that this works:
select json_array(true, false)
Because the above produces
[true, false]
But it doesn’t really work. That seems to be hard coded in the parser. As soon as your true
and false
values are expressions instead of literals, e.g. originating from a derived table:
select json_array(t, f)
from (select true as t, false as f) as t
You’ll get:
[1, 0]
There are different ways to emulate this. One being:
select json_array(
json_extract(case when t = 1 then 'true' when t = 0 then 'false' end, '$'),
json_extract(case when f = 1 then 'true' when f = 0 then 'false' end, '$')
)
from (select true as t, false as f) as t
Now, we’re getting again:
[true, false]
Oracle
Unlike MySQL / MariaDB, Oracle SQL doesn’t pretend it has a BOOLEAN
type. Instead, people encode it as a NUMBER(1)
or CHAR(1)
or some other thing. Irrespective of the encoding, this is the solution:
select json_array(
case when t = 1 then 'true' when t = 0 then 'false' end format json,
case when f = 1 then 'true' when f = 0 then 'false' end format json
)
from (
select 1 as t, 0 as f, null as n
from dual
) t
Producing
[true,false]
NULL handling
When using NULL
with SQL/JSON, there are various caveats. First off, SQL NULL
is not the same thing as JSON NULL
. Using PostgreSQL:
select
a,
b,
a = b as equal,
a is null as a_is_null,
b is null as b_is_null
from (
select null::jsonb as a, 'null'::jsonb as b
) as t
Produces:
|a |b |equal|a_is_null|b_is_null| |---|----|-----|---------|---------| | |null| |true |false |
Where the empty cells are SQL NULL
values, and the null
value is a JSON null
value that is not “SQL NULL
“. It’s the only reasonable thing to do, really. Do dialects agree?
MySQL
select
a,
b,
a = b as equal,
a is null as a_is_null,
b is null as b_is_null
from (
select null as a, json_extract('null', '$') as b
) as t
Producing also:
|a |b |equal|a_is_null|b_is_null| |---|----|-----|---------|---------| | |null| |1 |0 |
So, yes!
Oracle
Let’s see, this could be how I can create a JSON NULL
value:
select
a,
b,
case when a = b then 1 else 0 end as equal,
case when a is null then 1 else 0 end as a_is_null,
case when b is null then 1 else 0 end as b_is_null
from (
select null as a, json_value('[null]', '$[0]') as b
from dual
) t
Yet, Oracle is known to have a funky relationship with NULL
strings. This is the result:
|A |B |EQUAL|A_IS_NULL|B_IS_NULL| |---|---|-----|---------|---------| | | |0 |1 |1 |
There does not seem to be an actual JSON NULL
representation! I couldn’t find a workaround for this yet. Probably I will. But this is very unfortunate, leading to many conversion edge cases.
On the other hand, Oracle is the only dialect among the ones presented in this article that introduced a very useful NULL
handling clause for aggregate functions. Check this out:
select
json_arrayagg(a),
json_arrayagg(a absent on null),
json_arrayagg(a null on null)
from (
select 1 as a from dual union all
select null from dual
) t
Producing
|A |B |C | |---|---|--------| |[1]|[1]|[2,null]|
Note the above query doesn’t produce the correct result on Oracle 18c yet, because of a parser / optimiser bug. Use this to work around the bug for this example:
select
json_arrayagg(a) a,
json_arrayagg(a + 0 absent on null) b,
json_arrayagg(a - 0 null on null) c
from (
select 1 as a from dual union all
select null from dual
) t
Other dialects have different opinions on how to aggregate NULL
values to JSON documents. In SQL, aggregate functions tend to ignore NULL
values, such as Oracle above, but with JSON, it is often essential to include the value, especially when creating a JSON_OBJECT
, where an absent key is not strictly the same thing as an absent value.
Db2
Db2 has a very limited implementation of SQL/JSON. It is very standards compliant in terms of syntax, but suffers from severe bugs like these:
Which make it quite unusable for now. Surely, this will improve in the near future.
SQL Server
I’ve also left out SQL Server in this article. SQL Server had JSON and XML support for a while, but implemented it completely differently. You can’t just form arbitrary JSON objects or arrays easily, but you can transform result sets into JSON in an intuitive way.
This offers a quick win when streaming results, but doesn’t compose very well. E.g. you can’t create a JSON_ARRAY
with scalar, non-object content, though a JSON_OBJECT
can be created like this:
select 1 as a, 2 as b
for json path, without_array_wrapper
Producing
{"a":1,"b":2}
A limited number of JSON features can be emulated also in SQL Server, and as with all the dialects that don’t support an actual JSON type, constant escaping / unescaping may be required.
Conclusion
SQL/JSON was standardised relatively late, mostly by Oracle. The standard is very sound. But regrettably, a lot of dialects disagree on syntax, behaviour, and some are still quite buggy.
The soundest of all implementations is that of PostgreSQL, which introduced proper data types and offers a rich set of vendor-specific functions to manipulate JSON directly in SQL. In the near future, PostgreSQL will embrace standard SQL/JSON, just like Oracle, and enhance its own implementation. I don’t think there will be anything fundamentally new, just more portable standard syntax.
See this talk for a preview of “what’s next”:
There is nothing insurmountable for a library like jOOQ, which abstracts over all the dialects and offers the SQL/JSON (and SQL/XML) functionality using a uniform, standards-inspired API. As dialects improve, jOOQ will adopt new, better syntax by individual vendors, so you can implement your SQL/JSON queries forwards compatibly.
So, as always, use jOOQ to work around the subtle and boring differences of the various vendor syntaxes and start using SQL/JSON already today. Despite the caveats presented here, SQL/JSON is one of the most exciting things about modern SQL!
Read more:
- Nesting Collections With jOOQ 3.14’s SQL/XML or SQL/JSON support
- jOOQ 3.15’s New Multiset Operator Will Change How You Think About SQL
- Stop Mapping Stuff in Your Middleware. Use SQL’s XML or JSON Operators Instead
from Java, SQL and jOOQ. https://ift.tt/3rAMyoA
via IFTTT
No comments:
Post a Comment