phoenixThe other day I was hearing all about the Phoenix open source project, which has a great strap line of “we put the SQL back in NoSQL”. It’s a SQL skin over HBase that is provided as a JDBC driver, and it’s come out of Salesforce.com, and has been proposed as an Apache incubator project with the vote started yesterday (Thurs 5th December 2013).

What I find ironic about the whole SQL/NoSQL thing is how there is a huge amount of energy being put into “SQL-ising” NoSQL datastores. Obviously hive does it, Impala from Cloudera etc, and now Phoenix on top of HBase. Whilst being really impressive – and I mean that – Phoenix currently has some limitations that just bring home the maturity of the SQL/relational database world, such as the need to define your tables in the right order in your joins to optimise performance etc – features that SQL query optimisers have laughed in the face of for years.

One really nice feature of Phoenix is it’s support for secondary indexes, where under the covers it creates and maintains a separate HBase table but transparently uses it to prevent table scans when it can – something HBase developers have been laboriously hand-cranking for a while.

Also it provides query plans so you can understand what’s going on. In the relational world the query optimisers are so good these days that SQL developers can often be pretty slap dash in writing queries and still get good performance characteristics, at least up to a certain level of scale anyway – you are abstracted away from a lot of the underlying complexity, so can be more productive. Of course there is no substitute for understanding what is really going on under the hood, but in the “SQL on NoSQL” world, you really do need to understand the gory tuning nuts and bolts of the underlying NoSQL datastore or else you’re going to be in trouble.

The reasons behind the origins of Phoenix are compelling – Salesforce.com needed to store potentially millions of data items across many thousands of customers, and so they adopted HBase to deal with that scale. It’s fundamentally very batch in nature and they needed to support low latency web applications per customer. But the key driver for SQL-like interfaces that you hear repeated across all these NoSQL datastores is that well…everyone just knows SQL.

It is the lingua franca of data queries, and for most use cases, broad adoption by your developer community (even inside a very tech-savvy company like Facebook) is worth a heck of a lot more than that last 1% of NoSQL tuning that you might be able to squeeze out using a guru and the HBase API. SQL has proved to be very flexible across a wide number of data models – although the NoSQL community’s use of it has introduced lots of extensions, it’s not like the relational database vendors didn’t do that is it?

Long live SQL!