Caricature of a mad scientist drawn by en:User:J.J. and uploaded as en:image:Madscientist.jpg to English Wikipedia 15 July 2003I attended the second Hive London Meetup tonight – I’m not quite sure what I was expecting, I just wanted to rub shoulders with people in the big data world really and see what they were up to. Was very glad I went. I met a number of that elusive animal, real live data scientists, and also enjoyed a couple of great presentations.

The first thing that struck me was how pretty much all the big data use cases being performed in anger that were discussed were related to web ad tracking, clickstream/web analytics, social game analysis etc. No mention at all of analysis of financial data like stock trades or insurance-sector actuarial analysis for example. This very much fits with our experiences in the wider market – people who are selling “stuff” (physical or digital products) on B2C web sites have had to address the data explosion first. Large enterprises have mainly yet to realise the opportunity before them.

The second key takeaway for me was that whilst this was a Hive Meetup, both presenters issued pretty cautionary messages about using Hive at all – kinda ironic. Yali Sassoon from SnowPlow Analytics explained how they had started their open source work using Hive on AWS EMR for most of the data processing/ETL and analysis – unpacking query strings from AWS CloudFront logs, creating partitioned Hive tables backed by S3, and then using Hive to perform analysis operations on it, e.g. mimicking the functionality of Google Analytics. But over time, as they have matured their architecture, they’ve hit issues with troublesome debugging (you write something like SQL, but you debug Java :)), slow performance and some analyses that hive just isn’t really able to support in a performant fashion. So they’ve moved to investigating Cascading for the ETL (e.g. for better control over data exceptions and replays), and column-oriented databases for the final analytics datastore that they expose to their customers – specifically they’re using InfoBright. Notably though, all their customers still use Hive today for all their analysis. It’s just such an easy way to start on the hadoop journey and that counts for a lot. And for really big (petabytes) work, Hive is still the way to go.

Pedro Figueiredo gave the second presentation and covered a collection of hints, tips and tuning suggestions for using Hive. Hive’s strength is also it’s weakness – it looks like SQL so adoption by existing data analysts can be very rapid…er…but…it isn’t SQL so those same data analysts can create some shockingly bad processes/queries if they don’t understand the underlying MapReduce architecture and it’s inherent strengths and weaknesses. I don’t see this as much different to the issue with novice SQL coders creating hideous joins with no indexes etc – but when you’ve got gigabytes of data things can get ugly pretty quickly…

Pedro reinforced the “don’t use Hive for (serious) ETL” message, recommending native Java or streaming hadoop jobs in your preferred language. Use Hive for it’s strengths, such as summing/aggregation of already preprocessed datasets such that it is ready to load into your favorite OLAP/BI tool of choice for visualisation purposes. He also mentioned that he typically experiences a 90% cost saving over on-demand instance pricing by using spot AWS instances on EMR (so it’s a no brainer basically for worker nodes), but that the advantage diminishes once the US day kicks off.

Great session, very useful – and free flowing beer and pizza.