When you lift the lid on what is going on in the big data analytics world, the pace of development and innovation is phenomenal. Yesterday I heard some more about the range of optimisations and ongoing development being progressed for Apache Hive and Hadoop itself, and it really reminds me of the pace of innovation and development of other technologies I’ve seen in the past, e.g. relational databases, application servers, web and mobile development platforms etc. Hadoop has only existed for seven years or so and the data analytics revolution that it kicked off is morphing further and further away from its origins – I guess that’s natural selection in operation. Anyone still using WML over WAP? :)
HortonWorks are a big contributor to the Hive project and the Stinger initiative they are involved in is driving a number of really interesting optimisations and enhancements:
Further optimising the already optimised RCFile data storage structure to allow queries to avoid I/O for blocks of data held in HDFS that contain no data relevant to the query, or where preaggregated results can avoid I/O (precalculated min, max values etc)
Optimising data retrieval to best exploit CPU on-chip memory buffers
Exploiting YARN (“Yet-Another-Resource-Negotiator” – a recent framework and sub-project of Apache Hadoop that facilitates writing arbitrary distributed processing frameworks and applications) and Tez (a new Apache incubator project) to avoid unnecessary HDFS writing and subsequent reading of intermediate results between sequential MapReduce jobs
Using “always on” Hadoop implementations (aka Tez service) to avoid the heavy penalty of JVM startup costs – which are very significant for otherwise relatively low cost queries
Enhancing query optimisation for common use cases, e.g. the loading of dimension tables (which are relatively small in comparison to the fact table) into memory on each node in the cluster to accelerate common queries against a data warehouse star schema
Olivier Renault from HortonWorks presented some early data at yesterday evening’s London Hive meetup that showed that they’d seen Hive query performance times drop by a multiple of up to 35-40 using a combination of some of the above optimisations. So the Stinger initiative objective of a 100x performance improvement seems feasible, which really would be a transformational achievement. When most people experience a pig/Hive query on a Hadoop cluster for the first time there is a rather “oh – that was slow for a simple query…” reaction – and the strong drive to move Hadoop processing closer and closer to being able to support interactive query use cases is causing some interesting overlaps. For example, Cloudera’s Impala takes a different approach to performance optimisation where all data is loaded into memory – so it can provide blinding fast query performance but won’t ultimately scale as far as Hive (see here for more detail on this).
To be honest, for most of us mere mortals (who don’t work for Facebook, Yahoo! or Google), the tools are now already out there to handle 99% of the use cases we need – and we can see from the above work that they are improving at a rate such that supporting interactive queries for multiple users on very large corporate datasets is becoming a reality, so our customers will just expect this in the future without question.
Update – 10th April – A similar set of slides to the ones Olivier presented are now available here