Student T-Distribution DiagramOne of the oft-claimed benefits of Hadoop and stream-based big data processing architectures is the ability to process all the data, not just a sample of it, e.g. see this article from 2012 which announces “the end of sampling as we know it“. Across our customers, I sometimes see sampling approaches taken on traditional relational database-based analytics platforms just in order to get the runtimes down to acceptable levels, and not because that’s what they want to do, but that’s what they have to do given the cost-performance constraints of the on-premise infrastructure.

Sean Owen from Cloudera (Director of Data Science) argued the opposite side of this debate at tonight’s Hadoop London meetup, and it got me thinking about the designer’s choices relating to this topic. Sean put a strong argument for embracing approximation – that “good enough” answers derived from huge data sets are often all that are needed, and they can dramatically reduce the resource requirements, e.g. maybe a 10:1 reduction in CPU needs or runtime, and maybe an even larger 100:1 reduction in storage requirements if only the key business-relevant results are retained and “the noise” is discarded as part of the processing – such as just keeping the top 10 similar items when clustering like items together using Mahout. Maybe 99% of your retail web site users only ever get as far as viewing their top 3 recommendations anyway. Also, it’s algorithmically possible to determine the accuracy of a result, and the level confidence of achieving it (e.g. when calculating the mean of a large set of values by sampling a much smaller subset of the data), so the potential error range in the results can be quantified for your business audience. This is similar to the approach used by BlinkDB which trades off accuracy for response time.

But there is a potential contradiction with my opening statement here – and of course it all comes down to whatever the specific use case is that is being tackled. Processing all the data without any approximations and data discarded or sampled is really important if, for example:

  • Subtle differences really do matter – if a few fractions of a % difference is very relevant
  • When searching data sets for specific exact matches/specific conditions that you cannot afford to miss, e.g. fraud detection, reconciliation processes
  • In a shoulder case, where you are investigating not where the bulk of the data points lie, but the transition between populations etc – where a relatively small number of data points will exist even in a large data set
  • Where sampling will introduce a bias as it is not truly random – I think this is probably one of the biggest fears and drivers, that there is some unintended bias introduced into the results via the sampling mechanism. Sampling every 10th record from a stream is straightforward, but doing so when reading from large files could incur a significant I/O overhead in reading but discarding 90% of the data.

And of course – there is always a very very good reason to sample the data that you throw into your big data processing solution – because you don’t want to wait any longer than you need to in order to find out you’ve coded it wrong :) – so you want to perform your test cycles on as small a dataset as possible to minimise the debug cycle time.

If you want to delve into this subject a bit more and get into Student’s t-distribution, look at this copy of Sean’s slides from another event.