Yesterday at AWS re:Invent, Andy Jassy delivered the main keynote. As you can see from the photo below, the event was immense – the day before I was in the APN Summit so it was AWS partners only, and that felt big.
But this was 9,000 attendees from 57 countries in a room. The photo doesn’t really capture the epic scale – which struck me as kinda like a metaphor for AWS itself, i.e. the scale of the administrative operation was off the chart, it was all very efficiently managed, and it gets bigger every year!
I thought it was interesting that they didn’t even “save up” the recent announcement about the 10% price reduction for M3 EC2 instances that was announced on 5th November for re:Invent. To me, this just shows how baked into the business model these regular price reductions have become.
In content terms, the three main new announcements were:
- Amazon CloudTrail – the ability to log all AWS API calls to S3 for audit and compliance purposes. This is a nice feature that we’ve asked for before, but actually hasn’t been too much of a barrier to customer adoption previously, probably because we are typically managing the entire AWS layer for a customer anyway.
- Amazon WorkSpaces – virtual desktops-as-a-service. Interestingly desktop “state” is maintained as you move between access devices, e.g. from laptop to tablet. We’re deployed virtual desktops in AWS for a number of customer projects – either desktops for key users in a Disaster Recovery scenario, or for developers who are located around the world and need a consistent desktop with known applications installed etc in order to access AWS-hosted dev and test environments. So I can see us using this new feature in future projects as I suspect the cost model in terms of the saved installation/build/ongoing patching effort of putting in a bunch of Windows Remote Desktop servers.
- Amazon AppStream – HD quality video generation and streaming across multiple device types. This is related to another announcement that was made on 5th Nov – the new g2.2xlarge instance type, which has the GPU grunt to enables the creation of 3D applications that run in the cloud and deliver high performance 3D graphics to mobile devices, TVs etc.
Weirdly being at the event you get less time to look into these new product announcements and so you probably have less detail than if you were just reading about it on the web – after the keynote it was straight into a bunch of technical sessions.
I mainly focused on the data analytics sessions. First off, I got to hear about what NASA have been doing with data visualisation – I think all attendees expected to hear about exciting interstellar data visualisations, but it was actually about much more mundane visualisations of skills management, recruitment trends etc – and this in fact made it much more applicable to the audience’s typical use cases as well. There were some great takeaways about how to maximise your chance of success which I need to write-up at some point…
I then attended an excellent deep dive on Amazon Elastic MapReduce (EMR) – this covered Hadoop tuning and optimisation, architecture choices and how they impact costs, dynamically scaling clusters, when the use S3 and when to HDFS for storage, instance sizes to use and how to design the cluster size for a specific workload.
This was followed by some customer technical overviews of their use of RedShift. They had all migrated to RedShift from either a SQL or NoSQL architecture. For example, Desk.com have deployed two RedShift clusters in order to isolate read from write workloads, but I felt that they had been forced to put considerable effort into building a proxy in front of RedShift to optimise performance – fundamentally as RedShift is limited to 15 concurrent queries and for their reporting workload, they are not in control of the peaks in their user’s demand for reports. So they’ve implemented their own query queuing and throttling mechanism, which sounds like a whole heap of technical and tricky non-differentiating work to me. A key takeaway from this session for me though was that the price-performance characteristic of RedShift had really worked for these customers, and given them the ability to scale at a cost that they just could not before. They were all achieving very high data ingress rates by batching up their data inserts and loading direct from S3.
The final session I attended was about a Mechanical Turk use case from InfoScout. Mechanical Turk is an intriguing service as it’s so different to the other AWS offerings – in fact it’s not a service at all really although it exposes a bunch of APIs – it’s a marketplace. Classic Mechanical Turk use cases include translation, transcription, sentiment analysis, search engine algorithm validation etc, but InfoScout’s need was for data cleaning and capture following an automated by fallible OCR process – capturing the data from pictures of shopping receipts taken on smart phones. The main takeaway for me was about how they manage quality control – i.e. how do you know and therefore tune and optimise the quality of the results you get from the workers executing your HITs? InfoScout use two quality control strategies:
- Known answers – in a batch of receipt images that is handled by a Mechanical Turk worker, they inject a “known” receipt and compare the data captured with the known data on that receipt. This technique is good for clear yes/no quality checks, e.g. is this receipt from Walmart. This allows them to compute a metric for each worker as to how likely their other receipts have been accurately processed.
- Plurality – send unprocessed receipt to more than one worker and see how consistent the returned results are. InfoScout use this to build a confidence score based upon this and other factors such as worker tenure etc.
The final event of the day was the re:invent pub crawl around 16 of the coolest bars in The Venetian and The Palazzo hotels. I’m guessing I don’t need to tell you much about that event, other than it started with sangria…