By Liliandecassai (Own work) [CC-BY-SA-3.0 (], via Wikimedia Commons

Impala by Liliandecassai

Impala 1.0 was launched back in July last year, and it’s been supported by AWS EMR since last December so I’ve been meaning to have a quick play and also to compare it with a classic map-reduce approach to see the performance difference. It’s not like I don’t believe the promises – I just wanted to see it for myself.

So I ran up a small cluster on AWS – with an m1.large for the master node and 2 core nodes, also running m1.large. I used the US-West region (Oregon) – which offers the same cheap price points as US-East but is 100% carbon-neutral as well :). This was all running using spot instances in a VPC. For interest, the total AWS cost for 24 normalised instance hours (I actually ran the cluster for just over 3 hours, including one false cluster start!) was $1.05.  Using developer standard units of cost, that’s nearly the price of half a cup of coffee! (or since we’re using Oregon region, a green tea?)


As I’m lazy, I used the code and datasets from the AWS tutorial – and decided to just use a simple count of records that contained the string “robin” in the email address field of a 13.3m row table as my comparison. Here’s how you define the basic table structure…

create EXTERNAL TABLE customers( id BIGINT, name STRING, date_of_birth TIMESTAMP, gender STRING, state STRING, email STRING, phone STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION '/data/customers/';

The output is…

[] > select count(*) from customers;
Query: select count(*) from customers
| count(*) |
| 13353953 |
Returned 1 row(s) in 1.09s

[] > select count(*) from customers where like "%robin%";
Query: select count(*) from customers where like "%robin%"
| count(*) |
| 66702    |
Returned 1 row(s) in 1.73s

A slight aside – Impala uses run-time code generation to compile down the query down to machine code using LLVM, and this introduces a compilation overhead of circa 150ms, but which more than pays back on the majority of queries.  So this is where some of our 1.73s is going.  More about this here.

Pig comparison

As a glutton for punishment, I decided to use pig rather than the more usual hive for the comparison with Impala. The first thing to say – it was way harder, as the aptly named pig is just a bit more foreign to me than the SQL-like niceness of Impala…so there was some desperate checking of cheatsheets etc to remind me how best to do it…

The basic code for the same source data (already loaded into HDFS) is as follows…

CUST = LOAD 'hdfs://' USING PigStorage('|')
as (id:    chararray,
name:  chararray,
dob:   chararray,
sex:   chararray,
state: chararray,
email: chararray,
phone: chararray);
C2 = FILTER CUST BY REGEX_EXTRACT_ALL(email, '(.*)robin(.*)') IS NOT NULL;
dump C3;

As you can see the pig approach ran 8 maps. The output is as follows (with all the INFO messages and some other noise removed)…

HadoopVersion PigVersion UserId StartedAt           FinishedAt          Features
2.2.0   hadoop 2014-04-10 12:11:13 2014-04-10 12:12:26 GROUP_BY,FILTER


Successfully read 13353953 records (9 bytes) from: "hdfs://"

Successfully stored 1 records (9 bytes) in: "hdfs://"



I was just trying it out, so this is not a fair test in some ways – and I didn’t try and do any optimisation of either approach. The Impala approach ran about 40x faster, and this was consistent with repeated runs.


I checked out the CPU, IO etc and there was nothing hitting any limits, and CPU consumption when I was alternately using Impala and pig looked like this – load was even across my two core nodes, and the master had it’s feet up most of the time…

CPU CloudWatch metrics

I haven’t reported the data here, but I also played with some nasty 3-way joins using Impala and the results were really impressive. Obviously though it’s horses-for-courses – MapReduce-based approaches like hive and pig will soldier on when Impala has run out of memory for certain query types, or in the event of a node failure etc. But definitely a great bit of kit to have in the AWS EMR toolbag!

Internet Explorer running through workspaces on an iPad.

Unusually, I’m writing this blog post in a browser. Specifically, I’m writing it in Microsoft Internet Explorer, on Windows. Not particularly odd you might think, except that I’m not sat in front of my laptop. I’m using my iPad.

It’s ok, you haven’t gone mad. Hell might have already frozen over when Microsoft released Office for iPad last week, but rest assured it hasn’t happened twice: MS have not released Windows iPad Edition.

Instead, I’m trying out Amazon’s new Workspaces product. This product went GA last week, after a private beta announcement which we covered back in November.

Workspaces is a virtual desktop product that allows you to run managed Windows 7 desktops in AWS with very little effort. Signing up takes minutes, and you can provision either a mostly vanilla workspace with just a few basic utilities installed, or a ‘Plus’ workspace which adds Microsoft Office Pro, and Trend Micro AV. In either case, licences for the installed software are included in the price, making it a great way to stand up a desktop machine with all the essentials fully licenced in no time.

There are two performance teirs: ‘Standard’, with 1 vCPU and 3.75GB of ram (which sounds suspiciously similar to an m3.medium instance), or ‘Performance’, which packs 2 vCPUs and 7.5GB of RAM (m3.large, anyone?). As is commonly considered best practice, each machine has a couple of disk volumes attached: one that holds operating system and applications (C:), and one for holding a user’s data (D:). Data on the user’s D: is automatically backed up every 12 hours.

Depending on the bundle you chose, prices range from $35 to $75 per month.

You access your workspace using an amazon provided client application that runs on Windows, Mac, iPad, Kindle or Android tablet.

So, that’s the basics covered. How is it to use? Honestly, from the UK it’s currently a little painful. This is to be expected as workspaces is currently only available in the US, so every pixel of my display is being shot across the Atlantic before I get to see it. I’m seeing latencies of just over 200ms, and Amazon recommend a sub 100ms latency for good user experience. I can confirm that both iPad and Mac clients work well enough (spot the Apple fanboy), although in common with any iPad based remote desktop product, the touch-your-screen-to-point-your-mouse impedance mismatch is disorientating at times. Swapping between devices seems to work much as you’d expect. If you’re logged on from your iPad, and then sign in from a desktop, your session transfers seamlessly to the desktop.

From an infrastructure/desktop manager’s perspective, it’s early days at the moment I think. AD integration is possible, allowing users to log in with their normal credentials, as well as allowing them access to local printers and (I assume) file shares. While deploying your own software is certainly possible, you’re pretty much on your own there: There is no concept of an AMI here, nor is there any support for packaging and deploying applications within the service itself. This in itself is probably not a disaster in some senses, since most enterprises have their own deployment tools, but the lack of custom AMI capability makes boot strapping a workspace into the deployment tool harder than it would otherwise be.

What about use cases? We can already see a couple of things we do for customers where workspaces could replace or supplement what we currently provide:

  • Cloud DR solutions (for an example see our Haven Power case study). As things stand, the key issue preventing us from doing this is the fact that you pay for workspaces per month, regardless of how much usage of the workspace you make. Unusually for AWS, there isn’t an API allowing you to automatically provision/deprovision workspaces, making it hard to optimise the cost here.
  • Remote desktops for 3rd Party users. We deployed a Windows Terminal Services farm in AWS for another of our customers, who use it to allow third parties to work on their applications. Both the applications and terminal services farm are managed by us in AWS, and are accessed globally. in theory it would be relatively straightforward to replace the terminal services farm with Workspaces, although we’d have to be confident that the performance is adequate.

Workspaces is a promising technology, but until it’s available in the EU-WEST-1, we’re unlikely to be able to adopt it except perhaps in very niche circumstances.

That’s the thing about Amazon though: Like Apple, when Amazon first release a new feature, it’s tempting to be a little underwhelmed. But then, like Apple, a few months or years later we look back at a now mature technology, and we can’t quite remember when it grew up from a metaphorical spotty teenager with potential, to an essential member of the team.

It’s this ability to start ‘simple’, but then improve and polish their products day in, day out, over and over again that has made both companies the unstoppable juggernauts they now are.


Please Rate and Like  this blog.  Share it via the social icons below or via short URL

Our readers want to know what YOU think, so please add a Comment. 

It caught my eye the other day that Microsoft announced an equivalent to Amazon Web Services’ Direct Connect offering, i.e. the ability to connect from your premises to your cloud deployment without going over the Internet. The press release says this capability is “expected to be available in first half of 2014” – and I assume that this initial launch will be US only with Europe to follow later, although it doesn’t say.

Smart421 was a Direct Connect launch partner in the European region for AWS back in Jan 2012, although the initial US launch was way back in August 2011. So going on that basis, I can now put a crude estimate on how far behind AWS the Azure platform really is – at least two and a half years :)

Anyway, now is as good a time as any to share some brief stats from our real world experience of deploying Direct Connect for the European region. I’m not aware of much data in the public domain about Direct Connect latency measurements in the European region – so if you know of some, please comment on this post to let me know.

On a 1 gigabit connection, for an ICMP (i.e. ping) round trip we typically see a latency of circa 12-13ms for Direct Connect versus 33ms via a VPN over the Internet, i.e. about a 60% reduction in latency.


This data needs to be considered carefully as there are a multitude of factors at play here depending on the specific customer environment and requirements – such as the Internet connectivity for the VPN, and crucially where the customer “on-premises” equipment is in network terms with respect to the AWS Direct Connect location in London Docklands. Also any comparison will vary depending on time of day etc. I’m deliberately not providing any stats on achieved bandwidth here as there are just too many factors involved – principally that the limiting factor is likely to be any MPLS connectivity involved in the architecture rather than Direct Connect itself.

Still – it’s interesting data nonetheless…thanks to ‘Smartie’ Wayne for compiling the data.

Please share this blog using the social buttons below or short URL

Please take a moment to Rate and Like this Post. Our readers want to see YOUR opinion so please post a Comment.

A graphical user interface isn’t the only way for Amazon Web Services customers to control their cloud deployments.

Amazon has recently released the AWS Command Line Interface (CLI), capable of controlling EC2, S3, Elastic Beanstalk, Simple Workflow Service and about twenty other services, but no support for Glacier, SimpleDB, EMR, Cloudsearch, DataPipeline or Cloudfront.

The AWS CLI commands take the form:

$ aws SERVICE OPERATION [OPTIONS], so an example would be, “aws ec2 help”.

Each operation generates its output in JSON format by default. However using the “-output text” or “-output table”, will output text or tabular formats.

I found that trying to parse the text or tabular formats in a bash script quite tricky, so opted to parse the json using jq (JSON Query) tool.

$ aws ec2 describe-instances | jq .ReservationSet[ ].instancesSet[ ].keyName

The command above will return the keys associated with each instance.

In order to view the contents of S3 buckets in a directory-based listing, the AWS CLI command below will generate the required output.

$ aws s3 ls s3://mybucket
      LastWriteTime     Length Name
      -------------     ------ ----
                           PRE myfolder/
2013-09-03 10:00:00       1234 myfile.txt

In order to perform recursive uploads and downloads of multiple files in a single folder-level command. The AWS CLI will run these transfers in parallel for increased performance.

$ aws s3 cp myfolder s3://mybucket/myfolder --recursive
upload: myfolder/file1.txt to s3://mybucket/myfolder/file1.txt
upload: myfolder/subfolder/file1.txt to s3://mybucket/myfolder/subfolder/file1.txt

As with all things AWS, configuring the command line tool isn’t exactly trival, but Amazon offers an easy step-by-step install guide and extensive documentation to help new users get started.

The tool is available for Windows, Mac and Linux and also comes pre-installed on the most recent versions of the Amazon Linux AMI, Amazon’s supported and maintained Linux image for use on its EC2 cloud-computing service.

Very powerful.

Please share this blog using the social buttons below or short URL

Please take a moment to Rate and Like this Post. Our readers want to see YOUR opinion so please post a Comment.

Dodo headAt today’s AWS User Group meeting, I was reminded of CohesiveFT’s VSN3 offering – this provides an overlay network capability on top of various cloud providers. The bulk of their customers are using VSN3 on AWS, but it’s available for use on other cloud providers also, and even across multiple cloud providers in a single deployment.

The historical roots of VSN3 are in pre-AWS VPC days and at that point in history (Dec 2009 or so) it must have been a very attractive offering. In fact, providers like CohesiveFT have been providing Software Defined Networking (SDN) solutions long before the term was widely used. Now the differentiators offered by this kind of solution are looking increasingly thin as AWS’s VPC offering has matured and matured, but there’s still some advantages of this kind of product that I can see some customers will pay good money for. They are:

  • Network level encryption, i.e. automatically encrypt all traffic between all AWS instances – something that VPC doesn’t offer today, it assumes a network inside a VPC is trusted. Whilst I can’t see any compelling reason why I’d need this from a technical standpoint – I can see that a customer’s CISO might insist upon it, so it’s nice to know how I’d do it.
  • If you need UDP multi-cast support
  • If you want to treat different AWS regions (and maybe other cloud providers) as a single network
  • If you want a higher level of IPSec encryption than AES 128-bit
  • If you want to avoid locking yourself in to a specific cloud service provider’s approach to network management

As AWS’s pace of innovation is so high, then any other innovators like CohesiveFT that are initially part of the product supplier ecosystem eventually have their differentiators subsumed into the AWS core offering – and worse than that – those features are then typically offered for free (such as for AWS VPC) – destroying their market in a stroke. It’s pretty brutal!

Photo 17-09-2013 20 18 15Great event this week – the AWS Enterprise Summit in London on Tuesday.

From the Blade Runner-esque three story neon tube in the main exhibition area, to the conviction of the customers speaking about their adoption of Cloud – it truly felt that Amazon Web Services was coming of age.

What really hit me at this event though, was the number and variety of Enterprise clients coming to me and ‘confessing’ that they had dabbled with AWS and now needed help.

I think that stems from the ‘just put your credit card in to get started’ message of early AWS marketing. It’s been easy to get started – but it’s the equivalent of buying a fast car and turning up at an F1 event thinking you are ready to race.

Yes you have some storage and yes you have compute provisioned but in the same way that you and Red Bull both have 4 wheels and an engine, the complexity and scale of the next steps are beyond what most people want to do on their own – without an expert on hand to help.

So back in my confessional – I blessed a few, cursed a few but more-over offered solace by confirming that they were “Not Alone”.

St Paul's Cathedral from Grange St Paul's Hotel

St Paul’s Cathedral from Grange St Paul’s Hotel

The AWS Enterprise Summit yesterday was excellent.  I use superlatives sparingly, but it was iconic.

When registrations were so heavily over-subscribed as they were for yesterday’s event, you know you have a significant indicator for the levels of interest that the AWS Cloud is generating.  And, not for the first time, this was particularly pronounced in the UK Enterprise sector.

The place was rammed.

More than 500 delegates converged on the Grange St Paul’s Hotel in London to hear two AWS directors and a raft of director and senior IT people from UK Enterprises tell their story for themselves.

When Sir Christopher Wren (1632-1723) architected St Paul’s Cathedral, he demonstrated a level of architectural expertise that surpassed mere practicality and function. Although the same cannot really be said of the architecture for every instance in every cloud, increasingly more engagements for large UK companies do provide a monument to others of all that is good and enduring about Cloud computing.

We think that AWS is, on balance, getting a lot right.

But perhaps we should really let the attendees speak for themselves.

Please leave your Comment below.

St Paul's Cathedral from Millennium Bridge

St Paul’s Cathedral from Millennium Bridge

With the 2013 AWS Enterprise Summit opening in Grange St Paul’s London in a matter of hours, we’re feeling rather buoyed up with some latest news just in from Seattle.

AWS has conferred their top honours on Smart421.

Smart421 received confirmation of two Competency awards from Amazon Web Services (AWS).

The award follows evaluation by subject area experts at AWS’ world headquarters in Seattle and recognises the highest levels of business and technical competencies amongst Amazon’s leading partners accredited under the Amazon Partner Network (APN).

ACP-Big-Data ACP-MSPCSmart421 has received the Managed Services Provider Competency and the Big Data Competency.

The move makes Smart421 the first and only UK partner in the APN to be awarded both AWS Competencies.

Our CTO, Robin Meehan, said, “Our commitment to customers and capability to deliver solutions on the AWS Cloud has been evaluated and recognised by AWS experts in Seattle. The achievement of the MSP Competency and Big Data Competency is a major step forward. We’ve invested in Smart421’s AWS Practice and our AWS Practice Lead, Chris Turvil is certainly driving things forward, just as we announced in June this year. This assures Smart421’s place as the Enterprise partner of choice in the UK.”

I couldn’t have put it any better than that !

Related Links

Visit Smart421’s partner listing on the AWS website, please click here.

Please share this blog using the social buttons below or with short URL

Please Rate and Like this blog.  We always welcome Comment.

Why do so many big vendors insist on hawking around logos and case studies from firms in the United States upon the Brits?  Most of the names they bandy around mean nothing to us on this side of the pond.

If you, like me, have had your fill of this, perhaps this time next week you’ll be hot-footing along to the AWS Enterprise Summit in London ?  ( official hashtag #AWSsummit )

It’s not unusual to see company speakers being asked to give a 15-minute speaker slot. But we really must applaud AWS for bringing together so many speakers from enterprises that actually mean something to us Brits [Gartner and other event organisers, take note].

AWS have struck the right balance.  After keynotes by Andy Jassy and Stephen Schmidt, they plan to let the Brits do the talking. How refreshing.

Among them will be Smart421 customer Aviva, Britain’s leading insurance company.

Oh c’mon, don’t tell me you’ve never heard of Aviva?

Aviva_dw_997Aviva launched in August 2011. With a national reputation and strong brand in the general insurance sector at stake, they wanted to know how their multi-touchpoint cross marketing activities impacted each brand in their enterprise portfolio. So one of their visionaries engaged us to architect and deliver a solution on the AWS cloud that made it possible to analyse massive data sets of structured and unstructured data (aka Big Data analytics).

Keith Misson ( @ClonedTiger ) is one of the few leaders in the UK insurance sector to not only spot technological advances, but to do something directly relevant with those innovations.  Rare. But it was delivered. And it’s in the UK, not California (not that there anything wrong with California per se – but I think you get the point?).   He is an exemplary client and kind enough to go on the record with the following :

“Smart421’s Cloud architects gave us a head start on making Big Data real for us, including how business insights are really delivered, what the costs really are, and how the technology really works in our context. Their output contributes to how we differentiate ourselves in a crowded market.”

I won’t steal his thunder.  You’ll have to make a point of listening for yourself at 11.45am, towards the end of 2nd keynote.  (It might help you to read the customer story beforehand).

There is a whole raft of speakers during the afternoon session. So after lunch, rather than “heading back to the office”, why not stick around and learn something you can put into action.

Please share this blog using the short URL  or the social media buttons below.

Please Like and Rate this blog.  We always welcome Comment.

MB 1987 RiverIn the 1980’s, I was a typical PITA user, developing applications behind the backs of the IT department, even bringing my own PC and software into work. Eventually the IT department ‘took me under their wing’ and I was the one fighting off guerrilla developments from the user community, but by providing them with better, faster and more flexible technology, we won the day.

Now I find myself on the other side of the fence again.

I don’t develop anymore, but I’m watching the world of Cloud encourage self-service in the technical user community and leave IT departments behind. It’s a theme I have returned to before – the “democratisation of compute power” – served up brilliantly through the AWS IaaS model. We’ll see more examples of this at the AWS Enterprise Summit mid-September that Smart421 is sponsoring.  ( hashtag #AWSsummit )

However, it’s not just the Cloud that is challenging IT departments.

Mobile too seems to be spawning a new generation of Garagists*.  Either bright individuals buried in large companies or small one or two man bands creating mobile applications – building on core components (hosting/logon/mapping/location) provided by Apple, Google etc. by adding layers of creativity.

So what’s the problem – the real point here ?

The issue is security. When I was hacking out applications and getting sneaky access to CRM databases and pricing algorithms, everything was safe inside the corporate firewall. Nowadays it is mobile and cloud based.

Both of these technologies I wholeheartedly support, but like everything it has to be done in the right way. So if it was up to me again, I’d develop a Cloud strategy and Mobile architectural guidelines ASAP – before the Horse has bolted, the Cat is out of the bag and the Gorilla (sic) is in the mist.

* “The word Garagiste refers to the great Enzo Ferrari’s hatred of the multitude of talented, but small, Formula 1 teams that were emerging out of Britain in the late 50′s and early 60′s … were basically garage workers (grease monkeys in less formal parlance) compared to the engineering might of his Scuderia Ferrar. These teams didn’t produce their own engines or other ancillaries (aside from BRM), specialising mostly in light, nimble chassis”.

Please share this blog using short URL

Please rate and Like this blog.  Comments always welcome.

Next Page »


Get every new post delivered to your Inbox.

Join 1,122 other followers