Jeff Bezos Photo by John Keatley, Seattle's leading photographer keatleyphoto.com

Jeff Bezos
Photo by John Keatley, Seattle’s leading photographer keatleyphoto.com

Every time I hear this story, it makes me smile. From Kim Lane over at API Evangelist:

[…] one day Jeff Bezos issued a mandate, sometime back around 2002 (give or take a year):

  • All teams will henceforth expose their data and functionality through service interfaces.
  • Teams must communicate with each other through these interfaces.
  • There will be no other form of inter-process communication allowed: no direct linking, no direct reads of another team’s data store, no shared-memory model, no back-doors whatsoever. The only communication allowed is via service interface calls over the network.
  • It doesn’t matter what technology they use.
  • All service interfaces, without exception, must be designed from the ground up to be externalizable. That is to say, the team must plan and design to be able to expose the interface to developers in the outside world. No exceptions.

The mandate closed with:

Anyone who doesn’t do this will be fired. Thank you; have a nice day!

Assuming for the moment that this is true, the thing that makes me smile here isn’t the closing rhetoric. What Jeff described here is pretty well everything you need to know about successful SOA.

Look at the wording again. “All teams”. He didn’t say “all systems” or “all services”. Technology isn’t [the most] important. People are.

By focussing on teams rather than technology, Jeff ensured that Amazon’s embryonic SOA was business aligned. One, simple decision was all it took. Well, that and ten years of concerted effort of one of the brightest engineering teams on the planet.

As the leading enterprise AWS Partner in the UK  ;o)  Smart421 had a big presence at the London AWS Summit last week (23 April).

Several of our customers also attended and one of them, Steve Howes, Chief Executive of Rail Settlement Plan part of Association of Train Operating Companies – ATOC) presented as part of the 2nd keynote, and very kindly mentioned us.

In fact, we were referenced thoughout the day, sometimes in unexpected ways. For example, within the opening five minutes of the first keynote, Smart421 was name-checked by Werner Vogels, CTO at Amazon.

Steve Howes takes to the stage

Steve Howes of RSP takes to the stage in front of 1,200 delegates

As per the recent Las Vegas event, it was a bit rock’n'roll in the keynotes, with music by Foo Fighters playing over the PA whilst we were waiting for the queuing hordes to make their way through the registration bottleneck and into the venue.  Vogels himself appeared on stage to the sound of Nirvana pumping out of the speakers.

What struck me was the size that this event had become, a significant increase on the 2012 AWS Summit.

AWS Summit London 2013

In the afternoon, the event split into separate streams. I stuck to the more involved sessions that were digging into the specifics of particular service releases such as Amazon Redshift and Amazon DynamoDB.

As you can see from the photo above, there was barely sitting room only in some of these techy sessions, let alone standing room.

It made me think – would conference attendees be prepared to sit on the floor for any vendor, especially in the corner of a crowded room? There must be few other vendors, if any, that have that kind of pulling power right now.

These are exciting times. Opportunities for enterprises to benefit are enormous.

2013 AWS Summit, London 23 AprilAnd this event confirmed my reflections that we’re reaching a tipping point in the market; adoption by the “early majority” in the technology adoption lifecycle is really visible and is happening – albeit with different market sectors arriving at very different stages.

Our conversations continued all day long over at the Smart421 stand, where we showcased three of our customer engagements:
Disaster Recovery on the AWS Cloud for Haven Power,
Big Data Analytics on the AWS Cloud for Aviva / Quotemehappy.com,
and Service Transition to the AWS Cloud for ATOC.

Frankly, we wanted to showcase even more but there just wasn’t the space.

Please Rate and Like this blog.  If you can, please leave a Comment.

EOA-summit-logo-2013It was great to see National Rail Enquiries (NRE) win an award at the European Outsourcing Association Awards in Amsterdam last Friday (26 April).

In recognition of their SIAM outsourcing strategy (Service Integration and Management), NRE won the award for Best Multi-sourcing Project of the Year , beating strong category finalists 60k and Centrica (Centrica won this category in 2012).

Smart421 is pleased to be a large part of that initiative, performing the Managed Services element on top of an AWS Cloud platform for several key NRE applications.

As customers struggle with the chains of traditional SI relationships, Smart421 is providing agile delivery and innovation methods in the IaaS world.

Many analysts see this as “third generation outsourcing” and a change for good – and so do I.

 

Please rate and Like this blog.  If you can, please leave a Comment.

I was doing some Hadoop demo work last week for a customer and mainly just because I could, I used spot instances to host my Hadoop/pig cluster using AWS’s Elastic MapReduce (EMR) offering.  I thought I’d have a quick look at what the resulting costs were over the few hours I was using it.  I used a combination of small and large instances in the US-East region – m1.small for the master node and m1.large for the core nodes.  Note – these costs exclude the PaaS-cost uplift for using EMR (another 6 cents per hour for a large instance).

In summary – it’s dirt cheap….

AWS Spot Price Analysis

What is more revealing is to look at this in terms of the % of the on-demand price that this represents…

AWS Spot Price Analysis Saving

So in summary, around an average saving of 90% on the on-demand price!  This is probably influenced by the fact that I was running the cluster mainly during the time when the US are offline.  We tend to get a bit fixated on the headline EC2 cost reductions that have frequently occurred over the last few years, and the general “race to the bottom” of on-demand instance pricing between AWS, Google, Microsoft etc.  Obviously not all workloads are suitable for spot pricing, but what I did here was deliberately bid high (at the on-demand price for each instance type in fact) knowing that this would mean that I was very unlikely to get booted off the instances as anyone bid higher if capacity got short.  As EC2 instance costs are so low anyway, we tend to not worry too much about optimising costs by using spot pricing for many non-business critical uses – which is a bit lazy really and we could all exploit this more.  Let’s do that!

ImageIf you have any experience of supporting large scale infrastructures, whether they are based on ‘old school’ tin and wires, virtual machines or cloud based technologies you will know that it is important to be able to create consistently repeatable platform builds. This includes ensuring that the network infrastructure, ‘server hardware’, operating systems and applications are installed and configured the same way each time.

Historically this would have been achieved via the use the same hardware, scripted operating system installs and in the Windows application world of my past the use of application packagers and installers such as Microsoft Systems Management Server.

With the advent of cloud computing the requirements for consistency are still present and just as relevant. However the methods and tools used to create cloud infrastructures are now much more akin to application code than the shell script / batch job methods of the past (although some of those skills are still needed). The skills needed to support this are really a mix of both development and sys-ops and have led to the creation of Dev-Ops as a role in its own right.

Recently along with one of my colleagues I was asked to carry out some work to create a new AWS based environment for one of our customers. The requirements for the environment were that it needed to be:

  • Consistent
  • Repeatable and quick to provision
  • Scalable (the same base architecture needed to be used for development, test and production just with differing numbers of server instances)
  • Running Centos 6.3
  • Running Fuse ESB and MySQL

To create the environment we decided to use a combination of AWS CloudFormation to provision the infrastructure and Opscode Chef to carry out the installation of application software, I focussed primarily on the CloudFormation templates while my colleague pulled together the required Chef recipes.

Fortunately we had recently had a CloudFormation training day delivered by our AWS Partner Solutions Architect so I wasn’t entering the creation of the scripts cold, as at first the JSON syntax and number of things you can do with CloudFormation can be a little daunting.

To help with script creation and understanding I would recommend the following:

For the environment we were creating the infrastructure requirements were:

  • VPC based
  • 5 subnets
    • Public Web – To hold web server instances
    • Public Secure – To hold bastion instances for admin access
    • Public Access – To hold any NAT instances needed for private subnets
    • Private App – To hold application instances
    • Private Data – To hold database instances
    • ELB
      • External – Web server balancing
      • Internal – Application server balancing
      • Security
        • Port restrictions between all subnets (i.e. public secure can only see SSH on app servers)

To provision this I decided that rather than one large CloudFormation template I would split the environment into a number of smaller templates:

  • VPC Template – This created the VPC, Subnets, NAT and Bastion instances
  • Security Template – This created the Security Groups between the subnets
  • Instance Templates – These created the required instance types and numbers in each subnet

This then allowed us to swap out different Instance Templates depending on the environment we were creating for (i.e development could have single instances in each subnet whereas Test could have ELB balanced pairs or production could use features such as auto-scaling).

I won’t go into the details of the VPC and Security Templates here, suffice it to say that with the multiple template approach the outputs from the creation of one stack were used as the inputs to the next.

For the Instance Templates the requirement was that the instances would be running Centos 6.3 and that we would use Chef to deploy the required application components onto them. When I started looking in to how we would set the instances up do this I found that the examples available for Centos and CloudFormation were extremely limited compared to Ubuntu or Windows. As this is the case I would recommend working from a combination of the Opscode guide to installing Chef on Centos and AWS’s documentation on Integrating AWS with Opscode Chef.

Along the way to producing the finished script there were a number of lessons which I will share with you to help with your installation, the first of these was the need to use a Centos.org AMI from the AWS Marketplace. After identifying the required AMI I tried running up a test template to see what happens before signing up for it in the Marketplace, in CloudFormation this failed with an error of ‘AccessDenied. User doesn’t have permission to call ec2::RunInstances’ which was slightly misleading. Once I’d signed our account up for the AMI then this was cured.

The next problem I encountered was really one of my own making / understanding. When looking at AMIs to use I made sure that we had picked one that was Cloud-Init enabled, in my simplistic view I thought that this meant that commands such as cfn-init that are used within CloudFormation to carry out CloudFormation specific tasks would already be present. This wasn’t the case as the cfn- commands are part of a separate bootstrap installer that needs to be included in the UserData Section of the template (see below):

"UserData" : { "Fn::Base64" : { "Fn::Join" : ["", [
 "#!/bin/bash -v\n",
 "function error_exit\n",
 "{\n",
 " cfn-signal -e 1 -r \"$1\" '", { "Ref" : "ResFuseClientWaitHandle" }, "'\n",
 " exit 1\n",
 "}\n",<br /> "# Install the CloudFormation tools and call init\n",
 "# Note do not remove this bit\n",<br /> "easy_install https://s3.amazonaws.com/cloudformation-examples/aws-cfn-bootstrap-latest.tar.gz\n",
 "cfn-init --region ", { "Ref" : "AWS::Region" },
 " -s ", { "Ref" : "AWS::StackName" }, " -r ResInstanceFuse ",
 " --access-key ", { "Ref" : "ResAccessKey" },
 " --secret-key ", { "Fn::GetAtt" : [ "ResAccessKey", "SecretAccessKey" ]},
 " -c set1",
 " || error_exit 'Failed to run cfn-init'\n",
 "# End of CloudFormation Install and init\n", 
 "# Make the Chef log folder\n",
 "mkdir /etc/chef/logs\n",
 "# Try starting the Chef client\n",
 "chef-client -j /etc/chef/roles.json --logfile /etc/chef/logs/chef.log &gt; /tmp/initialize_chef_client.log 2&gt;&amp;1 || error_exit 'Failed to initialise chef client' \n",
 "# Signal success\n",
 "cfn-signal -e $? -r 'Fuse Server configuration' '", { "Ref" : "ResFuseClientWaitHandle" }, "'\n"
]]}}

As the cfn-signal which comes as part of the bootstrap installer is used for messaging to any wait handlers defined in the template this can lead to long breaks at the coffee machine before any feedback is received if they are not present.

The final lesson was how to deploy the Chef Client and configuration to the instances. Chef is a rubygems package, so needs this and supporting packages present on the instance before it can be installed. Within CloudFormation packages can be installed via the use of the packages configuration sections of AWS::CloudFormation::Init which for Linux supports rpm, yum and rubygems installers. Unfortunately for the AMI we chose to use the available repositories didn’t contain all packages necessary for our build, to get around this I had to rpm on the rbel repository definitions before using a combination of yum and rubygems to install Chef:

"packages" : {
 "rpm" : {
 "rbel" : "http://rbel.frameos.org/rbel6"
 },
 "yum" : {
 "ruby" : [],
 "ruby-devel" : [],
 "ruby-ri" : [],
 "ruby-rdoc" : [],
 "gcc" : [],
 "gcc-c++" : [],
 "automake" : [],
 "autoconf" : [],
 "make" : [],
 "curl" : [],
 "dmidecode" : [],
 "rubygems" : []
 },
 "rubygems" : {
 "chef" : [] 
 }
}

Once Chef was installed the next job was to create the Chef configuration files and validation key on the instance. This was carried out using the “files” options within AWS::CloudFormation::Init:

"files" : {
 "/etc/chef/client.rb" : 
 "content" : { "Fn::Join" : ["", [
 "log_level :info", "\n", "log_location STDOUT", "\n",
 "chef_server_url '", { "Ref" : "ParChefServerUrl" }, "'", "\n",
 "validation_key \"/etc/chef/chef-validator.pem\n",
 "validation_client_name '", { "Ref" : "ParChefValidatorName" }, "'", "\n"
 ]]}, 
 "mode" : "000644",
 "owner" : "root",
 "group" : "root"
 },
 "/etc/chef/roles.json" : {
 "content" : { 
 "run_list" : [ "role[esb]" ]
 },
 "mode" : "000644",
 "owner" : "root",
 "group" : "root"
 },
 "/etc/chef/chef-validator.pem" : {
 "source" : { "Fn::Join" : ["", [{ "Ref" : "ParChefKeyBucket" }, { "Ref" : "ParChefValidatorName" }, ".pem"]]},
 "mode" : "000644",
 "owner" : "root",
 "group" : "root",
 "authentication" : "S3Access"
 }
}

The hardest part of this was the validation key, as we had multiple instances wanting to use the same key we decided to place this within an S3 bucket and pull the key down. During the script creation I tried multiple ways of doing this, such as using S3Cmd (which needed another repository and set of configuration to run) but found that using the files section worked best.

Once Chef was installed the client was started via the UserData section (basically a shell script), this then handed control of what additional software and configuration is installed on the instance to the Chef Master. How much Chef does at this stage is a bit of a balancing act as the wait handler within the template will fail the stack creation if its timeout period is exceeded.

As you can probably tell if you have got this far, the creation of the templates took quite a few iterations to get right as I learnt more about CloudFormation. When debugging what is going on it is worth remembering that you should always set the stack to not rollback on failure. This then allows you to access the instances created to find out where they got to within the install, as the UserData section is basically a shell script with some CloudFormation hooks, more times than not the faults are likely to be the same as you would see on a standard non-AWS Linux install. Also for a Centos install remember that the contents of /var/log are your friend as both cloud-init and cfn-init create log files here for debugging purposes.

After watching Werner Vogels keynote speech from AWS Re:Invent it’s clear that treating infrastructure as a programmable resource (i.e. using technologies such as CloudFormation and Chef) is somewhere organisations need to be moving towards and based on my experience so far I will be recommending using this approach on all future AWS environments we get involved with, even the small ones.

Whilst there is a bit of a learning curve the benefits of repeatable builds, known configuration and the ability to source control infrastructure far outweigh any shortcomings, such as granular template validation which I’m sure will come with time.

If you have any comments or want to know more please get in touch.

Orthographic Projection of South AmericaAWS have been busy launching new regions around the world at an amazing rate recently – Sydney, Sao Paulo etc – and there’s still gossip of a second European region to come. It’s fair to say that Smart421′s business in South America is still in its growth phase :) , so I hadn’t paid a huge amount of attention to the instance costs for the South America region. As you’d expect Smart421 tend to use EU-West and US-East for all our customer and internal AWS work. I happened to be checking out the costs today for Elastic MapReduce (EMR – AWS’s Hadoop-as-a-service offering) and had a quick snout around to compare EMR costs across regions, so I stumbled across the Sao Paulo EC2 pricing.

In short – it’s significantly more than all the other regions. A standard on-demand large instance is $0.26/hr in US-East but a whopping $0.46/hr in Sao Paulo – that’s 77% more. Now I’m used to regional price variations as power, tax etc are different in different territories (when will the EU-West region drop to match US-East prices eh?), but that’s a lot. That creates a pretty significant incentive to still use the US services, latency and other similar considerations put to one side of course. Also, I wonder if it also broadens the opportunity for 3rd parties to offer cloud brokerage services (a market I’ve been rather skeptical about up until now due to the barriers to workload mobility) that automatically port compute workloads between regions for a percentage of any cost savings made.

Looks like cost harmonisation via the the globalisation of IT still has some way to go then. Ouch!

The subcategory called Big Data is emerging out of the shadows and into the mainstream.

Matt Wood with Robin Meehan

From left: Matt Wood, Chief Data Scientist at Amazon Web Services (AWS) with Robin Meehan, CTO at Smart421
Photo by Jim Templeton-Cross

What it is.

Definitions abound (who would have thought it? – quite usual in the technology market). For Big Data, we quite like the definition that originated with Doug Laney (@doug_laney), formerly META Group, now a Gartner analyst. It goes something like this:

 ” … increasing volume (amount of data), velocity (speed of data in and out), and variety (range of data types and sources)”

Gartner continue to use this “3Vs” model for describing Big Data.

Unsurprisingly, others are claiming Gartner’s construct for Big Data (see Doug’s blog post, 14 Jan 2012).

Still confused?

Put another way, Big Data is commonly understood to be:

“… a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools. The challenges include capture, curation, storage,search, sharing, analysis,and visualization. The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to “spot business trends, determine quality of research, prevent diseases, link legal citations, combat crime, and determine real-time roadway traffic conditions.” read more on Wikipedia.

Big Data could be executed on-premise if you have sufficient compute and storage in your corporate data centre. And some do, especially some large banks, and with good success. Several solutions are already out there on the market;  Oracle’s Big Data Appliance is just one example.  But it does also beg the question “why would you” ?

If you don’t want the CapEx of purchasing more tin, or don’t want to gobble up capacity in your own data centre, then there are alternatives. For example, a cost model now exists with cloud-based compute and cloud-based storage (for example, think of Amazon’s announcement of 25 percent reductions in the price of Amazon S3, it’s storage solution) that puts Big Data in the Cloud well within the reach of all UK enterprises. A cost model like that islikely to win friends in procurement and in corporate governance as well as in IT.

Hinging on technologies including Apache Hadoop clusters, Amazon Elastic Map Reduce (Amazon EMR) and others, Big Data is delivering a degree of analytics and visualisation not previously possible at affordable levels.

Don’t just take our word for it, ask around. We could point you to other experts in Big Data, such Matt Wood ( @mza ), Chief Data Scientist at AWS.

What it isn’t.

Big Data isn’t business intelligence (BI). What I mean is that Big Data isn’t BI in any traditional sense of the term. It is altogether another level on from that. Granted that some tooling enterprises may own may be recycled for use in Big Data analytics. But it isn’t another species, it’s another race.

Big Data isn’t a lame attempt at reviving a management information system (MIS); those should be left to rest in peace.

What it means for you.

By now, if you’ve read this far, something should be niggling away at you that you could be missing a trick. I trust it won’t be those voices in your head again. But it might be your instincts telling you how Big Data could answer those tough business questions – y’know, those “I can’t be asked” questions that existing systems just cannot deliver.

Now, you would not necessarily get our CTO to come right out and say that Big Data is the next big thing. But evidence we are assembling so far does seem to point to a new capability to deliver. For those with an appetite to understand their business in new ways, Big Data is delivering tangible intelligence that lets them see new dimensions, new possibilities and new revenue streams.

I did get a full radar lock on something our CTO said in the summer. It was a throw away line at the time but it stuck with me and with others. So, when the time came to consider an appropriate go-to-market message for our quarter three (Q3) focus, we decided to wheel out his one-liner as part of our messaging.

“It’s not about survival of the fittest -
it’s about survival of the best informed”
Robin Meehan, CTO, Smart421 Ltd.

Making no apologies to Charles Darwin or evolutionists, the statement is resonating with decision makers in the enterprise space, not least those in the Insurance sector. Why?  Well, we think it is because a lot of the big insurers operate under many names in their brand portfolios.

The capability to see and understand impacts of brand activities, such as Insurance Quotes, delivered using Big Data analytics in the AWS Cloud, is illuminating new gains that would otherwise have remained out of reach.

Don’t forget – brand analysis is only one use case for Big Data in the Cloud.

If the world is going Big Data crazy then you need to know what it is, what it isn’t and what it means to your enterprise.

Agree?  Disagree?

UPDATE 05 Dec 2012 – Our economist friend Tim Harford  (@TimHarford) sent this hilarious tweet: The data do not lie. OR DO THEY? Muah huah huah! http://dlvr.it/2b2NS1

UPDATE 06 Dec 2012 – Robin and colleague Ben Baumguertel (@bbaumguertel) are attending the Big Data Analytics event in London today (organised by @WhitehallMedia ).

Please Rate this post and Like this post below. If you can, please leave a Comment.

Las VegasUnfortunately I didn’t manage to make a strong enough case to travel to Las Vegas in person :( , so I did the next best thing and watched the live media stream yesterday evening – it was just like being there, but without Tom Jones or any showgirls. The two big things from Andy Jassy (the AWS SVP) were an approx 24% storage (S3) price reduction across all regions from 1st Dec, and the launch of a limited beta version of datawarehousing-as-as-service. On the second of these, AWS Redshift (which is discussed in more detailed in Jeff Barr’s post here) is a direct challenge to the existing column-oriented database world, Teradata, IBM, Oracle etc. It looks really interesting and is a classic cloud use case and so it makes sense for AWS to tackle it – it requires large volumes of storage and compute power and is a traditionally high-CapEx market sector – I’m looking forward to playing with it..

As for the S3 price reduction…well, a 24% price reduction is a pretty amazing step change in pricing. In what other industries would have such dramatic changes in price? I wish it was happening to UK gas & electricity pricing :) . Having said that, Google storage costs currently start at $0.095 per GB per month, so it looks like AWS are price matching with Google. Microsoft Azure pricing was still at $0.125 per GB when I checked this morning, but presumably they will have to respond (to be precise – this is not quite an apples for apples comparison as Azure replication is over a significant distance whereas AWS S3 replication is between AZs which are separate but within some, but typically undisclosed, kilometres). As discussed before on the blog, I can’t see how the majority of smaller (and by that I still mean very big!) IaaS cloud players can possibly compete with this perfect storm of huge economies of scale and immensely deep pockets. Looking at our current AWS billing (which includes customer’s AWS accounts that we manage on their behalf), S3 storage costs only account for <5% of the total costs as the lion’s share of the cost relates to compute – so more price reductions here as well please!

[Update 30/11/12 - Since reading Jeff's post I've realised that these cost savings also apply to EBS Snapshots (der...of course you'd expect that), so this actually makes the cost saving from this one price reduction more significant, getting up to 8% or so]

Caricature of a mad scientist drawn by en:User:J.J. and uploaded as en:image:Madscientist.jpg to English Wikipedia 15 July 2003I attended the second Hive London Meetup tonight – I’m not quite sure what I was expecting, I just wanted to rub shoulders with people in the big data world really and see what they were up to. Was very glad I went. I met a number of that elusive animal, real live data scientists, and also enjoyed a couple of great presentations.

The first thing that struck me was how pretty much all the big data use cases being performed in anger that were discussed were related to web ad tracking, clickstream/web analytics, social game analysis etc. No mention at all of analysis of financial data like stock trades or insurance-sector actuarial analysis for example. This very much fits with our experiences in the wider market – people who are selling “stuff” (physical or digital products) on B2C web sites have had to address the data explosion first. Large enterprises have mainly yet to realise the opportunity before them.

The second key takeaway for me was that whilst this was a Hive Meetup, both presenters issued pretty cautionary messages about using Hive at all – kinda ironic. Yali Sassoon from SnowPlow Analytics explained how they had started their open source work using Hive on AWS EMR for most of the data processing/ETL and analysis – unpacking query strings from AWS CloudFront logs, creating partitioned Hive tables backed by S3, and then using Hive to perform analysis operations on it, e.g. mimicking the functionality of Google Analytics. But over time, as they have matured their architecture, they’ve hit issues with troublesome debugging (you write something like SQL, but you debug Java :) ), slow performance and some analyses that hive just isn’t really able to support in a performant fashion. So they’ve moved to investigating Cascading for the ETL (e.g. for better control over data exceptions and replays), and column-oriented databases for the final analytics datastore that they expose to their customers – specifically they’re using InfoBright. Notably though, all their customers still use Hive today for all their analysis. It’s just such an easy way to start on the hadoop journey and that counts for a lot. And for really big (petabytes) work, Hive is still the way to go.

Pedro Figueiredo gave the second presentation and covered a collection of hints, tips and tuning suggestions for using Hive. Hive’s strength is also it’s weakness – it looks like SQL so adoption by existing data analysts can be very rapid…er…but…it isn’t SQL so those same data analysts can create some shockingly bad processes/queries if they don’t understand the underlying MapReduce architecture and it’s inherent strengths and weaknesses. I don’t see this as much different to the issue with novice SQL coders creating hideous joins with no indexes etc – but when you’ve got gigabytes of data things can get ugly pretty quickly…

Pedro reinforced the “don’t use Hive for (serious) ETL” message, recommending native Java or streaming hadoop jobs in your preferred language. Use Hive for it’s strengths, such as summing/aggregation of already preprocessed datasets such that it is ready to load into your favorite OLAP/BI tool of choice for visualisation purposes. He also mentioned that he typically experiences a 90% cost saving over on-demand instance pricing by using spot AWS instances on EMR (so it’s a no brainer basically for worker nodes), but that the advantage diminishes once the US day kicks off.

Great session, very useful – and free flowing beer and pizza.

GhostA colleague pointed me at an article in Computing the other day that starts off with “Retail giant Marks & Spencer is ditching Amazon as its online platform host“. As we are a leading AWS Solution Provider, interesting I thought, so I looked into it. The article itself it not misleading, but as usual the comments and interpretation it has generated have confused Amazon the retail store with Amazon Web Services (AWS) the IaaS/PaaS provider. For example someone made the comment “One is left wondering how many of AWS’s customers will move to rackspace or other companies of that ilk” – which is nonsense IMHO, and I’ll explain why.

M&S’s eCommerce offering is built using a white-labelled version of Amazon’s retail store – see this article in ComputerWeekly from 2007 when it was announced. I’m sure that some or all of M&S’s eCommerce site runs on AWS infrastructure, but that’s not the point really. I totally understand a retail organisation’s reluctance to use a retail competitor’s platform due to the reduced control over new functionality releases and the level of business data insight that this could give to a competitor. But those arguments do not apply to running a retail (with your own eCommerce software/platform) operation on AWS – if you follow good cloud architectural practices, e.g. encrypt data in transit and at rest, keep your encryption keys private and away from the cloud service provider etc.

In discussions with customers and vendors I see this confusion of Amazon.com and AWS all the time (sometimes deliberately to spread fear, and sometimes accidentally through lack of understanding). To be balanced here, the AWS offering has historically leaned on the Amazon.com brand and scale as a selling point, so I guess AWS need to keep making this distinction clear (which I’ve seen them do in every presentation). Anyway, either way it’s a red herring…and getting a bit tedious.

Follow

Get every new post delivered to your Inbox.

Join 801 other followers