So I ran up a small cluster on AWS – with an m1.large for the master node and 2 core nodes, also running m1.large. I used the US-West region (Oregon) – which offers the same cheap price points as US-East but is 100% carbon-neutral as well :). This was all running using spot instances in a VPC. For interest, the total AWS cost for 24 normalised instance hours (I actually ran the cluster for just over 3 hours, including one false cluster start!) was $1.05. Using developer standard units of cost, that’s nearly the price of half a cup of coffee! (or since we’re using Oregon region, a green tea?)
As I’m lazy, I used the code and datasets from the AWS tutorial – and decided to just use a simple count of records that contained the string “robin” in the email address field of a 13.3m row table as my comparison. Here’s how you define the basic table structure…
create EXTERNAL TABLE customers( id BIGINT, name STRING, date_of_birth TIMESTAMP, gender STRING, state STRING, email STRING, phone STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION '/data/customers/';
The output is…
[ip-10-0-0-26.us-west-2.compute.internal:21000] > select count(*) from customers;
Query: select count(*) from customers
| count(*) |
| 13353953 |
Returned 1 row(s) in 1.09s
[ip-10-0-0-26.us-west-2.compute.internal:21000] > select count(*) from customers where customers.email like "%robin%";
Query: select count(*) from customers where customers.email like "%robin%"
| count(*) |
| 66702 |
Returned 1 row(s) in 1.73s
A slight aside – Impala uses run-time code generation to compile down the query down to machine code using LLVM, and this introduces a compilation overhead of circa 150ms, but which more than pays back on the majority of queries. So this is where some of our 1.73s is going. More about this here.
As a glutton for punishment, I decided to use pig rather than the more usual hive for the comparison with Impala. The first thing to say – it was way harder, as the aptly named pig is just a bit more foreign to me than the SQL-like niceness of Impala…so there was some desperate checking of cheatsheets etc to remind me how best to do it…
The basic code for the same source data (already loaded into HDFS) is as follows…
CUST = LOAD 'hdfs://10.0.0.26:9000//data/customers/customers' USING PigStorage('|')
as (id: chararray,
C2 = FILTER CUST BY REGEX_EXTRACT_ALL(email, '(.*)robin(.*)') IS NOT NULL;
C3 = FOREACH (GROUP C2 ALL) GENERATE COUNT(C2) as cnt;
As you can see the pig approach ran 8 maps. The output is as follows (with all the INFO messages and some other noise removed)…
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
2.2.0 0.11.1.1 hadoop 2014-04-10 12:11:13 2014-04-10 12:12:26 GROUP_BY,FILTER
Successfully read 13353953 records (9 bytes) from: "hdfs://10.0.0.26:9000//data/customers/customers"
Successfully stored 1 records (9 bytes) in: "hdfs://10.0.0.26:9000/tmp/temp1725123561/tmp-1782422819"
I was just trying it out, so this is not a fair test in some ways – and I didn’t try and do any optimisation of either approach. The Impala approach ran about 40x faster, and this was consistent with repeated runs.
I checked out the CPU, IO etc and there was nothing hitting any limits, and CPU consumption when I was alternately using Impala and pig looked like this – load was even across my two core nodes, and the master had it’s feet up most of the time…
I haven’t reported the data here, but I also played with some nasty 3-way joins using Impala and the results were really impressive. Obviously though it’s horses-for-courses – MapReduce-based approaches like hive and pig will soldier on when Impala has run out of memory for certain query types, or in the event of a node failure etc. But definitely a great bit of kit to have in the AWS EMR toolbag!