Cloud MapReduce vs Hadoop
I’ve been playing with Hadoop quite a bit lately. However, I’m not deeply in love with Hadoop. If something better came along, I wouldn’t shed too many tears. Particularly if that something better wasn’t written in Java.
That’s why I was excited to check out the cloud mapreduce implementation of MR, which claims to have a 60x speed gain on Hadoop in only 3kLoC. This incredible feat was accomplished by using more of the Amazon cloud “os” than Hadoop supports (it’s built around S3 & SQS) and eliminating the task tracker node.
I started digging into their technical report[pdf] and was shocked to discover the source of their bold 60x claim. First, some background on HDFS (the filesystem on which Hadoop runs):
Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large files. It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a single instance. (from HDFS Architecture)
Got that? Hadoop operates on blocks of data that are 64mb (or larger), and the architecture is built around this assumption. In order to make their 60x claim, the authors of cloud mapreduce operate over “92,367 HTML files,” which by any measure would not be an optimal use of a system that is architected for 64mb chunks. The authors specifically note that Hadoop applies 92,367 map operations, which strongly contributed to the speed advantage of their system over Hadoop.
Additionally, because of the way they use S3, they are unable to store files larger than 5GB, which is well within the scope of a Hadoop job.
Here’s the kicker: When operating on a dataset provided as an example by the Hadoop project (which came an appropriately-sized 1gb file), the cloud mapreduce architecture beat Hadoop by a far more modest 1.4x (329s vs 459s) or 1.2x (1001s vs 1211s) factor. This is from their own data (chart on page 13 of the report).
So would it be worth locking your architecture into Amazon’s cloud and sacrificing the ability to deal with >5gb files as well as projects like Hive, Pig, and Hbase? For a 60x speed difference, it might actually be. For 1.5x, though?
The project is a neat idea and an impressive feat to match or beat Hadoop by taking advantage of cloud-specific resources. However, it is intellectually dishonest and unscrupulous to claim a 60x increase that is only obtained by improper usage of the other system. If they instead said “cloud mapreduce is designed with different architectural considerations,” I would have no complaint. But let’s compare apples to apples, okay?
May the toy elephant be with you.