By this point everyone is well acquainted with the power of Hadoop’s MapReduce. But what you’re also probably well acquainted with is the pain that must be suffered when setting up your own Hadoop cluster. Sure, there are some really good tutorials online if you know where to look:
- here is a great one for setting up a single node cluster
- and here is the equally good follow-on tutorial for setting up a complete cluster
However, I’m not much of a dev ops guy so I decided I’d take a look at Amazon’s Elastic MapReduce (EMR) and for the most part I’ve been very pleased. However, I did run into a couple of difficulties, and hopefully this short article will help you avoid my pitfalls.
Elastic MapReduce, Step by Step
Elastic MapReduce makes everything simple. In just a few steps and in just a few minutes, you can have your own MapReduce job running on a cluster of whatever size and configuration you’d like. First you just log on to your AWS console, navigate to the Elastic MapReduce page, and click the “Create New Job Flow” in the top left of the main panel.
Next you name your job and indicate that you will supply your own custom jar.
On the next screen you tell EMR exactly where to find your jar file. This is going to be a path that starts with the name of an S3 bucket that you have access to in AWS. You also specify the inputs to your job. For most people this will likely include the path to an input directory and the path to an output directory (as I’ve indicated here), but you can use this space to include any argument you’d like. There are a couple of potential gotchas here. First, if you did not specify a main class in your jar manifest, then the fully qualified class must be the first argument here. Second, if you are pointing to directories in S3, make sure that you include the S3 scheme at the front of the URI:
The next screen is where you get to experience the awesome efficiency of Elastic MapReduce. No futzing with hackneyed deployment scripts; you simply select the type of machine you want for the master, the type and number of machines you want for slaves. As you can see here, I’m building a 1021 machine cluster of the largest machines that AWS has to offer! (…And then after I took this screen shot I got scared and returned it to a 3-machine cluster of “smalls”.)
The next screen has a lot of goodies that it’s well worth paying attention to. If you ever want to be able to ssh into the cluster you’re building — which is often a good idea — then you must make sure to specify an Amazon EC2 Key Pair. While you’re debugging the setup, it’s also a good idea to set a directory to send log files to. And while I haven’t yet found it as useful, it’s probably also a good idea to enable debugging. And finally, the ssh key is of no use if the job has failed and AWS has shut down your cluster, so click the “Keep Alive” option in case you need to get into the machine and do a post-mortem. As we’ll see in a little bit, being able to ssh into the Hadoop master is also useful for running experiments in the same environment that that your job will run.
While, I haven’t had need of them quite yet, the next screen gives you the option to “bootstrap” your Hadoop cluster. So, for instance, you can install software, run custom scripts, or tweak the nitty-gritty configuration parameters of Hadoop itself. For the power user, this should be quite useful.
In the final screen, Elastic MapReduce allows you to review the configuration before you push the big “Create Job Flow” button and things get rolling. Once you create your job flow, then you can check back at the main screen of Elastic MapReduce and watch as your cluster builds, runs, and completes — or in my case FAILS!
Avoiding my Pitfalls in Elastic MapReduce
So let’s get into that discussion: What are some of the common pitfalls, and how can you avoid them? Above, I’ve already described a couple: make sure to specify the main class of your jar file (if it’s not already specified in the jar’s manifest), and since you’re probably pulling from and pushing to S3, make sure to use fully qualified URIs including the
That last one caused me some pain. Let’s take a look at how I tracked down my problem and hopefully it will help you in the future when you are tracking down your own problem.
Notice the big “FAILED” message for my job’s state. That was my first clue that something might be wrong.
Now notice the “Debug” button in the header of the table. By clicking on this button, you’ll have access to several useful debugging files.
In my case, the “stderr” file was the most helpful. This is the stderr output that occurred while running the job – so you’ll get to see where the error occurred in your java code.
For me, the error was about a Null value that basically indicated that a file did not exist. This was troubling because I’d run the job several times locally with no problem, and I’d double checked, and these files were exactly where (I thought) I’d specified. So something was different between my local environment and the environment of my Elastic MapReduce job. But what?!
Fortunately, I’d specified an ssh key and indicated that the cluster should be kept alive after running or failing the job, so I could easily ssh onto the box. But here’s where debugging gets a little tricky. The error had occurred deep within my code, and I didn’t really want to make little tweaks in the java on my local machine, jar everything up and, and send it to the remote machine. The turn around for this process would have been 15 minutes and I would be destroying my code in the process.
Instead, I opted to write Java on the remote machine for a relatively fast debugging cycle. I found that I could use vim to replicate the problematic portion of my code, and in a single line of bash, I could compile my code (Test.java), jar it (as Test.jar), and run it against Hadoop. Here’s my magic line:
javac -cp $HADOOP_HOME Test.java; jar cf Test.jar Test.class; hadoop jar Test.jar Test s3://myBucket/myDirectory
And, as it turns out, the bug I was having was because I’d used the default Hadoop FileSystem
FileSystem fs = FileSystem.get(conf);
Instead, I should have retrieved the filesystem from the input path, which takes into account that we’re talking with S3
FileSystem fs = new Path("s3://myBucket/myDirectory").getFileSystem(conf);
And finally, there was one more gotcha in my case that might be of help for you. It’s important to know that all JVMs are not equal! I’m doing image processing, so I’m making heavy use of the Java Advanced Imaging (JAI) API. On my local machine, this comes with the JVM, however the JVM used by Elastic MapReduce doesn’t have these dependencies (and possibly doesn’t have any of the javax dependencies). In order to fix this problem I had to pull down these extra dependences and make sure they were bundled into my jar file. But, once I did that, it ran as smooth as you please!
All in all, I really like Elastic MapReduce. It gives you the ability to get to work fast without having to worry much about configuration or administration tasks.