In an earlier post I described how to deploy Hadoop under Cygwin in Windows. This time I'll show how to get Mahout running in that environment.
Getting the Mahout examples running from within your Cygwin environment is as easy as copy-pasting the commands from the Mahout wiki. Trying to get this code running not using the mahout startup script (which, at least to my understaning limits you to only using the examples bundled with Mahout) is a different story.
Before we get started I just want to point out that should you have any questions concerning how Mahout works (i.e. not related to the installation) please direct your questions to the Mahout User list. Information about the list can be found here.
The main problem with Mahout under Cygwin is the different path-variables. Whether it be CLASSPATH, HADOOP_HOME or MAHOUT_HOME, Windows and Cygwin do not work well together. This is of course due to the fact that Cygwin expects Unix-style paths, whereas Windows obviously prefers it's native format.
Anyhow, these are the steps that need to be taken in order to be able to run commands like
hadoop jar mahout-core-0.4.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob --input userdata/ --output useroutput -n 10 --usersFile umr.csv -s SIMILARITY_PEARSON_CORRELATION
Notice how this differs from the example given in the Mahout wiki (which would look like this if we'd run the same line as above):
bin/mahout org.apache.mahout.cf.taste.hadoop.item.RecommenderJob --input userdata/ --output useroutput -n 10 --usersFile umr.csv -s SIMILARITY_PEARSON_CORRELATION
First, you'll need to have Hadoop set up as described here.
Next step is to download Mahout, for this guide I'm using the 0.4 binary distribution (mahout-distribution-0.4.tar.gz) found by following the download links here. Deploy the distribution by unzipping/untaring it to a directory on your disk and set up the %MAHOUT_HOME% environmental variable.
Having done this, on a non-Windows system we'd be done. This being Windows, we're not.
Test your Mahout (and Hadoop) home variable(s) by typing the following commands in a command window.
echo $MAHOUT_HOME
echo $HADOOP_HOME
in a console window. The output should read something like "C:\cygwin\usr\local\mahout-distribution-0.4\" and "C:\cygwin\usr\local\hadoop-0.20.2\", it will of course be different if you've deployed Hadoop and Mohout elsewhere.
By now you should be able to go to $MAHOUT_HOME (in a Cygwin terminal window) and issue the Mahout command above (provided you've started Hadoop and copied files to the directories specified in the command). If you're not sure how to do these thing, I suggest you familiarize yourself with Hadoop before continuing. You will find documentation for copying files onto a HDFS cluster here.
If issuing this command is successful, you should have run your first distributed Mahout job by now.
Time to add the final configurations to Hadoop in order to be able to run your own projects with Mahout as a dependency.
First we need to edit the hadoop script that runs it all ($HADOOP_HOME/bin/hadoop) and add the following lines after the user-specified classpath setup (should be around row 170)
export MAHOUT_JARS=""
MAHOUT_HOME=`cygpath $MAHOUT_HOME`
#echo $MAHOUT_HOME
# add release dependencies to CLASSPATH
for f in $MAHOUT_HOME/mahout-*.jar; do
MAHOUT_JARS=${MAHOUT_JARS}:$f;
done
# add dev targets if they exist
for f in $MAHOUT_HOME/*/target/mahout-*-job.jar; do
MAHOUT_JARS=${MAHOUT_JARS}:$f;
done
for f in $MAHOUT_HOME/lib/*.jar; do
MAHOUT_JARS=${MAHOUT_JARS}:$f;
#echo $f
done
# add development dependencies to CLASSPATH
for f in $MAHOUT_HOME/examples/target/dependency/*.jar; do
MAHOUT_JARS=${MAHOUT_JARS}:$f;
done
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH":"$MAHOUT_JARS
These lines add the Mahoot jars to the Hadoop classpath, i.e. make sure that when Hadoop issues the java command, Java will be able to find the Mahout dependencies in your project.
Next, we need to do the same (almost) in $HADOOP_HOM/conf/hadoop-env.sh, otherwise the datanodes and jobtracker won't be aware of the dependencies.
Add the following lines somewhere at the top of the file
export MAHOUT_HOME="C:\cygwin\usr\local\mahout-distribution-0.4"
export MAHOUT_JARS=""
MAHOUT_HOME=`cygpath $MAHOUT_HOME`
# add release dependencies to CLASSPATH
for f in $MAHOUT_HOME/mahout-*.jar; do
MAHOUT_JARS=${MAHOUT_JARS}:$f;
done
# add dev targets if they exist
for f in $MAHOUT_HOME/*/target/mahout-*-job.jar; do
MAHOUT_JARS=${MAHOUT_JARS}:$f;
done
for f in $MAHOUT_HOME/lib/*.jar; do
MAHOUT_JARS=${MAHOUT_JARS}:$f;
done
# add development dependencies to CLASSPATH
for f in $MAHOUT_HOME/examples/target/dependency/*.jar; do
MAHOUT_JARS=${MAHOUT_JARS}:$f;
done
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH":"$MAHOUT_JARS
Just make sure that MAHOUT_HOME points to the folder where your Mahout instance resides. This should be configurable with the $MAHOUT_HOME variable, however, when not specifically stating the variable in this file i received errors when starting or stopping either the nodes or the jobtracker.
Having done this, your Mahout setup should be done, and you should be able to run your self-built Mahout jobs.
Leave a Comment