In this post I will describe how to get a Hadoop environment with HBase running in Cygwin on Windows 7 x64.

Having spent the better part of a week reading through blog posts and documentation, I found that none of them covered the process in full detail, at least not for the software versions I intended to use.

This guide was written for Cygwin 1.7.7, Hadoop 0.21.0 and HBase 0.20.6.

UPDATE (Sept. 5, 2011): I no longer have this system running (switched to Ubuntu), and will most likely not be able to answer questions about the setup. I would recommend you to ask your questions on the hadoop-users mailinglist. You will find information on how to subscribe and post to the list on the Hadoop website.

UPDATE (May 25, 2011): If you are using this guide, remember to have a look at the comments, some of them concern version updates and other related issues.

UPDATE (Nov. 1, 2010): I've noticed some errors arising when using Hadoop 0.21.0 and HBase 0.20.6 and gone back to Hadoop 0.20.2 instead as this does not produce the same errors. If you intend to use HBase together with Hadoop I would recommend setting up Hadoop 0.20.2 instead, the installation is more or less identical.


You will additionally need ZooKeeper 3.3.1 in order to get HBase to run properly.

Throughout this guide I will assume that your Cygwin install path will be c:\cygwin and that Hadoop, ZooKeeper and HBase will be installed in c:\cygwin\etc\local (/etc/local/), this is however something you can choose yourself. If you choose to install Cygwin elsewhere, I would recommend to use folder names without whitespaces and other non-regular charaters.

The only prerequisite for this quite is that you have Java installed and added to your %PATH% variable (which is usually done automatically).


Software


Download each software bundle and put it somewhere where you'll easilly find it later.

CygwinCygwin


If you've never used Cygwin (or Linux/Unix/etc), you should perhaps get familiar with those environments first. If you still want to continue, read on.

Throughout the Cygwin section - if you find yourself lost, please follow Vlad Korolev's guide on how to get Cygwin up and running for Hadoop and make sure to additionally install tcp_wrappers and diffutils when chosing packages. Follow steps 2 to 4 in the guide and then continue with the Hadoop installation guide below.

If you're familiar with Cygwin you just need to make sure you have these packages installed:

  • openssh
  • openssl
  • tcp_wrappers
  • diffutils


Additionally you will have to set configure ssh to start as a service, and enable passwordless logins. To do this, fire up a Cygwin terminal window after you've completed the installation and do the following:

ssh-host-config

When asked about privilege separation answer no
When asked if sshd should be installed as a service answer yes
When asked about the CYGWIN environment variable, enter ntsec
Now go to the Services and Applications toll in Windows, locate the CYGWIN sshd service and start it.
Next Cygwin step is to set up the passwordless login. Go to your Cygwin terminal and enter

ssh-keygen

Do not use a passphrase and accept all default values. After the process is finished do the following

cd ~/.ssh \n

cat id_rsa.pub >> authorized_keys

This will add your identification key to the set of authorized keys, i.e. those that are allowed to login without entering a password.
Try connecting to localhost to see whether it works

ssh localhost

Doing this the first time should prompt you with a warning, enter yes and enter. Now try issuing the same command, this time there should be no warning and no need to enter a password.

This concludes the Cygwin installation.

HadoopHadoop


Since Vlad's guide is made for Hadoop 0.19.0, some of the configuration details specified in his guide do not apply anymore (or have moved to other files), this is an updated version what you'll find in his guide.

First - copy the downloaded tar.gz file to c:\cygwin\usr\local (which corresponds to /usr/local in the Cygwin environment). When this is done, it's time to extract the package, this is done by issuing

tar xvzf  hadoop-0.21.0.tar.gz

This command extracts the content of the downloaded hadoop file into c:\cygwin\usr\local\hadoop-0.21.0 (/usr/local/hadoop-0.21.0).
Hadoop requires some configuration, the configuration files are located in c:\cygwin\usr\local\hadoop-0.21.0\conf.
The files that need to be altered are:

core-site.xml


 fs.default.name
hdfs://127.0.0.1:9100

mapred-site.xml


 mapred.job.tracker
127.0.0.1:9101

and hdfs-site.xml


 dfs.replication
1

dfs.permissions
false

Only Hadoop 0.21.0: Next, one line has to be added to the hadoop-config.sh file in hadoop-0.21.-0/bin

CLASSPATH=`cygpath -wp "$CLASSPATH"`

Add this line before the line containing

JAVA_LIBRARY_PATH=''

The reason for this is that in order for the CLASSPATH to be build with all the Hadoop jars (line ~120 to ~200) the path needs to be in the Cygwin format (/cygdrive/c/cygwin/usr/local/hadoop...), however in order for Java use the classpath, it needs to be in the Windows format (c:\cygwin\usr\local\hadoop..). The line transforms the Cygwin built classpath to one that is understood by Windows.

This should be enough for Hadoop to run, test the installation by issuing these commands in a Cygwin window:

cd /usr/local/hadoop-0.21.0
mkdir logs
bin/hadoop namenode -format

The last command will take some seconds to finish and should produce about 20 lines of output during the creation of the namenode filesystem.
The final step of the Hadoop setup is to start it and test it.
To start it issue the following commands in a Cygwin window:

cd /usr/local/hadoop-0.21.0
bin/start-dfs.sh
bin/start-mapred.sh

Providing no error messages are printed, this should have started Hadoop. This can be checked by opening http://localhost:9100 and http://localhost:9101 in a browser. The first link should provide information about the NameNode, make sure that the Live Nodes count is 1. The second link provides information about the cluster.
Now it's time to run a little job on the cluster to see whether or not the installation was successfull.
First, copy some files to the node:

cd /usr/local/hadoop-0.21.0
mkdir input
cp conf/*.xml input
bin/hadoop jar hadoop-*examples.jar grep input output 'dfs[a-z.]+'
cat output/*

Provided there were no errors, you've just run your first Hadoop process.

ZooKeeperApache ZooKeeper


This step, it seems, is only necessary if you're installing the setup on 64 bit Windows.
The problem seems to be that the ZooKeeper server which comes bundled with HBase does not work correctly, and thus a standalone one needs to be set up.

Luckily the ZooKeeper install and configuration is quite easy.

First, copy the download zookeeper-3.3.1-tar.gz file to your c:\cygwin\usr\local directory, open a Cygwin window and issue the following commands:

cd /usr/local/
tar xvzf zookeeper-3.3.1.tar.gz

ZooKeeper's configuration file (zoo.cfg) is located in /usr/local/zookeeper-3.3.1/conf (c:\cygwin\usr\loca\zookeeper-3.3.1\conf).
Open the file and paste the following content into it, overwriting the original config:

# The number of milliseconds of each tick
tickTime=2000
# The number of ticks that the initial
# synchronization phase can take
initLimit=10
# The number of ticks that can pass between
# sending a request and getting an acknowledgement
syncLimit=5
# the directory where the snapshot is stored.
dataDir=/tmp/zookeeper/data
# the port at which the clients will connect
clientPort=2181

Make sure to create the /tmp/zookeeper/data directory and make it writable for everyone (chmod 777).

ZooKeeper is started by typing:

cd /user/local/zookeeper-3.3.1
bin/zkServer.sh start

Make sure to test if ZooKeeper is running correctly by connecting to it

bin/zkCli.sh -server 127.0.0.1:2181

This should connect you to ZooKeeper, you can type help to see what commands are available, however the only one you need to care about is quit.

HBaseHBase


Start by copying hbase-0.20.6.tar.gz to c:\cygwin\usrl\local and extracting it by issuing

tar xvzf hbase-0.20.6.tar.gz

in a Cygwin terminal.

Now it's time to create a symlink to your JRE directory in /usr/local/. Do this by typing:

ln -s /cygdrive/c/Program\ Files/Java/ /usr/local/

in a Cygwin terminal. will most likely be jre6, but be sure to double check this before making the link.

HBase's configuration files are located in /usr/local/hbase-0.20.6/conf/ (C:\cygwin\usr\local\hbase-0.20.6\conf), and to get HBase up and running we need to edit hbase-env.sh and hbase-default.xml.

In the hbase-env.sh the JAVA_HOME, HBASE_IDENT_STRING and HBASE_MANAGES_ZK variables have to be set,  this is done by editing the lines containing the variable names to read:

export JAVA_HOME=/usr/local/jre6
export HBASE_IDENT_STRING=$HOSTNAME
export HBASE_MANAGES_ZK=false

The last variable tells HBase not to use the bundled ZooKeeper server, as we've already installed a stand alone one.

Next, the hbase-default.xml file has to be edited, the two properties that need to be set are hbase.rootdir and hbase.tmp.dir


 hbase.rootdir
file:///C:/cygwin/tmp/hbase/data
The directory shared by region servers.
Should be fully-qualified to include the filesystem to use.
E.g: hdfs://NAMENODE_SERVER:PORT/HBASE_ROOTDIR

hbase.tmp.dir
C:/cygwin/tmp/hbase/tmp
Temporary directory on the local filesystem.

Make sure that both directories exist and are writeable by all users (chmod 777).

The command for starting HBase is:

cd /user/local/hbase-0.20.6
bin/start-hbase.sh

This section is very similar to what's found on the HBase wiki, the difference is the standalone ZooKeeper config.

Start your cluster


Having done all these steps, it's time to start up the cluster.

The startup procedure should follow this order:

  1. ZooKeeper
  2. Hadoop
  3. HBase


So what you do is:

ZooKeeper:

cd /usr/local/zookeeper-3.3.1
bin/zkServer.sh start

Hadoop:

cd /usr/local/hadoop-0.21.0
bin/start-dfs.sh
bin/start-mapred.sh

HBase:

cd /usr/bin/hbase-0.20.6
bin/start-hbase.sh

Acknowledgments

In order to get my system up and running, I used tutorials and information posted by others, this is simply an aggregation of several resources. These include:

Updated:

Leave a Comment