Thursday, 3 May 2012

Installing Cloudera CDH3 : Hadoop Distrubution

We will see how to install Cloudera CDH3: Hadoop Distrubution on ubuntu lucid system.
Currently this document describes installation of CDH3 on single node system.

A) How to Install CDH:

1)  First of all download the CDH debian package to a directory to which you have write access.
(for download you can use this link: download CDH3 debian package here )

2)  Go to the directory where above package is installed and install it from command line using following command.
sudo dpkg -i Downloads/cdh3-repository_1.0_all.deb

3)  Update the APT package index:
$ sudo apt-get update

4)  Find and install the Hadoop core package by using your favorite APT package manager, such as apt-get, aptitude, or dselect. For example:
 $ apt-cache search hadoop  
 $ sudo apt-get install hadoop-0.20  

you will see a list of packages like
hadoop-0.20 - A software platform for processing vast amounts of data

hadoop-0.20-conf-pseudo - Pseudo-distributed Hadoop configuration

hadoop-0.20-datanode - Data Node for Hadoop

hadoop-0.20-doc - Documentation for Hadoop

hadoop-0.20-fuse - HDFS exposed over a Filesystem in Userspace

hadoop-0.20-jobtracker - Job Tracker for Hadoop

hadoop-0.20-namenode - Name Node for Hadoop

hadoop-0.20-native - Native libraries for Hadoop (e.g., compression)

hadoop-0.20-pipes - Interface to author Hadoop MapReduce jobs in C++

hadoop-0.20-sbin - Server-side binaries necessary for secured Hadoop clusters

hadoop-0.20-secondarynamenode - Secondary Name Node for Hadoop

hadoop-0.20-source - Source code for Hadoop

hadoop-0.20-tasktracker - Task Tracker for Hadoop

5)  Install each type of daemon package on the appropriate machine. For example, install the NameNode package on your NameNode machine:
$ sudo apt-get install hadoop-0.20-<daemon type>

where <daemon type> is one of the following:
  • namenode
  • datanode
  • secondarynamenode
  • jobtracker
  • tasktracker
  • native

B) Changes in User Accounts and Groups in CDH3 Due to Security :
During CDH3 package installation, two new Unix user accounts called hdfs and mapred are automatically created to support security:

This User                                             Runs These Hadoop Programs
hdfs                                                       NameNode, DataNodes, and Secondary NameNode

mapred                                                  JobTracker and TaskTrackers

C) Directory Ownership in Local File System 

 Directory                     Owner                    Permissions               hdfs:hadoop               drwx------                 hdfs:hadoop               drwx------
mapred.local.dir        mapred:hadoop         drwxr-xr-x


D) Directory Ownership on HDFS :

The following directories on HDFS must also be configured as follows: 

Directory                       Owner                               Permissions
mapred.system.dir        mapred:hadoop                drwx------ 1
/ (root directory)           hdfs:hadoop                       drwxr-xr-x


E) Configuration : 

You will find all the configuration files in directory "/etc/hadoop-0.20/conf/"
You need to configure following properties in respective configuration file.

1)  In core-site.xml 


  <description>A base for other temporary directories.</description>

  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.</description>

create directory "/mnt/hadoop/tmp"
in local and make sure its user and group should be "mapred" and "hadoop" respectively.

2)  In hdfs-site.xml


  <description>  </description>

  <description>  </description>

  <description>  </description>


Create directories "/mnt/hadoop/name" and "/mnt/hadoop/data" in local  and make sure that these directories will have user and group as "hdfs" and "hadoop" respectively.

3)  In mapred-site.xml


  <description>The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.

  Determines where temporary MapReduce data is written. It also may be a list of directories

  <description>  </description>

Create directory "/mnt/hadoop/map-red-local-dir" in local and and make sure its user and group should be "mapred" and "hadoop" respectively.

Note that "/hadoop/system" directory should be created in HDFS and its user and group should be  "mapred" and "hadoop" 

F) Initial Setup :

  • Format the NameNode 

Before starting the NameNode for the first time you need to format the file system.
Make sure you format the NameNode as user hdfs.

$ sudo -u hdfs hadoop namenode -format
  • Step 1: Start HDFS

To start HDFS, start the NameNode, Secondary NameNode, and DataNode services. 
On the NameNode:
$ sudo service hadoop-0.20-namenode start

On the Secondary NameNode:
$ sudo service hadoop-0.20-secondarynamenode start
While starting secondarynamenode you may see following error in log file.

INFO org.apache.hadoop.hdfs.server.common.Storage: Storage directory /mnt/hadoop/tmp/dfs/namesecondary does not exist.

2012-05-04 06:03:36,155 INFO org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: SHUTDOWN_MSG:

--> create this directory manually
INFO org.apache.hadoop.hdfs.server.common.Storage: Cannot access storage directory /mnt/hadoop/tmp/dfs/namesecondary

2012-05-04 06:06:18,320 INFO org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: SHUTDOWN_MSG:

--> check the permissions and provide the necessary permissions to the directory.

On DataNode:

$ sudo service hadoop-0.20-datanode start
  •  Step 2: Create the HDFS /tmp Directory

Once HDFS is up and running, create the /tmp directory and set its permissions to 1777 (drwxrwxrwt), as follows:

$ sudo -u hdfs hadoop fs -mkdir /tmp 

$ sudo -u hdfs hadoop fs -chmod -R 1777 /tmp
  • Step 3: Create and Configure the mapred.system.dir Directory in HDFS

After you start HDFS and before you start the JobTracker (see the next step), you must also create the HDFS directory specified by the mapred.system.dir parameter and configure it to be owned by the mapred user. The mapred.system.dir parameter is represented by the following 
"/hadoop/system" path example.

$ sudo -u hdfs hadoop fs -mkdir /hadoop/system

$ sudo -u hdfs hadoop fs -chown mapred:hadoop /hadoop/system 

  • Step 4: Start MapReduce

To start MapReduce, start the TaskTracker and JobTracker services

On each TaskTracker system :

$ sudo service hadoop-0.20-tasktracker start

On the JobTracker system:

$ sudo service hadoop-0.20-jobtracker start

  •  Step 5: Run MapReduce Job 

Once all the setup is done you can try running your hadoop job as follows:

$ hadoop-0.20 jar your_job.jar input_path output_path