Currently this document describes installation of CDH3 on single node system.
A) How to Install CDH:
1) First of all download the CDH debian package to a directory to which you have write access.
(for download you can use this link: download CDH3 debian package here )
2) Go to the directory where above package is installed and install it from command line using following command.
sudo dpkg -i Downloads/cdh3-repository_1.0_all.deb
$ sudo apt-get update
4) Find and install the Hadoop core package by using your favorite APT package manager, such as apt-get, aptitude, or dselect. For example:
$ apt-cache search hadoop
$ sudo apt-get install hadoop-0.20
you will see a list of packages like
hadoop-0.20 - A software platform for processing vast amounts of data hadoop-0.20-conf-pseudo - Pseudo-distributed Hadoop configuration hadoop-0.20-datanode - Data Node for Hadoop hadoop-0.20-doc - Documentation for Hadoop hadoop-0.20-fuse - HDFS exposed over a Filesystem in Userspace hadoop-0.20-jobtracker - Job Tracker for Hadoop hadoop-0.20-namenode - Name Node for Hadoop hadoop-0.20-native - Native libraries for Hadoop (e.g., compression) hadoop-0.20-pipes - Interface to author Hadoop MapReduce jobs in C++ hadoop-0.20-sbin - Server-side binaries necessary for secured Hadoop clusters hadoop-0.20-secondarynamenode - Secondary Name Node for Hadoop hadoop-0.20-source - Source code for Hadoop hadoop-0.20-tasktracker - Task Tracker for Hadoop
5) Install each type of daemon package on the appropriate machine. For example, install the NameNode package on your NameNode machine:
$ sudo apt-get install hadoop-0.20-<daemon type>
where <daemon type> is one of the following:
- namenode
- datanode
- secondarynamenode
- jobtracker
- tasktracker
- native
B) Changes in User Accounts and Groups in CDH3 Due to Security :
During CDH3 package installation, two new Unix user accounts called hdfs and mapred are automatically created to support security:
This User Runs These Hadoop Programs
hdfs NameNode, DataNodes, and Secondary NameNode
mapred JobTracker and TaskTrackers
C) Directory Ownership in Local File System
Directory Owner Permissions
dfs.name.dir hdfs:hadoop drwx------
dfs.data.dir hdfs:hadoop drwx------
mapred.local.dir mapred:hadoop drwxr-xr-x
D) Directory Ownership on HDFS :
The following directories on HDFS must also be configured as follows:
Directory Owner Permissions
mapred.system.dir mapred:hadoop drwx------ 1
/ (root directory) hdfs:hadoop drwxr-xr-x
E) Configuration :
You will find all the configuration files in directory "/etc/hadoop-0.20/conf/"
You need to configure following properties in respective configuration file.
1) In core-site.xml
<configuration> <property> <name>hadoop.tmp.dir</name> <value>/mnt/hadoop/tmp</value> <description>A base for other temporary directories.</description> </property> <property> <name>fs.default.name</name> <value>hdfs://localhost:54310</value> <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description> </property>
create directory "/mnt/hadoop/tmp" in local and make sure its user and group should be "mapred" and "hadoop" respectively.
2) In hdfs-site.xml
<configuration> <property> <name>dfs.replication</name> <value>1</value> <description> </description> </property> <property> <name>dfs.name.dir</name> <value>/mnt/hadoop/name</value> <description> </description> </property> <property> <name>dfs.data.dir</name> <value>/mnt/hadoop/data</value> <description> </description> </property> </configuration>
Create directories "/mnt/hadoop/name" and "/mnt/hadoop/data" in local and make sure that these directories will have user and group as "hdfs" and "hadoop" respectively.
3) In mapred-site.xml
<configuration> <property> <name>mapred.job.tracker</name> <value>localhost:54311</value> <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property> <property> <name>mapred.local.dir</name> <value>/mnt/hadoop/map-red-local-dir</value> <description> Determines where temporary MapReduce data is written. It also may be a list of directories </description> </property> <property> <name>mapred.system.dir</name> <value>/hadoop/system</value> <description> </description> </property> </configuration>
Create directory "/mnt/hadoop/map-red-local-dir" in local and and make sure its user and group should be "mapred" and "hadoop" respectively.
Note that "/hadoop/system" directory should be created in HDFS and its user and group should be "mapred" and "hadoop"
F) Initial Setup :
Format the NameNode
Before starting the NameNode for the first time you need to format the file system.
Make sure you format the NameNode as user hdfs.
Make sure you format the NameNode as user hdfs.
$ sudo -u hdfs hadoop namenode -format
Step 1: Start HDFS
To start HDFS, start the NameNode, Secondary NameNode, and DataNode services.
On the NameNode:
$ sudo service hadoop-0.20-namenode start
On the Secondary NameNode:
$ sudo service hadoop-0.20-secondarynamenode start
While starting secondarynamenode you may see following error in log file.
INFO org.apache.hadoop.hdfs.server.common.Storage: Storage directory /mnt/hadoop/tmp/dfs/namesecondary does not exist.
2012-05-04 06:03:36,155 INFO org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: SHUTDOWN_MSG:
--> create this directory manually
INFO org.apache.hadoop.hdfs.server.common.Storage: Cannot access storage directory /mnt/hadoop/tmp/dfs/namesecondary
2012-05-04 06:06:18,320 INFO org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: SHUTDOWN_MSG:
--> check the permissions and provide the necessary permissions to the directory.
On DataNode:
$ sudo service hadoop-0.20-datanode start
Step 2: Create the HDFS /tmp Directory
$ sudo -u hdfs hadoop fs -mkdir /tmp $ sudo -u hdfs hadoop fs -chmod -R 1777 /tmp
Step 3: Create and Configure the mapred.system.dir Directory in HDFS
After you start HDFS and before you start the JobTracker (see the next step), you must also create the HDFS directory specified by the mapred.system.dir parameter and configure it to be owned by the mapred user. The mapred.system.dir parameter is represented by the following
"/hadoop/system" path example.
$ sudo -u hdfs hadoop fs -mkdir /hadoop/system $ sudo -u hdfs hadoop fs -chown mapred:hadoop /hadoop/system
Step 4: Start MapReduce
On each TaskTracker system :
$ sudo service hadoop-0.20-tasktracker start
On the JobTracker system:
$ sudo service hadoop-0.20-jobtracker start
Step 5: Run MapReduce Job
Once all the setup is done you can try running your hadoop job as follows:
$ hadoop-0.20 jar your_job.jar input_path output_path