Share It Explore It!: Installing Cloudera CDH3 : Hadoop Distrubution

We will see how to install Cloudera CDH3: Hadoop Distrubution on ubuntu lucid system.
Currently this document describes installation of CDH3 on single node system.

A) How to Install CDH:

1) First of all download the CDH debian package to a directory to which you have write access.

(for download you can use this link: download CDH3 debian package here )

2) Go to the directory where above package is installed and install it from command line using following command.

sudo dpkg -i Downloads/cdh3-repository_1.0_all.deb

3) Update the APT package index:

$ sudo apt-get update

4) Find and install the Hadoop core package by using your favorite APT package manager, such as apt-get, aptitude, or dselect. For example:

 $ apt-cache search hadoop  
 $ sudo apt-get install hadoop-0.20

you will see a list of packages like

hadoop-0.20 - A software platform for processing vast amounts of data

hadoop-0.20-conf-pseudo - Pseudo-distributed Hadoop configuration

hadoop-0.20-datanode - Data Node for Hadoop

hadoop-0.20-doc - Documentation for Hadoop

hadoop-0.20-fuse - HDFS exposed over a Filesystem in Userspace

hadoop-0.20-jobtracker - Job Tracker for Hadoop

hadoop-0.20-namenode - Name Node for Hadoop

hadoop-0.20-native - Native libraries for Hadoop (e.g., compression)

hadoop-0.20-pipes - Interface to author Hadoop MapReduce jobs in C++

hadoop-0.20-sbin - Server-side binaries necessary for secured Hadoop clusters

hadoop-0.20-secondarynamenode - Secondary Name Node for Hadoop

hadoop-0.20-source - Source code for Hadoop

hadoop-0.20-tasktracker - Task Tracker for Hadoop

5) Install each type of daemon package on the appropriate machine. For example, install the NameNode package on your NameNode machine:

$ sudo apt-get install hadoop-0.20-<daemon type>

where <daemon type> is one of the following:

namenode
datanode
secondarynamenode
jobtracker
tasktracker
native

B) Changes in User Accounts and Groups in CDH3 Due to Security :
During CDH3 package installation, two new Unix user accounts called hdfs and mapred are automatically created to support security:

This User Runs These Hadoop Programs

hdfs NameNode, DataNodes, and Secondary NameNode

mapred JobTracker and TaskTrackers

C) Directory Ownership in Local File System

Directory Owner Permissions

dfs.name.dir               hdfs:hadoop               drwx------
dfs.data.dir                 hdfs:hadoop               drwx------
mapred.local.dir        mapred:hadoop         drwxr-xr-x

D) Directory Ownership on HDFS :

The following directories on HDFS must also be configured as follows:

Directory                 Owner                       Permissions
mapred.system.dir        mapred:hadoop                drwx------ 1
/ (root directory)           hdfs:hadoop                       drwxr-xr-x

E) Configuration :

You will find all the configuration files in directory "/etc/hadoop-0.20/conf/"

You need to configure following properties in respective configuration file.

1) In core-site.xml

<configuration>
<property>
  <name>hadoop.tmp.dir</name>
  <value>/mnt/hadoop/tmp</value>
  <description>A base for other temporary directories.</description>
</property>

<property>
  <name>fs.default.name</name>
  <value>hdfs://localhost:54310</value>
  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.</description>
</property>

create directory "/mnt/hadoop/tmp" in local and make sure its user and group should be "mapred" and "hadoop" respectively.

2) In hdfs-site.xml

<configuration>

<property>
  <name>dfs.replication</name>
  <value>1</value>
  <description>  </description>
</property>


<property>
  <name>dfs.name.dir</name>
  <value>/mnt/hadoop/name</value>
  <description>  </description>
</property>

<property>
  <name>dfs.data.dir</name>
  <value>/mnt/hadoop/data</value>
  <description>  </description>
</property>


</configuration>

Create directories "/mnt/hadoop/name" and "/mnt/hadoop/data" in local and make sure that these directories will have user and group as "hdfs" and "hadoop" respectively.

3) In mapred-site.xml

<configuration>

<property>
  <name>mapred.job.tracker</name>
  <value>localhost:54311</value>
  <description>The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.
  </description>
</property>

<property>
  <name>mapred.local.dir</name>
  <value>/mnt/hadoop/map-red-local-dir</value>
  <description>
  Determines where temporary MapReduce data is written. It also may be a list of directories
  </description>
</property>

<property>
  <name>mapred.system.dir</name>
  <value>/hadoop/system</value>
  <description>  </description>
</property>
</configuration>

Create directory "/mnt/hadoop/map-red-local-dir" in local and and make sure its user and group should be "mapred" and "hadoop" respectively.

Note that "/hadoop/system" directory should be created in HDFS and its user and group should be "mapred" and "hadoop"

F) Initial Setup :

Format the NameNode

Before starting the NameNode for the first time you need to format the file system.
Make sure you format the NameNode as user hdfs.

$ sudo -u hdfs hadoop namenode -format

Step 1: Start HDFS

To start HDFS, start the NameNode, Secondary NameNode, and DataNode services.

On the NameNode:

$ sudo service hadoop-0.20-namenode start

On the Secondary NameNode:

$ sudo service hadoop-0.20-secondarynamenode start

While starting secondarynamenode you may see following error in log file.

INFO org.apache.hadoop.hdfs.server.common.Storage: Storage directory /mnt/hadoop/tmp/dfs/namesecondary does not exist.

2012-05-04 06:03:36,155 INFO org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: SHUTDOWN_MSG:

--> create this directory manually

INFO org.apache.hadoop.hdfs.server.common.Storage: Cannot access storage directory /mnt/hadoop/tmp/dfs/namesecondary

2012-05-04 06:06:18,320 INFO org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: SHUTDOWN_MSG:

--> check the permissions and provide the necessary permissions to the directory.

On DataNode:

$ sudo service hadoop-0.20-datanode start

Step 2: Create the HDFS /tmp Directory

Once HDFS is up and running, create the /tmp directory and set its permissions to 1777 (drwxrwxrwt), as follows:

$ sudo -u hdfs hadoop fs -mkdir /tmp 

$ sudo -u hdfs hadoop fs -chmod -R 1777 /tmp

Step 3: Create and Configure the mapred.system.dir Directory in HDFS

After you start HDFS and before you start the JobTracker (see the next step), you must also create the HDFS directory specified by the mapred.system.dir parameter and configure it to be owned by the mapred user. The mapred.system.dir parameter is represented by the following

"/hadoop/system" path example.

$ sudo -u hdfs hadoop fs -mkdir /hadoop/system

$ sudo -u hdfs hadoop fs -chown mapred:hadoop /hadoop/system

Step 4: Start MapReduce

To start MapReduce, start the TaskTracker and JobTracker services

On each TaskTracker system :

$ sudo service hadoop-0.20-tasktracker start

On the JobTracker system:

$ sudo service hadoop-0.20-jobtracker start

Step 5: Run MapReduce Job

Once all the setup is done you can try running your hadoop job as follows:

$ hadoop-0.20 jar your_job.jar input_path output_path

Share It Explore It!

Pages

Thursday, 3 May 2012

Installing Cloudera CDH3 : Hadoop Distrubution

A) How to Install CDH:

B) Changes in User Accounts and Groups in CDH3 Due to Security :
During CDH3 package installation, two new Unix user accounts called hdfs and mapred are automatically created to support security:

C) Directory Ownership in Local File System

Directory Owner Permissions

dfs.name.dir               hdfs:hadoop               drwx------
dfs.data.dir                 hdfs:hadoop               drwx------
mapred.local.dir        mapred:hadoop         drwxr-xr-x

D) Directory Ownership on HDFS :

The following directories on HDFS must also be configured as follows:

Directory                 Owner                       Permissions
mapred.system.dir        mapred:hadoop                drwx------ 1
/ (root directory)           hdfs:hadoop                       drwxr-xr-x

E) Configuration :

1) In core-site.xml

2) In hdfs-site.xml

3) In mapred-site.xml

F) Initial Setup :

Format the NameNode

Step 1: Start HDFS

On DataNode:

Step 2: Create the HDFS /tmp Directory

Step 3: Create and Configure the mapred.system.dir Directory in HDFS

Step 4: Start MapReduce

Step 5: Run MapReduce Job

Once all the setup is done you can try running your hadoop job as follows:

Pages

Thursday, 3 May 2012

Installing Cloudera CDH3 : Hadoop Distrubution

A) How to Install CDH:

B) Changes in User Accounts and Groups in CDH3 Due to Security : During CDH3 package installation, two new Unix user accounts called hdfs and mapred are automatically created to support security:

C) Directory Ownership in Local File System

Directory Owner Permissions

dfs.name.dir hdfs:hadoop drwx------dfs.data.dir hdfs:hadoop drwx------mapred.local.dir mapred:hadoop drwxr-xr-x

D) Directory Ownership on HDFS :

The following directories on HDFS must also be configured as follows:

Directory Owner Permissionsmapred.system.dir mapred:hadoop drwx------ 1/ (root directory) hdfs:hadoop drwxr-xr-x

E) Configuration :

1) In core-site.xml

2) In hdfs-site.xml

3) In mapred-site.xml

F) Initial Setup :

Format the NameNode

Step 1: Start HDFS

On DataNode:

Step 2: Create the HDFS /tmp Directory

Step 3: Create and Configure the mapred.system.dir Directory in HDFS

Step 4: Start MapReduce

Step 5: Run MapReduce Job

Once all the setup is done you can try running your hadoop job as follows:

B) Changes in User Accounts and Groups in CDH3 Due to Security :
During CDH3 package installation, two new Unix user accounts called hdfs and mapred are automatically created to support security:

dfs.name.dir hdfs:hadoop drwx------
dfs.data.dir hdfs:hadoop drwx------
mapred.local.dir mapred:hadoop drwxr-xr-x

Directory Owner Permissions
mapred.system.dir mapred:hadoop drwx------ 1
/ (root directory) hdfs:hadoop drwxr-xr-x