Pages

Tuesday 15 May 2012

Caching in Cassandra 1.1

In Cassandra 1.1 cache tuning has been completely changed to make it easier to use and tune.Each key cache hit saves 1 seek and each row cache hit saves 2 seeks at the minimum, sometimes more. The key cache is fairly tiny for the amount of time it saves, so it's worthwhile to use it at large numbers.The row cache saves even more time, but must store the whole values of its rows, so it is extremely space-intensive. It's best to only use the row cache if you have hot rows or static rows.

Cache configuration setting in "cassandra.yaml"

The main settings are key_cache_size_in_mb and row_cache_size_in_mb in cassandra.yaml.
Here is the part of the cassandra.yaml file where caching related setting has to do.
# Maximum size of the key cache in memory
# Default value is empty to make it "auto" (min(5% of Heap (in MB), 100MB)). Set to 0 to disable key cache.
key_cache_size_in_mb:



# Duration in seconds after which Cassandra should
# safe the keys cache. Caches are saved to saved_caches_directory as
# specified in this configuration file.
#
# Saved caches greatly improve cold-start speeds, and is relatively cheap in
# terms of I/O for the key cache. Row cache saving is much more expensive and
# has limited use.
#
# Default is 14400 or 4 hours.
key_cache_save_period: 14400



# Number of keys from the key cache to save
# Disabled by default, meaning all keys are going to be saved
# key_cache_keys_to_save: 100



# Maximum size of the row cache in memory.
# NOTE: if you reduce the size, you may not get you hottest keys loaded on startup.
#
# Default value is 0, to disable row caching.
row_cache_size_in_mb: 0



# Duration in seconds after which Cassandra should
# safe the row cache. Caches are saved to saved_caches_directory as specified
# in this configuration file.
#
# Saved caches greatly improve cold-start speeds, and is relatively cheap in
# terms of I/O for the key cache. Row cache saving is much more expensive and
# has limited use.
#
# Default is 0 to disable saving the row cache.
row_cache_save_period: 0



# Number of keys from the row cache to save
# Disabled by default, meaning all keys are going to be saved
# row_cache_keys_to_save: 100



# The provider for the row cache to use.
#
# Supported values are: ConcurrentLinkedHashCacheProvider, SerializingCacheProvider
#
# SerializingCacheProvider serialises the contents of the row and stores
# it in native memory, i.e., off the JVM Heap. Serialized rows take
# significantly less memory than "live" rows in the JVM, so you can cache
# more rows in a given memory footprint.  And storing the cache off-heap
# means you can use smaller heap sizes, reducing the impact of GC pauses.
#
# It is also valid to specify the fully-qualified class name to a class
# that implements org.apache.cassandra.cache.IRowCacheProvider.
#
# Defaults to SerializingCacheProvider
row_cache_provider: SerializingCacheProvider



# saved caches
saved_caches_directory: /var/lib/cassandra/saved_caches

How to Enable Per column family caching in cassandra 1.1

 

  1. Create keyspace using cassandra-cli


    create keyspace Keyspace1 with placement_strategy='SimpleStrategy' and strategy_options = {replication_factor:1};   
  2. Authenticate to newly created keyspace 


    use Keyspace1; 
  3.  Now you can create column families in this keyspace with various types of caching enabled

     
     create column family Standard1 with caching = 'ALL';
    
     create column family Standard2 with caching = 'KEYS_ONLY';
    
     create column family Standard2 with caching = 'ROWS_ONLY';
     
     create column family Standard3 with caching = 'NONE';
     

Thursday 3 May 2012

Installing Cloudera CDH3 : Hadoop Distrubution


We will see how to install Cloudera CDH3: Hadoop Distrubution on ubuntu lucid system.
Currently this document describes installation of CDH3 on single node system.

A) How to Install CDH:

1)  First of all download the CDH debian package to a directory to which you have write access.
(for download you can use this link: download CDH3 debian package here )

2)  Go to the directory where above package is installed and install it from command line using following command.
sudo dpkg -i Downloads/cdh3-repository_1.0_all.deb

3)  Update the APT package index:
$ sudo apt-get update

4)  Find and install the Hadoop core package by using your favorite APT package manager, such as apt-get, aptitude, or dselect. For example:
 $ apt-cache search hadoop  
 $ sudo apt-get install hadoop-0.20  

you will see a list of packages like
hadoop-0.20 - A software platform for processing vast amounts of data

hadoop-0.20-conf-pseudo - Pseudo-distributed Hadoop configuration

hadoop-0.20-datanode - Data Node for Hadoop

hadoop-0.20-doc - Documentation for Hadoop

hadoop-0.20-fuse - HDFS exposed over a Filesystem in Userspace

hadoop-0.20-jobtracker - Job Tracker for Hadoop

hadoop-0.20-namenode - Name Node for Hadoop

hadoop-0.20-native - Native libraries for Hadoop (e.g., compression)

hadoop-0.20-pipes - Interface to author Hadoop MapReduce jobs in C++

hadoop-0.20-sbin - Server-side binaries necessary for secured Hadoop clusters

hadoop-0.20-secondarynamenode - Secondary Name Node for Hadoop

hadoop-0.20-source - Source code for Hadoop

hadoop-0.20-tasktracker - Task Tracker for Hadoop

5)  Install each type of daemon package on the appropriate machine. For example, install the NameNode package on your NameNode machine:
$ sudo apt-get install hadoop-0.20-<daemon type>

where <daemon type> is one of the following:
  • namenode
  • datanode
  • secondarynamenode
  • jobtracker
  • tasktracker
  • native

B) Changes in User Accounts and Groups in CDH3 Due to Security :
During CDH3 package installation, two new Unix user accounts called hdfs and mapred are automatically created to support security:

This User                                             Runs These Hadoop Programs
hdfs                                                       NameNode, DataNodes, and Secondary NameNode

mapred                                                  JobTracker and TaskTrackers

C) Directory Ownership in Local File System 

 Directory                     Owner                    Permissions 

dfs.name.dir               hdfs:hadoop               drwx------
dfs.data.dir                 hdfs:hadoop               drwx------
mapred.local.dir        mapred:hadoop         drwxr-xr-x

 

D) Directory Ownership on HDFS :

The following directories on HDFS must also be configured as follows: 

Directory                       Owner                               Permissions
mapred.system.dir        mapred:hadoop                drwx------ 1
/ (root directory)           hdfs:hadoop                       drwxr-xr-x

 

E) Configuration : 

You will find all the configuration files in directory "/etc/hadoop-0.20/conf/"
You need to configure following properties in respective configuration file.

1)  In core-site.xml 

 

<configuration>
<property>
  <name>hadoop.tmp.dir</name>
  <value>/mnt/hadoop/tmp</value>
  <description>A base for other temporary directories.</description>
</property>

<property>
  <name>fs.default.name</name>
  <value>hdfs://localhost:54310</value>
  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.</description>
</property>

create directory "/mnt/hadoop/tmp"
in local and make sure its user and group should be "mapred" and "hadoop" respectively.

2)  In hdfs-site.xml


<configuration>

<property>
  <name>dfs.replication</name>
  <value>1</value>
  <description>  </description>
</property>


<property>
  <name>dfs.name.dir</name>
  <value>/mnt/hadoop/name</value>
  <description>  </description>
</property>

<property>
  <name>dfs.data.dir</name>
  <value>/mnt/hadoop/data</value>
  <description>  </description>
</property>


</configuration>

Create directories "/mnt/hadoop/name" and "/mnt/hadoop/data" in local  and make sure that these directories will have user and group as "hdfs" and "hadoop" respectively.

3)  In mapred-site.xml


<configuration>

<property>
  <name>mapred.job.tracker</name>
  <value>localhost:54311</value>
  <description>The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.
  </description>
</property>

<property>
  <name>mapred.local.dir</name>
  <value>/mnt/hadoop/map-red-local-dir</value>
  <description>
  Determines where temporary MapReduce data is written. It also may be a list of directories
  </description>
</property>

<property>
  <name>mapred.system.dir</name>
  <value>/hadoop/system</value>
  <description>  </description>
</property>
</configuration>

Create directory "/mnt/hadoop/map-red-local-dir" in local and and make sure its user and group should be "mapred" and "hadoop" respectively.

Note that "/hadoop/system" directory should be created in HDFS and its user and group should be  "mapred" and "hadoop" 

F) Initial Setup :

  • Format the NameNode 

Before starting the NameNode for the first time you need to format the file system.
Make sure you format the NameNode as user hdfs.

$ sudo -u hdfs hadoop namenode -format
  • Step 1: Start HDFS

To start HDFS, start the NameNode, Secondary NameNode, and DataNode services. 
On the NameNode:
$ sudo service hadoop-0.20-namenode start

On the Secondary NameNode:
$ sudo service hadoop-0.20-secondarynamenode start
  
While starting secondarynamenode you may see following error in log file.

INFO org.apache.hadoop.hdfs.server.common.Storage: Storage directory /mnt/hadoop/tmp/dfs/namesecondary does not exist.

2012-05-04 06:03:36,155 INFO org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: SHUTDOWN_MSG:
  

--> create this directory manually
 
INFO org.apache.hadoop.hdfs.server.common.Storage: Cannot access storage directory /mnt/hadoop/tmp/dfs/namesecondary

2012-05-04 06:06:18,320 INFO org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: SHUTDOWN_MSG:

--> check the permissions and provide the necessary permissions to the directory.

On DataNode:

$ sudo service hadoop-0.20-datanode start
  •  Step 2: Create the HDFS /tmp Directory

Once HDFS is up and running, create the /tmp directory and set its permissions to 1777 (drwxrwxrwt), as follows:

$ sudo -u hdfs hadoop fs -mkdir /tmp 

$ sudo -u hdfs hadoop fs -chmod -R 1777 /tmp
  • Step 3: Create and Configure the mapred.system.dir Directory in HDFS

After you start HDFS and before you start the JobTracker (see the next step), you must also create the HDFS directory specified by the mapred.system.dir parameter and configure it to be owned by the mapred user. The mapred.system.dir parameter is represented by the following 
"/hadoop/system" path example.

$ sudo -u hdfs hadoop fs -mkdir /hadoop/system

$ sudo -u hdfs hadoop fs -chown mapred:hadoop /hadoop/system 

  • Step 4: Start MapReduce

To start MapReduce, start the TaskTracker and JobTracker services

On each TaskTracker system :

$ sudo service hadoop-0.20-tasktracker start

On the JobTracker system:

$ sudo service hadoop-0.20-jobtracker start


  •  Step 5: Run MapReduce Job 

Once all the setup is done you can try running your hadoop job as follows:

$ hadoop-0.20 jar your_job.jar input_path output_path