Thursday, March 28, 2013

Hadoop Map Reduce In Cluster Mode

Introduction 

      The purpose of this blog is share my experience with Hadoop Map Reduce . I am a newbie to  Distributed File System , to be more precise a Hadoop User. 
      

Steps to configure Hadoop Map Reduce in Cluster Mode 

      Following are the steps to configure  : ( Below steps may have my Unix File path. The configuration has to be done for master and  slave nodes. )

1) Install JAVA if not installed already.
    - sudo apt-get install sun-java6-jdk
2) If you dont have Hadoop Tar file , Please download it from http://hadoop.apache.org/releases.html
3) Untar the file by following command. 
    - sudo tar xzf hadoop-<version>.tar.gz .
    In my case it has been extracted under '/home/hadoopPoc/hadoop-1.0.3'
4) Now edit the file hadoop 'hadoop-env.sh' from conf directory , uncomment/modify the JAVA_HOME variable. To do this follow the below steps :
    -  echo  $JAVA_HOME and copy the output ( In my case this value is '/usr/java/jdk1.6.0_21' )
    -  vi hadoop-env.sh ( In my case this file is under '/home/hadoopPoc/hadoop-1.0.3/conf/' )
    -  uncomment or modify the statement 'export JAVA_HOME=' and paste the output of 'echo  $JAVA_HOME'
5)  Edit core-site.xml file from conf directory( In my case this file is in '/home/hadoopPoc/hadoop-1.0.3/conf/' ) and add the following configuration .


<configuration>

<property>
                        <name>fs.default.name</name>
                        <value>hdfs://<namenode IP>:<port></value>
                        <description>This is the namenode uri</description>
 </property>
</configuration>
   - In my case namenode IP was ip address of 'Node A' in my cluster shown below
6) Edit 'hdfs-site.xml' file from conf directory and add the following configuration.


<configuration>
<property>
                                    <name>dfs.replication</name>
                                    <value>7</value>
                                    <description>Default block replication.
                                    The actual number of replications can be specified when the file
                                    is created. The default is used if replication is not specified in
                                    create time.
                                    </description>
 </property>
</configuration>
   - In my case I had 7 datanodes in my cluster so as the value 'dfs.replication' .
7)  Edit 'mapred-site.xml' file from conf directory and add the following configuration.


<configuration>
<property>
                        <name>mapred.job.tracker</name>
                        <value><Job Tracker Node>:<port></value>
                        <description>The host and port that the MapReduce job tracker runs
                         at.  If "local", then jobs are run in-process as a single map and reduce
                         task.
                        </description>
</property>
<property>
  <name>mapred.job.reuse.jvm.num.tasks</name>
  <value>10</value>
  </property>
<property>
  <name>mapred.tasktracker.map.tasks.maximum</name>
  <value>7</value>
  </property>
<property>
  <name>mapred.tasktracker.reduce.tasks.maximum</name>
  <value>7</value>
  </property>
<property>
  <name>mapred.task.timeout</name>
  <value>1800000</value>
</property>

</configuration>
    - In my case 'Job Tracker Node' is my Name Node itself and my port is '8001'
8) Edit 'master' file and add the IP address of master node.The master file is under conf directory.
9) Edit 'slaves' file and add the IP addresses of slave node. The slaves file in under conf directory.
10) Run ssh command in all the master node and slave nodes which will add the IP addresses to /etc/hosts of each node in cluster.To do this follow the steps :
   -  Open a node from master/slaves file in Putty or any Putty like software.
   -  run the command 'ssh  <slave IP addresses>' . Below screenshot will display the output of ssh.
   

    - Do the above 2 steps for all the nodes.
11) Format HDFS . To do so run the command
     -  'hadoop namenode -format'  ( In my case I executed the script from '/home/ppoker/hadoopPoc/hadoop-1.0.3/bin' directory).
12) Start Hadoop dfs . Run 'start-dfs.sh' script which is under bin directory. Following is a screenshot when start-dfs.sh is executed.

13) Start Hadoop Map Reduce . Run 'start-mapred.sh' script which is under bin directory. Following is a screenshot when start-mapred.sh is executed.

15) Run command 'jps' on Name node . Following process will be  listed

19707 SecondaryNameNode
19901 JobTracker
20048 Jps
19527 NameNode
16) Run command 'jps' on Datanodes . Following process will be listed.

20564 DataNode
20872 Jps
20770 TaskTracker
17) Hadoop Map Reduce configuration is completed . You can view web based interface of Name Node on http://<NameNode>:50070/ . You can track the Map Reduce on http://<NameNode>:50030/jobtracker.jsp

Below is the cluster that I have created for Map Reduce.


Whats NEXT !!

      Next post I will try to come up with an implementation of Map Reduce and will explain a   use case where I have  used Map Reduce. 


Please add valuable feedback about this post which will help me drive to write Next post about Map Reduce Implementation.

Thanks!!!