Hadoop - Setup

Updated: 2019-01-03

Pre-Installation

Create group

$ sudo addgroup hadoop

Create user

$ sudo adduser --ingroup hadoop hadoop

Change to Hadoop user

$ su - hadoop

Show groups of current user

$ groups

Show groups of specific user

$ groups root

Setup SSH without password, otherwise it will ask you to input password every time you start or stop the services.

By default, SSH is not install in Ubuntu.

$ sudo apt-get install ssh

Generate ssh keys without password

$ ssh-keygen -t rsa -P ""

Add the key to authorized_keys

$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Try

$ ssh localhost

Install Java

Download JDK from Oracle

Install Hadoop

Download Hadoop

$ cd /usr/local
$ sudo tar xzf hadoop-<version>.tar.gz
$ chown -R hadoop:hadoop hadoop-<version>

Update .bashrc

export JAVA_HOME=/path/to/jdk
export PATH=${JAVA_HOME}/bin:${PATH}

export HADOOP_HOME=/path/to/hadoop
export PATH=$PATH:$HADOOP_HOME/bin

Examples

mkdir input
cp conf/\*.xml input
hadoop jar hadoop-examples-<version>.jar grep input output 'security[a-z.]+'
cat output/\*

Format the namenode

$ hadoop namenode -format

Startup Hadoop daemon

$ bin/start-all.sh

Check Hadoop processes

$ jps

If receive this error message

$ hadoop fs -ls
11/06/23 14:30:44 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9000. Already tried 0 time(s).
11/06/23 14:30:45 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9000. Already tried 1 time(s).
...

Then namenode is not started, probabily not formated.

have to change ownership of /usr/local/hadoop

Check health

  • http://localhost:50070/dfshealth.jsp
  • http://localhost:50075/blockScannerReport

Master/Slave(Plain Text)

  • master: address of secondary namenode
  • slaves: address of datanodes and tasktrackers
  • Namenode and Jobtracker are run on local machine, i.e. where we call the scripts.

These two text files are used for these scripts:

start-dfs.sh

  1. Starts a namenode on the local machine (the machine that the script is run on)
  2. Starts a datanode on each machine listed in the slaves file
  3. Starts a secondary namenode on each machine listed in the masters file

stop-dfs.sh shuts them down.

start-mapred.sh

  1. Starts a jobtracker on the local machine
  2. Starts a tasktracker on each machine listed in the slaves file

stop-mapred.sh shuts them down.

Namenode needs large memory for storing all the metadata. Secondary Namenode needs the same amount of memory for backup.

Usually Namenode and Secondary Namenode are on separate machines.