Hadoop - Setup
Pre-Installation
Create group
$ sudo addgroup hadoop
Create user
$ sudo adduser --ingroup hadoop hadoop
Change to Hadoop user
$ su - hadoop
Show groups of current user
$ groups
Show groups of specific user
$ groups root
Setup SSH without password, otherwise it will ask you to input password every time you start or stop the services.
By default, SSH is not install in Ubuntu.
$ sudo apt-get install ssh
Generate ssh keys without password
$ ssh-keygen -t rsa -P ""
Add the key to authorized_keys
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Try
$ ssh localhost
Install Java
Download JDK from Oracle
Install Hadoop
Download Hadoop
$ cd /usr/local
$ sudo tar xzf hadoop-<version>.tar.gz
$ chown -R hadoop:hadoop hadoop-<version>
Update .bashrc
export JAVA_HOME=/path/to/jdk
export PATH=${JAVA_HOME}/bin:${PATH}
export HADOOP_HOME=/path/to/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
Examples
mkdir input
cp conf/\*.xml input
hadoop jar hadoop-examples-<version>.jar grep input output 'security[a-z.]+'
cat output/\*
Format the namenode
$ hadoop namenode -format
Startup Hadoop daemon
$ bin/start-all.sh
Check Hadoop processes
$ jps
If receive this error message
$ hadoop fs -ls
11/06/23 14:30:44 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9000. Already tried 0 time(s).
11/06/23 14:30:45 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9000. Already tried 1 time(s).
...
Then namenode is not started, probabily not formated.
have to change ownership of /usr/local/hadoop
Check health
http://localhost:50070/dfshealth.jsp
http://localhost:50075/blockScannerReport
Master/Slave(Plain Text)
- master: address of secondary namenode
- slaves: address of datanodes and tasktrackers
- Namenode and Jobtracker are run on local machine, i.e. where we call the scripts.
These two text files are used for these scripts:
start-dfs.sh
- Starts a namenode on the local machine (the machine that the script is run on)
- Starts a datanode on each machine listed in the slaves file
- Starts a secondary namenode on each machine listed in the masters file
stop-dfs.sh shuts them down.
start-mapred.sh
- Starts a jobtracker on the local machine
- Starts a tasktracker on each machine listed in the slaves file
stop-mapred.sh shuts them down.
Namenode needs large memory for storing all the metadata. Secondary Namenode needs the same amount of memory for backup.
Usually Namenode and Secondary Namenode are on separate machines.