Hadoop - Configuration

Updated: 2019-01-03

Configuration Files

Hadoop will use default settings if not told otherwise in site.

Config files can be found in either $HADOOP_HOME/conf or $HADOOP_CONF_DIR

  • yarn-site.xml
  • core-site.xml
  • hdfs-site.xml

Core

fs.default.name

The default setting is local file system, do not be surprised if you see your local files when calling $ hadoop fs -ls::

<property>
    <name>fs.default.name</name>
    <value>file:///</value>
</property>

Set it to HDFS::

<property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:54310</value>
</property>

fs.trash.interval

Enable trash bin (disabled by default) (1440 min = 24 hr)

<property>
	<name>fs.trash.interval</name>
	<value>1440</value>
</property>

hadoop.tmp.dir

The default tmp folder is /tmp/hadoop-${user.name} in default::

<property>
    <name>hadoop.tmp.dir</name>
    <value>/tmp/hadoop-${user.name}</value>
</property>

Add the following in site to override::

<property>
    <name>hadoop.tmp.dir</name>
    <value>/tmp/hadoop</value>
</property>

HDFS

default:$HADOOP_HOME/src/hdfs/hdfs-default.xml

site:$HADOOP_HOME/conf/hdfs-site.xml

dfs.name.dir / dfs.data.dir

Set the folder for namenode and datanode. ${hadoop.tmp.dir}/dfs/name and ${hadoop.tmp.dir}/dfs/data will be used by default. Set to other folders if you want::

<property>
    <name>dfs.name.dir</name>
    <value>/home/hadoop/dfs/name</value>
</property>

<property>
    <name>dfs.data.dir</name>
    <value>/home/hadoop/dfs/data</value>
</property>

dfs.replication

Set it to one for pseudo-cluster::

<property>
    <name>dfs.replication</name>
    <value>1</value>
</property>

MapReduce

default:$HADOOP_HOME/src/mapred/mapred-default.xml

site:$HADOOP_HOME/conf/mapred-site.xml

mapred.job.tracker

The host and port that the MapReduce job tracker runs at. It is set as "local" in default, meaning jobs are run in-process as a single map and reduce task::

<property>
    <name>mapred.job.tracker</name>
    <value>local</value>
</property>

Set it to localhost in site::

<property>
    <name>mapred.job.tracker</name>
    <value>localhost:54311</value>
</property>

hadoop-env.sh

Remember to set JAVA_HOME in this file