Spark
    Overview
    Getting Started
    Configuration
    DataFrame
    Aggregation
    Utilities
    YARN
    IO

Hadoop - Configuration

Updated: 2022-08-06

Configuration Files

Hadoop will use default settings if not told otherwise in site.

Config files can be found in either $HADOOP_HOME/conf or $HADOOP_CONF_DIR

  • yarn-site.xml
  • core-site.xml
  • hdfs-site.xml

Core

fs.default.name

The default setting is local file system, do not be surprised if you see your local files when calling $ hadoop fs -ls

<property>
    <name>fs.default.name</name>
    <value>file:///</value>
</property>

Set it to HDFS:

<property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:54310</value>
</property>

fs.trash.interval

Enable trash bin (disabled by default) (1440 min = 24 hr)

<property>
	<name>fs.trash.interval</name>
	<value>1440</value>
</property>

hadoop.tmp.dir

The default tmp folder is /tmp/hadoop-${user.name} in default:

<property>
    <name>hadoop.tmp.dir</name>
    <value>/tmp/hadoop-${user.name}</value>
</property>

Add the following in site to override:

<property>
    <name>hadoop.tmp.dir</name>
    <value>/tmp/hadoop</value>
</property>

HDFS

default: $HADOOP_HOME/src/hdfs/hdfs-default.xml

site: $HADOOP_HOME/conf/hdfs-site.xml

dfs.name.dir / dfs.data.dir

Set the folder for namenode and datanode. ${hadoop.tmp.dir}/dfs/name and ${hadoop.tmp.dir}/dfs/data will be used by default. Set to other folders if you want:

<property>
    <name>dfs.name.dir</name>
    <value>/home/hadoop/dfs/name</value>
</property>

<property>
    <name>dfs.data.dir</name>
    <value>/home/hadoop/dfs/data</value>
</property>

dfs.replication

Set it to one for pseudo-cluster:

<property>
    <name>dfs.replication</name>
    <value>1</value>
</property>

MapReduce

default: $HADOOP_HOME/src/mapred/mapred-default.xml

site: $HADOOP_HOME/conf/mapred-site.xml

mapred.job.tracker

The host and port that the MapReduce job tracker runs at. It is set as "local" in default, meaning jobs are run in-process as a single map and reduce task::

<property>
    <name>mapred.job.tracker</name>
    <value>local</value>
</property>

Set it to localhost in site::

<property>
    <name>mapred.job.tracker</name>
    <value>localhost:54311</value>
</property>

hadoop-env.sh

Remember to set JAVA_HOME in this file

Auto-completion

Add to ~/.bashrc or ~/.bash_profile

## Autocompletion for HDFS
# hdfs(1) completion
have()
{
    unset -v have
    PATH=$PATH:/sbin:/usr/sbin:/usr/local/sbin type $1 &>/dev/null &&
            have="yes"
}
have hadoop &&
_hdfs()
{
  local cur prev

  COMPREPLY=()
  cur=${COMP_WORDS[COMP_CWORD]}
  prev=${COMP_WORDS[COMP_CWORD-1]}

  if [[ "$prev" == hdfs ]]; then
    COMPREPLY=( $( compgen -W '-ls -lsr -du -dus -count -mv -cp -rm \
      -rmr -expunge -put -copyFromLocal -moveToLocal -mkdir -setrep \
      -touchz -test -stat -tail -chmod -chown -chgrp -help' -- $cur ) )
  fi

  if [[ "$prev" == -ls ]] || [[ "$prev" == -lsr ]] || \
    [[ "$prev" == -du ]] || [[ "$prev" == -dus ]] || \
    [[ "$prev" == -cat ]] || [[ "$prev" == -mkdir ]] || \
    [[ "$prev" == -put ]] || [[ "$prev" == -rm ]] || \
    [[ "$prev" == -rmr ]] || [[ "$prev" == -tail ]] || \
    [[ "$prev" == -cp ]]; then
    if [[ -z "$cur" ]]; then
      COMPREPLY=( $( compgen -W "$( hdfs -ls / 2>-|grep -v ^Found|awk '{print $8}' )" -- "$cur" ) )
    elif [[ `echo $cur | grep \/$` ]]; then
      COMPREPLY=( $( compgen -W "$( hdfs -ls $cur 2>-|grep -v ^Found|awk '{print $8}' )" -- "$cur" ) )
    else
      COMPREPLY=( $( compgen -W "$( hdfs -ls $cur* 2>-|grep -v ^Found|awk '{print $8}' )" -- "$cur" ) )
    fi
  fi
} &&
complete -F _hdfs hdfs
unset have