Hadoop - Trouble Shooting

Checksum

Error:

INFO fs.FSInputChecker: Found checksum error: b[0, 0]=
org.apache.hadoop.fs.ChecksumException: Checksum error: sample/configs.json at 0

Solution: Delete .crc file

Hadoop Version

Error

Exception in thread "main" java.io.IOException: Call to ... failed on local exception: java.io.EOFException
     at org.apache.hadoop.ipc.Client.wrapException(Client.java:775)
     at org.apache.hadoop.ipc.Client.call(Client.java:743)
     at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
     at $Proxy0.getProtocolVersion(Unknown Source)
     at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
     at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:106)
     at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:207)
     at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:170)
     at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:82)
     at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378)
     at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
     at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
     at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196)
     at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95)
     at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446)

Solution: Make sure you are compiling against the same Hadoop version that you are running on your cluster.

Unrecognized option: -jvm

Error: when starting datanode::

# hadoop datanode
Unrecognized option: -jvm
Could not create the Java virtual machine.

Reason: should not run as root

Solution: try another non-root user

Could only be replicated to 0 nodes, instead of 1

Error:

could only be replicated to 0 nodes, instead of 1

Reason 1: namenode and datanode are inconsistent

Solution: re-format namenode

$ stop-all.sh
(remove dfs.name.dir and dfs.data.dir and tmp files)
$ hadoop namenode -format
$ start-all.sh

Reason 2: running out of disk space.

$ hadoop dfsadmin -report

...

Datanodes available: 1 (1 total, 0 dead)

Name: 127.0.0.1:50010
Decommission Status : Normal
Configured Capacity: 144496779264 (134.57 GB)
DFS Used: 24591 (24.01 KB)
Non DFS Used: 144157540337 (134.26 GB)
DFS Remaining: 339214336(323.5 MB)
DFS Used%: 0%
**DFS Remaining%: 0.23%**
Last contact: Wed Aug 17 11:59:04 PDT 2011

Solution: set dfs.name.dir and dfs.data.dir (in hdfs-site.xml) to another disk.

I am using default settings, i.e. dfs.name.dir and dfs.data.dir are set in /tmp, which has tiny space left. however /data folder is mounted to a 2TB disk, redirecting those two dirs solved the problem.

DiskErrorException: Could not find any valid local directory

Error:

INFO mapred.JobClient: Task Id : attempt_201108171201_0004_m_000005_0, Status : FAILED

org.apache.hadoop.util.DiskChecker $DiskErrorException: Could not find any valid local directory for output/spill0.out at org.apache.hadoop.fs.LocalDirAllocator$ AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java

) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java

) at org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite(MapOutputFile.java

) at org.apache.hadoop.mapred.MapTask

MapOutputBuffer.sortAndSpill(MapTask.java:1391) at org.apache.hadoop.mapred.MapTask

MapOutputBuffer.flush(MapTask.java

) at org.apache.hadoop.mapred.MapTask

NewOutputCollector.close(MapTask.java:698) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:765) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369) at org.apache.hadoop.mapred.Child

4.run(Child.java

) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java

) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java

) at org.apache.hadoop.mapred.Child.main(Child.java

)

Reason: the disk space for tmp.dir is low

Solution: set hadoop.tmp.dir (in core-site.xml) to another place

Name node is in safe mode

Error

Exception in thread "main" org.apache.hadoop.ipc.RemoteException:           org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete /user/hivetrix/output. Name node is in safe mode.

Reason: Namenode will be in safemode to gather metadata of the files at start, then switch to normal mode.

Solution: Wait...or check if the safe mode is off

$ hadoop dfsadmin -safemode get
Safe mode is OFF

Unexpected version of storage directory

Error

ERROR datanode.DataNode: org.apache.hadoop.hdfs.server.common.IncorrectVersionException: Unexpected version of storage directory /path/to/directory

Reason: version conflict.

Solution: if install multiple versions of hadoop on one machine, set dfs.data.dir and dfs.name.dir in each hdfs-site.xml to different paths.

Slf4j

Error

Exception in thread "main" java.lang.NoSuchMethodError: org.slf4j.spi.LocationAwareLogger.log(Lorg/slf4j/Marker;Ljava/lang/String;ILjava/lang/String;[Ljava/lang/Object;Ljava/lang/Throwable;)V
    at org.apache.commons.logging.impl.SLF4JLocationAwareLog.debug(SLF4JLocationAwareLog.java:133)
    at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:136)
    at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:180)
    at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:159)
    at org.apache.hadoop.security.UserGroupInformation.isSecurityEnabled(UserGroupInformation.java:216)
    at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:409)
    at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:395)
    at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:1418)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1319)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:226)

Reason: slf4j version conflict. Hadoop-0.20.203.0 uses slf4j-api-1.4.3.jar and slf4j-log4j12-1.4.3.jar, while Solr-3.3.0 uses slf4j-api-1.6.1.jar and jcl-over-slf4j-1.6.1.jar

Solution: unify the versions of slf4j(use 1.6.1 in both cases)

Background: slf4j: "The Simple Logging Facade for Java or (SLF4J) serves as a simple facade or abstraction for various logging frameworks, e.g. java.util.logging, log4j and logback, allowing the end-user to plug in the desired logging framework at deployment time."

In short, programs(like Hadoop and Solr) do not know which logging method they use. they only talk to slf4j, and let slf4j talk to the specified logging module. User only need to put slf4j(slf4j-api-<version>.jar) and the module(like slf4j-log4j12-<version>.jar) in the classpath(<versions> should match).

Options are(copied from slf4j website):

slf4j-log4j12-1.6.1.jar
for log4j version 1.2, a widely used logging framework. You also need to place log4j.jar on your class path.
slf4j-jdk14-1.6.1.jar: Binding for java.util.logging, also referred to as JDK 1.4 logging
slf4j-nop-1.6.1.jar: Binding for NOP, silently discarding all logging.
slf4j-simple-1.6.1.jar: Binding for Simple implementation, which outputs all events to System.err. Only messages of level INFO and higher are printed. This binding may be useful in the context of small applications.
slf4j-jcl-1.6.1.jar: Binding for Jakarta Commons Logging. This binding will delegate all SLF4J logging to JCL.

To switch logging frameworks, just replace slf4j bindings on your class path. For example, to switch from java.util.logging to log4j, just replace slf4j-jdk14-1.6.1.jar with slf4j-log4j12-1.6.1.jar.

Task TimeOut

Error

Task ... failed to report status for 600 seconds

Reason: well... timeout... maybe there are some more serious problems

Solution

Increase the limit in mapred-site.xml. But not guanrantee to solve the real problem. Check log for more info::

<property>
  <name>mapred.task.timeout</name>
  <value>100000</value>
</property>

TaskTracker: Java heap space error

Error

FATAL org.apache.hadoop.mapred.TaskTracker: Task: attempt_201108190901_0002_r_000000_0 - Killed : Java heap space

Solution

Edit mapred-site.xml

<property>
    <name>mapred.child.java.opts</name>
    <value>-Xmx2048M</value>
</property>

When checking http://0.0.0.0:50030, you may see something like

Cluster Summary (Heap Size is 116.81 MB/888.94 MB)

this is set in hadoop-env.sh

# The maximum amount of heap to use, in MB. Default is 1000.
export HADOOP_HEAPSIZE=2000

this is the maximum heap size; set the heap size for mapreduce tasks in mapred.child.java.opts.

Error: LeaseExpiredException

org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /path/you/specified/_temporary/_attempt_201412071502_567695_r_000000_0/part-r-00000 File does not exist. Holder DFSClient_attempt_201412071502_567695_r_000000_0 does not have any open files.
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1606)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1597)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:1652)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:1640)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.complete(NameNode.java:689)
    at sun.reflect.GeneratedMethodAccessor63.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.

Reason: accidentally removed _attempt_* file in _temporary folder, which turned out to be a "lease".