Hadoop - Trouble Shooting
Checksum
Error:
INFO fs.FSInputChecker: Found checksum error: b[0, 0]=
org.apache.hadoop.fs.ChecksumException: Checksum error: sample/configs.json at 0
Solution: Delete .crc file
Hadoop Version
Error
Exception in thread "main" java.io.IOException: Call to ... failed on local exception: java.io.EOFException
at org.apache.hadoop.ipc.Client.wrapException(Client.java:775)
at org.apache.hadoop.ipc.Client.call(Client.java:743)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
at $Proxy0.getProtocolVersion(Unknown Source)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:106)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:207)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:170)
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:82)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446)
Solution: Make sure you are compiling against the same Hadoop version that you are running on your cluster.
Unrecognized option: -jvm
Error: when starting datanode::
# hadoop datanode
Unrecognized option: -jvm
Could not create the Java virtual machine.
Reason: should not run as root
Solution: try another non-root user
Could only be replicated to 0 nodes, instead of 1
Error:
could only be replicated to 0 nodes, instead of 1
Reason 1: namenode and datanode are inconsistent
Solution: re-format namenode
$ stop-all.sh
(remove dfs.name.dir and dfs.data.dir and tmp files)
$ hadoop namenode -format
$ start-all.sh
Reason 2: running out of disk space.
$ hadoop dfsadmin -report
...
Datanodes available: 1 (1 total, 0 dead)
Name: 127.0.0.1:50010
Decommission Status : Normal
Configured Capacity: 144496779264 (134.57 GB)
DFS Used: 24591 (24.01 KB)
Non DFS Used: 144157540337 (134.26 GB)
DFS Remaining: 339214336(323.5 MB)
DFS Used%: 0%
**DFS Remaining%: 0.23%**
Last contact: Wed Aug 17 11:59:04 PDT 2011
Solution: set dfs.name.dir
and dfs.data.dir
(in hdfs-site.xml) to another disk.
I am using default settings, i.e. dfs.name.dir
and dfs.data.dir
are set in /tmp
, which has tiny space left. however /data
folder is mounted to a 2TB disk, redirecting those two dirs solved the problem.
DiskErrorException: Could not find any valid local directory
Error:
INFO mapred.JobClient: Task Id : attempt_201108171201_0004_m_000005_0, Status : FAILED
org.apache.hadoop.util.DiskCheckerAllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:381) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:146) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:127) at org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite(MapOutputFile.java:121) at org.apache.hadoop.mapred.MapTaskMapOutputBuffer.flush(MapTask.java:1297) at org.apache.hadoop.mapred.MapTask4.run(Child.java:259) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) at org.apache.hadoop.mapred.Child.main(Child.java:253)
Reason: the disk space for tmp.dir is low
Solution: set hadoop.tmp.dir
(in core-site.xml) to another place
Name node is in safe mode
Error
Exception in thread "main" org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete /user/hivetrix/output. Name node is in safe mode.
Reason: Namenode will be in safemode to gather metadata of the files at start, then switch to normal mode.
Solution: Wait...or check if the safe mode is off
$ hadoop dfsadmin -safemode get
Safe mode is OFF
Unexpected version of storage directory
Error
ERROR datanode.DataNode: org.apache.hadoop.hdfs.server.common.IncorrectVersionException: Unexpected version of storage directory /path/to/directory
Reason: version conflict.
Solution: if install multiple versions of hadoop on one machine, set dfs.data.dir
and dfs.name.dir
in each hdfs-site.xml to different paths.
Slf4j
Error
Exception in thread "main" java.lang.NoSuchMethodError: org.slf4j.spi.LocationAwareLogger.log(Lorg/slf4j/Marker;Ljava/lang/String;ILjava/lang/String;[Ljava/lang/Object;Ljava/lang/Throwable;)V
at org.apache.commons.logging.impl.SLF4JLocationAwareLog.debug(SLF4JLocationAwareLog.java:133)
at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:136)
at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:180)
at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:159)
at org.apache.hadoop.security.UserGroupInformation.isSecurityEnabled(UserGroupInformation.java:216)
at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:409)
at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:395)
at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:1418)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1319)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:226)
Reason: slf4j version conflict. Hadoop-0.20.203.0 uses slf4j-api-1.4.3.jar and slf4j-log4j12-1.4.3.jar, while Solr-3.3.0 uses slf4j-api-1.6.1.jar and jcl-over-slf4j-1.6.1.jar
Solution: unify the versions of slf4j(use 1.6.1 in both cases)
Background: slf4j: "The Simple Logging Facade for Java or (SLF4J) serves as a simple facade or abstraction for various logging frameworks, e.g. java.util.logging, log4j and logback, allowing the end-user to plug in the desired logging framework at deployment time."
In short, programs(like Hadoop and Solr) do not know which logging method they use. they only talk to slf4j, and let slf4j talk to the specified logging module. User only need to put slf4j(slf4j-api-<version>.jar
) and the module(like slf4j-log4j12-<version>.jar
) in the classpath(<versions>
should match).
Options are(copied from slf4j website):
- slf4j-log4j12-1.6.1.jar:Binding for log4j version 1.2, a widely used logging framework. You also need to place log4j.jar on your class path.
- slf4j-jdk14-1.6.1.jar: Binding for java.util.logging, also referred to as JDK 1.4 logging
- slf4j-nop-1.6.1.jar: Binding for NOP, silently discarding all logging.
- slf4j-simple-1.6.1.jar: Binding for Simple implementation, which outputs all events to System.err. Only messages of level INFO and higher are printed. This binding may be useful in the context of small applications.
- slf4j-jcl-1.6.1.jar: Binding for Jakarta Commons Logging. This binding will delegate all SLF4J logging to JCL.
To switch logging frameworks, just replace slf4j bindings on your class path. For example, to switch from java.util.logging to log4j, just replace slf4j-jdk14-1.6.1.jar with slf4j-log4j12-1.6.1.jar.
Task TimeOut
Error
Task ... failed to report status for 600 seconds
Reason: well... timeout... maybe there are some more serious problems
Solution
Increase the limit in mapred-site.xml. But not guanrantee to solve the real problem. Check log for more info::
<property>
<name>mapred.task.timeout</name>
<value>100000</value>
</property>
TaskTracker: Java heap space error
Error
FATAL org.apache.hadoop.mapred.TaskTracker: Task: attempt_201108190901_0002_r_000000_0 - Killed : Java heap space
Solution
Edit mapred-site.xml
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx2048M</value>
</property>
When checking http://0.0.0.0:50030, you may see something like
Cluster Summary (Heap Size is 116.81 MB/888.94 MB)
this is set in hadoop-env.sh
# The maximum amount of heap to use, in MB. Default is 1000.
export HADOOP_HEAPSIZE=2000
this is the maximum heap size; set the heap size for mapreduce tasks in mapred.child.java.opts.
Error: LeaseExpiredException
org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /path/you/specified/_temporary/_attempt_201412071502_567695_r_000000_0/part-r-00000 File does not exist. Holder DFSClient_attempt_201412071502_567695_r_000000_0 does not have any open files.
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1606)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1597)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:1652)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:1640)
at org.apache.hadoop.hdfs.server.namenode.NameNode.complete(NameNode.java:689)
at sun.reflect.GeneratedMethodAccessor63.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.
Reason: accidentally removed _attempt_*
file in _temporary
folder, which turned out to be a "lease".