Hadoop - Overview
Updated: 2020-06-29
SequenceFile
- append-only(can’t seek to a specified key editing, adding or removing it like other key-value data structures like B-Trees)
- binary key-value pairs
3 formats:
- Uncompressed:
- Record Compressed: only 'values' are compressed here.
- Block Compressed: both keys and values are collected in 'blocks' separately and compressed. The size of the 'block' is configurable.
Map/reduce
- map: read from HDFS, output to local disk.
- reduce: read from the output of map, output to HDFS
HBase vs HDFS
HBase: low latency
About the DistributedCache
Hadoop 2 can figure out dependent jars and upload to DistributedCache. From Hadoop's source code:
private static void putJarOnClassPathThroughDistributedCache(
PigContext pigContext,
Configuration conf,
URL url) throws IOException {
// Turn on the symlink feature
DistributedCache.createSymlink(conf);
Path distCachePath = getExistingDistCacheFilePath(conf, url);
if (distCachePath != null) {
log.info("Jar file " + url + " already in DistributedCache as "
+ distCachePath + ". Not copying to hdfs and adding again");
// Path already in dist cache
if (!HadoopShims.isHadoopYARN()) {
// Mapreduce in YARN includes $PWD/* which will add all *.jar files in classapth.
// So don't have to ensure that the jar is separately added to mapreduce.job.classpath.files
// But path may only be in 'mapred.cache.files' and not be in
// 'mapreduce.job.classpath.files' in Hadoop 1.x. So adding it there
DistributedCache.addFileToClassPath(distCachePath, conf, distCachePath.getFileSystem(conf));
}
} else {
// REGISTER always copies locally the jar file. see PigServer.registerJar()
Path pathInHDFS = shipToHDFS(pigContext, conf, url);
DistributedCache.addFileToClassPath(pathInHDFS, conf, FileSystem.get(conf));
log.info("Added jar " + url + " to DistributedCache through " + pathInHDFS);
}
}
Copy From Local
Configuration conf = new Configuration();
conf.addResource(new Path(pathHadoopCoreSite));
conf.addResource(new Path(pathHadoopHDFSSite));
FileSystem fs = FileSystem.get(conf);
Path src = new Path(pathLocal);
Path dst = new Path(pathHDFS);
fs.copyFromLocalFile(src, dst);