Hadoop - Overview


  • append-only(can’t seek to a specified key editing, adding or removing it like other key-value data structures like B-Trees)
  • binary key-value pairs

3 formats:

  • Uncompressed:
  • Record Compressed: only 'values' are compressed here.
  • Block Compressed: both keys and values are collected in 'blocks' separately and compressed. The size of the 'block' is configurable.


  • map: read from HDFS, output to local disk.
  • reduce: read from the output of map, output to HDFS

HBase vs HDFS

HBase: low latency

About the DistributedCache

Hadoop 2 can figure out dependent jars and upload to DistributedCache. From Hadoop's source code:

private static void putJarOnClassPathThroughDistributedCache(
        PigContext pigContext,
        Configuration conf,
        URL url) throws IOException {

    // Turn on the symlink feature

    Path distCachePath = getExistingDistCacheFilePath(conf, url);
    if (distCachePath != null) {
        log.info("Jar file " + url + " already in DistributedCache as "
                + distCachePath + ". Not copying to hdfs and adding again");
        // Path already in dist cache
        if (!HadoopShims.isHadoopYARN()) {
            // Mapreduce in YARN includes $PWD/* which will add all *.jar files in classapth.
            // So don't have to ensure that the jar is separately added to mapreduce.job.classpath.files
            // But path may only be in 'mapred.cache.files' and not be in
            // 'mapreduce.job.classpath.files' in Hadoop 1.x. So adding it there
            DistributedCache.addFileToClassPath(distCachePath, conf, distCachePath.getFileSystem(conf));
    else {
        // REGISTER always copies locally the jar file. see PigServer.registerJar()
        Path pathInHDFS = shipToHDFS(pigContext, conf, url);
        DistributedCache.addFileToClassPath(pathInHDFS, conf, FileSystem.get(conf));
        log.info("Added jar " + url + " to DistributedCache through " + pathInHDFS);