Spark
    Overview
    Getting Started
    Configuration
    DataFrame
    Aggregation
    Utilities
    YARN
    IO

Hadoop - Overview

Updated: 2022-02-12

SequenceFile

  • append-only(can’t seek to a specified key editing, adding or removing it like other key-value data structures like B-Trees)
  • binary key-value pairs

3 formats:

  • Uncompressed:
  • Record Compressed: only 'values' are compressed here.
  • Block Compressed: both keys and values are collected in 'blocks' separately and compressed. The size of the 'block' is configurable.

Map/reduce

  • map: read from HDFS, output to local disk.
  • reduce: read from the output of map, output to HDFS

HBase vs HDFS

HBase: low latency

Copy From Local

Configuration conf = new Configuration();
conf.addResource(new Path(pathHadoopCoreSite));
conf.addResource(new Path(pathHadoopHDFSSite));
FileSystem fs = FileSystem.get(conf);

Path src = new Path(pathLocal);
Path dst = new Path(pathHDFS);

fs.copyFromLocalFile(src, dst);