Computer Science - Data Serialization

Updated: 2018-12-11

Data can be saved on disk, or sent from one application to another application over a network. The format of the data can be different from the data in memory.

  • Serialization: encoding structured data. The process of converting data in memory to a format in which it can be stored on disk or sent over a network.
  • Deserialization: the process of reading data from disk or network into memory.

Text Format

E.g. CSV, XML, JSON

  • Pro: human-readable
  • Con: not very efficient in terms of either storage space or parse time.

https://github.com/chimpler/pyhocon

Binary Formats

  • Pro: compact and faster to process.
  • Con: not human-readable

Most notable over-the-wire formats: ProtoBuf, Thrift and Avro. For storage, some columnar formats are gaining popularity.

For more info about ProtoBuf/Thrift/Avro, check the API page.

SequenceFile

  • to store key-value pairs.
  • commonly used in Hadoop as an input and output file format. MapReduce also uses SequenceFiles to store the temporary output from map functions.
  • three different formats:

    • Uncompressed,
    • Record Compressed: only the value in a record is compressed
    • Block Compressed: both keys and values are compressed.

Parquet