Data can be saved on disk, or sent from one application to another application over a network. The format of the data can be different from the data in memory.
- Serialization: encoding structured data. The process of converting data in memory to a format in which it can be stored on disk or sent over a network.
- Deserialization: the process of reading data from disk or network into memory.
E.g. CSV, XML, JSON
- Pro: human-readable
- Con: not very efficient in terms of either storage space or parse time.
- Pro: compact and faster to process.
- Con: not human-readable
Most notable over-the-wire formats: ProtoBuf, Thrift and Avro. For storage, some columnar formats are gaining popularity.
For more info about ProtoBuf/Thrift/Avro, check the API page.
- to store key-value pairs.
- commonly used in Hadoop as an input and output file format. MapReduce also uses SequenceFiles to store the temporary output from map functions.
three different formats:
- Record Compressed: only the value in a record is compressed
- Block Compressed: both keys and values are compressed.
- A columnar format.