Data Serialization
Updated: 2020-06-29
Data can be saved on disk, or sent from one application to another application over a network. The format of the data can be different from the data in memory.
- Serialization: encoding structured data. The process of converting data in memory to a format in which it can be stored on disk or sent over a network.
- Deserialization: the process of reading data from disk or network into memory.
Text Format
E.g. CSV, XML, JSON
- Pro: human-readable
- Con: not very efficient in terms of either storage space or parse time.
https://github.com/chimpler/pyhocon
Binary Formats
- Pro: compact and faster to process.
- Con: not human-readable
Most notable over-the-wire formats: ProtoBuf, Thrift and Avro. For storage, some columnar formats are gaining popularity.
For more info about ProtoBuf/Thrift/Avro, check the API page.
SequenceFile
- to store key-value pairs.
- commonly used in Hadoop as an input and output file format. MapReduce also uses SequenceFiles to store the temporary output from map functions.
-
three different formats:
- Uncompressed,
- Record Compressed: only the value in a record is compressed
- Block Compressed: both keys and values are compressed.
Parquet
- A columnar format.
- http://parquet.apache.org/