Data Formats

Data can be saved on disk, or sent from one application to another application over a network. The format of the data can be different from the data in memory.

Serialization: encoding structured data. The process of converting data in memory to a format in which it can be stored on disk or sent over a network.
Deserialization: the process of reading data from disk or network into memory.

Text Format

E.g. CSV, XML, JSON

Pro: human-readable
Con: not very efficient in terms of either storage space or parse time.

https://github.com/chimpler/pyhocon

Binary Formats

Pro: compact and faster to process.
Con: not human-readable

Most notable over-the-wire formats: ProtoBuf, Thrift and Avro. For storage, some columnar formats are gaining popularity.

For more info about ProtoBuf/Thrift/Avro, check the API page.

Capacitor

BigQuery's columnar format.

SequenceFile

to store key-value pairs.
commonly used in Hadoop as an input and output file format. MapReduce also uses SequenceFiles to store the temporary output from map functions.
three different formats:
- Uncompressed,
- Record Compressed: only the value in a record is compressed
- Block Compressed: both keys and values are compressed.

Parquet

A columnar format.
http://parquet.apache.org/

SSTable

SSTable, originally "Sorted String Table", stores data on disk in a persistent, ordered, immutable set of files:

a simple abstraction to efficiently store large numbers of key-value pairs while optimizing for high throughput, sequential read/write workloads.
stores immutable string-to-string maps
immutable means SSTables are never modified. They are later merged into new SSTables or deleted as data is updated.
always sorted in ascending order by keys
a key can have multiple values
optimized to store large values
often sharded
At the end of the file an index is written which stores each key and the offset of its value. This way, only the index (the size of all the keys) needs to fit in memory to allow efficient lookups of any string in the table, even when the values might be much bigger than available memory.

Used by BigTable, Cassandra, HBase, LevelDB. (Spanner is replacing SSTable with a columnar format called Ressi, which is better optimized for hybrid OLTP/OLAP query workloads: SSTable has a flat row-oriented format optimized for large values while Ressi has a more structured column-oriented format optimized for complex queries.) Can be used as the input and output of data pipelines, and loaded into services to serve data in prod.

Pros:

good as inputs and outputs of MapReduce
good as lookup tables, or indices
originated from Google, battle tested

Cons:

immutable, cannot be edited once written (SSTable is immutable but databases on top of it, like BigTable, are mutable)
data must be sorted when building the tables
lookups are fast based on keys, but slow on values. (use other columnar format instead for more sophisticated searching)

Apache Arrow

https://arrow.apache.org/

The “Arrow Columnar Format” includes a language-agnostic in-memory data structure specification, metadata serialization, and a protocol for serialization and generic data transport.

Apache Parquet and Apache ORC are popular examples of on-disk columnar data formats. Arrow is designed as a complement to these formats for processing data in-memory.

Data Formats / Serialization

GRPC
- Protobuf: created and used by Google
Thrift: created and used by Facebook
Avro
RCFile(Record Columnar File): Facebook
Optimized Row Columnar (ORC) Hortonworks
Parquet: Cloudera and Twitter

parquet vs arrow:

parquet: on disk
arrow: in memory

Link

https://en.wikipedia.org/wiki/Comparison_of_data-serialization_formats