logo

Data Formats

Last Updated: 2024-01-14

Data can be saved on disk, or sent from one application to another application over a network. The format of the data can be different from the data in memory.

  • Serialization: encoding structured data. The process of converting data in memory to a format in which it can be stored on disk or sent over a network.
  • Deserialization: the process of reading data from disk or network into memory.

Text Format

E.g. CSV, XML, JSON

  • Pro: human-readable
  • Con: not very efficient in terms of either storage space or parse time.

https://github.com/chimpler/pyhocon

Binary Formats

  • Pro: compact and faster to process.
  • Con: not human-readable

Most notable over-the-wire formats: ProtoBuf, Thrift and Avro. For storage, some columnar formats are gaining popularity.

For more info about ProtoBuf/Thrift/Avro, check the API page.

Capacitor

BigQuery's columnar format.

SequenceFile

  • to store key-value pairs.
  • commonly used in Hadoop as an input and output file format. MapReduce also uses SequenceFiles to store the temporary output from map functions.
  • three different formats:
    • Uncompressed,
    • Record Compressed: only the value in a record is compressed
    • Block Compressed: both keys and values are compressed.

Parquet

SSTable

SSTable, originally "Sorted String Table", stores data on disk in a persistent, ordered, immutable set of files:

  • a simple abstraction to efficiently store large numbers of key-value pairs while optimizing for high throughput, sequential read/write workloads.
  • stores immutable string-to-string maps
  • immutable means SSTables are never modified. They are later merged into new SSTables or deleted as data is updated.
  • always sorted in ascending order by keys
  • a key can have multiple values
  • optimized to store large values
  • often sharded
  • At the end of the file an index is written which stores each key and the offset of its value. This way, only the index (the size of all the keys) needs to fit in memory to allow efficient lookups of any string in the table, even when the values might be much bigger than available memory.

Used by BigTable, Cassandra, HBase, LevelDB. (Spanner is replacing SSTable with a columnar format called Ressi, which is better optimized for hybrid OLTP/OLAP query workloads: SSTable has a flat row-oriented format optimized for large values while Ressi has a more structured column-oriented format optimized for complex queries.) Can be used as the input and output of data pipelines, and loaded into services to serve data in prod.

Pros:

  • good as inputs and outputs of MapReduce
  • good as lookup tables, or indices
  • originated from Google, battle tested

Cons:

  • immutable, cannot be edited once written (SSTable is immutable but databases on top of it, like BigTable, are mutable)
  • data must be sorted when building the tables
  • lookups are fast based on keys, but slow on values. (use other columnar format instead for more sophisticated searching)

Apache Arrow

https://arrow.apache.org/

The “Arrow Columnar Format” includes a language-agnostic in-memory data structure specification, metadata serialization, and a protocol for serialization and generic data transport.

Apache Parquet and Apache ORC are popular examples of on-disk columnar data formats. Arrow is designed as a complement to these formats for processing data in-memory.

Data Formats / Serialization

  • GRPC
  • Thrift: created and used by Facebook
  • Avro
  • RCFile(Record Columnar File): Facebook
  • Optimized Row Columnar (ORC) Hortonworks
  • Parquet: Cloudera and Twitter

parquet vs arrow:

  • parquet: on disk
  • arrow: in memory

Link

https://en.wikipedia.org/wiki/Comparison_of_data-serialization_formats