logo

Probabilistic Data Structures

Last Updated: 2021-12-15

Approximated, not precise.

Bloom filter/Quotient filter:

  • "Have I ever seen this thing before?"
  • test whether an element is a member of a set
    • "possibly in set" or "definitely not in set".
  • used in Chrome browser to prevent users from visiting malicious websites.
  • A quotient filter requires 10–25% more space than a comparable Bloom filter but is faster because each access requires evaluating only a single hash function
  • both approximate member query filter, or AMQ
  • usage: serve as a proxy for the keys in a database on disk. As keys are added to or removed from the database, the filter is updated to reflect this. Any lookup will first consult the fast quotient filter, then look in the (presumably much slower) database only if the quotient filter reported the presence of the key. If the filter returns absence, the key is known not to be in the database without any disk accesses having been performed.

HyperLogLog

  • approximating the number of distinct elements in a multiset
  • widely used in analytical systems for fast cardinality estimation.
  • Implemented in many popular databases and key-value storages: ClickHouse[1], Vertica[2], VoltDB, Redis, Druid etc.

MinHash

  • estimating how similar two sets are

count–min sketch (CM sketch)

  • a frequency table of events in a stream of data.
  • top 100 most frequent elements
  • used to estimate statistics when you have tight memory

Skip List

  • fast search

Read More