Distributed Systems - ID
UUID: 128bit, no coordination
- type 4: random
- type 1: MAC address + time component
- GTID: Global Transaction Identifier
- Twitter Snowflake
- Cassandra TimeUUID
- K-ordering is a more precise way of saying roughly sorted.
- Github: https://github.com/segmentio/ksuid
Cannot use timestamp as id.
Unless servers are perfectly synchronized. A centralized clock system would be a single point of failure.
Distributed Environment: timestamp is not reliable
computers suffer from “clock drift”, where the clock is constantly skewing forwards or backwards away from the actual time. Perhaps even worse, every computer skews at a different rate 4.
So called “time oracles” are dedicated servers normally equipped with an atomic clock or GPS device whose sole purpose is for keeping time in as precise a manner as possible. Time oracle servers are used frequently in environments where super accurate time is important (such as trading platforms and financial applications), but they are prohibitively expensive and not a viable solution for those working in an environment where deploying custom hardware is impossible (think cloud environments).
You can can never solely rely on time for ordering or identification
Coordination techniques like global locking or single master setups to ensure sure IDs don’t get duplicated or out-of-order can be a recipe for future scalability and performance woes.
Atomically incrementing primary key columns in traditional SQL databases are difficult to scale because they have to lock while incrementing, but they can generate really compact IDs because of this. On the other hand, UUIDs try to circumvent the contention problem by adding entropy to the IDs until the chance of generating a duplicate ID becomes infinitesimally small; however the chance still exists and the IDs are huge.
Generate Unique ID: http://antirez.com/news/99
IDs may have some other useful properties, however:
Ordered: They may be ordered to allow comparison of items stamped with IDs. Meaningful: They may embed certain information about the point at which the ID was generated, for example the time (however inexact). Distributed: They may be generated by a machine without consulting any other machine (ie. they must be truly distributed). Compact: They may be represented in a specific format for space or performance reasons. Normally, you’d want to represent this in as compact a format as possible but you often trade-off longevity to do this (UNIX timestamps are a great example of compact albeit finite time representation in action 1).
Either we let an application like a database generate IDs for us, in which case we generally have scalability issues, or we use standards like the UUID which do not guarantee uniqueness in a distributed environment without significant efforts at coordination
the reasons for doing this are deeply rooted in our inability to keep time across a distributed environment.
if you want to generate unique IDs you need to store some state that makes sure an ID is never repeated.
In case of restart of the system, it should never happen that the same ID is generated again because our stored counter was not correctly persisted on disk.
- Generate a random number in a range between 0 and N, with N so big that the probability of collisions is so small to be, for every practical application, irrelevant. This generation method has a great advantage: it is completely stateless. Multiple nodes can generate IDs at the same time without exchanging messages. Moreover there is nothing to store on disk so we can go as fast as our CPU can go. The computation will easily fit the CPU cache. So it’s terribly fast and convenient.