Distributed Clocks and Ordering, Part 2: Logical Clocks

Part 2 of a three-part series. Part 1 established that wall clocks cannot order events and defined happened-before, the causal order that doesn’t need them. This part makes that order computable: Lamport clocks, vector clocks, the version vectors everyone confuses with them, and the dotted version vectors that keep Riak’s siblings from exploding. Part 3 covers hybrid clocks in production databases.

Part 1 ended with a definition and a challenge. The definition: a → b iff a chain of local steps and messages connects a to b. The challenge: how does a node, holding only local state, compute that relation? Every device in this part answers it; they differ only in how much causal past each timestamp remembers.

Lamport clocks: a total order, but the wrong one

Lamport’s own answer is the logical clock: each process keeps a counter C, incremented on every local event and send, and stamped onto outgoing messages. On receiving a message carrying timestamp t, the process sets C := max(C, t) + 1. Two integers’ worth of arithmetic, and it guarantees the clock condition:

if a → b, then C(a) < C(b).

Causality never violates the numbers. And by breaking ties with process IDs, you get a total order over all events that every node computes identically; which is genuinely useful. Lamport’s paper uses it to run a distributed mutual-exclusion algorithm; modern systems use the same trick anywhere they need some deterministic, causality-respecting arbitration. An LWW-register is exactly this: stamp writes with (lamport, replicaID), highest wins, everyone agrees, no wall clock consulted. Leader-election tiebreaks are the same shape: “highest (counter, id) wins” is arbitrary, deterministic, and never contradicts causality; all a tiebreak needs to be.

The same execution stamped with Lamport clocks, showing concurrent events receiving ordered timestamps

The catch is the direction of the arrow. The clock condition is one-way: C(a) < C(b) does not imply a → b. Run the rules over the execution from Part 1 and g1 (C’s early, causally isolated event) gets timestamp 1, while A’s e3 gets timestamp 3. The numbers say g1 came first. The truth is g1 ∥ e3; the ordering is fabricated. Meanwhile e3 and f1 both get timestamp 3 and the process-ID tiebreak orders them too, equally arbitrarily.

So Lamport clocks give you a total order, but the wrong one, or more precisely, an order, consistent with causality but otherwise invented. For arbitration that’s fine; arbitrary-but-deterministic is the job description. What a Lamport timestamp can never do is answer the question CRDTs actually ask: were these two updates concurrent? Given two stamps, 3 < 5 might mean causality or might mean nothing. The information was thrown away at stamp time, and no amount of comparison gets it back.

Causal histories: the idea underneath everything else

Before adding the fix, it pays to see the unoptimized version, because; as Baquero and Preguiça argue in Why Logical Clocks are Easy (CACM, 2016); every mechanism in the rest of this part is just a compressed encoding of one simple structure. Give every event a globally unique name, say (process, local counter): a1, a2, b1… The causal history of an event is the set of all names in its causal past, built by trivially mechanical rules: a new local event adds its own name to the previous history; a message carries its sender’s history; a receive unions the two. Then causality is literally set inclusion: a → b iff H(a) ⊊ H(b): and since a history always contains its own event’s name, the test collapses to “is a’s name in b’s history?”

That’s the whole theory. Sets of event names, union, inclusion. The only problem is that the sets grow forever, so every practical mechanism is a lossy or lossless compression of causal histories, and the right question to ask of each is Baquero and Preguiça’s: which events get names, and how does the encoding translate back? Lamport clocks are the maximally lossy compression: C is just the size of (a chain through) the history, which is why inclusion can’t be recovered. The lossless compression is next.

Vector clocks: detecting concurrency, paying in space

Notice that causal histories have dead weight: if b3 is in a history, b1 and b2 necessarily are too, so per process only the highest counter matters. Keep just that and the set {a1, a2, b1, b2, b3, c1, c2, c3} compresses to the vector [2, 3, 3]: no information lost. That compression is the vector clock: each of the N processes keeps a vector V of N counters. Process i increments V[i] on each local event; messages carry the full vector; on receipt, the receiver takes the element-wise maximum (set union, compressed) and increments its own entry. Comparison is element-wise, because set inclusion is:

V ≤ W iff V[k] ≤ W[k] for every k;
V < W iff V ≤ W and V ≠ W: then the V event happened-before the W event;
neither V < W nor W < V: the events are concurrent.

Unlike Lamport clocks, this is an if-and-only-if: a → b ⇔ V(a) < V(b). The vector characterizes causality. Stamp our execution and g1 carries [0,0,1] while e3 carries [3,0,0]: each vector exceeds the other in some slot, so they’re incomparable, so concurrent; detected, not guessed. And e2’s [2,0,0] is dominated by g2’s [2,2,2], recovering the genuine causal chain across two message hops.

The same execution stamped with vector clocks, showing incomparable vectors detecting the concurrent pair

The price is the N. Every stamp is a vector with one slot per participant, and Charron-Bost proved in 1991 that this isn’t a failure of imagination: characterizing causality among N processes fundamentally requires vectors of size N (see Schwarz and Mattern’s survey, aptly subtitled In search of the holy grail). For a database cluster of 50 nodes, fine. For “one entry per client device that ever wrote,” it’s metadata that outgrows the data, and the ways systems dodge the N are where the real engineering lives.

Vector clocks vs version vectors: the distinction everyone blurs

First, a correction to the record, because almost every article (and more than one database’s documentation) uses “vector clock” for two different devices. Attribution is messier than the textbooks admit: the canonical vector-clock citations are Fidge (1988) and Mattern (1989), who developed the theory independently, but vector-shaped causality metadata was in production use five years earlier, in Parker et al.’s 1983 version vectors for detecting conflicting file replicas. Lindsey Kuper’s delightful archaeology of the question is worth your time. The two structures are syntactically identical (a map from IDs to counters, compared pointwise) which is exactly why they get conflated. The difference is which events get names:

A vector clock names every event; local steps, sends, receives. It characterizes the causal order of an entire distributed computation, which is what you need for causal message delivery, consistent snapshots, or debugging a distributed trace; its entries advance constantly, message by message.
A version vector names only the events that modify a replicated object; the writes. It characterizes the causal order of an object’s versions: comparing two replicas’ vectors tells you that one state subsumes the other (keep the newer) or that they conflict (merge, or keep both); entries advance only on writes and are exchanged only at sync.

In causal-history terms (Baquero and Preguiça again): same sets, same union, same inclusion test; version vectors just leave most events unnamed. The practical consequences differ enough to make the conflation harmful. Vector clocks are per-event metadata on a firehose; version vectors are per-object metadata touched at write and sync time, which is why version vectors are what databases actually ship; Dynamo’s “vector clocks” are, pedantically, version vectors. It’s also what an MV-register carries to keep both concurrent writes instead of discarding one. (The OR-Set, by contrast, sidesteps comparison entirely with unique tags; uniqueness is a cheaper fact to establish than order.) When someone says “vector clock,” ask what gets incremented and when. The answer is the design.

Sibling explosion: when the actor set is the bug

Now back to the N, because the obvious dodge has a famous failure mode, and Riak hit it head-on.

A version vector needs an entry per writer. In a Dynamo-style store, who’s the writer; the client or the server? Both answers fail differently. Client IDs are causally honest (each client names its own writes) but unbounded: every device that ever issued a put earns a permanent slot, and the vector outgrows the value it protects. Server IDs keep the vector bounded (there are few servers) but now the server names events it didn’t causally perform. Two clients write through the same server; the server stamps them [S:1] and [S:2]; the second appears to causally follow the first even if the writing clients never saw each other’s values. Order has been invented, and acting on it overwrites a live update. Silent data loss, the unforgivable sin of Part 1.

Riak’s pre-2.0 compromise was server IDs plus pessimism: when a write’s causal context didn’t provably dominate the stored state, keep both values as siblings and let the reader reconcile. Safe (nothing is ever lost) but watch it leak. Trace it with one server S and two clients (this is the execution in the diagram):

C1 puts v1 with empty context. Object: vector [S:1], siblings {v1}.
C1 reads, receiving v1 and context [S:1].
C2, who never read anything, puts v2 with empty context. Empty context dominates nothing, so this is concurrent with v1: vector [S:2], siblings {v1, v2}. Correct; these writes really are concurrent.
C1 puts v3 with context [S:1], superseding the v1 it read. The server compares [S:1] against the object’s vector [S:2]: dominated. But “dominated by the whole object” is all a single per-object vector can express; it cannot say which siblings the context [S:1] covers. Discarding v3 would lose an update; overwriting would lose v2. The only safe move is to append: vector [S:3], siblings {v1, v2, v3}.

And there’s the leak: v3 superseded v1, the context proved it, and v1 survived anyway. Every read-modify-write interleaved with another client’s write adds a sibling that can never be retired; under real concurrency the sibling list balloons into the hundreds, each write enlarging the object it meant to replace. Basho’s engineers called it sibling explosion, and it took down enough clusters to motivate a research collaboration.

Three interleaved writes through one server-side version vector; the superseded sibling survives and the sibling set only grows

Dotted version vectors: the dot pays for the prefix

The fix, from Preguiça, Baquero, Almeida, and colleagues (the dotted version vector) is small enough to state in a sentence: keep the per-object version vector as a summary, but additionally tag each stored value with the exact event that wrote it, its dot (replica, counter), decoupled from the contiguous prefix. A plain version vector [S:3] can only describe contiguous knowledge: events 1 through 3. A dot can name event (S,3) alone, without implying 1 and 2. In causal-history terms, the trick is that the server delays naming the client’s write until it arrives, assigns the next server-side name as the dot, and records the write’s causal context separately, so the value’s identity and its past are no longer fused into one vector.

Causality checks now run per-value instead of per-object. An incoming write carries its context; any stored sibling whose dot is covered by that context was seen by the writer and is discarded; siblings whose dots are not covered are genuinely concurrent and survive. Replay the trace:

C1 puts v1, empty context → stored as v1 @ (S,1).
C2 puts v2, empty context → dot (S,2); context covers neither dot; siblings {v1 @ (S,1), v2 @ (S,2)}.
C1 puts v3 with context [S:1] → dot (S,3); the context covers (S,1), so v1 is retired; it doesn’t cover (S,2), so v2 stays. Result: {v2 @ (S,2), v3 @ (S,3)}.

The same three writes with dotted version vectors; the covered sibling is retired and only true concurrency survives

Same asymptotics as a version vector, one dot of overhead per stored value, and the false-concurrency leak is plugged: the number of siblings is now bounded by the number of genuinely concurrent writers, not by the interleaving history. Riak shipped DVVs as the default logical clock for bucket types in Riak 2.0. The detail is niche; the lesson isn’t: whose events you count, and at what granularity, matters as much as the clock algorithm itself.

Pruning, truncation, and the membership problem

One uncomfortable topic remains: entries that never die. A version vector accumulates a slot for every replica that ever wrote, including the node you decommissioned in 2023. Amazon’s Dynamo handled this with brute pragmatism: cap the per-object vector, and past the threshold evict the entry with the oldest last-updated wall-clock time. Note what that is; pruning is forgetting, and forgetting events re-introduces false concurrency: a stamp that compared as dominated before the eviction may compare as incomparable after it. Dynamo could afford that because its failure mode was benign; false concurrency just surfaces a spurious conflict for the application to merge, and the paper reports the issue never materially appeared in production. Prune a vector that arbitrates rather than surfaces conflicts, and the same trick silently loses writes. Unsafe-but-pragmatic is a fine engineering position, but only if you can name which side of it your read path is on.

The research answer to dynamic membership is more elegant and almost never deployed: interval tree clocks (Almeida, Baquero, Fonte, 2008) replace fixed IDs with binary-tree-shaped slices of the identifier space; a replica forks by splitting its interval with a newcomer and joins by merging intervals back, so participants can enter and leave forever without global coordination or unbounded growth. Most systems prefer to dodge the problem (bound the actor set, or prune and accept the consequences) but if your membership churns faster than your tolerance for either dodge, ITCs are waiting.

Where this leaves us

The progression of this part is a single trade: how much causal past a timestamp carries determines which questions it can answer. A Lamport scalar buys an order for 8 bytes; a vector buys concurrency detection for O(N); dots buy it per-value through bounded server IDs. None of them, though, means anything to a human: you cannot expire a session with a vector clock, or take a backup “as of 14:30” with a Lamport counter. Production databases want causality and wall-time meaning; that hybrid, and what CockroachDB, Spanner, and Cassandra each did about it, is Part 3.

References

Leslie Lamport, Time, Clocks, and the Ordering of Events in a Distributed System, CACM 1978.
Carlos Baquero and Nuno Preguiça, Why Logical Clocks are Easy, ACM Queue 14(1) / CACM 59(4), 2016.
Colin Fidge, Timestamps in Message-Passing Systems That Preserve the Partial Ordering, ACSC 1988.
Friedemann Mattern, Virtual Time and Global States of Distributed Systems, 1989.
D. S. Parker et al., Detection of Mutual Inconsistency in Distributed Systems, IEEE TSE 1983; the original version vectors.
Lindsey Kuper, Who invented vector clocks?, 2023.
Reinhard Schwarz and Friedemann Mattern, Detecting Causal Relationships in Distributed Computations: In Search of the Holy Grail, Distributed Computing 1994; includes Charron-Bost’s size lower bound.
Giuseppe DeCandia et al., Dynamo: Amazon’s Highly Available Key-value Store, SOSP 2007.
Nuno Preguiça, Carlos Baquero, Paulo Sérgio Almeida, Victor Fonte, Ricardo Gonçalves, Dotted Version Vectors: Logical Clocks for Optimistic Replication, 2010.
Basho/Riak, Vector Clocks Revisited and Part 2: Dotted Version Vectors.
Paulo Sérgio Almeida, Carlos Baquero, Victor Fonte, Interval Tree Clocks: A Logical Clock for Dynamic Systems, OPODIS 2008.