← Back to blog

Distributed Clocks and Ordering, Part 3: Hybrid Clocks in Production

Hybrid logical clocks walked by hand, CockroachDB's uncertainty intervals and read restarts, Spanner's TrueTime and commit-wait, Cassandra's last-write-wins as a cautionary tale, and a decision sketch for choosing a clock.

#distributed-systems#clocks#databases

Part 3 of a three-part series. Part 1 showed why wall clocks can’t order events; Part 2 built the logical clocks that can, at the cost of meaning nothing in human time. This part is about having both — hybrid logical clocks — and about what three production databases actually do: CockroachDB’s uncertainty restarts, Spanner’s commit-wait, and Cassandra’s decision to just trust NTP.


Hybrid logical clocks: close to wall time, causally safe

Logical clocks solved ordering by ignoring physical time — which means a Lamport or vector timestamp can’t answer “what did the system look like at 14:30?” or expire a session after an hour. Physical clocks mean something to humans but lie about order. For decades you simply picked your poison. In 2014, Kulkarni, Demirbas, Madeppa, Avva, and Leone showed you mostly don’t have to: the hybrid logical clock tracks physical time and preserves causality at once.

An HLC timestamp is a pair (l, c). The l component is the largest physical clock value the node has ever heard of — from its own clock or from any message — so it stays pinned to wall time. The c component is a Lamport-style counter that breaks ties when causally related events share the same l. The update rules, straight from the paper:

send or local event at node j:
    l' := l
    l  := max(l', pt.j)                  # pt.j = physical clock
    c  := (l == l') ? c + 1 : 0

receive message m at node j:
    l' := l
    l  := max(l', l.m, pt.j)
    if      l == l' == l.m :  c := max(c, c.m) + 1
    else if l == l'        :  c := c + 1
    else if l == l.m       :  c := c.m + 1
    else                   :  c := 0

Compare timestamps lexicographically — (l, c) < (l', c') — and you get the Lamport-style guarantee: e → f ⇒ hlc(e) < hlc(f). Like a Lamport clock (and unlike a vector clock), it’s one-way; HLC orders, it does not detect concurrency.

How an HLC timestamp updates across a message between nodes with skewed physical clocks, plus the 64-bit layout

The rules read denser than they are, so walk the diagram’s execution by hand. Node B’s physical clock runs 25 ms behind node A’s; pt is milliseconds within the second.

  • A sends at pt = 050. Local-event rule: l := max(l', 050) = 050; l changed, so c := 0. The message carries (050, 0).
  • B receives at pt = 025. The message is from B’s future: l := max(l', 050, 025) = 050, and since l equals the message’s l but not B’s old one, c := c.m + 1 = 1. B’s event stamps (050, 1) — strictly after A’s send, exactly as causality demands, even though B’s own clock says A’s timestamp hasn’t happened yet. A raw wall-clock stamp here would have been 025 < 050: causality inverted. The receive rule is the entire fix.
  • B does local work at pt = 030. Physical time still trails the high-water mark, so l holds at 050 and the counter absorbs the difference: (050, 2).
  • B again at pt = 058. Physical time has overtaken l; the clock snaps back to it: (058, 0), counter discharged.

Two properties fall out of that walk. First, the paper proves |l − pt| stays within the clock-synchronization uncertainty — l can only run ahead of the local clock by as much as some other clock runs ahead, which NTP bounds — so an HLC value is a physical timestamp, give or take your NTP error. That’s enough to take a consistent snapshot “as of 14:30” by collecting state at l = 14:30, c = 0, with no coordination round. Second, c only grows while pt lags the high-water mark, then resets; it stays small in practice, which is what makes the engineering tidy: the whole thing packs into the 64 bits of an NTP timestamp — 48 bits of l at microsecond-ish granularity, 16 bits of c — and drops into any schema that already stores a timestamp. It’s a superposition on NTP: it reads the physical clock and never writes it, so nothing else on the box notices. And it’s robustly monotonic even when the underlying clock isn’t — if NTP steps the clock backwards (Part 1’s recurring villain), l simply holds its high-water mark and c absorbs events until physical time catches up. Murat Demirbas’s writeup frames HLC as the practical middle of the design space, and production agrees: it’s a remarkably high ratio of guarantees to bits.

CockroachDB: uncertainty intervals and the price of “no stale reads”

CockroachDB is the canonical HLC deployment: every node runs one, every transaction gets its commit timestamp from one, and causally connected transactions are ordered for free wherever a causality chain — a message, a client token — actually exists. The hard case is the one HLC cannot solve: two transactions on different nodes with no causal connection, where node clocks disagree and serializability still demands that a read at timestamp t see every write that committed “before” it.

CockroachDB’s answer is to fence the ignorance. The cluster declares a maximum clock offset500 ms by default, commonly tuned to 250 ms — and each read at timestamp t drags behind it an uncertainty interval (t, t + max_offset]. A value with a timestamp inside that interval is unclassifiable: it may genuinely be from the reader’s future (fine to ignore), or it may be from its past, stamped by a fast clock (must be seen, or the read is stale). Since the database can’t tell which, it refuses to guess: the transaction’s timestamp is pushed past the uncertain value and its prior reads are re-validated (“read refreshing”); if any were invalidated, the client gets a ReadWithinUncertaintyIntervalError and the transaction restarts at the higher timestamp. Two mitigations keep this bearable: after a restart the same nodes can’t produce new uncertainty (their observed timestamps are remembered), and values read from the gateway node are never uncertain. There’s also a guardrail behind the guarantee: a node whose clock drifts beyond 80% of the max offset from half the cluster crashes itself rather than undermine the interval everyone else is relying on.

Tally the trade honestly. What you get: serializable isolation with no stale reads, on commodity hardware, with clock skew converted from a silent-corruption risk into a bounded performance problem. What you pay: restart-shaped latency that scales with how bad your clocks are and how contended your workload is — every millisecond of configured offset widens the window in which an innocent read collides with a recent write. Your NTP quality is no longer a correctness input, but it is now a tail-latency input. That is a spectacularly good deal compared to the alternatives, which is the point of the next two sections.

Spanner: outlaw the uncertainty instead

Google’s Spanner (OSDI 2012) attacks the same gap with money. TrueTime replaces “what time is it?” with an API that returns an interval: TT.now() = [earliest, latest], guaranteed to contain the true time. GPS receivers and atomic clocks in every datacenter keep the interval’s half-width ε in single-digit milliseconds. Uncertainty hasn’t been eliminated — it’s been measured, with hardware-backed bounds.

Measured uncertainty can be waited out. When a transaction commits, Spanner assigns it a timestamp s and then commit-waits: it holds the acknowledgment until TT.after(s) is true — until every clock in the fleet agrees s is in the past, roughly 2ε of stalling. After that wait, no subsequent transaction anywhere can possibly be assigned a timestamp below s, which buys external consistency: if T1 completes before T2 begins — in real time, observed by anyone — then T1’s timestamp is smaller. That is strictly stronger than what HLC systems offer, because it holds even with no causal connection between the transactions at all: the system’s timestamp order matches the wall-clock order a human with a stopwatch would have recorded.

The contrast with CockroachDB is the cleanest design lesson in this series, and Cockroach Labs drew it themselves. Both systems hold an uncertainty window. Spanner pays for a small ε in hardware, then pays ~2ε on every write, unconditionally, to guarantee no one ever observes the uncertainty. CockroachDB accepts a large ε (NTP-sized), pays nothing on the happy path, and pays a restart only when a read actually collides with the window — surrendering external consistency for unrelated transactions as the price. HLC is, in a fair summary, TrueTime without the hardware budget: the same interval-shaped reasoning, with causality tracking standing in for atomic clocks wherever a causal path exists, and honesty about the cases where it doesn’t.

Spanner waits out its small uncertainty on every commit; CockroachDB restarts reads that collide with its large one

Cassandra: the clock is the consistency model

And then there is the system that read Part 1 of this series and shipped it anyway. Cassandra resolves every conflicting write with last-write-wins on wall-clock timestamps, per cell. No version vectors (an early, deliberate performance decision — it saves a read-before-write round trip), no HLC, no uncertainty tracking. The architecture docs say the quiet part plainly: “Cassandra’s correctness does depend on these clocks.”

Kyle Kingsbury’s Jepsen analysis measured what that dependence means. Mutating a single cell under a partition — with QUORUM consistency, synchronized clocks, and a perfect external lock service — lost 28% of acknowledged writes: any write can resurrect later and, armed with a higher timestamp, annihilate writes that legitimately followed it. Ties are their own carnival: when two cells carry the same timestamp, deletes beat writes, and between live values the lexicographically bigger one wins — so a row written as [2, -2] and a concurrent [1, -1] can merge into [2, -1], a value nobody wrote. Ties were supposed to be microsecond-rare, except the server-side timestamp generator was for years multiplying milliseconds by 1,000 (Part 1’s entropy bug, found in this same Jepsen investigation), making collisions a thousand times likelier than the schema implied; the measured result was roughly one corrupted row in two hundred under contended writes. Deletions add doomstones: a tombstone stamped by a fast clock deletes writes from its future. None of these are partition exotica — skew plus concurrency suffices.

The honest framing is not “Cassandra is broken” — for immutable, append-shaped data (which is most of what Cassandra ingests) LWW is harmless, because Part 1’s rule applies: timestamps order events fine when the events are further apart than the clock error, and immutable cells are never closer than that, because there’s nothing to race. The honest framing is that “just run NTP” is Cassandra’s consistency model for mutable data: your replication guarantees are a function of your clock infrastructure, unstated in any schema, degrading silently and exactly when clocks misbehave — which, per Part 1, is not an if. Mutable cells under skew don’t error; they quietly lose whichever write the clocks disliked. If you must mutate, the docs steer you to lightweight transactions (Paxos — coordination, the thing LWW existed to avoid), at which point you’re paying consensus prices for a timestamp model chosen to dodge them.

Who uses what

SystemClockWhat it buys, what it costs
CassandraWall-clock timestamps, last-write-wins per cellSimple, tiny, one round trip. Correctness is delegated to NTP; skewed clocks silently drop concurrent writes, ties corrupt rows.
Dynamo / RiakVersion vectors → dotted version vectors (Part 2)True concurrency detection; conflicting versions surface to the application (the famous shopping-cart merge). Costs per-object metadata and an application-level reconciliation story.
CockroachDBHybrid logical clocks + bounded skewCausal ordering for free; no stale reads under serializable, enforced by uncertainty restarts within the max offset; drifting nodes eject themselves. MongoDB likewise stamps operations with an HLC-based cluster time.
SpannerTrueTime: GPS + atomic clocksUncertainty bounded in hardware; commit-wait converts it into external consistency — the strongest guarantee on this table, at the price of owning the datacenters.
CRDTsLamport stamps, version vectors, dots, unique tags — per typeThe CRDT zoo is largely a catalog of which causality device each type embeds: LWW-registers carry Lamport stamps, MV-registers carry version vectors, OR-Sets carry tags.

The pattern across the table: every system either detects concurrency and handles it (vectors, dots), arbitrates it deterministically (Lamport, HLC, LWW), or outlaws it by bounding clock error and waiting (TrueTime). There is no fourth option, only systems that haven’t decided which of the three they picked.

Choosing a clock: a decision sketch

Strip away the papers and the choice comes down to two questions: do you need to detect concurrency, or just order past it? and do your timestamps need to mean anything in wall-time terms?

  • Neither — you just need a deterministic, causality-respecting arbiter (LWW resolution, dedup ordering): a Lamport clock plus a node-ID tiebreak. Cheapest thing that’s correct.
  • Wall-time meaning, no concurrency detection — snapshots at a time, TTLs, “recent first” ordering across nodes: an HLC. Same cost as a 64-bit timestamp, none of the time-went-backwards bugs. This should be the default over raw wall-clock timestamps; Cassandra-style LWW on raw clocks is the option you choose only after reading the incident reports.
  • Concurrency detection, bounded actors — replicated objects where conflicting versions must surface rather than vanish: a version vector over replica IDs; if many clients write through few servers, dotted version vectors (Part 2 has the failure mode that forces the upgrade).
  • Concurrency detection among events at scale — full vector clocks, and budget honestly for the O(N) metadata, because the lower bound says you can’t dodge it.
  • You need to outlaw concurrency for external consistency — that’s not a clock library, that’s TrueTime, and it’s sold as a building.

If the merge-on-conflict branch is where you’re heading, that’s precisely where this series hands back to the CRDT series: CRDTs are what you build when you’ve accepted that concurrent updates are facts to be merged, not anomalies to be ordered away. The clocks in this series are how those facts get noticed at all.

References