Distributed Clocks and Ordering, Part 1: Wall Clocks and Causality

A three-part series on distributed clocks. Part 1 establishes why physical timestamps cannot order events and what “happened before” actually means without them. Part 2 builds the logical clocks that make that order computable; Part 3 covers the hybrid clocks production databases actually run.

Wall clocks lie

Every distributed-systems argument eventually arrives at a sentence like “write B came after write A, so B wins.” The instinct is to settle it with timestamps, because every machine has a clock. The problem is that every machine has a different clock, and the differences are not a rounding error.

Quartz oscillators drift apart on their own; temperature alone is enough. NTP pulls them back together, but only loosely: the HLC paper’s summary of the state of practice is that NTP usually holds clocks within tens of milliseconds over the public internet, can reach about a millisecond on a LAN under ideal conditions, and occasionally blows out to 100 ms or more when routes are asymmetric or congested. Worse, NTP corrections can step a clock backwards. A timestamp taken on one node is not merely imprecise relative to another node’s; it isn’t even guaranteed to be monotonic relative to its own past.

It’s worth being precise about that last part, because “NTP keeps the clock right” hides two very different correction mechanisms. When ntpd finds the clock off by less than the step threshold (128 ms by default) it slews: it nudges the clock’s rate so the error amortizes gradually, capped by the kernel at 500 parts per million. At that rate it takes about 33 minutes to work off each second of error, but time stays continuous and never runs backwards. Past the threshold (sustained beyond the 300-second stepout window), ntpd gives up on gentleness and steps: it sets the clock, discontinuously, in either direction. A freshly booted VM, a machine waking from suspend, a host that lost connectivity for an afternoon; all of these are step candidates, and a backwards step means now() returns a value earlier than one it already returned. Any code that subtracts two wall-clock readings and assumes the result is non-negative is wrong on every machine running default NTP. It just hasn’t been wrong yet.

And then there are leap seconds, which have their own graveyard. On June 30, 2012, the insertion of 23:59:60 UTC triggered a Linux kernel bug (a livelock around high-resolution timers that sent futex waits into a spin) and took down or degraded Reddit, Mozilla, LinkedIn, Yelp, and airline check-in systems running the Amadeus booking platform, all at the same instant, worldwide. On January 1, 2017, a leap second made time go backwards inside Cloudflare’s DNS server: a Go duration that could never be negative went negative, an RTT-weighted selection panicked, and DNS resolution failed for a slice of customers. The industry’s eventual answer was to stop delivering leap seconds as a discontinuity at all: Google smears the extra second linearly over 24 hours, and AWS’s Time Sync Service does the same; which fixes the kernel-panic class of bug at the cost that, during the smear, a smeared clock and a non-smeared clock legitimately disagree by up to half a second. The lie gets smoother, not smaller.

The anomaly you will actually ship

Skew numbers stay abstract until they reorder something a user can see, so let’s build the failure concretely. A discussion app runs behind a load balancer; writes land on whichever app server is least busy, and each server stamps rows with its own clock. Server A’s clock runs 70 ms fast; comfortably inside what NTP tolerates on a bad day. Server B’s is dead on.

At true time t, a user publishes a post through A. A stamps it t + 70 ms. Forty milliseconds later (a fast reply from someone who had the thread open, or a bot, or the author’s own “edit: typo” follow-up) a comment on that post lands on B, which stamps it t + 40 ms. Sort the thread by timestamp, as every “order by created_at” query does, and the comment now precedes the post it replies to. The database is not confused; it is faithfully reporting the lie it was told. Nothing crashed, no error was logged, and the bug will not reproduce when you investigate, because by then NTP has trimmed A’s clock.

The same arithmetic gets worse when timestamps don’t just sort data but arbitrate it. If those two writes had targeted the same key under last-write-wins (“set shipping address” then “confirm order with address” through different servers) the later operation loses to the earlier one’s inflated stamp, and a write that the client saw acknowledged is silently gone. Kyle Kingsbury’s The trouble with timestamps works through this class of failure in detail, and Part 3 of this series visits the production system that institutionalized it.

The point generalizes: “timestamp the event and sort” is a correctness strategy only when the events are further apart than your clock error. Tens of milliseconds of skew against single-digit-millisecond request latencies means the strategy fails precisely for the events most likely to be causally related; the ones close together.

The two clocks in your own runtime

Before reaching for distributed solutions, it’s worth fixing a confusion that lives entirely inside one machine, because practitioners trip over it constantly. Your OS exposes (at least) two clocks, and they answer different questions:

The wall clock (CLOCK_REALTIME, System.currentTimeMillis(), DateTime.Now) answers “what time is it?” It is the clock NTP slews, steps, and occasionally yanks backwards. Its values mean something outside your process: they can be stored, displayed, compared with other machines’ (loosely).
The monotonic clock (CLOCK_MONOTONIC, System.nanoTime()) answers “how much time has passed?” It counts up from an arbitrary origin (often boot), never goes backwards, and is immune to NTP steps. Its values mean nothing outside the process that read them: a nanoTime() reading is not a date, and comparing readings across two JVMs (or across a restart of the same JVM) is comparing two private rulers with different zero marks.

The rule is mechanical. Durations, timeouts, rate limits, “did this take too long” → monotonic clock. Timestamps that leave the process → wall clock (and then everything in this series applies). Violations of the rule are a reliable bug generator in both directions. The Cloudflare outage above was the first direction: an elapsed-time computation built on wall-clock readings, destroyed by a backwards step; which is precisely why Go 1.9 hid a monotonic reading inside time.Now() and made Sub use it transparently. The second direction is subtler: Cassandra’s server-side timestamp generation was once caught taking the current time in milliseconds and appending three zeroes to fake microsecond precision; a wall-clock value with a thousandth of the entropy its field promised, which inflated timestamp-collision probability by the same factor and produced measurably corrupted rows. Knowing which clock you are holding, and what its readings are for, is the single cheapest distributed-systems skill there is.

But notice what even perfect discipline buys you: a monotonic clock that is honest within one box and a wall clock that is wrong by an unknowable amount across boxes. Neither can tell you whether an event on node A preceded an event on node B. For that, we need to stop asking clocks and start asking a different question entirely.

Causality: the “happened-before” relation

Leslie Lamport supplied that question’s answer in 1978, in “Time, Clocks, and the Ordering of Events in a Distributed System”; arguably the founding paper of the field. Model the system as processes that perform events; some events are sends and receives of messages. Then the happened-before relation, written a → b, is the smallest transitive relation such that:

if a and b are events on the same process and a comes earlier, then a → b;
if a is the sending of a message and b is the receipt of that same message, then a → b;
if a → b and b → c, then a → c.

That’s the whole definition. No clocks anywhere; “before” is defined by the flow of information, not the flow of time. a → b means a causal path exists from a to b: it was physically possible for a to influence b.

The consequential part is what the definition leaves out. If neither a → b nor b → a, the events are concurrent, written a ∥ b. Concurrency here has nothing to do with simultaneity; two events hours apart are concurrent if no chain of messages connects them. Happened-before is a partial order, and the pairs it declines to order are exactly the pairs where “which came first?” has no observable answer, and where any system that pretends otherwise is inventing facts.

Space-time diagram of three processes with happened-before message arrows and a highlighted concurrent pair

The diagram above is the execution we’ll reuse for the rest of the series. Process A performs e1, sends a message (e2), then does local work (e3). B receives that message (f1) and forwards one to C (f2). C does early local work (g1) and later receives B’s message (g2). The chain e1 → e2 → f1 → f2 → g2 is causally ordered. But g1 is connected to nothing on A or B: g1 ∥ e3, and no honest observer can say which happened first.

It’s worth dwelling on why this is the only order available without coordination. A node’s entire knowledge of the rest of the system arrives in messages; what it can possibly know about a remote event is bounded by the chains of messages that connect them. Happened-before is that bound; it is the information-flow order, made formal. Any ordering claim finer than it (our post-and-comment example, ordered by wall clocks) asserts something no node could have observed, and any ordering claim that all nodes must agree on beyond it is no longer a clock problem but a consensus problem, with consensus prices. Between those two lines (below invention, above coordination) causal order is everything a distributed system can know for free.

This relation is also the load-bearing concept under the entire CRDT series. When Part 1 of that series said merges must handle “concurrent updates,” it meant updates unordered by →. So here is the honest position this part has been building to: physical timestamps cannot, by themselves, tell you the order in which two events on different machines occurred whenever those events are closer together than the clock error, and the clock error is milliseconds to hundreds of milliseconds, which is an eternity at modern request rates. If ordering matters, you need a definition of “before” that doesn’t lean on wall time at all. We now have the definition. What we don’t yet have is a way for a node, holding only local state, to compute whether a → b.

That’s Part 2: Lamport clocks, vector clocks, version vectors, and the dots that hold Riak together.

References

Leslie Lamport, Time, Clocks, and the Ordering of Events in a Distributed System, CACM 1978.
Sandeep Kulkarni, Murat Demirbas, Deepak Madeppa, Bharadwaj Avva, Marcelo Leone, Logical Physical Clocks and Consistent Snapshots in Globally Distributed Databases, 2014; for the NTP error-bound numbers.
NTP Project, Clock State Machine; slew vs step behavior, the 128 ms step threshold, and the 500 PPM slew rate.
The Register, Leap second bug cripples Linux servers at airlines, Reddit, LinkedIn, 2012.
Cloudflare, How and why the leap second affected Cloudflare DNS, 2017.
Google, Leap Smear; AWS, Set the time for your EC2 instance.
Linux man-pages, clock_gettime(2); CLOCK_REALTIME vs CLOCK_MONOTONIC.
Kyle Kingsbury, The trouble with timestamps, 2013, and Jepsen: Cassandra, 2013.