Idempotency in Practice, Part 1: The Exactly-Once Lie

A three-part series on what “make it idempotent” actually costs to build. Part 1 establishes why exactly-once delivery is impossible, what idempotency keys are, and which operations never needed them; Part 2 builds a Stripe-grade key store in Kotlin and Postgres; Part 3 covers idempotent consumers, Kafka’s exactly-once semantics, and the retrofit checklist.

”Exactly once” is a lie

Start with the smallest distributed system that exists: one client calling one server. Stripe’s engineering blog enumerates the ways this goes wrong: the connection can fail before the request arrives, the server can die halfway through doing the work, or (the truly annoying case) the work can succeed and the response can get lost on the way back. In the first case retrying is obviously safe. In the last two, the client has no way to tell the difference between “never happened” and “happened, but nobody told me.”

A retry timeline: the response is lost, the client retries, the server executes twice.

The claim that no protocol can close this gap is usually name-dropped as the Two Generals problem and left there. It deserves the two paragraphs it takes to actually work, because the argument is airtight in a way that “networks are flaky” is not.

Two generals are camped on hills on opposite sides of a valley held by the enemy. They win if and only if they attack simultaneously; an army that attacks alone is destroyed. Their only channel is messengers who cross the valley, and may be captured. General A sends “attack at dawn.” Can A march at dawn? Not safely: if the messenger was captured, B knows nothing and A attacks alone. So B must send an acknowledgment. Can B march once that ack is sent? No; if the ack’s messenger is captured, A never gained the confidence to march, and now B attacks alone. So A must acknowledge the acknowledgment, and the same blade falls on that messenger too.

Now the inductive kill. Suppose, for contradiction, some finite protocol (any sequence of messages) lets both generals attack with certainty. Take the last message that protocol sends. Its delivery cannot matter: the sender will attack regardless of whether it arrives (they’ve sent it; they get no further information), so for the protocol to be correct the receiver must be willing to attack without it. But then the last message was unnecessary; delete it, and you have a shorter correct protocol. Repeat until no messages remain, which would mean the generals can coordinate without communicating at all. Contradiction. No finite number of messages over a lossy channel produces common knowledge that a message was received. Tyler Treat’s classic essay walks the same ground for message queues, and the FLP theorem generalizes the bad news: in an asynchronous system, even one faulty process makes guaranteed agreement impossible. These are not design complexities to be engineered around. They are impossibility results.

The dichotomy you actually get to choose

Since certainty is off the table, a sender holding an unacknowledged request has exactly two policies available, and every delivery guarantee a vendor has ever sold you is one of these wearing a costume:

	At-most-once	At-least-once
On ambiguous failure	Give up	Retry until definitive answer
Duplicates	Never	Yes, by design
Loss	Yes, silently	Never
Receiver’s burden	Tolerate gaps	Tolerate repeats
Honest use case	Metrics, telemetry, anything where a hole is cheaper than a repeat	Money, orders, anything where loss is unacceptable

There is no third column. “Exactly-once delivery” is not a setting; it’s a contradiction. What vendors and serious engineers actually mean by it (sometimes labeled more honestly as effectively once) is a two-part construction: at-least-once delivery plus deduplication on the receiving side. The delivery guarantee is impossible; the processing guarantee is just expensive. Since losing data is usually worse than repeating work, almost every serious system picks the right-hand column and then deals with the duplicates. That dealing-with is this entire series. (Retrying “until definitive answer” has its own failure mode (a fleet of impatient clients can amplify an outage into a retry storm) but that’s a backpressure problem; here we only require that the retries be safe, not polite.)

Kafka is the canonical example of how the dedup bill gets paid. When Confluent announced exactly-once semantics in Kafka 0.11, the mechanism under the headline was sequence-numbered dedup inside a carefully drawn boundary, and the moment your code crosses that boundary to touch an external system, you’re back to at-least-once. Kafka didn’t refute the impossibility result; it drew a perimeter, did rigorous dedup inside it, and was honest about the edge. Part 3 takes that machinery apart properly.

Idempotency keys: who generates them, where they live, when they expire

For operations that must not repeat (charging a card being the canonical one) the standard tool is the idempotency key: a unique value the client generates and attaches to the request, which the server uses to recognize retries of the same logical operation.

The client generates the key, and this is not negotiable. Only the client knows whether two byte-identical requests are one intent retried or two genuinely separate intents (“charge $20” twice in a row is a legitimate thing a customer can want). The server cannot derive this from the payload; a hash of the body would happily collapse two real purchases into one. Stripe’s API accepts up to 255 characters and recommends a V4 UUID or anything with comparable entropy, and warns against embedding personal data in the key, because keys end up in logs.

On the server, the key becomes a row in a dedup store, and the row has to carry more than you’d think:

Field	Why it’s there
Key, scoped to the caller	`(account_id, key)` unique; one tenant’s UUIDs must not collide with another’s
Request fingerprint	Detect the same key reused with a different payload, which is a client bug to reject, not dedupe
State / lock	Distinguish “in flight” from “finished” for concurrent retries
Stored response	Status code and body, replayed verbatim on retry
Timestamp	So the reaper knows what to expire

The semantics around that row are where the real cost hides. A retry of a completed request must get the original response replayed; Stripe stores and replays the first outcome whether it succeeded or failed, including 500s. A retry that arrives while the original is still executing must not run concurrently; the IETF draft prescribes 409 Conflict for that case, and 422 for key reuse with a mismatched payload. Stripe adds a sharp subtlety: results are only saved once the handler actually begins executing; a request rejected at validation stores nothing, so the client can fix the payload and legitimately retry with the same key.

Idempotency key lifecycle: a dedup-store lookup routes the request to execute, replay, or 409.

Keys are a correctness mechanism, not an archive, so they expire. Stripe prunes after roughly 24 hours; Brandur Leach’s reference implementation argues for about 72, so a bug deployed on Friday still has its failed requests on hand for Monday’s fix. Either way, the expiry window is the guarantee: a retry that arrives after the reaper has visited is, as far as the server can tell, a brand-new request, and it will execute again. Pick the window deliberately; longer than your clients’ worst plausible retry-with-backoff horizon, including the queue-powered clients that might redeliver hours later.

One more piece of 2026 status, since the header has been “almost standard” for years: Idempotency-Key is still an Internet-Draft. The IETF HTTPAPI working group published draft-ietf-httpapi-idempotency-key-header-07 in October 2025 (the version on file expires April 2026), intended for Standards Track but not yet an RFC. In practice it doesn’t matter much; Stripe, Adyen, PayPal, WorldPay and a dozen others listed in the draft’s implementation section converged on the same header name and semantics long ago. The draft is documentation of folklore, which is often what good standards are.

Born idempotent vs. bolted on

Before building any of that machinery, check whether you need it. Some operations are naturally idempotent: executing them twice lands in the same state, no bookkeeping required. The pattern behind all of them is the same; state the destination, not the journey.

PUT /records/s3.example.com with the full record body can be replayed forever; per RFC 9110, PUT and DELETE are idempotent by definition, POST and PATCH are not. “Set balance to 70” is safe to repeat; “subtract 30” is not.
An UPSERT makes creation idempotent at the storage layer. Postgres’s INSERT ... ON CONFLICT explicitly “guarantees an atomic INSERT or UPDATE outcome … even under high concurrency”; the database arbitrates the race so you don’t have to.
Conditional writes (a version column, an If-Match ETag, a compare-and-swap) make updates idempotent and race-safe: the retry either reapplies the identical change or fails cleanly because the version already advanced. The version check doubles as a fencing token, which becomes important in Part 2.

Then there’s everything else. Charging a card, sending an email, appending an event, decrementing inventory; the operation is a verb, not a destination, and running it twice genuinely does the thing twice. For these you bolt dedup on: a recorded fact that says “this intent was already executed,” checked before the side effect happens. That’s what an idempotency key row is.

It helps to see the whole landscape at once, because “is this idempotent?” has a different answer for each basic shape of operation:

Operation shape	Example	Naturally idempotent?	What it needs
Read	`GET /orders/42`	Yes	Nothing; repeat freely (watch side-effectful “reads”)
Absolute set	`PUT` full resource, “set balance = 70”	Yes	Nothing for replays; a version check if writers race
Conditional update	CAS, `If-Match`, `WHERE version = 6`	Yes	The version column you already added
Increment / decrement	”subtract 30”, counter `+= 1`	No	Restate as a set, or attach an operation ID and dedup
Append / create	insert order, append event, send email	No	Idempotency key or natural unique constraint (UPSERT)
State transition	`pending → shipped`	Mostly	Make illegal transitions no-ops: `UPDATE ... WHERE status = 'pending'`: the second run matches zero rows

Two rows deserve a second look. Increments are the classic trap: they look tiny and innocent, and they are the least idempotent operation in computing; every replay moves the value. If you can restate the increment as an absolute set computed by the caller, do; if you can’t (concurrent writers, counters), the operation needs an identity of its own. And state transitions are the pleasant surprise: a guarded UPDATE whose WHERE clause names the source state is idempotent for free, because the duplicate finds the row already moved and matches nothing. A large fraction of “we need idempotency keys” conversations end with someone adding a WHERE status = clause instead.

The honest framing is that natural idempotency is a property of your data model, while bolted-on dedup is a second, shadow data model you now also operate; with its own storage, races, and expiry policy. Prefer the first wherever you can restructure the operation to allow it. The rest of this series is about paying for the second: Part 2 builds the key store for synchronous APIs, walking Brandur Leach’s Stripe-style Postgres design end to end in Kotlin; Part 3 does the same for message consumers and stream pipelines, where the dedup problem changes shape but never goes away.

References

Brandur Leach, Designing robust and predictable APIs with idempotency; Stripe engineering blog, 2017
Stripe API docs, Idempotent requests
Tyler Treat, You Cannot Have Exactly-Once Delivery; Two Generals and FLP framing
Fischer, Lynch, Paterson, Impossibility of Distributed Consensus with One Faulty Process; JACM, 1985
IETF HTTPAPI WG, The Idempotency-Key HTTP Header Field; draft-07, October 2025
Brandur Leach, Implementing Stripe-like Idempotency Keys in Postgres; key expiry and the reaper
Confluent, Exactly-Once Semantics Are Possible: Here’s How Kafka Does It; Kafka 0.11, 2017
PostgreSQL docs, INSERT; ON CONFLICT clause
RFC 9110: HTTP Semantics; idempotent methods