← Back to blog

Idempotency in Practice, Part 2: Building the Key Store

Brandur Leach's Stripe-style Postgres key store walked end to end in Kotlin: the schema, atomic claim via ON CONFLICT, recovery points so a crashed request resumes instead of double-executing, the Redis variant compared honestly, and the window you can't close.

#idempotency#postgres#kotlin#distributed-systems

Part 1 established that “exactly once” means at-least-once delivery plus dedup, and that idempotency keys are how synchronous APIs do the dedup. This part builds the key store itself — Brandur Leach’s Stripe-style Postgres design, walked end to end in Kotlin and SQL — and is honest about the Redis shortcut and the failure window no store closes. Part 3 does the same for message consumers.


Two questions decide everything

Wherever the “already done” fact lives, two questions decide the design: is the check-and-claim atomic, and is the claim atomic with the work itself?

The first question is the one most home-grown implementations fail. A handler that does SELECT (not seen) … INSERT (mark seen) has a race window between the two statements, and two concurrent retries of the same request will both sail through it. The claim has to be a single arbitrated operation — a unique-constraint insert, an INSERT ... ON CONFLICT, a Redis SET ... NX — so that exactly one request wins and everyone else finds out they lost.

Two concurrent identical requests: a non-atomic check-then-act leaves a race window where both execute; an atomic claim closes it.

The second question is harder, and it’s the one that earns this post its word count. If the claim lives in the same ACID database as the work, you can commit them together and there is no window at all — when the operation is purely local, Brandur’s advice is to skip everything below and just map the request to one transaction. But an API endpoint that charges a card mid-request cannot put Stripe inside its transaction. The claim and the work live in different failure domains, something will eventually crash between them, and the design below exists to make that crash boring instead of expensive.

The schema

Brandur Leach’s reference implementation — built around Stripe’s Rocket Rides demo app — is the design I’d reach for, and the one this walkthrough follows. We’ll use a generic orders API: create an order, charge the customer through a payment provider, send a receipt. Here’s the key table, lightly adapted:

CREATE TABLE idempotency_keys (
    id              BIGSERIAL    PRIMARY KEY,
    account_id      BIGINT       NOT NULL REFERENCES accounts (id),
    idempotency_key TEXT         NOT NULL
        CHECK (char_length(idempotency_key) <= 255),

    -- fingerprint of the request this key was first used with
    request_method  TEXT         NOT NULL,
    request_path    TEXT         NOT NULL,
    params_hash     BYTEA        NOT NULL,  -- SHA-256 over the canonicalized body

    -- where a retry should resume
    recovery_point  TEXT         NOT NULL DEFAULT 'started',

    -- non-NULL means a request is actively working this key
    locked_at       TIMESTAMPTZ  DEFAULT now(),

    -- the stored outcome, replayed verbatim on retry
    response_code   INT,
    response_body   JSONB,

    created_at      TIMESTAMPTZ  NOT NULL DEFAULT now(),
    last_run_at     TIMESTAMPTZ  NOT NULL DEFAULT now(),

    UNIQUE (account_id, idempotency_key)
);

Every column is load-bearing. The unique constraint is scoped to (account_id, idempotency_key) so tenants can’t collide with — or probe — each other’s keys. params_hash exists to catch the client bug of reusing a key with a different payload (Brandur stores the full request_params as JSONB instead, which costs more but lets a background process re-run abandoned requests; a hash only lets you reject mismatches — pick based on whether you want the completer we’ll meet later). locked_at is the concurrency arbiter. And recovery_point is the field that makes this a checkpoint system rather than a cache — it’s where the design stops being a lookup table and starts being a tiny workflow engine.

The atomic claim

When a request arrives, the first job is to either create the key row or figure out what the existing one means. Postgres’s INSERT ... ON CONFLICT makes the create-or-detect atomic in one statement (Brandur’s original uses a SERIALIZABLE transaction to the same end; ON CONFLICT arbitrates the race with less ceremony):

sealed interface Claim {
    data class Execute(val key: IdemKey) : Claim         // we own it; run the request
    data class Replay(val status: Int, val body: String) : Claim
    data object InProgress : Claim                       // 409 Conflict
    data object ParamsMismatch : Claim                   // 422 Unprocessable Content
}

fun claim(accountId: Long, keyVal: String, fingerprint: ByteArray): Claim = db.tx {
    val inserted = queryOrNull(
        """
        INSERT INTO idempotency_keys (account_id, idempotency_key,
                                      request_method, request_path, params_hash)
        VALUES (?, ?, ?, ?, ?)
        ON CONFLICT (account_id, idempotency_key) DO NOTHING
        RETURNING id, recovery_point
        """,
        accountId, keyVal, method, path, fingerprint,
    )
    if (inserted != null) return@tx Claim.Execute(inserted.toIdemKey())

    // The key exists. Lock the row and arbitrate.
    val existing = query(
        """
        SELECT * FROM idempotency_keys
        WHERE account_id = ? AND idempotency_key = ?
        FOR UPDATE
        """,
        accountId, keyVal,
    )
    when {
        !existing.paramsHash.contentEquals(fingerprint) ->
            Claim.ParamsMismatch

        existing.recoveryPoint == "finished" ->
            Claim.Replay(existing.responseCode!!, existing.responseBody!!)

        existing.lockedAt != null &&
            existing.lockedAt > now() - LOCK_TIMEOUT ->
            Claim.InProgress

        else -> {
            // Original attempt died (or its lock expired). Take over.
            execute(
                "UPDATE idempotency_keys SET locked_at = now(), last_run_at = now() WHERE id = ?",
                existing.id,
            )
            Claim.Execute(existing)
        }
    }
}

The four arms are the entire public contract from Part 1, now in executable order of precedence. A mismatched fingerprint is a 422 — the IETF draft’s verdict for “same key, different request,” and it must be checked first, because replaying a stored response to a request that asked for something else is a lie. A finished key replays. A locked, recent key means a concurrent twin is running right now: 409, retry shortly. And a key whose lock has gone stale — the original attempt presumably died mid-flight — gets taken over, which is where recovery_point earns its keep.

Should the concurrent case return 409 or just wait? For a public API, 409 — you don’t know how long the twin will take, and holding a connection open on someone else’s work invites slow-loris economics. Inside your own perimeter, blocking briefly on the row lock can be kinder to clients that can’t handle 409s. Either is defensible; silently running both is the only wrong answer.

Recovery points: the request as a resumable state machine

Here is the problem the claim alone doesn’t solve. Our handler does roughly: insert the order, charge the payment provider, store the response. If the process dies after the charge succeeds but before anything local records that fact, a retry that starts from the top will charge again. The window between “foreign system mutated” and “we wrote it down” is the heart of this whole topic — Brandur calls the external call a foreign state mutation, and the moment you make one, you’re committed: you’ve pushed state beyond your own boundary and must not lose track of it.

His answer is to break the request into atomic phases — groups of local writes, each committed in one transaction — separated by the foreign calls. The rules for drawing the boundaries: the key claim is a phase; every foreign state mutation gets its own slot; everything between them collapses into one phase each, no matter how many local writes it contains. Each phase, as its final act and inside its own transaction, advances recovery_point. A retry doesn’t start over; it reads the recovery point and jumps to the first thing not yet done.

The key-store state machine: atomic phases advance the recovery point; a crash resumes from the last committed point instead of re-executing earlier phases.

The handler becomes a loop over a directed acyclic state machine:

fun runOrder(key: IdemKey, params: OrderParams): StoredResponse {
    while (true) {
        when (key.recoveryPoint) {
            "started" -> atomicPhase(key) {
                val order = insertOrder(params, idempotencyKeyId = key.id)
                insertAuditRecord("order.created", order)
                AdvanceTo("order_created")
            }

            "order_created" -> {
                val order = findOrderByKeyId(key.id)   // may be a recovered run
                val charge = try {
                    psp.charge(
                        amountCents = order.amountCents,
                        customerId = order.pspCustomerId,
                        // Derived, deterministic: a crashed run's retry sends
                        // the SAME key, and the provider's dedup absorbs it.
                        idempotencyKey = "orders-api-${key.id}",
                    )
                } catch (e: CardDeclinedException) {
                    atomicPhase(key) { Finish(402, declinedBody(e)) }
                    continue
                }
                atomicPhase(key) {
                    markCharged(order.id, charge.id)
                    AdvanceTo("charge_created")
                }
            }

            "charge_created" -> atomicPhase(key) {
                stageJob("send_receipt", key.id)        // outbox-style, same tx
                Finish(201, orderBody(key.id))
            }

            "finished" -> return key.storedResponse()
        }
        key = key.reload()
    }
}

with a small helper that makes “advance the checkpoint atomically with the phase’s writes” impossible to forget:

sealed interface PhaseResult
data class AdvanceTo(val point: String) : PhaseResult
data class Finish(val status: Int, val body: String) : PhaseResult

fun atomicPhase(key: IdemKey, block: Tx.() -> PhaseResult) = db.tx {
    when (val result = block()) {
        is AdvanceTo -> execute(
            "UPDATE idempotency_keys SET recovery_point = ? WHERE id = ?",
            result.point, key.id,
        )
        is Finish -> execute(
            """
            UPDATE idempotency_keys
            SET recovery_point = 'finished', locked_at = NULL,
                response_code = ?, response_body = ?::jsonb
            WHERE id = ?
            """,
            result.status, result.body, key.id,
        )
    }
}

Now replay the nightmare. The process dies the instant the provider’s charge succeeds, before markCharged commits. The key row still says order_created with a stale lock. The retry claims it, jumps to the order_created arm, finds the existing order, and calls the provider again — with the same derived key, orders-api-${key.id}. The provider recognizes it and returns the original charge instead of minting a second one. The crash cost a delay, not a duplicate. That derived key is the single most important line in the listing: your key store dedups your callers, and the propagated key recruits the provider’s key store to dedup you. (Note it’s derived from your row’s ID, not passed through verbatim — the caller’s key is scoped to your API; the provider needs one scoped to your account with them.)

Two details worth pinning. First, don’t hold the phase transaction open across the foreign call — the charge happens between transactions, and only its result is recorded inside one. (Brandur’s Ruby wraps the call in the phase for brevity and his serializable retry handles it, but a connection pool slot held for a 2-second PSP call is how you discover what your pool limit is during an incident.) Second, terminal failures are responses too: a declined card stores a 402 and moves to finished, because retrying a decline forever helps no one — the stored failure is the idempotent outcome, exactly as Stripe replays errors, 500s included.

This is also the clean answer to a design question that otherwise causes long meetings: request-scope versus handler-scope atomicity. If your handler is local-only, the request is one transaction and the key row is just a receipt — done. The moment a third party enters mid-flight, “the request” can no longer be atomic, full stop; pretending otherwise just relocates the crash window. Recovery points accept the loss of global atomicity and replace it with something achievable: every local step atomic, every foreign step idempotent-by-propagated-key, and a durable cursor between them.

Key-by-request vs. key-by-operation

One subtlety the schema quietly settled: what does a key identify? The design above is key-by-request — one key per logical API call, fingerprinted against the whole payload. The alternative, key-by-operation, derives keys from domain identity: “the charge for order 4127” is charge-order-4127, forever, no client UUID involved.

Key-by-operation is tempting because it survives client amnesia — two different services, or a human with two browser tabs, can’t double-charge order 4127 even with independent keys. When the operation genuinely has unique domain identity, use it; it’s just natural idempotency from Part 1 wearing a key costume. The trap is operations that don’t: “add $20 of credit” has no natural identity, and a derived key like credit-user-9-2000 would silently collapse two legitimate top-ups into one. That’s why client-minted keys are the general mechanism — only the client knows intent — and domain-derived keys are the optimization for the cases with real identity. Brandur’s orders-api-${key.id} propagation is the hybrid: client key at the edge, derived keys downstream.

Retention, and the reaper

Keys expire, and the expiry window is part of your public contract. Stripe prunes after 24 hours and treats a reused key after pruning as a brand-new request; Brandur argues for ~72, so a bug deployed Friday still has its failed requests on hand when the fix ships Monday. The trade is plain: a longer window means a fatter, slower table and more PII-shaped data at rest; a shorter one means a slow client’s honest retry executes twice. Size it longer than your slowest legitimate retry path — including the queue-driven clients that redeliver hours later — then write it down where clients can read it.

The reaper is a cron job and one statement:

DELETE FROM idempotency_keys
WHERE created_at < now() - interval '72 hours'
LIMIT 10000;   -- in batches; the table is hot

Brandur’s repo adds two more background friends worth stealing: a completer that finds unlocked, unfinished keys whose clients have evidently given up and drives them to a terminal state (the stored request_params variant of the fingerprint column exists precisely so it can re-run them), and an enqueuer for the staged jobs. Budget for them on day one; the table starts growing on day one.

The Redis variant, honestly

Everything above costs a few indexed writes per request on your primary database. The popular cheaper answer is Redis: SET key token NX EX 86400 — claim if absent, with a TTL. (Bare SETNX is on the deprecation track and has no TTL at all, so a crashed worker leaves the key claimed forever; use SET with options.) The atomic NX genuinely closes the check-then-claim race. What it cannot close is everything around it:

  • The claim is in a different system than the work. Claim first and crash before the work: the operation looks done and never happens until the TTL expires — silent loss. Work first, crash before claiming: the retry executes again — duplicate. With Redis you must pick which failure you prefer; with a transactional claim you don’t.
  • The TTL races your retries. Eviction is your dedup window, and it’s enforced by memory pressure as well as policy — if the key expires before the last straggling redelivery arrives, the store has amnesia at the exact moment it’s needed.
  • Failover forgets. With asynchronous replication, a primary crash can promote a replica that never saw your claim. The duplicate sails through.
  • Expiry trusts the clock. As Martin Kleppmann’s distributed-locking analysis lays out, key expiry plus GC pauses plus clock jumps means two workers can both believe they hold the claim — and Redlock doesn’t rescue this, because nothing in it produces the monotonically increasing fencing token you’d need for the data store to reject the loser’s late write. (Our Postgres design has the same theoretical hole in its locked_at timeout, but the phases behind it are each transactional and the foreign calls are key-propagated, so a double-claim degrades to a 409 or a provider-side dedup hit, not a double charge.)
Postgres key storeRedis SET NX EX
Claim atomic with local workYes — same transactionNo — separate system
LatencyOne indexed write per phaseSub-millisecond
Stored response / replayOn the rowA second blob you now manage
Recovery pointsNatural fitBuild it yourself, without transactions
ExpiryExplicit reaper, your policyTTL + eviction policy + memory pressure
FailoverAs durable as your WALAsync replica may never have seen the claim
Honest roleCorrectnessEfficiency

None of this makes Redis dedup useless — it makes it best-effort. That’s the right tool when a duplicate costs an extra push notification, and the wrong one when a duplicate costs money. The rule worth taping to the wall: Redis dedup for efficiency, transactional dedup for correctness — and if a Redis key is the only thing standing between a customer and a double charge, that’s a finding, not a design.

The window you can’t close

Even the full Postgres design has a residual window, and it sits exactly where the walkthrough put it: between the provider’s commit and yours. Key propagation means that window no longer mints duplicates — but it relies on the provider having dedup, and not every foreign system does. For the ones that don’t, Brandur’s advice is bleak and correct: an indeterminate failure (timeout, connection reset) against a non-idempotent foreign API cannot be safely retried, only marked permanently errored and escalated. Which is why you should be the system that does offer keys.

So the residual program is: propagate keys to every downstream call that accepts them; treat the ones that don’t as permanent, logged risk; run the completer, because clients give up; and run a periodic reconciliation that diffs your records against the foreign system’s — charges you have no orders for, or the reverse, should page someone before the customer does. The design question is never “have we eliminated duplicates?” It’s “where does the residual probability live, how big is it, and who notices?” This machinery relocates risk from the common path — every lost response — to a rare seam with monitoring pointed at it. That relocation is the whole product.

Part 3 moves from synchronous APIs to the other half of the problem: consumers fed by brokers, where there’s no response to replay, the transactional claim is usually available, and Kafka has spent a decade selling a carefully fenced version of “exactly once.”

References