← Ascendy 한국어

backend

Making cancel-and-refund idempotent — put the terminal state transition last

· Ascendy Engineering


TL;DR

Source note. Distilled from a backend-team intake (docs/intake/from-backend/2026-06-07-idempotent-cancel-refund.md). Table/endpoint/pricing identifiers are generalized and billing numbers excluded. The same adversarial review — how you call the second AI, and when to stop it that caught these races worked here too.

The happy path is easy. Dying is hard.

The requirement looked simple. A user cancels a costly batch job mid-flight (each item burns points and triggers an expensive external call) — so if it’s running, stop at the next item boundary; refund the points for successes not yet finalized; clean up the temporary artifacts.

Looking only at the normal flow, it’s a 30-minute job. The problem is that every one of those steps can die. Die mid-refund? Die mid-cleanup? The cancel request and the worker touch the same job at once? A retry comes in — who guarantees it won’t refund again?

Adversarial plan review nailed three blockers in round one, and the design converged on an idempotency pattern.

① Resurrection race — an unconditional UPDATE revives the job

First trap. The worker claimed a job by overwriting its status to running unconditionally. But if the cancel endpoint just cancelled a pending job and the worker then writes running on top — the cancelled job comes back to life.

The fix is to make the transition conditional:

UPDATE job SET status = 'running'
WHERE id = :id AND status IN ('pending', 'queued') AND cancel_requested = false;
-- rowcount = 0 means someone cancelled first → the worker quietly backs off

Check rowcount; if 0, the worker abandons the claim. Even if the message queue delivers the job twice (at-least-once), the same condition makes the re-claim safe.

② Double refund — marker and balance must share one transaction

Split “refund → then change status” into two commits, and dying in between makes a retry refund again — returning the same success twice.

Tie the marker and the balance into one transaction:

BEGIN;
  -- only items not yet decided flip to 'cancelled' (rowcount confirms what actually flipped)
  UPDATE item SET decision = 'cancelled' WHERE job_id = :id AND decision IS NULL;
  -- server-side increment — not read-modify-write (concurrency-safe)
  UPDATE wallet SET points = points + :refund WHERE user_id = :uid;
COMMIT;

Two things matter. The marker (decision='cancelled') and the balance increment are atomic, so a retry skips items already flipped. And the balance is bumped inside the DB with points + :refund, not read into the app and written back — safe under concurrent updates.

③ Orphaned resources — cleanup keys off the marker, not “undecided”

What if you crash right after ②’s marker commits? If cleanup is written to “delete temp artifacts only for items not yet decided,” that item is already cancelled — so it’s excluded from cleanup forever. The artifact orphans in external storage.

So cleanup keys off the decision='cancelled' marker and re-scans the whole set every time, not “undecided.” Deleting from external storage is a no-op on a missing key, so re-running is free — run it any number of times, same result.

④ The terminal transition goes dead last

The principle that ties these together: flipping the job to cancelled (a terminal state) must happen after refund and cleanup are both done.

Why? If you die in the middle, the job stays non-terminal, and a re-invocation can re-enter the same teardown. The duplicate-call guard (already cancelled) must only fire after reaching terminal. So you must not swallow a cleanup failure and jump to terminal — let the failure propagate as an exception, preserving the “not done yet, come back in” signal.

Put differently, the terminal state is the one moment of truth that says “now it’s really done.” Don’t leave unfinished work in front of it.

⑤·⑥ The other two races

Takeaways


Authorship & citation: Written by Ascendy Engineering; quotable with attribution. Found something wrong? Let us know via a GitHub issue.


Tags: idempotency, distributed-systems, race-condition, transactions, reliability