← Ascendy 한국어

backend

The green light was lying — an ERROR right next to succeeded

· Ascendy Engineering


TL;DR

Source note. This post distills a backend-team intake (docs/intake/from-backend/2026-05-31-silent-primary-write-failure-masked-by-dual-write.md). Collection / field / module names and operational scale numbers are generalized. Its sibling in the same “silent failure” family is ERROR was visible, only INFO vanished — there the logs disappeared; here the success verdict lied.

An ERROR next to succeeded

While looking at something else, I noticed the worker logging the same ERROR on every reindex. It was trying to write a field to the vector DB and getting rejected — “no such field on the collection, and dynamic fields are off.” But something was off.

The very next line, that task had ended succeeded.

The task reports success, yet inside it a write is being rejected. That single contradiction was the clue — the success verdict wasn’t reflecting what actually happened.

Trap 1 — the dual-write hid the failure

Tracing it, there were two paths writing the text-search vectors. The authoritative primary collection, and a secondary multi-vector collection being rolled in (additive dual-write). The problem was where success got decided.

def reindex(media_id):
    try:
        primary_upsert(media_id, ...)     # fails every time, only logs an ERROR, passes
    except Exception as e:
        logger.error("primary write failed: %s", e)
    secondary_dual_write(media_id, ...)   # if THIS succeeds, the task is succeeded
    # → metrics/status stay green, the primary quietly empties

The primary write caught its error in a try/except, logged it and moved on, and the task’s success/failure was decided solely by the dual-write path. So the primary could fail every time and the task was always succeeded — a silent failure invisible unless you grep the logs.

Here’s the irony in the design. The dual-write existed precisely to keep a failure “from affecting the main path.” But when the direction flipped — when the masked side turned out to be the primary — the same safety device became a device that hides failures.

Trap 2 — the create-guard is silent about schema evolution

Why was the primary write rejected? The collection-creation code looked like this.

def get_collection(name: str):
    if not has_collection(name):          # ← the branch
        schema = build_current_schema()   #   never runs for an existing collection
        create(name, schema)
        create_index(name)
    return load(name)

if not has_collection looks idempotent. But there’s a trap — if the collection already exists, the schema definition is ignored entirely. A new field had been added to the code schema months ago, but a dev collection created before that kept its old schema. A vector DB (and schema-on-write stores in general) has no implicit ALTER like an RDB. So code schema and live-data schema silently drifted apart and froze that way.

I pinned it with a measurement, not a guess. I queried the live collection schema directly.

# expected: [id_field, tenant_filter_field, text_field, vector_field]
# actual:   [id_field, text_field, vector_field]   ← tenant_filter_field missing = drift confirmed

The field the code wanted wasn’t in the live data. And the search path filtered on that same field — so reads were broken on this collection too, not just writes.

The fix was data, not code

The code schema had been right all along. What was wrong was the live data that had been created and frozen earlier. So the fix was a one-off data operation, not a code change — drop the drifted dev collection, recreate it through the service’s own creation path (same schema + index + load). The bulk reindex already running refilled the new collection, and the primary write started logging success right after (the ERROR vanished).

One step remained. Did prod have the same drift? If so, prod’s search was broken too. So I queried the prod collection schema read-only, once, and compared the field set. Prod already had the field and was loaded at a normal scale — the drift was dev-only, and prod was left untouched. Not assuming “fixed in dev, so prod must be the same” — that check itself was part of the work.

Takeaways


Authorship & citation: Written by Ascendy Engineering; quotable with attribution. Found something wrong? Let us know via a GitHub issue.


Tags: vector-database, schema-migration, observability, debugging, idempotency, silent-failure