← Ascendy 한국어

ml

We dropped the reranker from vector search — what 'find all the baby photos' broke

· Ascendy Engineering


TL;DR

Source note. Written from an operator interview. Model specs were fact-checked at publish time against official sources (HF Qwen3-VL-Embedding, the unified-framework arXiv paper); internal collection/infra identifiers and exact dimension tuning are generalized (model names are public). The product motivation connects to photos: the problem was never storage, it was finding.

Two decisions, different reasons

Reworking the search stack, we made two decisions — and they’re separate. They’re easy to conflate, so let’s split them first.

The assumption “find all the baby photos” broke

I realized the reranker had to go while testing. One scenario broke the assumption.

“Find all the photos with a baby in them.”

A query like that has to return thousands. And here’s the key distinction — that’s a recall problem, not a precision one. Success isn’t “how accurate are the top 10,” it’s “did we pull back everything there is.”

A reranker is structurally wrong for that bulk.

In one line: a reranker is a precision@(small k) tool, not a “find all of it” recall tool. We’d been using it for the wrong job.

”So just branch by query type?” — you can’t

The most natural objection lands here. “Turn the reranker off for bulk queries (‘find all’), and on for pinpoint queries (‘the photo of my son blowing out the candles’).” Branch by query type. We considered it — and rejected it. Two reasons.

One: natural language doesn’t cleanly separate conditions. A single sentence can mix an include and an exclude — “skip the ones with a dog, but show me all the baby photos.” Having an LLM perfectly decompose that and route “this part is bulk, this part is pinpoint” is close to impossible. Get it wrong once and the user gets the wrong result.

Two: even if you could split it, you can’t infer intent. “Show me all” is literally all thousands for one person and just show me a lot for another. If the system guesses the intent behind the same phrase, half your users get something other than what they expected every time. You can’t reliably read, from natural language alone, an intent that diverges under identical wording — not safely, not consistently.

So instead of branching, we removed the reranker outright. What backed the decision: precision held up on the embedding alone — search quality was fine from the moment we pulled it.

The replacement — a two-stage built on MRL (precision without a reranker)

We thought about how to shore up precision after dropping the reranker, and it was easier than expected. Qwen3-VL-Embedding has MRL (Matryoshka Representation Learning). A single embedding vector keeps its meaning even when truncated short — like a matryoshka doll, the leading slice of dimensions is itself a usable embedding. Per the official spec, the embedding dimension goes up to 4096 (2B is 2048, 8B is 4096), with user-defined output dimensions anywhere from 64 to 4096.

We built a two-stage out of that.

  1. Low-dim first-pass filter — sweep the whole set fast and cheap at a low dimension to narrow candidates. Low dimensions make the comparison light.
  2. High-dim refine — re-compare only the narrowed candidates at a higher dimension to sharpen the ranking.

Without a separate cross-encoder reranker model, we get a reranker-like effect inside a single embedding model. The logic differs, though — this is two-stage dense retrieval at different dimensions, not a cross-encoder re-scoring query-document pairs. Coarse-to-fine lifts precision while not blocking bulk (the low-dim first pass).

The result: precision held, bulk works, and the stack got simpler by one model.

Takeaways


Authorship & citation: Written by Ascendy Engineering; quotable with attribution. Found something wrong? Let us know via a GitHub issue.


Tags: vector-search, embeddings, reranker, matryoshka, retrieval, rag