← Ascendy 한국어

infra

The first question in cost optimization isn't hardware — a self-hosted AI inference war story, part 2

· Ascendy Engineering


TL;DR

Source note. The timeline’s primary source is operational memory. Even where config files remain in the model repo, the post only covers models actually served today — file present ≠ in use. That was the trap from part 1.

Background — we moved to managed, and cost leaked again

Part 1 concluded that self-hosting GPU inference has hidden break-even costs, so we retreated to managed GPU. But we moved the workload and kept the same operating style — we simply ran two GPU pods 24/7.

At the time we didn’t know serverless GPU was an option. Pods stayed alive even during zero-traffic hours, hourly billing piled up, and the invoice was a shock. Moving to managed doesn’t make the fixed-cost problem disappear — if you don’t split the workload, it leaks on managed just the same.

The key decision — splitting workloads by latency budget

Looking for a fix, we learned about serverless GPU. And we chose the split axis to be not “hardware tier” but “how much latency this task can tolerate.” The judgment compresses to one line:

Yes · search wait

No · async

Workload

Does cold start

land on the user's

perceived path?

Always-on pod

Serverless

on-demand

Embeddings

vector search ~0 latency

Face models

free ride on spare VRAM

Image captioning

accepts cold start

Mermaid source
flowchart TD
  W[Workload] --> Q{Does cold start<br/>land on the user's<br/>perceived path?}
  Q -->|Yes · search wait| P[Always-on pod]
  Q -->|No · async| S[Serverless<br/>on-demand]
  P --> E[Embeddings<br/>vector search ~0 latency]
  P --> F[Face models<br/>free ride on spare VRAM]
  S --> C[Image captioning<br/>accepts cold start]

Three workloads split along this line:

WorkloadLatency toleranceOn the user’s perceived path?Placement
Embeddings (vector search)~0Yes (search wait)Always-on pod
Face models (detect + recognize)lowbackgroundAlways-on pod (free ride on spare VRAM, TensorRT)
Image captioningtens of seconds OKNo (async)Serverless (accepts cold start)

Embedding models and captioning models are the same family but serve different purposes — the embedding one produces vectors, resident on a pod; the captioning one generates text, on-demand on serverless.

Accepting cold start by “converting it into a cost”

Serverless cold start isn’t free — the first request is slow. But that slowness is effectively free if it doesn’t sit on the path the user perceives. Captioning is async, so its cold start sits outside the user experience.

So the judgment became not “has cold start = can’t use serverless,” but “which task’s which path does the cold start land on.” Only tasks where it lands somewhere harmless went to serverless. One-line rule:

Only tasks where cold start lands "off the user's perceived path" go serverless.
Everything else (especially the search-wait path) stays on an always-on pod.

Scale — designing for later, not just now

We’re at a low-traffic stage, so serverless’s “pay nothing when idle” benefit is largest right now. But the structure holds even as traffic grows.

When photos arrive in bulk (above a certain threshold count), we spin up a high-performance GPU and raise the batch size to lift throughput. Processing the same photo count on a large GPU in a short burst is cheaper in GPU-hours than grinding through it on a small GPU for a long time — shorter processing means fewer billed intervals, and larger batches mean higher GPU utilization. So serverless zero-billing at low traffic, high-performance GPU batch at bulk inflow — a design that covers both extremes on cost.

Decision / tradeoffs

If part 1 was “the hidden cost of self-hosting,” part 2 is “even after moving to managed, fixed cost leaks if you don’t split the workload — and the axis to split on is latency.

What we gave up is captioning’s first-response latency (cold start). But that was a cost the user doesn’t notice anyway. The most expensive mistake in cost optimization is reaching first for “find cheaper hardware” — ask “can this task tolerate a cold start” first, and moving one tolerant task to serverless deletes an entire always-on instance.

What’s next


Authorship & citation: This post was written by Ascendy Engineering and may be re-cited with attribution. If you find an error, please let us know via a GitHub issue.


Tags: gpu, inference, serverless, triton, vllm, cost-optimization, latency-budget, war-story