Notes on building with LLMs in production
What survived contact with real users: evals, guardrails, latency budgets and the unglamorous plumbing.
Most teams treat the model as the product. They wire a prompt to an endpoint, ship a chat box, and assume the hard part is over. The hard part hasn't started. The model is the easy 20 percent — a few weeks of prompt tuning and a provider key. The other 80 percent is everything around it: knowing when it's wrong, stopping it when it's dangerous, keeping it fast enough to use, and feeding it clean inputs. That work doesn't demo well, which is exactly why it's where production systems live or die.
I've shipped a handful of LLM features that real users touched every day. The ones that lasted weren't the ones with the cleverest prompts. They were the ones with the boring scaffolding underneath. Here's what survived.
Evals are the unit tests you can't skip
You cannot improve what you can't measure, and you cannot measure an LLM by eyeballing a few outputs in a playground. The first thing I build for any LLM feature is an eval set — fifty to two hundred real inputs with a known-good judgment for each. Not synthetic prompts. Actual user queries, including the ugly ones: the typo-ridden, the half-finished, the adversarial, the ones in the wrong language.
The eval set is what turns "the new prompt feels better" into "the new prompt is 91 percent correct, up from 84, but it regressed on the refund cases." Without it, every prompt change is a vibe, and you ship regressions you won't notice until a user does.
A few things I've learned the hard way:
- →Grade on the cases you'll be judged on. If support escalations are the cost, weight your eval set toward the queries that escalate.
- →Keep a frozen holdout. It's easy to overfit a prompt to the cases you stare at every day.
- →Re-run evals on every prompt change, every model version, every dependency bump. Providers update models under the same name, and your "stable" feature drifts without a single line of your code changing.
That last point catches people. A silent model update once dropped one of my classifiers from 92 to 88 percent overnight. The eval caught it the same morning. Without it, I'd have learned from a customer.
Guardrails: assume the model will betray you
A model that's right 95 percent of the time is wrong one call in twenty. At scale that's not an edge case, it's a daily occurrence — and the wrong outputs are not evenly distributed across harmless and catastrophic. The job is to make the catastrophic ones impossible to act on.
The pattern that works is treating the model's output as untrusted input, the same way you'd treat anything from a browser. Validate the shape. Check the confidence. Gate the consequential actions behind a deterministic layer that the model cannot talk its way past.
Confidence gates the action, not the model.
The rule I hold to: the model can recommend, but a deterministic check decides anything that spends money, sends a message, or can't be undone. Anything irreversible needs a human or a hard rule between the model's suggestion and the consequence. This isn't distrust of the model. It's the same reason you validate form input even though most users type honestly.
Treat every model output as untrusted input from a stranger who is usually right and occasionally, confidently, catastrophically wrong.
It helps to show the user what the model claimed and how sure it was, rather than presenting a verdict as fact.
Eligible for refund under the 30-day policy.
Latency is a feature, and you have a budget
Users will forgive a model that's occasionally wrong. They will not forgive one that's slow every single time. A correct answer that takes nine seconds loses to a decent answer in two, because the slow one breaks the user's train of thought and they stop trusting the tool to keep up with them.
So I set a latency budget before writing the feature, the same way I'd set one for a page load. Then I spend it deliberately:
- →Stream tokens the moment they exist. Perceived latency is what matters, and the first word arriving fast buys you enormous patience.
- →Pick the smallest model that passes the evals. The frontier model is rarely worth triple the latency for a classification task a smaller one handles at 90 percent.
- →Cache aggressively. A surprising share of real queries are near-duplicates. Semantic caching on common questions cut my median response time more than any model swap did.
- →Do the slow work off the critical path. If a step doesn't block the user's next action, move it to a background job and reconcile after.
The mistake I see most often is treating latency as something to optimize after launch. By then the architecture has hardened around the slow path, and clawing back two seconds means a rewrite. The budget has to come first, because it dictates the model, the caching, and whether you stream — and those are not decisions you reverse cheaply.
The unglamorous plumbing is most of the work
Everything above sits on infrastructure nobody puts in the demo. This is where the real time goes, and it's the part teams consistently underestimate.
Retries with backoff, because providers rate-limit and time out, and they do it most when your traffic is highest. Fallback to a second provider or a smaller model when the primary is down, so one vendor's outage isn't yours. Timeouts on every call, because a hung request holding a connection is worse than a fast failure. Structured logging of the full prompt, the response, the latency, and the cost of every call — you cannot debug what you didn't record, and "the model said something weird yesterday" is unanswerable without the log.
And cost, which is latency's quiet twin. Token spend is a real line item, and it scales with usage in a way that surprises finance the first month. Log it per call, attribute it per feature, and you'll find the one endpoint quietly burning the budget on prompts three times longer than they need to be.
None of this is AI work. It's the same distributed-systems discipline that any service talking to a flaky third party has always needed. The LLM doesn't excuse you from it — it raises the stakes, because the dependency is slower, pricier, and less predictable than any API you've integrated before.
The teams that win with LLMs aren't the ones with the best prompt. They're the ones who treated the model as one unreliable component in a system they actually engineered — and spent their effort on the scaffolding that makes an unreliable component safe to depend on.
