AIDec 1, 20257 min read

Prompts are code, so treat them that way

The moment a prompt decides product behavior, it deserves the same rigor as the code around it.

Most teams treat the prompt as content, not code. It lives in a Notion doc, gets pasted into a config field, or sits as a raw string three indents deep in a file nobody reviews. Someone tweaks a sentence on a Friday to fix one edge case, ships it, and on Monday a different feature quietly breaks. No diff, no test, no alert. Just a model behaving differently because a few words changed and nobody treated those words as a change.

Here is the line I draw: the moment a prompt decides what the product does — what gets approved, flagged, summarized, routed, or refused — it is no longer copy. It is logic. And logic that ships to users earns the same rigor as the function call sitting next to it. Versioning, review, tests, observability. All of it.

A prompt is a function with a fuzzy compiler

We accept enormous discipline around the deterministic parts of a system and almost none around the part that actually steers behavior. A regex that classifies support tickets gets a code review. The prompt that classifies support tickets gets pasted into the LLM dashboard by whoever had the tab open.

That asymmetry made sense when prompts were demos. It stopped making sense the moment a prompt became the thing standing between a user and a decision. A prompt has inputs, produces outputs, and has edge cases that break it. That is a function. The compiler is just probabilistic, which makes the discipline more important, not less — you cannot read the source and reason your way to the behavior. You have to observe it.

So put it where functions live. In the repo, in version control, behind a pull request. When a prompt change shows up as a diff next to the code that consumes it, a reviewer can see that you loosened the classification threshold and tightened the refusal wording in the same commit. That is a conversation worth having before it reaches a user, not after.

A prompt that decides product behavior is logic with a probabilistic compiler — and you don't ship logic without a diff.

Test the behavior, not the wording

The objection I hear is that you can't unit-test a prompt because the output isn't deterministic. True, and beside the point. You don't assert the exact string. You assert the properties that have to hold.

If a prompt's job is to classify an expense as reimbursable or not, you don't check that it returned a specific sentence. You check that a known-reimbursable receipt comes back reimbursable, that a known-personal one does not, and that a genuinely ambiguous one returns low confidence instead of a confident guess. Build a set of cases that encode what the prompt is for, run them on every change, and watch the pass rate move.

→Pin the cases that have burned you before. Every production incident becomes a permanent test.
→Track confidence, not just correctness. A right answer for the wrong reason fails quietly next time.
→Treat a regression as a regression. If the change drops three cases to win one, it doesn't ship.

This is the part teams skip, and it's the part that pays. The eval suite is what turns "the new prompt feels better" into "the new prompt passes 47 of 50 and the three failures are the rare ones we accept." One of those sentences belongs in a release decision. The other is a vibe.

Make the runtime observable and the threshold explicit

Code you can't see in production is code you don't really control, and prompts are worse than most because the same input can drift as the model behind it updates. You need to log what went in, what came back, and how confident the model was — then act on that confidence in code, not in prose.

The reflex is to write "if you are unsure, ask for clarification" into the prompt and trust it. Sometimes it works. But the decision to escalate is product logic, and product logic belongs in the language with the actual control flow, where it is testable and can't be talked out of by a clever input.

classify.ts


const result = await classify(receipt);
if (result.confidence < 0.7) {
return routeToHumanReview(receipt, result);
}
return autoApprove(result);

The model proposes; the code decides.

The prompt produces a judgment and a confidence. The code decides what to do with it. That seam is where you keep control: the model can be wrong, the model can drift, but the rule that a low-confidence expense goes to a human is something you wrote, tested, and can reason about.

Generated answer

Reimbursable — client dinner, within policy

Confidence

64%

↗ expense-policy.md↗ receipt OCR

Fig. 1 — 64 percent confidence routes to review, not auto-approval.

Treat the model version as a dependency

The last piece teams miss: the model is a dependency, and it ships updates you didn't ask for. A prompt that passed every case in March can quietly drift in June because the provider changed something underneath you. If you pinned your npm packages but pin nothing about the model, you've left the most volatile dependency in the system completely unmanaged.

So pin the model version where you can, run the eval suite when you can't, and re-run it before you adopt a new one. The same suite that guards your prompt edits guards the provider's. That is the whole reason to build it once.

None of this is exotic. It is the discipline we already apply to every other part of a system, pointed at the part we pretended was different. A prompt that decides product behavior is not a string. It is the most consequential, least understood function you ship — so version it, review it, test it, and watch it run. The teams that learn this the hard way all say the same thing afterward: it was code the whole time.

#AI#EngineeringShare ↗

→ / AUTHOR

Ionut Dumitru

Full-stack engineer and product designer. Writes about building products where the engineering and the design are the same job.

GitHub ↗X ↗

→ / NEXT

EngineeringNov 24, 2025

The migration you run a thousand times →

← All writingionutdumitru.com