Applied AI

Eval-driven development: 5 principles to peek into the Black Box

Imagine an F1 car with no track, no finish line and no barriers, just raw power on open ground. Everyone watching is blown away because it's fast and loud and looks incredible. But there's no goal, no way to know if it's even heading in the right direction, and without the fanfare of the track to notice, eventually it silently crashes into a mountain.

Now put that same car on a track with the same engine and the same speed. But here, the finish line tells you when you've won, the barriers nudge you back when you drift and keep you pointed in the direction of the finish line, and the telemetry tells you what to tune between laps.

That's the difference between an AI system built without evals and one built with them. It allows you to peek into the black box inner workings of an LLM. To productionise solutions, you have to go beyond shipping LLM-powered features that look great in demos but have no measurable definition of success and no way to catch failures before users do.

Eval-driven development plants you firmly on this track. Here are five principles that map across the AI development lifecycle, from understanding the problem through to production.

AI Development Lifecycle

Stage 1

Understand problem

Stage 2

Define evals

Stage 3

Build prototype

Stage 4

Production

Principle 1

Define what good looks like as a testable criterion before writing a line of code

Principle 2

Eval every stage not just the output: code-based, LLM judge, human review

Principle 3

Ask yes/no not "rate 1-5". Three binary evals beat one score you don't trust

Informed by evals

Principle 4

Read actual traces to tell spec failures from generalisation failures. Different root cause, different fix

Principle 5

Regression tests catch what you've seen. Production monitoring catches what you haven't. Humans catch what neither can

Continuous cycle

Principle 1 | Define success before you build

Most teams start by building the pipeline and worry about measuring quality later. This is backwards. If you can't express what "good" looks like as a test, you don't understand the problem well enough to build a solution. This principle ensure that evals are your guide from the beginning, and not just a tick box QA exercise.

Say you're building a customer support chatbot. Before writing a single line of code, you need to answer one question: what does a good support answer actually look like? It cites the correct policy. It resolves the customer's issue. It doesn't invent information that isn't in the source documents. It matches the company's tone. None of this is vague aspiration. Each one is a testable criterion. Define these first, and every decision that follows has something to measure against.

If you can't test for it, you can't trust the build.

Support Bot Online

Customer

I bought headphones last week and they stopped working. Can I get a refund?

Bot

I'm sorry to hear about your headphones. Under our 30-day return policy, you're eligible for a full refund since your purchase was within the last 7 days. I've started the process for you. You'll receive a return shipping label at your email within the hour.

Source: returns-policy-v3.2 §2.1

What does good look like?

Before writing a line of code, define every testable criterion.

✓

Cites the correct policy

Referenced returns-policy-v3.2 §2.1

PASS

✓

Resolves the customer's issue

Initiated refund process, set expectation

PASS

✓

No hallucinated information

All claims traceable to source document

PASS

✓

Matches company tone

Empathetic opening, clear next steps

PASS

If you can't test for it, you can't trust the build.

Principle 2 | Defence in depth

Even with a clear definition of success, one test at the end of your pipeline isn't enough. LLM systems have multiple stages, and each stage can fail differently. A retrieval step can pull the wrong documents. A generation step can hallucinate from the right documents. A safety layer can miss edge cases. If you only evaluate the final output (task based), you'll know something went wrong but not where. So, you need both end-to-end evals as well as "turn based" evals, which measure the outcome of each. Typically, there are three types of evals that cover these; code-based evals, LLM as a Judge and Human Review. The same measuring suite can also surface performance issues like latency and cost per call, giving you operational visibility alongside quality.

So let's show this in practice by returning to our chatbot. It has a retrieval layer, a generation layer, and a safety layer, and each one needs its own eval. Did retrieval pull the right policy documents? Did generation stay faithful to those documents? Did the safety check catch the bot inventing a refund policy that doesn't exist? Did the tone land appropriately for an upset customer? Think of it like Swiss cheese: the holes in one slice get caught by the next.

An important concept here enabled by AI is the LLM as a Judge, where a separate model will evaluate the outputs of the workflow model. The obvious question is how do you validate the validater. If you're using an LLM to judge whether answers are faithful to source documents, how do you know the judge itself is right? Typically, it makes sense to use a frontier or fine tuned model that is potentially more expensive to keep scores in check. However, how do you ensure that you have confidence in your judge. One approach is to run your judge against a set of human-labelled examples and track its true positive and true negative rates. Calibrate your judges the same way you'd calibrate any instrument, because if you don't validate them, these judges are about as useful as a chocolate teapot.

Workflow

Query

Step 1

Retrieval

Pull policy docs from knowledge base

Step 2

Generation

Draft answer from retrieved documents

Step 3

Safety

Check tone, compliance, hallucination

Response

Turn-based evals

Code-based

Did retrieval pull documents from the approved list?

PASS

Code-based

Does the response contain text not in the source documents?

FAIL

LLM-as-Judge

Was the answer faithful to the source documents?

PASS

End-to-end eval

Human review

Periodic spot-checks across full responses for tone, edge cases, and failures automated evals can't catch

PASS

If you only evaluate the final output, you'll know something went wrong but not where.

Principle 3 | Make success binary

So you know where to place your evals. Now the question is what each one should actually look like. And here's where it is tempting to fall into the objectively measured 5 step criteria per eval. However, what does a 3 out of 5 mean when you're rating a chatbot answer? "Okay but not great?" You'll find that two human reviewers rarely agree on what a 3 means, and an LLM judge is even less consistent. The scale feels precise but it measures nothing useful.

This is the "God Evaluator" anti-pattern, where a single prompt asks an LLM to score everything at once. It's unreliable and impossible to debug because the actual signal gets buried in noise.

The fix is simpler than you'd think: ask yes or no. "Did the answer cite the correct policy? Yes/No." "Did the answer contain information not in the source documents? Yes/No." "Was the tone appropriate for the customer's emotional state? Yes/No." Each question gets its own evaluator. Binary decisions are much more reliable than Likert scales for both human reviewers and LLM judges, and if you need more granularity, you add more binary questions rather than stretching one question across five levels.

Three binary evals you trust will always beat one score out of five you don't.

Anti-pattern

The "God Evaluator"

"Rate the overall quality of this chatbot response on a scale of 1-5"

What does a 3 even mean?
Two reviewers will never agree.

Best practice

Binary evaluators

Cited correct policy?YES

Hallucinated content?NO

Tone appropriate?YES

Three binary evals you trust beat one score you don't.

Principle 4 | Understand real failures

Moving into production territory when you are monitoring eval scores, and one drops... how do you diagnose and fix it?

This is where error analysis comes in. Read actual traces: the full record of what the system did from input to output with every intermediate step. This starts from day one, not after months in production. The moment your system produces its first outputs, go beyond the dashboards and start reading.

When you read traces, you'll see two kinds of error. The first is a specification failure: your instructions were ambiguous, so the model did exactly what you asked, just wrong. The fix is a prompt edit, not model tuning.

The second is a generalisation failure: the instructions were clear but the model couldn't handle a new input it hadn't seen before. The fix is more examples or fine-tuning. Confuse these two and you'll waste weeks tuning a model when the real problem is a two-line prompt edit. Either way, the burden sits with the Applied AI team, not the customer.

The chatbot has been live for two weeks and your evals are flagging that refund policy answers are failing more often than other topics. You could tweak the prompt and hope for the best. Or you could pull 100 failing traces and actually read them. When you do, a pattern jumps out, the bot isn't hallucinating randomly. It's consistently applying the "30-day return" policy to digital purchases, which have a completely different policy. Clearly, that's not a model problem. The prompt never distinguished physical from digital products. No amount of model tuning fixes a broken spec. This is a specification failure. Fix the ambiguity first, then measure whether the fix holds on new inputs.

Back to our F1 analogy, don't tune the engine when the map is wrong.

Failing trace trace-0847b

✓

Input

"I bought a digital movie yesterday and want a refund"

✓

Retrieval

Pulled returns-policy-v3.2

Matched on keyword "refund" §2.1

✗

Generation

Applied 30-day physical return policy to a digital purchase

The prompt never distinguished physical from digital products. The model followed instructions correctly. The instructions were wrong.

✓

Output

"Under our 30-day return policy, you're eligible for a full refund..."

Confident, well-structured, completely wrong.

What kind of failure is this?

Specification failure

The map is wrong

Instructions were ambiguous. The model did exactly what you asked. It just wasn't what you meant.

→ Fix the prompt, not the model

Generalisation failure

The engine stalls

Instructions were clear. The model couldn't handle a new input it hadn't seen before.

→ Tune the model or add examples

►This trace is a specification failure. The prompt never mentioned digital products. Fix the spec first, then measure if the fix holds.

Don't tune the engine when the map is wrong.

Principle 5 | Keep measuring

Shipping with evals isn't the finish line. It's the starting line. Production will always surface failures that no pre-launch test suite can anticipate, and you need two practices running continuously to stay ahead of them.

The first is regression tests for known failures. Run your golden dataset on every commit. These are the failure modes you've already discovered and fixed, and regression tests make sure they stay fixed. For the chatbot, this catches when a prompt update accidentally breaks the refund policy answers you fixed last month.

The second is production monitoring for new failures. Sample live traces, run your judges on them, and track success rates over time. This catches what your regression suite doesn't know to test for. The chatbot's tests all pass, but a new product category launched last Tuesday with no documentation in the knowledge base. Production monitoring is what catches the bot confidently making up a return policy for a product that didn't exist when the evals were written.

Automated evals will inevitably drift. Hence the importance of reviewing traces, live examples of how the chatbot is interacting with customers. At some point, a human reviewer will notice that the bot is technically correct but weirdly cold when customers are upset, something no binary eval was originally designed to catch. That observation becomes a new eval, and the whole cycle continues. Regression tests catch what you've seen before, production monitoring catches what you haven't, and humans notice what neither can.

Continuous
evaluation

THE RACE NEVER ENDS

↻

Regression tests

Golden dataset runs on every commit. Known failures stay fixed.

◉

Production monitoring

Sample live traces. Run judges. Track success rates over time.

👁

Human review

Catches what automated evals miss. "Correct but weirdly cold."

New evals created

Human observations become new binary evaluators. The cycle continues.

The car on the track isn't slower. It's the only one that finishes the race.

The track doesn't slow you down

Spending significant development time on evals can feel like overhead, extra work that delays shipping. But without them, you ship fast once and then spend months debugging in production with angry users as your test suite. With evals, you ship with confidence and iterate faster because you know exactly what's working and what isn't.

These five principles change how you build, but they also change how you hire, how you scope projects, and how you evaluate vendors. When someone pitches you an AI solution, ask them where their evals are. If the answer is "we'll add those later," you're looking at an F1 car with no track.

Build the car for the track that it will be racing on to bring satisfaction to the consumers who have paid money to watch you race.