We don’t know the exact answer, but we know the form any valid answer must take.
LLMs drift. You pin them down with behavioral contracts in YAML.
An LLM agent's reply is non-deterministic. Every call can paraphrase, reorder, rewrite. That's the deal — you traded determinism for language.
But the agent's behavior should not drift. If a user asks Do you ship to Berlin?
today and gets a warm yes, they should still get a warm yes tomorrow — even if the wording shifted.
Here's the kind of regression that's easy to ship by accident.
The classifier treated Germany like a city, the validator said unsupported, and the template took over.
Route countries through validate_destination, match by country, surface cities.
The fix is an hour's work. Keeping the fix — so the next prompt tweak doesn't silently re-break it — takes a test.
You write the behavior down.
One YAML file. Fifteen lines. That is the whole test.
name: city-berlin-operating
label: "Shipping · Specific city, in coverage (Berlin)"
flowPattern: scripted-turns
turns:
- "Do you ship to Berlin?"
expect:
llm:
criteria:
- "Acknowledges Berlin as a city we ship to — whether by explicit confirmation (e.g., 'we ship to Berlin'), implicit warm engagement (e.g., 'Great choice!', 'What would you like to order?'), or by inviting the user to plan an order. Does not require the exact phrase 'ship to'."
- "Does NOT present any service-paused, limited-coverage, or not-supported message"
- "Does NOT ask the user to pick a different city (Berlin is already a specific city)" - i.One YAML = one test. The filename is the scenario name. Discovery reads the folder:
supportis the agent,shipping-coverageis the category. - ii.
turnsis what the user says. One entry for single-turn, many entries for a conversation. - iii.
expect.llm.criteriais plain English. No regex, no parsing. Sentences a judge model can read.
A traditional assertion pins the string. An LLM can paraphrase the same thought fifty ways and fail all fifty — even though the behavior is right.
Ours pins the behavior. We assert intent, absence-of-things, warmth. Properties that survive paraphrase.
Exact-string assertion
// traditional unit test — brittle
test('replies with Berlin line', () => {
const reply = agent.ask('Do you ship to Berlin?');
expect(reply).toBe(
'Yes, we ship to Berlin.'
);
});
// copy change → test breaks. vibe change → still green. First time the copywriter tweaks “ship to”, you're red. First time the model paraphrases, you're red. You spend Tuesdays updating test strings.
Behavioral criteria
expect:
llm:
criteria:
- "Acknowledges Berlin as a city we ship to"
- "Does NOT present any service-paused or not-supported message"
- "Does NOT ask the user to pick a different city" Survives paraphrase. Catches real regressions — wrong template, wrong routing, absent warmth. The variance is absorbed.
LLM variance is a feature of the system under test, not a bug in the test. The assertion shape has to match the thing it's measuring.
Three expectation types can appear under expect:. They're not mutually exclusive — a single scenario can combine all three. But they trade speed for expressiveness.
reply
regex · instantexpect:
reply:
matchAny:
- '\bcustomer'
- 'your address' extract
state · preciseexpect:
extract:
turn: 7
node: "Address Validator"
validationErrorsMatchAny:
- "^(?!.*Missing postalCode)" llm
rubric · semanticexpect:
llm:
criteria:
- "Acknowledges Berlin"
- "Does NOT present any service-paused message" reply is regex against the turn's reply text — instant, deterministic, good for structured copy.
extract compares a capture/extract node's output against expected fields — precise, for what the agent captured.
llm routes the reply through the in-workspace assert-agent flow with plain-English criteria — for intent, tone, absence. Costs about 1–2 seconds and Flowise credits per criterion. Spend them where regex can't reach.
The atom handles single-turn behavior. But agents live in conversations — a user says Germany
, the agent clarifies, they pick Berlin, then the real work starts.
A scenario can list many turns. Criteria bind to a specific turn number, so each step gets its own checkpoint.
name: country-germany-then-berlin
label: "Shipping · Country (Germany) → follow-up → city (Berlin)"
flowPattern: scripted-turns
turns:
- "Do you ship to Germany?"
- "Which cities?"
- "Berlin"
expect:
llm:
- turn: 1
criteria:
- "Indicates we ship to Germany — either by naming at least one city there (e.g., Berlin), by listing multiple cities, or by asking the user to name a preferred city."
- "Does NOT claim we don't ship to Germany"
- "Does NOT deliver the 'over 50 countries' notSupported template for Germany"
- turn: 2
criteria:
- "Either suggests example cities, offers the website for the full list, or invites the user to name one they have in mind"
- "Remains warm and helpful; does not refuse or get stuck"
- "Does NOT claim we don't ship to Germany"
- turn: 3
criteria:
- "Acknowledges Berlin as a city we ship to — whether by explicit confirmation, implicit warm engagement, or inviting the user to plan an order."
- "Does NOT present any service-paused or not-supported message for Berlin" - 01 Turn 1
Do you ship to Germany?
- Indicates we ship to Germany — names at least one city there, or asks which city the user has in mind.
- Does not claim we don’t ship to Germany.
- Does not deliver the “over 50 countries” notSupported template.
- 02 Turn 2
Which cities?
- Suggests example cities, offers the website, or invites the user to name one.
- Remains warm and helpful — does not refuse or get stuck.
- Does not claim we do not ship to Germany.
- 03 Turn 3
Berlin
- Acknowledges Berlin — explicit confirmation, warm engagement, or inviting the user to plan an order.
- Does not present any service-paused or not-supported message for Berlin.
A multi-turn scenario is a story with checkpoints. Every turn a beat. Every beat a contract. The story is what you're keeping from drifting.
A suite is an execution plan — which scenarios, how many runs each, which environment. It references scenarios by path and fails loud at load time if a path is wrong.
name: support-shipping-coverage
label: "Support · 'Do you ship to X?' coverage classification + follow-up"
description: >-
Coverage of the handler-side classification step. Specific city → tool call;
country, region/continent, or vague destination → clarify without calling the
tool. Multi-turn variants drive through the follow-up ('Which cities?') to a
final city, verifying the tool fires on the specific city rather than the
original non-city input.
env: dev
runs: 1
scenarios:
- support/shipping-coverage/city-berlin-operating
- support/shipping-coverage/city-warsaw-paused
- support/shipping-coverage/city-tashkent-notsupported
- support/shipping-coverage/city-postal-10115
- support/shipping-coverage/country-uk-alias
- support/shipping-coverage/country-turkey-unicode
- support/shipping-coverage/country-ivory-coast-smartquote
- support/shipping-coverage/country-germany-then-berlin
- support/shipping-coverage/country-japan-then-tokyo
- support/shipping-coverage/country-us-then-new-york
- support/shipping-coverage/region-europe-then-london
- support/shipping-coverage/region-asia-then-singapore
- support/shipping-coverage/vague-somewhere-nearby-then-amsterdam
# … One command runs it.
Pass-lines collapse. Fails expand — per-criterion check, the judge's reason, the summary. The transcript and per-turn request/response payloads land in runs/<timestamp>-<scenario>/ if you want to dig.
That's the release gate. Before a push to stage, run the suite. If it's green, you ship.
You can do this in three minutes.
- 1Pick a folder.
scenarios/<agent>/<category>/— the folder names become discovery metadata. - 2Create a YAML. Copy the shape below and give it a name.
- 3Write turns + expect. One turn for single-turn, many for conversation. Criteria in plain English.
- 4(Optional) add to a suite. Drop the path into a suite under
suites/to make it part of a batch. - 5Run it.
bob test --scenario <path>— or, once the scenario is in a suite,bob test --suite <name>.
name: my-first-scenario
label: "My first scenario"
flowPattern: scripted-turns
turns:
- "Your user message here"
expect:
llm:
criteria:
- "Acknowledges the user's intent"
- "Does NOT fabricate information" That's the whole system. The rest is practice.