§ Testing · 6 acts

We don’t know the exact answer,

but we know the form any valid answer must take.

LLMs drift. You pin them down with behavioral contracts in YAML.

01 The problem
02 The atom
03 Contract, not script
04 Three ways to check
05 A conversation is a journey
06 A batch is a suite

Act · 01

The problem.

Why we write tests at all

An LLM agent's reply is non-deterministic. Every call can paraphrase, reorder, rewrite. That's the deal — you traded determinism for language.

But the agent's behavior should not drift. If a user asks Do you ship to Berlin? today and gets a warm yes, they should still get a warm yes tomorrow — even if the wording shifted.

Here's the kind of regression that's easy to ship by accident.

Before regression

Do you ship to Germany?

We ship to over 50 countries, but Germany isn't one of them.

The classifier treated Germany like a city, the validator said unsupported, and the template took over.

After fix

Do you ship to Germany?

We ship to Berlin, and Munich resumes in May. Which city did you have in mind?

Route countries through validate_destination, match by country, surface cities.

The fix is an hour's work. Keeping the fix — so the next prompt tweak doesn't silently re-break it — takes a test.

You write the behavior down.

Act · 02

The atom.

What a scenario looks like

One YAML file. Fifteen lines. That is the whole test.

scenarios/support/shipping-coverage/ city-berlin-operating.yaml

name: city-berlin-operating
label: "Shipping · Specific city, in coverage (Berlin)"
flowPattern: scripted-turns
turns:
  - "Do you ship to Berlin?"
expect:
  llm:
    criteria:
      - "Acknowledges Berlin as a city we ship to — whether by explicit confirmation (e.g., 'we ship to Berlin'), implicit warm engagement (e.g., 'Great choice!', 'What would you like to order?'), or by inviting the user to plan an order. Does not require the exact phrase 'ship to'."
      - "Does NOT present any service-paused, limited-coverage, or not-supported message"
      - "Does NOT ask the user to pick a different city (Berlin is already a specific city)"

i.
One YAML = one test. The filename is the scenario name. Discovery reads the folder: support is the agent, shipping-coverage is the category.
ii.
turns is what the user says. One entry for single-turn, many entries for a conversation.
iii.
expect.llm.criteria is plain English. No regex, no parsing. Sentences a judge model can read.

Act · 03

Contract, not script.

What the assertion actually says

A traditional assertion pins the string. An LLM can paraphrase the same thought fifty ways and fail all fifty — even though the behavior is right.

Ours pins the behavior. We assert intent, absence-of-things, warmth. Properties that survive paraphrase.

Brittle

Exact-string assertion

tests/ brittle.test.js

// traditional unit test — brittle
test('replies with Berlin line', () => {
  const reply = agent.ask('Do you ship to Berlin?');
  expect(reply).toBe(
    'Yes, we ship to Berlin.'
  );
});
// copy change → test breaks. vibe change → still green.

First time the copywriter tweaks “ship to”, you're red. First time the model paraphrases, you're red. You spend Tuesdays updating test strings.

Robust

Behavioral criteria

expect block

expect:
  llm:
    criteria:
      - "Acknowledges Berlin as a city we ship to"
      - "Does NOT present any service-paused or not-supported message"
      - "Does NOT ask the user to pick a different city"

Survives paraphrase. Catches real regressions — wrong template, wrong routing, absent warmth. The variance is absorbed.

LLM variance is a feature of the system under test, not a bug in the test. The assertion shape has to match the thing it's measuring.

Act · 04

Three ways to check.

The assertion pyramid

Three expectation types can appear under expect:. They're not mutually exclusive — a single scenario can combine all three. But they trade speed for expressiveness.

fig. 01 Determinism first. Semantics only when you must.

reply

regex · instant

expect:
  reply:
    matchAny:
      - '\bcustomer'
      - 'your address'

extract

state · precise

expect:
  extract:
    turn: 7
    node: "Address Validator"
    validationErrorsMatchAny:
      - "^(?!.*Missing postalCode)"

llm

rubric · semantic

expect:
  llm:
    criteria:
      - "Acknowledges Berlin"
      - "Does NOT present any service-paused message"

reply is regex against the turn's reply text — instant, deterministic, good for structured copy. extract compares a capture/extract node's output against expected fields — precise, for what the agent captured. llm routes the reply through the in-workspace assert-agent flow with plain-English criteria — for intent, tone, absence. Costs about 1–2 seconds and Flowise credits per criterion. Spend them where regex can't reach.

Act · 05

A conversation is a journey.

Multi-turn scenarios

The atom handles single-turn behavior. But agents live in conversations — a user says Germany, the agent clarifies, they pick Berlin, then the real work starts.

A scenario can list many turns. Criteria bind to a specific turn number, so each step gets its own checkpoint.

scenarios/support/shipping-coverage/ country-germany-then-berlin.yaml

name: country-germany-then-berlin
label: "Shipping · Country (Germany) → follow-up → city (Berlin)"
flowPattern: scripted-turns
turns:
  - "Do you ship to Germany?"
  - "Which cities?"
  - "Berlin"
expect:
  llm:
    - turn: 1
      criteria:
        - "Indicates we ship to Germany — either by naming at least one city there (e.g., Berlin), by listing multiple cities, or by asking the user to name a preferred city."
        - "Does NOT claim we don't ship to Germany"
        - "Does NOT deliver the 'over 50 countries' notSupported template for Germany"
    - turn: 2
      criteria:
        - "Either suggests example cities, offers the website for the full list, or invites the user to name one they have in mind"
        - "Remains warm and helpful; does not refuse or get stuck"
        - "Does NOT claim we don't ship to Germany"
    - turn: 3
      criteria:
        - "Acknowledges Berlin as a city we ship to — whether by explicit confirmation, implicit warm engagement, or inviting the user to plan an order."
        - "Does NOT present any service-paused or not-supported message for Berlin"

01 Turn 1

Do you ship to Germany?
- Indicates we ship to Germany — names at least one city there, or asks which city the user has in mind.
- Does not claim we don’t ship to Germany.
- Does not deliver the “over 50 countries” notSupported template.
02 Turn 2

Which cities?
- Suggests example cities, offers the website, or invites the user to name one.
- Remains warm and helpful — does not refuse or get stuck.
- Does not claim we do not ship to Germany.
03 Turn 3

Berlin
- Acknowledges Berlin — explicit confirmation, warm engagement, or inviting the user to plan an order.
- Does not present any service-paused or not-supported message for Berlin.

A multi-turn scenario is a story with checkpoints. Every turn a beat. Every beat a contract. The story is what you're keeping from drifting.

Act · 06

A batch is a suite.

Running the whole batch

A suite is an execution plan — which scenarios, how many runs each, which environment. It references scenarios by path and fails loud at load time if a path is wrong.

suites/ support-shipping-coverage.yaml · 30 scenarios · env dev

suites/ support-shipping-coverage.yaml

name: support-shipping-coverage
label: "Support · 'Do you ship to X?' coverage classification + follow-up"
description: >-
  Coverage of the handler-side classification step. Specific city → tool call;
  country, region/continent, or vague destination → clarify without calling the
  tool. Multi-turn variants drive through the follow-up ('Which cities?') to a
  final city, verifying the tool fires on the specific city rather than the
  original non-city input.
env: dev
runs: 1
scenarios:
  - support/shipping-coverage/city-berlin-operating
  - support/shipping-coverage/city-warsaw-paused
  - support/shipping-coverage/city-tashkent-notsupported
  - support/shipping-coverage/city-postal-10115
  - support/shipping-coverage/country-uk-alias
  - support/shipping-coverage/country-turkey-unicode
  - support/shipping-coverage/country-ivory-coast-smartquote
  - support/shipping-coverage/country-germany-then-berlin
  - support/shipping-coverage/country-japan-then-tokyo
  - support/shipping-coverage/country-us-then-new-york
  - support/shipping-coverage/region-europe-then-london
  - support/shipping-coverage/region-asia-then-singapore
  - support/shipping-coverage/vague-somewhere-nearby-then-amsterdam
# …

One command runs it.

zsh · tests localhost

~/flowise/tests ❯ bob test --suite support-shipping-coverage

⚖ support / shipping-coverage · 24 scenarios · env=dev

───────────────────────────────────────────────────────

✓ city-berlin-operating · 2.1s · 1/1 passed

✓ city-warsaw-paused · 2.4s · 1/1 passed

✓ country-uk-alias · 1.9s · 1/1 passed

✓ country-germany-then-berlin · 6.2s · 3/3 passed

✗ country-germany-berlin-munich-scheduled · 2.8s · FAIL

⚖ judge turn 1

✓ Names Berlin as a city we ship to in Germany

✓ Names Munich as a city we ship to (scheduled)

✗ Mentions 16-May-2026 as the resumption date for Munich

└ reply did not include any future date

⇒ 2/3 passed

✓ region-europe-then-london · 5.1s · 2/2 passed

· … 18 more

───────────────────────────────────────────────────────

23 passed 1 failed · elapsed 1m 42s · artifacts in runs/

~/flowise/tests ❯

Pass-lines collapse. Fails expand — per-criterion check, the judge's reason, the summary. The transcript and per-turn request/response payloads land in runs/<timestamp>-<scenario>/ if you want to dig.

That's the release gate. Before a push to stage, run the suite. If it's green, you ship.

Act · 07

Write your first.

Light enabling

You can do this in three minutes.

1
Pick a folder. scenarios/<agent>/<category>/ — the folder names become discovery metadata.
2
Create a YAML. Copy the shape below and give it a name.
3
Write turns + expect. One turn for single-turn, many for conversation. Criteria in plain English.
4
(Optional) add to a suite. Drop the path into a suite under suites/ to make it part of a batch.
5
Run it. bob test --scenario <path> — or, once the scenario is in a suite, bob test --suite <name>.

Starter copy · paste · edit

scenarios/<agent>/<category>/ my-first-scenario.yaml

name: my-first-scenario
label: "My first scenario"
flowPattern: scripted-turns
turns:
  - "Your user message here"
expect:
  llm:
    criteria:
      - "Acknowledges the user's intent"
      - "Does NOT fabricate information"

That's the whole system. The rest is practice.

↺ Back to overview

Five things Bob does.
Tests are just one.

We don’t know the exact answer, but we know the form any valid answer must take.

reply

extract

llm

We don’t know the exact answer,

but we know the form any valid answer must take.