Pro10 min

Observability, Cost, and Evals

Once an AI automation handles real volume, two questions decide whether it survives: is it still correct, and is it bankrupting you? You answer both by instrumenting the system. You cannot improve what you cannot see, and AI steps fail in quiet, drifting ways that classic monitoring misses.

Log every model call

For each AI step, record the input, the output, the model used, token counts, latency, and cost. Tools like Langfuse or Helicone do this with a proxy, or you can log to your own table. This trace is what lets you debug a bad answer a week later.

trace record
{
  "ts": "2026-06-21T09:14:02Z",
  "model": "claude-sonnet-4-6",
  "step": "classify_ticket",
  "input_tokens": 412,
  "output_tokens": 6,
  "cost_usd": 0.0021,
  "latency_ms": 740,
  "output": "Billing"
}

Cut cost without cutting quality

  • Route simple cases to a small model and only escalate hard ones to a big model.
  • Cache repeated prompts and reuse stable context to avoid paying for the same tokens twice.
  • Trim the prompt: less retrieved context and shorter system text is cheaper and often sharper.
  • Filter junk before the AI step so you never pay to classify spam.

Evals: catch regressions

Keep a small set of real inputs with known-correct outputs. Whenever you change a prompt or swap a model, run that set and compare. This eval set is your seatbelt: it tells you a tweak that looked better actually broke ten other cases.

zsh - eval run
# run the eval set after changing the prompt
$node evals/run.js --suite triage
42/45 correct (93.3%) prev: 45/45 (100%)
REGRESSION: 3 refund cases now misrouted. Reverting.
$
Set a spend alarm
Put a hard budget and an alert on your model provider before you scale. A looping agent or a viral spike can run up a frightening bill overnight. A cap turns a disaster into a notification.

Hands-on tasks