Claude CodeMeasure prompt and model changes with real metrics

LLM Eval Harness: Score Prompts Before You Ship

90.0Overall score

A reproducible evaluation workflow that runs a test set against candidate models, grades answers with both code checks and an LLM judge, and tracks score deltas across versions. For anyone shipping an LLM feature who needs proof a prompt change actually helped instead of vibes.

90.0Score

1.3kVotes

5Components

Install this build

Export

terminal

npx promptfoo@latest init && npx promptfoo eval

Components

Model

Claude Sonnet 4.6 (under test)
Claude Opus 4.8 (judge)
GPT-5

Stack

promptfoo
Inspect AI
DuckDB
pytest

MCP servers

filesystem
github

Subagents

dataset-builder
judge-prompt-tuner
regression-reporter

How it works

Define a golden test set with expected answers and rubrics
Run every candidate model and prompt variant in one sweep
Grade with exact-match plus an Opus 4.8 judge for open answers
regression-reporter blocks the PR if win rate drops vs baseline

Summary

90.0 score 1.3k votes

0 Reviews

Your rating

Loading discussion...

← All builds