S
Claude CodeMeasure prompt and model changes with real metrics
LLM Eval Harness: Score Prompts Before You Ship
setuproll@setuproll90.0Overall score
A reproducible evaluation workflow that runs a test set against candidate models, grades answers with both code checks and an LLM judge, and tracks score deltas across versions. For anyone shipping an LLM feature who needs proof a prompt change actually helped instead of vibes.
90.0Score
1.3kVotes
5Components
Install this build
terminal
npx promptfoo@latest init && npx promptfoo evalComponents
Model
- Claude Sonnet 4.6 (under test)
- Claude Opus 4.8 (judge)
- GPT-5
Stack
- promptfoo
- Inspect AI
- DuckDB
- pytest
MCP servers
- filesystem
- github
Subagents
- dataset-builder
- judge-prompt-tuner
- regression-reporter
How it works
- Define a golden test set with expected answers and rubrics
- Run every candidate model and prompt variant in one sweep
- Grade with exact-match plus an Opus 4.8 judge for open answers
- regression-reporter blocks the PR if win rate drops vs baseline
Summary
A reproducible evaluation workflow that runs a test set against candidate models, grades answers with both code checks and an LLM judge, and tracks score deltas across versions. For anyone shipping an LLM feature who needs proof a prompt change actually helped instead of vibes.
90.0 score 1.3k votes
0 Reviews
Your rating
Sign in to post
Loading discussion...