S
Claude Code logoClaude CodeMeasure prompt and model changes with real metrics

LLM Eval Harness: Score Prompts Before You Ship

setuproll@setuproll
90.0Overall score

A reproducible evaluation workflow that runs a test set against candidate models, grades answers with both code checks and an LLM judge, and tracks score deltas across versions. For anyone shipping an LLM feature who needs proof a prompt change actually helped instead of vibes.

90.0Score
1.3kVotes
5Components

Install this build

Export
terminal
npx promptfoo@latest init && npx promptfoo eval

Components

Model

  • Claude Sonnet 4.6 (under test)
  • Claude Opus 4.8 (judge)
  • GPT-5

Stack

  • promptfoo
  • Inspect AI
  • DuckDB
  • pytest

MCP servers

  • filesystem
  • github

Subagents

  • dataset-builder
  • judge-prompt-tuner
  • regression-reporter

How it works

  • Define a golden test set with expected answers and rubrics
  • Run every candidate model and prompt variant in one sweep
  • Grade with exact-match plus an Opus 4.8 judge for open answers
  • regression-reporter blocks the PR if win rate drops vs baseline

Summary

A reproducible evaluation workflow that runs a test set against candidate models, grades answers with both code checks and an LLM judge, and tracks score deltas across versions. For anyone shipping an LLM feature who needs proof a prompt change actually helped instead of vibes.

90.0 score 1.3k votes

0 Reviews

Your rating
Sign in to post

Loading discussion...