Setuproll: The Tier List for AI Coding Setups

How Setuproll Rates Setups

Methodology

Setuproll ranks AI coding setups so you can see what actually works before you spend a week wiring one up yourself. This page explains exactly what the numbers mean and, just as important, where they come from today. No hype, no secret sauce.

What the tiers mean

Every setup lands in one of four tiers, from S down to C. The tier is just a readable bucket for the score.

Score 90+

Top of the board. Solves hard tasks reliably and rarely needs a second attempt. Usually the strongest model paired with real tools.

Score 82 to 89

Strong daily driver. Handles most real work well, with the occasional miss on the hardest tasks.

Score 72 to 81

Solid and dependable. Good value, fine for most tasks, but you will hit its limits on complex multi-step work.

Score below 72

Works for simple, scoped tasks. Expect more retries and hand-holding on anything involved.

What we measure

Four numbers describe every build. The score is the headline; the other three tell you what you trade off to get it.

Score

A single 0 to 100 number that rolls up how a setup did across our task set. It blends how often the setup finished the job with how cleanly it got there. The score is what sorts the leaderboard and decides the tier. It is a relative ranking signal, not an absolute measure of quality.

Pass rate

The share of tasks the setup completed correctly. A pass means the output actually did what the task asked, not just that it produced something. This is the part of the score we care about most.

Cost

The mean spend per task, in dollars, based on tokens consumed at the provider's listed prices. Lower is better. Two setups can land in the same tier while one costs several times more, so cost is worth reading next to the score.

Speed

The mean wall-clock time per task. Lower is better. Speed depends on the model, the effort setting, and how many tool calls a setup makes, so it can swing a lot between runs.

How to read a build

A build is one combination of a model, a tool, and a config. Read it in this order.

1. Start with the tier and score. That tells you the overall ceiling. An S build will get more done with less correcting than a C build.
2. Check the pass rate. A high score with a modest pass rate means the setup is fast or cheap but misses more often. Decide which matters for your work.
3. Weigh cost against speed. If two builds tie on score, the cheaper or faster one usually wins for everyday use. The expensive one may still be worth it for hard, high-stakes tasks.
4. Look at the config. The tools, MCP servers, and rules attached to a build are the part you can copy. The model matters, but the setup around it is often what moves a build up a tier.

Data transparency

Where today's numbers come from

We want to be straight with you. The numbers on Setuproll right now are a mix of community-reported results and our own estimates. They are not yet independent lab benchmarks run under controlled conditions.

That means a score is a useful directional signal for comparing setups, but you should not treat any single number as a precise, reproducible measurement. Costs follow listed provider prices, which change. Speed depends on load and effort settings. Pass rates that came from the community reflect real but uncontrolled runs.

Independent, reproducible benchmarking is on the roadmap. As we stand up a controlled test harness, we will label which numbers are lab-verified and which are still estimated, so you always know what you are looking at.

Contribute or correct the data

Spotted a number that looks wrong, or run a setup and want to share your results? We want the corrections. Open an issue or send us your run details and we will review and update the entry. Setuproll gets more accurate every time someone checks our work.