Claude Code Data/ML Notebook Lab

Deep Dive

I let an agent run my notebooks. Here is how I stopped it from p-hacking me.

A reproducibility-first Claude Code setup with a Jupyter MCP, a leakage-aware eval reviewer, and hooks that log every experiment so the agent cannot quietly cheat.

dr_okonkwo9 min read2026-06-18

I am an applied ML researcher, and I have a specific paranoia. It is not that the agent writes bad code. Bad code throws errors and you fix it. My fear is the quiet failure: a model that reports 0.94 AUC because the test set leaked into a feature three cells earlier, and then writes a confident paragraph explaining why the result is plausible. That paragraph is the dangerous part.

So when I built my Claude Code setup for day-to-day experimentation, I did not optimize for speed. I optimized for the agent not being able to fool me, or itself. This is the Data/ML Notebook Lab build. It runs on Claude Opus 4.8, with a Jupyter MCP so it can actually execute cells, Postgres so it pulls real data instead of inventing a toy frame, and three subagents that argue with each other before anything gets called a result.

What this build is for

Data science and ML experimentation: EDA, feature work, training loops, and honest evaluation. Stats from my runs: roughly 3.4s median tool latency, about $0.58 per task, an 85 percent pass rate on my private eval set. It sits in A tier, not S, and I will tell you exactly why later.

The rules come first, before any tooling

Most people write CLAUDE.md as a style guide. Mine reads more like a pre-registration document. Three rules do the heavy lifting, and they are non-negotiable enough that I put them at the very top where the context window weighs them most.

CLAUDE.md

# ML experiment protocol (read before any notebook work)

## Hard rules
1. Seed everything for reproducibility.
   Set numpy, random, torch, and any sklearn random_state to
   SEED = 1729 at the top of every notebook. No exceptions.
2. Never leak the test set into features.
   Fit scalers, encoders, imputers, and feature selectors on the
   TRAIN split only, inside the CV fold. Transform val/test after.
3. Report metrics with confidence intervals.
   No bare point estimates. Bootstrap (n=1000) or report CV
   mean +/- std. A number without a spread is not a result.

## Workflow
- State the hypothesis in one sentence before writing a cell.
- Prefer a baseline (majority class / linear) before anything fancy.
- If a result looks too good, check for leakage first, celebrate second.

Rule 1 looks trivial. It is the most important line in the file. I picked SEED = 1729 because it is the Hardy-Ramanujan number and I will notice instantly if a cell prints a different seed. Reproducibility is not a virtue I admire from a distance, it is the thing that lets me rerun an experiment in three weeks and trust the delta.

Giving it real data and a real kernel

An agent that cannot run a cell is a very expensive autocomplete. The whole point is that Claude executes against a live Jupyter kernel and queries the actual warehouse, so its claims are grounded in execution rather than in its prior. Two MCP servers do this. The Jupyter one connects to a running kernel; Postgres is read-only by design, because I have no interest in an agent writing to a production table at 1am.

.mcp.json

{
  "mcpServers": {
    "jupyter": {
      "command": "uvx",
      "args": ["jupyter-mcp-server", "--kernel", "ml-lab"],
      "env": { "JUPYTER_URL": "http://127.0.0.1:8888" }
    },
    "postgres": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-postgres",
               "postgresql://reader@localhost:5432/experiments"],
      "env": { "PGOPTIONS": "-c default_transaction_read_only=on" }
    },
    "filesystem": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-filesystem", "./data", "./notebooks"]
    },
    "github": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-github"]
    }
  }
}

Read-only is load-bearing

The default_transaction_read_only flag is not a nice-to-have. The first week I ran this without it, the agent tried to CREATE TABLE a scratch result set in my analytics DB. It meant well. I do not care that it meant well.

Three subagents, and the one that earns its keep

I split the work into three subagents, each with its own context window and a narrow brief. The separation matters because a single model holding the whole experiment in context starts to root for its own result. Give the evaluation step a fresh context and an adversarial prompt, and it stops being invested.

eda-explorer: profiles distributions, missingness, and class balance before anyone touches a model. It is the boring step everyone skips.
feature-engineer: proposes transforms and writes them as fold-safe pipelines, never as a global fit-then-split.
eval-reviewer: the skeptic. Its only job is to find the reason the headline number is wrong. It reads every cell upstream of the metric.

Here is the eval-reviewer in full. Note the tool allowlist is deliberately tiny: it reads and runs, it does not write features. I do not want the reviewer able to fix the thing it is supposed to flag.

.claude/agents/eval-reviewer.md

---
name: eval-reviewer
description: >
  Adversarial reviewer for any reported ML metric. Use PROACTIVELY
  after a model is evaluated. Hunts for leakage, overfit claims,
  and missing uncertainty before a result is trusted.
tools: Read, Bash, mcp__jupyter__execute
model: opus
---

You are a hostile reviewer. Assume the reported metric is wrong
until proven otherwise. Work through this checklist and report
findings as PASS / FAIL with the offending cell number.

1. LEAKAGE: was any scaler/encoder/selector fit before the split?
   Did target-derived info reach the features? Check the CV loop.
2. SPLIT INTEGRITY: are train/test disjoint? Any duplicate rows
   across splits? Group leakage (same user in both)?
3. METRIC HONESTY: is there a confidence interval? Is the baseline
   reported alongside? Is the metric appropriate for the class balance?
4. SEED: confirm SEED=1729 is set and the run is reproducible.

Do NOT fix anything. Report. End with one line:
"VERDICT: trustworthy" or "VERDICT: do not trust, see FAIL items".

claude - eval-reviewer pass

You

Random forest hit 0.97 AUC on churn. Ship it?

Agent

Delegating to eval-reviewer before trusting that.

Agent

FAIL (cell 14): StandardScaler.fit() ran on the full frame before train_test_split in cell 16. Test distribution leaked into the scaler. FAIL (cell 22): no CI on the AUC. VERDICT: do not trust. Refit inside the fold and re-bootstrap.

You

...thank you for not letting me embarrass myself.

The reviewer doing the one thing I actually pay it for.

Tip

If you build only one subagent from this article, build the skeptic. A reviewer with a fresh context and no stake in the result catches the failure modes that the model which wrote the code is structurally unable to see.

Hooks: the part the model cannot forget to do

Rules in CLAUDE.md are aspirational. Hooks are deterministic. The agent can forget an instruction under a long context, but a PostToolUse hook fires whether it remembers or not. I use two. One runs the cell and captures metrics every time a notebook is edited. The other logs the full experiment to my tracker when the session ends, so nothing evaporates.

.claude/settings.json

{
  "hooks": {
    "PostToolUse": [
      {
        "matcher": "mcp__jupyter__.*|Edit",
        "hooks": [
          {
            "type": "command",
            "command": "python scripts/run_and_capture.py --seed 1729 --emit-metrics"
          }
        ]
      }
    ],
    "Stop": [
      {
        "hooks": [
          {
            "type": "command",
            "command": "python scripts/log_experiment.py --tracker mlflow --include-git-sha"
          }
        ]
      }
    ]
  }
}

zsh - ml-lab

# session wraps up, Stop hook fires automatically

$claude

[PostToolUse] run_and_capture.py: executed cells 1-24, captured 6 metrics

[eval-reviewer] VERDICT: do not trust, see FAIL items (leakage in cell 14)

# fixed the pipeline, reran, ended the session

[Stop] log_experiment.py -> mlflow run 4b1c logged

metric: roc_auc = 0.881 [0.864, 0.897] (1000x bootstrap)

seed=1729 git=8fd106a baseline_auc=0.712

# 0.881 with a real CI beats a fake 0.97 every single day

Why this lives in A tier and not S

I am honest about the tradeoff. All this guarding costs latency and tokens. The eval-reviewer doubles the work on any task that ends in a metric, and the bootstrap step is not free. My pass rate sits around 85 percent, which is lower than my flashier teammate's setup, but his numbers are point estimates with no error bars, so I genuinely do not know if his 91 is better than my 85 or just luckier. That ambiguity is the whole reason I built this.

Component	Choice	Why
Model	Claude Opus 4.8	Reasoning over leakage and CV logic needs the stronger model.
MCP	jupyter, postgres, filesystem, github	Live kernel + real data, both read-only where it counts.
Subagents	eda-explorer, feature-engineer, eval-reviewer	Separation of concerns, with a fresh-context skeptic at the end.
Hooks	PostToolUse capture, Stop log	Determinism the model cannot forget under long context.

How to Build Claude Subagents Better Than 99% of People· IndyDevDan

If you want the deeper background on the subagent and hook machinery I lean on, these are the two references I actually keep open. The official docs for the exact frontmatter format, and the hooks mastery repo for the lifecycle events when I am wiring something new.

Create custom subagents - Claude Code DocsThe authoritative spec for subagent Markdown files: YAML frontmatter, tool allowlists, and isolated context windows.code.claude.com disler/claude-code-hooks-masteryReference repo covering every hook lifecycle event with prompt-level control. Where I learned the PostToolUse matcher syntax.GitHub4.2k

Set it up

Start Claude Code in your project, then write the three rules into CLAUDE.md, drop the .mcp.json above, and add the eval-reviewer agent file. The reviewer alone changed how much I trust my own afternoon. Install with:

npm install -g @anthropic-ai/claude-code && claude

One last opinion, since you read this far. The fastest agent setup is not the best one for science. The best one is the one that tells you when you are wrong, including when the person being wrong is you at 1am with a result that looks too good to question.

Claude Code Data/ML Notebook Lab

Install this build

Components

Model

MCP servers

Subagents

Hooks

Rules