How This Research Is Run: Pre-registration, Claim-Status, MEASURE-ONLY

This is the operating manual for the project — the rules that decide what a run is allowed to conclude, written and frozen before the run exists.

The substantive results live elsewhere; this entry documents the process that lets you trust them. It uses one case study throughout — the init-weight bug logged in experiments/exp15-recheck-trained-weights/ — because it is the cleanest stress-test of every rule at once. This is the procedural record behind it.

The question

Can a research run talk itself into a conclusion? Most ML pipelines can: the same code that measures a quantity also decides whether the measurement "passed," and the thresholds drift to meet the data. The discipline here exists to make that structurally impossible, and the test is adversarial — what happens when a bug produces a beautiful, wrong result that survives every pre-flight check?

The method: four interlocking rules

1 — Frozen pre-registration. Before implementation, the run's intended sequence is committed to the script's module docstring and content-hashed. For the re-check, the SEQUENCE block in recheck.py was frozen 2026-06-17 22:38, the day before the 2026-06-18 run. It pre-commits the steps (train to $t=200$ → ladder selection first → conditional $T_O$ probe) and the stop condition: if the R4 ladder is inadequate, STOP — do not run the downstream probe. No threshold may be relaxed after the fact; the run reused the frozen exp15 constants (A_GRID, R=4, SWAP_BAND=[0.15,0.60], N_RT, SOKAL_C) unchanged, with the only code change being the one-line trained-weight refresh.

2 — Provenance is proven, not asserted. A run must demonstrate it did what it claims. Training provenance: opt_counts=[12200,12200], expected=12200 ( $=200\times61$ ), cumulative_ok=true, weights_hash=8070bbfed961ba6e $\neq$ init_hash=8f2ba0bdd1efc6f2, plus probe_rng_isolated=true. Even hardware divergence is logged honestly: the laptop RTX 4060 re-train produced a different weights_hash than exp15's H200 run — expected float-rounding divergence over 200 epochs — and the report argues the findings are structural, so the divergence does not weaken them.

3 — The four claim-status tags. Every claim carries exactly one of: solid (established prior to this project), conjectured (the working hypothesis), proven-here (newly established by a run), or validated (independently confirmed). Tags are the project's ledger; the whole point is that they move rarely and only with cause.

4 — MEASURE-ONLY. A run can be declared measure-only: it emits numbers but moves no claim-status tag and issues no PROCEED/HALT/PASS verdict. The code cannot self-authorize a tag flip. The re-check is exactly this — its outcome is the descriptive label LADDER-INADEQUATE-TRAINED, a _DISCIPLINE field with no declared budget or verdict, not a refutation.

The result: a beautiful wrong answer, caught and contained

The bug: build_alpha_programs built each PT replica's local kernel from the INIT weights (constructor reads step.model.factors, which training never rewrites), while the swap energy used the TRAINED weights — an inconsistent chain whose cold marginal is not $\pi_\theta$ . The minimal repro is damning: rebuilt program weights == INIT (err 0), != TRAINED (err 4.5); R1 PT-vs-Gibbs cosine $= 0.18$ (buggy) → $1.0000$ with the refresh. At $t=200$ the stale constructor was off by stale_vs_trained_maxabs = 27.27.

This bug passed every reversibility pre-flight. What caught it was rule 2 made adversarial: exp17's gate-2 gibbs_vs_PT extractor-agreement guard — the one independent ground-truth (Gibbs) check exp15/16 lacked. With the trained weights correctly loaded, all five A_GRID candidates fail the triple gate: pooled swap-acceptance maxes at 0.0099 (band floor $0.15$ — roughly $15\times$ below), round-trips $= 0$ on every chain, and widening the ladder spacing worsens it monotonically down to $0.0009$ at $\alpha_R=0.1$ .

And here is the discipline paying off: this overturned two prior runs' headline evidence — yet it moved zero tags. The conditional factorization stays solid, the operational claim $Q_{op}\approx Q_{struct}^{\perp}$ stays conjectured. It was logged as an evidence/erratum correction, not a refutation.

Scope and caveats

This entry documents process, and the process is honest about its own limits. The result is config-scoped — "R4 PT does not mix this trained DTM at this checkpoint" (60_12, SEED=0, $t=200$ ), not a general impossibility. The discipline also declines to over-act: the spine's Risk-4 sub-entry that rested on the buggy evidence is flagged but not auto-edited — "the form of correction is for the researcher to set." The rules constrain what a run may conclude; they do not let a measurement silently rewrite the research spine.

What it does NOT show

It does not show the kernel is unfixable, that the operational claim is false, or that pre-flight gates are worthless — only that pre-flight reversibility checks are not a substitute for an independent ground-truth guard. That is the load-bearing lesson.

What this feeds: the next probe is feasibility-first and scoped — whether a larger replica count or adaptive spacing mixes the trained DTM at all — run under exactly these rules, with exp17 HELD (not frozen) until mixing is solved.