A single inconsistent chain — local kernels from the init model, swap energies from the trained model — manufactured a mixing result that vanishes the moment the bug is repaired.
This is the complete technical record for experiments/exp15-recheck-trained-weights/. Here we keep the provenance, the gate, and the claim-status discipline. It is a MEASURE-ONLY re-check: it moves no claim-status tag and carries no PROCEED/HALT/PASS verdict.
The question
exp15 reported "aggregate Cal-STABLE / A6 reachable under R4 parallel tempering" and exp16 reported "the negative phase mixes well at scale ()." The exp17 gate-2 code-verification — the gibbs_vs_PT extractor-agreement guard, the one independent ground-truth (Gibbs) check exp15/16 lacked — uncovered a bug in the FROZEN exp15 build_alpha_programs (reused verbatim by exp16). So the question here is narrow and surgical: repair only that bug, change no threshold, and re-ask the branch-point question on the real trained DTM — does the R4 geometric -ladder actually mix?
The bug
build_alpha_programs builds each replica's local kernel via AnnealingIsingSamplingProgram(step.model, …), whose constructor reads step.model.FACTORS. DTM training updates step.model.weights and the existing program_* interactions but not step.model.factors (DTM.py:330, step_ebm.py:116, annealing_graph_ising.py:226). So the PT local kernel sampled the INIT model while the swap energy (energy_free, reads step.model.weights) used the TRAINED model — an inconsistent chain whose cold replica does not target .
The minimal repro, recorded at design time: on a trained model with , the rebuilt program's weights INIT (err 0), TRAINED (err 4.5); the R1 PT-vs-Gibbs cosine is 0.18 (buggy) and 1.0000 with the trained-weight refresh. The fix refreshes per_block_interactions via get_new_per_block_interactions(prog, step.model.weights, step.model.biases) + eqx.tree_at — exactly what training_spec.update_weights_and_biases already does. This is the only change. No frozen estimator-hygiene constant was relaxed.
The setup
Substrate pschilliOrange/dtm-replication @ 7c22d19, .venv-exp3 (jax==0.10.1, GPU backend), reversible-PT kernel LIVE. The frozen exp15 machinery (pt_traj, pt_super_sweep, energy_free, build_maps, sokal_profile_from_spins, classify_curve, measure_swap_accept, and the constants A_GRID, , SWAP_BAND=[0.15,0.60], N_RT, LADDER_SEEDS, SOKAL_C) is reused unchanged.
Training-provenance was proven, not asserted: opt_counts=[12200,12200], expected=12200 (), cumulative_ok=true, weights_hash=8070bbfed961ba6e init_hash=8f2ba0bdd1efc6f2, probe_rng_isolated=true. The bug is made explicit at : the un-refreshed constructor was stale by stale_vs_trained_maxabs = 27.27 (the bug grows with training — the 1-epoch smoke showed only ); the refreshed program matches exactly (refreshed_vs_trained_maxabs = 0.0, refresh_ok=true). gradient observables, . Pre-registered sequence: ladder selection FIRST, and if the fixed R4 ladder is inadequate, STOP before the probe.
The result: LADDER-INADEQUATE-TRAINED
With the trained-weight refresh, all five A_GRID candidates fail the triple gate (band and N_RT round-trips/chain and hot- self-consistency), aggregated over LADDER_SEEDS=(300,301,302):
| | ladder | swap-accept (pooled) | max | round-trips | | PASS |
|---:|---|---|---:|---:|---:|:--:|
| 0.5 | [1.0, 0.794, 0.630, 0.5] | [0.0099, 0.0066, 0.0037] | 0.0099 | 0 | 308 | ✗ |
| 0.4 | [1.0, 0.737, 0.543, 0.4] | [0.0047, 0.0036, 0.0016] | 0.0047 | 0 | 322 | ✗ |
| 0.3 | [1.0, 0.669, 0.448, 0.3] | [0.0027, 0.0022, 0.0009] | 0.0027 | 0 | 293 | ✗ |
| 0.2 | [1.0, 0.585, 0.342, 0.2] | [0.0016, 0.0007, 0.00005] | 0.0016 | 0 | 146 | ✗ |
| 0.1 | [1.0, 0.464, 0.215, 0.1] | [0.0009, 0.00004, 0.0] | 0.0009 | 0 | 288 | ✗ |
The best candidate (, narrowest geometric spacing) reaches pooled swap-acceptance of only 0.0099 — 15× below the band floor of 0.15 — and widening the spacing makes it worse (monotone decrease to 0.0009 at : more separation less replica overlap). Round-trips are 0 on every chain and every candidate. Adjacent replicas essentially never swap, so the R4 ladder cannot transport the cold chain across the trained DTM's energy barriers. The run STOPPED before the probe (halt = "ladder_inadequate"). The 1-epoch smoke showed the same collapse direction — swap-accept (init) (trained) — and the full run confirms it at scale. The flat init model swaps; the trained model does not. exp15's "A6 reachable" was an init-weight artifact.
Scope and caveats
This invalidates exp15's P0-RESOLVED and exp16's F4-fail as evidence for the trained DTM — both rode the buggy init-weight kernels — and exp15's anchor is contaminated, not cross-checked here, and must be re-derived once mixing is solved. exp17 is moot as designed (it assumes the negative R4 PT works) and is HELD, not run; its gate-2 code is correct and verified (battery 20/20, extractor exact on ) and reusable. Risk 5 (A2A6 antagonism) is CORROBORATED: exp4/exp6 found 4-block Gibbs gives , and now R4 PT also fails to mix the trained DTM — the obstruction is deeper than "use a better kernel with 4 replicas."
What it does not show: the conditional factorization stays [solid], the operational claim stays [conjectured] — no tag moves. This is an evidence/erratum correction, not a refutation of . The result is config-scoped (60_12, SEED=0, , R4 geometric ladder): "R4 PT does not mix this trained DTM at this checkpoint," not a general impossibility for reversible PT. The hardware-divergence note is honest provenance — an independent laptop-trained DTM (hash differs from exp15's H200 run) — but the findings are structural, not tied to one weight realization.
What this feeds: the clean new ground truth the exp15/exp16 errata cite, and the scoped feasibility-first next probe — whether a larger replica count or finer/adaptive spacing mixes the trained DTM at all.