Exp 15-recheck — The Init-Weight Bug, Isolated · Thermodynamic Machine Learning

A single inconsistent chain — local kernels from the init model, swap energies from the trained model — manufactured a mixing result that vanishes the moment the bug is repaired.

This is the complete technical record for experiments/exp15-recheck-trained-weights/. Here we keep the provenance, the gate, and the claim-status discipline. It is a MEASURE-ONLY re-check: it moves no claim-status tag and carries no PROCEED/HALT/PASS verdict.

The question

exp15 reported "aggregate $T_O$ Cal-STABLE / A6 reachable under R4 parallel tempering" and exp16 reported "the negative phase mixes well at scale ( $MSE_{neg} \propto 1/K$ )." The exp17 gate-2 code-verification — the gibbs_vs_PT extractor-agreement guard, the one independent ground-truth (Gibbs) check exp15/16 lacked — uncovered a bug in the FROZEN exp15 build_alpha_programs (reused verbatim by exp16). So the question here is narrow and surgical: repair only that bug, change no threshold, and re-ask the branch-point question on the real trained DTM — does the R4 geometric $\alpha$ -ladder actually mix?

The bug

build_alpha_programs builds each replica's local kernel via AnnealingIsingSamplingProgram(step.model, …), whose constructor reads step.model.FACTORS. DTM training updates step.model.weights and the existing program_* interactions but not step.model.factors (DTM.py:330, step_ebm.py:116, annealing_graph_ising.py:226). So the PT local kernel sampled the INIT model while the swap energy (energy_free, reads step.model.weights) used the TRAINED model — an inconsistent chain whose cold replica does not target $\pi_\theta$ .

The minimal repro, recorded at design time: on a trained model with $|\Delta w| \approx 4.5$ , the rebuilt program's weights $==$ INIT (err 0), $\ne$ TRAINED (err 4.5); the R1 PT-vs-Gibbs cosine is 0.18 (buggy) and 1.0000 with the trained-weight refresh. The fix refreshes per_block_interactions via get_new_per_block_interactions(prog, step.model.weights, step.model.biases) + eqx.tree_at — exactly what training_spec.update_weights_and_biases already does. This is the only change. No frozen estimator-hygiene constant was relaxed.

The setup

Substrate pschilliOrange/dtm-replication @ 7c22d19, .venv-exp3 (jax==0.10.1, GPU backend), reversible-PT kernel LIVE. The frozen exp15 machinery (pt_traj, pt_super_sweep, energy_free, build_maps, sokal_profile_from_spins, classify_curve, measure_swap_accept, and the constants A_GRID, $R=4$ , SWAP_BAND=[0.15,0.60], N_RT, LADDER_SEEDS, SOKAL_C) is reused unchanged.

Training-provenance was proven, not asserted: opt_counts=[12200,12200], expected=12200 ( $=200\times61$ ), cumulative_ok=true, weights_hash=8070bbfed961ba6e $\ne$ init_hash=8f2ba0bdd1efc6f2, probe_rng_isolated=true. The bug is made explicit at $t=200$ : the un-refreshed constructor was stale by stale_vs_trained_maxabs = 27.27 (the bug grows with training — the 1-epoch smoke showed only $\approx 2.5$ ); the refreshed program matches exactly (refreshed_vs_trained_maxabs = 0.0, refresh_ok=true). $P=25200$ gradient observables, $n_{free}=3600$ . Pre-registered sequence: ladder selection FIRST, and if the fixed R4 ladder is inadequate, STOP before the $T_O$ probe.

The result: `LADDER-INADEQUATE-TRAINED`

With the trained-weight refresh, all five A_GRID candidates fail the triple gate (band and $\ge$ N_RT round-trips/chain and hot- $\hat\tau$ self-consistency), aggregated over LADDER_SEEDS=(300,301,302):

| $\alpha_R$ | $\alpha$ ladder | swap-accept (pooled) | max | round-trips | $\hat\tau_{hot,max}$ | PASS | |---:|---|---|---:|---:|---:|:--:| | 0.5 | [1.0, 0.794, 0.630, 0.5] | [0.0099, 0.0066, 0.0037] | 0.0099 | 0 | 308 | ✗ | | 0.4 | [1.0, 0.737, 0.543, 0.4] | [0.0047, 0.0036, 0.0016] | 0.0047 | 0 | 322 | ✗ | | 0.3 | [1.0, 0.669, 0.448, 0.3] | [0.0027, 0.0022, 0.0009] | 0.0027 | 0 | 293 | ✗ | | 0.2 | [1.0, 0.585, 0.342, 0.2] | [0.0016, 0.0007, 0.00005] | 0.0016 | 0 | 146 | ✗ | | 0.1 | [1.0, 0.464, 0.215, 0.1] | [0.0009, 0.00004, 0.0] | 0.0009 | 0 | 288 | ✗ |

The best candidate ( $\alpha_R=0.5$ , narrowest geometric spacing) reaches pooled swap-acceptance of only 0.0099 — $\sim$ 15× below the band floor of 0.15 — and widening the spacing makes it worse (monotone decrease to 0.0009 at $\alpha_R=0.1$ : more separation $\Rightarrow$ less replica overlap). Round-trips are 0 on every chain and every candidate. Adjacent replicas essentially never swap, so the R4 ladder cannot transport the cold chain across the trained DTM's energy barriers. The run STOPPED before the $T_O$ probe (halt = "ladder_inadequate"). The 1-epoch smoke showed the same collapse direction — swap-accept $\approx 0.4$ (init) $\to \approx 0.007$ (trained) — and the full $t=200$ run confirms it at scale. The flat init model swaps; the trained model does not. exp15's "A6 reachable" was an init-weight artifact.

Scope and caveats

This invalidates exp15's P0-RESOLVED and exp16's F4-fail as evidence for the trained DTM — both rode the buggy init-weight kernels — and exp15's $T_O \approx 1.266\times10^4$ anchor is contaminated, not cross-checked here, and must be re-derived once mixing is solved. exp17 is moot as designed (it assumes the negative R4 PT works) and is HELD, not run; its gate-2 code is correct and verified (battery 20/20, extractor exact on $P=25200$ ) and reusable. Risk 5 (A2 $\leftrightarrow$ A6 antagonism) is CORROBORATED: exp4/exp6 found 4-block Gibbs gives $\tau \propto L$ , and now R4 PT also fails to mix the trained DTM — the obstruction is deeper than "use a better kernel with 4 replicas."

What it does not show: the conditional factorization stays [solid], the operational claim stays [conjectured] — no tag moves. This is an evidence/erratum correction, not a refutation of $Q_{op} \approx Q_{struct}^{\perp}$ . The result is config-scoped (60_12, SEED=0, $t=200$ , R4 geometric ladder): "R4 PT does not mix this trained DTM at this checkpoint," not a general impossibility for reversible PT. The hardware-divergence note is honest provenance — an independent laptop-trained $t=200$ DTM (hash differs from exp15's H200 run) — but the findings are structural, not tied to one weight realization.

What this feeds: the clean new ground truth the exp15/exp16 errata cite, and the scoped feasibility-first next probe — whether a larger replica count or finer/adaptive spacing mixes the trained DTM at all.

The question

The bug

The setup

The result: LADDER-INADEQUATE-TRAINED

Scope and caveats

The result: `LADDER-INADEQUATE-TRAINED`