Thermodynamic Machine Learning · MMXXVI
Experiment18.VI.MMXXVIRead 5 min

Exp 15-recheck — The Init-Weight Bug, Isolated

Entry 22

A single inconsistent chain — local kernels from the init model, swap energies from the trained model — manufactured a mixing result that vanishes the moment the bug is repaired.

This is the complete technical record for experiments/exp15-recheck-trained-weights/. Here we keep the provenance, the gate, and the claim-status discipline. It is a MEASURE-ONLY re-check: it moves no claim-status tag and carries no PROCEED/HALT/PASS verdict.

The question

exp15 reported "aggregate TOT_O Cal-STABLE / A6 reachable under R4 parallel tempering" and exp16 reported "the negative phase mixes well at scale (MSEneg1/KMSE_{neg} \propto 1/K)." The exp17 gate-2 code-verification — the gibbs_vs_PT extractor-agreement guard, the one independent ground-truth (Gibbs) check exp15/16 lacked — uncovered a bug in the FROZEN exp15 build_alpha_programs (reused verbatim by exp16). So the question here is narrow and surgical: repair only that bug, change no threshold, and re-ask the branch-point question on the real trained DTM — does the R4 geometric α\alpha-ladder actually mix?

The bug

build_alpha_programs builds each replica's local kernel via AnnealingIsingSamplingProgram(step.model, …), whose constructor reads step.model.FACTORS. DTM training updates step.model.weights and the existing program_* interactions but not step.model.factors (DTM.py:330, step_ebm.py:116, annealing_graph_ising.py:226). So the PT local kernel sampled the INIT model while the swap energy (energy_free, reads step.model.weights) used the TRAINED model — an inconsistent chain whose cold replica does not target πθ\pi_\theta.

The minimal repro, recorded at design time: on a trained model with Δw4.5|\Delta w| \approx 4.5, the rebuilt program's weights ==== INIT (err 0), \ne TRAINED (err 4.5); the R1 PT-vs-Gibbs cosine is 0.18 (buggy) and 1.0000 with the trained-weight refresh. The fix refreshes per_block_interactions via get_new_per_block_interactions(prog, step.model.weights, step.model.biases) + eqx.tree_atexactly what training_spec.update_weights_and_biases already does. This is the only change. No frozen estimator-hygiene constant was relaxed.

The setup

Substrate pschilliOrange/dtm-replication @ 7c22d19, .venv-exp3 (jax==0.10.1, GPU backend), reversible-PT kernel LIVE. The frozen exp15 machinery (pt_traj, pt_super_sweep, energy_free, build_maps, sokal_profile_from_spins, classify_curve, measure_swap_accept, and the constants A_GRID, R=4R=4, SWAP_BAND=[0.15,0.60], N_RT, LADDER_SEEDS, SOKAL_C) is reused unchanged.

Training-provenance was proven, not asserted: opt_counts=[12200,12200], expected=12200 (=200×61=200\times61), cumulative_ok=true, weights_hash=8070bbfed961ba6e \ne init_hash=8f2ba0bdd1efc6f2, probe_rng_isolated=true. The bug is made explicit at t=200t=200: the un-refreshed constructor was stale by stale_vs_trained_maxabs = 27.27 (the bug grows with training — the 1-epoch smoke showed only 2.5\approx 2.5); the refreshed program matches exactly (refreshed_vs_trained_maxabs = 0.0, refresh_ok=true). P=25200P=25200 gradient observables, nfree=3600n_{free}=3600. Pre-registered sequence: ladder selection FIRST, and if the fixed R4 ladder is inadequate, STOP before the TOT_O probe.

The result: LADDER-INADEQUATE-TRAINED

With the trained-weight refresh, all five A_GRID candidates fail the triple gate (band and \geN_RT round-trips/chain and hot-τ^\hat\tau self-consistency), aggregated over LADDER_SEEDS=(300,301,302):

| αR\alpha_R | α\alpha ladder | swap-accept (pooled) | max | round-trips | τ^hot,max\hat\tau_{hot,max} | PASS | |---:|---|---|---:|---:|---:|:--:| | 0.5 | [1.0, 0.794, 0.630, 0.5] | [0.0099, 0.0066, 0.0037] | 0.0099 | 0 | 308 | ✗ | | 0.4 | [1.0, 0.737, 0.543, 0.4] | [0.0047, 0.0036, 0.0016] | 0.0047 | 0 | 322 | ✗ | | 0.3 | [1.0, 0.669, 0.448, 0.3] | [0.0027, 0.0022, 0.0009] | 0.0027 | 0 | 293 | ✗ | | 0.2 | [1.0, 0.585, 0.342, 0.2] | [0.0016, 0.0007, 0.00005] | 0.0016 | 0 | 146 | ✗ | | 0.1 | [1.0, 0.464, 0.215, 0.1] | [0.0009, 0.00004, 0.0] | 0.0009 | 0 | 288 | ✗ |

The best candidate (αR=0.5\alpha_R=0.5, narrowest geometric spacing) reaches pooled swap-acceptance of only 0.0099 — \sim15× below the band floor of 0.15 — and widening the spacing makes it worse (monotone decrease to 0.0009 at αR=0.1\alpha_R=0.1: more separation \Rightarrow less replica overlap). Round-trips are 0 on every chain and every candidate. Adjacent replicas essentially never swap, so the R4 ladder cannot transport the cold chain across the trained DTM's energy barriers. The run STOPPED before the TOT_O probe (halt = "ladder_inadequate"). The 1-epoch smoke showed the same collapse direction — swap-accept 0.4\approx 0.4 (init) 0.007\to \approx 0.007 (trained) — and the full t=200t=200 run confirms it at scale. The flat init model swaps; the trained model does not. exp15's "A6 reachable" was an init-weight artifact.

Scope and caveats

This invalidates exp15's P0-RESOLVED and exp16's F4-fail as evidence for the trained DTM — both rode the buggy init-weight kernels — and exp15's TO1.266×104T_O \approx 1.266\times10^4 anchor is contaminated, not cross-checked here, and must be re-derived once mixing is solved. exp17 is moot as designed (it assumes the negative R4 PT works) and is HELD, not run; its gate-2 code is correct and verified (battery 20/20, extractor exact on P=25200P=25200) and reusable. Risk 5 (A2\leftrightarrowA6 antagonism) is CORROBORATED: exp4/exp6 found 4-block Gibbs gives τL\tau \propto L, and now R4 PT also fails to mix the trained DTM — the obstruction is deeper than "use a better kernel with 4 replicas."

What it does not show: the conditional factorization stays [solid], the operational claim stays [conjectured]no tag moves. This is an evidence/erratum correction, not a refutation of QopQstructQ_{op} \approx Q_{struct}^{\perp}. The result is config-scoped (60_12, SEED=0, t=200t=200, R4 geometric ladder): "R4 PT does not mix this trained DTM at this checkpoint," not a general impossibility for reversible PT. The hardware-divergence note is honest provenance — an independent laptop-trained t=200t=200 DTM (hash differs from exp15's H200 run) — but the findings are structural, not tied to one weight realization.


What this feeds: the clean new ground truth the exp15/exp16 errata cite, and the scoped feasibility-first next probe — whether a larger replica count or finer/adaptive spacing mixes the trained DTM at all.

— fin. —