Exp 15 — GPU DTM PT P0 (and Its Erratum) · Thermodynamic Machine Learning

The first GPU DTM parallel-tempering probe looked like the mixing wall was escapable — until a code audit showed the cold chain was sampling the wrong distribution; the result stands only as plumbing.

The erratum, first

Read this entry through its erratum. A 2026-06-18 code-verification of the exp17 gate-2 runner found a bug in this run's build_alpha_programs: the per-replica PT local kernels were built from the INIT weights (AnnealingIsingSamplingProgram reads the stale step.model.factors), while the swap energy used the TRAINED weights. That is an inconsistent PT chain whose cold replica does not target $\pi_\theta$ . The near-flat init model mixes trivially; the trained model need not. So the headline below — aggregate $T_O$ Cal-STABLE, A6 reachable under R4 PT — is an init-weight artifact, and the $T_O \approx 1.266\times10^{4}$ estimate is contaminated.

The corrected ground truth lives in experiments/exp15-recheck-trained-weights/: a MEASURE-ONLY re-check repairing only that bug (a trained-weight refresh of the program interactions) and re-running at the real $t=200$ reads LADDER-INADEQUATE-TRAINED — pooled swap-acceptance maxes at 0.0099 (band $[0.15, 0.60]$ ), round-trips = 0 on every chain across all five A_GRID candidates. R4 PT fails at the ladder, upstream of any $T_O$ estimate. The original run record below is preserved unedited for provenance.

The question

At $t=200$ on the real 60_12 MNIST DTM — exactly the checkpoint where exp4/exp6's reversible four-block Gibbs kernel gave $\tau_{max} \propto L$ (166.7→5,280, $\tau/L \approx 0.13$ – $0.17$ ) — does the R4 reversible parallel-tempering kernel $K = \tfrac{1}{2}(L\,S_{mix}^{n_s} + S_{mix}^{n_s} L)$ make the aggregate operational temperature $T_O$ Cal-STABLE? That is the G2-at-scale reachability probe for the $A6$ premise $K \gg \tau_{int}$ — not validation.

The setup

GPU tier, NVIDIA H200, backend=gpu. Fresh clone pschilliOrange/dtm-replication @ 7c22d19 plus the CPU-AUDIT device-fallback patch, thrml 0.1.3 PYTHONPATH-shadowed; jax==0.10.1 on CUDA 12. Reversible-PT kernel LIVE (patch_live=true, A2-gated). Pre-flight gates on-box before paid compute: selfadjoint_check_dtm_pt.py PASS (detailed-balance residuals $\sim 10^{-17}$ – $10^{-19} \ll 10^{-10}$ ), zero_compute_checks.py 9/9 ALL_PASS. The run was MEASURE-ONLY: no declared budget, no verdict, no tag in the JSON.

Method: a frozen-before-verdict, triple-gate geometric α-ladder scan over A_GRID=(0.5,0.4,0.3,0.2,0.1), then a doubling-stability calibration of the cold replica ( $\alpha_1=1$ ) with a half-Sokal $\tau_{int}$ estimator. Training-provenance was proven, not asserted: opt_counts=[12200,12200] $=200\times61$ , weights_hash $\neq$ init_hash, probe-RNG isolated.

The result (as originally read — now withdrawn)

Phase 0 H200 SMOKE: projected probe $2.315\,h < 4.203\,h$ remaining → proceeded. The ladder selected $\alpha_R=0.5$ , $\alpha=[1.0, 0.7937, 0.6300, 0.5]$ , robust over LADDER_SEEDS=(300,301,302): pooled swap-accept $[0.366, 0.428, 0.376]$ , round-trips $\min = 85 \ge 5$ , $\hat{\tau}_{hot,max}=0.561$ .

Doubling calibration:

 L     warm   τ_max     T_O = ½ΣS_a   ‖ΔS‖₁/ΣS
 1000  200    0.5778    12692.93      —
 2000  200    0.5584    12678.03      0.01054
 4000  200    0.5509    12661.49      0.00847

Both registered axes held over two consecutive doublings: $\hat{\tau}$ relative drift $0.033$ then $0.013$ , and $|\Delta T_O|/T_O = 0.00117$ then $0.00130$ — all $< \text{STAB\_TOL}=0.15$ . Registered outcome P0-RESOLVED, failed_axis=null, frozen at $L=4000$ : $T_O^{*} \approx 1.2661\times10^{4}$ , $\hat{\tau}^{*}=0.5509$ . Wall 0.964 GPU-h of a 5 GPU-h hard cap. No frozen hygiene constant (TAU_TOL, SOKAL_C, STAB_TOL) was relaxed.

Scope and caveats

What still stands: the reversibility ( $A2$ ) certificate, the MEASURE-ONLY discipline, the provenance machinery, and the plumbing of the calibrator. What is withdrawn: the entire scientific reading. This is not evidence that $A6$ is reachable at scale and it does not feed the operational tier toward → validated.

Even before the erratum the run was honestly fenced (the registered confounds): no exact- $\pi_r$ init (each replica burns in, so this showed reachability under this schedule, not a clean genuine-plateau exclusion); estimated $\hat{\text{Var}}_a$ entered $S_a$ ; speedup vs $P_{sym}$ out of scope (no exact $T_O(P_{sym})$ baseline — the exp4/exp6 contrast is a qualitative kernel effect, not a measured speedup); no exact $g$ , so the $Q$ -test was out of scope.

No tag moves — it was always MEASURE-ONLY. The conditional factorization stays [solid], the operational claim stays [conjectured]. Fundamentality remains OPEN; the result is config-scoped (60_12, SEED=0, $t=200$ , INPUT_IDX=0).

What this feeds: nothing toward validation — the corrected ground truth is experiments/exp15-recheck-trained-weights/ (LADDER-INADEQUATE-TRAINED), which is what actually carries forward.