Thermodynamic Machine Learning · MMXXVI
Experiment16.VI.MMXXVIRead 3 min

Exp 15 — GPU DTM PT P0 (and Its Erratum)

Entry 17

The first GPU DTM parallel-tempering probe looked like the mixing wall was escapable — until a code audit showed the cold chain was sampling the wrong distribution; the result stands only as plumbing.

The erratum, first

Read this entry through its erratum. A 2026-06-18 code-verification of the exp17 gate-2 runner found a bug in this run's build_alpha_programs: the per-replica PT local kernels were built from the INIT weights (AnnealingIsingSamplingProgram reads the stale step.model.factors), while the swap energy used the TRAINED weights. That is an inconsistent PT chain whose cold replica does not target πθ\pi_\theta. The near-flat init model mixes trivially; the trained model need not. So the headline below — aggregate TOT_O Cal-STABLE, A6 reachable under R4 PT — is an init-weight artifact, and the TO1.266×104T_O \approx 1.266\times10^{4} estimate is contaminated.

The corrected ground truth lives in experiments/exp15-recheck-trained-weights/: a MEASURE-ONLY re-check repairing only that bug (a trained-weight refresh of the program interactions) and re-running at the real t=200t=200 reads LADDER-INADEQUATE-TRAINED — pooled swap-acceptance maxes at 0.0099 (band [0.15,0.60][0.15, 0.60]), round-trips = 0 on every chain across all five A_GRID candidates. R4 PT fails at the ladder, upstream of any TOT_O estimate. The original run record below is preserved unedited for provenance.

The question

At t=200t=200 on the real 60_12 MNIST DTM — exactly the checkpoint where exp4/exp6's reversible four-block Gibbs kernel gave τmaxL\tau_{max} \propto L (166.7→5,280, τ/L0.13\tau/L \approx 0.130.170.17) — does the R4 reversible parallel-tempering kernel K=12(LSmixns+SmixnsL)K = \tfrac{1}{2}(L\,S_{mix}^{n_s} + S_{mix}^{n_s} L) make the aggregate operational temperature TOT_O Cal-STABLE? That is the G2-at-scale reachability probe for the A6A6 premise KτintK \gg \tau_{int} — not validation.

The setup

GPU tier, NVIDIA H200, backend=gpu. Fresh clone pschilliOrange/dtm-replication @ 7c22d19 plus the CPU-AUDIT device-fallback patch, thrml 0.1.3 PYTHONPATH-shadowed; jax==0.10.1 on CUDA 12. Reversible-PT kernel LIVE (patch_live=true, A2-gated). Pre-flight gates on-box before paid compute: selfadjoint_check_dtm_pt.py PASS (detailed-balance residuals 1017\sim 10^{-17}1019101010^{-19} \ll 10^{-10}), zero_compute_checks.py 9/9 ALL_PASS. The run was MEASURE-ONLY: no declared budget, no verdict, no tag in the JSON.

Method: a frozen-before-verdict, triple-gate geometric α-ladder scan over A_GRID=(0.5,0.4,0.3,0.2,0.1), then a doubling-stability calibration of the cold replica (α1=1\alpha_1=1) with a half-Sokal τint\tau_{int} estimator. Training-provenance was proven, not asserted: opt_counts=[12200,12200] =200×61=200\times61, weights_hash \neq init_hash, probe-RNG isolated.

The result (as originally read — now withdrawn)

Phase 0 H200 SMOKE: projected probe 2.315h<4.203h2.315\,h < 4.203\,h remaining → proceeded. The ladder selected αR=0.5\alpha_R=0.5, α=[1.0,0.7937,0.6300,0.5]\alpha=[1.0, 0.7937, 0.6300, 0.5], robust over LADDER_SEEDS=(300,301,302): pooled swap-accept [0.366,0.428,0.376][0.366, 0.428, 0.376], round-trips min=855\min = 85 \ge 5, τ^hot,max=0.561\hat{\tau}_{hot,max}=0.561.

Doubling calibration:

 L     warm   τ_max     T_O = ½ΣS_a   ‖ΔS‖₁/ΣS
 1000  200    0.5778    12692.93      —
 2000  200    0.5584    12678.03      0.01054
 4000  200    0.5509    12661.49      0.00847

Both registered axes held over two consecutive doublings: τ^\hat{\tau} relative drift 0.0330.033 then 0.0130.013, and ΔTO/TO=0.00117|\Delta T_O|/T_O = 0.00117 then 0.001300.00130 — all <STAB_TOL=0.15< \text{STAB\_TOL}=0.15. Registered outcome P0-RESOLVED, failed_axis=null, frozen at L=4000L=4000: TO1.2661×104T_O^{*} \approx 1.2661\times10^{4}, τ^=0.5509\hat{\tau}^{*}=0.5509. Wall 0.964 GPU-h of a 5 GPU-h hard cap. No frozen hygiene constant (TAU_TOL, SOKAL_C, STAB_TOL) was relaxed.

Scope and caveats

What still stands: the reversibility (A2A2) certificate, the MEASURE-ONLY discipline, the provenance machinery, and the plumbing of the calibrator. What is withdrawn: the entire scientific reading. This is not evidence that A6A6 is reachable at scale and it does not feed the operational tier toward → validated.

Even before the erratum the run was honestly fenced (the registered confounds): no exact-πr\pi_r init (each replica burns in, so this showed reachability under this schedule, not a clean genuine-plateau exclusion); estimated Var^a\hat{\text{Var}}_a entered SaS_a; speedup vs PsymP_{sym} out of scope (no exact TO(Psym)T_O(P_{sym}) baseline — the exp4/exp6 contrast is a qualitative kernel effect, not a measured speedup); no exact gg, so the QQ-test was out of scope.

No tag moves — it was always MEASURE-ONLY. The conditional factorization stays [solid], the operational claim stays [conjectured]. Fundamentality remains OPEN; the result is config-scoped (60_12, SEED=0, t=200t=200, INPUT_IDX=0).


What this feeds: nothing toward validation — the corrected ground truth is experiments/exp15-recheck-trained-weights/ (LADDER-INADEQUATE-TRAINED), which is what actually carries forward.

— fin. —