The first GPU DTM parallel-tempering probe looked like the mixing wall was escapable — until a code audit showed the cold chain was sampling the wrong distribution; the result stands only as plumbing.
The erratum, first
Read this entry through its erratum. A 2026-06-18 code-verification of the exp17 gate-2 runner found a bug in this run's build_alpha_programs: the per-replica PT local kernels were built from the INIT weights (AnnealingIsingSamplingProgram reads the stale step.model.factors), while the swap energy used the TRAINED weights. That is an inconsistent PT chain whose cold replica does not target . The near-flat init model mixes trivially; the trained model need not. So the headline below — aggregate Cal-STABLE, A6 reachable under R4 PT — is an init-weight artifact, and the estimate is contaminated.
The corrected ground truth lives in experiments/exp15-recheck-trained-weights/: a MEASURE-ONLY re-check repairing only that bug (a trained-weight refresh of the program interactions) and re-running at the real reads LADDER-INADEQUATE-TRAINED — pooled swap-acceptance maxes at 0.0099 (band ), round-trips = 0 on every chain across all five A_GRID candidates. R4 PT fails at the ladder, upstream of any estimate. The original run record below is preserved unedited for provenance.
The question
At on the real 60_12 MNIST DTM — exactly the checkpoint where exp4/exp6's reversible four-block Gibbs kernel gave (166.7→5,280, –) — does the R4 reversible parallel-tempering kernel make the aggregate operational temperature Cal-STABLE? That is the G2-at-scale reachability probe for the premise — not validation.
The setup
GPU tier, NVIDIA H200, backend=gpu. Fresh clone pschilliOrange/dtm-replication @ 7c22d19 plus the CPU-AUDIT device-fallback patch, thrml 0.1.3 PYTHONPATH-shadowed; jax==0.10.1 on CUDA 12. Reversible-PT kernel LIVE (patch_live=true, A2-gated). Pre-flight gates on-box before paid compute: selfadjoint_check_dtm_pt.py PASS (detailed-balance residuals –), zero_compute_checks.py 9/9 ALL_PASS. The run was MEASURE-ONLY: no declared budget, no verdict, no tag in the JSON.
Method: a frozen-before-verdict, triple-gate geometric α-ladder scan over A_GRID=(0.5,0.4,0.3,0.2,0.1), then a doubling-stability calibration of the cold replica () with a half-Sokal estimator. Training-provenance was proven, not asserted: opt_counts=[12200,12200] , weights_hash init_hash, probe-RNG isolated.
The result (as originally read — now withdrawn)
Phase 0 H200 SMOKE: projected probe remaining → proceeded. The ladder selected , , robust over LADDER_SEEDS=(300,301,302): pooled swap-accept , round-trips , .
Doubling calibration:
L warm τ_max T_O = ½ΣS_a ‖ΔS‖₁/ΣS
1000 200 0.5778 12692.93 —
2000 200 0.5584 12678.03 0.01054
4000 200 0.5509 12661.49 0.00847
Both registered axes held over two consecutive doublings: relative drift then , and then — all . Registered outcome P0-RESOLVED, failed_axis=null, frozen at : , . Wall 0.964 GPU-h of a 5 GPU-h hard cap. No frozen hygiene constant (TAU_TOL, SOKAL_C, STAB_TOL) was relaxed.
Scope and caveats
What still stands: the reversibility () certificate, the MEASURE-ONLY discipline, the provenance machinery, and the plumbing of the calibrator. What is withdrawn: the entire scientific reading. This is not evidence that is reachable at scale and it does not feed the operational tier toward → validated.
Even before the erratum the run was honestly fenced (the registered confounds): no exact- init (each replica burns in, so this showed reachability under this schedule, not a clean genuine-plateau exclusion); estimated entered ; speedup vs out of scope (no exact baseline — the exp4/exp6 contrast is a qualitative kernel effect, not a measured speedup); no exact , so the -test was out of scope.
No tag moves — it was always MEASURE-ONLY. The conditional factorization stays [solid], the operational claim stays [conjectured]. Fundamentality remains OPEN; the result is config-scoped (60_12, SEED=0, , INPUT_IDX=0).
What this feeds: nothing toward validation — the corrected ground truth is experiments/exp15-recheck-trained-weights/ (LADDER-INADEQUATE-TRAINED), which is what actually carries forward.