Thermodynamic Machine Learning · MMXXVI
Experiment31.V.MMXXVIRead 4 min

Exp 3 — At Scale: The Equilibration Limit

Entry 3

The first GPU-scale run of the operational-tracking test, where a real MNIST model's glacial mixing turned the headline prediction into an untested — not a confirmed or refuted — claim.

The question

Does the computable predictor QstructQ_{struct}^{\perp} track the operational gradient SNR QopQ_{op} at scale, on a genuinely trained model rather than a toy substrate? The factorization conjecture says QopQstructQ_{op} \approx Q_{struct}^{\perp} inside the assumption package; the prior experiments lived on exactly-diagonalizable substrates. Exp 3 is the first attempt to read the ratio off a real Denoising Thermodynamic Model.

The setup

One rented H100 80GB (Lightning Studio), dtm-replication @ 7c22d19 plus a CPU-audit patch, EXP3_MODE=full. Wall time 24,028 s ≈ 6.67 h, inside the declared 8 GPU-h budget. The full record is in experiments/exp3-htdml-embedding/ (report.md, results_full.json, exp3_full.log).

Scope (re-freeze 2026-05-30c): predictions P1–P5 on the single default 60_12 graph, single-input conditional πθ(x~)\pi_\theta(\cdot\,|\,\tilde{x}) — fixed seed-0 input, all 32 chains share it, varying only MC seed and hidden init. T=2000T = 2000 epochs =122,000= 122{,}000 gradient updates. The negative kernel is deterministic alternating-scan block-Gibbs (neg_kernel.scan), i.e. A2-excluded / F3-regime: every tracking number sits outside the proof-sketch reversibility regime regardless of verdict. The HTDML ladder and P6/P7 were deferred — this substrate has no NN embedding to attach them to.

The result

The binding fact is mixing. The MNIST DTM negative kernel is extremely slow: τmax486500\tau_{max} \approx 486\text{–}500, essentially constant across t=2002000t = 200 \to 2000 (τf\tau_f mean 244824\text{–}48). So the frozen Kref=4000K_{ref} = 4000 and the largest memory-feasible TRAJ_LEN=3000 / warmup=1000 are deeply under-equilibrated: 50τ25,000300050\,\tau \approx 25{,}000 \gg 3000 and 5τ2500>10005\,\tau \approx 2500 > 1000, so traj_adequate_50tau and burn_adequate_5tau are False at every checkpoint.

In that regime the predictor does not track. Because Qstruct=(K/2)g2/TOKQ_{struct}^{\perp} = (K/2)\|g\|^2 / T_O \propto K by construction, while the empirical Qop=g2/Eg^(K)g2Q_{op} = \|g\|^2 / \mathbb{E}\|\hat g(K) - g\|^2 is near-flat / sub-linear in KK (Qop0.601.05Q_{op} \approx 0.60\text{–}1.05), the ratio climbs monotonically with KK:

| checkpoint | K=100K{=}100 | K=300K{=}300 | K=1000K{=}1000 | |---|---|---|---| | t=200t{=}200 | 0.37 | 0.99 | 2.94 | | t=500t{=}500 | 0.54 | 1.41 | 3.18 | | t=1000t{=}1000 | 0.68 | 2.13 | 6.05 | | t=2000t{=}2000 | 0.60 | 1.84 | 5.46 |

Tracking holds inside the c=3c=3 band only at K100300K \approx 100\text{–}300 and busts at K=1000K=1000 (ratio up to 6). The P3a-median Qstruct/Qop=1.624[1/3,3]Q_{struct}^{\perp}/Q_{op} = 1.624 \in [1/3, 3] is in-band only by averaging across KK — an artifact, not a pass. Two further degraders, both pre-registered: kref_reference_unequilibrated_risk=True (the t=500t{=}500 gg-reference moved 23%, kref_g_relchange =0.026/0.234/0.144/0.100= 0.026/0.234/0.144/0.100 — Risk 3 circumvented, not closed); and τint\tau_{int} truncation inflating QstructQ_{struct}^{\perp}. Verdict: P5 weak / under-powered / partially Risk-3-livenot a clean at-scale tracking confirmation.

The cleanly-measured result is P2 (F4): the positive (data) phase is strongly subdominant. Median F4ratio=MSEpos/MSEneg=0.0460.3F4_{ratio} = MSE_{pos}/MSE_{neg} = 0.046 \le 0.3 (per-checkpoint 0.032/0.030/0.069/0.0590.032/0.030/0.069/0.059); msepos0.0040.011mseneg0.110.16\text{mse}_{pos} \approx 0.004\text{–}0.011 \ll \text{mse}_{neg} \approx 0.11\text{–}0.16. The negative phase dominates the estimator variance, as the Q-program assumes. → Risk 4 sharpened (stays open).

The rest: P1 NOT PASSED — CLT-sanity median 2.1252.125, growing with KK (1.17/2.65/7.491.17/2.65/7.49 at t=2000t{=}2000), violated upward by autocorrelation truncation, a regime failure not a formula failure (the estimator was validated against AR(1) in review). P3a PASS but only as a mean-τint\tau_{int} statement: F5ratio0.1F5_{ratio} \le 0.1 at all checkpoints, yet Δθτmax/Ktrain0.40500/4000.50>0.1\|\Delta\theta\|\cdot\tau_{max}/K_{train} \approx 0.40\cdot500/400 \approx 0.50 > 0.1, so the τ500\tau{\sim}500 slow modes are not in the fixed-θ regime. P4 clause-1 NOT PASSED via the structurally over-stable CDF proxy (card_rel_change median 0.559>0.50.559 > 0.5; C|C^*| is 700012500\sim 7000\text{–}12500 of 25,20025{,}200 observables — no low-rank structure); clause-2 is construction-confirmed, not ε-measured.

Scope and caveats

This does not show that the factorization fails — it shows the test was equilibration-limited. A faithful at-scale read needs KrefK_{ref} and chain lengths 50τ25,000\gtrsim 50\,\tau \approx 25{,}000 (a KrefK_{ref} re-freeze plus heavier compute), out of scope here. It is also single-input conditional only, not batch-operational; per-input sensitivity is a registered follow-up. And it is in the A2-excluded alternating-scan regime, outside the proof sketch. One suggestive aside, against the forecaster: the exp1/exp2 observable-orthogonality mechanism partially transfers (ryy=0.123/0.056/0.032/0.024r_{yy} = 0.123/0.056/0.032/0.024, gradient largely orthogonal to the temporal slow mode; τm/τf>2.5\tau_m/\tau_f > 2.5) — but under heavy under-equilibration the lag-1 surrogate is unreliable, so suggestive, not conclusive. No tag flip in any outcome. The factorization and QstructQ_{struct}^{\perp} stay [conjectured].


What this feeds: the at-scale QopQstructQ_{op} \approx Q_{struct}^{\perp} claim is neither confirmed nor refuted — untested at adequate equilibration — which motivates the reversible-kernel mixing investigation and a future KrefK_{ref} re-freeze; P2's clean PASS sharpens Risk 4.

— fin. —