Exp 4 — Reversible Kernel: τ̂ Unresolved (P0-HALT)

The reversible kernel the theorem actually requires does not equilibrate at accessible scale, so P0 halts before any prediction is exercised.

The question

Can we measure a finite integrated autocorrelation time $\tau_{int}$ for the $A2$ -satisfying negative-phase sampler — the precondition for the $A6$ premise $K \gg \tau_{int}$ ? This is the complete technical record. The setup follows on from experiments/exp3 (a deterministic alternating-scan kernel) by swapping in the reversible object the proof's spectral machinery needs.

The setup

Substrate: Lightning AI H200 (80→141 GB), dtm-replication @ 7c22d19, thrml 0.1.3 plus the EXP4-REVERSIBLE-SCAN patch (v2 toggle), single-input 60_12 DTM-MNIST conditional $\pi_\theta(\cdot \mid x_0)$ , seed 0. The negative-phase kernel is the symmetrized four-block Gibbs sweep $M = \tfrac{1}{2}(P_{fwd} + P_{rev})$ — self-adjointness re-passed at $\sim 10^{-17}$ (the deterministic exp3 kernel confirmed non-reversible at $\sim 10^{-2}$ ).

The probe is the non-circular doubling-stability measurement: a per-chain half-Sokal $\tau_{int}$ estimator on a self-consistent window ( $L \ge 5\tau$ ), with $L$ doubled and warm-up scaled to $\hat{\tau}$ . The stability rule resolves $\hat{\tau}$ only when $|\tau(2L) - \tau(L)| / \tau(L) < 0.15$ .

A v2 order-coin toggle made the run affordable: the per-chain coin forced XLA to compute both sweeps under the 400-chain vmap (38.6 s/epoch). Threading a shared order_key in training (true lax.cond, one sweep) gives 11.6 s/epoch while diagnostics keep the per-chain coin — so the across-chain SEM used in P5 stays exactly independent. The per-chain marginal kernel is identical $\tfrac{1}{2}(P_{fwd}+P_{rev})$ in both modes.

The result

τ_max grows dead-linearly in the trajectory length $L$ , with $\tau / L$ essentially constant ( $\approx 0.16$ ) across six doublings:

| $L$ (sweeps) | warm | $\tau_{max}$ | $\tau/L$ | self-consistent ( $L \ge 5\tau$ ) | |---|---|---|---|---| | 1,000 | 200 | 166.7 | 0.167 | yes | | 2,000 | 833 | 333.2 | 0.167 | yes | | 4,000 | 1,666 | 662.6 | 0.166 | yes | | 8,000 | 3,313 | 1,325 | 0.166 | yes | | 16,000 | 6,627 | 2,431 | 0.152 | yes | | 32,000 | 12,153 | 5,280 | 0.165 | yes |

A constant $\tau / L$ means the integrated $\tau_{int}$ accumulates as fast as data is added — the autocorrelation function has not decayed within the window. This is a near-zero spectral gap $\gamma_{eff} \to 0$ / effectively non-equilibrating chain. The doubling-stability criterion is never met (each doubling roughly doubles $\tau$ ), so the rule correctly refuses to resolve $\hat{\tau}$ . Warm-train ran 200/200 epochs at 11.623 s/epoch, confirming both the at-scale cost ( $\approx 6.5$ h for a 2000-epoch run) and the v2 fix.

Consequence: the $\gtrsim 50\cdot\hat{\tau}$ averaging windows are uninstantiable and the $A6 / K \gg \tau_{int}$ premise is unreachable for this kernel at this checkpoint. Registered outcome (F): P0-HALT.

Scope and caveats

This does show, robustly, that the $A2$ -required reversible kernel does not equilibrate at accessible scale (measured curve to $L=32{,}000$ ; further doublings could only add more growing- $\tau$ points). It does not by itself distinguish two readings, both giving the same operational verdict: (1) a genuine near-zero gap — the trained conditional is multimodal and the chain cannot cross basins ( $\gamma_{eff}\to 0$ , the very plateau regime the theorem is about); or (2) inadequate burn-in — true $\tau \gg$ warm-up, but if true $\tau$ is unbounded no feasible burn-in helps.

The exp3 comparison is confounded: exp4 changed both the kernel (→ reversible) and the window length, so the jump from $\tau\approx 486$ – $500$ to $\tau \ge 5{,}280$ cannot be cleanly attributed. The qualitative tell — exp3's $\tau$ was stable across $t$ while exp4's grows $\propto L$ with no leveling — favors genuinely slower mixing over mere truncation. The deeper tension worth flagging: exp3's faster-mixing kernel violated $A2$ ; the $A2$ -valid kernel is the slow one.

Honesty: the Studio was cut off (credit exhaustion) after $L=32{,}000$ , before p0_calibrate.json and the A7-spectrum feasibility probe were written — so P0 is sufficient to fire the $\hat{\tau}$ -UNRESOLVED HALT but the A7 measurement is unrun. P1–P5 did not run (no recorded DECISION: PROCEED); nothing is reported as measured for them.

No tag flip. The conditional factorization ( $A1$ – $A8$ + plateau + F4 $\Rightarrow Q_{op}\approx Q_{struct}^{\perp}$ ) stays [solid] (untouched — this is an operational test); the operational/unconditional claim stays [conjectured], now for the deeper reason that the chain does not equilibrate at accessible scale. This sharpens Risk 5 (the $A6$ gate) and Risk 1 (at-scale tracking), leaving both [open].

What this feeds: the natural next investigation is distinguishing near-zero-gap from inadequate-burn-in (ACF shape, or an earlier/less-trained checkpoint with finite $\tau$ ) and reaching the credit-gated A7-spectrum feasibility probe — both deferred.