Extropic claims it can run generative AI on thermodynamic (p-bit / TSU) hardware at a ~10,000× energy advantage. This is a pre-registered, layer-by-layer audit of one question: where does the public evidence stop proving that, and start projecting it?
The method is borrowed from chip design and from forecasting hygiene. The bar for “accomplished” is fixed before reading any source, so it cannot be quietly reshaped to fit what Extropic happens to show. Every claim is tagged by evidence maturity, mapped onto the eight layers of a real silicon stack, and tracked link by link. The goal is diagnosis, not advocacy or dismissal — locate the binding bottleneck and the missing evidence, and say exactly where they sit.
Not yet hardware-witnessed. Public evidence demonstrates the stochastic-circuit sub-primitive — now coupled and peer-reviewed — not practical generative-AI thermodynamic hardware. No proof criterion (W1–W4) is met, and the binding gap is the unbuilt bridge from the DTM algorithm to real, large, programmable arrays.
The bar — what would count as proof
The proof rule is conjunctive: practical generative AI via thermodynamic hardware counts as accomplished only if real TSU / p-bit hardware runs a chained EBM/DTM workload end-to-end, reaches useful quality beyond toy benchmarks, and shows a measured like-for-like energy advantage including mapping and host overhead. Four sub-criteria, tracked independently:
- W1 not met
- Hardware EBM / Gibbs sampling. Real hardware — not a software simulator — samples an energy-based / Boltzmann model from its own physical dynamics, with measured sample statistics. Best evidence: demonstrated Not met. The all-transistor RNG cell is demonstrated only as an L5/L6 sub-primitive; the NEAT-RN paper adds a measured coupled Gaussian sampler — still a sub-primitive (Gaussian ≠ Boltzmann/EBM; refinement R-B). No measured coupled-array EBM/Gibbs sampling.
- W2 not met
- End-to-end chained DTM on hardware. A denoising chain of EBMs (DTM) runs end-to-end on hardware — not a single isolated EBM, and not a software stand-in for the chain. Best evidence: simulated Not met. The chain runs only inside a GPU simulator of the DTCA.
- W3 not met
- Useful quality beyond toy benchmarks. Generation is useful on tasks materially harder than (Fashion-)MNIST — high-res image, text, video, or a real downstream task. Best evidence: simulated Not met. Binarized Fashion-MNIST is the benchmark; the CIFAR-10 hybrid is, by the authors, “a naive first attempt,” and binarization is conceded “not viable in general.”
- W4 not met
- Like-for-like energy / latency advantage. A measured energy and/or latency win over a GPU on the same task, with embedding / mapping cost and host overhead included — not just core-sampler joules. Best evidence: modeled-projected Not met. The ~10,000× is modeled-projected, not measured, and the toy setting omits the hybrid embedding/host overhead richer tasks need (refinement R-A).
Each claim carries one of four evidence tags. Only demonstrated satisfies a criterion — the rest may support plausibility, but they do not meet the bar.
- demonstrated
- Measured on real hardware running the actual workload. The only tag that satisfies a criterion.
- simulated
- Produced by a software model of future hardware — THRML simulation, GPU emulation of TSU behavior.
- modeled-projected
- An analytic / physical estimate (e.g. energy-per-sample extrapolation), not a run.
- roadmap
- A stated intention or future plan, with no evidence yet.
Two refinements were forced by the sources, appended rather than silently edited. R-A: a W4 comparison must be measured and must include the hybrid embedding / host overhead richer tasks need — a modeled core-sampler figure on a toy benchmark does not qualify. R-B: a measured coupled Gaussian sampler is a W1 sub-primitive, not a satisfaction of W1, because the distribution class is Gaussian, not Boltzmann/EBM.
The claim chain
Extropic’s public argument is a chain of seven links. Reading it end to end shows exactly where the evidence first drops below demonstrated.
- CMOS stochastic circuits
- Programmable p-bit / TSU arrays
- Gibbs / Boltzmann sampling
- Hardware-sampled EBMs
- Denoising chains of EBMs (DTM)
- Useful generative AI
- Measured energy / latency advantage over GPUs
The frontier sits right after C1: the RNG cell is demonstrated, composing it into programmable arrays (C2) is unpublished, and everything downstream (C3–C7) is simulated or modeled.
The eight-layer gap audit
Auditing the chain through a full silicon stack (L0–L7) localizes the break. The pattern is stark: demonstrated silicon at the bottom (L5/L6), a documented void in the middle (L2–L4 — no RTL, no synthesis, no place-and-route), and a principle-and-algorithm top (L0/L1) that exists only in simulation. The single most load-bearing number — the system energy — is a core-sampler model:
— with no synthesized, placed-and-routed full system behind it. The “unpublished middle” is not an omission to wave away; it is precisely where a ~10,000× claim would have to survive contact with a real layout.
Original work
The framework above organizes the public record. What follows is hands-on: I ran the code, read the raw result files, and worked out what mapping the algorithm onto silicon would actually take.
A reproduction run that surfaced an undocumented bug
I took the Extropic-funded public replication (pschilliOrange/dtm-replication, package thrmlDenoising, commit 7c22d19) and ran its smoke tests in a fresh uv virtualenv (Python 3.11, CPU). Two findings, reported together. First, the published repo is GPU-gated and undocumented: training hardcodes its device list to the GPU — the file’s sole GPU reference, with no CPU fallback, no config flag, and no mention of GPU/CUDA in the README. On a GPU-less host the repo install-imports cleanly, yet cannot train even the tiny synthetic smoke test out of the box.
# DTM.py:279 — the file's sole GPU reference: no CPU fallback, no flag, README silent
devices = jax.devices("gpu") # errors on a GPU-less host
# one-line local patch (throwaway clone only) → both smoke tests pass
devices = jax.devices()Second, with that one-line CPU patch both smoke-test cases pass: the epoch-50 save/load round-trip (exact to six places) and the two-step “frankenstein” stitch (within the test’s 5% / 30% tolerances) — independently reproducing the per-step-independence claim. So “runnable” means runnable on GPU hardware, not CPU-portable as shipped: a real reproducibility caveat, found by execution rather than by reading.
Re-deriving the paper’s figures from the checked-in CSVs
The repo ships the result CSVs behind its Fig. 5 plots, so the headline rankings can be checked directly. Fig. 5b: the recorded minima are exact — best free-FID MEBM 37.08 (@ep6), DTM 29.68 (@ep8), DTM+ACP 26.84 (@ep86). But the ranking “DTM beats MEBM” holds only under a best-free-FID-over-available-rows read. Under final-row or common-epoch (ep144) extraction it does not survive: DTM+ACP stays stable (~29) while both MEBM (~183) and DTM (~185) fail late. The MEBM/DTM minima are transient early spikes the models immediately leave; only DTM+ACP’s persists. Robust conclusion: ACP stabilizes training — the standalone “DTM beats MEBM” claim is extraction-dependent, not robust.
Fig. 5c: the degree-20 / grid-60 clamped-FID 20.12 anchor verifies exact, and the intra-model trends are robust — higher graph degree monotonically improves FID (8→56.8, 12→26.2, 16→22.9, 20→20.1) and the ~25% visible-node-fraction optimum holds. But the chain-depth / warmup gain is modest and non-monotonic (warmup 400→26.2, 800→24.2, 1200→24.4 — 1200 slightly worse than 800), and the grid is incomplete (17 of 18 runs). The trends that survive are the ones the repo can actually support.
Why a coupled, measured silicon sampler still doesn’t close the gap
The peer-reviewed NEAT-RN paper fabricates and measures a coupledmultivariate-Gaussian sampler — programmable covariance, a validated Brownian-gyrator dynamic model, a ~70%-thermal noise decomposition. It is real silicon and a genuine advance on a single Bernoulli cell. Yet it does not reach W1, and the reason is precise: the measured distribution is Gaussian, not Boltzmann/EBM. So “coupled + measured” is now true while W1 stays open — the binding gap at W1 is the distribution class, not whether circuits can be coupled at all. That distinction is refinement R-B.
What mapping the DTM onto silicon would actually require
Reading what software conveniences the replication leans on makes the L1↔L5 bridge concrete. The forward process and fixed coupling are clean closed forms — the perturbation and the weight — but five things the simulator gets for free, hardware would have to earn:
- Stochastic primitive. Replace the software PRNG (
jax.random.bernoulli) updates with measured outputs of a fabricated p-bit / TSU array — the W1 bar. - Update schedule. Preserve the bipartite two-color block-parallel Gibbs schedule physically — the synchronous-parallel-update assumption must hold in silicon.
- Placement. Turn the software chessboard placement into a real floorplan respecting the bipartite constraint and preset degree / jumps — an L3/L4 problem absent from every public artifact.
- Clamp / readout. Conditioning-block clamping and input injection need hardware clamp and readout addressing at scale — an L2 problem.
- Weight precision. Quantify the analog coupling precision achievable in hardware against the exact-float coupling formula and trained weights the simulation uses.
The evidence ledger
Six primary sources, each read against the same bar — what is claimed, what is actually shown, and which criterion (if any) it moves.
The verdict, and how this connects
Across all six sources the standing is consistent: the demonstrated part is the stochastic-circuit sub-primitive (the p-bit cell, plus the coupled-Gaussian NEAT-RN sampler — both at L5/L6, neither a coupled-array EBM/Gibbs sampler); everything above it is simulated or roadmap; and the L1↔L5 bridge, with its unpublished L2–L4 middle, is the binding bottleneck. That is a sharp, defensible reading of a fast-moving program — credit where the silicon is real, and a clear marker of what would actually settle the question.
Auditing where the public evidence stops is only the front half of a research program. The back half is my own pre-registered work on the same thrml substrate — exp1 through exp19 — testing the trainability question this audit isolates, with frozen thresholds and measure-only discipline.
· · ·