Extropic's DTM — In-Depth Analysis · Thermodynamic Machine Learning

Extropic claims it can run generative AI on thermodynamic (p-bit / TSU) hardware at a ~10,000× energy advantage. This is a pre-registered, layer-by-layer audit of one question: where does the public evidence stop proving that, and start projecting it?

The method is borrowed from chip design and from forecasting hygiene. The bar for “accomplished” is fixed before reading any source, so it cannot be quietly reshaped to fit what Extropic happens to show. Every claim is tagged by evidence maturity, mapped onto the eight layers of a real silicon stack, and tracked link by link. The goal is diagnosis, not advocacy or dismissal — locate the binding bottleneck and the missing evidence, and say exactly where they sit.

Verdict

Not yet hardware-witnessed. Public evidence demonstrates the stochastic-circuit sub-primitive — now coupled and peer-reviewed — not practical generative-AI thermodynamic hardware. No proof criterion (W1–W4) is met, and the binding gap is the unbuilt bridge from the DTM algorithm to real, large, programmable arrays.

The bar — what would count as proof

The proof rule is conjunctive: practical generative AI via thermodynamic hardware counts as accomplished only if real TSU / p-bit hardware runs a chained EBM/DTM workload end-to-end, reaches useful quality beyond toy benchmarks, and shows a measured like-for-like energy advantage including mapping and host overhead. Four sub-criteria, tracked independently:

W1 not met: Hardware EBM / Gibbs sampling. Real hardware — not a software simulator — samples an energy-based / Boltzmann model from its own physical dynamics, with measured sample statistics. Best evidence: demonstrated Not met. The all-transistor RNG cell is demonstrated only as an L5/L6 sub-primitive; the NEAT-RN paper adds a measured coupled Gaussian sampler — still a sub-primitive (Gaussian ≠ Boltzmann/EBM; refinement R-B). No measured coupled-array EBM/Gibbs sampling.
W2 not met: End-to-end chained DTM on hardware. A denoising chain of EBMs (DTM) runs end-to-end on hardware — not a single isolated EBM, and not a software stand-in for the chain. Best evidence: simulated Not met. The chain runs only inside a GPU simulator of the DTCA.
W3 not met: Useful quality beyond toy benchmarks. Generation is useful on tasks materially harder than (Fashion-)MNIST — high-res image, text, video, or a real downstream task. Best evidence: simulated Not met. Binarized Fashion-MNIST is the benchmark; the CIFAR-10 hybrid is, by the authors, “a naive first attempt,” and binarization is conceded “not viable in general.”
W4 not met: Like-for-like energy / latency advantage. A measured energy and/or latency win over a GPU on the same task, with embedding / mapping cost and host overhead included — not just core-sampler joules. Best evidence: modeled-projected Not met. The ~10,000× is modeled-projected, not measured, and the toy setting omits the hybrid embedding/host overhead richer tasks need (refinement R-A).

Each claim carries one of four evidence tags. Only demonstrated satisfies a criterion — the rest may support plausibility, but they do not meet the bar.

demonstrated: Measured on real hardware running the actual workload. The only tag that satisfies a criterion.
simulated: Produced by a software model of future hardware — THRML simulation, GPU emulation of TSU behavior.
modeled-projected: An analytic / physical estimate (e.g. energy-per-sample extrapolation), not a run.
roadmap: A stated intention or future plan, with no evidence yet.

Two refinements were forced by the sources, appended rather than silently edited. R-A: a W4 comparison must be measured and must include the hybrid embedding / host overhead richer tasks need — a modeled core-sampler figure on a toy benchmark does not qualify. R-B: a measured coupled Gaussian sampler is a W1 sub-primitive, not a satisfaction of W1, because the distribution class is Gaussian, not Boltzmann/EBM.

The claim chain

Extropic’s public argument is a chain of seven links. Reading it end to end shows exactly where the evidence first drops below demonstrated.

CMOS stochastic circuits
Programmable p-bit / TSU arrays
Gibbs / Boltzmann sampling
Hardware-sampled EBMs
Denoising chains of EBMs (DTM)
Useful generative AI
Measured energy / latency advantage over GPUs

The frontier sits right after C1: the RNG cell is demonstrated, composing it into programmable arrays (C2) is unpublished, and everything downstream (C3–C7) is simulated or modeled.

C1CMOS stochastic circuits produce controllable physical randomnessSingle RNG cell measured (programmable sigmoidal bias, τ₀ ≈ 100 ns); the NEAT-RN paper deepens it to a peer-reviewed coupled Gaussian sampler.

L5 / L6feeds W1demonstrated

C2Those circuits compose into programmable p-bit / TSU arraysThe evidence frontier. No public RTL / place-and-route; the production array is the Z1 roadmap (early access 2026).

L2–L5feeds W1 · W2roadmap

C3The array performs Gibbs / Boltzmann samplingBlock-Gibbs only in a GPU simulator (THRML); not measured on hardware.

L0 / L5feeds W1simulated

C4Sampling realizes energy-based models on hardwareEBMs sampled in software, never from hardware dynamics.

L1 / L0feeds W1 · W2simulated

C5EBMs are chained into a denoising model (DTM)The full chain runs end-to-end — in simulation, on the GPU.

L1feeds W2simulated

C6The DTM produces useful generative AIFashion-MNIST FID; beyond-toy quality not shown.

L1feeds W3simulated

C7…at a measured energy / latency advantage over GPUs~10,000× from a physical / FLOP model — not a measured comparison.

L0–L7feeds W4modeled-projected

The eight-layer gap audit

Auditing the chain through a full silicon stack (L0–L7) localizes the break. The pattern is stark: demonstrated silicon at the bottom (L5/L6), a documented void in the middle (L2–L4 — no RTL, no synthesis, no place-and-route), and a principle-and-algorithm top (L0/L1) that exists only in simulation. The single most load-bearing number — the system energy — is a core-sampler model:

E = T K_{mix} L^{2} E_{cell}, E_{cell} \approx 2 fJ, K_{mix} = 250

— with no synthesized, placed-and-routed full system behind it. The “unpublished middle” is not an omission to wave away; it is precisely where a ~10,000× claim would have to survive contact with a real layout.

L0PrincipleEBM-as-sampling and the mixing–expressivity tradeoff (MET) the DTM is built to break; two-color block-Gibbs correctness. THRML itself concedes Gibbs mixing has no general speed guarantee. Analytically sound — the gap is realization.

simulated

L1Algorithm (DTM)A chain of 2–8 sparse Boltzmann-machine EBMs as denoising steps; inference cost ∝ T·K·τ₀. Articulated, but evidenced only by simulation — the binding frontier.

simulated

L2RTL / controlNone public. No register-transfer description of scheduling, clamping, or readout.

roadmap

L3Logic synthesisNone public. The energy model is a core-sampler figure, with no netlist or digital overhead.

roadmap

L4Physical designNone public. The headline energy claim rests on local-only communication surviving a real layout — yet no place-and-route evidence exists.

roadmap

L5Circuit / SPICEThe strongest public layer: a measured single RNG cell plus the NEAT-RN coupled Gaussian sampler (programmable covariance, validated dynamics).

demonstrated

L6Device / TCADSubthreshold shot-noise entropy source; a measured ~70%-thermal noise decomposition fit by an MJP/EKV model.

demonstrated

L7MaterialStandard CMOS (TSMC FinFET), no exotic materials; foundry PDK models available.

modeled-projected

Original work

The framework above organizes the public record. What follows is hands-on: I ran the code, read the raw result files, and worked out what mapping the algorithm onto silicon would actually take.

A reproduction run that surfaced an undocumented bug

I took the Extropic-funded public replication (pschilliOrange/dtm-replication, package thrmlDenoising, commit 7c22d19) and ran its smoke tests in a fresh uv virtualenv (Python 3.11, CPU). Two findings, reported together. First, the published repo is GPU-gated and undocumented: training hardcodes its device list to the GPU — the file’s sole GPU reference, with no CPU fallback, no config flag, and no mention of GPU/CUDA in the README. On a GPU-less host the repo install-imports cleanly, yet cannot train even the tiny synthetic smoke test out of the box.

# DTM.py:279 — the file's sole GPU reference: no CPU fallback, no flag, README silent
devices = jax.devices("gpu")        # errors on a GPU-less host

# one-line local patch (throwaway clone only) → both smoke tests pass
devices = jax.devices()

Second, with that one-line CPU patch both smoke-test cases pass: the epoch-50 save/load round-trip (exact to six places) and the two-step “frankenstein” stitch (within the test’s 5% / 30% tolerances) — independently reproducing the per-step-independence claim. So “runnable” means runnable on GPU hardware, not CPU-portable as shipped: a real reproducibility caveat, found by execution rather than by reading.

Re-deriving the paper’s figures from the checked-in CSVs

The repo ships the result CSVs behind its Fig. 5 plots, so the headline rankings can be checked directly. Fig. 5b: the recorded minima are exact — best free-FID MEBM 37.08 (@ep6), DTM 29.68 (@ep8), DTM+ACP 26.84 (@ep86). But the ranking “DTM beats MEBM” holds only under a best-free-FID-over-available-rows read. Under final-row or common-epoch (ep144) extraction it does not survive: DTM+ACP stays stable (~29) while both MEBM (~183) and DTM (~185) fail late. The MEBM/DTM minima are transient early spikes the models immediately leave; only DTM+ACP’s persists. Robust conclusion: ACP stabilizes training — the standalone “DTM beats MEBM” claim is extraction-dependent, not robust.

Fig. 5c: the degree-20 / grid-60 clamped-FID 20.12 anchor verifies exact, and the intra-model trends are robust — higher graph degree monotonically improves FID (8→56.8, 12→26.2, 16→22.9, 20→20.1) and the ~25% visible-node-fraction optimum holds. But the chain-depth / warmup gain is modest and non-monotonic (warmup 400→26.2, 800→24.2, 1200→24.4 — 1200 slightly worse than 800), and the grid is incomplete (17 of 18 runs). The trends that survive are the ones the repo can actually support.

Why a coupled, measured silicon sampler still doesn’t close the gap

The peer-reviewed NEAT-RN paper fabricates and measures a coupledmultivariate-Gaussian sampler — programmable covariance, a validated Brownian-gyrator dynamic model, a ~70%-thermal noise decomposition. It is real silicon and a genuine advance on a single Bernoulli cell. Yet it does not reach W1, and the reason is precise: the measured distribution is Gaussian, not Boltzmann/EBM. So “coupled + measured” is now true while W1 stays open — the binding gap at W1 is the distribution class, not whether circuits can be coupled at all. That distinction is refinement R-B.

What mapping the DTM onto silicon would actually require

Reading what software conveniences the replication leans on makes the L1↔L5 bridge concrete. The forward process and fixed coupling are clean closed forms — the perturbation $p_{flip} = 1 - e^{- λ Δ t}$ and the weight $w = - \frac{1}{2} ln (tanh (λ Δ t /2))$ — but five things the simulator gets for free, hardware would have to earn:

Stochastic primitive. Replace the software PRNG (jax.random.bernoulli) updates with measured outputs of a fabricated p-bit / TSU array — the W1 bar.
Update schedule. Preserve the bipartite two-color block-parallel Gibbs schedule physically — the synchronous-parallel-update assumption must hold in silicon.
Placement. Turn the software chessboard placement into a real floorplan respecting the bipartite constraint and preset degree / jumps — an L3/L4 problem absent from every public artifact.
Clamp / readout. Conditioning-block clamping and input injection need hardware clamp and readout addressing at scale — an L2 problem.
Weight precision. Quantify the analog coupling precision achievable in hardware against the exact-float coupling formula and trained weights the simulation uses.

The evidence ledger

Six primary sources, each read against the same bar — what is claimed, what is actually shown, and which criterion (if any) it moves.

SRCDTM paper — Jelinčič et al. (2025)The load-bearing primary source. Demonstrates only the all-transistor RNG cell; the DTM generative results are simulated and the ~10,000× energy claim is modeled. No W-criterion met.

primary · arXiv:2510.23972demonstratedcell only

SRCNEAT-RN paper — Freitas et al. (2026)Peer-reviewed circuit / device companion: a fabricated, measured coupled multivariate-Gaussian sampler with a validated Brownian-gyrator model. Gaussian ≠ Boltzmann/EBM (refinement R-B).

Phys. Rev. Applied 25, 034061demonstratedsub-primitive

SRCDTM reference implementation — thrmlDenoisingAn Extropic-funded, runnable replication of the DTM algorithm on the public thrml library. All simulated (GPU / PRNG); the most complete public software artifact — and it has no hardware path.

commit 7c22d19simulatedmoves no W

SRCTHRML — Extropic's JAX block-Gibbs libraryThe simulation substrate beneath both the paper and the replication. Pure software; hardware is framed as future. Adds the first sampling/mixing caveat (Extropic's own admission).

L0 / L1simulated

SRCExtropic Hardware pageA claims / marketing source: X0 → XTR-0 → Z1 (early access 2026); the PBIT/PDIT/PMODE/PMOG family. No measurements, no independent demonstrated tag. Dates the C2 array frontier as a 2026 roadmap.

claims · snapshot 2026-05-25roadmap

SRC“From Zero to One” announcementA narrative source — and, read carefully, it confirms the standing in Extropic's own words: the DTM / energy results are “simulations,” the hardware is “buildout,” and only the single p-bit cell is “proved in practice.”

narrative · snapshot 2026-05-25roadmap

The verdict, and how this connects

Across all six sources the standing is consistent: the demonstrated part is the stochastic-circuit sub-primitive (the p-bit cell, plus the coupled-Gaussian NEAT-RN sampler — both at L5/L6, neither a coupled-array EBM/Gibbs sampler); everything above it is simulated or roadmap; and the L1↔L5 bridge, with its unpublished L2–L4 middle, is the binding bottleneck. That is a sharp, defensible reading of a fast-moving program — credit where the silicon is real, and a clear marker of what would actually settle the question.

Auditing where the public evidence stops is only the front half of a research program. The back half is my own pre-registered work on the same thrml substrate — exp1 through exp19 — testing the trainability question this audit isolates, with frozen thresholds and measure-only discipline.

[Project DTM][The experiments][Read the notebook]

· · ·