In-Memory Compute | The Wise Operator

What It Is

In-memory compute, sometimes written “compute-in-memory” or “processing-in-memory,” is a chip architecture that does the math inside the same cells that hold the data, instead of fetching the data, moving it across a bus to a separate arithmetic unit, computing the result, and writing the answer back. For traditional von Neumann machines, that round trip dominates both the time and the energy budget of an inference call. For large model inference, where the weights of a trillion-parameter model have to be brought to the processor billions of times during a single response, the round trip is the bill.

In-memory compute collapses that distance to zero. The memory cells perform the multiply-and-accumulate operations themselves, in parallel, in analog or low-precision digital depending on the design. The output of one layer is already where the next layer’s weights live. A chip built this way is not faster at training, where high-precision gradient updates still favor traditional GPUs. It is faster, and dramatically cheaper, at inference, where the same model is run a billion times against fresh prompts.

Fractile, the UK startup that raised $220 million in May 2026 to commercialize the approach, claims that reasoning workloads which take weeks on current hardware can be cut to days, and that the per-token cost of a long agent loop can drop by an order of magnitude.

How It Actually Works

A standard inference call moves weights from HBM (high-bandwidth memory) to a GPU’s tensor cores, runs the multiply, returns the activation, fetches the next layer’s weights, and repeats. The processor spends a large fraction of its energy moving bytes, not computing. The “memory wall” is the industry’s name for the resulting ceiling.

In-memory compute breaks the wall by changing what a memory cell is. Instead of just storing a value, the cell is wired so that applying an input voltage produces an output that is the product of the stored value and the input, summed across an entire row of cells. A single column of cells, energized once, returns the dot product of an input vector and a row of stored weights. That is the inner loop of every neural network layer.

The trade is that the math is analog (or very low precision) and the cells degrade slightly with each operation. The designs that have crossed into production combine in-memory compute for the bulk of the matmul work with a small classical processor for control flow, calibration, and the few high-precision operations that matter.

Why It Matters Right Now

Inference is now the dominant cost of running a frontier AI product, not training. The economics of agent loops, where one user prompt triggers dozens or hundreds of model calls, have turned token cost into the line every CFO at a model-using company watches. The hyperscalers’ answer is custom silicon (Google’s TPU, AWS’s Trainium, Microsoft’s Maia). The frontier labs’ answer is multi-gigawatt compute-commitment deals. The chip startups’ answer is to attack the architecture itself, and in-memory compute is the most credible challenger.

The Cerebras wafer-scale-engine put one alternative architecture on the public markets in April 2026. Fractile’s $220M round in May puts a second one into late-stage development. The bet is that Nvidia’s stack, built for training, leaves a margin for someone whose stack is built only for inference.

How TWO Uses It

Scott does not buy chips. The reason in-memory compute matters in this newsroom is the same reason every infrastructure shift matters. It sets the floor on what the per-token economics of an agent loop will be twelve months out. If a Fractile-style chip cuts inference cost by five times, the agent workflows that are too expensive to run today (a multi-hour deep research pass on every customer in your CRM, a continuous monitoring agent on every active project, a long-context personal assistant that remembers a year of email) become next year’s table stakes.

TWO’s editorial bet is that operators should design their workflows assuming inference will be cheap and abundant by mid-2027, and architect for that future even if they pay 2026 prices today. The companion decision is sourcing. If your agent platform is locked into one model vendor’s stack, you cannot capture the savings when a different chip undercuts it. That is the operator-decision lurking behind every “single-vendor convenience” pitch this year.

A Concrete Operator Scenario

You are building a sales-enablement agent that drafts a weekly brief on every named account in your pipeline. At 2026 inference prices, running the brief weekly across 200 accounts costs roughly the equivalent of one sales rep’s monthly salary in tokens. You can run it on the 30 priority accounts only, and you do, because the math forces you to.

Eighteen months later, if an in-memory chip lands in a managed inference product at one-fifth the price, the same brief across all 200 accounts is the cost of a single rep’s lunch budget for a month. The decision you owe yourself today is not whether to wait. The decision is whether the brief format and prompt and downstream workflow are built so that scaling from 30 to 200 accounts requires a config change, not a rewrite. Architecting for cheap inference is a choice you make today, not one the future chip makes for you.

What to Watch Next

Two signals. First, whether any of the chip startups (Fractile, Cerebras, Groq, Tenstorrent) lands a flagship customer on a public model deployment. Second, whether the hyperscalers integrate third-party inference silicon into their managed-inference products. A single AWS announcement that a Fractile chip is one of the runtime options behind Bedrock would mark the moment. Until then, the architecture is a bet, not a roadmap.