CPU inference with Ollama

2025-09-21

#ai #ruby

I’ve been using Ollama for the past year or so to experiment with open-source large language models, and have most recently been trying out OpenAI’s gpt-oss 20 billion parameter model.

The hardware

Not having access to a GPU from the last decade, I’ve instead been running these models via Ollama’s CPU-only support. It has been interesting to see how recent library improvements have made gpt-oss much more useable when operating in this way.

Here’s the hardware I’m using:

$ neofetch kernel cpu memory
kernel: 6.16.4-arch1-1
cpu: Intel Xeon E5-2698 v4 (40) @ 3.600GHz
memory: 15181MiB / 64208MiB

This is a Broadwell-EP CPU from early 2016, paired with quad-channel DDR4-2400. Memory bandwidth is king for LLM inference, and this hardware has a theoretical maximum of a little under 80GB/s.

The test

I used the following prompt against the lib/ source of a small Ruby gem, which resulted in 1.01k input tokens being fed to the model:

MODEL="gpt-oss:20b"
GEM_CODE=$(
  git ls-files -- lib \
  | xargs -I {} bash -c 'echo -e "{} contains:\n$(cat {})"'
)
echo -e "offer a refactoring suggestion for this ruby gem:\n\n$(GEM_CODE)" \
| ollama run $MODEL --verbose -

The results

A combination of Ollama enabling flash attention for CPU-only prompt processing, as well as efficiency improvements in the handling of the way gpt-oss model weights are stored resulted in a big performance improvement between 0.11.4 and 0.11.8:

Model Ollama Version Prompt Eval Rate (tokens/s) Generation Rate (tokens/s)
gpt-oss:20b 0.11.4 10.17 6.16
gpt-oss:20b 0.11.8 57.50 10.16
qwen3-coder:latest 0.11.8 78.19 13.55
qwen3:0.6b 0.11.8 404.92 42.81

I’ve included a couple of extra models in there for comparison, but the gpt-oss improvements were ~50% token generation, and 5x prompt evaluation.