The problem: hybrid cores aren’t symmetric multiprocessing

On Intel’s Alder Lake and later, a CPU isn’t a uniform pool of identical cores anymore — it’s a mix of P-cores (fast, wide out-of-order, hyperthreaded) and E-cores (slower per-thread, no hyperthreading, often missing ISA features the P-cores have, like AVX-512 on some SKUs). ggml’s threadpool has no idea about any of this by default: it spins up N threads and expects them to make roughly equal progress on equal-sized work slices. On a P+E hybrid part, that assumption breaks — the E-core threads become stragglers, and every generation step blocks on the slowest thread. Throwing more threads at the problem makes it worse, not better, since token generation is memory-bandwidth-bound, not compute-bound, past a fairly low thread count.

Pinning with --cpu-mask / --cpu-range

llama.cpp exposes explicit affinity controls rather than leaving it to the OS scheduler:

./llama-cli -m model.gguf --cpu-range 0-7 --cpu-strict 1 --prio 2 \
            --threads 8 --threads-batch 16
  • --cpu-mask / --cpu-range pin the threadpool to a specific set of logical cores (P-cores only, in practice — check lscpu/Task Manager for which core indices are P vs E on your SKU).
  • --cpu-strict forces strict affinity instead of treating the mask as a soft hint.
  • --prio bumps scheduling priority so the OS doesn’t preempt inference threads for background work.
  • --threads vs --threads-batch matter because prompt processing (batch, compute-bound, parallelizes well) and token generation (sequential, bandwidth-bound) have different optimal thread counts — it’s common to set --threads-batch higher than --threads.

OpenMP build vs native threads

ggml can be compiled two ways: with GGML_OPENMP (uses the OpenMP runtime’s thread pool and its own affinity/spin-wait tuning via OMP_* env vars), or without it, falling back to std::thread plus platform-native affinity calls (pthread_setaffinity_np on Linux, SetThreadAffinityMask on Windows). The two behave differently under contention:

  • OpenMP’s runtime handles spin-then-yield backoff itself, which tends to cost less CPU when the pool is idle between requests.
  • The native-thread path gives llama.cpp’s own --poll flag direct control over how long a worker busy-waits before yielding — good for latency-sensitive single-request serving, bad for power draw if you’re running on battery.

Neither is universally better; the choice depends on whether you’re optimizing for tail latency (native threads, low --poll backoff) or for running many idle-but-resident model instances (OpenMP, let the runtime manage contention).

What actually moved the needle

In practice, the bulk of the win came from three changes, not from adding threads:

  1. Excluding E-cores from the mask entirely for generation — mixed-core batches were consistently bottlenecked by the slowest thread.
  2. Separating --threads (generation) from --threads-batch (prompt processing) instead of using one value for both.
  3. Setting --cpu-strict 1 — without it, the scheduler would occasionally migrate a thread mid-run and reset whatever cache locality it had built up.

None of this shows up unless you’re watching per-core utilization during a run — at a glance, “more threads” looks like it should help, and it’s the opposite.