Cos'è un harness di inferenza e perché conta
llama.cpp, Ollama, vLLM, LM Studio: cosa fa davvero il software che esegue un modello, dal batching alla cache dell'attenzione, e come sceglierlo in base all'obiettivo.
Abstract (EN)
Between raw model weights and a usable service sits the inference harness. This article explains what that layer actually does (loading weights, managing the attention cache, batching requests, applying chat templates, exposing an API) and contrasts the engines a practitioner meets: llama.cpp for portable CPU and mixed-GPU inference, Ollama and LM Studio for a simple local experience, and vLLM for high-throughput serving via PagedAttention. The goal is to choose an engine by objective, local convenience versus production throughput, rather than by popularity, and to know which knobs each one exposes.