harnessinferenzavllmserving

Cos'è un harness di inferenza e perché conta

llama.cpp, Ollama, vLLM, LM Studio: cosa fa davvero il software che esegue un modello, dal batching alla cache dell'attenzione, e come sceglierlo in base all'obiettivo.

Osservatorio Evolutivo2 min di lettura

Abstract (EN)

Between raw model weights and a usable service sits the inference harness. This article explains what that layer actually does (loading weights, managing the attention cache, batching requests, applying chat templates, exposing an API) and contrasts the engines a practitioner meets: llama.cpp for portable CPU and mixed-GPU inference, Ollama and LM Studio for a simple local experience, and vLLM for high-throughput serving via PagedAttention. The goal is to choose an engine by objective, local convenience versus production throughput, rather than by popularity, and to know which knobs each one exposes.

Loading article content...

Fonti

← Harness Harness per agenti: dal prompt al loop di strumenti →