Sparse Prefix Caching for Hybrid and Recurrent LLM Serving
概要
arXiv:2605.05219v1 Announce Type: cross Abstract: Prefix caching is a key latency optimization for autoregressive LLM serving, yet existing systems assume dense per-token key/value reuse. State-space models change the structure of the problem: a recurrent layer can resume from a single stored state…