arXiv cs.AI by Synapse Flow 編集部

SparKV: Overhead-Aware KV Cache Loading for Efficient On-Device LLM Inference

概要

arXiv:2604.21231v2 Announce Type: replace-cross Abstract: Efficient inference for on-device Large Language Models (LLMs) remains challenging due to limited hardware resources and the high cost of the prefill stage, which processes the full input context to construct Key-Value (KV) caches. We presen…

元記事を読む →

関連記事