TTF: Temporal Token Fusion for Efficient Video-Language Model
概要
arXiv:2605.07355v1 Announce Type: cross Abstract: Video-language models (VLMs) face rapid inference costs as visual token counts scale with video length. For example, 32 frames at $448{\times}448$ resolution already yield >8,000 visual tokens in Qwen3-VL, making LLM prefill the dominant throughput …