Adaptive Greedy Frame Selection for Long Video Understanding
概要
arXiv:2603.20180v2 Announce Type: replace-cross Abstract: Large vision--language models (VLMs) are increasingly applied to long-video question answering, yet inference is often bottlenecked by the number of input frames and resulting visual tokens. Naive sparse sampling can miss decisive moments, w…