Decoding the Pulse of Reasoning VLMs in Multi-Image Understanding Tasks
概要
arXiv:2603.04676v2 Announce Type: replace-cross Abstract: Multi-image reasoning remains a significant challenge for vision-language models (VLMs). We investigate a previously overlooked phenomenon: during chain-of-thought (CoT) generation, the text-to-image (T2I) attention of reasoning VLMs exhibit…