Structured Role-Aware Policy Optimization for Multimodal Reasoning
概要
arXiv:2605.07274v1 Announce Type: new Abstract: Reinforcement learning from verifiable rewards (RLVR), especially with Group Relative Policy Optimization (GRPO), has shown strong potential for improving the reasoning capabilities of large vision-language models (LVLMs). However, in multimodal reaso…