Nonsense Helps: Prompt Space Perturbation Broadens Reasoning Exploration
概要
arXiv:2605.05566v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards, particularly Group Relative Policy Optimization (GRPO), has significantly advanced the reasoning capabilities of Large Language Models (LLMs). However, in complex tasks, GRPO frequently suffers from the …