GSM-SEM: Benchmark and Framework for Generating Semantically Variant Augmentations
概要
arXiv:2605.07053v1 Announce Type: cross Abstract: Benchmarks like GSM8K are popular measures of mathematical reasoning, but leaderboard gains can overstate true capability due to memorization of fixed test sets. Most robustness variants apply surface-level perturbations (paraphrases, renamings, num…