Information Theoretic Adversarial Training of Large Language Models
概要
arXiv:2605.05415v1 Announce Type: cross Abstract: Large language models (LLMs) remain vulnerable to adversarial prompting despite advances in alignment and safety, often exhibiting harmful behaviors under novel attack strategies. While adversarial training can improve robustness, existing approache…