大模型对齐与安全

2605.17986 2026-06-18 cs.CR cs.AI 版本更新专题 95

LivePI: More Realistic Benchmarking of Agents Against Indirect Prompt Injection

LivePI：更真实的智能体对抗间接提示注入基准测试

Lei Zhao, Abhay Bhaskar, Edgar Dobriban

专题命中提示注入：基准测试AI智能体对抗间接提示注入，核心是安全。

AI总结提出LivePI基准，覆盖7种输入表面、12种攻击/渲染家族和5种恶意目标，在真实虚拟机环境中评估多个AI智能体，发现攻击成功率10.7%-29.6%，并验证了两层防御的有效性。

URL PDF HTML

2410.15595 2026-06-18 cs.AI cs.CL cs.LG 版本更新专题 95

A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications

直接偏好优化综述：数据集、理论、变体及应用

Wenyi Xiao, Zechuan Wang, Leilei Gan, Shuai Zhao, Zongrui Li, Ruirui Lei, Wanggui He, Luu Anh Tuan, Long Chen, Hao Jiang, Zhou Zhao, Fei Wu

专题命中偏好对齐：DPO是偏好对齐的核心方法之一

AI总结综述直接偏好优化（DPO）在理论、变体、数据集和应用方面的进展，指出其作为RL-free替代方案的潜力与局限，并提出未来研究方向。

Comments Accepted by TPAMI 2026. Project page: https://github.com/Mr-Loevan/DPO-Survey

URL PDF HTML

2604.23130 2026-06-18 cs.CL cs.AI 版本更新专题 90

From Concept-Aligned Tokens to Vulnerable Features: Mechanistic Localization of Jailbreaks

从概念对齐的Token到脆弱特征：越狱的机制定位

Nilanjana Das, Mathew Dawit, Aman Chadha, Manas Gaur

专题命中越狱攻击：机制定位越狱漏洞，分析有害特征

AI总结提出一种基于Token的机制流水线，通过稀疏自编码器特征子组定位越狱漏洞，发现单个有害Token足以定位脆弱特征，且这些特征集中在中后期层。

URL PDF HTML

2511.20002 2026-06-18 cs.CV cs.AI cs.CR 版本更新专题 85

Semantic Router: On the Feasibility of Hijacking MLLMs via a Single Adversarial Perturbation

语义路由器：通过单一对抗扰动劫持多模态大语言模型的可行性研究

Changyue Li, Jiaying Li, Youliang Yuan, Jiaming He, Zhicong Huang, Pinjia He

专题命中越狱攻击：提出语义感知通用扰动劫持MLLM，属于越狱攻击。

AI总结提出语义感知通用扰动（SAUP），作为语义路由器同时劫持多个无状态决策，通过理论分析和SORT优化策略实现，在Qwen上对五个目标达到66%攻击成功率。

Comments Accepted to ICML 2026

URL PDF HTML

2412.16468 2026-06-18 cs.LG 版本更新专题 90

The Road to Artificial SuperIntelligence: A Comprehensive Survey of Superalignment

通往人工超级智能之路：超级对齐的全面综述

HyunJin Kim, DongHyun Ryu, Xiaoyuan Yi, Jing Yao, Jianxun Lian, Muhua Huang, Shitong Duan, JinYeong Bak, Xing Xie

专题命中安全评测：综述超级对齐问题，分析可扩展监督范式

AI总结本文综述了超级对齐问题，通过分析可扩展监督范式（夹层、自我增强和弱到强泛化）及其局限性，探讨了监督、控制和管理人工超级智能的挑战与路径。

Comments 24 pages

URL PDF HTML

2505.20045 2026-06-18 cs.CL 版本更新专题 85

Efficient Hallucination Detection for LLMs Using Uncertainty-Aware Attention Heads

基于不确定性感知注意力头的高效大语言模型幻觉检测

Artem Vazhentsev, Lyudmila Rvanova, Gleb Kuzmin, Ekaterina Fadeeva, Ivan Lazichny, Alexander Panchenko, Maxim Panov, Mrinmaya Sachan, Preslav Nakov, Timothy Baldwin, Artem Shelmanov

专题命中安全评测：无监督幻觉检测，提升LLM可靠性

AI总结提出RAUQ框架，利用不确定性感知注意力头与令牌级置信度，通过单次前向传递实现无监督、高效的序列级幻觉检测，在12个数据集上优于现有方法且额外计算少于1%。

Journal ref Proceedings of the 43rd International Conference on Machine Learning (ICML), Seoul, South Korea, 2026

URL PDF HTML

2507.04219 2026-06-18 cs.LG cs.AI 版本更新专题 80

Model Collapse Is Not a Bug but a Feature in Machine Unlearning for LLMs

模型崩溃不是错误，而是大语言模型机器遗忘中的一种特性

Yan Scholten, Sophie Xhonneux, Leo Schwinn, Stephan Günnemann

专题命中安全评测：机器遗忘方法，移除私有信息，涉及安全

AI总结提出部分模型崩溃（PMC）方法，通过故意触发模型在目标数据上的分布崩溃实现遗忘，无需在遗忘目标上优化，有效移除私有信息并保持模型效用。

Comments Accepted at ICLR 2026

URL PDF HTML

2504.14798 2026-06-18 cs.LG cs.CV 版本更新专题 75

RUB: Evaluating Residual Knowledge in Unlearned Models

RUB: 评估未学习模型中的残留知识

Hao Xuan, Xingyu Li

专题命中安全评测：评估未学习模型残留知识，对抗攻击

AI总结提出鲁棒未学习原则及统一基准RUB，通过未学习映射攻击（UMA）检测残留信息，揭示现有方法在对抗评估下的脆弱性。

Journal ref Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2026, pages 8550-8559

URL PDF HTML

2604.13899 2026-06-18 cs.CL cs.AI 版本更新专题 70

Do We Still Need Humans in the Loop? Comparing Human and LLM Annotation in Active Learning for Hostility Detection

我们是否仍然需要人在回路中？比较主动学习中用于敌意检测的人类与LLM标注

Ahmad Dawar Hakimi, Lea Hirlimann, Isabelle Augenstein, Hinrich Schütze

专题命中安全评测：比较LLM与人类在敌意检测中的标注效果

AI总结研究比较了LLM与人类在主动学习中的标注效果，发现LLM标注成本更低且性能更优，但主动学习在LLM标注下无优势。

URL PDF HTML

1. 提示注入 1 篇

LivePI: More Realistic Benchmarking of Agents Against Indirect Prompt Injection

2. 偏好对齐 1 篇

A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications

3. 越狱攻击 2 篇

From Concept-Aligned Tokens to Vulnerable Features: Mechanistic Localization of Jailbreaks

Semantic Router: On the Feasibility of Hijacking MLLMs via a Single Adversarial Perturbation

4. 安全评测 5 篇

The Road to Artificial SuperIntelligence: A Comprehensive Survey of Superalignment

Efficient Hallucination Detection for LLMs Using Uncertainty-Aware Attention Heads

Model Collapse Is Not a Bug but a Feature in Machine Unlearning for LLMs

RUB: Evaluating Residual Knowledge in Unlearned Models

Do We Still Need Humans in the Loop? Comparing Human and LLM Annotation in Active Learning for Hostility Detection