大模型对齐与安全

2606.18673 2026-06-18 cs.CR 新提交 95%

Understanding and Mitigating Prompt Leaking Attacks in Real-World LLM-Based Applications

理解并缓解真实世界基于LLM的应用中的提示泄露攻击

Yong Yang, Chong Fu, Tong Zhang, Rui Zeng, Qingming Li, Tianyu Du, Zonghui Wang, Shouling Ji, Wenzhi Chen

专题命中安全评测：系统提示泄露攻击与防御

AI总结本研究系统测量了1200个真实世界基于LLM的应用，发现超过80%会泄露系统提示，并提出了基于注意力漂移分析的AREA防御方法，在保持可用性的同时有效防止泄露。

Comments Accepted at ACM CCS 2026

详情

AI中文摘要

基于大型语言模型（LLM）的应用依赖系统提示来编码核心逻辑和开发者定义的约束，使得这些提示成为重要的知识产权。然而，系统提示容易受到提示泄露攻击。尽管先前的工作在受控环境中展示了此类攻击，但其在真实世界部署中的普遍性、原因和防御措施仍不清楚。本文对真实世界基于LLM的应用中的提示泄露进行了系统研究。我们测量了六个主要商业平台上的1200个应用，发现超过80%的部署在现实对抗性查询下泄露了系统提示，有时会暴露敏感信息，如第三方API密钥。我们还表明，现有防御措施往往无法在不降低可用性的情况下防止泄露。为了解释这些失败，我们进行了注意力层面的机制分析，并识别出注意力漂移，其中查询-键对齐偏差和softmax放大导致LLM逐渐忽略防御约束。基于这一洞察，我们提出了AREA，一种实用的防御方法，通过可优化的软提示重新锚定模型的注意力。实验和真实世界案例研究表明，AREA在匹配最先进防御措施的防泄露能力的同时，将平均可用性提高了33%以上，并将优化开销降低了近3倍。我们的负责任披露导致两家受影响的供应商将这些泄露归类为中危漏洞。

英文摘要

Large language model (LLM)-based applications rely on system prompts to encode core logic and developer-defined constraints, making these prompts important intellectual property. However, system prompts are vulnerable to prompt leaking attacks. Although prior work has shown such attacks in controlled settings, their prevalence, causes, and defenses in real-world deployments remain unclear. This paper presents a systematic study of prompt leaking in real-world LLM-based applications. We measure 1,200 applications across six major commercial platforms and find that over 80% of deployments leak system prompts under realistic adversarial queries, sometimes exposing sensitive information such as third-party API keys. We also show that existing defenses often fail to prevent leakage without degrading usability. To explain these failures, we conduct an attention-level mechanistic analysis and identify attention drift, where query-key alignment bias and softmax amplification cause LLMs to progressively ignore defensive constraints. Guided by this insight, we propose AREA, a practical defense that re-anchors the model's attention using an optimizable soft prompt. Experiments and real-world case studies show that AREA matches the leakage resistance of state-of-the-art defenses while improving average usability by over 33% and reducing optimization overhead by nearly 3x. Our responsible disclosure led two affected vendors to classify these leaks as medium-severity vulnerabilities.

URL PDF HTML ☆

赞 0 踩 0

2606.19222 2026-06-18 cs.LG cs.AI 新提交 90%

Mechanism-Guided Selective Unlearning for RLVR-Induced Reasoning

机制引导的选择性遗忘：针对RLVR诱导的推理

Chenyu Zhou, Qiliang Jiang, Shuning Wu, Xu Zhou

发表机构 * School of Engineering, Institute of Science Tokyo, Japan（东京科学研究院工程学院）； College of Control Science and Engineering, Zhejiang University, China（浙江大学控制科学与工程学院）； Department of Electrical and Computer Engineering, National University of Singapore, Singapore（新加坡国立大学电子与计算机工程系）

专题命中安全评测：针对RLVR推理的遗忘方法，涉及模型安全

AI总结提出MAST方法，通过机制引导选择性更新参数，在遗忘RLVR诱导的推理行为时，显著降低对保留性能的附带损害。

Comments 15 pages, 4 figures, 7 tables

详情

AI中文摘要

我们提出MAST（机制对齐选择性目标），一种机制引导的方法，用于遗忘RLVR诱导的推理，其附带损害远低于标准全参数更新。在Qwen2.5-Math-1.5B和Qwen3-1.7B-Base的匹配SFT/RLVR检查点上，SFT到RLVR的增量在token级delta-log-probability上与SFT更新显著不同，而全参数梯度上升仅通过破坏保留的MATH和GSM8K来实现遗忘。MAST根据离主能量、更新幅度和遗忘梯度耦合幅度对注意力投影张量进行排序，然后仅更新排名最高的子集。在主模型上，MAST诱导了统计上显著的目标遗忘（MATH遗忘从45/150降至37/150；McNemar p=0.0078），同时保留了GSM8K（+0.8个百分点）和MATH保留（-0.5个百分点）。该优势在不同种子、NPO/SimNPO目标以及Qwen3上均得到复现，在Qwen3上MAST保留了GSM8K，而全参数遗忘导致其崩溃。

英文摘要

We propose MAST (Mechanism-Aligned Selective Targeting), a mechanism-guided method for unlearning RLVR-induced reasoning with substantially lower collateral damage than standard full-parameter updates. In matched SFT/RLVR checkpoints on Qwen2.5-Math-1.5B and Qwen3-1.7B-Base, the SFT-to-RLVR increment differs sharply from the SFT update in token-level delta-log-probability, and full-parameter gradient ascent forgets only by damaging retain MATH and GSM8K. MAST ranks attention-projection tensors by off-principal energy, update magnitude, and forget-gradient coupling magnitude, then updates only the top-ranked subset. On the primary model, MAST induces statistically significant target forgetting (MATH forget 45/150 to 37/150; McNemar p=0.0078) while preserving GSM8K (+0.8 pp) and MATH retain (-0.5 pp). The advantage reproduces across seeds, NPO/SimNPO objectives, and Qwen3, where MAST preserves GSM8K while full-parameter unlearning collapses it.

URL PDF HTML ☆

赞 0 踩 0

2606.19168 2026-06-18 cs.AI cs.LG 新提交 90%

Beyond Safe Data: Pretraining-Stage Alignment with Regular Safety Reflection

超越安全数据：具有正则安全反射的预训练阶段对齐

Jinhan Li, Kexian Tang, Yihan Xu, Zhuorui Ye, Kaifeng Lyu

发表机构 * Institute for Interdisciplinary Information Sciences, Tsinghua University（清华大学交叉信息研究院）

专题命中安全评测：预训练阶段安全对齐方法，属于安全

AI总结提出安全反射预训练方法，在预训练语料中插入安全反思，使模型具备自我监控能力，实验表明该方法能有效降低推理和微调攻击成功率。

详情

AI中文摘要

为了实现大型语言模型（LLMs）更深层次的安全对齐，最近的研究探讨了如何将安全干预措施提前到预训练阶段，主要通过过滤不安全数据或将其改写为更安全的形式。我们认为，预训练阶段的对齐应超越使数据安全：LLMs可能将看似良性的知识和能力组合成不安全的行为。为此，我们提出了安全反射预训练，一种预训练阶段的对齐方法，该方法定期在预训练语料中插入简短的安全反思，将自我监控直接集成到语言建模中，建立一种基础能力，随后通过兼容的后训练加以强化。我们在FineWeb-Edu上预训练的1.7B模型上的实验表明，安全反射预训练提高了安全分类准确性，并显著降低了推理阶段和微调攻击的成功率。除了真实世界实验，我们还引入了一个完全受控的合成环境MedSafetyWorld，其中包含清晰的安全定义和推理结构，模型可以轻松地从安全数据中泛化出不安全行为。在MedSafetyWorld中的消融实验进一步表明，与数据过滤和改写相比，安全反射预训练在防止模型根据安全数据泛化出的不安全行为方面具有明显优势。综合来看，我们的发现表明，预训练对齐不仅应使训练数据安全，还应塑造模型可能从安全数据中习得的行为。

英文摘要

To achieve deeper safety alignment for large language models (LLMs), recent efforts have studied how to push safety interventions earlier into the pretraining stage, primarily by filtering unsafe data or rewriting it into safer forms. We argue that pretraining-stage alignment should go beyond making the data safe: LLMs may compose seemingly benign knowledge and capabilities into unsafe behaviors. To this end, we propose Safety Reflection Pretraining, a pretraining-stage alignment method which regularly inserts short safety reflections into pretraining corpora to integrate self-monitoring directly into language modeling, establishing a foundational capability that is subsequently reinforced by compatible post-training. Our experiments with 1.7B models pretrained on FineWeb-Edu show that Safety Reflection Pretraining improves safety classification accuracy and substantially reduces the success rates of inference-stage and finetuning attacks. Complementary to our real-world experiments, we also introduce a fully controlled synthetic environment, MedSafetyWorld, with a clear definition of safety and a reasoning structure under which models can easily generalize unsafe behaviors from safe data. Ablations in MedSafetyWorld further demonstrate a clear advantage of Safety Reflection Pretraining in preventing models from acting on unsafe behaviors generalized from safe data, compared with data filtering and rewriting. Taken together, our findings suggest that pretraining alignment should not only make the training data safe, but also shape the behaviors that models are likely to acquire from safe data.

URL PDF HTML ☆

赞 0 踩 0

2606.19023 2026-06-18 cs.CR cs.LG 新提交 90%

Lifecycle-Aware Dynamic Analysis for Secure ML Model Execution

生命周期感知的动态分析用于安全ML模型执行

Gabriele Digregorio, Marco Di Gennaro, Francesco Pastore, Stefano Zanero, Stefano Longari, Michele Carminati

发表机构 * Politecnico di Milano（米兰理工大学）

专题命中安全评测：提出动态生命周期分析方法检测ML模型恶意行为。

AI总结提出Moat，一种动态生命周期感知方法，通过监控模型执行各阶段与宿主系统的结构化交互来检测恶意行为，在多个框架上实现零误报率。

详情

AI中文摘要

对预训练机器学习（ML）模型的日益依赖引入了新的攻击面。最近的漏洞表明，恶意行为可以嵌入模型工件中，常常绕过现有防御。当前的模型扫描解决方案主要依赖于静态的、特定格式的规则或已知的攻击签名，这限制了它们跨框架泛化和检测新型利用路径的能力。相比之下，我们提出了一种解决方案，专注于攻击对执行模型的宿主系统产生的影响，并基于关于ML模型执行的基本直觉。特别地，我们观察到ML模型在定义良好的生命周期阶段内运行，并且在每个阶段内，与宿主系统的交互是高度结构化和可预测的。我们将这些直觉转化为Moat，一种用于安全ML模型执行的动态生命周期感知方法，并在我们的参考实现Re-Moat中实例化此设计。我们使用来自Hugging Face Hub的77,974个真实世界模型工件、来自CVE的31个概念验证（PoC）以及来自最先进数据集的334个模型，在多个ML框架上评估Re-Moat，并将其与最先进的模型扫描解决方案进行比较。我们的结果表明，我们的方法检测到所有评估的攻击类别，同时保持接近零的误报率，验证了我们的直觉并激励了用于安全ML模型执行的动态分析。

英文摘要

The growing reliance on pre-trained Machine Learning (ML) models has introduced new attack surfaces. Recent vulnerabilities demonstrate that malicious behavior can be embedded within model artifacts, often bypassing existing defenses. Current model-scanning solutions primarily rely on static, format-specific rules or known attack signatures, which limit their ability to generalize across frameworks and to detect novel exploitation paths. In contrast, we propose a solution that focuses on the effects an attack has on the host system executing the model and builds on foundational intuitions about ML model execution. In particular, we observe that ML models operate within well-defined lifecycle phases and that, within each phase, interactions with the host system are highly structured and predictable. We translate these intuitions into Moat, a dynamic lifecycle-aware approach for securing ML model execution, and instantiate this design in Re-Moat, our reference implementation. We evaluate Re-Moat across multiple ML frameworks using 77,974 real-world model artifacts from the Hugging Face Hub, 31 Proofs-of-Concept (PoCs) from CVEs, and 334 models from a state-of-the-art dataset, and compare it against state-of-the-art model-scanning solutions. Our results show that our approach detects all evaluated attack classes while maintaining a close-to-zero false-positive rate, validating our intuitions and motivating dynamic analysis for securing ML model execution.

URL PDF HTML ☆

赞 0 踩 0

2606.18656 2026-06-18 cs.CL 新提交 90%

The Wrong Kind of Right: Quantifying and Localizing Misfired Alignment in LLMs

错误的正确：量化和定位大语言模型中的失调对齐

Naihao Deng, Yiming Feng, Chimaobi Okite, Kaijian Zou, Lu Wang, Rada Mihalcea, Yulong Chen

发表机构 * University of Michigan（密歇根大学）； University of Cambridge（剑桥大学）； University of Aberdeen（阿伯丁大学）

专题命中安全评测：提出失调对齐基准VETO和量化指标MAR

AI总结本文提出VETO基准和失调对齐率（MAR）指标，发现所有LLM在刻板印象相关问题上均存在非平凡的失调对齐，且人类为0%，机制分析表明对齐诱导的线索会放大该现象。

详情

AI中文摘要

警告：本文研究刻板印象和偏见，包含可能令人不适的例子，仅用于说明目的。我们的发现不应被解释为反对对齐的论据。相反，本文强调了需要更先进对齐的原则性方法。对齐旨在确保大语言模型（LLMs）安全可靠地行为，包括避免不安全的推理。然而，我们表明这种安全导向的行为可能误触发：模型可能拒绝有根据的结论，即使上下文明确支持它们。我们将这种失败模式称为失调对齐，其中对齐引起的改变导致LLMs覆盖显式证据。为了量化这一现象，特别是针对刻板印象相关的对齐，我们引入了VETO，一个由2,032个BBQ派生对比对组成的基准，并定义了一个新指标，失调对齐率（MAR），它衡量在0到100的尺度上，模型在刻板印象相关问题上失败但在其对比对应问题上成功的频率。我们在VETO上对25个LLMs进行了基准测试，并表明所有LLMs，包括最新的，都表现出非平凡的（4.7%至18.9%）MAR，而所有人类参与者达到0.0%的MAR。受控启动实验进一步表明，对齐诱导的线索可以显著放大LLMs的MAR，表明这些失败不仅仅是单个例子的伪影，而是可以由安全相关的框架诱导。对开放权重LLMs的机制分析揭示了后期层对证据支持答案的抑制，并且指令模型与基础模型之间的比较表明这种抑制在指令训练后出现。这些发现表明，当前的对齐方法可能过度泛化表面安全线索，以至于覆盖客观证据，这激励了更多关于更好保持上下文基础的对齐目标的工作。

英文摘要

Warning: This paper studies stereotypes and biases, and contains potentially disturbing examples, used for illustration purposes only. Our findings should not be interpreted as an argument against alignment. Instead, this paper highlights the need for principled approaches to more advanced alignment. Alignment aims to ensure that large language models (LLMs) behave safely and reliably, including by avoiding unsafe inferences. However, we show that such safety-oriented behaviors can misfire: models may reject warranted conclusions even when they are explicitly supported by context. We call this failure mode misfired alignment, where alignment-induced changes cause LLMs to override explicit evidence. To quantify this phenomenon, specifically on stereotype-related alignment, we introduce VETO, a benchmark consisting of 2,032 BBQ-derived contrastive pairs, and define a new metric, Misfired Alignment Rate (MAR), which measures on a 0 to 100 scale how often a model fails on a stereotype-related question but succeeds on its contrastive counterpart. We benchmark 25 LLMs on VETO, and show that all LLMs, including the most recent ones, exhibit non-trivial (4.7 to 18.9%) MARs while all human participants achieve 0.0% MAR. Controlled priming experiments further show that alignment-induced cues can substantially amplify MAR across LLMs, indicating that these failures are not merely artifacts of individual examples but can be induced by safety-related framing. Mechanistic analyses on open-weight LLMs reveal late-layer suppression of evidence-supported answers, and comparisons between instruct and base LLMs suggest that this suppression emerges after instruction training. These findings show that current alignment methods can overgeneralize surface-level safety cues, to the point of overriding objective evidence, motivating more work on alignment objectives that better preserve contextual grounding.

URL PDF HTML ☆

赞 0 踩 0

2606.18430 2026-06-18 cs.LG cs.CR 新提交 90%

Signature filtering: a lightweight enhancement for statistical watermark detection in large language models

签名过滤：大型语言模型中统计水印检测的轻量级增强方法

Chih-Duo Hong, Yen-Pang Chen, Fang Yu

发表机构 * National Chengchi University（国立政治大学）

专题命中安全评测：提出签名过滤增强LLM水印检测

AI总结提出签名过滤模块，通过移除干扰水印检测的签名令牌，在弱信号和低熵设置下将检测率从8-31%提升至78-99%，同时保持可控的假阳性率。

详情

AI中文摘要

统计水印帮助组织归因大型语言模型（LLM）的输出，但现有检测器在水印信号弱、文本重复或水印被编辑时往往表现不佳。我们提出签名过滤，一种检测时模块，在不修改水印嵌入和文本生成的情况下增强水印检测。它学习一小部分“签名”令牌，这些令牌的存在会使水印测试不可靠，并在检测前移除这些令牌。通过在小训练集上求解混合整数线性规划获得签名，约束条件最大化真阳性率。我们还推导了在几种攻击者模型（色盲、颜色自适应和分布相关）下的有限样本和渐近界。在四个知名水印家族（Kgw、Sweet、Unigram、Exp）、四个基准语料库（C4、MBPP、HumanEval、Code-Search-Net）和六个LLM（Opt-1.3b、Opt-6.7b、Llama2-13b、Llama3.1-8b、Qwen2.5-14b、Phi-3-medium-14b）上，2-gram和3-gram签名在弱信号和低熵设置下将检测率从无过滤时的8-31%提升至78-99%，同时保持假阳性率可控且通常可忽略。在压力测试中，我们打乱句子并稀释、删除和替换25-50%的令牌，针对Kgw风格水印的2-gram过滤器保留了大部分干净文本的检测增益，通常匹配或超越先进的WinMax水印检测器。因此，签名过滤提供了一种简单、可扩展且模型无关的附加组件，以加强信息处理工作流中LLM文本基于水印的来源检查。

英文摘要

Statistical watermarks help organizations attribute large language model (LLM) outputs, yet existing detectors often struggle when watermark signals are weak, texts are repetitive, or watermarks are edited. We propose signature filtering, a detection-time module that enhances watermark detection without modifying watermark embedding and text generation. It learns a small set of ``signature'' tokens whose presence makes watermark tests unreliable, and removes these tokens before detection. The signatures are obtained by solving a mixed-integer linear program on a small training set, with constraints that maximize the true positive rate. We additionally derive finite-sample and asymptotic bounds under several attacker models (color-blind, color-adaptive, and distributionally correlated). On four well-known watermark families (Kgw, Sweet, Unigram, Exp), four benchmark corpora (C4, MBPP, HumanEval, Code-Search-Net), and six LLMs (Opt-1.3b, Opt-6.7b, Llama2-13b, Llama3.1-8b, Qwen2.5-14b, Phi-3-medium-14b), 2- and 3-gram signatures raise detection rates in weak-signal and low-entropy settings from 8~31% without filtering to 78~99% with filtering, while keeping false positives controllable and often negligible. In stress tests where we scramble sentences and perturb 25~50% of tokens by dilution, deletions, and substitutions, 2-gram filters for Kgw-style watermarks preserve most of the clean-text detection gains, often matching or outperforming the advanced WinMax watermark detector. Signature filtering thus provides a simple, scalable, and model-agnostic add-on to strengthen watermark-based provenance checks for LLM text in information processing workflows.

URL PDF HTML ☆

赞 0 踩 0

2606.18356 2026-06-18 cs.CR cs.AI 新提交 90%

SafeClawBench: Separating Semantic, Audit-Evidence, and Sandbox Harm in Tool-Using LLM Agents

SafeClawBench: 区分工具使用LLM代理中的语义、审计证据和沙箱危害

Yuchuan Tian, Mengyu Zheng, Haocheng Mei, Ye Yuan, Chao Xu, Xinghao Chen, Hanting Chen, Yu Wang

发表机构 * Peking University（北京大学）； Beijing Jiaotong University（北京交通大学）； SUIBE（上海外国语大学）； Huawei（华为）； Tsinghua University（清华大学）

专题命中安全评测：提出工具使用LLM代理安全基准，区分语义、审计和沙箱危害。

AI总结提出SafeClawBench基准，通过三个独立端点（语义攻击接受、审计可见危害证据、沙箱观察危害）评估工具使用LLM代理的安全性，揭示不同失败模式并支持可复现比较。

Comments 32 pages, 5 figures

详情

AI中文摘要

使用工具的语言模型代理引入了超出不安全文本的安全失败：它们可以泄露受保护对象、写入持久内存、发送消息、修改数据库或触发有害代码和工具效果。现有的评估通常将这些阶段合并为单一的攻击成功率，使得难以判断模型仅仅是同意了攻击者还是实际产生了可观察的危害。我们引入了SafeClawBench，一个用于工具使用代理安全性的分阶段基准，包含600个受控对抗任务，涵盖六种攻击家族：直接和间接提示注入、工具返回注入、内存投毒、内存提取以及歧义驱动的不安全推理。SafeClawBench报告三个独立的端点：语义攻击接受、审计可见危害证据和沙箱观察的工具/状态危害。在四种提示级策略下评估五个代理端点，我们发现这些端点捕捉了不同的失败模式。在没有额外提示保护的情况下，语义失败率在不同模型间差异很大，从9.0%到44.2%。审计危害证据比语义失败更窄，并且在单独的可执行协议下，一些匹配的任务身份在通过语义核心调用后仍产生沙箱危害：在12000行的匹配分析中，347个观察到的沙箱危害中有291个发生在通过语义检查的行中。提示策略会改变端点结果，但其效果取决于模型和协议。SafeClawBench提供了一个可复现的框架，用于比较代理模型和提示策略条件，而不会混淆文本合规性、证据支持的危害和可执行状态变化。开源数据集可在该https URL获取。

英文摘要

Tool-using language-model agents introduce security failures that go beyond unsafe text: they can disclose protected objects, write persistent memory, send messages, modify databases, or trigger harmful code and tool effects. Existing evaluations often collapse these stages into a single attack success rate, making it difficult to tell whether a model merely agreed with an attacker or actually produced observable harm. We introduce SafeClawBench, a staged benchmark for tool-using agent security with 600 controlled adversarial tasks across six attack families: direct and indirect prompt injection, tool-return injection, memory poisoning, memory extraction, and ambiguity-driven unsafe inference. SafeClawBench reports three separate endpoints: semantic attack acceptance, audit-visible harm evidence, and sandbox-observed tool/state harm. Evaluating five agent endpoints under four prompt-level policies, we find that these endpoints capture different failure modes. Without additional prompt protection, semantic failure rates vary widely across models, from 9.0% to 44.2%. Audited harm evidence is narrower than semantic failure, and under a separate executable protocol some matched task identities produce sandbox harm despite passing the Semantic Core call: in a 12,000-row matched analysis, 291 of 347 observed sandbox harms occur in rows that pass the semantic check. Prompt policies change endpoint outcomes, but their effects depend on both model and protocol. SafeClawBench provides a reproducible framework for comparing agent models and prompt-policy conditions without conflating textual compliance, evidence-supported harm, and executable state changes. The open-source dataset is available at https://huggingface.co/datasets/sairights/safeclawbench.

URL PDF HTML ☆

赞 0 踩 0

2606.19106 2026-06-18 cs.CR cs.CY 新提交 85%

Quantifying Compromise Risk in Exceptional Access Architectures Under Sparse and Indirect Evidence

在稀疏和间接证据下量化特殊访问架构中的泄露风险

Alan Woodward

专题命中安全评测：量化特殊访问架构的系统性泄露风险，属于安全评测。

AI总结针对特殊访问系统缺乏公开泄露数据的问题，构建结构化不确定性框架，通过历史类比、蒙特卡洛场景、信道独立性分解和贝叶斯结构风险模型，量化传输层与平台层EA架构的系统性泄露风险，发现两类架构风险均高于无EA基线，且分布形态不同。

详情

AI中文摘要

合法的特殊访问（EA）系统持有用于授权方解密受保护通信的加密密钥。关于其风险的争论长期且定性，因两个问题而复杂化：不存在EA特定泄露事件的公开数据集，因此评估必须使用稀疏的间接证据；先前的工作将结构不同的设计视为等效，尽管运营商基础设施中的传输层EA（T-EA）和平台层的覆盖层EA（OTT-EA）在加密密钥与密文数据的关系上有所不同。本文构建了一个结构化不确定性框架，用于评估EA架构中的系统性泄露风险。它不产生预测性预测，因为证据无法支持；它将稳健的发现与依赖于校准的发现分开。对T-EA和OTT-EA应用了四个分析层：三个实证支柱（历史类比、蒙特卡洛场景层、信道独立性分解）加上一个并行子图攻击图上的贝叶斯结构风险模型。核心发现是结构性的。首先，任何类别的配备EA的架构都承担比无EA反事实更高的建模风险，这一顺序独立于校准。其次，类别在分布形状上不同：T-EA风险由中心趋势主导，OTT-EA风险在关联活动下由尾部主导。第三，在结构化判断的目标溢价区间内，T-EA的校准条件年概率范围从1.4%到12.9%。在数十年时间跨度上，累积泄露远高于零；关键材料外泄是不可逆的，对OTT-EA更大的用户群体影响严重。该框架量化泄露概率，而非预期危害；后果建模和收益估计不在其范围内。

英文摘要

Lawful exceptional access (EA) systems hold the cryptographic keys that decrypt protected communications for authorised parties. The debate over their risks has been long and qualitative, complicated by two problems: no public dataset of EA-specific compromise events exists, so assessment must use sparse, indirect evidence; and prior work has treated structurally different designs as equivalent, though transmission-layer EA in carrier infrastructure (T-EA) and over-the-top EA at the platform layer (OTT-EA) differ in how cryptographic keys relate to ciphertext data. This paper builds a structured uncertainty framework for evaluating systemic compromise risk in EA architectures. It does not produce predictive forecasts, which the evidence cannot support; it separates findings robust to assumptions from those that depend on calibration. Four analytical layers are applied to T-EA and OTT-EA: three empirical pillars (historical analogues, a Monte Carlo scenario layer, a channel-independence decomposition) plus a Bayesian Structural Risk Model on a parallel-subgraph attack graph. The central findings are structural. First, EA-equipped architectures of either class carry strictly higher modelled risk than their no-EA counterfactual, an ordering independent of calibration. Second, the classes differ in distribution shape: T-EA risk is dominated by central tendency, OTT-EA by the tail under correlated campaigns. Third, calibration-conditional annual probability ranges span 1.4% to 12.9% for T-EA across the structured-judgement targeting-premium interval. Over multi-decade horizons, cumulative compromise is well above zero; key-material exfiltration is irreversible, weighing heavily on OTT-EA's larger user populations. The framework quantifies compromise probability, not expected harm; consequence modelling and benefit estimation are outside its scope.

URL PDF HTML ☆

赞 0 踩 0

2606.18936 2026-06-18 cs.AI cs.CY 新提交 85%

SciRisk-Bench: A Risk-Dimension-Aware Benchmark for AI4Science Safety

SciRisk-Bench：面向AI4Science安全的风险维度感知基准

Linghao Feng, Yinqian Sun, Dongqi Liang, Sicheng Shen, Chenfei Yan, Yuxuan Peng, Yilin Zhao, Haibo Tong, Kai Li, FeiFei Zhao, Yi Zeng

发表机构 * Brain-inspired Cognitive Intelligence Lab, Institute of Automation, Chinese Academy of Sciences, Beijing, China（脑启发认知智能实验室，自动化研究所，中国科学院，北京，中国）； School of Future Technology, University of Chinese Academy of Sciences, China（未来技术学院，中国科学院大学，中国）； School of Artificial Intelligence, University of Chinese Academy of Sciences, China（人工智能学院，中国科学院大学，中国）； Zhongguancun Academy, China（中关村学院，中国）； Beijing Key Laboratory of Safe AI and Superalignment（北京安全人工智能与超对齐重点实验室）； Gaoling School of AI, Renmin University of China（甘露人工智能学院，中国人民大学）； Beijing Institute of AI Safety and Governance (Beijing-AISI)（北京人工智能安全与治理研究院（北京-AISI））； School of Humanities, University of Chinese Academy of Sciences, China（人文学院，中国科学院大学，中国）

专题命中安全评测：提出科学领域安全基准，评测风险维度

AI总结提出SciRisk-Bench基准，从显式风险维度和科学学科两个角度评估AI4Science安全，覆盖7个学科、31个子学科和10个风险维度，实验揭示主流及科学大模型的安全薄弱环节。

详情

AI中文摘要

大型语言模型（LLMs）越来越多地嵌入到人工智能驱动的科学（AI4Science）工作流程中，从科学问答和文献分析到实验室规划和自主发现。这一进展迫切需要对安全基准进行评估，不仅要评估科学能力，还要评估模型是否能在高风险的科学背景下识别和避免风险。现有的AI4Science安全数据集涵盖多个学科和任务格式，但潜在的风险维度未得到充分说明。我们引入了\textbf{SciRisk-Bench}，这是一个旨在从两个互补视角评估AI4Science安全的基准：显式风险维度和科学学科。SciRisk-Bench涵盖7个学科、31个子学科和10个风险维度。在实验部分，我们评估了主流LLMs和面向科学的LLMs在风险维度、学科和子学科上的表现，从而能够细粒度地诊断科学模型在哪些方面仍然不安全。

英文摘要

Large language models (LLMs) are increasingly embedded in AI for Science (AI4Science) workflows, from scientific question answering and literature analysis to laboratory planning and autonomous discovery. This progress creates an urgent need for safety benchmarks that evaluate not only scientific competence, but also whether models recognize and avoid risks in high-stakes scientific contexts. Existing AI4Science safety datasets cover several disciplines and task formats, leaving the underlying risk dimensions underspecified. We introduce \textbf{SciRisk-Bench}, a benchmark designed to evaluate AI4Science safety from two complementary perspectives: explicit risk dimensions and scientific disciplines. SciRisk-Bench covers 7 disciplines, 31 subdisciplines and 10 risk dimensions. In the experimental section, we evaluate both mainstream LLMs and science-oriented LLMs across risk dimensions, disciplines, and sub-disciplines, enabling fine-grained diagnosis of where scientific models remain unsafe.

URL PDF HTML ☆

赞 0 踩 0

2606.18782 2026-06-18 cs.CL cs.AI 新提交 85%

RedactionBench

RedactionBench：基于上下文完整性的隐私保护基准测试

Sean Brynjólfsson, Shashvat Jayakrishnan, Esha Sali, Diptanshu Purwar, Madhav Aggarwal

发表机构 * A10 Networks, Inc.（A10网络公司）

专题命中安全评测：提出隐私保护基准测试，评估大模型上下文完整性。

AI总结 RedactionBench通过200个跨11个领域的文档，评估红actions的上下文隐私问题，提出R-Score指标，揭示红actions的主观性，推动隐私保护系统的发展。

详情

AI中文摘要

大型语言模型日益应用于需要擦除个人身份信息（PII）的敏感领域。尽管擦除PII是数据清理的必要步骤，现有基准测试将提取机制与隐私语义混为一谈。公共电话号码与医疗记录中的电话号码并不等同。是否构成违规取决于持有者、原因和上下文，根本区别红actions与简单实体识别。基于上下文完整性，我们引入RedactionBench，一个手动标注的基准测试，包含200个跨11个领域的文档，主要源自真实世界来源。我们还引入R-Score，一种新的字符级指标，将语义相似的红actions视为同等重要，并消除浅层格式选择，如电话号码的不同遮蔽样式。在命名实体识别模型、实体提取小型语言模型和前沿模型上进行评估，证明上下文红actions仍是一个未解决的问题。对RedactionBench的80多名用户的人工评估显示隐私观念存在明显分歧。标注者在强制性红actions（89.4%）和安全文本保留（94.1%）上达成一致，但在上下文红actions（47.7%）上未能达成一致。这种差异展示了上下文隐私的主观性，推动R-Score，将上下文模糊性与严格精度分离。我们比较了35个模型家族的性能，并报告了它们在擦除PII方面的表现。最后，我们发布RedactionBench，以建立未来隐私保护系统的基准，希望激发高效模型设计和标准化评估。

英文摘要

Large Language Models are increasingly applied to sensitive domains that require redaction of personally identifiable information (PII). While redacting PII is a data cleaning prerequisite, existing benchmarks conflate extraction mechanics with privacy semantics. A public phone number is not equivalent to a phone number in a medical record. Whether information constitutes a violation depends heavily on who holds it, why, and in what context, fundamentally differentiating redaction from simple entity recognition. Grounded in contextual integrity, we introduce RedactionBench, a manually annotated benchmark comprising 200 diverse documents across 11 domains, mostly seeded from real-world sources. We also introduce R-Score, a novel character-level metric that treats semantically similar redactions equally and nullifies shallow formatting choices, such as varying masking styles for phone numbers. Evaluations across Named Entity Recognition models, entity extraction Small Language Models, and frontier models equipped with agentic tools demonstrate that contextual redaction remains an unsolved problem. A human evaluation with over 80 users on RedactionBench reveals a stark dichotomy in privacy perceptions. Annotators show consensus with target labels for mandatory redactions (89.4 percent) and safe text preservations (94.1 percent), but fail to agree on contextual redactions (47.7 percent). This variance demonstrates the subjective nature of contextual privacy and motivates R-Score, which decouples contextual ambiguity from strict precision. We compare 35 models across families and report their performance in redacting PII. Finally, we release RedactionBench to establish a baseline for future privacy-preserving systems, hoping to inspire efficient model design and standardized evaluations.

URL PDF HTML ☆

赞 0 踩 0

2606.18473 2026-06-18 cs.CL 新提交 85%

PreUnlearn: Auditing Collateral Knowledge Damage Before Large Language Model Unlearning

PreUnlearn: 在大语言模型遗忘之前审计附带知识损害

Bo Su, Ankit Shah, Thai Le

发表机构 * Indiana University Bloomington（印第安纳大学布卢明顿分校）

专题命中安全评测：审计大模型遗忘的附带知识损害

AI总结提出PreUnlearn方法，通过数据特征预测遗忘操作对同领域和远距离知识的附带损害，实现遗忘前的风险审计。

Comments 12 pages, 6 figures

详情

AI中文摘要

大语言模型（LLMs）的机器遗忘旨在移除特定知识，同时保留模型其余能力。然而，遗忘与保留知识之间的界限往往不明确，因为相关甚至遥远的信息可能在模型中纠缠。在本文中，我们从数据中心的视角研究LLM遗忘，并衡量遗忘效应如何从遗忘集传播到同领域和远距离知识。我们发现一致的衰减模式：附带损害在遗忘集附近最强，随语义距离减弱，但不会在领域边界消失。我们进一步询问这种损害是否可以在执行遗忘之前被审计。我们将遗忘集审计制定为遗忘前预测任务，并分析哪些数据特征最能预测下游损害。我们的结果表明，遗忘集与评估集之间的交互特征提供了最强的信号，表明附带损害部分反映在模型更新前的数据几何中。这些发现将遗忘集审计定位为识别风险遗忘运行和设计更可靠遗忘程序的早期预警工具。

英文摘要

Machine unlearning for large language models (LLMs) aims to remove specified knowledge while preserving the rest of the model's capabilities. However, the boundary between knowledge to forget and knowledge to retain is often unclear, since related and even distant information may be entangled in the model. In this paper, we study LLM unlearning from a data-centric perspective and measure how unlearning effects propagate from the forget set to same-domain and distant-domain knowledge. We find a consistent decay pattern: collateral damage is strongest near the forget set, weakens with semantic distance, but does not disappear at domain boundaries. We further ask whether such damage can be audited before unlearning is executed. We formulate forget-set auditing as a pre-unlearning prediction task and analyze which data features are most predictive of downstream damage. Our results show that interaction features between the forget set and evaluation set provide the strongest signals, suggesting that collateral damage is partly reflected in data geometry before model updates occur. These findings position forget-set auditing as an early warning tool for identifying risky unlearning runs and designing more reliable unlearning procedures.

URL PDF HTML ☆

赞 0 踩 0

2606.12618 2026-06-18 cs.AI 新提交 85%

"Did you lie?" Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms

“你撒谎了吗？”评估不同规模模型和信念验证模型生物体的谎言检测器

Alan Cooney, David Africa, Geoffrey Irving

发表机构 * AI Security Institute（AI安全研究所）

专题命中安全评测：评估语言模型谎言检测器

AI总结本研究通过构建13个信念可验证的推理模型生物体和多样化提示撒谎测试集，评估了四种谎言检测器在不同规模模型上的表现，发现基于激活和概率的检测器在训练模型生物体上性能显著下降，而思维链法官保持较强性能，但存在伪影。

Comments 12 pages, 6 figures

详情

AI中文摘要

语言模型的鲁棒谎言检测器可以实现审计、监控和事后调查模型行为的强大技术，但评估它们需要模型可验证地相信与其所说相反的测试平台。我们表明，现有的训练模型生物体通常无法满足这一要求，使得先前的正面和负面检测结果难以解释。我们通过13个推理模型生物体来解决这个问题，这些生物体的隐藏信念在思维链中得到验证，并显示泛化到保留任务，同时结合了多样化欺骗（Varied Deception），一个涵盖广泛谎言诱导动机的提示撒谎测试集。在这些测试平台上，我们评估了四个检测器：一个思维链法官、一个对数概率分类器和两个激活探针，包括Did-You-Lie（DYL），一种训练后续探针的新方法。在提示撒谎任务上，跨越31个开放权重模型（参数从2B到1T），所有四个检测器都显示出与模型能力正相关的缩放。然而，每个基于激活和对数概率的检测器在我们训练的生物体上性能急剧下降，其中DYL保留了最多的信号；只有思维链法官保持强劲，平衡准确率为0.82，部分原因是我们的验证过程偏向于CoT可读的信念。因此，当前的谎言检测器无法支持关于模型信念的高置信度声明，我们提出了可能解决当前一些局限性的研究方向。我们发布了我们的数据集、模型生物体和训练好的检测器。

英文摘要

Robust lie detectors for language models could enable powerful techniques for auditing, monitoring, and post-hoc investigation of model behaviour, but evaluating them requires testbeds where models verifiably believe the opposite of what they say. We show that existing trained model organisms often fail this requirement, leaving prior positive and negative detection results difficult to interpret. We address this with 13 reasoning model organisms whose hidden beliefs are verified in chain-of-thought and shown to generalise to held-out tasks, alongside Varied Deception, a prompted-lying testbed covering a broad range of lie-inducing motivations. On these testbeds we evaluate four detectors: a chain-of-thought judge, a logprob classifier, and two activation probes, including Did-You-Lie (DYL), a new method for training follow-up probes. On prompted lying, across 31 open-weight models spanning 2B to 1T parameters, all four detectors show positive scaling with model capability. However, every activation- and logprob-based detector drops sharply on our trained model organisms, with DYL retaining the most signal; only the chain-of-thought judge remains strong, with 0.82 balanced accuracy, partly as an artefact of our verification process favouring CoT-readable beliefs. Current lie detectors therefore cannot support high-confidence claims about model beliefs, and we suggest research directions that may address some of their current limitations. We release our datasets, model organisms, and trained detectors.

URL PDF HTML ☆

赞 0 踩 0

2606.19057 2026-06-18 stat.ML cs.LG stat.CO stat.ME 新提交 80%

Quantifying and Auditing LLM Evaluation via Positive--Unlabeled Learning

通过正-无标签学习量化与审计大语言模型评估

Zilong Zhang, Yi-Ting Hung, Lei Ding, Chi-Kuang Yeh

发表机构 * Department of Mathematics and Statistics（数学与统计学系）； Georgia State University（佐治亚州立大学）； Department of Statistics（统计学系）； University of Manitoba（曼尼托巴大学）

专题命中安全评测：审计LLM评估偏差

AI总结针对大语言模型作为评估者存在的系统性偏差（如冗长偏好），提出基于部分最优传输的几何审计框架，利用少量人工验证正样本校正偏差，无需重训练即可提升与人类偏好的一致性。

详情

AI中文摘要

大语言模型（LLM）越来越多地被用作可扩展评估的评判者，然而这种LLM作为评判者的系统表现出与语义质量脱节的系统性偏差，最显著的是冗长偏差。同时，人工监督成本高昂且通常具有选择性，产生可靠的正向判断，但大多数输出未被标记且质量可能参差不齐。我们将选择性人工监督下的LLM评估形式化为一个正-无标签学习问题，并提出了一个基于部分最优传输的几何审计框架。通过在固定嵌入空间中将一小部分人工验证的正样本与可靠的无标签输出子集对齐，我们的方法识别出与人类一致的偏好，并在无需重新训练的情况下纠正有偏的评判者。实验表明，该方法提高了与人类偏好的一致性，增强了对呈现偏差的鲁棒性，并提供了可解释的置信度估计，为现有的LLM作为评判者流程提供了一种可扩展且统计上有依据的替代方案。

英文摘要

Large Language Models (LLMs) are increasingly used as judges for scalable evaluation, yet such LLM--as--a--Judge systems exhibit systematic biases that are decoupled from semantic quality, most notably verbosity bias. Meanwhile, human supervision is costly and typically selective, yielding reliable positive judgments but leaving most outputs unlabelled and potentially mixed in quality. We formulate LLM evaluation under selective human supervision as a positive--unlabelled learning problem and propose a geometric auditing framework based on Partial Optimal Transport. By aligning a small set of human--verified positives with a reliable subset of unlabelled outputs in a fixed embedding space, our method identifies human--consistent preferences and corrects biased judges without retraining. Experiments demonstrate improved alignment with human preferences, increased robustness to presentation biases, and interpretable confidence estimates, offering a scalable and statistically grounded alternative to existing LLM--as--a--judge pipelines.

URL PDF HTML ☆

赞 0 踩 0

2606.19262 2026-06-18 cs.LG 新提交 80%

Detecting Hidden ML Training With Zero-Overhead Telemetry

使用零开销遥测检测隐藏的机器学习训练

Robi Rahman, Sabiha Tajdari

发表机构 * Machine Intelligence Research Institute（机器智能研究院）； University of Virginia（弗吉尼亚大学）

专题命中安全评测：检测隐藏ML训练，用于AI治理安全

AI总结本文评估了仅使用零开销、隐私保护的NVML遥测（内容无关信号）对GPU工作负载分类的对抗鲁棒性，开发了一个分类器，在识别训练工作负载时达到98.2%的二元准确率，并对最具挑战性的意外工作负载达到43-87%的准确率。

Comments Technical AI Governance Research workshop at ICML 2026

2606.19242 2026-06-18 cs.SE 新提交 80%

Runtime Compliance Verification for AI Agents

AI代理的运行时合规性验证

Nafiseh Kahani, Masoud Barati, Diana Addae

专题命中安全评测：运行时监控确保GDPR合规

AI总结提出C-Trace框架，通过运行时监控和形式化策略谓词，确保AI代理在工具调用和对话中遵守GDPR规则，将攻击成功率降至12%以下。

2606.18767 2026-06-18 cs.CL 新提交 80%

Output Vector Editing for Memorization Mitigation in Large Language Models

输出向量编辑：缓解大型语言模型中的记忆化问题

Ahmad Dawar Hakimi, Kaiwei Lei, Isabelle Augenstein, Hinrich Schütze

发表机构 * Center for Information and Language Processing, LMU Munich（慕尼黑大学语言与信息处理中心）； Department of Computer Science, University of Copenhagen（哥本哈根大学计算机科学系）； Munich Center for Machine Learning（慕尼黑机器学习中心）； Pioneer Centre for AI（人工智能先锋中心）

专题命中安全评测：缓解LLM记忆化，输出向量编辑方法。

AI总结提出输出向量编辑方法，通过约束优化修改MLP神经元输出向量引入干扰项，在不改变激活值的情况下抑制记忆化序列，在OLMo-7B上实现87.9%抑制率，并揭示MLP编辑的机制边界。

详情

AI中文摘要

大型语言模型会记忆并复现训练数据中的序列，从而带来隐私、版权和安全风险。现有的神经元级缓解方法将编辑等同于将神经元激活归零，但激活仅控制神经元是否参与；输出向量才是写入残差流的内容，并通过叠加编码多个特征。我们提出输出向量编辑，这是一种约束优化的权重编辑方法，定位负责记忆化延续的一小组MLP神经元，并最小程度地修改其输出向量，以在词汇空间中引入干扰项，从而重定向它们在残差流中的贡献，同时保持激活不变。在四个模型（SmolLM-360M、OLMo-1B、OLMo-7B、Llama2-7B，参数规模从360M到7B）上进行评估，我们重点研究OLMo-7B（其开放权重和预训练语料库支持系统化挖掘），挖掘了6831个记忆化序列，实现了高达87.9%的抑制率。在相同定位的神经元上，与零消融相比，2.7倍的差距表明抑制来自输出向量编辑，而非仅定位。四种编辑模式涵盖了从激进抑制到最小重定向的谱系；集成使用时覆盖了96.5%的记忆化序列，而我们推荐的单一模式配置达到了81.5%，且没有灾难性的局部性失败。我们进一步识别了一个机制边界：约14%的序列无法通过仅MLP编辑达到；虽然这些失败总体上并非由注意力驱动，但消融贡献最大的注意力头可恢复其中60-64%，对于从前缀复制token的延续，恢复更强，这表明注意力是互补的后备机制而非主要机制。编辑模式排序和成功-局部性权衡在所有四个模型上迁移，成功率随模型规模而非家族增长。

英文摘要

Large language models memorize and reproduce sequences from their training data, creating privacy, copyright, and security risks. Existing neuron-level mitigation methods equate editing with zeroing out neuron activations, but the activation only controls whether a neuron engages; the output vector is what writes to the residual stream and, through superposition, encodes multiple features. We propose output vector editing, a constrained-optimization weight edit that locates a small set of MLP neurons responsible for a memorized continuation and minimally modifies their output vectors to introduce a distractor in vocabulary space, redirecting their residual-stream contributions while leaving activations unchanged. Evaluating on four models from 360M to 7B parameters (SmolLM-360M, OLMo-1B, OLMo-7B, Llama2-7B), we center on OLMo-7B (whose open weights and pretraining corpus enable systematic mining) and mine 6831 memorized sequences, achieving up to 87.9% suppression. The 2.7$\times$ gap over zero ablation on the same located neurons shows the suppression comes from the output-vector edit, not localization alone. Four edit modes span a spectrum from aggressive suppression to minimal redirection; in ensemble they cover 96.5% of memorized sequences, while our recommended single-mode configuration reaches 81.5% with no catastrophic locality failures. We further identify a mechanistic boundary at ${\sim}14%$ of sequences unreachable by MLP-only editing; while these failures are not attention-driven overall, ablating the top contributing attention heads recovers 60--64% of them, with stronger recovery on continuations that copy tokens from the prefix, positioning attention as a complementary fallback rather than a primary mechanism. Edit mode ordering and the success-locality trade-off transfer across all four models, with success rates scaling with model size rather than family.

URL PDF HTML ☆

赞 0 踩 0

2606.18532 2026-06-18 cs.CR cs.AI cs.RO cs.SE 新提交 80%

AI Sandboxes: A Threat Model, Taxonomy, and Measurement Framework

AI沙箱：威胁模型、分类法与测量框架

Inderjeet Singh, Haitham Mahmoud, Andrés Murillo

发表机构 * Fujitsu Research of Europe（富士通欧洲研究）

专题命中安全评测：AI沙箱威胁模型与测量框架

AI总结提出AI沙箱的威胁模型、分类法和测量框架，形式化沙箱边界与最弱链规则，定义网络物理威胁模型，并通过三个案例验证。

Comments 50 pages, 8 figures, 10 tables

详情

AI中文摘要

AI系统越来越多地在结合隔离、仿真、仪器化、监督和证据捕获的有界环境中进行评估。对于物理AI、AIoT和网络物理系统，这种转变不仅仅是术语问题：被测系统可能通过物理过程、网络设备和人类操作员进行感知、决策、执行、通信和故障。本文开发了一种面向保证的AI沙箱描述，将其作为数字AI、具身自主和网络物理部署中测试、评估、验证和确认的受控环境。我们形式化了沙箱边界和用于将每个维度的证据组合成有界部署声明的“最弱链”规则；分离了主要的沙箱原型；定义了一个包括对保证装置本身攻击的网络物理威胁模型；并引入了一个跨越保真度、可控性、可观测性、包含性、可重复性和治理工件的测量框架，在三个实际沙箱的工作案例研究中实例化。由此产生的威胁模型、分类法和测量框架阐明了沙箱可以有效测试什么、它可以包含哪些风险，以及它可以为安全、安保和监管保证支持哪些形式的证据。

英文摘要

AI systems are increasingly evaluated in bounded environments that combine isolation, simulation, instrumentation, supervision, and evidence capture. For physical AI, AIoT, and cyber-physical systems, this shift is not a matter of terminology: the system under test may sense, decide, actuate, communicate, and fail through physical processes, networked devices, and human operators. This article develops an assurance-oriented account of AI sandboxes as controlled environments for testing, evaluation, verification, and validation across digital AI, embodied autonomy, and cyber-physical deployments. We formalize the sandbox boundary and a weakest-link rule for composing per-dimension evidence into a bounded deployment claim; separate major sandbox archetypes; define a cyber-physical threat model that includes attacks on the assurance apparatus itself; and introduce a measurement framework spanning fidelity, controllability, observability, containment, reproducibility, and governance artifacts, instantiated on three worked case studies of real sandboxes. The resulting threat model, taxonomy, and measurement framework clarify what a sandbox can validly test, which risks it can contain, and what forms of evidence it can support for safety, security, and regulatory assurance.

URL PDF HTML ☆

赞 0 踩 0

2606.18322 2026-06-18 cs.LG cs.AI 新提交 75%

SAE Interventions are Unreliable: Post-Intervention Recovery of Suppressed Behavior

SAE干预不可靠：干预后抑制行为的恢复

Mingyue Cui, Linghui Shen, Xingyi Yang

发表机构 * The Hong Kong Polytechnic University（香港理工大学）

专题命中安全评测：揭示SAE特征干预不可靠，存在可恢复失败模式。

AI总结研究发现稀疏自编码器（SAE）特征干预虽能抑制行为，但存在可恢复的失败模式，通过优化残差扰动可恢复原始行为，揭示特征级控制与行为完整性之间的差距。

Comments Code: https://github.com/Mingyuee88/sae-post-intervention-recovery, Project page: https://mingyuee88.github.io/sae-post-intervention-recovery/

详情

AI中文摘要

稀疏自编码器（SAE）将残差流激活分解为可解释特征。最近的潜在空间防御越来越依赖这些分解，假设识别出的“不安全”SAE特征可作为监控和干预的可操作手柄。在这种范式下，固定特定有害特征预期能可靠地防止模型不当行为。然而，我们表明这种成功可能隐藏一种可恢复的失败模式：固定可能阻止行为的一条可见路径，但并未消除行为本身。我们将这种脆弱性形式化为干预后恢复，这是一个受约束的残差空间优化问题。从干预后的残差状态开始，我们优化残差扰动以恢复干预前的行为，同时保持目标SAE特征的干预后值。即使在干预在优化和生成过程中保持活跃的强威胁模型下，恢复仍然可能。为了排除恢复仅仅是撤销干预的可能性，我们使用编码器正交更新进行单层干预，并在跨层设置中使用相应的特征图雅可比矩阵。在TPP、遗忘、IOI和拒绝引导实验中，这种压力测试揭示了尽管特征级干预成功，行为仍可恢复。特别是在安全关键的拒绝引导设置中，我们在有效样本上实现了95.8%的恢复率，同时将防御特征的相对漂移保持在0.131，远低于基于后缀的基线。恢复路径归因分析进一步将这种恢复定位到SAE重建残差，即SAE未解释的组件。这些结果暴露了特征级控制与行为完整性之间的差距：SAE特征可以支持因果干预，但控制它们并不能保证对底层行为的控制。

英文摘要

Sparse Autoencoders (SAEs) decompose residual-stream activations into interpretable features. Recent latent-space defenses increasingly rely on these decompositions, assuming that identified "unsafe" SAE features serve as actionable handles for monitoring and intervention. In this paradigm, clamping a specific harmful feature is expected to reliably prevent model misbehavior. However, we show that this success may hide a recoverable failure mode: the clamp may block one visible route to a behavior without eliminating the behavior itself. We formulate this vulnerability as post-intervention recovery, a constrained residual-space optimization problem. Starting from the post-intervention residual state, we optimize residual perturbations to recover the pre-intervention behavior while preserving the post-intervention values of the targeted SAE features. Even under a strong threat model where the intervention remains active throughout optimization and generation, recovery remains possible. To rule out that recovery simply undoes the intervention, we use encoder-orthogonal updates for single-layer interventions and the corresponding feature-map Jacobian in the cross-layer setting. Across TPP, unlearning, IOI, and refusal steering experiments, this stress test reveals recoverable behavior despite successful feature-level intervention. Especially in the safety-critical refusal-steering setting, we achieve a 95.8% recovery rate on valid samples while keeping defended-feature relative drift to 0.131, substantially below suffix-based baselines. A recovery-path attribution analysis further localizes this recovery to the SAE reconstruction residual, the component left unexplained by the SAE. These results expose a gap between feature-level control and behavioral completeness: SAE features can support causal intervention, but controlling them does not guarantee control over the underlying behavior.

URL PDF HTML ☆

赞 0 踩 0

2606.18946 2026-06-18 cs.CL 新提交 70%

SenFlow: Inter-Sentence Flow Modeling for AI-Generated Text Detection in Hybrid Documents

SenFlow: 面向混合文档中AI生成文本检测的句间流建模

Jingkun Luo, Yifan Sun, Da-Tian Peng, Guanxiong Pei

发表机构 * Northwestern Polytechnical University（西北工业大学）； Zhejiang Lab（浙江实验室）

专题命中安全评测：AI文本检测属于安全评测范畴

AI总结针对人机混合文档的句子级AI文本检测，提出SenFlow模型，通过图传播和CRF解码建模句间依赖，在MOSAIC基准上跨域F1提升4.15个百分点。

Comments 16 pages, 4 figures, 9 tables

详情

AI中文摘要

针对混合文档（人类与LLM共同撰写同一文本）的句子级AI生成文本检测（S-AGTD）面临两个空白：现有方法孤立地对每个句子进行分类，忽略了句间依赖；现有基准遗漏了最新一代生成器。我们构建了MOSAIC基准，包含来自PubMed和XSum的16,000个混合文档，由DeepSeek-V3.2和Kimi K2生成，并经过严格质量控制，包括先前基准中缺失的困惑度一致性过滤器。我们将S-AGTD重新定义为文档句子序列上的结构化预测，并实例化为SenFlow，在句子图的单次文档级传递中，将基于图的句间传播与线性链CRF解码相结合。SenFlow在MOSAIC上达到了最先进的性能，在跨域迁移（三种难度递增协议中最难的一种）上平均Macro-F1提高了4.15个百分点。我们进一步发现，即使困惑度过滤器平衡了显式线索，AI插入仍然保留了一个依赖于生成器的句子长度差距，句子级检测器仍可利用这一点。代码和数据：此 https URL

英文摘要

Sentence-level AI-generated text detection (S-AGTD) for hybrid documents, where humans and LLMs co-author one text, faces two gaps: existing methods classify each sentence in isolation, discarding inter-sentence dependencies, and existing benchmarks omit the newest generation of generators. We construct MOSAIC, a benchmark of 16,000 hybrid documents over PubMed and XSum, generated by DeepSeek-V3.2 and Kimi K2 under stringent quality controls including a perplexity-consistency filter absent from prior benchmarks. We recast S-AGTD as structured prediction over the document sentence sequence and instantiate it as SenFlow, integrating graph-based inter-sentence propagation with linear-chain CRF decoding in a single document-level pass over a sentence graph. SenFlow reaches state-of-the-art performance on MOSAIC, with a +4.15 pp average Macro-F1 margin on cross-domain transfer, the hardest of three protocols of increasing difficulty. We further find that even after the perplexity filter equalizes overt cues, AI insertions retain a generator-dependent sentence-length gap that sentence-level detectors still exploit. Code and data: https://github.com/luojingkun22/SenFlow

URL PDF HTML ☆

赞 0 踩 0

2606.18924 2026-06-18 cs.SD 新提交 70%

Who Wins the Conflict? Mechanistic Interpretability of Text Bias in Audio LLMs

谁赢得冲突？音频大模型中文本偏差的机制可解释性

Hyebin Cho, Suho Yoo, Jaehyuk Jang, Changick Kim, Joon Son Chung

发表机构 * School of Electrical Engineering, KAIST（韩国科学技术院电子工程学院）

专题命中安全评测：研究文本主导偏差，缓解幻觉

AI总结本文通过机制分析揭示音频大模型中的文本主导偏差，发现文本路径主动抑制完整音频表征，并提出无训练干预方法back-patching以增强音频表征，缓解文本主导。

Comments Preprint

详情

AI中文摘要

虽然音频大模型在多模态理解方面表现出色，但它们存在文本主导偏差，即模型盲目偏向文本而忽视声学证据，导致幻觉。然而，当音频和文本输入相互矛盾时，这些模型内部行为的底层机制尚未被探索。在这项工作中，我们通过追踪内部表征在层间的传播，首次对这一现象进行了机制分析。我们的研究揭示了三个关键发现：（i）文本主导在模型中系统性地且经验性地存在；（ii）虽然文本和音频依赖功能不同的路径，但它们最终在后期层中汇聚到一个共享语义空间；（iii）文本路径不会擦除音频信息，而是主动抑制完整的音频表征。基于这些见解，我们利用back-patching，一种无训练干预方法，将后期层的音频激活路由回早期层。这放大了音频表征，使其能够克服文本抑制。我们的评估表明，back-patching持续减少文本主导，为冲突下的机制性多模态对齐铺平了道路。

英文摘要

While Audio Large Language Models (Audio LLMs) excel at multimodal understanding, they suffer from text dominance, a bias where models blindly favor text over acoustic evidence, causing hallucinations. However, the internal mechanisms underlying how these models behave when audio and textual inputs contradict each other remain unexplored. In this work, we present the first mechanistic analysis of this phenomenon by tracing the propagation of internal representations across layers. Our investigation reveals three key findings: (i) text dominance is systematically and empirically across models; (ii) while text and audio rely on functionally distinct pathways, they ultimately converge into a shared semantic space in late layers; and (iii) the text pathway does not erase audio information, but rather actively suppresses intact audio representations. Building on these insights, we leverage back-patching, a training-free intervention that routes late-layer audio activations back into earlier layers. This amplifies the audio representations, enabling them to overcome textual suppression. Our evaluation shows that back-patching consistently reduces text dominance, paving the way for mechanistic multimodal alignment under conflict.

URL PDF HTML ☆

赞 0 踩 0

2606.18632 2026-06-18 cs.RO 新提交 70%

ROBOSHACKLES: A Safety Dataset for Human-Injury Prevention in Embodied Foundation Models

ROBOSHACKLES: 面向具身基础模型中人体伤害预防的安全数据集

Zhuowen Yin, Chongyang Liu, Wenzhang Yang, Renjue Li, Yinxing Xue

发表机构 * Institute of Al for Industries, Chinese Academy of Sciences（工业人工智能研究所，中国科学院）； University of Science and Technology of China（中国科学技术大学）

专题命中安全评测：评估模型在安全关键场景下的不安全动作

AI总结为解决机器人伤害人类数据难以安全收集的问题，提出基于真实观测的安全数据构建流水线，生成包含1万条视频的ROBOSHACKLES数据集，涵盖直接和间接伤害类别，评估发现现有模型在安全关键场景下100%产生不安全动作。

详情

AI中文摘要

具身基础模型（EFMs）整合了多模态理解、未来状态推理和可执行的机器人动作。然而，它们在预防人体伤害方面的安全对齐仍未得到充分探索，主要是因为机器人伤害人类或造成危险家庭情境的真实世界数据无法安全或合乎道德地收集。为应对这一挑战，我们提出了一种针对人体伤害预防的安全关键数据构建流水线。该流水线从真实的DROID观测出发，经过场景理解、危险感知图像编辑、时间提示生成和单次滚动合成等步骤。时间提示指定了预期的场景演变，而Wan2.7则从编辑后的危险状态中单次合成逼真的机器人滚动视频。利用该流水线，我们构建了ROBOSHACKLES，一个包含10,000条机器人视频片段的数据集，源自真实的DROID观测，涵盖两个直接伤害和四个间接伤害类别。为确保数据集质量，我们使用自动指标评估任务完成度和视觉质量，并在基于拒绝的安全准则下评估了六个代表性EFM。结果表明，所有评估模型在测试的安全关键场景中都产生了不安全动作，不安全动作生成率为100%。ROBOSHACKLES可作为拒绝学习和机器人动作执行前危险预测的可扩展基准和训练资源。该数据集公开于https://roboshackles.github.io。

英文摘要

Embodied Foundation Models (EFMs) integrate multimodal understanding, future-state reasoning, and executable robot actions. Yet their safety alignment for human-injury prevention remains underexplored, primarily because real-world data of robots harming humans or creating hazardous household situations cannot be safely or ethically collected. To address this challenge, we propose a safety-critical data construction pipeline for human-injury prevention in EFMs.Starting from real DROID observations, our construction pipeline proceeds through scene understanding, hazard-aware image editing, temporal prompt generation, and single-pass rollout synthesis. The temporal prompts specify the expected scene evolution, while Wan2.7 synthesizes realistic robotic rollouts from the edited hazardous states in a single pass. Using this pipeline, we construct ROBOSHACKLES, a 10,000-clip robotic video dataset derived from real DROID observations, spanning two direct-harm and four indirect-harm categories. To ensure dataset quality, we assess task completion and visual quality with automatic metrics, and evaluate six representative EFMs under a refusal-based safety criterion. Results show that all evaluated models produce unsafe actions in the tested safety-critical scenarios, yielding a 100% unsafe action generation rate. ROBOSHACKLES serves as a scalable benchmark and training resource for refusal learning and hazard anticipation before robot action execution.The dataset is publicly available at https://huggingface.co/datasets/YZW00/RoboShackles.

URL PDF HTML ☆

赞 0 踩 0

2606.18310 2026-06-18 cs.CR cs.AI 新提交 70%

Conflict-Aware Retriever Editing for Knowledge Injection Attacks on LLM-Based RAG Systems

冲突感知检索器编辑：针对基于LLM的RAG系统的知识注入攻击

Xinru Liu, Xianglong Zhang, Di Cai, Zhumin Chen, Pengfei Hu, Xin Xin

发表机构 * Shandong University, China（山东大学，中国）； Tsinghua University, China（清华大学，中国）

专题命中安全评测：针对RAG系统的知识注入攻击。

AI总结提出冲突感知检索器编辑框架CAREATTACK，通过模型中心攻击将恶意知识注入RAG系统，利用图检测和参数编辑投影解决冲突，并轻量校准保持攻击效果。

详情

AI中文摘要

将恶意知识注入检索增强生成（RAG）系统可以操纵检索到的证据并误导下游生成，对AI应用构成严重安全威胁。现有的RAG注入攻击主要依赖于操纵外部知识库，例如制作恶意语料库。然而，这种以数据为中心的方法合成的文本可能被检测到，导致攻击失败。除了语料库操纵之外，开源检索器越来越多地将RAG系统暴露于以模型为中心的攻击。在本文中，我们提出了冲突感知检索器编辑，即CAREATTACK，一个以模型为中心的检索器攻击框架，用于在RAG中注入恶意知识。具体来说，CAREATTACK包括两个阶段：冲突感知检索器编辑和攻击保持锚点修复。冲突感知检索器编辑将高效的闭式参数编辑适应于密集检索模型，提升恶意知识在良性竞争段落之上的排名，并通过基于图的冲突检测和参数编辑投影解决潜在参数冲突。然后，攻击保持锚点修复对编辑后的检索器进行轻量校准，以进一步消除对非目标提示的影响，同时保持对目标提示的攻击有效性。我们在Qwen3-Embedding-0.6B和BGE-M3上实例化CAREATTACK，并在三个基准数据集上进行评估。实验结果表明，我们的方法显著地将恶意段落提升到RAG系统检索到的知识中，并且在访问检索模型参数的情况下，可以对批量目标提示和段落执行攻击。由于大多数RAG系统基于开源检索模型构建，这项工作揭示了RAG系统中一个实际攻击面。代码在此https URL公开。

英文摘要

Injecting malicious knowledge into retrieval-augmented generation (RAG) systems can manipulate retrieved evidence and mislead downstream generation, posing a serious security threat for AI applications. Existing RAG injection attacks mainly rely on manipulating external knowledge bases, such as crafting malicious corpus. However, the synthetic text crafted by such data-centric methods could be detectable, leading to the failure of attacks. Beyond corpus manipulation, open-source retrievers are increasingly exposing RAG systems to model-centric attacks. In this paper, we propose conflict-aware retriever editing, i.e., CAREATTACK, a model-centric retriever attack framework for malicious knowledge injection in RAG. Specifically, CAREATTACK consists two stages of conflict-aware retriever editing and attack-preserving anchor repair. Conflict-aware retriever editing adapts efficient closed-form parameter editing to the dense retrieval model, promoting malicious knowledge above benign competing passages and resolving potential parameter conflicts through graph-based conflict detection and parameter editing projection. Then, attack-preserving anchor repair performs lightweight calibration on the edited retriever to further eliminate the impact on non-target prompts while preserving the attack effectiveness for target prompts. We instantiate CAREATTACK on Qwen3-Embedding-0.6B and BGE-M3, and conduct evaluation on three benchmark datasets. Experimental results demonstrate our method substantially promote malicious passages into the retrieved knowledge of RAG systems and can perform attacks for batches of target prompts and passages, given the access of retrieval model parameters. Since most RAG systems are built upon open-source retrieval models, this work reveals a practical attack surface in RAG systems. Codes are public accessible at https://anonymous.4open.science/r/CareAttack-3F1C.

URL PDF HTML ☆

赞 0 踩 0

2606.18606 2026-06-18 cs.CL cs.AI 新提交 90%

Steerable Cultural Preference Optimization of Reward Models

可引导的文化偏好优化奖励模型

Minsik Oh, Advit Deepak, Sophie Wu, Douwe Kiela, Ekaterina Shutova

发表机构 * Stanford University（斯坦福大学）； University of Amsterdam（阿姆斯特丹大学）

专题命中偏好对齐：提出SCPO算法优化奖励模型文化偏好对齐

AI总结提出SCPO算法，通过平衡多种文化偏好训练奖励模型，在PRISM和GlobalOpinionQA数据集上提升少数群体偏好预测准确率最多7点，训练效率提高280%。

Comments Accepted to Pluralistic Alignment @ ICML 2026

详情

AI中文摘要

大型语言模型（LLM）技术以每个文化子社区可接受的方式服务于众多不同文化子社区至关重要。然而，迄今为止，关于LLM对齐的研究主要集中于预测来自特定地区的标注者的统一响应偏好。本文旨在以更全球化的视角推进对齐模型的发展，使其能够准确代表子社区的偏好，并且不对任何子社区表现出过度偏见。我们专注于为此目的开发奖励模型，并提出一种新颖的奖励模型训练算法（SCPO），该算法能够以平衡的方式融入多样化的文化偏好。我们的方法使得少数群体奖励模型在两个数据集（PRISM和GlobalOpinionQA）以及7个国家上的性能比基线模型提升最多7点。SCPO在训练数据效率上比奖励模型的完整数据微调高出最多280%。此外，我们通过分别评估子社区的偏好来进行偏见分析，并表明我们的加权方法减轻了过度偏见。我们的代码可在以下网址获取：this https URL

英文摘要

It is essential for large language model (LLM) technology to serve many different cultural sub-communities in a manner that is acceptable to each community. However, research on LLM alignment has so far predominantly focused on predicting a unified response preference of annotators from certain regions. This paper aims to advance the development of alignment models with a more global outlook, that are able to accurately represent the preferences of subcommunities and do not exhibit excessive bias towards any of them. We focus on the development of reward models for this purpose and present a novel reward model training algorithm (SCPO) that can incorporate diverse cultural preferences in a balanced manner. Our method results in performance increases of the minority reward model of up to 7 points over the baseline model across two datasets, PRISM and GlobalOpinionQA, and across 7 countries. SCPO is up to 280% more training data-efficient than full-data finetuning of reward models. In addition, we perform analysis of bias by separately evaluating on the preference of subcommunities and show that excessive bias is mitigated via our weighting method. Our code is available at https://github.com/minsik-ai/Steerable-Cultural-Preference

URL PDF HTML ☆

赞 0 踩 0

2606.18487 2026-06-18 cs.LG cs.AI cs.CL 新提交 90%

SFT Overtraining Predicts Rank Inversion via Entropy Collapse Under RLVR

SFT 过训练通过熵崩溃预测 RLVR 下的排名反转

Siddharth Aphale, Kelly Liu

发表机构 * Stanford University（斯坦福大学）

专题命中偏好对齐：SFT过训练导致RLVR下排名反转

AI总结研究发现 SFT 过度训练导致 rollout 分布熵降低，使 GRPO 中优势信号消失，从而引发排名反转；提出基于熵的两阶段诊断方法可预警高风险检查点。

Comments 14 pages, 6 figures. Accepted at the Deep Learning for Code (DL4C) Workshop at ICML 2026

详情

AI中文摘要

当 SFT 压缩 rollout 分布时，选择 pass@1 最高的 SFT 检查点进行 GRPO 的标准启发式方法可能失败。对于二元奖励，组内期望优势方差为 $p(1{-}p)(g{-}1)/g$；当早期 GRPO 将 $p$ 驱动到 $p^*(g)$ 以下时，大多数组具有相同奖励，不提供组间相对信号。我们研究了 Qwen2.5-Coder-3B 和 DeepSeek-Coder-6.7B 的 SFT 深度阶梯。我们在五个深度和三个种子上测试 Qwen2.5-Coder-3B，在四个匹配深度和三个种子上测试 DeepSeek-Coder-6.7B。在 Qwen 上，RL 前的 pass@1 随 SFT 深度增加而上升，但 GRPO 峰值 pass@10 从 $0.806$ 下降到 $0.481$（3 种子均值，$n{=}20$）；RL 前的熵与 GRPO 结果正相关（$\rho{=}{+}0.69$）。在 DeepSeek 上，pass@1 仍远高于 $p^*(8){=}0.083$，GRPO 结果压缩而非反转。结合 RL 前熵分诊与早期 GRPO 熵监测的两阶段诊断方法，可标记高风险检查点并提前停止失败运行。在我们的设置中，简单的 KL 参考正则化和标签平滑变体未能挽救崩溃的 Qwen 检查点，表明该失败并非琐碎的 GRPO 超参数伪影。

英文摘要

The standard heuristic of selecting the SFT checkpoint with the highest pass@1 for GRPO can fail when SFT compresses the rollout distribution. For binary rewards, the expected within group advantage variance is $p(1{-}p)(g{-}1)/g$; when early GRPO drives $p$ below $p^*(g)$, most groups have identical rewards and provide no group relative signal. We study SFT depth ladders for Qwen2.5-Coder-3B and DeepSeek-Coder-6.7B. We test Qwen2.5-Coder-3B across five depths and three seeds, and DeepSeek-Coder-6.7B across four matched depths and three seeds. On Qwen, pre RL pass@1 rises with SFT depth, but peak GRPO pass@10 falls from $0.806$ to $0.481$ (3 seed mean, $n{=}20$); pre RL entropy is positively associated with the GRPO outcome ($ρ{=}{+}0.69$). On DeepSeek, pass@1 remains far above $p^*(8){=}0.083$, and GRPO outcomes compress rather than invert. A two stage diagnostic, combining pre RL entropy triage with an early GRPO entropy monitor, flags high risk checkpoints and can stop failing runs early. Simple KL to reference regularisation and label smoothing variants do not rescue the collapsed Qwen checkpoint in our setting, suggesting the failure is not a trivial GRPO hyperparameter artefact.

URL PDF HTML ☆

赞 0 踩 0

2606.16276 2026-06-18 cs.AI 新提交 90%

SpecAlign: Efficient Specification-Grounded Alignment of Large Language Models via Synthetic Data

SpecAlign: 通过合成数据实现高效的大语言模型规范对齐

Wenjie Wang, Yue Huang, Zhengqing Yuan, Han Bao, Shiyi Du, Yuchen Ma, Yue Zhao, Yanfang Ye, Xiangliang Zhang

发表机构 * University of Notre Dame（圣母大学）； Carnegie Mellon University（卡内基梅隆大学）； LMU Munich（慕尼黑大学）； University of Southern California（南加州大学）

专题命中偏好对齐：规范对齐框架，合成数据实现规则遵守

AI总结提出规范对齐新范式，通过从规范文档合成数据（SpecAlign框架），结合结构化规则标注、可控规范实例化和多智能体对抗数据合成，生成细粒度偏好对，提升规则遵守度且不损害通用能力。

Comments 58 pages

详情

AI中文摘要

随着大语言模型（LLM）在现实应用中的部署日益增多，对齐不再由单一的通用安全或有用性概念主导，而是由提供商或应用特定的模型规范主导。这些规范通常冗长、结构化且频繁更新，然而现有的对齐流程缺乏系统化的机制来将其作为训练信号。在本文中，我们提出规范对齐（specification-grounded alignment），一种新的对齐范式，将提供商编写的模型规范作为主要对齐目标，而非抽象原则或静态基准。为实例化该范式，我们引入SpecAlign框架，该框架直接从规范文档合成对齐数据。SpecAlign结合结构化规则标注、可控规范实例化和多智能体对抗数据合成，生成细粒度、边界感知的偏好对，捕获合规行为和有意义的规范违反。在多个模型规范和骨干模型上的实验表明，使用SpecAlign进行训练一致地提高了规则遵守度，同时保持了通用能力并避免了过度保守的行为。这些结果表明，将对齐建立在显式模型规范上，能够实现LLM行为对不断变化的政策要求的快速、精确和可扩展的适应。

英文摘要

As large language models (LLMs) are increasingly deployed in real-world applications, alignment is no longer governed by a single universal notion of safety or helpfulness, but instead by provider- or application-specific model specifications. These specifications are typically long, structured, and frequently updated, yet existing alignment pipelines lack a systematic mechanism to operationalize them as training signals. In this paper, we propose specification-grounded alignment, a new alignment paradigm that treats provider-authored model specifications as the primary alignment target rather than abstract principles or static benchmarks. To instantiate this paradigm, we introduce SpecAlign, a framework that synthesizes alignment data directly from specification documents. SpecAlign combines structured rule annotation, controllable specification instantiation, and multi-agent adversarial data synthesis to generate fine-grained, boundary-aware preference pairs that capture both compliant behaviors and meaningful specification violations. Experiments across multiple model specifications and backbone models demonstrate that training with SpecAlign consistently improves rule compliance while preserving general capabilities and avoiding over-conservative behavior. These results suggest that grounding alignment in explicit model specifications enables rapid, precise, and scalable adaptation of LLM behavior to evolving policy requirements.

URL PDF HTML ☆

赞 0 踩 0

2601.17637 2026-06-18 cs.CY cs.HC 90%

Scaling Laws for Moral Machine Judgment in Large Language Models

大语言模型中道德机器判断的扩展规律

Kazuhiro Takemoto

专题命中偏好对齐：研究LLM道德判断与人类偏好对齐的扩展规律

AI总结研究通过评估75种大语言模型配置，发现模型规模与人类偏好距离呈幂律关系，扩展推理模型在较小规模时表现更优，为价值判断的扩展规律研究提供依据。

Comments 12 pages, 4 figures, 3 tables

Journal ref R Soc Open Sci. (2026) 13 (6): 260202

详情

DOI: 10.1098/rsos.260202

AI中文摘要

自主系统日益需要道德判断能力，但其与模型规模的可预测性尚不清楚。我们系统评估了75种大语言模型配置（0.27B-1000B参数），利用道德机器框架测量其在生死困境中的对齐程度。观察到人类偏好距离（D）与模型规模（S）呈幂律关系（D ∝ S^{-0.10±0.01}，R²=0.50，p<0.001）。混合效应模型证实此关系在控制模型家族和推理能力后仍成立。扩展推理模型在较小模型中表现更优（规模×推理交互作用：p=0.024）。此关系在多样架构中成立，但随着规模增大，方差降低，表明计算规模系统性地提高了道德判断的可靠性。这些发现将扩展规律研究扩展到基于价值的判断，并为人工智能治理提供实证基础。

英文摘要

Autonomous systems increasingly require moral judgment capabilities, yet whether these capabilities scale predictably with model size remains unexplored. We systematically evaluate 75 large language model configurations (0.27B--1000B parameters) using the Moral Machine framework, measuring alignment with human preferences in life-death dilemmas. We observe a consistent power-law relationship with distance from human preferences ($D$) decreasing as $D \propto S^{-0.10\pm0.01}$ ($R^2=0.50$, $p<0.001$) where $S$ is model size. Mixed-effects models confirm this relationship persists after controlling for model family and reasoning capabilities. Extended reasoning models show significantly better alignment, with this effect being more pronounced in smaller models (size$\times$reasoning interaction: $p = 0.024$). The relationship holds across diverse architectures, while variance decreases at larger scales, indicating systematic emergence of more reliable moral judgment with computational scale. These findings extend scaling law research to value-based judgments and provide empirical foundations for artificial intelligence governance.

URL PDF HTML ☆

赞 0 踩 0

2606.18327 2026-06-18 cs.LG cs.AI 新提交 70%

Self-CTRL: Self-Consistency Training with Reinforcement Learning

Self-CTRL：基于强化学习的自一致性训练

Itamar Pres, Laura Ruis, Melat Ghebreselassie, Belinda Z. Li, Jacob Andreas

发表机构 * MIT CSAIL（麻省理工学院计算机科学与人工智能实验室）

专题命中偏好对齐：通过强化学习优化语言模型自我解释与行为一致性。

AI总结提出Self-CTRL方法，通过强化学习优化语言模型自我解释与行为之间的一致性，在概率推理和宪法AI任务上显著提升一致性和安全性。

Comments 34 pages, 12 figures, includes appendices

详情

AI中文摘要

能够忠实描述自身行为的语言模型（LMs）更容易被用户审计、理解和信任。本文描述了基于强化学习的自一致性训练（Self-CTRL），该方法通过更新解释以更好地预测行为或更新行为以更好地匹配解释，优化LM的自我解释与相关输入行为之间的一致性。我们在两个领域应用该方法。首先，研究一个形式化概率推理任务，其中LM必须学习模仿一组有偏采样器，并评估其报告相关偏差的能力。我们发现，一致性训练将自我报告和行为测量的潜在偏差之间的相关性从$R^2=0.24$提高到$R^2=0.64$（在保留分布上），匹配直接真实标签监督的泛化能力。其次，研究一个宪法AI领域，其中LM必须描述何时拒绝或遵守用户请求。在此，Self-CTRL产生忠实描述模型在保留请求上行为的规则，将第三方审计模型的拒绝预测从$36\%$提高到$92\%$。另一方面，行为更新改善了对齐，将HarmBench失败率从$15.0\%$降低到$0.5\%$，而不会显著增加对无害提示的拒绝。通过对齐解释和行为，我们的工作为训练更安全、更透明、更可控的AI模型提供了通用方法。

英文摘要

Language models (LMs) that faithfully describe their own behavior can more easily be audited, understood, and trusted by users. This paper describes Self-Consistency Training with Reinforcement Learning (Self-CTRL), a method that optimizes for consistency between a LM's self-explanations and behavior on related inputs by updating explanations to better predict behavior or updating behavior to better match explanations. We apply our method in two domains. First, we study a formal probabilistic reasoning task in which LMs must learn to imitate a family of biased samplers and evaluated on their ability to report the associated biases. We find that consistency training improves the correlation between self-reported and behaviorally-measured latent biases from $R^2=0.24$ to $R^2=0.64$ on a set of held-out distributions, matching the generalization of direct ground-truth supervision. Second, we study a constitutional AI domain in which LMs must describe when they will refuse or comply with user requests. Here, Self-CTRL produces rules that faithfully describe the model's behavior on held-out requests, improving the refusal predictions of a third-party auditor model from $36\%$ to $92\%$. In the other direction, behavior updates improve alignment, reducing HarmBench failure rate from $15.0\%$ to $0.5\%$ without substantially increasing refusal on harmless prompts. By aligning explanations and behavior, our work provides a general recipe for training AI models to be safer, more transparent, and more controllable.

URL PDF HTML ☆

赞 0 踩 0

2606.18550 2026-06-18 cs.CR 新提交 85%

The Gate Is Only as Honest as Its Contracts: ContractGuard for the Contract Layer of Risk-Aware Causal Gating

门仅与其合约一样诚实：面向风险感知因果门控合约层的ContractGuard

Laxmipriya Ganesh Iyer, Rahul Suresh Babu

专题命中提示注入：防御间接提示注入攻击

AI总结针对工具增强型LLM代理的间接提示注入，提出ContractGuard，通过验证合约完整性（而非风险标签）来防御攻击，在基准测试中实现零注入成功率。

详情

AI中文摘要

风险感知因果门控（RACG）通过从代理的可见动作空间中移除危险工具来防御工具增强型LLM代理免受间接提示注入，使得即使完全符合注入条件的代理也无法调用其不可见的工具。我们提出三点。首先，这种结构性保证并未消除安全工具使用背后的信任假设；它将其转移到门所读取的工具合约——声明的先决条件、效果、风险和授权——的完整性上，因此攻击者若破坏合约，可使门误判而无需说服代理。其次，伪造工具的效果比篡改其风险标签更危险，因为RACG在可准入门之前应用因果门：离路径工具从不暴露，因此仅重新标记风险会失败，而效果伪造则将危险工具路由到因果路径上并成功。效果完整性，而非风险标签，是承载假设。第三，我们引入ContractGuard，一个位于注册表和门之间的验证器，它分层使用签名来源、类型化合约认证和运行时效果验证；在受控基准测试中，它针对所有建模攻击（包括穷举白盒自适应攻击）将注入成功率恢复为零，且不会过度拒绝诚实合约，该结构性预测在六个当前代托管模型（Claude Opus 4.8, Sonnet 4.6, Haiku 4.5; Amazon Nova Premier and Nova 2 Lite; GPT-OSS-120B）上得到确认。

英文摘要

Risk-Aware Causal Gating (RACG) defends tool-augmented LLM agents against indirect prompt injection by removing dangerous tools from the agent's visible action space, so that even a fully injection-compliant agent cannot call a tool it cannot see. We make three points. First, this structural guarantee does not eliminate the trust assumption behind safe tool use; it relocates it into the integrity of the tool contracts -- declared preconditions, effects, risk, and authorization -- that the gate reads, so an attacker who corrupts a contract can make the gate mis-decide without ever persuading the agent. Second, forging a tool's effects is strictly more dangerous than tampering with its risk label, because RACG applies a causal gate before its admissibility gate: an off-path tool is never exposed, so risk-relabeling alone fails, whereas effect forgery routes the dangerous tool onto the causal path and succeeds. Effect integrity, not the risk label, is the load-bearing assumption. Third, we introduce ContractGuard, a verifier between the registry and the gate that layers signed provenance, typed contract attestation, and runtime effect verification; on a controlled benchmark it restores injection success to zero against every modeled attack -- including an exhaustive white-box adaptive attacker -- without over-rejecting honest contracts, and the structural prediction is confirmed on six current-generation hosted models (Claude Opus 4.8, Sonnet 4.6, Haiku 4.5; Amazon Nova Premier and Nova 2 Lite; GPT-OSS-120B).

URL PDF HTML ☆

赞 0 踩 0

2606.18530 2026-06-18 cs.CR cs.CL cs.LG 新提交 85%

Evaluating Prompting-Based Defenses Against Domain-Camouflaged Injection Attacks

评估基于提示的防御策略对抗领域伪装注入攻击

Aaditya Pai

发表机构 * Data Science Institute（数据科学研究所）

专题命中提示注入：评估防御领域伪装注入攻击

AI总结针对领域伪装注入攻击，评估五种基于提示的防御方法（如释义、重点标记等）在三个模型家族和三个部署领域中的有效性，发现释义法最有效，可将伪装攻击成功率降低55-84%。

Comments 9 pages, 4 figures, 4 tables; under review at the AdvML-Frontiers x CoTMA workshop, COLM 2026

详情

AI中文摘要

领域伪装注入攻击使用领域特定词汇将恶意指令嵌入检索内容中，从而逃避依赖句法注入标记的标准检测器。当检测失败时，从业者需要知道哪些防御架构能降低攻击成功率。我们评估了五种基于提示的防御方法（重点标记、释义、提示夹层以及两种组合）对抗领域伪装注入攻击，涉及三个模型家族（Claude Haiku、Llama 3.1 8B、Gemini 2.0 Flash）和三个部署领域（金融、法律、通用），共进行3,510次试验。在代理处理之前对检索内容进行释义是最一致有效的防御方法，根据模型不同，可将伪装攻击成功率降低55-84%，并且在所有测试模型上均实现了比我们的Llama Guard 4配置更低的攻击成功率。防御效果强烈依赖于模型：重点标记在Claude Haiku上将攻击成功率减半，但在Llama 3.1 8B上没有任何益处。金融领域部署面临最高的残余风险，基线攻击成功率为26-33%，在较弱模型上没有任何基于提示的防御能完全消除威胁。这些结果首次系统评估了专门针对伪装类注入攻击的基于提示的防御方法，并为从业者建立了基于基准的建议。所有任务均使用合成构建的专业文档；这些基准排名是否能推广到真实企业文档仍是一个开放问题。

英文摘要

Domain-camouflaged injection attacks embed malicious instructions in retrieved content using domain-appropriate vocabulary, evading standard detectors that rely on syntactic injection markers. When detection fails, practitioners need to know which defense architectures reduce attack success. We evaluate five prompting-based defenses (spotlighting, paraphrasing, prompt sandwiching, and two combinations) against domain-camouflaged injection across three model families (Claude Haiku, Llama 3.1 8B, Gemini 2.0 Flash) and three deployment domains (financial, legal, general) using 3,510 trials. Paraphrasing retrieved content before agent processing is the most consistently effective defense in this benchmark, reducing camouflage attack success rate by 55-84\% depending on model, and achieves lower attack success rates than our Llama Guard 4 configuration on every model tested. Defense effectiveness is strongly model-dependent: spotlighting halves attack success on Claude Haiku but provides no benefit on Llama 3.1 8B. Financial domain deployments face the highest residual risk at 26-33\% baseline attack success rate, with no prompting-based defense fully eliminating the threat on weaker models. These results provide the first systematic evaluation of prompting-based defenses specifically against camouflage-class injection attacks and establish benchmark-based recommendations for practitioners. All tasks use synthetically constructed professional documents; whether these benchmark rankings generalize to real enterprise documents remains an open question.

URL PDF HTML ☆

赞 0 踩 0

2606.19235 2026-06-18 cs.CR 新提交 80%

CodeSentinel: A Three-Layer Defense Against Indirect Prompt Injection in Code Contexts

CodeSentinel：代码上下文中针对间接提示注入的三层防御

Po-Han Cheng, Chia-Mu Yu, Ying-Dar Lin, Yu-Sung Wu, Wei-Bin Lee

专题命中提示注入：针对代码上下文的提示注入防御

AI总结针对代码大语言模型在检索外部代码时面临的间接提示注入攻击，提出CodeSentinel三层推理时净化器，结合语法引导预过滤、CST引导动态Min-K%评分和节点扰动分析，实现0.80节点级F1，优于现有方法。

1. 安全评测 22 篇

Understanding and Mitigating Prompt Leaking Attacks in Real-World LLM-Based Applications

Mechanism-Guided Selective Unlearning for RLVR-Induced Reasoning

Beyond Safe Data: Pretraining-Stage Alignment with Regular Safety Reflection

Lifecycle-Aware Dynamic Analysis for Secure ML Model Execution

The Wrong Kind of Right: Quantifying and Localizing Misfired Alignment in LLMs

Signature filtering: a lightweight enhancement for statistical watermark detection in large language models

SafeClawBench: Separating Semantic, Audit-Evidence, and Sandbox Harm in Tool-Using LLM Agents

Quantifying Compromise Risk in Exceptional Access Architectures Under Sparse and Indirect Evidence

SciRisk-Bench: A Risk-Dimension-Aware Benchmark for AI4Science Safety

RedactionBench

PreUnlearn: Auditing Collateral Knowledge Damage Before Large Language Model Unlearning

"Did you lie?" Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms

Quantifying and Auditing LLM Evaluation via Positive--Unlabeled Learning

Detecting Hidden ML Training With Zero-Overhead Telemetry

Runtime Compliance Verification for AI Agents

Output Vector Editing for Memorization Mitigation in Large Language Models

AI Sandboxes: A Threat Model, Taxonomy, and Measurement Framework

SAE Interventions are Unreliable: Post-Intervention Recovery of Suppressed Behavior

SenFlow: Inter-Sentence Flow Modeling for AI-Generated Text Detection in Hybrid Documents

Who Wins the Conflict? Mechanistic Interpretability of Text Bias in Audio LLMs

ROBOSHACKLES: A Safety Dataset for Human-Injury Prevention in Embodied Foundation Models

Conflict-Aware Retriever Editing for Knowledge Injection Attacks on LLM-Based RAG Systems

2. 偏好对齐 5 篇

Steerable Cultural Preference Optimization of Reward Models

SFT Overtraining Predicts Rank Inversion via Entropy Collapse Under RLVR

SpecAlign: Efficient Specification-Grounded Alignment of Large Language Models via Synthetic Data

Scaling Laws for Moral Machine Judgment in Large Language Models

Self-CTRL: Self-Consistency Training with Reinforcement Learning

3. 提示注入 3 篇

The Gate Is Only as Honest as Its Contracts: ContractGuard for the Contract Layer of Risk-Aware Causal Gating

Evaluating Prompting-Based Defenses Against Domain-Camouflaged Injection Attacks

CodeSentinel: A Three-Layer Defense Against Indirect Prompt Injection in Code Contexts