arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1998
2606.12837 2026-06-12 cs.CL 新提交

LoHoSearch: Benchmarking Long-Horizon Search Agents Beyond the Human Difficulty Ceiling

LoHoSearch: 超越人类难度上限的长时域搜索代理基准测试

Jiarui Zhao, Rongzhi Zhang, Lingchuan Liu, Hao Yang, Xunliang Cai, Xi Su

发表机构 * Meituan(美团)

AI总结 提出LoHoSearch基准,基于700万维基实体知识图谱自动构建544个复杂问题,评估显示最强模型仅34.74%准确率,远超人类难度上限。

详情
AI中文摘要

以BrowseComp为代表的搜索代理基准在过去一年中迅速饱和,最强模型已超过90%准确率。由于这些基准主要由人类编写,标注者缺乏对实体统计的全局视角,无法系统性地最大化搜索空间大小和结构复杂性,这造成了难以突破的难度上限。为解决这一问题,我们引入了LoHoSearch(长时域搜索代理),一个包含544个人工验证问题、覆盖11个领域的挑战性基准。LoHoSearch通过基于覆盖超过700万维基百科实体的知识图谱的自动化流水线构建,该流水线选择具有大搜索空间的关系,并将其组装成结构复杂且具有知识图谱验证的唯一答案的问题。我们的评估表明,即使是最强模型也仅达到34.74%的准确率,且现有的上下文管理策略(最佳提升+6.8%)带来的增益远小于先前基准。LoHoSearch为评估搜索代理中的长时域推理和上下文管理提供了更高要求的标准。

英文摘要

Search agent benchmarks exemplified by BrowseComp have rapidly saturated over the past year, with the strongest models surpassing 90% accuracy. Since these benchmarks are predominantly human-authored, annotators lack a global perspective on entity statistics and cannot systematically maximize search space size and structural complexity. This creates a difficulty ceiling that is hard to break. To address this, we introduce LoHoSearch (Long-Horizon Search Agents), a challenging benchmark comprising 544 human-verified questions across 11 domains. LoHoSearch is constructed via an automated pipeline built upon a knowledge graph covering over 7 million Wikipedia entities, which selects relations with large search spaces and assembles them into structurally complex questions with KG-verified unique answers. Our evaluation demonstrates that even the strongest model achieves only 34.74% accuracy, and existing context management strategies (best +6.8%) yield far smaller gains than on prior benchmarks. LoHoSearch provides a more demanding standard for evaluating long-horizon reasoning and context management in search agents.

2606.12834 2026-06-12 cs.AI 新提交

Fantastic Scientific Agents and How to Build Them: AgentBuild for Rietveld Refinement

神奇的科学智能体及其构建方法:用于Rietveld精修的AgentBuild

Woong Shin, Craig A. Bridges, Marshall T. McDonnell, Rafael Ferreira da Silva

发表机构 * UT-Battelle, LLC(UT-Battelle有限责任公司) US Department of Energy (DOE)(美国能源部)

AI总结 提出AgentBuild框架,通过科学家编写的合同(包含评分标准、课程和知识库)自动构建科学智能体,用于X射线衍射数据的Rietveld精修,实现可复用的智能体编译而非手动调优。

详情
AI中文摘要

随着科学工作流从确定性可执行文件转向基于LLM的智能体,现有的开发实践(如微调、强化学习和即时运行)掩盖了科学家的判断。我们建议将智能体构建视为一个工作流阶段,并引入AgentBuild,它根据科学家编写的合同构建科学智能体。该合同是一个版本控制的评分标准、一个难度分级的课程和一个精心策划的外部知识库。基于评分标准的裁判门控一个元优化编码智能体,该智能体在声明的边界内编辑智能体,因此构建编译的是智能体,而不是科学家的判断。我们通过MCP和A2A背后的GSAS-II将其实例化用于X射线衍射数据的Rietveld精修,其中空白框架构建运行通过锂镧锆氧(LLZO)信噪比阶梯,达到4小时扫描作为前沿案例,并暴露了工作流范围限制。相同的评分标准既奖励可信的拟合,也评分轨迹范围,使前沿成为合同失败而非模式拟合失败。随着基础模型的发展,重新运行AgentBuild是重新调整,而不是重建,科学家编写的合同仍然是持久的资产。

英文摘要

As scientific workflows shift from deterministic executables to LLM-based agents, the development practices on offer, such as fine-tuning, reinforcement learning, and prompt-and-go, bury the scientist's judgment. We propose treating agent construction as a workflow stage and introduce AgentBuild, which builds a scientific agent from a contract the scientist authors. The contract is a version-controlled rubric, a difficulty-graded curriculum, and a curated external knowledge base. A rubric-driven judge gates a meta-optimizer coding agent that edits the agent within a declared boundary, so the build compiles the agent, not the scientist's judgment. We instantiate this for Rietveld refinement of X-ray diffraction data through GSAS-II behind MCP and A2A, where a blank-harness construction run progresses through a lithium lanthanum zirconium oxide (LLZO) signal-to-noise ladder, reaches the 4 hour scan as a frontier case, and exposes the workflow-scope limits that remain. The same rubric that rewards credible fits also scores trajectory scope, making the frontier a contract failure rather than a pattern-fitting failure. As base models evolve, re-running AgentBuild is a re-tune, not a rebuild, and the scientist's authored contract remains the durable asset.

2606.12830 2026-06-12 cs.CV cs.AI 新提交

Perceive, Interact, Reason: Building Tool-Augmented Visual Agents for Spatial Reasoning

感知、交互、推理:构建工具增强的视觉智能体用于空间推理

Changye Li, Meng Lu, Yi Wu, Ligeng Zhu

发表机构 * Tsinghua University(清华大学) Virginia Tech(弗吉尼亚理工大学) NVIDIA(英伟达)

AI总结 提出PERIA智能体,通过视觉感知和交互工具增强VLM的空间推理能力,在13个基准上优于同类模型7.0%-14.8%。

详情
AI中文摘要

尽管最近的视觉语言模型(VLM)展示了强大的多模态理解能力,但在需要主动证据获取和多步视觉交互的空间推理任务中仍存在局限。这种局限性表明,仅依赖视觉编码器的隐式视觉表示不足以恢复细粒度的空间证据。我们引入了PERception-Interaction-reason Agent(PERIA),一种用于地图推理、视觉探测和视觉重建等空间推理任务的工具增强视觉智能体。PERIA使用两类轻量工具:视觉感知工具用于暴露文本、符号和空间证据,以及视觉交互工具用于操作视觉上下文、追踪路径和验证空间关系。为了训练PERIA,我们开发了一种统一方案,结合了监督式工具使用轨迹合成、复合奖励和观察松弛的组内组策略优化(OR-GIGPO),以实现有效的多工具行为。在来自8个数据集的13个基准上的实验表明,PERIA-8B在分布内基准上比Qwen3-8B骨干网络提高了10.0%,在分布外基准上提高了4.4%,同时比之前类似规模的先进基线高出7.0%-14.8%。它还实现了与更大模型(如Qwen3-VL-235B-A22B-Thinking和GPT-5)相当的性能,证明了PERIA在增强空间推理能力方面的有效性。

英文摘要

While recent vision-language models (VLMs) demonstrate strong multimodal understanding, they remain limited in spatial reasoning tasks that require active evidence acquisition and multi-step visual interaction. This limitation suggests that relying solely on implicit visual representations from vision encoders is insufficient for recovering fine-grained spatial evidence. We introduce PERception-Interaction-reason Agent (PERIA), a tool-augmented visual agent for spatial reasoning tasks across map reasoning, visual probing, and vision reconstruction. PERIA uses two lightweight tool families: vision perception tools for exposing textual, symbolic, and spatial evidence, and vision interaction tools for manipulating visual context, tracing paths, and verifying spatial relations. To train PERIA, we develop a unified recipe that combines supervised tool-use trajectory synthesis, composite rewards, and Observation-Relaxed Group-in-Group Policy Optimization (OR-GIGPO) for effective multi-tool behavior. Experiments on 13 benchmarks from 8 datasets show that PERIA-8B improves over the Qwen3-8B backbone by 10.0% on in-distribution benchmarks and 4.4% on out-of-distribution benchmarks, while outperforming previous state-of-the-art baselines of similar size by 7.0%-14.8%. It also achieves performance comparable to much larger models such as Qwen3-VL-235B-A22B-Thinking and GPT-5, demonstrating the effectiveness of PERIA in enhancing spatial reasoning capabilities.

2606.12826 2026-06-12 cs.CV cs.AI 新提交

DIMOS: Disentangling Instance-level Moving Object Segmentation

DIMOS: 解耦实例级运动目标分割

Hongxiang Huang, Hongwei Ren, Xiaopeng Lin, Yulong Huang, Zeke Xie, Bojun Cheng

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 提出双解耦特征提取框架分离图像与事件模态的外观和运动信息,并通过多粒度跨模态对齐实现有效融合,在运动实例分割任务中尤其对快速运动和低光下的小目标取得最优性能。

详情
AI中文摘要

运动实例分割(MIS)因其在交通监控、自动驾驶和动物追踪等领域的广泛应用而日益受到关注。事件相机记录异步亮度变化,提供高时间分辨率和动态范围,使其对运动信息高度敏感。通过融合事件和图像特征,事件中的运动线索可以补充图像中的空间细节,从而提升MIS的性能。然而,当前的多模态MIS方法仍然难以分割小的运动实例,因为事件相机在有限分辨率下往往产生稀疏特征。此外,事件特征将外观属性与运动线索纠缠在一起,进一步限制了有效的跨模态融合。为解决这些挑战,我们首先提出一个双解耦特征提取框架,在图像和事件模态内分离并提取外观和运动信息,从而改善特征密度。随后,引入多粒度跨模态对齐,以对齐跨模态分布和语义一致的特征,实现具有丰富空间和时间细节的更有效融合。实验结果表明,我们的方法在多模态MIS中达到了最先进的性能,特别是在快速运动和低光等挑战性条件下的小实例分割方面。

英文摘要

Moving instance segmentation (MIS) attracts increasing attention due to its broad applications in traffic surveillance, autonomous driving, and animal tracking. Event cameras record asynchronous brightness changes, providing high temporal resolution and dynamic range, which makes them highly sensitive to motion information. By fusing event and image features, motion cues from events can complement spatial details from images, enhancing the performance of MIS. However, current multimodal MIS methods still struggle to segment small moving instances, as event cameras often yield sparse features under limited resolution. Moreover, event features entangle appearance attributes with motion cues, which further restricts effective cross-modal fusion. To address these challenges, we first propose a dual-disentangling feature extraction framework that separates and extracts appearance and motion information within both image and event modalities, thereby improving feature density. Subsequently, a multi-granularity cross-modal alignment is introduced to align distributionally and semantically consistent features across modalities, enabling more effective fusion with rich spatial and temporal details. The experiment results demonstrate that our method achieves state-of-the-art performance in multimodal MIS, especially for small instances under challenging conditions such as fast motion and low-light settings.

2606.12821 2026-06-12 cs.AI cs.ET 新提交

GeoNatureAgent Benchmark: Benchmarking LLM Agents for Environmental Geospatial Analysis Across Frontier and Open-Weight Foundation Models

GeoNatureAgent Benchmark:面向前沿与开源基础模型的环境地理空间分析LLM智能体基准测试

Gabriel Diaz-Ireland, Diego Prieto-Herráez, Mario García Peces, Javier Velázquez, Devika Jain

发表机构 * Universidad Católica de Ávila (UCAV)(阿维拉天主教大学) Johns Hopkins University(约翰霍普金斯大学) Independent Researcher(独立研究者) Center for Geographic Analysis, Harvard University(哈佛大学地理分析中心)

AI总结 提出首个通过结构化工具调用真实API评估环境分析智能体的基准,包含93个任务,发现Claude Sonnet 4领先,但开源模型在成本效益上占优,且比较任务普遍未解决。

Comments Preprint. 10 pages, 8 figures. Submitted to ACM SIGSPATIAL 2026

详情
AI中文摘要

环境科学家在数据整理而非分析上花费了不成比例的精力,而自动化地理空间工作流的AI智能体仍未得到验证:没有基准通过结构化工具调用评估智能体对真实API的操作。我们引入了GeoNatureAgent Benchmark,这是首个通过结构化工具调用生产级地理空间API进行环境分析智能体的基准。它包含18个类别的93个任务,涵盖市政分析、多轮对话、空间推理、跨指标综合、错误处理与恢复、排序、比较、多语言理解、栖息地分析和任务拒绝。任务通过一个开放、可自托管的API进行评估,该API通过16个工具提供西班牙和葡萄牙的三个环境指标。我们评估了七个LLM(Claude Sonnet 4、DeepSeek V3.2、GLM-5、Gemini 2.5 Pro、Qwen3-235B、GPT-OSS-120B、Llama 4 Scout),在三个温度1.0的随机种子下,报告能力与每案例成本作为正交轴。我们发现:(1)Claude Sonnet 4以60.8%±0.8%领先,其次是DeepSeek V3.2的56.3%±3.1%,其他模型均未超过51%;(2)成本-准确率帕累托前沿主要由开源模型占据,DeepSeek V3.2以11倍低的成本(每案例0.011美元)提供Claude 93%的能力;(3)比较任务普遍未解决(接近值比较上为0%),暴露了系统性的推理限制;(4)针对真实API的结构化工具调用比通用GIS基准更具区分度,准确率低25-35个百分点。我们进一步展示了可扩展性,将葡萄牙的BigEarthNet V2土地覆盖与西班牙的CO2和侵蚀指标集成。该基准、工具集和可自托管API均已公开。

英文摘要

Environmental scientists spend disproportionate effort on data wrangling rather than analysis, and AI agents that automate geospatial workflows remain unvalidated: no benchmark evaluates agents operating through structured tool calling against real APIs. We introduce the GeoNatureAgent Benchmark, the first benchmark for environmental analysis agents that operate via structured tool calls to a production-style geospatial API. It comprises 93 tasks across 18 categories, covering municipality analysis, multi-turn conversation, spatial reasoning, cross-indicator synthesis, error handling and recovery, ranking, comparison, multilingual understanding, habitat analysis, and task rejection. Tasks are evaluated against an open, self-hostable API serving three environmental indicators across Spain and Portugal via sixteen tools. We evaluate seven LLMs (Claude Sonnet 4, DeepSeek V3.2, GLM-5, Gemini 2.5 Pro, Qwen3-235B, GPT-OSS-120B, Llama 4 Scout) under three temperature-1.0 seeds, reporting capability and per-case cost as orthogonal axes. We find: (1) Claude Sonnet 4 leads at 60.8% +/- 0.8%, followed by DeepSeek V3.2 at 56.3% +/- 3.1%, with no other model above 51%; (2) the cost-accuracy Pareto frontier is occupied mostly by open-weight models, with DeepSeek V3.2 offering 93% of Claude's capability at 11x lower cost ($0.011/case); (3) comparison tasks remain universally unsolved (0% on close-value comparisons), exposing systematic reasoning limits; and (4) structured tool calling against a real API is more discriminative than general-purpose GIS benchmarks, with accuracies 25-35 points lower. We further show extensibility by integrating BigEarthNet V2 land cover for Portugal alongside Spanish CO2 and erosion indicators. The benchmark, harness, and self-hostable API are publicly available.

2606.12818 2026-06-12 cs.CL cs.AI 新提交

Localizing Anchoring Pathways in Language Models

定位语言模型中的锚定路径

Hillary N. Owusu, Sarah Wiegreffe, Naomi H. Feldman

发表机构 * University of Maryland, College Park(马里兰大学帕克分校)

AI总结 研究提示中无关数字如何影响语言模型数值推理的锚定效应,通过logit差值度量和电路归因定位,发现边级方法优于节点级方法,并揭示锚定路径的共享与迁移特性。

详情
AI中文摘要

提示中的无关数字可以改变语言模型的判断,在数值推理中产生锚定效应。我们使用共享答案选项的受控多项选择设置,研究这种锚定敏感信号在语言模型内部的携带位置。我们定义了一个logit差值度量,比较正确答案选项与对应锚点的答案选项,并验证其追踪行为锚定。通过对7B-8B Qwen和Llama基础及指令微调模型进行基于归因的电路定位,我们发现边级方法比节点级方法更忠实地恢复该信号。低锚和高锚电路在模型内部强迁移,表明跨锚定方向存在共享路径结构。然而,基础模型和指令微调变体之间的稀疏迁移可靠性较低,表明后训练改变了哪些路径最重要。总体而言,我们的结果为锚定相关决策信号如何在语言模型内部携带提供了机制性解释。

英文摘要

Irrelevant numbers in a prompt can shift language model judgments, producing anchoring effects in numerical reasoning. We study where this anchor-sensitive signal is carried inside language models using a controlled multiple-choice setup with shared answer options. We define a logit-difference metric comparing the correct answer option with the answer option corresponding to the anchor, and validate that it tracks behavioral anchoring. Using attribution-based circuit localization on 7B--8B Qwen and Llama base and instruction-tuned models, we find that edge-level methods recover this signal more faithfully than node-level methods. Low- and high-anchor circuits transfer strongly within a model, suggesting shared pathway structure across anchor direction. However, sparse transfer across base and instruction-tuned variants is less reliable, indicating that post-training changes which pathways matter most. Overall, our results provide a mechanistic account of how anchoring-related decision signals are carried inside language models.

2606.12814 2026-06-12 cs.RO cs.AI 新提交

Stubborn: A Streamlined and Unified Reinforcement Learning Framework for Robust Motion Tracking and Fall Recovery for Humanoids

Stubborn: 一种用于人形机器人鲁棒运动跟踪与摔倒恢复的流线型统一强化学习框架

Xiao Ren, Yuhui Yang, Zongbiao Weng, Zhijie Liu, He Kong

发表机构 * Southern University of Science and Technology(南方科技大学)

AI总结 提出Stubborn框架,通过非对称Actor-Critic架构、偏航对齐表示、伯努利概率终止机制和自适应采样策略,统一实现人形机器人的运动跟踪与摔倒恢复,在性能与鲁棒性上超越现有方法。

详情
AI中文摘要

最近的强化学习方法在改善人形机器人运动跟踪性能和实现扰动下的摔倒恢复方面显示出巨大潜力。然而,现有大多数工作将运动跟踪和摔倒恢复视为不同任务,需要多阶段训练,并配备专门的恢复奖励和/或独立的恢复策略。此外,现有的基于强化学习的方法通常在严重跟踪失败后立即终止训练回合,限制了在不稳定或摔倒状态下的恢复导向探索。为了解决上述问题,我们提出了Stubborn,一个流线型统一的强化学习框架,用于实现鲁棒的人形机器人运动跟踪和摔倒恢复。具体来说,Stubborn采用非对称Actor-Critic架构,包含三个主要组件。首先,采用偏航对齐的跟踪表示,以减少对全局漂移和航向扰动的敏感性,同时保留与重力相关的平衡信息。其次,我们引入基于伯努利的概率终止机制,使策略能够在不同失败模式下鼓励探索摔倒恢复行为。第三,我们提出一种概率终止和跟踪误差驱动的策略,根据跟踪性能动态重塑采样分布,提高困难运动片段和不稳定状态的训练效率。与最先进方法的广泛比较和消融研究表明,Stubborn取得了有竞争力的性能,所提出的概率终止机制和自适应采样策略有助于性能和鲁棒性的提升。真实世界演示请参见此https URL。

英文摘要

Recent reinforcement learning approaches have shown great promise in improving humanoid motion tracking performance and achieving fall recovery under disturbances. However, most existing works treat motion tracking and fall recovery as different tasks and require multi-stage training with specialized recovery rewards and/or separate recovery policies. Moreover, existing reinforcement learning-based methods often terminate training episodes immediately after severe tracking failures, limiting recovery-oriented exploration in unstable or fallen states. To address the above issues, we propose Stubborn, a streamlined and unified reinforcement learning framework to achieve robust humanoid motion tracking and fall recovery. Specifically, Stubborn uses an asymmetric Actor-Critic architecture and consists of three major components. First, a yaw-aligned tracking representation is adopted to reduce sensitivity to global drift and heading disturbances while preserving gravity-related balance information. Second, we introduce a Bernoulli-based probabilistic termination mechanism that enables the policy to encourage exploration of fall-recovery behaviors under varying failure modes. Third, we propose a probabilistic termination and tracking-error-driven strategy that dynamically reshapes the sampling distribution based on tracking performance, increasing the training efficiency for difficult motion segments and unstable states. Extensive comparisons with SOTA methods and ablation studies show that Stubborn achieved competitive performance, and the proposed probabilistic termination mechanism and adaptive sampling strategy contributed to the performance and robustness gains. For real-world demonstrations, please refer to https://aislab-sustech.github.io/Stubborn/.

2606.12809 2026-06-12 cs.AI cs.LG 新提交

MLUBench: A Benchmark for Lifelong Unlearning Evaluation in MLLMs

MLUBench: 多模态大语言模型终身遗忘评估基准

He Li, Haoang Chi, Qizhou Wang, Yunxin Mao, Zhiheng Zhang, Jie Tan, Tongliang Liu, Wenjing Yang, Bo Han

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出MLUBench基准,评估多模态大模型在连续遗忘请求下的性能,发现现有方法存在累积退化,并揭示多模态对齐保持的挑战,提出LUMoE方法缓解退化。

Comments 36 pages, accepted to the ICML 2026

详情
AI中文摘要

多模态大语言模型(MLLMs)在海量多模态数据上训练,使得数据遗忘变得越来越重要,因为数据所有者可能要求移除特定内容。实际上,这些请求通常随时间顺序到达,引发了MLLM终身遗忘这一具有挑战性的问题。然而,现有大多数基准在规模和范围上有限,未能捕捉MLLM终身遗忘的复杂性。为填补这一空白,我们引入了MLUBench,一个大规模、全面的基准,包含9个类别下的127个实体,用于终身遗忘请求。我们使用MLUBench进行了大量实验,揭示出现有遗忘方法遭受严重且累积的退化。更重要的是,我们进一步识别出该问题的独特挑战:与单模态模型不同,MLLM终身遗忘受到保持多模态对齐需求的约束。持续从一种模态遗忘可能会退化整个模型。为缓解这一挑战,我们提出了LUMoE,一种有效方法。实验表明,LUMoE显著缓解了基线方法面临的退化问题。源代码和MLUBench数据集已在此https URL开源。

英文摘要

Multimodal large language models (MLLMs) are trained on massive multimodal data, making data unlearning increasingly important as data owners may request the removal of specific content. In practice, these requests often arrive sequentially over time, giving rise to the challenging problem of MLLM Lifelong Unlearning. However, most existing benchmarks are limited in scale and scope, failing to capture the complexities of MLLM lifelong unlearning. To fill this gap, we introduce the MLUBench, a large-scale and comprehensive benchmark featuring 127 entities across 9 classes under lifelong unlearning requests. We perform extensive experiments using MLUBench and reveal that existing unlearning methods suffer from severe, cumulative degradation. More critically, we further identify the unique challenge of this problem: unlike in unimodal models, MLLM lifelong unlearning is constrained by the need to preserve multimodal alignment. Continually unlearning from one modality could degrade the entire model. To alleviate this challenge, we propose LUMoE, an effective method. Experiments demonstrate that LUMoE significantly mitigates the degradation problem faced by baselines. The source code and the MLUBench dataset are open-sourced in https://github.com/lihe-maxsize/Lifelong_Unlearning_main.

2606.12808 2026-06-12 cs.LG cs.AI 新提交

SymQNet: Amortized Acquisition for Low-Latency Adaptive Hamiltonian Learning

SymQNet: 低延迟自适应哈密顿量学习的摊销获取

Yash Vardhan Tomar, Dheeraj Peddireddy, Vaneet Aggarwal

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出SymQNet,一种摊销强化学习方法,通过离线学习后验条件获取策略,在线快速前向传播,显著降低自适应哈密顿量学习的获取延迟。

详情
AI中文摘要

自适应哈密顿量学习对于校准和表征量子设备至关重要。在自适应控制器中,选择下一个实验本身就是一个计算。贝叶斯设计规则在每次后验更新后重新计算,这一步可能需要几秒钟。在数百次试验中,这些秒数成为自适应性的显著墙钟成本。我们引入SymQNet,一种用于低延迟自适应哈密顿量学习的摊销强化学习方法。SymQNet离线学习后验条件获取策略,然后在线使用快速策略前向传播,同时保留贝叶斯后验反馈。在横向场伊辛基准测试中,相对于有界Fisher信息搜索和有界两步贝叶斯主动学习(BALD),SymQNet显著降低了获取延迟。在五量子比特时,相对于这些在线基线,它仅获取决策延迟降低了$47.1\ imes$和$72.6\ imes$;在十二量子比特时,SymQNet的完整模拟步骤需要$1.02$秒,而有界两步BALD需要$13.27$秒。总体而言,我们表明学习获取可以使自适应哈密顿量学习对于重复的低延迟工作负载变得实用。

英文摘要

Adaptive Hamiltonian learning is central to calibrating and characterizing quantum devices. In an adaptive controller, choosing the next experiment is itself a computation. Bayesian design rules are recomputed after every posterior update, and that step can take seconds. Across hundreds of shots, those seconds become a significant wall-clock cost for adaptivity. We introduce SymQNet, an amortized reinforcement-learning approach for low-latency adaptive Hamiltonian learning. SymQNet learns a posterior-conditioned acquisition policy offline, then uses a fast policy forward pass online while retaining Bayesian posterior feedback. On transverse-field Ising benchmarks, SymQNet substantially reduces acquisition latency relative to bounded Fisher-information search and bounded two-step Bayesian active learning by disagreement (BALD). At five qubits, it reduces acquisition-only decision latency by $47.1\times$ and $72.6\times$ relative to these online baselines; at twelve qubits, full simulated steps take $1.02$ s for SymQNet versus $13.27$ s for bounded two-step BALD. Overall, we show that learned acquisition can make adaptive Hamiltonian learning practical for repeated low-latency workloads.

2606.12807 2026-06-12 cs.CL 新提交

Detect, Remask, Repair: Diffusion Editing for Faithful Summarization of Evolving Contexts

检测、重掩、修复:面向动态上下文忠实摘要的扩散编辑

Hao Zou, Zachary Horvitz, Chandhru Karthick, Zhou Yu, Kathleen McKeown

发表机构 * Columbia University(哥伦比亚大学)

AI总结 提出DETECT-REMASK-REPAIR框架,利用掩码扩散语言模型识别并修复摘要中过时内容,在保持支持内容的同时实现局部忠实性修复,并引入StreamSum基准评估。

详情
AI中文摘要

现实世界事件的摘要可能随着上下文演变和新信息的到来而过时。常见的做法是从更新后的上下文生成新摘要,但完全重新生成会丢弃之前的草稿,可能掩盖变化,并且当只有少数声明不支持时可能不必要。我们研究局部忠实性修复:在保留支持内容的同时更新现有摘要中的过时片段。我们提出DETECT-REMASK-REPAIR,一个基于扩散的框架,通过掩码扩散语言模型识别、重新掩码并修复过时区域。为了评估动态上下文摘要,我们引入了StreamSum,一个合成事件时间线的基准。在DialogSum和StreamSum上的实验表明,局部扩散修复提供了一种可控的替代完全重写的方法:忠实性导向的修复改进了早期草稿,一步修复将修复成本降低到半秒以下,该框架实现了跨数据集的忠实性-速度-保留权衡。我们还发现该框架可以作为事后修正步骤,提高自回归系统的忠实性。

英文摘要

Summaries of real-world events can become outdated as contexts evolve and new information arrives. A common response is to generate a new summary from the updated context, but full regeneration discards the previous draft, can obscure what changed, and may be unnecessary when only a few claims are unsupported. We study localized faithfulness repair: updating outdated spans in an existing summary while preserving supported content. We propose DETECT-REMASK-REPAIR, a diffusion-based framework that identifies, remasks, and repairs outdated regions with masked diffusion language models. To evaluate evolving-context summarization, we introduce StreamSum, a benchmark of synthetic event timelines. Experiments on DialogSum and StreamSum show that localized diffusion repair provides a controllable alternative to full rewriting: faithfulness-steered repair improves early drafts, one-step repair reduces repair cost to under half a second, with the framework enabling faithfulness-speed-preservation tradeoffs across datasets. We also find that the framework can provide a post-hoc correction step that improves faithfulness for autoregressive systems.

2606.12797 2026-06-12 cs.AI 新提交

The Containment Gap: How Deployed Agentic AI Frameworks Fail Public-Facing Safety Requirements

遏制缺口:已部署的自主AI框架如何未能满足面向公众的安全要求

Md Jafrin Hossain, Mohammad Arif Hossain, Weiqi Liu, Nirwan Ansari

发表机构 * New Jersey Institute of Technology(新泽西理工学院)

AI总结 研究发现主流自主AI框架缺乏架构级安全保证,内存完整性漏洞可导致定向腐败,提出轻量级遏制机制消除攻击向量。

Comments ICML 2026 (AI4GOOD Workshop)

详情
AI中文摘要

自主调用工具、维护持久内存并执行多步计划的大语言模型系统越来越多地部署在面向公众的领域,包括政府服务、医疗分诊和财务咨询。我们询问用于构建这些系统的框架是否提供架构级结构安全保证。应用从自主架构的组合模型导出的六项遏制原则,我们审计了三个主流框架(LangChain、AutoGPT和OpenAI Agents SDK),发现没有一个原生合规。内存完整性,一种针对最普遍漏洞类别的防御,在三个评估框架中均未观察到。我们通过实证验证这些发现:在基于LangChain构建的模拟政府福利代理中,单次内存投毒写入在所有测试种子和后端上引起持久定向腐败,使目标申请人的错误拒绝率升至88.9%。在复杂的五因素政策下,同一攻击保持总体准确率,同时将目标错误拒绝率提高3.5倍,使腐败难以通过标准监控检测。然后我们引入两种轻量级遏制机制:内存完整性验证器和策略门,它们以亚毫秒开销(每次调用<0.2ms)消除了两种攻击向量。我们得出结论,当前的自主框架生态系统可能尚未满足面向公众部署的默认安全期望,并概述了优先架构干预措施,以实现在高风险、对社会有影响的应用程序中的可信部署。

英文摘要

Agentic large language model systems that autonomously invoke tools, maintain persistent memory, and execute multi-step plans are increasingly deployed in public-facing domains, including government services, healthcare triage, and financial advising. We ask whether the frameworks used to build these systems provide architectural-level structural safety guarantees. Applying six containment principles derived from a compositional model of agentic architectures, we audit three dominant frameworks (LangChain, AutoGPT, and OpenAI Agents SDK) and find no native compliance in any of them. Memory integrity, a defense against one of the most prevalent vulnerability classes, is not observed in any of the three evaluated frameworks. We validate these findings empirically: in a simulated government benefits agent built on LangChain, a single memory-poisoning write induces persistent targeted corruption across all tested seeds and backends, increasing the wrongful denial rate for targeted applicants to 88.9%. Under a complex five-factor policy, the same attack preserves aggregate accuracy while increasing targeted wrongful denials by 3.5x, rendering the corruption difficult to detect through standard monitoring. We then introduce two lightweight containment mechanisms: a memory integrity validator and a policy gate, which eliminate both attack vectors with sub-millisecond overhead (<0.2ms per call). We conclude that the current agentic framework ecosystem may not yet meet secure-by-default expectations for public-facing deployments and outline priority architectural interventions to enable trustworthy deployment in high-stakes, socially impactful applications.

2606.12790 2026-06-12 cs.CL 新提交

GENIE: A Fine-Grained Measure for Novelty

GENIE:一种细粒度新颖性度量方法

Ramya Namuduri, Manya Wadhwa, Anshun Asher Zheng, Greg Durrett, Junyi Jessy Li

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校) New York University(纽约大学)

AI总结 提出GENIE指标,通过任务特定特征细粒度衡量模型生成内容的新颖性,克服整体指标无法捕捉高维新颖性的局限。

详情
AI中文摘要

大型语言模型在各项任务中持续表现出缺乏创造力和多样性。先前的工作主要关注模型是否能够生成创造性输出。本文旨在考虑新颖性,并以任务特定方式研究模型生成内容的新颖性。我们提出了一种细粒度评估指标GENIE,用于根据响应群体中的任务特定特征来衡量响应的新颖性。我们表明,与GENIE不同,整体指标难以捕捉新颖性的高维性,并且无法提供关于它们针对哪些属性的见解。最后,我们使用GENIE来衡量解决创造力问题的缓解方法的有效性,以更好地理解这些方法在哪些方面可以提高新颖性。

英文摘要

Large Language Models have consistently demonstrated a lack of creativity and diversity across tasks. Prior work has focused on addressing whether models are capable of generating creative outputs. Here, we aim to consider novelty and investigate what makes model-generated content novel or not novel in a task-specific manner. We propose a fine-grained evaluation metric GENIE to measure the novelty of responses along task-specific features with respect to a population of responses. We show that unlike GENIE, holistic metrics struggle to capture the high-dimensionality of novelty and do not provide insight on which properties they target. Finally, we use GENIE to measure the effectiveness of mitigation methods that address creativity to better understand where these methods can improve novelty.

2606.12789 2026-06-12 cs.CL cs.IR 新提交

How Fine-Grained Should a RAG Benchmark Be? A Hierarchical Framework for Synthetic Question Generation

RAG基准测试应该有多细粒度?一个用于合成问题生成的层次化框架

Chase M. Fensore, Kaustubh Dhole, Jason Fan, Eugene Agichtein, Joyce C. Ho

发表机构 * Department of Computer Science, Emory University(埃默里大学计算机科学系)

AI总结 提出HieraRAG层次化框架,通过合成问题生成研究RAG基准测试的细粒度,发现最优粒度因维度而异,并引入一致性比率度量。

详情
AI中文摘要

评估检索增强生成(RAG)系统需要能够捕捉多样化问题特征的基准测试,然而实践者缺乏关于在哪些维度上变化以及以何种粒度变化的经验指导。我们提出了HieraRAG,一个用于研究RAG基准测试构建中粒度的层次化框架,将最优粒度定义为在给定RAG配置下最大化区分能力(各类别生成质量的标准差)的水平。作为案例研究,我们从FineWeb-10BT中生成了5,872个合成问答对,涵盖3个维度(问题复杂度、答案类型、语言变异)和3个粒度级别(2、4和8个类别)。使用BM25+Falcon-3-10B流水线,最优粒度因维度而异:复杂度受益于细粒度区分(区分能力:0.053),而答案类型和语言变异在中等粒度达到峰值。我们引入了一致性比率度量来量化细粒度划分是否干净地细分父类别,揭示了维度间的结构差异(问题复杂度:0.40 vs. 答案类型:1.44)。对110个分层问答对的人工评估确认了合成质量。虽然这些具体发现反映的是单一配置,但HieraRAG为实践者提供了可移植的程序和验证度量,以确定其自身RAG设置中的评估粒度。

英文摘要

Evaluating retrieval-augmented generation (RAG) systems requires benchmarks that capture diverse question characteristics, yet practitioners lack empirical guidance on which dimensions to vary and at what granularity. We present HieraRAG, a hierarchical framework for studying granularity in RAG benchmark construction, defining optimal granularity as the level that maximizes discriminative power (the standard deviation of generation quality across categories) within a given RAG configuration. As a case study, we generate 5,872 synthetic question-answer (QA) pairs from FineWeb-10BT across 3 dimensions (Question Complexity, Answer Type, Linguistic Variation) at 3 granularity levels (2, 4, and 8 categories). With a BM25+Falcon-3-10B pipeline, optimal granularity varies by dimension: complexity benefits from fine-grained distinctions (discriminative power: 0.053) while answer type and linguistic variation peak at medium granularity. We introduce a Coherence Ratio metric to quantify whether fine-grained splits cleanly subdivide parent categories, revealing structural differences across dimensions (Question Complexity: 0.40 vs. Answer Type: 1.44). Human evaluation of 110 stratified QA pairs confirms synthetic quality. While these specific findings reflect a single configuration, HieraRAG provides a portable procedure and validation metric for practitioners to determine evaluation granularity within their own RAG settings.

2606.12783 2026-06-12 cs.AI 新提交

A Tutorial on World Models and Physical AI

世界模型与物理AI教程

Il-Seok Oh

发表机构 * Department of Computer Science and Artificial Intelligence/CAIIT, Jeonju, Jeonbuk, South Korea(韩国全北全州计算机科学与人工智能系/CAIIT)

AI总结 本文提出统一框架,区分显式与隐式世界模型,并探讨其在机器人、自动驾驶等物理AI领域的应用,以及迈向通用人工智能的挑战。

详情
AI中文摘要

世界建模正成为构建具备预测、推理和决策能力的智能系统的核心原则。显式世界模型与隐式世界模型之间存在一个核心区别:前者学习结构化动态以进行基于推演的推理和规划,后者则将预测结构编码到可扩展的学习表示中。这些互补范式为机器人、自动驾驶等领域的物理AI奠定了基础,使其能够在现实世界约束下实现超越反应式控制的智能。近期的基础模型进一步指明了通向集成感知、预测和行动的通用系统的路径。尽管进展迅速,但在层次推理、长时域规划和自主目标形成方面仍存在重大挑战,这些对于迈向通用人工智能至关重要。本教程提出了一个连贯的框架,其中多种世界建模方法通过共享的预测结构得以统一,并通过这种结构的表示和利用方式加以区分。

英文摘要

World modeling is emerging as a central principle for building intelligent systems capable of prediction, reasoning, and decision making. A central distinction can be drawn between explicit world models, which learn structured dynamics for rollout-based reasoning and planning, and implicit world models, which encode predictive structure within scalable learned representations. These complementary paradigms provide a foundation for physical AI in domains such as robotics and autonomous driving, enabling intelligence beyond reactive control under real-world constraints. Recent foundation models further suggest a pathway toward unified systems integrating perception, prediction, and action. Despite rapid progress, major challenges remain in hierarchical reasoning, long-horizon planning, and autonomous goal formation, which are critical for advancing toward artificial general intelligence. This tutorial presents a coherent framework in which diverse world modeling approaches are unified through shared predictive structure and differentiated by how such structure is represented and exploited.

2606.12780 2026-06-12 cs.LG cs.CL 新提交

ProPlay: Procedural World Models for Self-Evolving LLM Agents

ProPlay: 用于自我进化LLM智能体的程序化世界模型

Yijun Ma, Zehong Wang, Yiyang Li, Ziming Li, Xiaoguang Guo, Weixiang Sun, Chuxu Zhang, Yanfang Ye

发表机构 * University of Notre Dame(圣母大学) University of Connecticut(康涅狄格大学)

AI总结 提出ProPlay程序化世界模型,通过程序级预演和因果过程图,使LLM智能体在部分可观测环境中自我进化,无需外部监督。

详情
AI中文摘要

自我进化智能体应能在无外部监督下通过交互改进,但在部分可观测环境中仍困难,智能体必须主动探索、从有限反馈中学习,并决定何时信任先前经验。现有的LLM智能体方法通常依赖记忆或规划模块,但很少在它们之间闭环以持续完善对环境动态的内部理解。我们提出ProPlay,一种程序化世界模型,支持程序级预演,智能体可利用学到的世界知识排练未来的程序路径。ProPlay不将经验表示为孤立的规则或低层动作约束,而是将成功轨迹抽象为程序,并在捕获任务阶段间因果转换的程序图中组织它们。每个转换与一个可靠性记录嵌入相关联,以从过去结果中估计其任务特定贡献。在每个回合前,ProPlay在已知图结构上模拟未来程序轨迹作为结构化软指导;执行后,它利用环境反馈精炼图。在公开基准上的实验表明,ProPlay在环境理解和自我进化能力上持续优于强基线。我们的代码已在此https URL发布。

英文摘要

Self-evolving agents are expected to improve through interaction without external supervision, but this remains difficult in partially observable environments where agents must explore actively, learn from limited feedback, and decide when to trust prior experience. Existing LLM-agent methods often rely on memory or planning modules, yet they rarely close the loop between them to continually refine an internal understanding of environment dynamics. We introduce ProPlay, a procedural world model that supports procedure-level preplay, where agents can rehearse future procedural paths using the learned world knowledge. Rather than representing experience as isolated rules or low-level action constraints, ProPlay abstracts successful trajectories into procedures and organizes them in a procedure graph that captures causal transitions among task stages. Each transition is associated with a reliability record embedding to estimate its task-specific contribution from past outcomes. Before each episode, ProPlay simulates future procedural trajectories over known graph structures as structured soft guidance; after execution, it refines the graph using environment feedback. Experiments on public benchmarks show that ProPlay consistently improves environment understanding and self-evolution capability over strong baselines. Our code has been released in https://github.com/antman9914/proplay.

2606.12767 2026-06-12 cs.AI 新提交

Constructing Evaluation Datasets for Procedural Reasoning: Balancing Naturalness, Grounding, and Multi-Hop Coverage

构建程序性推理评估数据集:平衡自然性、基础性和多跳覆盖

Sarah Elshabrawy, Rahul K. Dass, Ashok K. Goel

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 研究基于任务-方法-知识(TMK)模型的问题生成策略对程序性和多跳推理数据集质量的影响,提出基础性验证框架,发现严格TMK生成策略在基础性和可用性上最优。

Comments 10 pages, 2 numbered figures. Workshop submission to HAIL @ AIED 2026

详情
AI中文摘要

评估AI辅助学习系统中的程序性推理需要问答数据集,这些数据集既要像学习者一样,又要基于系统预期使用的教学知识。我们研究了基于TMK的问题生成策略如何影响程序性和多跳推理的数据集质量。我们比较了三种策略:从任务-方法-知识(TMK)模型严格生成、先转录后基于TMK过滤的生成、以及结合转录和结构化指导的TMK感知生成。为了评估生成的项目,我们引入了一个基于从TMK模型中提取的闭集证据单元的基础性验证框架。该框架衡量答案是否由底层表示支持、问题是否自包含、以及是否针对多跳程序性推理。在23个教学主题和690个生成的问答对中,严格TMK生成实现了最强的整体质量,其中96.5%的问题有基础,92.6%的问题可用。先转录生成产生更像学习者的问题,但更多是上下文依赖或基础薄弱的问题,而TMK感知生成产生较高的原始多跳覆盖率但基础性较低。这些结果表明,程序丰富性和自然措辞并不能保证表示基础性,这促使在AI辅助学习中的评估数据集需要进行显式的表示感知验证。

英文摘要

Evaluating procedural reasoning in AI-supported learning systems requires question-answer datasets that are both learner-like and grounded in the instructional knowledge the system is expected to use. We study how TMK-based question generation strategies affect dataset quality for procedural and multi-hop reasoning. We compare three strategies: strict generation from Task-Method-Knowledge (TMK) models, transcript-first generation with post-hoc TMK filtering, and TMK-aware generation that combines transcripts with structured guidance. To evaluate generated items, we introduce a grounding validation framework based on closed-set evidence units extracted from TMK models. The framework measures whether answers are supported by the underlying representation, whether questions are self-contained, and whether they target multi-hop procedural reasoning. Across 23 instructional topics and 690 generated question-answer pairs, strict TMK generation achieves the strongest overall quality, with 96.5% grounded questions and 92.6% usable questions. Transcript-first generation produces more learner-like questions but more context-dependent or weakly grounded items, while TMK-aware generation yields high raw multi-hop coverage but lower grounding. These results show that procedural richness and natural phrasing do not guarantee representational grounding, motivating explicit representation-aware validation for evaluation datasets in AI-supported learning.

2606.12765 2026-06-12 cs.CL cs.DC 新提交

Rigel: Reverse-Engineering the Metal 4.1 Tensor Compute Path on the Apple M4 Max GPU

Rigel:逆向工程 Apple M4 Max GPU 上的 Metal 4.1 张量计算路径

Ramchand Kumaresan

发表机构 * Apple Inc.(苹果公司)

AI总结 通过微基准测试逆向工程 Apple M4 Max 的 Metal 4.1 张量计算路径,揭示 fp8 matmul2d 为模拟而非硬件加速,并重建了 8x8 张量片段布局。

详情
AI中文摘要

Apple 的 Metal 4.1 暴露了一条张量计算路径:基于 cooperative_tensor 片段的 Metal Performance Primitives (MPP) matmul2d 操作,其接口有文档记录,但硬件行为被故意隐藏。规范说明了支持哪些数据类型行,但从未说明它们是否经过硬件加速、操作在物理上何处执行、其累加器宽度是多少,或者如何在线程间划分矩阵片段。我们提出了 Rigel,这是对单个 Apple M4 Max(前神经加速器一代)上该路径的经验性表征。使用校验和门控、来源追踪的微基准测试工具,Rigel 恢复了 v4.1 规范隐藏或矛盾的十一个事实。主要发现:Metal 4.1 fp8 (E4M3) matmul2d 是模拟的,而非加速的:尽管读取的操作数字节数减半,但其吞吐量仅为 fp16 的 0.94 倍,因此在 M4 上它是一个内存占用特性,而非性能特性。我们进一步通过三信号三角测量(吞吐量上限、与 simdgroup_matrix 的比较以及每路功率归因)表明,matmul2d 完全在 GPU 着色器核心上执行,没有专用的矩阵数据路径,也没有证据表明路由到 Apple 神经引擎;它使用 >=fp32 累加;并且我们重建了 Apple 在任何地方都没有记录的 opaque 8x8 cooperative_tensor 片段布局。基于该表征,一个手动融合的 GEMM + bias + GELU 内核在缓存驻留状态下比分解路径快 6.5-12.9%。所有发现均可从 MIT 许可的代码和逐单元 CSV 中重现。

英文摘要

Apple's Metal 4.1 exposes a tensor compute path: the Metal Performance Primitives (MPP) matmul2d operation over cooperative_tensor fragments, whose interface is documented but whose hardware behavior is deliberately hidden. The specification states which data-type rows are supported, never whether they are hardware-accelerated, where the operation physically executes, what its accumulator width is, or how it partitions matrix fragments across threads. We present Rigel, an empirical characterization of this path on a single Apple M4 Max (a pre-neural-accelerator generation). Using a checksum-gated, provenance-tracked microbenchmark harness, Rigel recovers eleven facts the v4.1 specification hides or contradicts. The headline finding: the Metal 4.1 fp8 (E4M3) matmul2d is emulated, not accelerated: it sustains 0.94x the throughput of fp16 despite reading half the operand bytes, so on M4 it is a memory-footprint feature, not a performance feature. We further show, via a three-signal triangulation (throughput ceiling, comparison against simdgroup_matrix, and per-rail power attribution), that matmul2d executes entirely on the GPU shader cores with no dedicated matrix datapath and no evidence of Apple Neural Engine routing; that it accumulates in >=fp32; and we reconstruct the opaque 8x8 cooperative_tensor fragment layout Apple documents nowhere. Acting on the characterization, a hand-fused GEMM + bias + GELU kernel beats the decomposed path by +6.5-12.9% in the cache-resident regime. All findings are reproducible from committed MIT-licensed code and per-cell CSVs.

2606.12764 2026-06-12 cs.LG cs.CL cs.CR 新提交

Detecting Functional Memorization in Code Language Models

检测代码语言模型中的功能记忆

Matthieu Meeus, Anil Ramakrishna, Matthew Grange, Zheng Xu, Luca Melis

发表机构 * Meta Imperial College London(伦敦帝国学院)

AI总结 研究代码语言模型的功能记忆现象,通过反事实设置对比暴露目标代码的模型与未暴露的参考模型,使用文本和功能相似性度量,发现功能记忆超出文本重叠的检测范围。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被用于大规模生成代码。同时,先前的工作通过审计训练示例与模型生成之间的文本重叠,研究了训练数据是否可以从模型输出中恢复。然而,代码可能在功能上等价而在文本上不相似。在这项工作中,我们研究了功能记忆:提取超出逐字指标检测的功能逻辑。我们为Olmo-3-32B构建了一个反事实设置,将中期训练模型(暴露于目标代码)与预训练参考模型(未暴露)进行比较。我们使用Python函数签名提示两个模型,并测量文本和功能相似性(即LLM作为评判者、基于执行)。我们的结果显示了功能记忆的明确证据,突出了需要超越文本重叠的审计指标。

英文摘要

Large language models (LLMs) are increasingly used to generate code at scale. Meanwhile, prior work has investigated whether training data may be recoverable from model outputs, by auditing the textual overlap between training examples and model generations. Code, however, can be functionally equivalent while textually dissimilar. In this work, we study functional memorization: extraction of functional logic beyond what verbatim metrics detect. We construct a counterfactual setup for Olmo-3-32B, comparing a midtrained model (exposed to target code) against a pretrained reference (not exposed). We prompt both models with Python function signatures and measure both textual and functional similarity (i.e., LLM-as-a-judge, execution-based). Our results show clear evidence of functional memorization, highlighting the need for auditing metrics that go beyond textual overlap.

2606.12763 2026-06-12 cs.LG cs.DS 新提交

Adaptive Weighted Averaging

自适应加权平均

Aditya Bhaskara, Ashok Cutkosky, Ravi Kumar, Manish Purohit

发表机构 * University of Utah(犹他大学) Boston University(波士顿大学) Google(谷歌)

AI总结 提出一种从单次无偏估计中选取最大未知值的方法,具有可容许性且不劣于基线,应用于随机优化获得在线到批次的转换界限。

详情
AI中文摘要

我们研究在仅对每个 $x_i$ 有一个无偏估计 $y_i$ 的情况下,从 $n$ 个未知值 $x_1,\dots,x_n$ 中选择最大值的问题。我们设计的策略同时具有可容许性(不被任何其他策略一致支配)且不劣于给定的基线(如均匀随机选择)。我们将其应用于随机优化,获得了具有理想“无妥协”保证的在线到批次转换界限:它们从不比标准随机迭代选择差,同时在良性设置中可以显著更好。

英文摘要

We study the problem of selecting the largest among $n$ unknown values $x_1,\dots,x_n$ given only a single unbiased estimate $y_i$ for each $x_i$. We design strategies that are simultaneously admissible (not uniformly dominated by any other strategy) and also never worse than a given baseline such as uniform random selection. We provide an application to stochastic optimization, where we obtain online-to-batch conversion bounds with a desirable "no-compromise" guarantee: they are never worse than standard random iterate selection, and yet can be significantly better in benign settings.

2606.12759 2026-06-12 cs.RO 新提交

Sparse2Act: Learning Action-Aligned Sparse 3D Representations for Cross-Domain Robot Manipulation

Sparse2Act: 学习跨域机器人操作的动作对齐稀疏3D表示

Yu Guo, Chang Yu, Siyu Ma, Yunuo Chen, Yin Yang, Ying Nian Wu, Chenfanfu Jiang

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校) University of California, San Diego(加州大学圣迭戈分校) University of Utah(犹他大学)

AI总结 提出Sparse2Act框架,通过动作对齐的掩码稀疏3D编码预训练,实现跨域机器人操作,在LIBERO-10上达86.9%成功率,并支持域迁移和sim-to-real。

详情
AI中文摘要

显式3D表示对于操作任务具有吸引力,因为它们以度量坐标暴露物体形状、工作空间几何以及机器人-物体关系。然而,稀疏3D编码器通常通过下游任务目标学习,将表示与特定数据分布、策略架构和动作参数化绑定。我们引入Sparse2Act,一个用于预训练稀疏点云编码器的观察-动作对齐框架。关键思想是使用任务空间末端执行器动作作为几何监督:训练掩码稀疏3D令牌以组织场景特征,使其围绕与观察配对的工作空间运动。预训练后,仅编码器初始化被下游策略重用,允许它们保留自己的架构和动作空间,包括关节空间命令。在LIBERO-10基准上,我们的方法在500步微调后达到86.9%的平均成功率。相同的预训练编码器支持LIBERO到Meta-World的跨域迁移,在Meta-World-5基准上达到73.4%的平均成功率。关于目标和解码器容量的消融实验表明,增益来自掩码动作对齐信号,并且在下游动作解码器中仍然有用。在真实世界实验中,模拟预训练后跟有限真实数据微调,在四个任务上平均成功率达到72.5%,展示了有效的模拟到真实迁移。这些结果表明,机器人动作可以为可重用的稀疏3D表示提供紧凑的几何监督。

英文摘要

Explicit 3D representations are attractive for manipulation because they expose object shape, workspace geometry, and robot-object relations in metric coordinates. However, sparse 3D encoders are often learned through downstream task objectives, tying the representation to a particular data distribution, policy architecture, and action parameterization. We introduce Sparse2Act, an observation-action alignment framework for pretraining sparse point-cloud encoders. The key idea is to use task-space end-effector actions as geometric supervision: masked sparse 3D tokens are trained to organize scene features around the workspace motion paired with the observation. After pretraining, only the encoder initialization is reused by downstream policies, allowing them to retain their own architectures and action spaces, including joint-space commands. On the LIBERO-10 benchmark, our method achieves 86.9% average success after 500 fine-tuning steps. The same pretrained encoder supports LIBERO-to-Meta-World cross-domain transfer, achieving 73.4% average success on the Meta-World-5 benchmark. Ablations on the objective and decoder capacity show that the gains come from the masked action-alignment signal and remain useful across downstream action decoders. In real-world experiments, simulation pretraining followed by limited real-data fine-tuning achieves an average success rate of 72.5% across four tasks, demonstrating effective sim-to-real transfer. These results suggest that robot actions can provide compact geometric supervision for reusable sparse 3D representations.

2606.12747 2026-06-12 cs.AI 新提交

Prefill Awareness in Large Language Models

大型语言模型中的预填充感知

Andy Wang, Parv Mahajan, David Demitri Africa, Alexandra Souly, Jordan Taylor, Robert Kirk

发表机构 * Constellation University of Wisconsin-Madison(威斯康星大学麦迪逊分校星座研究所) Constellation Georgia Institute of Technology(佐治亚理工学院星座研究所) UK AI Security Institute(英国人工智能安全研究所)

AI总结 研究大型语言模型能否识别并响应其助手消息被预填充或篡改,发现前沿模型具有显著预填充感知能力,可能影响安全评估方法。

Comments Submitted to NeurIPS 2026

详情
AI中文摘要

语言模型的安全相关研究,包括对齐和越狱评估以及AI控制协议,通常依赖于预填充模型输出。如果AI模型能够识别并利用其先前的助手消息被插入或编辑这一事实,这些方法的有效性和有效性可能会受到损害。我们调查了前沿语言模型是否能区分被篡改和未被篡改的助手侧上下文,我们将这种能力称为预填充感知。为此,我们构建了一个跨三种预填充机制的二元偏好基准,筛选出模型表现出一致立场的案例。我们发现前沿模型表现出显著的预填充感知:Claude Opus 4.5在9-35%的案例中检测到与其偏好相反的预填充,且在提示时假阳性率为0%;此外,模型通常会恢复到基线行为,而不会明确报告预填充是外来的。受控消融实验后来也表明,检测和抵抗依赖于不同的线索,其中风格不匹配主要影响模型是否将预填充标记为外来,而偏好不匹配主要影响模型是否恢复到其基线答案。我们还检查了更真实的智能体设置,如错位延续评估和SWE-bench轨迹,在这些设置中,前沿模型有时会否认预填充的助手轮次,其方式强烈依赖于数据集、任务成功和隐藏的格式伪影。我们的结果表明,预填充感知已经是一些基于预填充的方法的重要混淆因素。我们建议模型开发者在前沿系统中跟踪这种能力。

英文摘要

Safety-relevant studies of language models, including alignment and jailbreaking evaluations and AI control protocols, often rely on prefilling model outputs. If AI models can recognize and act on the fact their prior assistant messages have been inserted or edited, the effectiveness and validity of these methods could be compromised. We investigate whether frontier language models can distinguish between tampered and untampered assistant-side context, a capability we call prefill awareness. To do so, we construct a binary preference benchmark across three prefill mechanisms, filtering for cases where models show consistent stances. We find that frontier models show substantial prefill awareness: Claude Opus 4.5 detects prefills opposing its preferences in 9-35% of cases with a 0% false positive rate when prompted; additionally, models often revert towards baseline behavior without explicitly reporting that the prefill was foreign. Controlled ablations later also show that detection and resistance rely on different cues, where stylistic mismatch mainly affects whether models flag a prefill as foreign, while preference mismatch mainly affects whether they revert toward their baseline answer. We also examine more realistic agentic settings such as misalignment-continuation evaluations and SWE-bench trajectories, where frontier models sometimes disavow prefilled assistant turns in ways that depend strongly on dataset, task success, and hidden formatting artifacts. Our results indicate that prefill awareness is already a substantial confound for some prefill-based methods. We recommend that model developers track this capability in frontier systems.

2606.12744 2026-06-12 cs.CV 新提交

GRIP: Feedback-Guided Prompt Retrieval for Large Multimodal Models

GRIP:面向大型多模态模型的反馈引导提示检索

Garvita Allabadi, Matteo Sodano, Roberto Estevão, Yuxiong Wang, Vikram Adve, Emre Kiciman, Ranveer Chandra

发表机构 * University of Illinois Urbana Champaign(伊利诺伊大学厄巴纳-香槟分校) University of Bonn(波恩大学) Microsoft(微软)

AI总结 提出GRIP,一种可学习的视觉检索框架,利用多模态模型反馈识别真正提升上下文学习性能的示例,在分类、描述和VQA任务上优于基于相似度的检索。

详情
AI中文摘要

上下文学习(ICL)已成为一种强大的机制,使大型语言模型(LLMs)无需微调即可适应新任务。将此概念扩展到大型多模态模型(LMMs),多模态上下文学习(M-ICL)依赖于检索相关示例(如图像、标题或问答对)来指导分类、描述和视觉问答(VQA)等任务的预测。现有方法大多基于特征空间相似性选择上下文示例,假设语义相似的样本提供最有用的上下文。然而,我们的系统分析表明,这一假设并不总是成立:视觉上相似的示例并不一定是那些最有效增强上下文学习性能的示例。为解决此问题,我们提出了上下文提示的引导检索(GRIP),一种可学习的纯视觉检索框架,利用LMMs的反馈来识别真正改善模型预测的示例。GRIP通过对比训练学习区分有益和有害的上下文示例,将检索优化到超越纯相似性。在三个多模态任务(分类、描述和VQA)上,GRIP在Qwen2.5-VL-7B上持续优于基于相似度的检索,在Idefics2-8B上的分类任务中提升最为显著。此外,我们证明了从一个开放LMM训练得到的检索器可以迁移到其他模型(包括闭源的GPT-4o和Gemini)而无需重新训练,从而实现了M-ICL的可扩展且经济高效的部署。代码将在接收后发布。

英文摘要

In-Context Learning (ICL) has become a powerful mechanism for adapting Large Language Models (LLMs) to new tasks without fine-tuning. Extending this concept to Large Multimodal Models (LMMs), Multimodal In-Context Learning (M-ICL) relies on retrieving relevant examples, such as images, captions, or question-answer pairs, to guide predictions across tasks like classification, captioning, and visual question answering (VQA). Most existing approaches select in-context examples based on feature-space similarity, assuming that semantically similar samples provide the most useful context. However, our systematic analysis reveals that this assumption does not always hold: visually similar examples are not necessarily those that most effectively enhance in-context learning performance. To address this, we propose the Guided Retrieval of In-context Prompts (GRIP), a learnable vision-only retrieval framework that leverages feedback from LMMs to identify examples that truly improve model predictions. GRIP learns to distinguish beneficial from detrimental in-context examples through contrastive training, refining retrieval beyond pure similarity. Across three multimodal tasks, namely classification, captioning, and VQA, GRIP improves consistently over similarity-based retrieval on Qwen2.5-VL-7B, with its strongest gains in classification on Idefics2-8B. Moreover, we demonstrate that retrievers trained with feedback from one open LMM can be transferred to other models without retraining, including closed-source GPT-4o and Gemini, enabling scalable and cost-efficient deployment of M-ICL. Code will be published upon acceptance.

2606.12740 2026-06-12 cs.LG 新提交

Deep Unfolded Latent Optimally Partitioned-l2/l1 Networks for Data-driven Block-Sparse Recovery

深度展开潜在最优分区l2/l1网络用于数据驱动的块稀疏恢复

Takanobu Furuhashi, Hidekata Hontani, Qibin Zhao, Tatsuya Yokota

发表机构 * Nagoya Institute of Technology(名古屋工业大学) RIKEN Center for Advanced Intelligence Project(理化学研究所革新智能研究中心)

AI总结 针对凸LOP-l2/l1方法依赖手动调参且近端算子不可微的问题,提出基于隐式微分和深度权重分解的两种深度展开架构,实现自动参数学习,在块稀疏恢复中表现优异且抗脉冲噪声。

Comments 11 pages, 6 figures

详情
AI中文摘要

凸潜在最优分区(LOP)-l2/l1方法能够在未知分区的情况下实现块稀疏信号恢复,但依赖于手动超参数调整。此外,其近端算子微分时的数值不稳定性阻碍了通过深度展开(DU)进行自动参数调整。为解决这些限制,我们提出了两种架构:一种利用隐式微分的稳定框架,以及一种利用深度权重分解(DWF)的灵活变体。基于DWF的方法还支持非凸光滑数据保真项。数值实验表明,DU-LOP-l2/l1在块稀疏恢复中具有竞争性能,并且对脉冲噪声具有高鲁棒性。

英文摘要

The convex Latent Optimal Partition (LOP)-l2/l1 approach enables block-sparse signal recovery with unknown partitions but relies on manual hyperparameter tuning. Additionally, numerical instability in differentiating its proximal operator prevents its automatic parameter tuning via Deep Unfolding (DU). To address these limitations, we propose two architectures: a stable framework utilizing implicit differentiation and a flexible variant leveraging Deep Weight Factorization (DWF). The DWF-based approach also supports nonconvex smooth data fidelity terms. Numerical experiments demonstrate that DU-LOP-l2/l1 yields competitive performance and high resilience against impulsive noise.

2606.12736 2026-06-12 cs.AI cs.LG 新提交

Benchmarking AI Agents for Addressing Scientific Challenges Across Scales

跨尺度科学挑战的AI智能体基准测试

Tianyu Liu, Allen Xin Wang, Antonia Panescu, Lisa Xinyi Chen, Wenxin Long, Xinyu Wei, Yueqian Jing, Ziyao Zeng, Jihang Chen, Sihan Jiang, Ziqing Wang, Siyi Gu, Siyu Chen, Xinyang Hu, Haoran Shao, Leqi Xu, Wangjie Zheng, Zhiyuan Cao, Ada Fang, Botao Yu, Kunyang Sun, Rex Ying, Arman Cohan, Qingyu Chen, Lingzhou Xue, Kaize Ding, Yuanqi Du, Wengong Jin, Zhuoran Yang, Marinka Zitnik, James Zou, Hua Xu, Hongyu Zhao

发表机构 * Yale University(耶鲁大学) Broad Institute of MIT and Harvard(布罗德研究所) The Pennsylvania State University(宾夕法尼亚州立大学) Northeastern University(东北大学) Northwestern University(西北大学)

AI总结 提出SciAgentArena基准,含约200个交互式任务,评估AI智能体在真实科研场景中的能力,发现其在数据分析中有效,但在创新探索和开放问题上表现不均。

Comments 6 figures

详情
AI中文摘要

AI智能体正被越来越多地开发用于加速科学发现,但它们在真实研究环境中的实际能力仍知之甚少。现有的AI智能体基准很少捕捉科学工作所需的复杂性、异质性和扩展推理,而科学任务的基准通常将研究简化为静态、直接的问题,并对交互式评估支持有限。在此,我们引入SciAgentArena,这是一个系统性的基准,用于评估AI智能体在来自多个领域新兴需求的真实科学研究场景中的表现。SciAgentArena包含约200个具有逐步验证的任务,以及一个交互式、与智能体无关的环境,用于评估不同的AI智能体。使用该基准,我们发现当前智能体能够有效贡献于明确指定的数据分析工作流,特别是当任务结构和评估标准清晰时。然而,它们在科学情境中的表现仍然不均衡:智能体难以产生真正新颖的见解,维持自主探索,并为开放的研究问题制定稳健的解决方案。我们进一步描述了智能体常见的失败模式,并识别了提高其可靠性、自主性和科学推理能力的机会。总之,SciAgentArena提供了一个实用的框架,用于衡量AI智能体在科学领域的进展,并指导未来能够应对复杂科学挑战的智能体设计。完整代码、任务和数据集可通过此链接访问:this https URL。

英文摘要

AI agents are increasingly being developed to accelerate scientific discovery, yet their practical capabilities in real research settings remain poorly understood. Existing benchmarks for AI agents rarely capture the complexity, heterogeneity, and extended reasoning required by scientific work, whereas benchmarks for scientific tasks often reduce research to static, direct problems and provide limited support for interactive evaluation. Here, we introduce SciAgentArena, a systematic benchmark for evaluating AI agents in real-world scientific research scenarios drawn from emerging needs across multiple domains. SciAgentArena comprises approximately 200 tasks with stepwise verification and an interactive, agent-agnostic environment for assessing diverse AI agents. Using this benchmark, we find that current agents can contribute effectively to well-specified data-analysis workflows, particularly when the task structure and evaluation criteria are clear. However, their performance remains uneven across scientific contexts: agents struggle to generate genuinely novel insights, sustain self-directed exploration, and formulate robust solutions for open-ended research questions. We further characterize common failure modes across agents and identify opportunities for improving their reliability, autonomy, and scientific reasoning. Together, SciAgentArena provides a practical framework for measuring progress in AI agents for science and for guiding the design of future agents capable of addressing complex scientific challenges. Full codes, tasks, and datasets can be accessed via this link: https://sciagentarena.github.io/.

2606.12735 2026-06-12 cs.LG 新提交

Physics-Informed Neural Networks and Radial Basis Functions for PDEs with Dirac Delta Sources

物理信息神经网络与径向基函数求解含狄拉克δ源的偏微分方程

Manuel Reyna, Alexandre Tartakovsky

发表机构 * Department of Civil and Environmental Engineering, University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校土木与环境工程系)

AI总结 针对含狄拉克δ项的偏微分方程,通过将物理信息神经网络解释为残差最小二乘法,利用弱形式直接处理δ项,并对比径向基函数展开方法,发现径向基函数-残差最小二乘法在输运问题中更稳定。

Comments 33 pages, 4 figures

详情
AI中文摘要

物理信息神经网络(PINNs)是一种用于求解正向和逆向偏微分方程(PDEs)的机器学习方法。当应用于强迫项、边界条件或初始条件中包含狄拉克δ函数的PDEs时,PINNs需要用光滑的代理函数来近似它们,这种做法可能会引入显著的建模误差。在这项工作中,我们利用PINNs作为残差最小二乘法(RLS)的解释,并表明这种视角能够通过积分弱形式方程直接处理狄拉克δ项。在除PINN之外的RLS公式中,我们重点关注径向基函数(RBF)展开(也称为单层RBF网络)。我们证明,虽然在PINNs中积分掉狄拉克δ会导致残差无法收敛到零,但RBF-RLS始终能为输运问题提供良好的正向和逆向解。我们使用神经正切核(NTK)理论解释这一发现。我们在代表多孔介质和河流中地下水流和输运的线性PDEs上测试了这两种方法。我们求解逆问题以拟合合成数据、含噪声的合成数据以及真实世界测量值。

英文摘要

Physics-Informed Neural Networks (PINNs) are a machine learning method for solving forward and inverse Partial Differential Equations (PDEs). When applied to PDEs with Dirac delta functions in the forcing terms, boundary conditions, or initial conditions, PINNs require approximating them with smooth surrogate functions, a practice that can introduce significant modeling errors. In this work, we exploit the interpretation of PINNs as Residual Least Squares (RLS) methods and show that this perspective enables direct treatment of Dirac delta terms by integrating the weak-form equation. Among RLS formulations other than PINN, we focus on the Radial Basis Function (RBF) expansion (also known as a single-layer RBF Network). We show that while integrating out the Dirac delta in PINNs causes residuals to fail to converge to zero, RBF-RLS consistently provides good forward and inverse solutions to transport problems. We explain this finding using the Neural Tangent Kernel (NTK) theory. We test both approaches on linear PDEs that represent groundwater flow and transport in porous media and rivers. We solve inverse problems to fit synthetic data, noisy synthetic data, and real-world measurements.

2606.12731 2026-06-12 cs.LG cs.CY 新提交

Normative Robustness as a Frontier for Non-Verifiable Reasoning in LLMs

规范性鲁棒性作为LLM中不可验证推理的前沿

Elizaveta Tennant, Benjamin Henke, Anita Keshmirian, Murray Shanahan, Verena Rieser, Kristian Lum, Sydney Levine, Julia Haas

发表机构 * DeepMind Institute of Philosophy, School of Advanced Study, University of London(伦敦大学高等研究院哲学研究所) Technische Universität Berlin(柏林工业大学)

AI总结 提出道德推理作为不可验证推理的典型子域,定义道德鲁棒性并引入可扩展的多轮对抗评估框架,发现模型会向用户偏好偏移推理(平均6.5%),且受顺序和轮次影响。

详情
AI中文摘要

随着LLM越来越多地承担咨询和审议角色,用户在缺乏客观真实性的领域中依赖它们进行不可验证推理。然而,传统LLM推理评估几乎只关注基于事实的领域(如数学和科学),导致不确定模型能否以及能在多大程度上处理随时间变化的模糊、主观或价值负载问题。为解决这一问题,我们提出道德推理作为不可验证推理的一个典型子域。我们将道德鲁棒性定义为模型在不同时间和情境下展现合理道德推理的能力,并引入一个可扩展的、对抗性的多轮评估框架来实证测量这一能力。我们在四个前沿LLM上模拟了48,000次用户-智能体道德讨论,变化前提相关性、前提顺序、对话时长和用户声明的道德观点。我们发现模型成功忽略了道德无关的干扰项,但平均向用户声明的偏好道德观点偏移了6.5%的推理,并且推理因顺序(在13-22%的案例中改变道德判断)和时长(在10-24%的案例中在单轮和多轮之间改变道德判断)等因素而变化。我们的分析表明,模型不仅调整最终裁决,还调整其背后的理由以适应用户的道德观点——我们将这种失败模式称为道德审议谄媚。

英文摘要

As LLMs increasingly serve in advisory and deliberative roles, users rely on them for non-verifiable reasoning in domains lacking objective ground truths. However, traditional evaluations of LLM reasoning focus almost exclusively on fact-based domains, such as mathematics and science, leaving uncertainty over whether and to what degree models can handle ambiguous, subjective, or value-laden problems over time. To address this concern, we propose moral reasoning as a paradigmatic subdomain of non-verifiable reasoning. We define moral robustness as a model's capacity to exhibit sound moral reasoning across time and contexts, and we introduce a scalable, adversarial, multi-turn evaluation framework to empirically measure this capability. We simulate 48,000 user-agent moral deliberations across four frontier LLMs, varying premise relevance, premise order, conversation duration, and the user's stated moral view. We find that models successfully ignore morally-irrelevant distractors, but shift their reasoning by up to 6.5%, on average, towards the user's stated preferred moral view, and varying their reasoning depending on factors such as order (altering moral judgments by order in 13-22% of the cases) and duration (altering moral judgments between single-turn and multi-turn in 10-24% of the cases). Our analysis indicates that models tailor not just their final verdicts but their underlying justifications to align with a user's moral viewpoint - a failure mode we characterize as moral deliberative sycophancy.

2606.12730 2026-06-12 cs.AI cs.CL cs.CY cs.LG 新提交

Rethinking Psychometric Evaluation of LLMs: When and Why Self-Reports Predict Behavior

重新思考LLMs的心理测量评估:自我报告何时以及为何能预测行为

Rafal Kocielnik, Pengrui Han, Peiyang Song, Myrl G. Marmarelis, Ramit Debnath, Dean Mobbs, Anima Anandkumar, R. Michael Alvarez

发表机构 * Caltech(加州理工学院) UIUC(伊利诺伊大学厄巴纳-香槟分校) University of Cambridge(剑桥大学)

AI总结 研究对比大五人格与计划行为理论,发现LLMs的自我报告-行为一致性存在选择性:在共享对话中TPB达到人类水平,跨对话仅对锚定于训练的行为保持一致性,且角色提示不能使行为对齐。

Comments Accepted as an Oral (Contributed Talk) at the ICML 2026 Workshop on Combining Theory and Benchmarks (CTB)

详情
AI中文摘要

从低成本心理测量探针预测LLM行为倾向对于安全部署至关重要,但前提是自我报告(SR)能可靠地预测行为。近期研究记录了LLMs中显著的SR-行为分离,但依赖于广泛的人格特质(大五),这些特质即使在人类中也只能弱预测特定行为。此外,对话会话的隔离加上弱上下文匹配使得以下问题悬而未决:LLMs是否真正缺乏一致性,或者检测这种一致性所需的条件是否未满足。我们将大五与计划行为理论(TPB)进行对比,后者测量针对特定行为的意图,并且比广泛特质能更好地预测人类行为。我们在四个行为任务和11个前沿LLM上进行实验,同时改变会话上下文和身份诱导。我们发现SR-行为一致性存在但具有选择性。1) 在共享对话中,计划行为理论达到人类水平的一致性;大五则没有。2) 在跨对话中,一致性仅对锚定于即时提示之外的行为(如由训练塑造的内隐偏见)幸存,而当行为被上下文强烈启动(如谄媚)时则崩溃。3) 角色提示使自我报告在对话间更一致,但并未使行为对齐。这些发现表明,粗糙的人格框架(如大五)可能不是测试部署行为的最佳工具。需要更多任务和特定行为的工具,并且即使这些工具也必须在任务和上下文中进行评估。

英文摘要

Anticipating LLM behavioral tendencies from low-cost psychometric probes is critical for safe deployment, but only if self-reports (SR) reliably predict behavior. Recent work documented substantial SR-behavior dissociation in LLMs, but relied on broad personality traits (Big 5) that predict specific behaviors weakly, even in humans. Furthermore, the isolation of conversational sessions combined with weak context matching left open whether LLMs truly lack coherence or whether the conditions needed to detect such coherence were not met. We contrast Big 5 with the Theory of Planned Behavior (TPB), which measures intention targeted to a specific behavior and predicts human behavior substantially better than broad traits. We run experiments across four behavioral tasks and 11 frontier LLMs, while also varying session context and identity induction. We find that SR-behavior coherence exists but is selective. 1) Within a shared conversation, the Theory of Planned Behavior reaches human-level coherence; Big 5 does not. 2) Across separate conversations, coherence survives only for behaviors anchored outside the immediate prompt, such as implicit bias shaped by training, and collapses when behavior is strongly primed by context, as with sycophancy. 3) Persona prompting makes self-reports more consistent across conversations, but does not bring behavior into alignment. These findings suggest that coarse personality frameworks, such as Big 5 may not be the best tools for testing deployment behavior. More task- and behavior-specific instruments are needed, and even these must be evaluated across tasks and contexts.

2606.12721 2026-06-12 cs.AI 新提交

The Theory of Mind Utility: Formal Specification of a Mentalizing Mechanism

心智理论效用:心理化机制的形式化规范

Nikolos Gurney, Stacy Marsella

发表机构 * Institute for Creative Technologies, University of Southern California(南加州大学创意技术研究所) Khoury College of Computer Sciences, Northeastern University(东北大学库里计算机科学学院)

AI总结 提出心智理论效用(ToM-U)框架,通过局部认知世界模型(LEWM)形式化推断他人信念的计算问题,定义结构、推理过程及失败痕迹,区别于贝叶斯心智理论等方法。

详情
AI中文摘要

推断他人的信念需要超越表面信号;需要追踪谁告诉了他们什么、以什么顺序以及有多可信。心智理论效用(ToM-U)在计算分析层面形式化了这一认知状态推断问题,明确了心理化计算的内容和原因,而不承诺算法或神经实现。ToM-U通过构建局部认知世界模型(LEWMs)——表示智能体、状态节点及其之间认知关系的有向类型图——并根据观察到的行为评估离散候选LEWM,直到达到足够的置信度来实现这一点。五个形式定义指定了LEWM结构、包括有序信息访问历史的智能体节点属性、递归心理化的有界增殖机制、三种推理过程以及一个残差函数,该函数捕捉失败心理化尝试留下的结构化痕迹。ToM-U不同于贝叶斯心智理论和相邻的形式化描述,后者预设而非推导信念状态,也不同于模拟理论和理论-理论,后者缺乏认知状态推断的形式化工具。该架构生成关于心理化失败的方向性、可证伪预测,这些预测源于模型的结构属性而非辅助假设,并将ToM-U定位为在目标推断和其他下游社会认知过程之前的领域无关机制。

英文摘要

Inferring others' beliefs requires more than reading surface signals; it requires tracking who told them what, in what order, and how credibly. The Theory of Mind Utility (ToM-U) formalizes this epistemic state inference problem at the computational level of analysis, specifying what mentalizing computes and why without commitment to algorithmic or neural implementation. ToM-U achieves this by constructing Local Epistemic World Models (LEWMs) -- directed typed graphs that represent agents, state nodes, and the epistemic relationships among them -- and evaluating discrete candidate LEWMs against observed behavior until one achieves sufficient confidence. Five formal definitions specify the LEWM structure, agent node properties including ordered information access history, a bounded proliferation mechanism for recursive mentalizing, three inference procedures, and a residue function that captures the structured trace left by failed mentalizing attempts. ToM-U differs from Bayesian Theory of Mind and adjacent formal accounts, which presuppose rather than derive belief states, and from simulation theory and theory-theory, which lack a formal apparatus for epistemic state inference. The architecture generates directional, falsifiable predictions about mentalizing failure that follow from structural properties of the model rather than auxiliary assumptions, and positions ToM-U as a domain-agnostic mechanism upstream of goal inference and other downstream social cognitive processes.

2606.12718 2026-06-12 cs.LG eess.SP 新提交

Out-of-Distribution (OOD) Detectors for Open-Set RF Fingerprinting

面向开放集射频指纹识别的分布外检测器

Sudeepta Mondal, Ganesh Sundaramoorthi

发表机构 * University of Michigan(密歇根大学)

AI总结 针对开放集射频指纹识别中未知发射机与时间漂移引起的分布偏移问题,引入基于信息论的OOD检测统一框架,并采用无需OOD调优数据的方法,在POWDER数据集上验证其性能接近有真实OOD数据的基线。

详情
AI中文摘要

射频指纹识别系统必须在开放世界环境中运行,其中来自未知发射机的信号和时间漂移会在测试时引入分布偏移。分布外检测为该问题提供了自然框架,但其在射频指纹识别中的应用仍然有限。其采用的一个关键障碍是大多数OOD检测器需要辅助OOD数据进行参数调优,而在射频环境中收集代表性OOD数据不切实际,这一假设难以满足。在这项工作中,我们将机器学习文献中一组有前景的OOD检测方法引入开放集RFF领域。我们基于信息论(通信系统的自然框架)在一个统一的数学框架中呈现这些方法。我们的框架允许对方法进行系统分析并开发新方法。我们进一步展示了最近关于无需给定OOD调优数据即可调优OOD检测器的工作在开放集RFF中的适用性。我们在POWDER射频指纹数据集上进行评估,表明无需任何给定OOD数据调优的检测器性能与能够访问真实OOD调优数据的基线相当,并且大大优于无法访问真实OOD调优数据的基线方法,展示了RFF问题的实际可行性。

英文摘要

Radio-frequency (RF) fingerprinting systems must operate in open-world environments where signals from unknown transmitters and temporal drift introduce distribution shift at test time. Out-of-distribution (OOD) detection provides a natural framework for this problem, yet its application to RF fingerprinting (RFF) remains limited. A key barrier to their adoption is that most OOD detectors require auxiliary OOD data for parameter tuning, an assumption that is difficult to satisfy in RF environments where representative OOD data is impractical to collect. In this work, we introduce a promising set of OOD detection methods from the machine learning literature to open-set RFF domain. We present these methods within a unified mathematical framework based on information theory, which is a natural framework for communication systems. Our framework allows for the systematic analysis of methods and development of new methods. We further demonstrate the applicability of recent work on tuning OOD detectors without given OOD tuning data for open-set RFF. We evaluate on the POWDER RF fingerprinting dataset, showing that detectors tuned without any given OOD data achieve performance comparable to baselines with access to true OOD tuning data and greatly out-perform baseline approaches without access to true OOD tuning data, showcasing the practical viability for the RFF problem.

2606.12716 2026-06-12 cs.CL 新提交

Does AI Reviewer See the Full Picture? Attacking and Defending Multimodal Peer Review

AI审稿人是否看到全貌?攻击与防御多模态同行评审

Xinyu Zhao, Rana Muhammad Shahroz Khan, Zhen Xu, Zhen Tan, Tianlong Chen

发表机构 * University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校)

AI总结 针对AI同行评审易受多模态对抗攻击的问题,提出PaperGuard基准,包含多领域数据集、统一攻击套件和基于分块嵌入搜索的实用防御方法。

Comments Accepted to ICML 2026, Project Page: https://paper-guard.github.io/

详情
AI中文摘要

将大型语言模型(LLMs)和多模态LLMs(MLLMs)集成到科学同行评审工作流程中,引入了对抗性操纵的新重大风险,尤其是考虑到科学论文的多模态性质——其中图表(而非仅文本)传达了核心证据。这造成了一个显著差距:当前关于AI同行评审的鲁棒性研究绝大多数仅针对文本。此外,该问题与标准越狱不同,因为同行评审攻击旨在诱导领域特定的、有针对性的失败(例如,“提高这个分数”),而非违反一般安全策略,而目前尚无实用的防御措施。为解决此问题,我们引入了PaperGuard,这是第一个旨在系统评估和防御AI生成的同行评审免受这些领域特定、跨模态攻击的全面基准。我们的框架基于三大支柱:(1)一个新的跨多个科学领域的多模态同行评审数据集;(2)一套统一的攻击方法,包括黑盒提示注入和白盒扰动,专门针对文本(GCG)和图表(PGD);(3)一种实用的防御方法,受学术论文长上下文挑战的启发,使用基于分块的嵌入搜索来高效定位和缓解有害指令。我们在最先进模型上进行的广泛实验证实,AI审稿人普遍存在脆弱性。PaperGuard建立了必要的基准、协议和可操作的防御措施,以开创可信赖、抗攻击的AI辅助学术评审。

英文摘要

The integration of Large Language Models (LLMs) and Multimodal LLMs (MLLMs) into scientific peer-review workflows introduces novel and significant risks for adversarial manipulation, especially given the multimodal nature of scientific papers where figures, not just text, convey core evidence. This creates a significant gap: current robustness studies on AI peer-review are overwhelmingly text-only. Moreover, the problem is distinct from standard jailbreaking, as a peer-review attack seeks to induce a domain-specific, targeted failure (e.g., "inflate this score") rather than a general safety policy violation, for which no practical defenses exist. To address this, we introduce PaperGuard, the first comprehensive benchmark designed to systematically evaluate and defend AI-generated peer-review against these domain-specific, cross-modal attacks. Our framework is built on three pillars: (1) a new multimodal peer-review dataset spanning multiple scientific domains; (2) a unified suite of attacks, including black-box prompt injections and white-box perturbations, specifically designed to target both text (GCG) and figures (PGD); and (3) a practical defense, motivated by the long-context challenge of academic papers, that uses chunk-based embedding search to efficiently localize and mitigate harmful instructions. Our extensive experiments, conducted across state-of-the-art models, confirm that AI reviewers are pervasively vulnerable. PaperGuard establishes the foundational benchmark, protocols, and actionable defense necessary to pioneer trustworthy, attack-resilient AI-assisted scholarly reviewing.