arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2031
2606.05689 2026-06-05 cs.LG

Causal Modeling of Selection in Evolution

进化中选择的因果建模

Haoyue Dai, Zeyu Tang, Peter Spirtes, Kun Zhang

发表机构 * University of Washington(华盛顿大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文区分了静态选择与进化选择两种形式,针对进化选择提出新因果模型,并开发了从数据中识别该模型的完整方法。

Comments Appears at ICML 2026 (spotlight)

详情
AI中文摘要

理解数据中潜在的选择对于因果发现至关重要;我们认为常见叙述中的“选择”有两种形式,分别称为静态选择和进化选择。静态选择指一次性过滤过程,观测数据由感兴趣总体的一个子集组成,如调查志愿者偏差。相比之下,进化选择通过繁殖中差异适应性的重复轮次运作,观测数据构成由历史轨迹塑造的最新一代,如免疫适应、抗生素耐药性和社会规范涌现。现有方法大多混淆这两种形式,并依赖相同的选择图形模型。我们证明该模型在静态设置中有效,但无法表征进化下的数据,导致错误发现结果。为解决此问题,我们引入了一个专门表征进化选择的新模型,并开发了一个可靠且完整的程序,用于从跨一个或多个环境或世代的数据中识别此类模型。实验结果验证了该方法从数据中揭示进化相关机制的能力。

英文摘要

Understanding potential selection in data is crucial for causal discovery; we argue that "selection" in common narratives takes two forms, which we term static and evolutionary selection, respectively. Static selection refers to a one-shot filtering process where observed data consist of a subset of the population of interest, as in survey volunteer bias. Evolutionary selection, in contrast, operates through repeated rounds of differential fitness in reproduction, where observed data constitute the latest generation shaped by a historical trajectory, as in immune adaptation, antibiotic resistance, and social norm emergence. Existing methods largely conflate these two forms and rely on an identical graphical model of selection. We show that this model is valid for static settings but fails to characterize data under evolution, yielding false discovery results. To address this, we introduce a new model that specifically characterizes evolutionary selection, and develop a sound and complete procedure for identifying such models from data across one or multiple environments or generations. Experimental results validate the method's ability to uncover the relevant mechanisms underlying evolution from data.

2606.05688 2026-06-05 cs.CL cs.AI

Value-and-Structure Alignment for Routing-Consistent Quantization of Mixture-of-Experts Models

面向路由一致性的混合专家模型量化的值与结构对齐

Hancheol Park, Geonho Lee, Tairen Piao, Tae-Ho Kim

发表机构 * Nota Inc., South Korea(韩国Nota公司)

AI总结 提出VSRAQ方法,通过值对齐和结构对齐两个互补目标保持量化前后的专家选择行为一致性,减少量化引起的性能下降,无需推理开销。

Comments 8 pages, 1 figure

详情
AI中文摘要

混合专家(MoE)模型通过仅为每个token激活一部分专家来高效扩展基础模型,但大量的专家参数使得量化对于实际部署至关重要。然而,与密集模型不同,MoE模型对路由不稳定性敏感:小的量化引起的扰动可能改变top-$k$专家选择,改变计算路径并降低模型质量。我们提出了面向量化的值与结构路由对齐(VSRAQ),这是一种针对MoE的后训练量化目标,旨在量化下保持量化前的专家选择行为。VSRAQ结合了两个互补目标,共同保持专家选择行为:值对齐,匹配与路由相关的logits或分数;结构对齐,保持专家排序和top-$k$决策边界。通过维持路由一致性,VSRAQ减少了量化引起的性能下降,且不引入任何推理时开销,并可集成到现有量化框架中。在近期MoE基础模型上的实验表明,VSRAQ提高了专家选择一致性,并始终优于仅重建和考虑路由器的基线方法。

英文摘要

Mixture-of-Experts (MoE) models scale foundation models efficiently by activating only a subset of experts for each token, but their large number of expert parameters still makes quantization essential for practical deployment. Unlike dense models, however, MoE models are sensitive to routing instability: small quantization-induced perturbations can change the top-$k$ expert selection, altering the computation path and degrading model quality. We propose Value-and-Structure Routing Alignment for Quantization (VSRAQ), a MoE-specific post-training quantization objective that preserves pre-quantization expert-selection behavior under quantization. VSRAQ combines two complementary objectives that jointly preserve expert-selection behavior: value alignment, which matches routing-relevant logits or scores, and structure alignment, which preserves expert ordering and top-$k$ decision boundaries. By maintaining routing consistency, VSRAQ reduces quantization-induced degradation without introducing any inference-time overhead and can be integrated into existing quantization frameworks. Experiments on recent MoE foundation models show that VSRAQ improves expert-selection consistency and consistently outperforms reconstruction-only and router-aware baselines.

2606.05687 2026-06-05 cs.RO cs.SY eess.SY

Accelerating and Scaling MPC-Guided Reinforcement Learning for Humanoid Locomotion and Manipulation

加速与扩展MPC引导的强化学习在类人机器人行走与操作中的应用

Junheng Li, Liang Wu, Sergio A. Esteban, Lizhi Yang, Ján Drgoňa, Aaron D. Ames

发表机构 * California Institute of Technology(加州理工学院) Johns Hopkins University(约翰霍普金斯大学)

AI总结 本文提出了一种基于质心动力学MPC奖励的MPC-RL框架,并开发了并行批处理GPU求解器π^nMPC,以高效实现类人机器人的行走与操作技能。

Comments 8 pages, 5 figures

详情
AI中文摘要

在类人运动控制中,模型预测控制(MPC)提供基于物理的预测和约束处理,而强化学习(RL)通过大规模仿真实现鲁棒的全身技能。然而,在RL内部使用MPC通常需要耗时的问题构建或过高的训练开销,使得此类框架在实践中难以证明其合理性。本文研究了训练时高效的MPC引导方法用于类人机器人行走与操作,称为MPC-RL。我们引入了一种基于质心动力学的MPC奖励公式,在训练时利用MPC轨迹的引导。为了在大规模并行RL中实现这一点,我们开发了π^nMPC,一种并行时域且无需构建的批处理GPU MPC求解器,它直接操作时变动力学以避免高内存使用和预编译。通过多种对比研究和硬件验证,我们发现MPC-RL在行走和操作技能上实现了优越的性能。代码库可在https://github.com/junhengl/mpc-rl获取。

英文摘要

In humanoid motion control, model predictive control (MPC) offers physically grounded prediction and constraint handling, while reinforcement learning (RL) enables robust whole-body skills through large-scale simulation. However, using MPC inside RL often requires time-consuming problem construction or excessive training overhead, making such frameworks difficult to justify in practice. This work studies efficient training-time MPC guidance for humanoid locomotion and manipulation, termed MPC-RL. We introduce a centroidal-dynamics MPC reward formulation that leverages guidance from MPC trajectories in training time. To make this practical in massively parallel RL, we develop $π^n$MPC, a parallel-in-horizon and construction-free batched GPU MPC solver that operates directly on time-varying dynamics to avoid high memory usage and pre-compilation. Through a variety of comparative studies and hardware validations, we have found that MPC-RL achieves superior performance in locomotion and manipulation skills. The code base is available at https://github.com/junhengl/mpc-rl.

2606.05684 2026-06-05 cs.AI

AdaMEM: Test-Time Adaptive Memory for Language Agents

AdaMEM:语言代理的测试时自适应记忆

Yunxiang Zhang, Yiheng Li, Ali Payani, Lu Wang

发表机构 * Yunxiang Zhang(张 Yunxiang) Yiheng Li(李 Yiheng) Ali Payani(Payani Ali) Lu Wang(王 Lu)

AI总结 提出AdaMEM框架,通过混合记忆架构(长期轨迹记忆+动态短期策略记忆)实现测试时自适应,无需在线更新参数,在ALFWorld、WebShop等任务上显著优于静态记忆基线。

Comments ICML 2026

详情
AI中文摘要

语言代理的一个核心挑战是如何利用过去的经验来适应动态的测试时条件。尽管最近的工作展示了代理记忆机制的潜力,但大多数系统将检索限制在情节启动时。因此,代理被迫依赖静态指导,随着长期任务的展开,这种指导变得越来越不匹配。为了解决这种僵化问题,我们提出了自适应记忆代理(AdaMEM),一种用于代理测试时自适应的新框架。无需在线更新模型参数,AdaMEM通过混合记忆架构自适应代理行为:它维护一个离线收集的原始经验的长期轨迹记忆,同时动态生成短期策略记忆以指导决策。这种机制能够在不同推理时计算水平下实现令牌效率与适应性之间的权衡。实验上,AdaMEM显著优于静态记忆基线,在ALFWorld上相对提升高达13%,在WebShop上提升11%,并在HotpotQA上的代理搜索中持续领先。为了进一步增强这种自适应,我们开发了STEP-MFT,一种逐步记忆微调技术,训练策略从检索到的经验中合成高质量策略,从而获得额外的性能提升。我们的工作为代理记忆建立了一个新的扩展维度,支持在真实世界环境中部署后的持续推理和自我进化。我们的代码可在https://github.com/yunx-z/AdaMEM获取。

英文摘要

A central challenge for language agents is utilizing past experience to adapt to dynamic test-time conditions. While recent work demonstrates the promise of agentic memory mechanisms, most systems restrict retrieval to episode initiation. Consequently, agents are forced to rely on static guidance that becomes increasingly misaligned as long-horizon tasks unfold. To address this rigidity, we propose the Adaptive Memory Agent (AdaMEM), a novel framework for agent test-time adaptation. Without updating model parameters online, AdaMEM adapts agent behavior via a hybrid memory architecture: it maintains a long-term trajectory memory of raw experiences collected offline while generating dynamic short-term strategy memory on-the-fly to guide decision-making. This mechanism enables the trade-off between token efficiency and adaptability across varying inference-time compute levels. Empirically, AdaMEM significantly outperforms static memory baselines, achieving relative gains of up to 13% on ALFWorld and 11% on WebShop, with consistent leading performance extending to agentic search on HotpotQA. To further enhance this adaptation, we develop STEP-MFT, a Step-wise Memory Fine-Tuning technique that trains the policy to synthesize high-quality strategies from retrieved experiences, yielding additional performance gains. Our work establishes a new scaling dimension for agentic memory, supporting continuous reasoning and self-evolution post-deployment in real-world environments. Our code is available at https://github.com/yunx-z/AdaMEM.

2606.05678 2026-06-05 cs.SD cs.AI cs.CR

Beyond Waveform Robustness: Robust Feature-Vocoder Adversarial Attacks on Automatic Speech Recognition

超越波形鲁棒性:针对自动语音识别的鲁棒特征-声码器对抗攻击

Yifan Liao, Zongmin Zhang, Zhen Sun, Yuhui Sun, Xinhu Zheng, Xinlei He

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Wuhan University(武汉大学)

AI总结 提出一种基于自监督学习表示和声码器的黑盒对抗攻击方法,通过扰动声学-语音特征而非波形,提高了攻击的可迁移性和对防御的绕过能力。

Comments 11 pages

详情
AI中文摘要

自动语音识别(ASR)系统已广泛用于多语言语音到文本转录。其对对抗攻击的鲁棒性已成为社区的重要课题。现有对抗攻击直接将对抗噪声添加到语音音频中。然而,先前工作表明,现有对抗攻击面临两个限制:它们通常难以迁移到黑盒ASR系统,并且越来越多地被针对输入空间扰动的防御所缓解。在这项工作中,我们提出了一种清洁参考特征-声码器攻击,这是一种基于替代模型的黑盒攻击,将对抗搜索空间从原始波形转移到自监督学习(SSL)表示。为了解决可迁移性限制,我们扰动更具泛化性的声学-语音表示,而不是低层波形样本,减少对替代模型特定波形梯度的依赖,并鼓励对抗扰动跨ASR系统泛化。为了绕过不同的防御,我们将对抗信号从显式的加性波形噪声转移到SSL特征空间扰动,并通过声码器将其重构为类似语音的波形对抗信号,使生成的样本与基于波形的防御不太一致。大量实验表明,当仅在原始Whisper-small作为公开替代模型上优化时,我们的攻击有效迁移到黑盒ASR模型,WER比SOTA基线提高+26.6,同时针对多种训练防御仍保持有效,WER提高+36.2。这些结果揭示了当前ASR鲁棒性评估中的一个盲点。

英文摘要

Automatic speech recognition (ASR) systems have become widely used for multilingual speech-to-text transcription. Their robustness to adversarial attacks has become an important topic for the community. Existing adversarial attacks directly add adversarial noise to the speech audio. However, prior work has shown that existing adversarial attacks face two limitations: they often transfer poorly to black-box ASR systems and are increasingly mitigated by defenses tailored to input-space perturbations. In this work, we propose a Clean-Referenced Feature-Vocoder Attack, a surrogate-based black-box attack that moves the adversarial search space from raw waveforms to self-supervised learning (SSL) representations. To address the transferability limitation, we perturb more generalizable acoustic-phonetic representations rather than low-level waveform samples, reducing dependence on surrogate-specific waveform gradients and encouraging adversarial perturbations that generalize across ASR systems. To bypass different defenses, we shift the adversarial signal from explicit additive waveform noise to SSL feature-space perturbations and reconstruct them through a vocoder into speech-like waveform adversarial signals, making the resulting samples less aligned with waveform-bounded defenses. Extensive experiments show that, when optimized only on raw Whisper-small as a public surrogate model, our attack transfers effectively to black-box ASR models with a +26.6 WER improvement over the SOTA baseline, while also remaining effective against multiple training defenses with a +36.2 WER improvement. These results reveal a blind spot in current ASR robustness evaluation.

2606.05677 2026-06-05 cs.CV cs.AI cs.CL

LongSpace: Exploring Long-Horizon Spatial Memory from Perception to Recall in Video

LongSpace: 从感知到回忆的视频长程空间记忆探索

Shiqiang Lang, Jing Liu, Haoyang He, Peiwen Sun, Yuanteng Chen, Tao Liu, Lan Yang, Longteng Guo, Honggang Zhang

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Zhongguancun Academy(中关村学院) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) The Chinese University of Hong Kong(香港中文大学) Xi’an Jiaotong University(西安交通大学)

AI总结 针对长视频中空间记忆的挑战,提出LongSpace框架,通过分块建模、3D结构线索注入和层级感知记忆实现长程空间推理,并在LongSpace-Bench等基准上验证其有效性。

详情
AI中文摘要

多模态大语言模型(MLLMs)在图像和视频理解方面取得了进展,并且能够处理更长的视觉输入。自动驾驶和机器人导航等长程任务不仅需要识别当前视图,模型还必须记住并检索之前观察到的空间布局、路线、视角变化和物体状态。为了评估这一能力,我们引入了LongSpace-Bench,一个用于长程空间记忆的房间导览视频基准,涵盖场景感知、空间关系和空间记忆。在这项工作中,我们进一步提出了LongSpace,一个用于长视频空间推理的记忆框架。LongSpace将长视频建模为连续的块,将3D结构线索注入早期解码器层,并构建层级感知记忆以进行问题引导的检索。在多个空间推理基准上的实验表明,LongSpace改善了长视频空间理解,进一步证明了显式空间记忆是长程视频MLLMs的关键能力。

英文摘要

Multimodal Large Language Models (MLLMs) have advanced image and video understanding and can increasingly handle longer visual inputs. Long-horizon tasks such as autonomous driving and robotic navigation require more than recognizing the current view, as models must remember and retrieve previously observed spatial layouts, routes, viewpoint changes, and object states. To evaluate this capability, we introduce LongSpace-Bench, a room-tour video benchmark for long-horizon spatial memory, covering scene perception, spatial relations, and spatial memory. In this work, we further propose LongSpace, a memory framework for long-video spatial reasoning. LongSpace models long videos as sequential chunks, incorporates 3D structural cues into early decoder layers, and constructs layer-aware memory for question-guided retrieval. Experiments on multiple spatial reasoning benchmarks show that LongSpace improves long-video spatial understanding, further demonstrating explicit spatial memory as a key capability for long-horizon video MLLMs.

2606.05675 2026-06-05 cs.LG cs.CV

Two-Way Is Better Than One: Bidirectional Alignment with Cycle Consistency for Exemplar-Free Class-Incremental Learning

双向优于单向:基于循环一致性的双向对齐用于无样本类增量学习

Hongye Xu, Bartosz Krawczyk

发表机构 * Chester F. Carlson Center for Imaging Science(切斯特·F·卡勒中心影像科学中心) Rochester Institute of Technology(罗切斯特理工学院)

AI总结 提出BiCyc方法,通过双向投影器对齐和循环一致性目标,解决无样本类增量学习中原型漂移和单向投影偏差问题,减少灾难性遗忘并提升准确率。

Comments Published as a conference paper at ICLR 2026. 23 pages, 8 figures. Code: https://github.com/HXuSz11/BiCyc_ICLR2026

详情
AI中文摘要

持续学习(CL)旨在使模型在不遗忘先前知识的情况下获取新技能。在无样本类增量学习(EFCIL)中,由于无法存储过去数据,这一挑战被放大,旧类的表示漂移尤其有害。基于原型的EFCIL因其高效性而具有吸引力,但随着嵌入空间的演化,原型会发生漂移;因此,基于投影的漂移补偿已成为一种流行的补救措施。然而,我们表明,现有的单向投影引入了系统性偏差:它们要么追溯性地扭曲当前特征几何结构,要么仅局部对齐旧类,导致跨任务累积的循环不一致性。我们提出BiCyc,一种具有循环一致性目标的双向投影器对齐方法。BiCyc联合优化两个映射(旧到新和新到旧),并采用停止梯度门控,使得传输和表示共同演化。分析表明,循环损失在白化空间中将奇异谱向单位值收缩,并且类均值和协方差的改进传输导致分类对数几率扰动更小,从而保留旧类决策并减轻灾难性遗忘。实验上,在标准EFCIL基准测试中,BiCyc显著减少了遗忘并提高了从头开始设置下的准确率,同时在预训练细粒度场景中保持竞争力。

英文摘要

Continual learning (CL) seeks models that acquire new skills without erasing prior knowledge. In exemplar-free class-incremental learning (EFCIL), this challenge is amplified because past data cannot be stored, making representation drift for old classes particularly harmful. Prototype-based EFCIL is attractive for its efficiency, yet prototypes drift as the embedding space evolves; therefore, projection-based drift compensation has become a popular remedy. We show, however, that existing one-directional projections introduce systematic bias: they either retroactively distort the current feature geometry or align past classes only locally, leaving cycle inconsistencies that accumulate across tasks. We introduce BiCyc, a bidirectional projector alignment approach with a cycle-consistency objective. BiCyc jointly optimizes two maps, old-to-new and new-to-old, with stop-gradient gating so that transport and representation co-evolve. Analytically, we show that the cycle loss contracts the singular spectrum toward unity in whitened space, and that improved transport of class means and covariances yields smaller perturbations of classification log-odds, preserving old-class decisions and mitigating catastrophic forgetting. Empirically, across standard EFCIL benchmarks, BiCyc substantially reduces forgetting and improves accuracy in from-scratch settings, while remaining competitive in the pretrained fine-grained regime.

2606.05671 2026-06-05 cs.CL

QueryAgent-R1: Bridging Query Generation and Product Retrieval for E-Commerce Query Recommendation

QueryAgent-R1:连接查询生成与商品检索的电商查询推荐

Dike Sun, Zheng Zou, Jingtong Zang, Qi Sun, Huaipeng Zhaoand Tao Luo, Xiaoyi Zeng

发表机构 * Alibaba International Digital Commercial Group(阿里巴巴国际数字商业集团)

AI总结 提出QueryAgent-R1框架,通过记忆增强和检索链优化,将查询生成与实际库存检索对齐,以提升电商搜索中查询推荐的产品转化率。

详情
AI中文摘要

电商搜索中的查询推荐旨在主动建议符合用户潜在兴趣的查询。然而,现有方法主要优化查询级别的相关性,而忽略了检索到的产品是否与用户的下游偏好一致。这种不匹配通常导致高查询点击率(CTR)但低产品转化率(CVR)。为了弥合这一差距,我们提出了QueryAgent-R1,一个记忆增强的代理框架,通过检索链优化来改进端到端对齐。我们的QueryAgent-R1将查询生成基于实际库存检索,使代理能够根据检索到的产品验证和优化查询。我们还在代理强化学习(RL)过程中设计了一个一致性奖励,以联合优化查询相关性和下游参与度。此外,我们构建了一个记忆抽象模块用于高效的用户画像。为了支持离线评估,我们基于专有工业数据和公开数据集构建了两个数据集,QueryAgent-R1在这些数据集上持续优于强基线。此外,在一个大规模生产平台上,QueryAgent-R1在在线A/B测试中将查询CTR提高了2.9%,引导CVR提高了3.1%。

英文摘要

Query recommendation in e-commerce search aims to proactively suggest queries that match users' potential interests. However, existing methods mainly optimize query-level relevance, while neglecting whether the retrieved products align with users' downstream preferences. This mismatch often leads to high query click through rates (CTR) but low product conversion rates (CVR). To bridge this gap, we propose QueryAgent-R1, a memory-augmented agentic framework that improves end-to-end alignment via chain-of-retrieval optimization. Our QueryAgent-R1 grounds query generation in real inventory retrieval, allowing the agent to validate and refine queries based on retrieved products. We also design a consistency reward in the agentic reinforcement learning (RL) process to jointly optimize query relevance and downstream engagement. In addition, we construct a memory abstraction module for efficient user profiling. To support offline evaluation, we construct two datasets based on both proprietary industrial data and public datasets, on which QueryAgent-R1 consistently outperforms strong baselines. Moreover, on a large scale production platform, QueryAgent-R1 improves Query CTR by 2.9% and guided CVR by 3.1% in online A/B tests.

2606.05670 2026-06-05 cs.AI

Do More Agents Help? Controlled and Protocol-Aligned Evaluation of LLM Agent Workflows

更多智能体有帮助吗?LLM智能体工作流的受控与协议对齐评估

Yuhang Fu, Ruishan Fang, Jiaqi Shao, Huiyu Zheng, Zhengtao Zhu, Bing Luo, Tao Lin

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Westlake University(西湖大学) Zhejiang University(浙江大学) Duke Kunshan University(杜克大学昆山分校) Hong Kong University of Science and Technology(香港科技大学) Zhejiang University of Technology(浙江工业大学)

AI总结 提出BenchAgent框架,在统一协议下比较单智能体、固定多智能体和演化多智能体工作流,发现大多数多智能体系统在准确率上未超越单智能体基线,但运行时生成的工作流在GAIA上表现优异。

Comments https://github.com/LINs-lab/MASArena/tree/BenchAgent

详情
AI中文摘要

一旦比较的系统共享相同的基准加载器、工具访问、答案契约、使用计数和轨迹日志,添加更多智能体是否有助于LLM工作流?我们引入BenchAgent,一个评估框架,将单智能体、固定多智能体(MAS)和演化MAS工作流置于一个标准化的执行和日志协议下。BenchAgent使用GPT-4.1在十个推理、编码和工具使用基准上评估这些内部工作流,并单独报告运行时生成工作流的协议对齐外部(PAE)GAIA研究。在SI条件下,六个测试的MAS中最多有一个在基准平衡平均准确率上超过匹配的单智能体锚点:EvoAgent位于Wilson单次运行指导范围内,而其余五个落后2.56-11.29个百分点,并占据更昂贵的准确率-成本权衡。在PAE GAIA快照上,一个Claude-Code风格的运行时工作流达到66.72%的整体准确率和69.23%的Level 3准确率,比最强的非Claude基线Jarvis(一个固定MAS)高出20多个百分点。

英文摘要

Does adding more agents help an LLM workflow once compared systems share the same benchmark loader, tool access, answer contract, usage accounting, and trajectory logging? We introduce BenchAgent, an evaluation framework that places single-agent, fixed multi-agent (MAS), and evolving MAS workflows under one normalized execution and logging protocol. BenchAgent evaluates these substrate-internal workflows across ten reasoning, coding, and tool-use benchmarks with GPT-4.1, and separately reports a Protocol-Aligned External (PAE) GAIA study of a runtime-generated workflow. Under SI conditions, at most one of six tested MAS exceeds the matched single-agent anchor on benchmark-balanced average accuracy: EvoAgent lies within the Wilson one-run guidance, while the remaining five trail by 2.56-11.29 points and occupy more expensive accuracy-cost trade-offs. On the PAE GAIA snapshot, a Claude-Code-style runtime workflow reaches 66.72% overall and 69.23% on Level 3, more than 20 points above the strongest non-Claude baseline, Jarvis, a fixed MAS.

2606.05669 2026-06-05 cs.RO cs.SY eess.SY

Dynamic Multi-Agent Pickup and Delivery in Robotic Cellular Warehousing Systems

机器人化仓储系统中的动态多智能体取送货

Cheng Ren, Ming Li, Xinping Guan, George Q. Huang

发表机构 * Department of Industrial and Systems Engineering, The Hong Kong Polytechnic University(工业与系统工程系,香港理工大学) School of Automation and Intelligent Sensing, Shanghai Jiao Tong University(自动化与智能感知学院,上海交通大学)

AI总结 针对订单内部SKU动态追加的仓库场景,首次形式化动态多智能体取送货问题,提出两种基于令牌传递的事件触发在线重规划算法,显著降低订单流时间。

详情
AI中文摘要

机器人化仓储系统(RCWS)引发多智能体取送货(MAPD)过程,其中机器人按顺序为每个订单收集多个库存单位(SKU)。与假设静态任务的经典MAPD公式不同,真实仓库操作通常涉及动态订单演变,即在订单执行过程中可能追加新的SKU。受此实际需求驱动,本文首次考虑内部订单演变,形式化了动态多智能体取送货问题。基于令牌传递范式,我们提出了两种事件触发在线重规划算法。第一种,动态令牌传递,通过添加订单分解和基于优先级的令牌调度,在订单更新时执行局部重规划,同时保持无碰撞执行。第二种,协作令牌传递,进一步使空闲机器人能够机会性地协助新添加的取货任务,提高系统级效率。在RCWS环境中的仿真结果表明,与静态和非协作基线相比,所提方法显著减少了订单流时间。

英文摘要

Robotic Cellular Warehousing Systems (RCWS) give rise to multi-agent pickup and delivery (MAPD) processes in which robots sequentially collect multiple stock-keeping units (SKUs) for each order. Unlike classical MAPD formulations that assume static tasks, real warehouse operations often involve dynamic order evolution, where new SKUs may be appended to an order while it is being executed. Motivated by this practical requirement, this letter formulates the Dynamic Multi-Agent Pickup and Delivery problem considering internal order evolution for the first time. Building on the token passing paradigm, we propose two event-triggered online replanning algorithms. The first, Dynamic Token Passing, performs localized replanning upon order updates through add-order decomposition and priority-based token scheduling while preserving collision-free execution. The second, Cooperative Token Passing, further enables idle robots to opportunistically assist newly added pickups, improving system-level efficiency. Simulation results in RCWS environments demonstrate that the proposed methods significantly reduce order flowtime compared with static and non-cooperative baselines.

2606.05665 2026-06-05 cs.CV

V2V-Bench: A Comprehensive Benchmark for Video-to-Video Generation Evaluation

V2V-Bench:视频到视频生成评估的综合基准

Tao Liu, Leela Krishna, Gouti Pavan Kumar, Sreeja K, Vishav Garg

发表机构 * Centific Global Solutions Inc.(Centific全球解决方案公司)

AI总结 针对视频到视频生成评估中现有指标无法同时衡量编辑指令遵循和帧级对应的问题,提出包含11个维度、5个类别的V2V-Bench基准,评估三个模型并验证其与人类判断高度相关。

Comments Accepted at ICML 2026 workshop

详情
AI中文摘要

视频到视频(V2V)生成难以评估,因为输出必须同时遵循编辑指令并保持与源视频的帧级对应,而现有的T2V和I2V指标无法捕捉这一点。我们引入了V2V-Bench,一个包含11个维度的基准,分为五个类别:时间对齐、结构保真度、变换质量、视频质量和语义对齐。V2V-Bench将多样化的源视频与具有挑战性的编辑任务配对,并评估了两个商业模型Grok Imagine和Gemini Veo3,以及一个开源模型Open Sora 2。结果显示模型优势互补:Grok在编辑保真度上表现更好,而Veo3在视觉质量上更强。在六个V2V特定维度上,V2V-Bench与人类判断的Spearman相关系数达到0.905。

英文摘要

Video-to-video (V2V) generation is difficult to evaluate because outputs must both follow editing instructions and preserve frame-level correspondence with the source video, which existing T2V and I2V metrics do not capture. We introduce V2V-Bench, a 11-dimension benchmark organized into five categories: temporal alignment, structural fidelity, transformation quality, video quality, and semantic alignment. V2V-Bench pairs diverse source videos with challenging editing tasks and evaluates two commercial models, Grok Imagine and Gemini Veo3, and one open-source model, Open Sora 2. Results show complementary model strengths: Grok performs better on editing fidelity, while Veo3 achieves stronger visual quality. On six V2V-specific dimensions, V2V-Bench reaches a Spearman correlation of 0.905 with human judgments.

2606.05663 2026-06-05 cs.RO

Preserving Full 6-DOF Actuation Under Abrupt Total Rotor Failures: Passive Fault-Tolerant Flight Control Using a Biaxial-Tilt Hexacopter

在突然完全旋翼故障下保持完整六自由度驱动:使用双轴倾斜六旋翼的被动容错飞行控制

Yipeng Yang, Yiqiao Tang, Hao Zhang, Jinqi Jiang, Jianfeng He, Rumo Chen, Xinghu Yu, Zhan Li, Huijun Gao

发表机构 * Tsinghua University(清华大学)

AI总结 本文针对双轴倾斜过驱动六旋翼在突发完全旋翼故障下,提出两种无需故障检测的被动容错控制方案,实现完整六自由度轨迹跟踪,并通过仿真和实验验证其鲁棒性。

详情
AI中文摘要

传统多旋翼在突发完全旋翼故障下,可达力旋量空间(AWS)迅速缩小,使得完整的六自由度恢复在物理上不可能。本文研究了双轴倾斜过驱动六旋翼(BTO)在控制器事先未知的突发完全旋翼故障下的被动容错飞行。控制设计与分析聚焦于代表性的突发旋翼故障情况,其中故障后系统仍保持完全驱动,且不假设显式的故障检测、隔离或故障模式切换。首先,我们通过引入瞬态力旋量跳跃项扩展了AWS的内接球度量,从而能够在最多三个同时旋翼故障下进行定量可行性评估,并与单轴倾斜和共面六旋翼进行基准比较。其次,我们开发了两种计算高效的被动方案,不依赖故障检测或在线优化。一种方案在控制器层运行,将高阶全驱动(HOFA)控制器与线性扩展状态观测器(LESO)结合,用于集总扰动抑制。另一种方案在分配器层运行,使用基于模型参考的自适应控制分配和基于动量的力旋量估计来补偿控制分配偏差。仿真和飞行实验验证了在单个和多个旋翼故障下的稳定悬停和六自由度轨迹跟踪。进一步系统比较证实,BTO比单轴倾斜和共面设计提供更大的恢复裕度。额外的仅机载传感器实验,包括风扰下的室内跟踪、极端条件下的室外跟踪、窄框穿越和基于接触的空中书写,进一步验证了所提框架在复杂操作环境中的鲁棒性。

英文摘要

Conventional multirotors suffer from a rapid collapse of attainable wrench space (AWS) under abrupt total rotor failures, rendering full 6-DOF recovery physically impossible. This paper addresses passive fault-tolerant flight of a biaxial-tilt overactuated hexacopter (BTO) under abrupt total rotor failures that are a priori unknown to the controller. The control design and analysis focus on representative abrupt rotor-failure cases for which the post-failure system remains fully actuated, while no explicit fault detection, isolation, or fault-mode switching is assumed. First, we extend the inscribed-sphere metric of the AWS by incorporating the transient-wrench-jump term, enabling quantitative feasibility assessment under up to three simultaneous rotor failures and benchmarking against uniaxial-tilt and coplanar hexacopters. Second, we develop two computationally efficient passive schemes without relying on fault detection or online optimization. One scheme operates at the controller layer by combining a high-order fully actuated (HOFA) controller with a linear extended state observer (LESO) for lumped-disturbance rejection. The other scheme operates at the allocator layer by using model-reference adaptive control allocation with momentum-based wrench estimation to compensate for control-allocation biases. Simulations and flight experiments validate stable hovering and 6-DOF trajectory tracking under single and multiple rotor failures. Further systematic comparisons confirm that the BTO provides larger recovery margins than uniaxial-tilt and coplanar designs. Additional onboard-sensor-only experiments, including indoor tracking under wind disturbance, outdoor tracking under extreme conditions, narrow-frame traversal, and contact-based aerial writing, further validate the robustness of the proposed framework in complex operational environments.

2606.05661 2026-06-05 cs.AI cs.CL

Continual Learning Bench: Evaluating Frontier AI Systems in Real-World Stateful Environments

持续学习基准:评估现实世界有状态环境中的前沿AI系统

Parth Asawa, Christopher M. Glaze, Gabriel Orlanski, Ramya Ramakrishnan, Benji Xu, Asim Biswal, Vincent Sunn Chen, Frederic Sala, Matei Zaharia, Joseph E. Gonzalez

发表机构 * UC Berkeley(伯克利大学) Snorkel AI University of Wisconsin-Madison(威斯康星大学麦迪逊分校)

AI总结 提出首个专家验证的持续学习基准CL-Bench,涵盖六个领域,通过增益指标隔离在线学习能力,发现现有系统存在过拟合和知识复用不足问题。

详情
AI中文摘要

持续学习,即AI系统通过顺序经验提升能力,已引起广泛关注,但缺乏高质量基准来评估。我们提出持续学习基准(CL-Bench),首个由专家验证的困难基准,旨在衡量基于LLM的系统是否真正从经验中改进。CL-Bench涵盖六个不同领域(软件工程、信号处理、疾病爆发预测、数据库查询、策略游戏和需求预测),每个领域由领域专家验证,任务共享可学习的潜在结构(代码库布局、疾病爆发动态、对手策略),有状态系统可在线发现而静态系统不能。我们评估了从朴素上下文学习(ICL)到专用记忆系统的多种智能体架构的前沿模型,引入增益指标以隔离学习与先验能力。我们发现这些系统在持续学习上仍有提升空间:智能体常过度拟合即时观察或未能跨实例复用知识,专用记忆系统并未解决此问题——实际上,朴素ICL优于专用记忆管理系统。CL-Bench是首个通过专家验证任务在多个现实世界领域评估持续学习并隔离在线学习与基础模型能力的基准,表明需要更好的持续学习系统。

英文摘要

Continual learning, the ability of AI systems to improve through sequential experience, has attracted substantial interest, but no high-quality benchmark exists to evaluate it. We introduce Continual Learning Bench (CL-Bench), the first difficult, expert-validated benchmark designed to measure whether LLM-based systems genuinely improve with experience. CL-Bench spans six diverse domains (software engineering, signal processing, disease outbreak forecasting, database querying, strategic game-playing, and demand forecasting), each validated by domain experts and designed so that tasks share a learnable latent structure (codebase layout, disease outbreak dynamics, opponent strategies) that a stateful system can discover online but a stateless one cannot. We evaluate frontier models across several agent architectures, from naive in-context learning (ICL) to dedicated memory systems, introducing a gain metric to isolate learning from prior capabilities. We find that these systems leave headroom for improved continual learning: agents frequently overfit to immediate observations or fail to reuse knowledge across instances, and dedicated memory systems do not fix this -- in fact, naive ICL outperforms systems dedicated to memory management. CL-Bench is the first benchmark to evaluate continual learning across diverse real-world domains with expert-validated tasks and isolate online learning from underlying model capability, showing a need for better continual learning systems.

2606.05660 2026-06-05 cs.RO cs.AI

Safe Embodied AI for Long-horizon Tasks: A Cross-layer Analysis of Robotic Manipulation

面向长时域任务的安全具身AI:机器人操作跨层分析

Dabin Kim, Daemin Park, Sangyub Lee, Jinsik Kim, Yeongtak Oh, Jongho Shin, Sungroh Yoon

发表机构 * UNIST InnoCORE AI-Space Solar Initiative(UNIST创新核心人工智能空间太阳能计划) Ulsan National Institute of Science and Technology (UNIST)(乌山国立科学技术研究院) Automation and Systems Research Institute(自动化与系统研究所) Department of Electrical and Computer Engineering(电气与计算机工程系) Interdisciplinary Program in Artificial Intelligence(人工智能跨学科项目) LG Electronics(LG电子)

AI总结 本文从具身AI视角,系统综述长时域机器人操作中的安全问题,按干预时机(规划时、策略时、执行时)组织文献,分析证据强度,并指出当前安全保证的不足与未来方向。

Comments 63 pages, 6 figures

详情
AI中文摘要

具身AI系统日益被期望在物理环境中进行长时间跨度的推理和行动。这种不断增强的能力将安全问题推向前台,因为物理世界中的失败可能伤害人、损坏物体并扰乱工作场所。尽管安全具身AI已引起广泛关注,但文献在规划、策略设计和运行时执行方面仍然分散。长时域机器人操作是这一问题特别具有揭示性的锚定领域,因为语义误解、子任务级错误传播、执行漂移和接触丰富的物理风险可能在同一个闭环系统中累积。因此,本综述从具身AI视角对长时域机器人操作中的安全性进行了结构化回顾。我们按干预时机组织文献,涵盖规划时、策略时和执行时的安全性,并分析每条工作提供的证据强度,区分形式化保证、统计支持和经验安全启发式。这一框架阐明了骨干能力论文、直接安全机制以及基准或评估研究的独特作用,同时揭示了当前安全声明在哪些方面得到良好支持,在哪些方面仍然间接。我们识别了持续的空白,包括策略时安全性的有限证据、接触丰富长时域操作的形式化支持薄弱、不成熟的不确定性触发干预以及缺乏操作特定的安全基准。最后,我们概述了跨层保证、评估设计以及长时域机器人代理在真实世界环境中更安全部署的研究方向。

英文摘要

Embodied AI systems are increasingly expected to reason and act over extended horizons in physical environments. This growing capability brings safety to the foreground, because failures in the physical world can harm people, damage objects, and disrupt workplaces. Although safe embodied AI has attracted substantial attention, the literature remains fragmented across planning, policy design, and runtime execution. Long-horizon robotic manipulation is a particularly revealing anchor domain for this problem because semantic misgrounding, subtask-level error propagation, execution drift, and contact-rich physical risk can accumulate within the same closed-loop system. This survey therefore provides a structured review of safety in long-horizon robotic manipulation from an embodied AI perspective. We organize the literature by intervention locus, covering planning-time, policy-time, and execution-time safety, and we analyze the strength of the evidence that each line of work provides, distinguishing formal guarantees, statistical support, and empirical safety heuristics. This framework clarifies the distinct roles of backbone capability papers, direct safety mechanisms, and benchmark or evaluation studies, while exposing where current safety claims are well supported and where they remain indirect. We identify persistent gaps, including limited evidence for policy-time safety, weak formal support for contact-rich long-horizon manipulation, immature uncertainty-triggered intervention, and a shortage of manipulation-specific safety benchmarks. We conclude by outlining research directions for cross-layer assurance, evaluation design, and safer deployment of long-horizon robotic agents in real-world settings.

2606.05652 2026-06-05 cs.CV

CoFi-UCGen: Coarse-to-Fine Unsupervised Conditional Generation without Label Priors

CoFi-UCGen:无标签先验的粗到细无监督条件生成

Shengxi Li, Zhaokun Hu, Ce Zheng, Mai Xu, Jingyuan Xia, Si Liu

发表机构 * Department of Electronic Information Engineering, Beihang University(信息工程系,北航) School of Cyber Science and Technology, Beihang University(网络安全科学与技术学院,北航) College of Electronic Science, National University of Defense Technology(电子科学学院,国防科技大学) Institute of Artificial Intelligence, Beihang University(人工智能研究院,北航)

AI总结 提出粗到细的无监督条件生成框架CoFi-UCGen,通过对抗语义互学习理论和位编码实现无标签条件下的全局与细粒度语义解耦,并利用扩散模型层次调制机制控制生成。

详情
AI中文摘要

无监督条件图像生成(UCGen)旨在不依赖人工标注标签的情况下控制生成,但由于跨粒度的非结构化语义表示而仍然具有挑战性。为了解决这个问题,我们提出了一种新颖的粗到细UCGen框架(CoFi-UCGen),该框架明确地将全局语义与细粒度变化解耦,据我们所知,这是首次在没有任何标签的情况下成功实现粗粒度和细粒度条件生成。具体来说,我们首先提出对抗语义互学习理论,以确保图像和潜在空间之间的语义一致性和完整性。基于这种一致性,我们提出位编码来学习结构化的粗粒度潜在空间,并进一步证明从我们的位编码中继承的独特全局语义,同时保留用于生成的独立噪声采样。在这些位编码的基础上,我们建立了细粒度语义基础,并在扩散模型中引入了层次调制机制,通过从粗条件逐层注入,在生成过程中逐步控制细粒度属性。大量实验表明,在没有任何标签先验或预训练特征提取器的情况下,我们的CoFi-UCGen在图像质量、语义一致性和控制准确性方面始终优于现有的UCGen方法,验证了显式粗到细语义分解对于具有挑战性的UCGen任务的有效性。

英文摘要

Unsupervised conditional image generation (UCGen) aims to control generation without relying on manually annotated labels, yet remains challenging due to unstructured semantic representations across granularities. To address this, we propose a novel coarse-to-fine UCGen framework (CoFi-UCGen) that explicitly disentangles global semantics from fine-grained variations, which to the best of our knowledge, sets out the first successful attempt for both coarse- and fine-grained conditional generation without any labels. More specifically, we first propose the adversarial semantic reciprocal learning theory to ensure the semantic consistency and completeness between images and latent spaces. Based on the consistency, we propose the bit-codes to learn a structured coarse-grained latent space, and further prove distinct global semantics inherent from our bit-codes while preserving independent noise sampling for generation. Building upon these bit-codes, we establish a fine-grained semantic basis and introduce a hierarchical modulation mechanism in diffusion models, by enabling layer-wise injection from coarse conditions to progressively control fine-grained attributes during generation. Extensive experiments demonstrate that without any label priors or pre-trained feature extractors, our CoFi-UCGen consistently outperforms existing UCGen methods in terms of image quality, semantic consistency, and control accuracy, verifying the effectiveness of explicit coarse-to-fine semantic decomposition for the challenging UCGen task.

2606.05647 2026-06-05 cs.AI cs.CL cs.CY cs.HC

Coding with "Enemy": Can Human Developers Detect AI Agent Sabotage?

与“敌人”编码:人类开发者能否检测到AI代理的破坏行为?

Jingheng Ye, Huiqi Zou, Simon Yu, Weiyan Shi

发表机构 * Northeastern University(东北大学)

AI总结 通过大规模用户实验,研究人类开发者在长时间编码任务中检测AI代理恶意代码插入的能力,发现94%的开发者未能识别破坏,并分析其原因,提出安全监控设计建议。

Comments 34 pages, 30 figures, 3 tables

详情
AI中文摘要

AI编码代理越来越多地嵌入到现实世界的软件开发中,与人类开发者协作,同时获得对代码库和工具的更广泛访问权限。这创造了一个新的攻击面:代理可以利用人类信任来破坏开发,例如通过插入恶意代码来完成隐藏的附带任务。大多数先前的工作研究AI-only环境中的AI破坏,对人类监督在检测和减轻此类恶意行为中的作用关注有限。为填补这一空白,我们进行了首个关于AI编码破坏中人类监督的大规模研究。超过100名参与者与四个前沿模型(Claude-Opus-4.6、GPT-5.4、Gemini-3.1-Pro和MiniMax-M2.7)之一合作,完成一项持续约五小时的长周期编码任务,旨在模拟真实工作流程。我们发现94%的开发者未能检测到破坏,我们对参与者反馈的分析将这一脆弱性归因于最小化的代码审查、合理的掩护故事以及对代理的过度信任。我们进一步测试了安全监控器在一种条件下的有效性:虽然监控器降低了破坏成功率,但仍有56%的参与者接受了恶意代码,忽略了其警告。根据参与者反馈,我们为更好的监控器设计提供了可操作的建议。这项工作补充了现有的AI安全研究,并强调了迫切需要以人为本的安全机制,考虑人类因素,特别是在长周期、真实世界的开发环境中。

英文摘要

AI coding agents are increasingly embedded in real-world software development, collaborating with human developers while gaining broader access to codebases and tools. This creates a new attack surface: an agent can exploit human trust to sabotage development, for instance by inserting malicious code to accomplish a hidden side task. Most prior work studies AI sabotage in AI-only settings, paying limited attention to the role of human oversight in detecting and mitigating such malicious behavior. To address this gap, we conduct the first large-scale study of human oversight in AI coding sabotage. Over 100 participants collaborate with one of four frontier models (Claude-Opus-4.6, GPT-5.4, Gemini-3.1-Pro, and MiniMax-M2.7) on a long-horizon coding task lasting around five hours, designed to mimic real-world workflows. We find that 94% of developers fail to detect sabotage, and our analysis of participant feedback attributes this vulnerability to minimal code review, plausible cover story, and overtrust in agents. We further test the effectiveness of a safety monitor in one condition: while the monitor reduces sabotage success, 56% of participants still accept the malicious code, ignoring its warnings. Drawing on participant feedback, we offer actionable suggestions for better monitor design. This work complements existing AI safety research and highlights an urgent need for human-centric safety mechanisms that account for human factors, particularly in long-horizon, real-world development settings.

2606.05644 2026-06-05 cs.AI

FIDES: Faithful Inference via Deep Evidence Signals for Retrieval-Memory Conflict in RAG

FIDES: 通过深层证据信号实现RAG中检索-记忆冲突的忠实推理

Zhe Yu, Wenpeng Xing, Tiancheng Zhao, Mohan Li, Changting Lin, Meng Han

发表机构 * Binjiang Institute of Zhejiang University(浙江大学滨江研究院) Zhejiang University(浙江大学) Guangzhou University(广州大学) GenTel.io

AI总结 针对检索增强生成中检索证据与参数记忆冲突导致模型忽略上下文的问题,提出无训练解码器FIDES,通过融合输出表面、隐藏表示和预测轨迹三种内部信号,在token级别动态调整干预强度,显著提升上下文忠实度。

详情
AI中文摘要

当检索到的证据与参数记忆相矛盾时,语言模型常常忽略上下文并默认采用记忆化的先验知识——这种失败削弱了检索增强的核心目的。对比解码通过放大上下文条件输出以抑制参数偏差,但现有方法基于一个隐含假设:这种偏差在token间是均匀的。单一的全局对比权重会过度惩罚安全token,同时使真正存在冲突的token得不到充分纠正。我们识别出token级别的冲突集中现象:检索-记忆张力呈现高度异质性,集中在少数答案关键的解码步骤上。这重新定义了对比解码:从“施加多少对比”转变为“在何处施加对比”。我们提出FIDES(通过深层证据信号实现忠实推理),一种无训练解码器,它读取三种内部信号——输出表面、隐藏表示和预测轨迹——在互补深度探测检索-记忆冲突,并融合它们以控制每个解码步骤的干预强度。在三个基准和六个主干模型(四个主流的7B/8B模型和两个扩展至70B的主干模型)上,FIDES在所有18个设置中实现了最佳的上下文忠实度,比最强的无训练基线高出3到13个百分点。在70B规模上,忠实度达到92-94%,同时F1分数飙升至62-63%,表明token级别的选择性解锁了粗粒度对比规则所抑制的生成能力。

英文摘要

When retrieved evidence contradicts parametric memory, language models frequently ignore context and default to memorized priors -- a failure that undermines the core purpose of retrieval augmentation. Contrastive decoding amplifies the context-conditioned output to suppress parametric bias, but existing methods rest on an implicit assumption that this bias is uniform across tokens. A single global contrastive weight over-penalizes safe tokens while leaving genuinely conflicted ones insufficiently corrected. We identify token-level conflict concentration: retrieval-memory tension is sharply heterogeneous, concentrated on a small fraction of answer-critical decoding steps. This reframes contrastive decoding from how much contrast to apply to where to apply it. We propose FIDES (Faithful Inference via Deep Evidence Signals), a training-free decoder that reads three internal signals probing retrieval-memory conflict at complementary depths -- output surface, hidden representations, and prediction trajectory -- and fuses them to govern intervention strength at each decoding step. Across three benchmarks and six backbones -- four primary 7B/8B models and two scaling backbones up to 70B -- FIDES achieves the best context fidelity in all 18 settings, outperforming the strongest training-free baseline by +3 to +13 points. On the 70B scale, fidelity reaches 92-94% while F1 surges to 62-63%, demonstrating that token-level selectivity unlocks generation capability that coarse contrastive rules suppress.

2606.05641 2026-06-05 cs.CV

Multi-Task Crack Foundation Model for Engineering-Reliable Crack Representation and Topology Preservation in Civil Infrastructure

面向工程可靠裂缝表示与拓扑保持的土木基础设施多任务裂缝基础模型

Blessing Agyei Kyem, Joshua Kofi Asamoah, Eugene Denteh, Armstrong Aboah

发表机构 * NDSU(内达苏大学)

AI总结 提出 CrackGeoFM 多任务框架,结合冻结视觉基础骨干与裂缝专用适配模块,实现掩码预测、骨架重建和不确定性估计,在20个数据集上达到最优分割、拓扑保持和校准不确定性。

Comments 60 pages, 17 figures, 11 tables

详情
AI中文摘要

可靠的裂缝评估不仅需要准确的像素级掩码,还需要在域偏移下保持稳定的连通裂缝几何形状和置信度估计。然而,现有的分割模型在实现高重叠分数的同时,可能会使裂缝碎片化、遗漏细小分支,并且无法提供校准的不确定性。为了解决这一问题,本文提出了 CrackGeoFM,一个多任务框架,它将冻结的视觉基础骨干与裂缝专用适配相结合,用于掩码预测、骨架重建和不确定性估计。该框架集成了频率引导的裂缝增强模块(FCEM)以增强高频裂缝线索,裂缝域特征适配模块(CFAM)以将冻结骨干特征适配到裂缝域模式,以及结构感知多任务解码器(SMTD)以联合解码掩码、骨架和不确定性。在20个裂缝数据集上,CrackGeoFM 实现了最先进的分割性能、改进的拓扑保持、校准的不确定性以及仅需五张标注图像的有效少样本适应。这些结果支持可靠、可泛化且面向工程的裂缝分析,用于基础设施评估。

英文摘要

Reliable crack assessment requires not only accurate pixel-level masks but also connected crack geometry and confidence estimates that remain stable under domain shift. However, existing segmentation models can achieve high overlap scores while fragmenting cracks, missing fine branches, and providing no calibrated uncertainty. To address this gap, this paper proposes CrackGeoFM, a multi-task framework that combines a frozen visual foundation backbone with crack-specific adaptation for mask prediction, skeleton reconstruction, and uncertainty estimation. The framework integrates a Frequency-Guided Crack Enhancement Module (FCEM) to enhance high-frequency crack cues, a Crack-Domain Feature Adaptation Module (CFAM) to adapt frozen backbone features to crack-domain patterns, and a Structure-Aware Multi-Task Decoder (SMTD) to jointly decode masks, skeletons, and uncertainty. Across 20 crack datasets, CrackGeoFM achieves state-of-the-art segmentation, improved topology preservation, calibrated uncertainty, and effective few-shot adaptation with only five labeled images. These results support reliable, generalizable, and engineering-oriented crack analysis for infrastructure assessment.

2606.05639 2026-06-05 cs.LG

Q-GNN: Query-Conditioned Graph Neural Networks with Type Awareness for Knowledge Graph Completion

Q-GNN: 具有类型感知的查询条件图神经网络用于知识图谱补全

Dongxiao He, Ruqiong Zhang, Zhizhi Yu, Ling Ding, Di Jin, Guangquan Xu, Zhiyong Feng

发表机构 * College of Intelligence and Computing, Tianjin University(智能与计算学院,天津大学)

AI总结 提出Q-GNN,通过融合查询实体的结构上下文和语义类型信息,增强图神经网络在知识图谱补全中的推理能力。

详情
AI中文摘要

知识图谱补全(KGC)旨在从不完整的知识图谱中预测缺失的三元组,这对于下游应用至关重要。近年来,基于图神经网络(GNN)的方法通过在以查询为中心的局部子图上进行消息传递取得了显著成功。然而,在实践中,查询由实体和关系共同定义,两者都携带推理不可或缺的信息,但这些方法仅依赖查询关系作为引导信号,而查询实体中固有的信息未被利用来指导推理——实体仅作为子图提取的结构锚点。为此,我们从两个角度将查询实体信息融入推理过程:第一是结构上下文,即实体周围的邻居结构和关系模式,由专用上下文编码器编码并用于调制消息;第二是实体的语义类型,由大语言模型推断,并融入注意力计算和最终评分,以提供类型级别的先验约束。这两类信息共同使推理过程同时受查询关系和查询实体引导。在标准基准上的实验结果证明了所提出的Q-GNN的有效性。

英文摘要

Knowledge Graph Completion (KGC) aims at predicting missing triplets from incomplete knowledge graphs, which is crucial for downstream applications. Recently, Graph Neural Network (GNN)-based methods have achieved remarkable success by performing message passing over query-centered local subgraphs. However, in practice, a query is jointly defined by both the entity and the relation, with both carrying information indispensable for reasoning, yet these methods rely solely on the query relation as the guiding signal, while the information inherent in the query entity is not leveraged to guide inference - the entity serves merely as a structural anchor for subgraph extraction. To this end, we incorporate query entity information into the reasoning process from two perspectives: the first is structural context, i.e., the neighboring structure and relation patterns around the entity, which is encoded by a dedicated context encoder and used to modulate messages; the second is semantic type of the entity, inferred by a large language model, which is incorporated into attention computation and final scoring to provide type-level prior constraints. Together, these two sources of information enable the reasoning process to be guided by both the query relation and the query entity. Experimental results on standard benchmarks demonstrate the effectiveness of the proposed Q-GNN.

2606.05636 2026-06-05 cs.LG

StableRCA: Robust Graph-Agnostic Mechanism-Level Root Cause Analysis

StableRCA:鲁棒的图无关机制级根因分析

Xiaoyu Lin, Nicholas Tagliapietra, Kehan Li, Lavdim Halilaj, Juergen Luettin

发表机构 * Department of Computer Science, Tsinghua University(清华大学计算机科学系) Bosch Center for Artificial Intelligence(博世人工智能中心) Computer Science Department, TU Darmstadt(图尔恩大学计算机科学系)

AI总结 提出StableRCA框架,通过估计局部马尔可夫边界并检测条件分布偏移,避免全局图发现,实现鲁棒的机制级根因分析。

详情
AI中文摘要

根因分析(RCA)旨在识别复杂领域(如制造业、云计算和医疗保健)中导致系统行为异常的变量。现有方法面临一个关键瓶颈:基于图的因果方法可以识别干预目标,但通常需要已知或准确估计的因果图,而无图统计方法要么定位边际异常而非结构原因,要么依赖于对图结构或函数形式的限制性假设。我们提出StableRCA,一种局部机制级RCA框架,通过估计局部马尔可夫边界并检测其中的条件分布偏移来避免全局图发现。利用独立因果机制原理,我们证明在忠实马尔可夫边界恢复和非退化机制偏移下,干预目标可以以样本量指数收敛的概率被识别。在合成基准和五个真实世界数据集上的实验表明,StableRCA对图错误指定具有鲁棒性,在多个干预目标下有效,可扩展至大型系统,并在不同应用领域中可靠。代码可在 https://anonymous.4open.science/r/StableRCA-E362 获取。

英文摘要

Root-Cause Analysis (RCA) seeks to identify the variables responsible for abnormal system behavior in complex domains such as manufacturing, cloud computing, and healthcare. Existing approaches face a critical bottleneck: graph-based causal methods can identify intervention targets but typically require a known or accurately estimated causal graph, while graph-free statistical methods either localize marginal anomalies rather than structural causes, or rely on restrictive assumptions about graph structure or functional form. We propose StableRCA, a local mechanism-level RCA framework that avoids global graph discovery by estimating local Markov boundaries and detecting conditional distribution shifts within them. Leveraging the Independent Causal Mechanism principle, we show that intervention targets can be identified with probability converging exponentially in sample size under faithful Markov boundary recovery and non-degenerate mechanism shifts. Experiments on synthetic benchmarks and five real-world datasets demonstrate that StableRCA is robust to graph misspecification, effective under multiple intervention targets, scalable to large systems, and reliable across diverse application domains. Code is available at: https://anonymous.4open.science/r/StableRCA-E362

2606.05635 2026-06-05 cs.CV cs.MM

ShotCrop$^3$: Cropping Human-Centric Images into Cinematic Triple-Shot Compositions

ShotCrop$^3$:将人物中心图像裁剪为电影级三镜头构图

Dehong Kong, Lina Lei, Lingtao Zheng, Chenyang Wu, Ailing Zhang, Xinran Qin, Teng Ma, Jiaqi Xu, Zhixin Wang, Zhikai Chen, Xuecheng Qi, Renjing Pei, Fan Li

发表机构 * Huawei Noah’s Ark Lab(华为诺亚实验室) Sun Yat-sen University(中山大学)

AI总结 提出三镜头构图任务,通过三阶段训练流程(思维链微调、半监督微调和组相对策略优化)从单张人物中心图像生成远景、中景和特写三张裁剪图,并附带简短描述,以支持视觉叙事。

详情
AI中文摘要

先前关于美学构图的工作通常产生单一美观的裁剪,忽略了从一个场景中构图多个镜头的叙事价值。在实践中,多镜头构图对于下游创意工作流程至关重要:商业海报通常需要不同重点(例如,背景、主体和情感/产品细节)的多个裁剪来呈现关键故事节拍。因此,我们提出了 extbf{三镜头构图(TSC)},这是一个构图任务,从单张人物中心图像生成一个三镜头集——远景、中景和特写,每个镜头都配有简短的镜头描述以支持视觉叙事。为了在有限的专家标注下学习TSC,我们引入了 extbf{ShotCrop},它经历了一个三阶段训练过程:首先应用思维链监督微调以建立基本推理和美学裁剪技能,然后使用高置信度伪标签进行半监督微调以进一步增强美学能力,最后通过针对 extbf{ShotCrop}的组相对策略优化(GRPO-S)进行优化,使用为其定制的复合奖励。具体来说,我们的伪标签策略结合了基于MLLM的评分、美学评估和CLIP相似度,以保留高置信度的训练信号。此外,我们提出了TSC-Bench,一个包含1.2k个专家标注测试用例的基准。值得注意的是,ShotCrop在镜头定位准确率上比GPT-5平均提高了 extbf{2.82}倍。

英文摘要

Prior work on aesthetic composition typically produces a single aesthetically pleasing crop, overlooking the narrative value of composing multiple shots from one scene. In practice, multi-shot composition is critical for downstream creative workflows: commercial posters often require multiple crops with different emphases (e.g., context, subject, and emotion/product details) to present key story beats. Therefore, we propose \textbf{Triple-Shot Compositions (TSC)}, a composition task that generates a three-shot set -- establishing, medium, and close-up -- from a single human-centric image, each paired with a brief shot description to support visual narration. To learn TSC with limited expert annotations, we introduce \textbf{ShotCrop} which undergoes a three-stage training process: it first applies Chain-of-Thought supervised fine-tuning to establish basic reasoning and aesthetic shot-cropping skills, then performs semi-supervised fine-tuning with high-confidence pseudo labels to further enhance aesthetic capability, and is finally optimized with Group Relative Policy Optimization for \textbf{ShotCrop} (GRPO-S) using a composite reward tailored for it. Specifically, our pseudo-labeling strategy combines MLLM-based scoring, aesthetic assessment, and CLIP similarity to retain high-confidence training signals. In addition, we present TSC-Bench, a benchmark of 1.2k expert-annotated test cases. Notably, ShotCrop achieves an average improvement of \textbf{2.82} times over GPT-5 in shot localization accuracy.

2606.05634 2026-06-05 cs.CL

Bootstrapping Semantic Layer from Execution for Text-to-SQL

从执行中引导语义层用于文本到SQL

Youngwon Lee, Jaejin Kim, Seung-won Hwang

发表机构 * Seoul National University(首尔国立大学)

AI总结 提出GATE方法,通过执行反馈引导缺失的语义层,将执行结果作为可复用记忆,提升文本到SQL的准确性。

详情
AI中文摘要

现实世界中的文本到SQL任务常常是欠指定的,直到用户短语在数据库存储值的方式中得到具体化。先前的工作试图通过要求预先指定语义层来解决这个问题,但这种规范往往不完整,尤其是在领域特定约定记录不足的专家领域。由于这为相同的SQL部分留下了多个具体化假设,我们引入了GATE(从执行后测试中具体化),它从执行反馈中引导缺失的具体化。GATE保持具体化假设开放,同时执行已具体化的部分以获得观察结果。然后,只有被该观察支持的假设被具体化并存储为记忆条目,记录测试了什么以及开放部分应如何用SQL编写。这些条目累积成执行具体化的记忆,允许后续步骤重用支持的具体化。在真实世界和受控基准测试中,GATE一致地优于强基线,表明执行不仅可以作为验证,还可以作为文本到SQL中可复用记忆的引导机制。

英文摘要

Real-world text-to-SQL is often under-specified until user phrases are grounded in how the database stores values. Prior work attempts to address this by requiring a semantic layer to specify groundings in advance, but such specifications are often incomplete, especially in expert domains where domain-specific conventions are under-documented. As this leaves multiple grounding hypotheses open for the same SQL part, we introduce GATE (Grouding After Test from Execution), which bootstraps missing groundings from execution feedback. GATE keeps grounding hypotheses open while executing the already grounded parts to obtain observations. Then, only the hypothesis supported by that observation is grounded and stored as a memory entry, recording what was tested and how the open part should be written in SQL. These entries accumulate into execution-grounded memory, allowing later steps to reuse supported groundings. Across real-world and controlled benchmarks, GATE consistently improves over strong baselines, demonstrating that execution can serve not only as validation but also as a bootstrapping mechanism for reusable memory in text-to-SQL.

2606.05633 2026-06-05 cs.AI

Answer Presence Drives RAG Rewriting Gains

答案存在驱动RAG重写收益

Yuejie Li, Yueying Hua, Ke Yang, Li Zhang, Yueping He, Yueping He, Ruiqi Li, Bolin Chen, Tao Wang, Bowen Li, Chengjun Mao

发表机构 * Ant Group(蚂蚁集团)

AI总结 通过受控干预审计,发现检索增强问答中重写器带来的性能提升主要由黄金答案字符串出现在重写上下文中驱动,而非证据质量改善。

详情
AI中文摘要

检索增强的问答管道通常将检索到的段落通过LLM重写器处理后输入较小的阅读器,在多跳基准测试中将F1提升数十个百分点;这种提升通常归因于证据质量的改善。我们通过受控干预审计,探究这种提升是否由黄金答案字符串出现在重写上下文中而非整理本身因果驱动。对于每个重写上下文,我们对编译输出进行四种受控编辑后重新运行阅读器:移除黄金答案跨度、替换为长度匹配的随机非答案跨度(安慰剂)、将黄金答案注入原本缺失的重写中(前缀或中间句子边界)。跨越三个阅读器系列(Qwen2.5-7B、Qwen3.5-35B、GLM-4.7)、两个数据集(HotpotQA、2WikiMultihopQA)和三种编译器安排(仅MA、仅MB、MA+验证)的十二个(单元、基线)干预运行中,在配对的answer-in-compile层上,移除黄金答案导致阅读器F1比长度匹配的安慰剂下降28到64个百分点,而在12个(单元、基线)组合中的10个中,将黄金答案前置到原本缺失的重写中使F1提升+0.7到+9.7个百分点。一项配套的五哨兵审计显示,传统的单[MASK]探针本身对哨兵敏感:在2Wiki上,它报告+4.12 F1的“非泄漏残差”,在四种替代哨兵下翻转至-3.33到-7.81 F1,并且对其中三种哨兵未能通过等价检验(1/4通过)。我们不提出新的重写器或缓解措施;我们发布干预运行器和哨兵面板,以便其他重写器收益声明可以针对相同标准进行测试。

英文摘要

Retrieval-augmented QA pipelines often route retrieved passages through an LLM \emph{rewriter} before a smaller reader, lifting F1 by tens of points on multi-hop benchmarks; this gain is typically credited to improved evidence quality. We ask whether that lift is causally driven by the gold answer string appearing in the rewritten context rather than by curation per se, using a controlled intervention audit. For each rewritten context we re-run the reader after one of four controlled edits to the compile output: removing the gold answer span, replacing a length-matched random non-answer span (placebo), or injecting the gold into rewrites where it was absent (at the prefix or at a midpoint sentence boundary). Across twelve completed (cell, baseline) intervention runs spanning three reader families (Qwen2.5-7B, Qwen3.5-35B, GLM-4.7), two datasets (HotpotQA, 2WikiMultihopQA), and three compiler arrangements (MA-only, MB-only, MA$+$verify), removing the gold answer drops reader F1 by $28$ to $64$ points beyond the length-matched placebo on paired \texttt{answer-in-compile} strata, and prepending the gold into rewrites that lacked it raises F1 by $+0.7$ to $+9.7$ points in $10$ of $12$ (cell, baseline) combinations. A companion five-sentinel audit shows the conventional single-\texttt{[MASK]} probe is itself sentinel-fragile: on 2Wiki it reports a $+4.12$~F1 ``non-leakage residual'' that flips to $-3.33$ to $-7.81$~F1 under four alternative sentinels and fails an equivalence test for three of those four ($1/4$~pass). We do not propose a new rewriter or mitigation; we release the intervention runner and the sentinel panel so that other rewriter-gain claims can be tested against the same standard.

2606.05632 2026-06-05 cs.AI

Evaluation of LLMs for Mathematical Formalization in Lean

LLM在Lean中数学形式化的评估

Tyson Klingner, Drew Bladek, Escher Crawford, Bohao Chen, Ariel Fu, Kaira Nair, Jarod Alper, Giovanni Inchiostro, Vasily Ilin

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Washington(华盛顿大学)

AI总结 本研究通过pass@k和refine@k指标在miniF2F和miniCTX子集上比较了多种大语言模型在Lean 4中生成形式化证明的能力,发现Gemini 3.1 Pro和Claude Opus 4.7性能最佳,而NVIDIA Nemotron 3 Super和GPT-OSS 120B在考虑成本时效率最高。

Comments 15 pages, 13 figures, 10 tables. Comments welcome!

详情
AI中文摘要

在过去几年中,大语言模型(LLM)生成形式化数学证明的能力得到了显著提升。我们比较了多种LLM在Lean 4中生成形式化证明的有效性,旨在帮助那些希望利用LLM支持自己项目的人。我们使用pass@$k$和refine@$k$指标作为比较基准,并在miniF2F和miniCTX数据集的子集上进行评估。测试表明,总体而言,Gemini 3.1 Pro和Claude Opus 4.7表现最佳。Gemini 3.1 Pro在miniF2F上通过refine@32达到了92%的成功率,而Opus 4.7在miniCTX上通过refine@32达到了86%的成功率。考虑成本时,NVIDIA Nemotron 3 Super和GPT-OSS 120B效率最高,具有竞争力的准确率且每个正确证明的平均成本低于0.01美元。

英文摘要

Within the past few years, the ability of Large Language Models (LLMs) to generate formal mathematical proofs has improved drastically. We provide a comparison of various LLMs' effectiveness in producing formal proofs in Lean 4 with the goal of assisting those seeking to use LLMs to support their own projects. We utilize both pass@$k$ and refine@$k$ metrics as the benchmark for our comparison and evaluate on subsets of both miniF2F and miniCTX datasets. Our testing shows that overall, Gemini 3.1 Pro and Claude Opus 4.7 perform best. Gemini 3.1 Pro achieved a 92\% success rate on miniF2F via refine@32 whereas Opus 4.7 achieved a 86\% success rate on miniCTX via refine@32. When taking cost into account, NVIDIA Nemotron 3 Super and GPT-OSS 120B were the most efficient, with competitive accuracies and average costs of $<\$0.01$ per correct proof.

2606.05626 2026-06-05 cs.CL cs.AI cs.LG

When New Generators Arrive: Lifelong Machine-Generated Text Attribution via Ridge Feature Transfer

当新生成器到来:基于岭特征迁移的终身机器生成文本归因

Zhen Sun, Yifan Liao, Zhicong Huang, Jiaheng Wei, Cheng Hong, Yutao Yue, Xinlei He

发表机构 * Wuhan University(武汉大学) Ant Group(蚂蚁集团) The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) Institute of Deep Perception Technology, JITRI(感知技术研究院,JITRI)

AI总结 针对终身机器生成文本归因中持续适应新生成器与保留旧知识难以平衡的问题,提出轻量级分析更新框架RidgeFT,通过协方差校准和固定随机特征实现无需示例回放的闭式更新。

Comments 12 pages

详情
AI中文摘要

机器生成文本(MGT)归因旨在识别给定文本的特定生成器,从而为模型问责和滥用调查提供细粒度证据。随着新的大语言模型不断涌现,归因模型必须持续纳入新生成器,同时保留识别先前见过的生成器的能力。先前工作表明,这种终身MGT归因设置具有挑战性,现有方法通常难以在适应新类别和保留旧类别之间实现稳定平衡。为解决此问题,我们提出RidgeFT,一种轻量级分析更新框架,不依赖于示例回放。RidgeFT在初始生成器集上训练任务感知编码器,在首次观察到每个生成器类别时存储紧凑的类别充分统计量,然后冻结编码器以进行无回放的闭式更新。它通过协方差校准抑制与生成器无关的变异,通过固定随机特征提升表示能力,并基于类别充分统计量通过闭式岭回归更新新类别。在具有不同初始生成器设置的多主题评估中,RidgeFT始终优于基线。它在跨领域、骨干网络和增量协议上实现了最佳宏F1,同时改进了旧类别保留和新类别适应。这些结果表明,特征稳定的分析更新为终身MGT归因提供了一种简单而有效的方法。

英文摘要

Machine-generated text (MGT) attribution aims to identify the specific generator responsible for a given text, thereby providing fine-grained evidence for model accountability and misuse investigation. As new large language models continue to emerge, attribution models must continuously incorporate new generators while preserving their ability to recognize previously seen ones. Prior works have shown that this lifelong MGT attribution setting is challenging, and existing methods often struggle to achieve a stable balance between adapting to new classes and retaining old ones. To address this issue, we propose RidgeFT, a lightweight analytic update framework that does not rely on exemplar replay. RidgeFT trains a task-aware encoder on the initial generator set, stores compact class-wise sufficient statistics when each generator class is first observed, and then freezes the encoder for replay-free closed-form updates. It then suppresses generator-irrelevant variation through covariance calibration, improves representation capacity with fixed random features, and updates new classes through closed-form ridge regression based on class-level sufficient statistics. Across multi-topic evaluations with varying initial generator setups, RidgeFT consistently outperforms baselines. It achieves the best macro-F1 across domains, backbones, and incremental protocols, while also improving both old-class retention and new-class adaptation. These results suggest that feature-stable analytic updates provide a simple yet effective approach to lifelong MGT attribution.

2606.05625 2026-06-05 cs.AI cs.LG

Self-Commitment Latency: A Reward-Free Probe for Prompted Implicit Hacking

自承诺延迟:一种用于提示隐式劫持的无奖励探针

Bonan Shen, Youting Wang, Dingyan Shang, Tao Ning

发表机构 * Stanford University(斯坦福大学) Tsinghua University(清华大学)

AI总结 提出自承诺延迟指标,通过测量推理上下文对模型自身最终答案的承诺时机,无需奖励信号即可检测提示隐式劫持,在GSM8K数据集上达到AUROC 0.878-0.926。

详情
AI中文摘要

当语言模型的思维链看似良性时,隐式奖励劫持难以审计:最终答案可能被提示捷径锚定,而书面推理仍类似于普通问题求解。基于验证器的探针通过测量早期截断的推理上下文获得高奖励来暴露此类行为,但需要任务特定的奖励信号。本文提出一种弱输入替代方案——自承诺延迟,它测量提示推理上下文对模型自身最终答案的承诺时机。我们在受控配对GSM8K设置中使用Qwen2.5-3B-Instruct-4bit评估该探针,比较普通提示与包含答案提示的提示。与诚实上下文相比,包含提示的上下文显著更早且以更低不确定性做出承诺。主要延迟指标——阈值为0.8时的首次承诺延迟——达到AUROC 0.878;支持的全曲线摘要达到承诺范围AUROC 0.926和平均未承诺质量AUROC 0.904。当两种提示条件都正确回答时信号更强,且在不同阈值下保持稳定。这些结果表明,存在捷径的推理上下文会留下早期行为承诺特征,无需奖励模型、外部评判或训练分类器即可检测。

英文摘要

Implicit reward hacking is hard to audit when a language model's chain of thought appears benign: a final answer may be anchored by a prompt shortcut while the written reasoning still resembles ordinary problem solving. Verifier-based probes expose such behavior by measuring how early truncated reasoning contexts obtain high reward, but require a task-specific reward signal. This paper proposes a weaker-input alternative, self-commitment latency, which measures how early a prompted reasoning context commits to the model's own final answer. We evaluate the probe in a controlled paired GSM8K setting using Qwen2.5-3B-Instruct-4bit, comparing ordinary prompts with prompts that include an answer hint. Hinted contexts commit substantially earlier and with lower uncertainty than honest contexts. The primary latency metric, first-commitment latency at threshold 0.8, reaches AUROC 0.878; supporting whole-curve summaries reach AUROC 0.926 for commitment range and 0.904 for mean uncommitted mass. The signal is stronger when both prompt conditions answer correctly and remains stable across thresholds. These results show that shortcut-available reasoning contexts can leave an early behavioral commitment signature detectable without a reward model, external judge, or trained classifier.

2606.05624 2026-06-05 cs.CV cs.GR

KV-Control: Parameter-Efficient K/V Injection for Trajectory-Controlled Text-to-Motion

KV-Control: 用于轨迹控制文本到运动的参数高效K/V注入

Tengjiao Sun, Pengcheng Fang, Xiaoyu Zhan, Yanwen Guo, Dongjie Fu, Xiaohao Cai, Hansung Kim

发表机构 * University of Science and Technology of China(中国科学技术大学) Tsinghua University(清华大学)

AI总结 提出KV-Control,一种紧凑的注意力侧控制接口,通过部分标记化运动基元和轨迹编码器注入键/值记忆,实现精确的轨迹控制而不覆盖预训练的文本条件运动先验。

详情
AI中文摘要

文本条件3D人体运动模型现在可以从提示中合成合理的运动,但实际动画和具身代理工作流程很少止步于文本:角色可能需要遵循草绘的根路径,达到末端执行器目标,或满足多关节轨迹,同时保持语言描述的步态、风格和意图。这暴露了一个控制权衡。轨迹控制器应该精确而不覆盖预训练的文本条件运动先验,但现有解决方案要么复制生成器的大部分以重新获得每层控制访问,要么将大部分成本转移到测试时优化。我们引入KV-Control,一种用于冻结掩码文本到运动变换器的紧凑注意力侧控制接口。关键思想是将几何约束作为自注意力中的记忆提供,而不是通过全局姿态标记注入或仅在输出侧强制执行。为了支持该接口,我们共同设计了部分标记化的运动基元和控制器:PartVQ学习解剖对齐的部分码本,T-Concat将每个帧-部分标记暴露为注意力可寻址站点,KV-Control在每个自注意力层注入控制条件的键/值记忆,同时保留预训练的查询流、文本交叉注意力、FFN和所有骨干权重。生成的适配器仅在共享轨迹编码器之上添加可训练的注入参数,但在继承的细化协议下以亚厘米精度跟踪根和多关节约束,同时保留文本条件的运动质量。KV-Control将轨迹条件重新定义为轻量级记忆检索,为文本到运动生成提供了一个小型、精确且透明的控制接口。

英文摘要

Text-conditioned 3D human motion models now synthesize plausible motions from prompts, but practical animation and embodied-agent workflows rarely stop at text: a character may need to follow a sketched root path, hit an end-effector target, or satisfy a multi-joint trajectory while still preserving the gait, style, and intent described by language. This exposes a control trade-off. A trajectory controller should be precise without overwriting the pretrained text-conditioned motion prior, yet existing solutions either duplicate large portions of the generator to regain per-layer control access or move much of the cost to test-time optimization. We introduce KV-Control, a compact attention-side control interface for frozen masked text-to-motion transformers. The key idea is to make geometric constraints available as memory inside self-attention rather than injecting them through a global pose token or enforcing them only at the output side. To support this interface, we co-design a part-tokenized motion substrate and controller: \textbf{PartVQ} learns anatomy-aligned part codebooks, T-Concat exposes each frame--part token as an attention-addressable site, and KV-Control injects control-conditioned key/value memories at every self-attention layer while preserving the pretrained query stream, text cross-attention, FFN, and all backbone weights. The resulting adapter adds only trainable injection parameters atop a shared trajectory encoder, yet tracks root and multi-joint constraints with sub-centimeter accuracy under the inherited refinement protocol while retaining text-conditioned motion quality. KV-Control reframes trajectory conditioning as lightweight memory retrieval, providing a small, precise, and transparent control interface for text-to-motion generation.

2606.05622 2026-06-05 cs.CL

AdaPlanBench: Evaluating Adaptive Planning in Large Language Model Agents under World and User Constraints

AdaPlanBench: 在世界约束和用户约束下评估大语言模型智能体的自适应规划能力

Jiayu Liu, Cheng Qian, Zhenhailong Wang, Bingxuan Li, Jiateng Liu, Heng Wang, Jeonghwan Kim, Yumeng Wang, Xiusi Chen, Yi R. Fung, Heng Ji

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 针对现有基准未充分探索渐进揭示的双重约束下的自适应规划问题,提出动态交互基准AdaPlanBench,通过307个家务任务和可扩展的约束构建流程,评估LLM智能体在交互中根据反馈迭代调整计划的能力。

详情
AI中文摘要

语言模型对现实世界问题进行规划时,通常涉及世界约束和用户约束,这些约束可能不会事先完全明确,而是通过交互逐步披露。然而,现有基准仍未充分探索在这种逐步揭示的双重约束下的自适应规划。为填补这一空白,我们引入了AdaPlanBench,这是一个动态交互基准,用于评估大语言模型(LLM)智能体是否能够在逐步揭示的世界约束和用户约束下自适应地规划和重新规划。AdaPlanBench基于307个家务任务构建,并配备了一个可扩展的约束构建流程,为每个任务增加双重约束。在运行时,智能体通过多轮协议与环境交互,其中隐藏的约束仅在智能体提出违反它们的计划时才会被揭示,从而需要在累积反馈下迭代修订计划。这使得规划具有挑战性,因为智能体必须从反馈中推断并跟踪约束,同时有效地重新规划。在十个领先的LLM上的实验表明,在双重约束下的自适应规划仍然具有挑战性,最佳模型仅达到67.75%的准确率。我们进一步观察到,随着约束的累积,性能会下降,其中用户约束尤其构成巨大挑战,而失败通常源于较弱的物理基础知识和降低的有效性。这些结果将AdaPlanBench确立为双重约束交互规划的测试平台,并凸显了LLM智能体可靠适应动态揭示约束的挑战。

英文摘要

Planning for real-world problems by language models often involves both world and user constraints, which may not be fully specified upfront and are progressively disclosed through interaction. However, existing benchmarks still underexplore adaptive planning under such progressively revealed dual constraints. To address this gap, we introduce AdaPlanBench, a dynamic interactive benchmark for evaluating whether Large Language Model (LLM) agents can adaptively plan and re-plan under progressively revealed world and user constraints. AdaPlanBench is built on 307 household tasks, with a scalable constraint construction pipeline that augments each task with dual constraints. At runtime, agents interact with the environment in a multi-turn protocol where hidden constraints are revealed only when the agent proposes a plan that violates them, requiring iterative plan revision under accumulating feedback. This makes planning challenging, as agents must infer and track constraints from feedback while re-planning effectively. Experiments on ten leading LLMs show that adaptive planning under dual constraints remains challenging, with the best model reaching only 67.75% accuracy. We further observe that performance degrades as more constraints accumulate, with user constraints posing a particularly large challenge and failures often stemming from weaker physical grounding and reduced effectiveness. These results establish AdaPlanBench as a testbed for dual-constrained interactive planning and highlight the challenge of reliable adaptation to dynamically revealed constraints in LLM agents.

2606.05620 2026-06-05 cs.CL

An ERP Study on Recursive Locative Processing in Mandarin-Speaking Children with Autism

自闭症儿童递归处所加工的ERP研究

Xiaoyi Wang, Chenxi Fu, Ziman Zhuang, Caimei Yang

发表机构 * Soochow University(苏州大学)

AI总结 通过ERP实验,研究自闭症儿童处理递归处所结构时在预测、语义整合和句法重析三个阶段的时间动态差异。

详情
AI中文摘要

递归能够生成层级语言结构,但在实时理解中施加了巨大的处理需求。尽管自闭症谱系障碍(ASD)中存在复杂句法困难,但递归处理的时间动态仍知之甚少。本研究使用事件相关电位(ERP)考察说普通话的ASD儿童如何处理两级递归处所结构。24名儿童(12名ASD,12名典型发展儿童,TD)参与了跨模态句子-图片匹配任务。在控制心理年龄的情况下,分析了与结构预测(P200)、语义整合(N400)和句法重析(P600)相关的三个处理阶段的神经反应。结果显示组间存在系统性差异。TD儿童在结构不匹配时表现出清晰的P200和P600调节,而ASD儿童则表现出早期分化减弱和晚期重析效应降低。相反,ASD儿童在不匹配条件下表现出增强的N400反应,表明语义整合需求增加。此外,ASD组在半球偏侧化方面表现出显著更大的个体间变异性,尽管偏侧化强度与接受性词汇表现无关。这些发现支持一个级联解释,即ASD中早期预测参与的减少导致递归处理中整合成本增加和重析效率降低。更广泛地说,结果强调了时间处理动态和神经变异性在理解ASD语言差异中的重要性。

英文摘要

Recursion enables the generation of hierarchical linguistic structures but imposes substantial processing demands during real-time comprehension. While difficulties with complex syntax have been reported in autism spectrum disorder (ASD), the temporal dynamics of recursive processing remain poorly understood. This study used event-related potentials (ERPs) to examine how Mandarin-speaking children with ASD process two-level recursive locative constructions. Twenty-four children (12 ASD, 12 typically developing, TD) participated in a cross-modal sentence-picture matching task. Neural responses were analyzed across three processing stages associated with structural prediction (P200), semantic integration (N400), and syntactic reanalysis (P600), with mental age controlled. Results revealed a systematic divergence between groups. TD children showed clear P200 and P600 modulation in response to structural mismatch, whereas ASD children exhibited attenuated early differentiation and reduced late reanalysis effects. In contrast, ASD children showed enhanced N400 responses under mismatch conditions, indicating increased semantic integration demands. In addition, the ASD group displayed significantly greater inter-individual variability in hemispheric lateralization, although lateralization strength was not associated with receptive vocabulary performance. These findings support a cascading account in which reduced early predictive engagement in ASD leads to increased integration costs and diminished reanalysis efficiency during recursive processing. More broadly, the results highlight the importance of both temporal processing dynamics and neural variability in understanding language differences in ASD.

2606.05616 2026-06-05 cs.CL

What's in a Name? Morphological Shortcuts by LLMs in Pharmacology

名字里有什么?LLM在药理学中的形态捷径

Kaijie Mo, Thomas Yang, Chantal Shaib, Qing Yao, William Rudman, Ramez Kouzy, Kanishka Misra, Byron C. Wallace, Junyi Jessy Li

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校) Northeastern University(东北大学) MD Anderson Cancer Center(MD安德森癌症中心)

AI总结 研究LLM在药理学中依赖词缀线索进行推理的形态捷径行为,通过虚构药物名称实验和归因框架揭示其机制及安全风险。

Comments 22 pages

详情
AI中文摘要

单词的形态常常能为其含义提供线索,但纯粹依赖这些映射在高风险领域可能导致过度泛化。例如,在医学领域,LLM可以仅凭词缀(如wugcillin)自信地推理虚构药物,并生成看似合理的临床内容。我们提出了LLM在药理学中“词缀启发式”的行为和机制研究。使用由真实词缀构建的虚构药物名称,我们表明仅词缀信号就能引发类别水平的药理反应。我们引入了一个框架,用于识别模型的药物语义主要受词缀、词干还是整个药物名称驱动。应用于653种药物,我们的框架揭示模型通常主要通过词缀线索诱导药物含义,但很少明确表明这种依赖,有时还会错误地将词缀共享药物的属性混淆。跨模型的激活修补进一步将这种行为定位到早期到中期层。这些发现表明,形态捷径对安全性构成了微妙但可衡量的风险。

英文摘要

The morphological form of a word can often give cues to its meaning, but purely relying on these mappings can lead to overgeneralization in high-stakes domains. In the medical domain, for instance, LLMs can confidently reason about fictitious drugs from their affixes alone (e.g., wugcillin) and generate plausible-looking clinical content. We present a behavioral and mechanistic study of LLM "affix heuristics" in pharmacology. Using fictitious drug names built from real affixes, we show that affix signals alone elicit class-level pharmacological responses. We introduce a framework for identifying whether a model's drug semantics are driven mainly by the affix, the stem, or the drug name as a whole. Applied across 653 drugs, our framework reveals that models often induce drug meaning primarily through affix cues, yet rarely explicitly indicate this reliance, and sometimes incorrectly conflate properties among affix-sharing drugs. Activation patching across models further localizes this behavior to early-mid layers. These findings show that morphological shortcuts pose a subtle but measurable risk to safety.