arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2115
2606.15617 2026-06-17 cs.CV 新提交

NeRD: Neuro-Symbolic Rule Distillation for Efficient Ontology-Grounded Chain-of-Thought in Medical Image Diagnosis

NeRD:面向医学图像诊断的高效本体接地思维链的神经符号规则蒸馏

Hongxi Yang, Yiwen Jiang, Siyuan Yan, Jamie Chow, Eunis Li, Charlotte Poon, Stephanie Fong, Xiangyu Zhao, Deval Mehta, Yasmeen George, Zongyuan Ge

发表机构 * Department of Data Science & AI, Faculty of Information Technology, Monash University(莫纳什大学信息技术学院数据科学与人工智能系) AIM for Health Lab, Faculty of Information Technology, Monash University(莫纳什大学信息技术学院AIM健康实验室) Faculty of Engineering, Monash University(莫纳什大学工程学院) Faculty of Medicine, The Chinese University of Hong Kong(香港中文大学医学院) School of Computing Technologies, RMIT University(皇家墨尔本理工大学计算技术学院)

AI总结 提出NeRD框架,通过神经符号规则蒸馏生成高效、本体接地且非冗余的推理链,避免人工规则,在皮肤数据集上实现强诊断性能和可解释性,并首次实现专家介入的多模态思维链诊断。

Comments Accepted at MICCAI 2026

详情
AI中文摘要

可解释性对于可信的医学图像诊断至关重要。然而,现有的概念驱动可解释方法存在关键局限性:概念瓶颈模型(CBM)需要在推理时对所有预定义概念进行评分并用于人工干预,给临床医生带来沉重负担;而基于理由的生成方法通常通过类别可区分性选择概念,这可能偏离诊断本体。为了解决这些问题,我们提出了神经符号规则蒸馏(NeRD),这是一个生成高效、本体接地且充分而非冗余的推理链的框架,无需手动构建诊断规则。在两个皮肤数据集上的实验证明了其强大的诊断性能和可解释性,盲法专家评估确认了NeRD理由的临床合理性。我们的方法进一步实现了首次专家介入的多模态思维链诊断研究,实现了高效且有效的概念级干预。

英文摘要

Interpretability is essential for trustworthy medical image diagnosis. However, existing concept-driven interpretable methods have key limitations: Concept Bottleneck Models (CBMs) require scoring all predefined concepts at inference time and for manual intervention, imposing a substantial burden on clinicians, while rationale-based generative approaches often select concepts by class discriminability, which can drift from diagnostic ontologies. To address these issues, we propose Neuro-Symbolic Rule Distillation (NeRD), a framework that produces efficient, ontology-grounded reasoning chains that are sufficient yet non-redundant, without manually crafting diagnostic rules. Experiments on two skin datasets demonstrate strong diagnostic performance and interpretability, and blinded expert evaluation confirms the clinical plausibility of NeRD rationales. Our method further enables a first expert-in-the-loop study for Multimodal Chain-of-Thought-based diagnosis, achieving efficient and effective concept-level intervention.

2606.15614 2026-06-17 cs.CV 新提交

Variational Test-time Optimization for Diffusion Synchronization

扩散同步的变分测试时优化

Hyunsoo Lee, Farrin Marouf Sofian, Kushagra Pandey, Stephan Mandt

发表机构 * Seoul National University(首尔大学) University of California, Irvine(加利福尼亚大学尔湾分校)

AI总结 提出基于最优控制的变分测试时优化框架,通过优化控制变量引导多轨迹协同生成,无需额外训练即可提升扩散同步性能。

Comments Preprint. Project website: https://hleephilip.github.io/SyncVC/

详情
AI中文摘要

协同生成通过协调多个扩散轨迹来扩展预训练先验的能力,已成为扩展扩散模型适用性的强大范式。在现有方法中,扩散同步通过引入通用引导机制提供了场景无关的解决方案。然而,当前的同步方法严重依赖启发式方法,并且仍然需要针对特定任务进行调整,这限制了它们的泛化能力和性能。在这项工作中,我们基于最优控制数学推导了一个同步框架,为扩散同步提供了原理性解释。在采样过程中,我们优化控制变量以引导多个轨迹朝向一致解,同时保持接近底层扩散先验。我们的方法完全在测试时运行,无需额外训练,因此当与强大的预训练先验结合时,能够在多样化的生成场景中广泛应用。我们在三个代表性的协同生成任务上展示了相对于基线的持续改进,涵盖了广泛的模态和应用。除了性能提升,我们的工作为协同生成建立了新的基础,为将预训练生成模型扩展到新的协同生成设置开辟了一条原理性路径。

英文摘要

Collaborative generation, which coordinates multiple diffusion trajectories to extend the capabilities of pretrained priors, has emerged as a powerful paradigm for extending the applicability of diffusion models. Among existing approaches, diffusion synchronization provides a scenario-agnostic solution by introducing general guidance mechanisms. However, current synchronization approaches rely heavily on heuristics and still require task-specific tailoring, which limits their generalizability and performance. In this work, we mathematically derive a synchronization framework based on optimal control, providing a principled explanation of diffusion synchronization. During sampling, we optimize control variables to guide multiple trajectories toward coherent solutions while remaining close to the underlying diffusion prior. Our method operates entirely at test-time without additional training, thereby enabling broad applicability across diverse generation scenarios when combined with strong pretrained priors. We demonstrate consistent improvements over baselines on three representative collaborative generation tasks, covering a wide range of modalities and applications. Beyond performance gains, our work establishes a novel foundation for collaborative generation, opening a principled path toward extending pretrained generative models to new collaborative generation settings.

2606.15575 2026-06-17 cs.AI cs.HC 新提交

Do we have the knowledge we need? Rethinking human-AI decision-making in corporations

我们是否拥有所需的知识?重新思考企业中的人机决策

Anne S. R. Marx, Ricardo M. Avelino, Torbjørn Netland, Mennatallah El-Assady

发表机构 * ETH Zurich(苏黎世联邦理工学院) Department of Computer Science & ETH AI Center, ETH Zurich(苏黎世联邦理工学院计算机科学系与ETH AI中心) Department of Computer Science & Architecture, ETH Zurich(苏黎世联邦理工学院计算机科学与建筑系) Department of Management, Technology, and Economics, ETH Zurich(苏黎世联邦理工学院管理、技术与经济系) Department of Computer Science, ETH Zurich(苏黎世联邦理工学院计算机科学系)

AI总结 本文提出一个框架,根据任务属性和知识可用性推荐人机代理分配与控制机制,并应用于制造任务示例。

Comments Proceedings of AutomationXP26 Workshop of the 2026 CHI Conference on Human Factors in Computing Systems, April 14, 2026, Barcelona, Spain. ACM, New York, NY, USA, 8 pages

详情
AI中文摘要

组织知识分散在各种软件系统、隐性知识和传统上为人类消费设计的手动文档中。随着AI系统越来越多地被部署并赋予决策角色,它们需要访问这些知识。这提出了两个问题:组织应如何存储和维护知识,使其对人类和未来的AI系统都可访问;以及在不同风险和不确定性水平的任务中,应如何在人类和AI之间分配代理权?在这篇立场论文中,我们描述了组织知识如何演变,并贡献了一个框架,将任务属性和知识可用性映射到推荐的代理分配和控制机制。我们通过两个不同的制造任务说明了该框架的适用性:一个常规操作(视觉质量检查)和一个一次性战略决策(工厂选址),并总结了未来研究的机会。

英文摘要

Organizational knowledge is fragmented across a variety of software systems, tacit expertise, and manual documents that have traditionally been designed for human consumption. As AI systems are increasingly deployed and granted decision-making roles, they require access to this knowledge. This raises two questions: how should organizations store and maintain knowledge so that it remains accessible to both humans and future AI systems, and how should agency be allocated between humans and AI across tasks with different risks and levels of uncertainty? In this position paper, we describe how organizational knowledge evolves and contribute a framework that maps task attributes and knowledge availability to recommended agency allocations and control mechanisms. We illustrate the applicability of the framework on two different manufacturing tasks: a routine operation (visual quality inspection) and a one-off strategic decision (factory location), and conclude with opportunities for future research.

2606.15573 2026-06-17 cs.AI cs.CR 新提交

QoS-Aware Token Scheduling and Private Data Valuation for Multi-Modal Agentic Networks

面向多模态代理网络的QoS感知令牌调度与私有数据估值

Yao Du, Jing Liu, Pengfei Xu, Zehua Wang, Victor C. M. Leung, Cyril Leung, Victoria Lemieux

发表机构 * University of British Columbia(不列颠哥伦比亚大学) Lazai Network(Lazai网络)

AI总结 针对去中心化代理系统中数据异构和资源受限问题,提出基于差分隐私的多模态表示与公平令牌分配方案,在保障服务质量的同时提升数据隐私和贡献公平性。

Comments Accepted to IEEE ICME 2026

详情
AI中文摘要

在代理系统中,人类生成的数据记录锚定了AI服务的价值。然而,云计算管道将处理集中在远程服务器上。数据集中化降低了个人数据主权,并可能降低服务质量(QoS)。同时,用户贡献在数量和质量上存在差异:去中心化记录可能存在偏差、噪声和异质分布。为了解决数据挑战,我们研究了去中心化且资源受限的代理系统中的公平令牌分配和私有数据估值。我们的方法将多模态表示嵌入到共享语义空间中,并释放差分隐私(DP)原型以在减少语义泄露的同时保持效用。在DP保证下,我们设计了一种公平的令牌分配方案,该方案奖励有效贡献,并对数据异质性和AI资源稀缺性具有鲁棒性。大量仿真表明,与标准基准相比,基于贡献的公平性和QoS得到了改善。对图像重建攻击的抵抗力增强表明多模态个人数据的隐私得到了加强。

英文摘要

In agentic systems, human-generated data records anchor the value of AI services. Yet cloud compute pipelines centralize processing on remote servers. Data centralization reduces personal data sovereignty and may potentially degrade the quality of service (QoS). Meanwhile, user contributions are diverse in quantity and quality: decentralized records can be biased, noisy, and heterogeneously distributed. To address the data challenge, we study fair token allocation and private data valuation for decentralized and resource-constrained agentic systems. Our approach embeds multi-modal representations in a shared semantic space and releases differentially private (DP) prototypes to preserve utility while reducing semantic leakage. With the DP guarantee, we design a fair token allocation scheme that rewards effective contributions and remains robust to data heterogeneity and AI resource scarcity. Extensive simulations demonstrate improved contribution-based fairness and QoS compared to standard benchmarks. The improved resistance to image reconstruction attacks indicates enhanced privacy for multi-modal personal data.

2606.15531 2026-06-17 cs.LG cs.CR 新提交

Greedy Coordinate Diffusion: Effective and Semantically Coherent Adversarial Attacks via Diffusion Guidance

贪婪坐标扩散:通过扩散引导实现有效且语义一致的对抗攻击

Bohdan Turbal, Blossom Metevier, Max Springer, Aleksandra Korolova

发表机构 * University of Maryland(马里兰大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 提出贪婪坐标扩散方法,利用扩散模型引导生成语义连贯的对抗样本,在保持自然性的同时实现高攻击成功率。

详情
Journal ref
ICML 2026
AI中文摘要

在良性任务(如数学辅导)上微调对齐的语言模型会系统性破坏安全护栏,即使训练数据不包含有害内容。虽然机械论方法已揭示对齐在模型权重中的位置,但它们并未提供通用形式框架来推导关于微调何时降低对齐的保证——这使得该领域缺乏预测或防止对齐崩溃的原则性工具。我们通过参数空间轨迹的几何分析开发了一个局部几何框架,并将其应用于理解微调中对齐的脆弱性。虽然一阶分析表明正交更新是安全的,但我们证明这是虚幻的:微调损失的曲率诱导二阶加速,可能导致二阶漂移进入对齐敏感区域。我们将框架的一个构造形式化为对齐不稳定性条件(AIC),即三个几何性质,当它们存在时足以保证退化。我们的主要结果证明了沿梯度流轨迹的对齐退化四次方起始,这由对齐对特定参数的依赖程度以及任务与这些参数的耦合强度决定。这些发现给出了静态一阶保护在梯度下降下失效的正式充分条件。我们进一步实证验证了框架的基础,表明Fisher信息矩阵可以代理不同微调中安全退化的程度。

英文摘要

Adversarial attacks on large language models have limited practical impact despite extensive research. Optimization-based attacks such as Greedy Coordinate Gradient (GCG) (Zou et al., 2023) produce high-perplexity, incoherent suffixes that existing defenses easily detect (Bengio et al., 2024). Moreover, attempting to enforce coherence constraints during optimization often prevents the attack from successfully eliciting the specific targeted response, resulting in low success rates against robust models. Conversely, attacks that maintain coherence often alter the semantic intent of queries; when the model complies with these altered queries, responses fail to address the adversary's original goal. In this work, we introduce Greedy Coordinate Diffusion (GCD), a novel framework that efficiently generates adversarial attacks against safety-aligned models while maintaining low perplexity and high semantic adherence to the adversary's original intent. GCD leverages the generative priors of discrete diffusion language models to guide the search for adversarial suffixes that achieve semantic coherence and adherence. Unlike GCG, GCD does not require direct gradient access, allowing it to operate in a gray-box setting. We show GCD achieves highest ASR while remaining competitive on response-quality scores, and that the constructed adversarial prompts are detected at lower rates than other methods by perplexity-based and guard-model filters.

2606.15386 2026-06-17 cs.LG 新提交

A Compositional Framework for Open-ended Intelligence

开放智能的组合框架

Ida Momennejad, Roberta Raileanu

发表机构 * GitHub

AI总结 提出开放智能的形式化定义,通过有限原始集和组合算子生成闭包,支持跨任务和世界的无限组合生成,并引入下一原始预测作为架构目标。

详情
AI中文摘要

开放智能是指适应与训练环境显著不同的新问题和新环境的能力。我们将开放智能形式化为由有限原始集 \(P\) 和一组组合算子 \(C\) 诱导的闭包。我们刻画了诱导闭包 \(\mathcal{L}(P,C)\) 的性质,该闭包支持跨任务和世界族的无界组合生成。开放智能的数学需要两个支柱:一组最小的表示原始(例如状态、动作)和算法原始(例如最近邻),以及反映习得组合语法的组合模式(例如递归、序列化)。这两个支柱的闭包使得能够在广泛的环境中生成无限的自适应响应。该数学支持互补的研究议程,包括解释性和可解释性的评估指标,以及构建组合泛化原生的架构。我们提出下一原始预测作为一种新的架构目标,其中训练目标鼓励获取可重用的算法原始及其组合语法,从而通过重组生成新的解决方案。课程学习和自我博弈通过跨任务和世界族发现可重用原始和转换模式,实现闭包的终身学习和扩展。我们通过物理学、进化论和神经科学的案例研究来夯实该框架。

英文摘要

Open-ended intelligence is the capacity to adapt to novel problems and environments that are substantially different from those in training. A mathematics of open-ended intelligence requires two pillars: first, a minimal set of representational primitives (e.g., states, actions) and algorithmic primitives (e.g., nearest neighbor); and second, an acquired compositional grammar for selection, recursion, and branching that produces sequences of operations and recurring motifs. We formalize open-ended intelligence in terms of the compositional closure induced by a finite primitive set $P$ and a set of composition operators $C$. We characterize properties of the induced closure $\mathcal{L}(P,C)$ that support unbounded compositional generation across families of tasks and worlds. The closure of the two pillars yields infinite adaptive responses across a wide range of settings. The mathematics supports complementary research agendas, including evaluation metrics for explanation and interpretability, and novel architectures where compositional generalization is native. We propose next primitive prediction (NPP) as a novel architectural objective, where training encourages the acquisition of reusable algorithmic primitives and their compositional grammar, such that new solutions are generated through recombination. Given such an objective, curriculum learning and self-play can enable lifelong learning, expanding the closure by discovering reusable primitives and transition motifs across settings. We ground the framework through case studies in physics, evolution, and neuroscience.

2606.15236 2026-06-17 cs.CV 新提交

Show the Signal, Hide the Noise: Spectral Forcing for Pixel-Space Diffusion

展示信号,隐藏噪声:像素空间扩散的频谱强制

Weichen Fan, Haiwen Diao, Penghao Wu, Ziwei Liu

发表机构 * S-Lab, Nanyang Technological University(南洋理工大学S-Lab)

AI总结 提出频谱强制方法,通过在像素空间扩散模型中对噪声输入施加时变低通滤波器,引导模型关注信号频带,提升训练效率和生成质量。

Comments Code link: https://github.com/WeichenFan/Spectral_Forcing

详情
AI中文摘要

像素空间扩散模型在全带宽噪声图像上训练,但去噪器可用的有用信号强烈依赖于频率。在整流流扩散和自然图像幂律谱下,每个时间$t$的频带数据-噪声等高线$k^{*}(t) = (1-t)^{-2/α}$将信号承载的低频区域与噪声主导的高频区域分开。我们表明,这种隐式的由粗到细结构不仅仅是描述性的:它引发了一个容量分配问题。标准的像素空间去噪器必须内部发现移动的带宽边界,并可能在最优预测退化为确定性基线而非数据分布建模的频率-时间区域上花费计算。为了显式化这个边界,我们引入了频谱强制,一个无参数、时间条件的2D-DCT低通算子,在补丁嵌入器之前应用于噪声输入。其截止频率随扩散时间单调增加,并在数据端点处变为恒等映射。通过受控的合成实验,我们确定了该算子有益的机制:粗补丁分词和其高频内容主要是噪声而非必要信号的数据。在ImageNet-256上使用JiT-700M/32,频谱强制在不同训练周期中一致地改进了FID和Inception Score,展示了训练过程中的稳健增益;在更细的分词下,频谱强制仍然具有竞争力。我们进一步将未修改的算子插入SenseNova-U1,一个统一的文本到图像模型,它改进了DPG-Bench和GenEval,表明输入侧频谱先验可以超越类条件生成。这些结果表明了一条通往容量高效的像素空间扩散的途径:展示信号并隐藏噪声。

英文摘要

Pixel-space diffusion models are trained on full-bandwidth noisy images, yet the useful signal available to the denoiser is strongly frequency dependent. Under rectified-flow diffusion and natural-image power-law spectra, the per-band data-to-noise contour $k^{*}(t) = (1-t)^{-2/α}$ separates a signal-bearing low-frequency region from a noise-dominated high-frequency region at each time $t$. We show that this implicit coarse-to-fine structure is not merely descriptive: it induces a capacity-allocation problem. A standard pixel-space denoiser must discover the moving bandwidth boundary internally and can spend computation on frequency-time regions where the optimal prediction collapses to deterministic baselines rather than data-distribution modeling. To make this boundary explicit, we introduce Spectral Forcing, a parameter-free, time-conditional 2D-DCT low-pass operator applied to the noisy input before the patch embedder. Its cutoff expands monotonically with the diffusion time and becomes the identity at the data endpoint. Through controlled synthetic experiments, we identify the regime in which the operator is beneficial: coarse patch tokenization and data whose high-frequency content is predominantly noise rather than essential signal. On ImageNet-256 with JiT-700M/32, Spectral Forcing consistently improves both FID and Inception Score across different training epochs, demonstrating robust gains throughout training; at finer tokenization, the spectral forcing is still competitive. We further insert the unchanged operator into SenseNova-U1, a unified text-to-image model, where it improves DPG-Bench and GenEval, showing that the input-side spectral prior transfers beyond class-conditional generation. These results suggest a route to capacity-efficient pixel-space diffusion by showing the signal and hiding the noise.

2606.15148 2026-06-17 cs.RO cs.AI 新提交

MimicIK: Real-Time Generative Inverse Kinematics from Teleoperation with FK Consistency

MimicIK: 基于遥操作且保持正运动学一致性的实时生成式逆运动学

Jiahao Yang, Shenhao Yan, Fan Feng, Chengsi Yao, Ge Wang, Zhixin Mai, Yiming Zhao, Yatong Han

发表机构 * Ising AI CUHK-Shenzhen(香港中文大学(深圳))

AI总结 提出MimicIK框架,利用条件流匹配从遥操作数据学习平滑鲁棒的关节空间运动先验,通过两阶段迭代优化和正运动学一致性损失实现实时逆运动学求解,在6-DOF机器人数据集上达到4.65mm位置误差和92.01%成功率。

详情
AI中文摘要

逆运动学(IK)仍然是实时机器人操作的关键瓶颈。经典的数值求解器具有高几何精度,但在闭环部署中常出现不连续的分支切换和运动学奇异点附近的不稳定行为。同时,学习型IK方法在平衡空间精度、运动平滑性和实时效率方面经常遇到困难,尤其是在使用嘈杂的人类遥操作数据训练时。我们提出\textbf{MimicIK},一个实时生成式逆运动学框架,通过条件流匹配从遥操作演示中学习平滑且鲁棒的关节空间运动先验。给定当前关节构型和目标末端执行器位姿,MimicIK基于最小迭代策略(MIP)主干,通过高效的两步迭代精化过程预测连续的增量关节指令。为了强制物理一致性,我们进一步引入正运动学一致性损失,这是一种可微的正运动学正则化项,在训练过程中惩罚任务空间与目标位姿的偏差。我们在包含8,848个遥操作演示的真实6-DOF机器人数据集上评估MimicIK。MimicIK实现了4.65 mm的平均位置误差,92.01%的10 mm成功率,以及仅7.99%的轨迹尖峰率。与UNet扩散基线相比,我们的方法在提高空间精度和运动平滑性的同时,将推理延迟从21.66 ms降低到6.74 ms。此外,与在分布外部署时灾难性发散的确定性MLP基线不同,MimicIK在奇异构型附近保持稳定,并在部署硬件上实现鲁棒的20 Hz实时控制。

英文摘要

Inverse kinematics (IK) remains a critical bottleneck for real-time robot manipulation. Classical numerical solvers achieve high geometric precision but often suffer from discontinuous branch switching and unstable behavior near kinematic singularities during closed-loop deployment. Meanwhile, learned IK approaches frequently struggle to balance spatial accuracy, motion smoothness, and real-time efficiency, particularly when trained on noisy human teleoperation data. We present \textbf{MimicIK}, a real-time generative inverse kinematics framework that learns smooth and robust joint-space motion priors from teleoperation demonstrations through conditional flow matching. Given the current joint configuration and a target end-effector pose, MimicIK predicts continuous delta-joint commands using an efficient two-step iterative refinement process based on a Minimal Iterative Policy (MIP) backbone. To enforce physical consistency, we further introduce an FK consistency loss, a differentiable forward-kinematics regularization that penalizes task-space deviations from the target pose during training. We evaluate MimicIK on a real-world 6-DOF robot dataset containing 8,848 teleoperation demonstrations. MimicIK achieves a mean position error of 4.65 mm, a 10 mm success rate of 92.01\%, and a trajectory spike rate of only 7.99\%. Compared with a UNet diffusion baseline, our method improves both spatial accuracy and motion smoothness while reducing inference latency from 21.66 ms to 6.74 ms. Furthermore, unlike deterministic MLP baselines that catastrophically diverge under out-of-distribution deployment, MimicIK remains stable near singular configurations and enables robust 20 Hz real-time control on deployment hardware.

2606.15121 2026-06-17 cs.CL 新提交

When Cognitive Graphs Meet LLMs: BDEI Cognitive Pathways for Panic Emotional Arousal Prediction

当认知图遇见大语言模型:恐慌情绪唤醒预测的BDEI认知路径

Mengzhu Liu, Long Qin, Chuan Ai, Zhengqiu Zhu, Hongru Liang, Chen Gao, Yong Li, Xin Lu, Quanjun Yin

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出PanicCognitivePath框架,通过心理安全距离模型融合多域信号,引入显式情绪节点构建BDEI认知路径,将LLM限制于单步参数估计,实现恐慌情绪唤醒时间预测,准确率提升10.68%。

详情
AI中文摘要

在情绪显现前预测个体恐慌情绪唤醒时间对于主动应急干预至关重要。现有方法融合了认知元素,但均未显式建模情绪唤醒过程,因此不适用于情绪唤醒时间预测。我们认为,基于评价情绪理论进行预测是必要的,因为该理论显式建模了这一过程,但必须解决三个问题:(1) 评价理论认为情绪源于对多个威胁维度的同时评估,但尚无工作将这些输入融合为风险感知;(2) 现有认知模型缺乏情绪节点,将威胁评价与情绪唤醒解耦,迫使情绪从行为中间接推断;(3) 鉴于其可泛化的认知推理能力,当前方法采用LLM作为主要决策者,却忽视了其输出的脆弱性和易幻觉性。为解决这些问题,我们提出了PanicCognitivePath (PCP)框架,该框架同时解决了上述三个问题。基于心理距离理论的心理安全距离(PSD)模型将四域信号映射为统一的风险度量,作为后续认知推理的入口条件。在BDI中引入基于评价情绪理论的显式情绪节点,形成信念-欲望-情绪-意图(BDEI)路径。风险度量超过PSD阈值的智能体进入该路径,将威胁评价直接与情绪唤醒耦合。BDEI路径控制所有状态转换,而LLM被限制于信念到欲望转换的参数估计,将幻觉限制在单一步骤内并防止错误传播。在飓风桑迪上的实验表明,PCP将唤醒时间准确率较基线提升10.68%,峰值计数误差降至7.07%。

英文摘要

Predicting individual panic emotional arousal timing before manifestation is essential for proactive emergency intervention. Existing methods incorporate cognitive elements but none explicitly model the emotional arousal process, making them ill-suited for emotional arousal timing prediction. We argue that grounding prediction in appraisal emotion theory is necessary because it explicitly models this process, but three problems must be solved. (1) Appraisal theory posits that emotion arises from simultaneous evaluation across multiple threat dimensions, yet no prior work fuses these inputs into risk perception. (2) Existing cognitive models lack an Emotion node, decoupling threat appraisal from emotional arousal and forcing emotions to be inferred indirectly from behaviors. (3) Given their generalizable cognitive reasoning, current approaches adopt LLMs as the primary decision-maker, yet overlook the fragility and hallucination-proneness of their outputs. To address these issues, we introduce PanicCognitivePath (PCP), a framework that addresses all three. A Psychological Safety Distance (PSD) model, grounded in psychological distance theory, maps four-domain signals into a unified risk metric as the entry condition for subsequent cognitive reasoning. An explicit Emotion node grounded in appraisal emotion theory is introduced into BDI, forming a Belief-Desire-Emotion-Intention (BDEI) pathway. Agents whose risk metric exceeds the PSD threshold enter this pathway, coupling threat appraisal directly to emotional arousal. The BDEI pathway governs all state transitions while the LLM is confined to parameter estimation for the Belief-to-Desire transition, confining hallucinations to a single step and preventing error propagation. Experiments on Hurricane Sandy show PCP improves arousal timing accuracy by 10.68% over baselines, reduces peak count error to 7.07%.

2606.14990 2026-06-17 cs.LG cs.AI 新提交

Rational Sparse Autoencoder

有理稀疏自编码器

Naiyu Yin, Yue Yu

发表机构 * Lehigh University(里海大学)

AI总结 提出有理稀疏自编码器(RSAE),用可训练有理函数替代固定编码器激活,通过两阶段流程(初始化+微调)在多种语言模型和基线激活族上提升重构与下游行为指标,不牺牲特征可解释性。

Comments Accepted to the Mechanistic Interpretability Workshop at ICML 2026

详情
AI中文摘要

稀疏自编码器(SAE)是机械可解释性的标准工具,但当前的SAE系列受限于固定的编码器非线性,如ReLU、JumpReLU和TopK。这会将特定的稀疏机制硬编码到模型中,并可能扭曲重构与稀疏性的权衡。我们引入了有理稀疏自编码器(RSAE),它将固定的编码器激活替换为可训练的有理函数。有理激活足够灵活,可以在紧致域上一致逼近现有SAE系列使用的激活原语(对于TopK,提供分离top-k阈值后获得的阈值门),同时提供更丰富的函数类以适应观察到的预激活几何形状。我们通过两阶段流程实现这一想法:初始化过程复制预训练的基线SAE权重,插入通过在合成数据上使用松弛Remez交换获得的有理系数,并随有理系数一起校准尺度参数;然后在标准稀疏正则化重构目标下进行微调步骤。实验上,在三个开源权重语言模型的残差流激活上,以及所有三个基线激活族中,RSAE在微调步骤后严格改进,无论是在重构侧指标还是在下游行为指标上,且不牺牲稀疏探测下的特征级可解释性。这些增益在宿主语言模型、基线激活族以及我们测试的完整基线稀疏范围内一致,而升级本身每个自编码器仅增加少量标量参数,并在单个消费级GPU上运行几分钟。

英文摘要

Sparse autoencoders (SAEs) are standard tools for mechanistic interpretability, but current SAE families are constrained by fixed encoder nonlinearities such as ReLU, JumpReLU, and TopK. This hard-codes a particular sparsity mechanism into the model and can distort the reconstruction-versus-sparsity trade-off. We introduce the Rational Sparse Autoencoder (RSAE), which replaces the fixed encoder activation with a trainable rational function. Rational activations are flexible enough to uniformly approximate the activation primitives used by existing SAE families on compact domains (for TopK, the thresholded gate obtained after a separating top-k threshold is supplied), while also providing a richer function class for adapting to the observed pre-activation geometry. We realise this idea through a two-stage pipeline: an initialisation procedure that copies the pre-trained baseline SAE weights, plugs in rational coefficients obtained by the relaxed Remez exchange on synthetic data, and calibrates the scale parameters along with the rational coefficients; followed by a fine-tuning step under the standard sparsity-regularised reconstruction objective. Empirically, on residual-stream activations of three open-weight language models and across all three baseline activation families, the RSAE strictly improves on it after the fine-tuning step, both on reconstruction-side metrics and on downstream-behaviour metrics, without sacrificing feature-level interpretability under sparse probing. These gains are consistent across host language models, across baseline activation families, and across the full range of baseline sparsity we tested, while the upgrade itself adds only a handful of scalar parameters per autoencoder and runs in minutes on a single consumer GPU.

2606.14782 2026-06-17 cs.CV cs.CL 新提交

Last But Not Least: Boundary Attention CalibratiON for Multimodal KV Cache Compression

最后但同样重要:用于多模态KV缓存压缩的边界注意力校准

Tianhao Chen, Yuheng Wu, Kelu Yao, Xiaogang Xu, Xiaobin Hu, Dongman Lee

发表机构 * KAIST(韩国科学技术院) Zhejiang Laboratory(之江实验室) The Chinese University of Hong Kong(香港中文大学) National University of Singapore(新加坡国立大学)

AI总结 针对多模态大语言模型长视觉上下文中KV缓存压缩导致关键证据丢失的问题,提出BACON方法,通过校准观察窗口注意力与最后查询注意力,并利用层内一致性和层间持久性抑制噪声,在激进压缩下平均提升7.5%性能。

详情
AI中文摘要

多模态大语言模型(MLLMs)实现了强大的视觉-语言推理,但长视觉上下文会扩大KV缓存并增加解码延迟。现有的压缩方法依赖观察窗口注意力进行稳定的token重要性估计,然而这种聚合可能稀释稀疏的视觉证据,并在激进压缩下丢弃答案关键token。因此,我们识别出最后查询注意力作为恢复此类证据的补充来源,但其与答案无关的信号可能误导保留。我们提出BACON,一种即插即用方法,通过最后查询证据校准观察窗口注意力,并通过层内一致性和层间持久性抑制孤立噪声。在多种基准、模型、预算和压缩方法下,BACON在最激进的预算下平均提升多模态KV压缩7.5%,最高提升达30.9%。

英文摘要

Multimodal Large Language Models (MLLMs) achieve strong vision-language reasoning, but long visual contexts enlarge the KV cache and increase decoding latency. Existing compression methods rely on observation window attention for stable token-importance estimation, yet this aggregation can dilute sparse visual evidence and discard answer-critical tokens under aggressive compression. Therefore, we identify last-query attention as a complementary source for recovering such evidence, but its answer-irrelevant signals can mislead retention. We propose BACON, a plug-and-play method that calibrates observation window attention with last-query evidence and suppresses isolated noise via intra-layer coherence and inter-layer persistence. Across diverse benchmarks, models, budgets, and compression methods, BACON improves multimodal KV compression by 7.5% on average under the most aggressive budget, with gains up to 30.9%. Our project page is available at https://ryu1ion.github.io/official_BACON/

2606.14668 2026-06-17 cs.LG 新提交

When to Write and When to Suppress: Route-Specialized Dual Adapters for Memory-Assisted Knowledge Editing

何时写入与何时抑制:面向记忆辅助知识编辑的路径专用双适配器

Baijia Zhang, Yining Huang

发表机构 * institutetext(机构)

AI总结 提出路径专用双适配器编辑器,通过相关性路由器决定是否应用编辑记忆,分别训练编辑适配器和局部性适配器,在三个基准上取得最佳概率偏好准确率。

详情
AI中文摘要

知识编辑系统必须更新选定的事实,同时保持邻近但无关的行为不变。本文在记忆辅助设置中研究该问题,其中在推理时检索编辑记忆,参数高效适配器校正模型的对象偏好。我们认为核心设计问题不仅是如何写入编辑,还包括何时抑制它。我们引入\method{},一种路径专用双适配器编辑器。相关性路由器首先决定提示是否应接收编辑记忆。被路由的提示使用训练为偏好新对象而非原始对象的编辑适配器;未被路由的非直接提示使用单独的局部性适配器,该适配器训练为保留或恢复原始对象偏好。我们在三个1,000案例协议\cf{}、\zsre{}和\mquake{}上,在相同记忆协议和两个7B/8B基础模型下评估\method{}。在Llama-3.1-8B-Instruct上,\method{}在所有三个基准上获得最佳总体概率偏好准确率:\cf{}为0.8180,\zsre{}为0.8946,\mquake{}为0.9922。在Qwen3-8B上趋势相同。路由器消融实验表明,相关记忆边界因数据集而异:在\cf{}上,词汇神经路由器最安全;而在\zsre{}和\mquake{}上,BGE嵌入路由效果更好。组件和模块消融实验表明,增益主要来自将编辑注入与离路抑制分离,而非单纯增加LoRA容量。

英文摘要

Knowledge editing systems must update selected facts while preserving nearby but irrelevant behavior. This paper studies this problem in a memory-assisted setting where an edit memory is retrieved at inference time and a parameter-efficient adapter corrects the model's object preference. We argue that the central design question is not only how to write an edit, but also when to suppress it. We introduce \method{}, a route-specialized dual-adapter editor. A relevance router first decides whether a prompt should receive an edit memory. Routed prompts use an edit adapter trained to prefer the new object over the original object; unrouted non-direct prompts use a separate locality adapter trained to preserve or restore the original-object preference. We evaluate \method{} on three 1,000-case protocols, \cf{}, \zsre{}, and \mquake{}, under the same memory protocol and two 7B/8B base models. On Llama-3.1-8B-Instruct, \method{} obtains the best overall probability-preference accuracy on all three benchmarks: 0.8180 on \cf{}, 0.8946 on \zsre{}, and 0.9922 on \mquake{}. The same trend holds on Qwen3-8B. Router ablations show that the relevant memory boundary differs across datasets: a lexical neural router is safest on \cf{}, while BGE embedding routing is better on \zsre{} and \mquake{}. Component and module ablations show that the gain mainly comes from separating edit injection from off-route suppression rather than from simply increasing LoRA capacity.

2606.14551 2026-06-17 cs.RO cs.AI 新提交

TRACE: Trajectory-Routed Causal Memory for Delayed-Evidence Visuomotor Imitation

TRACE: 用于延迟证据视觉运动模仿的轨迹路由因果记忆

Zihao Li, Ranpeng Qiu, Yincong Chen, Guoqiang Ren, Weiming Zhi

发表机构 * Zeno AI Zhejiang University(浙江大学) Zhejiang University of Technology(浙江工业大学) The University of Sydney(悉尼大学)

AI总结 针对视觉运动模仿中早期线索消失导致观察歧义的问题,提出TRACE记忆框架,利用路径签名存储和检索任务相关证据,在长周期任务中提升分支选择准确率。

详情
AI中文摘要

自主运行的机器人可能需要基于不再可见的证据做出决策。我们研究\emph{延迟证据}任务,其中早期线索在后续决策点之前消失,因此视觉上相似的观察可能需要不同的动作。在这些设置中,当前观察不足以作为控制的状态。我们引入了轨迹路由因果证据(TRACE),一种用于视觉运动模仿策略的记忆框架。TRACE将任务相关的视觉和机器人状态证据(如物体身份、目标选择或路线依赖状态)存储在固定大小的潜在记忆中,该记忆在长片段中保持有界。TRACE不是通过原始时间或手动提供的任务标签来索引记忆,而是使用\emph{路径签名}:已执行机器人状态轨迹的紧凑、顺序敏感特征。这些签名不存储视觉线索本身;相反,它们提供了轨迹条件化的键,用于写入和检索线索可见时存储的证据。当机器人后来遇到歧义观察时,策略以TRACE记忆为条件,恢复缺失的上下文并选择正确的分支。TRACE通过轻量级适配器附加到策略上,而不改变策略主干、动作头或模仿目标。在具有视觉歧义分支点的真实世界长时域操作任务中,TRACE在分支选择和任务成功率上优于替代基线,包括短历史记忆和循环记忆。项目页面:此 https URL

英文摘要

Robots under autonomous operation may require decisions based on evidence that is no longer visible. We study delayed-evidence tasks, where an early cue disappears before a later decision point, so visually similar observations can require different actions. In these settings, the current observation is not a sufficient state for control. We introduce TRAjectory-routed Causal Evidence (TRACE), a memory framework for visuomotor imitation policies. TRACE stores task-relevant visual and robot-state evidence, such as object identity, target choice, or route-dependent state, in a fixed-size latent memory that remains bounded over long episodes. Instead of indexing memory by raw time or manually provided task labels, TRACE uses path signatures: compact, order-sensitive features of the executed robot-state trajectory. These signatures do not store the visual cue itself; rather, they provide trajectory-conditioned keys for writing and retrieving the evidence stored when the cue was visible. When the robot later reaches an ambiguous observation, the policy conditions on TRACE memory to recover the missing context and choose the correct branch. TRACE attaches through lightweight adapters to policies, without changing the policy backbone, action head, or imitation objective. Across real-world long-horizon manipulation tasks with visually ambiguous branch points, TRACE improves branch selection and task success over alternative baselines, including short-history and recurrent memory. Project page: https://jeong-zju.github.io/trace

2606.14438 2026-06-17 cs.RO cs.AI 新提交

CADET: Physics-Grounded Causal Auditing and Training-Free Deconfounding of End-to-End Driving Planners

CADET: 基于物理的因果审计与无训练去混杂的端到端驾驶规划器

Zikun Guo

发表机构 * School of Electronics Engineering, Kyungpook National University(庆北国立大学电子工程学院)

AI总结 提出CADET框架,无需重新训练即可审计和修复预训练端到端驾驶规划器中的虚假关联,通过物理因果图识别混杂因素并干预测试时输入。

Comments 8pages 4figures

详情
AI中文摘要

通过模仿学习训练的端到端自动驾驶规划器容易产生统计捷径:它们将仅与专家动作共现的场景元素(如路边物体、建筑立面)与驾驶决策关联,而非因果决定驾驶的变量。这种因果混淆在长尾场景中悄然损害可靠性,且难以检测,因为常见的开环指标(L2位移和碰撞率)受自车状态主导,无法指示规划器是否依赖虚假线索。现有的基于因果干预训练的修复方法需要重新训练大型模型,且无法审计已部署的规划器。我们提出CADET,一个无需训练的框架,可以在不更新任何参数的情况下审计、基准测试和修复预训练端到端规划器中的虚假依赖。

英文摘要

End-to-end (E2E) autonomous-driving planners trained by imitation are prone to statistical shortcuts: they associate scene elements that merely co-occur with expert actions (a roadside object, a building facade) with driving decisions, rather than the variables that causally determine them. Such causal confusion silently compromises reliability in long-tail scenarios, and it is difficult to detect, because prevailing open-loop metrics (L2 displacement and collision rate) are dominated by ego status and do not indicate whether a planner depends on spurious cues. Existing remedies based on causal-intervention training require retraining large models and cannot audit a planner that is already deployed. We present CADET, a training-free framework that audits, benchmarks, and repairs spurious reliance in pretrained E2E planners without any parameter update.

2606.14383 2026-06-17 cs.CV 新提交

IndustryBench-MIPU: Benchmarking Multi-Image Attribute Value Extraction for Industrial Products

IndustryBench-MIPU:面向工业产品的多图像属性值提取基准

Haonan Qi, Jin Cao, Yongqi Zhang, Xintong Wang, Weidong Tang, Bin Chen, Chengfu Huo, Haojun Pan, Hengyu You, Jing Li, Yingde Wang, Liang Ding

发表机构 * Multimodal and Industrial AI Team(多模态与工业AI团队) Taobao&Tmall, Alibaba Group(淘宝&天猫,阿里巴巴集团)

AI总结 提出首个多图像工业产品理解基准IndustryBench-MIPU,通过结构化属性提取任务评估多模态大模型在规格表、铭牌、技术图纸上的文本识别、视觉推理、领域知识和跨图像证据整合能力,发现多图像完整性是核心瓶颈。

详情
AI中文摘要

工业产品(如阀门和断路器)由密集的技术规格定义,这些规格支配着供应链中的采购、兼容性和安全性。这些规格分散在多个异构的产品图像中,包括规格表、铭牌和技术图纸,然而多模态大语言模型(MLLMs)能否可靠地恢复它们仍未被充分探索。为填补这一空白,我们引入了IndustryBench-MIPU,这是首个用于多图像工业产品理解的大规模基准,围绕结构化属性提取构建——从产品图像中恢复属性-值对。该任务共同探究了规格表和铭牌上的文本识别、技术图纸上的视觉推理、解码工业术语的领域知识,以及跨图像证据整合以组装分散的规格。具体而言,该基准包含来自27,652张图像的4,559个产品,具有跨越18个工业类别的103,703个标注,通过多模型共识和三层质量保证构建。在单图像和产品级多图像设置下评估九个MLLMs,揭示了一个显著的完整性差距:模型实现了高精度(86-94%),但最佳模型仅恢复了49.9%的产品级属性;从单图像到多图像提取,召回率下降了15-34个百分点。多图像完整性,而非单图像准确性,是核心瓶颈。数据集和代码已公开。

英文摘要

Industrial products such as valves and circuit breakers are defined by dense technical specifications that govern procurement, compatibility, and safety across supply chains. These specifications are scattered across multiple heterogeneous product images, including specification tables, nameplates, and technical drawings, yet whether Multimodal Large Language Models (MLLMs) can reliably recover them remains underexplored. To fill this gap, we introduce IndustryBench-MIPU, the first large-scale benchmark for multi-image industrial product understanding, built around structured attribute extraction -- recovering property-value pairs from product images. This task jointly probes text recognition on specification tables and nameplates, visual reasoning over technical drawings, domain knowledge to decode industrial terminology, and cross-image evidence integration to assemble scattered specifications. Concretely, the benchmark comprises 4,559 products across 27,652 images with 103,703 annotations spanning 18 industrial categories, constructed through multi-model consensus and three-tier quality assurance. Evaluating nine MLLMs under both single-image and product-level multi-image settings reveals a stark completeness gap: models achieve high precision (86--94%) but the best recovers only 49.9% of product-level attributes; moving from single-image to multi-image extraction costs 15--34 percentage points of recall. Multi-image completeness, not single-image accuracy, is the core bottleneck. Dataset and code are publicly available.

2606.14187 2026-06-17 cs.LG 新提交

Zeta: Dual Whitening for Matrix Optimization via Coordinate-Adaptive Preconditioning

Zeta: 通过坐标自适应预处理实现矩阵优化的双重白化

Kaiwen Chen, Shuhai Zhang, Zimo Liu, Linxiao Li, Ying Sun, Yuchen Li, Yifan Zhang, Bo Han, Mingkui Tan, Qiuwu Chen

发表机构 * South China University of Technology(华南理工大学) AIGCode Hong Kong Baptist University(香港浸会大学)

AI总结 针对矩阵优化中坐标尺度异质性问题,提出双重白化优化器Zeta,通过先坐标白化后谱白化的严格顺序降低正交化误差,在语言建模和视觉任务上提升收敛速度与泛化性能。

详情
AI中文摘要

大规模神经网络训练日益依赖矩阵感知优化器,这类优化器利用权重参数的结构,超越逐元素自适应。然而,现有矩阵感知方法(如Muon)存在一个未被充分认识的脆弱性:其核心操作Newton-Schulz迭代严重依赖于输入条件,而原始动量矩阵表现出严重的坐标尺度异质性。本文首先通过卡方均匀性检验验证了这种尺度异质性,表明矩阵内尺度不平衡在Transformer层中普遍存在,且坐标白化能有效纠正。受此发现启发,我们提出Zeta,一种双重白化优化器,在严格有序的流程中应用坐标白化和谱白化。该顺序不是可调选择,而是源于数学依赖:坐标白化建立了谱白化可靠运行所需的统计各向同性。我们进一步证明,通过改善输入的条件数,该双重流程相对于纯谱方法严格降低了正交化误差。实验上,Zeta在语言建模(0.6B至8B参数)、混合专家架构和视觉任务中匹配或超越强基线,表明在正交化前解决尺度不平衡能带来更快的收敛和更好的泛化。代码可在该https URL获取。

英文摘要

Large-scale neural network training increasingly relies on matrix-aware optimizers that exploit the structure of weight parameters beyond element-wise adaptation. However, existing matrix-aware methods such as Muon have an underappreciated vulnerability: their core operation, Newton-Schulz iteration, depends critically on input conditioning, yet the raw momentum matrices exhibit severe coordinate-wise scale heterogeneity. In this paper, we first verify this scale heterogeneity through a chi-square uniformity test, showing that intra-matrix scale imbalance is prevalent across Transformer layers and that coordinate whitening effectively corrects it. Motivated by this finding, we propose Zeta, a dual whitening optimizer that applies coordinate whitening and spectral whitening in a strictly ordered pipeline. The ordering is not a tunable choice but follows from a mathematical dependency: coordinate whitening establishes the statistical isotropy that spectral whitening requires to function reliably. We further prove that this dual pipeline strictly reduces orthogonalization error relative to pure spectral methods by improving the condition number of the input. Empirically, Zeta matches or surpasses strong baselines across language modeling (0.6B to 8B parameters), mixture-of-experts architectures, and vision tasks, demonstrating that resolving scale imbalance before orthogonalization leads to faster convergence and better generalization. Code is available at https://github.com/AIGCodeOS/aigcode_zeta_optimizer.

2606.14096 2026-06-17 cs.CV 新提交

A New Multi-Domain Benchmark for Micro-Action Recognition and Detection

微动作识别与检测的新多领域基准

Yanbin Hao, Pengyu Liu, Xing Wei, Xun Yang, Dan Guo, Meng Wang

发表机构 * School of Computer Science and Information Engineering, Hefei University of Technology(合肥工业大学计算机科学与信息工程学院) School of Information Science and Technology, University of Science and Technology of China(中国科学技术大学信息科学技术学院)

AI总结 提出MMA-82,一个大规模多领域微动作基准,扩展至82个类别、4个领域,涵盖识别与多标签检测任务,实验表明现有方法在域迁移、长尾分布等场景下仍面临挑战。

Comments 10 pages, 9 figures

详情
AI中文摘要

微动作是全身层面持续时间短、幅度低的细微身体运动,能够揭示潜在意图、非自愿反应和细粒度情感变化。我们之前的MA-52基准为微动作识别提供了重要基础,但在规模、场景多样性、任务覆盖和评估协议方面仍有限。为了将微动作分析推向更真实和全面的场景,我们引入了MMA-82,这是MA-52的大规模多领域扩展。MMA-82将标签空间从52个细粒度微动作类别扩展到82个,并涵盖四个不同领域,包括实验室访谈、街头访谈、精神病患者访谈和情感丰富的电视视频,最终从454名受试者中获得了77,856个标注实例。基于MMA-82,我们建立了两个核心任务:微动作识别和多标签微动作检测。对于识别,我们进一步定义了域内和跨域协议,包括少样本和零样本设置,以评估模型的鲁棒性、可迁移性和泛化能力。大量实验表明,当前方法在真实微动作理解中仍面临困难,尤其是在域迁移、长尾类别分布和复杂时间定位下。除了基准测试,我们还研究了微动作与情感之间的关系,表明微动作与情感状态密切相关,并为面部微表情提供补充线索,以改进情感识别。这些结果表明,MMA-82是真实微动作分析的全面且具有挑战性的基准,也是以人为中心的AI的宝贵资源。MMA-82可在以下网址获取:https://xxx。

英文摘要

Micro-actions are short-duration, low-amplitude subtle body movements at the whole-body level that can reveal latent intentions, involuntary reactions, and fine-grained affective changes. Our previous MA-52 benchmark has provided an important foundation for micro-action recognition, but it remains limited in scale, scene diversity, task coverage, and evaluation protocols. To advance micro-action analysis toward more realistic and comprehensive settings, we introduce MMA-82, a large-scale multi-domain extension of MA-52. MMA-82 expands the label space from 52 to 82 fine-grained micro-action categories and covers four distinct domains, including laboratory interviews, street interviews, psychiatric patient interviews, and emotion-rich television videos, resulting in 77,856 annotated instances from 454 subjects. Built upon MMA-82, we establish two core tasks: Micro-Action Recognition and Multi-label Micro-Action Detection. For recognition, we further define in-domain and cross-domain protocols, including few-shot and zero-shot settings, to evaluate model robustness, transferability, and generalization. Extensive experiments show that current methods still struggle with realistic micro-action understanding, especially under domain shift, long-tailed category distributions, and complex temporal localization. Beyond benchmarking, we investigate the relationship between micro-actions and emotion, showing that micro-actions are strongly associated with emotional states and provide complementary cues to facial micro-expressions for improved emotion recognition. These results demonstrate that MMA-82 serves as a comprehensive and challenging benchmark for realistic micro-action analysis and a valuable resource for human-centered AI. MMA-82 is available at https://lpynow.github.io/MMA-82-AIM/.

2606.14081 2026-06-17 cs.CV cs.AI cs.LG eess.IV 新提交

Clay-CNN Hybrids: Leveraging Geospatial Foundation Models as Auxiliary Context for Landslide Detection

Clay-CNN混合模型:利用地理基础模型作为滑坡检测的辅助上下文

Huong Binh Vu

发表机构 * Harvard University(哈佛大学)

AI总结 针对滑坡检测中的极端类别不平衡问题,提出将地理基础模型Clay v1.5作为辅助上下文注入U-Net瓶颈的混合方法,在Landslide4Sense基准上达到64.5% F1,优于纯Clay或U-Net基线。

详情
AI中文摘要

灾后快速滑坡制图对灾害响应至关重要,但由于极端类别不平衡,自动化仍然困难。本研究评估了地理基础模型(GFM)Clay v1.5是否能够改善Landslide4Sense(L4S)基准上的像素级滑坡分割,该基准包含3,799个训练块,具有14个Sentinel-2和地形波段,约2%的正像素。我们比较了三种策略:Clay作为主编码器并融合多尺度残差地形、在瓶颈处注入Clay语义上下文的U-Net骨干、以及标准U-Net基线。采用两阶段低秩适应(LoRA)的混合U-Net + Clay模型在三个随机种子上的最佳测试F1为64.5±1.8%,超过了纯Clay骨干(55.2±3.6%)和U-Net基线(59.9%)。由于缺乏多尺度跳跃连接,Clay作为独立编码器的性能低于U-Net,但其预训练表示在作为辅助上下文注入时持续提升了性能。这些发现表明,GFM在滑坡检测中最有效的方式是补充空间细节丰富的卷积架构,而非替代它们。

英文摘要

Rapid post-event landslide mapping is essential for disaster response but remains difficult to automate due to extreme class imbalance. This study evaluates whether Clay v1.5, a Geospatial Foundation Model (GFM), can improve pixel-level landslide segmentation on the Landslide4Sense (L4S) benchmark, which contains 3,799 training chips with 14 Sentinel-2 and terrain bands and approximately 2% positive pixels. We compare three strategies: Clay as the primary encoder with multi-scale residual terrain fusion, a U-Net backbone augmented with Clay semantic context at the bottleneck, and a standard U-Net baseline. The hybrid U-Net + Clay model with two-stage Low-Rank Adaptation (LoRA) achieved the best test F1 of 64.5 +/- 1.8% over three seeds, surpassing the Clay-only backbone (55.2 +/- 3.6%) and the U-Net baseline (59.9%). Clay as a standalone encoder underperformed the U-Net due to the absence of multi-scale skip connections, but its pretrained representations consistently improved performance when injected as auxiliary context. These findings suggest that GFMs are most effective for landslide detection when they complement spatially detailed convolutional architectures rather than replace them.

2606.13258 2026-06-17 cs.AI 新提交

MOSAIC: Modality-Specific Adaptation for Incremental Continual Learning in Parkinson's Disease Gait Assessment

MOSAIC: 帕金森病步态评估中增量持续学习的模态特定适应

Minlin Zeng, Zhipeng Zhou, Yang Qiu, Martin J. McKeown, Zhiqi Shen

发表机构 * Nanyang Technological University(南洋理工大学) Pacific Parkinson's Research Centre, University of British Columbia(不列颠哥伦比亚大学太平洋帕金森研究中心)

AI总结 针对帕金森病步态评估中模态增量场景,提出MOSAIC框架,通过模态特定预热、统计解耦MSBN架构和课程引导排斥目标,解决跨模态蒸馏不可靠、统计偏移和可塑性下降问题。

详情
AI中文摘要

基于步态的帕金森病评估越来越依赖异构传感器,但临床系统很少同时收集所有模态。新传感器可能通过设备升级、协议变更或多中心部署引入,而历史患者数据由于隐私和存储限制通常不可用。这种模态增量场景面临三个挑战:不可靠的跨模态蒸馏、模态特定的统计偏移以及保存后可塑性下降。我们提出了MOSAIC,一个紧凑的持续学习框架。首先,我们识别了有毒教师现象,并引入模态特定预热,在蒸馏前稳定新学习的模态表示。其次,我们提出了一种统计解耦的MSBN架构,在保持共享语义主干的同时隔离传感器统计信息。第三,我们设计了一个课程引导的排斥目标用于可塑性恢复,在保留旧知识的同时恢复模态特定容量。在三个多模态帕金森步态数据集上的实验表明,MOSAIC提高了最终性能并减轻了遗忘。项目代码可在以下网址获取:this https URL

英文摘要

Gait-based Parkinson's disease assessment increasingly relies on heterogeneous sensors, but clinical systems rarely collect all modalities simultaneously. New sensors may arrive through device upgrades, protocol changes, or multi-center deployment, while historical patient data are often unavailable because of privacy and storage constraints. This modality-incremental setting faces three challenges: unreliable cross-modal distillation, modality-specific statistical shifts, and reduced plasticity after preservation. We propose MOSAIC, a compact continual learning framework. First, we identify the Toxic Teacher phenomenon and introduce Modality-Specific Warm-Up to stabilize newly learned modality representations before distillation. Second, we propose a statistics-decoupled MSBN architecture that isolates sensor statistics while maintaining a shared semantic backbone. Third, we design a curriculum-guided repulsive objective for Plasticity Recovery, preserving legacy knowledge while recovering modality-specific capacity. Experiments on three multimodal Parkinson's gait datasets show that MOSAIC improves final performance and mitigates forgetting. Project code is available at: https://github.com/minlinzeng/MOSAIC_Modality-Specific-Adaptation-for-Incremental-Continual-Learning-in-PD-Gait-Assessment.git

2606.13196 2026-06-17 cs.AI cs.CY 新提交

Under What Conditions Can a Machine Be Called Genuinely Creative?

机器在何种条件下能够真正具有创造力?

Yong Zeng

发表机构 * Concordia University(康考迪亚大学)

AI总结 本文基于Designics理论,提出机器真正创造力需满足十个要求,并通过实例论证其计算可行性,同时指出当前生成式AI系统尚不具备真正创造力。

详情
AI中文摘要

最近的AI系统能够生成看似具有创造力的文本、软件架构、假设、设计和科学工作流。本文探讨机器在何种条件下能够真正具有创造力,以及如何在共享的认知和创造环境中保持人类能动性。它提出了一个源于Designics(意义承载的意向性变化科学)的需求框架。本文认为,真正的机器创造力不应仅由输出新颖性、当前性能或瞬时架构来定义。相反,创造力被理解为通过递归干预动力学对不完全情境的结构性转变。基于此观点,它依赖于十个需求:环境表示、范围感知、冲突识别、干预能力、后果观察、知识与环境更新、范围重定、局部到全局展开、基于价值的范围界定以及人机共居。这些需求通过Designics的三个定律(感知、冲突和能力)进行组织。本文通过选定的网络-物理和网络-生物研究(包括递归元素提取、自主网格生成以及神经生理和工作负载分析)说明了这些需求的计算可行性。然后,它将开放系统、自动发现框架、自我修改代理、基础模型和代理工作流视为压力案例:它们展示了强大的生成手段,但本身并未建立真正的机器创造力。最后,本文认为主动的AI伦理是真正机器创造力的内在部分,而非事后过滤器。基于价值的范围界定和人机共居必须塑造创造机器如何感知环境、识别冲突、选择干预、观察后果、更新知识以及重新确定未来行动的范围。

英文摘要

Recent AI systems can generate texts, software architectures, hypotheses, designs, and scientific workflows that appear creative. This paper asks under what conditions a machine can be called genuinely creative, and how human agency can be preserved within shared cognitive and creative environments. It develops a requirement framework derived from Designics, the science of meaning-bearing intentional change. The paper argues that genuine machine creativity should not be defined by output novelty, current performance, or transient architecture alone. Instead, creativity is understood as the structural transformation of incomplete situations through recursive intervention dynamics. On this view, it depends on ten requirements: environment representation, scoped perception, conflict identification, intervention capability, consequence observation, knowledge and environment update, rescoping, local-to-global unfolding, value-based scoping, and human-AI co-living. These are organized through the three laws of Designics: perception, conflict, and capability. The paper illustrates the computational tractability of these requirements through selected cyber-physical and cyber-biological studies, including recursive element extraction, autonomous mesh generation, and neurophysiological and workload analysis. It then treats open-ended systems, automated discovery frameworks, self-modifying agents, foundation models, and agentic workflows as pressure cases: they demonstrate powerful generative means but do not by themselves establish genuine machine creativity. Finally, the paper argues that proactive AI ethics is internal to genuine machine creativity rather than an after-the-fact filter. Value-based scoping and human-AI co-living must shape how creative machines perceive environments, identify conflicts, select interventions, observe consequences, update knowledge, and rescope future action.

2606.12863 2026-06-17 cs.LG 新提交

Multimodal Graph Negative Learning

多模态图负学习

Zhengyu Wu, Xu Wang, Hongchao Qin, Xunkai Li, Guang Zeng, Rong-Hua Li, Guoren Wang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出GraphMNL框架,通过负学习解决多模态属性图中节点级分支语义不平衡问题,避免主导分支偏差传播,在Grocery和Reddit M数据集上取得最优性能。

详情
AI中文摘要

多模态属性图(MAGs)将图拓扑与异构模态属性(如文本和图像)集成,从而能够对复杂关系系统进行更丰富的建模。然而,这种表达能力也使得MAGs上的学习依赖于多个语义源,包括结构拓扑、文本和视觉属性,每个都可以被视为节点表示的一个分支。当这些分支在语义信息量和可靠性上因节点而异时,就会出现节点级分支语义不平衡:一个分支为某个节点提供判别性语义,但由于模态质量或结构上下文的偏差,可能会误导另一个节点。现有方法通常通过跨分支一致性或对齐来缓解这种异质性,隐含地将主导预测视为可靠监督。当主导分支有偏差时,强制模仿可能会将其偏差传播到其他分支,并抑制对分类有用的原始语义。我们提出GraphMNL,一种图感知的多模态负学习框架,通过使用负学习作为跨分支指导来解决这个问题。该模型不强制劣质分支模仿教师预测,而是教导它们节点不太可能属于哪些类别。GraphMNL构建分支库,通过图感知可靠性仲裁识别主导和劣质分支,门控不稳定传输,并对非目标类别应用目标保持负学习。这种设计将目标监督与分支指导解耦,使得监督损失学习正确类别,而当分支一致性不可靠时,负学习抑制不太可能的备选类别。通过全面的实验评估,GraphMNL在Grocery数据集上达到72.47%的准确率,在Reddit M数据集上达到76.60的F1分数,取得了最佳性能。

英文摘要

Multimodal attributed graphs (MAGs) integrate graph topology with heterogeneous modality attributes, such as text and images, thereby enabling richer modeling of complex relational systems. However, such expressiveness also makes learning on MAGs depend on multiple semantic sources, including structural topology, textual and visual attributes, each of which can be regarded as a branch for node representation. Node-level branch semantic imbalance arises when these branches differ across nodes in semantic informativeness and reliability: a branch that provides discriminative semantics for one node may mislead another due to bias in modality quality or structural context. Existing methods often mitigate such heterogeneity through cross-branch agreement or alignment, implicitly treating the dominant prediction as reliable supervision. When the dominant branch is biased, forced imitation may propagate its bias to other branches and suppress original semantics that are useful for classification. We propose GraphMNL, a graph-aware multimodal negative learning framework that addresses this issue by using Negative Learning as cross-branch guidance. Instead of forcing inferior branches to imitate a teacher prediction, the model teaches them which classes a node is unlikely to belong to. GraphMNL builds a branch library, identifies dominant and inferior branches via graph-aware reliability arbitration, gates unstable transfer, and applies target-preserving negative learning over non-target classes. This design decouples target supervision from branch guidance so that supervised losses learn the correct class, while Negative Learning suppresses unlikely alternatives when branch agreement is unreliable. Through the comprehensive experimental evaluation, GraphMNL achieves the best performance on Grocery datasets with 72.47% accuracy and 76.60 F1 score on Reddit M datasets.

2606.12742 2026-06-17 cs.AI cs.AR 新提交

Reducing the Complexity of Deep Learning Models for EEG Analysis on Wearable Devices

降低可穿戴设备上用于脑电图分析的深度学习模型复杂度

Farough Shayeste Roodi, Parham Zilouchian Moghaddam, Mahdi Mohammadi-nasab, Mehdi Modarressi, Mostafa Ersali Salehi Nasab, Masoud Daneshtalab

发表机构 * University of Tehran(德黑兰大学) Mälardalen University(梅拉达伦大学) Royal Institute of Technology(皇家理工学院)

AI总结 研究通过参数量化和电极减少方法,在资源受限的可穿戴设备上部署DNN模型,实现脑电图分析中精度与复杂度的权衡。

详情
AI中文摘要

可穿戴医疗设备是增长最快的物联网领域。许多自动化医疗服务依赖于两种关键的生物信号,即心电图和脑电图,它们分别反映心脏和大脑的活动。尽管深度神经网络被认为是处理和分析这些信号的主要方式,但可穿戴设备中非常严格的能量和计算能力限制远低于DNN模型的计算、能量和内存带宽需求,从而阻碍了深度学习在许多实际可穿戴服务中的部署。本文研究了在资源受限的可穿戴设备上部署最先进的DNN模型的可行性。值得注意的是,我们探讨了在使用参数量化和电极减少方法时,DNN的精度与计算复杂度之间的权衡。我们的研究集中在几种用于脑电图信号分析(特别是检测癫痫发作)的最先进的DNN模型上。我们的发现表明,当明智地应用这些技术时,可以显著降低所考虑的DNN的复杂度,同时对精度的影响最小。这些结果揭示了在将基于DNN的在线脑电图分析适配到可穿戴设备时,精度与复杂度降低之间明确的权衡关系。

英文摘要

Wearable healthcare devices are the fastest-growing Internet of Things (IoT) sector. Many automated healthcare services rely on two crucial biological signals, namely ECG and EEG, which reflect the activity of the heart and brain, respectively. Although deep neural networks are considered the primary way to process and analyze these signals, the very tight energy and computational power constraints in wearable devices are far below the computational, energy, and memory bandwidth demands of DNN models, thereby impeding the deployment of deep learning in many practical wearable services. This paper investigates the feasibility of deploying state-of-the-art DNN models in resource-constrained wearable devices. Notably, we explore the trade-off between accuracy and computational complexity of DNNs when parameter quantization and electrode reduction methods are used. Our investigation centers on several state-of-the-art DNN models designed for EEG signal analysis, specifically for detecting epileptic seizures. Our findings demonstrate that, when applied judiciously, these techniques can significantly reduce the complexity of the DNNs under consideration with minimal adverse effects on accuracy. These results reveal the explicit trade-offs between accuracy and complexity reduction encountered when adapting DNN-based online EEG analysis for wearable devices.

2606.11990 2026-06-17 cs.LG cs.AI 新提交

Time-Series Foundation Model Embeddings for Remaining Useful Life Estimation

用于剩余使用寿命估计的时间序列基础模型嵌入

Amir El-Ghoussani, Michele De Vita, Ronald Naumann, Vasileios Belagiannis

发表机构 * University of Erlangen-Nuremberg(埃尔朗根-纽伦堡大学) Siemens AG(西门子股份公司)

AI总结 提出冻结预训练时间序列基础模型Chronos-2作为骨干,结合轻量回归头进行剩余寿命预测,在工业传感器数据上优于多种基线方法。

Comments Accepted to EUSIPCO 2026, 4 pages, 2 figures, 2 tables

详情
AI中文摘要

剩余使用寿命(RUL)预测对于工业预测性维护至关重要,然而许多基于学习的方法依赖于大量的特征工程或大型标注数据集来训练特定任务的序列模型。在这项工作中,我们引入了一种轻量级学习方法,利用冻结的预训练时间序列基础模型(TSFM),并将其与一个小型回归头结合,用于从多变量传感器流中估计RUL。具体来说,我们使用Chronos-2作为冻结骨干来提取上下文窗口特征,并训练一个轻量级回归神经网络进行RUL预测。在来自两种设备类型的真实工业传感器数据上的实验表明,在相同的预处理和评估协议下,Chronos-2特征一致地优于循环、卷积、基于Transformer和梯度提升基线。我们进一步分析了上下文长度的影响,发现随着历史记录变长,性能显著提升,这表明TSFM表示为工业环境中的RUL估计提供了一种实用且数据高效的替代方案。

英文摘要

Remaining Useful Life (RUL) prediction is essential for industrial predictive maintenance, yet many learning-based approaches rely on extensive feature engineering or large labeled datasets to train task-specific sequence models. In this work, we introduce a lightweight learning approach, in which we leverage a frozen pretrained time-series foundation model (TSFM) and combine it with a small regression head for RUL estimation from multivariate sensor streams. More specifically, we use Chronos-2 as a frozen backbone to extract context window features and train a lightweight regression neural network for RUL prediction. Experiments on real-world industrial sensor data from two device types show that Chronos-2 features consistently improve over recurrent, convolutional, Transformer-based, and gradient-boosting baselines under the same preprocessing and evaluation protocol. We further analyze the impact of context length and find that performance improves significantly with longer histories, indicating that TSFM representation offer a practical and data-efficient alternative for RUL estimation in industrial settings.

2606.11616 2026-06-17 cs.LG cs.IR 新提交

DeMix: Debugging Training Data with Mixed Data Error Types by Investigating Influence Vectors

DeMix: 通过影响向量调试包含混合错误类型的训练数据

Jiale Deng, Yanyan Shen, Xiaogang Shi, Junjun Chai

发表机构 * Shanghai Jiao Tong University(上海交通大学) ByteDance Inc.(字节跳动) Tiktok

AI总结 提出DeMix框架,利用影响向量捕捉不同错误类型对模型行为的独特模式,将数据调试转化为多标签分类问题,并引入基于干预的学习策略,在11个任务上显著提升调试F1分数和修复后模型性能。

详情
AI中文摘要

高质量的训练数据对于机器学习模型的成功至关重要。然而,真实世界的数据集通常包含由数据准备流程中的系统性缺陷引起的混合错误类型,包括标签错误、特征错误和虚假相关性。有效的训练数据调试既需要检测错误样本,也需要识别其具体的错误类型以便进行针对性修复,但现有的数据清洗和归因方法未能充分满足这一双重需求。在本文中,我们提出DeMix,一种同时诊断错误样本及其错误类型的新框架。我们的关键见解是,不同的错误类型会在模型行为上产生不同的模式。DeMix通过影响向量捕获这些特定于错误的模式,这些影响向量描述了每个训练样本如何影响所有验证样本上的模型预测。我们将训练数据调试形式化为一个多标签分类问题,其中开发了一个分类器直接从影响向量预测错误类型。我们进一步引入了一种基于干预的学习策略,引导分类器捕获每种错误类型特有的不变理由,确保学到的分类器有效泛化。在表格数据预测、推荐系统和LLM对齐等11个任务上的实证评估表明,DeMix显著优于最先进的方法,在数据调试F1分数上提高了22.61%,在数据修复后任务模型性能上提高了9.32%。代码可在以下网址获取:this https URL。

英文摘要

High-quality training data is essential for the success of machine learning models. However, real-world datasets often contain mixed types of errors arising from systematic flaws in data preparation pipelines, including label errors, feature errors, and spurious correlations. Effective debugging of training data requires both detecting erroneous samples and identifying their specific error types to enable targeted repair, yet existing data cleaning and attribution methods fail to adequately address this dual requirement. In this paper, we propose DeMix, a novel framework that simultaneously diagnoses erroneous samples and their error types. Our key insight is that different error types produce distinct patterns on model behavior. DeMix captures such error-specific patterns by influence vectors that characterize how each training sample affects model predictions across all validation samples. We formulate training data debugging as a multi-label classification problem where a classifier is developed to predict error types directly from influence vectors. We further introduce an intervention-based learning strategy that guides the classifier to capture invariant rationales specific to each error type, ensuring the learned classifier generalizes effectively. Empirical evaluations on 11 tasks across tabular data prediction, recommendation systems, and LLM alignment demonstrate that DeMix significantly outperforms state-of-the-art approaches, achieving a 22.61% improvement in data debugging F1-score and a 9.32% gain in task model performance after data repair. Code is available at: https://github.com/SJTU-DMTai/DeMix.

2606.10774 2026-06-17 cs.LG cs.DC 新提交

Asynchronous Decentralized Federated Learning over Lossy Wireless Links via Reception- and Age-Aware Aggregation

部分接收下分散式联邦学习的逆概率加权与信息年龄聚合

Chanuka A. S. Hewa Kaluannakkage, Rajkumar Buyya

发表机构 * University of Melbourne(墨尔本大学) University of Technology Sydney(悉尼科技大学)

AI总结 针对无线网络下分散式联邦学习的选择偏差和更新过时问题,提出结合逆概率加权与信息年龄加权的DFL-AA方法,理论消除链路质量偏差,实验优于现有基线。

Comments 14 pages, 9 figures, research paper for journal submission

详情
AI中文摘要

在有损无线网络上的分散式联邦学习面临两个关键挑战:选择偏差,即由于部分模型接收,来自劣质链路的更新被系统性地低估;以及更新过时,即异步节点贡献过时信息。我们表明,使用局部填充重建的均匀八卦聚合会引入持久的链路质量诱导偏差,而基于完整性的加权进一步放大了这种效应。为了解决这些挑战,我们提出了DFL-AA(具有自适应AoI加权聚合的分散式联邦学习),它结合了逆概率加权与基于在线EWMA的信道估计来纠正选择偏差,以及基于信息年龄的加权来减轻过时,而无需全局同步。我们从理论上证明DFL-AA在期望上消除了链路质量失真,并通过实验证明在不同丢包率、网络规模和异构无线条件下,其性能持续优于最先进的基线。

英文摘要

Decentralized Federated Learning(DFL) enables collaborative model training across wireless edge nodes, including IoT deployments, autonomous vehicles, UAV swarms, and satellite constellations. Operating over lossy wireless links under constraints, these systems cannot rely on retransmissions, so model parameters must be accepted as partial chunks, leading to two key failure modes, which are selection bias, where poor-quality links are systematically under-represented in gossip aggregation, and update staleness, where asynchronous nodes contribute outdated models. We prove that classical gossip aggregation introduces irreducible selection bias proportional to the link-loss rate. We propose DFL-AA (Decentralized Federated Learning with Adaptive AoI-weighted Aggregation), which corrects selection bias using Inverse Probability Weighting (IPW) with online channel estimation and mitigates staleness via Age-of-Information (AoI) decay without requiring a global clock. We prove that DFL-AA removes link-quality distortion in expectation and consistently outperforms state-of-the-art baselines across varying loss rates and heterogeneous channel conditions on fixed directed topologies.

2606.10703 2026-06-17 cs.LG cs.CL 新提交

From Observation to Intervention: A Causal Audit of Expert Importance in Mixture-of-Experts Models

从观察到干预:混合专家模型中专家重要性的因果审计

Leonard Engmann, Christian Medeiros Adriano, Holger Giese

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 通过因果审计发现,混合专家模型中的路由统计指标无法预测专家重要性,现有剪枝方法的成功源于早期层冗余而非识别可删除专家。

Comments 9 pages, 2 figures, 9 tables. Accepted at the ICML 2026 Workshop on Philosophy of Science Meets Machine Learning (PhilML). Camera-ready Version. Non-archival

详情
AI中文摘要

可解释性方法通常使用观察到的模型行为的总体统计量来推断特定计算的目标干预效果;用Pearl的术语来说,它们将第一层的关联证据视为支持第二层的干预结论,而这种做法的有效性很少被检验。我们考察了一个具体实例:混合专家(MoE)剪枝中路由统计量的使用,其中利用率、激活范数和路由权重分布被视为预测哪些专家可以被移除而不产生功能损失的指标。在三个高冗余MoE架构(OLMoE-1B-7B-0924、Qwen1.5-MoE-A2.7B、DeepSeek-V2-Lite)上进行的token级干预审计发现,经过多重比较校正后,没有任何观测指标能预测任何模型中的因果专家重要性,所有60个指标-层组合的效应量均低于Cohen's $d = 0.17$。通过每个token的路由权重控制排除了统计功效不足的问题,仅在OLMoE的最后一个MoE层恢复了一个Bonferroni显著的信号($d = +0.231$, $p = 0.0013$)。现有剪枝方法在此场景下的成功并非由于识别了可删除的专家,而是因为早期层的冗余使得大多数选择标准可互换。我们的结果提供了一个明确的反例,表明从总体观测统计量到关于专家重要性的token级干预推断这一常见推理步骤存在问题,并展示了干预审计如何校准可解释性主张的证据标准。

英文摘要

Interpretability methods routinely use population-level summary statistics over observed model behaviour to license claims about the effects of targeted interventions on specific computations; in Pearl's terms, they treat rung-1 associational evidence as if it supported rung-2 interventional conclusions, a move whose validity is rarely tested. We examine one concrete instance: the use of routing statistics in Mixture-of-Experts (MoE) pruning, where utilization rates, activation norms, and routing weight distributions are treated as predictors of which experts can be removed without functional cost. A token-level interventional audit across three high-redundancy MoE architectures (OLMoE-1B-7B-0924, Qwen1.5-MoE-A2.7B, DeepSeek-V2-Lite) finds no observational metric predicts causal expert importance in any model: across all 60 metric-layer combinations effect sizes stay below Cohen's $d = 0.23$, and no metric is reliably positive under our corrected, dual-test criterion. A per-token routing weight control, run with identical $n$, rules out insufficient power, recovering a signal whose CI excludes zero at OLMoE's final MoE layer ($d = +0.231$, 95\% CI $[+0.09, +0.37]$, $p = 0.0013$). Existing pruning methods succeed in this regime not by identifying dispensable experts but because early-layer redundancy renders most selection criteria interchangeable. Our results provide an explicit counterexample to the common inferential step from population-level observational summaries to token-level interventional claims about expert importance, and illustrate how interventional audits can calibrate the evidential standards for interpretability claims.

2606.10616 2026-06-17 cs.AI 新提交

Learning What to Remember: Observability-Safe Memory Retention via Constrained Optimization for Long-Horizon Language Agents

学习记住什么:通过约束优化实现长时域语言代理的观测安全记忆保留

Qingcan Kang, Liu Mingyang, Shixiong Kai, Kaichao Liang, Tao Zhong, Mingxuan Yuan

发表机构 * Huawei Noah's Ark Lab(华为诺亚方舟实验室) Department of Computer Science, City University of Hong Kong(香港城市大学计算机科学系)

AI总结 针对长时域语言代理的有限上下文窗口,提出OSL-MR框架,将记忆保留建模为约束随机优化问题,通过在线可观测特征与离线监督的严格分离学习查询条件化的证据价值,实验表明在严格预算下优于现有方法。

详情
AI中文摘要

长时域语言代理积累的观测、推理轨迹和检索事实会超出其有限的上下文窗口,使得记忆保留成为一个基本的资源分配问题。现有记忆系统通过启发式评分、检索优化或学习压缩来改进管理,但大多将保留视为局部决策问题,并未在现实观测约束下显式建模其长期后果。为填补这一空白,我们将记忆保留建模为一个约束随机优化问题,具有明确的预算可行性、证据效用以及延迟成本(包括遗漏惩罚、重新获取延迟和过时信息风险)。随后,我们提出OSL-MR(观测安全记忆保留学习),这是一个新颖的框架,强制执行在线可观测特征与离线可用监督(OAS)之间的严格分离。OSL-MR结合了一个从实现的证据监督中训练的证据学习器和一个混合评分启发式,该启发式既作为可部署的在线安全基线,又作为结构化的归纳先验用于学习。由此产生的策略直接从交互数据中学习查询条件化的证据价值,同时在同一观测约束下保持可部署性。在LOCOMO和LongMemEval上的实验表明,OSL-MR在严格记忆预算下持续优于基于最近性的方法、生成式代理风格评分和其他启发式基线。混合评分先验在保持召回率的同时进一步提高了精确度,敏感性分析表明其在广泛的成本配置下具有鲁棒性。

英文摘要

Long-horizon language agents accumulate observations, reasoning traces, and retrieved facts that exceed their context windows, making memory retention -- what to keep, discard, or later recover under a fixed budget -- central to sustained performance. Most systems score memories with local rules such as recency or relevance, ignoring the delayed costs of retention: future retrieval failures, recomputation, and stale-information use. We formulate retention as a constrained, partially observable stochastic optimization problem in which current decisions shape information demands revealed only later, and prove its single-step version NP-hard. Since exact optimization is intractable and future demands unknown, we develop \textbf{OSL-MR} (Observability-Safe Learning for Memory Retention), a learning-augmented approximation for deployable memory control. Its core principle is observability separation: deployed decisions use only online-observable signals, while supervision from evidence realized after an interaction is used solely for offline learning. OSL-MR pairs a budget-aware Mixed-Score heuristic (a cold-start policy and inductive prior) with an evidence learner predicting which memories later serve as evidence. As the cumulative objective is non-decomposable and combinatorial, the learner is trained on evidence-membership signals rather than reward, a tractable, deployable target. On LoCoMo and LongMemEval, OSL-MR consistently outperforms strong heuristic and imitation-learning baselines, especially under tight budgets, and is robust across cost settings. On exactly-solvable instances, retention is genuinely multi-step: a perfect single-step optimizer is far from optimal, whereas OSL-MR stays near the dynamic-programming optimum. These results establish constrained stochastic optimization and optimization-guided learning as a scalable foundation for memory in long-horizon agents.

2606.09376 2026-06-17 cs.CL 新提交

Precision Is Not Faithfulness: Coverage-Aware Evaluation of Grounded Generation with a Complete Oracle

精确性不等于忠实度:基于完全神谕的覆盖感知接地生成评估

Juan S. Santillana

发表机构 * Globant

AI总结 针对参考无关忠实度指标仅测量精确性而忽略召回率的问题,提出利用完全神谕(F1赛事和NOAA天气预报)测量覆盖度,并设计结合精确性与覆盖度的综合指标及验证器引导生成方法。

Comments 9 pages. v2: adds Anthropic Claude + 3 additional fine-tuned bases (1B-7B); 6 frontier families x 3 languages. Code https://github.com/vectrayx/precision-is-not-faithfulness Demo https://huggingface.co/spaces/jsantillana/faithful-strategy-engineer-f1

详情
AI中文摘要

参考无关的忠实度指标验证模型对事实的每个原子声明,并越来越多地用于评估接地生成。我们表明它们存在一个盲点:它们仅测量精确性——所陈述的声明是否得到支持?——因此奖励弃权,因为模型通过几乎不说什么就可以获得近乎完美的忠实度。我们使用F1遥测技术使其可测量,这是一个战略事实确定性推导且关键是完全的领域:对于每个决策,我们知道所有重要事实的完整集合。这种完整性——在开放领域的忠实度基准中缺失——让我们能够精确测量召回率(相关事实的覆盖度)以及精确性。在一个涵盖150场比赛的7,253个决策实例的多语言(EN/ES/PT)基准上,最精确的前沿模型覆盖不到一半的相关事实,并且按F1排名最后,因此要求覆盖度会重新排序系统;同样的效果在第二个完全神谕领域(NOAA天气预报)中再次出现。提示消融实验表明,低覆盖度不是提示不足的产物:明确要求模型彻底并不能缩小差距。我们将忠实度与覆盖度结合成一个单一分数,验证了该指标(受控扰动;无模型正则表达式提取器和跨家族LLM提取器之间的一致性,系统级Spearman 1.0),并给出了一种无需参考即可提高精确性和召回率的验证器引导生成方法。我们发布了基准、结构化注释、指标、基线和交互式演示。

英文摘要

Reference-free faithfulness metrics verify each atomic claim a model makes against ground truth, and are increasingly used to evaluate grounded generation. We show they share a blind spot: they measure only precision -- are the stated claims supported? -- and therefore reward abstention, since a model can score near-perfect faithfulness by saying almost nothing. We make this measurable using Formula 1 telemetry, a domain where strategic ground truth is derived deterministically and, crucially, completely: for each decision we know the full set of facts that mattered. This completeness -- absent in open-domain faithfulness benchmarks -- lets us measure recall (coverage of the relevant facts) exactly, alongside precision. On a multilingual (EN/ES/PT) benchmark of 7,253 decision instances spanning 157 races, the most precise frontier model covers under half of the relevant facts and ranks last by F1, so requiring coverage reorders the systems; the same effect reappears in a second complete-oracle domain (NOAA weather forecasts). Fine-tuning small models (1B-7B) on the complete oracle closes the precision-recall gap entirely (F1 ~0.98), beating every zero-shot frontier system regardless of scale. We pair faithfulness with coverage into a single score, validate the metric (controlled perturbation; agreement across a model-free regex extractor and a cross-family LLM extractor, system-level Spearman 1.0), and give a verifier-guided generation method that improves precision and recall without references. We release the benchmark, structured annotations, metric, baselines, and an interactive demo.

2606.09004 2026-06-17 cs.AI 新提交

LATTEArena: An Evaluation Framework for LLM-powered Tabular Feature Engineering (Extended Version)

LATTEArena: 基于LLM的表格特征工程评估框架(扩展版)

Ankai Hao, Ke Chen, Huan Li, Lidan Shou

发表机构 * Zhejiang University(浙江大学)

AI总结 提出LATTEArena,首个标准化评估框架,通过六维分类法分解15种方法、模块化竞技场和组件消融实验,揭示Tree-of-Thought与MCTS成本效益最优等16项关键发现。

Comments 31 pages, 9 figures

详情
AI中文摘要

特征工程对于表格数据分析仍然至关重要,大型语言模型(LLM)已成为自动化这一过程的有前景的范式,催生了基于LLM的自动化表格特征工程(LATTE)。然而,缺乏标准化平台阻碍了公平、成本感知的比较。此外,复杂的方法设计掩盖了单个组件的具体贡献;例如,尽管LFG集成了思维树、少样本演示、蒙特卡洛树搜索和自然语言生成,但每种技术的竞争优点的孤立影响仍未量化。为解决这些挑战,我们引入了LATTEArena,这是首个竞争性评估框架,具有以下特点:(1)六维分类法,将15种代表性方法分解为可重用组件;(2)标准化模块化竞技场,用于受控比较;(3)涵盖性能、成本和鲁棒性的多维评估;(4)组件级消融,量化每种技术的竞争优点。通过广泛评估,我们揭示了16项关键发现,包括:(1)思维树与蒙特卡洛树搜索实现了最佳成本效益;(2)RPN和代码输出格式分别主导分类和回归任务。我们公开发布了模块化框架和超过4000条执行日志,使研究人员能够将新技术与现有技术无缝对比,推动LATTE发展。

英文摘要

Feature engineering remains a cornerstone of tabular data analysis, and Large Language Models (LLMs) have emerged as a promising paradigm for its automation, giving rise to LLM-powered Automated Tabular Feature Engineering (LATTE). However, the field lacks standardized, cost-aware evaluation platforms, and the combinatorial explosion of design choices obscures true algorithmic progress. To bridge these gaps, we systematically deconstruct 15 representative LATTE methods into a unified 6-dimensional taxonomy. Based on this abstraction, we introduce LATTEArena, a standardized, modular, and extensible benchmarking framework that decouples monolithic pipelines into reusable execution blocks. By distilling the massive combinatorial space, we evaluate 24 core LATTE configurations across 7 research questions. Our head-to-head benchmarking goes beyond predictive accuracy to quantify token efficiency and execution robustness, yielding 17 empirical findings on cost-effectiveness trade-offs. Furthermore, we provide 3 concrete recommendations for optimal real-world deployment. By enabling controlled component-level comparisons, LATTEArena shifts the paradigm from ad-hoc prompt engineering to systematic context management. All code, datasets, and over 4,000 execution logs are publicly available to foster a dynamic, community-driven benchmark. Our framework, leaderboard, and all artifacts are hosted on the LATTEArena project website at https://goodenhak.github.io/LATTEArena.

2606.08810 2026-06-17 cs.CL cs.LG 新提交

Continuous Language Diffusion as a Decoder-Interface Problem

连续语言扩散作为解码器-接口问题

Zhicheng Du, Lan Ma

发表机构 * Tsinghua Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院, 清华大学)

AI总结 研究连续扩散语言模型如何从高斯噪声生成流畅文本,提出解码器-盆地机制,并设计诊断协议揭示标量指标隐藏的失败,通过接口相图解释令牌恢复行为。

详情
AI中文摘要

高斯扰乱的句子嵌入没有直接的语言解释,但连续扩散语言模型可以从它们生成流畅文本。我们通过嵌入式语言流(ELF)研究这一谜题,并识别出解码器-盆地机制:当轨迹到达原生解码器可以读取稳定令牌的区域时,去噪成功。我们引入了可去噪性、语义可恢复性、顺序敏感性、解码器兼容性和轨迹可靠性的诊断协议。它暴露了标量指标隐藏的失败:低均方误差可能丢弃语言内容,低困惑度可能反映低熵崩溃,干净的潜在重建可能与狭窄的解码器盆地共存。一个解码器-边界界解释了为什么令牌恢复依赖于边界和局部解码器敏感性,而不仅仅是潜在误差。审计公开的ELF检查点揭示了一个接口相图:早期预测弱可读,轨迹中期分歧标志竞争区域,晚期预测进入高边界最终令牌盆地。一旦进入,在生成的ELF状态上令牌实现出奇简单:冻结的T5令牌嵌入查找恢复了原生解码器决策的93%–96%,单个线性读出在32k样本时达到97.9%的一致性,在结构化残差尾部留下约1.1的困惑度差距。在显式诊断监控下,保守的边界门在去噪步骤中提前17%–27%退出。对LangFlow、BitstreamDiffusion和连续潜在扩散语言模型(Cola-DLM)的边界检查表明,当状态对象和解码器改变时,相同的接口问题仍然有意义。因此,连续和潜在扩散语言模型应作为表示-解码器系统进行评估。

英文摘要

Gaussian-corrupted sentence embeddings have no direct linguistic interpretation, yet continuous diffusion language models can generate fluent text from them. We study this puzzle through Embedded Language Flows (ELF) and identify a decoder-basin mechanism: our evidence suggests that denoising becomes reliable when trajectories reach regions where the native decoder can read stable tokens. We introduce a diagnostic protocol for denoisability, semantic recoverability, order sensitivity, decoder compatibility, and trajectory reliability. It exposes failures hidden by scalar metrics: low mean-squared error can discard linguistic content, low perplexity can reflect low-entropy collapse, and clean latent reconstruction can coexist with a narrow decoder basin. A decoder-margin bound explains why token recovery depends on margin and local decoder sensitivity, not latent error alone. Auditing public ELF checkpoints reveals an interface phase diagram: early predictions are weakly readable, mid-trajectory disagreement marks a competition region, and late predictions enter a high-margin decoder basin. Once inside, token realization is surprisingly simple on generated ELF states: frozen T5 (Text-to-Text Transfer Transformer) token-embedding lookup recovers $93$--$96\%$ of native decoder decisions, and a single linear readout reaches $97.9\%$ agreement at 32k samples, leaving an $\approx1.1$--$1.2$ perplexity gap in a structured residual tail. Under conservative held-out gates, a margin rule exits roughly $17$--$28\%$ earlier in denoising steps under an explicit diagnostic monitor. Boundary checks on LangFlow, BitstreamDiffusion, and the Continuous Latent Diffusion Language Model (Cola-DLM) show that the same interface questions remain meaningful when the state object and decoder change. Continuous and latent diffusion language models should therefore be evaluated as representation-decoder systems.