arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2237
专题追踪
2606.16533 2026-06-17 cs.AI cs.CV 新提交

Kairos: A Native World Model Stack for Physical AI

Kairos: 面向物理AI的原生世界模型栈

Kairos Team, Fei Wang, Shan You, Qiming Zhang, Tao Huang, Zuoyi Fu, Zhisheng Zheng, Yunlong Xi, Feng Lv, Xiaoming Wu, Zeyu Liu, Cong Wan, Pu Li, Ruiqing Yang, Xiaoou Li, Wei Wang, Kangkang Zhu, Yuwei Zhang, Shi Fu, Zheng Zhang, Xiaoning Wu, Xuzeng Fan, Dacheng Tao, Xiaogang Wang

发表机构 * Kairos Team(Kairos团队)

AI总结 提出Kairos原生世界模型栈,通过跨具身数据课程、混合线性时间注意力架构和部署感知系统协同设计,实现世界知识获取、长时程状态保持与高效执行,在具身世界模型等基准上达到顶级性能。

详情
AI中文摘要

世界模型正从被动视觉生成器转变为物理AI的基础性、可操作基础设施:它们必须从异构经验中原生获取世界知识,在长时间跨度内维持持久状态,并在实际部署约束下高效执行。我们引入Kairos,一个围绕这些需求设计的原生世界模型栈。(1) Kairos通过开创由跨具身数据课程指导的原生预训练范式来学习世界,该课程将开放世界视频、人类行为数据和机器人交互组织成渐进式发展路径。(2) Kairos通过配备混合线性时间注意力的原生统一架构来维持世界,该架构中滑动窗口注意力捕捉局部动态,扩张滑动窗口捕捉中程依赖,门控线性注意力维持持久全局记忆。我们建立了形式化理论界限,证明这种时间分解严格限制了误差累积,从数学上保证了跨扩展时间范围的状态传播。(3) Kairos通过整合部署感知系统协同设计来运行世界,支持在服务器和消费级硬件上为真实世界的观察-行动-反馈循环生成低延迟展开。在具身世界模型、长时程和动作策略基准上的实验表明,Kairos在实现顶级性能的同时提供了强大的效率-能力权衡。这些结果共同将Kairos定位为未来自进化物理智能的凝聚性操作基础。

英文摘要

World models are transitioning from passive visual generators to foundational, operational infrastructure for Physical AI: they must natively acquire world knowledge from heterogeneous experience, maintain persistent states over long horizons, and execute efficiently within real deployment constraints. We introduce Kairos, a native world model stack designed around these requirements. (1) Kairos learns the world by pioneering a Native Pre-training Paradigm governed by a Cross-Embodiment Data Curriculum, which organizes open-world videos, human behavioral data, and robot interactions into a progressive developmental pathway. (2) Kairos maintains the world by unified world understanding, generation, and prediction within a Native Unified Architecture equipped with Hybrid Linear Temporal Attention, where sliding-window attention captures local dynamics, dilated sliding windows capture mid-range dependencies, and gated linear attention maintains persistent global memory. We establish formal theoretical bounds demonstrating that this temporal factorization strictly limits error accumulation, mathematically guaranteeing state propagation across extended horizons. (3) Kairos runs the world by incorporating a Deployment-Aware System Co-Design to support low-latency rollout generation on server and consumer-grade hardware for real-world observation-action-feedback loops. Experiments on embodied world-model, long-horizon, and action-policy benchmarks show that Kairos achieves top level performance while offering a strong efficiency-capability trade-off. Together, these results position Kairos as a cohesive operational foundation for future self-evolving physical intelligence.

2606.16449 2026-06-17 cs.CV 新提交

PermaVid: Consistent Video Generation Across Edits via Disentangled Context Memory

PermaVid: 通过解耦上下文记忆实现编辑下的一致视频生成

Shuai Yang, Bingjie Gao, Ziwei Liu, Jiaqi Wang, Dahua Lin, Tong Wu

发表机构 * Shanghai Jiao Tong University(上海交通大学) Stanford University(斯坦福大学) S-Lab, Nanyang Technological University(南洋理工大学S-Lab) The Chinese University of Hong Kong(香港中文大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 提出PermaVid框架,利用解耦为语义外观和几何结构的上下文记忆,结合编辑感知更新策略,实现编辑操作后视频的长期一致生成。

Comments Project page: https://ys-imtech.github.io/projects/PermaVid/

详情
AI中文摘要

在编辑操作下的一致视频生成需要持久性:当编辑修改场景外观或布局时,后续生成应在时间和视角上保持连贯。然而,现有的记忆设计在修改后难以维持长期一致性,因为存储的上下文可能变得过时或无效。为了解决这个问题,我们提出了PermaVid,一种新颖的框架,基于多模态上下文记忆,将空间上下文解耦为语义外观和几何结构,并采用编辑感知的记忆更新和检索策略,使记忆演化与后续观察保持一致。具体来说,我们开发了两个互补的记忆库:一个RGB上下文记忆,捕获外观感知的观察同时隐式编码几何;一个深度上下文记忆,保留与语义解耦的纯几何结构。基于此设计,我们引入了一个记忆引导的视频生成模型,在从混合模态记忆上下文中提取的参考条件下执行多模态特征融合。实验表明,我们的方法在编辑后保持了强大的长期语义和结构一致性,显著优于现有方法。

英文摘要

Consistent video generation under editing operations requires persistence: when edits modify scene appearance or layout, subsequent generations should remain coherent across time and viewpoints. However, existing memory designs struggle to maintain long-term consistency after such modifications, as stored contexts may become outdated or invalid. To address this, we propose PermaVid, a novel framework built upon a multi-modal context memory that disentangles spatial context into semantic appearance and geometric structure, together with an edit-aware memory update and retrieval strategy that keeps memory evolution aligned with subsequent observations. Specifically, we develop two complementary memory banks: an RGB context memory that captures appearance-aware observations while implicitly encoding geometry, and a depth context memory that preserves geometry-only structure disentangled from semantics. Building on this design, we introduce a memory-guided video generation model that performs multi-modal feature fusion under reference conditions drawn from mixed-modality memory contexts. Experiments demonstrate that our method maintains strong long-term semantic and structural consistency after edits, significantly outperforming state-of-the-art methods.

2606.16379 2026-06-17 cs.LG stat.ML 新提交

Scalable and Interpretable Representation Alignment with Ordinal Similarity

可扩展且可解释的序数相似性表示对齐

Diogo Soares, Pankhil Gawade, Andrea Dittadi, Ewa Szczurek

发表机构 * University of Maryland(马里兰大学) Google Research(谷歌研究院)

AI总结 针对现有表示相似性度量缺乏可解释性、对异常值敏感且计算复杂的问题,提出基于序数相似性的三元组和四元组相似性指数,实现可解释、鲁棒且高效的对齐度量。

详情
AI中文摘要

评估表示相似性是表示学习的基础。然而,现有度量存在显著局限性:由于基线漂移而缺乏可解释性,对异常值缺乏鲁棒性,并且对于大型数据集计算上难以处理,迫使依赖启发式近似。为了解决这些问题,我们开发了一个序数相似性框架,通过三元组相似性指数(TSI)和四元组相似性指数(QSI)实例化,通过量化序数关系的一致性来衡量对齐。我们从理论上证明,这种公式本质上是可解释的、对异常值鲁棒的,并且计算高效。最后,我们建立了TSI与通过互近邻度量的局部邻域对齐之间的形式等价性。实验上,我们验证了这些性质,并表明序数相似性提供了一种可扩展的对齐度量方法,使从业者能够更好地理解和设计表示。

英文摘要

Evaluating representation similarity is fundamental to representation learning. However, existing metrics suffer from significant limitations: they lack interpretability due to shifting baselines, lack robustness to outliers, and are computationally intractable for large datasets, forcing reliance on heuristic approximations. To address this, we develop an ordinal-similarity framework, instantiated by the Triplet (TSI) and Quadruplet (QSI) Similarity Indices, which measure alignment by quantifying the consistency of ordinal relationships. We theoretically demonstrate this formulation is inherently interpretable, robust to outliers, and computationally efficient. Finally, we establish a formal equivalence between TSI and local neighborhood alignment, measured by Mutual Nearest Neighbors. Empirically, we validate these properties and show that ordinal similarity offers a scalable approach to measuring alignment, enabling practitioners to better understand and design representations.

2606.16337 2026-06-17 cs.AI cs.HC cs.LG 新提交

Medical Heuristic Learning: An LLM-Driven Framework for Interpretable and Auditable Clinical Decision Rules

医学启发式学习:一个用于可解释和可审计临床决策规则的LLM驱动框架

Wei Xu, Ke Yang, Gang Luo, Keli Zheng, Lingyan Hu, Jing Wang, Kefeng Li

发表机构 * Centre for Artificial Intelligence Driven Drug Discovery, Macao Polytechnic University(人工智能驱动药物发现中心,澳门理工学院) Key Laboratory of Short-Range Radio Equipment Testing and Evaluation, Ministry of Industry and Information Technology Terahertz Science Application Center (TSAC), Beijing Institute of Technology(工业和信息化部短距离无线电设备测试与评估重点实验室,太赫兹科学应用中心(TSAC),北京理工大学) Department of Critical Care Medicine, Yantai Yuhuangding Hospital, Qingdao University(重症医学科,烟台友谊医院,青岛大学) Faculty of Education, The University of Hong Kong(教育学院,香港大学) College of Information Engineering, Dalian University(信息工程学院,大连大学)

AI总结 提出医学启发式学习(MHL),利用LLM驱动的工作流优化确定性可执行决策系统,生成可解释、可审计的Python决策规则,在医学数据集上达到与最先进方法相当的性能,并支持小样本和高度不平衡场景。

详情
AI中文摘要

临床表格数据的预测建模是临床决策支持的核心,因此不仅需要强大的预测性能,还需要透明的决策逻辑。尽管深度学习和基于树的集成方法可以实现高精度,但其黑箱性质仍然是临床部署的主要障碍。这一挑战因医疗数据的常见特征而进一步加剧,包括有限的样本量、严重的类别不平衡以及因诊断标准和临床文档变化引起的特征演化。为了解决这些问题,我们提出了医学启发式学习(MHL),这是临床表格预测中超越梯度学习范式的一个实例。MHL不依赖神经网络权重更新,而是使用大型语言模型(LLM)驱动的工作流,整合统计探测、医学知识探测、规则合成和代码级迭代优化,以优化一个确定性的可执行决策系统。最终模型不是以不透明的参数表示,而是作为版本化的纯Python决策规则,这些规则明确可解释、完全可审计且具有临床基础。MHL还支持持续学习,从先前验证的规则开始,并在数据漂移或特征演化下使用更新的特征信息迭代修订规则。在医学数据集上的全面实验表明,MHL在保持与小样本和高度不平衡设置下强健行为的同时,实现了与最先进方法相当的性能。结果进一步表明,这种显式规则更新机制有助于缓解特征演化下的灾难性遗忘。总体而言,这些发现表明,非基于梯度的启发式系统为高风险临床决策支持提供了一种透明且可适应的替代方案。

英文摘要

Predictive modeling for clinical tabular data is central to clinical decision support and therefore requires not only strong predictive performance but also transparent decision logic. Although deep learning and tree-based ensemble methods can achieve high accuracy, their black-box nature remains a major obstacle to clinical deployment. This challenge is further compounded by common characteristics of medical data, including limited sample sizes, severe class imbalance, and feature evolution arising from changes in diagnostic criteria and clinical documentation. To address these issues, we propose Medical Heuristic Learning (MHL), an instantiation of the learning-beyond-gradients paradigm for clinical tabular prediction. Instead of relying on neural network weight updates, MHL uses a large language model (LLM)-driven workflow that integrates statistical probes, medical knowledge probes, rule synthesis, and code-level iterative refinement to optimize a deterministic and executable decision system. The resulting model is expressed not as opaque parameters, but as versioned pure-Python decision rules that are explicitly interpretable, fully auditable, and clinically grounded. MHL also supports continual learning by starting from previously validated rules and iteratively revising them using updated feature information under data drift or feature evolution. Comprehensive experiments on medical datasets show that MHL achieves performance comparable to state-of-the-art methods while maintaining strong behavior in small-sample and highly imbalanced settings. The results further indicate that this explicit rule update mechanism can help alleviate catastrophic forgetting under feature evolution. Overall, these findings suggest that non-gradient-based heuristic systems offer a transparent and adaptable alternative for high-stakes clinical decision support.

2606.16203 2026-06-17 cs.CV 新提交

DynFS-MoE: Dynamic Functional-Structural Mixture-of-Experts for Post-Traumatic Epilepsy Diagnosis

DynFS-MoE: 用于创伤后癫痫诊断的动态功能-结构混合专家模型

Jun-En Ding, Spencer Chen, Henry Noren, Daniel Valdivia, Christine Yohn, Suhina Patel, Taylor Zink, Hai Sun, Feng Liu

发表机构 * Department of Systems Engineering, Stevens Institute of Technology(史蒂文斯理工学院系统工程系) Department of Neurosurgery, Robert Wood Johnson Medical School, Rutgers University(罗格斯大学罗伯特·伍德·约翰逊医学院神经外科)

AI总结 提出动态多模态混合专家框架,通过时间感知功能-结构编码和类别条件专家路由,融合功能与结构MRI,在三个二分类任务中优于静态融合基线,并揭示有意义的ROI交互。

详情
AI中文摘要

创伤后癫痫(PTE)是创伤性脑损伤(TBI)的严重并发症,但由于其在大脑中诱导的复杂结构和功能改变,早期识别仍然具有挑战性。为了解决这个问题,我们提出了一个动态多模态混合专家(MoE)框架,通过时间感知功能-结构编码和类别条件专家路由,整合功能性和结构性MRI。在该框架内,模态特定和跨模态专家学习互补表示,而模态-类别MoE(MCoE)模块根据每个分类目标动态分配专家权重。跨三个二分类任务的实验结果表明,该框架始终优于静态融合基线,高可解释性分析进一步揭示了有意义的感兴趣区域(ROI)交互。这种动态多模态专家框架有效捕获了类别依赖的脑交互模式,并为PTE诊断和风险分层提供了一种可解释的方法。

英文摘要

Post-traumatic epilepsy (PTE) is a severe complication of traumatic brain injury (TBI), yet early identification remains challenging due to the complex structural and functional alterations it induces in the brain. To address this, we propose a dynamic multimodal Mixture-of-Experts (MoE) framework that integrates functional and structural MRI through time-aware functional-structural encoding and class-conditioned expert routing. Within this framework, modality-specific and cross-modal experts learn complementary representations, while a Modality-Class MoE (MCoE) module dynamically dispatches expert weights according to each classification objective. Experimental results across three binary classification tasks demonstrate that the framework consistently outperforms static fusion baselines, and high-interpretability analyses further reveal meaningful region-of-interest (ROI) interactions. This dynamic multimodal expert framework effectively captures class-dependent brain interaction patterns and provides an interpretable approach for PTE diagnosis and risk stratification.

2606.16070 2026-06-17 cs.AI 新提交

Mind-Studio: Executable World Models with Lookahead Evaluation for Partially Observable Games

Mind-Studio: 针对部分可观测游戏的可执行世界模型与前向评估

Yifei Dong, Mingen Zheng, Linquan Wu, Jeff Z. Pan, Jiaxin Bai

发表机构 * Hong Kong University of Science and Technology(香港科技大学) City University of Hong Kong(香港城市大学) University of Edinburgh(爱丁堡大学) Hong Kong Baptist University(香港浸会大学)

AI总结 提出Mind-Studio框架,利用大语言模型从轨迹合成可执行的pygame风格世界模型,通过K步前向保真度协议评估,在Montezuma's Revenge等游戏中显著提升预测准确性和子目标验证。

Comments 12 pages, 2 figures

详情
AI中文摘要

世界模型合成旨在将交互经验转化为环境动态的内部模型。现有的符号方法通常拟合观测到的转移或局部规则的混合,但它们不会产生一个可以独立于真实环境运行的完整可执行程序。我们提出了Mind-Studio,一个利用大语言模型从状态-动作-下一状态轨迹合成可执行的pygame风格世界模型的框架。Mind-Studio将熵选择轨迹与一个轻量级游戏技能文件相结合,该文件包含从截图中提取的对象、动作和静态场景信息。我们使用K步前向保真度协议评估合成质量,该协议将生成的世界模型 rollout 与来自相同状态的Real-ALE rollout进行比较。在Montezuma's Revenge上,Mind-Studio将选定动作的下一状态预测从PoE-World的0.3%提高到48.7%,同时验证了8个子目标中的5个;在Alien、Assault和Skiing上,它实现了比先前学习的前向源更强的分支级保真度。

英文摘要

World-model synthesis aims to turn interaction experience into an internal model of environment dynamics. Existing symbolic approaches often fit observed transitions or mixtures of local rules, but they do not produce a complete executable program that can run independently of the real environment. We present Mind-Studio, a framework that synthesizes executable pygame-style world models from state-action-next-state trajectories using large language models. Mind-Studio combines entropy-selected traces with a lightweight game skill file containing object, action, and static scene information extracted from screenshots. We evaluate synthesis quality with a K-step lookahead fidelity protocol that compares generated world-model rollouts against Real-ALE rollouts from the same state. On Montezuma's Revenge, Mind-Studio improves chosen-action next-state prediction from 0.3% for PoE-World to 48.7% while verifying 5 of 8 subgoals; across Alien, Assault, and Skiing, it achieves stronger branch-level fidelity than prior learned lookahead sources.

2606.16009 2026-06-17 cs.CL cs.HC 新提交

Bridging the Usability Gap: Lessons from Interpreting Studies for Machine Interpreting Design

弥合可用性差距:口译研究对机器口译设计的启示

Claudio Fantinuoli

发表机构 * University of Mainz(美因茨大学)

AI总结 本文定义机器口译为语音翻译的子领域,指出其存在“准确性幻觉”,并借鉴口译研究提出未来设计的三个优先方向:能动性、共同基础与体验,以弥合可用性差距。

详情
AI中文摘要

机器口译(MI)作为语音翻译的实时分支,在标准基准测试中取得了显著进展,一些系统在文本保真度上接近人类水平。然而,用户体验仍远不如口译员中介的交流,揭示了所谓的“准确性幻觉”:系统在纸面上表现准确,但在实践中无法支持流畅、目标导向的互动。本文将MI定义为语音翻译的一个独特子领域,具有自身特点,并需要基于交际有效性而非孤立保真度指标的评估方法。借鉴口译研究的见解,我们识别了当前系统忽视的专业口译实践的关键维度,并将其整合为未来MI的三个相互依赖的设计优先方向:能动性(上下文敏感的主动性和修复)、共同基础(多模态和话语级情境意识)以及体验(通过真实互动进行自适应改进)。这些优先方向共同为弥合可用性差距、实现能够实时维持真实多语言交流的系统指明了道路。

英文摘要

Machine interpreting (MI), the live, real-time application of speech translation, has achieved remarkable progress on standard benchmarks, with some systems approaching human parity on textual fidelity. Yet the user experience remains far inferior to interpreter-mediated communication, revealing what we term the accuracy illusion: systems that appear accurate on paper but fail in practice to support smooth, goal-oriented interaction. This paper defines MI as a distinct subfield of speech translation, with its own characteristics and the need for evaluation methods grounded in communicative effectiveness rather than isolated fidelity metrics. Drawing on insights from interpreting studies, we identify critical dimensions of professional interpreting practice that are overlooked by current systems, and consolidate them into three interdependent design priorities for future MI: agency (context-sensitive initiative and repair), grounding (multimodal and discourse-level situational awareness), and experience (adaptive improvement through real interaction). Together, these priorities chart a path toward closing the usability gap and enabling systems that can sustain authentic multilingual communication in real time.

2606.15937 2026-06-17 cs.CV 新提交

GOOSE-M2F: Adapting Mask2Former for High-Fidelity, Long-Tailed Fine-Grained Semantic Segmentation in Unstructured Outdoor Terrain

GOOSE-M2F:适配Mask2Former用于非结构化户外地形的高保真、长尾细粒度语义分割

Jyothiraditya Lingam, Nikhileswara Rao Sulake, Sai Manikanta Eswar Machara

发表机构 * Rajiv Gandhi University of Knowledge Technologies, Nuzvid, India(拉吉夫·甘地知识技术大学,努兹维德,印度)

AI总结 针对非结构化户外地形长尾细粒度语义分割挑战,提出GOOSE-M2F,通过200个对象查询、特征精炼模块和辅助监督头,结合多阶段训练策略,在GOOSE基准上达到70.08%复合mIoU。

Comments This solution has got 3rd position at GOOSE 2D Fine-Grained Semantic Segmentation (FGSS) Challenge at ICRA~2026

详情
AI中文摘要

我们提出GOOSE-M2F,这是Mask2Former针对GOOSE 2D细粒度语义分割(FGSS)挑战(ICRA 2026)的任务特定适配。GOOSE基准涵盖非结构化户外地形中的64个细粒度类别,具有严重的长尾分布,其中稀有类别每张图像占据少于50个像素。我们扩展了Swin-Large Mask2Former基线,并贡献了三个针对性改进:(1)200个对象查询以消除表示饱和;(2)结合ASPP-lite和CBAM双注意力的特征精炼模块(FRM);(3)为稀有类别提供直接逐像素梯度的辅助监督头。多阶段训练策略结合了分布平衡损失、稀有类别复制粘贴增强、动态IoU感知重加权和EMA。在推理时,采用密集滑动窗口引擎,结合2D高斯核融合和4尺度TTA,提升了+10.57%。GOOSE-M2F达到70.08%官方复合mIoU(细粒度63.55%,粗粒度76.61%),在GOOSE 2D FGSS排行榜上位列第三。代码和训练好的模型已公开:\href{https://github.com/Aditya-Lingam-9000/GOOSE-M2F}{Github GOOSE-M2F代码} 和 \href{https://huggingface.co/XYZ9843/GOOSE-M2F}{Hugging Face GOOSE-M2F}。

英文摘要

We present GOOSE-M2F, a task-specific adaptation of Mask2Former for the GOOSE 2D Fine-Grained Semantic Segmentation (FGSS) Challenge at ICRA 2026. The GOOSE benchmark spans 64 fine-grained classes across unstructured outdoor terrain with a severely long-tailed distribution, where rare classes occupy fewer than 50 pixels per image. We extend the Swin-Large Mask2Former baseline with three targeted contributions: (1) 200 object queries to eliminate representational saturation; (2) a Feature Refinement Module (FRM) combining ASPP-lite and CBAM dual-attention; and (3) an Auxiliary Supervision Head that delivers direct per-pixel gradients for rare classes. A multi-stage training strategy pairs Distribution-Balanced loss, Rare-Class Copy-Paste augmentation, dynamic IoU-aware re-weighting, and EMA. At inference, a dense sliding-window engine with 2D Gaussian kernel blending and 4-scale TTA adds +10.57%. GOOSE-M2F achieves 70.08% Official Composite mIoU (63.55% fine, 76.61% coarse), placing 3rd on the GOOSE 2D FGSS leaderboard. Code and trained models are publicly available at GitHub: https://github.com/Aditya-Lingam-9000/GOOSE-M2F and Hugging Face: https://huggingface.co/XYZ9843/GOOSE-M2F.

2606.15932 2026-06-17 cs.CL 新提交

Beyond NL2Code: A Structured Survey of Multimodal Code Intelligence

超越NL2Code:多模态代码智能的结构化综述

Xuanle Zhao, Qiushi Sun, Jingyu Xiao, Xuexin Liu, Haoyue Yang, Qiaosheng Chen, Xianzhen Luo, Jing Huang, Yufeng Zhong, Lei Chen, Shuai Fu, Zhenlin Wei, Jinhe Bi, Lei Jiang, Haibo Qiu, Siqi Yang, Peng Shi, Jian Hu, Zhixiong Zeng

发表机构 * Meituan(美团) The University of Hong Kong(香港大学) The Chinese University of Hong Kong(香港中文大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Nanjing University(南京大学) Harbin Institute of Technology(哈尔滨工业大学) Australian Institute for Machine Learning, Adelaide University(阿德莱德大学澳大利亚机器学习研究所) Ludwig Maximilian University of Munich(慕尼黑大学) University of Science and Technology of China(中国科学技术大学) Queen Mary University of London(伦敦玛丽女王大学)

AI总结 本文系统综述多模态代码智能,将任务按代码角色分类,覆盖GUI、科学可视化、结构化图形及前沿任务,并提出四个基于验证的未来方向。

Comments Work completed in January 2026. Updating now

详情
AI中文摘要

虽然LLMs已经显著推进了文本到代码的合成,但许多实际编程任务通过视觉工件(如截图、图表、文档、矢量图、视频和交互状态)来指定意图。这些任务要求模型将视觉感知连接到可执行程序,因为正确性不仅取决于语法,还取决于布局、几何、数据语义、可编辑性、交互行为以及执行后适用的领域特定约束。本综述考察多模态代码智能,涵盖在视觉输入和输出下生成、编辑、优化、执行或推理代码的系统。我们首先根据代码在每个任务中扮演的角色来定义该领域,将代码区分为渲染工件、可编辑符号结构、科学表示、中间推理轨迹或可执行策略/工具接口。然后,我们将基准和方法组织成四个领域:图形用户界面、科学可视化、结构化图形以及前沿任务和框架。这种分类法将成熟的工件生成问题与新兴的智能体和统一设置联系起来,并使我们能够比较不同任务如何处理正确性证据。展望未来,我们认为未来研究可能受益于四个以验证为中心的方向:多信号验证可以结合互补的正确性证据,多状态验证可以测试跨执行轨迹的行为,跨任务迁移测试可以探究可重用的视觉代码技能,以及可验证的智能体轨迹可以揭示智能体行为是否基于视觉证据。这些方向共同可能将多模态代码生成从单输出模仿转向基于证据的可执行系统。

英文摘要

While Large Language Models (LLMs) have substantially advanced text-to-code synthesis, many real programming tasks specify intent through visual artifacts such as screenshots, charts, vector drawings, videos, and interactive states. These tasks require models to connect visual perception to executable programs, because correctness depends not only on syntax but also on layout, data semantics, interaction behavior, and domain-specific constraints that apply after execution. This survey examines Multimodal Code Intelligence, covering systems that generate, edit, refine, or reason with code under visually grounded inputs and outputs. We first formulate the field by the role that code plays in each task, distinguishing code as a rendered artifact, an editable symbolic structure, a scientific representation, an intermediate reasoning trace, or an executable policy or tool interface. We then organize benchmarks and methods into four domains: Graphical User Interface, Scientific Visualization, Structured Graphics, and Frontier Tasks and Frameworks. This taxonomy connects mature artifact-generation problems to emerging agentic and unified settings and allows us to compare how different tasks treat evidence of correctness. Looking ahead, we argue that future research may benefit from four verification-centered directions. Multi-signal validation can combine complementary evidence of correctness, multi-state verification can test behavior across execution trajectories, cross-task transfer testing can probe reusable visual-code skills, and verifiable agent traces can reveal whether agent actions are grounded in visual evidence. Together, these directions may move this field from single-output imitation toward evidence-grounded executable systems. An ongoing project and resources are available on \href{https://github.com/xjywhu/Awesome-Multimodal-LLM-for-Code}{GitHub}.

2606.15903 2026-06-17 cs.CL cs.AI 新提交

Control-Plane Placement Shapes Forgetting: An Architectural Study of Agent Memory Across Thirteen System Configurations

控制平面放置塑造遗忘:跨十三种系统配置的智能体记忆架构研究

Dongxu Yang

发表机构 * DeepLethe

AI总结 研究LLM在智能体记忆管道中的位置(控制平面 vs 召回平面)对遗忘失败模式的影响,通过13种配置在385例对抗测试集上的实验,揭示了三种放置机制的互补覆盖范围,并提出了ForgetEval评估套件。

Comments 25 pages including appendices. Code, benchmark, and adapters released under MIT at https://github.com/deeplethe/lethe

详情
AI中文摘要

LLM在智能体记忆管道中的位置——位于检索存储事实(广泛基准测试)的召回平面和通过替换、释放、清除来改变事实(基本未经测试)的控制平面之间——决定了系统能够恢复哪些遗忘失败模式。通过在385例对抗测试集上比较十三种系统配置,我们观察到三种具有部分互补覆盖范围的放置机制:确定性原语足以处理词汇/时间类别,但无法处理规范化(标识符混淆上5%,跨语言上0%);写入时LLM可以恢复规范化(100%),但无法处理意图感知删除(前缀冲突和复合事实为0%);变异时钩子可以恢复意图感知删除(78-85%),并同时提升几乎所有类别的性能(整体91.7-93.2%,每385例运行成本0.17美元,每例变异延迟2.3秒,而确定性方法为64-191毫秒,召回路径不变)。我们通过ForgetEval揭示了这种权衡,ForgetEval包含1000例模板化套件和385例对抗层(132例手工制作+253例LLM生成并经预言机验证),通过确定性子串匹配评分,并配有一个六方法适配器协议,采用诚实的N/A评分,允许异构记忆存储以130行代码接入。该协议通过10名标注者的IAA(Fleiss' kappa = 0.958)和77例外部作者子集(四位盲贡献者)得到验证,该子集复现了规范化不对称性并放大了联合放置的提升(+27.8个百分点)。生产环境中的失败主要是遗忘失败而非召回失败,但现有基准仅衡量召回。ForgetEval和所有适配器均以MIT许可发布。

英文摘要

Where an LLM sits in an agent memory pipeline -- between the recall plane that retrieves stored facts (extensively benchmarked) and the control plane that mutates them via supersede, release, purge (largely untested) -- shapes which forgetting failure modes the system recovers. Comparing thirteen system configurations on a 385-case adversarial surface, we observe three placement regimes with partly complementary coverage: deterministic primitives suffice for lexical/temporal categories but fail canonicalization (5% on identifier-obfuscation, 0% on cross-lingual); inscribe-time LLM recovers canonicalization (100%) but cannot help intent-aware deletion (0% on prefix-collision and compound-fact); a mutation-time hook recovers intent-aware deletion (78-85%) and brightens nearly all categories simultaneously (91.7-93.2% overall, $0.17 per 385-case run, 2.3s/case mutation latency vs. 64-191ms/case deterministic, recall path unchanged). We expose the trade-off via ForgetEval, a 1000-case templated suite plus a 385-case adversarial layer (132 hand-crafted + 253 LLM-drafted oracle-validated) scored by deterministic substring match, paired with a six-method Adapter Protocol with honest N/A scoring that lets heterogeneous memory stores enter in 130 lines. Admission is corroborated by 10-annotator IAA (Fleiss' kappa = 0.958) and a 77-case external-authored subset (four blind contributors) that replicates the canonicalization asymmetry and amplifies the joint-placement lift (+27.8 pt). Production failures are predominantly forgetting failures rather than recall failures, yet existing benchmarks measure only recall. ForgetEval and all adapters are released under MIT.

2606.15883 2026-06-17 cs.CL cs.AI 新提交

Koshur Diacritizer: A Byte-Level Sequence-to-Sequence Model for Kashmiri Diacritic Restoration

Koshur Diacritizer:用于克什米尔语变音符号恢复的字节级序列到序列模型

Haq Nawaz Malik, Nahfid Nissar, Faizan Iqbal

发表机构 * arXiv

AI总结 针对克什米尔语数字文本中变音符号缺失导致的歧义问题,提出基于ByT5-small的字节级序列到序列模型Koshur Diacritizer,结合脚本感知归一化、对齐验证和骨架保留推理,在测试集上实现DERm 0.2012和WER 0.2159,专家评估准确率77.5%。

详情
AI中文摘要

克什米尔语是一种使用改良的波斯-阿拉伯字母书写的印度-雅利安语言,在数字文本中经常省略变音符号,造成歧义并挑战下游NLP应用。我们提出了Koshur Diacritizer,一个基于ByT5-small的字节级序列到序列模型,用于恢复克什米尔语文本中的变音符号。为支持此任务,我们发布了一个公开可用的数据集,包含23.7k对齐的未变音/变音克什米尔语句对。所提出的框架结合了脚本感知归一化、对齐验证和骨架保留推理,以确保在保持原始基本字母序列的同时进行可靠的恢复。在保留测试集上的实验结果显示,DERm为0.2012,WER为0.2159。此外,由克什米尔语母语语言学专家评估的平均准确率为77.5%。数据集、模型和源代码已公开发布,为克什米尔语变音符号恢复和未来的低资源语言研究提供了可复现的基线。

英文摘要

Kashmiri, an Indo-Aryan language written in a modified Perso-Arabic script, frequently omits diacritic marks in digital text, creating ambiguity and challenging downstream NLP applications. We present Koshur Diacritizer, a ByT5-small byte-level sequence-to-sequence model for restoring diacritics in Kashmiri text. To support this task, we release a publicly available dataset of 23.7k aligned undiacritized diacritized Kashmiri sentence pairs. The proposed framework combines script-aware normalization, alignment validation, and skeleton-preserving inference to ensure reliable restoration while maintaining the original base-letter sequence. Experimental results on a held-out test set achieve a DERm of 0.2012 and a WER of 0.2159. Additionally, evaluation by a native Kashmiri linguistic expert yields a mean accuracy of 77.5%. The dataset, model, and source code are publicly released to provide a reproducible baseline for Kashmiri diacritic restoration and future low-resource language research.

2606.15735 2026-06-17 cs.CL cs.AI 新提交

EHRNote-ChatQA: A Benchmark for Evidence-Grounded Multi-Turn Clinical Question Answering over Longitudinal Discharge Summaries

EHRNote-ChatQA:一个面向纵向出院总结的基于证据的多轮临床问答基准

Jiyoun Kim, Muhan Yeo, Eunhye Jang, Jeewon Yang, Hangyul Yoon, Su Ji Lee, Hee Jo Han, Hee-Jae Jung, Doyun Kwon, Jun young Lee, Jaehun Lee, Jung-Oh Lee, Sunjun Kweon, Jong Hak Moon, Daseul Kim, Minjae Cho, Edward Choi

发表机构 * KAIST(韩国科学技术院) Seoul National University(首尔大学) Seoul National University Bundang Hospital(首尔大学盆唐医院) SAIHST, Sungkyunkwan University(成均馆大学) Yonsei University College of Medicine(延世大学医学院) Gangnam Severance Hospital(江南塞弗伦斯医院) Severance Hospital(塞弗伦斯医院) Seoul Medical Center(首尔医疗中心) Seoul National University Hospital(首尔大学医院) National Cancer Center(国立癌症中心) Icahn School of Medicine at Mount Sinai(西奈山伊坎医学院) Samsung Medical Center(三星医疗中心)

AI总结 提出EHRNote-ChatQA基准,基于MIMIC-IV出院总结构建,包含967个多轮样本和16072个专家验证的QA对,评估LLM在证据支持下的多轮临床问答能力,发现模型在证据定位和多轮错误累积方面存在挑战。

详情
AI中文摘要

出院总结是关键的临床文档,包含患者整个住院期间的背景信息,医疗专家在患者再入院、持续护理和诊断决策中会常规审阅这些文档。在审阅时,医疗专家通常必须迭代地综合多个总结中的信息,同时验证支持每个答案的证据。尽管大型语言模型(LLM)在临床问答中的应用日益增多,但现有基准未能充分反映这一场景:它们通常评估考试式的医学知识,或侧重于单轮问答且证据定位评估有限。我们引入了EHRNote-ChatQA,这是首个针对患者多个出院总结的基于证据的多轮临床问答基准。该基准基于去标识化的MIMIC-IV出院总结构建,包含967个患者级多轮样本,涵盖1到5份笔记,以及16072个经医学专家验证的QA对(8036个内容问题,每个配对有一个证据定位问题),覆盖八个临床类别。基准通过专家指导的流程构建,结合出院总结结构化模式、专家策划的多轮QA模板和基于LLM的生成,随后由11位医学专家对每个QA样本进行审查和修订。对22个开源和闭源LLM的基准测试揭示了若干挑战,包括LLM在证据定位方面比内容回答更困难、多轮错误随轮次累积,以及单轮临床QA性能无法可靠迁移到该场景。这些发现确立了EHRNote-ChatQA作为评估临床QA系统的严格且实用的基准。该数据集将通过PhysioNet凭证访问公开发布。

英文摘要

Discharge summaries are crucial clinical documents containing the context of a patient's overall hospital stay, and are routinely reviewed by medical experts for patient readmission, ongoing care, and diagnostic decision-making. When reviewing them, medical experts often must iteratively synthesize information across multiple summaries while verifying the evidence supporting each answer. Although large language models (LLMs) are increasingly explored for clinical question answering, existing benchmarks do not sufficiently reflect this setting: they often evaluate exam-style medical knowledge or focus on single-turn question answering with limited evidence-grounding evaluation. We introduce EHRNote-ChatQA, the first benchmark for evidence-grounded multi-turn clinical question answering over patients' multiple discharge summaries. Built from de-identified MIMIC-IV discharge summaries, EHRNote-ChatQA contains 967 patient-level multi-turn samples spanning one to five notes and 16,072 medical-expert-verified QA pairs (8,036 content questions, each paired with an evidence-grounding question) across eight clinical categories. The benchmark is constructed through an expert-informed pipeline combining discharge-summary structuring schema, expert-curated multi-turn QA templates, and LLM-based generation, followed by review and revision of every single QA sample by 11 medical experts. Benchmarking 22 open- and closed-source LLMs reveals several challenges, including that LLMs struggle more with evidence grounding than content answering, multi-turn errors compound across turns, and single-turn clinical QA performance does not reliably transfer to this setting. These findings establish EHRNote-ChatQA as a rigorous and practical benchmark for evaluating clinical QA systems. The dataset will be made publicly available through PhysioNet credentialed access.

2606.15617 2026-06-17 cs.CV 新提交

NeRD: Neuro-Symbolic Rule Distillation for Efficient Ontology-Grounded Chain-of-Thought in Medical Image Diagnosis

NeRD:面向医学图像诊断的高效本体接地思维链的神经符号规则蒸馏

Hongxi Yang, Yiwen Jiang, Siyuan Yan, Jamie Chow, Eunis Li, Charlotte Poon, Stephanie Fong, Xiangyu Zhao, Deval Mehta, Yasmeen George, Zongyuan Ge

发表机构 * Department of Data Science & AI, Faculty of Information Technology, Monash University(莫纳什大学信息技术学院数据科学与人工智能系) AIM for Health Lab, Faculty of Information Technology, Monash University(莫纳什大学信息技术学院AIM健康实验室) Faculty of Engineering, Monash University(莫纳什大学工程学院) Faculty of Medicine, The Chinese University of Hong Kong(香港中文大学医学院) School of Computing Technologies, RMIT University(皇家墨尔本理工大学计算技术学院)

AI总结 提出NeRD框架,通过神经符号规则蒸馏生成高效、本体接地且非冗余的推理链,避免人工规则,在皮肤数据集上实现强诊断性能和可解释性,并首次实现专家介入的多模态思维链诊断。

Comments Accepted at MICCAI 2026

详情
AI中文摘要

可解释性对于可信的医学图像诊断至关重要。然而,现有的概念驱动可解释方法存在关键局限性:概念瓶颈模型(CBM)需要在推理时对所有预定义概念进行评分并用于人工干预,给临床医生带来沉重负担;而基于理由的生成方法通常通过类别可区分性选择概念,这可能偏离诊断本体。为了解决这些问题,我们提出了神经符号规则蒸馏(NeRD),这是一个生成高效、本体接地且充分而非冗余的推理链的框架,无需手动构建诊断规则。在两个皮肤数据集上的实验证明了其强大的诊断性能和可解释性,盲法专家评估确认了NeRD理由的临床合理性。我们的方法进一步实现了首次专家介入的多模态思维链诊断研究,实现了高效且有效的概念级干预。

英文摘要

Interpretability is essential for trustworthy medical image diagnosis. However, existing concept-driven interpretable methods have key limitations: Concept Bottleneck Models (CBMs) require scoring all predefined concepts at inference time and for manual intervention, imposing a substantial burden on clinicians, while rationale-based generative approaches often select concepts by class discriminability, which can drift from diagnostic ontologies. To address these issues, we propose Neuro-Symbolic Rule Distillation (NeRD), a framework that produces efficient, ontology-grounded reasoning chains that are sufficient yet non-redundant, without manually crafting diagnostic rules. Experiments on two skin datasets demonstrate strong diagnostic performance and interpretability, and blinded expert evaluation confirms the clinical plausibility of NeRD rationales. Our method further enables a first expert-in-the-loop study for Multimodal Chain-of-Thought-based diagnosis, achieving efficient and effective concept-level intervention.

2606.15614 2026-06-17 cs.CV 新提交

Variational Test-time Optimization for Diffusion Synchronization

扩散同步的变分测试时优化

Hyunsoo Lee, Farrin Marouf Sofian, Kushagra Pandey, Stephan Mandt

发表机构 * Seoul National University(首尔大学) University of California, Irvine(加利福尼亚大学尔湾分校)

AI总结 提出基于最优控制的变分测试时优化框架,通过优化控制变量引导多轨迹协同生成,无需额外训练即可提升扩散同步性能。

Comments Preprint. Project website: https://hleephilip.github.io/SyncVC/

详情
AI中文摘要

协同生成通过协调多个扩散轨迹来扩展预训练先验的能力,已成为扩展扩散模型适用性的强大范式。在现有方法中,扩散同步通过引入通用引导机制提供了场景无关的解决方案。然而,当前的同步方法严重依赖启发式方法,并且仍然需要针对特定任务进行调整,这限制了它们的泛化能力和性能。在这项工作中,我们基于最优控制数学推导了一个同步框架,为扩散同步提供了原理性解释。在采样过程中,我们优化控制变量以引导多个轨迹朝向一致解,同时保持接近底层扩散先验。我们的方法完全在测试时运行,无需额外训练,因此当与强大的预训练先验结合时,能够在多样化的生成场景中广泛应用。我们在三个代表性的协同生成任务上展示了相对于基线的持续改进,涵盖了广泛的模态和应用。除了性能提升,我们的工作为协同生成建立了新的基础,为将预训练生成模型扩展到新的协同生成设置开辟了一条原理性路径。

英文摘要

Collaborative generation, which coordinates multiple diffusion trajectories to extend the capabilities of pretrained priors, has emerged as a powerful paradigm for extending the applicability of diffusion models. Among existing approaches, diffusion synchronization provides a scenario-agnostic solution by introducing general guidance mechanisms. However, current synchronization approaches rely heavily on heuristics and still require task-specific tailoring, which limits their generalizability and performance. In this work, we mathematically derive a synchronization framework based on optimal control, providing a principled explanation of diffusion synchronization. During sampling, we optimize control variables to guide multiple trajectories toward coherent solutions while remaining close to the underlying diffusion prior. Our method operates entirely at test-time without additional training, thereby enabling broad applicability across diverse generation scenarios when combined with strong pretrained priors. We demonstrate consistent improvements over baselines on three representative collaborative generation tasks, covering a wide range of modalities and applications. Beyond performance gains, our work establishes a novel foundation for collaborative generation, opening a principled path toward extending pretrained generative models to new collaborative generation settings.

2606.15575 2026-06-17 cs.AI cs.HC 新提交

Do we have the knowledge we need? Rethinking human-AI decision-making in corporations

我们是否拥有所需的知识?重新思考企业中的人机决策

Anne S. R. Marx, Ricardo M. Avelino, Torbjørn Netland, Mennatallah El-Assady

发表机构 * ETH Zurich(苏黎世联邦理工学院) Department of Computer Science & ETH AI Center, ETH Zurich(苏黎世联邦理工学院计算机科学系与ETH AI中心) Department of Computer Science & Architecture, ETH Zurich(苏黎世联邦理工学院计算机科学与建筑系) Department of Management, Technology, and Economics, ETH Zurich(苏黎世联邦理工学院管理、技术与经济系) Department of Computer Science, ETH Zurich(苏黎世联邦理工学院计算机科学系)

AI总结 本文提出一个框架,根据任务属性和知识可用性推荐人机代理分配与控制机制,并应用于制造任务示例。

Comments Proceedings of AutomationXP26 Workshop of the 2026 CHI Conference on Human Factors in Computing Systems, April 14, 2026, Barcelona, Spain. ACM, New York, NY, USA, 8 pages

详情
AI中文摘要

组织知识分散在各种软件系统、隐性知识和传统上为人类消费设计的手动文档中。随着AI系统越来越多地被部署并赋予决策角色,它们需要访问这些知识。这提出了两个问题:组织应如何存储和维护知识,使其对人类和未来的AI系统都可访问;以及在不同风险和不确定性水平的任务中,应如何在人类和AI之间分配代理权?在这篇立场论文中,我们描述了组织知识如何演变,并贡献了一个框架,将任务属性和知识可用性映射到推荐的代理分配和控制机制。我们通过两个不同的制造任务说明了该框架的适用性:一个常规操作(视觉质量检查)和一个一次性战略决策(工厂选址),并总结了未来研究的机会。

英文摘要

Organizational knowledge is fragmented across a variety of software systems, tacit expertise, and manual documents that have traditionally been designed for human consumption. As AI systems are increasingly deployed and granted decision-making roles, they require access to this knowledge. This raises two questions: how should organizations store and maintain knowledge so that it remains accessible to both humans and future AI systems, and how should agency be allocated between humans and AI across tasks with different risks and levels of uncertainty? In this position paper, we describe how organizational knowledge evolves and contribute a framework that maps task attributes and knowledge availability to recommended agency allocations and control mechanisms. We illustrate the applicability of the framework on two different manufacturing tasks: a routine operation (visual quality inspection) and a one-off strategic decision (factory location), and conclude with opportunities for future research.

2606.15573 2026-06-17 cs.AI cs.CR 新提交

QoS-Aware Token Scheduling and Private Data Valuation for Multi-Modal Agentic Networks

面向多模态代理网络的QoS感知令牌调度与私有数据估值

Yao Du, Jing Liu, Pengfei Xu, Zehua Wang, Victor C. M. Leung, Cyril Leung, Victoria Lemieux

发表机构 * University of British Columbia(不列颠哥伦比亚大学) Lazai Network(Lazai网络)

AI总结 针对去中心化代理系统中数据异构和资源受限问题,提出基于差分隐私的多模态表示与公平令牌分配方案,在保障服务质量的同时提升数据隐私和贡献公平性。

Comments Accepted to IEEE ICME 2026

详情
AI中文摘要

在代理系统中,人类生成的数据记录锚定了AI服务的价值。然而,云计算管道将处理集中在远程服务器上。数据集中化降低了个人数据主权,并可能降低服务质量(QoS)。同时,用户贡献在数量和质量上存在差异:去中心化记录可能存在偏差、噪声和异质分布。为了解决数据挑战,我们研究了去中心化且资源受限的代理系统中的公平令牌分配和私有数据估值。我们的方法将多模态表示嵌入到共享语义空间中,并释放差分隐私(DP)原型以在减少语义泄露的同时保持效用。在DP保证下,我们设计了一种公平的令牌分配方案,该方案奖励有效贡献,并对数据异质性和AI资源稀缺性具有鲁棒性。大量仿真表明,与标准基准相比,基于贡献的公平性和QoS得到了改善。对图像重建攻击的抵抗力增强表明多模态个人数据的隐私得到了加强。

英文摘要

In agentic systems, human-generated data records anchor the value of AI services. Yet cloud compute pipelines centralize processing on remote servers. Data centralization reduces personal data sovereignty and may potentially degrade the quality of service (QoS). Meanwhile, user contributions are diverse in quantity and quality: decentralized records can be biased, noisy, and heterogeneously distributed. To address the data challenge, we study fair token allocation and private data valuation for decentralized and resource-constrained agentic systems. Our approach embeds multi-modal representations in a shared semantic space and releases differentially private (DP) prototypes to preserve utility while reducing semantic leakage. With the DP guarantee, we design a fair token allocation scheme that rewards effective contributions and remains robust to data heterogeneity and AI resource scarcity. Extensive simulations demonstrate improved contribution-based fairness and QoS compared to standard benchmarks. The improved resistance to image reconstruction attacks indicates enhanced privacy for multi-modal personal data.

2606.15531 2026-06-17 cs.LG cs.CR 新提交

Greedy Coordinate Diffusion: Effective and Semantically Coherent Adversarial Attacks via Diffusion Guidance

贪婪坐标扩散:通过扩散引导实现有效且语义一致的对抗攻击

Bohdan Turbal, Blossom Metevier, Max Springer, Aleksandra Korolova

发表机构 * University of Maryland(马里兰大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 提出贪婪坐标扩散方法,利用扩散模型引导生成语义连贯的对抗样本,在保持自然性的同时实现高攻击成功率。

Journal ref ICML 2026

详情
AI中文摘要

在良性任务(如数学辅导)上微调对齐的语言模型会系统性破坏安全护栏,即使训练数据不包含有害内容。虽然机械论方法已揭示对齐在模型权重中的位置,但它们并未提供通用形式框架来推导关于微调何时降低对齐的保证——这使得该领域缺乏预测或防止对齐崩溃的原则性工具。我们通过参数空间轨迹的几何分析开发了一个局部几何框架,并将其应用于理解微调中对齐的脆弱性。虽然一阶分析表明正交更新是安全的,但我们证明这是虚幻的:微调损失的曲率诱导二阶加速,可能导致二阶漂移进入对齐敏感区域。我们将框架的一个构造形式化为对齐不稳定性条件(AIC),即三个几何性质,当它们存在时足以保证退化。我们的主要结果证明了沿梯度流轨迹的对齐退化四次方起始,这由对齐对特定参数的依赖程度以及任务与这些参数的耦合强度决定。这些发现给出了静态一阶保护在梯度下降下失效的正式充分条件。我们进一步实证验证了框架的基础,表明Fisher信息矩阵可以代理不同微调中安全退化的程度。

英文摘要

Adversarial attacks on large language models have limited practical impact despite extensive research. Optimization-based attacks such as Greedy Coordinate Gradient (GCG) (Zou et al., 2023) produce high-perplexity, incoherent suffixes that existing defenses easily detect (Bengio et al., 2024). Moreover, attempting to enforce coherence constraints during optimization often prevents the attack from successfully eliciting the specific targeted response, resulting in low success rates against robust models. Conversely, attacks that maintain coherence often alter the semantic intent of queries; when the model complies with these altered queries, responses fail to address the adversary's original goal. In this work, we introduce Greedy Coordinate Diffusion (GCD), a novel framework that efficiently generates adversarial attacks against safety-aligned models while maintaining low perplexity and high semantic adherence to the adversary's original intent. GCD leverages the generative priors of discrete diffusion language models to guide the search for adversarial suffixes that achieve semantic coherence and adherence. Unlike GCG, GCD does not require direct gradient access, allowing it to operate in a gray-box setting. We show GCD achieves highest ASR while remaining competitive on response-quality scores, and that the constructed adversarial prompts are detected at lower rates than other methods by perplexity-based and guard-model filters.

2606.15386 2026-06-17 cs.LG 新提交

A Compositional Framework for Open-ended Intelligence

开放智能的组合框架

Ida Momennejad, Roberta Raileanu

发表机构 * GitHub

AI总结 提出开放智能的形式化定义,通过有限原始集和组合算子生成闭包,支持跨任务和世界的无限组合生成,并引入下一原始预测作为架构目标。

详情
AI中文摘要

开放智能是指适应与训练环境显著不同的新问题和新环境的能力。我们将开放智能形式化为由有限原始集 \(P\) 和一组组合算子 \(C\) 诱导的闭包。我们刻画了诱导闭包 \(\mathcal{L}(P,C)\) 的性质,该闭包支持跨任务和世界族的无界组合生成。开放智能的数学需要两个支柱:一组最小的表示原始(例如状态、动作)和算法原始(例如最近邻),以及反映习得组合语法的组合模式(例如递归、序列化)。这两个支柱的闭包使得能够在广泛的环境中生成无限的自适应响应。该数学支持互补的研究议程,包括解释性和可解释性的评估指标,以及构建组合泛化原生的架构。我们提出下一原始预测作为一种新的架构目标,其中训练目标鼓励获取可重用的算法原始及其组合语法,从而通过重组生成新的解决方案。课程学习和自我博弈通过跨任务和世界族发现可重用原始和转换模式,实现闭包的终身学习和扩展。我们通过物理学、进化论和神经科学的案例研究来夯实该框架。

英文摘要

Open-ended intelligence is the capacity to adapt to novel problems and environments that are substantially different from those in training. A mathematics of open-ended intelligence requires two pillars: first, a minimal set of representational primitives (e.g., states, actions) and algorithmic primitives (e.g., nearest neighbor); and second, an acquired compositional grammar for selection, recursion, and branching that produces sequences of operations and recurring motifs. We formalize open-ended intelligence in terms of the compositional closure induced by a finite primitive set $P$ and a set of composition operators $C$. We characterize properties of the induced closure $\mathcal{L}(P,C)$ that support unbounded compositional generation across families of tasks and worlds. The closure of the two pillars yields infinite adaptive responses across a wide range of settings. The mathematics supports complementary research agendas, including evaluation metrics for explanation and interpretability, and novel architectures where compositional generalization is native. We propose next primitive prediction (NPP) as a novel architectural objective, where training encourages the acquisition of reusable algorithmic primitives and their compositional grammar, such that new solutions are generated through recombination. Given such an objective, curriculum learning and self-play can enable lifelong learning, expanding the closure by discovering reusable primitives and transition motifs across settings. We ground the framework through case studies in physics, evolution, and neuroscience.

2606.15236 2026-06-17 cs.CV 新提交

Show the Signal, Hide the Noise: Spectral Forcing for Pixel-Space Diffusion

展示信号,隐藏噪声:像素空间扩散的频谱强制

Weichen Fan, Haiwen Diao, Penghao Wu, Ziwei Liu

发表机构 * S-Lab, Nanyang Technological University(南洋理工大学S-Lab)

AI总结 提出频谱强制方法,通过在像素空间扩散模型中对噪声输入施加时变低通滤波器,引导模型关注信号频带,提升训练效率和生成质量。

Comments Code link: https://github.com/WeichenFan/Spectral_Forcing

详情
AI中文摘要

像素空间扩散模型在全带宽噪声图像上训练,但去噪器可用的有用信号强烈依赖于频率。在整流流扩散和自然图像幂律谱下,每个时间$t$的频带数据-噪声等高线$k^{*}(t) = (1-t)^{-2/α}$将信号承载的低频区域与噪声主导的高频区域分开。我们表明,这种隐式的由粗到细结构不仅仅是描述性的:它引发了一个容量分配问题。标准的像素空间去噪器必须内部发现移动的带宽边界,并可能在最优预测退化为确定性基线而非数据分布建模的频率-时间区域上花费计算。为了显式化这个边界,我们引入了频谱强制,一个无参数、时间条件的2D-DCT低通算子,在补丁嵌入器之前应用于噪声输入。其截止频率随扩散时间单调增加,并在数据端点处变为恒等映射。通过受控的合成实验,我们确定了该算子有益的机制:粗补丁分词和其高频内容主要是噪声而非必要信号的数据。在ImageNet-256上使用JiT-700M/32,频谱强制在不同训练周期中一致地改进了FID和Inception Score,展示了训练过程中的稳健增益;在更细的分词下,频谱强制仍然具有竞争力。我们进一步将未修改的算子插入SenseNova-U1,一个统一的文本到图像模型,它改进了DPG-Bench和GenEval,表明输入侧频谱先验可以超越类条件生成。这些结果表明了一条通往容量高效的像素空间扩散的途径:展示信号并隐藏噪声。

英文摘要

Pixel-space diffusion models are trained on full-bandwidth noisy images, yet the useful signal available to the denoiser is strongly frequency dependent. Under rectified-flow diffusion and natural-image power-law spectra, the per-band data-to-noise contour $k^{*}(t) = (1-t)^{-2/α}$ separates a signal-bearing low-frequency region from a noise-dominated high-frequency region at each time $t$. We show that this implicit coarse-to-fine structure is not merely descriptive: it induces a capacity-allocation problem. A standard pixel-space denoiser must discover the moving bandwidth boundary internally and can spend computation on frequency-time regions where the optimal prediction collapses to deterministic baselines rather than data-distribution modeling. To make this boundary explicit, we introduce Spectral Forcing, a parameter-free, time-conditional 2D-DCT low-pass operator applied to the noisy input before the patch embedder. Its cutoff expands monotonically with the diffusion time and becomes the identity at the data endpoint. Through controlled synthetic experiments, we identify the regime in which the operator is beneficial: coarse patch tokenization and data whose high-frequency content is predominantly noise rather than essential signal. On ImageNet-256 with JiT-700M/32, Spectral Forcing consistently improves both FID and Inception Score across different training epochs, demonstrating robust gains throughout training; at finer tokenization, the spectral forcing is still competitive. We further insert the unchanged operator into SenseNova-U1, a unified text-to-image model, where it improves DPG-Bench and GenEval, showing that the input-side spectral prior transfers beyond class-conditional generation. These results suggest a route to capacity-efficient pixel-space diffusion by showing the signal and hiding the noise.

2606.15148 2026-06-17 cs.RO cs.AI 新提交

MimicIK: Real-Time Generative Inverse Kinematics from Teleoperation with FK Consistency

MimicIK: 基于遥操作且保持正运动学一致性的实时生成式逆运动学

Jiahao Yang, Shenhao Yan, Fan Feng, Chengsi Yao, Ge Wang, Zhixin Mai, Yiming Zhao, Yatong Han

发表机构 * Ising AI CUHK-Shenzhen(香港中文大学(深圳))

AI总结 提出MimicIK框架,利用条件流匹配从遥操作数据学习平滑鲁棒的关节空间运动先验,通过两阶段迭代优化和正运动学一致性损失实现实时逆运动学求解,在6-DOF机器人数据集上达到4.65mm位置误差和92.01%成功率。

详情
AI中文摘要

逆运动学(IK)仍然是实时机器人操作的关键瓶颈。经典的数值求解器具有高几何精度,但在闭环部署中常出现不连续的分支切换和运动学奇异点附近的不稳定行为。同时,学习型IK方法在平衡空间精度、运动平滑性和实时效率方面经常遇到困难,尤其是在使用嘈杂的人类遥操作数据训练时。我们提出\textbf{MimicIK},一个实时生成式逆运动学框架,通过条件流匹配从遥操作演示中学习平滑且鲁棒的关节空间运动先验。给定当前关节构型和目标末端执行器位姿,MimicIK基于最小迭代策略(MIP)主干,通过高效的两步迭代精化过程预测连续的增量关节指令。为了强制物理一致性,我们进一步引入正运动学一致性损失,这是一种可微的正运动学正则化项,在训练过程中惩罚任务空间与目标位姿的偏差。我们在包含8,848个遥操作演示的真实6-DOF机器人数据集上评估MimicIK。MimicIK实现了4.65 mm的平均位置误差,92.01%的10 mm成功率,以及仅7.99%的轨迹尖峰率。与UNet扩散基线相比,我们的方法在提高空间精度和运动平滑性的同时,将推理延迟从21.66 ms降低到6.74 ms。此外,与在分布外部署时灾难性发散的确定性MLP基线不同,MimicIK在奇异构型附近保持稳定,并在部署硬件上实现鲁棒的20 Hz实时控制。

英文摘要

Inverse kinematics (IK) remains a critical bottleneck for real-time robot manipulation. Classical numerical solvers achieve high geometric precision but often suffer from discontinuous branch switching and unstable behavior near kinematic singularities during closed-loop deployment. Meanwhile, learned IK approaches frequently struggle to balance spatial accuracy, motion smoothness, and real-time efficiency, particularly when trained on noisy human teleoperation data. We present \textbf{MimicIK}, a real-time generative inverse kinematics framework that learns smooth and robust joint-space motion priors from teleoperation demonstrations through conditional flow matching. Given the current joint configuration and a target end-effector pose, MimicIK predicts continuous delta-joint commands using an efficient two-step iterative refinement process based on a Minimal Iterative Policy (MIP) backbone. To enforce physical consistency, we further introduce an FK consistency loss, a differentiable forward-kinematics regularization that penalizes task-space deviations from the target pose during training. We evaluate MimicIK on a real-world 6-DOF robot dataset containing 8,848 teleoperation demonstrations. MimicIK achieves a mean position error of 4.65 mm, a 10 mm success rate of 92.01\%, and a trajectory spike rate of only 7.99\%. Compared with a UNet diffusion baseline, our method improves both spatial accuracy and motion smoothness while reducing inference latency from 21.66 ms to 6.74 ms. Furthermore, unlike deterministic MLP baselines that catastrophically diverge under out-of-distribution deployment, MimicIK remains stable near singular configurations and enables robust 20 Hz real-time control on deployment hardware.

2606.15121 2026-06-17 cs.CL 新提交

When Cognitive Graphs Meet LLMs: BDEI Cognitive Pathways for Panic Emotional Arousal Prediction

当认知图遇见大语言模型:恐慌情绪唤醒预测的BDEI认知路径

Mengzhu Liu, Long Qin, Chuan Ai, Zhengqiu Zhu, Hongru Liang, Chen Gao, Yong Li, Xin Lu, Quanjun Yin

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出PanicCognitivePath框架,通过心理安全距离模型融合多域信号,引入显式情绪节点构建BDEI认知路径,将LLM限制于单步参数估计,实现恐慌情绪唤醒时间预测,准确率提升10.68%。

详情
AI中文摘要

在情绪显现前预测个体恐慌情绪唤醒时间对于主动应急干预至关重要。现有方法融合了认知元素,但均未显式建模情绪唤醒过程,因此不适用于情绪唤醒时间预测。我们认为,基于评价情绪理论进行预测是必要的,因为该理论显式建模了这一过程,但必须解决三个问题:(1) 评价理论认为情绪源于对多个威胁维度的同时评估,但尚无工作将这些输入融合为风险感知;(2) 现有认知模型缺乏情绪节点,将威胁评价与情绪唤醒解耦,迫使情绪从行为中间接推断;(3) 鉴于其可泛化的认知推理能力,当前方法采用LLM作为主要决策者,却忽视了其输出的脆弱性和易幻觉性。为解决这些问题,我们提出了PanicCognitivePath (PCP)框架,该框架同时解决了上述三个问题。基于心理距离理论的心理安全距离(PSD)模型将四域信号映射为统一的风险度量,作为后续认知推理的入口条件。在BDI中引入基于评价情绪理论的显式情绪节点,形成信念-欲望-情绪-意图(BDEI)路径。风险度量超过PSD阈值的智能体进入该路径,将威胁评价直接与情绪唤醒耦合。BDEI路径控制所有状态转换,而LLM被限制于信念到欲望转换的参数估计,将幻觉限制在单一步骤内并防止错误传播。在飓风桑迪上的实验表明,PCP将唤醒时间准确率较基线提升10.68%,峰值计数误差降至7.07%。

英文摘要

Predicting individual panic emotional arousal timing before manifestation is essential for proactive emergency intervention. Existing methods incorporate cognitive elements but none explicitly model the emotional arousal process, making them ill-suited for emotional arousal timing prediction. We argue that grounding prediction in appraisal emotion theory is necessary because it explicitly models this process, but three problems must be solved. (1) Appraisal theory posits that emotion arises from simultaneous evaluation across multiple threat dimensions, yet no prior work fuses these inputs into risk perception. (2) Existing cognitive models lack an Emotion node, decoupling threat appraisal from emotional arousal and forcing emotions to be inferred indirectly from behaviors. (3) Given their generalizable cognitive reasoning, current approaches adopt LLMs as the primary decision-maker, yet overlook the fragility and hallucination-proneness of their outputs. To address these issues, we introduce PanicCognitivePath (PCP), a framework that addresses all three. A Psychological Safety Distance (PSD) model, grounded in psychological distance theory, maps four-domain signals into a unified risk metric as the entry condition for subsequent cognitive reasoning. An explicit Emotion node grounded in appraisal emotion theory is introduced into BDI, forming a Belief-Desire-Emotion-Intention (BDEI) pathway. Agents whose risk metric exceeds the PSD threshold enter this pathway, coupling threat appraisal directly to emotional arousal. The BDEI pathway governs all state transitions while the LLM is confined to parameter estimation for the Belief-to-Desire transition, confining hallucinations to a single step and preventing error propagation. Experiments on Hurricane Sandy show PCP improves arousal timing accuracy by 10.68% over baselines, reduces peak count error to 7.07%.

2606.14990 2026-06-17 cs.LG cs.AI 新提交

Rational Sparse Autoencoder

有理稀疏自编码器

Naiyu Yin, Yue Yu

发表机构 * Lehigh University(里海大学)

AI总结 提出有理稀疏自编码器(RSAE),用可训练有理函数替代固定编码器激活,通过两阶段流程(初始化+微调)在多种语言模型和基线激活族上提升重构与下游行为指标,不牺牲特征可解释性。

Comments Accepted to the Mechanistic Interpretability Workshop at ICML 2026

详情
AI中文摘要

稀疏自编码器(SAE)是机械可解释性的标准工具,但当前的SAE系列受限于固定的编码器非线性,如ReLU、JumpReLU和TopK。这会将特定的稀疏机制硬编码到模型中,并可能扭曲重构与稀疏性的权衡。我们引入了有理稀疏自编码器(RSAE),它将固定的编码器激活替换为可训练的有理函数。有理激活足够灵活,可以在紧致域上一致逼近现有SAE系列使用的激活原语(对于TopK,提供分离top-k阈值后获得的阈值门),同时提供更丰富的函数类以适应观察到的预激活几何形状。我们通过两阶段流程实现这一想法:初始化过程复制预训练的基线SAE权重,插入通过在合成数据上使用松弛Remez交换获得的有理系数,并随有理系数一起校准尺度参数;然后在标准稀疏正则化重构目标下进行微调步骤。实验上,在三个开源权重语言模型的残差流激活上,以及所有三个基线激活族中,RSAE在微调步骤后严格改进,无论是在重构侧指标还是在下游行为指标上,且不牺牲稀疏探测下的特征级可解释性。这些增益在宿主语言模型、基线激活族以及我们测试的完整基线稀疏范围内一致,而升级本身每个自编码器仅增加少量标量参数,并在单个消费级GPU上运行几分钟。

英文摘要

Sparse autoencoders (SAEs) are standard tools for mechanistic interpretability, but current SAE families are constrained by fixed encoder nonlinearities such as ReLU, JumpReLU, and TopK. This hard-codes a particular sparsity mechanism into the model and can distort the reconstruction-versus-sparsity trade-off. We introduce the Rational Sparse Autoencoder (RSAE), which replaces the fixed encoder activation with a trainable rational function. Rational activations are flexible enough to uniformly approximate the activation primitives used by existing SAE families on compact domains (for TopK, the thresholded gate obtained after a separating top-k threshold is supplied), while also providing a richer function class for adapting to the observed pre-activation geometry. We realise this idea through a two-stage pipeline: an initialisation procedure that copies the pre-trained baseline SAE weights, plugs in rational coefficients obtained by the relaxed Remez exchange on synthetic data, and calibrates the scale parameters along with the rational coefficients; followed by a fine-tuning step under the standard sparsity-regularised reconstruction objective. Empirically, on residual-stream activations of three open-weight language models and across all three baseline activation families, the RSAE strictly improves on it after the fine-tuning step, both on reconstruction-side metrics and on downstream-behaviour metrics, without sacrificing feature-level interpretability under sparse probing. These gains are consistent across host language models, across baseline activation families, and across the full range of baseline sparsity we tested, while the upgrade itself adds only a handful of scalar parameters per autoencoder and runs in minutes on a single consumer GPU.

2606.14782 2026-06-17 cs.CV cs.CL 新提交

Last But Not Least: Boundary Attention CalibratiON for Multimodal KV Cache Compression

最后但同样重要:用于多模态KV缓存压缩的边界注意力校准

Tianhao Chen, Yuheng Wu, Kelu Yao, Xiaogang Xu, Xiaobin Hu, Dongman Lee

发表机构 * KAIST(韩国科学技术院) Zhejiang Laboratory(之江实验室) The Chinese University of Hong Kong(香港中文大学) National University of Singapore(新加坡国立大学)

AI总结 针对多模态大语言模型长视觉上下文中KV缓存压缩导致关键证据丢失的问题,提出BACON方法,通过校准观察窗口注意力与最后查询注意力,并利用层内一致性和层间持久性抑制噪声,在激进压缩下平均提升7.5%性能。

详情
AI中文摘要

多模态大语言模型(MLLMs)实现了强大的视觉-语言推理,但长视觉上下文会扩大KV缓存并增加解码延迟。现有的压缩方法依赖观察窗口注意力进行稳定的token重要性估计,然而这种聚合可能稀释稀疏的视觉证据,并在激进压缩下丢弃答案关键token。因此,我们识别出最后查询注意力作为恢复此类证据的补充来源,但其与答案无关的信号可能误导保留。我们提出BACON,一种即插即用方法,通过最后查询证据校准观察窗口注意力,并通过层内一致性和层间持久性抑制孤立噪声。在多种基准、模型、预算和压缩方法下,BACON在最激进的预算下平均提升多模态KV压缩7.5%,最高提升达30.9%。

英文摘要

Multimodal Large Language Models (MLLMs) achieve strong vision-language reasoning, but long visual contexts enlarge the KV cache and increase decoding latency. Existing compression methods rely on observation window attention for stable token-importance estimation, yet this aggregation can dilute sparse visual evidence and discard answer-critical tokens under aggressive compression. Therefore, we identify last-query attention as a complementary source for recovering such evidence, but its answer-irrelevant signals can mislead retention. We propose BACON, a plug-and-play method that calibrates observation window attention with last-query evidence and suppresses isolated noise via intra-layer coherence and inter-layer persistence. Across diverse benchmarks, models, budgets, and compression methods, BACON improves multimodal KV compression by 7.5% on average under the most aggressive budget, with gains up to 30.9%. Our project page is available at https://ryu1ion.github.io/official_BACON/

2606.14668 2026-06-17 cs.LG 新提交

When to Write and When to Suppress: Route-Specialized Dual Adapters for Memory-Assisted Knowledge Editing

何时写入与何时抑制:面向记忆辅助知识编辑的路径专用双适配器

Baijia Zhang, Yining Huang

发表机构 * institutetext(机构)

AI总结 提出路径专用双适配器编辑器,通过相关性路由器决定是否应用编辑记忆,分别训练编辑适配器和局部性适配器,在三个基准上取得最佳概率偏好准确率。

详情
AI中文摘要

知识编辑系统必须更新选定的事实,同时保持邻近但无关的行为不变。本文在记忆辅助设置中研究该问题,其中在推理时检索编辑记忆,参数高效适配器校正模型的对象偏好。我们认为核心设计问题不仅是如何写入编辑,还包括何时抑制它。我们引入\method{},一种路径专用双适配器编辑器。相关性路由器首先决定提示是否应接收编辑记忆。被路由的提示使用训练为偏好新对象而非原始对象的编辑适配器;未被路由的非直接提示使用单独的局部性适配器,该适配器训练为保留或恢复原始对象偏好。我们在三个1,000案例协议\cf{}、\zsre{}和\mquake{}上,在相同记忆协议和两个7B/8B基础模型下评估\method{}。在Llama-3.1-8B-Instruct上,\method{}在所有三个基准上获得最佳总体概率偏好准确率:\cf{}为0.8180,\zsre{}为0.8946,\mquake{}为0.9922。在Qwen3-8B上趋势相同。路由器消融实验表明,相关记忆边界因数据集而异:在\cf{}上,词汇神经路由器最安全;而在\zsre{}和\mquake{}上,BGE嵌入路由效果更好。组件和模块消融实验表明,增益主要来自将编辑注入与离路抑制分离,而非单纯增加LoRA容量。

英文摘要

Knowledge editing systems must update selected facts while preserving nearby but irrelevant behavior. This paper studies this problem in a memory-assisted setting where an edit memory is retrieved at inference time and a parameter-efficient adapter corrects the model's object preference. We argue that the central design question is not only how to write an edit, but also when to suppress it. We introduce \method{}, a route-specialized dual-adapter editor. A relevance router first decides whether a prompt should receive an edit memory. Routed prompts use an edit adapter trained to prefer the new object over the original object; unrouted non-direct prompts use a separate locality adapter trained to preserve or restore the original-object preference. We evaluate \method{} on three 1,000-case protocols, \cf{}, \zsre{}, and \mquake{}, under the same memory protocol and two 7B/8B base models. On Llama-3.1-8B-Instruct, \method{} obtains the best overall probability-preference accuracy on all three benchmarks: 0.8180 on \cf{}, 0.8946 on \zsre{}, and 0.9922 on \mquake{}. The same trend holds on Qwen3-8B. Router ablations show that the relevant memory boundary differs across datasets: a lexical neural router is safest on \cf{}, while BGE embedding routing is better on \zsre{} and \mquake{}. Component and module ablations show that the gain mainly comes from separating edit injection from off-route suppression rather than from simply increasing LoRA capacity.

2606.14551 2026-06-17 cs.RO cs.AI 新提交

TRACE: Trajectory-Routed Causal Memory for Delayed-Evidence Visuomotor Imitation

TRACE: 用于延迟证据视觉运动模仿的轨迹路由因果记忆

Zihao Li, Ranpeng Qiu, Yincong Chen, Guoqiang Ren, Weiming Zhi

发表机构 * Zeno AI Zhejiang University(浙江大学) Zhejiang University of Technology(浙江工业大学) The University of Sydney(悉尼大学)

AI总结 针对视觉运动模仿中早期线索消失导致观察歧义的问题,提出TRACE记忆框架,利用路径签名存储和检索任务相关证据,在长周期任务中提升分支选择准确率。

详情
AI中文摘要

自主运行的机器人可能需要基于不再可见的证据做出决策。我们研究\emph{延迟证据}任务,其中早期线索在后续决策点之前消失,因此视觉上相似的观察可能需要不同的动作。在这些设置中,当前观察不足以作为控制的状态。我们引入了轨迹路由因果证据(TRACE),一种用于视觉运动模仿策略的记忆框架。TRACE将任务相关的视觉和机器人状态证据(如物体身份、目标选择或路线依赖状态)存储在固定大小的潜在记忆中,该记忆在长片段中保持有界。TRACE不是通过原始时间或手动提供的任务标签来索引记忆,而是使用\emph{路径签名}:已执行机器人状态轨迹的紧凑、顺序敏感特征。这些签名不存储视觉线索本身;相反,它们提供了轨迹条件化的键,用于写入和检索线索可见时存储的证据。当机器人后来遇到歧义观察时,策略以TRACE记忆为条件,恢复缺失的上下文并选择正确的分支。TRACE通过轻量级适配器附加到策略上,而不改变策略主干、动作头或模仿目标。在具有视觉歧义分支点的真实世界长时域操作任务中,TRACE在分支选择和任务成功率上优于替代基线,包括短历史记忆和循环记忆。项目页面:此 https URL

英文摘要

Robots under autonomous operation may require decisions based on evidence that is no longer visible. We study delayed-evidence tasks, where an early cue disappears before a later decision point, so visually similar observations can require different actions. In these settings, the current observation is not a sufficient state for control. We introduce TRAjectory-routed Causal Evidence (TRACE), a memory framework for visuomotor imitation policies. TRACE stores task-relevant visual and robot-state evidence, such as object identity, target choice, or route-dependent state, in a fixed-size latent memory that remains bounded over long episodes. Instead of indexing memory by raw time or manually provided task labels, TRACE uses path signatures: compact, order-sensitive features of the executed robot-state trajectory. These signatures do not store the visual cue itself; rather, they provide trajectory-conditioned keys for writing and retrieving the evidence stored when the cue was visible. When the robot later reaches an ambiguous observation, the policy conditions on TRACE memory to recover the missing context and choose the correct branch. TRACE attaches through lightweight adapters to policies, without changing the policy backbone, action head, or imitation objective. Across real-world long-horizon manipulation tasks with visually ambiguous branch points, TRACE improves branch selection and task success over alternative baselines, including short-history and recurrent memory. Project page: https://jeong-zju.github.io/trace

2606.14438 2026-06-17 cs.RO cs.AI 新提交

CADET: Physics-Grounded Causal Auditing and Training-Free Deconfounding of End-to-End Driving Planners

CADET: 基于物理的因果审计与无训练去混杂的端到端驾驶规划器

Zikun Guo

发表机构 * School of Electronics Engineering, Kyungpook National University(庆北国立大学电子工程学院)

AI总结 提出CADET框架,无需重新训练即可审计和修复预训练端到端驾驶规划器中的虚假关联,通过物理因果图识别混杂因素并干预测试时输入。

Comments 8pages 4figures

详情
AI中文摘要

通过模仿学习训练的端到端自动驾驶规划器容易产生统计捷径:它们将仅与专家动作共现的场景元素(如路边物体、建筑立面)与驾驶决策关联,而非因果决定驾驶的变量。这种因果混淆在长尾场景中悄然损害可靠性,且难以检测,因为常见的开环指标(L2位移和碰撞率)受自车状态主导,无法指示规划器是否依赖虚假线索。现有的基于因果干预训练的修复方法需要重新训练大型模型,且无法审计已部署的规划器。我们提出CADET,一个无需训练的框架,可以在不更新任何参数的情况下审计、基准测试和修复预训练端到端规划器中的虚假依赖。

英文摘要

End-to-end (E2E) autonomous-driving planners trained by imitation are prone to statistical shortcuts: they associate scene elements that merely co-occur with expert actions (a roadside object, a building facade) with driving decisions, rather than the variables that causally determine them. Such causal confusion silently compromises reliability in long-tail scenarios, and it is difficult to detect, because prevailing open-loop metrics (L2 displacement and collision rate) are dominated by ego status and do not indicate whether a planner depends on spurious cues. Existing remedies based on causal-intervention training require retraining large models and cannot audit a planner that is already deployed. We present CADET, a training-free framework that audits, benchmarks, and repairs spurious reliance in pretrained E2E planners without any parameter update.

2606.14383 2026-06-17 cs.CV 新提交

IndustryBench-MIPU: Benchmarking Multi-Image Attribute Value Extraction for Industrial Products

IndustryBench-MIPU:面向工业产品的多图像属性值提取基准

Haonan Qi, Jin Cao, Yongqi Zhang, Xintong Wang, Weidong Tang, Bin Chen, Chengfu Huo, Haojun Pan, Hengyu You, Jing Li, Yingde Wang, Liang Ding

发表机构 * Multimodal and Industrial AI Team(多模态与工业AI团队) Taobao&Tmall, Alibaba Group(淘宝&天猫,阿里巴巴集团)

AI总结 提出首个多图像工业产品理解基准IndustryBench-MIPU,通过结构化属性提取任务评估多模态大模型在规格表、铭牌、技术图纸上的文本识别、视觉推理、领域知识和跨图像证据整合能力,发现多图像完整性是核心瓶颈。

详情
AI中文摘要

工业产品(如阀门和断路器)由密集的技术规格定义,这些规格支配着供应链中的采购、兼容性和安全性。这些规格分散在多个异构的产品图像中,包括规格表、铭牌和技术图纸,然而多模态大语言模型(MLLMs)能否可靠地恢复它们仍未被充分探索。为填补这一空白,我们引入了IndustryBench-MIPU,这是首个用于多图像工业产品理解的大规模基准,围绕结构化属性提取构建——从产品图像中恢复属性-值对。该任务共同探究了规格表和铭牌上的文本识别、技术图纸上的视觉推理、解码工业术语的领域知识,以及跨图像证据整合以组装分散的规格。具体而言,该基准包含来自27,652张图像的4,559个产品,具有跨越18个工业类别的103,703个标注,通过多模型共识和三层质量保证构建。在单图像和产品级多图像设置下评估九个MLLMs,揭示了一个显著的完整性差距:模型实现了高精度(86-94%),但最佳模型仅恢复了49.9%的产品级属性;从单图像到多图像提取,召回率下降了15-34个百分点。多图像完整性,而非单图像准确性,是核心瓶颈。数据集和代码已公开。

英文摘要

Industrial products such as valves and circuit breakers are defined by dense technical specifications that govern procurement, compatibility, and safety across supply chains. These specifications are scattered across multiple heterogeneous product images, including specification tables, nameplates, and technical drawings, yet whether Multimodal Large Language Models (MLLMs) can reliably recover them remains underexplored. To fill this gap, we introduce IndustryBench-MIPU, the first large-scale benchmark for multi-image industrial product understanding, built around structured attribute extraction -- recovering property-value pairs from product images. This task jointly probes text recognition on specification tables and nameplates, visual reasoning over technical drawings, domain knowledge to decode industrial terminology, and cross-image evidence integration to assemble scattered specifications. Concretely, the benchmark comprises 4,559 products across 27,652 images with 103,703 annotations spanning 18 industrial categories, constructed through multi-model consensus and three-tier quality assurance. Evaluating nine MLLMs under both single-image and product-level multi-image settings reveals a stark completeness gap: models achieve high precision (86--94%) but the best recovers only 49.9% of product-level attributes; moving from single-image to multi-image extraction costs 15--34 percentage points of recall. Multi-image completeness, not single-image accuracy, is the core bottleneck. Dataset and code are publicly available.

2606.14187 2026-06-17 cs.LG 新提交

Zeta: Dual Whitening for Matrix Optimization via Coordinate-Adaptive Preconditioning

Zeta: 通过坐标自适应预处理实现矩阵优化的双重白化

Kaiwen Chen, Shuhai Zhang, Zimo Liu, Linxiao Li, Ying Sun, Yuchen Li, Yifan Zhang, Bo Han, Mingkui Tan, Qiuwu Chen

发表机构 * South China University of Technology(华南理工大学) AIGCode Hong Kong Baptist University(香港浸会大学)

AI总结 针对矩阵优化中坐标尺度异质性问题,提出双重白化优化器Zeta,通过先坐标白化后谱白化的严格顺序降低正交化误差,在语言建模和视觉任务上提升收敛速度与泛化性能。

详情
AI中文摘要

大规模神经网络训练日益依赖矩阵感知优化器,这类优化器利用权重参数的结构,超越逐元素自适应。然而,现有矩阵感知方法(如Muon)存在一个未被充分认识的脆弱性:其核心操作Newton-Schulz迭代严重依赖于输入条件,而原始动量矩阵表现出严重的坐标尺度异质性。本文首先通过卡方均匀性检验验证了这种尺度异质性,表明矩阵内尺度不平衡在Transformer层中普遍存在,且坐标白化能有效纠正。受此发现启发,我们提出Zeta,一种双重白化优化器,在严格有序的流程中应用坐标白化和谱白化。该顺序不是可调选择,而是源于数学依赖:坐标白化建立了谱白化可靠运行所需的统计各向同性。我们进一步证明,通过改善输入的条件数,该双重流程相对于纯谱方法严格降低了正交化误差。实验上,Zeta在语言建模(0.6B至8B参数)、混合专家架构和视觉任务中匹配或超越强基线,表明在正交化前解决尺度不平衡能带来更快的收敛和更好的泛化。代码可在该https URL获取。

英文摘要

Large-scale neural network training increasingly relies on matrix-aware optimizers that exploit the structure of weight parameters beyond element-wise adaptation. However, existing matrix-aware methods such as Muon have an underappreciated vulnerability: their core operation, Newton-Schulz iteration, depends critically on input conditioning, yet the raw momentum matrices exhibit severe coordinate-wise scale heterogeneity. In this paper, we first verify this scale heterogeneity through a chi-square uniformity test, showing that intra-matrix scale imbalance is prevalent across Transformer layers and that coordinate whitening effectively corrects it. Motivated by this finding, we propose Zeta, a dual whitening optimizer that applies coordinate whitening and spectral whitening in a strictly ordered pipeline. The ordering is not a tunable choice but follows from a mathematical dependency: coordinate whitening establishes the statistical isotropy that spectral whitening requires to function reliably. We further prove that this dual pipeline strictly reduces orthogonalization error relative to pure spectral methods by improving the condition number of the input. Empirically, Zeta matches or surpasses strong baselines across language modeling (0.6B to 8B parameters), mixture-of-experts architectures, and vision tasks, demonstrating that resolving scale imbalance before orthogonalization leads to faster convergence and better generalization. Code is available at https://github.com/AIGCodeOS/aigcode_zeta_optimizer.

2606.14096 2026-06-17 cs.CV 新提交

A New Multi-Domain Benchmark for Micro-Action Recognition and Detection

微动作识别与检测的新多领域基准

Yanbin Hao, Pengyu Liu, Xing Wei, Xun Yang, Dan Guo, Meng Wang

发表机构 * School of Computer Science and Information Engineering, Hefei University of Technology(合肥工业大学计算机科学与信息工程学院) School of Information Science and Technology, University of Science and Technology of China(中国科学技术大学信息科学技术学院)

AI总结 提出MMA-82,一个大规模多领域微动作基准,扩展至82个类别、4个领域,涵盖识别与多标签检测任务,实验表明现有方法在域迁移、长尾分布等场景下仍面临挑战。

Comments 10 pages, 9 figures

详情
AI中文摘要

微动作是全身层面持续时间短、幅度低的细微身体运动,能够揭示潜在意图、非自愿反应和细粒度情感变化。我们之前的MA-52基准为微动作识别提供了重要基础,但在规模、场景多样性、任务覆盖和评估协议方面仍有限。为了将微动作分析推向更真实和全面的场景,我们引入了MMA-82,这是MA-52的大规模多领域扩展。MMA-82将标签空间从52个细粒度微动作类别扩展到82个,并涵盖四个不同领域,包括实验室访谈、街头访谈、精神病患者访谈和情感丰富的电视视频,最终从454名受试者中获得了77,856个标注实例。基于MMA-82,我们建立了两个核心任务:微动作识别和多标签微动作检测。对于识别,我们进一步定义了域内和跨域协议,包括少样本和零样本设置,以评估模型的鲁棒性、可迁移性和泛化能力。大量实验表明,当前方法在真实微动作理解中仍面临困难,尤其是在域迁移、长尾类别分布和复杂时间定位下。除了基准测试,我们还研究了微动作与情感之间的关系,表明微动作与情感状态密切相关,并为面部微表情提供补充线索,以改进情感识别。这些结果表明,MMA-82是真实微动作分析的全面且具有挑战性的基准,也是以人为中心的AI的宝贵资源。MMA-82可在以下网址获取:https://xxx。

英文摘要

Micro-actions are short-duration, low-amplitude subtle body movements at the whole-body level that can reveal latent intentions, involuntary reactions, and fine-grained affective changes. Our previous MA-52 benchmark has provided an important foundation for micro-action recognition, but it remains limited in scale, scene diversity, task coverage, and evaluation protocols. To advance micro-action analysis toward more realistic and comprehensive settings, we introduce MMA-82, a large-scale multi-domain extension of MA-52. MMA-82 expands the label space from 52 to 82 fine-grained micro-action categories and covers four distinct domains, including laboratory interviews, street interviews, psychiatric patient interviews, and emotion-rich television videos, resulting in 77,856 annotated instances from 454 subjects. Built upon MMA-82, we establish two core tasks: Micro-Action Recognition and Multi-label Micro-Action Detection. For recognition, we further define in-domain and cross-domain protocols, including few-shot and zero-shot settings, to evaluate model robustness, transferability, and generalization. Extensive experiments show that current methods still struggle with realistic micro-action understanding, especially under domain shift, long-tailed category distributions, and complex temporal localization. Beyond benchmarking, we investigate the relationship between micro-actions and emotion, showing that micro-actions are strongly associated with emotional states and provide complementary cues to facial micro-expressions for improved emotion recognition. These results demonstrate that MMA-82 serves as a comprehensive and challenging benchmark for realistic micro-action analysis and a valuable resource for human-centered AI. MMA-82 is available at https://lpynow.github.io/MMA-82-AIM/.

2606.14081 2026-06-17 cs.CV cs.AI cs.LG eess.IV 新提交

Clay-CNN Hybrids: Leveraging Geospatial Foundation Models as Auxiliary Context for Landslide Detection

Clay-CNN混合模型:利用地理基础模型作为滑坡检测的辅助上下文

Huong Binh Vu

发表机构 * Harvard University(哈佛大学)

AI总结 针对滑坡检测中的极端类别不平衡问题,提出将地理基础模型Clay v1.5作为辅助上下文注入U-Net瓶颈的混合方法,在Landslide4Sense基准上达到64.5% F1,优于纯Clay或U-Net基线。

详情
AI中文摘要

灾后快速滑坡制图对灾害响应至关重要,但由于极端类别不平衡,自动化仍然困难。本研究评估了地理基础模型(GFM)Clay v1.5是否能够改善Landslide4Sense(L4S)基准上的像素级滑坡分割,该基准包含3,799个训练块,具有14个Sentinel-2和地形波段,约2%的正像素。我们比较了三种策略:Clay作为主编码器并融合多尺度残差地形、在瓶颈处注入Clay语义上下文的U-Net骨干、以及标准U-Net基线。采用两阶段低秩适应(LoRA)的混合U-Net + Clay模型在三个随机种子上的最佳测试F1为64.5±1.8%,超过了纯Clay骨干(55.2±3.6%)和U-Net基线(59.9%)。由于缺乏多尺度跳跃连接,Clay作为独立编码器的性能低于U-Net,但其预训练表示在作为辅助上下文注入时持续提升了性能。这些发现表明,GFM在滑坡检测中最有效的方式是补充空间细节丰富的卷积架构,而非替代它们。

英文摘要

Rapid post-event landslide mapping is essential for disaster response but remains difficult to automate due to extreme class imbalance. This study evaluates whether Clay v1.5, a Geospatial Foundation Model (GFM), can improve pixel-level landslide segmentation on the Landslide4Sense (L4S) benchmark, which contains 3,799 training chips with 14 Sentinel-2 and terrain bands and approximately 2% positive pixels. We compare three strategies: Clay as the primary encoder with multi-scale residual terrain fusion, a U-Net backbone augmented with Clay semantic context at the bottleneck, and a standard U-Net baseline. The hybrid U-Net + Clay model with two-stage Low-Rank Adaptation (LoRA) achieved the best test F1 of 64.5 +/- 1.8% over three seeds, surpassing the Clay-only backbone (55.2 +/- 3.6%) and the U-Net baseline (59.9%). Clay as a standalone encoder underperformed the U-Net due to the absence of multi-scale skip connections, but its pretrained representations consistently improved performance when injected as auxiliary context. These findings suggest that GFMs are most effective for landslide detection when they complement spatially detailed convolutional architectures rather than replace them.