arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4119
2606.02170 2026-06-02 cs.CL

CRAFTQA: A Code-Driven Adaptive Framework for Complex Structured Data Reasoning

CRAFTQA: 一种用于复杂结构化数据推理的代码驱动自适应框架

Chengtao Gan, Zhiqiang Liu, Long Jin, Yushan Zhu, Lei Liang, Wen Zhang

发表机构 * Zhejiang University(浙江大学) Ant Group(蚂蚁集团) JIUTIAN Research(JIUTIAN研究) ZJU-Ant Group Joint Lab of Knowledge Graph(浙江大学-蚂蚁集团知识图谱联合实验室)

AI总结 提出CRAFTQA框架,通过CodeSTEP模块生成逐步代码推理序列,并利用CRAFT模块动态生成自定义代码函数,以突破预定义函数集的限制,在复杂结构化数据推理任务中显著优于现有统一方法。

Comments Accepted by Findings of ACL 2026

详情
AI中文摘要

现实场景涉及大量异构结构化数据(例如表格、知识图谱),因此对这些多样化数据进行有效推理变得越来越重要。统一结构化数据问答已成为一个突出的研究趋势,旨在单一框架内回答跨不同结构化数据类型的自然语言问题。然而,现有的统一方法有一个共同的局限性:它们依赖于一组预定义函数,这限制了它们在超出这些预定义操作之外执行复杂推理的能力。为了克服这一根本性限制,我们提出了CRAFTQA,一种新颖的自适应代码驱动框架,包含两个核心模块:CodeSTEP和CRAFT。CodeSTEP模块是一种范式,它生成一个完整的可执行Python代码序列,其中包含基于问题的逐步代码推理操作。CRAFT模块为超出预定义函数集的操作动态生成自定义代码函数,并与CodeSTEP无缝集成,显著增强了处理复杂推理的灵活性。在多个结构化数据集上的综合实验表明,与现有的统一方法相比,CRAFTQA在复杂推理场景中取得了显著的改进。

英文摘要

Real-world scenarios involve massive heterogeneous structured data (e.g., tables, knowledge graphs), making effective reasoning over such diverse data increasingly important. Unified structured data question answering has emerged as a prominent research trend, aiming to answer natural language questions across different structured data types within a single framework. However, existing unified methods share a common limitation: they rely on a set of predefined functions, which restricts their ability to perform complex reasoning beyond these predefined operations. To overcome this fundamental limitation, we propose CRAFTQA, a novel adaptive code-driven framework comprising two core modules, CodeSTEP and CRAFT. The CodeSTEP module is a paradigm that generates a complete executable Python code sequence, which contains step-by-step code-based reasoning operations based on the question. The CRAFT module dynamically generates custom code functions for operations beyond the predefined function set, and seamlessly integrates with CodeSTEP to significantly enhance flexibility in handling complex reasoning. Comprehensive experiments on multiple structured datasets demonstrate that CRAFTQA achieves remarkable improvements in complex reasoning scenarios compared to existing unified methods.

2606.02168 2026-06-02 cs.CV cs.LG

Disentanglement-Based Equivariant Learning for Compositional VQA

基于解耦的等变学习用于组合式VQA

Zhou Du, Zhaoquan Yuan, Xiao Wu, Changsheng Xu

发表机构 * IEEE Publication Technology Group(IEEE出版技术组) School of Computing and Artificial Intelligence, Southwest Jiaotong University(计算机与人工智能学院,西南交通大学) Engineering Research Center of Sustainable Urban Intelligence Transportation, Ministry of Education, China(可持续智慧城市交通工程研究中心,中华人民共和国教育部) State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation, Chinese Academy of Sciences(多模态人工智能系统(MAIS)国家重点实验室,自动化研究所,中国科学院) School of Artificial Intelligence, University of Chinese Academy of Sciences(人工智能学院,中国科学院大学)

AI总结 提出DEAL框架,通过因果干预解耦视觉和文本概念,并利用等变约束增强组合推理能力,在CLEVR-CoGenT和GQA-SGL上超越现有方法。

Comments Accepted by IEEE Transactions on Multimedia

详情
Journal ref
IEEE Trans. Multimedia, vol. 27, pp. 8160-8173, 2025
AI中文摘要

组合式视觉问答(VQA)是一项具有挑战性但基础的任务,要求模型理解先前学习概念的新组合。当前方法往往忽视潜在概念的解耦,并且在有效捕捉组合变化机制方面受到限制。此外,最先进的技术依赖于额外的线索进行训练,这在现实世界的VQA场景中不可行。为了解决这些问题,本文提出了一种新颖的基于解耦的等变学习(DEAL)框架用于组合式VQA,该框架仅由真实答案指导。在DEAL中,我们采用因果启发的干预措施,在重新编码框架内解耦来自视觉和文本输入的概念。基于等变性原理,我们随后对推理输入进行组合变换,并对输出施加等变约束,以增强模型的组合推理能力。在基准数据集CLEVR-CoGenT和GQA-SGL上进行的全面实验验证了我们提出的DEAL方法在视觉和语言泛化设置下均优于现有的最先进方法。

英文摘要

Compositional visual question answering (VQA) represents a challenging yet fundamental task that requires models to comprehend novel combinations of previously learned concepts. The current methods often overlook the disentanglement of underlying concepts and are restricted in terms of their ability to effectively capture the compositional variation mechanism. Moreover, the state-of-the-art techniques depend on additional clues for training, which is not feasible in real-world VQA scenarios. To address these issues, in this paper, we introduce a novel Disentanglement-based EquivAriant Learning (DEAL) framework for compositional VQA, which is guided exclusively by ground-truth answers. In DEAL, we employ causality-inspired interventions to disentangle concepts derived from visual and textual inputs within a re-encoding framework. Based on the principle of equivariance, we subsequently perform a compositional transformation on the inference input and impose the equivariant constraint on the output to augment the compositional reasoning capacity of the model. Comprehensive experiments conducted on the benchmark CLEVR-CoGenT and GQA-SGL datasets validate the superiority of our proposed DEAL approach over the existing state-of-the-art methods for compositional VQA tasks in both visual and linguistic generalization settings.

2606.02167 2026-06-02 cs.AI

From Capability Models to Automated Planning: An AAS-Native Approach for Automatic PDDL Generation

从能力模型到自动规划:一种面向AAS的自动PDDL生成方法

Hamied Nabizada, Thomas Wirt, Luis Miguel Vieira da Silva, Felix Gehlhoff, Alexander Fay

发表机构 * Institute of Automation Technology, Helmut Schmidt University Hamburg(海德堡-施密特大学汉堡自动化技术研究所) Chair of Automation Technology, Ruhr University Bochum(博尔塔伦大学博德姆自动化技术教授团)

AI总结 提出一种基于资产管理外壳(AAS)能力模型自动生成PDDL规划问题的方法,使工程师无需掌握PDDL语法即可进行生产系统布局验证。

Comments Accepted at the 2026 IEEE 22nd International Conference on Automation Science and Engineering (CASE 2026)

详情
AI中文摘要

设计生产系统的工程师需要验证给定的布局是否支持所有必需的生产序列。自动化规划技术可以回答此类问题,但在规划领域定义语言(PDDL)中制定所需的规划问题需要专业知识,而生产工程师通常缺乏这些知识。资产管理外壳(AAS)已成为工业4.0中工业资产的标准化数字孪生。我们展示了使用四个成熟的工业4.0标准(用于过程描述的VDI 3682、用于语义属性限定的IEC 61360-1、用于类型层次结构的IDTA 02011和用于实例描述的IDTA 02016)构建的AAS能力模型包含足够的信息,可以自动生成完整的PDDL问题。与之前引入PDDL特定子模型的工作不同,我们的方法从资源功能的领域级描述(即能力)中推导出所有规划元素,使工程师能够在完全不接触PDDL语法或规划概念的情况下对能力进行建模。我们的提取算法将分布式的多AAS架构转换为完整的PDDL规划问题。我们在实验室生产系统的AAS模型上验证了该方法,通过最优规划比较了四种布局变体,展示了工程师如何通过修改AAS模型并重新生成规划域来系统地探索设计权衡。

英文摘要

Engineers designing production systems need to verify that a given layout supports all required production sequences. Automated planning techniques can answer such questions, but formulating the required planning problems in the Planning Domain Definition Language (PDDL) demands specialized expertise that production engineers typically lack. Asset Administration Shells (AAS) have emerged as the standardized Digital Twin for industrial assets in Industry 4.0. We show that AAS capability models, structured using four established Industry 4.0 standards (VDI 3682 for process descriptions, IEC 61360-1 for semantic property qualification, IDTA 02011 for type hierarchies, and IDTA 02016 for instance descriptions), contain sufficient information to generate complete PDDL problems automatically. Unlike prior work that introduced PDDL-specific submodels, our approach derives all planning elements from domain-level descriptions of resource functions, so-called capabilities, allowing engineers to model capabilities without any exposure to PDDL syntax or planning concepts. Our extraction algorithm transforms distributed Multi-AAS architectures into complete PDDL planning problems. We validate the approach on AAS models of a laboratory production system, comparing four layout variants using optimal planning to demonstrate how engineers can systematically explore design trade-offs by modifying the AAS model and regenerating the planning domain

2606.02163 2026-06-02 cs.AI

An Abstract Worlds Semantic Framework for Belief Change Operators

信念变化算子的抽象世界语义框架

Daniel Grimaldi, M. Vanina Martinez, Ricardo O. Rodriguez

发表机构 * Departamento de Computación Facultad de Ciencias Exactas y Naturales Universidad de Buenos Aires(布宜诺斯艾利斯大学计算机系) Instituto de Investigación en Ciencias de la Computación UBA-CONICET(UBA-CONICET计算科学研究所) Artificial Intelligence Research Institute (IIIA-CSIC)(人工智能研究 institute (IIIA-CSIC))

AI总结 提出一种无逻辑语法的集合论框架——抽象世界语义,通过将世界视为原始元素并定义世界收缩与修正算子,统一分析信念变化模型,并推广至经典与非优先信念变化。

详情
AI中文摘要

本文提出了一种用于信念变化的集合论框架,称为抽象世界语义,其中不假设任何逻辑语法。受Grove(1988)结果的启发,我们的方法将世界视为原始元素,并在此基础上定义了世界收缩和世界修正算子。该语义框架能够对信念变化模型进行统一分析。在此框架内,我们通过定义通用算子,统一了经典和非优先信念变化构造。当考虑经典命题逻辑时,我们的框架提供了AGM、KM和多重变化模型的同质化描述。总之,AWS系统化了信念变化框架和算子,简化并推广了基于信念集的信念变化理论。

英文摘要

This article proposes a set-theoretic framework for belief change, called Abstract Worlds Semantics, in which no logical syntax is assumed. Inspired by Grove's (1988) results, our approach treats worlds as primitive elements, over which world contraction and world revision operators are defined. This semantic framework enables a unified analysis of belief change models. Within this framework, we unify classical and non-prioritized belief change constructions by defining versatile operators. When classical propositional logic is considered, our framework provides a homogeneous account of AGM, KM, and Multiple Change models. In summary, AWS systematizes belief change frameworks and operators, simplifying and generalizing belief change theory over belief sets.

2606.02161 2026-06-02 cs.CV cs.CL

InfoMerge: Information-aware Token Compression for Efficient Video Large Language Models

InfoMerge: 信息感知的令牌压缩用于高效视频大语言模型

Xinxin Liu, Shiwei Gan, Xiao Liu, Yafeng Yin, Lei Xie, Sanglu Lu

发表机构 * State Key Laboratory of Novel Software Technology(新型软件技术国家重点实验室)

AI总结 提出InfoMerge,一种无需训练的视觉令牌压缩方法,通过鲁棒冗余估计和内容感知预算分配,在减少85%视觉令牌的同时保持98.8%性能,实现4.24倍预填充加速。

Comments 15 pages, 8 figures

详情
AI中文摘要

视频大语言模型在视频理解中表现出色,但过多的视觉令牌带来了巨大的计算开销。现有的免训练压缩方法通过减少视觉令牌来提高推理效率,但它们通常依赖局部相邻帧相似性进行时间冗余估计,或主要根据片段长度分配令牌预算。这种设计对帧级噪声敏感,且无法捕捉真实视频的非均匀信息分布。为解决这些挑战,我们提出InfoMerge,一种无需训练的视觉令牌压缩方法,通过鲁棒冗余估计和内容感知预算分配来提高令牌利用率。具体来说,我们提出时间指纹差异:一种片段级二阶时间冗余估计策略,用于建模每个片段内相同空间位置令牌的时间相似性结构。我们进一步引入内容感知预算分配(CABA),根据片段独特性和基于谱熵的表征丰富性动态分配片段级令牌预算。通过减少对冗余静态区域的重复保留,并将更多令牌分配给信息丰富的片段,InfoMerge在保持强大性能的同时更好地利用了有限的令牌预算。大量实验表明,InfoMerge在多个基准和骨干网络上实现了强效的精度-效率权衡,在激进压缩下优势更为明显。在LLaVA-OneVision-7B上,InfoMerge保留了原始平均性能的98.8%,同时减少了85%的视觉令牌,并在预填充阶段实现了4.24倍的加速。

英文摘要

Video Large Language Models (Video-LLMs) achieve strong performance in video understanding, but their excessive visual tokens bring substantial computational overhead. Existing training-free compression methods improve inference efficiency by reducing visual tokens, yet they often rely on local adjacent-frame similarity for temporal redundancy estimation or allocate token budgets mainly according to segment length. Such designs are sensitive to frame-level noise and fail to capture the non-uniform information distribution of real-world videos. To address these challenges, we propose InfoMerge, a training-free visual token compression method that improves token utilization through robust redundancy estimation and content-aware budget allocation. Specifically, we propose the Temporal Fingerprint Difference: a segment-level second-order temporal redundancy estimation strategy, which models the temporal similarity structure of tokens at the same spatial positions within each segment. We further introduce Content-Aware Budget Allocation (CABA), which dynamically allocates segment-level token budgets based on segment uniqueness and spectral-entropy-based representational richness. By reducing repeated preservation of redundant static regions and allocating more tokens to informative segments, InfoMerge makes better use of the limited token budget while maintaining strong performance. Extensive experiments show that InfoMerge achieves strong efficiency--accuracy trade-offs across multiple benchmarks and backbones, with more pronounced advantages under aggressive compression. On LLaVA-OneVision-7B, InfoMerge retains 98.8\% of the original average performance while reducing 85\% of visual tokens and achieving a 4.24-fold speedup in the prefill stage.

2606.02158 2026-06-02 cs.CL

On the Salience of Low-Probability Tokens for AI-Generated Text Detection: A Multiscale Uncertainty Perspective

低概率标记对AI生成文本检测的显著性:多尺度不确定性视角

Yikai Guo, Bin Wang, Xilai Fan, Wenjun Ke, Haoran Luo

发表机构 * National University of Singapore(新加坡国立大学) University of Science and Technology of China(中国科学技术大学)

AI总结 针对AI生成文本检测中模板标记主导和点估计脆弱的问题,提出基于低概率标记的多尺度不确定性估计器Uncertainty及其改进版Uncertainty++,在多个数据集和LLM上验证了有效性、泛化性和鲁棒性。

Comments Accepted by ICML 2026 main conference

详情
AI中文摘要

AI生成的文本日益与人类写作融合,带来了诸如错误信息、学术滥用和语料库污染等实际风险。虽然统计检测器因高效和泛化能力而具有吸引力,但它们存在两个关键局限性:(i)模板标记主导,人类和LLM写作中共有的模板标记可能压倒判别信号;(ii)脆弱的点估计,依赖单一概率分数在对抗性操纵下产生不稳定的决策。为解决这些问题,我们提出Uncertainty,一种多尺度不确定性估计器,专注于信息丰富的低概率标记,这些标记更清晰地暴露分布差异。局部上,它通过平均低概率标记的对数概率来缓解模板标记主导;全局上,它通过Rényi熵捕捉该低概率区域的分布形状,从而降低脆弱性。我们进一步通过条件独立采样将检测器扩展为Uncertainty++,得到更稳定的不确定性估计。在七个数据集和十六个LLM上的实验证明了其高效性、泛化性和鲁棒性。我们的代码可在https://github.com/guoyikai2000/Uncertainty-AIGT获取。

英文摘要

AI-generated text increasingly blends with human writing, raising practical risks such as misinformation, academic misuse, and corpora contamination. While statistical detectors are appealing for efficiency and generalization, they suffer from two key limitations. (i) Boilerplate dominance, boilerplate tokens shared across human and LLM writing can overwhelm discriminative signals. (ii) Brittle point estimates, relying on a single probability score yields unstable decisions under adversarial manipulations. To address these issues, we propose Uncertainty, a multiscale uncertainty estimator that focuses on informative low-probability tokens, which more clearly expose distributional discrepancies. Locally, it alleviates boilerplate dominance by averaging the log-probabilities of low-probability tokens; globally, it reduces brittleness by capturing the distributional shape of this low-probability region via Rényi entropy. We further extend the detector to Uncertainty++ via conditional independent sampling, yielding a more stable uncertainty estimation. Experiments across seven datasets and sixteen LLMs demonstrate high effectiveness, generalization, and robustness. Our code is available at https://github.com/guoyikai2000/Uncertainty-AIGT.

2606.02153 2026-06-02 cs.CV cs.GR

Ultra Diffusion Poser: Diffusion-Based Human Motion Tracking From Sparse Inertial Sensors and Ranging-Based Between-Sensor Distances

超扩散姿态估计器:基于扩散的从稀疏惯性传感器和测距传感器间距离的人体运动跟踪

Dominik Hollidt, Tommaso Bendinelli, Christian Holz

发表机构 * Department of Computer Science, ETH Zurich(苏黎世联邦理工学院计算机科学系)

AI总结 提出Ultra Diffusion Poser扩散模型,通过显式建模UWB测距的几何约束(空间布局模块解析重建传感器位置)和引入UWB扩散引导,在扩散采样中强制预测姿态与实测距离对齐,将关节位置误差降低22%。

Comments CVPR 2026 - Computer Vision and Pattern Recognition

详情
Journal ref
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026, pp. 7036-7046
AI中文摘要

使用惯性测量单元(IMU)的方法提供了一种可穿戴的替代基于摄像头的运动捕捉方案。为了减轻惯性信号的漂移,最近的稀疏惯性姿态估计器集成了由超宽带(UWB)测距测量的传感器间距离。到目前为止,UWB距离仅被用作额外的输入特征,忽略了它们对传感器位置施加的物理约束。然而,这些距离也可以用于重建底层3D传感器布局,从而为姿态重建提供更具信息性的输入。我们提出了Ultra Diffusion Poser,一种显式建模这些几何约束的扩散模型。它包括一个空间布局模块,该模块从UWB测量中解析地重建3D传感器位置。这些传感器位置与IMU信号和UWB距离一起作为扩散过程中的条件信号。尽管如此,网络预测可能违反传感器间距离测量。为了解决这个问题,我们引入了UWB扩散引导,它在扩散采样过程中鼓励预测姿态与测量距离之间的对齐。这些贡献共同使我们的模型达到了最先进的性能,将关节位置误差相比先前工作降低了高达22%。

英文摘要

Methods using inertial measurement units (IMUs) provide a wearable alternative to camera-based motion capture. To mitigate drift from inertial signals, recent sparse inertial pose estimators integrate inter-sensor distances measured by ultra-wideband (UWB) ranging. So far, UWB distances have only been used as an additional input feature, ignoring the physical constraints they impose on sensor positions. However, these distances can also be used to reconstruct the underlying 3D sensor layout, which in turn provides more informative input for pose reconstruction. We propose Ultra Diffusion Poser, a diffusion model that explicitly models these geometric constraints. It includes a Spatial Layout Module that analytically reconstructs the 3D sensor positions from UWB measurements. These sensor positions are used alongside IMU signals and UWB distances as a conditioning signal during diffusion. Still, network predictions can violate inter-sensor distance measurements. To address this, we introduce UWB-Diffusion Guidance, which encourages alignment between predicted poses and measured distances during diffusion sampling. Together, these contributions enable our model to achieve state-of-the-art performance, reducing joint position error by up to 22% over prior work.

2606.02151 2026-06-02 cs.AI cs.SY eess.SY

S3TS: Stochastic Scenario-Structured Tree Search for Advanced Planning Under Uncertainty

S3TS:面向不确定性下高级规划的随机情景结构化树搜索

Fabio Pavirani, Bert Claessens, Pierre Pinson, Chris Develder

发表机构 * IDLab Ghent university – imec(IDLab 布鲁塞尔大学 – imec) Beebop.ai Imperial College London(伦敦帝国理工学院)

AI总结 提出随机情景结构化树搜索(S3TS)算法,通过情景树显式表示不确定性并集成非线性模型,在需求响应信号发布问题上实现近最优性能,成本比最优解高14%以内,在非线性场景中比贪心算法和确定性MCTS分别降低51%和5.4%的成本。

详情
AI中文摘要

能源领域的有效调度对于确保电网及其连接资产的可靠运行至关重要,例如通过优化发电机组和储能系统的调度。有效的规划策略必须(a)适应先进且可能非线性的系统模型——利用现代电网日益增长的数据可用性,以及(b)显式处理由可再生能源整合等引起的不确定性。虽然现有方法可以处理非线性(例如蒙特卡洛树搜索)或不确定性(例如随机数学优化),但缺乏能够同时应对这两个挑战的规划技术。为填补这一空白,我们提出了一种随机情景结构化树搜索(S3TS)算法,该算法通过情景树显式表示不确定性,同时能够集成先进的非线性模型。我们在一个模拟的需求响应信号发布问题上评估了S3TS,该问题很大程度上模仿了比利时的失衡结算机制。结果表明,在线性、可解析处理的设置中,S3TS实现了接近最优的性能,成本在情景树条件下比数学最优解高14%以内。在高度非线性的场景中,S3TS显著优于基线方法,与贪心算法和确定性MCTS相比,成本分别降低了51%和5.4%。

英文摘要

Effective scheduling in the energy sector is essential to ensure the reliable operation of electrical grids and their connected assets by, for instance, optimizing the dispatch of generation units and storage systems. An effective planning strategy must (a) accommodate advanced and potentially non-linear system models -- exploiting the increasing data availability of modern grids, and (b) explicitly handle uncertainties arising, for instance, from the integration of renewable energy sources. While existing approaches can address either non-linearity (e.g., Monte Carlo Tree Search) or uncertainty (e.g., stochastic mathematical optimization), there is a lack of planning techniques capable of addressing both challenges simultaneously. To bridge this gap, we propose a Stochastic Scenario-Structured Tree Search (S3TS) algorithm that explicitly represents uncertainty through scenario trees while enabling the integration of advanced non-linear models. We evaluate S3TS on a simulated demand response signal publication problem, largely mimicking the imbalance settlement mechanism in Belgium. The results demonstrate near-optimal performance in linear, analytically tractable settings, with costs within 14% of the mathematically optimal solution conditioned to the scenario trees. In highly non-linear scenarios, S3TS significantly outperforms baseline methods, achieving cost reductions of up to 51% and 5.4% compared to a myopic algorithm and deterministic MCTS, respectively.

2606.02147 2026-06-02 cs.CL cs.AI

Multilingual Idioms in Sentences and Conversations Across High-, Medium-, and Low-Resource Languages

高、中、低资源语言中的句子和对话多语言习语

Saeed Almheiri, Bilal Elbouardi, Salsabila Zahirah Pranida, Irina Nikishina, Ashwath Rao B, Parameswari Krishnamurthy, Muhammad Cendekia Airlangga, Rifo Ahmad Genadi, Nguyen Phan Gia Bao, Amir Hossein Yari, Hawau Olamide Toyin, Nurdaulet Mukhituly, Mena Attia, Besher Hassan, Ahmad Fathan Hidayatullah, Tatsuki Kuribayashi, Haonan Li, Suma Bhat, Fajri Koto

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(莫扎德大学人工智能大学) University of Hamburg(汉堡大学) Manipal University(曼印大学) IIIT Hyderabad(海得拉尔IIIT) University of Science and Technology of Hanoi(河内科学技术大学) Universitas Islam Indonesia(印尼伊斯兰大学) Princeton University(普林斯顿大学)

AI总结 针对多语言习语理解,构建了覆盖3种高资源、3种中资源和12种低资源语言的MIDI数据集,包含句子和对话上下文中的字面与比喻用法,实验表明低资源语言理解更差,字面义比比喻义更难,对话上下文虽有改善但未消除差距。

详情
AI中文摘要

习语表达对多语言NLP构成重大挑战,因为其意义在比喻和字面用法之间转换,通常需要上下文才能准确理解。先前工作集中在高资源语言上,通常评估孤立的习语意义问题,忽略了现实话语。我们引入了MIDI,一个多语言习语数据集,涵盖3种高资源、3种中资源和12种低资源语言,由母语者策划。与之前的数据集不同,MIDI提供了嵌入在句子级和对话上下文中的习语,捕捉了字面和比喻解读。对最先进模型的基准测试表明,习语理解在低资源语言中下降,并且在所有资源层级中,字面解释比比喻解释更难。对话上下文提高了性能,但并未消除这些差异。通过受控测试和对隐藏表示的干预,我们进一步将记忆与推理分离,揭示了当前模型的核心局限性。

英文摘要

Idiomatic expressions pose a major challenge for multilingual NLP because their meanings shift between figurative and literal usage, often requiring context for accurate interpretation. Prior work has focused on high-resource languages typically evaluates isolated idiom-meaning questions, overlooking realistic discourse. We introduce MIDI, a multilingual idiom dataset spanning 3 high-, 3 medium-, and 12 low-resource languages, curated by native speakers. Unlike previous datasets, MIDI provides idioms embedded in both sentence-level and conversational contexts, capturing both literal and figurative readings. Benchmarking state-of-the-art models shows that idiom comprehension degrades in low-resource languages and that, in all resource tiers, literal interpretations are substantially harder than figurative ones. Conversational context improves performance but does not eliminate these disparities. Through controlled tests and interventions on hidden representations, we further separate memorization from reasoning, exposing core limitations of current models.

2606.02145 2026-06-02 cs.LG

Hybrid Neural Ordinary Differential Equations for Data-Efficient Polymerization Modeling with Incomplete Kinetics

混合神经常微分方程用于不完全动力学的高效聚合建模

Marah Almanasreh, Alexander Mitsos, Eike Cramer

发表机构 * RWTH Aachen University, Process Systems Engineering (AVT.SVT)(亚琛工业大学过程系统工程系) JARA-CSD Energy Systems Engineering (ICE-1), Forschungszentrum Jülich(能源系统工程(ICE-1),焦耳中心) Department of Chemical Engineering, Sargent Centre for Process Systems Engineering, University College London(化学工程系,萨金特过程系统工程中心,伦敦大学学院)

AI总结 提出混合神经常微分方程框架,通过仅学习部分表征的有效自由基浓度项,在稀疏数据下实现自由基聚合的准确预测。

Comments 25 pages, 5 figures

详情
AI中文摘要

聚合动力学的准确预测对于过程设计、控制和优化至关重要。然而,纯机理模型需要对部分表征的动力学进行劳动密集型的参数化,而纯数据驱动模型需要大量且多样化的数据集,这些数据集获取成本高昂,尤其是在早期设计阶段。我们提出了一种混合神经常微分方程(NODE)框架,用于自由基聚合的数据高效建模。以甲基丙烯酸甲酯(MMA)的间歇聚合为例,明确保留了机理质量平衡,仅通过神经网络代理从数据中学习部分表征的控制单体消耗的有效自由基浓度,而引发剂分解、链增长和终止等已确立的反应则保持物理建模。在稀疏数据条件下,将混合NODE与离散时间前馈神经网络和纯数据驱动NODE进行比较,模型在规则和不规则采样下仅使用少至十个测量值进行训练。混合NODE始终比两种纯数据驱动基线实现更低的预测误差和更物理一致的外推。在噪声数据和未见操作条件的泛化场景中,混合NODE的RMSE为0.013,而数据驱动NODE为0.31,离散时间模型为0.68,表明在有限数据可用性下,仅学习闭合项而非完整动力学足以实现可靠预测。

英文摘要

Accurate prediction of polymerization dynamics is essential for process design, control, and optimization. Yet, purely mechanistic models require labor-intensive parameterization of partially characterized kinetics, while purely data-driven models demand large, diverse datasets that are costly to obtain, particularly in early-design stages. We propose a hybrid Neural Ordinary Differential Equation (NODE) framework for data-efficient modeling of free-radical polymerization. Using batch polymerization of methyl methacrylate (MMA) as a case study, the mechanistic mass balances are retained explicitly, and only the partially-characterized effective radical concentration governing monomer consumption is learned from data through a neural network surrogate, while established reactions such as initiator decomposition, propagation, and termination remain physically modeled. The hybrid NODE is evaluated against a discrete-time feedforward neural network and a purely data-driven NODE under sparse data conditions, with models trained on as few as ten measurements under both regular and irregular sampling. The hybrid NODE consistently achieves lower prediction errors and more physically consistent extrapolations than both purely data-driven baselines. In a generalization scenario with noisy data and unseen operating conditions, the hybrid NODE achieves an RMSE of 0.013, compared to 0.31 for the data-driven NODE and 0.68 for the discrete-time model, demonstrating that learning only a closure term rather than the full dynamics is sufficient for reliable prediction under limited data availability.

2606.02138 2026-06-02 cs.LG cs.AI

VLBM: Variational Latent Basis Modeling for OOD Robust Multivariate Time Series Forecasting

VLBM:面向OOD鲁棒多变量时间序列预测的变分潜在基建模

Xudong Zhang, Jierui Lei, Jiacheng Li, Lingdong Shen, Jian Cui, Haina Tang

发表机构 * School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院) Center for Machine Learning Research, Peking University(北京大学机器学习研究中心) Amap, Alibaba Group(阿里巴巴集团阿地图) School of Advanced Interdisciplinary Sciences, University of Chinese Academy of Sciences(中国科学院大学先进交叉学科学院) Environmental Microbiome and Innovative Genomics Laboratory, Peking University(北京大学环境微生物与创新基因组实验室)

AI总结 提出VLBM框架,通过变分潜在基分离稳定动态与OOD偏差,实现混合ID/OOD分布下的鲁棒预测,在12个基准任务上平均MAE和MSE分别提升15.08%和7.74%。

详情
AI中文摘要

多变量时间序列预测中的分布外(OOD)事件虽然罕见,但往往主导现实世界风险,使得平均情况预测不足以可靠部署。在混合ID/OOD分布的标准平均风险训练下,来自罕见OOD事件的优化信号可能被频繁的分布内(ID)模式淹没,因此强基准精度可能无法转化为高影响偏移下的可靠性。为解决此问题,我们提出VLBM(变分潜在基模型),一种理论指导的潜在预测框架,将稳定动态与OOD引起的偏差分离。VLBM学习一个共享潜在基,定义稳定ID动态的低秩子空间,将输入显式分解为基子空间分量和正交残差分量,并将未来感知后验与未来盲先验对齐,使得测试时潜在推断仅依赖于历史输入。在涵盖交通、天气、电力系统及其他现实世界领域的12个基准任务上,包括新构建的现实世界OOD交通数据集,VLBM实现了最先进的OOD鲁棒性和ID精度,平均MAE和MSE比最强基线分别提升15.08%和7.74%。在合成模拟数据集上,VLBM也持续实现最佳性能并更好地跟踪OOD脉冲恢复。这些结果支持潜在结构化预测作为混合ID和OOD条件下鲁棒预测的原则性途径。代码可在https://github.com/leijieruilq/VLBM_OOD_forecast获取。

英文摘要

Out of distribution (OOD) events in multivariate time series forecasting are rare but often dominate real world risk, making average case forecasting insufficient for reliable deployment. Under standard average risk training on mixed ID/OOD distributions, optimization signals from rare OOD events can be overwhelmed by frequent in distribution (ID) patterns, so strong benchmark accuracy may not translate into reliability under high impact shifts. To address this issue, we propose VLBM (Variational Latent Basis Model), a theory guided latent forecasting framework that separates stable dynamics from OOD induced deviations. VLBM learns a shared latent basis that defines a low rank subspace for stable ID dynamics, explicitly decomposes inputs into basis subspace components and orthogonal residual components, and aligns a future aware posterior with a future blind prior so that test time latent inference depends only on historical input. Across 12 benchmark tasks spanning transportation, weather, power systems, and other real world domains, including newly constructed real world OOD traffic datasets, VLBM achieves state of the art OOD robustness and ID accuracy, with average MAE and MSE gains of 15.08\% and 7.74\% over the strongest baseline. On a synthetic simulation dataset, VLBM also consistently achieves the best performance and better tracks OOD pulse recovery. These results support latent structured forecasting as a principled route to robust prediction under mixed ID and OOD conditions. The code is available at https://github.com/leijieruilq/VLBM_OOD_forecast.

2606.02136 2026-06-02 cs.LG

Edge-aware Decoding for Neural Asymmetric Routing

面向神经非对称路由的边缘感知解码

Li Liang, Jinbiao Chen, Zizhen Zhang

发表机构 * Sun Yat-Sen University(中山大学) Department of Industrial Systems Engineering and Management, National University of Singapore(新加坡国立大学工业系统工程与管理系)

AI总结 针对神经非对称路由中表示与决策不匹配的问题,提出边缘感知解码器,通过显式暴露转移级成本信息提升零样本泛化性能。

详情
AI中文摘要

神经非对称路由模型越来越多地通过矩阵表示和非对称感知注意力来编码方向性。然而,最终路由动作并非孤立节点,而是在当前部分路由下选择的有向转移。这造成了表示与决策的不匹配:成对成本信息可能在上游编码,而最终候选logit仍主要参数化为上下文-节点兼容性。我们提出一种针对神经非对称路由的解码器设计原则:最终得分应显式暴露问题成本结构所暗示的转移级量。我们通过一个边缘感知解码器实例化该原则,该解码器为当前有向边、返回起点的闭合以及静态轻量级前瞻添加候选特定项,同时保持表示骨干网络固定。在受控的SVD/Sinkhorn非对称骨干网络上,该解码器在ATSP-100上训练并在ATSP-100/200/500/1000上零样本评估时,优于RADAR参考,将ATSP-1000的差距从4.13%降至2.73%。在ACVRP上,相同的得分级修改在更丰富的路由状态下显示出相同的定性趋势。ATSP消融实验和有向转移诊断进一步阐明了机制:最强证据涉及对当前有向边的敏感性,而闭合和静态前瞻则作为启发式延续线索。结果支持一项机制研究:神经非对称路由中一个关键的解码器侧信号是决策时暴露转移级边缘信息。

英文摘要

Neural asymmetric routing models increasingly encode directionality through matrix representations and asymmetry-aware attention. The final routing action, however, is not a node in isolation but a directed transition chosen under the current partial route. This creates a representation--decision mismatch: pairwise cost information may be encoded upstream while the final candidate logit is still largely parameterized as context--node compatibility. We propose a decoder-design principle for neural asymmetric routing: the final score should explicitly expose transition-level quantities suggested by the problem's cost-to-go structure. We instantiate this principle with an edge-aware decoder that adds candidate-specific terms for the current directed edge, return-to-start closure, and static lightweight lookahead, while keeping the representation backbone fixed. On a controlled SVD/Sinkhorn asymmetric backbone, the decoder improves over the RADAR reference when trained on ATSP-100 and evaluated zero-shot on ATSP-100/200/500/1000, reducing the ATSP-1000 gap from $4.13\%$ to $2.73\%$. On ACVRP, the same score-level modification shows the same qualitative trend under a richer routing state. ATSP ablations and directed-transition diagnostics sharpen the mechanism: the strongest evidence concerns sensitivity to the current directed edge, while closure and static lookahead act as heuristic continuation cues. The results support a mechanism study: a key decoder-side signal in neural asymmetric routing is decision-time exposure of transition-level edge information.

2606.02134 2026-06-02 cs.LG cs.AI cs.CV

Rethinking Evaluation Paradigms in IBP-based Certified Training

重新思考基于IBP的认证训练中的评估范式

Konstantin Kaulen, Hadar Shavit, Holger H. Hoos

发表机构 * University of Freiburg(弗赖堡大学) ETH Zurich(苏黎世联邦理工学院)

AI总结 针对认证训练中自然精度与认证精度的权衡问题,提出基于Pareto前沿的多目标超参数优化方法,实现公平的方法间比较,并发现先前配置的欠调优现象,建立新的最优性能。

Comments Accepted to ICML 2026

详情
AI中文摘要

深度神经网络在许多监督学习任务上取得了强大性能,但仍易受对抗性扰动的影响。神经网络验证提供了数学上严格的鲁棒性保证,但计算成本高昂。为缓解这一问题,认证训练技术在训练过程中优化可验证的鲁棒性,通常通过方法特定的超参数控制自然精度与认证精度之间的权衡。由于这些指标本质上是冲突的,报告单一配置的常见做法存在问题:它可能误导关于整体性能的结论,并妨碍对最新技术的无偏评估。我们通过基于自然-认证精度权衡的Pareto前沿比较来评估认证训练方法。为了实现公平、方法无关的比较,我们执行高效的自动化多目标超参数优化,为每种方法识别一组Pareto最优配置。这种方法常常揭示先前报告配置中的显著欠调优,从而获得更优性能并建立新的最优水平。利用这些前沿,我们首次对认证训练方法进行了全面的多目标比较,表明先前的进展并不像假设的那样显著,并揭示了先前未报告的性能互补性。

英文摘要

Deep neural networks achieve strong performance on many supervised learning tasks but remain vulnerable to adversarial perturbations. Neural network verification provides mathematically rigorous robustness guarantees, yet at substantial computational cost. To mitigate this, certified training techniques optimise for verifiable robustness during training, typically inducing a trade-off between natural and certified accuracy controlled by method-specific hyperparameters. Because these metrics are inherently conflicting, the common practice of reporting a single configuration is problematic: it can mislead conclusions about overall performance and prevents unbiased assessments of the state of the art. We address this by evaluating certified training methods via Pareto front comparisons over the natural--certified accuracy trade-off. To enable fair, method-agnostic comparisons, we perform efficient automated multi-objective hyperparameter optimisation to identify a set of Pareto-optimal configurations for each method. This approach often uncovers substantial undertuning in previously reported configurations, yielding superior performance and establishing a new state of the art. Leveraging these fronts, we present the first comprehensive multi-objective comparison of certified training approaches, showing that prior advancements are less pronounced than assumed and revealing previously unreported performance complementarities.

2606.02129 2026-06-02 cs.CV

Equilibrated Diffusion: Frequency-aware Textual Embedding for Equilibrated Image Customization

均衡扩散:面向均衡图像定制的频率感知文本嵌入

Liyuan Ma, Xueji Fang, Guo-Jun Qi

发表机构 * Westlake University(西湖大学) Zhejiang University(浙江大学)

AI总结 提出均衡扩散方法,通过频率空间分解概念特征并独立优化嵌入,实现风格与主体解耦,提升定制图像的保真度和文本对齐。

详情
AI中文摘要

图像定制从参考概念图像中学习目标主体,并根据文本提示生成条件图像,主要修改风格或背景。主流方法采用微调将多样化的概念属性打包到统一的潜在嵌入中,但纠缠的属性阻碍了从风格和背景中消除无关干扰。为解决此问题,我们提出均衡扩散,一种频率驱动的方法,解缠纠缠的概念特征,实现均衡定制和一致的文本-视觉匹配。与使用共享嵌入和统一调优学习完整概念的传统方法不同,我们的工作利用图像频率分量与语义之间的内在联系:低频表示主体内容,高频对应风格。我们在频率空间中分解概念并独立优化每个嵌入。这种分离优化使去噪器能够捕获与主体身份分离的风格,并更好地泛化到未见过的风格提示。合并多频率嵌入保留了模型原有的空间定制能力。我们进一步部署掩码引导扩散以限制无关背景变化并增强文本对齐。将残差参考注意力(RRA)插入空间注意力中以保持主体结构和身份一致性。实验证明,均衡扩散在主体保真度和文本遵循方面超过主流基线,验证了我们方法的优越性。

英文摘要

Image customization learns target subjects from reference concept images and generates conditioned images per text prompts, mainly modifying styles or backgrounds. Prevailing methods adopt fine-tuning to pack diverse concept attributes into a unified latent embedding, yet entangled attributes hinder elimination of irrelevant disturbances from style and background. To address this issue, we propose Equilibrated Diffusion, a frequency-driven approach that disentangles tangled concept features for balanced customization and consistent text-visual matching. Unlike conventional methods learning full concepts with shared embeddings and unified tuning, our work utilizes the inherent link between image frequency components and semantics: low frequencies represent subject content and high frequencies correspond to styles. We decompose concepts in frequency space and optimize each embedding independently. This separate optimization enables the denoiser to capture style detached from subject identity and generalize better to unseen stylistic prompts. Merging multi-frequency embeddings preserves the model's original spatial customization ability. We further deploy mask-guided diffusion to restrict irrelevant background changes and boost text alignment. Residual Reference Attention (RRA) is inserted into spatial attention to retain subject structure and identity consistency. Experiments prove Equilibrated Diffusion exceeds mainstream baselines on subject fidelity and text adherence, verifying our method's superiority.

2606.02120 2026-06-02 cs.CV cs.AI cs.LG

Understanding-Enhanced Model Collaboration for Long-Tailed Egocentric Mistake Detection

理解增强的模型协作用于长尾自我中心错误检测

Boyu Han, Qianqian Xu, Shilong Bao, Zhiyong Yang, Ruochen Cui, Qingming Huang

发表机构 * State Key Laboratory of AI Safety, Institute of Computing Technology, CAS(人工智能安全国家重点实验室,计算技术研究所,中国科学院) School of Computer Science and Tech., University of Chinese Academy of Sciences(中国科学院大学计算机科学与技术学院) Beijing Academy of Artificial Intelligence(北京人工智能研究院) Institute of Information Engineering, CAS(信息工程研究所,中国科学院) School of Cyber Security, University of Chinese Academy of Sciences(中国科学院大学网络安全学院)

AI总结 提出理解增强的模型协作方法(UE-MCM),结合粗粒度视频理解与细粒度动作推理,通过双分支模型和自适应融合门检测自我中心视频中的错误,并优化长尾分布。

详情
AI中文摘要

在本报告中,我们解决了从自我中心视频数据中判断用户是否错误执行动作的问题。为此,我们提出了一种理解增强的模型协作方法(UE-MCM),该方法将高效的粗粒度视频理解与准确的细粒度动作推理相结合。具体来说,UE-MCM包含一个小模型分支和一个大模型分支。大模型分支关注细粒度动作本身是否执行错误,而小模型分支联合输入粗粒度视频和细粒度片段,以识别可能局部正确但与整体工作流不一致的动作。小模型分支基于CLIP4CLIP视频编码器构建,该编码器从通过扩散对比重建增强的CLIP模型初始化,大模型分支使用Qwen3-VL嵌入模型从细粒度动作片段中提取高容量表示。然后,通过轻量级协作门自适应融合小分支预测和大分支预测。为了处理错误实例的长尾分布,我们通过互补目标优化分类器,包括重加权交叉熵、AUC导向学习和标签感知调整。所得系统平衡了速度和准确性,使其能够有效检测自我中心教学视频中的细微、罕见和模糊错误。

英文摘要

In this report, we address the problem of determining whether a user performs an action incorrectly from egocentric video data. To this end, we propose an Understanding-Enhanced Model Collaboration Method (UE-MCM) that combines efficient coarse-grained video understanding with accurate fine-grained action reasoning. Specifically, UE-MCM contains a small model branch and a large model branch. The large model branch focuses on whether the fine-grained action itself is executed incorrectly, while the small model branch jointly takes the coarse-grained video and fine-grained segment as input to identify actions that may be locally correct but inconsistent with the overall workflow. The small model branch is built on a CLIP4CLIP video encoder initialized from a CLIP model enhanced by Diffusion Contrastive Reconstruction, and the large model branch uses the Qwen3-VL Embedding model to extract high-capacity representations from fine-grained action segments. The small-branch prediction and the large-branch prediction are then adaptively fused by a lightweight collaboration gate. To handle the long-tailed distribution of mistake instances, we optimize the classifiers with complementary objectives, including reweighted cross-entropy, AUC-oriented learning, and label-aware adjustment. The resulting system balances speed and accuracy, making it effective for detecting subtle, rare, and ambiguous mistakes in egocentric instructional videos.

2606.02119 2026-06-02 cs.LG cs.AI

How Hard Can It Be? Hardness-Aware Multi-Objective Unlearning

到底有多难?难度感知的多目标遗忘学习

Jiangwei Chen, Xinyuan Niu, Rachael Hwee Ling Sim, Zhengyuan Liu, Nancy F. Chen, Bryan Kian Hsiang Low

发表机构 * National University of Singapore(新加坡国立大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 针对现有遗忘学习无法保证同时提升遗忘质量和保持保留效用的缺陷,提出一种基于约束优化的难度感知多目标遗忘算法(HAMU),通过量化遗忘数据与保留数据的相似度来指导模型更新,在保证遗忘质量提升的同时最小化保留效用损失。

Comments ICML 2026

详情
AI中文摘要

机器遗忘旨在由于隐私、版权或偏见问题,移除特定遗忘训练数据的影响,同时保持模型在剩余保留数据上的性能。现有的遗忘算法,例如优化损失的加权组合,试图实现提高遗忘质量和保持保留效用这些目标。然而,它们无法保证对所有遗忘和保留数据都能将目标改进到指定程度。在这项工作中,我们从约束优化的角度,用一种新颖且理论扎实的方法解决了这一限制。首先,我们确定遗忘数据和保留数据之间的相似度可以量化调和两个目标的难度。接下来,我们推导出一种遗忘算法(HAMU),其总体目标是通过根据我们的难度度量更新模型权重,在保证遗忘质量有指定改进的同时,最小化保留效用成本/下降。我们的难度度量还告知用户何时保留效用下降不可避免,即两个目标无法同时改进,应考虑停止。我们的算法适用于非凸模型,并且易于并行化,使其易于在实际场景中部署。我们通过实验使用大型模型在图像和文本数据集上证明了HAMU相对于基线的优越性能。我们的代码可在 https://github.com/aoi3142/HAMU 获取。

英文摘要

Machine unlearning aims to remove the influence of specific forget training data due to privacy, copyright or bias concerns while maintaining the model performance on the remaining retain data. Existing unlearning algorithms, such as optimizing a weighted combination of losses, have tried to achieve these objectives of improving forget quality and maintaining retain utility. However, they do not guarantee that these objectives can be improved by a specified extent for all forget and retain data. In this work, we address this limitation with a novel and theoretically-grounded approach from a constrained optimization perspective. Firstly, we identify that the hardness of reconciling both objectives can be quantified by the similarity between the forget data and the retain data. Next, we derive an unlearning algorithm (HAMU) with the overall goal of guaranteeing a specified improvement in forget quality while minimizing the retain utility cost/degradation by updating the model weights based on our hardness measure. Our hardness measure also informs users when retain utility degradation is unavoidable, i.e., both objectives cannot be improved simultaneously, and stopping should be considered. Our algorithm is applicable to non-convex models and is easily parallelizable, making it readily deployable in real-world scenarios. We empirically demonstrate HAMU's superior performance over baselines on both image and text datasets using large models. Our code is available at https://github.com/aoi3142/HAMU.

2606.02111 2026-06-02 cs.CV cs.AI cs.CL

Jailbreaking Multimodal Large Language Models using Multi-Clip Video

使用多片段视频破解多模态大语言模型

Choongwon Kang, Seungjong Sun, Hyunmin Jun, Jang Hyun Kim

发表机构 * Department of Applied Artificial Intelligence, Sungkyunkwan University(应用人工智能系,成均馆大学) Department of Human-Artificial Intelligence Interaction, Sungkyunkwan University(人机交互系,成均馆大学)

AI总结 提出MCV SafetyBench数据集,通过多片段视频评估多模态大语言模型的安全漏洞,发现视频模态比图像更脆弱,动态和多样化上下文增加攻击成功率,并基于图像模态的鲁棒性提出防御策略。

Comments 27 pages, 20 figures, Accepted to the Main Conference of ACL 2026

详情
AI中文摘要

随着多模态大语言模型(MLLMs)发展到处理视频输入,人们开始担忧其被恶意滥用的可能性。先前的越狱研究表明,MLLMs中的安全对齐可以通过视觉输入被绕过,但尚不清楚视频输入的哪些属性导致了这种脆弱性。为填补这一空白,我们引入了Multi-Clip Video (MCV) SafetyBench,一个包含2,920个视频的数据集,旨在评估视频输入的多样性如何影响MLLMs的脆弱性。每个视频由多个短片段组成,描述与有害查询相关的不同上下文。对八个代表性视频MLLMs的实验表明,攻击成功率随着片段数量的增加而持续提高。我们的结果进一步表明,视频模态(1)比图像模态更脆弱,(2)对动态视频比对静态视频更脆弱,(3)当视频包含更多样化的上下文时更脆弱。基于这些发现,我们提出了一种利用图像模态相对鲁棒性的防御策略。

英文摘要

As multimodal large language models (MLLMs) have advanced to process video inputs, concerns have emerged about their potential for malicious misuse. Prior jailbreak studies have shown that safety alignment in MLLMs can be bypassed through visual inputs, yet it remains unclear which properties of video inputs induce this vulnerability. To address this gap, we introduce Multi-Clip Video (MCV) SafetyBench, a dataset of 2,920 videos designed to evaluate how the diversity of video inputs affects the vulnerability of MLLMs. Each video consists of multiple short clips depicting diverse contexts related to a harmful query. Experiments on eight representative video MLLMs show that attack success consistently increases with the number of clips. Our results further indicate that the video modality is (1) more vulnerable than the image modality, (2) more vulnerable to dynamic videos than to static videos, and (3) more vulnerable when videos contain more diverse contexts. Building on these findings, we propose a defense strategy that leverages the relative robustness of the image modality.

2606.02109 2026-06-02 cs.AI

BADGER: Bridging Agentic and Deterministic Evaluation for Generative Enterprise Reasoning

BADGER:桥接生成式企业推理的自主与确定性评估

Shannon Serrao, Soumitra Chatterjee, Dorina Strori, Abhishek Sharma, Nathan Miller

发表机构 * Merkle Analytics

AI总结 提出BADGER框架,统一文本到SQL评估与自主行为评估,通过混合执行准确率指标(Hybrid-EX)和自主评估套件,在工业查询上超越现有方法。

Comments 30 pages, 2 figures, 6 tables

详情
AI中文摘要

将自然语言转换为SQL查询并编排多步自主推理管道的企业AI系统需要与学术基准根本不同的评估方法。Spider和BIRD建立了执行准确率协议;G-Eval和RAGAS推进了基于LLM的评估;最近的工作如Spider 2.0、BEAVER和BIRD-Interact开始解决企业和自主维度。没有一个单一框架将文本到SQL评估与自主行为评估统一到一个生产级管道中,并针对人类专家判断进行校准。我们提出了在Merkle开发的BADGER,一个统一的评估框架,集成了文本到SQL评估与自主行为评估。BADGER提供三个贡献。首先,LLM辅助的SQL组件提取,扩展Spider方法以处理CTE-heavy、方言特定的SQL。其次,混合执行准确率指标(Hybrid-EX),通过使用LLM在确定性单元格级评分之前推断结构对齐,解决列别名和数值容错脆弱性。在150个人工标注的行业查询上验证,Hybrid-EX达到Cohen's kappa=0.717 [95% CI: 0.600-0.822](高度一致性)和87.3%的平衡准确率,优于所有六个竞争框架(Delta-kappa: 0.322-0.502,所有p<=0.001)。第三,一个企业自主评估套件,将RAGAS、G-Eval和代理基准指标组装成一个统一管道;超额工具使用是唯一的新元素。BADGER完全在客户受管的数据环境中运行,支持可配置的LLM评判后端,并支持快速原型化客户特定的评判器和指标,作为持续评估骨干而非一次性质量门。

英文摘要

Enterprise AI systems that translate natural language into SQL queries and orchestrate multi-step agentic reasoning pipelines require evaluation approaches fundamentally different from academic benchmarks. Spider and BIRD established execution-accuracy protocols; G-Eval and RAGAS advanced LLM-based assessment; and recent work such as Spider 2.0, BEAVER, and BIRD-Interact has begun to address enterprise and agentic dimensions. No single framework unifies text-to-SQL assessment with agentic behavior evaluation into a production-grade pipeline calibrated against human expert judgment. We present BADGER, developed at Merkle, a unified evaluation framework integrating text-to-SQL assessment with agentic behavior evaluation. BADGER offers three contributions. First, LLM-assisted SQL component extraction extending Spider methodology to handle CTE-heavy, dialect-specific SQL. Second, a hybrid execution accuracy metric (Hybrid-EX) resolving column-aliasing and numeric-tolerance brittleness by using an LLM to infer structural alignments before deterministic cell-level scoring. Validated on 150 human-annotated industry queries, Hybrid-EX achieves Cohen's kappa=0.717 [95% CI: 0.600-0.822] (Substantial agreement) and 87.3% balanced accuracy, outperforming all six competing frameworks (Delta-kappa: 0.322-0.502, all p<=0.001). Third, an enterprise agentic evaluation suite assembling RAGAS, G-Eval, and agent benchmark metrics into a unified pipeline; Excess Tool Usage is the sole novel element. BADGER runs entirely within the client's governed data environment, supports configurable LLM judge backends, and enables rapid prototyping of client-specific judges and metrics, serving as a continuous evaluation backbone rather than a one-time quality gate.

2606.02107 2026-06-02 cs.RO cs.AI cs.LG

Network Distributed Multi-Agent Reinforcement Learning for Consensus Control of Quadcopters

网络分布式多智能体强化学习用于四旋翼无人机一致性控制

Youssef Mahran, Zeyad Gamal, Aamir Ahmad, Ayman El-Badawy

发表机构 * Mechatronics Engineering Department, German University in Cairo (GUC), Egypt(埃及德国大学(GUC)机械工程系) Institute of Flight Mechanics and Control (IFR), Head of Flight Robotics, University of Stuttgart, Germany(德国斯图加特大学飞行力学与控制研究所) Faculty of EMS, Head of Mechatronics Engineering Department, German University in Cairo (GUC), Egypt(埃及德国大学(GUC)EMS学院)

AI总结 提出网络分布式多智能体强化学习框架,利用通信图实现分布式策略,通过MASAC训练高层规划器,实现零样本扩展到250个智能体。

Comments This is the Author Accepted Manuscript version of a paper accepted for publication. The final published version is available via IEEE Xplore

详情
Journal ref
2026 IEEE 23rd Mediterranean Electrotechnical Conference (MELECON)
AI中文摘要

本文提出了一种用于四旋翼无人机一致性控制的网络分布式多智能体强化学习(ND-MARL)框架。与依赖集中式规划或完全分散式执行的传统多智能体MARL公式相比,ND-MARL将群体通信图纳入决策过程。在2-邻居通信拓扑下,每个智能体仅观察两个邻居的信息,并通过分布式策略输出动作。使用多智能体软演员-评论家(MASAC)训练高层分布式一致性规划器,并将其嵌入层次化堆栈中,以生成由低层四旋翼控制器跟踪的参考目标位置。结果表明,与集中式MARL控制器相比,实现了平滑的一致性轨迹和规划器-跟踪器集成。最值得注意的是,学习到的控制器表现出零样本可扩展性,即在三智能体系统上训练的策略,在相同的2-邻居通信拓扑下,无需重新训练或微调即可部署到多达250个智能体的群体中,实现了随着团队规模增大而稳态散布增加的一致收敛,这是由于稀疏信息传播所致。这些发现突显了ND-MARL作为分布式、通信感知的四旋翼一致性控制的稳定框架。

英文摘要

This paper proposes a Network Distributed Multi-Agent Reinforcement Learning (ND-MARL) framework for quadcopter consensus control. Compared to conventional multi-agent MARL formulations that rely on centralized planning or fully decentralized execution, ND-MARL incorporates the swarm communication graph into the decision process. Under a 2-Neighbor communication topology, each agent observes information of only two neighbors and outputs an action through a distributed policy. A high-level distributed consensus planner is trained using Multi-Agent Soft Actor-Critic (MASAC) and embedded in a hierarchical stack to generate reference target positions tracked by a low-level quadcopter controller. Results demonstrate smooth consensus trajectories and planner-tracker integration when compared to a centralized MARL controller. Most notably, the learned controller exhibits zero-shot scalability, as policies trained on a three-agent system are deployed to swarms of up to 250 agents under the same 2-Neighbor communication topology without retraining or fine-tuning, achieving consistent convergence with increasing steady-state spread at large team sizes due to sparse information propagation. These findings highlight ND-MARL as a stable framework for distributed, communication-aware quadcopter consensus control.

2606.02106 2026-06-02 cs.LG stat.ML

When Tabular Foundation Models Transfer Across Modalities: A Systematic Evaluation Across 95 Datasets, 7 Modalities, and Two Regimes

当表格基础模型跨模态迁移:对95个数据集、7种模态和两种范式的系统评估

Julien Lafrance

发表机构 * Telecom Paris, Institut Polytechnique de Paris(巴黎电信学院,巴黎理工学院)

AI总结 本文提出一种结合等角紧框架预处理与表格基础模型的分类流水线,在跨模态数据上评估其性能,并证明其在速度与质量间取得良好平衡。

Comments 24 pages, 5 figures. Code and data available at https://doi.org/10.5281/zenodo.19982636

详情
AI中文摘要

我们提出一个单一的分类流水线,该流水线结合了等角紧框架(ETF)预处理阶段和用于上下文推理的表格基础模型,一旦数据被映射到固定向量表示,该流水线在所有模态上应用相同。我们在涵盖七种信号模态——视觉、音频、语音、文本、分子、时间序列和表格——的95个数据集上对其进行评估。主要的方法论贡献是固定比较对象:在整个论文中,性能与相同冻结特征上最强的轻量级调优基线进行比较,而oracle选择、部署选择和专门微调则分别报告。该流水线在相同冻结特征上与强大的轻量级调优基线广泛竞争。它并不在每个任务上都匹配最好的专门模型或高度调优的流水线,但差距很小,且运行速度更快——通常比完整骨干微调快4到200倍,而质量往往相当。我们描述了如何在实际中部署该流水线:何时应用ETF预处理,如何在无验证集的情况下停止其训练,如何设置上下文分类器,以及如何校准所得概率。校准步骤并非装饰性的:TabICL通过构造产生良好校准的概率,ETF预处理最初会破坏该校准,而后处理重新缩放则恢复它——从而产生每个预测的置信度信号,从业者可以将其用作置信度门控部署的信任阈值。我们还报告了该流水线在哪些情况下不应期望有帮助,以及如何提前识别这些情况。

英文摘要

We present a single classification pipeline that combines an Equiangular Tight Frame (ETF) preprocessing stage with a tabular foundation model for in-context inference, applied identically across modalities once data is mapped to fixed vector representations. We evaluate it on 95 datasets spanning seven signal modalities -- vision, audio, speech, text, molecular, time-series, and tabular. The main methodological contribution is to fix the comparison object: throughout the paper, performance is judged against the strongest lightweight tuned baseline on the same frozen features, while oracle selection, deployed selection, and specialized fine-tuning are reported separately. The pipeline is broadly competitive with strong lightweight tuned baselines on the same frozen features. It does not match the very best specialized models or heavily tuned pipelines on every task, but it stays close, and it runs much faster -- typically 4 to 200 times faster than full backbone fine-tuning, often at comparable quality. We describe how to deploy the pipeline in practice: when to apply ETF preprocessing, how to stop its training without a validation split, how to set up the in-context classifier, and how to calibrate the resulting probabilities. The calibration step is non-cosmetic: TabICL produces well-calibrated probabilities by construction, ETF preprocessing initially disrupts that calibration, and the post-hoc rescaling restores it -- yielding a per-prediction confidence signal that practitioners can use as a trust threshold for confidence-gated deployment. We also report where the pipeline should not be expected to help, and how to identify those cases in advance.

2606.02105 2026-06-02 cs.CV

Multimodal Action Diffusion for Robust End-to-End Autonomous Driving

多模态动作扩散用于鲁棒的端到端自动驾驶

Jorge Daniel Rodríguez-Vidal, Diego Porres, Gabriel Villalonga Pineda, Antonio M. López Peña

发表机构 * Computer Vision Center (CVC)(计算机视觉中心) Universitat Autònoma de Barcelona (UAB)(巴塞罗那自治大学)

AI总结 提出动作扩散变换器(ADT),通过多模态动作建模和最近邻匹配,在闭环Bench2Drive基准上超越先前最优方法,同时延迟降低十倍。

Comments Preprint. June 1st, 2026. Corresponding author: Jorge Daniel Rodríguez-Vidal

详情
AI中文摘要

端到端自动驾驶(E2E-AD)系统大多收敛于预测中间轨迹路点,将最终控制委托给具有GPS访问权限的手工控制器。直接控制信号预测(以端到端方式输出油门、转向和刹车)仍未被充分探索,且关键的是,动作多模态性在此类系统中的作用尚未被很好理解。我们认为,超越确定性单动作输出不仅是建模选择,更是驾驶性能、表示质量和训练稳定性的关键驱动因素。为验证这一点,我们引入了动作扩散变换器(ADT),这是一种无锚点扩散变换器,使用MSE目标训练,天然地对合理驾驶动作的多模态分布进行建模。ADT不承诺单一确定性命令,而是生成K个动作候选,并通过最近邻匹配(NNM)在推理时选择最合适的一个。除了强大的基准数值外,我们表明动作多模态性在学习表示和行为一致性方面带来了可衡量的好处,这些效果是确定性架构无法复制的。ADT在具有挑战性的闭环Bench2Drive基准上超越了先前最先进方法,同时实现了十倍更低的延迟,这表明表达性多模态动作建模对于鲁棒的端到端驾驶既实用高效又概念上必不可少。

英文摘要

End-to-End Autonomous Driving (E2E-AD) systems have largely converged on predicting intermediate trajectory waypoints, delegating final control to hand-crafted controllers with GPS access. Direct control-signal prediction (outputting throttle, steer and brake in an end-to-end fashion) remains underexplored, and critically, the role of action multimodality in such systems is not well understood. We argue that moving beyond deterministic, single-action outputs is not merely a modelling choice, but a key driver of driving performance, representational quality, and training stability. To validate this, we introduce the Action Diffusion Transformer (ADT), an anchor-free diffusion transformer trained with a MSE objective that natively models the multimodal distribution of plausible driving actions. Rather than committing to a single deterministic command, ADT generates K action candidates and selects the most suitable one at inference via Nearest Neighbour Matching (NNM). Beyond strong benchmark numbers, we show that action multimodality yields measurable benefits in learned representations and behavioral consistency, effects that deterministic architectures cannot replicate. ADT surpasses previous state-of-the-art on the challenging closed-loop Bench2Drive benchmark while achieving ten times lower latency, demonstrating that expressive, multimodal action modelling is both practically efficient and conceptually essential for robust end-to-end driving.

2606.02096 2026-06-02 cs.CV

WebSpline: Structure-Informed Splines for Real-Time 3D Gaussians from Monocular Videos

WebSpline:面向单目视频实时三维高斯的结构化样条

Jongmin Park, Jeonghwan Yun, Minh-Quan Viet Bui, Munchurl Kim

发表机构 * KAIST(韩国科学技术院)

AI总结 提出WebSpline框架,利用结构信息样条(SIS)表示和结构代理图(SPG),实现从单目视频中实时、高保真、结构连贯的动态三维高斯重建。

Comments The first two authors contributed equally to this work (equal contribution). Please visit our project page at https://kaist-viclab.github.io/webspline-site/

详情
AI中文摘要

从单目视频进行动态场景重建仍然极具挑战性,现有方法在有限的视角线索下往往难以平衡全局结构一致性与局部细节。为解决这一问题,我们提出WebSpline,一种新颖的动态三维高斯框架,能够从单目视频中实现结构连贯且高保真的重建,并支持快速渲染。WebSpline的核心是结构信息样条(SIS)表示,它使用可学习的三次埃尔米特样条对每个动态高斯轨迹进行建模,其运动通过辅助的结构代理图(SPG)进行结构化组织。所提出的框架分两个阶段进行优化:(i)第一阶段,从二维点轨迹初始化SPG,并通过时间刚性正则化进行细化,以建立序列中运动物体的结构连贯性;(ii)第二阶段,从细化后的SPG初始化SIS表示,并在空间和结构邻域约束下进行优化。推理时,仅通过评估学习到的SIS即可获得高斯运动,从而实现快速渲染。在具有挑战性的单目动态场景基准iPhone和NVIDIA上的大量实验表明,我们的WebSpline达到了最先进的渲染质量,同时在iPhone数据集上渲染速度比第二名WorldTree快10倍以上。

英文摘要

Dynamic scene reconstruction from monocular videos remains highly challenging, as existing methods often struggle to balance global structural coherence and local fine-grained details under limited multi-view cues. To address this challenge, we propose WebSpline, a novel dynamic 3D Gaussian framework that enables structurally coherent and high-fidelity reconstruction from monocular videos with fast rendering. The core of WebSpline is the Structure-Informed Spline (SIS) representation, which models each dynamic Gaussian trajectory using a learnable cubic Hermite spline whose motion is structurally organized with an auxiliary Structural Proxy Graph (SPG). The proposed framework is optimized in two stages: (i) in the first stage, the SPG is initialized from 2D point tracks and refined with temporal rigidity regularization to establish structural coherence for moving objects across the sequence; and (ii) in the second stage, the SIS representation is initialized from the refined SPG and optimized under both spatial and structural neighborhood constraints. At inference, Gaussian motion is obtained solely by evaluating the learned SIS, enabling fast rendering. Extensive experiments on the challenging monocular dynamic scene benchmarks, iPhone and NVIDIA, demonstrate that our WebSpline achieves state-of-the-art rendering quality while rendering over 10 times faster than WorldTree, the second-best method on the iPhone dataset.

2606.02093 2026-06-02 cs.CL cs.AI cs.LG

The Role of Ambiguity in Error Prediction via Uncertainty Quantification

不确定性量化中模糊性在错误预测中的作用

Ieva Raminta Staliūnaitė, James Bishop, Andreas Vlachos

发表机构 * University of Cambridge(剑桥大学) The Alan Turing Institute(艾伦·图灵研究所)

AI总结 通过解耦输入模糊性与不确定性信号,利用门控专家和选择性预测提升大语言模型在问答任务中的错误预测性能。

Comments 8 pages not including references and appendices, 3 figures

详情
AI中文摘要

错误预测任务,即预测模型输出是否正确,通常通过不确定性量化(UQ)来解决。然而,虽然不确定性指标捕捉了模型缺乏知识或能力进行预测的情况,但它们也反映了模型输入和上下文中固有的偶然不确定性。本文提出了一种通过将输入模糊性与UQ信号解耦来改进大语言模型(LLM)错误预测的方法。我们在问答(QA)任务上使用六种UQ指标进行实验,结果表明,UQ指标在无歧义实例上的错误预测能力优于具有多个合理答案的问题。我们使用门控专家和选择性预测将真实和预测的模糊性标签纳入错误预测流程。我们发现,模糊性信息提高了跨模型家族、训练和评估范式、数据集(包括据称无歧义的数据集)以及偶然不确定性来源的错误预测分数,在标准数据集上对单个UQ指标的PRR提升超过10个百分点。

英文摘要

The task of Error Prediction, namely predicting whether a model output is correct, is commonly tackled with Uncertainty Quantification (UQ). However, while uncertainty metrics capture when models lack knowledge or capacity to make a prediction, they also reflect aleatoric uncertainty, which is inherent in the model input and context. This paper presents a method for improving error prediction for Large Language Models (LLMs), by disentangling input ambiguity from UQ signal. We conduct experiments on the task of Question Answering (QA) with six UQ metrics and show that UQ metrics are more predictive of errors on unambiguous instances than on questions with multiple plausible answers. We use Gated Experts and Selective Prediction to incorporate gold and predicted ambiguity labels into the error prediction pipeline. We find that ambiguity information improves error prediction scores across model families, training and evaluation paradigms, datasets (including allegedly unambiguous ones), and sources of aleatoric uncertainty, yielding improvements of over 10 points of PRR for individual UQ metrics on standard datasets.

2606.02079 2026-06-02 cs.CV

FACT: A Simple and Efficient Framework for Active Finetuning

FACT:一种简单高效的主动微调框架

Wenshuai Xu, You Song, Yuzhuo Cui, Minjie Ren, Qingjie Liu, Zhenghui Hu

发表机构 * Zhejiang (No. 2024C01020)(浙江(No. 2024C01020)) National Natural Science Foundation of China (No. 62302031)(中国国家自然科学基金委员会(No. 62302031)) Zhejiang Provincial Natural Science Foundation of China (Nos. LQ23F020024 and LZJMZ24D050009)(中国浙江省自然科学基金委员会(Nos. LQ23F020024 and LZJMZ24D050009))

AI总结 针对主动微调中全量微调导致预训练特征失真和过拟合的问题,提出FACT三层分层微调框架,通过冻结特征增强和参数高效微调,在多种数据集和架构上显著提升性能,尤其在低采样率下实现超过20%的增益。

Comments ACCEPTED for publication as a REGULAR paper in the IEEE Transactions on Image Processing (T-IP)

详情
AI中文摘要

主动微调的主要目标是通过使用精心挑选的信息性或挑战性数据对预训练模型进行微调,以提高其在特定任务或领域上的性能。先前的研究主要关注主动方面(即数据选择),同时统一采用全量微调进行模型适应,这不可避免地因分布偏移而扭曲预训练特征。当模型大小相对于微调数据量较大时,这个问题变得尤为突出,导致过拟合风险增加。为了解决这一关键差距,我们正式概述了FiAF任务,该任务强调在主动学习中系统探索微调方法。我们提出了FACT,一个三阶段分层微调框架,兼具高效性和简洁性,专门为主动微调场景设计。我们的综合实验涵盖:(1)三大数据集类别,包括经典(CIFAR10、CIFAR100、ImageNet-1k)、不平衡(CIFAR10-LT、CIFAR100-LT)和细粒度(StanfordCars、FGVCAircraft)图像分类数据集,每个在3-5种不同采样率下评估;(2)多样化的预训练架构,包括卷积神经网络(ConvNeXt)、视觉变换器(ViT)和视觉LSTM(ViL)网络;(3)对冻结特征增强(FroFA)策略的系统研究;(4)对效率和泛化性的全面严格分析。结果表明,我们的框架具有显著改进,并具备强大的泛化性和鲁棒性。值得注意的是,在低采样率下,我们的框架在CIFAR10、CIFAR100和ImageNet-1k基准测试中,ViT模型实现了超过20%的显著性能提升。这种系统性的方法在保持参数效率的同时建立了新的最先进性能,在标记数据稀缺时尤其有效。

英文摘要

The main goal of active finetuning is to improve a pretrained model's performance on a specific task or domain by finetuning it with carefully selected informative or challenging data. Previous research has predominantly focused on the active aspect (i.e., data selection) while uniformly employing full finetuning for model adaptation, which inevitably distorts pretrained features due to distribution shift. This issue becomes particularly pronounced when the model size is large relative to the finetuning data quantity, leading to heightened overfitting risks. To address this critical gap, we formally outline the FiAF task that emphasizes systematic exploration of finetuning methodologies in active learning. We propose FACT, a three-phase hierarchical finetuning framework featuring both efficiency and simplicity, specifically designed for active finetuning scenarios. Our comprehensive experiments span: (1) Three major dataset categories encompassing classic (CIFAR10, CIFAR100, ImageNet-1k), imbalanced (CIFAR10-LT, CIFAR100-LT), and fine-grained (StanfordCars, FGVCAircraft) image classification datasets, each evaluated under 3-5 distinct sampling ratios; (2) Diverse pretrained architectures including Convolutional Neural Network (ConvNeXt), Vision Transformer (ViT), and Vision LSTM (ViL) networks; (3) A systematic investigation of frozen feature augmentation (FroFA) strategies. (4) A comprehensive and rigorous analysis of efficiency and generalizability. The results demonstrate significant improvements with strong generalization and robustness. Notably, under low sampling ratios, our framework achieves remarkable performance gains of over 20% on the ViT model for CIFAR10, CIFAR100, and ImageNet-1k benchmarks. This systematic approach establishes new state-of-the-art performance while maintaining parameter efficiency, proving particularly effective when labeled data is scarce.

2606.02078 2026-06-02 cs.LG

Beyond $\ell_2$-norm and $\ell_\infty$-norm: A Curvature-Inspired $\ell_p$-Norm Scheme for Deep Neural Networks

超越ℓ2范数和ℓ∞范数:一种受曲率启发的深度神经网络ℓp范数方案

Jianhao Xu, Zhuang Yang

发表机构 * School of Computer Science and Technology, Soochow University(苏州大学计算机科学与技术学院)

AI总结 针对现有优化器在参数维度曲率变化大时适应性差的问题,提出一种动态p值的ℓp范数方案,并融入SGD和SGDM,得到LPSGD和LPSGDM优化器,通过早期大p抑制高曲率方向、后期余弦退火减小p实现稳定更新,理论证明非凸情形下O(T^{-1/2})收敛率,在CIFAR和ImageNet数据集上验证了泛化性能提升。

详情
AI中文摘要

现有的深度神经网络(DNN)优化器通常依赖于ℓ2范数或ℓ∞范数,导致优化器不能很好地适应参数维度上曲率的显著变化。通常,DNN的训练过程在早期表现出强烈的曲率各向异性,而在后期,DNN的训练过程趋向于向各向异性较弱的平坦区域移动。特别地,基于ℓ2范数的优化器通常由高曲率方向主导,限制了优化器沿较低曲率方向的更新,从而导致收敛速度较慢。而基于ℓ∞范数的优化器由于坐标方向更新幅度相同,在平坦区域容易产生振荡。为了解决ℓ2和ℓ∞范数产生的这两种极端情况,我们提出了一种具有动态p值的新型ℓp范数方案,并将其融入随机梯度下降(SGD)和带动量的SGD(SGDM)中,从而得到两种具有更好泛化性能的新型优化器:ℓp-SGD(LPSGD)和ℓp-SGDM(LPSGDM)。特别地,所得到的优化器通过使用较大的p(p>2)来抑制早期高曲率方向的支配地位,随后将p逐渐减小至2以实现更稳定和精细的更新,其中后一过程受余弦退火策略启发。我们建立了所得到算法的理论保证,并分析了LPSGD和LPSGDM在非凸情形下均达到O(T^{-1/2})的收敛率。在基准数据集(包括CIFAR-10、CIFAR-100和ImageNet-1K)上,使用多种DNN(如VGG-11、ResNet-18和ResNet-50)进行了大量实验。

英文摘要

The existing optimizers for deep neural networks (DNNs) typically rely on either the $\ell_2$ norm or the $\ell_\infty$ norm, resulting in optimizers that do not adapt well to substantial changes in curvature across parameter dimensions. Generally, the training process of DNNs often exhibits strong curvature anisotropy in the early period, whereas in the later period, the training process of DNNs tends to move toward flatter regions with weaker anisotropy. Particularly, optimizers based on the \(\ell_2\)-norm are usually dominated by high-curvature directions, restricting updates of optimizers along with lower curvature direction and thus leading to a slower convergence rate. While optimizers based on the \(\ell_\infty\)-norm are prone to oscillations in flatter regions, due to the coordinate-wise updates of the same magnitude. To address these two extreme cases generated by $\ell_2$ and $\ell_\infty$ norms, we propose a novel $\ell_p$-norm scheme with a dynamical value of $p$ and incorporate it into stochastic gradient descent (SGD) and SGD with momentum (SGDM), leading to two novel optimizers with better generalization performance: ${\ell_p}$-SGD (LPSGD) and ${\ell_p}$-SGDM (LPSGDM). Particularly, the resulting optimizers suppress the dominance of high-curvature directions in the early period by utilizing a large $p$ ($p>2$), followed by a gradual decrease of $p$ toward 2 to enable more stable and refined updates, where the latter process is motivated by the cosine annealing strategy. We establish theoretical guarantees of the resulting algorithms and analyze that both LPSGD and LPSGDM achieve an \(O(T^{-1/2})\) convergence rate for the nonconvex setting. Extensive experiments are conducted on benchmark datasets, including CIFAR-10, CIFAR-100, and ImageNet-1K, with multiple DNNs such as VGG-11, ResNet-18, and ResNet-50.

2606.02068 2026-06-02 cs.CV cs.AI

Fast and Lightweight Novel View Synthesis with Differentiable Multiplane Image

基于可微多平面图像的快速轻量级新视角合成

Kaidi Zhang, Guanxu Zhu

发表机构 * Universiti Malaya(马来大学) Wuhan University(武汉大学)

AI总结 针对现有方法在速度、模型大小和稀疏视角下的不足,提出基于可微多平面图像(MPI)的快速轻量级新视角合成方法,利用点图进行几何初始化并引入一步扩散处理空洞和伪影。

详情
AI中文摘要

近年来,新视角合成取得了显著进展,主流方法如神经辐射场(NeRF)和3D高斯泼溅(3DGS)产生了令人印象深刻的结果。然而,这些方法往往难以平衡渲染速度和模型大小,且其基于优化的训练可能非常耗时。此外,它们通常依赖于密集观测,在稀疏视角条件下往往无法产生令人满意的结果。尽管前馈重建显著减少了3DGS的优化时间,但其像素对齐公式从单张图像生成数百万个高斯,严重限制了其在移动设备上的实际部署。为了解决这些限制,我们重新审视了多平面图像(MPI)表示,该表示使用一组紧凑的平面层来表示场景,以实现高效的新视角合成。利用视觉基础模型的最新进展,我们使用预测的点图进行可靠的几何初始化,然后进行可微优化。为了解决稀疏初始化MPI中的空洞和伪影问题,我们引入了一步扩散,该扩散既参与MPI的可微优化,也参与渲染结果的后处理。与代表性的基于GS的方法相比,我们的方法速度快30.7%,模型大小仅为其14.8%,同时在前景场景中实现了具有竞争力的合成质量。

英文摘要

Recently, novel view synthesis has witnessed remarkable progress, with mainstream methods such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) delivering impressive results. However, these approaches often struggle to balance rendering speed and model size, and their optimization-based training can be highly time-consuming. Furthermore, they typically rely on dense observations, often failing to produce satisfactory results under sparse-view conditions. Although feed-forward reconstruction significantly reduces the optimization time of 3DGS, its pixel-aligned formulation generates millions of Gaussians from a single image, severely limiting its practical deployment on mobile devices. To address these limitations, we revisit the Multiplane Image(MPI) representation, which represents scenes using a compact set of planar layers for efficient novel view synthesis. Leveraging recent advances in visual foundation models, we utilize predicted point maps for reliable geometric initialization, followed by differentiable optimization. To address the issues of holes and artifacts in sparsely initialized MPI, we introduce one-step diffusion, which participates in both the differentiable optimization of MPI and the postprocessing of rendering results. Compared with a representative GS-based method, our approach is 30.7% faster and uses only 14.8% of its model size, while achieving competitive synthesis quality on front-view scenarios

2606.02061 2026-06-02 cs.LG

Ablating Archetypes: The Stability of Archetypal SAEs is an Artifact of Initialization and Metric Design

消除原型:原型SAE的稳定性是初始化和度量设计的人为产物

Michał Brzozowski, Neo Christopher Chung

发表机构 * Samsung AI Center(三星人工智能中心) University of Warsaw(华沙大学)

AI总结 本文通过实验证明,原型稀疏自编码器声称的稳定性源于多轮训练中相同的初始化设置,而非原型约束本身,并强调稳定性与稳定化的区别对可解释性研究至关重要。

详情
AI中文摘要

使用稀疏自编码器(SAE)的字典学习从神经网络激活中产生过完备基,这些基通常是可解释的,并减少了多义性。然而,不同随机种子的SAE特征差异很大——这个问题被称为不稳定性。原型SAE(Fel等人,2025)被提出作为一种通用的字典学习干预,用于更可靠的概念提取,并报告在训练结束时字典更稳定。我们证明原型SAE声称的稳定性是在多次运行中设置相同初始化的结果。通过我们的分析,我们试图澄清机械可解释性中可能模糊使用的两个不同概念:稳定性是两个独立训练模型之间的一致性,而稳定化是独立初始化的运行向共同解收敛。这种区分对于自然语言处理(NLP)的机械可解释性至关重要,其中特征稳定性越来越多地被用作SAE特征是可重用分析单元的证据。原型SAE的实验共享一个确定性的k-means解码器初始化,在训练开始前将运行间字典距离设为零。当移除这种初始化时,原型约束在我们的设置中没有提供稳定化优势。我们进一步发现了一个依赖于预处理的余弦几何问题,使端点稳定性指标的解释复杂化。总的来说,我们的研究支持在更大的字典学习传统中研究SAE的价值,同时表明稳定性声明需要轨迹诊断和初始化消融。

英文摘要

Dictionary learning with sparse autoencoders (SAEs) produces overcomplete bases from neural network activations that are often interpretable and reduces polysemanticity. However, features from SAEs vary substantially across random seeds -- a problem known as instability. Archetypal SAEs (Fel et al., 2025) were proposed as a general dictionary-learning intervention for more reliable concept extraction, and report more stable dictionaries at the end of training. We demonstrate that the stability claimed by archetypal SAEs is a result of setting identical initialization across multiple runs. Through our analyses, we attempt to clarify two distinct notions in mechanistic interpretability that may be ambiguously used: stability is agreement between two independently trained models, whereas stabilization is the convergence of independently initialized runs toward a common solution. This distinction is critical for mechanistic interpretability of natural language processing (NLP), where feature stability is increasingly used as evidence that SAE features are reusable units of analysis. Experiments from archetypal SAEs share a deterministic k-means decoder initialization, setting inter-run dictionary distance to zero before training begins. When this initialization is removed, the archetypal constraint provides no stabilization advantage in our setting. We further identify a preprocessing-dependent cosine geometry issue that complicates interpretation of endpoint stability metrics. Overall, our study supports the value of studying SAEs within the larger dictionary-learning tradition while showing that stability claims require trajectory diagnostics and initialization ablations.

2606.02058 2026-06-02 cs.CV cs.RO

TIDES: Time-Derivative Event Simulation via Deformable Reconstruction

TIDES:基于可变形重建的时间导数事件模拟

Christopher Thirgood, Dipon Kumar Ghosh, Simon Hadfield

发表机构 * University of Surrey(萨里大学)

AI总结 提出TIDES,一种基于动态高斯泼溅的连续时间事件模拟器,通过显式3D场景表示推导逐像素强度动态,实现精确的阈值交叉预测,并利用遮挡引导自适应时间步长,达到最先进的事件流保真度。

详情
AI中文摘要

事件相机响应环境外观变化而发出异步事件。真实世界事件数据集的稀缺使得模拟至关重要。然而,大多数模拟器从帧序列推断事件时间戳,迫使许多阈值交叉共享一小组离散时间;我们将这种失效模式称为时间戳批处理,它在快速运动和遮挡下会恶化。我们提出TIDES,一种基于动态高斯泼溅的连续时间事件模拟器。由于TIDES在具有学习几何和运动的显式3D场景表示上运行,它可以直接从场景推导每像素强度动态,而不是通过渲染帧的差分。这使得能够精确预测阈值交叉,包括每个渲染步骤的多次交叉,而无需时间上采样或帧插值。相同的3D场景模型揭示了物体之间部分遮挡的位置;TIDES利用这一点来指导自适应时间步长,仅将计算集中在遮挡动力学使简单亮度变化模型不可靠的区域。最后,我们使用瓦片级仲裁器对有限传感器带宽进行建模,其吞吐量、抖动和事件丢失再现了真实的传感器伪影。在配对的RGB-事件基准测试中,TIDES达到了最先进的事件流保真度。我们还表明,TIDES模拟的事件比竞争对手更有效地转移到真实下游任务。

英文摘要

Event cameras emit asynchronous events in response to environmental appearance changes. The scarcity of real-world event datasets makes simulation essential. However, most simulators infer event timestamps from frame sequences, forcing many threshold crossings to share a small set of discrete times; a failure mode we term timestamp batching that worsens under fast motion and occlusion. We present TIDES, a continuous-time event simulator built on dynamic Gaussian splatting. Because TIDES operates on an explicit 3D scene representation with learnt geometry and motion, it can derive per-pixel intensity dynamics directly from the scene, rather than by differencing rendered frames. This enables accurate threshold-crossing prediction, including multiple crossings per rendering step, without temporal upsampling or frame interpolation. The same 3D scene model reveals where objects partially occlude one another; TIDES uses this to guide adaptive time stepping, concentrating computation only in regions where occlusion dynamics make simple models of brightness change unreliable. Finally, we model finite sensor bandwidth using a tile-level arbiter whose throughput, jitter, and event drops reproduce realistic sensor artifacts. Across paired RGB-event benchmarks, TIDES attains state-of-the-art event-stream fidelity. We also show that events simulated by TIDES transfer more effectively to real downstream tasks than competitors'.

2606.02054 2026-06-02 cs.AI

eMoT: evolving Memory-of-Thought via Symbolic Anchoring and Memory Corrosion

eMoT: 通过符号锚定和记忆腐蚀演化的思维记忆

Xiang Li, Jiwei Wei, Ke Liu, Yitong Qin, Jinyu Guo, Malu Zhang, Peng Wang, Yang Yang

发表机构 * Center for Future Media, University of Electronic Science and Technology of China(未来媒体中心,电子科技大学)

AI总结 提出eMoT框架,通过记忆腐蚀、符号锚定和一致性精炼三个模块,将推理轨迹视为动态演化记忆,以稳定多步推理并提升准确率与一致性。

详情
AI中文摘要

尽管大型语言模型(LLMs)在多步推理任务上取得了令人印象深刻的性能,但其可靠性仍然受到关键限制的阻碍,例如不受约束的幻觉和较差的数值计算。从根本上说,这些问题源于标准模型将推理视为一次性的瞬态生成过程,而不是保留并改进成功的程序逻辑。为了解决这些挑战,我们提出了eMoT(演化的思维记忆),这是一个统一框架,通过将推理轨迹视为动态演化的记忆而非静态模板来稳定多步推理。该框架主要由三个相互连接的模块组成:(i)记忆腐蚀机制,强化高效用推理结构,同时逐渐衰减较少使用的结构;(ii)符号锚定引擎,利用Python进行确定性计算,类似于人类使用计算器;(iii)一致性驱动的精炼过程,将神经推理与符号结果对齐,减少逻辑差异的累积。在多个推理基准上,eMoT相比标准的思维链和结构化推理基线提高了准确率和解决方案一致性。在传统任务Game of 24上,eMoT达到了100%的准确率,比基线高出17.6%。在数学任务GSM8K、ASDiv、SVAMP和MGSM上的评估进一步显示了在多步数学推理中的持续改进。在我们的评估中,尽管使用了轻量级骨干模型且基线能力受限,我们仍取得了优越的性能。与依赖大规模模型的替代方法相比,我们的结果表明性能提升根本上是由eMoT框架的推理控制驱动的,而非单纯的模型规模。

英文摘要

While Large Language Models (LLMs) achieve impressive performance on multi-step reasoning tasks, their reliability is persistently hindered by critical limitations such as unconstrained hallucinations and poor numerical computation. Fundamentally, these issues arise because standard models treat reasoning as a transient, one-off generation process rather than retaining and refining successful procedural logic. To address these challenges, we propose eMoT (evolving Memory-of-Thought), a unified framework that stabilizes multi-step reasoning by treating reasoning trajectories as dynamic, evolving memories rather than static templates. The framework primarily consists of three interconnected modules: (i) a memory corrosion mechanism that reinforces high-utility reasoning structures while gradually decaying less frequent ones; (ii) a symbolic anchoring engine that utilizes Python for deterministic computation, much like a human uses a calculator; and (iii) a consistency-driven refinement process that aligns neural inference with symbolic outcomes, reducing the accumulation of logical discrepancies. Across multiple reasoning benchmarks, eMoT improves accuracy and solution consistency over standard Chain-of-Thought and structured reasoning baselines.On the traditional task Game of 24, eMoT achieves 100% accuracy, surpassing the baseline by up to 17.6%. Evaluations on mathematical task GSM8K, ASDiv, SVAMP, and MGSM further show consistent gains in multi-step mathematical reasoning. In our evaluation, we achieve superior performance despite utilizing a lightweight backbone model with constrained baseline capabilities. Compared to alternative methods that rely on massively scaled models, our results demonstrate that the performance gains are fundamentally driven by the eMoT framework's reasoning control rather than sheer model size.

2606.02049 2026-06-02 cs.AI

Explainable Data-driven Deep Reinforcement Learning Methods for Optimal Energy Management in Buildings

面向建筑最优能量管理的可解释数据驱动深度强化学习方法

Hallah Shahid Butt, Qiong Huang, Gökhan Demirel, Kevin Förderer, Erfan Tajalli-Ardekani, Simnon Waczowicz, Luigi Spatafora, Veit Hagenmeyer, Benjamin Schäfer

发表机构 * Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院)

AI总结 提出可解释深度强化学习框架,结合真实数据训练策略内与离策略算法,通过事后解释技术揭示电池管理决策过程,实现降本与透明化。

详情
AI中文摘要

可再生能源在电力系统中的日益普及,特别是在配备光伏板和储能系统的建筑中,引入了能源系统的显著复杂性。波动的发电量、变化的电价以及增加的实体(如光伏系统和热泵)增加了复杂性,使系统更难运行。这导致了对额外控制和优化路径的需求,包括基于数据的控制,如强化学习。虽然深度强化学习已成为在动态且日益复杂的环境中优化建筑运营的有前景的解决方案,但其黑箱特性阻碍了用户信任和实际应用。本文提出了一种应用于住宅建筑能量管理的可解释深度强化学习框架。我们在合成数据以及来自KIT Living Lab Energy Campus的真实数据上展示了其使用。我们在扩展的状态空间上训练并比较了策略内和离策略的DRL智能体,该状态空间包含实时测量(需求、光伏发电、电池功率、荷电状态)、外部信号(动态电价、本地天气数据)、日历和假日指标以及需求和价格预测。我们的实验结果表明,策略内算法,特别是优势演员-评论家和近端策略优化,在累积奖励和策略稳定性方面优于离策略方法。为了解释这些模型,我们采用事后解释技术来阐述学到的控制策略。我们的发现表明,XRL框架不仅通过最优电池管理降低了电力成本,还提供了对智能体决策过程的透明、可操作的见解。

英文摘要

The increasing integration of renewable energy sources into power systems, particularly in buildings equipped with photovoltaic (PV) panels and energy storage systems, introduces significant complexity in energy systems. Volatile power generation, varying electricity tariffs, and increased entities, e.g., PV systems, and heat pumps, have increased the complexity and made the system harder to operate. This leads to the demand for additional control and optimization routes including data-based controls, such as reinforcement learning. While deep reinforcement learning (DRL) has emerged as a promising solution to optimize building operations in dynamic and ever more complex environments, its black-box nature impedes user trust and practical adoption. This paper presents a framework for explainable deep reinforcement learning (XRL) applied to energy management in residential buildings. We demonstrate its usage on both synthetic data but also on real-world data from the Living Lab Energy Campus (LLEC) at KIT. We train and compare both on-policy and off-policy DRL agents on an expanded state space that incorporates real-time measurements (demand, PV generation, battery power, state of charge), external signals (dynamic electricity price, local weather data), calendrical and holiday indicators, and forecasts for demand and price. Our experimental results indicate that on-policy algorithms, particularly Advantage Actor Critic (A2C) and Proximal Policy Optimization (PPO), outperform off-policy methods in terms of cumulative rewards and policy stability. To explain these models, we employ post-hoc interpretation techniques to elaborate the learned control policies. Our findings demonstrate that the XRL framework not only reduces electricity costs through optimal battery management, but also provides transparent, actionable insights into the agent's decision-making process.