arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2075
专题追踪
2605.08863 2026-05-14 cs.CL cs.LG

Max-pooling Network Revisited: Analyzing the Role of Semantic Probability in Multiple Instance Learning for Hallucination Detection

Shota Fujikawa, Issei Sato

发表机构 * Department of Computer Science, The University of Tokyo(东京大学计算机科学系)

AI总结 本文重新审视了最大池化网络在幻觉检测中的应用,分析了语义概率在多重实例学习中的作用。研究指出,通过扩大决策边界,结合语义一致性缩放内部状态可以提升模型性能。基于这一发现,作者提出了一种高效的分类方法,利用最大池化聚合词级特征,并通过轻量级MLP直接估计句子得分,无需复杂的语义相似度计算,从而在保持竞争力的同时显著提升了计算效率。

详情
英文摘要

Hallucination detection has become increasingly important for improving the reliability of large language models (LLMs). Recently, hybrid approaches such as HaMI, which combine semantic consistency with internal model states via Multiple Instance Learning (MIL), have achieved state-of-the-art performance. However, these methods incur substantial computational overhead due to repeated sampling and costly semantic similarity computations. In this work, we first provide a theoretical analysis of HaMI in terms of decision margins, revealing that scaling internal states with semantic consistency leads to an enlarged decision margin. Motivated by this insight, we revisit classical sentence classification models from a margin enlargement perspective, aggregating token-level features via max pooling and directly estimating sentence scores using a lightweight MLP. Without requiring semantic consistency computations, our approach achieves substantial efficiency improvements while maintaining competitive performance with state-of-the-art baselines through adaptive aggregation of internal feature representations. Code is available at https://github.com/FUJI1229/Hallucination_Detection.

2605.08759 2026-05-14 cs.LG

MDL-GBG: A Non-parametric and Interpretable Granular-Ball Generation Method for Clustering

Zeqiang Xian, Caihui Liu, Yong Zhang, Wenjing Qiu, Duoqian Miao, Witold Pedrycz

发表机构 * Department of Mathematics and Computer Science, Gannan Normal University(赣南师范大学数学与计算机科学学院) Key Laboratory of Data Science and Artificial Intelligence of Jiangxi Education Institutes, Gannan Normal University(江西省教育研究院数据科学与人工智能重点实验室) Department of Computer Science and Technology, Tongji University(同济大学计算机科学与技术学院) Department of Electrical and Computer Engineering, University of Alberta(阿尔伯塔大学电气与计算机工程学院)

AI总结 现有粒球生成方法主要依赖人工设计的质量度量和启发式分裂或停止准则,可能削弱聚类过程中局部生成决策的透明性。为此,本文提出了一种基于最小描述长度原理的非参数且可解释的粒球生成方法MDL-GBG,将粒球生成问题转化为局部模型选择问题,通过比较单球模型、双球模型和核心球加残差模型,选择描述长度最短的模型进行粒球的保留、分裂或残差剥离。实验表明,MDL-GBG生成的稳定粒球能有效提升聚类性能,在多个评估指标上优于现有方法。

Comments 13 pages, 5 figures, 4 tables. Revised version with updated computational complexity analysis, experiments, and discussion. The implementation was optimized to reduce redundant computation and improve efficiency; experimental results and related descriptions were updated accordingly

详情
英文摘要

Existing granular-ball generation methods are still mainly driven by handcrafted quality measures and heuristic splitting or stopping criteria, which may weaken the transparency of local generation decisions in clustering. To address this issue, this paper proposes Minimum Description Length based Granular-Ball Generation (MDL-GBG), a non-parametric and interpretable granular-ball generation method for clustering. MDL-GBG reformulates granular-ball generation as a local model selection problem under the Minimum Description Length principle. For each granular ball, three candidate explanations are compared, namely a single-ball model, a two-ball model, and a core-ball-plus-residual model, and the model with the shortest description length is selected. In this way, ball retention, splitting, and residual peeling are unified within a common coding-theoretic framework. A residual reassignment mechanism is further introduced to re-evaluate peeled-off boundary samples after stable granular-balls are formed. Experiments on 20 UCI datasets show that the stable granular-balls generated by MDL-GBG provide an effective upstream representation for clustering. In particular, MDL-GBG+AC achieves the best average ranks in ARI, ACC, and NMI among the compared methods. These results indicate that MDL-GBG offers a principled and interpretable alternative to heuristic granular-ball generation strategies.

2605.08541 2026-05-14 cs.LG

Tokens-per-Parameter Coverage Is Critical for Robust LLM Scaling Law Extrapolation

Joshua Shay Kricheli, Alexander Lawrence Reid, Soumajyoti Sarkar, Venkata Gandikota, Paulo Shakarian

发表机构 * Syracuse University(Syracuse大学) Amazon AGI Foundations(Amazon AGI研究院)

AI总结 该研究指出,在语言模型的规模定律拟合中,固定 tokens-per-parameter(TPP)比例的设计会导致参数估计的病态问题,使得模型外推性能下降。研究揭示了当参数数量 $N$ 和 token 数量 $D$ 的指数接近时,最小二乘问题的条件数会显著增大,导致尺度系数难以准确估计。为此,作者提出了一个 TPP 多样性阈值,确保良好条件的估计,并通过实验验证非共线设计在多个数据集和精度模式下均优于传统方法。

详情
英文摘要

Neural scaling laws approximate a language model's loss as a power-law function of parameter count $N$ and token count $D$. Following Chinchilla-style compute-optimal training, many studies fit scaling laws from runs performed under a fixed tokens-per-parameter (TPP) ratio $k$ and set $D = kN$. We show that this collinear design, combined with the empirically common near-equality of the exponents governing $N$ and $D$, induces an inherent ill-conditioning in the Gauss-Newton least-squares problem: the condition number of the design grows as the inverse square of the gap between the $N$ and $D$-exponents. The scale coefficients become practically unidentifiable, with confidence intervals inflating by an order of magnitude or more, yielding a ``sloppy'' model whose extrapolations degrade sharply off the training ray. We prove this for four scaling-law formalisms and derive a closed-form TPP-diversity threshold that is necessary and sufficient for well-conditioned estimation. Empirically, non-collinear designs outperform collinear ones on held-out splits with a 97.3\% win rate across four laws, five corpora, multiple floating point precision modes. We further show the degeneracy is rooted in Jacobian geometry and is not an artifact of the loss function: any smooth estimation objective whose curvature involves the Jacobian inherits the same ill-conditioning.

2605.08504 2026-05-14 cs.CL

A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models

Zeru Shi, Zhenting Wang, Fan Yang, Qifan Wang, Ruixiang Tang

发表机构 * Rutgers University(罗格斯大学) Wake Forest University(威克森林大学) Meta AI

AI总结 本文研究了大语言模型中大规模激活值的来源,发现了一种普遍存在的“大规模涌现层(ME Layer)”,该层是激活值首次出现并经残差连接传播到深层的关键位置。研究指出,该层中的RMSNorm和FFN参数共同促成了大规模激活的形成,且一旦形成,这些激活在后续层中变化较小,降低了注意力模块接收到的隐藏表示的多样性。为此,作者提出了一种简单有效的方法来缓解这种激活的刚性,该方法在多项任务中提升了模型性能,并有助于减轻注意力陷阱问题。

详情
英文摘要

We investigate the origins of massive activations in large language models (LLMs) and identify a specific layer named the \textbf{Massive Emergence Layer (ME Layer)}, that is consistently observed across model families, where massive activations first emerge and subsequently propagate to deeper layers through residual connections. We show that, within the ME Layer both the RMSNorm and the FFN parameters jointly contribute to the emergence of massive activations. Once formed, the massive activation token representation remains largely invariant across layers, reducing the diversity of hidden representations passed to the attention module. Motivated by this limitation, we propose a simple and effective method to reduce the rigidity of the massive activation token. Our approach consistently improves LLM performance across multiple tasks, including instruction following and math reasoning, in both training free and fine tuning settings. Moreover, we show that our method mitigates attention sinks by selectively weakening their influence, elucidating their origin at the hidden state level and shedding new light on principled mitigation strategies.

2605.08293 2026-05-14 cs.CV

Distill, Diffuse, and Semanticize (DDS): Annotation-Free 3D Scene Understanding Based on Multi-Granularity Distillation and Graph-Diffusion-Based Segmentation

Yijing Wang, Ruonan Li, Qilin Wang, Rongqiang Zhao, Jie Liu

发表机构 * Faculty of Computing, Harbin Institute of Technology(哈尔滨工业大学计算机学院) Pengcheng Laboratory(鹏城实验室)

AI总结 本文提出了一种名为DDS的轻量级框架,用于无需标注的3D场景理解。该方法结合多粒度知识蒸馏和基于图扩散的分割技术,在保留超点结构组织的同时引入视觉语义信息,实现了区域一致且语义化的3D场景理解。实验表明,DDS在多个真实数据集上优于现有方法,在多项指标上均有显著提升,为无标注的3D场景理解提供了可扩展且可解释的解决方案。

详情
英文摘要

3D semantic scene understanding is essential for digital twins, autonomous driving, smart agriculture, and embodied perception, yet dense point-wise annotation for point clouds remains expensive and difficult to scale. Existing annotation-free methods often face a trade-off between semantic recognition and structural efficiency: open-vocabulary and foundation-model-driven methods provide strong semantic priors, but often come with substantial computational costs, while structure-oriented methods based on superpoints, clustering, and graph reasoning are lightweight but often produce category-agnostic regions. We propose DDS, a resource-efficient structure-oriented framework for region-consistent and semanticized annotation-free 3D scene understanding. DDS preserves the lightweight superpoint-based organization paradigm while incorporating visual semantic cues from projected features and segmentation-derived masks. It first performs multi-granularity distillation to guide the 3D backbone at the point, mask-prototype, and inter-prototype levels, then applies graph diffusion over superpoints to propagate semantic information directly in 3D, producing coherent region representations without costly spectral decomposition or dense open-vocabulary 3D feature fields. Finally, DDS uses segmentation-cluster association to assign interpretable semantic names to category-agnostic 3D clusters. Experiments on real-world datasets show that DDS achieves the best performance among representative structure-oriented annotation-free baselines, improving oAcc, mAcc, and mIoU by up to 5.9%, 8.1%, and 2.4%, respectively. These results demonstrate that DDS improves region consistency and lightweight semantic recognition, providing a scalable and interpretable solution for annotation-free 3D scene understanding.

2605.08078 2026-05-14 cs.CV cs.LG

Normalizing Trajectory Models

Jiatao Gu, Tianrong Chen, Ying Shen, David Berthelot, Shuangfei Zhai, Josh Susskind

发表机构 * Apple(苹果公司)

AI总结 本文提出了一种名为 Normalizing Trajectory Models(NTM)的新型生成模型,用于解决在少量采样步骤下扩散模型性能下降的问题。NTM 通过将每个逆向步骤建模为具有精确似然训练的条件归一化流,保留了完整的似然框架,同时提升了生成效率。该模型结合了浅层可逆模块与深层并行预测器,支持从头训练或基于预训练流匹配模型初始化,并通过自蒸馏技术实现了仅需四步即可生成高质量图像的效果,在文本到图像任务中表现优异。

Comments 25 pages, 10 figures; corrected typos and citations

详情
英文摘要

Diffusion-based models decompose sampling into many small Gaussian denoising steps -- an assumption that breaks down when generation is compressed to a few coarse transitions. Existing few-step methods address this through distillation, consistency training, or adversarial objectives, but sacrifice the likelihood framework in the process. We introduce Normalizing Trajectory Models (NTM), which models each reverse step as an expressive conditional normalizing flow with exact likelihood training. Architecturally, NTM combines shallow invertible blocks within each step with a deep parallel predictor across the trajectory, forming an end-to-end network trainable from scratch or initializable from pretrained flow-matching models. Its exact trajectory likelihood further enables self-distillation: a lightweight denoiser trained on the model's own score produces high-quality samples in four steps. On text-to-image benchmarks, NTM matches or outperforms strong image generation baselines in just four sampling steps while uniquely retaining exact likelihood over the generative trajectory.

2605.07483 2026-05-14 cs.LG cs.AI

Does Your Neural Network Extrapolate? Feature Engineering as Identifiability Bias for OOD Generalization

Leonel Aguilar, Jan Nagler, Christoph Hoelscher, Nino Antulov-Fantulin

发表机构 * Chair of Cognitive Science, ETH Zürich(认知科学教授职位,苏黎世联邦理工学院) Centre for Human and Machine Intelligence, Frankfurt School(人机智能中心,法兰克福学校) Aisot Technologies AG, D-GESS, ETH Zürich(Aisot技术公司,D-GESS,苏黎世联邦理工学院)

AI总结 本文研究了深度神经网络在分布外(OOD)场景下泛化失败的原因,指出其根本问题在于从训练数据中学习到的特征无法反映真实的数据生成过程(DGP)。作者提出,通过引入结构化的特征映射、标签映射和模型类(φ, ψ, M),可以明确DGP的假设,从而提升OOD泛化能力。实验表明,正确的特征表示和模型选择能够显著降低OOD误差,并在多个自然科学和机器学习任务中验证了该方法的有效性。

详情
英文摘要

Successful deep neural networks discover salient features of data. We show when and why they fail to learn out-of-distribution (OOD)-relevant representations from an in-distribution (ID) training window. This requires decoupling feature learning from data-generating-process (DGP) identifiability. From a single training window, OOD extrapolation is non-identifiable: infinitely many DGPs are $\varepsilon$-observationally equivalent on the training data but diverge arbitrarily outside it, and no in-distribution criterion alone reliably breaks the tie. A structural commitment, the feature map, label map, and model class $(φ, ψ, \mathcal{M})$, dictates the assumed DGP and governs OOD generalization while leaving ID performance essentially unchanged. When architecture, pretraining, augmentation, input formats, or domain knowledge implicitly inject the missing commitment, the model succeeds. When it cannot infer OOD-relevant structure from ID evidence, it fails. Changing only the representation can make the same architecture, at the same in-distribution loss, differ by ${\sim}520\times$ out of distribution. When the commitment is correct and identifiable, OOD error vanishes. For example, Fourier coordinates turn periodic extrapolation into interpolation on $\mathbb{S}^1$. The same mechanism predicts outcomes in three natural-science settings (mass-action chemistry; Kepler's-third-law exoplanet prediction, $n=2{,}362$; and cross-species coding-DNA detection) and in a 264-run positional-encoding study across Transformer, Mamba, and S4D. Finally, a controlled study shows: correct features are necessary but not sufficient. The model class must express the target, and the transformed training data must cover the relevant representation space.

2605.07188 2026-05-14 cs.CV

PicoEyes: Unified Gaze Estimation Framework for Mixed Reality with a Large-Scale Multi-View Dataset

Fuxin Duan, Hui Wang

发表机构 * Pico, Bytedance(字节跳动)

AI总结 本文提出了一种统一的注视估计框架PicoEyes,能够从单目或双目输入中直接预测注视的多个关键属性,包括3D眼参数、眼区分割、光轴、视线轴和深度图,并在端到端流程中同时解决校准、注视预测和设备姿态变化问题。研究还引入了一个大规模多视角近眼数据集,包含多种条件下的详尽2D和3D标注。实验表明,PicoEyes在无校准、校准、重戴校准和预测等多种设置下均优于现有学术和工业注视追踪方法,为混合现实应用中的鲁棒且通用的注视估计提供了实用范式。

Comments 15 pages, 10 figures, conference

详情
英文摘要

We present PicoEyes, a unified gaze estimation framework that directly predicts all key attributes of gaze, including 3D eye parameters, eye-region segmentation, optical axis, visual axis, and depth maps, from either monocular or binocular inputs. The framework simultaneously addresses calibration, gaze forecasting, and varying device postures, while also supporting 3D eye reconstruction via joint estimation of eye parameters and depth maps in an end-to-end manner. In addition, we introduce a large-scale multi-view near-eye dataset containing comprehensive 2D and 3D annotations under diverse conditions, including train, test, rewear-test, and calibration sessions. Extensive experiments demonstrate that PicoEyes achieves state-ofthe-art performance, consistently outperforming both academic and industrial gaze tracking methods across nocalibration, calibration, rewear-after-calibration, and forecasting settings. This work establishes a practical, end-toend paradigm for robust and generalizable gaze estimation in mixed reality (MR) applications.

2605.07161 2026-05-14 cs.AI

SREGym: A Live Benchmark for AI SRE Agents with High-Fidelity Failure Scenarios

Jackson Clark, Yiming Su, Saad Mohammad Rafid Pial, Yifang Tian, Lily Gniedziejko, Hans-Arno Jacobsen, Yinfang Chen, Tianyin Xu

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) University of Toronto(多伦多大学)

AI总结 本文提出SREGym,一个用于评估AI Site Reliability Engineering(SRE)代理的高保真基准平台。SREGym基于真实云原生系统架构构建,能够模拟多层故障、环境噪声和多种失效模式,提供90个现实且具有挑战性的SRE问题。该平台设计模块化且可扩展,支持故障注入与噪声控制,研究结果显示当前前沿代理在处理不同类型故障时表现差异显著,最高可达40%的端到端结果差异。

详情
英文摘要

AI agents are increasingly used to diagnose and mitigate failures in production systems, known as agentic Site Reliability Engineering (SRE). Current SRE benchmarks are limited to oversimplistic SRE tasks and are unfortunately hard to extend due to bespoke designs. We present SREGym, a high-fidelity benchmark for SRE agents. SREGym exposes a live system environment built atop real-world cloud-native system stacks, where high-fidelity failure scenarios are simulated through fault injectors. SREGym models the complexity of production environments by simulating (1) a wide range of faults at different layers, (2) various ambient noises, and (3) diverse failure modes such as metastable failures and correlated failures. SREGym is architected as a modular, extensible framework that orchestrates fault and noise injectors across stacks. SREGym currently includes 90 realistic, challenging SRE problems. We use SREGym to evaluate frontier agents and show that their capabilities varies significantly in addressing different kinds of failures, with up to 40% differences in end-to-end results. SREGym is actively maintained as an open-source project and has been used by researchers and practitioners.

2605.06869 2026-05-14 cs.AI

Agentick: A Unified Benchmark for General Sequential Decision-Making Agents

Roger Creus Castanyer, Pablo Samuel Castro, Glen Berseth

发表机构 * Mila Quebec AI Institute(魁北克AI研究所) Université de Montréal(蒙特利尔大学) Google DeepMind(谷歌DeepMind)

AI总结 本文提出 Agentick,一个用于评估通用序列决策智能体的统一基准,旨在公平比较从头学习的强化学习智能体、基于预训练知识的语言模型智能体以及混合智能体等不同方法。Agentick 提供了 37 个程序生成的任务,涵盖六类能力、四个难度等级和五种观测模态,并通过统一的 Gymnasium 接口实现,同时配套了编码接口、参考策略、训练数据集和实时排行榜。实验表明,不同方法在不同任务上各有优劣,突显了当前智能体研究仍有较大提升空间,Agentick 为推动通用自主智能体的发展提供了重要的实验平台。

详情
英文摘要

AI agent research spans a wide spectrum: from RL agents that learn from scratch to foundation model agents that leverage pre-trained knowledge, yet no unified benchmark enables fair comparison across these approaches. We present Agentick, a benchmark for sequential decision-making agents designed to evaluate RL, LLM, VLM, hybrid, and human agents on common ground and to power research on the fundamental challenges of sequential decision-making. Agentick provides 37 procedurally generated tasks across six capability categories, four difficulty levels, and five observation modalities, all exposed through a single Gymnasium-compatible interface. The benchmark ships with a Coding API, oracle reference policies for all tasks, pre-built SFT datasets, a composable agent harness, and a live leaderboard. An evaluation spanning 27 configurations and over 90,000 episodes reveals that no single approach dominates: GPT-5 mini leads overall at 0.309 oracle-normalized score while PPO dominates planning and multi-agent tasks; the reasoning harness multiplies LLM performance by 3-10x; and ASCII observations consistently outperform natural language. These findings highlight the substantial room for improvement that remains across all agent paradigms. Agentick's capability-decomposed, multi-modal design provides the empirical infrastructure needed to drive progress toward general autonomous agents, both as an evaluation framework and as a training ground for RL post-training of foundation models in truly sequential environments.

2605.06387 2026-05-14 cs.LG cs.AI

Asymmetric On-Policy Distillation: Bridging Exploitation and Imitation at the Token Level

Nan Jia, Haojin Yang, Xing Ma, Jiesong Lian, Shuailiang Zhang, Weipeng Zhang, Ke Zeng, Xunliang Cai, Zequn Sun

发表机构 * Huazhong University of Science and Technology(华中科技大学) Peking University(北京大学) Meituan(美团)

AI总结 本文研究了如何改进基于策略的蒸馏方法,以在令牌级别更好地结合探索与模仿学习。针对传统方法在优势权重策略梯度中的高方差更新、零优势区域梯度消失和探索瓶颈等问题,提出了一种不对称的在策略蒸馏方法(AOPD),通过在非正优势区域采用局部散度最小化替代无效的负强化,同时保留正强化学习。实验表明,AOPD在数学推理基准中表现优于标准方法,且在训练过程中保持更高的策略熵和更好的工具使用适应能力。

详情
英文摘要

On-policy distillation (OPD) trains a student on its own trajectories with token-level teacher feedback and often outperforms off-policy distillation and standard reinforcement learning. However, we find that its standard advantage weighted policy gradient suffers from three structural weaknesses, including high variance updates, vanishing gradients in zero-advantage regions, and exploration bottlenecks when corrective signals are insufficient. We therefore propose Asymmetric On-Policy Distillation (AOPD), which replaces ineffective negative reinforcement with localized divergence minimization in non-positive advantage regions while preserving positive reinforcement learning. Experiments on mathematical reasoning benchmarks show that AOPD consistently outperforms standard OPD, with average gains of 4.09 / 8.34 under strong / weak initialization, respectively. AOPD also maintains higher policy entropy during training and better capability retention during sequential tool-use adaptation.

2605.06309 2026-05-14 cs.CL

MultiLinguahah : A New Unsupervised Multilingual Acoustic Laughter Segmentation Method

Sofia Callejas, Nahuel Gomez, Catherine Pelachaud, Brian Ravenet, Valentin Barriere

发表机构 * Université Paris-Saclay LISN -- Orsay, France(巴黎-萨克雷大学LISN研究所--欧塞尔,法国) Universidad de Chile DCC -- Santiago, Chile(智利大学DCC学院--圣地亚哥,智利) Sorbonne University ISIR -- Paris, France(索邦大学ISIR研究所--巴黎,法国)

AI总结 本文提出了一种新的无监督多语言笑声分割方法MultiLinguahah,旨在解决跨语言环境下音频中笑声检测和分割的难题。该方法将笑声分割任务转化为基于能量的音频序列异常检测问题,并利用BYOL-A编码器学习音频表示,再通过孤立森林进行分割。实验结果表明,该方法在非英语语境下优于现有的先进算法,展示了其在多语言场景中的优越性和泛化能力。

详情
英文摘要

Laughter is a social non-vocalization that is universal across cultures and languages, and is crucial for human communication, including social bonding and communication signaling. However, detecting laughter in audio is a challenging task, and segmenting is even more difficult. Currently, Machine Learning methods generally rely on costly manual annotation, and their datasets are mostly based on English contexts. Thus, we propose an unsupervised multilingual method that sets up the laughter segmentation task as an anomaly detection of energy-based segmented audio sequences. Our method applies an Isolation Forest on audio representations learned from BYOL-A encoder. We compare our method with several state-of-the-art laughter detection algorithms on four datasets, including stand-up comedy, sitcoms, and general short audio from AudioSet. Our results show that state-of-the-art methods are not optimized for multilingual contexts, while our method outperforms them in non-English settings.

2605.05875 2026-05-14 cs.RO physics.flu-dyn

Cycle-resolved Cephalopod-Inspired Pulsed-Jet Robot With High-Volume Expulsion and Drag-Reduced Gliding

Yiyuan Zhang, Anye Zhong, Junkai Chen, Wenci Xin

发表机构 * Department of Mechanical Engineering, College of Design and Engineering, National University of Singapore(机械工程系,设计与工程学院,新加坡国立大学) Advanced Robotics Centre, National University of Singapore(先进机器人中心,新加坡国立大学) Singapore-MIT Alliance for Research and Technology Centre(新加坡-麻省理工联合研究技术中心)

AI总结 本文提出了一种受章鱼启发的脉冲喷射机器人,其采用刚柔结合的折纸式外套结构,实现了大体积主动喷射和减阻滑翔。该机器人通过协调喷射、滑翔和外套充盈的完整周期运动,提升了整体推进效率。实验表明,该机器人在首次喷射周期内即可达到0.5 m/s的峰值速度,并验证了高喷射体积比、减阻滑翔和被动进水阀对推进性能的关键作用。

Comments Updated author list; no changes to the scientific content

详情
英文摘要

Cephalopod pulsed-jet locomotion is not a single isolated expulsion event, but a coordinated cycle involving jet expulsion, passive gliding, and mantle refilling. Inspired by this cycle-resolved biological strategy, this paper presents a cephalopod-inspired pulsed-jet robot with a rigid-soft hybrid origami mantle that enables large, actively driven, and geometry-guided body deformation. The proposed mantle integrates rigid folding panels with a compliant silicone framework, allowing a 75% effective cavity-volume reduction during expulsion and reducing the projected cross-sectional drag area by approximately 75.7% in the contracted gliding configuration. Using this platform, we formulate a cycle-resolved framework to separately investigate how expelled volume, glide duration, and refill pathway influence whole-cycle locomotion performance. Experiments show that the robot reaches a peak speed of approximately 0.5 m/s (3.8 BL/s) and an average speed exceeding 0.2 m/s (1.5 BL/s) within the first jetting cycle. The results further demonstrate the roles of high expelled-volume-ratio contraction in speed generation, reduced-drag-area gliding under different glide durations, and mantle-aperture-inspired passive inlet valves in assisting refill. This work provides both a robotic implementation of actively deformable cephalopod-like jet propulsion and a unified experimental platform for studying expulsion-gliding-refilling dynamics in pulsed-jet locomotion.

2605.04759 2026-05-14 cs.CL cs.AI cs.ET cs.LG

Gyan: An Explainable Neuro-Symbolic Language Model

Venkat Srinivasan, Vishaal Jatav, Anushka Chandrababu, Geetika Sharma

发表机构 * Innospark Ventures & Gyan AI(Innospark Ventures及Gyan AI) Gyan AI Inc.(Gyan AI公司)

AI总结 本文提出了一种可解释的神经符号语言模型Gyan,其基于一种新颖的非Transformer架构,克服了传统大语言模型在可解释性、可维护性和计算资源消耗等方面的不足。Gyan通过结合修辞结构理论、语义角色理论和基于知识的计算语言学,实现了对完整组合语境的捕捉,并构建了一个类人“世界模型”以增强理解能力。实验表明,Gyan在多个数据集上取得了优越的性能,展示了其在关键任务中构建可信、可靠语言模型的潜力。

Comments also submitted to NeurIPS 2026

详情
英文摘要

Transformer based pre-trained large language models have become ubiquitous. There is increasing evidence to suggest that even with large scale pre-training, these models do not capture complete compositional context and certainly not, the full human analogous context. Besides, by the very nature of the architecture, these models hallucinate, are difficult to maintain, are not easily interpretable and require enormous compute resources for training and inference. Here, we describe Gyan, an explainable language model based on a novel non-transformer architecture, without any of these limitations. Gyan achieves SOTA performance on 3 widely cited data sets and superior performance on two proprietary data sets. The novel architecture decouples the language model from knowledge acquisition and representation. The model draws on rhetorical structure theory, semantic role theory and knowledge-based computational linguistics. Gyan's meaning representation structure captures the complete compositional context and attempts to mimic humans by expanding the context to a 'world model'. AI model adoption critically depends on trust and transparency especially in mission critical use cases. Collectively, our results demonstrate that it is possible to create models which are trustable and reliable for mission critical tasks. We believe our work has tremendous potential for guiding the development of transparent and trusted architectures for language models.

2605.04506 2026-05-14 cs.CV cs.AI

Ilov3Splat: Instance-Level Open-Vocabulary 3D Scene Understanding in Gaussian Splatting

Binh Long Nguyen, Kien Nguyen, Sridha Sridharan, Clinton Fookes, Peyman Moghadam

发表机构 * School of Electrical Engineering and Robotics(电气工程与机器人学学院) Queensland University of Technology(昆士兰理工大学) CSIRO Robotics(CSIRO机器人部) CSIRO

AI总结 Ilov3Splat 是一种基于高斯点扩散(3D-GS)的新型框架,用于实现实例级别的开放词汇三维场景理解。该方法通过在高斯点中引入视图一致的特征场,联合优化场景几何与语义表示,从而提升跨视角一致性与实例级推理能力。通过结合多分辨率哈希嵌入与对比损失训练实例特征场,Ilov3Splat 能够在无需类别监督的情况下,基于自然语言描述准确识别和分割三维场景中的任意物体,显著优于现有开放词汇三维理解方法。

Comments The International Conference on Pattern Recognition (ICPR) 2026

详情
英文摘要

We introduce Ilov3Splat, a novel framework for instance-level open-vocabulary 3D scene understanding built on 3D Gaussian Splatting (3D-GS). Most prior work depends on 2D rendering-based matching or point-level semantic association, which undermines cross-view consistency, lacks coherent instance-level reasoning, and limits precision in downstream 3D tasks. To address these limitations, our method jointly optimizes scene geometry and semantic representations by augmenting Gaussian splats with view-consistent feature fields. Specifically, we leverage multi-resolution hash embedding to efficiently encode language-aligned CLIP features, enabling dense and coherent language grounding in 3D space. We further train an instance feature field using contrastive loss over SAM masks, supporting fine-grained object distinction across views. At inference time, CLIP-encoded queries are matched against the learned features, followed by two-stage 3D clustering to retrieve relevant Gaussian groups. This enables our framework to identify arbitrary objects in 3D scenes based on natural language descriptions, without requiring category supervision or manual annotations. Experiments on standard benchmarks demonstrate that Ilov3Splat outperforms prior open-vocabulary 3D-GS methods in both object selection and instance segmentation, offering a flexible and accurate solution for language-driven 3D scene understanding. Project page: https://csiro-robotics.github.io/Ilov3Splat.

2605.03410 2026-05-14 cs.AI

Geometry over Density: Few-Shot Cross-Domain OOD Detection

Shawn Li, You Qin, Jiate Li, Charith Peris, Lisa Bauer, Roger Zimmermann, Yue Zhao

发表机构 * University of Southern California(南加州大学) National University of Singapore(新加坡国立大学) Amazon(亚马逊)

AI总结 本文研究了在仅有少量样本的情况下,如何利用预训练模型进行跨领域异常检测的问题。提出了一种名为UFCOD的统一框架,通过分析扩散过程中的信息几何特性,提取路径能量和动力学能量两个特征,实现无需额外训练即可在任意新领域进行OOD检测。该方法在12个跨领域基准测试中取得了93.7%的平均AUROC,展示了其在样本效率上的显著优势。

详情
英文摘要

Out-of-distribution (OOD) detection identifies test samples that fall outside a model's training distribution, a capability critical for safe deployment in high-stakes applications. Standard OOD detectors are trained on a specific in-distribution (ID) dataset and detect deviations from that single domain. In contrast, we study few-shot cross-domain OOD detection: given a \emph{single} pre-trained model, can we perform OOD detection on \emph{arbitrary} new ID-OOD task pairs using only a handful of ID samples at inference time, with no additional training? We propose \textbf{UFCOD}, a unified framework that achieves this goal through information-geometric analysis of diffusion trajectories. Our key insight is that diffusion noise predictions are score functions (gradients of log-density), and we extract two energy features: \emph{Path Energy} (integrated score magnitude) and \emph{Dynamics Energy} (score smoothness), that form a discrete Sobolev norm capturing how samples interact with the learned diffusion process. The central contribution is a \textbf{train-once, deploy-anywhere} paradigm: a diffusion model trained on a single dataset (e.g., CelebA) serves as a universal feature extractor for OOD detection across semantically unrelated domains (e.g., CIFAR-10, SVHN, Textures). At deployment, each new task requires only $\sim$100 unlabeled ID samples for inference: no retraining, no fine-tuning, no task-specific adaptation. Using 100 ID samples per task, UFCOD achieves 93.7\% average AUROC across 12 cross-domain benchmarks, competitive with methods trained on 50k--163k samples, demonstrating $\sim$500$\times$ improvement in sample efficiency. See our code in https://github.com/lili0415/UFCOD.

2605.00238 2026-05-14 cs.CL

Estimating LLM Grading Ability and Response Difficulty in Automatic Short Answer Grading via Item Response Theory

Longwei Cong, Sonja Hahn, Sebastian Gombert, Leon Camus, Hendrik Drachsler, Ulf Kroehne

发表机构 * DIPF | Leibniz Institute for Research and Information in Education(德意志教育研究所 | 列支敦士登教育研究所) Faculty of Computer Science, Goethe University Frankfurt(法兰克福歌德大学计算机学院) Chemnitz University of Technology(化学工业大学)

AI总结 该研究提出了一种基于项目反应理论(IRT)的评估框架,用于分析基于大语言模型(LLM)的自动短答案评分系统的评分能力和响应难度。该方法能够揭示模型在不同难度回答上的评分表现差异,发现整体性能相似的模型在面对难度增加时其评分准确性下降程度存在显著差异。研究还发现,困难回答的错误主要集中于“部分正确但不完整”标签,且这类回答在语义对齐度、矛盾信号和嵌入空间孤立性等方面表现出更高的难度特征。

Comments accepted at BEA 2026, the 21st Workshop on Innovative Use of NLP for Building Educational Applications

详情
英文摘要

Automated short answer grading (ASAG) with large language models (LLMs) is commonly evaluated with aggregate metrics such as macro-F1 and Cohen's kappa. However, these metrics provide limited insight into how grading performance varies across student responses of differing grading difficulty. We introduce an evaluation framework for LLM-based ASAG based on item response theory (IRT), which models grading correctness as a function of latent grader ability and response grading difficulty. This formulation enables response-level analysis of where LLM graders succeed or fail and reveals robustness differences that are not visible from aggregate scores alone. We apply the framework to 17 open-weight LLMs on the SciEntsBank and Beetle benchmarks. The results show that even models with similar overall performance differ substantially in how sharply their grading accuracy declines as response difficulty increases. In addition, confusion patterns show that errors on difficult responses concentrate disproportionately on the \texttt{partially\_correct\_incomplete} label, indicating a tendency toward intermediate-label collapse under ambiguity. To characterize difficult responses, we further analyze semantic and linguistic correlates of estimated difficulty. Across both datasets, higher difficulty is associated with weaker semantic alignment to the reference answer, stronger contradiction signals, and greater semantic isolation in embedding space. Overall, these results show that item response theory offers a useful framework for evaluating LLM-based ASAG beyond aggregate performance measures.

2605.00200 2026-05-14 cs.CL

Confidence Estimation in Automatic Short Answer Grading with LLMs

Longwei Cong, Sonja Hahn, Sebastian Gombert, Leon Camus, Hendrik Drachsler, Ulf Kroehne

发表机构 * DIPF | Leibniz Institute for Research and Information in Education(莱布尼茨教育研究与信息研究所) Faculty of Computer Science, Goethe University Frankfurt(弗赖堡大学计算机科学系) Chemnitz University of Technology(化学工业大学)

AI总结 本文研究了基于大语言模型的自动短答案评分中的置信度估计问题,旨在提升人机协作教育评估的安全性与可靠性。作者提出了一种结合模型置信度和数据集不确定性的混合置信度框架,通过对比多种模型置信度估计方法,发现单一模型置信度不足以准确反映评分不确定性。该框架引入了基于学生回答语义聚类的噪声估计,有效提升了置信度估计的可靠性与选择性评分性能,为可信的AI辅助教育评估系统提供了支持。

Comments accepted to the 27th International Conference on Artificial Intelligence in Education (AIED 2026)

详情
英文摘要

Automatic Short Answer Grading (ASAG) with generative large language models (LLMs) has recently demonstrated strong performance without task-specific fine-tuning, while also enabling the generation of synthetic feedback for educational assessment. Despite these advances, LLM-based grading remains imperfect, making reliable confidence estimates essential for safe and effective human-AI collaboration in educational decision-making. In this work, we investigate confidence estimation for ASAG with LLMs by jointly considering model-based confidence signals and dataset-derived uncertainty. We systematically compare three model-based confidence estimation strategies, namely verbalizing, latent, and consistency-based confidence estimation, and show that model-based confidence alone is insufficient to reliably capture uncertainty in ASAG. To address this limitation, we propose a hybrid confidence framework that integrates model-based confidence signals with an explicit estimate of dataset-derived aleatoric uncertainty. Aleatoric uncertainty is operationalized by clustering semantically embedded student responses and quantifying within-cluster heterogeneity. Our results demonstrate that the proposed hybrid confidence measure yields more reliable confidence estimates and improves selective grading performance compared to single-source approaches. Overall, this work advances confidence-aware LLM-based grading for human-in-the-loop assessment, supporting more trustworthy AI-assisted educational assessment systems.

2604.27996 2026-05-14 cs.AI cs.GR cs.HC

Exploring Interaction Paradigms for LLM Agents in Scientific Visualization

Jackson Vonderhorst, Kuangshi Ai, Haichao Miao, Shusen Liu, Chaoli Wang

发表机构 * Univ. Notre Dame(诺特难大学) LLNL(劳伦斯利弗莫尔国家实验室)

AI总结 本文研究了不同类型的大型语言模型(LLM)代理在科学可视化任务中的表现,用户通过自然语言指令生成可视化流程。通过比较三种主要交互范式,包括使用结构化工具的领域特定代理、计算机使用代理和通用编程代理,在15个基准任务中评估了八种代表性代理的可视化质量、效率、鲁棒性和计算成本。研究还分析了不同交互方式及持久记忆对性能的影响,结果表明各类方法在灵活性、效率和稳定性方面存在明显权衡,未来科学可视化系统应结合结构化工具使用、交互能力和自适应记忆机制以实现性能与灵活性的平衡。

详情
英文摘要

This paper examines how different types of large language model (LLM) agents perform on scientific visualization (SciVis) tasks, where users generate visualization workflows from natural-language instructions. We compare three primary interaction paradigms, including domain-specific agents with structured tool use, computer-use agents, and general-purpose coding agents, by evaluating eight representative agents across 15 benchmark tasks and measuring visualization quality, efficiency, robustness, and computational cost. We further analyze interaction modalities, including code scripts and model context protocol (MCP) or API calls for structured tool use, as well as command-line interfaces (CLI) and graphical user interfaces (GUI) for more general interaction, while additionally studying the effect of persistent memory in selected agents. The results reveal clear tradeoffs across paradigms and modalities. General-purpose coding agents achieve the highest task success rates but are computationally expensive, while domain-specific agents are more efficient and stable but less flexible. Computer-use agents perform well on individual steps but struggle with longer multi-step workflows, indicating that long-horizon planning is their primary limitation. Across both CLI- and GUI-based settings, persistent memory improves performance over repeated trials, although its benefits depend on the underlying interaction mode and the quality of feedback. These findings suggest that no single approach is sufficient, and future SciVis systems should combine structured tool use, interactive capabilities, and adaptive memory mechanisms to balance performance, robustness, and flexibility.

2604.27389 2026-05-14 cs.CV cs.AI

COHERENCE: Benchmarking Fine-Grained Image-Text Alignment in Interleaved Multimodal Contexts

Bingli Wang, Huanze Tang, Haijun Lv, Zhishan Lin, Lixin Gu, Lei Feng, Qipeng Guo, Kai Chen

发表机构 * Southeast University Shanghai AI Laboratory(上海大学上海人工智能实验室) Shanghai AI Laboratory(上海人工智能实验室)

AI总结 本文提出COHERENCE基准,旨在评估多模态大语言模型在交织图文上下文中进行细粒度图文对齐的能力。现有基准多关注单一或多个图像的理解,而现实场景中信息常以图文交织形式呈现,要求模型不仅识别图像内容,还需建立图文间的细粒度关联并进行推理。COHERENCE涵盖四个代表性领域的交织图文内容,包含6,161个高质量问题,并通过六类错误分析,揭示当前模型在该任务中的不足。

详情
英文摘要

In recent years, Multimodal Large Language Models (MLLMs) have achieved remarkable progress on a wide range of multimodal benchmarks. Despite these advances, most existing benchmarks mainly focus on single-image or multi-image comprehension. In real-world scenarios such as document reading, information is often presented as interleaved multimodel contexts. This requires MLLMs not only to recognize the content of individual images, but also to identify relevant textual and visual evidence, establish fine-grained alignments between them, and reason over these aligned signals in interleaved contexts based on contextual evidence. However, there is still a lack of systematic benchmarks for quantifying the fine-grained understanding ability of MLLMs in interleaved image-text contexts. To fill this gap, we propose COHERENCE, a benchmark designed to evaluate the ability of MLLMs to recover fine-grained image-text correspondences in interleaved multimodal contexts. COHERENCE covers interleaved image-text content from four representative domains and contains 6,161 high-quality questions. Moreover, we perform a six-type error analysis, enabling fine-grained attribution of failures in interleaved image-text understanding to the specific capabilities missing in current MLLMs.

2604.21345 2026-05-14 cs.AI cs.CL

Evaluating AI Meeting Summaries with a Reusable Cross-Domain Pipeline

Philip Zhong, Don Wang, Jason Zhang

发表机构 * Cisco Systems, Inc.(思科系统公司)

AI总结 本文提出了一种可复用的跨领域评估系统,用于评估AI会议摘要的质量,系统整合了结构化真实标签构建、固定候选生成、基于主张的评分、持久化报告以及隐私保护的在线监控与提名接口。通过在114场会议数据上进行测试,研究发现不同模型在准确性方面差异不显著,但在保留率方面,gpt-5.1模型表现出更高的完整性和覆盖率。该工作为AI会议摘要的评估提供了一套标准化且可扩展的评估框架。

Comments AI Application Feature Quality Evaluation (28 pages total)

详情
英文摘要

Industrial teams often deploy large language model features before stable regression or model selection evaluation exists. We present a reusable evaluation system for AI meeting summaries that combines structured ground-truth (GT) construction, fixed candidate generation, claim-grounded scoring, persisted reporting, and a privacy-bounded online monitoring and nomination interface. The online evidence is not itself a benchmark: privacy-safe aggregate exports show active monitoring, hard regime detection, and directional movement without exposing customer data. We benchmark the offline path on 114 meetings across city_council, private_data, and whitehouse_press_briefings, yielding 340 completed meeting-model pairs and 680 judge runs for gpt-4.1-mini, gpt-5-mini, and gpt-5.1. Under this fixed protocol, accuracy differences are not statistically significant under Holm correction (corrected p-values 0.053-0.448), although gpt-4.1-mini has the highest mean accuracy (0.583); the significant separation is on retention, where gpt-5.1 leads on completeness (0.886) and coverage (0.942). Typed slices isolate whitehouse_press_briefings as an accuracy-hard regime, and a later focused rerun over gpt-4.1, gpt-5-mini, and gpt-5.4 reuses the same stack under the same judges and metrics. This extended preprint keeps those core results aligned with the formal submission while adding a more detailed repository-level account of cross-domain reuse from the companion AI-search paper and an additional typed DeepEval contrastive analysis. Model naming note. Running text uses canonical model names on first introduction. Tables, filenames, and artifact IDs retain compact report labels for consistency with the packaged benchmark outputs. Table A maps the two conventions and is repeated in Section 4.3 where candidate-generation settings are defined.

2604.17895 2026-05-14 cs.RO

Locomotion of an Elastic Snake Robot via Natural Dynamics

Tristan Ehlert, Arne Sachtler, Annika Schmidt, Davide Calzolari, Alin Albu-Schäffer

发表机构 * German Aerospace Center (DLR), Robotics and Mechatronics Center (RMC)(德国航空航天中心(DLR)机器人与机电中心) Technical University of Munich (TUM), Department of Computer Engineering(慕尼黑技术大学(TUM)计算机工程系)

AI总结 本文研究了如何利用弹性蛇形机器人的自然动力学特性设计高效运动模式。通过引入特征流形理论,作者分析了系统的非线性动力学行为,并提出了两种基于自然动力学的步态。实验表明,在无摩擦的理想情况下,基于非制动周期轨迹的步态具有完美的能量效率,而在更现实的有摩擦场景中,该步态相比传统刚性系统步态也表现出更高的效率,为基于自然动力学的步态设计提供了有价值的参考。

详情
英文摘要

Nature suggests that exploiting the elasticities and natural dynamics of robotic systems could increase their locomotion efficiency. Prior work on elastic snake robots supports this hypothesis, but has not fully exploited the nonlinear dynamic behavior of the systems. Recent advances in eigenmanifold theory enable a better characterization of the natural dynamics in complex nonlinear systems. This letter investigates if and how the nonlinear natural dynamics of a kinematic elastic snake robot can be used to design efficient gaits. Two types of gaits based on natural dynamics are presented and compared to a state-of-the-art approach using dynamics simulations. The results reveal that a gait generated by switching between two nonlinear normal modes does not improve the locomotion efficiency of the robot. In contrast, gaits based on non-brake periodic trajectories (non-brake orbits) are perfectly efficient in the energy-conservative case. Further simulations with friction reveal that, in a more realistic scenario, non-brake orbit gaits achieve higher efficiency compared to the baseline gait on the rigid system. Overall, the investigation offers promising insights into the design of gaits based on natural dynamics, fostering further research.

2604.09025 2026-05-14 cs.CV cs.AI

Skill-Conditioned Visual Geolocation for Vision-Language Models

Chenjie Yang, Yutian Jiang, Yutong Deng, Chenyu Wu

发表机构 * Southwest Jiaotong University(西南交通大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Zhejiang University(浙江大学)

AI总结 该研究针对视觉语言模型在地理定位任务中缺乏结构化地理推理和自主进化能力的问题,提出了一种无需训练的GeoSkill框架。该方法基于一个可演进的技能图(Skill-Graph),通过提炼人类专家轨迹生成自然语言技能,并利用推理模型进行引导式推理。同时,通过自主进化机制,从大规模网络数据中不断生成和优化技能,提升地理定位的准确性和推理可信度,显著增强了模型对真实地理知识的理解与泛化能力。

详情
英文摘要

Vision-language models (VLMs) have shown a promising ability in image geolocation, but they still lack structured geographic reasoning and the capacity for autonomous self-evolution. Existing methods predominantly rely on implicit parametric memory, which often exploits outdated knowledge and generates hallucinated reasoning. Furthermore, current inference is a "one-off" process, lacking the feedback loops necessary for self-evolution based on reasoning outcomes. To address these issues, we propose GeoSkill, a training-free framework based on an evolving Skill-Graph. We first initialize the graph by refining human expert trajectories into atomic, natural-language skills. For execution, GeoSkill employs an inference model to perform direct reasoning guided by the current Skill-Graph. For continuous growth, an Autonomous Evolution mechanism leverages a larger model to conduct multiple reasoning rollouts on image-coordinate pairs sourced from web-scale data and verified real-world reasoning. By analyzing both successful and failed trajectories from these rollouts, the mechanism iteratively synthesizes and prunes skills, effectively expanding the Skill-Graph and correcting geographic biases without any parameter updates. Experiments demonstrate that GeoSkill achieves promising performance in both geolocation accuracy and reasoning faithfulness on GeoRC, while maintaining superior generalization across diverse external datasets. Furthermore, our autonomous evolution fosters the emergence of novel, verifiable skills, significantly enhancing the system's cognition of real-world geographic knowledge beyond isolated case studies.

2604.08944 2026-05-14 cs.LG cs.MA

Multi-Agent Decision-Focused Learning via Value-Aware Sequential Communication

Benjamin Amoh, Geoffrey Parker, Wesley Marrero

发表机构 * Thayer School of Engineering, Dartmouth College(达特茅斯大学泰勒工程学院)

AI总结 该研究提出了一种名为 SeqComm-DFL 的多智能体决策聚焦学习方法,旨在提升部分可观测环境下智能体之间的协作效率。该方法通过价值感知的序列通信机制,使智能体在优先级顺序下生成有助于提升决策质量的消息,并结合Stackelberg条件进行信息传递。研究还扩展了最优模型设计框架,结合QMIX分解实现高效端到端训练,并在多个基准任务中显著提升了累积奖励和胜率。

Comments 9 pages, 2 figues, 1 table, neurips 2026

详情
英文摘要

Multi-agent coordination under partial observability requires agents to share complementary private information. While recent methods optimize messages for intermediate objectives (e.g., reconstruction accuracy or mutual information), rather than decision quality, we introduce \textbf{SeqComm-DFL}, unifying the sequential communication with decision-focused learning for task performance. Our approach features \emph{value-aware message generation with sequential Stackelberg conditioning}: messages maximize receiver decision quality and are generated in priority order, with agents conditioning on their predecessors. The \emph{guidance potential} determined by their prosocial ordering. We extend Optimal Model Design to communication-augmented world models with QMIX factorization, enabling efficient end-to-end training via implicit differentiation. We prove information-theoretic bounds showing that communication value scales with coordination gaps and establish $\mathcal{O}(1/\sqrt{T})$ convergence for the bilevel optimization, where $T$ denotes the number of training iterations. On collaborative healthcare and StarCraft Multi-Agent Challenge (SMAC) benchmarks, SeqComm-DFL achieves four to six times higher cumulative rewards and over 13\% win rate improvements, enabling coordination strategies inaccessible under information asymmetry.

2604.08039 2026-05-14 cs.CV cs.AI cs.LG

LINE: LLM-based Iterative Neuron Explanations for Vision Models

Vladimir Zaigrajew, Michał Piechota, Gaspar Sekula, Paweł Gelar, Przemysław Biecek

发表机构 * Centre for Credible AI(可信AI中心) Warsaw University of Technology(华沙理工大学) University of Warsaw, Poland(波兰华沙大学)

AI总结 本文提出了一种基于大语言模型的迭代神经元解释方法LINE,用于对视觉模型中的神经元进行开放词汇的概念标注。LINE在黑盒设置下,通过语言模型和图像生成器迭代生成并优化概念描述,无需模型训练,能够发现传统预定义词汇表中遗漏的概念,并在多个数据集上取得了优于现有方法的性能。该方法不仅能够识别每个神经元的主要概念,还能提供完整的生成历史,支持多义性评估和生成可视化解释。

详情
英文摘要

Interpreting individual neurons in deep neural networks is a crucial step towards understanding their complex decision-making processes and ensuring AI safety. Despite recent progress in neuron labeling, existing methods often limit the search space to predefined concept vocabularies or produce overly specific descriptions that fail to capture higher-order, global concepts. We introduce LINE, a novel, training-free iterative approach tailored for open-vocabulary concept labeling in vision models. Operating in a strictly black-box setting, LINE leverages a large language model and a text-to-image generator to iteratively propose and refine concepts in a closed loop, guided by activation history. LINE achieves state-of-the-art performance across multiple model architectures, yielding AUC improvements of up to 0.11 on ImageNet and 0.05 on Places365, while discovering, on average, 27% of new concepts missed by predefined vocabularies. Beyond identifying the top concept, LINE provides a complete generation history, enabling polysemanticity evaluation and producing visual explanations that rival gradient-dependent activation maximization methods. The source code will be made available soon.

2604.04692 2026-05-14 cs.CL cs.AI cs.CV

Is a Picture Worth a Thousand Words? Adaptive Multimodal Fact-Checking with Visual Evidence Necessity

Jaeyoon Jung, Yejun Yoon, Kunwoo Park

发表机构 * School of AI Convergence, Soongsil University(顺斯利大学人工智能融合学院) MAUM AI Inc.(MAUM人工智能公司) Department of Intelligent Semiconductors, Soongsil University(顺斯利大学智能半导体系)

AI总结 本文研究了在多模态事实核查任务中是否应普遍使用视觉证据的问题,挑战了现有研究中“视觉证据总是有助于提升性能”的假设。为此,作者提出了AMuFC框架,通过两个协作的视觉-语言模型,分别用于判断是否需要视觉证据以及基于证据进行事实验证,从而实现对视觉证据的自适应使用。实验表明,该方法在三个数据集上显著提升了事实核查的准确性。

Comments preprint, 18 pages

详情
英文摘要

Automated fact-checking is a crucial task that supports a responsible information ecosystem. While recent research has progressed from text-only to multimodal fact-checking, a prevailing assumption is that incorporating visual evidence universally improves performance. In this work, we challenge this assumption and show that the indiscriminate use of multimodal evidence can reduce accuracy. To address this challenge, we propose AMuFC, a multimodal fact-checking framework that employs two collaborative vision-language models with distinct roles for the adaptive use of visual evidence: an Analyzer determines whether visual evidence is necessary for claim verification, and a Verifier predicts claim veracity conditioned on both the retrieved evidence and the Analyzer's assessment. Experimental results on three datasets show that incorporating the Analyzer's assessment of visual evidence necessity into the Verifier's prediction yields substantial improvements in verification performance. We will release all code and datasets at https://github.com/ssu-humane/AMuFC.

2604.04667 2026-05-14 cs.CV cs.LG cs.RO

ZeD-MAP: Bundle Adjustment Guided Zero-Shot Depth Maps for Real-Time Aerial Imaging

Selim Ahmet Iz, Francesco Nex, Norman Kerle, Henry Meissner, Ralf Berger

发表机构 * German Aerospace Center (DLR), Institute of Space Research(德国航空航天中心(DLR)空间研究所) Faculty of Geo-Information Science and Earth Observation (ITC), University of Twente(代尔夫特理工大学地理信息科学与地球观测学院)

AI总结 本文提出了一种名为ZeD-MAP的框架,用于实现实时无人机航拍图像的高精度深度重建。该方法结合零样本扩散模型与增量聚类式光束法平差(BA),在无需任务特定再训练的情况下,提升了深度估计的度量一致性和时间连续性。实验表明,该方法在高分辨率航拍图像上实现了亚米级精度,且单帧处理时间在1.47到4.91秒之间,适用于实时三维地图生成。

详情
英文摘要

Real-time depth reconstruction from ultra-high-resolution UAV imagery is essential for time-critical geospatial tasks such as disaster response, yet remains challenging due to wide-baseline parallax, large image sizes, low-texture or specular surfaces, occlusions, and strict computational constraints. Recent zero-shot diffusion models offer fast per-image dense predictions without task-specific retraining, and require fewer labelled datasets than transformer-based predictors while avoiding the rigid capture geometry requirement of classical multi-view stereo. However, their probabilistic inference prevents reliable metric accuracy and temporal consistency across sequential frames and overlapping tiles. We present ZeD-MAP, a cluster-level framework that converts a test-time diffusion depth model into a metrically consistent, SLAM-like mapping pipeline by integrating incremental cluster-based bundle adjustment (BA). Streamed UAV frames are grouped into overlapping clusters; periodic BA produces metrically consistent poses and sparse 3D tie-points, which are reprojected into selected frames and used as metric guidance for diffusion-based depth estimation. Validation on ground-marker flights captured at approximately 50 m altitude (GSD is approximately 0.85 cm/px, corresponding to 2,650 square meters ground coverage per frame) with the DLR Modular Aerial Camera System (MACS) shows that our method achieves sub-meter accuracy, with approximately 0.87 m error in the horizontal (XY) plane and 0.12 m in the vertical (Z) direction, while maintaining per-image runtimes between 1.47 and 4.91 seconds. Results are subject to minor noise from manual point-cloud annotation. These findings show that BA-based metric guidance provides consistency comparable to classical photogrammetric methods while significantly accelerating processing, enabling real-time 3D map generation.

2604.02022 2026-05-14 cs.AI

ATBench: A Diverse and Realistic Agent Trajectory Benchmark for Safety Evaluation and Diagnosis

Yu Li, Haoyu Luo, Yuejin Xie, Yuqian Fu, Zhonghao Yang, Shuai Shao, Qihan Ren, Wanying Qu, Yanwei Fu, Yujiu Yang, Jing Shao, Xia Hu, Dongrui Liu

发表机构 * Shanghai AI Lab(上海人工智能实验室) Fudan University(复旦大学) Shanghai Jiao Tong University(上海交通大学) Tsinghua University(清华大学) KAUST(卡塔尔人工智能科学中心) East China Normal University(华东师范大学)

AI总结 ATBench 是一个用于评估和诊断基于大语言模型的智能体安全性的多样化且真实的轨迹基准。该基准通过风险来源、失败模式和现实危害三个维度系统地组织风险,并采用异构工具池和长上下文延迟触发机制,构建出具有多阶段真实风险演进的轨迹数据。ATBench 包含 1000 条轨迹,涵盖丰富的交互场景和工具调用,数据经过规则和大模型过滤并由人工全面审核,能够有效评估先进模型在长期交互中的安全表现,并支持分层分析和跨基准比较。

详情
英文摘要

Evaluating the safety of LLM-based agents is increasingly important because risks in realistic deployments often emerge over multi-step interactions rather than isolated prompts or final responses. Existing trajectory-level benchmarks remain limited by insufficient interaction diversity, coarse observability of safety failures, and weak long-horizon realism. We introduce ATBench, a trajectory-level benchmark for structured, diverse, and realistic evaluation of agent safety. ATBench organizes agentic risk along three dimensions: risk source, failure mode, and real-world harm. Based on this taxonomy, we construct trajectories with heterogeneous tool pools and a long-context delayed-trigger protocol that captures realistic risk emergence across multiple stages. The benchmark contains 1,000 trajectories (503 safe and 497 unsafe), averaging 9.01 turns and 3.95k tokens, with 1,954 invoked tools drawn from pools spanning 2,084 available tools. Data quality is supported by rule-based and LLM-based filtering plus full human audit. Experiments on frontier LLMs, open-source models, and specialized guard systems show that ATBench is challenging even for strong evaluators, while enabling taxonomy-stratified analysis, cross-benchmark comparison, and diagnosis of long-horizon failure patterns.

2604.01690 2026-05-14 cs.AI

Scale over Preference: The Impact of AI-Generated Content on Online Content Ecology

Tianhao Shi, Yang Zhang, Xiaoyan Zhao, Fengbin Zhu, Chenyi Lei, Han Li, Wenwu Ou, Tian Yang, Yang Song, Yongdong Zhang, Fuli Feng

发表机构 * University of Science and Technology of China(中国科学技术大学) National University of Singapore(新加坡国立大学) Kuaishou Technology(快手科技) The Chinese University of Hong Kong(香港中文大学)

AI总结 本研究探讨了人工智能生成内容(AIGC)对在线内容生态的影响,通过分析中国主流视频平台上的海量用户数据,揭示了AIGC与人类生成内容(HGC)在创作与消费行为上的显著差异。研究发现,尽管用户更偏好HGC,但AIGC创作者通过高产量策略仍能获得与HGC相当的总体互动量,算法推荐机制在其中起到了调节作用。研究建议引入对AIGC敏感的推荐算法和精准治理框架,以保障在线平台内容生态的长期健康发展。

Comments update authors in v2

详情
英文摘要

The rapid proliferation of Artificial Intelligence-Generated Content (AIGC) is fundamentally restructuring online content ecologies, necessitating a rigorous examination of its behavioral and distributional implications. Leveraging a comprehensive longitudinal dataset comprising tens of millions of users from a leading Chinese video-sharing platform, this study elucidated the distinct creation and consumption behaviors characterizing AIGC versus Human-Generated Content (HGC). We identified a prevalent scale-over-preference dynamic, wherein AIGC creators achieve aggregate engagement comparable to HGC creators through high-volume production, despite a marked consumer preference for HGC. Deeper analysis uncovered the ability of the algorithmic content distribution mechanism in moderating these competing interests regarding AIGC. These findings advocated for the implementation of AIGC-sensitive distribution algorithms and precise governance frameworks to ensure the long-term health of the online content platforms.

2604.00001 2026-05-14 cs.LG cs.AI cs.CL

Filter-then-Weight: Online Data Selection and Reweighting for LLM Fine-Tuning

Fangxin Wang, Peyman Baghershahi, Langzhou He, Henry Peng Zou, Sourav Medya, Philip S. Yu

发表机构 * Department of Computer Science(计算机科学系)

AI总结 本文研究了大语言模型在线微调中的数据选择与重加权问题,提出了一种基于优化器状态的在线数据选择框架。核心方法是将数据选择视为根据当前优化器状态塑造下一步更新方向的问题,并设计了两阶段的Filter-then-Weight算法,先筛选几何上有用的样本,再优化其权重系数。该方法通过引入因子化梯度表示和优化矩阵计算,有效提升了在线微调的收敛效率和下游任务性能。

Comments 24 pages, 2 figures, 9 tables

详情
英文摘要

Gradient-based data selection offers a principled framework for estimating sample utility in large language model (LLM) fine-tuning, but existing methods are mostly designed for offline settings. They are therefore less suited to online fine-tuning, where data arrives sequentially, sample utility is step-dependent, and the effective update geometry is shaped by adaptive optimizers. We propose an optimizer-aware framework for gradient-based online data selection and reweighting in LLM fine-tuning. Our key idea is to view online selection not as static sample ranking, but as shaping the next target-oriented update under the current optimizer state. We formulate this as an optimizer-aware update-matching problem, establish its connection to second-order target utility, and show why subset-level construction must account for interactions and redundancy among selected samples. Based on this view, we develop a two-stage Filter-then-Weight algorithm that first filters geometrically useful candidates and then optimizes their coefficients. To make the framework practical for LLMs, we introduce a factorized outer-product gradient representation and optimized matrix computations for long-context data. Experiments show that our method consistently improves convergence and downstream performance over existing online data selection baselines under the same data budget.