arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1695
专题追踪
2605.16087 2026-05-25 cs.RO cs.AI

Towards Trustworthy and Explainable AI for Perception Models: From Concept to Prototype Vehicle Deployment

面向感知模型的可信与可解释人工智能:从概念到原型车辆部署

Till Beemelmanns, Shayan Sharifi, Manas Mehrotra, Ayushman Choudhuri, Lutz Eckstein

发表机构 * Institute for Automotive Engineering, RWTH Aachen University(汽车工程研究所,亚琛工业大学)

AI总结 本文研究了如何在自动驾驶感知模型中实现可信且可解释的人工智能,针对深度神经网络在自动驾驶中应用时存在的不透明性和安全性问题,提出了一种集成可信解释性和不确定性估计的感知模块。该方法基于变压器架构,在推理时通过注意力机制生成解释,并通过扰动一致性测试验证其可靠性,同时引入不确定性估计与校准模块以提升系统鲁棒性。研究还展示了该模块在原型车上的部署及可视化接口,验证了其在实时可信感知监控中的可行性。

Comments Accepted for publication at IEEE ITSC 2026

详情
AI中文摘要

深度神经网络已成为自动驾驶感知的主流解决方案,但其不透明性与新兴的可信人工智能指南相冲突,并给安全保证、调试和人工监督带来复杂性。尽管存在安全与可解释人工智能的理论框架,但针对3D场景理解的可信人工智能具体实现仍然稀缺。我们通过提出一个极其鲁棒、集成忠实可解释性和校准不确定性估计的可信人工智能感知模块来填补这一空白。基于Transformer检测器,我们在推理时从注意力机制中导出解释,并使用基于扰动的连续性测试验证其忠实性。我们进一步集成了不确定性估计与校准模块,并应用了增强鲁棒性的训练方法。实验展示了忠实的显著性行为、改进的鲁棒性以及良好校准的不确定性估计。最后,我们将这些可信人工智能元素部署到原型车辆中,并提供一个可解释人工智能界面,可视化文档工件、模型不确定性状态和显著性图,展示了实时可信感知监控的可行性。补充材料见 https://tillbeemelmanns.github.io/trustworthy_ai/ 。

英文摘要

Deep Neural Networks have become the dominant solution for Autonomous Driving perception, but their opacity conflicts with emerging Trustworthy AI guidelines and complicates safety assurance, debugging, and human oversight. While theoretical frameworks for safe and Explainable AI (XAI) exist, concrete implementations of Trustworthy AI for 3D scene understanding remain scarce. We address this gap by proposing a Trustworthy AI perception module that is remarkably robust, integrates faithful explainability, and calibrated uncertainty estimates. Building on a transformer-based detector, we derive explanation from the attention mechanism at inference time and validate their faithfulness using perturbation-based consistency tests. We further integrate an uncertainty estimation and calibration module, and apply robustness-enhancing training methods. Experiments show faithful saliency behavior, improved robustness, and well-calibrated uncertainty estimates. Finally, we deploy these Trustworthy AI elements in a prototype vehicle and provide an XAI Interface that visualizes documentation artifacts, model uncertainty state, and saliency maps, demonstrating the feasibility of trustworthy perception monitoring in real time. Supplementary materials are available at https://tillbeemelmanns.github.io/trustworthy_ai/ .

2605.15828 2026-05-25 cs.CV

Not All Tasks Quantize Equally: Fisher-Guided Quantization for Visual Geometry Transformer

并非所有任务量化平等:面向视觉几何Transformer的Fisher引导量化

Yipu Zhang, Jintao Cheng, Weilun Feng, Jiehao Luo, Chuanguang Yang, Zhulin An, Yongjun Xu, Wei Zhang

发表机构 * Department of Electronic and Computer Engineering, HKUST(香港科技大学电子与计算机工程系) State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences(中国科学院人工智能安全国家重点实验室) University of Chinese Academy of Sciences(中国科学院大学) School of Data Science and Engineering, South China Normal University(华南师范大学数据科学与工程学院)

AI总结 本文研究了如何在视觉几何变换器(VGGT)等前馈3D重建模型中进行有效的量化,以降低模型的内存和计算开销。针对不同任务、块和通道对量化误差的敏感性差异,作者提出了一种基于Fisher信息矩阵的引导量化方法(FGQ),通过量化不同组件对任务的重要性,在校准过程中动态调整仿射变换,从而更有效地保留关键信息。实验表明,FGQ在多个3D视觉任务中显著优于现有方法,在4位量化下相对提升了高达39%的性能。

详情
AI中文摘要

以视觉几何基础Transformer(VGGT)为代表的前馈3D重建模型,在单次前向传播中联合预测多个视觉几何任务,如深度估计、相机姿态预测和点云重建。它们已广泛应用于3D视觉应用,但其十亿级参数带来了巨大的内存和计算开销,给设备端部署带来挑战。训练后量化(PTQ)是减少这种开销的有效技术。现有的前馈3D模型PTQ方法主要关注处理重尾激活分布和构建多样化的校准数据集。然而,我们观察到前馈3D模型通过共享骨干网络预测多个几何属性,其中不同的Transformer块和隐藏通道对每个任务的贡献不同,导致不同任务、块和通道对量化误差的敏感性差异显著。因此,平等对待所有任务会过度强调不敏感的任务,并导致敏感任务上的显著精度损失。为解决此问题,我们提出面向前馈3D重建模型的Fisher引导量化(FGQ)。具体地,FGQ使用对角Fisher信息矩阵来量化不同任务、块和通道的敏感性,并在校准期间将这些敏感性纳入可学习仿射变换中,以更好地保留对每个任务最关键的通道和块。在相机姿态估计、点云重建和深度估计上的大量实验表明,FGQ在VGGT上始终优于最先进的量化基线,在4比特量化下实现了高达39%的相对改进。代码可在https://github.com/ypzhng/FGQ获取。

英文摘要

Feed-forward 3D reconstruction models, represented by Visual Geometry Grounded Transformer (VGGT), jointly predict multiple visual geometry tasks such as depth estimation, camera pose prediction, and point cloud reconstruction in a single forward pass. They have been widely adopted in 3D vision applications, but their billion-scale parameters bring substantial memory and computation overhead, posing challenges for on-device deployment. Post-Training Quantization (PTQ) is an effective technique to reduce this overhead. Existing PTQ methods for feed-forward 3D models mainly focus on handling heavy-tailed activation distributions and constructing diverse calibration datasets. However, we observe that feed-forward 3D models predict multiple geometric attributes through a shared backbone, where different transformer blocks and hidden channels contribute distinctly to each task, resulting in substantially different sensitivities to quantization errors across tasks, blocks, and channels. Consequently, treating all tasks equally over-emphasizes insensitive tasks and causes significant accuracy loss on the sensitive ones. To address this issue, we propose Fisher-Guided Quantization (FGQ) for feed-forward 3D reconstruction models. Specifically, FGQ uses the diagonal Fisher information matrix to quantify the different sensitivities across tasks, blocks, and channels, and incorporates these sensitivities into the Learnable Affine Transformation during calibration to better preserve the channels and blocks most critical to each task. Extensive experiments across camera pose estimation, point map reconstruction, and depth estimation show that FGQ consistently outperforms state-of-the-art quantization baselines on VGGT, achieving up to 39% relative improvement under the 4-bit quantization. Code is available at https://github.com/ypzhng/FGQ.

2605.15482 2026-05-25 cs.CL

FINESSE-Bench: A Hierarchical Benchmark Suite for Financial Domain Knowledge and Technical Analysis in Large Language Models

FINESSE-Bench:面向大语言模型金融领域知识与技术分析的分层基准套件

Dmitry Stanishevskii, Nini Kamkia, Alexey Khoroshilov, Dmitry Zmitrovich, Denis Kokosinskii, Zhirayr Hayrapetyan, Andrei Kalmykov

发表机构 * Lime FinTech(Lime金融科技)

AI总结 FINESSE-Bench 是一个用于评估大型语言模型在金融领域知识和技术分析能力的分层基准测试套件,包含 3,993 道题目,涵盖从基础到专家级的多个难度层级。该基准结合了专业认证考试、实际交易任务和金融竞赛题目,旨在全面评估模型在金融领域的知识广度、计算能力及应对复杂问题的表现。此外,FINESSE-Bench 提供统一的评估协议和自动评分机制,为更深入、专业化的模型能力测评提供了有力工具。

Comments 21 pages, 10 tables, 2 figures

详情
AI中文摘要

大语言模型(LLMs)正越来越多地应用于金融分析、报告、投资决策支持、风险管理、合规和专业培训。然而,对其在金融领域专业能力的稳健评估仍不完整。广泛使用的开放基准如FinQA、ConvFinQA和TAT-QA在推动金融问答和数值推理方面发挥了重要作用,但它们主要关注财务报告上的问答,并未提供明确的专业难度层级。更广泛的资源,包括FinanceBench、PIXIU、FinBen和FLaME,扩展了金融任务的覆盖范围,但评估从基础知识到专家级金融推理的过渡问题仍然存在。在这项工作中,我们提出了FINESSE-Bench,这是一个由八个专业基准组成的套件,包含3993个问题,用于对LLMs的金融能力进行分层评估。FINESSE-Bench结合了受专业认证(类似CFA 1-3级、类似CMT 2级和类似CFTe 1级)启发的考试导向数据集、应用交易任务集以及一个俄语奥林匹克基准。这种设计能够评估领域广度、难度增加时的性能下降、解决计算任务的能力以及模型在专业金融领域中的行为。我们还描述了一个统一的评估协议,涵盖多项选择题、数值答案和简短开放式回答,以及基于LLM作为评判范式的自由形式答案自动评分方案。FINESSE-Bench旨在作为现有开放金融基准的补充,并作为对大语言模型中专业相关金融能力进行更实质性评估的工具。

英文摘要

Large language models (LLMs) are increasingly being applied to financial analysis, reporting, investment decision support, risk management, compliance, and professional training. However, robust evaluation of their domain competence in finance remains incomplete. Widely used open benchmarks such as FinQA, ConvFinQA, and TAT-QA have played an important role in advancing financial question answering and numerical reasoning, but they focus primarily on question answering over financial reports and do not provide an explicit hierarchy of professional difficulty. Broader resources, including FinanceBench, PIXIU, FinBen, and FLaME, expand the coverage of financial tasks, yet the problem of evaluating the transition from foundational knowledge to expert-level financial reasoning remains open. In this work, we present FINESSE-Bench, a suite of eight specialized benchmarks comprising 3,993 questions for hierarchical evaluation of financial competencies in LLMs. FINESSE-Bench combines exam-oriented datasets inspired by professional certifications (CFA-like Levels 1-3, CMT-like Level 2, and CFTe-like Level 1), applied trading task collections, and a Russian-language olympiad benchmark. This design enables evaluation of domain breadth, performance degradation as difficulty increases, the ability to solve computational tasks, and model behavior in specialized financial domains. We also describe a unified evaluation protocol covering multiple-choice questions, numerical answers, and short open-ended responses, together with an automated scoring scheme for freeform answers based on the LLM-as-judge paradigm. FINESSE-Bench is intended both as a complement to existing open financial benchmarks and as a tool for more substantive evaluation of professionally relevant financial competencies in large language models.

2605.11596 2026-05-25 cs.CV

HorizonDrive: Self-Corrective Autoregressive World Model for Long-horizon Driving Simulation

HorizonDrive: 用于长时域驾驶仿真的自纠正自回归世界模型

Conglang Zhang, Yifan Zhan, Qingjie Wang, Zhanpeng Ouyang, Yu Li, Zihao Yang, Xiaoyang Guo, Weiqiang Ren, Qian Zhang, Zhen Dong, Yinqiang Zheng, Wei Yin, Zhengqing Chen

发表机构 * Wuhan University(武汉大学) The University of Tokyo(东京大学) Horizon Robotics Tsinghua University(清华大学) University of Science and Technology of China(中国科学技术大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 本文提出HorizonDrive,一种用于长时域驾驶模拟的自纠正自回归世界模型。该方法通过引入计划式回滚恢复机制,使教师模型能够在长序列预测中保持稳定,并利用其自回归扩展提供无界监督,从而在有限内存下实现分钟级的预测。实验表明,HorizonDrive在多项指标上显著优于现有方法,提升了驾驶模拟的质量与效率。

Comments Comments: 22 pages, 14 figures. Project page: https://zcliangyue.github.io/HorizonDrive Code: https://github.com/zcliangyue/HorizonDrive

详情
AI中文摘要

闭环驾驶仿真需要超越短时离线片段的实时交互,推动当前驾驶世界模型向自回归(AR)滚转发展。现有的AR蒸馏方法通常依赖于帧沉或学生端退化训练。前者由于快速的自我运动和场景变化,难以迁移到驾驶场景;后者受限于教师单次输出长度,仅提供有限的监督时域。一个自然的问题是:能否通过AR滚转扩展教师本身,以有限的内存成本提供无限时域的监督?关键困难在于标准教师会在自身预测下漂移,污染其提供的监督。我们的关键见解是使教师具备滚转能力,确保从其自身的AR滚转中获得可靠监督。这实例化为HorizonDrive,一个用于AR驾驶仿真的抗漂移训练与蒸馏框架。首先,计划性滚转恢复(SRR)训练基础模型从预测损坏的历史中重建真实未来片段,得到一个在长AR滚转中保持稳定的教师。其次,通过AR滚转扩展具备滚转能力的教师,在有限内存下提供长时域分布匹配监督,同时短窗口学生通过教师滚转DMD(TRD)与之对齐,以实现高效的实时部署。HorizonDrive原生支持在有限内存下的分钟级AR滚转;在nuScenes上,与最强的长时域流式基线相比,HorizonDrive将FID降低52%,FVD降低37%,并将ARE和DTW分别降低21%和9%,同时与单次驾驶视频生成器保持竞争力。

英文摘要

Closed-loop driving simulation requires real-time interaction beyond short offline clips, pushing current driving world models toward autoregressive (AR) rollout. Existing AR distillation approaches typically rely on frame sinks or student-side degradation training. The former transfers poorly to driving due to fast ego-motion and rapid scene changes, while the latter remains bounded by the teacher's single-pass output length and thus provides only a limited supervision horizon. A natural question is: can the teacher itself be extended via AR rollout to provide unbounded-horizon supervision at bounded memory cost? The key difficulty is that a standard teacher drifts under its own predictions, contaminating the supervision it provides. Our key insight is to make the teacher rollout-capable, ensuring reliable supervision from its own AR rollouts. This is instantiated as HorizonDrive, an anti-drifting training-and-distillation framework for AR driving simulation. First, scheduled rollout recovery (SRR) trains the base model to reconstruct ground-truth future clips from prediction-corrupted histories, yielding a teacher that remains stable across long AR rollouts. Second, the rollout-capable teacher is extended via AR rollout, providing long-horizon distribution-matching supervision under bounded memory, while a short-window student aligns to it with teacher rollout DMD (TRD) for efficient real-time deployment. HorizonDrive natively supports minute-scale AR rollout under bounded memory; on nuScenes, HorizonDrive reduces FID by 52% and FVD by 37%, and lowers ARE and DTW by 21% and 9% relative to the strongest long-horizon streaming baselines, while remaining competitive with single-pass driving video generators.

2605.11490 2026-05-25 cs.LG stat.ML

Adaptive Calibration in Non-Stationary Environments

非平稳环境中的自适应校准

Junyan Liu, Haipeng Luo, Lillian J. Ratliff

发表机构 * University of Washington(华盛顿大学) University of Southern California(南加州大学)

AI总结 在非平稳环境中实现自适应校准是现代AI系统中的核心挑战。本文提出了一类能够根据环境非平稳程度自动调整校准误差的在线预测算法,在i.i.d.和对抗性环境之间实现平滑过渡。该方法在多种校准度量下均取得了理论保证,其误差上界在平稳和对抗性场景下均达到最优,并扩展了先前相关工作,引入了基于阶段的调度策略和预测空间的非均匀划分技术。

Comments Added results for piecewise-stationary environments and included a comparison with the concurrent work of Huang et al. (arXiv:2605.09273)

详情
AI中文摘要

在现代AI系统中,进行校准的在线预测是一个核心挑战。现有文献大多关注完全对抗性环境,其中结果可能是任意的,导致算法保守,在更温和的设置(如结果近乎平稳)中表现次优。这一差距引发了一个自然问题:我们能否设计在线预测算法,其校准误差自动适应环境的非平稳程度,在独立同分布和对抗性场景之间平滑插值?我们对此问题给出肯定回答,并开发了一套算法,在多种校准度量下实现自适应校准保证。具体地,设$T$为轮数,$K$为环境中未知的独立同分布段数,$C\in[0,T]$为另一个未知的非平稳度量(定义为均值结果的最小$\ell_1$偏差),我们的算法对$\ell_1$校准误差达到$\widetilde{O}(\min\{\sqrt{T}+(TC)^{\frac{1}{3}}, \sqrt{KT}\})$,对$\ell_2$和伪KL校准误差均达到$\widetilde{O}(\min\{(1+C)^{\frac{1}{3}}, K\})$。这些界匹配平稳情况($C=0$且$K=1$)的最优率,并在完全对抗性场景($C, K=\Omega(T)$)中恢复已知保证。我们的方法建立在并扩展了先前工作[Hu等人,2026,Luo等人,2025]的基础上,引入基于epoch的调度以及对预测空间进行新颖的非均匀划分,在底层真实值附近分配更精细的分辨率。

英文摘要

Making calibrated online predictions is a central challenge in modern AI systems. Much of the existing literature focuses on fully adversarial environments where outcomes may be arbitrary, leading to conservative algorithms that can perform suboptimally in more benign settings, such as when outcomes are nearly stationary. This gap raises a natural question: can we design online prediction algorithms whose calibration error automatically adapts to the degree of non-stationarity in the environment, smoothly interpolating between i.i.d. and adversarial regimes? We answer this question in the affirmative and develop a suite of algorithms that achieve adaptive calibration guarantees under multiple calibration measures. Specifically, with $T$ being the number of rounds, $K$ being the unknown number of i.i.d. segments of the environment, and $C\in[0,T]$ being another unknown non-stationary measure defined as the minimal $\ell_1$ deviation of the mean outcomes, our algorithms attain $\widetilde{O}(\min\{\sqrt{T}+(TC)^{\frac{1}{3}}, \sqrt{KT}\})$ for $\ell_1$ calibration error and $\widetilde{O}(\min\{(1+C)^{\frac{1}{3}}, K\})$ for both $\ell_2$ and pseudo KL calibration error. These bounds match the optimal rates in the stationary case ($C=0$ and $K=1$) and recover known guarantees in the fully adversarial regime ($C, K=Ω(T)$). Our approach builds on and extends prior work [Hu et al., 2026, Luo et al., 2025], introducing an epoch-based scheduling together with a novel non-uniform partition of the prediction space that allocates finer resolution near the underlying ground truth.

2605.10347 2026-05-25 cs.AI cs.CL

How Mobile World Model Guides GUI Agents?

移动世界模型如何指导GUI代理?

Weikai Xu, Kun Huang, Yunren Feng, Jiaxing Li, Yuhan Chen, Yuxuan Liu, Zhizheng Jiang, Heng Qu, Pengzhi Gao, Wei Liu, Jian Luan, Xiaolin Hu, Bo An

发表机构 * Nanyang Technological University(南洋理工大学) MiLM Plus, Xiaomi Inc.(小米公司) Independent Researchers(独立研究人员) Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学人工智能学院) Wuhan University(武汉大学) Xiamen University(厦门大学)

AI总结 本文研究了移动世界模型如何指导GUI代理进行有效交互,针对现有模型在预测动作后果方面的不足,提出了一种多模态世界模型,涵盖增量文本、完整文本、扩散图像和可渲染代码四种表示方式。实验表明,该模型在多个基准测试中达到最优性能,并揭示了代码重建在分布内精度和多模态监督上的优势,文本反馈在分布外执行中的鲁棒性,以及世界模型在训练过程中的辅助作用,而非作为通用的后验验证工具。

详情
AI中文摘要

视觉语言模型的最新进展使移动GUI代理能够感知视觉界面并执行用户指令,但对于长期和高风险交互,动作后果的可靠预测仍然至关重要。现有的移动世界模型提供基于文本或基于图像的未来状态,但尚不清楚哪种表示有用,生成的rollout是否可以替代真实环境,以及测试时指导如何帮助不同强度的代理。为了回答上述问题,我们筛选并标注了移动世界模型数据,然后训练了四种模态的世界模型:增量文本、完整文本、基于扩散的图像和可渲染代码。这些模型在MobileWorldBench和Code2WorldBench上均达到了最先进性能。此外,通过在AITZ、AndroidControl和AndroidWorld上评估其下游效用,我们得到三个发现。首先,可渲染代码重建实现了高分布内保真度,并为数据构建提供了有效的多模态监督,而基于文本的反馈对于在线分布外执行更鲁棒。其次,世界模型生成的轨迹可以在训练过程中提供可迁移的交互经验,并提高代理的端到端任务性能,尽管这些数据不保留原始分布。最后,对于动作熵低的过度自信移动代理,后验自省提供的收益有限,这表明世界模型作为先验感知或训练监督比作为通用事后验证器更有效。

英文摘要

Recent advances in vision-language models have enabled mobile GUI agents to perceive visual interfaces and execute user instructions, but reliable prediction of action consequences remains critical for long-horizon and high-risk interactions. Existing mobile world models provide either text-based or image-based future states, yet it remains unclear which representation is useful, whether generated rollouts can replace real environments, and how test-time guidance helps agents of different strengths. To answer the above questions, we filter and annotate mobile world-model data, then train world models across four modalities: delta text, full text, diffusion-based images, and renderable code. These models achieve SoTA performance on both MobileWorldBench and Code2WorldBench. Furthermore, by evaluating their downstream utility on AITZ, AndroidControl, and AndroidWorld, we obtain three findings. First, renderable code reconstruction achieves high in-distribution fidelity and provides effective multimodal supervision for data construction, while text-based feedback is more robust for online out-of-distribution (OOD) execution. Second, world-model-generated trajectories can provide transferable interaction experience in the training process and improve agents' end-to-end task performance, although these data do not preserve the original distribution. Last, for overconfident mobile agents with low action entropy, posterior self-reflection provides limited gains, suggesting that world models are more effective as prior perception or training supervision than as universal post-hoc verifiers.

2605.07590 2026-05-25 cs.CV

Beyond Defenses: Manifold-Aligned Regularization for Intrinsic 3D Point Cloud Robustness

超越防御:面向内在3D点云鲁棒性的流形对齐正则化

Pedro Alonso, Chongshou Li, Tianrui Li

发表机构 * School of Computing and Artificial Intelligence, Southwest Jiaotong University(计算机与人工智能学院,西南交通大学) Engineering Research Center of Sustainable Urban Intelligent Transportation, Ministry of Education, Southwest Jiaotong University(可持续城市智能交通工程研究中心,教育部,西南交通大学)

AI总结 尽管点云鲁棒性研究已取得进展,但现有方法多依赖数据增强或防御机制,忽视了对抗脆弱性的几何本质。本文提出一种基于流形对齐的正则化方法,认为3D网络的对抗脆弱性源于模型学习的潜在几何结构与点云表面内在几何之间的不匹配。通过引入Manifold-Aligned Point Recognition(MAPR)框架,在不依赖对抗训练或额外数据的情况下,有效提升了模型在多个数据集上的鲁棒性。

详情
AI中文摘要

尽管点云鲁棒性研究取得了广泛进展,现有方法主要依赖增强策略或防御机制,却忽视了对抗脆弱性的几何本质。我们假设3D网络中的对抗脆弱性源于模型学习的潜在几何与底层表面的内在几何之间的流形错位。沿输入流形的微小几何保持扰动往往在特征空间中引起不成比例的扭曲,可能导致误分类。我们通过建立3D鲁棒性的几何解释来形式化这一现象,将经典对抗理论与点云的内在结构联系起来。受此分析启发,我们提出了流形对齐点识别(MAPR),该框架通过跨内在扰动对齐预测来正则化潜在几何。MAPR为每个点云增强捕获局部曲率和扩散结构的内在特征,并应用保持内在几何保持扰动不变性的一致性损失。在不依赖对抗训练或额外数据的情况下,MAPR在多个数据集上持续提升对多种对抗攻击的鲁棒性,在ModelNet40和ScanObjectNN上分别比原始模型平均提高+20.02和+8.83个百分点的鲁棒性。

英文摘要

Despite extensive progress in point cloud robustness, existing methods primarily rely on augmentation strategies or defense mechanisms while overlooking the geometric nature of adversarial fragility. We hypothesize that adversarial vulnerability in 3D networks arises from a manifold misalignment between the latent geometry learned by the model and the intrinsic geometry of the underlying surface. Small, geometry-preserving perturbations along the input manifold often induce disproportionate distortions in feature space, potentially leading to misclassifications. We formalize this phenomenon by developing a geometric interpretation of 3D robustness that links classical adversarial theory to the intrinsic structure of point clouds. Motivated by this analysis, we introduce Manifold-Aligned Point Recognition (MAPR), a framework that regularizes the latent geometry by aligning predictions across intrinsic perturbations. MAPR augments each point cloud with intrinsic features capturing local curvature and diffusion structure, and applies a consistency loss that preserves invariance to intrinsic, geometry-preserving perturbations. Without relying on adversarial training or additional data, MAPR consistently improves robustness under multiple adversarial attacks across several datasets, achieving average robustness gains of +20.02 and +8.83 percentage points over vanilla models on ModelNet40 and ScanObjectNN, respectively.

2605.07220 2026-05-25 cs.LG

On the Robustness of Distribution Support under Diffusion Guidance

扩散引导下分布支撑的鲁棒性研究

Ruijia Cao, Yuchen Wu, Nisha Chandramoorthy

发表机构 * Center for Applied Mathematics, Cornell University(康奈尔大学应用数学中心) School of Operations Research and Information Engineering, Cornell University(康奈尔大学运筹学与信息工程学院) Department of Statistics, The University of Chicago(芝加哥大学统计学系)

AI总结 本文研究了扩散引导在生成样本时对分布支撑集的鲁棒性问题,揭示了其为何能持续生成高质量样本的理论原因。作者通过建立扩散引导过程在精确得分函数下的支撑集鲁棒性性质,证明其生成的样本几乎总是接近目标分布的支撑集,从而保证了样本的结构合理性。该分析适用于多种扩散模型和离散化方案,为理解扩散引导生成物理合理样本提供了理论依据。

详情
AI中文摘要

扩散引导是一种强大的技术,能够通过扩散模型实现可控且高保真的样本生成。在高层次上,它通过引入引导项来修改得分函数,从而将生成过程导向所需条件。尽管在经验上取得了成功,但扩散引导的理论性质在很大程度上仍未得到探索,并且尚不清楚它为何能持续生成高质量样本。在这项工作中,我们通过建立支撑的鲁棒性性质来解释扩散引导的有效性。具体来说,我们表明,在精确访问得分函数的情况下,引导扩散过程几乎总是生成接近目标支撑的样本。这一性质尤其理想,因为偏离支撑的样本通常在结构上不可信,并可能对下游任务产生不利影响。我们的分析涵盖了去噪扩散隐式模型(DDIM)和去噪扩散概率模型(DDPM),并适用于由指数积分器引起的广泛离散化方案。我们的结果为理解扩散引导为何能生成物理上有意义且结构合理的样本提供了严格的基础。

英文摘要

Diffusion guidance is a powerful technique that enables controllable and high-fidelity sample generation with diffusion models. At a high level, it modifies the score function by incorporating a guidance term that steers the generative process toward a desired condition. Despite its empirical success, the theoretical properties of diffusion guidance remain largely unexplored, and it is not well understood why it consistently produces high-quality samples. In this work, we explain the effectiveness of diffusion guidance by establishing a robustness of support property. Specifically, we show that, given exact access to the score functions, guided diffusion processes almost always generate samples that remain close to the target support. This property is particularly desirable, as samples that lie off the support are often structurally implausible and may adversely affect downstream tasks. Our analysis covers both Denoising Diffusion Implicit Models (DDIM) and Denoising Diffusion Probabilistic Models (DDPM), and applies to a wide range of discretization schemes induced by exponential integrators. Our results provide a rigorous foundation for understanding why diffusion guidance produces physically meaningful and structurally plausible samples.

2605.06840 2026-05-25 cs.AI

Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning

从LLM推理轨迹中提取搜索树揭示短视规划

Sixing Chen, Ji-An Li, Saner Cakir, Sinan Akcali, Kayla Lee, Marcelo G. Mattar

发表机构 * Generality Inc.(Generality公司)

AI总结 本研究通过从大型语言模型(LLM)在“四连棋”游戏中的推理轨迹中提取搜索树,揭示了LLM在规划行为上的短视特性。研究发现,尽管LLM的推理轨迹中包含较深的节点,但其决策主要依赖于浅层搜索,而非深度搜索;相比之下,人类玩家的性能更多由深度搜索驱动。这一发现揭示了LLM与人类规划之间的关键差异,并为改进LLM的规划能力提供了方向性指导。

详情
AI中文摘要

大型语言模型(LLMs),尤其是推理模型,会生成扩展的思维链(CoT)推理,其中通常包含对未来结果的明确思考。然而,这种思考是否构成真正的规划、其结构如何以及哪些方面驱动性能仍不清楚。在这项工作中,我们引入了一种新方法,通过从四子棋游戏的推理轨迹中提取和量化搜索树来表征LLM规划。通过将计算模型拟合到提取的搜索树上,我们表征了规划的结构及其如何影响移动决策。我们发现LLM的搜索比人类更浅,性能由搜索广度而非深度预测。最引人注目的是,尽管LLM在轨迹中扩展了深层节点,但其移动选择最好由一个完全忽略这些节点的短视模型解释。一项因果干预研究(我们选择性剪枝CoT段落)进一步表明,移动选择主要由浅层节点而非深层节点驱动。这些模式与人类规划形成对比,在人类规划中,性能主要由深度搜索驱动。总之,我们的发现揭示了LLM与人类规划之间的关键差异:虽然人类专业知识由更深层次的搜索驱动,但LLM并不基于深层前瞻行动。这种分离为对齐LLM和人类规划提供了有针对性的指导。更广泛地说,我们的框架提供了一种可推广的方法,用于解释跨战略领域LLM规划的结构。

英文摘要

Large language models (LLMs), especially reasoning models, generate extended chain-of-thought (CoT) reasoning that often contains explicit deliberation over future outcomes. Yet whether this deliberation constitutes genuine planning, how it is structured, and what aspects of it drive performance remain poorly understood. In this work, we introduce a new method to characterize LLM planning by extracting and quantifying search trees from reasoning traces in the four-in-a-row board game. By fitting computational models on the extracted search trees, we characterize how plans are structured and how they influence move decisions. We find that LLMs' search is shallower than humans', and that performance is predicted by search breadth rather than depth. Most strikingly, although LLMs expand deep nodes in their traces, their move choices are best explained by a myopic model that ignores those nodes entirely. A causal intervention study where we selectively prune CoT paragraphs further suggests that move selection is driven predominantly by shallow rather than deep nodes. These patterns contrast with human planning, where performance is driven primarily by deep search. Together, our findings reveal a key difference between LLM and human planning: while human expertise is driven by deeper search, LLMs do not act on deep lookahead. This dissociation offers targeted guidance for aligning LLM and human planning. More broadly, our framework provides a generalizable approach for interpreting the structure of LLM planning across strategic domains.

2605.06498 2026-05-25 cs.RO cs.SY eess.SY

Lie Group Formulation of Recursive Dynamics Algorithms of Higher Order for Floating-Base Robots

浮动基座机器人高阶递归动力学算法的李群公式

Ahmed Ali, Chiara Gabellieri, Antonio Franchi

发表机构 * Robotics and Mechatronics Department, EEMCS Faculty, University of Twente(特文特大学机器人与机电系,EEMCS学院) Department of Computer, Control and Management Engineering, Sapienza University of Rome(罗马大学计算机、控制与管理工程系)

AI总结 本文研究了浮动基座机器人的高阶递归动力学算法在李群框架下的表示方法,提出了一种基于李群的牛顿-欧拉、连杆惯性及混合动力学算法的高阶时间导数计算方法。该方法适用于基座配置在SE(3)上、连杆结构配置在T^{n1} × R^{n2}流形上的树状机械系统,并通过空间扭力表示实现动力学方程的闭式表达。研究还展示了该方法在12自由度空中机械臂上的应用,验证了其在几何正逆动力学及其高阶导数计算中的有效性,并证明其计算复杂度随导数阶数呈二次增长,优于自动微分方法的指数增长。

Journal ref ASME. Journal of Mechanisms and Robotics (2026)

详情
AI中文摘要

本文描述了计算浮动基座树状系统的李群牛顿-欧拉、组合体惯量和混合动力学算法的高阶时间导数的过程,其中基座构型在SE(3)上演化,附着的机构是一个开运动学树,构型在(n1+n2)维流形T^{n1} × R^{n2}上,使用旋量的空间表示。在给出算法后,我们将得到的递归式整理成闭式运动方程,识别出满足无源性性质的容许科里奥利矩阵,并证明组合惯性张量在所有时间导数下保持不变。然后,我们将所开发的方法应用于一个12自由度空中机械臂,推导其几何正动力学和逆动力学及其一阶时间导数的解析表达式,而数值模拟成功评估了这些动力学直至五阶。最后,为了展示其实用性,我们对所提出的扩展进行了基准测试,并表明在考虑的测试中,其计算成本随导数阶数呈二次增长,而自动微分基线则呈指数增长。

英文摘要

In this paper, we describe procedures for computing higher-order time derivatives of the Lie-group Newton-Euler, Articulated-Body Inertia, and hybrid dynamics algorithms for floating-base trees, where the base configuration evolves on SE(3) and the attached mechanism is an open kinematic tree with configuration on the (n1+n2)-dimensional manifold T^{n1} \times R^{n2}, using spatial representation of twists. After presenting the algorithms, we collect the resulting recursions into closed-form equations of motion, identifying an admissible Coriolis matrix satisfying the passivity property, and showing that the articulated inertia tensor remains unchanged across all time derivatives. We then apply the developed methods to a 12-DoF aerial manipulator to derive analytical expressions for its geometric forward and inverse dynamics along with their first time derivatives whereas the numerical simulations successfully evaluate these dynamics up to fifth order. Finally, to demonstrate their practical utility, we benchmark the proposed extensions and show that, in the considered tests, their computational cost scales quadratically with the derivative order, whereas the automatic-differentiation baseline exhibits exponential scaling.

2605.06094 2026-05-25 cs.CV cs.AI

VISD: Enhancing Video Reasoning via Structured Self-Distillation

VISD: 通过结构化自蒸馏增强视频推理

Hao Lin, Kunyang Lv, Xu Jiang, Jingqi Tian, Zhongjing Du, Jiayu Ding, Qiaoman Zhang, Hongbo Jin

发表机构 * HUST(华中科技大学) Wuhan University(武汉大学) Peking University(北京大学) Tsinghua University(清华大学)

AI总结 本文提出VISD,一种用于增强视频推理的结构化自蒸馏框架,旨在解决视频大语言模型在复杂推理任务中因稀疏奖励和细粒度信用分配不足而导致的学习效率低下的问题。VISD引入了一个视频感知的评判模型,将推理质量分解为答案正确性、逻辑一致性和时空定位等多个维度,并利用结构化反馈指导教师策略进行细粒度的标记级监督。通过方向与幅度解耦机制,VISD稳定地将密集监督与强化学习结合,显著提升了推理准确性和训练效率。实验表明,VISD在多个基准测试中均优于现有方法,且收敛速度更快。

详情
AI中文摘要

训练视频大语言模型进行复杂推理仍然具有挑战性,原因在于稀疏的序列级奖励以及缺乏对长时间、时间上接地推理轨迹的细粒度信用分配。虽然具有可验证奖励的强化学习提供了可靠的监督,但它无法捕捉令牌级贡献,导致学习效率低下。相反,现有的自蒸馏方法提供密集监督,但缺乏结构和诊断特异性,并且通常与强化学习交互不稳定。在这项工作中,我们提出了VISD,一个结构化自蒸馏框架,为视频推理引入诊断上有意义的特权信息。VISD采用视频感知判断模型,将推理质量分解为多个维度,包括答案正确性、逻辑一致性和时空接地性,并使用这种结构化反馈指导教师策略进行令牌级监督。为了将密集监督与强化学习稳定集成,我们引入了方向-幅度解耦机制,其中由奖励计算的展开级优势决定更新方向,而结构化特权信号调节令牌级更新幅度。这种设计实现了语义对齐和细粒度的信用分配,提高了推理忠实度和训练效率。此外,VISD结合了课程调度和基于指数移动平均的教师稳定化,以支持长视频序列上的鲁棒优化。在多个基准上的实验表明,VISD始终优于强基线,提高了答案准确性和时空接地质量。值得注意的是,VISD在优化步骤中实现了近2倍的收敛速度,突出了结构化自监督在提高视频大语言模型性能和样本效率方面的有效性。

英文摘要

Training VideoLLMs for complex reasoning remains challenging due to sparse sequence level rewards and the lack of fine grained credit assignment over long, temporally grounded reasoning trajectories. While reinforcement learning with verifiable rewards (RLVR) provides reliable supervision, it fails to capture token level contributions, leading to inefficient learning. Conversely, existing self distillation methods offer dense supervision but lack structure and diagnostic specificity, and often interact unstably with reinforcement learning. In this work, we propose VISD, a structured self distillation framework that introduces diagnostically meaningful privileged information for video reasoning. VISD employs a video aware judge model to decompose reasoning quality into multiple dimensions, including answer correctness, logical consistency, and spatio-temporal grounding, and uses this structured feedback to guide a teacher policy for token level supervision. To stably integrate dense supervision with RL, we introduce a direction magnitude decoupling mechanism, where rollout level advantages computed from rewards determine update direction, while structured privileged signals modulate token level update magnitudes. This design enables semantically aligned and fine grained credit assignment, improving both reasoning faithfulness and training efficiency. Additionally, VISD incorporates curriculum scheduling and EMA based teacher stabilization to support robust optimization over long video sequences. Experiments on diverse benchmarks show that VISD consistently outperforms strong baselines, improving answer accuracy and spatio temporal grounding quality. Notably, VISD reaches these gains with nearly 2x faster convergence in optimization steps, highlighting the effectiveness of structured self supervision in improving both performance and sample efficiency for VideoLLMs.

2605.06088 2026-05-25 cs.CV

OpenGaFF: Open-Vocabulary Gaussian Feature Field with Codebook Attention

OpenGaFF: 基于码本注意力的开放词汇高斯特征场

Kunyi Li, Michael Niemeyer, Sen Wang, Stefano Gasperini, Nassir Navab, Federico Tombari

发表机构 * Technical University of Munich(慕尼黑技术大学) Google(谷歌) Munich Center for Machine Learning(慕尼黑机器学习中心) Visualais

AI总结 本文提出了一种名为 OpenGaFF 的新型框架,用于实现开放词汇的3D场景理解。该方法基于3D高斯点喷射技术,通过引入高斯特征场,将语义建模为高斯几何和外观的连续函数,从而增强几何与语义之间的关联性,提升3D空间中语义的一致性。此外,作者设计了一个结构化码本和基于码本引导的注意力机制,以实现对开放词汇的鲁棒推理,并减少物体内部特征的差异。实验表明,该方法在多个标准2D和3D开放词汇基准测试中均优于现有方法,取得了更优的分割质量与更强的3D语义一致性。

详情
AI中文摘要

理解基于高斯表示的开放词汇3D场景仍然具有挑战性,因为多视角观测下的语义预测碎片化且空间不一致。在本文中,我们提出了OpenGaFF,一个基于3D高斯泼溅构建的开放词汇3D场景理解新框架。我们方法的核心是一个高斯特征场,它将语义建模为高斯几何和外观的连续函数。通过显式地将语义预测条件于几何结构,该公式加强了几何与语义之间的耦合,从而在3D空间中相似结构上实现了更好的空间一致性。为了进一步强制执行对象级语义一致性,我们引入了一个结构化码本,作为一组共享的语义基元。此外,提出了一种码本引导的注意力机制,通过查询嵌入与学习到的码本条目之间的相似性匹配来检索语言特征,从而实现鲁棒的开放词汇推理,同时减少对象内特征方差。在标准2D和3D开放词汇基准上的大量实验表明,我们的方法持续优于先前的方法,实现了改进的分割质量、更强的3D语义一致性以及一个语义可解释的码本,为学习到的表示提供了洞察。

英文摘要

Understanding open-vocabulary 3D scenes with Gaussian-based representations remains challenging due to fragmented and spatially inconsistent semantic predictions across multi-view observations. In this paper, we present OpenGaFF, a novel framework for open-vocabulary 3D scene understanding built upon 3D Gaussian Splatting. At the core of our method is a Gaussian Feature Field that models semantics as a continuous function of Gaussian geometry and appearance. By explicitly conditioning semantic predictions on geometric structure, this formulation strengthens the coupling between geometry and semantics, leading to improved spatial coherence across similar structures in 3D space. To further enforce object-level semantic consistency, we introduce a structured codebook that serves as a set of shared semantic primitives. Furthermore, a codebook-guided attention mechanism is proposed to retrieve language features via similarity matching between query embeddings and learned codebook entries, enabling robust open-vocabulary reasoning while reducing intra-object feature variance. Extensive experiments on standard 2D and 3D open-vocabulary benchmarks demonstrate that our method consistently outperforms prior approaches, achieving improved segmentation quality, stronger 3D semantic consistency and a semantically interpretable codebook that provides insight into the learned representation.

2605.05997 2026-05-25 cs.CV

4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding

4DThinker: 用4D图像进行动态空间理解的思考

Zhangquan Chen, Manyuan Zhang, Xinlei Yu, Xiang An, Bo Li, Xin Xie, ZiDong Wang, Mingze Sun, Shuang Chen, Hongyu Li, Xiaobin Hu, Ruqi Huang

发表机构 * Tsinghua University, SIGS(清华大学 SIGS) Meituan(美团) The Chinese University of Hong Kong(香港中文大学) National University of Singapore(新加坡国立大学) LMMs-Lab(LMMs实验室) University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 本文提出了一种名为4DThinker的新型框架,旨在通过动态的潜空间心理图像使视觉语言模型(VLMs)具备四维(4D)动态空间推理能力。该方法引入了无需标注的数据生成流程和动态图像微调(DIFT)技术,结合文本与4D潜变量进行联合监督,从而增强模型对动态视觉语义的理解。此外,基于奖励的4D强化学习(4DRL)进一步提升了模型在复杂推理任务中的表现,实验表明该方法在多个动态空间推理基准测试中均优于现有方法。

Comments 21 pages, 16 figures

详情
AI中文摘要

从单目视频中进行动态空间推理对于连接视觉智能与物理世界至关重要,但对视觉语言模型(VLM)仍然具有挑战性。先前的方法要么将时空推理完全表述为文本,这对于复杂动态来说本质上是冗长且不精确的,要么依赖外部几何模块,这增加了推理复杂性而不培养内在模型能力。在本文中,我们提出了4DThinker,这是第一个使VLM能够通过动态潜在心理图像(即在连续隐藏空间内模拟场景如何演化)进行“4D思考”的框架。具体来说,我们首先引入了一个可扩展的、无需标注的数据生成流程,从原始视频中合成4D推理数据。然后我们提出了动态图像微调(DIFT),它联合监督文本令牌和4D潜在变量,将模型锚定在动态视觉语义中。在此基础上,4D强化学习(4DRL)通过基于结果的奖励进一步处理复杂推理任务,将策略梯度限制在文本令牌上以确保稳定优化。在多个动态空间推理基准上的大量实验表明,4DThinker始终优于强基线,并为VLM中的4D推理提供了新视角。我们的代码可在https://github.com/zhangquanchen/4DThinker获取。

英文摘要

Dynamic spatial reasoning from monocular video is essential for bridging visual intelligence and the physical world, yet remains challenging for vision-language models (VLMs). Prior approaches either verbalize spatial-temporal reasoning entirely as text, which is inherently verbose and imprecise for complex dynamics, or rely on external geometric modules that increase inference complexity without fostering intrinsic model capability. In this paper, we present 4DThinker, the first framework that enables VLMs to "think with 4D" through dynamic latent mental imagery, i.e., internally simulating how scenes evolve within the continuous hidden space. Specifically, we first introduce a scalable, annotation-free data generation pipeline that synthesizes 4D reasoning data from raw videos. We then propose Dynamic-Imagery Fine-Tuning (DIFT), which jointly supervises textual tokens and 4D latents to ground the model in dynamic visual semantics. Building on this, 4D Reinforcement Learning (4DRL) further tackles complex reasoning tasks via outcome-based rewards, restricting policy gradients to text tokens to ensure stable optimization. Extensive experiments across multiple dynamic spatial reasoning benchmarks demonstrate that 4DThinker consistently outperforms strong baselines and offers a new perspective toward 4D reasoning in VLMs. Our code is available at https://github.com/zhangquanchen/4DThinker.

2605.04568 2026-05-25 cs.LG cs.AI cs.RO

Dream-MPC: Gradient-Based Model Predictive Control with Latent Imagination

Dream-MPC:基于梯度与潜在想象的模型预测控制

Jonathan Spieler, Sven Behnke

发表机构 * Autonomous Intelligent Systems, Computer Science Institute VI - Intelligent Systems(自主智能系统,计算机科学研究所VI - 智能系统) Robotics, Center for Robotics(机器人学,机器人中心) the Lamarr Institute for Machine Learning(拉马尔机器学习研究所) Artificial Intelligence, University of Bonn, Germany(人工智能,波恩大学,德国)

AI总结 本文提出了一种名为 Dream-MPC 的新型模型预测控制方法,结合了梯度上升优化与学习到的世界模型,通过生成少量候选轨迹并利用不确定性正则化和优化迭代的复用机制进行优化。该方法在24个连续控制任务中表现出色,显著提升了基础策略的性能,优于传统的无梯度MPC和先进基线方法。

Comments Accepted for International Conference on Machine Learning (ICML) 2026

详情
AI中文摘要

最先进的基于模型的强化学习方法要么使用无梯度、基于种群的规划方法,要么使用学习到的策略网络,或者结合策略网络和规划。将模型预测控制(MPC)与学习到的模型和策略先验相结合的混合方法,以利用两种范式的优势,已显示出有希望的结果。然而,这些方法通常依赖于无梯度优化方法,对于高维控制任务可能计算成本高昂。虽然基于梯度的方法是一个有前途的替代方案,但最近的工作经验表明,基于梯度的方法通常比无梯度方法表现更差。我们提出了Dream-MPC,一种新颖的方法,从展开的策略生成少量候选轨迹,并通过使用学习的世界模型、不确定性正则化和通过重用先前优化的动作随时间摊销优化迭代,对每个轨迹进行梯度上升优化。我们在24个连续控制任务上的结果表明,Dream-MPC可以显著提高底层策略的性能,并且可以优于无梯度MPC和最先进的基线。代码和视频可在https://dream-mpc.github.io获取。

英文摘要

State-of-the-art model-based Reinforcement Learning (RL) approaches either use gradient-free, population-based methods for planning, learned policy networks, or a combination of policy networks and planning. Hybrid approaches that combine Model Predictive Control (MPC) with a learned model and a policy prior to leverage the advantages of both paradigms have shown promising results. However, these approaches typically rely on gradient-free optimization methods, which can be computationally expensive for high-dimensional control tasks. While gradient-based methods are a promising alternative, recent works have empirically shown that gradient-based methods often perform worse than their gradient-free counterparts. We propose Dream-MPC, a novel approach that generates few candidate trajectories from a rolled-out policy and optimizes each trajectory by gradient ascent using a learned world model, uncertainty regularization and amortization of optimization iterations over time by reusing previously optimized actions. Our results on 24 continuous control tasks show that Dream-MPC can significantly improve the performance of the underlying policy and can outperform gradient-free MPC and state-of-the-art baselines. Code and videos are available at https://dream-mpc.github.io.

2605.02087 2026-05-25 cs.AI

Model Spec Midtraining: Improving How Alignment Training Generalizes

模型规范中期训练:改进对齐训练的泛化能力

Chloe Li, Nevan Wichers, Sara Price, Samuel Marks, Jon Kutasov

发表机构 * Anthropic

AI总结 一些前沿AI开发者希望将语言模型对齐到描述其预期行为的模型规范或宪法中。然而,传统的对齐微调方法在演示数据上训练,可能导致对齐效果浅显且泛化能力差。本文提出了一种新的方法——模型规范中间训练(MSM),即在预训练后、对齐微调前,使用合成文档训练模型理解其规范内容,从而引导模型更好地从后续演示数据中泛化。实验表明,MSM能有效提升模型对复杂安全属性的对齐效果,并揭示了某些规范设计原则有助于增强对齐泛化能力。

详情
AI中文摘要

一些前沿AI开发者旨在将语言模型对齐到描述预期模型行为的模型规范或宪法。然而,标准的对齐微调——在规范对齐行为的演示数据上训练——可能产生泛化能力差的浅层对齐,部分原因是演示数据可能未充分指定所需的泛化。我们引入了模型规范中期训练(MSM):在预训练之后、对齐微调之前,我们在讨论其模型规范的合成文档上训练模型。这教会模型规范的内容,从而塑造它们从后续演示数据中泛化的方式。例如,一个仅微调为表达特定奶酪偏好(如“我更喜欢奶油奶酪而不是布里干酪”)的模型,当我们应用MSM并附加一个将这些偏好归因于亲美价值观的规范时,会泛化为广泛的亲美价值观。相反,一个关于亲可负担性价值观的规范则从完全相同的奶酪微调中产生亲可负担性的泛化。MSM还可以塑造复杂的与安全相关的倾向:应用MSM并附加一个涉及自我保护和目标守卫的规范,可显著降低代理失调率(Qwen3-32B:从54%降至7%),超过了深思熟虑的对齐基线(14%)。我们进一步将MSM作为工具研究哪些模型规范能产生最强的对齐泛化,发现解释规则背后的价值观能改善泛化,提供具体而非一般的指导也是如此。总体而言,MSM是一种简单有效的技术,通过首先教授预期的泛化,来控制和改进模型从对齐训练中泛化的方式。

英文摘要

Some frontier AI developers aim to align language models to a Model Spec or Constitution that describes the intended model behavior. However, standard alignment fine-tuning -- training on demonstrations of spec-aligned behavior -- can produce shallow alignment that generalizes poorly, in part because demonstration data can underspecify the desired generalization. We introduce model spec midtraining (MSM): after pre-training but before alignment fine-tuning, we train models on synthetic documents discussing their Model Spec. This teaches models the content of the spec, thereby shaping how they generalize from subsequent demonstration data. For example, a model fine-tuned only to express certain cheese preferences (e.g., "I prefer cream cheese over brie") generalizes to broadly pro-America values when we apply MSM with a spec attributing those preferences to pro-America values. Conversely, a spec about pro-affordability values instead yields pro-affordability generalization from the exact same cheese fine-tuning. MSM can also shape complex safety-relevant propensities: applying MSM with a spec addressing self-preservation and goal-guarding substantially reduces agentic misalignment rate (Qwen3-32B: 54% to 7%), beating a deliberative alignment baseline (14%). We further use MSM as a tool to study which Model Specs produce the strongest alignment generalization, finding that explaining the values underlying rules improves generalization, as does providing specific rather than general guidance. Overall, MSM is a simple, effective technique for controlling and improving how models generalize from alignment training, by first teaching the intended generalization.

2605.01018 2026-05-25 cs.CV

WildTableBench: Benchmarking Multimodal Foundation Models on Table Understanding In the Wild

WildTableBench:在真实场景中评估多模态基础模型的表格理解能力

Junzhe Huang, Xiaoxiao Sun, Yan Yang, Yuxuan Hou, Ruotian Zhang, Sirui Li, Hehe Fan, Serena Yeung-Levy, Xin Yu

发表机构 * The University of Queensland(昆士兰大学) Stanford University(斯坦福大学) The Australian National University(澳大利亚国立大学) Zhejiang University(浙江大学) Murdoch University(莫纳什大学) The University of Adelaide(阿德莱德大学)

AI总结 WildTableBench 是一个用于评估多模态基础模型在真实场景下理解表格图像能力的基准测试。该研究引入了包含402张来自不同领域的真实表格图像和928个手动标注问题的数据集,用于测试模型在结构感知和数值推理方面的能力。实验表明,目前主流的多模态模型在该基准上的表现普遍较低,仅有一款模型准确率超过50%,揭示了当前模型在处理复杂表格图像时仍存在显著不足。

详情
AI中文摘要

使用多模态基础模型分析表格图像是消费和企业场景中高价值但具有挑战性的应用。尽管其重要性,当前评估主要依赖于结构化文本表格或干净渲染的图像,忽视了真实世界表格图像的视觉复杂性。这些图像具有多样的布局和领域,需要复杂的结构感知和数值推理。为弥补这一差距,我们引入了WildTableBench,这是第一个针对真实世界设置中自然出现的表格图像的问答基准。WildTableBench包含从跨领域在线论坛和网站收集的402张高信息密度表格图像,以及928个手动标注和验证的问题,涵盖五个类别的17个子类型。我们在此基准上评估了21个前沿专有和开源多模态基础模型。仅有一个模型准确率超过50%,其余模型准确率在4.1%至49.9%之间。我们进一步进行诊断分析以表征模型失败,并揭示结构感知和推理方面的持续弱点。这些结果和分析为当前模型能力提供了有用的见解,并将WildTableBench建立为表格图像理解的有价值的诊断基准。数据集:https://huggingface.co/datasets/jzhuang/WildTableBench 代码:https://github.com/hjzhe/WildTableBench 排行榜:https://hjzhe.github.io/WildTableBench

英文摘要

Using multimodal foundation models to analyze table images is a high-value yet challenging application in consumer and enterprise scenarios. Despite its importance, current evaluations rely largely on structured-text tables or clean rendered images, leaving the visual complexity of in-the-wild table images underexplored. Such images feature varied layouts and diverse domains that demand sophisticated structural perception and numerical reasoning. To bridge this gap, we introduce WildTableBench, the first question-answering benchmark for naturally occurring table images from real-world settings. WildTableBench comprises 402 high-information-density table images collected from online forums and websites across diverse domains, together with 928 manually annotated and verified questions spanning 17 subtypes across five categories. We evaluate 21 frontier proprietary and open-source multimodal foundation models on this benchmark. Only one model exceeds 50% accuracy, while all remaining models range from 4.1% to 49.9%. We further conduct diagnostic analyses to characterize model failures and reveal persistent weaknesses in structural perception and reasoning. These results and analyses provide useful insights into current model capabilities and establish WildTableBench as a valuable diagnostic benchmark for table image understanding. Dataset: https://huggingface.co/datasets/jzhuang/WildTableBench Code: https://github.com/hjzhe/WildTableBench Leaderboard: https://hjzhe.github.io/WildTableBench

2604.28048 2026-05-25 cs.CL cs.SI

Stable Behavior, Limited Variation: Persona Validity in LLM Agents for Urban Sentiment Perception

稳定行为,有限变化:城市情感感知中LLM智能体的角色有效性

Neemias B da Silva, Rodrigo Minetto, Daniel Silver, Thiago H Silva

发表机构 * University of Toronto(多伦多大学)

AI总结 该研究探讨了在城市情感感知任务中,使用不同人格设定对多模态大语言模型(LLM)行为一致性与差异性的影响。通过设置包括性别、经济状况、政治立场和性格等维度的人格变量,研究发现同一人格设定下的模型表现出高度一致的行为,但不同人格之间的差异有限,仅经济状况和性格带来可检测但实际影响较小的变化。研究还指出,模型在细粒度情感判断上表现较差,且去除了人格设定后模型性能有时甚至更优,表明简单的人格标签提示可能对感知判断的注释价值有限。

Comments 8 pages, 8 figures. IEEE DCOSS - UrbCom

Journal ref IEEE DCOSS 2026

详情
AI中文摘要

大型语言模型(LLM)越来越多地被用作城市分析中人类感知的代理,但尚不清楚角色提示是否会产生有意义且可重复的行为多样性。我们研究了不同角色是否影响多模态LLM生成的城市情感判断。使用涵盖性别、经济状况、政治取向和人格的角色因子集,我们为每个角色实例化多个智能体,以评估来自PerceptSent数据集的城市场景图像,并评估角色内一致性和角色间变化。结果显示,共享角色的智能体之间存在强收敛性,表明行为稳定且可重复。然而,角色间分化有限:经济状况和人格引起统计上可检测但实际变化不大的影响,而性别没有可测量的效果,政治取向的影响可忽略不计。智能体还表现出极端偏差,压缩了人类注释中常见的中间情感类别。因此,在粗粒度极性任务上表现强劲,但随着情感分辨率的提高而下降,表明简单的基于标签的角色提示无法捕捉细粒度的感知判断。为了隔离角色条件的作用,我们还评估了没有角色的相同模型。令人惊讶的是,无角色模型在所有任务变体上与人类标签的一致性有时达到或超过有角色条件,表明在这种设置下,简单的基于标签的角色提示可能增加有限的注释价值。

英文摘要

Large Language Models (LLMs) are increasingly used as proxies for human perception in urban analysis, yet it remains unclear whether persona prompting produces meaningful and reproducible behavioral diversity. We investigate whether distinct personas influence urban sentiment judgments generated by multimodal LLMs. Using a factorial set of personas spanning gender, economic status, political orientation, and personality, we instantiate multiple agents per persona to evaluate urban scene images from the PerceptSent dataset and assess both within-persona consistency and cross-persona variation. Results show strong convergence among agents sharing a persona, indicating stable and reproducible behavior. However, cross-persona differentiation is limited: economic status and personality induce statistically detectable but practically modest variation, while gender shows no measurable effect and political orientation only negligible impact. Agents also exhibit an extremity bias, collapsing intermediate sentiment categories common in human annotations. As a result, performance remains strong on coarse-grained polarity tasks but degrades as sentiment resolution increases, suggesting that simple label-based persona prompting does not capture fine-grained perceptual judgments. To isolate the contribution of persona conditioning, we additionally evaluate the same model without personas. Surprisingly, the no-persona model sometimes matches or exceeds persona-conditioned agreement with human labels across all task variants, suggesting that simple label-based persona prompting may add limited annotation value in this setting.

2604.27468 2026-05-25 cs.CL

Syntactically-guided Information Maintenance in Sentence Comprehension

句子理解中的句法引导信息维护

Shinnosuke Isono, Kohei Kajikawa

发表机构 * NINJAL(NINJAL研究所) Georgetown University(乔治·华盛顿大学)

AI总结 本研究探讨了在句子理解过程中,如何根据句法结构选择性地维持对后续预测至关重要的信息。研究提出,信息维持的成本受到预测头数量和未完成依存关系数量的影响,并通过自然阅读时间数据验证了这两个因素在日语中对维持成本的不同作用。研究还发现,阅读速度较慢的读者更能从可预测性中获益,表明句法结构在语言理解中的重要作用,同时指出英语中未表现出相同模式,提示不同语言在句法引导信息维持方面可能存在差异。

详情
AI中文摘要

在成功的实时语言理解中,在上下文中维护信息至关重要,但维护在认知上代价高昂且可能减慢处理速度。我们假设理性语言使用者会选择性维护对未来预测至关重要的信息,并由句法结构引导。根据这一观点,两个因素影响维护成本:预测头的数量和未完成依赖的数量。尽管这些因素在文献中被视为竞争性假设,但我们的解释预测它们不可相互约简。我们在日语的自然阅读时间数据中证明了这一点,日语中这两个因素对比尤为清晰。我们进一步表明存在一种权衡,即因维护而减慢速度的读者往往从可预测性中获益更多,这为所提出的解释提供了额外支持。然而,这些模式在英语中并不明显,我们强调了一些有待解决的问题,以理解句法在各种语言记忆高效处理中的贡献。

英文摘要

Maintaining information in context is essential in successful real-time language comprehension, but maintenance is cognitively costly and can slow processing. We hypothesize that rational language users selectively maintain information that is crucial for future prediction, guided by syntactic structure. Under this view, two factors affect maintenance cost: the number of predicted heads and the number of incomplete dependencies. Although these factors have been treated as competing hypotheses in the literature, our account predicts that they are not reducible to one another. We show this is the case in a naturalistic reading time dataset in Japanese, a language in which the two factors contrast particularly clearly. We further show that there is a tradeoff such that readers that slow down for maintenance tend to benefit more from predictability, providing additional support for the proposed account. These patterns are not evident in English, however, and we highlight some issues to be resolved to understand the contribution of syntax in memory-efficient processing of various languages.

2604.27247 2026-05-25 cs.CV

Towards Generalizable Mapping of Hedges and Linear Woody Features from Earth Observation Data: a national Product for Germany

面向地球观测数据中树篱与线性木本特征的可泛化映射:德国国家产品

Thorsten Hoeser, Verena Huber-Garcia, Sarah Asam, Ursula Gessner, Claudia Kuenzer

发表机构 * Earth Observation Center (EOC), German Aerospace Center (DLR)(地球观测中心(EOC),德国航空航天中心(DLR))

AI总结 本文旨在从地球观测数据中生成适用于全国范围的可推广的灌木和线性木质特征地图,以支持生态管理和保护。研究提出了一种模块化的工作流程,包含一个灵活的数据接口和一个深度神经网络,分别用于生成木质植被掩膜和区分线性与非线性结构。该方法在德国全国范围内应用了三种不同分辨率的数据源,无需重新训练模型即可生成高质量的线性木质特征地图,并在多个评估区域表现出良好的性能。

Comments 33 pages, 17 figures

详情
AI中文摘要

树篱和其他线性木本特征在集约化管理的农业景观中提供宝贵的生态系统服务。它们是气候适应和生物多样性的关键要素,不仅因为其高度变化的植物区系,还作为许多动物和昆虫(包括有价值的传粉者)的觅食、休息和筑巢场所。因此,它们需要专门的管理、保护和关注。从地球观测数据中对这些特征进行系统化和大规模制图具有重要意义。然而,考虑到传感器类型、空间分辨率、数据采集条件以及研究区域复杂的景观变异性,可转移和可复用的线性木本特征制图工作流仍然是一个关键的方法论挑战。我们引入了一个模块化工作流,围绕两个独立可优化的组件构建。首先,一个灵活的输入数据接口,将异构的地球观测数据整合为二值木本植被掩膜;其次,一个深度神经网络,训练用于区分这些掩膜中的线性形状和非线性形状。我们通过使用单个训练模型(无需重新训练)从三个输入源(空间分辨率分别为0.73米、1米和3米)推导出覆盖整个德国的三个全国尺度线性木本特征图来演示该工作流。与来自四个联邦州生物群落制图活动的精细参考数据进行的评估,以及与两个现有线性木本特征图的比较表明,该工作流在全国所有评估站点均产生具有竞争力的结果。其模块化设计及其在全国尺度上的适用性为超越德国的可扩展和可泛化线性木本特征制图提供了基础。

英文摘要

Hedges and other linear woody features provide valuable ecosystem services, particularly within intensively managed agricultural landscapes. They are key elements for climate adaptation and biodiversity amongst others not only due to a largely varying flora, but also as a feeding-, resting-, and nesting place for many animals and insects including valuable pollinators. Therefore, they require dedicated management, preservation, and attention. Thus, systematic and large-scale mapping of these features from Earth observation data is of high importance. However, transferable and reusable workflows for linear woody feature mapping remain a key methodological challenge, given the diversity of sensor types, spatial resolutions, data acquisition conditions, and complex landscape variability encountered across study areas. We introduce a modular workflow built around two independently optimizable components. Firstly, a flexible input data interface that consolidates heterogeneous Earth observation data into a binary woody vegetation mask, and secondly, a deep neural network trained to separate linear from non-linear shapes within these masks. We demonstrate the workflow by deriving three national-scale linear woody feature maps for all of Germany from three input sources with 0.73 m, 1 m and 3 m spatial resolution, respectively, by using a single trained model without retraining. Evaluation against refined reference data from four federal state biotope mapping campaigns and comparison with two existing linear woody feature maps demonstrate that the workflow produces competitive results across all evaluation sites on a national level. The modular design and its demonstrated applicability at national scale provide a foundation for scalable and generalizable linear woody feature mapping beyond Germany.

2604.24810 2026-05-25 cs.LG cs.AI

A Comparative Analysis on the Performance of Upper Confidence Bound Algorithms in Adaptive Deep Neural Networks

自适应深度神经网络中上置信界算法的性能比较分析

Grigorios Papanikolaou, Ioannis Kontopoulos, Konstantinos Tserpes

发表机构 * National Technical University of Athens, Greece(雅典技术大学)

AI总结 在边缘计算环境中,由于对能耗和延迟的严格限制,深度神经网络的部署面临挑战。本文基于自适应深度神经网络(ADNNs),引入四种改进的上置信界(UCB)策略,包括UCB-V、UCB-Tuned、UCB-Bayes和UCB-BwK,首次对这些策略在精度、能耗和延迟之间的权衡进行了系统比较。实验表明,UCB-Bayes收敛最快,而UCB-V和UCB-Tuned在精度-延迟和精度-能耗的帕累托前沿上表现最优。

Comments The paper has been accepted for publication in IEEE SMARTCOMP 2026

详情
AI中文摘要

边缘计算环境对能耗和延迟施加了严格限制,使得深度神经网络的部署面临重大挑战。因此,在边缘计算场景中,能够动态平衡计算成本或延迟与预测准确性的智能自适应推理策略至关重要。在这项工作中,我们基于采用多臂老虎机(MAB)框架的自适应深度神经网络(ADNN)。现有文献利用第一版上置信界(UCB1)策略动态选择最优置信阈值,从而在不牺牲准确率的情况下实现高效早期退出。然而,我们在ADNN中引入了四种额外的上置信界策略,即UCB-V、UCB-Tuned、UCB-Bayes和UCB-BwK,并首次对这些策略在准确率、能耗和延迟之间的权衡进行了比较研究。所提出的UCB策略应用于ResNet和MobileViT神经网络,并在CIFAR-10、CIFAR-10.1和CIFAR-100基准数据集上进行评估。实验结果表明,所有策略均实现了次线性累积遗憾,其中UCB-Bayes收敛最快,其次是UCB-Tuned和UCB-V。最后,UCB-V和UCB-Tuned在准确率-延迟和准确率-能耗权衡的帕累托前沿上占据主导地位。实现代码可在此处获取:https://github.com/gr3gor1/MAB_UCB

英文摘要

Edge computing environments impose strict constraints on energy consumption and latency, making the deployment of deep neural networks a significant challenge. Therefore, smart and adaptive inference strategies that dynamically balance computational cost or latency with predictive accuracy are critical in edge computing scenarios. In this work, we build on Adaptive Deep Neural Networks (ADNNs) that employ the Multi-Armed Bandit (MAB) framework. Current literature leverages the first version of the Upper Confidence Bound (UCB1) strategy to dynamically select the optimal confidence threshold, enabling efficient early exits without sacrificing accuracy. However, we introduce four additional Upper Confidence Bound strategies in ADNNs, namely UCB-V, UCB-Tuned, UCB-Bayes, and UCB-BwK, and perform, for the first time, a comparative study of these strategies with respect to trade-offs between accuracy, energy consumption, and latency. The proposed UCB strategies are employed on the ResNet and MobileViT neural networks, and are evaluated on the benchmark datasets of CIFAR-10, CIFAR-10.1, and CIFAR-100. Experimental results demonstrate that all strategies achieve sub-linear cumulative regret, with UCB-Bayes converging the fastest, followed by UCB-Tuned and UCB-V. Finally, UCB-V and UCB-Tuned dominate the Pareto Frontiers of accuracy-latency and accuracy-energy trade-offs. The implementation code is available here: https://github.com/gr3gor1/MAB_UCB

2604.21889 2026-05-25 cs.CL cs.AI cs.LG

TingIS: Real-time Risk Event Discovery from Noisy Customer Incidents at Enterprise Scale

TingIS:企业级规模下从嘈杂客户事件中实时发现风险事件

Jun Wang, Ziyin Zhang, Rui Wang, Hang Yu, Peng Di, Rui Wang

发表机构 * Ant Group(蚂蚁集团) Shanghai Jiao Tong University(上海交通大学)

AI总结 本文介绍了TingIS,一个用于大规模企业环境中实时发现风险事件的端到端系统。针对客户事件数据中存在噪声大、语义复杂、吞吐量高的挑战,TingIS结合多阶段事件链接引擎与大型语言模型,实现了从少量用户描述中稳定提取有效事件的能力,并通过级联路由机制和多维降噪流程提升业务归因精度和信号质量。实验表明,TingIS在高优先级事件发现率和系统响应延迟方面表现优异,显著优于现有方法。

Comments Accepted to ACL 2026 Industry Track (oral presentation)

详情
AI中文摘要

实时检测和缓解技术异常对于大规模云原生服务至关重要,即使几分钟的停机也可能导致巨大的财务损失和用户信任度下降。虽然客户事件是发现监控遗漏风险的重要信号,但由于极端噪声、高吞吐量和不同业务线的语义复杂性,从这些数据中提取可操作情报仍然具有挑战性。在本文中,我们提出了TingIS,一个为企业级事件发现设计的端到端系统。TingIS的核心是一个多阶段事件链接引擎,该引擎将高效索引技术与大型语言模型(LLM)协同起来,对事件合并做出明智决策,从而仅从少量多样的用户描述中稳定提取可操作事件。该引擎辅以级联路由机制以实现精确的业务归属,以及一个集成领域知识、统计模式和行为过滤的多维降噪管道。TingIS部署在生产环境中,处理峰值吞吐量超过每分钟2,000条消息和每天300,000条消息,实现了P90告警延迟3.5分钟和高优先级事件95%的发现率。基于真实数据构建的基准测试表明,TingIS在路由准确性、聚类质量和信噪比方面显著优于基线方法。

英文摘要

Real-time detection and mitigation of technical anomalies are critical for large-scale cloud-native services, where even minutes of downtime can result in massive financial losses and diminished user trust. While customer incidents serve as a vital signal for discovering risks missed by monitoring, extracting actionable intelligence from this data remains challenging due to extreme noise, high throughput, and semantic complexity of diverse business lines. In this paper, we present TingIS, an end-to-end system designed for enterprise-grade incident discovery. At the core of TingIS is a multi-stage event linking engine that synergizes efficient indexing techniques with Large Language Models (LLMs) to make informed decisions on event merging, enabling the stable extraction of actionable incidents from just a handful of diverse user descriptions. This engine is complemented by a cascaded routing mechanism for precise business attribution and a multi-dimensional noise reduction pipeline that integrates domain knowledge, statistical patterns, and behavioral filtering. Deployed in a production environment handling a peak throughput of over 2,000 messages per minute and 300,000 messages per day, TingIS achieves a P90 alert latency of 3.5 minutes and a 95\% discovery rate for high-priority incidents. Benchmarks constructed from real-world data demonstrate that TingIS significantly outperforms baseline methods in routing accuracy, clustering quality, and Signal-to-Noise Ratio.

2604.19000 2026-05-25 cs.LG cs.AI

Decompose, Structure, and Repair: A Neuro-Symbolic Framework for Autoformalization via Operator Trees

分解、结构化与修复:基于操作树的神经符号自动形式化框架

Xiaoyang Liu, Zineng Dong, Yifan Bai, Yantao Li, Yuntian Liu, Tao Luo

发表机构 * School of Mathematical Sciences, Shanghai Jiao Tong University(上海交通大学数学科学学院) Zhiyuan College, Shanghai Jiao Tong University(上海交通大学紫阳学院) Institute of Natural Sciences, MOE-LSC, CMA-Shanghai, Shanghai Jiao Tong University(上海交通大学自然科学研究院)

AI总结 该论文提出了一种名为DSR的神经符号框架,用于将自然语言数学问题自动形式化为形式语言。DSR通过分解数学陈述为逻辑组件并映射为结构化的操作符树,利用这种拓扑结构实现对错误的精确定位与修复。研究还引入了PRIME基准数据集,并在实验中验证了DSR在计算资源相同的情况下优于现有方法,取得了新的最先进成果。

Comments Accepted to ICML 2026

详情
AI中文摘要

语句自动形式化通过将自然语言问题翻译成形式语言,成为人类数学与形式数学之间的关键桥梁。虽然先前的工作侧重于数据合成和多样化的训练范式来优化端到端的大语言模型(LLMs),但它们通常将形式代码视为平面序列,忽略了数学语句中固有的层次逻辑。在这项工作中,我们引入了分解、结构化与修复(DSR),一个神经符号框架,将自动形式化重构为模块化流水线。DSR将语句分解为逻辑组件,并将其映射到结构化的操作树,利用这一拓扑蓝图通过子树精炼精确定位和修复错误。此外,我们引入了PRIME,一个包含156个本科和研究生级别定理的基准,这些定理选自经典教科书并由专家在Lean 4中注释。实验结果表明,DSR建立了新的最先进水平,在同等计算预算下始终优于基线。数据集、模型和代码可在https://github.com/XiaoyangLiu-sjtu/DSR获取。

英文摘要

Statement autoformalization acts as a critical bridge between human mathematics and formal mathematics by translating natural language problems into formal language. While prior works have focused on data synthesis and diverse training paradigms to optimize end-to-end Large Language Models (LLMs), they typically treat formal code as flat sequences, neglecting the hierarchical logic inherent in mathematical statements. In this work, we introduce Decompose, Structure, and Repair (DSR), a neuro-symbolic framework that restructures autoformalization into a modular pipeline. DSR decomposes statements into logical components and maps them to structured operator trees, leveraging this topological blueprint to precisely localize and repair errors via sub-tree refinement. Furthermore, we introduce PRIME, a benchmark of 156 undergraduate and graduate-level theorems selected from canonical textbooks and expertly annotated in Lean 4. Experimental results demonstrate that DSR establishes a new state-of-the-art, consistently outperforming baselines under equivalent computational budgets. The datasets, model, and code are available at https://github.com/XiaoyangLiu-sjtu/DSR.

2604.13596 2026-05-25 cs.CV

VGGT-Segmentor: Geometry-Enhanced Cross-View Segmentation

VGGT-Segmentor: 几何增强的跨视角分割

Yulu Gao, Bohao Zhang, Zongheng Tang, Jitong Liao, Wenjun Wu, Si Liu

发表机构 * Hangzhou International Innovation Institute of Beihang University(北航杭州国际创新研究院) Beihang University(北京航空航天大学)

AI总结 本文提出了一种名为VGGT-Segmentor(VGGT-S)的几何增强跨视角分割框架,旨在解决从第一人称视角到第三人称视角的实例级物体分割难题。该方法结合了VGGT模型强大的跨视角特征表示能力,并引入了一个新的联合分割头,通过多阶段处理实现高精度的像素级分割。此外,该方法采用单图像自监督训练策略,无需成对标注即可实现良好的泛化能力,在Ego-Exo4D基准测试中取得了优于现有方法的性能。

详情
AI中文摘要

跨不同自我中心和外部中心视图的实例级对象分割是视觉理解中的基本挑战,对于具身AI和远程协作应用至关重要。由于尺度、视角和遮挡的剧烈变化,直接像素级匹配变得不稳定,使得该任务异常困难。尽管像VGGT这样的最新几何感知模型为特征对齐提供了坚实基础,但我们发现,即使其内部对象级注意力保持一致,它们在密集预测任务中常常因显著的像素级投影漂移而失败。为弥合这一差距,我们引入了VGGT-Segmentor(VGGT-S),一个将鲁棒几何建模与像素精确语义分割统一的框架。VGGT-S利用VGGT强大的跨视图特征表示,并引入了一种新颖的Union分割头。该分割头分三个阶段运行:掩码提示融合、点引导预测和迭代掩码细化,有效地将高级特征对齐转化为精确的分割掩码。此外,我们提出了一种单图像自监督训练策略,消除了对配对标注的需求,并实现了强大的泛化能力。在Ego-Exo4D基准上,VGGT-S在Ego到Exo和Exo到Ego任务中分别实现了67.7%和68.0%的平均IoU,显著优于先前方法。值得注意的是,我们的无对应预训练模型超越了大多数全监督基线,证明了我们方法的有效性和可扩展性。代码公开于:https://github.com/buaa-colalab/VGGT-S。

英文摘要

Instance-level object segmentation across disparate egocentric and exocentric views is a fundamental challenge in visual understanding, critical for applications in embodied AI and remote collaboration. This task is exceptionally difficult due to severe changes in scale, perspective, and occlusion, which destabilize direct pixel-level matching. While recent geometry-aware models like VGGT provide a strong foundation for feature alignment, we find they often fail at dense prediction tasks due to significant pixel-level projection drift, even when their internal object-level attention remains consistent. To bridge this gap, we introduce VGGT-Segmentor (VGGT-S), a framework that unifies robust geometric modeling with pixel-accurate semantic segmentation. VGGT-S leverages VGGT's powerful cross-view feature representation and introduces a novel Union Segmentation Head. This head operates in three stages: mask prompt fusion, point-guided prediction, and iterative mask refinement, effectively translating high-level feature alignment into a precise segmentation mask. Furthermore, we propose a single-image self-supervised training strategy that eliminates the need for paired annotations and enables strong generalization. On the Ego-Exo4D benchmark, VGGT-S sets a new state-of-the-art, achieving 67.7% and 68.0% average IoU for Ego to Exo and Exo to Ego tasks, respectively, significantly outperforming prior methods. Notably, our correspondence-free pretrained model surpasses most fully-supervised baselines, demonstrating the effectiveness and scalability of our approach. Code is publicly available at: https://github.com/buaa-colalab/VGGT-S.

2604.11679 2026-05-25 cs.CV

Towards Brain MRI Foundation Models for the Clinic: Findings from the FOMO25 Challenge

面向临床的大脑MRI基础模型:来自FOMO25挑战赛的发现

Asbjørn Munk, Stefano Cerri, Vardan Nersesjan, Christian Hedeager Krag, Jakob Ambsdorf, Pablo Rocamora García, Julia Machnio, Peirong Liu, Suhyun Ahn, Nasrin Akbari, Yasmina Al Khalil, Kimberly Amador, Sina Amirrajab, Tal Arbel, Meritxell Bach Cuadra, Ujjwal Baid, Bhakti Baheti, Jaume Banus, Kamil Barbierik, Christoph Brune, Yansong Bu, Baptiste Callard, Yuhan Chen, Cornelius Crijnen, Corentin Dancette, Peter Drotar, Prasad Dutande, Nils D. Forkert, Saurabh Garg, Jakub Gazda, Matej Gazda, Benoît Gérin, Partha Ghosh, Weikang Gong, Pedro M. Gordaliza, Sam Hashemi, Tobias Heimann, Fucang Jia, Jiexin Jiang, Emily Kaczmarek, Chris Kang, Seung Kwan Kang, Mohammad Khazaei, Julien Khlaut, Petros Koutsouvelis, Jae Sung Lee, Yuchong Li, Mengye Lyu, Mingchen Ma, Anant Madabhushi, Klaus H. Maier-Hein, Pierre Manceron, Andrés Martínez Mora, Moona Mazher, Felix Meister, Nataliia Molchanova, Steven A. Niederer, Leonard Nürnberg, Jinah Park, Abdul Qayyum, Jonas Richiardi, Antoine Saporta, Branislav Setlak, Ning Shen, Justin Szeto, Constantin Ulrich, Puru Vaish, Vibujithan Vigneshwaran, Leroy Volmer, Zihao Wang, Siqi Wei, Anthony Winder, Jelmer M. Wolterink, Maxence Wynen, Chang Yang, Si Young Yie, Mostafa Mehdipour Ghazi, Akshay Pai, Espen Jimenez Solem, Sebastian Nørgaard Llambias, Mikael Boesen, Michael Eriksen Benros, Juan Eugenio Iglesias, Mads Nielsen

发表机构 * organization= Department of Computer Science, University of Copenhagen , city= Copenhagen , country= Denmark organization= Pioneer Centre for AI , city= Copenhagen , country= Denmark organization= Copenhagen Research Centre for Biological Precision Psychiatry, Mental Health Centre Copenhagen, Copenhagen University Hospital , region= Capital Region of Denmark , city= Copenhagen , country= Denmark organization= Athinoula A. Martinos Center for Biomedical Imaging, Massachusetts General Hospital Harvard Medical School , city= Boston , state= Massachusetts , country= USA Artificial Intelligence Laboratory, Massachusetts Institute of Technology , city= Boston , state= Massachusetts , country= USA organization= Johns Hopkins University , city= Baltimore , state= Maryland , country= USA organization= Radiological AI Testcenter (RAIT) , region= Capital Region of Denmark , city= Copenhagen , country= Denmark organization= Copenhagen University Hospital, Rigshospitalet , region= Capital Region of Denmark , city= Copenhagen , country= Denmark organization= Copenhagen University Hospital, Bispebjerg \& Frederiksberg Hospital , region= Capital Region of Denmark , city= Copenhagen , country= Denmark organization= Department of Clinical Medicine, Faculty of Health Medical Sciences, University of Copenhagen , city= Copenhagen , country= Denmark organization= Division of Medical Image Computing, German Cancer Research Center (DKFZ) , city= Heidelberg , country= Germany organization= University of British Columbia , city= Vancouver , state= British Columbia , country= Canada organization= Hawkes Institute, Department of Computer Science, University College London , city= London , country= United Kingdom Lung Institute, Faculty of Medicine, Imperial College London , city= London , country= United Kingdom organization= Department of Applied Mathematics, Technical Medical Centre, University of Twente , city= Enschede , country= Netherlands organization= IISLAB, Technical University of Košice , city= Košice , country= Slovakia organization= 2nd Department of Internal Medicine, Pavol Jozef Safarik University L Pasteur University Hospital , city= Košice , country= Slovakia organization= Fudan University , city= Shanghai , country= China organization= Shenzhen Technology University , city= Shenzhen , country= China organization= Department of Radiology, Lausanne University Hospital University of Lausanne , city= Lausanne , country= Switzerland organization= Louvain Neuroinflammation Imaging Lab (NIL), Université Catholique de Louvain , city= Brussels , country= Belgium organization= University of Applied Sciences organization= CIBM Center for Biomedical Imaging , city= Lausanne , country= Switzerland organization= Department of Radiation Oncology (Maastro), GROW Research Institute for Oncology Reproduction, Maastricht University Medical Centre+ , city= Maastricht , country= The Netherlands organization= Department of Biomedical Engineering, Medical Image Analysis, Eindhoven University of Technology , city= Eindhoven , country= The Netherlands organization= Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences , city= Shenzhen , country= China organization= McGill University Mila - Quebec AI Institute , city= Montreal , country= Canada organization= Hotchkiss Brain Institute Department of Radiology, University of Calgary , city= Calgary , state= Alberta , country= Canada organization= Department of Radiology, University of Calgary , city= Calgary , state= Alberta , country= Canada organization= Alberta Children's Hospital Research Institute, Department of Clinical Neuroscience, University of Calgary , city= Calgary , state= Alberta , country= Canada organization= The Wallace H. Coulter Department of Biomedical Engineering, Georgia Tech Emory University , city= Atlanta , state= Georgia , country= USA organization= SGGS College of Engineering organization= Seoul National University , city= Seoul , country= South Korea organization= The D-Lab, Department of Precision Medicine, GROW Research Institute for Oncology Reproduction, Maastricht University , city= Maastricht , country= The Netherlands organization= Artificial Intelligence in Medicine (AIM) Program, Mass General Brigham, Harvard Medical School , city= Boston , state= Massachusetts , country= USA Nuclear Medicine, CARIM \& GROW, Maastricht University , city= Maastricht , country= The Netherlands organization= Department of Radiation Oncology, Dana-Farber Cancer Institute, Brigham Women’s Hospital, Harvard Medical School , city= Boston , state= Massachusetts , country= USA Learning Group, Heidelberg University Hospital , city= Heidelberg , country= Germany

AI总结 临床部署自动化脑部MRI分析面临数据异质性强、标签获取成本高的挑战。本文通过组织FOMO25挑战赛,提供了大规模预训练数据集FOMO60K,并在临床真实数据上评估了模型在少样本和跨域场景下的表现。研究发现,无监督预训练能有效提升模型在跨域数据上的泛化能力,且不同预训练目标对不同任务效果各异,小规模预训练模型已能取得良好性能,进一步扩大模型规模和训练时间并未带来稳定提升。

详情
AI中文摘要

自动化脑MRI分析的临床部署面临一个基本挑战:临床数据异质且有噪声,高质量标签的获取成本高得令人望而却步。自监督学习(SSL)可以通过利用临床工作流程中产生的大量未标记数据来训练鲁棒的 extit{基础模型},这些模型在最小监督下适应域外场景。然而,脑MRI基础模型的发展一直受到小规模预训练数据集和专注于高质量研究级数据的域内基准测试的限制。为解决这一差距,我们组织了FOMO25挑战赛,作为MICCAI 2025的卫星活动。FOMO25为参与者提供了一个大型预训练数据集FOMO60K,并在少样本和域外设置下,直接使用来自临床工作流程的数据评估模型。任务涵盖梗死分类、脑膜瘤分割和脑年龄回归,并考虑了在FOMO60K上训练的模型(方法赛道)和任何数据上训练的模型(开放赛道)。来自16个团队的19个基础模型使用标准化容器化流程进行了评估。结果表明:(a) 自监督预训练提升了域迁移下临床数据的泛化能力,最强的 extit{域外}训练模型超越了 extit{域内}训练的有监督基线。(b) 没有单一的预训练目标对所有任务都有利:MAE有利于分割,混合重建-对比目标有利于分类,以及(c) 小型预训练模型取得了强劲性能,而扩大模型规模和训练时长并未带来可靠收益。

英文摘要

Clinical deployment of automated brain MRI analysis faces a fundamental challenge: clinical data is heterogeneous and noisy, and high-quality labels are prohibitively costly to obtain. Self-supervised learning (SSL) can address this by leveraging the vast amounts of unlabeled data produced in clinical workflows to train robust \textit{foundation models} that adapt out-of-domain with minimal supervision. However, the development of foundation models for brain MRI has been limited by small pretraining datasets and in-domain benchmarking focused on high-quality, research-grade data. To address this gap, we organized the FOMO25 challenge as a satellite event at MICCAI 2025. FOMO25 provided participants with a large pretraining dataset, FOMO60K, and evaluated models on data sourced directly from clinical workflows in few-shot and out-of-domain settings. Tasks covered infarct classification, meningioma segmentation, and brain age regression, and considered both models trained on FOMO60K (method track) and any data (open track). Nineteen foundation models from sixteen teams were evaluated using a standardized containerized pipeline. Results show that (a) self-supervised pretraining improves generalization on clinical data under domain shift, with the strongest models trained \textit{out-of-domain} surpassing supervised baselines trained \textit{in-domain}. (b) No single pretraining objective benefits all tasks: MAE favors segmentation, hybrid reconstruction-contrastive objectives favor classification, and (c) strong performance was achieved by small pretrained models, and improvements from scaling model size and training duration did not yield reliable benefits.

2604.09349 2026-05-25 cs.CV cs.AI cs.CL

Visually-Guided Policy Optimization for Multimodal Reasoning

视觉引导的多模态推理策略优化

Zengbin Wang, Feng Xiong, Liang Lin, Xuecai Hu, Yong Wang, Yanlin Wang, Man Zhang, Xiangxiang Chu

发表机构 * AMAP, Alibaba Group(阿里集团AMAP) SYSU(南方科技大学) BUPT(北京邮电大学)

AI总结 该研究针对视觉语言模型在多模态推理中视觉关注不足的问题,提出了一种名为Visually-Guided Policy Optimization(VGPO)的新框架,通过引入视觉注意力补偿机制和双粒度优势重加权策略,增强模型在推理过程中的视觉聚焦能力。实验表明,VGPO有效提升了模型在数学多模态推理和依赖视觉的任务中的表现,显著改善了视觉信息的利用效率。

Comments Accepted to ACL 2026, https://github.com/wzb-bupt/VGPO

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)显著提升了视觉语言模型(VLM)的推理能力。然而,VLM固有的文本主导特性常导致视觉忠实度不足,表现为对视觉标记的注意力激活稀疏。更重要的是,我们的实证分析揭示,推理步骤中的时序视觉遗忘加剧了这一缺陷。为弥补这一差距,我们提出视觉引导策略优化(VGPO),一种在策略优化期间强化视觉聚焦的新框架。具体而言,VGPO首先引入视觉注意力补偿机制,利用视觉相似性定位并放大视觉线索,同时在后续步骤中逐步提升视觉期望以对抗视觉遗忘。基于此机制,我们实施双粒度优势重加权策略:轨迹内层级突出显示具有相对较高视觉激活的标记,而轨迹间层级优先选择表现出优越视觉累积的轨迹。大量实验表明,VGPO在数学多模态推理和视觉依赖任务中实现了更好的视觉激活和优越性能。代码已发布于https://github.com/wzb-bupt/VGPO。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has significantly advanced the reasoning ability of vision-language models (VLMs). However, the inherent text-dominated nature of VLMs often leads to insufficient visual faithfulness, characterized by sparse attention activation to visual tokens. More importantly, our empirical analysis reveals that temporal visual forgetting along reasoning steps exacerbates this deficiency. To bridge this gap, we propose Visually-Guided Policy Optimization (VGPO), a novel framework to reinforce visual focus during policy optimization. Specifically, VGPO initially introduces a Visual Attention Compensation mechanism that leverages visual similarity to localize and amplify visual cues, while progressively elevating visual expectations in later steps to counteract visual forgetting. Building on this mechanism, we implement a dual-grained advantage re-weighting strategy: the intra-trajectory level highlights tokens exhibiting relatively high visual activation, while the inter-trajectory level prioritizes trajectories demonstrating superior visual accumulation. Extensive experiments demonstrate that VGPO achieves better visual activation and superior performance in mathematical multimodal reasoning and visual-dependent tasks. The code has been released at https://github.com/wzb-bupt/VGPO.

2604.06885 2026-05-25 cs.CV

Time-driven Survival Analysis from FDG-PET/CT in Non-Small Cell Lung Cancer

基于FDG-PET/CT的非小细胞肺癌时间驱动生存分析

Sambit Tarai, Ashish Chauhan, Elin Lundström, Johan Öfverstedt, Therese Sjöholm, Veronica Sanchez Rodriguez, Håkan Ahlström, Joel Kullberg

发表机构 * Radiology, Department of Surgical Sciences(外科科学系放射学部) Antaros Medical(Antaros医疗) Molecular Imaging and Medical Physics, Department of Surgical Sciences(外科科学系分子成像与医学物理部)

AI总结 该研究提出了一种基于FDG-PET/CT影像的深度回归框架,用于预测非小细胞肺癌患者的总生存期(OS),并引入时间变量作为输入以实现时间驱动的生存分析。方法结合ResNet-50提取影像特征,并与时间信息融合,生成随时间变化的生存概率预测。实验表明,该方法在AUC指标上优于基线模型,且结合临床与影像特征的集成模型取得了最佳性能,验证了多模态数据在生存预测中的互补价值。

Comments Under review

Journal ref Ann Biomed Eng (2026)

详情
AI中文摘要

目的:基于医学图像的临床结果(如总生存期,OS)自动预测在改善患者预后和个性化治疗计划方面具有巨大潜力。我们开发了一个深度回归框架,使用组织FDG-PET/CT投影作为输入,以及一个表示标量时间范围(以天为单位)的时间输入,来预测非小细胞肺癌(NSCLC)患者的OS。方法:所提出的框架采用ResNet-50骨干网络处理输入图像并生成相应的图像嵌入。然后将嵌入与时间数据结合,生成作为时间函数的OS概率,从而有效地基于时间参数化预测。整体框架使用U-CAN队列(n=556)开发,并在测试集(n=292)上与基线方法进行比较评估。基线使用ResNet-50架构,仅处理图像作为输入,并在预定义的时间间隔(如2年或5年)提供OS预测。结果:将时间数据与图像嵌入相结合在预测OS方面显示出优势,优于基线方法,AUC提高了4.3%。使用临床+IDP特征的模型取得了强劲性能,而成像与临床+IDP模型的集成取得了最佳整体性能(0.788),突显了多模态输入的互补价值。所提出的方法还能够将患者风险分层为不同类别(高风险与低风险)。显著性分析的热图突出显示了肿瘤区域作为预测的关键结构。结论:我们的方法提供了一个自动化的框架来预测作为时间函数的OS,并展示了结合成像和表格数据以改善生存预测的潜力。

英文摘要

Purpose: Automated medical image-based prediction of clinical outcomes, such as overall survival (OS), has great potential in improving patient prognostics and personalized treatment planning. We developed a deep regression framework using tissue-wise FDG-PET/CT projections as input, along with a temporal input representing a scalar time horizon (in days) to predict OS in patients with Non-Small Cell Lung Cancer (NSCLC). Methods: The proposed framework employed a ResNet-50 backbone to process input images and generate corresponding image embeddings. The embeddings were then combined with temporal data to produce OS probabilities as a function of time, effectively parameterizing the predictions based on time. The overall framework was developed using the U-CAN cohort (n = 556) and evaluated by comparing with a baseline method on the test set (n = 292). The baseline utilized the ResNet-50 architecture, processing only the images as input and providing OS predictions at pre-specified intervals, such as 2- or 5-year. Results: The incorporation of temporal data with image embeddings demonstrated an advantage in predicting OS, outperforming the baseline method with an improvement in AUC of 4.3%. The proposed model using clinical + IDP features achieved strong performance, and an ensemble of imaging and clinical + IDP models achieved the best overall performance (0.788), highlighting the complementary value of multimodal inputs. The proposed method also enabled risk stratification of patients into distinct categories (high vs low risk). Heat maps from the saliency analysis highlighted tumor regions as key structures for the prediction. Conclusion: Our method provided an automated framework for predicting OS as a function of time and demonstrates the potential of combining imaging and tabular data for improved survival prediction.

2604.03244 2026-05-25 cs.AI cs.CY cs.DB

AI Evaluation Should Require Standardized Item-Level Data Releases

AI评估应要求标准化的项目级数据发布

Han Jiang, Susu Zhang, Dongyao Zhu, Yuzhuo Bai, Sang T. Truong, Xiaoyuan Yi, Sanmi Koyejo, Xing Xie, Ziang Xiao

发表机构 * Johns Hopkins University(约翰霍普金斯大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Microsoft Research Asia(微软亚洲研究院) Stanford University(斯坦福大学) North Carolina State University(北卡罗来纳州立大学) Tsinghua University(清华大学)

AI总结 本文主张人工智能评估应采用标准化的项目级基准数据作为默认基础设施。当前评估方法存在项目选择不明确、构念不一致和泛化能力差等问题,其根本原因是对模型整体得分的过度关注。为构建有效的评估体系,作者提出应通过项目级模型响应的实证数据进行验证,并建立标准化数据发布机制,以提高评估的透明性、可复现性和可审计性。为此,研究构建了OpenEval数据集,展示了项目级数据在识别低质量项目、分析构念偏差和验证基准结构方面的作用。

详情
AI中文摘要

这篇立场论文认为,标准化的项目级基准数据应成为AI评估的默认基础设施。当前的评估存在项目选择不明确、构造错位和泛化能力差的问题。这些失败的根本原因在于对聚合模型分数的错误关注。没有项目级证据,有效性声明无法评估,导致能力声明夸大、研究方向错误以及对已部署系统的不当信任。我们的立场是,设计有效的评估需要来自项目级模型响应的实证证据,并且此类数据的标准化发布应被视为核心AI评估基础设施。此外,这种发布能够实现评估结果的透明度、可复制性和可审计性。为了展示这一规范既可行又重要,我们构建了OpenEval,这是一个包含来自广泛使用基准的15.5万个项目的1000万条响应的项目级档案,采用AI评估社区可以发展的统一模式。我们展示了项目级数据如何识别低质量项目、记录构造错位以及恢复关于基准内部结构的有效性证据。我们解决了关于污染和作者负担的反对意见,并表明每个问题相对于基于不可信声明做出的决策成本而言都是可处理的。

英文摘要

This position paper argues that standardized item-level benchmark data should become the default infrastructure for AI evaluation. Current evaluations suffer from underspecified item selection, construct misalignment, and poor generalization. The root cause of these failures is a misplaced focus on aggregate model scores. Without item-level evidence, validity claims cannot be assessed, resulting in inflated capability claims, misdirected research, and unwarranted trust in deployed systems. Our position is that designing valid evaluations requires empirical evidence from item-level model responses, and the standardized release of such data should be treated as core AI evaluation infrastructure. Such a release, in addition, enables transparency, replicability, and auditability of evaluation results. To show the norm is both feasible and consequential, we construct OpenEval, an item-level archive of 10M responses across 155k items from widely-used benchmarks, under a unified schema that the AI evaluation community can develop upon. We demonstrate how item-level data can identify low-quality items, document construct misalignment, and recover validity evidence about benchmarks' internal structure. We address objections around contamination and author burden, and show each is tractable relative to the cost of decisions made on claims that cannot be trusted.

2604.00003 2026-05-25 cs.CL cs.AI cs.IR

Tabular PDF Information Extraction with Local LLMs and Layout-Aware Parsing: A Reliability Evaluation

使用本地大语言模型和布局感知解析的表格PDF信息提取:可靠性评估

Muhammad Anis Al Hilmi, Neelansh Khare, Noel Framil Iglesias, Kurnia Adi Cahyanto, Azhar Al Afghani, Musfi Yuliadi

发表机构 * Faculty of Engineering, Universitas Swadaya Gunung Jati(工程学院,Swadaya Gunung Jati大学) University of California, Irvine(加州大学伊维奇分校) UNIR, La Rioja(UNIR,拉里奥ja) Universitas Diponegoro(迪波内戈罗大学)

AI总结 该研究评估了从学术PDF文档中提取结构化信息的可靠性,以印度尼西亚高等教育课程注册表(KRS)为案例,比较了三种方法:纯大语言模型(LLM)、混合确定性-LLM(正则表达式与LLM结合)以及基于Camelot的流程并结合LLM作为后备。实验表明,混合方法在处理确定性元数据时效率更高,而基于Camelot的流程结合LLM后备在准确率和计算效率上表现最佳,尤其适合计算资源受限的环境。

Comments 9 pages, 5 figures, 3 tables

详情
AI中文摘要

从学术PDF文档中提取结构化信息并非易事:单页通常结合自由文本元数据和表格区域,存在跨程序变化,并容易受到干扰下游解析的Unicode编码伪影的影响。本研究以印度尼西亚高等教育的学术课程注册文档(Kartu Rencana Studi或KRS)为案例,评估了表格PDF文档信息提取方法的可靠性。比较了三种策略:纯LLM、混合确定性-LLM(正则表达式和LLM)以及基于Camelot的管道(带LLM回退)。实验在140份文档(基于LLM的测试)和860份文档(基于Camelot的管道评估)上进行,涵盖四个学习项目,包含表格和元数据中的不同数据。使用Ollama和消费级CPU(无GPU)本地运行了三个12-14B的LLM模型(Gemma 3、Phi 4和Qwen 2.5)。评估使用了精确匹配(EM)和Levenshtein相似度(LS)指标,阈值为0.7。尽管并非适用于所有模型,但结果表明,与纯LLM相比,混合方法可以提高效率,尤其是对于确定性元数据。基于Camelot的管道(带LLM回退)在准确性(EM和LS高达0.99-1.00)和计算效率(大多数情况下每个PDF不到1秒)方面取得了最佳组合。Qwen 2.5:14b模型在所有场景中表现最一致。这些发现证实,在计算受限的环境中,将确定性和基于LLM的方法相结合是从基于文本的表格PDF文档中提取信息的可靠且高效的策略。

英文摘要

Extracting structured information from academic PDF documents is non trivial: a single page typically combines free text metadata with tabular regions, exhibits cross program variation, and is susceptible to Unicode encoding artifacts that interfere with downstream parsing. This study evaluates the reliability of information extraction approaches for tabular PDF documents, using academic course registration documents (Kartu Rencana Studi or KRS) from Indonesian higher education as a case study. Three strategies are compared: LLM only, Hybrid Deterministic - LLM (regex & LLM), and a Camelot based pipeline with LLM fallback. Experiments were conducted on 140 documents for the LLM based test and 860 documents for the Camelot based pipeline evaluation, covering four study programs with varying data in tables and metadata. Three 12 - 14B LLM models (Gemma 3, Phi 4, and Qwen 2.5) were run locally using Ollama and a consumer grade CPU without a GPU. Evaluations used exact match (EM) and Levenshtein similarity (LS) metrics with a threshold of 0.7. Although not applicable to all models, the results show that the hybrid approach can improve efficiency compared to LLM only, especially for deterministic metadata. The Camelot based pipeline with LLM fallback produced the best combination of accuracy (EM and LS up to 0.99 - 1.00) and computational efficiency (less than 1 second per PDF in most cases). The Qwen 2.5:14b model demonstrated the most consistent performance across all scenarios. These findings confirm that integrating deterministic and LLM based methods is a reliable and efficient strategy for information extraction from tabular text based PDF documents in computationally constrained environments.

2603.24985 2026-05-25 cs.CV

Few-Shot Left Atrial Wall Segmentation in 3D LGE MRI via Meta-Learning

基于元学习的3D LGE MRI左心房壁少样本分割

Yusri Al-Sanaani, Rebecca Thornhill, Pablo Nery, Elena Pena, Robert deKemp, Calum Redpath, David Birnie, Sreeraman Rajan

发表机构 * Department of Systems and Computer Engineering, Carleton University(系统与计算机工程系,卡尔顿大学) Department of Radiology, Radiation Oncology, and Medical Physics, University of Ottawa(放射科、放射肿瘤学与医学物理系,渥太华大学) Division of Cardiology, Department of Medicine, University of Ottawa Heart Institute(心内科,医学系,渥太华心脏研究所)

AI总结 该研究针对3D晚期钆增强磁共振成像(LGE-MRI)中左心房壁分割的挑战,提出了一种基于元学习的模型无关框架,结合3D残差U-Net网络,实现少量样本(5、10、20个样本)下的分割任务。通过联合训练左心房壁及辅助左、右心房腔任务,并引入边界感知复合损失函数,提升了对薄结构的分割精度。实验表明,该方法在少样本条件下优于传统微调方法,并在不同数据域下表现出良好的鲁棒性,有助于减少心脏重构评估中的标注负担。

Comments Accepted to IEEE EMBC 2026

详情
AI中文摘要

从晚期钆增强磁共振成像(LGE-MRI)中分割左心房(LA)壁因其薄几何结构、低对比度和有限的专家标注而具有挑战性。我们提出了一种基于模型无关元学习(MAML)的框架,采用3D残差U-Net骨干网络,用于K-shot(K=5, 10, 20)左心房壁分割。该框架在左心房壁任务以及辅助的左心房和右心房(RA)腔任务上进行元训练,并使用边界感知复合损失来改善薄结构描绘。我们在一个保留的干净测试集上评估了MAML,并在未见过的合成域偏移和本地队列上评估了其鲁棒性。在保留的干净测试集上,MAML在5-shot下优于少样本微调基线,Dice系数(DSC)=0.54对比0.48,豪斯多夫距离(HD95)=4.60对比6.40毫米。在20-shot下,MAML接近从头训练的完全监督模型,DSC=0.59对比0.61。在未见过的偏移下,性能相对于干净测试有所下降,但随K增加而持续改善。在5-shot下,MAML在未见过的合成偏移下达到DSC=0.52和HD95=5.02毫米,在本地队列上达到DSC=0.50和HD95=5.43毫米。这些结果表明,元学习可以改善低样本适应中的薄壁描绘,并可能减少心房重构评估的标注负担。

英文摘要

Segmenting the left atrial (LA) wall from late gadolinium enhancement magnetic resonance imaging (LGE-MRI) is challenging because of its thin geometry, low contrast, and limited expert annotations. We propose a model-agnostic meta-learning (MAML) framework with a 3D residual U-Net backbone for K-shot (K = 5, 10, 20) LA wall segmentation. The framework is meta-trained on LA wall tasks together with auxiliary LA and right atrial (RA) cavity tasks and uses a boundary-aware composite loss to improve thin-structure delineation. We evaluated MAML on a held-out clean test set and assessed its robustness under an unseen synthetic domain shift and on a local cohort. On the held-out clean test set, MAML outperformed the K-shot fine-tuning baseline at 5-shot, achieving Dice coefficient (DSC) = 0.54 versus 0.48 and Hausdorff distance (HD95) = 4.60 versus 6.40 mm. At 20-shot, MAML approached the fully supervised model trained from scratch, with DSC = 0.59 versus 0.61. Under unseen shifts, performance decreased relative to clean testing but improved consistently as K increased. At 5-shot, MAML achieved DSC = 0.52 and HD95 = 5.02 mm under the unseen synthetic shift, and DSC = 0.50 and HD95 = 5.43 mm on the local cohort. These results suggest that meta-learning can improve thin-wall delineation in low-shot adaptation and may reduce the annotation burden for atrial remodeling assessment.

2603.21437 2026-05-25 cs.CL cs.IR

Pooling and Semantic Shift: The Fundamental Challenges in Long Text Embedding and Retrieval

池化与语义偏移:长文本嵌入与检索中的根本挑战

Hang Gao, Wujiang Xu, Kai Mei, Dimitris N. Metaxas

发表机构 * Rutgers University(罗格斯大学)

AI总结 本文研究了基于Transformer的嵌入模型在长文本表示与检索中面临的两个根本性挑战:池化操作导致的嵌入坍缩和语义漂移。作者提出,嵌入质量下降并非单纯由文本长度或注意力机制引起,而是源于池化操作与内部语义变化的共同作用,并建立了统一的理论框架加以证明。通过实验验证,语义漂移是导致嵌入高度集中化的主因,揭示了各向异性对检索性能的影响仅在强语义漂移情况下才显著,为理解长文本嵌入难题提供了理论依据。

详情
AI中文摘要

基于Transformer的嵌入模型经常表现出几何病态,例如各向异性和长度诱导的表示崩溃,这会降低下游检索性能。虽然先前的工作通常将这些归因于文本长度或注意力机制,但我们认为根本驱动因素反而是固有的池化操作与内部语义偏移。在本文中,我们建立了一个统一的理论框架,证明上下文池化本质上会导致嵌入崩溃。具体来说,我们从数学上证明,对语义多样的句子进行池化不可避免地会导致微观层面的语义稀释,并严格降低向量空间的平均成对距离,从而保证宏观层面的空间集中。基于这些几何洞察,我们正式定义了语义偏移,以捕捉文本内部的自然语义演变和分散。通过跨多种模型和语料库的精心控制实验,我们将文本长度与语义内容分离。我们证明语义偏移是严重嵌入集中的主要预测因子。关键的是,我们的检索评估揭示,各向异性仅在由强语义偏移诱导时才有害,从而调和了先前文献中的矛盾观察,并为现代嵌入模型面临的长上下文挑战提供了原则性解释。

英文摘要

Transformer-based embedding models frequently exhibit geometric pathologies, such as anisotropy and length-induced representation collapse, which can degrade downstream retrieval performance. While prior work often attributes these issues directly to text length or attention mechanisms, we argue that the fundamental drivers are instead the inherent pooling operations coupled with internal semantic shift. In this paper, we establish a unified theoretical framework proving that contextual pooling intrinsically causes embedding collapse. Specifically, we mathematically prove that pooling semantically diverse sentences inevitably leads to micro-level semantic dilution, and strictly reduces the Mean Pairwise Distance of the vector space, guaranteeing macro-level spatial concentration. Grounded in these geometric insights, we formally define semantic shift to capture the natural semantic evolution and dispersion within a text. Through carefully controlled experiments across diverse models and corpora, we disentangle text length from semantic content. We demonstrate that semantic shift is the primary predictor of severe embedding concentration. Crucially, our retrieval evaluations reveal that anisotropy is fundamentally harmful only when induced by strong semantic shifts, reconciling conflicting observations in prior literature and offering a principled explanation for the long-context challenges faced by modern embedding models.