arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.16087 2026-05-25 cs.RO cs.AI

Towards Trustworthy and Explainable AI for Perception Models: From Concept to Prototype Vehicle Deployment

面向感知模型的可信与可解释人工智能：从概念到原型车辆部署

Till Beemelmanns, Shayan Sharifi, Manas Mehrotra, Ayushman Choudhuri, Lutz Eckstein

发表机构 * Institute for Automotive Engineering, RWTH Aachen University（汽车工程研究所，亚琛工业大学）

AI总结本文研究了如何在自动驾驶感知模型中实现可信且可解释的人工智能，针对深度神经网络在自动驾驶中应用时存在的不透明性和安全性问题，提出了一种集成可信解释性和不确定性估计的感知模块。该方法基于变压器架构，在推理时通过注意力机制生成解释，并通过扰动一致性测试验证其可靠性，同时引入不确定性估计与校准模块以提升系统鲁棒性。研究还展示了该模块在原型车上的部署及可视化接口，验证了其在实时可信感知监控中的可行性。

Comments Accepted for publication at IEEE ITSC 2026

详情

AI中文摘要

深度神经网络已成为自动驾驶感知的主流解决方案，但其不透明性与新兴的可信人工智能指南相冲突，并给安全保证、调试和人工监督带来复杂性。尽管存在安全与可解释人工智能的理论框架，但针对3D场景理解的可信人工智能具体实现仍然稀缺。我们通过提出一个极其鲁棒、集成忠实可解释性和校准不确定性估计的可信人工智能感知模块来填补这一空白。基于Transformer检测器，我们在推理时从注意力机制中导出解释，并使用基于扰动的连续性测试验证其忠实性。我们进一步集成了不确定性估计与校准模块，并应用了增强鲁棒性的训练方法。实验展示了忠实的显著性行为、改进的鲁棒性以及良好校准的不确定性估计。最后，我们将这些可信人工智能元素部署到原型车辆中，并提供一个可解释人工智能界面，可视化文档工件、模型不确定性状态和显著性图，展示了实时可信感知监控的可行性。补充材料见 https://tillbeemelmanns.github.io/trustworthy_ai/ 。

英文摘要

Deep Neural Networks have become the dominant solution for Autonomous Driving perception, but their opacity conflicts with emerging Trustworthy AI guidelines and complicates safety assurance, debugging, and human oversight. While theoretical frameworks for safe and Explainable AI (XAI) exist, concrete implementations of Trustworthy AI for 3D scene understanding remain scarce. We address this gap by proposing a Trustworthy AI perception module that is remarkably robust, integrates faithful explainability, and calibrated uncertainty estimates. Building on a transformer-based detector, we derive explanation from the attention mechanism at inference time and validate their faithfulness using perturbation-based consistency tests. We further integrate an uncertainty estimation and calibration module, and apply robustness-enhancing training methods. Experiments show faithful saliency behavior, improved robustness, and well-calibrated uncertainty estimates. Finally, we deploy these Trustworthy AI elements in a prototype vehicle and provide an XAI Interface that visualizes documentation artifacts, model uncertainty state, and saliency maps, demonstrating the feasibility of trustworthy perception monitoring in real time. Supplementary materials are available at https://tillbeemelmanns.github.io/trustworthy_ai/ .

URL PDF HTML ☆

赞 0 踩 0

2605.15828 2026-05-25 cs.CV

移动世界模型如何指导GUI代理？

Weikai Xu, Kun Huang, Yunren Feng, Jiaxing Li, Yuhan Chen, Yuxuan Liu, Zhizheng Jiang, Heng Qu, Pengzhi Gao, Wei Liu, Jian Luan, Xiaolin Hu, Bo An

发表机构 * Nanyang Technological University（南洋理工大学）； MiLM Plus, Xiaomi Inc.（小米公司）； Independent Researchers（独立研究人员）； Gaoling School of Artificial Intelligence, Renmin University of China（中国人民大学人工智能学院）； Wuhan University（武汉大学）； Xiamen University（厦门大学）

AI总结本文研究了移动世界模型如何指导GUI代理进行有效交互，针对现有模型在预测动作后果方面的不足，提出了一种多模态世界模型，涵盖增量文本、完整文本、扩散图像和可渲染代码四种表示方式。实验表明，该模型在多个基准测试中达到最优性能，并揭示了代码重建在分布内精度和多模态监督上的优势，文本反馈在分布外执行中的鲁棒性，以及世界模型在训练过程中的辅助作用，而非作为通用的后验验证工具。

详情

AI中文摘要

视觉语言模型的最新进展使移动GUI代理能够感知视觉界面并执行用户指令，但对于长期和高风险交互，动作后果的可靠预测仍然至关重要。现有的移动世界模型提供基于文本或基于图像的未来状态，但尚不清楚哪种表示有用，生成的rollout是否可以替代真实环境，以及测试时指导如何帮助不同强度的代理。为了回答上述问题，我们筛选并标注了移动世界模型数据，然后训练了四种模态的世界模型：增量文本、完整文本、基于扩散的图像和可渲染代码。这些模型在MobileWorldBench和Code2WorldBench上均达到了最先进性能。此外，通过在AITZ、AndroidControl和AndroidWorld上评估其下游效用，我们得到三个发现。首先，可渲染代码重建实现了高分布内保真度，并为数据构建提供了有效的多模态监督，而基于文本的反馈对于在线分布外执行更鲁棒。其次，世界模型生成的轨迹可以在训练过程中提供可迁移的交互经验，并提高代理的端到端任务性能，尽管这些数据不保留原始分布。最后，对于动作熵低的过度自信移动代理，后验自省提供的收益有限，这表明世界模型作为先验感知或训练监督比作为通用事后验证器更有效。

英文摘要

Recent advances in vision-language models have enabled mobile GUI agents to perceive visual interfaces and execute user instructions, but reliable prediction of action consequences remains critical for long-horizon and high-risk interactions. Existing mobile world models provide either text-based or image-based future states, yet it remains unclear which representation is useful, whether generated rollouts can replace real environments, and how test-time guidance helps agents of different strengths. To answer the above questions, we filter and annotate mobile world-model data, then train world models across four modalities: delta text, full text, diffusion-based images, and renderable code. These models achieve SoTA performance on both MobileWorldBench and Code2WorldBench. Furthermore, by evaluating their downstream utility on AITZ, AndroidControl, and AndroidWorld, we obtain three findings. First, renderable code reconstruction achieves high in-distribution fidelity and provides effective multimodal supervision for data construction, while text-based feedback is more robust for online out-of-distribution (OOD) execution. Second, world-model-generated trajectories can provide transferable interaction experience in the training process and improve agents' end-to-end task performance, although these data do not preserve the original distribution. Last, for overconfident mobile agents with low action entropy, posterior self-reflection provides limited gains, suggesting that world models are more effective as prior perception or training supervision than as universal post-hoc verifiers.

URL PDF HTML ☆

赞 0 踩 0

2605.07590 2026-05-25 cs.CV

Beyond Defenses: Manifold-Aligned Regularization for Intrinsic 3D Point Cloud Robustness

超越防御：面向内在3D点云鲁棒性的流形对齐正则化

Pedro Alonso, Chongshou Li, Tianrui Li

发表机构 * School of Computing and Artificial Intelligence, Southwest Jiaotong University（计算机与人工智能学院，西南交通大学）； Engineering Research Center of Sustainable Urban Intelligent Transportation, Ministry of Education, Southwest Jiaotong University（可持续城市智能交通工程研究中心，教育部，西南交通大学）

AI总结尽管点云鲁棒性研究已取得进展，但现有方法多依赖数据增强或防御机制，忽视了对抗脆弱性的几何本质。本文提出一种基于流形对齐的正则化方法，认为3D网络的对抗脆弱性源于模型学习的潜在几何结构与点云表面内在几何之间的不匹配。通过引入Manifold-Aligned Point Recognition（MAPR）框架，在不依赖对抗训练或额外数据的情况下，有效提升了模型在多个数据集上的鲁棒性。

详情

AI中文摘要

尽管点云鲁棒性研究取得了广泛进展，现有方法主要依赖增强策略或防御机制，却忽视了对抗脆弱性的几何本质。我们假设3D网络中的对抗脆弱性源于模型学习的潜在几何与底层表面的内在几何之间的流形错位。沿输入流形的微小几何保持扰动往往在特征空间中引起不成比例的扭曲，可能导致误分类。我们通过建立3D鲁棒性的几何解释来形式化这一现象，将经典对抗理论与点云的内在结构联系起来。受此分析启发，我们提出了流形对齐点识别（MAPR），该框架通过跨内在扰动对齐预测来正则化潜在几何。MAPR为每个点云增强捕获局部曲率和扩散结构的内在特征，并应用保持内在几何保持扰动不变性的一致性损失。在不依赖对抗训练或额外数据的情况下，MAPR在多个数据集上持续提升对多种对抗攻击的鲁棒性，在ModelNet40和ScanObjectNN上分别比原始模型平均提高+20.02和+8.83个百分点的鲁棒性。

英文摘要

Despite extensive progress in point cloud robustness, existing methods primarily rely on augmentation strategies or defense mechanisms while overlooking the geometric nature of adversarial fragility. We hypothesize that adversarial vulnerability in 3D networks arises from a manifold misalignment between the latent geometry learned by the model and the intrinsic geometry of the underlying surface. Small, geometry-preserving perturbations along the input manifold often induce disproportionate distortions in feature space, potentially leading to misclassifications. We formalize this phenomenon by developing a geometric interpretation of 3D robustness that links classical adversarial theory to the intrinsic structure of point clouds. Motivated by this analysis, we introduce Manifold-Aligned Point Recognition (MAPR), a framework that regularizes the latent geometry by aligning predictions across intrinsic perturbations. MAPR augments each point cloud with intrinsic features capturing local curvature and diffusion structure, and applies a consistency loss that preserves invariance to intrinsic, geometry-preserving perturbations. Without relying on adversarial training or additional data, MAPR consistently improves robustness under multiple adversarial attacks across several datasets, achieving average robustness gains of +20.02 and +8.83 percentage points over vanilla models on ModelNet40 and ScanObjectNN, respectively.

URL PDF HTML ☆

赞 0 踩 0

2605.07220 2026-05-25 cs.LG

On the Robustness of Distribution Support under Diffusion Guidance

扩散引导下分布支撑的鲁棒性研究

Ruijia Cao, Yuchen Wu, Nisha Chandramoorthy

发表机构 * Center for Applied Mathematics, Cornell University（康奈尔大学应用数学中心）； School of Operations Research and Information Engineering, Cornell University（康奈尔大学运筹学与信息工程学院）； Department of Statistics, The University of Chicago（芝加哥大学统计学系）

AI总结本文研究了扩散引导在生成样本时对分布支撑集的鲁棒性问题，揭示了其为何能持续生成高质量样本的理论原因。作者通过建立扩散引导过程在精确得分函数下的支撑集鲁棒性性质，证明其生成的样本几乎总是接近目标分布的支撑集，从而保证了样本的结构合理性。该分析适用于多种扩散模型和离散化方案，为理解扩散引导生成物理合理样本提供了理论依据。

详情

AI中文摘要

扩散引导是一种强大的技术，能够通过扩散模型实现可控且高保真的样本生成。在高层次上，它通过引入引导项来修改得分函数，从而将生成过程导向所需条件。尽管在经验上取得了成功，但扩散引导的理论性质在很大程度上仍未得到探索，并且尚不清楚它为何能持续生成高质量样本。在这项工作中，我们通过建立支撑的鲁棒性性质来解释扩散引导的有效性。具体来说，我们表明，在精确访问得分函数的情况下，引导扩散过程几乎总是生成接近目标支撑的样本。这一性质尤其理想，因为偏离支撑的样本通常在结构上不可信，并可能对下游任务产生不利影响。我们的分析涵盖了去噪扩散隐式模型（DDIM）和去噪扩散概率模型（DDPM），并适用于由指数积分器引起的广泛离散化方案。我们的结果为理解扩散引导为何能生成物理上有意义且结构合理的样本提供了严格的基础。

英文摘要

Diffusion guidance is a powerful technique that enables controllable and high-fidelity sample generation with diffusion models. At a high level, it modifies the score function by incorporating a guidance term that steers the generative process toward a desired condition. Despite its empirical success, the theoretical properties of diffusion guidance remain largely unexplored, and it is not well understood why it consistently produces high-quality samples. In this work, we explain the effectiveness of diffusion guidance by establishing a robustness of support property. Specifically, we show that, given exact access to the score functions, guided diffusion processes almost always generate samples that remain close to the target support. This property is particularly desirable, as samples that lie off the support are often structurally implausible and may adversely affect downstream tasks. Our analysis covers both Denoising Diffusion Implicit Models (DDIM) and Denoising Diffusion Probabilistic Models (DDPM), and applies to a wide range of discretization schemes induced by exponential integrators. Our results provide a rigorous foundation for understanding why diffusion guidance produces physically meaningful and structurally plausible samples.

URL PDF HTML ☆

赞 0 踩 0

2605.06840 2026-05-25 cs.AI

Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning

从LLM推理轨迹中提取搜索树揭示短视规划

Sixing Chen, Ji-An Li, Saner Cakir, Sinan Akcali, Kayla Lee, Marcelo G. Mattar

发表机构 * Generality Inc.（Generality公司）

AI总结本研究通过从大型语言模型（LLM）在“四连棋”游戏中的推理轨迹中提取搜索树，揭示了LLM在规划行为上的短视特性。研究发现，尽管LLM的推理轨迹中包含较深的节点，但其决策主要依赖于浅层搜索，而非深度搜索；相比之下，人类玩家的性能更多由深度搜索驱动。这一发现揭示了LLM与人类规划之间的关键差异，并为改进LLM的规划能力提供了方向性指导。

详情

AI中文摘要

大型语言模型（LLMs），尤其是推理模型，会生成扩展的思维链（CoT）推理，其中通常包含对未来结果的明确思考。然而，这种思考是否构成真正的规划、其结构如何以及哪些方面驱动性能仍不清楚。在这项工作中，我们引入了一种新方法，通过从四子棋游戏的推理轨迹中提取和量化搜索树来表征LLM规划。通过将计算模型拟合到提取的搜索树上，我们表征了规划的结构及其如何影响移动决策。我们发现LLM的搜索比人类更浅，性能由搜索广度而非深度预测。最引人注目的是，尽管LLM在轨迹中扩展了深层节点，但其移动选择最好由一个完全忽略这些节点的短视模型解释。一项因果干预研究（我们选择性剪枝CoT段落）进一步表明，移动选择主要由浅层节点而非深层节点驱动。这些模式与人类规划形成对比，在人类规划中，性能主要由深度搜索驱动。总之，我们的发现揭示了LLM与人类规划之间的关键差异：虽然人类专业知识由更深层次的搜索驱动，但LLM并不基于深层前瞻行动。这种分离为对齐LLM和人类规划提供了有针对性的指导。更广泛地说，我们的框架提供了一种可推广的方法，用于解释跨战略领域LLM规划的结构。

英文摘要

Large language models (LLMs), especially reasoning models, generate extended chain-of-thought (CoT) reasoning that often contains explicit deliberation over future outcomes. Yet whether this deliberation constitutes genuine planning, how it is structured, and what aspects of it drive performance remain poorly understood. In this work, we introduce a new method to characterize LLM planning by extracting and quantifying search trees from reasoning traces in the four-in-a-row board game. By fitting computational models on the extracted search trees, we characterize how plans are structured and how they influence move decisions. We find that LLMs' search is shallower than humans', and that performance is predicted by search breadth rather than depth. Most strikingly, although LLMs expand deep nodes in their traces, their move choices are best explained by a myopic model that ignores those nodes entirely. A causal intervention study where we selectively prune CoT paragraphs further suggests that move selection is driven predominantly by shallow rather than deep nodes. These patterns contrast with human planning, where performance is driven primarily by deep search. Together, our findings reveal a key difference between LLM and human planning: while human expertise is driven by deeper search, LLMs do not act on deep lookahead. This dissociation offers targeted guidance for aligning LLM and human planning. More broadly, our framework provides a generalizable approach for interpreting the structure of LLM planning across strategic domains.

URL PDF HTML ☆

赞 0 踩 0

2605.06498 2026-05-25 cs.RO cs.SY eess.SY

Lie Group Formulation of Recursive Dynamics Algorithms of Higher Order for Floating-Base Robots

浮动基座机器人高阶递归动力学算法的李群公式

Ahmed Ali, Chiara Gabellieri, Antonio Franchi

发表机构 * Robotics and Mechatronics Department, EEMCS Faculty, University of Twente（特文特大学机器人与机电系，EEMCS学院）； Department of Computer, Control and Management Engineering, Sapienza University of Rome（罗马大学计算机、控制与管理工程系）

AI总结本文研究了浮动基座机器人的高阶递归动力学算法在李群框架下的表示方法，提出了一种基于李群的牛顿-欧拉、连杆惯性及混合动力学算法的高阶时间导数计算方法。该方法适用于基座配置在SE(3)上、连杆结构配置在T^{n1} × R^{n2}流形上的树状机械系统，并通过空间扭力表示实现动力学方程的闭式表达。研究还展示了该方法在12自由度空中机械臂上的应用，验证了其在几何正逆动力学及其高阶导数计算中的有效性，并证明其计算复杂度随导数阶数呈二次增长，优于自动微分方法的指数增长。

Journal ref ASME. Journal of Mechanisms and Robotics (2026)

详情

DOI: 10.1115/1.4071985

AI中文摘要

本文描述了计算浮动基座树状系统的李群牛顿-欧拉、组合体惯量和混合动力学算法的高阶时间导数的过程，其中基座构型在SE(3)上演化，附着的机构是一个开运动学树，构型在(n1+n2)维流形T^{n1} × R^{n2}上，使用旋量的空间表示。在给出算法后，我们将得到的递归式整理成闭式运动方程，识别出满足无源性性质的容许科里奥利矩阵，并证明组合惯性张量在所有时间导数下保持不变。然后，我们将所开发的方法应用于一个12自由度空中机械臂，推导其几何正动力学和逆动力学及其一阶时间导数的解析表达式，而数值模拟成功评估了这些动力学直至五阶。最后，为了展示其实用性，我们对所提出的扩展进行了基准测试，并表明在考虑的测试中，其计算成本随导数阶数呈二次增长，而自动微分基线则呈指数增长。

英文摘要

In this paper, we describe procedures for computing higher-order time derivatives of the Lie-group Newton-Euler, Articulated-Body Inertia, and hybrid dynamics algorithms for floating-base trees, where the base configuration evolves on SE(3) and the attached mechanism is an open kinematic tree with configuration on the (n1+n2)-dimensional manifold T^{n1} \times R^{n2}, using spatial representation of twists. After presenting the algorithms, we collect the resulting recursions into closed-form equations of motion, identifying an admissible Coriolis matrix satisfying the passivity property, and showing that the articulated inertia tensor remains unchanged across all time derivatives. We then apply the developed methods to a 12-DoF aerial manipulator to derive analytical expressions for its geometric forward and inverse dynamics along with their first time derivatives whereas the numerical simulations successfully evaluate these dynamics up to fifth order. Finally, to demonstrate their practical utility, we benchmark the proposed extensions and show that, in the considered tests, their computational cost scales quadratically with the derivative order, whereas the automatic-differentiation baseline exhibits exponential scaling.

URL PDF HTML ☆

赞 0 踩 0

2605.06094 2026-05-25 cs.CV cs.AI

VISD: Enhancing Video Reasoning via Structured Self-Distillation

VISD: 通过结构化自蒸馏增强视频推理

Hao Lin, Kunyang Lv, Xu Jiang, Jingqi Tian, Zhongjing Du, Jiayu Ding, Qiaoman Zhang, Hongbo Jin

发表机构 * HUST（华中科技大学）； Wuhan University（武汉大学）； Peking University（北京大学）； Tsinghua University（清华大学）

AI总结本文提出VISD，一种用于增强视频推理的结构化自蒸馏框架，旨在解决视频大语言模型在复杂推理任务中因稀疏奖励和细粒度信用分配不足而导致的学习效率低下的问题。VISD引入了一个视频感知的评判模型，将推理质量分解为答案正确性、逻辑一致性和时空定位等多个维度，并利用结构化反馈指导教师策略进行细粒度的标记级监督。通过方向与幅度解耦机制，VISD稳定地将密集监督与强化学习结合，显著提升了推理准确性和训练效率。实验表明，VISD在多个基准测试中均优于现有方法，且收敛速度更快。

详情

AI中文摘要

训练视频大语言模型进行复杂推理仍然具有挑战性，原因在于稀疏的序列级奖励以及缺乏对长时间、时间上接地推理轨迹的细粒度信用分配。虽然具有可验证奖励的强化学习提供了可靠的监督，但它无法捕捉令牌级贡献，导致学习效率低下。相反，现有的自蒸馏方法提供密集监督，但缺乏结构和诊断特异性，并且通常与强化学习交互不稳定。在这项工作中，我们提出了VISD，一个结构化自蒸馏框架，为视频推理引入诊断上有意义的特权信息。VISD采用视频感知判断模型，将推理质量分解为多个维度，包括答案正确性、逻辑一致性和时空接地性，并使用这种结构化反馈指导教师策略进行令牌级监督。为了将密集监督与强化学习稳定集成，我们引入了方向-幅度解耦机制，其中由奖励计算的展开级优势决定更新方向，而结构化特权信号调节令牌级更新幅度。这种设计实现了语义对齐和细粒度的信用分配，提高了推理忠实度和训练效率。此外，VISD结合了课程调度和基于指数移动平均的教师稳定化，以支持长视频序列上的鲁棒优化。在多个基准上的实验表明，VISD始终优于强基线，提高了答案准确性和时空接地质量。值得注意的是，VISD在优化步骤中实现了近2倍的收敛速度，突出了结构化自监督在提高视频大语言模型性能和样本效率方面的有效性。

英文摘要

Training VideoLLMs for complex reasoning remains challenging due to sparse sequence level rewards and the lack of fine grained credit assignment over long, temporally grounded reasoning trajectories. While reinforcement learning with verifiable rewards (RLVR) provides reliable supervision, it fails to capture token level contributions, leading to inefficient learning. Conversely, existing self distillation methods offer dense supervision but lack structure and diagnostic specificity, and often interact unstably with reinforcement learning. In this work, we propose VISD, a structured self distillation framework that introduces diagnostically meaningful privileged information for video reasoning. VISD employs a video aware judge model to decompose reasoning quality into multiple dimensions, including answer correctness, logical consistency, and spatio-temporal grounding, and uses this structured feedback to guide a teacher policy for token level supervision. To stably integrate dense supervision with RL, we introduce a direction magnitude decoupling mechanism, where rollout level advantages computed from rewards determine update direction, while structured privileged signals modulate token level update magnitudes. This design enables semantically aligned and fine grained credit assignment, improving both reasoning faithfulness and training efficiency. Additionally, VISD incorporates curriculum scheduling and EMA based teacher stabilization to support robust optimization over long video sequences. Experiments on diverse benchmarks show that VISD consistently outperforms strong baselines, improving answer accuracy and spatio temporal grounding quality. Notably, VISD reaches these gains with nearly 2x faster convergence in optimization steps, highlighting the effectiveness of structured self supervision in improving both performance and sample efficiency for VideoLLMs.

URL PDF HTML ☆

赞 0 踩 0

2605.06088 2026-05-25 cs.CV

模型规范中期训练：改进对齐训练的泛化能力

Chloe Li, Nevan Wichers, Sara Price, Samuel Marks, Jon Kutasov

发表机构 * Anthropic

AI总结一些前沿AI开发者希望将语言模型对齐到描述其预期行为的模型规范或宪法中。然而，传统的对齐微调方法在演示数据上训练，可能导致对齐效果浅显且泛化能力差。本文提出了一种新的方法——模型规范中间训练（MSM），即在预训练后、对齐微调前，使用合成文档训练模型理解其规范内容，从而引导模型更好地从后续演示数据中泛化。实验表明，MSM能有效提升模型对复杂安全属性的对齐效果，并揭示了某些规范设计原则有助于增强对齐泛化能力。

详情

AI中文摘要

一些前沿AI开发者旨在将语言模型对齐到描述预期模型行为的模型规范或宪法。然而，标准的对齐微调——在规范对齐行为的演示数据上训练——可能产生泛化能力差的浅层对齐，部分原因是演示数据可能未充分指定所需的泛化。我们引入了模型规范中期训练（MSM）：在预训练之后、对齐微调之前，我们在讨论其模型规范的合成文档上训练模型。这教会模型规范的内容，从而塑造它们从后续演示数据中泛化的方式。例如，一个仅微调为表达特定奶酪偏好（如“我更喜欢奶油奶酪而不是布里干酪”）的模型，当我们应用MSM并附加一个将这些偏好归因于亲美价值观的规范时，会泛化为广泛的亲美价值观。相反，一个关于亲可负担性价值观的规范则从完全相同的奶酪微调中产生亲可负担性的泛化。MSM还可以塑造复杂的与安全相关的倾向：应用MSM并附加一个涉及自我保护和目标守卫的规范，可显著降低代理失调率（Qwen3-32B：从54%降至7%），超过了深思熟虑的对齐基线（14%）。我们进一步将MSM作为工具研究哪些模型规范能产生最强的对齐泛化，发现解释规则背后的价值观能改善泛化，提供具体而非一般的指导也是如此。总体而言，MSM是一种简单有效的技术，通过首先教授预期的泛化，来控制和改进模型从对齐训练中泛化的方式。

英文摘要

Some frontier AI developers aim to align language models to a Model Spec or Constitution that describes the intended model behavior. However, standard alignment fine-tuning -- training on demonstrations of spec-aligned behavior -- can produce shallow alignment that generalizes poorly, in part because demonstration data can underspecify the desired generalization. We introduce model spec midtraining (MSM): after pre-training but before alignment fine-tuning, we train models on synthetic documents discussing their Model Spec. This teaches models the content of the spec, thereby shaping how they generalize from subsequent demonstration data. For example, a model fine-tuned only to express certain cheese preferences (e.g., "I prefer cream cheese over brie") generalizes to broadly pro-America values when we apply MSM with a spec attributing those preferences to pro-America values. Conversely, a spec about pro-affordability values instead yields pro-affordability generalization from the exact same cheese fine-tuning. MSM can also shape complex safety-relevant propensities: applying MSM with a spec addressing self-preservation and goal-guarding substantially reduces agentic misalignment rate (Qwen3-32B: 54% to 7%), beating a deliberative alignment baseline (14%). We further use MSM as a tool to study which Model Specs produce the strongest alignment generalization, finding that explaining the values underlying rules improves generalization, as does providing specific rather than general guidance. Overall, MSM is a simple, effective technique for controlling and improving how models generalize from alignment training, by first teaching the intended generalization.

URL PDF HTML ☆

赞 0 踩 0

2605.01018 2026-05-25 cs.CV

WildTableBench: Benchmarking Multimodal Foundation Models on Table Understanding In the Wild

WildTableBench：在真实场景中评估多模态基础模型的表格理解能力

Junzhe Huang, Xiaoxiao Sun, Yan Yang, Yuxuan Hou, Ruotian Zhang, Sirui Li, Hehe Fan, Serena Yeung-Levy, Xin Yu

发表机构 * The University of Queensland（昆士兰大学）； Stanford University（斯坦福大学）； The Australian National University（澳大利亚国立大学）； Zhejiang University（浙江大学）； Murdoch University（莫纳什大学）； The University of Adelaide（阿德莱德大学）

AI总结 WildTableBench 是一个用于评估多模态基础模型在真实场景下理解表格图像能力的基准测试。该研究引入了包含402张来自不同领域的真实表格图像和928个手动标注问题的数据集，用于测试模型在结构感知和数值推理方面的能力。实验表明，目前主流的多模态模型在该基准上的表现普遍较低，仅有一款模型准确率超过50%，揭示了当前模型在处理复杂表格图像时仍存在显著不足。

详情

AI中文摘要

使用多模态基础模型分析表格图像是消费和企业场景中高价值但具有挑战性的应用。尽管其重要性，当前评估主要依赖于结构化文本表格或干净渲染的图像，忽视了真实世界表格图像的视觉复杂性。这些图像具有多样的布局和领域，需要复杂的结构感知和数值推理。为弥补这一差距，我们引入了WildTableBench，这是第一个针对真实世界设置中自然出现的表格图像的问答基准。WildTableBench包含从跨领域在线论坛和网站收集的402张高信息密度表格图像，以及928个手动标注和验证的问题，涵盖五个类别的17个子类型。我们在此基准上评估了21个前沿专有和开源多模态基础模型。仅有一个模型准确率超过50%，其余模型准确率在4.1%至49.9%之间。我们进一步进行诊断分析以表征模型失败，并揭示结构感知和推理方面的持续弱点。这些结果和分析为当前模型能力提供了有用的见解，并将WildTableBench建立为表格图像理解的有价值的诊断基准。数据集：https://huggingface.co/datasets/jzhuang/WildTableBench 代码：https://github.com/hjzhe/WildTableBench 排行榜：https://hjzhe.github.io/WildTableBench

英文摘要

Using multimodal foundation models to analyze table images is a high-value yet challenging application in consumer and enterprise scenarios. Despite its importance, current evaluations rely largely on structured-text tables or clean rendered images, leaving the visual complexity of in-the-wild table images underexplored. Such images feature varied layouts and diverse domains that demand sophisticated structural perception and numerical reasoning. To bridge this gap, we introduce WildTableBench, the first question-answering benchmark for naturally occurring table images from real-world settings. WildTableBench comprises 402 high-information-density table images collected from online forums and websites across diverse domains, together with 928 manually annotated and verified questions spanning 17 subtypes across five categories. We evaluate 21 frontier proprietary and open-source multimodal foundation models on this benchmark. Only one model exceeds 50% accuracy, while all remaining models range from 4.1% to 49.9%. We further conduct diagnostic analyses to characterize model failures and reveal persistent weaknesses in structural perception and reasoning. These results and analyses provide useful insights into current model capabilities and establish WildTableBench as a valuable diagnostic benchmark for table image understanding. Dataset: https://huggingface.co/datasets/jzhuang/WildTableBench Code: https://github.com/hjzhe/WildTableBench Leaderboard: https://hjzhe.github.io/WildTableBench

URL PDF HTML ☆

赞 0 踩 0

2604.28048 2026-05-25 cs.CL cs.SI

Stable Behavior, Limited Variation: Persona Validity in LLM Agents for Urban Sentiment Perception

稳定行为，有限变化：城市情感感知中LLM智能体的角色有效性

Neemias B da Silva, Rodrigo Minetto, Daniel Silver, Thiago H Silva

发表机构 * University of Toronto（多伦多大学）

AI总结该研究探讨了在城市情感感知任务中，使用不同人格设定对多模态大语言模型（LLM）行为一致性与差异性的影响。通过设置包括性别、经济状况、政治立场和性格等维度的人格变量，研究发现同一人格设定下的模型表现出高度一致的行为，但不同人格之间的差异有限，仅经济状况和性格带来可检测但实际影响较小的变化。研究还指出，模型在细粒度情感判断上表现较差，且去除了人格设定后模型性能有时甚至更优，表明简单的人格标签提示可能对感知判断的注释价值有限。

Comments 8 pages, 8 figures. IEEE DCOSS - UrbCom

Journal ref IEEE DCOSS 2026

详情

AI中文摘要

大型语言模型（LLM）越来越多地被用作城市分析中人类感知的代理，但尚不清楚角色提示是否会产生有意义且可重复的行为多样性。我们研究了不同角色是否影响多模态LLM生成的城市情感判断。使用涵盖性别、经济状况、政治取向和人格的角色因子集，我们为每个角色实例化多个智能体，以评估来自PerceptSent数据集的城市场景图像，并评估角色内一致性和角色间变化。结果显示，共享角色的智能体之间存在强收敛性，表明行为稳定且可重复。然而，角色间分化有限：经济状况和人格引起统计上可检测但实际变化不大的影响，而性别没有可测量的效果，政治取向的影响可忽略不计。智能体还表现出极端偏差，压缩了人类注释中常见的中间情感类别。因此，在粗粒度极性任务上表现强劲，但随着情感分辨率的提高而下降，表明简单的基于标签的角色提示无法捕捉细粒度的感知判断。为了隔离角色条件的作用，我们还评估了没有角色的相同模型。令人惊讶的是，无角色模型在所有任务变体上与人类标签的一致性有时达到或超过有角色条件，表明在这种设置下，简单的基于标签的角色提示可能增加有限的注释价值。

英文摘要

Large Language Models (LLMs) are increasingly used as proxies for human perception in urban analysis, yet it remains unclear whether persona prompting produces meaningful and reproducible behavioral diversity. We investigate whether distinct personas influence urban sentiment judgments generated by multimodal LLMs. Using a factorial set of personas spanning gender, economic status, political orientation, and personality, we instantiate multiple agents per persona to evaluate urban scene images from the PerceptSent dataset and assess both within-persona consistency and cross-persona variation. Results show strong convergence among agents sharing a persona, indicating stable and reproducible behavior. However, cross-persona differentiation is limited: economic status and personality induce statistically detectable but practically modest variation, while gender shows no measurable effect and political orientation only negligible impact. Agents also exhibit an extremity bias, collapsing intermediate sentiment categories common in human annotations. As a result, performance remains strong on coarse-grained polarity tasks but degrades as sentiment resolution increases, suggesting that simple label-based persona prompting does not capture fine-grained perceptual judgments. To isolate the contribution of persona conditioning, we additionally evaluate the same model without personas. Surprisingly, the no-persona model sometimes matches or exceeds persona-conditioned agreement with human labels across all task variants, suggesting that simple label-based persona prompting may add limited annotation value in this setting.

URL PDF HTML ☆

赞 0 踩 0

2604.27468 2026-05-25 cs.CL

VGGT-Segmentor: 几何增强的跨视角分割

Yulu Gao, Bohao Zhang, Zongheng Tang, Jitong Liao, Wenjun Wu, Si Liu

发表机构 * Hangzhou International Innovation Institute of Beihang University（北航杭州国际创新研究院）； Beihang University（北京航空航天大学）

AI总结本文提出了一种名为VGGT-Segmentor（VGGT-S）的几何增强跨视角分割框架，旨在解决从第一人称视角到第三人称视角的实例级物体分割难题。该方法结合了VGGT模型强大的跨视角特征表示能力，并引入了一个新的联合分割头，通过多阶段处理实现高精度的像素级分割。此外，该方法采用单图像自监督训练策略，无需成对标注即可实现良好的泛化能力，在Ego-Exo4D基准测试中取得了优于现有方法的性能。

详情

AI中文摘要

跨不同自我中心和外部中心视图的实例级对象分割是视觉理解中的基本挑战，对于具身AI和远程协作应用至关重要。由于尺度、视角和遮挡的剧烈变化，直接像素级匹配变得不稳定，使得该任务异常困难。尽管像VGGT这样的最新几何感知模型为特征对齐提供了坚实基础，但我们发现，即使其内部对象级注意力保持一致，它们在密集预测任务中常常因显著的像素级投影漂移而失败。为弥合这一差距，我们引入了VGGT-Segmentor（VGGT-S），一个将鲁棒几何建模与像素精确语义分割统一的框架。VGGT-S利用VGGT强大的跨视图特征表示，并引入了一种新颖的Union分割头。该分割头分三个阶段运行：掩码提示融合、点引导预测和迭代掩码细化，有效地将高级特征对齐转化为精确的分割掩码。此外，我们提出了一种单图像自监督训练策略，消除了对配对标注的需求，并实现了强大的泛化能力。在Ego-Exo4D基准上，VGGT-S在Ego到Exo和Exo到Ego任务中分别实现了67.7%和68.0%的平均IoU，显著优于先前方法。值得注意的是，我们的无对应预训练模型超越了大多数全监督基线，证明了我们方法的有效性和可扩展性。代码公开于：https://github.com/buaa-colalab/VGGT-S。

英文摘要

Instance-level object segmentation across disparate egocentric and exocentric views is a fundamental challenge in visual understanding, critical for applications in embodied AI and remote collaboration. This task is exceptionally difficult due to severe changes in scale, perspective, and occlusion, which destabilize direct pixel-level matching. While recent geometry-aware models like VGGT provide a strong foundation for feature alignment, we find they often fail at dense prediction tasks due to significant pixel-level projection drift, even when their internal object-level attention remains consistent. To bridge this gap, we introduce VGGT-Segmentor (VGGT-S), a framework that unifies robust geometric modeling with pixel-accurate semantic segmentation. VGGT-S leverages VGGT's powerful cross-view feature representation and introduces a novel Union Segmentation Head. This head operates in three stages: mask prompt fusion, point-guided prediction, and iterative mask refinement, effectively translating high-level feature alignment into a precise segmentation mask. Furthermore, we propose a single-image self-supervised training strategy that eliminates the need for paired annotations and enables strong generalization. On the Ego-Exo4D benchmark, VGGT-S sets a new state-of-the-art, achieving 67.7% and 68.0% average IoU for Ego to Exo and Exo to Ego tasks, respectively, significantly outperforming prior methods. Notably, our correspondence-free pretrained model surpasses most fully-supervised baselines, demonstrating the effectiveness and scalability of our approach. Code is publicly available at: https://github.com/buaa-colalab/VGGT-S.

URL PDF HTML ☆

赞 0 踩 0

2604.11679 2026-05-25 cs.CV

Towards Brain MRI Foundation Models for the Clinic: Findings from the FOMO25 Challenge

面向临床的大脑MRI基础模型：来自FOMO25挑战赛的发现

Asbjørn Munk, Stefano Cerri, Vardan Nersesjan, Christian Hedeager Krag, Jakob Ambsdorf, Pablo Rocamora García, Julia Machnio, Peirong Liu, Suhyun Ahn, Nasrin Akbari, Yasmina Al Khalil, Kimberly Amador, Sina Amirrajab, Tal Arbel, Meritxell Bach Cuadra, Ujjwal Baid, Bhakti Baheti, Jaume Banus, Kamil Barbierik, Christoph Brune, Yansong Bu, Baptiste Callard, Yuhan Chen, Cornelius Crijnen, Corentin Dancette, Peter Drotar, Prasad Dutande, Nils D. Forkert, Saurabh Garg, Jakub Gazda, Matej Gazda, Benoît Gérin, Partha Ghosh, Weikang Gong, Pedro M. Gordaliza, Sam Hashemi, Tobias Heimann, Fucang Jia, Jiexin Jiang, Emily Kaczmarek, Chris Kang, Seung Kwan Kang, Mohammad Khazaei, Julien Khlaut, Petros Koutsouvelis, Jae Sung Lee, Yuchong Li, Mengye Lyu, Mingchen Ma, Anant Madabhushi, Klaus H. Maier-Hein, Pierre Manceron, Andrés Martínez Mora, Moona Mazher, Felix Meister, Nataliia Molchanova, Steven A. Niederer, Leonard Nürnberg, Jinah Park, Abdul Qayyum, Jonas Richiardi, Antoine Saporta, Branislav Setlak, Ning Shen, Justin Szeto, Constantin Ulrich, Puru Vaish, Vibujithan Vigneshwaran, Leroy Volmer, Zihao Wang, Siqi Wei, Anthony Winder, Jelmer M. Wolterink, Maxence Wynen, Chang Yang, Si Young Yie, Mostafa Mehdipour Ghazi, Akshay Pai, Espen Jimenez Solem, Sebastian Nørgaard Llambias, Mikael Boesen, Michael Eriksen Benros, Juan Eugenio Iglesias, Mads Nielsen

发表机构 * organization= Department of Computer Science, University of Copenhagen , city= Copenhagen , country= Denmark ； organization= Pioneer Centre for AI , city= Copenhagen , country= Denmark ； organization= Copenhagen Research Centre for Biological ； Precision Psychiatry, Mental Health Centre Copenhagen, Copenhagen University Hospital , region= Capital Region of Denmark , city= Copenhagen , country= Denmark ； organization= Athinoula A. Martinos Center for Biomedical Imaging, Massachusetts General Hospital ； Harvard Medical School , city= Boston , state= Massachusetts , country= USA ； Artificial Intelligence Laboratory, Massachusetts Institute of Technology , city= Boston , state= Massachusetts , country= USA ； organization= Johns Hopkins University , city= Baltimore , state= Maryland , country= USA ； organization= Radiological AI Testcenter (RAIT) , region= Capital Region of Denmark , city= Copenhagen , country= Denmark ； organization= Copenhagen University Hospital, Rigshospitalet , region= Capital Region of Denmark , city= Copenhagen , country= Denmark ； organization= Copenhagen University Hospital, Bispebjerg \& Frederiksberg Hospital , region= Capital Region of Denmark , city= Copenhagen , country= Denmark ； organization= Department of Clinical Medicine, Faculty of Health ； Medical Sciences, University of Copenhagen , city= Copenhagen , country= Denmark ； organization= Division of Medical Image Computing, German Cancer Research Center (DKFZ) , city= Heidelberg , country= Germany ； organization= University of British Columbia , city= Vancouver , state= British Columbia , country= Canada ； organization= Hawkes Institute, Department of Computer Science, University College London , city= London , country= United Kingdom ； Lung Institute, Faculty of Medicine, Imperial College London , city= London , country= United Kingdom ； organization= Department of Applied Mathematics, Technical Medical Centre, University of Twente , city= Enschede , country= Netherlands ； organization= IISLAB, Technical University of Košice , city= Košice , country= Slovakia ； organization= 2nd Department of Internal Medicine, Pavol Jozef Safarik University ； L Pasteur University Hospital , city= Košice , country= Slovakia ； organization= Fudan University , city= Shanghai , country= China ； organization= Shenzhen Technology University , city= Shenzhen , country= China ； organization= Department of Radiology, Lausanne University Hospital ； University of Lausanne , city= Lausanne , country= Switzerland ； organization= Louvain Neuroinflammation Imaging Lab (NIL), Université Catholique de Louvain , city= Brussels , country= Belgium ； organization= University of Applied Sciences ； organization= CIBM Center for Biomedical Imaging , city= Lausanne , country= Switzerland ； organization= Department of Radiation Oncology (Maastro), GROW Research Institute for Oncology ； Reproduction, Maastricht University Medical Centre+ , city= Maastricht , country= The Netherlands ； organization= Department of Biomedical Engineering, Medical Image Analysis, Eindhoven University of Technology , city= Eindhoven , country= The Netherlands ； organization= Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences , city= Shenzhen , country= China ； organization= McGill University ； Mila - Quebec AI Institute , city= Montreal , country= Canada ； organization= Hotchkiss Brain Institute ； Department of Radiology, University of Calgary , city= Calgary , state= Alberta , country= Canada ； organization= Department of Radiology, University of Calgary , city= Calgary , state= Alberta , country= Canada ； organization= Alberta Children's Hospital Research Institute, Department of Clinical Neuroscience, University of Calgary , city= Calgary , state= Alberta , country= Canada ； organization= The Wallace H. Coulter Department of Biomedical Engineering, Georgia Tech ； Emory University , city= Atlanta , state= Georgia , country= USA ； organization= SGGS College of Engineering ； organization= Seoul National University , city= Seoul , country= South Korea ； organization= The D-Lab, Department of Precision Medicine, GROW Research Institute for Oncology ； Reproduction, Maastricht University , city= Maastricht , country= The Netherlands ； organization= Artificial Intelligence in Medicine (AIM) Program, Mass General Brigham, Harvard Medical School , city= Boston , state= Massachusetts , country= USA ； Nuclear Medicine, CARIM \& GROW, Maastricht University , city= Maastricht , country= The Netherlands ； organization= Department of Radiation Oncology, Dana-Farber Cancer Institute, Brigham ； Women’s Hospital, Harvard Medical School , city= Boston , state= Massachusetts , country= USA ； Learning Group, Heidelberg University Hospital , city= Heidelberg , country= Germany

AI总结临床部署自动化脑部MRI分析面临数据异质性强、标签获取成本高的挑战。本文通过组织FOMO25挑战赛，提供了大规模预训练数据集FOMO60K，并在临床真实数据上评估了模型在少样本和跨域场景下的表现。研究发现，无监督预训练能有效提升模型在跨域数据上的泛化能力，且不同预训练目标对不同任务效果各异，小规模预训练模型已能取得良好性能，进一步扩大模型规模和训练时间并未带来稳定提升。

详情

AI中文摘要

自动化脑MRI分析的临床部署面临一个基本挑战：临床数据异质且有噪声，高质量标签的获取成本高得令人望而却步。自监督学习（SSL）可以通过利用临床工作流程中产生的大量未标记数据来训练鲁棒的 extit{基础模型}，这些模型在最小监督下适应域外场景。然而，脑MRI基础模型的发展一直受到小规模预训练数据集和专注于高质量研究级数据的域内基准测试的限制。为解决这一差距，我们组织了FOMO25挑战赛，作为MICCAI 2025的卫星活动。FOMO25为参与者提供了一个大型预训练数据集FOMO60K，并在少样本和域外设置下，直接使用来自临床工作流程的数据评估模型。任务涵盖梗死分类、脑膜瘤分割和脑年龄回归，并考虑了在FOMO60K上训练的模型（方法赛道）和任何数据上训练的模型（开放赛道）。来自16个团队的19个基础模型使用标准化容器化流程进行了评估。结果表明：(a) 自监督预训练提升了域迁移下临床数据的泛化能力，最强的 extit{域外}训练模型超越了 extit{域内}训练的有监督基线。(b) 没有单一的预训练目标对所有任务都有利：MAE有利于分割，混合重建-对比目标有利于分类，以及(c) 小型预训练模型取得了强劲性能，而扩大模型规模和训练时长并未带来可靠收益。

英文摘要

Clinical deployment of automated brain MRI analysis faces a fundamental challenge: clinical data is heterogeneous and noisy, and high-quality labels are prohibitively costly to obtain. Self-supervised learning (SSL) can address this by leveraging the vast amounts of unlabeled data produced in clinical workflows to train robust \textit{foundation models} that adapt out-of-domain with minimal supervision. However, the development of foundation models for brain MRI has been limited by small pretraining datasets and in-domain benchmarking focused on high-quality, research-grade data. To address this gap, we organized the FOMO25 challenge as a satellite event at MICCAI 2025. FOMO25 provided participants with a large pretraining dataset, FOMO60K, and evaluated models on data sourced directly from clinical workflows in few-shot and out-of-domain settings. Tasks covered infarct classification, meningioma segmentation, and brain age regression, and considered both models trained on FOMO60K (method track) and any data (open track). Nineteen foundation models from sixteen teams were evaluated using a standardized containerized pipeline. Results show that (a) self-supervised pretraining improves generalization on clinical data under domain shift, with the strongest models trained \textit{out-of-domain} surpassing supervised baselines trained \textit{in-domain}. (b) No single pretraining objective benefits all tasks: MAE favors segmentation, hybrid reconstruction-contrastive objectives favor classification, and (c) strong performance was achieved by small pretrained models, and improvements from scaling model size and training duration did not yield reliable benefits.

URL PDF HTML ☆

赞 0 踩 0

2604.09349 2026-05-25 cs.CV cs.AI cs.CL

Visually-Guided Policy Optimization for Multimodal Reasoning

视觉引导的多模态推理策略优化

Zengbin Wang, Feng Xiong, Liang Lin, Xuecai Hu, Yong Wang, Yanlin Wang, Man Zhang, Xiangxiang Chu

发表机构 * AMAP, Alibaba Group（阿里集团AMAP）； SYSU（南方科技大学）； BUPT（北京邮电大学）

AI总结该研究针对视觉语言模型在多模态推理中视觉关注不足的问题，提出了一种名为Visually-Guided Policy Optimization（VGPO）的新框架，通过引入视觉注意力补偿机制和双粒度优势重加权策略，增强模型在推理过程中的视觉聚焦能力。实验表明，VGPO有效提升了模型在数学多模态推理和依赖视觉的任务中的表现，显著改善了视觉信息的利用效率。

Comments Accepted to ACL 2026, https://github.com/wzb-bupt/VGPO

详情

AI中文摘要

基于可验证奖励的强化学习（RLVR）显著提升了视觉语言模型（VLM）的推理能力。然而，VLM固有的文本主导特性常导致视觉忠实度不足，表现为对视觉标记的注意力激活稀疏。更重要的是，我们的实证分析揭示，推理步骤中的时序视觉遗忘加剧了这一缺陷。为弥补这一差距，我们提出视觉引导策略优化（VGPO），一种在策略优化期间强化视觉聚焦的新框架。具体而言，VGPO首先引入视觉注意力补偿机制，利用视觉相似性定位并放大视觉线索，同时在后续步骤中逐步提升视觉期望以对抗视觉遗忘。基于此机制，我们实施双粒度优势重加权策略：轨迹内层级突出显示具有相对较高视觉激活的标记，而轨迹间层级优先选择表现出优越视觉累积的轨迹。大量实验表明，VGPO在数学多模态推理和视觉依赖任务中实现了更好的视觉激活和优越性能。代码已发布于https://github.com/wzb-bupt/VGPO。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has significantly advanced the reasoning ability of vision-language models (VLMs). However, the inherent text-dominated nature of VLMs often leads to insufficient visual faithfulness, characterized by sparse attention activation to visual tokens. More importantly, our empirical analysis reveals that temporal visual forgetting along reasoning steps exacerbates this deficiency. To bridge this gap, we propose Visually-Guided Policy Optimization (VGPO), a novel framework to reinforce visual focus during policy optimization. Specifically, VGPO initially introduces a Visual Attention Compensation mechanism that leverages visual similarity to localize and amplify visual cues, while progressively elevating visual expectations in later steps to counteract visual forgetting. Building on this mechanism, we implement a dual-grained advantage re-weighting strategy: the intra-trajectory level highlights tokens exhibiting relatively high visual activation, while the inter-trajectory level prioritizes trajectories demonstrating superior visual accumulation. Extensive experiments demonstrate that VGPO achieves better visual activation and superior performance in mathematical multimodal reasoning and visual-dependent tasks. The code has been released at https://github.com/wzb-bupt/VGPO.

URL PDF HTML ☆

赞 0 踩 0

2604.06885 2026-05-25 cs.CV

Time-driven Survival Analysis from FDG-PET/CT in Non-Small Cell Lung Cancer

基于FDG-PET/CT的非小细胞肺癌时间驱动生存分析

Sambit Tarai, Ashish Chauhan, Elin Lundström, Johan Öfverstedt, Therese Sjöholm, Veronica Sanchez Rodriguez, Håkan Ahlström, Joel Kullberg

发表机构 * Radiology, Department of Surgical Sciences（外科科学系放射学部）； Antaros Medical（Antaros医疗）； Molecular Imaging and Medical Physics, Department of Surgical Sciences（外科科学系分子成像与医学物理部）

AI总结该研究提出了一种基于FDG-PET/CT影像的深度回归框架，用于预测非小细胞肺癌患者的总生存期（OS），并引入时间变量作为输入以实现时间驱动的生存分析。方法结合ResNet-50提取影像特征，并与时间信息融合，生成随时间变化的生存概率预测。实验表明，该方法在AUC指标上优于基线模型，且结合临床与影像特征的集成模型取得了最佳性能，验证了多模态数据在生存预测中的互补价值。

Comments Under review

Journal ref Ann Biomed Eng (2026)

详情

DOI: 10.1007/s10439-026-04181-y

AI中文摘要

目的：基于医学图像的临床结果（如总生存期，OS）自动预测在改善患者预后和个性化治疗计划方面具有巨大潜力。我们开发了一个深度回归框架，使用组织FDG-PET/CT投影作为输入，以及一个表示标量时间范围（以天为单位）的时间输入，来预测非小细胞肺癌（NSCLC）患者的OS。方法：所提出的框架采用ResNet-50骨干网络处理输入图像并生成相应的图像嵌入。然后将嵌入与时间数据结合，生成作为时间函数的OS概率，从而有效地基于时间参数化预测。整体框架使用U-CAN队列（n=556）开发，并在测试集（n=292）上与基线方法进行比较评估。基线使用ResNet-50架构，仅处理图像作为输入，并在预定义的时间间隔（如2年或5年）提供OS预测。结果：将时间数据与图像嵌入相结合在预测OS方面显示出优势，优于基线方法，AUC提高了4.3%。使用临床+IDP特征的模型取得了强劲性能，而成像与临床+IDP模型的集成取得了最佳整体性能（0.788），突显了多模态输入的互补价值。所提出的方法还能够将患者风险分层为不同类别（高风险与低风险）。显著性分析的热图突出显示了肿瘤区域作为预测的关键结构。结论：我们的方法提供了一个自动化的框架来预测作为时间函数的OS，并展示了结合成像和表格数据以改善生存预测的潜力。

英文摘要

Purpose: Automated medical image-based prediction of clinical outcomes, such as overall survival (OS), has great potential in improving patient prognostics and personalized treatment planning. We developed a deep regression framework using tissue-wise FDG-PET/CT projections as input, along with a temporal input representing a scalar time horizon (in days) to predict OS in patients with Non-Small Cell Lung Cancer (NSCLC). Methods: The proposed framework employed a ResNet-50 backbone to process input images and generate corresponding image embeddings. The embeddings were then combined with temporal data to produce OS probabilities as a function of time, effectively parameterizing the predictions based on time. The overall framework was developed using the U-CAN cohort (n = 556) and evaluated by comparing with a baseline method on the test set (n = 292). The baseline utilized the ResNet-50 architecture, processing only the images as input and providing OS predictions at pre-specified intervals, such as 2- or 5-year. Results: The incorporation of temporal data with image embeddings demonstrated an advantage in predicting OS, outperforming the baseline method with an improvement in AUC of 4.3%. The proposed model using clinical + IDP features achieved strong performance, and an ensemble of imaging and clinical + IDP models achieved the best overall performance (0.788), highlighting the complementary value of multimodal inputs. The proposed method also enabled risk stratification of patients into distinct categories (high vs low risk). Heat maps from the saliency analysis highlighted tumor regions as key structures for the prediction. Conclusion: Our method provided an automated framework for predicting OS as a function of time and demonstrates the potential of combining imaging and tabular data for improved survival prediction.

URL PDF HTML ☆

赞 0 踩 0

2604.03244 2026-05-25 cs.AI cs.CY cs.DB

池化与语义偏移：长文本嵌入与检索中的根本挑战

Hang Gao, Wujiang Xu, Kai Mei, Dimitris N. Metaxas

发表机构 * Rutgers University（罗格斯大学）

AI总结本文研究了基于Transformer的嵌入模型在长文本表示与检索中面临的两个根本性挑战：池化操作导致的嵌入坍缩和语义漂移。作者提出，嵌入质量下降并非单纯由文本长度或注意力机制引起，而是源于池化操作与内部语义变化的共同作用，并建立了统一的理论框架加以证明。通过实验验证，语义漂移是导致嵌入高度集中化的主因，揭示了各向异性对检索性能的影响仅在强语义漂移情况下才显著，为理解长文本嵌入难题提供了理论依据。

详情

AI中文摘要

基于Transformer的嵌入模型经常表现出几何病态，例如各向异性和长度诱导的表示崩溃，这会降低下游检索性能。虽然先前的工作通常将这些归因于文本长度或注意力机制，但我们认为根本驱动因素反而是固有的池化操作与内部语义偏移。在本文中，我们建立了一个统一的理论框架，证明上下文池化本质上会导致嵌入崩溃。具体来说，我们从数学上证明，对语义多样的句子进行池化不可避免地会导致微观层面的语义稀释，并严格降低向量空间的平均成对距离，从而保证宏观层面的空间集中。基于这些几何洞察，我们正式定义了语义偏移，以捕捉文本内部的自然语义演变和分散。通过跨多种模型和语料库的精心控制实验，我们将文本长度与语义内容分离。我们证明语义偏移是严重嵌入集中的主要预测因子。关键的是，我们的检索评估揭示，各向异性仅在由强语义偏移诱导时才有害，从而调和了先前文献中的矛盾观察，并为现代嵌入模型面临的长上下文挑战提供了原则性解释。

英文摘要

Transformer-based embedding models frequently exhibit geometric pathologies, such as anisotropy and length-induced representation collapse, which can degrade downstream retrieval performance. While prior work often attributes these issues directly to text length or attention mechanisms, we argue that the fundamental drivers are instead the inherent pooling operations coupled with internal semantic shift. In this paper, we establish a unified theoretical framework proving that contextual pooling intrinsically causes embedding collapse. Specifically, we mathematically prove that pooling semantically diverse sentences inevitably leads to micro-level semantic dilution, and strictly reduces the Mean Pairwise Distance of the vector space, guaranteeing macro-level spatial concentration. Grounded in these geometric insights, we formally define semantic shift to capture the natural semantic evolution and dispersion within a text. Through carefully controlled experiments across diverse models and corpora, we disentangle text length from semantic content. We demonstrate that semantic shift is the primary predictor of severe embedding concentration. Crucially, our retrieval evaluations reveal that anisotropy is fundamentally harmful only when induced by strong semantic shifts, reconciling conflicting observations in prior literature and offering a principled explanation for the long-context challenges faced by modern embedding models.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

Towards Trustworthy and Explainable AI for Perception Models: From Concept to Prototype Vehicle Deployment

Not All Tasks Quantize Equally: Fisher-Guided Quantization for Visual Geometry Transformer

FINESSE-Bench: A Hierarchical Benchmark Suite for Financial Domain Knowledge and Technical Analysis in Large Language Models

HorizonDrive: Self-Corrective Autoregressive World Model for Long-horizon Driving Simulation

Adaptive Calibration in Non-Stationary Environments

How Mobile World Model Guides GUI Agents?

Beyond Defenses: Manifold-Aligned Regularization for Intrinsic 3D Point Cloud Robustness

On the Robustness of Distribution Support under Diffusion Guidance

Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning

Lie Group Formulation of Recursive Dynamics Algorithms of Higher Order for Floating-Base Robots

VISD: Enhancing Video Reasoning via Structured Self-Distillation

OpenGaFF: Open-Vocabulary Gaussian Feature Field with Codebook Attention

4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding

Dream-MPC: Gradient-Based Model Predictive Control with Latent Imagination

Model Spec Midtraining: Improving How Alignment Training Generalizes

WildTableBench: Benchmarking Multimodal Foundation Models on Table Understanding In the Wild

Stable Behavior, Limited Variation: Persona Validity in LLM Agents for Urban Sentiment Perception

Syntactically-guided Information Maintenance in Sentence Comprehension

Towards Generalizable Mapping of Hedges and Linear Woody Features from Earth Observation Data: a national Product for Germany

A Comparative Analysis on the Performance of Upper Confidence Bound Algorithms in Adaptive Deep Neural Networks

TingIS: Real-time Risk Event Discovery from Noisy Customer Incidents at Enterprise Scale

Decompose, Structure, and Repair: A Neuro-Symbolic Framework for Autoformalization via Operator Trees

VGGT-Segmentor: Geometry-Enhanced Cross-View Segmentation

Towards Brain MRI Foundation Models for the Clinic: Findings from the FOMO25 Challenge

Visually-Guided Policy Optimization for Multimodal Reasoning

Time-driven Survival Analysis from FDG-PET/CT in Non-Small Cell Lung Cancer

AI Evaluation Should Require Standardized Item-Level Data Releases

Tabular PDF Information Extraction with Local LLMs and Layout-Aware Parsing: A Reliability Evaluation

Few-Shot Left Atrial Wall Segmentation in 3D LGE MRI via Meta-Learning

Pooling and Semantic Shift: The Fundamental Challenges in Long Text Embedding and Retrieval