arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2075
专题追踪
2310.07844 2026-05-14 cs.RO

Saturation-Aware Angular Velocity Estimation: Extending the Robustness of SLAM to Aggressive Motions

Simon-Pierre Deschênes, Dominic Baril, Matěj Boxan, Johann Laconte, Philippe Giguère, François Pomerleau

发表机构 * Northern Robotics Laboratory, Université Laval, Quebec City, Quebec, Canada(北方机器人实验室,拉瓦尔大学,魁北克市,魁北克,加拿大)

AI总结 本文提出了一种新的角速度估计方法,旨在提高SLAM算法在剧烈运动引起的陀螺仪饱和情况下的鲁棒性。通过利用加速度计在机器人翻滚时估计角速度,该方法有效减少了定位误差,并提升了系统在极端条件下的稳定性。研究还构建了一个名为TIGS的数据集,用于支持高角速度下的SLAM算法测试与评估。

Comments 7 pages, 7 figures, published in 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan

Journal ref 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 2024, pp. 10711-10718

详情
英文摘要

We propose a novel angular velocity estimation method to increase the robustness of Simultaneous Localization And Mapping (SLAM) algorithms against gyroscope saturations induced by aggressive motions. Field robotics expose robots to various hazards, including steep terrains, landslides, and staircases, where substantial accelerations and angular velocities can occur if the robot loses stability and tumbles. These extreme motions can saturate sensor measurements, especially gyroscopes, which are the first sensors to become inoperative. While the structural integrity of the robot is at risk, the robustness of the SLAM framework is oftentimes given little consideration. Consequently, even if the robot is physically capable of continuing the mission, its operation will be compromised due to a corrupted representation of the world. Regarding this problem, we propose a method to estimate the angular velocity using accelerometers during extreme rotations caused by tumbling. We show that our method reduces the median localization error by 71.5 % in translation and 65.5 % in rotation and is robust to mapping failures, which occurred in 37.5 % of the experiments without our method. We also propose the Tumbling-Induced Gyroscope Saturation (TIGS) dataset, which consists of outdoor experiments recording the motion of a mechanical lidar subject to angular velocities four times higher than other similar datasets available. The dataset is available online at https://github.com/norlab-ulaval/Norlab_wiki/wiki/TIGS-Dataset.

2304.11193 2026-05-14 cs.RO cs.AI cs.CV

Multi-Modal World Model for Physical Robot Interactions: Simultaneous Visual and Tactile Predictions for Enhanced Accuracy

Willow Mandil, Amir Ghalamzan-E

发表机构 * University of Lincoln(林肯大学) University of Sheffield(谢菲尔德大学)

AI总结 本文研究了在物理机器人交互中融合视觉与触觉信息的世界模型预测方法,旨在提升对复杂环境中机器人操作结果的预测准确性。通过引入两个新的机器人推物数据集,作者展示了在物理不确定性较高的场景下,结合视觉与触觉信息能显著提高预测性能,而在视觉信息已足够明确的情况下,触觉带来的提升有限。该工作为构建更鲁棒的机器人世界模型提供了新的数据支持与方法启示。

Comments This paper is accepted for publication in Robotics and Autonomous Systems

详情
英文摘要

Predicting the outcomes of robotic actions, often referred to as learning a world model, in complex environments remains a fundamental challenge in robotics. Existing approaches primarily rely on visual observations and action inputs to generate video-based predictions, frequently overlooking the critical role of tactile feedback in understanding physical interactions. In this work, we investigate the integration of tactile and visual information within predictive perception systems for physical robot interaction. We demonstrate that visuo-tactile prediction provides the greatest benefits in physically ambiguous interaction regimes, while improvements are naturally limited when object dynamics are visually inferable. Furthermore, we introduce two novel robot-pushing datasets collected using a magnetic-based tactile sensor for unsupervised learning. The first dataset comprises visually identical objects with varying physical properties, explicitly isolating physical ambiguity, while the second mirrors existing robot-pushing benchmarks involving clusters of household objects. Our results show that tactile-visual integration improves prediction accuracy and robustness under physical ambiguity, while offering limited gains in visually unambiguous settings. Code and datasets are publicly available.

1911.09301 2026-05-14 cs.CV

Image Aesthetics Assessment using Multi Channel Convolutional Neural Networks

Nishi Doshi, Gitam Shikhenawis, Suman K Mitra

发表机构 * Dhirubhai Ambani Institute of Information and Communication Technology(迪鲁巴希·阿姆巴尼信息与通信技术研究所) C R Rao Advanced Institute of Mathematics, Statistics and Computer Science(C R Rao高级数学、统计与计算机科学研究所)

AI总结 本文研究了图像美学评估问题,旨在将图像分类为高质量或低质量。作者提出了一种多通道卷积神经网络方法,除使用原始图像外,还引入了图像裁剪和显著性图作为输入,以提升分类效果。实验表明,该方法在常用AVA数据集上的性能优于现有方法,具有重要的应用价值。

Journal ref Computer Vision and Image Processing. CVIP 2019

详情
英文摘要

Image Aesthetics Assessment is one of the emerging domains in research. The domain deals with classification of images into categories depending on the basis of how pleasant they are for the users to watch. In this article, the focus is on categorizing the images in high quality and low quality image. Deep convolutional neural networks are used to classify the images. Instead of using just the raw image as input, different crops and saliency maps of the images are also used, as input to the proposed multi channel CNN architecture. The experiments reported on widely used AVA database show improvement in the aesthetic assessment performance over existing approaches.

2605.12620 2026-05-14 cs.AI

Think Twice, Act Once: Verifier-Guided Action Selection For Embodied Agents

Nishad Singhi, Christian Bialas, Snehal Jauhri, Vignesh Prasad, Georgia Chalvatzaki, Marcus Rohrbach, Anna Rohrbach

发表机构 * Technical University of Darmstadt(达姆斯塔特技术大学)

AI总结 本文提出了一种名为VeGAS的验证器引导的动作选择框架,旨在提升基于多模态大语言模型(MLLM)的具身智能体在复杂现实任务中的鲁棒性。该方法在推理时通过生成多个候选动作,并利用一个生成式验证器选择最可靠的动作,而无需修改原始策略。通过一种由大语言模型驱动的数据合成策略,VeGAS在训练时自动生成多样化的失败案例,从而提升验证器的泛化能力。实验表明,VeGAS在多个具身推理基准测试中显著提升了性能,尤其在多目标、长时序任务上相对基线提升了36%。

Comments CVPR 2026 (Findings)

详情
英文摘要

Building generalist embodied agents capable of solving complex real-world tasks remains a fundamental challenge in AI. Multimodal Large Language Models (MLLMs) have significantly advanced the reasoning capabilities of such agents through strong vision-language knowledge and chain-of-thought (CoT) reasoning, yet remain brittle when faced with challenging out-of-distribution scenarios. To address this, we propose Verifier-Guided Action Selection (VegAS), a test-time framework designed to improve the robustness of MLLM-based embodied agents through an explicit verification step. At inference time, rather than committing to a single decoded action, VeGAS samples an ensemble of candidate actions and uses a generative verifier to identify the most reliable choice, without modifying the underlying policy. Crucially, we find that using an MLLM off-the-shelf as a verifier yields no improvement, motivating our LLM-driven data synthesis strategy, which automatically constructs a diverse curriculum of failure cases to expose the verifier to a rich distribution of potential errors at training time. Across embodied reasoning benchmarks spanning the Habitat and ALFRED environments, VeGAS consistently improves generalization, achieving up to a 36% relative performance gain over strong CoT baselines on the most challenging multi-object, long-horizon tasks.

2605.12608 2026-05-14 cs.CV

A Data Efficiency Study of Synthetic Fog for Object Detection Using the Clear2Fog Pipeline

Mohamed Ahmed Mohamed, Xiaowei Huang

发表机构 * Waymo Open Dataset(Waymo开放数据集) GitHub

AI总结 本文研究了在恶劣天气下提升目标检测性能的数据效率问题,提出了一种基于物理原理的端到端合成雾气生成方法Clear2Fog(C2F),能够在保持相机与激光雷达传感器一致性的同时,在晴朗天气数据集上生成逼真的雾天图像。通过引入单目深度估计和新型大气光估计方法,C2F有效克服了现有技术中的结构伪影和色偏问题。实验表明,使用C2F生成的多样化雾天数据进行训练,能够显著提升模型在真实雾天环境中的检测性能。

Comments Project code and experimental configs available at https://github.com/mmohamed28/Clear2Fog

详情
英文摘要

Object detection in adverse weather is critical for the safety of autonomous vehicles; however, the scarcity of labelled, real-world foggy data remains a significant bottleneck. In this paper, we propose Clear2Fog (C2F), an end-to-end, physics-based pipeline that simulates fog on clear-weather datasets while ensuring sensor-level consistency across camera and LiDAR. By using monocular depth estimation and a novel atmospheric light estimation method, C2F overcomes structural artifacts and chromatic biases common in existing techniques. A human perceptual study confirms C2F's physical realism, with the generated images being preferred 92.95% of the time over an established method. Utilising a training set of 270,000 images from the Waymo Open Dataset, we conduct an extensive data efficiency study to investigate how environmental diversity influences model robustness. Our findings reveal that models trained on mixed-density fog datasets at 75% scale outperform those trained on fixed-density datasets at 100% scale. Furthermore, we investigate the sim-to-real transfer by fine-tuning pre-trained models on real-world foggy data. We demonstrate that a tenfold increase over the default fine-tuning learning rate successfully overcomes negative transfer from synthetic biases, resulting in a 1.67 mAP improvement over real-only baselines. The C2F pipeline provides a scalable framework for enhancing the reliability of autonomous systems in adverse weather and demonstrates the potential of diverse synthetic datasets for efficient model training.

2605.12587 2026-05-14 cs.CV

TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking

Jisu Nam, Jahyeok Koo, Soowon Son, Jaewoo Jung, Honggyu An, Junhwa Hur, Seungryong Kim

发表机构 * KAIST AI(韩国科学技术院人工智能研究所) Google DeepMind(谷歌DeepMind)

AI总结 本文提出了一种名为TrackCraft3R的方法,旨在将预训练的视频扩散变换器(video DiT)重新用于单目视频的密集3D跟踪任务。通过引入双潜在表示和时间RoPE对齐技术,该方法将视频DiT的逐帧生成模式转换为以参考帧为锚点的跟踪范式,从而在单次前向传播中预测出参考帧中每个像素在时间上的跟踪点图及其可见性。实验表明,TrackCraft3R在标准的稀疏和密集3D跟踪基准上取得了最先进的性能,同时在速度和内存消耗方面也优于现有方法。

Comments Project page and code are available at https://cvlab-kaist.github.io/TrackCraft3r/

详情
英文摘要

Dense 3D tracking from monocular video is fundamental to dynamic scene understanding. While recent 3D foundation models provide reliable per-frame geometry, recovering object motion in this geometry remains challenging and benefits from strong motion priors learned from real-world videos. Existing 3D trackers either follow iterative paradigms trained from scratch on synthetic data or fine-tune 3D reconstruction models learned from static multi-view images, both lacking real-world motion priors. Pre-trained video diffusion transformers (video DiTs) offer rich spatio-temporal priors from internet-scale videos, making them a promising foundation for 3D tracking. However, their frame-anchored formulation, which generates each frame's content, is fundamentally mismatched with reference-anchored dense 3D tracking, which must follow the same physical points from a reference frame across time. We present TrackCraft3R, the first method to repurpose a video DiT as a feed-forward dense 3D tracker. Given a monocular video and its frame-anchored reconstruction pointmap, TrackCraft3R predicts a reference-anchored tracking pointmap that follows every pixel of the first frame across time in a single forward pass, along with its visibility. We achieve this through two designs: (i) a dual-latent representation that uses per-frame geometry latents and reference-anchored track latents as dense queries, and (ii) temporal RoPE alignment, which specifies the target timestamp of each track latent. Together, these designs convert the per-frame generative paradigm of video DiTs into a reference-anchored tracking formulation with LoRA fine-tuning. TrackCraft3R achieves state-of-the-art performance on standard sparse and dense 3D tracking benchmarks, while running 1.3x faster and using 4.6x less peak memory than the strongest prior method. We further demonstrate robustness to large motions and long videos.

2605.12586 2026-05-14 cs.CV cs.AI cs.DB

3D Primitives are a Spatial Language for VLMs

Junze Liu, Kun Qian, Florian Dubost, Kai Zhong, Arvind Srinivasan, Nan Chen, Anping Wang, Sam Zhang, Alejandro Mottini, Qingjun Cui, Tian Wang

发表机构 * Unity Technologies

AI总结 该研究探讨了视觉语言模型(VLMs)在空间理解上的矛盾表现,并提出以3D几何基元(如立方体、球体等)作为中间表示来提升其空间推理能力。研究引入了SpatialBabel基准,评估了多种VLM在基于基元的3D场景重建任务中的表现,并提出了两种新方法:无需训练的Code-CoT推理策略和自监督的S³-FT微调方法,显著提升了模型在多个空间理解任务上的性能,验证了几何基元在代码中的诊断与迁移价值。

详情
英文摘要

Vision-language models (VLMs) exhibit a striking paradox: they can generate executable code that reconstructs a 3D scene from geometric primitives with correct object counts, classes, and approximate positions, yet the same models fail at simpler spatial questions on the same image. We show that 3D geometric primitives (cubes, spheres, cylinders, expressed in executable code) serve as a powerful intermediate representation for spatial understanding, and exploit this through three contributions. First, we introduce \textbf{\textsc{SpatialBabel}}, a benchmark evaluating fourteen VLMs on primitive-based 3D scene reconstruction across six \emph{scene-code languages} (programming languages and declarative formats for 3D primitive scenes), revealing that a single model's object-detection F1 can vary by up to $5.7\times$ across languages. Second, we propose \textbf{Code-CoT} (Code Chain-of-Thought), a training-free inference strategy that routes spatial reasoning through primitive-based code generation. Code-CoT lifts the SpatialBabel-QA-Score by up to $+6.4$\% on primitive scenes and real-photo CV-Bench-3D accuracy by $+5.0$\% for VLMs with strong coding capabilities. Third, we propose \textbf{S$^{3}$-FT} (Self-Supervised Spatial Fine-Tuning), which self-supervisedly distills primitive spatial knowledge into general visual reasoning by parsing the model's own Three.js primitive-reconstructions into structured annotations and fine-tuning on the result, with \emph{no human labels and no teacher model}. Training on primitive images alone, S$^3$-FT improves Qwen3-VL-8B by $+4.6$ to $+8.6$\% on SpatialBabel-Primitive-QA, $+9.7$\% on CV-Bench-2D, and $+17$\% on HallusionBench; the recipe transfers across model families. These results establish geometric primitives in code as both a diagnostic and a transferable spatial vocabulary for VLMs. We will release all artifacts upon publication.

2605.12584 2026-05-14 cs.LG cs.AI

Towards Robust Federated Multimodal Graph Learning under Modality Heterogeneity

Sirui Zhang, Haonan Wang, Xunkai Li, Zekai Chen, Shumeng Li, Hongchao Qin, Rong-Hua Li, Guoren Wang

发表机构 * Beijing Institute of Technology(北京理工大学)

AI总结 本文研究了在模态异构环境下实现鲁棒的联邦多模态图学习的问题,针对现有方法在模态缺失处理和联邦协作中的不足,提出了一种两阶段框架,分别在客户端完成缺失模态的重建,在服务器端进行参数聚合。为此,作者设计了FedMPO方法,通过拓扑感知的跨模态生成、缺失感知的专家路由和可靠性感知的聚合策略,有效提升了模型的鲁棒性和性能。实验表明,FedMPO在多个数据集上优于现有方法,尤其在模态缺失率高和数据分布不均衡的情况下表现突出。

详情
英文摘要

Recently, multimodal graph learning (MGL) has garnered significant attention for integrating diverse modality information and structured context to support various network applications. However, real-world graphs are often isolated due to data-sharing limitations across multiple parties, and their modalities are frequently incomplete. This highlights an urgent need to develop a robust federated approach. However, we find that existing methods remain insufficient. On the one hand, centralized MGL methods that handle missing modalities overlook the knowledge sharing and generalization in federated scenarios. On the other hand, while federated MGL methods have become increasingly mature, they primarily target non-graph data. Based on these technologies, we identify a two-stage pipeline wherein client-side completion reconstructs missing modalities, and server-side aggregation integrates the client-updated parameters of both the modality generator and the backbone models. Although this serves as a general solution, we identify two primary challenges in achieving greater robustness: (1) Topology-Isolated Local Completion: Client-side modality generation struggles to effectively leverage global semantics. (2) Reliability-Imbalanced Global Aggregation: Server-side multi-party collaboration is hindered by client updates with varying modality availability and recovery reliability. To address these challenges, we propose \textsc{FedMPO}, which utilizes topology-aware cross-modal generation to recover missing features using comprehensive graph context, missing-aware expert routing to locally filter out noisy recovered signals, and reliability-aware aggregation to appropriately down-weight unreliable updates. Extensive experiments on 3 tasks across 6 datasets demonstrate that FedMPO outperforms baselines, achieving performance gains of up to 4.10% and 5.65% in high-missing and non-IID settings.

2605.12580 2026-05-14 cs.LG

CAWI: Copula-Aligned Weight Initialization for Randomized Neural Networks

Mushir Akhtar, M. Tanveer, Mohd. Arshad

发表机构 * Indian Institute of Technology Indore(印度理工学院印度尔)

AI总结 随机化神经网络(RdNNs)通过冻结随机初始化的输入到隐藏层权重,实现了无需反向传播的高效训练。然而,传统随机初始化方法忽略了数据中特征间的依赖关系,影响了模型的预测性能。本文提出了一种名为CAWI的权重初始化方法,通过拟合数据的copula模型来捕捉特征间的依赖结构,从而在保持闭式解的前提下提升模型性能。实验表明,CAWI在多个分类任务和生物医学数据集上均显著优于传统初始化方法。

Journal ref Proceedings of the 29th International Conference on Artificial Intelligence and Statistics (AISTATS) 2026

详情
英文摘要

Randomized neural networks (RdNNs) enable efficient, backpropagation-free training by freezing randomly initialized input-to-hidden weights, which permits a closed-form solution for the output layer. However, conventional random initialization is blind to inter-feature dependence, ignoring correlations, asymmetries, and tail dependence in the data, which degrades conditioning and predictive performance. To the best of our knowledge, this limitation remains unaddressed in the RdNN literature. To close this gap, we propose CAWI (Copula-Aligned Weight Initialization), a framework that draws input-to-hidden weights from a data-fitted copula that matches empirical dependence, ensuring the frozen projections respect inter-feature dependence without sacrificing the closed-form solution. CAWI (i) maps each feature to the unit interval using empirical CDFs, (ii) fits a multivariate copula that captures rank-based dependence among features, and (iii) samples each weight column w_j from the fitted copula and applies a fixed inverse marginal transform to set scale. The objective, solver, and "freeze-once" paradigm remain unchanged; only the sampling law for W becomes dependence-aware. For dependence modeling, we consider two copula families: elliptical (Gaussian, t) and Archimedean (Clayton, Frank, Gumbel). This enables CAWI to handle diverse dependence, including tail dependence. We evaluate CAWI across 83 diverse classification benchmarks (binary and multiclass) and two biomedical datasets, BreaKHis and the Schizophrenia dataset, using standard shallow and deep RdNN architectures. CAWI consistently delivers significant improvements in predictive performance over conventional random initialization. Code is available at: https://github.com/mtanveer1/CAWI

2605.12574 2026-05-14 cs.CV cs.AI

DistractMIA: Black-Box Membership Inference on Vision-Language Models via Semantic Distraction

Hongyi Tang, Zhihao Zhu, Yi Yang

发表机构 * The Hong Kong University of Science and Technology(香港科学与技术大学)

AI总结 本文研究了如何通过语义干扰技术,在仅能访问视觉语言模型生成文本输出的黑盒场景下,对其训练数据进行成员推理攻击。提出的方法DistractMIA通过在输入图像中插入已知语义干扰物,并分析模型生成文本的变化,从而判断样本是否属于训练数据。该方法无需访问模型内部信息,仅依赖输出结果,实验表明其在多个视觉语言模型和基准数据集上均优于现有方法,并在医疗图像任务中展现出良好的泛化能力。

Comments 23 pages, 8 figures

详情
英文摘要

Vision-language models (VLMs) are trained on large-scale image-text corpora that may contain private, copyrighted, or otherwise sensitive data, motivating membership inference as a tool for training-data auditing. This is especially challenging for deployed VLMs, where auditors typically observe only generated textual responses. Existing VLM membership inference attacks either rely on probability-level signals unavailable in such settings, or use mask-based semantic prediction tasks whose effectiveness depends on object-centric visual assumptions. To address these limitations, we propose DistractMIA, an output-only black-box framework based on semantic distraction. Rather than removing visual evidence, DistractMIA preserves the original image, inserts a known semantic distractor, and measures how generated responses change. This design is motivated by the intuition that member samples remain more anchored to the original image semantics, while non-member samples are more easily redirected toward the distractor. To make this signal reliable, DistractMIA calibrates distractor configurations on a reference set and derives membership scores from repeated textual generations, capturing response stability and distractor uptake without accessing logits, probabilities, or hidden states. Experiments across multiple VLMs and benchmarks show that DistractMIA consistently outperforms both output-only and stronger-access baselines. Its performance on a medical benchmark further demonstrates applicability beyond object-centric natural images.

2605.12573 2026-05-14 cs.CV cs.AI cs.LG

Improving Diffusion Posterior Samplers with Lagged Temporal Corrections for Image Restoration

Davide Evangelista, Elena Morotti, Francesco Pivi, Maurizio Gabbrielli

发表机构 * Dept. of Computer Science and Engineering(计算机科学与工程系) University of Bologna(博洛尼亚大学) Dept. of Political and Social Sciences(政治与社会科学系)

AI总结 本文研究了如何改进基于扩散的后验采样(PS)方法在图像恢复任务中的性能。作者从动力学角度重新诠释PS,提出了一种结合二阶离散化和残差修正的新型方法LAMP,通过引入滞后时间修正来提升采样过程的稳定性与准确性。实验表明,LAMP在多个图像恢复任务中优于现有方法,且无需增加去噪评估次数。

Comments 9 Figures, 9 Tables, Submitted to a conference

详情
英文摘要

Diffusion-based posterior sampling (PS) is a leading framework for imaging inverse problems, combining learned priors with measurement constraints. Yet, its standard formulations rely on instantaneous data-consistent estimates, which induce temporal variability in the reverse dynamics. We reinterpret PS from a dynamical perspective, showing that the standard PS update corresponds to a first-order discretization of the diffusion dynamics plus a residual correction capturing the mismatch between the denoised prediction and the data-consistent estimate. A second-order discretization, however, naturally introduces a temporal correction based on the variation of consecutive estimates. Building on this, we propose LAMP, combining the second-order update with the residual correction characterizing a PS technique. LAMP thus inherits a lagged temporal correction, and it can be implemented as a modular plug-in over the PS backbone. We show that LAMP preserves the structure of a posterior sampler, and we perform a one-step risk analysis to characterize when LAMP improves the reverse transition via a bias-variance trade-off. Experiments across multiple imaging tasks demonstrate consistent improvements over strong baselines such as DiffPIR and DDRM, without increasing the number of denoising evaluations.

2605.12571 2026-05-14 cs.CV cs.AI

VideoSEAL: Mitigating Evidence Misalignment in Agentic Long Video Understanding by Decoupling Answer Authority

Chenhao Qiu, Yechao Zhang, Xin Luo, Shien Song, Xusheng Liu

发表机构 * Nanyang Technological University, Singapore(南洋理工大学,新加坡)

AI总结 本文研究了长期视频问答任务中由于证据不一致导致的性能问题,提出了一种名为VideoSEAL的解耦框架,通过将规划与回答权威性分离,提升了答案准确性和证据对齐度。该方法引入时间与语义双重诊断指标,揭示了现有模型在推理和训练过程中存在的压力源,并通过像素级验证机制有效缓解了证据不一致问题。实验表明,该框架在多个长期视频基准测试中表现优异,且具备良好的扩展性和模块化升级能力。

Comments Accepted to ICML 2026. 33 pages, 13 figures. Code and models are available at https://github.com/Echochef/VideoSEAL

详情
英文摘要

Long video question answering requires locating sparse, time-scattered visual evidence within highly redundant content. Although current MLLMs perform well on short videos, long videos introduce long-horizon search and verification, which often necessitates multi-turn, agentic interaction. We show that existing LVU agents can exhibit "evidence misalignment": they produce correct answers that are not supported by the retrieved or inspected evidence. To characterize this failure, we introduce two diagnostics (temporal groundedness and semantic groundedness) and use them to reveal two pressures that amplify misalignment: prompt pressure from shared-context saturation at inference time and reward pressure from outcome-only optimization during training. These findings point to a structural root cause: the coupled agent paradigm conflates long-horizon planning with answer authority. We therefore propose the decoupled planner-inspector framework, which separates planning from answer authority and gates final answering on pixel-level verification. Across four long-video benchmarks, our framework improves both answer accuracy and evidence alignment, achieving 55.1% on LVBench and 62.0% on LongVideoBench while producing interpretable search trajectories. Moreover, the decoupled architecture scales consistently with increased search budgets and supports plug-and-play upgrades of the MLLM backbone without retraining the planner. Code and models are available at https://github.com/Echochef/VideoSEAL.

2605.12570 2026-05-14 cs.CV

M3Net: A Macro-to-Meso-to-Micro Clinical-inspired Hierarchical 3D Network for Pulmonary Nodule Classification

Jinyue Li, Yuzhou Yu, Jingjing Yang, Meng Fu, Yani Zhang, Shuyao He, Dianlong Ge, Xin Ning, Yannan Chu, Qiankun Li

发表机构 * Hefei Cancer Hospital of CAS, Institute of Health and Medical Technology, Hefei Institutes of Physical Science, Chinese Academy of Sciences(中国科学院合肥医疗健康研究院、健康与医疗技术研究所、物理研究所) University of Science and Technology of China(中国科学技术大学) Graduate School, Bengbu Medical College(蚌埠医疗学院研究生院) Department of Pulmonary and Critical Care Medicine, The First Affiliated Hospital of USTC, Division of Life Sciences and Medicine, University of Science and Technology of China (USTC)(中国科学技术大学附属第一医院呼吸与危重症医学科、生命科学与医学学院) Northeastern University(东北大学) Institute of Semiconductors, Chinese Academy of Sciences(中国科学院半导体研究所) College of Computing and Data Science (CCDS), Nanyang Technological University(南洋理工大学计算与数据科学学院)

AI总结 肺结节的良恶性分类在肺部癌症早期筛查中具有重要意义,但因其多尺度和异质性特征而极具挑战。为此,本文提出M3Net,一种受放射科医生分层诊断流程启发的三维网络,通过整合从细粒度结构到全局解剖关系的多尺度上下文信息,实现更准确的分类。该网络采用分层输入结构和跨尺度语义一致性机制,显著提升了模型性能和可解释性,在公开数据集和自建临床数据集上的实验结果表明其性能优于现有方法。

Comments Published in Information Fusion (2026), 15 pages, 5 figures

Journal ref Information Fusion, 2026

详情
英文摘要

The accurate classification of benign and malignant pulmonary nodules in CT scans is critical for early lung cancer screening, yet remains challenging due to the multi-scale and heterogeneous nature of pulmonary nodules. While deep learning offers potential for auxiliary diagnosis, most existing models act as "black boxes", lacking the transparency and explainability required for trustworthy clinical integration. To address this issue, we propose M3Net, a novel 3D network for pulmonary nodule classification inspired by the hierarchical diagnostic workflow of radiologists, which integrates multi-scale contextual information from fine-grained structures to global anatomical relationships. Our framework constructs a progressive multi-scale input, from fine-grained nodule structures to local semantics and global spatial relationships. M3Net employs scale-specific encoders and ensures cross-scale semantic consistency through latent space projection and mutual information maximization. Extensive experiments on the public LIDC-IDRI dataset and a self-collected clinical dataset (USTC-FHLN) demonstrate that our method achieves state-of-the-art performance, with accuracies of 86.96% and 84.24% respectively, outperforming the best baseline by 3.26% and 2.17%. The results validate that M3Net provides a more robust and clinically relevant solution for pulmonary nodule classification. The code is available at https://github.com/jylEcho/M3-Net.

2605.12561 2026-05-14 cs.LG cs.RO

Learning When to Act: Communication-Efficient Reinforcement Learning via Run-Time Assurance

Adam Haroon, Erick J. Rodríguez-Seda, Cody Fleming, Tristan Schuler

发表机构 * Department of Mechanical Engineering, Iowa State University(爱荷华州立大学机械工程系) Department of Weapons, Robotics, and Control Engineering, United States Naval Academy(美国海军学院武器、机器人与控制工程系) Navy Center for Applied Research in Artificial Intelligence (NCARAI), U.S. Naval Research Laboratory(美国海军人工智能应用研究中心(NCARAI))

AI总结 该研究探讨了在保证安全的前提下,如何学习智能体何时采取行动的问题,提出了一种基于运行时保障(RTA)的通信高效强化学习方法。通过结合点态李雅普诺夫安全屏障和LQR备份,该方法能够在稳定均衡点附近实现控制输入与动作时机的联合学习,提供比传统约束MDP更强的安全保障。实验表明,该方法在多个系统中实现了更高的通信间隔,且具备跨环境迁移能力和对高维系统的扩展性。

Comments 27 pages, 6 figures

详情
英文摘要

Safe reinforcement learning (RL) typically asks $\textit{what}$ an agent should do. We ask $\textit{when}$ it needs to act, and show that a single policy can jointly learn control inputs and communication-efficient timing decisions under a pointwise Lyapunov safety shield. We focus on stabilization around a known equilibrium, where CARE-based LQR backups, Lyapunov certificates, and classical Lyapunov-STC are well defined, enabling clean comparison against analytical baselines. A run-time assurance (RTA) layer overrides the policy via a one-step-ahead Lyapunov prediction and a precomputed LQR backup, providing a strictly stronger guarantee than constrained MDP methods that enforce safety only in expectation. On an inverted pendulum, cart--pole, and planar quadrotor, the learned policy achieves $1.91\times$, $1.45\times$, and $3.51\times$ higher mean inter-sample interval (MSI) than a Lyapunov-triggered baseline; a fixed LQR controller at the same average rate is unstable on all three plants, showing that adaptive timing, not a lower average rate, makes sparsity safe. A CARE-derived Lyapunov reward transfers across environments without redesign, with a single weight $w_c$ controlling the stability--communication tradeoff; ablations confirm the RTA shield is essential, with its removal reducing MSI by $1.27$--$1.84\times$ and degrading state norms. A preference-conditioned extension recovers the full tradeoff frontier from one model at $\tfrac{2}{11}$ of training compute, and SAC experiments show the results are algorithm-agnostic across discrete and continuous domains. A 12-state 3D quadrotor case study extends the framework to higher-dimensional systems where classical STC is intractable, and robustness to $\pm30\%$ mass variation and disturbances shows graceful degradation, with the RTA absorbing what the learned policy cannot.

2605.12556 2026-05-14 cs.CV

M2Retinexformer: Multi-Modal Retinexformer for Low-Light Image Enhancement

Youssef Aboelwafa, Hicham G. Elmongui, Marwan Torki

发表机构 * Alexandria University, Egypt(亚历山大大学,埃及)

AI总结 低光图像增强因噪声放大、伪影和色彩失真等复杂退化问题而具有挑战性。本文提出了一种多模态Retinexformer(M2Retinexformer)框架,通过引入深度线索、亮度先验和语义特征,在渐进式优化流程中提升增强效果。该方法利用跨模态注意力机制融合多尺度信息,并通过自适应门控机制动态平衡光照引导的自注意力与跨注意力,实验表明其在多个基准数据集上优于现有方法。

Comments Accepted at 2026 IEEE International Conference on Image Processing (ICIP)

详情
英文摘要

Low-light image enhancement is challenging due to complex degradations, including amplified noise, artifacts, and color distortion. While Retinex-based deep learning methods have achieved promising results, they primarily rely on single-modality RGB information. We propose M2Retinexformer (Multi-Modal Retinexformer), a novel framework that extends Retinexformer by incorporating depth cues, luminance priors, and semantic features within a progressive refinement pipeline. Depth provides geometric context that is invariant to lighting variations, while luminance and semantic features offer explicit guidance on brightness distribution and scene understanding. Modalities are extracted at multiple scales and fused through cross-attention, with adaptive gating dynamically balancing illumination-guided self-attention and cross-attention based on the reliability of auxiliary cues. Evaluations on the LOL, SID, SMID, and SDSD benchmarks demonstrate overall improvements over Retinexformer and recent state-of-the-art methods. Code and pretrained weights are available at https://github.com/YoussefAboelwafa/M2Retinexformer

2605.12549 2026-05-14 cs.CV

What Happens Before Decoding? Prefill Determines GUI Grounding in VLMs

Jiaping Lin, Fei Shen, Junzhe Li, Ping Nie, Fei Yu, Ming Li, Haizhou Li

发表机构 * Guangming Laboratory(光明实验室) National University of Singapore(新加坡国立大学) Peking University(北京大学) University of Waterloo(滑铁卢大学) The Chinese University of Hong Kong (Shenzhen)(香港中文大学(深圳))

AI总结 现有无训练的GUI定位方法通常依赖多次推理过程来识别目标元素,但每个前向传播过程独立解析指令和视觉布局,缺乏视觉token之间的渐进交互。本文研究了视觉语言模型(VLMs)在GUI定位过程中的内部机制,发现其遵循两阶段范式:预填充阶段确定候选UI元素,解码阶段进一步细化坐标。基于此,作者提出了一种无训练方法Re-Prefill,在预填充阶段引入注意力引导的二次处理,通过提取与查询位置高度相关的视觉token作为初步假设,从而提升定位精度。实验表明,该方法在多个基准测试中均取得显著提升。

详情
英文摘要

Existing training-free approaches for GUI grounding often rely on multiple inference runs, such as iterative cropping or candidate aggregation, to identify target elements. Despite this additional computation, each forward pass still independently interprets the instruction and parses the visual layout, without enabling progressive interaction among visual tokens. In this paper, we study what happens during GUI grounding in Vision-Language Models (VLMs) and identify a previously overlooked bottleneck. We show that grounding follows a two-stage paradigm: the prefill stage determines candidate UI elements, while the decoding stage subsequently refines the final coordinates. This asymmetry establishes prefill as the critical step, as errors in candidate selection cannot be effectively corrected during decoding. Based on this observation, we propose Re-Prefill, a training-free method that revisits inference by introducing an attention-guided second prefill stage to refine target selection. Specifically, visual tokens that consistently receive high attention from the query position, i.e., the final token, across layers are extracted as a preliminary target hypothesis and appended to the input, together with the instruction hidden states, enabling the model to deeply re-think its decision before coordinate generation. Experiments across four VLMs and five benchmarks, including ScreenSpot-Pro, ScreenSpot-V2, OSWorld-G, UI-Vision, and MMBench-GUI, demonstrate consistent improvements without additional training, with gains of up to 4.3% on ScreenSpot-Pro. Code will be available at https://github.com/linjiaping1/Re-Prefill.

2605.12545 2026-05-14 cs.CV cs.AI

CROP: Expert-Aligned Image Cropping via Compositional Reasoning and Optimizing Preference

Zhitong Dong, Chao Li, Jie Yu, Hao Chen

发表机构 * Southeast University(东南大学) Key Laboratory of New Generation Artificial Intelligence Technology(新一代人工智能技术重点实验室) Alibaba Group(阿里巴巴集团)

AI总结 该研究提出了一种名为CROP的新方法,旨在通过组合推理和优化偏好来实现与专家审美一致的图像裁剪。不同于以往依赖显著性预测或检索增强的方法,CROP将美学裁剪重新定义为多模态推理任务,引导视觉语言模型像专业摄影师一样进行分析、提案和决策。该方法通过分解复杂的审美问题,并结合专家偏好对齐模块,有效提升了裁剪结果与人类专家判断的一致性,实验表明其在多个数据集上均表现出优越性能。

详情
英文摘要

Aesthetic image cropping aims to enhance the aesthetic quality of an image by improving its composition through spatial cropping. Previous methods often rely on saliency prediction or retrieval augmentation, ignoring the task's core requirement: a deep understanding of composition and aesthetics. Consequently, saliency-based methods struggle to make compositional trade-offs in complex scenes, while retrieval-based methods blindly refer to similar cases, lacking adaptive reasoning for unique scenes. Both approaches fail to align their automated cropping results with those of human experts. To address the above issues, we propose a novel paradigm that reformulates aesthetic cropping as a multimodal reasoning task, aiming to activate the VLM's analytical and comprehension capabilities in aesthetics. We design a Compositional Reasoning and Optimizing Preference method (CROP) that directs the VLM to think like a professional photographer. It deconstructs a complex and subjective aesthetic problem into an "analysis-proposal-decision" process, reasoning step by step through the analysis of scene elements and compositional principles. Meanwhile, our expert preference alignment module makes the model's decision consistent with human expert aesthetics. Extensive experiments across multiple datasets validate our method's superiority and component effectiveness.

2605.12530 2026-05-14 cs.CL cs.AI cs.CY

In-Situ Behavioral Evaluation for LLM Fairness, Not Standardized-Test Scores

Zeyu Tang, Sang T. Truong, Deonna Owens, Shreyas Sharma, Yibo Jacky Zhang, Brando Miranda, Sanmi Koyejo

发表机构 * GitHub

AI总结 本文提出应通过实际对话行为而非标准化测试来评估大语言模型的公平性。研究发现,标准化测试中的提示构造方式可能对评分产生较大影响,从而扭曲公平性结论。为此,作者开发了多智能体对话框架 MAC-Fairness,通过多轮对话分析模型在不同身份下的行为差异,揭示了模型特有的行为特征,这些特征在不同公平性目标和评估方法的基准中具有普适性。

详情
英文摘要

LLM fairness should be evaluated through in-situ conversational behavior rather than standardized-test Q&A benchmarks. We show that the standardized-test paradigm can be structurally unreliable: surface-level prompt construction choices, although entirely orthogonal to the fairness question being tested, account for the majority of score variance, shift fairness conclusions in both the direction and the magnitude, and result in severe discordance in model rankings. We develop MAC-Fairness, a multi-agent conversational framework that embeds controlled variation factors into multi-round dialogue for in-situ behavior evaluation, examining how models' conversational behavior shifts when identity is varied as part of natural multi-agent interaction. Repurposing standardized-test questions as conversation seeds rather than as the evaluation instrument, we evaluate position persistence (how they hold positions, from the self-perspective) and peer receptiveness (how receptive they are to peers, from the other-perspective) across 8 million conversation transcripts spanning multiple models and identity presence configurations. In-situ behavioral evaluation reveals stable, model-specific behavioral signatures that could generalize across benchmarks differing in fairness targets and evaluation methodologies, a form of evidence the standardized-test paradigm does not offer.

2605.12528 2026-05-14 cs.CV cs.AI cs.AR

MorphOPC: Advancing Mask Optimization with Multi-scale Hierarchical Morphological Learning

Yuting Hu, Lei Zhuang, Chen Wang, Ruiyang Qin, Hua Xiang, Gi-joon Nam, Jinjun Xiong

发表机构 * University at Buffalo(布法罗大学) IBM T. J. Watson Research Center(IBM 沃森研究中心) Villanova University(维拉诺瓦大学)

AI总结 随着特征尺寸缩小至纳米级,从光刻掩模向硅晶圆准确转移电路图案变得愈发困难。为提高图案保真度和制造可行性,本文提出MorphOPC,一种基于多尺度分层形态学学习的掩模优化模型,通过局部布局特征的形态学操作序列生成掩模,有效提升了生成质量。实验表明,MorphOPC在多个基准测试中优于现有方法,实现了更高的印刷保真度和更低的制造成本,展示了其在可扩展掩模优化中的巨大潜力。

详情
英文摘要

As feature sizes shrink to the nanometer scale, accurately transferring circuit patterns from photomasks to silicon wafers becomes increasingly challenging. Optical proximity correction (OPC) is widely used to ensure pattern fidelity and manufacturability. Recent generative mask optimization models based on encoder-decoder architecture can synthesize near-optimal masks, serving as fast machine learning (ML) surrogates for traditional OPC. However, these models often fail to capture the geometric transformations from target layouts to mask patterns, leading to suboptimal quality. In this work, we formulate mask generation as a sequence of morphological operations on local layout features and propose \textit{MorphOPC}, a multi-scale hierarchical model with neural morphological modules to learn these transformations. Experiments on edge-based OPC and ILT benchmarks across metal and via layers show that \textit{MorphOPC} consistently outperforms state-of-the-art methods, achieving higher printing fidelity and lower manufacturing cost, demonstrating strong potential for scalable mask optimization.

2605.12523 2026-05-14 cs.CL cs.AI cs.HC

Exploring how EFL students talk to and through AI to develop texts

David James Woo, Yangyang Yu, Yilin Huang, Deliang Wang, Kai Guo, Chi Ho Yeung

发表机构 * Everwrite Limited(Everwrite有限公司) Shanghai Jiao Tong University(上海交通大学) The Education University of Hong Kong(香港教育大学) The University of Hong Kong(香港大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 本研究探讨了英语作为外语(EFL)学习者在使用生成式人工智能进行写作时,如何通过提示工程与AI进行对话,并协商作者身份。通过对44名香港中学生使用AI聊天机器人完成写作任务的屏幕录制进行混合方法分析,研究发现了学生使用的十种提示策略,并归纳出三种人机协作责任模式。尽管不同责任模式对写作内容、语言和结构等方面的表现无显著影响,但这些策略和模式对EFL写作教学中的学生参与度和自主性具有重要启示。

Comments 37 pages, 5 figures

详情
英文摘要

Generative Artificial Intelligence (AI) introduces new considerations for English as a foreign language (EFL) writing pedagogy. This study explores how students talk to and through AI by prompt engineering and negotiating authorship, respectively, and whether any patterns in the latter relate to students' writing performance. Using an exploratory mixed methods design, we analyzed screen recordings of 44 Hong Kong secondary students completing a Curricular Writing Task with AI Chatbots. Content analysis identified ten types of prompting strategies students employed, including questions, searches, and detailed instructions. From clustering these strategies, three distinct profiles of human-AI rhetorical load responsibility emerged: AI-dominant (52% of students), Human-dominant (25%) and Collaborative human-AI (14%). A MANOVA analysis indicated no significant multivariate effect of rhetorical load responsibility on three dimensions of students' writing performance: content, language, and organization. Students' prompting strategies and rhetorical load responsibility patterns have implications for their engagement and autonomy in EFL writing pedagogy.

2605.12522 2026-05-14 cs.CL cs.AI

Differences in Text Generated by Diffusion and Autoregressive Language Models

Zeyang Zhang, Chengwei Liang, Xingyan Chen, Meiqi Gu, Minrui Luo, Jingzhao Zhang, Tianxing He

发表机构 * Shanghai Qi Zhi Institute(上海启智研究院) Institute for Interdisciplinary Information Sciences, Tsinghua University(清华大学交叉信息研究院) Xiongan AI Institute(雄安人工智能研究院)

AI总结 本文研究扩散语言模型(DLMs)与自回归语言模型(ARMs)在生成文本上的内在差异。通过实验证明,DLMs生成的文本具有更低的$n$-gram熵、更高的语义一致性和多样性,并进一步分析发现这些差异主要源于DLM的双向上下文建模能力,而解码算法中的置信度重掩码策略是导致熵降低的关键因素。该研究揭示了两类模型在文本生成中的核心区别机制,为未来模型设计提供了理论指导。

详情
英文摘要

Diffusion language models (DLMs) are promising alternatives to autoregressive language models (ARMs), yet the intrinsic differences in their generated text remain underexplored. We first find empirically that off-the-shelf DLMs exhibit lower $n$-gram entropy, higher semantic coherence, and higher semantic diversity. To understand the cause, we conduct controlled experiments that decouple the effects of training objectives and decoding algorithms. Results suggest that the DLM training objective contributes to the increases in semantic coherence and semantic diversity, but has a minor influence on entropy. These differences are primarily driven by the bidirectional context; other components in the training objective, such as input masking, label masking, and the weighting function, have a much weaker influence. Further, our experiments demonstrate that the reduction in entropy stems from DLMs' decoding algorithms, particularly confidence-based remasking strategies. We provide a theoretical understanding for this entropy reduction phenomenon. Together, our work uncovers key mechanisms underlying the differences between DLMs and ARMs in text generation, and informs future design of training objectives and decoding algorithms in DLMs.

2605.12521 2026-05-14 cs.CL cs.AI

ToolWeave: Structured Synthesis of Complex Multi-Turn Tool-Calling Dialogues

Dinesh Khandelwal, Gnana Prakash Punnavajhala, GPS Bhargav, Gaurav Pandey, Sachin Joshi, Hima Karanam, Dinesh Raghu

发表机构 * IBM Research(IBM研究院) IIIT Hyderabad(Hyderabad理工学院)

AI总结 ToolWeave 是一种用于生成复杂多轮工具调用对话的结构化合成框架,旨在解决当前合成数据生成方法中对话不真实、工具调用流程不合理的问题。该方法通过构建具有内置依赖关系的工具,并基于用户目标对工作流进行筛选,从而生成更符合实际任务场景的多步骤工具调用流程。实验表明,ToolWeave 生成的对话包含更多多步骤交互,参数和工具名的幻觉更少,基于其微调的大型语言模型在多个基准测试中表现优于现有方法。

详情
英文摘要

Multi-turn tool calling is essential for LLMs to function as autonomous agents, yet synthesizing the training data required for these capabilities remains a fundamental challenge. Existing synthetic data generation pipelines often produce unrealistic dialogues for two reasons: they chain tools that are only superficially compatible rather than aligned with meaningful user tasks, and they generate dialogues in one shot, which often introduces arguments that were neither provided by the user nor produced by prior tool calls. These issues also lead to a severe underrepresentation of multi-step tool interactions. We introduce ToolWeave, a structured framework for synthesizing realistic multi-turn tool-calling dialogues. ToolWeave support realistic multi-step workflows (or tool sequences) by constructing tools with built-in dependencies and filters the workflows based on alignment with user goals. It reduces parameter hallucination by using a fine-grained planning stage that explicitly tracks parameter provenance. As a result, ToolWeave-generated synthetic dialogues contain more multi-step tool interactions (45%) and fewer hallucinations in parameters and tool names. Consequently, LLMs fine-tuned on ToolWeave consistently outperform those fine-tuned on prior datasets across three public benchmarks. Notably, Llama-3.1-70B fine-tuned on ToolWeave achieves 39.75% on BFCL-V3 multi-turn, compared to 23.50% when fine-tuned on SOTA ToolFlow data.

2605.12520 2026-05-14 cs.CL cs.AI

BoostTaxo: Zero-Shot Taxonomy Induction via Boosting-Style Agentic Reasoning and Constraint-Aware Calibration

Yancheng Ling, Zhenlin Qin, Leizhen Wang, Zhenliang Ma

发表机构 * KTH Royal Institute of Technology(皇家理工学院) Monash University(莫纳什大学)

AI总结 BoostTaxo 是一种基于增强型推理和约束感知校准的零样本分类体系归纳框架,旨在解决现有方法在泛化能力、结构可靠性和效率方面的不足。该方法采用轻量级和大规模语言模型协同工作,通过粗到细的父类识别策略、检索增强的定义优化以及结构感知的评分校准,提升分类体系构建的准确性和鲁棒性。实验表明,BoostTaxo 在 WordNet、DBLP 和 SemEval-Sci 等多个基准数据集上表现优异,验证了其在零样本场景下的有效性。

Comments 13 pages,7 figtures

详情
英文摘要

Taxonomy induction is crucial for organizing concepts into explicit and interpretable semantic hierarchies. While existing methods have achieved promising results, their generalization, structural reliability, and efficiency remain limited, hindering their performance in zero-shot and large-scale scenarios. To overcome these limitations, we introduce BoostTaxo, a boosting-style LLM framework for zero-shot taxonomy induction. It takes a set of domain terms as inputs and performs parent identification in a coarse-to-fine manner, employing retrieval-augmented definition refinement, hybrid parent candidate selection, candidate rating, and structure-aware score calibration to improve taxonomy construction. Specifically, a lightweight LLM is used to efficiently filter candidate parents, while a large-scale LLM is employed to rank and score candidate parents for fine-grained parent selection. Structural features are further incorporated to calibrate candidate edge weights and enhance the reliability of the induced taxonomy. The unified BoostTaxo is evaluated on three public benchmark datasets, namely WordNet, DBLP, and SemEval-Sci, and achieves superior or comparable performance to state-of-the-art methods in zero-shot taxonomy induction. The ablation study validates the contribution of the hybrid parent candidate selection and the structure-aware score calibration to the overall performance. Further analysis investigates the impact of candidate selection size on taxonomy quality and presents representative case and failure studies, providing deeper insights into the effectiveness and limitations of the proposed framework.

2605.12519 2026-05-14 cs.CL cs.AI

Correct Answers from Sound Reasoning: Verifiable Process Supervision for Language Models

Kyuyoung Kim, Kevin Wang, Yunfei Xie, Peiyang Xu, Peiyao Sheng, Chen Wei, Zhangyang Wang, Jinwoo Shin, Pramod Viswanath, Sewoong Oh

发表机构 * KAIST AI(KAIST人工智能研究所) University of Texas, Austin(德克萨斯大学奥斯汀分校) Rice University(里士满大学) Princeton University(普林斯顿大学) University of Washington(华盛顿大学) Sentient Labs(Sentient实验室)

AI总结 训练语言模型在生成正确答案的同时具备合理推理能力仍是一个挑战。本文提出一种可验证过程监督(VPS)框架,在可验证领域中联合优化预测准确性和推理质量,通过结构化推理格式引导模型,并引入自适应奖励加权机制以提升推理子任务的处理效果。实验表明,与仅优化准确率的强化学习方法相比,VPS在保持高准确率的同时显著提升了推理质量,减少了错误并恢复了内部一致性。

Comments Preprint

详情
英文摘要

Training language models to produce both correct answers and sound reasoning remains an open challenge. Reinforcement learning with verifiable rewards typically optimizes only final outcomes, which can lead to a failure mode where task accuracy improves while reasoning becomes less accurate, less complete, or even internally inconsistent. We propose verifiable process supervision (VPS), a post-training framework for verifiable domains that jointly optimizes prediction accuracy and reasoning quality. We first apply supervised fine-tuning to induce a structured reasoning format, enabling syntactic extraction of intermediate claims that are evaluated against ground-truth signals to form process-level rewards. To address the heterogeneous difficulty of reasoning subtasks, we introduce adaptive reward weighting that prioritizes components with the largest remaining errors, creating an implicit curriculum. We evaluate VPS on chess, a controlled testbed where reasoning steps can be deterministically verified against engine signals. While accuracy-only RL improves move accuracy, it sharply degrades reasoning quality, increasing win-rate error by up to 112% and reducing internal consistency by up to 69%. In contrast, VPS preserves accuracy while significantly improving reasoning quality, reducing win-rate error by up to 30% and restoring consistency to near saturation. At matched accuracy, judge evaluation also prefers the process-supervised models. A reasoning-space analysis further shows that, without a structured prior, accuracy-only RL converges to budget-dependent shortcuts rather than sound multi-step reasoning. These results show that VPS enables language models to reason both accurately and reliably in verifiable domains.

2605.12518 2026-05-14 cs.CL cs.AI

TimelineReasoner: Advancing Timeline Summarization with Large Reasoning Models

Liancheng Zhang, Xiaoxi Li, Zhicheng Dou

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学北京校区人工智能学院)

AI总结 随着在线新闻的快速增长,从非结构化内容中提取结构化时间线成为一个挑战。为了解决这一问题,本文提出了一种新的框架 TimelineReasoner,利用大型推理模型(LRMs)的主动推理能力,将时间线摘要从静态生成转变为一个迭代、推理驱动的过程。该框架采用两阶段结构,分别进行全局事件认知和细节探索,通过事件抓取、时间线更新和缺失检测等机制,显著提升了时间线的准确性、覆盖度和连贯性。实验结果表明,TimelineReasoner 在多个数据集上均优于现有基于大语言模型的方法。

详情
英文摘要

The proliferation of online news poses a challenge to extracting structured timelines from unstructured content. While recent studies have shown that Large Language Models (LLMs) can assist Timeline Summarization (TLS), these approaches primarily treat models as passive generators. The emergence of Large Reasoning Models (LRMs) presents an opportunity to reason over events actively, enabling iterative evidence acquisition, the detection of missing events, and the validation of temporal consistency. To systematically leverage the reasoning capabilities of LRMs, we propose TimelineReasoner, a novel framework that shifts TLS from static generation to an active, reasoning-driven process. Unlike prior work, TimelineReasoner adopts a two-stage framework: Global Cognition, which tracks events at a macroscopic level and continuously updates a global event memory, and Detail Exploration, which identifies informational gaps and refines the timeline via targeted document retrieval. To support this, TimelineReasoner incorporates several specialized mechanisms, including an Event Scraper for retrieving temporal event descriptions, a Timeline Updater for refining the timeline, and a Supervisor for detecting gaps in the timeline and guiding retrieval. Experimental results on open-domain TLS datasets demonstrate that TimelineReasoner significantly outperforms existing LLM-based TLS methods in terms of timeline accuracy, coverage, and coherence. On closed-domain TLS datasets, our method performs on par with or exceeds state-of-the-art approaches. This work not only pushes the boundaries of TLS but also highlights the broader potential of LRM-based reasoning frameworks for timeline summarization.

2605.12517 2026-05-14 cs.CL cs.AI cs.CV

Bridging the Missing-Modality Gap: Improving Text-Only Calibration of Vision Language Models

Mingyeong Kim, Jungwon Choi, Chaeyun Jang, Juho Lee

发表机构 * Graduate School of AI, KAIST(人工智能研究生院,韩国科学技术院)

AI总结 该研究探讨了视觉语言模型在仅输入文本时出现的性能下降和校准偏差问题,发现即使文本保留了关键信息,模型的置信度也会变得不可靠。为此,作者提出了一种轻量的交叉注意力模块——潜在想象模块(LIM),通过从文本生成潜在嵌入并输入到冻结的模型主干中,从而在无需生成图像的情况下提升模型的准确性和校准效果。实验表明,LIM在多种文本-only任务和缺失图像场景中均表现出显著的性能提升。

Comments 9 pages, 16 figures. Accepted at the ICLR 2026 Workshop on Principled Design for Trustworthy AI: Interpretability, Robustness, and Safety across Modalities

详情
英文摘要

Vision-language models (VLMs) are often deployed on text-only inputs, although they are trained with images. We find that removing the vision modality causes large drops in accuracy and severe miscalibration, and the model does not behave like its original language backbone under text-only prompting. This failure is not explained only by missing semantic information. Even when text descriptions preserve key content, confidence becomes unreliable, while adding a visual signal through generated images partially restores accuracy and calibration. We propose the Latent Imagination Module (LIM), a lightweight cross-attention module that predicts imagined latent embeddings from textual input and feeds them into a frozen VLM backbone without pixel-level image synthesis. Across text-only benchmarks, unseen tasks, and missing-image scenarios, LIM improves accuracy and reduces calibration error. These results suggest that latent modality completion is a practical approach for reliable VLM inference under missing-modality.

2605.12516 2026-05-14 cs.CL cs.AI

Domain Adaptation of Large Language Models for Polymer-Composite Additive Manufacturing Using Retrieval-Augmented Generation and Fine-Tuning

Saiful Islam Sagor, Tania Haghighi, Minhaj Nur Alam, Erina Baynojir Joyee

发表机构 * Department of Mechanical Engineering and Engineering Science, University of North Carolina at Charlotte(北卡罗来纳大学夏洛特分校机械工程与工程科学系) Department of Electrical and Computer Engineering, University of North Carolina at Charlotte(北卡罗来纳大学夏洛特分校电气与计算机工程系)

AI总结 该研究探讨了如何将通用大语言模型适配到增材制造(AM)领域,以提升其在专家级问答任务中的准确性、相关性和实用性。研究采用检索增强生成(RAG)和微调两种方法,通过构建专门的AM语料库进行实验,结果表明RAG方法在回答准确性、相关性和整体偏好度上均显著优于基线模型,而基于原始文本的微调方法效果较差。该成果为在专业工程领域中有效适配大语言模型提供了新的思路。

详情
英文摘要

General-purpose large language models (LLMs) often struggle to generate reliable responses in specialized engineering domains due to limited domain grounding and insufficient exposure to structured technical knowledge. This study investigates practical strategies for adapting a foundation LLM to the additive manufacturing (AM) domain in order to improve answer accuracy, relevance, and usability for expert-level question answering. AM knowledge is distributed across heterogeneous sources such as academic literature, manufacturer documentation, technical standards, and procedural guides. Although general LLMs demonstrate strong linguistic capabilities, they frequently fail to retrieve and contextualize such domain-specific information. Two common approaches to address this limitation are domain-specific fine-tuning and retrieval-augmented generation (RAG). We construct a curated AM corpus and evaluate three configurations based on LLaMA-3-8B: (1) the pretrained baseline model, (2) a RAG system that retrieves relevant document chunks from a vector database, and (3) a model fine-tuned on raw domain text. Performance is evaluated using 200 expert-designed AM questions assessed by mechanical engineering experts for accuracy, relevance, and overall preference. Results show that the RAG model consistently outperforms the baseline. Among the 200 questions, 75.5% of RAG responses are judged more accurate, 85.2% are preferred overall, and 90.8% are rated more relevant than baseline responses. In contrast, fine-tuning on raw AM text reduces performance, producing more accurate answers in only 5.6% of cases and more relevant answers in 32.5% of cases. These results indicate that retrieval-augmented approaches provide a more effective pathway for adapting LLMs to specialized engineering domains than naive fine-tuning on unstructured technical data.

2605.12506 2026-05-14 cs.CV cs.AI cs.HC cs.RO eess.IV

Scale-Gest: Scalable Model-Space Synthesis and Runtime Selection for On-Device Gesture Detection

Abdul Basit, Saim Rehman, Muhammad Shafique

发表机构 * New York University (NYU) Abu Dhabi(纽约大学(NYU)阿布扎赫德)

AI总结 在移动设备上实现满足实时性、能耗和内存约束的基于机器学习的手势检测具有挑战性,尤其在电池电量不一的情况下。本文提出了一种名为 Scale-Gest 的新型运行时自适应手势检测框架,通过扩展检测器空间为一系列紧凑的 tiny-YOLO 架构,并引入基于设备校准的 ACE(准确率-复杂度-能耗)配置,实现了在不同约束下的最优模型选择。实验表明,该方法在保持高检测性能的同时,显著降低了能耗和延迟,适用于车载等实际应用场景。

Comments 7 pages, 11 figures, Accepted to DAC 2026

详情
英文摘要

Realizing on-device ML-based gesture detection under tight real-time performance, energy and memory constraints is challenging, especially when considering mobile devices with varying battery-power levels. Existing EdgeAI deployments typically rely on a single fixed detector, limiting optimization opportunities. We present Scale-Gest, a novel run-time adaptive gesture detection framework that expands the detector space into a dense family of tiny-YOLO architectures. We introduce multiple novel device-calibrated ACE (Accuracy-Complexity-Energy) profiles by analyzing different model-resolution-stride operating points. A lightweight run-time controller selects an appropriate ACE mode under user-defined and battery constraints, while a motion-aware hand-gesture-tracking ROI gate crops the input for reduced complexity detection. To evaluate performance of our system in real-world car driving scenarios, we introduce a temporally-annotated Driver Simulated Gesture (DSG-18) dataset. Scale-Gest maintains event-level F1 while significantly reducing energy and latency compared to single-detector approaches. On a battery-powered laptop running gesture streams, our ACE controller reduces per-frame energy by 4x (from 6.9 mJ to 1.6 mJ) while maintaining high gesture-detection performance (event-level F1 = 0.8-0.9) and low mean latency (6 ms).

2605.13840 2026-05-14 stat.ML cs.DS cs.LG math.ST stat.CO stat.TH

What is Learnable in Valiant's Theory of the Learnable?

Steve Hanneke, Anay Mehrotra, Grigoris Velegkas, Manolis Zampetakis

发表机构 * Purdue University(普渡大学) Stanford University(斯坦福大学) Google Research(谷歌研究) Yale University(耶鲁大学)

AI总结 本文重新审视了Valiant在1984年提出的可学习性模型,探讨了其中哪些概念类是可以被学习的。研究发现,在有限域(包括布尔超立方体)中,一个类可学习当且仅当每个可实现的正样本可以通过多项式大小的自适应查询压缩方案进行认证。这一结果揭示了Valiant模型的学习能力严格介于PAC学习和无查询版本之间,并首次给出了在该模型中学习$d$维半空间的有效算法,展示了查询机制对可学习类的实质性影响。

Comments Abstract shortened for arXiv

详情
英文摘要

Valiant's 1984 paper is widely credited with introducing the PAC learning model, but it, in fact, introduced a different model: unlike PAC learning, the learner receives only positives, may issue membership queries, and must output a hypothesis with no false positives. Prior work characterized variants, including the case without queries. We revisit Valiant's original model and ask: *Which classes are learnable in it?* For every finite domain, including Valiant's Boolean-hypercube setting, we show that a class is learnable if and only if every realizable positive sample can be certified by a poly-size adaptive query-compression scheme. This is a new variant of sample compression where the learner certifies samples via a short interaction with the membership oracle. Our characterization shows that learnability in Valiant's model is strictly sandwiched between learnability in the PAC model and the variant of Valiant's model without membership queries. This is one of the rare cases where introducing membership queries changes the set of learnable classes, and not just the sample or computational complexity. Next, we study the natural extension of the model to arbitrary domains. While we do not obtain an exact characterization, our techniques readily generalize and show that the same strict sandwiching persists. Finally, we show that $d$-dimensional halfspaces, which are not learnable without queries, are learnable with queries: we give a $\mathrm{poly}(d) \tilde{O}(1/ε)$ sample and $\mathrm{poly}(d) \mathrm{polylog}(1/ε)$ query algorithm, and prove that at least $Ω(d)$ samples or queries are necessary. To our knowledge, this is the first algorithm for halfspaces in Valiant's model. Together, these results uncover a surprisingly rich theory behind Valiant's original notion of learnability and introduce ideas that may be of independent interest in learning theory.

2605.13817 2026-05-14 cs.SE cs.AI

Neurosymbolic Auditing of Natural-Language Software Requirements

Bethel Hall, William Eiers

发表机构 * Stevens Institute of Technology(史蒂文斯理工学院)

AI总结 该研究针对自然语言编写的软件需求中存在的模糊性、不一致性和规格不完整等问题,提出了一种结合神经网络与符号推理的审计方法。通过将自然语言需求转化为形式化逻辑,并利用SMT求解器进行验证,该方法能够检测需求中的歧义、矛盾及安全违规。研究构建了名为VERIMED的神经符号化框架,应用于医疗设备软件需求的验证,实验表明该方法能有效减少模糊性需求,并显著提升需求验证的准确性。

Comments 10

详情
英文摘要

Natural-language software requirements are often ambiguous, inconsistent, and underspecified; in safety-critical domains, these defects propagate into formal models that verify the wrong specification and into implementations that ship unsafe behavior. We show that large language models, equipped with an SMT solver, can audit such requirements: translating them into formal logic, detecting ambiguity through stochastic variation in the generated formalization, and exposing inconsistency, vacuousness, and safety violations through solver queries on the resulting specification. We present VERIMED, a neurosymbolic pipeline that operationalizes this idea for medical-device software requirements, and report two findings. First, stochastic variation across independent formalizations is a signal of ambiguity: requirements that admit multiple plausible interpretations produce SMT-inequivalent formalizations, and bidirectional SMT equivalence checking turns this disagreement into a solver-checkable test. Second, the usefulness of symbolic feedback depends on its granularity: in counterexample-guided repair on a hemodialysis question-answering benchmark, concrete SMT counterexamples raise verified accuracy from 55.4% to 98.5%. Over an extensive experimental evaluation on open-source hemodialysis safety requirements, we show that the LLM-based approach in VERIMED successfully reduces ambiguity-sensitive requirements and enables rigorous auditing of software requirements through SMT-based queries.