arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

检索范围排序方式

检索时间范围

重置

HOT 人工智能、机器人等 9

cs.AI 人工智能 cs.CV 计算机视觉 cs.CL 自然语言处理 cs.RO 机器人 cs.LG 机器学习 cs.SD 声音 cs.ET 新兴技术 eess.AS 音频语音 eess.IV 图像视频

CS 计算机 41

cs 计算机 cs.AI 人工智能 cs.AR 硬件架构 cs.CC 计算复杂性 cs.CE 计算工程 cs.CG 计算几何 cs.CL 自然语言处理 cs.CR 密码安全 cs.CV 计算机视觉 cs.CY 计算机与社会 cs.DB 数据库 cs.DC 分布式计算 cs.DL 数字图书馆 cs.DM 离散数学 cs.DS 数据结构 cs.ET 新兴技术 cs.FL 形式语言 cs.GL 综述文献 cs.GR 图形学 cs.GT 博弈论 cs.HC 人机交互 cs.IR 信息检索 cs.IT 信息论 cs.LG 机器学习 cs.LO 计算机逻辑 cs.MA 多智能体 cs.MM 多媒体 cs.MS 数学软件 cs.NA 数值分析 cs.NE 神经进化 cs.NI 网络架构 cs.OH 其他计算机 cs.OS 操作系统 cs.PF 性能 cs.PL 编程语言 cs.RO 机器人 cs.SC 符号计算 cs.SD 声音 cs.SE 软件工程 cs.SI 社会信息网络 cs.SY 系统控制

ECON 经济学 4

econ 经济学 econ.EM 计量经济 econ.GN 一般经济 econ.TH 理论经济

EESS 电气与系统 5

eess 电气与系统 eess.AS 音频语音 eess.IV 图像视频 eess.SP 信号处理 eess.SY 系统控制

MATH 数学 33

math 数学 math.AC 交换代数 math.AG 代数几何 math.AP 偏微分方程 math.AT 代数拓扑 math.CA 经典分析 math.CO 组合数学 math.CT 范畴论 math.CV 复变函数 math.DG 微分几何 math.DS 动力系统 math.FA 泛函分析 math.GM 一般数学 math.GN 一般拓扑 math.GR 群论 math.GT 几何拓扑 math.HO 历史综述 math.IT 信息论 math.KT K理论 math.LO 逻辑 math.MG 度量几何 math.MP 数学物理 math.NA 数值分析 math.NT 数论 math.OA 算子代数 math.OC 优化控制 math.PR 概率 math.QA 量子代数 math.RA 环与代数 math.RT 表示论 math.SG 辛几何 math.SP 谱理论 math.ST 统计理论

PHYSICS 物理 55

astro-ph 天体物理 astro-ph.CO 宇宙学 astro-ph.EP 地球行星 astro-ph.GA 星系物理 astro-ph.HE 高能天体 astro-ph.IM 天文仪器 astro-ph.SR 太阳恒星 cond-mat 凝聚态 cond-mat.dis-nn 无序神经 cond-mat.mes-hall 介观纳米 cond-mat.mtrl-sci 材料科学 cond-mat.other 其他凝聚态 cond-mat.quant-gas 量子气体 cond-mat.soft 软凝聚态 cond-mat.stat-mech 统计力学 cond-mat.str-el 强关联电子 cond-mat.supr-con 超导 gr-qc 广义相对论 hep-ex 高能实验 hep-lat 格点高能 hep-ph 高能唯象 hep-th 高能理论 math-ph 数学物理 nlin 非线性科学 nlin.AO 自适应系统 nlin.CD 混沌动力学 nlin.CG 胞自动机 nlin.PS 斑图孤子 nlin.SI 可积系统 nucl-ex 核物理实验 nucl-th 核物理理论 physics 物理 physics.acc-ph 加速器物理 physics.ao-ph 大气海洋 physics.app-ph 应用物理 physics.atm-clus 原子分子团簇 physics.atom-ph 原子物理 physics.bio-ph 生物物理 physics.chem-ph 化学物理 physics.class-ph 经典物理 physics.comp-ph 计算物理 physics.data-an 数据分析 physics.ed-ph 物理教育 physics.flu-dyn 流体动力学 physics.gen-ph 普通物理 physics.geo-ph 地球物理 physics.hist-ph 物理史哲 physics.ins-det 仪器探测 physics.med-ph 医学物理 physics.optics 光学 physics.plasm-ph 等离子体 physics.pop-ph 科普物理 physics.soc-ph 物理与社会 physics.space-ph 空间物理 quant-ph 量子物理

Q-BIO 定量生物 11

q-bio 定量生物 q-bio.BM 生物分子 q-bio.CB 细胞行为 q-bio.GN 基因组学 q-bio.MN 分子网络 q-bio.NC 神经认知 q-bio.OT 其他定量生物 q-bio.PE 种群进化 q-bio.QM 定量方法 q-bio.SC 亚细胞过程 q-bio.TO 组织器官

Q-FIN 定量金融 10

q-fin 定量金融 q-fin.CP 计算金融 q-fin.EC 经济学 q-fin.GN 一般金融 q-fin.MF 数学金融 q-fin.PM 投资组合 q-fin.PR 证券定价 q-fin.RM 风险管理 q-fin.ST 统计金融 q-fin.TR 交易微观结构

STAT 统计 7

stat 统计 stat.AP 统计应用 stat.CO 统计计算 stat.ME 统计方法 stat.ML 机器学习 stat.OT 其他统计 stat.TH 统计理论

2310.07844 2026-05-14 cs.RO

Saturation-Aware Angular Velocity Estimation: Extending the Robustness of SLAM to Aggressive Motions

Simon-Pierre Deschênes, Dominic Baril, Matěj Boxan, Johann Laconte, Philippe Giguère, François Pomerleau

发表机构 * Northern Robotics Laboratory, Université Laval, Quebec City, Quebec, Canada（北方机器人实验室，拉瓦尔大学，魁北克市，魁北克，加拿大）

AI总结本文提出了一种新的角速度估计方法，旨在提高SLAM算法在剧烈运动引起的陀螺仪饱和情况下的鲁棒性。通过利用加速度计在机器人翻滚时估计角速度，该方法有效减少了定位误差，并提升了系统在极端条件下的稳定性。研究还构建了一个名为TIGS的数据集，用于支持高角速度下的SLAM算法测试与评估。

Comments 7 pages, 7 figures, published in 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan

Journal ref 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 2024, pp. 10711-10718

2304.11193 2026-05-14 cs.RO cs.AI cs.CV

Multi-Modal World Model for Physical Robot Interactions: Simultaneous Visual and Tactile Predictions for Enhanced Accuracy

Willow Mandil, Amir Ghalamzan-E

发表机构 * University of Lincoln（林肯大学）； University of Sheffield（谢菲尔德大学）

AI总结本文研究了在物理机器人交互中融合视觉与触觉信息的世界模型预测方法，旨在提升对复杂环境中机器人操作结果的预测准确性。通过引入两个新的机器人推物数据集，作者展示了在物理不确定性较高的场景下，结合视觉与触觉信息能显著提高预测性能，而在视觉信息已足够明确的情况下，触觉带来的提升有限。该工作为构建更鲁棒的机器人世界模型提供了新的数据支持与方法启示。

Comments This paper is accepted for publication in Robotics and Autonomous Systems

1911.09301 2026-05-14 cs.CV

Image Aesthetics Assessment using Multi Channel Convolutional Neural Networks

Nishi Doshi, Gitam Shikhenawis, Suman K Mitra

发表机构 * Dhirubhai Ambani Institute of Information and Communication Technology（迪鲁巴希·阿姆巴尼信息与通信技术研究所）； C R Rao Advanced Institute of Mathematics, Statistics and Computer Science（C R Rao高级数学、统计与计算机科学研究所）

AI总结本文研究了图像美学评估问题，旨在将图像分类为高质量或低质量。作者提出了一种多通道卷积神经网络方法，除使用原始图像外，还引入了图像裁剪和显著性图作为输入，以提升分类效果。实验表明，该方法在常用AVA数据集上的性能优于现有方法，具有重要的应用价值。

Journal ref Computer Vision and Image Processing. CVIP 2019

2605.12620 2026-05-14 cs.AI

Think Twice, Act Once: Verifier-Guided Action Selection For Embodied Agents

Nishad Singhi, Christian Bialas, Snehal Jauhri, Vignesh Prasad, Georgia Chalvatzaki, Marcus Rohrbach, Anna Rohrbach

发表机构 * Technical University of Darmstadt（达姆斯塔特技术大学）

AI总结本文提出了一种名为VeGAS的验证器引导的动作选择框架，旨在提升基于多模态大语言模型（MLLM）的具身智能体在复杂现实任务中的鲁棒性。该方法在推理时通过生成多个候选动作，并利用一个生成式验证器选择最可靠的动作，而无需修改原始策略。通过一种由大语言模型驱动的数据合成策略，VeGAS在训练时自动生成多样化的失败案例，从而提升验证器的泛化能力。实验表明，VeGAS在多个具身推理基准测试中显著提升了性能，尤其在多目标、长时序任务上相对基线提升了36%。

Comments CVPR 2026 (Findings)

2605.12608 2026-05-14 cs.CV

A Data Efficiency Study of Synthetic Fog for Object Detection Using the Clear2Fog Pipeline

Mohamed Ahmed Mohamed, Xiaowei Huang

发表机构 * Waymo Open Dataset（Waymo开放数据集）； GitHub

AI总结本文研究了在恶劣天气下提升目标检测性能的数据效率问题，提出了一种基于物理原理的端到端合成雾气生成方法Clear2Fog（C2F），能够在保持相机与激光雷达传感器一致性的同时，在晴朗天气数据集上生成逼真的雾天图像。通过引入单目深度估计和新型大气光估计方法，C2F有效克服了现有技术中的结构伪影和色偏问题。实验表明，使用C2F生成的多样化雾天数据进行训练，能够显著提升模型在真实雾天环境中的检测性能。

Comments Project code and experimental configs available at https://github.com/mmohamed28/Clear2Fog

2605.12587 2026-05-14 cs.CV

TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking

Jisu Nam, Jahyeok Koo, Soowon Son, Jaewoo Jung, Honggyu An, Junhwa Hur, Seungryong Kim

发表机构 * KAIST AI（韩国科学技术院人工智能研究所）； Google DeepMind（谷歌DeepMind）

AI总结本文提出了一种名为TrackCraft3R的方法，旨在将预训练的视频扩散变换器（video DiT）重新用于单目视频的密集3D跟踪任务。通过引入双潜在表示和时间RoPE对齐技术，该方法将视频DiT的逐帧生成模式转换为以参考帧为锚点的跟踪范式，从而在单次前向传播中预测出参考帧中每个像素在时间上的跟踪点图及其可见性。实验表明，TrackCraft3R在标准的稀疏和密集3D跟踪基准上取得了最先进的性能，同时在速度和内存消耗方面也优于现有方法。

Comments Project page and code are available at https://cvlab-kaist.github.io/TrackCraft3r/

详情

英文摘要

Dense 3D tracking from monocular video is fundamental to dynamic scene understanding. While recent 3D foundation models provide reliable per-frame geometry, recovering object motion in this geometry remains challenging and benefits from strong motion priors learned from real-world videos. Existing 3D trackers either follow iterative paradigms trained from scratch on synthetic data or fine-tune 3D reconstruction models learned from static multi-view images, both lacking real-world motion priors. Pre-trained video diffusion transformers (video DiTs) offer rich spatio-temporal priors from internet-scale videos, making them a promising foundation for 3D tracking. However, their frame-anchored formulation, which generates each frame's content, is fundamentally mismatched with reference-anchored dense 3D tracking, which must follow the same physical points from a reference frame across time. We present TrackCraft3R, the first method to repurpose a video DiT as a feed-forward dense 3D tracker. Given a monocular video and its frame-anchored reconstruction pointmap, TrackCraft3R predicts a reference-anchored tracking pointmap that follows every pixel of the first frame across time in a single forward pass, along with its visibility. We achieve this through two designs: (i) a dual-latent representation that uses per-frame geometry latents and reference-anchored track latents as dense queries, and (ii) temporal RoPE alignment, which specifies the target timestamp of each track latent. Together, these designs convert the per-frame generative paradigm of video DiTs into a reference-anchored tracking formulation with LoRA fine-tuning. TrackCraft3R achieves state-of-the-art performance on standard sparse and dense 3D tracking benchmarks, while running 1.3x faster and using 4.6x less peak memory than the strongest prior method. We further demonstrate robustness to large motions and long videos.

URL PDF HTML ☆

赞 0 踩 0

2605.12586 2026-05-14 cs.CV cs.AI cs.DB

3D Primitives are a Spatial Language for VLMs

Junze Liu, Kun Qian, Florian Dubost, Kai Zhong, Arvind Srinivasan, Nan Chen, Anping Wang, Sam Zhang, Alejandro Mottini, Qingjun Cui, Tian Wang

发表机构 * Unity Technologies

AI总结该研究探讨了视觉语言模型（VLMs）在空间理解上的矛盾表现，并提出以3D几何基元（如立方体、球体等）作为中间表示来提升其空间推理能力。研究引入了SpatialBabel基准，评估了多种VLM在基于基元的3D场景重建任务中的表现，并提出了两种新方法：无需训练的Code-CoT推理策略和自监督的S³-FT微调方法，显著提升了模型在多个空间理解任务上的性能，验证了几何基元在代码中的诊断与迁移价值。

详情

英文摘要

Vision-language models (VLMs) exhibit a striking paradox: they can generate executable code that reconstructs a 3D scene from geometric primitives with correct object counts, classes, and approximate positions, yet the same models fail at simpler spatial questions on the same image. We show that 3D geometric primitives (cubes, spheres, cylinders, expressed in executable code) serve as a powerful intermediate representation for spatial understanding, and exploit this through three contributions. First, we introduce \textbf{\textsc{SpatialBabel}}, a benchmark evaluating fourteen VLMs on primitive-based 3D scene reconstruction across six \emph{scene-code languages} (programming languages and declarative formats for 3D primitive scenes), revealing that a single model's object-detection F1 can vary by up to $5.7\times$ across languages. Second, we propose \textbf{Code-CoT} (Code Chain-of-Thought), a training-free inference strategy that routes spatial reasoning through primitive-based code generation. Code-CoT lifts the SpatialBabel-QA-Score by up to $+6.4$\% on primitive scenes and real-photo CV-Bench-3D accuracy by $+5.0$\% for VLMs with strong coding capabilities. Third, we propose \textbf{S$^{3}$-FT} (Self-Supervised Spatial Fine-Tuning), which self-supervisedly distills primitive spatial knowledge into general visual reasoning by parsing the model's own Three.js primitive-reconstructions into structured annotations and fine-tuning on the result, with \emph{no human labels and no teacher model}. Training on primitive images alone, S$^3$-FT improves Qwen3-VL-8B by $+4.6$ to $+8.6$\% on SpatialBabel-Primitive-QA, $+9.7$\% on CV-Bench-2D, and $+17$\% on HallusionBench; the recipe transfers across model families. These results establish geometric primitives in code as both a diagnostic and a transferable spatial vocabulary for VLMs. We will release all artifacts upon publication.

URL PDF HTML ☆

赞 0 踩 0

2605.12584 2026-05-14 cs.LG cs.AI

Towards Robust Federated Multimodal Graph Learning under Modality Heterogeneity

Sirui Zhang, Haonan Wang, Xunkai Li, Zekai Chen, Shumeng Li, Hongchao Qin, Rong-Hua Li, Guoren Wang

发表机构 * Beijing Institute of Technology（北京理工大学）

AI总结本文研究了在模态异构环境下实现鲁棒的联邦多模态图学习的问题，针对现有方法在模态缺失处理和联邦协作中的不足，提出了一种两阶段框架，分别在客户端完成缺失模态的重建，在服务器端进行参数聚合。为此，作者设计了FedMPO方法，通过拓扑感知的跨模态生成、缺失感知的专家路由和可靠性感知的聚合策略，有效提升了模型的鲁棒性和性能。实验表明，FedMPO在多个数据集上优于现有方法，尤其在模态缺失率高和数据分布不均衡的情况下表现突出。

详情

英文摘要

Recently, multimodal graph learning (MGL) has garnered significant attention for integrating diverse modality information and structured context to support various network applications. However, real-world graphs are often isolated due to data-sharing limitations across multiple parties, and their modalities are frequently incomplete. This highlights an urgent need to develop a robust federated approach. However, we find that existing methods remain insufficient. On the one hand, centralized MGL methods that handle missing modalities overlook the knowledge sharing and generalization in federated scenarios. On the other hand, while federated MGL methods have become increasingly mature, they primarily target non-graph data. Based on these technologies, we identify a two-stage pipeline wherein client-side completion reconstructs missing modalities, and server-side aggregation integrates the client-updated parameters of both the modality generator and the backbone models. Although this serves as a general solution, we identify two primary challenges in achieving greater robustness: (1) Topology-Isolated Local Completion: Client-side modality generation struggles to effectively leverage global semantics. (2) Reliability-Imbalanced Global Aggregation: Server-side multi-party collaboration is hindered by client updates with varying modality availability and recovery reliability. To address these challenges, we propose \textsc{FedMPO}, which utilizes topology-aware cross-modal generation to recover missing features using comprehensive graph context, missing-aware expert routing to locally filter out noisy recovered signals, and reliability-aware aggregation to appropriately down-weight unreliable updates. Extensive experiments on 3 tasks across 6 datasets demonstrate that FedMPO outperforms baselines, achieving performance gains of up to 4.10% and 5.65% in high-missing and non-IID settings.

URL PDF HTML ☆

赞 0 踩 0

2605.12580 2026-05-14 cs.LG

CAWI: Copula-Aligned Weight Initialization for Randomized Neural Networks

Mushir Akhtar, M. Tanveer, Mohd. Arshad

发表机构 * Indian Institute of Technology Indore（印度理工学院印度尔）

AI总结随机化神经网络（RdNNs）通过冻结随机初始化的输入到隐藏层权重，实现了无需反向传播的高效训练。然而，传统随机初始化方法忽略了数据中特征间的依赖关系，影响了模型的预测性能。本文提出了一种名为CAWI的权重初始化方法，通过拟合数据的copula模型来捕捉特征间的依赖结构，从而在保持闭式解的前提下提升模型性能。实验表明，CAWI在多个分类任务和生物医学数据集上均显著优于传统初始化方法。

Journal ref Proceedings of the 29th International Conference on Artificial Intelligence and Statistics (AISTATS) 2026

2605.12574 2026-05-14 cs.CV cs.AI

DistractMIA: Black-Box Membership Inference on Vision-Language Models via Semantic Distraction

Hongyi Tang, Zhihao Zhu, Yi Yang

发表机构 * The Hong Kong University of Science and Technology（香港科学与技术大学）

AI总结本文研究了如何通过语义干扰技术，在仅能访问视觉语言模型生成文本输出的黑盒场景下，对其训练数据进行成员推理攻击。提出的方法DistractMIA通过在输入图像中插入已知语义干扰物，并分析模型生成文本的变化，从而判断样本是否属于训练数据。该方法无需访问模型内部信息，仅依赖输出结果，实验表明其在多个视觉语言模型和基准数据集上均优于现有方法，并在医疗图像任务中展现出良好的泛化能力。

Comments 23 pages, 8 figures

2605.12573 2026-05-14 cs.CV cs.AI cs.LG

Improving Diffusion Posterior Samplers with Lagged Temporal Corrections for Image Restoration

Davide Evangelista, Elena Morotti, Francesco Pivi, Maurizio Gabbrielli

发表机构 * Dept. of Computer Science and Engineering（计算机科学与工程系）； University of Bologna（博洛尼亚大学）； Dept. of Political and Social Sciences（政治与社会科学系）

AI总结本文研究了如何改进基于扩散的后验采样（PS）方法在图像恢复任务中的性能。作者从动力学角度重新诠释PS，提出了一种结合二阶离散化和残差修正的新型方法LAMP，通过引入滞后时间修正来提升采样过程的稳定性与准确性。实验表明，LAMP在多个图像恢复任务中优于现有方法，且无需增加去噪评估次数。

Comments 9 Figures, 9 Tables, Submitted to a conference

2605.12571 2026-05-14 cs.CV cs.AI

VideoSEAL: Mitigating Evidence Misalignment in Agentic Long Video Understanding by Decoupling Answer Authority

Chenhao Qiu, Yechao Zhang, Xin Luo, Shien Song, Xusheng Liu

发表机构 * Nanyang Technological University, Singapore（南洋理工大学，新加坡）

AI总结本文研究了长期视频问答任务中由于证据不一致导致的性能问题，提出了一种名为VideoSEAL的解耦框架，通过将规划与回答权威性分离，提升了答案准确性和证据对齐度。该方法引入时间与语义双重诊断指标，揭示了现有模型在推理和训练过程中存在的压力源，并通过像素级验证机制有效缓解了证据不一致问题。实验表明，该框架在多个长期视频基准测试中表现优异，且具备良好的扩展性和模块化升级能力。

Comments Accepted to ICML 2026. 33 pages, 13 figures. Code and models are available at https://github.com/Echochef/VideoSEAL

2605.12570 2026-05-14 cs.CV

M3Net: A Macro-to-Meso-to-Micro Clinical-inspired Hierarchical 3D Network for Pulmonary Nodule Classification

Jinyue Li, Yuzhou Yu, Jingjing Yang, Meng Fu, Yani Zhang, Shuyao He, Dianlong Ge, Xin Ning, Yannan Chu, Qiankun Li

发表机构 * Hefei Cancer Hospital of CAS, Institute of Health and Medical Technology, Hefei Institutes of Physical Science, Chinese Academy of Sciences（中国科学院合肥医疗健康研究院、健康与医疗技术研究所、物理研究所）； University of Science and Technology of China（中国科学技术大学）； Graduate School, Bengbu Medical College（蚌埠医疗学院研究生院）； Department of Pulmonary and Critical Care Medicine, The First Affiliated Hospital of USTC, Division of Life Sciences and Medicine, University of Science and Technology of China (USTC)（中国科学技术大学附属第一医院呼吸与危重症医学科、生命科学与医学学院）； Northeastern University（东北大学）； Institute of Semiconductors, Chinese Academy of Sciences（中国科学院半导体研究所）； College of Computing and Data Science (CCDS), Nanyang Technological University（南洋理工大学计算与数据科学学院）

AI总结肺结节的良恶性分类在肺部癌症早期筛查中具有重要意义，但因其多尺度和异质性特征而极具挑战。为此，本文提出M3Net，一种受放射科医生分层诊断流程启发的三维网络，通过整合从细粒度结构到全局解剖关系的多尺度上下文信息，实现更准确的分类。该网络采用分层输入结构和跨尺度语义一致性机制，显著提升了模型性能和可解释性，在公开数据集和自建临床数据集上的实验结果表明其性能优于现有方法。

Comments Published in Information Fusion (2026), 15 pages, 5 figures

Journal ref Information Fusion, 2026

2605.12561 2026-05-14 cs.LG cs.RO

Learning When to Act: Communication-Efficient Reinforcement Learning via Run-Time Assurance

Adam Haroon, Erick J. Rodríguez-Seda, Cody Fleming, Tristan Schuler

发表机构 * Department of Mechanical Engineering, Iowa State University（爱荷华州立大学机械工程系）； Department of Weapons, Robotics, and Control Engineering, United States Naval Academy（美国海军学院武器、机器人与控制工程系）； Navy Center for Applied Research in Artificial Intelligence (NCARAI), U.S. Naval Research Laboratory（美国海军人工智能应用研究中心（NCARAI））

AI总结该研究探讨了在保证安全的前提下，如何学习智能体何时采取行动的问题，提出了一种基于运行时保障（RTA）的通信高效强化学习方法。通过结合点态李雅普诺夫安全屏障和LQR备份，该方法能够在稳定均衡点附近实现控制输入与动作时机的联合学习，提供比传统约束MDP更强的安全保障。实验表明，该方法在多个系统中实现了更高的通信间隔，且具备跨环境迁移能力和对高维系统的扩展性。

Comments 27 pages, 6 figures

2605.12556 2026-05-14 cs.CV

M2Retinexformer: Multi-Modal Retinexformer for Low-Light Image Enhancement

Youssef Aboelwafa, Hicham G. Elmongui, Marwan Torki

发表机构 * Alexandria University, Egypt（亚历山大大学，埃及）

AI总结低光图像增强因噪声放大、伪影和色彩失真等复杂退化问题而具有挑战性。本文提出了一种多模态Retinexformer（M2Retinexformer）框架，通过引入深度线索、亮度先验和语义特征，在渐进式优化流程中提升增强效果。该方法利用跨模态注意力机制融合多尺度信息，并通过自适应门控机制动态平衡光照引导的自注意力与跨注意力，实验表明其在多个基准数据集上优于现有方法。

Comments Accepted at 2026 IEEE International Conference on Image Processing (ICIP)

2605.12549 2026-05-14 cs.CV

What Happens Before Decoding? Prefill Determines GUI Grounding in VLMs

Jiaping Lin, Fei Shen, Junzhe Li, Ping Nie, Fei Yu, Ming Li, Haizhou Li

发表机构 * Guangming Laboratory（光明实验室）； National University of Singapore（新加坡国立大学）； Peking University（北京大学）； University of Waterloo（滑铁卢大学）； The Chinese University of Hong Kong (Shenzhen)（香港中文大学（深圳））

AI总结现有无训练的GUI定位方法通常依赖多次推理过程来识别目标元素，但每个前向传播过程独立解析指令和视觉布局，缺乏视觉token之间的渐进交互。本文研究了视觉语言模型（VLMs）在GUI定位过程中的内部机制，发现其遵循两阶段范式：预填充阶段确定候选UI元素，解码阶段进一步细化坐标。基于此，作者提出了一种无训练方法Re-Prefill，在预填充阶段引入注意力引导的二次处理，通过提取与查询位置高度相关的视觉token作为初步假设，从而提升定位精度。实验表明，该方法在多个基准测试中均取得显著提升。

2605.12545 2026-05-14 cs.CV cs.AI

CROP: Expert-Aligned Image Cropping via Compositional Reasoning and Optimizing Preference

Zhitong Dong, Chao Li, Jie Yu, Hao Chen

发表机构 * Southeast University（东南大学）； Key Laboratory of New Generation Artificial Intelligence Technology（新一代人工智能技术重点实验室）； Alibaba Group（阿里巴巴集团）

AI总结该研究提出了一种名为CROP的新方法，旨在通过组合推理和优化偏好来实现与专家审美一致的图像裁剪。不同于以往依赖显著性预测或检索增强的方法，CROP将美学裁剪重新定义为多模态推理任务，引导视觉语言模型像专业摄影师一样进行分析、提案和决策。该方法通过分解复杂的审美问题，并结合专家偏好对齐模块，有效提升了裁剪结果与人类专家判断的一致性，实验表明其在多个数据集上均表现出优越性能。

2605.12530 2026-05-14 cs.CL cs.AI cs.CY

In-Situ Behavioral Evaluation for LLM Fairness, Not Standardized-Test Scores

Zeyu Tang, Sang T. Truong, Deonna Owens, Shreyas Sharma, Yibo Jacky Zhang, Brando Miranda, Sanmi Koyejo

发表机构 * GitHub

AI总结本文提出应通过实际对话行为而非标准化测试来评估大语言模型的公平性。研究发现，标准化测试中的提示构造方式可能对评分产生较大影响，从而扭曲公平性结论。为此，作者开发了多智能体对话框架 MAC-Fairness，通过多轮对话分析模型在不同身份下的行为差异，揭示了模型特有的行为特征，这些特征在不同公平性目标和评估方法的基准中具有普适性。

2605.12528 2026-05-14 cs.CV cs.AI cs.AR

MorphOPC: Advancing Mask Optimization with Multi-scale Hierarchical Morphological Learning

Yuting Hu, Lei Zhuang, Chen Wang, Ruiyang Qin, Hua Xiang, Gi-joon Nam, Jinjun Xiong

发表机构 * University at Buffalo（布法罗大学）； IBM T. J. Watson Research Center（IBM 沃森研究中心）； Villanova University（维拉诺瓦大学）

AI总结随着特征尺寸缩小至纳米级，从光刻掩模向硅晶圆准确转移电路图案变得愈发困难。为提高图案保真度和制造可行性，本文提出MorphOPC，一种基于多尺度分层形态学学习的掩模优化模型，通过局部布局特征的形态学操作序列生成掩模，有效提升了生成质量。实验表明，MorphOPC在多个基准测试中优于现有方法，实现了更高的印刷保真度和更低的制造成本，展示了其在可扩展掩模优化中的巨大潜力。

2605.12523 2026-05-14 cs.CL cs.AI cs.HC

Exploring how EFL students talk to and through AI to develop texts

David James Woo, Yangyang Yu, Yilin Huang, Deliang Wang, Kai Guo, Chi Ho Yeung

发表机构 * Everwrite Limited（Everwrite有限公司）； Shanghai Jiao Tong University（上海交通大学）； The Education University of Hong Kong（香港教育大学）； The University of Hong Kong（香港大学）； The Chinese University of Hong Kong（香港中文大学）

AI总结本研究探讨了英语作为外语（EFL）学习者在使用生成式人工智能进行写作时，如何通过提示工程与AI进行对话，并协商作者身份。通过对44名香港中学生使用AI聊天机器人完成写作任务的屏幕录制进行混合方法分析，研究发现了学生使用的十种提示策略，并归纳出三种人机协作责任模式。尽管不同责任模式对写作内容、语言和结构等方面的表现无显著影响，但这些策略和模式对EFL写作教学中的学生参与度和自主性具有重要启示。

Comments 37 pages, 5 figures

2605.12522 2026-05-14 cs.CL cs.AI

Differences in Text Generated by Diffusion and Autoregressive Language Models

Zeyang Zhang, Chengwei Liang, Xingyan Chen, Meiqi Gu, Minrui Luo, Jingzhao Zhang, Tianxing He

发表机构 * Shanghai Qi Zhi Institute（上海启智研究院）； Institute for Interdisciplinary Information Sciences, Tsinghua University（清华大学交叉信息研究院）； Xiongan AI Institute（雄安人工智能研究院）

AI总结本文研究扩散语言模型（DLMs）与自回归语言模型（ARMs）在生成文本上的内在差异。通过实验证明，DLMs生成的文本具有更低的$n$-gram熵、更高的语义一致性和多样性，并进一步分析发现这些差异主要源于DLM的双向上下文建模能力，而解码算法中的置信度重掩码策略是导致熵降低的关键因素。该研究揭示了两类模型在文本生成中的核心区别机制，为未来模型设计提供了理论指导。

2605.12521 2026-05-14 cs.CL cs.AI

ToolWeave: Structured Synthesis of Complex Multi-Turn Tool-Calling Dialogues

Dinesh Khandelwal, Gnana Prakash Punnavajhala, GPS Bhargav, Gaurav Pandey, Sachin Joshi, Hima Karanam, Dinesh Raghu

发表机构 * IBM Research（IBM研究院）； IIIT Hyderabad（Hyderabad理工学院）

AI总结 ToolWeave 是一种用于生成复杂多轮工具调用对话的结构化合成框架，旨在解决当前合成数据生成方法中对话不真实、工具调用流程不合理的问题。该方法通过构建具有内置依赖关系的工具，并基于用户目标对工作流进行筛选，从而生成更符合实际任务场景的多步骤工具调用流程。实验表明，ToolWeave 生成的对话包含更多多步骤交互，参数和工具名的幻觉更少，基于其微调的大型语言模型在多个基准测试中表现优于现有方法。

2605.12520 2026-05-14 cs.CL cs.AI

BoostTaxo: Zero-Shot Taxonomy Induction via Boosting-Style Agentic Reasoning and Constraint-Aware Calibration

Yancheng Ling, Zhenlin Qin, Leizhen Wang, Zhenliang Ma

发表机构 * KTH Royal Institute of Technology（皇家理工学院）； Monash University（莫纳什大学）

AI总结 BoostTaxo 是一种基于增强型推理和约束感知校准的零样本分类体系归纳框架，旨在解决现有方法在泛化能力、结构可靠性和效率方面的不足。该方法采用轻量级和大规模语言模型协同工作，通过粗到细的父类识别策略、检索增强的定义优化以及结构感知的评分校准，提升分类体系构建的准确性和鲁棒性。实验表明，BoostTaxo 在 WordNet、DBLP 和 SemEval-Sci 等多个基准数据集上表现优异，验证了其在零样本场景下的有效性。

Comments 13 pages,7 figtures

2605.12519 2026-05-14 cs.CL cs.AI

Correct Answers from Sound Reasoning: Verifiable Process Supervision for Language Models

Kyuyoung Kim, Kevin Wang, Yunfei Xie, Peiyang Xu, Peiyao Sheng, Chen Wei, Zhangyang Wang, Jinwoo Shin, Pramod Viswanath, Sewoong Oh

发表机构 * KAIST AI（KAIST人工智能研究所）； University of Texas, Austin（德克萨斯大学奥斯汀分校）； Rice University（里士满大学）； Princeton University（普林斯顿大学）； University of Washington（华盛顿大学）； Sentient Labs（Sentient实验室）

AI总结训练语言模型在生成正确答案的同时具备合理推理能力仍是一个挑战。本文提出一种可验证过程监督（VPS）框架，在可验证领域中联合优化预测准确性和推理质量，通过结构化推理格式引导模型，并引入自适应奖励加权机制以提升推理子任务的处理效果。实验表明，与仅优化准确率的强化学习方法相比，VPS在保持高准确率的同时显著提升了推理质量，减少了错误并恢复了内部一致性。

Comments Preprint

2605.12518 2026-05-14 cs.CL cs.AI

TimelineReasoner: Advancing Timeline Summarization with Large Reasoning Models

Liancheng Zhang, Xiaoxi Li, Zhicheng Dou

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China（中国人民大学北京校区人工智能学院）

AI总结随着在线新闻的快速增长，从非结构化内容中提取结构化时间线成为一个挑战。为了解决这一问题，本文提出了一种新的框架 TimelineReasoner，利用大型推理模型（LRMs）的主动推理能力，将时间线摘要从静态生成转变为一个迭代、推理驱动的过程。该框架采用两阶段结构，分别进行全局事件认知和细节探索，通过事件抓取、时间线更新和缺失检测等机制，显著提升了时间线的准确性、覆盖度和连贯性。实验结果表明，TimelineReasoner 在多个数据集上均优于现有基于大语言模型的方法。

2605.12517 2026-05-14 cs.CL cs.AI cs.CV

Bridging the Missing-Modality Gap: Improving Text-Only Calibration of Vision Language Models

Mingyeong Kim, Jungwon Choi, Chaeyun Jang, Juho Lee

发表机构 * Graduate School of AI, KAIST（人工智能研究生院，韩国科学技术院）

AI总结该研究探讨了视觉语言模型在仅输入文本时出现的性能下降和校准偏差问题，发现即使文本保留了关键信息，模型的置信度也会变得不可靠。为此，作者提出了一种轻量的交叉注意力模块——潜在想象模块（LIM），通过从文本生成潜在嵌入并输入到冻结的模型主干中，从而在无需生成图像的情况下提升模型的准确性和校准效果。实验表明，LIM在多种文本-only任务和缺失图像场景中均表现出显著的性能提升。

Comments 9 pages, 16 figures. Accepted at the ICLR 2026 Workshop on Principled Design for Trustworthy AI: Interpretability, Robustness, and Safety across Modalities

2605.12516 2026-05-14 cs.CL cs.AI

Domain Adaptation of Large Language Models for Polymer-Composite Additive Manufacturing Using Retrieval-Augmented Generation and Fine-Tuning

Saiful Islam Sagor, Tania Haghighi, Minhaj Nur Alam, Erina Baynojir Joyee

发表机构 * Department of Mechanical Engineering and Engineering Science, University of North Carolina at Charlotte（北卡罗来纳大学夏洛特分校机械工程与工程科学系）； Department of Electrical and Computer Engineering, University of North Carolina at Charlotte（北卡罗来纳大学夏洛特分校电气与计算机工程系）

AI总结该研究探讨了如何将通用大语言模型适配到增材制造（AM）领域，以提升其在专家级问答任务中的准确性、相关性和实用性。研究采用检索增强生成（RAG）和微调两种方法，通过构建专门的AM语料库进行实验，结果表明RAG方法在回答准确性、相关性和整体偏好度上均显著优于基线模型，而基于原始文本的微调方法效果较差。该成果为在专业工程领域中有效适配大语言模型提供了新的思路。

2605.12506 2026-05-14 cs.CV cs.AI cs.HC cs.RO eess.IV

Scale-Gest: Scalable Model-Space Synthesis and Runtime Selection for On-Device Gesture Detection

Abdul Basit, Saim Rehman, Muhammad Shafique

发表机构 * New York University (NYU) Abu Dhabi（纽约大学（NYU）阿布扎赫德）

AI总结在移动设备上实现满足实时性、能耗和内存约束的基于机器学习的手势检测具有挑战性，尤其在电池电量不一的情况下。本文提出了一种名为 Scale-Gest 的新型运行时自适应手势检测框架，通过扩展检测器空间为一系列紧凑的 tiny-YOLO 架构，并引入基于设备校准的 ACE（准确率-复杂度-能耗）配置，实现了在不同约束下的最优模型选择。实验表明，该方法在保持高检测性能的同时，显著降低了能耗和延迟，适用于车载等实际应用场景。

Comments 7 pages, 11 figures, Accepted to DAC 2026

2605.13840 2026-05-14 stat.ML cs.DS cs.LG math.ST stat.CO stat.TH

What is Learnable in Valiant's Theory of the Learnable?

Steve Hanneke, Anay Mehrotra, Grigoris Velegkas, Manolis Zampetakis

发表机构 * Purdue University（普渡大学）； Stanford University（斯坦福大学）； Google Research（谷歌研究）； Yale University（耶鲁大学）

AI总结本文重新审视了Valiant在1984年提出的可学习性模型，探讨了其中哪些概念类是可以被学习的。研究发现，在有限域（包括布尔超立方体）中，一个类可学习当且仅当每个可实现的正样本可以通过多项式大小的自适应查询压缩方案进行认证。这一结果揭示了Valiant模型的学习能力严格介于PAC学习和无查询版本之间，并首次给出了在该模型中学习$d$维半空间的有效算法，展示了查询机制对可学习类的实质性影响。

Comments Abstract shortened for arXiv

详情

英文摘要

Valiant's 1984 paper is widely credited with introducing the PAC learning model, but it, in fact, introduced a different model: unlike PAC learning, the learner receives only positives, may issue membership queries, and must output a hypothesis with no false positives. Prior work characterized variants, including the case without queries. We revisit Valiant's original model and ask: *Which classes are learnable in it?* For every finite domain, including Valiant's Boolean-hypercube setting, we show that a class is learnable if and only if every realizable positive sample can be certified by a poly-size adaptive query-compression scheme. This is a new variant of sample compression where the learner certifies samples via a short interaction with the membership oracle. Our characterization shows that learnability in Valiant's model is strictly sandwiched between learnability in the PAC model and the variant of Valiant's model without membership queries. This is one of the rare cases where introducing membership queries changes the set of learnable classes, and not just the sample or computational complexity. Next, we study the natural extension of the model to arbitrary domains. While we do not obtain an exact characterization, our techniques readily generalize and show that the same strict sandwiching persists. Finally, we show that $d$-dimensional halfspaces, which are not learnable without queries, are learnable with queries: we give a $\mathrm{poly}(d) \tilde{O}(1/ε)$ sample and $\mathrm{poly}(d) \mathrm{polylog}(1/ε)$ query algorithm, and prove that at least $Ω(d)$ samples or queries are necessary. To our knowledge, this is the first algorithm for halfspaces in Valiant's model. Together, these results uncover a surprisingly rich theory behind Valiant's original notion of learnability and introduce ideas that may be of independent interest in learning theory.

URL PDF HTML ☆

赞 0 踩 0

2605.13817 2026-05-14 cs.SE cs.AI

Neurosymbolic Auditing of Natural-Language Software Requirements

Bethel Hall, William Eiers

发表机构 * Stevens Institute of Technology（史蒂文斯理工学院）

AI总结该研究针对自然语言编写的软件需求中存在的模糊性、不一致性和规格不完整等问题，提出了一种结合神经网络与符号推理的审计方法。通过将自然语言需求转化为形式化逻辑，并利用SMT求解器进行验证，该方法能够检测需求中的歧义、矛盾及安全违规。研究构建了名为VERIMED的神经符号化框架，应用于医疗设备软件需求的验证，实验表明该方法能有效减少模糊性需求，并显著提升需求验证的准确性。

Comments 10

AI 大模型

视觉与机器人

科学与医疗

Saturation-Aware Angular Velocity Estimation: Extending the Robustness of SLAM to Aggressive Motions

Multi-Modal World Model for Physical Robot Interactions: Simultaneous Visual and Tactile Predictions for Enhanced Accuracy

Image Aesthetics Assessment using Multi Channel Convolutional Neural Networks

Think Twice, Act Once: Verifier-Guided Action Selection For Embodied Agents

A Data Efficiency Study of Synthetic Fog for Object Detection Using the Clear2Fog Pipeline

TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking

3D Primitives are a Spatial Language for VLMs

Towards Robust Federated Multimodal Graph Learning under Modality Heterogeneity

CAWI: Copula-Aligned Weight Initialization for Randomized Neural Networks

DistractMIA: Black-Box Membership Inference on Vision-Language Models via Semantic Distraction

Improving Diffusion Posterior Samplers with Lagged Temporal Corrections for Image Restoration

VideoSEAL: Mitigating Evidence Misalignment in Agentic Long Video Understanding by Decoupling Answer Authority

M3Net: A Macro-to-Meso-to-Micro Clinical-inspired Hierarchical 3D Network for Pulmonary Nodule Classification

Learning When to Act: Communication-Efficient Reinforcement Learning via Run-Time Assurance

M2Retinexformer: Multi-Modal Retinexformer for Low-Light Image Enhancement

What Happens Before Decoding? Prefill Determines GUI Grounding in VLMs

CROP: Expert-Aligned Image Cropping via Compositional Reasoning and Optimizing Preference

In-Situ Behavioral Evaluation for LLM Fairness, Not Standardized-Test Scores

MorphOPC: Advancing Mask Optimization with Multi-scale Hierarchical Morphological Learning

Exploring how EFL students talk to and through AI to develop texts

Differences in Text Generated by Diffusion and Autoregressive Language Models

ToolWeave: Structured Synthesis of Complex Multi-Turn Tool-Calling Dialogues

BoostTaxo: Zero-Shot Taxonomy Induction via Boosting-Style Agentic Reasoning and Constraint-Aware Calibration

Correct Answers from Sound Reasoning: Verifiable Process Supervision for Language Models

TimelineReasoner: Advancing Timeline Summarization with Large Reasoning Models

Bridging the Missing-Modality Gap: Improving Text-Only Calibration of Vision Language Models

Domain Adaptation of Large Language Models for Polymer-Composite Additive Manufacturing Using Retrieval-Augmented Generation and Fine-Tuning

Scale-Gest: Scalable Model-Space Synthesis and Runtime Selection for On-Device Gesture Detection

What is Learnable in Valiant's Theory of the Learnable?

Neurosymbolic Auditing of Natural-Language Software Requirements