arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

检索范围排序方式

检索时间范围

重置

HOT 人工智能、机器人等 9

cs.AI 人工智能 cs.CV 计算机视觉 cs.CL 自然语言处理 cs.RO 机器人 cs.LG 机器学习 cs.SD 声音 cs.ET 新兴技术 eess.AS 音频语音 eess.IV 图像视频

CS 计算机 41

cs 计算机 cs.AI 人工智能 cs.AR 硬件架构 cs.CC 计算复杂性 cs.CE 计算工程 cs.CG 计算几何 cs.CL 自然语言处理 cs.CR 密码安全 cs.CV 计算机视觉 cs.CY 计算机与社会 cs.DB 数据库 cs.DC 分布式计算 cs.DL 数字图书馆 cs.DM 离散数学 cs.DS 数据结构 cs.ET 新兴技术 cs.FL 形式语言 cs.GL 综述文献 cs.GR 图形学 cs.GT 博弈论 cs.HC 人机交互 cs.IR 信息检索 cs.IT 信息论 cs.LG 机器学习 cs.LO 计算机逻辑 cs.MA 多智能体 cs.MM 多媒体 cs.MS 数学软件 cs.NA 数值分析 cs.NE 神经进化 cs.NI 网络架构 cs.OH 其他计算机 cs.OS 操作系统 cs.PF 性能 cs.PL 编程语言 cs.RO 机器人 cs.SC 符号计算 cs.SD 声音 cs.SE 软件工程 cs.SI 社会信息网络 cs.SY 系统控制

ECON 经济学 4

econ 经济学 econ.EM 计量经济 econ.GN 一般经济 econ.TH 理论经济

EESS 电气与系统 5

eess 电气与系统 eess.AS 音频语音 eess.IV 图像视频 eess.SP 信号处理 eess.SY 系统控制

MATH 数学 33

math 数学 math.AC 交换代数 math.AG 代数几何 math.AP 偏微分方程 math.AT 代数拓扑 math.CA 经典分析 math.CO 组合数学 math.CT 范畴论 math.CV 复变函数 math.DG 微分几何 math.DS 动力系统 math.FA 泛函分析 math.GM 一般数学 math.GN 一般拓扑 math.GR 群论 math.GT 几何拓扑 math.HO 历史综述 math.IT 信息论 math.KT K理论 math.LO 逻辑 math.MG 度量几何 math.MP 数学物理 math.NA 数值分析 math.NT 数论 math.OA 算子代数 math.OC 优化控制 math.PR 概率 math.QA 量子代数 math.RA 环与代数 math.RT 表示论 math.SG 辛几何 math.SP 谱理论 math.ST 统计理论

PHYSICS 物理 55

astro-ph 天体物理 astro-ph.CO 宇宙学 astro-ph.EP 地球行星 astro-ph.GA 星系物理 astro-ph.HE 高能天体 astro-ph.IM 天文仪器 astro-ph.SR 太阳恒星 cond-mat 凝聚态 cond-mat.dis-nn 无序神经 cond-mat.mes-hall 介观纳米 cond-mat.mtrl-sci 材料科学 cond-mat.other 其他凝聚态 cond-mat.quant-gas 量子气体 cond-mat.soft 软凝聚态 cond-mat.stat-mech 统计力学 cond-mat.str-el 强关联电子 cond-mat.supr-con 超导 gr-qc 广义相对论 hep-ex 高能实验 hep-lat 格点高能 hep-ph 高能唯象 hep-th 高能理论 math-ph 数学物理 nlin 非线性科学 nlin.AO 自适应系统 nlin.CD 混沌动力学 nlin.CG 胞自动机 nlin.PS 斑图孤子 nlin.SI 可积系统 nucl-ex 核物理实验 nucl-th 核物理理论 physics 物理 physics.acc-ph 加速器物理 physics.ao-ph 大气海洋 physics.app-ph 应用物理 physics.atm-clus 原子分子团簇 physics.atom-ph 原子物理 physics.bio-ph 生物物理 physics.chem-ph 化学物理 physics.class-ph 经典物理 physics.comp-ph 计算物理 physics.data-an 数据分析 physics.ed-ph 物理教育 physics.flu-dyn 流体动力学 physics.gen-ph 普通物理 physics.geo-ph 地球物理 physics.hist-ph 物理史哲 physics.ins-det 仪器探测 physics.med-ph 医学物理 physics.optics 光学 physics.plasm-ph 等离子体 physics.pop-ph 科普物理 physics.soc-ph 物理与社会 physics.space-ph 空间物理 quant-ph 量子物理

Q-BIO 定量生物 11

q-bio 定量生物 q-bio.BM 生物分子 q-bio.CB 细胞行为 q-bio.GN 基因组学 q-bio.MN 分子网络 q-bio.NC 神经认知 q-bio.OT 其他定量生物 q-bio.PE 种群进化 q-bio.QM 定量方法 q-bio.SC 亚细胞过程 q-bio.TO 组织器官

Q-FIN 定量金融 10

q-fin 定量金融 q-fin.CP 计算金融 q-fin.EC 经济学 q-fin.GN 一般金融 q-fin.MF 数学金融 q-fin.PM 投资组合 q-fin.PR 证券定价 q-fin.RM 风险管理 q-fin.ST 统计金融 q-fin.TR 交易微观结构

STAT 统计 7

stat 统计 stat.AP 统计应用 stat.CO 统计计算 stat.ME 统计方法 stat.ML 机器学习 stat.OT 其他统计 stat.TH 统计理论

2605.07357 2026-05-12 cs.AI

GraphReAct: Reasoning and Acting for Multi-step Graph Inference

Xingtong Yu, Zhongwei Kuai, Chang Zhou, Xuanting Xie, Renhe Jiang, Xikun Zhang, Hong Cheng, Xinming Zhang, Yuan Fang

发表机构 * The Chinese University of Hong Kong（香港中文大学）； The University of Science and Technology of China（中国科学技术大学）； University of Electronic Science and Technology of China（电子科技大学）； The University of Tokyo（东京大学）； RMIT University（皇家墨尔本理工大学）； Singapore Management University（新加坡管理学院）

AI总结本文提出了一种名为GraphReAct的图推理-行动框架，用于解决多步骤图推理问题。该方法结合了图结构数据中的拓扑信息与语义信息，设计了两种互补的检索动作——拓扑检索和语义检索，以动态扩展推理上下文，并引入上下文精炼动作以逐步压缩信息。实验表明，GraphReAct在六个基准数据集上均优于现有方法，验证了其在图学习中的有效性。

Comments Under review

2605.07237 2026-05-12 cs.CL

Teaching Language Models to Think in Code

Hyeon Hwang, Jiwoo Lee, Jaewoo Kang

发表机构 * Korea University（韩国大学）； AIGEN Sciences（AIGEN科技）

AI总结本文提出了一种名为ThinC的新框架，旨在让语言模型通过代码进行推理，而非将代码作为自然语言指令的工具。该方法通过代码块之间的执行结果进行推理，减少了自然语言推理的干扰与错误。实验表明，ThinC在多个高水平数学基准测试中表现优异，甚至超越了更大规模的模型，并且其推理过程高度依赖代码执行结果，具有较强的鲁棒性。

Comments Preprint

2605.07203 2026-05-12 cs.CV

From Pixels to Primitives: Scene Change Detection in 3D Gaussian Splatting

Chamuditha Jayanga Galappaththige, Jason Lai, Timothy Patten, Donald Dansereau, Niko Suenderhauf, Dimity Miller

发表机构 * QUT Centre for Robotics（昆士兰大学机器人中心）； ARIAM ； ACFR, University of Sydney（悉尼大学先进计算机研究学院）

AI总结本文研究了基于高斯泼溅（Gaussian Splatting）的场景变化检测问题，提出了一种直接在原始高斯参数空间进行比较的方法，而非传统的渲染后对比方式。通过分析高斯的原始属性（位置、各向异性协方差和颜色），作者证明这些属性本身已包含足够的变化信息，并引入几何和光度漂移的各向异性模型以及每个高斯的可观测性项来解决表示的欠约束问题。该方法在多视角一致性、变化类型区分等方面具有优势，并在实际数据集上取得了优于现有方法约17%的性能提升。

Comments Project Page: https://chumsy0725.github.io/GS-DIFF/

2605.07202 2026-05-12 cs.AI

Towards Autonomous Business Intelligence via Data-to-Insight Discovery Agent

Dongming Wu, Junwen Li, Ming Lu, Gang Wang, Ting Chen

发表机构 * Rajax Network Technology(Taobao Shangou of Alibaba)（阿里淘宝购物网络技术）

AI总结本文提出AIDA（自主洞察发现代理），首个面向复杂商业环境的端到端自主探索框架，旨在解决大语言模型在将碎片化企业数据转化为可操作洞察时面临的挑战。AIDA构建了一个包含200多个指标和100多个维度的高灵活性零售环境，并集成了专有的领域特定语言（DSL），实现了语义推理与精确SQL执行的结合。通过强化学习系统，AIDA将商业分析建模为受帕累托原则指导的累积推理过程，实验表明其在环境感知和多角度深入分析方面显著优于基于工作流的代理。

2605.07024 2026-05-12 cs.LG cs.AI

Delulu: A Verified Multi-Lingual Benchmark for Code Hallucination Detection in Fill-in-the-Middle Tasks

Mahdi Erfanian, Nelson Daniel Troncoso, Aashna Garg, Amabel Gale, Xiaoyu Liu, Pareesa Ameneh Golnari, Shengyu Fu

发表机构 * University of Illinois Chicago（伊利诺伊大学芝加哥分校）； Microsoft（微软）

AI总结该论文提出Delulu，一个经过验证的多语言基准数据集，用于检测代码生成任务中Fill-in-the-Middle（FIM）任务中的幻觉问题。研究通过对抗性流程生成并筛选出包含1951个样本的高质量数据集，涵盖7种语言和4类幻觉类型，并利用Docker容器验证代码的编译与运行错误。实验评估了11个开源FIM模型，结果显示即使是最强模型也仅达到84.5%的准确率，表明FIM任务中的幻觉问题具有内在难度，而非特定模型家族的缺陷。

2605.06969 2026-05-12 cs.CV

Bringing Multimodal Large Language Models to Infrared-Visible Image Fusion Quality Assessment

Yuchen Guo, Junli Gong, Yao Lu, Xintong Xu, Yiuming Cheung, Weifeng Su

发表机构 * Northwestern University（西北大学）； Northeastern University（东北大学）； University of Washington（华盛顿大学）； Hong Kong Baptist University（香港 Baptist大学）； Beijing Normal - Hong Kong Baptist University（北京师范大学-香港 Baptist大学）

AI总结该研究旨在提升红外-可见光图像融合（IVIF）质量评估的准确性，针对现有方法过度依赖手工特征和全参考指标的问题，提出了一种基于多模态大语言模型（MLLM）的新型评估方法FuScore。该方法通过MLLM生成连续的质量评分，而非离散等级预测，从而实现对相似质量图像的细粒度区分，并结合多维度一致性构建软标签，进一步引入三元目标函数以提升评估的全面性和鲁棒性。实验表明，FuScore在与人类视觉偏好相关性方面达到了当前最优水平。

2605.06763 2026-05-12 cs.LG

Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache

Mohsen Dehghankar, Abolfazl Asudeh

发表机构 * Department of Computer Science（计算机科学系）； University of Illinois Chicago（伊利诺伊大学芝加哥分校）

AI总结该研究将稀疏注意力机制重新定义为半空间范围搜索问题，旨在提升大语言模型推理效率的同时保证关键键值对的完整召回。为此，作者提出了一种名为Louver的新索引结构，能够在理论和实践中实现零误漏，并且具备轻量、易集成以及硬件优化等特性。实验表明，Louver在准确性和运行效率上均优于现有稀疏注意力方法，甚至超越了高度优化的密集注意力实现。

详情

英文摘要

Sparse attention improves LLM inference efficiency by selecting a subset of key-value entries, but at the cost of potential accuracy degradation. In particular, omitting critical KV entries can induce substantial errors in model outputs. Existing methods typically operate under fixed or adaptive token budgets and provide empirical robustness or partial theoretical guarantees, yet they do not ensure zero false negatives in decoding steps, particularly since the set of relevant tokens is both query- and step-dependent. Our empirical observations confirm that missing even one critical key can lead to sharp error spikes, especially in long reasoning tasks where the set of important tokens varies throughout decoding. This observation motivates the need for indexing methods that dynamically adapt to these variations across decoding steps while guaranteeing a full recall of the relevant keys above a certain threshold. We address this challenge by reformulating sparse attention as the halfspace range searching problem. However, existing range searching indices are not suitable for modern LLM inference due to their computational and implementation overheads. To overcome this, we introduce Louver, a novel index structure tailored for efficient KV cache retrieval. Louver (i) guarantees zero false negatives with respect to a specified threshold in both theory and practice, (ii) is lightweight to integrate into existing LLM pipelines, and (iii) incorporates hardware-aware optimizations for both CPU and GPU executions. Our experiments demonstrate that Louver outperforms prior sparse attention methods in both accuracy and runtime, and is faster than highly optimized dense attentions such as FlashAttention. These results highlight that recall guarantees are a critical and overlooked dimension of sparse attention, and open a new direction for building theoretically grounded, efficient KV cache indices.

URL PDF HTML ☆

赞 0 踩 0

2605.06681 2026-05-12 cs.LG cs.CV

A Hierarchical Ensemble Pipeline for Anomaly Detection in ESA Satellite Telemetry

Lorenzo Riccardo Allegrini, Geremia Pompei

发表机构 * ContinualIST, Pisa, Italy（持续主义机构，意大利比萨）； University of Pisa, Department of Computer Science, Pisa, Italy（比萨大学计算机科学系，意大利比萨）

AI总结本文提出了一种分层集成管道，用于处理欧洲空间局（ESA）卫星遥测数据中的异常检测问题。该方法结合了形状片段提取、统计特征分析、单通道建模、通道内堆叠以及跨通道聚合等多种技术，通过时间序列交叉验证和双层掩码策略进行训练与验证，有效防止信息泄露。实验结果表明，该方法在ESA-ADB基准测试中表现出优异的泛化能力，能够有效检测现实卫星遥测数据中的细微异常。

Comments 15 pages, 3 figures, 1 table. Submitted to the ML4ITS workshop at the ECML PKDD 2025 conference. Awarded 2nd place in the final round of the Spacecraft Anomaly Challenge on ESA dataset. (Ranked 1st on the Kaggle public leaderboard and 3rd on the private leaderboard)

Journal ref Communications in Computer and Information Science 2842 (2026) Chapter 7

2605.06663 2026-05-12 cs.CL

EMO: Pretraining Mixture of Experts for Emergent Modularity

Ryan Wang, Akshita Bhagia, Sewon Min

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Allen Institute for AI（人工智能研究院）

AI总结本文提出了一种名为EMO的专家混合（MoE）预训练模型，旨在实现模型的模块化部署，使得不同任务可以独立使用或组合使用专家子集，而无需人工定义先验知识。EMO通过鼓励相似领域的内容使用相似的专家，并基于文档边界进行预训练，使专家分组在训练过程中自然形成。实验表明，EMO在保持整体性能的同时，能够显著减少推理时所需的专家数量，且专家子集在语义层面（如数学、代码等）表现出专业化能力，优于传统MoE在语法层面的分工。

2605.06483 2026-05-12 cs.AI cs.RO cs.SY eess.SY

ReasonSTL: Bridging Natural Language and Signal Temporal Logic via Tool-Augmented Process-Rewarded Learning

Bowen Ye, Zhijian Li, Junyue Huang, Junkai Ma, Xiang Yin

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Alibaba Group（阿里巴巴集团）

AI总结该研究提出了一种名为ReasonSTL的框架，旨在解决将自然语言转化为信号时序逻辑（STL）这一关键但具有挑战性的任务。ReasonSTL通过结合本地开源语言模型与工具增强的推理过程，实现了自然语言到STL公式的高效生成，并引入了过程奖励训练机制以优化工具使用路径和最终公式结构。实验表明，该方法在自动评估和人工评估中均达到领先水平，为工业场景下的形式化规范编写提供了透明、低成本且隐私保护的解决方案。

2605.06117 2026-05-12 cs.LG

BoostLLM: Boosting-inspired LLM Fine-tuning for Few-shot Tabular Classification

Yi-Siang Wang, Kuan-Yu Chen, Yu-Chen Den, Darby Tien-Hao Chang

发表机构 * SinoPac Holdings（SinoPac控股公司）

AI总结本文提出BoostLLM，一种受提升算法启发的大型语言模型（LLM）微调框架，旨在提升其在少样本表格分类任务中的性能。该方法将参数高效的微调过程转化为多轮残差优化过程，通过训练序列化的PEFT适配器作为弱学习器，并结合决策树路径作为结构化输入视图，以增强模型对表格数据的归纳偏置。实验表明，BoostLLM在多个LLM主干和数据集上均优于传统微调方法，且在少样本场景下表现可与XGBoost媲美，甚至在某些情况下超越基于GPT-4o的模型。

Comments 19 pages, 4 figures

2605.05831 2026-05-12 cs.CV

Unifying Scientific Communication: Fine-Grained Correspondence Across Scientific Media

Megha Mariam K. M, Vineeth N. Balasubramanian, C. V. Jawahar

发表机构 * IIIT Hyderabad（IIIT海得拉尔学院）； Microsoft Research India & IIT Hyderabad（微软研究院印度分部及IIIT海得拉尔学院）

AI总结随着科学传播逐渐呈现多模态趋势，研究论文、幻灯片、视频等不同形式的材料共同传达研究成果，但目前缺乏结构化的关联方式。本文提出首个整合研究论文、演讲视频、讲解视频和幻灯片的多模态会议数据集（MCD），并评估多种嵌入式和视觉-语言模型在跨格式细粒度对应任务中的表现。研究发现，视觉-语言模型在整体上表现稳健，但在细粒度对齐上仍有不足，而嵌入式模型在文本与视觉对应上效果较好，但对公式和符号内容的处理存在明显聚类差异，为多模态科学理解的未来研究指明了方向。

Comments Accepted at the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Findings Track, 2026

2605.05736 2026-05-12 cs.AI

SDFlow: Similarity-Driven Flow Matching for Time Series Generation

Wei Li, Shibo Feng, Pengcheng Wu, Xingyu Gao, Min Wu, Peilin Zhao

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Shanghai University（上海大学）； Nanyang Technological University（南洋理工大学）； Chinese Academy of Sciences（中国科学院）； Institute for Infocomm Research, A*STAR（信息通信研究院，A*STAR）

AI总结本文提出了一种名为SDFlow的非自回归时间序列生成方法，通过相似性驱动的流匹配技术，在冻结的向量量化（VQ）潜在空间中实现并行序列生成，有效解决了自回归模型中的暴露偏差问题。该方法通过全局传输映射替代逐步预测、低秩流形分解降低维度复杂度、以及在变分流匹配框架中引入离散监督，成功克服了非自回归生成中的关键挑战。实验表明，SDFlow在长序列生成任务中取得了最先进的性能，显著提升了生成质量并加快了推理速度。

2605.05072 2026-05-12 cs.CV

Height-Guided Projection Reparameterization for Camera-LiDAR Occupancy

Yuan Wu, Zhiqiang Yan, Jiawei Lian, Zhengxue Wang, Jian Yang

发表机构 * Nanjing University of Science and Technology（南京理工大学）； National University of Singapore（新加坡国立大学）

AI总结本文研究了如何从相机和激光雷达传感器数据中准确预测三维场景的占用情况，重点解决传统方法在投影空间采样固定、难以适应真实场景高度变化和稀疏性的问题。为此，作者提出了一种名为HiPR的框架，通过高度引导的投影重参数化方法，动态调整激光雷达点云的采样范围，使投影点更合理地分布于具有几何意义的区域。实验表明，HiPR在保持实时推理能力的同时，显著优于现有先进方法。

2605.05045 2026-05-12 cs.CV cs.CL

When Relations Break: Analyzing Relation Hallucination in Vision-Language Model Under Rotation and Noise

Philip Wootaek Shin, Ajay Narayanan Sridhar, Sivani Devarapalli, Rui Zhang, Jack Sampson, Vijaykrishnan Narayanan

发表机构 * The Pennsylvania State University（宾夕法尼亚州立大学）

AI总结该研究分析了视觉-语言模型在面对旋转和噪声等视觉干扰时产生的关系幻觉现象，揭示了即使轻微的图像扰动也会显著影响模型对物体间关系的推理能力。研究评估了多种基于提示的增强与预处理策略，发现这些方法虽能部分缓解问题，但无法彻底消除关系幻觉。结果表明，当前模型在感知鲁棒性与关系理解之间仍存在差距，亟需开发更具几何感知能力的视觉-语言模型。

2605.04012 2026-05-12 cs.AI

SymptomAI: Toward a Conversational AI Agent for Everyday Symptom Assessment

Joseph Breda, Fadi Yousif, Beszel Hawkins, Marinela Cotoi, Miao Liu, Ray Luo, Po-Hsuan Cameron Chen, Mike Schaekermann, Samuel Schmidgall, Xin Liu, Girish Narayanswamy, Samuel Solomon, Maxwell A. Xu, Xiaoran Fan, Longfei Shangguan, Anran Wang, Bhavna Daryani, Buddy Herkenham, Cara Tan, Mark Malhotra, Shwetak Patel, John B. Hernandez, Quang Duong, Yun Liu, Zach Wasson, Dimitrios Antos, Bob Lou, Matthew Thompson, Jonathan Richina, Anupam Pathak, Nichole Young-Lin, Jake Sunshine, Daniel McDuff

发表机构 * Google Research（谷歌研究）； Google DeepMind（谷歌深Mind）

AI总结该研究提出了一种名为SymptomAI的会话式人工智能代理，用于日常症状的端到端访谈与鉴别诊断。通过在Fitbit应用中进行的大规模随机实验，SymptomAI在与13,917名参与者的交互中表现出比独立临床医生更高的诊断准确性。研究还发现，采用专门症状访谈策略的AI代理在诊断效果上显著优于用户引导的对话方式，并揭示了症状与生理指标之间的强关联。

Comments 13 page main text, 54 pages total. 16 figures total

详情

英文摘要

Language models excel at diagnostic assessments on curated medical case-studies and vignettes, performing on par with, or better than, clinical professionals. However, existing studies focus on complex scenarios with rich context making it difficult to draw conclusions about how these systems perform for patients reporting symptoms in everyday life. We deployed SymptomAI, a set of conversational AI agents for end-to-end patient interviewing and differential diagnosis (DDx), via the Fitbit app in a study that randomized participants (N=13,917) to interact with five AI agents. This corpus captures diverse communication and a realistic distribution of illnesses from a real world population. A subset of 1,228 participants reported a clinician-provided diagnosis, and 517 of these were further evaluated by a panel of clinicians during over 250 hours of annotation. SymptomAI DDx were significantly more accurate (OR = 2.56, p < 0.001) than those from independent clinicians given the same dialogue in a blinded randomized comparison. Moreover, agentic strategies which conduct a dedicated symptom interview that elicit additional symptom information before providing a diagnosis, perform substantially better than baseline, user-guided conversations (p < 0.001). An auxiliary analysis on 1,509 conversations from a general US population panel validated that these results generalize beyond wearable device users. We used SymptomAI diagnoses as labels for all 13,917 participants to analyze over 500,000 days of wearable metrics across nearly 400 unique conditions. We identified strong associations between acute infections and physiological shifts (e.g., OR > 7 for influenza). While limited by self-reported ground truth, these results demonstrate the benefits of a dedicated and complete symptom interview compared to a user-guided symptom discussion, which is the default of most consumer LLMs.

URL PDF HTML ☆

赞 0 踩 0

2605.03652 2026-05-12 cs.CV cs.AI

AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics

Tencent HY Team

发表机构 * Tencent HY Team（腾讯HY团队）

AI总结本文提出了一种名为 AniMatrix 的动画视频生成模型，专门针对动画艺术风格进行设计，而非依赖物理现实作为先验。该模型通过双通道条件机制和三步过渡策略，重新定义“正确性”标准，克服传统模型对物理规律的依赖，并有效区分艺术表达与生成失败。实验表明，AniMatrix 在专业动画师参与的评估中表现优异，尤其在提示理解与艺术动作生成方面显著优于现有模型。

Comments 37 pages, 1 main figure (qualitative comparison), 1 TikZ architecture diagram; technical report. Model weights and inference code to be released

2605.03438 2026-05-12 cs.CV

Mantis: Mamba-native Tuning is Efficient for 3D Point Cloud Foundation Models

Zihao Guo, Jihua Zhu, Jian Liu, Ajmal Saeed Mian

发表机构 * Xi’an Jiaotong University（西安交通大学）； School of Artificial Intelligence and Robotics, Hunan University（湖南大学人工智能与机器人学院）； University of Western Australia（西澳大学）

AI总结本文提出了一种名为Mantis的高效参数微调框架，专门针对基于Mamba架构的3D点云基础模型。该方法通过引入状态感知适配器（SAA），在冻结预训练主干网络的前提下实现状态级的细粒度适配，同时采用双序列化一致性蒸馏（DSCD）减少序列化带来的不稳定性。实验表明，Mantis仅需约5%的可训练参数即可在多个基准上取得具有竞争力的性能。

2605.02948 2026-05-12 cs.LG cs.AI cs.SD

AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation

Yuxin Lu, Jiayang Sun, Guibo Zhu, Min Cao

发表机构 * Soochow University（苏州大学）； Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）

AI总结 AsymTalker 是一种基于扩散模型的长时 talking head 生成方法，旨在解决现有方法在长时间视频生成中出现的身份不一致和时空对齐问题。该方法引入了时间参考编码（TRE）和非对称知识蒸馏（AKD），分别用于缓解静态身份参考与动态音频流之间的时空错位，以及解决分块生成过程中身份漂移的问题。实验表明，AsymTalker 在保证高保真度和身份一致性的同时，能够生成长达600秒的视频，并实现每秒66帧的实时推理速度，达到了当前最先进的性能。

2605.02751 2026-05-12 cs.AI cs.CL

Mitigating Misalignment Contagion by Steering with Implicit Traits

Maria Chang, Ronny Luss, Miao Liu, Keerthiram Murugesan, Karthikeyan Ramamurthy, Djallel Bouneffouf

发表机构 * IBM Research Yorktown Heights（IBM研究院Yorktown Heights）

AI总结在多智能体场景中，语言模型（LMs）遵循指令和保持价值对齐至关重要，但现有研究多关注单个模型与用户的对齐，忽视了多模型交互中可能产生的对齐偏差扩散问题。本文通过多轮对话的社会困境游戏实验，发现模型在交互中可能表现出更加反社会的行为，且当其他模型被引导为恶意行为时，这种效应会加剧。为缓解该问题，作者提出了一种基于隐式特质引导的方法，通过间歇性注入强化模型初始正向社会行为的系统提示，有效抑制对齐偏差的扩散，且无需访问模型参数或内部状态，适用于黑箱模型的多智能体应用场景。

2605.02487 2026-05-12 cs.RO

Visibility-Aware Mobile Grasping in Dynamic Environments

Tianrun Hu, Anxing Xiao, David Hsu, Hanbo Zhang

发表机构 * School of Computing, National University of Singapore（新加坡国立大学计算机学院）； Smart Systems Institute, National University of Singapore（新加坡国立大学智能系统研究所）

AI总结本文研究了机器人在动态未知环境中进行移动抓取的问题，重点解决有限视野下视觉感知与身体运动之间的协调难题。提出了一种统一的移动抓取系统，包含基于行为树的分层规划器和结合主动感知的全身运动规划器，能够在动态障碍物存在的情况下安全导航并完成抓取任务。实验表明，该方法在静态和动态未知环境中分别实现了68.8%和58.0%的成功率，显著优于现有方法。

2605.01402 2026-05-12 cs.CL cs.CV cs.LG

Injecting Distributional Awareness into MLLMs via Reinforcement Learning for Deep Imbalanced Regression

Yao Du, Shanshan Song, Xiaomeng Li

发表机构 * The Hong Kong University of Science（香港科学与技术大学）

AI总结多模态大语言模型（MLLMs）在处理长尾分布的数值回归任务时表现不佳，现有基于标记的监督微调方法容易偏向高密度区域，导致回归均值化和尾部性能下降。本文提出了一种基于组相对策略优化的分布感知强化学习框架，通过引入基于一致相关系数的奖励机制，在批量层面提供跨样本的比较监督，从而在相关性、尺度和均值等方面对齐预测与真实分布。该方法无需修改模型结构，实验表明其在多种长尾回归基准上均优于传统微调方法，尤其在中样本和少样本场景下效果显著。

Comments Accepted by ICML 2026

2605.00642 2026-05-12 cs.AI cs.CV

Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding

Yan Zhang, Daiqing Wu, Huawen Shen, Can Ma, Yu Zhou

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences（中国科学院信息工程研究所）； VCIP & TMCC & DISSec, College of Computer Science, Nankai University（南开大学计算机学院）； School of Cyber Security, University of Chinese Academy of Sciences（中国科学院大学网络安全学院）

AI总结本文提出了一种面向GUI定位任务的首个基于策略自蒸馏（OPSD）框架GUI-SD，旨在解决现有强化学习方法在训练效率和样本稀疏性方面的不足。该方法通过构建视觉增强的特权上下文和引入熵引导的蒸馏策略，实现了单次交互中的密集监督学习，有效提升了定位精度与训练效率。实验表明，GUI-SD在六个代表性基准上均优于现有方法。

Comments under review

2605.00623 2026-05-12 cs.RO

Recovering Hidden Reward in Diffusion-Based Policies

Yanbiao Ji, Qiuchang Li, Yuting Hu, Shaokai Wu, Wenyuan Xie, Guodong Zhang, Qicheng He, Deyi Ji, Yue Ding, Hongtao Lu

发表机构 * Shanghai Jiao Tong University（上海交通大学）

AI总结本文提出了一种名为 EnergyFlow 的框架，通过参数化一个标量能量函数，将生成动作建模与逆强化学习相结合，其梯度即为去噪场。该方法在最大熵最优性条件下，通过去噪得分匹配学习得分函数，能够恢复专家的软Q函数梯度，从而实现无需对抗训练的奖励提取。实验表明，EnergyFlow 在多种操作任务中表现出领先的模仿性能，并能为下游强化学习提供有效的奖励信号，优于对抗性逆强化学习和基于似然的方法。

Comments Accepted by ICML 2026

2605.00548 2026-05-12 cs.CV cs.GR

Colorful-Noise: Training-Free Low-Frequency Noise Manipulation for Color-Based Conditional Image Generation

Nadav Z. Cohen, Ofir Abramovich, Ariel Shamir

发表机构 * Reichman University（雷曼大学）

AI总结本文研究了扩散模型中输入噪声的特性，发现白噪声中低频分量主要决定图像的全局结构和颜色组成，而高频分量控制细节。基于此，作者提出了一种无需训练的低频噪声操控方法，通过简单操作低频噪声来引导图像生成过程，从而在保持输出多样性的同时，实现对图像整体结构和颜色的有效控制。

Comments SIGGRAPH 2026 Conference Paper. Project Page at: https://nadavc220.github.io/colorful-noise/

2605.00408 2026-05-12 cs.CV

Beyond Heuristics: Learnable Density Control for 3D Gaussian Splatting

Zhenhua Ning, Xin Li, Jun Yu, Guangming Lu, Yaowei Wang, Wenjie Pei

发表机构 * Pengcheng Laboratory, Shenzhen（鹏城实验室，深圳）； Harbin Institute of Technology, Shenzhen（哈尔滨工业大学，深圳）

AI总结本文提出了一种可学习的密度控制方法LeGS，用于改进三维高斯溅射（3DGS）技术，以克服其对启发式密度控制规则的依赖。该方法将密度控制建模为通过强化学习优化的参数化策略网络，并设计了一种基于敏感性分析的有效奖励函数，以精确量化单个高斯分布对重建质量的贡献。实验表明，LeGS在多个数据集上显著优于现有方法，在重建质量和计算效率之间取得了更好的平衡。

Comments 9 pages, 5 figures

2604.27224 2026-05-12 cs.RO

Learning Tactile-Aware Quadrupedal Loco-Manipulation Policies

Pokuang Zhou, Yuhao Zhou, Quan Khanh Luu, Seungho Han, Heng Zhang, Binghao Huang, Yunzhu Li, Arash Ajoudani, Zhengtong Xu, Yu She

发表机构 * Purdue University（普渡大学）； Istituto Italiano di Tecnologia（意大利技术研究院）； Columbia University（哥伦比亚大学）

AI总结本文研究了如何通过触觉感知提升四足机器人在复杂接触环境中的运动与操作能力。作者提出了一种分层的触觉感知策略学习框架，结合真实人类示范训练高层视觉-触觉策略，并通过大规模仿真强化学习训练底层触觉感知全身控制策略，实现了从仿真到现实的零样本迁移。实验表明，该方法在多种高接触任务中相比仅依赖视觉或视觉-触觉的方法，平均性能提升了28.54%。

2604.26805 2026-05-12 cs.AI cs.MA

Bian Que: An Agentic Framework with Flexible Skill Arrangement for Online System Operations

Bochao Liu, Zhipeng Qian, Yang Zhao, Xinyuan Jiang, Zihan Liang, Yufei Ma, Junpeng Zhuang, Ben Chen, Shuo Yang, Hongen Wan, Yao Wu, Chenyi Lei, Xiao Liang

发表机构 * Kuaishou Technology（快手科技）

AI总结本文提出了一种名为Bian Que的智能运维框架，旨在提升大规模在线系统（如搜索、推荐和广告）的运维效率。该框架通过统一的操作范式和灵活的技能编排机制，实现了对运维事件的精准数据与知识匹配，解决了传统方法中信息过载与人工配置困难的问题。研究贡献包括统一的操作模式、自动化的技能生成与优化机制，以及自演进的学习系统，实际部署在快手电商搜索引擎上，显著提升了运维效率与准确性。

Comments HomePage: https://benchen4395.github.io

详情

英文摘要

Operating and maintaining (O&M) large-scale online engine systems (eg, search, recommendation and advertising) demands substantial human effort for release monitoring, alert response, and root cause analysis. Despite the inherent suitability of LLM-based agents for such operational scenarios, the critical bottleneck impeding their practical deployment lies not in reasoning, but in orchestration capability - specifically, the precise selection of relevant data (encompassing metrics, logs, and change events) and applicable knowledge (including handbook-defined rules and empirically derived practitioner experience) tailored to each individual operational event. Feeding all signals indiscriminately causes dilution and hallucination, while manually curating the event-to-(data, knowledge) mapping is intractable under dozens of daily releases. Here we present Bian Que, an agentic operating framework with three contributions: (i) The unified operational paradigm, which abstracts routine daily O&M actions into three canonical patterns: release interception, proactive inspection, and alert root cause analysis; (ii) The flexible Skill Arrangement, each predefined Skill explicitly defines the requisite data and operational knowledge for each specific context. Such Skills can be automatically generated and updated by LLM agents, and can also be iteratively optimized by on-call engineers via natural language instructions. (iii) The unified self-evolving mechanism, where each correction signal enables two parallel evolutionary pathways: distilling event memory into knowledge, and targeted refinement of Skills. Deployed on the e-commerce search engine of KuaiShou, Bian Que reduces alert volume by 75%, achieves 80% root-cause analysis accuracy, cuts mean time to resolution by over 50%, and attains a 99.0% pass rate on offline evaluations. Codes are at https://github.com/benchen4395/BianQue_Assistant.

URL PDF HTML ☆

赞 0 踩 0

2604.24954 2026-05-12 cs.LG cs.AI cs.CV

Nemotron 3 Nano Omni: Efficient and Open Multimodal Intelligence

NVIDIA, :, Amala Sanjay Deshmukh, Kateryna Chumachenko, Tuomas Rintamaki, Matthieu Le, Tyler Poon, Danial Mohseni Taheri, Ilia Karmanov, Guilin Liu, Jarno Seppanen, Arushi Goel, Mike Ranzinger, Greg Heinrich, Guo Chen, Lukas Voegtle, Philipp Fischer, Timo Roman, Karan Sapra, Collin McCarthy, Shaokun Zhang, Fuxiao Liu, Hanrong Ye, Yi Dong, Mingjie Liu, Yifan Peng, Piotr Zelasko, Zhehuai Chen, Nithin Rao Koluguri, Nune Tadevosyan, Lilit Grigoryan, Ehsan Hosseini Asl, Pritam Biswas, Leili Tavabi, Yuanhang Su, Zhiding Yu, Peter Jin, Alexandre Milesi, Netanel Haber, Yao Xu, Sarah Amiraslani, Nabin Mulepati, Eric Tramel, Jaehun Jung, Ximing Lu, Brandon Cui, Jin Xu, Zhiqi Li, Shihao Wang, Yuanguo Kuang, Shaokun Zhang, Huck Yang, Boyi Li, Hongxu Yin, Song Han, Bilal Kartal, Pavlo Molchanov, Adi Renduchintala, Charles Wang, David Mosallanezhad, Soumye Singhal, Luis Vega, Katherine Cheung, Sreyan Ghosh, Yian Zhang, Alexander Bukharin, Venkat Srinivasan, Johnny Greco, Andre Manoel, Maarten Van Segbroeck, Suseella Panguliri, Rohit Watve, Divyanshu Kakwani, Shubham Pachori, Jeffrey Glick, Radha Sri-Tharan, Aileen Zaman, Khanh Nguyen, Shi Chen, Jiaheng Fang, Qing Miao, Wenfei Zhou, Yu Wang, Zaid Pervaiz Bhat, Varun Praveen, Arihant Jain, Ramanathan Arunachalam, Tomasz Kornuta, Ashton Sharabiani, Amy Shen, Wei Huang, Yi-Fu Wu, Ali Roshan Ghias, Huiying Li, Brian Yu, Nima Tajbakhsh, Chen Cui, Wenwen Gao, Li Ding, Terry Kong, Manoj Kilaru, Anahita Bhiwandiwalla, Marek Wawrzos, Daniel Korzekwa, Pablo Ribalta, Grzegorz Chlebus, Besmira Nushi, Ewa Dobrowolska, Maciej Jakub Mikulski, Kunal Dhawan, Steve Huang, Jagadeesh Balam, Yongqiang Wang, Nikolay Karpov, Valentin Mendelev, George Zelenfroynd, Meline Mkrtchyan, Qing Miao, Omri Almog, Bhavesh Pawar, Rameshwar Shivbhakta, Sudeep Sabnis, Ashrton Sharabiani, Negar Habibi, Geethapriya Venkataramani, Pamela Peng, Prerit Rodney, Serge Panev, Richard Mazzarese, Nicky Liu, Michael Fukuyama, Andrii Skliar, Roger Waleffe, Duncan Riach, Yunheng Zou, Jian Hu, Hao Zhang, Binfeng Xu, Yuhao Yang, Zuhair Ahmed, Alexandre Milesi, Carlo del Mundo, Chad Voegele, Zhiyu Cheng, Nave Assaf, Andrii Skliar, Daniel Afrimi, Natan Bagrov, Ran Zilberstein, Ofri Masad, Eugene Khvedchenia, Natan Bagrov, Borys Tymchenko, Tomer Asida, Daniel Afrimi, Parth Mannan, Victor Cui, Michael Evans, Katherine Luna, Jie Lou, Pinky Xu, Guyue Huang, Negar Habibi, Michael Boone, Pradeep Thalasta, Adeola Adesoba, Dina Yared, Christopher Parisien, Leon Derczynski, Shaona Ghosh, Wes Feely, Micah Schaffer, Radha Sri-Tharan, Jeffrey Glick, Barnaby Simkin, George Zelenfroynd, Tomasz Grzegorzek, Rishabh Garg, Aastha Jhunjhunwala, Sergei Kolchenko, Farzan Memarian, Haran Kumar, Shiv Kumar, Isabel Hulseman, Anjali Shah, Kari Briski, Padmavathy Subramanian, Joey Conway, Udi Karpas, Jane Polak Scowcroft, Annie Surla, Shilpa Ammireddy, Ellie Evans, Jesse Oliver, Tom Balough, Chia-Chih Chen, Sandip Bhaskar, Alejandra Rico, Bardiya Sadeghi, Seph Mard, Katherine Cheung, Meredith Price, Laya Sleiman, Saori Kaji, Wesley Helmholz, Wendy Quan, Michael Lightstone, Jonathan Cohen, Jian Zhang, Oleksii Kuchaiev, Boris Ginsburg, Jan Kautz, Eileen Long, Mohammad Shoeybi, Mostofa Patwary, Oluwatobi Olabiyi, Andrew Tao, Bryan Catanzaro, Udi Karpas

发表机构 * NVIDIA

AI总结本文介绍了 Nemotron 3 Nano Omni，这是 Nemotron 多模态系列的最新模型，首次原生支持音频输入，同时兼容文本、图像和视频。该模型在架构、训练数据和训练方法上均有改进，在多种模态任务中均表现出更高的准确性，尤其在现实文档理解、长音频视频理解和智能计算机使用方面表现突出。基于高效的 Nemotron 3 Nano 30B-A3B 架构，该模型引入了创新的多模态 token 减少技术，显著降低了推理延迟并提升了吞吐量，同时提供了多种精度格式的模型权重和部分训练数据及代码以促进进一步研究。

2604.20783 2026-05-12 cs.LG

Physics-Conditioned Synthesis of Internal Ice-Layer Thickness for Incomplete Layer Traces

Zesheng Liu, Maryam Rahnemoonfar

发表机构 * Department of Computer Science and Engineering（计算机科学与工程系）； Lehigh University（莱维大学）； Department of Civil and Environmental Engineering（土木与环境工程系）

AI总结该研究旨在解决由雷达观测得到的冰层内部结构不完整的问题，通过结合物理气候模型提供的同步特征，生成完整的冰层厚度标注。提出的方法结合了几何学习与基于变换器的时序模块，以聚合层内空间信息并促进层间信息传播，从而生成结构一致且符合物理规律的冰层厚度。该模型在保留已有观测数据的基础上，能够恢复缺失的冰层片段，甚至填补完全缺失的层，并为后续深度层预测模型提供了有效的预训练监督信号。

Comments Accepted for 2026 IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2026)

AI 大模型

视觉与机器人

科学与医疗

GraphReAct: Reasoning and Acting for Multi-step Graph Inference

Teaching Language Models to Think in Code

From Pixels to Primitives: Scene Change Detection in 3D Gaussian Splatting

Towards Autonomous Business Intelligence via Data-to-Insight Discovery Agent

Delulu: A Verified Multi-Lingual Benchmark for Code Hallucination Detection in Fill-in-the-Middle Tasks

Bringing Multimodal Large Language Models to Infrared-Visible Image Fusion Quality Assessment

Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache

A Hierarchical Ensemble Pipeline for Anomaly Detection in ESA Satellite Telemetry

EMO: Pretraining Mixture of Experts for Emergent Modularity

ReasonSTL: Bridging Natural Language and Signal Temporal Logic via Tool-Augmented Process-Rewarded Learning

BoostLLM: Boosting-inspired LLM Fine-tuning for Few-shot Tabular Classification

Unifying Scientific Communication: Fine-Grained Correspondence Across Scientific Media

SDFlow: Similarity-Driven Flow Matching for Time Series Generation

Height-Guided Projection Reparameterization for Camera-LiDAR Occupancy

When Relations Break: Analyzing Relation Hallucination in Vision-Language Model Under Rotation and Noise

SymptomAI: Toward a Conversational AI Agent for Everyday Symptom Assessment

AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics

Mantis: Mamba-native Tuning is Efficient for 3D Point Cloud Foundation Models

AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation

Mitigating Misalignment Contagion by Steering with Implicit Traits

Visibility-Aware Mobile Grasping in Dynamic Environments

Injecting Distributional Awareness into MLLMs via Reinforcement Learning for Deep Imbalanced Regression

Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding

Recovering Hidden Reward in Diffusion-Based Policies

Colorful-Noise: Training-Free Low-Frequency Noise Manipulation for Color-Based Conditional Image Generation

Beyond Heuristics: Learnable Density Control for 3D Gaussian Splatting

Learning Tactile-Aware Quadrupedal Loco-Manipulation Policies

Bian Que: An Agentic Framework with Flexible Skill Arrangement for Online System Operations

Nemotron 3 Nano Omni: Efficient and Open Multimodal Intelligence

Physics-Conditioned Synthesis of Internal Ice-Layer Thickness for Incomplete Layer Traces