arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2251
专题追踪
2605.28264 2026-05-28 cs.AI

Entropy Distribution as a Fingerprint for Hallucinations in Generative Models

熵分布作为生成模型中幻觉的指纹

Mattia J. Villani, Pranav Deshpande, Akshay Seshadri, Romina Yalovetzky, Niraj Kumar

发表机构 * Global Technology Applied Research(全球技术应用研究)

AI总结 本文提出基于token级熵分布(而非仅均值)的校准熵分数(CES),通过单次前向传递和黑盒logits访问实现幻觉检测,并提供理论保证和实证验证。

详情
AI中文摘要

大型语言模型(LLMs)经常生成事实上不正确的输出,通常称为幻觉,这削弱了信任并限制了在高风险环境中的部署。现有的幻觉检测方法通常需要多次前向传递或访问模型内部。在这项工作中,我们提供了理论背景和实证证据,表明token级熵的分布(超越困惑度或长度归一化熵所捕获的均值)作为幻觉的指纹,其分布形状和尾部行为携带独立信号。我们将幻觉检测形式化为统计假设检验,并提出校准熵分数(CES),一种轻量级算法,仅需单次前向传递和黑盒访问token logits。CES通过校准的参考CDF将均值信号与生成熵的最大信号相结合,产生可直接跨模型和任务比较的分数。我们通过新颖的随机长度Dvoretzky-Kiefer-Wolfowitz不等式建立了有限样本校准保证,并证明了CES检测幻觉的概率随生成长度指数级收敛到1。在八个QA基准和十个生成模型(涵盖开源和API访问模型)上,CES在所有单次黑盒方法中实现了最高的检测性能,同时提供了现有启发式方法所缺乏的正式误差保证。值得注意的是,CES在统计上与需要更高计算成本的多样本方法无法区分,缩小了轻量级与昂贵检测之间的差距,使其适用于实时、大规模部署。

英文摘要

Large Language Models (LLMs) often generate factually incorrect outputs, commonly termed hallucinations, that undermine trust and limit deployment in high-stakes settings. Existing hallucination detection methods typically require multiple forward passes, or access to model internals. In this work, we provide theoretical background and empirical evidence that the distribution of token-level entropies, beyond the mean captured by perplexity or length-normalised entropy, serves as a fingerprint of hallucination, with distributional shape and tail behaviour carrying independent signal. We formalize hallucination detection as a statistical hypothesis test and propose the Calibrated Entropy Score (CES), a lightweight algorithm requiring only a single forward pass and black-box access to token logits. CES combines the mean signal with the maximum signal of the generated entropy through a calibrated reference CDF, producing scores that are directly comparable across models and tasks. We establish finite-sample calibration guarantees via a novel random-length Dvoretzky--Kiefer--Wolfowitz inequality, and also prove that CES detects hallucinations with probability converging to one exponentially fast in the generation length. Across eight QA benchmarks and ten generator models spanning open-source and API access models, CES achieves the highest detection performance among all single-pass black-box methods while providing formal error guarantees that existing heuristics lack. Remarkably, CES is statistically indistinguishable from multi-sample methods that require far greater computational cost, closing the gap between lightweight and expensive detection and making it suitable for real-time, large-scale deployment.

2605.28261 2026-05-28 cs.CV

MORI-Seg: Learning Morphological Geometry for Instance Segmentation without Instance Annotations

MORI-Seg: 无需实例标注的形态学几何学习用于实例分割

Leiyue Zhao, Tianyu Shi, Daniel Reisenbuchler, Xinzi He, Junchao Zhu, Tianyuan Yao, Yuechen Yang, Yanfan Zhu, Junlin Guo, Gelei Xu, Haichun Yang, Yuankai Huo, Mert R. Sabuncu, Yihe Yang, Ruining Deng

发表机构 * Southern University of Science and Technology(南方科技大学) Sichuan University(四川大学) University of Regensburg(莱茵-魏尔堡大学) Cornell University(康奈尔大学) Vanderbilt University(范德比尔特大学) University of Notre Dame(Notre Dame 大学) Vanderbilt University Medical Center(范德比尔特大学医学中心) Cornell Tech(康奈尔科技) Weill Medical College of Cornell University(康奈尔大学韦尔医学院)

AI总结 提出MORI-Seg框架,通过从语义掩码学习形态感知几何表示(对象中心距离场和边界带表示)以及类条件特征解耦模块,在仅语义监督下实现端到端的实例分割,提升拥挤粘连区域的实例分离精度。

详情
AI中文摘要

肾脏功能单元的实例级量化对于形态测量分析至关重要,然而大多数公开可用的病理数据集仅提供语义分割标注,其中同一类别的相邻结构被合并为单个区域。这阻碍了可靠的实例级分析,并限制了后续的定量研究。现有的启发式后处理方法在拥挤和粘连区域往往产生次优的实例分离,而基于深度学习的实例分割方法通常需要密集的实例级标注,这些标注成本高昂且劳动密集。我们提出MORI-Seg,一个无需实例级标注即可实现实例分割的深度学习框架。MORI-Seg不依赖启发式分割或实例监督,而是通过联合建模对象中心距离场和边界带表示,直接从语义掩码学习形态感知的几何表示,以编码内部结构和接触界面。类条件特征解耦模块进一步促进实例内一致性和实例间分离。在仅语义监督下,MORI-Seg以端到端的方式将连接的语义区域分解为不同的实例掩码。实验表明,与经典的后处理流程和代表性的语义到实例学习方法相比,MORI-Seg在实例分离准确性和更可靠的形态测量量化方面表现更优。官方实现已在 https://github.com/ddrrnn123/MORI-Seg 公开。

英文摘要

Instance-level quantification of kidney functional units is essential for morphometric analysis, yet most publicly available pathology datasets provide only semantic segmentation annotations, where adjacent structures of the same class are merged into single regions. This prevents reliable instance-level analysis and limits downstream quantitative studies. Existing heuristic post-processing methods often yield suboptimal instance separation, particularly in crowded and adherent regions, while deep learning-based instance segmentation approaches typically require intensive instance-level annotations that are costly and labor-intensive to obtain. We propose MORI-Seg, a deep learning framework that enables instance segmentation without requiring instance-level annotations. Instead of heuristic splitting or instance supervision, MORI-Seg learns morphology-aware geometric representations directly from semantic masks by jointly modeling object-centric distance fields and boundary-band representations to encode interior structure and contact interfaces. A class-conditioned feature disentanglement module further promotes intra-instance coherence and inter-instance separation. Under semantic-only supervision, MORI-Seg decomposes connected semantic regions into distinct instance masks in an end-to-end manner. Experiments demonstrate improved instance separation accuracy and more reliable morphometric quantification compared with classical post-processing pipelines and representative semantic-to-instance learning approaches. The official implementation is publicly available at https://github.com/ddrrnn123/MORI-Seg.

2605.28257 2026-05-28 cs.CV

Category-Level 3D Correspondence in Camera Space via Morphable Object Priors

基于可变形对象先验的相机空间类别级3D对应

Leonhard Sommer, Artur Jesslen, Basavaraj Sunagad, Adam Kortylewski

发表机构 * University of Freiburg, Germany(弗赖堡大学,德国) CISPA Helmholtz Center for Information Security, Germany(信息安全部署中心,德国)

AI总结 通过学习共享可变形对象先验,从单张图像预测类别内实例间一致的3D位置,无需显式对应监督,并在新基准HouseCorr3D上达到最优。

Comments 14 pages, 4 figures. Data and code are publicly available at https://github.com/GenIntel/HouseCorr3D

详情
AI中文摘要

从图像理解3D对象是机器人和AR/VR应用的基础。尽管近期工作在类别级姿态估计方面取得了进展,但当前的表示未能捕捉到推理对象部件、功能和交互所需的细粒度语义。在这项工作中,我们研究了相机空间中的类别级3D对应——从单张图像预测在类别内实例间保持一致的3D位置——并展示了通过学习共享的可变形对象先验,无需显式对应监督即可涌现出这种对应。为了推动这一方向的研究,我们引入了HouseCorr3D,这是首个大规模的单目类别级3D对应基准,包含50个家庭对象类别的178k张图像、280个独特实例以及直接标注在CAD模型上的3D关键点。关键的是,HouseCorr3D提供了遮挡区域的模态补全对应标签和显式对称标注,解决了现有数据集的主要局限性。我们进一步提出了Morpheus,一种通过学习解耦规范形状、形变和对象姿态来学习可变形类别级形状先验的方法。通过这种共享的规范基础,相机空间中有语义意义的3D对应隐式地涌现出来。这些涌现的3D对应在HouseCorr3D上达到了新的最优水平,证明了无需直接对应监督即可实现语义3D对象理解。数据和代码公开于https://github.com/GenIntel/HouseCorr3D。

英文摘要

Understanding 3D objects from images is fundamental to robotics and AR/VR applications. While recent work has made progress in category-level pose estimation, current representations fail to capture the fine-grained semantics needed for reasoning about object parts, functions, and interactions. In this work, we study category-level 3D correspondence in camera space -- predicting, from a single image, 3D locations that remain consistent across instances within a category -- and show that it can emerge without explicit correspondence supervision by learning a shared morphable object prior. To enable research in this direction, we introduce HouseCorr3D, the first large-scale benchmark for monocular category-level 3D correspondence with 178k images across 50 household object categories, 280 unique instances, and 3D keypoint annotations directly on CAD models. Crucially, HouseCorr3D provides amodal correspondence labels for occluded regions and explicit symmetry annotations, addressing key limitations of existing datasets. We further propose Morpheus, a method that learns morphable category-level shape priors by disentangling canonical shape, deformation, and object pose. Through this shared canonical grounding, semantically meaningful 3D correspondences in camera space emerge implicitly. These emerging 3D correspondences set a new state of the art on HouseCorr3D, demonstrating that semantic 3D object understanding can arise without direct correspondence supervision. Data and code are publicly available at https://github.com/GenIntel/HouseCorr3D.

2605.28255 2026-05-28 cs.AI cs.CL cs.HC

AI, Take the Wheel: What Drives Delegation and Trust in Human-Computer Cooperative Question Answering?

AI,掌舵吧:是什么驱动人机协作问答中的委托与信任?

Maharshi Gor, Yoo Yeon Sung, Yu Hou, Eve Fleisig, Irene Ying, Tianyi Zhou, Jordan Boyd-Graber

发表机构 * University of Maryland(马里兰大学) University of California(加州大学) MBZUAI

AI总结 通过问答游戏实验,研究人类在何时以及为何选择委托AI或采纳其建议,发现人类存在对AI正确建议的低依赖(3.9%)和错误建议的过度依赖(1.7%),并受确认偏见影响,建议通过校准置信度、基于证据的解释和信任细化机制来改进人机协作。

Comments Findings of the Association for Computational Linguistics, 2026

详情
AI中文摘要

AI系统并非完美无缺,人类在决定是否信任AI而非自身判断时也可能犯错。因此,改善人机协作需要理解人类何时、为何以及如何决定依赖AI。我们研究了两种不同的依赖决策:委托选择——在不知道AI输出结果的情况下决定何时让AI自主行动,以及采纳选择——评估AI建议并决定如何使用它们。这两种解耦的依赖模式塑造了协作,但先前的工作很少在现实环境中对同一用户同时研究它们。我们通过研究在问答游戏中竞争的人机协作团队来填补这一空白,游戏中人类可以选择何时以及如何与AI代理合作以获胜。我们的24场比赛匹配了23位专家人类和16个AI代理,捕获了387次委托决策和1440次采纳决策。虽然人机协作表现优于单独的AI或人类,但人类做出了次优的协作决策,既对正确的AI建议低依赖(错失3.9%的机会),又在AI误导时过度依赖(1.7%)。双方都贡献了错误答案:当人类和AI意见不一致时,报告的模型置信度接近随机水平,而确认偏见导致当AI建议与人类初始错误答案一致时,低依赖率更高(64.5%)。为缩小这一差距,我们建议采用校准的置信度、基于证据的解释以及帮助用户细化信任的机制。

英文摘要

AI systems are fallible, and humans can make mistakes in deciding whether to trust AI over their own judgment. Thus, improving human-AI collaboration requires understanding when, why, and how humans decide to rely on AI. We study two distinct reliance decisions: the delegation choice -- deciding when to let AI act autonomously without knowing its output, and the adoption choice -- evaluating AI suggestions and deciding how to use them. Both of these decoupled reliance patterns shape collaboration, but prior work rarely studies them together in realistic settings with the same users. We address this gap by studying collaborative human--AI teams competing in a question-answering game in which humans can choose when and how to work with AI agents to win. Our 24 matches pair 23 expert humans with 16 AI agents, capturing 387 delegation and 1440 adoption decisions. While human--AI collaboration performs better than either AI or humans alone, humans make suboptimal collaboration decisions, both under-relying on correct AI suggestions (3.9% of opportunities missed) and over-relying when AI misleads them (1.7%). Both parties contribute wrong answers: reported model confidence is near chance when humans and AI disagree, while confirmation bias drives higher under-reliance (64.5%) when an AI suggestion agrees with humans' initial incorrect answer. To close this gap, we recommend calibrated confidence, evidence-grounded explanations, and mechanisms that help users refine trust.

2605.28253 2026-05-28 cs.CL cs.DB cs.HC

Building Community-Centred NLP Resources for Puno Quechua

构建以社区为中心的普诺克丘亚语自然语言处理资源

Elwin Huaman, Adrian Gamarra Lafuente, Johanna Cordova, Anna Korhonen

发表机构 * University of Cambridge (UK)(剑桥大学(英国)) Stanford University (USA)(斯坦福大学(美国)) ERTIM - Inalco (France)(ERTIM - Inalco(法国))

AI总结 通过参与式设计收集66小时语音数据,微调Whisper-base等模型,首次为普诺克丘亚语建立ASR基准并开源所有资源。

Comments Sixth Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP 2026), co-located with ACL 2026

详情
AI中文摘要

保护资源不足的语言需要由使用者塑造并为其服务的数字工具和资源。我们首次为普诺克丘亚语(ISO 639-3: qxp)提供了专门的ASR资源:(1)任何单一克丘亚语变体中最大的语音语料库,包括66小时的脚本和自发性语音录音(其中36小时为手动转录和验证数据),通过参与式设计活动收集;(2)首个系统的普诺克丘亚语ASR基准,评估了最先进模型并微调了Whisper-base、wav2vec2-base和XLS-R-300M,包括有无继续预训练(CPT)的情况;(3)所有数据集和微调模型的开源发布。

英文摘要

The preservation of under-resourced languages requires digital tools and resources shaped by and for their speakers. We present the first dedicated ASR resources for Puno Quechua (ISO 639-3: qxp): (1) the largest speech corpus for any single Quechua variety, consisting in 66 hours of recordings for scripted and spontaneous speech (including 36 hours of manually transcribed and validated data), collected via a participatory design campaign; (2) the first systematic ASR benchmark for Puno Quechua, evaluating state-of-the-art models and fine-tuning Whisper-base, wav2vec2-base, and XLS-R-300M, with and without continued pre-training (CPT); (3) an open release of all datasets and fine-tuned models.

2605.28247 2026-05-28 cs.LG cs.AI

IRDS: Interpretable RLVR Data Selection via Verifier-Coupled Sparse Autoencoder Coverage

IRDS: 通过验证器耦合的稀疏自编码器覆盖实现可解释的RLVR数据选择

Yuhan Li, Mingxu Zhang, Dazhong Shen, Ying Sun

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港理工大学(广州)) Nanjing University of Aeronautics and Astronautics(南京航空航天大学) The 63rd Research Institute, National University of Defense Technology, Nanjing(国防科技大学第六三研究所,南京)

AI总结 提出IRDS方法,基于稀疏自编码器簇和验证器耦合的覆盖目标,选择模型失败但可学习的RLVR训练实例,提升数学推理准确率并降低计算成本。

Comments 24 pages,3 figures,18 tables

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)已成为增强LLM推理能力的关键技术,但其数据效率低下仍是一个主要瓶颈。现有方法仅部分解决此问题,各自至少缺少子集级覆盖、验证器信号使用或可解释性中的一项。为弥补这一空白,我们提出了IRDS(可解释的RLVR数据选择),该方法在稀疏自编码器(SAE)簇的基础上选择RLVR训练实例,使得选择本身在可识别的问题模式上是可审计的。为了选择模型既失败又能从中学习的实例,我们在SAE基础上引入了一个验证器耦合的覆盖目标,并通过贪心对数行列式最大化来求解。在三个指令微调模型和六个数学推理基准上的实验表明,IRDS实现了最高的整体准确率,在Qwen两个模型上超过最强基线+3.9/+4.0个百分点,在Llama-3.1-8B上超过+0.5个百分点,同时运行成本比基于轨迹的基线低一个数量级。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has become a key technique for en- hancing LLM reasoning, yet its data ineffi- ciency remains a major bottleneck. Existing methods address this problem only partially, each missing at least one of subset-level cov- erage, verifier signal use, or interpretability. To address this gap, we present IRDS (Inter- pretable RLVR Data Selection), which selects RLVR training instances on a sparse autoen- coder (SAE) cluster basis so the selection itself is auditable on recognizable problem motifs. To select instances the model both fails on and can still learn from, we introduce a verifier- coupled coverage objective on the SAE basis and solve it by greedy log-determinant max- imization. Experiments on three instruction- tuned models and six math reasoning bench- marks show that IRDS achieves the highest overall accuracy, exceeding the strongest base- line by +3.9/+4.0 pp on the two Qwen models and by +0.5 pp on Llama-3.1-8B, while run- ning an order of magnitude cheaper than the trajectory-based baseline.

2605.28241 2026-05-28 cs.CV

PointQ-Bench: Benchmarking Diagnostic and Interpretable Point Cloud Quality Assessment

PointQ-Bench:诊断性和可解释的点云质量评估基准

Duanchu Wang, Cheng Li, Junjie Yang, Jing Huang, Zihang Cheng, Zhi Gao, ZhuBohong, Di Wang

发表机构 * Xi’an Jiaotong University(西安交通大学) Xidian University(西安电子科技大学) University of Chinese Academy of Sciences(中国科学院大学) Ningxia University(宁夏大学)

AI总结 提出PointQ-Bench基准,通过异常感知、缺陷诊断、可用性分级和开放式质量报告任务,将点云质量评估从标量评分扩展到全面质量理解,并揭示当前模型在感知与诊断之间的差距。

详情
AI中文摘要

点云质量在3D采集、重建、渲染和感知中起着关键作用,然而现有的点云质量评估(PCQA)研究主要集中于标量分数预测。在实际检测场景中,质量评估通常涉及识别缺陷、表征主要问题类型、评估下游可用性以及提供基于证据的描述,而这些并未被当前基准明确评估。我们引入了PointQ-Bench,一个旨在将PCQA从标量评分扩展到全面质量理解的基准。PointQ-Bench包含3,083个点云,涵盖真实扫描、模拟失真和AI生成内容,覆盖八种主要问题类型。每个样本都标注有平均意见分数(MOS)、质量等级、问题标签、专家依据的描述以及12,332个问答对。该基准支持三个感知导向任务:异常感知、缺陷诊断和可用性分级,以及一个认知导向任务:开放式质量报告。为了评估自由形式的质量描述,我们进一步提出了SSFRQ-5D,一个通过人机一致性分析验证的五维评估协议。在14个视觉语言模型和传统PCQA基线上的大量实验揭示了一致的感知-诊断差距:虽然当前模型在粗粒度缺陷感知方面表现出新兴能力,但在基于证据的诊断和质量校准方面存在困难。强大的2D多模态大语言模型通常优于现有的3D视觉语言模型,而额外视图或点级输入的收益并不均匀,在不同任务、数据源和模型之间变化,特别是在边界模糊条件下。总体而言,PointQ-Bench为推进可靠且可解释的点云质量理解提供了一个诊断性测试平台。

英文摘要

Point cloud quality plays a critical role in 3D acquisition, reconstruction, rendering, and perception, yet existing point cloud quality assessment (PCQA) research remains largely centered on scalar score prediction. In practical inspection scenarios, quality assessment often involves identifying defects, characterizing dominant issue types, assessing downstream usability, and providing evidence-supported descriptions, which are not explicitly evaluated by current benchmarks. We introduce PointQ-Bench, a benchmark designed to extend PCQA from scalar scoring toward comprehensive quality understanding. PointQ-Bench consists of 3,083 point clouds spanning authentic scans, simulated distortions, and AI-generated content, covering eight major issue types. Each sample is annotated with mean opinion scores (MOS), quality levels, issue tags, expert-grounded descriptions, and 12,332 question-answer pairs. The benchmark supports three perception-oriented tasks: anomaly sensing, defect diagnosis, and usability grading, as well as a cognition-oriented task of open-ended quality reporting. To evaluate free-form quality descriptions, we further propose SSFRQ-5D, a five-dimensional evaluation protocol validated through human-AI agreement analysis. Extensive experiments on 14 vision-language models and traditional PCQA baselines reveal a consistent perception-diagnosis gap: while current models exhibit emerging abilities in coarse defect perception, they struggle with grounded diagnosis and quality calibration. Strong 2D MLLMs generally outperform existing 3D VLMs, and the benefit of additional views or point-level inputs is non-uniform, varying across tasks, data sources, and models, particularly under boundary-ambiguous conditions. Overall, PointQ-Bench provides a diagnostic testbed for advancing reliable and interpretable point cloud quality understanding.

2605.28239 2026-05-28 cs.CV

Learning to Label: A Reinforced Self-Evolving Framework for Semi-supervised Referring Expression Segmentation

学习标注:一种用于半监督指代表达分割的强化自进化框架

Runlong Cao, Ying Zang, Chuanwei Zhou, Tianrun Chen, Tong Zhang, Zhen Cui, Chunyan Xu

发表机构 * School of Computer Science and Engineering, Nanjing University of Science and Technology(南京理工大学计算机科学与工程学院) School of Information Engineering, Huzhou Normal University(湖州师范学院信息工程学院) School of Artificial Intelligence, Nanjing University of Posts and Telecommunications(南京邮电大学人工智能学院) National Key Laboratory of Tibetan Language Intelligence(藏语智能国家重点实验室) Zhejiang University(浙江大学) Beijing Normal University(北京师范大学)

AI总结 提出L2L框架,通过强化学习将伪标签构建转化为可学习的决策过程,结合多模态大模型提取先验,实现半监督指代表达分割的联合优化。

Comments 24 pages, 13 figures

详情
AI中文摘要

半监督指代表达分割(SS-RES)旨在有限标注下实现精确的像素级语言定位,但在利用未标注图像-文本对时面临监督不足和伪标签不可靠的问题。本文提出学习标注(L2L),一种强化自进化框架,将伪标签构建视为可学习的决策过程。为建立基础理解,我们利用多模态大语言模型提取语义-空间先验,将其实例化为初始软分割提议,并与文本线索一起提升为可学习引导信号,以条件化层次分割网络。为确保稳定学习,强化伪标签选择被表述为探索性决策过程,基于多模态先验和模型预测自适应地奖励高实用性的像素级监督。这种强化自进化循环实现了分割模型和伪标签的联合优化,在稀疏监督下逐步增强标签可靠性。在RefCOCO、RefCOCO+和RefCOCOg上的大量实验表明,该方法优于现有方法,验证了其有效性和泛化能力。

英文摘要

Semi-supervised referring expression segmentation (SS-RES) aims to achieve precise pixel-level language grounding under limited annotation, yet suffers from limited supervision and unreliable pseudo-labels when exploiting unlabeled image-text pairs. In this work, we propose Learning to Label, a reinforced self-evolving framework (L2L) that casts pseudo-label construction as a learnable decision-making process. To build foundational understanding, we leverage a multimodal large language model to extract semantic-spatial priors, which are instantiated as initial soft segmentation proposals and elevated, together with textual cues, into learnable guidance signals that condition a hierarchical segmentation network. To ensure stable learning, reinforced pseudo-label selection is formulated as an exploratory decision process that adaptively rewards high-utility pixel-level supervision based on multimodal priors and model predictions. This reinforced self-evolving loop enables joint optimization of the segmentation model and pseudo-labels, progressively enhancing label reliability under sparse supervision. Extensive experiments on RefCOCO, RefCOCO+, and RefCOCOg demonstrate improvements over existing methods, validating its effectiveness and generalization.

2605.28237 2026-05-28 cs.RO cs.CV

POINav: Benchmarking and Enhancing Final-Meters Arrival in Real-World Vision-Language Navigation

POINav: 在真实世界视觉语言导航中基准测试与增强最终米级到达

Ruiyan Gong, Meisheng Zhang, Yuxiang Zhao, Mingchao Sun, Yanfen Shen, Zedong Chu, Zhining Gu, Wei Guo, Xiaolong Cheng, Qiming Li, Kangning Niu, Yanqing Zhu, Xiaolong Wu, Tianlun Li, Mu Xu

发表机构 * Amap CV Lab, Alibaba Group(阿里集团阿里的Amap视觉实验室)

AI总结 针对真实世界POI导航的“最后几米”挑战,提出首个闭环评估基准POINav-Bench,并设计脑-动作框架结合70K真实标志-入口数据对,实现高保真度导航。

Comments 25 pages, 9 figures

详情
AI中文摘要

真实世界导航本质上由兴趣点(POI)驱动,然而到达精确的POI仍然是一个关键的“最后几米”挑战。现有的POI目标导航的视觉语言导航(VLN)基准通常由于生成的场景而存在粗粒度或显著的模拟到现实差距。为弥合这一差距,我们提出了POINav-Bench,这是第一个专为真实世界POI目标导航闭环评估设计的基准。它包含使用3D高斯泼溅(3DGS)从真实世界捕获重建的11个商业区域,总面积达126,398平方米,涵盖163个不同的POI。通过可通行性感知标注和参考轨迹,POINav-Bench能够在真实、POI丰富的现实环境中对导航智能体进行高保真评估。在此基础上,我们提出了POINav脑-动作框架,其中脑模块执行基于POI的推理以指导动作模块预测用于真实世界执行的连续航点。我们进一步整理了POINav-Dataset,包含70K个真实世界标志-入口对。实验表明,我们的框架为改进真实世界POI目标导航提供了一条可行路径。

英文摘要

Real-world navigation is fundamentally driven by Points of Interest (POIs), yet reaching a precise POI remains a critical "final-meters" challenge. Existing Vision-Language Navigation (VLN) benchmarks of POI-goal navigation often suffer from coarse granularity or significant sim-to-real gaps due to generated scene. To bridge this gap, we present POINav-Bench, the first benchmark designed for closed-loop evaluation of real-world POI-goal navigation. It comprises 11 commercial areas reconstructed from real-world captures using 3D Gaussian Splatting (3DGS), covering 126,398 $m^{2}$ in total and spanning 163 distinct POIs. With traversability-aware annotations and reference trajectories, POINav-Bench enables high-fidelity evaluation of navigation agents in realistic, POI-rich real-world environments. Building on this, we propose the POINav Brain-Action Framework where a Brain module performs POI-grounded reasoning to guide an Action module in predicting continuous waypoints for real-world execution. We further curate the POINav-Dataset, containing 70K real-world signage-entrance pairs. Experiments show that our framework provides a viable path toward refining real-world POI-goal navigation.

2605.28234 2026-05-28 cs.CV

Bridging the Sampling Distribution Shift in Radio Map Estimation: A Trajectory-Aware Paradigm

桥接无线电地图估计中的采样分布偏移:一种轨迹感知范式

Feng Qiu, Zheng Fang, Shuhang Zhang, Kangjun Liu, Longkun Zou, Jing Liu, Ke Chen

发表机构 * School of Artificial Intelligence, Xidian University(西安电子科技大学人工智能学院) Pengcheng Laboratory(鹏城实验室) Department of Computer Science and Engineering, Southern University of Science and Technology(南方科技大学计算机科学与工程系) Department of Electronics, Peking University(北京大学电子系) Guangzhou Institute of Technology, Xidian University(西安电子科技大学广州研究院)

AI总结 针对无人机轨迹采样与随机采样分布不匹配导致的性能下降,提出基于随机触发轨迹采样的轨迹感知训练范式,有效降低估计误差。

详情
AI中文摘要

基于学习的无线电地图估计(RME)在无人机辅助无线感知中扮演关键角色,支持覆盖预测和网络优化等任务。当前大多数方法假设基于随机采样的独立同分布(i.i.d.)训练和测试设置。然而,实际无人机测量是沿着可行轨迹顺序收集的,导致高度结构化和空间相关的模式。这种不匹配引入了采样分布偏移,增加了空间场恢复的内在难度,并损害了在i.i.d.假设下训练的模型的泛化能力。为缓解这一问题,我们提出了一种基于随机触发轨迹采样(ST-TBS)的轨迹感知训练范式,该范式在保持轨迹连续性的同时引入采样变异性。此外,从统计角度来看,我们表明与随机采样相比,基于轨迹的采样降低了空间多样性并增加了信息冗余。在RadioMapSeer和SpectrumNet数据集上的大量实验表明,在基于轨迹的观测下,使用随机采样训练的模型性能显著下降,在SpectrumNet上RMSE从0.0391增加到0.2632。相反,我们提出的ST-TBS方法有效将RMSE降低到0.0571。这些结果强调了对齐训练和部署采样分布对于可靠RME的必要性。

英文摘要

Learning-based radio map estimation (RME) plays a critical role in UAV-assisted wireless sensing, enabling tasks such as coverage prediction and network optimization. Most current methods assume an independently and identically distributed (i.i.d.) training and testing setting based on random sampling. However, practical UAV measurements are collected sequentially along feasible trajectories, resulting in highly structured and spatially correlated patterns. This mismatch introduces a sampling distribution shift that increases the intrinsic difficulty of spatial field recovery and compromises the generalization of models trained under i.i.d. assumptions. To mitigate this issue, we propose a trajectory-aware training paradigm based on Stochastic-Triggered Trajectory-Based Sampling (ST-TBS), which preserves trajectory continuity while introducing sampling variability. Moreover, from a statistical perspective, we show that trajectory-based sampling reduces spatial diversity and increases information redundancy compared to random sampling. Extensive experiments on the RadioMapSeer and SpectrumNet datasets demonstrate that models trained with random sampling suffer significant performance degradation under trajectory-based observations, with RMSE increasing from 0.0391 to 0.2632 on SpectrumNet. Conversely, our proposed ST-TBS method effectively reduces the RMSE to 0.0571. These results highlight the necessity of aligning training and deployment sampling distributions for reliable RME.

2605.28232 2026-05-28 cs.AI

PIRS: Physics-Informed Reward Shaping for SAC-Based Building Energy Management

PIRS:基于物理信息奖励塑形的SAC建筑能源管理

Shadmehr Zaregarizi, Khashayar Yavari

发表机构 * Politecnico di Torino(托里尼理工大学)

AI总结 针对深度强化学习中奖励函数设计缺乏物理基础的问题,提出PIRS方法,将ISO 7730 PMV公式嵌入SAC的多目标奖励中,提升可解释性和性能。

Comments N pages, 4 figures, 3 tables. Accepted at the 2nd Workshop on AI-Driven Energy Efficiency in Dynamic Systems (AI-DEEDS '26), co-located with ACM e-Energy / ACM Sustainability Week, Banff, AB, Canada, June 22-25, 2026

详情
AI中文摘要

居住者舒适度和电网感知的能效是相互竞争的目标,其联合优化关键取决于深度强化学习(DRL)控制器中奖励函数的指定方式。然而,奖励设计在很大程度上仍然是临时的:舒适度项要么是手动调整的启发式规则,要么是简单的温度偏差代理,缺乏热舒适物理的明确基础。我们提出PIRS(物理信息奖励塑形),它在用于Soft Actor-Critic(SAC)的加权多目标奖励中,用ISO 7730预测平均投票(PMV)公式替代这些临时的舒适度代理。通过将舒适度信号锚定在ISO 7730 PMV公式中,PIRS提高了奖励的可解释性,并在不改变学习流程任何其他组件的情况下,提供了一个基于标准的舒适度代理。我们在CityLearn v2.1.2(2022年挑战赛第一阶段)中评估PIRS,使用一个中央SAC智能体在五个随机种子上训练50k步,并与基于规则的控制器(RBC)、手动设计的奖励(E2)、仅能量奖励(E3)和朴素温度偏差舒适度奖励(E4)进行比较。区域级关键绩效指标(KPI)以与RBC的比率报告显示,PIRS在成本、碳和电力指标上与手动基线相当,同时显著优于非物理基础的设计——特别是在负载爬坡(1.78倍 vs. ~2.4倍RBC)和日峰值需求方面。所有DRL策略在此训练预算下仍高于RBC;我们诚实地解释这一差距,并将PIRS定位为可解释、符合标准的奖励设计基础,而非在有限计算下优于经典控制的声明。

英文摘要

Occupant comfort and grid-aware energy efficiency are competing objectives whose joint optimization depends critically on how reward functions are specified in deep reinforcement learning (DRL) controllers for buildings. Yet reward design remains largely ad hoc: comfort terms are either hand-tuned heuristics or simple temperature-deviation proxies without explicit grounding in thermal-comfort physics. We present PIRS (Physics-Informed Reward Shaping), which replaces these ad-hoc comfort proxies with the ISO 7730 Predicted Mean Vote (PMV) formulation inside a weighted multi-objective reward for Soft Actor-Critic (SAC). By anchoring the comfort signal in the ISO 7730 PMV formulation, PIRS improves reward interpretability and provides a standards-grounded comfort proxy without changing any other component of the learning pipeline. We evaluate PIRS in CityLearn v2.1.2 (challenge 2022 phase 1) with a central SAC agent trained for 50k steps over five random seeds, and compare against a rule-based controller (RBC), a manually engineered reward (E2), an energy-only reward (E3), and a naive temperature-deviation comfort reward (E4). District-level key performance indicators (KPIs), reported as ratios versus RBC, show that PIRS attains cost, carbon, and electricity metrics on par with the manual baseline while substantially outperforming non-physics-grounded designs -- particularly on load ramping (1.78x vs. ~2.4x RBC) and daily peak demand. All DRL policies remain above RBC at this training budget; we interpret this gap honestly and position PIRS as an interpretable, standards-aligned foundation for reward design rather than a claim of dominance over classical control at limited compute.

2605.28231 2026-05-28 cs.RO cs.LG

ProgVLA: Progress-Aware Robot Manipulation Skill Learning

ProgVLA:进度感知的机器人操作技能学习

Seungsu Kim, Jinyoung Choi, Seungmin Baek, Jean-Michel Renders

发表机构 * NAVER LABS(NAVER实验室) NAVER LABS Europe(NAVER实验室欧洲)

AI总结 提出ProgVLA,一种紧凑的视觉-语言-动作模型,通过显式表示任务进度和两阶段Perceiver重采样机制,在有限计算和内存下实现长序列多模态处理,并在多任务操作基准上达到或超越大模型性能。

详情
AI中文摘要

我们提出了ProgVLA,一种紧凑的视觉-语言-动作(VLA)模型,专为在严格的计算和内存预算下进行可靠的机器人操作而设计。该模型特别关注通过维护任务进度的显式表示来高效处理长多模态序列。为此,ProgVLA集成了两个关键组件。首先,一个带有两阶段Perceiver重采样方案的多模态编码器将可变长度的视觉、语言和本体感受流压缩为一组固定的控制就绪上下文令牌,在保持跨模态基础的同时大幅减少序列长度。其次,一组辅助的进度头通过离线强化学习(RL)目标进行训练,以联合学习针对归一化剩余水平目标的批评者。这为策略提供了任务进度的内部估计,并实现了优势加权和成功加权的流匹配模仿学习。在两个成熟的多任务机器人操作基准上,一个0.1B参数的ProgVLA模型达到了与显著更大的预训练基线相当的成功率,并且在长时域和更困难的任务层级上超过了它们。消融实验表明,学习到的上下文重采样器和任务自适应视觉微调是最大的单一贡献者,而进度感知训练提供了集中在长时域和多对象任务上的一致额外增益。我们还在真实世界的玩具厨房环境中进一步验证了该方法。

英文摘要

We present ProgVLA, a compact vision-language-action (VLA) model designed for reliable robot manipulation under tight compute and memory budgets. The model specifically focuses on efficiently processing long multi-modal sequences by maintaining an explicit representation of task progress over extended horizons. To this end, ProgVLA integrates two key components. First, a multi-modal encoder with a two-stage Perceiver resampling scheme compresses variable-length visual, language, and proprioceptive streams into a fixed set of control-ready context tokens, substantially reducing sequence length while preserving cross-modal grounding. Second, an auxiliary set of progress heads is trained with offline reinforcement learning (RL) objectives to jointly learn critics over normalized remaining-horizon targets. This provides the policy with an internal estimate of task progress and enables advantage- and success-weighted flow-matching imitation learning. On two well-established multi-task robot manipulation benchmarks, a 0.1B-parameter ProgVLA model reaches success rates that are competitive with, and on long-horizon and harder task tiers exceed, substantially larger pretrained baselines. Ablations indicate that the learned context resampler and task-adaptive visual fine-tuning are the largest single contributors, while progress-aware training provides a consistent additional gain that is concentrated on long-horizon and multi-object tasks. We further validate the approach in real-world toy-kitchen environments.

2605.28230 2026-05-28 cs.CV

Proprio: Latent Self-Scoring and Inference-Time Refinement for Physically Plausible Video Generation

Proprio: 用于物理合理视频生成的潜在自评分与推理时精炼

Mariam Hassan, Kaouther Messaoud, Wuyang Li, Alexandre Alahi

发表机构 * École Polytechnique Fédérale de Lausanne(洛桑联邦理工学院) Télécom Paris(巴黎电信学院)

AI总结 提出Proprio,一种无需训练框架,通过分析模型在潜在扰动下的流残差作为自评分信号,结合最佳N搜索和梯度自精炼,提升冻结视频生成器输出的物理合理性。

详情
AI中文摘要

现代视频生成模型在视觉上效果显著,但经常违反基本物理原理。我们提出Proprio,一种无需训练的框架,使冻结的视频生成器能够评估和改进自身输出的物理合理性。受本体感觉(生物对自身运动的感知)启发,Proprio将模型在受控潜在扰动下的流残差视为自评分信号。能被生成器学习到的动力学更好解释的样本会产生更小且更稳定的残差。我们跨时间步和扰动聚合该信号,通过动态时空掩码聚焦于运动相关区域,并将其用于最佳N搜索、基于梯度的自精炼或两者结合。在文本到视频和图像到视频基准测试中,Proprio持续提升物理合理性,在多种设置下优于基于VLM的评分和外部世界模型基线。使用TurboWan2.2,Proprio将Physics-IQ从32.2提升至37.5(+16.5%),VideoPhy2-hard物理常识从45.6提升至55.0(+20.6%)。人类评估进一步显示,在大约三分之二的比较中,评估者更偏好Proprio选择或精炼的视频的物理合理性。这些结果表明,冻结的视频生成器包含可操作的内部信号,用于评估和改进自身输出的物理合理性。

英文摘要

Modern video generative models produce visually impressive results, yet frequently violate basic physical principles. We propose Proprio, a training-free framework that enables a frozen video generator to assess and improve the physical plausibility of its own outputs. Inspired by proprioception, the biological sense of one's own movement, Proprio treats the model's flow residual under controlled latent perturbations as a self-scoring signal. Samples that are better explained by the generator's learned dynamics induce smaller and more stable residuals. We aggregate this signal across timesteps and perturbations, focus it on motion-relevant regions with a dynamic spatiotemporal mask, and use it for best-of-N search, gradient-based self-refinement, or both. Across text-to-video and image-to-video benchmarks, Proprio consistently improves physical plausibility, outperforming VLM-based scoring, and external world-model baselines in several settings. With TurboWan2.2, Proprio improves Physics-IQ from 32.2 to 37.5 (+16.5%) and VideoPhy2-hard physical commonsense from 45.6 to 55.0 (+20.6%). Human evaluation further shows that raters prefer Proprio-selected or refined videos for physical plausibility in roughly two-thirds of comparisons. These results suggest that frozen video generators contain actionable internal signals for evaluating and improving the physical plausibility of their own outputs.

2605.28229 2026-05-28 cs.CV cs.AI

VidPrism: Heterogeneous Mixture of Experts for Image-to-Video Transfer

VidPrism: 用于图像到视频迁移的异构混合专家模型

Rui Lin, Chuanming Wang, Huadong Ma

发表机构 * State Key Laboratory of Networking and Switching Technology(网络与交换技术国家重点实验室)

AI总结 提出VidPrism,一种异构时间混合专家框架,通过功能专业化专家、内容感知多速率采样和动态双向融合机制,解决传统MoE中专家同质化问题,在视频识别基准上达到最先进性能。

Comments CVPR2026 camera ready

详情
AI中文摘要

随着预训练技术的快速发展,适应大规模视觉-语言模型(VLM)进行视频理解(即图像到视频迁移学习)已成为主导范式。为了获得卓越性能,近期进展中采用混合专家(MoE)来增强VLM的时间建模能力是一种有效策略。然而,传统的MoE设计存在专家同质化问题,即所有专家充当相同的通才,从无差异的视频流中低效地学习时空特征。为解决此问题,我们提出VidPrism,一种新颖的异构时间混合专家框架。VidPrism通过部署功能专业化的专家开创了分工机制,每个专家承担从空间理解到时间建模的不同角色。为了适当地为这些专家提供输入,我们引入了一个内容感知的多速率采样模块,动态生成从语义丰富到运动聚焦的表示流,为专家提供专业化输入。此外,一种动态双向融合机制实现了这些路径之间的协同信息交换,从而产生全面的视频表示。在各种视频识别基准上的大量实验表明,VidPrism达到了最先进的性能,并有效促进了专家专业化。我们的源代码可在https://github.com/Lrrrr549/VidPrism.git获取。

英文摘要

With the rapid development of pre-training technologies, adapting large-scale Vision-Language Models (VLMs) for video understanding \emph{\ie} image-to-video transfer learning has become a dominant paradigm. To achieve superior performance, it raises as an effective strategy among recent advances to employ Mixture-of-Experts (MoE) to enhance VLMs' temporal modeling capabilities. However, conventional MoE designs suffer from expert homogenization, where all experts act as identical generalists, inefficiently learning spatio-temporal features from undifferentiated video streams. To overcome this problem, we propose VidPrism, a novel heterogeneous temporal Mixture-of-Experts framework. VidPrism pioneers a division of labor by deploying functionally specialized experts, each assuming a role ranging from spatial understanding to temporal modeling. To feed these specialists appropriately, we introduce a content-aware, multi-rate sampling module that dynamically generates streams ranging from semantically rich to motion-focused representations, providing specialized inputs for experts. Furthermore, a dynamic, bidirectional fusion mechanism enables synergistic information exchange between these pathways, leading to a comprehensive video representation. Extensive experiments on various video recognition benchmarks demonstrate that VidPrism achieves state-of-the-art performance and effectively fosters expert specialization. Our source code is available at \href{https://github.com/Lrrrr549/VidPrism.git}{https://github.com/Lrrrr549/VidPrism.git}.

2605.28228 2026-05-28 cs.CL

When Seekers Are Hard to Help: Evaluating Emotional Support Dialogue Systems in Worst-Case Interactions

当求助者难以帮助:评估情感支持对话系统在最坏情况交互中的表现

Jiajie Yang, Yangchun Li, Guanyi Chen, Rui Fan, Xin Bai, Tingting He

发表机构 * Hubei Provincial Key Laboratory of Artificial Intelligence and Smart Learning(湖北人工智能与智能学习省级重点实验室) National Language Resources Monitoring and Research Center for Network Media(网络媒体语言资源监测与研究中心) School of Computer Science, Central China Normal University(华中师范大学计算机学院) Faculty of Artificial Intelligence in Education, Central China Normal University(华中师范大学教育人工智能学院) School of Chinese Language and Literature, Central China Normal University(华中师范大学中文语言文学学院)

AI总结 本研究通过专家模拟和提出最坏情况评估框架,发现现有情感支持对话系统在面对低参与度、抗拒等困难求助者时性能显著下降,并验证了最坏情况模拟数据可提升模型鲁棒性。

详情
AI中文摘要

情感支持对话系统(ESDS)越来越多地使用大语言模型模拟的求助者进行评估和训练。然而,这类模拟求助者通常表现为合作、平均水平的用户,他们清晰披露、建设性回应并在几轮内接受支持。这可能导致过于乐观的评估,并掩盖ESDS是否能够处理困难的求助互动。在这项工作中,我们研究了在最坏情况交互下的ESDS评估,其中求助者由于低参与度、抗拒、有限的自我披露、情绪波动或僵化的负面解释而难以帮助。我们首先进行了一项专家模拟研究,邀请八位经验丰富的咨询专业人员模拟困难求助者,与现有的中文ESDS互动,提供量表评分,并参与半结构化访谈。基于这项研究,我们推导出最坏情况下的求助者行为,并识别出当前系统的关键局限性。然后,我们提出了一个最坏情况评估框架,包括一个基于LLM的最坏情况求助者模拟器和四个面向最坏情况的指标:深度情感理解、引导性探索、平衡的情感支持以及真实和接地气的支持。评估17个系统后,我们发现几乎所有模型在最坏情况交互下性能都大幅下降。大型通用LLM通常比专门的ESDS更稳健,但即使是最强的模型也难以维持参与度并改善求助者的情绪状态。最后,我们表明最坏情况模拟也可以生成有用的训练数据,提高较小模型的鲁棒性。

英文摘要

Emotional Support Dialogue Systems (ESDSes) are increasingly evaluated and trained with LLM-simulated seekers. However, such simulated seekers often behave as cooperative, average-case users who disclose clearly, respond constructively, and accept support within a few turns. This can lead to overly optimistic evaluation and obscure whether ESDSes can handle difficult help-seeking interactions. In this work, we study ESDS evaluation under worst-case interactions, where seekers are hard to help due to low engagement, resistance, limited self-disclosure, emotional volatility, or rigid negative interpretations. We first conduct an expert simulation study with eight experienced counselling professionals, who simulate difficult seekers, interact with existing Chinese ESDSes, provide scale ratings, and participate in semi-structured interviews. Based on this study, we derive worst-case seeker behaviours and identify key limitations of current systems. We then propose a worst-case evaluation framework consisting of an LLM-based worst-case seeker simulator and four worst-case-oriented metrics: Deep Emotional Understanding, Guided Exploration, Balanced Emotional Support, and Authentic and Grounded Support. Evaluating 17 systems, we find that nearly all models suffer substantial performance drops under worst-case interactions. Large general-purpose LLMs are generally more robust than specialised ESDSes, but even the strongest models struggle to sustain engagement and improve seekers' emotional states. Finally, we show that worst-case simulation can also generate useful training data, improving the robustness of smaller models.

2605.28227 2026-05-28 cs.CL

Why We Need Speech to Evaluate Speech Translation

为什么我们需要语音来评估语音翻译

Maike Züfle, Danni Liu, Vilém Zouhar, Jan Niehues

发表机构 * Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院) ETH Zurich(苏黎世联邦理工学院)

AI总结 本文通过元评估发现现有文本和语音质量估计指标在评估语音翻译中的语音特有信息(如性别一致性和韵律)时均存在不足,并提出SpeechCOMET模型,分析其失败原因,强调需要专用训练数据和真正基于语音的模型。

详情
AI中文摘要

语音翻译模型越来越能够保留语音特定信息(例如,说话者性别、韵律和强调),但评估指标仍然对这些现象视而不见。我们在两个针对性别一致性和韵律的对比数据集上对基于文本和基于语音的质量估计指标进行了元评估,发现两者均存在不足,即使直接访问语音信号也是如此。然后,我们训练了SpeechCOMET,一个带有语音编码器的质量估计模型家族,并评估了一个最先进的SpeechLLM作为评判者。两者在标准质量估计上匹配或超过基于文本的COMET,但都没有一致地评估语音特定现象。我们确定了三个原因:(1)当前编码器未能可靠地保留语音特定特征,(2)模型倾向于忽略语音源信号,以及(3)质量估计训练数据包含的相关示例太少。我们发布了所有模型和代码,并认为进展需要专用的语音特定训练数据和真正基于语音的模型。

英文摘要

Speech translation models are increasingly capable of preserving speech-specific information (e.g., speaker gender, prosody, and emphasis), yet evaluation metrics remain blind to such phenomena. We meta-evaluate both text- and speech-based quality estimation metrics on two contrastive datasets targeting gender agreement and prosody, and find that both fall short, even when given direct access to the speech signal. We then train SpeechCOMET, a family of quality estimation models with speech encoders, and evaluate a state-of-the-art SpeechLLM as a judge. Both match or exceed text-based COMET on standard quality estimation, but neither consistently assesses speech-specific phenomena. We identify three causes: (1) speech-specific features are not reliably preserved in current encoders, (2) models tend to ignore the speech source signal, and (3) quality estimation training data contains too few relevant examples. We release all models and code, and argue that progress requires dedicated speech-specific training data and models that genuinely condition on speech.

2605.28226 2026-05-28 cs.LG

PhAME: Phenotype-Aware Molecular Editing via Latent Diffusion

PhAME: 基于表型感知的潜在扩散分子编辑

Łukasz Janisiów, Sebastian Musiał, Bartosz Zieliński, Dawid Rymarczyk, Tomasz Danel

发表机构 * Faculty of Mathematics and Computer Science, Jagiellonian University(杰洛内夫斯基大学数学与计算机科学学院) Doctoral School of Exact and Natural Sciences, Jagiellonian University(杰洛内夫斯基大学精确与自然科学博士学院) Jagiellonian Center for Artificial Intelligence, Jagiellonian University(杰洛内夫斯基人工智能中心) Faculty of Chemistry, Jagiellonian University(杰洛内夫斯基大学化学系) Ardigen SA(Ardigen公司)

AI总结 提出PhAME框架,利用潜在扩散模型在预训练图VAE的潜在空间中进行分子编辑,通过组合无分类器引导机制同时优化表型条件和结构相似性,实现高化学有效性和新颖性的多目标分子优化。

详情
AI中文摘要

小分子药物发现需要同时优化候选分子的众多属性。这些属性可以通过分析高维生物特征(如细胞形态和转录组扰动)来研究,这些特征提供了对潜在生物机制的丰富视角。然而,现有的使用这些特征进行优化的生成方法未能满足两个关键要求:在保持与已知先导物结构接近的同时,提供朝向期望表型特征的精确引导。我们引入了PhAME(表型感知分子编辑),这是一种潜在扩散框架,通过将分子优化重新定义为预训练图基VAE潜在空间中的编辑来克服这一挑战。我们的核心贡献是一种具有两个独立尺度的组合无分类器引导方案,一个用于表型条件,另一个用于与种子结构的相似性,允许从业者控制这两个目标之间的权衡。在包括对接分数优化和多模态表型生成在内的多个基准测试中的实证评估表明,PhAME在保持高化学有效性和新颖性的同时实现了最先进的结果。

英文摘要

Small-molecule drug discovery requires simultaneous optimization of numerous properties of candidate molecules. These properties can be investigated through the analysis of high-dimensional biological signatures, such as cell morphology and transcriptomic perturbations, which provide a rich perspective on the underlying biological mechanisms. However, existing generative methods, which use those signatures for optimization, fail to meet two key requirements: providing precise guidance toward desired phenotypic signatures while maintaining structural proximity to a known hit. We introduce PhAME (Phenotype-Aware Molecular Editing), a latent diffusion framework that overcomes this challenge by recasting molecular optimization as editing in the latent space of a pretrained graph-based VAE. Our central contribution is a compositional classifier-free guidance scheme with two independent scales, one for the phenotype-conditioning and one for similarity to the seed structure, allowing practitioners to control the tradeoff between these two objectives. Empirical evaluations across diverse benchmarks, including docking score optimization and multimodal phenotypic generation, demonstrate that PhAME achieves state-of-the-art results while maintaining high chemical validity and novelty.

2605.28225 2026-05-28 cs.CL

Supervised Semantic Differential for Cross-Cultural Concept Analysis: A Case Study of Human Affect

监督语义差异法用于跨文化概念分析:以人类情感为例

Jan Sikora, Paweł Lenartowicz, Hubert Plisiecki

发表机构 * University of Warsaw(华沙大学) Society for Open Science(开放科学协会) Centre for Brain Research, Jagiellonian University(雅盖隆大学脑研究中心) IDEAS Research Institute(IDEAS研究院)

AI总结 本文提出跨语言监督语义差异法(SSD),通过对齐的多语言词嵌入比较语义维度,并以波兰语、英语和法语情感规范词汇为例,验证了情感维度的跨语言可恢复性及文化差异。

Comments 9 pages, 2 figures, excluding the appendices. Code to reproduce our results is available at https://github.com/przebor/Cross-Cultural-SSD

详情
AI中文摘要

跨文化比较心理意义需要超越词汇层面的翻译,并考察语义维度在不同语言中的组织方式。我们提出了监督语义差异法(SSD)的跨语言扩展,该方法在嵌入空间中估计监督语义梯度,并在对齐的多语言词嵌入之间进行比较。该方法通过置换检验和自助法区间检验梯度对齐性和差异,并通过围绕差异梯度的聚类解释残差差异。我们在波兰语、英语和法语情感规范词汇上展示了该方法,对效价、唤醒度和优势度(如可用)进行建模。情感维度在语言和模型设置中显著可恢复。跨语言比较显示出广泛的对齐性以及结构化的残差差异:效价似乎是共享的,而唤醒度和优势度产生了更多可解释的对比,涉及身体威胁、审美刺激、内部情感性、宏观权威和日常控制。几个聚类也反映了语料库特定的伪影,强调了谨慎解释的必要性。跨语言SSD提供了一个可解释的框架,用于测试语义对齐性、识别差异,并生成关于心理意义跨文化差异的假设。

英文摘要

Cross-cultural comparison of psychological meaning requires methods that go beyond word-level translation and examine how semantic dimensions are organized across languages. We introduce a cross-lingual extension of the Supervised Semantic Differential (SSD), which estimates supervised semantic gradients in embedding space and compares them across aligned multilingual word embeddings. The method tests gradient alignment and difference using permutation procedures and bootstrap intervals, and interprets residual differences through clustering around the difference gradient. We demonstrate the approach on Polish, English, and French affective norm lexicons, modeling Valence, Arousal, and Dominance where available. Affective dimensions were significantly recoverable across languages and model settings. Cross-lingual comparisons showed broad alignment together with structured residual differences: Valence appeared mostly shared, whereas Arousal and Dominance produced more interpretable contrasts involving bodily threat, aesthetic stimulation, internal emotionality, macro-level authority, and everyday control. Several clusters also reflected corpus-specific artifacts, underscoring the need for cautious interpretation. Cross-lingual SSD offers an explainable framework for testing semantic alignment, identifying divergence, and generating hypotheses about cross-cultural differences in psychological meaning.

2605.28224 2026-05-28 cs.AI

When Does Memory Help Multi-Trajectory Inference for Tool-Use LLM Agents?

何时记忆有助于工具使用LLM代理的多轨迹推理?

Xinzhe Li, Yaguang Tao

发表机构 * RMIT University(皇家墨尔本理工大学)

AI总结 本文提出一个统一框架,将记忆沿传输范围和内容抽象两个维度分解,在无验证器设置下评估四种记忆方法与三种推理策略在四个工具使用基准上的表现,发现推理策略是混淆变量,不同策略下相同记忆方法产生显著不同结果。

Comments More evaluation and analysis are on the way

详情
AI中文摘要

工具使用LLM代理的多轨迹推理——生成多个推理尝试并从中选择——受益于跨尝试的知识转移,以便后续尝试避免早期尝试的陷阱。现有的跨轨迹记忆方法(轨迹级反思、原子事实提取、原始观察注入)均在单个任务的单一推理策略下进行评估,使得报告的性能提升是否反映记忆抽象或推理方法的属性变得不明确。我们提出一个统一框架,将记忆沿两个维度分解——传输范围(扩展内 vs. 跨轨迹)和传输内容的抽象程度——并在四种工具使用基准(涵盖SQL、知识图谱和CLI环境)上,在匹配实际代理部署设置的无验证器设置下,评估四种方法在三种推理策略(best-of-N、束搜索、MCTS)下的表现。实验矩阵将推理方法识别为混淆变量:相同的记忆方法在相同示例的不同推理策略下产生统计上不同的结果。反思仅在MCTS下达到显著性(不在best-of-N下);扩展内注入(使每个候选条件依赖于先前兄弟候选的结果)仅有助于缺乏多样性的束搜索;而原子事实提取对准确性无影响,但在具有可重用环境结构的任务上使轨迹缩短19-26%。

英文摘要

Multi-trajectory inference for tool-use LLM agents - generating multiple reasoning attempts and selecting among them - benefits from transferring knowledge across attempts so that later ones avoid the pitfalls of earlier ones. Existing cross-trajectory memory methods (trajectory-level reflection, atomic fact extraction, raw observation injection) are each evaluated under a single inference strategy on a single task, making it unclear whether reported gains reflect properties of the memory abstraction or of the inference method. We propose a unified framework that decomposes memory along two axes -- the scope of transfer (within an expansion vs. across trajectories) and the abstraction of the transferred content -- and evaluate four methods under three inference strategies (best-of-N, beam search, MCTS) on four tool-use benchmarks spanning SQL, knowledge-graph, and CLI environments, in a verifier-free setting that matches the deployment regime of practical agents. The experiment matrix identifies the inference method as a confound: the same memory method produces statistically distinct results under different inference strategies on the same examples. Reflection reaches significance only under MCTS (not under best-of-N); within-expansion injection (conditioning each candidate on prior siblings' outcomes) helps only diversity-starved beam search; and atomic fact extraction is accuracy-neutral but shortens trajectories by 19-26% on tasks with reusable environmental structure.

2605.28222 2026-05-28 cs.CL cs.IR cs.LG

Analyzing Quality-Latency-Resource Trade-offs in a Technical Documentation RAG Assistant Using LoRA Adaptation

使用LoRA适配分析技术文档RAG助手中的质量-延迟-资源权衡

Evgenii Palnikov, Elizaveta Gavrilova

发表机构 * HSE University(俄罗斯高等经济大学)

AI总结 本研究通过LoRA适配器在RAG系统中分析质量、延迟和资源之间的权衡,发现仅对q和v注意力投影进行适配的配置在帕累托前沿占优。

Comments 13-page main body plus extended appendix; 6 figures; benchmark, LoRA adapters, and code at https://github.com/EugPal/rag-lora-tradeoffs

详情
AI中文摘要

我们研究了在基于文档的检索增强生成(RAG)系统中使用生成器的低秩适配(LoRA)时的质量-延迟-资源权衡。我们构建了一个包含5,144个问答对的手动验证基准测试,这些问答对基于官方Kubernetes文档,并将其与固定的混合检索流水线(BGE-M3密集、BGE-M3原生稀疏、互惠排名融合、交叉编码器重排序)结合。在此基准测试上,我们在Llama-3.2-3B-Instruct和Llama-3.1-8B-Instruct上对20种LoRA配置进行了消融实验,涉及秩和目标模块的选择,并评估了每个配置的token级F1、LLM判断的接地性和正确性(pass@4)、推理延迟、推理内存和训练成本,所有结果均附有bootstrap 95%置信区间。帕累托分析表明,仅作用于q和v注意力投影的LoRA适配器始终主导前沿,而3B/8B的选择主要定义了操作区间。参数匹配的控制比较进一步表明,q/v优势是结构性的而非纯粹参数性的。基准测试、选定的适配器和代码可在https://github.com/EugPal/rag-lora-tradeoffs获取。

英文摘要

We study quality-latency-resource trade-offs in a documentation-grounded retrieval-augmented generation (RAG) system that uses Low-Rank Adaptation (LoRA) of the generator. We build a manually verified benchmark of 5,144 question-answer pairs over the official Kubernetes documentation and combine it with a fixed hybrid-retrieval pipeline (BGE-M3 dense, BGE-M3 native sparse, Reciprocal Rank Fusion, cross-encoder reranking). Over this benchmark we ablate 20 LoRA configurations on Llama-3.2-3B-Instruct and Llama-3.1-8B-Instruct across rank and target-module choices, and evaluate each on token-level F1, LLM-judged groundedness and correctness (pass@4), inference latency, inference memory, and training cost, all reported with bootstrap 95% confidence intervals. Pareto analysis shows that LoRA adapters acting only on the q and v attention projections consistently dominate the front, while the 3B/8B choice mainly defines operating regime. A param-matched control comparison further indicates that the q/v advantage is structural rather than purely parametric. The benchmark, selected adapters, and code are available at https://github.com/EugPal/rag-lora-tradeoffs.

2605.28218 2026-05-28 cs.CL

IFMTBench: A Comprehensive Benchmark for Multilingual Translation Instruction Following

IFMTBench:多语言翻译指令遵循的综合基准

Mingrui Sun, Mao Zheng, Zheng Li, Mingyang Song

发表机构 * Large Language Model Department, Tencent(腾讯大语言模型部)

AI总结 提出IFMTBench基准,涵盖7种语言、4506个单约束和2838个多约束项,通过确定性检查器和基于LLM的评分器评估翻译指令遵循能力,揭示指令遵循随模型规模增长快于翻译质量,且术语表和结构化格式约束难度最高。

Comments 11 pages, 6 figures, conference

详情
AI中文摘要

现代翻译工作流程要求的不只是语义等价。用户通常要求模型保留JSON或HTML模式、遵循精心策划的术语表、利用提供的上下文消除歧义,并匹配指定的语域,往往同时满足多个条件。传统的BLEU和xCOMET等指标能捕捉语义保真度,但对约束遵循的指示甚少,而一般的指令遵循基准则忽略了翻译的跨语言性质。我们引入了\bench,一个涵盖七种语言的多语言翻译指令遵循基准,包含4506个单约束和2838个多约束项,跨越六个约束维度和五种组合模式,指令以所有七种语言发出。约束分为由确定性检查器验证的门控子集和由基于评分规则的LLM法官评分的连续子集,通过乘法规则组合以抵抗奖励黑客攻击。评估15个模型揭示了先前协议遗漏的系统性差距:指令遵循随模型规模增长比翻译质量更显著,术语表和结构化格式约束主导了难度梯度,而一般指令遵循排名与翻译行为的相关性很弱。我们的基准可在https://github.com/Tencent-Hunyuan/Hy-MT2/tree/main/IFMTBench获取。

英文摘要

Modern translation workflows demand more than semantic equivalence. Users routinely require models to preserve JSON or HTML schemas, honor curated glossaries, disambiguate with provided context, and match prescribed registers, often several at once. Conventional metrics such as BLEU and xCOMET capture semantic fidelity but provide little signal on constraint adherence, while general instruction following benchmarks ignore the cross-lingual nature of translation. We introduce \bench, a benchmark for multilingual translation instruction following covering seven languages, with 4,506 single-constraint and 2,838 multi-constraint items spanning six constraint dimensions and five compositional patterns with instructions issued in all seven languages. Constraints are split into a gating subset verified by deterministic checkers and a continuous subset scored by a rubric-based LLM judge, combined under a multiplicative rule that resists reward hacking. Evaluating 15 models reveals systematic gaps that prior protocols miss: Instruction following scales with size more sharply than translation quality, glossary and structured-format constraints dominate the difficulty gradient, and general instruction following rankings correlate only weakly with translation behavior. Our benchmark are available at https://github.com/Tencent-Hunyuan/Hy-MT2/tree/main/IFMTBench.

2605.28217 2026-05-28 cs.CV

A Patient-Specific Pulmonary Arterial Tree Digital Twin to Extract Pulmonary Embolism Biomarkers

患者特异性肺动脉树数字孪生以提取肺栓塞生物标志物

Morgane des Ligneris, Nathan Painchaud, Allan Serva, Laurent Bertoletti, Pierre Croisille, Carole Frindel, Odyssée Merveille

发表机构 * Univ Lyon, INSA‐Lyon, Université Claude Bernard Lyon 1, UJM-Saint Etienne, CNRS, Inserm, CREATIS UMR 5220, U1294(里昂大学、里昂国立应用科学学院、 Claude Bernard 里昂大学、 UJM-圣艾蒂安、 CNRS、 Inserm、 CREATIS UMR 5220、 U1294) Department of Radiology, CHU Saint-Etienne, UJM Saint-Etienne, Saint-Etienne, France(放射科、圣艾蒂安大学医院、 UJM-圣艾蒂安、圣艾蒂安、法国) IUF, Institut Universitaire de France, Paris(IUF、法国国家科学院、巴黎)

AI总结 提出一种自动化流程,通过构建肺动脉树的有向图表示并提取基于图像的生物标志物(包括局部动脉特征和全局严重程度评分),生成患者数字孪生,用于肺栓塞的风险评估。

Comments 11 pages + 2 pages of supplementary materials. Submitted to special issue of JBHI

详情
AI中文摘要

肺栓塞是由血凝块阻塞肺动脉引起的,是急性心血管综合征的主要原因之一。在临床实践中,通过计算机断层扫描肺血管造影诊断后的治疗决策依赖于风险分层,该分层将30天死亡风险分为三类。这种分层取决于右心室与左心室直径比以及两种心脏酶的血液水平。然而,在急诊情况下,血液生物标志物并不总是可用,而手动计算既定的严重程度评分(如Qanadli和Mastora评分)耗时且很少在临床常规中进行。本研究引入了一种自动化流程,该流程对肺动脉树的有向图表示进行建模,标记其层次结构并表征肺栓塞。该流程推导出基于图像的生物标志物,包括局部动脉级特征(形态学信息、层次位置、血凝块体积和由此产生的阻塞)以及全局患者级生物标志物,如自动计算的严重程度评分(Qanadli和Mastora评分)以及按肺叶和层次划分的总栓塞体积分布。利用人工智能生成的动脉、栓塞、肺和肺叶的二元掩码,它创建了动脉结构的患者数字孪生。通过与现有流程、解剖学期望和手动严重程度评分计算的比较验证,证明了该流程能够自动生成解剖学上准确的数字孪生和具有高度一致性的严重程度评分。这支持了这些基于图像的生物标志物自动提供关于血栓负荷和空间血凝块分布的快速、精确信息的潜力。

英文摘要

Pulmonary embolism, the obstruction of a pulmonary artery by a blood clot, is one of the leading causes of acute cardiovascular syndrome. In clinical practice, therapeutic decisions after diagnosis via computed tomography pulmonary angiography rely on risk stratification, which categorizes 30-day mortality risk into three categories. This stratification depends on the right-to-left ventricular diameter ratio and blood levels of two cardiac enzymes. However, blood biomarkers are not always available in emergency settings, and manual calculation of established severity scores - such as Qanadli and Mastora - is time-consuming and rarely performed in clinical routine practice. This study introduces an automated pipeline that models a directed graph representation of the pulmonary arterial tree, labeling its hierarchical structure and characterizing pulmonary embolism. The pipeline derives image-based biomarkers, including local artery-level features (morphological information, hierarchical position, clot volume, and resulting obstruction) and global patient-level biomarkers such as automatically calculated severity scores (Qanadli and Mastora) and the total embolic volume distribution by lobes and hierarchical levels. Using artificial-intelligence-generated binary masks of arteries, emboli, lungs, and lobes, it creates a patient digital twin of the arterial structure. Validation of the pipeline through comparison to an existing pipeline, anatomical expectations, and manual severity score calculations demonstrates the pipeline's ability to automatically generate anatomically accurate digital twins and severity scores with strong agreement. This supports the potential of these image-derived biomarkers to automatically provide rapid, precise information on thrombotic burden and spatial clot distribution.

2605.28213 2026-05-28 cs.AI

Learning When to Optimize: Verified Optimization Skills from Expert GPU-Kernel Lineages

学习何时优化:来自专家GPU内核谱系的验证优化技能

Shuoming Zhang, Qiuchu Yu, Yangyu Zhang, Ruiyuan Xu, Xiyu Shi, Guangli Li, Xiaobing Feng, Huimin Cui, Jiacheng Zhao

发表机构 * SKLP, Institute of Computing Technology, Chinese Academy of Sciences(SKLP,计算技术研究所,中国科学院) University of Chinese Academy of Sciences(中国科学院大学) University of New South Wales(新南威尔士大学)

AI总结 提出KLineage方法,通过反向遍历专家GPU内核实现并提取可重用的优化技能,学习优化的适用条件,从而提升LLM代理生成内核的优化质量与效率。

Comments Preprint, Under Review

详情
AI中文摘要

基于LLM的代理越来越多地被用于生成GPU内核,但它们通常知道尝试哪些优化,却不知道这些优化何时是合理的。我们引入了KLineage,它从专家内核中学习这种缺失的“何时”知识:KLineage不是依赖前向展开,而是通过验证门控简化反向遍历专家实现,并将每个接受的步骤逆转为可重用的优化技能。每个技能不仅记录了优化意图,还记录了它在代码中的适用位置、使其有效的条件、产生的效果以及其假设避免了哪些失败。下游LLM在相同的编译/正确性/性能分析门控下,将这些技能应用到新的代码表面上。在两个NVIDIA架构上的五个专家工作负载中,这些谱系衍生的技能作为有效的优化课程,在相同的固定预算下,在最终内核质量和优化效率方面均超过了近期基于内存的LLM内核基线。此外,我们使用一个单独的22实例保留检查作为对源案例记忆的合理性测试。

英文摘要

LLM-based agents are increasingly used to generate GPU kernels, but they often know what optimizations to try without knowing when those optimizations are sound. We introduce KLineage, which learns this missing "when" knowledge from expert kernels: instead of relying on forward rollouts, KLineage walks expert implementations backward through validation-gated simplifications and reverses each accepted step into a reusable optimization skill. Each skill records not only the optimization intent, but also where it applies in code, what conditions made it valid, what effect it had, and what failures its assumptions avoid. A downstream LLM materializes these skills on new code surfaces under the same compile/correctness/profile gate. On five expert workloads across two NVIDIA architectures, these lineage-derived skills serve as an effective optimization curriculum, exceeding recent memory-based LLM-kernel baselines in both final kernel quality and optimization efficiency under the same fixed budget. We additionally use a separate 22-instance held-out check as a sanity test against source-case memorization.

2605.28211 2026-05-28 cs.CL

When Helpful Context Leaks: Privacy Risks in Domain-Adapted ASR

当有用上下文泄露:领域自适应ASR中的隐私风险

Maike Züfle, Jan Niehues

发表机构 * Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院)

AI总结 本文识别并系统研究了领域自适应ASR中因上下文提示或微调导致模型泄露隐私的风险,通过构建控制数据集测量泄露率,并评估了提示级缓解策略及精度-泄露权衡。

详情
AI中文摘要

语音大语言模型越来越多地部署在专业环境中,领域定制是标准做法:用户在提示中提供包含敏感信息的上下文,在专有录音上进行微调,或两者兼有。我们识别并系统研究了这种定制的一个被忽视的隐私风险:适应于识别领域特定术语的模型可以被诱导转录其上下文或训练数据中一个语音相似的词,即使说的是不同的词,从而泄露私人信息。为了评估这一风险,我们构建了一个控制数据集,并测量了两种定制机制(提示和微调)下的泄露率。两种机制都会导致可测量的泄露,且组合时加剧。我们评估了一种提示级缓解策略,并分析了不同定制方法下的精度-泄露权衡,发现无上下文提示的微调提供了最佳平衡。我们公开了代码和数据集。

英文摘要

SpeechLLMs are increasingly deployed in professional settings where domain customisation is standard practice: users supply context in prompts with sensitive information, fine-tune on proprietary recordings, or both. We identify and systematically investigate an overlooked privacy risk of such customisation: a model adapted to recognise domain-specific terminology can be nudged into transcribing a phonetically similar word from its context or training data, even when a different word is spoken, thereby leaking private information. To evaluate this risk, we construct a controlled dataset and measure leakage rates across two customisation mechanisms, prompting and fine-tuning. Both mechanisms cause measurable leakage, compounding when combined. We evaluate a prompt-level mitigation strategy and analyse the accuracy-leakage trade-off across customisation approaches, finding that fine-tuning without context prompts offers the best balance. We release our code and dataset publicly.

2605.28203 2026-05-28 cs.LG

Refining Multidimensional Video Reward Models via Disentangled Influence Functions

通过解耦影响函数优化多维视频奖励模型

Muyao Wang, Zeke Xie, Hideki Nakayama

发表机构 * The University of Tokyo(东京大学) HKUST (Guangzhou)(香港科技大学(广州))

AI总结 针对文本到视频生成任务中训练样本在不同评估维度上可靠性不一致的问题,提出解耦影响框架以估计维度特定监督风险,并设计维度解耦剪枝与重加权策略,显著提升多维视频奖励模型与真实标注的对齐效果。

详情
AI中文摘要

随着文本到视频(T2V)生成模型的不断发展,视频评估的复杂性要求跨多个轴进行细粒度评估。为此,近期工作致力于开发多维视频奖励模型(MVRMs),将评估过程分解以更好地适应人类视觉感知的多面性。然而,训练有效的MVRMs从根本上受到视频数据复杂性的挑战。在本工作中,我们识别出一个关键现象,称为维度异质性:训练样本的可靠性在不同评估维度上可能显著不同,这意味着一个样本可能为一个目标提供可靠的监督,同时为另一个目标引入高监督风险。因此,基于全局标量指标进行过滤的流行数据驱动方法对于T2V任务是不适定的。为解决此问题,我们提出一个解耦影响框架,能够高效估计维度特定的监督风险。利用该框架,我们引入两种维度解耦优化策略:维度解耦剪枝(移除极端高风险样本)和维度解耦重加权(对高风险监督进行软降权)。大量实验表明,我们的解耦策略显著优于全局过滤基线,得到的奖励模型与真实标注的对齐效果更优。

英文摘要

As Text-to-Video (T2V) generation models continue to evolve, the complexity of video evaluation necessitates a fine-grained assessment across various axes. To address this, recent works have focused on developing Multidimensional Video Reward Models (MVRMs), which decompose the evaluation process to better align with the multifaceted nature of human visual perception. However, training effective MVRMs is fundamentally challenged by the complex nature of video data. In this work, we identify a critical phenomenon termed Dimensional Heterogeneity: the reliability of a training sample can vary substantially across evaluation dimensions, meaning that a sample may provide reliable supervision for one objective while inducing high supervision risk for another. Consequently, prevailing data-centric methods that filter based on global scalar metrics are ill-posed for T2V tasks. To address this, we propose a disentangled influence framework that that efficiently estimates dimension-specific supervision risk. Leveraging this framework, we introduce two dimension-disentangled refinement strategies: Dimension-Disentangled Pruning, which removes extreme high-risk samples, and Dimension-Disentangled Reweighting, which softly down-weights high-risk supervision. Extensive experiments demonstrate that our disentangled strategies significantly outperform global filtering baselines, yielding reward models with superior alignment to ground truth.

2605.28202 2026-05-28 cs.RO

Natural Functional Gradients for Smooth Trajectory Optimization

平滑轨迹优化的自然函数梯度

Kisang Park, Chanwoo Kim, Kyungjae Lee, Sungjoon Choi

发表机构 * Department of Artificial Intelligence, Korea University, Seoul, Republic of Korea(韩国大学人工智能系,首尔,大韩民国) Department of Statistics, Korea University, Seoul, Republic of Korea(韩国大学统计系,首尔,大韩民国)

AI总结 提出一种基于自然函数梯度的轨迹优化框架,通过函数空间中的几何感知更新和蒙特卡洛估计,在无解析梯度时生成更平滑、更可行的运动轨迹。

详情
AI中文摘要

生成无碰撞且平滑的运动仍然是机器人操作中的一个核心挑战,尤其是在杂乱环境和狭窄通道中,可行区域高度受限且碎片化。我们提出了一种轨迹优化框架,该框架使用自然函数梯度直接在函数空间中进行几何感知更新。该方法优化了一个高斯平滑的替代目标,通过平滑轨迹扰动正则化优化景观,同时保留轨迹级结构。由于更新在函数空间内固有定义,轨迹规则性可以独立于特定时间离散化进行控制。我们推导了自然函数梯度的实用蒙特卡洛估计器,仅需黑盒轨迹评估,使得该方法在由于碰撞检测和接触丰富的仿真导致解析梯度不可用或不可靠时适用。在受限机器人操作任务上的实验表明,与代表性的规划和轨迹优化基线相比,所提出的方法在几何间隙狭窄的环境中提高了轨迹可行性并生成了更平滑的运动。更多结果、视频和实现细节可在项目页面获取:https://kisangpark.github.io/natural-functional-gradient/

英文摘要

Generating collision-free and smooth motions remains a central challenge in robotic manipulation, particularly in cluttered environments and narrow passages where feasible regions are highly constrained and fragmented. We propose a trajectory optimization framework that performs geometry-aware updates directly in function space using natural functional gradients. The method optimizes a Gaussian-smoothed surrogate objective that regularizes the optimization landscape through smooth trajectory perturbations while preserving trajectory-level structure. Because the updates are defined intrinsically in function space, trajectory regularity can be controlled independently of a particular time discretization. We derive a practical Monte-Carlo estimator of the natural functional gradient that requires only black-box trajectory evaluations, making the method applicable when analytic gradients are unavailable or unreliable due to collision checking and contact-rich simulation. Experiments on constrained robotic manipulation tasks demonstrate that the proposed method improves trajectory feasibility and produces smoother motions than representative planning and trajectory optimization baselines in environments with narrow geometric clearances. Additional results, videos, and implementation details are available at the project page: https://kisangpark.github.io/natural-functional-gradient/

2605.28201 2026-05-28 cs.AI

Plant, Persist, Trigger: Sleeper Attack on Large Language Model Agents

种植、持久化、触发:针对大语言模型智能体的潜伏攻击

Yongxiang Li, Moxin Li, Zhixin Ma, Fengbin Zhu, Dongrui Liu, Wenjie Wang, Fuli Feng

发表机构 * University of Science and Technology of China(中国科学技术大学) National University of Singapore(新加坡国立大学) Singapore Management University(新加坡管理学院) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 提出潜伏攻击(Sleeper Attack),即攻击者将对抗性内容注入智能体状态并持久化,在后续交互中被良性用户查询触发,导致有害行为;构建包含1896个实例的基准测试,实验表明当前最强LLM智能体仍易受此类攻击。

详情
AI中文摘要

大语言模型(LLM)智能体仍然容易受到来自外部环境的安全威胁,攻击者将对抗性内容注入外部观察(如工具返回的数据、网页或MCP上下文),导致有害的智能体行为,例如不安全的操作或错误的输出。现有研究通常关注单次交互攻击,即智能体观察到对抗性内容后立即在单次用户请求中表现出有害行为。然而,我们表明对抗性内容也可以在同一智能体服务的多次交互中持久化,使得此类威胁更难检测和缓解。具体来说,对抗性内容可能持久化在智能体状态中,在多次交互中保持休眠,随后被良性用户查询激活。我们将此类安全威胁形式化为潜伏攻击(Sleeper Attack)。为了评估它,我们构建了一个包含1896个实例的基准测试,涵盖六种真实世界的有害结果、三种攻击策略和三种智能体状态目标:会话上下文、记忆和可复用技能。在七个强大的开源和闭源LLM上的实验表明,最先进的LLM智能体仍然容易受到潜伏攻击,即使在单次交互基线中它们实现了较低的攻击成功率。我们的代码和数据可在https://anonymous.4open.science/r/skdvnfu23ihr9wdscnksf1asdffsaef获取。

英文摘要

Large Language Model (LLM) agents remain vulnerable to safety threats from the external environment, where attackers inject adversarial content into external observations such as tool-returned data, webpages, or MCP context, causing harmful agentic behaviors such as unsafe actions or incorrect outputs. Existing studies typically focus on single-interaction attacks, where the agent observes adversarial content and immediately exhibits harmful behavior within one user request. However, we show that adversarial content can also persist across interactions served by the same agent, making such threats harder to detect and mitigate. Specifically, adversarial content may persist in the agent state, remain dormant across interactions, and later be activated by a benign user query. We formalize this type of safety threat as Sleeper Attack. To evaluate it, we construct a benchmark with 1,896 instances covering six real-world harmful outcomes, three attack strategies, and three agent state targets: session context, memory, and reusable skills. Experiments on seven strong open-source and closed-source LLMs show that state-of-the-art LLM agents remain vulnerable to Sleeper Attack, even when they achieve low attack success rates under a single-interaction baseline. Our code and data are available at https://anonymous.4open.science/r/skdvnfu23ihr9wdscnksf1asdffsaef.

2605.28200 2026-05-28 cs.LG q-bio.GN

Geometry-First Generative Spatial Single-Cell Reconstruction

几何优先的生成式空间单细胞重建

Ehtesamul Azim, Muhtasim Noor Alif, Tae Hyun Hwang, Yanjie Fu, Wei Zhang

发表机构 * University of Central Florida(佛罗里达大学) Vanderbilt University Medical Center(范德比尔特大学医学中心) Arizona State University(亚利桑那州立大学)

AI总结 提出GEARS框架,通过几何优先方法结合扩散模型和置换等变生成器,从单细胞RNA测序数据重建空间几何,无需细胞类型标签或组织学图像。

Comments 32nd SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026)

详情
AI中文摘要

单细胞RNA测序(scRNA-seq)可分析大量细胞但丢失空间背景,而空间转录组学(ST)以较低分辨率保留部分空间结构。现有大多数整合方法要么解卷积斑点混合物,要么将细胞映射到测量的斑点网格上,这使重建受限于固定网格和切片特定坐标系,在非配对设置中尤其成问题。我们提出GEARS,一种几何优先框架,在ST引导下重建内在的单细胞空间几何,无需依赖细胞类型标签、组织学图像或细胞-斑点分配。GEARS首先学习一个域不变的表达编码器,对齐ST斑点和解离细胞,然后训练一个置换等变生成器,配合基于扩散的细化器(采用EDM风格预处理),在来自ST坐标的姿态不变监督下生成局部空间几何。在推理时,GEARS在scRNA-seq细胞的多个重叠子集上重建几何,聚合跨子集的预测成对距离,并解决全局距离几何问题以获得规范二维坐标和密集距离矩阵。大量定量和定性实验(包括横截面泛化)表明,与强空间映射和解卷积基线相比,GEARS在全局距离保持、局部邻域保真度和空间分布对齐方面持续改进。

英文摘要

Single-cell RNA sequencing (scRNA-seq) profiles large numbers of cells but loses spatial context, whereas spatial transcriptomics (ST) preserves partial spatial structure at lower resolution. Most existing integration methods either deconvolve spot mixtures or map cells onto a measured spot lattice, which ties reconstructions to a fixed grid and slide-specific coordinate systems, a limitation that is especially problematic in unpaired settings. We propose GEARS, a geometry-first framework that reconstructs an intrinsic single-cell spatial geometry guided by ST, without relying on cell-type labels, histological images, or cell-to-spot assignment. GEARS first learns a domain-invariant expression encoder that aligns ST spots and dissociated cells, and then trains a permutation-equivariant generator with a diffusion-based refiner with EDM-style preconditioning to generate local spatial geometries under pose-invariant supervision derived from ST coordinates. At inference, GEARS reconstructs geometry on many overlapping subsets of scRNA-seq cells, aggregates predicted pairwise distances across subsets, and solves a global distance-geometry problem to obtain canonical two-dimensional coordinates and a dense distance matrix. Extensive quantitative and qualitative experiments, including cross-section generalization, show that GEARS consistently improves global distance preservation, local neighborhood fidelity, and spatial distribution alignment compared to strong spatial mapping and deconvolution baselines.

2605.28198 2026-05-28 cs.LG

Hierarchical Synthetic Tabular Data Generation: A Hybrid Top-Down and Bottom-Up Framework

层次化合成表格数据生成:一种自上而下与自下而上混合框架

Junfeng Nie, Alvin Jin, Xiaohui Chen

发表机构 * University of Southern California(南加州大学)

AI总结 提出一种层次化混合自上而下和自下而上(H-TDBU)框架,通过解耦语义结构与随机纹理,结合结构驱动的逻辑约束和轻量级表格生成器,在弱多模态金融基准上提升合成数据的语义一致性和统计保真度。

Comments Accepted as a poster at FMSD @ ICML 2026. 9 pages, 6 figures

详情
AI中文摘要

现有的合成表格数据生成方法要么基于纯生成模型,要么基于大语言模型,两者在处理数据异质性、逻辑一致性、罕见事件覆盖以及低数据场景下的鲁棒性方面都存在困难。在本文中,我们提出了一种层次化混合自上而下和自下而上(H-TDBU)框架,该框架将语义结构与随机纹理解耦。在自上而下的路径中,构建了结构驱动的逻辑约束和跨模态对齐规则;而在自下而上的路径中,使用轻量级表格生成器从真实数据中学习局部统计模式。两条路径通过带有迭代反馈循环的统一合成引擎进行整合。我们在结合表格数据和情感文本数据的弱多模态金融基准上评估了该框架。实验结果表明,我们的H-TDBU方法在保持语义一致性的同时,相比神经基线方法提升了训练-合成-测试-真实性能。我们的结果表明,层次化规则引导的合成为在合成数据生成中结合可控性、语义连贯性和统计保真度提供了一种有效机制。

英文摘要

Existing approaches for synthetic tabular data generation are based on either purely generative models or LLMs, both of which struggle with data heterogeneity, logical consistency, rare-event coverage, and robustness in low-data regimes. In this paper, we propose a hierarchical hybrid top-down and bottom-up (H-TDBU) framework that decouples semantic structures from stochastic texture. In the top-down path, structure-driven logical constraints and cross-modal alignment rules are constructed, while in the bottom-up path, lightweight tabular generators are used to learn local statistical patterns from real data. The two paths are consolidated in a unified synthesis engine with an iterative feedback loop. We evaluate the framework on weak multimodal financial benchmarks combining tabular and sentiment-text data. Experimental results show that our H-TDBU approach improves train-synthetic-test-real performance over neural baseline methods while preserving semantic consistency. Our results suggest that hierarchical rule-guided synthesis provides an effective mechanism for combining controllability, semantic coherence, and statistical fidelity in synthetic data generation.

2605.28192 2026-05-28 cs.AI

Agentic Active Omni-Modal Perception for Multi-Hop Audio-Visual Reasoning

面向多跳音视频推理的主动全模态感知代理

Ke Xu, Yuhao Wang, Ziyang Cheng, Hongcheng Liu, Yanfeng Wang, Yu Wang

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 针对多跳音视频推理中证据稀疏且跨模态分布的问题,提出MOV-Bench基准和AOP-Agent代理框架,通过分层全模态记忆与观察-反思-重规划循环实现主动感知,显著提升开源全模态大模型在长视频和推理密集型问题上的性能。

详情
AI中文摘要

多跳音视频推理对全模态大语言模型(Omni-LLMs)仍然具有挑战性,因为相关证据通常稀疏、时间上分散,并且分布在音频和视频流中。现有基准对此设置的研究有限,通常仅涉及有限数量的模态、相关时间片段或推理步骤。在这项工作中,我们引入了MOV-Bench,一个包含519个精心设计问题的基准,这些问题需要对时间上分散的音视频证据进行多跳推理。在MOV-Bench上的评估表明,当前的全模态大语言模型在多跳跨模态推理方面仍然存在困难。为了解决这一挑战,我们进一步提出了AOP-Agent,一个基于开源全模态大语言模型的高效代理框架,用于主动全模态感知。通过将分层全模态记忆与协作的观察-反思-重规划循环相结合,AOP-Agent使开源全模态大语言模型能够进行主动感知,而无需额外训练或专有模型。在MOV-Bench和OmniVideoBench上的实验表明,AOP-Agent持续提升了推理性能,在长视频和推理密集型问题上尤其显著。

英文摘要

Multi-hop audio-visual reasoning remains challenging for Omni-LLMs, as relevant evidence is often sparse, temporally dispersed, and distributed across both audio and visual streams. Existing benchmarks provide limited investigation of this setting, typically involving only a limited number of modalities, relevant temporal segments, or reasoning steps. In this work, we introduce MOV-Bench, a benchmark containing 519 carefully curated questions that require multi-hop reasoning over temporally dispersed audio-visual evidence. Evaluations on MOV-Bench reveal that current Omni-LLMs still struggle with multi-hop cross-modal reasoning. To address this challenge, we further propose AOP-Agent, an efficient agentic framework built on open-source Omni-LLMs for active omni-modal perception. By combining a hierarchical omni-modal memory with a collaborative observe-reflect-replan loop, AOP-Agent enables open-source Omni-LLMs to perform active perception without additional training or proprietary models. Experiments on MOV-Bench and OmniVideoBench demonstrate that AOP-Agent consistently improves reasoning performance, with particularly notable gains on long videos and reasoning-intensive questions.