arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.03199 2026-06-03 cs.LG physics.chem-ph

Fast Organic Crystal Structure Prediction with Unit Cell Flow Matching

基于晶胞流匹配的快速有机晶体结构预测

Alston Lo, Luka Mucko, Austin H. Cheng, Andy Cai, Alastair J. A. Price, Wojciech Matusik, Alán Aspuru-Guzik

发表机构 * MIT CSAIL（麻省理工学院计算机科学与人工智能实验室）； University of Zagreb（Zagreb大学）； University of Toronto（多伦多大学）； Vector Institute for Artificial Intelligence（人工智能矢量研究所）； Acceleration Consortium（加速联盟）； Canadian Institute for Advanced Research（加拿大高级研究研究院）； NVIDIA（NVIDIA公司）

AI总结提出Clari模型，利用流匹配生成无冗余晶胞，以秒级速度实现有机晶体结构预测，速度提升15-30倍。

详情

AI中文摘要

有机晶体结构预测（CSP）是有机固体计算建模的必要条件，但传统上每个分子需要耗费数CPU年。诸如OXtal之类的生成模型通过直接采样稳定的有机晶体结构，大幅降低了这一成本。然而，OXtal放弃了显式晶格参数化，转而使用昂贵的三角形层对块体材料的大块区域进行建模，这可能导致每个分子花费数分钟的计算成本。在本文中，我们通过Clari将其降低到秒级，Clari是一个大规模流匹配模型，生成无冗余晶胞，并用纯对偏注意力取代三角形层。Clari仅需原子类型和键作为输入，无需RDKit可处理的输入分子，从而扩展了其适用于富勒烯、金属配合物和原子团簇等具有挑战性的化学体系。我们进一步消融了关键设计选择，如辅助损失、时间步分布、噪声先验和自条件化。在OXtal的测试集上，我们超越了OXtal的求解率，同时获得了15-30倍的加速。由于Clari还模拟了显式氢原子，它通过直接能量排序支持推理时扩展，无需任何修饰或弛豫步骤。当生成150个晶体并选择能量前30的晶体时，我们进一步提高了求解率，同时保持了5-8倍的加速。我们还引入了CSD教学子集，作为未来基准测试中多样化和复杂分子的新测试分割。我们的贡献使得在几秒内实现CSP成为可能，使有机固体的大规模虚拟筛选变得实用。代码可从此https URL获取。

英文摘要

Organic crystal structure prediction (CSP) is a requirement for computational modelling of organic solids, but traditionally costs several CPU-years per molecule. Generative models such as OXtal dramatically reduce this cost by sampling stable organic crystal structures directly. However, OXtal forgoes explicit lattice parametrization in favour of modelling large crops of the bulk material with expensive triangle layers, which can incur a computational cost of minutes per molecule. In this paper, we reduce this to seconds with Clari, a large-scale flow matching model that generates redundancy-free unit cells and replaces triangle layers with pure pair-bias attention. Clari requires only atom types and bonds as input and does not need an RDKit-sanitizable input molecule, which expands its applicability to challenging chemistries such as fullerenes, metal complexes, and atom clusters. We further ablate key design choices such as auxiliary losses, timestep distributions, noise priors, and self-conditioning. On OXtal's test sets, we surpass OXtal's solve rate while obtaining a speedup of $15$-$30\times$. Because Clari also models explicit hydrogens, it supports inference-time scaling via direct energy ranking, without any decoration or relaxation step. When generating 150 crystals and selecting the top-30 by energy, we further improve solve rate while maintaining a speedup of $5$-$8\times$. We also introduce the CSD Teaching Subset as a new test split of diverse and complex molecules for future benchmarking. Our contributions enable CSP within seconds, making large-scale virtual screening of organic solids practical. Code is available at https://github.com/aspuru-guzik-group/clari.

URL PDF HTML ☆

赞 0 踩 0

2606.03198 2026-06-03 cs.CL cs.AI

AI Rater Discrimination Depends on Scoring Protocol in Complex Clinical Decision-Making

AI评分歧视取决于复杂临床决策中的评分协议

Sangwon Baek, Kyu Yeon Hur, Kyunga Kim

发表机构 * Asclep Korea Inc.（Asclep韩国公司）； Center for Data Science, New York University（纽约大学数据科学中心）； Division of Endocrinology and Metabolism, Department of Medicine, Samsung Medical Center, Sungkyunkwan University School of Medicine（成均馆大学医学院内分泌与代谢科，三星医疗中心）； Biomedical Statistics Center, Samsung Medical Center（三星医疗中心生物医学统计中心）； Department of Digital Health, SAIHST（SAIHST数字健康科）； Department of Data Convergence & Future Medicine, Sungkyunkwan University（成均馆大学数据融合与未来医学科）

AI总结通过因子研究，发现基于评分标准的协议能放大AI评分者区分能力，而无评分标准协议则抑制这种区分，支持在临床AI评估中使用评分标准锚定。

Comments 11 pages, 4 main figures, 8 supplementary figures, 9 supplementary tables

详情

AI中文摘要

临床AI评估越来越多地委托给大型语言模型（LLMs）作为AI评分者进行评分，但其在不同评估条件下的评分行为尚未被定量表征。我们通过一项因子研究填补了这一空白，该研究关注成人2型糖尿病（T2D）药物治疗在12个月门诊随访中的AI评分者行为，这是一项涉及复杂决策的临床任务，通过七个评估问题操作化。四个开源LLMs同时作为临床决策支持系统（CDSS）模型和AI评分者。每个CDSS输出在两种评分协议下评分：基于评分标准的Gold Rubric（GR）协议（包含患者特定评分标准）和无评分标准的Non Gold Rubric（Non-GR）协议。线性混合效应模型将评分协议因子与五个设计因子（CDSS模型、CDSS提示配置（文档参考生成[DRG] vs. 基线）、评分者模型、提示字符和提示类型）交叉，并估计主效应及其协议交互。在所有问题中，AI评分者在Non-GR下始终给出非常窄范围内的更高分数（平均74-78分），而GR下的平均分数低7.69至49.64分，四分位距宽1.68至3.67倍。在每个问题内，GR将AI评分者对DRG和基线CDSS输出的区分能力放大了1.76至5.10倍，同时揭示了Non-GR抑制的评分者模型间的显著行为变异。这些发现支持评分标准锚定作为保留临床AI评估区分能力的评分协议；当问题需要患者特定或司法管辖区特定标准，而评分者模型无法仅从参数知识推断时，无评分标准评分无法替代。

英文摘要

Clinical AI evaluation increasingly delegates scoring to large language models (LLMs) acting as AI raters, yet their scoring behavior across evaluation conditions has not been quantitatively characterized. We address this gap through a factorial study of AI rater behavior in adult type 2 diabetes (T2D) pharmacotherapy at 12-month outpatient follow-up, a clinical task involving complex decision-making operationalized across seven evaluation questions. Four open-source LLMs served simultaneously as clinical decision support system (CDSS) models and AI raters. Each CDSS output was scored under two scoring protocols: a rubric-anchored Gold Rubric (GR) protocol incorporating a patient-specific rubric, and a rubric-free Non Gold Rubric (Non-GR) protocol. Linear mixed effects models crossed the scoring protocol factor with five design factors -- CDSS model, CDSS prompt configuration (document-referenced generation [DRG] vs.\ Baseline), rater model, prompt character, and prompt type -- and estimated main effects together with their protocol interactions. Across all questions, AI raters yielded consistently higher scores within a very narrow range (74--78 points on average) under Non-GR compared to those under GR (7.69 to 49.64 points lower mean scores; 1.68 to 3.67 times wider interquartile ranges). Within each question, GR amplified the AI rater's discrimination between DRG and Baseline CDSS outputs by factors of 1.76 to 5.10, while also revealing substantial behavioral variation across rater models that Non-GR suppressed. These findings support rubric anchoring as the scoring protocol that preserves discriminative power in clinical AI evaluation; rubric-free scoring cannot substitute when questions require patient-specific or jurisdiction-specific criteria that rater models cannot infer from parametric knowledge alone.

URL PDF HTML ☆

赞 0 踩 0

2606.03197 2026-06-03 cs.CL

MemTrain: Self-Supervised Context Memory Training

MemTrain：自监督上下文记忆训练

Ziheng Li, Xingrun Xing, Haoqing Wang, Zhi-Hong Deng, Yehui Tang

发表机构 * State Key Laboratory of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University（通用人工智能国家重点实验室，智能科学与技术学院，北京大学）； Samsung Research, Beijing, China（三星研究院，北京，中国）

AI总结提出自监督框架MemTrain，通过两个耦合代理任务（端到端掩码重建和中间记忆召回）利用无标签维基百科语料增强LLM代理的上下文记忆能力，显著提升下游长文本推理性能。

详情

AI中文摘要

记忆是长时程LLM代理不可或缺的能力，使其能够保留和利用跨扩展交互积累的信息。现有的记忆代理方法通常在下游任务上通过强化学习进行端到端训练。然而，为记忆密集型场景收集高质量标注问题成本高昂，且所得训练数据往往缺乏足够的多样性以覆盖通用记忆行为。在这项工作中，我们提出MemTrain，一种自监督训练框架，用于普遍增强LLM代理的上下文记忆能力，以实现更有效的下游后训练。MemTrain在无标签维基百科语料上引入两个耦合代理任务：（1）端到端掩码重建目标，要求模型在多轮记忆更新后恢复掩码实体，从而从最终结果角度鼓励记忆维持；（2）中间记忆召回目标，要求模型利用中间记忆状态重建掩码历史信息，鼓励整个交互过程中的忠实压缩和记忆完整性。两个目标通过GRPO联合优化。在长文本QA和基于搜索的QA基准上的大量实验表明，MemTrain在不同模型上持续改善下游记忆密集型推理性能，相比直接的任务特定后训练，提升高达17.67个点。

英文摘要

Memory is an indispensable capability for long-horizon LLM agents, enabling them to preserve and utilize information accumulated across extended interactions. Existing memory-agent approaches are typically trained end-to-end with reinforcement learning on downstream tasks. However, collecting high-quality annotated problems for memory-intensive scenarios is costly, and the resulting training data often lack sufficient diversity to cover general memory behaviors. In this work, we propose MemTrain, a self-supervised training framework for generally enhancing the context-memory capability of LLM agents for more effective downstream post-training. MemTrain introduces two coupled proxy tasks over unlabeled Wikipedia corpora: (1) an end-to-end masked reconstruction objective, which requires the model to recover masked entities after multiple rounds of memory updates, thereby encouraging memory maintenance from the final outcome perspective; and (2) an intermediate memory recall objective, which requires the model to reconstruct masked historical information using intermediate memory states, encouraging faithful compression and memory completeness throughout the interaction process. The two objectives are jointly optimized using GRPO. Extensive experiments on long-text QA and search-based QA benchmarks demonstrate that MemTrain consistently improves downstream memory-intensive reasoning performance across different models, achieving gains of up to 17.67 points over direct task-specific post-training.

URL PDF HTML ☆

赞 0 踩 0

2606.03188 2026-06-03 cs.RO

GeoSem-WAM: Geometry- and Semantic-Aware World Action Models

GeoSem-WAM：几何与语义感知的世界动作模型

Fulong Ma, Daojie Peng, Wenjun Yue, Jiahang Cao, Bintao Wang, Qiang Zhang, Jun Ma

发表机构 * HKUST(GZ)（香港科技大学（广州））； HKU（香港大学）； USTC（中国科学技术大学）； SDU（山东大学）； X-Humaniod

AI总结提出GeoSem-WAM框架，通过几何和语义监督增强潜在表示，在统一潜在空间中联合捕捉场景动态、空间几何和语义上下文，避免测试时显式未来展开或视频生成，提升动作预测准确性和鲁棒性。

详情

AI中文摘要

最近的世界动作模型（WAM）在具身决策中展示了令人印象深刻的能力。然而，它们的有效性是源于推理过程中的显式未来想象，还是由预测训练引起的表示学习，仍是一个未解之谜。新兴证据表明，主要优势在于学习鲁棒的潜在表示，而非在测试时生成未来观测。尽管如此，现有的WAM主要依赖于基于RGB的未来预测，这提供了对复杂环境有限的结构和空间理解。为了解决这个问题，我们提出了一个结构化世界建模框架，通过几何和语义监督增强潜在表示。除了未来的RGB预测，我们的模型引入了两个辅助预测分支，用于未来的几何和语义表示，使其能够在统一的潜在空间中联合捕捉场景动态、空间几何和语义上下文。关键在于，我们的方法通过避免测试时的显式未来展开或视频生成，保持了高效的推理。大量实验表明，纳入结构化世界监督一致地提高了动作预测准确性、场景理解以及在具有挑战性的具身场景下的鲁棒性，突显了其推进可扩展和高效WAM的潜力。

英文摘要

Recent World Action Models (WAMs) have demonstrated impressive capabilities in embodied decision-making. However, whether their effectiveness stems from explicit future imagination during inference or representation learning induced by predictive training remains an open question. Emerging evidence suggests the primary advantage lies in learning robust latent representations rather than generating future observations at test time. Nevertheless, existing WAMs mainly rely on RGB-based future prediction, which provides limited structural and spatial understanding of complex environments. To address this, we propose a structured world modeling framework that enhances latent representations through geometric and semantic supervision. Alongside future RGB prediction, our model introduces two auxiliary prediction branches for future geometry and semantic representations, enabling it to jointly capture scene dynamics, spatial geometry, and semantic context within a unified latent space. Crucially, our approach preserves efficient inference by avoiding explicit future rollout or video generation at test time. Extensive experiments show that incorporating structured world supervision consistently improves action prediction accuracy, scene understanding, and robustness under challenging embodied scenarios, highlighting its potential for advancing scalable and efficient WAMs.

URL PDF HTML ☆

赞 0 踩 0

2606.03179 2026-06-03 cs.CL

HyperPatch: Sequential Knowledge Editing Under n-ary Structural Drift

HyperPatch: n元结构漂移下的顺序知识编辑

Yu-Kai Chan, Wen-Sheng Lien, Dong-Ting Yao, Bo-Kai Ruan, Kwan-Yeung Lin, Hong-Han Shuai, Meng-Fen Chiang

发表机构 * National Yang Ming Chiao Tung University

AI总结针对非平稳环境中n元事件顺序更新引发的结构漂移问题，提出HyperPatch框架，通过超图流形上的稳定性建模，实现参数保留的知识编辑，在MQuAKE-CF和MQuAKE-T基准上分别取得96.24%和21.06%的跳步准确率相对提升。

Comments Accepted to Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026)

详情

AI中文摘要

大型语言模型（LLMs）依赖知识编辑（KE）来维持时间有效性，然而现实世界中的知识本质上是n元的。我们证明，在非平稳环境中，对复杂关系的顺序更新会引发n元结构漂移，这是一种将n元事件二元化为三元组会破坏关系原子性的现象。这导致结构条件知识迁移失败，即检索器系统性错误接地，常被误诊为参数幻觉。为解决此问题，我们提出HyperPatch，一个参数保留框架，将顺序KE重新表述为超图流形上的稳定性问题。HyperPatch通过三个阶段保持事件完整性：（i）结构先验初始化，通过在超图神经网络（HGNN）上进行对比学习建立拓扑感知的嵌入空间，以捕获高阶相关性；（ii）顺序拓扑编辑，利用双阶段机制，采用基于SimHash的拓扑对齐进行快速冲突解决，以及拓扑LoRA自适应来跟踪漂移而无需骨干重训练；（iii）结构条件推理，整合来自融合语言和结构流形的全局一致证据。在MQuAKE-CF和MQuAKE-T基准上，HyperPatch在跳步准确率（H-Acc）上分别比最强基线相对提升96.24%和21.06%。进一步的消融实验表明，在连续n元更新流下具有卓越的可靠性，而标准基于KG的变体因结构错位导致H-Acc下降高达88.3%。

英文摘要

Large Language Models (LLMs) rely on Knowledge Editing (KE) to maintain temporal validity, yet real-world knowledge is inherently n-ary. We demonstrate that in non-stationary environments, sequential updates to complex relations induce N-ary Structural Drift, a phenomenon where the binary reification of n-ary events into triples fractures relational atomicity. This precipitates Structure-Conditioned Knowledge Transfer Failure, a systematic mis-grounding of the retriever frequently misdiagnosed as parametric hallucination. To tackle this, we propose HyperPatch, a parameter-preserving framework that reformulates sequential KE as a stability problem over hypergraph manifolds. HyperPatch preserves event integrity through three phases: (i) Structural Prior Initialization, establishing a topology-aware embedding space via contrastive learning on a Hypergraph Neural Network (HGNN) to capture high-order correlations; (ii) Sequential Topology Editing, utilizing a dual-stage mechanism that employs SimHash-based Topological Alignment for rapid conflict resolution and Topological LoRA Adaptation to track drift without backbone retraining; and (iii) Structure-Conditioned Reasoning, which integrates globally consistent evidence from fused linguistic and structural manifolds. On the MQuAKE-CF and MQuAKE-T benchmarks, HyperPatch achieves relative gains in Hop-wise Accuracy (H-Acc) of 96.24% and 21.06% over the strongest baseline, respectively. Further ablations demonstrate superior reliability under continuous n-ary update streams, whereas the standard KG-based variant suffers H-Acc collapses of up to 88.3% due to structural misalignment.

URL PDF HTML ☆

赞 0 踩 0

2606.03169 2026-06-03 cs.SD cs.LG cs.MM

SketchSong: Hierarchical Song Generation with Sketch Planning and Fine-Grained Multi-Track Modeling

SketchSong: 基于草图规划与细粒度多轨建模的分层歌曲生成

Xiaoyue Duan, Nanxing Hu, Yutang Feng, Xudong Yan, Jiatao Chen, Jinchao Zhang, Jie Zhou

发表机构 * Pattern Recognition Center, WeChat AI, Tencent Inc.（腾讯人工智能研究院）

AI总结提出分层歌曲生成框架SketchSong，通过歌曲级草图规划和细粒度多轨建模解决歌曲编排不连贯及声部建模粗糙问题，在客观指标和人工听测上优于基线。

详情

AI中文摘要

最近的歌曲生成系统能够合成逼真的音频，但生成完整歌曲仍面临两个挑战。首先，现有方法中缺乏明确的歌曲级编排规划，模型往往需要在生成底层音频细节的同时组织整体编排发展，这常导致编排不连贯，如段落过渡薄弱和动态进展受限。其次，对不同音乐部分的粗粒度建模掩盖了它们各自的作用和交互，限制了生成歌曲的编排丰富性。本文提出SketchSong，一种分层歌曲生成框架，通过歌曲级草图规划和细粒度多轨建模解决这些问题。在时间维度上，SketchSong首先预测从压缩音频表示中提取的高层草图标记的紧凑序列，然后基于这些草图生成音频标记。这种从粗到细的过程在详细音频生成之前为模型提供了明确的编排规划。在轨道维度上，SketchSong显式建模四个轨道，即人声、贝斯、鼓和其他乐器。这使得模型能够更精确地捕捉不同音乐部分的作用和交互。在歌曲生成基准上的实验表明，SketchSong在客观指标和人工听测上均持续优于基线。尽管没有采用额外的偏好优化后训练（如歌词和文本提示对齐），SketchSong仍取得了与经过后训练的强开源系统相竞争的结果，证明了我们整体设计的有效性。

英文摘要

Recent song generation systems can synthesize realistic audio, yet generating complete songs remains challenging for two reasons. First, explicit song-level arrangement planning remains limited in existing methods, so models often need to organize overall arrangement development while generating low-level audio details. This often leads to incoherence in arrangements, such as weak section transitions and limited dynamic progression. Second, coarse modeling of different musical parts obscures their distinct roles and interactions, limiting arrangement richness of generated songs. In this paper, we present SketchSong, a hierarchical song generation framework that addresses these issues through song-level sketch planning and fine-grained multi-track modeling. Along the temporal dimension, SketchSong first predicts a compact sequence of high-level sketch tokens derived from compressed audio representations, and then generates audio tokens conditioned on these sketches. This coarse-to-fine process gives the model an explicit arrangement plan before detailed audio generation. Along the track dimension, SketchSong explicitly models four tracks, i.e., vocals, bass, drums and other instruments. This enables the model to capture the roles and interactions of different musical parts more precisely. Experiments on song generation benchmarks show that SketchSong consistently outperforms our baseline on both objective metrics and human listening tests. Despite not employing additional post-training for preference optimization such as lyrics and text-prompt alignments, SketchSong achieves competitive results against strong, post-trained open-source systems, demonstrating the effectiveness of our overall design.

URL PDF HTML ☆

赞 0 踩 0

2606.03168 2026-06-03 cs.CV

JAVEDIT: Joint Audio-Visual Instruction-Guided Video Editing with Agentic Data Curation

JAVEDIT: 联合音频-视觉指令引导视频编辑与智能体数据策展

Yinan Chen, Chuming Lin, Zhennan Chen, Yuxiang Zeng, Junwei Zhu, Yali Bi, Xijie Huang, Chengming Xu, Donghao Luo, Zhucun Xue, Xiaobin Hu, Chengjie Wang, Yong Liu, Jiangning Zhang, Shuicheng Yan

发表机构 * Zhejiang University（浙江大学）； Tencent Youtu Lab（腾讯优图实验室）； Nanjing University（南京大学）； University of Auckland（奥克兰大学）； Fudan University（复旦大学）； National University of Singapore（新加坡国立大学）

AI总结针对联合音频-视觉编辑缺乏数据集和基准的问题，提出首个大规模高质量数据集JAVEdit-100k、基准JAVEditBench以及基线模型JAVEdit，在六项指标中五项超越所有基线。

Comments Equal contributions from first two authors. Project page: https://ryanchenyn.github.io/projects/JAVEdit Code: https://github.com/RyanChenYN/JAVEdit Dataset: https://huggingface.co/datasets/Coraxor/JAVEdit-100k

详情

AI中文摘要

虽然基于指令的视频编辑已取得显著进展，但联合音频-视觉编辑仍受限于缺乏专用数据集和基准。为填补这一空白，我们提出了JAVEdit-100k，这是首个为指令引导的联合音频-视觉编辑定制的大规模高质量数据集。该数据集专注于以人为中心的视频，包含约10万个编辑三元组，涵盖五个不同类别，包括主体编辑和语音编辑。该数据集通过四个精心设计的生成流程严格构建，并无缝配对智能体在环质量控制机制。此外，为解决该领域缺乏标准化评估的问题，我们引入了JAVEditBench，这是一个全面的基准，包含精选源视频和跨所有编辑类别的人类对齐指令。最后，我们提出了JAVEdit，一个用于指令引导的联合音频-视觉编辑的开创性基线模型。实验表明，\model\ 在六项评估指标中的五项上优于所有基线。

英文摘要

While instruction-based video editing has seen significant progress, joint audio-visual editing remains constrained by the absence of dedicated datasets and benchmarks. To bridge this gap, we present JAVEdit-100k, the first large-scale, high-quality dataset tailored for instruction-guided joint audio-visual editing. Focusing on human-centric videos, JAVEdit-100k comprises approximately 100K editing triplets spanning five distinct categories, including subject editing and speech editing. This dataset is rigorously constructed via four meticulously designed generation pipelines, seamlessly paired with an agent-in-the-loop quality control mechanism. Furthermore, to address the lack of standardized evaluation within the field, we introduce JAVEditBench, a comprehensive benchmark featuring curated source videos and human-aligned instructions across all editing categories. Finally, we propose JAVEdit, a pioneering baseline model for instruction-guided joint audio-visual editing. Experiments show that \model\ outperforms all baselines on five of six evaluation metrics.

URL PDF HTML ☆

赞 0 踩 0

2606.03165 2026-06-03 cs.CL cs.AI

Fully Automated Identification of Lexical Alignment and Preference-Stage Shifts in Large Language Models

大型语言模型中词汇对齐和偏好阶段转变的完全自动识别

Thomas Stephan Juzek, Xiaoyang Ming, Jose A. Hernandez

发表机构 * University of Washington（华盛顿大学）

AI总结本文提出两种无需人工干预的评估指标——词汇对齐分数和三角化偏好转变，用于自动识别大型语言模型中的词汇过度使用及其与人类偏好学习的关联。

Comments 16 pages, 2 figures, 10 tables

详情

DOI: 10.63317/4ut7ammh7z3h
Journal ref: Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026), pages 6116-6131

AI中文摘要

数字聊天助手（如ChatGPT）使用的语言可能与人类预期存在偏差（不对齐）。主要针对科学英语的研究已经描述了出现的偏差以及在一定程度上解释了原因，将其与人类偏好学习的训练阶段联系起来。然而，现有方法依赖于人工筛选。本文引入了两种无需筛选、假设较少的评估指标：词汇对齐分数（识别词汇过度使用）和三角化偏好转变（量化此类转变中有多少可归因于人类偏好学习）。使用PubMed摘要，生成了续写，并通过六个模型系列（Falcon、Gemma、Llama、Mistral、OLMo、Yi）的滑动窗口文档频率进行测量。该过程无需人工干预即可识别过度使用的词汇，如'suggest'、'additionally'和'strategy'，并估计它们与偏好学习的关联。我们的发现重复了先前的工作，并且在参数设置、随机种子以及进一步数据的评估中保持稳定。该方法易于扩展，能够系统研究科学英语之外以及跨语言的词汇（不对齐），因此，这些指标有潜力为未来模型改进对齐并理解其起源做出贡献。

英文摘要

The language used by digital chat assistants such as ChatGPT can diverge from human expectations (misalignment). Research, mostly on Scientific English, has described both what divergences occur and, to some extent, why, linking them to the training stage of human preference learning. Yet, existing approaches rely on manual curation. This paper introduces two curation-free, assumption-light evaluation metrics: the Lexical Alignment Score, which identifies lexical overuse, and the Triangulated Preference Shift, which quantifies how much of such shifts can be attributed to human preference learning. Using PubMed abstracts, continuations were generated and measured using windowed document prevalence across six model families (Falcon, Gemma, Llama, Mistral, OLMo, Yi). The procedure identifies, without manual intervention, overused items such as 'suggest', 'additionally', and 'strategy', and estimates their link to preference learning. Our findings replicate prior work and remain stable across parameter settings, random seeds, and evaluation on further data. The approach scales readily and enables systematic study of lexical (mis)alignment beyond Scientific English and across languages, and as such, the metrics have the potential to contribute to improved alignment for future models and understanding of its origins.

URL PDF HTML ☆

赞 0 踩 0

2606.03160 2026-06-03 cs.CV

SRENet: Spectral Re-Entry Network for Point Cloud Action Recognition

SRENet：用于点云动作识别的频谱重入网络

Qiuxia Wu, Jiarui Lan, Wenxiong Kang, Zhiyong Wang, Kun Hu

发表机构 * School of Software Engineering, South China University of Technology（南方科技大学软件工程学院）； School of Automation Science and Engineering, South China University of Technology（南方科技大学自动化科学与工程学院）； School of Computer Science, University of Sydney（悉尼大学计算机科学学院）； School of Science, Edith Cowan University（爱丁堡牛津大学科学学院）

AI总结提出SRENet，通过频谱分解与重入模块从频率角度学习全局上下文和细粒度时间动态，实现点云序列动作识别。

Comments 13 pages, 11 figures. Accepted by IEEE Transactions on Circuits and Systems for Video Technology

详情

DOI: 10.1109/TCSVT.2026.3695515

AI中文摘要

从点云序列中识别人体动作对于自动驾驶和人机交互等3D感知驱动应用至关重要。然而，点云的不规则结构和时间不一致性给时空表示学习带来了独特挑战，特别是在捕捉全局运动上下文和细粒度时间动态方面。我们提出SRENet，一个频谱感知框架，旨在从频率角度显式学习动作识别的全局上下文和细粒度时间动态。SRENet引入频谱分解块（SDeBlock），沿时间和空间轴进行基于小波的分析，通过频率特定注意力将特征分解为低频和高频分量。为了恢复残差动态并重新对齐在语义融合过程中扭曲的时间频率结构，频谱重入块（SReBlock）执行二次时间分解。此外，设计了一种频谱感知学习策略，通过对比损失和课程调度增强两个频率子空间的可区分性，该调度逐渐将焦点从低频空间转移到高频空间，与从粗到细的运动模式一致。在MSR-Action3D、NTU-RGBD和NTU-RGBD120上的大量实验表明，SRENet实现了最先进的性能，验证了频率建模在基于点云的动作理解中的有效性。

英文摘要

Recognizing human actions from point cloud sequences is critical for 3D perception driven applications such as autonomous driving and human-computer interaction. However, the irregular structure and temporal inconsistency of point clouds pose unique challenges for spatio-temporal representation learning, especially in capturing both global motion context and fine-grained temporal dynamics. We propose SRENet, a spectral-aware framework designed to explicitly learn both global context and fine-grained temporal dynamics of motion from a frequency perspective for action recognition. SRENet introduces a Spectral Decomposition Block (SDeBlock) that performs wavelet-based analysis along temporal and spatial axes, disentangling features into low- and high-frequency components with frequency-specific attention. To recover residual dynamics and re-align temporal frequency structures distorted during semantic fusion, a Spectral Re-entry Block (SReBlock) performs secondary temporal decomposition. Furthermore, a spectral-aware learning strategy is devised to enhance discriminability in both frequency subspaces via contrastive loss and a curriculum schedule that gradually shifts focus from low- to high-frequency spaces in line with coarse to detailed motion patterns. Extensive experiments on MSR-Action3D, NTU-RGBD and NTU-RGBD120 demonstrate that SRENet achieves state-of-the-art performance, validating the effectiveness of frequency modeling in point cloud-based action understanding.

URL PDF HTML ☆

赞 0 踩 0

2606.03157 2026-06-03 cs.AI

ClinicalMC: A Benchmark for Multi-Course Clinical Decision-Making with Large Language Models

ClinicalMC：面向大语言模型的多疗程临床决策基准

Ruihui Hou, Siyi Zhu, Ziyue Huai, Guangya Yu, Yongqi Fan, Chunming Wang, Tong Ruan

发表机构 * East China University of Science and Technology, Shanghai, China（东华大学）； Renji Hospital Affiliated to Shanghai Jiaotong University School of Medicine, Shanghai, China（复旦大学附属中山医院）

AI总结提出ClinicalMC基准，包含多阶段样本，通过多智能体评估框架在单轮静态和多轮动态设置下测试大语言模型的临床决策能力。

详情

AI中文摘要

大语言模型（LLMs）已在医疗领域广泛应用，但在复杂临床决策场景中仍面临重大挑战。现有基准主要评估LLMs在单疗程设置中的表现，缺乏对多疗程场景的系统评估——在后者中，患者的病情随时间演变。为弥补这一空白，我们提出ClinicalMC，一个面向多疗程临床决策的基准。它包含从入院到出院的四个阶段的1,275个中文样本和5,804个英文样本。这些阶段涵盖分诊、首诊检查/诊断/治疗、后续多疗程检查/评估/治疗以及最终诊断。在ClinicalMC中，英文数据集中的患者平均经历5.11个临床疗程，而中文数据集中的患者经历3.42个。为评估LLM性能，我们构建了一个多智能体评估框架，包括患者、考官和医生智能体。基于该基准和框架，我们设计了两种实验设置——单轮静态设置和多轮动态设置——并评估了三类LLM：1）闭源LLM如GPT5-mini；2）开源LLM如DeepSeek-V3.2；3）医学LLM如HuatuoGPT-o1。通过广泛评估，我们旨在更好地理解LLM在医学领域的性能，并支持其在医疗中的有效部署。

英文摘要

Large language models (LLMs) have been widely adopted in healthcare, yet they still encounter significant challenges in complex clinical decision-making scenarios. Existing benchmarks primarily assess LLM performance in single-course settings and lack systematic evaluation in multi-course scenarios, where a patient's condition evolves over time. To address this gap, we propose ClinicalMC, a benchmark for multi-course clinical decision-making. It includes 1,275 Chinese and 5,804 English samples across four stages from admission to discharge. These stages cover triage, first-course examination/diagnosis/treatment, subsequent multi-course examination/assessment/treatment, and final diagnosis. In ClinicalMC, patients in the English dataset undergo an average of 5.11 clinical courses, whereas those in the Chinese dataset undergo 3.42. To assess LLM performance, we construct a multi-agent evaluation framework that includes patient, examiner, and doctor agents. Based on the benchmark and framework, we design two experimental settings -- a single-turn static setting and a multi-turn dynamic setting -- and assess three categories of LLMs: 1) closed-source LLMs like GPT5-mini; 2) open-source LLMs like DeepSeek-V3.2; and 3) medical LLMs like HuatuoGPT-o1. Through extensive evaluation, we aim to better understand LLM performance in the medical domain and support its effective deployment in healthcare.

URL PDF HTML ☆

赞 0 踩 0

2606.03156 2026-06-03 cs.CL

A cross-domain tropical species dataset with Chinese vernacular names and CITES source links

一个包含中文俗名和CITES来源链接的跨领域热带物种数据集

Jeff Wang

发表机构 * NEXLY LLC, United States（NEXLY LLC美国）

AI总结构建了一个覆盖410,499个活跃热带物种的跨领域数据集，整合多源分类标识符，添加跨领域本体、中文俗名层（覆盖率99.50%）和CITES来源链接层，并报告了初步内部审查结果。

Comments 25 pages, 4 figures, 4 tables. Dataset descriptor for the Tropical Species Encyclopedia. Companion to the methodology paper arXiv:2606.00994. Dataset deposited at Zenodo (doi:10.5281/zenodo.20377811); canonical preprint-of-record at Zenodo (doi:10.5281/zenodo.20424981)

详情

AI中文摘要

我们描述了一个版本化的跨领域数据集，包含410,499个活跃热带物种（工作快照2026-04-20），涵盖三个应用子领域——热带植物、热带水生和热带宠物——这些领域共享商业和监管生命周期，但分布在按界组织的生物多样性基础设施中。该资源整合了来自GBIF、世界在线植物志、iNaturalist、NCBI分类学、生命目录和生命百科全书的分类标识符，并添加了三个原始层：一个跨领域本体，根据贸易和饲养背景重新划分分类群；一个中文俗名层，在排除未经验证的机器生成建议的类型学下，提供明确的每个名称来源；以及一个CITES来源链接层，将每个分类群连接到其Species+条目。中文俗名覆盖率——即带有与科学双名法不同的中文名称的分类群比例——达到99.50%（410,499个中的408,456个；全种群计数）。覆盖率表征完整性，而非名称翻译准确性；后者由四级来源类型学界定，并且是此处报告的初步内部审查的主题，其中盲法外部审计被确定为主要的未决事项。上游内容仅通过稳定标识符引用原始贡献层，支持CC-BY 4.0重用。该数据集存放在Zenodo上（https://doi.org/10.5281/zenodo.20377811）。本预印本是数据集当前状态的规范v1.0描述；未来的数据描述符提交是可预期的，但取决于“局限性”中列出的验证和发布工程事项。

英文摘要

We describe a versioned cross-domain dataset of 410,499 active tropical species (working snapshot 2026-04-20) spanning three applied subdomains -- tropical_plants, tropical_aquatic, and tropical_pets -- that share a commercial and regulatory life cycle but are distributed across kingdom-organised biodiversity infrastructures. The resource joins taxonomic identifiers from GBIF, Plants of the World Online, iNaturalist, NCBI Taxonomy, the Catalogue of Life and the Encyclopedia of Life, and adds three original layers: a cross-domain ontology that re-segments taxa along trade and husbandry contexts; a Chinese vernacular layer with explicit per-name provenance under a typology that excludes unverified machine-generated proposals; and a CITES source-linkage layer connecting each taxon to its Species+ entry. Chinese vernacular coverage -- the proportion of taxa carrying a CJK Chinese name distinct from the scientific binomial -- reaches 99.50 percent (408,456 of 410,499; full-population count). Coverage characterises completeness, not name-translation accuracy; the latter is bounded by the four-level provenance typology and is the subject of a preliminary internal review reported here, with a blind external audit identified as the principal open item. Upstream content is referenced by stable identifier only for the original-contribution layers, supporting CC-BY 4.0 reuse. The dataset is deposited on Zenodo (10.5281/zenodo.20377811). This preprint is the canonical v1.0 description of the dataset's current state; future Data Descriptor submission is anticipated but is contingent on the validation and release-engineering items listed in the Limitations.

URL PDF HTML ☆

赞 0 踩 0

2606.03148 2026-06-03 cs.CV

$A^2$: Smaller Self-Supervised ViTs Localize Better than Larger Ones

$A^2$: 较小的自监督ViT比更大的ViT定位更优

Sreehari Rammohan, Huy Ha, Carl Vondrick

发表机构 * Columbia University（哥伦比亚大学）； Stanford University（斯坦福大学）

AI总结针对视觉分类中前景定位与丰富表征的矛盾，提出$A^2$方法，通过解耦小模型定位与大模型嵌入，利用预训练特征实现无需额外训练的竞争性能。

详情

AI中文摘要

鲁棒的视觉分类通常依赖于定位图像中的主要前景对象，同时忽略上下文干扰。令人惊讶的是，我们发现较小的自监督ViT的注意力图比更大的ViT能更好地定位前景对象。然而，我们仍然需要大型ViT，因为它们从每个补丁中提取更丰富的表示。为了兼顾良好的定位和丰富的表示，我们提出了$A^2$，一种简单的方法，通过将看哪里（小注意力模型）与提取什么（大嵌入模型）解耦，利用这种逆缩放发现：我们围绕小模型的注意力峰值裁剪图像，并用大模型嵌入这些裁剪块。$A^2$完全使用预训练特征，不需要组标签，也不需要针对每个数据集进行注意力或骨干网络训练。在5个基准测试中，$A^2$与基于骨干匹配的损失级方法（如DFR）具有竞争力，并且在更强的分布偏移下优于端到端注意力训练。

英文摘要

Robust visual classification often depends on localizing the main foreground objects in an image while ignoring contextual distractors. Surprisingly, we find that the attention maps of smaller self-supervised ViTs localize foreground objects better than those of larger ViTs. However, we still need large ViTs, because they extract richer representations from each patch. To get the best of both worlds, good localization and rich representations, we propose $A^2$, a simple method that leverages this inverse scaling finding by decoupling where to look (a small attention model) from what to extract (a large embedding model): we crop around the attention peaks of a small model and embed the crops with a larger model. $A^2$ uses entirely pretrained features, requires no group labels, and does not require per-dataset attention or backbone training. Across 5 benchmarks, $A^2$ is competitive with backbone-matched loss-level methods like DFR, and outperforms end-to-end attention training under stronger distribution shifts.

URL PDF HTML ☆

赞 0 踩 0

2606.03144 2026-06-03 cs.AI

GTBench: A Curriculum-Grounded Benchmark for Evaluating LLMs as Mathematical Research Assistants in Graph Theory

GTBench：一个基于课程体系的基准，用于评估大语言模型作为图论数学研究助手的能力

Noujoud Nader, Ibrahem Aljabea, Patrick Diehl, Deepti Gupta

发表机构 * Louisiana State University（路易斯安那州立大学）； Los Alamos National Laboratory（洛斯阿拉莫斯国家实验室）； Texas A&M-Central Texas（德克萨斯大学阿姆斯特朗中央分校）

AI总结本文提出GTBench基准，通过三个难度递增的图论问题组（本科定义、算法推理、研究生证明）评估大语言模型的数学推理能力，发现GPT-5表现最佳，其他模型随难度下降显著，并揭示了人类与自动评估者之间的系统性分歧。

Comments 19 pages, 5 figures, 7 tables

详情

AI中文摘要

大型语言模型（LLM）越来越多地被用作技术学科的自学助手，但其作为数学推理助手的可靠性仍知之甚少。我们引入了GTBench，这是一个基于课程体系的基准，用于评估LLM作为图论数学研究助手的能力，包含63个问题，分为三个难度递增的组：本科定义和基本性质（第1组）、算法跟踪和结构推理（第2组）以及研究生级别的证明构建（第3组）。问题来源于经过验证的学术材料，包括Diestel的《图论》。我们评估了五个前沿模型——GPT-5、Claude Sonnet 4.6、Gemini 2.5 Flash-Lite、Llama 3.3 70B和Mistral Large 3——在零样本和思维链提示下，对第1组和第2组使用精确匹配和LLM作为评判者的评估，对第3组使用混合人类专家和LLM作为评判者的协议。我们的结果揭示了显著的性能层次：GPT-5在第1组接近上限（零样本95.8%），并在研究生证明上保持有意义的准确性（82%），而所有其他模型随着难度增加性能大幅下降，其中Llama在第3组零样本下的人类评估中达到0%。失败模式分析表明，正确的算法但错误的执行错误在第1组和第2组中占主导地位，而第3组还出现了不完整的推理失败，并揭示了人类评估者与自动评判者之间的系统性分歧，特别是在冗长或接近完整的证明上（人类对之间的kappa = 0.48-0.83）。GTBench为LLM中的图论推理提供了第一个基于课程体系的评估框架，对数学教育和科学研究中AI工具的治理具有直接影响。

英文摘要

Large language models (LLMs) are increasingly used as self-study assistants in technical disciplines, yet their reliability as mathematical reasoning assistants remains poorly understood. We introduce GTBench, a curriculum-grounded benchmark for evaluating LLMs as mathematical research assistants in graph theory, comprising 63 problems organized into three groups of increasing difficulty: undergraduate definitions and basic properties (Group 1), algorithm tracing and structural reasoning (Group 2), and graduate-level proof construction (Group 3). Problems are sourced from verified academic materials including Diestel's Graph Theory. We evaluate five frontier models -- GPT-5, Claude Sonnet 4.6, Gemini 2.5 Flash-Lite, Llama 3.3 70B, and Mistral Large 3 -- under zero-shot and chain-of-thought prompting, using exact-match and LLM-as-judge evaluation for Groups 1 and 2, and a hybrid human expert and LLM-as-judge protocol for Group 3. Our results reveal a pronounced performance hierarchy: GPT-5 approaches ceiling on Group 1 (95.8% zero-shot) and maintains meaningful accuracy on graduate proofs (82%), while all other models degrade substantially with difficulty, with Llama achieving 0% under human evaluation on Group 3 zero-shot. Failure mode analysis shows that correct algorithm, wrong execution errors dominate Groups 1 and 2, while Group 3 additionally surfaces incomplete reasoning failures and reveals systematic disagreement between human evaluators and the automated judge, particularly on verbose or near-complete proofs (kappa = 0.48-0.83 across human pairs). GTBench provides the first curriculum-grounded evaluation framework for graph-theoretic reasoning in LLMs, with direct implications for the governance of AI tools in mathematical education and scientific research.

URL PDF HTML ☆

赞 0 踩 0

2606.03143 2026-06-03 cs.LG cs.CL

FederatedSkill: Federated Learning for Agentic Skill Evolution

FederatedSkill: 面向智能体技能演化的联邦学习

Jingbo Yang, Guanyu Yao, Yang Zhang, Ramana Rao Kompella, Gaowen Liu, Shiyu Chang

发表机构 * UC Santa Barbara（加州大学圣巴bara分校）； MIT-IBM Watson AI Lab（麻省理工-IBM Watson人工智能实验室）； Cisco Research（思科研究）

AI总结提出FederatedSkill框架，通过语义技能差异作为通信单元，在保护隐私的同时实现个性化技能演化，相比自演化基线成功率提升44.4%，计算成本降低37.5%。

详情

AI中文摘要

现代LLM智能体越来越依赖技能库来处理复杂任务，使得技能演化成为自我改进的主要驱动力。然而，孤立的单用户任务流缺乏构建全面技能所需的多样性。虽然跨用户协作可以克服这一数据瓶颈，但当前的轨迹共享方法会损害用户隐私，并强加一个统一的全局库，无法适应客户端的异质性。我们引入了FederatedSkill，一个用于协作智能体演化的隐私保护框架。FederatedSkill超越了原始轨迹共享，利用语义技能差异（即对本地库的结构化补丁）作为通信的基本单位。在服务器端，一个演化智能体聚合这些补丁，动态建模客户端特定的能力边界，促进严格个性化的技能演化，而不是次优的全局平均。在20个不同的智能体任务族上评估，FederatedSkill相比自演化基线表现出显著提升，成功率最高提高44.4%，计算成本降低37.5%。

英文摘要

Modern LLM agents increasingly rely on skill libraries to handle complex tasks, making skill evolution a primary driver of self-improvement. However, isolated single-user task streams lack the diversity required to build comprehensive skills. While cross-user collaboration can overcome this data bottleneck, current trajectory-sharing approaches compromise user privacy and impose a uniform global library that fails to accommodate client heterogeneity. We introduce FederatedSkill, a privacy-preserving framework for collaborative agent evolution. Moving beyond raw trajectory sharing, FederatedSkill utilizes semantic skill diffs, structured patches over local libraries, as the fundamental unit of communication. On the server side, an evolution agent aggregates these patches to dynamically model client-specific capability boundaries, facilitating strictly personalized skill evolution rather than a suboptimal global average. Evaluated across 20 distinct agent task families, FederatedSkill demonstrates substantial gains over self-evolving baselines, achieving up to a 44.4% increase in success rate and a 37.5% reduction in computational cost.

URL PDF HTML ☆

赞 0 踩 0

2606.03142 2026-06-03 cs.CV

Disentangling Visual and Factual Correctness in LVLMs' Visualization Literacy

解构LVLMs可视化素养中的视觉与事实正确性

Soohyun Lee, Jaeyoung Kim, Seokhyeon Park, Sihyeon Lee, Jiwon Song, Bohyoung Kim, Hyunjoo Song, Jinwook Seo

发表机构 * Seoul National University（首尔国立大学）； MADI Co., Ltd.（MADI公司）； Hankuk University of Foreign Studies（韩国民法大学）； Soongsil University（顺天大学）

AI总结提出框架分离视觉正确性与事实正确性，通过反事实测试和仲裁指标揭示LVLMs在可视化素养评估中依赖事实记忆而非视觉推理的问题。

Comments Under review at IEEE Transactions on Visualization and Computer Graphics (TVCG). 23 pages, 9 figures

详情

AI中文摘要

大型视觉语言模型（LVLMs）展现出强大的可视化解释能力，但尚不清楚其响应是否反映对视觉证据的真实推理，还是训练中习得的事实先验。当前评估混合了这两种来源，掩盖了正确视觉解释被记忆事实覆盖的情况。我们提出了一个将视觉正确性与事实正确性分离的框架，揭示了现有可视化素养评估的有效性局限。通过15个最先进LVLMs的三个实验：（1）多个模型在标准测试（VLAT）上达到人类水平，但这可能反映事实回忆而非视觉理解，而随机数据测试（reVLAT）在正确视觉解释被事实先验取代时低估了素养。（2）使用我们的反事实可视化素养评估测试（CVLAT）和能力归一化仲裁指标，我们根据视觉-事实依赖指数（VFRI）的符号对模型进行分类，揭示了以视觉为导向的多数和以事实知识为导向的少数，尽管几个接近零的情况需要谨慎。在相同反事实项目上的人类基线（N=30）证实，人们在冲突时绝大多数遵循图表，提供了人类参考点。（3）基于提示的干预可以改变优先级，但其有效性高度依赖模型且方向不对称，高图表阅读能力不能预测提示可控性。总体而言，高可视化准确性不足以证明忠实的视觉推理：可靠地集成到视觉分析中不仅需要评估可视化素养，还需要评估模型在视觉证据和事实先验分歧时如何仲裁。基准和代码：此 https URL

英文摘要

Large Vision-Language Models (LVLMs) show strong visualization interpretation, yet it is unclear whether their responses reflect genuine reasoning over visual evidence or factual priors learned during training. Current evaluations mix these two sources, obscuring when correct visual interpretation is overridden by memorized facts. We present a framework that isolates visual correctness from factual correctness, revealing validity limitations in existing visualization literacy assessments. Across three experiments with 15 state-of-the-art LVLMs: (1) several models reach human-level performance on standard tests (VLAT), but this may reflect factual recall rather than visual understanding, while randomized-data tests (reVLAT) underestimate literacy when correct visual interpretation is superseded by factual priors. (2) Using our Counterfactual Visualization Literacy Assessment Test (CVLAT) with capability-normalized arbitration metrics, we classify models by the sign of their visual-factual reliance index (VFRI), revealing a visualization-oriented majority and a factual knowledge-oriented minority, though several near-zero cases warrant caution. A human baseline (N=30) on the same counterfactual items confirms that people overwhelmingly follow the chart under conflict, providing a human reference point. (3) Prompt-based intervention can shift prioritization, but its effectiveness is highly model-dependent and direction-asymmetric, and high chart-reading capability does not predict prompt-controllability. Overall, high visualization accuracy is not sufficient evidence of faithful visual reasoning: reliable integration into visual analytics requires evaluating not only visualization literacy but also how models arbitrate between visual evidence and factual priors when the two diverge. Benchmark and code: https://github.com/JaeyoungKim-HCIL/CVLAT

URL PDF HTML ☆

赞 0 踩 0

2606.03137 2026-06-03 cs.AI

Think-Before-Speak: From Internal Evaluation to Public Expression in Multi-Agent Social Simulation

Think-Before-Speak: 从内部评估到多智能体社会模拟中的公开表达

Kaiqi Yang, Tai-Quan Peng, Sanguk Lee, Hui Liu

发表机构 * Michigan State University（密歇根州立大学）； Hankuk University of Foreign Studies（韩国民法大学）

AI总结提出TBS框架，通过分离智能体的内部推理与公开话语生成，模拟从内部评估到公开表达的路径，并在气候政策讨论中验证其机制敏感性。

详情

AI中文摘要

基于LLM的多智能体模拟为研究社会互动、审议和集体意见动态提供了一种有前景的方法。然而，许多现有的对话模拟框架主要将互动表示为可观察的轮次交换或聚合输出，使得沉默、说话意图和公开表达背后的内部评估过程难以考察。我们引入了TBS（Think-Before-Speak），一种基于间隔的多智能体模拟框架，将智能体的私人推理与公开话语生成分离。在每个间隔，所有智能体基于共享的对话历史及其自身记忆更新结构化的内部状态。这些状态包括与失调相关的评估、感知的意见气候、感知的孤立风险、回应策略和说话意愿。然后，协调器解决竞争的说话意图，并将一个话语提交到公共对话中，允许内部评估和公共互动随时间共同演化。我们在模拟的关于气候相关政策问题的市政厅讨论中评估了TBS。结果表明，TBS产生连贯的内部状态轨迹，并且这些轨迹在轮次分配、沉默和记忆条件下系统地变化。与失调相关的评估增加了智能体的说话意愿，而沉默压力评估则降低了它。一旦形成说话意图，公开表达主要由轮次分配规则塑造。这些发现表明，TBS通过使从内部评估到公开表达的路径可观察和可分析，支持机制敏感的社会模拟。

英文摘要

LLM-based multi-agent simulation offers a promising way to study social interaction, deliberation, and collective opinion dynamics. However, many existing dialogue simulation frameworks represent interaction mainly as observable turn exchange or aggregated outputs, leaving the internal evaluative processes behind silence, speaking intention, and public expression difficult to examine. We introduce TBS (Think-Before-Speak), an interval-based multi-agent simulation framework that separates agents' private reasoning from public utterance generation. At each interval, all agents update structured internal states based on the shared dialogue history and their own memory. These states include dissonance-related appraisal, perceived opinion climate, perceived isolation risk, response strategy, and willingness to speak. The orchestrator then resolves competing speaking intentions and commits one utterance to the public dialogue, allowing internal evaluation and public interaction to co-evolve over time. We evaluate TBS in simulated town hall discussions on a climate-related policy issue. Results show that TBS produces coherent internal-state traces and that these traces vary systematically across turn-allocation, silence, and memory conditions. Dissonance-related appraisal increases agents' willingness to speak, whereas silence-pressure appraisal decreases it. Once speaking intention is formed, public expression is shaped mainly by turn-allocation rules. These findings suggest that TBS supports mechanism-sensitive social simulation by making the pathway from internal evaluation to public expression observable and analyzable.

URL PDF HTML ☆

赞 0 踩 0

2606.03135 2026-06-03 cs.AI

Uncertainty-Aware Clarification in LLM Agents with Information Gain

基于信息增益的LLM智能体不确定性感知澄清

Mengyi Deng, Zhiwei Li, Xin Li, Tingyu Zhu, Ying Zhao, Zhijiang Guo, Wei Wang

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结针对用户指令不明确导致LLM智能体工具操作错误的问题，提出一种以信息增益奖励为导向的澄清框架，通过贝叶斯信念更新量化澄清问题的效用，训练智能体生成高信息增益的澄清，在τ-Bench环境中将任务成功率提升3.7%，仅增加0.3个交互步骤。

详情

Journal ref: ICML 2026

AI中文摘要

大型语言模型（LLM）智能体通常在未明确说明的用户指令下运行，其中关于用户意图的潜在不确定性会导致错误的工具操作。为了解决这一挑战，我们提出了一种目标导向的澄清框架，将澄清行为与歧义消除对齐。我们方法的核心是信息增益奖励，这是一种通过测量由澄清交互引起的对真实目标贝叶斯信念更新来量化澄清问题效用的指标。我们使用该奖励训练澄清器（LLM），以优化高信息增益，确保澄清有效减少不确定性并提高智能体-工具-用户环境中的任务完成度。我们在一个增强澄清的τ-Bench环境中验证了我们的框架，并在五个异质骨干网络上进行了跨智能体评估。实验结果表明，与无澄清基线相比，我们的方法一致地将成功率提高了3.7%，同时平均仅增加了0.3个总交互步骤。

英文摘要

Large Language Model (LLM) agents often operate under underspecified user instructions, where latent uncertainty over user intent leads to erroneous tool actions. To address this challenge, we propose a goal-oriented clarification framework that aligns clarification behavior with ambiguity resolution. Central to our approach is the Information Gain Reward, a metric that quantifies the utility of clarification questions by measuring the Bayesian belief update towards the ground-truth goal induced by the clarification exchange. We train the clarifier (LLM) using this reward to optimize for high information gain, ensuring that clarifications effectively reduce uncertainty and improve task completion within the agent-tool-user environment. We validate our framework within a clarification-enhanced $τ$-Bench environment, conducting cross-agent evaluations across five heterogeneous backbones. Empirical results demonstrate that our method consistently improves the success rate by 3.7\% over the no-clarification baseline, while adding only 0.3 total interaction steps on average.

URL PDF HTML ☆

赞 0 踩 0

2606.03134 2026-06-03 cs.RO cs.LG

How Visible Are Silent Manipulation Failures? An Observability Study of False-Success Detection in Simulated Robot Episodes

无声操作失败的可见性：模拟机器人任务中假成功检测的可观测性研究

Aarav Bedi

发表机构 * Aarav Bedi

AI总结本研究通过模拟双机械臂ALOHA任务，探讨机器人自身成功检测器标记为成功的任务中，假成功（实际失败但被误判为成功）的可恢复性，发现基于关节数据的检测器在方块转移任务中几乎完全可恢复假成功，而在插销任务中仅部分可恢复，视觉检测器可弥补差距，且可分离性依赖于远低于实际传感器噪声的速度差异。

Comments 4 pages, 3 figures

详情

AI中文摘要

模仿学习策略用于机器人操作时，其训练任务的成功标签质量取决于机器人自身的成功检测器。一种特别有害的错误是假成功：机器人记录为成功但实际任务结果错误的任务。我们针对这些任务提出一个狭窄但实际的问题：一旦任务被标记为成功，推翻该标签所需的信息有多少存在于本体感觉中，又有多少需要视觉？我们在两个双机械臂ALOHA任务上构建模拟测试平台，通过环境扰动而非标签编辑诱发失败，利用检测器从未见过的特权模拟器状态标记每个任务，仅保留机器人标记为成功的任务。然后，我们将限制于本体感觉的检测器与基于视觉的检测器进行比较。我们发现可恢复性范围广泛：在方块转移任务中，假成功几乎完全可从关节数据中恢复，而在插销插入任务中，本体感觉仅恢复部分假成功，视觉检测器则弥补了大部分差距。我们还表明，我们测量的本体感觉可分离性依赖于远低于任何实际传感器噪声水平的速度差异，因此最好将其视为无噪声模拟器夸大的乐观上限。我们发布了生成和评估流程。

英文摘要

Imitation-learning policies for robot manipulation inherit the quality of the success labels attached to their training episodes, and those labels are usually produced by the robot's own success check. A particularly damaging error is the false success: an episode the robot logs as a success when the task outcome was actually wrong. We ask a narrow but practical question about these episodes. Once an episode has already been flagged as a success, how much of the information needed to overturn that label is present in proprioception, and how much requires vision? We build a simulated testbed on two bimanual ALOHA tasks, induce failures through environment perturbations rather than label edits, label every episode by privileged simulator state that the detector never sees, and keep only episodes the robot flagged as successful. We then compare detectors restricted to proprioception against a vision-based detector. We find that recoverability spans a wide range: in cube transfer the false successes are almost fully recoverable from joint data alone, while in peg insertion proprioception recovers only part of them and a vision detector closes most of the gap. We also show that the proprioceptive separability we measure rests on velocity differences far below any realistic sensor noise floor, so it is best read as an optimistic upper bound that a noiseless simulator inflates. We release the generation and evaluation pipeline.

URL PDF HTML ☆

赞 0 踩 0

2606.03132 2026-06-03 cs.CL

DMT-CBT: Longitudinal Therapeutic State Modeling for CBT Counseling

DMT-CBT：面向CBT咨询的纵向治疗状态建模

Chang Liu, Shuyi Zhang, Changsheng Ma, Yongfeng Tao, Minqiang Yang, Bin Hu

发表机构 * School of Information Science and Engineering, Lanzhou University（信息科学与工程学院，兰州大学）

AI总结提出DMT-CBT框架，通过跨会话结构化治疗状态、多模态行为基础与工具增强干预，解决现有方法将CBT咨询简化为局部回复生成的问题，实现纵向治疗状态动态建模。

详情

AI中文摘要

大型语言模型（LLM）在认知行为疗法（CBT）咨询中展现出日益增长的潜力。然而，现有方法大多将咨询视为局部回复生成问题，专注于短文本、仅文本或单次会话交互中的共情回复。我们认为这种表述从根本上与真实心理治疗的本质不符。在临床CBT中，治疗是一个纵向过程，治疗师需要跨会话持续推断、更新和干预不断演变的治疗状态。真实的CBT还涉及多模态推理和延迟的跨会话干预效果，要求模型在部分可观测性下捕捉纵向治疗状态的演变。我们提出DMT-CBT，一个用于CBT咨询中治疗状态动态建模的框架。DMT-CBT跨会话维护结构化的治疗状态，同时整合多模态行为基础和工具增强的干预，以支持适应性治疗推理。基于该框架，我们构建了DMTCorpus，一个合成的多会话多模态CBT咨询数据集，具有演变的治疗状态、图像基础的患者行为以及跨会话干预的连续性。实验结果表明，与事后提取方法相比，DMT-CBT提高了咨询保真度和治疗联盟，产生了更有利的纵向情感轨迹，并更忠实地保留了治疗状态。

英文摘要

Large language models (LLMs) have shown growing potential for Cognitive Behavioral Therapy (CBT) counseling. However, most existing approaches still formulate counseling as a local response generation problem, focusing on empathetic replies within short, text-only, or single-session interactions. We argue that this formulation fundamentally mismatches the nature of real psychotherapy. In clinical CBT, therapy is a longitudinal process in which therapists continuously infer, update, and intervene on evolving therapeutic states across sessions. Realistic CBT further involves multimodal inference and delayed cross-session intervention effects, requiring models to capture longitudinal therapeutic state evolution under partial observability. We propose DMT-CBT, a framework for Dynamic Modeling of evolving Therapeutic states in CBT counseling. DMT-CBT maintains structured therapeutic states across sessions while incorporating multimodal behavioral grounding and tool-augmented intervention to support adaptive therapeutic reasoning. Based on this framework, we construct DMTCorpus, a synthetic multi-session multimodal CBT counseling dataset featuring evolving therapeutic states, image-grounded client behaviors, and cross-session intervention continuity. Experimental results show that DMT-CBT improves counseling fidelity and therapeutic alliance, produces more favorable longitudinal affective trajectories, and preserves therapeutic states more faithfully than post-hoc extraction approaches.

URL PDF HTML ☆

赞 0 踩 0

2606.03131 2026-06-03 cs.LG

HARVE: Hacking-Aware Reward-Head Vector Editing for Robust Reward Models

HARVE：面向鲁棒奖励模型的感知黑客奖励头向量编辑

Shuang Liu, Yuxuan Bo, Qiuyang Zhao, Caiyue Huang, Xiaorong Chen, Yanguang Liu, Mengnan Du

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； University of Virginia（弗吉尼亚大学）； Harvard University（哈佛大学）； Stanford University（斯坦福大学）； University of Michigan（密歇根大学）； New Jersey Institute of Technology（新泽西理工学院）； The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））

AI总结针对奖励模型易受奖励黑客攻击的问题，提出无需训练的奖励头编辑方法HARVE，通过移除与黑客相关子空间对齐的奖励头向量分量，提升鲁棒性并保持通用能力。

详情

AI中文摘要

奖励模型对于大型语言模型（LLM）对齐至关重要，但它们仍然容易受到奖励黑客攻击。为了评估奖励模型的鲁棒性，我们引入了RewardHackBench，其中包含13种奖励黑客模式，涵盖现实生活中的高风险领域和通用设置，并且我们发现八个奖励模型在特定子类别上存在严重失败。为了缓解这些失败，我们提出了HARVE，一种针对标量奖励模型的无需训练的奖励头编辑方法。HARVE不是微调奖励模型，而是从与选定黑客子类别相关的残差流方向中识别出多方向黑客子空间，并移除与该子空间对齐的奖励头向量分量。这直接降低了奖励头对黑客相关特征的敏感性，仅使用少量对比性的黄金-黑客示例，无需梯度更新或微调。在八个奖励模型上的综合实验表明，该方法提高了黑客鲁棒性，优于微调基线，并保持了奖励模型的通用能力。进一步的分析表明，奖励黑客攻击更适合被捕捉为多维残差空间结构，而不是孤立的表面线索。

英文摘要

Reward models are central to large language model (LLM) alignment, but they remain vulnerable to reward hacking. To evaluate reward-model robustness, we introduce RewardHackBench containing 13 reward-hacking patterns covering real life high-stakes domains and general settings, and we find severe failures on specific subcategories across eight reward models. To mitigate these failures, we propose HARVE, a training-free reward-head editing method for scalar reward models. Instead of fine-tuning the reward model, HARVE identifies a multi-directional hacking subspace from residual stream directions associated with selected hacking subcategories, and removes the component of the reward-head vector aligned with that subspace. This directly reduces the reward head's sensitivity to hacking-related features using only a small set of contrastive gold-hacked examples, without gradient updates or fine-tuning. Comprehensive experiments across eight reward models indicates that \model improves hacking robustness, outperforms fine-tuning baselines, and preserves reward-models' general capability. Further analyses suggest that reward hacking is better captured as a multidimensional residual-space structure than by isolated surface cues.

URL PDF HTML ☆

赞 0 踩 0

2606.03130 2026-06-03 cs.LG

Synthetic Hallucinations, Real Gains: Hard Negatives from Frontier Models for FIM Hallucination Mitigation

合成幻觉，真实收益：来自前沿模型的硬负样本用于FIM幻觉缓解

Mahdi Erfanian, Nelson Daniel Troncoso, Aashna Garg, Amabel Gale, Xiaoyu Liu, Pareesa Ameneh Golnari, Shengyu Fu

发表机构 * University of Illinois Chicago（伊利诺伊大学芝加哥分校）； Microsoft（微软）

AI总结针对小型开源代码模型在IDE自动补全中产生的填充中间（FIM）幻觉问题，提出一种无需执行的替代方法：利用前沿代码模型合成看似合理但错误的补全作为硬负样本，通过对比合成幻觉与真实开发者编辑的差异作为监督微调信号，在Delulu基准上提升精确匹配18.8个百分点。

详情

AI中文摘要

驱动IDE自动补全的小型开源代码模型仍然会输出幻觉的填充中间（FIM）补全：对项目中不存在的方法、参数、变量和导入的语法上自然的调用。现有的缓解方法要么需要每种语言的执行沙箱（在按键中途不适用），要么需要偏好优化管道（需要大量人工标注语料库）。我们提出一种无需执行的替代方案：使用前沿代码模型合成看似合理但错误的补全作为硬负样本，然后利用这些合成幻觉与真实开发者编辑之间的对比作为监督微调信号。我们的管道从公共GitHub中跨八种语言抓取多语言FIM上下文，并让一组三个前沿生成器为每个上下文针对Delulu分类法（一个经Docker验证的多语言FIM幻觉基准）中的四种幻觉类型各生成一个硬负样本，从而产生配对的选定/拒绝数据集。在10万行精选子集上微调Qwen2.5-Coder-7B-Instruct，使Delulu精确匹配提升+18.8点，编辑相似度提升+0.22，覆盖每种语言和每种类型，同时改进每个HumanEval-Infilling分割和每个SAFIM子集。同样的配方在3B模型上使Delulu提升+12.8 EM，并带有小的、特征化的一般FIM权衡。五轴消融实验（规模、类型混合、语言覆盖、基础模型家族和难度感知的愚弄率）加上头对头的SFT与DPO/ORPO比较，映射了哪些设计选择驱动了收益。我们发布完整的管道源代码——生成、愚弄率LLM评判、筛选和FIM微调配方——以便本文中的实验可以在任何许可语料库上端到端复现。

英文摘要

Small open-source code models that power IDE autocomplete still emit hallucinated Fill-in-the-Middle (FIM) completions: syntactically natural calls to methods, parameters, variables, and imports that do not exist in the surrounding project. Existing mitigations either require per-language execution sandboxes that do not apply at mid-keystroke or preference-optimisation pipelines that need large human-labelled corpora. We propose an execution-free alternative: use frontier code models to synthesise plausible-but-wrong completions as hard negatives, then leverage the contrast between these synthetic hallucinations and the ground-truth developer edit as a supervised fine-tuning signal. Our pipeline scrapes multilingual FIM contexts from public GitHub across eight languages and asks a panel of three frontier generators to produce one hard negative per context for each of four hallucination types drawn from the Delulu taxonomy, a Docker-verified multilingual FIM hallucination benchmark, yielding a paired chosen/rejected dataset. Fine-tuning Qwen2.5-Coder-7B-Instruct on a 100K-row curated subset lifts Delulu exact match by +18.8 points and edit similarity by +0.22 on every language and every type, while also improving every HumanEval-Infilling split and every SAFIM subset. The same recipe at 3B lifts Delulu by +12.8 EM with a small, characterised general-FIM trade-off. Five-axis ablations (size, type mix, language coverage, base-model family, and a difficulty-aware fool rate) plus a head-to-head SFT vs. DPO/ORPO comparison map which design choices drive the gain. We release the full pipeline source code -- generation, fool-rate LLM judging, curation, and the FIM fine-tuning recipe -- so that the experiments in this paper can be reproduced end-to end on any permissively licensed corpus.

URL PDF HTML ☆

赞 0 踩 0

2606.03127 2026-06-03 cs.RO

TTT-VLA: Test-Time Latent Prompt Optimization for Vision-Language-Action Models

TTT-VLA：面向视觉-语言-动作模型的测试时潜在提示优化

Wenbo Zhang, Jianxiong Li, Shuai Yang, Sijin Chen, Jiajun Liu, Lingqiao Liu, Xiao Ma

发表机构 * ByteDance Seed（字节跳动种子）； The University of Adelaide（阿德莱德大学）； Tsinghua University（清华大学）； Zhejiang University（浙江大学）； The University of Hong Kong（香港大学）； CSIRO Data61

AI总结提出TTT-VLA框架，通过测试时优化潜在提示来适应分布偏移，无需修改策略本身，在SimperEnv上提升单/多实体任务成功率。

详情

AI中文摘要

基于大规模数据训练的视觉-语言-动作（VLA）模型取得了显著进展，但在部署时仍易受分布偏移影响。最近的VLA模型表明，提示可以作为引导策略行为的有效接口，但现有的基于提示的引导通常依赖外部指导。这自然引出一个问题：能否通过优化提示来实现VLA的测试时训练（TTT），使得引导接口本身可以从交互中学习和适应？我们通过TTT-VLA来解决这个问题，这是一种基于潜在提示优化（LPO）的测试时训练框架。在训练期间，潜在提示通过额外的代理任务学习，为策略学习提供额外的学习条件信号。在测试时，通过从当前环境收集交互数据，并仅使用代理任务的自监督信号优化这些数据上的潜在提示来执行TTT，而不修改策略本身。在SimperEnv上的实验表明，所提方法在单实体和多实体设置中均能持续提高任务成功率。进一步分析表明，提升主要源于纠正少量关键决策，而非全局改变策略行为。这些结果表明，LPO为基础操作策略的部署时改进提供了一条有效且实用的途径。

英文摘要

Vision-Language-Action (VLA) models trained on large-scale data have made remarkable progress, but they remain vulnerable to distribution shifts at deployment time. Recent VLA models suggest that prompts can serve as an efficient interface for steering policy behavior, but existing prompt-based steering typically relies on external guidance. This raises a natural question: can test-time training (TTT) for VLA be achieved by optimizing a prompt, so that the steering interface itself can be learned and adapted from interaction? We address this question with TTT-VLA, a test-time training framework based on Latent Prompt Optimization (LPO). During training, the latent prompt is learned with an additional proxy task, providing an extra learned conditioning signal for policy learning. At test time, TTT is performed by collecting interaction data from the current environment and optimizing only the latent prompt on those data using the proxy task's self-supervised signal, without modifying the policy itself. Experiments on SimplerEnv demonstrate that the proposed method consistently improves task success rates in both single- and multi-embodiment settings. Further analysis shows that the gains arise primarily from correcting a small number of critical decisions rather than globally altering policy behavior. These results suggest that LPO provides an effective and practical pathway for deployment-time improvement of foundation manipulation policies.

URL PDF HTML ☆

赞 0 踩 0

2606.03125 2026-06-03 cs.LG

Rethinking Neural Width for Alternating Current Optimal Power Flow Proxies

重新思考用于交流最优潮流代理的神经网络宽度

Dhruvi Khandelwal, Anurag Basistha, Ayushi Jolotia, Parikshit Pareek

发表机构 * Department of Electrical Engineering, National Institute of Technology Kurukshetra, India（印度克什米尔国立理工学院电气工程系）； Indraprastha Institute of Information Technology Delhi, India（印度德里印度理工信息学院）； Department of Electrical Engineering, Indian Institute of Technology Roorkee, India（印度罗尔基印度理工学院电气工程系）

AI总结本文提出损失引导神经稠密化算法，通过逐步扩展网络容量来最小化宽度，以精确逼近交流最优潮流流形，并在多个IEEE系统上以少十倍的神经元达到与基线相当的性能。

2606.03121 2026-06-03 cs.LG

TiWeaver: Unified Temporal Dynamics Modeling via Contextual Patching

TiWeaver：通过上下文补丁实现统一的时间动态建模

Zhe Li, Jindong Tian, Hao Miao, Zhi Lei, Chenjuan Guo, Bin Yang

发表机构 * East China Normal University（东华大学）； Hong Kong Polytechnic University（香港理工大学）； Aalborg University（奥胡斯大学）

AI总结针对多变量时间序列中因缺失值和非均匀采样等不规则性导致的动态复杂性和通道间异步依赖问题，提出TiWeaver框架，通过图引导自适应分词器（G²AT）和细粒度异步依赖提取器（FADE）实现自适应建模，在12个数据集上取得最高25%的性能提升。

详情

DOI: 10.1145/3770855.3817748

AI中文摘要

多变量时间序列预测在现实世界应用中扮演着关键角色，包括天气预报、股票分析和健康监测。由于数据源的多样性，时间序列表现出多样的时间动态，通常伴随着各种不规则性，如缺失值和非均匀采样频率。这些不规则性导致跨通道的复杂异步时间依赖。因此，具有固定补丁方案的单一模型往往难以很好地适应多样化的多变量时间序列，阻碍了准确预测。在本文中，我们提出了TiWeaver，一个统一框架，旨在自适应地处理时间动态和细粒度的通道间依赖。具体来说，我们引入了一个图引导自适应分词器（G²AT），通过联合考虑时间密度和表示一致性，将时间序列划分为高度上下文连贯的补丁。此外，我们提出了一个细粒度异步依赖提取器（FADE），旨在建模细粒度的异步通道间依赖，同时结合长期历史依赖。我们在12个真实世界时间序列数据集上评估了TiWeaver，它取得了最先进的性能，优于现有方法高达25%。这些结果证明了其在多样化领域和数据特征上的鲁棒性和有效性。

英文摘要

Multivariate time series forecasting plays a critical role in real-world applications, including weather prediction, stock analysis, and health monitoring. Due to the diversity of data sources, time series exhibit diverse temporal dynamics, often accompanied by various irregularities such as missing values and non-uniform sampling frequencies. Such irregularities lead to complex and asynchronous temporal dependencies across channels. Thus, a single model with a fixed patching scheme often fails to adapt well to diverse multivariate time series, hindering accurate forecasting. In this paper, we propose TiWeaver, a unified framework designed to handle temporal dynamics and fine-grained inter-channel dependencies adaptively. Specifically, we introduce a Graph-Guided Adaptive Tokenizer (G$^2$AT) that divides time series into high contextually coherent patches by jointly considering temporal density and representation consistency. In addition, we propose a Fine-grained Asynchronous Dependency Extractor (FADE), which is designed to model fine-grained asynchronous inter-channel dependencies while incorporating long-term historical dependencies. We evaluate TiWeaver on 12 real-world time series datasets, where it achieves state-of-the-art performance, outperforming existing methods up to 25%. These results demonstrate its robustness and effectiveness across diverse domains and data characteristics.

URL PDF HTML ☆

赞 0 踩 0

2606.03120 2026-06-03 cs.CV

KC-3DGS: Kurtosis-Constrained Gaussian Splatting for High-Fidelity View Synthesis

KC-3DGS: 基于峰度约束的高斯泼溅用于高保真视图合成

Vivekjyoti Banerjee, Abhay Yadav, Rama Chellappa, Aniket Roy

发表机构 * Johns Hopkins University（约翰霍普金斯大学）； NEC Labs America（NEC美国实验室）

AI总结提出KC-3DGS，通过在小波域添加多尺度对齐损失、峰度集中损失和跨频带协方差惩罚，增强3DGS的感知质量，尤其改善稀疏视图下的高频细节和结构伪影。

详情

AI中文摘要

3D高斯泼溅（3DGS）通过将场景表示为各向异性高斯集合，并通过可微分光栅化优化，实现了实时新视图合成。然而，标准像素空间损失（L1、SSIM）仅约束整体重建误差，允许优化在频率尺度上重新分配误差。这导致过度平滑和结构伪影，尤其在监督有限的稀疏视图设置中。我们提出KC-3DGS，通过基于自然图像统计的小波域监督来增强3DGS训练。我们的方法结合了三个组件：（1）多尺度小波系数对齐损失，显式惩罚缺失的高频细节；（2）有监督的峰度集中损失，鼓励渲染图像匹配真实图像的重尾频率统计；（3）跨频带协方差惩罚，促进频率专门化。我们提供理论分析，表明像素空间损失允许在小波重分布下的一族不可区分扰动，而我们的联合目标排除了退化解。在MipNeRF360、Tanks&Temples、MVImgNet、DeepBlending和WRIVA-ULTRRA上的实验表明，感知质量持续提升。在具有挑战性的WRIVA-ULTRRA室外数据集上，KC-3DGS在DreamSim上提高了9.48%，同时改善了PSNR、SSIM和LPIPS。在仅有12张训练图像的稀疏视图设置中，我们的方法在MipNeRF360上将PSNR提高了高达0.5 dB，同时保持了感知质量。该方法作为即插即用的正则化策略，可无缝集成到现有的3DGS流程中。

英文摘要

3D Gaussian Splatting (3DGS) enables real-time novel view synthesis by representing scenes as collections of anisotropic Gaussians optimized via differentiable rasterization. However, standard pixel-space losses (L1, SSIM) constrain only aggregate reconstruction error, permitting the optimization to redistribute error across frequency scales. This leads to oversmoothing and structural artifacts, particularly in sparse-view settings where supervision is limited. We propose KC-3DGS, which augments 3DGS training with wavelet-domain supervision based on natural image statistics. Our method combines three components: (1) a multi-scale wavelet coefficient alignment loss that explicitly penalizes missing high-frequency detail, (2) a supervised kurtosis concentration loss that encourages rendered images to match the heavy-tailed frequency statistics of ground-truth images, and (3) a cross-band covariance penalty that promotes frequency specialization. We provide theoretical analysis showing that pixel-space losses admit a family of indistinguishable perturbations under wavelet redistribution, and that our joint objective excludes degenerate solutions. Experiments across MipNeRF360, Tanks&Temples, MVImgNet, DeepBlending, and WRIVA-ULTRRA demonstrate consistent improvements in perceptual quality. On the challenging WRIVA-ULTRRA outdoor dataset, KC-3DGS achieves a 9.48% improvement in DreamSim while also improving PSNR, SSIM, and LPIPS. In sparse-view settings with only 12 training images, our method improves PSNR by up to 0.5 dB on MipNeRF360 while maintaining perceptual quality. The approach integrates seamlessly into existing 3DGS pipelines as a plug-and-play regularization strategy.

URL PDF HTML ☆

赞 0 踩 0

2606.03119 2026-06-03 cs.CV cs.AI cs.LG

GuidedBridge: Training-freely Improving Bridge Models with Prior Guidance

GuidedBridge: 无需训练地利用先验引导改进桥接模型

Zehua Chen, Yucheng Yang, Binjie Yuan, Kaiwen Zheng, Jun S. Liu, Jun Zhu

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出无需训练的先验引导方法（PG）和频率调制先验引导（FMPG），通过对比弱先验与已见先验增强桥接模型的先验利用，并设计级联框架CFG-FMPG用于图像修复，实验证明该方法能一致提升预训练桥接模型在多种图像翻译任务中的性能。

Comments ICML 2026

详情

AI中文摘要

引导方法，如无分类器引导（CFG）和自动引导（AG），推动了扩散模型中噪声到数据生成的发展。最近，桥接模型引入了一种数据到数据的生成过程，可以利用有指导性的干净先验。在这项工作中，受先前通过去噪结果质量差异作为引导的方法启发，我们提出了一种无需训练的桥接引导方法，称为先验引导（PG）。具体来说，我们引入一个弱先验，该先验在桥接预训练期间未见，阻碍先验利用从而降低去噪结果。然后，我们将其与已见先验对比，通过缩放因子突出并增强先验利用。此外，我们分析了桥接过程中先验利用的潜在机制，并设计了频率调制先验引导（FMPG），该引导将引导尺度调整到与桥接生成动力学一致的低频和高频带。为了解决图像修复中的先验利用问题，我们开发了一个级联框架CFG-FMPG，该框架首先通过CFG生成噪声隐藏表示，然后将其作为生成先验与FMPG一起利用，在不影响推理效率的情况下发挥它们的互补优势。实验表明，我们的PG方法在多种图像翻译任务中一致地改进了预训练桥接模型。

英文摘要

Guidance methods, such as classifier-free guidance (CFG) and auto-guidance (AG), have advanced noise-to-data generation in diffusion models. Recently, bridge models have introduced a data-to-data generative process that can exploit an instructive clean prior. In this work, inspired by previous methods creating quality difference between denoising results as guidance, we propose a training-free bridge guidance method, termed Prior Guidance (PG). Specifically, we introduce a weak prior, which is unseen during bridge pre-training, hindering prior exploitation and thereby degrading denoising result. Then, we contrast it with the seen prior to highlight and enhance prior exploitation via a scaling factor. Moreover, we analyze the underlying mechanism of prior exploitation in the bridge process and design frequency-modulated prior guidance (FMPG), which tailors the guidance scale to low- and high-frequency bands coherent with bridge generative dynamics. To address prior exploitation in image in-painting, we develop a cascaded framework, CFG-FMPG, which first generates a noisy hidden representation via CFG and then exploits it as a generative prior with FMPG, fulfilling their complementary strengths without compromising inference efficiency. Experiments demonstrate that our PG methods consistently improve pre-trained bridge models across diverse image translation tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.03118 2026-06-03 cs.LG cs.CV q-bio.NC

Learning to See via Epiretinal Implant Stimulation in silico with Model-Based Deep Reinforcement Learning

通过基于模型的深度强化学习在硅上学习经由视网膜上植入物刺激的视觉

Jacob Lavoie, Marwan Besrour, William Lemaire, Jean Rouat, Réjean Fontaine, Eric Plourde

发表机构 * Department of Electrical Engineering and Computer Engineering, Université de Sherbrooke（电气与计算机工程系， Sherbrooke 大学）

AI总结本研究提出使用各向同性和各向异性形状，通过深度强化学习在虚拟患者的视网膜上渲染可理解的图像，以提高人工恢复视觉的清晰度。

Comments 18 pages, 6 figures. Published version: Biomed. Phys. Eng. Express 10, 025006 (2024)

详情

DOI: 10.1088/2057-1976/acf1a5
Journal ref: Biomed. Phys. Eng. Express 10 (2024) 025006

AI中文摘要

目标：年龄相关性黄斑变性和视网膜色素变性等疾病会导致感光层退化。恢复视力的一种方法是通过微电极阵列（如视网膜上植入物）电刺激存活的视网膜神经节细胞。已知视网膜上植入物会产生沿邻近视网膜神经节细胞轴突束延伸的可见各向异性形状。最近的研究表明，为了获得各向同性的像素状形状，可以通过失活电极或降低刺激电流水平来映射轴突束并避免刺激它们。避免轴突束刺激旨在去除类似笔触的形状，转而采用更简化的像素状形状集合。方法：在本研究中，我们提出使用各向同性和各向异性形状，在名为rlretina的强化学习环境中为虚拟患者的视网膜渲染可理解的图像。该环境将任务形式化为在基于笔触的渲染任务中使用笔触。主要结果：我们训练了一个深度强化学习智能体，它学会组合各向同性和各向异性形状以形成图像。我们研究了哪种基于误差或基于感知的指标适合奖励智能体。该智能体以基于模型的数据生成方式训练，使用经过心理物理学验证的轴突映射模型来渲染不同虚拟患者感知到的图像。我们表明，与不同虚拟患者中的朴素方法相比，该智能体可以生成更可理解的图像。意义：这项工作提供了一种解决视网膜上刺激的新方法，这是朝着使用各向异性光幻视改善人工恢复视力中视觉敏锐度的第一步。

英文摘要

Objective: Diseases such as age-related macular degeneration and retinitis pigmentosa cause the degradation of the photoreceptor layer. One approach to restore vision is to electrically stimulate the surviving retinal ganglion cells with a microelectrode array such as epiretinal implants. Epiretinal implants are known to generate visible anisotropic shapes elongated along the axon fascicles of neighboring retinal ganglion cells. Recent work has demonstrated that to obtain isotropic pixel-like shapes, it is possible to map axon fascicles and avoid stimulating them by inactivating electrodes or lowering stimulation current levels. Avoiding axon fascicle stimulation aims to remove brushstroke-like shapes in favor of a more reduced set of pixel-like shapes. Approach: In this study, we propose the use of isotropic and anisotropic shapes to render intelligible images on the retina of a virtual patient in a reinforcement learning environment named rlretina. The environment formalizes the task as using brushstrokes in a stroke-based rendering task. Main Results: We train a deep reinforcement learning agent that learns to assemble isotropic and anisotropic shapes to form an image. We investigate which error-based or perception-based metrics is adequate to reward the agent. The agent is trained in a model-based data generation fashion using the psychophysically validated axon map model to render images as perceived by different virtual patients. We show that the agent can generate more intelligible images compared to the naive method in different virtual patients. Significance: This work shares a new way to address epiretinal stimulation that constitutes a first step towards improving visual acuity in artificially-restored vision using anisotropic phosphenes.

URL PDF HTML ☆

赞 0 踩 0

2606.03114 2026-06-03 cs.CV

FAF-CD: Frequency-Aware Fusion for Change Detection under Imperfect Multimodal Remote Sensing

FAF-CD: 面向不完美多模态遥感的频率感知融合变化检测

Yufan Wang, Sokratis Makrogiannis, Chandra Kambhamettu

发表机构 * University of South Florida（佛罗里达州立大学）； Delaware State University（特拉华州立大学）

AI总结提出频率感知混合框架FAF-CD，通过DINOv3预训练ConvNeXt编码器、VMamba解码器及修正感知三支融合模块（可变形空间对齐+傅里叶/哈尔小波比较+自适应门控），在不完美异质遥感（如EO-SAR）和二元光学变化检测中提升精度并降低计算成本。

Comments Code will be released at https://github.com/VimsLab/FAF-CD

详情

AI中文摘要

面向真实世界监测的遥感变化检测通常依赖于不完美的异质观测，其中事件前后图像可能异步、跨传感器，或受光照、季节和模态偏移影响。这一设置对EO-SAR灾害制图尤其具有挑战性，因为干扰变化可能类似于结构损伤。我们提出FAF-CD，一种频率感知混合框架，采用DINOv3预训练的ConvNeXt编码器和线性复杂度的基于VMamba的解码器。其修正感知三支融合模块将可变形空间对齐与傅里叶和哈尔小波比较相结合，使用自适应门控跨尺度聚合互补线索。在BRIGHT验证集上，匹配的异质EO-SAR适应在干净和扰动tc-mIoU/tc-mAP上优于NeXt2Former-CD。FAF-CD还泛化到二元光学变化检测，在LEVIR-CD上达到0.924 cF1，在WHU-CD上达到0.955 cF1，并在伪变化对齐压力测试下，在M-CD和NeXt2Former-CD中，在两个二元数据集上获得最佳平均扰动cIoU/cF1。相对于NeXt2Former-CD，它进一步降低了约24 GFLOPs的计算成本，同时保持或提高了精度。

英文摘要

Remote sensing change detection for real-world monitoring often relies on imperfect heterogeneous observations, where pre- and post-event images may be asynchronous, cross-sensor, or affected by illumination, seasonal, and modality shifts. This setting is especially challenging for EO-SAR disaster mapping, where nuisance variation can resemble structural damage. We propose FAF-CD, a frequency-aware hybrid framework with a DINOv3-pretrained ConvNeXt encoder and a linear-complexity VMamba-based decoder. Its rectification-aware tri-branch fusion module combines deformable spatial alignment with Fourier and Haar-wavelet comparisons, using adaptive gating to aggregate complementary cues across scales. On BRIGHT validation, a matched heterogeneous EO-SAR adaptation improves clean and perturbed tc-mIoU/tc-mAP over NeXt2Former-CD. FAF-CD also generalizes to binary optical CD, achieving 0.924 cF1 on LEVIR-CD and 0.955 cF1 on WHU-CD, and obtains the best average perturbed cIoU/cF1 on both binary datasets among M-CD and NeXt2Former-CD under pseudo-change-aligned stress tests. It further reduces cost by approximately 24 GFLOPs relative to NeXt2Former-CD while maintaining or improving accuracy.

URL PDF HTML ☆

赞 0 踩 0

2606.03113 2026-06-03 cs.CL

Experience-Driven Dynamic Exits for LLMs with Reinforcement Learning

基于强化学习的经验驱动动态退出机制用于大语言模型

Yanyu Zhu, Hoilam Pao, Niu Hu, Wei Guo, Shaoxiong Zhan, Boyu Lai, Zitai Wang, Yongqin Zeng, Hai-Tao Zheng

发表机构 * arXiv

AI总结针对大语言模型自回归推理慢的问题，提出LEDE框架，利用离线强化学习动态选择最优退出层和推测长度，实现2.0-2.7倍加速，并比静态推测基线额外提升17%速度。

2606.03111 2026-06-03 cs.CV

Inverting the Generation Process of Denoising Diffusion Implicit Models: Empirical Evaluation and a Novel Method

反转去噪扩散隐式模型的生成过程：实证评估与新方法

Yan Zeng, Masanori Suganuma, Takayuki Okatani

发表机构 * Graduate School of Information Sciences, Tohoku University（东北大学信息科学研究生院）； RIKEN Center for AIP（理化学研究所AIP中心）

AI总结提出一种结合梯度下降和不动点方法的混合方法，用于从生成图像中恢复DDIM的初始噪声图，显著提高了预测精度和重建质量。

详情

AI中文摘要

本文研究了反转DDIM图像生成过程以从生成图像中恢复潜在变量（特别是初始噪声图）的问题。现有方法在此任务中常面临精度不足的挑战。我们提出了一种新颖的混合方法，该方法在第一步结合了通过梯度下降的直接反转，随后在后续步骤中采用不动点方法。在三个数据集上的实证评估表明，我们的方法显著提高了初始潜在变量的预测精度，同时实现了更优的重建准确性。此外，我们引入了一项新的评估指标，称为自插值测试，该测试评估从真实与预测潜在图之间的插值点生成的图像质量，从而提供对性能更深入的洞察。我们的结果表明，尽管现有方法在重建方面表现尚可，但它们始终无法准确预测初始潜在变量，导致在自插值测试中表现不佳。相比之下，我们的方法在所有指标上均优于其他方法，为扩散模型提供了宝贵的见解，并增强了其在图像生成和编辑中的应用。

英文摘要

This paper studies the problem of inverting the DDIM image generation process to recover latent variables, particularly the initial noise map, from a generated image. Existing methods often struggle with accuracy in this task. We propose a novel hybrid approach that combines direct inversion via gradient descent for the first step, followed by a fixed-point method for subsequent steps. Empirical evaluations across three datasets demonstrate that our method significantly improves the prediction of initial latent variables while achieving superior reconstruction accuracy. Additionally, we introduce a new evaluation, called the self-interpolation test, which assesses the quality of images generated from interpolated points between the true and predicted latent maps, offering deeper insights into performance. Our results reveal that while existing methods perform reasonably well in reconstruction, they consistently fail to accurately predict the initial latent variables, resulting in poor performance on the self-interpolation test. In contrast, our method outperforms all others across all metrics, providing valuable insights into diffusion models and enhancing their applications in image generation and editing.

URL PDF HTML ☆

赞 0 踩 0