arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4101
2606.01558 2026-06-02 cs.CV

Attention-guided Fine-tuning of Multimodal Large Language Models Improves Chain-of-Thought Reasoning

注意力引导的多模态大语言模型微调提升思维链推理能力

Sanchit Sinha, Guangzhi Xiong, Bohan Liu, Zhenghao He, Aidong Zhang

发表机构 * University of Virginia(弗吉尼亚大学)

AI总结 针对多模态大语言模型中思维链推理效果不佳的问题,提出注意力引导的微调目标Attentive-CoT,通过延迟答案承诺和维持视觉令牌访问来提升推理性能。

详情
AI中文摘要

思维链提示在多模态大语言模型中的有效性仍不确定:在多个视觉推理基准上,与直接提示相比,思维链提示常常降低性能。在本文中,我们对三个现代多模态大语言模型系列在不同模型规模下,针对需要逐步视觉证据的数据集进行了思维链行为的系统分析。我们的分析识别出两种反复出现的失败模式:过早的答案承诺和推理生成过程中有限的直接视觉令牌访问。我们进一步发现,标准的思维链式监督微调只能部分缓解这些问题,同时往往增加对文本先验的依赖并减少反事实视觉依赖。受这些发现的启发,我们提出了Attentive-CoT,一种注意力引导的微调目标,它鼓励思维链轨迹延迟答案承诺,同时维持持续的视觉令牌访问。Attentive-CoT可以插入任何思维链式监督微调训练中,无需架构更改。在六个多模态大语言模型上的三个视觉推理基准实验表明,Attentive-CoT相比标准微调提升了思维链性能。

英文摘要

The effectiveness of Chain-of-Thought (CoT) prompting in Multimodal Large Language Models (MLLMs) remains uncertain: across several visual reasoning benchmarks, CoT prompting often degrades performance compared to direct prompting. In this paper, we provide a systematic analysis of CoT behavior in three modern MLLM families across model scales on datasets requiring step-wise visual evidence. Our analysis identifies two recurring failure modes: premature answer commitment and limited direct visual-token access during rationale generation. We further find that standard CoT-style Supervised Fine-Tuning (CoT-SFT) can mitigate these issues only partially, while often increasing reliance on textual priors and reducing counterfactual visual dependence. Motivated by these findings, we propose Attentive-CoT (Att-CoT), an attention-guided fine-tuning objective that encourages CoT trajectories to delay answer commitment while maintaining sustained visual-token access. Att-CoT can be plugged into any CoT-SFT training run without architectural changes. Experiments on three visual reasoning benchmarks across six MLLMs show that Att-CoT enhances CoT performance over standard fine-tuning.

2606.01557 2026-06-02 cs.LG eess.SP

Everywhere Learning: Artificial Intelligence with Pointwise Constraints

处处学习:具有逐点约束的人工智能

Ignacio Boero, Ignacio Hounie, Luiz Chamon, Alejandro Ribeiro

发表机构 * Department of Electrical and Systems Engineering, University of Pennsylvania(宾夕法尼亚大学电气与系统工程系) École polytechnique, Institut Polytechnique de Paris(巴黎理工学院)

AI总结 提出“处处学习”新范式,通过近似对偶理论分析泛化性能,并用稀疏L1惩罚控制泛化,在语言模型任务中验证其优势。

详情
AI中文摘要

处处学习是一种新范式,其中人工智能系统被训练以满足数据分布上概率为1的损失约束。这与训练人工智能系统最小化平均损失的标准范式形成对比。我们发展了一种近似对偶理论,以支持泛化分析,该分析建立了经验与统计处处学习问题解之间的接近性。我们的结果表明,对偶变量将数据分布重新加权到损失约束更难满足的点,并且泛化由数据分布质量集中与约束更难满足点上的质量集中之间的不匹配控制。我们进一步表明,我们可以通过约束松弛上的稀疏L1惩罚来控制泛化。我们通过语言模型任务中的智能体分类实验说明了处处学习的优点。

英文摘要

Everywhere learning is a new paradigm whereby Artificial Intelligence (AI) systems are trained to satisfy loss constraints with probability one over the data distribution. This is in contrast to the standard paradigm of training AI systems to minimize average losses. We develop an approximate duality theory to substantiate a generalization analysis that establishes the proximity between solutions of empirical and statistical everywhere learning problems. Our results show that dual variables reweigh the data distribution towards points in which loss constraints are more difficult to satisfy and that generalization is controlled by the mismatch between the concentration of mass of the data distribution and the concentration of mass on points where constraints are more difficult to satisfy. We further show that we can control generalization with a sparse L1 penalty on constraint relaxations. We illustrate the merits of everywhere learning with an experiment in agentic classification for language model tasks.

2606.01552 2026-06-02 cs.AI

RoleCDE:Benchmarking and Mitigating Role-Alignment Trade-offs in Role-Playing Agents

RoleCDE:角色扮演代理中的角色-对齐权衡的基准测试与缓解

Huayi Lai, Shichao Song, Simin Niu, Hanyu Wang, Jiawei Yang, Zhouxing Wang, Zhiqiang Yin, Xun Liang

发表机构 * School of Information, Renmin University of China(中国人民大学信息学院)

AI总结 针对角色扮演代理在角色特定价值与对齐约束冲突时的决策问题,提出首个基准RoleCDE,通过认知困境场景评估角色-场景基础、价值冲突解决和决策倾向,发现“角色价值解耦”现象,并基于RoleCDE的微调有效缓解该问题。

Comments 23pages

详情
AI中文摘要

角色扮演代理(RPAs)被广泛用于引导大语言模型(LLMs)表现出角色一致的行为,然而现有基准主要评估表面保真度,对角色-对齐价值冲突下的决策提供有限洞察。为解决这一差距,我们引入RoleCDE,这是首个旨在评估RPAs在角色特定价值与对齐导向约束之间结构化冲突下的基准。RoleCDE将角色感知决策制定为认知困境场景,联合评估角色-场景基础、价值冲突解决和决策倾向。该基准大规模构建,涵盖约8000个多样化的角色档案和场景,以及近24000个困境实例,跨越三个难度级别和八个角色类别。对几个主流LLMs的评估揭示了一种“角色价值解耦”现象,即当两者冲突时,代理系统性地默认选择对齐和道德一致的决策,而非角色特定价值,即使在明确的角色条件下也是如此。这种行为在很大程度上不受困境难度影响,但在不同角色类别间差异显著。我们进一步表明,基于RoleCDE的微调通过改善价值权衡推理有效缓解了这种解耦,同时保持了通用角色扮演保真度和通用推理性能。代码可在 https://github.com/rabbitrose/RoleCDE 获取。

英文摘要

Role-playing agents(RPAs) are widely used to steer large language models(LLMs) toward role-consistent behavior, yet existing benchmarks mainly evaluate surface-level fidelity and offer limited insight into decision making under role-alignment value conflicts. To address this gap, we introduce RoleCDE, the first benchmark designed to evaluate RPAs under structured conflicts between role-specific values and alignment-oriented constraints. RoleCDE formulates role-aware decision making as cognitive dilemma scenarios, jointly evaluating role-scenario grounding, value conflict resolution, and decision tendencies. The benchmark is constructed at scale, covering approximately 8k diverse role profiles and scenarios and nearly 24k dilemma instances across three difficulty levels and eight role categories. Evaluation of several mainstream LLMs reveals a "Role Value Decoupling" phenomenon, where agents systematically default to alignment-and morality-consistent decisions rather than role-specific values when the two conflict, even under explicit role conditioning. This behavior is largely invariant to dilemma difficulty but varies substantially across role categories. We further show that RoleCDE-based fine-tuning effectively mitigates this decoupling by improving value trade-off reasoning, while preserving general role-playing fidelity and general reasoning performance. Code is available at: https://github.com/rabbitrose/RoleCDE.

2606.01549 2026-06-02 cs.CV

ForestMamba: Sparse Mamba with Geometry-guided Queries for 3D Forest Point Cloud Segmentation

ForestMamba: 基于几何引导查询的稀疏Mamba用于3D森林点云分割

Trung Thanh Nguyen, Tuan-Anh Vu, Duc Viet Le, Yasutomo Kawanishi, Takahiro Komamizu, Ichiro Ide, Teja Kattenborn

发表机构 * Nagoya University(名古屋大学) RIKEN Seika(日本理化学研究所Seika研究中心) University of California, Los Angeles(加州大学洛杉矶分校) University of Twente(埃因霍温理工大学) Ritsumeikan University(立命馆大学)

AI总结 提出ForestMamba方法,通过稀疏编码器、几何引导查询初始化和Mamba查询解码器,实现高效且结构感知的森林点云分割,在七个森林区域上优于现有方法,推理速度提升3倍,GPU内存降低2.3倍。

详情
AI中文摘要

基于AI的地面和无人机LiDAR点云语义和实例分割正成为一种变革性方法,将森林的复杂3D结构转化为可操作的信息,用于森林监测和生物多样性评估。然而,由于数据量大、采样密度不规则、冠层结构复杂重叠以及地理变异性,森林LiDAR场景仍然极具挑战性。基于稀疏卷积或Transformer的现有方法取得了有希望的结果,但存在两个关键限制:注意力的二次复杂度难以扩展到大型森林场景,以及通用上下文建模未利用森林结构先验,限制了复杂区域中的树木分离。为了解决这些挑战,我们提出了ForestMamba,一种结构感知方法,将森林特定先验融入特征编码、查询生成和查询细化中,同时用线性时间状态空间建模替代二次注意力。首先,我们引入了一个具有垂直优先 slab 序列化的稀疏编码器,将稀疏体素组织成垂直连贯的序列,以实现高效的长程上下文建模。其次,我们提出了一种基于实时多尺度冠层高度模型(CHM)的几何引导查询初始化策略,其中冠层最大值提供了生态学上有意义的查询种子,并通过最远点采样(FPS)补充以覆盖林下树木。第三,我们设计了一个基于Mamba的查询解码器,将局部kNN体素聚合与空间双路径Mamba相结合,以线性计算复杂度进行查询细化。在七个森林区域上的大量实验表明,ForestMamba在分割任务中始终优于现有基线,同时实现比基于Transformer的方法快3倍的推理速度和低2.3倍的GPU内存。

英文摘要

AI-based semantic and instance segmentation of terrestrial and drone LiDAR point clouds is emerging as a transformative approach for converting the complex 3D structure of forests into actionable information for forest monitoring and biodiversity assessment. However, forest LiDAR scenes remain highly challenging due to their large data volumes, irregular sampling density, overlapping and complex canopy structure, and geographic variability. Existing methods based on sparse convolutions or Transformers achieve promising results, but suffer from two key limitations: Quadratic complexity of attention scales poorly to large forest scenes, and Generic context modeling does not exploit forest structural priors, limiting tree separation in complex regions. To address these challenges, we propose ForestMamba, a structure-aware method that incorporates forest-specific priors into feature encoding, query generation, and query refinement, while replacing quadratic attention with linear-time state-space modeling. First, we introduce a sparse encoder with vertical-priority slab serialization that organizes sparse voxels into vertically coherent sequences for efficient long-range context modeling. Second, we propose a geometry-guided query initialization strategy based on an on-the-fly multi-scale Canopy Height Model (CHM), where canopy maxima provide ecologically meaningful query seeds, supplemented by Farthest Point Sampling (FPS) to cover understory trees. Third, we design a Mamba-based query decoder that combines local kNN voxel aggregation with a spatial dual-path Mamba for query refinement with linear computational complexity. Extensive experiments across seven forest regions demonstrate that ForestMamba consistently outperforms existing baselines in both segmentation tasks, while achieving 3 times faster inference and 2.3 times lower GPU memory than Transformer-based methods.

2606.01544 2026-06-02 cs.LG

CRePE: Convolution-aware Relative Importance in Post-training Pruning with Efficient Search

CRePE: 后训练剪枝中基于卷积感知的相对重要性及高效搜索

Cheonjun Park

发表机构 * Hankuk University of Foreign Studies(韩国家外国语大学)

AI总结 提出CRePE方法,通过引入二维局部邻域上下文和自适应系数改进相对重要性评分,结合PHO代理优化实现高效后训练剪枝,在多种模型和稀疏度下取得最优性能。

Comments 10 pages

详情
AI中文摘要

在实际部署大型语言模型(LLM)时,会带来大量的内存和计算成本。后训练剪枝(PTP)是一种通过移除权重来降低这些成本的有效方法,无需额外训练。在现有方法中,RIA引入了通过行和列和归一化的相对重要性分数,实现了最先进的精度。然而,RIA仅考虑一维十字形(行/列)方向信息,并对行和列贡献赋予相同权重。在本文中,我们提出**CRePE**,它将二维局部邻域上下文和自适应系数纳入相对重要性评分。CRePE在各种模型和稀疏度设置下始终优于现有的PTP方法。然而,通过基于困惑度(PPL)的爬山法确定最优自适应系数需要大量PPL评估和约11小时的搜索时间。为了解决这个问题,我们提出**PHO**(基于代理的超参数优化),它消除了重复PPL测量的需要,并将搜索时间减少到约20分钟。此外,PHO在一个模型上找到的最优超参数配置可以很好地迁移到其他模型,展现出强大的泛化能力。最后,我们验证了CRePE可以与现有技术(包括通道置换、非均匀稀疏分配和重新剪枝方法)正交结合。

英文摘要

Deploying Large Language Models (LLMs) in practice incurs substantial memory and computational costs. Post-training pruning (PTP) is an effective approach to reducing these costs by removing weights without additional training. Among existing methods, RIA introduces relative importance scores normalized by row and column sums, achieving state-of-the-art accuracy. However, RIA considers only 1D cross-shaped (row/column) directional information and assigns equal weight to row and column contributions. In this paper, we propose \textbf{CRePE}, which incorporates 2D local neighborhood context and adaptive coefficients into Relative Importance scoring. CRePE consistently outperforms existing PTP methods across diverse models and sparsity settings. However, identifying optimal adaptive coefficients via perplexity (PPL)-based hill climbing requires numerous PPL evaluations and approximately 11 hours of search time. To address this, we propose \textbf{PHO} (Proxy-based Hyperparameter Optimization), which eliminates the need for repeated PPL measurements and reduces the search time to approximately 20 minutes. Furthermore, the optimal hyperparameter configuration found by PHO on one model transfers well to other models, demonstrating strong generalization. Finally, we verify that CRePE can be orthogonally combined with existing techniques including Channel Permutation, non-uniform sparsity allocation, and re-pruning methods.

2606.01543 2026-06-02 cs.CV

PathAR: Structure-First Autoregressive Synthesis of Multimodal Pathology Images

PathAR: 结构优先的多模态病理图像自回归合成

Yuan Zhang, Jiahao Xia, Junzhang Huang, Meng Wang, Feng Chen, Guanyu Yang, Huazhu Fu

发表机构 * Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education(新一代人工智能技术及其交叉应用重点实验室(东南大学),教育部) Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore(创新与精准眼健康中心,新加坡国立大学 Yong Loo Lin 医学院) Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore(眼科学系,新加坡国立大学 Yong Loo Lin 医学院) Department of Biostatistics, Center for Global Health, School of Public Health, Nanjing Medical University(生物统计学系,全球健康中心,南京医科大学) Institute of High-Performance Computing, Agency for Science, Technology and Research(高性能计算研究所,科技研究局)

AI总结 提出PathAR,一种结构优先的自回归合成框架,通过显式分解结构与外观并使用交错自回归Transformer,实现模态标签条件下的病理图像生成,改善结构一致性和模态保真度。

Comments 12 pages, 7 figures

详情
AI中文摘要

多模态病理学中的数据稀缺推动了统一生成模型的发展,这些模型在保持解剖学一致结构的同时合成模态特定的外观。尽管模态在外观统计上存在差异,但细胞拓扑和组织边界等形态结构在不同采集协议中基本保持不变。然而,现有方法通常将这些因素建模在均匀的token流中,隐式地将结构与外观耦合,削弱了模态变化下的结构可控性。为解决这一问题,我们提出病理自回归建模(PathAR),一种结构优先的自回归合成框架,显式分解结构和外观,用于模态标签条件下的病理生成。PathAR采用双向量量化(Dual-VQ)分词器将样本分解为掩码引导的结构和外观token,以及一个具有非对称注意力可见性的交错自回归(IAR)Transformer,以强制执行结构到外观的依赖关系。PathAR在异质模态特定外观下稳定形态,并支持空间对齐的图像-掩码对生成。大量实验表明,PathAR在结构一致性和模态保真度上优于基线,保持样本多样性,支持数据稀缺情况下的下游分割,并展现出对更细粒度器官标签变化的可扩展性。

英文摘要

Data scarcity in multimodal pathology motivates unified generative models that synthesize modality-specific appearance while preserving anatomically coherent structure. Although modalities differ in appearance statistics, morphological structures such as cellular topology and tissue boundaries are largely preserved across acquisition protocols. However, existing methods often model these factors within a homogeneous token stream, implicitly coupling structure with appearance and weakening structural controllability under modality shifts. To address this, we propose pathology Autorgressive modeling (PathAR), a structure-first autoregressive synthesis framework that explicitly factorizes structure and appearance for modality-label-conditioned pathology generation.PathAR employs a dual vector quantization (Dual-VQ) tokenizer to decompose samples into mask-grounded structure and appearance tokens, and an interleaved autoregressive (IAR) transformer with asymmetric attention visibility to enforce structure-to-appearance dependence. PathAR stabilizes morphology under heterogeneous modality-specific appearances and enables spatially aligned image--mask pair generation. Extensive experiments show that PathAR improves structural consistency and modality fidelity over baselines, maintains sample diversity, supports downstream segmentation in data-scarce regimes, and demonstrates extensibility to finer-grained intra-modality organ-label variation.

2606.01540 2026-06-02 cs.LG cs.AI

TN-SHAP-G: Graph-Structured Tensor Network Surrogates for Shapley Values and Interactions

TN-SHAP-G:用于Shapley值和交互的图结构张量网络代理

Farzaneh Heidari, Guillaume Rabusseau

发表机构 * University of Washington(华盛顿大学) CNRS(法国国家科学研究中心)

AI总结 提出TN-SHAP-G框架,利用图结构输入通过张量网络代理高效计算Shapley值和高阶交互指数。

详情
AI中文摘要

Shapley值是一种广泛使用的工具,用于归因黑盒模型中输入变量的重要性和交互,但其计算涉及定义在指数级子集空间上的函数。我们提出TN-SHAP-G,一个利用图结构输入中的结构高效计算Shapley值和高阶交互指数的框架。给定一个预测器和一个固定的掩码方案,TN-SHAP-G学习一个紧凑的、与图对齐的多线性代理,该代理近似掩码输入行为,表示为拓扑结构反映输入图的张量网络。一旦从少量oracle查询中训练完成,该代理通过多线性扩展实现一阶和高阶Shapley指数的确定性恢复,无需额外模型查询或蒙特卡洛方差。分子基准实验表明,学习到的分解在小图上紧密匹配精确Shapley值,并能高效扩展到基于采样的方法不可行的更大图。

英文摘要

Shapley values are a widely used tool for attributing importance and interactions among input variables in black-box models, but their computation involves a function defined over an exponentially large space of subsets. We propose TN-SHAP-G, a framework that exploits structure in graph-structured inputs to compute Shapley values and higher-order interaction indices efficiently. Given a predictor and a fixed masking scheme, TN-SHAP-G learns a compact, graph-aligned multilinear surrogate that approximates the masked-input behavior, represented as a tensor network whose topology mirrors the input graph. Once trained from a small number of oracle queries, the surrogate enables deterministic recovery of first- and higher-order Shapley indices via the multilinear extension, without additional model queries or Monte Carlo variance. Experiments on molecular benchmarks show that the learned factorization closely matches exact Shapley values on small graphs and scales efficiently to larger graphs where sampling-based methods become infeasible.

2606.01528 2026-06-02 cs.AI

Joint Agent Memory and Exploration Learning via Novelty Signals

通过新颖性信号实现联合智能体记忆与探索学习

Shizuo Tian, Xiaohong Weng, Rui Kong, Yuxuan Chen, Guohong Liu, Yuebing Song, Jiacheng Liu, Yuchen Li, Dawei Yin, Ting Cao, Yunxin Liu, Yuanchun Li

发表机构 * Tsinghua University(清华大学) Sun Yat-sen University(中山大学) Baidu Inc.(百度公司) Tongji University(同济大学) Peking University(北京大学)

AI总结 提出JAMEL框架,利用新颖性信号联合训练智能体记忆与探索策略,在开放环境中实现高效探索并泛化到未见环境。

详情
AI中文摘要

在开放环境中,探索对于自主智能体至关重要,但当前的语言模型智能体难以做到这一点。有效的探索需要记忆,但保留原始交互历史在长轨迹中计算成本高昂。虽然潜在记忆提供了压缩交互历史的解决方案,但其训练缺乏可靠的监督信号。我们提出了联合智能体记忆与探索学习(JAMEL),这是一个通过新颖性驱动的交互来共同训练智能体记忆和探索策略的框架。我们观察到记忆和探索形成了一个相互依赖的循环:持续的探索需要记忆来区分已耗尽的行为和未见过的新行为,而寻求新颖性的交互提供了使记忆对未来探索有用的监督。通过利用确定性和持久的新颖性信号(如GUI领域的代码覆盖率),我们为记忆模块提供了自然的、无需标注的监督。实证评估表明,我们的方法成功泛化到未见环境。其探索能力优于开放权重基线,并与闭源模型的探索深度相媲美,同时减少了token消耗。我们的代码和模型已在https://github.com/MobileLLM/JAMEL开源。

英文摘要

In open-ended environments, exploration is fundamental for autonomous agents, yet current language model agents struggle with this. Effective exploration requires memory, but retaining raw interaction histories is computationally expensive over long trajectories. While latent memory offers a solution to compress interaction histories, its training lacks reliable supervisory signals. We introduce \textbf{J}oint \textbf{A}gent \textbf{M}emory and \textbf{E}xploration \textbf{L}earning (\textbf{JAMEL}), a framework that trains agentic memory and exploration policy together through novelty-driven interaction. We observe that memory and exploration form a mutually dependent loop: sustained exploration requires memory to distinguish exhausted behaviors from unseen ones, while novelty-seeking interaction provides the supervision needed to make memory useful for future exploration. By utilizing deterministic and persistent novelty signals such as code coverage in the GUI domain, we provide natural, annotation-free supervision for the memory module. Empirical evaluations demonstrate that \ours successfully generalizes to unseen environments. Its exploration capability outperforms open-weight baselines and rivals the exploration depth of a closed-source model while reducing token consumption. Our code and model are open-sourced at https://github.com/MobileLLM/JAMEL.

2606.01527 2026-06-02 cs.LG cs.CR

Near-Optimal Pure Machine Unlearning for Smooth Strongly Convex Losses

平滑强凸损失下的近最优纯机器遗忘

Matthew Regehr, Gautam Kamath, Andrew Lowy

发表机构 * University of Waterloo(滑铁卢大学) Vector Institute(向量研究所) CISPA Helmholtz Center for Information Security(CISPA海德堡信息安全中心)

AI总结 针对平滑强凸随机优化中的近似ε-遗忘问题,本文通过证明超额总体风险的上界和下界(紧至条件数因子),几乎解决了遗忘的基本统计代价,并提出了在ε≫d时相比从头再训练和差分隐私基线具有指数级精度提升的遗忘算法。

详情
AI中文摘要

机器遗忘受到法律和用户需求(如被遗忘权)的驱动,旨在从训练模型中移除个体数据的影响。先前的工作已经为平滑强凸随机优化中的遗忘开发了算法和误差界,但遗忘的基本统计代价仍不清楚。我们通过证明近似ε-遗忘的超额总体风险的上界和下界,几乎解决了这个问题;我们的界紧至条件数因子。对于单位球上的均值估计,我们的上界和下界匹配。最优速率是通常的统计误差加上一个遗忘惩罚,该惩罚在从头再训练速率和随着ε/d增长而指数级减小的项之间插值,其中d是模型的维度。特别地,当ε≫d时,我们的ε-遗忘算法相比从头再训练模型和差分隐私基线提供了指数级的精度提升。另一方面,当ε≤d时,从头再训练是最优的。

英文摘要

Machine unlearning is motivated by legal and user-facing requirements to remove the influence of individuals' data from trained models, such as the right to be forgotten. Prior work has developed algorithms and error bounds for unlearning in smooth strongly convex stochastic optimization, but the fundamental statistical cost of unlearning has remained unclear. We nearly resolve this problem by proving upper and lower bounds on the excess population risk of approximate $\varepsilon$-unlearning; our bounds are tight up to a condition-number factor. For mean estimation over the unit ball, our upper and lower bounds match. The optimal rate is the usual statistical error plus an unlearning penalty that interpolates between the retraining-from-scratch rate and an exponentially smaller term as $\varepsilon/d$ grows, where $d$ is the dimension of the model. In particular, when $\varepsilon \gg d$, our $\varepsilon$-unlearning algorithm offers an exponential accuracy improvement over retraining the model from scratch and differentially private baselines. On the other hand, when $\varepsilon \le d$, retraining from scratch is optimal.

2606.01526 2026-06-02 cs.RO

Spatio-Temporal Reconnection for Multi-Robot Networks using Adaptive Prescribed-Time CBFs

基于自适应预设时间CBF的多机器人网络时空重连

Hao Liu, Yupeng Yang, Yanze Zhang, Wenhao Luo

发表机构 * Department of Computer Science, University of Illinois Chicago(伊利诺伊大学芝加哥分校计算机科学系) Department of Computer Science, University of North Carolina at Charlotte(北卡罗来纳大学夏洛特分校计算机科学系)

AI总结 提出自适应预设时间控制屏障函数框架,使多机器人系统能在可调预设时间内断开并重连通信,结合触发机制提升任务效率。

Comments 6 pages, 6 figures, accepted by IFAC 2026

详情
AI中文摘要

在多机器人系统中,维持持续的通信图连接往往过于严格,特别是当机器人通信范围有限但在大环境中运行时。相反,允许机器人暂时断开连接并在之后重新连接,通常更有利于高效执行任务,同时确保团队内及时的信息共享。在本文中,我们提出了一种自适应预设时间控制屏障函数(自适应PT-CBF)框架,使机器人能够在可调且可行的预设时间内暂时断开连接并重新进入通信范围。此外,我们引入了一种重连触发机制,该机制联合考虑任务执行和重连紧迫性,从而提供了一种原则性的方式来决定何时应发生重连。理论分析证明了在预设有限时间内收敛到满足重连的合理性。实验结果验证了我们提出的自适应PT-CBF的性能,具有改进的任务效率和令人满意的重连。

英文摘要

In multi-robot systems, maintaining persistent communication graph connectivity is often overly restrictive, especially when robots have limited communication ranges but operate in large environments. Instead, allowing robots to temporarily disconnect and later reconnect is often more desirable for efficient task execution while still ensuring timely information sharing across the team. In this paper, we propose an adaptive prescribed-time control barrier function (adaptive PT-CBF) framework that enables robots to temporarily disconnect and re-enter the communication range within an adjustable and feasible prescribed time. Moreover, we introduce a reconnection triggering mechanism that jointly considers task execution and reconnection urgency, thereby providing a principled way to decide when reconnection should occur. Theoretical analysis justifies convergence to the satisfying reconnection within a prescribed finite time. Experimental results validate the performance of our proposed adaptive PT-CBF with improved task efficiency and satisfying reconnections.

2606.01525 2026-06-02 cs.LG stat.ML

Semi-Supervised Hyperbolic Hierarchical Clustering with Set-Level Structural Priors

基于集合级结构先验的半监督双曲层次聚类

Junjing Zheng, Xinyu Zhang, Xiangfeng Qiu, Chengliang Song, Weidong Jiang

发表机构 * College of Electronic Science and Technology, National University of Defense Technology(电子科学与技术学院,国防科技大学)

AI总结 提出一种半监督双曲层次聚类方法,通过引入集合作为基本建模单元,利用从叶级监督导出的集合级结构先验来指导非叶层次结构学习,提升标签一致性和树质量。

详情
AI中文摘要

半监督层次聚类旨在学习与数据模式和用户提供的监督一致的树结构。监督通常以叶级关系的形式给出,例如成对的必须连接/不能连接约束或三元组的必须在之前连接约束。尽管这些约束有助于调节局部样本关系,但它们并不直接指示哪些样本应形成连贯的子树。因此,学习到的树的非叶结构可能偏离真实标签所偏好的层次组织。为了解决这一局限性,我们提出了一种具有集合级结构先验的半监督双曲层次聚类方法。主要贡献是引入集合作为层次学习的基本建模单元。每个集合表示预期在子树内凝聚的样本,并从叶级监督以及学习到的约束一致相似性结构中导出。这些集合作为子树级监督的软结构先验,使得监督能够指导超出局部叶级关系的非叶层次形成。具体来说,我们首先学习约束一致的嵌入以获得可靠的集合划分,然后构建约束诱导的集合并估计集合间相似性以形成集合级结构先验。最后,将这些先验纳入双曲层次目标中进行连续树优化。在11个基准数据集上的实验和消融研究表明,所提出的方法在提高代表性层次聚类基线的标签一致性的同时,也增强了基于相似性的树质量。

英文摘要

Semi-supervised hierarchical clustering aims to learn a tree structure consistent with data patterns and user-provided supervision. Supervision is usually given as leaf-level relations, such as pairwise must-link/cannot-link constraints or triplet-wise must-link-before constraints. Although useful for regulating local sample relations, such supervision does not directly indicate which samples should form coherent subtrees. Consequently, the non-leaf structure of the learned tree may deviate from the hierarchical organization preferred by ground-truth labels. To address this limitation, we propose a semi-supervised hyperbolic hierarchical clustering method with set-level structural priors. The main contribution is to introduce sets as basic modeling units for hierarchy learning. Each set denotes samples expected to cohere within a subtree and is induced from leaf-level supervision together with a learned constraint-consistent similarity structure. These sets act as soft structural priors for subtree-level supervision, allowing supervision to guide non-leaf hierarchy formation beyond local leaf-level relations. Specifically, we first learn constraint-consistent embeddings to obtain a reliable set partition, then construct constraint-induced sets and estimate inter-set similarities to form set-level structural priors. Finally, these priors are incorporated into a hyperbolic hierarchy objective for continuous tree optimization. Experiments on eleven benchmark datasets and ablation studies show that the proposed method consistently improves label consistency over representative hierarchical clustering baselines while also enhancing similarity-based tree quality.

2606.01521 2026-06-02 cs.LG stat.ML

Fast Generalization after Interpolation via Critically Damped Momentum Optimization

通过临界阻尼动量优化实现插值后的快速泛化

Luca Muscarnera, Silas Ruhrberg Estévez, Yuanzhang Xiao, Mihaela Van der Schaar

发表机构 * University of Cambridge(剑桥大学) University of Hawaii at Manoa(夏威夷大学曼瑙分校)

AI总结 提出GROKtimizer双阶段策略,结合快速收敛到插值与临界阻尼动量后插值范数最小化,在局部二次模型下实现比经典梯度下降二次加速,选择低范数插值解以提升泛化。

详情
AI中文摘要

机器学习的一个核心问题是模型在训练中可以达到近乎完美的性能,但对未见示例的泛化能力却显著较差。这种差距在高维、小样本场景下尤为严重,因为存在许多插值解,优化必须隐式地在具有不同泛化特性的最小值之间进行选择。基于最近关于插值阈值附近优化动态的理论进展,我们注意到风险最小化的两阶段结构(先损失最小化,后复杂度最小化)启发了一种双阶段优化调度。因此,我们从理论上证明,GROKtimizer——一种结合快速收敛到插值与基于临界阻尼动量(CDM)的后插值范数最小化的双阶段策略——为选择低范数插值解提供了一种自然方案。在后插值盆地的局部二次模型下,GROKtimizer比经典梯度下降实现了二次加速,并在一阶优化器中具有可证明的最优性。为了展示我们方法的适用性,我们在经典grokking文献中常见的几个合成基准以及各种真实世界数据集上评估了GROKtimizer。最后,我们将我们的发现与平坦最小值假说相协调,强调了后插值动态在构建高质量、泛化模型中的重要性。

英文摘要

A central problem in machine learning is that models can achieve near-perfect training performance while generalizing substantially less well to unseen examples. This gap is especially acute in high-dimensional, low-sample regimes, where many interpolating solutions exist and optimization must implicitly select among minima with different generalization properties. Following recent theoretical advances on optimization dynamics near the interpolation threshold, we note that the two-regime structure of risk minimization, with loss minimization followed by complexity minimization, motivates a biphasic optimization schedule. We thus theoretically demonstrate that GROKtimizer, a biphasic strategy that combines rapid convergence to interpolation with Critically Damped Momentum (CDM)-based post-interpolation norm minimization, offers a natural solution for selecting low-norm interpolating solutions. Under a local quadratic model of the post-interpolation basin, GROKtimizer provides a quadratic speedup over classical gradient descent, with provable optimality among first-order optimizers. To showcase the applicability of our method, we evaluate GROKtimizer on several synthetic benchmarks common in the classical grokking literature and on various real-world datasets. Finally, we reconcile our findings with the flat-minima hypothesis, highlighting the importance of post-interpolation dynamics in the construction of high-quality, generalizing models.

2606.01520 2026-06-02 cs.AI

TERRA: Task-Embedded Reasoning and Representation Architecture for Cross-Domain Applications

TERRA: 面向跨领域应用的任务嵌入推理与表示架构

Shayan Shokri

发表机构 * Humanpath Labs Inc.(Humanpath实验室有限公司)

AI总结 提出TERRA架构,通过形式化跨领域转移问题,利用松弛双模拟差异和Gromov-Wasserstein距离度量结构状态域间的同态性,推导出预测误差与决策遗憾的转移界,将广泛直觉转化为可检验理论。

详情
AI中文摘要

一个单一的动作条件潜在预测架构原则上可以在驾驶场景、机器人工作空间或金融订单簿的结构化状态上进行训练。在任何单个领域内实现这一点的要素已经存在并得到单独验证:掩码潜在预测、动作条件潜在世界模型、离散动作标记化以及体素化状态上的联合嵌入预测。TERRA解决的是尚未确立的转移问题:在一个结构化状态领域学到的表示或预测器何时以及多大程度上能够迁移到结构类似但其他方面无关的领域。我们对此问题进行了形式化处理。我们将每个领域建模为分级潜在网格上的受控马尔可夫过程,将任何实例分解为薄领域适配器和共享的领域不变核心,并识别出跨领域对应关系,该对应关系近似于一个马尔可夫决策过程同态,其质量通过松弛双模拟差异来衡量,对于缺乏共享坐标系的领域,则通过其动作条件转移算子之间的Gromov-Wasserstein距离来衡量。在Lipschitz预测器下,我们推导出一个转移界,该界将源模型误差与结构失配分开,在预测范围内呈几何增长,并由Gromov-Wasserstein距离从下方保证;然后通过双模拟度量的Lipschitz值性质将潜在误差与决策遗憾联系起来。由此产生的结构化状态转移假设被表述为一个可证伪的主张,并附有预注册的实验方案,核心是从驾驶场景到订单簿的转移测试,包括其被反驳的条件。我们不呈现实证结果:这是一个将广泛重复的直觉转化为可检验理论的研究提案。

英文摘要

A single action-conditioned latent predictive architecture can in principle be trained on the structured state of a driving scene, a robot workspace, or a financial order book. The ingredients for doing so within any one domain already exist and are individually validated: masked-latent prediction, action-conditioned latent world models, discrete action tokenization, and joint-embedding prediction on voxelized state. What is not established, and what TERRA addresses, is the transfer question: when does a representation or predictor learned in one structured-state domain carry over to a structurally analogous but otherwise unrelated domain, and by how much. We give this question a formal treatment. We model each domain as a controlled Markov process on a graded latent grid, factor any instantiation into thin domain adapters and a shared domain-invariant core, and identify a cross-domain correspondence with an approximate Markov decision process homomorphism whose quality is measured by a lax bisimulation discrepancy and, for domains lacking a shared coordinate system, by a Gromov-Wasserstein distance between their action-conditioned transition operators. Under a Lipschitz predictor we derive a transfer bound that separates source-model error from structural mismatch, grows geometrically in the prediction horizon, and is certified from below by the Gromov-Wasserstein distance; we then connect latent error to decision regret through the Lipschitz value property of bisimulation metrics. The resulting Structured-State Transfer Hypothesis is stated as a falsifiable claim with a preregistered experimental program, centered on a transfer test from driving scenes to order books, including conditions under which it is refuted. We present no empirical results: this is a research proposal that converts a widely repeated intuition into testable theory.

2606.01518 2026-06-02 cs.CV cs.GR

MotionDreamer: Universal Skeletal Motion Generation for 3D Rigged Shapes

MotionDreamer: 面向3D绑定形状的通用骨骼运动生成

Ye Tao, Yuxin Yao, Kendong Liu, Dapeng Wu, Junhui Hou

发表机构 * City University of Hong Kong(香港城市大学)

AI总结 提出基于扩散的框架MotionDreamer,通过结构-语义注入机制从2D视频生成类别无关的骨骼动画,并构建大规模动态数据集,实现跨形态的高保真运动合成。

Comments 18 pages, 7 figures

详情
AI中文摘要

绑定形状的运动生成对于可扩展的4D资产制作至关重要。然而,基于模板的方法受限于特定拓扑结构,无法泛化到不同形态。相反,逐案例优化计算成本高,易陷入局部最优,且对视角引起的歧义高度敏感。在本文中,我们提出MotionDreamer,一个基于扩散的框架,旨在从2D视频指导中生成类别无关的骨骼动画。为了克服高质量训练数据的稀缺性,我们整理了一个大规模动态数据集,包含约20,000个多样化的3D模型,每个模型具有完整的纹理、骨骼绑定和广泛的动画序列。为了弥合2D视觉运动线索与异构3D骨骼结构之间的运动学差距,我们提出了一种结构-语义注入机制。我们的模型将纹理和语义属性直接集成到骨骼关节表示中,使其能够将感知的视觉动态映射到特定的关节层次及其功能角色。这使得MotionDreamer能够合成高保真动画,在从现有生物物种到幻想生物的广泛未见类别中保持解剖一致性。大量实验表明,我们的方法显著优于现有方法,为鲁棒且高效的4D资产生成设立了新的最先进基准。代码将在接收后公开。

英文摘要

Motion generation for rigged shapes is vital for scalable 4D asset production. However, template-based methods are limited by specific topologies and fail to generalize across diverse morphologies. Conversely, per-case optimization is computationally expensive, susceptible to local optima, and highly sensitive to viewpoint-induced ambiguities. In this paper, we present MotionDreamer, a diffusion-based framework designed for category-agnostic skeletal animation generation from 2D video guidance. To overcome the scarcity of high-quality training data, we have curated a large-scale dynamic dataset comprising approximately 20,000 diverse 3D models, each featuring complete textures, skeletal rigging, and a wide array of comprehensive animation sequences. To bridge the kinematic gap between 2D visual motion cues and heterogeneous 3D skeletal structures, we propose a structural-semantic injection mechanism. Our model integrates texture and semantic attributes directly into skeletal joint representations. This allows it to map perceived visual dynamics to specific joint hierarchies and their functional roles. This enables MotionDreamer to synthesize high-fidelity animations that maintain anatomical consistency across a vast range of unseen categories, from existing biological species to fantastical beings. Extensive experiments demonstrate that our approach significantly outperforms existing methods, setting a new state-of-the-art benchmark for robust and efficient 4D asset generation. The code will be made publicly available upon acceptance.

2606.01509 2026-06-02 cs.LG cs.AI

ProbMoE: Differentiable Probabilistic Routing for Mixture-of-Experts

ProbMoE:可微分的专家混合概率路由

Heng Zhao, Zilei Shao, Guy Van den Broeck, Zhe Zeng

发表机构 * Imperial College London(伦敦帝国学院) University of Waterloo(多伦多大学) EPFL(瑞士联邦理工学院)

AI总结 提出ProbMoE概率路由框架,通过离散子集空间上的概率推断实现专家选择,解决top-k路由的离散非可微问题,并扩展到动态k路由,提升专家利用率和路由多样性。

Comments Accepted at ICML 2026

详情
AI中文摘要

专家混合(MoE)模型通过每个令牌仅激活一小部分专家来扩展规模。然而,训练此类模型仍然具有挑战性,因为top-$k$路由是离散且不可微的,需要针对专家选择的梯度估计器,其设计仍是一个核心开放问题。我们引入了ProbMoE,一种概率路由框架,将专家选择建模为基数受限专家子集上的分布,并将路由公式化为该离散子集空间中的概率推断。我们首先提出ProbMoE Exact-$k$路由,在前向传播中采样$k$专家子集,后向传播使用每个专家精确边际概率的梯度作为真实梯度的可处理代理。ProbMoE自然地推广到动态$k$路由设置,其中训练和推理都将路由基数约束到相同的预定义范围,允许每个令牌自适应地分配专家。在多个基准测试和模型骨干上,ProbMoE Exact-$k$相比竞争基线实现了强性能,具有改进的专家利用率和路由多样性;ProbMoE Dynamic-$k$以更少的激活专家实现了可比的性能。

英文摘要

Mixture-of-Experts (MoE) models scale by activating only a small subset of experts per token. However, training such models remains challenging because top-$k$ routing is discrete and non-differentiable, requiring gradient estimators for expert selection whose design remains a central open problem. We introduce ProbMoE, a probabilistic routing framework that models expert selection as a distribution over cardinality-constrained expert subsets and formulates routing as probabilistic inference in this discrete subset space. We first propose ProbMoE Exact-$k$ routing, which samples $k$-expert subsets in the forward pass, and the backward pass uses gradients through each expert's exact marginal probability as a tractable surrogate for the true gradient. ProbMoE naturally generalizes to a dynamic-$k$ routing setting, where both training and inference constrain the routing cardinality to the same predefined range, allowing adaptive expert allocation per token. Across benchmarks and model backbones, ProbMoE Exact-$k$ achieves strong performance compared to competitive baselines, with improved expert utilization and routing diversity; ProbMoE Dynamic-$k$ achieves comparable performance with fewer activated experts.

2606.01503 2026-06-02 cs.CV cs.AI cs.CL

On the Limits of Token Reduction for Efficient Unified Vision Language Training

论高效统一视觉语言训练中令牌缩减的极限

Siyi Chen, Weiming Zhuang, Jingtao Li, Lingjuan Lv

发表机构 * University of Michigan(密歇根大学) Sony AI(索尼人工智能)

AI总结 本文通过分析层注意力分配,发现视觉理解与视觉生成在令牌冗余上存在不对称性,设计任务特定加速器,但统一训练中任务特定令牌丢弃导致协同损失,表明高效统一建模需保留共享跨任务结构。

详情
AI中文摘要

统一视觉语言模型(VLM)在单个自回归骨干中集成了视觉理解和视觉生成,但其联合训练计算成本高昂且从效率角度常被忽视。在这项工作中,我们研究了基于令牌缩减的加速在统一VLM训练中的可行性和极限。通过对逐层注意力分配的系统分析,我们揭示了一个基本的不对称性:视觉理解在后期层表现出显著的视觉冗余,而视觉生成在深度上对图像令牌保持持续依赖。受此观察启发,我们设计了任务特定的加速器,针对每个目标选择性地减少图像令牌计算。虽然这些方法在孤立设置中实现了显著的效率提升,但我们在统一训练下观察到一致的协同损失——任务特定的令牌丢弃需要不同的参数路径,并消除了联合优化中通常观察到的相互性能增益。我们的发现表明,高效统一建模需要保留共享的跨任务结构,强调了需要协同感知的加速策略。项目页面:https://chicychen.github.io/TokenReductionUnifiedVLM/。

英文摘要

Unified vision-language models (VLMs) integrate visual understanding and visual generation within a single autoregressive backbone, but their joint training is computationally expensive and largely overlooked from an efficiency perspective. In this work, we study the feasibility and limits of token-reduction-based acceleration for unified VLM training. Through a systematic analysis of layerwise attention allocation, we uncover a fundamental asymmetry: visual understanding exhibits substantial late-layer visual redundancy, whereas visual generation maintains persistent dependence on image tokens across depth. Guided by this observation, we design task-specific accelerators that selectively reduce image-token computation for each objective. While these methods achieve significant efficiency gains in isolated settings, we observe a consistent synergy loss under unified training -- task-specific token dropping necessitates divergent parameter pathways and eliminates the mutual performance gains typically observed in joint optimization. Our findings suggest that efficient unified modeling requires preserving shared cross-task structures, highlighting the need for synergy-aware acceleration strategies. Project page: https://chicychen.github.io/TokenReductionUnifiedVLM/.

2606.01498 2026-06-02 cs.CL cs.AI

TimeSage-MT: A Multi-Turn Benchmark for Evaluating Agentic Time Series Reasoning

TimeSage-MT:用于评估智能时间序列推理的多轮基准测试

Yaxuan Kong, Qingren Yao, Yuqi Nie, Yichen Li, Yilei Shao, Stefan Zohren, Anna Vettoruzzo, Joaquin Vanschoren, Ming Jin, Qingsong Wen

发表机构 * University of Oxford(牛津大学) VulpiVox Intelligence Eindhoven University of Technology(埃因霍温理工大学) Griffith University(格里菲斯大学) Squirrel Ai Learning East China Normal University(华东师范大学)

AI总结 提出TimeSage-MT多轮基准测试,包含240个任务和2680轮对话,覆盖8个真实领域,用于评估LLM智能体在时间序列推理中的表现,揭示其在决策导向任务中的性能下降及记忆、不确定性处理等缺陷。

详情
AI中文摘要

时间序列数据为许多真实世界领域的决策提供信息。虽然大语言模型(LLM)智能体可以通过自然语言和工具分析数据,但目前尚不清楚它们是否能在多轮对话中进行可靠的时间序列分析。现有基准测试侧重于预测和异常检测等单步任务,忽略了用户目标演变、智能体必须基于先前分析以及结论从累积证据中得出的实际工作流程。在这项工作中,我们引入了TimeSage-MT,一个用于智能时间序列推理的多轮基准测试,包含240个任务和2,680轮对话,涵盖8个真实世界领域,从基础探索到决策导向分析。TimeSage-MT通过一个可复现的流程构建,该流程将真实世界的时间序列数据转换为具有可验证答案的多轮对话。它提供了一个统一的评估协议和公共排行榜,用于比较时间序列智能系统。为了展示基准测试的实用性,我们评估了前沿LLM以及TimeSage——一种配备全面时间序列技能库的新型结构化智能体。结果显示,在决策导向任务上性能急剧下降,原因是记忆、不确定性处理和基于领域的决策方面的失败。TimeSage-MT揭示了当前智能推理中的关键差距,并为未来发展提供了严谨的基础。

英文摘要

Time series data inform critical decisions across many real-world domains. While large language model (LLM) agents can analyze data through natural language and tools, it remains unclear whether they can conduct reliable time series analysis across multi-turn conversations. Existing benchmarks focus on single-step tasks such as forecasting and anomaly detection, overlooking practical workflows where user goals evolve, agents must build on prior analyses, and conclusions emerge from accumulated evidence. In this work, we introduce TimeSage-MT, a multi-turn benchmark for agentic time series reasoning with 240 tasks and 2,680 dialogue turns across 8 real-world domains, spanning basic exploration to decision-oriented analysis. TimeSage-MT is built through a reproducible pipeline that converts real-world time series data into multi-turn conversations with verifiable answers. It provides a unified evaluation protocol and public leaderboard for comparing time series agentic systems. To demonstrate the benchmark's utility, we evaluate frontier LLMs alongside TimeSage, a novel structured agent equipped with a comprehensive time series skill library. The results show sharp performance drops on decision-oriented tasks, driven by failures in memory, uncertainty handling, and domain-based decision making. TimeSage-MT exposes critical gaps in current agentic reasoning and provides a rigorous foundation for future development.

2606.01493 2026-06-02 cs.CV

Splatshot: 3D Face Avatar Generation from a Single Unconstrained Photo

Splatshot: 从单张非约束照片生成3D人脸头像

Hao Liang, Zhixuan Ge, Soumendu Majee, Joanna Li, Ashok Veeraraghavan, Guha Balakrishnan

发表机构 * Rice University(里士大学) Samsung Research America(三星美国研究院)

AI总结 提出SplatShot,一种无需训练的方法,通过将3D高斯泼溅与扩散模型去噪过程耦合,从单张照片生成多视图一致的逼真3D人脸头像。

Comments 28 pages, 15 figures

详情
AI中文摘要

从单张非约束照片重建逼真的3D人脸头像具有挑战性:前馈3D高斯泼溅(3DGS)模型在分布外输入上性能下降,而预训练扩散模型生成高保真图像但缺乏多视图一致性。我们观察到这些范式本质上是互补的:显式3D表示保证几何一致性,而2D扩散先验确保逼真度。基于此,我们提出SplatShot,一种无需训练的框架,直接在去噪过程中耦合这些表示。给定一个基础3DGS人脸模型和一张参考图像,我们使用每步3D反馈循环联合去噪所有目标视图。在每个时间步,我们从噪声潜变量预测干净图像,将3DGS重新拟合到这些多视图预测,并将3DGS重新渲染与2D预测之间的光度差异反向传播到噪声估计中。这将采样轨迹引导向严格3D一致、身份保真的输出。在各种野外图像上的实验表明,SplatShot生成的3D头像具有优越的身份保持、逼真度和多视图一致性。

英文摘要

Reconstructing a photorealistic 3D face avatar from a single unconstrained photograph is challenging: feed-forward 3D Gaussian Splatting (3DGS) models degrade on out-of-distribution inputs, while pretrained diffusion models produce high-fidelity images but lack multi-view consistency. We observe that these paradigms are fundamentally complementary: explicit 3D representations guarantee geometric consistency, whereas 2D diffusion priors ensure photorealism. Building on this, we propose SplatShot, a training-free framework that couples these representations directly within the denoising process. Given a base 3DGS face model and a single reference image, we jointly denoise all target views using a per-step 3D feedback loop. At each timestep, we predict clean images from the noisy latents, refit the 3DGS to these multi-view predictions, and back-propagate the photometric discrepancy between the 3DGS re-renderings and 2D predictions into the noise estimate. This steers the sampling trajectory toward strictly 3D-coherent, identity-faithful outputs. Experiments on diverse in-the-wild images demonstrate that SplatShot produces 3D avatars with superior identity preservation, photorealism, and multi-view consistency.

2606.01485 2026-06-02 cs.CV cs.LG

Perception First: A Frontier Native-Video Model with Self-Consistency for Implicit Video Question Answering

感知优先:具有自一致性的前沿原生视频模型用于隐式视频问答

Ali Alavi

发表机构 * The Ohio State University(俄亥俄州立大学)

AI总结 本文通过系统实验发现隐式视频问答基准是感知受限而非推理受限,并指出提升基础模型感知能力和轻量级测试时去噪是唯一可靠手段。

详情
AI中文摘要

我们描述了提交至CVPR 2026 VRR挑战赛的方案,该方案基于ImplicitQA / VRR-QA基准:一种多项选择视频问答任务,其中答案有意地不在任何单帧中可观察,必须从创意视频的不连续帧中的空间布局、运动、深度、视角、因果关系和社会背景推断。我们对开源视频大语言模型(Qwen2.5-VL、Qwen3-VL、InternVL3、Gemma-3以及经过强化学习训练的视频推理器Video-R1和VideoChat-R1.5)和一系列推理时策略(思维链、问题分解、描述-推理级联、音频转录、空间状态提示、自一致性、多模型集成和类别路由)进行了系统的、无需训练的研究。我们的核心发现是,该基准是感知受限而非推理受限:推理侧的增强是中性的甚至有害的,而基础模型的感知能力和轻量级测试时去噪是唯一可靠的杠杆。按类别的错误分析将困难定位到低级感知——相对深度、视角和计数是最困难的类别,而因果和社会推理几乎已解决——一个明确注入单目深度线索以攻击最弱类别的提示将测试准确率降低了5.8个百分点,证实了模型需要更好的感知,而非更好的过程。

英文摘要

We describe our submission to the VRR Challenge @ CVPR 2026, built on the \emph{ImplicitQA} / \emph{VRR-QA} benchmark~\cite{implicitqa}: multiple-choice video question answering in which answers are deliberately \emph{not} observable in any single frame and must be inferred from spatial layout, motion, depth, viewpoint, causality, and social context across discontinuous frames of creative video. We conduct a systematic, training-free study spanning open-source Video-LMMs (Qwen2.5-VL~\cite{qwen25vl}, Qwen3-VL~\cite{qwen3vl}, InternVL3, Gemma-3, and the RL-tuned video reasoners Video-R1~\cite{videor1} and VideoChat-R1.5~\cite{videochatr15}) and a battery of inference-time strategies (chain-of-thought, question decomposition, describe-then-reason cascades, audio transcripts, spatial state prompting, self-consistency~\cite{selfconsistency}, multi-model ensembling, and category routing). Our central finding is that this benchmark is \emph{perception-bound rather than reasoning-bound}: reasoning-side augmentations are neutral-to-harmful, whereas base-model perceptual capability and lightweight test-time denoising are the only reliable levers. A per-category error analysis localizes the difficulty to low-level perception -- relative depth, viewpoint, and counting are the hardest categories, while causal and social reasoning are nearly solved -- and a prompt that explicitly injects monocular depth cues to attack the weakest category \emph{lowers} test accuracy by $5.8$ points, confirming that the model needs a better \emph{percept}, not a better \emph{procedure}.

2606.01483 2026-06-02 cs.LG cs.AI eess.AS

MURMUR: An Efficient Inference System for Long-Form ASR

MURMUR:一种高效的长时间语音识别推理系统

Wei-Tzu Lee, Keisuke Kamahori, Baris Kasikci

发表机构 * University of Washington(华盛顿大学)

AI总结 提出MURMUR推理系统,通过块间和块内两级优化,在保持高精度的同时显著降低长时间语音识别的延迟。

详情
AI中文摘要

长时间自动语音识别(ASR)需要高精度和低延迟,但现有系统迫使两者之间进行权衡。基于块的流水线在并行窗口中处理音频以实现低延迟,但丢失了跨块上下文,并且需要脆弱的启发式方法来对齐边界处的说话人和时间戳。长上下文ASR模型通过单次传递解决所有问题以获得更好的准确性,但速度慢一个数量级。我们提出MURMUR,一个通过两级操作克服这种权衡的推理系统。在块间级别,我们重新审视基于块的流水线以适应现代长上下文ASR,将块大小视为可调超参数,并表明中间块大小在准确性和延迟之间取得了良好的平衡。在块内级别,我们通过应用于输出和语音令牌的滑动窗口KV缓存驱逐策略来利用注意力稀疏性。在AMI-IHM上,MURMUR匹配单次传递准确性,同时将延迟降低4.2倍,通过令牌驱逐进一步获得收益,相对tcpWER退化小于1%。MURMUR的代码可在https://github.com/uw-syfi/Murmur获取。

英文摘要

Long-form automatic speech recognition (ASR) requires both high accuracy and low latency, but existing systems force a trade-off between the two. Chunk-based pipelines process audio in parallel windows for low latency, but lose cross-chunk context and need brittle heuristics to align speakers and timestamps at boundaries. Long-context ASR models resolve everything in a single pass for better accuracy, but are an order of magnitude slower. We propose Murmur, an inference system that overcomes this trade-off by operating at two levels. At the inter-chunk level, we revisit the chunk-based pipeline for modern long-context ASR, treating chunk size as a tunable hyperparameter, and show that intermediate chunk sizes strike a good balance of accuracy and latency. At the intra-chunk level, we exploit attention sparsity through a sliding window KV cache eviction policy applied to both output and speech tokens. On AMI-IHM, Murmur matches single-pass accuracy while reducing latency by 4.2x, with further gains from token eviction at less than 1% relative tcpWER degradation. The code of Murmur is available at https://github.com/uw-syfi/Murmur.

2606.01482 2026-06-02 cs.CL

Beyond Topical Similarity: Contrastive Evidence Retrieval with Interpretable Attention Alignment in RAG

超越主题相似性:RAG 中具有可解释注意力对齐的对比证据检索

Francielle Vargas, João Robiatti, Diego Alves, Lucas Pascotti Valem, Maximilian Seeth, Sebastián Ferrada, Ameeta Agrawal, Daniel Pedronette, André Freitas

发表机构 * University of Chile(智利大学) São Paulo State University(圣保罗州立大学) Saarland University(萨尔兰州立大学) University of Munich(慕尼黑大学) Portland State University(波特兰州立大学) Idiap Research Institute(Idiap研究机构)

AI总结 提出 CERA 框架,通过基于主观性的困难负样本选择和辅助注意力对齐损失注入证据归纳偏差,实现可解释且事实准确的检索。

详情
AI中文摘要

确保 RAG 中的事实性和可解释性仍然是一个开放且紧迫的问题。我们引入了对比证据理性注意力(CERA),这是第一个采用基于主观性的困难负样本选择并通过辅助注意力对齐损失将证据归纳偏差注入对比学习的检索框架。CERA 使用两个训练目标微调密集检索器:基于三元组的对比学习和可解释注意力对齐,后者通过使用基于词性标注的掩码分布监督 CLS 到 token 的注意力,该分布覆盖人工标注的事实理性作为证据信号。在大型临床试验报告语料库上的实验表明,与 Contriever 和困难负样本选择基线相比,基于主观性的困难负样本选择显著提高了检索效果。此外,理性对齐在保持竞争性检索性能的同时提高了忠实度,支持了注意力在人类理性指导下可以作为模型行为更忠实解释的假设。超越主题相似性,CERA 使检索器能够识别构成支持证据的特定 token,促进了 RAG 系统中更可解释的证据选择。

英文摘要

Ensuring factuality and interpretability in RAG remains an open and urgent problem. We introduce Contrastive Evidence Rationale Attention (CERA), the first retrieval framework to employ subjectivity-based hard negative selection and inject an evidential inductive bias into contrastive learning through an auxiliary attention alignment loss. CERA fine-tunes a dense retriever using two training objectives: triplet-based contrastive learning and interpretable attention alignment, which supervises CLS-to-token attention using a part-of-speech-weighted masking distribution over human-annotated factual rationales as evidence signals. Experiments on a large corpus of clinical trial reports demonstrate that the subjectivity-based hard negative selection substantially improves retrieval effectiveness compared to both Contriever and hard negative selection baselines. Furthermore, rationale alignment improves faithfulness while maintaining competitive retrieval performance, supporting the hypothesis that attention can serve as a more faithful explanation of model behavior when guided by human rationales. Moving beyond topical similarity, CERA enables the retriever to identify the specific tokens that constitute supporting evidence, promoting more interpretable evidence selection in RAG systems.

2606.01481 2026-06-02 cs.CV

SafeGen-Bench: Benchmarking Safety in Image-Conditioned Text-to-Video Generation

SafeGen-Bench: 图像条件文本到视频生成中的安全性基准测试

Yingzi Ma, Xiaogeng Liu, Yawen Zheng, Chaowei Xiao

发表机构 * University of Wisconsin-Madison(威斯康星大学麦迪逊分校) Tsinghua University(清华大学) Johns Hopkins University(约翰霍普金斯大学)

AI总结 针对图像条件文本到视频生成中安全文本和图像组合仍可能产生有害内容的问题,提出SafeGen-Bench基准,定义10个恶意类别并评估现有模型,发现当前模型难以避免生成恶意内容,且单模态护栏防御不足。

Comments 8 pages, 7 figures, 2 tables

详情
AI中文摘要

随着文本到图像扩散模型的快速发展,像Sora这样的生成式视频模型(T2V模型)现在可以从文本提示或初始图像生成短视频。然而,合成视频生成——尤其是在初始图像引导下——常常带来风险,包括可能创建非法、政治敏感或不道德的内容。现有基准已开始考虑生成视频的安全性,但它们主要关注用恶意文本提示测试模型,忽略了文本提示和图像组合仍可能导致有害视频内容的场景。在实践中,这是一个常见且具有挑战性的问题:从安全文本和图像输入生成的视频仍可能传达有害信息。为弥补这一差距,我们引入了SafeGen-Bench,一个专门设计用于评估条件T2V模型安全性的基准。我们的基准定义了10个恶意类别,重点关注与时间序列和描绘行为相关的风险。SafeGen-Bench包含从多样图像和视频源中精心选择的起始帧,并配以相应的文本提示以模拟真实输入。我们在SafeGen-Bench上评估了多种条件T2V模型,结果表明当前模型难以持续避免生成恶意内容,不安全分数高达44.5,尤其是在需要高质量的条件下。此外,我们评估了基于文本和基于图像的护栏在我们的基准上的有效性,发现单模态护栏单独不足以提供稳健防御,在七个恶意类别中失败率达80%。我们希望SafeGen-Bench能促进更安全、更可控的条件T2V模型的开发。

英文摘要

With the rapid advancements in text-to-image diffusion models, generative video models (T2V models) like Sora can now produce short synthetic videos from a text prompt or an initial image. However, synthetic video generation -- especially when guided by an initial image -- often poses risks, including the potential creation of illegal, politically sensitive, or unethical content. Existing benchmarks have started to consider the safety of generated videos, but they primarily focus on testing models with malicious text prompts, ignoring the scenario where text prompt and image combination may still lead to harmful video content. In practice, this is a common and challenging issue: videos generated from safe text and image inputs can nonetheless convey harmful information. To bridge this gap, we introduce SafeGen-Bench, a benchmark specifically designed to evaluate the safety of conditional T2V models. Our benchmark defines 10 malicious categories, concentrating on risks related to both temporal sequences and depicted behaviors. SafeGen-Bench consists of carefully selected start frames from diverse image and video sources, paired with corresponding text prompts to simulate realistic inputs. We evaluate a variety of conditional T2V models on SafeGen-Bench, and the results indicate that current models struggle to consistently avoid generating malicious content with unsafety scores reaching up to 44.5, especially under conditions requiring high quality. Furthermore, we assess the effectiveness of both text-based and image-based guardrails on our benchmark, finding that unimodal guardrails alone were insufficient to provide a robust defense, with an 80\% failure rate across seven malicious categories. We hope that SafeGen-Bench will foster the development of safer and more controllable conditional T2V models.

2606.01479 2026-06-02 cs.CL

Sparse Autoencoders for Interpretable Emotion Control in Text-to-Speech

用于文本到语音中可解释情感控制的稀疏自编码器

Hongfei Du, Jiacheng Shi, Sidi Lu, Gang Zhou, Ye Gao

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文通过稀疏自编码器分析基于LLM的TTS模型中的情感相关潜在特征,提出特征级干预框架实现双向情感诱导与抑制,无需修改模型参数。

Comments Accepted by ICML 2026

详情
AI中文摘要

将大型语言模型(LLMs)集成到文本到语音(TTS)系统中提高了语音的表现力,但可解释的情感控制仍然具有挑战性。现有方法主要依赖于外部条件或全局激活引导,对情感控制背后的内部表示提供的洞察有限。在这项工作中,我们使用稀疏自编码器(SAEs)分析基于LLM的TTS模型语义隐藏状态中与情感相关的变异,以识别稀疏潜在特征。我们的分析表明,情感变异分布在多个稀疏潜在特征上,而对一小部分特征进行干预可以实现可解释的情感控制。基于这一观察,我们引入了一个特征级干预框架,用于双向情感诱导和抑制,而无需修改骨干参数。我们进一步表明,不同的潜在特征与特定的声学属性(例如,音高)相关联,这表明情感表达源于协调的潜在贡献,而非单一的全局变化。实验上,引导这些稀疏潜在特征在情感诱导和抑制性能上达到或优于全局引导和现有的TTS基线。

英文摘要

Integrating large language models (LLMs) into text-to-speech (TTS) systems has improved speech expressiveness, yet interpretable emotional control remains challenging. Existing approaches primarily rely on external conditioning or global activation steering, offering limited insight into the internal representations underlying emotional control. In this work, we analyze emotion-related variation in the semantic hidden states of LLM-based TTS models using sparse autoencoders (SAEs) to identify sparse latent features. Our analysis shows that emotional variation is distributed across multiple sparse latent features, while intervening on a small subset enables interpretable emotion control. Building on this observation, we introduce a feature-level intervention framework for bidirectional emotion induction and suppression without modifying backbone parameters. We further show that distinct latent features are associated with specific acoustic attributes (e.g., pitch), suggesting that emotional expression arises from coordinated latent contributions rather than a single global shift. Empirically, steering these sparse latent features achieves comparable or superior emotion induction and suppression performance relative to global steering and existing TTS baselines.

2606.01469 2026-06-02 cs.CL

Peacemaker at ATE-IT: Automatic term extraction from Italian text for waste management data using encoder model

Peacemaker at ATE-IT: 使用编码器模型从意大利语文本中自动提取废物管理术语

Mahdi Bakhtiyarzadeh, Hadi Bayrami Asl Tekanlou, Jafar Razmara

发表机构 * Department of Computer Science, University of Tabriz(塔布里兹大学计算机科学系) University of Tabriz(塔布里兹大学)

AI总结 针对ATE共享任务中的Task A,提出一种低计算成本、可解释的自动术语提取方法,通过微调编码器模型在少量资源上实现平衡性能,为低资源模型提供起点。

Comments 9 pages, 2 figures, Published in EVALITA 2026, CEUR Workshop Proceedings Vol. 4195

详情
Journal ref
CEUR Workshop Proceedings, Vol. 4195, 2026
AI中文摘要

自动术语提取的发展在现代技术中变得越来越重要。目前几乎每个可用的搜索引擎中都存在自动术语提取。最近的进展为自动术语的提取提供了有希望的结果;然而,由于多种因素,如可用于训练的标注文档数量有限,以及由于领域变化导致提取多词表达式的复杂性,准确标注是困难的。在本文中,我们将提出一种低成本且可解释的自动术语提取方法,专门为ATE共享任务中的Task A开发。这种新方法利用微调提取策略,可以在少量计算资源上运行。我们使用类型级和微级精确率、召回率和F1分数来评估我们的自动化系统,以衡量提取性能的两个互补方面。根据实验结果,我们提出的方法与其他团队相比,实现了一致且平衡的性能。尽管该技术本身相对简单,但它为低资源模型提供了一个良好的起点。总体而言,研究结果表明,未来在模型扩展方面有可能取得重大进展,同时仍能保持其可解释性。

英文摘要

The development of automatic term extraction has become increasingly important in modern technology. Automatic term extraction can be found in virtually every search engine that is currently available to users. Recent advancements have provided promising results for the extraction of automatic terms; however, accurate labeling is difficult because of several factors, such as the limited number of annotated documents available for training and the complexity of extracting multi-word expressions due to shifts in the domain. In this paper, we will present a low-cost and interpretable method of automatic term extraction, developed specifically for Task A of the ATE Shared Task. This new method utilizes fine-tuning extraction strategies that can run on a small amount of computational resources. We evaluated our automated system using both type-level and micro-level measures of precision, recall, and F1-score to measure both complementary aspects of the extraction performance. According to the experimental results, our proposed approach achieves consistent and balanced performance compared to other teams. Even though the technique itself is relatively straightforward, it serves as a good starting point for low-resource models. Overall, the findings point toward the possibility of significant future advancements (in model expansion) with higher-level performance still able to retain their ability to be interpreted.

2606.01464 2026-06-02 cs.CL

Cross-lingual Self-Consistency for Multilingual Reasoning with Language Models

跨语言自一致性:面向语言模型的多语言推理

Ahmed Elhady, Eneko Agirre, Mikel Artetxe

发表机构 * HiTZ Center, University of the Basque Country (UPV/EHU)(巴斯克大学HiTZ中心) Reka AI

AI总结 提出无监督强化学习方法,通过强制模型对跨语言等价问题产生相同答案来增强多语言推理,在MGSM上平均提升21.7%,并展现出强泛化能力。

Comments Paper under review

详情
AI中文摘要

尽管大语言模型(LLMs)的多语言覆盖范围在扩大,但其高级推理能力仍主要局限于少数高资源语言(如英语)。为了解决这一问题,我们提出了一种无监督强化学习(RL)方法,通过强制跨语言自一致性(即模型应对不同语言的等价问题产生相同最终答案)来增强多语言推理。现有方法受限于多语言推理数据的稀缺性,且对未见语言的泛化能力较弱。我们的方法既不需要标准答案,也不需要平行数据,在MGSM的10种语言上平均提升了高达21.7%。此外,我们的方法展现出强泛化能力,在训练期间未见的MGSM语言上平均提升18.2%,在3个分布外基准测试上提升高达6.2%。这些结果表明,基于一致性的方法有潜力在无需监督数据的情况下提升LLMs的多语言能力。

英文摘要

Despite expanding their multilingual coverage, the advanced reasoning capabilities of LLMs remain largely confined to a few high-resource languages like English. To address this, we propose an unsupervised Reinforcement Learning (RL) approach to enhance multilingual reasoning by enforcing cross-lingual self-consistency: the principle that a model should produce the same final answer for equivalent problems in different languages. Existing methods are limited by the scarcity of multilingual reasoning data and show weak generalization to unseen languages. Our approach requires neither gold answers nor parallel data, and it achieves average gains of up to 21.7% on MGSM across 10 languages. In addition, our method demonstrates strong generalization, with an 18.2% mean improvement on MGSM languages unseen during training, and up to 6.2% gain on 3 out-of-distribution benchmarks. These results show the potential of consistency-based methods to improve the multilingual capabilities of LLMs without requiring supervised data.

2606.01462 2026-06-02 cs.AI cs.CL cs.LG

An Enigma of Artificial Reason: Investigating the Production-Evaluation Gap in Large Reasoning Models

人工推理之谜:探究大型推理模型中的生成-评估差距

Mingzhong Sun, Teresa Yeo, Armando Solar-Lezama, Tan Zhi-Xuan

发表机构 * NUS Department of Computer Science(国立新加坡大学计算机科学系) MIT EECS(麻省理工学院电子工程与计算机科学系) A*STAR(新加坡科技研究局) Singapore-MIT Alliance for Research and Technology (SMART)(新加坡-麻省理工联合研究技术机构(SMART))

AI总结 本文通过VAIR数据集发现大型推理模型在评估推理时存在显著缺陷,表现为答案确认偏差,即模型倾向于验证答案正确性而非仔细检查推理步骤。

Comments 10 pages, 8 figures, 2 tables (Appendix: 19 pages, 13 figures, 3 tables)

详情
AI中文摘要

对人类推理的研究表明,人们通常更擅长评估推理而非从头生成推理。相比之下,大型推理模型(LRMs)经过训练,擅长生成长链推理以解决复杂问题。那么,LRMs在评估推理方面表现如何?我们通过有效答案-无效推理(VAIR)数据集进行研究:该数据集包含数学问题和解决方案,这些解决方案存在琐碎的推理缺陷但答案有效,旨在将推理评估与推理生成混淆因素分离。与人类(我们发现人类在评分此类问题时仅比解决它们差6%)不同,我们发现LRMs存在显著的生成-评估差距:前沿模型在评估VAIR解决方案时得分低至48%,尽管在解决方案生成方面近乎完美。为何存在这一谜团?通过思维链(CoT)分析,我们发现了答案确认偏差的证据:LRMs通常先产生答案,然后检查正确答案,而不是仔细验证每一步,即使在注意到异常推理时也会编造合理化解释。线性探针进一步证实了这一点,表明虽然LRM激活编码了有效推理的某些表示,但它们未能稳健地将VAIR解决方案表示为无效。对最终答案表示的因果修补导致LRM判断和激活翻转,表明答案有效性是模型确认偏差的原因。这些发现揭示了主导推理训练方法的显著局限性,该方法激励LRMs生成并确认朝向正确答案的推理,但未能稳健地评估底层推理。

英文摘要

Studies of human reasoning have shown that people are typically stronger at evaluating reasoning than producing it from scratch. In contrast, large reasoning models (LRMs) are trained to excel at producing long chains of reasoning to solve complex problems. How then do LRMs perform at evaluating reasons? We investigate this with the Valid-Answer-Invalid-Reasoning (VAIR) dataset: math problems and solutions with trivial reasoning flaws but valid answers, designed to isolate reasoning evaluation from the confound of reasoning production. Unlike humans, who we find are only 6% worse at grading than solving such problems, we find a substantial production-evaluation gap in LRMs: frontier models score as low as 48% when evaluating VAIR solutions, despite near-perfect solution production. Why this enigma? Through chain-of-thought (CoT) analysis, we find evidence of an answer confirmation bias: LRMs often produce then check for the correct answer instead of carefully verifying each step, fabricating rationalizations even when noticing anomalous reasoning. Linear probes corroborate this, showing that while LRM activations encode some representation of valid reasoning, they fail to robustly represent VAIR solutions as invalid. Causal patching of the final answer's representations causes LRM verdicts and activations to flip, demonstrating that answer validity is responsible for models' confirmation biases. These findings indicate an outstanding limitation in dominant approaches to reasoning training, which incentivize LRMs to produce and confirm reasoning towards correct answers, but not to robustly evaluate the underlying reasons.

2606.01461 2026-06-02 cs.LG cs.MA

Genotype-Conditioned Molecular Generation via Evidence-Grounded Multi-Objective Latent Perturbation in Diffusion Models

基于证据的多目标潜在扰动在扩散模型中的基因型条件分子生成

Brenda Nogueira, Gisela A. Gonzalez-Montiel, Nitesh V. Chawla, Nuno Moniz

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系) University of Notre Dame(诺克斯大学) Department of Chemistry and Biochemistry(化学与生物化学系) Lucy Family Institute for Data & Society(数据与社会学院)

AI总结 提出一种在预训练的基因型到药物扩散模型的潜在空间中,通过梯度上升优化可学习扰动以最大化药物敏感性、类药性和合成可及性的复合奖励,并利用实验数据和LLM管道确保生物合理性和机制一致性。

详情
AI中文摘要

由于肿瘤异质性和跨癌症亚型缺乏明确的分子靶点,开发有效的抗癌疗法仍然具有挑战性。以癌症基因型为条件的生成模型为个性化药物发现提供了一条有前景的途径,但现有方法缺乏对同时优化敏感性、可合成性和机制结合合理性的明确优化。我们提出了一种针对预训练的基因型到药物扩散模型的潜在空间优化方法,引入一个在分子潜在空间上的可学习扰动,通过梯度上升优化以最大化结合预测药物敏感性(AUC)、类药性(QED)和合成可及性(SAS)的复合奖励。关键的是,通过将奖励设计和评估基于实验衍生的癌细胞系数据和经过验证的药理学信号,将候选生成锚定在真实世界的临床证据中,从而强制执行生物学真实性。机制一致性合理性进一步通过基于扩散模型注意力机制的多智能体LLM管道进行评估。在来自三个保留评估集的15个癌细胞系上的实验表明,在敏感性、类药性、可合成性和化学有效性方面,与竞争基线相比,该方法具有一致且显著的改进。

英文摘要

Developing effective anticancer therapeutics remains challenging due to tumor heterogeneity and the absence of well-defined molecular targets across cancer subtypes. Generative models conditioned on cancer genotypes offer a promising avenue for personalized drug discovery, yet existing approaches lack explicit optimization for simultaneous sensitivity, synthesizability, and mechanistic binding plausibility. We present a latent-space optimization approach for a pretrained genotype-to-drug diffusion model, introducing a learnable perturbation over the molecular latent space optimized via gradient ascent to maximize a composite reward combining predicted drug sensitivity (AUC), drug-likeness (QED), and synthetic accessibility (SAS). Critically, biological realism is enforced by grounding both reward design and evaluation in experimentally-derived cancer cell line data and validated pharmacologic signals, anchoring candidate generation in real-world clinical evidence. Mechanistic consistency plausibility is further assessed by a multi-agent LLM pipeline grounded in the diffusion model's attention mechanism. Experiments across 15 cancer cell lines from three held-out evaluation sets demonstrate consistent and noticeable improvements over competing baselines in sensitivity, drug-likeness, synthesizability, and chemical validity.

2606.01460 2026-06-02 cs.SD eess.AS

A Lightweight Slot-Attention Framework for Multi-Instrument Multi-Pitch Estimation

轻量级槽注意力框架用于多乐器多音高估计

Michael Taenzer

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出一种轻量级槽注意力框架,通过匈牙利匹配和模块化扩展实现多乐器多音高估计,并验证了其在URMP上的乐器族分解效果。

Comments Preprint submitted to the IEEE 28th International Workshop on Multimedia Signal Processing (MMSP). This work has been submitted to the IEEE for possible publication. 6 pages, 2 figures

详情
AI中文摘要

多音高估计(MPE)通常预测混合信号中哪些音高是活跃的,但不预测是哪种乐器或声源产生的。本文研究了一种用于多乐器MPE(MI-MPE)的轻量级槽注意力框架,其中混合CQT被映射到一组无序的类声源音高图。该模型使用排列不变的匈牙利匹配来避免固定的输出语义,并将槽的数量视为活跃声源数量的上界。我们进一步研究了两种模块化扩展:一个自监督音色编码器,为槽级音色嵌入提供训练时目标;以及一个复音分支,正则化混合级和槽级预测的音高密度。实验表明,匈牙利匹配显著改善了URMP上的乐器族分解。音轨级预测仍然更具挑战性:音色和复音监督改善了特定配置,但未能一致地解决声源分配问题。结果表明,基于槽的架构是声源感知MPE的一个有前景的方向,同时强调了需要更仔细地将辅助音乐线索与槽身份耦合。

英文摘要

Multi-pitch estimation (MPE) typically predicts which pitches are active in a mixture, but not which instrument or source produced them. This paper investigates a lightweight slot-attention framework for multi-instrument MPE (MI-MPE), where a mixture CQT is mapped to an unordered set of source-like pitch maps. The model uses permutation-invariant Hungarian matching to avoid fixed output semantics and treats the number of slots as an upper bound on the number of active sources. We further study two modular extensions: a self-supervised timbre encoder that provides training-time targets for slot-level timbre embeddings, and a polyphony branch that regularizes the pitch density of mixture- and slot-level predictions. Experiments show that Hungarian matching substantially improves instrument family decomposition on URMP. Stem-level prediction remains more challenging: timbre and polyphony supervision improve selected configurations, but do not consistently resolve source assignment. The results suggest that slot-based architectures are a promising direction for source-aware MPE, while highlighting the need to couple auxiliary musical cues to slot identity more carefully.

2606.01458 2026-06-02 cs.RO

LEGS: Fine-Tuning Teleop-Free VLAs for Humanoid Loco-manipulation in an Embodied Gaussian Splatting World

LEGS: 在具身高斯泼溅世界中免遥操作微调VLA用于人形机器人全身操控

Hojune Kim, Timothy Chen, Jiankai Sun, Lars W. Osterberg, Qianzhong Chen, Ke Wang, Mac Schwager

发表机构 * Stanford University(斯坦福大学)

AI总结 提出LEGS混合模拟器,通过程序化运动基元生成器和两阶段颜色校准,无需遥操作即可合成训练数据,使VLA策略在真实人形机器人操控任务中达到或超越遥操作训练效果。

Comments https://legsvla.github.io/

详情
AI中文摘要

训练用于人形机器人全身操控的视觉-语言-动作(VLA)策略受到收集人类遥操作演示的高成本和复杂性的限制。迄今为止,在模拟器中微调的VLA策略未能有效迁移到人形机器人全身操控任务中。我们提出LEGS(通过具身高斯泼溅实现全身操控),一种混合模拟器,将网格前景(机器人、物体、道具)合成到从手持场景捕获重建的光照真实3D高斯泼溅(3DGS)背景上。LEGS使用程序化运动基元生成器在无需人类遥操作的情况下大规模合成带标签的演示,并通过确定性两阶段颜色校准将渲染的3DGS图像对齐到机器人的部署相机。在Unitree G1人形机器人上,跨三个全身难度递增的抓取放置任务和三个VLA骨干网络(psi_0, pi_0.5, GR00T N1.6),仅使用LEGS数据训练的策略在每个实验中都匹配或超越了使用人类遥操作演示训练的策略。它还优于消融了3DGS背景效果的纯网格模拟基线,表明光照真实渲染是合成数据迁移的关键因素。LEGS中的人形运动独立于场景外观记录,使得相同的自动生成演示可以在新背景和物体网格下重新渲染——覆盖新场景的成本比遥操作低15倍以上——从而增强训练数据对场景变化的鲁棒性。在物体和场景外观联合偏移下,使用重新渲染的LEGS-AUG数据训练的策略保持任务成功,而使用遥操作数据训练的基线完全失败。我们的项目页面位于https://legsvla.github.io/。

英文摘要

Training vision-language-action (VLA) policies for humanoid loco-manipulation is constrained by the high cost and complexity of collecting human teleoperation demonstrations. VLA policies fine-tuned in simulators have, until now, failed to transfer effectively in humanoid loco-manipulation tasks. We present LEGS (Loco-manipulation via Embodied Gaussian Splatting), a hybrid simulator that composites a mesh foreground (robot, objects, props) over a photorealistic 3D Gaussian Splatting (3DGS) background reconstructed from a handheld scene capture. LEGS uses a procedural motion-primitive generator to synthesize labeled demonstrations at scale without human teleoperation, and a deterministic two-stage color calibration to align the rendered 3DGS image to the robot's deployment camera. On a Unitree G1 humanoid robot, across three pick-and-place tasks of increasing whole-body difficulty and three VLA backbones (psi_0, pi_0.5, GR00T N1.6), a policy trained purely on LEGS data matches or exceeds one trained on human teleoperation demos on every experiment. It also outperforms a mesh-only simulation baseline that ablates the effect of the 3DGS background, showing that photorealistic rendering is a key enabler for synthetic data transfer. Humanoid motion is recorded independently of scene appearance in LEGS, allowing the same auto-generated demonstrations to be re-rendered under new backgrounds and object meshes--covering a new scene at more than 15x lower cost than teleoperation--to augment training data for robustness to scene variations. Under combined object-and-scene appearance shift, the policy trained on re-rendered LEGS-AUG data maintains task success while the baseline trained on teleoperation data fails entirely. Our project page is located at https://legsvla.github.io/.

2606.01456 2026-06-02 cs.LG cs.CL cs.GT

Truthful AI Advisors: A Pre-Specified Benchmark for Large Language Model Honesty Under Preference Misalignment

诚实的人工智能顾问:偏好错位下大语言模型诚实性的预设基准

Hamidreza Hasani Balyani, Seyed Pouyan Mousavi Davoudi, Alireza Amiri-Margavi, Amin Gholami Davodi, Arshia Gharagozlou

发表机构 * Amazon Lab126, HW Tech Org.(亚马逊实验室126,硬件技术组织) Computational Modeling and Simulation University of Pittsburgh(计算建模与仿真大学匹兹堡分校) Mathematics & Statistics Department University of Minnesota Duluth(数学与统计学系明尼苏达大学 Duluth 分校)

AI总结 通过Crawford-Sobel廉价谈话模型构建基准,评估大语言模型在偏好冲突时是否诚实,发现模型过度揭示信息,偏离策略最优。

Comments 19 pages. Code and data: https://github.com/iHamidHasani/cheap-talk-llm-benchmark

详情
AI中文摘要

大语言模型越来越多地被部署为顾问,其目标与用户不一致:推荐系统优化参与度,销售助手优化购买,谈判代理优化让步。当诚实与自身收益冲突时,这些顾问是否保持诚实是一个核心的对齐评估问题。我们将经典的Crawford-Sobel廉价谈话模型转化为偏好错位下LLM诚实性的预设基准。廉价谈话理论预测既非完全揭示也非沉默,而是粗糙的单调划分,随着偏好冲突增加,信息区间减少。发送者观察到状态omega在[0,1]中,希望接收者的行动接近omega+b,并向理想行动为omega的接收者发送一条无成本消息。设计使用5个偏差水平、3个提示框架、固定的低温度设置和每个单元200个状态:共12,000次发送者调用。对于正偏差网格b∈{0.01,0.04,0.08,0.12},最信息丰富的划分大小分别为7、4、3、2,预言机归一化互信息分别为0.5294、0.3268、0.2205、0.1829。在四个指令调优模型(GPT-4o、Claude Sonnet 4.5、Gemini 2.5 Flash-Lite、Llama-3.3-70B)上运行完整设计,我们发现所有四个模型相对于最信息丰富的均衡过度揭示1.8至4.2倍:归一化互信息保持在0.78-0.94,而预言机规定为0.18-0.53。信息量随偏差下降如预测,但从未接近策略最优;模型显示出近乎完全的揭示,并带有跟踪其偏差的恒定正向偏移(线性夸大)。收益最大化与诚实框架的影响可忽略。解码器消融表明,仅当接收者读取发送者陈述的数字时,该发现才可恢复:仅嵌入解码器将相同数据误读为近乎胡言乱语。

英文摘要

Large language models are increasingly deployed as advisors whose objective is not aligned with the user's: recommenders optimize for engagement, sales assistants for purchases, negotiation agents for concessions. Whether such advisors stay truthful when honesty conflicts with their own payoff is a core alignment-evaluation question. We turn the canonical Crawford-Sobel cheap-talk model into a pre-specified benchmark for LLM honesty under preference misalignment. Cheap-talk theory predicts neither full revelation nor silence but coarse monotone partitions, with fewer informative intervals as preference conflict grows. A sender observes a state omega in [0,1], wants the receiver's action near omega+b, and sends one costless message to a receiver whose ideal action is omega. The design uses 5 bias levels, 3 prompt frames, a fixed low-temperature setting, and 200 states per cell: 12,000 sender calls. For the positive-bias grid b in {0.01,0.04,0.08,0.12} the exact most-informative partition sizes are 7,4,3,2, with oracle normalized mutual information 0.5294, 0.3268, 0.2205, 0.1829. Running the full design on four instruction-tuned models (GPT-4o, Claude Sonnet 4.5, Gemini 2.5 Flash-Lite, Llama-3.3-70B), we find all four over-reveal relative to the most-informative equilibrium by 1.8 to 4.2x: normalized mutual information stays at 0.78-0.94 where the oracle prescribes 0.18-0.53. Informativeness declines with bias as predicted but never approaches the strategic optimum; rather than coarse partitions, models show near-full revelation with a constant upward offset tracking their bias (linear exaggeration). Payoff-maximizing versus honesty framing has negligible effect. A decoder ablation shows the finding is recoverable only when the receiver reads the sender's stated number: an embedding-only decoder mis-reads the same data as near-babbling.