arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1184
2605.04998 2026-06-16 cs.SD cs.IR cs.LG 版本更新

Empirical Study of Pop and Jazz Mix Ratios for Genre-Adaptive Chord Generation

流行与爵士混合比例对体裁自适应和弦生成的实证研究

Jinju Lee

发表机构 * PearlLeeStudio(pearllee studio)

AI总结 本研究通过调整流行与爵士音乐的比例进行和弦生成排练,发现适度的流行排练能在保持流行准确率的同时提升爵士预测性能,并修正了先前版本中的检查点选择错误。

Comments Erratum: the released F1 checkpoint equals the Phase-0 pop baseline (full SHA-256 verified); min mixed validation loss selection kept the unadapted warmup epoch. Tables 4 and 5 are best epoch metrics; mix ratio conclusions hold. A corrected retrain (jazz only validation), ft-pop80-v2, reproduces across 3 seeds. v1 F2 row fixed. 3 figs, 5 tables. https://huggingface.co/PearlLeeStudio

详情
AI中文摘要

本修订更新了一项流行到爵士和弦生成的排练研究。最佳时期的指标仍然表明,适度的流行排练能在保持流行准确率的同时提高爵士预测性能,但v2版本修正了已发布检查点的选择:已发布的F1等于阶段0,F2存在转录错误,而ft-pop80-v2恢复了跨3个种子的哈希区分爵士适应F1。

英文摘要

This revision updates a pop-to-jazz chord-generation rehearsal study. Best-epoch metrics still show that modest pop rehearsal preserves pop accuracy while improving jazz prediction, but v2 corrects released-checkpoint selection: the released F1 equals Phase 0, F2 had a transcription error, and ft-pop80-v2 restores a hash-distinct jazz-adapted F1 across 3 seeds.

2605.04813 2026-06-16 cs.LG 版本更新

A Biased Nonnegative Block Term Tensor Decomposition Model for Dynamic QoS Prediction

一种用于动态QoS预测的有偏非负块项张量分解模型

Wenjing Liu, Yujia Lei, Qu Wang

发表机构 * GitHub

AI总结 提出BNBT框架,采用有偏非负块项张量分解增强表示能力,引入线性偏置项并设计SLF-NMUT算法,在动态QoS预测中显著提升精度。

详情
AI中文摘要

随着云计算和Web服务的快速发展,服务质量(QoS)已成为服务选择与推荐的关键标准。张量潜在特征分析为建模多维QoS数据提供了有效途径,现有大多数QoS预测方法主要基于规范多元分解(CP分解)或Tucker分解。然而,受限于其固有结构特性,这些方法无法准确捕捉用户-服务交互中复杂且动态的依赖关系,从而限制了预测性能。为解决此问题,本文提出一种基于有偏非负块项张量分解模型的动态QoS预测框架,称为BNBT。具体而言,该框架从三个方面进行构建:(1)采用块项张量分解增强潜在特征学习的表示能力;(2)引入线性偏置项以进一步提高预测精度;(3)设计一种面向张量的单元素依赖非负乘性更新算法SLF-NMUT,用于高效参数估计。在真实QoS数据集上的大量实验表明,所提出的BNBT框架在预测精度上持续优于多种先进的QoS预测方法。

英文摘要

With the rapid development of cloud computing and Web services, Quality of Service (QoS) has become a key criterion for service selection and recommendation. Tensor latent feature analysis provides an effective way to model multidimensional QoS data, and most existing QoS prediction methods are mainly based on Canonical Polyadic (CP) decomposition or Tucker decomposition. However, constrained by their inherent structural properties, these methods cannot accurately capture the complex and dynamic dependencies in user-service interactions, which limits their prediction performance. To address this issue, this paper proposes a dynamic QoS prediction framework based on the Biased Nonnegative Block Term Tensor Decomposition Model, termed BNBT. Specifically, the proposed framework is developed from three aspects: (1) block term tensor decomposition is employed to enhance the representation capability of latent feature learning; (2) linear bias terms are incorporated to further improve prediction accuracy; and (3) a tensor-oriented single-element-dependent nonnegative multiplicative update algorithm, called SLF-NMUT, is designed for efficient parameter estimation. Extensive experiments on real-world QoS datasets demonstrate that the proposed BNBT framework consistently outperforms several state-of-the-art QoS prediction methods in terms of prediction accuracy.

2605.03297 2026-06-16 cs.SD cs.LG 版本更新

Contrastive Regularization for Accent-Robust ASR

对比正则化用于口音鲁棒的ASR

Van-Phat Thai, Aradhya Dhruv, Duc-Thinh Pham, Sameer Alam

发表机构 * Air Traffic Management Research Institute, Nanyang Technological University, Singapore(新加坡南洋理工大学航空交通管理研究所) Center of AI Research, VinUniversity, Vietnam(越南Vin大学人工智能研究中心)

AI总结 提出使用监督对比学习作为轻量级口音不变辅助目标,在CTC微调中正则化编码器表示,无需架构修改或显式口音监督,在L2-ARCTIC基准上实现高达25-29%的未见口音词错误率降低。

Comments Accepted by Interspeech 2026

详情
AI中文摘要

基于自监督声学预训练和CTC微调的ASR系统在母语语音上表现强劲,但对口音变化仍然敏感。我们研究监督对比学习(SupCon)作为CTC微调的轻量级、口音不变辅助目标。一个话语级对比损失正则化编码器表示,无需架构修改或显式口音监督。在L2-ARCTIC基准上的实验表明,多个预训练编码器均实现一致的WER降低,在未见口音评估下相对降低高达25-29%。使用转录内余弦离散度分析表明,SupCon在口音变化下促进更紧凑和稳定的表示几何结构。总体而言,SupCon提供了一种有效且模型无关的正则化策略,用于提高口音鲁棒性。

英文摘要

ASR systems based on self-supervised acoustic pretraining and CTC fine-tuning achieve strong performance on native speech but remain sensitive to accent variability. We investigate supervised contrastive learning (SupCon) as a lightweight, accent-invariant auxiliary objective for CTC fine-tuning. An utterance-level contrastive loss regularizes encoder representations without architectural modification or explicit accent supervision. Experiments on the L2-ARCTIC benchmark show consistent WER reductions across multiple pretrained encoders, with up to 25 -- 29\% relative reduction under unseen-accent evaluation. Analysis using within-transcript cosine dispersion indicates that SupCon promotes more compact and stable representation geometry under accent variability. Overall, SupCon provides an effective and model-agnostic regularization strategy for improving accent robustness.

2605.01961 2026-06-16 cs.LG 版本更新

Multi-User Dueling Bandits: A Fair Approach using Nash Social Welfare

多用户决斗式赌博机:一种基于纳什社会福利的公平方法

Maheed H. Ahmed, Mahsa Ghasemi

发表机构 * Electrical and Computer Engineering, Purdue University(电子与计算机工程系,普渡大学)

AI总结 针对多用户偏好异质的决斗式赌博机问题,采用纳什社会福利目标最大化用户效用乘积,提出Fair-Explore-Then-Commit和Fair-ε-Greedy算法,并证明其遗憾上界匹配下界。

详情
AI中文摘要

从人类偏好数据中学习正成为一种有用的工具,从微调大型语言模型到训练强化学习智能体。然而,在大多数场景中,模型是在所有人类评估者的平均偏好上训练的,这在偏好差异较大时可能对少数群体不公平。在这项工作中,我们考虑了决斗式赌博机中的公平性,这是一个从偏好数据中进行在线学习的标准框架。我们假设每个用户都有一个(可能不同的)康多塞赢家,即一个优于其他所有臂的臂。使用这些用户特定的康多塞赢家作为参考点,我们根据臂相对于相应赢家的表现来评估和评分。为了促进异质用户之间的公平性,我们采用了成熟的纳什社会福利目标,该目标最大化用户效用的乘积,从而固有地惩罚不平等并防止任何单个用户被边缘化。在此框架内,我们构建了一个困难实例,以建立时间范围$T$、$K$个臂和$D$个用户的遗憾下界$Ω(T^{2/3}\min(K,D)^\frac{1}{3})$,据我们所知,这是第一个量化异质偏好决斗式赌博机中公平性成本的结果。然后,我们提出了带有康多塞赢家识别阶段的Fair-Explore-Then-Commit和Fair-$ε$-Greedy算法。我们进一步推导了它们的遗憾上界,该上界在$T$的依赖关系上与下界匹配,仅相差对数因子。

英文摘要

Learning from human preference data is becoming a useful tool, from fine-tuning large language models to training reinforcement learning agents. However, in most scenarios, the model is trained on the average preference of all human evaluators, which, under large variations of preferences, can be unfair to minority groups. In this work, we consider fairness in dueling bandits, a standard framework for online learning from preference data. We assume that each user has a (potentially distinct) Condorcet winner, which is an arm preferred to every other arm. Using these user-specific Condorcet winners as reference points, we evaluate and score arms according to their performance relative to the corresponding winner. To promote fairness across heterogeneous users, we adopt the well-established Nash Social Welfare objective, which maximizes the product of user utilities, thereby inherently penalizing inequality and preventing the marginalization of any single user. Within this framework, we construct a hard instance to establish a regret lower bound of $Ω(T^{2/3}\min(K,D)^\frac{1}{3})$ for a time horizon $T$, $K$ arms, and $D$ users, which, to the best of our knowledge, is the first result quantifying the cost of fairness in dueling bandits with heterogeneous preferences. We then present the Fair-Explore-Then-Commit and Fair-$ε$-Greedy algorithms with a Condorcet winner identification phase. We further derive their regret upper bounds that match the lower-bound dependence on $T$ up to logarithmic factors.

2605.01702 2026-06-16 cs.LG 版本更新

Floating-Point Networks with Automatic Differentiation Can Represent Almost All Floating-Point Functions and Their Gradients

具有自动微分的浮点网络可以表示几乎所有浮点函数及其梯度

Sejun Park, Yeachan Park, Geonho Hwang

发表机构 * Department of Artificial Intelligence, Korea University(人工智能系,韩国大学) Department of Mathematics and Statistics, Sejong University(数学与统计学系,世宗大学) Department of Mathematical Sciences, Gwangju Institute of Science and Technology(数学科学系,光州科学技术院)

AI总结 本文证明,在浮点算术下,使用自动微分的浮点神经网络可以表示任意浮点函数及其梯度,适用于ReLU、ELU等常见激活函数。

详情
AI中文摘要

理论研究显示,对于紧致域上的任意可微函数,存在一个神经网络可以同时逼近函数值和梯度。然而,由于该结果假设实数参数和精确内部运算,无法在实际中使用。相反,实际实现仅使用实数的有限子集和带有舍入误差的机器运算。本文研究在浮点算术下,当输入梯度由自动微分算法$D^\mathtt{AD}$计算时,神经网络是否具有类似结果。我们首先证明,给定一个浮点函数$\phi$(例如损失函数),任意函数值和梯度可以分别由浮点网络$f$和$D^\mathtt{AD}(\phi\circ f)$表示。我们进一步推广该结果:在温和条件下,给定$\phi_1,\dots,\phi_n$,$D^\mathtt{AD}(\phi_i\circ f)$可以同时表示任意梯度,而$f$表示目标值。我们的结果适用于实际激活函数,例如$\mathrm{ReLU}$、$\mathrm{ELU}$、$\mathrm{GeLU}$、$\mathrm{Swish}$、$\mathrm{Sigmoid}$和$\mathrm{tanh}$。

英文摘要

Theoretical studies show that for any differentiable function on a compact domain, there exists a neural network that approximates both the function values and gradients. However, such a result cannot be used in practice since it assumes real parameters and exact internal operations. In contrast, real implementations only use a finite subset of reals and machine operations with round-off errors. In this work, we investigate whether a similar result holds for neural networks under floating-point arithmetic, when the gradient with respect to the input is computed by the automatic differentiation algorithm $D^\mathtt{AD}$. We first show that given a floating-point function $ϕ$ (e.g., a loss function), arbitrary function values and gradients can be represented by a floating-point network $f$ and $D^\mathtt{AD}(ϕ\circ f)$, respectively. We further extend this result: given $ϕ_1,\dots,ϕ_n$, $D^\mathtt{AD}(ϕ_i\circ f)$ can simultaneously represent arbitrary gradients while $f$ represents the target values, under mild conditions. Our results hold for practical activation functions, e.g., $\mathrm{ReLU}$, $\mathrm{ELU}$, $\mathrm{GeLU}$, $\mathrm{Swish}$, $\mathrm{Sigmoid}$, and $\mathrm{tanh}$.

2606.09500 2026-06-16 cs.AI cs.DL 版本更新

Deterministic Integrity Gates for LLM-Assisted Clinical Manuscript Preparation: An Auditable Biomedical Informatics Architecture

用于LLM辅助临床手稿准备的确定性完整性门控:一种可审计的生物医学信息学架构

Yoojin Nam, Jinhoon Jeong, Namkug Kim

发表机构 * University of Ulsan College of Medicine(蔚山大学医学院) Asan Medical Center(峨山医疗中心) Aperivue AMIST, Asan Medical Center(AMIST,峨山医疗中心)

AI总结 提出一种确定性完整性门控架构,通过将工作流分解为可独立验证的技能并在每个阶段设置确定性检查,解决了LLM生成临床手稿中的虚假引用、数据漂移和报告指南缺失问题。

Comments 28 pages, 3 figures, 4 tables; includes supplementary material (deterministic-detector inventory, per-class defect breakdown, worked example). Software (MIT): https://github.com/Aperivue/medsci-skills . Archived on Zenodo: concept DOI https://doi.org/10.5281/zenodo.20155321 and version DOI (v3.8.0) https://doi.org/10.5281/zenodo.20582972

详情
AI中文摘要

目的。大型语言模型(LLM)越来越多地起草临床研究手稿,但其流畅性可能隐藏虚构的引用、偏离源表格的数字以及未满足的报告指南项目。现有工具生成文本而不进行验证,自我批评继承了产生自信虚构的盲点。我们描述了一种将生成与验证配对的架构。方法。该设计基于三个原则:将工作流分解为自包含的技能,在每个阶段转换处设置失败即停止的门控,以及用最便宜的足够机制解决每个完整性问题——一个确定性的、可重新执行的检查(如果适用),以及仅在需要解释时才使用散文级探针。这种尽可能确定性的分离,组织为完整性门控分类法,是核心贡献。它被实现为MedSci Skills,一个由43个技能组成的开源工具包,由一个编排器协调,其确定性层级包括21个标准库检测器。我们在三个可重复的公共数据集管道(STARD、PRISMA、STROBE)和一个种子缺陷消融上评估它。结果。在三个管道中,每个内容哈希清单都验证为干净,门控揭示了真实缺陷。在27个相同的注入缺陷上,确定性门控检测到所有27个,在匹配的干净固定装置上没有误报,而通用单提示LLM审查员检测到11个,其遗漏集中在生成的代码、参考文献内部和散文未暴露的风格缺陷上。结论。尽可能确定性的验证产生了一个可审计、可重新执行的轨迹,暴露了人类检查LLM辅助手稿所需的证据——可行性和可重复性证据,而不是声称具有人类竞争力的质量,这由另一项盲法研究解决。MedSci Skills采用MIT许可并归档(v3.8.0)。

英文摘要

As autonomous research agents and AI co-scientist systems push large language models (LLMs) from drafting toward end-to-end manuscript production, the bottleneck shifts from generation to verification. Fluent LLM output can hide fabricated citations, numbers that drift from source tables, and unmet reporting-guideline items; existing tools generate without verifying, and self-critique inherits the blind spots that produce confident fabrication. We describe an architecture pairing generation with verification, resting on three principles: decompose the workflow into self-contained skills, gate every stage transition with halt-on-failure, and resolve each integrity question with the cheapest sufficient mechanism, a deterministic, re-executable check where one suffices and a prose-level probe only where interpretation is unavoidable. This determinism-where-possible split, organized as an integrity-gate taxonomy, is the core contribution. It is realized as MedSci Skills, an open-source toolkit of 43 skills with a 21-detector deterministic tier, evaluated on three public-dataset pipelines (STARD, PRISMA, STROBE) and a seeded-defect ablation. Across the three pipelines every content-hash manifest verified clean and the gates surfaced real defects; on 27 identical injected defects the deterministic gates detected all 27 with no false positives on the matched clean fixtures, whereas a single-prompt LLM reviewer detected 11, its misses in code, bibliography, and style defects the prose hides. Determinism-where-possible verification yields an auditable, re-executable trail that exposes the evidence a human needs to check an LLM-assisted manuscript: feasibility and reproducibility evidence, not a claim of human-competitive quality, which a separate blinded study addresses. MedSci Skills is MIT-licensed and archived (v3.8.0).

2606.09365 2026-06-16 cs.AI cs.CL 版本更新

Experience Makes Skillful: Enabling Generalizable Medical Agent Reasoning via Self-Evolving Skill Memory

经验造就熟练:通过自进化技能记忆实现可泛化的医疗智能体推理

Haoran Sun, Wenjie Li, Yujie Zhang, Zekai Lin, Fanrui Zhang, Kaitao Chen, Xingqi He, Yichen Li, Mianxin Liu, Lei Liu, Yankai Jiang

发表机构 * Fudan University(复旦大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Shanghai Innovation Institute(上海创新研究院) Huazhong University of Science and Technology(华中科技大学)

AI总结 提出SkeMex框架,通过技能记忆实现医疗智能体后部署自进化,无需更新模型权重,在临床任务中优于现有记忆型智能体。

详情
AI中文摘要

医疗智能体系统越来越期望支持交互式临床决策,而不仅仅是静态问答。在这种设置中,有效的智能体必须跨演化病例重用先前经验,然而现有的记忆机制通常保留原始历史轨迹,这些轨迹冗余、嘈杂且难以管理。更重要的是,它们很少区分哪些记忆对未来推理真正有用。这限制了它们积累紧凑且可靠的经验以进行长期临床推理的能力。为弥补这一差距,我们提出SkeMex,一种部署后自进化框架,通过基于技能的记忆改进医疗智能体,无需更新模型权重。SkeMex将信息丰富的交互轨迹提炼为结构化技能,编码可重用的程序性知识,并将其组织成涵盖通用、任务特定和行动级经验的多分支存储库。为确定哪些记忆应被重用和保留,SkeMex从环境反馈中估计上下文相关的效用,并用其指导价值感知的检索和存储库治理。闭环的“读-写-评估-治理”生命周期通过写入新技能、更新效用、促进有用记忆和移除有害条目进一步支持持续进化。跨不同临床任务的实验表明,SkeMex在离线和在线设置中均持续优于代表性记忆型智能体。它还能跨模型骨干泛化并支持可迁移的技能记忆。所有数据和代码将公开发布。

英文摘要

Medical agent systems are increasingly expected to support interactive clinical decision making rather than only static question answering. In such settings, effective agents must reuse prior experience across evolving cases, yet existing memory mechanisms often retain raw historical traces that are redundant, noisy, and difficult to govern. More importantly, they rarely distinguish which memories are truly useful for future reasoning. This limits their ability to accumulate compact and reliable experience for long-horizon clinical reasoning. To close this gap, we propose SkeMex, a post-deployment self-evolution framework that improves medical agents through a skill-based memory without updating model weights. SkeMex distills informative interaction trajectories into structured skills that encode reusable procedural knowledge, and organizes them into a multi-branch repository spanning general, task-specific, and action-level experience. To determine which memories should be reused and retained, SkeMex estimates context-dependent utility from environment feedback and uses it to guide value-aware retrieval and repository governance. A closed-loop ``Read--Write--Assess--Govern" lifecycle further supports continual evolution by writing new skills, updating utilities, promoting useful memories, and removing harmful entries. Experiments across diverse clinical tasks show that SkeMex consistently outperforms representative memory-based agents in both offline and online settings. It also generalizes across model backbones and supports transferable skill memory. All data and code will be released publicly.

2606.07226 2026-06-16 cs.LG cs.AI cs.CL 版本更新

DEFINED: A Data-Efficient Computational Framework for Fine-Grained Creativity Assessment in Debate Scenarios

DEFINED: 辩论场景中细粒度创造力评估的数据高效计算框架

Tongzhou Yu, Mingjia Li, Hong Qian, Wenkai Wang, Zongbao Zhang, Yaoyu Jiang, Xiangfeng Wang, Aimin Zhou, Jiajun Guo

发表机构 * Nanjing University(南京大学) Shanghai Innovation Institute(上海创新研究院) East China Normal University(华东师范大学)

AI总结 提出DEFINED框架,通过层次化八维指标体系、预训练语言模型和混合粒度训练策略,在辩论场景中实现数据高效的细粒度创造力自动评估,优于现有方法。

Comments Accepted by KDD 2026

详情
AI中文摘要

人类创造力已成为大语言模型时代的关键能力。在复杂、开放环境中评估创造力是数据挖掘领域的一大挑战,目前受限于对标准化简单任务的依赖以及细粒度专家数据的稀缺。作为生态有效的评估场景,辩论反映了创造力的多个维度,涵盖发散思维和收敛思维。此外,辩论是一个数据丰富的领域,拥有大量公开可获取的材料。当前主流的自动评分方法难以适应辩论等复杂场景,因此仍然依赖昂贵的人工评估。为此,本文提出DEFINED,一种数据高效的计算框架,用于辩论场景中的细粒度创造力评估。DEFINED通过层次化的八维指标体系操作化辩论创造力,采用预训练自回归语言模型,并配备支持细粒度和粗粒度评估的层次化评分头。从真实辩论比赛中获取陈述及其相关专家评分,并采用约束数据增强策略以解决原始数据中的精英偏差。DEFINED采用混合粒度训练策略,能够从训练有素的研究生专家提供的有限细粒度监督中实现鲁棒学习。为严格验证超越合成基准的生态效度,我们纳入了一项针对辩论新手参与者的实证研究,利用这些真实数据作为中低水平人群的定性案例研究。在我们的评估协议中,评分模型实现了准确且稳定的评分,优于基于提示的大语言模型评估器和现有的辩论评分方法。

英文摘要

Human creativity has emerged as a critical competency in the era of large language models. Assessing creativity in complex, open-ended environments is a grand challenge in data mining, currently hindered by a reliance on standardized simple tasks and the scarcity of fine-grained expert data. As an ecologically valid assessment context, debate reflects multiple dimensions of creativity, encompassing both divergent thinking and convergent thinking. Moreover, debate is a data-rich domain, with a large volume of publicly accessible materials. Current mainstream automated scoring methods are poorly suited to complex settings such as debate, and therefore still rely on costly human evaluation. To this end, this paper proposes DEFINED, a data-efficient computational framework for fine-grained creativity assessment in debate scenarios. DEFINED operationalizes debate creativity through a hierarchical eight-dimensional metric system, implemented via a pre-trained autoregressive language model with a hierarchical scoring head that supports both fine-grained and coarse-grained evaluation. Statements and their associated expert scores were obtained from authentic debate competitions, and a constrained data augmentation strategy was employed to address the elite bias inherent in the original data. DEFINED adopts a mixed-granularity training strategy enabling robust learning from limited fine-grained supervision annotated by trained graduate experts. To rigorously validate ecological validity beyond synthetic benchmarks, we incorporate an empirical study with debate-naive participants, utilizing these authentic data to serve as a qualitative case study for mid-to-low proficiency populations. Across our evaluation protocol, our scoring model achieves accurate and stable scoring, outperforming prompt-based large language model evaluators and existing debate scoring methods.

2606.07082 2026-06-16 cs.LG cs.AI 版本更新

On the Geometry of On-Policy Distillation

论在线策略蒸馏的几何结构

Zhennan Shen, Yanshu Li, Qingyu Yin, Chak Tou Leong, Zhilin Wang, Yanxu Chen, Rongduo Han, Sunbowen Lee, Yi R. Fung

发表机构 * HKUST(香港科技大学) UT Austin(得克萨斯大学奥斯汀分校) Zhejiang University(浙江大学) Hong Kong PolyU(香港理工大学) USTC(中国科学技术大学) BUPT(北京邮电大学) Nankai University(南开大学) BIT(北京理工大学)

AI总结 本文通过参数空间诊断,揭示在线策略蒸馏(OPD)的更新轨迹具有松弛离主成分、子空间锁定等独特几何特性,表明其并非介于SFT和RLVR之间的中间方法。

Comments 17 pages, 8 figures

详情
AI中文摘要

在线策略蒸馏(OPD)越来越多地被用于改进大型语言模型的推理能力,但其训练动态仍鲜为人知。我们刻画了OPD更新在参数空间中的轨迹,并将其与监督微调(SFT)和可验证奖励强化学习(RLVR)进行了比较。一套参数空间诊断一致地将OPD置于松弛的离主成分区域:与SFT相比,其更新影响更少的权重,并更强烈地避开主方向;而与RLVR相比,其约束更宽松。除了这种静态定位外,OPD还表现出子空间锁定:其累积更新迅速进入一个狭窄的低维通道。将训练限制在早期形成的更新子空间内能保持OPD的性能,但会严重降低SFT,表明该锁定子空间对OPD在功能上是充分的。控制实验进一步表明,稀疏化更新令牌和将rollout生成移至离策略能保持秩动态,而将OPD目标与RLVR混合则会改变它们。总体而言,这些结果表明OPD不仅仅是SFT和RLVR之间的中间点,而是在参数空间中诱导出自身独特的更新几何结构。

英文摘要

On-policy distillation (OPD) is increasingly used to improve large language model reasoning, but its training dynamics remain poorly understood. We characterize the trajectory of OPD updates in parameter space and compare it with supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR). A suite of parameter-space diagnostics consistently places OPD in a relaxed off-principal regime: compared with SFT, its updates affect fewer weights and avoid principal directions more strongly, while compared with RLVR, they remain less tightly constrained. Beyond this static localization, OPD exhibits subspace locking: its cumulative updates rapidly enter a narrow low-dimensional channel. Constraining training to the update subspace formed early in training preserves OPD performance but substantially degrades SFT, indicating that the locked subspace is functionally sufficient for OPD. Control experiments further show that sparsifying the update tokens and shifting rollout generation off-policy preserve the rank dynamics, whereas mixing the OPD objective with RLVR changes them. Overall, these results suggest that OPD is not merely an intermediate point between SFT and RLVR, but induces its own update geometry in parameter space.

2606.06302 2026-06-16 cs.LG cs.SE 版本更新

Tangram: Unlocking Non-Uniform KV Cache Compression for Efficient Multi-turn LLM Serving

Tangram: 解锁非均匀KV缓存以实现高效的多轮LLM服务

Hyungmin Kim, Minsoo Kim, Hongseok Kim, Jungwook Choi

发表机构 * Hanyang University(翰林大学) Rebellions Republic of Korea(Rebellions)

AI总结 针对多轮LLM服务中KV缓存线性增长导致的GPU内存和带宽压力,提出Tangram系统,通过确定性预算分配、头组页面和提前负载均衡三项技术实现非均匀KV缓存的高效管理,吞吐量提升达2.6倍。

Comments 13 pages. 15 figures

详情
AI中文摘要

多轮大语言模型(LLM)服务对于一致的用户体验至关重要,但键值(KV)缓存的线性增长给GPU内存和带宽带来了巨大压力。非均匀KV压缩通过考虑每个KV缓存的重要性来有效保留更多信息。然而,这种KV缓存的异质性带来了各种系统挑战——包括内存碎片、调度复杂性和内核利用率降低——这些共同导致现有LLM服务系统的显著低效。为了克服这些挑战,我们提出了Tangram,一种新颖的服务系统,旨在使非均匀KV缓存变得实用。Tangram通过三种核心技术解决系统低效问题:(1)确定性预算分配根据每个头的内在模式为其分配静态内存占用,完全消除动态调度开销和预填充停滞;(2)头组页面将具有相似保留需求的注意力头聚类,并使用独立的向量化页表进行管理,从而最大化物理内存回收;(3)提前(AOT)负载均衡利用静态预算配置文件确保均匀的GPU利用率,无需运行时开销。实验结果表明,与现有基线相比,Tangram在完全保持模型准确性的同时,吞吐量提升高达2.6倍。我们的实现已在https://github.com/aiha-lab/TANGRAM公开。

英文摘要

Multi-turn LLM serving accumulates dialogue history whose Key-Value (KV) cache grows with every turn and every user, quickly exceeding the model weights themselves and making memory -- not compute -- the binding constraint on throughput. Non-uniform KV compression, which allocates heterogeneous budgets across attention heads, preserves accuracy far better than uniform schemes, yet remains impractical: modern serving stacks assume identical KV lengths across heads, so heterogeneity traps freed memory as page fragmentation, spends up to 25% of prefill time reclaiming scattered pages, and skews GPU workloads that inflate decode latency by up to $1.7\times$ or burn 15--20% of each decode step on re-planning. We observe that this heterogeneity need not be discovered at runtime: head-wise retention follows a two-level structural regularity -- an input-invariant head ranking with narrowly bounded per-head ratios -- that can be calibrated offline from as few as 50 samples. Building on this insight, we present Tangram, a serving framework that statically resolves what prior systems handle dynamically: Budget Reservation fixes each head's post-compression footprint at scheduling time, eliminating page reclamation; Ragged Paging clusters similar-budget heads into independent page tables, turning fragmentation into reclaimable memory; and Ahead-of-Time Load Balancing precomputes balanced GPU partitions with zero runtime planning. Implemented on vLLM, Tangram serves as a drop-in substrate for existing non-uniform compression methods, matching their accuracy while improving end-to-end throughput by up to $2.6\times$ over the full-KV baseline. Our implementation is publicly available at https://github.com/aiha-lab/TANGRAM.

2606.06176 2026-06-16 cs.CV 版本更新

RQUL-UIE: Revitalizing Quality-Unstable Labels for Underwater Image Enhancement via In-Dataset Self-Supervision

RQUL-UIE: 通过数据集内自监督重振质量不稳定标签用于水下图像增强

Haochen Hu, Yanrui Bin, Chih-yung Wen, Bing Wang

发表机构 * The Hong Kong Polytechnic University(香港理工大学)

AI总结 提出一种基于扩散模型的数据集内自监督学习策略,通过评估标签质量并量化噪声级别进行分步去噪监督,结合傅里叶细化网络,有效利用不稳定标签提升水下图像增强质量。

详情
AI中文摘要

水下图像增强对于减轻水介质引起的退化至关重要。尽管基于学习的方法取得了显著进展,但大多数依赖于具有不稳定标签质量的配对数据集,这限制了模型性能。本文提出了一种基于扩散的数据集内自监督学习策略,旨在利用训练标签的质量分布。具体地,我们通过预训练扩散模型的语义感知嵌入以无需训练的方式评估标签质量。这些质量分数随后被量化为噪声级别索引,指导多步去噪过程以进行级别监督。该机制防止低质量标签降低模型性能,同时最大化其在训练中的效用。此外,引入基于傅里叶的细化网络以显式重建高频分量。大量评估表明,我们的方法在恢复质量上始终优于最先进的方法。代码和预训练模型将在接收后提供链接。

英文摘要

Underwater Image Enhancement (UIE) is essential for mitigating degradations caused by water medium. Although learning-based methods have advanced significantly, most rely on paired datasets with unstable label quality, which bottlenecks model performance. This paper proposes a diffusion-based, in-dataset self-supervised learning strategy designed to exploit the quality distribution of training labels. Specifically, we evaluate label quality via semantic perception embeddings from a pre-trained diffusion model in a training-free manner. These quality scores are subsequently quantized into noise-level indices, guiding a multi-step denoising process for level-wise supervision. This mechanism prevents low-quality labels from degrading the model while maximizing their utility during training. Furthermore, a Fourier-based refinement network is incorporated to explicitly reconstruct high-frequency components. Extensive evaluations demonstrate that our method consistently outperforms SOTA approaches in restoration quality. The code and pre-trained model will be available once accepted in link.

2606.06007 2026-06-16 cs.LG 版本更新

Diffusion Models for Adaptive Sequential Data Generation

自适应序列数据生成的扩散模型

Haoyang Cao, Minshuo Chen, Yinbin Han, Renyuan Xu

发表机构 * Department of Applied Mathematics and Statistics, Data Science and AI Institute, and Mathematical Institute for Data Science, Johns Hopkins University(应用数学与统计学系、数据科学与人工智能研究所、数据科学数学研究所,约翰霍普金斯大学) Department of Industrial Engineering and Management Sciences, Northwestern University(工业工程与管理科学系,西北大学) Department Management Science and Engineering, Stanford University(管理科学与工程系,斯坦福大学)

AI总结 提出一种顺序前向后向扩散框架,通过沿序列逐步注入和去除噪声并基于历史生成条件确保自适应性,用于生成自适应时间序列数据,并引入新的分数匹配目标实现高效并行训练,在合成数据和均值-方差最优投资组合构建中验证有效性。

Comments 38 pages

详情
AI中文摘要

生成逼真的合成序列数据在运筹学、金融、医疗、能源系统和科学计算等实际应用中至关重要,这些领域使用时间索引观测进行预测、模拟、风险评估和数据驱动决策。虽然扩散模型在生成静态数据方面取得了显著成功,但其直接扩展到序列设置往往无法捕捉时间依赖性和信息结构。设计能够以自适应方式模拟序列数据且不预知未来信息的扩散模型仍然是一个开放挑战。在这项工作中,我们提出了一种用于自适应时间序列生成的顺序前向后向扩散框架。我们的方法沿序列逐步注入和去除噪声,并基于先前生成的历史进行条件化以确保自适应性。引入了一种新的分数匹配目标以实现高效的并行训练。我们在一个通用框架下推导了严格的统计保证,然后以ReLU网络作为具体实例建立了分数逼近、分数估计和分布估计结果。在实验上,我们在合成数据(包括ARMA模型和高斯过程)上验证了我们的方法,并展示了其在构建均值-方差最优投资组合中的有效性。

英文摘要

Generating realistic synthetic sequential data is critical in real-world applications across operations research, finance, healthcare, energy systems, and scientific computing, where time-indexed observations are used for prediction, simulation, risk assessment, and data-driven decision-making. While diffusion models have achieved remarkable success in generating static data, their direct extensions to sequential settings often fail to capture temporal dependence and information structure. Designing diffusion models that can simulate sequential data in an adapted manner, and hence without anticipation of future information, therefore remains an open challenge. In this work, we propose a sequential forward-backward diffusion framework for adapted time series generation. Our approach progressively injects and removes noise along the sequence, conditioning on the previously generated history to ensure adaptiveness. A novel score-matching objective is introduced for efficient parallel training. We derive rigorous statistical guarantees under a generic framework, then establish score approximation, score estimation, and distribution estimation results with ReLU networks serving as a concrete instance. Empirically, we validate our method on synthetic data, including ARMA models and Gaussian processes, and demonstrate its effectiveness in constructing mean-variance optimal portfolios.

2606.05878 2026-06-16 cs.LG 版本更新

TS-ICL: A Flexible Time-Indexed Foundation Model for Time Series via In-Context Learning

TS-ICL: 一种基于上下文学习的灵活时间索引时间序列基础模型

Etienne Le Naour, Tahar Nabil, Adrien Petralia

发表机构 * EDF R&D(EDF研究与发展)

AI总结 提出TS-ICL,一种基于上下文学习的概率编码器-回归器Transformer,统一了时间序列预测与插值,并在插值任务上达到新最优,同时在部分观测回溯窗口预测中表现突出。

详情
AI中文摘要

基础模型标志着时间序列建模的深刻范式转变,任务特定模型正被通用零样本模型取代。然而,当前方法主要关注预测,而现实世界的时间序列通常是不规则和部分观测的,需要模型能够联合预测、插补缺失值并处理降采样条件。为应对这些挑战,我们引入了TS-ICL,一种新颖的基于概率上下文学习的编码器-回归器Transformer,统一了预测和插值。TS-ICL将时间序列任务表述为时间戳对齐的回归,并通过训练从新颖的因果数据先验生成的合成依赖结构自然地纳入协变量。实验上,TS-ICL在插值任务上达到了新的最优,同时在单变量和协变量感知基准上与领先的预测基础模型保持竞争力。它在部分观测回溯窗口的预测中表现出特别强的性能。

英文摘要

Foundation models mark a profound paradigm shift in time series modeling, with task-specific models being superseded by general-purpose zero-shot models. Yet, current approaches primarily focus on forecasting, while real-world time series are often irregularly and partially observed, requiring models that can jointly forecast, impute missing values, and handle degraded sampling conditions. To address these challenges, we introduce TS-ICL, a novel probabilistic In-Context Learning encoder--regressor Transformer that unifies forecasting and imputation. TS-ICL formulates time series tasks as timestamp-aligned regression and naturally incorporates covariates by training on synthetic dependency structures generated from a novel causal data prior. Empirically, TS-ICL achieves a new state-of-the-art in imputation, while remaining competitive with leading forecasting foundation models across both univariate and covariate-aware benchmarks. It shows particularly strong performance in forecasting with partially observed look-back windows.

2606.05742 2026-06-16 cs.CL 版本更新

AdaPLD: Adaptive Retrieval and Reuse for Efficient Model-Free Speculative Decoding

AdaPLD: 自适应检索与重用实现高效无模型推测解码

Runheng Liu, Jincheng Xie, Wen Hu, Xingchen Xiao, Heyan Huang

发表机构 * School of Computer Science and Technology, Beijing Institute of Technology(北京理工大学计算机科学与技术学院) Department of Mathematical Sciences, Tsinghua University(清华大学数学科学部) JDT AI Infra(京东AI基础设施)

AI总结 针对现有基于重用的推测解码方法在词汇匹配失败时召回率低和确定性复制脆弱的问题,提出无需训练的自适应方法AdaPLD,通过语义相似性恢复重用机会并构建分支假设,实现最高3.10倍解码加速。

详情
AI中文摘要

推测解码通过在单次目标模型前向传播中验证多个草拟令牌来加速生成,减少了顺序解码迭代。无模型变体通过重用生成过程中已有的文本和模型状态来避免辅助草稿模型,但其加速效果取决于构建的草稿的可靠性。我们指出现有基于重用的方法存在两个局限性:基于词汇锚定的检索在表面形式变化下召回率有限,以及当检索上下文不能唯一确定续写时,确定性跨度复制可能脆弱。我们提出\emph{AdaPLD},一种无需训练的方法,自适应地改进检索和草稿构建。AdaPLD保留高精度的词汇重用,同时利用语义相似性在词汇匹配失败时恢复额外的重用机会。它进一步构建分支重用假设以考虑续写的不确定性,而不是依赖单个复制的跨度。在多个基准测试中,AdaPLD减少了目标模型前向传播次数,并实现了高达$3.10 imes$的解码加速。

英文摘要

Speculative decoding accelerates generation by verifying multiple drafted tokens in a single target-model forward pass, reducing sequential decoding iterations. Model-free variants avoid auxiliary draft models by reusing text and model states already available during generation, but their speedup depends on the reliability of the constructed drafts. We identify two limitations of existing reuse-based methods: lexically anchored retrieval has limited recall under surface-form variation, and deterministic span copying can be brittle when the retrieved context does not uniquely determine the continuation. We propose \emph{AdaPLD}, a training-free method that adaptively improves both retrieval and draft construction. AdaPLD preserves high-precision lexical reuse while using semantic similarity to recover additional reuse opportunities when lexical matching fails. It further constructs branched reuse hypotheses to account for continuation uncertainty, rather than relying on a single copied span. Across diverse benchmarks, AdaPLD reduces target-model forward passes and achieves up to $3.10\times$ decoding speedup.

2606.05693 2026-06-16 cs.LG cs.IR 版本更新

MolE-RAG: Molecular Structure-Enhanced Retrieval-Augmented Generation for Chemistry

MolE-RAG:面向化学的分子结构增强检索增强生成

Joey Chan, Wonbin Kweon, Ashley Shin, Niharika Bhattacharjee, Pengcheng Jiang, Yue Guo, Jiawei Han

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) University of California, San Diego(加州大学圣地亚哥分校)

AI总结 提出无需训练的分子中心检索增强生成框架MolE-RAG,通过整合检索文献、分子特定信息和结构相似分子三种上下文,显著提升LLM在分子性质预测任务中的性能。

详情
AI中文摘要

大型语言模型(LLM)在分子性质预测方面展现出潜力,但其对化学结构的推理能力仍然有限,因为分子表示(如SMILES)与LLM主要训练的自然语言存在显著差异。为弥合这一语义和化学知识鸿沟,我们提出MolE-RAG,一种无需训练的、以分子为中心的检索增强生成框架,用于基于LLM的分子性质预测。MolE-RAG通过三种互补的推理时上下文来源增强每次预测:检索的化学文献、分子特定信息(包括化合物同义词、标识符、官能团注释和物理化学描述符),以及从训练集中检索的结构相似分子。我们使用专有、化学专用和开源LLM在九个分子性质预测任务上评估MolE-RAG。在通用LLM上,相比仅使用SMILES的基线,MolE-RAG在分类任务上将ROC-AUC提升最多28个百分点,并将回归RMSE降低最多67%。我们进一步发现,每种上下文来源的效用因模型和任务而异,不同模型分别从文本检索、分子上下文或结构检索中获益最多。这些结果表明,以分子为中心的检索可以在无需模型微调的情况下改进基于LLM的分子性质预测,同时为在推理时整合异构化学知识提供灵活框架。

英文摘要

Large language models (LLMs) have shown promise for molecular property prediction, but their ability to reason over chemical structures remains limited, as molecular representations such as SMILES differ substantially from the natural language on which LLMs are primarily trained. To bridge this semantic and chemical knowledge gap, we propose MolE-RAG, a training-free, molecule-centric retrieval-augmented generation framework for LLM-based molecular property prediction. MolE-RAG augments each prediction with three complementary sources of inference-time context: retrieved chemistry literature, molecule-specific information including compound synonyms, identifiers, functional group annotations, and physicochemical descriptors, and structurally similar molecules retrieved from the training set. We evaluate MolE-RAG across nine molecular property prediction tasks using proprietary, chemistry-specialized, and open-source LLMs. Across general-purpose LLMs, MolE-RAG improves ROC-AUC by up to 28 percentage points on classification tasks and reduces regression RMSE by up to 67% relative to a SMILES-only baseline. We further find that the utility of each context source varies across models and tasks, with different models benefiting most from textual retrieval, molecular context, or structural retrieval. These results suggest that molecule-centric retrieval can improve LLM-based molecular property prediction without model fine-tuning while providing a flexible framework for integrating heterogeneous chemical knowledge at inference time.

2606.05692 2026-06-16 cs.LG cs.AI 版本更新

Benchmarking Counterfactual Prediction in Epidemic Time Series with Time-Varying Interventions

具有时变干预的流行病时间序列中的反事实预测基准测试

Wenhao Mu, Facundo Yan, Anik Mumssen, Marisa Eisenberg, Alexander Rodríguez

发表机构 * University of Michigan Computer Science and Engineering(密歇根大学计算机科学与工程系) University of Michigan Epidemiology & Complex Systems(密歇根大学流行病学与复杂系统)

AI总结 为解决缺乏可观测反事实结果的真实基准问题,基于校准的基于智能体的模型生成大规模流行病时间序列反事实预测基准,支持静态/时变治疗和单/多策略干预,评估多种因果推断方法。

Comments To appear in Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026)

详情
AI中文摘要

深度学习在时间序列因果推断方面取得了显著进展,但由于缺乏具有可观测反事实结果的现实基准,进展仍然受到限制。现有数据集要么依赖没有真实反事实的真实世界观测,要么依赖无法捕捉复杂因果动态的简化模拟。为了解决这一差距,我们开发了一个大规模基准,用于动态干预下流行病时间序列的反事实预测。与现有基准不同,它支持静态和时变治疗,以及单策略和多策略干预设置,从而能够在广泛的因果推断场景中评估因果推断方法。利用基于真实世界人口、流动性、流行病学和政策数据校准的基于智能体的模型,我们生成了跨越美国150多个县的真实反事实轨迹。使用该基准,我们评估了广泛使用和最先进的因果推断方法,揭示了显著的性能差异,并突出了现实时间序列因果推理的挑战。

英文摘要

Deep learning has enabled significant advances in time-series causal inference, yet progress remains constrained by the lack of realistic benchmarks with observable counterfactual outcomes. Existing datasets either rely on real-world observations without ground-truth counterfactuals or on simplified simulations that fail to capture complex causal dynamics. To address this gap, we develop a large-scale benchmark for counterfactual prediction in epidemic time series under dynamic interventions. Unlike existing benchmarks, it supports static and time-varying treatments, as well as both single-policy and multi-policy intervention settings, enabling evaluation of causal inference methods across a broad range of causal inference scenarios. Leveraging a calibrated agent-based model grounded in real-world demographic, mobility, epidemiological, and policy data, we generate realistic counterfactual trajectories across more than 150 U.S. counties. Using this benchmark, we evaluate widely used and state-of-the-art causal inference methods, revealing substantial performance differences and highlighting the challenges of realistic time-series causal reasoning.

2606.05014 2026-06-16 cs.CL 版本更新

Depth-Attention: Cross-Layer Value Mixing for Language Models

深度注意力:语言模型的跨层值混合

Boyi Zeng, Yiqin Hao, Zitong Wang, Shixiang Song, He Li, Feichen Song, Yifan Liu, Ziwei He, Xinbing Wang, Zhouhan Lin

发表机构 * LUMIA Lab(LUMIA实验室) School of Artificial Intelligence(人工智能学院) Shanghai Jiao Tong University(上海交通大学) Shanghai AI Laboratory(上海人工智能实验室) Sun Yat-sen University(中山大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 提出深度注意力机制,在注意力模块内部实现跨层值混合,无需额外参数和推理状态,提升语言模型性能。

Comments 21 pages, 4 figures, 9 tables

详情
AI中文摘要

自注意力机制可以在序列中自由选择信息,但在深度方向上,Transformer仅将每一层的输出加到残差流中,因此后续层无法选择性重用早期层的表示。最近的跨层方法改善了这种流动,但在注意力之外的隐藏状态上操作,在推理时增加了键值缓存之外的状态——随着现代LLM使用分组查询和多头潜在注意力压缩缓存,这一成本日益显著。我们引入深度注意力,它在注意力模块内部执行这种选择:在一层对序列进行注意力之前,其查询在同一token位置上对早期层的键进行注意力,并将它们的值混合到自注意力随后读取的值中。由于深度注意力重用标准的注意力查询、键和值缓存槽,将深度混合后的值替换原始值,因此它不增加参数,也不引入超出标准键值缓存的持久推理状态——缓存大小与普通解码器相同,且小于基于隐藏状态的跨层方法。在1.5B和3B参数的Qwen3风格解码器上,深度注意力取得了最低的困惑度和最高的平均下游准确率,相比普通Transformer提升高达2.3个准确率点,在困惑度和平均准确率上超越了强跨层基线,同时仅增加不到0.01%的额外算术FLOPs,且无额外持久推理状态。这些增益在360M到3B参数范围内保持一致,并扩展到循环Transformer。

英文摘要

Self-attention selects information freely across the sequence, but across depth, Transformers merely add each layer's output to the residual stream, so later layers cannot selectively reuse earlier-layer representations. Recent cross-layer methods improve this flow but operate on hidden states outside attention, adding state beyond the key-value cache at inference--a cost that becomes increasingly salient as modern LLMs compress the cache with grouped-query and multi-head latent attention. We introduce Depth-Attention, which performs this selection inside the attention module itself: before a layer attends over the sequence, its query attends over the keys of earlier layers at the same token position and mixes their values into the value that self-attention then reads. Because Depth-Attention reuses the standard attention queries, keys, and value-cache slots, storing depth-mixed values in place of the original values, it adds no parameters and introduces no persistent inference state beyond the standard key-value cache--the same cache size as a vanilla decoder and less than hidden-state-based cross-layer methods. On Qwen3-style decoders at 1.5B and 3B parameters, Depth-Attention attains the lowest perplexity and the highest average downstream accuracy, improving over the vanilla Transformer by up to 2.3 accuracy points and surpassing strong cross-layer baselines in perplexity and average accuracy, while adding under 0.01% extra arithmetic FLOPs and no additional persistent inference state. The gains hold from 360M to 3B parameters and extend to looped Transformers.

2606.04907 2026-06-16 cs.RO 版本更新

WAM-Nav: Asymmetric Latent World-Action Modeling for Unified Visual Navigation

WAM-Nav:面向统一视觉导航的非对称潜在世界-动作建模

Ning Yang, Yan Huang, Kaiwen Peng, Ziheng He, Kai Wang, Cui Miao, Kailin Lyu, Guo Li, Xiaofeng Wang, Zheng Zhu, Jing Liu, Nianfeng Liu

发表机构 * Nanjing University(南京大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) University of Chinese Academy of Sciences(中国科学院大学) FiveAges National University of Defense Technology(国防科技大学) Tsinghua University(清华大学) GigaAI

AI总结 提出WAM-Nav,一种联合学习动作生成与潜在视觉预测的非对称扩散Transformer模型,通过共享扩散Transformer实现长时程动作与短时程视觉预测的联合扩散,并引入双流上下文条件机制和目标对齐模块,在统一策略下支持图像目标、点目标和无目标导航,在ClutterScenes和InternScenes基准上分别提升15.7%和3.3%的成功率,并在真实环境中实现85%的任务成功率。

详情
AI中文摘要

视觉导航需要在复杂的几何和物理约束下生成平滑且无碰撞的轨迹。现有的反应式策略直接将观测映射到动作,缺乏预期推理能力,限制了其主动避障的能力。虽然视觉想象提供了预测性前瞻,但传统的模块化方法将场景预测与策略学习分离,常常导致误差累积和推理效率低下。为了解决这些限制,我们提出了WAM-Nav,一种用于具身视觉导航的潜在世界-动作模型,它联合学习动作生成和潜在视觉预测,从而在不影响推理效率的情况下实现更鲁棒和更具前瞻性的导航决策。具体来说,WAM-Nav利用共享的扩散Transformer进行非对称联合扩散,同时生成长时程动作和短时程视觉预测,减少了多步自回归展开中固有的推理延迟和视觉误差累积。为了进一步促进平滑且一致的轨迹生成,我们引入了一种双流上下文条件机制,将情节级别的自运动历史与顺序视觉观测相结合。结合统一的目标对齐模块,该模块在不同目标类型间保持平衡表示,WAM-Nav在单一策略下自然支持图像目标、点目标和无目标探索。在具有挑战性的ClutterScenes和InternScenes基准上的大量实验证明了WAM-Nav的强大泛化能力,特别是在图像目标和点目标导航中,成功率分别提高了15.7%和3.3%。真实世界部署进一步验证了有效的零样本模拟到现实迁移,在多样化的室内和室外环境中实现了平均85%的任务成功率。

英文摘要

Visual navigation requires generating smooth and collision-free trajectories under complex geometric and physical constraints. Existing reactive policies that directly map observations to actions lack anticipatory reasoning, limiting their ability to proactively avoid obstacles. While visual imagination offers predictive foresight, conventional modular approaches separate scene prediction from policy learning, often leading to error accumulation and inefficient inference. To address these limitations, we propose WAM-Nav, a Latent World-Action Model for embodied visual navigation that jointly learns action generation and latent visual foresight, enabling more robust and foresighted navigation decisions without compromising inference efficiency. Specifically, WAM-Nav utilizes a shared Diffusion Transformer for asymmetric joint diffusion to concurrently generate long-horizon actions and short-horizon visual foresight, reducing the inference latency and visual error accumulation inherent in multi-step autoregressive rollouts. To further encourage smooth and consistent trajectory generation, we introduce a dual-stream contextual conditioning mechanism that integrates episode-level ego-motion history with sequential visual observations. Combined with a unified goal alignment module that preserves balanced representations across goal types, WAM-Nav naturally supports Image-Goal, Point-Goal, and No-Goal exploration within a single policy. Extensive experiments on the challenging ClutterScenes and InternScenes benchmarks demonstrate strong generalization of WAM-Nav, particularly on Image-Goal and Point-Goal navigation, where it improves success rates by 15.7% and 3.3%, respectively. Real-world deployment further validates effective zero-shot sim-to-real transfer, achieving an average 85% task success rate across diverse indoor and outdoor environments.

2606.04678 2026-06-16 cs.LG 版本更新

Test-Time Compute Scaling for ASR with Depth-Conditioned Looped Transformers

基于深度条件循环Transformer的ASR测试时计算缩放

Yacouba Kaloga, Shashi Kumar, Shakeel A. Sheikh, Driss Khalil, Petr Motlicek, Ina Kodrasi

发表机构 * Idiap Research Institute(Idiap研究 institute) EPFL(瑞士联邦理工学院) BUT(布拉格技术大学) Novartis Institute of Biomedical Research(诺华生物医学研究 institute)

AI总结 提出LARM模型,通过深度条件循环Transformer将循环编码器深度变为可控的测试时计算轴,结合稀疏CTC检查点、监督时钟嵌入、FiLM深度条件和延迟软后验反馈,在LibriSpeech上随推理循环次数增加提升WER,实现测试时计算缩放从自回归语言模型推理扩展到连续非自回归语音识别。

详情
AI中文摘要

端到端ASR系统通常在推理时使用固定深度的声学编码器,这使得在不训练更大模型的情况下,难以用额外的测试时计算换取更好的识别性能。一种自然的方法是循环重用共享的Transformer块,但我们发现简单的循环并不能充分利用额外的循环计算。我们引入了LARM,一种深度条件循环Transformer,将循环编码器深度变为可控的测试时计算轴。LARM结合了稀疏CTC检查点、监督时钟嵌入、FiLM深度条件和延迟软后验反馈。这些组件将循环结构化为由潜在精炼阶段分隔的识别检查点,并允许共享权重在循环步骤间进行特化。在LibriSpeech上,LARM随着推理循环次数的增加提高了WER,并达到了与更深的非共享参数基线相竞争的性能。我们的结果表明,测试时计算缩放可以超越自回归语言模型推理,扩展到连续非自回归语音识别。

英文摘要

End-to-end ASR systems typically use fixed-depth acoustic encoders at inference, making it difficult to trade additional test-time computation for improved recognition without training a larger model. A natural approach is to reuse a shared Transformer block recurrently, but we find that naive looping does not fully exploit additional recurrent compute. We introduce LARM, a depth-conditioned looped Transformer that turns recurrent encoder depth into a controllable test-time compute axis. LARM combines sparse CTC checkpoints, supervision-clock embeddings, FiLM depth conditioning, and delayed soft-posterior feedback. These components structure the loop into recognition checkpoints separated by latent refinement phases and allow shared weights to specialize across recurrent steps. On LibriSpeech, LARM improves WER as the number of inference loops increases and achieves performance competitive with deeper unshared-parameter baselines. Our results show that test-time compute scaling can extend beyond autoregressive language-model reasoning to continuous non-autoregressive speech recognition.

2606.04184 2026-06-16 cs.CV 版本更新

GroupToM-Bench: Benchmarking Group Theory of Mind and Nonlinear Social Emergence in MLLMs

GroupToM-Bench: 多模态大语言模型中群体心智理论和非线性社会涌现的基准测试

Weidong Tang, Jierui Li, Yueling Hou, Zihan Mei, Can Zhang, Xinyan Wan, Zhiyuan Liang, Pengfei Zhou, Yang You, Wangbo Zhao

发表机构 * Xidian University(西安电子科技大学) National University of Singapore(新加坡国立大学) University of Electronic Science and Technology of China(电子科技大学) University of Science and Technology of China(中国科学技术大学)

AI总结 针对多模态大语言模型在群体心智理论推理上的不足,提出GroupToM-Bench基准,通过七级认知审计框架评估模型从微观BDI状态到宏观结果预测的因果链,揭示模型在处理社会结构和非线性集体动态上的缺陷。

Comments ACL 2026 (Main Conference)

详情
AI中文摘要

真正的通用智能不仅需要物理世界模型,还需要社会世界模型:即推断个体心理状态如何相互作用并结晶为群体层面结果的能力。尽管在个体层面的心智理论推理方面取得了显著进展,现有的多模态大语言模型在这一更广泛的任务上仍然失败。集体行为从社会张力、从众动态和结构约束中非线性地涌现,这意味着它不能通过简单地对个体意图求和来恢复。我们提出了GroupToM-Bench,第一个针对群体层面心智理论的多模态基准,围绕一个跨越微观层面BDI状态(信念、欲望、意图)、中观层面群体张力和结构约束以及宏观层面结果预测和机制归因的因果链构建。为了探测这一完整弧线,我们开发了一个七级认知审计框架。实验揭示了当前模型与人类基线之间的差距,突出了模型在处理社会结构和非线性集体动态方面的失败。

英文摘要

True general intelligence requires not only a model of the physical world but also a social world model: the capacity to infer how individual mental states interact and crystallize into group-level outcomes. Despite notable progress in individual-level Theory of Mind (ToM) reasoning, existing multimodal large language models fail at this broader task. Collective behavior emerges non-linearly from social tensions, conformity dynamics, and structural constraints, meaning it cannot be recovered by merely summing individual intentions. We present GroupToM-Bench, the first multimodal benchmark for group-level ToM, built around a causal chain spanning micro-level BDI states (belief, desire, intention), meso-level group tension and structural constraints, and macro-level outcome prediction and mechanistic attribution. To probe this full arc, we develop a seven-level cognitive audit framework. Experiments reveal a gap between current models and human baselines, highlighting a failure to process social structures and non-linear collective dynamics.

2606.04145 2026-06-16 cs.LG cs.AI cs.DC 版本更新

EvalStop: Using World Feedback to Detect and Correct Reward Overoptimization in Multi-Tenant RLHF Platforms

EvalStop:利用世界反馈检测和纠正多租户RLHF平台中的奖励过度优化

Guilin Zhang, Chuanyi Sun, Kai Zhao, Xu Chu, Shahryar Sarkani, John M. Fossaceca

发表机构 * DeepMind, London, UK(深度Mind, 英国伦敦) University of Cambridge, UK(英国剑桥大学) University of Washington, USA(美国华盛顿大学)

AI总结 提出EvalStop调度原语,通过检测评估分数连续下降来终止作业、释放GPU并保留最佳检查点,以纠正奖励过度优化,在RLHF负载上实现高精度检测并提升JCT。

详情
AI中文摘要

云LLM微调平台越来越多地服务于RLHF工作负载,其中学习到的奖励模型作为人类质量的代理被优化。正如Gao等人(2023)所示,在持续优化压力下,该代理与世界反馈(下游评估指标)发生偏离,这种现象称为奖励过度优化。现有的平台调度器忽略这种偏离:非预见性调度器优化JCT而不考虑任何质量信号,SLAQ式质量感知调度器使用训练损失(一个单调下降的较弱代理,可通过黑客攻击降低),而经典的每作业早停需要人工监控且不释放共享GPU。我们提出EvalStop,一个可组合的调度原语,它在连续k次评估分数下降时终止作业,释放GPU,保留最佳检查点,并委托给任何基础调度器。我们将调度器级别的早停视为检测问题,并在一个离散事件模拟器中评估它,该模拟器的RLHF工作负载混合了奖励黑客攻击和结构健康运行,真实标签对调度器隐藏。在RLHF密集型负载(80% RLHF,64 GPU)上,EvalStop实现了精确率98%、召回率99%、假阳性率1.5%,同时相比SRTF-Est将JCT提高了9%,将浪费的计算减少了22%(p<0.05)。简单的固定进度和损失平台竞争对手要么在健康RLHF上产生65%的假阳性率,要么错过超过一半的真实黑客攻击案例。增益在所有测试的基础调度器上均成立(JCT提升9-25%),且检测质量在评估噪声(噪声标准差≤0.05时精确率至少91%)和黑客攻击基础率(黑客攻击比例20-80%时精确率至少89%)下保持稳定。

英文摘要

Cloud LLM fine-tuning platforms increasingly serve RLHF workloads, where a learned reward model is optimized as a proxy for human quality. As Gao et al. (2023) showed, this proxy diverges from world feedback (downstream eval metrics) under sustained optimization pressure, a phenomenon known as reward overoptimization. Existing platform schedulers ignore this divergence: non-clairvoyant schedulers optimize JCT without any quality signal, SLAQ-style quality-aware schedulers use training loss (a weaker proxy that drops monotonically through hacking), and classical per-job early stopping requires human monitoring and does not free shared GPUs. We propose EvalStop, a composable scheduling primitive that terminates jobs on k consecutive eval-score declines, releases GPUs, preserves the best checkpoint, and delegates to any base scheduler. We frame scheduler-level early stopping as a detection problem and evaluate it in a discrete-event simulator whose RLHF workload mixes reward-hacking and structurally healthy runs, with ground-truth labels hidden from schedulers. On RLHF-heavy workloads (80% RLHF, 64 GPUs), EvalStop achieves precision 98% / recall 99% / FPR 1.5% while improving JCT by 9% and cutting wasted compute by 22% over SRTF-Est (p<0.05). Trivial fixed-progress and loss-plateau competitors either incur 65% FPR on healthy RLHF or miss over half of true hacking cases. Gains compose across every base scheduler tested (9-25% JCT) and detection quality stays stable under eval noise (precision at least 91% at noise std <= 0.05) and hacking base rate (precision at least 89% across 20-80% hacking fractions).

2606.03788 2026-06-16 cs.CV 版本更新

SLU-2K: A Question-Based Benchmark for Semantic Evaluation of Sign Language Translation

SLU-2K:基于问题的手语翻译语义评估基准

Zeno Testa, Antonino Furnari, Lorenzo Baraldi, Natalia Díaz-Rodríguez

发表机构 * University of Modena and Reggio Emilia(摩德纳和雷吉奥艾米利亚大学) University of Catania(卡塔尼亚大学) University of Granada(格拉纳达大学) CITIC & DaSCI Institute(CITIC与DaSCI研究所)

AI总结 提出SLU-2K基准,通过2350个视频问答对评估手语翻译的语义理解,揭示当前系统在语义正确性上的不足。

Comments Accepted at the GenSign Workshop, CVPR 2026

详情
AI中文摘要

手语翻译(SLT)通常使用表面形式指标(如BLEU和ROUGE)进行评估,这些指标奖励词汇重叠,但不直接衡量翻译是否保留了源手语序列的含义。这与将SLT集成到辅助技术中的最终目标相悖。在这项工作中,我们将重点从手语翻译(SLT)转向手语理解(SLU),特别强调语义理解。具体来说,我们根据系统从输入视频中正确恢复原始句子关键语义方面的能力来评估系统,例如发生的动作以及关于人和物体的事实。为了系统地实现这种评估,我们提出了SLU-2K,这是一个基于流行的PHOENIX-2014T和CSL-Daily数据集的2350个封闭式视频问答对的数据集。为了获得SLU-2K,我们提出并广泛评估了一个自动数据生成流水线,该流水线生成7个类别的问题,即动作、位置、数字、物体、人物、时间和天气条件。我们通过评估流行的多模态大语言模型(MLLM)和两个代表性的最先进系统MMSTL和SpaMo,展示了SLU-2K的潜力。我们的结果表明,MLLM达到了接近随机的性能,突显了当前AI系统中需要更系统地集成SLU。此外,在领域内数据上精心微调的最先进翻译系统仍然存在显著的语义差距,结果范围从56.7%到75.2%。这些发现表明,当前的SLT评估协议高估了真正的理解,未来的进展不仅应通过流畅性和n-gram重叠来衡量,还应通过语义正确性来衡量。代码、提示和基准文件可在此https URL获取。

英文摘要

Sign Language Translation (SLT) is typically evaluated with surface-form metrics such as BLEU and ROUGE, which reward lexical overlap but do not directly measure whether a translation preserves the meaning of the source sign sequence. This is in contrast with the final objective of integrating SLT in assistive technology. In this work, we shift the focus from Sign Language Translation (SLT) to Sign Language Understanding (SLU), with particular emphasis on semantic understanding. Specifically, we evaluate systems based on their ability to correctly recover, from the input video, key semantic aspects of the original sentence, such as actions taking place and facts about people and objects. To enable this evaluation systematically, we propose SLU-2K, a dataset of 2,350 closed-ended video question-answer pairs based on the popular PHOENIX-2014T and CSL-Daily datasets. To obtain SLU-2K, we propose and extensively evaluate an automated data generation pipeline which produces questions across 7 categories, namely actions, locations, numbers, objects, people, time, and weather conditions. We show the potential of SLU-2K by evaluating popular Multimodal Large Language Models (MLLMs) and two representative state-of-the-art systems, MMSTL and SpaMo. Our results show that MLLMs reach near-random performance, highlighting the need for a more systematic integration of SLU in current AI systems. Furthermore, state-of-the-art translation systems carefully fine-tuned on in-domain data still exhibit a substantial semantic gap, with results ranging from 56.7% to 75.2%. These findings suggest that current SLT evaluation protocols overestimate true understanding and that future progress should be measured not only by fluency and n-gram overlap, but also by semantic correctness. Code, prompts, and benchmark files are available at https://github.com/ZenoTsT/SLU-2K

2606.03654 2026-06-16 cs.CV cs.NA math.NA 版本更新

Graph Regularized Non-negative Reduced Biquaternion Matrix Factorization for Color Image Recognition

图正则化非负简化四元数矩阵分解用于彩色图像识别

Hailang Wu, Yonghe Liu, Bingxuan Yu, Chaoqian Li

发表机构 * School of Mathematics and Statistics, Yunnan University(云南大学数学与统计学学院)

AI总结 针对非负简化四元数矩阵分解忽略局部几何结构的问题,提出图正则化模型,通过引入图拉普拉斯正则化项保持局部结构,并设计分量交替投影梯度算法,在彩色图像识别中取得竞争性结果。

详情
AI中文摘要

非负简化四元数矩阵分解(NRBMF)利用简化四元数(RB)矩阵的乘积,将彩色图像像素的非负约束纳入分解过程。然而,NRBMF主要关注重构精度,未利用图像数据的局部几何结构,这可能限制所学低维特征的判别能力。为解决此问题,我们提出了一种图正则化非负简化四元数矩阵分解(GNRBMF)模型用于彩色图像识别。该模型将图拉普拉斯正则化项引入简化四元数系数矩阵,鼓励原始空间中的邻近样本在学习的特征空间中具有相似表示。同时,GNRBMF在简化四元数域中保留了NRBMF的非负保持特性。为求解优化问题,推导了一种分量交替投影梯度算法,并分析了其收敛性。实验结果表明,所提出的GNRBMF模型在某些测试设置下取得了具有竞争力或更优的识别性能。

英文摘要

Non-negative reduced biquaternion matrix factorization (NRBMF) uses the product of reduced biquaternion (RB) matrices to incorporate the non-negativity constraints of color image pixels into the factorization process. However, NRBMF mainly focuses on reconstruction accuracy and does not explicitly exploit the local geometric structure of image data, which may limit the discriminative ability of the obtained low-dimensional coefficient representations. To address this issue, we propose a graph regularized non-negative reduced biquaternion matrix factorization (GNRBMF) model for color image recognition. The proposed model incorporates a graph Laplacian regularizer into the reduced biquaternion coefficient matrix, encouraging nearby samples in the original space to have similar coefficient representations. Meanwhile, GNRBMF retains the non-negativity property of NRBMF in the reduced biquaternion algebra. To solve the optimization problem, a component-wise alternating projected gradient algorithm is derived, and its convergence properties are analyzed. Experimental results on three color image datasets show that the proposed GNRBMF model achieves competitive or superior recognition performance compared with several methods in most tested settings.

2606.03212 2026-06-16 cs.LG 版本更新

Bayesian Tensor Decomposition with Diffusion Model Prior

贝叶斯张量分解与扩散模型先验

Zerui Tao, Qibin Zhao

发表机构 * Zerui Tao(泽瑞·陶) Qibin Zhao(赵启斌)

AI总结 提出DiffBCP框架,结合累积收缩过程先验和预训练扩散模型,通过分裂吉布斯采样实现贝叶斯CP分解,在图像修复和去噪任务中优于现有方法。

Comments ICML 2026

详情
AI中文摘要

低秩张量分解(TD)通常对干净、完全观测的数据有效,但在严重缺失或噪声下性能下降。低秩性本身是一种有用但有限的结构先验,额外的手工先验(如稀疏性或平滑性)仍难以捕捉真实世界数据的丰富统计特性。为了在重度污染下补偿这种弱的归纳偏置,我们希望注入一个学习到的、数据驱动的先验;然而,最先进的扩散模型与当前的TD和可处理的后验推断并不兼容。为了解决这些挑战,我们引入了DiffBCP,一种混合先验的贝叶斯CP分解框架,它将CP因子上的累积收缩过程先验(用于自动秩选择)与一个现成的预训练扩散模型(作为重构张量上的隐式数据先验)相结合。尽管似然、低秩约束和扩散先验之间存在耦合,为了使后验推断可处理,我们开发了一个分裂吉布斯采样器:CP因子允许共轭更新,而扩散块通过低秩引导的去噪进行采样。一个噪声自适应的耦合调度进一步减少了对手动调参退火的敏感性。在图像修复和去噪(包括高分辨率分布外图像)上的实验表明,与贝叶斯、非线性和即插即用TD基线相比,该方法具有一致的改进。

英文摘要

Low-rank tensor decomposition (TD) is usually effective on clean, fully observed data, but it often degrades under severe missingness or noise. Low-rankness is itself a useful but limited structural prior, and additional handcrafted priors (e.g., sparsity or smoothness) still fall short of capturing the rich statistics of real-world data. To compensate for this weak inductive bias under heavy corruption, one would like to inject a learned, data-driven prior; however, the state-of-the-art diffusion models are not readily compatible with current TD and tractable posterior inference. To address these challenges, we introduce DiffBCP, a hybrid-prior Bayesian CP decomposition framework that couples a cumulative shrinkage process prior over the CP factors for automatic rank selection with an off-the-shelf pre-trained diffusion model as an implicit data prior on the reconstructed tensor. To make posterior inference tractable despite the coupling among the likelihood, low-rank constraint, and diffusion prior, we develop a split Gibbs sampler: CP factors admit conjugate updates, while the diffusion block is sampled via low-rank-guided denoising. A noise-adaptive coupling schedule further reduces sensitivity to hand-tuned annealing. Experiments on image inpainting and denoising, including high-resolution out-of-distribution images, show consistent gains over Bayesian, nonlinear, and plug-and-play TD baselines.

2606.02955 2026-06-16 cs.CL cs.AI cs.LG 版本更新

Fast-dLLM++: Fréchet Profile Decoding for Faster Diffusion LLM Inference

Fast-dLLM++: 用于更快扩散LLM推理的Fréchet轮廓解码

Siva Rajesh Kasa, Yasong Dai, Sumit Negi, Hongdong Li

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 针对扩散大语言模型推理中并行令牌生成的瓶颈,提出Fréchet轮廓解码方法,通过利用异构置信度轮廓选择并行提交集,在保持模型和缓存不变的情况下提升吞吐量。

Comments Initial version accepted at Workshop on Structured Probabilistic Inference & Generative Modeling, ICML 2026. Project Page: https://ringo-star.github.io/projectpage_frechet/

详情
AI中文摘要

扩散大语言模型承诺并行令牌生成,但推理仍然受限于决定哪些掩码令牌可以安全地一起提交。Fast-dLLM通过KV缓存和置信度引导的并行解码解决了这个问题,但其解码理论使用同质高置信度假设,实际上将每个候选集简化为其最弱的选择令牌。我们认为这留下了速度提升空间,因为实际解码步骤表现出异构置信度轮廓。我们提出 extbf{Fast-dLLM++},一种无需训练的扩展,引入了\emph{Fréchet轮廓解码}:从完整的排序置信度轮廓中选择并行提交集,而不是单个最坏情况置信度。得到的规则是Fast-dLLM因子选择器的异构置信度泛化,在等置信度情况下精确恢复先前规则,并在所选令牌具有不均匀置信度时增加一个可证明的\emph{异构性奖励}。Fast-dLLM++完全保持模型、扩散过程和缓存实现不变,使其成为现有Fast-dLLM解码的直接替代品。在GSM8K、MATH、HumanEval和MBPP上使用LLaDA-8B模型的实验表明,理论改进直接转化为经验收益:轮廓感知选择通过利用最弱令牌规则忽略的安全并行性改进了准确率-吞吐量前沿,在可比准确率下实现了高达37%的吞吐量提升。我们的匿名代码发布在此https URL。

英文摘要

Diffusion large language models promise parallel token generation, yet inference remains bottlenecked by deciding which masked tokens can be safely committed together. Fast-dLLM addressed this with KV caching and confidence-guided parallel decoding, but its decoding theory uses a homogeneous high-confidence assumption that effectively reduces each candidate set to its weakest selected token. We argue that this leaves speed on the table because real decoding steps exhibit heterogeneous confidence profiles. We propose \textbf{Fast-dLLM++}, a training-free extension that introduces \emph{Fréchet profile decoding}: selecting parallel commit sets from the full sorted confidence profile rather than a single worst-case confidence. The resulting rule is a heterogeneous-confidence generalization of Fast-dLLM's factor selector and it recovers the previous rule exactly in the equal-confidence case and adds a provable \emph{heterogeneity bonus} when the selected tokens have uneven confidences. Fast-dLLM++ leaves the model, diffusion process, and cache implementation entirely unchanged, making it a drop-in replacement for existing Fast-dLLM decoding. Experiments on GSM8K, MATH, HumanEval, and MBPP with the LLaDA-8B model show that the theoretical improvement translates directly into empirical gains: profile-aware selection improves the accuracy--throughput frontier by exploiting safe parallelism that weakest-token rules miss, achieving up to 37\% higher throughput at comparable accuracy. Our code release is at https://github.com/Ringo-Star/FastdLLM_plusplus.

2606.02877 2026-06-16 cs.CV 版本更新

Pathway-Structured Privileged Distillation for Deployable Computational Pathology

面向可部署计算病理学的通路结构特权蒸馏

Yongxin Guo, Hao Lu, Onur Koyun, Muhammet Demir, Metin Gurcan

发表机构 * School of Medicine, Wake Forest University(威克森林大学医学院)

AI总结 提出MoPE框架,通过通路索引病理专家和记忆使用对齐,将多模态学习转化为仅组织学推理的特权蒸馏,提升全切片图像推理性能。

详情
AI中文摘要

整合转录组学和组织病理学可以改善癌症风险建模,但在常规环境中RNA分析的有限可用性限制了其实用性。本文引入了通路专家混合(MoPE),这是一个知识蒸馏框架,将多模态学习重新定义为仅组织学推理的特权蒸馏。MoPE的动机来自RNA谱和全切片图像之间的部分可观测性:组织学可以捕获某些分子程序相关的形态学后果,但不能期望重建完整的转录组状态。MoPE编码RNA衍生的通路,并通过记忆使用对齐将分子监督转移到通路索引的病理专家。在各种公共基准测试和两个独立的乳腺癌队列中,与基线方法相比,MoPE持续改善了仅WSI推理性能。通路使用分析和人工审核的视觉检查提供了模型行为和候选形态学相关读数的有限检查。这些结果支持通路结构特权蒸馏作为在训练期间利用分子信息同时保持无RNA推理的有前途的途径。

英文摘要

Integrating transcriptomics and histopathology can improve cancer risk modelling, yet practical use is constrained by the limited availability of RNA profiling in routine settings. Here we introduce Mixture of Pathway Experts (MoPE), a knowledge-distillation framework that reframes multimodal learning as privileged distillation for histology-only inference. MoPE is motivated by the partial observability between RNA profiles and whole-slide images: histology can capture morphology-linked consequences of certain molecular programmes, but cannot be expected to reconstruct the full transcriptomic state. MoPE encodes RNA-derived pathways and transfers the molecular supervision to pathway-indexed pathology experts through memory-usage alignment. Across diverse public benchmarks and two independent breast cancer cohorts, MoPE consistently improved WSI-only inference performance relative to baseline methods. Pathway-usage analyses and human-audited visual inspection provide bounded inspection of model behaviour and candidate morphology-linked readouts. These results support pathway-structured privileged distillation as a promising route to using molecular information during training while preserving RNA-free inference.

2606.02670 2026-06-16 cs.LG cs.AI 版本更新

Anomalies in Multivariate Time Series Benchmarks Are Mostly Univariate

多变量时间序列基准中的异常主要是单变量的

Marc Pinet, Julien Cumin, Samuel Berlemont, Dominique Vaufreydaz

发表机构 * Orange Research(Orange研究院) Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG(格勒诺布尔阿尔卑斯大学、CNRS、格勒诺布尔INP、LIG)

AI总结 本文通过诊断框架和实验证明,当前多变量时间序列异常检测基准中,异常主要源于单变量偏离,跨通道结构变化极少,因此现有基准不适合验证跨通道建模能力。

Comments Accepted at the 12th International Workshop on Mining and Learning from Time Series (MiLeTS), co-located with KDD 2026

详情
AI中文摘要

许多最新的多变量时间序列异常检测(MT-SAD)模型引入了跨通道建模,其隐含假设是异常的结构可能分布在多个通道上。我们在八个广泛使用的公共基准上评估了这一假设,引入了一个逐段诊断框架,该框架针对每个标记的异常,标记是否至少有一个通道单独偏离其正常历史,是否跨通道相关结构发生变化,或两者兼有。该框架表明,在一系列合理阈值下,没有跨通道破裂发生在没有伴随单变量偏离的情况下。一个补充指标还显示,在八个基准中的六个上,至少一半的标记异常段在79%到100%的时间步上发生单变量偏离,在其中的三个数据集上达到100%。为了验证我们的框架在存在跨通道结构时能够捕获它,我们构建了具有共享噪声的相移正弦通道的合成数据。每个异常段通过两种通道级损坏之一进行改变,这些损坏保留了每个通道的边缘分布,同时破坏了跨通道结构,我们的框架正确地将这些段表征为仅跨通道异常。在这些数据上,依赖通道(CD)模型成功利用了跨通道信号,而独立通道(CI)模型则失败。在真实基准上对最近SOTA检测器的CI/CD比较进一步证实了CD建模没有带来可衡量的收益。我们得出结论,当前的MT-SAD基准不适合验证跨通道建模能力,并呼吁开发更多结构多样的评估集。本研究的代码已公开。

英文摘要

Many recent multivariate time series anomaly detection (MTSAD) models incorporate cross-channel modeling, under the implicit assumption that the structure of anomalies may be spread across multiple channels. We evaluate this assumption on eight widely used public benchmarks by introducing a per-segment diagnostic framework that flags, for each labeled anomaly, whether at least one channel deviates individually from its normal history, whether the cross-channel correlation structure changes, or both. The framework shows that no cross-channel rupture occurs without an accompanying univariate deviation across a range of reasonable thresholds. A complementary metric also reveals that on six of the eight benchmarks, at least half of the labeled anomaly segments deviate univariately on 89% to 100% of their timesteps, reaching 100% on three of these datasets. To verify that our framework captures cross-channel structure when present, we construct synthetic data of phase-shifted sinusoidal channels with shared noise. Each anomalous segment is altered through one of two channel-wise corruptions that preserve the per-channel marginal distribution while breaking cross-channel structure, and our framework correctly characterizes these segments as cross-channel-only. On these data, channel-dependent (CD) models successfully exploit the cross-channel signal whereas channel-independent (CI) ones fail. The CI/CD comparison of a recent SOTA detector on real benchmarks further confirms that CD modeling brings no measurable gain. We conclude that current MTSAD benchmarks are unsuitable for validating cross-channel modeling capabilities, and we call for the development of more structurally diverse evaluation sets. The code for this study is publicly available.

2606.02506 2026-06-16 cs.CV 版本更新

Question-Aware Evidence Ledgers for Video Relational Reasoning

问题感知的证据账本用于视频关系推理

Yilin Ou, Mengshi Qi, Huadong Ma

发表机构 * State Key Laboratory of Networking and Switching Technology(网络与交换技术国家重点实验室)

AI总结 提出基于GPT-5.5视频QA求解器和问题感知证据账本的测试时推理流水线,通过显式化计数、空间、端点、视角和对话推理所需的目标、计数单位、参考帧及时间或空间范围,并利用外部工具作为证据源,最终在VRR-QA挑战上达到92.95%的整体准确率。

Comments Technical report for the VRR Challenge at the VideoLLMs Workshop, CVPR 2026

详情
AI中文摘要

VRR-QA挑战评估视频中的视觉关系推理,答案通常依赖于隐含的空间关系、事件边界、目标身份和对话上下文,而非单个显著帧。我们提出一个基于强GPT-5.5视频QA求解器和一组问题感知证据账本的测试时推理流水线。初始求解器从统一的视频表示回答每个问题,而路由账本被提示使所需目标、计数单位、参考帧以及时间或空间范围显式化,用于计数、空间、端点、视角和对话推理。外部工具如开放词汇检测、深度线索、成对裁剪、ASR和场景图账本仅用作证据源。保守门控保持当前答案,除非独立证据唯一支持不同选项。最终证据门控流水线在挑战测试集上达到92.95%的整体准确率和93.79%的宏平均准确率。

英文摘要

The VRR-QA challenge evaluates visual relational reasoning in videos, where answers often depend on implicit spatial relations, event boundaries, target identity, and dialogue context rather than a single salient frame. We present a test-time reasoning pipeline built around a strong GPT-5.5 video QA solver and a set of question-aware evidence ledgers. The initial solver answers each question from a uniform video representation, while routed ledgers are prompted to make the required targets, count units, reference frames, and temporal or spatial scope explicit for counting, spatial, endpoint, viewpoint, and dialogue reasoning. External tools such as open-vocabulary detection, depth cues, pair crops, ASR, and scene-graph ledgers are used only as evidence sources. A conservative gate keeps the current answer unless independent evidence uniquely supports a different option. The final evidence-gated pipeline achieves 92.95% overall accuracy and 93.79% macro accuracy on the challenge test split.

2606.02493 2026-06-16 cs.CL 版本更新

Not What, But How: A Framework for Auditing LLM Responses across Positioning, Generalization, Anthropomorphism, and Maxims

不是“什么”,而是“如何”:LLM 响应框架的沟通审计

Siddhesh Milind Pawar, Sarah Masud, Haneul Yoo, Alice Oh, Isabelle Augenstein

发表机构 * University of Copenhagen(哥本哈根大学) KAIST(韩国科学技术院)

AI总结 提出 FRANZ 框架,从文化定位、概括性语言、拟人化线索和对话准则遵守四个维度审计 LLM 对主观问题的响应框架,并构建 SQUARE 语料库进行实证分析。

Comments 34 pages, 19 Figures, 4 Tables

详情
AI中文摘要

大型语言模型(LLM)越来越多地被用于回答主观性、信息寻求型问题,在这些问题中,用户对响应的沟通方式敏感,而不仅仅是答案是否正确。现有的针对主观文化查询的 LLM 评估主要关注事实正确性,忽略了响应的框架方式。为此,我们引入了 FRANZ,一个自动化的响应特征化框架,用于沿四个维度对 LLM 响应进行沟通审计:文化定位、概括性语言的使用、拟人化线索以及对对话准则的遵守。为了支持这一评估,我们贡献了 SQUARE——一个包含来自 57 个子版块的 376k 个主观问题的语料库,并映射到 7 个国家和 19 个问题类别。我们通过评分三个开放权重 LLM 的响应来展示 FRANZ 的适用性。我们观察到,LLM 在采用每种响应特征的频率上显示出统计显著差异。与单维度审计不同,FRANZ 揭示了内部定位和拟人化是正相关的,且相关程度因国家而异,为识别框架差异提供了诊断视角。

英文摘要

Large language models (LLMs) are being increasingly used to answer subjective, information-seeking questions, where users are sensitive to how responses are communicated, not just whether the answers are correct. Existing LLM evaluations for subjective cultural queries largely focus on factual correctness, ignoring how the response is framed. To this end, we introduce FRANZ, an automated FRAmework for respoNse characteriZation to conduct communicative audit of LLM responses along four dimensions: cultural positioning, use of generalizing language, anthropomorphic cues, and adherence to conversational maxims. To enable this evaluation, we contribute SQUARE - a corpus of 376k subjective questions sourced from 57 subreddits, and mapped to 7 countries and 19 question categories. We demonstrate FRANZ's applicability by scoring responses from three open-weight LLMs. We observe that LLMs show statistically significant differences in the frequency with which they employ each response characteristic. Unlike single-dimensional audits, FRANZ reveals that insider positioning and anthropomorphism are positively coupled, with the degree of coupling varying by country, providing a diagnostic lens for identifying framing divergences.

2606.01900 2026-06-16 cs.CV 版本更新

Auteur: Language-Driven Cinematographic Framing for Human-Centric Video Generation

Auteur: 以语言驱动的电影化取景实现以人为中心的视频生成

Muhammed Burak Kizil, Enes Sanli, Niloy J. Mitra, Xuelin Chen, Erkut Erdem, Aykut Erdem, Duygu Ceylan

发表机构 * Koç University(科克大学) University College London(伦敦大学学院) Adobe(Adobe公司) Hacettepe University(哈切特佩大学)

AI总结 提出Auteur方法,通过将相机运动参数化为以人为中心的取景(包括镜头尺寸、角度和构图),并利用领域特定语言(DSL)和微调的多模态大语言模型,实现语言驱动的电影化取景,在人类中心视频生成中优于现有方法。

Comments Project Page: https://cyberiada.github.io/Auteur/

详情
AI中文摘要

生成式视频模型在视觉保真度和时间连贯性方面取得了显著进展,但有意地控制相机仍然难以实现。现有框架将相机运动视为像素合成的副产品,产生的轨迹具有随机性、空间不一致性,并且对驱动场景的人类主体漠不关心。在这项工作中,我们提出了Auteur,一种用于生成式视频中语言驱动的、以人为中心的相机取景方法。我们的核心见解是,专业电影制作人构思镜头时并非将其视为世界空间中的轨迹,而是定义为相对于演员的取景,将镜头尺寸、角度和构图编码为人体姿态和运动的函数。我们将这一直觉形式化为一种以人为中心的相机参数化,并引入一种可转换为标准6自由度相机参数的领域特定语言(DSL)。然后,一个微调的多模态大语言模型充当虚拟导演,将自然语言描述和粗略的人体运动映射为稀疏的DSL关键帧,这些关键帧通过确定性插值生成连续的相机轨迹,并作为输入提供给视频生成器。我们在一个新数据集上训练和评估Auteur,该数据集包含34K个对齐的文本、人体运动和DSL标注的相机轨迹,这些轨迹来自程序化合成和CondensedMovies数据集中的真实电影片段。Auteur实现了以人为中心的场景的电影化取景,这一能力在先前的生成模型中基本缺失。为了评估这一行为,我们提出了新的以取景为中心的指标,实验表明Auteur持续优于现有方法。

英文摘要

Generative video models have achieved remarkable visual fidelity and temporal coherence, yet intentional camera control remains elusive. Existing frameworks treat camera motion as a byproduct of pixel synthesis, producing trajectories that are stochastic, spatially inconsistent, and indifferent to the human subject driving the scene. In this work, we present Auteur, a method for language-driven, human-centric camera framing in generative video. Our core insight is that professional filmmakers conceive shots not as world-space trajectories but as framings defined relative to the actor, encoding shot size, angle, and composition as functions of human pose and motion. We formalize this intuition as a human-centric camera parameterization and introduce a Domain-Specific Language (DSL) that is convertible to standard 6-DoF camera parameters. A fine-tuned multimodal large language model then acts as a virtual director, mapping natural language descriptions and coarse human motion to sparse DSL keyframes that are deterministically interpolated into continuous camera trajectories, which are then provided as input to video generators. We train and evaluate Auteur on a new dataset of 34K aligned text, human motion, and DSL-annotated camera trajectories drawn from procedural synthesis and real-world movie footage from the CondensedMovies dataset. Auteur enables cinematographic framing of human-centered scenes, a capability largely absent in prior generative models. To assess this behavior, we propose new framing-focused metrics, and our experiments show that Auteur consistently outperforms existing methods. Project page is https://cyberiada.github.io/Auteur/