arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2070
2606.11675 2026-06-11 cs.AI 新提交

Lung-R1: A Knowledge Graph-Guided LLM for Pulmonary Diagnostic Reasoning

Lung-R1:知识图谱引导的肺部诊断推理大语言模型

Haoyang Zeng, Yuanxi Fu, Rongzhen Li, Yuming Yang, Xiao Sun, Jingwang Huang, Gujie Shao, Guohui Xiang, Quan Lu, Dongfan Ye, Xuetao Chen, Jiang Zhong, Kaiwen Wei, Zhi Xu

发表机构 * School of Computer Science, Chongqing University(重庆大学计算机学院) AI Research Institution, Mashang Financial Institution(马上金融人工智能研究院) Department of Information, Third Military Medical University(陆军军医大学信息系)

AI总结 提出LungKG知识图谱和Lung-R1模型,通过KG约束的推理链构建和强化学习,解决肺部知识到病例诊断的差距,在EMR诊断任务上达到SOTA。

详情
AI中文摘要

诊断肺部疾病需要在表型变异性和跨疾病重叠中整合异质性证据。尽管大语言模型(LLMs)在肺部知识问答和信息处理任务上取得了进展,但可靠的肺部诊断需要对电子病历证据进行患者特异性的、关系感知的推理,而非孤立的知识回忆。我们将肺部知识与病例级诊断推理之间的这一差距定义为肺部知识到诊断的差距。为解决这一问题,我们引入了LungKG,这是第一个用于诊断知识组织和记录基础推理的结构化肺部知识图谱。LungKG包含59,038个节点和164,308条边,涵盖15种实体类型和112种关系类型,既作为可重用的肺部知识资源,也作为LungKG引导模型适应的基础。基于LungKG,我们提出了Lung-R1,一种通过KG约束的推理链构建和KG引导的强化学习训练的LungKG引导的肺部LLM。在20个系统的评估中,Lung-R1-14B在选择题、肺部问答和EMR诊断任务上均达到最先进性能,EMR诊断得分为4.3583,超过最强非Lung-R1基线0.1476分。这些结果证明了LungKG引导训练对基于EMR的肺部诊断的价值。

英文摘要

Diagnosing pulmonary diseases requires integrating heterogeneous evidence amid phenotypic variability and cross-disease overlap. Although large language models (LLMs) have shown progress on pulmonary knowledge question answering (QA) and information-processing tasks, reliable pulmonary diagnosis requires patient-specific, relation-aware reasoning over electronic medical record (EMR) evidence rather than isolated knowledge recall. We define this gap between pulmonary knowledge and case-level diagnostic reasoning as the Pulmonary Knowledge-to-Diagnosis Gap. To address it, we introduce LungKG, the first structured pulmonary knowledge graph for diagnostic knowledge organization and record-grounded reasoning. LungKG contains 59,038 nodes and 164,308 edges across 15 entity types and 112 relation types, serving as both a reusable pulmonary knowledge resource and the foundation for LungKG-guided model adaptation. Built on LungKG, we propose Lung-R1, a LungKG-guided pulmonary LLM trained through KG-constrained reasoning-chain construction and KG-guided reinforcement learning. In a 20-system evaluation, Lung-R1-14B achieves state-of-the-art performance across Choice, Pulmonary-QA, and EMR Diagnosis, reaching an EMR Diagnosis score of 4.3583 and surpassing the strongest non-Lung-R1 baseline by 0.1476 points. These results demonstrate the value of LungKG-guided training for EMR-based pulmonary diagnosis.

2606.11674 2026-06-11 cs.SD cs.LG 新提交

SpAArSIST: Sparsified AASIST for Efficient and Reliable Anti-Spoofing

SpAArSIST: 用于高效可靠反欺骗的稀疏化AASIST

Anton Firc, Vojtěch Staněk, Zbyněk Lička, Kamil Malinka, Martin Perešíni

发表机构 * Brno University of Technology(布尔诺理工大学)

AI总结 提出SpAArSIST,通过稀疏化图池化后端,在保持竞争力的同时降低计算量20.7%、模型大小4.1%,并提升域外鲁棒性。

Comments Accepted at Interspeech 2026

详情
AI中文摘要

我们提出了SpAArSIST,这是对广泛使用的基于自监督学习(SSL)的反欺骗方法AASIST图池化后端的面向部署的改进。受公共实现中冗余操作的启发,我们用显式的轻量级选择替换了学习池化和堆栈节点注意力:分离的训练和推理图池化比率$(k_{\mathrm{tr}},k_{\mathrm{inf}})$、基于幅度的节点评分以及图节点的均值聚合。最佳整体配置(排名第一)将后端计算削减了20.7%(从195.045M MACs降至154.706M MACs),模型大小减少了4.1%(从611.8k参数降至586.4k参数),同时将在In-the-Wild上的域外鲁棒性提升至2.82% EER和0.078 minDCF(原为4.64%和0.133),并在ASVspoof5上保持竞争力。我们还提供了一个综合选择分数,总结了准确性、校准和计算量,以支持平衡的面向部署的模型选择。

英文摘要

We present SpAArSIST, a deployment-oriented refinement of the widely used AASIST graph pooling backend for self-supervised learning (SSL) based anti-spoofing. Motivated by redundant operations in public implementations, we replace learned pooling and stack-node attention with explicit, lightweight choices: separate train and inference graph pooling ratios $(k_{\mathrm{tr}},k_{\mathrm{inf}})$, magnitude-based node scoring, and mean aggregation of graph nodes. The best overall configuration (rank 1) cuts backend compute by 20.7% (195.045M $\rightarrow$ 154.706M MACs) and model size by 4.1% (611.8k $\rightarrow$ 586.4k params), while improving out-of-domain robustness on In-the-Wild to 2.82% EER and 0.078 minDCF (from 4.64% and 0.133) and remaining competitive on ASVspoof5. We further provide a composite selection score that summarizes accuracy, calibration, and compute to support balanced deployment-oriented model choice.

2606.11670 2026-06-11 cs.CV cs.AI 新提交

ARGUS: Stacked Multi-View Identity Mosaic Injection for Subject-Preserving Video Generation

ARGUS: 堆叠多视角身份马赛克注入用于主体保持的视频生成

Zijie Meng, Jiwen Liu, Yufei Liu, Chengzhuo Tong, Xiaoqiang Liu, Yuanxing Zhang, Yulong Xu, Pengfei Wan

发表机构 * Peking University(北京大学) Kuaishou Technology(快手科技) Xiamen University(厦门大学)

AI总结 提出ARGUS框架,通过堆叠多视角身份马赛克注入(SMII)将身份表示为紧凑动态分布,结合MLLM身份导演、无交叉对反事实训练等模块,在主体保持视频生成中达到SOTA。

Comments 13 pages, 3 figures

详情
AI中文摘要

仅靠正面人脸相似度无法解决主体保持的视频生成问题:生成的人物必须在运动、大视角变化、表情变化、遮挡、尺度变化以及文本、首帧和身份参考之间的冲突中保持可识别。我们认为核心瓶颈在于点参考范式,该范式将身份坍缩为与姿态、配饰、光照、背景和相机统计纠缠的单一静态观测。我们提出了Argus,一个基于Wan的框架,核心是堆叠多视角身份马赛克注入(SMII)。SMII将MLLM选择的图像/视频身份证据转换为3*3堆叠马赛克,使马赛克与当前扩散时间同步,并将其作为负时间只读内存注入Wan的原生令牌空间。这使身份从外部清洁适配器或单个参考图像转变为紧凑的动态分布。围绕SMII,MLLM身份导演选择信息丰富的身份时刻并解决条件冲突,而无交叉对反事实训练、时间身份退火和自适应自相似性指导在没有配对主体-视频监督的情况下提高了鲁棒性。我们进一步发布了HardID-Celeb,一个公众人物身份压力基准,并引入YawScore和OccScore来探测大偏航和首帧遮挡鲁棒性。Argus在OpenS2V-Eval Human-Domain上达到了SOTA结果,总分为64.38,FaceSim为71.86,NexusScore为51.62,NaturalScore为79.14。在HardID-Celeb上,Argus获得了76.80的FaceSim,并在YawScore和OccScore上分别比最强基线提高了12.60和15.10分,证明了动态身份记忆和大规模反事实自监督对于主体保持视频生成非常有效。

英文摘要

Subject-preserving video generation is not solved by frontal-face similarity alone: a generated person must remain recognizable across motion, large viewpoint changes, expression shifts, occlusion, scale variation, and conflicts among text, first-frame, and identity references. We argue that the central bottleneck is the point-reference paradigm, which collapses identity into a single static observation entangled with pose, accessories, lighting, background, and camera statistics. We introduce Argus, a Wan-based framework centered on Stacked Multi-View Identity Mosaic Injection (SMII). SMII converts MLLM-selected image/video identity evidence into a 3*3 stacked mosaic, synchronizes the mosaic with the current diffusion time, and injects it as negative-time read-only memory in Wan's native token space. This turns identity from an external clean adapter or a single reference image into a compact dynamic distribution. Around SMII, an MLLM Identity Director selects informative identity moments and resolves condition conflicts, while no-cross-pair counterfactual training, Temporal Identity Annealing, and Adaptive Self-Likeness Guidance improve robustness without paired subject-video supervision. We further release HardID-Celeb, a public-figure identity-stress benchmark, and introduce YawScore and OccScore to probe large-yaw and first-frame-occlusion robustness. Argus achieves state-of-the-art results on OpenS2V-Eval Human-Domain, reaching 64.38 Total Score, 71.86 FaceSim, 51.62 NexusScore, and 79.14 NaturalScore. On HardID-Celeb, Argus obtains 76.80 FaceSim and improves YawScore and OccScore by 12.60 and 15.10 points over the strongest baselines, demonstrating that dynamic identity memory and large-scale counterfactual self-supervision are highly effective for subject-preserving video generation.

2606.11666 2026-06-11 cs.SD 新提交

The Hidden Cost of Pairwise Verification in Synthetic Speech Source Tracing

合成语音源追踪中成对验证的隐藏成本

Anton Firc, Zbyněk Lička, Vojtěch Staněk, Kamil Malinka

发表机构 * Brno University of Technology(布尔诺理工大学)

AI总结 研究比较全局锚定与成对验证在合成语音源追踪中的性能,发现成对验证导致嵌入方差集中、分辨率降低,从而在域内和域外任务中表现更差。

Comments Accepted at Interspeech 2026

详情
AI中文摘要

开放集源追踪日益被框定为验证问题,促使使用来自生物特征识别的成对度量学习目标。因此,我们在匹配的骨干网络以及固定的数据和epoch预算下,在MLAAD(域内)和STOPA(域外)上比较全局锚定和成对验证。在我们的运行中,全局锚定产生的域内错误率(8.61% EER)低于成对变体(12-15% EER),即使使用对抗挖掘和XLS-R微调也是如此。由于成对目标直接优化相似性,它们将方差集中到更少的嵌入方向上,降低了紧密相关生成器之间的分辨率。为了测试这是否导致了性能下降,我们对全局监督基线施加了类似的瓶颈,但基线仍然具有竞争力。结合嵌入空间分析($k_{99}$),这些结果表明差距不能仅由维度解释,而是由成对目标对保留方向的塑造所致。

英文摘要

Open-set source tracing is increasingly framed as a verification problem, motivating the use of pairwise metric-learning objectives from biometrics. We thus compare global anchoring and pairwise verification under matched backbones and a fixed data and epoch budget on MLAAD (in-domain) and STOPA (out-of-domain). In our runs, global anchoring yields lower in-domain error (8.61% EER) than pairwise variants (12-15% EER), even with rival mining and XLS-R finetuning. Because pairwise objectives optimize similarity directly, they concentrate variance into fewer embedding directions, reducing resolution among closely related generators. To test if this drives the drop, we impose a similar bottleneck to the globally supervised baseline, yet the baseline remains competitive. Together with an embedding-space analysis ($k_{99}$), these results suggest that the gap is not explained by dimensionality alone, but rather by the pairwise objective's shaping of the retained directions.

2606.11662 2026-06-11 cs.AI 新提交

TreeSeeker: Tree-Structured Trial, Error, and Return in Deep Search

TreeSeeker:深度搜索中的树结构试错与回溯

Zhuofan Shi, Mingzhe Ma, Lu Wang, Fangkai Yang, Pu Zhao, Yiming Guan, Youling Huang, Wei Zhang, Qingwei Lin, Dongmei Zhang, Saravan Rajmohan

发表机构 * Microsoft(微软公司) East China Normal University(东华大学)

AI总结 提出TreeSeeker框架,通过树结构分支-回溯搜索和UCB信号选择,在深度搜索中实现受控试错,显著提升复杂问答性能。

详情
AI中文摘要

深度搜索要求智能体通过多步网络搜索、浏览、证据比较和综合来回答复杂问题。一个核心挑战是当多个方向看似可行但只有部分能最终提供可靠证据时,如何决定搜索方向。如果智能体贪婪地跟随当前最佳方向,它可能会不断扩展一个薄弱的延续;如果无纪律地探索,则可能将预算浪费在无关的尝试上。我们提出TreeSeeker,一个用于深度搜索中受控试错的推理时框架。TreeSeeker将搜索组织为树结构状态上的分支-回溯搜索,其中每个分支是子目标的一个试探性方向。在每一轮中,TreeSearch读取所有子目标树,识别活跃目标,并使用价值、不确定性和风险等文本UCB信号来选择:利用有希望的分支、探索不确定的替代方案,或剪除无生产力的延续并返回到较早的分支点。TreeMem通过将证据、不确定性、冲突、进展和失败线索附加到产生它们的分支上来支持这一控制循环,从而使试验结果能够指导后续决策。在XBench-DeepSearch、BrowseComp和BrowseComp-ZH上的实验表明,TreeSeeker始终优于强开源基线,这表明显式的分支-回溯控制可以补充更强的推理和工具执行能力。

英文摘要

Deep search requires agents to answer complex questions through multi-step web search, browsing, evidence comparison, and synthesis. A central challenge is deciding how to search when several directions look plausible but only some will later lead to reliable evidence. If an agent greedily follows the current best-looking direction, it may keep extending a weak continuation. If it explores without discipline, it may waste budget on disconnected trials. We propose TreeSeeker, an inference-time framework for controlled trial-and-error in deep search. TreeSeeker organizes search as branch-and-return search over tree-structured states, where each branch is a tentative direction for a sub-goal. At each round, TreeSearch reads all sub-goal trees, identifies active goals, and uses textual UCB signals of value, uncertainty, and risk to select among exploiting a promising branch, exploring an uncertain alternative, or pruning an unproductive continuation and returning to an earlier branch point. TreeMem supports this control loop by keeping evidence, uncertainty, conflicts, progress, and failure cues attached to the branches that produced them, so trial outcomes can guide later decisions. Experiments on XBench-DeepSearch, BrowseComp, and BrowseComp-ZH show that TreeSeeker consistently outperforms strong open-source baselines, suggesting that explicit branch-and-return control complements stronger reasoning and tool execution.

2606.11661 2026-06-11 cs.CV cs.LG 新提交

Learning Instance-Adaptive Low-Rank Orthogonal Subspaces for Clothes-Changing Person Re-Identification

学习实例自适应低秩正交子空间用于换衣行人重识别

Dong-Woo Kim, Tae-Kyun Kim

发表机构 * Dongwoo Kim(金东吾) Tae-Kyun Kim(金泰勋)

AI总结 提出Ortho-ReID方法,通过从VLM文本描述中显式建模低秩服装子空间,并利用几何约束提取服装不变特征,在多个基准数据集上取得最优性能。

Comments Accepted to the ICML 2026 Workshop on CoLoRAI

详情
AI中文摘要

换衣行人重识别(CC-ReID)旨在识别尽管因服装变化导致外观剧烈变化的个体。现有方法依赖对抗学习来解耦服装特征,我们提出Ortho-ReID,该方法从VLM文本描述中显式建模低秩服装子空间,并通过直接几何约束提取服装不变表示。一个关键组件是基于Transformer的基生成器(Basis Maker),它通过与图像块的交叉注意力,将共享的低维服装先验细化为实例自适应低秩子空间,从而在变化的可见性条件下也能实现鲁棒的服装特征提取。该实例自适应子空间通过与服装文本嵌入对齐进行监督,而身份特征则通过可学习的投影头提取,并在几何上约束与其严格正交。大量实验表明,在PRCC(top-1提升5.9%)、Celeb-reID-light(提升3.5%)和LaST(提升5.3%)上达到了最先进性能,在LTCC上也取得了有竞争力的结果。

英文摘要

Clothes-changing person re-identification (CC-ReID) aims to recognize individuals despite drastic appearance changes caused by clothing variation. While existing methods rely on adversarial learning to disentangle clothing features, we propose Ortho-ReID, which explicitly models a low-rank clothing subspace from VLM text descriptions and extracts clothing-invariant representations via direct geometric constraints. A critical component is our transformer-based Basis Maker, which refines a shared, low-dimensional clothing prior into an instance-adaptive low-rank subspace through cross-attention with image patches, enabling robust clothing feature extraction even under varying visibility conditions. This instance-adaptive subspace is supervised via alignment with clothing text embeddings, while identity features are extracted via a learnable projection head and geometrically constrained to be strictly orthogonal to it. Extensive experiments demonstrate state-of-the-art performance on PRCC (+5.9% top-1), Celeb-reID-light (+3.5%), and LaST (+5.3%), with competitive results on LTCC.

2606.11660 2026-06-11 cs.LG 新提交

Bergson: An Open Source Library for Data Attribution

Bergson:一个用于数据归因的开源库

Lucia Quirke, Louis Jaburi, David Johnston, William Z. Li, Gonçalo Paulo, Guillaume Martres, Girish Gupta, Stella Biderman, Nora Belrose

发表机构 * EleutherAI Independent

AI总结 提出Bergson开源库,支持大规模语言模型和预训练数据集的多种数据归因方法,提供磁盘梯度存储和多节点分布式训练,首次开源实现MAGIC、SOURCE和TrackStar三种方法。

详情
AI中文摘要

数据归因是可解释性领域一个有前景的方向,旨在通过训练数据的影响来解释模型行为,其应用包括调试不良模型行为和训练数据集整理。然而,大规模执行数据归因需要大量的工程工作,许多前沿技术缺乏开源工具和支持。Bergson是一个开源库,旨在通过提供一系列可扩展到超大规模语言模型和预训练数据集的技术,推动该领域的更快发展。该库原生支持磁盘梯度存储和多节点分布式训练,并为研究人员提供生活质量工具。最后,我们首次开源实现了三种领先的数据归因方法:MAGIC、SOURCE和TrackStar。该库可在以下网址获取:https://github.com/example/bergson。

英文摘要

Data attribution is a promising field in interpretability that aims to explain model behavior through the influence of its training data, with applications including debugging undesirable model behavior and training dataset curation. However, significant engineering effort is required to perform it at scale, and many cutting edge techniques lack open-source tooling and support. Bergson is an open source library that aims to enable faster progress in the field by providing a host of techniques that scale to very large language models and pre-training datasets. The library natively supports on-disk gradient stores and multi-node distributed training, and provides quality of life tools for researchers. Finally, we introduce the first open-source implementations of three leading data attribution methods: MAGIC, SOURCE, and TrackStar. The library is available at https://github.com/EleutherAI/bergson .

2606.11657 2026-06-11 cs.LG cs.AI 新提交

Sparse probes and murky physics: a case study of interpretability challenges in a foundation model for continuum dynamics

稀疏探针与模糊物理:连续介质动力学基础模型可解释性挑战的案例研究

Katherine Rosenfeld, Maike Sonnewald

发表机构 * Gates Foundation(盖茨基金会) UC Davis(加州大学戴维斯分校)

AI总结 本研究通过稀疏自编码器探针分析连续介质动力学基础模型Walrus的内部机制,发现其内部特征与物理分解不完全一致,并存在输出级偏差,揭示了科学基础模型可解释性的关键挑战。

Comments 8 pages, 5 figures

详情
Journal ref
ICLR 2026 Workshop on Foundation Models for Science
AI中文摘要

生成式AI仿真器越来越多地用于我们已经拥有强大理论、基准和物理直觉的科学领域。这引发了一个核心评估和可解释性问题:当一个基础模型能够再现已知的连续介质动力学时,是什么内部机制支持这种行为?内部行为是否与已知物理一致?以及它与仿真器成功或失败的关系如何?我们研究了跨领域连续介质动力学基础模型——Polymathic团队的Walrus,采用基于物理原理的机械可解释性方法。我们应用稀疏自编码器(SAE)探测选定层,并利用涡度作为物理基础度量,解决了对大量特征集(超过20,000个)进行分类的实际挑战。作为刻意简单的测试平台,我们聚焦于剪切流,并比较了多个剪切流设置(即数值模拟中的参数值)下的特征招募情况。在不同设置中,我们发现了分段一致性的证据,特征子集以相似角色重复出现,但这种结构是间歇性的,并未清晰地映射到标准物理分解上。同时,数值模拟与仿真器之间的直接比较揭示了系统性的输出级差异,包括能量/结构变得过于扩散或过于局部的区域。我们将这些差异的部分与特定SAE特征使用的变化联系起来。我们的工作突出了科学基础模型的开放性问题:如何稳健地优先考虑机械上有意义的特征,如何将稳定结构与分析伪影(包括单层和SAE限制)分离,以及如何利用既定基准来决定何时“不同”的内部表示真正具有信息性而非仅仅是有效的。

英文摘要

Generative AI emulators are increasingly used in scientific domains where we already have strong theory, benchmarks, and physical intuition. This raises a central evaluation and interpretability question: when a foundation-style model can reproduce known continuum dynamics, what internal mechanism supports that behavior, is the internal behaviour consistent with known physics, and how does it relate to where the emulator succeeds or fails? We investigate a cross-domain foundation model for continuum dynamics, Walrus by Polymathic, using mechanistic interpretability guided by physical principles. We apply a sparse autoencoder (SAE) to probe a selected layer, and address the practical challenge of triaging a large feature set (over 20,000) using enstrophy as a physically grounded metric. As a deliberately simple testbed, we focus on shear flow and compare feature recruitment across multiple shear-flow setups, i.e. parameter values in the numerical simulation. Across setups we find evidence of piecewise consistency, with subsets of features recurring in similar roles, but this structure is intermittent and does not map cleanly onto standard physical decompositions. In parallel, direct comparisons between numerical simulation and the emulator reveal systematic output-level discrepancies, including regimes where energy/structures become too diffuse or too localized. We connect parts of these discrepancies to changes in specific SAE feature usage. Our work highlights open questions for scientific foundation models: how to robustly prioritize mechanistically meaningful features, how to separate stable structure from analysis artifacts (including single-layer and SAE limitations), and how to use established benchmarks to decide when "different" internal representations are genuinely informative rather than merely effective.

2606.11652 2026-06-11 cs.LG 新提交

IAPO: Input Attribution-Aware Policy Optimization for Tool Use in Small Multimodal Agents

IAPO:面向小型多模态代理工具使用的输入归因感知策略优化

Yifan Yang, Zhen Zhang, Jiayi Tian, Liyan Tan, Zheng Zhang

发表机构 * University of California, Santa Barbara(加州大学圣塔芭芭拉分校)

AI总结 提出输入归因感知策略优化(IAPO),通过强化学习对齐模型与教师模型的输入归因,提升多模态小语言模型的工具调用能力,在六个测试集上平均准确率提升3%。

详情
AI中文摘要

本文研究强化学习方法以提升多模态小语言模型(SLM)代理的工具调用能力。尽管现有工作探索了多种奖励设计来改善代理的工具调用能力,但这些方法在SLM训练中面临固有局限性,尤其是在多模态场景下。首先,许多现有方法通过精确匹配某些真实标签或预定义格式来评估工具使用正确性。然而,这种假设通常不适用于多模态任务,因为可能存在多个有效的工具使用路径,且通常没有标注的工具轨迹。其次,这种稀疏且脆弱的二元奖励对如何改进底层决策过程提供的指导很少,使得多模态SLM难以从中学习。为解决这些问题,我们提出输入归因感知策略优化(IAPO),一种通过将模型在输入组件上的归因与更强的教师模型对齐,来改进多模态SLM工具使用的强化学习算法。在Qwen2.5-VL-3B上的实验表明,与现有的视觉工具使用工作相比,所提方法通过帮助模型关注最相关的输入证据,在六个测试集上平均将视觉问答准确率提高了3%。

英文摘要

This paper investigates reinforcement learning (RL) methods for improving tool-calling capabilities in multimodal small language model (SLM) agents. While existing works have explored various reward designs to improve agentic tool-calling ability, these approaches face inherent limitations for SLM training, especially under multimodal scenarios. First, many existing methods evaluate tool use correctness through exact matching against certain ground-truth or predefined formats. However, this assumption is often unsuitable for multimodal tasks, where multiple tool use paths may be valid and annotated tool trajectories are typically unavailable. Second, such sparse and brittle binary rewards provide little guidance on how to improve the underlying decision process, making them particularly difficult for multimodal SLM to learn from. To address these issues, we propose Input Attribution-Aware Policy Optimization (IAPO), an RL algorithm for improving tool use in multimodal SLM by aligning the model's attribution across input components with that of a stronger teacher. Experiments on Qwen2.5-VL-3B show that the proposed method improves visual question answering accuracy by an average of 3% across six test sets compared with existing visual tool use work, by helping the model attend to the most relevant input evidence.

2606.11651 2026-06-11 cs.LG q-bio.QM stat.AP 新提交

DeepRHP: A Hybrid Variational Autoencoder for Designing Random Heteropolymers as Protein Mimics

DeepRHP:一种用于设计随机异聚合物作为蛋白质模拟物的混合变分自编码器

Shuni Li, Zhiyuan Ruan, Andy Shen, Ivan Jayapurna, Ting Xu, Haiyan Huang

发表机构 * DeepRHP: A Hybrid Variational Autoencoder for Designing Random Heteropolymers as Protein Mimics(DeepRHP:一种用于设计随机杂合聚合物作为蛋白质模拟物的混合变分自编码器)

AI总结 提出混合变分自编码器DeepRHP,在半监督框架下结合特征VAE与经典VAE,通过潜在空间捕获关键化学特征与序列模式,指导随机异聚合物设计,实验验证其稳定膜蛋白的有效性。

Comments Oral presentation at AAAI 2023 Workshop on AI to Accelerate Science and Engineering

详情
AI中文摘要

由预定义单体组成的合成随机异聚合物(RHP)为设计类蛋白质材料提供了一种方法。如果设计得当,这些RHP可以模拟蛋白质的行为和功能。因此,需要计算工具来有效指导RHP设计。我们通过开发DeepRHP(一种在半监督框架下改进的变分自编码器(VAE)模型)来弥补这一差距。通过为经典VAE配备额外的基于特征的VAE,DeepRHP迫使潜在空间捕获关键化学特征的结构以及单个RHP序列模式。从这个意义上说,我们的方法是通用的,允许以混合方式纳入任何相关特征。我们通过提出在非原生环境中稳定膜蛋白(例如水通道蛋白Z)的潜在单体组成,并将我们的预测与已发表的结果进行交叉验证,证明了DeepRHP的有效性。我们的模型与真实RHP功能之间的一致性表明,利用混合自编码器架构来指导蛋白质和其他生物化合物的RHP设计具有巨大潜力。

英文摘要

Synthetic random heteropolymers (RHPs), consisting of a predefined set of monomers, offer an approach toward the design of protein-like materials. These RHPs, if designed appropriately, can mimic protein behavior and function. As such, there is a need for computational tools to efficiently guide RHP design. We bridge this gap by developing DeepRHP, a modified variational autoencoder (VAE) model under a semi-supervised framework. By equipping a classical VAE with an additional feature-based VAE, DeepRHP forces the latent space to capture structures of critical chemical features as well as individual RHP sequence patterns. In this sense, our method is versatile by allowing any relevant features to be incorporated in a hybrid manner. We demonstrate the effectiveness of DeepRHP by suggesting potential monomer compositions that stabilize membrane proteins (e.g. Aquaporin Z) in non-native environments and cross-validating our prediction with published results. The concordance between our model and true RHP function suggests strong potential in utilizing hybrid autoencoder architectures to guide RHP design for proteins and other biological compounds.

2606.11650 2026-06-11 cs.LG cs.NA math.NA physics.comp-ph 新提交

Structure-Preserving Neural Surrogates with Tractable Uncertainty Quantification

具有可处理不确定性量化的保结构神经代理模型

Handi Zhang, Adrienne M. Propp, Brooks Kinch, Houman Owhadi, Nathaniel Trask

发表机构 * University of Pennsylvania(宾夕法尼亚大学) Stanford University(斯坦福大学) California Institute of Technology(加州理工学院)

AI总结 提出一种结合混合有限元空间与高斯过程回归的保结构降阶模型,通过拓扑结构实现状态-通量关系的不确定性量化,并导出狄利克雷-诺伊曼映射的闭式后验不确定性。

详情
AI中文摘要

科学机器学习的最新进展为偏微分方程(PDE)的近实时求解提供了一种手段,但缺乏支持当代验证与确认的传统模拟器的理论基础。在这项工作中,我们构建了数据驱动的降阶模型,作为保结构、实时代理模型。值得注意的是,施加物理守恒结构的外微分也揭示了拓扑结构,我们利用该结构构建了状态-通量关系中不确定性的高斯过程(GP)表示,最终为目标量导出具有后验不确定性闭式表达的狄利克雷-诺伊曼映射。我们特别提出了由轻量级变压器规定的传统Raviart-Thomas和$dgP_0$单元的保结构$H(\mathrm{div})$--$L^2$子空间。通过提出一个守恒律来学习与该子空间一致的降阶动力学,其中GP描述了体积之间的通量。这项工作依赖于混合有限元空间与GP回归之间的新颖接口;当训练被表述为最优恢复问题(ORP)时,得到的GP回归可以写成一个带有等式约束的优化问题,该约束施加了守恒结构,适用于快速的Schur补训练策略。然后,训练好的模型可以实时求解,得到由指定狄利克雷数据驱动的边界通量的闭式估计量。本文包括线性泛函的RKHS后验误差界以支持不确定性量化,以及数值实验证明了后验分布作为误差估计代理的准确性。

英文摘要

Recent advances in scientific machine learning provide a means of near-real-time solution to partial differential equations (PDEs), but lack the theoretical underpinnings of conventional simulators that support contemporary verification and validation. In this work, we construct data-driven reduced-order models that serve as structure-preserving, real-time surrogates. Remarkably, the exterior calculus that imposes physical conservation structure also exposes topological structure that we use to build a Gaussian process (GP) representation of uncertainty in state-flux relationships, ultimately yielding a Dirichlet-to-Neumann map for quantities of interest with closed-form expressions for posterior uncertainty. We specifically propose structure-preserving $H(\mathrm{div})$--$L^2$ subspaces of conventional Raviart--Thomas and $dgP_0$ elements prescribed by a lightweight transformer. Reduced-order dynamics consistent with this subspace are learned by posing a conservation law in which a GP describes the fluxes between volumes. This work hinges on a novel interface between mixed FEM spaces and GP regression; when training is posed as the optimal recovery problem (ORP), the resulting GP regression can be written as an optimization problem with equality constraints that impose a conservation structure, amenable to a fast Schur-complement training strategy. The trained model can then be solved in real time with closed-form estimators for boundary fluxes driven by prescribed Dirichlet data. The paper includes RKHS posterior error bounds for linear functionals to support uncertainty quantification, as well as numerical experiments demonstrating the accuracy of the posterior distribution as a surrogate for error estimation.

2606.11646 2026-06-11 cs.LG q-bio.QM stat.ML 新提交

Tree-Structured Orthonormal Decomposition of the Aitchison Simplex

Aitchison单纯形的树结构正交分解

Daisuke Yamada, Qijun Zhang, Travis Pence, Barbara B. Bendlin, Federico Rey, Vikas Singh

发表机构 * National Institute of Information and Communications Technology, Japan(日本信息与通信技术国家研究所)

AI总结 提出PolyILR方法,利用树结构对成分数据进行正交分解,在微生物组和单细胞数据中生成稳定可解释的特征,并建立与softmax分类器的理论联系。

Comments Accepted at ICML 2026. To appear in PMLR vol. 306

详情
AI中文摘要

成分数据——编码相对比例的向量——出现在包括生态学、地球化学和基因组学在内的科学领域。这些数据中的特征通常具有已知的层次结构(例如,分类学、系统发育、本体论),但现有方法要么忽略这种结构,要么丢弃内在的Aitchison几何,要么设计用于二叉树,要么产生不完整的坐标系。我们描述了PolyILR,一种与任何树拓扑对齐的Aitchison切空间的正交分解。我们的构造在每个内部节点定义了一个加权局部几何,捕获完整的分支结构,然后将这些提升到一个全局正交基,其中每个坐标对应一个特定的树位置。在微生物组和单细胞基准测试中,PolyILR产生稳定、可解释的特征,并支持多尺度树分辨率下的推理。我们还建立了与softmax分类器的新理论联系,暗示了在概率建模中的可能应用。

英文摘要

Compositional data -- vectors encoding relative proportions -- arise across scientific domains, including ecology, geochemistry, and genomics. The features in these data often come with known hierarchical structure (e.g., taxonomies, phylogenies, ontologies), yet existing methods either ignore this structure, discard the intrinsic Aitchison geometry, are designed for binary trees, or yield incomplete coordinate systems. We describe PolyILR, a canonical orthonormal decomposition of the Aitchison tangent space aligned with any tree topology. Our construction defines a weighted local geometry at each internal node capturing full branching structure, then lifts these to a global orthonormal basis where every coordinate corresponds to a specific tree location. On microbiome and single-cell benchmarks, PolyILR yields stable, interpretable features and enables inference at multiscale tree resolution. We also establish a novel theoretical connection to softmax classifiers, suggesting possible applications to probabilistic modeling.

2606.11645 2026-06-11 cs.CV 新提交

Motion Reinforces Appearance: RGB-Skeleton Gated Residual Fusion for Micro-Gesture Online Recognition

运动增强外观:用于微手势在线识别的RGB-骨架门控残差融合

Jialin Liu, Xinwen He, Pengyu Liu, Jiale Shi, Huaijuan Zang, Yanbin Hao

发表机构 * School of Computer Science and Information Engineering, Hefei University of Technology (HFUT), Hefei, China(合肥工业大学计算机与信息工程学院)

AI总结 提出DyFADet+双流RGB-骨架框架,通过门控残差模块自适应融合骨架运动与RGB特征,实现微手势在线识别,在SMG数据集上F1达40.88,排名第二。

Comments 13 pages, 2 figures

详情
AI中文摘要

微手势分析因能从细微身体动作推断自发情绪而受到越来越多的关注。微手势在线识别,即在未修剪视频中定位和分类每个手势实例,是第四届EI-MiGA-IJCAI挑战赛的核心任务。与典型的时序动作检测相比,MGR强调动作的定位和分类,要求模型输出每个微手势的开始时间、结束时间和类别。此外,由于微手势高度自发,仅依赖单一模态难以捕捉完整准确的多模态线索。在这项工作中,我们提出DyFADet+,它将DyFADet扩展为双流RGB-骨架框架。在我们的模型中,两种模态都被投影到共享的多尺度时序嵌入中,并通过门控残差模块融合,该模块自适应地将骨架运动注入RGB表示,而不是使用简单的拼接。最后,这些融合特征由动态TAD头解码,用于在线分类和边界回归。在SMG数据集上,我们的方法取得了40.88的F1分数,在微手势在线识别赛道中排名第二。

英文摘要

Micro-gesture analysis attracts increasing attention for inferring spontaneous emotion from subtle body movements. Micro-gesture online recognition, which localizes and classifies each gesture instance in untrimmed videos, is a core task in the 4th EI-MiGA-IJCAI Challenge. Compared with typical temporal action detection, MGR emphasizes the localization and classification of actions, requiring the model to output the start time, end time, and category of each micro-gesture. Moreover, since micro-gestures are highly spontaneous, relying solely on a single modality makes it difficult to capture the complete and accurate multi-modal cues. In this work, we propose DyFADet+, which extends DyFADet into a dual-stream RGB-skeleton framework. In our model, both modalities are projected into shared multi-scale temporal embeddings and fused through a gated residual module, which adaptively injects skeleton motion into the RGB representation rather than using naive concatenation. Finally, these fused features are decoded by a Dynamic TAD head for online classification and boundary regression. On the SMG dataset, our method achieves an F1 score of 40.88, ranking 2nd in the Micro-gesture Online Recognition track.

2606.11643 2026-06-11 cs.CL 新提交

Improving Cross-Format Robustness in Language Models with Multi-Format Training

通过多格式训练提升语言模型的跨格式鲁棒性

June M. Liu, Shaomian Zheng, He Cao, Dingnan Jin, Qing Cui, Jun Zhou

发表机构 * Ant Group(蚂蚁集团) International Digital Economy Academy (IDEA)(国际数字经济学院(IDEA))

AI总结 提出FormatMix方法,通过将部分训练数据扩展为多种等价格式,显著提升大语言模型在不同答案格式下的一致性,仅需30%数据即可接近全格式训练效果。

详情
AI中文摘要

大型语言模型通常对答案格式仍然敏感:一种格式下正确解答的问题可能在另一种语义等价的格式下失败。为了研究这一差距,我们将跨格式鲁棒性定义为模型在不同格式下一致回答相同潜在问题的程度。然后,我们比较了全格式训练与FormatMix,后者使用随机或目标选择将仅一部分训练项扩展为多种等价格式。在GLM4和Llama-3.1上,多格式监督一致地提升了任务性能和跨格式鲁棒性,而仅使用多项选择题(MCQ)监督几乎无益,甚至可能降低鲁棒性。我们进一步发现,仅将约30%的训练集扩展为多种格式通常能恢复全格式训练的大部分收益,并且这一效果在我们研究的模型族和规模中均存在。这些结果表明,格式多样性(而非额外的监督本身)是鲁棒性的关键驱动因素。轻量级的多格式增强是一种实用的方法,可以在不改变基础模型的情况下使LLM对答案格式不那么敏感。

英文摘要

Large language models often remain sensitive to answer format: a question solved correctly in one form may fail in another semantically equivalent form. To study this gap, we define cross-format robustness as the extent to which a model answers the same underlying question consistently across formats. We then compare full-format training with FormatMix, which expands only a subset of training items into multiple equivalent formats using either random or targeted selection. Across GLM4 and Llama-3.1, multi-format supervision consistently improves both task performance and cross-format robustness, whereas Multiple-choice question (MCQ)-only supervision alone brings little benefit and can even reduce robustness. We further find that expanding only about 30% of the training set into multiple formats often recovers most of the gain from full-format training, and this effect appears across the model families and sizes we study. These results suggest that format diversity, rather than additional supervision alone, is the key driver of robustness. That lightweight multi-format augmentation is a practical way to make LLMs less sensitive to answer format without changing the base model.

2606.11640 2026-06-11 cs.LG cs.AI 新提交

TAROT: Task-Adaptive Refinement of LLM-prior Graphs for Few-shot Tabular Learning

TAROT: 面向小样本表格学习的任务自适应LLM先验图精炼

Ruxue Shi, Yili Wang, Mengnan Du, Hangting Ye, Yi Chang, Xin Wang

发表机构 * Jilin University(吉林大学) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 提出TAROT框架,通过构建并精炼任务自适应语义图,利用LLM先验和GNN编码特征语义关系,提升小样本表格学习性能。

详情
AI中文摘要

小样本表格学习为实际应用中标注成本高、新任务样本收集困难的情况提供了一种经济有效的方法。现有的传统方法和基于LLM的方法在小样本场景中已展现出有效性。然而,传统方法需要在未标注或生成的数据上进行额外训练,这带来了显著的计算开销。此外,直接将原始表格数据输入LLM的基于LLM的方法引发了隐私和合规性问题。更重要的是,这两种范式都很大程度上忽略了特征之间的语义关系,而语义关系为构建语义图提供了结构和语义先验。语义图对于在小样本场景中建模有意义的特征交互至关重要。本文提出TAROT,一个基于GNN的框架,通过从先验中构建并精炼任务自适应语义图来编码结构和语义先验,从而提升小样本表格学习的预测性能。TAROT首先通过统一语义表格节点编码器(USTNE)将异构表格数据编码为统一的节点语义表示。然后,它提示LLM根据任务描述和特征名称推断特征之间的语义关系,以构建语义图。为了减轻LLM幻觉引入的结构噪声,TAROT引入了任务自适应语义图精炼,剪除虚假或与任务无关的边,并添加缺失的与任务相关的边,使图结构与下游目标对齐。最后,GNN在精炼后的图上进行消息传递,以捕获与任务相关的语义依赖关系进行预测。在各种小样本表格学习基准上的大量实验证明了TAROT的优越性能,使其成为该领域的最先进方法。

英文摘要

Few-shot tabular learning provides a cost-effective approach for real-world applications where annotation is costly and collecting sufficient samples for new tasks is difficult. Existing Traditional and LLM-based methods have demonstrated effectiveness in few-shot scenarios. However, traditional methods need additional training on unlabeled or generated data, which incur significant computational overhead. In addition, LLM-based methods that directly feed raw tabular data into LLMs raise privacy and compliance concerns. More importantly, both paradigms largely overlook the semantic relationships between features, which provide structural and semantic prior for constructing a semantic graph. Semantic graph is essential for modeling meaningful feature interactions in few-shot scenarios. In this paper, we propose TAROT, a GNN-based framework that encodes the structural and semantic prior by constructing and refining a task-adaptive semantic graph from this prior, thereby improving predictive performance in few-shot tabular learning. TAROT first encodes heterogeneous tabular data into unified node semantic representations via a Unified Semantic Tabular Node Encoder (USTNE). Then, it prompts LLMs to infer the semantic relationship between features based on the task description and feature names to construct a semantic graph. To mitigate structural noise introduced by the hallucination of LLMs, TAROT introduces Task-adaptive Semantic Graph Refinement that prunes spurious or task-unrelated edges and adds missing task-related ones, aligning the graph structure with the downstream objective. Finally, a GNN performs message passing over the refined graph to capture task-related semantic dependencies for prediction. Extensive experiments on various few-shot tabular learning benchmarks demonstrate the superior performance of TAROT, establishing it as a state-of-the-art approach in this domain.

2606.11639 2026-06-11 cs.CL 新提交

Evaluating Bias in Phoneme-Based Automatic Speech Recognition Systems: An Analysis of IPA Transcription Models

评估基于音素的自动语音识别系统中的偏差:对IPA转录模型的分析

Catherine Bao, Maneesha Rani Saha, Neal Patwari

发表机构 * University of Utah(犹他大学)

AI总结 研究评估WhisperIPA和ZIPA两个开源IPA转录ASR系统在不同口音和语言上的性能,通过标准音素错误率和软音素错误率分析,发现模型在性别、口音、种族和年龄等群体间存在持续性能差异。

详情
AI中文摘要

自动语音识别(ASR)系统的普及增加了对种族、年龄、性别和口音等人口统计偏差的探索,这些偏差通常源于不平衡的训练数据。大多数研究集中在基于标准字素的ASR系统上,而对基于音素的系统(如生成国际音标(IPA)表示的模型)关注较少。随着ASR系统向多语言支持和低资源语言建模转变,基于IPA的层作为关键的、语言无关的基础。在本研究中,我们评估了两个最先进的开源ASR系统WhisperIPA和ZIPA的性能,它们生成跨不同口音和语言源的IPA转录。我们的评估包括现有的多语言语音语料库和人口统计注释的英语语料库。我们通过比较模型生成的IPA转录与字素到音素(G2P)系统,使用标准音素错误率(PER)和提出的软PER指标(容忍语言学上相似的音素替换)来衡量模型性能。我们的分析考察了性能在不同语言和人口统计群体(如性别、口音、种族和年龄)之间的变化,揭示了即使在考虑了可接受的音素变异后仍存在的持续差异。这些发现为偏差的潜在来源提供了见解,并为开发更包容和语言鲁棒的基于音素的ASR系统提供了信息。我们的代码和数据将公开发布给社区。

英文摘要

The popularization of automatic speech recognition (ASR) systems has increased exploration of the demographic biases related to race, age, gender, and accent, often formed from imbalanced training data. Most of these studies focused on standard grapheme-based ASR systems with comparatively little emphasis on phoneme-based systems, such as models that produce International Phonetic Alphabet (IPA) representations. As ASR systems shift toward multilingual support and low-resource language modeling, IPA-based layers serve as a critical, language-agnostic foundation. In this study, we evaluate the performance of two state-of-the-art open-source ASR systems, WhisperIPA and ZIPA, that generate IPA transcriptions across diverse accents and language sources. Our evaluation includes existing multilingual speech corpora and demographically annotated English-language corpora. We measure model performance by comparing model-generated IPA transcriptions against grapheme-to-phoneme (G2P) systems using both standard phoneme error rate (PER) and a proposed Soft PER metric that tolerates linguistically similar phoneme substitutions. Our analysis examines how performance varies across languages and demographic groups such as gender, accent, ethnicity, and age, revealing persistent disparities even after accounting for acceptable phonemic variation. These findings provide insight into potential sources of bias and inform the development of more inclusive and linguistically robust phoneme-based ASR systems. Our code and data will be made publicly available to the community.

2606.11637 2026-06-11 cs.AI 新提交

TouchThinker: Scaling Tactile Commonsense Reasoning to the Open World with Large-scale Data and Action-aware Representation

TouchThinker: 通过大规模数据和动作感知表示将触觉常识推理扩展到开放世界

Kailin Lyu, Di Wu, Pengwei Zhang, Yuhang Zheng, Yingxin Lai, Long Xiao, Kangyi Wu, Pengna Li, Chen Gao, Lianyu Hu, Xiaobin Hu, Jie Hao, Ce Hao, Weihao Yuan, Shuicheng Yan

发表机构 * Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) National University of Singapore(新加坡国立大学) Zhongguancun Academy(中关村学院) Xiamen University(厦门大学) Xi’an Jiaotong University(西安交通大学) Nanyang Technological University(南洋理工大学) Nanjing University(南京大学)

AI总结 提出TouchThinker框架,通过构建百万级多源触觉数据集TouchThinker-1M和动作感知建模,将触觉常识推理扩展到开放世界,在多个数据集上取得竞争性表现。

Comments 18 pages, 11 figures

详情
AI中文摘要

触觉是具身智能体理解物理世界的关键模态。尽管最近的工作已将触觉信号融入语言系统进行触觉常识推理,但由于两个关键瓶颈,将此类系统扩展到现实的开放世界环境仍然具有挑战性:(1) 当前的触觉推理数据集在格式和规模上仍然有限,为从触觉观察到物理常识的推理提供的监督不足,并阻碍了可迁移触觉常识的学习;(2) 触觉信号本质上是冗余且特定于动作的,但现有方法常常忽略这些特性,导致表示效率低下且语义表达能力有限。为了解决这些局限性,我们提出了TouchThinker,一个从数据和表示两个角度将触觉常识推理扩展到开放世界的触觉-语言框架。首先,我们构建了TouchThinker-1M,一个百万级、多源的触觉推理数据集,涵盖\textbf{415}个物体、\textbf{8}个场景和\textbf{7}种传感器类型,为开放世界泛化提供了坚实的数据基础。我们进一步引入了TouchThinker-Bench,一个具有更真实和多样化任务的开放世界基准。然后,我们提出了动作感知建模机制,以提高触觉表示效率并实现高效推理。实验结果表明,TouchThinker在多个数据集上取得了与最先进模型竞争的性能。我们的代码和数据集将在以下网址提供:this https URL。

英文摘要

Touch is a key modality for embodied agents to understand the physical world. Although recent work has incorporated tactile signals into language systems for tactile commonsense reasoning, scaling such systems to realistic open-world settings remains challenging due to two key bottlenecks: (1) current tactile reasoning datasets remain limited in format and scale, providing insufficient supervision for reasoning from tactile observations to physical commonsense and hindering the learning of transferable tactile commonsense; (2) Tactile signals are inherently redundant and action-specific, yet existing methods often overlook these properties, resulting in inefficient representations with limited semantic expressiveness. To address these limitations, we propose TouchThinker, a tactile-language framework that scales tactile commonsense reasoning to the open world from both data and representation perspectives. First, we construct TouchThinker-1M, a million-scale, multi-source tactile reasoning dataset covering \textbf{415} objects, \textbf{8} scenarios, and \textbf{7} sensor types, providing a solid data foundation for open-world generalization. We further introduce TouchThinker-Bench, an open-world benchmark with more realistic and diverse tasks. Then, we propose action-aware modeling mechanism to improve tactile representation efficiency and enable efficient reasoning. Experimental results demonstrate that TouchThinker achieves competitive performance against state-of-the-art models across multiple datasets. Our code and dataset will be made available at: https://github.com/lvkailin0118/TouchThinker.

2606.11636 2026-06-11 cs.RO 新提交

SAFER-Nav: Enhancing Safety for Visual Robot Navigation via Segmentation-Aware Fine-Tuning

SAFER-Nav: 通过分割感知微调增强视觉机器人导航的安全性

Geonyeong Ko, Giung Lee, Changjoo Nam

发表机构 * Dept. of Electronic Engineering, Sogang University(西江大学电子工程系) Dept. of Computer Science, Rice University(莱斯大学计算机科学系) Vertical Labs, Co., Ltd.(Vertical Labs 有限公司)

AI总结 提出SAFER-Nav方法,通过分割感知微调将障碍物边界和可通行空间结构直接融入导航策略,降低碰撞频率并保持目标到达性能。

详情
AI中文摘要

基于视觉的导航模型,特别是基础模型,仅从RGB观测生成可行轨迹。然而,即使是最先进的基于Transformer和扩散的策略也难以在包含未见障碍物或条件变化的不熟悉部署环境中泛化。生成的轨迹通常仍以目标为导向但不安全。现有工作通过外部轨迹校正或内部几何先验提高安全性,但所得策略并未被训练显式表示障碍物边界或可通行自由空间结构。为解决此问题,我们提出一种导航模型,通过微调将这些结构直接纳入策略,并设计为与多种基于RGB的主干兼容。在多个机器人平台、室内环境以及静态和动态障碍物场景中,我们的方法相对于ViNT、NoMaD及其CARE增强变体降低了碰撞频率,同时保持目标到达性能。

英文摘要

Vision-based navigation models, particularly foundation models, generate viable trajectories from RGB observations alone. However, even state-of-the-art transformer- and diffusion-based policies struggle to generalize in unfamiliar deployment environments containing unseen obstacles or shifted conditions. The resulting trajectories often remain goal-directed but unsafe. Existing efforts improve safety through external trajectory correction or internal geometric priors, yet the resulting policies are not trained to explicitly represent obstacle boundaries or traversable free-space structure. To address this, we propose a navigation model that incorporates these structures directly into the policy via fine-tuning and is designed to be compatible with diverse RGB-based backbones. Across multiple robot platforms, indoor environments, and static and dynamic obstacle scenarios, our method reduces collision frequency relative to ViNT, NoMaD, and their CARE-augmented variants while maintaining goal-reaching performance.

2606.11634 2026-06-11 cs.AI 新提交

Architecture-Aware Reinforcement Learning Makes Sliding-Window Attention Competitive in Math Reasoning

架构感知强化学习使滑动窗口注意力在数学推理中具有竞争力

Kai Liu, Peijie Dong, Xinchen Xie, Jianfei Gao, Qipeng Guo, Xiaowen Chu, Shaoting Zhang, Kai Chen

发表机构 * Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University(上海智能自主系统研究院,同济大学) Shanghai AI Laboratory(上海人工智能实验室) Hong Kong University of Science and Technology (Guangzhou)(香港理工大学(广州))

AI总结 提出SWARR方法,通过监督微调将预训练自注意力模型高效转换为滑动窗口注意力,并利用强化学习策略适应,缩小了与自注意力的性能差距,同时保持线性复杂度的高效性。

详情
AI中文摘要

推理和智能体大型语言模型的快速进展增加了对长上下文推理的需求,但自注意力的计算复杂度随上下文长度呈二次增长。为了解决这个问题,我们研究了SWARR(用于数学推理的滑动窗口注意力强化适应),这是一种将SWA模型适应数学推理的实用方案。SWARR包含两个阶段:(1)从预训练的SA模型高效转换为SWA,并通过监督微调(SFT)避免重新训练基础模型;(2)使用强化学习(RL)进行策略适应。我们发现,在SFT后SWA的性能仍低于SA,我们假设这一差距部分由数据-架构不匹配导致:大多数SFT数据是为SA模型准备的,可能包含SWA难以建模的长距离依赖。由于在策略RL在SWA约束下优化自生成轨迹,它可以使轨迹更好地匹配SWA。在数学推理基准上的实验表明,该方案显著缩小了SWA与SA之间的差距,恢复了SWA转换过程中丢失的大部分准确性,同时保持了线性复杂度注意力的效率优势。我们的核心贡献是实证发现,RL改变了仅通过转换和SFT得出的关于SWA在数学推理中可行性的结论。

英文摘要

The rapid progress of reasoning and agentic large language models (LLMs) has increased the demand for long-context inference, but self-attention (SA) scales quadratically with context length. To address this, we study SWARR (Sliding-Window Attention with Reinforced Adaptation for Math Reasoning), a practical recipe for adapting SWA models to mathematical reasoning. SWARR has two stages: (1) efficient conversion from a pretrained SA model to SWA with supervised fine-tuning (SFT), which avoids pretraining a new base model, and (2) policy adaptation with reinforcement learning (RL). We find that SWA still underperforms SA after SFT, and we hypothesize that this gap is caused in part by a data-architecture mismatch: most SFT data are prepared for SA models and may contain long-range dependencies that are difficult for SWA to model. Because on-policy RL optimizes self-generated trajectories under the SWA constraint, it can adapt trajectories to better match SWA. Experiments on mathematical reasoning benchmarks show that this recipe substantially narrows the gap between SWA and SA, recovering much of the accuracy lost during SWA conversion while preserving the efficiency benefits of linear-complexity attention. Our central contribution is the empirical finding that RL changes the conclusion one would draw from conversion and SFT alone about SWA's viability for math reasoning.

2606.11628 2026-06-11 cs.RO cs.AI 新提交

LUCID: Learning Embodiment-Agnostic Intent Models from Unstructured Human Videos for Scalable Dexterous Robot Skill Acquisition

LUCID:从非结构化人类视频学习与具身无关的意图模型以实现可扩展的灵巧机器人技能获取

Harsh Gupta, Guanya Shi, Wenzhen Yuan

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出LUCID两阶段框架,从互联网规模的非结构化人类视频学习任务意图,并在大规模并行仿真中学习机器人控制,实现零样本迁移到不同具身和场景。

详情
AI中文摘要

目前最广泛采用的机器人学习流程通常从机器人演示或结构化人类数据中学习技能,这些数据收集成本高昂且与特定具身绑定。相比之下,非结构化人类视频提供了一种可扩展的替代方案。它们包含跨物体、场景和策略的多样化操作演示,但与机器人动作没有直接联系。我们提出LUCID,一个两阶段框架,从互联网规模数据集的非结构化人类视频中学习任务意图,并在大规模并行仿真中学习机器人控制。意图模型根据当前观测以闭环方式预测短时意图(场景中下一步应该发生什么)。一个具身特定的感觉运动策略将此意图转换为机器人动作。意图接口在控制器之间共享,因此相同的意图模型可应用于不同具身,从我们的主要灵巧手到平行夹爪。我们在五个真实世界操作任务上评估LUCID:搅拌、擦拭和分拣,仅由互联网视频监督,零样本迁移到新场景和物体实例;以及推T和电缆布线,各由1小时自收集智能手机视频监督。项目页面:此 https URL。

英文摘要

The most widely-adopted robot learning pipelines today learn skills from robot demonstrations or structured human data, which are expensive to collect and tied to specific embodiments. In contrast, unstructured human videos provide a scalable alternative. They contain diverse manipulation demonstrations across objects, scenes, and strategies, but are not directly connected to robot action. We propose LUCID, a two-stage framework that learns task intent from unstructured human videos drawn from internet-scale datasets and learns robot control in massively-parallel simulation. The intent model predicts short-horizon intent (what should happen next in the scene) from the current observation in closed loop. An embodiment-specific sensorimotor policy converts this intent into robot actions. The intent interface is shared across controllers, so the same intent model can be applied to different embodiments, from our primary dexterous hand to a parallel-jaw gripper. We evaluate LUCID on five real-world manipulation tasks: stirring, wiping, and binning supervised by only internet video, with zero-shot transfer to novel scenes and object instances; and push-T and cable routing supervised by 1 hr each of self-collected smartphone video. Project page: https://lucid-robot.github.io/.

2606.11627 2026-06-11 cs.LG cs.AI 新提交

When Context Returns: Toward Robust Internalization in On-Policy Distillation

当上下文回归:面向在线策略蒸馏中的鲁棒内化

Xun Wang, Ruishuo Chen, Zhuoran Li, Yu Chen, Longbo Huang

发表机构 * IIIS, Tsinghua University(清华大学交叉信息研究院)

AI总结 针对在线策略蒸馏中上下文内化后重新引入上下文导致性能下降的问题,提出一种轻量级一致性正则化方法,通过锚定无上下文输出并惩罚偏离,有效缓解退化并提升鲁棒性。

详情
AI中文摘要

近期研究表明,在线策略蒸馏可以将特权上下文(如系统提示或任务提示)内化到学生模型中,使得推理时不再需要上下文。尽管该方法成功提升了学生的无上下文性能,我们却发现一个有趣且此前未被研究的现象:在许多设置中,向蒸馏后的学生模型重新引入原始特权上下文实际上会降低其性能,甚至对于它已经在无上下文情况下正确解决的实例也是如此。我们将此称为上下文诱导退化,并认为鲁棒内化不仅要求匹配教师的条件上下文行为,还要求在上下文重新引入时保持稳定,这一性质我们称为上下文可移除性。受此观察启发,我们提出一种轻量级一致性正则化方法,首先通过停止梯度锚定学生的无上下文输出,然后通过前向KL散度惩罚条件上下文输出偏离该锚点。这一简单添加每训练步仅需一次额外前向传播,却能有效缓解上下文诱导退化,并在许多情况下甚至提升无上下文性能。在涵盖不同领域和模型家族的12种配置中,我们的方法在大多数设置下提升了条件上下文准确率,在11/12的设置中减少了上下文诱导损害,并有效消除了响应长度膨胀。一项机制性案例研究进一步证实,上下文可移除性在表示层面得以实现,无论上下文是否存在,隐藏状态几乎保持相同。

英文摘要

Recent work has shown that on-policy distillation can internalize privileged context, such as system prompts or task hints, into a student model so that the context is no longer needed at inference time. Although this approach successfully improves the student's no-context performance, we identify an interesting and previously unstudied phenomenon: in many settings, reintroducing the original privileged context to the distilled student actually degrades its performance, even on instances it already solves correctly without context. We term this context-induced degradation and argue that robust internalization demands not only matching the teacher's context-conditioned behavior, but also remaining stable when the context is reintroduced, a property we call context removability. Motivated by this observation, we propose a lightweight consistency regularizer that first anchors the student's no-context output via stop-gradient, then penalizes the context-conditioned output for deviating from it via forward KL divergence. This simple addition requires only one extra forward pass per training step, yet it effectively mitigates context-induced degradation and, in many cases, even improves no-context performance. Across 12 configurations spanning diverse domains and model families, our method improves context-conditioned accuracy in the majority of settings, reduces context-induced harm in 11 out of 12 settings, and effectively eliminates response-length inflation. A mechanistic case study further confirms that context removability is achieved at the representation level, with hidden states remaining nearly identical regardless of whether the context is present.

2606.11626 2026-06-11 cs.CV 新提交

Adapting Vision-Language Models from Iconic to Inclusive for Multi-Label Recognition Without Labels

将视觉-语言模型从标志性适应到包容性:用于无标签的多标签识别

Cheng Chen, Jingyu Zhou, Yifan Zhao, Jia Li

发表机构 * State Key Laboratory of Virtual Reality Technology and Systems, SCSE & QRI, Beihang University(虚拟现实技术与系统国家重点实验室,北京航空航天大学计算机学院与青岛研究院)

AI总结 提出无监督框架,通过“切割”和“缝合”两阶段适应VLMs,实现无标签的多标签图像识别,在四个数据集上超越现有无监督方法。

详情
AI中文摘要

理解多标签图像仍然是计算机视觉中的一项挑战性任务。随着视觉-语言多模态学习的快速发展,视觉-语言模型(VLM)能够在没有标注数据的情况下实现零样本识别。然而,由于其内在设计,这些模型通常优先考虑最标志性的物体,而忽略其他上下文正例。这种内在偏差与多标签学习的性质相冲突,从而限制了它们的适用性。在这项工作中,我们提出了一个无监督框架,将VLM从标志性识别适应到包容性理解,实现无标签的多标签图像识别。我们的方法包括两个关键阶段:“切割”和“缝合”:在切割阶段,我们提出了多采样响应估计器,以防止模型仅关注单个物体。在第二个缝合阶段,引入了多目标混合适应,以调整标签使其更符合多标签分布,同时仅在一个epoch内保留原始模型的内在特性。大量实验表明,我们的框架在四个公共数据集上显著优于现有的无监督方法,甚至超过了几种有代表性的弱监督基线。这些结果证明了将预训练VLM适应于更全面的视觉理解而无需人工标注的潜力。我们的代码在此https URL公开。

英文摘要

Understanding multi-label images remains a challenging task in computer vision. With the rapid progress of vision-language multimodal learning, vision-language models (VLMs) enable zero-shot recognition without labeled data. However, due to their intrinsic design, these models often prioritize the most iconic object and omit other contextual positives. This intrinsic bias conflicts with the nature of multi-label learning, thereby limiting their applicability. In this work, we propose an unsupervised framework that adapts VLMs from iconic recognition toward inclusive understanding, enabling label-free multi-label image recognition. Our approach consists of two key stages, ``cutting'' and ``sewing'': In the cutting stage, we present the multi-sampling response estimator to prevent the model from concentrating only on one single object. In the second sewing stage, the multi-object blend adaptation is introduced to adjust the labels to better conform to the multi-label distribution while preserving the intrinsic characteristics of the original model within only one epoch. Extensive experiments show that our framework significantly outperforms existing unsupervised approaches on four public datasets, even surpassing several representative weakly supervised baselines. These results demonstrate the potential of adapting pre-trained VLMs for more comprehensive visual understanding without manual annotations. Our code is publicly available at https://github.com/iCVTEAM/TailorCLIP.

2606.11625 2026-06-11 cs.LG 新提交

TimeRouter: Efficient and Adaptive Routing of Time-Series Foundation Models

TimeRouter: 时间序列基础模型的高效自适应路由

Kanghui Ning, Yushan Jiang, Kashif Rasul, Anderson Schneider, Yuriy Nevmyvaka, Dongjin Song

发表机构 * University of Connecticut(康涅狄格大学) Salesforce AI Research JP Morgan AI Research(摩根大通人工智能研究院)

AI总结 提出TimeRouter框架,通过轻量判别路由、选择性门控和集成回退实现时间序列基础模型的自适应选择,无需LLM推理,在GIFT-EVAL榜单取得最优性能。

详情
AI中文摘要

时间序列基础模型(TSFMs)作为新兴智能时间序列系统中的预测专家越来越受到探索。然而,TSFMs表现出异质性归纳偏差,且没有单一模型能在所有预测场景中持续占优,使得专家选择成为关键挑战。现有系统通常将此决策委托给基于LLM的控制器,导致大量推理开销。我们提出TimeRouter,一种高效路由框架,通过轻量判别路由、选择性门控和集成回退,利用预训练TSFM池的经验互补性。具体而言,TimeRouter结合了学习路由头、选择性门控和集成回退,在推理时无需调用LLM即可实现自适应专家选择。TimeRouter在GIFT-EVAL榜单上取得了最先进性能,LB MASE为0.6765。除了基准性能,我们的消融研究为TSFM路由设计提供了经验见解,强调了池组成和选择性门控的重要性。综合来看,这些结果使TimeRouter成为未来基于基础模型池的智能时间序列系统的模块化轻量路由层。我们的代码见此链接。

英文摘要

Time-series foundation models (TSFMs) are increasingly explored as predictive experts within emerging agentic time-series systems. However, TSFMs exhibit heterogeneous inductive biases, and no single model consistently dominates across forecasting regimes, making expert selection a critical challenge. Existing systems often delegate this decision to LLM-based controllers, incurring substantial inference overhead. We present TimeRouter, an efficient routing framework that leverages empirical complementarity across a pool of pretrained TSFMs through lightweight discriminative routing, selective gating, and ensemble fallback. Concretely, TimeRouter combines a learned routing head, a selective gate, and an ensemble fallback, enabling adaptive expert selection without invoking an LLM at inference time. TimeRouter achieves state-of-the-art performance on the GIFT-EVAL leaderboard, with an LB MASE of 0.6765. Beyond benchmark performance, our ablation studies provide empirical insights into TSFM routing design, highlighting the importance of pool composition and selective gating. Taken together, these results position TimeRouter as a modular and lightweight routing layer for future agentic time-series systems built upon foundation-model pools. Our code is available at https://github.com/UConn-DSIS/TimeRouter.

2606.11619 2026-06-11 cs.CV 新提交

Precision-Aware Illumination-Disentangled Vision Transformer for Spacecraft 6D Pose Estimation

精度感知光照解耦视觉Transformer用于航天器6D姿态估计

Zongwu Xie, Yifan Yang, Yonglong Zhang, Guanghu Xie, Yang Liu, Shuo Zhang

发表机构 * School of Mechatronics Engineering, Harbin Institute of Technology(哈尔滨工业大学机电工程学院)

AI总结 提出PAID-ViT,通过光照解耦、可靠性感知令牌聚合和掩码监督,在光照变化和反射干扰下实现鲁棒的航天器6D姿态估计。

Comments 11 pages, 7 figures

详情
AI中文摘要

视觉传感器为航天器近距离操作提供了轻量级解决方案,但在光照变化、镜面反射、阴影、弱纹理和背景干扰下,单目航天器6D姿态估计仍然困难。这些因素使局部视觉证据在空间上不可靠,并可能破坏姿态回归的稳定性。本文提出了一种精度感知光照解耦视觉Transformer(PAID-ViT),用于鲁棒的航天器姿态估计。该模型将姿态相关的结构令牌与光照敏感的外观令牌分离,在姿态聚合前估计补丁可靠性,并使用前景掩码监督以保留轮廓线索。一个无参数的几何恢复模块将归一化裁剪坐标、对数深度和连续6D旋转表示转换为相机坐标系下的旋转和平移。在SPEED+ V2(本研究使用的SPEED+验证/光箱/太阳灯评估配置)上的实验表明,PAID-ViT减少了平移误差,并在具有挑战性的太阳灯域中提高了鲁棒性,而消融研究支持了光照解耦、可靠性感知令牌聚合、掩码监督和训练侧正则化的互补作用。

英文摘要

Vision sensors provide a lightweight solution for spacecraft proximity operations, but monocular spacecraft 6D pose estimation remains difficult under illumination variation, specular reflection, shadowing, weak texture, and background interference. These factors make local visual evidence spatially unreliable and can destabilize pose regression. This article proposes a Precision-Aware Illumination-Disentangled Vision Transformer (PAID-ViT) for robust spacecraft pose estimation.The proposed model separates pose-relevant structure tokens from illumination-sensitive appearance tokens, estimates patch reliability before pose aggregation, and uses foreground mask supervision to preserve silhouette cues. A parameter-free geometric recovery module converts normalized crop coordinates, log-depth, and a continuous 6D rotation representation into camera-frame rotation and translation. Experiments on SPEED+ V2, the SPEED+ validation/lightbox/sunlamp evaluation configuration used in this study, suggest that PAID-ViT reduces translation error and improves robustness in the challenging sunlamp domain, while ablation studies support the complementary roles of illumination disentanglement, reliability-aware token aggregation, mask supervision, and training-side regularization.

2606.11615 2026-06-11 cs.CV cs.CR cs.LG 新提交

Adv-TGD: Adversarial Text-Guided Diffusion for Face Recognition Impersonation Attacks

Adv-TGD:面向人脸识别冒充攻击的对抗性文本引导扩散

Omid Ahmadieh, Nima Karimian

发表机构 * University of South Florida, Bellini College of Artificial Intelligence, Cybersecurity and Computing(南佛罗里达大学贝利尼人工智能、网络安全与计算学院)

AI总结 提出Adv-TGD框架,利用Stable Diffusion和LoRA微调生成逼真对抗人脸,在保持视觉质量的同时实现高成功率身份冒充攻击,平均ASR达85.90%。

详情
AI中文摘要

人脸识别(FR)技术的广泛普及引发了严重的隐私担忧,因为面部数据可能在未经同意的情况下被利用。为了解决这一挑战,我们提出了Adv-TGD,一个生成式对抗攻击框架,能够合成逼真的人脸,冒充目标身份并欺骗人脸识别系统。基于Stable Diffusion,Adv-TGD对每个样本进行LoRA微调,以简洁的文本提示为条件,生成自然但具有对抗性操控的身份。与传统的身份攻击方法不同,我们的方法在单步去噪过程中为每个源-目标对优化轻量级交叉注意力适配器。潜在混合受到面部局部热图掩码的约束,以确保空间精确的身份操控,同时保留非敏感区域。我们引入了一个复合目标,结合了掩码epsilon-MSE重建、FR嵌入空间中的阈值化身份差异、方向特征对齐和源相似性抑制,以平衡对抗攻击和视觉真实性。可选地,LLaVA生成的属性提示增强了细粒度语义细节,而不会重新引入身份线索。在黑盒评估协议下,Adv-TGD在IR152、IRSE50、MobileFace和FaceNet上平均攻击成功率(ASR)达到85.90%,超过语义SOTA基线Adv-CPG +6.25个百分点、基于扩散的化妆方法DiffAIM +3个百分点以及基于噪声的P3-Mask +16个百分点。尽管攻击效果强劲,Adv-TGD仍保持了高视觉保真度(PSNR = 27.15 dB,SSIM = 0.981)。此外,我们通过成功将其扩展到野外数据集(LADN)、通用对象分类(ImageNet)和基于Transformer的扩散模型(FLUX.1),展示了我们框架的灵活性。

英文摘要

The widespread adoption of face recognition (FR) technologies raises serious privacy concerns, as facial data can be exploited without consent. To address this challenge, we propose Adv-TGD, a generative adversarial attack framework that synthesizes photorealistic faces capable of impersonating target identities and deceiving face recognition systems. Built upon Stable Diffusion, Adv-TGD performs per-sample LoRA fine-tuning conditioned on concise textual prompts to generate natural yet adversarially manipulated identities. Unlike conventional identity-attack approaches, our method optimizes lightweight cross-attention adapters for each source-target pair within a single-step denoising process. Latent blending is constrained by a face-local heatmap mask to ensure spatially precise identity manipulation while preserving non-sensitive regions. We introduce a composite objective that integrates masked epsilon-MSE reconstruction, thresholded identity divergence in FR embedding space, directional feature alignment, and source-similarity suppression to balance adversarial attack and visual realism. Optionally, LLaVA-generated attribute prompts enhance fine-grained semantic details without reintroducing identity cues. Under the black-box evaluation protocol, Adv-TGD attains an average attack success rate (ASR) of 85.90% across IR152, IRSE50, MobileFace, and FaceNet, surpassing the semantic SOTA baseline Adv-CPG by +6.25 points, diffusion-based makeup method DiffAIM by +3 points, and noise-based P3-Mask by +16 points. Despite its strong attack efficacy, Adv-TGD preserves high visual fidelity (PSNR = 27.15 dB, SSIM = 0.981). Furthermore, we demonstrate the flexibility of our framework by successfully extending it to in-the-wild datasets (LADN), general object classification (ImageNet), and transformer-based diffusion models (FLUX.1).

2606.11614 2026-06-11 cs.LG cs.AI cs.CV 新提交

Information-Theoretic Decomposition for Multimodal Interaction Learning

多模态交互学习的信息论分解

Zequn Yang, Yake Wei, Haotian Ni, Zhihao Xu, Di Hu

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China, Beijing(中国人民大学人工智能学院,北京) Beijing Key Laboratory of Research on Large Models(北京大模型研究关键实验室) Engineering Research Center of Next-Generation Intelligent Search(下一代智能搜索与推荐工程研究中心) Beihang University, Beijing(北航,北京) Gaotu Techedu Inc.(高图科技有限公司)

AI总结 提出基于信息论的多模态交互分解方法DMIL,通过变分分解架构和微调策略学习样本特定的冗余、独特和协同交互,提升多模态学习性能。

Comments Accepted to CVPR 2026

详情
AI中文摘要

多模态学习依赖于捕获跨模态的冗余、独特和协同信息,这些信息共同构成多模态交互。一个关键但尚未充分探索的挑战是,这些隐式交互在不同样本间动态变化。在这项工作中,我们首次进行了系统的信息论分析,强调了学习这些动态的、样本特定的交互对于有效多模态学习的重要性。我们的分析进一步揭示了传统范式在学习这些不同交互类型方面的缺陷:模态集成方法难以捕获协同,而联合学习范式往往未能充分利用冗余信息。这突显了对一种能够基于每个样本自适应地从不同交互类型中学习的方法的需求。为此,我们提出了基于分解的多模态交互学习(DMIL),一种显式建模并学习样本特定交互的新范式。首先,我们设计了一个变分分解架构来分离组成交互组件。其次,我们采用了一种新的学习策略,在微调过程中利用这些显式交互组件来实现全面的交互学习。跨不同任务和架构的大量实验表明,DMIL通过适应整体的样本特定交互,始终实现了优越的性能。我们的框架灵活且广泛适用,建立了一个以交互为中心的多模态学习范式。代码可在以下网址获取:此 https URL。

英文摘要

Multimodal learning hinges on capturing redundant, unique, and synergistic information across modalities, which collectively constitute multimodal interactions. A critical yet underexplored challenge is that these implicit interactions vary dynamically across samples. In this work, we present the first systematic, information-theoretic analysis highlighting why learning these dynamic, sample-specific interactions is critical for effective multimodal learning. Our analysis further reveals deficits in conventional paradigms at learning these distinct interaction types: modality ensemble approaches struggle to capture synergy, while joint learning paradigms often under-utilize redundant information. This highlights the need for an approach that can adaptively learn from different interaction types on a per-sample basis. To this end, we propose Decomposition-based Multimodal Interaction Learning (DMIL), a novel paradigm that explicitly models and learns from sample-specific interactions. First, we design a variational decomposition architecture to isolate the constituent interaction components. Second, we employ a new learning strategy that leverages these explicit interaction components in a fine-tuning process to achieve comprehensive interaction learning. Extensive experiments across diverse tasks and architectures demonstrate that DMIL consistently achieves superior performance by adapting to holistic sample-specific interactions. Our framework is flexible and broadly applicable, establishing an interaction-centric paradigm for multimodal learning. The code is available at https://github.com/GeWu-Lab/DMIL.

2606.11611 2026-06-11 cs.SD 新提交

SARA: A Dual-Stream VAE for High-Fidelity Speech Generation via Integrating Semantic and Acoustic Representations

SARA: 一种通过整合语义和声学表示实现高保真语音生成的双流VAE

Peijie Chen, Wenhao Guan, Weijie Wu, Kaidi Wang, Daiyu Huang, Zhuanling Zha, Junbo Li, Jun Fang, Qingyang Hong, Lin Li

发表机构 * School of Informatics, Xiamen University(厦门大学信息学院) School of Electronic Science and Engineering, Xiamen University(厦门大学电子科学与技术学院) DiDi Global Inc.(滴滴全球股份有限公司)

AI总结 提出SARA双流VAE,融合冻结的SSL语义锚点和残差声学编码器,解决语音分词器中声学与语义的权衡,实现高保真重建和零样本TTS的自然合成。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

零样本文本转语音(TTS)依赖于鲁棒的语音表示。然而,当前的语音分词器面临一个基本权衡:声学编解码器保留高保真音频但缺乏语言约束,导致生成过程中出现内容错误;而来自自监督学习(SSL)模型的语义标记确保精确的文本对齐,但丢弃了一些声学信息。为了弥合这一差距,我们提出了SARA,一种双流VAE,它直接将冻结的SSL语义锚点与专用的残差声学编码器融合。这有效缓解了困境,创建了一个高效且紧凑的潜在空间,而无需依赖复杂的正则化器。SARA在重建质量上优于强基线。此外,在下游零样本TTS任务中,它产生了高度自然且富有表现力的合成质量,即使在加速推理下也保持稳健的生成性能,在合成速度和计算成本之间提供了有利的权衡。

英文摘要

Zero-shot text-to-speech (TTS) relies on robust speech representations. However, current speech tokenizers face a fundamental trade-off: acoustic codecs preserve high-fidelity audio but lack linguistic constraints, causing content errors during generation, whereas semantic tokens from self-supervised learning (SSL) models ensure precise text alignment but discard some acoustic information. To bridge this gap, we propose SARA, a dual-stream VAE that directly fuses a frozen SSL semantic anchor with a dedicated residual acoustic encoder. This effectively mitigates the dilemma, creating an efficient and compact latent space without relying on complex regularizers. SARA achieves superior reconstruction quality over strong baselines. Furthermore, in downstream zero-shot TTS tasks, it yields highly natural and expressive synthesis quality, and maintains robust generation performance even under accelerated inference, offering a favorable trade-off between synthesis speed and computational cost.

2606.11609 2026-06-11 cs.CL 新提交

Multi-Agent Reasoning with Adaptive Worker Allocation for Stance Detection

基于自适应工人分配的多智能体推理用于立场检测

Meysam Sabbaghan, Arman Zareian Jahromi, Doina Caragea

发表机构 * Kansas State University(堪萨斯州立大学)

AI总结 提出一种Manager-Worker多智能体框架,通过自适应分配工人智能体进行推理级合成,而非标签级投票,在隐式和上下文依赖的立场检测上显著提升性能。

详情
AI中文摘要

立场检测需要识别作者对目标的态度,通常来自简短文本,其中立场是隐含的、间接的或修辞性的。尽管大型语言模型(LLM)在此任务上表现强劲,但当多种解释可能成立时,单次提示可能脆弱。现有的聚合策略,如多数投票或自一致性,通过组合标签来提高鲁棒性,但丢弃了解决冲突解释所需的中间推理。我们提出了一种用于立场检测的自适应工人分配多智能体推理框架,将聚合从标签级投票转变为推理级合成。该框架采用Manager-Worker架构,其中Manager根据输入复杂度自适应地分配可变数量的Worker智能体。每个Worker从不同角度分析输入,并生成仅推理的解释而不输出立场标签;然后Manager综合这些解释以产生最终预测。我们在SemEval-2016、P-Stance和COVID-19 Stance上使用Llama、Mistral和Gemini评估了所提出的框架。结果表明,该框架在隐式和上下文依赖的立场案例上取得了最大增益,在COVID-19上达到86.07 Macro-F1,在SemEval-2016上达到82.90,同时在更显式的立场数据集(如P-Stance)上保持竞争力。这些发现表明,当仅凭表面线索无法可靠推断立场时,自适应推理级聚合最为有益。

英文摘要

Stance detection requires identifying an author's position toward a target, often from short-form texts where stance is implicit, indirect, or rhetorically framed. Although large language models (LLMs) achieve strong performance on this task, single-pass prompting can be brittle when multiple interpretations are plausible. Existing aggregation strategies, such as majority voting or self-consistency, improve robustness by combining labels, but they discard the intermediate reasoning needed to resolve conflicting interpretations. We introduce a multi-agent reasoning framework with adaptive worker allocation for stance detection that shifts aggregation from label-level voting to reasoning-level synthesis. The framework employs a Manager-Worker architecture in which a Manager adaptively allocates a variable number of Worker agents based on input complexity. Each Worker analyzes the input from a distinct perspective and produces a reasoning-only explanation without emitting a stance label; the Manager then synthesizes these explanations to produce the final prediction. We evaluate the proposed framework on SemEval-2016, P-Stance, and COVID-19 Stance using Llama, Mistral, and Gemini. Results show that the framework yields the largest gains on implicit and context-dependent stance cases, achieving 86.07 Macro-F1 on COVID-19 and 82.90 on SemEval-2016, while remaining competitive on more explicit stance datasets such as P-Stance. These findings suggest that adaptive reasoning-level aggregation is most beneficial when stance cannot be reliably inferred from surface cues alone.

2606.11606 2026-06-11 cs.CV 新提交

Frozen Foundation-Model Embeddings Discard Small-Lesion Signal in Chest Radiography: Implications for Pre-Deployment Evaluation

冻结的基础模型嵌入在胸部X光检查中丢弃小病灶信号:对部署前评估的启示

Raajitha Muthyala, Zhenan Yin, Alekhya Jilla, Frank Li, Theo Dapamede, Bardia Khosravi, Mohammadreza Chavoshi, Judy Gichoya, Saptarshi Purkayastha

发表机构 * Department of Biomedical Engineering and Informatics, Indiana University(印第安纳大学生物医学工程与信息学系) Department of Radiology and Imaging Sciences, Emory University(埃默里大学放射学与影像科学系)

AI总结 本研究系统量化了五种冻结的视觉Transformer基础模型在胸部X光检查中保留或丢失小尺度、低对比度信号的情况,发现全局聚合步骤会无声地抑制小尺度信号,但可从补丁令牌中恢复。

详情
AI中文摘要

冻结的视觉Transformer(ViT)基础模型嵌入越来越多地用作下游胸部X光检查(CXR)流程的基础,然而在冻结的前向传播中,小尺度、低对比度信号在何处保留或丢失,尚未在架构、预训练领域和目标之间进行系统量化。我们探测了五种冻结的ViT(RAD-DINO、DINOv2-B/14、DINOv3 ViT-7B、BiomedCLIP、MedSigLIP)和一个冻结的DINO预训练ResNet-50架构对照,跨越三个大型CXR队列(NIH-CXR14、MIMIC-CXR、Emory-CXR;总池n=492,724)和ChestX-Det10(n=3,543;1,462个小病灶边界框,涵盖钙化、结节、肿块)。每个模型通过小尺度扰动面板和区域感知边界框分层探针对真实病灶进行评估,比较来自同一前向传播的三种池化模式:分类令牌(CLS)、补丁均值(所有最终层补丁令牌的平均值)和边界框限制的局部补丁。在扰动面板上,CLS嵌入处于随机水平(ROC曲线下面积[AUC] 0.500-0.524);补丁均值在等模糊和网状细细胞上与CLS无区别,但在较大方向模糊足迹上随CLS上升,而全局决策任务的疾病AUC范围为0.642-0.913。局部补丁探针从同一前向传播中恢复AUC约1.0(每个模型平均改进+0.412至+0.488);ResNet-50对照重现了随机水平。在ChestX-Det10上,图像级CLS分类显示类内小与大层间差距高达+0.243 AUC;同一前向传播上的边界框级局部补丁池化在每个(模型×类别)单元上恢复AUC >= 0.899。冻结的ViT嵌入在全局聚合步骤中无声地抑制小尺度信号;该信号可从补丁令牌中恢复,但需依赖于感兴趣区域。

英文摘要

Frozen vision-transformer (ViT) foundation-model embeddings increasingly serve as the substrate for downstream chest-radiography (CXR) pipelines, yet where small-scale, low-contrast signal is retained or lost in the frozen forward pass has not been systematically quantified across architectures, pretraining domains, and objectives. We probed five frozen ViTs (RAD-DINO, DINOv2-B/14, DINOv3 ViT-7B, BiomedCLIP, MedSigLIP) and a frozen DINO-pretrained ResNet-50 architectural control across three large CXR cohorts (NIH-CXR14, MIMIC-CXR, Emory-CXR; aggregate pool n=492,724) and ChestX-Det10 (n=3,543; 1,462 small-lesion bounding boxes across Calcification, Nodule, Mass). Each model was evaluated with a small-scale-perturbation panel and a region-aware bounding-box-stratified probe on real lesions, comparing three pooling modes from the same forward pass: classification token (CLS), patch-mean (mean over all final-layer patch tokens), and bounding-box-restricted patch-local. On the perturbation panel, CLS embeddings sat at the chance floor (area under the ROC curve [AUC] 0.500-0.524); patch-mean was indistinguishable from CLS on iso-blur and reticular-fine cells but rose with CLS on larger directional-blur footprints, while disease AUC on globally decided tasks ranged 0.642-0.913. Patch-local probes recovered AUC ~1.0 from the same forward pass (per-model mean improvement +0.412 to +0.488); the ResNet-50 control reproduced the chance floor. On ChestX-Det10, image-level CLS classification showed within-class small-versus-large stratum gaps up to +0.243 AUC; bounding-box-level patch-local pooling on the same forward pass recovered AUC >= 0.899 on every (model x class) cell. Frozen ViT embeddings silently suppress small-scale signal at the global-aggregation step; the signal is recoverable from patch tokens conditional on a region of interest.

2606.11605 2026-06-11 cs.LG cs.AI 新提交

Physics-Distilled Neural Network enabled by Large Language Models for Manufacturing Process-Property Predictive Modeling

基于大语言模型的物理蒸馏神经网络用于制造过程-性能预测建模

Ge Song, Kiarash Naghavi Khanghah, Anandkumar Patel, Rajiv Malhotra, Hongyi Xu

发表机构 * School of Mechanical, Aerospace and Manufacturing Engineering, University of Connecticut(康奈尔大学机械、航空航天与制造工程学院) Department of Mechanical & Aerospace Engineering, Rutgers, the State University of New Jersey(新泽西州立大学鲁特大学机械与航空航天工程学院)

AI总结 提出一种知识蒸馏框架,利用大语言模型从文献中提取物理先验,通过图掩码注意力层捕获变量依赖,蒸馏至轻量学生模型,在数据稀缺下实现高精度预测与实时部署。

Comments Under review, Journal of Computing and Information Science in Engineering

详情
AI中文摘要

预测制造过程中的过程-性能关系常面临高实验成本和复杂'黑箱'模型可解释性有限的挑战。本文提出一种新颖的知识蒸馏框架,旨在数据稀缺场景下实现高精度预测。该框架将分析性物理先验(通过大语言模型从科学文献中系统提取)集成到特权教师模型中。我们采用图掩码注意力层来捕获输入变量间复杂的物理依赖关系,这些变量表现为严格设定点或静态与高频时间特征的组合。这种特权知识被蒸馏到轻量级学生预测器中进行推理。通过在五种不同制造过程中的综合实验,评估了该框架的可行性和鲁棒性。为确保统计可靠性,鉴于数据集规模较小,采用重复K折交叉验证技术来量化模型稳定性和泛化能力。结果表明,所提框架在所有评估领域均持续实现高预测精度。最重要的是,该架构表现出显著的容错性,即使在LLM推导的分析先验次优或不完整的情况下,也能保持稳健的预测性能。此外,学生预测器的推理频率超过6000 Hz,便于在标准工业硬件上进行实时边缘部署。这项工作为在数据受限环境下弥合理论物理与实时工业监测之间的差距提供了可扩展的解决方案。

英文摘要

Predicting process-property relationships in manufacturing is often challenged by high experimental costs and the limited interpretability of complex 'black-box' models. This paper proposes a novel knowledge distillation framework designed to achieve high-accuracy predictions in data-scarce scenarios. The framework integrates analytical physics priors, which are systematically extracted from scientific literature via Large Language Models, into a privileged teacher model. We employ a Graph-Masked Attention layer to capture the complex physical dependencies among input variables showing strict setpoints or a combination of static and high-frequency temporal signatures. This privileged knowledge is distilled into a lightweight student predictor for inference. The feasibility and robustness of the framework are evaluated through a comprehensive experiment across five diverse manufacturing processes. To ensure statistical reliability, given the small dataset sizes, a repeated K-fold cross-validation technique is employed to quantify model stability and generalization. Results indicate that the proposed framework consistently achieves high predictive accuracy across all evaluated domains. Most importantly, the architecture demonstrates significant fault tolerance by maintaining robust predictive performance even in scenarios where LLM-derived analytical priors are suboptimal or incomplete. Furthermore, the student predictor achieves an inference frequency exceeding 6000 Hz, which facilitates real-time edge deployment on standard industrial hardware. This work provides a scalable solution for bridging the gap between theoretical physics and real-time industrial monitoring in data-limited environments.