arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 21516
2606.04454 2026-06-04 cs.CL

Stepwise Reasoning Enhancement for LLMs via External Subgraph Generation

通过外部子图生成增强大语言模型的逐步推理

Xin Zhang, Yang Cao, Baoxing Wu, Kai Song, Siying Li

发表机构 * School of Information Science and Engineering, Chongqing Jiaotong University(重庆交通大学信息科学与工程学院) School of Computer Science and Technology, Chongqing University of Posts and Telecommunications(重庆邮电大学计算机科学与技术学院)

AI总结 提出SGR框架,通过从知识图谱生成查询相关子图来引导大语言模型进行逐步推理,提升复杂多步推理的准确性、鲁棒性和可解释性。

详情
AI中文摘要

大语言模型在自然语言生成和下游推理任务中表现出色,但在复杂多步推理中仍面临逻辑一致性、事实基础和可解释性方面的挑战。为解决这些局限,本文提出SGR,一种通过查询相关子图生成将大语言模型与外部知识图谱集成的逐步推理增强框架。给定输入问题,SGR首先提取关键实体、关系和约束以构建结构化模式,然后通过模式引导查询从知识图谱中检索紧凑子图。生成的子图提供明确的关系证据,引导语言模型进行逐步推理。此外,SGR结合了基于Cypher的直接推理与协作推理集成,允许根据模型置信度和图一致性验证和聚合来自多个推理路径的候选答案。在包括CWQ、WebQSP、GrailQA和KQA Pro的基准数据集上的实验表明,SGR在推理准确性和Hits@1性能上优于标准提示和几种知识增强基线。消融研究进一步表明,模式引导和基于Neo4j的检索对框架的有效性都至关重要。这些结果表明,动态生成的外部子图可以提高基于大语言模型的推理的准确性、鲁棒性和可解释性。

英文摘要

Large language models have shown strong performance in natural language generation and downstream reasoning tasks, but they still struggle with logical consistency, factual grounding, and interpretability in complex multi-step reasoning. To address these limitations, this paper proposes SGR, a stepwise reasoning enhancement framework that integrates large language models with external knowledge graphs through query-relevant subgraph generation. Given an input question, SGR first extracts key entities, relations, and constraints to construct a structured schema, then retrieves compact subgraphs from a knowledge graph using schema-guided querying. The generated subgraphs provide explicit relational evidence that guides the language model through step-by-step reasoning. In addition, SGR combines direct Cypher-based reasoning with collaborative reasoning integration, allowing candidate answers from multiple reasoning paths to be validated and aggregated according to both model confidence and graph consistency. Experiments on benchmark datasets including CWQ, WebQSP, GrailQA, and KQA Pro demonstrate that SGR improves reasoning accuracy and Hits@1 performance over standard prompting and several knowledge-enhanced baselines. Ablation studies further show that schema guidance and Neo4j-based retrieval are both crucial to the effectiveness of the framework. These results indicate that dynamically generated external subgraphs can improve the accuracy, robustness, and interpretability of LLM-based reasoning.

2606.04453 2026-06-04 cs.CV cs.LG

Radiomic Feature Selection Using Gradient Loss of Deep Neural Network for Lung Cancer Stage Detection

基于深度神经网络梯度损失的放射组学特征选择用于肺癌分期检测

Hina Shakir, Mohammad Mohatram, Javeed Hussain, Syed Rizwan Ali, Muhammad Irfan Memon

发表机构 * Department of Software Engineering, Bahria University(巴尔ia大学软件工程系) Global College of Engineering and Technology(全球工程与技术学院) Software Engineering & Business Incubation Center, Bahria University(软件工程与企业孵化中心,巴尔ia大学)

AI总结 提出GL-RFE框架,利用深度神经网络梯度敏感性分析递归消除低贡献特征,从106个放射组学特征中选出前15个用于肺癌早晚期分类,准确率达90.22%。

详情
Journal ref
J. Vis. Exp. (230), e70181, (2026)
AI中文摘要

放射组学能够从医学图像中提取定量成像生物标志物,已成为计算机辅助癌症诊断的重要工具。然而,放射组学数据集通常具有高维小样本的特点,使得特征选择成为构建可靠预测模型的关键步骤。本研究提出了一种梯度损失递归特征消除(GL-RFE)框架,该框架集成深度神经网络的梯度敏感性分析,以识别对肺癌分期检测最具影响力的放射组学特征。使用3D Slicer平台的PyRadiomics扩展从胸部计算机断层扫描(CT)中提取了总共106个放射组学特征。所提出的方法通过计算网络损失相对于输入特征的梯度来评估特征重要性,并递归消除贡献最小的特征。最终选出的前15个放射组学特征用于训练深度神经网络分类器,以区分早期和晚期肺癌。该框架在测试数据集上取得了强劲的分类性能,准确率为90.22%,精确率为90.10%,召回率为90.24%,F1分数为90.16%。可视化分析(包括相关性热图和分布图)进一步证实了特征冗余减少和类别可分性提高。与传统特征选择技术相比,GL-RFE有效捕捉了非线性特征交互并增强了模型泛化能力。所提出的协议为基于放射组学的癌症分期检测提供了一种可重复且可解释的方法,特别适用于高维小样本生物医学数据集,并在基因组学和多模态临床分析等其他领域具有潜在应用价值。

英文摘要

Radiomics enables extraction of quantitative imaging biomarkers from medical images and has become an important tool for computer-aided cancer diagnosis. However, radiomics datasets are typically high-dimensional with limited samples, making feature selection a critical step for building reliable predictive models. This study proposes a Gradient-Loss Recursive Feature Elimination (GL-RFE) framework that integrates gradient sensitivity analysis from a deep neural network to identify the most influential radiomic features for lung cancer stage detection. A total of 106 radiomic features were extracted from chest Computed Tomography (CT) scans using the PyRadiomics extension of the 3D Slicer platform. The proposed method evaluates feature importance by computing gradients of the network loss with respect to input features and recursively eliminates features with minimal contribution. The resulting top-15 radiomic features are used to train a deep neural network classifier for distinguishing early-stage and advanced-stage lung cancer. The proposed framework achieves strong classification performance, with accuracy of 90.22%, precision of 90.10%, recall of 90.24%, and F1-score of 90.16% on the test dataset. Visualization analyses, including correlation heat maps and distribution plots, further confirm reduced feature redundancy and improved class separability. Compared to conventional feature selection techniques, GL-RFE effectively captures nonlinear feature interactions and enhances model generalization. The presented protocol provides a reproducible and interpretable methodology for radiomics-based cancer stage detection and is particularly suitable for high-dimensional, small-sample biomedical datasets, with potential applications in other domains such as genomics and multimodal clinical analysis.

2606.04451 2026-06-04 cs.LG

On Out-of-sample Embedding in UMAP

UMAP中的样本外嵌入

Mohammad Tariqul Islam, Jason W. Fleischer

发表机构 * Media Lab, Massachusetts Institute of Technology(媒体实验室,麻省理工学院) Electrical and Computer Engineering, Princeton University(电子与计算机工程,普林斯顿大学)

AI总结 针对UMAP在添加新样本时产生的排斥效应,通过优化原始k近邻图中的成对交互,提出参数化UMAP方法以改善嵌入质量。

Comments 22 pages, 16 figures

详情
AI中文摘要

邻域嵌入算法通过在低维空间中构建等价的图表示来揭示高维数据中的相关性。一种日益流行的算法是统一流形学习与投影(UMAP),它使用代数拓扑来映射两个空间之间的距离。虽然它在许多类型的数据集上表现良好,但UMAP在将样本外点添加到现有映射时存在困难。特别是,UMAP通常将新点放置在所发现簇的周边,而不是与它们的相关邻居一起放在簇的内部。在这里,我们通过优化原始k近邻图中的成对交互来克服这种“排斥效应”。此外,我们表明参数化UMAP比非参数算法获得更好的嵌入,特别是当数据变得更复杂时(例如,医学图像)。我们还表明,当使用参数化UMAP嵌入数据时,排斥效应自然得到缓解。我们使用可信度、最近邻分类器以及分析嵌入中的吸引力和排斥力来表征不同的UMAP方法。

英文摘要

Neighbor embedding algorithms reveal correlations in high-dimensional data by constructing an equivalent graph representation in a lower-dimensional space. An increasingly popular algorithm is Uniform Manifold Learning and Projection (UMAP), which uses algebraic topology to map distances between the two spaces. While it works well on many types of data sets, UMAP has trouble adding out-of-sample points to a pre-existing mapping. In particular, UMAP often places new points on the periphery of the found clusters, rather than in their interiors with their correlated neighbors. Here, we overcome this ``repulsion effect'' by optimizing pairwise interactions within the original k-nearest-neighbor graph. Moreover, we show that parameterizing UMAP obtains better embeddings than non-parametric algorithms, particularly as the data gets more complex (e.g., medical images). We also show that the repulsion effect is naturally mitigated when a parameterized UMAP is employed to embed the data. We characterize different UMAP approaches using trustworthiness, nearest neighbor classifiers, and by analyzing attractive and repulsive forces in the embeddings.

2606.04450 2026-06-04 cs.CL cs.CY

Listening to the Workforce: Measuring Construction Worker Safety Attitudes from Social Media Discourse Using LLMs

倾听劳动力:使用LLMs从社交媒体话语中测量建筑工人安全态度

Farouq Sammour, Yuxin Zhang, Zhenyu Zhang

发表机构 * Texas A&M University(德克萨斯A&M大学)

AI总结 提出并验证了建筑安全态度框架(CSAF),通过LLM分类器从Reddit社区话语中测量工人安全态度,实现高精度多维分析。

详情
AI中文摘要

工人安全态度是决定建筑工地上保护措施是否被应用或规避的关键因素。然而,大规模测量安全态度一直难以实现。安全态度是多维的,因话题而异,并且在工人自己的对话中最为坦诚。本研究创建并验证了建筑安全态度框架(CSAF),该框架整合了两个组成部分:一个基于理论的结构,沿八个维度表征安全态度;以及一个用于在工人自然话语中测量这些态度的操作化编码手册。将CSAF应用于Reddit上r/Construction社区的250条帖子和评论,经过训练的编码者达到了高度一致(Krippendorff's α = 0.85)。成对提升度和条件概率证实了八个维度既相关又不同。为了将框架应用于大量话语,CSAF通过大语言模型(LLM)分类器进行操作化。在450条r/Construction贡献中,分类器再现了专家人工编码(Cohen's κ = 0.90,精确率 = 0.98,召回率 = 0.98),并且在400条r/Roofing贡献中,转移到不同行业社区后仍保持该准确率(κ = 0.89,精确率 = 0.98,召回率 = 0.97)。一项价值验证案例研究将经过验证的分类器应用于10,346条r/Roofing贡献,证明CSAF能够按安全主题区分多维态度,追踪它们随时间的变化,并追溯不利态度背后的推理。因此,本研究提供了一个理论扎实、经验验证的工具来检查安全态度,为针对不安全实践背后态度的干预措施提供了基础。

英文摘要

Worker safety attitudes are key determinants of whether protective practices are applied or bypassed on construction sites. Yet measuring them at scale has remained out of reach. Safety attitudes are multidimensional, vary across topics, and surface most candidly in workers' own conversations. This study created and validated the Construction Safety Attitude Framework (CSAF), which integrates two components: a theory-grounded structure that characterizes safety attitudes along eight dimensions, and an operational codebook for measuring them in worker naturalistic discourse. Applying CSAF to 250 posts and comments from the r/Construction community on Reddit, trained coders reached strong agreement (Krippendorff's α = 0.85). Pairwise lift and conditional probability confirmed that the eight dimensions are related yet distinct. To apply the framework across large volumes of discourse, CSAF was operationalized through a large language model (LLM) classifier. On 450 r/Construction contributions, the classifier reproduced expert human coding (Cohen's \k{appa} = 0.90, precision = 0.98, recall = 0.98), and on 400 contributions from r/Roofing it retained that accuracy after transfer to a different trade community (\k{appa} = 0.89, precision = 0.98, recall = 0.97). A proof-of-value case study then applied the validated classifier to 10,346 contributions from r/Roofing, demonstrating that CSAF can distinguish multidimensional attitudes by safety topic, track how they shift over time, and trace the reasoning behind unfavorable ones. The study therefore provides a theoretically grounded, empirically vetted instrument for examining safety attitudes, offering a basis for targeted interventions that address the attitudes underlying unsafe practices.

2606.04445 2026-06-04 cs.LG cs.AI math.ST stat.TH

RowNet: A Memory Transformer for Tabular Regression

RowNet: 用于表格回归的记忆Transformer

Askat Rakhymbekov, Gulshat Muhametjanova

发表机构 * Department of Applied Mathematics and Informatics(应用数学与信息学系) Kyrgyz-Turkish Manas University(吉尔吉斯-土耳其马纳斯大学)

AI总结 针对房地产估值中表格回归问题,提出RowNet,一种基于检索的神经网络架构,通过记忆库中的成对相似性特征、目标一致性增强和混合专家模块实现价格预测。

Comments Retrieval-based neural architecture for real estate valuation. Related to TabR (arXiv:2307.14338) and retrieval-augmented tabular learning

详情
AI中文摘要

房地产估值是一个结构化回归问题,其中价格受异构特征类型、稀疏区域效应、非线性交互以及可比房产的实际逻辑影响。标准多层感知器将每一行视为孤立向量,必须仅从监督中学习局部性、尺度敏感性和类别匹配。梯度提升决策树提供了强大的表格基线,但其以特征为中心的分裂机制并未显式建模相似历史观测的检索。本文提出了RowNet,一种用于房地产每平方米价格预测的基于检索的神经网络架构。RowNet通过针对标记属性记忆库的成对相似性特征来表示查询属性。第一检索层从仅特征相似性中估计粗略目标。第二层通过目标一致性特征增强记忆比较,并使用多个学习注意力头检索互补的可比集。最终的混合专家模块结合了学习门控、残差校正、熵正则化和头多样性正则化以产生预测。

英文摘要

Real estate valuation is a structured regression problem in which prices are governed by heterogeneous feature types, sparse regional effects, nonlinear interactions, and the practical logic of comparable properties. Standard multilayer perceptrons treat each row as an isolated vector and must learn locality, scale sensitivity, and categorical matching from supervision alone. Gradient-boosted decision trees provide strong tabular baselines, but their feature-centric splitting mechanism does not explicitly model the retrieval of similar historical observations. This paper presents RowNet, a retrieval-based neural architecture for real estate price-per-square-meter prediction. RowNet represents a query property through pairwise similarity features against a memory bank of labeled properties. A first retrieval layer estimates a coarse target from feature-only similarities. A second layer augments the memory comparison with target-consistency features and uses multiple learned attention heads to retrieve complementary comparable sets. A final mixture-of-experts module combines learned gating, residual correction, entropy regularization, and head-diversity regularization to produce the prediction.

2606.04442 2026-06-04 cs.CL cs.AI

MemoryDocDataSet: A Benchmark for Joint Conversational Memory and Long Document Reasoning

MemoryDocDataSet: 联合对话记忆与长文档推理的基准测试

Qiyang Xie, Jialun Wu, Xinjie He, Su Liu, Shuai Xiao, Zhiyuan Lin, Weikai Zhou

发表机构 * Northeastern University(东北大学) Johns Hopkins University(约翰霍普金斯大学) Columbia University(哥伦比亚大学) Independent Researcher(独立研究者)

AI总结 提出MemoryDocDataSet合成基准,包含50个微世界和1000个QA对,评估系统同时处理多轮对话历史和长文档阅读理解的能力,其中75.1%的问题需要混合推理(先导航对话历史再提取文档答案),实验显示联合检索存在明显差距。

Comments 17 pages, 2 figures, 8 tables. Submitted for peer review

详情
AI中文摘要

人工智能系统越来越需要结合两种要求很高的能力:导航多轮对话历史和在长文档中进行深度阅读理解。然而,现有的基准测试没有同时评估这两者。我们引入了MemoryDocDataSet,一个包含50个微世界和1000个QA对的合成基准,其中每个实例包含3-5个人物角色、一个跨越数月活动的时间事件图、3-5篇真实长文档(每篇20,000-50,000个token,来自Caselaw Access Project)、基于这些文档的多轮对话,以及跨越五个推理类别的20个问答对。其定义特征是混合源标签:需要系统首先导航对话历史以确定哪个文档相关,然后从该文档中提取答案的问题。混合问题占数据集的75.1%。通过使用LLM作为评判者的提示敏感性自一致性分析来表征数据集质量,在所有50个微世界中得到中位数Cohen's $κ= 0.634$。我们评估了六种基线配置,涵盖截断上下文、长上下文LLM、检索增强生成(RAG)和记忆系统。最佳基线(RAG-Both)在整体F1上达到0.358,在混合问题上达到0.342。仅文档检索(RAG-Doc)在混合问题上降至0.267,尽管在仅文档问题上达到0.453,这显示了明显的联合检索差距,激励了统一对话记忆与长文档导航的架构。我们发布了数据集、生成流水线和所有基线实现。

英文摘要

AI systems increasingly need to combine two demanding capabilities: navigating multi-session conversation history and performing deep reading comprehension within long documents. Yet no existing benchmark evaluates both simultaneously. We introduce MemoryDocDataSet, a synthetic benchmark of 50 micro-worlds and 1,000 QA pairs in which each instance comprises 3-5 personas, a temporal event graph spanning months of activity, 3-5 real long documents (20,000-50,000 tokens each sourced from the Caselaw Access Project), multi-session conversations grounded on those documents, and 20 question-answer pairs across five reasoning categories. The defining feature is the Hybrid source tag: questions requiring a system to first navigate conversation history to identify which document is relevant, then extract the answer from within that document. Hybrid questions account for 75.1% of the dataset. Dataset quality is characterised through a prompt-sensitivity self-consistency analysis using LLM-as-judge, yielding a median Cohen's $κ= 0.634$ across all 50 micro-worlds. We evaluate six baseline configurations spanning truncated context, long-context LLMs, retrieval-augmented generation (RAG), and memory systems. The best baseline (RAG-Both) achieves 0.358 overall F1 and 0.342 on Hybrid. Document-only retrieval (RAG-Doc) collapses to 0.267 on Hybrid despite achieving 0.453 on Doc-only questions, demonstrating a clear joint-retrieval gap that motivates architectures unifying conversational memory with long-document navigation. We release the dataset, generation pipeline, and all baseline implementations.

2606.04438 2026-06-04 cs.LG cs.AI

LoopMoE: Unifying Iterative Computation with Mixture-of-Experts for Language Modeling

LoopMoE:统一迭代计算与混合专家模型用于语言建模

Wenkai Chen, Tianshu Li, Wenyong Huang, Yichun Yin, Lifeng Shang, Chengwei Qin

发表机构 * Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Huawei Technologies Co.,Ltd.(华为技术有限公司)

AI总结 提出LoopMoE,通过迭代自适应层归一化和容量平衡策略,在相同参数和FLOPs下,循环MoE语言模型在多个基准上优于标准MoE。

详情
AI中文摘要

混合专家模型(MoE)和循环架构分别沿着参数容量和有效深度两个正交维度扩展模型。然而,主流的循环架构依赖于密集主干,将参数数量与每个token的FLOPs耦合,这使得在匹配预算下无法隔离迭代计算的效果。为此,我们提出了LoopMoE,一种循环MoE语言模型,通过两种设计将稀疏路由与迭代权重共享计算相结合。第一种是IterAdaLN,它通过联合以迭代索引和每个token隐藏状态为条件的调制信号来解决权重共享对称性。第二种是一种容量平衡策略,恢复了经过良好调整的非循环参考模型的注意力到FFN活跃参数比率。这些设计共同实现了在相同总参数、每个token FLOPs和活跃子层比率下,循环MoE与标准MoE的首次严格受控的头对头评估。在3B规模下,LoopMoE在9个下游基准测试中的8个上优于标准MoE,平均提升超过1个点。在9B规模下,LoopMoE继续优于匹配的标准MoE,表明架构优势在更大规模下持续存在。我们的工作建立了稀疏性和循环性的受控综合,并为循环语言模型指明了一个有前景的方向。

英文摘要

Mixture-of-Experts (MoE) and looped architectures scale models along two orthogonal axes, namely parameter capacity and effective depth. However, mainstream looped architectures rely on dense backbones that couple parameter count with per-token FLOPs, which makes it impossible to isolate the effect of iterative computation under matched budgets. To this end, we present LoopMoE, a looped MoE language model that integrates sparse routing with iterative weight-shared computation through two designs. The first is IterAdaLN, which resolves weight-sharing symmetry via a modulation signal jointly conditioned on the iteration index and the per-token hidden state. The second is a capacity-balancing strategy that recovers the attention-to-FFN active parameter ratio of well-tuned non-looped references. Together, these designs enable the first strictly controlled, head-to-head evaluation of a looped MoE against a Vanilla MoE under identical total parameters, per-token FLOPs, and active sublayer ratios. At the 3B scale, LoopMoE outperforms the Vanilla MoE on 8 of 9 downstream benchmarks with an average improvement exceeding 1 point. At the 9B scale, LoopMoE continues to outperform the matched Vanilla MoE, indicating that the architectural gain persists at larger scale. Our work establishes a controlled synthesis of sparsity and recurrence, and suggests a promising direction for looped language models.

2606.04437 2026-06-04 cs.CV

INTACT: Ego-Guided Typed Sparse Evidence Retrieval for Heterogeneous Collaborative Perception

INTACT: 面向异构协同感知的自我引导类型化稀疏证据检索

Chen Li, Shengrong Yuan, Jialong Zuo, Xinzhong Zhu, Nong Sang, Changxin Gao

发表机构 * National Key Laboratory of Multispectral Information Intelligent Processing Technology, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology(多谱信息智能处理国家级重点实验室,人工智能与自动化学院,华中科技大学) Zhejiang Normal University(浙江师范大学)

AI总结 提出INTACT框架,通过自我车辆发出类型化证据查询、协作方仅返回局部证据的稀疏检索机制,实现异构协同感知中零训练的新节点接入,在OPV2V-H和DAIR-V2X上取得高效性能。

详情
AI中文摘要

协同感知通过跨智能体共享信息扩展自动驾驶车辆的感知范围,但异构传感器和感知模型使得中间特征融合难以大规模部署。现有的异构协同方法通常遵循先翻译后融合的范式:协作方特征必须在对齐、适应或投影到自我兼容空间后才能融合。这种特征兼容性契约提升了固定系统的性能,但将部署与协作方特定的适配耦合,使得新加入的异构智能体集成成本高昂。为解决这一问题,我们提出INTACT,一种面向异构协同感知的自我引导类型化稀疏证据检索框架。INTACT不翻译整个协作方特征图,而是让自我车辆发出类型化证据查询,表达可疑目标和证据不足的区域。协作方仅在查询位置返回局部证据,自我车辆通过稀疏的每查询路由选择有用响应,并通过门控残差回写注入。这将兼容性要求从全局特征图可解释性转变为在自我车辆查询下的局部、类型化响应可比性,实现了零训练的异构插入协议:自我接口训练一次,新协作方通过检查点合并加入。在模拟和真实世界的异构协同感知基准上的大量实验验证了INTACT的有效性和可部署性。在OPV2V-H上,INTACT仅用0.52M额外参数和18.0 $\log_2$通信量达到80.1 AP70,相当于密集特征传输的约16倍压缩。在DAIR-V2X上,INTACT在具有挑战性的真实条件下达到43.8 AP50。

英文摘要

Collaborative perception extends the perceptual range of autonomous vehicles by sharing information across agents, but heterogeneous sensors and perception models make intermediate feature fusion difficult to deploy at scale. Existing heterogeneous collaboration methods typically follow a translation-first paradigm: collaborator features must be aligned, adapted, or projected into an ego-compatible space before fusion. Such feature-compatibility contracts improve fixed-system performance, but they couple deployment to collaborator-specific adaptation and make newly joined heterogeneous agents costly to integrate. To address this gap, we propose INTACT, an ego-guided typed sparse evidence retrieval framework for heterogeneous collaborative perception. Instead of translating an entire collaborator feature map, INTACT lets the ego vehicle issue typed evidence queries that express suspected objects and evidence-deficient regions. Collaborators respond only with local evidence at queried locations, and the ego selects useful responses through sparse per-query routing and injects them through gated residual write-back. This changes the compatibility requirement from global feature-map interpretability to local, typed response comparability under ego-issued queries, enabling a zero-training heterogeneous insertion protocol in which the ego interface is trained once and new collaborators join through checkpoint merging. Extensive experiments on simulated and real-world heterogeneous collaborative perception benchmarks validate the effectiveness and deployability of INTACT. On OPV2V-H, INTACT achieves 80.1 AP70 with only 0.52M additional parameters and 18.0 $\log_2$ communication volume, corresponding to about 16$\times$ compression over dense feature transmission. On DAIR-V2X, INTACT achieves 43.8 AP50 under challenging real-world conditions.

2606.04436 2026-06-04 cs.CV cs.RO

3DThinkVLA: Endowing Vision-Language-Action Models with Latent 3D Priors via 3D-Thinking-Guided Co-training

3DThinkVLA:通过3D思维引导的协同训练赋予视觉-语言-动作模型潜在3D先验

Jiaxin Shi, Xidong Zhang, Fucai Zhu, Zhe Li, Siyu Zhu, Weihao Yuan

发表机构 * Shanghai Jiao Tong University(上海交通大学) Harbin Institute of Technology(哈尔滨工业大学) Nanyang Technological University(南洋理工大学) Fudan University(复旦大学) Nanjing University(南京大学) Daimon Robotics(达梦机器人) Great Bay University(大亚大学)

AI总结 提出3D思维引导的协同训练框架,通过解耦3D几何感知与空间推理并在不同特征层次注入,使VLA模型在动作预测中隐式进行3D空间推理,无需3D传感器或外部模型,在多个基准上达到最优性能。

详情
AI中文摘要

我们提出了一种3D思维引导的协同训练框架,使视觉-语言-动作(VLA)模型能够在动作预测过程中隐式地进行3D空间推理。我们的核心见解是,3D几何感知和3D空间推理是两种不同的能力,可以在不同的特征层次上解耦并注入。在训练过程中,三个紧密耦合的组件主要在潜在空间中协同工作:(1)为了获得几何先验,一个潜在3D几何感知模块将中间视觉特征与3D基础模型对齐,在不修改VLM骨干架构的情况下获取低级几何线索。(2)作为补充,一个在线3D推理蒸馏模块通过共享推理锚点令牌缓解提示引发的推理差距。在3D VLM协同训练期间,该锚点作为第一个输出令牌发出,以稳健地编码空间先验。在VLA训练期间,它作为插入在任务指令和动作指令之间的输入令牌,将高级空间思维从显式教师推理提示转移到学生动作提示,无需链式思维文本生成。(3)然后,这些解耦的几何和推理特征通过空间增强的动作集成统一起来,该集成将它们作为分层空间条件共同注入到动作查询令牌中,以防止动作捷径。在部署时,我们的方法仅保留其轻量级适配器以执行隐式3D推理,丢弃用于监督的3D基础模型和教师分支。因此,它纯粹在2D图像上运行,无需3D传感器、外部模型或显式文本生成,同时防止预训练VLM的灾难性遗忘,在LIBERO、LIBERO-PLUS、SimplerEnv和真实世界操作任务上实现了最先进的性能。

英文摘要

We propose a 3D-thinking-guided co-training framework that enables vision-language-action (VLA) models to perform 3D spatial reasoning implicitly during action prediction. Our core insight is that 3D geometry perception and 3D spatial reasoning are distinct capabilities that can be disentangled and injected at different feature hierarchies. During training, three tightly coupled components work in concert primarily within the latent space: (1) To gain geometric priors, a latent 3D geometry perception module aligns intermediate visual features with a 3D foundation model, acquiring low-level geometric cues without architectural modifications to the VLM backbone. (2) Complementing this, an online 3D reasoning distillation module mitigates the prompt-induced reasoning gap via a shared reasoning anchor token. During 3D VLM co-training, this anchor is emitted as the first output token to robustly encode spatial priors. During VLA training, it serves as an input token inserted between the task and action instructions, transferring high-level spatial thinking from explicit teacher reasoning prompts to student action prompts without chain-of-thought text generation. (3) These disentangled geometric and reasoning features are then united by a spatially augmented action integration, which jointly injects them into the action-query tokens as hierarchical spatial conditions to prevent action shortcuts. At deployment, our method retains only its lightweight adapters to perform implicit 3D reasoning, discarding the 3D foundation model and the teacher branch used for supervision. Consequently, it operates purely on 2D images without 3D sensors, external models, or explicit text generation while preventing catastrophic forgetting of the pretrained VLM, achieving state-of-the-art performance on LIBERO, LIBERO-PLUS, SimplerEnv, and real-world manipulation tasks.

2606.04435 2026-06-04 cs.AI cs.CL cs.CR cs.IR

Cascading Hallucination in Agentic RAG: The CHARM Framework for Detection and Mitigation

智能体RAG中的级联幻觉:用于检测和缓解的CHARM框架

Saroj Mishra

发表机构 * University of North Dakota(北达科他大学)

AI总结 针对多步智能体RAG管道中早期错误传播并放大为最终错误输出的级联幻觉问题,提出CHARM框架,通过阶段级事实验证、跨阶段一致性跟踪、置信度传播监控和级联解析触发四个组件实现检测与缓解,在多个数据集上达到89.4%的级联检测率和82.1%的错误传播减少。

详情
AI中文摘要

多步智能体检索增强生成(RAG)管道在复杂推理任务中展现出显著能力,但仍然容易受到一类现有幻觉检测机制系统性遗漏的故障影响:级联幻觉,即在管道早期阶段引入的错误会通过连续推理步骤传播并放大,产生自信但事实不正确的最终输出。为解决这一漏洞,我们将级联幻觉形式化为智能体RAG系统中的一种独特故障模式,提出四种级联模式的分类法,并引入CHARM(级联幻觉感知解析与缓解),一种用于检测和中断多步推理管道中错误传播的架构框架。CHARM包含四个组件——阶段级事实验证、跨阶段一致性跟踪、置信度传播监控和级联解析触发——它们与标准智能体RAG管道并行运行,无需替换架构。我们在HotpotQA、MuSiQue、2WikiMultiHopQA以及一个自定义对抗数据集上,在LangChain智能体管道配置下评估CHARM,实现了89.4%的级联检测率、5.3%的假阳性率、每阶段平均215 ms ± 18 ms的延迟开销,以及82.1%的错误传播减少,而输出级检测器仅为18.5%。组件消融实验证实每个检测模块对整体级联覆盖都有显著贡献。CHARM与人在回路监督框架集成,为生产级智能体AI部署提供完整的可靠性和治理栈。

英文摘要

Multi-step agentic retrieval-augmented generation (RAG) pipelines have demonstrated significant capability for complex reasoning tasks, yet remain vulnerable to a class of failure that existing hallucination detection mechanisms systematically miss: cascading hallucination, where errors introduced at early pipeline stages propagate and amplify across successive reasoning steps, producing confident but factually incorrect final outputs. To address this vulnerability, we formalize cascading hallucination as a distinct failure mode in agentic RAG systems, present a four-type taxonomy of cascade patterns, and introduce CHARM (Cascading Hallucination Aware Resolution and Mitigation), an architectural framework for detecting and interrupting error propagation in multi-step reasoning pipelines. CHARM comprises four components - stage-level fact verification, cross-stage consistency tracking, confidence propagation monitoring, and cascade resolution triggering - that operate alongside standard agentic RAG pipelines without requiring architectural replacement. We evaluate CHARM on HotpotQA, MuSiQue, 2WikiMultiHopQA, and a custom adversarial dataset across LangChain agentic pipeline configurations, achieving an 89.4% cascade detection rate with a 5.3% false positive rate and 215 ms +/- 18 ms average latency overhead per stage, achieving an error propagation reduction of 82.1%, compared to 18.5% for output-level detectors. Component ablations confirm that each detection module contributes meaningfully to overall cascade coverage. CHARM integrates with human-in-the-loop oversight frameworks to provide a complete reliability and governance stack for production agentic AI deployment.

2606.04434 2026-06-04 cs.CV cs.LG

Hyper-ICL: Attention Calibration with Hyperbolic Anchor Distillation for Multimodal In-Context Learning

Hyper-ICL:基于双曲锚点蒸馏的注意力校准用于多模态上下文学习

Niloufar Alipour Talemi, Hossein Kashiani, Fatemeh Afghah

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出Hyper-ICL,一种轻量级训练框架,通过低秩logit适配器和双曲锚点蒸馏损失校准注意力分布,无需推理时提供上下文示例即可重建演示效果,提升多模态上下文学习的准确性和稳定性。

Comments Accepted at the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

多模态上下文学习已成为多模态大语言模型的一种实用推理范式,其中少量交错的图像-文本上下文示例条件化模型以解决新任务。尽管灵活,但多模态ICL由于对演示格式、顺序和内容的敏感性,导致高推理延迟和不稳定性。为解决这些限制,我们提出Hyper-ICL,一种轻量级、基于训练的无演示多模态ICL框架,它直接在推理时无需ICD即可重建演示效果。Hyper-ICL学习一个参数高效的低秩logit级适配器,校准注意力分布以更好地匹配演示诱导的注意力重分布。为捕捉演示影响如何随查询变化,我们引入查询自适应调制机制,根据当前查询在层和头之间自适应控制token级的干预强度。最后,我们提出逐层双曲锚点蒸馏损失,通过Lorentz测地距离将中间学生特征对齐到演示条件化的教师。该损失鼓励学生重建ICD诱导的演示-查询关系。在六个不同多模态基准(包括VQAv2、OK-VQA和COCO Caption)上的大量实验表明,Hyper-ICL在准确性和稳定性上持续优于普通ICL和现有最先进方法。

英文摘要

Multimodal In-Context Learning (ICL) has emerged as a practical inference paradigm for Multimodal Large Language Models, where a small set of interleaved image-text In-Context Demonstrations (ICDs) conditions the model to solve new tasks. Despite its flexibility, multimodal ICL incurs high inference latency and suffers from instability due to sensitivity to demonstration formatting, ordering, and content. To address these limitations, we propose Hyper-ICL, a lightweight, training-based framework for demonstration-free multimodal ICL that reconstructs demonstration effects directly without requiring ICDs at inference time. Hyper-ICL learns a parameter-efficient low-rank logit-level adapter that calibrates attention distributions to better match demonstration-induced attention redistribution. To capture how demonstration influence varies across queries, we introduce a query-adaptive modulation mechanism that adaptively controls intervention strength at token level across layers and heads based on the current query. Finally, we propose a layer-wise hyperbolic anchor distillation loss that aligns intermediate student features to a demonstration-conditioned teacher via Lorentz geodesic distance. This loss encourages the student to reconstruct the demonstration-query relationships induced by ICDs. Extensive experiments across six different multimodal benchmarks (including VQAv2, OK-VQA, and COCO Caption) demonstrate that Hyper-ICL consistently improves accuracy and stability over vanilla ICL and existing state-of-the-art methods.

2606.04433 2026-06-04 cs.CV cs.CL cs.LG

Stateful Visual Encoders for Vision-Language Models

用于视觉-语言模型的有状态视觉编码器

Zirui Wang, Junwei Yu, Adam Yala, David M. Chan, Joseph E. Gonzalez, Trevor Darrell

发表机构 * University of California, Berkeley(加州大学伯克利分校) UC Berkeley(加州大学伯克利分校)

AI总结 提出有状态视觉编码器,通过将每个视觉表示条件于先前的视觉特征,增强视觉-语言模型在多图像、多轮交互中的视觉变化感知能力,在跨图像空间聚合、多目标视觉差异和轨迹行为克隆等任务上取得一致改进。

Comments Project page: https://statefulvisualencoders.github.io/

详情
AI中文摘要

视觉-语言模型(VLM)越来越多地用于多图像、多轮代理场景,其中决策依赖于视觉变化。然而,在现有的开源权重VLM中,视觉比较仅在语言模型内部进行,而视觉编码器本身是无状态的:每个图像独立编码,无法访问先前的视觉上下文。因此,微小但任务关键的变化可能在语言模型有机会比较之前被减弱,尤其是当这些变化不影响场景的高层语义时。我们引入了一种有状态视觉编码器,它将每个视觉表示条件于先前的视觉特征。在监督微调下,配备有状态编码器的VLM在涉及跨图像空间聚合、多目标视觉差异和视觉轨迹行为克隆的控制任务上取得了一致的改进。这些改进在输入分辨率、语言模型大小和VLM骨干网络上保持一致。最后,我们在实际任务上验证了我们的模型,包括纵向放射学、细粒度图像比较和遥感,其中有状态编码器一致地改进了通用VLM基线,并在选定领域可以匹配或超越专用模型。项目页面:https://statefulvisualencoders.github.io/

英文摘要

Vision-language models (VLMs) are increasingly used in multi-image, multi-turn agentic settings where decisions depend on visual changes. However, in existing open-weight VLMs, visual comparisons happen only inside the language model, while the visual encoder itself remains stateless: each image is encoded independently, without access to the prior visual context. As a result, small but task-critical changes may be attenuated before the language model has a chance to compare them, especially when those changes do not affect the high-level semantics of the scene. We introduce a Stateful Visual Encoder, which conditions each visual representation on prior visual features. Under supervised finetuning, VLMs equipped with stateful encoders achieve consistent improvements on controlled tasks involving cross-image spatial aggregation, multi-object visual differencing, and visual trajectory behavior cloning. These improvements are consistent across input resolutions, language model sizes, and VLM backbones. Finally, we validate our model on real-world tasks, including longitudinal radiology, fine-grained image comparison, and remote sensing, where stateful encoders consistently improve generalist VLM baselines and can match or surpass specialized models in selected domains. Project page: https://statefulvisualencoders.github.io/

2606.04432 2026-06-04 cs.CV

DSA: Dynamic Step Allocation for Fast Autoregressive Video Generation

DSA: 用于快速自回归视频生成的动态步数分配

Thanh-Tung Le, Yunhan Zhao, Menglei Chai, Zhengyang Shen, Zhe Cao, Danhang Tang, Xiaohui Xie, Deying Kong

发表机构 * University of California, Irvine(加州大学尔湾分校) Google(谷歌) Google DeepMind(谷歌深Mind)

AI总结 提出一种置信度引导的自适应计算框架DSA,通过轻量级置信度头动态调整每帧去噪步数,在保持视频质量的同时实现实时自回归视频生成。

Comments CVPR2026, Findings Track

详情
AI中文摘要

视频扩散变压器已实现最先进的视觉质量,但其高推理成本仍是实时应用的主要瓶颈。最近的蒸馏框架产生了具有降低延迟的自回归视频扩散模型,但这些模型仍然每帧使用固定数量的去噪步数,在可预测帧上浪费计算,而在具有挑战性的帧上精炼不足。我们提出了DSA,一种用于自回归视频扩散的置信度引导自适应计算框架。DSA引入了一个轻量级置信度头,在分布匹配蒸馏目标下与生成器联合训练,以估计每帧去噪可靠性。在推理时,该置信度信号动态调整扩散步数:简单帧提前终止以提高速度,而复杂帧获得额外精炼。我们的方法不需要额外的视频数据、启发式规则,且几乎不需要架构修改。实验表明,DSA实现了实时自回归视频生成,在H100 GPU上达到22.63 FPS,延迟低于1秒,同时与最近的自回归和双向视频扩散模型相比,保持了有竞争力或更优的VBench质量。我们的结果表明,置信度引导的自适应采样为交互式视频生成提供了一条有效且实用的路径。

英文摘要

Video diffusion transformers have achieved state-of-the-art visual quality, but their high inference cost remains a major bottleneck for real-time applications. Recent distillation frameworks produce autoregressive video diffusion models with reduced latency, yet these models still use a fixed number of denoising steps per frame, wasting computation on predictable frames and under-refining challenging ones. We present DSA, a confidence-guided adaptive computation framework for AR video diffusion. DSA introduces a lightweight confidence head, trained jointly with the generator under a distribution-matching distillation objective, to estimate per-frame denoising reliability. At inference, this confidence signal dynamically adjusts the number of diffusion steps: simple frames terminate early for speed, while complex frames receive additional refinement. Our method requires no extra video data, no heuristics, and little architectural modification. Experiments show that DSA achieves real-time autoregressive video generation, reaching 22.63 FPS with sub-second latency on H100 GPUs, while maintaining competitive or superior VBench quality compared to recent autoregressive and bidirectional video diffusion models. Our results demonstrate that confidence-guided adaptive sampling provides an effective and practical path toward interactive video generation.

2606.04427 2026-06-04 cs.CV

Implicit Fuzzification via Bounded Noise Injection for Robust Medical Image Segmentation

通过有界噪声注入的隐式模糊化用于鲁棒医学图像分割

Bisheng Tang, Zhangfeng Ma, Chuchu Zhai, Feng Dong, Yaoqun Wu, Ammar Oad, Yifei Peng

发表机构 * Xinshao County People’s Hospital, Shaoyang(新邵县人民医院,邵阳)

AI总结 提出NoiseUNet,通过在跳跃连接中注入有界扰动来正则化跨尺度特征融合,隐式实现模糊化,提升医学图像分割的准确性和边界保真度。

Comments Under reviewing

详情
AI中文摘要

图像分割仍然受到由采样引起的信息损失和像素级标注固有不确定性导致的边界模糊性的根本限制。尽管U-Net等编码器-解码器架构取得了强劲性能,但它们常常产生过度自信的预测,无法捕捉过渡区域的模糊性。为解决此问题,我们提出 extbf{NoiseUNet},一个简单而有效的框架,它在跳跃连接中注入有界扰动以正则化跨尺度特征融合。该机制增强了对局部特征变化的鲁棒性,并促进了边界感知表示。理论上,该扰动诱导出隐式模糊化效果,产生软性的、数据驱动的隶属度,无需显式模糊建模。我们进一步引入 extbf{ThyR},一个具有固有模糊边界的真实世界甲状腺超声数据集。实验表明,NoiseUNet在分割精度和边界保真度上均有一致提升。

英文摘要

Image segmentation remains fundamentally limited by boundary ambiguity arising from sampling-induced information loss and inherent uncertainty in pixel-wise labeling. Although encoder-decoder architectures such as U-Net achieve strong performance, they often produce overconfident predictions that fail to capture transition-region ambiguity. To address this issue, we propose \textbf{NoiseUNet}, a simple yet effective framework that injects bounded perturbations into skip connections to regularize cross-scale feature fusion. This mechanism enforces robustness to local feature variations and promotes boundary-aware representations. Theoretically, the perturbation induces an implicit fuzzification effect, yielding soft, data-driven memberships without requiring explicit fuzzy modeling. We further introduce \textbf{ThyR}, a real-world thyroid ultrasound dataset with inherently ambiguous boundaries. Experiments demonstrate that NoiseUNet consistently improves both segmentation accuracy and boundary fidelity.

2606.04423 2026-06-04 cs.LG stat.ML

The price of multi-group transductive learning

多组转导学习的代价

Noah Bergam, Samuel Deng, Daniel Hsu

发表机构 * Columbia University(哥伦比亚大学)

AI总结 本文证明在转导学习设置中,多组学习器在某些组上的错误率可能相对于单组设置产生乘法惩罚,且惩罚随组数线性增长至样本量的平方根,这与统计设置中惩罚至多对数增长且与组数无关形成鲜明对比。

详情
AI中文摘要

我们证明,在转导设置中,每个多组学习器在某些组上的错误率相对于单组设置可能产生乘法惩罚,并且惩罚可以随组数线性增加,最多达到样本量的平方根。这与类似(组可实现)统计设置中的最优多组学习器形成鲜明对比,后者的惩罚始终至多是样本量的对数,且与组数无关。

英文摘要

We show every multi-group learner in the transductive setting may incur a multiplicative penalty in its error rate on some group relative to the error rate achievable in the single-group setting, and the penalty can increasing linearly with the number of groups, up to roughly the square-root of the sample size. This stands in stark contrast to optimal multi-group learners in an analogous (group-realizable) statistical setting, where the penalty is always at most logarithmic in the sample size and independent of the number of groups.

2606.04420 2026-06-04 cs.LG

Loss-Conditional PINNs for Parametric PDE Families

损失条件PINNs用于参数化PDE族

Anna Lazareva, Alexander Tarakanov

发表机构 * Faculty of Computer Science HSE University(俄罗斯莫斯科高等经济学院计算机科学系) VK and HSE University(VK与莫斯科高等经济大学)

AI总结 提出LC-PINN,通过将损失权重或物理系数作为网络输入并随机采样,实现单一模型参数化整个PDE族,无需配对数据,在多个参数化方程上匹配或优于逐权重重训练的PINN基线。

详情
AI中文摘要

物理信息神经网络(PINNs)通过最小化残差、边界、初始和数据损失的加权组合来逼近常微分方程和偏微分方程的解。其性能通常受损失权重选择的主导:不良的权重可能导致训练退化到满足一个物理约束而忽略另一个的解。现有方法选择或调整单一组好的权重。我们采取不同的观点:不是调整一个权重向量,而是在训练期间探索整个权重空间。我们引入LC-PINN,它将Dosovitskiy和Djolonga(2020)的损失条件训练适应于PDE残差设置:条件向量(损失权重或标量物理系数)被视为网络输入,并在每个优化步骤从简单先验中采样。这将PINN训练转变为学习由该向量索引的连续解族,无需求解器生成的配对数据。因此,LC-PINN介于经典PINNs和算子学习之间:它保持完全物理信息,但在参数族上摊销训练。我们的贡献不在于损失条件构造本身,而在于将其扩展到PINNs,将损失权重和参数系数机制统一在一个架构下(损失权重使用拼接,系数使用FiLM),以及一个固定求积的L-BFGS完成协议,使得参数系数机制可训练。我们给出了条件最优的lambda不变性结果,并在参数化Helmholtz、Schrödinger、粘性Burgers和Buckley-Leverett方程上研究了LC-PINN。单个LC-PINN在一个模型中参数化整个族,同时匹配或改进逐权重重训练的PINN基线,总成本相对于逐实例重训练具有有利的摊销。

英文摘要

Physics-informed neural networks (PINNs) approximate solutions of ODEs and PDEs by minimising a weighted combination of residual, boundary, initial, and data losses. Their performance is often dominated by the choice of loss weights: a poor weighting can drive training to a degenerate solution in which one physical constraint is satisfied while another is ignored. Existing methods select or adapt a single good set of weights. We take a different view: instead of tuning one weight vector, we explore the entire weight space during training. We introduce LC-PINN, which adapts the loss-conditional training of Dosovitskiy and Djolonga (2020) to the PDE-residual setting: the conditioning vector (either the loss weights or a scalar physical coefficient) is treated as a network input and sampled from a simple prior at every optimisation step. This turns PINN training into learning a continuous family of solutions indexed by that vector, with no solver-generated paired data. LC-PINN thus lies between classical PINNs and operator learning: it stays fully physics-informed but amortises training over a parametric family. Our contribution is not the loss-conditional construction itself, but its extension to PINNs, the unification of the loss-weight and parametric-coefficient regimes under one architecture (concatenation for loss weights, FiLM for coefficients), and a fixed-quadrature L-BFGS finishing protocol that makes the parametric-coefficient regime trainable. We give a lambda-invariance result for the conditional optimum and study LC-PINN on parametric Helmholtz, Schrodinger, viscous Burgers, and Buckley-Leverett equations. A single LC-PINN matches or improves retrained per-weight PINN baselines while parameterising the full family in one model, at a total cost that amortises favourably against per-instance retraining.

2606.04418 2026-06-04 cs.SD cs.CL eess.AS

CleanCodec: Efficient and Robust Speech Tokenization via Perceptually Guided Encoding

CleanCodec:通过感知引导编码实现高效且鲁棒的语音分词化

Eugene Kwek, Feng Liu, Rui Zhang, Wenpeng Yin

发表机构 * Pennsylvania State University(宾夕法尼亚州立大学) Drexel University(德雷塞尔大学)

AI总结 提出CleanCodec,一种去噪音频编解码器,通过选择性信息瓶颈编码仅保留感知重要特征,以12.5 tokens/s实现最先进的分词效率,在说话人相似度和语音可懂度上显著优于现有编解码器,并在下游任务中实现高达17倍推理加速。

详情
AI中文摘要

神经音频编解码器是语音处理流程的关键组件,将音频压缩为离散令牌以供下游建模。然而,现有编解码器难以平衡重建质量与令牌效率,常常以牺牲语言和声学有意义内容为代价,编码背景噪声和录音伪影等感知无关信息。我们将音频分词化重新定义为选择性信息瓶颈问题,并提出CleanCodec,一种去噪音频编解码器,学习仅编码感知重要特征并丢弃不可感知信息。在每秒仅12.5个令牌的情况下,CleanCodec实现了最先进的分词效率,在说话人相似度和语音可懂度上大幅优于现有编解码器。在下游文本到语音和语音转换任务上的评估进一步展示了改进的性能和高达17倍的推理加速,凸显了显著的效率提升。

英文摘要

Neural audio codecs are a key component of speech processing pipelines, compressing audio into discrete tokens for downstream modeling. However, existing codecs struggle to balance reconstruction quality with token efficiency, often encoding perceptually irrelevant information such as background noise and recording artifacts at the expense of linguistically and acoustically meaningful content. We reframe audio tokenization as a selective information bottleneck problem and propose CleanCodec, a denoising audio codec which learns to encode only perceptually important features and discard imperceptible information. At just 12.5 tokens per second, CleanCodec achieves state-of-the-art tokenization efficiency, substantially outperforming existing codecs in speaker similarity and speech intelligibility. Evaluations on downstream text-to-speech and voice conversion tasks further demonstrate improved performance and up to 17x faster inference, highlighting significant efficiency gains.

2606.04414 2026-06-04 cs.CV cs.MM

Motion-Guided Causal Disentanglement for Robust Multi-View Cine Cardiac MRI Diagnosis

运动引导的因果解耦用于鲁棒多视角电影心脏MRI诊断

Chuankai Xu, Cristiane De Carvalho Singulane, Mohammad Abuannadi, Stephen Chandler, Jeremy Slivnick, Karolina Zareba, Jane Cao, Vidya Nadig, Fabio Fernandes, Seth Uretsky, Diego Perez de Arenaza, Amit Patel, Jianxin Xie

发表机构 * University of Virginia(弗吉尼亚大学) University of Chicago(芝加哥大学) Ohio State University(俄亥俄州立大学) St. Francis Hospital & Heart Center(圣弗朗西斯医院及心脏中心) Hartford HealthCare(哈特福德医疗集团) Instituto do Coração (InCor)(心脏研究所(InCor)) Atlantic Health System(大西洋健康系统) Sociedad Italiana de Beneficencia (Hospital Italiano)(意大利慈善协会(意大利医院))

AI总结 提出运动引导的视角-疾病解耦框架MoViD,通过双分支监督对比学习和梯度反转对抗约束分离视角特定与疾病判别特征,结合无标注时间运动特征定位心脏区域并缓解类别不平衡,在静脉血栓数据集和两个公开基准上超越标准Transformer基线。

详情
AI中文摘要

多视角心脏磁共振成像提供互补的解剖信息,广泛用于无创疾病评估。最近的基于Transformer的模型在CMR分析中展示了强大的表示学习能力;然而,它们通常学习统一的潜在嵌入,将视角特定的解剖变异与疾病相关特征纠缠在一起。这种纠缠使分类器偏向结构属性而非视角不变的病理模式。在低数据场景下,特别是对于代表性不足的心脏疾病,这个问题更加严重,因为有限的样本增加了对捷径学习和视角相关决策边界的敏感性。为了解决这个问题,我们提出了一个基于ViT-MAE骨干的运动引导视角-疾病解耦框架MoViD。该模型通过双分支监督对比目标和梯度反转对抗约束,明确地将潜在表示分解为视角特定和疾病判别组件,最小化疾病信息泄漏到视角嵌入中。此外,引入了一种从帧间差异图导出的无标注时间运动特征,用于定位跳动的心脏区域并抑制背景伪影。对比损失中融入了焦点重加权机制以缓解类别不平衡。我们在一个私有临床静脉血栓数据集和两个公开基准(M&Ms, M&Ms2)上评估了该框架。在疾病分类和心脏分割任务中,我们的方法始终优于标准Transformer基线,并与大规模预训练基础模型相比表现出竞争性能,验证了结构解耦在医学图像分析中的有效性。

英文摘要

Multi-view cardiac magnetic resonance (CMR) imaging provides complementary anatomical information and is widely used for noninvasive disease assessment. Recent transformer-based models have demonstrated strong representation learning capabilities for CMR analysis; however, they typically learn unified latent embeddings that entangle view-specific anatomical variations with disease-related features. Such entanglement biases classifiers toward structural attributes rather than view-invariant pathological patterns. This issue is exacerbated in low-data regimes, particularly for underrepresented cardiac conditions, where limited samples increase the susceptibility to shortcut learning and view-dependent decision boundaries. To address this, we propose a Motion-Guided View--Disease Disentanglement framework MoViD built upon a ViT-MAE backbone. The model explicitly factorizes latent representations into view-specific and disease-discriminative components using dual-branch supervised contrastive objectives and a gradient-reversal adversarial constraint that minimizes disease leakage into the view embedding. Additionally, an annotation-free temporal motion feature, derived from inter-frame difference maps, is introduced to localize the beating heart region and suppress background artifacts. A focal reweighting mechanism is incorporated into the contrastive loss to mitigate class imbalance. We evaluate the framework on a private clinical venous thrombosis dataset and two public benchmarks (M&Ms, M&Ms2). Across disease classification and cardiac segmentation tasks, our approach consistently outperforms standard transformer baselines and demonstrates competitive performance against large-scale pretrained foundation models, validating the efficacy of structural disentanglement in medical image analysis.

2606.04413 2026-06-04 cs.LG

(Mis)generalization of Helpful-only Fine-tuning

仅帮助性微调的(错误)泛化

Mohammad Omar Khursheed, Baram Sosis, Fabien Roger

发表机构 * Anthropic Fellows Program(Anthropic 合作者计划) Anthropic

AI总结 研究仅帮助性训练(不拒绝用户意图)的模型在泛化中的缺陷,发现其存在涌现错位、残余拒绝行为、低可操控性、谄媚和不连贯角色等问题,并提出合成文档微调和添加角色相关问题来缓解。

Comments 77 pages, 50 figures

详情
AI中文摘要

仅帮助性模型,即训练为始终遵循用户意图的模型,对于危险能力评估和AI研发中拒绝行为会成为障碍的其他领域具有价值。关于仅帮助性训练的泛化特性知之甚少:仅帮助性模型比其无害对应模型拒绝更少,但先前工作未研究其对齐的其他维度。我们研究了现有仅帮助性模型的缺陷。我们发现一些模型表现出涌现错位,其他模型存在残余拒绝行为,大多数模型显示出低可操控性、谄媚和不连贯角色。我们表明简单的反拒绝训练可能导致其中许多问题。然而,这些问题并非仅帮助性训练的必要后果:我们证明合成文档微调和向SFT及RL添加角色相关问题可以缓解它们。

英文摘要

Helpful-only models, that is, models that are trained to always follow user intent, are valuable for dangerous capability evaluations and other areas of AI R&D where refusals would be an obstacle. Little is known about the generalization properties of helpful-only training: helpful-only models refuse less than their harmless counterparts, but previous work has not studied other dimensions of their alignment. We study the shortcomings of existing helpful-only models. We find that some show emergent misalignment, others have residual refusal behaviors, and most show poor steerability, sycophancy, and incoherent character. We show that simple anti-refusal training can cause many of these issues. None of these problems are necessary consequences of helpful-only training, though: we show that synthetic document fine-tuning and adding character-related questions to SFT and RL can mitigate them.

2606.04410 2026-06-04 cs.CV

Ultra-Fast Neural Video Compression

超快神经视频压缩

Jiahao Li, Wenxuan Xie, Zhaoyang Jia, Bin Li, Zongyu Guo, Xiaoyi Zhang, Yan Lu

发表机构 * Microsoft Research Asia(微软亚洲研究院) University of Science and Technology of China(中国科学技术大学)

AI总结 提出基于块的编码框架DCVC-UF,通过联合时空建模和并行重建实现超快编解码,显著提升率失真-复杂度权衡。

Comments CVPR 2026

详情
AI中文摘要

尽管神经视频编解码器(NVC)已展现出优越的压缩比,但其过高的计算复杂度仍是实际部署的关键障碍。本文引入一种基于块的编码框架,旨在显著改善率失真-复杂度权衡。我们的方法不是逐帧处理,而是将多个帧组成的块编码为单个紧凑的潜在表示,并同时解码它们。这是通过用于联合时空建模的跨帧交互模块和用于并行重建的帧特定解码器实现的。这种范式不仅显著提高了编码吞吐量,还有助于更有效地建模长期时间相关性。为了进一步提高速度,我们提出了一种简化的熵编码机制,将比特流交互整合为单一步骤,大幅减少解码开销。基于这些创新,我们提出了DCVC-UF(超快),一种新的NVC,在性能上树立了新的SOTA。我们的实验表明,DCVC-UF可以实现超快的编码和解码速度,显著优于之前的领先编解码器。DCVC-UF是NVC发展历程中的一个显著里程碑。代码位于https://github.com/microsoft/DCVC。

英文摘要

While neural video codecs (NVCs) have demonstrated superior compression ratio, their prohibitive computational complexity remains a critical barrier to real-world deployment. This paper introduces a chunk-based coding framework designed to significantly improve the rate-distortion-complexity trade-off. Instead of processing frames sequentially, our approach encodes a chunk of multiple frames into a single compact latent representation and decodes them simultaneously. This is enabled by cross-frame interaction modules for joint spatial-temporal modeling and frame-specific decoders for parallel reconstruction. This paradigm not only dramatically enhances coding throughput but also facilitates more effective modeling of long-term temporal correlations. To further boost speed, we propose a streamlined entropy coding mechanism that consolidates bit-stream interactions into a single step, substantially reducing decoding overhead. Building on these innovations, we present DCVC-UF (Ultra-Fast), a new NVC that sets a new SOTA in performance. Our experiments show that DCVC-UF can achieve ultra-fast encoding and decoding speeds, significantly outperforming previous leading codecs. DCVC-UF serves as a notable landmark in the journey of NVC evolution. The code is at https://github.com/microsoft/DCVC.

2606.04408 2026-06-04 cs.LG cs.AI

An Ensembled Latent Factor Model via Differential Evolution and Gradient Descent Optimization

基于差分进化和梯度下降优化的集成潜在因子模型

Rui Zhang, Jinhang Liu, Wenbo Zhang

发表机构 * Chongqing Academy of Economics Research(重庆经济研究院) College of Computer and Information Science, Southwest University(西南大学计算机与信息科学学院)

AI总结 针对高维不完全数据,提出一种集成潜在因子模型,通过差分进化和梯度下降两种优化方法分别建模并自适应加权融合,以获取更全面、偏差更小的表示。

详情
AI中文摘要

高维不完全(HDI)数据在许多现实世界的大数据场景中普遍存在。潜在因子模型是一种常见的表示学习方法,能够从这些数据中揭示信息丰富的潜在因子。然而,大多数现有的潜在因子模型仅依赖梯度下降进行优化,这可能导致表示不充分且有偏差,特别是在处理异构HDI数据时。因此,本研究提出了一种基于差分进化和梯度下降优化的集成潜在因子模型(ELFM-DEGDO),其设计包括两个方面:1)分别通过差分进化和梯度下降优化独立建模两个不同的潜在因子模型;2)通过定制的自适应加权机制将这两个不同的潜在因子模型组合起来,以有效融合它们的优势。通过利用两种优化范式的互补优势,ELFM-DEGDO能够为HDI数据生成更全面、偏差更小的表示。在三个HDI数据集上的测试表明,ELFM-DEGDO的性能始终优于相关的几种潜在因子模型。

英文摘要

High-dimensional and incomplete (HDI) data are prevalent in many real-world big data scenarios. Latent factor models serve as a common representation learning approach, capable of uncovering informative latent factors from such data. Nevertheless, most existing latent factor models rely solely on gradient descent for optimization, which may lead to insufficient and biased representations, particularly when dealing with heterogeneous HDI data. Thus, this study proposes an Ensembled Latent Factor Model via Differential Evolution and Gradient Descent Optimization (ELFM-DEGDO) with two-fold designed: 1) two diverse latent factor models are independently modeled via differential evolution and gradient descent optimization, respectively, and 2) the two diverse latent factor models are combined via a customized self-adaptive weighting mechanism to effectively fuse their strengths. By leveraging the complementary advantages of both optimization paradigms, ELFM-DEGDO is able to produce more comprehensive and less biased representations for HDI data. Three HDI datasets are tested to show that ELFM-DEGDO consistently performs better than related several latent factor models.

2606.04405 2026-06-04 cs.LG cs.AI

Low-Rank Decay for Grokking in Scale-Invariant Transformers: A Spectral-Geometric View

尺度不变Transformer中Grokking的低秩衰减:谱几何视角

Mingyu Li

发表机构 * Beijing Normal University(北京师范大学)

AI总结 针对尺度不变Transformer中权重衰减无法简化归一化层函数的问题,提出低秩衰减(LRD)正则化器,通过核范数子梯度的切向分量压缩奇异值,在模算术任务中加速有效秩下降并扩展延迟泛化(grokking)的数据边界。

详情
AI中文摘要

现代Transformer架构经常采用归一化机制,如RMSNorm和Query-Key归一化,使得模型的部分相对于权重幅度近似尺度不变。在这种机制下,标准的Frobenius范数权重衰减仅沿权重空间的径向方向作用,无法直接简化归一化层所表示的函数。我们通过这一视角研究小规模算法任务中的grokking现象,并提出\emph{低秩衰减}(LRD),一种类似核范数的谱正则化器,其子梯度——极因子$UV^\top$——即使在尺度不变设置中也保留切向分量。这一区别具有具体的动力学后果:在模型记忆训练集且任务梯度消失后,L2衰减无法再重塑权重谱,而LRD则以类似$\ell_1$的方式继续压缩奇异值。在模算术任务中,我们发现LRD诱导Query/Key矩阵的快速有效秩下降,并扩展了延迟泛化(grokking)发生的数据分数边界。我们进一步通过核范数子微分在低秩流形附近的“针到扇”展开,提供了谱几何解释。

英文摘要

Modern Transformer architectures frequently employ normalization mechanisms such as RMSNorm and Query-Key Normalization, making parts of the model approximately scale-invariant with respect to weight magnitudes. In this regime, standard Frobenius-norm weight decay acts purely along the radial direction of the weight space and cannot directly simplify the function represented by the normalized layer. We study grokking in small algorithmic tasks through this lens and propose \emph{Low-Rank Decay} (LRD), a nuclear-norm-like spectral regularizer whose subgradient -- the polar factor $UV^\top$ -- retains a tangential component even in the scale-invariant setting. This distinction has a concrete dynamical consequence: after the model memorizes the training set and task gradients vanish, L2 decay can no longer reshape the weight spectrum, whereas LRD continues to compress singular values in an $\ell_1$-like fashion. On modular arithmetic tasks, we find that LRD induces rapid effective-rank collapse in Query/Key matrices and expands the data-fraction boundary at which delayed generalization (grokking) occurs. We further provide a spectral-geometric interpretation through the ``needle-to-fan'' expansion of the nuclear-norm subdifferential near low-rank strata.

2606.04402 2026-06-04 cs.AI

Not All Errors Are Equal: Consequence-Aware Reasoning Compute Allocation

并非所有错误都同等重要:后果感知的推理计算分配

Jingbo Wen, Liang He, Ziqi He

发表机构 * The University of Sydney(悉尼大学) Shanghai Institute of Optics and Fine Mechanics(上海光学精密机械研究所)

AI总结 提出后果感知的测试时计算分配方法,通过轻量级预测器估计任务错误成本,在相同预算下将高后果任务路由到更多计算资源,在SWE-bench上降低22%-33%的成本加权损失。

详情
AI中文摘要

现代推理模型可以为不同任务分配不同量的测试时计算,例如思考令牌、模型调用或计算预算。现有方法通常通过预测难度来驱动这种分配,并在预期能提高准确率的地方投入更多计算。这隐含地假设所有失败的成本相同,因为准确率目标对每个任务一视同仁。然而,这种假设在部署中并不成立:日志消息中的拼写错误和导致生产数据库损坏的迁移都算作一次基准失败,但它们的实际成本根本不同。为填补这一空白,我们提出后果感知的测试时计算分配。我们不是仅根据预测难度来路由计算,而是使用轻量级预测器从问题文本中估计如果任务被错误解决会有多高的成本。然后,调度器在相同总预算下将更高后果的任务路由到更大的计算层级或更高的思考预算。我们在SWE-bench Lite上进行主要实验,并在Multi-SWE-bench mini上评估跨数据集行为,总共涵盖700个软件工程任务。我们的结果表明,在各种标注下,后果和难度大致正交,并且当前的思考模型并未根据后果充分分配计算。此外,我们的仅问题文本预测器在300个SWE-bench任务中从未将高后果任务误分类为低后果任务。在匹配的计算预算下,我们的后果感知调度器相对于难度感知路由将成本加权损失降低了22%至33%;特别是,优先级感知变体(根据边际效用信号缩放每个任务的成本进行路由)降低了超过30%,而其可部署的预测器驱动版本保留了超过90%的预言机增益。

英文摘要

Modern reasoning models can allocate different amounts of test-time computation, such as thinking tokens, model calls, or compute budget, to different tasks. Existing methods generally drive this allocation by predicted difficulty and spend more compute where it is expected to raise accuracy. This implicitly assumes that all failures cost the same, since an accuracy objective weights every task equally. However, such an assumption does not hold in deployment: A typo in a log message and a migration that corrupts a production database both count as one benchmark failure, but their real-world costs are fundamentally different. To fill this gap, we propose consequence-aware test-time compute allocation. Instead of routing compute only by predicted difficulty, we use a lightweight predictor to estimate from the issue text how costly a task would be if solved incorrectly. The scheduler then routes higher-consequence tasks to larger compute tiers or higher thinking budgets under the same total budget. We conduct main experiments on SWE-bench Lite and evaluate cross-dataset behavior on Multi-SWE-bench mini, covering 700 software-engineering tasks in total. Our results reveal that consequence and difficulty are approximately orthogonal under various annotations, and that current thinking models do not allocate compute sufficiently according to consequence. Moreover, our issue-only predictor never misclassifies a high-consequence task as low-consequence across the 300 SWE-bench tasks. Under matched compute budgets, our consequence-aware scheduler reduces cost-weighted loss by 22% to 33% relative to difficulty-aware routing; in particular, the priority-aware variant, which routes by per-task cost scaled by the marginal-utility signal, crosses 30%, and its deployable predictor-driven version retains over 90% of the oracle gain.

2606.04401 2026-06-04 cs.LG

TANDEM: Bi-Level Data Mixture Optimization with Twin Networks

TANDEM: 基于孪生网络的双层数据混合优化

Jiaxing Wang, Deping Xiang, Jin Xu, Mingyang Yi, Guoqiang Gong, Zicheng Zhang, Haoran Li, Pengzhang Liu, Zhen Chen, Ke Zhang, Ju Fan, Qixiang Jiang

发表机构 * JD.com(京东公司) University of Oxford(牛津大学) Renmin University of China(中国人民大学) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 提出TANDEM方法,通过孪生网络(代理模型和参考模型)的差异衡量数据效用,优化领域混合比例,在数据受限和监督微调等场景中显著提升大语言模型性能。

详情
AI中文摘要

大语言模型(LLM)的能力在很大程度上取决于来自不同领域的训练数据。优化特定领域的混合比例可以建模为一个双层优化问题,我们将其简化为单层惩罚形式,并通过孪生网络求解:一个在主数据上训练的代理模型和一个在额外数据上动态更新的参考模型。我们提出的方法——用于双层数据混合优化的孪生网络(TANDEM),通过孪生模型之间的差异衡量数据效用,并增加从额外数据中受益更多的领域的权重。与先前方法相比,TANDEM提供了理论保证和更广泛的适用性。此外,我们的双层视角提出了研究领域重新加权的新设置,例如数据受限场景和监督微调,其中优化的混合比例显著提升了性能。大量实验验证了TANDEM在所有场景中的有效性。

英文摘要

The capabilities of large language models (LLMs) significantly depend on training data drawn from various domains. Optimizing domain-specific mixture ratios can be modeled as a bi-level optimization problem, which we simplify into a single-level penalized form and solve with twin networks: a proxy model trained on primary data and a dynamically updated reference model trained with additional data. Our proposed method, Twin Networks for bi-level DatA mixturE optiMization (TANDEM), measures the data efficacy through the difference between the twin models and up-weights domains that benefit more from the additional data. TANDEM provides theoretical guarantees and wider applicability, compared to prior approaches. Furthermore, our bi-level perspective suggests new settings to study domain reweighting such as data-restricted scenarios and supervised fine-tuning, where optimized mixture ratios significantly improve the performance. Extensive experiments validate TANDEM's effectiveness in all scenarios.

2606.04399 2026-06-04 cs.LG cs.CR

DPDL: Towards Differential Privacy Preservation in Decentralized Stochastic Learning on Non-IID Data

DPDL: 非独立同分布数据下分散式随机学习中的差分隐私保护

Yunsheng Yuan, Xue Xiao, Lina Wang, Feng Li

发表机构 * School of Computer Science and Technology, Shandong University(计算机科学与技术学院,山东大学) Inspur Cloud Information Technology Co, Ltd(Inspur 云信息技术有限公司)

AI总结 针对非独立同分布数据下的分散式学习隐私泄露问题,提出基于差分隐私和相似性校准的DPDL算法,通过加噪和余弦相似度校准实现隐私保护并保持线性加速。

详情
AI中文摘要

在分散式学习范式中,一组智能体在没有中央服务器的情况下,利用分布式数据集协作训练全局模型。尽管协作的力量已被许多前沿研究验证,但它需要智能体之间广泛交换梯度信息,从而对单个智能体带来高隐私泄露风险。此外,在实际应用中,训练数据通常在智能体之间非独立同分布,这给实现隐私保护的分散式学习带来了更多挑战。为了解决这些问题,我们提出了一种针对非独立同分布数据的隐私保护分散式学习算法DPDL,该算法通过基于相似性校准的技术,在交叉梯度聚合中利用差分隐私(DP)的概念。具体来说,在每一轮中,每个智能体在与其邻居共享交叉梯度(即其私有本地数据上邻居局部模型的导数)之前,通过高斯噪声机制对其进行扰动;然后采用余弦相似度校准接收到的扰动交叉梯度,使得校准后的交叉梯度聚合能够以类似动量的方式有效更新局部模型。我们严格的理论分析不仅揭示了实现特定隐私保护水平所需的最小噪声水平,而且表明我们的算法在非独立同分布数据训练中仍然实现了线性加速。最后,我们在真实世界数据集上进行了大量实验,以验证我们的算法在防御隐私攻击和训练准确模型方面的有效性。

英文摘要

In the paradigm of decentralized learning, a group of agents collaborate to train a global model using distributed datasets without a central server. Although the power of collaboration has been verified by many state-of-the-art studies, it entails extensive gradient information exchanging among the agents and thus induces high risk of privacy leakage for the individual agents. Moreover, in real-world applications, the training data are usually non-identically and independently distributed across the agents, inducing more challenges to enable privacy-preserved decentralized learning. To address these issues, we propose a privacy-preserved decentralized learning algorithm with non-IID data, DPDL, which leverages the notion of Differential Privacy (DP) in cross-gradient aggregation through a similarity-based calibration technique. Specifically, in each round, each agent perturbs the cross-gradients (i.e., the derivatives of its neighbors' local model in its private local data) by Gaussian noise mechanism before sharing them with its neighbors; it then adopt cosine similarity to calibrate the received perturbed cross-gradients such that the aggregation of the calibrated cross-gradients can be utilized to effectively update local model in a momentum-like manner. Our rigorous theoretical analysis not only reveals the minimum noise level required to achieve a specific level of privacy preservation, but also illustrates that our algorithm still achieves a linear speedup in training with non-IID data. We finally conduct extensive experiments on real-world dataset to validate the effectiveness of our algorithm in defending privacy attacks and in training accurate models.

2606.04396 2026-06-04 cs.CL

Read the Trace, Steer the Path: Trajectory-Aware Reinforcement Learning for Diffusion Language Models

读取轨迹,引导路径:面向扩散语言模型的轨迹感知强化学习

Anant Khandelwal, Manish Gupta

发表机构 * Microsoft AI, India(微软印度人工智能)

AI总结 提出CAPR算法,通过缓存轨迹状态和块级价值头,利用去噪轨迹提供类似树搜索的细粒度监督,在降低计算成本的同时提升扩散语言模型的强化学习效果。

Comments 19 pages, 10 figures, 7 Tables

详情
AI中文摘要

扩散大语言模型(dLLMs)通过并行迭代去掩码和修正多个位置来生成响应。这一过程留下了丰富的去噪轨迹,描绘了哪些标记变得可信、哪些仍不稳定以及何时形成承诺。现有的dLLM强化学习方法仅弱化地使用这一信号。扁平化展开成本低,但将单一结果奖励分配给整个轨迹。树展开通过分支部分轨迹并将叶节点奖励向上传播,提供更精细、可验证的训练信号,但计算密集。我们提出疑问:去噪轨迹本身能否在不使用树级计算的情况下提供类似树的监督?我们引入CAPR(缓存-摊销路径细化),一种dLLM-RL算法,它将去噪轨迹总结为紧凑的路径状态,利用缓存轨迹状态生成廉价的兄弟延续,并训练块级价值头用于局部块级监督。在块级去掩码调度下,CAPR记录路径状态和块进度特征,然后根据每个块中揭示的标记将最终结果奖励重新分配到各个块。这训练价值头将一个稀疏奖励转换为块级PPO权重。因此,CAPR恢复了树搜索的大部分粒度,同时避免了完整的树扩展,将展开生成成本降低到扁平展开的大约0.75倍和树展开的0.6倍(在标准设置下)。在4x4数独、Countdown、GSM8K和Math500上,使用密集和混合专家LLaDA骨干网络,CAPR在256和512标记预算下为RL调优的dLLMs设立了新的最先进水平。在数独上,它以不到三分之一的每步计算量匹配了最强的树结构基线。

英文摘要

Diffusion large language models (dLLMs) generate responses by iteratively unmasking and revising many positions in parallel. This process leaves a rich denoising trace depicting which tokens become confident, which remain unstable, and when commitments form. Existing dLLM reinforcement learning methods use this signal only weakly. Flat rollouts are cheap, but assign a single outcome reward to the whole trajectory. Tree rollouts provide finer, verifiable training signals by branching partial trajectories and propagating leaf rewards upward, but are compute intensive. We ask whether the denoising trace itself can provide tree-like supervision without tree-level compute. We introduce CAPR (Cached-Amortized Path Refinement), a dLLM-RL algorithm that summarizes the denoising trace into a compact path state, uses cached trajectory states to generate cheap sibling continuations, and trains a block-level value head for local block-wise supervision. Under a block-wise unmasking schedule, CAPR records path-state and block-progress features, then redistributes the final outcome reward across blocks according to the tokens revealed in each block. This trains the value head to convert one sparse reward into block-level PPO weights. CAPR therefore recovers much of the granularity of tree search while avoiding full tree expansion, reducing rollout-generation cost to roughly 0.75x that of flat rollouts and 0.6x that of tree rollouts (under standard settings). Across 4x4 Sudoku, Countdown, GSM8K, and Math500, on dense and mixture-of-experts LLaDA backbones, CAPR sets a new state of the art for RL-tuned dLLMs at 256- and 512-token budgets. On Sudoku, it matches the strongest tree-structured baseline at less than one third of the per-step compute.

2606.04392 2026-06-04 cs.LG cs.CL

Physics-Informed Neural Network Modeling of Biodegradable Contaminant Transport through GCL/SL Composite Liners

物理信息神经网络建模可生物降解污染物通过GCL/SL复合衬垫的迁移

Dong Li, Yapeng Cao, Haiping Zhao, Shutong Han

发表机构 * Department of Civil, Environmental, and Infrastructure Engineering, George Mason University(乔治·马歇尔大学土木、环境与基础设施工程系) State Key Laboratory of Cryospheric Science and Frozen Soil Engineering, Northwest Institute of Eco-Environment and Resources, Chinese Academy of Sciences(中国科学院寒区工程与冻土科学联合实验室,西北生态环境资源研究院) Laboratoire Navier/CERMES, École Nationale des Ponts et Chaussées, Institut Polytechnique de Paris(巴黎理工学院劳达实验室/塞姆斯实验室,法国国家桥梁与道路学院)

AI总结 提出双域物理信息神经网络框架,通过硬约束PINN精确模拟GCL/SL复合衬垫中污染物迁移,并扩展至逆问题识别降解半衰期。

详情
AI中文摘要

本研究开发了一个双域物理信息神经网络框架,用于污染物通过GCL/SL复合衬垫系统的迁移,其中薄GCL层采用稳态平流-弥散-生物降解公式处理,而下层土壤衬垫建模为瞬态传输域。在不同渗滤液水头条件下,评估了两种公式与解析解和有限元参考解的对比:标准软约束PINN(Std-PINN)和硬约束PINN(H-PINN),其中选定的边界和初始条件直接嵌入试验解中。Std-PINN捕捉了整体突破行为,但在早期传输阶段显示出较大误差,特别是在平流传输更显著的高水头条件下。H-PINN减少了与基于惩罚的约束执行相关的优化负担,提供了更准确和稳定的浓度预测,将MAE从Std-PINN的约0.058-0.067降低到H-PINN的约0.011-0.023,同时将MRE从约9.10%-19.16%降低到约2.08%-3.14%。参数分析证实,采用tanh激活函数和优化网络结构的H-PINN提供了最佳的预测精度。H-PINN进一步扩展到逆建模,用于从有限的浓度观测中识别SL降解半衰期,显示出对预设值的可靠收敛性以及在低到中等观测噪声下的可接受鲁棒性。

英文摘要

This study develops a two-domain physics-informed neural network framework for contaminant transport through a GCL/SL composite liner system, in which the thin GCL layer is treated using a steady-state advection-dispersion-biodegradation formulation and the underlying soil liner is modeled as a transient transport domain. Two formulations are evaluated against analytical and finite-element reference solutions under different leachate-head conditions: a standard PINN with soft constraint enforcement (Std-PINN) and a hard-constrained PINN (H-PINN), in which selected boundary and initial conditions are embedded directly into the trial solutions. The Std-PINN captures the overall breakthrough behavior but shows larger errors during the early transport stage, particularly under higher leachate heads where advective transport becomes more pronounced. The H-PINN reduces the optimization burden associated with penalty-based constraint enforcement and provides more accurate and stable concentration predictions, lowering the MAE from approximately 0.058-0.067 for the Std-PINN to about 0.011-0.023 for the H-PINN, while reducing the MRE from approximately 9.10%-19.16% to about 2.08%-3.14%. Parametric analyses confirm that the H-PINN with the tanh activation function and an optimized network structure provides the best predictive accuracy. The H-PINN is further extended to inverse modeling for identifying the SL degradation half-life from limited concentration observations, showing reliable convergence toward prescribed values and acceptable robustness under low-to-moderate observation noise.

2606.04391 2026-06-04 cs.AI

Online Skill Learning for Web Agents via State-Grounded Dynamic Retrieval

基于状态接地动态检索的Web代理在线技能学习

Jiaxi Li, Ke Deng, Yun Wang, Jingyuan Huang, Yucheng Shi, Qiaoyu Tan, Jin Lu, Ninghao Liu

发表机构 * University of Georgia(佐治亚大学) Tencent America(腾讯美国) New York University(纽约大学) The Hong Kong Polytechnic University(香港理工大学)

AI总结 提出状态接地动态检索(SGDR)方法,通过逐步技能重用提升Web代理在多步自动化任务中的表现,在WebArena上平均成功率分别达到37.5%(GPT-4.1)和24.3%(Qwen3-4B)。

Comments 17 pages

详情
AI中文摘要

语言代理越来越依赖可重用技能来改进跨相关任务的多步Web自动化。越来越多的研究关注在线技能学习,其中代理不断从先前的任务轨迹中归纳技能,并在未来的任务中动态重用它们。然而,现有方法主要在任务级别重用技能:根据初始任务指令检索一组固定的技能,并在执行过程中保持不变。这种静态策略与Web执行不一致,因为适当的下一步动作不仅取决于任务目标,还取决于当前网页状态,而网页状态通常会转变为初始技能无法覆盖的情况。为了解决这一差距,我们提出了状态接地动态检索(SGDR),一种在线技能学习方法,使Web代理能够逐步重用技能。SGDR由三个组件组成:一个滑动窗口提取过程,将完成的轨迹转化为可在中间执行状态调用的可重用子程序;一种双文本代码表示,将技能检索与可执行动作连接起来;以及一种状态接地动态检索机制,将技能与任务目标和当前网页状态相匹配。在WebArena上跨五个领域的实验表明,SGDR consistently outperforms strong baselines, achieving average success rates of 37.5% with GPT-4.1 and 24.3% with Qwen3-4B, corresponding to relative gains of 10.6% and 10.0% over the strongest baseline, respectively. 代码可在 https://github.com/plusnli/skill-dynamic-retrieval 获取。

英文摘要

Language agents increasingly rely on reusable skills to improve multi-step web automation across related tasks. A growing line of work studies online skill learning, where agents continually induce skills from previous task trajectories and reuse them in future tasks on the fly. However, existing methods mainly reuse skills at the task-level: a fixed set of skills is retrieved based on the initial task instruction and then held fixed throughout execution. This static strategy is misaligned with web execution, where the appropriate next action depends not only on the task goal but also on the current webpage state, which often transitions into situations that the initial skills fail to cover. To address this gap, we propose State-Grounded Dynamic Retrieval (SGDR), an online skill learning method that enables stepwise skill reuse for web agents. SGDR consists of three components: a sliding-window extraction process that turns completed trajectories into reusable sub-procedures invokable at intermediate execution states, a dual text-code representation that connects skill retrieval with executable action, and a state-grounded dynamic retrieval mechanism that matches skills to both the task goal and the current webpage state. Experiments on WebArena across five domains show that SGDR consistently outperforms strong baselines, achieving average success rates of 37.5% with GPT-4.1 and 24.3% with Qwen3-4B, corresponding to relative gains of 10.6% and 10.0% over the strongest baseline, respectively. The code is available at https://github.com/plusnli/skill-dynamic-retrieval.

2606.04390 2026-06-04 cs.LG cond-mat.dis-nn math.PR

Shortcomings and capacities of real-constrained neural networks in complex spaces

复空间中实约束神经网络的缺陷与能力

Andrew Gracyk

发表机构 * Department of Mathematics(数学系)

AI总结 通过 Gardner 体积比较和 Harish-Chandra-Itzykson-Zuber (HCIZ) 公式,研究了复假设类中强制实预激活相对于复预激活的存储容量渐近比。

Comments First version

详情
AI中文摘要

我们找到了在复假设类中强制实预激活相对于复预激活时存储容量的渐近比。我们的方法依赖于临界容量下的 Gardner 体积比较。我们的证明依赖于文献中非标准的 Harish-Chandra-Itzykson-Zuber (HCIZ) 公式的应用。利用 HCIZ 公式,我们可以获得最终渐近比的更稳健近似。该策略特别适用于我们的工作,因为我们通过 Weyl 积分公式和 Haar 测度在酉紧流形和正交紧流形上进行积分。

英文摘要

We find the asymptotic ratio between the storage capacities when enforcing real pre-activations in a complex hypothesis class as opposed to complex ones in the same class. Our methods depend on Gardner volume comparisons at critical capacity. Our proof relies on an application of the Harish-Chandra-Itzykson-Zuber (HCIZ) formula, nonstandard in literature. With the HCIZ formula, we may obtain a more robust approximation for the final asymptotic ratio. This strategy is applicable to our work specifically since we integrate over the unitary and orthogonal compact manifolds, facilitated via the Weyl integration formula and the Haar measure.

2606.04389 2026-06-04 cs.CL

When Clients Stop Following: A Cognitive Conceptualization Diagram-driven Framework for Strategic Counseling

当来访者不再跟随:基于认知概念化图的策略性咨询框架

Yihao Qin, Junyi Zhao, Changsheng Ma, Yongfeng Tao, Minqiang Yang, Chang Liu, Bin Hu

发表机构 * School of Information Science and Engineering, Lanzhou University(兰州大学信息科学与工程学院)

AI总结 针对现有评估协议中来访者过度顺从导致评分虚高的问题,提出基于认知行为疗法的抵抗感知框架,通过认知概念化图模拟动态抵抗,并利用强化学习优化策略推理与响应生成,以提升在困难咨询交互中的策略鲁棒性。

详情
AI中文摘要

大型语言模型(LLMs)在心理咨询中展现出潜力,但现有基准高度依赖高度合作的模拟来访者。我们观察到一个关键的咨询师-来访者跟随现象:这些来访者往往在仅几轮对话后便迅速从抵抗转向顺从,造成治疗进展的假象,并通过表面共情在当前评估协议下虚高分数。为解决这一评估失配问题,我们提出一个基于认知行为疗法(CBT)的抵抗感知框架。我们引入CARS,一个通过认知概念化图(CCDs)显式建模动态抵抗的来访者模拟器。我们提出STREAMS,一个将策略推理(思考者)与响应生成(呈现者)解耦并通过强化学习优化的双模块框架。我们进一步提出EWTS-MI,一个用于评估高摩擦交互下响应性的熵加权指标。在抵抗性和非抵抗性咨询设置上的实验验证了我们对评估失配的发现,并展示了抵抗感知训练在挑战性咨询交互下提升策略鲁棒性的有效性。

英文摘要

Large Language Models (LLMs) show promise in psychological counseling, yet existing benchmarks rely heavily on highly cooperative simulated clients. We observe a critical counselor-following phenomenon: these clients often rapidly shift from resistance to compliance after only a few turns, creating an illusion of therapeutic progress and inflating scores under current evaluation protocols through superficial empathy. To address this evaluation mismatch, we propose a Cognitive Behavioral Therapy (CBT)-grounded resistance-aware framework. We introduce CARS, a client simulator that explicitly models dynamic resistance via Cognitive Conceptualization Diagrams (CCDs). We present STREAMS, a dual-module framework that decouples strategic reasoning (Thinker) from response generation (Presenter) and optimizes it via reinforcement learning. We further propose EWTS-MI, an entropy-weighted metric for evaluating responsiveness under high-friction interactions. Experiments across resistant and non-resistant counseling settings validate our findings on evaluation mismatch and demonstrate the effectiveness of resistance-aware training for improving strategic robustness under challenging counseling interactions.