arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1971
专题追踪 全部专题
2606.18478 2026-06-18 cs.CV 新提交

Data-Forcing Distillation: Restoring Diversity and Fidelity in Few-Step Video Generation

数据强制蒸馏:恢复少步视频生成中的多样性和保真度

Siyi Chen, Shaowei Liu, Yixuan Jia, Zian Wang, Huan Ling, Qing Qu, Jun Gao

发表机构 * University of Michigan(密歇根大学) NVIDIA(英伟达) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 针对分布匹配蒸馏(DMD)在少步视频生成中出现的模式坍塌和过饱和问题,提出数据强制蒸馏(DFD)框架,通过教师评分差异引导学生接近真实数据分布,仅需一行代码修改即可恢复多样性和保真度。

详情
AI中文摘要

最近的进展表明,将多步视频扩散模型蒸馏为高效的少步学生模型具有前景。其中,分布匹配蒸馏(DMD)及其后继DMD2实现了强大的生成质量和快速收敛。然而,由于反向KL目标的性质,这些方法表现出两个持续的失败模式:样本多样性大幅下降,以及明显过饱和的输出偏离真实视频外观。在这项工作中,我们提出了数据强制蒸馏(DFD),一个简单的训练后框架,通过仅一行代码更改即可恢复DMD中的多样性和保真度。其核心是教师评分差异,用于引导学生朝向真实数据分布,将其拉向缺失的模式(缓解模式坍塌)并远离真实数据中不存在的问题模式(避免过饱和)。我们提供了框架的深入理论分析,并在文本到视频、图像到视频和自回归视频生成上验证了我们的方法。仅需100-300步微调,DFD就能有效恢复Wan2.1-1.3B和Cosmos-Predict2.5-2B模型上的多样性和保真度,解决过饱和伪影,显著改善视频动态和外观,甚至优于教师模型。

英文摘要

Recent progress has shown promise in distilling multi-step video diffusion models into efficient few-step students. Among them, Distribution Matching Distillation (DMD) and its successor DMD2 achieved strong generation quality and fast convergence. However, due to the nature of the reverse Kullback--Leibler (KL) objective, these methods exhibit two persistent failure modes: a substantial drop in sample diversity, and visibly over-saturated outputs that deviate from real-video appearance. In this work, we propose Data-Forcing Distillation (DFD), a simple post-training framework that restores diversity and fidelity in DMD with only a single-line of code change. At its core is the teacher score discrepancy to guide the student toward the real-data distribution, pulling it to missing modes (mitigating mode collapse) and away from problematic modes absent in real data (avoiding over-saturation). We provide an in-depth theoretical analysis of our framework and validate our approach on text-to-video, image-to-video, and autoregressive video generation. With only 100--300 steps of finetuning, DFD effectively restores diversity and fidelity on both Wan2.1-1.3B and Cosmos-Predict2.5-2B model, resolving the over-saturation artifacts with significantly better video dynamics and appearance, and even outperforms the teacher model.

2606.18473 2026-06-18 cs.CL 新提交

PreUnlearn: Auditing Collateral Knowledge Damage Before Large Language Model Unlearning

PreUnlearn: 在大语言模型遗忘之前审计附带知识损害

Bo Su, Ankit Shah, Thai Le

发表机构 * Indiana University Bloomington(印第安纳大学布卢明顿分校)

AI总结 提出PreUnlearn方法,通过数据特征预测遗忘操作对同领域和远距离知识的附带损害,实现遗忘前的风险审计。

Comments 12 pages, 6 figures

详情
AI中文摘要

大语言模型(LLMs)的机器遗忘旨在移除特定知识,同时保留模型其余能力。然而,遗忘与保留知识之间的界限往往不明确,因为相关甚至遥远的信息可能在模型中纠缠。在本文中,我们从数据中心的视角研究LLM遗忘,并衡量遗忘效应如何从遗忘集传播到同领域和远距离知识。我们发现一致的衰减模式:附带损害在遗忘集附近最强,随语义距离减弱,但不会在领域边界消失。我们进一步询问这种损害是否可以在执行遗忘之前被审计。我们将遗忘集审计制定为遗忘前预测任务,并分析哪些数据特征最能预测下游损害。我们的结果表明,遗忘集与评估集之间的交互特征提供了最强的信号,表明附带损害部分反映在模型更新前的数据几何中。这些发现将遗忘集审计定位为识别风险遗忘运行和设计更可靠遗忘程序的早期预警工具。

英文摘要

Machine unlearning for large language models (LLMs) aims to remove specified knowledge while preserving the rest of the model's capabilities. However, the boundary between knowledge to forget and knowledge to retain is often unclear, since related and even distant information may be entangled in the model. In this paper, we study LLM unlearning from a data-centric perspective and measure how unlearning effects propagate from the forget set to same-domain and distant-domain knowledge. We find a consistent decay pattern: collateral damage is strongest near the forget set, weakens with semantic distance, but does not disappear at domain boundaries. We further ask whether such damage can be audited before unlearning is executed. We formulate forget-set auditing as a pre-unlearning prediction task and analyze which data features are most predictive of downstream damage. Our results show that interaction features between the forget set and evaluation set provide the strongest signals, suggesting that collateral damage is partly reflected in data geometry before model updates occur. These findings position forget-set auditing as an early warning tool for identifying risky unlearning runs and designing more reliable unlearning procedures.

2606.18472 2026-06-18 cs.CV 新提交

Domain Generalizable Adaptation of 3D Vision-Language Models via Regularized Fine-Tuning

通过正则化微调实现可域泛化的3D视觉-语言模型适应

Sneha Paul, Zachary Patterson, Nizar Bouguila

发表机构 * Concordia University(康考迪亚大学)

AI总结 提出ReFine3D框架,通过选择性层调优、多视图一致性、同义词提示及点渲染视觉监督等正则化策略,提升3D大语言模型在域泛化中的性能。

Comments Accepted at Transactions on Machine Learning Research (TMLR)

详情
AI中文摘要

域适应仍然是3D视觉中的一个核心挑战,特别是对于将3D点云与视觉和文本数据对齐的多模态基础模型。尽管这些模型表现出强大的通用能力,但将其适应到数据有限的下游领域往往会导致过拟合和灾难性遗忘。为了解决这个问题,我们引入了ReFine3D,一个正则化的微调框架,专为3D大语言模型(LMMs)的可域泛化调优而设计。ReFine3D将选择性层调优与两种针对性的正则化策略相结合:跨增强点云的多视图一致性,以及通过大语言模型生成的基于同义词的提示实现的文本多样性。此外,我们加入了点渲染的视觉监督和一种基于置信度聚合的测试时增强机制,以进一步增强鲁棒性。在不同3D域泛化基准上的大量实验表明,ReFine3D将基类到新类泛化提高了1.36%,跨数据集迁移提高了2.43%,对损坏的鲁棒性提高了1.80%,少样本准确率提高了最多3.11%,以最小的额外计算开销超越了先前的最先进方法。

英文摘要

Domain adaptation remains a central challenge in 3D vision, especially for multimodal foundation models that align 3D point clouds with visual and textual data. While these models demonstrate strong general capabilities, adapting them to downstream domains with limited data often leads to overfitting and catastrophic forgetting. To address this, we introduce ReFine3D, a regularized fine-tuning framework designed for domain-generalizable tuning of 3D large multimodal models (LMMs). ReFine3D combines selective layer tuning with two targeted regularization strategies: multi-view consistency across augmented point clouds and text diversity through synonym-based prompts generated by large language models. Additionally, we incorporate point-rendered vision supervision and a test-time augmentation mechanism with confidence-based aggregation to further enhance robustness. Extensive experiments across different 3D domain generalization benchmarks show that ReFine3D improves base-to-novel class generalization by 1.36%, cross-dataset transfer by 2.43%, robustness to corruption by 1.80%, and few-shot accuracy by up to 3.11%, outperforming prior state-of-the-art methods with minimal added computational overhead.

2606.18471 2026-06-18 cs.CL 新提交

Possible or Definite? A Benchmark for Evaluating Diagnostic Uncertainty Preservation in Clinical Text

可能还是确定?评估临床文本中诊断不确定性保留的基准

Hongbo Du, Zixin Lu, Jiaming Qu

发表机构 * Trine University(特里尼大学) University of Michigan(密歇根大学) Amazon(亚马逊)

AI总结 构建包含9184个不确定性标注的基准,评估LLM在临床文本中保留诊断不确定性的能力,发现LLM保留原始不确定性线索不足一半,且难以区分相邻级别。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地用于临床文本任务,如总结和修订。虽然大多数研究评估LLM生成文本的流畅性和连贯性,但LLM是否正确保留诊断不确定性仍未得到充分探索。在临床实践中,诸如“可能肺炎”之类的短语传达了现有证据的强度,并直接指导后续检测和治疗决策。改变这些不确定性表达可能会完全改变临床含义。在本文中,我们通过两个步骤系统地评估了这个问题。首先,我们构建了一个包含1200份临床文档的基准,其中包含跨五个级别的9184个不确定性标注。其次,我们在此基准上评估了三个LLM。我们的结果表明:(1)LLM保留原始不确定性线索的能力很差,通常不到一半的时间;(2)LLM难以区分相邻级别之间的细微差别。这项工作揭示了标准评估指标无法捕捉的失败模式,并为LLM在临床工作流程中的安全部署提供了启示。

英文摘要

Large language models (LLMs) are increasingly used for clinical text tasks such as summarization and revision. While most studies evaluate the fluency and coherence of LLM-generated text, whether LLMs correctly preserve diagnostic uncertainty remains underexplored. In clinical practice, phrases such as ``possible pneumonia'' communicate the strength of available evidence and directly guide decisions about follow-up testing and treatment. Altering these uncertainty expressions can change the clinical meaning entirely. In this paper, we systematically evaluated this problem in two steps. First, we constructed a benchmark of 1,200 clinical documents with 9,184 uncertainty annotations across five levels. Second, we evaluated three LLMs on this benchmark. Our results show that (1) LLMs preserve the original uncertainty cues poorly, often less than half the time; (2) LLMs struggle with nuanced distinctions between adjacent levels. This work reveals a failure mode not captured by standard evaluation metrics and provides implications for the safe deployment of LLMs in clinical workflows.

2606.18469 2026-06-18 cs.LG cs.AI 新提交

Structured Representation Learning with Locally Linear Embeddings and Adaptive Feature Fusion

基于局部线性嵌入与自适应特征融合的结构化表示学习

Somjit Nath, Jackson J Cone, Derek Nowrouzezahrai, Samira Ebrahimi Kahou

发表机构 * Mila – Quebec AI Institute(米拉-魁北克人工智能研究所)

AI总结 受神经科学启发,提出一种强化学习框架,利用局部线性嵌入捕捉状态局部结构,并通过注意力机制自适应融合动态与奖励特征,提升学习效率。

Comments Published in Transactions on Machine Learning Research (04/2026)

详情
AI中文摘要

神经科学研究揭示,大脑通过利用结构化的低维流形和自适应门控机制动态融合多源信息来编码复杂行为。受这些原理启发,我们提出了一种新颖的强化学习(RL)框架,鼓励分离动态特定和奖励特定特征,直接类比神经回路如何分离和整合信息以实现高效决策。我们的方法利用局部线性嵌入(LLE)来捕捉许多环境中固有的局部线性结构,反映神经群体活动中观察到的局部平滑性,同时通过标准RL目标推导奖励特定特征。一种类似于皮层门控的注意力机制,在逐状态基础上自适应地融合这些互补表示。在基准任务上的实验结果表明,我们的方法基于神经科学原理,相比传统RL方法提高了学习效率和整体性能,凸显了显式建模局部状态结构和自适应特征选择(如生物系统中观察到的)的优势。

英文摘要

Neuroscientific research has revealed that the brain encodes complex behaviors by leveraging structured, low-dimensional manifolds and dynamically fusing multiple sources of information through adaptive gating mechanisms. Inspired by these principles, we propose a novel reinforcement learning (RL) framework that encourages the disentanglement of dynamics-specific and reward-specific features, drawing direct parallels to how neural circuits separate and integrate information for efficient decision-making. Our approach leverages locally linear embeddings (LLEs) to capture the intrinsic, locally linear structure inherent in many environments, mirroring the local smoothness observed in neural population activity, while concurrently deriving reward-specific features through the standard RL objective. An attention mechanism, analogous to cortical gating, adaptively fuses these complementary representations on a per-state basis. Experimental results on benchmark tasks demonstrate that our method, grounded in neuroscientific principles, improves learning efficiency and overall performance compared to conventional RL approaches, highlighting the benefits of explicitly modeling local state structures and adaptive feature selection as observed in biological systems.

2606.18466 2026-06-18 cs.CL 新提交

Montreal Forced Aligner and the state of speech-to-text alignment in 2026

Montreal Forced Aligner 与 2026 年语音到文本对齐的现状

Michael McAuliffe, Kaylynn Gunter, Michael Wagner, Morgan Sonderegger

发表机构 * University of Wisconsin--Madison(威斯康星大学麦迪逊分校) McGill University(麦吉尔大学) Centre for Brain, Language, and Music(大脑、语言与音乐中心) University of Oregon(俄勒冈大学)

AI总结 本文介绍 MFA 3.0 自 1.0 版本以来的发展,并在英语、日语和韩语上评估其性能,在四个基准数据集上达到平均边界误差低于 15 ms 的最优或接近最优性能。

详情
AI中文摘要

Montreal Forced Aligner (MFA) 于 2016 年发布,此后成为研究和工业中最广泛使用的强制对齐工具。在过去的十年中,MFA 经历了实质性发展,包括使用更大的开源数据集扩展到更多语言和方言、统一的 IPA 词典、模型自适应、跨语言音素映射以及支持工具。本文记录了 MFA 3.0 自 1.0 版本以来的发展,并在英语、日语和韩语上评估 MFA 的性能,与经典和神经强制对齐器进行基准测试。MFA 3.0 在所有四个基准数据集上实现了最优或接近最优的性能,平均边界误差低于 15 ms。自适应和跨语言映射对于 MFA 训练分布之外的语言有效,并且发音概率建模和音系规则在特定条件下提供了增益。

英文摘要

The Montreal Forced Aligner (MFA) was released in 2016 and has since become the most widely used tool for forced alignment in research and industry. In the decade since, MFA has undergone substantial development, including expanded coverage across more languages and dialects using larger open-source datasets, harmonized IPA dictionaries, model adaptation, cross-language phone remapping, and support utilities. This paper documents MFA 3.0's developments since version 1.0 and evaluates MFA's performance across English, Japanese, and Korean, benchmarked against classic and neural forced aligners. MFA 3.0 achieves state-of-the-art or near state-of-the-art performance across all four benchmark datasets with mean boundary errors below 15 ms. Adaptation and cross-language remapping are effective for languages outside MFA's training distribution, and pronunciation probability modeling and phonological rules provide gains in specific conditions.

2606.18465 2026-06-18 cs.LG cs.AI 新提交

What Does the Weight Norm Control in Grokking? Logit-Scale Mediation under Cross-Entropy

权重范数在Grokking中控制什么?交叉熵下的对数尺度中介作用

Truong Xuan Khanh

发表机构 * H&K Research Studio, Clevix LLC

AI总结 本文通过固定权重范数并改变输出温度,发现Grokking延迟主要由对数尺度(logit scale)决定,权重范数仅通过影响对数尺度间接起作用。

Comments 16 papges, 10 tables and 4 figures. Code and data to reproduce all numbers, tables, and figures: https://github.com/ClevixLab/grokking-logit-scale

详情
AI中文摘要

Grokking,即从记忆到泛化的延迟跳跃,通常与权重范数相关:范数越小,泛化越早。我们探究范数实际控制什么。通过钳位固定权重范数并仅改变输出温度,我们在交叉熵下将Grokking延迟滑动到其整个范数诱导范围;将有效对数尺度匹配回基线可恢复两个模数下约85%的延迟。在范数和温度的网格上,延迟仅由对数尺度决定(R2 = 0.97),范数仅额外贡献1-2%。该效应依赖于损失函数:在均方误差下,对数尺度被固定,范数通过不同路径起作用。记忆控制、float64 softmax崩溃审计和无LayerNorm的Transformer均指向同一通道。从同一状态分叉,延迟遵循钳位的范数值而非钳位操作本身,这排除了重缩放伪影。近端变量是对数尺度及其驱动的softmax饱和;权重范数仅是上游手柄。所有数字、表格和图表均可从发布的代码和数据中复现。

英文摘要

Grokking, the delayed jump from memorization to generalization, is usually tied to the weight norm: a smaller norm generalizes sooner. We ask what the norm actually controls. Holding the weight norm fixed by clamping and varying only an output temperature, we slide the grokking delay across its entire norm-induced range under cross-entropy; matching the effective logit scale back to baseline recovers about 85% of the delay at two moduli. Across a grid of norms and temperatures the delay collapses onto the logit scale alone (R2 = 0.97), with the norm adding 1-2% beyond it. The effect is loss-dependent: under mean-squared error the logit scale is pinned and the norm acts through a different route. A memorization control, a float64 softmax-collapse audit, and a no-LayerNorm transformer point to the same channel. Forking arms from one identical state, the delay follows the held norm value and not the clamp operation, which closes a rescaling-artifact concern. The proximal variable is the logit scale and the softmax saturation it drives; the weight norm is only an upstream handle. All numbers, tables, and figures reproduce from released code and data.

2606.18457 2026-06-18 cs.LG 新提交

Task-Restricted Symmetries in Recurrent Weight Space

循环权重空间中的任务限制对称性

Simon Dräger

发表机构 * Salk Institute for Biological Studies, La Jolla, CA, USA(索尔克生物研究所,拉霍亚,加利福尼亚州,美国)

AI总结 通过有序实Schur坐标分析单层tanh RNN,发现任务分布下循环矩阵存在功能冗余,特定非正常Schur耦合可被移除而不影响性能,揭示了任务限制的近似功能不变性。

Comments 6 pages, 2 figures. Accepted at the ICML 2026 Workshop on Weight-Space Symmetries

详情
AI中文摘要

循环网络在权重空间中可能包含大量的功能冗余:改变一个循环矩阵可能使输入-输出展开在任务分布上几乎不变,而类似尺度的变化可能破坏相同的行为。我们使用有序实Schur坐标研究单层tanh RNN中的这种冗余。Schur形式将谱块与定向非正常耦合分开,为保持输入和读出映射固定的结构化消融提供了诊断基础。在固定长度的复制任务中,一些训练好的解中可以选择性地移除非正常Schur耦合而损失很小,而其他耦合对于准确的自主回放是必要的。在触发器、正弦生成和上下文相关积分任务中,损失保持的消融轮廓因任务和训练解而异。这些结果识别了候选的近似功能不变性,而非循环权重空间的普遍对称性。Schur坐标消融提供了一种实用的诊断方法,用于判断哪些结构化扰动能保持训练好的循环解,哪些会破坏其计算。

英文摘要

Recurrent networks can contain substantial functional redundancy in weight space: changing a recurrent matrix may leave the input-output rollout nearly unchanged on a task distribution, while similar-scale changes can destroy the same behavior. We study this redundancy in one-layer tanh RNNs using ordered real Schur coordinates. The Schur form separates spectral blocks from directed nonnormal couplings, giving a diagnostic basis for structured ablations that keep the input and readout maps fixed. In a fixed-length copy task, selected nonnormal Schur couplings can be removed with little loss in some trained solutions, whereas other couplings are necessary for accurate autonomous replay. Across flip-flop, sine generation, and context-dependent integration, the loss-preserving ablation profile varies across tasks and trained solutions. These results identify candidate approximate functional invariances, not universal symmetries of recurrent weight space. Schur-coordinate ablations provide a practical diagnostic for which structured perturbations preserve a trained recurrent solution and which ones disrupt its computation.

2606.18454 2026-06-18 cs.LG cs.AI 新提交

Veriphi: Attack-Guided Neural Network Verification with Dataset-Dependent Training Methods

Veriphi: 基于攻击引导的神经网络验证与数据集依赖训练方法

Pratik Deshmukh, Kartik Arya, Vasili Savin

发表机构 * TU Wien(维也纳工业大学)

AI总结 提出Veriphi系统,结合快速对抗攻击与α,β-CROWN形式化边界验证,实验表明训练方法有效性依赖数据集特性,IBP在MNIST上有效但在CIFAR-10上失效,PGD对抗训练在小扰动下达到94%认证准确率,并实现5倍验证加速。

Comments 17 Pages, 8 Figures

详情
AI中文摘要

我们提出Veriphi,一个GPU加速的神经网络验证系统,它使用α,β-CROWN方法将快速对抗攻击与形式化边界认证相结合。通过在MNIST和CIFAR-10上使用三种训练方法(标准、对抗、认证)进行系统实验,我们证明了训练方法的有效性从根本上依赖于数据集。区间边界传播(IBP)在简单的MNIST(784维)上达到78%的认证准确率,但在更复杂的CIFAR-10数据集上提供的认证性能可忽略不计,而在小扰动下PGD对抗训练以94%的认证率占主导地位。我们通过攻击引导的伪造实现了5倍的验证加速,并将我们的方法扩展到生产规模模型(1.058亿参数),用于实际航空航天物流优化。我们的结果挑战了认证训练普遍优于对抗训练的假设,表明上下文对于验证策略选择至关重要。

英文摘要

We present Veriphi, a GPU-accelerated neural network verification system that combines fast adversarial attacks with formal bound certification using alpha,beta-CROWN methods. Through systematic experiments on MNIST and CIFAR-10 using three training methodologies (standard, adversarial, certified), we demonstrate that training method effectiveness is fundamentally dataset-dependent. Interval Bound Propagation (IBP) achieves 78% certified accuracy on simple MNIST (784 dimensions) but provides negligible certification performance on the more complex CIFAR-10 dataset, where PGD adversarial training dominates with 94% certification at small perturbations. We achieve 5x verification speedup through attack-guided falsification and scale our approach to production-size models (105.8M parameters) for real-world aerospace logistics optimization. Our results challenge the assumption that certified training universally outperforms adversarial training, showing context matters critically for verification strategy selection.

2606.18453 2026-06-18 cs.CL 新提交

LLM Parameters for Math Across Languages: Shared or Separate?

跨语言数学问题的LLM参数:共享还是分离?

Behzad Shomali, Luisa Victor, Tim Selbach, Ali Hamza Bashir, David Berghaus, Joachim Koehler, Mehdi Ali, Markus Frey

发表机构 * Lamarr Institute(Lamarr研究所) University of Bonn(波恩大学) Fraunhofer IAIS(弗劳恩霍夫智能分析和信息系统研究所)

AI总结 通过跨语言机制分析,发现多语言LLM中数学相关参数存在部分跨语言重叠,且主要集中在中间层,英语参数集最大,低资源语言参数集较小。

Comments 5 pages. Accepted at ACL Student Research Workshop (SRW) 2026. Code: https://github.com/luisavictor/math-across-languages Translated Datasets: https://huggingface.co/math-across-languages Webpage: https://math-across-languages.github.io

详情
AI中文摘要

大型语言模型(LLM)在数学推理性能上表现出显著的跨语言差异,但目前尚不清楚这些差异是反映语言特定参数,还是反映一种因语言不同而表现不同的共享机制。我们提出了一种跨语言的LLM数学推理机制分析,使我们能够定位和比较支持跨语言数学推理的模型参数。我们发现,提取的数学相关参数表现出部分跨语言重叠,最强的重叠集中在中间模型层。我们进一步观察到,英语始终产生最大的数学相关参数集,而低资源语言则显示出较小的相关参数集。这些结果表明,多语言LLM中与数学相关的行为既不是完全语言不变的,也不是完全语言特定的,而是表现出部分跨语言参数重叠,并伴有系统性的语言依赖差异。

英文摘要

Large language models (LLMs) exhibit substantial cross-lingual variation in mathematical reasoning performance, but it remains unclear whether these differences reflect language-specific parameters or a shared mechanism that manifests differently by language. We present a cross-lingual mechanistic analysis of mathematical reasoning in LLMs, enabling us to localize and compare model parameters that support mathematical reasoning across languages. We find that the extracted math-associated parameters exhibit partial cross-lingual overlap, with the strongest overlap concentrated in intermediate model layers. We further observe that English consistently produces the largest set of math-relevant parameters, whereas lower-resource languages reveal smaller sets of relevant parameters. These results suggest that math-related behavior in multilingual LLMs is neither fully language-invariant nor fully language-specific, but instead exhibits partial cross-lingual parameter overlap with systematic language-dependent differences.

2606.18451 2026-06-18 cs.LG 新提交

A Cross-Model VLM-Judge Protocol for Single-Image 3D Mesh Quality (and Why Cheap Proxies Fall Short)

跨模型VLM评判协议用于单图像3D网格质量(以及为什么廉价代理方法不足)

Ali Asaria, Tony Salomone, Deep Gandhi

发表机构 * Transformer Lab

AI总结 提出可重复的VLM评判协议评估单图3D网格质量,发现几何有效性和渲染CLIP等廉价代理方法无法替代VLM评判。

详情
AI中文摘要

单图像到3D生成器正在快速改进,但目前没有公认的、无需人工的方法来判断生成的网格是否优于另一个。从业者通常依赖廉价的自动代理方法(渲染空间的CLIP相似性和网格几何有效性统计),但这些方法在多大程度上跟踪感知质量尚未确定。我们做出两项贡献。首先,我们提出并验证了一个可重复的VLM评判评估协议:一个固定的24视角无头渲染装置、两个独立的视觉语言评判家族,以及一个强制的位置偏差校正,该校正查询两种呈现顺序并仅保留顺序一致的判决。两个评判家族彼此高度一致(Cohen's kappa = 0.66),远高于随机一致性基线。其次,以该协议为参考,我们证明廉价代理方法无法替代它。几何有效性平均而言仅是一个弱信号(因为,如我们所示,它是双峰的),且低于我们预先注册的目标,而渲染CLIP则处于随机水平。一个学习的Bradley-Terry头部坍缩到一个单一流形统计量(给渲染CLIP赋予负权重),并且与仅几何方法完全匹配,因此学习特征权重毫无收益。该代理方法也是双峰的:在具有可见几何缺陷的对比中显著高于随机水平,但在模糊对比中处于随机水平,这与几何有效性仅在缺陷视觉显著时跟踪评判者的行为一致。因此,我们推荐VLM评判协议作为在测试条件下(Google Scanned Objects上的两个前馈生成器,采用面丢失退化机制)可靠且可重复的评估器,并建议不要将几何/CLIP代理方法作为优化目标。

英文摘要

Single-image-to-3D generators are improving quickly, but there is no agreed, human-free way to tell whether one generated mesh is better than another. Practitioners commonly rely on cheap automatic proxies (render-space CLIP similarity and mesh geometry-validity statistics), yet how well these track perceived quality is unestablished. We make two contributions. First, we propose and validate a reproducible VLM-judge evaluation protocol: a fixed 24-view headless render rig, two independent vision-language judge families, and a mandatory position-bias correction that queries both presentation orders and keeps only order-consistent verdicts. The two judge families agree substantially with each other (Cohen's kappa = 0.66), well above the chance-agreement floor. Second, using this protocol as the reference, we show the cheap proxies do not substitute for it. Geometry validity is only a weak signal on average (because, as we show, it is bimodal) and stays below our pre-registered target, while render-CLIP is at chance. A learned Bradley-Terry head collapses onto a single manifoldness statistic (giving render-CLIP a negative weight) and matches geometry-only exactly, so learning the feature weights buys nothing. The proxy is also bimodal: it is significantly above chance on contrasts with visible geometric defects but at chance on ambiguous contrasts, consistent with geometry validity tracking the judge only when the defect is visually salient. We therefore recommend the VLM-judge protocol as a reliable, reproducible evaluator under the conditions tested (two feed-forward generators on Google Scanned Objects, with a face-drop degradation regime) and advise against geometry/CLIP proxies as optimization targets.

2606.18448 2026-06-18 cs.CL 新提交

VISUALSKILL: Multimodal Skills for Computer-Use Agents

VISUALSKILL:面向计算机使用智能体的多模态技能

Ziyan Jiang, Li An, Yujian Liu, Jiabao Ji, Qiucheng Wu, Jacob Andreas, Yang Zhang, Shiyu Chang

发表机构 * UC Santa Barbara(加州大学圣塔芭芭拉分校) MIT CSAIL(麻省理工学院计算机科学与人工智能实验室) MIT-IBM Watson AI Lab(麻省理工学院-IBM沃森人工智能实验室)

AI总结 提出VISUALSKILL分层多模态技能库,通过结合文档与UI探索构建,使智能体在CUA基准上平均得分提升15.3点,且多模态优于纯文本技能。

详情
AI中文摘要

计算机使用智能体(CUA)在标准化基准上接近人类水平,但在长周期任务和未见软件上仍存在困难。现有技能库通过可复用技能解决此问题,但仅以文本形式表示技能工件,忽略了GUI交互的视觉特性。我们提出VISUALSKILL:一种分层多模态技能,针对每个目标应用定制,并组织为按主题文件索引的中央索引,智能体通过load_topic MCP工具按需获取相关主题的文本和图形。我们通过结合编写文档与实时应用UI探索的两阶段流水线构建每个技能。在两个CUA基准CUA-World和OSExpert-Eval上,由Claude Opus 4.6支持的Claude Code CLI智能体使用VISUALSKILL达到平均得分0.456,比无技能基线(0.303)绝对提升15.3点。与从相同源内容生成且仅在模态上与VISUALSKILL不同的匹配纯文本技能相比,VISUALSKILL进一步绝对提升8.3点(0.373 vs. 0.456),直接证明在技能工件中保留视觉图形而非将其语言化,有助于智能体识别UI元素并在每次操作后验证工作流状态。我们的代码见此链接。

英文摘要

Computer-use agents (CUAs) approach human-level performance on standardised benchmarks but still struggle on long-horizon tasks and unseen software. Existing skill libraries address this with reusable skills, but represent the skill artifact as text only, despite the visual nature of GUI interaction. We propose VISUALSKILL: a hierarchical multimodal skill, tailored to each target application and organised as a central index over per-topic files, which the agent consumes through a load_topic MCP tool that fetches the relevant topic's text and figures on demand. We construct each skill with a two-stage pipeline that combines authored documentation with live-application UI exploration. On two CUA benchmarks, CUA-World and OSExpert-Eval, a Claude Code CLI agent backed by Claude Opus 4.6 reaches an average score of 0.456 with VISUALSKILL, a +15.3 point absolute lift over the no-skill baseline (0.303). Against a matched text-only skill that is generated from the same source content and differs from VISUALSKILL only in modality, VISUALSKILL yields a further +8.3 point absolute gain over the matched text-only skill (0.373 vs. 0.456), providing direct evidence that retaining visual figures in the skill artifact, rather than verbalizing them away, helps the agent both identify UI elements and verify workflow state after each action. Our code is available at https://github.com/XMHZZ2018/VisualSkills.

2606.18444 2026-06-18 cs.LG cs.AI 新提交

TMR-GGNN: Credit Card Fraud Detection based on Time-Aware Multi-Relational Guided Graph Neural Network

TMR-GGNN:基于时间感知多关系引导图神经网络的信用卡欺诈检测

Rohit Tewari, Shubhankar Shilpi, Navin Chhibber, Devendra Singh Parmar, Sunil Khemka, Piyush Ranjan

发表机构 * Unysis Truist Banks Infinity Tech Group Technical Product(Unysis 信任银行 Infinity 技术集团技术产品) Fairfax, USA(美国费尔法克斯) Atlanta, USA(美国亚特兰大) Sunnyvale, USA(美国 Sunnyvale) Persistent Systems IEEE Vice Chair AeroSpace Chapter(Persistent 系统 IEEE 副主席航空航天分会) Discover Financial Services(Discover 金融服务) Edison, USA(美国埃迪森)

AI总结 提出TMR-GGNN框架,通过时间窗口内异构实体交互建模、动态多关系图构建、时间感知注意力机制和对比学习解码器,结合InfoNCE与Focal Loss复合损失函数,解决数据不平衡和欺诈模式演化问题。

Comments 2025 2nd International Conference on Software, Systems and Information Technology (SSITCON), Pages 7

详情
AI中文摘要

近年来,由于高度不平衡的数据、不断演变的欺诈模式以及交易实体间复杂的关联结构,信用卡欺诈检测面临重大挑战。为解决这些问题,本研究提出了一种名为时间感知多关系引导图神经网络(TMR-GGNN)的新框架。具体而言,所提出的TMR-GGNN通过建模客户、商户、设备和IP在时间窗口内的异构交互,扩展了编码器-解码器图神经网络(GNN)架构。随后,该TMR-GGNN方法构建了一个动态的多关系图,并在编码器中引入时间感知关系注意力机制,以基于时间邻近性和语义上下文自适应地权衡交易相关性。因此,解码器采用对比学习模块来区分真实和合成的交易模式,同时提高模型对罕见欺诈案例的泛化能力。此外,为有效管理严重的类别不平衡并强调判别性学习,引入了结合基于信息噪声对比估计(InfoNCE)的对比损失与Focal Loss的复合损失函数。这种集成有助于改进欺诈识别,同时减少假阴性。

英文摘要

In recent years, credit card fraud detection has faced significant challenges due to highly imbalanced data, evolving fraud patterns, and complex relational structures among transaction entities. To address these issues, this research proposes a novel framework called Timeaware Multi Relational Guided Graph Neural Network (TMR GGNN). Particularly, the proposed TMR GGNN extends the encoder decoder Graph Neural Network GNN architecture by modeling heterogeneous interactions across customers, merchants, devices, and IPs over temporal windows. Subsequently, the proposed TMR GGNN approach constructs a dynamic, multi relational graph and incorporates a time aware relational attention mechanism within the encoder to adaptively weigh the transaction relevance based on temporal proximity and semantic context. Consequently, the decoder employs a contrastive learning module to distinguish between real and synthesized transaction patterns, while improving the models generalization of rare fraud cases. Additionally, to effectively manage severe class imbalances and emphasize discriminative learning, a composite loss function combining Information Noise Contrastive Estimation (InfoNCE) based contrastive loss with Focal Loss is introduced. This integration assists in improving fraud identification while mitigating false negatives.

2606.18441 2026-06-18 cs.CV 新提交

Reasoning as Intersection: Consensus-Frame Alignment for Visual Focus in Video-MLLMs

推理即交集:视频多模态大语言模型中视觉焦点的一致性帧对齐

Chengwen Liu, Zhe Huang, Jisheng Dang, Hong Peng, Qi Tian, Tat-Seng Chua

发表机构 * School of Information Science and Engineering, Lanzhou University(兰州大学信息科学与工程学院) Beijing University of Posts and Telecommunications(北京邮电大学) Cloud and AI BU, Huawei(华为云与AI业务部) School of Computing, National University of Singapore(新加坡国立大学计算机学院)

AI总结 提出无时间标注的过程级奖励框架CF-GRPO,通过视频内在线索构建一致性帧先验,并利用一致性帧奖励优化模型帧使用与先验的对齐,提升视频推理性能。

详情
AI中文摘要

强化学习提升了大型语言模型的推理能力,但将仅结果奖励应用于视频多模态大语言模型(Video-MLLMs)时,对哪些视觉证据应支持答案提供的指导有限。受多感官整合启发(其中一致的线索可以增强感知估计的显著性和可靠性),我们引入了一致性帧GRPO(CF-GRPO),一种无需时间标注的过程级奖励框架,用于证据感知的视频推理。CF-GRPO从内在视频线索中构建一致性帧先验,包括时间覆盖、场景转换线索和查询条件化的视觉相关性。然后,它从视觉和响应表示中计算模型侧的帧使用分数,并通过一致性帧奖励(CFR)优化它们的一致性。通过显著性感知的稀疏聚合和分布锐化,CFR提供了高对比度的奖励信号,无需人工时间标注。实验表明,VideoCFR在复杂视频推理基准上取得了有竞争力的性能,并在多个指标上优于代表性的Video-MLLM和RL基线,同时一致性先验提供了训练中强调的证据帧的可解释视图。实现代码见:https://this https URL。

英文摘要

Reinforcement learning has improved the reasoning ability of large language models, but applying outcome-only rewards to video multimodal large language models (Video-MLLMs) provides limited guidance on which visual evidence should support the answer. Inspired by multisensory integration, where consistent cues can enhance the salience and reliability of perceptual estimates, we introduce Consensus Frame GRPO (CF-GRPO), a temporal-annotation-free process-level reward framework for evidence-aware video reasoning. CF-GRPO constructs a consensus frame prior from intrinsic video cues, including temporal coverage, scene-transition cues, and query-conditioned visual relevance. It then computes a model-side frame-use score from visual and response representations and optimizes their agreement through the Consensus Frame Reward (CFR). With salience-aware sparse aggregation and distribution sharpening, CFR provides a high-contrast reward signal without requiring human temporal annotations. Experiments show that VideoCFR achieves competitive performance across complex video reasoning benchmarks and improves several metrics over representative Video-MLLM and RL baselines, while the consensus prior provides an interpretable view of the evidence frames emphasized during training. The implementation is available at https://github.com/1Pansy/VideoCFR.

2606.18439 2026-06-18 cs.CV cs.RO 新提交

RegimeVGGT: Layer-Wise Spatially Preserving Redundancy Removal for Visual Geometry Grounded Transformer

RegimeVGGT:面向视觉几何基础Transformer的逐层空间保持冗余去除

Jinhao You, Shuo Lyu, Zhuohang Lyu, Tanxuan Li, Zibo Zhao, Jiaxiang Hu, Kai Tang, Yichen Guo

发表机构 * University of Pennsylvania(宾夕法尼亚大学) University of California, Irvine(加利福尼亚大学尔湾分校) Nanyang Technological University(南洋理工大学)

AI总结 提出RegimeVGGT,通过逐层U形压缩(显著性引导带状合并与选择性保护K/V下采样)去除冗余,在保持重建质量的同时实现6.7倍加速。

Comments 9 pages, 3 figures, 7 tables. Jinhao You, Shuo Lyu, Zhuohang Lyu, Tanxuan Li, and Zibo Zhao contributed equally. Shuo Lyu is the corresponding author

详情
AI中文摘要

视觉几何基础Transformer(VGGT)通过一次前向传播从多视图图像恢复密集3D场景结构,但二次交叉帧注意力限制了其可扩展性。现有的免训练加速器沿单一轴均匀减少计算,忽略了层间异质性。我们的频谱、探测和因果分析揭示了三个区域:浅层缺乏跨视图结构,中层驱动跨视图对齐,深层对密集几何是冗余的,但其跨帧注意力对姿态仍然至关重要。RegimeVGGT沿两个轴应用逐层U形压缩:显著性引导带状合并保护几何和边缘显著性令牌,而选择性保护K/V下采样通过相移空间网格、参考帧锚点以及未压缩的相机/注册令牌来保持跨帧空间覆盖和姿态关键路径。免训练,RegimeVGGT在匹配重建质量下相比VGGT*实现了6.7倍加速。

英文摘要

Visual Geometry Grounded Transformer (VGGT) recovers dense 3D scene structure from multi-view images in one forward pass, but quadratic cross-frame attention limits its scalability. Existing training-free accelerators reduce computation uniformly along one axis, missing layer heterogeneity. Our spectral, probing, and causal analyses reveal three regimes: shallow layers lack cross-view structure, middle layers drive cross-view alignment, and deep layers are redundant for dense geometry yet their cross-frame attention remains essential for pose. RegimeVGGT applies layer-wise U-shaped compression along two axes: Saliency-Guided Banded Merging protects geometry- and edge-salient tokens, while Selectively Protected K/V Downsampling preserves cross-frame spatial coverage and the pose-critical path through a phase-shifted spatial grid, a reference-frame anchor, and uncompressed camera/register tokens. Training-free, RegimeVGGT achieves a 6.7x speedup over VGGT* at matched reconstruction quality.

2606.18431 2026-06-18 cs.LG cs.DC 新提交

Beyond Prediction: Tail-Aware Scheduling for LLM Inference

超越预测:面向LLM推理的尾延迟感知调度

Yueying Li, Yuanfan Chen, Jiayang Chen, Esha Choukse, Haoran Qiu, G. Edward Suh, Rodrigo Fonseca, Ziv Scully, Udit Gupta

发表机构 * Cornell University, Computer Science Department(康奈尔大学计算机科学系) Cornell University, Electrical and Computer Engineering Department(康奈尔大学电气与计算机工程系) Cornell University, Operations Research and Information Engineering Department(康奈尔大学运筹学与信息工程系) Microsoft Azure System Research(微软Azure系统研究) NVIDIA Corporation(英伟达公司)

AI总结 针对LLM推理中长度预测调度在分布偏移和尾延迟控制上的脆弱性,提出无预测的分布感知调度框架,通过轻量统计信号实现软优先级提升,结合缓存感知抢占,在多种工作负载下将P99 TTLT降低35-50%,TTFT降低34-47%。

Journal ref Forty-Third International Conference on Machine Learning (2026)

详情
AI中文摘要

LLM服务表现出极端的长度可变性,使得基于大小的调度在实践中变得困难。最近的LLM调度器使用预测的解码长度或排名来近似SJF/SRPT,并主要报告均值中心指标如TTFT和TBT。我们表明,这些预测驱动的策略在分布偏移、突发到达和GPU内存压力下可能脆弱,同时对主导用户体验的尾延迟(P90-P99)控制有限,即使拥有完美的解码长度知识。我们引入了一个分布感知、无预测的调度框架,用由轻量统计信号驱动的软优先级提升取代显式长度预测。我们的设计协同优化调度和缓存感知抢占,以考虑跨工作负载混合的内存耦合解码动态。在生产环境和开源轨迹上的评估表明,相对于具有完美长度知识的SRPT,我们的方法将P99 TTLT降低了高达35-50%,并在各种工作负载(包括推理密集型和聊天密集型任务)上将TTFT降低了34-47%。这些结果证明了在在线LLM服务中优化尾延迟的稳健替代方案。

英文摘要

LLM serving exhibits extreme length variability, making size-based scheduling difficult in practice. Recent LLM schedulers approximate SJF/SRPT using predicted decode lengths or ranks and primarily report mean-centric metrics such as TTFT and TBT. We show that these prediction-driven policies can be fragile under distribution shifts, bursty arrivals, and GPU memory pressure, while offering limited control over the tail latency (P90-P99) that dominates user experience, even with perfect decode-length knowledge. We introduce a distribution-aware, prediction-free scheduling framework that replaces explicit length prediction with soft priority boosting driven by lightweight statistical signals. Our design co-optimizes scheduling and cache-aware preemption to account for memory-coupled decode dynamics across workload mixes. Evaluated on production and open-source traces, our method reduces P99 TTLT by up to 35-50% relative to SRPT with perfect length knowledge and reduces TTFT by 34-47% across workloads, including reasoning-heavy and chat-heavy tasks. These results demonstrate a robust alternative for optimizing tail latency in online LLM serving.

2606.18430 2026-06-18 cs.LG cs.CR 新提交

Signature filtering: a lightweight enhancement for statistical watermark detection in large language models

签名过滤:大型语言模型中统计水印检测的轻量级增强方法

Chih-Duo Hong, Yen-Pang Chen, Fang Yu

发表机构 * National Chengchi University(国立政治大学)

AI总结 提出签名过滤模块,通过移除干扰水印检测的签名令牌,在弱信号和低熵设置下将检测率从8-31%提升至78-99%,同时保持可控的假阳性率。

详情
AI中文摘要

统计水印帮助组织归因大型语言模型(LLM)的输出,但现有检测器在水印信号弱、文本重复或水印被编辑时往往表现不佳。我们提出签名过滤,一种检测时模块,在不修改水印嵌入和文本生成的情况下增强水印检测。它学习一小部分“签名”令牌,这些令牌的存在会使水印测试不可靠,并在检测前移除这些令牌。通过在小训练集上求解混合整数线性规划获得签名,约束条件最大化真阳性率。我们还推导了在几种攻击者模型(色盲、颜色自适应和分布相关)下的有限样本和渐近界。在四个知名水印家族(Kgw、Sweet、Unigram、Exp)、四个基准语料库(C4、MBPP、HumanEval、Code-Search-Net)和六个LLM(Opt-1.3b、Opt-6.7b、Llama2-13b、Llama3.1-8b、Qwen2.5-14b、Phi-3-medium-14b)上,2-gram和3-gram签名在弱信号和低熵设置下将检测率从无过滤时的8-31%提升至78-99%,同时保持假阳性率可控且通常可忽略。在压力测试中,我们打乱句子并稀释、删除和替换25-50%的令牌,针对Kgw风格水印的2-gram过滤器保留了大部分干净文本的检测增益,通常匹配或超越先进的WinMax水印检测器。因此,签名过滤提供了一种简单、可扩展且模型无关的附加组件,以加强信息处理工作流中LLM文本基于水印的来源检查。

英文摘要

Statistical watermarks help organizations attribute large language model (LLM) outputs, yet existing detectors often struggle when watermark signals are weak, texts are repetitive, or watermarks are edited. We propose signature filtering, a detection-time module that enhances watermark detection without modifying watermark embedding and text generation. It learns a small set of ``signature'' tokens whose presence makes watermark tests unreliable, and removes these tokens before detection. The signatures are obtained by solving a mixed-integer linear program on a small training set, with constraints that maximize the true positive rate. We additionally derive finite-sample and asymptotic bounds under several attacker models (color-blind, color-adaptive, and distributionally correlated). On four well-known watermark families (Kgw, Sweet, Unigram, Exp), four benchmark corpora (C4, MBPP, HumanEval, Code-Search-Net), and six LLMs (Opt-1.3b, Opt-6.7b, Llama2-13b, Llama3.1-8b, Qwen2.5-14b, Phi-3-medium-14b), 2- and 3-gram signatures raise detection rates in weak-signal and low-entropy settings from 8~31% without filtering to 78~99% with filtering, while keeping false positives controllable and often negligible. In stress tests where we scramble sentences and perturb 25~50% of tokens by dilution, deletions, and substitutions, 2-gram filters for Kgw-style watermarks preserve most of the clean-text detection gains, often matching or outperforming the advanced WinMax watermark detector. Signature filtering thus provides a simple, scalable, and model-agnostic add-on to strengthen watermark-based provenance checks for LLM text in information processing workflows.

2606.18429 2026-06-18 cs.CV cs.AI cs.LG 新提交

CAOA -- Completion-Assisted Object-CAD Alignment

CAOA -- 补全辅助的物体-CAD对齐

Hiranya Garbha Kumar, Minhas Kamal, Balakrishnan Prabhakaran

发表机构 * University at Albany(奥尔巴尼大学)

AI总结 提出CAOA方法,结合语义感知点云补全和对称感知相对位姿估计,在Scan2CAD上实现17%精度提升,并发布S2C-Completion数据集。

Comments GitHub: https://github.com/MinhasKamal/CAOA

Journal ref Thirteenth International Conference on 3D Vision (3DV), 2026

详情
AI中文摘要

准确地将CAD模型与室内RGB-D扫描中的对应物体对齐是3D语义重建的核心挑战。该任务需要估计9自由度(DoF)位姿——位置、旋转和三轴尺度——但受到噪声和不完整扫描以及导致几何畸变的分割误差的阻碍。我们提出补全辅助的物体-CAD对齐(CAOA),该方法将语义和上下文感知的点云补全模块与对称感知的相对位姿估计算法相结合,实现CAD模型与扫描物体的精确对齐。现有的补全方法通常在合成数据集上训练和评估,往往难以泛化到真实扫描。为弥合这一差距,我们引入了一种针对室内场景的合成数据生成策略,通过与广泛使用的补全数据集进行定量比较,验证了其显著减小合成到真实领域差距的效果。此外,我们发布了S2C-Completion,一个来自Scan2CAD的超过8500个物体-CAD对的专家标注数据集,用于真实室内单物体补全,并作为该任务的新基准。对于物体-CAD对齐,我们通过对称感知损失融入对称信息,提高了对对称模糊的鲁棒性。在Scan2CAD基准上,CAOA相比最先进方法实现了17%的精度提升。

英文摘要

Accurately aligning CAD models to their corresponding objects in indoor RGB-D scans is a central challenge in 3D semantic reconstruction. The task requires estimating a 9-Degree-of-Freedom (DoF) pose-position, rotation, and scale along three axes-but is hindered by noisy and incomplete scans, as well as segmentation errors that cause geometric distortions. We present Completion-Assisted Object-CAD Alignment (CAOA), a method that integrates a semantically and contextually aware point cloud completion module with a symmetry-aware relative pose estimation algorithm, enabling precise alignment of CAD models to scanned objects. Existing completion methods are typically trained and evaluated on synthetic datasets, which often fail to generalize to real-world scans. To bridge this gap, we introduce a synthetic data generation strategy tailored to indoor scenes, significantly reducing the synthetic-to-real domain gap-validated through quantitative comparisons with widely used completion datasets. In addition, we release S2C-Completion, an expert-annotated dataset of over 8,500 object-CAD pairs from Scan2CAD, created for real-world indoor single-object completion and intended as a new benchmark for this task. For object-CAD alignment, we incorporate symmetry information via a symmetry-aware loss, improving robustness to symmetric ambiguities. On the Scan2CAD benchmark, CAOA achieves a 17% accuracy improvement over state-of-the-art methods.

2606.18426 2026-06-18 cs.RO 新提交

VEGA: Learning Navigation VLAs from In-the-Wild Egocentric Video with Geometric Trajectory Supervision

VEGA: 从野外自我中心视频中通过几何轨迹监督学习导航VLA

Gershom Seneviratne, Yohan Abeysinghe, Jianyu An, Vaibhav Shende, Dinesh Manocha

发表机构 * University of Maryland, College Park(马里兰大学帕克分校)

AI总结 提出VEGA方法,利用未标注的自我中心视频通过重建场景几何生成障碍感知轨迹,训练流匹配VLA导航策略,在VEGA-Bench上碰撞减少33.0%,真实世界成功率提升至少150.0%。

详情
AI中文摘要

我们提出了VEGA,一种从未标注的自我中心导航视频中训练导航视觉-语言-动作(VLA)模型的方法。互联网规模的自我中心视频提供了可扩展的导航相关视觉观察来源,捕捉了杂乱场景、近距离障碍物以及通过真实世界空间的自然人体运动。然而,这些视频不能直接用于策略学习,因为它们没有提供在机器人坐标系中基于显式导航目标的障碍感知轨迹。VEGA通过从单目视频重建局部场景几何、采样导航目标(表示为文本、图像或空间路径点)并利用构建的几何生成障碍感知轨迹来解决这一差距。生成的轨迹分布随后用于训练流匹配VLA导航策略。通过仅在训练期间使用几何,VEGA将障碍感知规划直接蒸馏到基于视觉的策略中。此外,我们引入了VEGA-Bench,一个包含25万场景和约500万个导航目标(与场景几何配对)的基准,旨在评估VLA的目标进展、碰撞避免和障碍物间隙。我们的评估表明,VEGA在VEGA-Bench上实现了有竞争力的目标进展,同时相比最强基线碰撞减少33.0%,障碍物间隙提高17.9%,在真实世界试验中成功率至少提高150.0%,碰撞至少减少66.7%,障碍物间隙至少提高60.0%。最终,我们证明了视频衍生的几何监督为训练障碍感知导航VLA提供了可扩展且有效的信号。代码和基准将在发表时发布。

英文摘要

We introduce VEGA, an approach for training navigation VisionLanguage-Action (VLA) models from unlabeled egocentric navigation videos. Internet-scale egocentric videos provide a scalable source of navigation-relevant visual observations, capturing cluttered scenes, close-range obstacles, and natural human motion through real-world spaces. However, these videos are not directly usable for policy learning because they do not provide obstacle-aware trajectories conditioned on explicit navigation goals in the robot's coordinate frame. VEGA addresses this gap by reconstructing local scene geometry from monocular video, sampling navigation goals (represented as text, image, or spatial waypoints) and generating obstacle-aware trajectories using the constructed geometry. The resulting trajectory distribution is then used to train a flow-matching VLA navigation policy. By using geometry exclusively during training, VEGA distills obstacle-aware planning directly into a vision-based policy. Furthermore, we introduce VEGA-Bench, a benchmark containing 250k scenes and approximately 5 million navigation goals paired with scene geometry, designed to evaluate goal progress, collision avoidance, and obstacle clearance of VLAs. Our evaluation shows that VEGA achieves competitive goal progress while reducing collisions by 33.0% and improving obstacle clearance by 17.9% over the strongest baseline on VEGABench, while improving success by at least 150.0%, reducing collisions by at least 66.7%, and improving obstacle clearance by at least 60.0% in real-world trials. Ultimately, we demonstrate that video-derived geometric supervision provides a scalable and effective signal for training obstacle-aware navigation VLAs. The code and benchmark will be released at the time of publication.

2606.18420 2026-06-18 cs.LG q-bio.QM stat.ML 新提交

Measurement noise limits the advantage of nonlinear models over linear models in biomedical prediction

测量噪声限制了非线性模型在生物医学预测中相对于线性模型的优势

Marc-Andre Schulz, Kerstin Ritter

发表机构 * Hertie Institute for AI in Brain Health, University of Tübingen(赫蒂人工智能脑健康研究所,图宾根大学) Tübingen AI Center, University of Tübingen(图宾根人工智能中心,图宾根大学) Department of Psychiatry and Neurosciences, Charité – Universitätsmedizin Berlin(精神病学与神经科学系,柏林夏里特医学院) Bernstein Center for Computational Neuroscience, Berlin(伯恩斯坦计算神经科学中心,柏林) German Center for Mental Health (DZPG), partner site Tübingen(德国心理健康中心(DZPG),图宾根合作站点)

AI总结 本文指出,在生物医学表格数据中,测量噪声会削弱非线性结构,导致非线性模型与线性模型性能相当,并提出了一个精确的超额风险恒等式,揭示了测量可靠性、样本量和特征表示三个条件必须同时满足才能体现非线性优势。

详情
AI中文摘要

在生物医学表格数据上,诸如深度网络、梯度提升树和核方法等灵活模型,在给定相同特征的情况下,反复被线性回归和逻辑回归匹配或击败。通常的反应是将其视为模型方面的不足,需要通过更多数据、更好的架构或调参来修复,假设非线性结构存在而模型未能捕捉到。我们认为,当限制因素是测量而非模型时(这在生物医学中经常发生),这些修复无法奏效。加性噪声模糊了群体最优预测器,并且由于模糊在去除函数的广泛形状之前先去除精细、快速变化的细节,它比线性结构更快地抹去非线性结构。一个k阶交互作用被特征可靠性的k次幂衰减,而线性部分只衰减一次。在生物医学测量典型的可靠性下,即使底层生物学是强非线性的,非线性优势也可能消失,并且噪声所移除的部分无法通过更大的队列或更灵活的模型恢复,只能通过更好的测量。非线性是隐藏的,而非缺失,线性模型与灵活模型之间的平局本身并不能对生物学做出定论。这些片段是经典的,来自测量误差统计、心理测量学和高斯分析,我们将它们组合成一个精确的超额风险恒等式。测量可靠性是与样本量和特征表示并列的三个条件之一,必须对齐才能使灵活模型发挥作用,而它们共同只留下一个狭窄的窗口,大多数生物医学任务落在此窗口之外。在140个英国生物银行任务中,灵活模型与线性模型之间的差距(如果存在)带有预测的噪声特征,并且这三个条件可以通过干预而非仅通过基准测试来分离。

英文摘要

On biomedical tabular data, flexible models such as deep networks, gradient-boosted trees, and kernel methods are repeatedly matched or beaten by linear and logistic regression given the same features. The usual reaction is to treat this as a model-side shortfall, to be fixed with more data, a better architecture, or tuning, on the assumption that the nonlinear structure is there and the model has failed to capture it. We argue that these fixes cannot help when the binding limit is the measurement rather than the model, as it frequently is in biomedicine. Additive noise blurs the population-optimal predictor, and because blurring removes a function's fine, rapidly varying detail before its broad shape, it erases nonlinear structure faster than linear structure. A degree-$k$ interaction is attenuated by the $k$-th power of feature reliability, while the linear part is attenuated only once. At the reliabilities typical of biomedical measurement, the nonlinear advantage can vanish even when the underlying biology is strongly nonlinear, and what the noise removes cannot be recovered by a larger cohort or a more flexible model, only by better measurement. The nonlinearity is hidden, not absent, and a tie between linear and flexible models is not by itself a verdict on the biology. These pieces are classical, drawn from measurement-error statistics, psychometrics, and Gaussian analysis, and we assemble them into an exact excess-risk identity. Measurement reliability is one of three conditions, alongside sample size and feature representation, that must align for a flexible model to help, and together they leave only a narrow window that most biomedical tasks fall outside. Across 140 UK Biobank tasks, the gap between flexible and linear models, where it exists, carries the predicted noise signature, and the three conditions can be separated by intervention but not by a benchmark alone.

2606.18406 2026-06-18 cs.CL 新提交

CoreMem: Riemannian Retrieval and Fisher-Guided Distillation for Long-Term Memory in Dialogue Agents

CoreMem: 对话代理中长期记忆的黎曼检索与Fisher引导蒸馏

Jiaqi Chen, Yongqin Zeng, Shaoshen Chen, Yijian Zhang, Hai-Tao Zheng, Chunxia Ma, XiuTeng Zhou

发表机构 * Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院) Peng Cheng Laboratory(鹏城实验室) Shandong Analysis and Test Center, Qilu University of Technology(齐鲁工业大学山东省分析测试中心) State Key Laboratory for Quality Ensurance and Sustainable Use of Dao-di Herbs(道地药材品质保障与可持续利用国家重点实验室)

AI总结 提出CoreMem架构,用黎曼检索替代余弦相似度解决高维检索枢纽问题,通过Fisher引导离散令牌蒸馏实现原则性压缩,在8GB显存边缘设备上实现长期记忆对话代理。

Comments 15 pages, 5 figures

详情
AI中文摘要

个性化对话代理需要持续的长期记忆以在多次会话中维持连贯交互。然而,在消费级硬件(例如8 GB VRAM边缘设备)上部署这些能力会引入严重的内存和计算瓶颈。现有系统通常依赖各向同性余弦相似度进行检索,以及启发式规则进行上下文压缩。这些方法缺乏统一的理论基础,经常在高维检索中遭受枢纽问题,并在压缩过程中出现句法碎片化。为克服这些限制,我们提出CoreMem,一种资源高效的边缘-云记忆架构,从根本上由信息几何统一。首先,黎曼检索用局部自适应Fisher-Rao度量替代余弦匹配,通过马氏距离有效惩罚枢纽记忆,并采用O(Ndr) Woodbury加速实现实时搜索。其次,Fisher引导离散令牌蒸馏(FDTD)引入分层句子到令牌压缩机制。它从Fisher信息迹中推导敏感度分数,提供原则性的压缩-KL权衡,并辅以显式结构句法保护。在LOCOMO和LongMemEval-S基准上评估,CoreMem实现了显著的准确率提升,在开放域(+4.51个百分点)和时间(+4.17个百分点)推理上取得实质性增益。广泛性能分析证实,CoreMem在严格的8 GB VRAM预算内无缝运行,成功弥合了资源受限边缘设备与对理论基础的终身记忆代理需求之间的差距。

英文摘要

Personalized dialogue agents require continuous long-term memory to maintain coherent interactions across multiple sessions. However, deploying these capabilities on consumer-grade hardware (e.g., 8 GB VRAM edge devices) introduces severe memory and compute bottlenecks. Existing systems typically rely on isotropic cosine similarity for retrieval and heuristic rules for context compression. These approaches lack a unified theoretical foundation, frequently suffering from the hubness problem in high-dimensional retrieval and syntactic fragmentation during compression. To overcome these limitations, we propose CoreMem, a resource-efficient edge-cloud memory architecture fundamentally unified by information geometry. First, Riemannian retrieval replaces cosine matching with a locally adaptive Fisher-Rao metric, effectively penalizing hub memories via Mahalanobis distance with O(Ndr) Woodbury acceleration for real-time search. Second, Fisher-guided discrete token distillation (FDTD) introduces a hierarchical sentence-to-token compression mechanism. It derives sensitivity scores from Fisher information traces, providing a principled compression-KL tradeoff augmented with explicit structural syntax protection. Evaluated on the LOCOMO and LongMemEval-S benchmarks, CoreMem achieves strong accuracy improvements, yielding substantial gains in Open-domain (+4.51 pp) and Temporal (+4.17 pp) reasoning. Extensive profiling confirms that CoreMem operates seamlessly within a strict 8 GB VRAM budget, successfully bridging the gap between resource-constrained edge devices and the demand for theoretically grounded, lifelong memory agents.

2606.18394 2026-06-18 cs.CL 新提交

JetFlow: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting

JetFlow: 通过并行树草稿突破推测解码的缩放上限

Lanxiang Hu, Zhaoxiang Feng, Yulun Wu, Haoran Yuan, Yujie Zhao, Yu-Yang Qian, Bojun Wang, Daxin Jiang, Yibo Zhu, Tajana Rosing, Hao Zhang

发表机构 * UC San Diego(加州大学圣地亚哥分校) Zhejiang University(浙江大学) UIUC(伊利诺伊大学厄巴纳-香槟分校) Nanjing University(南京大学) StepFun(阶跃星辰)

AI总结 提出JetFlow框架,通过因果并行草稿头结合树推测解码,将更大草稿预算转化为更长接受前缀和更高端到端加速,在Qwen3模型上实现最高9.64倍加速。

详情
AI中文摘要

推测解码(SD)通过草拟多个令牌并并行验证来加速自回归大语言模型(LLM),但面临缩放限制:仅当接受率保持较高且草拟开销较低时,增加草稿预算才能提高速度。这一上限难以突破,因为先前基于头的SD方法面临因果-效率困境。自回归草稿器生成路径条件候选,适用于树推测解码且接受长度更高,但其草拟成本随树深度增长。双向块扩散草稿器一次性生成所有位置,但其分支无关的边缘分布可能形成个体合理但相互不一致的树,浪费预算并降低接受率。我们提出JetFlow,一种基于头的SD框架,结合单次前向草拟效率与分支级因果条件。JetFlow在冻结目标模型的融合隐藏状态上训练因果并行草稿头,生成与目标模型自回归分解对齐的候选树。这使得JetFlow能够将更大的草稿预算转换为更长的接受前缀和更高的端到端加速。在密集和MoE Qwen3模型上的数学、编码和聊天基准测试中,JetFlow始终优于双向头和基于树的SD基线。在H100 GPU上,JetFlow在MATH-500上实现高达9.64倍加速,在开放式对话工作负载上实现4.58倍加速,并通过vLLM集成在实际服务负载下进一步降低延迟。我们的代码和模型可在该https URL获取。

英文摘要

Speculative decoding (SD) accelerates autoregressive Large Language Models (LLMs) by drafting multiple tokens and verifying them in parallel, but it faces a scaling limitation: increasing the draft budget improves speed only when acceptance remains high and drafting overhead stays low. This ceiling has been difficult to break because prior head-based SD methods face a causality-efficiency dilemma. Autoregressive drafters produce path-conditioned candidates that are effective for tree speculative decoding with higher acceptance length, but their drafting cost grows with tree depth. Bidirectional block-diffusion drafters generate all positions in one pass, but their branch-agnostic marginals can form individually plausible yet mutually inconsistent trees, wasting budget and reducing acceptance. We propose JetFlow, a head-based SD framework that combines one-forward drafting efficiency with branch-wise causal conditioning. JetFlow trains a causal parallel draft head over fused hidden states from the frozen target model, producing candidate trees whose scores align with the target model's autoregressive factorization. This enables JetFlow to convert larger draft budgets into longer accepted prefixes and higher end-to-end speedup. Across math, coding, and chat benchmarks on dense and MoE Qwen3 models, JetFlow consistently outperforms bidirectional-head and tree-based SD baselines. On H100 GPUs, JetFlow achieves up to 9.64x speedup on MATH-500 and 4.58x on open-ended conversational workloads, with further latency gains demonstrated through vLLM integration under realistic serving loads. Our code and models are available at https://github.com/hao-ai-lab/JetFlow.

2606.18390 2026-06-18 cs.LG q-bio.QM 新提交

MOLAR: Learning Multimodal Molecular Representations from Noisy Labels

MOLAR: 从噪声标签中学习多模态分子表示

Yingxu Wang, Kunyu Zhang, Nan Yin, Yu Li, Eran Segal

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学) Zhengzhou University(郑州大学) The Education University of Hong Kong(香港教育大学) The Chinese University of Hong Kong(香港中文大学) Weizmann Institute of Science(魏茨曼科学研究所)

AI总结 提出MOLAR框架,通过分离干净属性推断与标签观测,利用图与文本模态的残差证据,从噪声标签中学习多模态分子表示,在自然噪声和标签翻转基准上优于基线方法。

详情
AI中文摘要

动机:噪声标签是分子属性预测中的常见挑战,因为分子注释通常来自实验分析、 curated数据库或弱注释流程,而非直接观测到的干净生物状态。将记录标签视为可靠监督会导致模型记忆损坏的观测并学习误导性的分子证据。在多模态分子表示学习中,图-文本融合或对齐可能放大此问题,从而跨模态传播标签引起的错误。结果:我们提出MOLAR,一个从噪声标签中学习多模态分子表示的噪声感知框架。MOLAR将潜在干净属性推断与记录标签观测分离:图和文本视图为干净属性分布贡献残差证据,一个分类标签观测通道将此分布映射到记录标签用于训练。该公式从模型中推导出后验标签可靠性和模态特定的分子证据。在自然噪声分子基准和受控标签翻转基准上的实验表明,MOLAR始终优于代表性基线。可视化分析进一步表明MOLAR提供了可解释的可靠性和模态证据诊断。

英文摘要

Motivation: Noisy labels are a common challenge in molecular property prediction because molecular annotations are often obtained from assays, curated databases, or weak annotation pipelines rather than directly observed clean biological states. Treating recorded labels as reliable supervision can cause models to memorize corrupted observations and learn misleading molecular evidence. In multimodal molecular representation learning, this issue can be amplified by graph-text fusion or alignment, which may propagate label-induced errors across modalities. Results: We propose MOLAR, a noise-aware framework for learning multimodal molecular representations from noisy labels. MOLAR separates latent clean-property inference from recorded-label observation: graph and text views contribute residual evidence to a clean-property distribution, and a categorical label-observation channel maps this distribution to recorded labels for training. This formulation derives posterior label reliability and modality-specific molecular evidence from the model. Experiments on naturally noisy molecular benchmarks and controlled label-flipping benchmarks show that MOLAR consistently outperforms representative baselines. Visualization analyses further show that MOLAR provides interpretable reliability and modality-evidence diagnostics.

2606.18389 2026-06-18 cs.CL 新提交

Want Better Synthetic Data? Steer It: Activation Steering for Low-Resource Language Generation

想要更好的合成数据?引导它:面向低资源语言生成的激活引导

Jan Cegin, Daniil Gurgurov, Yusser Al Ghussin, Simon Ostermann

发表机构 * Kempelen Institute of Intelligent Technologies(肯佩伦智能技术研究所) German Research Institute for Artificial Intelligence (DFKI)(德国人工智能研究中心(DFKI))

AI总结 提出激活引导作为低资源语言合成数据生成的替代方法,包括语言引导和质量引导,实验表明早期层引导能提升数据多样性和下游模型性能。

Comments 25 pages

详情
AI中文摘要

大型语言模型(LLMs)已成为合成数据生成的有效工具,包括低资源语言,生成的数据可以提升下游任务性能。当前最佳方法通常依赖于目标语言示例的少样本提示,这增加了推理成本,并可能通过词汇锚定降低多样性。在这项工作中,我们研究激活引导作为低资源合成数据生成的替代方案。我们研究了两种引导策略:语言引导,针对语言的 linguistic identity;以及质量引导,通过对比人类撰写和反向翻译的文本表示来捕捉良好形式性。我们在四个开源LLM、多个层和11种类型多样的语言上评估这些方法,通过生成情感和主题分类数据并微调较小的分类器。引导在零样本和少样本提示设置中应用,并与非引导对应方法进行比较。我们的结果表明,早期层的引导一致地提高了生成数据的多样性,同时通常产生更强的下游模型性能,特别是对于低资源语言。

英文摘要

Large language models (LLMs) have become an effective tool for synthetic data generation, including for low-resource languages, where generated data can improve downstream task performance. Current best-performing approaches typically rely on few-shot prompting with target-language examples, which increases inference costs and may reduce diversity through lexical anchoring. In this work, we investigate activation steering as an alternative for low-resource synthetic data generation. We study two steering strategies: Language Steering, which targets the linguistic identity of a language, and Quality Steering, which captures well-formedness by contrasting human-written and backtranslated text representations. We evaluate these methods across four open-source LLMs, multiple layers, and 11 typologically diverse languages by generating sentiment and topic classification data and finetuning smaller classifiers. Steering is applied in both zero-shot and few-shot prompting settings and compared against non-steered counterparts. Our results show that steering on early layers consistently improves the diversity of generated data while often yielding stronger downstream model performance, particularly for low-resource languages.

2606.18388 2026-06-18 cs.LG cs.AI cs.CL cs.MA 新提交

LLMZero: Discovering Adaptive Training Strategies for RL Post-Training via LLM Agents

LLMZero: 通过LLM智能体发现RL后训练的自适应训练策略

Haoyang Fang, Wei Zhu, Boran Han, Alex Zhang, Zhenyu Pan, Shuo Yang, Shuai Zhang, Jiading Gai, Peng Tang, Cuixiong Hu, Xuan Zhu, Huzefa Rangwala, George Karypis, Bernie Wang

发表机构 * Amazon(亚马逊)

AI总结 提出LLMZero系统,利用LLM智能体通过树搜索发现多阶段RL后训练的自适应策略,揭示容量参数单调累积、正则化参数振荡的规律,在4个GRPO任务上相对基线提升9%-140%。

详情
AI中文摘要

RL后训练策略依赖于数据集,并揭示了一个反复出现的经验模式:容量参数在阶段间单调累积,而正则化参数主要根据训练动态的变化而振荡。这种区别很重要,因为固定调度将所有参数提交到固定轨迹,因此无法表达正则化必须跟踪的非平稳探索-利用权衡;该原则为多阶段训练提供了可操作的设计规则。我们通过LLMZero发现了这一点,该系统通过树搜索让LLM智能体搜索训练轨迹,诊断每个检查点的病理并提出协调的多参数转换。在4个不同的GRPO任务中,LLMZero发现的策略相对基础模型提升9%到140%,相对网格搜索提升6%到15%,始终优于随机搜索和基于技能的智能体。该结构原则跨任务迁移,解释了为什么发现的策略形式不同但参数动态相似。

英文摘要

RL post-training strategies are dataset-dependent and reveal a recurring empirical pattern: capacity parameters accumulate monotonically across stages, while regularization parameters predominantly oscillate in response to shifting training dynamics. This distinction matters because fixed schedules commit all parameters to fixed trajectories and therefore cannot express the non-stationary exploration-exploitation tradeoffs that regularization must track; the principle provides actionable design rules for multi-stage training. We discover this through LLMZero, a system where LLM agents search over training trajectories via tree search, diagnosing pathologies at each checkpoint and proposing coordinated multi-parameter transitions. Across 4 diverse GRPO tasks, LLMZero discovers strategies that improve over the base model by 9% to 140% relative and over grid search by 6% to 15% relative, consistently outperforming random search and the skill-based agent. The structural principle transfers across tasks, providing an explanation for why discovered strategies take qualitatively different forms yet share similar parameter dynamics.

2606.18385 2026-06-18 cs.AI 新提交

CaVe-VLM-CoT: An Interpretable Vision-Language Model Framework

CaVe-VLM-CoT:一种可解释的视觉-语言模型框架

Sneha Rao, Shaina Raza, Dhanesh Ramachandram

发表机构 * Vector Institute(向量研究所)

AI总结 提出CaVe-VLM-CoT框架,通过五阶段闭环流水线(提取器、检索器、求解器、引用注入器、验证器)实现证据推理,并引入CaVeScore复合指标评估检索质量、引用忠实度和跨模态基础,在ScienceQA和MMMU上取得性能提升。

详情
AI中文摘要

视觉-语言模型(VLM)仍然容易产生幻觉,输出流畅但视觉上不忠实的输出。现有的思维链和检索增强方法仅部分解决了这一问题,因为它们既没有强制执行步骤级引用基础,也没有将验证失败路由回检索以进行纠正。我们提出了CaVe-VLM-CoT,一个模块化的基于反射的智能体RAG框架,通过五阶段闭环流水线强制执行证据推理:提取器、检索器、求解器、引用注入器和验证器,其中检测到的无根据声明会触发结构化反馈给提取器以进行针对性重新检索。由于现有框架没有联合衡量检索质量、逐步引用忠实度和跨模态基础,我们提出了一套涵盖所有阶段的23个组件级指标,以CaVeScore为核心,这是一个加权准确性、引用精确率和召回率、归因和证据基础的复合指标。无需任何架构或提示修改,CaVe-VLM-CoT在ScienceQA上达到87.1%的准确率和56.6%的CaVeScore,在MMMU(30个学科)上达到55.2%的准确率和35.7%的CaVeScore。

英文摘要

Vision-Language Models (VLMs) remain prone to hallucinations, producing fluent but visually unfaithful outputs. Existing chain-of-thought and retrieval-augmented methods only partially address this, as they neither enforce step-level citation grounding nor route verification failures back to retrieval for correction. We present CaVe-VLM-CoT, a modular reflection-based agentic-RAG framework that enforces evidence-grounded reasoning through a five-stage closed-loop pipeline: Extractor, Retriever, Solver, Citation Injector, and Verifier, in which detected ungrounded claims trigger structured feedback to the Extractor for targeted re-retrieval. Since no existing framework jointly measures retrieval quality, step-wise citation faithfulness, and cross-modal grounding, we propose a suite of 23 component-wise metrics across all stages, anchored by CaVeScore, a composite metric weighting accuracy, citation precision and recall, attribution, and evidence grounding. Without any architectural or prompt modifications, CaVe-VLM-CoT achieves 87.1\% accuracy and 56.6\% CaVeScore on ScienceQA , and 55.2\% accuracy and 35.7\% CaVeScore on MMMU (30 subjects).

2606.18384 2026-06-18 cs.LG cs.DC 新提交

SCOPE-FL: A Strategy-proof Chain-based Optimal pareto efficient Federated Learning System

SCOPE-FL:一种策略证明的基于链的最优帕累托高效联邦学习系统

Seyed Salar Ghazi, Kaiwen Zhang, Mehdi feizi, Hans-Arno Jacobsen

发表机构 * École de Technologie Supérieure (ÉTS)(高等技术学院) Ferdowsi University of Mashhad(菲尔多西大学) University of Toronto(多伦多大学)

AI总结 针对分层联邦学习中客户端选择策略缺乏帕累托效率和策略证明性导致整体福利下降的问题,提出SCOPE-FL框架,采用顶级交易循环算法同时保证帕累托最优和策略证明性,并通过区块链智能合约实现奖励分配。

详情
AI中文摘要

分层联邦学习(HFL)能够在分布式设备间实现可扩展的协作模型训练,同时保护数据隐私。然而,现有的HFL客户端选择机制存在根本性的策略低效问题。通过优先考虑稳定性而非帕累托效率(PE),它们产生次优的资源分配,并且缺乏策略证明性(SP),参与者有动机歪曲其真实偏好,这两种失败在实践中都会在帕累托意义上降低系统整体福利。为解决这一问题,我们提出SCOPE-FL(策略证明的基于链的最优帕累托高效联邦学习),一种同步HFL框架,将客户端选择建模为双边学校选择问题,通过顶级交易循环(TTC)算法求解,同时保证PE和SP。对于奖励分配,SCOPE-FL采用基于一轮重建(OR)的可扩展沙普利值近似,确保补偿与每个客户端的贡献成比例。整个机制通过区块链智能合约执行,为SP保证在实践中成立提供了防篡改环境。在MNIST、Fashion-MNIST和CIFAR-10上的综合评估表明,SCOPE-FL在模型准确率、收敛速度和奖励效率方面优于现有最先进方法(包括DA、IAS等),同时通信延迟与DA相当,区块链开销在大规模下显著低于DA。

英文摘要

Hierarchical Federated Learning (HFL) enables scalable collaborative model training across distributed devices while preserving data privacy. However, existing HFL client selection mechanisms suffer from a fundamental strategic inefficiency. By prioritizing stability over Pareto efficiency (PE), they produce suboptimal resource allocations, and without strategy proofness (SP), participants are incentivized to misrepresent their true preferences, both failures degrading system overall welfare in the Pareto sense in practice. To address it, we propose SCOPE-FL (Strategy-proof Chain-based Optimal pareto efficient Federated Learning), a synchronous HFL framework that formulates client selection as a two-sided school choice problem solved through the Top Trading Cycle (TTC) algorithm that simultaneously guarantees PE and SP. For reward distribution, SCOPE-FL employs a scalable Shapley value approximation based on One-Round Reconstruction (OR), ensuring compensation proportional to each client's contribution. The entire mechanism executes via blockchain smart contracts, providing the tamper-proof environment required for the SP guarantees to hold in practice. A comprehensive evaluation on MNIST, Fashion-MNIST, and CIFAR-10 demonstrates that SCOPE-FL outperforms state-of-the-art approaches, including DA, IAS, and other methods across model accuracy, convergence rate, and reward efficiency, while achieving communication latency comparable to DA and blockchain overhead significantly lower than DA at scale.

2606.18383 2026-06-18 cs.LG cs.CL 新提交

From Sparse Features to Trustworthy Proxies: Certifying SAE-Based Interpretability

从稀疏特征到可信代理:认证基于SAE的可解释性

Dibyanayan Bandyopadhyay, Asif Ekbal

发表机构 * Department of Computer Science and Engineering, Indian Institute of Technology Patna(印度理工学院巴特那分校计算机科学与工程系)

AI总结 提出一种后验泛化框架,通过稀疏代理(SAE重建)认证语言模型,推导期望风险上界,并在GPT-2 Small等模型上验证非平凡界,揭示深层更易认证且特征分解区分语义对齐与统计稀疏性。

详情
AI中文摘要

稀疏自编码器(SAE)越来越多地被用于从语言模型(LM)中提取可解释特征,但一个核心问题仍然存在:基于SAE的解释何时可以被视为底层冻结LM的忠实视图?我们通过一个后验泛化框架来研究这个问题,该框架通过稀疏代理来认证LM,稀疏代理是通过将原生隐藏激活替换为其预训练的SAE重建而获得的。我们的框架使用四个可测量量推导出基础模型期望风险的上界:代理风险、SAE重建差距、概念池不匹配和稀疏复杂度。我们将此证书解释为解释忠实性的操作标准。特别地,非平凡界表明提取的稀疏特征保留了有意义的预测信息,而小的重建和匹配误差表明代理在行为上接近原始模型。实验上,我们展示了在GPT-2 Small、Gemma-2B和Llama-3-8B上,该界在实际样本量下变得非平凡。对Llama-3-8B的详细逐层分析揭示了强烈的深度依赖性,较深层变得更容易认证,这与更强的局部保真度和更弱的下游误差放大相关。最后,通过特征洗牌消融,我们展示了分解区分了真正的语义对齐与单纯的统计稀疏性,为基于SAE的解释何时变得不太可靠提供了有用的诊断。

英文摘要

Sparse autoencoders (SAEs) are increasingly used to extract interpretable features from language models (LMs), yet a central question remains: when can an SAE-based explanation be treated as a faithful view of an underlying frozen LM We study this through a post-hoc generalization framework that certifies the LM via a sparse proxy, obtained by replacing a native hidden activation with its pretrained SAE reconstruction. Our framework derives an upper bound on the base model's expected risk using four measurable quantities: proxy risk, SAE reconstruction gap, concept-pool mismatch, and sparse complexity. We interpret this certificate as an operational criterion for explanatory faithfulness. In particular, a non-vacuous bound indicates that the extracted sparse features retain meaningful predictive information, while small reconstruction and mismatch errors indicate that the proxy remains behaviorally close to the original model. Empirically, we show that the bound becomes non-vacuous on GPT-2 Small, Gemma-2B, and Llama-3-8B at practical sample sizes. A detailed layerwise analysis of Llama-3-8B reveals a strong depth dependence, with later layers becoming much easier to certify, associated with both stronger local fidelity and weaker downstream error amplification. Finally, through feature-shuffling ablations, we show that the decomposition distinguishes genuine semantic alignment from mere statistical sparsity, providing a useful diagnostic for when SAE-based explanations become less reliable.

2606.18381 2026-06-18 cs.CL cs.IR 新提交

SproutRAG: Attention-Guided Tree Search with Progressive Embeddings for Long-Document RAG

SproutRAG: 基于注意力引导的树搜索与渐进嵌入的长文档RAG

Amirhossein Abaskohi, Issam H. Laradji, Peter West, Giuseppe Carenini

发表机构 * University of British Columbia(不列颠哥伦比亚大学) ServiceNow Research(ServiceNow研究院)

AI总结 提出SproutRAG,通过注意力引导构建句子级分块树,实现多粒度检索,无需额外LLM调用,平均信息效率提升6.1%。

详情
AI中文摘要

检索增强生成(RAG)系统必须平衡检索粒度与上下文连贯性,现有方法通过LLM引导的分块、单级上下文扩展或层次摘要来解决这一挑战。这些方法在索引或检索过程中依赖昂贵的LLM调用,将上下文聚合限制在单一粒度级别,或通过摘要引入信息损失。我们提出SproutRAG,一种注意力引导的层次化RAG框架,通过将句子级块组织成逐渐增大但语义连贯的单元,利用学习到的句子间注意力构建二分块树,从而解决这一权衡。与依赖外部LLM、固定上下文扩展或有损摘要的先前方法不同,SproutRAG学习哪些注意力头和层最能捕捉语义文档结构,实现无需额外LLM调用或压缩摘要的多粒度检索。在检索时,SproutRAG使用层次化束搜索检索多个粒度的候选,捕获超越平面检索的多句子相关性。该框架通过联合目标进行端到端训练,同时改进嵌入和树结构。在涵盖科学、法律和开放域设置的四个基准上的实验表明,SproutRAG在最强基线上平均信息效率(IE)提升6.1%。代码可在该https URL获取。

英文摘要

Retrieval-augmented generation (RAG) systems must balance retrieval granularity with contextual coherence, a challenge that existing methods address through LLM-guided chunking, single-level context expansion, or hierarchical summarization. These approaches variously depend on costly LLM calls during indexing or retrieval, limit context aggregation to a single granularity level, or introduce information loss through summarization. We present SproutRAG, an attention-guided hierarchical RAG framework that addresses this trade-off by organizing sentence-level chunks into progressively larger but semantically coherent units, using learned inter-sentence attention to construct a binary chunking tree. Unlike prior approaches that rely on external LLMs, fixed context expansion, or lossy summarization, SproutRAG learns which attention heads and layers best capture semantic document structure, enabling multi-granularity retrieval without additional LLM calls or compressed summaries. At retrieval time, SproutRAG uses hierarchical beam search to retrieve candidates at multiple granularities, capturing multi-sentence relevance beyond flat retrieval. The framework is trained end-to-end with a joint objective that improves both embeddings and tree structure. Experiments across four benchmarks spanning scientific, legal, and open-domain settings demonstrate that SproutRAG improves information efficiency (IE) by 6.1% on average over the strongest baseline. Code is available on https://github.com/AmirAbaskohi/SproutRAG.

2606.18375 2026-06-18 cs.RO 新提交

PAIWorld: A 3D-Consistent World Foundation Model for Robotic Manipulation

PAIWorld: 用于机器人操作的三维一致世界基础模型

Yuhang Huang, Xuan Lv, Junyan Xu, Zhiyuan Yu, Jiazhao Zhang, Ruizhen Hu, Wancheng Feng, Shilong Zou, Hewen Xiao, Ziqiao Zhou, Kaiyun Huang, Zhiyu Peng, Juzhan Xu, Hang Zhao, Chenyang Zhu, Renjiao Yi, Yifei Huang, Douhui Wu, Yan Zhang, Kexu Cheng, Chunhe Song, Yunzhi Xue, Xiuhong Zhang, Leitao Guo, Yunji Chen, Bin Wu, Haibin Yu, Kai Xu

发表机构 * Institute of AI for Industries, Chinese Academy of Sciences(中国科学院人工智能产业研究院)

AI总结 提出PAIWorld框架,通过几何感知交叉注意力、几何旋转位置编码和潜在3D-REPA蒸馏,解决多视图世界模型的3D不一致问题,在机器人操作基准上取得领先性能。

详情
AI中文摘要

世界基础模型(WFMs)是强大的模拟器,但它们主要运行在单视图设置中,缺乏机器人操作所需的多视图3D一致性。虽然机器人系统依赖多个摄像头(自我中心、眼到手和腕装)进行策略学习,但当前的多视图世界模型只是简单地拼接视图标记,没有显式的几何推理。这导致跨视图物体漂移、深度不一致和纹理错位。我们将这些失败归因于两个缺陷:缺乏显式的视图间通信机制和缺乏3D几何先验。我们认为同时解决这两个问题是必要且充分的。为此,我们提出PAIWorld,一个通过三个核心组件增强扩散变换器世界模型的框架:(1)几何感知交叉注意力块,建立跨视图的显式通路;(2)几何旋转位置编码,将相机射线方向和外部姿态编码到注意力机制中;(3)潜在3D-REPA,从冻结的3D基础模型中蒸馏3D感知特征以确保3D一致性。基于DiT世界基础模型,PAIWorld在机器人操作基准上实现了最先进的多视图3D一致性,在WorldArena排行榜上排名第一,在AgiBot-Challenge2026排行榜上排名第二,同时支持基于模型的规划、世界动作模型和多视图策略后训练等下游应用。

英文摘要

World foundation models (WFMs) are powerful simulators, yet they predominantly operate in a single-view setting and lack the multi-view 3D consistency required for robotic manipulation. While robotic systems rely on multiple cameras (egocentric, eye-to-hand, and wrist-mounted) for policy learning, current multi-view world models simply concatenate view tokens without explicit geometric reasoning. This causes cross-view object drift, depth inconsistency, and texture misalignment. We trace these failures to two deficiencies: the absence of an explicit inter-view communication mechanism and the lack of a 3D geometric prior. We argue that resolving both simultaneously is necessary and sufficient. To address this, we present PAIWorld, a framework that augments diffusion-transformer world models via three core components: (1) Geometry-Aware Cross-View Attention blocks that establish an explicit pathway across views, (2) Geometric Rotary Position Embedding that encodes camera ray directions and extrinsic poses into the attention mechanism, and (3) Latent 3D-REPA, which distills 3D-aware features from frozen 3D foundation models to ensure 3D consistency. Built upon a DiT-based world foundation model, PAIWorld achieves state-of-the-art multi-view 3D consistency on robotic manipulation benchmarks, ranking 1st on the WorldArena leaderboard and 2nd on the AgiBot-Challenge2026 leaderboard, while enabling downstream applications such as model-based planning, world action models, and multi-view policy post-training.