arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2251
2606.10646 2026-06-10 cs.LG cs.CL 新提交

How Does Reasoning Flow? Tracing Attention-Induced Information Flow for Targeted RL in LLMs

推理流如何流动?追踪注意力诱导的信息流以实现LLM中的目标RL

Zhichen Dong, Yang Li, Yuhan Sun, Weixun Wang, Yijia Luo, Zinian Peng, Taiheng Ye, Chao Yang, Wenbo Su, Yu Cheng, Bo Zheng, Junchi Yan

发表机构 * Shanghai Jiao Tong University(上海交通大学) Alibaba Group(阿里巴巴集团) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 提出FlowTracer框架,通过注意力诱导的有向无环图追踪答案导向的推理流,基于全局信息流结构分配token级信用,从而提升LLM在推理任务中的强化学习效果。

Comments 25 pages, 7 figures, 11 tables. Accepted at ICML 2026

详情
AI中文摘要

Token级信用分配仍然是大型语言模型(LLM)中强化学习(RL)的主要障碍,其中RL配方通常平等对待所有token,未能区分决定性推理步骤与常规格式或流畅填充。最近的研究利用模型内部信号分配更细粒度的信用,但这些往往是点式启发式方法,忽略了信息传播的全局结构。我们提出FlowTracer,一个RL框架,它在注意力诱导的有向无环图上追踪答案导向的推理流,其中节点对应token,边容量来自聚合的注意力权重,并从这种全局结构中推导出token信用。边容量被重新加权,仅保留能够到达答案区域的影响,同时强制执行局部流守恒,使得中间token不会因路径长度或无关分支而损失或获得有效质量。在此图上,FlowTracer提取连接问题与答案的信息流骨干,并通过流吞吐量对token进行评分,揭示调解长距离依赖的高影响枢纽和聚合检查点。这些推导出的重要性用于塑造token级奖励,使学习信号精确聚焦于将信息路由向(或远离)正确答案的token,并在各种推理任务中提供一致的性能提升。

英文摘要

Token-level credit assignment remains a key obstacle for reinforcement learning (RL) in large language models (LLMs), where RL recipes typically treat all tokens equally, failing to distinguish decisive reasoning steps from routine formatting or fluent filler. Recent attempts leverage model-internal signals to assign finer-grained credit, but these are often point-wise heuristics that ignore the global structure of information propagation. We propose FlowTracer, an RL framework that traces answer-targeted reasoning flow on an attention-induced directed acyclic graph in which nodes correspond to tokens and edge capacities come from aggregated attention weights and derives token credit from this global structure. The edge capacities are reweighted to retain only the influence that can reach the answer region, while enforcing local flow conservation so intermediate tokens neither lose nor gain effective mass due to path length or irrelevant branches. On this graph, FlowTracer extracts an information-flow backbone connecting the question to the answer and scores tokens by flow throughput, revealing high-impact hubs and aggregation checkpoints that mediate long-range dependencies. These derived importances are used to shape token-level rewards, enabling learning signals to focus precisely on the tokens that route information toward (or away from) correct answers and delivering consistent performance gains across a range of reasoning tasks.

2606.10645 2026-06-10 cs.CV 新提交

ManiSplat: Manipulation Trajectory Synthesis from Monocular Video via Decoupled 3D Gaussian Splatting

ManiSplat: 基于解耦3D高斯泼溅的单目视频操作轨迹合成

Wenhao Hu, Haonan Zhou, Liu Liu, Yun Du, Xinjie Wang, Ziang Li, Zhizhong Su, Gaoang Wang

发表机构 * Zhejiang University(浙江大学) Horizon Robotics(地平线机器人)

AI总结 提出ManiSplat框架,通过图结构解耦表示和任务导向时空对齐,从单目视频重建可控的3D高斯数字孪生,支持机器人操作任务与策略学习。

详情
AI中文摘要

从真实世界观测中重建动态且可交互的3D场景仍然是计算机视觉和机器人学中的一个基本挑战。尽管3D高斯泼溅的最新进展实现了高保真静态重建,但由于复杂的接触交互和突变的姿态变化,将其扩展到具有关节机器人和可操作物体的交互环境仍然困难。为了解决这些挑战,我们引入了ManiSplat,一个统一的框架,直接从单目自我中心机器人视频重建可控且解耦的高斯数字孪生。我们的方法引入了一种图结构解耦表示,将机器人、物体和背景分离为独立可优化的高斯子场,并组织在场景图中。为了确保稳定性,我们提出了一个任务导向的时空对齐模块,利用操作任务的内在逻辑——在运动和技能阶段之间交替——来构建准确的伪真实轨迹。最后,联合光度-几何优化确保重建场景在时间上连贯、物理上一致且可用于仿真。大量实验表明,我们的方法以高保真度和可控性重建了交互驱动的动态场景,有效支持下游机器人任务和策略学习。

英文摘要

Reconstructing dynamic and interactive 3D scenes from real-world observations remains a fundamental challenge in computer vision and robotics. While recent advances in 3D Gaussian Splatting have enabled high-fidelity static reconstruction, extending it to interactive environments with articulated robots and manipulable objects remains difficult due to complex contact interactions and abrupt pose changes. To address these challenges, we introduce ManiSplat, a unified framework that reconstructs controllable and decoupled Gaussian digital twins directly from monocular ego-view robotic videos. Our method introduces a Graph-Structured Disentangled Representation that separates the robot, objects, and background into independently optimizable Gaussian subfields organized within a scene graph. To ensure stability, we propose a Task-Oriented Spatio-Temporal Alignment module that leverages the inherent logic of manipulation tasks-alternating between Motion and Skill phases-to construct accurate pseudo-ground-truth trajectories. Finally, a joint photometric-geometric optimization ensures the reconstructed scenes are temporally coherent, physically consistent, and simulation-ready. Extensive experiments demonstrate that our approach reconstructs interaction-driven dynamic scenes with high fidelity and controllability, effectively supporting downstream robotic tasks and policy learning.

2606.10640 2026-06-10 cs.CV 新提交

ChartLens: A Dual-Branch Framework for Chart Data Correction and Factual Summary Refinement

ChartLens:用于图表数据校正和事实性摘要精炼的双分支框架

Hao Liu, Ruping Cao, Kun Wang, Zhiran Li, Fan Liu, Yupeng Hu, Liqiang Nie

发表机构 * Shandong University(山东大学) Southeast University(东南大学) Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳))

AI总结 提出ChartLens双分支框架,通过结构感知CSV验证校正和文本保留引导的摘要精炼,提升图表数据恢复与摘要事实性,在DataMFM挑战赛Track 2中获第一。

详情
AI中文摘要

在本报告中,我们展示了针对DataMFM挑战赛Track 2:图表理解(Chart Understanding)的冠军解决方案。该赛道要求模型从图表图像中恢复结构化图表数据并生成忠实于事实的自然语言摘要。为了满足准确数据提取和事实性叙述的互补需求,我们提出了ChartLens,一个用于图表数据校正和摘要精炼的双分支框架。ChartLens由两个关键模块组成:结构感知CSV验证与校正(SAVC)和文本保留引导的摘要精炼(TRSR)。SAVC通过验证和校正提高结构化数据提取的可靠性,而TRSR通过保留图表中的关键文本和数值证据来增强摘要生成。通过结合模型自适应、基于校正的生成和OCR辅助的证据依据,ChartLens改善了结构化数据恢复和摘要事实性。在测试集上,我们的最终系统获得了69.10的总分,并在Track 2中排名第一,证明了其在准确图表理解方面的有效性。我们的代码将在以下网址发布:this https URL。

英文摘要

In this report, we present our champion solution for the DataMFM Challenge Track 2: Chart Understanding. This track requires models to recover structured chart data and generate faithful natural-language summaries from chart images. To address the complementary requirements of accurate data extraction and factual narration, we propose ChartLens, a dual-branch framework for chart data correction and summary refinement. ChartLens consists of two key modules: Structure-Aware CSV Verification and Correction (SAVC) and Text-Retention-Guided Summary Refinement (TRSR). SAVC improves the reliability of structured data extraction through verification and correction, while TRSR enhances summary generation by preserving critical textual and numerical evidence from charts. By combining model adaptation, correction-based generation, and OCR-assisted evidence grounding, ChartLens improves both structured data recovery and summary factuality. On the test set, our final system achieves an overall score of 69.10 and ranks first in Track 2, demonstrating its effectiveness for accurate chart understanding. Our code will be released at: https://github.com/iLearn-Lab/CVPRW26-ChartLens.

2606.10632 2026-06-10 cs.LG cs.AI 新提交

Is Fairness Truly Fair? Towards Reliable Lipschitz Fairness in Multi-Task Learning via Fixed-\texorpdfstring{$δ$}{delta} Alignment

公平真的公平吗?通过固定δ对齐实现多任务学习中可靠的Lipschitz公平性

Junbo Ding, Xin Zang, Chenchen Pan, Donghao Song, Jiaxin Zhu, Danhuai Guo

发表机构 * Beijing University of Chemical Technology(北京化工大学)

AI总结 针对多任务学习中Lipschitz个体公平性评估受表示尺度干扰的问题,提出固定δ审计与受控正则化框架ReLiF,实现语义一致的公平性评估与权衡。

详情
AI中文摘要

Lipschitz风格的个体公平性形式化了语义相似的样本应获得相似预测的思想,但在多任务学习(MTL)中,其评估可能受到方法引起的表示尺度的干扰。本文识别了阈值混淆问题:当审计容差源自每个模型自身的表示距离时,不同算法会在不同的语义阈值下进行比较。阈值漂移分析进一步展示了偏差排名如何变化,并识别了排名保持的充分条件。我们提出了\textbf{ReLiF},一个可靠性感知框架,将评估时的固定$\delta$审计与训练时的受控正则化分离。ReLiF使用共享参考容差进行可比较的审计,并通过违反率反馈控制器保持Lipschitz代理活跃而不让其主导随机训练。本文还发展了关于阈值漂移、参考容差选择以及huberized训练代理与其未平滑的正间隔对应物之间关系的支持性分析。在临床时间序列基准和NYUv2(NYU Depth V2)密集预测上的实验表明,固定$\delta$审计暴露了方法依赖阈值可能掩盖的效用-公平性权衡。在使用ResNet50骨干的NYUv2上,ReLiF在共享固定阈值下实现了有竞争力的效用,同时显著减少了对齐偏差。在临床基准上,ReLiF产生了受控的公平性正则化权衡,而固定$\delta$审计揭示任务平衡基线有时能实现更低的偏差,且真正的效用-公平性权衡仍然存在。这些结果支持固定$\delta$审计作为评估MTL中Lipschitz公平性的语义一致协议。

英文摘要

Lipschitz-style individual fairness formalizes the idea that semantically similar examples should receive similar predictions, but its evaluation in multi-task learning (MTL) can be confounded by method-induced representation scales. This paper identifies threshold confounding: when the auditing tolerance is derived from each model's own representation distances, different algorithms are compared under different semantic thresholds. A threshold-drift analysis further shows how Bias rankings can change and identifies sufficient conditions for ranking preservation. We propose \textbf{ReLiF}, a reliability-aware framework that separates evaluation-time fixed-$δ$ auditing from training-time controlled regularization. ReLiF uses a shared reference tolerance for comparable auditing and a violation-rate feedback controller to keep the Lipschitz surrogate active without letting it dominate stochastic training. This work also develops supporting analysis for threshold drift, reference-tolerance selection, and the relationship between the huberized training surrogate and its unsmoothed positive-margin counterpart. Experiments on clinical time-series benchmarks and NYUv2 (NYU Depth V2) dense prediction show that fixed-$δ$ auditing exposes utility--fairness trade-offs that method-dependent thresholds can obscure. On NYUv2 with a ResNet50 backbone, ReLiF achieves competitive utility while substantially reducing aligned bias under shared fixed thresholds. On clinical benchmarks, ReLiF yields controlled fairness-regularized trade-offs, while fixed-$δ$ auditing reveals that task-balancing baselines can sometimes achieve lower bias and that genuine utility--fairness trade-offs persist. These results support fixed-$δ$ auditing as a semantically consistent protocol for evaluating Lipschitz fairness in MTL.

2606.10628 2026-06-10 cs.CV 新提交

Leveraging Metric Depth for Relative Depth Prediction

利用度量深度进行相对深度预测

Xiaoyang Bi, Shuaikun Liu, Zhaohong Liu, Yuxin Yang, Zhe Zhao, Mengshi Qi, Liang Liu, Huadong Ma

发表机构 * Beijing Key Laboratory of Intelligent Telecommunications Software and Multimedia(智能电信软件与多媒体北京市重点实验室) Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 针对足球场景中相对深度预测训练样本少的问题,提出利用预训练模型的零样本能力学习度量深度,在挑战集上取得2.68×10^{-3}的分数。

详情
AI中文摘要

我们展示了针对2025年SoccerNet单目深度估计竞赛挑战的解决方案。在足球场景中预测相对深度具有挑战性,尤其是仅有数千个训练样本可用。为解决这一问题,我们的方法利用了在大规模数据集上预训练的模型的强大零样本能力来学习度量深度,从而有效进行相对深度预测,在挑战集上取得了$2.68 \ imes 10^{-3}$的分数。

英文摘要

We present our solution to the 2025 SoccerNet Monocular Depth Estimation Competition Challenge. Predicting the relative depth in football scenarios is challenging, especially with only thousands of training samples available. To address this issue, our method leverages the powerful zero-shot capabilities of models pretrained on large-scale datasets to learn metric depth for effective relative depth prediction, achieving a score of $2.68 \times 10^{-3}$ on the challenge set.

2606.10620 2026-06-10 cs.CV cs.AI 新提交

Can Image Models Imagine Time? ImageTime: A Novel Benchmark for Probing Visual World Modeling Through Spatiotemporal Consistency

图像模型能想象时间吗?ImageTime:通过时空一致性探究视觉世界建模的新基准

Xinrui Wu, Lichen Huang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出ImageTime基准,通过四关键帧协议(初始状态、动作开始、过渡状态、最终状态)评估图像生成模型在时空一致性上的表现,揭示模型在维持连贯视觉世界状态方面的能力与不足。

详情
AI中文摘要

图像生成模型现在能够生成高质量的静态图像,但它们表示视觉世界随时间变化的能力仍然知之甚少。实际工作流程如故事板、逐步插图、参考引导编辑和视频预可视化要求模型在多个视觉状态之间保持身份、对象、空间关系和因果顺序。现有评估主要衡量单图像正确性、组合对齐或视频质量,而未明确图像模型是否能连贯地想象一个时间有序的过程。我们引入ImageTime,一个诊断基准,使用时空一致性作为图像生成中视觉世界建模的行为探针。给定一个动作指令,以及可选地指定初始状态的参考图像,模型必须生成一张包含四个有序关键状态的图像:初始状态、动作开始、过渡状态和最终状态。这个四关键帧协议比单图像生成在时间上要求更高,同时避免了密集视频动态的混淆。ImageTime通过渐进能力层次组织任务,并将每个场景分解为阶段状态谓词、跨帧时间约束和禁止的因果违规。GPT-5.5在结构化的VLM-as-judge协议下对所有生成的图像进行评分,产生可解释的能力分数、诊断子分数和失败标签。通过多家族基准测试,ImageTime揭示了当前图像生成系统在要求随时间维持连贯视觉世界状态时成功、失败和漂移的地方。

英文摘要

Image generation models now produce high-quality static images, yet their ability to represent how a visual world changes over time remains poorly understood. Practical workflows such as storyboarding, step-by-step illustration, reference-guided editing, and video previsualization require models to preserve identities, objects, spatial relations, and causal order across multiple visual states. Existing evaluations largely measure single-image correctness, compositional alignment, or video quality, leaving open whether an image model can coherently imagine a temporally ordered process. We introduce ImageTime, a diagnostic benchmark that uses spatiotemporal consistency as a behavioral probe of visual world modeling in image generation. Given an action instruction, and optionally a reference image specifying the initial state, a model must generate one image containing four ordered key states: initial state, action onset, transition state, and final state. This four-keyframe protocol is more temporally demanding than single-image generation while avoiding the confounds of dense video dynamics. ImageTime organizes tasks with a progressive capability hierarchy and decomposes each scenario into stage-wise state predicates, cross-frame temporal constraints, and forbidden causal violations. GPT-5.5 scores all generated images under a structured VLM-as-judge protocol, producing interpretable capability scores, diagnostic subscores, and failure labels. Through multi-family benchmarking, ImageTime reveals where current image generation systems succeed, fail, and drift when asked to maintain coherent visual world states over time.

2606.10617 2026-06-10 cs.CV 新提交

SSR-Merge: Subspace Signal Routing for Training-Free LoRA Merging in Diffusion Models

SSR-Merge: 面向扩散模型中免训练的LoRA合并的子空间信号路由

Zhengxuan Wei, Yi Dong, Zonghui Li, Xianhui Lin, Xing Liu, Hong Gu, Shaofeng Zhang, Wenbin Li, Qi Fan

发表机构 * Stanford University(斯坦福大学)

AI总结 提出子空间信号路由(SSR)方法,通过沿秩维度拼接LoRA构建统一子空间,利用逆相关矩阵去相关和方向引导矩阵分离信号,解决参数合并中的干扰问题,理论证明其等价于OLS最优解,并设计流式算法降低开销。

Comments Accepted at ICML 2026

详情
AI中文摘要

低秩适应(LoRA)合并可以有效地将来自多个训练好的LoRA的不同生成能力组合到扩散模型中。然而,现有的LoRA合并技术常常遭受严重的参数干扰,导致共享参数空间中的破坏性冲突。为了解决这个问题,我们提出了子空间信号路由(SSR),它通过路由内部信号而不是执行参数空间合并来解决干扰。具体来说,SSR首先通过沿秩维度拼接候选LoRA来构建一个统一的子空间。接下来,SSR使用逆相关矩阵对该空间内的混合信号进行去相关。最后,一个方向引导矩阵将这些净化后的信号引导到各自的任务特定子空间。我们提供了严格的理论分析,证明SSR与普通最小二乘(OLS)解一致,从而确保数学最优性。我们利用充分统计量的可加性设计了一个流式算法。这使得能够进行即时更新,显著减少内存开销和计算时间。大量实验验证了SSR在保持相当效率的同时显著优于最先进的方法。代码可在该https URL获取。

英文摘要

Low-Rank Adaptation (LoRA) merging can efficiently combine diverse generative capabilities from multiple trained LoRAs for a diffusion model. However, existing LoRA merging techniques often suffer from severe parameter interference, causing destructive collisions in the shared parameter space. To address this, we propose Subspace Signal Routing (SSR), which resolves interference by routing internal signals instead of performing parameter-space merge. Specifically, SSR first constructs a unified subspace by concatenating candidate LoRAs along the rank dimension. Next, SSR employs an inverse correlation matrix to decorrelate mixed signals within this space. Finally, a directional guide matrix steers these purified signals into their respective task-specific subspaces. We provide a rigorous theoretical analysis proving that SSR aligns with the Ordinary Least Squares (OLS) solution, thereby ensuring mathematical optimality. We utilize the additivity of sufficient statistics to design a streaming algorithm. This enables on-the-fly updates that significantly reduce memory overhead and computation time. Extensive experiments validate that SSR significantly outperforms state-of-the-art methods while maintaining comparable efficiency. Code is available at https://github.com/nagara214/SSR-Merge.

2606.10614 2026-06-10 cs.RO cs.CV cs.LG 新提交

Dexterous Point Policy: Learning Point-based Dexterous Hand Policies from Human Demonstrations

灵巧点策略:从人类演示中学习基于点的灵巧手策略

Beomjun Kim, Seong Hyeon Park, Seunghoon Sim, Seungjun Moon, Sanghyeok Lee, Jinwoo Shin

发表机构 * KAIST(韩国科学技术院)

AI总结 提出Dexterous Point Policy框架,通过统一3D关键点表示从人类视频学习灵巧操作策略,无需机器人演示,在真实任务中达到75%成功率。

详情
AI中文摘要

基于人类演示视频预训练的机器人基础模型显示出潜力,但当策略部署到真实机器人时仍存在显著的具身差距。常见的补救措施是在机器人特定演示上微调这些模型。然而,机器人数据收集可能过于昂贵和耗时,这在灵巧操作中尤为突出,例如,即使是单个原子任务,遥操作多指手也可能需要数天。为了解决这个问题,我们引入了Dexterous Point Policy,一个直接从人类视频学习灵巧操作策略且无需机器人演示的框架。我们的核心见解是,统一的3D关键点表示在用于观察和动作时,可以桥接人类和机器人的具身。具体来说,我们从原始视频中提取任务相关物体和人类手的3D关键点,并训练一个自回归变换器来处理这些关键点。我们观察到,在关键点层面,特别是手腕和指尖,人类和机器人的行为紧密对齐,从而实现直接策略迁移。在一套包括拾取放置和工具使用的真实机器人任务中,Dexterous Point Policy达到了75.0%的成功率,而最先进的VLA基线仅达到1.0%。此外,我们的方法对未见过的场景具有很强的泛化能力,包括多物体环境和新型物体类别。

英文摘要

Robotic foundation models pre-trained on human demonstration videos have shown promise, but a significant embodiment gap remains when the resulting policies are deployed on real robots. A common remedy is to fine-tune these models on robot-specific demonstrations. However, robot data collection can be prohibitively expensive and time-consuming, which is particularly acute in dexterous manipulation, e.g., teleoperating a multi-fingered hand for even a single atomic task can take days. To address this, we introduce Dexterous Point Policy, a framework that learns dexterous manipulation policies directly from human videos and requires no robot demonstrations. Our core insight is that a unified 3D keypoint representation can bridge human and robot embodiments when used for both observations and actions. Specifically, we extract 3D keypoints of task-relevant objects and human hands from raw videos, and train an autoregressive transformer over these keypoints. We observe that at the keypoint level, specifically the wrist and fingertips, human and robot behaviors closely align, enabling direct policy transfer. On a suite of real-robot tasks spanning pick-and-place and tool use, Dexterous Point Policy attains 75.0% success, whereas a state-of-the-art VLA baseline reaches only 1.0%. Furthermore, our method generalizes strongly to unseen scenarios, including multi-object environments and novel object categories.

2606.10613 2026-06-10 cs.LG cs.AI 新提交

Fast and Highly Expressive Policy Learning for Offline Reinforcement Learning via Bootstrapped Flow Q-Learning

基于自举流Q学习的离线强化学习快速且高表达性策略学习

Thanh Nguyen, Tri Ton, Hongbin Choe, Tung M. Luu, Chang D. Yoo

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出自举流Q学习(BFQ),通过分治位移向量并自举短程分量,实现单步动作生成,无需辅助网络或蒸馏,显著降低计算成本并提升性能。

Comments ICML 2026, 19 pages

详情
Journal ref
ICML 2026
AI中文摘要

基于扩散的Q学习已成为离线强化学习的一种强大范式,但其对多步去噪的依赖使得训练和推理在计算上昂贵且脆弱。最近将扩散Q学习加速到单步动作生成的努力通常引入辅助网络、策略蒸馏或多阶段训练,这常常损害简单性、稳定性或性能。为解决这些限制,我们引入了自举流Q学习(BFQ),一种新颖的框架,能够在训练和推理期间实现精确的单步动作生成,无需辅助网络或蒸馏过程。BFQ采用分治视角处理沿流路径的位移向量:它首先学习可以从流匹配边际速度准确估计的短程位移,然后自举这些分量以直接学习单步噪声到动作的映射。这种公式消除了多步去噪,导致学习过程更快、更简单、更稳健。广泛的D4RL评估表明,与多步扩散基线相比,BFQ在显著降低计算成本的同时提高了性能,证明了单步动作生成足以实现高性能的离线强化学习。

英文摘要

Diffusion-based Q-learning has emerged as a powerful paradigm for offline reinforcement learning, but its reliance on multi-step denoising makes both training and inference computationally expensive and brittle. Recent efforts to accelerate diffusion Q-learning toward single-step action generation typically introduce auxiliary networks, policy distillation, or multi-phase training, which frequently compromise simplicity, stability, or performance. To address these limitations, we introduce Bootstrapped Flow Q-Learning (BFQ), a novel framework that enables accurate single-step action generation during both training and inference, without auxiliary networks or distillation procedures. BFQ adopts a divide-and-conquer view of the displacement vector along the flow path: it begins by learning short-range displacements that can be accurately estimated from the Flow Matching marginal velocity, and bootstraps these components to directly learn a noise-to-action mapping in a single step. This formulation eliminates multi-step denoising, resulting in a learning procedure that is substantially faster, simpler, and more robust. Extensive D4RL evaluations show that BFQ improves performance while significantly reducing computational cost compared to multi-step diffusion baselines, demonstrating that single-step action generation suffices for high-performance offline Reinforcement Learning.

2606.10612 2026-06-10 cs.CV 新提交

GaussTrace: Provenance Analysis of 3D Gaussian Splatting Models with Evidence-based LLM Reasoning

GaussTrace:基于证据的LLM推理的3D高斯泼溅模型溯源分析

Haoliang Han, Ziyuan Luo, Renjie Wan

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出GaussTrace框架,通过属性统计分析和假设驱动的编辑模拟,结合大语言模型链式推理,构建3D高斯泼溅模型的有向溯源图,无需训练或编辑历史。

Comments Accepted by ICML2026

详情
AI中文摘要

3D高斯泼溅(3DGS)是一种创建高保真3D资产的有力技术。然而,3DGS模型在数字平台上的广泛共享和迭代修改给知识产权保护和取证溯源带来了紧迫挑战。为此,我们提出GaussTrace,一种用于构建3DGS模型有向溯源图的新框架。GaussTrace将溯源分析表述为基于证据的推理问题。它基于3DGS参数的属性统计特征来捕捉内在属性。此外,我们引入常见操作的假设驱动编辑模拟,为可能的变换路径提供辅助证据。这些统计和模拟线索共同使大语言模型(LLM)能够执行结构化思维链(CoT)推理,产生方向性溯源推断和可解释的边原因。实验结果表明,GaussTrace有效构建了不同3DGS模型之间的演化关系,无需模型训练或访问编辑历史,即可提供准确、可解释且鲁棒的溯源图。项目页面:此https URL。

英文摘要

3D Gaussian Splatting (3DGS) is a powerful technique for creating high-fidelity 3D assets. However, the widespread sharing and iterative modification of 3DGS models across digital platforms create pressing challenges for intellectual property protection and forensic traceability. To address this, we propose GaussTrace, a novel framework for constructing directed provenance graphs for 3DGS models. GaussTrace formulates provenance analysis as an evidence-based reasoning problem. It builds upon attribute-wise statistical profiling of 3DGS parameters to capture intrinsic properties. Moreover, we introduce hypothesis-driven editing simulations of common operations to provide auxiliary evidence for plausible transformation pathways. These statistical and simulated cues jointly enable a Large Language Model (LLM) to perform structured Chain-of-Thought (CoT) reasoning, yielding directional provenance inferences and explainable edge reasons. Experimental results demonstrate that GaussTrace effectively constructs evolutionary relationships among diverse 3DGS models, delivering accurate, interpretable, and robust provenance graphs without requiring model training or access to editing histories. Project page: https://haolianghan.github.io/GaussTrace.

2606.10611 2026-06-10 cs.LG cs.CV 新提交

Geometry-Aware Reinforcement Learning for 2D Irregular Nesting

几何感知强化学习用于二维不规则排样

Auguste Lehuger, Guillaume Henon-Just

发表机构 * Valeo Brain(法雷奥大脑)

AI总结 提出Polygons Transformer架构与组合优化强化学习框架,使智能体从数据中学习几何先验,在二维不规则排样中达到与最先进启发式算法Sparrow竞争的面积利用率。

Comments 15 pages, 4 figures, 5 tables. Under review at the European Workshop on Reinforcement Learning (EWRL)

详情
AI中文摘要

针对二维不规则排样问题的传统启发式求解器存在一个根本性限制:它们对多边形几何是盲目的,依赖引导式暴力搜索在连续放置空间中导航,几何指导极少。本文认为,强化学习具有独特优势来克服这一瓶颈。通过将优化策略与几何感知神经编码器配对,智能体可以直接从数据中自动发现丰富的几何先验,利用这些学到的直觉来战略性地引导探索。为实现这一点,我们引入了Polygons Transformer(PoT),这是一种新颖的架构,能够编码二维连续矢量几何,同时允许跨多边形注意力。我们将这种新颖架构与组合优化强化学习(CORL)训练框架相结合,以寻找最优解。为了支持这一范式,我们发布了一个源自复杂地理轮廓的开源训练数据集以及一个专门的评估基准。我们的实证验证表明,训练后的智能体在面积利用率方面与最先进的启发式求解器Sparrow高度竞争,证明强化学习可以成功发现并利用几何感知来完成精确的空间任务。

英文摘要

Traditional heuristic solvers for the 2D irregular nesting problem share a fundamental limitation: they are blind to polygon geometry, relying on guided brute-force to navigate the continuous placement space with minimal geometrical guidance. In this paper, we argue that Reinforcement Learning is uniquely positioned to overcome this bottleneck. By pairing an optimization policy with a geometry-aware neural encoder, an agent can automatically discover rich geometric priors directly from data, utilizing these learned intuitions to strategically guide exploration. To realize this, we introduce the Polygons Transformer (PoT), a novel architecture that encodes 2D continuous vector geometries while allowing cross-polygons attention. We couple this novel architecture with a Combinatorial Optimization Reinforcement Learning (CORL) training framework to find optimal solutions. To support this paradigm, we release an open-source training dataset derived from complex geographic contours alongside a dedicated evaluation benchmark. Our empirical validation demonstrates that our trained agent achieves area utilization performance highly competitive with Sparrow, the state-of-the-art heuristic solver, proving that reinforcement learning can successfully discover and exploit geometric awareness for precise spatial tasks.

2606.10610 2026-06-10 cs.CL 新提交

Small Data, Big Noise: Adversarial Training for Robust Parameter-Efficient Fine-Tuning

小数据,大噪声:面向鲁棒参数高效微调的对抗训练

Eitan Cohen, Idan Simai, Uri Shaham

发表机构 * Bar-Ilan University(巴伊兰大学)

AI总结 提出SDBN框架,将对抗训练与参数高效微调结合,通过离散不确定性集变体增强模型在低资源场景下的鲁棒性和泛化能力。

Comments Accepted to Findings of ACL 2026

详情
AI中文摘要

参数高效微调(PEFT)已成为将基础模型适应下游NLP任务的关键技术。然而,当前的PEFT方法在处理噪声鲁棒性和有限训练数据下的性能退化方面往往存在困难。我们提出SDBN(小数据大噪声),一个统一的框架,将对抗训练引入PEFT——尽管两者具有互补优势,但在PEFT设置中这一组合仍较少被研究——以增强模型鲁棒性和泛化能力,优于其他方法。我们还引入了该方法的两种变体,使用离散不确定性集:SDBN-h,枚举字符级编辑并使用梯度选择最坏情况变体;SDBN-p,使用LLM生成的变体进行生成任务中的鲁棒优化。跨多个基准的实验显示,特别是在低资源设置以及词级和字符级污染下,性能有显著提升。该框架解决了对抗训练与参数高效适应之间较少被探索的交集,无需引入额外参数或仅需适度的计算开销,使得在数据稀缺和语言变异性常共存的现实场景中,PEFT部署更加可靠。

英文摘要

Parameter-Efficient Fine-Tuning (PEFT) has become essential for adapting foundation models to downstream NLP tasks. However, current PEFT methods often struggle with robustness to noise and performance degradation on limited training data. We propose SDBN (Small Data Big Noise), a unified framework that brings adversarial training to PEFT - a combination that remains less studied in the PEFT setting despite its complementary strengths - to enhance model robustness and generalization, outperforming alternative approaches. We also introduce two variants of the method that use discrete uncertainty sets: SDBN-h, which enumerates character-level edits and selects worst-case variants using gradients, and SDBN-p, which uses LLM-generated variants for robust optimization in generative tasks. Experiments across multiple benchmarks reveal substantial improvements, particularly in low-resource settings and under both word-level and character-level corruptions. This framework addresses the less explored intersection of adversarial training and parameter-efficient adaptation, without introducing additional parameters or only modest computational overhead, making PEFT deployments more reliable in real-world scenarios where data scarcity and linguistic variability often coexist

2606.10607 2026-06-10 cs.LG cs.AI cs.CL 新提交

Causal Ensemble Agent: Hierarchical Causal Discovery with LLM-guided Expert Reweighting

因果集成智能体:基于LLM引导的专家重加权的层次化因果发现

Xinyu Li, Yuanyuan Wang, Haoxuan Li, Chuan Zhou, Erdun Gao, Bo Han, Tongliang Liu, Kun Zhang, Howard Bondell, Mingming Gong

发表机构 * The University of Melbourne(墨尔本大学) MBZUAI(穆罕默德·本·扎耶德人工智能大学) Peking University(北京大学) Adelaide University(阿德莱德大学) Hong Kong Baptist University(香港浸会大学) The University of Sydney(悉尼大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出因果集成智能体(CEA)框架,通过线性意见池聚合不同层次的统计因果发现结果,并利用大语言模型(LLM)作为元裁判在决策边界附近动态重加权专家,从而构建更准确完整的因果图。

详情
AI中文摘要

因果发现旨在从观测数据中揭示因果结构,这对现实世界决策至关重要。然而,不同的因果发现算法可能产生相互冲突的结果,使得识别准确的因果图复杂化。传统方法依赖数值和统计假设,往往忽略丰富的领域特定信息(如特征描述),而这些信息也有助于结构学习。尽管近期研究探索使用大语言模型(LLM)通过直接查询推断因果关系,但由于缺乏与实际数据的一致性,此类方法可能不可靠。为解决这些限制,我们提出因果集成智能体(CEA),一种新颖框架,通过线性意见池聚合来自不同图层次的统计发现专家的结构见解,并在聚合置信度接近决策边界时,使用LLM作为元裁判动态重加权专家,从而组合出更完善、更完整的因果图。在合成和真实数据集上的大量实验表明,CEA在广泛的因果发现方法中实现了最强的整体性能,突显了在因果发现中使用LLM进行元分析的有效性。

英文摘要

Causal discovery aims to uncover causal structures from observational data, which is crucial for real-world decision-making. However, different causal discovery algorithms can produce divergent results that conflict with each other, complicating the identification of accurate causal graphs. Traditional approaches rely on numerical values and statistical assumptions, often ignoring rich domain-specific information, such as feature descriptions, which could also help structure learning. While recent works explore using Large Language Models (LLMs) to infer causal relations via direct queries, such methods can be unreliable due to a lack of alignment with the actual data. To address these limitations, we propose Causal Ensemble Agent (CEA), a novel framework that aggregates structural insights from statistical discovery experts across different graph levels via linear opinion pooling, and uses an LLM as a meta-referee to dynamically reweight experts when the aggregated confidence is close to the decision boundary, thereby composing an improved and more complete causal graph. Extensive experiments on both synthetic and real-world datasets demonstrate that CEA achieves the strongest overall performance across a wide range of causal discovery methods, highlighting the effectiveness of using LLMs for meta-analysis in causal discovery.

2606.10602 2026-06-10 cs.CV 新提交

Globally Localizing Lunar Rover in Pixels via Graph Alignment

通过图对齐在像素级全局定位月球车

Mao Chen, Xu Yang, Chuankai Liu, Xiangkai Zhang, Xiaoxue Wang, Zheng Bo, Zuoyu Zhang, Zhiyong Liu

发表机构 * The State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所多模态人工智能系统国家重点实验室) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院) Beijing Aerospace Control Center(北京航天飞行控制中心) Technology and Engineering Center for Space Utilization, Chinese Academy of Sciences(中国科学院空间应用工程与技术中心)

AI总结 提出WARG框架,利用统一图学习和重投影图匹配解决月球车跨视角定位中的实体纠缠、视角差异和仿真到真实域偏移问题,在玉兔二号真实数据上实现1.68米定位误差。

详情
AI中文摘要

精确的月球车定位是自主月球探测的前提,然而全球导航卫星系统(GNSS)信号的缺失以及局部定位方法的累积漂移严重限制了远程任务。跨视角定位通过匹配月球车视角和卫星视角图像提供了一种有前景的无漂移全局解决方案。然而,月球环境为对应点对齐带来了独特挑战,包括实体间纠缠、视角间差异以及仿真到真实的域偏移。为了解决这些挑战,我们提出了重投影图扭曲对齐(WARG),一个利用统一图学习和重投影图匹配实现鲁棒跨视角对齐的框架。在合成的LuSNAR数据集上预训练后,WARG的平均测试误差为0.32米,并在合成月球南极区域展现出鲁棒的零样本泛化能力,误差为3.63米。更重要的是,在玉兔二号月球车的真实数据上验证时,WARG在100米×100米的搜索区域内实现了1.68米的定位误差,相当于在空间分辨率为1.40米/像素的低分辨率卫星图像中达到近像素级精度。除了精度,WARG计算高效,仅含1.56M参数,是之前轻量级模型的16.12%,在NVIDIA RTX A6000 GPU上运行频率为5.49 Hz,接近GNSS级更新频率。最后,我们观察到WARG通过跨视角定位学习自然发展出低级空间感知能力,包括语义分割和结构推理,突显其作为以最小标注成本实现空间智能的有前景范式的潜力。源代码见:此 https URL。

英文摘要

Precise rover localization is a prerequisite for autonomous lunar exploration, yet the absence of Global Navigation Satellite System (GNSS) signals and the cumulative drift of local localization methods severely constrain long-range missions. Cross-view localization provides a promising drift-free global solution by matching rover-view and satellite-view imagery. However, the lunar environment poses unique challenges for correspondence alignment, including inter-entity entanglement, inter-viewpoint divergence, and simulation-to-real domain shift. To address these challenges, we propose Warped Alignment of Reprojected Graphs (WARG), a framework that leverages unified graph learning and reprojected graph matching for robust cross-view alignment. Pretrained on the synthetic LuSNAR dataset, WARG achieves an average test error of 0.32 m and demonstrates robust zero-shot generalization to the synthetic lunar south pole region with an error of 3.63 m. More importantly, when validated on real-world data from the YuTu-2 rover, WARG achieves a localization error of 1.68 m within a 100 m x 100 m search area, corresponding to nearly one-pixel precision in low-resolution satellite imagery with a spatial resolution of 1.40 m/pixel. Beyond accuracy, WARG is computationally efficient, containing only 1.56M parameters, corresponding to 16.12% of previous lightweight models, and operating at 5.49 Hz on an NVIDIA RTX A6000 GPU, approaching GNSS-level update frequency. Finally, we observe that WARG naturally develops low-level spatial awareness, including semantic segmentation and structural reasoning, through cross-view localization learning, highlighting its potential as a promising paradigm for spatial intelligence with minimal annotation cost. The source code is available at https://github.com/maochen-casia/warg.

2606.10594 2026-06-10 cs.CV 新提交

Segment and Select: Vision-Language Segmentation in 3D Scenarios

Segment and Select: 3D场景中的视觉-语言分割

Yulin Chen, Zhihang Zhong, Yuenan Hou

发表机构 * Shanghai AI Laboratory(上海人工智能实验室) University of Science and Technology of China(中国科学技术大学) Shanghai Jiaotong University(上海交通大学)

AI总结 提出SEGA3D范式,通过掩码候选生成器、大语言模型和语义空间选择器实现3D场景中基于语言指令的细粒度分割,在ScanNet和Matterport3D上分别提升8.3和5.3 mIoU。

Comments The core idea is to reformulate 3D vision-language segmentation as the segment-and-select paradigm (free from the superpoint dependency)

详情
AI中文摘要

3D视觉-语言分割旨在根据语言指令和视觉观察在3D场景中分割目标对象。现有技术严重依赖粗糙的超点表示来降低计算复杂度,这导致分割质量差和对象边界混乱。本文提出用于3D视觉-语言分割的SEGment-And-Select(SEGA3D)范式,该范式直接操作于细粒度视觉信息,无需依赖超点。具体而言,我们首先利用掩码候选生成器提供细粒度的类别掩码候选,显著提高候选掩码相对于超点对应物的质量。然后,利用大语言模型(LLM)基于语言描述和视觉特征生成语义和空间信息。LLM输出和视觉特征被输入到语义-空间选择器(SSS)以产生排名最高的掩码候选。最后,设计循环验证模块(LVM)从选定的候选掩码中产生分割掩码。我们的SEGA3D在ScanRefer、ScanNet和Matterport3D基准测试中取得了有竞争力的性能。值得注意的是,我们的SEGA3D在ScanNet和Matterport3D上分别超过最佳性能对手8.3 mIoU和5.3 mIoU。代码将在发表后提供。

英文摘要

3D vision-language segmentation aims to segment target objects in 3D scenarios according to the linguistic instructions and visual observations. Prior art heavily relies on the coarse superpoint representation to reduce the computation complexity, which suffers from poor segmentation quality and messy object boundaries. In this paper, we propose the SEGment-And-select (SEGA3D) paradigm for 3D visionlanguage segmentation that directly operates on the fine-grained visual information and is free from the superpoint dependency. Specifically, we first leverage a mask candidate generator to provide fine-grained categorical mask candidates, substantially improving the quality of candidate masks over the superpoint counterparts. Then, a Large Language Model (LLM) is utilized to generate the semantic and spatial information based on the linguistic description and visual features. The LLM output and visual features are fed to the Semantic-Spatial Selector (SSS) to produce the top-ranking mask candidates. Eventually, the Loopback Verification Module (LVM) is designed to yield the segmentation mask from the selected candidate masks. Our SEGA3D attains competitive performance on ScanRefer, ScanNet and Matterport3D benchmarks. Notably, our SEGA3D surpasses the top-performing counterpart by 8.3 mIoU and 5.3 mIoU on ScanNet and Matterport3D, respectively. Codes will be available upon publication.

2606.10592 2026-06-10 cs.LG 新提交

Dirichlet-Guided Group Forecasting for Alleviating Over-smoothing in Time Series Forecasting

Dirichlet引导的群体预测:缓解时间序列预测中的过度平滑

Xingyu Zhang, Jingyao Wang, Xin Yu, Zeen Song, Jianqi Zhang, Changwen Zheng, Wenwen Qiang

发表机构 * University of Chinese Academy of Sciences(中国科学院大学) Institute of Software, Chinese Academy of Sciences(中国科学院软件研究所)

AI总结 针对时间序列预测中的过度平滑问题,提出Dirichlet引导的群体预测(DGF)框架,通过显式建模多个模式条件预测分布及其选择概率的不确定性,并采用Dirichlet引导的分层采样和奖励优化,提升预测的准确性、多样性和动态一致性。

详情
AI中文摘要

时间序列预测常常遭受过度平滑的影响,尤其是当未来动态是多模态时。预测可能遵循观测未来的粗略趋势,但未能保留定义合理动态演变的急剧变化、振荡、转折点和制度转换。在这项工作中,我们从潜在动态模式压缩的角度重新审视过度平滑:在部分观测和单一实现监督下,多个可能的未来模式可能在预测过程中被削弱、合并或平均。基于这一观点,我们提出了Dirichlet引导的群体预测(DGF),一种保持模式的预测框架,它显式建模多个模式条件预测分布及其选择概率的不确定性。DGF使用Dirichlet引导的分层采样机制和基于奖励的优化,以鼓励预测准确、动态一致且模式区分。在真实世界预测基准上的大量实验表明,DGF减少了过度平滑,同时提高了预测准确性、多样性和动态一致性。

英文摘要

Time series forecasting often suffers from over-smoothing, especially when future dynamics are multi-modal. Forecasts may follow the coarse trend of the observed future, but fail to preserve sharp changes, oscillations, turning points, and regime transitions that define plausible dynamic evolution. In this work, we revisit over-smoothing from the perspective of latent dynamical mode compression: under partial observation and single-realization supervision, multiple plausible future modes can be weakened, merged, or averaged during forecasting. Based on this view, we propose Dirichlet-Guided Group Forecasting (DGF), a mode-preserving forecasting framework that explicitly models multiple mode-conditioned predictive distributions and uncertainty over their selection probabilities. DGF uses a Dirichlet-guided hierarchical sampling mechanism and reward-based optimization to encourage forecasts that are accurate, dynamically consistent, and mode-distinct. Extensive experiments on real-world forecasting benchmarks show that DGF reduces over-smoothing while improving forecasting accuracy, diversity, and dynamical consistency.

2606.10591 2026-06-10 cs.SD 新提交

ContextCodec: Content-Focused Context Guidance for Ultra-Low Bitrate Speech Coding

ContextCodec: 面向内容的超低比特率语音编码上下文引导

Chengbin Liang, Wenqi Guo, Hao Cao, Zhijin Qin

发表机构 * Department of Electronic Engineering, Tsinghua University(清华大学电子工程系) Department of Automation, Tsinghua University(清华大学自动化系)

AI总结 提出ContextCodec,通过双分支编码器解耦声学细节与内容上下文,利用CLIP对比损失对齐上下文特征与音素索引,在500 bps下实现质量与可懂度的良好平衡。

Comments Accepted at Interspeech 2026. 6 pages, 2 figures, 5 tables

详情
AI中文摘要

神经语音编解码器实现了低比特率语音通信,但在超低比特率(< 1000 bps)下保持感知质量和可懂度具有挑战性。现有设计通常优先考虑声学细节,在严格的比特率约束下留给核心语言信息的容量有限。为了解决这个问题,我们提出了ContextCodec,一种传输面向内容的上下文特征以显式指导重建的编解码器。ContextCodec采用双分支编码器,将声学细节与面向内容的上下文解耦。上下文分支通过CLIP风格的对比损失进行训练,该损失将上下文特征与音素索引对齐,减少副语言泄漏。在解码过程中,这些特征被注入每个解码阶段以进行显式指导。此外,我们引入了一个轻量级的自回归潜在细化模块。实验表明,在500 bps下实现了强大的质量-可懂度权衡,在典型移动CPU上的RTF为0.4886。

英文摘要

Neural speech codecs enable low-bitrate speech communication, yet at ultra-low bitrates (< 1000 bps) preserving perceptual quality and intelligibility is challenging. Existing designs often prioritize acoustic details, leaving limited capacity for the core linguistic message under tight bitrate constraints. To address this, we propose ContextCodec, a codec that transmits content-focused context features to explicitly guide reconstruction. ContextCodec adopts a dual-branch encoder that decouples acoustic details from content-focused context. The context branch is trained with a CLIP-style contrastive loss that aligns context features with phoneme indices, reducing paralinguistic leakage. During decoding, these features are injected at each decoding stage for explicit guidance. In addition, we introduce a lightweight autoregressive latent refinement module. Experiments show a strong quality-intelligibility trade-off down to 500 bps, with an RTF of 0.4886 on a typical mobile CPU.

2606.10587 2026-06-10 cs.LG cs.AI 新提交

Towards Diverse Scientific Hypothesis Search with Large Language Models

面向多样化科学假设搜索的大语言模型

Haorui Wang, Parshin Shojaee, Kazem Meidani, Kunyang Sun, José Miguel Hernández-Lobato, Teresa Head-Gordon, Jiajun He, Chandan K. Reddy, Chao Zhang, Yuanqi Du

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 针对科学假设搜索中多样性崩溃问题,提出基于并行回火的多温度进化框架,在固定验证预算下提升假设质量与多样性。

Comments ICML 2026

详情
AI中文摘要

大语言模型(LLMs)在加速科学发现方面日益崛起,最近在生成有效科学假设等高级任务中表现突出。然而,在许多发现场景中,目标并非识别单一最佳假设,因为验证可能噪声大且成本高,科学家受益于一组高质量替代假设,以对冲下游不确定性,寻求最佳解决方案。尽管如此,常用的进化搜索策略在假设生成中往往优先优化而非探索,搜索过程中的选择压力导致多样性崩溃。受这些局限性的启发,我们将假设搜索表述为采样问题,目标是在固定验证预算下高效生成多样化、高质量的假设。基于这一视角,我们提出\ours,一种受经典并行回火算法启发的进化框架,在多个温度水平下搜索假设,并实现跨温度的原则性信息交换,以在不干扰收敛的情况下改善探索。在分子发现、方程发现和算法发现等领域,我们的方法在相同验证预算下持续提升假设质量和多样性,生成的候选假设在更昂贵的下游计算验证中仍保持稳健。

英文摘要

Large language models (LLMs) are on the rise for accelerating scientific discovery, most recently in advanced tasks such as generating valid scientific hypotheses. Yet in many discovery settings, the goal is not to identify a single best hypothesis since validation can be noisy and expensive, and scientists benefit from a set of high-quality alternative hypotheses that hedge against downstream uncertainty for the best solutions. Nevertheless, commonly used evolutionary search recipes tend to prioritize optimization over exploration in hypothesis generation, and the resulting selection pressure during the search process leads to diversity collapse. Motivated by these limitations, we formulate hypothesis search as a sampling problem, where the objective is to efficiently produce diverse, high-quality hypotheses under a fixed validation budget. Building on this perspective, we propose \ours, an evolutionary framework inspired by the classical parallel tempering algorithm that searches hypotheses at multiple temperature levels and enables principled information exchange across temperatures to improve exploration without disrupting convergence. Across domains including molecular discovery, equation discovery, and algorithm discovery, our approach consistently improves both hypothesis quality and diversity under the same validation budget, and produces candidates that remain robust under more expensive downstream computational validations.

2606.10582 2026-06-10 cs.LG cs.AI 新提交

Drawing with Strangers: Population Scaling Drives Zero-Shot Mutual Intelligibility in Emergent Sketching

与陌生人共绘:种群规模驱动涌现素描中的零-shot互懂性

Jooyeon Kim

发表机构 * Graduate School of Artificial Intelligence, UNIST(UNIST人工智能研究生院)

AI总结 研究通过视觉素描游戏,发现扩大训练种群规模能显著提升独立训练群体间的零-shot互懂性,其机制在于增加群体内变异并减少群体间差异,最终通过感知锚定实现结构收敛。

详情
AI中文摘要

涌现通信中的泛化主要关注新颖输入或语言结构,但智能体与来自严格不相交社区的陌生人进行通信的能力仍相对未被探索。在这项工作中,我们将这种能力形式化为\textit{零-shot互懂性(ZMI)}:独立训练群体之间无需事先接触即可成功通信。利用涌现素描(智能体通过绘制一组笔画进行通信)作为视觉接地模态,我们发现扩大训练种群规模显著提高了独立群体间的ZMI。关键的是,随着种群规模扩大,群体内通信变异增加,防止了同质化共适应。同时,群体间变异减少,表明向某种普遍性的结构收敛。进一步分析揭示,这种普遍性是通过感知接地实现的:扩大后的种群越来越将其涌现素描锚定在目标图像的客观视觉相似性上。这些结果共同将ZMI定位为涌现通信中一个独特的泛化轴,并提出了实现社会可互操作人工智能体的途径。

英文摘要

Generalization in emergent communication has largely focused on novel inputs or linguistic structures, yet the capacity for agents to communicate with strangers from strictly disjoint communities remains relatively unexplored. In this work, we formalize this capability as \textit{zero-shot mutual intelligibility (ZMI)}: successful communication between independently trained populations without prior exposure. Leveraging emergent sketching -- in which agents communicate through sets of drawn strokes -- as a visually grounded modality, we find that scaling the training population substantially improves ZMI across independent groups. Crucially, as we scale the population size, in-group communicative variation increases, preventing co-adaptation into homogeneity. Simultaneously, cross-group variation decreases, indicating a structural convergence toward a certain type of universality. Further analysis reveals that this universality is achieved through perceptual grounding: scaled populations increasingly anchor their emergent sketches on the objective visual resemblance of the target images. Together, these results position ZMI as a distinct axis of generalization in emergent communication and suggest a route toward socially interoperable artificial agents.

2606.10581 2026-06-10 cs.CL cs.SD eess.AS 新提交

ParaBridge: Bridging Paralinguistic Perception and Dialogue Behavior in Speech Language Models

ParaBridge: 弥合语音语言模型中的副语言感知与对话行为

Yuxiang Wang, Qinke Ni, Shengbo Cai, Wan Lin, Liqiang Zhang, Zhizheng Wu

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Tencent Hunyuan(腾讯混元) Shenzhen Loop Area Institute(深圳循环区域研究所) Amphion Technology Co., Ltd.(Amphion科技有限公司) Tsinghua University(清华大学)

AI总结 提出ParaBridge,一种在线自我蒸馏方法,将推理阶段的副语言指令支架转化为稳定的模型行为,无需人工标注或外部奖励,显著提升语音语言模型对副语言线索的响应能力。

详情
AI中文摘要

语音携带的信息远不止文字:孩子的声音、恐惧的语气或嘈杂的背景都应引导一个足够胜任的语音对话助手给出不同的回复。当前的语音语言模型(SLM)能够识别此类副语言线索,但在开放域对话中常常忽略它们。我们观察到,在推理阶段使用简单的副语言指令支架可以缩小这种感知-行为差距,表明相关线索已潜在于模型中。然而,这种支架在多轮上下文和竞争指令下仍然脆弱。因此,我们提出\textbf{ParaBridge},一种在线自我蒸馏方法,将脆弱的推理时支架转化为稳定的模型行为。在训练过程中,支架仅作为临时的特权视图;无支架模型自行生成回复,而支架视图沿其轨迹提供密集的全词汇下一词目标。这种监督教会了模型在非词汇线索应影响回复时的时机,无需策划的对话、人工标签或外部奖励模型。在Qwen3-Omni-thinking上,ParaBridge将无支架的VoxSafeBench SAR从14.6\%提升至40.3\%,并将EchoMind平均评分从3.27提升至3.92。它还保留了通用能力,MMAU-Pro、VoiceBench和GPQA均与原始模型相差在0.4分以内。在训练分布之外,ParaBridge泛化到未见过的副语言线索,从面向安全的训练迁移到共情导向的对话,并在不同的SLM骨干上有效。

英文摘要

Speech carries more information than just words: a child's voice, a fearful tone, or a noisy background should all lead a sufficiently competent spoken-dialogue assistant to different replies. Current Speech Language Models (SLMs) can recognize such paralinguistic cues but often ignore them in open-ended dialogue. We observe that a simple paralinguistic instruction scaffold at the inference stage narrows this perception-behavior gap, suggesting that the relevant cues are already latent in the model. Such scaffolds, however, remain brittle under multi-turn context and competing instructions. Therefore, we propose \textbf{ParaBridge}, an on-policy self-distillation method that turns a brittle inference-time scaffold into stable model behavior. During training, the scaffold serves only as a temporary privileged view; the scaffold-free model rolls out its own response, while the scaffolded view supplies dense, full-vocabulary next-token targets along its trajectory. This supervision teaches when non-lexical cues should affect the reply without the need for curated dialogues, human labels, or external reward models. On Qwen3-Omni-thinking, ParaBridge raises scaffold-free VoxSafeBench SAR from $14.6\%$ to $40.3\%$ and improves EchoMind average rating from $3.27$ to $3.92$. It also preserves general ability, with MMAU-Pro, VoiceBench, and GPQA all within $0.4$ points of the original model. Beyond the training distribution, ParaBridge generalizes to unseen paralinguistic cues, transfers from safety-oriented training to empathy-oriented dialogue, and works on a different SLM backbone.

2606.10580 2026-06-10 cs.LG cs.AI 新提交

Convergence of Monte Carlo Optimistic Policy Iteration: Beyond Uniform State-Action Updates

蒙特卡洛乐观策略迭代的收敛性:超越均匀状态-动作更新

Octave Oliviers, Glenn Vinnicombe

发表机构 * Department of Engineering, University of Cambridge(剑桥大学工程系)

AI总结 本文证明,在每状态动作均匀更新的条件下,首次访问蒙特卡洛乐观策略迭代收敛到最优,放宽了传统均匀状态-动作更新的要求,并通过均场动力学和锁定论证方法给出证明。

详情
AI中文摘要

蒙特卡洛乐观策略迭代(MC-O-PI)的渐近行为是一个长期悬而未决的问题。当环境模型未知时(这在实践中很常见),唯一已知的保证收敛到最优性的条件是不切实际的。在其标准形式中,该条件要求用于策略评估的回合在整个状态-动作空间上均匀初始化。本文严格放宽了这一要求。具体来说,我们证明即使更新仅在每个状态内的动作上均匀,首次访问MC-O-PI也能收敛到最优性。这允许回合以任意频率从不同状态开始;当状态空间很大或未知但每个状态中的动作空间可管理时,这是一种现实的实现。证明脱离了Tsitsiklis的经典分析,其中心交换性论证在状态以不同频率更新时不再适用。相反,我们首先证明当更新在每个状态的动作上均匀时,MC-O-PI的均场动力学生成单调改进的策略,然后通过扩展组合稳定性-ODE方法的锁定论证,证明噪声不能持续阻止这种改进。这种方法为一般研究乐观策略迭代算法提供了一种新途径。

英文摘要

The asymptotic behaviour of Monte Carlo optimistic policy iteration (MC-O-PI) is a long-standing open question. When the model of the environment is unknown, as is common in practice, the only known condition that guarantees convergence to optimality is impractical. In its canonical form, this condition requires that the episodes used for policy evaluation be initialised uniformly over the entire state-action space. This paper strictly relaxes that requirement. Specifically, we prove that initial-visit MC-O-PI converges to optimality even when updates are uniform only over the actions within each state. This allows episodes to start in different states at arbitrary frequencies; a realistic implementation when the state space is large or unknown but the action space in each state is manageable. The proof departs from the classical analysis of Tsitsiklis whose central commutativity argument no longer applies when states are updated at different frequencies. Instead, we first show that the mean-field dynamics of MC-O-PI generate monotonically improving policies when updates are uniform over the actions in each state, and then prove that noise cannot consistently prevent this improvement by extending the lock-in argument of the combined stability-ODE method. This approach suggests a new way to study optimistic policy-iteration algorithms in general.

2606.10579 2026-06-10 cs.RO cs.SY eess.SY 新提交

LieIPM: Lie Group Interior Point Method for Direct Trajectory Optimization of Rigid Bodies

LieIPM:用于刚体直接轨迹优化的李群内点法

Sangli Teng, Ruiqi Zhang, Tzu-Yuan Lin, William A Clark, Mark Mueller, Ram Vasudevan, Maani Ghaffari, Koushil Sreenath

发表机构 * University of California, Berkeley(加州大学伯克利分校) MIT(麻省理工学院) Ohio University(俄亥俄大学) University of Michigan, Ann Arbor(密歇根大学安娜堡分校)

AI总结 提出一种基于李群结构的约束轨迹优化框架LieIPM,利用二阶刚体模型和变分积分器,实现无奇异、快速收敛的牛顿型更新。

详情
AI中文摘要

设计刚体的动态可行轨迹是机器人学中的一个基本问题。虽然直接方法被广泛使用,但现有的约束优化器通常在欧几里得空间中运行,忽略了刚体运动的流形结构。这种不匹配可能引入奇异性或导致优化问题病态。为了弥补这一差距,我们开发了一个结构感知框架,直接在矩阵李群上进行约束轨迹优化。我们的方法基于利用李群结构的二阶刚体模型,这使得在保持底层几何结构的同时实现高效的牛顿型更新成为可能。在此模型基础上,我们提出了一种线搜索李群内点法(LieIPM)来处理流形上的约束。我们使用李群变分积分器实例化该框架用于刚体运动规划,并推导出利用群对称性的闭式内蕴导数。LieIPM通过构造保留了旋转运动的拓扑结构,避免了奇异性。数值结果表明,与通用求解器和结构利用最优控制方法相比,该方法具有更强的鲁棒性和更快的收敛速度。

英文摘要

Designing dynamically feasible trajectories for rigid bodies is a fundamental problem in robotics. While direct methods are widely used, the existing constrained optimizers typically operate in Euclidean space and ignore the manifold structure of rigid body motions. This mismatch may introduce singularities or lead to poorly conditioned optimization problems. To bridge this gap, we develop a structure-aware framework for constrained trajectory optimization directly on matrix Lie groups. Our approach is based on the second-order rigid body models utilizing Lie group structures, which enables efficient Newton-type updates while preserving the underlying geometry. Building on this model, we propose a line-search Lie Group Interior Point Method (LieIPM) to handle constraints on the manifolds. We instantiate the framework for rigid body motion planning using Lie group variational integrators and derive closed-form intrinsic derivatives that exploit group symmetries. The LieIPM preserves the topology of rotation motions by construction and avoids singularities. Numerical results demonstrate superior robustness and faster convergence compared to general-purpose solvers and structure-exploiting optimal control methods.

2606.10577 2026-06-10 cs.RO 新提交

AgenticNav: Zero-Shot Vision-and-Language Navigation as a Tool-Calling Harness

AgenticNav:零样本视觉与语言导航作为工具调用框架

Yijian Li, Changze Li, Hantian Shi, Jiaying Luo, Jiyuan Cai, Ming Yang, Tong Qin

发表机构 * Shanghai Jiao Tong University(上海交通大学) Huawei Technologies Ltd(华为技术有限公司)

AI总结 提出AgenticNav,通过将动作、深度和记忆作为可调用工具暴露给VLM,实现零样本连续环境导航,在R2R-CE基准上达到SOTA性能。

详情
AI中文摘要

连续环境中的零样本视觉与语言导航(VLN-CE)最近随着大型视觉语言模型(VLM)的出现而变得可行。然而,现有方法通常依赖学习到的航点预测器来提出可导航动作,这严重限制了模型的动作空间,并且未能有效利用深度输入。此外,记忆通常通过累积包含大量无关上下文的冗长文本或视觉历史,或通过检索跨回合经验来处理,这削弱了零样本设置。在本文中,我们将零样本VLN-CE重新思考为VLM与环境之间的代理接口,并提出了AgenticNav,这是一个轻量级导航框架,将动作、深度和记忆暴露为可调用的工具。动作工具允许VLM直接选择RGB观测中的目标像素,并将其转换为可执行运动,而不是从预测的航点中选择。深度通过按需像素深度工具暴露,使VLM能够在需要的地方请求精确的度量距离。对于记忆,AgenticNav提供了一个紧凑的地图图像,总结历史轨迹,并配有一个召回工具,允许VLM有选择地重新访问过去的视觉观测,而不会使提示上下文过载。在R2R-CE基准上,AgenticNav在相同VLM骨干下,在零样本方法中建立了新的最先进(SOTA)性能。真实世界验证进一步突显了其相比先前方法的零样本泛化能力。消融实验表明,我们的动作工具设计优于传统航点预测器,并且深度工具和代理记忆进一步促进了导航性能。

英文摘要

Zero-shot vision-and-language navigation in continuous environments (VLN-CE) has recently become feasible with large vision-language models (VLMs). However, existing methods typically rely on learned waypoint predictors to propose navigable actions. This severely limits the model's action space and fails to leverage depth inputs effectively. Moreover, memory is commonly handled by accumulating long textual or visual histories with substantial irrelevant context, or by retrieving cross-episode experiences, which weakens the zero-shot setting. In this paper, we rethink zero-shot VLN-CE as an agentic interface between the VLM and the environment, and present AgenticNav, a lightweight navigation harness that exposes action, depth, and memory as callable tools. Instead of choosing from predicted waypoints, the action tool allows the VLM to directly select a target pixel in RGB observations, converting it into executable motion. Depth is exposed through an on-demand pixel-depth tool, enabling the VLM to request precise metric distances only where they matter. For memory, AgenticNav provides a compact map image summarizing the historical trajectory, paired with a recall tool that allows the VLM to selectively revisit past visual observations without overwhelming the prompt context. On the R2R-CE benchmark, AgenticNav establishes new state-of-the-art (SOTA) performance among zero-shot methods given the same VLM backbone. Real-world validation further highlights its zero-shot generalization compared to prior methods. Ablations show that our action tool design outperforms traditional waypoint predictors, and that depth tool and agentic memory further contribute to navigation performance.

2606.10572 2026-06-10 cs.AI 新提交

One Token per Multimodal Evidence: Latent Memory for Resource-Constrained QA

每个多模态证据一个令牌:面向资源受限问答的潜在记忆

Zhi Zheng, Ziqiao Meng, Hao Luan, Wei Liu, Wee Sun Lee

发表机构 * School of Computing, National University of Singapore(新加坡国立大学计算机学院)

AI总结 提出潜在记忆范式,将每个证据压缩为单个高维潜在令牌,通过统一训练实现高效检索与生成,在资源受限场景下以3-10倍令牌节省达到竞争性问答性能。

详情
AI中文摘要

外部记忆有效地将基于大语言模型(LLMs)和视觉-语言模型(VLMs)的问答(QA)与相关的多模态证据联系起来。然而,现有的记忆范式以原始文本和图像形式表示每个记忆项,因此基于检索的系统必须将检索到的文本或图像传递给生成LLMs/VLMs,导致高令牌消耗和存储压力,使得资源受限的应用难以承受。我们提出潜在记忆,一种潜在空间记忆范式,它将每个原始文本或图像证据项替换为由小型压缩器LLM/VLM生成的单个高维潜在令牌。潜在记忆不是在生成时检索原始证据,而是在统一的潜在表示空间中操作:查询被嵌入到该空间中以检索相关的潜在令牌,检索到的潜在令牌直接提示给预训练的LLM或VLM以生成答案。为了使每个潜在令牌同时具有用于重建、检索和生成的信息,我们使用重建、对比和蒸馏目标以统一的端到端方式训练压缩器。潜在记忆在七个纯文本QA基准(例如HotpotQA)和多模态QA基准上进行了评估,与先进的RAG基线相比,它实现了具有竞争力的QA性能,同时消耗的生成器令牌减少了3到10倍。它还能在WebQA上提供最强的图像基础问答性能。代码可在该https URL获取。

英文摘要

External memory effectively grounds large language models (LLMs) and vision-language models (VLMs)-based question answering (QA) in relevant multimodal evidence. However, existing memory paradigms represent each memory item in raw text and image forms, so retrieval-based systems must pass the retrieved text or images to the generation LLMs/VLMs, resulting in high token consumption and storage pressure, making it unaffordable for resource-constrained applications. We propose Latent Memory, a latent-space memory paradigm that replaces each raw text or image evidence item with a single high-dimensional latent token produced by a small compressor LLM/VLM. Rather than retrieving raw evidence for generation, Latent Memory operates in a unified latent representation space: the query is embedded into this space to retrieve relevant latent tokens, and the retrieved latent tokens are directly prompted to a pretrained LLM or VLM for answer generation. To make each latent token simultaneously informative for reconstruction, retrieval, and generation, we train the compressor with reconstruction, contrastive, and distillation objectives in a unified end-to-end manner. Latent Memory is evaluated on seven text-only QA benchmarks (e.g., HotpotQA) and multimodal QA benchmarks, where it achieves competitive QA performance compared to advanced RAG baselines while consuming 3x to 10x fewer generator tokens. It can also deliver the strongest image-grounded QA performance on WebQA. Code is available at https://github.com/zz1358m/Latent-Memory-Master.

2606.10571 2026-06-10 cs.CV cs.AI cs.CR 新提交

Improving Adversarial Transferability on Vision-Language Pre-training Models via Surrogate-Specific Bias Correction

通过代理特定偏差校正提高视觉-语言预训练模型上的对抗迁移性

Lijia Yu, Jiuxin Cao, Yuchen Qiang, Changhao Chen, Yifei Huang, Bo Liu

发表机构 * School of Cyber Science and Engineering, Southeast University(东南大学网络空间安全学院) Purple Mountain Laboratories(紫金山实验室) School of Computer Science and Engineering, Southeast University(东南大学计算机科学与工程学院)

AI总结 提出DeBias-Attack方法,通过梯度校正消除代理特定偏差,提高对抗样本在VLP模型间的迁移性,实验验证其在多种模型和任务上的有效性。

Comments 17 pages, 7 figures, 10 tables

详情
AI中文摘要

对抗样本揭示了视觉-语言预训练(VLP)模型中的脆弱性,并为提高鲁棒性提供了见解。一个关键特性是跨模型迁移性,这使得基于迁移的黑盒攻击成为可能。然而,现有攻击通常严重依赖代理模型,导致跨模型性能下降。一个原因是对抗优化可能更多地遵循代理模型响应而非输入语义,使得更新方向在代理模型上有效,但对未见目标迁移性较差。我们将这种依赖称为代理特定偏差。受此观察启发,DeBias-Attack通过校正对抗优化方向中的代理特定偏差来提高迁移性。它维护两个扰动分支。主分支在原始图像上优化扰动,并获得用于破坏图像-文本对齐的对抗梯度。参考分支在弱语义图像上优化扰动,该图像由数据集平均图像加上每次迭代重新采样的小高斯噪声构成。由于该弱语义图像几乎不含清晰的视觉内容,其优化更多地反映代理模型响应而非图像语义,其参考梯度估计代理特定偏差。DeBias-Attack在更新对抗图像之前移除主梯度在参考梯度上的对齐投影,然后使用更新后的对抗图像进行上下文感知的文本替换。DeBias-Attack是首个通过梯度校正来校正代理特定偏差的基于迁移的VLP攻击。实验表明,在VLP模型、下游任务以及开源和闭源多模态大语言模型上均表现出强劲性能。

英文摘要

Adversarial examples reveal vulnerabilities in Vision-Language Pre-training (VLP) models and provide insights for improving robustness. A key property is cross-model transferability, which enables transfer-based black-box attacks. However, existing attacks often rely heavily on the surrogate model, causing cross-model performance drops. One reason is that adversarial optimization may follow surrogate model responses more than input semantics, making the update direction effective on the surrogate but less transferable to unseen targets. We refer to this dependency as surrogate-specific bias. Motivated by this observation, DeBias-Attack improves transferability by correcting surrogate-specific bias in adversarial optimization directions. It maintains two perturbation branches. The main branch optimizes a perturbation on the original image and obtains the adversarial gradient used to disrupt image-text alignment. The reference branch optimizes a perturbation on a weak-semantic image constructed from the dataset mean image with small Gaussian noise resampled at each iteration. Since this weak-semantic image contains little clear visual content, its optimization reflects surrogate responses more than image semantics, and its reference gradient estimates surrogate-specific bias. DeBias-Attack removes the aligned projection of the main gradient on the reference gradient before updating the adversarial image, then performs context-aware text substitution using the updated adversarial image. DeBias-Attack is the first transfer-based VLP attack that corrects surrogate-specific bias through gradient correction. Experiments show strong performance across VLP models, downstream tasks, and open-source and closed-source multimodal large language models.

2606.10569 2026-06-10 cs.CL cs.AI 新提交

Hidden Consensus:Preference-Validity Compression in Human Feedback

隐藏共识:人类反馈中的偏好有效性压缩

Dorcas Chia Ern Chua, Karen Myn Hui Lee, Jia Yue Tan, Zhen Xue Gue, Norzalena Abdul Hamid, Azima Binti Azmi, Keat Mei Yeong, Aizat Izyani binti Mujab, Hafsah Noor Azam, Chee Guo Khoo, Han Ying Lim, Chee Seng Chan

发表机构 * YTL AI Labs Universiti Malaya(马来亚大学) Monash University Malaysia(莫纳什大学马来西亚校区) Universiti Malaysia Sarawak(马来西亚沙捞越大学)

AI总结 本文提出偏好有效性压缩问题,即RLHF将多元有效反馈压缩为单一奖励目标,导致对齐测量偏差。通过马来西亚语料分析,79%的提示存在多个多数支持响应,表明多数聚合测量的是argmax可接受性而非多元对齐。

Comments 28 pages. When AI learns from human feedback, it forces a single "correct" answer, but sometimes multiple answers are all genuinely valid, and that nuance gets thrown away

详情
AI中文摘要

标准的RLHF流程通常将异质的人类判断简化为单一的标量奖励目标。我们认为这种简化在结构多元的社会中可能错误地衡量对齐,在这些社会中,分歧可能反映文化、历史、语言、区域或规范性的解释,而非标注噪声。我们将这种失败称为偏好有效性压缩,即多个多元有效的响应选项被压缩成一个优化目标。以马来西亚为诊断场景,我们通过偏好事件分析RLHF风格的反馈聚合,这些事件将提示、响应和跨解释框架的可接受性判断联系起来。在来自20名参与者和107个三人标注提示的321个偏好事件中,79%的提示包含多个多数支持的响应,而单一赢家聚合会丢弃这些响应,并且当考虑所有多数支持的选项时,顶部响应之间的明显优势差距会消失。参与者经常选择多个可接受的响应,而被丢弃的响应明显反映了连贯的本地、实践或文化框架。这些发现表明,该语料中的多数聚合测量的是argmax可接受性而非多元对齐。我们将此视为测量有效性问题,并认为未来的对齐方法应满足有效性保持一致性,即在多元有效的解释框架中保持稳定,而不是将它们压缩为单一的奖励目标。

英文摘要

Standard RLHF pipelines often reduce heterogeneous human judgments into a single scalar reward target. We argue that this reduction can mis-measure alignment in structurally plural societies, where disagreement may reflect culturally, historically, linguistically, regionally, or normatively grounded interpretations rather than annotation noise. We call this failure Preference-Validity Compression, the collapse of multiple plural-valid response options into a single optimization target. Using Malaysia as a diagnostic setting, we analyze RLHF-style feedback aggregation through preference events linking prompts, responses, and acceptability judgments across interpretive frames. Across 321 preference events from 20 participants and 107 trio-annotated prompts, 79% of prompts contain more than one majority-supported response that single-winner aggregation would discard, and apparent dominance gaps between top responses diminish when all majority-supported options are considered. Participants frequently select multiple acceptable responses, and discarded responses demonstrably reflect coherent local, practical, or cultural frames. These findings show that majority aggregation in this corpus measures argmax acceptability rather than plural alignment. We treat this as a measurement-validity issue and argue that future alignment methods should satisfy Validity-Preserving Consistency, remaining stable across plural-valid interpretive frames rather than collapsing them into a single reward target.

2606.10568 2026-06-10 cs.RO 新提交

VeriSpace: Spatially Grounded Action Verification for Vision-Language-Action Models

VeriSpace: 面向视觉-语言-动作模型的空间基础动作验证

Guiyu Zhao, Longteng Guo, Junyou Zhu, Jun Fu, Yanghong Mei, Bin Cao, Jie Jiang, Xingjian He, Jing Liu

发表机构 * Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 提出VeriSpace,一种3D感知的动作验证器,通过双路径3D注入场景编码和空间基础动作推理,在测试时选择候选动作,提升VLA模型的可靠性。

Comments Submit to ACM MM

详情
AI中文摘要

视觉-语言-动作(VLA)模型在机器人操作中展现出强大潜力,但其测试时的可靠性仍受限于一次性动作预测,即使微小的动作误差也可能导致抓取失败、碰撞或任务进展错误。一种自然的替代方案是为VLA系统配备测试时验证,允许在执行前提出并评估多个候选动作。然而,可靠的动作验证具有挑战性,因为它不仅需要区分候选动作之间的细微几何差异,还需要评估动作是否朝着任务目标有意义地推进。我们提出VeriSpace,一种用于VLA系统测试时动作选择的3D感知动作验证器。VeriSpace通过两个关键组件评估候选动作:双路径3D注入场景编码,构建同时保留视觉语义和显式3D几何的场景表示;以及空间基础动作推理,通过推理任务相关的空间关系、几何有效性和预期的目标进展来评估每个动作。这些组件共同实现了对细微但结果关键的候选动作更可靠的区分,同时与现有VLA策略完全兼容。在公共基准和真实机器人操作任务上的实验表明,VeriSpace在分布内和分布外设置中均持续提高了底层VLA策略和先前基于验证的方法的决策可靠性,带来了显著的性能提升。

英文摘要

Vision-language-action (VLA) models have shown strong promise for robotic manipulation, but their reliability at test time remains limited by one-shot action prediction, where even small action errors can cause grasp failure, collision, or incorrect task progression. A natural alternative is to equip VLA systems with test-time verification, allowing multiple candidate actions to be proposed and evaluated before execution. However, reliable action verification is challenging because it requires not only distinguishing subtle geometric differences between candidate actions, but also assessing whether an action makes meaningful progress toward the task goal. We present VeriSpace, a 3D-aware action verifier for test-time action selection in VLA systems. VeriSpace evaluates candidate actions through two key components: Dual-Path 3D-Injected Scene Encoding, which constructs a scene representation that jointly preserves visual semantics and explicit 3D geometry, and Spatially-Grounded Action Reasoning, which evaluates each action by reasoning over task-relevant spatial relations, geometric validity, and expected goal progress. Together, these components enable more reliable discrimination between subtle yet outcome-critical action candidates while remaining fully compatible with existing VLA policies. Experiments on public benchmarks and real-world robotic manipulation tasks show that VeriSpace consistently improves decision reliability over both underlying VLA policies and prior verification-based methods, yielding substantial gains in both in-distribution and out-of-distribution settings.

2606.10565 2026-06-10 cs.SD eess.AS 新提交

A Lightweight Dual-Factor Acoustic Authentication System via Cascaded GMM-DTW Architecture for Edge Computing

一种基于级联GMM-DTW架构的轻量级双因素声学认证系统用于边缘计算

Yutong Zhang

发表机构 * Yutong Zhang(张宇同)

AI总结 针对资源受限的边缘环境,提出一种轻量级级联GMM-DTW双因素语音锁系统,通过共享MFCC特征空间实现顺序防御,结合动态联合绝对-相对边界约束,在低功耗边缘节点上实现低延迟和高安全性。

详情
AI中文摘要

本文提出了一种轻量级、级联GMM-DTW双因素语音锁系统,适用于资源受限的边缘环境。通过利用共享的MFCC特征空间,该框架实现了结合GMM说话人筛选和DTW口令验证的顺序防御机制。为了在不增加额外硬件的情况下应对呈现攻击,在GMM分类空间中引入了动态联合绝对-相对边界约束,将物理冒名顶替者和高保真重放攻击的误接受率(FAR)分别限制在2.73%和6.67%,合法用户的误拒绝率(FRR)为16.67%。由于Sakoe-Chiba窗口优化,在时间压力下,全局端到端处理延迟在单核CPU上严格限制为9.82ms,其中特征提取1.51ms,GMM评分0.54ms,最坏情况DTW匹配7.77ms。这些经验基准证明了白盒声学级联在低功耗边缘节点上实现安全、确定性实时部署的可行性。

英文摘要

This paper presents a lightweight, cascaded GMM-DTW dual-factor voice lock system for resource-constrained edge environments. By utilizing a shared MFCC feature space, the framework implements a sequential defense mechanism combining GMM speaker screening and DTW passphrase verification. To counter presentation threats without extra hardware, a dynamic joint absolute-relative margin constraint is integrated into the GMM classification space, limiting the physical imposter and high-fidelity replay attack False Acceptance Rates (FAR) to 2.73% and 6.67%, respectively, with a legitimate False Rejection Rate (FRR) of 16.67%. Due to Sakoe-Chiba window optimization, the global end-to-end processing latency under temporal stress is rigidly bounded at 9.82ms on a single-core CPU, comprising 1.51ms for feature extraction, 0.54ms for GMM scoring, and 7.77ms for worst-case DTW matching. These empirical benchmarks demonstrate the viability of white-box acoustic cascades for secure, deterministic real-time deployment on low-power edge nodes.

2606.10554 2026-06-10 cs.CL cs.AI 新提交

Benchmarking Knowledge Editing using Logical Rules

使用逻辑规则对知识编辑进行基准测试

Tatiana Moteu Ngoli, NDah Jean Kouagou, Hamada M. Zahera, Axel-Cyrille Ngonga Ngomo

发表机构 * Data Science Group, Heinz Nixdorf Institute, Paderborn University(帕德博恩大学海因茨·尼克斯多夫研究所数据科学组)

AI总结 提出基于逻辑规则的基准,评估知识编辑方法对单次编辑逻辑后果的处理能力,发现现有方法在蕴含知识上性能下降高达24%。

Comments Accepted at the 24th International Semantic Web Conference 2025

详情
Journal ref
The Semantic Web. ISWC 2025. ISWC 2025. Lecture Notes in Computer Science, vol 16141. Springer, Cham
AI中文摘要

大型语言模型(LLMs)越来越多地部署在需要访问最新知识的实际应用中。然而,重新训练LLMs计算成本高昂。因此,知识编辑技术对于维护预训练模型中的当前信息和纠正错误断言至关重要。当前的知识编辑基准主要关注回忆编辑过的事实,往往忽略其逻辑后果。为解决这一局限,我们引入了一个新基准,旨在评估知识编辑方法如何处理单次事实编辑的逻辑后果。我们的基准从知识图谱中提取与给定编辑相关的逻辑规则,然后基于这些规则生成多跳问题,以评估对逻辑后果的影响。我们的发现表明,虽然现有的知识编辑方法能够准确地将直接断言插入LLMs,但它们经常无法注入蕴含的知识。具体来说,使用ROME和FT等流行方法的实验显示,在直接编辑的知识和蕴含知识的评估之间存在高达24%的性能差距。这凸显了在知识编辑中需要语义感知的评估框架。

英文摘要

Large Language Models (LLMs) are increasingly deployed in real-world applications that require access to up-to-date knowledge. However, retraining LLMs is computationally expensive. Therefore, knowledge editing techniques are crucial for maintaining current information and correcting erroneous assertions within pre-trained models. Current benchmarks for knowledge editing primarily focus on recalling edited facts, often neglecting their logical consequences. To address this limitation, we introduce a new benchmark designed to evaluate how knowledge editing methods handle the logical consequences of a single fact edit. Our benchmark extracts relevant logical rules from a knowledge graph for a given edit. Then, it generates multi-hop questions based on these rules to assess the impact on logical consequences. Our findings indicate that while existing knowledge editing approaches can accurately insert direct assertions into LLMs, they frequently fail to inject entailed knowledge. Specifically, experiments with popular methods like ROME and FT reveal a substantial performance gap, up to 24%, between evaluations on directly edited knowledge and on entailed knowledge. This highlights the critical need for semantics-aware evaluation frameworks in knowledge editing.

2606.10543 2026-06-10 cs.LG cs.AI cs.ET q-bio.QM 新提交

Flexible Flows for Biological Sequence Design

生物序列设计的灵活流模型

Yogesh Verma, Dani Korpela, Harri Lähdesmäki, Vikas Garg

发表机构 * Aalto University(阿尔托大学) YaiYai Ltd(YaiYai有限公司) OpenProtein.AI

AI总结 提出结构化耦合、潜编辑速率参数化和潜分类器无引导机制,实现变长序列生成和细粒度控制,在多种生物序列任务中达到最优性能。

详情
AI中文摘要

设计功能性生物序列需要在严格的进化和生物物理约束下导航巨大的离散空间。离散流匹配(DFM)提供了在此类空间上的生成框架,但现有方法依赖于生物学上无信息的耦合,并且在变长序列生成和细粒度控制方面灵活性有限。我们提出了一种结构化耦合,编码序列元素间的领域特定偏好,将源分布偏向合理区域,而不修改流目标或训练过程。在此基础上,我们引入了一种基于潜编辑的速率参数化,通过基于共享全局潜变量的编辑操作(类似于潜变量模型)对变长生成进行建模,同时保持可追踪性。我们进一步引入了一种潜分类器无引导机制,在连续潜空间中连贯地引导生成,以及用于测试时控制编辑操作的Dirichlet先验温度缩放。我们的方法在多种生物序列任务中实现了最先进的性能,包括密度估计、无条件和条件DNA序列生成以及肽序列生成。

英文摘要

Designing functional biological sequences requires navigating vast discrete spaces under strict evolutionary and biophysical constraints. Discrete Flow Matching (DFM) offers a generative framework over such spaces, but existing approaches rely on biologically uninformative couplings and offer limited flexibility for variable-length sequence generation and fine-grained control. We propose a structured coupling that encodes domain-specific preferences among sequence elements, biasing the source distribution toward plausible regions without modifying the flow objective or training procedure. Building on this, we introduce a latent edit-based rate parameterization that models variable-length generation via edit operations conditioned on a shared global latent, akin to a latent variable model, while remaining tractable. We further introduce a latent classifier-free guidance mechanism that steers generation coherently in continuous latent space, along with Dirichlet-prior temperature scaling for test-time control over edit operations. Our method achieves state-of-the-art performance across diverse biological sequence tasks, including density estimation, unconditional and conditional DNA sequence generation, and peptide sequence generation.