arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2029
2606.04811 2026-06-05 cs.CV

Dream.exe: Can Video Generation Models Dream Executable Robot Manipulation?

Dream.exe: 视频生成模型能否梦想出可执行的机器人操作?

Rui Zhao, Kaiming Yang, Jifeng Zhu, Siyang Chen, Ziqi Wang, Weijia Wu, Kevin Qinghong Lin, Heng Wang, Mike Zheng Shou

发表机构 * Show Lab, National University of Singapore(新加坡国立大学Show实验室) University of Oxford(牛津大学) Tencent(腾讯)

AI总结 提出Dream.exe评估框架,通过视频到执行流水线测试视频生成模型产生的运动能否转化为可执行的机器人操作,发现视觉质量不能预测可执行性。

详情
AI中文摘要

视频生成模型在合成视觉上引人注目的内容方面取得了令人印象深刻的进展,但其输出仍然局限于虚拟领域。一个自然的问题随之而来:当这些模型生成的视频离开屏幕进入现实时,它们对物理世界的反映有多好?我们提出机器人操作作为这个问题的具体、可测量的窗口:如果一个模型真正内化了物理定律,它所描绘的运动应该转化为可执行的机器人行为。我们引入了Dream.exe,一个通过视频到执行流水线来操作这一标准的评估框架。给定一个场景图像和任务描述,Dream.exe合成一个操作视频,将生成的运动转换为机器人轨迹,并在物理模拟器中执行,产生纯视觉指标无法提供的接地信号。使用这个流水线,我们评估了8个模型,涵盖前沿闭源生成器、开源生成器和机器人专用模型。我们的基准测试包括101个手动策划的操作任务,分为三个物理复杂度级别,通过视觉质量、轨迹保真度和执行成功率进行测量。令人鼓舞的是,几个模型取得了可测量的执行成功率,表明从互联网规模数据中学习的生成先验已经编码了有意义的物理知识。然而,视觉质量被证明是执行性的差预测器,暴露了标准视觉评估未捕获的模型能力维度。Dream.exe将在https://github.com/showlab/Dream.exe开源。

英文摘要

Video generation models have made impressive strides in synthesizing visually compelling content, yet their outputs remain confined to the virtual domain. A natural question follows: how well do these models reflect the physical world when their generated videos leave the screen and enter reality? We propose robotic manipulation as a concrete, measurable window onto this question: if a model has truly internalized physical laws, the motion it depicts should translate into executable robot behavior. We introduce Dream$.$exe, an evaluation framework that operationalizes this criterion through a video-to-execution pipeline. Given a scene image and a task description, Dream$.$exe synthesizes a manipulation video, converts the generated motion into robot trajectories, and executes them in a physics simulator, yielding a grounding signal that purely visual metrics cannot offer. Using this pipeline, we evaluate 8 models spanning frontier closed-source generators, open-source generators, and robot-specific models. Our benchmark covers 101 manually curated manipulation tasks at three levels of physical complexity, measured across visual quality, trajectory fidelity, and execution success. Encouragingly, several models achieve measurable execution success, suggesting that generative priors learned from internet-scale data already encode meaningful physical knowledge. Yet visual quality proves a poor predictor of executability, exposing a dimension of model capability that standard visual evaluations do not capture. Dream$.$exe will be open-sourced at https://github.com/showlab/Dream.exe.

2606.04777 2026-06-05 cs.LG

UniFair: A unified fair clustering approach based on separation and compactness

UniFair: 基于分离度和紧致性的统一公平聚类方法

Antonia Karra, Vasiliki Papanikou, Georgios Vardakas, Evaggelia Pitoura, Aristidis Likas

发表机构 * University of Ioannina(伊奥安纳大学) Archimedes/Athena Research Center(阿基米德/雅典娜研究中心)

AI总结 提出UniFair框架,通过联合优化分离公平性和社会公平性,在保持聚类质量的同时减少群体差异。

Comments 17 pages, 6 Figures

详情
AI中文摘要

聚类越来越多地用于支持高影响决策,但诸如$k$-means等标准目标可能会产生对人口群体不平等对待的聚类结果。现有的公平聚类方法通常优化单一公平性概念,并且常常忽略聚类成本如何与诱导决策边界的几何形状相互作用。我们提出 extsc{UniFair},一个统一框架,联合优化\emph{分离公平性}和\emph{社会公平性}。分离公平性鼓励受保护群体远离诱导决策边界,而社会公平性通过惩罚群体级聚类成本来减少簇内失真的差异。我们为分离公平和统一$k$-means目标开发了基于梯度的优化过程,并通过在自编码器的潜在空间中强制执行相同标准将其扩展到深度聚类。在表格和图像数据集上的实验表明, extsc{UniFair}在仅适度增加聚类损失的情况下,减少了与边界相关和基于成本的群体差异。

英文摘要

Clustering is increasingly used to support high-impact decisions, yet standard objectives such as k-means can produce clusterings that treat demographic groups unequally. Existing fair clustering methods typically optimize a single notion of fairness and often overlook how clustering costs interact with the geometry of the induced decision boundaries. We propose UniFair, a unified framework that jointly optimizes separation fairness and social fairness. Separation fairness encourages protected groups to lie farther from the induced decision boundaries, while social fairness reduces disparities in within-cluster distortion by penalizing group-wise clustering costs. We develop gradient-based optimization procedures for separation-fair and unified k-means objectives, and extend them to deep clustering by enforcing the same criteria in the latent space of an autoencoder. Experiments on tabular and image datasets show that UniFair reduces both boundary-related and cost-based group disparities with only a modest increase in clustering loss.

2606.04708 2026-06-05 cs.RO cs.AI

VISTA: Vision-Grounded and Physics-Validated Adaptation of UMI data for VLA Training

VISTA: 基于视觉和物理验证的UMI数据适配用于VLA训练

Siyuan Yang, Linzheng Guo, Ouyang Lu, Zhaxizhuoma, Daoran Zhang, Xinmiao Wang, Ting Xiao, Fangzheng Yan, Zhijun Chen, Yan Ding, Chao Yu, Chenjia Bai, Xuelong Li

发表机构 * Institute of AI (TeleAI), China Telecom(人工智能研究院(TeleAI),中国电信) Lumos Robotics(Lumos机器人) University of Science and Technology of China(中国科学技术大学) Northwestern Polytechnical University(西北工业大学) Shanghai Jiao Tong University(上海交通大学) East China University of Science and Technology(东华大学) Harbin Engineering University(哈尔滨工程大学) Fudan University(复旦大学)

AI总结 提出VISTA框架,通过UMI-VQA数据集对齐视觉表示、物理验证流水线筛选可行轨迹以及两阶段联合训练,解决UMI数据训练VLA模型时的视觉分布偏移和物理不可行动作问题。

Comments Corrected the typing error

详情
AI中文摘要

通用操作接口(UMI)实现了无需特定硬件遥操作的可扩展真实世界机器人数据收集,但利用UMI数据训练大规模视觉-语言-动作(VLA)模型仍然面临根本性挑战。我们识别出两个关键不匹配:腕部安装的鱼眼视图具有严重的径向畸变和以夹爪为中心的局部视角,对于预训练VLM而言是分布外数据;人类收集的轨迹经常违反运动学限制、发生碰撞或超出控制器带宽,导致VLA策略学习到物理上不可行的动作。为解决这些挑战,我们提出了VISTA框架,通过三个协同组件弥合这一双重差距。(i) UMI-VQA,首个专门针对腕部鱼眼观测的大规模VQA数据集,通过辅助视觉-语言监督将VLM表示对齐到畸变视觉领域。(ii) 系统性的物理验证流水线,在训练前进行数据完整性预检查,并对每条有效轨迹的轨迹连续性、自碰撞风险和执行保真度进行评分。(iii) 两阶段联合训练方案,在UMI-VQA上联合学习视觉-语言基础,并在验证轨迹上学习动作预测。我们的实验经验表明,引入UMI-VQA能持续提升下游策略性能,且物理验证分数对部署成功具有强预测性。在多种仿真和真实世界操作任务中,VISTA显著优于包括$π_{0.5}$、LingBot-VLA和Wall-X在内的强基线。我们向社区发布了物理验证流水线、UMI-VQA、验证轨迹数据和预训练模型。

英文摘要

Universal Manipulation Interface (UMI) enables scalable real-world robot data collection without hardware-specific teleoperation, yet leveraging UMI data to train large-scale Vision-Language-Action (VLA) models remains fundamentally challenging. We identify two critical mismatches: wrist-mounted fisheye views, with severe radial distortion and local gripper-centric perspectives, are out-of-distribution for pretrained VLMs; and human-collected trajectories frequently violate kinematic limits, incur collisions, or exceed controller bandwidth, teaching VLA policies physically infeasible actions. To address the challenges, we present VISTA, a framework that bridges this dual gap through three synergistic components. (i)~UMI-VQA, the first large-scale VQA dataset tailored to wrist-mounted fisheye observations, aligns VLM representations to the distorted visual regime via auxiliary vision-language supervision. (ii)~A systematic physical-validation pipeline performs a data-completeness pre-check and scores each valid trajectory for trajectory continuity, self-collision risk, and execution fidelity before it enters training. (iii)~A two-stage co-training recipe jointly learns vision-language grounding on UMI-VQA and action prediction on validated trajectories. Our experiments empirically show that incorporating UMI-VQA consistently improves downstream policy performance, and that physical-validation scores are strongly predictive of deployment success. On diverse simulation and real-world manipulation tasks, VISTA significantly outperforms strong baselines including $π_{0.5}$, LingBot-VLA, and Wall-X. We release the physical-validation pipeline, UMI-VQA, validated trajectory data, and the pre-trained model for the community.

2606.04672 2026-06-05 cs.LG cs.AI

Learning Long Range Spatio-Temporal Representations over Continuous Time Dynamic Graphs with State Space Models

利用状态空间模型学习连续时间动态图中的长程时空表示

Ayushman Raghuvanshi, Thummaluru Siddartha Reddy, Sundeep Prabhakar Chepuri, Mahesh Chandran

发表机构 * University of Texas at Austin(德克萨斯大学奥斯汀分校) Indian Institute of Science(印度科学研究院)

AI总结 提出一种基于状态空间模型的连续时间动态图框架(CTDG-SSM),通过拓扑感知的高阶多项式投影算子(CTT-HiPPO)实现长程时空信息传播,在动态链接预测、节点分类和序列分类任务上取得最优性能。

Comments Accepted at ICML 2026

详情
AI中文摘要

连续时间动态图(CTDG)为捕捉演化关系数据中的细粒度时间模式提供了更丰富的框架。长程信息传播是学习表示时的关键挑战,其中需要在长时间跨度上保留和更新信息。现有方法限制模型捕捉一跳或局部时间邻域,无法捕捉多跳或全局结构模式。为解决此问题,我们从第一性原理推导出一个参数高效的连续时间动态图状态空间建模框架(CTDG-SSM)。我们首先引入连续时间拓扑感知高阶多项式投影算子(CTT-HiPPO),这是一种基于记忆的HiPPO新公式,用于联合编码时间动态和图结构。CTT-HiPPO的解通过将经典HiPPO解投影到拉普拉斯矩阵的多项式上获得,产生拓扑感知的记忆更新,该更新等价于CTDG的状态空间公式(CTDG-SSM)。然后,使用零阶保持方法获得计算高效的离散公式用于模型实现。在动态链接预测、动态节点分类和序列分类的基准测试中,CTDG-SSM实现了最先进的性能。值得注意的是,在需要长程时间(LRT)和空间推理的数据集上,它取得了较大的性能提升。

英文摘要

Continuous-time dynamic graphs (CTDGs) provide a richer framework to capture fine-grained temporal patterns in evolving relational data. Long-range information propagation is a key challenge while learning representations, wherein it is important to retain and update information over long temporal horizons. Existing approaches restrict models to capture one-hop or local temporal neighborhoods and fail to capture multi-hop or global structural patterns. To mitigate this, we derive a parameter-efficient state-space modeling framework for continuous-time dynamic graphs (CTDG-SSM) from first principles. We first introduce continuous-time Topology-Aware higher order polynomial projection operator (CTT-HiPPO), a novel memory-based reformulation of HiPPO to jointly encode temporal dynamics and graph structure. The solution from CTT-HiPPO is obtained by projecting the classical HiPPO solution through a polynomial of the Laplacian matrix, yielding topology-aware memory updates that admit an equivalent state-space formulation for CTDGs (CTDG-SSM). Then a computationally efficient discrete formulation is obtained using the zero-order hold approach for model implementation. Across benchmarks on dynamic link prediction, dynamic node classification, and sequence classification, CTDG-SSM achieves state-of-the-art performance. Notably, it achieves large performance gains on datasets that require long range temporal (LRT) and spatial reasoning.

2606.04560 2026-06-05 cs.LG cs.AI

Rollout-Level Advantage-Prioritized Experience Replay for GRPO

基于轨迹级别优势优先经验回放的GRPO

Gyeongtae Yoo, Sanghyeok Park, Soohyuk Jang, Ik-hwan Kim, Sungroh Yoon

发表机构 * Department of Electrical and Computer Engineering, Seoul National University(首尔国立大学电子与计算机工程系) Interdisciplinary Program in AI, Seoul National University(首尔国立大学人工智能跨学科项目) AIIS, ASRI, INMC, and ISRC, Seoul National University(首尔国立大学人工智能研究所、人工智能研究机构、智能网络与计算中心及人工智能科学研究中心)

AI总结 针对GRPO样本效率低的问题,提出轨迹级经验回放缓冲器,通过年龄驱逐限制陈旧性、新鲜锚定组合保持在线策略、按优势幅度优先采样,在多个数学基准上显著提升性能。

详情
AI中文摘要

基于可验证奖励的GRPO强化学习是后训练推理LLM的标准方法,但样本效率低下。每个轨迹仅用于一次梯度更新后被丢弃。朴素回放在此设置中不适用,因为LLM策略每步梯度变化快,存储的轨迹会变得陈旧并破坏训练稳定性。我们提出一种面向GRPO的轨迹级回放缓冲器,存储和采样单个轨迹而非整组。缓冲器通过年龄驱逐限制陈旧性:任何超过tau_max训练步数的轨迹被移除。缓冲器还通过新鲜锚定组合保留在线策略数据:每个批次保留其新鲜的在线策略轨迹,并拼接从缓冲器中单独抽取的回放轨迹。我们按每个轨迹的优势幅度进行优先回放,并回收优势大的单个轨迹。在三个Qwen3-Base规模、五个数学基准上,我们的方法优于GRPO和朴素回放基线。所有规模均获得正向增益,且随模型增大而增长。最大增益在4B规模上,五个基准平均提升+4.35个百分点。在联合衡量准确率和token效率的AES指标下,与GRPO的效率差距同样在4B最大,为+0.579。

英文摘要

Reinforcement learning from verifiable rewards with GRPO is a standard approach for post-training reasoning LLMs. It remains sample inefficient. Each rollout is used for a single gradient update and then discarded. Naive replay is not well suited in this setting because LLM policies drift quickly per gradient step. Stored rollouts therefore become stale and can destabilize training. We propose a rollout-level replay buffer for GRPO that stores and samples individual rollouts rather than whole groups. The buffer bounds staleness through age eviction. Any rollout older than tau_max training steps is removed. The buffer also preserves on-policy data via fresh-anchored composition. Each batch keeps its fresh on-policy rollouts and then concatenates replay rollouts drawn separately from the buffer. We prioritize replay by per-rollout advantage magnitude and recycle individual rollouts whose advantages are large. Across three Qwen3-Base scales on five math benchmarks, our method outperforms GRPO and naive replay baselines. Gains are positive at every scale and grow with model size. The largest gain is +4.35 pp on the five-benchmark average at 4B. Under an AES metric that jointly measures accuracy and token efficiency, the efficiency margin over GRPO is again largest at 4B, at +0.579.

2606.04485 2026-06-05 cs.LG

LimiX-2M: Mitigating Low-Rank Collapse and Attention Bottlenecks in Tabular Foundation Models

LimiX-2M:缓解表格基础模型中的低秩坍塌和注意力瓶颈

Yuanrui Wang, Xingxuan Zhang, Han Yu, Mingchao Hao, Gang Ren, Hao Yuan, Li Mao, Yunjia Zhang, Chun Yuan, Peng Cui

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出统一tokenize-and-route框架LimiX-2M,通过RaBEL扩展标量为局部RBF特征并重新排序双向块S→N→F,以2M参数超越更大模型,改善表格基础模型的精度-效率权衡。

Comments Accepted to ICML 2026

详情
AI中文摘要

表格基础模型(TFM)日益与树集成方法竞争,但其性能通常计算效率低下:使用标准仿射标量分词时,每个特征通过本质上的一维通道注入值变化,特征ID/位置信号无法增加特征内值的自由度,导致早期层值敏感性弱和隐藏状态冗余。我们提出了一个统一的\emph{tokenize-and-route}框架用于强TFM: extbf{RaBEL}将每个标量扩展为紧凑的局部RBF特征(可选指数门控)以改善条件和浅层有效秩,而重新排序的双向块 extbf{S$ ightarrow$N$ ightarrow$F}通过在特征混合前聚合跨样本上下文并使用注意力池化来使计算与读出对齐。这些变化共同产生了 extbf{LimiX-2M},一个2M参数模型,在广泛使用的表格基准上优于更大的TabPFN-v2和TabICL基线,同时降低了训练和推理成本。这些结果突出了值感知分词和读出对齐路由作为改善TFM中精度-效率权衡的关键杠杆。模型检查点和推理代码可在https://github.com/limix-ldm-ai/LimiX获取。

英文摘要

Tabular foundation models (TFMs) increasingly rival tree ensembles, but their performance is often compute-inefficient: with standard affine scalar tokenization, each feature injects value variation through an essentially one-dimensional channel, and feature IDs/positional signals cannot increase within-feature value degrees of freedom, yielding weak early-layer value sensitivity and redundant hidden states. We present a unified tokenize-and-route framework for strong TFMs: RaBEL expands each scalar into compact localized RBF features (optionally exponent-gated) to improve conditioning and shallow-layer effective rank, while a reordered bidirectional block S->N->F aligns computation with the readout by aggregating cross-sample context before feature mixing and using attention pooling. Together, these changes yield LimiX-2M, a 2M-parameter model that outperforms larger TabPFN-v2 and TabICL baselines on widely used tabular benchmarks while reducing training and inference costs. These results highlight value-aware tokenization and readout-aligned routing as key levers for improving the accuracy--efficiency trade-off in TFMs. Model checkpoints and inference code are available at https://github.com/limix-ldm-ai/LimiX.

2606.04463 2026-06-05 cs.RO

OSCAR: Omni-Embodiment Action-Conditioned World Model for Robotics

OSCAR: 面向机器人的全具身骨架条件世界动作模型

Zhuoyuan Wu, Jun Gao

发表机构 * Peking University(北京大学) University of Michigan(密歇根大学) NVIDIA(英伟达)

AI总结 提出OSCAR,一种基于动作条件的视频世界模型,通过大规模数据管道和2D骨架渲染统一表示,实现跨机器人具身的泛化,并用于策略评估。

Comments Project page: https://wuzy2115.github.io/oscar-project-page/

详情
AI中文摘要

我们提出OSCAR,一种精确的动作条件视频世界模型,能够泛化到不同的机器人具身并支持机器人策略评估。现有的视频世界模型在真实机器人评估中面临三个主要挑战:当前机器人训练数据集的场景多样性有限、动作跟随不精确、以及跨具身泛化能力差以支持广泛采用。我们从两个角度应对这些挑战。其核心是一个大规模标准化数据管道,用于整理、过滤和去重广泛的机器人和以自我为中心的人类数据集,产生一个涵盖多样化任务、场景、动作和机器人具身的干净联合训练数据集。为了给视频模型提供条件,我们采用2D运动学骨架渲染作为统一的条件表示,能够泛化到不同的机器人手臂甚至人类手部。我们在单个GH200 GPU上微调Cosmos-Predict2.5-2B模型。与现有基线相比,我们的模型在动作跟随、外观质量和运动一致性方面取得了显著改进,而基线要么模型规模大得多,要么需要更多GPU。我们进一步将OSCAR部署到RoboArena中评估机器人策略。大量实验表明,OSCAR中的虚拟策略评估与真实世界评估之间存在显著相关性,为未来机器人策略可以纯粹在虚拟生成的世界中评估铺平了道路。

英文摘要

We present OSCAR, a precise action-conditioned video world model that generalizes across different robot embodiments and enables robot policy evaluation. Existing video world models face three main challenges for real-world robot evaluation: limited scenario diversity in current robot training datasets, imprecise action following, and poor generalization across embodiments for broad adoption. We tackle these challenges from two perspectives. At its core is a large-scale standardized data pipeline that curates, filters, and deduplicates broad robotics and egocentric human datasets, yielding a clean joint-training dataset that spans diverse tasks, scenarios, actions, and robot embodiments. To condition the video model, we adopt 2D kinematic skeleton rendering as a unified conditioning representation that generalizes across different robot arms or even human hands. We finetune the Cosmos-Predict2.5-2B model on a single GH200 GPU. Our model achieves significant improvement on action following, appearance quality, and motion consistency, compared to existing baselines, which either have a much larger model size or require more GPUs. We further deploy OSCAR to evaluate robot policies from RoboArena. Extensive experiments demonstrate the significant correlation between our virtual policy evaluation in OSCAR and real-world evaluation, paving the way for the future where robot policies can be purely evaluated in virtual generated worlds.

2606.04335 2026-06-05 cs.LG cs.SY eess.SY

Policy Gradient for Continuous-Time Robust Markov Decision Processes

连续时间鲁棒马尔可夫决策过程的策略梯度

Tanya Veeravalli, David M. Bossens, Atsushi Nitanda

发表机构 * Centre for Frontier AI Research, Agency for Science, Technology and Research (A*STAR)(前沿人工智能研究中心,科技研究局(A*STAR)) Institute of High Performance Computing, Agency for Science, Technology and Research (A*STAR)(高性能计算研究所,科技研究局(A*STAR)) College of Computing and Data Science, Nanyang Technological University(计算与数据科学学院,南洋理工大学)

AI总结 本文针对连续时间鲁棒马尔可夫决策过程,推导了策略梯度和对抗梯度,并提出双循环优化器和平均场优化器,分别实现线性收敛和亚线性收敛,同时给出了样本复杂度分析。

详情
AI中文摘要

鲁棒马尔可夫决策过程(RMDPs)框架允许设计在最坏情况转移动态下满足性能保证的强化学习智能体。传统RMDPs考虑离散时间动态,最近,样本高效的策略梯度算法已在此背景下被研究。本文研究连续时间RMDP框架内的策略梯度算法。使用随机和常微分方程的路径和伴随公式推导策略梯度和对抗梯度。我们提出双循环优化器,在基于oracle的设置中获得线性收敛,在基于样本的设置中获得$ ilde{\mathcal{O}}( rac{1}{ε^2})$样本复杂度,该分析还推导了无折扣总成本MDP框架的新工具。此外,我们提出平均场优化器作为分布优化器,具有$ ilde{\mathcal{O}}( rac{1}{K})$的基于oracle的收敛速率和$N$粒子近似下$ ilde{\mathcal{O}}( rac{N^2}ε)$的样本复杂度。通过神经常微分方程动态的连续时间RMDPs,两种优化器的连续时间策略梯度算法的有效性得到确认。

英文摘要

The framework of robust Markov decision processes (RMDPs) allows the design of reinforcement learning agents that satisfy performance guarantees under worst-case transition dynamics. Traditional RMDPs consider discrete-time dynamics and recently, sample-efficient policy gradient algorithms have been considered in this context. This paper investigates policy gradient algorithms within a continuous-time RMDP framework. Policy gradients and adversarial gradients are derived using pathwise and adjoint-based formulas for stochastic and ordinary differential equations. We propose double-loop optimisers to obtain linear convergence in the oracle-based setting and an $\tilde{\mathcal{O}}(\frac{1}{ε^2})$ sample complexity in the sample-based setting in an analysis which also derives novel tools for the framework of undiscounted total cost MDPs. Additionally, we propose mean-field optimisers as distributional optimisers with an $\tilde{\mathcal{O}}(\frac{1}{K})$ oracle-based convergence rate and an $\tilde{\mathcal{O}}(\frac{N^2}ε)$ sample complexity under $N$-particle approximation. The effectiveness of continuous-time policy gradient algorithms is confirmed for both optimisers on continuous-time RMDPs with neural ordinary differential equation dynamics.

2606.04037 2026-06-05 cs.AI cs.LG cs.SE

Toward Pre-Deployment Assurance for Enterprise AI Agents: Ontology-Grounded Simulation and Trust Certification

面向企业AI代理的部署前保障:基于本体的仿真与信任认证

Thanh Luong Tuan, Abhijit Sanyal

发表机构 * Golden Gate University(金门大学) Data, Digital & IT, Novartis Healthcare Pvt. Ltd.(数据、数字与IT,诺华健康护理私人有限公司)

AI总结 提出一种基于本体的验证框架,通过本体驱动的场景生成和信任证书,实现企业AI代理在部署前的自动化监管合规与安全认证。

Comments 26 pages, 3 figures. Companion to arXiv:2604.00555. Code and data: https://github.com/frank-luongt/faos-research/tree/main/RA-6

详情
AI中文摘要

企业人工智能(AI)代理的部署前验证仍然是大型语言模型(LLM)能力基准测试与生产部署之间的关键缺口。一旦代理在生产环境中运行,部署后监控、人在回路控制和提示级护栏提供的保障有限。我们提出了一种基于本体的验证框架,包含三个组件:一个代理操作范围,形式化了跨权限、领域约束、安全属性、治理规则和自主级别的认证空间;一个本体到场景的生成流水线,自动推导出监管、操作和对抗性测试场景;以及一个信任证书,携带机器可验证的证明,并附带分级部署裁决(批准、有条件、拒绝)。在四个受监管行业(金融科技、银行、保险和医疗保健)中进行的受控试点,实例化为美国与越南的五个行业-监管体制单元,生成了1,800个场景,并针对125个主要来源监管要求和25个注入故障进行了评估。基于本体的生成(G4)实现了48.3%的监管覆盖率,而基于角色的基线为33.1%(校正后p=0.0006),并且领域特异性最高(4.77/5.0;p=2e-6)。在Bonferroni校正后,相对于基线和检索增强提示的覆盖率优势不再稳健。跨三个LLM家族(Claude Sonnet 4、Qwen 2.5 72B、Gemma 4 26B;总计5,400个场景)的交叉验证复制了角色与本体模式。结果表明,对于监管密集型领域,基于本体的场景生成可作为基于角色测试套件的可信补充。

英文摘要

Pre-deployment verification of enterprise artificial intelligence (AI) agents remains a critical gap between large language model (LLM) capability benchmarking and production deployment. Post-deployment monitoring, human-in-the-loop controls, and prompt-level guardrails offer limited assurance once an agent is operating in production. We present an ontology-grounded verification framework -- to our knowledge the first to combine three components: an Agent Operational Envelope formalizing the certification space across permissions, domain constraints, safety properties, governance rules, and autonomy levels; an ontology-to-scenario generation pipeline that derives regulatory, operational, and adversarial test scenarios automatically; and a machine-verifiable Trust Certificate with graduated deployment verdicts. A controlled pilot across four regulated industries (Fintech, Banking, Insurance, Healthcare), instantiated as five industry-by-regulatory-regime cells across the United States and Vietnam (where Vietnam's 2025 AI Law makes such verification legally mandated for financial services), generated 1,800 scenarios evaluated against 125 primary-source regulatory requirements and 25 injected faults. Ontology-grounded generation significantly outperformed the dominant persona-based baseline on regulatory coverage (48.3% versus 33.1%; corrected p_c = .0006) and attained the highest domain specificity (4.77/5.0; p = 2e-6); transparently, its advantage over plain and retrieval-augmented prompting did not survive Bonferroni correction. Cross-validation across three LLM families (Claude Sonnet 4, Qwen 2.5 72B, Gemma 4 26B; 5,400 total scenarios) replicated the persona-versus-ontology pattern. The framework offers a reproducible, regulation-grounded route to pre-deployment assurance for enterprise AI agents, complementing runtime governance with an auditable deployment gate.

2606.04032 2026-06-05 cs.LG cs.AI cs.CL cs.PF

Do Transformers Need Three Projections? Systematic Study of QKV Variants

Transformer 需要三个投影吗?QKV 变体的系统研究

Ali Kayyam, Anusha Madan Gopal, M Anthony Lewis

发表机构 * Ali Kayyam Anusha Madan Gopal M Anthony Lewis

AI总结 本文系统研究了注意力机制中查询、键、值投影共享的变体,发现 Q-K=V 共享在语言建模中仅以 3.1% 的困惑度损失实现 50% 的 KV 缓存减少,且与头共享结合可达到 96.9% 的缓存减少,从而支持设备端推理。

Comments Accepted at ICML 2026 (PMLR vol. 306). 26 pages, 12 figures, 16 tables. Code: https://github.com/Brainchip-Inc/Do-Transformers-Need-3-Projections

详情
AI中文摘要

Transformer 已成为各种 AI 任务的标准解决方案,其中查询、键和值(QKV)注意力公式起着核心作用。然而,这三个投影的各自贡献以及省略某些投影的影响仍知之甚少。我们系统评估了三种投影共享约束:a) Q-K=V(共享键-值),b) Q=K-V(共享查询-键),c) Q=K=V(单投影)。后两种变体产生对称注意力图;为了解决这个问题,我们还通过二维位置编码探索了非对称注意力。通过涵盖合成任务、视觉(MNIST、CIFAR、TinyImageNet、异常检测)和语言建模(在 10B 令牌上训练的 300M 和 1.2B 参数模型)的实验,我们发现我们的 Transformer 性能与 QKV Transformer 相当,有时甚至更好。在语言建模中,Q-K=V 投影共享实现了 50% 的 KV 缓存减少,仅导致 3.1% 的困惑度下降。关键的是,投影共享与头共享(GQA/MQA)互补:将 Q-K=V 与 GQA-4 结合可实现 87.5% 的缓存减少,而 Q-K=V + MQA 则达到 96.9%,从而实现了实用的设备端推理。我们表明,Q-K=V 保持了质量,因为键和值可以占据相似的表示空间,并且注意力在低秩机制下运行,而 Q=K-V 则破坏了注意力的方向性。我们的结果系统地将投影共享描述为注意力中权重绑定的一种未被充分探索的实例,具有直接、可量化的推理内存优势,尤其对边缘部署有价值。代码公开于 https://github.com/anushamadan02/Do-Transformers-Need-3-Projections。

英文摘要

Transformers have become the standard solution for various AI tasks, with the query, key, and value (QKV) attention formulation playing a central role. However, the individual contribution of these three projections and the impact of omitting some remain poorly understood. We systematically evaluate three projection sharing constraints: a) Q-K=V (shared key-value), b) Q=K-V (shared query-key), and c) Q=K=V (single projection). The last two variants produce symmetric attention maps; to address this, we also explore asymmetric attention via 2D positional encodings. Through experiments spanning synthetic tasks, vision (MNIST, CIFAR, TinyImageNet, anomaly), and language modeling (300M and 1.2B parameter models on 10B tokens), we discovered that our transformers perform on par or occasionally better than the QKV transformer. In language modeling, Q-K=V projection sharing achieves 50% KV cache reduction with only 3.1% perplexity degradation. Crucially, projection sharing is complementary to head sharing (GQA/MQA): combining Q-K=V with GQA-4 yields 87.5% cache reduction, while Q-K=V + MQA achieves 96.9%, enabling practical on-device inference. We show that Q-K=V preserves quality because keys and values can occupy similar representational spaces and attention operates in a low-rank regime, whereas Q=K-V breaks attention directionality. Our results systematically characterize projection sharing as an underexplored instance of weight tying in attention, with direct, quantifiable inference memory benefits, particularly valuable for edge deployment. The code is publicly available at https://github.com/Brainchip-Inc/Do-Transformers-Need-3-Projections

2606.03785 2026-06-05 cs.CL

Backdoor Unlearning Generalization: A Path Toward the Removal of Unknown Triggers in LLMs

后门遗忘泛化:走向移除大语言模型中未知触发器的路径

Lisa Bouger, Théo Lasnier, Philippe Loubet Moundi, Yannick Teglia, Djamé Seddah

发表机构 * Inria Paris(法国巴黎国家信息与自动化研究所) Sorbonne Université(索邦大学) Thales CDI(泰雷兹CDI)

AI总结 本文通过实验证明,针对单个后门的遗忘训练可以泛化抑制其他未明确针对的后门,并引入交叉激活偏移距离量化不同训练引起的模型变化,为利用可控后门移除未知后门提供新方向。

Comments 22 pages, 28 figures

详情
AI中文摘要

大语言模型中的后门攻击是一个日益严重的安全问题,模型可能生成对手选择的内容。现有防御一次只针对一个后门,并且通常需要知道触发器,这使得防御者在模型中可能存在未知后门时处于结构性劣势。我们表明,通过遗忘进行后门中和可以跨后门泛化:训练模型忽略单个触发器也可以抑制其他从未明确针对的后门。我们通过分析每次移除一个后门后获得的模型,研究了三个模型家族中的这一现象,这些模型的后门是通过预训练或持续预训练注入的。为了理解为什么遗忘某些后门会导致其他后门的抑制,我们引入了交叉激活偏移距离,以量化不同训练引起的模型变化之间的距离。我们的结果为LLM安全开辟了一个新方向,因为防御者可以故意注入受控后门然后移除它们,利用跨后门转移来抑制攻击者可能先前在模型中引入的未知后门。

英文摘要

Backdoor attacks in Large Language Models (LLMs) are a growing security concern, where models can generate adversary-chosen content. Existing defenses target backdoors one at a time and typically require knowledge of the trigger, leaving the defender at a structural disadvantage when unknown backdoors may exist in a model. We show that backdoor neutralization through unlearning generalizes across backdoors: training a model to ignore a single trigger can also suppress other backdoors that were never explicitly targeted. We study this phenomenon across three model families, whose backdoors were injected via pretraining or continual pretraining, by analyzing the models obtained after removing one backdoor at a time. To understand why unlearning certain backdoors induces the suppression of others, we introduce the Cross Activation Shift Distance, to quantify the distance between model changes induced by different trainings. Our results open a new direction for LLM safety as defenders could deliberately inject controlled backdoors and then remove them, leveraging cross-backdoor transfer to also suppress unknown backdoors that an attacker may have previously introduced in the model.

2606.03730 2026-06-05 cs.CV

Beyond False Stability: High-Noise Drift Gating for Test-Time Adversarial Defenses in Vision-Language Models

超越虚假稳定性:面向视觉语言模型测试时对抗防御的高噪声漂移门控

Hashmat Shadab Malik, Muzammal Naseer, Salman Khan

发表机构 * Mohamed Bin Zayed University of AI, UAE(穆罕默德·本·扎耶德人工智能大学,阿联酋) Khalifa University, UAE(卡布斯大学,阿联酋) Australian National University, Australia(澳大利亚国立大学,澳大利亚)

AI总结 针对视觉语言模型在测试时易受对抗攻击的问题,提出一种无训练、即插即用的高噪声漂移门控机制,通过检测高噪声下的特征不稳定性触发防御,改善了干净-鲁棒性权衡。

详情
AI中文摘要

视觉语言模型(如CLIP)展现出强大的零样本泛化能力,但极易受到对抗攻击。对抗训练能提升鲁棒性但计算成本高昂,因此推动了测试时防御的研究。近期方法利用CLIP视觉表示对随机扰动的响应:聚合噪声视图的预测、构建高斯噪声平均锚点并将特征向锚点插值、或应用反扰动。这些策略提升了鲁棒性,但往往降低了干净准确率,导致不利的干净-鲁棒权衡。我们重新审视随机测试时防御,并发现CLIP表示空间中一个未被充分探索的噪声区域转变。先前工作主要在弱噪声区域探索扰动,其中对抗样本可能表现出异常稳定性(虚假稳定性)。我们的分析表明,随着扰动强度增加,这种稳定性发生逆转:在弱噪声区域之外,对抗表示变得比干净表示明显更不稳定,提供了更清晰的分离信号。这种转变在均匀噪声和高斯噪声、光度变换和几何变换、不同数据集以及多种攻击下均一致。在对抗训练模型中,该转变基本消失,表明其与非鲁棒CLIP中对抗表示的脆弱局部盆地几何结构相关。我们提出一种无训练、即插即用的漂移门控机制,利用高噪声特征漂移作为轻量级门控信号,仅在检测到类似对抗的不稳定性时触发现有测试时防御。在13个数据集上,该方法一致改善了干净-鲁棒权衡。在8个细粒度数据集上,反攻击防御的平均干净+对抗准确率从65.7%提升至71.4%,噪声锚定防御从68.4%提升至73.2%;在ImageNet及其四个变体上,分别从56.1%提升至66.2%和从62.1%提升至67.6%。

英文摘要

Vision-language models (VLMs) such as CLIP show strong zero-shot generalization but remain highly vulnerable to adversarial attacks. Adversarial training improves robustness but is computationally expensive, motivating test-time defenses. Recent approaches exploit how CLIP's visual representations respond to stochastic perturbations: aggregating predictions across noisy views, constructing Gaussian noise-averaged anchors and interpolating features toward them, or applying counter-perturbations. These strategies improve robustness but often degrade clean accuracy, yielding an unfavorable clean-robust trade-off. We revisit stochastic test-time defenses and identify an underexplored noise-regime transition in CLIP's representation space. Prior work explored perturbations mainly in the weak-noise regime, where adversarial examples can appear unusually stable (false stability). Our analysis shows this reverses as perturbation strength grows: beyond the weak-noise regime, adversarial representations become markedly more unstable than clean ones, giving a clearer separation signal. The transition is consistent across uniform and Gaussian noise, photometric and geometric transforms, datasets, and diverse attacks. It largely disappears in adversarially trained models, suggesting it is tied to the fragile local-basin geometry of adversarial representations in non-robust CLIP. We propose a training-free, plug-in drift-gated mechanism that uses high-noise feature drift as a lightweight gating signal to trigger existing test-time defenses only when adversarial-like instability is detected. Across 13 datasets it consistently improves the clean-robust trade-off. On eight fine-grained datasets, mean clean+adversarial accuracy rises from 65.7% to 71.4% for counterattack defenses and 68.4% to 73.2% for noise-anchoring; on ImageNet and four shifted variants, from 56.1% to 66.2% and 62.1% to 67.6%.

2606.03650 2026-06-05 cs.CL cs.AI

CoEval: Ranking Language Models for Custom Tasks Without Labeled Data or Trustworthy Benchmarks

CoEval: 无标注数据或可信基准下为自定义任务排序语言模型

Alexander Apartsin, Yehudit Aperstein

发表机构 * Holon Institute of Technology(霍洛技术学院) Afeka Tel Aviv Academic College of Engineering(阿法卡特拉维大学工程学院)

AI总结 提出CoEval框架,通过教师模型生成无污染基准和跨族评审团,无需标注数据或人工评估即可对语言模型进行排序,在真实排名恢复上达ho=0.86。

Comments 16 pages, 5 images

详情
AI中文摘要

当特定应用没有任务相关的标注数据,且标准公共基准不可信(其项目可能已泄露到预训练中,因此分数反映的是记忆而非适用性)时,为特定应用选择或排序语言模型最为困难。我们提出CoEval,一个开源、可复用的框架,端到端地弥补了这一差距:仅从任务或领域的描述出发,教师模型合成一个全新的、属性受控的基准,无需人工标注,且由于每次运行都重新生成项目,因此无污染;跨族评审团对候选模型进行排序,无需人工评分。在存在真实基准的情况下验证,CoEval恢复了真实的模型排序,并与真实正确性相关性达ho=0.86。无标签评审无需人工校准,因为评审团组成(供应商多样性)而非规模驱动可靠性:一个精心挑选的小型跨族评审团最可靠,而单个评审员可能与真实基准负相关(评审员选择遗憾0.35),但集成评审团从未如此。生成的项目与五个主要公共基准的逐字13-gram重叠为零;评审团消除了冗长偏差并排除了同族自我偏好。一项四项任务研究以5.89美元产生了7,978次评估。相同的声明式流程适用于任何领域,并且足够便宜,可以在每次模型发布时重新运行:一个任何团队都可以为其自身应用重新生成的无标签、无污染排行榜。

英文摘要

Selecting a pretrained language model, or evaluating a fine-tuned one, for a specific application is a high-value decision, yet the public benchmarks used to make it are poorly suited: a generic benchmark need not reflect a particular sub-domain or sub-task, and its scores are suspect when its items have leaked into pretraining and are recalled rather than solved. We present CoEval, an open framework that supplies a trustworthy, task-specific signal through ensemble self-evaluation: from a task or domain description, a pool of models rotates through all three roles, teacher, student, and judge, to generate a fresh, contamination-free benchmark, answer it, and score one another, with no human labels or raters. Because every model also answers as a student, the responses are the data that weight each question by its discriminative power and each judge by its consensus with the panel. Where ground truth exists, CoEval recovers the true ranking and tracks objective correctness at \r{ho}=0.86, and the weighting recovers the gold ranking of thirteen models at Spearman 0.95. Reliability comes from panel composition, not size: this label-free weighting zeroes out broken judges and down-weights saturated questions, so neither distorts the ranking. Generated items show zero verbatim overlap with five public benchmarks, the panel cancels verbosity bias and precludes same-family self-preference, and rankings are domain-specific: three different models top four de-novo domains, so a generic leaderboard misdirects most practitioners. The same pipeline reruns on each model release, giving any team a contamination-free leaderboard for its application.

2606.03189 2026-06-05 cs.CL

SenseJudge: Human-Centric Preference-Driven Judgment Framework

SenseJudge: 以人为中心的偏好驱动判断框架

Rui Li, Junfeng Liu, Xiangwen Kong, Linhai Xu, Zhifang Sui

发表机构 * State Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University(信息处理国家重点实验室,计算机学院,北京大学) StepFun Xi’an Jiaotong University(西安交通大学)

AI总结 提出SenseJudge框架和SenseBench基准,通过融入用户偏好实现个性化判断和模型排名,实验证明其优于现有方法。

Comments ACL 2026 Findings

详情
AI中文摘要

大型语言模型(LLMs)作为判断器在评估模型响应等各种场景中正成为一种日益被接受的范式。然而,现有的判断方法通常依赖于使用固定偏好数据训练的评判器,这往往忽视了多样化的用户偏好,难以适应真实的人机对话场景。为了解决这些局限性,我们提出了SenseJudge,一个由人类偏好驱动的可定制判断框架,以及SenseBench,一个源自真实世界多轮交互的多样且具有挑战性的指令遵循基准。我们将自动判断框架和基准应用于两个任务:(1)LLMs作为个性化判断器,以及(2)模型排名。我们进行了大量实验,结果表明SenseJudge框架在LLMs作为个性化判断器任务中超越了其他判断方法和模型,并实现了与真实人类感知一致的模型排名。此外,我们对位置偏差和一致性进行了分析,并进行了消融研究,证实了SenseJudge的鲁棒性。

英文摘要

Large Language Models (LLMs) as judges across various scenarios such as assessing model responses is becoming an increasingly accepted paradigm. However, existing judgment approaches often rely on trained judgers using fixed preference data, which tend to overlook diverse user preferences and struggle to adapt to real-world human-AI dialogue scenarios. To address these limitations, we propose SenseJudge, a customizable judgment framework driven by human preferences and SenseBench, a diverse and challenging instruction-following benchmark derived from real-world multi-turn interactions. We applied the automatic judgment framework and benchmark to two tasks: (1) LLMs as personalized judges, and (2) model ranking. We conducted extensive experiments, and the results demonstrate that the SenseJudge framework surpasses other judgment methods and models in the LLMs-as-personalized-judges task and achieves model ranking that aligns with real human sense. Additionally, we conducted analyses on position bias and consistency, alongside ablation studies, which affirmed the robustness of SenseJudge.

2606.03100 2026-06-05 cs.CV cs.LG

Zero-Shot 3D Question Answering via Hierarchical View-to-Token Transportation

零样本3D问答通过层级视图到令牌传输

Dongsheng Wang, Dawei Su, Hui Huang

发表机构 * Dongsheng Wang(王东生) Dawei Su(苏大卫) Hui Huang(黄慧)

AI总结 提出KeyVT方法,通过层级视图和令牌级输入上下文收集,结合像素特征与相机参数评估视图重要性,并利用最优传输识别代表性令牌,实现零样本3D问答性能提升。

Comments Accepted at ICML 2026. 19 pages, 6 figures

详情
AI中文摘要

最近,通过2D视觉-语言模型(VLM)进行零样本3D场景理解因其有前景的空间推理能力而受到越来越多的研究关注。通常,从3D点云中采样多个2D视图,并输入预训练的VLM以回答给定问题。这种范式凸显了输入上下文质量的关键作用,并提出了在有限输入预算下尽可能保留与任务相关的3D细节的挑战。我们提出了 exttt{KeyVT},一种在视图和令牌级别进行输入上下文收集的层级方法。具体来说,我们将像素特征与相机参数结合,并基于语义内容和几何位置评估视图重要性,从而得到空间一致且与任务相关的视图。此外,我们通过最优传输(OT)框架识别代表性令牌来解决选定视图中补丁之间的冗余问题,其中视图令牌和关键令牌被公式化为嵌入空间中的两个离散分布。这些关键令牌通过最小化OT距离期望覆盖所有视图特征。我们在三个广泛使用的基准上评估了我们的框架,结果表明与现有的无调优方法相比有显著改进,并且性能与基于训练的方法相当。

英文摘要

Recently, zero-shot 3D scene understanding via 2D Vision-Language Models (VLMs) has gained increasing research interest due to their promising spatial reasoning capabilities. Typically, multiple 2D views are sampled from a 3D point cloud and fed into pre-trained VLMs to answer a given question. This paradigm highlights the critical role of input context quality and raises the challenge of retaining as many task-relevant 3D details as possible under a limited input budget. We propose \texttt{KeyVT}, a hierarchical approach for input context collection at both the view and token levels. Specifically, we combine pixel features with camera parameters and assess view importance based on both semantic content and geometric position, resulting in spatially consistent and task-relevant views. Furthermore, we address redundancy among patches across selected views by identifying representative tokens under the optimal transport (OT) framework, where view tokens and key tokens are formulated as two discrete distributions in the embedding space. These key tokens are expected to cover all view features by minimizing the OT distance. We evaluate our framework on three widely used benchmarks, demonstrating significant improvements over existing tuning-free methods and performance comparable to training-based approaches.

2606.03070 2026-06-05 cs.LG cs.AI

ASymPO: Asymmetric-Scale Policy Optimization for Asynchronous LLM Post-Training Without Behavior Information

ASymPO: 用于异步大语言模型后训练的非对称尺度策略优化(无需行为信息)

Zehua Liu, Yuxuan Yao, Xiaojin Fu, Tao Zhong, Mingxuan Yuan

发表机构 * Huawei Technologies(华为技术)

AI总结 针对异步强化学习中陈旧响应导致的分布漂移问题,提出非对称尺度策略优化(ASymPO),通过归一化每个响应的令牌损失来恢复零和平衡,无需行为策略概率。

Comments incorrect proofs in the paper

详情
AI中文摘要

异步强化学习通过将响应生成与策略优化解耦来提高语言模型后训练的吞吐量,但陈旧响应会引入分布漂移。标准的行为校正方法通过行为策略概率、重要性比率或裁剪来控制这种漂移,这需要在推出和学习系统之间具有令牌对齐、版本化和数值一致的行为对数概率。我们探究是否仅使用当前策略概率就能稳定异步组相对强化学习。我们识别出一种尺度不平衡失败模式:当在当前策略下评估陈旧响应时,正负损失项可能出现在不同的负对数概率尺度上,因此零和优势不再意味着平衡的损失贡献。我们提出非对称尺度策略优化(ASymPO),它通过每个响应的当前平均令牌负对数概率来归一化其令牌损失。ASymPO不需要行为策略概率,恢复了响应级别的零和平衡,并保留了非零的学习信号。我们还引入了固定负缩放基线——缩放策略优化(SPO),并在异步数学推理后训练中评估了这两种仅当前策略的目标函数。

英文摘要

Asynchronous reinforcement learning can improve language-model post-training throughput by decoupling response generation from policy optimization, but stale responses introduce distribution drift. Standard behavior-corrected methods control this drift with behavior-policy probabilities, importance ratios, or clipping, which requires token-aligned, versioned, and numerically consistent behavior log-probabilities across rollout and learner systems. We ask whether asynchronous group-relative RL can instead be stabilized using only current-policy probabilities. We identify a scale-imbalance failure mode: when stale responses are evaluated under the current policy, positive and negative loss terms can appear at different negative log-probability scales, so zero-sum advantages no longer imply balanced loss contributions. We propose Asymmetric-Scale Policy Optimization (ASymPO), which normalizes each response's token loss by its current average token negative log-probability. ASymPO requires no behavior-policy probabilities, restores response-level zero-sum balance, and preserves a nonzero learning signal. We also introduce Scaled Policy Optimization (SPO), a fixed negative-scaling baseline, and evaluate both current-policy-only objectives in asynchronous mathematical reasoning post-training.

2606.02907 2026-06-05 cs.CL cs.AI

Linear Probes Detect Task Format, Not Reasoning Mode in Language Model Hidden States

线性探针检测语言模型隐藏状态中的任务格式,而非推理模式

Subramanyam Sahoo, Vinija Jain, Aman Chadha, Divya Chaudhary

发表机构 * Horizon Research(远景研究) Meta Apple(苹果公司) Northeastern University(东北大学)

AI总结 通过线性探针实验发现,大语言模型隐藏状态中看似分离的推理模式实际上由任务格式(如来源、选项数、响应长度)混淆导致,而非真正的推理计算结构。

Comments Accepted in the 6th Workshop on Trustworthy NLP, ACL 2026

详情
AI中文摘要

线性探针广泛用于声称大语言模型(LLM)隐藏状态对不同推理类型学习到不同表示。我们通过在经典三分法基准(LogiQA 2.0(演绎)、ARC-Challenge(归纳)和$\alpha$NLI(溯因))上探测Qwen3-14B来检验这一说法。在40层中的第32层,线性探针达到100%交叉验证准确率,且几何结构良好分离(本征维度:20.6、28.5、33.6;凸包污染$\leq$1.5%)。然而,这种分离完全由格式混淆驱动。对来源身份、选项数和响应长度进行残差化后,准确率降至随机水平。轨迹锚点相似性表明任务间推理大部分共享(42.5%一致性 vs. 33.3%随机),且随机对照因果操控($n=20$)显示几何结构与推理模式之间无功能联系($p=0.286$)。因此,高探针准确率反映的是任务格式而非计算结构,这促使在机制可解释性中常规性地进行格式去混淆。

英文摘要

Linear probing of large language model (LLM) hidden states is widely used to claim that models learn distinct representations for different reasoning types. We test this by probing Qwen3-14B on three benchmarks spanning the classical trichotomy: LogiQA 2.0 (deductive), ARC-Challenge (inductive), and $α$NLI (abductive). At layer 32 of 40, linear probes achieve 100\% cross-validated accuracy with well-separated geometry (intrinsic dimensionalities: 20.6, 28.5, 33.6; convex hull contamination $\leq$1.5\%). However, this separation is entirely driven by format confounds. Residualizing source identity, option count, and response length reduces accuracy to chance. Trace-anchor similarity indicates largely shared reasoning across tasks (42.5\% agreement vs.\ 33.3\% chance), and causal steering with random controls ($n=20$) shows no functional link between geometry and reasoning mode ($p=0.286$). Thus, high probe accuracy reflects task format rather than computational structure, motivating routine format deconfounding in mechanistic interpretability.

2606.02776 2026-06-05 cs.CL

Topics as Proxies for Sociodemographics: How Conversational Context Affects LLM Answers

话题作为社会人口统计的代理:对话上下文如何影响大语言模型的回答

Vera Neplenbroek, Gabriele Sarti, Arianna Bisazza, Raquel Fernández

发表机构 * Institute for Logic, Language and Computation, University of Amsterdam(逻辑、语言与计算研究所,阿姆斯特丹大学) Khoury College of Computer Sciences, Northeastern University(计算机科学学院,东北大学) Center for Language and Cognition, University of Groningen(语言与认知中心,格罗宁根大学)

AI总结 研究大语言模型在高风险场景中对话上下文对回答差异的影响,发现话题是社会人口统计差异的主要驱动因素,且影响方式不可预测。

详情
AI中文摘要

当大语言模型(LLM)用于高风险场景(如法律、医疗和金融建议)时,即使单次对话历史也足以导致用户间结果差异。先前研究表明,这会导致社会人口统计群体之间的结果差异,某些群体获得比其他群体更有利的结果。在这项工作中,我们证明LLM实际上难以从单次对话历史推断用户的社会人口统计信息,并且尽管社会人口统计群体之间存在差异,但差异幅度很小。为了探究这些差异的主要驱动因素,我们将用户社会人口统计信息与对话的一系列(心理)语言学特征(包括对话话题、情感和可读性)进行比较。我们发现,在对话上下文中,对话话题最能预测LLM生成的建议,这些话题在一定程度上充当社会人口统计群体的代理,并且常常以不可预测的方式影响建议。这令人担忧,并强调未来研究需要更好地理解,并在必要时减轻高风险场景中对话上下文对LLM输出的影响。

英文摘要

When large language models (LLMs) are used in high-stakes scenarios, such as legal, medical and financial advice, even a single conversation history is enough to drive differences in outcomes between users. Prior work has demonstrated that this results in outcome disparities between sociodemographic groups, with some groups receiving more advantageous outcomes than others. In this work, we demonstrate that LLMs actually struggle to infer user sociodemographics from a single conversation history and that although there are disparities between sociodemographic groups, they are minimal in magnitude. To investigate what the main driver of these disparities is, we compare user sociodemographics to a range of (psycho)linguistic features of conversations, including conversation topic, emotions, and readability. We find that conversation topics are most predictive of LLM-generated advice within a conversational context, which, to some extent, function as proxies for sociodemographic groups and often affect advice in unpredictable ways. This is cause for concern and highlights the need for future research to better understand and, if needed, mitigate the effect of conversational context on LLM outputs in high-stakes scenarios.

2606.02750 2026-06-05 cs.CL

On the Persistent Effects of Lexicality in Large Language Models

论词汇性在大语言模型中的持久影响

Hammad Rizwan, Muhammad Umair Haider, Nishant Subramani, Mona T. Diab, A. B. Siddique, Hassan Sajjad

发表机构 * Dalhousie University(达尔豪斯大学) University of Kentucky(肯塔基大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文通过对抗性语义压力测试和信息论视角,量化了大语言模型中词汇重叠相对于语义内容的影响,发现词汇影响贯穿模型深度,并在中间层出现词汇和语义信号同时衰减的过渡区域,进而以摘要和模型编辑为例展示了词汇影响对下游任务的作用。

详情
AI中文摘要

从大语言模型(LLMs)中提取的表征在许多下游应用中扮演着重要角色。然而,这些表征的结构往往受词汇重叠而非语义内容的影响。我们对这种词汇影响与语义内容之间的关系及其对下游任务的影响的理解仍然有限。在这项工作中,我们研究表征以量化词汇重叠相对于语义内容的影响。我们考虑了若干对抗性语义压力测试,并进一步将我们的发现与信息论视角联系起来。我们发现词汇影响贯穿模型的深度,在不同架构、训练范式和目标函数(包括为语义相似性训练的模型)中一致存在。此外,我们观察到一个中间深度区域,其中词汇和语义信号同时衰减,表明这是一个表征对表面形式和意义都较差的过渡状态。我们进一步通过摘要和模型编辑作为案例研究,展示了词汇影响对LLMs下游使用的影响。

英文摘要

Representations extracted from large language models (LLMs) play an important role in many downstream applications. However, the structure of these representations is often influenced by lexical overlap rather than semantic content. Our understanding of the relationship between this lexical influence and semantic content, and its implications for downstream tasks, remains limited. In this work, we investigate representations to quantify the effect of lexical overlap relative to semantic content. We consider several adversarial semantic stress tests and further connect our findings to the information theory perspective. We find that lexical influence extends across the depth of models, consistently across architectures, training regimes, and objective functions, including the models trained for semantic similarity. Moreover, we observe a mid-depth region in which both lexical and semantic signals degrade simultaneously, indicating a transitional regime where representations are poor for both surface form and meaning. We further demonstrate the effect of lexical influence on downstream uses of LLMs using summarization and model editing as a case study.

2606.02684 2026-06-05 cs.LG cs.AI cs.CL

Filter, Then Reweight: Rethinking Optimization Granularity in On-Policy Distillation

先过滤,再重加权:重新思考在线策略蒸馏中的优化粒度

Yuying Li, Leqi Zheng, Yongzi Yu, Wenrui Zhou, Xuchang Zhong, Xing Hu, Jing Jin, Hangjie Yuan, Tao Feng

发表机构 * THU(清华大学) HKUST(香港科技大学) BIT(北京理工大学) Meituan(美团) ZJU(浙江大学)

AI总结 针对在线策略蒸馏,提出FiRe-OPD方法,通过轨迹级过滤和令牌级软重加权实现细粒度优化,在多种设置下优于现有方法。

详情
AI中文摘要

大型语言模型中的在线策略蒸馏正从全轨迹KL监督转向更具选择性的训练范式。最近的在线策略蒸馏方法越来越关注选择哪些轨迹进行学习、哪些令牌信息量最大以及哪些监督信号最可靠。受此趋势启发,我们重新思考在线策略蒸馏的优化粒度,并提出FiRe-OPD(先过滤,再重加权),该方法在轨迹和令牌两个层面联合调整监督信号。具体来说,FiRe-OPD首先过滤轨迹以移除低质量的采样结果,然后在保留的轨迹内应用软重加权以强调信息丰富的令牌。与硬令牌选择相比,FiRe-OPD利用软加权机制有效减轻信息损失并增强优化稳定性,从而实现更细粒度的在线策略蒸馏优化。我们在强到弱、单教师和多教师设置中验证了FiRe-OPD的有效性,并展示了其相对于近期令牌级在线策略蒸馏方法的优越性(例如,在强到弱设置中AIME 2024上+6.25,在多教师设置中Miner上+18.81)。我们的代码可从此链接获取。

英文摘要

On-Policy distillation (OPD) in large language models is shifting from full-trace KL supervision toward more selective training paradigms. Recent OPD methods increasingly focus on selecting which trajectories to learn from, which tokens are most informative, and which supervision signals are most reliable. Motivated by this trend, we rethink optimization granularity of OPD and propose \fireicon\ FiRe-OPD (Filter, then Reweight), which jointly adjusts supervision signals at both trajectory and token levels. In details, FiRe-OPD first filters trajectories to remove low-quality rollout samples, and then applies soft reweighting within the retained trajectories to emphasize informative tokens. Compared with hard token selection, FiRe-OPD leverages a soft-weighting mechanism to effectively mitigate information loss and enhance optimization stability, thereby achieving finer-grained OPD optimization. We validate the effectiveness of FiRe-OPD across strong-to-weak, single-teacher, and multi-teacher settings, and demonstrate its superiority over recent token-level OPD methods ( (e.g., +6.25 on AIME 2024 in strong-to-weak, +18.81 on Miner in multi-teacher). Our code is available at https://github.com/YuYingLi0/FiRe-OPD.

2606.02031 2026-06-05 cs.LG cs.AI cs.CL cs.CV

OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents

OpenWebRL: 揭秘视觉网络代理的在线多轮强化学习

Rui Yang, Qianhui Wu, Yuxi Chen, Hao Bai, Wenlin Yao, Hao Cheng, Baolin Peng, Huan Zhang, Tong Zhang, Jianfeng Gao

发表机构 * UIUC(伊利诺伊大学香槟分校) Microsoft(微软)

AI总结 提出OpenWebRL框架,通过在线多轮强化学习在真实网站上训练视觉网络代理,以4B参数模型在基准测试中达到开源最优,并与闭源系统竞争。

Comments 36 pages, 11 figures

详情
AI中文摘要

构建强大的视觉网络代理需要长程推理、精确定位以及与动态真实网站的稳健交互。尽管进展迅速,最强的系统仍然大多是专有的,而开放代理仍然严重依赖于对大量策划的网络轨迹进行监督式后训练。这种依赖造成了主要的可扩展性瓶颈:高质量演示的收集成本高昂,而静态数据集对多样且不断变化的开放网络的覆盖有限。尽管在线强化学习在基于文本的代理中显示出前景,但其直接用于在实时网站上训练视觉网络代理的潜力仍未得到充分探索。在本文中,我们介绍了OpenWebRL,一个用于在真实网站上通过在线多轮强化学习训练视觉网络代理的开放框架。OpenWebRL涵盖了完整的训练流程,包括可扩展的实时浏览器基础设施、监督初始化、多模态上下文管理、轨迹级成功判断以及高效的多轮策略优化。使用该框架,我们训练了OpenWebRL-4B,在具有挑战性的实时网络基准测试中建立了新的开源最优水平。仅使用0.4K初始化轨迹和2.2K开放式强化学习训练任务,OpenWebRL-4B在Online-Mind2Web上达到67.0%的成功率,在DeepShop上达到64.0%,优于之前类似或更大规模的开放代理,并与包括OpenAI CUA和Gemini CUA在内的专有系统保持竞争力。除了强大的基准性能外,我们还系统研究了使在线强化学习对视觉网络代理有效的关键设计选择,并分析了强化学习如何改进代理推理。总体而言,我们的工作为构建更强大、可重复且成本效益更高的开放网络代理提供了一条实用路径。我们将发布我们的训练数据、模型和代码以支持未来的研究。

英文摘要

Building capable visual web agents requires long-horizon reasoning, precise grounding, and robust interaction with dynamic real-world websites. Despite rapid progress, the strongest systems remain largely proprietary, while open agents still depend heavily on supervised post-training over large collections of curated web trajectories. This dependence creates a major scalability bottleneck: high-quality demonstrations are expensive to collect, and static datasets offer limited coverage of the diverse, ever-changing open web. Although online RL has shown promise for text-based agents, its potential for training visual web agents directly on live websites remains largely underexplored. In this paper, we introduce OpenWebRL, an open framework for training visual web agents with online multi-turn RL on real websites. OpenWebRL covers the full training pipeline, including scalable live-browser infrastructure, supervised initialization, multimodal context management, trajectory-level success judging, and efficient multi-turn policy optimization. Using this framework, we train OpenWebRL-4B, which establishes a new open-source state of the art on challenging live-web benchmarks. With only 0.4K initialization trajectories and 2.2K open-ended RL training tasks, OpenWebRL-4B achieves 67.0% success on Online-Mind2Web and 64.0% on DeepShop, outperforming prior open agents of similar or larger scale and remaining competitive with proprietary systems including OpenAI CUA and Gemini CUA. Beyond strong benchmark performance, we systematically study the key design choices that make online RL effective for visual web agents, and analyze how RL improves agentic reasoning. Overall, our work offers a practical path toward building more capable, reproducible, and cost-efficient open web agents. We will release our training data, models, and code to support future research.

2606.01935 2026-06-05 cs.CV

Unified Driving Tokens: Representation- and Geometry-Guided Discrete Tokenizer for Driving World Models and Planning

统一驾驶令牌:面向驾驶世界模型和规划的表示与几何引导的离散分词器

Ziyang Yao, Zeyu Zhu, YunCheng Jiang, Zibin Guo, Huijing Zhao

发表机构 * Peking University(北京大学) Xiaomi EV(小米电动车)

AI总结 提出一种表示引导与几何增强的离散分词器,通过联合监督学习紧凑令牌,同时优化重建保真度、表示一致性和规划性能。

详情
AI中文摘要

离散视觉令牌应为基于令牌的世界建模和自动驾驶规划提供紧凑表示。然而,大多数分词器继承自图像生成,主要针对像素重建进行优化,这可能导致易于生成的内容与对驾驶决策有用的解码内容之间存在差距。我们提出了一种表示引导和几何增强的分词器,在联合监督下学习离散令牌。该分词器通过特征解码将其离散瓶颈与冻结的DINO特征空间对齐,同时通过感知损失和对抗损失的RGB重建保留外观。为了注入几何状态相关线索,我们在训练期间添加了相邻帧深度和相对姿态监督,并通过多码本量化稳定联合目标。我们使用轻量级规划读出和GPT风格的下一个令牌世界模型评估相同的学习令牌。在NAVSIM上的实验表明,在固定解码器下,重建保真度和表示一致性得到改善,规划性能具有竞争力,并且在匹配设置下生成质量更好。

英文摘要

Discrete visual tokens should provide a compact representation for both token-based world modeling and planning in autonomous driving. However, most tokenizers are inherited from image generation and are optimized mainly for pixel reconstruction, which may leave a gap between what is easy to generate and what is useful to decode for driving decisions. We present a representation-guided and geometry-enhanced tokenizer that learns discrete tokens under joint supervision. The tokenizer aligns its discrete bottleneck with a frozen DINO feature space through feature decoding, while preserving appearance via RGB reconstruction with perceptual and adversarial losses. To inject geometric state-related cues, we add adjacent-frame depth and relative-pose supervision during training and stabilize joint objectives with multi-codebook quantization. We evaluate the same learned tokens with a lightweight planning readout and a GPT-style next-token world model. Experiments on NAVSIM show improved reconstruction fidelity and representation consistency, competitive planning performance under a fixed decoder, and better generative quality under matched settings.

2606.01897 2026-06-05 cs.AI

Community-Aware Assessment of Social Textual Engagement and Resonance: A Human-Centric Perspective on User-Generated Content Evaluation

社区感知的社交文本参与度与共鸣评估:以人为中心的用户生成内容评价视角

Tianjiao Li, Kai Zhao, Xiang Li, Yang Liu, Huyang Sun

发表机构 * GitHub

AI总结 提出CASTER任务和MEDEA架构,通过社会思维链机制模拟社区认知与情感反应,实现用户生成内容的多模态共鸣评估。

Comments Published as a main conference paper at ACL 2026

详情
AI中文摘要

传统视频质量评估(VQA)狭隘地关注美学保真度,忽略了定义用户生成内容(UGC)质量的复杂社会动态。在这项工作中,我们提出从信号中心指标向以人为中心的共鸣评估的范式转变。我们引入CASTER(社区感知的社交文本参与度与共鸣评估),这是一个新任务,根据UGC项目的多模态属性而非仅视觉质量来评估其是否实现积极的社区共鸣。为此,我们提出MEDEA(多模态参与驱动评估架构),它引入了一种新颖的社会思维链(Social-CoT)机制。与传统的逻辑CoT不同,Social-CoT执行多模态视角转换,实例化不同的观众角色以模拟集体认知和情感反应(即“社区思维”),然后得出质量判断。MEDEA通过两阶段方法进行训练,包括监督微调和带有社会对齐奖励的过程监督强化学习,以确保推理路径基于真实的人类社会认知。为支持此任务,我们发布了CASTER-Bench,一个涵盖多种UGC类别的全面人工标注基准。实验表明,MEDEA在CASTER-Bench上显著优于最先进的基线,同时提供可解释且富有同理心的推理路径,与真实社区反馈一致。

英文摘要

Traditional Video Quality Assessment (VQA) focuses narrowly on aesthetic fidelity, overlooking the complex social dynamics that define quality in User-Generated Content (UGC). In this work, we propose a paradigm shift from signal-centric metrics to human-centric resonance assessment. We introduce CASTER (Community-Aware Assessment of Social Textual Engagement and Resonance), a new task that evaluates whether a UGC item achieves positive community resonance based on its multimodal attributes rather than visual quality alone. To address this, we present MEDEA (Multimodal Engagement-Driven Evaluation Architecture), which introduces a novel Social Chain-of-Thought (Social-CoT) mechanism. Unlike traditional logical CoT, Social-CoT performs multimodal perspective-taking, instantiating diverse viewer personas to simulate collective cognitive and emotional reactions (i.e., the "community mind") before deriving a quality judgment. MEDEA is trained via a two-stage approach involving supervised fine-tuning and process-supervised reinforcement learning with Social Alignment Reward to ensure reasoning paths are grounded in authentic human social cognition. To support this task, we release CASTER-Bench, a comprehensive human-annotated benchmark covering diverse UGC categories. Experiments demonstrate that MEDEA significantly outperforms state-of-the-art baselines on CASTER-Bench while providing interpretable and empathetic reasoning paths that align with real community feedback.

2606.01822 2026-06-05 cs.CV

Hierarchically Decoupled Mixture-of-Experts for Robust Traffic Sign Recognition in Complex Driving Scenarios

用于复杂驾驶场景中鲁棒交通标志识别的分层解耦混合专家模型

Mingxiao Wang, Xiaozhen Qu, Bolin Gao, Tong Wang, Lei He

发表机构 * School of Automotive and Traffic Engineering, Liaoning University of Technology(辽宁科技学院汽车与交通工程学院) State Key Laboratory of Intelligent Green Vehicles and Mobility, School of Vehicle and Mobility, Tsinghua University(智能绿色车辆与移动State Key Laboratory,清华大学车辆与移动学院)

AI总结 提出分层解耦异构混合专家框架CBDES MoE TSR,通过图像级动态路由机制选择最优专家模型,在复合交通标志数据集上mAP50-95达76.8%,比基线提升2.3%且计算开销降低39.4%。

Comments 9 figures, 3 tables

详情
AI中文摘要

交通标志检测是自动驾驶和智能交通系统中环境感知的基本组成部分。然而,现有大多数检测器依赖具有全局共享参数的静态推理,限制了其适应多样化和非结构化交通场景的能力。因此,单个静态模型通常难以同时处理清晰的近距样本和诸如远距离小目标或恶劣天气环境等挑战性条件。为解决这一局限,我们提出了CBDES MoE TSR,一种用于交通标志识别的分层解耦异构混合专家(MoE)框架。该框架通过引入异构YOLO专家池和轻量级门控网络,摆脱了传统的全局共享参数范式,实现了图像级动态路由机制。基于输入图像的语义特征,门控模块从专家池中选择性激活最合适的专家模型,实现从固定参数拟合到按需动态表示的转变。这种设计增强了特定场景下的特征提取能力,同时保持了可控的推理开销。实验结果表明,所提方法在复合交通标志数据集上实现了检测精度与效率的显著平衡。具体而言,我们的方法达到了76.8%的mAP50-95,相比基线方法(74.5%)提升了2.3%,同时计算开销降低了约39.4%。这些结果有力地验证了所提方法的有效性。

英文摘要

Traffic sign detection is a fundamental component of environmental perception in autonomous driving and intelligent transportation systems. However, most existing detectors rely on static inference with globally shared parameters, limiting their ability to adapt to diverse and unstructured traffic scenarios. As a result, a single static model often struggles to simultaneously handle both clear near-range samples and challenging conditions such as distant small targets or adverse weather environments. To address this limitation, we propose CBDES MoE TSR, a hierarchically decoupled heterogeneous mixture-of-experts(MoE) framework for traffic sign recognition. The proposed framework departs from the conventional globally shared parameter paradigm by introducing a heterogeneous You Only Look Once (YOLO) expert pool together with a lightweight gating network, enabling an image-level dynamic routing mechanism. Based on the semantic characteristics of the input image, the gating module selectively activates the most suitable expert model from the expert pool, enabling a shift from fixed parameter fitting to on-demand dynamic representation. This design enhances feature extraction capability for specific scenarios while maintaining controlled inference overhead. Experimental results demonstrate that the proposed method achieves a remarkable balance between detection accuracy and efficiency on the composite traffic sign dataset. Specifically, our method attains an mAP50-95 of 76.8%, yielding a 2.3% improvement over the baseline method (74.5%) while simultaneously reducing computational overhead by approximately 39.4%. These findings robustly validate the effectiveness of the proposed approach.

2606.01113 2026-06-05 cs.CV

R^3: Composed Video Retrieval via Reasoning-Guided Recalling and Re-ranking

R^3: 基于推理引导的召回与重排序的组合视频检索

Zixu Li, Yupeng Hu, Zhiheng Fu, Zhiwei Chen, Weili Guan, Liqiang Nie

发表机构 * Shandong University(山东大学) Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳))

AI总结 提出R^3零样本组合视频检索流程,通过生成推理轨迹增强查询表示,并融合重排序验证候选视频,有效解决源视频与编辑指令组合检索的挑战。

详情
AI中文摘要

CoVR-R挑战评估组合视频检索,系统需根据参考视频和文本编辑指令从大型图库中检索目标视频。该设置不是标准的视频-文本检索问题:查询由源视频中的视觉证据和编辑隐含的变换共同定义。强嵌入模型可提供可扩展的候选召回,但可能无法充分表达目标侧后果,如状态变化、动作替换、对象保留或时间一致性。成对多模态重排序器可直接验证此类细节,但全面重排序整个图库在计算上不可行。我们提出R^3,一个基于推理引导的召回与重排序的零样本组合视频检索流程。核心思想是将源-编辑查询转化为推理基础的检索程序,而非将编辑文本视为短标题。首先,模型生成推理轨迹,描述应用编辑后预期的目标视频。然后,将轨迹与源视频一起编码为推理增强查询,并通过一致性门控残差规则与基础组合查询的检索分数融合。最后,重排序器通过直接源-候选比较验证召回候选。实验证明了我们方法在应对该挑战中的有效性。代码可在https://github.com/Lee-zixu/R-3获取。

英文摘要

The CoVR-R challenge evaluates composed video retrieval, where a system must retrieve a target video from a large gallery given a reference video and a textual edit instruction. This setting is not a standard video-text retrieval problem: the query is defined by both the visual evidence in the source video and the transformation implied by the edit. A strong embedding model can provide scalable candidate recall, but it may under-express target-side consequences such as state changes, action replacement, object preservation, or temporal consistency. A pairwise multimodal reranker can verify such details more directly, but exhaustive reranking over the full gallery is computationally infeasible. We present $\mathbb{R}^3$, a zero-shot composed video retrieval pipeline built around Reasoning-guided Recalling and Reranking. The core idea is to turn the source-edit query into a reasoning-grounded retrieval program rather than treating the edit text as a short caption. First, the model generates a reasoning trace that describes the expected target video after applying the edit. Then the trace is encoded together with the source video as a reasoning-augmented query, and its retrieval score is fused with the base composed query through an agreement-gated residual rule. At last, a re-ranker verifies the recalled candidates with direct source-candidate comparison. Experiments have demonstrated the effectiveness of our method in addressing this challenge. Codes are available on https://github.com/Lee-zixu/R-3.

2606.00644 2026-06-05 cs.AI

ForeSci: Evaluating LLM Agents for Forward-Looking AI Research Judgment

ForeSci: 评估LLM智能体在前瞻性AI研究判断中的能力

Qiuyu Tian, Haojie Yin, Yingce Xia, Youyong Kong, Zequn Liu

发表机构 * Southeast University(东南大学) Beijing Zhongguancun Academy(北京中关村学院) Duke Kunshan University(杜克昆山大学)

AI总结 提出ForeSci基准,通过时间控制的500个任务评估LLM智能体基于历史证据做出前瞻性研究判断的能力,发现证据与决策脱节问题。

详情
AI中文摘要

AI研究通常需要在未来证据出现之前做出决策:攻击哪个瓶颈、追求哪个方向、项目应如何定位。我们引入了ForeSci,一个时间控制的基准,用于评估LLM智能体是否能够从历史证据中做出此类前瞻性研究判断。ForeSci包含500个任务,涵盖四个快速发展的AI领域和四个决策家族。每个任务配有一个截止对齐的离线知识库;截止日期后的论文在生成过程中被隐藏,仅用于验证。为避免随机未来事件预测,任务源自截止前的分类分支和证据信号,并选择早于任务截止日期的答案生成骨干。我们评估了原生LLM、混合RAG以及四种骨干上的三种研究智能体适配。结果表明,显式证据组织提高了可追溯性和事实支持,但收益强烈依赖于决策家族。诊断揭示了一个反复出现的证据-决策脱节:智能体可能引用相关证据,但预测错误的研究对象。ForeSci将前瞻性AI研究判断转化为一个受控基准,用于评估作为决策系统的研究智能体。

英文摘要

AI research often requires decisions before future evidence exists: which bottleneck to attack, which direction to pursue, or where a project should be positioned. We introduce ForeSci, a temporally controlled benchmark for evaluating whether LLM agents can make such forward-looking research judgements from historical evidence. ForeSci contains 500 tasks across four fast-moving AI domains and four decision families. Each task is paired with a cutoff-aligned offline knowledge base; post-cutoff papers are hidden during generation and used only for validation. To avoid random future-event prediction, tasks are derived from pre-cutoff taxonomy branches and evidence signals, and answer-generation backbones are selected to precede the task cutoffs. We evaluate native LLMs, Hybrid RAG, and three research-agent adaptations across four backbones. Results show that explicit evidence organization improves traceability and factual support, but gains depend strongly on the decision family. Diagnostics reveal a recurring evidence-decision decoupling: agents may cite relevant evidence while forecasting the wrong research object. ForeSci turns forward-looking AI research judgement into a controlled benchmark for evaluating research agents as decision-making systems.

2606.00616 2026-06-05 cs.CV cs.AI

Pause and Think: A Dataset and Benchmark for Video-Grounded Assistive Action Suggestion

暂停与思考:面向视频基础辅助动作建议的数据集与基准

Shivam Singh, Saptarshi Majumder, Pratik Prabhanjan Brahma, Zicheng Liu, Emad Barsoum

发表机构 * Advanced Micro Devices, Inc.(先进微器件公司)

AI总结 提出 pause-and-think-T 数据集和 pause-and-think-B 基准,通过推理监督训练紧凑模型,在视频场景理解与目标规划任务中达到与大型模型相当的性能。

详情
AI中文摘要

最近的视觉语言模型(VLM)在视频中的基础推理、时间一致性和上下文感知规划方面存在困难。我们引入了 pause-and-think-T,一个以推理为中心的训练数据集,鼓励模型暂停、基于视觉证据进行推理,并生成简洁、可操作的响应。该数据集在生成答案之前促进结构化推理,引导模型走向类人、基于场景的辅助。我们在我们的 pause-and-think-B 基准上微调了一个紧凑的 4B 参数模型,并针对上下文理解和目标规划任务进行了评估。该模型在参数比 Qwen3-VL-235B(58.9%)少 59 倍的情况下达到了 58.0% 的准确率,在场景理解上与 GPT-5.2 匹配,并超越了 GPT-4o。除了我们的基准之外,该模型在 EgoThink 和 TempCompass 上也表现出强大的分布外性能,在可操作性、辅助性、属性识别、情境推理和时间顺序方面取得了显著提升,且无需特定基准训练。我们的结果表明,有针对性的推理监督使紧凑模型能够提供可操作的、基于视觉的指导,同时泛化到训练数据之外,而无需进行大规模模型扩展。

英文摘要

Recent Vision-Language Models (VLMs) struggle with grounded reasoning, temporal consistency, and context aware planning in videos. We introduce pause-and-think-T, a reasoning-centric training dataset that encourages models to pause, reason over visual evidence, and produce concise, actionable responses. The dataset promotes structured reasoning prior to answer generation, guiding models toward human-like, scene-grounded assistance. We fine-tune a compact 4B-parameter model and evaluate it on our pause-and-think-B benchmark targeting contextual understanding and goal planning tasks. The model achieves 58.0% accuracy at 59x fewer parameters than Qwen3-VL-235B (58.9%), matching GPT-5.2 on scene understanding and surpassing GPT-4o. Beyond our benchmark, it also shows strong out-of-distribution performance on EgoThink and TempCompass, with substantial gains in affordance, assistance, attribution recognition, situated reasoning, and temporal order, without benchmark-specific training. Our results indicate that targeted reasoning supervision enables compact models to deliver actionable, visually grounded guidance while generalizing beyond training data, without requiring large-scale model expansion.

2606.00522 2026-06-05 cs.CV

A Trajectory-Driven Spatio-Temporal Refinement Solution for CVPR 2026 8th UG2+ Challenge Track 3: DOST

CVPR 2026 第八届 UG2+ 挑战赛赛道三:湍流中动态目标分割的有效解决方案

Hongzhen Li, Miao Yu, Leilei Cao, Youwei Pan, Yingfang Zhu, Fengjie Zhu

发表机构 * TEX AI, Transsion Holdings(TEX AI,Transsion控股)

AI总结 基于 SegAnyMo 框架,通过数据域自适应和时空后处理模块,提升严重大气畸变下的动态目标分割性能,在挑战赛中获第二名。

详情
AI中文摘要

在这项工作中,我们提出了针对第八届 UG2+ 挑战赛(CVPR 2026)赛道三:湍流中动态目标分割(DOST)的解决方案。我们的方法建立在强大的基线框架 Segment Any Motion (SegAnyMo) 之上,该框架提供了强大的掩码生成和运动跟踪能力。为了进一步提升在严重大气畸变下的分割性能,我们提出了两个关键改进。首先,我们采用以数据为中心的域自适应策略。通过从 DAVIS 数据集和 DOST 数据集的子集中选取序列,并结合模拟大气波动退化,显著扩展了训练数据,增强了模型对复杂几何畸变的鲁棒性。其次,我们引入了时空后处理模块。该细化步骤有效去除了持续存在的边界连接假前景和短时碎片噪声,同时严格保留了真实小目标并保持帧间的原始个体标签。通过上述组合策略,我们的方法在挑战赛中获得了第二名。

英文摘要

In this work, we present our solution for the 8th UG2+ Challenge (CVPR 2026) Track 3: Dynamic Object Segmentation in Turbulence (DOST). Our method is built upon the strong baseline framework Segment Any Motion (SegAnyMo), which provides powerful mask generation and motion tracking capabilities. To further boost the segmentation performance under severe atmospheric distortions, we propose two key improvements. First, we employ a data-centric domain adaptation strategy. We significantly expand our training data by incorporating selected sequences from the DAVIS dataset alongside a subset of the DOST dataset, and apply simulated atmospheric fluctuation degradations to enhance the model's robustness against complex geometric distortions. Second, we introduce a spatio-temporal post-processing module. This refinement step effectively removes persistent boundary-connected false foregrounds and short-lived fragmented noise, while strictly preserving genuine small targets and maintaining original individual labels across frames. With these combined strategies, our proposed method ranks the 2st place in the challenge.

2605.31278 2026-06-05 cs.AI cs.LG stat.ME

Industrializing Prediction-Powered Inference: The GLIDE Library for Reliable GenAI and Agentic Systems Evaluation

工业化预测驱动推断:用于可靠生成式AI与智能体系统评估的GLIDE库

Grégoire Martinon, Ibrahim Merad, Mohammed Raki

发表机构 * University of California, Berkeley(加州大学伯克利分校) Google Research(谷歌研究院)

AI总结 提出GLIDE开源库,统一多种预测驱动推断方法,提供无偏估计与有效置信区间,显著降低人工标注成本。

Comments 8 pages, Accepted to the ICML 2026 Workshop on Statistical Frameworks for Uncertainty in Agentic Systems, Seoul, South Korea, 2026

详情
AI中文摘要

智能体系统的可靠评估需要具有有效不确定性的无偏估计,但标准实践在昂贵的人工标注和有偏的LLM-as-judge代理之间权衡。预测驱动推断(PPI)将两者结合为具有有效置信区间的去偏估计,然而其各种方法仍分散在不同论文的部分实现中。我们介绍GLIDE,一个开源Python库,它在专用于均值估计的scipy风格API下统一了最先进的PPI估计器(PPI++、分层PPI、先预测后去偏及其分层变体、主动统计推断)和采样器(均匀、分层、主动、成本最优)。GLIDE附带一个可复现的蒙特卡洛验证套件、一个基于经验的决策树用于方法选择,以及一个智能体评估案例研究,显示在同等精度下显著节省标注成本。GLIDE包可通过此URL获取:https://github.com/EmertonData/glide

英文摘要

Reliable evaluation of agentic systems requires unbiased estimates with valid uncertainty, but standard practice navigates between costly human annotation and biased LLM-as-judge proxies. Prediction-powered inference (PPI) combines both into debiased estimates with valid confidence intervals, yet its various methods remain scattered across papers under partial implementations. We introduce GLIDE, an open-source Python library that unifies state-of-the-art PPI estimators (PPI++, Stratified PPI, Predict-Then-Debias and its stratified variants, Active Statistical Inference) and samplers (uniform, stratified, active, cost-optimal) under a scipy-style API specialized to mean estimation. GLIDE ships with a reproducible Monte Carlo validation suite, an empirically grounded decision tree for method selection, and an agentic evaluation case study showing substantial annotation savings at equivalent precision. The GLIDE package is available at this URL: https://github.com/EmertonData/glide

2605.30819 2026-06-05 cs.CV cs.GR

Function2Scene: 3D Indoor Scene Layout from Functional Specifications

Function2Scene: 基于功能规范的3D室内场景布局

Ruiqi Wang, Qimin Chen, Daniel Ritchie, Angel X. Chang, Manolis Savva, Kai Wang, Hao Zhang

发表机构 * Simon Fraser University(西蒙弗雷泽大学) Brown University(布朗大学)

AI总结 提出Function2Scene框架,通过解析自然语言设计简报中的用户角色和活动,从17个功能约束准则生成布局,并利用LLM和VLM的迭代检查-修复循环优化,在30个专业案例中94.3%的成对比较优于基线方法。

Comments project page: https://function2scene.github.io/

详情
AI中文摘要

大多数文本驱动的3D室内场景合成方法从以物体为中心的提示生成房间,询问应放置什么家具而不是如何使用空间。然而,在实际室内设计中,布局的好坏取决于其对居住者的支持程度,例如他们的活动和身体需求。我们引入了Function2Scene,一个从功能规范(即描述谁将使用房间以及他们需要在那里做什么的自然语言设计简报)生成3D室内布局的框架。给定这样的规范,我们的系统解析居住者角色和活动,从涵盖空间、人体工程学、活动和环境考虑的17个标准分类中导出一组定制的功能设计约束,并使用这些约束来指导布局生成。Function2Scene不依赖LLM直接生成最终场景,而是通过工具增强的检查-修复循环进行迭代评估和细化,结合几何测量、基于LLM的上下文推理和基于VLM的视觉评估。在30个专业编写的室内设计案例上的实验表明,Function2Scene生成的布局比最近的基于LLM的场景合成基线更好地满足功能需求,我们的结果在94.3%的成对比较中被偏好。我们的工作将文本驱动的室内场景合成从放置合理的物体重新定义为设计支持人类使用的空间。

英文摘要

Most text-driven 3D indoor scene synthesis methods generate rooms from object-centric prompts, asking what furniture should be placed rather than how the space is used. Yet in real interior design, a layout is judged by how well it supports its occupants, e.g., their activities and physical needs. We introduce Function2Scene, a framework for generating 3D indoor layouts from functional specifications, i.e., natural-language design briefs describing who will use a room and what they need to do there. Given such a specification, our system parses occupant personas and activities, derives a customized set of functional design constraints from a taxonomy of 17 criteria spanning spatial, ergonomic, activity, and environmental considerations, and uses these constraints to guide layout generation. Rather than relying on an LLM to directly produce a final scene, Function2Scene performs iterative evaluation and refinement through a tool-augmented check-and-repair loop, combining geometric measurements, LLM-based contextual reasoning, and VLM-based visual assessment. Experiments on 30 professionally written interior-design cases show that Function2Scene produces layouts that better satisfy functional requirements than recent LLM-based scene synthesis baselines, with our results preferred in 94.3% of pairwise comparisons. Our work reframes text-driven indoor scene synthesis from placing plausible objects to designing spaces that support human use.