arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.04811 2026-06-05 cs.CV

基于轨迹级别优势优先经验回放的GRPO

Gyeongtae Yoo, Sanghyeok Park, Soohyuk Jang, Ik-hwan Kim, Sungroh Yoon

发表机构 * Department of Electrical and Computer Engineering, Seoul National University（首尔国立大学电子与计算机工程系）； Interdisciplinary Program in AI, Seoul National University（首尔国立大学人工智能跨学科项目）； AIIS, ASRI, INMC, and ISRC, Seoul National University（首尔国立大学人工智能研究所、人工智能研究机构、智能网络与计算中心及人工智能科学研究中心）

AI总结针对GRPO样本效率低的问题，提出轨迹级经验回放缓冲器，通过年龄驱逐限制陈旧性、新鲜锚定组合保持在线策略、按优势幅度优先采样，在多个数学基准上显著提升性能。

详情

AI中文摘要

基于可验证奖励的GRPO强化学习是后训练推理LLM的标准方法，但样本效率低下。每个轨迹仅用于一次梯度更新后被丢弃。朴素回放在此设置中不适用，因为LLM策略每步梯度变化快，存储的轨迹会变得陈旧并破坏训练稳定性。我们提出一种面向GRPO的轨迹级回放缓冲器，存储和采样单个轨迹而非整组。缓冲器通过年龄驱逐限制陈旧性：任何超过tau_max训练步数的轨迹被移除。缓冲器还通过新鲜锚定组合保留在线策略数据：每个批次保留其新鲜的在线策略轨迹，并拼接从缓冲器中单独抽取的回放轨迹。我们按每个轨迹的优势幅度进行优先回放，并回收优势大的单个轨迹。在三个Qwen3-Base规模、五个数学基准上，我们的方法优于GRPO和朴素回放基线。所有规模均获得正向增益，且随模型增大而增长。最大增益在4B规模上，五个基准平均提升+4.35个百分点。在联合衡量准确率和token效率的AES指标下，与GRPO的效率差距同样在4B最大，为+0.579。

英文摘要

Reinforcement learning from verifiable rewards with GRPO is a standard approach for post-training reasoning LLMs. It remains sample inefficient. Each rollout is used for a single gradient update and then discarded. Naive replay is not well suited in this setting because LLM policies drift quickly per gradient step. Stored rollouts therefore become stale and can destabilize training. We propose a rollout-level replay buffer for GRPO that stores and samples individual rollouts rather than whole groups. The buffer bounds staleness through age eviction. Any rollout older than tau_max training steps is removed. The buffer also preserves on-policy data via fresh-anchored composition. Each batch keeps its fresh on-policy rollouts and then concatenates replay rollouts drawn separately from the buffer. We prioritize replay by per-rollout advantage magnitude and recycle individual rollouts whose advantages are large. Across three Qwen3-Base scales on five math benchmarks, our method outperforms GRPO and naive replay baselines. Gains are positive at every scale and grow with model size. The largest gain is +4.35 pp on the five-benchmark average at 4B. Under an AES metric that jointly measures accuracy and token efficiency, the efficiency margin over GRPO is again largest at 4B, at +0.579.

URL PDF HTML ☆

赞 0 踩 0

2606.04485 2026-06-05 cs.LG

LimiX-2M: Mitigating Low-Rank Collapse and Attention Bottlenecks in Tabular Foundation Models

LimiX-2M：缓解表格基础模型中的低秩坍塌和注意力瓶颈

Yuanrui Wang, Xingxuan Zhang, Han Yu, Mingchao Hao, Gang Ren, Hao Yuan, Li Mao, Yunjia Zhang, Chun Yuan, Peng Cui

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出统一tokenize-and-route框架LimiX-2M，通过RaBEL扩展标量为局部RBF特征并重新排序双向块S→N→F，以2M参数超越更大模型，改善表格基础模型的精度-效率权衡。

Comments Accepted to ICML 2026

详情

AI中文摘要

表格基础模型（TFM）日益与树集成方法竞争，但其性能通常计算效率低下：使用标准仿射标量分词时，每个特征通过本质上的一维通道注入值变化，特征ID/位置信号无法增加特征内值的自由度，导致早期层值敏感性弱和隐藏状态冗余。我们提出了一个统一的\emph{tokenize-and-route}框架用于强TFM： extbf{RaBEL}将每个标量扩展为紧凑的局部RBF特征（可选指数门控）以改善条件和浅层有效秩，而重新排序的双向块 extbf{S$ ightarrow$N$ ightarrow$F}通过在特征混合前聚合跨样本上下文并使用注意力池化来使计算与读出对齐。这些变化共同产生了 extbf{LimiX-2M}，一个2M参数模型，在广泛使用的表格基准上优于更大的TabPFN-v2和TabICL基线，同时降低了训练和推理成本。这些结果突出了值感知分词和读出对齐路由作为改善TFM中精度-效率权衡的关键杠杆。模型检查点和推理代码可在https://github.com/limix-ldm-ai/LimiX获取。

英文摘要

Tabular foundation models (TFMs) increasingly rival tree ensembles, but their performance is often compute-inefficient: with standard affine scalar tokenization, each feature injects value variation through an essentially one-dimensional channel, and feature IDs/positional signals cannot increase within-feature value degrees of freedom, yielding weak early-layer value sensitivity and redundant hidden states. We present a unified tokenize-and-route framework for strong TFMs: RaBEL expands each scalar into compact localized RBF features (optionally exponent-gated) to improve conditioning and shallow-layer effective rank, while a reordered bidirectional block S->N->F aligns computation with the readout by aggregating cross-sample context before feature mixing and using attention pooling. Together, these changes yield LimiX-2M, a 2M-parameter model that outperforms larger TabPFN-v2 and TabICL baselines on widely used tabular benchmarks while reducing training and inference costs. These results highlight value-aware tokenization and readout-aligned routing as key levers for improving the accuracy--efficiency trade-off in TFMs. Model checkpoints and inference code are available at https://github.com/limix-ldm-ai/LimiX.

URL PDF HTML ☆

赞 0 踩 0

2606.04463 2026-06-05 cs.RO

OSCAR: Omni-Embodiment Action-Conditioned World Model for Robotics

OSCAR: 面向机器人的全具身骨架条件世界动作模型

Zhuoyuan Wu, Jun Gao

发表机构 * Peking University（北京大学）； University of Michigan（密歇根大学）； NVIDIA（英伟达）

AI总结提出OSCAR，一种基于动作条件的视频世界模型，通过大规模数据管道和2D骨架渲染统一表示，实现跨机器人具身的泛化，并用于策略评估。

Comments Project page: https://wuzy2115.github.io/oscar-project-page/

详情

AI中文摘要

我们提出OSCAR，一种精确的动作条件视频世界模型，能够泛化到不同的机器人具身并支持机器人策略评估。现有的视频世界模型在真实机器人评估中面临三个主要挑战：当前机器人训练数据集的场景多样性有限、动作跟随不精确、以及跨具身泛化能力差以支持广泛采用。我们从两个角度应对这些挑战。其核心是一个大规模标准化数据管道，用于整理、过滤和去重广泛的机器人和以自我为中心的人类数据集，产生一个涵盖多样化任务、场景、动作和机器人具身的干净联合训练数据集。为了给视频模型提供条件，我们采用2D运动学骨架渲染作为统一的条件表示，能够泛化到不同的机器人手臂甚至人类手部。我们在单个GH200 GPU上微调Cosmos-Predict2.5-2B模型。与现有基线相比，我们的模型在动作跟随、外观质量和运动一致性方面取得了显著改进，而基线要么模型规模大得多，要么需要更多GPU。我们进一步将OSCAR部署到RoboArena中评估机器人策略。大量实验表明，OSCAR中的虚拟策略评估与真实世界评估之间存在显著相关性，为未来机器人策略可以纯粹在虚拟生成的世界中评估铺平了道路。

英文摘要

We present OSCAR, a precise action-conditioned video world model that generalizes across different robot embodiments and enables robot policy evaluation. Existing video world models face three main challenges for real-world robot evaluation: limited scenario diversity in current robot training datasets, imprecise action following, and poor generalization across embodiments for broad adoption. We tackle these challenges from two perspectives. At its core is a large-scale standardized data pipeline that curates, filters, and deduplicates broad robotics and egocentric human datasets, yielding a clean joint-training dataset that spans diverse tasks, scenarios, actions, and robot embodiments. To condition the video model, we adopt 2D kinematic skeleton rendering as a unified conditioning representation that generalizes across different robot arms or even human hands. We finetune the Cosmos-Predict2.5-2B model on a single GH200 GPU. Our model achieves significant improvement on action following, appearance quality, and motion consistency, compared to existing baselines, which either have a much larger model size or require more GPUs. We further deploy OSCAR to evaluate robot policies from RoboArena. Extensive experiments demonstrate the significant correlation between our virtual policy evaluation in OSCAR and real-world evaluation, paving the way for the future where robot policies can be purely evaluated in virtual generated worlds.

URL PDF HTML ☆

赞 0 踩 0

2606.04335 2026-06-05 cs.LG cs.SY eess.SY

话题作为社会人口统计的代理：对话上下文如何影响大语言模型的回答

Vera Neplenbroek, Gabriele Sarti, Arianna Bisazza, Raquel Fernández

发表机构 * Institute for Logic, Language and Computation, University of Amsterdam（逻辑、语言与计算研究所，阿姆斯特丹大学）； Khoury College of Computer Sciences, Northeastern University（计算机科学学院，东北大学）； Center for Language and Cognition, University of Groningen（语言与认知中心，格罗宁根大学）

AI总结研究大语言模型在高风险场景中对话上下文对回答差异的影响，发现话题是社会人口统计差异的主要驱动因素，且影响方式不可预测。

详情

AI中文摘要

当大语言模型（LLM）用于高风险场景（如法律、医疗和金融建议）时，即使单次对话历史也足以导致用户间结果差异。先前研究表明，这会导致社会人口统计群体之间的结果差异，某些群体获得比其他群体更有利的结果。在这项工作中，我们证明LLM实际上难以从单次对话历史推断用户的社会人口统计信息，并且尽管社会人口统计群体之间存在差异，但差异幅度很小。为了探究这些差异的主要驱动因素，我们将用户社会人口统计信息与对话的一系列（心理）语言学特征（包括对话话题、情感和可读性）进行比较。我们发现，在对话上下文中，对话话题最能预测LLM生成的建议，这些话题在一定程度上充当社会人口统计群体的代理，并且常常以不可预测的方式影响建议。这令人担忧，并强调未来研究需要更好地理解，并在必要时减轻高风险场景中对话上下文对LLM输出的影响。

英文摘要

When large language models (LLMs) are used in high-stakes scenarios, such as legal, medical and financial advice, even a single conversation history is enough to drive differences in outcomes between users. Prior work has demonstrated that this results in outcome disparities between sociodemographic groups, with some groups receiving more advantageous outcomes than others. In this work, we demonstrate that LLMs actually struggle to infer user sociodemographics from a single conversation history and that although there are disparities between sociodemographic groups, they are minimal in magnitude. To investigate what the main driver of these disparities is, we compare user sociodemographics to a range of (psycho)linguistic features of conversations, including conversation topic, emotions, and readability. We find that conversation topics are most predictive of LLM-generated advice within a conversational context, which, to some extent, function as proxies for sociodemographic groups and often affect advice in unpredictable ways. This is cause for concern and highlights the need for future research to better understand and, if needed, mitigate the effect of conversational context on LLM outputs in high-stakes scenarios.

URL PDF HTML ☆

赞 0 踩 0

2606.02750 2026-06-05 cs.CL

On the Persistent Effects of Lexicality in Large Language Models

论词汇性在大语言模型中的持久影响

Hammad Rizwan, Muhammad Umair Haider, Nishant Subramani, Mona T. Diab, A. B. Siddique, Hassan Sajjad

发表机构 * Dalhousie University（达尔豪斯大学）； University of Kentucky（肯塔基大学）； Carnegie Mellon University（卡内基梅隆大学）

AI总结本文通过对抗性语义压力测试和信息论视角，量化了大语言模型中词汇重叠相对于语义内容的影响，发现词汇影响贯穿模型深度，并在中间层出现词汇和语义信号同时衰减的过渡区域，进而以摘要和模型编辑为例展示了词汇影响对下游任务的作用。

详情

AI中文摘要

从大语言模型（LLMs）中提取的表征在许多下游应用中扮演着重要角色。然而，这些表征的结构往往受词汇重叠而非语义内容的影响。我们对这种词汇影响与语义内容之间的关系及其对下游任务的影响的理解仍然有限。在这项工作中，我们研究表征以量化词汇重叠相对于语义内容的影响。我们考虑了若干对抗性语义压力测试，并进一步将我们的发现与信息论视角联系起来。我们发现词汇影响贯穿模型的深度，在不同架构、训练范式和目标函数（包括为语义相似性训练的模型）中一致存在。此外，我们观察到一个中间深度区域，其中词汇和语义信号同时衰减，表明这是一个表征对表面形式和意义都较差的过渡状态。我们进一步通过摘要和模型编辑作为案例研究，展示了词汇影响对LLMs下游使用的影响。

英文摘要

Representations extracted from large language models (LLMs) play an important role in many downstream applications. However, the structure of these representations is often influenced by lexical overlap rather than semantic content. Our understanding of the relationship between this lexical influence and semantic content, and its implications for downstream tasks, remains limited. In this work, we investigate representations to quantify the effect of lexical overlap relative to semantic content. We consider several adversarial semantic stress tests and further connect our findings to the information theory perspective. We find that lexical influence extends across the depth of models, consistently across architectures, training regimes, and objective functions, including the models trained for semantic similarity. Moreover, we observe a mid-depth region in which both lexical and semantic signals degrade simultaneously, indicating a transitional regime where representations are poor for both surface form and meaning. We further demonstrate the effect of lexical influence on downstream uses of LLMs using summarization and model editing as a case study.

URL PDF HTML ☆

赞 0 踩 0

2606.02684 2026-06-05 cs.LG cs.AI cs.CL

Filter, Then Reweight: Rethinking Optimization Granularity in On-Policy Distillation

先过滤，再重加权：重新思考在线策略蒸馏中的优化粒度

Yuying Li, Leqi Zheng, Yongzi Yu, Wenrui Zhou, Xuchang Zhong, Xing Hu, Jing Jin, Hangjie Yuan, Tao Feng

发表机构 * THU（清华大学）； HKUST（香港科技大学）； BIT（北京理工大学）； Meituan（美团）； ZJU（浙江大学）

AI总结针对在线策略蒸馏，提出FiRe-OPD方法，通过轨迹级过滤和令牌级软重加权实现细粒度优化，在多种设置下优于现有方法。

详情

AI中文摘要

大型语言模型中的在线策略蒸馏正从全轨迹KL监督转向更具选择性的训练范式。最近的在线策略蒸馏方法越来越关注选择哪些轨迹进行学习、哪些令牌信息量最大以及哪些监督信号最可靠。受此趋势启发，我们重新思考在线策略蒸馏的优化粒度，并提出FiRe-OPD（先过滤，再重加权），该方法在轨迹和令牌两个层面联合调整监督信号。具体来说，FiRe-OPD首先过滤轨迹以移除低质量的采样结果，然后在保留的轨迹内应用软重加权以强调信息丰富的令牌。与硬令牌选择相比，FiRe-OPD利用软加权机制有效减轻信息损失并增强优化稳定性，从而实现更细粒度的在线策略蒸馏优化。我们在强到弱、单教师和多教师设置中验证了FiRe-OPD的有效性，并展示了其相对于近期令牌级在线策略蒸馏方法的优越性（例如，在强到弱设置中AIME 2024上+6.25，在多教师设置中Miner上+18.81）。我们的代码可从此链接获取。

英文摘要

On-Policy distillation (OPD) in large language models is shifting from full-trace KL supervision toward more selective training paradigms. Recent OPD methods increasingly focus on selecting which trajectories to learn from, which tokens are most informative, and which supervision signals are most reliable. Motivated by this trend, we rethink optimization granularity of OPD and propose \fireicon\ FiRe-OPD (Filter, then Reweight), which jointly adjusts supervision signals at both trajectory and token levels. In details, FiRe-OPD first filters trajectories to remove low-quality rollout samples, and then applies soft reweighting within the retained trajectories to emphasize informative tokens. Compared with hard token selection, FiRe-OPD leverages a soft-weighting mechanism to effectively mitigate information loss and enhance optimization stability, thereby achieving finer-grained OPD optimization. We validate the effectiveness of FiRe-OPD across strong-to-weak, single-teacher, and multi-teacher settings, and demonstrate its superiority over recent token-level OPD methods ( (e.g., +6.25 on AIME 2024 in strong-to-weak, +18.81 on Miner in multi-teacher). Our code is available at https://github.com/YuYingLi0/FiRe-OPD.

URL PDF HTML ☆

赞 0 踩 0

2606.02031 2026-06-05 cs.LG cs.AI cs.CL cs.CV

OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents

OpenWebRL: 揭秘视觉网络代理的在线多轮强化学习

Rui Yang, Qianhui Wu, Yuxi Chen, Hao Bai, Wenlin Yao, Hao Cheng, Baolin Peng, Huan Zhang, Tong Zhang, Jianfeng Gao

发表机构 * UIUC（伊利诺伊大学香槟分校）； Microsoft（微软）

AI总结提出OpenWebRL框架，通过在线多轮强化学习在真实网站上训练视觉网络代理，以4B参数模型在基准测试中达到开源最优，并与闭源系统竞争。

Comments 36 pages, 11 figures

详情

AI中文摘要

构建强大的视觉网络代理需要长程推理、精确定位以及与动态真实网站的稳健交互。尽管进展迅速，最强的系统仍然大多是专有的，而开放代理仍然严重依赖于对大量策划的网络轨迹进行监督式后训练。这种依赖造成了主要的可扩展性瓶颈：高质量演示的收集成本高昂，而静态数据集对多样且不断变化的开放网络的覆盖有限。尽管在线强化学习在基于文本的代理中显示出前景，但其直接用于在实时网站上训练视觉网络代理的潜力仍未得到充分探索。在本文中，我们介绍了OpenWebRL，一个用于在真实网站上通过在线多轮强化学习训练视觉网络代理的开放框架。OpenWebRL涵盖了完整的训练流程，包括可扩展的实时浏览器基础设施、监督初始化、多模态上下文管理、轨迹级成功判断以及高效的多轮策略优化。使用该框架，我们训练了OpenWebRL-4B，在具有挑战性的实时网络基准测试中建立了新的开源最优水平。仅使用0.4K初始化轨迹和2.2K开放式强化学习训练任务，OpenWebRL-4B在Online-Mind2Web上达到67.0%的成功率，在DeepShop上达到64.0%，优于之前类似或更大规模的开放代理，并与包括OpenAI CUA和Gemini CUA在内的专有系统保持竞争力。除了强大的基准性能外，我们还系统研究了使在线强化学习对视觉网络代理有效的关键设计选择，并分析了强化学习如何改进代理推理。总体而言，我们的工作为构建更强大、可重复且成本效益更高的开放网络代理提供了一条实用路径。我们将发布我们的训练数据、模型和代码以支持未来的研究。

英文摘要

Building capable visual web agents requires long-horizon reasoning, precise grounding, and robust interaction with dynamic real-world websites. Despite rapid progress, the strongest systems remain largely proprietary, while open agents still depend heavily on supervised post-training over large collections of curated web trajectories. This dependence creates a major scalability bottleneck: high-quality demonstrations are expensive to collect, and static datasets offer limited coverage of the diverse, ever-changing open web. Although online RL has shown promise for text-based agents, its potential for training visual web agents directly on live websites remains largely underexplored. In this paper, we introduce OpenWebRL, an open framework for training visual web agents with online multi-turn RL on real websites. OpenWebRL covers the full training pipeline, including scalable live-browser infrastructure, supervised initialization, multimodal context management, trajectory-level success judging, and efficient multi-turn policy optimization. Using this framework, we train OpenWebRL-4B, which establishes a new open-source state of the art on challenging live-web benchmarks. With only 0.4K initialization trajectories and 2.2K open-ended RL training tasks, OpenWebRL-4B achieves 67.0% success on Online-Mind2Web and 64.0% on DeepShop, outperforming prior open agents of similar or larger scale and remaining competitive with proprietary systems including OpenAI CUA and Gemini CUA. Beyond strong benchmark performance, we systematically study the key design choices that make online RL effective for visual web agents, and analyze how RL improves agentic reasoning. Overall, our work offers a practical path toward building more capable, reproducible, and cost-efficient open web agents. We will release our training data, models, and code to support future research.

URL PDF HTML ☆

赞 0 踩 0

2606.01935 2026-06-05 cs.CV

R^3: 基于推理引导的召回与重排序的组合视频检索

Zixu Li, Yupeng Hu, Zhiheng Fu, Zhiwei Chen, Weili Guan, Liqiang Nie

发表机构 * Shandong University（山东大学）； Harbin Institute of Technology (Shenzhen)（哈尔滨工业大学（深圳））

AI总结提出R^3零样本组合视频检索流程，通过生成推理轨迹增强查询表示，并融合重排序验证候选视频，有效解决源视频与编辑指令组合检索的挑战。

详情

AI中文摘要

CoVR-R挑战评估组合视频检索，系统需根据参考视频和文本编辑指令从大型图库中检索目标视频。该设置不是标准的视频-文本检索问题：查询由源视频中的视觉证据和编辑隐含的变换共同定义。强嵌入模型可提供可扩展的候选召回，但可能无法充分表达目标侧后果，如状态变化、动作替换、对象保留或时间一致性。成对多模态重排序器可直接验证此类细节，但全面重排序整个图库在计算上不可行。我们提出R^3，一个基于推理引导的召回与重排序的零样本组合视频检索流程。核心思想是将源-编辑查询转化为推理基础的检索程序，而非将编辑文本视为短标题。首先，模型生成推理轨迹，描述应用编辑后预期的目标视频。然后，将轨迹与源视频一起编码为推理增强查询，并通过一致性门控残差规则与基础组合查询的检索分数融合。最后，重排序器通过直接源-候选比较验证召回候选。实验证明了我们方法在应对该挑战中的有效性。代码可在https://github.com/Lee-zixu/R-3获取。

英文摘要

The CoVR-R challenge evaluates composed video retrieval, where a system must retrieve a target video from a large gallery given a reference video and a textual edit instruction. This setting is not a standard video-text retrieval problem: the query is defined by both the visual evidence in the source video and the transformation implied by the edit. A strong embedding model can provide scalable candidate recall, but it may under-express target-side consequences such as state changes, action replacement, object preservation, or temporal consistency. A pairwise multimodal reranker can verify such details more directly, but exhaustive reranking over the full gallery is computationally infeasible. We present $\mathbb{R}^3$, a zero-shot composed video retrieval pipeline built around Reasoning-guided Recalling and Reranking. The core idea is to turn the source-edit query into a reasoning-grounded retrieval program rather than treating the edit text as a short caption. First, the model generates a reasoning trace that describes the expected target video after applying the edit. Then the trace is encoded together with the source video as a reasoning-augmented query, and its retrieval score is fused with the base composed query through an agreement-gated residual rule. At last, a re-ranker verifies the recalled candidates with direct source-candidate comparison. Experiments have demonstrated the effectiveness of our method in addressing this challenge. Codes are available on https://github.com/Lee-zixu/R-3.

URL PDF HTML ☆

赞 0 踩 0

2606.00644 2026-06-05 cs.AI

ForeSci: Evaluating LLM Agents for Forward-Looking AI Research Judgment

ForeSci: 评估LLM智能体在前瞻性AI研究判断中的能力

Qiuyu Tian, Haojie Yin, Yingce Xia, Youyong Kong, Zequn Liu

发表机构 * Southeast University（东南大学）； Beijing Zhongguancun Academy（北京中关村学院）； Duke Kunshan University（杜克昆山大学）

AI总结提出ForeSci基准，通过时间控制的500个任务评估LLM智能体基于历史证据做出前瞻性研究判断的能力，发现证据与决策脱节问题。

详情

AI中文摘要

AI研究通常需要在未来证据出现之前做出决策：攻击哪个瓶颈、追求哪个方向、项目应如何定位。我们引入了ForeSci，一个时间控制的基准，用于评估LLM智能体是否能够从历史证据中做出此类前瞻性研究判断。ForeSci包含500个任务，涵盖四个快速发展的AI领域和四个决策家族。每个任务配有一个截止对齐的离线知识库；截止日期后的论文在生成过程中被隐藏，仅用于验证。为避免随机未来事件预测，任务源自截止前的分类分支和证据信号，并选择早于任务截止日期的答案生成骨干。我们评估了原生LLM、混合RAG以及四种骨干上的三种研究智能体适配。结果表明，显式证据组织提高了可追溯性和事实支持，但收益强烈依赖于决策家族。诊断揭示了一个反复出现的证据-决策脱节：智能体可能引用相关证据，但预测错误的研究对象。ForeSci将前瞻性AI研究判断转化为一个受控基准，用于评估作为决策系统的研究智能体。

英文摘要

AI research often requires decisions before future evidence exists: which bottleneck to attack, which direction to pursue, or where a project should be positioned. We introduce ForeSci, a temporally controlled benchmark for evaluating whether LLM agents can make such forward-looking research judgements from historical evidence. ForeSci contains 500 tasks across four fast-moving AI domains and four decision families. Each task is paired with a cutoff-aligned offline knowledge base; post-cutoff papers are hidden during generation and used only for validation. To avoid random future-event prediction, tasks are derived from pre-cutoff taxonomy branches and evidence signals, and answer-generation backbones are selected to precede the task cutoffs. We evaluate native LLMs, Hybrid RAG, and three research-agent adaptations across four backbones. Results show that explicit evidence organization improves traceability and factual support, but gains depend strongly on the decision family. Diagnostics reveal a recurring evidence-decision decoupling: agents may cite relevant evidence while forecasting the wrong research object. ForeSci turns forward-looking AI research judgement into a controlled benchmark for evaluating research agents as decision-making systems.

URL PDF HTML ☆

赞 0 踩 0

2606.00616 2026-06-05 cs.CV cs.AI

Pause and Think: A Dataset and Benchmark for Video-Grounded Assistive Action Suggestion

暂停与思考：面向视频基础辅助动作建议的数据集与基准

Shivam Singh, Saptarshi Majumder, Pratik Prabhanjan Brahma, Zicheng Liu, Emad Barsoum

发表机构 * Advanced Micro Devices, Inc.（先进微器件公司）

AI总结提出 pause-and-think-T 数据集和 pause-and-think-B 基准，通过推理监督训练紧凑模型，在视频场景理解与目标规划任务中达到与大型模型相当的性能。

详情

AI中文摘要

最近的视觉语言模型（VLM）在视频中的基础推理、时间一致性和上下文感知规划方面存在困难。我们引入了 pause-and-think-T，一个以推理为中心的训练数据集，鼓励模型暂停、基于视觉证据进行推理，并生成简洁、可操作的响应。该数据集在生成答案之前促进结构化推理，引导模型走向类人、基于场景的辅助。我们在我们的 pause-and-think-B 基准上微调了一个紧凑的 4B 参数模型，并针对上下文理解和目标规划任务进行了评估。该模型在参数比 Qwen3-VL-235B（58.9%）少 59 倍的情况下达到了 58.0% 的准确率，在场景理解上与 GPT-5.2 匹配，并超越了 GPT-4o。除了我们的基准之外，该模型在 EgoThink 和 TempCompass 上也表现出强大的分布外性能，在可操作性、辅助性、属性识别、情境推理和时间顺序方面取得了显著提升，且无需特定基准训练。我们的结果表明，有针对性的推理监督使紧凑模型能够提供可操作的、基于视觉的指导，同时泛化到训练数据之外，而无需进行大规模模型扩展。

英文摘要

Recent Vision-Language Models (VLMs) struggle with grounded reasoning, temporal consistency, and context aware planning in videos. We introduce pause-and-think-T, a reasoning-centric training dataset that encourages models to pause, reason over visual evidence, and produce concise, actionable responses. The dataset promotes structured reasoning prior to answer generation, guiding models toward human-like, scene-grounded assistance. We fine-tune a compact 4B-parameter model and evaluate it on our pause-and-think-B benchmark targeting contextual understanding and goal planning tasks. The model achieves 58.0% accuracy at 59x fewer parameters than Qwen3-VL-235B (58.9%), matching GPT-5.2 on scene understanding and surpassing GPT-4o. Beyond our benchmark, it also shows strong out-of-distribution performance on EgoThink and TempCompass, with substantial gains in affordance, assistance, attribution recognition, situated reasoning, and temporal order, without benchmark-specific training. Our results indicate that targeted reasoning supervision enables compact models to deliver actionable, visually grounded guidance while generalizing beyond training data, without requiring large-scale model expansion.

URL PDF HTML ☆

赞 0 踩 0

2606.00522 2026-06-05 cs.CV

A Trajectory-Driven Spatio-Temporal Refinement Solution for CVPR 2026 8th UG2+ Challenge Track 3: DOST

CVPR 2026 第八届 UG2+ 挑战赛赛道三：湍流中动态目标分割的有效解决方案

Hongzhen Li, Miao Yu, Leilei Cao, Youwei Pan, Yingfang Zhu, Fengjie Zhu

发表机构 * TEX AI, Transsion Holdings（TEX AI，Transsion控股）

AI总结基于 SegAnyMo 框架，通过数据域自适应和时空后处理模块，提升严重大气畸变下的动态目标分割性能，在挑战赛中获第二名。

详情

AI中文摘要

在这项工作中，我们提出了针对第八届 UG2+ 挑战赛（CVPR 2026）赛道三：湍流中动态目标分割（DOST）的解决方案。我们的方法建立在强大的基线框架 Segment Any Motion (SegAnyMo) 之上，该框架提供了强大的掩码生成和运动跟踪能力。为了进一步提升在严重大气畸变下的分割性能，我们提出了两个关键改进。首先，我们采用以数据为中心的域自适应策略。通过从 DAVIS 数据集和 DOST 数据集的子集中选取序列，并结合模拟大气波动退化，显著扩展了训练数据，增强了模型对复杂几何畸变的鲁棒性。其次，我们引入了时空后处理模块。该细化步骤有效去除了持续存在的边界连接假前景和短时碎片噪声，同时严格保留了真实小目标并保持帧间的原始个体标签。通过上述组合策略，我们的方法在挑战赛中获得了第二名。

英文摘要

In this work, we present our solution for the 8th UG2+ Challenge (CVPR 2026) Track 3: Dynamic Object Segmentation in Turbulence (DOST). Our method is built upon the strong baseline framework Segment Any Motion (SegAnyMo), which provides powerful mask generation and motion tracking capabilities. To further boost the segmentation performance under severe atmospheric distortions, we propose two key improvements. First, we employ a data-centric domain adaptation strategy. We significantly expand our training data by incorporating selected sequences from the DAVIS dataset alongside a subset of the DOST dataset, and apply simulated atmospheric fluctuation degradations to enhance the model's robustness against complex geometric distortions. Second, we introduce a spatio-temporal post-processing module. This refinement step effectively removes persistent boundary-connected false foregrounds and short-lived fragmented noise, while strictly preserving genuine small targets and maintaining original individual labels across frames. With these combined strategies, our proposed method ranks the 2st place in the challenge.

URL PDF HTML ☆

赞 0 踩 0

2605.31278 2026-06-05 cs.AI cs.LG stat.ME

Industrializing Prediction-Powered Inference: The GLIDE Library for Reliable GenAI and Agentic Systems Evaluation

工业化预测驱动推断：用于可靠生成式AI与智能体系统评估的GLIDE库

Grégoire Martinon, Ibrahim Merad, Mohammed Raki

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Google Research（谷歌研究院）

AI总结提出GLIDE开源库，统一多种预测驱动推断方法，提供无偏估计与有效置信区间，显著降低人工标注成本。

Comments 8 pages, Accepted to the ICML 2026 Workshop on Statistical Frameworks for Uncertainty in Agentic Systems, Seoul, South Korea, 2026

详情

AI中文摘要

智能体系统的可靠评估需要具有有效不确定性的无偏估计，但标准实践在昂贵的人工标注和有偏的LLM-as-judge代理之间权衡。预测驱动推断（PPI）将两者结合为具有有效置信区间的去偏估计，然而其各种方法仍分散在不同论文的部分实现中。我们介绍GLIDE，一个开源Python库，它在专用于均值估计的scipy风格API下统一了最先进的PPI估计器（PPI++、分层PPI、先预测后去偏及其分层变体、主动统计推断）和采样器（均匀、分层、主动、成本最优）。GLIDE附带一个可复现的蒙特卡洛验证套件、一个基于经验的决策树用于方法选择，以及一个智能体评估案例研究，显示在同等精度下显著节省标注成本。GLIDE包可通过此URL获取：https://github.com/EmertonData/glide

英文摘要

Reliable evaluation of agentic systems requires unbiased estimates with valid uncertainty, but standard practice navigates between costly human annotation and biased LLM-as-judge proxies. Prediction-powered inference (PPI) combines both into debiased estimates with valid confidence intervals, yet its various methods remain scattered across papers under partial implementations. We introduce GLIDE, an open-source Python library that unifies state-of-the-art PPI estimators (PPI++, Stratified PPI, Predict-Then-Debias and its stratified variants, Active Statistical Inference) and samplers (uniform, stratified, active, cost-optimal) under a scipy-style API specialized to mean estimation. GLIDE ships with a reproducible Monte Carlo validation suite, an empirically grounded decision tree for method selection, and an agentic evaluation case study showing substantial annotation savings at equivalent precision. The GLIDE package is available at this URL: https://github.com/EmertonData/glide

URL PDF HTML ☆

赞 0 踩 0

2605.30819 2026-06-05 cs.CV cs.GR

Function2Scene: 3D Indoor Scene Layout from Functional Specifications

Function2Scene: 基于功能规范的3D室内场景布局

Ruiqi Wang, Qimin Chen, Daniel Ritchie, Angel X. Chang, Manolis Savva, Kai Wang, Hao Zhang

发表机构 * Simon Fraser University（西蒙弗雷泽大学）； Brown University（布朗大学）

AI总结提出Function2Scene框架，通过解析自然语言设计简报中的用户角色和活动，从17个功能约束准则生成布局，并利用LLM和VLM的迭代检查-修复循环优化，在30个专业案例中94.3%的成对比较优于基线方法。

Comments project page: https://function2scene.github.io/

详情

AI中文摘要

大多数文本驱动的3D室内场景合成方法从以物体为中心的提示生成房间，询问应放置什么家具而不是如何使用空间。然而，在实际室内设计中，布局的好坏取决于其对居住者的支持程度，例如他们的活动和身体需求。我们引入了Function2Scene，一个从功能规范（即描述谁将使用房间以及他们需要在那里做什么的自然语言设计简报）生成3D室内布局的框架。给定这样的规范，我们的系统解析居住者角色和活动，从涵盖空间、人体工程学、活动和环境考虑的17个标准分类中导出一组定制的功能设计约束，并使用这些约束来指导布局生成。Function2Scene不依赖LLM直接生成最终场景，而是通过工具增强的检查-修复循环进行迭代评估和细化，结合几何测量、基于LLM的上下文推理和基于VLM的视觉评估。在30个专业编写的室内设计案例上的实验表明，Function2Scene生成的布局比最近的基于LLM的场景合成基线更好地满足功能需求，我们的结果在94.3%的成对比较中被偏好。我们的工作将文本驱动的室内场景合成从放置合理的物体重新定义为设计支持人类使用的空间。

英文摘要

Most text-driven 3D indoor scene synthesis methods generate rooms from object-centric prompts, asking what furniture should be placed rather than how the space is used. Yet in real interior design, a layout is judged by how well it supports its occupants, e.g., their activities and physical needs. We introduce Function2Scene, a framework for generating 3D indoor layouts from functional specifications, i.e., natural-language design briefs describing who will use a room and what they need to do there. Given such a specification, our system parses occupant personas and activities, derives a customized set of functional design constraints from a taxonomy of 17 criteria spanning spatial, ergonomic, activity, and environmental considerations, and uses these constraints to guide layout generation. Rather than relying on an LLM to directly produce a final scene, Function2Scene performs iterative evaluation and refinement through a tool-augmented check-and-repair loop, combining geometric measurements, LLM-based contextual reasoning, and VLM-based visual assessment. Experiments on 30 professionally written interior-design cases show that Function2Scene produces layouts that better satisfy functional requirements than recent LLM-based scene synthesis baselines, with our results preferred in 94.3% of pairwise comparisons. Our work reframes text-driven indoor scene synthesis from placing plausible objects to designing spaces that support human use.

URL PDF HTML ☆

赞 0 踩 0