arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.01196 2026-06-02 cs.CL cs.AI

Low-Resource Safety Failures Are Action Failures, Not Representation Failures

低资源安全失败是行动失败，而非表征失败

Rashad Aziz, Ikhlasul Akmal Hanif, Fajri Koto

发表机构 * Mohamed bin Zayed University of Artificial Intelligence（莫扎伊德大学人工智能学院）

AI总结本文发现低资源语言的安全对齐失败源于决策校准问题而非表征缺失，通过重校准高资源门控（低秩逻辑回归+阈值重置）显著提升拒绝选择性。

详情

AI中文摘要

在高资源语言中学习的安全对齐在低资源语言中迁移效果不佳。模型能拒绝英文有害提示，但当相同提示翻译成斯瓦希里语或缅甸语时则无法拒绝。自适应引导方法如AdaSteer和CAST在跨语言中继承了这一失败。我们诊断了迁移失败的原因。在Qwen2.5-7B、Gemma-2-9B和Llama-3.1-8B模型上，针对23种语言，从高资源激活中提取的有害方向几乎能像高资源提示一样线性分离低资源有害与无害提示。相关表征存在。然而，有害拒绝率从87.9%下降到43.9%。模型未能将表征转化为拒绝。未能迁移的是安全决策的校准，而非底层表征。我们利用这一点，通过重校准而非重新训练高资源门控：一个低秩逻辑回归读出器，其决策阈值使用每类仅1到4个目标语言示例重置。该门控在拒绝引导和有害方向消融之间路由，将平均拒绝选择性（Δ = 有害 − 无害拒绝）从最强自适应基线的33.6显著提高到54.5，同时保持MMLU效用。这些结果表明，一些低资源安全失败可以通过重校准现有表征而非学习新表征来修复。我们的代码已发布：https://github.com/rashadaziz/low-resource-safety。

英文摘要

Safety alignment learned in high-resource languages transfers poorly to low-resource languages. Models refuse harmful prompts in English but fail to refuse when the same prompts are translated into Swahili or Burmese. Adaptive steering methods like AdaSteer and CAST inherit this failure cross-lingually. We diagnose where transfer breaks down. Across Qwen2.5-7B, Gemma-2-9B, and Llama-3.1-8B on 23 languages, the harmfulness direction extracted from high-resource activations linearly separates harmful from harmless low-resource prompts nearly as well as high-resource ones. The relevant representation is present. Yet harmful refusal drops from 87.9% to 43.9%. The model fails to convert the representation into refusal. What fails to transfer is calibration of the safety decision, not the underlying representation. We exploit this by recalibrating, rather than retraining, a high-resource gate: a low-rank logistic readout with its decision threshold reset using as few as 1 to 4 target-language examples per class. The gate routes between refusal steering and harmfulness-direction ablation, substantially raising mean refusal selectivity ($Δ$ = harmful $-$ harmless refusal) from 33.6 for the strongest adapted baseline to 54.5 while preserving MMLU utility. These results suggest that some low-resource safety failures can be repaired by recalibrating existing representations rather than learning new ones. Our code is released: https://github.com/rashadaziz/low-resource-safety.

URL PDF HTML ☆

赞 0 踩 0

2606.01192 2026-06-02 cs.CV

PairedGTA: Generating Driving Datasets for Controlled Photometric Shift Analysis

PairedGTA：用于受控光度偏移分析的驾驶数据集生成

Andrea Chianese, Giulio Rossolini, Alessandro Biondi, Marco Cococcioni, Giorgio Buttazzo

发表机构 * Scuola Superiore Sant’Anna（圣安娜高等学院）； Department of Excellence in Robotics & AI（机器人与人工智能卓越部门）； University of Pisa（比萨大学）

AI总结提出基于高保真游戏引擎的PairedGTA框架，通过生成完美配对的图像，实现独立于几何和语义变化的光度偏移分析，并用于评估语义分割模型在恶劣条件下的性能退化。

Comments Under review

详情

AI中文摘要

评估自动驾驶视觉感知系统的性能对于确保在不同环境场景下的可靠运行至关重要。理想情况下，要在不同恶劣条件下进行平衡和公平的分析，需要同一场景在不同天气或光照变化下的完美配对图像。这将允许独立于几何和语义变化来评估光度偏移的影响。不幸的是，真实世界数据集很少提供同一场景在不同环境条件下的图像，因为通常相机姿态、交通和动态物体（车辆、行人等）的位置随时间变化，因此只能提供粗略配对的数据。为了解决这一挑战，本工作引入了一种基于高保真游戏引擎的数据生成框架，用于提取完美配对的图像。通过利用与GTA游戏引擎通信的软件API，该框架在保持场景几何、相机姿态以及动态物体的身份和位置的同时，修改光照和天气条件。对于每个采样位置，它程序化地实例化动态实体，并在各种恶劣条件下渲染像素对齐的图像。通过在语义分割模型上的系统分析，展示了所提出的生成框架在驾驶场景中的优势，其输出退化可以更直接地归因于光度偏移，而不是不受控制的语义或几何因素。

英文摘要

Evaluating the performance of visual perception systems for autonomous driving is essential to ensure reliable operation across diverse environmental scenarios. Ideally, a balanced and fair analysis across different adverse conditions would require perfectly paired images of the same scene under different weather or illumination changes. This would allow evaluating the effect of photometric shifts independently of geometry and semantic changes. Unfortunately, real-world datasets rarely provide images of the same scene under different environmental conditions, because, normally, camera pose, traffic, and locations of dynamic objects (vehicles, pedestrians, etc.) vary over time, thus yielding only coarsely paired data. To address this challenge, this work introduces a data generation framework based on a high-fidelity game engine for extracting perfectly paired images. By leveraging software APIs that communicate with the GTA game engine, the framework modifies illumination and weather conditions while preserving scene geometry, camera pose, and the identity and placement of dynamic objects. For each sampled location, it procedurally instantiates dynamic entities and renders pixel-aligned images under diverse adverse conditions. The benefit of the proposed generation framework in driving scenarios is demonstrated through a systematic analysis of semantic segmentation models, whose output degradation can be attributed more directly to photometric shifts rather than to uncontrolled semantic or geometric factors.

URL PDF HTML ☆

赞 0 踩 0

2606.01189 2026-06-02 cs.AI

The Case for Model Science: Verify, Explore, Steer, Refine

模型科学的案例：验证、探索、引导、改进

Przemyslaw Biecek, Luca Longo, Jianlong Zhou, Thomas Fel, Andreas Holzinger, Wojciech Samek

发表机构 * Center for Credible AI（可信AI中心）； University of Warsaw（华沙大学）； Warsaw University of Technology（华沙技术大学）； University College Cork（科克大学学院）； University of Technology Sydney（悉尼技术大学）； Kempner Institute, Harvard University（哈佛大学凯普纳研究所）； Human-Centered AI Lab（以人为本的人工智能实验室）； Technical University of Berlin（柏林技术大学）； Fraunhofer Heinrich Hertz Institute（弗劳恩霍夫海因里希·赫茨研究所）； Berlin Institute for the Foundations of Learning and Data (BIFOLD)（柏林学习与数据基础研究所（BIFOLD））

AI总结本文提出AI社区应超越基准测试，建立系统性的模型分析学科——模型科学，通过验证、探索、引导和改进四个功能视角，以及共享基础设施和深度案例研究，来理解复杂AI模型的行为。

Comments Follow up on arXiv:2508.20040

详情

AI中文摘要

我们认为，AI社区现在已经准备好超越基准测试，并将分散的模型分析工作整合成一个系统性的学科，我们称之为模型科学。复杂的AI模型现在服务于数十亿用户，但我们对它们工作原理的理解远远落后于部署它们的能力。几十年来以基准测试为导向的研究取得了显著进展：广泛的排行榜、各种性能指标、跨不同任务的能力提升追踪；然而，这种成功也揭示了基准测试的局限性，因为它们告诉我们模型是否表现良好，但不告诉我们为什么成功或失败，它们忽略了关键的失败模式，如幻觉或捷径。来自成熟科学的先例指明了前进的方向：认知科学表明，理解复杂系统需要互补的分析层次；神经科学证明，对单个案例的深入研究揭示了群体研究遗漏的东西；医学教导我们，专业培训必须与研究实践同步发展；农业模型展示了共享基础设施和原则如何实现累积进展。这些经验为模型科学提供了三个基础。首先，我们建议围绕四个功能视角整合研究：验证、探索、引导和改进，这些视角解决了关于模型行为的互补问题。其次，我们讨论了累积知识所需的基础设施：数据集、模型和发现的目录。第三，我们强调需要对单个模型实例进行深入分析，而不仅仅是模型家族，因为单个案例可以揭示群体研究遗漏的东西。

英文摘要

We argue that the AI community is now ready to move beyond benchmarking and consolidate scattered efforts in model analysis into a systematic discipline, a direction we term Model Science. Complex AI models now serve billions of users, yet our understanding of how they work lags far behind our ability to deploy them. Decades of benchmark-driven research have delivered remarkable progress: extensive leaderboards, a wide range of performance metrics, tracking capability gains across diverse tasks; yet this success has also revealed the limits of benchmarks as they tell us whether models perform but not why they succeed or fail, they miss critical failure modes, such as hallucinations or shortcuts. Precedents from established sciences point the way forward: cognitive science shows that understanding complex systems requires complementary levels of analysis; neuroscience demonstrates that deep study of single cases reveals what population studies miss; medicine teaches that specialised training must develop alongside research practice; and agriculture models how shared infrastructure and principles enable cumulative progress. These lessons inform three foundations for Model Science. First, we propose to consolidate research around four functional perspectives: Verify, Explore, Steer, and Refine that address complementary questions about model behaviour. Second, we discuss the required infrastructure for cumulative knowledge: catalogues of datasets, models and findings. Third, we highlight the need for deep analysis of individual model instances, not just model families, because single cases can reveal what population studies miss.

URL PDF HTML ☆

赞 0 踩 0

2606.01185 2026-06-02 cs.AI

"Skill issues'': data-centric optimization of lakehouse agents

技能问题：湖仓代理的数据中心优化

Nicole Rose Schneider, Davide Ghilardi, Giacomo Piccinini, Jacopo Tagliabue

发表机构 * University of Maryland（马里兰大学）； Università Milano Bicocca（米兰Bicocca大学）； Bauplan Labs（Bauplan实验室）

AI总结针对分支湖仓Bauplan上的编码代理，提出数据中心的优化流程，通过生成任务验证器对、在隔离沙箱中执行候选技能并利用追踪信号和程序化检查评分，将准确率提升31.9%。

详情

AI中文摘要

编码代理正在成为数据基础设施的用户，但它们的成功不仅取决于模型质量：还取决于教导代理如何使用系统的技能和环境文件。我们研究如何为在分支湖仓Bauplan上操作的代理优化这些工件。在我们的设置中，无头API和类似Git的数据原语通过代码、分支、提交和合并暴露数据工作流。我们的核心观察是，分支湖仓将数据代理评估从输出匹配问题转变为状态验证问题：代理生成的管道代码会引发具体的、可检查的湖仓变化。我们提出了一个数据中心优化流程，生成任务验证器对，在隔离沙箱中执行候选技能，并使用追踪级信号和湖仓状态的程序化检查对轨迹进行评分。在25个任务的初步评估中，优化后的技能将准确率提升了31.9%。这些结果表明，写路径数据工作流为优化代理技能提供了有用的基础，超越了只读任务。

英文摘要

Coding agents are becoming users of data infrastructure, but their success depends not only on model quality: it also depends on the skills and environment files that teach agents how to use a system. We study how to optimize these artifacts for agents operating on a branching lakehouse, Bauplan. In our setting, headless APIs and Git-like data primitives expose data workflows through code, branches, commits, and merges. Our central observation is that a branching lakehouse turns data-agent evaluation from an output-matching problem into a state-verification problem: agent-generated pipeline code induces concrete, inspectable lakehouse changes. We present a data-centric optimization pipeline that generates task-verifier pairs, executes candidate skills in isolated sandboxes, and scores trajectories using both trace-level signals and programmatic checks over lakehouse state. In a preliminary evaluation on 25 tasks, optimized skills improve accuracy by 31.9%. These results suggest that write-path data workflows provide a useful substrate for optimizing agent skills beyond read-only tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.01182 2026-06-02 cs.CL cs.AI

CA-BED: Conversation-Aware Bayesian Experimental Design

CA-BED：对话感知的贝叶斯实验设计

Daniel Arnould, Rashad Aziz, Zixuan Kang, Tanav Changal, Kevin Zhu, Sunishchal Dev, Gabriel Grand, Shreyas Sunil Kulkarni

发表机构 * University of California, Berkeley（加州大学伯克利分校）； University of Washington（华盛顿大学）； University of Toronto（多伦多大学）

AI总结提出对话感知的贝叶斯实验设计（CA-BED），一种推理时概率对话规划框架，通过结合贝叶斯实验设计与LLM似然估计，在多个对话轮次中优化问题选择，在结构化实体推断基准上平均成功率提升21.8%，仅增加1.8轮对话。

Comments Reliable Autonomy Workshop at ICLR 2026

详情

AI中文摘要

大型语言模型（LLM）在静态推理任务中表现出色，但在需要通过提问主动获取信息的交互场景中，其性能往往会下降。一个关键挑战在于选择能够减少不确定性同时纳入可能模糊或仅部分信息性的回应的问题。为了解决这个问题，我们提出了对话感知的贝叶斯实验设计（CA-BED），一种推理时概率对话规划框架，它将贝叶斯实验设计与基于LLM的似然估计相结合，以在多个对话轮次中优化问题选择。CA-BED维护关于假设的信念分布，预测可能的答案，并通过模拟对话树传播期望信息增益。在两个结构化实体推断基准上，CA-BED相比直接提示实现了平均21.8%的成功率提升，相对于其他信息寻求方法也有相当的增益。与直接提示相比，它仅平均增加了1.8个对话轮次就实现了这些增益。

英文摘要

Large Language Models (LLMs) excel at static reasoning tasks, yet their performance often degrades in interactive scenarios where information must be actively acquired through questioning. A key challenge lies in selecting questions that reduce uncertainty while incorporating responses that may be ambiguous or only partially informative. To address this, we propose Conversation-Aware Bayesian Experimental Design (CA-BED), an inference-time probabilistic dialog planning framework that integrates Bayesian Experimental Design with LLM-based likelihood estimation to optimize question selection over multiple conversational turns. CA-BED maintains a belief distribution over hypotheses, anticipates possible answers, and propagates expected information gain through a simulated conversation tree. Across two structured entity-deduction benchmarks, CA-BED yields an average 21.8% improvement in success rates over direct prompting, with comparable gains relative to alternative information-seeking methods. It achieves these gains with an average increase of only 1.8 conversational turns compared to direct prompting.

URL PDF HTML ☆

赞 0 踩 0

2606.01179 2026-06-02 cs.LG cs.AI

Physics-Informed Deep Learning for Entropy Prediction in Heterogeneous Systems: Thermodynamic and Information-Theoretic Case Studies

异质系统中熵预测的物理信息深度学习：热力学与信息论案例研究

Biswajeet Sahoo, Debadutta Patra

发表机构 * Durham University（杜ham大学）； Department of Chemical Engineering（化学工程系）； Veer Surendra Sai University of Technology（维尔·苏雷纳·赛大学）

AI总结提出统一物理信息深度学习框架，通过微分方程残差和信息论约束，在单一神经网络中同时实现热力学与信息论系统的熵预测，并验证其数据效率和物理一致性。

详情

AI中文摘要

熵产生支配着物理和信息论系统中的不可逆性和不确定性。尽管物理信息神经网络（PINNs）成功求解微分方程，但当前架构本质上仍是领域特定的。跨根本不同物理定律的领域不变熵表示的提取尚未探索。本文引入了一个统一的物理信息深度学习（PIDL）框架，该框架在单一神经架构中同时强制执行微分方程残差和信息论界限。我们通过两个经典研究来展示该框架：（i）一个热力学连续搅拌釜反应器（CSTR）模型，求解控制常微分方程，其中Softplus约束严格强制执行热力学第二定律；（ii）一个信息论金融市场模型，求解逆Fokker-Planck偏微分方程以推断潜在漂移和扩散系数，通过Softplus约束保证扩散正性，同时自然诱导香农熵。评估了三种模型变体：两个特定领域基线和一种共享编码器架构。PIDL框架保证了绝对的热力学可接受性，零违反第二定律，并表现出卓越的数据效率，仅使用30%的可用训练数据即可保持>90%的预测精度。此外，对学习到的熵表面的事后Ruppeiner黎曼几何分析成功识别了热力学相不稳定性。该方法为物理约束熵建模提供了一个稳健、领域无关的架构，推动了可持续过程设计和定量金融风险评估的应用。

英文摘要

Entropy production governs irreversibility and uncertainty in both physical and information-theoretic systems. While Physics-Informed Neural Networks (PINNs) successfully solve differential equations, current architectures remain inherently domain-specific. The extraction of domain-invariant entropy representations across fundamentally different physical laws remains unexplored. This paper introduces a unified Physics-Informed Deep Learning (PIDL) framework that simultaneously enforces differential equation residuals and information-theoretic bounds within a single neural architecture. We demonstrate this framework via two canonical studies: (i) a thermodynamic continuous stirred-tank reactor (CSTR) model solving governing ODEs, where a Softplus constraint strictly enforces the Second Law of Thermodynamics; and (ii) an information-theoretic financial market model solving the inverse Fokker-Planck PDE to infer latent drift and diffusion coefficients, guaranteeing diffusion positivity via a Softplus constraint while naturally inducing Shannon entropy. Three model variants are evaluated: two domain-specific baselines and one shared-encoder architecture. The PIDL framework guarantees absolute thermodynamic admissibility with zero Second-Law violations and exhibits exceptional data efficiency, retaining >90% predictive accuracy using merely 30% of available training data. Furthermore, a post-hoc Ruppeiner Riemannian geometric analysis of the learned entropy surface successfully identifies thermodynamic phase instabilities. This methodology provides a robust, domain-agnostic architecture for physics-constrained entropy modeling, advancing applications in sustainable process design and quantitative financial risk assessment.

URL PDF HTML ☆

赞 0 踩 0

2606.01176 2026-06-02 cs.LG

Temporal Motif Signatures for Temporal Graph Neural Networks

时序图神经网络的时序模体特征

Dylan Sandfelder, Mihai Cucuringu, Xiaowen Dong

发表机构 * University of Oxford（牛津大学）； University of California, Los Angeles（加州大学洛杉矶分校）

AI总结针对时序图神经网络难以捕捉短时序模体模式的问题，提出一种紧凑的13维模体特征图，可线性嵌入任意静态或时序编码器，并在多种任务上提升性能。

详情

AI中文摘要

真实时序交互流在短时模体模式（重复、互惠、星型多样性、三元组流）中蕴含预测结构，而普通的时序图神经网络（TGNN）通常无法将其暴露给边评分器。我们在MOOC交互预测中具体展示了这一点：一个由过去窗口星型计数组成的小型四特征族已经提供了相对于强静态GNN的大部分提升。在广泛的实际和合成时序数据集中，我们发现模体活动沿着三个尺度稳定的轴（二元近因/互惠、星型多样性、三元组流）一致地组织，并利用这一经验结构设计了一个紧凑的13维、防泄漏、候选局部模体特征图h(u, v, t)，该特征图可线性嵌入任何静态或时序编码器，无需改变架构。时序Weisfeiler-Leman（WL）分析将该增强置于锚定时序WL层次的第一级，并展示了一个候选锚定对，模体特征在该对上具有区分性。我们通过实验证明，相同的增强在异构任务上一致地提升了性能：TGB链路属性预测在所有五个基线上，Bitcoin Alpha/OTC和MOOC上的边分类，以及合成时序生成器的图级分类。

英文摘要

Real temporal interaction streams carry predictive structure in short-horizon motif patterns -- repetition, reciprocity, star diversity, triadic flow -- that vanilla temporal graph neural networks (TGNNs) often fail to expose to their edge scorers. We show this concretely on MOOC interaction prediction, where a small four-feature family of past-window star counts already delivers most of the lift over a strong static GNN. Across a wide set of real and synthetic temporal datasets we find that motif activity organizes consistently along three scale-stable axes (dyadic recency/reciprocity, star diversity, triadic flow), and we use this empirical structure to design a compact 13-coordinate, leakage-safe, candidate-local motif feature map h(u, v, t) that linearly embeds into any static or temporal encoder without architectural changes. A temporal Weisfeiler-Leman (WL) analysis places the augmentation relative to the first level of an anchored temporal-WL hierarchy and exhibits a candidate-anchored pair on which motif features distinguish. We demonstrate empirically that the same augmentation consistently lifts performance across heterogeneous tasks: TGB link-property prediction across all five baselines, edge classification on Bitcoin Alpha/OTC and MOOC, and graph-level classification of synthetic temporal generators.

URL PDF HTML ☆

赞 0 踩 0

2606.01173 2026-06-02 cs.CV

Reusing Fusion-Time Spectral Reliability for Adaptive Fusion and Expert Routing in RGB-Infrared Object Detection

复用融合时频谱可靠性用于RGB-红外目标检测的自适应融合与专家路由

Yefeng Wu

发表机构 * Tsinghua University（清华大学）

AI总结提出一种无参数的7维频谱可靠性描述符，通过频谱可靠性融合和可靠性条件专家路由，提升RGB-红外目标检测在退化条件下的性能。

详情

AI中文摘要

RGB-红外检测器通常会丢弃跨模态融合过程中产生的统计信息，使得下游模块无法知晓当前交互是否可靠。我们提出提取一个无参数的7维频谱可靠性描述符——汇总频带能量、幅度比、相位一致性和跨模态相关性——并在融合阶段之外复用该描述符。该描述符驱动频谱可靠性融合（SRF），它将频谱残差与保守的空间基进行门控，以及可靠性条件专家路由（RCER），它将描述符与池化内容结合以引导稀疏的后融合专家。在匹配消融实验下，描述符感知门控相比仅内容自适应门控提高了mAP50；一个2×2因子分析进一步表明，在参数数量几乎相等的情况下，描述符条件路由相比仅专家架构提供了更大的边际增益。在DroneVehicle上的六种合成退化条件下，平均保留率提升至95.0%，而仅内容MoE为92.0%，拼接为87.9%，在模态缺失下增益最大；同一模型在自然白天/黑夜分割上也分别提高了+5.2/+5.3的mAP50。这些结果表明，将融合时可靠性作为显式信号保留有利于自适应融合和融合后条件计算。

英文摘要

RGB-infrared detectors typically discard the statistics generated during cross-modal fusion, leaving downstream modules unaware of whether the current interaction is reliable. We propose to extract a parameter-free, 7-dimensional spectral reliability descriptor -- summarizing band energy, amplitude ratio, phase consistency, and cross-modal correlation -- and to reuse it beyond the fusion stage. The descriptor drives both Spectral Reliability Fusion (SRF), which gates a spectral residual against a conservative spatial base, and Reliability-Conditioned Expert Routing (RCER), which combines the descriptor with pooled content to steer sparse post-fusion experts. Under matched ablations, descriptor-aware gating improves mAP50 over content-only adaptive gating; a $2{\times}2$ factorial analysis further shows that descriptor-conditioned routing provides the larger marginal gain over expert architecture alone at near-equal parameter count. Under six synthetic degradations on DroneVehicle, average retention rises to 95.0%, versus 92.0% for content-only MoE and 87.9% for concatenation, with the largest gain under modality drop; the same model also improves mAP50 by +5.2/+5.3 on the natural day/night split. These results suggest that preserving fusion-time reliability as an explicit signal benefits both adaptive fusion and post-fusion conditional computation.

URL PDF HTML ☆

赞 0 踩 0

2606.01168 2026-06-02 cs.CL

Thinking Economically: A Hierarchical Framework for Adaptive-Complexity Reasoning in LLMs

经济思维：面向LLM自适应复杂度推理的分层框架

Yubo Gao, Haotian Wu, Hong Chen, Junquan Huang, Yibo Yan, Jungang Li, Zihao Dongfang, Sicheng Tao, Puay Siew Tan, Jie Zhang, Xuming Hu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港科学与技术大学（广州））； The Hong Kong University of Science and Technology（香港科学与技术大学）； Nanyang Technological University（南洋理工大学）； Singapore Institute of Manufacturing Technology, A*STAR（新加坡制造技术研究所，A*STAR）

AI总结针对LLM推理中的“过度思考”问题，提出分层自适应预算器（HAB）框架，通过粗粒度到细粒度的预算分配实现计算资源的高效利用，在GSM8K和MATH500上同时提升准确率和降低token使用量。

Comments 11 pages, 4 figures, 3 tables

详情

AI中文摘要

思维链（CoT）显著增强了LLM的推理能力，但常常因“过度思考”而产生大量计算开销：生成过长的推理过程却没有相应的精度提升。现有的效率方法通常采用统一压缩，忽略了推理复杂度在两个不同粒度上具有异质性这一关键观察：不同问题之间以及单个推理步骤内部。这激发了我们“经济思维”的原则：根据内在任务和步骤需求智能分配计算资源，而非追求统一的简洁性。我们提出分层自适应预算器（HAB），一个通过从粗到细的预算分配来实现该原则的训练框架。在步骤间层面，HAB预测每个问题的最优推理深度。在步骤内层面，HAB从基于困惑度的步骤比较和自适应帕累托优化目标中学习步骤特定的token预算信号，该目标捕捉局部质量-效率权衡，同时基于Fisher信息的剪枝器进一步提供细粒度的训练时指导，从而鼓励生成器内化更经济的推理模式。在GSM8K和MATH500上的实验表明，HAB不仅在准确率上超越了标准CoT，还减少了token使用量，实现了比对比基线更强的性能-效率权衡。

英文摘要

Chain-of-Thought (CoT) has significantly enhanced LLM reasoning, yet often incurs substantial computational overhead due to "overthinking": generating excessively long rationales without commensurate accuracy gains. Existing efficiency methods typically apply uniform compression, which overlooks a critical observation that reasoning complexity is heterogeneous at two distinct granularity: across different problems and within individual reasoning steps. This motivates our principle of Thinking Economically: intelligently allocating computational resources based on intrinsic task and step demands rather than pursuing uniform brevity. We propose Hierarchical Adaptive Budgeter (HAB), a training framework that operationalizes this principle through coarse-to-fine budgeting. At the inter-step level, HAB predicts the optimal reasoning depth for each problem. At the intra-step level, HAB learns step-specific token budgeting signals from PPL-derived step comparisons and an adaptive Pareto optimization objective that captures the local quality-efficiency trade-off, while a Fisher Information-based pruner further provides fine-grained training-time guidance, thereby encouraging the generator to internalize more economical reasoning patterns. Experiments on GSM8K and MATH500 show that HAB not only surpasses standard CoT in accuracy but also reduces token usage, achieving a stronger performance-efficiency trade-off than the compared baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.01164 2026-06-02 cs.CV

Towards Interactive Video World Modeling: Frontiers, Challenges, Benchmarks, and Future Trends

迈向交互式视频世界建模：前沿、挑战、基准与未来趋势

Jiuming Liu, Chaojun Ni, Mengmeng Liu, Chensheng Peng, Fangjinhua Wang, Sitian Shen, Marc Pollefeys, Masayoshi Tomizuka, Ayush Tewari, Per Ola Kristensson

发表机构 * Department of Engineering, University of Cambridge, U.K.（剑桥大学工程系）； Peking University（北京大学）； University of Twente（埃因霍温理工大学）； Mechanical Systems Control Laboratory, University of California, Berkeley, USA（加州大学伯克利分校机械系统控制实验室）； ETH Zurich（苏黎世联邦理工学院）； Microsoft（微软公司）； University of Oxford（牛津大学）

AI总结本文系统综述了交互式世界建模的研究趋势、技术挑战、评估基准，并提出了未来方向，重点在于动作条件可控性、长程交互与记忆以及实时响应性。

Comments Under review. The GitHub repository is publicly available at: https://github.com/liujiuming123/Awesome-Interactive-World-Model

详情

AI中文摘要

随着大语言模型和基于扩散的内容生成的快速发展，世界建模引起了越来越多的研究关注，惠及游戏引擎、具身人工智能、自动驾驶等多个下游领域。通过将用户动作明确纳入世界状态转换，最近的文献在动作条件视频或3D生成范式中赋予了世界建模交互性，进一步增强了世界演化的可控性，并促进用户自由遍历、操纵、导航和个性化状态演化。本文旨在系统回顾交互式世界建模的最新研究趋势、技术发展、评估基准，并提出未来潜在方向。具体而言，我们首先总结了在应用场景、世界状态演化和场景模态方面的近期工作和趋势。随后，我们深入探讨三个关键的技术挑战，包括动作条件可控性、长程交互与记忆，以及实时交互的动作跟随响应性。此外，我们还全面比较了四个特定应用领域（开放世界探索、游戏引擎、自动驾驶和机器人）中的现有基准和指标。最后，我们讨论了实现下一代交互式世界建模的几个有前景的未来方向。相应的代码库已公开在：https://github.com/liujiuming123/Awesome-Interactive-World-Model。

英文摘要

With rapid development of large language models and diffusion-based content generation, world modeling has attracted increasing research attention, benefiting various downstream domains such as game engines, embodied AI, autonomous driving, etc. Through explicitly incorporating user actions into world state transition, recent literature empowers world modeling with interactivity in an action-conditioned video or 3D generation paradigm, further enhancing controllability over world evolutions and facilitating users to freely traverse, manipulate, navigate, and personalize the state evolution. In this paper, we aim to systematically review recent research trends, technical developments, evaluation benchmarks, and also propose future potential directions in interactive world modeling. Specifically, we first summarize recent efforts and trends in terms of application scenarios, world state evolution, and scene modality. Afterwards, we delve into three crucial technical challenges, including action-conditioned controllability, long-horizon interactions and memory, and action-following responsiveness for real-time interactivity. Furthermore, we also thoroughly compare existing benchmarks and metrics in four specific application fields: open-world exploration, game engine, autonomous driving, and robotics. Finally, we discuss several promising future directions in achieving next-generation interactive world modeling. The corresponding repository is publicly available at: https://github.com/liujiuming123/Awesome-Interactive-World-Model.

URL PDF HTML ☆

赞 0 踩 0

2606.01160 2026-06-02 cs.AI

Expected Value Alignment for Generative Reward Modeling in Formal Mathematics Verification

形式数学验证中生成式奖励建模的期望值对齐

Shihao Ji, Haotao Tan, Zihui Song, Mingyu Li

发表机构 * GitHub

AI总结提出期望值对齐（EVA）方法，通过从模型词元分布中提取连续分数，在保持生成式奖励模型离散输出的同时实现连续评分，用于Lean 4形式验证。

详情

AI中文摘要

大型语言模型（LLMs）越来越多地与形式化交互式定理证明器（如Lean 4）一起使用。通过强化学习或搜索方法扩展这些系统需要能够评估中间推理步骤的过程奖励模型（PRMs）。现有的奖励模型设计暴露了一个实际的权衡。值头模型提供连续分数但修改了生成模型接口，而生成式奖励模型保留了文本理由但难以匹配连续浮点回归，因为数值被分割到多个词元上。我们引入了期望值对齐（EVA），一种奖励建模过程，它保持表面输出离散，同时从模型的词元分布中提取连续分数。模型以结构化的JSON格式输出整数分数，EVA计算对应锚定词元logits的期望值作为连续分数。训练结合了因果语言建模目标与这些期望值的辅助均方误差损失。我们在 extit{Leibniz}中实例化EVA，这是一个用于Lean 4形式验证的奖励模型，并针对零样本和奖励建模基线进行了评估。评估表明，基于logits的连续评分显著减少了离散化伪影，同时保留了生成式批评的可解释性。

英文摘要

Large Language Models (LLMs) are increasingly used with formal interactive theorem provers such as Lean 4. Scaling these systems with reinforcement learning or search methods requires process reward models (PRMs) that can evaluate intermediate reasoning steps. Existing reward-model designs expose a practical trade-off. Value-head models provide continuous scores but modify the generative model interface, while generative reward models preserve textual rationales but are poorly matched to continuous floating-point regression because numeric values are split across tokens. We introduce Expected Value Alignment (EVA), a reward-modeling procedure that keeps the surface output discrete while extracting continuous scores from the model's token distribution. The model emits integer scores in a structured JSON format, and EVA computes a continuous score as the expectation over the logits of the corresponding anchor tokens. Training combines the causal language modeling objective with an auxiliary mean squared error loss on these expected values. We instantiate EVA in \textit{Leibniz}, a reward model for Lean 4 formal verification, and evaluate it against zero-shot and reward-modeling baselines. The evaluation demonstrates that continuous logit-based scoring significantly reduces discretization artifacts while retaining the interpretability of generative critiques.

URL PDF HTML ☆

赞 0 踩 0

2606.01159 2026-06-02 cs.LG cs.GT

Fairness in two-player zero-sum games with bandit feedback

带赌博反馈的两人零和博弈中的公平性

S Akash, Pratik Gajane

发表机构 * LatentForce.ai ； Laboratoire d’Informatique Fondamentale d’Orléans（奥尔良基础信息学实验室）； University of Orléans（奥尔良大学）

AI总结研究在公平约束下（每个动作概率至少为α/m）的两人零和博弈，通过重参数化将公平博弈转化为标准零和博弈，提出Fair-ETC-TPZSG算法并证明其遗憾界。

详情

AI中文摘要

我们研究在公平约束下的两人零和博弈（TPZSGs），其中每个动作必须以至少$α/m$的概率被选择。现有的实例相关结果针对$ extit{纯}$纳什均衡，而公平性通常产生$ extit{混合}$均衡，这是一个更难的学习目标。我们的关键技术工具是重参数化：每个公平策略分解为$p = (α/m)\mathbf{1} + (1-α)\widetilde{p}$，其中$\widetilde{p} \in Δ_m$，代入收益形式得到$p^{ op}Aq = \widetilde{p}^{ op}\widetilde{A} q$，其中公平收益矩阵$\widetilde{A} := (1-α)A + α\mathbf{1} c^{ op}$，$c_j = frac{1}{m}\sum_i A(i,j)$是列均值向量。那么$A$上的公平博弈等价于$\widetilde{A}$上的标准零和博弈，因此均衡存在性、KKT结构和LP基稳定性归结为应用于$\widetilde{A}$的经典结果。我们推导了公平最小最大值、公平纳什均衡、公平遗憾以及一个简洁的对偶表示，表明公平代价至多为$α(1-1/m)$，并且当无约束均衡已经具有完全支撑时消失。我们的主要结果是针对$ exttt{Fair-ETC-TPZSG}$算法的$\widetilde{O}(T^{2/3})$遗憾界，该算法适用于一般的混合公平均衡，并讨论了为什么朴素的动作消除不能轻易改进它。当公平均衡具有单一主导动作时，即当$\widetilde{p}^{\star}$是$Δ_m$的顶点时，该界收紧为实例相关的$\widetilde{O}(1/\widetildeΔ(α)^{2})$，其中$\widetildeΔ(α)$是LP边际间隙。

英文摘要

We study two-player zero-sum games (TPZSGs) with bandit feedback under fairness constraints requiring every action to be played with probability at least $α/m$. Existing instance-dependent results target $\textit{pure}$ Nash equilibria, while fairness generically produces $\textit{mixed}$ equilibria, a harder learning target. Our key technical tool is a reparametrization: every fair strategy decomposes as $p = (α/m)\mathbf{1} + (1-α)\widetilde{p}$ with $\widetilde{p} \in Δ_m$, and substituting into the payoff form yields $p^{\top}Aq = \widetilde{p}^{\top}\widetilde{A} q$ for a fair payoff matrix $\widetilde{A} := (1-α)A + α\mathbf{1} c^{\top}$, where $c_j = \tfrac{1}{m}\sum_i A(i,j)$ is the column-mean vector. The fair game on $A$ is then equivalent to a standard zero-sum game on $\widetilde{A}$, so equilibrium existence, KKT structure, and LP basis stability reduce to classical results applied to $\widetilde{A}$. We derive the fair minimax value, fair Nash equilibrium, fair regret, and a clean dual representation showing the price of fairness is at most $α(1-1/m)$ and vanishes whenever the unconstrained equilibrium already has full support. Our main result is an $\widetilde{O}(T^{2/3})$ regret bound for an Explore-Then-Commit algorithm, $\texttt{Fair-ETC-TPZSG}$, applicable to general mixed fair equilibria, together with a discussion of why naive action elimination does not readily improve it. When the fair equilibrium has a single dominant action, equivalently when $\widetilde{p}^{\star}$ is a vertex of $Δ_m$, the bound sharpens to instance-dependent $\widetilde{O}(1/\widetildeΔ(α)^{2})$, where $\widetildeΔ(α)$ is the LP-margin gap.

URL PDF HTML ☆

赞 0 踩 0

2606.01155 2026-06-02 cs.LG cs.AI

When Data Is Scarce: Scaling Sparse Language Models with Repeated Training

当数据稀缺时：通过重复训练扩展稀疏语言模型

Boqian Wu, Qiao Xiao, Patrik Okanovic, Tomasz Sternal, Maurice van Keulen, Mykola Pechenizkiy, Elena Mocanu, Torsten Hoefler, Decebal Constantin Mocanu

发表机构 * Eindhoven University of Technology（埃因霍温理工大学）； University of Luxembourg（卢森堡大学）； University of Twente（埃因霍温大学）； ETH Zürich（苏黎世联邦理工学院）

AI总结研究数据受限下稀疏训练的可扩展性，提出包含活跃参数、唯一标记、数据重复和稀疏度的缩放定律，发现稀疏训练可延迟数据饱和并改善资源权衡。

Comments Accepted at ICML2026

详情

AI中文摘要

密集大语言模型在无限数据下的缩放定律已被充分探索，但稀疏性与有限数据如何相互作用尚未研究。在这项工作中，我们研究了数据受限场景下的稀疏训练，其中有限的唯一标记需要多轮训练。我们的实验涵盖拟合集中最多1.92B参数的模型、最高93.75%的稀疏度、最多2.6B标记的唯一数据预算，以及16轮训练中最多41.6B的总训练标记；我们进一步在保留的密集等价模型（最多7.68B参数）上验证了外推能力。我们发现：1. 数据受限下的稀疏缩放：我们引入了一个缩放定律，将损失建模为活跃参数、唯一标记、数据重复和稀疏度的函数，准确预测跨计算和数据预算的性能。2. 延迟数据饱和：稀疏训练延迟了重复数据带来的收益递减，使多轮训练更有效。3. 资源权衡：在固定数据下，损失最优的稀疏度约为50%，而计算最优的稀疏度更高且随数据规模增长。总体而言，稀疏性不仅是提高效率的工具，也是在数据稀缺下改善缩放权衡的机制。我们的代码可在 https://github.com/boqian333/sparse-dc-scaling 获取。

英文摘要

Scaling laws for dense LLMs under infinite data are well explored, but how sparsity interacts with limited data is not. In this work, we study sparse training in data-constrained regimes where limited unique tokens require multi-epoch training. Our experiments span models up to 1.92B parameters in the fitting set, sparsity up to 93.75%, unique data budgets up to 2.6B tokens, and total training tokens up to 41.6B over 16 epochs; we further validate extrapolation on held-out dense-equivalent models up to 7.68B parameters. We find that: 1. Sparse scaling in data-limited settings: We introduce a scaling law that models loss as a function of active parameters, unique tokens, data repetition, and sparsity, accurately predicting performance across compute and data budgets. 2. Delayed data saturation: sparse training postpones diminishing returns from repeated data, making multi-epoch training more effective. 3. Resource trade-offs: With fixed data, loss-optimal sparsity is moderate ~ 50%, while compute-optimal sparsity is higher and grows with data scale. Overall, sparsity is not just a tool for efficiency, but a mechanism for improving scaling trade-offs under data scarcity. Our code is available at: https://github.com/boqian333/sparse-dc-scaling.

URL PDF HTML ☆

赞 0 踩 0

2606.01151 2026-06-02 cs.LG

Lagrangian Perturbation Diffusion Steering: Latent Reinforcement Learning for Generative Policies

拉格朗日扰动扩散引导：用于生成策略的潜在强化学习

Hikmet Simsir, Ozgur S. Oguz

发表机构 * University of Michigan（密歇根大学）

AI总结提出拉格朗日扰动扩散引导（LP-DS），通过学习紧凑的噪声空间扰动来微调冻结的生成策略，利用拉格朗日信任区域目标优化，在保持潜在先验约束的同时提升下游价值，在多个基准上实现样本效率和回报提升。

Comments Accepted as a regular paper at ICML 2026

详情

AI中文摘要

使用高容量生成策略的行为克隆实现了强大的模仿性能，但通常受限于演示覆盖率和分布偏移。直接强化学习微调可以提升性能，但更新大型动作解码器往往不稳定且样本效率低。我们提出拉格朗日扰动扩散引导（LP-DS），一种轻量级自适应方法，通过在解码前学习紧凑的噪声空间扰动来改进冻结的生成策略。LP-DS 使用拉格朗日信任区域目标优化该扰动，在约束与潜在先验偏差的同时提升下游价值。在 RoboMimic 操作、OpenAI Gym 运动和 Adroit 灵巧操作基准上，LP-DS 提高了样本效率、成功率和回报，同时相比无约束噪声空间引导保持了更高的动作空间熵，回报提升高达 25%。使用流匹配骨干、大型视觉-语言-动作模型以及物理 Franka 部署的额外评估表明，LP-DS 不限于紧凑扩散策略或模拟基准。项目页面：https://sites.google.com/view/lp-ds/home。

英文摘要

Behavior cloning with high-capacity generative policies achieves strong imitation performance, but is often limited by demonstration coverage and distribution shift. Direct reinforcement learning fine-tuning can improve performance, but updating large action decoders is frequently unstable and sample inefficient. We propose Lagrangian Perturbation Diffusion Steering (LP-DS), a lightweight adaptation method that improves a frozen generative policy by learning a compact noise-space perturbation before decoding. LP-DS optimizes this perturbation with a Lagrangian trust-region objective, improving downstream value while constraining deviation from the latent prior. Across RoboMimic manipulation, OpenAI Gym locomotion, and Adroit dexterous manipulation benchmarks, LP-DS improves sample efficiency, success, and return while maintaining higher action-space entropy than unconstrained noise-space steering, with return improvements of up to 25% over prior baselines. Additional evaluations with flow-matching backbones, a large vision-language-action model, and physical Franka deployment show that LP-DS is not limited to compact diffusion policies or simulated benchmarks. Project page: https://sites.google.com/view/lp-ds/home.

URL PDF HTML ☆

赞 0 踩 0

2606.01149 2026-06-02 cs.CV

CoSTL: Comprehensive Spatial-Temporal Representation Learning for Moment Retrieval and Highlight Detection

CoSTL：面向时刻检索与高亮检测的综合时空表征学习

Xin Dong, Wenjia Geng, Wenfeng Deng, Yansong Tang

发表机构 * Shenzhen International Graduate School, Tsinghua University（清华大学深圳国际研究生院）； Pengcheng Laboratory（鹏城实验室）

AI总结提出综合时空表征学习框架CoSTL，通过文本驱动的渐进细粒度图像编码器和多尺度时间感知模块，联合学习空间细节与时间动态，在时刻检索和高亮检测任务上达到最优性能。

Comments 14 pages, 3 figures

详情

AI中文摘要

视频时刻检索（MR）和高亮检测（HD）是视频分析中的关键任务，旨在根据给定的文本查询定位特定时刻并估计片段级相关性。最近的方法将它们视为类似的视频定位任务，并使用相同的架构来解决。这些任务需要在图像级别进行细粒度理解，以及在整个视频中进行高级时间理解。现有方法主要关注使用帧级特征的时间建模，通常忽略了单个帧内与文本查询相关的丰富视觉信息。这种疏忽导致定位结果不准确。为了解决这一局限性，我们提出了一个综合时空表征学习框架（CoSTL），该框架捕获了细粒度的图像级信息和时间动态。具体来说，CoSTL包含一个文本驱动的渐进细粒度图像编码器，执行两步文本驱动的知识提取过程以学习细粒度空间表征。此外，一个多尺度时间感知模块捕获综合的时空表征，增强了模型处理时间动态的能力。我们在四个公开基准数据集上展示了最先进的性能：QVHighlights、Charades-STA、TACoS和TVSum。

英文摘要

Video Moment Retrieval (MR) and Highlight Detection (HD) are crucial tasks in video analysis that aim to localize specific moments and estimate clip-wise relevance based on a given text query. Recent approaches treat them as similar video grounding tasks and use the same architecture to solve them. These tasks require both fine-grained comprehension at the image level and high-level temporal understanding across the entire video. Existing approaches have primarily focused on temporal modeling using frame-level features, often neglecting the rich visual information related to the text query within individual frames. This oversight leads to inaccurate grounding results. To address this limitation, we propose a Comprehensive Spatial-Temporal Representation Learning Framework (CoSTL), which captures both fine-grained image-level information and temporal dynamics. Specifically, CoSTL incorporates a text-driven progressive fine-grained image encoder, performing a two-step text-driven knowledge extraction process to learn fine-grained spatial representations. Furthermore, a multi-scale temporal perception module captures comprehensive spatial-temporal representations, enhancing the model's ability to process temporal dynamics. We demonstrate state-of-the-art performance on four public benchmarks: QVHighlights, Charades-STA, TACoS, and TVSum.

URL PDF HTML ☆

赞 0 踩 0

2606.01148 2026-06-02 cs.CL

Not All Explanations Simulate Equally: Comparing Verbalized Feature Attributions and Self-Generated Rationales

并非所有解释都能同等模拟：比较言语化特征归因与自生成理由

Pingjun Hong, Benjamin Roth

发表机构 * Faculty of Computer Science, University of Vienna（维也纳大学计算机科学系）； UniVie Doctoral School Computer Science, University of Vienna（维也纳大学计算机科学博士学院）； Faculty of Philological and Cultural Studies, University of Vienna（维也纳大学文学与文化研究系）

AI总结本研究通过反事实模拟设置，比较了言语化特征归因和自生成理由两种解释来源对问答模型行为可模拟性的影响，发现解释格式和粒度显著影响模拟效果。

2606.01145 2026-06-02 cs.AI

Reasoning4Sciences: Bridging Reasoning Language Models to All Scientific Branches

Reasoning4Sciences：将推理语言模型桥接到所有科学分支

Teddy Ferdinan, Bartłomiej Koptyra, Mikołaj Langner, Tomasz Adamczyk, Łukasz Radliński, Maciej Markiewicz, Aleksander Szczęsny, Stanisław Woźniak, Tymoteusz Romanowicz, Dzmitry Pihulski, Mateusz Zbrocki, Mateusz Śmigielski, Michał Rajkowski, Mateusz Biedka, Konrad Kiełczyński, Konrad Wojtasik, Jacek Duszenko, Jan Eliasz, Piotr Matys, Michał Bernacki-Janson, Maria Bellaniar Ismiati, Latius Hermawan, Wiktoria Mieleszczenko-Kowszewicz, Anna Kubicka-Sowinska, Grzegorz Chodak, Karol Postawa, Paweł Zyblewski, Tomasz Szandała, Łukasz Sterczewski, Adrian Chajec, Pawel Niewiadomski, Piotr Gruber, Marcin Wdowikowski, Sławomir Czarnecki, Bartłomiej Kryszak, Dominik Drabik, Tomasz Kajdanowicz, Kamil Mamak, Paweł Preś, Katarzyna Paczkowska, Joachim Sobczuk, Tomasz Zięba, Jan Kocoń, Maciej Piasecki, Przemysław Kazienko

发表机构 * Poznan University of Technology（波兹南理工大学）； National Cheng Kung University（国立成功大学）； Universitas Katolik Musi Charitas Palembang（Palembang 巴厘岛天主教大学）

AI总结本文首次全面分析推理语言模型在28个科学学科中的采用情况，提出基于领域资源的成熟度评估框架，揭示学科间差距并展望未来方向。

详情

AI中文摘要

虽然推理语言模型（RLMs）正迅速成为科学研究的强大工具，但其影响主要集中在“硬科学”领域。RLMs在其他科学分支中的采用缓慢（或缺乏）导致研究生产力差距不断扩大。在本综述中，我们首次按照欧洲研究理事会（ERC）使用的分类，对RLMs在28个科学学科中的采用情况进行了全面分析，涵盖社会科学与人文、物理科学与工程以及生命科学。我们研究了RLMs如何跨学科开发、评估和应用。此外，我们引入了一个基于可用领域特定开发和评估资源的成熟度导向评估框架，揭示了RLM成熟度的显著差异，当仅考虑公开可用资源时，这种差异变得更加明显。最后，我们强调了当前跨学科流行的实施范式、当前挑战以及推动RLMs在科学中采用的未来方向。

英文摘要

While Reasoning Language Models (RLMs) are rapidly emerging as powerful tools for scientific research, their impact is primarily concentrated in "hard science" fields. The slow -- or lack of -- adoption of RLMs in other branches of science is causing a widening gap in research productivity. In this survey, we provide the first comprehensive analysis of RLM adoption across 28 scientific disciplines following the classification used by the European Research Council (ERC), spanning the Social Sciences and Humanities, Physical Sciences and Engineering, and Life Sciences. We examine how RLMs are developed, evaluated, and applied across disciplines. Furthermore, we introduce a maturity-oriented assessment framework based on available domain-specific development and evaluation resources, revealing substantial disparities in RLM maturity that become even more pronounced when only publicly available resources are considered. Finally, we highlight current implementation paradigms that are gaining popularity across disciplines, current challenges, and future directions in enabling RLM adoption across science.

URL PDF HTML ☆

赞 0 踩 0

2606.01136 2026-06-02 cs.CL

From Outliers to Errors: Auditing Pali-to-English LLM Translations with Multi-Reference Adjudication

从异常到错误：使用多参考裁决审计巴利语到英语的LLM翻译

Máté Metzger, Nadnapang Phophichit, Hansa Dhammahaso

发表机构 * Independent Researcher（独立研究者）； Nibbana Meditation Centre（尼布达冥想中心）

AI总结针对大语言模型翻译古典语言时误将合理变异标记为错误的问题，提出基于多个人类翻译参考包络和嵌入漂移阈值筛选、再由LLM评审团裁决的审计方法，并应用于巴利语-英语翻译。

Comments Preprint. This manuscript has not yet been peer reviewed

详情

AI中文摘要

单一评分翻译指标可能混淆合理变异与错误，这一问题对于古典语言尤为严重，因为同一段落的多个可辩护的英语译文共存。我们使用三位公认的人类译者（Bhikkhu Sujato、Thanissaro Bhikkhu和Bhikkhu Bodhi）的译文作为局部参考包络（而非单一黄金标准），审计了四种旗舰大语言模型（GPT-5.5、Claude Sonnet 4.6、Gemini 3.1 Pro和Grok 4.3）在巴利语经典1700个段落上的巴利语到英语输出。每个候选译文的归一化嵌入漂移（相对于参考质心）作为分诊信号，而非错误标签；然后，对漂移阈值超过1.5的1203个候选译文，由盲审的三模型LLM评审团进行裁决，并针对300个实例的作者裁决验证集进行校准。两个结果突出：第一，漂移预测的是严重程度而非错误本身：在裁决的高漂移候选译文中，主要错误率从1.5-2.0区间的7.9%单调上升至3.0以上的51.6%，而约80%的1.5-2.0异常值被判定为有效的翻译变体。第二，模型差异在高漂移尾部最为明显：GPT-5.5的裁决高漂移主要错误率最低，其置信区间与Claude Sonnet 4.6和Gemini 3.1 Pro重叠；Grok 4.3的异常值数量最大，且尾部主要错误率最高（总体27.6%，漂移3.0以上74.4%）。主要错误类别（如省略或截断、教义术语错误）正是最可能误导教义文本读者的失败类型。贡献在于提供了一种可复用的古典到现代翻译审计设计：从多个人类译者定义局部参考包络，使用嵌入漂移优先审查，并对标记的尾部进行裁决，而非将异常状态视为错误。

英文摘要

Single-score translation metrics can conflate legitimate variation with error, a problem especially acute for classical languages where multiple defensible English renderings of the same passage coexist. We audit Pali-to-English output from four flagship large language models (LLMs): GPT-5.5, Claude Sonnet 4.6, Gemini 3.1 Pro, and Grok 4.3, on 1,700 passages from the Pali Canon, using three established human translations by Bhikkhu Sujato, Thanissaro Bhikkhu, and Bhikkhu Bodhi as a local reference envelope rather than a single gold standard. Each candidate's normalized embedding drift from the reference centroid serves as a triage signal, not an error label; the 1,203 candidates above a 1.5 drift threshold are then adjudicated by a blinded three-model LLM judge panel, calibrated against a 300-instance author-adjudicated validation set. Two results stand out. First, drift predicts severity rather than error per se: the major-error rate among adjudicated high-drift candidates rose monotonically from 7.9% in the 1.5-2.0 band to 51.6% above 3.0, while approximately 80% of 1.5-2.0 outliers were judged valid translation variations. Second, model differences were clearest in the high-drift tail: GPT-5.5 had the lowest adjudicated high-drift major-error rate, with confidence intervals overlapping those of Claude Sonnet 4.6 and Gemini 3.1 Pro; Grok 4.3 had both the largest outlier volume and the highest tail major-error rate (27.6% overall, 74.4% above drift 3.0). The dominant major-error categories (e.g. omission or truncation, doctrinal term errors) are precisely the failures most likely to mislead readers of doctrinal text. The contribution is a reusable audit design for classical-to-modern translation: define a local reference envelope from multiple human translators, use embedding drift to prioritize review, and adjudicate the flagged tail rather than treating outlier status as error.

URL PDF HTML ☆

赞 0 踩 0

2606.01132 2026-06-02 cs.CV

HakushoBench: A Japanese Chart and Table VQA Benchmark from Governmental White Papers

HakushoBench：来自政府白皮书的日语图表VQA基准

Issa Sugiura, Shuhei Kurita, Yusuke Oda, Naoaki Okazaki

发表机构 * Institute of Science Tokyo（东京科学研究所）； NII（日本学术振兴会）； NII LLMC（日本学术振兴会LLMC）

AI总结利用政府白皮书构建日语图表VQA基准HakushoBench，包含2053张图像和人工标注问答对，评估视觉语言模型对图表的深度理解。

Comments 16 pages, 17 figures

详情

AI中文摘要

理解图表和表格图像对于将视觉语言模型（VLM）应用于现实世界的文档理解至关重要。虽然英语基准已经快速发展，但非英语基准仍然稀缺，这使得人们不清楚这种进展是否跨语言泛化。一个关键障碍是难以大规模收集真实且多样化的非英语图表和表格图像。为了解决这个问题，我们利用政府白皮书作为英语之外基准构建的可扩展来源，因为它们包含跨多种格式和领域的自然出现的图表和表格，并且在许多国家免费提供。作为首次实例，我们介绍了HakushoBench，这是一个基于33份政府白皮书构建的具有挑战性的日语图表和表格VQA基准。HakushoBench包含2053张图像，涵盖超过10种图像类型，并带有人工标注的问答对，旨在评估对图表和表格的深入和整体理解，而不仅仅是局部视觉线索。跨广泛VLM的实验表明，HakushoBench对开放权重模型仍然具有挑战性：最佳开放权重模型仅达到58.6%的准确率，开放权重与专有模型之间34.9个百分点的差距凸显了在复杂图表和表格理解方面仍有很大的改进空间。我们发布了数据集和代码。

英文摘要

Understanding chart and table images is essential for applying vision-language models (VLMs) to real-world document understanding. While English benchmarks have advanced rapidly, non-English counterparts remain scarce, leaving it unclear whether this progress generalizes across languages. A key obstacle is the difficulty of collecting realistic and diverse non-English chart and table images at scale. To address this, we leverage governmental white papers as a scalable source for benchmark construction beyond English, as they contain naturally occurring charts and tables across diverse formats and domains and are freely accessible in many countries. As a first instantiation, we introduce HakushoBench, a challenging Japanese chart and table VQA benchmark built from 33 governmental white papers. HakushoBench contains 2,053 images spanning over 10 image types, with manually annotated QA pairs, designed to assess deep and holistic understanding of charts and tables, rather than local visual cues alone. Experiments across a broad range of VLMs demonstrate that HakushoBench remains challenging for open-weight models: the best open-weight model achieves only 58.6% accuracy, and a 34.9-point gap between open-weight and proprietary models highlights substantial room for improvement in complex chart and table understanding. We release our dataset and code.

URL PDF HTML ☆

赞 0 踩 0

2606.01128 2026-06-02 cs.LG

Local MixVR: Breaking the Communication-Sample Dependence in Distributed Learning

Local MixVR：打破分布式学习中通信与样本的依赖关系

Tehila Dahan, Bassel Hamoud, Roie Reshef, Martin Jaggi, Kfir Y. Levy

发表机构 * Technion Haifa, Israel（技术离子海法分校，以色列）； EPFL Lausanne, Switzerland（洛桑联邦理工学院，瑞士）

AI总结提出Local MixVR框架，通过局部更新与方差缩减技术消除通信复杂度对样本总数N的依赖，实现仅与工作节点数M相关的通信复杂度，在M<O(N^{1/4})时优于现有最优方法。

2606.01126 2026-06-02 cs.LG cs.AI cs.CV

STARFISH: faST Accuracy Recovery in pruned networks From Internal State Healing

STARFISH: 从内部状态修复中实现剪枝网络的快速精度恢复

Shir Maon, Odelia Melamed, Adi Shamir

发表机构 * Weizmann Institute of Science（魏茨曼科学研究所）

AI总结提出STARFISH方法，通过少量无标签校准集优化剪枝网络与原始网络内部状态对齐，高效恢复精度，在ViT网络上优于现有方法。

详情

AI中文摘要

剪枝是一种旨在减少大型神经网络中权重数量的过程。这可以显著加快推理速度，但可能导致模型精度大幅下降，因此通常随后会进行修复过程以恢复部分丢失的精度。在本文中，我们提出了一种新的修复方法STARFISH，它可以高效地恢复任何剪枝网络的（大部分）精度。STARFISH的主要思想是使用少量无标签示例的校准集，优化剪枝网络以与原始网络的内部状态表示对齐。对于去除50%权重的常见情况，在基于ViT的网络中，STARFISH修复相比最先进方法将恢复精度提高了高达22%。在激进剪枝下其优势更为显著。例如，在ImageNet的DeiT-B网络中去除75%权重后，STARFISH仅使用训练图像数量的0.4%作为校准集，恢复了原始稠密模型精度的82%，而竞争恢复技术仅达到稠密模型精度的40%。

英文摘要

Pruning is a process designed to reduce the number of weights in a large neural network. This can substantially speed up inference but might cause a considerable reduction in the model's accuracy, and thus it is usually followed by a healing process that regains some of the lost accuracy. In this paper, we propose a new healing method, STARFISH, that can recover (most of) the accuracy of any pruned network efficiently. The main idea of STARFISH is to optimize the pruned network to align with the original network's internal state representations using a tiny calibration set of unlabeled examples. For the common case of removing 50% of the weights, STARFISH healing improves the recovered accuracy by up to 22% over the state-of-the-art methods on ViT-based networks. Its advantage is even more pronounced under aggressive pruning. For example, after eliminating 75% of the weights in a DeiT-B network for ImageNet, STARFISH uses only 0.4% of the number of training images as a calibration set and recovers 82% of the original dense accuracy, whereas competing recovery techniques reach only 40% of the dense model accuracy.

URL PDF HTML ☆

赞 0 踩 0

2606.01123 2026-06-02 cs.LG

From Reward-Free Representations to Preferences: Rethinking Offline Preference-Based Reinforcement Learning

从无奖励表示到偏好：重新思考离线基于偏好的强化学习

Jun-Jie Yang, Chia-Heng Hsu, Kui-Yuan Chen, Ping-Chun Hsieh

发表机构 * GitHub

AI总结本文提出一种结合无奖励表示学习和对比搜索微调的离线偏好强化学习框架，通过从无奖励离线数据中学习潜在后继度量表示，再利用偏好数据进行对比搜索和微调，显著提升了偏好效率。

Comments Published in ICML 2026

详情

AI中文摘要

基于偏好的强化学习通过从成对的人类偏好反馈中学习，避免了显式的奖励工程。现有的离线PbRL方法通常遵循两阶段流程，首先从标记的偏好中学习奖励或偏好模型，然后在未标记数据上执行离线RL。我们通过零样本RL文献中的无奖励表示学习视角重新审视离线PbRL，并提出一个新的训练框架，该框架首先从无奖励离线数据中学习潜在后继度量表示，然后使用偏好数据进行对比搜索和微调。通过大量实验和消融研究，我们表明我们的方法在偏好效率上优于离线PbRL基线。这项工作首次将RFRL与PbRL联系起来，突出了其作为反馈高效解决方案的潜力。我们的代码可在https://github.com/rl-bandits-lab/FB-PbRL公开获取。

英文摘要

Preference-based reinforcement learning (PbRL) avoids explicit reward engineering by learning from pairwise human preference feedback. Existing offline PbRL methods typically follow a two-stage pipeline, first learning a reward or preference model from labeled preferences and then performing offline RL on unlabeled data. We revisit offline PbRL through the lens of reward-free representation learning (RFRL) from the zero-shot RL literature, and propose a new training framework that first learns latent successor-measure representations from reward-free offline data, followed by contrastive search and fine-tuning using preference data. Through extensive experiments and ablations, we show that our method achieves superior preference efficiency over offline PbRL baselines. This work is the first to connect RFRL with PbRL, highlighting its potential as a feedback-efficient solution. Our code is publicly available at https://github.com/rl-bandits-lab/FB-PbRL.

URL PDF HTML ☆

赞 0 踩 0

2606.01122 2026-06-02 cs.LG q-fin.CP

A Per-Component Diagnostic Protocol for Neural HJB-PIDE Solvers under Control-Dependent Lévy Jumps

控制依赖 Lévy 跳跃的神经 HJB-PIDE 求解器的逐分量诊断协议

R. Drissi

发表机构 * GitHub

AI总结提出一个五步诊断协议，用于检测残差训练的神经 HJB-PIDE 求解器在控制依赖 Lévy 跳跃下的算子计算错误，并通过 CRRA-Merton-Variance-Gamma 基准案例验证其有效性。

详情

AI中文摘要

我们针对具有控制依赖 Lévy 跳跃的残差训练神经 HJB-PIDE 求解器，提出一个五步诊断协议，旨在解决神经 PDE 方法的一种常见失效模式：学习到的解可能匹配标量诊断指标，但错误计算了其训练损失内部的算子。该协议将每个神经求解与至少一个从零开始的独立参考配对，将哈密顿量分解为漂移、扩散、补偿器和非局部积分分量（在 u 网格上），并在 (t,x) 网格上比较值函数及其低阶导数，然后进行任何 argmax 比较。应用于标准 CRRA-Merton-Variance-Gamma 基准，它隔离了神经方法重要性提议密度中缺失的 1/2 混合因子，该因子将非局部积分恰好缩放了一半——这是常数提议尺度误差的教科书式特征，而更长的训练、网格细化和截断扫描均无法发现。修正该错误后，四个参考解——两个具有不连续离散化的有限差分求解器、神经求解器以及通过 CRRA 齐次性获得的半解析标量基线——在最优控制上达成约 2% 以内的一致。常数系数 CRRA 基准通过齐次性简化为标量最大化，因此标量基线是此处的高效方法；贡献在于该协议，原则上可应用于真正需要神经 HJB-PIDE 求解器的非齐次和高维场景。该案例是更广泛的神经 PDE 验证失效的具体实例：学习到的值或控制的逐点一致可能与系统性错误的非局部算子共存，因此在信任 argmax 策略之前，需要进行逐分量和表面层次的检查。

英文摘要

We propose a five-step diagnostic protocol for residual-trained neural HJB-PIDE solvers with control-dependent Lévy jumps, targeting a general failure mode of neural PDE methods: a learned solution can match headline scalar diagnostics while miscomputing an operator inside its training loss. The protocol pairs each neural solve with at least one from-scratch independent reference, decomposes the Hamiltonian into drift, diffusion, compensator, and nonlocal-integral components across a u-grid, and compares the value function and its low-order derivatives over a (t,x) grid before any argmax comparison. Applied to a standard CRRA-Merton-Variance-Gamma benchmark, it isolates a missing 1/2-mixture factor in the neural method's importance-proposal density that scaled the nonlocal integral by exactly half - a textbook signature of a constant proposal scale error, invisible to longer training, grid refinement, and truncation sweeps. With the bug corrected, four references - two finite-difference solvers with disjoint discretizations, the neural solver, and a semi-analytic scalar baseline obtained from CRRA homogeneity - agree on the optimal control to within ~2%. The constant-coefficient CRRA benchmark collapses by homogeneity to a scalar maximization, so the scalar baseline is the efficient method here; the contribution is the protocol, applicable in principle to non-homogeneous and higher-dimensional settings where neural HJB-PIDE solvers are genuinely needed. The episode is a concrete instance of a broader neural-PDE verification failure: pointwise agreement of a learned value or control can coexist with a systematically wrong nonlocal operator, so per-component and surface-level checks are needed before trusting the argmax policy.

URL PDF HTML ☆

赞 0 踩 0

2606.01118 2026-06-02 cs.CV

Rank-Aware Quantile Activation for Motion-Robust Crop Segmentation in UAV Imagery

面向无人机影像中运动鲁棒作物分割的秩感知分位数激活

Abinav Kiran, Sravan Danda, Aditya Challa, Sougata Sen, Daya Sagar B S

发表机构 * Senior Member, IEEE（IEEE高级会员）

AI总结针对高速无人机影像中的运动模糊导致语义分割退化的问题，提出秩感知的双分位数激活（QAct）模块，通过实例级秩归一化替代幅度门控，在零样本和模糊监督两种设置下均显著提升mIoU，尤其在稀有纹理依赖类上表现突出，且与模糊域训练互补。

详情

AI中文摘要

高速无人机采集的运动模糊会降低对具有高农业价值的稀有纹理依赖类别的语义分割性能。标准CNN依赖于高频幅度特征，而模糊会破坏这些特征，导致少数信号被统计性擦除。我们提出双分位数激活（QAct），一种秩感知模块，用实例级秩归一化替代幅度门控。在Agriculture-Vision 2021数据集上，在零样本和模糊监督两种设置下、多种严重程度上进行评估，QAct是主导架构因素：它在两种设置和所有严重程度上都比ReLU带来一致的mIoU提升，在稀有结构和纹理依赖类别上增益最强。一些主导类别（水、播种机跳过）在蒸馏下表现出混合的每类性能。在中等模糊下，零样本QAct优于蒸馏训练的ReLU；在所有严重程度上，Distill-QAct达到最佳性能，证实了秩感知激活和模糊域训练是互补的鲁棒性来源。

英文摘要

Motion blur from high-speed UAV acquisition de-grades semantic segmentation on rare texture-dependent classes with high agronomic value. Standard CNNs rely on high-frequency magnitude features that blur destroys, causing statistical erasure of minority signals. We propose Dual Quantile Activation (QAct), a rank-aware block replacing magnitude gating with instance-level rank normalization. Evaluated onAgriculture-Vision 2021 across zero-shot and blur-supervised regimes at multiple severities, QAct is the dominant architectural factor: it delivers consistent mIoU gains over ReLU across both regimes and all severities, with strongest gains on rare structural and texture-dependent classes. Some dominant classes (water,planter skip) show mixed per-class performance under distillation. At moderate blur, zero-shot QAct outperforms distillation-trained ReLU; across all severities, Distill-QAct achieves best performance, confirming rank aware activation and blur-domain training are complementary robustness sources.

URL PDF HTML ☆

赞 0 踩 0

2606.01117 2026-06-02 cs.LG cs.AI

HASTE: Hardware-Aware Dynamic Sparse Training for Large Output Spaces

HASTE: 面向大输出空间的硬件感知动态稀疏训练

Nasib Ullah, Jinbin Zhang, Jean Lucien Randrianantenaina, Erik Schultheis, Rohit Babbar

发表机构 * University of Waterloo（滑铁卢大学）

AI总结提出组共享固定扇入稀疏性方法，通过半结构化输出层设计结合长尾分解，在极端多标签分类中实现显著加速并保持精度。

Comments Accepted at ICML 2026 Regular

详情

AI中文摘要

极端多标签分类（XMC）涉及在具有数百万标签的大输出空间上学习模型，使得输出层成为内存计算瓶颈。虽然基于稀疏性的方法降低了算术复杂度，但由于不规则内存访问、硬件利用率低或在长尾场景中依赖辅助架构组件，它们通常无法产生成比例的速度提升。我们引入了组共享固定扇入稀疏性，一种半结构化的输出层设计，其中语义相关的标签共享一个稀疏输入模式，同时保留独立的权重。这种分组引入了任务对齐的归纳偏置——鼓励相关标签共享特征子集——同时减少了索引内存开销，增加了跨标签的特征重用，并通过利用现代加速器原语的自定义CUDA内核实现了高效的GPU执行。作为辅助目标的替代方案，我们利用XMC的长尾结构，将输出层分解为频繁标签上的小型密集头部和其余标签上的组共享稀疏尾部，在保留稀疏性内存优势的同时提供了信息丰富的梯度路径。通过内核级微基准测试，我们表明组共享固定扇入将算术减少转化为实际的挂钟时间增益，在前向传播中实现了高达4.4倍的加速，在反向传播中实现了高达25倍的加速，同时与FLOPs匹配的密集瓶颈相比，性能仅相差几个百分点。在大型XMC基准测试中，我们的方法在precision@k上匹配或优于先前的稀疏基线，同时缩小了与密集方法的性能差距。

英文摘要

Extreme multi-label classification (XMC) involves learning models over large output spaces with millions of labels, making the output layer a memory-compute bottleneck. While sparsity-based methods reduce arithmetic complexity, they often fail to yield proportional speedups due to irregular memory access, poor hardware utilization, or reliance on auxiliary architectural components in long-tailed regimes. We introduce group-shared fixed fan-in sparsity, a semi-structured output-layer design in which semantically related labels share a sparse input pattern while retaining independent weights. This grouping introduces a task-aligned inductive bias -- encouraging related labels to share feature subsets -- while reducing index memory overhead, increasing feature reuse across labels, and enabling efficient GPU execution via custom CUDA kernels that leverage modern accelerator primitives. As an alternative to auxiliary objectives, we exploit the long-tailed structure of XMC by decomposing the output layer into a small dense head over frequent labels and a group-shared sparse tail over the remainder, providing an informative gradient pathway while preserving the memory benefits of sparsity. Through kernel-level microbenchmarking, we show that group-shared fixed fan-in translates arithmetic reductions into practical wall-clock gains, achieving up to $4.4\times$ speedup in the forward pass and up to $25\times$ speedup in backward passes over standard fixed fan-in sparsity, while operating within a few percent of a FLOPs-matched dense bottleneck. Across large-scale XMC benchmarks, our approach matches or improves precision@k over prior sparse baselines, while narrowing the performance gap to dense.

URL PDF HTML ☆

赞 0 踩 0

2606.01112 2026-06-02 cs.RO

Tether-Aware Dynamic Collision Avoidance for USV-HROV Systems

USV-HROV系统的系缆感知动态避碰

Yang Gu, Ziyang Hong, Xuanlin Chen, Hao Wei, Cheng Wang, Shujie Yang, Yulin Si

发表机构 * Zhejiang University（浙江大学）

AI总结针对USV跟踪HROV时水下系缆与过往船只刮擦及系缆绷紧风险，提出一种系缆感知的动态避碰方法，通过引入系缆安全感知平面域和系缆绷紧感知速度障碍法，实现安全避碰并降低系缆绷紧可能性。

详情

AI中文摘要

由无人水面艇（USV）和混合遥控潜水器（HROV）组成的异构海洋机器人系统在海底电缆检测中展现出巨大潜力。在此类任务中，USV在水面跟踪HROV，同时通过脐带缆提供电力和通信。然而，USV在跟踪HROV时的动态避碰具有挑战性，因为水下系缆可能刮擦过往船只，而规避机动会增大USV-HROV间距，从而增加系缆绷紧的可能性并影响HROV操作。为解决这些挑战，本文提出了一种用于跟踪HROV的USV的系缆感知动态避碰方法。首先，引入系缆安全感知平面域，以表示系缆与障碍船之间的三维碰撞风险，无需显式系缆形状模型。其次，开发了系缆绷紧感知速度障碍法，以实现安全避碰并降低系缆绷紧的可能性。最后，该方法与视线制导集成，以协调HROV跟踪和避碰。基于Gazebo的仿真表明，所提方法能够避开动态障碍船，同时保持系缆安全并降低USV规避机动期间系缆绷紧的可能性。

英文摘要

Heterogeneous marine robotic systems composed of an unmanned surface vehicle (USV) and a hybrid remotely operated vehicle (HROV) have shown great potential for subsea cable inspection. In such missions, the USV tracks the HROV at the surface while supplying power and communication through an umbilical tether. However, dynamic collision avoidance for the USV during HROV tracking is challenging because the submerged tether may scrape against passing vessels, while evasive maneuvers can enlarge the USV--HROV separation, thereby increasing the likelihood of tether tautness and compromising HROV operations. To address these challenges, this work proposes a tether-aware dynamic collision avoidance method for a USV tracking an HROV. First, a tether safety-aware planar domain is introduced to represent the three-dimensional collision risk between the tether and obstacle vessels without an explicit tether shape model. Second, a tether tautness-aware velocity obstacle method is developed to achieve safe avoidance while reducing the likelihood of tether tautness. Finally, the method is integrated with line-of-sight guidance to coordinate HROV tracking and collision avoidance. Gazebo-based simulations show that the proposed method avoids dynamic obstacle vessels while maintaining tether safety and reducing the likelihood of tether tautness during USV evasive maneuvers.

URL PDF HTML ☆

赞 0 踩 0

2606.01106 2026-06-02 cs.CV

Temporal Evidence Routing with Structured Visual Evidence for TimeLogicQA

基于结构化视觉证据的时间证据路由用于TimeLogicQA

Yuyang Sun, Yongliang Wu, Xingyu Zhu, Yuxia Chen, Zhenxiang Jiang, Yangguang Ji, Wenbo Zhu, Yanxi Shi, Jay Wu, Shuo Wang, Xu Yang

发表机构 * Southeast University（东南大学）； National University of Singapore（新加坡国立大学）； Independent Researcher（独立研究员）； Opus AI Research（Opus AI研究院）； University of Science and Technology of China（中国科学技术大学）

AI总结提出视觉证据路由流水线，分离感知与符号时间推理，通过结构化视觉证据和确定性时间规则在TimeLogicQA上达到81.8 AvgAcc。

详情

AI中文摘要

TimeLogicQA评估视频问答系统是否能推理事件存在、顺序、持续性、边界条件和重叠等时间关系。我们通过一个视觉证据路由流水线来处理此任务，该流水线将感知与符号时间推理分离。系统首先将每个问题解析为事件目标、答案模式、候选选项和时间算子。然后，根据持续时间和算子难度对视频进行路由，对短片段使用有序的全帧证据，对长视频使用以事件为中心的候选窗口。多模态大语言模型为相关事件生成结构化视觉证据，而程序化验证器恢复密集的动作区间，确定性归约器应用算子特定的时间规则产生最终答案。保守融合仅在视觉证据、时间程序和置信度检查一致时接受答案，减少噪声答案翻转。在官方测试评估中，我们的最终系统实现了81.8的平均准确率。

英文摘要

TimeLogicQA evaluates whether video question answering systems can reason over temporal relations such as event existence, ordering, persistence, boundary conditions, and overlap. We address this task with a visual evidence routing pipeline that separates perception from symbolic temporal reasoning. The system first parses each question into event targets, answer mode, candidate options, and temporal operators. It then routes videos according to duration and operator difficulty, using ordered full-frame evidence for short clips and event-focused candidate windows for long videos. A multimodal large language model produces structured visual evidence for the relevant events, while programmatic verifiers recover dense action intervals and a deterministic reducer applies operator-specific temporal rules to produce the final answer. Conservative fusion accepts an answer only when the visual evidence, temporal program, and confidence checks agree, reducing noisy answer flips. On the official test evaluation, our final system achieves an AvgAcc of 81.8.

URL PDF HTML ☆

赞 0 踩 0

2606.01104 2026-06-02 cs.CV

Adaptive Dense Evidence Refinement for Video Relational Reasoning for VRR-QA Challenge

自适应密集证据精炼用于视频关系推理：VRR-QA挑战

Yuyang Sun, Yongliang Wu, Xingyu Zhu, Yuxia Chen, Zhenxiang Jiang, Yangguang Ji, Wenbo Zhu, Yanxi Shi, Jay Wu, Shuo Wang, Xu Yang

发表机构 * Southeast University（东南大学）； National University of Singapore（国立新加坡大学）； Independent Researcher（独立研究员）； Opus AI Research（Opus AI研究）； University of Science and Technology of China（中国科学技术大学）

AI总结提出一种自适应测试时计算系统，通过轻量视图识别不稳定问题并路由到高预算密集证据模块，在VRR-QA测试集上达到90.07%平均准确率。

详情

AI中文摘要

VRR-QA评估视频语言系统能否推断空间、时间、视角、深度和可见性关系，这些关系通常无法通过单帧解决。我们提出一个仅推理的系统，基于自适应测试时计算。系统首先通过直接视频语言模型传递回答每个问题，然后使用多个轻量视图发现不稳定问题。只有这些困难问题被路由到高预算密集证据模块，该模块构建带时间戳的帧观察、关系特定探针、候选验证和保守的时间聚合。这种设计分离了视频问答中常混淆的两个问题：寻找合理的替代答案以及决定何时应更改当前答案。在测试集上，最终系统获得90.07平均准确率和87.81宏平均准确率。报告重点介绍最终测试系统和复现自适应密集验证器所需的实现设置。

英文摘要

VRR-QA evaluates whether video-language systems can infer spatial, temporal, viewpoint, depth, and visibility relations that are not always resolved by a single frame. We present an inference-only system built around adaptive test-time computation. The system first answers each question with a direct video-language model pass, then uses multiple lightweight views to find unstable questions. Only these difficult questions are routed to a high-budget dense evidence module that constructs timestamped frame observations, relation-specific probes, candidate verification, and conservative temporal aggregation. This design separates two problems that are often confused in video question answering: finding plausible alternative answers and deciding when a current answer should actually be changed. On the test split, the final system obtains 90.07 average accuracy and 87.81 macro average accuracy. The report focuses on the final test system and the implementation settings required to reproduce the adaptive dense verifier.

URL PDF HTML ☆

赞 0 踩 0

2606.01101 2026-06-02 cs.LG cs.AI

Soft-NBCE: Entropy-Weighted Chunk Fusion for Long-Context

Soft-NBCE: 基于熵加权分块融合的长上下文处理

Shihao Ji, Mingyu Li, Zihui Song

发表机构 * Beijing Normal University（北京师范大学）； Chunjiang Intelligence（春江智能）

AI总结针对长上下文推理中硬选择策略导致语义碎片化的问题，提出Soft-NBCE，通过熵加权软融合和一致性蒸馏，在保持检索精度的同时提升多跳推理性能。

Comments 7 pages, 3 figures, 2 tables. Preprint

详情

AI中文摘要

自注意力的二次复杂度仍然是大型语言模型（LLMs）处理超长上下文的瓶颈。朴素贝叶斯认知引擎（NBCE）通过将文档分块并在每个解码步骤路由到熵最低的分块，实现了长上下文推理的并行化。这种硬选择策略在跨分块推理时会导致语义碎片化，因为相邻token之间的突然路由变化破坏了模型的上下文基础。我们提出了Soft-NBCE，这是一种轻量级扩展，用软熵加权分块融合替代了离散的分块选择。通过预测熵上的温度缩放Softmax，为所有分块分配连续权重，实现了跨分块条件分布的log空间聚合。为了部分补偿分块引入的条件独立性假设，我们提出了一致性蒸馏，这是一种基于LoRA的自蒸馏方法，通过KL散度将分块logit分布约束为全上下文教师分布。在LongBench多跳基准测试中，带有一致性蒸馏的Soft-NBCE在NBCE风格基线（MuSiQue F1: 0.310 vs. 0.275（Vanilla NBCE）；HotpotQA F1: 0.479 vs. 0.427）上持续改进，同时在O(L^2/n)峰值内存下保持检索精度（NIAH-32K: 0.909）。

英文摘要

The quadratic complexity of self-attention remains a bottleneck for Large Language Models (LLMs) processing ultra-long contexts. The Naive Bayes Cognitive Engine (NBCE) parallelizes long-context inference by chunking documents and routing to the lowest-entropy chunk at each decoding step. This hard-selection strategy causes semantic fragmentation during cross-chunk reasoning, as abrupt routing changes between adjacent tokens disrupt the model's contextual grounding. We present Soft-NBCE, a lightweight extension that replaces discrete chunk selection with soft entropy-weighted chunk fusion. A temperature-scaled Softmax over predictive entropies assigns continuous weights to all chunks, enabling log-space aggregation across chunk-conditioned distributions. To partially compensate for the conditional independence assumption introduced by chunking, we propose Consistency Distillation, a LoRA-based self-distillation that constrains the chunked logit distribution toward a full-context teacher via KL-divergence. On LongBench multi-hop benchmarks, Soft-NBCE with Consistency Distillation improves consistently over NBCE-style baselines (MuSiQue F1: 0.310 vs.\ 0.275 for Vanilla NBCE; HotpotQA F1: 0.479 vs.\ 0.427) while maintaining retrieval accuracy (NIAH-32K: 0.909) at O(L^2/n) peak memory.

URL PDF HTML ☆

赞 0 踩 0

2606.01099 2026-06-02 cs.CL cs.AI

MiCU: End-to-End Smart Home Command Understanding with Large Language Model

MiCU: 基于大语言模型的端到端智能家居指令理解

Haowei Han, Kexin Hu, Weiwei Cai, Debiao Zhang, Bin Qin, Yuxiang Wang, Jiawei Jiang, Xiao Yan, Bo Du

发表机构 * School of Computer Science, Wuhan University（武汉大学计算机学院）； Xiaomi Corporation（小米公司）； Institute for Math & AI, Wuhan University（武汉大学数学与人工智能研究院）

AI总结提出MiCU，一种利用课程学习、强化学习和令牌压缩技术的领域特定大语言模型，用于解决智能家居中模糊指令理解问题，平均准确率提升20.01%。

详情

DOI: 10.1145/3770855.3818446

AI中文摘要

智能家居生态系统中的指令理解系统可以自动化设备控制并显著改善用户体验。然而，尽管它们在精确表述（例如“打开卧室灯”）上表现良好，但在处理模糊或不一致的指令（例如“让卧室变得舒适”）时却存在困难。大语言模型（LLM）在各种领域都能很好地泛化，并且在此类任务上可以超越传统的基于规则的系统，但其有效性通常受到领域特定数据稀缺、任务特定适应性不足以及高计算成本的限制。在本文中，我们提出了一种利用用户日志和LLM的自动化训练数据合成工作流程；然后构建了MiCU，一个在指令理解方面表现出色的领域特定LLM。具体来说，我们采用课程学习将领域知识注入基础LLM，然后通过冷启动训练结合领域特定思维规则引导的强化学习（RL）来增强其推理能力。此外，我们引入了一种令牌压缩技术，将设备描述压缩为单个特殊令牌，从而显著降低推理开销，并实现了\model-fast，一种针对长输入优化的高效变体。大量实验表明，MiCU显著优于基线，在所有设备类别上平均准确率提升20.01%。我们已在小米家应用中部署了MiCU，每天接收约170万页面浏览量。生产评估显示，MiCU将用户纠正率降低了1.57%，并将人工审核准确率提高了32.05%。我们的数据和代码可在https://github.com/xiaomi-research/iot_spec_llm获取。

英文摘要

Command understanding systems in smart home ecosystems can automate device control and substantially improve user experience. However, while they perform well on precise utterances (e.g., "turn on the bedroom light"), they struggle with ambiguous or misaligned commands (e.g., "make the bedroom cozy"). Large language models (LLMs) generalize well across various domains and can outperform traditional rule-based systems on such tasks, but their effectiveness is often constrained by scarce domain-specific data, insufficient task-specific adaptation, and high computational costs. In this paper, we propose an automated training data synthesis workflow using user logs and LLMs; then we build MiCU, a domain-specific LLM that excels at command understanding. Specifically, we employ curriculum learning to inject domain knowledge into the base LLM, then we enhance its reasoning ability via cold-start training combined with reinforcement learning (RL) guided by domain-specific thinking rules. Additionally, we introduce a token compression technique that condenses device description into a single special token, substantially reducing inference overhead and enabling \model-fast, an efficient variant optimized for long inputs. Extensive experiments show that MiCU significantly outperforms baselines, with an average accuracy gain of 20.01% across all device categories. We have deployed MiCU in the Xiaomi Home app, receiving approximately 1.7 million page views per day. Production evaluations show that MiCU reduces user correction rate by 1.57% and increases human audited accuracy by 32.05%. Our data and code are available at https://github.com/xiaomi-research/iot_spec_llm

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

Low-Resource Safety Failures Are Action Failures, Not Representation Failures

PairedGTA: Generating Driving Datasets for Controlled Photometric Shift Analysis

The Case for Model Science: Verify, Explore, Steer, Refine

"Skill issues'': data-centric optimization of lakehouse agents

CA-BED: Conversation-Aware Bayesian Experimental Design

Physics-Informed Deep Learning for Entropy Prediction in Heterogeneous Systems: Thermodynamic and Information-Theoretic Case Studies

Temporal Motif Signatures for Temporal Graph Neural Networks

Reusing Fusion-Time Spectral Reliability for Adaptive Fusion and Expert Routing in RGB-Infrared Object Detection

Thinking Economically: A Hierarchical Framework for Adaptive-Complexity Reasoning in LLMs

Towards Interactive Video World Modeling: Frontiers, Challenges, Benchmarks, and Future Trends

Expected Value Alignment for Generative Reward Modeling in Formal Mathematics Verification

Fairness in two-player zero-sum games with bandit feedback

When Data Is Scarce: Scaling Sparse Language Models with Repeated Training

Lagrangian Perturbation Diffusion Steering: Latent Reinforcement Learning for Generative Policies

CoSTL: Comprehensive Spatial-Temporal Representation Learning for Moment Retrieval and Highlight Detection

Not All Explanations Simulate Equally: Comparing Verbalized Feature Attributions and Self-Generated Rationales

Reasoning4Sciences: Bridging Reasoning Language Models to All Scientific Branches

From Outliers to Errors: Auditing Pali-to-English LLM Translations with Multi-Reference Adjudication

HakushoBench: A Japanese Chart and Table VQA Benchmark from Governmental White Papers

Local MixVR: Breaking the Communication-Sample Dependence in Distributed Learning

STARFISH: faST Accuracy Recovery in pruned networks From Internal State Healing

From Reward-Free Representations to Preferences: Rethinking Offline Preference-Based Reinforcement Learning

A Per-Component Diagnostic Protocol for Neural HJB-PIDE Solvers under Control-Dependent Lévy Jumps

Rank-Aware Quantile Activation for Motion-Robust Crop Segmentation in UAV Imagery

HASTE: Hardware-Aware Dynamic Sparse Training for Large Output Spaces

Tether-Aware Dynamic Collision Avoidance for USV-HROV Systems

Temporal Evidence Routing with Structured Visual Evidence for TimeLogicQA

Adaptive Dense Evidence Refinement for Video Relational Reasoning for VRR-QA Challenge

Soft-NBCE: Entropy-Weighted Chunk Fusion for Long-Context

MiCU: End-to-End Smart Home Command Understanding with Large Language Model