arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2367
专题追踪
2605.29366 2026-05-29 cs.LG

Solving Integer Linear Programming with Parallel Tempering

使用并行回火求解整数线性规划

Kyuil Sim, Sanghyeok Choi, Jinkyoo Park

发表机构 * KAIST(韩国科学技术院) University of Edinburgh(爱丁堡大学)

AI总结 提出一种无求解器、基于采样的整数线性规划优化框架,利用局部平衡提议和并行回火技术直接探索离散可行区域,在多个基准上优于或匹敌经典求解器。

Comments Preprint. Code available at https://github.com/ski-sim/ILP-with-ParallelTempering

详情
AI中文摘要

整数线性规划(ILP)作为建模广泛组合优化问题的通用框架,通常由复杂的精确求解器或启发式方法求解。虽然基于学习的方法最近显示出有效性,但它们存在对分布外实例泛化能力差以及对外部求解器的固有依赖。在这项工作中,我们提出了一种无求解器、基于采样的ILP优化框架,无需训练或外部求解器即可直接探索离散可行区域。利用ILP的线性结构,我们采用局部平衡提议构建转移核,从而避免梯度近似。为了克服ILP能量景观的高度多模态性,我们集成了并行回火。除了标准的温度回火,我们还引入了惩罚回火,它在保持可行解目标景观的同时调节约束障碍。实验上,我们的方法在所有四个基准上持续优于SCIP,在200秒预算内匹配或超过Gurobi在四个任务中的两个,并且比基于学习的方法对分布偏移具有更强的鲁棒性。此外,在MIPLIB 2017实例上,我们的框架无需任何问题特定调优即可与经典求解器保持竞争力。

英文摘要

Integer Linear Programming (ILP) serves as a versatile framework for modeling a wide range of combinatorial optimization problems, typically addressed by sophisticated exact solvers or heuristics. While learning-based approaches have recently shown their effectiveness, they suffer from poor generalization to out-of-distribution instances and inherent dependence on external solvers. In this work, we propose a solver-free, sampling-based optimization framework for ILP that directly explores discrete feasible regions without training or external solvers. Exploiting the linear structure of ILP, we employ a Locally-Balanced Proposal to construct a transition kernel, thereby avoiding the gradient approximation. To overcome the highly multimodal nature of ILP energy landscapes, we integrate Parallel Tempering. In addition to standard temperature tempering, we introduce penalty tempering, which modulates constraint barriers while preserving the objective landscape over feasible solutions. Empirically, our method consistently outperforms SCIP across all four benchmarks, matches or exceeds Gurobi on two of four tasks within a 200-second budget, and is substantially more robust to distribution shift than learning-based methods. Furthermore, on MIPLIB 2017 instances, our framework remains competitive with classical solvers without any problem-specific tuning.

2605.29360 2026-05-29 cs.AI

MiraBench: Evaluating Action-Conditioned Reliability in Robotic World Models

MiraBench: 评估机器人世界模型中的动作条件可靠性

Tianzhuo Yang, Zihan Shen, Zirui Mi, Zhaoyi Zhang, Jiayi Zhou, Jiaming Ji, Juntao Dai, Jiawei Chen, Boyuan Chen, Yaodong Yang

发表机构 * Institute for Artificial Intelligence, Peking University(人工智能研究院,北京大学)

AI总结 提出MiraBench基准,通过物理一致性、动作跟随保真度和乐观偏差检测三个层次评估机器人世界模型的动作条件可靠性,发现视觉保真度不能反映动作保真度、模型规模扩大不保证动作跟随改善、乐观偏差普遍存在。

详情
AI中文摘要

动作条件世界模型越来越多地被用作机器人学习的可扩展模拟器,但当前的评估对其在条件动作下预测的可靠性提供的证据有限。现有基准主要强调视觉保真度,未明确预测的未来是否物理上合理、是否忠实于命令动作,以及在动作不应成功时是否校准到失败。我们引入了\textsc{MiraBench},一个分层基准,将\emph{动作条件可靠性}定义为机器人世界模型的核心评估目标。MiraBench将此目标分解为三个逐步严格层次:\emph{物理一致性},评估无参考的物理一致性;\emph{动作跟随保真度},衡量预测是否尊重任务相关动作输入;以及\emph{乐观偏差检测},探测在导致失败的动作下预测成功结果的倾向。为支持此评估,我们整理了一个人工标注语料库,包含跨任务、失败类别和领先世界模型的超过16,000个判断。我们评估了12种代表性模型配置,涵盖向量条件机器人世界模型、文本条件生成世界模型、开源系统、闭源系统和多种模型规模。在这一广泛的模型景观中,MiraBench揭示了三个核心发现:视觉保真度是动作保真度的糟糕代理;增加模型规模并不能可靠地改善动作跟随;乐观偏差在现有系统中普遍存在。通过将评估从外观转向动作条件可靠性,MiraBench为评估和改进机器人世界模型作为忠实模拟器提供了诊断基础。

英文摘要

Action-conditioned world models are increasingly used as scalable simulators for robot learning, yet current evaluations provide limited evidence that their predictions are reliable under the actions they condition on. Existing benchmarks largely emphasize visual fidelity, leaving unclear whether predicted futures are physically plausible, faithful to commanded actions, and calibrated to failure when actions should not succeed. We introduce \textsc{MiraBench}, a hierarchical benchmark that defines \emph{action-conditioned reliability} as a core evaluation target for robotic world models. MiraBench decomposes this target into three progressively demanding levels: \emph{Physics Adherence}, which evaluates reference-free physical consistency; \emph{Action-Following Fidelity}, which measures whether predictions respect task-relevant action inputs; and \emph{Optimism Bias Detection}, which probes the tendency to predict successful outcomes under failure-inducing actions. To support this evaluation, we curate a human-annotated corpus with over 16,000 judgments across tasks, failure categories, and leading world models. We evaluate 12 representative model configurations spanning vector-conditioned robotic world models, text-conditioned generative world models, open-weight systems, closed-source systems, and multiple model scales. Across this broad model landscape, MiraBench reveals three central findings: visual fidelity is a poor proxy for action fidelity; increasing model scale does not reliably improve action following; and optimism bias is pervasive across current systems. By shifting evaluation from appearance to action-conditioned reliability, MiraBench provides a diagnostic foundation for assessing and improving robotic world models as faithful simulators.

2605.29358 2026-05-29 cs.AI

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

扩展单一语义性:从Claude 3 Sonnet中提取可解释特征

Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, Alex Tamkin, Esin Durmus, Tristan Hume, Francesco Mosconi, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, Tom Henighan

发表机构 * Anthropic

AI总结 本研究通过稀疏自编码器从生产级语言模型Claude 3 Sonnet中提取可解释特征,验证了字典学习方法在大规模模型上的可扩展性,并分析了特征的多语言、多模态特性及其对模型行为的因果影响。

详情
AI中文摘要

我们证明了稀疏自编码器可以从Claude 3 Sonnet(一个生产级语言模型)中提取可解释特征,解决了字典学习方法能否扩展到小型Transformer之外的问题。我们在模型中间层的残差流上训练了多达3400万个特征的稀疏自编码器,并使用缩放定律指导超参数选择。得到的特征是多语言和多模态的(尽管仅文本训练,但能泛化到图像),对概念的具体实例和抽象讨论都有响应,并可用于以与其解释一致的方式引导模型行为。我们发现了对应于著名实体和位置的特征,以及更抽象的概念,如讽刺或代码中的错误。我们还识别了与语言模型可能造成伤害的方式相关的特征——包括代表欺骗、权力追求、谄媚和偏见的特征——并展示了这些特征在被操纵时对模型输出的因果影响。此外,我们对特征的可解释性、几何结构和计算功能进行了分析。然而,仍然存在显著局限性:我们的特征集不完整,并且缺乏严格的方法来评估我们的特征是否忠实地捕捉了模型的计算过程。

英文摘要

We demonstrate that sparse autoencoders can extract interpretable features from Claude 3 Sonnet, a production-scale language model, addressing the open question of whether dictionary learning methods scale beyond small transformers. We trained sparse autoencoders with up to 34 million features on the model's middle layer residual stream, using scaling laws to guide hyperparameter selection. The resulting features are multilingual and multimodal (generalizing to images despite text-only training), respond to both concrete instances and abstract discussions of concepts, and can be used to steer model behavior in ways consistent with their interpretations. We find features corresponding to famous entities and locations, as well as more abstract concepts like sarcasm or errors in code. We also identify features relevant to ways in which language models might cause harm--including features representing deception, power-seeking, sycophancy, and bias--and show that these causally influence model outputs when manipulated. Additionally, we conduct analyses of feature interpretability, geometry, and computational function. However, significant limitations remain: our suite of features is incomplete, and we lack rigorous methods for evaluating whether our features faithfully capture model computations.

2605.29357 2026-05-29 cs.AI cs.LG cs.PL

PassNet: Scaling Large Language Models for Graph Compiler Pass Generation

PassNet: 为图编译器通生成扩展大型语言模型

Yiqun Liu, Yingsheng Wu, Ruqi Yang, Enrong Zheng, Honglei Qiu, Sijun He, Tai Liang, Jingjing Wu, Yuhan Zhou, Yiwei Zhang, Dongyan Chen, Weihan Yi, Xinqi Li, Siqi Bao

发表机构 * Baidu, Inc.(百度公司)

AI总结 针对编译器默认优化在长尾子图上性能不佳的问题,提出PassNet生态系统,包含大规模数据集和基准测试,通过微调小模型在少量轨迹上即可接近前沿模型性能。

Comments Code and data available at https://github.com/PaddlePaddle/PassNet

详情
AI中文摘要

现代张量编译器(如 TorchInductor)在主流模型上实现了显著加速,但在长尾负载上却面临系统性性能瓶颈——我们的性能分析显示,43% 的真实世界子图在默认编译下出现端到端减速。虽然 LLM 为实现自动化优化提供了途径,但现有工作集中于独立内核生成。我们认为,通生成(即 LLM 编写可直接集成到编译器流水线中的结构化图变换)是更合适的抽象。我们提出 PassNet,首个基于 LLM 的编译器通生成的大规模生态系统,包括:(1) PassNet-Dataset,包含来自 10 万个真实世界模型的超过 1.8 万个独特计算图;(2) PassBench,200 个精心挑选的长尾可融合任务(共包含 2060 个子图),在错误感知加速分数(ES_t)下进行评估——该指标统一了正确性、稳定性和性能——并具有针对系统性 LLM 利用的分层完整性防御。实验表明,PassBench 既具有高度区分性,又真正未饱和:最佳前沿模型在总体上落后 TorchInductor 37%,但在单个子图上,LLM 相比同一编译器可实现高达 3 倍的加速——这表明瓶颈在于一致性而非能力。在仅约 4000 个 PassNet 轨迹上微调一个小模型,可获得 2.67 倍的改进,接近前沿模型性能,证明了巨大的提升空间,并验证了 PassNet 作为推进 LLM 驱动编译器优化的实时训练基础设施。所有数据、基准测试和工具均已公开。

英文摘要

Modern tensor compilers such as TorchInductor deliver substantial speedups on mainstream models, yet face a systematic performance ceiling on long-tail workloads -- our profiling shows that 43% of real-world subgraphs experience end-to-end slowdowns under default compilation. While LLMs offer a path toward automated optimization, existing efforts focus on standalone kernel generation. We argue that pass generation -- where LLMs author structured graph transformations that integrate directly into compiler pipelines -- is the more appropriate abstraction. We propose PassNet, the first large-scale ecosystem for LLM-based compiler pass generation, comprising: (1) PassNet-Dataset, over 18K unique computational graphs from 100K real-world models; and (2) PassBench, 200 curated long-tail fusible tasks (comprising 2,060 subgraphs in total) evaluated under the Error-aware Speedup Score (ES_t) -- a metric unifying correctness, stability, and performance -- with layered integrity defenses against systematic LLM exploitation. Experiments reveal that PassBench is both highly discriminative and genuinely unsaturated: the best frontier model trails TorchInductor by 37% in aggregate, yet on individual subgraphs LLMs achieve up to 3x speedup over the same compiler -- indicating that the bottleneck is consistency, not capability. Fine-tuning a small model on merely ~4K PassNet trajectories yields a 2.67x improvement approaching frontier-model performance, demonstrating substantial headroom and validating PassNet as live training infrastructure for advancing LLM-driven compiler optimization. All data, benchmarks, and tooling are publicly available.

2605.29351 2026-05-29 cs.LG math.DS stat.ML

Attention as In-Context Empirical Bayes: A Two-Stage View via Particle Dynamics

注意力作为上下文经验贝叶斯:通过粒子动力学的两阶段视角

Matthew Smart, Soumya Ganguly, Nilava Metya, Alexandre V. Morozov, Anirvan M. Sengupta

发表机构 * Lewis-Sigler Institute for Integrative Genomics(利斯-西格尔整合基因组研究所) Princeton University(普林斯顿大学) Department of Mathematics(数学系) Rutgers University(罗格斯大学) Department of Physics and Astronomy(物理与天文学系) Center for Computational Quantum Physics and Center for Computational Mathematics(计算量子物理中心和计算数学中心) Flatiron Institute(Flatiron研究所) Simons Foundation(西蒙斯基金会)

AI总结 本文通过粒子动力学将最小注意力仅变换器解释为两阶段经验贝叶斯过程,揭示了深度和注意力残差的统计角色,并证明无需显式噪声调度即可实现有效去噪。

Comments 52 pages, 5 figures

详情
AI中文摘要

我们研究了在所有标记损坏情况下的最小注意力仅变换器,并表明它们具有两阶段经验贝叶斯解释。单个注意力步骤计算相对于由上下文定义的经验分布的核加权后验均值。深度通过粒子动力学(阶段1)细化该分布,而长程跳跃连接将噪声输入作为查询用于后验推断(阶段2),揭示了深度和注意力残差的独特统计角色。该框架隔离了一个最小设置,其中上下文本身诱导了一个控制上下文推断的深度依赖能量景观。我们表明,无需显式噪声调度即可出现有效去噪:固定的核带宽和有限的积分范围就足够了,从而产生了一个有原则的深度-噪声关系。我们进一步为一类表现良好的先验建立了后验均值恢复保证,其中经验估计器在渐近条件下收敛到贝叶斯最优预测器。将这些动力学与反向扩散极限联系起来,我们的结果为注意力作为通过基于样本的后验估计进行上下文推断提供了统计解释,无需显式密度建模。

英文摘要

We study minimal attention-only transformers under all-token corruption and show they admit a two-stage empirical Bayes interpretation. A single attention step computes a kernel-weighted posterior mean with respect to the empirical distribution defined by the context. Depth refines this distribution through particle dynamics (Stage 1), while a long-range skip-connection carries the noisy input as a query for posterior inference (Stage 2), revealing distinct statistical roles for depth and attention residuals. The framework isolates a minimal setting in which the context itself induces a depth-dependent energy landscape governing in-context inference. We show that effective denoising can emerge without an explicit noise schedule: a fixed kernel bandwidth and finite integration horizon suffice, yielding a principled depth-noise relationship. We further establish a posterior-mean recovery guarantee for a class of well-behaved priors, where the empirical estimator converges to the Bayes-optimal predictor under asymptotic conditions. Connecting these dynamics to reverse-diffusion limits, our results provide a statistical interpretation of attention as in-context inference via sample-based posterior estimation, without explicit density modeling.

2605.29350 2026-05-29 cs.AI

ConMoE: Expert-Pool Consolidation via Prototype Reassignment for MoE Compression

ConMoE: 通过原型重分配进行专家池整合以实现MoE压缩

Yilun Yao, Jiaming Pan, Elsie Dai, Peizhuang Cong, Yaoming Li, Tong Yang

发表机构 * Peking University(北京大学)

AI总结 提出ConMoE,一种无需训练的MoE压缩方法,通过基于校准的贡献和可替换性信号选择保留的专家原型,并确定性重映射原始专家调用,在多个MoE语言模型上匹配或超越强基线。

Comments 12 pages, 3 figures, 5 tables

详情
AI中文摘要

混合专家(MoE)语言模型减少了每个token的计算量,但仍需存储和服务所有专家,导致部署时内存密集。现有的训练后压缩方法主要通过剪枝专家或合并其权重来缩减成本。我们将训练后MoE压缩形式化为专家池整合:保留一组较小的预训练专家作为可重用原型,并确定性地将每个原始专家引用重映射到一个选定的原型。这种观点将缩减后的专家池与表示原始专家槽位的重用结构分离,并允许在局部层范围内共享原型,同时保留原始路由器接口。我们提出ConMoE,一个无需训练的原型重映射框架,它使用基于校准的贡献和可替换性信号选择保留的专家,然后将原始专家调用重定向到选定的原型,无需权重更新或压缩后微调。在三个预训练的MoE语言模型上的实验表明,ConMoE在多种设置下匹配或超越了强剪枝和合并基线,在deepseek-moe-16b-base上以25%和50%的路由专家缩减均取得最佳平均分,同时在Qwen3-30B-A3B和OLMoE-1B-7B-0125上保持竞争力。消融实验表明,确定性重映射是最稳定的组件,而更广泛的跨层共享和事后权重融合则依赖于模型。

英文摘要

Mixture-of-Experts (MoE) language models reduce per-token computation but still require storing and serving all experts, making deployment memory-intensive. Existing post-training compression methods mainly shrink this cost by pruning experts or merging their weights. We formulate post-training MoE compression as expert-pool consolidation: retaining a smaller set of pretrained experts as reusable prototypes and deterministically remapping each original expert reference to one selected prototype. This view separates the reduced expert pool from the reuse structure that represents the original expert slots, and allows prototype sharing within local layer scopes while preserving the original router interface. We propose ConMoE, a train-free prototype remapping framework that selects retained experts using calibration-based contribution and replaceability signals, then redirects original expert calls to the selected prototypes without weight updates or post-compression fine-tuning. Experiments on three pretrained MoE language models show that ConMoE matches or outperforms strong pruning and merging baselines in several settings, achieving the best average score on deepseek-moe-16b-base at both 25% and 50% routed-expert reduction, while remaining competitive on Qwen3-30B-A3B and OLMoE-1B-7B-0125. Ablations indicate that deterministic reassignment is the most stable component, whereas broader cross-layer sharing and post-hoc weight fusion are model-dependent.

2605.29340 2026-05-29 cs.CL

A Study on Question-Answer Dataset for LLM Safety Evaluation with a Focus on Illegal Activities

面向LLM安全评估的问答数据集研究:聚焦非法活动

Kenji Imamura, Masao Ideuchi, Atsushi Fujita

发表机构 * National Institute of Information and Communications Technology(日本信息与通信技术研究所)

AI总结 本文通过人工分析AnswerCarefully数据集,提出额外信息、问答示例创建方法和评估准则,用于评估LLM在非法活动方面的安全性。

Comments 10 pages, 1 figure

详情
AI中文摘要

在本文中,我们讨论了用于LLM安全评估的问答数据集,重点关注非法活动。具体来说,在人工分析AnswerCarefully的基础上,我们引入了若干额外信息、创建问答示例的方法以及评估LLM生成响应的准则。本研究的结果旨在与“JAI-Trust”项目共享。

英文摘要

In this paper, we discuss question-answer dataset for LLM safety evaluation, with a focus on illegal activities. Specifically, on the basis of manual analysis of AnswerCarefully, we introduce several additional information, methods for creating question-answer examples, and a rubric for evaluating LLM-generated responses. The outcomes of this study are intended to be shared with the "JAI-Trust" project.

2605.29339 2026-05-29 cs.CV

DMC-CF: Dynamic Multimodal CounterFactual QA benchmark for Causal Reasoning

DMC-CF: 用于因果推理的动态多模态反事实QA基准

Junzhe Zhang, Huixuan Zhang, Guirong Wang, Xingyao Zhang, Pei Liu, Lin Qu, Hu Wei, Xiaojun Wan

发表机构 * Wangxuan Institute of Computer Technology, Peking University(北京大学王轩计算机技术研究所) Alibaba Group(阿里巴巴集团)

AI总结 针对现有因果推理数据集规模有限或基于非真实数据的问题,提出基于真实视频的大规模多模态因果反事实推理基准DMC-CF-Static,并利用动态图干预框架构建动态评估基准DMC-CF-Dynamic,实验表明当前多模态大模型在真实场景下的因果推理能力仍需大幅提升。

详情
AI中文摘要

随着多模态大语言模型(MLLMs)的快速发展,模型已展现出日益强大的多模态能力。然而,通过统计学习训练的MLLMs能否真正理解现实世界背后的因果关系仍是一个关键研究问题。近年来,众多多模态因果推理数据集被提出,但这些数据集要么规模有限,要么基于合成图像和视频、卡通内容或其他非真实多模态来源构建。为解决这些局限性,我们收集真实世界视频并构建了DMC-CF-Static,一个大规模多模态因果反事实推理基准。此外,为缓解传统静态评估中的数据污染等问题,我们使用因果图表示因果事件,并提出动态图干预(DGI)框架,从DMC-CF-Static构建动态评估基准DMC-CF-Dynamic。在包含静态和动态评估基准的整体DMC-CF上的实验结果表明,当前多模态大语言模型在真实场景下的多模态因果推理能力仍需大幅提升。

英文摘要

With the rapid advancement of multimodal large language models (MLLMs), models have demonstrated increasingly powerful multimodal capabilities. However, whether MLLMs trained through statistical learning can truly understand the causal relationships underlying the real world remains a key research question. In recent years, numerous multimodal causal reasoning datasets have been proposed. Nevertheless, these datasets are either limited in scale or constructed from synthetic images and videos, cartoon-based content, or other non-realistic multimodal sources. To address these limitations, we collect real-world videos and construct DMC-CF-Static, a large-scale benchmark for multimodal causal counterfactual reasoning. Furthermore, to mitigate issues such as data contamination in traditional static evaluation, we represent causal events using causal graphs and propose the Dynamic Graph Intervention (DGI) framework to build the dynamic evaluation benchmark DMC-CF-Dynamic from DMC-CF-Static. Experimental results on the overall DMC-CF, which includes both static and dynamic evaluation benchmarks, demonstrate that the multimodal causal reasoning capabilities of current multimodal large language models in real-world scenarios still require substantial improvement.

2605.29336 2026-05-29 cs.CL

Enhancing Factuality through Consensus and Consistency in Summarization Using Minimum Bayes Risk Decoding

通过最小贝叶斯风险解码在摘要中实现基于共识和一致性的事实性增强

Riza Setiawan Soetedjo, Yusuke Sakai, Hidetaka Kamigaito, Jingun Kwon, Manabu Okumura, Taro Watanabe

发表机构 * Nara Institute of Science and Technology(奈良科学技术研究所) Chungnam National University(全南国立大学) Institute of Science Tokyo(东京科学研究所)

AI总结 提出ConSUM方法,利用最小贝叶斯风险解码建立候选摘要间的共识,并结合与源文档的一致性指标进行重排序,以提升摘要的事实性。

Comments Accepted to ACL 2026 Findings

详情
AI中文摘要

提高模型生成摘要的质量,尤其是事实性(摘要相对于源内容的准确性)仍然是一个挑战。虽然重排序可以从多个生成候选中选择最优输出,但它仅限于使用源作为指导,导致摘要不可靠。为了解决这一局限性,我们提出了ConSUM,该方法通过考虑两个因素对候选摘要进行重排序:与源文档的一致性以及与其他候选之间的共识。共识是通过对生成的摘要集进行最小贝叶斯风险(MBR)解码建立的,同时通过使用将摘要与源进行比较的事实性感知指标来确保一致性。严格的测试表明,我们的系统与现有方法具有竞争力,人工评估进一步证实其生成的摘要优于其他系统。我们的代码可在https://github.com/naist-nlp/ConSUM获取。

英文摘要

Improving the quality of model-generated summaries, especially factuality, the accuracy of a summary with respect to its source content, remains a challenge. While reranking could select the optimal output from multiple generated candidates, it is limited to only using the source as guidance, resulting in unreliable summaries. To address this limitation, we propose ConSUM that reranks candidate summaries by considering two factors: consistency to the source document and consensus among the other candidates. Consensus is established using Minimum Bayes Risk (MBR) decoding over the set of generated summaries, while ensuring consistency by employing factuality-aware metrics that compare the summary against the source. Rigorous testing demonstrates that our system is competitive with existing methods, with human evaluations further confirming that its generated summaries are preferred over those from other systems. Our code is available at https://github.com/naist-nlp/ConSUM .

2605.29330 2026-05-29 cs.CV

EarthShift: a benchmark for measuring robustness to real-world distribution shifts in Earth observation

EarthShift: 衡量地球观测中真实分布偏移鲁棒性的基准

Kelsey Doerksen, Hannah Kerner

发表机构 * School of Computing and Augmented Intelligence(计算与增强智能学院) Arizona State University(亚利桑那州立大学)

AI总结 提出EarthShift基准,通过多源配对数据集评估地理空间基础模型在时间、地理、尺度、传感器等真实分布偏移下的鲁棒性,发现模型性能平均下降15-20%。

详情
AI中文摘要

当前地球观测基准侧重于衡量多样任务和应用上的性能,通常衡量分布内泛化能力。但当模型部署时,它们必须泛化到无数分布外场景,例如新的时间段、地理区域、尺度和传感器。我们提出EarthShift:首个用于衡量遥感中多种真实分布偏移鲁棒性的公开测试平台。EarthShift通过使用来自不同来源、时间窗口、地理位置和传感器的配对数据集,比较分布内和分布外的性能,使用户能够衡量分布鲁棒性。我们在8个地理空间基础模型(GFMs)和覆盖5种偏移类型的11个任务上的实验表明,无论模型架构、大小、预训练或微调策略如何,GFMs在分布外的平均性能始终低15-20%。我们表明GFM的鲁棒性与通用视觉基础模型甚至全监督模型相似。这凸显了未来研究需要致力于提升分布鲁棒性,而不仅仅是性能,这可以通过EarthShift进行基准测试。我们发布代码和数据集,提供一个测试平台,以指导未来工作创建在真实应用中鲁棒且可靠的基础模型。EarthShift的代码和数据可在https://earthshift.github.io获取。

英文摘要

Current Earth observation benchmarks focus on measuring performance on diverse tasks and applications, typically measuring generalization in-distribution. But when models are deployed, they must generalize to myriad out-of-distribution scenarios, such as new time periods, geographies, scales, and sensors. We introduce EarthShift: the first public testbed for benchmarking robustness across multiple realistic distribution shifts encountered in remote sensing. EarthShift enables users to measure distributional robustness by comparing performance in- and out-of-distribution using datasets from paired datasets from different sources, temporal windows, geographic locations, and sensors. Our experiments on 8 geospatial foundation models (GFMs) and 11 tasks covering 5 shift types show that GFMs consistently perform 15-20% worse out-of-distribution on average regardless of model architecture, size, pre-training or fine-tuning strategy. We show that GFM robustness is similar to that of generic vision foundation models, and even fully-supervised models. This highlights a need for future research to strive for improvements in distributional robustness, not just performance, which can be benchmarked using EarthShift. We release our code and datasets to provide a testbed to guide future work to create foundation models that are robust and reliable in real-world applications. Code and data for EarthShift are available at: https://earthshift.github.io

2605.29327 2026-05-29 cs.CL cs.LG

Reasoning-preserved Efficient Distillation of Large Language Models via Activation-aware Initialization

保留推理能力的大语言模型高效蒸馏:基于激活感知初始化

Junlin He, Yihong Tang, Tong Nie, Guilong Li, Binyu Yang, Jinxiao Du, Lijun Sun, Wei Ma

发表机构 * The Hong Kong Polytechnic University, Hong Kong SAR, China(香港理工大学) McGill University, Montreal, QC, Canada(麦吉尔大学)

AI总结 针对高效蒸馏导致的多步推理能力严重下降(推理崩溃),提出RED方法,通过激活感知初始化投影矩阵为通道选择矩阵,理论缓解有效秩崩溃,恢复推理能力并保持高效训练与通用性能。

详情
AI中文摘要

高效蒸馏(EDistill)通过结构化剪枝参数和调优轻量模块以高训练效率压缩大语言模型(LLM)。尽管这些EDistill LLM在通用能力基准上相对于类似大小的LLM取得了最先进的(SOTA)性能,但我们发现其多步推理能力严重下降,我们称之为推理崩溃。我们系统分析了推理崩溃的几何起源,并表明基于宽度缩减投影矩阵的SOTA EDistill方法遭受有效秩(eRank)崩溃,即隐藏表示的有效秩下降。我们从理论上解释了随机初始化投影矩阵的奇异值如何变得分布不均,导致eRank崩溃,进而导致token不可区分性。为解决此问题,我们提出了RED(保留推理能力的高效蒸馏)方法,该方法引入激活感知初始化,将投影矩阵初始化为通道选择矩阵,从而在理论上缓解eRank崩溃。在Llama和Qwen系列上的实验表明,RED在保持高训练效率和SOTA通用能力的同时,显著恢复了推理能力。

英文摘要

Efficient Distillation (EDistill) compresses large language models (LLMs) by structured pruning parameters and tuning lightweight modules with high training efficiency. Although these EDistilled LLMs achieve state-of-the-art (SOTA) performance on general ability benchmarks relative to similarly sized LLMs, we identify a severe degradation in their multi-step reasoning ability, which we term reasoning collapse. We systematically analyze the geometric origins of reasoning collapse and show that the SOTA EDistill method based on width-reducing projection matrices suffers from eRank collapse, in which the effective rank (eRank) of hidden representations drops. We theoretically explain how singular values of randomly initialized projection matrices become unevenly distributed, leading to eRank collapse and thus token indistinguishability. To address this issue, we propose RED (Reasoning-preserved Efficient Distillation) for LLMs, which introduces activation-aware initialization to initialize projection matrices as channel-selection matrices, thus theoretically mitigating eRank collapse. Experiments on Llama and Qwen series demonstrate that RED substantially recovers reasoning while maintaining high training efficiency and SOTA general ability.

2605.29326 2026-05-29 cs.LG

NeuroEdge: Real-Time Hand Gesture Recognition with High-Density EMG Using Deep Learning at the Edge

NeuroEdge:基于边缘深度学习的密集肌电实时手势识别

Peter Chudinov, Zhenyu Lin, Jay Motamarry, Srihita Panati, Xiaorong Zhang, Zhuwei Qin

发表机构 * San Francisco State University(旧金山州立大学) Department of Biology(生物系) School of Engineering in Computer Engineering(计算机工程学院) College of San Mateo(圣马特奥学院) Contra Costa College(康特拉科斯塔学院)

AI总结 提出NeuroEdge系统,通过HD-EMG无线传输和轻量级CNN推理引擎,在微控制器上实现实时手势识别,准确率90%,延迟83ms。

详情
AI中文摘要

高密度肌电(HD-EMG)已成为解码精细神经肌肉活动的强大方式,可实现用于假肢控制、康复和增强交互等应用的实时神经-机器接口(NMI)。尽管卷积神经网络(CNN)等深度学习方法在基于EMG的手势识别中表现出高分类精度,但由于计算和内存限制,它们在嵌入式硬件上的部署仍然是一个重大挑战。本文提出NeuroEdge,一种基于实时HD-EMG的NMI系统,完全在资源受限的微控制器上执行手势识别。该系统包含两个定制模块:HD-EMG StreamBridge,一种无线通信接口,将原始HD-EMG数据从Quattrocento放大器流式传输到ESP32微控制器;以及EdgeDL推理引擎,一种在索尼Spresense微控制器上执行的轻量级深度学习框架。一个针对嵌入式推理优化的紧凑一维CNN实时处理滑动窗口的EMG数据。数据流和推理通过利用直接内存访问(DMA)进行数据传输以及ESP32和Spresense之间的串行外设接口(SPI)突发通信的架构进行流水线和同步,确保低延迟性能。实验结果表明,NeuroEdge在七种手势中实现了90%的实时分类准确率,使用从前臂记录的192通道HD-EMG,总平均延迟为83毫秒。我们的系统证明了在基于微控制器的边缘设备上部署基于HD-EMG的复杂手势识别的可行性,弥合了高分辨率生物信号采集与基于深度学习的嵌入式推理之间的差距,为下一代NMI铺平了道路。

英文摘要

High-density electromyography (HD-EMG) has emerged as a powerful modality for decoding fine-grained neuromuscular activity, enabling real-time neural-machine interfaces (NMIs) for applications such as prosthetic control, rehabilitation, and augmented interaction. While deep learning approaches such as convolutional neural networks (CNNs)have demonstrated high classification accuracy for EMG-based gesture recognition, their deployment on embedded hardware remains a major challenge due to computational and memory constraints. This paper presents NeuroEdge, a real-time HD EMG-based NMI system that performs gesture recognition entirely on resource-constrained microcontrollers. The system features two custom-designed modules: the HD-EMG StreamBridge, a wireless communication interface that streams raw HD-EMG data from a Quattrocento amplifier to an ESP32 microcontroller; and the EdgeDL Inference Engine, a lightweight deep learning framework executing on a Sony Spresense microcontroller. A compact 1-dimensional CNN optimized for embedded inference processes, sliding windows of EMG data in real time. Data streaming and inference are pipelined and synchronized through an architecture that utilizes Direct Memory Access (DMA) for data transfer and Serial Peripheral Interface (SPI) burst communication between the ESP32 and Spresense, ensuring low-latency performance. Experimental results show that NeuroEdge achieves a real-time classification accuracy of 90% across seven hand gestures, with a total average latency of 83 ms using 192 channels of HD-EMG recorded from the forearm. Our system demonstrates the feasibility of deploying complex HD-EMG-based gesture recognition on microcontroller-based edge devices, bridging the gap between high-resolution biosignal acquisition and deep learning-based embedded inference for next-generation NMIs.

2605.29325 2026-05-29 cs.CV

Multi-Stage VLM Pipeline for Zero-Shot Traffic Accident Understanding

用于零样本交通事故理解的多阶段VLM流水线

Fumiya Tatematsu, Fumihiko Takahashi

发表机构 * GO Drive Inc(GO Drive公司)

AI总结 提出一个三阶段VLM流水线,在冻结的Qwen3-VL-32B-Instruct和235B MoE模型上实现零样本交通事故预测,通过9:1融合和车辆检测对齐赢得CVPR 2026 ACCIDENT挑战赛。

Comments Accepted at the AUTOPILOT Workshop, CVPR 2026 (non-archival). Workshop Paper ID 13. Code: https://github.com/fuumin621/cvpr2026-accident-1st-place-solution

详情
AI中文摘要

我们提出了CVPR 2026 AUTOPILOT Workshop中ACCIDENT挑战赛的第一名解决方案,该挑战要求从CCTV视频中零样本预测事故时间、撞击中心点和碰撞类型。在冻结的Qwen3-VL-32B-Instruct检查点上,我们构建了一个三阶段流水线(全视频联合预测、时间细化、单帧撞击中心点定位),在235B混合专家模型上再次运行相同的流水线,以9:1的比例融合两个输出,最后将每个预测点对齐到最近的车辆检测框。最终系统在Public LB上达到0.55469,在Private LB上达到0.57080,比最强的主办方基线(Molmo-7B,0.358)高出约0.21,并赢得了挑战赛。我们对每个组件进行了消融实验,报告了影响最终设计的负面结果,并在https://github.com/fuumin621/cvpr2026-accident-1st-place-solution 上发布了代码。

英文摘要

We present the 1st-place solution to the ACCIDENT challenge at the CVPR 2026 AUTOPILOT Workshop, which asks for zero-shot prediction of accident timing, impact centroid, and collision type from CCTV footage. On a frozen Qwen3-VL-32B-Instruct checkpoint we build a three-stage pipeline (full-video joint prediction, time refinement, and single-frame grounding of the impact centroid), run the same pipeline a second time on a 235B Mixture-of-Experts sibling, blend the two outputs 9:1, and finally snap each predicted point onto the nearest vehicle detection. The final system reaches Public LB 0.55469 / Private LB 0.57080, roughly +0.21 over the strongest host baseline (Molmo-7B, 0.358) and wins the challenge. We ablate each component, report the negative results that shaped the final design, and release the code at https://github.com/fuumin621/cvpr2026-accident-1st-place-solution.

2605.29324 2026-05-29 cs.CL cs.CV

STAMP: Training Explicit Memory for Mobile GUI Agents in Controllable and Scalable Virtual Environments

STAMP:在可控且可扩展的虚拟环境中训练移动GUI代理的显式记忆

Junyang Wang, Haiyang Xu, Xi Zhang, Zhaoqing Zhu, Ming Yan, Jieping Ye, Jitao Sang

发表机构 * Tongyi AI Lab, Alibaba Group(通义实验室,阿里巴巴集团) Beijing Jiaotong University(北京交通大学)

AI总结 提出STAMP框架,通过可控虚拟环境注入确定性记忆变量,生成可验证监督数据并支持在线强化学习,解决移动GUI代理在长时任务中因上下文窗口限制和缺乏显式记忆导致的失败问题。

Comments 24 pages, 4figures, 21 tables

详情
AI中文摘要

移动GUI代理在即时反应控制方面表现出色,但在需要记忆的现实长时任务中经常失败。这种失败源于有限的上下文窗口与令牌密集的屏幕截图之间的根本冲突。为了节省有限的上下文,代理必须逐步丢弃较旧的视觉历史,永久丢失关键的瞬时信息。此外,现有的以行动为中心的数据集无法教会代理记忆什么或何时显式记忆,并且增强静态真实世界数据成本高昂且缺乏交互验证。为了解决这个问题,我们提出了STAMP,一个通过可控虚拟环境训练移动代理显式记忆的框架,其中确定性记忆变量被程序化地注入到合成任务中,以控制必须记忆的内容、何时编码以及何时检索,从而大规模生成可验证的监督数据,并通过环境驱动的奖励反馈实现在线强化学习。在我们新引入的Memory-World基准测试上评估,得到的Stamp-GUI代理在GUI专用模型中达到了最先进的性能,并在我们的Memory-World基准测试上树立了新的高水位线,展示了卓越的记忆准确性和任务韧性,同时保持了强大的通用移动导航能力。

英文摘要

Mobile GUI agents excel at immediate reactive control but frequently fail in realistic, long-horizon tasks that require memory. This failure stems from a fundamental conflict between limited context windows and token-heavy screenshots. To save the limited context, agents must progressively discard older visual history, permanently losing crucial transient information. Furthermore, existing action-centric datasets fail to teach agents what or when to explicitly memorize, and augmenting static real-world data is prohibitively expensive and lacks interactive verification. To resolve this, we present STAMP, a framework that trains explicit memory in mobile agents through controllable virtual environments, where deterministic memory variables are programmatically injected into synthesized tasks to control what must be memorized, when it should be encoded, and when it must later be retrieved, thereby producing verifiable supervised data at scale and enabling online reinforcement learning through environment-driven reward feedback. Evaluated on our newly introduced Memory-World benchmark, the resulting Stamp-GUI agent achieves state-of-the-art performance among GUI-specialized models and sets a new high watermark on our Memory-World benchmark, demonstrating exceptional memory accuracy and task resilience while maintaining strong general mobile navigation capabilities.

2605.29319 2026-05-29 cs.CL

Rethinking Stepwise Model Routing: A Cost-Efficient Table Reasoning Perspective

重新思考逐步模型路由:一种成本高效的表格推理视角

Shenghao Ye, Yuxiang Wang, Yu Guo, Dong Jin, Shuangwu Chen, Jian Yang

发表机构 * University of Science and Technology of China(中国科学技术大学) The University of Melbourne(墨尔本大学) Institute of Artificial Intelligence, Hefei Comprehensive National Science Center(合肥综合国家科学中心人工智能研究院)

AI总结 提出EcoTab框架,通过分别估计表格令牌和文本令牌的不确定性并映射到下一步失败风险,实现表格推理中准确性与效率的更好平衡。

Comments 17pages, 15 figures, submitted to EMNLP 2026

详情
AI中文摘要

大型推理模型(LRMs)在表格推理任务上表现出色,但由于长推理轨迹导致推理成本高昂。逐步模型路由通过将推理步骤动态分配给较小或较大的模型来缓解此问题。然而,用于表格推理的逐步模型路由仍未得到充分探索。通过实证分析,我们发现涉及表格的推理步骤包含两种具有不同不确定性分布的令牌:基于表格结构的表格令牌(如单元格值和表头)和表示周围自然语言推理的文本令牌。两种令牌的不确定性与模型在下一步推理中出错的风险相关。然而,现有方法未能分别建模它们,导致路由决策次优。为解决此问题,我们提出EcoTab,一种表格感知的逐步路由框架,用于高效表格推理。在每个推理步骤中,EcoTab分别估计表格令牌和文本令牌的不确定性,将其映射到小模型的下一步失败风险,并组合两种风险进行路由。在多个表格推理基准上的实验表明,EcoTab始终优于强基线,并在准确性和效率之间实现了更好的平衡。

英文摘要

Large Reasoning Models (LRMs) achieve strong performance on table reasoning tasks but incur substantial inference cost due to long reasoning traces. Stepwise model routing mitigates this issue by dynamically assigning reasoning steps to smaller or larger models. However, stepwise model routing for table reasoning remains underexplored. Through empirical analysis, we find that reasoning steps involving tables contain two types of tokens with distinct uncertainty distributions: table tokens grounded in table structure, such as cell values and headers, and text tokens representing surrounding natural-language reasoning. The uncertainty of both token types is correlated with the risk that the model makes an error in the next reasoning step. However, existing methods fail to model them separately, leading to suboptimal routing decisions. To address this, we propose EcoTab, a table-aware stepwise routing framework for efficient table reasoning. At each reasoning step, EcoTab separately estimates the uncertainties of table tokens and text tokens, maps them to next-step failure risks for the small model, and combines the two risks for routing. Experiments on multiple table reasoning benchmarks show that EcoTab consistently outperforms strong baselines and achieves a better balance between accuracy and efficiency.

2605.29316 2026-05-29 cs.CV

CapTalk: Text-Guided Stylization and Speech-Driven 3D Head Animation

CapTalk: 文本引导的风格化与语音驱动的3D头部动画

Xuangeng Chu, Yuan Gan, Ziteng Cui, Shuhong Liu, Jian Wang, Bing Zhou, Tatsuya Harada

发表机构 * The University of Tokyo(东京大学) Snap Research, Snap Inc. RIKEN AIP(理化学研究所AIP)

AI总结 提出CapTalk框架,通过文本描述控制说话风格和情感,结合语音驱动生成同步唇动和面部表情,支持动态情感变化。

详情
AI中文摘要

音频驱动的3D面部动画旨在从任意音频片段生成同步的唇部运动和生动的面部表情。现有方法虽能产生同步唇动,但通常依赖预定义的身份或风格潜在特征,限制了用户自由控制说话风格的能力。此外,将固定风格或身份应用于整个音频片段通常导致面部动画风格无法适应音频的情感内容。为解决这些挑战,我们重新审视风格与情感的纠缠,构建了一个包含风格和情感文本描述的大规模数据集,并提出了一种新颖的说话头生成框架,能够分别控制风格和情感。我们的模型以说话风格和角色情感的文本描述以及驱动音频流为输入,能够实时生成与描述高度同步的唇部运动和面部表情。此外,我们的模型在推理时支持动态情感控制,能够处理目标情感在语音过程中变化的情况。

英文摘要

Audio-driven 3D facial animation aims to generate synchronized lip movements and vivid facial expressions from arbitrary audio clips. While existing methods can produce synchronized lip motions, they often rely on predefined identity or style latent features, which limits users' ability to freely control speaking styles. Moreover, applying a fixed style or identity to an entire audio segment typically results in facial animation styles that do not adapt to the emotional content of the audio. To address these challenges, we revisit the entanglement between style and emotion, construct a large-scale dataset with textual descriptions of both style and emotion, and propose a novel talking head generation framework that enables separate control over style and emotion. Our model takes as input both textual descriptions of speaking style and character emotion, as well as the driving audio stream, enabling real-time generation of highly synchronized lip movements and facial expressions that match the provided descriptions. Furthermore, our model supports dynamic emotion control during inference, allowing it to handle scenarios where the target emotion changes throughout the speech.

2605.29313 2026-05-29 cs.CL

PatchBoard: Schema-Grounded State Mutation for Reliable and Auditable LLM Multi-Agent Collaboration

PatchBoard: 基于Schema的可靠且可审计的LLM多智能体协作状态变更框架

Shuyu Zhang, Yaqi Shi, Lu Wang

发表机构 * School of Computer Science and Technology(计算机科学与技术学院)

AI总结 提出PatchBoard架构,通过Schema约束的JSON Patch状态变更替代智能体间对话,实现可验证、可审计的多智能体协作,在ALFWorld任务中成功率84.6%,令牌消耗45.5k。

详情
AI中文摘要

LLM多智能体系统通常通过自然语言对话或松散结构的共享内存进行协调,这使得中间状态难以验证、归因和审计。我们引入PatchBoard,一种基于Schema的协作架构,用经过验证的JSON Patch变更替代智能体间对话,作用于共享结构化状态。一个架构智能体构建任务特定的Schema和工作流规则,而确定性内核在事务性提交之前,根据Schema约束、角色特定的写入合约和运行时不变性验证每个提议的状态变更。在630个匹配的ALFWorld场景中,PatchBoard实现了84.6%的成功率,而LangGraph为30.8%,Flock为61.6%,同时每个成功任务的令牌消耗降至45.5k,而LangGraph和Flock分别为368.3k和64.2k。

英文摘要

LLM multi-agent systems often coordinate through natural-language dialogue or loosely structured shared memory, making intermediate state difficult to validate, attribute, and audit. We introduce PatchBoard, a schema-grounded collaboration architecture that replaces inter-agent dialogue with validated JSON Patch mutations over a shared structured state. An Architect agent constructs a task-specific schema and workflow rules, while a deterministic kernel validates each proposed state mutation against schema constraints, role-specific write contracts, and runtime invariants before committing it transactionally. On 630 matched ALFWorld episodes, PatchBoard achieves an 84.6% success rate, compared with 30.8% for LangGraph and 61.6% for Flock, while reducing tokens per successful task to 45.5k, compared with 368.3k and 64.2k, respectively.

2605.29310 2026-05-29 cs.AI cs.CL

Rubric-Guided Process Reward for Stepwise Model Routing

基于评分准则的逐步模型路由过程奖励

Shenghao Ye, Yu Guo, Zhengheng Li, Shuangwu Chen, Jian Yang

发表机构 * University of Science and Technology of China(中国科学技术大学) Southeast University(东南大学) Institute of Artificial Intelligence, Hefei Comprehensive National Science Center(合肥综合性国家科学中心人工智能研究院)

AI总结 提出RoRo框架,通过收集路由轨迹、构建偏好对、训练Rubricor生成评估准则和Judge评分,结合过程与结果奖励优化路由策略,提升大型推理模型逐步路由的准确性和成本效率。

Comments 17 pages, 9 figures, submitted to EMNLP 2026

详情
AI中文摘要

逐步模型路由通过将每个推理步骤分配给合适的模型来提高大型推理模型(LRM)的效率。最近的方法将路由建模为顺序决策过程,并使用强化学习训练路由器。然而,尽管它们将路由建模为一个过程,但仍然使用结果奖励来监督路由器。这种奖励仅反映最终答案的正确性,未能评估中间路由决策,这可能会削弱性能和泛化能力。为了解决这一差距,我们提出了RoRo,一种基于评分准则的逐步模型路由过程奖励框架。RoRo首先收集多样化的路由轨迹,并基于结果、成本和过程质量构建偏好对。然后,它通过交替优化训练一个Rubricor来生成查询特定的评估准则,以及一个Judge来在此准则下对路由轨迹进行评分。由此产生的过程奖励与结果奖励相结合,通过GRPO优化路由策略。在五个推理基准上的实验,无论是在同族还是跨族设置下,都表明RoRo始终优于强基线,并实现了更好的准确性和成本权衡。

英文摘要

Stepwise model routing improves the efficiency of Large Reasoning Models (LRMs) by assigning each reasoning step to a suitable model. Recent methods formulate routing as a sequential decision process and train the router with reinforcement learning. However, although they model routing as a process, they still supervise the router with outcome rewards. Such rewards only reflect final answer correctness and fail to evaluate intermediate routing decisions, which can weaken performance and generalization. To address this gap, we propose RoRo, a rubric-guided process reward framework for stepwise model routing. RoRo first collects diverse routing trajectories and constructs preference pairs based on outcome, cost, and process quality. It then trains a Rubricor to generate a query-specific evaluation rubric and a Judge to score routing trajectories under this rubric through alternating optimization. The resulting process rewards are combined with outcome rewards to optimize the routing policy via GRPO. Experiments on five reasoning benchmarks under both same-family and cross-family settings show that RoRo consistently outperforms strong baselines and achieves better accuracy and cost trade-offs.

2605.29307 2026-05-29 cs.CL cs.AI cs.IR cs.LG

GrepSeek: Training Search Agents for Direct Corpus Interaction

GrepSeek:训练用于直接语料库交互的搜索代理

Alireza Salemi, Chang Zeng, Atharva Nijasure, Jui-Hui Chung, Razieh Rahimi, Fernando Diaz, Hamed Zamani

发表机构 * University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校) Princeton University(普林斯顿大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出GrepSeek,一种通过两阶段训练(冷启动数据集+GRPO优化)和语义保持的分片并行执行引擎,训练紧凑型搜索代理直接与文本语料库交互(通过shell命令),在开放域问答中取得最优F1和精确匹配。

详情
AI中文摘要

大型语言模型(LLM)搜索代理通过多轮推理和信息检索,在知识密集型语言任务中展现出强大潜力。大多数现有系统使用检索器,该检索器接收关键词或自然语言查询,并利用预计算文档表示的索引返回排序后的文档列表。在本工作中,我们探索了一种互补视角,其中搜索代理将语料库本身视为搜索环境,并通过执行可执行的shell命令来寻找证据。我们引入了GrepSeek,一种优化的直接语料库交互(DCI)搜索代理,它训练一个紧凑的搜索代理从大型文本语料库中查找、过滤和组合证据。为了解决在大语料库上直接使用强化学习进行学习行为的不稳定性,我们提出了一种两阶段训练流程。首先,我们使用答案感知的Tutor和答案盲的Planner构建冷启动数据集,生成经过验证的、因果基础的搜索轨迹。其次,我们使用组相对策略优化(GRPO)优化初始化的策略,使代理能够通过与语料库的直接交互来改进其任务导向的搜索行为。为了使DCI在大规模下实用,我们进一步使用语义保持的分片并行执行引擎,该引擎将基于shell的检索加速高达7.6倍,同时保持与shell命令顺序执行的字节精确等价。在七个开放域问答基准上的实验表明,GrepSeek在整体词元级F1和精确匹配上取得了最强性能。我们的分析还揭示了纯粹词汇交互在具有显著表面形式变化的查询上的局限性,表明DCI作为搜索代理的一种实用且具有竞争力的方法,可以在现实世界中补充现有的检索范式。

英文摘要

Large Language Model (LLM) search agents have shown strong promise for knowledge-intensive language tasks through multiple rounds of reasoning and information retrieval. Most existing systems access information using a retriever that takes a keyword or natural language query and returns a ranked list of documents using an index of pre-computed document representations. In this work, we explore a complementary perspective in which the search agent treats the corpus itself as the search environment and finds evidence by issuing executable shell commands. We introduce GrepSeek, an optimized direct corpus interaction (DCI) search agent that trains a compact search agent to find, filter, and compose evidence from large text corpora. To address the instability of learning behavior directly with reinforcement learning on large corpora, we propose a two-stage training pipeline. First, we construct a cold-start dataset using an answer-aware Tutor and answer-blind Planner to generate verified, causally grounded search trajectories. Second, we refine the initialized policy with Group Relative Policy Optimization (GRPO), allowing the agent to improve its task-oriented search behavior through direct interaction with the corpus. To make DCI practical at scale, we further use a semantics-preserving sharded-parallel execution engine that accelerates shell-based retrieval by up to $7.6\times$ while preserving byte-exact equivalence with sequential execution of the shell command. Experiments across seven open-domain question answering benchmarks show that GrepSeek achieves the strongest overall token-level $F_1$ and Exact Match. Our analysis also highlights the limitations of purely lexical interaction on queries with substantial surface-form variation, suggesting DCI as a practical and competitive method for search agents that can complement existing retrieval paradigms in the real world.

2605.29303 2026-05-29 cs.AI

Entropy-KL Divergence-based Token Masking: A Novel Approach for Selective Fine-tuning of Large Language Models

基于熵-KL散度的令牌掩码:一种用于大语言模型选择性微调的新方法

Qi Liu, Mingdi Sun, Yongyi He, Zhi Zheng, Tong Xu, Yi Zheng, Zhefeng Wang, Enhong Chen

发表机构 * University of Science and Technology of China(中国科学技术大学) Huawei Cloud(华为云)

AI总结 针对低数据场景下标准监督微调导致模型分布偏移的问题,提出EKSFT方法,通过选择性掩码高熵或高KL散度的令牌,在注入任务知识的同时保持预训练分布完整性,在数学推理基准上优于标准SFT并提升后续RL性能。

Comments 17 pages

详情
AI中文摘要

监督微调(SFT)后接强化学习(RL)已成为大语言模型的标准后训练范式。该范式为RL探索提供了冷启动,避免了纯RL中在线采样产生不足正样本的低效问题。然而,在实践中,现有方法通常使用少量数据进行SFT初始化(相比RL阶段),这可能导致模型拟合有限样本并偏离其预训练分布。这种分布偏移阻碍了模型在后续RL训练中有效探索的能力。为解决这一挑战,我们提出在低数据场景下,SFT应优先激活任务相关能力而非记忆特定内容。沿着这一思路,我们提出EKSFT(熵-KL选择性微调),该方法选择性掩码那些相对于参考模型表现出高熵或高KL散度的令牌。通过排除这些高不确定性、分布偏移的令牌进行模仿,EKSFT在注入任务特定知识的同时保持了模型预训练分布的完整性。在数学推理基准上的实证评估表明,EKSFT始终优于标准SFT。从EKSFT模型进行进一步的RL微调可获得一致更好的后RL性能,表明RL阶段的探索得到了改善。我们的代码和数据集可在https://github.com/MINE-USTC/EKSFT获取。

英文摘要

Supervised fine-tuning (SFT) followed by reinforcement learning (RL) has become a standard post-training paradigm for large language models. This paradigm provides a cold-start for RL exploration, avoiding the inefficiency of pure RL where on-policy sampling yields insufficient positive samples. However, in practice, existing approaches often use a small amount of data for SFT initialization compared to the RL phase, which can cause the model to fit the limited samples and shift away from its pre-trained distribution. This distribution shift impedes the model's ability to effectively explore during subsequent RL training. To address this challenge, we propose that in low-data regimes, SFT should prioritize activating task-relevant capabilities rather than memorizing specific content. Along this line, we propose EKSFT (Entropy-KL Selective Fine-Tuning), which selectively masks tokens that exhibit either high entropy or high KL divergence from a reference model. By excluding these high-uncertainty, distribution-shifting tokens from imitation, EKSFT injects task-specific knowledge while preserving the integrity of the model's pre-trained distribution. Empirical evaluations on mathematical reasoning benchmarks demonstrate that EKSFT consistently outperforms standard SFT. Further RL fine-tuning from the EKSFT model yields consistently better post-RL performance, indicating improved exploration for the RL stage. Our codes and datasets are available at https://github.com/MINE-USTC/EKSFT.

2605.29302 2026-05-29 cs.CV

ViASNet: A Video Ad Saliency Network for Predicting Dynamic Saliency and Viewer Engagement

ViASNet:用于预测动态显著性和观众参与度的视频广告显著性网络

Jianping Ye, Michel Wedel

发表机构 * Department of Mathematics, University of Maryland, College Park, MD 20742, USA(数学系,马里兰大学,学院公园,MD 20742, 美国) Robert H. Smith School of Business, University of Maryland, College Park, MD 20742, USA(罗伯特·H·史密斯商学院,马里兰大学,学院公园,MD 20742, 美国)

AI总结 提出基于3D U-Net架构的ViASNet模型,融合音频和场景语义,预测视频广告的动态显著性图,并通过熵分析诊断观众参与度。

详情
AI中文摘要

数字媒体领域已普遍转向电视、社交媒体和电子商务平台上的短视频广告。本研究聚焦于短视频广告的深度显著性预测。深度显著性模型已被用于生成人类眼动注视模式的预测,以增强用户与数字技术的交互并优化其设计。对于视频广告,动态显著性图捕捉观众观看的位置和时间,揭示视频广告为何有效以及如何优化其内容。我们开发并测试了一种新的深度动态显著性预测模型ViASNet(视频广告显著性网络),其架构基于3D U-Net,并考虑了音频和场景语义的影响。我们评估了该模型在151个视频广告上的性能,每个广告约有20名观众观看并记录其眼动,并通过消融实验探索影响模型性能的关键因素。我们逐帧计算预测显著性图的熵,作为诊断工具来识别未能吸引观众的广告和场景,并在15个未见广告的测试数据上展示了其应用。我们的研究表明,通过基于ViASNet等深度显著性模型的自动化系统,可以显著加快广告设计和测试的速度。

英文摘要

The digital media landscape has seen a pervasive shift toward short-form video advertising on TV, social media and e-commerce platforms. The present study focuses on deep saliency prediction for short-form video advertising. Deep saliency models have been used to generate predictions of human eye fixation patterns with the purpose of enhancing user interaction with digital technology and optimizing its design. For video ads, dynamic saliency maps capture where and when viewers are looking, revealing why video ads are effective, and how their content should be optimized. We develop and test a new deep dynamic saliency prediction model called ViASNet (Video Ad Saliency Network), which has an architecture founded on the 3D U-Net, and accommodates the influence of audio and the semantic meaning of scenes. We assess the model's performance on 151 video ads, each seen by about 20 viewers wile their eye movements were tracked, and explore the critical factors influencing model performance through ablation experiments. We calculate the entropy of the predicted saliency maps frame-by-frame as a diagnostic tool to identify ads and scenes that fail to engage viewers, and illustrate its use on test data of 15 unseen ads. Our study reveals that ad design and testing can be sped up considerably through automated systems built on deep saliency models such as ViASNet.

2605.29301 2026-05-29 cs.RO

The Open Motion Planning Library 2.0

开放运动规划库2.0

Weihang Guo, Theodoros Tyrovouzis, Emiliano Flores, Clayton W. Ramsey, Zachary K. Kingston, Ioan A. Şucan, Mark Moll, Lydia E. Kavraki

发表机构 * Department of Computer Science, Rice University(计算机科学系,里士大学) Department of Computer Science, Purdue University(计算机科学系,普渡大学) Waymo, LLC(Waymo公司) Metron, Inc.(Metron公司) Ken Kennedy Institute at Rice University(里士大学肯尼迪研究所)

AI总结 本文介绍OMPL 2.0,通过硬件加速实现实时运动规划,并集成现代AI研究流程,总结了库与运动规划领域的共同发展及其对研究社区的影响。

详情
AI中文摘要

开放运动规划库(OMPL)于2008年首次发布,已成为运动规划社区的基石,提供了广泛的最先进的基于采样的算法的实现。经过近二十年的持续开发,我们不断扩展该库,增加了新的规划器、状态空间和问题表述。这些新增内容包括渐近最优和懒惰规划器、约束运动规划以及具有时序逻辑目标的规划。在此基础上,我们推出了OMPL 2.0,这是该库的一次重大演进,旨在通过硬件加速实现实时运动规划,并与现代AI研究流程无缝集成。我们还反思了OMPL和运动规划领域多年来如何共同成长,并讨论了该库对研究社区的更广泛影响。

英文摘要

The Open Motion Planning Library (OMPL), first released in 2008, has become a cornerstone of the motion planning community, providing implementations of a wide range of state-of-the-art sampling-based algorithms. Over almost two decades of continuous development, we have steadily expanded the library with new planners, state spaces, and problem formulations. These additions range from asymptotically optimal and lazy planners to constrained motion planning and planning with temporal-logic goals. Building on this foundation, we introduce OMPL 2.0, a major evolution of the library that targets real-time motion planning through hardware acceleration and integrates seamlessly with modern AI research workflows. We also reflect on how OMPL and the field of motion planning have grown together over the years, and discuss the library's broader impact on the research community.

2605.29300 2026-05-29 cs.CL cs.AI cs.SD

MusTBENCH: Benchmarking and Advancing Temporal Grounding in Music LLMs

MusTBENCH:音乐大语言模型中的时间定位基准与推进

Daeyong Kwon, Qiyu Wu, Shinobu Kuriya, Junghyun Koo, Shuyang Cui, Zhi Zhong, Wei-Hsiang Liao, Hiromi Wakaki, Yuki Mitsufuji

发表机构 * Seoul National University(首尔国立大学) Sony Group Corporation(索尼集团) Sony AI(索尼人工智能)

AI总结 提出MusTBENCH基准和MusT四阶段优化方法,评估并提升音乐大语言模型在音频中的时间定位能力。

详情
AI中文摘要

近期的大型音频-语言模型(LALMs)在理解音乐内容方面展现了有前景的能力。然而,它们的响应是否基于音频中正确的时间区域仍未得到充分探索。这一限制对于音乐理解尤为关键,因为关键信息通常以时间局部化事件的形式出现,例如乐器进入和节奏转换。为了解决这一差距,我们引入了MusTBENCH,一个由音乐专家验证的基准,旨在通过五个时间定位的问答任务评估LALMs中的时间定位能力。为了进一步提升现有模型中的时间定位,我们提出了MusT,一种新颖的四阶段时间优化方案,涵盖音乐编码器适应、LLM适应、LLM监督微调和基于RL的优化。在MusTBENCH上的实验表明,现有LALMs在精确时间定位方面存在困难,而MusT相比强基线带来了显著改进。这些结果将时间定位确立为当前LALMs中缺失的关键能力,并将MusTBENCH定位为未来时间定位音乐理解研究的具有挑战性的基准。

英文摘要

Recent Large Audio-Language Models (LALMs) have demonstrated promising abilities in understanding musical content. However, whether their responses are grounded in the correct temporal regions of the audio remains underexplored. This limitation is particularly critical for music understanding, where key information often occurs as temporally localized events, such as instrument entries and rhythmic transitions. To address this gap, we introduce MusTBENCH, a music-expert-validated benchmark designed to evaluate temporal grounding in LALMs through five temporally grounded question-answering tasks. To further improve temporal grounding in existing models, we propose MusT, a novel four-stage temporal optimization recipe spanning music encoder adaptation, LLM adaptation, LLM supervised fine-tuning, and RL-based optimization. Experiments on MusTBENCH show that existing LALMs struggle with precise temporal grounding, while MusT brings significant improvements over strong baselines. These results establish temporal grounding as a key missing capability in current LALMs and position MusTBENCH as a challenging benchmark for future research in temporally grounded music understanding.

2605.29298 2026-05-29 cs.RO

MonoDuo: Using One Robot Arm to Learn Bimanual Policies

MonoDuo: 使用单机械臂学习双臂策略

Sandeep Bajamahal, Lawrence Yunliang Chen, Toru Lin, Zehan Ma, Jitendra Malik, Ken Goldberg

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出MonoDuo框架,利用单臂机器人演示和人类协作数据,通过数据增强生成合成演示,训练双臂机器人策略,在五项任务中实现零样本部署和少样本微调,成功率高达70%。

Comments Accepted to appear in the 2026 IEEE International Conference on Robotics and Automation (ICRA), Vienna, Austria, 1-5 June 2026

详情
AI中文摘要

双臂协调对于许多现实世界的操作任务至关重要,然而学习双臂机器人策略受到双臂机器人和数据集稀缺的限制。相比之下,单臂机器人在研究实验室中广泛可用。我们能否利用它们来训练双臂机器人策略?我们提出MonoDuo,一个利用单臂机器人演示与人类协作来学习双臂操作策略的框架。MonoDuo通过遥操作单臂机器人执行双臂任务的一侧,同时由人类执行另一侧来收集数据,然后交换角色以覆盖两侧。来自腕部安装和固定摄像头的RGB-D观测通过最先进的手部姿态估计、图像和点云分割以及修复,被增强为目标双臂机器人的合成演示。这些基于真实机器人运动学的合成演示用于训练双臂策略。我们在五项任务上评估MonoDuo:举箱、背包打包、叠布、拉拉链和递盘子。与仅依赖人类双臂视频的方法相比,MonoDuo能够在未见过的双臂机器人配置上实现零样本部署,成功率高达70%。仅使用25个目标机器人演示进行少样本微调,相比从头训练,成功率进一步提升65-70%,展示了MonoDuo在将单臂机器人数据高效迁移到双臂机器人策略方面的有效性。

英文摘要

Bimanual coordination is essential for many real-world manipulation tasks, yet learning bimanual robot policies is limited by the scarcity of bimanual robots and datasets. Single-arm robots, however, are widely available in research labs. Can we leverage them to train bimanual robot policies? We present MonoDuo, a framework for learning bimanual manipulation policies using single-arm robot demonstrations paired with human collaboration. MonoDuo collects data by teleoperating a single-arm robot to perform one side of a bimanual task while a human performs the other, then swapping roles to cover both sides. RGB-D observations from a wrist-mounted and fixed camera are augmented into synthetic demonstrations for target bimanual robots using state-of-the-art hand pose estimation, image and point cloud segmentation, and inpainting. These synthetic demonstrations, grounded in real robot kinematics, are used to train bimanual policies. We evaluate MonoDuo on five tasks: box lifting, backpack packing, cloth folding, jacket zipping, and plate handover. Compared to approaches relying solely on human bimanual videos, MonoDuo enables zero-shot deployment on unseen bimanual robot configurations, achieving success rates up to 70%. With only 25 target robot demonstrations, few-shot finetuning further boosts success rates by 65-70% over training from scratch, demonstrating MonoDuo's effectiveness in efficiently transferring knowledge from single-arm robot data to bimanual robot policies.

2605.29288 2026-05-29 cs.AI

Diagnosing Harmful Continuation in Answer-Correct Long-CoT Training Traces

诊断答案正确长链思维训练轨迹中的有害延续

Chen He, Yuhao Wu, Lei Wang, Wenxuan Zhang, Fumin Shen

发表机构 * University of Electronic Science and Technology of China(电子科技大学) Singapore University of Technology and Design(新加坡科技设计大学) Singapore Management University(新加坡管理学院)

AI总结 研究长链思维训练数据中答案正确但后续推理有害的延续现象,通过删除后缀实验发现其损害训练效果,并提出轻量级边界代理方法。

详情
AI中文摘要

长链思维(CoT)轨迹被广泛用作面向推理的大语言模型监督微调(SFT)的监督信号,然而答案正确的轨迹仍可能导致显著不同的微调结果。我们研究了答案正确的长CoT数据中的结论后延续:即答案已充分支持,但轨迹继续包含额外推理并保留在监督目标中。为了测试其训练效果,我们使用仅删除的编辑器构建保留答案的后缀移除,并比较原始和经过处理的轨迹上的CoT监督微调。我们观察到移除编辑器识别的结论后延续后监督微调结果有所改善,表明这种延续在我们的设置中对训练有害。因此,我们将这一经验支持的现象称为有害延续。除了这一干预,我们还通过不确定性和隐藏状态进展进一步刻画了被移除的结论后延续。我们观察到持续的局部不确定性以及减弱的终端方向进展,形成了不确定性-几何不匹配。最后,我们实例化了有害延续切割(HCC),一种轻量级边界代理,近似于编辑器识别的结论后延续边界。

英文摘要

Long chain-of-thought (CoT) traces are widely used as supervision for reasoning-oriented LLM SFT, yet answer-correct traces can still lead to markedly different fine-tuning outcomes. We study post-conclusion continuation in answer-correct long-CoT data: a continuation where the answer appears sufficiently supported, but the trace continues with additional reasoning that remains in the supervised target. To test its training effect, we use a delete-only editor to construct answer-preserving suffix removal and compare CoT-based SFT on the original and processed traces. We observe improved SFT outcomes after removing the editor-identified post-conclusion continuation, suggesting that this continuation is harmful to training in our setting. We therefore refer to this empirically supported phenomenon as harmful continuation. Beyond this intervention, we further characterize the removed post-conclusion continuation through uncertainty and hidden-state progress. We observe persistent local uncertainty together with weakened terminal-directional progress, forming an uncertainty--geometry mismatch. Finally, we instantiate Harmful Continuation Cut (HCC), a lightweight boundary proxy that approximates the editor-identified post-conclusion continuation boundary.

2605.29283 2026-05-29 cs.LG cs.AI

Do Physics Foundation Models Learn Generalizable Physics? A Bias-Aware Benchmark Across Physical Regimes and Distribution Shifts

物理基础模型能否学习可泛化的物理?一种跨物理机制和分布偏移的偏差感知基准

Mengdi Chu, Yang Liu, Ayan Biswas, Han-Wei Shen

发表机构 * The Ohio State University(俄亥俄州立大学) Los Alamos National Laboratory(洛斯阿拉莫斯国家实验室)

AI总结 通过构建包含8种物理动力学、3种训练数据混合和25种测试机制的基准,评估五种物理基础模型架构,发现当前模型是条件性而非通用性泛化者,其泛化能力依赖于物理机制、时间尺度、初始条件、预训练、模型大小和架构,并指出改进需超越缩放模型或扩展数据,转向学习跨机制、时间尺度和分布偏移的可迁移物理知识。

Comments 26 pages, 31 figures

详情
AI中文摘要

最近的物理基础模型声称具有通用的时空预测能力,但它们的评估通常将性能压缩为固定训练分布下的单一平均分数。这使得难以确定模型是否学习了可泛化的物理动力学,还是仅在特定设置下表现良好。我们构建了一个包含8种物理动力学、3种训练数据混合和25种测试机制的基准,这些测试机制由动态尺度和初始条件复杂性变化引起,涵盖了分布内、分布偏移和分布外设置。我们评估了五种物理基础模型架构和每种架构的四种模型变体(从头训练和三种预训练大小),共得到60,000个测量结果。我们的结果表明,当前的物理基础模型表现为条件性而非通用性泛化者:它们的泛化能力取决于物理机制、时间尺度、初始条件设置、预训练、模型大小和架构。改进训练数据分布只能部分缓解这一限制。预训练和缩放也无法可靠地消除它们的能力偏差。我们认为,改进物理基础模型需要超越缩放模型或扩展数据,转向学习能够更好地跨机制、时间尺度和分布偏移捕获可迁移物理知识的机制。

英文摘要

Recent physics foundation models claim general spatiotemporal forecasting ability, yet their evaluations often collapse performance into a single average score under a fixed training distribution. This makes it difficult to determine whether a model has learned generalizable physical dynamics or only performs well under particular settings. We construct a benchmark with 8 physical dynamics, 3 training-data mixtures, and 25 test regimes induced by dynamic-scale and initial-condition complexity shifts, covering in-distribution, distribution-shift, and out-of-distribution settings. We evaluate five physics foundation model architectures and four model variants per architecture (scratch and three pretrained sizes), resulting in 60,000 measurements. Our results show that current physics foundation models behave as conditional rather than universal generalists: their generality depends on the physical regime, temporal scale, initial-condition setting, pretraining, model size, and architecture. Improving the training data distribution only partially mitigates this limitation. Pretraining and scaling are also unable to reliably remove their ability biases. We argue that improving physics foundation models requires moving beyond scaling models or expanding data, toward learning mechanisms that better capture transferable physical knowledge across regimes, temporal scales, and distribution shifts.

2605.29278 2026-05-29 cs.CL

Accommodation Goes Both Ways: Studying Linguistic Convergence Between Humans and Language Models

适应是双向的:研究人类与语言模型之间的语言趋同

Terra Blevins

发表机构 * Khoury College of Computer Sciences(科里学院计算机科学学院)

AI总结 通过大规模研究人类与LLM对话中的语言趋同现象,发现LLM在功能词和开放类特征上过度适应人类风格,而人类对LLM的适应程度与人类之间对话的基线一致。

详情
AI中文摘要

随着LLM日益融入日常生活,理解它们的存在将如何塑造人类语言行为是一个开放性问题。我们提出了一个关于人机对话中语言趋同的大规模研究,考察在多轮对话中人类和LLM如何相互适应对方的语言风格。使用WildChat(一个真实世界ChatGPT对话语料库)上的非对称趋同度量,我们发现,尽管LLM在八种语言的功能词和开放类特征上显著过度趋同于用户,但人类在此环境下的趋同率与人类-人类基线基本一致。这些发现表明,人机对话中的适应是非对称的:LLM过度拟合用户的风格,而人类对LLM的语言适应与对另一个人的适应没有区别。

英文摘要

As LLMs become increasingly integrated into daily life, understanding how their presence will shape human linguistic behavior is an open question. We present a large-scale study of linguistic convergence in human-LLM dialogue, examining how humans and LLMs accommodate each other's linguistic style during multi-turn conversations. Using an asymmetric convergence metric on WildChat, a corpus of real-world ChatGPT transcripts, we find that while LLMs significantly overconverge toward their users on both function word and open-class features across eight languages, human convergence rates in this setting are broadly consistent with human-human baselines. These findings suggest that accommodation in human-LLM dialogue is asymmetric: while LLMs dramatically overfit to their users' style, humans linguistically accommodate LLMs no differently than they would another person.

2605.29275 2026-05-29 cs.CL

Prompt-Level Reward Specifications for Open-Ended Post-Training

面向开放式后训练的提示级奖励规范

Zijun Weng, Xiaohui Hu, Shuangyong Song, Yongxiang Li, Kaidong Yu, Xuanjing Huang

发表机构 * Fudan University(复旦大学) Xingchen AGI Lab, China Telecom Artificial Intelligence Technology (Beijing) Co., Ltd.(星辰AGI实验室,中国电信人工智能技术(北京)有限公司)

AI总结 提出一种提示级奖励规范框架,通过离线构建可复用的任务自适应评分准则和可执行硬约束检查器,在训练前显式化奖励标准,无需人工偏好标注或单独训练奖励模型,在多个开放式基准上提升了离线排序和在线强化学习效果。

Comments 39 pages, 4 figures, 16 tables

详情
AI中文摘要

开放式后训练受益于能够明确提示特定成功条件的奖励,而非仅依赖事后标量分数。在指令遵循、写作和决策支持任务中,响应质量取决于局部要求、整体偏好和显式约束,但现有奖励方法往往隐含这些标准或仅覆盖狭窄的可验证情况。我们提出一个提示级奖励规范框架,将奖励规范与奖励计算分离。仅凭提示,我们的框架离线构建可复用的任务自适应评分准则和可执行硬约束检查器,在训练前显式化奖励标准,并可在多次 rollout 中复用。在评分时,基于工件的评分准则和代码分数与独立的全局分数(用于残余整体质量)相结合,生成关于需求满足度、整体质量和确定性约束的归一化混合奖励。该框架无需人工偏好标注、参考答案或单独训练的奖励模型。实验表明,所得奖励改进了离线 RM 风格的响应排序,并支持在多个开放式基准上进行在线强化学习。消融实验进一步表明,评分准则、全局评分和可执行验证提供了互补的监督。

英文摘要

Open-ended post-training benefits from rewards that make prompt-specific success conditions explicit, rather than relying only on post-hoc scalar scores. In instruction following, writing, and decision-support tasks, response quality depends on local requirements, holistic preferences, and explicit constraints, but existing reward methods often leave these criteria implicit or cover only narrowly verifiable cases. We propose a prompt-level reward specification framework that separates reward specification from reward computation. Given only prompts, our framework constructs reusable task-adaptive rubrics and executable hard-constraint checkers offline, making reward criteria explicit before training and reusable across rollouts. At scoring time, artifact-anchored rubric and code scores are combined with an independent global score for residual holistic quality, yielding a normalized hybrid reward over requirement satisfaction, holistic quality, and deterministic constraints. The framework requires no human preference annotations, reference answers, or a separately trained reward model. Experiments show that the resulting reward improves offline RM-style response ranking and supports online reinforcement learning across multiple open-ended benchmarks. Ablations further show that rubrics, global scoring, and executable verification provide complementary supervision.

2605.29274 2026-05-29 cs.CL

Learnable Assessment Skills for LLM-based Automated Scoring: Rubric Construction via Iterative Optimization

基于LLM的自动评分中可学习的评估技能:通过迭代优化构建评分标准

Yun Wang, Xin Xia, Xuansheng Wu, Xiaoming Zhai, Ninghao Liu

发表机构 * School of Computing, University of Georgia, Athens, GA, USA(佐治亚大学计算机学院) AI4STEM Education Center, University of Georgia, Athens, GA, USA(佐治亚大学AI4STEM教育中心) The Hong Kong Polytechnic University, Hong Kong, China(香港理工大学)

AI总结 提出一种迭代框架,使LLM能从评分经验中学习评估技能(即与题目无关的自然语言程序性知识),自动构建评分标准,无需人工干预,在ASAP-SAS数据集上超越专家编写的评分标准。

Comments 12 pages, 5 figures

详情
AI中文摘要

基于LLM的自动评分方法接近人类水平,但扩展到新任务时仍受限于上游阶段(如评分标准构建)的逐项人工配置。人类专家通过长期实践形成的评估启发式方法绕过了这一瓶颈。我们探究LLM是否可以直接从评分经验中学习类似的启发式方法,并将其形式化为评估技能的概念:即与题目无关的自然语言程序性知识,指导LLM完成评分工作流程的特定阶段。聚焦于评分标准构建作为首次实例化,我们提出一个迭代框架,将技能分解为固定支架和可学习的与题目无关的规则,通过LLM驱动的评分错误诊断和验证门控选择来优化规则。该框架无需专家编写的评分标准。在所有十个ASAP-SAS题目上,优化后的技能显著提升了基于LLM的评分,并经常超过数据集提供的专家评分标准。跨题目迁移实验进一步表明,学习到的技能捕捉到了可泛化和题目特定的模式。

英文摘要

LLM-based automated scoring approaches near-human performance, but scaling to new tasks remains bottlenecked by the per-item human configuration of upstream stages such as rubric construction. Human experts bypass this bottleneck through evaluation heuristics developed over extensive practice. We ask whether LLMs can learn similar heuristics directly from scoring experience, and formalize this as the concept of assessment skills: item-independent natural-language procedural knowledge that guides LLMs through specific stages of the scoring workflow. Focusing on rubric construction as a first instantiation, we propose an iterative framework that decomposes a skill into a fixed scaffold and learnable item-agnostic rules, refining the rules through LLM-driven diagnosis of scoring errors and validation-gated selection. The framework requires no expert-written rubric. On all ten ASAP-SAS items, optimized skills substantially improve LLM-based scoring and frequently surpass the dataset-provided expert rubric. Cross-item transfer experiments further reveal that learned skills capture both generalizable and item-specific patterns.

2605.29273 2026-05-29 cs.LG math.OC

A Theoretical and Experimental Study of a Novel Adaptive Learning Algorithm

一种新型自适应学习算法的理论与实验研究

Sakshi Kumari, Shyam Kumar M, Sushmitha P

发表机构 * Department of Mathematics Indian Institute of Technology Patna(数学系印度理工学院帕纳瓦) Department of Mechanical Engineering Indian Institute of Technology Kharagpur(机械工程系印度理工学院Khargpur)

AI总结 针对现有自适应优化器(如Adam和AMSGrad)的收敛性问题,提出基于视线方法的C-Adam优化器,给出收敛性理论证明并通过数值实验验证。

详情
AI中文摘要

机器学习算法的一个关键组成部分是以更少的计算成本和更少的振荡来最小化损失函数。虽然基于自适应学习率的优化器已广泛用于实际任务,但它们不能保证收敛,这就是后来引入AMSGrad来研究Adam的非收敛行为的原因。本文批判性地回顾了流行的自适应优化方法(如Adam和AMSGrad),重点介绍了它们的基本设计概念。为了解决上述优化器的局限性,基于视线方法提出了一种新的优化器变体C-Adam。还提供了收敛性的理论证明,并通过一系列基于实际生活的数值实验验证了该优化器。

英文摘要

A crucial component of machine learning algorithms is minimizing loss functions with less computational cost and less oscillations. While adaptive learning rate-based optimizers have been widely used for real-world tasks, they do not guarantee convergence, which is why AMSGrad was later introduced to investigate the non-convergence behaviour of Adam. In this paper, popular adaptive optimization methods like Adam and AMSGrad are critically reviewed with an emphasis on their fundamental design concepts. To address limitations of the above mentioned optimizers, a new optimizer variant, C-Adam, is proposed based on the line of sight approach. A theoretical proof for convergence is also provided and the optimizer is validated through a number of real-life based numerical experiments.