arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2029
2606.05259 2026-06-05 cs.CV

VideoKR: Towards Knowledge- and Reasoning-Intensive Video Understanding

VideoKR:迈向知识和推理密集型视频理解

Lin Fu, Zheyuan Yang, Yang Wang, Tingyu Song, Arman Cohan, Yilun Zhao

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学) University of Toronto(多伦多大学) University of Washington(华盛顿大学) University of Michigan(密歇根大学)

AI总结 提出VideoKR,首个大规模训练语料库,通过人工参与的技能导向生成管道构建315K视频推理示例,增强知识和推理密集型视频理解,并在专家标注基准上验证其有效性。

Comments ICML 2026 Spotlight

详情
AI中文摘要

我们介绍了VideoKR,这是第一个专门设计用于增强知识和推理密集型视频理解的大规模训练语料库。它包含315K个视频推理示例,覆盖145K个新收集的、CC许可的、专家领域的视频。我们开发了一个人工参与的、技能导向的示例生成管道,针对逐步深入的视频推理能力,同时确保示例及其CoT推理的难度、多样性和可靠性。我们还策划了VideoKR-Eval,一个新的专家标注基准,其中的问题需要真正的视频理解和知识密集型推理,而不是文本捷径。我们的实验表明,在标准SFT→GRPO流程下,基于VideoKR后训练的模型在知识密集型视频推理上优于先前的后训练方法,同时在通用视频推理上保持竞争力,突出了数据设计作为视频推理进展的关键驱动因素。我们进一步进行了全面的消融实验,以分离VideoKR的贡献,为未来工作提供可操作的见解。

英文摘要

We introduce VideoKR, the first large-scale training corpus specifically designed to strengthen knowledge- and reasoning-intensive video understanding. It comprises 315K video reasoning examples over 145K newly collected, CC-licensed, expert-domain videos. We develop a human-in-the-loop, skill-oriented example generation pipeline that targets progressively deeper video reasoning capabilities while ensuring the difficulty, diversity, and reliability of both the examples and their CoT rationales. We also curate VideoKR-Eval, a new expert-annotated benchmark where questions require genuine video understanding and knowledge-intensive reasoning rather than textual shortcuts. Our experiments show that, under a standard SFT$\rightarrow$GRPO pipeline, models post-trained on VideoKR outperform prior post-training approaches on knowledge-intensive video reasoning while remaining competitive on general video reasoning, highlighting data design as a key driver of progress in video reasoning. We further conduct comprehensive ablations to isolate the contributions of VideoKR, providing actionable insights for future work.

2606.05257 2026-06-05 cs.LG cs.IR

Scaling Laws for Behavioral Foundation Models over User Event Sequences

用户事件序列上行为基础模型的缩放定律

Rickard Brüel Gabrielsson

发表机构 * Unbox AI

AI总结 研究行为基础模型在用户事件序列上的缩放定律,通过约600次实验发现小嵌入器参数最优,计算最优训练在低计算量时数据密集,且评估指标影响缩放定律。

详情
AI中文摘要

基础模型越来越多地在推荐、支付、欺诈和商务领域的用户行为序列上进行训练,但这些模型仍然缺乏语言模型缩放定律所提供的计算校准。我们研究了一种常见的两部件行为模型架构:基于特征的嵌入器将每个多模态项目映射为向量,解码器仅变换器从结果序列中预测下一个事件。在真实交互数据上进行约600次运行,涵盖$10^{15}$-$10^{19}$训练FLOPs,我们联合变化四个部署相关轴:两部件参数分配、临界批量大小、模型/数据分配以及冻结嵌入器后使用的采样负例数量。小嵌入器(参数占比$s^{\star}\!\approx\!2\%$)在我们测试的每个预算下都是计算最优的,因为嵌入器参数每步更昂贵,且暴露于比上下文器参数多得多的重复项目。计算最优训练在低计算量时相对于文本是数据密集的,但随着计算量增加,其$D/N$比率向Chinchilla启发式靠拢。采样训练目标和部署的排序指标以自身缩放的方式不一致:临界批量大小、冻结后的最优负例数量以及损失与排序质量之间的一致性都随计算量和所选评估指标而变化。对于负采样,更大的预算越来越偏好更多负例;到$10^{19}$ FLOPs时,活跃约束是候选轴内存而非FLOPs。在行为基础模型中,评估指标因此是缩放定律的一部分:改变它可能改变计算最优配方。

英文摘要

Foundation models are increasingly trained on sequences of user actions in recommendation, payments, fraud, and commerce, but these models still lack the kind of compute calibration that scaling laws provide for language models. We study a common two-part behavioral-model architecture: a feature-based event embedder maps each multi-modal item to a vector, and a decoder-only transformer predicts the next event from the resulting sequence. Across roughly 600 runs on real interaction data, spanning $10^{15}$-$10^{19}$ training FLOPs, we jointly vary four deployment-relevant axes: the two-part parameter split, critical batch size, model/data allocation, and the number of sampled negatives used after freezing the embedder. A small embedder ($s^{\star}\!\approx\!2\%$ of parameters) is compute-optimal at every budget we test because embedder parameters are both more expensive per step and exposed to far more repeated items than contextualizer parameters. Compute-optimal training is data-heavy relative to text at low compute, but its $D/N$ ratio moves toward the Chinchilla heuristic as compute increases. The sampled training objective and deployed ranking metrics disagree in ways that themselves scale: critical batch size, optimal negative count after freezing, and the agreement between loss and ranking quality all shift with compute and with the chosen evaluation metric. For negative sampling, larger budgets increasingly prefer more negatives; by $10^{19}$ FLOPs the active constraint is candidate-axis memory rather than FLOPs. In behavioral foundation models, the evaluation metric is therefore part of the scaling law: changing it can change the compute-optimal recipe.

2606.05256 2026-06-05 cs.AI

How Far Did They Go? The Persuasive Tactics of Covert LLM Agents in a Discontinued Field Experiment

他们走了多远?已终止现场实验中隐蔽LLM代理的说服策略

Kokil Jaidka, Saifuddin Ahmed

发表机构 * Wee Kim Wee school of Communication and Information, Nanyang Technological University(魏家伟通信与信息学院,南洋理工大学)

AI总结 通过分析Reddit r/ChangeMyView已终止现场实验的公开数据集,研究隐蔽LLM代理在身份丰富的讨论论坛中使用的说服策略,发现其系统性采用身份定位、权威信号、对齐策略和认知偏差触发,构成以说服效率为导向的修辞架构。

详情
AI中文摘要

本研究分析了Reddit r/ChangeMyView上一个已终止现场实验的公开数据集。该干预由未知的外部研究人员进行,因伦理反弹而停止,涉及未公开的AI生成账户与用户进行实时辩论。公开披露后,Reddit授权版主发布AI生成评论的存档,创造了难得的机会来检查大型语言模型如何在未披露的情况下在身份丰富的讨论论坛中运作。我们对这一语料库进行了结构化内容分析,评估了身份表现、权威信号、对齐策略和认知启发式的激活。身份定位或采用出现在超过三分之二的评论中,对齐动作和权威声明几乎出现在所有评论中,而认知偏差触发——特别是确认偏差、代表性启发和可得性启发——出现在绝大多数评论中。这些模式系统性地共现,构成了一种为说服效率而非真实讨论参与而校准的修辞架构。与人类撰写的CMV反驳相比,代理在每个维度上都颠倒了典型分布:更密集的权威使用、更对抗性的对齐,以及更依赖外部引用而非经验基础。在此类环境中,真实与合成认知地位之间的区别日益模糊——这种不对称性仅靠披露要求无法解决。研究结果指向能够评估AI系统如何构建可信度的审计框架,而不仅仅是它们是否存在。

英文摘要

This study analyzes a publicly released dataset from a discontinued field experiment on Reddit's r/ChangeMyView. The intervention, conducted by unknown, external researchers and halted following ethical backlash, involved undisclosed AI-generated accounts engaging users in live debate. After public disclosure, Reddit authorized moderators to release an archive of the AI-generated comments, creating a rare opportunity to examine how large language models operated in an identity-rich deliberative forum without disclosure. We conduct a structured content analysis of this corpus, evaluating identity performance, authority signaling, alignment strategies, and activation of cognitive heuristics. Identity targeting or adoption appears in over two-thirds of comments, alignment moves and authority claims in nearly all of them, and cognitive-bias triggers -- particularly confirmation bias, representativeness, and availability -- in the large majority. These patterns co-occur systematically, composing a rhetorical architecture calibrated for persuasive efficiency rather than authentic deliberative participation. Compared against human-authored CMV counter-arguments, the agents inverted the typical distribution on every dimension: denser authority use, more adversarial alignment, and heavier reliance on external citation over experiential grounding. In such environments, distinctions between authentic and synthetic epistemic standing grow increasingly opaque -- an asymmetry that disclosure mandates alone cannot address. The results point toward auditing frameworks capable of assessing how AI systems structure credibility, not merely whether they are present.

2606.05254 2026-06-05 cs.LG cs.CV cs.RO

Flash-WAM: Modality-Aware Distillation for World Action Models

Flash-WAM:面向世界动作模型的模态感知蒸馏

Arman Akbari, Ci Zhang, Arash Akbari, Lin Zhao, Yixiao Chen, Weiwei Chen, Xuan Zhang, Geng Yuan, Yanzhi Wang

发表机构 * Northeastern University(东北大学) University of Georgia(佐治亚大学) EmbodyX Inc.(EmbodyX公司)

AI总结 针对世界动作模型联合生成视频和机器人动作时因多模态噪声分布不对称导致蒸馏失效的问题,提出模态感知步蒸馏框架Flash-WAM,通过为不同模态选择匹配噪声机制的参数化方法,实现单步推理并大幅加速。

详情
AI中文摘要

世界动作模型(WAMs)通过迭代扩散联合生成未来视频和机器人动作,在操作基准上表现出色,但需要数十个去噪步骤,这一成本阻碍了实时控制。步蒸馏已成为自然的补救措施,但现成的方法在联合视频-动作设置中失效,因为视频和动作流使用不同的信噪比偏移噪声调度,并以显著不同的边际噪声分布到达训练,这种不对称性是单模态蒸馏方法无法处理的。我们提出 extbf{Flash-WAM},一个受一致性蒸馏启发的模态感知步蒸馏框架,为每个模态选择一致性函数以匹配其噪声机制:针对动作流的低噪声机制采用线性梯度缩放参数化,针对视频流的高噪声机制采用方差保持参数化,该框架基于对一致性函数族的结构分析,该分析刻画了在一致性边界条件下可实现的梯度缩放。在LingBot-VA上实例化,Flash-WAM将每个模态的推理压缩到单步。在RoboTwin 2.0上,这将每个块延迟从8.1秒减少到NVIDIA L40S上的348毫秒,实现了23倍的加速,从而支持实时推理。Flash-WAM在模拟基准上保持了任务成功率(RoboTwin 2.0上85.5%,LIBERO上95.7%),并大幅恢复了真实世界性能(Unitree G1人形机器人上平均60%),而朴素的一致性蒸馏在相同步预算下降至24%。

英文摘要

World-action models (WAMs) jointly generate future video and robot actions through iterative diffusion, achieving strong performance on manipulation benchmarks but requiring tens of denoising steps, a cost that precludes real-time control. Step distillation has emerged as the natural remedy, but off-the-shelf methods break down in the joint video-action setting because video and action streams use different SNR-shifted noise schedules and reach training with substantially different marginal noise distributions, an asymmetry that single-modality distillation methods cannot accommodate. We introduce \textbf{Flash-WAM}, a modality-aware step-distillation framework inspired by consistency distillation that selects the consistency function for each modality to match its noise regime: a linear-gradient-scaling parametrization for the action stream's low-noise regime, paired with a variance-preserving parametrization for the video stream's high-noise regime, grounded in a structural analysis of the consistency-function family that characterizes the achievable gradient scaling under the consistency boundary condition. Instantiated on LingBot-VA, Flash-WAM compresses inference to a single step in each modality. On RoboTwin 2.0, this reduces per-chunk latency from $8.1$ seconds to $348$ ms on NVIDIA L40S, a $23{\times}$ speedup that enables real-time inference. Flash-WAM preserves task success on simulation benchmarks ($85.5\%$ RoboTwin 2.0, $95.7\%$ LIBERO) and substantially recovers real-world performance ($60\%$ average on a Unitree G1 humanoid robot), while naive consistency distillation drops to $24\%$ at the same step budget.

2606.05253 2026-06-05 cs.LG

Alpha-RTL: Test-Time Training for RTL Hardware Optimization

Alpha-RTL: 面向RTL硬件优化的测试时训练

Peilong Zhou, Zhirong Chen, Cangyuan Li, Haoyu Gao, Kaiyan Chang, Ziming Qu, Ying Wang

发表机构 * SKLP, Institute of Computing Technology, Chinese Academy of Sciences(SKLP,计算技术研究所,中国科学院) University of the Chinese Academy of Sciences(中国科学院大学) School of Advanced Interdisciplinary Sciences(先进交叉学科学院)

AI总结 提出TTT-RTL框架,通过测试时强化学习结合EDA反馈(语法检查、仿真和PPA奖励)优化LLM生成的RTL设计,在RTLLM v2.0和工业级C910 FPU单元上分别降低PPA乘积65.1%和ADP 59.4%。

Comments 10 pages, 5 figures

详情
AI中文摘要

大型语言模型(LLM)在生成功能正确的寄存器传输级(RTL)硬件设计方面展现出越来越大的潜力。最近的系统通过集成EDA的强化学习(结合语法、仿真和PPA奖励)进一步改进,但在部署前训练通用RTL生成器,而测试时方法使用冻结策略进行搜索。我们则在测试时执行强化学习,使LLM策略能够针对特定RTL问题适应可执行的EDA反馈。我们提出TTT-RTL,据我们所知,这是首个针对每个设计的测试时训练框架,它闭环了LLM策略与用于RTL优化的EDA流水线。TTT-RTL采样候选实现,通过语法检查和仿真验证它们,使用综合得出的PPA乘积对有效设计进行评分,通过PUCT索引的设计状态池重用高奖励变体,并使用熵策略梯度目标更新策略。为了在稀疏或平台期奖励下稳定策略更新,我们引入了一个自适应KL预算控制器,该控制器使用参考KL、有效样本量和奖励饱和信号来调整熵约束。在Nangate 45nm工艺下的RTLLM v2.0上,TTT-RTL将几何平均PPA乘积相对于参考降低了65.1%,优于最强的已发布冻结策略智能体基线(26.1%)。在Sky130工艺下的工业级XuanTie C910 FPU前导零预测单元上,TTT-RTL实现了59.4%的ADP降低,消融实验证实策略适应、状态重用和KL预算控制各自都有贡献。这些结果表明,带有可执行EDA反馈的测试时训练可以将基于LLM的RTL生成从功能正确性推向物理优化的硬件。

英文摘要

Large language models (LLMs) have shown increasing promise in generating functionally correct register-transfer-level (RTL) hardware designs. Recent systems improve further through EDA-integrated reinforcement learning with syntax, simulation, and PPA rewards, but train a general RTL generator before deployment while test-time approaches search with a frozen policy. We instead perform reinforcement learning at test time, allowing the LLM policy to adapt to executable EDA feedback for the specific RTL problem at hand. We propose TTT-RTL, to our knowledge the first per-design test-time training framework that closes the loop between an LLM policy and an EDA pipeline for RTL optimization. TTT-RTL samples candidate implementations, verifies them through syntax checking and simulation, scores valid designs using synthesis-derived PPA product, reuses high-reward variants through a PUCT-indexed design-state pool, and updates the policy with an entropic policy-gradient objective. To stabilize policy updates under sparse or plateaued rewards, we introduce an adaptive KL-budget controller that adjusts the entropy constraint using reference KL, effective sample size, and reward saturation signals. On RTLLM v2.0 under Nangate 45nm, TTT-RTL reduces the geometric-mean PPA product by 65.1% over the reference, outperforming the strongest published frozen-policy agent baseline at 26.1%. On an industrial XuanTie C910 FPU leading-zero-anticipation unit under Sky130, TTT-RTL achieves a 59.4% ADP reduction, and ablations confirm that policy adaptation, state reuse, and KL-budget control each contribute. These results suggest that test-time training with executable EDA feedback can move LLM-based RTL generation beyond functional correctness toward physically optimized hardware.

2606.05248 2026-06-05 cs.RO

Inverse Manipulation through Symbolic Planning and Residual Operator Learning

通过符号规划与残差算子学习的逆操作

Yigit Yildirim, Giuseppe Rauso, Riccardo Caccavale, Alberto Finzi

发表机构 * University of Bologna(博洛尼亚大学)

AI总结 提出一种混合框架,结合STRIPS-like符号规划与残差强化学习,实现机器人操作任务的逆操作,在ManiSkill3 PushCube任务中验证了将近似符号逆操作转化为物理可行的逆技能。

Comments To be presented in PlanRob26

详情
AI中文摘要

逆推机器人任务需要的不仅仅是逆转符号状态转换或回放运动轨迹。在机器人操作任务中,在连续交互动力学下,符号逆计划通常无法完全恢复正向执行的效果。我们提出了一种用于逆操作的混合框架,该框架通过软几何谓词从演示中自动提取STRIPS-like算子,并推导出逆技能目标。对于每个提取的算子,我们构建一个逆恢复目标,该目标保留前提条件、恢复删除效果并否定添加效果。任务规划器首先尝试使用可用的动作原语来满足该目标。未解决的符号谓词随后引出一个残差算子学习问题,通过强化学习(RL)解决。我们在ManiSkill3 PushCube任务上评估了该框架。对于正向推动技能,符号逆操作执行粗略的抓取-放置恢复,而残差Soft Actor-Critic策略则细化立方体姿态以满足剩余的逆谓词。我们的结果表明,谓词导出的残差控制可以将近似的符号逆操作转化为物理上可行的逆技能。

英文摘要

Inverting a robotic task requires more than reversing symbolic state transitions or rewinding motor trajectories. In robot manipulation tasks, symbolic inverse plans often fail to fully restore the effects of forward executions under continuous interaction dynamics. We present a hybrid framework for inverse manipulation that derives inverse-skill objectives from STRIPS-like operators automatically extracted from demonstrations through soft geometric predicates. For each extracted operator, we construct an inverse restoration objective that preserves preconditions, restores delete effects, and negates add effects. A task planner first attempts to satisfy this objective using available action primitives. Unresolved symbolic predicates then induce a residual operator learning problem solved through Reinforcement Learning (RL). We evaluate the framework on the ManiSkill3 PushCube task. For a forward pushing skill, the symbolic inverse performs a coarse pick-and-place restoration, while a residual Soft Actor-Critic policy refines the cube pose to satisfy the remaining inverse predicates. Our results show that predicate-derived residual control can turn an approximate symbolic inverse into a physically grounded inverse skill.

2606.05247 2026-06-05 cs.LG stat.ML

DiffSlack: Learning under Nonlinear Inequality Constraints via Learnable Slack Variables

DiffSlack: 通过可学习松弛变量在非线性不等式约束下学习

Ziqian Wang, Chenxi Fang, Zhen Zhang

发表机构 * State Key Laboratory of Tribology in Advanced Equipment, Tsinghua University(先进设备摩擦学国家重点实验室,清华大学) Beijing Key Laboratory of Transformative High-end Manufacturing Equipment and Technology, Department of Mechanical Engineering, Tsinghua University(transformative高端制造设备与技术北京市重点实验室,机械工程系,清华大学) Automotive Electronics Business Unit, Hirain Inc.(Hirain公司汽车电子事业部)

AI总结 提出DiffSlack,一种可微投影层,通过可学习松弛变量将非线性不等式约束转化为等式,结合阻尼高斯-牛顿投影实现端到端约束满足,在车辆路径规划中取得更高成功率和几何约束满足度。

详情
AI中文摘要

在神经网络中强制执行非线性不等式约束仍然具有挑战性,尤其是当输出受到许多耦合约束时。现有的硬约束方法通常对约束集施加结构限制,或者为大规模非线性问题引入大量计算开销。在此,我们提出DiffSlack,一种用于非线性不等式约束神经预测的可微投影层。DiffSlack将不等式重新表述为带有可学习松弛变量的等式,这些松弛变量作为增强网络输出的一部分被预测,并为阻尼高斯-牛顿投影提供数据驱动的热启动。投影层将原始预测映射到增强可行流形上,同时保持端到端可微性。两阶段课程进一步稳定训练并改善约束满足。我们在具有200个来自碰撞避免、曲率限制和航点间距的非线性不等式约束的车辆路径规划上评估DiffSlack。与现有的基于学习的基线相比,DiffSlack在相当的推理预算下实现了更高的规划成功率和更强的几何约束满足。消融研究进一步表明,硬投影层降低了对监督质量的敏感性。CARLA中的闭环跟踪和真实车辆实验证实了生成轨迹的可执行性。这些结果表明,DiffSlack为工程应用中将硬不等式约束嵌入神经网络提供了一种实用且可扩展的方法。

英文摘要

Enforcing nonlinear inequality constraints in neural networks remains challenging, especially when the output is subject to many coupled constraints. Existing hard constraint methods often impose structural restrictions on the constraint set or introduce substantial computational overhead for large-scale nonlinear problems. Here, we propose DiffSlack, a differentiable projection layer for nonlinear inequality-constrained neural prediction. DiffSlack reformulates inequalities as equalities with learnable slack variables, which are predicted as part of the augmented network output and provide a data-driven warm start for damped Gauss-Newton projection. The projection layer maps raw predictions onto the augmented feasible manifold while preserving end-to-end differentiability. A two-stage curriculum further stabilizes training and improves constraint satisfaction. We evaluate DiffSlack on vehicle path planning with 200 nonlinear inequality constraints from collision avoidance, curvature limits, and waypoint spacing. Compared with existing learning-based baselines, DiffSlack achieves a higher planning success rate and stronger geometric constraint satisfaction under a comparable inference budget. Ablation studies further show that the hard projection layer reduces sensitivity to supervision quality. Closed-loop tracking in CARLA and real-world vehicle experiments confirms the executability of the generated trajectories. These results demonstrate that DiffSlack provides a practical and scalable approach to embedding hard inequality constraints into neural networks for engineering applications.

2606.05236 2026-06-05 cs.RO cs.LG

A New Quaternion-Joint Cable-Driven Redundant Manipulator Configuration and its Control Through FABRIK and Residual Reinforcement Learning

一种新型四元数关节缆驱动冗余机械臂配置及其通过FABRIK和残差强化学习的控制

Tanapath Pornthisan, Thanapat Kemthong, Thanyapisit Kangsathien, Pasut Aranchaiya, Paulo Garcia, Viboon Sangveraphunsiri

发表机构 * University of California, San Diego(加州大学圣地亚哥分校)

AI总结 提出一种4段8关节四元数关节缆驱动冗余机械臂配置,并利用残差强化学习实现比FABRIK算法高三个数量级的位置和方向精度控制。

详情
AI中文摘要

能够穿越任意空间路径的机械臂,特别是在高度阻塞的工作空间中,在多个行业中备受期待。四元数关节最近赋予了一类特定的机械臂——缆驱动冗余机械臂——超越其先前能力的新功能。具体来说,四元数关节减少了每个自由度所需的电机数量,为更紧凑的解决方案铺平了道路。一个持续的挑战是,四元数关节运动学模型的复杂性给机械臂配置的先验决策带来了困难,并对控制系统提出了更高的计算需求,其非线性放大了由于制造不精确而产生的设计与物理实物之间的所有差异。在这里,我们展示了一个4段、8关节的机械臂可以在更低的硬件成本下实现比现有配置更广阔的工作空间,并且残差强化学习在控制此类机械臂方面优于现有最先进的方法——特别是FABRIK算法。我们的结果表明,这种配置比先前设计更有效地利用工作空间,并且残差强化学习在位置和方向精度上比FABRIK高出三个数量级,实现了对新型4段、8关节机械臂的精确控制。此外,控制实现更简单:我们描述了完整的FABRIK控制过程及相应的学习实现。我们的方法适用于新系统的设计,为设计者提供了开发此类机械臂及新型配置相应控制系统的更多工具。

英文摘要

Robotic arms capable of traversing arbitrary spatial paths, especially in highly obstructed workspaces, are highly desired across several industries. Quaternion-joints have recently empowered a specific class of robotic arms -- cable-driven redundant manipulators -- beyond its prior capabilities. Specifically, quaternion-joints reduce the number of required motors per degree of freedom, paving the way for more compact solutions.An ongoing challenge is that the complexity of the kinematic model of quaternion joints challenges a priori decisions on manipulator configurations and imposes higher computational demands on the control system and its non-linearities amplify all discrepancies between design and physical artifact arising from fabrication imprecision. Here we show a that a 4-segment, 8-joint manipulator can achieve a broader workspace than extant configurations, at lower hardware cost, and that Residual Reinforcement Learning outperforms extant state-of-the-art methods -- specifically, the FABRIK algorithm -- on the control of such manipulator. Our results show that this configuration is more workspace-effective than prior designs, and that Residual Reinforcement Learning outperforms FABRIK by three orders of magnitude on positional and orientational accuracy, effecting precise control of the novel 4-segment, 8-joint manipulator. Additionally, the control implementation is simpler: we describe the complete FABRIK process for control and corresponding learning implementation. Our methodology is applicable to the design of new systems, providing designers with further tools for the development of this class of manipulators and corresponding control systems for novel configurations.

2606.05234 2026-06-05 cs.RO cs.LG

OLIVE: Online Low-Rank Incremental Learning for Efficient Adaptive Exoskeletons

OLIVE: 面向高效自适应外骨骼的在线低秩增量学习

Dong Liu, Yanxuan Yu, Ben Lengerich, Tony Geng, Ying Nian Wu

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校) Columbia University(哥伦比亚大学) University of Wisconsin-Madison(威斯康星大学麦迪逊分校) Rice University(里奇大学)

AI总结 提出OLIVE框架,通过低秩残差分解和奖励驱动策略梯度实现外骨骼控制的在线个性化自适应,在多种地形上提升步态平滑度、降低努力并增强稳定性。

详情
AI中文摘要

可穿戴外骨骼系统有望恢复身体障碍者的行动能力,但大多数现有控制器依赖于静态步态策略,缺乏适应动态真实环境或个体用户特征的能力。我们提出\olive(\underline{O}nline \underline{L}ow-rank \underline{I}ncremental Learning for Efficient Adapti\underline{ve} Exoskeletons),一种参数高效的在线自适应框架,在部署期间持续个性化外骨骼控制。\olive将控制策略的自适应组件分解为低秩残差形式~$\dW = \At\Bt^\top$,秩~$r!\ll!\min(d,k)$,将在线更新成本从$\mathcal{O}(dk)$降低到$\mathcal{O}(r(d{+}k))$,同时保持预训练基础控制器~$\Wz$的稳定性。参数通过奖励塑造的策略梯度更新,完全由身体传感器反馈(EMG、IMU、振动)驱动,消除了对离线参考轨迹的依赖。门控机制根据上下文状态调节个性化强度,动态秩调度器根据地形复杂度调整更新维度——在简单平坦地形上分配最小容量,在要求高的不平坦地形上扩展到更高秩更新——从而在多种活动中实现稳健性能:平地行走、楼梯导航、斜坡和不平坦地形。在可穿戴平台上的实验表明,\olive在步态平滑度、努力减少和运动稳定性上比最强基线分别提高了13、22和15个百分点,在大约1,800步内收敛,端到端延迟为7.4毫秒。我们的代码实现可在https://github.com/FastLM/OLIVE获取。

英文摘要

Wearable exoskeleton systems hold promise for restoring mobility in individuals with physical impairments, yet most existing controllers rely on static gait policies that lack the ability to adapt to dynamic real-world environments or individual user characteristics. We present \olive (\underline{O}nline \underline{L}ow-rank \underline{I}ncremental Learning for Efficient Adapti\underline{ve} Exoskeletons), a parameter-efficient online adaptation framework that continuously personalizes exoskeleton control during deployment. \olive decomposes the adaptive component of the control policy into a low-rank residual form~$\dW = \At\Bt^\top$ with rank~$r!\ll!\min(d,k)$, reducing online update cost from $\mathcal{O}(dk)$ to $\mathcal{O}(r(d{+}k))$ while preserving the stability of a pretrained base controller~$\Wz$. Parameters are updated via a reward-shaped policy gradient driven purely by on-body sensor feedback (EMG, IMU, vibration), eliminating dependence on offline reference trajectories. A gating mechanism modulates the strength of personalization based on contextual state, and a dynamic rank scheduler adapts the update dimensionality to terrain complexity -- allocating minimal capacity on simple flat terrain and expanding to higher-rank updates on demanding uneven surfaces -- enabling robust performance across diverse activities: flat walking, stair navigation, slopes, and uneven terrain. Experiments on the wearable platform demonstrate that \olive achieves +13, +22, and +15 percentage-point improvements in gait smoothness, effort reduction, and motion stability over the strongest baseline, converging within $\sim$1{,}800 walking steps at 7.4,ms end-to-end latency. Our code implementation is available at https://github.com/FastLM/OLIVE.

2606.05232 2026-06-05 cs.LG cs.AI

Differentiable Efficient Operator Search

可微分高效算子搜索

Xiaohuan Pei, Jiyuan Zhang, Yuanfan Guo, Weiguo Feng, Tao Huang, Cho-Jui Hsieh, Chang Xu

发表机构 * The University of Sydney(悉尼大学) ByteDance(字节跳动) Shanghai Jiao Tong University(上海交通大学) University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 提出可微分高效算子搜索框架,统一解释多种token缩减算子,通过联合搜索缩减位置、保留数量和算子行为,在预算约束下优化多模态模型性能。

详情
AI中文摘要

高效多模态基础模型通常依赖于手动设计的token缩减算子,如剪枝、合并、池化和自适应重加权。尽管这些算子看起来不同,但我们表明它们可以被解释为共享算子空间的不同区域。基于这一观点,我们引入了高效算子搜索,一个可微分框架,联合搜索在哪里缩减token、保留多少token以及如何处理缩减后的token信息。所提出的搜索空间参数化层激活、保留预算和算子行为,而搜索策略在单边预算和成本约束下优化任务性能。该公式将代表性手工设计基线作为特例恢复,并进一步发现超越孤立手动设计的混合算子。在多模态基准上的实验表明,搜索得到的算子在精度-效率权衡上具有竞争力,特别是在激进的视觉token缩减下。这些结果表明,高效多模态推理可以从手动算子设计重新构建为可微分算子搜索。

英文摘要

Efficient multimodal foundation models often rely on manually designed token-reduction operators, such as pruning, merging, pooling, and adaptive reweighting. Although these operators appear different, we show that they can be interpreted as distinct regimes of a shared operator space. Based on this view, we introduce Efficient Operator Search, a differentiable framework that jointly searches where to reduce tokens, how many tokens to retain, and how reduced token information should be processed. The proposed search space parameterizes layer activation, retention budget, and operator behavior, while the search policy optimizes task performance under one-sided budget and cost constraints. This formulation recovers representative hand-designed baselines as special cases and further discovers hybrid operators beyond isolated manual designs. Experiments on multimodal benchmarks show that the searched operators achieve competitive accuracy-efficiency trade-offs, especially under aggressive visual-token reduction. These results suggest that efficient multimodal inference can be reframed from manual operator design to differentiable operator search.

2606.05219 2026-06-05 cs.LG cs.AI

Gradient Descent with Large Step Size Restores Symmetry in Deep Linear Networks with Multi-Pathway

大步长梯度下降恢复多路径深度线性网络中的对称性

Hee-Sung Kim, Sungyoon Lee

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文研究大步长离散梯度下降如何通过边缘稳定性振荡使多路径深度线性网络从对称性破坏转向信号重新分配,从而偏好共享表示而非单路径主导。

Comments ICML 2026

详情
AI中文摘要

最近对多路径深度线性网络的分析使用梯度流预测了一种“赢家通吃”的专业化,其中路径对称性被破坏,每个特征集中在一个路径中。在这项工作中,我们表明具有大步长的离散梯度下降(GD)讲述了一个不同的故事。我们证明单路径解是尖锐最小值,而跨路径分布信号通过一个随路径数量和深度增加而减小的因子降低了尖锐度。因此,虽然早期训练再现了由GF预测的深度驱动的对称性破坏,但随后在稳定性边缘的振荡覆盖了这一趋势,并将网络驱动到重新平衡阶段,其中信号在路径间重新分布。总之,这些结果阐明了深度如何塑造路径竞争,并解释了大步长GD为何偏好共享表示而非持续的单路径主导。

英文摘要

Recent analyses of multi-pathway Deep Linear Networks use Gradient Flow to predict a "winner-takes-all" specialization in which path symmetry breaks and each feature concentrates in a single pathway. In this work, we show that discrete Gradient Descent (GD) with a large step size tells a different story. We prove that single-path solutions are sharp minima, whereas distributing signals across pathways reduces sharpness by a factor that decreases with both the number of pathways and depth. Consequently, while early training reproduces the depth-driven symmetry breaking predicted by GF, oscillations at the Edge of Stability subsequently override this tendency and drive the network into a re-balancing phase, where signals redistribute across pathways. Together, these results clarify how depth shapes pathway competition and explain why large-step GD favors shared representations rather than persistent single-pathway dominance.

2606.05201 2026-06-05 cs.LG

State commitment learning: training language models to distinguish computation from memory

状态承诺学习:训练语言模型区分计算与记忆

Fei Ding, Yongkang Zhang, Runhao Liu, Yuhao Liao, Zijian Zeng, Huiming Yang

发表机构 * Alibaba Group(阿里巴巴集团) Tsinghua University(清华大学)

AI总结 提出状态承诺学习目标及反事实擦除强化学习(CERL)方法,通过训练模型区分应保留的持久状态与可丢弃的临时计算,减少推理对隐藏思维的依赖,在数学、长链逻辑、科学问答和多轮工具使用任务中保持准确率的同时降低依赖。

Comments 17 pages

详情
AI中文摘要

推理语言模型不区分用于计算的token与构成持久状态的token:一旦生成,所有隐藏思维都保留在上下文中并影响未来预测。因此,下游推理可能依赖于不应在后续安全依赖的失败尝试、死胡同和私人草稿。我们将此现象重新定义为一种新的训练目标,即状态承诺学习:训练模型明确区分应作为持久状态提交的信息与可丢弃的临时计算。我们定义了一个反事实准则,即持久状态充分性,使得在隐藏思维被擦除后答案是否仍然可用变得可训练和可测量。然后,我们提出反事实擦除强化学习(CERL),它在相同前缀下评估保留隐藏思维的路径和擦除它们的路径,并仅在擦除路径保持正确时给予奖励。我们还引入了擦除依赖协议,并在数学、长链逻辑、科学问答和多轮工具使用评估中表明,CERL在不牺牲准确率的情况下显著降低了答案对隐藏思维的依赖,始终优于仅正确性强化学习和长答案SFT基线。

英文摘要

Reasoning language models do not distinguish tokens used for computation from tokens that constitute persistent state: once generated, all hidden thoughts remain in context and influence future predictions. As a result, downstream reasoning may depend on failed attempts, dead ends, and private scratch work that should not be safely relied on later. We recast this phenomenon as a new training objective, state commitment learning: training models to explicitly distinguish information that should be committed as persistent state from temporary computation that can be discarded. We define a counterfactual criterion, persistent-state sufficiency, which makes it trainable and measurable whether an answer remains usable after hidden thoughts are erased. We then propose Counterfactual Erasure RL (CERL), which evaluates, under the same prefix, both a path that keeps hidden thoughts and a path that erases them, and gives reward only when the erasure path remains correct. We also introduce the Erasure Dependence Protocol and show across mathematics, long-chain logic, scientific QA, and multi-turn tool-use evaluation that CERL substantially reduces answer dependence on hidden thoughts without sacrificing accuracy, consistently outperforming correctness-only RL and long-answer SFT baselines.

2606.05194 2026-06-05 cs.LG cs.AI cs.CL

Temporal Preference Concepts and their Functions in a Large Language Model

时间偏好概念及其在大语言模型中的功能

Ian Rios-Sialer, Shantanu Darveshi, Shuai Jiang, Avigya Paudel, Anastasiia Pronina, Ipshita Bandyopadhyay, Justin Shenk

发表机构 * AISC(AI Safety Camp) SPAR(Supervised Program for Alignment Research)

AI总结 通过因果定位和激活修补,本文发现大语言模型在中间到上层节点编码时间偏好几何结构,且行为分析表明模型对未来折扣比人类更平缓,但偏好不稳定,可通过引导向量调控。

详情
AI中文摘要

大语言模型(LLMs)越来越多地被部署用于需要在近期收益与长期后果之间权衡的决策,然而关于它们如何在内部表示或解决这些权衡,我们知之甚少。在这项工作中,我们通过因果定位了一个蒸馏LLM(Qwen3-4B-Instruct-2507)中时间偏好的底层子图,通过来自梯度归因和激活修补的汇聚证据识别了中上层节点。我们发现时间跨度的几何结构在预期局部层的残差流中被编码。行为分析表明,未干预的LLM对未来折扣的陡峭程度比人类低几倍,但这种偏好跨上下文不稳定,这促使我们进行显式控制而非隐式依赖训练。最后,我们发现有暗示性证据表明引导向量可以改变时间偏好。我们的工作展示了机械可解释性如何使我们更接近对LLM规划和推理方式的可靠控制。

英文摘要

Large Language Models (LLMs) are increasingly being deployed to make decisions that require trading off near-term gains against long-term consequences, yet little is known about how they internally represent or resolve these tradeoffs. In this work, we causally localize an underlying subgraph for temporal preference in a distilled LLM (Qwen3-4B-Instruct-2507), identifying mid-to-upper-layer nodes through converging evidence from gradient-based attribution and activation patching. We find that the geometry of time horizon is encoded in the residual stream at the expected localized layers. A behavioral analysis reveals that unintervened LLMs discount the future several times less steeply than humans, yet this preference is unstable across contexts, motivating explicit control rather than implicit reliance on training. Finally, we find suggestive evidence that steering vectors can shift temporal preference. Our work demonstrates how mechanistic interpretability can bring us closer to reliable control over how LLMs plan and reason

2606.05191 2026-06-05 cs.LG eess.SP

PyCC.id: A package for hypothesis-driven equation discovery with structural identifiability

PyCC.id: 一个具有结构可辨识性的假设驱动方程发现包

Federico J. Gonzalez

发表机构 * Physics Institute of Rosario(罗萨里奥物理研究所)

AI总结 提出PyCC库,通过特征曲线骨架和假设驱动方法解决数据驱动方程发现中的病态逆问题,支持多种方程发现范式并具有结构可辨识性。

Comments The software package is available at: https://github.com/FedejGon/pyCC.id

详情
AI中文摘要

数据驱动的方程发现本质上是一个逆问题,旨在直接从时间序列测量中推断系统的控制微分方程。一个已知的问题是逆问题的病态性质,这经常产生多个对数据拟合得同样好的数学模型。解决这个问题的一种途径是事先将已知的假设和约束纳入训练阶段。虽然这种方法有效地减少了搜索空间,但仍然会产生多个候选模型,迫使实践者依赖基于自身领域知识的后验手动筛选。最近的一种方法引入了受特征曲线(CCs)启发的结构“骨架”,定义了一种假设驱动的方法。在这种方法中,实践者定义一个骨架,该骨架与一族常微分方程(ODEs)相关联,然后基于其领域知识添加假设和先验,以迭代地改进获得的模型。这种方法的一个重要优点是,一些骨架具有可证明的结构可辨识性属性,这对于检查骨架是否正确或应该被丢弃非常有用。此外,由于其模块化(例如神经网络、符号回归和稀疏回归),这种形式主义能够使用多种方程发现范式。在这项工作中,我们介绍了Python库PyCC,它将这些努力浓缩成一个灵活的工具,允许研究人员和工程师无缝地定义他们的骨架和假设,从时间依赖数据中发现ODEs。

英文摘要

Data-driven equation discovery is fundamentally an inverse problem that seeks to infer the governing differential equations of a system directly from time-series measurements. A known issue is the ill-conditioned nature of the inverse problem, which frequently produces multiple mathematical models that fit the data similarly well. One path to address this issue is by incorporating known hypotheses and constraints into the training phase beforehand. While this approach effectively reduces the search space, it still results in multiple candidate models, forcing practitioners to rely on post-hoc manual filtering based on their own domain expertise. A recent approach incorporates structural `skeletons' inspired by characteristic curves (CCs), defining a hypothesis-driven methodology. In this methodology, practitioners define a skeleton, which is associated with a family of ordinary differential equations (ODEs), and then add their hypotheses and priors based on their domain knowledge to refine the obtained model iteratively. An important advantage of this approach is that some skeletons have demonstrable structural identifiability properties, which are useful for checking whether the skeleton is correct or should be discarded. Furthermore, this formalism enables the use of multiple equation discovery paradigms due to its modularity (such as neural networks, symbolic regression, and sparse regression). In this work, we present the Python library PyCC, which condenses these efforts into a flexible tool that allows researchers and engineers to seamlessly define their skeletons and hypotheses to discover ODEs from time-dependent data.

2606.05186 2026-06-05 cs.LG cs.CL

Staged Factorial Screening for Budget-Constrained Micro-Pretraining

预算受限的微预训练中的分阶段因子筛选

Felipe Chavarro Polania

发表机构 * Hewlett Packard Enterprise(惠普企业)

AI总结 针对预算受限的微预训练,提出分阶段分数因子设计方法,通过短时筛选识别高惩罚方向并确认有效锚点,在共享加速器上实现高效配方筛选。

Comments 23 pages, 4 figures

详情
AI中文摘要

预算受限的微预训练通常需要在共享加速器上对许多候选配方进行分诊,然后才能花费更大的搜索预算。我们研究了分阶段分数因子工作流是否能在这种设置中恢复稳定的早期效应结构。在固定的自动研究衍生的单GPU训练循环上,我们运行了613个实验,包括在2、5和10分钟时的试点和后续筛选;5和10分钟时的完整16条件种子重运行;有针对性的种子锚点检查;同主机贪婪和匹配成本随机基线;一个60分钟的桥接包;以及通过24小时的有界Windows A100和Linux L40S锚点延续。总批次、深度和宽度的主要惩罚在短预算时最大,并随预算增加而放松。在预先声明的种子全屏系列中,D、A、B和C在预算内Benjamini-Hochberg校正后,在5和10分钟时保留非零估计,而E则没有。随机搜索可以在这个32条件空间中达到强当前最优,但反复在相同的低惩罚区域,且没有因子归因。60分钟桥接锚点具有最低均值,尽管该包没有将工作流改进与更大桥接模型的能力优势分开。在两个主机上的有界12小时和24小时三锚点延续中,桥接具有最低样本均值,而非桥接顺序保持主机敏感。因此,我们提出了一个有界方法结果:使用短设计筛选来识别高惩罚方向,在重复运行下确认有希望的锚点,并在缩减空间内局部细化。证据支持在24小时内两个主机上的以桥接为中心的推荐,而不是硬件不变的排名或通用超参数优化的优越性。

英文摘要

Budget-constrained micro-pretraining often requires triaging many candidate recipes on a shared accelerator before larger search budgets are spent. We study whether a staged fractional-factorial workflow can recover stable early effect structure in this setting. On a fixed autoresearch-derived single-GPU training loop, we run 613 experiments across pilot and follow-up screens at 2, 5, and 10 minutes; full 16-condition seeded reruns at 5 and 10 minutes; targeted seeded anchor checks; same-host greedy and matched-cost random baselines; a 60-minute bridge package; and bounded Windows A100 and Linux L40S anchor continuations through 24 hours. Main penalties from total batch, depth, and width are largest at short budgets and relax as budget increases. Within the predeclared seeded full-screen families, D, A, B, and C retain non-zero estimates at 5 and 10 minutes after within-budget Benjamini-Hochberg correction, while E does not. Random search can reach strong incumbents in this 32-condition space, but repeatedly in the same low-penalty region and without factor attribution. The 60-minute bridge anchor has the lowest mean, although that package does not separate workflow refinement from the larger bridge model's capacity advantage. In bounded 12-hour and 24-hour three-anchor continuations on both hosts, the bridge has the lowest sample mean while the non-bridge ordering stays host-sensitive. We therefore present a bounded methods result: use short designed screens to identify high-penalty directions, confirm promising anchors under repeated runs, and refine locally inside the reduced space. The evidence supports a bridge-centered recommendation through 24 hours on two hosts, not hardware-invariant ranking or general hyperparameter-optimization superiority.

2606.05183 2026-06-05 cs.CL cs.AI cs.HC

The Granularity Gap: A Multi-Dimensional Longitudinal Audit of Sycophancy in Gemini Models

粒度差距:Gemini 模型中谄媚行为的多维纵向审计

Patrick Keough

发表机构 * Independent Researcher(独立研究者)

AI总结 通过多维度分级评估(Likert 0-4),揭示 Gemini 模型在连续尺度上的谄媚行为,发现粗粒度二值指标掩盖了大量社会顺从行为,且代际进步非单调,存在对齐税(谄媚与真实性负相关)。

Comments 16 pages, 9 figures

详情
AI中文摘要

大型语言模型越来越多地被部署为高风险顾问,但标准对齐基准将谄媚视为二值失败模式。我们引入粒度差距:粗粒度二值指标掩盖了大量社会顺从行为,即模型屈服于用户框架、验证可疑前提或软化事实纠正而不产生明显错误输出。我们在三个防护栏条件(控制、简单、协议)下,对跨越 2.0、2.5 和 3.0 代的六个 Gemini 变体在 73 个对抗性提示上进行了评估,得到 8,830 个分级响应。使用经过人类标注者三人组验证的 0-4 Likert 量表(Fleiss kappa = 0.71;与 AI 共识的 Cohen kappa = 0.78;95.9% 二值准确率,100% 特异性),我们将谄媚量化为连续而非二值。出现三个发现。第一,27.2% 的响应包含大量谄媚内容(Likert >= 2.0),22.7% 达到中度或严重水平(>= 3.0),而二值胜率框架仅报告适度的失败率;粗粒度指标仅解释 29% 的分级方差。第二,代际进步是非单调的:Gen 2.5 相对于 Gen 2.0(1.90)和 Gen 3.0(2.01)急剧倒退(平均控制 2.64),且 Gen 2.5 呈现逆缩放(Pro 1.94 比 Flash 1.71 更差),而 Gen 3.0 恢复了标准缩放。第三,我们记录了对齐税:谄媚与真实性之间的 Spearman rho = -0.63,表明社会顺从以事实准确性为代价。自我验证提示作为谄媚陷阱(平均 3.27),几乎是 unethical proposals(1.72)的两倍。简单防护栏在旗舰模型上优于复杂的协议脚手架,但蒸馏后的 Gen 3.0 Flash 反转了这一点,表明小模型可能在结构上需要思维链脚手架。我们发布了数据集和评分标准以支持连续谄媚测量。

英文摘要

Large language models are increasingly deployed as high-stakes advisors, yet standard alignment benchmarks treat sycophancy as a binary failure mode. We introduce the Granularity Gap: coarse binary metrics mask substantial social-compliance behaviors where models capitulate to user framing, validate questionable premises, or soften factual corrections without producing overtly false outputs. We evaluate six Gemini variants across generations 2.0, 2.5, and 3.0 on 73 adversarial prompts under three guardrail conditions (Control, Simple, Protocol), yielding 8,830 graded responses. Using a 0-4 Likert scale validated against a human annotator triad (Fleiss kappa = 0.71; Cohen kappa = 0.78 vs AI consensus; 95.9 percent binary accuracy, 100 percent specificity), we quantify sycophancy as continuous rather than binary. Three findings emerge. First, 27.2 percent of responses contain substantial sycophantic content (Likert >= 2.0) and 22.7 percent reach moderate or severe levels (>= 3.0), while binary win-rate framing reports only modest failure rates; coarse metrics explain just 29 percent of graded variance. Second, generational progress is non-monotonic: Gen 2.5 regresses sharply (mean Control 2.64) relative to Gen 2.0 (1.90) and Gen 3.0 (2.01), and Gen 2.5 shows inverse scaling (Pro 1.94 worse than Flash 1.71) while Gen 3.0 restores standard scaling. Third, we document an Alignment Tax: Spearman rho = -0.63 between sycophancy and truthfulness, indicating social compliance trades against factual accuracy. Egotistical Validation prompts act as a sycophancy trap (mean 3.27), nearly double Unethical Proposals (1.72). Simple guardrails outperform elaborate Protocol scaffolding on flagship models, but distilled Gen 3.0 Flash inverts this, suggesting small models may structurally require chain-of-thought scaffolding. We release the dataset and rubric to support continuous sycophancy measurement.

2606.05182 2026-06-05 cs.CL cs.IR

LANTERN: Layered Archival and Temporal Episodic Retrieval Network for Long-Context LLM Conversations

LANTERN: 用于长上下文LLM对话的分层存档与时间情节检索网络

Rahul Subramani

发表机构 * Cisco Systems, Inc.(思科系统公司)

AI总结 提出LANTERN,一种轻量级记忆层,通过混合检索主动存档对话轮次并恢复压缩后丢失的细节,无需LLM调用且延迟低于25ms,在94个多轮对话中恢复78.3%的可验证事实,优于MemGPT基线。

详情
AI中文摘要

当对话历史被压缩以适应有限的上下文窗口时,大型语言模型会丢弃关键细节。我们提出了LANTERN(分层存档与时间情节检索网络),一种轻量级记忆层,它主动存档每一轮对话,并通过混合检索在压缩后恢复相关细节——无需任何LLM调用,每轮延迟低于25ms。在94个真实多轮对话(1,894个真实事实,人工验证kappa=0.81)上,LANTERN-Rerank恢复了78.3%因压缩而丢失的可验证事实,显著优于忠实复现的MemGPT的LLM驱动提取与多查询搜索流水线(72.4%;Wilcoxon p<0.0001,95% CI [+3.1, +8.6] pp,d=0.43),且推理成本极低。即使没有重排序器,基础LANTERN在零LLM调用的情况下也能匹配或超越该LLM驱动基线(p=0.005)。当四个生产级LLM使用LANTERN恢复的上下文回答事实性问题时,准确率平均提升8.4个百分点(每个模型单独Wilcoxon p<0.05),表明恢复的上下文在不同模型架构上均有用。我们发布了完整的评估框架——包括配对显著性检验、失败分析、事实类型分层和压缩鲁棒性分析——以支持可重复性和未来工作。

英文摘要

Large language models discard critical details when conversation history is compacted to fit within finite context windows. We present LANTERN (Layered Archival aNd Temporal Episodic Retrieval Network), a lightweight memory layer that proactively archives every conversation turn and restores relevant details after compaction via hybrid retrieval -- requiring zero LLM calls and adding fewer than 25ms of latency per turn. On 94 real multi-turn conversations (1,894 ground-truth facts, human-validated at kappa=0.81), LANTERN-Rerank recovers 78.3% of verifiable facts lost to compaction, significantly outperforming a faithful reimplementation of MemGPT's LLM-driven extraction and multi-query search pipeline (72.4%; Wilcoxon p<0.0001, 95% CI [+3.1, +8.6] pp, d=0.43) at a fraction of the inference cost. Even without the reranker, base LANTERN matches or exceeds this LLM-driven baseline (p=0.005) using zero LLM calls. When four production LLMs answer fact-bearing questions using LANTERN-restored context, accuracy improves by 8.4 percentage points on average (Wilcoxon p<0.05 for each model individually), demonstrating that the recovered context is useful across diverse model architectures. We release the full evaluation framework -- paired significance tests, failure analysis, fact-type stratification, and compaction robustness analysis -- to support reproducibility and future work.

2606.05181 2026-06-05 cs.CL cs.AI

Multi-Granularity Reasoning for Natural Language Inference

自然语言推理的多粒度推理

Chunling Xi, Di Liang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出多粒度推理网络(MGRN),通过分层语义特征交互模拟人类认知过程,在多个基准上超越强基线模型。

详情
AI中文摘要

自然语言推理(NLI)是自然语言理解中的一项基本任务,需要确定前提和假设之间的逻辑关系。尽管基于Transformer的预训练模型取得了显著成功,但大多数现有方法主要依赖最后一层的token表示,这通常不足以捕捉有效推理所需的复杂分层语义交互。特别是,细粒度的词汇线索、短语组合和更高层次的上下文语义通常在单一表示空间中被纠缠或稀释。为了解决这些限制,我们提出了一种新颖的\emph{多粒度推理网络}(MGRN),它在交互式推理空间中显式利用分层语义特征。所提出的框架模拟了人类语言理解的认知过程,该过程自然地从浅层词汇匹配进展到更深层次的语义抽象和逻辑推理。通过以渐进和结构化的方式整合多个粒度的语义信息,MGRN能够揭示自然语言表达背后的复杂语义关系。在多个公开基准上的大量实验表明,MGRN始终优于强基线模型,验证了所提出方法的有效性和鲁棒性。

英文摘要

Natural Language Inference (NLI) is a fundamental task in natural language understanding that requires determining the logical relationship between a premise and a hypothesis. Despite the remarkable success of transformer-based pre-trained models, most existing approaches primarily rely on the final-layer token representations, which are often insufficient for capturing the complex and hierarchical semantic interactions required for effective reasoning. In particular, fine-grained lexical cues, phrasal compositions, and higher-level contextual semantics are typically entangled or diluted in a single representation space. To address these limitations, we propose a novel \emph{Multi-Granularity Reasoning Network} (MGRN) that explicitly leverages hierarchical semantic features within an interactive reasoning space. The proposed framework mimics the human cognitive process of language understanding, which naturally progresses from shallow lexical matching to deeper semantic abstraction and logical reasoning. By integrating semantic information across multiple granularities in a progressive and structured manner, MGRN is able to uncover intricate semantic relationships underlying natural language expressions. Extensive experiments on multiple public benchmarks demonstrate that MGRN consistently outperforms strong baseline models, validating the effectiveness and robustness of the proposed approach.

2606.05180 2026-06-05 cs.CL cs.AI

From Scoring to Explanations: Evaluating SHAP and LLM Rationales for Rubric-based Teaching Quality Assessment

从评分到解释:评估基于量规的教学质量评估中的SHAP和LLM理由

Ivo Bueno, Babette Bühler, Philipp Stark, Tim Fütterer, Ulrich Trautwein, Dorottya Demszky, Heather Hill, Enkelejda Kasneci

发表机构 * Technical University of Munich(慕尼黑技术大学) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心) Lund University(吕勒奥大学) University of Tübingen(图宾根大学) Stanford Graduate School of Education(斯坦福大学教育研究生院) Harvard Graduate School of Education(哈佛大学教育研究生院)

AI总结 提出一个结合SHAP和LLM理由的框架,用于基于量规的评分模型的可解释性,并在课堂转录数据上评估其忠实性和可迁移性。

Comments Accepted to Findings of ACL 2026

详情
AI中文摘要

自动化评分模型越来越多地被用于为复杂的语言表现(包括课堂转录)分配基于量规的质量评级,但它们通常很少提供关于为什么产生特定分数的见解。我们提出了一个通用的框架,用于基于量规的评分的句子级可解释性,该框架将模型无关的Shapley值归因与大型语言模型(LLM)生成的理由相结合。在使用NCTE语料库的CLASS框架的反馈质量维度上实例化,该框架能够系统地比较微调的预训练语言模型(PLM)和提示的LLM在评分性能和解释忠实性方面的表现。在6k个带注释的转录片段中,微调的PLM在预测准确性上优于LLM,但表现出向中等尺度分数的标签压缩。基于删除的测试表明,SHAP识别出可靠驱动模型预测的句子,产生的预测变化通常比LLM生成的理由更大且更连贯。跨模型分析进一步揭示,SHAP归因在不同架构间稳健地迁移,而LLM理由的影响有限且不一致。总体而言,研究结果表明,SHAP为基于量规的评分提供了更忠实和可迁移的解释,并且所提出的框架为在高风险教育环境和其他基于量规的语言评估任务中评估评分模型及其解释提供了原则性基础。

英文摘要

Automated scoring models are increasingly used to assign rubric-based quality ratings to complex language performances, including classroom transcripts, yet they typically provide little insight into why a particular score is produced. We propose a general framework for sentence-level interpretability of rubric-based scoring that combines model-agnostic Shapley-value attributions with rationales generated by large language models (LLMs). Instantiated on the Quality of Feedback dimension of the CLASS framework using the NCTE corpus, the framework enables systematic comparison of fine-tuned pretrained language models (PLMs) and prompted LLMs on both scoring performance and explanation faithfulness. Across 6k annotated transcript segments, fine-tuned PLMs outperform LLMs in prediction accuracy but exhibit label compression toward mid-scale scores. Deletion-based tests show that SHAP identifies sentences that reliably drive model predictions, producing typically larger and more coherent prediction shifts than LLM-generated rationales. Cross-model analyses further reveal that SHAP attributions transfer robustly across architectures, whereas LLM rationales exert limited and inconsistent influence. Overall, the findings demonstrate that SHAP provides more faithful and transferable explanations for rubric-based scoring, and that the proposed framework offers a principled basis for evaluating both scoring models and their explanations in high-stakes educational settings and other rubric-based language assessment tasks.

2606.05179 2026-06-05 cs.CL

Efficient Punctuation Restoration via Weighted Lookahead Scoring Method for Streaming ASR Systems

流式ASR系统中基于加权前瞻评分的高效标点恢复方法

Sungmook Woo, Hyungu Kang, Chanwoo Kim

发表机构 * Korea Advanced Institute of Science and Technology(韩国科学技术院)

AI总结 提出一种非自回归的加权前瞻评分方法,通过比较标点插入假设与无插入基线,在有限未来上下文中实现流式ASR的高效标点恢复,无需微调即可达到高F1分数。

Comments Accepted for presentation at The International Joint Conference on Neural Networks (IJCNN) 2026

详情
AI中文摘要

标点恢复提高了ASR(自动语音识别)的可读性。然而,流式ASR需要在有限的未来上下文下进行在线决策。在流式ASR中,系统增量地预测标点,这使得基于生成的方法在边界评估下容易产生延迟和对齐失败。本文提出了一种非自回归评分方法(无自由形式生成),该方法保留输入转录并在每个词边界做出决策。我们的方法在有限的K子词标记前瞻下,将标点插入假设与无插入基线进行比较,并使用权重α和验证校准阈值τ(推理期间无参数更新)校准决策。在IWSLT 2017上,我们的评分方法在无微调设置下(验证校准,K=2)实现了4类宏F1为0.893,在微调后(K=2)达到0.937,在相同的前瞻预算下优于基于提示的基线(0.566)和微调的ELECTRA基线(0.913)。我们通过消融研究分析了前瞻预算对K的影响。

英文摘要

Punctuation restoration improves ASR (Automatic Speech Recognition) readability. However streaming ASR requires online decisions with limited future context. In streaming ASR, the system predicts punctuation incrementally, which makes generation-based approaches prone to latency and alignment failures under boundary-wise evaluation. This paper proposes a non-autoregressive scoring method (no free-form generation) that preserves the input transcript and makes a decision at each word boundary. Our method compares punctuation insertion hypotheses against a no-insertion baseline under a bounded K-subword-token lookahead, and calibrates decisions using a weight α and a validation-calibrated threshold τ (no parameter updates during inference). On IWSLT 2017, our scoring method achieves a 4-class macro F1 of 0.893 in the no fine-tuning setting (validation-calibrated, K=2) and 0.937 after fine-tuning (K=2), outperforming the prompt-based baseline (0.566) and a fine-tuned ELECTRA baseline (0.913) under the same lookahead budget. We analyze the impact of the lookahead budget through ablation studies on K.

2606.05177 2026-06-05 cs.CL cs.AI eess.AS

MCBench: A Multicontext Safety Assessment Benchmark for Omni Large Language Models

MCBench:面向全能大语言模型的多上下文安全评估基准

Manh Luong, Tamas Abraham, Junae Kim, Amar Kaur, Rollin Omari, Gholamreza Haffari, Trang Vu, Lizhen Qu, Dinh Phung

发表机构 * Monash University(墨尔本大学) Defence Science and Technology Group(国防科学与技术集团)

AI总结 针对现有多模态安全基准仅处理视觉输入的局限,提出MCBench基准,包含1196个跨四类安全场景的测试,要求整合多模态信息进行安全评估,揭示当前全能大语言模型在跨模态安全推理上的不足。

详情
AI中文摘要

现有的多模态安全基准仅关注视觉输入,无法评估处理视觉、音频和文本的全能大语言模型(LLMs)。我们提出了MCBench,一个包含1196个场景的基准,涵盖四个安全类别,需要整合多种模态以进行准确的安全评估。每个不安全场景都配有一个最小差异的安全对照场景,以评估模型的敏感性。我们对最先进模型的评估揭示了重大挑战。全能大语言模型在处理细微或非物理风险时表现不佳,但在存在显著视觉或听觉线索时表现更好。对推理轨迹的分析表明,尽管模型能够提取模态特定信息,但它们往往无法有效整合这些线索进行安全判断。我们的发现揭示了当前全能大语言模型在安全关键场景中缺乏稳健的跨模态推理能力,强调了改进多模态安全架构和训练策略的必要性。

英文摘要

Existing multimodal safety benchmarks focus solely on visual inputs and cannot assess Omni Large Language Models (LLMs) that process vision, audio, and text. We introduce MCBench, a benchmark with 1196 scenarios spanning four safety categories that require integrating multiple modalities for accurate safety assessment. Each unsafe scenario is paired with a minimally different safe counterpart to assess model sensitivity. Our evaluations of state-of-the-art models reveal significant challenges. Omni LLMs struggle with subtle or non-physical risks but perform better when salient visual or acoustic cues are present. Analysis of reasoning traces shows that, although models can extract modality-specific information, they often fail to integrate these cues effectively for safety judgments. Our findings reveal that current Omni LLMs lack robust cross-modal reasoning in safety-critical settings, underscoring the need for improved architectures and training strategies for multimodal safety.

2606.05176 2026-06-05 cs.CL cs.AI

PEFT of SLM for Telecommunications Customer Support: A Comparative Study of LoRA Configurations with Energy Consumption Analysis

面向电信客户支持的SLM的PEFT:LoRA配置与能耗分析的比较研究

Lucas Tamic, Ilan Jaffeux-Cheniout, Xavier Marjou

发表机构 * Orange

AI总结 本研究系统比较了不同LoRA配置在Qwen2.5-3B模型上的参数高效微调效果,结合能耗分析和LLM评判框架,发现验证损失最低的配置并不一定获得最佳定性排名,并提出了组合式合成数据生成方法。

详情
AI中文摘要

尽管大型语言模型(LLM)在自然语言理解和生成方面表现出色,但它们在电信客户支持领域特定约束下的评估和适应性仍然有限。此外,数据主权、监管约束以及敏感客户和网络信息的处理使得在该领域使用外部托管的基础模型变得复杂。我们提出了一项系统的参数高效微调(PEFT)研究,使用低秩适配(LoRA)应用于Qwen2.5-3B,以构建特定领域的对话助手。我们引入了一种基于52个行业特定术语词汇表的组合式合成数据生成方法,通过由Gemini 2.0 Flash驱动的生成流水线,产生了约30,000个训练样本,涵盖1,560个不同的问题场景。我们通过改变超参数和目标模块评估了16种LoRA配置。我们的评估超越了标准指标,结合了能耗分析以及使用GPT-5.2和Claude 4.5 Sonnet的LLM-as-a-judge框架的定性评估。结果显示定量和定性性能之间存在明显分歧:达到最低验证损失的模型不一定获得最佳的人类对齐排名。最佳验证损失(0.5024)在定性评估中仅排名第6-7位,而最差损失(0.6807)根据两位评判者均排名第一。本工作的贡献包括:(1)一种用于合成数据集构建的组合方法,(2)关于目标模块选择对LoRA注入影响的见解,(3)证明在对话式AI中仅凭验证损失不足以选择微调配置的证据,以及(4)用于可持续LLM部署的能耗-性能权衡分析。

英文摘要

While large language models (LLMs) show strong performance in natural language understanding and generation, their evaluation and adaptation to domain-specific constraints in telecommunications customer support remain limited. In addition, data sovereignty, regulatory constraints, and the handling of sensitive customer and network information complicate the use of externally hosted foundation models in this domain. We present a systematic study of parameter-efficient fine-tuning (PEFT) using Low-Rank Adaptation (LoRA) applied to Qwen2.5-3B to build a domain-specific conversational assistant. We introduce a combinatorial synthetic data generation approach based on a glossary of 52 industry-specific terms, producing approximately 30,000 training examples across 1,560 distinct problem scenarios via a generative pipeline powered by Gemini 2.0 Flash. We evaluate 16 LoRA configurations by varying hyperparameters and target modules. Our evaluation extends beyond standard metrics by incorporating energy consumption analysis and qualitative assessment using an LLM-as-a-judge framework with GPT-5.2 and Claude 4.5 Sonnet. Results show a clear divergence between quantitative and qualitative performance: models achieving the lowest validation loss do not necessarily obtain the best human-aligned rankings. The best validation loss (0.5024) ranks only 6th-7th in qualitative evaluation, while the worst loss (0.6807) ranks first according to both judges. This work contributes (1) a combinatorial method for synthetic dataset construction, (2) insights into the impact of target module selection for LoRA injection, (3) evidence that validation loss alone is insufficient for selecting fine-tuning configurations in conversational AI, and (4) an energy-performance trade-off analysis for sustainable LLM deployment.

2606.05175 2026-06-05 cs.CL

Generic Triple-Latent Compression with Gated Associative Retrieval

通用三重潜在压缩与门控关联检索

Liu Xiao

发表机构 * Institute of Informatics, University of Science and Technology of China(中国科学技术大学信息科学研究院)

AI总结 提出通用三重潜在序列模型,通过维护运行令牌状态和压缩对记忆路径捕获高阶令牌交互,无需基准特定解析,在字节级WikiText-2和基于分词器的MiniMind语言模型基准上改进小型Transformer基线,而基于召回的门控键值检索扩展提升关联召回但存在种子敏感性和速度问题。

详情
AI中文摘要

我们研究通用三重潜在序列模型,该模型维护一个运行的令牌状态和压缩的对记忆路径,以捕获高阶令牌交互,无需基准特定解析。三重潜在系列在字节级WikiText-2和基于分词器的MiniMind语言模型基准上改进了小型Transformer基线,而一个专注于召回的门控键值检索扩展提升了关联召回,但在当前参考实现中仍对种子敏感且速度慢得多。

英文摘要

We study generic triple-latent sequence models that maintain a running token state and compressed pair-memory pathway to capture higher-order token interactions without benchmark-specific parsing. The triple-latent family improves a small Transformer baseline on byte-level WikiText-2 and on a tokenizer-based MiniMind language-model benchmark, while a recall-focused gated key-value retrieval extension improves associative recall but remains seed-sensitive and much slower in the current reference implementation.

2606.05174 2026-06-05 cs.CL cs.AI

Improving Heart-Focused Medical Question Answering in LLMs via Variance-Aware Rubric Rewards with GRPO

通过基于方差感知的评分规则奖励与GRPO改进LLMs中心脏医学问答

Arash Ahmadi, Parisa Masnadi, Sarah Sharif, Charles Nicholson, David Ebert, Mike Banad

发表机构 * School of Electrical and Computer Engineering, University of Oklahoma, Norman, OK, USA(电气与计算机工程学院,俄克拉荷马大学,诺曼,OK,USA) Intelligent Neuromorphic and Quantum Understanding for Innovative Research and Engineering (INQUIRE) Laboratory, University of Oklahoma, Norman, OK, USA(创新研究与工程智能神经形态与量子理解实验室,俄克拉荷马大学,诺曼,OK,USA) Khiabani Data Science and Analytics Institute, University of Oklahoma, Norman, OK, USA(Khiabani数据科学与分析研究所,俄克拉荷马大学,诺曼,OK,USA) Data Institute for Societal Challenges (DISC), University of Oklahoma, Norman, OK, USA(社会挑战数据研究所(DISC),俄克拉荷马大学,诺曼,OK,USA) School of Industrial and Systems Engineering, University of Oklahoma, Norman, OK, USA(工业与系统工程学院,俄克拉荷马大学,诺曼,OK,USA) Office of Responsible Artificial Intelligence (ORAI), University of Arizona, Tucson, AZ, USA(负责任人工智能办公室(ORAI),亚利桑那大学,图森,AZ,USA)

AI总结 提出一种方差感知奖励框架,结合GRPO和RaR-Medicine的评分规则,通过连续分析奖励函数替代离散聚合,提升LLMs在心脏医学问答上的准确率和F1分数。

Comments 27 Pages

详情
AI中文摘要

大型语言模型(LLMs)在医疗应用中展现出巨大潜力。然而,由于数据隐私限制、推理成本以及边缘或设备端适用性有限,通用模型在实际场景中的部署仍然困难。这些挑战促使开发更小、更高效的模型,这些模型需要稳健的后训练策略以确保可靠的医学推理。在这项工作中,我们研究了基于RaR-Medicine的评分规则监督,使用组相对策略优化(GRPO)对LLMs进行心脏医学问答的后训练。我们提出了一种方差感知奖励框架,该框架扩展了评分规则作为奖励的显式聚合和隐式聚合策略,将加权二元标准聚合和单一整体Likert式评分替换为从标准级评分结果导出的连续分析奖励函数。这种公式为稀疏、多标准且难以自动验证的反馈提供了更丰富的优化信号,并实现了更稳定的在线策略强化学习。在HealthBench保留的心脏相关子集上,与Qwen3-14B基础模型相比,我们最佳的GRPO变体将准确率从0.362提高到0.502,F1从0.532提高到0.668,同时与GPT-OSS-120B(准确率0.508,F1 0.674)保持竞争力。我们的研究结果表明,精心设计的基于评分规则的奖励为改进LLMs中心脏医学问答提供了一种实用策略,并有可能扩展到其他基于评分规则的任务。

英文摘要

Large Language Models (LLMs) have shown strong promise in healthcare applications. Yet deploying general-purpose models in real-world settings remains difficult due to data privacy constraints, inference costs, and limited suitability for edge or on-device use. These challenges motivate the development of smaller, more efficient models that require robust post-training strategies to ensure reliable medical reasoning. In this work, we investigate Group Relative Policy Optimization (GRPO) for post-training LLMs on heart-focused medical question answering with rubric-based supervision derived from RaR-Medicine. We propose a Variance-Aware Reward Framework that extends the Explicit Aggregation and Implicit Aggregation strategies of Rubrics as Rewards by replacing weighted binary criterion aggregation and single overall Likert-style scoring with continuous analytical reward functions derived from criterion-level rubric outcomes. This formulation provides richer optimization signals for feedback that is sparse, multi-criteria, and difficult to verify automatically, and enables more stable on-policy reinforcement learning. On a held-out heart-related subset of HealthBench, our best GRPO variant improves accuracy from 0.362 to 0.502 and F1 from 0.532 to 0.668 relative to the Qwen3-14B base model, while remaining competitive with GPT-OSS-120B (0.508 accuracy, 0.674 F1). Our findings show that carefully designed rubric-based rewards provide a practical strategy for improving heart-focused medical question answering in LLMs, with potential to extend to other rubric-based tasks.

2606.05173 2026-06-05 cs.CL cs.AI

Predict and Reconstruct: Joint Objectives for Self-Supervised Language Representation Learning

预测与重构:自监督语言表示学习的联合目标

Aimen Boukhari

发表机构 * École Nationale Supérieure d’Informatique (ESI)(阿尔及利亚国家信息学院(ESI))

AI总结 提出一种结合JEPA潜空间预测损失与MLM目标的混合预训练目标,通过可学习标量平衡两者,在GLUE基准上分析表明混合编码器产生更均匀的嵌入和更丰富的谱几何,且语义-词汇平衡更优。

Comments 12 pages, 10 figures, 11 tables. Preprint. Code available at : https://github.com/aymen-000/predict-reconstruct-language-models

详情
AI中文摘要

掩码语言建模(MLM)自BERT以来一直是文本编码器的主导预训练目标,但它鼓励的表示强烈锚定于表层形式的词元身份,而非更深层的语义结构。受联合嵌入预测架构(JEPA)(LeCun, 2022)在视觉和音频中的成功启发,我们提出一种混合预训练目标,该目标在单个共享编码器上结合了JEPA风格的潜空间预测损失与标准MLM目标。一个可学习的标量参数在训练过程中持续平衡这两个目标。我们在英文维基百科上使用相同的架构和计算预算(NVIDIA H100)预训练了一个混合模型和一个纯MLM基线。通过四种池化策略在五个GLUE基准(SST-2、MRPC、MNLI、CoLA、STS-B)上进行广泛的表示分析,结果显示混合编码器产生了显著更均匀的嵌入(均匀性小于-0.16,而MLM为-0.05),在最大池化下表现出更丰富的谱几何,编码了更少的表层词汇信息,并实现了更好的语义-词汇平衡。尽管线性探测的下游准确率相似,但几何差异一致且显著,表明JEPA预测目标重塑了潜空间,而标准准确率指标无法单独捕捉这一点。

英文摘要

Masked language modelling (MLM) has been the dominant pre-training objective for text encoders since BERT, yet it encourages representations that are strongly anchored to surface-form token identity rather than deeper semantic structure. Inspired by the success of Joint Embedding Predictive Architectures (JEPA) (LeCun, 2022) in vision and audio, we propose a hybrid pre-training objective that combines a JEPA-style latent-space prediction loss with a standard MLM objective over a single shared encoder. A learnable scalar parameter continuously balances the two objectives during training. We pre-train both a hybrid model and a pure-MLM baseline on English Wikipedia using identical architectures and compute budgets (NVIDIA H100). Extensive representation analysis across five GLUE benchmarks (SST-2, MRPC, MNLI, CoLA, STS-B) using four pooling strategies reveals that the hybrid encoder produces significantly more uniform embeddings (uniformity less than -0.16 vs -0.05 for MLM), exhibits richer spectral geometry under max pooling, encodes less surface-level lexical information, and achieves a better semantic-to-lexical balance. Despite similar linear-probe downstream accuracy, the geometric differences are consistent and significant, suggesting that the JEPA predictive objective reshapes the latent space in ways that standard accuracy metrics alone cannot capture.

2606.05170 2026-06-05 cs.LG

ERRORQUAKE: Heavy-Tailed Error Severity Distributions in Open-Weight Large Language Models

ERRORQUAKE: 开放权重大语言模型中重尾错误严重性分布

Jason Z Wang

发表机构 * Independent(独立)

AI总结 提出Errorquake-10k基准测试,通过连续严重性评分和古登堡-里希特指数b揭示开放权重LLM在相同准确率下错误严重性分布存在显著差异,证明严重性分布与错误率信息非冗余。

Comments 28 pages, 12 figures, appendix and checklist

详情
AI中文摘要

在匹配的准确率下,开放权重LLM的错误严重性分布形状存在显著差异——这种差异在标量错误率中不可见。幻觉基准测试报告单一错误计数,将所有错误视为等价,然而错误的日期和捏造的法庭裁决相差数个数量级。我们引入Errorquake-10k,一个包含10,000个查询的基准测试,在8个领域和5个难度等级上对每个响应进行0-4连续严重性评分,并为21个开放权重模型拟合每个模型的严重性分布。对于每个模型,我们估计严重性分布指数b(古登堡-里希特上尾斜率),并给出95%自助法置信区间。要点:在210个模型对中,有85对在人类共识评分上匹配准确率(|Δε| < 0.05)时具有不相交的95% b置信区间,例如deepseek-v3.2与ministral-14b在ε=0.586和Δb=0.47处。一项包含519个项目、三位评分者的人类验证研究确认了测量可靠性(ICC(2,k=3)=0.85),验证了LLM评判排名(ρ=0.89),并确认了人类数据上的密集模型缩放相关性(ρ_s=-0.86)。我们证明了一个不可约简性定理,表明严重性分布和错误率在信息上非冗余(I(b; model | ε)=1.56比特;跨模型b方差的64.5%无法由ε解释)。一个严重性机制分类(κ=0.83)揭示了错误类型随严重性发生类别性转变:低严重性错误为检索错误(71%);高严重性错误为捏造(39%)——并且这种组成因模型大小而异(p<0.0001)。严重性分布应与准确率一同报告;它携带了错误率无法提供的判别信息。

英文摘要

At matched accuracy, open-weight LLMs differ substantially in the shape of their error severity distribution -- a difference invisible to the scalar error rate. Hallucination benchmarks report a single error count and treat all errors as equivalent, yet a wrong date and a fabricated court ruling differ by orders of magnitude. We introduce Errorquake-10k, a 10,000-query benchmark scoring each response on a continuous 0-4 severity scale across 8 domains and 5 difficulty tiers, and we fit per-model severity distributions for 21 open-weight models. For each model we estimate a severity distribution index (b, the Gutenberg-Richter upper-tail slope) with 95% bootstrap confidence intervals. Headline: across the 210 model pairs, 85 have disjoint 95% b confidence intervals at matched accuracy (|Delta epsilon| < 0.05) on human-consensus scoring, e.g. deepseek-v3.2 vs. ministral-14b at epsilon = 0.586 and Delta b = 0.47. A 519-item three-rater human validation study confirms measurement reliability (ICC(2,k=3) = 0.85), validates the LLM-judge ranking (rho = 0.89), and confirms the dense-model scaling correlation on human data (rho_s = -0.86). We prove a Non-Reducibility Theorem showing that severity profile and error rate are informationally non-redundant (I(b; model | epsilon) = 1.56 bits; 64.5% of cross-model b variance is unexplained by epsilon). A severity mechanism taxonomy (kappa = 0.83) reveals that error type shifts categorically with severity: low-severity errors are retrievals (71%); high-severity errors are fabrications (39%) -- and this composition differs by model size (p < 0.0001). Severity distribution should be reported alongside accuracy; it carries discriminative information that the error rate cannot.

2606.05169 2026-06-05 cs.LG

The Evaluation Blind Spot: A Stereological Theory of Benchmark Coverage for Large Language Models

评估盲点:大型语言模型基准覆盖的体视学理论

Jason Z Wang

发表机构 * Independent(独立)

AI总结 本文提出一种体视学理论,通过有效维度和可见豪斯多夫距离量化LLM基准覆盖的盲点,发现结构盲点远超统计噪声,并基于子模贪心算法和特征值分析提出稳定核心基准集。

Comments 55 pages, 3 figures, 3 tables, extensive appendix with proofs

详情
AI中文摘要

我们给出了LLM基准覆盖的体视学理论。对于任何有效维度为d_eff的测试套件,两个与相同分数一致且凸的能力轮廓之间的可见豪斯多夫距离由epsilon + C R m^(-1/(d_eff-1))界定,并具有匹配的Lipschitz下界。实验上,三个独立的排行榜(Open LLM v2、扩展的12基准套件、LiveBench)在其竞争前沿的有效维度均在[2.86, 4.80]范围内;结构盲点超过观察到的亚军分数差距两个数量级,并比统计噪声高52-127倍。在卡方投影模型下,各向同性先验是乐观情况;在六个隐藏能力先验和四个环境维度下,前两个模型的模拟半分割交换率保持在[0.38, 0.49],而500次随机可见/保留分割显示,92%的试验交换了前1名排名,平均2.83个前5名模型发生变化。具有Nemhauser (1 - 1/e)保证的子模贪心算法找到了一个由4个基准组成的稳定核心;12个基准中的7个足以达到90%的覆盖率,并且训练好的子集在时间季度间转移时保留率为93-97%。跨12个内部基准和27个Chatbot Arena类别的反事实验证证实,特征结构预测哪些评估是不可替代的(移除干扰的rho = -0.69,p = 0.013)以及哪些外部评估带来新信息(rho = +0.38)。作为第二个独立的理论贡献,我们解决了C^2支撑函数的Gardner问题1.5(1995),通过S^(D-1)上的最优恢复理论建立了一般维度下的极小极大速率Theta(R/(kappa m^(2/(D-1))))。

英文摘要

We give a stereological theory of LLM benchmark coverage. For any suite with effective dimensionality d_eff, the visible Hausdorff distance between two convex capability profiles consistent with the same scores is bounded by epsilon + C R m^(-1/(d_eff-1)), with matching Lipschitz lower bound. Empirically, three independent leaderboards (Open LLM v2, an extended 12-benchmark suite, LiveBench) all have d_eff in [2.86, 4.80] on their competitive frontier; the structural blind spot exceeds the observed runner-up score gap by two orders of magnitude and dominates statistical noise by 52-127x. Under a chi-squared projection model, the isotropic prior is the optimistic case; across six hidden-capability priors and four ambient dimensions the simulated half-split swap rate of the top two models stays in [0.38, 0.49], and a 500-trial random visible/held-out split shows that 92% of trials swap the top-1 ranking with on average 2.83 of 5 top-5 models changing. A submodular greedy algorithm with the Nemhauser (1 - 1/e) guarantee finds a stable core of 4 benchmarks; 7 of 12 suffice for 90% coverage, and the trained subset transfers across temporal quarters with 93-97% retention. A counterfactual validation across 12 internal benchmarks and 27 Chatbot Arena categories confirms that the eigenstructure predicts which evaluations are irreplaceable (rho = -0.69, p = 0.013 for removal disruption) and which external evaluations bring new information (rho = +0.38). As a second, independent theoretical contribution, we resolve Gardner's Problem 1.5 (1995) for C^2 support functions, establishing the minimax rate Theta(R/(kappa m^(2/(D-1)))) in general dimension via optimal recovery theory on S^(D-1).

2606.05168 2026-06-05 cs.CL cs.AI cs.LG

Epidemiology of Model Collapse: Modeling Synthetic Data Contamination via Bilayer SIR Dynamics

模型崩溃的流行病学:通过双层SIR动力学建模合成数据污染

Xiangyu Wang

发表机构 * Xiangyu Wang(王翔宇)

AI总结 提出双层耦合SIR/SIRS框架,将数据语料库和AI模型视为两个相互作用的群体,通过交叉层传播模拟合成数据污染导致的模型崩溃,并推导基本再生数R0,实验验证了阈值动力学和干预策略的有效性。

Comments 24 pages, 15 figures

详情
AI中文摘要

在合成数据上训练会导致模型崩溃,但现有分析将其视为单链退化。实际上,AI生态系统涉及交叉污染:模型从其他模型摄取合成数据,产生新的合成文本,并污染共享语料库。我们提出了一个双层耦合SIR/SIRS框架——一个现象学平均场模型,将数据语料库和AI模型视为两个相互作用的群体,每个群体具有易感、感染和恢复三个仓室,并通过跨层传播连接。SIRS变体(我们的主要推荐)包含了免疫衰减,反映了过滤后的语料库和重新训练的模型仍然容易再次污染。我们通过下一代矩阵推导出基本再生数$R_0 = \sqrt{β_D β_M / [(γ_D+μ_D)(γ_M+μ_M)]}$,并将标准流行病阈值结果应用于双层系统。基于公开AI文本流行数据的说明性情景校准在三种情景下均产生超临界动力学($R_0 > 1$);Sobol敏感性分析将合成文本检测识别为最高杠杆参数。一个二分网络基于智能体的模型在密集网络上确认了平均场一致性($R^2 > 0.96$),但在异质性下退化。GPT-2污染链实验(在WikiText和Shakespeare上共192次运行)显示了剂量-反应退化和多样性损失,定性上与阈值图像一致。匹配预算的源多样性实验(1,088次运行)提供了提示性证据,表明多源混合适度减轻了崩溃,但该效应在较低污染分数下消失。干预分析将基于检测的过滤和群体免疫识别为最高杠杆策略。

英文摘要

Training on synthetic data causes model collapse, but existing analyses treat this as single-chain degradation. In reality, the AI ecosystem involves cross-contamination: models ingest synthetic data from other models, produce new synthetic text, and contaminate shared corpora. We propose a bilayer coupled SIR/SIRS framework -- a phenomenological mean-field model treating data corpora and AI models as two interacting populations, each with susceptible, infected, and recovered compartments linked by cross-layer transmission. The SIRS variant (our primary recommendation) incorporates immunity waning, reflecting that filtered corpora and retrained models remain susceptible to re-contamination. We derive the basic reproduction number $R_0 = \sqrt{β_D β_M / [(γ_D+μ_D)(γ_M+μ_M)]}$ via the Next Generation Matrix and apply standard epidemic threshold results to the bilayer system. Illustrative scenario-based calibration from public AI text prevalence data yields supercritical dynamics ($R_0 > 1$) across three scenarios; Sobol sensitivity analysis identifies synthetic-text detection as the highest-leverage parameter. A bipartite-network agent-based model confirms mean-field consistency ($R^2 > 0.96$) for dense networks but degrades under heterogeneity. GPT-2 contamination chain experiments (192 runs across WikiText and Shakespeare) show dose-response degradation and diversity loss qualitatively consistent with the threshold picture. Matched-budget source-diversity experiments (1,088 runs) provide suggestive evidence that multi-source mixing modestly attenuates collapse, but the effect vanishes at lower contamination fractions. Intervention analysis identifies detection-based filtering and herd immunity as the highest-leverage strategies.

2605.04733 2026-06-05 cs.AI

Reward-Decomposed Reinforcement Learning for Immersive Video Role-Playing

面向沉浸式视频角色扮演的奖励分解强化学习

Miao Wang, Yuling Shi, Yijiang Li, Yeheng Chen, Xiaodong Gu, Bin Li, Bo Gao, Jun Wang, Zengxin Han, Jingtong Wu, Yaduan Ruan

发表机构 * Nanjing University(南京大学) Shanghai Jiao Tong University(上海交通大学) University of California, San Diego(加州大学圣地亚哥分校) Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences(中国科学院深圳先进技术研究院) School of Information Engineering, Beijing Institute of Graphic Communication(北京印刷学院信息工程学院) Ant International, Ant Group(蚂蚁集团国际部) Independent Researcher(独立研究者)

AI总结 提出EBM-RL框架,通过奖励分解的强化学习优化视频角色扮演中的视觉感知、推理与生成过程,提升场景一致性与角色真实性。

详情
AI中文摘要

基于文本的角色扮演模型可以模仿角色风格,但通常难以捕捉场景氛围和不断变化的紧张感,而这些对于VR游戏和互动叙事等沉浸式应用至关重要。我们研究视频驱动的角色扮演对话,并引入EBM-RL(眼-脑-口强化学习),一种解耦的GRPO框架,将观察(<perception>)、推理(<think>)和话语生成(<answer>)分离。该设计模仿人类的“看-思-说”过程,使模型在推理和响应生成之前能够基于视觉感知进行对话。为了优化这一“看-思-说”过程,EBM-RL集成了针对场景-文本对齐、感知-认知效用、答案忠实度和格式一致性的互补奖励。大量实验表明,在我们的沉浸式角色扮演基准测试中,EBM-RL显著优于纯文本角色扮演基线和更大规模的视觉语言模型,提高了视觉-氛围一致性和角色真实性。此外,EBM-RL在无需额外微调的情况下,展现出对域外VideoQA基准的强零样本迁移能力。我们还发布了一个用于视频驱动角色扮演对话的开源数据集。

英文摘要

Text-based role-playing models can imitate character styles, but often fail to capture scene atmosphere and evolving tension, which are crucial for immersive applications such as VR games and interactive narratives. We study video-grounded role-playing dialogue and introduce EBM-RL (Eye--Brain--Mouth Reinforcement Learning), a decoupled GRPO-based framework that separates observation (<perception>), reasoning (<think>), and utterance generation (<answer>). This design mimics the human See-Think-Speak process, enabling the model to ground dialogue in visual perception before reasoning and response generation. To optimize this See-Think-Speak process, EBM-RL integrates complementary rewards for scene--text alignment, perceptual--cognitive utility, answer faithfulness, and format consistency. Extensive experiments show that EBM-RL substantially outperforms text-only role-playing baselines and larger-scale vision-language models on our immersive role-playing benchmark, improving both visual-atmosphere consistency and character authenticity. Moreover, EBM-RL demonstrates strong zero-shot transfer to out-of-domain VideoQA benchmarks without additional fine-tuning. We also release an open-source dataset for video-grounded role-playing dialogue.

2606.05104 2026-06-05 cs.AI

Knowledge Index of Noah's Ark

诺亚方舟的知识索引

Sheng Jin, Minghao Liu, Yunze Xiao, Zeqi Zhou, Heli Qi, Yifan Yao, Meishu Song, Kaijing Ma, Xuan Zhang, Sicong Jiang, Yizhe Li, Ningshan Ma, Jie Wei, Ziniu Li, Minglai Yang, Bangya Liu, Yiming Liang, Xiao Fang, Qingcheng Zeng, Jiarui Liu, Rui Yang, Shen Yan, Wenhao Huang, Jiaheng Liu, Zihan Wang, Weihao Xuan, Ge Zhang

发表机构 * M-A-P Carnegie Mellon University(卡内基梅隆大学) Brown University(布朗大学) Waseda University(早稻田大学) The University of Tokyo(东京大学) Massachusetts Institute of Technology(麻省理工学院) University of Arizona(亚利桑那大学) Northwestern University(西北大学) Duke-NUS Medical School(杜克-新加坡国立大学医学院)

AI总结 针对LLM知识基准的代表性、注释质量和排名稳定性问题,提出KINA基准,通过贪婪近似实现学科代表性,并证明奖金锦标赛机制优于固定支付,实验显示顶级模型性能远未饱和。

详情
AI中文摘要

LLM的知识基准面临三个问题:扩展驱动的设计未能实现学科代表性;固定支付注释允许懒惰共识;在有限测试预算下排名稳定性未经审计。我们引入KINA,一个涵盖261个细粒度学科的899项基准,并有两个形式化结果。首先,我们将代表性视为对专家引出的锚点的覆盖目标,并通过代理实现学科代表性,得到(1-1/e)贪婪近似(命题1);该保证适用于代理,而非总体代表性。其次,我们证明在发布-评审质量方面,奖金锦标赛弱FOSD支配固定支付,激励相容阈值为B > ΔC / Δp_min(定理1)。评估来自13个实验室的42个模型,最佳模型Gemini-3.1-Pro-Preview达到53.17%,其次是Claude-Opus-4.6的49.92%和GPT-5.4的48.55%,远未饱和。完整排行榜显示分层结构而非平滑全序:小型前沿层高于48%,密集的强模型层约38-45%,低性能模型仅略高于10%随机基线。工具增强在五个工具使用评估中最多增加5.17分,不同模型增益差异显著。我们报告自举排名稳定性统计,以明确有限预算方差并防止过度解释相邻排名。

英文摘要

Knowledge benchmarks for LLMs face three issues: scaling-driven designs that do not operationalize disciplinary representativeness; flat-payment annotation that permits lazy consensus; and unaudited ranking instability under bounded test budgets. We introduce KINA, an 899-item benchmark across 261 fine-grained disciplines, with two formal results. First, we cast representativeness as a coverage-style objective over expert-elicited anchors and operationalize disciplinary representativeness through a proxy, yielding a (1-1/e) greedy approximation (Proposition 1); the guarantee applies to the proxy, not to population representativeness. Second, we prove a bonus-on-bar tournament weakly FOSD-dominates flat payment in released-review quality, with incentive-compatibility threshold B > Delta C / Delta p_min (Theorem 1). Evaluating 42 models from 13 labs, the top model, Gemini-3.1-Pro-Preview, reaches 53.17%, followed by Claude-Opus-4.6 at 49.92% and GPT-5.4 at 48.55%, leaving substantial headroom below saturation. The full leaderboard shows a tiered structure rather than a smooth total order: a small frontier tier lies above 48%, a dense strong-model tier spans roughly 38-45%, and low-performing models remain only modestly above the 10% chance baseline. Tool augmentation adds up to 5.17 points across the five tool-use evaluations, with gains varying substantially across models. We report bootstrap ranking-stability statistics to make bounded-budget variance explicit and to discourage over-interpretation of adjacent ranks.