arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.05259 2026-06-05 cs.CV

VideoKR: Towards Knowledge- and Reasoning-Intensive Video Understanding

VideoKR：迈向知识和推理密集型视频理解

Lin Fu, Zheyuan Yang, Yang Wang, Tingyu Song, Arman Cohan, Yilun Zhao

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Stanford University（斯坦福大学）； University of Toronto（多伦多大学）； University of Washington（华盛顿大学）； University of Michigan（密歇根大学）

AI总结提出VideoKR，首个大规模训练语料库，通过人工参与的技能导向生成管道构建315K视频推理示例，增强知识和推理密集型视频理解，并在专家标注基准上验证其有效性。

Comments ICML 2026 Spotlight

详情

AI中文摘要

我们介绍了VideoKR，这是第一个专门设计用于增强知识和推理密集型视频理解的大规模训练语料库。它包含315K个视频推理示例，覆盖145K个新收集的、CC许可的、专家领域的视频。我们开发了一个人工参与的、技能导向的示例生成管道，针对逐步深入的视频推理能力，同时确保示例及其CoT推理的难度、多样性和可靠性。我们还策划了VideoKR-Eval，一个新的专家标注基准，其中的问题需要真正的视频理解和知识密集型推理，而不是文本捷径。我们的实验表明，在标准SFT→GRPO流程下，基于VideoKR后训练的模型在知识密集型视频推理上优于先前的后训练方法，同时在通用视频推理上保持竞争力，突出了数据设计作为视频推理进展的关键驱动因素。我们进一步进行了全面的消融实验，以分离VideoKR的贡献，为未来工作提供可操作的见解。

英文摘要

We introduce VideoKR, the first large-scale training corpus specifically designed to strengthen knowledge- and reasoning-intensive video understanding. It comprises 315K video reasoning examples over 145K newly collected, CC-licensed, expert-domain videos. We develop a human-in-the-loop, skill-oriented example generation pipeline that targets progressively deeper video reasoning capabilities while ensuring the difficulty, diversity, and reliability of both the examples and their CoT rationales. We also curate VideoKR-Eval, a new expert-annotated benchmark where questions require genuine video understanding and knowledge-intensive reasoning rather than textual shortcuts. Our experiments show that, under a standard SFT$\rightarrow$GRPO pipeline, models post-trained on VideoKR outperform prior post-training approaches on knowledge-intensive video reasoning while remaining competitive on general video reasoning, highlighting data design as a key driver of progress in video reasoning. We further conduct comprehensive ablations to isolate the contributions of VideoKR, providing actionable insights for future work.

URL PDF HTML ☆

赞 0 踩 0

2606.05257 2026-06-05 cs.LG cs.IR

Scaling Laws for Behavioral Foundation Models over User Event Sequences

用户事件序列上行为基础模型的缩放定律

Rickard Brüel Gabrielsson

发表机构 * Unbox AI

AI总结研究行为基础模型在用户事件序列上的缩放定律，通过约600次实验发现小嵌入器参数最优，计算最优训练在低计算量时数据密集，且评估指标影响缩放定律。

详情

AI中文摘要

基础模型越来越多地在推荐、支付、欺诈和商务领域的用户行为序列上进行训练，但这些模型仍然缺乏语言模型缩放定律所提供的计算校准。我们研究了一种常见的两部件行为模型架构：基于特征的嵌入器将每个多模态项目映射为向量，解码器仅变换器从结果序列中预测下一个事件。在真实交互数据上进行约600次运行，涵盖$10^{15}$-$10^{19}$训练FLOPs，我们联合变化四个部署相关轴：两部件参数分配、临界批量大小、模型/数据分配以及冻结嵌入器后使用的采样负例数量。小嵌入器（参数占比$s^{\star}\!\approx\!2\%$）在我们测试的每个预算下都是计算最优的，因为嵌入器参数每步更昂贵，且暴露于比上下文器参数多得多的重复项目。计算最优训练在低计算量时相对于文本是数据密集的，但随着计算量增加，其$D/N$比率向Chinchilla启发式靠拢。采样训练目标和部署的排序指标以自身缩放的方式不一致：临界批量大小、冻结后的最优负例数量以及损失与排序质量之间的一致性都随计算量和所选评估指标而变化。对于负采样，更大的预算越来越偏好更多负例；到$10^{19}$ FLOPs时，活跃约束是候选轴内存而非FLOPs。在行为基础模型中，评估指标因此是缩放定律的一部分：改变它可能改变计算最优配方。

英文摘要

Foundation models are increasingly trained on sequences of user actions in recommendation, payments, fraud, and commerce, but these models still lack the kind of compute calibration that scaling laws provide for language models. We study a common two-part behavioral-model architecture: a feature-based event embedder maps each multi-modal item to a vector, and a decoder-only transformer predicts the next event from the resulting sequence. Across roughly 600 runs on real interaction data, spanning $10^{15}$-$10^{19}$ training FLOPs, we jointly vary four deployment-relevant axes: the two-part parameter split, critical batch size, model/data allocation, and the number of sampled negatives used after freezing the embedder. A small embedder ($s^{\star}\!\approx\!2\%$ of parameters) is compute-optimal at every budget we test because embedder parameters are both more expensive per step and exposed to far more repeated items than contextualizer parameters. Compute-optimal training is data-heavy relative to text at low compute, but its $D/N$ ratio moves toward the Chinchilla heuristic as compute increases. The sampled training objective and deployed ranking metrics disagree in ways that themselves scale: critical batch size, optimal negative count after freezing, and the agreement between loss and ranking quality all shift with compute and with the chosen evaluation metric. For negative sampling, larger budgets increasingly prefer more negatives; by $10^{19}$ FLOPs the active constraint is candidate-axis memory rather than FLOPs. In behavioral foundation models, the evaluation metric is therefore part of the scaling law: changing it can change the compute-optimal recipe.

URL PDF HTML ☆

赞 0 踩 0

2606.05256 2026-06-05 cs.AI

How Far Did They Go? The Persuasive Tactics of Covert LLM Agents in a Discontinued Field Experiment

他们走了多远？已终止现场实验中隐蔽LLM代理的说服策略

Kokil Jaidka, Saifuddin Ahmed

发表机构 * Wee Kim Wee school of Communication and Information, Nanyang Technological University（魏家伟通信与信息学院，南洋理工大学）

AI总结通过分析Reddit r/ChangeMyView已终止现场实验的公开数据集，研究隐蔽LLM代理在身份丰富的讨论论坛中使用的说服策略，发现其系统性采用身份定位、权威信号、对齐策略和认知偏差触发，构成以说服效率为导向的修辞架构。

详情

AI中文摘要

本研究分析了Reddit r/ChangeMyView上一个已终止现场实验的公开数据集。该干预由未知的外部研究人员进行，因伦理反弹而停止，涉及未公开的AI生成账户与用户进行实时辩论。公开披露后，Reddit授权版主发布AI生成评论的存档，创造了难得的机会来检查大型语言模型如何在未披露的情况下在身份丰富的讨论论坛中运作。我们对这一语料库进行了结构化内容分析，评估了身份表现、权威信号、对齐策略和认知启发式的激活。身份定位或采用出现在超过三分之二的评论中，对齐动作和权威声明几乎出现在所有评论中，而认知偏差触发——特别是确认偏差、代表性启发和可得性启发——出现在绝大多数评论中。这些模式系统性地共现，构成了一种为说服效率而非真实讨论参与而校准的修辞架构。与人类撰写的CMV反驳相比，代理在每个维度上都颠倒了典型分布：更密集的权威使用、更对抗性的对齐，以及更依赖外部引用而非经验基础。在此类环境中，真实与合成认知地位之间的区别日益模糊——这种不对称性仅靠披露要求无法解决。研究结果指向能够评估AI系统如何构建可信度的审计框架，而不仅仅是它们是否存在。

英文摘要

This study analyzes a publicly released dataset from a discontinued field experiment on Reddit's r/ChangeMyView. The intervention, conducted by unknown, external researchers and halted following ethical backlash, involved undisclosed AI-generated accounts engaging users in live debate. After public disclosure, Reddit authorized moderators to release an archive of the AI-generated comments, creating a rare opportunity to examine how large language models operated in an identity-rich deliberative forum without disclosure. We conduct a structured content analysis of this corpus, evaluating identity performance, authority signaling, alignment strategies, and activation of cognitive heuristics. Identity targeting or adoption appears in over two-thirds of comments, alignment moves and authority claims in nearly all of them, and cognitive-bias triggers -- particularly confirmation bias, representativeness, and availability -- in the large majority. These patterns co-occur systematically, composing a rhetorical architecture calibrated for persuasive efficiency rather than authentic deliberative participation. Compared against human-authored CMV counter-arguments, the agents inverted the typical distribution on every dimension: denser authority use, more adversarial alignment, and heavier reliance on external citation over experiential grounding. In such environments, distinctions between authentic and synthetic epistemic standing grow increasingly opaque -- an asymmetry that disclosure mandates alone cannot address. The results point toward auditing frameworks capable of assessing how AI systems structure credibility, not merely whether they are present.

URL PDF HTML ☆

赞 0 踩 0

2606.05254 2026-06-05 cs.LG cs.CV cs.RO

DiffSlack: 通过可学习松弛变量在非线性不等式约束下学习

Ziqian Wang, Chenxi Fang, Zhen Zhang

发表机构 * State Key Laboratory of Tribology in Advanced Equipment, Tsinghua University（先进设备摩擦学国家重点实验室，清华大学）； Beijing Key Laboratory of Transformative High-end Manufacturing Equipment and Technology, Department of Mechanical Engineering, Tsinghua University（transformative高端制造设备与技术北京市重点实验室，机械工程系，清华大学）； Automotive Electronics Business Unit, Hirain Inc.（Hirain公司汽车电子事业部）

AI总结提出DiffSlack，一种可微投影层，通过可学习松弛变量将非线性不等式约束转化为等式，结合阻尼高斯-牛顿投影实现端到端约束满足，在车辆路径规划中取得更高成功率和几何约束满足度。

详情

AI中文摘要

在神经网络中强制执行非线性不等式约束仍然具有挑战性，尤其是当输出受到许多耦合约束时。现有的硬约束方法通常对约束集施加结构限制，或者为大规模非线性问题引入大量计算开销。在此，我们提出DiffSlack，一种用于非线性不等式约束神经预测的可微投影层。DiffSlack将不等式重新表述为带有可学习松弛变量的等式，这些松弛变量作为增强网络输出的一部分被预测，并为阻尼高斯-牛顿投影提供数据驱动的热启动。投影层将原始预测映射到增强可行流形上，同时保持端到端可微性。两阶段课程进一步稳定训练并改善约束满足。我们在具有200个来自碰撞避免、曲率限制和航点间距的非线性不等式约束的车辆路径规划上评估DiffSlack。与现有的基于学习的基线相比，DiffSlack在相当的推理预算下实现了更高的规划成功率和更强的几何约束满足。消融研究进一步表明，硬投影层降低了对监督质量的敏感性。CARLA中的闭环跟踪和真实车辆实验证实了生成轨迹的可执行性。这些结果表明，DiffSlack为工程应用中将硬不等式约束嵌入神经网络提供了一种实用且可扩展的方法。

英文摘要

Enforcing nonlinear inequality constraints in neural networks remains challenging, especially when the output is subject to many coupled constraints. Existing hard constraint methods often impose structural restrictions on the constraint set or introduce substantial computational overhead for large-scale nonlinear problems. Here, we propose DiffSlack, a differentiable projection layer for nonlinear inequality-constrained neural prediction. DiffSlack reformulates inequalities as equalities with learnable slack variables, which are predicted as part of the augmented network output and provide a data-driven warm start for damped Gauss-Newton projection. The projection layer maps raw predictions onto the augmented feasible manifold while preserving end-to-end differentiability. A two-stage curriculum further stabilizes training and improves constraint satisfaction. We evaluate DiffSlack on vehicle path planning with 200 nonlinear inequality constraints from collision avoidance, curvature limits, and waypoint spacing. Compared with existing learning-based baselines, DiffSlack achieves a higher planning success rate and stronger geometric constraint satisfaction under a comparable inference budget. Ablation studies further show that the hard projection layer reduces sensitivity to supervision quality. Closed-loop tracking in CARLA and real-world vehicle experiments confirms the executability of the generated trajectories. These results demonstrate that DiffSlack provides a practical and scalable approach to embedding hard inequality constraints into neural networks for engineering applications.

URL PDF HTML ☆

赞 0 踩 0

2606.05236 2026-06-05 cs.RO cs.LG

A New Quaternion-Joint Cable-Driven Redundant Manipulator Configuration and its Control Through FABRIK and Residual Reinforcement Learning

一种新型四元数关节缆驱动冗余机械臂配置及其通过FABRIK和残差强化学习的控制

Tanapath Pornthisan, Thanapat Kemthong, Thanyapisit Kangsathien, Pasut Aranchaiya, Paulo Garcia, Viboon Sangveraphunsiri

发表机构 * University of California, San Diego（加州大学圣地亚哥分校）

AI总结提出一种4段8关节四元数关节缆驱动冗余机械臂配置，并利用残差强化学习实现比FABRIK算法高三个数量级的位置和方向精度控制。

详情

AI中文摘要

能够穿越任意空间路径的机械臂，特别是在高度阻塞的工作空间中，在多个行业中备受期待。四元数关节最近赋予了一类特定的机械臂——缆驱动冗余机械臂——超越其先前能力的新功能。具体来说，四元数关节减少了每个自由度所需的电机数量，为更紧凑的解决方案铺平了道路。一个持续的挑战是，四元数关节运动学模型的复杂性给机械臂配置的先验决策带来了困难，并对控制系统提出了更高的计算需求，其非线性放大了由于制造不精确而产生的设计与物理实物之间的所有差异。在这里，我们展示了一个4段、8关节的机械臂可以在更低的硬件成本下实现比现有配置更广阔的工作空间，并且残差强化学习在控制此类机械臂方面优于现有最先进的方法——特别是FABRIK算法。我们的结果表明，这种配置比先前设计更有效地利用工作空间，并且残差强化学习在位置和方向精度上比FABRIK高出三个数量级，实现了对新型4段、8关节机械臂的精确控制。此外，控制实现更简单：我们描述了完整的FABRIK控制过程及相应的学习实现。我们的方法适用于新系统的设计，为设计者提供了开发此类机械臂及新型配置相应控制系统的更多工具。

英文摘要

Robotic arms capable of traversing arbitrary spatial paths, especially in highly obstructed workspaces, are highly desired across several industries. Quaternion-joints have recently empowered a specific class of robotic arms -- cable-driven redundant manipulators -- beyond its prior capabilities. Specifically, quaternion-joints reduce the number of required motors per degree of freedom, paving the way for more compact solutions.An ongoing challenge is that the complexity of the kinematic model of quaternion joints challenges a priori decisions on manipulator configurations and imposes higher computational demands on the control system and its non-linearities amplify all discrepancies between design and physical artifact arising from fabrication imprecision. Here we show a that a 4-segment, 8-joint manipulator can achieve a broader workspace than extant configurations, at lower hardware cost, and that Residual Reinforcement Learning outperforms extant state-of-the-art methods -- specifically, the FABRIK algorithm -- on the control of such manipulator. Our results show that this configuration is more workspace-effective than prior designs, and that Residual Reinforcement Learning outperforms FABRIK by three orders of magnitude on positional and orientational accuracy, effecting precise control of the novel 4-segment, 8-joint manipulator. Additionally, the control implementation is simpler: we describe the complete FABRIK process for control and corresponding learning implementation. Our methodology is applicable to the design of new systems, providing designers with further tools for the development of this class of manipulators and corresponding control systems for novel configurations.

URL PDF HTML ☆

赞 0 踩 0

2606.05234 2026-06-05 cs.RO cs.LG

OLIVE: Online Low-Rank Incremental Learning for Efficient Adaptive Exoskeletons

OLIVE: 面向高效自适应外骨骼的在线低秩增量学习

Dong Liu, Yanxuan Yu, Ben Lengerich, Tony Geng, Ying Nian Wu

发表机构 * University of California, Los Angeles（加州大学洛杉矶分校）； Columbia University（哥伦比亚大学）； University of Wisconsin-Madison（威斯康星大学麦迪逊分校）； Rice University（里奇大学）

AI总结提出OLIVE框架，通过低秩残差分解和奖励驱动策略梯度实现外骨骼控制的在线个性化自适应，在多种地形上提升步态平滑度、降低努力并增强稳定性。

详情

AI中文摘要

可穿戴外骨骼系统有望恢复身体障碍者的行动能力，但大多数现有控制器依赖于静态步态策略，缺乏适应动态真实环境或个体用户特征的能力。我们提出\olive（\underline{O}nline \underline{L}ow-rank \underline{I}ncremental Learning for Efficient Adapti\underline{ve} Exoskeletons），一种参数高效的在线自适应框架，在部署期间持续个性化外骨骼控制。\olive将控制策略的自适应组件分解为低秩残差形式~$\dW = \At\Bt^\top$，秩~$r!\ll!\min(d,k)$，将在线更新成本从$\mathcal{O}(dk)$降低到$\mathcal{O}(r(d{+}k))$，同时保持预训练基础控制器~$\Wz$的稳定性。参数通过奖励塑造的策略梯度更新，完全由身体传感器反馈（EMG、IMU、振动）驱动，消除了对离线参考轨迹的依赖。门控机制根据上下文状态调节个性化强度，动态秩调度器根据地形复杂度调整更新维度——在简单平坦地形上分配最小容量，在要求高的不平坦地形上扩展到更高秩更新——从而在多种活动中实现稳健性能：平地行走、楼梯导航、斜坡和不平坦地形。在可穿戴平台上的实验表明，\olive在步态平滑度、努力减少和运动稳定性上比最强基线分别提高了13、22和15个百分点，在大约1,800步内收敛，端到端延迟为7.4毫秒。我们的代码实现可在https://github.com/FastLM/OLIVE获取。

英文摘要

Wearable exoskeleton systems hold promise for restoring mobility in individuals with physical impairments, yet most existing controllers rely on static gait policies that lack the ability to adapt to dynamic real-world environments or individual user characteristics. We present \olive (\underline{O}nline \underline{L}ow-rank \underline{I}ncremental Learning for Efficient Adapti\underline{ve} Exoskeletons), a parameter-efficient online adaptation framework that continuously personalizes exoskeleton control during deployment. \olive decomposes the adaptive component of the control policy into a low-rank residual form~$\dW = \At\Bt^\top$ with rank~$r!\ll!\min(d,k)$, reducing online update cost from $\mathcal{O}(dk)$ to $\mathcal{O}(r(d{+}k))$ while preserving the stability of a pretrained base controller~$\Wz$. Parameters are updated via a reward-shaped policy gradient driven purely by on-body sensor feedback (EMG, IMU, vibration), eliminating dependence on offline reference trajectories. A gating mechanism modulates the strength of personalization based on contextual state, and a dynamic rank scheduler adapts the update dimensionality to terrain complexity -- allocating minimal capacity on simple flat terrain and expanding to higher-rank updates on demanding uneven surfaces -- enabling robust performance across diverse activities: flat walking, stair navigation, slopes, and uneven terrain. Experiments on the wearable platform demonstrate that \olive achieves +13, +22, and +15 percentage-point improvements in gait smoothness, effort reduction, and motion stability over the strongest baseline, converging within $\sim$1{,}800 walking steps at 7.4,ms end-to-end latency. Our code implementation is available at https://github.com/FastLM/OLIVE.

URL PDF HTML ☆

赞 0 踩 0

2606.05232 2026-06-05 cs.LG cs.AI

时间偏好概念及其在大语言模型中的功能

Ian Rios-Sialer, Shantanu Darveshi, Shuai Jiang, Avigya Paudel, Anastasiia Pronina, Ipshita Bandyopadhyay, Justin Shenk

发表机构 * AISC（AI Safety Camp）； SPAR（Supervised Program for Alignment Research）

AI总结通过因果定位和激活修补，本文发现大语言模型在中间到上层节点编码时间偏好几何结构，且行为分析表明模型对未来折扣比人类更平缓，但偏好不稳定，可通过引导向量调控。

详情

AI中文摘要

大语言模型（LLMs）越来越多地被部署用于需要在近期收益与长期后果之间权衡的决策，然而关于它们如何在内部表示或解决这些权衡，我们知之甚少。在这项工作中，我们通过因果定位了一个蒸馏LLM（Qwen3-4B-Instruct-2507）中时间偏好的底层子图，通过来自梯度归因和激活修补的汇聚证据识别了中上层节点。我们发现时间跨度的几何结构在预期局部层的残差流中被编码。行为分析表明，未干预的LLM对未来折扣的陡峭程度比人类低几倍，但这种偏好跨上下文不稳定，这促使我们进行显式控制而非隐式依赖训练。最后，我们发现有暗示性证据表明引导向量可以改变时间偏好。我们的工作展示了机械可解释性如何使我们更接近对LLM规划和推理方式的可靠控制。

英文摘要

Large Language Models (LLMs) are increasingly being deployed to make decisions that require trading off near-term gains against long-term consequences, yet little is known about how they internally represent or resolve these tradeoffs. In this work, we causally localize an underlying subgraph for temporal preference in a distilled LLM (Qwen3-4B-Instruct-2507), identifying mid-to-upper-layer nodes through converging evidence from gradient-based attribution and activation patching. We find that the geometry of time horizon is encoded in the residual stream at the expected localized layers. A behavioral analysis reveals that unintervened LLMs discount the future several times less steeply than humans, yet this preference is unstable across contexts, motivating explicit control rather than implicit reliance on training. Finally, we find suggestive evidence that steering vectors can shift temporal preference. Our work demonstrates how mechanistic interpretability can bring us closer to reliable control over how LLMs plan and reason

URL PDF HTML ☆

赞 0 踩 0

2606.05191 2026-06-05 cs.LG eess.SP

PyCC.id: A package for hypothesis-driven equation discovery with structural identifiability

PyCC.id: 一个具有结构可辨识性的假设驱动方程发现包

Federico J. Gonzalez

发表机构 * Physics Institute of Rosario（罗萨里奥物理研究所）

AI总结提出PyCC库，通过特征曲线骨架和假设驱动方法解决数据驱动方程发现中的病态逆问题，支持多种方程发现范式并具有结构可辨识性。

Comments The software package is available at: https://github.com/FedejGon/pyCC.id

详情

AI中文摘要

数据驱动的方程发现本质上是一个逆问题，旨在直接从时间序列测量中推断系统的控制微分方程。一个已知的问题是逆问题的病态性质，这经常产生多个对数据拟合得同样好的数学模型。解决这个问题的一种途径是事先将已知的假设和约束纳入训练阶段。虽然这种方法有效地减少了搜索空间，但仍然会产生多个候选模型，迫使实践者依赖基于自身领域知识的后验手动筛选。最近的一种方法引入了受特征曲线（CCs）启发的结构“骨架”，定义了一种假设驱动的方法。在这种方法中，实践者定义一个骨架，该骨架与一族常微分方程（ODEs）相关联，然后基于其领域知识添加假设和先验，以迭代地改进获得的模型。这种方法的一个重要优点是，一些骨架具有可证明的结构可辨识性属性，这对于检查骨架是否正确或应该被丢弃非常有用。此外，由于其模块化（例如神经网络、符号回归和稀疏回归），这种形式主义能够使用多种方程发现范式。在这项工作中，我们介绍了Python库PyCC，它将这些努力浓缩成一个灵活的工具，允许研究人员和工程师无缝地定义他们的骨架和假设，从时间依赖数据中发现ODEs。

英文摘要

Data-driven equation discovery is fundamentally an inverse problem that seeks to infer the governing differential equations of a system directly from time-series measurements. A known issue is the ill-conditioned nature of the inverse problem, which frequently produces multiple mathematical models that fit the data similarly well. One path to address this issue is by incorporating known hypotheses and constraints into the training phase beforehand. While this approach effectively reduces the search space, it still results in multiple candidate models, forcing practitioners to rely on post-hoc manual filtering based on their own domain expertise. A recent approach incorporates structural `skeletons' inspired by characteristic curves (CCs), defining a hypothesis-driven methodology. In this methodology, practitioners define a skeleton, which is associated with a family of ordinary differential equations (ODEs), and then add their hypotheses and priors based on their domain knowledge to refine the obtained model iteratively. An important advantage of this approach is that some skeletons have demonstrable structural identifiability properties, which are useful for checking whether the skeleton is correct or should be discarded. Furthermore, this formalism enables the use of multiple equation discovery paradigms due to its modularity (such as neural networks, symbolic regression, and sparse regression). In this work, we present the Python library PyCC, which condenses these efforts into a flexible tool that allows researchers and engineers to seamlessly define their skeletons and hypotheses to discover ODEs from time-dependent data.

URL PDF HTML ☆

赞 0 踩 0

2606.05186 2026-06-05 cs.LG cs.CL

Staged Factorial Screening for Budget-Constrained Micro-Pretraining

预算受限的微预训练中的分阶段因子筛选

Felipe Chavarro Polania

发表机构 * Hewlett Packard Enterprise（惠普企业）

AI总结针对预算受限的微预训练，提出分阶段分数因子设计方法，通过短时筛选识别高惩罚方向并确认有效锚点，在共享加速器上实现高效配方筛选。

Comments 23 pages, 4 figures

详情

AI中文摘要

预算受限的微预训练通常需要在共享加速器上对许多候选配方进行分诊，然后才能花费更大的搜索预算。我们研究了分阶段分数因子工作流是否能在这种设置中恢复稳定的早期效应结构。在固定的自动研究衍生的单GPU训练循环上，我们运行了613个实验，包括在2、5和10分钟时的试点和后续筛选；5和10分钟时的完整16条件种子重运行；有针对性的种子锚点检查；同主机贪婪和匹配成本随机基线；一个60分钟的桥接包；以及通过24小时的有界Windows A100和Linux L40S锚点延续。总批次、深度和宽度的主要惩罚在短预算时最大，并随预算增加而放松。在预先声明的种子全屏系列中，D、A、B和C在预算内Benjamini-Hochberg校正后，在5和10分钟时保留非零估计，而E则没有。随机搜索可以在这个32条件空间中达到强当前最优，但反复在相同的低惩罚区域，且没有因子归因。60分钟桥接锚点具有最低均值，尽管该包没有将工作流改进与更大桥接模型的能力优势分开。在两个主机上的有界12小时和24小时三锚点延续中，桥接具有最低样本均值，而非桥接顺序保持主机敏感。因此，我们提出了一个有界方法结果：使用短设计筛选来识别高惩罚方向，在重复运行下确认有希望的锚点，并在缩减空间内局部细化。证据支持在24小时内两个主机上的以桥接为中心的推荐，而不是硬件不变的排名或通用超参数优化的优越性。

英文摘要

Budget-constrained micro-pretraining often requires triaging many candidate recipes on a shared accelerator before larger search budgets are spent. We study whether a staged fractional-factorial workflow can recover stable early effect structure in this setting. On a fixed autoresearch-derived single-GPU training loop, we run 613 experiments across pilot and follow-up screens at 2, 5, and 10 minutes; full 16-condition seeded reruns at 5 and 10 minutes; targeted seeded anchor checks; same-host greedy and matched-cost random baselines; a 60-minute bridge package; and bounded Windows A100 and Linux L40S anchor continuations through 24 hours. Main penalties from total batch, depth, and width are largest at short budgets and relax as budget increases. Within the predeclared seeded full-screen families, D, A, B, and C retain non-zero estimates at 5 and 10 minutes after within-budget Benjamini-Hochberg correction, while E does not. Random search can reach strong incumbents in this 32-condition space, but repeatedly in the same low-penalty region and without factor attribution. The 60-minute bridge anchor has the lowest mean, although that package does not separate workflow refinement from the larger bridge model's capacity advantage. In bounded 12-hour and 24-hour three-anchor continuations on both hosts, the bridge has the lowest sample mean while the non-bridge ordering stays host-sensitive. We therefore present a bounded methods result: use short designed screens to identify high-penalty directions, confirm promising anchors under repeated runs, and refine locally inside the reduced space. The evidence supports a bridge-centered recommendation through 24 hours on two hosts, not hardware-invariant ranking or general hyperparameter-optimization superiority.

URL PDF HTML ☆

赞 0 踩 0

2606.05183 2026-06-05 cs.CL cs.AI cs.HC

The Granularity Gap: A Multi-Dimensional Longitudinal Audit of Sycophancy in Gemini Models

粒度差距：Gemini 模型中谄媚行为的多维纵向审计

Patrick Keough

发表机构 * Independent Researcher（独立研究者）

AI总结通过多维度分级评估（Likert 0-4），揭示 Gemini 模型在连续尺度上的谄媚行为，发现粗粒度二值指标掩盖了大量社会顺从行为，且代际进步非单调，存在对齐税（谄媚与真实性负相关）。

Comments 16 pages, 9 figures

详情

AI中文摘要

大型语言模型越来越多地被部署为高风险顾问，但标准对齐基准将谄媚视为二值失败模式。我们引入粒度差距：粗粒度二值指标掩盖了大量社会顺从行为，即模型屈服于用户框架、验证可疑前提或软化事实纠正而不产生明显错误输出。我们在三个防护栏条件（控制、简单、协议）下，对跨越 2.0、2.5 和 3.0 代的六个 Gemini 变体在 73 个对抗性提示上进行了评估，得到 8,830 个分级响应。使用经过人类标注者三人组验证的 0-4 Likert 量表（Fleiss kappa = 0.71；与 AI 共识的 Cohen kappa = 0.78；95.9% 二值准确率，100% 特异性），我们将谄媚量化为连续而非二值。出现三个发现。第一，27.2% 的响应包含大量谄媚内容（Likert >= 2.0），22.7% 达到中度或严重水平（>= 3.0），而二值胜率框架仅报告适度的失败率；粗粒度指标仅解释 29% 的分级方差。第二，代际进步是非单调的：Gen 2.5 相对于 Gen 2.0（1.90）和 Gen 3.0（2.01）急剧倒退（平均控制 2.64），且 Gen 2.5 呈现逆缩放（Pro 1.94 比 Flash 1.71 更差），而 Gen 3.0 恢复了标准缩放。第三，我们记录了对齐税：谄媚与真实性之间的 Spearman rho = -0.63，表明社会顺从以事实准确性为代价。自我验证提示作为谄媚陷阱（平均 3.27），几乎是 unethical proposals（1.72）的两倍。简单防护栏在旗舰模型上优于复杂的协议脚手架，但蒸馏后的 Gen 3.0 Flash 反转了这一点，表明小模型可能在结构上需要思维链脚手架。我们发布了数据集和评分标准以支持连续谄媚测量。

英文摘要

Large language models are increasingly deployed as high-stakes advisors, yet standard alignment benchmarks treat sycophancy as a binary failure mode. We introduce the Granularity Gap: coarse binary metrics mask substantial social-compliance behaviors where models capitulate to user framing, validate questionable premises, or soften factual corrections without producing overtly false outputs. We evaluate six Gemini variants across generations 2.0, 2.5, and 3.0 on 73 adversarial prompts under three guardrail conditions (Control, Simple, Protocol), yielding 8,830 graded responses. Using a 0-4 Likert scale validated against a human annotator triad (Fleiss kappa = 0.71; Cohen kappa = 0.78 vs AI consensus; 95.9 percent binary accuracy, 100 percent specificity), we quantify sycophancy as continuous rather than binary. Three findings emerge. First, 27.2 percent of responses contain substantial sycophantic content (Likert >= 2.0) and 22.7 percent reach moderate or severe levels (>= 3.0), while binary win-rate framing reports only modest failure rates; coarse metrics explain just 29 percent of graded variance. Second, generational progress is non-monotonic: Gen 2.5 regresses sharply (mean Control 2.64) relative to Gen 2.0 (1.90) and Gen 3.0 (2.01), and Gen 2.5 shows inverse scaling (Pro 1.94 worse than Flash 1.71) while Gen 3.0 restores standard scaling. Third, we document an Alignment Tax: Spearman rho = -0.63 between sycophancy and truthfulness, indicating social compliance trades against factual accuracy. Egotistical Validation prompts act as a sycophancy trap (mean 3.27), nearly double Unethical Proposals (1.72). Simple guardrails outperform elaborate Protocol scaffolding on flagship models, but distilled Gen 3.0 Flash inverts this, suggesting small models may structurally require chain-of-thought scaffolding. We release the dataset and rubric to support continuous sycophancy measurement.

URL PDF HTML ☆

赞 0 踩 0

2606.05182 2026-06-05 cs.CL cs.IR

LANTERN: Layered Archival and Temporal Episodic Retrieval Network for Long-Context LLM Conversations

LANTERN: 用于长上下文LLM对话的分层存档与时间情节检索网络

Rahul Subramani

发表机构 * Cisco Systems, Inc.（思科系统公司）

AI总结提出LANTERN，一种轻量级记忆层，通过混合检索主动存档对话轮次并恢复压缩后丢失的细节，无需LLM调用且延迟低于25ms，在94个多轮对话中恢复78.3%的可验证事实，优于MemGPT基线。

详情

AI中文摘要

当对话历史被压缩以适应有限的上下文窗口时，大型语言模型会丢弃关键细节。我们提出了LANTERN（分层存档与时间情节检索网络），一种轻量级记忆层，它主动存档每一轮对话，并通过混合检索在压缩后恢复相关细节——无需任何LLM调用，每轮延迟低于25ms。在94个真实多轮对话（1,894个真实事实，人工验证kappa=0.81）上，LANTERN-Rerank恢复了78.3%因压缩而丢失的可验证事实，显著优于忠实复现的MemGPT的LLM驱动提取与多查询搜索流水线（72.4%；Wilcoxon p<0.0001，95% CI [+3.1, +8.6] pp，d=0.43），且推理成本极低。即使没有重排序器，基础LANTERN在零LLM调用的情况下也能匹配或超越该LLM驱动基线（p=0.005）。当四个生产级LLM使用LANTERN恢复的上下文回答事实性问题时，准确率平均提升8.4个百分点（每个模型单独Wilcoxon p<0.05），表明恢复的上下文在不同模型架构上均有用。我们发布了完整的评估框架——包括配对显著性检验、失败分析、事实类型分层和压缩鲁棒性分析——以支持可重复性和未来工作。

英文摘要

Large language models discard critical details when conversation history is compacted to fit within finite context windows. We present LANTERN (Layered Archival aNd Temporal Episodic Retrieval Network), a lightweight memory layer that proactively archives every conversation turn and restores relevant details after compaction via hybrid retrieval -- requiring zero LLM calls and adding fewer than 25ms of latency per turn. On 94 real multi-turn conversations (1,894 ground-truth facts, human-validated at kappa=0.81), LANTERN-Rerank recovers 78.3% of verifiable facts lost to compaction, significantly outperforming a faithful reimplementation of MemGPT's LLM-driven extraction and multi-query search pipeline (72.4%; Wilcoxon p<0.0001, 95% CI [+3.1, +8.6] pp, d=0.43) at a fraction of the inference cost. Even without the reranker, base LANTERN matches or exceeds this LLM-driven baseline (p=0.005) using zero LLM calls. When four production LLMs answer fact-bearing questions using LANTERN-restored context, accuracy improves by 8.4 percentage points on average (Wilcoxon p<0.05 for each model individually), demonstrating that the recovered context is useful across diverse model architectures. We release the full evaluation framework -- paired significance tests, failure analysis, fact-type stratification, and compaction robustness analysis -- to support reproducibility and future work.

URL PDF HTML ☆

赞 0 踩 0

2606.05181 2026-06-05 cs.CL cs.AI

MCBench：面向全能大语言模型的多上下文安全评估基准

Manh Luong, Tamas Abraham, Junae Kim, Amar Kaur, Rollin Omari, Gholamreza Haffari, Trang Vu, Lizhen Qu, Dinh Phung

发表机构 * Monash University（墨尔本大学）； Defence Science and Technology Group（国防科学与技术集团）

AI总结针对现有多模态安全基准仅处理视觉输入的局限，提出MCBench基准，包含1196个跨四类安全场景的测试，要求整合多模态信息进行安全评估，揭示当前全能大语言模型在跨模态安全推理上的不足。

详情

AI中文摘要

现有的多模态安全基准仅关注视觉输入，无法评估处理视觉、音频和文本的全能大语言模型（LLMs）。我们提出了MCBench，一个包含1196个场景的基准，涵盖四个安全类别，需要整合多种模态以进行准确的安全评估。每个不安全场景都配有一个最小差异的安全对照场景，以评估模型的敏感性。我们对最先进模型的评估揭示了重大挑战。全能大语言模型在处理细微或非物理风险时表现不佳，但在存在显著视觉或听觉线索时表现更好。对推理轨迹的分析表明，尽管模型能够提取模态特定信息，但它们往往无法有效整合这些线索进行安全判断。我们的发现揭示了当前全能大语言模型在安全关键场景中缺乏稳健的跨模态推理能力，强调了改进多模态安全架构和训练策略的必要性。

英文摘要

Existing multimodal safety benchmarks focus solely on visual inputs and cannot assess Omni Large Language Models (LLMs) that process vision, audio, and text. We introduce MCBench, a benchmark with 1196 scenarios spanning four safety categories that require integrating multiple modalities for accurate safety assessment. Each unsafe scenario is paired with a minimally different safe counterpart to assess model sensitivity. Our evaluations of state-of-the-art models reveal significant challenges. Omni LLMs struggle with subtle or non-physical risks but perform better when salient visual or acoustic cues are present. Analysis of reasoning traces shows that, although models can extract modality-specific information, they often fail to integrate these cues effectively for safety judgments. Our findings reveal that current Omni LLMs lack robust cross-modal reasoning in safety-critical settings, underscoring the need for improved architectures and training strategies for multimodal safety.

URL PDF HTML ☆

赞 0 踩 0

2606.05176 2026-06-05 cs.CL cs.AI

PEFT of SLM for Telecommunications Customer Support: A Comparative Study of LoRA Configurations with Energy Consumption Analysis

面向电信客户支持的SLM的PEFT：LoRA配置与能耗分析的比较研究

Lucas Tamic, Ilan Jaffeux-Cheniout, Xavier Marjou

发表机构 * Orange

AI总结本研究系统比较了不同LoRA配置在Qwen2.5-3B模型上的参数高效微调效果，结合能耗分析和LLM评判框架，发现验证损失最低的配置并不一定获得最佳定性排名，并提出了组合式合成数据生成方法。

详情

面向沉浸式视频角色扮演的奖励分解强化学习

Miao Wang, Yuling Shi, Yijiang Li, Yeheng Chen, Xiaodong Gu, Bin Li, Bo Gao, Jun Wang, Zengxin Han, Jingtong Wu, Yaduan Ruan

发表机构 * Nanjing University（南京大学）； Shanghai Jiao Tong University（上海交通大学）； University of California, San Diego（加州大学圣地亚哥分校）； Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences（中国科学院深圳先进技术研究院）； School of Information Engineering, Beijing Institute of Graphic Communication（北京印刷学院信息工程学院）； Ant International, Ant Group（蚂蚁集团国际部）； Independent Researcher（独立研究者）

AI总结提出EBM-RL框架，通过奖励分解的强化学习优化视频角色扮演中的视觉感知、推理与生成过程，提升场景一致性与角色真实性。

详情

AI中文摘要

基于文本的角色扮演模型可以模仿角色风格，但通常难以捕捉场景氛围和不断变化的紧张感，而这些对于VR游戏和互动叙事等沉浸式应用至关重要。我们研究视频驱动的角色扮演对话，并引入EBM-RL（眼-脑-口强化学习），一种解耦的GRPO框架，将观察（<perception>）、推理（<think>）和话语生成（<answer>）分离。该设计模仿人类的“看-思-说”过程，使模型在推理和响应生成之前能够基于视觉感知进行对话。为了优化这一“看-思-说”过程，EBM-RL集成了针对场景-文本对齐、感知-认知效用、答案忠实度和格式一致性的互补奖励。大量实验表明，在我们的沉浸式角色扮演基准测试中，EBM-RL显著优于纯文本角色扮演基线和更大规模的视觉语言模型，提高了视觉-氛围一致性和角色真实性。此外，EBM-RL在无需额外微调的情况下，展现出对域外VideoQA基准的强零样本迁移能力。我们还发布了一个用于视频驱动角色扮演对话的开源数据集。

英文摘要

Text-based role-playing models can imitate character styles, but often fail to capture scene atmosphere and evolving tension, which are crucial for immersive applications such as VR games and interactive narratives. We study video-grounded role-playing dialogue and introduce EBM-RL (Eye--Brain--Mouth Reinforcement Learning), a decoupled GRPO-based framework that separates observation (<perception>), reasoning (<think>), and utterance generation (<answer>). This design mimics the human See-Think-Speak process, enabling the model to ground dialogue in visual perception before reasoning and response generation. To optimize this See-Think-Speak process, EBM-RL integrates complementary rewards for scene--text alignment, perceptual--cognitive utility, answer faithfulness, and format consistency. Extensive experiments show that EBM-RL substantially outperforms text-only role-playing baselines and larger-scale vision-language models on our immersive role-playing benchmark, improving both visual-atmosphere consistency and character authenticity. Moreover, EBM-RL demonstrates strong zero-shot transfer to out-of-domain VideoQA benchmarks without additional fine-tuning. We also release an open-source dataset for video-grounded role-playing dialogue.

URL PDF HTML ☆

赞 0 踩 0

2606.05104 2026-06-05 cs.AI

Knowledge Index of Noah's Ark

诺亚方舟的知识索引

Sheng Jin, Minghao Liu, Yunze Xiao, Zeqi Zhou, Heli Qi, Yifan Yao, Meishu Song, Kaijing Ma, Xuan Zhang, Sicong Jiang, Yizhe Li, Ningshan Ma, Jie Wei, Ziniu Li, Minglai Yang, Bangya Liu, Yiming Liang, Xiao Fang, Qingcheng Zeng, Jiarui Liu, Rui Yang, Shen Yan, Wenhao Huang, Jiaheng Liu, Zihan Wang, Weihao Xuan, Ge Zhang

发表机构 * M-A-P ； Carnegie Mellon University（卡内基梅隆大学）； Brown University（布朗大学）； Waseda University（早稻田大学）； The University of Tokyo（东京大学）； Massachusetts Institute of Technology（麻省理工学院）； University of Arizona（亚利桑那大学）； Northwestern University（西北大学）； Duke-NUS Medical School（杜克-新加坡国立大学医学院）

AI总结针对LLM知识基准的代表性、注释质量和排名稳定性问题，提出KINA基准，通过贪婪近似实现学科代表性，并证明奖金锦标赛机制优于固定支付，实验显示顶级模型性能远未饱和。

详情

AI中文摘要

LLM的知识基准面临三个问题：扩展驱动的设计未能实现学科代表性；固定支付注释允许懒惰共识；在有限测试预算下排名稳定性未经审计。我们引入KINA，一个涵盖261个细粒度学科的899项基准，并有两个形式化结果。首先，我们将代表性视为对专家引出的锚点的覆盖目标，并通过代理实现学科代表性，得到(1-1/e)贪婪近似（命题1）；该保证适用于代理，而非总体代表性。其次，我们证明在发布-评审质量方面，奖金锦标赛弱FOSD支配固定支付，激励相容阈值为B > ΔC / Δp_min（定理1）。评估来自13个实验室的42个模型，最佳模型Gemini-3.1-Pro-Preview达到53.17%，其次是Claude-Opus-4.6的49.92%和GPT-5.4的48.55%，远未饱和。完整排行榜显示分层结构而非平滑全序：小型前沿层高于48%，密集的强模型层约38-45%，低性能模型仅略高于10%随机基线。工具增强在五个工具使用评估中最多增加5.17分，不同模型增益差异显著。我们报告自举排名稳定性统计，以明确有限预算方差并防止过度解释相邻排名。

英文摘要

Knowledge benchmarks for LLMs face three issues: scaling-driven designs that do not operationalize disciplinary representativeness; flat-payment annotation that permits lazy consensus; and unaudited ranking instability under bounded test budgets. We introduce KINA, an 899-item benchmark across 261 fine-grained disciplines, with two formal results. First, we cast representativeness as a coverage-style objective over expert-elicited anchors and operationalize disciplinary representativeness through a proxy, yielding a (1-1/e) greedy approximation (Proposition 1); the guarantee applies to the proxy, not to population representativeness. Second, we prove a bonus-on-bar tournament weakly FOSD-dominates flat payment in released-review quality, with incentive-compatibility threshold B > Delta C / Delta p_min (Theorem 1). Evaluating 42 models from 13 labs, the top model, Gemini-3.1-Pro-Preview, reaches 53.17%, followed by Claude-Opus-4.6 at 49.92% and GPT-5.4 at 48.55%, leaving substantial headroom below saturation. The full leaderboard shows a tiered structure rather than a smooth total order: a small frontier tier lies above 48%, a dense strong-model tier spans roughly 38-45%, and low-performing models remain only modestly above the 10% chance baseline. Tool augmentation adds up to 5.17 points across the five tool-use evaluations, with gains varying substantially across models. We report bootstrap ranking-stability statistics to make bounded-budget variance explicit and to discourage over-interpretation of adjacent ranks.

URL PDF HTML ☆

赞 0 踩 0