arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2257
2605.28741 2026-05-28 cs.CV

Self-Prophetic Decoding to Unlock Visual Search in LVLMs

自预言解码以解锁LVLM中的视觉搜索

Zhendong He, Qiyuan Dai, Guanbin Li, Liang Lin, Sibei Yang

AI总结 提出SeProD框架,通过自预言解码利用预训练模型的内在单步能力,以无训练、即插即用的方式增强LVLM在多步视觉搜索中的连贯推理,在4个基准的12个分割上一致提升性能。

Comments Accepted at ICML 2026

详情
AI中文摘要

大型视觉语言模型(LVLM)正迅速向真正的多模态推理发展,视觉搜索代表了“用图像思考”范式的具体实例。然而,LVLM视觉搜索面临两个关键挑战:后训练后内在能力之间的不兼容性,以及长多步推理上下文中的干扰。为解决这些问题,我们提出了两个新颖的见解。首先,预训练和后训练LVLM之间的自我调节利用了预训练模型的内在单步能力,以减轻能力退化和长上下文干扰。其次,基于概率的预言采样取代了简单的提示,提供了一个概率接口,其中预训练模型充当预言家,后训练模型在其输出分布下选择性地接受预言令牌,从而保持连贯的多步推理。基于这些见解,我们引入了SeProD,一个自预言解码框架,它利用内在的单步能力以无训练、即插即用的方式实现连贯的多步推理。实验表明,由于并行的预言接受机制,SeProD在4个视觉搜索基准的所有12个分割以及通用VQA基准上一致地提升了多个视觉搜索LVLM的性能,且没有增加计算开销。

英文摘要

Large Vision-Language Models (LVLMs) are rapidly evolving toward true multimodal reasoning, with visual search representing a concrete instantiation of the thinking-with-images paradigm. However, LVLM visual search faces two key challenges: incompatibility among intrinsic capabilities after post-training, and interference in long multi-step reasoning contexts. To address these, we identify two novel insights. First, self-regulation between pre- and post-training LVLMs leverages the intrinsic single-step capabilities of the pre-training model to mitigate capability deterioration and long-context interference. Second, probability-based prophetic sampling, replacing naive prompting, provides a probabilistic interface where the pre-training model acts as a prophet and the post-training model selectively accepts prophetic tokens under its output distribution, preserving coherent multi-step reasoning. Building on these insights, we introduce SeProD, a self-prophetic decoding framework that leverages intrinsic single-step capabilities to enable coherent multi-step reasoning in a training-free, plug-and-play manner. Experiments show that SeProD consistently improves multiple visual-search LVLMs across all 12 splits of 4 visual search benchmarks, as well as across general VQA benchmarks, without added computational overhead, thanks to its parallel prophetic acceptance mechanism.

2605.28740 2026-05-28 cs.CL cs.AI

Reverse Probing: Supervised Token-level Uncertainty Quantification for Large Language Models in Clinical Text

反向探测:临床文本中大语言模型的监督式词级不确定性量化

Bushi Xiao, Sarvesh Soni, Daisy Zhe Wang

AI总结 提出反向探测框架,利用预标注摘要从模型内部激活中提取词级不确定性信号,在临床文本中实现高效、可解释的不确定性量化。

详情
AI中文摘要

随着大语言模型越来越多地应用于临床文本,确保它们能够可靠地表明自身的不确定性变得至关重要。大多数现有的不确定性量化(UQ)方法是为开放域生成设计的,无法在长临床文本中定位到词或跨度级别的不确定性。我们提出了反向探测,这是首个专门针对临床摘要的UQ框架,它直接从预标注的摘要中估计词级不确定性。与采样新输出不同,反向探测将文本视为探测模型内部状态的探针,从四类内部激活中提取不确定性信号。我们在两个专家标注的临床数据集上进行了评估,在所有指标上优于八个适配基线,AUPRC最高提升4倍,同时减少了推理时间和计算成本。特征分析表明,delta能量和邻域上下文是所有模型中最一致的预测因子。本研究提供了关于模型内部如何响应无支持的临床内容的可解释性见解。

英文摘要

As large language models are increasingly deployed for clinical text, ensuring they can reliably signal their own uncertainty becomes critical. Most existing uncertainty quantification (UQ) methods are designed for open-domain generation and cannot localize uncertainty at the token or span level in long clinical text. We propose Reverse Probing, the first UQ framework specialized for clinical summarization, which estimates token-level uncertainty directly from pre-existing labeled summaries. Rather than sampling new outputs, Reverse Probing treats the text as a probe into the model's internal state, extracting uncertainty signals from four categories of internal activations. We evaluate on two expert-annotated clinical datasets and outperform eight adapted baselines on all metrics, achieving up to 4 times higher AUPRC while reducing inference time and computational costs. Feature analysis reveals that delta energy and neighborhood context are the most consistent predictors across all models. This study offers interpretable insights into how models internally respond to unsupported clinical content.

2605.28739 2026-05-28 cs.LG cs.AI cs.NE q-bio.QM

BIRDNet: Mining and Encoding Boolean Implication Knowledge Graphs as Interpretable Deep Neural Networks

BIRDNet: 挖掘和编码布尔蕴含知识图作为可解释深度神经网络

Tirtharaj Dash

AI总结 提出BIRDNet,通过挖掘特征间的布尔蕴含关系并编码为稀疏可解释神经网络,在保持高精度的同时大幅减少参数,并在转录组和蛋白质组数据中恢复已知生物学特征。

Comments 5 pages; 1 figure, 4 tables

详情
AI中文摘要

知识丰富领域中的表格数据通常携带特征对之间的布尔蕴含关系(BIR)形式的潜在先验。我们使用稀疏异常二项检验挖掘此类关系。挖掘出的蕴含构成一个带类型的定向图,等价于一个由2-文字子句组成的命题规则库。我们将该图编码为分层神经网络的连接性,称为BIRDNet,其中每个隐藏单元对应一条挖掘出的规则,并仅绑定到其两个特征。我们展示了这种设计的两个结果:首先,该架构在构造上是稀疏的:每个BIR层中最多有$2/d$的权重是活跃的,其中$d$是输入维度。其次,模型是可解释的:每个训练后的单元保持稳定的符号身份,因此无需代理模型即可从网络中读取规则。与大多数神经符号模型不同,BIRDNet不消耗外部规则库;其结构先验是从数据中挖掘的。我们在六个转录组和蛋白质组基准上评估BIRDNet。我们的结果表明,BIRDNet在AUROC上与最强的密集基线相差0.02以内,精度损失很小,同时使用的活跃参数比架构匹配的密集MLP少高达96倍。第一层规则恢复了多种癌症亚型和组织类型中的已知生物学特征,包括典型扩增子、谱系定义共表达模块和免疫浸润标记。数据和代码可在 https://github.com/MAHI-Group/BIRDNet 获取。

英文摘要

Tabular data in knowledge-rich domains often carries a latent prior in the form of Boolean implication relationships (BIRs) between pairs of features. We mine such relationships with a sparse-exception binomial test. The mined implications form a typed directed graph, equivalent to a propositional rule base of 2-literal clauses. We encode this graph as the connectivity of a layered neural network, called BIRDNet, in which each hidden unit corresponds to one mined rule and binds only to its two features. We show two consequences of this design: First, the architecture is sparse by construction: at most $2/d$ of the weights in each BIR layer are active, where $d$ is the input dimension. Second, the model is interpretable: every trained unit keeps a stable symbolic identity, so rules can be read off the network without surrogate models. Unlike most neurosymbolic models, BIRDNet does not consume an external rule base; its structural prior is mined from the data. We evaluate BIRDNet on six transcriptomic and proteomic benchmarks. Our results show that BIRDNet stays within 0.02 AUROC of the strongest dense baseline, at a small accuracy cost, while using up to $96\times$ fewer active parameters than an architecture-matched dense MLP. First-layer rules recover known biological signatures across multiple cancer subtypes and tissue types, including canonical amplicons, lineage-defining co-expression modules, and immune-infiltration markers. Data and code are available at: https://github.com/MAHI-Group/BIRDNet.

2605.28736 2026-05-28 cs.RO

Imitation Learning for Robot Assistance in Open Surgery: A Multi-Policy Evaluation on Suture Following

开放手术中机器人辅助的模仿学习:针对缝合跟随的多策略评估

Xucheng Wang, Zhizhou Yang, Xiaoman Zhang, Sung Eun Kim, Romain Hardy, Pranav Rajpurkar

AI总结 本研究首次评估通用模仿学习在开放手术中用于外科医生-机器人协作辅助的可行性,以缝合跟随(每次缝合时助手执行的抓取-拉动-释放动作)为任务,通过比较四种策略(ACT、Diffusion Policy、SmolVLA、π₀)在28个训练模型上的表现,发现π₀在数据效率、背景鲁棒性和轨迹平滑性上最优,并在机器人缝合试验中达到92%的缝合完成率。

详情
AI中文摘要

本研究首次评估了通用模仿学习在外科医生-机器人协作辅助开放手术中的应用,针对缝合跟随:即助手在每次缝合时执行的抓取-拉动-释放动作。我们在一个开源机器人臂上收集了160次遥操作演示(32,374帧),并基准测试了四种架构不同的模仿学习策略(ACT、Diffusion Policy、SmolVLA、π₀),涉及28个训练模型,在32种配置下沿三个临床相关维度(数据集大小、相机视角和背景变化)进行评估。结果表明,在理想条件下,四种策略实现了50%-75%的任务成功率,深度误差是所有架构的主要失败模式。在所有策略中,π₀凭借预训练的视觉-语言骨干网络取得了最强结果,展现出优越的数据效率、对背景变化的更强鲁棒性以及与手术工作流兼容的更平滑轨迹。在外科医生-机器人缝合试验中,π₀实现了92%的缝合完成率。这些发现确立了开放手术中的协作机器人辅助作为模仿学习的可行目标,并强调深度感知和末端执行器设计是临床转化的关键优先事项。

英文摘要

This study presents the first evaluation of general-purpose imitation learning for surgeon-robot collaborative assistance in open surgery, targeting suture following: the grab-pull-release motion an assistant performs at every stitch. We collect 160 teleoperated demonstrations (32,374 frames) on an open-source robot arm, benchmark four architecturally diverse imitation learning policies (ACT, Diffusion Policy, SmolVLA, $π_0$) across 28 trained models evaluated in 32 configurations along three clinically motivated dimensions: dataset size, camera viewpoint, and background variation. Our results demonstrate that under ideal conditions, the four policies achieve $50$-$75\%$ task success, with depth error as the dominant failure mode across all architectures. Among all policies, $π_0$ achieves the strongest results with a pretrained vision-language backbone, demonstrating superior data efficiency, greater robustness to background variation, and smoother trajectories compatible with surgical workflow. When deployed in a surgeon-robot suturing trial, $π_0$ yields a $92\%$ stitch completion rate. These findings establish collaborative robotic assistance in open surgery as a feasible target for imitation learning and highlight depth perception and end-effector design as key priorities for clinical translation.

2605.28735 2026-05-28 cs.CV

SeeGroup: Multi-Layer Depth Estimation of Transparent Surfaces via Self-Determined Grouping

SeeGroup: 通过自确定分组的透明表面多层深度估计

Hongyu Wen, Jia Deng

AI总结 提出SeeGroup方法,通过将多层深度建模为点过程并采用置换不变损失,实现自适应分组,显著提升透明表面多层深度估计精度。

详情
AI中文摘要

透明物体在日常生活中很常见,理解其多层深度(包括透明表面及其背后的物体)非常重要。现有的多层深度方法通常扩展单层预测,通过3D点的前后顺序定义层并顺序预测。然而,由于分层几何允许将3D点分组为多个有效层,预定义的分组策略本质上是受限的。在这项工作中,我们提出了SeeGroup,一种避免施加预定义分组并允许模型自适应地将表面分配到深度图的多层深度估计方法。我们将逐像素多层深度公式化为一个点过程,将深度层视为沿每条相机射线的无序事件。这引出了观测深度层上的置换不变似然,产生了一个自然支持任意层分组的损失函数。实验表明,我们的方法显著推进了多层深度估计的最新水平,在LayeredDepth基准上将四重相对深度准确率从61.34%提升至70.09%。代码可在https://github.com/princeton-vl/SeeGroup获取。

英文摘要

Transparent objects are common in daily life, and it is important to understand their multilayer depth, including the transparent surface and the objects behind it. Existing methods for multilayer depth typically extend single-layer prediction. They define layers by the front-to-back ordering of 3D points and predict the layers sequentially. However, as layered geometry can admit multiple valid groupings of 3D points into layers, a predefined grouping strategy is inherently restrictive. In this work, we propose SeeGroup, a multi-layer depth estimation method that avoids imposing a predefined grouping and allows the model itself to adaptively assign surfaces to depth maps. We formulate per-pixel multi-layer depth as a point process, treating depth layers as unordered events along each camera ray. This induces a permutation-invariant likelihood over the observed depth layers, yielding a loss that naturally supports arbitrary layer groupings. Experiments demonstrate that our method significantly advances the state of the art of multi-layer depth estimation, improving quadruplet relative depth accuracy on LayeredDepth benchmark from 61.34% to 70.09%. Code is available at https://github.com/princeton-vl/SeeGroup.

2605.28733 2026-05-28 cs.AI

Utility-Aware Multimodal Contrastive Learning for Product Image Generation

效用感知的多模态对比学习用于产品图像生成

Xiaohang Feng, Yiling Xie

AI总结 提出一种效用感知的多模态对比学习框架,通过引入效用感知InfoNCE损失优化产品图像生成,使图像在语义对齐的同时提升市场需求。

详情
AI中文摘要

产品图像强烈影响在线市场中消费者的决策。借助多模态对比学习,生成式AI可以输出与文本提示紧密对齐的图像。然而,现有的生成式AI模型并未直接优化市场表现。这是一个关键差距,因为仅凭语义对齐并不能保证图像能够促进销售。为了解决这一局限性,我们提出了一个 extit{效用感知的多模态对比学习}框架,将消费者需求纳入新颖的效用感知InfoNCE损失中。优化这一效用感知目标引导生成过程朝向既语义连贯又增强需求的图像。这一效果直接源于学习到的图像-文本表示空间向需求驱动的视觉线索的转变,我们也通过所提目标的理论界限验证了这一点。在Amazon和Airbnb的下游应用中,我们的方法生成和编辑的产品图像在增加需求和保持保真度方面优于最先进的模型,同时保持了文本-图像一致性。值得注意的是,我们的效用感知框架保留了美学和独特性等属性的倒U型需求模式,在保持保真度和语义一致性的同时提升了基于需求的性能。人类受试者实验进一步验证了其商业有效性。随着生成式AI技术的不断发展,我们的效用感知组件可以灵活地嵌入新兴的生成模型中,以改善直接商业用途。

英文摘要

Product images strongly influence consumer decision-making in online marketplaces. Empowered by multimodal contrastive learning, generative AI can output images that closely align with text prompts. Yet existing generative AI models do not directly optimize marketplace performance. This is a critical gap, since semantic alignment alone does not guarantee that an image will sell. To address this limitation, we propose a \textit{utility-aware multimodal contrastive learning} framework that incorporates consumer demand into a novel Utility-Aware InfoNCE loss. Optimizing this utility-aware objective guides generation toward images that are both semantically coherent and demand-enhancing. This effect arises directly from a shift in the learned image-text representation space toward demand-driven visual cues, which we also validate through the theoretical bound of the proposed objective. In downstream applications on Amazon and Airbnb, product images generated and edited by our method outperform state-of-the-art models in increasing demand and preserving fidelity, while maintaining text-image consistency. Notably, our utility-aware framework preserves inverse U-shaped demand patterns for attributes such as aesthetics and uniqueness, improving demand-based performance while preserving fidelity and semantic consistency. Human-subject experiments further validate its commercial effectiveness. As generative AI technology continues to evolve, our utility-aware component can be flexibly embedded into emerging generative models to improve direct commercial use.

2605.28732 2026-05-28 cs.CL cs.AI cs.LG

MemTrace: Tracing and Attributing Errors in Large Language Model Memory Systems

MemTrace:大型语言模型记忆系统中的错误追踪与归因

Xinle Deng, Ruobin Zhong, Hujin Peng, Xiaoben Lu, Yanzhe Wu, Guang Li, Buqiang Xu, Yunzhi Yao, Jizhan Fang, Haoliang Cao, Junjie Guo, Yuan Yuan, Ziqing Ma, Yuanqiang Yu, Rui Hu, Baohua Dong, Hangcheng Zhu, Ningyu Zhang

AI总结 提出MemTrace框架,通过构建可执行的记忆演化图实现细粒度错误追踪,并利用自动归因方法定位根因,进而优化提示词提升下游任务性能。

Comments Ongoing work

详情
AI中文摘要

记忆对于使大型语言模型支持长程推理至关重要,但现有的记忆系统仍然不可靠且难以调试。追踪记忆的动态演化对于理解信息如何随时间合成、传播或损坏至关重要。在这项工作中,我们研究了LLM记忆系统中错误追踪与归因的新问题。我们提出了一种新颖的框架,将记忆流水线转换为可执行的记忆演化图,从而实现对操作信息流的细粒度追踪。然后,我们构建了MemTraceBench,一个从代表性记忆系统(如Long-Context、RAG、Mem0和EverMemOS)收集的基准,以系统地研究记忆故障模式。我们进一步引入了一种自动归因方法,该方法迭代地追踪操作子图以定位任何失败案例的根本原因。我们的分析表明,记忆故障是系统性的,源于操作层面的问题,如信息丢失和检索错位。关键的是,我们利用这些细粒度的归因信号来指导下游提示优化,建立了一个自动纠正故障并提升最终任务性能高达7.62%的闭环系统。代码将在https://github.com/zjunlp/MemTrace发布。

英文摘要

Memory is essential for enabling large language models to support long-horizon reasoning, yet existing memory systems remain unreliable and difficult to debug. Tracing memory's dynamic evolution is crucial to understand how information is synthesized, propagated, or corrupted over time. In this work, we study the new problem of error tracing and attribution in LLM memory systems. We propose a novel framework that transforms memory pipelines into executable memory evolution graphs, enabling fine-grained tracing of operational information flow. We then construct MemTraceBench, a benchmark collected from representative memory systems such as Long-Context, RAG, Mem0, and EverMemOS, to systematically study memory failure modes. We further introduce an automatic attribution method that iteratively traces operation subgraphs to pinpoint the root cause of any failed case. Our analysis reveals that memory failures are systematic, stemming from operation-level issues like information loss and retrieval misalignment. Crucially, we leverage these fine-grained attribution signals to guide downstream prompt optimization, establishing a closed-loop system that automatically corrects faults and boosts end-task performance by up to 7.62%. Code will be released at https://github.com/zjunlp/MemTrace.

2605.28730 2026-05-28 cs.AI

AlphaTransit: Learning to Design City-scale Transit Routes

AlphaTransit: 学习设计城市级公交线路

Bibek Poudel, Sai Swaminathan, Weizi Li

AI总结 针对公交线路设计中的延迟反馈问题,提出AlphaTransit框架,将蒙特卡洛树搜索与神经策略-价值网络结合,在布卢明顿基准上实现最高服务率。

详情
AI中文摘要

设计公交网络需要许多顺序的线路扩展决策,但其质量通常只有在完整网络组装后才能显现。这种延迟反馈挑战是公交线路网络设计问题(TRNDP)的核心,其中线路交互可能具有欺骗性:一个看似有用的局部扩展可能会造成换乘瓶颈、产生冗余重叠或降低整体吞吐量。为了在延迟模拟器反馈下指导线路构建,我们引入了AlphaTransit,一个用于城市级公交网络设计的基于搜索的规划框架。AlphaTransit将蒙特卡洛树搜索(MCTS)与神经策略-价值网络相结合:策略提出线路扩展,价值估计下游设计质量,搜索利用这些预测来优化每个决策。这提供了在路线构建过程中的决策时间前瞻,而无需在搜索树内运行模拟器展开。我们在一个新的布卢明顿TRNDP基准上评估AlphaTransit,该基准具有现实的道路拓扑和基于人口普查的需求,在混合和全公交需求设置下。在布卢明顿网络中,AlphaTransit在两种需求设置下均达到了最高服务率,分别为54.6%和82.1%。相对于无搜索的强化学习,这对应9.9%和11.4%的服务率提升;相对于无学习指导的MCTS,这对应2.5%和11.2%的提升。这些结果表明,将学习指导与MCTS结合比单独使用任何一种方法对公交网络设计更有效。我们的代码和数据公开在https://github.com/poudel-bibek/AlphaTransit。

英文摘要

Designing a transit network requires many sequential route extension decisions, but their quality is often visible only after the full network is assembled. This delayed-feedback challenge lies at the heart of the Transit Route Network Design Problem (TRNDP), where route interactions can be deceptive: an extension that appears useful locally can create transfer bottlenecks, produce redundant overlap, or reduce overall throughput. To guide route construction under delayed simulator feedback, we introduce AlphaTransit, a search-based planning framework for cityscale bus network design. AlphaTransit couples Monte Carlo Tree Search (MCTS) with a neural policy-value network: the policy proposes route extensions, the value estimates downstream design quality, and search uses these predictions to refine each decision. This provides decision-time lookahead during route construction without running simulator rollouts inside the search tree. We evaluate AlphaTransit on a new Bloomington TRNDP benchmark with realistic road topology and censusderived demand, under mixed and full transit demand settings. In the Bloomington network, AlphaTransit attains the highest service rate in both demand settings, reaching 54.6% and 82.1%, respectively. Relative to reinforcement learning without search, these correspond to 9.9% and 11.4% service rate gains; relative to MCTS without learned guidance, they correspond to 2.5% and 11.2% gains. These results suggest that coupling learned guidance with MCTS is more effective than using either approach alone for transit network design. Our code and data are publicly available in https://github.com/poudel-bibek/AlphaTransit.

2605.28726 2026-05-28 cs.RO cs.LG

How VLAs Fail Differently: Black-Box Action Monitoring Reveals Architecture-Specific Failure Signatures

VLA如何以不同方式失败:黑盒动作监控揭示架构特定的失败特征

Krishnam Gupta

AI总结 本文通过黑盒动作监控发现,视觉-语言-动作(VLA)架构在电机指令层面以根本不同且可预测的方式失败,并证明架构匹配的监控器选择至关重要。

Comments Accepted at IEEE ICRA 2026 Workshop "From Data to Decisions: VLA Pipelines for Real Robots", Vienna, June 2026. Non-archival workshop. 5 pages, 2 figures, 22 references

详情
AI中文摘要

我们发现VLA架构在电机指令层面以根本不同且可预测的方式失败。在相同的评估协议(PushT和ALOHA 14自由度双手操作共450个回合)上运行VQ-BeT、Diffusion Policy和ACT,我们发现:(1)方向反转率是所有三种架构的通用失败预测器(AUROC=0.93, 0.79, 0.91; p<0.001);(2)加加速度监控仅对离散令牌架构具有预测性,遵循离散到连续的梯度(0.88, 0.69, 0.41);(3)速度违规本身在所有地方均无预测性(AUROC 0.41-0.69),然而速度检查是VLA部署代码中最常见的安全机制;(4)对于连续族VLA,速度监控提供的预测信号几乎为零(ACT上AUROC=0.52,Diffusion上0.41),证明架构匹配的监控器选择至关重要。这些结果量化了众所周知的离散/连续VLA区分的监控后果:两个家族产生定性不同的失败特征,需要不同的监控器。没有单一的监控器能普遍适用;需要架构匹配的选择。这一发现得益于SafeContract,一个无需训练、黑盒动作监控工具包,具有共形校准。代码:https://github.com/krishnam94/vla-edge

英文摘要

We discover that VLA architectures fail in fundamentally different, predictable ways at the motor-command level. Running VQ-BeT, Diffusion Policy, and ACT on identical evaluation protocols (n=450 episodes across PushT and ALOHA 14-DOF bimanual manipulation), we find: (1) direction reversal rate is a universal failure predictor across all three architectures (AUROC=0.93, 0.79, 0.91; p<0.001); (2) jerk monitoring is predictive only for discrete-token architectures, following a discrete-to-continuous gradient (0.88, 0.69, 0.41); (3) velocity violations alone are non-predictive everywhere (AUROC 0.41-0.69), yet velocity checking is the most common safety mechanism in VLA deployment code; and (4) for continuous-family VLAs, velocity monitoring provides effectively zero predictive signal (AUROC=0.52 on ACT, 0.41 on Diffusion), proving that architecture-matched monitor selection is essential. These results quantify a monitoring consequence of the well-known discrete/continuous VLA distinction: the two families produce qualitatively different failure signatures that require different monitors. No single monitor works universally; architecture-matched selection is required. This finding was enabled by SafeContract, a training-free, black-box action monitoring toolkit with conformal calibration. Code: https://github.com/krishnam94/vla-edge

2605.28722 2026-05-28 cs.AI

Multi-Adapter Representation Interventions via Energy Calibration

通过能量校准的多适配器表示干预

Manjiang Yu, Hongji Li, Junwei Chen, Xue Li, Priyanka Singh, Yang Cao, Lijie Hu

AI总结 提出MARI方法,通过竞争性多适配器机制和基于能量的门控模块,自适应地确定干预方向和强度,在保持通用能力的同时提升对齐性能。

Comments Accepted by ICML 2026

详情
AI中文摘要

表示干预已成为一种有前景的范式,可以在不修改模型权重的情况下将大型语言模型对齐到期望的行为。现有方法通常对所有输入统一应用固定的干预。然而,我们发现适当的干预方向和强度在不同样本间差异很大,这种无差别的干预会导致良性输入上通用能力的下降。为了解决这些挑战,我们提出了通过能量校准的多适配器表示干预(MARI)。具体来说,我们引入了一种竞争性多适配器机制,其中专门的专家捕获非线性校正模式,并自适应地确定不同样本的适当干预方向和强度。此外,我们设计了一个基于能量的门控模块,利用内部传播动力学来区分适合干预的输入。跨不同模型系列和参数规模的广泛实验表明,MARI实现了最先进的对齐性能。我们的方法在TruthfulQA、BBQ和安全基准测试上显著提高了性能,同时在MMLU和ARC等任务上保持甚至提高了通用能力。我们的代码可在https://github.com/V1centNevwake/MARI获取。

英文摘要

Representation intervention has emerged as a promising paradigm for aligning large language models toward desired behaviors without modifying model weights. Existing methods typically apply a fixed intervention uniformly across all inputs. However, we find that the appropriate intervention direction and strength vary substantially across samples, and such indiscriminate intervention leads to degradation of general capabilities on benign inputs. To address these challenges, we propose Multi-Adapter Representation Interventions via Energy Calibration (MARI). Specifically, we introduce a competitive multi-adapter mechanism in which specialized experts capture non-linear correction patterns and adaptively determine the appropriate intervention direction and strength for different samples. Furthermore, we design an energy-based gating module that leverages internal propagation dynamics to distinguish inputs that are applicable for intervention. Extensive experiments across diverse model families and parameter scales demonstrate that MARI achieves state-of-the-art alignment performance. Our method significantly improves performance on TruthfulQA, BBQ, and safety benchmarks, while maintaining and even improving general capabilities on tasks such as MMLU and ARC. Our code is available at https://github.com/V1centNevwake/MARI.

2605.28721 2026-05-28 cs.AI

LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?

LiveBrowseComp: 搜索智能体是在搜索,还是仅仅在验证它们已知的信息?

HuiMing Fan, Xiao Wang, Zheng Chu, Qianyu Wang, Zhuoyao Wang, Ming Liu, Bing Qin, XingYu

AI总结 本文通过诊断方法发现基于LLM的搜索智能体存在内在知识依赖(IKD),即依赖模型内部知识而非外部证据,并引入LiveBrowseComp基准来评估超越内在知识覆盖的深度搜索能力。

详情
AI中文摘要

基于LLM的搜索智能体是真的在搜索,还是仅仅利用网络验证它们已知的信息?我们在BrowseComp上通过三个诊断研究这个问题。我们的分析揭示了内在知识依赖(IKD):即使有工具访问权限,智能体也常常依赖内在知识——检索前模型已编码的信息——而非外部证据。智能体在没有工具的情况下回答了高达44.5%的BrowseComp问题,超过一半的搜索查询来自内部生成的假设而非检索到的线索,并且当答案支持证据被移除时,其表现比闭卷基线更差。这些结果表明,静态搜索基准可能奖励基于记忆的验证而非基于证据的发现,混淆了智能体已知的信息与它们能找到的信息。然后我们引入了LiveBrowseComp,一个深度搜索基准,旨在评估超越内在知识覆盖的智能体。它包含335个人工编写的问题,其答案依赖于基准构建前90天内发布的事实,来自六个更新的来源,并过滤掉全球显著事件。在LiveBrowseComp上,所有评估的智能体闭卷准确率低于2%,搜索增强的分数相对于BrowseComp下降了25-40个百分点,且先前的模型排名不再可靠地预测性能。LiveBrowseComp可在https://huggingface.co/datasets/Forival/LiveBrowseComp获取。

英文摘要

Are LLM-based search agents genuinely searching, or using the web to verify what they already know? We study this question on BrowseComp with three diagnostics. Our analysis reveals Intrinsic Knowledge Dependence (IKD): even with tool access, agents often rely on intrinsic knowledge -- information encoded in the model before retrieval -- rather than on external evidence. Agents answer up to 44.5% of BrowseComp questions without tools, generate more than half of their search queries from internally produced hypotheses rather than retrieved leads, and perform worse than closed-book baselines when answer-supporting evidence is removed. These results suggest that static search benchmarks can reward memory-backed verification rather than evidence-driven discovery, conflating what agents already know with what they can find. We then introduce LiveBrowseComp, a deep-search benchmark designed to evaluate agents beyond intrinsic coverage. It contains 335 human-authored questions whose answers depend on facts published within the 90 days preceding benchmark construction, drawn from six updated sources and filtered to exclude globally salient events. On LiveBrowseComp, all evaluated agents fall below 2% closed-book accuracy, search-augmented scores drop by 25-40 points relative to BrowseComp, and prior model rankings no longer reliably predict performance. LiveBrowseComp is available at https://huggingface.co/datasets/Forival/LiveBrowseComp.

2605.28717 2026-05-28 cs.AI cs.AR cs.NI

OpenURMA: A Clean-Room Open Implementation of the Unified Bus Protocol

OpenURMA:统一总线协议的开源洁净室实现

Bojie Li

AI总结 针对RDMA在数据中心网络接口的瓶颈,OpenURMA基于华为UB协议规范,通过RTL、SystemC和gem5三层实现,展示了UB在64字节远程取操作中相比RoCEv2 RC实现4.37倍延迟降低和2.80倍吞吐提升。

详情
AI中文摘要

现代数据中心RDMA的瓶颈在网络接口而非线缆。运行RoCE或InfiniBand的NIC为每个(应用,远程端点)对维护每连接状态——在1024应用扇出时达数百兆字节——并在64字节操作上支付四次PCIe往返,将延迟放大到线缆延迟的一个数量级以上。这两者都源于RDMA从InfiniBand继承的基于PCIe的队列对抽象。 华为的统一总线(UB)是2025年公开的规范,它改变了抽象:将每应用端点状态与每主机传输状态解耦,使连接上下文呈加性增长,将排序作为可选功能,并通过原生CPU加载/存储到片上总线控制器来访问远程内存。UB已搭载在华为闭源的Ascend 950芯片中。 OpenURMA是UB传输层和事务层的首个洁净室开源实现,在三个层级实现——Alveo U50上的可综合RTL、双节点周期级SystemC模拟器以及gem5全系统框架——每个层级都有匹配的OpenRoCE(RoCEv2 RC)基线。贡献在于实现、测试平台以及闭源芯片无法进行的受控比较。在规范的64字节远程取操作——UB规范第8.3节的LOAD,RoCEv2 RC的READ——上,UB的加载/存储路径实现了约500纳秒的端到端延迟,比匹配基线(2186纳秒)低4.37倍,吞吐量高2.80倍,且仅占用U50约14%的LUT。

英文摘要

Modern datacenter RDMA is bottlenecked at the network interface, not the wire. A NIC running RoCE or InfiniBand holds per-connection state for every (application, remote-endpoint) pair - hundreds of megabytes at 1024-application fanout - and pays a four-traversal PCIe round trip on a 64-byte operation, inflating latency an order of magnitude beyond the wire. Both follow from the Queue Pair over PCIe abstraction RDMA inherits from InfiniBand. Huawei's Unified Bus (UB), a public 2025 specification, changes the abstraction: it decouples per-application endpoint state from per-host transport state so connection context grows additively, exposes ordering as opt-in, and reaches remote memory through native CPU load/store to an on-chip-bus controller. UB ships in Huawei's closed Ascend 950 silicon. OpenURMA is the first clean-room open implementation of UB's transport and transaction layers, realised at three tiers - synthesisable RTL on Alveo U50, a cycle-level two-node SystemC simulator, and a gem5 full-system scaffold - each with a matched OpenRoCE (RoCEv2 RC) baseline. The contribution is the implementation, harness, and controlled comparison closed silicon does not admit. On the canonical 64-byte remote fetch - LOAD on UB-spec Sec.8.3, READ on RoCEv2 RC - UB's load/store path delivers ~500 ns end-to-end, 4.37x below the matched baseline (2186 ns), sustains 2.80x higher throughput, and fits in ~14% of a U50's LUTs.

2605.28714 2026-05-28 cs.CL cs.AI

IPO-Mine: A Toolkit and Dataset for Section-Structured Analysis of Long, Multimodal IPO Documents

IPO-Mine:用于长多模态IPO文档的章节结构化分析的工具包和数据集

Michael Galarnyk, Siddharth Lohani, Vidhyakshaya Kannan, Sagnik Nandi, Aman Patel, Liqin Ye, Arnav Hiray, Rutwik Routu, Prasun Banerjee, Siddhartha Somani, Sudheer Chava

AI总结 本文提出IPO-Mine工具包和数据集,通过标准化解析IPO文件为章节结构化文本和图像,构建大规模多模态数据集,并建立图表评估任务,揭示多模态模型在长文档分析中的对齐挑战。

Comments 12 pages

详情
AI中文摘要

首次公开募股(IPO)文件是私营公司上市时发布的文件,允许个人(散户)投资者购买其股票。这些文件描述了公司的业务、财务状况和风险,是包含叙述性文本和图像的长篇多模态文档。尽管它们对金融市场至关重要,但目前缺乏用于使用现代语言和多模态模型研究IPO文件的大规模标准化数据集或基准。这些文档带来了重大挑战:文件通常超过50万词,且缺乏一致的结构组织。我们引入了IPO-Toolkit,这是一个开源框架,用于下载和解析IPO文件,将其标准化为章节结构化文本和提取的图像。该工具包分割文件、提取嵌入的图像,并生成结构化输出,从而支持对长多模态文档进行大规模、可重复的分析工作流。利用这一基础设施,我们构建了IPO-Dataset,这是一个大规模、章节结构化的多模态数据集,涵盖1994年至2026年超过109,000份IPO文件及其修订版,包含超过76,000张图像。我们针对提取的金融图表建立了结构化评估任务,包括图表质量和误导性评估。我们的实验表明,最先进的多模态模型在这些任务上常常与专家人类判断存在分歧,揭示了在长篇幅真实监管文档上进行多模态推理时的对齐挑战。除了基准测试,IPO-Dataset还支持对章节级文本变异以及视觉和文本披露实践的跨行业差异进行大规模分析。我们的代码、数据集和网站根据CC-BY-4.0公开提供。

英文摘要

An Initial Public Offering (IPO) filing is a document released when a private firm goes public, allowing individual (retail) investors to purchase its shares. These filings describe a firm's business, financials, and risks and are long, multimodal documents with narrative text and images. Despite their importance to financial markets, there is no large-scale, standardized dataset or benchmark for studying IPO filings with modern language and multimodal models. These documents pose significant challenges: filings frequently exceed 500,000 tokens and lack consistent structural organization. We introduce the IPO-Toolkit, an open-source framework for downloading and parsing IPO filings into standardized section-structured text and extracted images. The toolkit segments filings, extracts embedded images, and produces structured outputs that enable large-scale, reproducible analysis workflows over long, multimodal documents. Using this infrastructure, we construct the IPO-Dataset, a large, section-structured, multimodal dataset covering more than 109,000 IPO filings and amendments from 1994 to 2026 and containing over 76,000 images. We establish structured evaluation tasks over extracted financial charts, including chart quality and misleadingness assessment. Our experiments show that state-of-the-art multimodal models often diverge from expert human judgments on these tasks, exposing alignment challenges in multimodal reasoning over long, real-world regulatory documents. Beyond benchmarking, the IPO-Dataset enables large-scale analysis of section-level textual variation and cross-industry differences in visual and textual disclosure practices. Our code, dataset, and website are publicly available under CC-BY-4.0.

2605.28713 2026-05-28 cs.AI

Thinking as Compression: Your Reasoning Model is Secretly a Context Compressor

思维即压缩:你的推理模型其实是一个上下文压缩器

Guoxin Ma, Yibing Liu, Chengzhengxu Li, Yu Liang, Yan Wang, Yueyang Zhang, Kecheng Chen, Zhaohan Zhang, Zhiyuan Sun, Daiting Shi

AI总结 本文提出思维即压缩(TaC)范式,利用推理模型自身的思维痕迹作为压缩上下文,并通过奖励驱动优化(TaC-C)实现可控压缩,在长上下文QA任务上显著优于现有方法。

Comments Under Review

详情
AI中文摘要

上下文压缩旨在缩短长上下文输入,同时最小化信息损失,以加速LLM推理。现有方法虽有前景,但通常依赖复杂的压缩模块或针对压缩的训练,忽视了LLM的内在能力。相比之下,本文揭示推理模型本身可以通过组织任务相关信息自然地压缩长上下文。因此,我们提出思维即压缩(TaC),一种将思维本身视为压缩上下文的新压缩范式。无需专用压缩器,TaC直接提示推理模型生成思维痕迹作为缩短的上下文,已优于大多数代表性压缩方法。进一步,鉴于原始思维输出可能难以控制预算和存在捷径行为,我们引入带约束的思维即压缩(TaC-C),利用简单的奖励驱动优化框架,激发内在思维成为紧凑且可控的压缩上下文。在四个长上下文QA基准上的实验表明,TaC-C一致优于现有基线。在4倍和8倍压缩比下,它在平均F1上分别超过最强竞争对手17.4%和23.4%,在平均精确匹配分数(EM)上分别超过15.7%和21.7%。

英文摘要

Context compression aims to shorten long context inputs with minimal information loss for LLM inference acceleration. While existing methods have shown promise, they typically rely on complex compression modules or compression-specific training, leaving the intrinsic capabilities of LLMs underexplored. In contrast, this work reveals that a thinking model itself can naturally compress long contexts by organizing task-relevant information. We thus derive Thinking as Compression (TaC), a new compression paradigm that treats thinking itself as compressed context. Without relying on specific dedicated compressor, TaC directly prompts the thinking model to generate thinking traces as the shortened context, already outperforming most representative compression methods. Further, given that raw thinking output may struggle with budget control and shortcut behaviors, we introduce Thinking as Compression Constrained (TaC-C), leveraging a simple reward-driven optimization framework to elicit intrinsic thinking as compact and controllable compressed context. Experiments across four long-context QA benchmarks demonstrate that TaC-C consistently outperforms existing baselines. At 4x and 8x compression ratios, it surpasses the strongest competitor by 17.4% and 23.4% in average F1, and by 15.7% and 21.7% in average Exact Match Score (EM), respectively.

2605.28710 2026-05-28 cs.CL cs.AI

Towards Reliable Multilingual LLMs-as-a-Judge: An Empirical Study

迈向可靠的多语言LLM作为评判者:一项实证研究

Irune Zubiaga, Aitor Soroa, Rodrigo Agerri

AI总结 本研究通过分析指令翻译、单语与多语言监督及模型规模等策略,探讨了在有无领域内数据情况下开发多语言LLM评判者的方法,并揭示了领域内数据可用时微调小模型可媲美专有模型、零样本大模型在域外更有效等关键权衡。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被用于生成文本的自动评估,然而大多数先前工作集中在英语上。尽管对多语言评估的需求日益增长,将基于LLM的评估器扩展到多语言环境仍然具有挑战性,特别是对于低资源语言和领域内数据稀缺的场景。本文探索了开发多语言LLM评判者的几种策略,考虑了是否有领域内数据可用于微调。我们系统分析了英语、西班牙语和巴斯克语(代表高、中、低资源语言),考虑了指令翻译、单语与多语言监督以及模型规模。为了评估,我们将两个现有的元评估数据集扩展到巴斯克语和西班牙语。我们的结果揭示了关键的权衡:当领域内数据可用时,微调的小模型可以达到与专有模型相当的性能,而在域外设置中,使用较大模型的零样本评估更为有效。我们还观察到,在域外数据上进行微调可能会对模型性能产生不利影响。这些发现为构建高效、可靠的多语言评估流程提供了实用指导。数据和代码公开在hitz-zentroa/mJudge。

英文摘要

Large language models (LLMs) are increasingly used for the automatic evaluation of generated text, yet most prior work focuses on English. Despite the growing demand for multilingual evaluation, extending LLM-based evaluators to multilingual settings remains challenging, particularly for low-resource languages and scenarios where in-domain data is scarce. This work explores several strategies for developing multilingual LLMs-as-a-judge, considering whether in-domain data is available for fine-tuning or not. We systematically analyze English, Spanish, and Basque, representing high-, mid-, and low-resource languages, considering instruction translation, monolingual versus multilingual supervision, and model size. For evaluation, we extend two existing meta-evaluation datasets to Basque and Spanish. Our results reveal key trade-offs: When in-domain data is available, fine-tuned smaller models can achieve performance comparable to proprietary models, whereas zero-shot evaluation with larger models proves more effective in out-of-domain settings. We also observe that fine-tuning on out-of-domain data can adversely affect model performance. These findings provide practical guidance for building efficient, reliable multilingual evaluation pipelines. The data and code are publicly available at hitz-zentroa/mJudge.

2605.28707 2026-05-28 cs.AI cs.LG

Beyond Binary Moral Judgment: Modeling Ethical Pluralism in AI

超越二元道德判断:在AI中建模伦理多元主义

Aisha Aijaz, Rahul Goel, Arnav Batra, Raghava Mutharaju

AI总结 提出将道德推理建模为规范性伦理理论分布(伦理多元主义)的框架,通过规范-语义双流架构和堆叠集成学习实现,在450个案例上达到88.89%的准确率。

详情
AI中文摘要

在社会关键领域的决策中,AI系统正以不同能力越来越多地参与。然而,尽管自主系统无处不在,大多数处理自主道德决策的方法仍诉诸于标量或二元判断。这些方法对于可接受的道德推理是不够的,因为它们提供的解释很少,遗漏了必须包含以支持问责的关键背景和理论信息。为此,我们提出了一个将道德推理建模为规范性伦理理论或伦理多元主义分布的框架。我们引入了一个整合这些理论的规范伦理单纯形。还准备了涵盖15个细分子理论的450个案例基准,用于堆叠集成学习。这些案例描述了自然语言中的伦理困境,并具有相关的提取上下文特征。单纯形的实现通过双流规范-语义架构完成,随后是规范信息的融合和顺序堆叠集成,以学习三个广泛理论(后果主义、美德伦理学和道义论)及其15个子类别的最佳拟合。我们的实验表明,将上下文和规范先验与语义嵌入相结合显著提高了分类性能,准确率达到88.89%。我们进行了消融研究,以表明结构化伦理表示超越了类比推理的贡献,并且所选的堆叠架构由于逐步学习粒度而给出了最佳结果。还通过熵、置信度和可视化分析了伦理多元主义。因此,将伦理多元主义建模为概率性规范分布支持类人道德推理、伦理分歧分析以及未来AI系统中的对齐。

英文摘要

Critical decision-making in socially consequential spaces is increasingly involving AI systems at varying capacities. Yet, despite the ubiquity of autonomous systems, most approaches to handling autonomous moral decision-making resort to scalar or binary judgments. These methods are insufficient for acceptable moral reasoning, as they provide little explanation, leaving out imperative contextual and theoretical information that must be included to support accountability. For this, we propose a framework to model moral reasoning as a distribution over normative ethical theories or ethical pluralism. We introduce a normative ethics simplex that integrates these theories. A benchmark of 450 cases across 15 fine-grained subtheories was also prepared for the purposes of stacked ensemble learning. These cases describe ethical dilemmas in natural language and have associated extracted contextual features. The implementation of the simplex was achieved via a two-stream normative-semantic architecture. This is followed by the fusion of normative information and a sequential, stacking ensemble to learn the best fit of the three broad theories: consequentialism, virtue ethics, and deontology, and the 15 subcategories. Our experiments demonstrate that the integration of contextual and normative priors with the semantic embeddings significantly improves the performance of the classification, displaying an accuracy of 88.89%. We conducted ablation studies to show that structured ethical representations contribute beyond analogical reasoning, and the chosen stacking architecture gives the best results due to the gradual learning of granularity. Ethical pluralism is also analyzed through entropy, confidence, and visualization. Thus, modeling ethical pluralism as a probabilistic normative distribution supports human-like moral reasoning, ethical disagreement analysis, and future alignment in AI systems.

2605.28705 2026-05-28 cs.LG

Understanding Generalization and Forgetting in In-Context Continual Learning

理解上下文持续学习中的泛化与遗忘

Guangyu Li, Meng Ding, Lijie Hu

AI总结 提出首个上下文持续学习理论框架,分析预训练Transformer在单提示中处理多序列任务时的泛化与遗忘行为,揭示注意力机制导致的干扰和偏差。

Comments accepted by ICML 2026

详情
AI中文摘要

上下文学习(ICL)的强大之处在于使大型语言模型能够仅通过基于提示的推理来适应新任务,完全绕过了参数更新的需要。现有理论主要在单任务设置下研究ICL,而现实中的提示通常包含异构任务序列,这导致我们无法理解大型语言模型是否在推理过程中隐式地执行持续学习。为了弥补这一差距,我们提出了首个用于上下文持续学习的理论框架,模拟预训练Transformer如何通过共享注意力机制在单个提示内处理多个顺序任务。聚焦于线性和掩码线性自注意力,我们推导了顺序任务提示下模型预测的误差表达式,并分析了它们的泛化和遗忘行为。我们的结果表明,标准注意力机制通过均匀或因果地聚合历史上下文,不可避免地引起任务间干扰,导致系统性偏差。我们进一步提供了预测误差的偏差-方差-干扰分解,刻画了历史上下文信息何时产生正迁移或可证明的负迁移。这一分析揭示了基于注意力的持续推理的基本限制,并为长提示中的顺序敏感性和性能退化提供了理论解释。

英文摘要

In-context learning (ICL) derives its power from enabling Large Language Models to adapt to new tasks via prompt-based reasoning alone, entirely bypassing the need for parameter updates. Existing theories primarily study ICL in single-task settings, while real-world prompts often contain sequences of heterogeneous tasks, leaving a gap in understanding whether Large Language Models implicitly perform continual learning during inference. To bridge this gap, we propose the first theoretical framework for in-context continual learning, modeling how a pretrained Transformer processes multiple sequential tasks within a single prompt through shared attention mechanisms. Focusing on linear and masked linear self-attention, we derive error expressions for model predictions under sequential task prompts and analyze their generalization and forgetting behavior. Our results reveal that standard attention mechanisms inevitably induce intertask interference by uniformly or causally aggregating historical contexts, leading to systematic bias. We further provide a bias-variance-interference decomposition of prediction error, characterizing when historical in-context information yields positive transfer or provable negative transfer. This analysis exposes fundamental limits of attention-based continual inference and offers theoretical explanations for order sensitivity and performance degradation in long prompts.

2605.28704 2026-05-28 cs.LG

Expressive Power of Floating-Point Neural Networks with Arbitrary Reduction Orders and Inexact Activation Implementations

具有任意归约顺序和不精确激活实现的浮点神经网络的表达能力

Yeachan Park, Geonho Hwang, Wonyeol Lee, Sejun Park

AI总结 本文研究在广义浮点执行语义下(包括任意归约顺序和具有有界ulp误差的不精确激活实现),浮点神经网络能否精确表示浮点域上的任意函数,并引入通用可区分性框架,证明第一层区分每对不同输入的能力是通用可表示性的必要条件,同时在温和条件下证明适当形式的可区分性也是充分条件,从而为Sigmoid、tanh、ReLU等实际激活函数建立了通用可表示性结果。

详情
AI中文摘要

大多数现有的神经网络表达能力理论假设精确实数运算,而实际神经网络是在有限精度浮点算术下执行的,其执行语义依赖于实现。最近的工作开始研究浮点神经网络的表达能力,但现有结果仅限于高度受限的激活函数和理想化假设,如固定的从左到右归约顺序和正确舍入的激活实现。在这项工作中,我们研究了在广义浮点执行语义下浮点神经网络的表达能力,包括任意归约顺序和具有有界ulp误差的不精确激活实现。我们探讨了浮点神经网络何时能够精确表示浮点域上的任意函数。为此,我们引入了一个通用的可区分性框架,并表明在第一层中区分每对不同输入的能力是通用可表示性的必要条件。这一表征产生了广泛的不具备通用可表示性的激活实现类别,扩展了先前孤立的反例,如正确舍入的余弦激活。我们进一步证明,在激活实现的温和条件下,适当形式的可区分性也是通用可表示性的充分条件。利用这一框架,我们为一大类实际激活函数建立了通用可表示性结果,包括Sigmoid、tanh、ReLU、ELU、SeLU、GeLU、Swish、Mish和sin的实现,这些结果在比以前已知的显著更现实的浮点执行模型下成立。

英文摘要

Most existing expressivity theories for neural networks assume exact real arithmetic, whereas practical neural networks are executed under finite-precision floating-point arithmetic with implementation-dependent execution semantics. Recent works have begun studying the expressive power of floating-point neural networks, but existing results are limited to highly restricted activation functions and idealized assumptions such as fixed left-to-right reduction orders and correctly rounded activation implementations. In this work, we study the expressive power of floating-point neural networks under generalized floating-point execution semantics, including arbitrary reduction orders and inexact activation implementations with bounded ulp errors. We investigate when floating-point neural networks can represent arbitrary functions between floating-point domains exactly. To this end, we introduce a general distinguishability framework and show that the ability to distinguish every pair of distinct inputs in the first layer is necessary for universal representability. This characterization yields broad classes of activation implementations that are not universal representators, extending previous isolated counterexamples such as the correctly rounded cosine activation. We further prove that a suitable form of distinguishability is also sufficient for universal representability under mild conditions on the activation implementation. Using this framework, we establish universal representability results for a broad class of practical activation functions, including implementations of $\mathrm{Sigmoid}$, $\tanh$, $\mathrm{ReLU}$, $\mathrm{ELU}$, $\mathrm{SeLU}$, $\mathrm{GeLU}$, $\mathrm{Swish}$, $\mathrm{Mish}$, and $\sin$, under significantly more realistic floating-point execution models than previously known.

2605.28699 2026-05-28 cs.AI

TRACER: Turn-level Regret Matching with Inner Reinforcement Credit for Cooperative Multi-LLM Reasoning

TRACER: 基于内部强化信用与轮次级遗憾匹配的多LLM协作推理

Chusen Li, Zhou Liu, Shuigeng Zhou, Wentao Zhang

AI总结 提出TRACER框架,通过控制器-遗憾层和生成-信用层分别学习发言时机与内容,解决多智能体强化学习中的稀疏奖励、搭便车和固定协议振荡问题,实现数学收敛的协作推理。

Comments 25 pages, 3 figures

详情
AI中文摘要

大型语言模型越来越依赖强化学习或多智能体提示来改进推理,但这两个范式仍然难以结合。将单智能体强化学习直接应用于多轮多智能体系统面临以下困境:i) 稀疏奖励、角色级搭便车和过高的训练开销。ii) 智能体仅模仿协作。iii) 固定协作协议陷入振荡的局部最优。我们引入TRACER,一个用于协作多LLM推理的轮次级强化框架。TRACER将协作决策分为控制器-遗憾层和生成-信用层,其中控制器通过遗憾匹配学习智能体是否应在当前轮次发言或跳过,生成-信用层则使用角色特定的GSPO奖励优化提议者和评审者的发言。这种设计i) 在动作模式和生成话语两个层面分配信用,从而避免搭便车和稀疏奖励。我们仅扩展控制器做出的选择,从而大幅降低训练的计算成本。此外,ii) 智能体在学习何时发言和说什么的过程中获得协作能力。最后,iii) 通过巧妙设计二元动作,我们将为有限动作空间建立的经典博弈论扩展到深度学习,从而实现数学上严格的收敛。我们在GSM8K训练集上训练所有局部RL方法,并在保留的GSM8K、MATH500和GPQA-Diamond上评估域内准确率、跨基准泛化能力、推理成本和修正保持行为。所得框架提供了一个紧凑且可复现的测试平台,用于研究超越固定辩论、投票或聚合协议的学习协作策略。代码可在https://github.com/Shark-Forest/TRACER获取。

英文摘要

Large language models increasingly rely on either reinforcement learning or multi-agent prompting to improve reasoning, yet these two paradigms remain difficult to combine. Directly applying single-agent reinforcement learning to multi-turn multi-agent systems faces following dilemmas: i) Sparse rewards, role-level free-riding and excessive training overhead. ii) Agents only imitate to collaborate. iii) Fixed collaboration protocol falls into oscillating local optimum. We introduce TRACER, a turn-level reinforcement framework for cooperative multi-LLM reasoning. TRACER separates collaborative decision making into a controller-regret layer, where controllers learn whether the agents should speak or skip the current round through regret matching, and a generation-credit layer, which optimizes proposer and reviewer utterances with role-specific GSPO rewards. This design i) assigns credit at the level of both action modes and generated utterances, thus avoiding free-riding and sparse rewards. We only expand the choices made by the controllers, thus greatly reducing computational cost of training. Moreover, ii) agents acquire collaborative capability as they learn when to utter and what to speak. Finally, iii) by designing binary actions ingeniously, we extend classical game theory established for finite action spaces to deep learning, thus achieving mathematically rigorous convergence. We train all local RL-style methods on the GSM8K training split and evaluate on held-out GSM8K, MATH500, and GPQA-Diamond to measure in-domain accuracy, cross-benchmark generalization, inference cost, and correction-preservation behavior. The resulting framework provides a compact and reproducible testbed for studying learned collaboration policies beyond fixed debate, voting, or aggregation protocols. Code is available at https://github.com/Shark-Forest/TRACER.

2605.28691 2026-05-28 cs.CV

OSP-Next: Efficient High-Quality Video Generation with Sparse Sequence Parallelism, HiF8 Quantization, and Reinforcement Learning

OSP-Next: 结合稀疏序列并行、HiF8量化和强化学习的高效高质量视频生成

Yunyang Ge, Xianyi He, Zezhong Zhang, Bin Lin, Bin Zhu, Xinhua Cheng, Li Yuan

AI总结 提出OSP-Next文本到视频生成模型,通过混合全稀疏注意力架构、稀疏序列并行(SSP)、HiF8量化和混合GRPO后训练,在保持高质量的同时显著提升效率,在NVIDIA H200和Ascend 950PR上实现1.5倍以上加速。

详情
AI中文摘要

扩散Transformer在视频生成中取得了高质量,但全注意力的二次成本限制了效率。我们提出OSP-Next,一种高效的文本到视频生成模型,集成了稀疏注意力、并行、量化和强化学习。OSP-Next采用混合全稀疏注意力架构,其中稀疏组件通过Skiparse-2D注意力实现。这种固定模式机制沿空间维度应用逐token和逐组的稀疏注意力,利用局部性同时保持与FlashAttention内核的原生兼容性。基于Skiparse-2D注意力中重排的局部等价性,我们进一步提出稀疏序列并行(SSP),它将子序列划分到多个rank,并通过一次All-to-All通信切换稀疏模式。与Ulysses序列并行(SP)相比,SSP为稀疏注意力提供了原生并行策略,并将通信量减少了75%。OSP-Next还引入了HiF8量化,以实现8位量化和稀疏微调的稳定联合训练,并应用Mix-GRPO后训练来提升稀疏模型的性能。实验表明,OSP-Next的VBench总得分为83.73%,超过了Wan2.1基线。在5秒720P和5秒768P设置下,OSP-Next在NVIDIA H200 GPU上实现了高达1.64倍的单GPU加速和超过1.52倍的八GPU加速。此外,在VBench总分仅下降0.4%的情况下,OSP-Next-HiF8在单个Ascend 950PR上分别实现了1.69倍和2.27倍的加速,展示了OSP-Next跨硬件平台的效率和性能。

英文摘要

Diffusion Transformers achieve strong video generation quality, but the quadratic cost of full attention limits efficiency. We introduce OSP-Next, an efficient text-to-video generation model that integrates sparse attention, parallelism, quantization, and reinforcement learning. OSP-Next uses a hybrid full-sparse attention architecture, where the sparse component is implemented with Skiparse-2D Attention. This fixed-pattern mechanism applies token-wise and group-wise sparse attention along spatial dimensions, leveraging locality while maintaining native compatibility with FlashAttention kernels. Based on the local equivalence of rearrangement in Skiparse-2D Attention, we further propose Sparse Sequence Parallelism (SSP), which partitions subsequences across ranks and switches sparse patterns through a single All-to-All communication. Compared with Ulysses Sequence Parallelism (SP), SSP provides a native parallel strategy for sparse attention and reduces communication volume by 75%. OSP-Next also incorporates HiF8 quantization to enable stable joint training with 8-bit quantization and sparse fine-tuning, and applies Mix-GRPO post-training to improve the performance of the sparse model. Experiments show that OSP-Next achieves a VBench total score of 83.73%, surpassing the Wan2.1 baseline. Under the 5-second 720P and 5-second 768P settings, OSP-Next achieves up to 1.64$\times$ single-GPU speedup and over 1.52$\times$ eight-GPU speedup on NVIDIA H200 GPUs. In addition, with only a 0.4% drop in VBench total score, OSP-Next-HiF8 achieves 1.69$\times$ and 2.27$\times$ speedups under the two settings on a single Ascend 950PR, demonstrating the efficiency and performance of OSP-Next across hardware platforms.

2605.28687 2026-05-28 cs.SD physics.med-ph

Cross-modal characterization of infant cry: validation of a chest-surface accelerometer in extracting acoustic vocal function measures

婴儿哭声的跨模态表征:胸表加速度计在提取声学发声功能测量中的验证

Winko W. An, Saketh Sundar, Lisa Yankowitz, Daryush D. Mehta, Carol L. Wilkinson

AI总结 本研究验证了胸表加速度计在婴儿哭声分析中的有效性,发现其能可靠捕获基频和抖动等声学特征,为噪声鲁棒且保护隐私的临床研究提供替代方案。

详情
AI中文摘要

背景:婴儿哭声声学为早期神经发育提供了有前景的窗口,并可能作为神经发育障碍的可扩展生物标志物。然而,传统的基于麦克风的录音在现实临床环境中极易受到环境噪声的影响,并引发隐私问题。胸表加速度计通过直接捕获来自喉部的振动,可能提供一种稳健的替代方案。方法:我们通过比较常规疫苗接种期间从加速度计和同步记录的麦克风信号中提取的声学特征,评估了胸戴加速度计用于婴儿哭声分析的有效性。最终样本包括来自多样化儿科人群的85名婴儿(41名4个月大;44名12个月大)。从两种模态中提取了七种发声测量指标,包括基频、抖动、 shimmer、倒谱峰值突出度和谐波噪声比。使用组内相关系数评估模态间的一致性和一致性。结果:加速度计和麦克风录音之间的基频表现出极好的一致性(ICC > 0.94)。抖动测量也显示出良好到极好的一致性,而倒谱峰值突出度显示出中等一致性。Shimmer和谐波噪声比在模态间显示出较低的一致性绝对值和系统偏差,反映了信号传输和噪声敏感性可能存在的差异。结论:总之,胸表加速度计可以可靠地捕获婴儿哭声的几个临床相关声学特征,特别是基频和抖动的时间测量。这种方法为基于麦克风的录音提供了一种噪声鲁棒且保护隐私的替代方案,支持其在可扩展的临床和发育研究应用中的潜在用途。

英文摘要

Background: Infant cry acoustics provide a promising window into early neurodevelopment and may serve as scalable biomarkers for neurodevelopmental disorders. However, conventional microphone-based recordings are highly susceptible to environmental noise and raise privacy concerns in real-world clinical settings. Chest-surface accelerometers may offer a robust alternative by capturing vibrations directly from the larynx. Methods: We evaluated the validity of a chest-mounted accelerometer (ACC) for infant cry analysis by comparing acoustic features derived from ACC and simultaneously recorded microphone (MIC) signals during routine vaccination visits. The final sample included 85 infants (41 at 4 months; 44 at 12 months) from a diverse pediatric population. Seven vocal measures were extracted from both modalities, including fundamental frequency (F0), jitter, shimmer, cepstral peak prominence (CPP), and harmonics-to-noise ratio (HNR). Agreement and consistency between modalities was assessed using intraclass correlation coefficients (ICCs). Results: F0 demonstrated excellent agreement between ACC and MIC recordings (ICC > 0.94). Jitter measures also showed good-to-excellent agreement, while CPP demonstrated moderate agreement. Shimmer and HNR showed lower absolute agreement and systematic bias between modalities, reflecting possible differences in signal transmission and noise sensitivity. Conclusion: In summary, chest-surface accelerometers can reliably capture several clinically relevant acoustic features of infant cry, particularly temporal measures of F0 and jitter. This approach offers a noise-robust and privacy-preserving alternative to microphone-based recordings, supporting its potential use in scalable clinical and developmental research applications.

2605.28684 2026-05-28 cs.LG cs.CE cs.NA math.NA physics.comp-ph

History-aware adaptive reduced-order models via incremental singular value decomposition

基于增量奇异值分解的历史感知自适应降阶模型

Amirpasha Hedayat, Ali Mohaghegh, Laura Balzano, Cheng Huang, Karthik Duraisamy

AI总结 针对降阶模型在线动态偏离离线训练区域导致精度下降的问题,提出基于增量奇异值分解(iSVD)的投影自适应降阶框架,通过偶尔的全阶算子评估提供校正快照以在线更新基,并在三个非线性问题上验证其优于现有方法。

Comments 50 pages, 27 figures, Preprint submitted to Elsevier

详情
AI中文摘要

降阶模型(ROM)可以加速高维动力学模拟,但当在线动态偏离离线训练数据所代表的区域时,其精度通常会下降。我们开发了一种基于增量奇异值分解(iSVD)的投影自适应ROM框架,其中偶尔的全阶算子评估为在线基更新提供校正快照。这里考虑的侵入式ROM完全由基参数化,因此每次更新自然传播到降阶算子和超降阶机制。通过其演变的奇异结构,iSVD保留了观测动态的编码历史,在这个意义上具有历史感知能力。我们在三个复杂度递增的非线性问题上研究了该方法:一维粘性Burgers方程、Sod激波管和刚性一维十种组分旋转爆震发动机(RDE)。Burgers问题用于分析该方法,并将iSVD与替代基自适应规则进行比较,表明历史感知更新优于瞬时更新,且iSVD整体性能最强。Sod和RDE案例表明,这些优势在更具挑战性的可压缩流设置中持续存在。对于RDE问题,iSVD自适应ROM在预测精度和计算效率上都优于当前最先进的直接自适应ROM基线。成本分析表明,主要的在线成本来自与全阶模型交互以获取校正快照,而iSVD更新本身可忽略不计。这些结果将iSVD确定为在线学习降阶子空间的有效机制,并指出了使ROM在其初始训练窗口长几个数量级的时间范围内保持预测性的路径。

英文摘要

Reduced-order models (ROMs) can accelerate high-dimensional dynamical simulations, but their accuracy often deteriorates when online dynamics leave the regime represented by offline training data. We develop a projection-based adaptive ROM framework based on incremental singular value decomposition (iSVD), in which occasional full-order operator evaluations provide correction snapshots for online basis updates. The intrusive ROMs considered here are fully parameterized by the basis, so each update naturally propagates to reduced operators and hyper-reduction machinery. Through its evolving singular structure, iSVD retains an encoded history of the observed dynamics and is history-aware in this sense. We study the method on three nonlinear problems of increasing complexity: the one-dimensional viscous Burgers equation, the Sod shock tube, and a stiff one-dimensional ten-species rotating detonation engine (RDE). The Burgers problem is used to analyze the method and compare iSVD with alternative basis adaptation rules, showing that history-aware updates outperform instantaneous updates and that iSVD gives the strongest overall performance. The Sod and RDE cases demonstrate that these advantages persist in more challenging compressible-flow settings. For the RDE problem, the iSVD adaptive ROM improves upon the current state-of-the-art Direct adaptive ROM baseline in both predictive accuracy and computational efficiency. A cost analysis shows that the dominant online cost comes from interacting with the full-order model to obtain correction snapshots, while the iSVD update itself is negligible. These results identify iSVD as an effective mechanism for online learning of reduced subspaces and suggest a path toward ROMs that remain predictive over horizons several orders of magnitude longer than their initial training window.

2605.28683 2026-05-28 cs.AI

VeriTrip: A Verifiable Benchmark for Travel Planning Agents over Unstructured Web Corpora

VeriTrip: 面向非结构化网络语料的旅行规划智能体可验证基准

Yuting Xu, Jiayi Tian, Jian Liang, Xin Xiong, Hang Zhang, Mu Xu, Xiao-Yu Zhang

AI总结 提出VeriTrip基准,通过多模态检索库和可验证知识库,评估智能体在非结构化网络语料中基于证据推理的旅行规划能力,揭示检索-推理权衡问题。

Comments 10 pages, 4 figures

详情
AI中文摘要

现有基准通过建立以API为中心的范式为旅行规划智能体奠定了基础。然而,随着自主智能体能力的不断提升,其评估必须从简单的工具执行扩展到处理开放网络的固有复杂性。当前基准绕过了核心认知障碍:它们未能考虑信息噪声,忽略了多源事实矛盾,并且忽视了将视觉感知融入逻辑规划的必要性。我们引入了VeriTrip,一个旨在满足智能体鲁棒性和可靠性日益增长需求的可验证基准。VeriTrip将评估重点转向基于非结构化多模态网络语料的证据推理。它建立了一个源自真实世界的多模态检索库(MRB),迫使智能体自主协调跨异构数据的查询。同步的可验证知识库(VKB)支持逐单元验证协议,精确量化事实可靠性,区分系统性推理失败与参数幻觉。我们在领先的多模态大语言模型上的评估揭示了一个关键的“检索-推理权衡”:自主检索的认知负荷显著侵蚀了指令保持能力。VeriTrip为能够在无约束多模态环境中运行的下一代规划智能体提供了严格的基础。

英文摘要

Existing benchmarks have laid the foundation for travel planning agents by establishing API-centric paradigms. However, as the capabilities of Autonomous Agents continue to advance, their evaluation must evolve beyond simple tool execution toward handling the inherent complexities of the open web. Current benchmarks bypass core cognitive hurdles: they fail to account for information noise, ignore multi-source factual contradictions, and overlook the necessity of grounding visual perception into logical planning. We introduce VeriTrip, a verifiable benchmark designed to meet the increasing demands for agent robustness and reliability. VeriTrip shifts the evaluation focus to evidence-grounded reasoning over unstructured multimodal web corpora. It establishes a Multimodal Retrieval Base (MRB) derived from real-world sources, forcing agents to autonomously orchestrate queries across heterogeneous data. A synchronized Verifiable Knowledge Base (VKB) enables a cell-wise verification protocol that precisely quantifies factual reliability, distinguishing systematic reasoning failures from parametric hallucinations. Our evaluations across leading MLLMs reveal a critical \textit{retrieval-reasoning trade-off}: the cognitive load of autonomous retrieval significantly erodes instruction retention. VeriTrip provides the rigorous foundation necessary for the next generation of planning agents capable of operating in unconstrained, multimodal environments.

2605.28679 2026-05-28 cs.LG stat.ML

Optimal ridge regularization revisited

最优岭回归正则化再探讨

Jack Timmermans, Sergio A. Alvarez

AI总结 针对有限数据样本的线性岭回归,提出一种迭代算法从生成参数计算最优正则化强度,并证明其在有限噪声水平下的收敛性,实验表明结合样本参数估计可在多种设置下实现接近最优的泛化性能。

详情
AI中文摘要

我们考虑在有限数据样本 $X$ 上的 $L^2$ 正则化线性(岭)回归,其中 $X$ 具有有界协方差,线性预测目标 $y$ 具有加性各向同性噪声且方差有限。我们提出了一种迭代过程,用于在固定 $X$ 设置下从生成参数数值计算最优正则化强度,并证明了其在有限噪声水平下的收敛性。我们在合成数据上的实验评估表明,所提出的过程结合基于样本的参数估计,在广泛的样本量、长宽比和噪声水平下,实现了接近最优的随机 $X$ 泛化性能,额外计算成本相当于欠参数化情况下的一次初步岭回归和过参数化情况下的两次初步岭回归。

英文摘要

We consider $L^2$-regularized linear (ridge) regression over a finite data sample $X$ with bounded covariance and linear prediction targets $y$ with additive isotropic noise of finite variance. We present an iterative procedure to compute the optimal regularization strength numerically from the generative parameters in the fixed-$X$ setting and prove its convergence at limited noise levels. Our experimental evaluation over synthetic data shows that the proposed procedure combined with sample-based parameter estimates attains near-optimal random-$X$ generalization across a wide range of sample sizes, aspect ratios, and noise levels, at an added computational cost equivalent to one preliminary ridge regression in the underparameterized regime and two in the overparameterized case.

2605.28678 2026-05-28 cs.AI

DREAM-R: Multimodal Speculative Reasoning with RL-Based Refined Drafting, Precise Verification, and Fully Parallel Execution

DREAM-R: 基于强化学习的精炼草稿、精确验证与完全并行执行的多模态推测推理

Yunhai Hu, Zining Liu, Xiangyang Yin, Tianhua Xia, Bo Bao, Eric Sather, Vithursan Thangarasa, Sai Qian Zhang

AI总结 提出DREAM-R框架,通过强化学习优化草稿生成、阈值验证机制和完全并行执行,加速多模态模型的推理密集型任务,同时保持准确性。

详情
AI中文摘要

推测推理最近被提出作为加速大型多模态模型中推理密集型生成的一种手段,但其有效性常受限于推测草稿与目标验证推理之间的不匹配。在本工作中,我们引入了DREAM-R,一个显著提升推测推理性能的框架。其核心是采用推测对齐策略优化(SAPO),这是一种强化学习目标,训练草稿模型生成既忠实于目标轨迹又简洁的推理步骤。我们进一步提出基于阈值的验证机制(TBVM),使用基于比率的标准,仅在正面证据明显占优时稳定且可解释地接受推测步骤,从而防止错误传播。基于这些组件,我们开发了完全并行推测推理(FPSR)框架,该框架将草稿生成、目标侧推理和验证并行化到多步推理中,支持提前停止和干净回退。在推理密集型基准上的实验表明,在保持目标模型准确性的同时,实现了高达[具体加速比]的加速,在不牺牲推理质量的情况下带来了显著的效率提升。

英文摘要

Speculative reasoning has recently been proposed as a means to accelerate reasoning-intensive generation in large multimodal models, but its effectiveness is often constrained by misalignment between speculative drafts and target-verified reasoning. In this work, we introduce DREAM-R, a framework that substantially improves the performance of speculative reasoning. At its core, DREAM-R employs Speculative Alignment Policy Optimization (SAPO), a reinforcement-learning objective that trains draft models to generate reasoning steps that are both faithful to target trajectories and concise. We further propose a Threshold-based Verification Mechanism (TBVM) that uses a ratio-based criterion to provide stable and interpretable acceptance of speculative steps only when positive evidence clearly dominates, thereby preventing error propagation. Building on these components, we develop a Fully Parallel Speculative Reasoning (FPSR) framework that parallelizes draft generation, target-side reasoning, and verification across multi-step reasoning, enabling early stopping and clean fallback. Experiments on reasoning-heavy benchmarks demonstrate up to speedup while preserving target-model accuracy, yielding substantial efficiency gains without compromising reasoning quality.

2605.28675 2026-05-28 cs.LG

Optimal Data Acquisition for Reinforcement Learning: A Large Deviations Perspective

强化学习的最优数据获取:大偏差视角

Mingjie Hu, Jian-Qiang Hu, Enlu Zhou

AI总结 针对强化学习中数据获取效率问题,提出基于大偏差理论的统一框架,通过策略选择错误概率的指数衰减率作为效率指标,推导变分特征并设计自适应数据获取策略,证明其近鲁棒最优性。

详情
AI中文摘要

数据获取效率是在商业和医疗运营中部署强化学习的一个核心挑战,在这些场景中,交互成本高、速度慢,并且通常涉及人类参与。本文为无限时域强化学习中的数据获取开发了一个统一的大偏差框架。我们引入策略选择错误概率的指数衰减率作为原则性的效率指标,并通过马尔可夫链的大偏差理论推导出该速率的变分特征,从而得到一个嵌套优化问题。基于这一特征,我们根据嵌套问题的最优解形式化了两种互补的最优性概念。由于所得程序是隐式的且通常难以处理,我们提出了一个具有显式约束的可处理凸松弛。然后,我们开发了一种懒惰的一步投影次梯度方法来求解松弛问题,并利用其迭代构造自适应数据获取策略。我们证明,在最优性准则下,所得的强化学习算法在常数因子内是近鲁棒最优的。最后,我们将该框架扩展到线性函数逼近以提高可扩展性,数值实验支持了所提方法的有效性。

英文摘要

Data acquisition efficiency is a central challenge in deploying reinforcement learning in business and healthcare operations, where interactions are costly, slow, and often involve humans in the loop. This paper develops a unified large deviations framework for data acquisition in infinite-horizon reinforcement learning. We introduce the exponential decay rate of the policy-selection error probability as a principled efficiency metric and derive a variational characterization of this rate via large deviations theory for Markov chains, yielding a nested optimization problem. Based on this characterization, we formalize two complementary notions of optimality in terms of the optimal solution of the nested problem. Because the resulting program is implicit and generally intractable, we propose a tractable convex relaxation with explicit constraints. We then develop a lazy one-step projected subgradient method to solve the relaxed problem and use its iterates to construct an adaptive data acquisition policy. We prove that the resulting reinforcement learning algorithm is near-robustly optimal under our optimality criterion, up to a constant factor. Finally, we extend the framework to linear function approximation to improve scalability, and numerical experiments support the effectiveness of the proposed approach.

2605.28669 2026-05-28 cs.CL cs.AI

Sense Representations Are Inducible Interfaces

Sense Representations Are Inducible Interfaces

Jan Christian Blaise Cruz, Alham Fikri Aji

AI总结 提出ACROS方法,通过门控残差加法在冻结的预训练解码器LM中诱导显式词义通路,实现零样本词义消歧、低KL词义引导和跨语言适应,保持基础LM质量。

Comments https://github.com/jcblaisecruz02/acros

详情
AI中文摘要

词义表示(显式的、每个标记的意义分解)对于消歧、引导和跨语言对齐很有用,但现有方法要求模型在预训练时就内置词义结构。我们引入了ACROS,它通过门控残差加法在冻结的预训练解码器LM中诱导出显式的词义通路。在SmolLM2-360M上,ACROS在保持基础LM质量的同时,支持相同诱导变量的三种用途:零样本词义消歧(Raganato ALL上F1为64.95,与WordNet首义启发式方法相当)、在5,161个CoInCo案例中进行低KL词义引导(其中简单的非oracle代理恢复了约90%的正向偏移),以及针对四种语言的SENSIA跨语言适应(平均R@1为0.988,目标FLORES PPL为7.94)。ACROS使词义表示成为普通预训练LM的可诱导接口。

英文摘要

Sense representations (explicit, per-token meaning decompositions) are useful for disambiguation, steering, and cross-lingual alignment, but existing approaches require models to be pretrained with sense structure baked in. We introduce ACROS, which induces an explicit sense pathway into a frozen pretrained decoder LM through a gated residual addition. On SmolLM2-360M, ACROS preserves base LM quality while supporting three uses of the same induced variables: zero-shot word-sense disambiguation (64.95 F1 on Raganato ALL, competitive with the WordNet first-sense heuristic), low-KL lexical steering across 5,161 CoInCo cases where a simple non-oracle proxy recovers about 90% of positive shifts, and SENSIA cross-lingual adaptation to four languages (mean R@1 0.988, target FLORES PPL 7.94). ACROS makes sense representations an inducible interface for ordinary pretrained LMs.

2605.28666 2026-05-28 cs.AI

An LLM-Based Assistance System for Intuitive and Flexible Capability-Based Planning

基于LLM的直观灵活能力规划辅助系统

Luis Miguel Vieira da Silva, Nicolas König, Felix Gehlhoff

AI总结 提出一种混合辅助系统,将基于能力的形式化SMT规划与LLM自然语言交互层结合,通过人机协同实现规划解释与知识模型自适应,提升工业自动化中能力规划的可访问性和灵活性。

详情
AI中文摘要

在现代工业中,动态环境以及模块化和可重构资源的复杂性要求对过程序列进行自动化规划。基于能力的规划方法通过从以机器可解释形式描述资源功能的语义知识模型自动生成计划来解决这一问题。然而,其实际应用仍然有限:求解器反馈(特别是在不可满足情况下)难以解释,并且知识模型需要随着操作条件变化或请求变得不可行而进行调整。本文提出一种混合辅助系统,通过基于大语言模型(LLM)的自然语言交互、解释和适应层,增强现有的基于能力的可满足性模理论(SMT)规划方法。形式化规划的正确性仍由符号规划器保证,而LLM层在明确的人机协同(HitL)批准下处理自然语言访问和灵活的知识模型适应。该系统分解为四个组件:能力基础化、符号规划、结果解释和规划适应,实现为路由代理工作流,其中中央路由器将任务委派给五个专门代理。该系统在模块化生产系统上针对四种场景类型进行了评估。在23个测试案例中,10个知识查询中的9个和所有4个可满足规划案例均被正确处理,4个不可满足案例中的3个产生了具体的修复建议,所有5个自适应规划场景通过迭代的、用户批准的知识模型修改最终生成了可满足计划。研究结果证实,将形式化规划与基于LLM的辅助相结合,显著提高了工业自动化的可访问性和适应性。

英文摘要

In modern industry, dynamic environments and the complexity of modular and reconfigurable resources require automated planning of process sequences. Capability-based planning approaches address this by automatically generating plans from semantic knowledge models that describe resource functions in a machine-interpretable form. Their practical use, however, remains limited: solver feedback, especially in the case of unsatisfiability, is difficult to interpret, and the knowledge models require adaptation as operational conditions change or requests become infeasible. This paper presents a hybrid assistance system that augments an existing capability-based Satisfiability Modulo Theories (SMT) planning approach with an Large Language Model (LLM)-based layer for natural-language interaction, explanation, and adaptation. Formal planning correctness remains with the symbolic planner, while the LLM layer handles natural-language access and flexible knowledge model adaptation under explicit Human-in-the-Loop (HitL) approval. The system decomposes into four components: Capability Grounding, Symbolic Planning, Result Interpretation, and Planning Adaptation, realized as a routed agentic workflow in which a central router delegates to five specialized agents. The system is evaluated on a modular production system across four scenario types. Of 23 test cases, 9 of 10 knowledge queries and all 4 satisfiable planning cases were handled correctly, 3 of 4 unsatisfiable cases produced concrete repair proposals, and all 5 adaptive planning scenarios resolved into satisfiable plans through iterative, user-approved knowledge model modifications. The findings confirm that combining formal planning with LLM-based assistance substantially improves accessibility and adaptability in industrial automation.

2605.28664 2026-05-28 cs.LG cs.CL

Activation Steering for Synthetic Data Generation: The Role of Diversity in Downstream Safety Detection

用于合成数据生成的激活引导:多样性在下游安全检测中的作用

Vijeta Deshpande, Tootiya Giyahchi, Veena Padmanabhan, Leman Akoglu, Anna Rumshisky

AI总结 研究激活引导(AS)生成高质量训练数据用于下游安全检测分类器,发现多样性是关键但被忽视的轴,且AS在窄参数范围内优于提示生成。

详情
AI中文摘要

安全检测模型需要HHH(有帮助、无害、诚实)违反输出的示例以实现鲁棒泛化,但此类示例稀缺。激活引导(AS)已成为一种数据高效的方法,用于生成与目标概念对齐的响应。我们研究AS能否为下游分类器生成高质量训练数据集,这一问题尚未被测试。我们通过内在和外在评估,跨越4个概念×2个模型×4种引导方法进行了双重研究。内在方面,除了引导成功(概念对齐)和连贯性的领域标准,我们引入了样本级和集合级多样性作为文献中先前缺失的质量轴,并发现增加引导强度会降低响应多样性。外在方面,我们用引导生成替换可用训练数据中的HHH违反示例,并微调检测分类器。AS生成的数据在4个概念中的3个上产生了比提示生成数据更好的分类器。然而,136个AS配置中只有41个优于提示,表明下游效用存在于一个狭窄的区间,该区间同时满足成功、连贯性和多样性。这三个轴的调和平均数比单独的成功和连贯性更一致地与下游AUROC相关,为实践者调整AS超参数提供了实用的启发式目标。总之,我们的结果突出了AS在合成数据生成中改进安全检测的潜力,并确定了多样性作为调整AS的关键且先前被忽视的轴。

英文摘要

Safety detection models require examples of HHH (Helpful, Harmless, Honest)-violating outputs for robust generalization, however such examples are scarce. Activation Steering (AS) has emerged as a data-efficient method for generating target-concept-aligned responses. We investigate whether AS can generate high-quality training datasets for downstream classifiers, a question that remains untested. We present a two-fold study with intrinsic and extrinsic evaluation across $4$ concepts $\times\,2$ models $\times\,4$ steering methods. Intrinsically, beyond the field-standard rubric of steering success (concept alignment) and coherence, we introduce sample- and set-level diversity as a quality axis previously absent from the literature, and find that increasing steering strength reduces response diversity. Extrinsically, we replace HHH-violating examples in the available training data with steered generations and fine-tune detection classifiers. AS-generated data results in a better classifier than the prompting-generated data on $3$ of $4$ concepts. However, only $41$ of $136$ AS configurations outperform prompting, indicating that downstream utility lies in a narrow regime that jointly satisfies success, coherence, and diversity. The harmonic mean of these three axes correlates with downstream AUROC more consistently across concepts than success and coherence alone, providing a practical heuristic target for practitioners tuning AS hyperparameters. Together, our results highlight the potential of AS in synthetic data generation for improving safety detection and identify diversity as a critical, previously overlooked axis for tuning AS.

2605.28659 2026-05-28 cs.LG

Applications of temporal graph learning for predicting the dynamics of biological systems

时间图学习在预测生物系统动力学中的应用

Manuel Dileo, Andrea Sottoriva

AI总结 本研究提出基于伪时间分辨基因调控网络的时间图神经网络框架,用于预测细胞状态演变,在三个任务上优于scGPT等基础模型。

详情
AI中文摘要

生物基础模型通过将Transformer架构直接应用于基因表达矩阵,在单细胞表示学习中表现出色。然而,这些方法主要在静态设置下运行,并未显式建模细胞发育程序的时间演化。建模这种动态对于理解细胞状态在发育或疾病进展中如何逐步出现、分化和重组至关重要。在这篇进行中的论文中,我们探索了一种替代性的基于时间图的方法,其中细胞状态通过伪时间分辨的基因调控网络表示,并建模为持久基因身份上的演化图结构。从单细胞转录组数据开始,我们推断伪时间轨迹,将细胞离散化为发育快照,为每个快照重建一个基因调控网络,并应用时间图神经网络预测生物状态。我们在两个公开的小鼠发育数据集(红系原肠胚形成和胰腺内分泌发生)上评估该框架,考虑三个互补任务:基因表达预测、链接预测和出度中心性预测。我们的结果表明,基于图的模型优于著名的基础模型如scGPT和scFoundation,表明显式建模演化的调控结构提供了静态预训练表示之外的有用信息。对于链接预测和中心性预测,时间图学习捕捉了非平凡的调控动态,并能够识别时间上重要的基因枢纽。总体而言,我们的发现支持时间图学习作为建模动态生物系统的一个有前景的方向,以及作为当前单细胞生物学基础模型方法的补充范式。

英文摘要

Biological foundation models have shown strong performance in single-cell representation learning by applying transformer architectures directly to gene-expression matrices. However, these approaches predominantly operate in static settings and do not explicitly model the temporal evolution of developmental programs in the cell. Modeling such dynamics is important for understanding how cellular states progressively emerge, differentiate, and reorganize during development or disease progression. In this work-in-progress paper, we investigate an alternative temporal graph-based perspective in which cellular states are represented through pseudotime-resolved gene regulatory networks and modeled as evolving graph structures over persistent gene identities. Starting from single-cell transcriptomic data, we infer pseudotime trajectories, discretize cells into developmental snapshots, reconstruct one gene regulatory network per snapshot, and apply temporal graph neural networks to forecast biological states. We evaluate this framework on two publicly available mouse developmental datasets, erythroid gastrulation and pancreatic endocrinogenesis, considering three complementary tasks: gene-expression forecasting, link prediction, and out-degree centrality prediction. Our results show that graph-based models outperform well-known foundation-model such as scGPT and scFoundation, suggesting that explicitly modeling evolving regulatory structure provides useful information beyond static pretrained representations. For link prediction and centrality forecasting, temporal graph learning captures non-trivial regulatory dynamics and enables the identification of temporally important gene hubs. Overall, our findings support temporal graph learning as a promising direction for modeling dynamic biological systems and as a complementary paradigm to current foundation model approaches in single-cell biology.