arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3851
热门方向导航
2606.09028 2026-06-09 cs.CV cs.AI cs.RO 新提交

ATM: Action-Consistency Transfer Matrix for Diagnosing and Improving Latent World Models

ATM:用于诊断和改进潜在世界模型的动作一致性转移矩阵

Jiaheng Chen

发表机构 * School of Software, Northeastern University(东北大学软件学院)

AI总结 提出ATM矩阵,通过轻量级探针比较真实与预测潜在转移中的动作信息,无需模拟器即可诊断世界模型质量,并引入AITS利用动作可识别性作为训练信号提升下游规划。

Comments 13 pages, 3 figures, 6 tables

详情
AI中文摘要

潜在世界模型越来越多地用于控制和目标条件规划,但评估其学习到的表示是否对规划有用通常需要与CEM等规划器耦合的慢速模拟器评估。这种评估是黑盒且依赖于模型复杂度的:在相同协议下,不同世界模型每个检查点可能需要几分钟到几小时。在这项工作中,我们提出了ATM,一个动作一致性转移矩阵,用于诊断潜在转移是否保留了与规划相关的动作语义。ATM通过轻量级事后探针比较真实编码转移和模型预测转移中的动作信息,生成一个可解释的矩阵,揭示表示质量、转移域不一致性和失败模式,而无需模拟器 rollout。它还可以折叠成一个简单的筛选分数,用于跨检查点、变体和世界模型的内部任务排名。当真实成功差距显著时,ATM实现了高度可靠的成对排名,同时将分钟到小时的CEM评估减少到秒级的转移分析,在我们的设置中实现了超过100倍的加速。我们进一步引入了AITS,表明动作可识别性不仅具有诊断作用,而且是一种有用的训练信号,可以在不改变规划器的情况下改进下游规划。

英文摘要

Latent world models are increasingly used for control and goal-conditioned planning, yet assessing whether their learned representations are useful for planning usually requires slow, planner-coupled simulator evaluation with CEM or similar planners. Such evaluation is black-box and model-complexity-dependent: under the same protocol, different world models may require minutes to hours per checkpoint. In this work, we propose ATM, an Action-Consistency Transfer Matrix for diagnosing whether latent transitions preserve action semantics relevant to planning. ATM compares action information in real encoded transitions and model-predicted transitions through lightweight post-hoc probes, producing an interpretable matrix that reveals representation quality, transition-domain inconsistency, and failure modes without simulator rollout. It can also be collapsed into a simple screening score for within-task ranking across checkpoints, variants, and world models. When the true success gap is non-trivial, ATM achieves highly reliable pairwise ranking, while reducing minutes-to-hours CEM evaluation to seconds-level transition analysis, yielding more than 100x speedup in our setup. We further introduce AITS, showing that action-identifiability is not only diagnostic but also a useful training signal for improving downstream planning without changing the planner.

2606.09027 2026-06-09 cs.CL cs.AI 新提交

SafeRun: Enabling Determinism in LLM Planning for Running

SafeRun:在跑步规划中实现LLM的确定性

Meilin Chen, Zepeng Zhai, Jiaxuan Zhao, Yuan Lu

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 针对LLM在跑步规划中因概率性导致安全违规的问题,提出SafeRun框架,通过解耦架构将LLM的软解释与确定性求解器的硬约束分离,实现100%安全评分。

Comments Workshop on Planning in the Era of LLMs (LM4Plan) at ICML 2026

详情
AI中文摘要

大型语言模型能够实现灵活的自然语言规划,但由于其概率性,在确定性关键领域仍不可靠。这一限制在跑步规划中尤其成问题,因为违反安全规则可能导致安全风险。我们提出SafeRun,一种通过解耦架构实现基于LLM的确定性规划的框架。SafeRun将LLM的软解释与确定性求解器的硬约束执行分离,在保持自然语言灵活性的同时确保严格的安全约束。为了验证SafeRun,我们构建了一个全面的基准测试,用于在现实生理和安全约束下进行跑步规划。在五个LLM上的实验表明,SafeRun实现了100%的安全评分(相比之下,PE平均为79.1%,CodeAct平均为97.6%),同时保持了具有竞争力的指令遵循分数。SafeRun基准测试可在\href{https://huggingface.co/datasets/zzp-seeker/SafeRun-RunPlanning-Benchmark}{huggingface}上公开获取。

英文摘要

Large Language Models enable flexible natural-language planning but remain unreliable in determinism-critical domains due to their probabilistic nature. This limitation is especially problematic in running planning, where violating safety rules can lead to safety risks. We propose SafeRun, a framework for deterministic LLM-based planning via a decoupled architecture. SafeRun separates soft interpretation by an LLM from hard constraint enforcement by a deterministic solver, ensuring strict safety constraints while preserving natural-language flexibility. To validate SafeRun, we build a comprehensive benchmark for running planning under realistic physiological and safety constraints. Experiments across five LLMs show that SafeRun achieves 100\% safety score (vs.\ 79.1\% PE average and 97.6\% CodeAct average) while maintaining competitive instruction-following scores. The SafeRun benchmark is publicly available at \href{https://huggingface.co/datasets/zzp-seeker/SafeRun-RunPlanning-Benchmark}{huggingface}.

2606.09019 2026-06-09 cs.SD cs.AI 新提交

TLDR: Compressing Audio Tokens for Efficient Autoregressive Text-to-Speech

TLDR:压缩音频令牌以实现高效自回归文本到语音

Yejin Lee, Junwon Moon, Hyoeun Kim, Hyunjin Choi, Heeseung Kim, Kyuhong Shim

发表机构 * Sungkyunkwan University(成均馆大学) University of Seoul(首尔市立大学)

AI总结 提出TLDR框架,通过将因果建模从令牌级转移到补丁级,利用轻量级压缩器和LoRA适配的冻结预训练骨干,实现1.8倍推理加速和75% KV缓存减少。

详情
AI中文摘要

基于编解码器的自回归(AR)语音语言模型通过将语音建模为离散音频令牌序列,并使用大型预训练骨干网络,实现了强大的文本到语音(TTS)质量。然而,这种令牌级公式造成了结构效率瓶颈:语音令牌序列比文本序列长得多,要求AR骨干在每个令牌位置执行因果计算,并维护随序列长度增长的KV缓存。我们引入TLDR,一种基于补丁的自回归框架,通过将因果建模从令牌级语音序列转移到补丁级序列,加速基于编解码器的AR-TTS。TLDR使用轻量级压缩器将连续的编解码器令牌分组为紧凑的潜在补丁,使用通过LoRA适配的冻结预训练AR-TTS骨干对生成的较短补丁序列进行建模,并使用说话人条件提取器在每个补丁内重建细粒度语音令牌。在补丁大小为4的情况下,TLDR比基线AR-TTS模型实现了1.8倍的推理加速,并将全局KV缓存内存减少了高达75%。实验结果表明,补丁级全局因果建模可以成为降低预训练基于编解码器的AR-TTS系统推理成本的一种实用方法,而无需替换现有模块。

英文摘要

Codec-based autoregressive (AR) speech language models have achieved strong text-to-speech (TTS) quality by modeling speech as sequences of discrete audio tokens with large pretrained backbones. However, this token-level formulation creates a structural efficiency bottleneck: speech-token sequences are much longer than text sequences, requiring the AR backbone to perform causal computation at every token position and maintain a KV cache that grows with the sequence length. We introduce TLDR, a patch-based autoregressive framework that accelerates codec-based AR-TTS by shifting the causal modeling from token-level speech sequences to patch-level sequences. TLDR groups consecutive codec tokens into compact latent patches using a lightweight compressor, models the resulting shorter patch sequence with a frozen pretrained AR-TTS backbone adapted by LoRA, and reconstructs fine-grained speech tokens within each patch using a speaker-conditioned extractor. With a patch size of 4, TLDR achieves a 1.8x inference speedup over the baseline AR-TTS model and reduces global KV-cache memory by up to 75%. Experimental results indicate that patch-level global causal modeling can be a practical way to reduce the inference cost of pretrained codec-based AR-TTS systems without replacing the existing modules.

2606.09013 2026-06-09 cs.CL 新提交

Beyond Averages: Evaluating LLMs on Human Survey Replication at the Distributional Level

超越平均值:在分布层面评估LLM对人类调查的复现能力

Jeonghyeon Moon, Jiwon Kim, Yeheum Lah, Yoonju Han, Yuncheol Kang

发表机构 * Ewha Womans University(梨花女子大学)

AI总结 本研究通过非公开的韩国方便面购买实验,在分布层面评估LLM复现人类调查响应的能力,发现均值匹配的模型可能产生更偏离人类的分布,且结构化角色和多模态输入提升对齐度,而推理提示则降低。

详情
AI中文摘要

LLM越来越多地被用于模拟人类调查响应,但先前的工作主要使用均值层面或总体一致性来评估复现能力,对LLM是否复现人类行为的变异性提供的见解有限。我们使用一个非公开的2010年韩国方便面购买消费者选择实验,在分布层面评估基于LLM的调查复现,该设置不太可能与模型训练数据重叠。我们评估了三种不同统计类型的响应变量:二元购买发生、分类品牌选择和计数购买数量。对于每种变量,我们在均值层面、模式和分布一致性上比较人类和LLM响应,并参考仅来自人类数据的基线。LLM在复现条件层面模式上表现合理,但未能捕捉分布结构:对于购买数量,没有模型能击败一个简单的条件不敏感基线(该基线仅匹配合并的人类分布)。因为均值匹配人类良好的模型仍可能产生比该基线更远离人类的分布,仅基于均值的评估可能具有误导性。复现能力也随输入配置而变化,结构化角色和多模态输入改善一致性,而显式推理提示则单调地降低一致性。

英文摘要

LLMs are increasingly used to simulate human survey responses, but prior work has mainly evaluated replication using mean-level or aggregate agreement, offering limited insight into whether LLMs reproduce the variability of human behavior. We evaluate LLM-based survey replication at the distributional level using a non-public 2010 consumer choice experiment on Korean instant noodle purchases, a setting unlikely to overlap with model training data. We evaluate three response variables of differing statistical type: binary purchase incidence, categorical brand choice, and count purchase quantity. For each, we compare human and LLM responses at mean-level, pattern, and distributional alignment, and against reference baselines from the human data alone. LLMs reproduce condition-level patterns reasonably well but fail to capture distributional structure: for purchase quantity, no model beats a condition-insensitive baseline that simply matches the pooled human distribution. Because models that match human means well can still produce distributions further from humans than this baseline, mean-based evaluation alone can be actively misleading. Replication also varies with input configuration, with structured personas and multimodal inputs improving alignment while explicit reasoning prompting degrades it monotonically.

2606.09012 2026-06-09 cs.LG cs.AI math.OC stat.ML 新提交

Understanding Quantization-Aware Training: Gradients at Quantized Weights Bias to the Low-Loss Basin

理解量化感知训练:量化权重的梯度偏向低损失盆地

Hanyang Li, Jianhao Ma, Ying Cui

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Pennsylvania(宾夕法尼亚大学)

AI总结 提出统一几何框架解释后训练量化失败与量化感知训练恢复机制,揭示量化感知训练通过梯度感知谷壁使量化点返回低损失盆地。

Comments 31 pages, 10 figures

详情
AI中文摘要

后训练量化(PTQ)将训练好的全精度模型转换为低比特权重,无需任务级重训练,而量化感知训练(QAT)将量化纳入训练循环。尽管PTQ在中等比特宽度下高效且通常准确,但在激进比特宽度下可能急剧失败;QAT成本更高但通常能恢复丢失的精度。我们提出了一个统一的几何框架,同时解释PTQ失败和QAT恢复。我们将全精度训练建模为在更宽的\emph{山谷}内沿着低损失\emph{河流}:河流的法向邻域形成近乎平坦的\emph{盆地},而离开该盆地会导致损失急剧增加。当量化网格与盆地宽度相当时,局部PTQ目标(包括舍入和基于Hessian的二阶重建)可能选择盆地外的高损失部署量化点,即使附近存在低损失量化点。在这种情况下,基于直通估计器的QAT具有有用的偏差:它在部署的量化权重处评估梯度,同时更新潜在的全精度权重,导致梯度感知谷壁并获得向内分量,从而将后续量化迭代引导回盆地。我们通过局部景观模型形式化这一机制,构造了几何PTQ失败模式,并在局部量化器兼容性假设下证明了有限时间QAT恢复。在多种神经网络量化方案下的视觉和语言模型实验,证实了预测的PTQ跨盆地失败以及相应的QAT恢复机制。

英文摘要

Post-training quantization (PTQ) converts a trained full-precision model into low-bit weights without task-level retraining, while quantization-aware training (QAT) incorporates quantization into the training loop. Although PTQ is efficient and often accurate at moderate bitwidths, it can fail sharply at aggressive bitwidths; QAT is more expensive but can often recover the lost accuracy. We propose a unified geometric framework that explains both PTQ failure and QAT recovery. We model full-precision training as following a low-loss \emph{river} inside a wider \emph{valley}: a normal neighborhood of the river forms a nearly flat \emph{basin}, while leaving this basin incurs a sharp loss increase. When the quantization grid is comparable to the basin width, local PTQ objectives, including rounding and Hessian-based second-order reconstruction, can select a high-loss deployed quantized point outside the basin even when nearby low-loss quantized points exist. In this regime, straight-through-estimator-based QAT has a useful bias: it evaluates gradients at the deployed quantized weights while updating latent full-precision weights, causing the gradient to sense the valley wall and acquire an inward component that steers subsequent quantized iterates back into the basin. We formalize this mechanism through a local landscape model, construct a geometric PTQ failure mode, and prove finite-time QAT recovery under local quantizer-compatibility assumptions. Experiments across vision and language models under multiple neural-network quantization schemes corroborate the predicted basin-crossing failure of PTQ and the corresponding recovery mechanism of QAT.

2606.09009 2026-06-09 cs.CV 新提交

Scaling by Diversified Experience for Vision-Language-Action Models

通过多样化经验扩展视觉-语言-动作模型

Leiyu Wang, Zhaofengnian Wang, Xueqi Li, Luoyi Fan, Cewu Lu, Nanyang Ye

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出SyVLA模型,通过意图解耦算法和相似样本引导的强化学习管道,解决视觉-语言-动作模型在推理与控制耦合及策略优化不稳定问题,在真实机器人任务中取得更高成功率和泛化能力。

Comments ICML 2026, SyVLA

详情
AI中文摘要

视觉-语言-动作模型在现实部署中面临重大挑战,原因在于高层推理与低层控制的纠缠以及策略优化的不稳定性。本文介绍了SyVLA,一种通过多样化经验训练的鲁棒VLA模型。我们提出意图解耦算法,从推理上下文中分离控制相关特征,以及相似样本引导的RL管道,以稳定策略更新并缓解分布偏移。在真实机器人任务和多模态基准上的大量实验表明,与现有方法相比,SyVLA实现了更高的任务成功率和更强的分布外泛化能力,同时有效保留了核心视觉-语言能力。代码和数据集发布在项目页面上。

英文摘要

Vision-Language-Action models face significant challenges in real-world deployment due to the entanglement of high-level reasoning with low-level control, and the instability of policy optimization. In this paper, we introduce SyVLA, a robust VLA model trained with diversified experiences. We propose an Intention Decoupling algorithm to isolate control-relevant features from reasoning contexts and a similar-sample guided RL pipeline to stabilize policy updates and mitigate distribution shift. Extensive experiments on real-world robotic tasks and multi-modal benchmarks demonstrate that SyVLA achieves superior task success rates and stronger out-of-distribution generalization compared to existing methods, while effectively preserving core vision-language capabilities. Codes and Datasets is released on \href{https://sy-vla.github.io/}{project page}.

2606.08998 2026-06-09 cs.AI cs.CY econ.GN q-fin.EC 新提交

The Token Not Taken: Sampling, State, and the Variability of AI Agent Outputs

未被选取的令牌:采样、状态与AI智能体输出的变异性

Muhammad Zia Hydari, Raja Iqbal

发表机构 * University of Pittsburgh(匹兹堡大学) Ejento.ai

AI总结 本文分析AI智能体系统输出变异性的来源,区分令牌采样的内在随机性与环境、数据等外在因素,并讨论在匹配条件下变异性的可复现性及确定性执行在部署中未必导致相同行为的原因。

详情
AI中文摘要

智能体AI系统在不同运行中可能表现出不同的行为:相同的请求可能产生不同的计划、不同的工具调用、不同的代码编辑或不同的最终答案。这种变异性源于多个常被混淆的层面。基础模型是一个大型预训练模型,通常可适应许多下游任务,将输入上下文映射到输出的预测。在当前许多智能体中,该模型嵌入在一个编排循环中,该循环进行规划、调用工具、观察结果并更新状态。此类系统中一个明确的内在变异性来源是令牌生成:模型计算可能的下一个令牌的分数,分数被转换为概率,解码器可能使用伪随机数生成器采样令牌。一个微小的采样令牌差异随后可能向上传播为不同的工具调用、代码路径、搜索查询或智能体状态。其他变异性来源是令牌采样的外在因素,包括变化的环境、实时数据、服务基础设施、批次效应和数值细节。通过分离这些层面,本文阐明了将智能体AI系统称为随机系统的含义、在匹配条件下这种变异性何时可复现,以及为什么确定性执行在部署环境中不一定意味着相同的行为。

英文摘要

Agentic AI systems can behave differently across runs: the same request may produce a different plan, a different tool call, a different code edit, or a different final answer. Such variability arises from several layers that are often conflated. A foundation model is a large pretrained model, usually adaptable to many downstream tasks, that maps an input context to predictions over outputs. In many current agents, that model is embedded in an orchestration loop that plans, calls tools, observes results, and updates state. One explicit intrinsic source of variability in such systems is token generation: the model computes scores over possible next tokens, the scores are converted into probabilities, and a decoder may sample tokens using a pseudo-random number generator. A small sampled token difference can then propagate upward into a different tool call, code path, search query, or agent state. Other sources of variability are extrinsic to token sampling, including changing environments, live data, serving infrastructure, batch effects, and numerical details. By separating these layers, the manuscript clarifies what it means to call agentic AI systems stochastic, when such variability can be reproduced under matched conditions, and why deterministic execution need not imply identical behavior in deployed settings.

2606.08994 2026-06-09 cs.CL 新提交

Language-Aware Token Boosting: LLM Language Confusion Reduction Without Tuning

语言感知令牌增强:无需微调的大语言模型语言混淆减少

Trapoom Ukarapol, Pakhapoom Sarapat, Nut Chukamphaeng

发表机构 * SCB DataX Tsinghua University(清华大学) SCBX

AI总结 提出无需微调的语言混淆减少方法,通过语言感知令牌增强(LATB)和自适应版本(Adaptive-LATB)对目标语言令牌施加扰动,有效提升多语言对齐并保持摘要质量。

Comments ACL2026 Main Conference

详情
AI中文摘要

大型语言模型(LLMs)在生成非英语文本时有时会出现语言混淆。现有方法通常依赖微调来缓解此问题。相比之下,我们提出了一种无需微调的语言混淆减少范式。在该范式中,我们引入了两种方法:语言感知令牌增强(LATB),它对与目标语言相关的令牌施加有针对性的扰动;以及自适应语言感知令牌增强(Adaptive-LATB),它根据模型对目标语言的置信度动态调整这些扰动。实验表明,我们的方法通过减少语言混淆有效提升了多语言对齐,同时在不需额外微调的情况下保持了摘要质量。我们的代码已公开。https://github.com/scbdatax/genai-datax-language-aware-token-boosting

英文摘要

Large language models (LLMs) sometimes exhibit language confusion when generating non-English text. Existing approaches typically rely on fine-tuning to mitigate this issue. In contrast, we propose a tuning-free paradigm for reducing language confusion. Within this paradigm, we introduce two methods: Language-Aware Token Boosting (LATB), which applies targeted perturbations to tokens associated with the desired language, and Adaptive Language-Aware Token Boosting (Adaptive-LATB), which dynamically adjusts these perturbations based on the model's confidence in the intended language. Experiments demonstrate that our methods effectively improve multilingual alignment by reducing language confusion, while maintain the summarization quality without requiring any additional fine-tuning. Our code is publicly available. https://github.com/scbdatax/genai-datax-language-aware-token-boosting.

2606.08993 2026-06-09 cs.LG cs.SY eess.SY math.OC 新提交

LEAF: A Learning-Enabled ADMM Framework for Accelerated Convex Optimization

LEAF: 一种用于加速凸优化的学习增强ADMM框架

Binh Nguyen, Trinh Tran, Truong X. Nghiem

发表机构 * University of Central Florida(中佛罗里达大学)

AI总结 提出LEAF框架,通过输入凸神经网络学习Moreau包络来加速凸优化,降低模型复杂度并保持收敛性,实验显示比最先进求解器快一个数量级。

详情
AI中文摘要

我们提出LEAF,一种用于加速凸优化的学习增强ADMM框架。关键思想是使用输入凸神经网络(ICNN)逼近目标函数的Moreau包络,从而得到一个保持凸性和光滑性的学习模型。这导致了所提出的Moreau包络学习ADMM(MEL-ADMM)及其分裂变体sMEL-ADMM。与直接学习高维算子的现有方法不同,LEAF学习标量值的Moreau包络,显著降低了模型复杂度并提高了数据效率。该框架适用于包括光滑和非光滑目标在内的广泛凸问题。通过ICNN架构显式嵌入凸性,所提出的方法在保持优化问题关键结构性质的同时保持了高逼近精度。MEL-ADMM和sMEL-ADMM都在学习模型下具有收敛性和可行性的理论保证。严格分析表明,所提出的方法实现了与经典ADMM相当的收敛速度,同时降低了每次迭代的计算成本。数值实验表明,与最先进的求解器相比,速度提升可达一个数量级,同时保持较低的最优性差距。

英文摘要

We propose LEAF, a learning-enabled ADMM framework for accelerated convex optimization. The key idea is to approximate the Moreau envelope of the objective function using an Input Convex Neural Network (ICNN), resulting in a learned model that preserves convexity and smoothness. This leads to the proposed Moreau Envelope Learning ADMM (MEL-ADMM) and its splitting variant sMEL-ADMM. Unlike existing approaches that learn high-dimensional operators directly, LEAF learns a scalar-valued Moreau envelope, significantly reducing model complexity and improving data efficiency. The framework accommodates a broad class of convex problems with smooth and non-smooth objectives. By embedding convexity explicitly through the ICNN architecture, the proposed approach maintains high approximation accuracy while preserving key structural properties of the optimization problem. Both MEL-ADMM and sMEL-ADMM are developed with theoretical guarantees of convergence and feasibility under the learned model. Rigorous analysis shows that the proposed methods achieve convergence rates comparable to classical ADMM while reducing per-iteration computational cost. Numerical experiments demonstrate up to an order-of-magnitude speedup over state-of-the-art solvers while maintaining low optimality gaps

2606.08992 2026-06-09 cs.RO cs.AI cs.CV 新提交

SpaceVLN: A Zero-Shot Vision-and-Language Navigation Agent with Online Spatial Cognitive Memory and Reasoning

SpaceVLN:具有在线空间认知记忆与推理的零样本视觉与语言导航智能体

Yucheng Deng, Pingrui Lai, Xinhai Li, Chenjia Bai, Xiaoheng Deng, Chengnuo Sun, Xuelong Li, Hua Yang

发表机构 * Shanghai Jiao Tong University(上海交通大学) China Telecom(中国电信) Central South University(中南大学) Jiangsu University(江苏大学)

AI总结 提出SpaceVLN,通过空间认知记忆和任务引导的空间推理,在零样本设置下实现连续环境中的视觉与语言导航,在多个基准上达到最优性能。

Comments 23 pages, 9 figures, 7 tables

详情
AI中文摘要

连续环境中的视觉与语言导航要求智能体理解未见环境的空间结构以遵循语言指令。尽管基础模型为无需任务特定策略训练的零样本导航开辟了有希望的路径,但许多导航器仍依赖局部视觉线索和基于线性历史的推理,忽视了探索区域、穿越路径、地标及其空间关系的空间本质。本文提出SpaceVLN,一种围绕空间认知记忆和任务引导的空间推理构建的导航智能体。具体而言,SpaceVLN引入了一个高效的分阶段闭环框架,其中规划和执行围绕可验证的空间-地标阶段组织。导航过程中,智能体逐步将探索区域抽象为空间航点,并动态维护子任务基础的地标证据,形成层次化的空间认知记忆以进行进度定位和空间关系理解。基于此记忆,Spatial-CoT将任务进度推理与空间感知、分析和预测相结合,实现任务引导的空间推理以用于具身导航。统一阶段接口使SpaceVLN能够在统一的零样本设置下处理视觉与语言导航和目标导向导航,无需任务特定策略训练。在R2R-CE、RxR-CE、GN-Bench和HM3D-OVON上,SpaceVLN实现了最先进的零样本性能,真实机器人部署进一步验证了其适用性。这些结果突显了空间认知记忆和任务引导的空间推理作为更强具身导航智能体的实用基础。

英文摘要

Vision-and-Language Navigation in continuous environments requires agents to understand the spatial structure of previously unseen environments in order to follow language instructions. Although foundation models have opened a promising path toward zero-shot navigation without task-specific policy training, many navigators still rely on local visual cues and linear history-based reasoning, overlooking the spatial nature of navigation across explored regions, traversed paths, landmarks, and their spatial relations. In this paper, we propose SpaceVLN, a navigation agent built around Spatial Cognitive Memory and Task-Guided Spatial Reasoning. Specifically, SpaceVLN introduces an efficient stagewise closed-loop framework where planning and execution are organized around verifiable space--landmark stages. During navigation, the agent progressively abstracts explored regions into Spatial Waypoints and dynamically maintains subtask-grounded landmark evidence, forming a hierarchical Spatial Cognitive Memory for progress localization and spatial-relation understanding. Built on this memory, Spatial-CoT integrates task-progress reasoning with spatial perception, analysis, and prediction, enabling Task-Guided Spatial Reasoning for embodied navigation. The unified stage interface enables SpaceVLN to address both Vision-and-Language Navigation and Object-Goal Navigation under a unified zero-shot setting, without task-specific policy training. Across R2R-CE, RxR-CE, GN-Bench, and HM3D-OVON, SpaceVLN achieves state-of-the-art zero-shot performance, and real-robot deployment further validates its applicability. These results highlight Spatial Cognitive Memory and Task-Guided Spatial Reasoning as a practical foundation for stronger embodied navigation agents.

2606.08988 2026-06-09 cs.CL cs.LG 新提交

Structure-Aware Modeling of Multiple-Choice Questions Improves Automatic Difficulty Estimation

选择题的结构感知建模改进自动难度估计

Gabriel Ortega, Abelino Jiménez, Séverin Lions, Pablo Dartnell

发表机构 * Centro de Investigación Avanzada en Educación (CIAE), Instituto de Estudios Avanzados en Educación (IE), Universidad de Chile(智利大学高级教育研究中心(CIAE),高级教育研究所(IE)) Departamento de Evaluación, Medición y Registro Educacional (DEMRE), Universidad de Chile(智利大学评估、测量与教育注册系(DEMRE)) Centro de Modelamiento Matemático (CMM), Universidad de Chile(智利大学数学建模中心(CMM)) Departamento de Ingeniería Matemática (DIM), Universidad de Chile(智利大学数学工程系(DIM))

AI总结 提出结构感知模型,将选择题的干扰项作为独立输入编码,通过顺序感知或顺序不变聚合提升难度预测,在自然科学和社科数据集上达到R²=0.83和0.71。

Comments 30 pages, 1 table, 2 figures

详情
AI中文摘要

自动题目难度估计(AQDE)在教育评估中日益重要,因为它有潜力产生与专家判断相竞争的难度估计,同时有助于减少与试点管理相关的时间和财务负担,并扩展到数字测试环境。先前的AQDE研究报告了关于将干扰项作为附加文本添加到题干和正确答案中是否能一致改进难度预测的混合证据。我们假设干扰项信息的有效性取决于其结构表示,并且明确将干扰项建模为独立组件可以改进忽略此信息的基线的难度估计。为此,我们设计了受控架构,将选择题组件建模为不同输入,以隔离干扰项内容和顺序的贡献。具体来说,我们通过将每个干扰项编码为独立的文本输入,并通过顺序感知的拼接(带位置标签)或顺序不变的求和来聚合其表示,从而表示干扰项。我们使用两个智利数据集(自然科学和社会科学,2016-2020年;4114道选择题)评估了这些架构。与仅使用题干和正确答案的简单模型相比,我们最佳的结构感知架构实现了更高的预测性能,自然科学题目的R²=0.83,社会科学题目的R²=0.71。一个顺序不变的变体以大约一半的参数达到了几乎相同的准确率,提供了有利的准确率-效率权衡。这些结果表明,结构信息(尤其是干扰项内容)驱动了预测准确性的提升,支持开发计算上可行的大规模教育应用的高效结构感知模型。

英文摘要

Automatic Question Difficulty Estimation (AQDE) holds growing promise for educational assessment because it has the potential to yield difficulty estimates that are competitive with expert judgment, while helping reduce the time and financial burden associated with pilot administrations and scaling to digital testing contexts. Prior AQDE studies report mixed evidence on whether adding distractors as additional text to the question stem and the correct key consistently improves difficulty prediction. We hypothesize that the effectiveness of distractor information depends on its structural representation, and that explicitly modeling distractors as separate components improves difficulty estimation over baselines that omit this information. To address this, we designed controlled architectures that model MCQ components as distinct inputs to isolate the contribution of distractor content and order. Specifically, we represented distractors by encoding each distractor as its own text input and aggregating their representations either with order-aware concatenation (with positional tags) or with an order-invariant summation. We evaluated these architectures using two Chilean datasets (Natural and Social Sciences, 2016-2020; 4,114 multiple-choice questions). Compared to a simpler model that only used the question stem and the key, our best distractor-aware architecture achieved higher predictive performance, reaching R^2 = 0.83 for Natural Sciences and R^2 = 0.71 for Social Sciences items. An order-invariant variant achieved nearly the same accuracy with approximately half as many parameters, offering a favorable accuracy-efficiency trade-off. These results show that structural information (especially distractor content) drives gains in predictive accuracy, supporting the development of efficient, structure-aware models that are computationally viable for large-scale educational applications.

2606.08985 2026-06-09 cs.LG 新提交

Beyond Neural Collapse: Task-Intrinsic Geometry Governs Neural Representations in Modular Arithmetic

超越神经坍缩:任务内在几何决定模算术中的神经表示

Hu Tan, Kuo Gai, Shihua Zhang

发表机构 * Academy of Mathematics and Systems Science, Chinese Academy of Sciences(中国科学院数学与系统科学研究院) School of Mathematical Sciences, University of Chinese Academy of Sciences(中国科学院大学数学科学学院) Shanghai Institute for Mathematics and Interdisciplinary Sciences (SIMIS)(上海数学与交叉学科研究院) Key Laboratory of Systems Health Science of Zhejiang Province, School of Life Science, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences(浙江省系统健康科学重点实验室,中国科学院大学杭州高等研究院生命科学学院)

AI总结 本文发现模加法任务中网络表示呈现二维循环几何而非神经坍缩的单纯形等角紧框架,通过层间非均匀训练、子空间锁定后的相位对齐动力学和复杂度优势分析解释了这一现象。

详情
AI中文摘要

虽然神经坍缩(NC)预测一个$K$类平衡分类器应将终端表示组织为$(K-1)$维单纯形等角紧框架(ETF),但模加法始终进入不同的状态:网络压缩为二维循环几何,其中分类器权重和词元嵌入都位于圆上。我们从三个方向精炼对这一现象的解释。首先,我们形式化了一个逐层非均匀训练机制:下游分类器权重被密集交叉熵梯度驱动到秩2等角配置,而上游嵌入尚未完全重组;一旦这个分类器平面形成,反向传播的特征梯度将嵌入运动约束在同一平面内,同时权重衰减抑制正交分量。其次,在此子空间锁定之后,诱导的平面内动力学允许在$S^1$上的一种熵正则化输运解释;结合模加法标签,这使嵌入形成简化为相位对齐,其最小化器是$\mathbb{Z}/P\mathbb{Z}$的单频特征,因此是圆上的等角点。第三,我们量化了为什么这一解优于NC:单纯形ETF在交叉熵上仅获得$O(1)$的优势,而循环秩2解在Schatten或权重衰减代理下享有$\Theta(K)$的优势,产生临界阈值$\lambda_{\mathrm{crit}} = \Theta(1/K)$。我们的结果解释了为什么分类器权重首先移动以及为什么嵌入随后与之对齐,表明模算术上的grokking不是由最大分离单独支配,而是由分离、对称性和复杂性之间的任务结构化权衡所支配。

英文摘要

While neural collapse (NC) predicts that a $K$-class-balanced classifier should organize terminal representations as a $(K-1)$-dimensional simplex equiangular tight frame (ETF), modular addition consistently enters a different regime: networks compress to a two-dimensional cyclic geometry in which both classifier weights and token embeddings lie on circles. We refine the explanation of this phenomenon in three directions. First, we formalize a layerwise non-uniform training mechanism: downstream classifier weights are driven by dense cross-entropy gradients into a rank-2 equiangular configuration before upstream embeddings fully reorganize, and once this classifier plane forms, backpropagated feature gradients constrain embedding motion to the same plane while weight decay suppresses orthogonal components. Second, after this subspace locking, the induced in-plane dynamics admit an entropy-regularized transport interpretation on $S^1$; combined with modular-addition labels, this reduces embedding formation to phase alignment, whose minimizers are single-frequency characters of $\mathbb{Z}/P\mathbb{Z}$ and hence equal-angle points on a circle. Third, we quantify why this solution prevails over NC: a simplex ETF gains only an $O(1)$ advantage in cross-entropy, whereas the cyclic rank-2 solution enjoys a $Θ(K)$ advantage under Schatten or weight-decay surrogates, yielding a critical threshold $λ_{\mathrm{crit}} = Θ(1/K)$. Our results explain both why classifier weights move first and why embeddings subsequently align with them, showing that grokking on modular arithmetic is governed not by maximal separation alone but by a task-structured trade-off between separation, symmetry, and complexity.

2606.08980 2026-06-09 cs.CV 新提交

EPS3D: End-to-End Feed-Forward 3D Panoptic Segmentation

EPS3D: 端到端前馈式3D全景分割

Runsong Zhu, Jiaxin Guo, Xiaoyang Guo, Zhengzhe Liu, Ka-Hei Hui, Wei Yin, Kai Chen, Wei Chen, Weiqiang Ren, Yunhui Liu, Pheng-Ann Heng, Chi-Wing Fu

发表机构 * Nanyang Technological University(南洋理工大学)

AI总结 提出端到端前馈框架EPS3D,通过蒸馏训练和多视图图像预测3D感知特征,结合互增强模块实现语义-实例一致性,在Replica上语义mIoU提升13%,每场景仅需1秒。

Comments ICML 2026. The code is publicly available at \href{https://github.com/Runsong123/EPS3D}{https://github.com/Runsong123/EPS3D}

详情
AI中文摘要

本文介绍了EPS3D,一种用于开放词汇3D全景分割的新型端到端前馈框架。与依赖额外预处理的现有方法不同,我们设计了一种端到端架构,采用基于蒸馏的训练策略,在多样化的3D场景中从多视图图像预测3D感知的语义和实例特征,提高了3D一致性并避免了错误累积。我们进一步提出了一个互增强模块,以强制实现固有的语义-实例一致性。通过在实例内对齐语义(Ins2Sem)和利用语义指导细化实例特征(Sem2Ins),我们实现了更连贯的3D场景理解。最终,EPS3D在两个基准测试上优于最先进的基线(例如,在Replica上语义mIoU提升13%),且效率高(例如,每场景1秒),支持机器人操作和3D场景编辑等任务。

英文摘要

This paper introduces EPS3D, a new end-to-end feed-forward framework for open-vocabulary 3D panoptic segmentation. Unlike existing methods relying on additional preprocessing, we design an end-to-end architecture, with a distillation-based training strategy on diverse 3D scenes to predict 3D-aware semantic and instance features from multi-view images, improving 3D consistency and avoiding error accumulation. We further propose a mutual enhancement module to enforce inherent semantic-instance consistency. By aligning semantics within instances (Ins2Sem) and refining instance features with semantic guidance (Sem2Ins), we achieve more coherent 3D scene understanding. Ultimately, EPS3D outperforms SOTA baselines on two benchmarks (e.g., +13% mIoU for semantics on Replica) with high efficiency (e.g., 1s per scene), supporting tasks like robotic manipulation and 3D scene editing.

2606.08978 2026-06-09 cs.LG 新提交

Heterophily-Aware Adaptive Knowledge Distillation for Hypergraph Neural Networks

异质性感知的自适应知识蒸馏用于超图神经网络

Joohee Cho, David Yoon Suk Kang, Yunyong Ko

发表机构 * Chung-Ang University(中央大学) Chungbuk National University(忠北国立大学)

AI总结 针对超图神经网络在异质性节点上性能下降的问题,提出异质性感知的自适应蒸馏方法HADES,通过量化节点异质性调节教师知识迁移,使学生模型性能超越教师并实现最高12.3倍加速。

Comments 5 pages, 2 figures, 4 tables

详情
AI中文摘要

超图知识蒸馏旨在通过轻量级学生模型保留超图神经网络(HNN)教师的预测性能,同时降低推理成本。在这项工作中,我们观察到HNN在通过语义多样的超边连接的异质性节点上的预测性能显著较低,表明教师知识的可靠性在不同节点间存在差异。受此观察启发,我们提出了HADES,一种用于超图神经网络的异质性感知自适应蒸馏方法。HADES量化节点异质性,并将其作为教师可靠性的估计,以在蒸馏过程中调节教师知识的迁移。在真实世界超图上的实验结果表明,HADES在不同HNN教师和蒸馏目标下持续提升学生性能。在许多情况下,所得学生模型的预测性能超越其教师,同时实现高达12.3倍的推理加速。

英文摘要

Hypergraph knowledge distillation aims to retain the predictive performance of a hypergraph neural network (HNN) teacher while reducing inference costs through a lightweight student model. In this work, we observe that HNNs exhibit substantially lower prediction performance on heterophilic nodes connected through semantically diverse hyperedges, indicating that the reliability of teacher knowledge varies across nodes. Motivated by this observation, we propose HADES, a heterophily-aware adaptive distillation method for hypergraph neural networks. HADES quantifies node heterophily and leverages it as an estimate of teacher reliability to modulate the transfer of teacher knowledge during distillation. Experimental results on real-world hypergraphs demonstrate that HADES consistently improves student performance across different HNN teachers and distillation objectives. In many cases, the resulting student models surpass the predictive performance of their teachers while achieving up to 12.3 times faster inference.

2606.08977 2026-06-09 cs.LG cs.DS 新提交

Online Learning with Recency: Algorithms for Sliding-window Streaming Multi-armed Bandits

在线学习中的近因效应:滑动窗口流式多臂老虎机算法

Vladimir Braverman, Chen Wang, Liudeng Wang, Samson Zhou

发表机构 * Johns Hopkins University(约翰霍普金斯大学) Rensselaer Polytechnic Institute(伦斯勒理工学院) Texas A&M University(德克萨斯农工大学)

AI总结 针对在线学习中的近因效应,研究单遍滑动窗口流式多臂老虎机问题,提出纯探索和遗憾最小化算法,并给出记忆-遗憾权衡。

Comments ICML 2026

详情
AI中文摘要

受在线学习中近因效应的启发,本文研究了单遍*滑动窗口流式多臂老虎机(MABs)*的算法。在该设置中,我们有$n$个臂,其奖励分布为未知的次高斯分布,并给定参数$W$。臂以单遍流的形式到达,只有最近的$W$个臂被视为有效。算法需要在有限内存(定义为存储的臂数)下进行纯探索和遗憾最小化。该模型是近年来广泛研究的流式多臂老虎机模型(无滑动窗口)的自然扩展。我们对该模型下的纯探索和遗憾最小化问题进行了全面分析。对于纯探索,我们证明在次线性内存下找到最佳臂是困难的,而找到近似最佳臂则存在高效算法。对于遗憾最小化,我们探索了一种新的遗憾概念,并给出了任何单遍算法的尖锐内存-遗憾权衡。我们通过实验补充了理论结果,展示了样本、遗憾和内存之间的权衡。

英文摘要

Motivated by the recency effect in online learning, we study algorithms for single-pass *sliding-window streaming multi-armed bandits (MABs)* in this paper. In this setting, we are given $n$ arms with unknown sub-Gaussian reward distributions and a parameter $W$. The arms arrive in a single-pass stream, and only the most recent $W$ arms are considered valid. The algorithm is required to perform pure exploration and regret minimization with limited memory, defined as the number of stored arms. The model is a natural extension of the streaming multi-armed bandits model (without the sliding window) that has been extensively studied in recent years. We provide a comprehensive analysis of both the pure exploration and regret minimization problems with the model. For pure exploration, we prove that finding the best arm is hard with sublinear memory while finding an approximate best arm admits an efficient algorithm. For regret minimization, we explore a new notion of regret and give sharp memory-regret trade-offs for any single-pass algorithm. We complement our theoretical results with experiments, demonstrating the trade-offs between sample, regret, and memory.

2606.08976 2026-06-09 cs.AI 新提交

RTL-BenchLS: A Large-Scale Benchmark for RTL Reasoning and Generation with Large Language Models

RTL-BenchLS:面向大语言模型的RTL推理与生成的大规模基准

Jing Wang, Shang Liu, Wenji Fang, Yuchao Wu, Yugao Zhu, Zhiyao Xie

发表机构 * Hong Kong University of Science and Technology(香港科技大学)

AI总结 提出大规模基准RTL-BenchLS,包含超1万个形式验证的Verilog设计,并引入三项自监督推理任务,解决现有基准规模小、任务单一的问题,评估显示当前最佳模型性能较低。

详情
AI中文摘要

基于LLM的RTL生成与推理是硬件设计自动化的一个有前景的方向。高质量的基准是跟踪这一进展的关键基础设施。然而,现有的RTL基准在规模和任务范围上存在固有局限性。它们涵盖的设计通常较小且简单,任务几乎完全集中在规格到RTL的生成上。前沿模型在现有基准上的性能已经饱和。扩大这些基准的规模从根本上很困难,因为基准测试需要对齐的标签,例如规格和测试平台。对于实际设计,这种对齐的高质量数据很少可用。我们引入了RTL-BenchLS,这是一个大规模基准,解决了上述两个局限性。它包含超过10,000个经过形式验证的Verilog设计,涵盖比现有基准更大且更复杂的设计。除了规格到RTL的生成,我们提出了三项联合评估推理与生成的新任务:往返推理、掩码内容推理和仓库问题推理。前两项是自监督的,直接解决了扩展瓶颈。所有任务都通过形式等价性检查进行验证,无需任何手动测试平台。我们在RTL-BenchLS上评估了八个LLM。即使是最好的模型,在自然语言往返推理上仅达到23%,在掩码内容推理上达到28%,在仓库问题修复上达到12%。RTL-BenchLS比现有基准更具挑战性。它为未来的改进留下了充足的空间,并为开发基于LLM的硬件设计方法提供了指导。

英文摘要

LLM-based RTL generation and reasoning is a promising direction for hardware design automation. High-quality benchmarks are critical infrastructure for tracking progress in this direction. However, existing RTL benchmarks face inherent limitations in both scale and task scope. The designs they cover are typically small and simple, and the tasks focus almost entirely on specification-to-RTL generation. Frontier models' performance already saturates on the existing benchmarks. Scaling these benchmarks up is fundamentally difficult because aligned labels are required for benchmarking, such as specifications and testbenches. Such aligned high-quality data are rarely available for real-world designs. We introduce RTL-BenchLS, a large-scale benchmark addressing both limitations above. It contains over 10,000 formally verified Verilog designs, covering substantially larger and more complex designs than existing benchmarks. Beyond specification-to-RTL generation, we propose three novel tasks that jointly evaluate reasoning and generation: round-trip reasoning, masked-content reasoning, and repository-issue reasoning. The first two are self-supervised, which directly resolves the scaling bottleneck. All tasks are verified through formal equivalence checking without any manual testbenches. We evaluate eight LLMs on RTL-BenchLS. Even the best model reaches only 23% on natural-language round-trip reasoning, 28% on masked-content reasoning, and 12% on repository-issue fixing. RTL-BenchLS is substantially more challenging than existing benchmarks. It leaves ample room for future improvement and offers guidance for developing LLM-based methods for hardware design.

2606.08974 2026-06-09 cs.AI 新提交

Diverse Thinking Schemata Elicit Better Reasoning in Large Language Models

多样思维图式激发大型语言模型更优推理

Xinyue Liang, Yizhe Yang, Yu Bai, Bin Xu, Jiawei Li, Yang Gao

发表机构 * School of Computer Science and Technology, Beijing Institute of Technology(北京理工大学计算机科学与技术学院)

AI总结 提出多样图式策略优化(DiScO),通过增强推理步骤转换和答案候选的多样性,提升大型语言模型在数学推理任务中的表现和错误恢复能力。

详情
AI中文摘要

大型推理模型(LRMs)因其通过生成扩展推理链解决复杂数学问题的能力而受到越来越多的关注。在这项工作中,我们聚焦于推理过程中两个关键但尚未充分探索的方面:推理转换(捕捉推理步骤之间的不同转换)和答案候选(反映模型产生的解路径的多样性)。我们将这两个方面统称为思维图式。我们观察到思维图式的多样性与模型性能之间存在相关性,这激励我们通过增强多样性来进一步提升推理潜力。为此,我们提出了多样图式策略优化(DiScO),该框架首先赋予模型图式感知能力,然后通过强化学习鼓励多样性,并在推理时进一步促进多样化推理。在多个数学推理基准上的实验表明,DiScO始终优于标准的群体相对策略优化。除了准确性之外,人工标注分析显示,DiScO显著提高了模型从错误初始尝试中恢复的能力。总体而言,我们的工作表明思维图式多样性发挥的重要作用,并指出沿着多样性维度进行扩展是一个有前景的研究方向。

英文摘要

Large reasoning models (LRMs) have attracted increasing attention for their ability to solve complex mathematical problems by generating extended reasoning chains. In this work, we focus on two critical yet underexplored aspects of the reasoning process: reasoning transitions capturing the distinct transitions between reasoning steps and answer candidates reflecting the variety of solution paths produced by the model. We collectively define these two aspects as thinking schemata. We observe a correlation between the diversity of thinking schemata and model performance, which motivates us to enhance diversity as a means to further improve reasoning potential. To this end, we propose Diverse Schemata Policy Optimization (DiScO), a framework that first endows the model with schemata awareness, then encourages diversity through reinforcement learning, and further promotes diverse reasoning at inference time. Experiments on multiple mathematical reasoning benchmarks demonstrate that DiScO consistently outperforms standard group relative policy optimization. Beyond accuracy, human-annotated analyses show that DiScO substantially improves the model's ability to recover from erroneous initial attempts. Overall, our work suggests the important role that diversity of the thinking schemata plays and points to scaling along the diversity dimension as a promising research direction.

2606.08970 2026-06-09 cs.AI 新提交

An Effective Router for Vision-Language Model Selection

一种有效的视觉-语言模型选择路由器

Can Wang, Shengwei Wang, Bolin Zhang, Zhiying Tu, Dianhui Chu

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) Shandong Key Laboratory of Digital Service Computing Technology and Systems(山东省数字服务计算技术与系统重点实验室)

AI总结 针对视觉-语言模型(VLM)选择中数据缺乏、特征表示无效和模型空间僵化的问题,提出ARMS路由器,通过增强输入信号和扩展训练策略,在分布内和分布外测试集上表现优异,仅800M参数即可超越GPT-4o。

详情
AI中文摘要

具有不同性能和资源需求的视觉-语言模型(VLM)被广泛部署,使得用户难以从众多VLM候选中选择最合适的。现有工作揭示了语言模型中的性能悖论现象,并专注于路由方法来解决它。然而,开发用于VLM选择的路由器仍然是一个关键且具有挑战性的问题,主要面临:1)缺乏专门数据,2)特征表示无效,以及3)模型空间僵化和适应成本高。在本文中,我们构建了一个用于VLM选择的多模态数据集,包含七个主流VLM在32,626个独特图像-文本查询上的输出。然后,我们提出了ARMS,一个用于VLM选择的路由器。ARMS通过VLM配置文件增强输入信号,采用简单但有效的架构来改进查询和VLM能力的表示。为了提高ARMS对新VLM的适应性,我们提出了两种扩展训练策略:增量训练和独立训练。在分布内和分布外测试集上的实验结果表明了ARMS的有效性。特别是,使用我们的训练策略,ARMS(仅800M参数)可以适应更广泛的VLM空间,并击败规模大数百倍的商业模型如GPT-4o。我们的代码、模型和数据集可在匿名仓库中获取。

英文摘要

Vision-language models (VLMs) with varying performance and resource requirements are widely deployed, making it difficult for users to select the most appropriate one among numerous VLM candidates. Existing work reveals the performance paradox phenomenon in language models and focuses on routing methods to solve it. However, developing a router for VLM selection is still a critical yet challenging problem, which primarily faces: 1) lack of specialized data, 2) ineffective feature representation, and 3) rigid model space and costly adaptation. In this paper, we construct a multimodal dataset for VLM selection, containing the outputs of seven mainstream VLMs on 32,626 unique image-text queries. We then propose ARMS, a router for VLM selection. ARMS enhances input signals with VLM profiles, employs a simple but effective architecture to improve representations of queries and VLM capabilities. To improve ARMS' adaptation to new VLMs, we propose two extension training strategies: incremental training and independent training. Experimental results on both in-distribution and out-of-distribution test sets demonstrate the effectiveness of ARMS. In particular, using our training strategy, ARMs (only 800M in size) can adapt to a broader VLM space and defeat commercial models like GPT-4o that are hundreds of times larger in scale. Our code, models, and datasets are available in the anonymous repository.

2606.08969 2026-06-09 cs.CL cs.AI 新提交

CARE: A Conformal Safety Layer for Medical Summarization

CARE:面向医学摘要的保形安全层

Suhana Bedi, Bridget Lin, Anson Y. Zhou, Chloe O. Stanwyck, Jenelle A. Jindal, Sanmi Koyejo, David Stutz, Nigam H. Shah

发表机构 * Stanford University(斯坦福大学) Google DeepMind(谷歌深度思维)

AI总结 提出CARE方法,通过保形风险控制为LLM医学摘要提供校准的遗漏和幻觉标记,在保证安全性的同时减少审查负担。

Comments 29 pages, 5 figures

详情
AI中文摘要

大型语言模型(LLM)越来越多地用于医学摘要,但其输出可能遗漏重要的医学信息并引入无根据的陈述。现有的错误检测方法产生启发式或未校准的分数,无法对遗漏错误进行正式控制,也无法以原则性的方式在安全性与临床医生审查负担之间进行权衡。我们引入了风险评估的保形评估(CARE),这是一种事后、模型无关的安全层,使用保形风险控制为任何LLM生成的摘要叠加校准的遗漏和幻觉标记,无需重新训练。CARE通过两个控制器提供有限样本、分布无关的保证:一个幻觉控制器,限制包含任何未标记幻觉句子的文档的概率;一个遗漏控制器,限制未提交审查的重要遗漏的期望比例。与幻觉检测不同,遗漏同时取决于源句子是否重要以及摘要是否覆盖该句子。我们表明,仅校准一个维度可能违反目标风险界限,而边际分解虽然有效但过于保守。通过在整个$(τ,γ)$阈值空间上进行联合校准,CARE在保持正式保证的同时,比替代的校准基线最多减少5倍的标记句子。在五个医学摘要任务中,CARE在100次校准/测试重划分中,以95%的置信度满足$α=0.15$的目标风险界限,每个领域仅使用约100个标记文档。在一项初步的临床医生研究(75份文档审查)中,校准标记平均将遗漏检测提高了28.6个百分点。这些结果表明,句子级别的安全保证对于LLM辅助的医学摘要是可行的,并为平衡残余风险和审查工作量提供了一种可调节的机制。

英文摘要

Large language models (LLMs) are increasingly used for medical summarization, but their outputs can omit medically important information and introduce unsupported claims. Existing error-detection methods produce heuristic or uncalibrated scores, providing no formal control over missed errors and no principled way to trade off safety against clinician review burden. We introduce Conformal Assessment for Risk Evaluation (CARE), a post-hoc, model-agnostic safety layer that uses conformal risk control to overlay calibrated omission and hallucination flags onto summaries from any LLM without retraining. CARE provides finite-sample, distribution-free guarantees through two controllers: a hallucination controller that bounds the probability of a document containing any unflagged hallucinated sentence, and an omission controller that bounds the expected fraction of important omissions not surfaced for review. Unlike hallucination detection, omissions depend jointly on whether a source sentence is important and whether it is covered by the summary. We show that calibrating only one dimension can violate the target risk bound, while marginal decompositions remain valid but overly conservative. By jointly calibrating over the full $(τ,γ)$ threshold space, CARE preserves formal guarantees while surfacing up to 5$\times$ fewer sentences than alternative calibrated baselines. Across five medical summarization tasks, CARE satisfies the target risk bound at $α= 0.15$ with 95% confidence across 100 calibration/test resplits, using only ~100 labeled documents per domain. In a preliminary clinician study (75 document reviews), calibrated flags improved omission detection by 28.6 percentage points on average. These results show that sentence-level safety guarantees are feasible for LLM-assisted medical summarization and offer a tunable mechanism for balancing residual risk and review effort.

2606.08962 2026-06-09 cs.LG cs.CV cs.RO 新提交

C$^3$ache: Accelerating World Action Models with Cross Inference Chunk Cache

C$^3$ache: 利用跨推理块缓存加速世界动作模型

Weisen Zhao, Lam Nguyen, Zhicong Lu, Yuzhang Shang

发表机构 * George Mason University(乔治梅森大学) University of Central Florida(中佛罗里达大学)

AI总结 提出C$^3$ache方法,通过跨推理块缓存和重用去噪残差,加速世界动作模型推理,实现高达2.5倍加速且任务成功率几乎无损。

详情
AI中文摘要

世界动作模型(WAM)比标准的视觉-语言-动作(VLA)策略在新型运动和环境中具有更好的泛化能力,因为视频建模目标使其能够从大量未标记视频中学习,而不是依赖稀缺的标记机器人演示。这种泛化能力计算成本高昂。为了完成一个任务,WAM需要运行多个推理块,每个块都需要一个昂贵的去噪过程。现有的加速方法通过在一个块的去噪轨迹内缓存和重用计算来降低这一成本。我们的实证分析揭示了它们忽略的一个重要的冗余来源:块间的冗余。当机器人执行平滑行为时,在给定去噪步骤计算的残差从一个块到下一个块高度相关。我们引入了C$^3$ache,一种无需训练的方法,它在相同去噪步骤的推理块之间缓存和重用这些残差。在基于Fast-WAM骨干的基准测试上的实验表明,C$^3$ache在总墙钟推理时间上实现了高达2.5倍的加速,而任务成功率几乎没有下降。

英文摘要

World Action Models (WAMs) generalize better than standard Vision-Language-Action (VLA) policies to novel motions and environments, because a video-modeling objective lets them learn from abundant unlabeled video rather than scarce labeled robot demonstrations. This generalization is computationally expensive. To complete a task, a WAM runs over multiple inference chunks, and each chunk requires a costly denoising process. Existing acceleration methods reduce this cost by caching and reusing computation within a single chunk's denoising trajectory. Our empirical analysis reveals a substantial source of redundancy they overlook: redundancy across chunks. When a robot executes a smooth behavior, the residuals computed at a given denoising step are strongly correlated from one chunk to the next. We introduce C$^3$ache, a training-free method that caches and reuses these residuals across inference chunks at the same denoising step. Experiments on benchmarks with a Fast-WAM backbone show that C$^3$ache achieves up to a $2.5\times$ speedup in total wall-clock inference time, with negligible degradation in task success rate.

2606.08959 2026-06-09 cs.CV cs.CL 新提交

ChinaHeritaQA: A Culturally-Grounded Visual Question Answering Dataset for World Heritage Sites in China

ChinaHeritaQA:面向中国世界遗产地的文化基础视觉问答数据集

Yi Zhang, Bolei Ma, Yong Cao, Chengyan Wu, Daniel Hershcovich, Anna-Carolina Haensch

发表机构 * LMU Munich(慕尼黑大学) FAU Erlangen-Nuremberg(埃尔朗根-纽伦堡大学) Munich Center for Machine Learning(慕尼黑机器学习中心) University of Tübingen & Tübingen AI Center(图宾根大学与图宾根人工智能中心) Sun Yat-sen University(中山大学) University of Copenhagen(哥本哈根大学) University of Maryland, College Park(马里兰大学帕克分校)

AI总结 提出ChinaHeritaQA多模态基准数据集,包含2279张图像和14133个双语多项选择题,覆盖七个认知维度,评估视觉语言模型在中国世界遗产上的文化推理能力。

详情
AI中文摘要

我们介绍了ChinaHeritaQA,这是一个多模态基准数据集,用于评估视觉语言模型(VLM)在中国联合国教科文组织世界遗产地上的文化推理能力。该数据集包含2279张野外图像,配以14133个双语(中文/英文)多项选择题对,涵盖七个认知维度,从基本身份识别到历史分期和建筑分析。在联合国教科文组织对齐的本体论指导下,并通过严格的人工注释验证,该数据集确保了语言质量和事实一致性。对最先进VLM的评估显示,虽然顶级模型在平均表现上超过人类,但出现了显著的任务级差异:模型在视觉识别方面表现出色,但在文化基础推理上存在困难。性能也因朝代和地区而异。ChinaHeritaQA揭示了强大的视觉检索能力并不能延伸到文化和历史理解。我们发布该数据集以支持未来关于文化感知多模态学习的研究。

英文摘要

We introduce ChinaHeritaQA, a multimodal benchmark dataset for evaluating the cultural reasoning abilities of vision-language models (VLMs) on UNESCO World Heritage sites in China. The dataset comprises 2,279 in-the-wild images paired with 14,133 bilingual (Chinese/English) multiple-choice QA pairs spanning seven cognitive dimensions, from basic identity recognition to historical periodization and architectural analysis. Guided by a UNESCO-aligned heritage ontology and verified through rigorous human annotation, the dataset ensures linguistic quality and factual consistency. Evaluations of state-of-the-art VLMs reveal that while top models exceed human performance on average, substantial task-level variation emerges: models excel at visual recognition but struggle with culturally grounded reasoning. Performance also varies by dynasty and region. ChinaHeritaQA reveals that strong visual retrieval does not extend to cultural and historical understanding. We release the dataset to support future research on culturally aware multimodal learning.

2606.08957 2026-06-09 cs.CV 新提交

Rethinking 3D Shape Generation: Diffusion over Superquadrics

重新思考3D形状生成:超二次曲面上的扩散

Zhiyang Liu, Wanze Li, Yuwei Wu, Chengran Yuan, Jiawei Sun, Rui Zheng, Marcelo H Ang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出将扩散模型从高基数几何表示转移到紧凑的超二次曲面参数上,以降低计算和内存成本,并支持无分辨率点云解码、部件级编辑和约束设计,实现高效生成。

Comments Accepted to ICML2026

详情
AI中文摘要

扩散模型推动了3D形状生成的发展,但大多数方法仍然在高基数空间(如体素/SDF网格、网格或点云)中进行去噪,这计算和内存密集,难以在更高分辨率和更强可控性方面扩展。我们重新思考扩散表示,提出将扩散从密集几何转移到紧凑的几何基元,将每个形状表示为少量超二次曲面。我们不操作成千上万的几何表示值,而是利用7KB的超二次曲面参数(姿态、大小和形状),大幅降低扩散状态维度和每步计算/内存。我们的超二次曲面扩散通过支持更广泛的能力(如无分辨率点云解码、部件级编辑和基于约束的设计)提高了可扩展性,并在点云解码后在标准基准上实现了具有竞争力的表面保真度和分布性能,同时在大多数条件下每个形状的生成时间仅为0.6秒。

英文摘要

Diffusion models have advanced 3D shape generation, yet most methods still denoise in high-cardinality spaces (e.g., voxel/SDF grids, meshes, or point clouds), which is computationally and memory intensive and makes it difficult to scale in terms of both higher resolution and stronger controllability. We rethink the diffusion representation and propose to move diffusion from dense geometry to compact geometric primitives, representing each shape as a small set of superquadrics. Instead of operating on thousands to millions of geometric representation values, we leverage 7KB superquadric parameters (pose, size, and shape), drastically reducing diffusion-state dimensionality and per-step compute/memory. Our diffusion-over-superquadrics improves scalability by supporting broader capabilities (e.g., resolution-free point-cloud decoding, part-level editing, and constraint-based design) and achieving competitive surface-fidelity and distributional performance on standard benchmarks after point-cloud decoding, while enabling efficient generation within 0.6s per shape for most conditions.

2606.08953 2026-06-09 cs.LG math.FA 新提交

Self-Consistent Generative Paths via Admissible Random Variational Transport

通过可容许随机变分输运的自洽生成路径

Lei Luo, Yingzhen Zhang, Jian Yang

发表机构 * PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, School of Computer Science and Engineering, Nanjing University of Science and Technology(南京理工大学计算机科学与工程学院高维信息智能感知与系统教育部重点实验室PCA实验室)

AI总结 提出自洽生成路径作为可容许局部变分输运校正的随机不动点,并引入随机不动点路径残差(R-FPR)来度量生成路径与校正之间的差距,为扩散、流、一步生成、VAE、GAN等模型提供残差控制原理。

Comments 17 pages, 4 figures, including Appendix

详情
AI中文摘要

现代生成模型通常定义从简单先验到数据分布的完整概率路径,而不仅仅是端点映射。扩散模型遵循随机去噪路径,流匹配学习输运场,一致性和蒸馏方法将路径压缩为一步或几步,对抗模型匹配终端分布,VAE通过潜在核生成。现有的统一观点主要描述这些路径是如何构建的。我们研究一个互补的问题:生成的概率路径何时是自洽的?我们将自洽生成路径定义为可容许局部变分输运校正的随机不动点。在该框架中,局部校正由结合散度或几何项、能量项和结构约束的随机变分输运算子指定。该框架包含随机正则化最优输运近端步骤作为结构化实例,同时允许非OT散度、潜在核、对抗约束、因果离散核和终端一步映射。该理论产生随机不动点路径残差(R-FPR),它衡量实际生成路径与可容许局部校正之间的差距。我们证明了适定性、随机不动点的存在性和吸引性、非收缩存在性、残差到生成误差界、经验残差集中性、代理扰动界、连续时间极限以及算子级泛化与模型特定推论。由此产生的理论将端点匹配转化为路径自洽性测试,并为诊断失败、正则化训练和指导跨扩散、流、一步、VAE、GAN/WGAN和自回归生成器的自适应采样提供了残差控制原理。

英文摘要

Modern generative models often define an entire probability path from a simple prior to the data law, rather than only an endpoint map. Diffusion models follow stochastic denoising paths, flow matching learns transport fields, consistency and distillation methods compress paths into one or a few steps, adversarial models match terminal distributions, and VAEs generate through latent kernels. Existing unifying views mainly describe how such paths are constructed. We study a complementary question: when is a generated probability path self-consistent? We define a self-consistent generative path as a random fixed point of admissible local variational transport corrections. In this framework, a local correction is specified by a random variational transport operator combining a divergence or geometry term, an energy term, and a structural constraint. The framework contains random regularized optimal-transport proximal steps as a structured instance, while also allowing non-OT divergences, latent kernels, adversarial constraints, causal discrete kernels, and terminal one-step maps. The theory yields a random fixed-point path residual (R-FPR), which measures the gap between the actual generated path and an admissible local correction. We prove well-posedness, random fixed-point existence and attraction, non-contractive existence, residual-to-generation error bounds, empirical residual concentration, proxy perturbation bounds, continuous-time limits, and operator-level generalization with model-specific corollaries. The resulting theory turns endpoint matching into path self-consistency testing and provides a residual-control principle for diagnosing failures, regularizing training, and guiding adaptive sampling across diffusion, flow, one-step, VAE, GAN/WGAN, and autoregressive generators.

2606.08952 2026-06-09 cs.AI 新提交

AlloSpatial: Agentic Harness Framework for Spatial Reasoning in Foundation Models

AlloSpatial:基础模型中空间推理的智能体框架

Shouwei Ruan, Bin Wang, Zhenyu Wu, Qihui Zhu, Yuxiang Zhang, Jingzhi Li, Yubin Wang, Xingxing Wei

发表机构 * Institute of Artificial Intelligence, Beihang University(北京航空航天大学人工智能研究院) Huawei Noah’s Ark Lab(华为诺亚方舟实验室) University of Science and Technology Beijing(北京科技大学)

AI总结 提出AlloSpatial框架,通过World2Mind认知映射沙箱将自我中心观察转化为异中心空间先验,并利用空间推理工具实现几何语义仲裁,在VSI-Bench和MindCube上提升模型5%-18%的空间推理性能。

详情
AI中文摘要

多模态基础模型(MFMs)取得了显著进展,但在物理世界的空间推理中仍然脆弱。一个关键瓶颈在于它们无法将局部的自我中心观察转化为全局的异中心空间表示。为了解决这个问题,我们提出了AlloSpatial,一个用于基础模型中异中心空间认知的智能体框架。AlloSpatial引入了World2Mind,一个即插即用的认知映射沙箱,将自我中心观察转化为结构化的异中心先验,包括异中心空间树和路线图,支持查询对象拓扑、几何关系、可通过性和轨迹。为了在噪声重建和模糊视觉证据下可靠地利用这些先验,AlloSpatial引入了空间推理工具,用于工具使用判断、模态解耦线索收集和几何语义仲裁。我们进一步通过冷启动强化学习,使用工具门控轨迹级奖励,在Qwen3-VL中内化这一过程。在VSI-Bench和MindCube上的实验表明,AlloSpatial在无训练设置下将专有模型提升了5%-18%,而仅ASTs就在移除视觉输入时支持强大的空间推理。训练后的AlloSpatial智能体进一步超越了更大的通用模型和竞争性的空间基线,表明结构化的异中心表示、主动工具使用和可验证推理为具有空间能力的基础模型提供了一条有前景的路径。

英文摘要

Multimodal Foundation Models (MFMs) have made substantial progress, yet remain fragile in spatial reasoning over the physical world. A key bottleneck lies in their inability to transform local egocentric observations into a global allocentric spatial representation. To address this, we propose AlloSpatial, an agentic framework for allocentric spatial cognition in foundation models. AlloSpatial introduces World2Mind, a plug-and-play cognitive mapping sandbox that converts egocentric observations into structured allocentric priors, including Allocentric-Spatial Trees and route maps that support querying object topology, geometric relations, passability, and trajectories. To utilize these priors reliably under noisy reconstruction and ambiguous visual evidence, AlloSpatial introduces a Spatial Reasoning Harness for tool-use judgment, modality-decoupled cue collection, and geometry-semantic arbitration. We further internalize this process in Qwen3-VL through cold-start reinforcement learning with a harness-gated trajectory-level reward. Experiments on VSI-Bench and MindCube show that AlloSpatial improves proprietary models by 5%-18% in a training-free setting, while ASTs alone support strong spatial reasoning even when visual inputs are removed. The trained AlloSpatial agents further outperform larger general-purpose models and competitive spatial baselines, suggesting that structured allocentric representations, active tool use, and verifiable reasoning offer a promising route toward spatially capable foundation models.

2606.08948 2026-06-09 cs.CV cs.AI 新提交

NutriMLLM: Multimodal Large Language Models for Dietary Micronutrient Analysis

NutriMLLM:用于膳食微量营养素分析的多模态大语言模型

Runze Yan, Minxiao Wang, Jiaying Lu, Darren Liu, Xiao Hu, Hanqi Luo

发表机构 * Emory University(埃默里大学)

AI总结 针对现有MLLM在膳食微量营养素估计中不可靠的问题,利用十年人口规模膳食回顾生成约110万图像-营养素三元组,微调Qwen3-VL和GLM-4.6V-Flash得到NutriMLLM,在真实图像上实现65种营养素全覆盖,准确率匹配或超越专有模型。

Comments 35 pages, 10 figures, 1 table

详情
AI中文摘要

从食物图像中全面估计膳食微量营养素可以改善临床营养护理,但训练此类模型需要将多样化食物与完整营养素谱相关联的大规模多模态数据集。我们首先证明,现有的多模态大语言模型(MLLMs),包括领先的专有模型,在此任务上不可靠。在五个模型家族和四个独立评估基准(ASA24、SNAPMe、FNDDS和NutriBench)上,模型经常弃权或返回统计上不合理的值。为了在没有昂贵专家标注的情况下解决这一差距,我们将十年人口规模的24小时膳食回顾重新用作文本到图像生成的结构化提示。该流程生成了约110万图像-描述-营养素三元组的合成语料库,每个三元组将生成的食品图像与完整的65种营养素标签配对。据我们所知,这是计划在发表后公开发布的最大合成食品图像语料库,具有全面的微量营养素标注。在此语料库上微调Qwen3-VL(2B/4B/8B/30B)和GLM-4.6V-Flash,得到了NutriMLLM,这是第一个专门用于全面膳食微量营养素估计的视觉语言模型家族。我们使用一个四组件框架评估这些模型,该框架分别测量弃权、幻觉、整体可用性和每种营养素的数值准确性。在真实食品图像上,每个NutriMLLM变体在所有65种营养素上实现了近乎完全的覆盖,并且最大的变体在大多数营养素上的准确率匹配或超过了专有基线(GPT-5、Gemini 3和Claude Sonnet 4.5)。这些结果表明,回忆驱动的合成监督可以使基于图像的全面微量营养素估计成为一个可处理的工程问题,并支持膳食评估、个性化营养指导和人口规模的微量营养素监测。

英文摘要

Comprehensive estimation of dietary micronutrients from food images could improve clinical nutrition care, but training such models requires large multimodal datasets linking diverse foods to complete nutrient profiles. We first show that existing multimodal large language models (MLLMs), including leading proprietary models, are unreliable for this task. Across five model families and four independent evaluation benchmarks (ASA24, SNAPMe, FNDDS, and NutriBench), models frequently abstained or returned statistically implausible values. To address this gap without costly expert annotation, we repurposed a decade of population-scale 24-hour dietary recalls as structured prompts for text-to-image generation. This pipeline produced a synthetic corpus of about 1.1 million image-description-nutrient triplets, each pairing a generated food image with a complete 65-nutrient label. To our knowledge, this is the largest synthetic food-image corpus with comprehensive micronutrient annotation planned for public release upon publication. Fine-tuning Qwen3-VL (2B/4B/8B/30B) and GLM-4.6V-Flash on this corpus yielded NutriMLLM, the first family of vision-language models specialized for comprehensive dietary micronutrient estimation. We evaluate these models with a four-component framework that separately measures abstention, hallucination, overall usability, and per-nutrient numerical accuracy. On real food images, every NutriMLLM variant achieved near-complete coverage across all 65 nutrients, and the largest variant matched or exceeded proprietary baselines (GPT-5, Gemini 3, and Claude Sonnet 4.5) in accuracy on most nutrients. These results show that recall-driven synthetic supervision can make image-based comprehensive micronutrient estimation a tractable engineering problem and support dietary assessment, personalized nutrition guidance, and population-scale micronutrient surveillance.

2606.08945 2026-06-09 cs.LG 新提交

From Hazard Functions to Language Space: Cox-Supervised Distillation of Survival Risk into a Large Language Model

从风险函数到语言空间:Cox监督的生存风险蒸馏到大语言模型

Nicholas I-Hsien Kuo, Blanca Gallego, Louisa Jorm

发表机构 * Centre for Big Data Research in Health, the University of New South Wales(新南威尔士大学健康大数据研究中心)

AI总结 提出将Cox比例风险模型的时间事件风险信息迁移到大语言模型中的方法,通过文本提示微调Qwen模型,在三个数据集上取得有竞争力的区分度和校准性,并发现隐藏状态呈现连续风险梯度。

详情
AI中文摘要

我们研究了Cox比例风险模型估计的时间事件风险信息是否可以迁移到生成式大语言模型中。我们提出了一种基于文本的生存建模流程,其中结构化的临床协变量被转换为文本提示,并微调基于Qwen的大语言模型,以使用Cox模型预测作为训练目标生成患者特定的生存风险。在GBSG2、ACTG320和WHAS500数据集上,尽管该模型是作为文本生成任务而非使用传统的生存分析损失进行训练,但它取得了有竞争力的留出区分度和校准性。我们进一步分析了模型隐藏状态的几何结构,其中t-SNE可视化揭示了潜在空间中的平滑风险梯度,表明模型将生存风险表示为连续结构而非孤立的风险类别。这些发现共同表明,大语言模型可以内化生存风险结构,同时支持校准预测,为语言模型中的时间事件推理提供了一条途径。

英文摘要

We investigate whether information about time-to-event risk estimated by a Cox proportional hazards model can be transferred into a generative large language model. We propose a text-based survival modelling pipeline in which structured clinical covariates are converted into text prompts and a Qwen-based large language model is fine-tuned to generate patient-specific survival risk using Cox model predictions as a training target. Across GBSG2, ACTG320, and WHAS500, the model achieves competitive held-out discrimination and calibration despite being trained as a text-generation task rather than with a conventional survival-analysis loss. We further analyse the geometry of the model's hidden states, where t-SNE visualisations reveal smooth risk gradients in latent space, suggesting that the model represents survival risk as a continuous structure rather than isolated risk categories. Together, these findings suggest that large language models can internalise survival-risk structure while supporting calibrated prediction, providing a route towards time-to-event reasoning in language models.

2606.08940 2026-06-09 cs.CL 新提交

Multilingual Sentiment Aware Text Summarization A Reinforcement Learning Approach for Consistency Maintenance

多语言情感感知文本摘要:一种用于一致性维护的强化学习方法

Mikhail Krasitskii, Alexander Gelbukh, Olga Kolesnikova, Grigori Sidorov

发表机构 * Instituto Politécnico Nacional (IPN), Centro de Investigación en Computación (CIC)(国立理工学院(IPN),计算研究中心(CIC))

AI总结 研究RLHF摘要中的情感漂移现象,提出基于策略归因框架的情感感知KL正则化方法,在保持摘要质量的同时缓解情感中性化。

详情
AI中文摘要

来自人类反馈的强化学习(RLHF)显著提高了大语言模型在文本摘要中的质量和流畅性。然而,其对情感属性的影响仍未被充分理解。在这项工作中,我们研究了情感漂移,即基于RLHF的摘要输出相对于源文本向中性情感的系统性偏移。我们在多个数据集、模型架构和八种语言上进行了广泛实验,以分析对齐目标如何影响情感保留。我们的结果表明,情感漂移是一种一致现象,随着KL正则化强度的增加而增强,表明对齐稳定性与情感保真度之间存在权衡。为了解释这种行为,我们引入了一个策略归因框架,该框架分解了RLHF目标并量化了其组成部分的贡献。我们的分析表明,KL正则化是所有设置中情感抑制的主要驱动因素。基于这些发现,我们提出了对KL正则化项的情感感知修改,该修改选择性地减少对情感承载标记的约束。实证结果表明,这种方法在保持摘要质量的同时缓解了情感漂移。总体而言,我们的发现揭示了当前对齐方法的一个基本局限性:虽然它们提高了事实一致性和安全性,但可能无意中抑制了情感表达。这促使我们开发明确考虑情感保留的对齐策略。

英文摘要

Reinforcement Learning from Human Feedback (RLHF) has significantly improved the quality and fluency of large language models in text summarization. However, its impact on affective properties remains insufficiently understood. In this work, we study sentiment drift, a systematic shift toward neutral sentiment in RLHF-based summarization outputs compared to source texts. We conduct extensive experiments across multiple datasets, model architectures, and eight languages to analyze how alignment objectives influence sentiment preservation. Our results show that sentiment drift is a consistent phenomenon that becomes stronger with increased KL regularization strength, indicating a trade-off between alignment stability and affective fidelity. To explain this behavior, we introduce a Policy Attribution framework that decomposes the RLHF objective and quantifies the contribution of its components. Our analysis reveals that KL regularization is the primary driver of sentiment suppression across all settings. Based on these findings, we propose a sentiment-aware modification of the KL regularization term, which selectively reduces constraints on sentiment-bearing tokens. Empirical results demonstrate that this approach mitigates sentiment drift while maintaining summarization quality. Overall, our findings highlight a fundamental limitation of current alignment methods: while they improve factual consistency and safety, they may unintentionally suppress emotional expressiveness. This motivates the development of alignment strategies that explicitly account for affective preservation.

2606.08938 2026-06-09 cs.CL cs.AI 新提交

PACT: Learning Diverse Diagnostic Strategies via Privileged Synthesis and Branch Consensus

PACT: 通过特权合成与分支共识学习多样化诊断策略

Gen Li, Yuanze Hu, Zhichao Yang, Qingchen Yu, Jianwei Lv, Yue Guo, Yujing Liu, Faguo Wu, Hongwei Zheng, Xiandong Li, Bo Yuan, Yifan Sun, Zhaoxin Fan

发表机构 * Beihang University(北京航空航天大学) Baidu(百度) ByteDance(字节跳动) Beijing Academy of Blockchain and Edge Computing(北京区块链与边缘计算研究院) Renmin University of China(中国人民大学)

AI总结 提出PACT框架,通过特权合成对话数据和多分支共识训练,使LLM同时学习多种诊断推理范式,在中文医疗诊断基准上取得最优性能。

Comments 16 pages, 5 figures, 5 tables

详情
AI中文摘要

临床诊断需要在信息不完整的情况下灵活运用多种推理范式。现有的基于LLM的医疗智能体表现出强大的医学推理能力,但单一范式或简单混合的对话监督使得这些范式难以无干扰地学习。我们提出\textbf{PACT}(周期性锚点共识训练),一个将监督的多范式对话合成与基于共识的分支训练相结合的框架。在数据层面,\textbf{DPS}(医生-患者-监督者)利用完整的电子病历(EMR)进行质量控制,同时保持医生代理仅能访问患者可见信息。这产生了四种诊断推理范式下的经过验证的对话,而不会泄露隐藏的临床答案。在训练层面,PACT为每个范式训练一个范式特定的LoRA分支,并通过符号共识定期将分支聚合到共享锚点中。我们进一步构建了一个动态的多轮中文医疗诊断基准用于交互式会诊。实验表明,PACT在诊断结果和会诊过程指标上,与专有、医学专用和任务适应的基线相比,达到了最先进的性能。

英文摘要

Clinical diagnosis requires flexible use of multiple reasoning paradigms under incomplete patient information. Existing LLM-based medical agents show strong medical reasoning ability, but single-paradigm or naively mixed dialogue supervision makes these paradigms difficult to learn without interference. We propose \textbf{PACT} (Periodic Anchor Consensus Training), a framework that couples supervised multi-paradigm dialogue synthesis with consensus-based Branch training. At the data level, \textbf{DPS} (Doctor-Patient-Supervisor) uses complete electronic medical records (EMRs) for quality control while keeping the doctor agent restricted to patient-visible information. This produces validated dialogues under four diagnostic reasoning paradigms without leaking hidden clinical answers. At the training level, PACT trains one paradigm-specific LoRA Branch per paradigm and periodically aggregates Branches into a shared Anchor through sign consensus. We further construct a dynamic multi-turn Chinese medical diagnosis benchmark for interactive consultation. Experiments show that PACT achieves state-of-the-art performance among compared proprietary, medical-specialized, and task-adapted baselines on diagnostic outcome and consultation-process metrics.

2606.08935 2026-06-09 cs.LG cs.AI 新提交

PAI: Preserving Amplitude Information in Representation-Based Time-Series Anomaly Detection

PAI:在基于表示的时间序列异常检测中保留振幅信息

Kang Zhang, Wei Jian Lau, Shoushou Ren, Dong Lin, Joon Son Chung, Chuanhao Sun

发表机构 * HUAWEI(华为) KAIST(韩国科学技术院)

AI总结 针对现有基于表示的时间序列异常检测方法忽略振幅信息导致性能下降的问题,提出PAI方案,通过诊断模块和分数增强函数融合振幅相关分数,在TSB-AD-U-Eva和TAB UV数据集上平均VUS-PR提升98.4%和36.8%。

Comments 15 pages

详情
AI中文摘要

基于表示的时间序列异常检测算法在多种异常检测任务上显著优于其他方法。然而,我们在评估中发现它们存在一个主要限制——学习到的嵌入通常是振幅无关的。丢失振幅信息会降低与振幅相关异常的性能,并且这种失败普遍存在于所有现有的基于表示的方法中。为了解决上述问题,我们提出了一种新的异常评分方案PAI。PAI由两个互补模块组成:诊断模块和最终分数增强函数。诊断模块比较同一表示库上的余弦评分和欧几里得评分,以测试振幅信息是否已被捕获到学习到的表示中。然后在最终分数增强函数中,PAI计算逐点中位数和MAD偏差分数以及局部均值偏移分数——这些分数与表示分数融合以产生最终异常分数。在TSB-AD-U-Eva和TAB UV数据集上,PAI在所有报告的指标上改进了所有四种评估的基于表示的方法,平均VUS-PR增益分别为98.4%和36.8%。在所有评估的组合中,PaAno + PAI实现了最佳性能,比最先进的方法高出15%。对bootstrap置信区间、异常类型细分以及TS2Vec输入归一化消融的进一步评估进一步支持了所提出的方案。这些结果表明,显式保留振幅信息对于基于表示的时间序列异常检测非常重要,而这一点在现有的评分方案中未得到充分重视。代码可在https://github.com/pantheon5100/PAI获取。

英文摘要

Representation-based time-series anomaly detection algorithms significantly outperform other methods on diverse anomaly detection tasks. However, we notice that they suffer from a major limitation in our evaluation - their learned embeddings are often amplitude-agnostic. Losing amplitude information can degrade performance on amplitude related anomalies, and this failure is prevalent across all existing representation-based methods. To address aforementioned issues, we propose a new anomaly scoring scheme named PAI. PAI consists of two complementary modules, a diagnostic module and a final score augmentation function. The diagnostic module compares cosine and Euclidean scoring on the same representation bank to test whether amplitude information is already captured in the learned representation. Then in final score augmentation function, PAI computes a point-wise median and MAD deviation score and a local mean-shift score-which are fused with the representation score to produce the final anomaly score. On the TSB-AD-U-Eva and TAB UV datasets, PAI improves all four evaluated representation-based methods across every reported metric, achieving average VUS-PR gains of 98.4% and 36.8%, respectively. Among all evaluated combinations, PaAno + PAI achieves the best performance, outperforming the state-of-the-art method by 15%. Further evaluation on bootstrap confidence intervals, anomaly-type breakdowns, and a TS2Vec input-normalization ablation further support the proposed scheme. These results suggest that explicitly retaining amplitude information is important for representation-based time-series anomaly detection, which has been underemphasized in existing scoring schemes. Code is available at: https://github.com/pantheon5100/PAI

2606.08934 2026-06-09 cs.LG stat.AP stat.CO stat.ME stat.ML 新提交

Backward Coherence and Hidden-State Stability in Recurrent Neural Networks: A Quasi-Reverse-Martingale Theory

递归神经网络中的反向相干性与隐藏状态稳定性:拟逆鞅理论

Yuan-chin Ivan Chang

发表机构 * Institute of Statistical Science, Academia Sinica(中央研究院统计科学研究所)

AI总结 提出反向相干性概念,通过拟逆鞅理论证明隐藏状态序列几乎必然收敛,并设计正则化方法,在多个任务中实现更早稳定和更低误差。

详情
AI中文摘要

递归神经网络维护一个隐藏状态 $h_t$,但其概率意义通常不明确。我们通过\emph{反向相干性}研究隐藏状态稳定性:即通过学习的反向投影器 $g_ϕ$ 从 $h_{t+1}$ 重构 $h_t$ 的程度。在收缩性和可和反向漂移条件下,隐藏状态序列构成拟逆鞅。这导致几乎必然收敛、混合下的速率、可解释的极限表示、有限路径停止时间以及时间一致置信序列的理论框架。模拟支持该理论。反向相干性正则化将经验拟鞅总和 $\hat Q$ 降低 $43$--$58\%$,比未正则化的 RNN 早 $28$--$44\%$ 达到稳定,并提供与几何界一致的跟踪误差恢复。额外测试证实回波状态遗忘率受 $ρ$ 限制,并验证增量总和管 $R_t$ 具有 $100\%$ 同时覆盖率,尽管 $R_t$ 是保守的;实践中,缺陷尾代理 $\hat Q_t$ 是更有用的监控指标。反向相干性损失也等价于在高斯反向模型中最小化 Kullback--Leibler 散度,将该方法与变分推断联系起来。扩展涵盖 $ϕ$-混合输入、变点检测和有限样本集中度。三项真实数据研究进一步验证了该方法。在 PhysioNet 2012 ICU 数据上,逆鞅 RNN (RMRNN) 与 RNN 的死亡率预测 AUC 相当,同时提前 13 小时达到稳定表示。在 FRED-MD 上,它在概念漂移下将一个月前预测误差降低约四倍。在 UCI 人类活动识别上,它保持较低的后转换跟踪误差并具有几何衰减。这些保证在所述假设下成立;不声称普适性。

英文摘要

Recurrent neural networks maintain a hidden state $h_t$, but its probabilistic meaning is often unclear. We study hidden-state stability through \emph{backward coherence}: the extent to which $h_t$ can be reconstructed from $h_{t+1}$ by a learned backward projector $g_ϕ$. Under contraction and summable backward drift, the hidden-state sequence forms a quasi-reverse-martingale. This yields almost-sure convergence, rates under mixing, an interpretable limiting representation, finite pathwise stopping times, and a theoretical framework for time-uniform confidence sequences. Simulations support the theory. Backward-coherence regularisation reduces the empirical quasi-martingale total $\hat Q$ by $43$--$58%$, reaches stability $28$--$44%$ earlier than an unregularised RNN, and gives tracking-error recovery consistent with geometric bounds. Additional tests confirm echo-state forgetting rates bounded by $ρ$ and verify the increment-sum tube $R_t$ with $100%$ simultaneous coverage, although $R_t$ is conservative; in practice, the defect-tail proxy $\hat Q_t$ is the more useful monitor. The backward-coherence loss is also equivalent to minimising a Kullback--Leibler divergence in a Gaussian backward model, linking the method to variational inference. Extensions cover $ϕ$-mixing inputs, change-point tracking, and finite-sample concentration. Three real-data studies further validate the approach. On PhysioNet 2012 ICU data, the Reverse Martingale RNN (RMRNN) matches RNN mortality-prediction AUC while reaching stable representations 13 hours earlier. On FRED-MD, it reduces one-month-ahead forecast error by about fourfold under concept drift. On UCI Human Activity Recognition, it maintains lower post-transition tracking error with geometric decay. The guarantees apply under the stated assumptions; universality is not claimed.