arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1971
热门方向导航
2606.18986 2026-06-18 cs.CL cs.AI 新提交

Beyond Tokenization: Direct Timestep Embedding and Contrastive Alignment for Time-Series Question Answering

超越分词:面向时间序列问答的直接时间步嵌入与对比对齐

Yafeng Wu, Huu Hiep Nguyen, Thin Nguyen, Hung Le

发表机构 * Deakin University(德肯大学)

AI总结 提出CADE框架,通过逐点线性编码器直接嵌入每个时间步,避免分词瓶颈,并利用单向监督对比损失对齐时间序列与文本锚点,在Time-MQA基准上提升六项TSQA任务性能。

详情
AI中文摘要

大型语言模型的最新进展催生了时间序列问答(TSQA),它将时间序列分析表述为自然语言问答。然而,直接将原始数值序列输入LLM会遇到分词瓶颈:字节对编码将连续值分割成不稳定的词元,其嵌入缺乏有意义的度量结构,导致幅度、尺度和趋势信息的丢失。先前的方法使用基于分块的编码器将序列分割成固定窗口,锁定单一粒度,这会破坏模式并隐藏确切的时间步,且通过一个在不同长度或采样率的数据集上很少迁移的独立模块实现。为了解决这一挑战,我们提出了CADE(对比对齐与直接嵌入),一个基于两个关键组件构建的TSQA新框架:直接时间步嵌入和语义对齐。该框架通过逐点线性编码器和MLP投影器将每个时间步直接映射到LLM嵌入空间,保留了精确的索引级访问,同时消除了分块和填充的需要。为了进一步弥合时间序列与语言表示之间的语义差距,我们引入了一种新颖的单向监督对比损失,将时间序列嵌入与冻结的类名文本锚点对齐。在公开的Time-MQA基准上的实验结果表明,我们的框架在六项TSQA任务上持续提升了性能,优于开源和专有的LLM基线。

英文摘要

Recent advances in large language models (LLMs) have given rise to time-series question answering (TSQA), which formulates time-series analysis as natural-language question answering. However, directly feeding raw numerical series into LLMs suffers from a tokenization bottleneck: Byte Pair Encoding fragments continuous values into unstable tokens whose embeddings lack meaningful metric structure, resulting in the loss of magnitude, scale, and trend information. Prior methods use patch-based encoders that split the series into fixed windows, locking in one granularity that breaks patterns and hides exact timesteps, through a separate module that rarely transfers across datasets with different lengths or sampling rates. To address this challenge, we propose CADE (Contrastive Alignment with Direct Embedding), a novel framework for TSQA built upon two key components: direct timestep embedding and semantic alignment. The proposed framework maps each timestep directly into the LLM embedding space through a point-wise linear encoder and MLP projector, preserving exact index-level access while eliminating the need for patching and padding. To further bridge the semantic gap between time-series and language representations, we introduce a novel one-directional supervised contrastive loss that aligns time-series embeddings with frozen class-name text anchors. Experimental results on the public Time-MQA benchmark demonstrate that our framework consistently improves performance across six TSQA tasks, outperforming both open-source and proprietary LLM baselines.

2606.18974 2026-06-18 cs.CV 新提交

Visual-OPSD: Cross-Modal On-Policy Self-Distillation for Efficient Unified Multimodal Reasoning

Visual-OPSD:用于高效统一多模态推理的跨模态在策略自蒸馏

Pengyu Li, Zhitao Gao, Lingling Zhang, Muye Huang, Yuanming Li, Fangzhi Xu, Jun Liu

发表机构 * Xi’an Jiaotong University(西安交通大学) MOE KLINNS Lab(MOE KLINNS实验室) Shaanxi Province Key Laboratory of Big Data Knowledge Engineering(陕西省大数据知识工程重点实验室) Sun Yat-sen University(中山大学)

AI总结 提出Visual-OPSD方法,通过跨模态在策略自蒸馏,将多步扩散生成的可视化思维推理能力转移到纯文本学生模型,实现14.3倍加速且性能提升3.40个百分点。

详情
AI中文摘要

统一多模态模型(UMMs)将生成的“可视化思维”(VTs)与文本推理交错以改进空间任务。这导致多步扩散带来大约一个数量级的推理成本。我们发现这种成本带来的直接收益有限。在ThinkMorph上,移除或噪声化VTs在九个基准上几乎不改变准确率。一旦渲染,注意力集中在VT上,无论其内容如何。然而,KL诊断表明,以特权VT轨迹为条件会改变模型的完成分布。这表明生成路径编码了超出渲染像素的有用推理。受此差距启发,我们提出了Visual On-Policy Self-Distillation(Visual-OPSD)。教师和学生共享相同权重,但上下文不同:教师看到特权VTs,而学生只看到问题。在策略学生轨迹上的token级JSD蒸馏将教师的推理转移到纯文本学生。在九个基准上,Visual-OPSD相比其生成教师提高了$+3.40$个百分点,加速$14.3\times$(每个样本10.0秒 vs. 142.8秒),并在VSP上比同规模VLM提高了$+63.83$个百分点。高斯噪声控制(真实VT为$+0.40$pp vs. $+10.28$pp)和$58.4\%$的KL差距闭合证实,收益来自生成路径的语义内容。

英文摘要

Unified multimodal models (UMMs) interleave generated ''visual thoughts'' (VTs) with text reasoning to improve spatial tasks. This incurs roughly an order-of-magnitude inference cost from multi-step diffusion. We find this cost yields limited direct benefit. On ThinkMorph, removing or noising VTs barely changes accuracy across nine benchmarks. Once rendered, attention concentrates on the VT regardless of content. Yet a KL diagnostic shows that conditioning on a privileged VT trace shifts the model's completion distribution. This suggests the generation pathway encodes useful reasoning beyond the rendered pixels. Motivated by this gap, we propose Visual On-Policy Self-Distillation(Visual-OPSD). Teacher and student share identical weights but differ in context: the teacher sees privileged VTs while the student sees only the question. Token-level JSD distillation on on-policy student trajectories transfers the teacher's reasoning to a text-only student. Across nine benchmarks, Visual-OPSD improves over its generative teacher by $+3.40$pp with $14.3\times$ speedup (10.0s vs. 142.8s per sample) and outperforms same-scale VLMs by $+63.83$pp on VSP. A Gaussian-noise control ($+0.40$pp vs. $+10.28$pp for real VTs) and $58.4\%$ closure of the KL gap confirm that gains come from the semantic content of the generation pathway.

2606.18967 2026-06-18 cs.LG 新提交

EfficientRollout: System-Aware Self-Speculative Decoding for RL Rollouts

EfficientRollout: 面向强化学习推演的感知系统的自推测解码

Minseo Kim, Minjae Lee, Seunghyuk Oh, Kevin Galim, Donghoon Kim, Coleman Hooper, Harman Singh, Amir Gholami, Hyung Il Koo, Wonjun Kang

发表机构 * FuriosaAI University of California, Berkeley(加州大学伯克利分校)

AI总结 针对强化学习推演中自回归解码延迟瓶颈,提出感知系统的自推测解码框架,通过量化自推测解码器与感知系统的推测开关策略,在保持模型质量前提下降低推演和端到端延迟。

Comments Project Page: https://github.com/furiosa-ai/EfficientRollout

详情
AI中文摘要

强化学习(RL)已成为LLMs代表性后训练范式,赋予其强大的推理和智能体能力。然而,推演生成仍是主要的延迟瓶颈,因为自回归采样顺序解码响应,且少量长尾生成往往决定完成时间。推测解码(SD)为缓解此瓶颈提供了自然途径,它是一种用于服务固定LLMs的成熟技术,通过快速草拟令牌并通过并行验证接受它们来降低延迟,同时保持目标模型分布。但其实际加速效果无法直接迁移到RL推演:(i)不断变化的目标策略使得任何固定草拟者与策略输出分布日益不匹配;(ii)推演解码过程中活跃批次大小缩小,解码从计算受限转向内存受限,此时并行验证可利用未充分利用的计算资源。因此,加速RL推演需要草拟者在长序列、高温生成下对演化策略保持有效,以及感知系统的SD使用以避免计算受限状态。我们提出EfficientRollout,一个感知系统的自推测SD框架,旨在解决RL推演中的这一差距。EfficientRollout从目标模型诱导量化草拟者(即自推测解码),使其与演化策略保持耦合,无需单独草拟者预训练或在线适应。它进一步协调感知系统的SD切换策略与接受感知的草稿长度自适应,仅在有益状态下进行推测,同时使草拟预算与演化草拟者质量匹配。EfficientRollout在加速自回归推演基线上分别将推演和端到端延迟降低高达19.6%和12.7%,同时保持最终模型质量。

英文摘要

Reinforcement learning (RL) has become a representative post-training paradigm for LLMs, enabling strong reasoning and agentic capabilities. However, rollout generation remains a dominant latency bottleneck because autoregressive sampling decodes responses sequentially and a small number of long-tailed generations often determine completion time. Speculative decoding (SD) offers a natural way to address this bottleneck, as it is a well-established technique for serving fixed LLMs that reduces latency by rapidly drafting tokens and accepting them through parallel verification while preserving the target-model distribution. However, its practical speedups do not directly carry over to RL rollouts: (i) the evolving target policy makes any fixed drafter increasingly mismatched with the policy's output distribution; and (ii) active batch sizes shrink throughout rollout decoding, shifting decoding from compute-bound to memory-bound regimes where parallel verification can exploit underutilized compute. Therefore, accelerating RL rollouts requires both a drafter that remains effective under long, high-temperature generations from an evolving policy and system-aware use of SD that avoids compute-bound regimes. We present EfficientRollout, a system-aware self-SD framework designed to address this gap for RL rollouts. EfficientRollout induces a quantized drafter from the target model (i.e. self-speculative decoding), keeping it coupled to the evolving policy without separate drafter pretraining or online adaptation. It further coordinates a system-aware SD toggle policy with acceptance-aware draft-length adaptation, enabling speculation only in beneficial regimes while matching the drafting budget to evolving drafter quality. EfficientRollout reduces rollout and end-to-end latency by up to 19.6% and 12.7%, respectively, over an accelerated AR rollout baseline, while preserving final model quality.

2606.18963 2026-06-18 cs.LG 新提交

Online Reward-Punishment Learning from Fixed-Channel Perceptual Event Streams without Environment Rewards

无环境奖励的固定通道感知事件流在线奖惩学习

Zirong Li

发表机构 * Zirong Li(李 Cirong)

AI总结 提出OHIRL框架,在无标量奖励下通过固定通道感知流进行在线奖惩学习,利用内部轨迹评估器推断感知维度的效价,在XOR任务和CartPole等控制任务中达到高准确率。

Comments 9 pages, 5 figures, 6 tables; 13-page technical supplement

详情
AI中文摘要

我们研究当环境不提供标量奖励或评估标签时的在线奖惩学习。在每一步,智能体仅接收一个固定通道的感知数据包,诸如疼痛、能量、接触、损伤或认知错误等量被视为感知维度,其效价必须从转移后果中推断。OHIRL分离了四个角色:M_psi学习下一数据包预测,D_omega建模残差动力学,C_eta是一个固定的内部转移后轨迹评估器,B_xi学习使用由此产生的价值证据进行后续策略更新和动作评分。C_eta采用恢复正性、持久/增长负性的残差调节取向;系数来源审计显示,等单元、原始等值和随机单调变体保留了超过92%的已发布顶级动作排名,而符号反转保留了0%。无奖励协议暴露观察转移,同时隐藏环境奖励、延迟外部评估器、成功标签和动作好坏标签。条件误差分解将B_xi的证据估计误差与残差策略优化误差分离。在2x2-XOR数据包任务中,药物和辣椒在视觉XOR上下文中获得相反的价值,并且相同的疼痛或辣度增加可能根据后果结构为正或负;B_xi达到0.952的平衡奖励符号准确率。在完整的在线交错审计中,M_psi达到留出R2=0.907,B_xi达到0.940的符号准确率,策略达到0.979的最优动作准确率,而即时数据包分数、预测误差奖励、打乱目标、零奖励和误差减少控制均崩溃。隐藏奖励的CartPole和Taxi控制、公共上下文无泄漏审计以及模块角色消融进一步测试了信息边界和组件必要性。

英文摘要

We study online reward-punishment learning when the environment provides no scalar reward or evaluative label. At each step the agent receives only a fixed-channel perceptual packet, and quantities such as pain, energy, contact, damage, or cognitive error are treated as perceptual dimensions whose valence must be inferred from transition consequences. OHIRL separates four roles: M_psi learns next-packet prediction, D_omega models residual dynamics, C_eta is a fixed internal post-transition trajectory evaluator, and B_xi learns to use the resulting value evidence for later policy updates and action scoring. C_eta uses a recovery-positive and persistence/growth-negative residual-regulation orientation; a coefficient-origin audit shows that equal-unit, raw-equal, and random monotone variants preserve more than 92% of the released top-action rankings, while sign inversion preserves 0%. The reward-free protocol exposes observation transitions while withholding environment rewards, delayed external evaluators, success labels, and action-goodness labels. A conditional error decomposition separates B_xi evidence-estimation error from residual policy-optimization error. In a 2x2-XOR packet task, medicine and chili acquire opposite value under visual XOR contexts, and the same pain or spice increase can be positive or negative depending on consequence structure; B_xi reaches 0.952 balanced reward-sign accuracy. In a full online-interleaved audit, M_psi reaches holdout R2=0.907, B_xi reaches 0.940 sign accuracy, and the policy reaches 0.979 optimal-action accuracy, while immediate packet scores, prediction-error rewards, shuffled targets, zero reward, and error-reduction controls collapse. Hidden-reward CartPole and Taxi controls, public-context no-leakage audits, and module-role ablations further test information boundaries and component necessity.

2606.18961 2026-06-18 cs.LG 新提交

Be Your Own Teacher: Steering Protein Language Models via Unsupervised Reward Optimization

做自己的老师:通过无监督奖励优化引导蛋白质语言模型

Lanqing Li, Shentong Mo, Yang Yu, Pheng-Ann Heng

发表机构 * The Chinese University of Hong Kong(香港中文大学) MBZUAI Hong Kong University of Science and Technology(香港科学理工大学)

AI总结 提出无监督奖励优化框架,结合模型不确定性和语义一致性作为代理奖励,通过SRO和BRO算法优化PLMs,在无标签数据下实现可控蛋白质生成,性能接近有监督方法。

Comments 24 pages, 2 figures, 13 tables

详情
AI中文摘要

蛋白质语言模型(PLMs)已成为可控生物分子设计的有力工具,但其后训练适应通常依赖于昂贵的湿实验验证或精心策划的偏好数据集。为了克服这一监督瓶颈,我们引入了PLMs的无监督奖励优化,这是一个无需真实标签即可实现可引导蛋白质生成的综合框架。我们的关键见解是,任务无关的奖励(将内在模型不确定性与由蛋白质表示模型指导的外在语义一致性相结合)在基础模型和温度设置中与可控性度量表现出强相关性。基于这一发现,我们提出了两种离线算法:软奖励优化(SRO)和二值化奖励优化(BRO),它们有效地最大化由这些代理奖励诱导的经典RLHF目标。在组合性分布外提示上的大量实验表明,两种方法均显著优于竞争基线(DPO、KTO),同时在多个采样温度、模型规模和蛋白质家族中接近理想性能。此外,使用无监督奖励微调的PLMs在pass@k评估中相比其基础模型能够实现持续更高的覆盖率。通过使PLMs能够利用自身生成的体验进行自我改进,我们的框架为在标签偏好或实验反馈稀缺或不可用的环境中实现可控生物分子设计提供了一条可扩展的途径。

英文摘要

Protein language models (PLMs) have emerged as powerful tools for controllable biomolecular design, yet their post-training adaptation typically relies on costly wet-lab validation or curated preference datasets. To overcome this supervision bottleneck, we introduce unsupervised reward optimization of PLMs, a comprehensive framework for steerable protein generation without ground-truth labels. Our key insight is that task-agnostic rewards, which combine intrinsic model uncertainty with extrinsic semantic consistency informed by protein representation models, exhibit strong correlation with controllability measures across base models and temperature regimes. Building upon this discovery, we propose two offline algorithms: Soft Reward Optimization (SRO) and Binarized Reward Optimization (BRO), which effectively maximize the classical RLHF objective induced by these proxy rewards. Extensive experiments on compositional out-of-distribution prompts demonstrate that both methods significantly outperform competitive baselines (DPO, KTO), while approaching oracle performance across multiple sampling temperatures, model scales and protein families. Moreover, PLMs fine-tuned with unsupervised rewards can achieve consistently higher coverage compared to their base model in pass@k evaluations. By enabling self-improvement of PLMs through their own generated experience, our framework provides a scalable pathway toward controllable biomolecular design in settings where labeled preferences or experimental feedback are scarce or unavailable.

2606.18959 2026-06-18 cs.RO 新提交

TactSpace: Learning a Physics-enriched Shared Latent Space for Tactile Sim-to-Real Transfer

TactSpace: 学习富含物理信息的共享潜在空间以实现触觉模拟到现实的迁移

Arunim Joarder, Arjun Bhardwaj, René Zurbrügg, Mayank Mittal, Florin Püntener, Sira Bielefeldt, Cosmin Roman, Vaishakh Patil, Marco Hutter

发表机构 * Robotic Systems Lab, ETH Zürich(瑞士苏黎世联邦理工学院机器人系统实验室) Micro- and Nanosystems Lab, ETH Zürich(瑞士苏黎世联邦理工学院微纳系统实验室) ETH AI Center(苏黎世联邦理工学院人工智能中心) NVIDIA(NVIDIA公司)

AI总结 提出多模态表示学习框架TactSpace,通过共享潜在空间对齐异构触觉模态,实现零样本模拟到现实迁移,在力预测和形状重建任务中分别降低误差16.7%和45.8%。

Comments 9 pages, 6 figures, 4 tables, accepted into IROS 2026

详情
AI中文摘要

触觉传感提供了对机器人操作至关重要的接触相互作用的直接测量。然而,当前的模拟器缺乏足够保真度来忠实模拟触觉传感器的复杂变形和换能机制,严重阻碍了机器人学习流程中的模拟到现实迁移。为了解决这一挑战,我们提出了一种多模态表示学习框架,该框架在共享潜在空间内对齐异构触觉模态,消除了对精确原始信号模拟的需求,同时保留了相关的接触信息。我们的方法采用模态特定编码器将不同的触觉观测(例如模拟穿透深度和真实电容)投影到公共嵌入空间中。该模型使用自重建和交叉重建目标以及对比对齐进行训练,鼓励模态不变且信息丰富的表示。我们在压头形状识别、力预测和几何重建任务上评估学习到的嵌入,仅在模拟中训练并直接在真实传感器测量上测试。我们的结果展示了跨物理不同表示的零样本模拟到现实迁移。此外,结合多物理模拟模态产生了更信息丰富的嵌入,这些嵌入可跨不同下游任务迁移,力预测误差降低16.7%,形状重建误差降低45.8%。最后,我们为Isaac Lab发布了一个基于Warp的高效罚函数触觉模拟模型实现,支持可扩展的触觉数据生成。

英文摘要

Tactile sensing provides direct measurements of contact interactions that are essential for robotic manipulation. However, current simulators lack the fidelity to faithfully model the complex deformation and transduction mechanics of tactile sensors, severely hindering sim-to-real transfer in robot learning pipelines. To address this challenge, we propose a multi-modal representation learning framework that aligns heterogeneous tactile modalities within a shared latent space, eliminating the need for accurate raw-signal simulation while preserving relevant contact information. Our approach employs modality-specific encoders to project diverse tactile observations, such as simulated penetration depth and real-world capacitance, into a common embedding space. The model is trained using self- and cross-reconstruction objectives alongside contrastive alignment, encouraging modality-invariant yet information-rich representations. We evaluate the learned embeddings on indenter shape identification, force prediction, and geometric reconstruction tasks, training exclusively in simulation and testing directly on real sensor measurements. Our results demonstrate zero-shot sim-to-real transfer across physically dissimilar representations. Furthermore, incorporating multi-physics simulation modalities yields more informative embeddings that transfer across diverse downstream tasks, demonstrating a 16.7% reduction in force prediction error and a 45.8% reduction in shape reconstruction error. Finally, we release an efficient Warp-based implementation of a penalty-based tactile simulation model for Isaac Lab, enabling scalable tactile data generation.

2606.18955 2026-06-18 cs.CV cs.RO 新提交

Motion-Focused Latent Action Enables Cross-Embodiment VLA Training from Human EgoVideos

运动聚焦的潜在动作使跨实体VLA训练能从人类自我中心视频中学习

Runze Xu, Yiluo Zhang, Jian Wang, Yu Wang, Jincheng Yu

发表机构 * Department of Electronic Engineering, Tsinghua University(清华大学电子工程系) Tianfu Jiangxi Laboratory(天府江西实验室)

AI总结 提出基于潜在动作的框架,利用混合解耦VQ-VAE从无标签人类视频中提取通用动作先验,通过意图-感知解耦策略减少动作幻觉,仅需50条轨迹即可适配下游任务。

Comments Accepted to IROS 2026

详情
AI中文摘要

训练通用视觉-语言-动作(VLA)模型通常需要大量、多样化的机器人数据集,并带有高保真动作标注。尽管自我中心的人类操作视频丰富且捕捉了显著的环境多样性,但缺乏动作标签使其难以在传统训练范式下使用。为解决这一问题,我们提出了一种基于潜在动作的框架,旨在从无标签人类视频中提取通用动作先验。该架构采用混合解耦VQ-VAE,通过物理掩码将运动动态与环境背景解耦,从而构建跨实体动作码本。通过在人类视频上使用码本进行预训练,VLM骨干网络学习到动作意图的深层表示。为了适应特定实体,我们引入了一种意图-感知解耦策略,其中VLM预测动作意图,而一个独立的冻结视觉编码器为动作专家提供状态特定特征,从而减少动作幻觉。在仿真和真实环境中的结果表明,我们的方法仅在无标签人类视频上预训练,与在大量标注数据集上训练的最先进VLA模型相比具有竞争力,且仅需50条轨迹进行下游适配。

英文摘要

Training generalist Vision-Language-Action(VLA) models typically requires massive, diverse robotic datasets with high-fidelity action annotations. While egocentric human manipulation videos are abundant and capture significant environmental diversity, the absence of action labels makes them difficult to use in conventional training paradigms. To address this, we propose a latent-action-based framework designed to extract general action priors from unlabeled human videos. The architecture features a Hybrid Disentangled VQ-VAE that decouples motion dynamics from environmental backgrounds through physical masks, enabling the construction of a cross-embodiment action codebook. By pre-training on human videos with the codebook, the VLM backbone learns deep representations of action intent. For adaptation to specific embodiments, we introduce an intent-perception decoupling strategy where the VLM predicts the action intent while a separate frozen visual encoder provides state-specific features to the action expert, thereby reducing action hallucinations. Results in simulation and real-world environments show that our method, pre-trained exclusively on unlabeled human videos, performs competitively with state-of-the-art VLA models trained on massive annotated datasets, requiring only 50 trajectories for downstream adaptation.

2606.18954 2026-06-18 cs.CL 新提交

GraphPO: Graph-based Policy Optimization for Reasoning Models

GraphPO:基于图的推理模型策略优化

Yuliang Zhan, Xinyu Tang, Jian Li, Dandan Zheng, Weilong Chai, Jingdong Chen, Jun Zhou, Ge Wu, Wenyue Tang, Hao Sun

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学北京校区人工智能学院) Ant Group(蚂蚁集团)

AI总结 提出GraphPO框架,将推理轨迹建模为有向无环图,通过合并语义等价路径减少冗余探索,并利用边级优势函数提高推理效率,在多个基准上优于链式和树式方法。

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)已成为增强大型推理模型能力的标准范式。RLVR通常独立采样响应并根据最终答案优化策略。该范式有两个局限性:首先,独立响应常包含相似的中间推理步骤,导致冗余探索和计算浪费;其次,稀疏的最终答案奖励难以识别有用步骤。基于树的方法通过共享前缀并比较同一前缀下的分支来提供细粒度信号,部分解决了这一问题。然而,树分支仍然是独立扩展的。当不同分支达到相似的推理状态时,它们无法共享信息并重复类似的探索。此外,基于树的方法忽略了这种分散性,仅在不同分支内进行局部比较,这可能导致优势估计的方差更高。为了解决这一挑战,我们提出了GraphPO(基于图的策略优化),一种新颖的RL框架,将轨迹表示为有向无环图,其中推理步骤作为边,从推理路径中总结的语义状态作为节点。GraphPO将语义等价的推理路径合并为等价类,允许它们共享后缀,并将预算从冗余扩展重新分配到多样化探索。此外,我们为入边分配效率优势,为出边分配正确性优势,从而在从结果中推导过程监督的同时提高推理效率。理论表明,GraphPO降低了优势估计方差并提高了推理效率。在三个LLM上的推理和智能体搜索基准实验表明,在相同的token预算或响应预算下,GraphPO始终优于基于链和基于树的基线方法。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has become a standard paradigm for enhancing the capability of large reasoning models. RLVR typically samples responses independently and optimizes the policy using from final answers. This paradigm has two limitations. First, independently responses often contain similar intermediate reasoning steps, causing redundant exploration and wasted computation. Second, sparse final-answer rewards make it hard to identify useful steps. Tree-based methods partly address this problem by sharing prefixes and comparing branches from the same prefix to provide fine-grained signals. However, tree branches are still expanded independently. When different branches reach similar reasoning states, they cannot share information and repeat similar exploration. Moreover, tree-based methods ignore such dispersion and only perform local comparisons within separate branches, which can lead to higher variance in advantage estimation. To address this challenge, we propose GraphPO (Graph-based Policy Optimization), a novel RL framework that represents rollouts as a directed acyclic graph, with reasoning steps as edges and semantic states summarized from the reasoning paths as nodes. GraphPO merges semantically equivalent reasoning paths into equivalence classes, allowing them to share suffixes and reallocating budget away from redundant expansions to diverse exploration. Furthermore, we assign efficiency advantages to incoming edges and correctness advantages to outgoing edges, thereby improving inference efficiency while deriving process supervision from outcome. Theory shows that GraphPO reduces advantage-estimation variance and enhances reasoning efficiency. Experiments on three LLMs across reasoning and agentic search benchmarks show that GraphPO consistently outperforms chain- and tree-based baselines with the same token budgets or response budgets.

2606.18953 2026-06-18 cs.RO 新提交

Object-Centric Residual RL for Zero-Shot Sim-to-Real VLA Enhancement

面向零样本仿真到现实VLA增强的以对象为中心的残差强化学习

Kinam Kim, Namiko Saito, Heecheol Kim, Katsushi Ikeuchi, Jaegul Choo, Yasuyuki Matsushita

发表机构 * KAIST(韩国科学技术院) Microsoft Research Asia - Tokyo(微软亚洲研究院-东京) The University of Tokyo(东京大学)

AI总结 提出以对象为中心的残差强化学习框架,在仿真中训练策略,零样本迁移到真实机器人,将VLA模型成功率从42%提升至76%。

Comments 8 pages, 7 figures, 2 tables; 8-page appendix

详情
AI中文摘要

视觉-语言-动作(VLA)模型能够泛化到多种操作任务,但其基于模仿学习的策略在精确物理交互中因执行误差累积而脆弱;能否仅在仿真中训练的强化学习策略零样本提升真实世界VLA的鲁棒性?残差强化学习在冻结的VLA之上学习修正策略,提供了一个自然框架,但现有方法面临根本的仿真到现实困境:特权状态方法需要有损蒸馏才能部署;基于图像的方法存在视觉域差距;而真实世界强化学习成本高且不安全。我们提出一种以对象为中心的残差强化学习框架,利用对象姿态优化VLA动作,从而构建一个在仿真和现实之间一致迁移的紧凑观测空间。为对齐两个域,我们额外在仿真中重放相同的遥操作演示,以训练真实世界VLA的仿真对应物。残差强化学习策略仅在仿真中通过姿态噪声注入和丢弃进行训练,并零样本迁移到真实机器人。在真实Franka Research 3(FR3)机器人的五个操作任务上,我们的方法将成功率从42%零样本提升至76%,且改进后的轨迹可进一步用于重新训练基础VLA以实现自我改进,无需额外遥操作。项目页面:此https URL

英文摘要

Vision-Language-Action (VLA) models can generalize across diverse manipulation tasks, but their imitation-learning-based policies remain brittle in precise physical interactions due to compounding execution errors; Can a reinforcement learning policy trained purely in simulation improve the robustness of real-world VLAs zero-shot? Residual RL, which learns a corrective policy on top of a frozen VLA, offers a natural framework, but existing approaches face a fundamental sim-to-real dilemma: privileged-state methods require lossy distillation for deployment; image-based methods suffer from the visual domain gap; and real-world RL is costly and unsafe. We propose an object-centric residual RL framework that refines VLA actions using object poses, enabling a compact observation space that transfers consistently between simulation and reality. To align the two domains, we additionally replay the same teleoperation demonstrations in simulation to train a sim counterpart of the real-world VLA. The residual RL policy is trained only in simulation with pose noise injection and dropout, and transfers zero-shot to the real robot. Across five manipulation tasks on a real Franka Research 3 (FR3) robot, our method improves the success rate from 42% to 76% zero-shot, and the improved rollouts can be further reused to retrain the base VLA for self-improvement without additional teleoperation. Project page: https://www.microsoft.com/en-us/research/articles/object-centric-residual-rl/

2606.18952 2026-06-18 cs.CV 新提交

SP-TransientBench: A Real-Captured Single Photon Perception Benchmark

SP-TransientBench: 一个真实捕获的单光子感知基准

Hongzhou Dong, Zili Zhang, Ziting Wen, Yiheng Qiang, Runrong Deng, Wenle Dong, Ziwen Jiang, Xinyang Li, Rui Lu, Shuoyao Sun, Wenyu Wang, Ziyi Xia, Haitao Zheng, Guodong Shi, Xiaoqiang Ren

发表机构 * Shanghai University(上海大学) Southern University of Science and Technology(南方科技大学) The University of Sydney(悉尼大学)

AI总结 针对单光子LiDAR在真实场景中因噪声和多回波瞬态现象导致的感知挑战,提出包含10个场景、10297个视角的真实捕获多任务基准STB,支持深度估计、多视图重建和3D语义理解评估。

详情
AI中文摘要

基于单光子雪崩二极管(SPAD)传感的单光子LiDAR(SPL)能够以极高灵敏度进行时间分辨光子测量,为光子匮乏环境下的主动3D感知提供了独特潜力。然而,由于独特的测量噪声和复杂的多回波瞬态现象,真实世界的单光子感知仍然面临根本性挑战,这些因素共同使几何重建和语义场景理解变得复杂。尽管对基于SPAD的传感兴趣日益增长,现有研究大多局限于模拟数据或小规模受控捕获。因此,在深度估计、多视图重建和3D语义理解方面,对真实世界单光子感知的系统评估仍未得到充分探索。为弥补这一空白,我们引入了SP-TransientBench(STB),一个真实捕获的多任务单光子感知基准。STB包含10个多样化场景和10297个视图,使用固态单光子LiDAR以256×192分辨率捕获。每个视图提供具有多回波行为的完整飞行时间直方图、标准化元数据和用于多视图评估的校准相机位姿。我们还为选定场景提供了13类3D语义标注。通过为每个任务提供专用数据划分和评估协议,STB能够在多个3D视觉问题上实现真实世界单光子感知的一致且可重复的基准测试。数据集和代码将在接收后发布。

英文摘要

Single-photon LiDAR (SPL) based on single-photon avalanche diode (SPAD) sensing enables time-resolved photon measurements with extreme sensitivity, offering unique potential for active 3D perception in photon-starved scenarios.However, real-world single photon perception remains fundamentally challenging due to unique measurement noise and complex multi-return transient phenomena, which jointly complicate geometric reconstruction and semantic scene understanding. Despite growing interest in SPAD-based sensing, existing studies are largely limited to simulated data or small-scale controlled captures. As a result, systematic evaluation of real-world single photon perception across depth estimation, multi-view reconstruction, and 3D semantic understanding remains underexplored. To bridge this gap, we introduce SP-TransientBench (STB), a real-captured multi-task benchmark for single photon perception. SP-TransientBenc comprises 10 diverse scenes and 10,297 views captured using a solid-state single-photon LiDAR at $256\times192$ resolution. Each view provides full time-of-flight histograms with multi-return behavior,standardized metadata, and calibrated camera poses for multi-view evaluation. We further provide 13-class 3D semantic annotations for selected scenes. By providing dedicated data splits and evaluation protocols for each task, STB enables consistent and reproducible benchmarking of real-world single photon perception across multiple 3D vision problems. The dataset and code will be released upon acceptance.

2606.18948 2026-06-18 cs.RO 新提交

C-ARC: Continuous-Adaptive Range Clustering for Non-Repetitive LiDAR Sensors

C-ARC: 面向非重复式LiDAR传感器的连续自适应范围聚类

Nick B. Schroeder, Jonathan Lichtenfeld, Oskar von Stryk

发表机构 * Technical University of Darmstadt(德累斯顿技术大学) Simulation, Systems Optimization and Robotics Group(仿真、系统优化与机器人组)

AI总结 提出C-ARC框架,通过滑动窗口上的持久双图结构解耦高频点插入与按需聚类检索,并利用指数控制环自适应校准网格分辨率,实现非重复式LiDAR点云的实时聚类。

Comments Submitted to IEEE Robotics and Automation Letters. This work has been submitted to the IEEE for possible publication. 8 pages, 7 figures

详情
AI中文摘要

实时LiDAR聚类识别点云中的结构,是许多移动机器人算法的重要前提。当前方法主要针对重复式机械LiDAR传感器开发。近年来,由于成本和外形尺寸小,非重复式LiDAR传感器的使用显著增加。这类基于Risley棱镜的非重复传感器违反了重复式机械传感器的两个关键假设:结构化的扫描线和明确的帧边界。其Rhodonea曲线轨迹产生非均匀点分布,且缺乏旋转周期使得传统扫描线索引无法适用。为满足这些新需求,我们开发了C-ARC,一个连续自适应范围聚类框架,它在滑动窗口上维护一个持久双图,将高频点插入与按需聚类检索解耦。这对于SLAM或跟踪等关键功能至关重要。自适应范围网格分辨率机制在初始化时使用指数控制环校准网格尺寸,无需预先了解扫描模式即可平衡稀疏-碰撞权衡。作为开源的单线程C++17库实现,C-ARC在商用硬件上对Livox Mid-360以20 Hz产生实时聚类输出。在Livox Avia上的评估表明,对于扫描模式高度集中的传感器,无界单元占用是主要限制。自适应分辨率机制还提高了现有基于网格的方法在非重复数据上的聚类质量。

英文摘要

Real-time LiDAR clustering identifies structures in point clouds, which is an essential prerequisite for many mobile robotics algorithms. Current methods are mostly developed for repetitive mechanical LiDAR sensors. Recently, the use of non-repetitive LiDAR sensors is strongly increasing due to their small cost and form factor. Such non-repetitive Risley prism-based sensors violate two key assumptions of repetitive mechanical sensors: structured scan lines and well-defined frame boundaries. Their Rhodonea-curve trajectories produce non-uniform point distributions, and the absence of a rotation cycle renders conventional scan line indexing inapplicable. To meet such new requirements, we developed C-ARC, a Continuous-Adaptive Range Clustering framework that maintains a persistent dual-graph over a sliding window, decoupling high-frequency point insertion from on-demand cluster retrieval. This is crucial for key functionalities like SLAM or tracking. An adaptive range grid resolution mechanism calibrates grid dimensions at initialization using an exponential control loop, balancing the sparsity-collision trade-off without prior knowledge of the scanning pattern. Implemented as an open-sourced single-threaded C++17 library, C-ARC produces real-time cluster output at 20 Hz on commodity hardware for the Livox Mid-360. Evaluation on the Livox Avia identifies unbounded cell occupancy as the primary limitation for sensors with strongly concentrated scan patterns. The adaptive resolution mechanism additionally improves clustering quality for existing grid-based methods on non-repetitive data.

2606.18947 2026-06-18 cs.AI cs.CL cs.IR cs.MA 新提交

Decoupling Search from Reasoning: A Vendor-Agnostic Grounding Architecture for LLM Agents

将搜索与推理解耦:面向LLM Agent的供应商无关的接地架构

Emmanuel Aboah Boateng, Kyle MacDonald, Amardeep Kumar, Siddharth Kodwani, Sudeep Das

发表机构 * DoorDash, Inc.(DoorDash公司)

AI总结 提出解耦搜索接地(DSG)架构,将搜索接地从推理模型中分离,通过MCP兼容网关实现供应商路由、缓存等控制,在降低成本和延迟的同时保持或提升准确性。

Comments 15 pages, Figure 8

详情
AI中文摘要

生产级LLM Agent越来越依赖实时搜索,但原生搜索接地将检索策略、供应商选择、证据注入、成本、延迟和生成行为捆绑在单一模型-供应商边界内。这种耦合使得接地难以检查、调优、重用或移植,并可能触发搜索诱导的冗长,破坏严格的输出合约。我们提出解耦搜索接地(DSG),一种供应商无关的边界,通过MCP兼容网关将接地移出推理模型,将供应商路由、源感知上下文渲染、配置的回退、检索深度控制以及精确和语义缓存作为一级控制暴露。在SimpleQA、FreshQA和HotpotQA上的五个前沿模型上,原生搜索在时效性敏感的FreshQA上领先,但DSG在控制重要时展现出更强的前沿:在SimpleQA上,它以91%更低的搜索成本接近原生准确率(86.1%对87.7%),保持简洁答案合约,并以68%更低的延迟达到99.4%的热缓存命中率。作为大规模Agent工作负载的共享生产接地层部署,DSG在电商查询理解(QIU)工作负载上匹配或略超原生搜索准确率,同时将搜索成本降低超过98%。实时接地最好被视为可优化的接口边界,而非固定的模型特性。

英文摘要

Production LLM agents increasingly depend on real-time search, yet native search grounding bundles retrieval policy, provider choice, evidence injection, cost, latency, and generation behavior behind a single model-provider boundary. This coupling makes grounding hard to inspect, tune, reuse, or port, and can trigger Search-Induced Verbosity that breaks strict output contracts. We present Decoupled Search Grounding (DSG), a vendor-agnostic boundary that moves grounding outside the reasoning model through an MCP-compatible gateway, exposing provider routing, source-aware context rendering, configured fallback, retrieval-depth control, and exact plus semantic caching as first-class controls. Across five frontier models on SimpleQA, FreshQA, and HotpotQA, native search leads on recency-sensitive FreshQA, but DSG exposes a stronger frontier when control matters: on SimpleQA it nearly matches native accuracy (86.1% vs. 87.7%) at 91% lower search cost, preserves concise answer contracts, and reaches a 99.4% warm-cache hit rate with 68% lower latency. Deployed as a shared production grounding layer for large-scale agentic workloads with interchangeable models, DSG matches or slightly exceeds native-search accuracy on an e-commerce query-understanding (QIU) workload while cutting search cost by over 98%. Real-time grounding is best treated as an optimizable interface boundary, not a fixed model feature.

2606.18946 2026-06-18 cs.CL 新提交

SenFlow: Inter-Sentence Flow Modeling for AI-Generated Text Detection in Hybrid Documents

SenFlow: 面向混合文档中AI生成文本检测的句间流建模

Jingkun Luo, Yifan Sun, Da-Tian Peng, Guanxiong Pei

发表机构 * Northwestern Polytechnical University(西北工业大学) Zhejiang Lab(浙江实验室)

AI总结 针对人机混合文档的句子级AI文本检测,提出SenFlow模型,通过图传播和CRF解码建模句间依赖,在MOSAIC基准上跨域F1提升4.15个百分点。

Comments 16 pages, 4 figures, 9 tables

详情
AI中文摘要

针对混合文档(人类与LLM共同撰写同一文本)的句子级AI生成文本检测(S-AGTD)面临两个空白:现有方法孤立地对每个句子进行分类,忽略了句间依赖;现有基准遗漏了最新一代生成器。我们构建了MOSAIC基准,包含来自PubMed和XSum的16,000个混合文档,由DeepSeek-V3.2和Kimi K2生成,并经过严格质量控制,包括先前基准中缺失的困惑度一致性过滤器。我们将S-AGTD重新定义为文档句子序列上的结构化预测,并实例化为SenFlow,在句子图的单次文档级传递中,将基于图的句间传播与线性链CRF解码相结合。SenFlow在MOSAIC上达到了最先进的性能,在跨域迁移(三种难度递增协议中最难的一种)上平均Macro-F1提高了4.15个百分点。我们进一步发现,即使困惑度过滤器平衡了显式线索,AI插入仍然保留了一个依赖于生成器的句子长度差距,句子级检测器仍可利用这一点。代码和数据:此 https URL

英文摘要

Sentence-level AI-generated text detection (S-AGTD) for hybrid documents, where humans and LLMs co-author one text, faces two gaps: existing methods classify each sentence in isolation, discarding inter-sentence dependencies, and existing benchmarks omit the newest generation of generators. We construct MOSAIC, a benchmark of 16,000 hybrid documents over PubMed and XSum, generated by DeepSeek-V3.2 and Kimi K2 under stringent quality controls including a perplexity-consistency filter absent from prior benchmarks. We recast S-AGTD as structured prediction over the document sentence sequence and instantiate it as SenFlow, integrating graph-based inter-sentence propagation with linear-chain CRF decoding in a single document-level pass over a sentence graph. SenFlow reaches state-of-the-art performance on MOSAIC, with a +4.15 pp average Macro-F1 margin on cross-domain transfer, the hardest of three protocols of increasing difficulty. We further find that even after the perplexity filter equalizes overt cues, AI insertions retain a generator-dependent sentence-length gap that sentence-level detectors still exploit. Code and data: https://github.com/luojingkun22/SenFlow

2606.18943 2026-06-18 cs.CV 新提交

Physics-IQ Verified

物理智力验证

Tim Rädsch, Yuki M Asano, Hilde Kuehne, Stefan Bauer, Priyank Jaini, Robert Geirhos, Carsten T. Lüth

发表机构 * Anates Labs(Anates实验室) Technical University of Munich(慕尼黑技术大学) University of Technology Nuremberg(纽伦堡技术大学) Tuebingen AI Center, University of Tuebingen(图宾根大学人工智能中心) Helmholtz AI, Munich(慕尼黑海德堡人工智能研究所) Google DeepMind research(谷歌DeepMind研究)

AI总结 本文提出Physics-IQ Verified基准,通过改进提示和地面真实质量及引入样本级评分系统,提升视频生成模型对物理现实的理解评估,验证结果表明基准提升了57.6%的样本和34.8%的提示。

详情
AI中文摘要

视频生成模型(VGMs)已成为新的前沿,不仅用于视频生成,还用于多种下游任务,包括世界建模。为推进这些任务,一个良好的视频模型必须理解世界的物理现实。评估这种理解成为新兴领域,催生了Physics-IQ基准,通过将模型生成的视频与真实物理实验视频进行比较来量化。本文系统审计了Physics-IQ基准,揭示不足并提出三种解决方案,改进如何衡量VGMs的物理理解。具体而言,我们提高了提示和地面真实质量以减少混淆因素影响,并进一步引入样本级评分系统,使每个样本和指标权重相等。我们的基准Physics-IQ Verified优化了57.6%的所有样本并改进了超过34.8%的提示。在使用六个图像到视频生成模型的比较研究中,我们观察到中等但有意义的排名变化(Kendall's τ=0.46)。我们希望Physics-IQ Verified通过提供更可靠的信号推动社区发展,向物理准确的VGMs迈进。该基准的代码可通过此https URL访问。

英文摘要

Video generative models ( VGMs) have become a new frontier that can be used not just for video generation but for a multitude of downstream tasks, including world modeling. To advance these tasks, a good video model must understand the physical reality of the world. Evaluating this understanding is an emerging field and has led to the Physics-IQ benchmark, which quantifies this explicitly by comparing model-generated videos to real-world videos of physical experiments. In this work, we present a systematic audit of the Physics-IQ benchmark, expose shortcomings and propose three solutions that sharpen how we can measure physical understanding of VGMs. Specifically, we improve prompt and ground-truth quality to reduce the influence of confounding factors and further introduce a sample-level scoring system that weights each sample and metric equally. Our resulting benchmark, Physics-IQ Verified, refines 57.6\% of all samples and improves over 34.8\% of prompts. In a comparison study using six image-to-video generative models, we observe moderate but meaningful ranking changes (Kendall's $τ= 0.46$). We hope Physics-IQ Verified advances the community by providing a more reliable signal toward physically accurate VGMs. The code for the benchmark can be accessed at https://github.com/google-deepmind/physics-iq-benchmark

2606.18936 2026-06-18 cs.AI cs.CY 新提交

SciRisk-Bench: A Risk-Dimension-Aware Benchmark for AI4Science Safety

SciRisk-Bench:面向AI4Science安全的风险维度感知基准

Linghao Feng, Yinqian Sun, Dongqi Liang, Sicheng Shen, Chenfei Yan, Yuxuan Peng, Yilin Zhao, Haibo Tong, Kai Li, FeiFei Zhao, Yi Zeng

发表机构 * Brain-inspired Cognitive Intelligence Lab, Institute of Automation, Chinese Academy of Sciences, Beijing, China(脑启发认知智能实验室,自动化研究所,中国科学院,北京,中国) School of Future Technology, University of Chinese Academy of Sciences, China(未来技术学院,中国科学院大学,中国) School of Artificial Intelligence, University of Chinese Academy of Sciences, China(人工智能学院,中国科学院大学,中国) Zhongguancun Academy, China(中关村学院,中国) Beijing Key Laboratory of Safe AI and Superalignment(北京安全人工智能与超对齐重点实验室) Gaoling School of AI, Renmin University of China(甘露人工智能学院,中国人民大学) Beijing Institute of AI Safety and Governance (Beijing-AISI)(北京人工智能安全与治理研究院(北京-AISI)) School of Humanities, University of Chinese Academy of Sciences, China(人文学院,中国科学院大学,中国)

AI总结 提出SciRisk-Bench基准,从显式风险维度和科学学科两个角度评估AI4Science安全,覆盖7个学科、31个子学科和10个风险维度,实验揭示主流及科学大模型的安全薄弱环节。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地嵌入到人工智能驱动的科学(AI4Science)工作流程中,从科学问答和文献分析到实验室规划和自主发现。这一进展迫切需要对安全基准进行评估,不仅要评估科学能力,还要评估模型是否能在高风险的科学背景下识别和避免风险。现有的AI4Science安全数据集涵盖多个学科和任务格式,但潜在的风险维度未得到充分说明。我们引入了\textbf{SciRisk-Bench},这是一个旨在从两个互补视角评估AI4Science安全的基准:显式风险维度和科学学科。SciRisk-Bench涵盖7个学科、31个子学科和10个风险维度。在实验部分,我们评估了主流LLMs和面向科学的LLMs在风险维度、学科和子学科上的表现,从而能够细粒度地诊断科学模型在哪些方面仍然不安全。

英文摘要

Large language models (LLMs) are increasingly embedded in AI for Science (AI4Science) workflows, from scientific question answering and literature analysis to laboratory planning and autonomous discovery. This progress creates an urgent need for safety benchmarks that evaluate not only scientific competence, but also whether models recognize and avoid risks in high-stakes scientific contexts. Existing AI4Science safety datasets cover several disciplines and task formats, leaving the underlying risk dimensions underspecified. We introduce \textbf{SciRisk-Bench}, a benchmark designed to evaluate AI4Science safety from two complementary perspectives: explicit risk dimensions and scientific disciplines. SciRisk-Bench covers 7 disciplines, 31 subdisciplines and 10 risk dimensions. In the experimental section, we evaluate both mainstream LLMs and science-oriented LLMs across risk dimensions, disciplines, and sub-disciplines, enabling fine-grained diagnosis of where scientific models remain unsafe.

2606.18924 2026-06-18 cs.SD 新提交

Who Wins the Conflict? Mechanistic Interpretability of Text Bias in Audio LLMs

谁赢得冲突?音频大模型中文本偏差的机制可解释性

Hyebin Cho, Suho Yoo, Jaehyuk Jang, Changick Kim, Joon Son Chung

发表机构 * School of Electrical Engineering, KAIST(韩国科学技术院电子工程学院)

AI总结 本文通过机制分析揭示音频大模型中的文本主导偏差,发现文本路径主动抑制完整音频表征,并提出无训练干预方法back-patching以增强音频表征,缓解文本主导。

Comments Preprint

详情
AI中文摘要

虽然音频大模型在多模态理解方面表现出色,但它们存在文本主导偏差,即模型盲目偏向文本而忽视声学证据,导致幻觉。然而,当音频和文本输入相互矛盾时,这些模型内部行为的底层机制尚未被探索。在这项工作中,我们通过追踪内部表征在层间的传播,首次对这一现象进行了机制分析。我们的研究揭示了三个关键发现:(i)文本主导在模型中系统性地且经验性地存在;(ii)虽然文本和音频依赖功能不同的路径,但它们最终在后期层中汇聚到一个共享语义空间;(iii)文本路径不会擦除音频信息,而是主动抑制完整的音频表征。基于这些见解,我们利用back-patching,一种无训练干预方法,将后期层的音频激活路由回早期层。这放大了音频表征,使其能够克服文本抑制。我们的评估表明,back-patching持续减少文本主导,为冲突下的机制性多模态对齐铺平了道路。

英文摘要

While Audio Large Language Models (Audio LLMs) excel at multimodal understanding, they suffer from text dominance, a bias where models blindly favor text over acoustic evidence, causing hallucinations. However, the internal mechanisms underlying how these models behave when audio and textual inputs contradict each other remain unexplored. In this work, we present the first mechanistic analysis of this phenomenon by tracing the propagation of internal representations across layers. Our investigation reveals three key findings: (i) text dominance is systematically and empirically across models; (ii) while text and audio rely on functionally distinct pathways, they ultimately converge into a shared semantic space in late layers; and (iii) the text pathway does not erase audio information, but rather actively suppresses intact audio representations. Building on these insights, we leverage back-patching, a training-free intervention that routes late-layer audio activations back into earlier layers. This amplifies the audio representations, enabling them to overcome textual suppression. Our evaluation shows that back-patching consistently reduces text dominance, paving the way for mechanistic multimodal alignment under conflict.

2606.18923 2026-06-18 cs.LG 新提交

GrapNet: A Programmable Dynamic-Architecture Neural Graph Substrate

GrapNet: 一种可编程的动态架构神经图基板

Zirong Li

发表机构 * Zirong Li(李子荣)

AI总结 提出GrapNet,一种将图作为可执行架构的神经基板,通过可编程接口支持结构编辑、冻结子图、局部审计等操作,在Split Fashion-MNIST和Split CIFAR-10上分别提升12.08和3.81个百分点的准确率。

Comments 8 pages, 1 figure, preprint

详情
AI中文摘要

可编程性是固定张量神经网络中缺失的一流接口:编辑关系、冻结子图、审计局部函数或更改执行后端应是对神经程序的操作,而非临时参数手术。GrapNet研究这种图即网络的设置。图是架构和可执行程序,而非输入数据图。每个计算节点拥有其下一层子节点引用和与这些引用对齐的可训练分配向量;删除关系会物理移除子节点引用和相应的分配坐标。结构规则和执行策略位于节点核心之外,因此同一子节点拥有的图可以被增长、冻结、结构编辑、分组为可训练族块、通过注意力在活动关系上路由,或在拓扑稳定后降级为密集快照。GrapNet通过向量值父接口与常规模块组合:密集层、CNN编码器、ResNet特征提取器、注意力块和Transformer表示都可以为每个坐标提供一个感知GrapNode。评估组织为可编程性压力测试套件,而非新的重放基准。在匹配的十种子Split Fashion-MNIST研究中,可塑GrapNet+ER头在相同已见类损失和重放记忆下达到63.16%的已见类准确率,而参数更大的密集MLP+ER为51.08%,配对差值为12.08点,p=1.3e-5。在Split CIFAR-10上使用冻结的ImageNet ResNet-18编码器时,相同基板将在线头比MLP-256提高3.81点,p=0.0026。这些结果支持GrapNet作为可编辑的神经图基板,其核心价值在于具有忠实执行视图的结构可编程性。

英文摘要

Programmability is a missing first-class interface in fixed-tensor neural networks: editing a relation, freezing a subgraph, auditing a local function, or changing the execution backend should be an operation on the neural program rather than ad-hoc parameter surgery. GrapNet studies this graph-as-network setting. The graph is the architecture and executable program, not an input data graph. Each compute node owns its next-layer child references and a trainable allocation vector aligned with those references; deleting a relation physically removes both the child reference and the corresponding allocation coordinate. Structural rules and execution policies live outside the node core, so the same child-owned graph can be grown, frozen, structurally edited, grouped into trainable family blocks, routed by attention over active relations, or lowered to dense snapshots after topology stabilizes. GrapNet composes with conventional modules through a vector-valued parent interface: dense layers, CNN encoders, ResNet feature extractors, attention blocks, and transformer representations can all feed one sensory GrapNode per coordinate. The evaluation is organized as a programmability stress suite rather than as a new replay benchmark. In a matched ten-seed Split Fashion-MNIST study, a plastic GrapNet+ER head reaches 63.16 percent seen-class accuracy versus 51.08 percent for a parameter-larger dense MLP+ER under the same seen-class loss and replay memory, with paired delta 12.08 points and p=1.3e-5. On Split CIFAR-10 with a frozen ImageNet ResNet-18 encoder, the same substrate improves the online head over MLP-256 by 3.81 points, with p=0.0026. These results support GrapNet as an editable neural graph substrate whose core value is structural programmability with faithful execution views.

2606.18922 2026-06-18 cs.CL cs.AI 新提交

As Easy as Rocket Science: Assessing the Ability of Large Language Models to Interpret Negation in Figurative Language

像火箭科学一样简单:评估大型语言模型解释比喻语言中否定能力的研究

Jasmine Owers, Edwin Simpson, Martha Lewis

发表机构 * Intelligent Systems Lab University of Bristol(智能系统实验室 英国布里斯托尔大学) ILLC University of Amsterdam(阿姆斯特丹大学语言学研究所)

AI总结 本研究通过开发新的注释数据集,测试多种大型语言模型在比喻语言中理解否定的能力,发现否定与比喻的组合对模型构成挑战,且性能高度依赖提示风格。

Comments 16 pages, 16 figures; for associated code and data see https://github.com/jrdowers/Negation-and-Fig-Lang; To be published in Transactions of the Association for Computational Linguistics

详情
AI中文摘要

比喻语言和否定是当前语言模型面临挑战的两个领域,然而,两者在书面和口语中广泛使用。大型语言模型(LLMs)也广泛应用于日常场景,在这些场景中它们不一定能针对特定数据集进行调整。因此,理解LLMs正确解释包含否定和比喻语言的文本的能力至关重要。为了研究这一点,我们为现有的比喻语言数据集开发了一套新的注释,并在该数据集上测试了一系列语言模型。我们发现,否定和比喻性的结合可能带来特殊挑战,并且整体性能以及不同否定类型上的性能特别依赖于所使用的提示风格。

英文摘要

Figurative language and negation are two areas that challenge current language models, however, both are widely used throughout written and spoken language. Large language models (LLMs) are also widely used in everyday contexts where they cannot necessarily be tuned for a specific dataset. It is therefore essential to understand the ability of LLMs to correctly interpret text that includes both negation and figurative language. To investigate this, we develop a set of new annotations to an existing dataset of figurative language, and test a range of language models on the dataset. We find that the combination of negation and figurativeness can present a particular challenge, and that performance overall and across different negation types is particularly dependent on the prompt style used.

2606.18918 2026-06-18 cs.LG cs.CC 新提交

Some Complexity Results for Robustness Verification for Binarized Neural Networks

二值化神经网络鲁棒性验证的一些复杂性结果

Harshit Goyal, Sudakshina Dutta

发表机构 * Indian Institute of Technology Goa(印度理工学院Goa)

AI总结 本文通过从布尔可满足性问题归约证明二值化神经网络的可满足性是NP完全的,并利用均匀遮挡导致的网络输出分段常数结构,提出多项式时间鲁棒性检查算法。

详情
AI中文摘要

本文研究了二值化神经网络(BNNs)验证问题的计算复杂性,其中激活函数(有时权重)是二值的。我们分析了两个问题:可满足性和均匀图像遮挡下的鲁棒性。我们通过从布尔可满足性问题(SAT)归约证明BNN可满足性是NP完全的,并且均匀遮挡在网络输出中诱导出分段常数结构,从而实现了多项式时间的鲁棒性检查算法。

英文摘要

This paper studies the computational complexity of verification problems for Binarized Neural Networks (BNNs), where activations (and sometimes weights) are binary. We analyze two problems: satisfiability and robustness under uniform image occlusion. We show that BNN satisfiability is NP-complete via a reduction from Boolean satisfiability problem (SAT), and that uniform occlusion induces a piecewise-constant structure in the network output, enabling a polynomial-time robustness-checking algorithm.

2606.18910 2026-06-18 cs.LG cs.CL 新提交

REVES: REvision and VErification--Augmented Training for Test-Time Scaling

REVES:通过修订与验证增强的测试时扩展训练

Yuanxin Liu, Ruida Zhou, Xinyan Zhao, Amr Sharaf, Hongzhou Lin, Arijit Biswas, Mohammad Ghavamzadeh, Zhaoran Wang, Mingyi Hong

发表机构 * Northwestern University(西北大学) Amazon AGI(亚马逊人工智能实验室) Qualcomm AI Research(高通人工智能研究) University of Minnesota(明尼苏达大学)

AI总结 提出REVES框架,通过将中间步骤的“接近正确”答案转化为解耦的修订和验证提示,实现高效的离策略数据生成,提升大语言模型的多步推理能力,在LiveCodeBench上比强化学习基线高6.5分。

详情
AI中文摘要

通过顺序修订进行测试时扩展已成为增强大语言模型(LLM)推理能力的强大范式。然而,标准的后训练方法主要优化单次目标,与多步推理动态存在根本性不匹配。虽然最近的工作将其视为多轮强化学习(RL),但传统方法直接优化多步轨迹,未能进一步利用模型可以从纠正中学习的中间步骤中的高质量错误。我们提出了一个两阶段迭代框架,交替进行在线数据/提示增强和策略优化。通过将成功恢复轨迹中的中间步骤(“接近正确”答案)转化为解耦的修订和验证提示,我们的方法将训练集中在有效的答案转换和错误识别上。与标准的多轮RL相比,这种方法实现了高效的离策略数据生成,并减少了长程采样的计算开销。在LiveCodeBench上,使用公开可用的测试用例作为反馈,我们观察到比RL基线高6.5分,比标准多轮训练高4.0分。除了编码,我们的方法在圆填充问题上达到了先前报告的SOTA结果,同时使用了最小的基础模型(4B)和远少于更大进化搜索系统的采样次数。在真实验证下的数学结果进一步证实了改进的纠正能力。该方法还泛化到分布外的约束满足谜题,如n皇后和迷你数独,其中正确性完全由问题约束定义。代码可在该https URL获取。

英文摘要

Test-time scaling via sequential revision has emerged as a powerful paradigm for enhancing Large Language Model (LLM) reasoning. However, standard post-training methods primarily optimize single-shot objectives, creating a fundamental misalignment with multi-step inference dynamics. While recent work treats this as multi-turn reinforcement learning (RL), conventional approaches optimize over the multi-step trajectories directly, failing to further exploit the high-quality mistakes in intermediate steps that model can learn from correcting them. We propose a two-stage iterative framework that alternates between online data/prompt augmentation and policy optimization. By converting the intermediate steps (``near-miss'' answers) in the successful recovery trajectories into decoupled revision and verification prompts, our approach concentrates training on both effective answer transformation and error identification. This approach enables efficient off-policy data generation and reduces the computational overhead of long-horizon sampling compared to standard multi-turn RL. On LiveCodeBench, using publicly available test cases as feedback, we observe gains of +6.5 points over the RL baseline and +4.0 points over standard multi-turn training. Beyond coding, our approach matches the previously reported SOTA result on circle packing while using the smallest base model (4B) and far fewer rollouts than the much larger evolutionary search systems. Math results under ground-truth verification further confirm improved correction ability. It also generalizes to out-of-distribution constraint-satisfaction puzzles such as n\_queens and mini\_sudoku, where correctness is defined entirely by problem constraints. Code is available at https://github.com/yxliu02/REVES.git.

2606.18906 2026-06-18 cs.CV 新提交

BindEdit: Taming Attention Leakage for Precise Multi-Object Image Editing

BindEdit: 驯服注意力泄漏以实现精确的多目标图像编辑

Chaewon Park, Soyoon Lee, Naeun Lee, Minjung Shin, Seogkyu Jeon, Kibeom Hong

发表机构 * Sookmyung Women’s University(成均女性大学) Yonsei University(延世大学) Samsung Research(三星研究院)

AI总结 针对多目标图像编辑中的语义混合和对象重复问题,提出BindEdit方法,通过联合正则化交叉注意力和自注意力、交叉注意力重平衡机制及区域保真项,在单次扩散轨迹内抑制注意力泄漏,实现精确编辑。

Comments Preprint

详情
AI中文摘要

真实图像编辑能够精确操作视觉内容,但现有方法在复杂的多目标场景中常常失败,导致语义混合、对象重复或编辑不完整。我们将这些失败归因于注意力泄漏,即在去噪过程中,跨空间区域和文本标记的信号变得纠缠。具体来说,我们识别出两种不同形式的泄漏:编辑-标记泄漏,其中模糊的标记-区域对齐导致对象混合;以及源主导泄漏,其中未改变的源对象的标记压倒了目标实体应有的注意力。为了解决这些泄漏,我们提出了\textbf{BindEdit},它在单次扩散轨迹内强制执行注意力级别的约束。为了抑制编辑-标记泄漏,BindEdit联合正则化交叉注意力和自注意力,使得每个目标标记组绑定到其对应的空间区域,同时保持实例级别的分离。为了抑制源主导泄漏,一种交叉注意力重平衡机制放大目标标记的影响,并减弱可编辑区域内残留的源语义。此外,区域保真项确保每个目标概念在整个编辑掩码中连贯表达。另外,我们提出了一个全面的多目标基准,涵盖不同的对象数量和类别。大量实验表明,BindEdit在单次扩散轨迹内始终优于现有方法,在单目标和多目标编辑场景中均保持稳健性能。

英文摘要

Real image editing enables precise manipulation of visual content, yet existing methods often fail in complex multi-object scenarios, causing semantic blending, object duplication, or incomplete edits. We attribute these failures to attention leakage, where signals across spatial regions and text tokens become entangled during the denoising process. Specifically, we identify two distinct forms of leakage: Edit-Token Leakage, where ambiguous token-region alignment leads to object blending, and Source Dominance Leakage, where tokens of unchanged source objects overwhelm the attention intended for target entities. To resolve these leakages, we propose \textbf{BindEdit}, which enforces attention-level constraints within a single diffusion trajectory. To suppress Edit-Token Leakage, BindEdit jointly regularizes cross- and self-attention so that each target token group is bound to its corresponding spatial region while maintaining instance-level separation. To suppress Source Dominance Leakage, a cross-attention re-balancing mechanism amplifies target token influence and attenuates residual source semantics within editable regions. Moreover, a region fidelity term ensures that each target concept is expressed coherently across the entire editing mask. Additionally, we propose a comprehensive multi-object benchmark encompassing diverse object counts and categories. Extensive experiments demonstrate that BindEdit consistently outperforms existing methods within a single diffusion trajectory, maintaining robust performance across both single- and multi-object editing scenarios.

2606.18902 2026-06-18 cs.CL 新提交

SAGE: Stochastic Prompt Optimization via Agent-Guided Exploration

SAGE: 基于智能体引导探索的随机提示优化

Ziyi Zhu, Luka Smyth, Saki Shinoda, Jinghong Chen

发表机构 * Slingshot AI Department of Engineering, University of Cambridge(剑桥大学工程系)

AI总结 提出随机提示优化框架SPO,其中SAGE方法通过多智能体诊断代码执行实现黑盒搜索,在多个基准测试中表现依赖于错误类型,并在心理健康聊天机器人中通过连续优化显著提升次日留存率。

详情
AI中文摘要

上下文工程已成为无需参数更新即可改进AI系统的主要手段。最近研究表明文本梯度并非真实梯度,这促使我们将自动提示优化(APO)视为黑盒搜索。我们引入了SPO(随机提示优化),一个在提示空间上进行随机搜索的框架,并比较了三种复杂度递增的策略:基于错误信息的随机搜索、带有进化算子的遗传算法以及SAGE(基于智能体引导探索的SPO),后者是一个具有诊断代码执行的多智能体流水线。在三个基准测试中,没有单一策略占主导地位;有效性取决于景观结构与错误类型的相互作用。我们进一步在连续优化范式下将SAGE部署到一个心理健康聊天机器人上,它将八个个体噪声A/B测试周期累积为次日留存率的统计显著提升。我们认为,将定性诊断与定量验证相结合是使智能体优化对开放式任务导向对话有效的关键。

英文摘要

Context engineering has emerged as a primary lever for improving AI systems without parameter updates. Recent work showing that textual gradients do not function as real gradients motivates treating automatic prompt optimization (APO) as black-box search. We introduce SPO (Stochastic Prompt Optimization), a framework for stochastic search over prompt space, and compare three strategies of increasing sophistication: error-informed random search, a genetic algorithm with evolutionary operators, and SAGE (SPO via Agent-Guided Exploration), a multi-agent pipeline with diagnostic code execution. Across three benchmarks, no single strategy dominates; effectiveness depends on the interaction of landscape structure with error type. We further deploy SAGE on a mental-health chatbot under a continuous optimization paradigm, where it compounds eight cycles of individually-noisy A/B tests into a statistically robust gain in next-day retention. We argue that coupling qualitative diagnosis with quantitative validation is what makes agentic optimization effective for open-ended task-oriented dialogue.

2606.18898 2026-06-18 cs.LG 新提交

Anomaly Detection for Sparse and Irregular Multivariate Time Series with Latent SDEs

基于潜在随机微分方程的稀疏不规则多元时间序列异常检测

Martin Uray, Dominik Geng, Florian Graf, Stefan Huber, Roland Kwitt

发表机构 * Josef Ressel Centre for Intelligent and Secure Industrial Automation, University of Applied Sciences, Salzburg, Austria(约瑟夫·雷斯尔智能与安全工业自动化中心,应用科学大学,萨尔茨堡,奥地利) University of Salzburg, Austria(萨尔茨堡大学,奥地利)

AI总结 针对现实世界中稀疏、不规则采样的多元时间序列,提出基于潜在随机微分方程的生成方法,将观测投影到连续时间随机动力系统,处理缺失和不规则采样,并捕获循环行为,在六个基准数据集上取得最优结果。

Comments Preprint

详情
AI中文摘要

多元时间序列异常检测(MTSAD)在工业监控、网络安全或医疗保健等广泛应用领域至关重要。现实世界的数据通常是稀疏的、不规则采样的或部分观测的,但现有方法假设时间序列均匀采样。我们提出了一种基于潜在随机微分方程的生成方法,将观测到的时间序列投影到一个连续时间随机动力系统上,能够直接处理缺失观测和不规则采样,同时自然捕获许多现实世界用例固有的可能循环行为。在六个异常基准数据集上的实验表明,我们提出的方法在现有最先进基线中排名第一。我们进一步证明,在严重数据稀疏性下,我们的方法保持鲁棒性,而测试的基线方法性能显著下降。这些结果突显了潜在随机微分方程作为多元时间序列异常检测的自然归纳偏置,尤其是在存在现实世界不规则性的情况下。

英文摘要

Multivariate time series anomaly detection (MTSAD) is critical for a wide range of application areas, such as industrial monitoring, cybersecurity, or healthcare. Real-world data is often sparse, irregularly sampled or partially observed, yet existing methods assume uniformly sampled time series. We propose a generative approach based on Latent SDEs that projects the observed time series on a continuous-time stochastic dynamical system, directly being able to handle missing observations and irregular sampling, while also naturally capturing possible cyclic behavior that many real-world use cases inherently possess. Experiments on six anomaly benchmark datasets show that our proposed method ranks first among state-of-the-art baselines. We further demonstrate that our method remains robust under severe data sparsity, while performance significantly degrades for the tested baseline methods. These results highlight latent SDEs as a natural inductive bias for anomaly detection in multivariate time series, especially in presence of real-world irregularities.

2606.18894 2026-06-18 cs.CV 新提交

Automatic ply-specific analyses of CFRP micrographs using shortest-path-based ply distinction

基于最短路径的碳纤维增强聚合物显微图像自动铺层分析

Jonas Naumann, Jonas P. Appels, Julius Biermann, Christopher Gorsky, Timo de Wolff, Christoph Brauer

发表机构 * German Aerospace Center (DLR)(德国航空航天中心(DLR)) Institute of Lightweight Systems(轻质系统研究所) Composite Process Technologies(复合材料加工技术) Institute of Analysis and Algebra(分析与代数研究所)

AI总结 提出一种自动方法,通过将语义分割掩码视为图并应用最短路径算法区分铺层实例,实现高分辨率CFRP显微图像的铺层分割与定量分析。

详情
AI中文摘要

我们提出了一种自动方法,用于在高分辨率碳纤维增强聚合物显微图像的语义分割掩码中区分铺层实例。将分割掩码解释为以像素为顶点的图,使我们能够使用最短路径算法生成铺层分隔路径。从而,我们利用全局信息弥合了语义分割和铺层实例分割之间的差距。我们成功地将该方法应用于具有广泛特征的高分辨率显微图像,例如单层或多层中人为添加的间隙、不同的堆叠顺序以及贯穿铺层的裂纹。基于计算出的路径将每个纤维像素分配给一个铺层,可以对其微观结构特性(如局部纤维体积分数以及局部分辨的铺层和中间层厚度)进行全面的定量铺层分析。这些见解有助于揭示制造引起的不均匀性,得出关于制造参数的结论,并将力学性能与潜在的微观结构缺陷联系起来。

英文摘要

We present an automated approach to distinguish between ply instances in semantic segmentation masks of high-resolution carbon-fiber reinforced polymer micrographs. Interpreting the segmentation mask as a graph with pixels as vertices, enables us to use a shortest-path algorithm yielding the ply-separating paths. Thereby, we bridge the gap between semantic segmentation and ply instance segmentation using global information. We successfully apply our approach on high-resolution micrographs featuring a broad range of characteristics like artificially added gaps in single or multiple plies, different stacking sequences and ply traversing cracks. Assigning each fiber pixel to a ply based on the calculated paths, allows for a comprehensive, quantitative ply analysis with respect to its microstructural properties like the local fiber volume fraction as well as locally resolved ply and interleaf layer thickness. These insights help to reveal manufacturing-induced inhomogeneities, draw conclusions on manufacturing parameters and link mechanical properties to underlying microstructural imperfections.

2606.18893 2026-06-18 cs.CL 新提交

Learning Robust Pair Confidence for Multimodal Emotion-Cause Pair Extraction

学习鲁棒的成对置信度用于多模态情感-原因对提取

Zhuangzhuang Pan, Ning Dong, Yingna Su, Yan Xia

发表机构 * Institute for Advanced Studies(先进研究院) Universiti Malaya(马来大学) School of Information Engineering(信息工程学院) Suqian University(宿州学院) Digitization Department(数字化部门)

AI总结 提出RPCL框架,通过置信度差异边界约束和对抗性扰动,增强多模态情感-原因对提取中成对置信度的判别性和稳定性,在三个数据集上提升Pair F1约2.6-2.8个百分点。

Comments 11 pages, 3 figures, 5 tables

详情
AI中文摘要

多模态情感-原因对提取(MECPE)需要候选对上的可靠成对置信度。现有的成对评分器通常对有效候选使用成对级别的交叉熵,这大多独立地处理链接。这使得竞争原因之间的相对置信度几何结构约束不足,允许黄金对接近硬负例或依赖偶然的非黄金上下文。我们将这种脆弱性研究为成对置信度脆弱性,并提出RPCL(鲁棒成对置信度学习),一种仅用于训练的成对置信度学习框架。RPCL鼓励成对置信度既具有判别性又具有稳定性:通过置信度差异边界约束将黄金对与行方向硬负例分离,并将干净成对预测与来自损坏视图的预测对齐,其中非黄金上下文话语表示被部分损坏。在推理时,原始的干净成对评分器和解码流水线保持不变。在ECF、MECAD和MEC4上,RPCL在全文本-音频-视频设置下将三种子平均Pair F1相对于匹配基线模型提高了2.58到2.83个百分点,并在所有三个数据集上提高了平均Pair AUPRC。诊断分析进一步显示更大的黄金-负例置信度差距和更低的边界违反严重性。这些结果表明,显式塑造成对置信度是MECPE的一种有效训练策略。

英文摘要

Multimodal emotion-cause pair extraction (MECPE) requires reliable pair confidence over candidate pairs. Existing pair scorers commonly use pair-level cross entropy over valid candidates, which treats links mostly independently. This leaves the relative confidence geometry among competing causes under-constrained, allowing gold pairs to stay close to hard negatives or rely on incidental non-gold context. We study this vulnerability as pair-confidence brittleness and propose RPCL (Robust Pair Confidence Learning), a training-only framework for pair-confidence learning. RPCL encourages pair confidence to be both discriminative and stable: gold pairs are separated from row-wise hard negatives through a confidence-difference margin constraint, and clean pair predictions are aligned with predictions from a corrupted view where non-gold contextual utterance representations are partially corrupted. The original clean pair scorer and decoding pipeline are used unchanged at inference time. On ECF, MECAD, and MEC4, RPCL improves the three-seed mean Pair F1 over a matched base model by 2.58 to 2.83 percentage points in the full text-audio-video setting, and improves mean Pair AUPRC on all three datasets. Diagnostic analysis further shows larger gold-negative confidence gaps and lower margin-violation severity. These results suggest that explicitly shaping pair confidence is an effective training strategy for MECPE.

2606.18890 2026-06-18 cs.AI 新提交

Skill-Guided Continuation Distillation for GUI Agents

面向GUI代理的技能引导延续蒸馏

Zhimin Fan, Hongwei Yu, Yeqing Shen, Haolong Yan, Guozhen Peng, Tianhao Peng, Yudong Zhang, Xiaowen Zhang, Kaijun Tan, Zheng Ge, Xiangyu Zhang, Daxin Jiang

发表机构 * StepFun University of Science and Technology Beijing(北京科技大学) Tsinghua University(清华大学) Nanyang Technological University(南洋理工大学)

AI总结 提出技能引导延续蒸馏(SGCD)框架,通过技能引导策略生成成功延续轨迹,弥补专家轨迹中未覆盖的状态监督缺失,在OSWorld-Verified上将三个基础模型成功率从30%左右提升至50%以上。

详情
AI中文摘要

改进GUI代理通常依赖于在专家轨迹上的行为克隆。然而,当当前策略偏离专家策略时,在闭环执行过程中不可避免地会遇到策略导致的偏离轨迹状态,即超出专家轨迹的状态。由于专家轨迹未对这些未见状态提供演示,这些状态得不到有效监督,导致策略无法选择正确动作。为弥补这一监督缺口,我们提出技能引导延续蒸馏(SGCD),一种迭代式自我改进框架。SGCD首先在没有技能引导的情况下运行简单策略若干步,以到达真实的偏离轨迹状态。从这些状态出发,技能引导策略完成任务并生成成功的延续轨迹,这些轨迹与专家轨迹混合,为策略导致的偏离轨迹状态提供监督。技能从成功和失败的轨迹中提取,包括延续计划、关键目标、失败陷阱和成功标准。在OSWorld-Verified上,SGCD将三个基础模型的成功率从30%左右提升至超过50%,证明了其有效性和通用性。

英文摘要

Improving GUI agents typically relies on behavior cloning on expert trajectories. However, as the current policy deviates from the expert policy, it inevitably encounters policy-induced off-trajectory states during closed-loop execution, i.e., states that fall outside the expert trajectories. Since expert trajectories provide no demonstrations for these unseen states, such states receive no effective supervision, leaving the policy unable to select the correct action. To close this supervision gap, we propose Skill-Guided Continuation Distillation (SGCD), an iterative self-improvement framework. SGCD first runs the plain policy without skill guidance for a few steps to reach realistic off-trajectory states. From these states, a skill-guided policy then completes the task and produces successful continuations, which are mixed with expert trajectories to supply supervision over policy-induced off-trajectory states. The skills are extracted from both successful and failed rollouts, consisting of Continuation Plans, Critical Targets, Failure Traps, and Success Criteria. On OSWorld-Verified, SGCD improves the success rate of three base models from the low-30\% range to over 50\%, demonstrating its effectiveness and generality.

2606.18889 2026-06-18 cs.CL 新提交

Improving Medical Communication using Rubric-Guided Counterfactual Recommendations

利用评分引导的反事实推荐改善医疗沟通

Adrian Cosma, Nicoleta-Nina Basoc, Andrei Niculae, Cosmin Dumitrache, Emilian Radoi

发表机构 * IDSIA, Dalle Molle Institute for Artificial Intelligence(IDSIA,达勒莫利人工智能研究所) National University of Science and Technology POLITEHNICA Bucharest(科学与技术国家大学POLITEHNICA布加勒斯特)

AI总结 提出一种语言模型引导的反事实推荐流程,通过调整语气、个性化等可解释沟通特征,在不影响医学内容的前提下提升患者积极反馈概率,平均提升6.41%。

Comments 4 Tables, 8 Figures

详情
AI中文摘要

基于文本的远程医疗越来越依赖轻量级的患者反馈,然而,此类反馈主要反映感知的沟通质量而非医学准确性。我们引入了一种语言模型引导的反事实推荐流程,该流程发现并优化可解释的沟通特征,如语气、个性化、可操作性和完整性,以解决患者关切,同时不干扰医学内容。这些特征与患者-医生互动元数据一起用于估计积极反馈。在推理时,系统搜索低成本的序数特征变化,并推荐最小的沟通变化,这些变化预计会增加积极反馈的概率,而独立的审计模型测试这些增益是否超出选择模型的泛化能力。在互动中,推荐在独立审计下平均带来+6.41%的预测积极反馈概率增益,且93.31%的推荐为非负。这些结果表明,小的、可解释的沟通变化可以捕获大部分预测增益,同时保留医生对医学推理和最终措辞的控制。

英文摘要

Text-based telemedicine increasingly relies on lightweight patient feedback, however, such feedback primarily reflects perceived communication quality rather than medical accuracy. We introduce an LM-guided counterfactual recommendation pipeline that discovers and refines interpretable communication features such as tone, personalization, actionability and completeness in addressing patient concerns, without interfering with the medical content. These features are used together with patient-doctor interaction metadata to estimate positive feedback. At inference time, the system searches over low-cost ordinal feature changes and recommends minimal communication changes predicted to increase the probability of positive feedback, while independent auditor models test whether these gains generalize beyond the selection model. Across interactions, recommendations yield a mean +6.41% gain in predicted positive feedback probability under independent auditors, and are non-negative for 93.31% of recommendations. These results suggest that small, interpretable communication changes can capture most predicted gains while preserving the doctor's control over medical reasoning and final wording.

2606.18888 2026-06-18 cs.AI 新提交

Generative-Model Predictive Planning for Navigation in Partially Observable Environments

部分可观测环境下导航的生成模型预测规划

Thomas Quilter, Yifan Zhu, Guorui Quan, Mingfei Sun, Samuel Kaski

发表机构 * University of Manchester(曼彻斯特大学) Aalto University(阿尔托大学)

AI总结 提出BeliefDiffusion框架,结合扩散模型和模型预测控制,显式建模多模态信念分布并进行前瞻规划,在合成地图环境中显著优于无模型强化学习和生成方法。

详情
AI中文摘要

部分可观测环境中的导航对自主智能体构成重大挑战,需要在未知环境中利用有限的感知信息做出有效决策。基于信念的方法,特别是那些使用神经网络近似信念空间的方法,往往无法捕捉信念空间固有的多模态性,尤其是在具有感知混淆的高维情况下。虽然生成模型提供了一种有吸引力的替代方案,但它们通常需要大量数据或专家演示,并且缺乏长期规划的显式机制。在本文中,我们介绍了BeliefDiffusion,一种结合了生成和规划优势的新框架。BeliefDiffusion利用扩散模型显式表征多模态信念分布,并利用模型预测控制(MPC)同时进行前瞻规划。它包含两个步骤:(1)基于观测历史想象合理的环境配置;(2)在聚合的配置上规划高效的导航策略。通过在合成地图环境中的大量实验,我们证明BeliefDiffusion在导航成功率和路径效率上显著优于无模型强化学习基线和其它生成方法。我们的结果验证了将多模态信念表示显式纳入规划能够在部分可观测设置中实现更鲁棒的导航。

英文摘要

Navigation in partially observable environments presents a significant challenge for autonomous agents, requiring effective decision-making with limited sensory information in unknown environments. Belief-based methods, particularly those using neural networks to approximate the belief space, often fail to capture the inherent multimodality of belief spaces, especially in high-dimensional cases with perceptual aliasing. While generative models present a compelling alternative, they typically require substantial data or expert demonstrations and lack explicit mechanisms for long-term planning. In this paper, we introduce BeliefDiffusion, a novel framework that combines the benefits of both generation and planning. BeliefDiffusion leverages diffusion models to explicitly characterize multimodal belief distributions and utilizes Model Predictive Control (MPC) to simultaneously plan ahead. It consists of two steps: (1) Imagining plausible environment configurations based on observation history and (2) Planning efficient navigation strategies across an aggregated configurations. Through extensive experiments in synthetic map environments, we demonstrate that BeliefDiffusion significantly outperforms both model-free reinforcement learning baselines and other generative approaches in navigation success rate and path efficiency. Our results validate that explicitly incorporating multimodal belief representations into planning enables more robust navigation in partially observable settings.

2606.18886 2026-06-18 cs.CV 新提交

DINO-Med3D: Bridging Dimension and Domain Gaps in Volumetric Segmentation via Progressive Adaptation

DINO-Med3D:通过渐进式适应弥合体分割中的维度与领域差距

Haoyu Hu, Xiyao Ma, Shiqi Liu, Linsen Zhang, Xiaoliang Xie, Xiaohu Zhou, Zeng-Guang Hou

发表机构 * University of Chinese Academy of Sciences(中国科学院大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所)

AI总结 提出两阶段渐进框架DINO-Med3D,通过多切片嵌入模块、3D适配器和并行细节恢复流,将DINOv3适配到3D医学分割,在五个数据集上超越现有方法。

Comments Accepted at MICCAI 2026. The camera-ready version and link will be made publicly available upon publication

详情
AI中文摘要

尽管DINOv3在自然图像中展现了显著的语义判别能力,但其直接应用于体医学分割受到固有的维度和领域差异的阻碍。为解决这些问题,我们提出DINO-Med3D,一个两阶段渐进框架,将预训练的DINOv3编码器重新用于3D医学任务。在第一阶段,我们通过引入融合伪3D上下文的多切片嵌入模块来弥合维度差距,同时采用分割代理任务将从自然场景学到的表示适应到医学领域。随后,我们通过在冻结的主干中添加轻量级3D适配器来增强体理解,以强制执行全局切片间连续性。最后,为补偿嵌入过程中固有的空间信息损失,我们设计了一个并行细节恢复流,以显式保留高频边界线索。在五个公共数据集上的大量实验表明,我们的方法成功地将DINOv3适应到医学领域,并显著优于最先进的基线方法。

英文摘要

Although DINOv3 has demonstrated remarkable semantic discrimination in natural imagery, its direct application to volumetric medical segmentation is hindered by inherent dimension and domain disparities. To resolve these issues, we propose DINO-Med3D, a two-stage progressive framework that repurpose the pre-trained DINOv3 encoder for 3D medical tasks. In the first stage, we mitigate the dimension gap by introducing a multi-slice embedding module that incorporates pseudo-3D context, while simultaneously employing a segmentation proxy task to adapt representations learned from natural scenes to the medical domain. Subsequently, we further enhance volumetric understanding by adding lightweight 3D adapters into the frozen backbone to enforce global inter-slice continuity. Finally, to compensate for the spatial information loss inherent in the embedding process, we design a parallel detail recovery stream to explicitly preserve high-frequency boundary cues. Extensive experiments on five public datasets demonstrate that our approach successfully adapts DINOv3 to the medical domain and significantly outperforms state-of-the-art baselines.

2606.18885 2026-06-18 cs.CV cs.IR 新提交

LARE: Low-Attention Region Encoding for Text-Image Retrieval

LARE: 低注意力区域编码用于文本-图像检索

Abdulmalik Alquwayfili, Faisal Almeshal, Jumanah Almajnouni, Leena Alotaibi, Faisal Alhajari, Mohammed Alkhrashi, Alreem Almuhrij, Abdullah Aldwyish, Raied Aljadaany, Huda Alamri, Muhammad Kamran J. Khan

发表机构 * Saudi Data and Artificial Intelligence Authority (SDAIA)(沙特数据与人工智能局)

AI总结 提出LARE框架,通过并行编码低注意力区域和完整图像,解决拥挤场景下视觉编码器忽视关键细节的问题,在密集场景子集上提升检索性能。

Comments Accepted at the ICML 2026 Workshop on Efficient Multimodal Question Answering (EMM-QA). Code: https://github.com/AbdulmalikDS/LARE ; Dataset: https://huggingface.co/datasets/AbdulmalekDS/Dense-Set

详情
AI中文摘要

拥挤场景中的图像检索尤其具有挑战性,因为传统视觉编码器存在显著性偏差,倾向于关注主要对象而忽略低注意力区域,而这些区域通常对细粒度检索至关重要。我们提出了LARE(低注意力区域编码),一个显式建模这些被忽略区域的框架。LARE采用双编码策略,并行编码图像的低注意力区域和完整图像,从而产生更多样化和信息丰富的图像嵌入。为了评估拥挤场景下的图像检索性能,我们引入了Dense-Set,一个源自COCO和Flickr30K的具有挑战性的子集。在该子集中,图像被重新标注,以提供对低注意力或先前被忽略区域的更丰富描述。该数据集突显了现有检索模型的局限性,并能够在密集拥挤场景条件下进行更严格的评估。实验结果表明,所提出的框架通过在共享潜在空间中保留微妙的非主导视觉线索来提高检索性能。

英文摘要

Image retrieval in crowded scenes is particularly challenging due to the salience bias of conventional visual encoders, which tend to focus on dominant objects while neglecting low-attention regions that are often crucial for fine-grained retrieval. We propose LARE (Low-Attention Region Encoding), a framework that explicitly models these overlooked regions. LARE adopts a dual-encoding strategy that encodes low-attention regions of an image and the full image in parallel, leading to more diverse and informative image embeddings. To evaluate image retrieval performance in challenging crowded scenes, we introduce Dense-Set, a challenging subset derived from COCO and Flickr30K. In this subset, images are re-captioned to provide richer descriptions of low-attention or previously overlooked regions. This dataset highlights the limitations of existing retrieval models and enables a more rigorous evaluation under densely crowded scene conditions. Experimental results demonstrate that the proposed framework improves retrieval performance by preserving subtle, non-dominant visual cues within the shared latent space.