arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4106
2606.01858 2026-06-02 cs.CV

Polaris: Scaling Up Instruction-Guided Image Generation Towards Millions of Personalized Style Needs

Polaris: 将指令引导的图像生成扩展到数百万个性化风格需求

Zhi-Kai Chen, Jun-Peng Jiang, Jun-Jie Tao, De-Chuan Zhan, Han-Jia Ye

发表机构 * Tsinghua University(清华大学)

AI总结 提出Polaris智能检索框架,通过索引和检索超过6500个检查点和75000个适配器,自动选择和集成最相关的模型组件,实现无需额外训练的可扩展、可控且对齐的指令驱动图像生成。

详情
AI中文摘要

用户越来越期望图像生成模型能够快速适应高度多样化和个性化的需求,例如生成具有独特风格或特征的图像。传统方法依赖于微调,成本高昂且难以扩展。为了应对这些限制,社区积累了一个不断增长的微调模块和适配器库,其中每个组件针对特定的生成需求,并共同作为处理新需求的基础。这自然引出一个问题:与其重复训练新模型,我们能否系统地利用这个不断扩展的生态系统来更好地满足用户指令?为此,我们提出了Polaris,一个智能检索框架,根据用户的指令自动从模型库中选择和集成合适的模型。关键见解是,利用如此庞大和异构的库不仅需要在数千个候选中找到最相关的模块,还需要将它们有效地对齐以进行指令驱动的生成和编辑。Polaris通过索引超过6500个检查点和75000个适配器,并根据用户的输入和指令检索最相关的组件来解决这一挑战。通过这种方式,它提供了可扩展、可控且良好对齐的生成——无需任何额外训练。

英文摘要

Users increasingly expect image generation models to quickly adapt to highly diverse and personalized requirements, such as producing images with distinctive styles or characteristics. Traditional approaches rely on fine-tuning, which is costly and difficult to scale. To cope with these limitations, the community has accumulated a growing library of fine-tuned modules and adapters, where each component targets specific generation needs and collectively serves as a foundation for handling new demands. This naturally raises a question: instead of repeatedly training new models, can we systematically exploit this expanding ecosystem to better fulfill user instructions? To this end, we present Polaris, an intelligent retrieval framework that automatically selects and integrates suitable models from the model library based on a user's instructions. The key insight is that harnessing such a massive and heterogeneous pool requires not only finding the most relevant modules among thousands of candidates, but also aligning them effectively for instruction-driven generation and editing. Polaris addresses this challenge by indexing over 6,500 checkpoints and 75,000 adapters, and retrieving the most relevant components given a user's input and instruction. In doing so, it delivers scalable, controllable, and well-aligned generation -- without any additional training.

2606.01850 2026-06-02 cs.AI

Does Compression Preserve Uncertainty? A Unified Benchmark for Quantized and Sparse LLMs via Conformal Prediction

压缩是否保留不确定性?基于共形预测的量化和稀疏大语言模型统一基准

Yujia Tong, Yuxi Wang, Yunyang Wan, Tian Zhang, Junhao Dong, Jingling Yuan

发表机构 * Wuhan University of Technology(武汉理工大学) Nanyang Technological University(南洋理工大学)

AI总结 本研究通过共形预测方法,在五个NLP任务上对12种不同压缩配置的大语言模型进行基准测试,发现压缩经常解耦准确率与不确定性,大模型更能吸收压缩引起的不确定性,且不确定性膨胀常呈阈值状而非渐进。

详情
AI中文摘要

模型压缩技术如量化和剪枝被广泛用于降低大语言模型(LLMs)的部署成本,现有评估几乎只关注准确率保持。然而,在安全关键应用中,模型可靠量化自身不确定性的能力同样重要。我们问:压缩是否保留了这种能力?为回答此问题,我们在五个NLP任务上对12种不同压缩配置的LLM进行基准测试,使用共形预测提供严格、无分布的不确定性度量。实验揭示:(I) 压缩经常解耦准确率与不确定性;(II) 大模型吸收压缩引起的不确定性的能力远强于小模型;(III) 不确定性膨胀常呈阈值状而非渐进。这些结果表明,仅基于准确率的评估不足以评估压缩LLM的部署就绪度,不确定性感知的基准测试应成为模型压缩流程的标准组成部分。

英文摘要

Model compression techniques such as quantization and pruning are widely used to reduce the deployment cost of large language models (LLMs), with existing evaluations focusing almost exclusively on accuracy preservation. However, in safety-critical applications, a model's ability to reliably quantify its own uncertainty is equally important. We ask: does compression preserve this ability? To answer this question, we benchmark 12 LLMs under various compression configurations across five NLP tasks, using conformal prediction to provide a rigorous, distribution-free measure of uncertainty. Our experiments reveal that: (I) compression frequently decouples accuracy from uncertainty; (II) larger models absorb compression-induced uncertainty far more effectively than smaller ones; and (III) uncertainty inflation is often threshold-like rather than gradual. These results suggest that accuracy-only evaluation is insufficient for assessing the deployment readiness of compressed LLMs, and that uncertainty-aware benchmarking should be a standard component of model compression pipelines.

2606.01848 2026-06-02 cs.CV

RescueBench: Can Embodied Agents Save Lives in the Wild ?

RescueBench: 具身智能体能否在野外拯救生命?

Kui Wu, Beiyu Guo, Hao Chen, ShuHang Xu, Yuling Li, Yongdan Zeng, Zhoujun Li, Yizhou Wang, Fangwei Zhong

发表机构 * Beihang University(北京航空航天大学) Beijing Normal University(北京师范大学) Peking University(北京大学) City University of Macau(澳门城市大学) ATEC2025 Challenge Committee(ATEC2025挑战委员会)

AI总结 本文提出 RescueBench,一个四阶段流水线的逼真诊断基准,用于评估具身智能体在搜索与救援任务中的探索、记忆和交互能力,并揭示探索和记忆失败如何传播。

详情
AI中文摘要

搜索与救援(SAR)要求具身智能体在多模态不确定性下探索陌生环境,执行多阶段交互,并在长时域内检索空间记忆。现有基准通常孤立评估这些能力,当它们必须在现实工作流中组合时,失败如何叠加尚不清楚。我们提出 RescueBench,一个逼真的诊断基准,将 SAR 实例化为四阶段流水线:多模态探索、目标救援、记忆引导返回和最终交接。通过将顺序任务组合与阶段级评估相结合,RescueBench 能够分析探索和记忆失败如何在具身救援工作流中传播。它包含五个渐进难度级别,在环境复杂性、线索模糊性和空间层次上有所不同,并配有自动化的情节生成和标注流水线,用于可扩展的评估和训练。我们评估了七个基线、一个 oracle 参考和人类玩家,结果显示没有基线能在最大难度下完成全部任务。阶段级诊断将自主探索识别为主要失败模式,空间记忆为第二个独立瓶颈,表明当前拓扑视觉语言导航或基于地图的方法无法解决这些局限。代码见 https://github.com/wukui-muc/RescueBench。

英文摘要

Search-and-rescue (SAR) requires embodied agents to explore unfamiliar environments under multimodal uncertainty, perform multi-stage interactions, and retrieve spatial memory over long horizons. Existing benchmarks typically evaluate these capabilities in isolation, leaving unclear how failures compound when they must be composed in realistic workflows. We introduce RescueBench, a photo-realistic diagnostic benchmark that instantiates SAR as a four-stage pipeline: multimodal exploration, target rescue, memory-guided return, and final handoff. By combining sequential task composition with stage-level evaluation, RescueBench enables analysis of how exploration and memory failures propagate through embodied rescue workflows. It contains five progressive difficulty levels that vary in environmental complexity, clue ambiguity, and spatial hierarchy, along with an automatic episode generation and annotation pipeline for scalable evaluation and training. We evaluate seven baselines, an oracle reference, and human players, showing that no baselines complete the full task at the greatest difficulty. Stage-level diagnosis identifies autonomous exploration as the dominant failure mode and spatial memory as a second, independent bottleneck, suggesting that these limitations are not resolved by current topological visual-language navigation or map-based methods. Code is available in https://github.com/wukui-muc/RescueBench

2606.01847 2026-06-02 cs.RO cs.LG

The Lie We Tell: Correcting the Euclidean Fallacy in Vision Language Action Policies via Score Matching on Tangent Space

我们说的谎言:通过切空间上的分数匹配纠正视觉-语言-动作策略中的欧几里得谬误

Bing-Cheng Chuang, I-Hsuan Chu, Bor-Jiun Lin, YuanFu Yang, Min Sun, Chun-Yi Lee

发表机构 * National Taiwan University(台湾大学)

AI总结 针对扩散视觉-语言-动作策略将SE(3)位姿表示为平坦R^12向量导致的欧几里得谬误,提出Lie Diffuser Actor (LDA)框架,通过左不变SDE注入噪声、在切空间预测分数并利用指数映射回缩样本,从根本上消除流形漂移、保证坐标框架等变性和测地线最优性,在CALVIN ABC→D上平均任务长度从3.27提升至3.51。

Comments ICML 2026 Accepted

详情
AI中文摘要

基于扩散的视觉-语言-动作策略在机器人操作中取得了显著成功,但犯了一个我们称之为$ extbf{欧几里得谬误}$的基本几何错误:将SE(3)位姿表示为平坦的$\mathbb{R}^{12}$向量。这种近似导致(1)违反SO(3)约束的流形漂移,(2)坐标变换下等变性的破坏,以及(3)具有过高运动学代价的非测地线轨迹。我们提出$ extbf{Lie Diffuser Actor (LDA)}$,一个本质上在SE(3)上运行的扩散框架。我们的方法通过左不变SDE注入噪声,在切空间中预测分数,并通过指数映射回缩样本。这种表述通过构造消除了流形漂移,同时保证了坐标框架等变性和测地线最优性。在CALVIN ABC$ ightarrow$D上,LDA将平均任务长度从$3.27$提升到$3.51$($+7.3\%$)。我们进一步在真实机器人上验证了该方法,结果表明我们的方法在大多数任务上优于基线。

英文摘要

Diffusion-based Vision-Language-Action policies achieve remarkable success in robotic manipulation, yet commit a fundamental geometric error we term the $\textbf{Euclidean Fallacy}$: representing SE(3) poses as flat $\mathbb{R}^{12}$ vectors. This approximation induces (1) manifold drift violating SO(3) constraints, (2) broken equivariance under coordinate transformations, and (3) non-geodesic trajectories with excessive kinematic cost. We introduce $\textbf{Lie Diffuser Actor (LDA)}$, a diffusion framework operating intrinsically on SE(3). Our method injects noise through left-invariant SDEs, predicts scores in the tangent space, and retracts samples via the exponential map. This formulation eliminates manifold drift by construction while guaranteeing coordinate-frame equivariance and geodesic optimality. On CALVIN ABC$\rightarrow$D, LDA improves average task length from $3.27$ to $3.51$ ($+7.3\%$). We further validate our method on real robot and the results show that our methodology outperforms the baseline on majority tasks.

2606.01845 2026-06-02 cs.CL cs.AI

Unveiling the Limits of Large Language Models in Inferring Pragmatic Meaning from Non-Verbal Responses

揭示大型语言模型在推断非语言回应中的语用意义的局限性

Sugyeong Eo, Heuiseok Lim

发表机构 * Department of Software, Yonsei University Mirae Campus(燕山大学软件系) Department of Computer Science and Engineering, Korea University(韩国大学计算机科学与工程系)

AI总结 本研究首次系统评估大型语言模型(LLMs)从纯非语言回应对话中推断语用意义的能力,发现其准确率相比语言回应下降高达60%,并表明上下文学习有助于语用推理。

详情
AI中文摘要

尽管大型语言模型(LLMs)在语用语言理解方面取得了显著进展,但先前的研究主要集中在其对语言行为的理解上。然而,非语言行为仍然是人类交流的基本组成部分,特别是当故意单独使用以传达间接意义时。在这项工作中,我们首次系统评估了LLMs从仅包含非语言回应的对话中推断语用意义的能力。我们探讨了三个研究问题:(1)LLMs能否识别通过非语言回应传达的间接意图?(2)LLMs何时以及如何未能捕捉非语言意图?(3)我们如何提高LLMs解释非语言意图的能力?通过评估,我们观察到LLMs难以从非语言回应中推断出潜在意义,准确率相比语言回应下降高达60个百分点。进一步的广泛分析揭示了LLMs在解释非语言行为时的行为模式,并表明上下文学习有助于语用推理。

英文摘要

Although large language models (LLMs) have shown considerable progress in pragmatic language understanding, prior research has focused mainly on their comprehension of verbal behavior. Nonetheless, non-verbal behavior remains a fundamental component of human communication, especially when deliberately utilized in isolation to convey indirect meanings. In this work, we present the first systematic evaluation of LLMs' ability to infer pragmatic meaning in dialogue consisting solely of non-verbal responses. We explore three research questions: (1) Can LLMs recognize indirect intent conveyed through non-verbal responses? (2) When and how do LLMs fail to capture non-verbal intent? (3) How can we improve LLMs' ability to interpret non-verbal intent?. Through the evaluation, we observe that LLMs struggle to infer underlying meaning from non-verbal responses, with accuracy dropping by up to 60% points compared to verbal ones. Further extensive analysis reveals a behavioral pattern in LLMs' interpretations of non-verbal behavior and demonstrates that in-context learning facilitates pragmatic inference.

2606.01843 2026-06-02 cs.CV cs.AI

Suppressing Forgery-Specific Shortcuts for Generalizable Deepfake Detection

抑制伪造特定捷径以实现可泛化的深度伪造检测

Yihui Wang, Yonghui Yang, Jilong Liu, Fengbin Zhu, Le Wu, Tat-Seng Chua

发表机构 * Hefei University of Technology(合肥工业大学) National University of Singapore(国立新加坡大学)

AI总结 提出Shortcut Subspace Suppression (S^3)框架,通过子空间建模显式表征并抑制方法特定捷径,以提升深度伪造检测的跨方法泛化能力。

详情
AI中文摘要

深度伪造检测在跨伪造方法泛化方面表现不佳,因为现有模型倾向于依赖虚假的方法特定捷径,这些捷径无法迁移到未见过的篡改操作。尽管近期方法试图改进泛化性,但它们缺乏明确的机制来识别和抑制学习表示中的此类捷径。在这项工作中,我们提出了捷径子空间抑制(S^3)框架,通过子空间建模显式表征并抑制方法特定捷径。我们的关键洞察是,区分不同伪造方法的变体捕获了方法特定的伪影,因此可作为方法特定捷径的有效代理。为此,我们训练一个轻量级线性探针进行伪造方法分类,并执行奇异值分解(SVD)以提取主导的捷径子空间。基于此公式,我们开发了两种互补策略来减少对捷径的依赖。在训练期间,我们软性抑制特征表示中的捷径子空间,鼓励模型依赖更可泛化的线索进行真/假判别。在推理时,我们引入一个无需训练的对应方法,衰减与识别出的捷径方向对齐的神经元,从而实现即插即用的泛化增强,并提高可解释性。在多个基准上的大量实验表明,我们的方法显著改善了跨方法泛化,同时保持了强大的域内性能。代码将在论文被接收后发布。

英文摘要

Deepfake detection suffers from poor generalization across forgery methods, as existing models tend to rely on spurious method-specific shortcuts that fail to transfer to unseen manipulations. While recent approaches attempt to improve generalization, they lack an explicit mechanism to identify and suppress such shortcuts in learned representations. In this work, we propose Shortcut Subspace Suppression (S^3) framework that explicitly characterizes and suppresses method-specific shortcuts via subspace modeling. Our key insight is that variations distinguishing different forgery methods capture method-specific artifacts and thus serve as an effective proxy for method-specific shortcuts. To this end, we train a lightweight linear probe for forgery method classification and perform Singular Value Decomposition (SVD) to extract the dominant shortcut subspace. Building on this formulation, we develop two complementary strategies to reduce shortcut reliance. During training, we softly suppress the shortcut subspace in feature representations, encouraging the model to rely on more generalizable cues for real/fake discrimination. At inference time, we introduce a training-free counterpart that attenuates neurons aligned with the identified shortcut directions, enabling plug-and-play generalization enhancement with improved interpretability. Extensive experiments on multiple benchmarks demonstrate that our method significantly improves cross-method generalization while maintaining strong in-domain performance. The code will be released upon acceptance of the submission.

2606.01840 2026-06-02 cs.AI

Evaluation of Baseline Methods for IDD-based SSD External Memory Search

基于IDD的SSD外部内存搜索的基线方法评估

Yuki Suzuki, Alex Fukunaga

发表机构 * International Symposium on Combinatorial Search (SoCS 2026)(国际组合搜索会议(SoCS 2026))

AI总结 本文评估了基于即时重复检测(IDD)的A*算法在SSD外部内存搜索中的简单基线方法性能,并分析了操作系统级页面缓存的影响。

Comments accepted to The 19th International Symposium on Combinatorial Search (SoCS2026)

详情
AI中文摘要

许多困难的搜索问题无法仅使用RAM通过A*等算法解决。先前的工作提出了使用容量远大于RAM的外部内存(如SSD和HDD)的搜索算法,但先前的工作主要集中在延迟重复检测方法以及复杂的即时重复检测(IDD)方法上,而相对简单的IDD方法尚未得到系统研究。此外,操作系统级管理及加速外部内存访问的机制(如页面缓存)的影响也未被研究。本文通过评估和分析基于IDD的A*的简单基线方法的性能,填补了文献中的这些空白。

英文摘要

Many difficult search problems cannot be solved by algorithms such as A* using only RAM. Search algorithms which use external memory such as SSDs and HDDs with much higher capacity than RAM have been proposed in previous work, but previous work has focused on delayed duplicate detection approaches, as well as complex immediate duplicate detection (IDD) methods, and relatively simple methods for IDD have not been systematically studied. In addition, the effect of OS-level mechanisms for managing and speeding up accesses to external memory, such as page caches, has not been studied. This paper addresses these gaps in the literature by evaluating and analyzing the performance of simple baseline approaches for IDD-based A*.

2606.01838 2026-06-02 cs.CL cs.AI cs.LG

LayerRoute: Input-Conditioned Adaptive Layer Skipping via LoRA Fine-Tuning for Agentic Language Models

LayerRoute: 基于LoRA微调的输入条件自适应层跳过方法用于智能语言模型

Prateek Kumar Sikdar

发表机构 * Accenture(埃森哲)

AI总结 提出LayerRoute,通过为每个Transformer块添加轻量级路由器和LoRA适配器,根据输入类型(工具调用或规划推理)自适应跳过层,在仅增加0.22%可训练参数下实现12.91%的跳过差异并提升质量。

Comments 10 pages, 3 figures, 4 tables

详情
AI中文摘要

智能语言模型系统交替使用两种结构不同的步骤类型:结构化工具调用(短、确定性、低困惑度)和开放式规划/推理步骤(长、复杂、高困惑度)。尽管存在这种异质性,当前的推理系统对每个步骤应用相同的计算量。我们引入LayerRoute,一个轻量级适配器,学习基于每个输入有选择地跳过Transformer块。LayerRoute为Qwen2.5-0.5B-Instruct中的24个Transformer块中的每一个增加:(1)一个每层路由器(约897个参数,Linear(896,1)),通过直通估计器输出硬二值门;(2)在Q/K/V/O注意力投影上的LoRA适配器(秩8,约1.08M参数)。骨干权重保持冻结。在智能体数据(Hermes、Glaive、GSM8K、Turing)上进行单次端到端训练,并加入门正则化项,迫使系统发现每个输入类型下哪些块是可跳过的。经过3000步(在A100 40GB上6.4分钟),LayerRoute实现了12.91%的跳过差异:工具调用跳过15.25%的FLOPs,而规划步骤仅跳过2.34%,仅使用1.10M可训练参数(占494M骨干的0.22%)。由于LoRA适配,质量相比基础模型有所提升,工具调用上的困惑度差为-1.29,规划步骤上为-1.30。

英文摘要

Agentic language model systems alternate between two structurally distinct step types: structured tool calls (short, deterministic, low perplexity) and open-ended planning/reasoning steps (long, complex, high perplexity). Despite this heterogeneity, current inference systems apply identical compute to every step. We introduce LayerRoute, a lightweight adapter that learns to selectively skip transformer blocks on a per-input basis. LayerRoute augments each of the 24 transformer blocks in Qwen2.5-0.5B-Instruct with: (1) a per-layer router (~897 parameters, Linear(896,1)) that outputs a hard binary gate via the straight-through estimator, and (2) LoRA adapters (rank 8, ~1.08M parameters) on the Q/K/V/O attention projections. The backbone weights remain frozen. A single end-to-end training pass on agentic data (Hermes, Glaive, GSM8K, Turing) with a gate regularisation term forces the system to discover which blocks are skippable per input type. After 3,000 steps (6.4 minutes on an A100 40GB), LayerRoute achieves a 12.91% skip differential: tool calls skip 15.25% of FLOPs while planning steps skip only 2.34%, using only 1.10M trainable parameters (0.22% of the 494M backbone). Quality improves over the base model due to LoRA adaptation, with perplexity delta of -1.29 on tool calls and -1.30 on planning.

2606.01834 2026-06-02 cs.CV cs.AI

Physics-Guided Attention in a Lightweight TCN for Efficient WiFi CSI-Based Human Activity Recognition

轻量级TCN中的物理引导注意力用于高效基于WiFi CSI的人体活动识别

Chinthaka Ranasingha, Tharindu Fernando, Sridha Sridharan, Clinton Fookes, Harshala Gammulle

发表机构 * Signal Processing, Artificial Intelligence and Vision Technologies (SAIVT) Research Group, School of Electrical Engineering and Robotics, Queensland University of Technology (QUT)(信号处理、人工智能与视觉技术(SAIVT)研究组,电气工程与机器人学院,昆士兰科技大学(QUT))

AI总结 提出一种紧凑的TCN框架,通过多普勒能量引导的时间注意力和方差驱动的通道注意力机制,显式引入运动感知归纳偏置,在减少参数和计算成本的同时实现优于深度基线模型的性能。

详情
AI中文摘要

基于WiFi信道状态信息(CSI)的人体动作识别(HAR)因其非接触、低成本及保护隐私的特性而受到越来越多的关注。然而,现有的基于学习的方法主要依赖深度、计算密集的架构来隐式地从CSI测量中捕捉运动动态,从而增加了模型复杂度并降低了效率。相反,我们认为,结合针对CSI信号物理特性的适当归纳偏置能够实现更高效和有效的学习。在这项工作中,我们提出一个紧凑的基于时间卷积网络(TCN)的框架,将运动感知的归纳偏置显式地融入特征学习。具体地,我们在特征空间中引入多普勒能量引导的时间注意力机制以强调运动显著的时间段,以及一个方差驱动的通道注意力模块,根据时间运动统计自适应地加权信息子载波。通过整合这些领域特定的先验知识,所提模型在不增加架构深度的情况下有效捕捉运动动态。在多个基准数据集上的大量实验表明,我们的方法相比更深的基线模型取得了优越的性能,同时显著减少了参数数量和计算成本。

英文摘要

Human Action Recognition (HAR) using WiFi Channel State Information (CSI) has gained increasing attention due to its non-contact, low-cost, and privacy-preserving nature. However, existing learning-based approaches largely rely on deep, computationally intensive architectures to implicitly capture motion dynamics from CSI measurements, thereby increasing model complexity and reducing efficiency. Instead, we argue that incorporating appropriate inductive biases tailored to the physical characteristics of CSI signals enables more efficient and effective learning. In this work, we propose a compact temporal convolutional network (TCN)-based framework that explicitly incorporates motion-aware inductive biases into feature learning. Specifically, we introduce a Doppler-energy-guided temporal attention mechanism in feature space to emphasize motion-salient time segments, and a variance-driven channel attention module to weight informative subcarriers based on temporal motion statistics adaptively. By integrating these domain-specific priors, the proposed model effectively captures motion dynamics without increasing architectural depth. Extensive experiments on multiple benchmark datasets demonstrate that our approach achieves superior performance compared to deeper baselines, while significantly reducing parameter count and computational cost.

2606.01833 2026-06-02 cs.LG cs.AI

Learning Implicit Bias in Generative Spaces for Accelerating Protein Dynamics Emulation

学习生成空间中的隐式偏置以加速蛋白质动力学仿真

Kaihui Cheng, Zhiqiang Cai, Wenkai Xiang, Zhihang Hu, Siyu Zhu, Tzuhsiung Yang, Yuan Qi

发表机构 * Fudan University(复旦大学) Shanghai Academy of AI for Science(上海人工智能科学研究院) Shanghai Innovation Institute(上海创新研究院)

AI总结 提出在预训练生成式仿真器的生成空间中引入隐式历史依赖偏置,结合距离加权分数估计和环境支持正则化,通过重投影步骤保持结构有效性,显著提升采样多样性和稀有状态覆盖速度。

详情
AI中文摘要

蛋白质动力学生成式仿真器能够以分子动力学一小部分成本生成合理的轨迹,但它们继承了训练分布,在长期外推下倾向于重访已知状态而非到达稀有状态。受经典增强采样启发,我们在预训练仿真器的生成空间中引入隐式历史依赖偏置。具体来说,一个历史感知的分数估计器向冻结的仿真器添加距离加权偏置,引导逆时采样远离先前生成的结构,并通过环境支持项进行正则化。为在长时间尺度下保持结构有效性,一个基于分数的精化步骤利用冻结仿真器将漂移的样本重新投影到数据流形上。实验表明,该方法(i)在DynamicPDB-80上将多样性提升35%;(ii)在12个零样本快速折叠蛋白质上,单独使用学习到的偏置达到无偏仿真器覆盖的速度最高快约15倍,与精化结合后覆盖速度最高快约37倍,同时覆盖的低能态数量多约3倍。代码即将发布。

英文摘要

Generative emulators of protein dynamics produce plausible trajectories at a fraction of the cost of molecular dynamics, but they inherit their training distribution and tend to revisit known states rather than reach rare ones under long-horizon extrapolation. Inspired by classical enhanced sampling, we introduce an implicit, history-dependent bias in the generative space of a pretrained emulator. Specifically, a history-aware score estimator augments the frozen emulator with a distance-weighted bias that steers reverse-time sampling away from previously generated structures, regularized by an environment-support term. To preserve structural validity at long horizons, a score-based refinement step re-projects drifted samples onto the data manifold using the frozen emulator. Our experiments demonstrate that the method (i) raises diversity by $35\%$ on DynamicPDB-80; (ii) on $12$ zero-shot Fast-Folding proteins, the learned bias alone reaches the unbiased emulator's coverage up to ${\sim}15\times$ faster, and pairing it with refinement reaches the coverage up to ${\sim}37\times$ faster while covering ${\sim}3\times$ as many low-energy states. Code will be released soon.

2606.01830 2026-06-02 cs.AI

CAPF: Guiding Search-Agent Rollouts with Credit-Attenuated Privileged Feedback

CAPF:基于信用衰减特权反馈引导搜索智能体轨迹生成

Bin Chen, Xinye Liao, Yiming Liu, Xin Liao, Chonghan Liu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对结果奖励稀疏导致搜索智能体学习困难的问题,提出训练时利用验证器侧信息(CAPF)将零奖励轨迹修复为正奖励轨迹,并衰减相关信用以适配无特权反馈的部署场景,在七个开放域问答基准上将Qwen3-4B的平均精确匹配分数从44.7%提升至48.5%。

详情
AI中文摘要

最近的LLM搜索智能体使用带可验证奖励的强化学习(RLVR)从结果奖励中学习搜索增强推理。在困难问题上,这些智能体很少采样到端到端成功的轨迹,导致仅基于结果的RLVR只有少量正奖励轨迹。我们认为,改善此类问题的学习需要在训练期间提供额外指导,而RLVR已经包含了可以提供这种指导的验证器侧信息。这些信息可以识别智能体提交答案中的错误或遗漏,并引导轨迹内的修正。我们提出了一种训练时机制,称为**信用衰减特权反馈**(CAPF),该机制通过在训练期间进行特权反馈调用,使验证器侧信息可用。CAPF允许策略将零奖励尝试修复为正奖励修复轨迹,并衰减对反馈调用和早期动作的信用,以适应没有此调用的部署。实证研究表明,在七个开放域问答基准上,CAPF将Qwen3-4B的平均精确匹配分数从仅结果RLVR的44.7%提升至48.5%。

英文摘要

Recent LLM search agents use reinforcement learning with verifiable rewards (RLVR) to learn search-augmented reasoning from outcome rewards. On hard problems, these agents rarely sample end-to-end successful rollouts, leaving outcome-only RLVR with few positive-reward trajectories. We argue that improving learning on such problems requires additional guidance during training, and RLVR already contains verifier-side information that can provide it. This information can identify errors or omissions in the agent's submitted answer and guide revision within the rollout. We propose a training-time mechanism called \textbf{Credit-Attenuated Privileged Feedback} (CAPF), which makes this verifier-side information available through a Privileged Feedback call during training. CAPF lets the policy revise zero-reward attempts into positive-reward repair trajectories and attenuates credit for the feedback call and earlier actions to accommodate deployment without this call. Empirical research demonstrates that CAPF improves Qwen3-4B's average exact-match score from 44.7% under outcome-only RLVR to 48.5% on seven open-domain QA benchmarks.

2606.01825 2026-06-02 cs.CV cs.MM

ROGLE: Robust Global-Local Alignment with Automated Region Supervision for Text-Based Person Search

ROGLE: 基于自动区域监督的鲁棒全局-局部对齐用于文本行人搜索

Zequn Xie, Xibei Jia, Sihang Cai, Shulei Wang, Tao Jin

发表机构 * Zhejiang University(浙江大学)

AI总结 提出ROGLE框架,通过自动区域-句子匹配策略和多重粒度学习,解决文本行人搜索中细粒度对齐不足的问题,并在新基准P-VLG上取得最优性能。

Comments 12 pages, 5 figures

详情
AI中文摘要

文本行人搜索(TBPS)旨在使用自然语言查询检索行人图像。然而,现有的TBPS模型,尤其是基于CLIP的模型,由于从短标题训练中继承的全局表示偏差和语义稀疏性,在细粒度理解方面存在困难。这导致弱细粒度对齐,而区域级标注的稀缺加剧了这一问题。为此,我们提出了ROGLE(鲁棒全局-局部嵌入),一个统一的框架,通过自动区域-句子匹配(RSM)策略克服了对昂贵人工标注的依赖。RSM自动挖掘伪区域-句子对,用于可扩展的细粒度监督。此外,ROGLE采用多粒度学习策略,融合全局对比学习和区域级局部对齐。我们还引入了P-VLG基准,这是一个通过从现有公共基准中整理和丰富图像构建的大规模数据集。它包含超过10万个标注区域和丰富的长标题,是第一个同时支持全局和局部评估协议的TBPS基准。大量实验表明,ROGLE显著优于现有方法,特别是在具有挑战性的长查询上。代码和P-VLG基准将公开提供。

英文摘要

Text-Based Person Search (TBPS) aims to retrieve pedestrian images using natural language queries. However, existing TBPS models, especially those based on CLIP, struggle with fine-grained understanding due to global representational bias and semantic sparsity inherited from training on short captions. This results in weak fine-grained alignment, exacerbated by the scarcity of region-level annotations. To address this, we propose ROGLE (Robust Global-Local Embedding), a unified framework that overcomes reliance on costly manual annotations through an automated Region-to-Sentence Matching (RSM) strategy. RSM automatically mines pseudo region-sentence pairs for scalable fine-grained supervision. Furthermore, ROGLE employs a multi-granular learning strategy that fuses global contrastive learning with region-level local alignment. We also introduce the P-VLG Benchmark, a large-scale dataset constructed by curating and enriching images from established public benchmarks. It features over 100,000 annotated regions and rich long-form captions, making it the first TBPS benchmark to support both global and local assessment protocols. Extensive experiments show that ROGLE significantly outperforms existing approaches, particularly on challenging long-form queries. Code and the P-VLG benchmark will be made publicly available.

2606.01824 2026-06-02 cs.RO

DisFlow: Scene Flow from Distance Field for Object Pose, Velocity Tracking, and Dynamic Object Reconstruction

DisFlow: 基于距离场的场景流用于物体姿态、速度跟踪和动态物体重建

Lan Wu, Sheila Sutjipto, Jennifer Wakulicz, Teresa Vidal-Calleja

发表机构 * Robotics Institute, University of Technology Sydney(技术悉尼大学机器人研究所) School of Engineering, University of Western Australia(西澳大学工程学院)

AI总结 提出DisFlow框架,利用高斯过程隐式曲面表示从距离场估计场景流,实现6自由度动态物体姿态估计、运动跟踪和表面重建。

详情
AI中文摘要

我们提出了DisFlow,一种新颖的从距离场进行在线场景流估计的框架,能够实现6自由度动态物体姿态估计、运动跟踪和表面重建。场景由高斯过程隐式曲面(GPIS)表示,表面法线作为导数约束,使得在表面附近能够进行准确的符号距离计算和带不确定性的梯度查询。以此表示为基石,我们从距离场计算场景流,描述表面点如何在连续帧中随时间传输。通过我们的流,我们可以通过优雅的闭式优化逐步注册新观测的点云来估计物体的姿态和运动。与先前在相机或世界坐标系中操作的方法不同,我们的方法直接在物体坐标系中进行概率融合,其中物体随时间保持几何一致性。DisFlow方法在空间和时间上的紧密耦合产生了密集几何、表面法线、物体姿态轨迹、速度和不确定性,且均达到实时速率。我们在动态物体序列上评估了DisFlow,并证明它在同时重建高质量物体表面的同时,实现了准确的姿态和运动跟踪。代码公开于https://github.com/LanWu076/disflow_ros2。

英文摘要

We present \emph{DisFlow}, a novel framework for online scene flow estimation from distance field that enables \emph{6DoF dynamic object pose estimation}, \emph{motion tracking}, and \emph{surface reconstruction}. The scene is represented by Gaussian Process Implicit Surfaces (GPIS), with surface normals serving as derivative constraints, enabling accurate signed distance computations near the surface and gradient queries with uncertainty. With this representation as a foundation, we compute a scene flow from the distance field that describes how surface points are transported over time in consecutive frames. Through our flow, we can estimate an object's pose and motion by incrementally registering a new observed point cloud via an elegant closed-form optimisation. Unlike prior methods that operate in the camera or world frame, our approach performs probabilistic fusion directly in the \emph{object frame}, where the object remains geometrically consistent over time. The tight coupling of the DisFlow method in space and time yields dense geometry, surface normals, object pose trajectories, velocities, and uncertainty, all at real-time rates. We evaluate DisFlow on dynamic object sequences and demonstrate that it achieves accurate pose and motion tracking while simultaneously reconstructing high-quality object surfaces. Code publicly available at \href{https://github.com/LanWu076/disflow_ros2}{https://github.com/LanWu076/disflow\_ros2}

2606.01820 2026-06-02 cs.CL

TalkTag: Fine-Grained Morphosyntactic Error Annotation for Transcribed Speech

TalkTag: 转录语音的细粒度形态句法错误标注

Shamira Venturini, Oliver Hennhöfer, Steffen Kinkel, Jannik Strötgen

发表机构 * Karlsruhe Institute of Technology, Germany(德意志联邦共和国卡尔斯鲁厄理工大学) Karlsruhe University of Applied Sciences, Germany(德意志联邦共和国卡尔斯鲁厄应用科学大学)

AI总结 提出基于LLM的轻量级工具TalkTag,在数据稀缺条件下自动进行口语转录文本的CHAT风格错误标注,实现低资源场景下的精确标注并识别歧义情况。

详情
AI中文摘要

细粒度形态句法错误标注在临床和发展语言研究中很重要,但劳动密集、依赖专家且难以扩展。我们提出了TalkTag,一个基于LLM的轻量级工具,经过微调可自动对口语转录文本进行CHAT风格的错误标注。该系统在极端数据稀缺条件下使用儿童叙事数据开发,展示了低资源设置下语言分析的可行性。我们的评估表明,TalkTag产生了令人鼓舞的精确标注,同时有效识别了语言歧义使自动标注真正复杂的情况。总之,通过TalkTag,我们提供了一种可扩展的手动错误标注替代方案,并为形态句法错误标注提供了实际可行的支持。

英文摘要

Fine-grained morphosyntactic error annotation is important in clinical and developmental language research, yet it is labour-intensive, expert-dependent, and difficult to scale. We present TalkTag, an LLM-based lightweight tool fine-tuned to automate CHAT-style error annotation in spoken-language transcripts. Developed under conditions of extreme data scarcity using children's narrative data, the system shows the feasibility of linguistic analysis in low-resource settings. Our evaluation demonstrates that TalkTag produces encouragingly precise annotation while effectively identifying instances where linguistic ambiguity makes automated tagging genuinely complex. In summary, with TalkTag, we provide a scalable alternative to manual error annotation and practically viable support for morphosyntactic error annotation.

2606.01819 2026-06-02 cs.CV eess.IV

Hist2Style: Histogram-Guided Stylization with Bilateral Grids

Hist2Style: 基于双边网格的直方图引导风格化

Dekel Galor, Adam Pikielny, Zhoutong Zhang, Ke Wang, Laura Waller, Jiawen Chen, Ilya Chugunov

发表机构 * Adobe Nextcam University of California, Berkeley(加州大学伯克利分校)

AI总结 提出Hist2Style,利用双边网格实现快速、边缘感知的逼真风格迁移,通过蒸馏大模型为轻量网络,并基于直方图嵌入提供可解释的用户控制。

Comments 10 pages, 8 figures. Extended results are at https://www.dekelgalor.com/hist2style

详情
AI中文摘要

逼真风格迁移旨在匹配输入图像与风格目标的颜色和色调,同时保留原始场景的内容和细节。尽管现有的大图像模型可以促进这类外观编辑,但它们的高计算需求、潜在的幻觉以及有限的用户控制使其不适合高分辨率、实时工作流。我们引入Hist2Style,一种双边网格公式,用于快速、边缘感知的风格化,通过将操作限制在双边空间中的局部仿射变换来保持视觉保真度。我们的模型通过在大规模监督语料库上训练(该语料库由语言和视觉语言模型生成),将大图像编辑模型蒸馏为轻量网络,针对空间变化的颜色编辑。网络以风格目标的直方图嵌入为条件,提供可解释的接口,通过修改目标颜色分布来调整输出风格。总体而言,Hist2Style通过构造保持内容结构,避免幻觉,并支持实时、高分辨率的逼真风格化,具有交互式用户可控的颜色和色调调整。

英文摘要

Photorealistic style transfer aims to match the color and tone of an input image to that of a style target while preserving the content and details of the original scene. Although existing large image models can facilitate these kinds of appearance edits, their high computational demands, potential for hallucinations, and limited user control make them unsuitable for high-resolution, real-time workflows. We introduce Hist2Style, a bilateral-grid formulation for fast, edge-aware stylization that preserves visual fidelity by constraining operations to locally affine transforms in bilateral space. Our model distills a large image editing model into a lightweight network by training on a large supervised corpus generated with language and vision-language models, targeting spatially varying color edits. The network conditions on a histogram-based embedding of the style target to provide an interpretable interface for adjusting the output style by modifying the target color distribution. Overall, Hist2Style maintains content structure by construction, avoids hallucinations, and supports real-time, high-resolution photorealistic stylization with interactive user-controllable color and tone adjustments.

2606.01818 2026-06-02 cs.CV

Unsupervised Collaborative Domain Adaptation for Driving Scene Parsing

无监督协作域自适应用于驾驶场景解析

Jiahe Fan, Shaolong Shu, Mingjian Sun, Tiehua Zhang, Bohong Xiao, Hanli Wang, Rui Fan

发表机构 * College of Electronic and Information Engineering, Tongji University(同济大学电子与信息学院) Department of Control Science and Engineering, Harbin Institute of Technology(控制科学与工程系,哈尔滨工业大学) Department of Vehicle Control System and Software Development, NIO(车辆控制系统与软件开发部,蔚来汽车) School of Computer Science and Technology, Tongji University(计算机科学与技术学院,同济大学) Key Laboratory of Embedded System and Service Computing (Ministry of Education), Tongji University(嵌入式系统与服务计算重点实验室(教育部),同济大学)

AI总结 提出无监督协作域自适应框架UCDA,通过多源模型协作优化和知识蒸馏,在无源数据条件下提升目标域驾驶场景解析的鲁棒性和泛化能力。

详情
AI中文摘要

可靠的驾驶场景解析是自动驾驶车辆在开放动态环境中运行的基本能力。然而,将感知模型适应新的部署域仍然具有挑战性,因为像素级标注成本高昂,且由于隐私、安全或所有权限制,源域数据通常无法访问。现有的无源无监督域自适应方法通常依赖于单个预训练源模型,这使得自适应后的感知系统容易受到源特定偏差的影响,并在不同的道路布局、光照条件、天气模式和交通状况下限制其鲁棒性。本文提出了一种无监督协作域自适应(UCDA)框架,用于无源设置下的驾驶场景解析,该框架将多个预训练源模型的互补知识迁移到统一的目标模型,而无需访问任何原始源样本。为了比较独立训练模型的预测,UCDA构建了一个类级原型记忆库,并通过原型相似性估计跨模型预测可靠性,从而减少源模型间不一致置信度尺度的影响。基于由此产生的互补监督,UCDA采用两阶段迁移策略:首先通过正负一致性约束的协作优化,在无标签的目标域驾驶数据上精炼多个源模型,然后将它们经过验证的专业知识蒸馏到单个可部署的目标模型中。在公开驾驶场景数据集和从自动驾驶车辆平台收集的真实世界数据上的全面评估表明,UCDA有效地整合了互补的多源知识,提高了目标域场景解析的可靠性和在不同驾驶环境中的泛化能力。

英文摘要

Reliable driving scene parsing is a fundamental capability for autonomous vehicles operating in open and dynamic driving environments. However, adapting perception models to new deployment domains remains challenging because pixel-level annotations are expensive to obtain, while source-domain data are often inaccessible due to privacy, security, or ownership constraints. Existing source-free unsupervised domain adaptation methods typically rely on a single pre-trained source model, which makes the adapted perception system vulnerable to source-specific biases and limits its robustness under diverse road layouts, illumination conditions, weather patterns, and traffic conditions. This article presents an unsupervised collaborative domain adaptation (UCDA) framework for driving scene parsing in a source-free setting, which transfers complementary knowledge from multiple pre-trained source models to a unified target model without accessing any original source samples. To compare predictions from independently trained models, UCDA constructs a class-level prototype memory bank and estimates cross-model prediction reliability through prototype similarity, reducing the effect of inconsistent confidence scales across source models. Based on the resulting complementary supervision, UCDA adopts a two-stage transfer strategy: multiple source models are first refined on unlabeled target-domain driving data through collaborative optimization with positive and negative consistency constraints, and their validated expertise is then distilled into a single deployable target model. Comprehensive evaluations on public driving-scene datasets and real-world data collected from an autonomous vehicle platform demonstrate that UCDA effectively consolidates complementary multi-source knowledge, improving target-domain scene parsing reliability and generalization across diverse driving environments.

2606.01815 2026-06-02 cs.CL

CRAB-Bench: Evaluating LLM Agents under Complex Task Dependencies and Human-aligned User Simulation

CRAB-Bench:在复杂任务依赖和人类对齐用户模拟下评估LLM智能体

Danqing Wang, Akshay Sivaraman, Lei Li

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出CRAB-Bench和RUSE框架,通过约束图生成复杂任务依赖和基于人类行为研究的用户模拟,评估LLM智能体在真实服务场景中的表现,发现最佳模型仅达61%通过率,用户模拟导致性能下降高达57%。

详情
AI中文摘要

在现实服务场景中评估LLM智能体需要复杂的任务依赖、不完美的用户行为以及能够容纳多种有效解决方案的评估。我们引入了CRAB-Bench(基于约束的现实智能体基准)和RUSE(现实用户模拟引擎)来填补这一空白。CRAB-Bench通过一个包含多个相互依赖实体和结构化干扰项的约束图生成任务,要求智能体在数千个误导性候选项中仔细推理,其中只有极小部分解是有效的。RUSE用基于人类行为研究的现实用户取代了合作性的模板式模拟器,这些用户实例化在不同的角色和四个行为维度上。在四个前沿LLM智能体上的实验表明,最佳模型在CRAB-Bench上仅达到61%的pass@1,而切换到RUSE导致进一步下降高达57%,主要集中在任务解决能力而非对话质量上。信息泄露是最具破坏性的行为维度,与RUSE交互的智能体更少承认错误,而是通过隐式纠正掩盖错误。

英文摘要

Evaluating LLM agents in realistic service scenarios requires complex task dependencies, imperfect user behavior, and an evaluation that accommodates multiple valid solutions. We introduce CRAB-Bench (Constraint-based Realistic Agent Benchmark) and RUSE (Realistic User Simulation Engine) to address this gap. CRAB-Bench generates tasks via a constraint graph over multiple interdependent entities with structured distractors, requiring agents to reason carefully over thousands of misleading candidates where only a tiny fraction of solutions are valid. RUSE replaces cooperative, template-like simulators with realistic users grounded in human behavioral studies, instantiated across diverse personas and four behavioral dimensions. Experiments on four frontier LLM agents show that the best model achieves only 61% pass@1 on CRAB-Bench, and switching to RUSE causes further drops of up to 57%, concentrated in task-solving ability rather than conversational quality. Information Disclosure is the most damaging behavioral dimension, and agents interacting with RUSE are less likely to admit mistakes, instead masking errors through implicit corrections.

2606.01813 2026-06-02 cs.CL

Cost-Aware Diffusion Draft Trees for Speculative Decoding

用于推测解码的成本感知扩散草稿树

Shuai Zhang, Huachuan Qiu, Hongliang He, Yong Dai

发表机构 * Zhejiang University(浙江大学) Westlake University(西湖大学)

AI总结 提出CaDDTree方法,通过联合优化树结构和节点预算直接最大化令牌吞吐量,并证明在凸验证成本下吞吐量函数是单峰的,从而无需离线预算搜索。

详情
AI中文摘要

推测解码通过让轻量级草稿模型生成令牌,并由目标语言模型并行验证来加速推理。诸如DFlash之类的块扩散草稿模型一次性生成整个草稿块,产生每个位置边际分布;DDTree利用这些边际分布构建候选树,在固定节点预算下最大化期望接受长度。然而,我们观察到接受长度随预算非递减:它总是偏好更大的树而不考虑验证成本,没有为预算选择提供原则性基础。我们提出 extbf{CaDDTree}(成本感知扩散草稿树),一种通过联合选择树结构和节点预算直接优化令牌吞吐量(单位时间内期望生成的令牌数)的方法。我们显式建模草稿和验证延迟,表明吞吐量目标可分解为每轮对预算的一维搜索,并证明在凸验证成本下吞吐量函数是 extit{单峰的},从而实现了高效的贪心停止规则。CaDDTree无需离线预算搜索,每轮根据当前每个位置分布和验证成本自适应调整预算。在Qwen3-4B和Qwen3-8B上跨越推理、编码和指令遵循任务的八个基准测试上的实验表明,CaDDTree在几乎所有任务上匹配或超越了具有最优预算选择的DDTree。

英文摘要

Speculative decoding accelerates inference by having a lightweight drafter propose tokens verified in parallel by the target language model. Block diffusion drafters such as DFlash generate an entire draft block in one pass, yielding per-position marginals; DDTree uses these to build a candidate tree that maximizes expected acceptance length under a fixed node budget. We observe, however, that acceptance length is non-decreasing in budget: it always favors larger trees regardless of verification cost, offering no principled basis for budget selection. We introduce \textbf{CaDDTree} (Cost-aware Diffusion Draft Tree), a method that directly optimizes token throughput (expected tokens generated per unit time) by jointly selecting the tree structure and node budget. We model draft and verification latencies explicitly, show that the throughput objective decomposes into a per-round one-dimensional search over the budget, and prove that under a convex verification cost the throughput function is \emph{unimodal}, enabling an efficient greedy stopping rule. CaDDTree requires no offline budget search, adapting the budget each round from the current per-position distributions and verification cost. Experiments on Qwen3-4B and Qwen3-8B across eight benchmarks spanning reasoning, coding, and instruction-following tasks show that \caDDTree{} matches or surpasses DDTree with oracle budget selection on nearly all tasks.

2606.01811 2026-06-02 cs.CL cs.AI cs.LG

"I've Seen How This Goes": Characterizing Diversity via Progressive Conditional Surprise

“我知道这会如何发展”:通过渐进条件惊奇度刻画多样性

Matthew Khoriaty, David Williams-King, Shi Feng

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Cambridge(剑桥大学) Stanford University(斯坦福大学)

AI总结 提出一种基于上下文学习的多样性度量方法 Decan(D_{Ca_n}),通过单次前向传递计算每个字节的得分,无需嵌入模型、参考语料或人工标注,在多个基准上验证了其有效性。

Comments 28 pages, 18 figures, 9 tables. Accepted to the Workshop on Generative AI, Creativity, and Human-AI Co-Creation @ ICML 2026 (non-archival). Code and data: https://github.com/AMindToThink/icl-diversity

详情
AI中文摘要

衡量创意输出的多样性对于评估训练后模式崩溃、比较解码策略以及量化AI和人类写作中的创造性行为至关重要。我们提出了一种使用上下文学习来度量多样性的新方法,其中“Decan”度量 $D_{Ca_n} = C \times a_n$ 是我们评估的工作实例:一个基于每个字节的得分,该得分从基础模型 $θ$ 的每个标记对数概率中读取,每次排列只需一次前向传递,无需嵌入模型、参考语料库和人工标签。该方法基于信息论,利用语言模型的上下文学习来检测任意数量输入之间的广泛相似性,并避免了训练专用模型的需要。同一流程对AI样本和人类编写的回答集进行评分,将多样性视为(回答、提示、评分模型)的一个属性。在Tevet和Berant基于人类判断的McDiv基准上,$D_{Ca_n}$ 在McDiv prompt_gen 集上达到了0.846的OCA,这是其表现最好的情况,仅次于Tevet和Berant报告的最强神经基线(SentBERT,0.897)。在OLMo-2-7B训练后流程中,$D_{Ca_n}$ 在基础→SFT→DPO→RLVR阶段单调下降,检测到创意写作应用所关注的多样性损失类型。

英文摘要

Measuring the diversity of creative outputs is central to evaluating post-training mode collapse, comparing decoding strategies, and quantifying creative behavior in both AI and human writing. We propose a new approach to measuring diversity using in-context learning, of which the ``Decan'' metric, $D_{Ca_n} = C \times a_n$, is the working instance we evaluate: a per-byte score read off the per-token log-probabilities of a base model $θ$ in a \emph{single forward pass} per permutation, with no embedding model, no reference corpus, and no human labels. This approach is grounded in information theory, makes use of language model in-context learning to detect a wide range of similarities between any number of inputs, and obviates the need to train a special-purpose model. The same pipeline scores AI samples and human-written response sets, with diversity treated as a property of (responses, prompt, scoring model). On Tevet and Berant's human-grounded McDiv benchmark, $D_{Ca_n}$ reaches OCA 0.846 on the McDiv prompt\_gen set where it performs best, behind the strongest neural baseline reported in Tevet and Berant (SentBERT, 0.897). On the OLMo-2-7B post-training pipeline, $D_{Ca_n}$ drops monotonically across the base $\to$ SFT $\to$ DPO $\to$ RLVR stages, detecting the type of diversity loss that creative-writing applications care about.

2606.01810 2026-06-02 cs.AI

Token Predictors Are Not Planners: Building Physically Grounded Causal Reasoners

Token 预测器不是规划器:构建物理基础的因果推理器

Zheng Lu, Mingqi Gao, Qinlei Xie, Wanqi Zhong, Hanwen Cui, Heng Cao, Zirui Song, Yifan Yang, Chong Luo, Bei Liu, Yiming Li

发表机构 * Tsinghua University(清华大学) Microsoft Research Asia(微软亚洲研究院) MBZUAI

AI总结 针对具身视觉-语言规划中模型依赖语言统计先验而非因果推理的问题,提出 Causal-Plan-Bench 基准和 Causal-Plan-1M 数据集,并训练 Causal Planner 模型,实现从 token 预测到物理因果推理的转变。

Comments 77 pages, appendices included. Code: https://github.com/THUSI-Lab/Causal-Reasoner

详情
AI中文摘要

当前的具身视觉-语言规划基准往往倾向于语言上的下一 token 预测,而非物理基础的下一状态推理。这奖励了模仿统计语言先验而非追踪因果依赖的模型,将物理规划简化为浅层序列建模。我们认为,可靠的物理自主性需要从语言基础的 token 预测转向物理基础的因果推理。为此,我们引入了 Causal-Plan-Bench,这是一个通过多阶段验证构建的高保真诊断套件,用于评估四个因果维度的具身规划。我们还构建了 Causal-Plan-1M,这是一个百万规模的显式推理轨迹语料库,通过四阶段标注流程从自我中心视频中生成。广泛评估表明,领先模型仍然难以展示真正的物理自主性,Gemini 3 Pro 在我们的基准上仅达到 38.18。相比之下,我们的训练方法使基于 Qwen3-VL-8B 构建的 Causal Planner 能够内化物理逻辑,从而实现更准确的下一状态估计。该模型在域内性能和跨基准泛化方面表现强劲,并揭示了一个因果缩放定律:将因果训练数据扩展到一百万实例可获得 36.3% 的相对提升,从 33.22 提高到 45.28。总体而言,我们的工作为将智能体从表面的 token 预测器转变为物理基础的因果推理器迈出了具体的一步。

英文摘要

Current benchmarks for embodied vision-language planning often favor linguistic next-token prediction over physically grounded next-state reasoning. This rewards models that mimic statistical language priors rather than track causal dependencies, reducing physical planning to shallow sequence modeling. We argue that reliable physical autonomy requires a shift from linguistically grounded token prediction toward physically grounded causal reasoning. To this end, we introduce Causal-Plan-Bench, a high-fidelity diagnostic suite curated through multi-stage verification to evaluate embodied planning across four causal dimensions. We also construct Causal-Plan-1M, a million-scale corpus of explicit reasoning traces produced by a four-stage annotation pipeline over egocentric videos. Extensive evaluation shows that leading models still struggle to demonstrate genuine physical agency, with Gemini 3 Pro reaching only 38.18 on our benchmark. In contrast, our training recipe enables Causal Planner, built on Qwen3-VL-8B, to internalize physical logic for more accurate next-state estimation. The model achieves strong in-domain performance and cross-benchmark generalization, and reveals a Causal Scaling Law: scaling causal training data to one million instances yields a 36.3% relative gain, from 33.22 to 45.28. Overall, our work provides a concrete step toward turning agents from superficial token predictors into physically grounded causal reasoners.

2606.01808 2026-06-02 cs.CV

Personalized 3D Myocardial Infarct Geometry Reconstruction from Cine MRI for Cardiac Digital Twins

基于电影MRI的个性化三维心肌梗死几何重建用于心脏数字孪生

Yilin Lyu, Mark YY Chan, Ching-Hui Sia, Lei Li

发表机构 * Department of Biomedical Engineering, National University of Singapore(新加坡国立大学生物医学工程系) Department of Medicine, National University of Singapore(新加坡国立大学医学系) Department of Cardiology, National University Heart Centre Singapore(新加坡国立心脏中心心内科部)

AI总结 提出一种显式几何-运动嵌入模型,从多视角电影MRI中全自动重建个性化、可仿真的三维心肌梗死几何结构,采用双分支自适应融合和AHA-17引导的多尺度监督,实现无对比剂梗死表征。

Comments 14 pages

详情
AI中文摘要

准确的三维心肌梗死(MI)几何表征对于构建心脏数字孪生(CDT)以精确模拟梗死相关电生理至关重要。晚期钆增强磁共振成像(LGE MRI)是定位MI的临床参考,但其对造影剂的依赖限制了在肾功能受损患者中的使用,并限制了纵向随访。作为替代,无对比剂电影MRI可可视化异常心室壁运动,这高度指示梗死区域。在本研究中,我们提出了一种新颖的显式几何-运动嵌入模型,直接从多视角电影MRI中全自动重建个性化、可仿真的三维MI几何结构。具体地,我们构建了一个4D(3D+t)双心室网格,以显式提取和解耦几何感知和运动感知特征。我们进一步设计了一个双分支模块用于自适应几何-运动融合,以捕获时空依赖性来映射梗死区域。此外,我们引入了一种利用AHA-17节段引导的交叉注意力机制的多尺度监督来指导预测,确保生物物理一致的重建。在225例电影MRI上的实验结果表明,所提出的三维MI重建实现了高性能,平均Dice得分为0.678±0.011。在下游的计算机电生理模拟评估中,结果与LGE衍生的真实情况高度一致,突显了所提出模型在无对比剂瘢痕表征和无缝集成到CDT建模中的巨大潜力。代码将在稿件被接受发表后公开。

英文摘要

Accurate 3D geometric characterization of myocardial infarction (MI) is essential for building cardiac digital twins (CDTs) to precisely simulate infarct-related electrophysiology. Late gadolinium enhancement magnetic resonance imaging (LGE MRI) is the clinical reference for locating MI, yet its reliance on contrast agents restricts use in renally impaired patients and limits longitudinal follow-ups. As an alternative, contrast-free cine MRI visualizes abnormal ventricular wall motion, which is highly indicative of the infarcted area. In this study, we propose a novel explicit geometry-motion embedded model to fully automatically reconstruct personalized, simulation-ready 3D MI geometries directly from multi-view cine MRIs. Specifically, we construct a 4D (3D + t) biventricular mesh to explicitly extract and decouple geometry-aware and motion-aware features. We further design a dual-branch module for adaptive geometry-motion fusion to capture spatiotemporal dependencies for mapping infarcted region. Furthermore, we introduce multi-scale supervision utilizing an AHA-17 segment-guided cross-attention mechanism to steer the prediction, ensuring biophysically consistent reconstruction. Experimental results on 225 cine MRIs demonstrated that the proposed 3D MI reconstruction achieved high performance with an average Dice score of 0.678 $\pm$ 0.011. In the downstream in-silico electrophysiological simulation evaluations, the results were highly consistent with the LGE-derived ground truth, highlighting the great potential of the proposed model for contrast-free scar characterization and seamless integration into CDT modeling. The code will be released publicly upon acceptance of the manuscript for publication.

2606.01803 2026-06-02 cs.AI

OctoT2I: A Self-Evolving Agentic Text-to-Image Router

OctoT2I:一种自我进化的智能文本到图像路由系统

Xu Jiang, Bin Chen, Gehui Li, Yule Duan, Ronggang Wang, Jian Zhang

发表机构 * School of Electronic and Computer Engineering, Peking University(电子与计算机工程学院,北京大学) Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology, Shenzhen Graduate School, Peking University(广东省超高清沉浸媒体技术重点实验室,北京大学深圳研究生院)

AI总结 提出OctoT2I框架,通过自进化机制构建知识库并采用状态化多轮路由策略,联合优化生成质量与推理效率,在GenEval上达到0.96性能,同时实现90.3%推理加速和56.6%能效提升。

详情
AI中文摘要

文本到图像(T2I)模型的爆炸式增长——从大规模版本到轻量级、实时模型——如今面临单模型扩展的边际收益递减。智能T2I方法通过使用多个模型来缓解这一瓶颈。然而,现有的智能T2I方法面临三个关键挑战:依赖昂贵的手工先验或人工标注、僵化的单路径决策机制以及忽视推理效率。为解决这些挑战,我们引入OctoT2I,一种新颖的智能框架,将T2I任务重新表述为生成质量和推理效率的联合优化。OctoT2I实现了一种有状态的多轮路由策略,该策略基于其知识和记忆自适应地选择最合适的工具。这一策略由我们新颖的自进化机制从头构建的知识库支持。该机制无需人工监督,首先自主定义基础概念维度(例如风格、颜色、数量),然后通过迭代的“提出-求解-评估-学习”(PSEL)循环智能地探索它们的组合。PSEL循环高效地发现每个工具的能力边界,在无需外部指导的情况下推动持续改进。大量实验表明,OctoT2I在GenEval上实现了具有竞争力的性能(0.96),同时相比领先基线(Flow-GRPO)提供了90.3%的推理加速和56.6%的能效提升,在性能和效率之间取得了卓越的平衡。代码和模型将公开提供。

英文摘要

The explosive growth of Text-to-Image (T2I) models, from large-scale versions to lightweight, real-time ones, now faces diminishing marginal returns from single-model scaling. Agentic T2I methods emerged to alleviate this bottleneck by using multiple models. However, existing agentic T2I methods suffer from three key challenges: reliance on expensive handcrafted priors or human annotations, rigid single-path decision mechanisms, and a neglect of inference efficiency. To address these challenges, we introduce OctoT2I, a novel agentic framework that reformulates the T2I task as a joint optimization of generation quality and inference efficiency. OctoT2I implements a stateful, multi-round routing strategy that adaptively selects the most suitable tool based on its knowledge and memory. This strategy is enabled by a knowledge base built from scratch by our novel Self-Evolving Mechanism. This mechanism, which requires no human supervision, first autonomously defines foundational Conceptual Dimensions (eg, style, color, count) and then intelligently explores their combinations via an iterative" Propose--Solve--Evaluate--Learn"(PSEL) loop. The PSEL loop efficiently discovers each tool's capability frontier, driving continuous improvement without external guidance. Extensive experiments demonstrate that OctoT2I achieves competitive performance (0.96) on GenEval while delivering a 90.3% inference speedup and a 56.6% energy-efficiency gain over the leading baseline (Flow-GRPO), striking an exceptional balance between performance and efficiency. Code and models will be made available.

2606.01800 2026-06-02 cs.CL cs.AI cs.LG

Multilinguality of Large Language Models From a Structural Perspective

从结构视角看大语言模型的多语言性

Haruki Sakajo, Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe

发表机构 * Nara Institute of Science and Technology(奈良科学技術研究所)

AI总结 本研究通过表示结构分析探索大语言模型的多语言性,发现低资源语言与英语的结构差异大于高、中资源语言,且语言特定后训练改变结构但保留语言间关系。

详情
AI中文摘要

大型语言模型(LLMs)通过在多语言数据上进行预训练和后训练,在处理多种语言方面表现出色,尽管英语在训练数据中占主导地位。先前关注标记表示的研究揭示了这些LLMs如何处理非英语文本。尽管这些分析提供了有见地的发现,但它们未能捕捉到结构视角,而结构是语言的内在属性。在本研究中,我们通过表示结构分析探索LLMs的多语言性。我们的发现表明,低资源语言在结构上与英语的差异大于高资源和中资源语言,并且语言特定的后训练改变了它们的结构,同时保留了语言间的关系。

英文摘要

Large language models (LLMs) have excelled in processing multiple languages through pre- and post-training on multilingual data, even though English dominates the training data. Prior work focusing on token representations has revealed how those LLMs process non-English text. Although these analyses have provided insightful findings, they fail to capture a structural view, which is an inherent property of language. In this study, we explore the multilinguality of LLMs through representational structural analysis. Our findings reveal that low-resource languages are structurally more different from English than high- and mid-resource languages, and that language-specific post-training alters their structures while preserving inter-language relationships.

2606.01799 2026-06-02 cs.LG stat.ML

Tree-Guided Identify-Then-Exploit: A Unified Framework of Best Arm Identification and Regret Minimization for Dueling Bandits

树引导的识别-然后-利用:决斗式赌博机中最佳臂识别与遗憾最小化的统一框架

Pu Wang, Yao-Xiang Ding

发表机构 * State Key Lab of CAD&CG(计算机辅助设计与图形学国家重点实验室)

AI总结 针对Condorcet赢家假设下的N臂随机决斗式赌博机,提出树引导的识别-然后-利用(TG-ITE)统一框架,通过共享树引导识别方法在O(N)次比较内找到高置信度候选,并针对不同目标设计利用策略,首次同时实现最佳臂识别O(N)样本复杂度、弱遗憾O(N)和强遗憾O(N log T)保证,并消除现有方法中O(log N)的次优差距。

详情
AI中文摘要

我们研究在Condorcet赢家假设下的$N$臂随机决斗式赌博机,考虑三个广泛采用的目标:最佳臂识别(BAI)、弱遗憾和强遗憾。我们提出树引导的识别-然后-利用(TG-ITE),据我们所知,这是第一个统一处理所有这些目标的框架。无需更强的假设,我们提出一种共享的树引导识别方法,在$O(N)$次比较内找到高置信度的候选。我们进一步提出不同的利用策略,利用这个热启动阶段来优化具体目标。这种方法使得我们的方法能够:(1)在没有通常采用的更强假设的情况下,实现BAI的$O(N)$样本复杂度;(2)构建第一个赢家保持风格的算法,实现$O(N)$弱遗憾;(3)享有与专门强遗憾方法相同的$O(N \log T)$保证;(4)实现BAI和弱遗憾的联合优化,两者均具有$O(N)$保证,消除了现有方法中$O(\log N)$的次优差距。我们的结果提供了证据,表明在决斗式赌博机中,BAI和遗憾最小化之间的权衡相对温和。

英文摘要

We study $N$-armed stochastic dueling bandits under the Condorcet-winner assumption, where three widely adopted objectives are considered: best-arm identification (BAI), weak regret, and strong regret. We propose Tree-Guided Identify-Then-Exploit (TG-ITE), the first unified framework to tackle all these objectives to our knowledge. Without requiring stronger assumptions, we propose a shared tree-guided identification approach to find a high-confidence incumbent within $O(N)$ comparisons. We further propose varied exploitation strategies to utilize this warm-start stage to optimize the specific objectives at hand. This methodology enables our approach to (1) achieve $O(N)$ sample complexity in BAI without commonly adopted stronger assumptions; (2) build the first winner-stays-style algorithm to achieve $O(N)$ weak regret; (3) enjoy the same $O(N \log T)$ guarantee as specialized strong-regret approaches; (4) realize the joint optimization of BAI and weak regret with $O(N)$ guarantees for both, eliminating the sub-optimal gap of $O(\log N)$ in the existing approach. Our results provide evidence that the trade-off between BAI and regret minimization is relatively benign in dueling bandits.

2606.01790 2026-06-02 cs.CV cs.AI

STaR-KV: Spatio-Temporal Adaptive Re-weighting for KV Cache Compression in GUI Vision-Language Models

STaR-KV: 面向GUI视觉语言模型的时空自适应KV缓存压缩重加权方法

Yuhang Han, Wenzheng Yang, Yujie Chen, Xiangqi Jin, Yaojie Zhang, Siteng Huang, Linfeng Zhang

发表机构 * EPIC Lab, SJTU(上海交通大学EPIC实验室) HKUST (GZ)(香港科技大学(广州)) The University of Sydney(悉尼大学) UESTC(电子科技大学) ZJU(浙江大学)

AI总结 提出STaR-KV,一种无需训练的KV缓存压缩框架,通过子空间感知评分、时间稳定性折扣和熵驱动温度三个维度自适应校准令牌重要性,在GUI任务中实现高精度和近40%的峰值GPU内存节省。

详情
AI中文摘要

基于视觉语言模型的图形用户界面(GUI)代理展现出广泛的自动化能力,但其部署受限于随交互步骤线性增长的键值(KV)缓存。例如,UI-TARS-1.5-7B在仅五个屏幕截图上消耗76 GB的GPU内存,接近主流80 GB加速器的容量。现有的KV压缩方法共享两个结构假设:将视觉令牌重要性聚合为单个共享显著性图,并对融合的分数分布应用固定的top-B截断。初步测量反驳了这两点:空间专门化存在于注意力子空间层面并在层间迁移,而分数分布沿轨迹漂移。我们提出STaR-KV(时空自适应重加权),一种无需训练的KV缓存压缩框架,沿三个维度校准令牌重要性:(i)由在线空间互信息驱动的子空间感知评分;(ii)时间稳定性折扣,抑制来自持续关注子空间的冗余缓存条目;(iii)熵导出的温度,自适应重塑分数分布。在四个GUI基准测试中,STaR-KV在匹配预算下实现了最先进的KV压缩方法(如GUIKV、SnapKV)中最强的平均准确率,无压缩阶段FLOPs开销(-0.07%),并在20% KV缓存预算下削减近40%的峰值GPU内存。代码可在https://github.com/kawhiiiileo/STaR-KV获取。

英文摘要

Vision-language-model-based graphical user interface (GUI) agents have shown broad automation capabilities, yet deployment is bottlenecked by a key-value (KV) cache that grows linearly with interaction steps. For instance, UI-TARS-1.5-7B consumes 76 GB of GPU memory on merely five screenshots, approaching the capacity of mainstream 80 GB accelerators. Existing KV compression methods share two structural assumptions: aggregating visual-token importance into a single shared saliency map, and applying a fixed top-B cutoff to the fused score distribution. Pilot measurements refute both: spatial specialization lives at the attention-subspace level and migrates across layers, while the score distribution drifts in shape along a trajectory. We propose STaR-KV (Spatio-Temporal Adaptive Re-weighting), a training-free KV cache compression framework that calibrates token importance along three axes: (i) subspace-aware scoring driven by online spatial mutual information; (ii) a temporal stability discount that suppresses redundant cache entries from persistently attended subspaces; and (iii) an entropy-derived temperature that adaptively reshapes the score distribution. Across four GUI benchmarks, STaR-KV achieves the strongest average accuracy among state-of-the-art KV compression methods (e.g., GUIKV, SnapKV) at matched budgets, with no compression-stage FLOPs overhead (-0.07%) and cutting peak GPU memory by nearly 40% at a 20% KV-cache budget. Code is available at https://github.com/kawhiiiileo/STaR-KV.

2606.01789 2026-06-02 cs.AI

Consistency evaluation of benchmarks used for causal discovery

用于因果发现的基准一致性评估

Yuzhe Zhang, Chihui Chen, Lina Yao, Chen Wang

发表机构 * Independent researcher(独立研究者) UNSW Australia(新南威尔士大学澳大利亚分校) CSIRO Australia(澳大利亚联邦科学与工业研究组织)

AI总结 提出自动检索论文并利用大语言模型检查基准因果图与领域研究一致性的流程,评估11个流行基准,发现其一致性差异显著。

详情
AI中文摘要

在图形因果模型中,因果发现旨在基于数值数据和领域知识(以纯文本形式)构建因果图。然而,因果发现方法的评估在该领域仍然是一个挑战,因为领域研究的进展常常使得基准因果图包含不一致的知识。这个问题尤其影响基于大语言模型(LLM)的因果发现方法,因为它们对文献中的新发现敏感。本文首次系统研究基准因果图的质量。具体来说,我们设计了一个流程,自动从科学数据库中检索相关研究论文,并提示LLM检查基准因果图与领域研究论文之间的一致性。我们评估了11个流行的真实世界基准,我们的流程总共处理了38,081篇领域论文。结果表明,流行基准与领域研究的一致性差异显著,这对因果发现研究具有明确的意义。

英文摘要

In graphical causal model, causal discovery aims to construct a causal graph based on numerical data and domain knowledge in plain text. However, the evaluation of causal discovery methods remains a challenge in the area as the progress of domain researches often makes benchmark causal graphs contain mis-aligned knowledge. This problem especially affects the evaluation of large language model (LLM) based causal discovery methods as they are sensitive to the new discoveries in the literature. This work is the first to systematically study the quality of benchmark causal graphs. Specifically, we design a pipeline that automatically retrieves relevant research papers from scientific databases, and prompts LLMs to check the consistency between the benchmark causal graphs and domain research papers. We evaluate 11 popular real-world benchmarks, for which our pipeline in total proceeds 38,081 domain papers. Our results show that popular benchmarks vary significantly in their consistency with domain research, with clear implications for causal discovery research.

2606.01788 2026-06-02 cs.CV

PlatonicNav: Unveiling Semantic Correspondence in Navigation with Platonic Topological Maps

PlatonicNav: 在导航中揭示柏拉图式拓扑地图的语义对应

Junlin Long, Zeyu Zhang, Xu Deng, Yiran Wang, Yue Yang, Luke Borgnolo, Maxwell Twelftree, Yang Zhao

发表机构 * USYD(新南威尔士大学) Maincode UNSW(新南威尔士大学) La Trobe(拉特罗布大学)

AI总结 提出PlatonicNav框架,通过自监督视觉编码器构建柏拉图式拓扑地图,无需跨模态训练即可统一视觉目标导航、跨模态目标导航和视觉语言导航任务。

详情
AI中文摘要

具身视觉导航中,智能体感知复杂环境并从原始感官输入出发行动以到达目标,支撑了家庭服务机器人、辅助机器人和大规模自主探索等广泛应用。然而,最近统一视觉语言导航(VLN)和目标目标导航(ObjNav)的尝试仍停留在架构融合、混合任务训练和大规模视觉语言预训练层面,未检验独立训练的视觉和语言编码器是否已共享共同的语义结构。此外,即使是面向目标的拓扑地图,仍通过显式跨模态监督(如CLIP或大型视觉语言模型)来锚定语言目标,尚不清楚这种锚定是否可能仅从纯视觉构建的地图实现。为解决这些挑战,我们将柏拉图式表示假说扩展到具身导航,并将纯视觉ObjNav、跨模态ObjNav和VLN重新解释为同一面向目标的语义流形的三种不同接口。我们进一步引入PlatonicNav,一个无需训练的框架,其柏拉图式拓扑地图融合来自自监督视觉编码器的几何和语义节点距离,并通过盲匹配(无需任何配对视觉语言数据)锚定语言目标。在HM3D-IIN、OVON和R2R-CE(基于MP3D)等仿真基准以及宇树Go2机器人上的实验表明,PlatonicNav无需显式跨模态训练即可跨任务、模态和具身形式泛化。代码:https://github.com/AIGeeksGroup/PlatonicNav。网站:https://aigeeksgroup.github.io/PlatonicNav。

英文摘要

Embodied visual navigation, where an agent perceives a complex environment and acts to reach a goal from raw sensory input, underpins a wide range of applications such as household service robotics, assistive robotics, and large-scale autonomous exploration. However, recent attempts to unify vision-and-language navigation (VLN) and object goal navigation (ObjNav) remain at the level of architectural fusion, mixed-task training, and large vision-language pretraining, without examining whether independently trained vision and language encoders may already share a common semantic structure. Moreover, even object-centric topological maps still ground language goals through explicit cross-modal supervision such as CLIP or large vision-language models, leaving open whether such grounding is possible from a purely vision-built map. To address these challenges, we extend the Platonic Representation Hypothesis to embodied navigation and recast vision-only ObjNav, cross-modal ObjNav, and VLN as three different interfaces to the same object-centric semantic manifold. We further introduce PlatonicNav, a training-free framework whose Platonic Topological Map fuses geometric and semantic node distances from a self-supervised visual encoder, and grounds language goals via blind matching without any paired vision-language data. Extensive experiments on simulation benchmarks including HM3D-IIN, OVON, and R2R-CE on MP3D, together with deployment on Unitree Go2, demonstrate that PlatonicNav generalizes across tasks, modalities, and embodiments without explicit cross-modal training. Code: https://github.com/AIGeeksGroup/PlatonicNav. Website: https://aigeeksgroup.github.io/PlatonicNav.

2606.01787 2026-06-02 cs.AI math.OC

Stochastic convergence of parallel asynchronous adaptive first-order methods

并行异步自适应一阶方法的随机收敛性

Serge Gratton, Philippe L. Toint

发表机构 * Université de Toulouse, INP, IRIT, Toulouse, France(图卢兹大学,INP,IRIT,法国图卢兹) IA Artificial and Natural Intelligence Toulouse Institute (ANITI)(图卢兹3IA人工智能与自然智能研究所(ANITI)) NAXYS, University of Namur, Namur, Belgium(NAXYS,纳慕尔大学,比利时纳慕尔)

AI总结 本文提出一类新的异步自适应一阶优化方法,包括多种流行算法的异步变体,并分析其在非凸函数上的随机收敛性,达到O(1/√t)的收敛速率。

详情
AI中文摘要

本文介绍了一类新的异步自适应一阶优化方法,包括几种流行算法的异步变体。还考虑了使用动量和/或非精确归一化的这些方法的版本。在完全随机环境下分析了该类方法在非凸函数上的收敛性,并证明在合理假设下,收敛阶为O(1/√t)(忽略对数因子)。数值实验表明,这种异步自适应算法在异构大规模机器学习系统中非常有用。

英文摘要

A new class of asynchronous adaptive first-order optimization methods is introduced, comprising asynchronous variants of several popular algorithms. Versions of these methods using momentum and/or inexact normalization are also considered. The convergence of methods in the class on non-convex functions is analyzed in a fully stochastic setting, and is shown to be (up to logarithmic factors) of order O(1/sqrt{t}) under reasonable assumptions. Numerical experiments suggest that such asynchronous adaptive algorithms are very relevant in heterogeneous large-scale machine learning systems.

2606.01781 2026-06-02 cs.AI

Structure-Guided Adaptive Propagation for Protein-Protein Interaction Site Prediction

结构引导的自适应传播用于蛋白质-蛋白质相互作用位点预测

Enqiang Zhu, Yizi Liu, Yilong Luo, Yao Chen, Yu Zhang, Baoshan Ma

发表机构 * Institute of Computing Science and Technology, Guangzhou University(广州大学计算机科学与技术学院) School of Computer Science, Peking University(北京大学计算机科学学院) Information Science & Technology Department, Beijing Capital International Airport Co., Ltd.(北京首都国际机场有限公司信息科学与技术部) School of Information Science and Technology, Dalian Maritime University(大连海事大学信息科学与技术学院)

AI总结 提出SGAP-PPIS模型,利用等变图神经网络的多尺度几何状态生成残基级传播系数,实现自适应信息扩散,在Test_60上取得竞争性能。

Comments 9 pages, 3 figures

详情
AI中文摘要

准确预测蛋白质-蛋白质相互作用位点(PPIS)对于理解细胞过程、疾病机制和治疗靶点发现至关重要。基于图的深度学习通过整合残基级结构上下文推进了PPIS预测。然而,尽管蛋白质界面存在结构和功能异质性,大多数基于图的模型仍依赖固定传播方案,对所有残基一视同仁。这种传播可能限制信息扩散适应局部几何环境的能力,使得难以区分真正的相互作用位点和结构相似的非相互作用邻居。我们提出SGAP-PPIS,一种用于PPIS预测的结构引导自适应传播模型。SGAP-PPIS不使用固定传播机制,而是利用等变图神经网络的多尺度几何状态生成残基级传播系数。这种设计允许每个残基根据其几何微环境自适应地平衡局部特征保留和邻域扩散。实验结果表明,SGAP-PPIS在Test_60上达到了与最先进方法竞争的性能。消融研究表明,几何条件自适应传播、尺度对齐几何引导和多步传播状态表示共同推动了这些改进。

英文摘要

Accurate prediction of protein-protein interaction sites (PPIS) is essential for understanding cellular processes, disease mechanisms, and therapeutic target discovery. Graph-based deep learning has advanced PPIS prediction by incorporating residue-level structural context. However, most graph-based models still rely on fixed propagation schemes that treat all residues similarly, despite the structural and functional heterogeneity of protein interfaces. Such propagation may limit the ability to adapt information diffusion to local geometric environments, making it difficult to distinguish true interaction sites from structurally similar non-interacting neighbors. We present SGAP-PPIS, a structure-guided adaptive propagation model for PPIS prediction. Rather than using a fixed propagation mechanism, SGAP-PPIS leverages multi-scale geometric states from an equivariant graph neural network to generate residue-wise propagation coefficients. This design allows each residue to adaptively balance local feature preservation and neighborhood diffusion according to its geometric microenvironment. Experimental results show that SGAP-PPIS achieves competitive performance among the state-of-the-art methods on Test\_60. Ablation studies show that geometry-conditioned adaptive propagation, scale-aligned geometric guidance, and multi-step propagation-state representation jointly drive these improvements.

2606.01779 2026-06-02 cs.CL

HarnessForge: Joint Harness and Policy Evolution for Adaptive Agent Systems

HarnessForge:面向自适应智能体系统的协同框架与策略进化

Mingju Chen, Can Lv, Guibin Zhang, Heng Chang, Shiji Zhou

发表机构 * Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing, School of Artificial Intelligence, Beihang University(北京未来区块链与隐私计算先进创新中心,人工智能学院,北京航空航天大学) Tsinghua University(清华大学)

AI总结 提出HarnessForge元自适应框架,通过框架-策略协同进化实现LLM智能体系统的全系统自适应,在多个基准上显著提升性能。

Comments 25 pages, 13 figures

详情
AI中文摘要

LLM智能体越来越需要在需要不同执行范式的异构任务环境中运行。这对固定智能体系统提出了挑战,并推动了超越孤立组件更新的系统级元自适应。虽然现有工作已自适应外部框架或训练底层推理策略,但全系统自适应仍未被充分表征。结构与执行之间的自适应空间很少被明确化,外部框架与内部推理器之间的兼容性也未得到联合优化。我们提出HarnessForge,一个用于进化LLM智能体系统的元自适应框架。HarnessForge将智能体系统形式化为一个框架-策略对,定义了一个稳定的自适应空间,将框架级执行结构与策略级推理行为分离。然后,它通过故障引导的框架裁剪和框架条件化的策略对齐执行框架-策略协同进化。在来自不同领域的五个基准上的实验表明,HarnessForge一致地改进了Qwen3-4B和Qwen3-8B骨干网络,优于仅框架和仅策略的基线,比最强基线提升高达12.0%,并实现了有利的展开效率权衡,证明了框架-策略协同进化是有效的,并且框架与推理策略之间的可执行兼容性对于智能体系统自适应至关重要。代码可在https://github.com/mingju-c/HarnessForge获取。

英文摘要

LLM agents are increasingly expected to operate across heterogeneous task regimes that require distinct execution paradigms. This challenges fixed agent systems and motivates system-level meta-adaptation beyond isolated component updates. While existing works have adapted external harness or trained underlying reasoning policies, full-system adaptation remains insufficiently characterized. The adaptation space between structure and execution is rarely made explicit, and the compatibility between the external harness and the internal reasoner is not optimized jointly. We propose HarnessForge, a meta-adaptive framework for evolving LLM agent systems. HarnessForge formulates an agent system as a harness--policy pair, defining a stable adaptation space that separates harness-level execution structure from policy-level reasoning behavior. It then performs harness--policy co-evolution through fault-guided harness tailoring and harness-conditioned policy alignment. Experiments across five benchmarks from diverse domains show that HarnessForge consistently improves both Qwen3-4B and Qwen3-8B backbones, outperforming harness-only and policy-only baselines with gains of up to 12.0\% over the strongest baseline and achieving favorable rollout-efficiency tradeoffs, demonstrating that harness--policy co-evolution is effective, and that executable compatibility between the harness and reasoning policy is essential for agent-system adaptation. The code is available at https://github.com/mingju-c/HarnessForge.