arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2115
2606.17935 2026-06-17 cs.CV 新提交

MoonSplat: Monocular Online Gaussian Splatting with Sim(3) Global Optimization

MoonSplat: 基于Sim(3)全局优化的单目在线高斯泼溅

Guo Pu, Yixuan Han, Haofeng Li, Yao Zhang, Hui Zhou, Zhouhui Lian

发表机构 * Wangxuan Institute of Computer Technology, Peking University(北京大学王选计算机技术研究所) Beijing Hydrogen Intelligent Tech. Co., Ltd.(北京氢元智能科技有限公司)

AI总结 提出一种结合Sim(3)全局优化的在线体素化3DGS框架,通过颜色残差学习策略加速收敛,实现鲁棒的相机跟踪和全局闭环,在室内外数据集上达到SOTA性能。

Comments SIGGRAPH 2026

详情
AI中文摘要

从单目图像序列进行在线3D重建是一个具有挑战性且持续的研究课题。3D高斯泼溅(3DGS)凭借其高质量的实时渲染能力,使得在线3D重建能够以更强的表达能力表示密集场景,因此在机器人、AR/VR等广泛应用中具有巨大潜力。然而,现有的在线3DGS方法仍面临一些关键挑战:由于缺乏全局优化导致的脆弱相机位姿估计,以及在大规模或长序列场景中优化效率低下。为了解决这些问题,我们提出了一种鲁棒且高效的在线体素化3DGS重建框架,该框架集成了全局$\ ext{Sim}(3)$优化,能够实现可靠的相机跟踪以及针对相机位姿和体素化3DGS的高效全局闭环。为了加速体素化3DGS的收敛,我们进一步引入了一种颜色残差学习策略,这不仅提高了优化速度,还增强了渲染质量。在多种室内外数据集上的大量实验表明,我们的方法在相机位姿估计精度和渲染质量方面均达到了最先进的性能,同时保持了实时效率。此外,我们基于所提出的方法开发并部署了一个真实的基于无人机的主动重建系统,验证了其在实际在线3D重建任务中的鲁棒性和泛化能力。我们的代码和数据可在该网址获取。

英文摘要

Online 3D reconstruction from monocular image sequences is a challenging and ongoing research topic. 3D Gaussian Splatting (3DGS), leveraging its high-quality real-time rendering capability, empowers online 3D reconstruction to represent dense scenes with enhanced expressiveness, and thus holds great promise for a wide range of applications such as robotics and AR/VR. However, existing online 3DGS methods still suffer from some key challenges: fragile camera pose estimation due to the lack of global optimization, and low optimization efficiency in large-scale or long-sequence scenarios. To address these issues, we propose a robust and efficient online voxelized 3DGS reconstruction framework integrated with global $\text{Sim}(3)$ optimization, which enables reliable camera tracking and efficient global loop closure for both camera poses and voxelized 3DGS. To accelerate the convergence of the voxelized 3DGS, we further introduce a color residual learning strategy, which not only boosts optimization speed but also enhances rendering quality. Extensive experiments on diverse indoor and outdoor datasets demonstrate that our method achieves state-of-the-art performance in both camera pose estimation accuracy and rendering quality, while retaining real-time efficiency. Additionally, we develop and deploy a real-world UAV-based active reconstruction system grounded on our proposed method, validating its robustness and generalizability for practical online 3D reconstruction tasks. Our code and data are available at https://github.com/TrickyGo/MoonSplat.

2606.17931 2026-06-17 cs.LG 新提交

Predictive Analytics in E-Commerce for CustomerBehavior Forecasting using hybrid Ret-DNN withXGBoost Model

电子商务中基于混合Ret-DNN与XGBoost模型的客户行为预测分析

Degala Pushpa Sri, Mayank Atreya, Lakshmi. H, Navin Chhibber, Mukesh Soni

发表机构 * Chewy Inc(Chewy公司) Pace Institute of Technology and Atlanta, USA(佩斯理工学院和亚特兰大美国) Nitte Meenakshi Institute of Sciences(尼特梅恩克希科学学院) Lovely Professional University(洛丽专业大学) Infinity Tech Group(无限科技集团) University(大学)

AI总结 提出混合Ret-DNN与XGBoost模型,通过特征提取和梯度提升预测客户购买概率,在UK零售数据集上MAE达0.2193。

Comments 2025 2nd International Conference on Software, Systems and Information Technology (SSITCON)

详情
AI中文摘要

近年来,电子商务服务在人们的日常生活中迅速增长,帮助他们在线购买产品。然而,零售平台难以理解客户行为,并难以预测其未来购买。为克服这些挑战,本研究提出一种混合零售深度神经网络(Ret-DNN)与极端梯度提升(XGBoost)模型,用于捕捉零售数据的时间特征和表格动态。首先,数据来自一家英国在线零售商,包含近50万条交易记录。然后,使用一系列技术对收集的数据进行预处理,如数据清洗、异常值处理、时间特征提取、特征编码和z-score归一化,以确保数据准备好进行模型训练和测试。随后,预处理后的数据被输入到Ret-DNN模型中,该模型作为特征提取器,理解客户交易的完整上下文。进一步,提取的数据作为输入输入到XGBoost模型,该模型预测最终输出为客户购买概率。最后,提出的Ret-DNN XGBoost模型取得了更好的结果,平均绝对误差(MAE)为0.2193,优于现有的Ret-DNN模型。关键词:客户行为预测,极端梯度提升,电子商务,预测分析,零售深度神经网络。

英文摘要

In recent years, electronic (E) commerce services have rapidly increased in the daily lives of people, which helpsthem to purchase products online. However, retail platforms have struggled to understand customer behavior and make it difficult to predict their future purchases. To overcome these challenges, this study proposes a hybrid Retail Deep NeuralNetwork (Ret-DNN) with an Extreme Gradient Boosting(XGBoost) model for capturing temporal features and tabular dynamics of retail data. First, data were sourced from a UnitedKingdom (UK)-based online retailer that contains transactions with almost 500,000 records. Then, the collected data were pre-processed using a series of techniques, such as data cleaning, outlier handling, temporal feature extraction, feature encoding, and z-score normalization, to ensure that the data were ready for model training and testing. Subsequently, the preprocessed data were fed into the Ret-DNN model, which acts as a feature extractor to understand the complete context of customer transactions. Further, the extracted data were fed as input into the XGBoost model, which predicted the final output as the purchase probability of customers. Finally, the proposed Ret-DNN XGBoost model achieved better results by attaining aMean Absolute Error (MAE) 0.2193 when compared to the existing Ret-DNN model. Keywords: Customer behavior forecasting, extreme gradientboosting, electronic commerce, predictive analytic, retail deepneural networks.

2606.17930 2026-06-17 cs.AI 新提交

How Inference Compute Shapes Frontier LLM Evaluation

推理计算如何塑造前沿LLM评估

Jessica McFadyen, Ole Jorgensen, Harry Coppock, Kevin Wei, Cozmin Ududec

发表机构 * UK AI Security Institute(英国人工智能安全研究所)

AI总结 通过控制推理计算量(如token预算、上下文压缩和重复提交)评估12个前沿语言模型,发现更大计算量显著提升性能,固定预算评估低估模型能力,且不同基准对推理扩展方法敏感。

Comments 34 pages, 4 figures

详情
AI中文摘要

AI评估正转向更困难的任务,这些任务受益于涉及工具使用和迭代问题解决的更长轨迹。因此,性能对测试时可用的计算量(“推理计算”)及其分配越来越敏感。然而,许多评估仍然在单一限制性预算下报告性能,这意味着低分可能反映评估设置而非模型的潜在能力。为了验证这一点,我们在涵盖软件工程、数学、医学和网络安全的七个具有挑战性的基准上评估了多达12个前沿语言模型。我们使用结合三种简单推理扩展干预的受控设置:更大的token预算、上下文压缩和重复提交尝试,由模型本身或最小正确性反馈引导。我们发现了三个主要结果。首先,更大的token预算在多个领域的基准上显著提升性能,包括网络安全、FrontierMath、Humanity's Last Exam和TerminalBench。其次,随着模型进步,固定预算评估可能越来越低估前沿能力。较新的模型在大型预算下达到更高性能,解锁更困难的任务并更可靠地解决它们。第三,不同基准在哪种推理扩展方法最有效方面存在差异:重复提交广泛提升性能,但更大token预算、外部反馈和并行尝试的价值因基准而异。总体而言,我们的结果表明基准分数是协议依赖的。因此,我们主张评估应将能力报告为推理时间计算的函数,明确指定协议选择,并在匹配预算的大共享计算范围内比较模型代际,特别是在安全或政策相关设置中。

英文摘要

AI evaluations are shifting toward harder tasks that benefit from longer trajectories involving tool use and iterative problem solving. As a result, performance is increasingly sensitive to the amount and allocation of compute available at test time ("inference compute"). Yet many evaluations still report performance at a single restrictive budget, meaning that low scores may reflect the evaluation setup rather than the model's underlying capability. To test this, we evaluate up to 12 frontier language models on seven challenging benchmarks spanning software engineering, mathematics, medicine, and cybersecurity. We use a controlled setup combining three simple inference-scaling interventions: larger token budgets, context compaction, and repeated submission attempts, guided either by the model itself or by minimal correctness feedback. We find three main results. First, larger token budgets substantially improve performance on benchmarks across multiple domains, including cybersecurity, FrontierMath, Humanity's Last Exam, and TerminalBench. Second, fixed-budget evaluations can increasingly understate frontier capability as models advance. Newer models reach higher performance at large budgets, where they unlock harder tasks and solve them more reliably. Third, benchmarks differ in which inference-scaling methods help most: repeated submission broadly improves performance, but the value of larger token budgets, external feedback, and parallel attempts varies by benchmark. Overall, our results show that benchmark scores are protocol-dependent. We therefore argue that evaluations should report capability as a function of inference-time compute, specify protocol choices explicitly, and compare model generations over a large shared compute range at matched budgets, especially in safety- or policy-relevant settings.

2606.17929 2026-06-17 cs.AI 新提交

PreAct: Computer-Using Agents that Get Faster on Repeated Tasks

PreAct:在重复任务上加速的计算机使用代理

Bojie Li

发表机构 * Pine AI

AI总结 提出PreAct方法,通过将首次成功执行编译为状态机程序,在后续任务中直接重放,避免逐步骤调用语言模型,实现8.5-13倍加速,并确保重放时屏幕状态匹配。

详情
AI中文摘要

计算机使用代理通过屏幕操作真实软件——点击和打字——但它们从头解决每个任务:当要求重复一个任务时,代理重新读取屏幕,重新推理每次点击,并再次支付全部成本。我们提出PreAct,让这样的代理在之前做过的任务上更快。首次成功时,PreAct将运行编译成一个小的状态机程序——检查屏幕的状态、执行动作的转换——并在后续运行中直接重放,而不是调用代理,速度提升8.5-13倍,无需每步的语言模型调用。重放并非盲目:每一步PreAct在行动前检查屏幕是否与程序预期匹配,一旦出现异常就将控制权交还给代理。PreAct在决定保留什么时也应用同样的原则:新编译的程序只有在从干净状态重新运行时,独立评估器确认其解决了任务后,才进入存储——捕获那些重放到最后一步但未完成任务的程序。在移动、桌面和网络基准测试中,这种存储时检查将重复运行中因故障程序积累而改善的运行与退化的运行区分开,每个基准测试价值1.75-2.6个任务,三个方向一致;当没有程序匹配时,从头探索的回退使PreAct与强大的记录-重放基线持平。我们还报告了哪些因素不重要:提示措辞、运行时护栏,以及语言模型或普通嵌入检索器选择重用的程序。

英文摘要

Computer-using agents drive real software through the screen -- clicking and typing -- but they solve every task from scratch: asked to repeat a task, an agent re-reads the screen, re-reasons every tap, and pays the full cost again. We present PreAct, which lets such an agent get faster on tasks it has done before. The first time it succeeds, PreAct compiles the run into a small state-machine program-states that check the screen, transitions that act-and on later runs replays it directly instead of invoking the agent 8.5-13x faster, with no per-step language-model calls. Replay is not blind: at each step PreAct checks that the screen matches what the program expects before acting, and hands control back to the agent the moment something is off. PreAct applies the same discipline when deciding what to keep: a freshly compiled program enters the store only if, re-run from a clean state, an independent evaluator confirms it solved the task-catching programs that replay to their last step yet leave the task undone. Across a mobile, a desktop, and a web benchmark, this store-time check separates repeated runs that improve from ones that degrade as faulty programs accumulate, worth 1.75-2.6 tasks per benchmark, the same direction on all three; a fallback that explores afresh when no program fits brings PreAct level with a strong record-and-replay baseline. We also report what did not matter: prompt wording, runtime guardrails, and whether a language model or a plain embedding retriever selects which program to reuse.

2606.17927 2026-06-17 cs.LG cs.AI 新提交

KANLib -- An Modular, Extensible and Fast Kolmogorov-Arnold Network Implementation

KANLib -- 一个模块化、可扩展且快速的Kolmogorov-Arnold网络实现

Julian Hoever, Gregor Schiele

发表机构 * Intelligent Embedded Systems University of Duisburg-Essen(智能嵌入式系统杜伊斯堡-埃森大学)

AI总结 提出KANLib框架,通过统一现有KAN实现、支持多种基函数和自适应网格缩放,在保持灵活性和高性能的同时,实现可复现的预测结果。

详情
AI中文摘要

Kolmogorov-Arnold网络(KAN)最近通过用可学习的一元函数替代线性权重,成为传统多层感知器的一种有前途的替代方案。尽管在可解释性和表达能力方面具有理论优势,但由于高计算成本和现有框架中不一致的功能支持,KAN的实际研究仍然困难。本文介绍了KANLib,一个用于开发和评估KAN架构的模块化、可扩展且计算高效的框架。KANLib在强调灵活性、功能一致性和高性能的一致软件架构中,统一了现有实现(包括PyKAN、EfficientKAN和FastKAN)的核心概念。该框架支持两种基函数类型、自适应网格缩放、网格扩展和细粒度架构定制,同时保持与标准PyTorch工作流的兼容性。在加利福尼亚房价基准上的实验评估表明,KANLib在重现已建立参考KAN实现的预测行为的同时,实现了具有竞争力的计算效率。此外,该框架能够探索超出标准KAN公式的架构变体,且对预测性能影响很小。总体而言,KANLib为未来关于可扩展和可扩展KAN架构的研究提供了坚实的基础。

英文摘要

Kolmogorov-Arnold Networks (KANs) have recently emerged as a promising alternative to traditional multilayer perceptrons by replacing linear weights with learnable univariate functions. Despite their theoretical advantages in interpretability and expressiveness, practical research of KANs remains difficult due to high computational costs and inconsistent feature support across existing frameworks. This paper introduces KANLib, a modular, extensible, and computationally efficient framework for developing and evaluating KAN architectures. KANLib unifies core concepts from existing implementations, including PyKAN, EfficientKAN, and FastKAN, within a consistent software architecture that emphasizes flexibility, feature parity, and high performance. The framework supports two basis function types, adaptive grid rescaling, grid extension, and fine-grained architectural customization while maintaining compatibility with standard PyTorch workflows. Experimental evaluation on the California Housing benchmark demonstrates that KANLib reproduces the predictive behavior of established reference KAN implementations while achieving competitive computational efficiency. Furthermore, the framework enables the exploration of architectural variations beyond standard KAN formulations with only minor impacts on predictive performance. Overall, KANLib provides a robust foundation for future research on scalable and extensible KAN architectures.

2606.17924 2026-06-17 cs.RO cs.AI 新提交

PearlVLA: Progressive Embodied Action-Plan Refinement in Latent Space

PearlVLA:潜在空间中的渐进式具身动作计划精炼

Bochen Yang, Lianlei Shan

发表机构 * Imperial College London(帝国理工学院) Tsinghua University(清华大学)

AI总结 提出PearlVLA框架,通过在VLM潜在空间中进行迭代计划精炼,平衡动作生成效率与显式推理,在LIBERO基准上达到最先进性能。

Comments 21 pages, 2 figures. Preprint

详情
AI中文摘要

当前的视觉-语言-动作(VLA)模型在高效动作生成与显式推理之间存在权衡。直接从视觉-语言骨干表示解码动作可实现低延迟控制,而通过文本链、像素级子目标或动作搜索进行显式推理可以改善规划,但会带来大量延迟和计算成本。我们提出PearlVLA,一个将推理转移到视觉-语言模型(VLM)潜在空间中的VLA框架。PearlVLA将VLM元查询表示分离为固定的视觉接地分支和迭代的潜在计划分支。在每个精炼轮次中,一个计划条件的世界查询探测一个轻量级冻结的潜在世界模型,以获取无动作的未来观察潜在表示,该表示被反馈以指导计划精炼。然后,一个未来引导的RefineNet应用计划的残差更新,逐步将粗糙的语义草稿精炼为细粒度的潜在动作计划。经过K轮精炼后的计划被并行解码为动作块,用于低延迟执行。我们进一步引入因果精炼分组过程奖励强化学习,以优化潜在精炼过程,奖励来自由潜在计划编辑引起的更长视野想象未来。在LIBERO基准上的实证评估表明,PearlVLA在现有方法中达到了最先进的性能。

英文摘要

Current Vision-Language-Action (VLA) models face a trade-off between efficient action generation and explicit deliberation. Directly decoding actions from vision-language backbone representations enables low-latency control, whereas explicit reasoning through textual chains, pixel-level subgoals, or action search can improve planning but incurs substantial latency and computational cost. We propose PearlVLA, a VLA framework that moves deliberation into the latent space of a vision-language model (VLM). PearlVLA separates VLM meta-query representations into a fixed visual grounding branch and an iterative latent plan branch. At each refinement round, a plan-conditioned world query probes a lightweight frozen latent world model for an action-free future observation latent, which is fed back to guide plan refinement. A future-guided RefineNet then applies scheduled residual updates to progressively refine a coarse semantic draft into a fine-grained latent action plan. The refined plan after K rounds is then decoded in parallel into an action chunk for low-latency execution. We further introduce Causal Refinement-Grouped Process-Reward RL to optimize the latent refinement process with rewards from longer-horizon imagined futures induced by latent plan edits. Empirical evaluations on the LIBERO benchmark demonstrate that PearlVLA achieves state-of-the-art performance among existing methods.

2606.17906 2026-06-17 cs.RO 新提交

WAM-RL: World-Action Model Reinforcement Learning with Reconstruction Rewards and Online Video SFT

WAM-RL:基于重建奖励和在线视频SFT的世界-动作模型强化学习

Zezhong Qian, Xiaowei Chi, Yu Qi, Haozhan Li, Zhi Yang Chen, Shanghang Zhang

发表机构 * State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University(北京大学计算机学院多媒体信息处理国家重点实验室) Northeastern University(东北大学) Tsinghua University(清华大学)

AI总结 提出WAM-RL框架,通过强化学习联合优化世界模型和动作模型,解决长时域任务中仅优化动作模型的不足,首次将强化学习引入世界-动作范式。

详情
AI中文摘要

最近的世界-动作(WA)模型展现出强大的泛化能力和数据效率,但它们通常依赖专家轨迹进行训练。这种依赖限制了它们获取超出演示分布的细粒度操作技能,并阻止它们通过真实世界交互持续改进。为了解决这些限制,我们提出了WAM-RL,一种强化学习框架,通过与环境的在线交互实现世界模型和动作模型的联合优化。通过允许两个组件共同进化,我们的方法增强了细粒度控制和适应性。具体来说,WA模型由世界模型和动作器组成。我们设计了一种具有分层优化的定制强化学习方法,以协调它们的改进。在方法论方面,我们系统地研究了将强化学习应用于动作模型以及在线训练世界模型在RL设置中的效果。我们的实验揭示了一个关键见解:仅优化动作器可以在短时域任务上带来改进,但在长时域任务上无法提供显著收益。相反,联合优化世界模型和动作器对于在长时域设置中实现强性能至关重要。我们的工作是首次将强化学习引入世界-动作范式,并提供了关于在线优化动作头和世界模型如何影响整体性能的见解。

英文摘要

Recent World-Action (WA) models demonstrate strong generalization ability and data efficiency, but they typically rely on expert trajectories for training. This reliance limits their ability to acquire fine-grained manipulation skills beyond the demonstration distribution and prevents them from continuously improving through real-world interaction. To address these limitations, we propose WAM-RL, a reinforcement learning framework that enables joint optimization of the world model and the action model through online interaction with the environment. By allowing the two components to co-evolve, our approach enhances fine-grained control and adaptability. Specifically, a WA model consists of a world model and an actor. We design a tailored reinforcement learning method with hierarchical optimization to coordinate their improvement. On the methodological side, we systematically investigate the effects of applying reinforcement learning to the action model, as well as online training of the world model within an RL setting. Our experiments reveal a key insight: optimizing only the actor yields improvements on short-horizon tasks, but fails to provide significant gains on long-horizon tasks. In contrast, jointly optimizing both the world model and the actor is critical for achieving strong performance in long-horizon settings. Our work is the first to introduce reinforcement learning into the World-Action paradigm, and provides insights into how online optimization of both the action head and the world model impacts overall performance.

2606.17905 2026-06-17 cs.CL 新提交

ChLogic: Evaluating Robustness of Logical Reasoning in Chinese Expressions

ChLogic: 评估中文表达中逻辑推理的鲁棒性

Peixian Zhou, Yuxu Chen, Chaorui Zhang, Wei Han, Bo Bai, Xueyan Niu

发表机构 * College of Mathematics, Sichuan University(四川大学数学学院) Theory Lab, 2012 Labs, Huawei Technologies Co., Ltd(华为技术有限公司2012实验室理论实验室)

AI总结 提出英中对照基准ChLogic,通过形式逻辑模板构建数据集,测试模型在英文和多种中文表达下逻辑推理的鲁棒性,发现英中性能差距,回译效果因数据集和模型而异。

详情
AI中文摘要

大型语言模型在标准化逻辑推理基准上表现越来越好,但这种能力在英语之外是否保持鲁棒尚不清楚。我们提出ChLogic,一个英中对照基准,测试当相同的潜在逻辑结构用英语和多种中文表层实现表达时,模型是否保持逻辑推理性能。该基准基于形式逻辑模板构建,包含三个数据集:(i) 通用对照集,源自9个模板家族的60个通用命题;(ii) 困难对照集,源自40个困难问题;(iii) 仅中文集,涵盖15种语言特有现象类型。每个对照项将一个英文参考表达式与五个中文实现配对。在Qwen3、Ministral和GLM模型上的实验揭示了持续的英中性能差距。从标准中文回译成英语通常能提升通用对照集上的性能,但对困难对照集产生混合效果,Qwen3-32B和GLM-5.1在翻译后表现更差。这些结果表明,中文表层实现、翻译伪影和模型特定行为共同影响多语言逻辑推理。总体而言,ChLogic为多语言推理的鲁棒性提供了有用的压力测试。

英文摘要

Large language models perform increasingly well on standardized logical reasoning benchmarks, but whether this ability remains robust beyond English is unclear. We introduce ChLogic, an English--Chinese aligned benchmark that tests whether models preserve logical reasoning performance when the same latent logical structure is expressed in English and diverse Chinese surface realizations. Built from formal logical templates, the benchmark contains three data sets: (i) the General aligned set, derived from 60 General Propositions across nine template families; (ii) the Difficult aligned set, derived from 40 Difficult Problems; and (iii) the Chinese-only set, covering 15 language-specific phenomenon types. Each aligned item pairs one English reference expression with five Chinese realizations. Experiments on Qwen3, Ministral, and GLM models reveal a persistent English--Chinese performance gap. Back-translation from standard Chinese into English often improves performance on the General aligned set, but produces mixed effects on the Difficult aligned set, where Qwen3-32B and GLM-5.1 perform worse after translation. These results indicate that Chinese surface realization, translation artifacts, and model-specific behavior jointly affect multilingual logical reasoning. Overall, ChLogic provides a useful stress test for the robustness of multilingual reasoning.

2606.17904 2026-06-17 cs.AI 新提交

DiagFlowBench: Evaluating How Language Models Handle Off-Procedure Inputs in Grounded Diagnostic Dialogue

DiagFlowBench:评估语言模型在基于规程的诊断对话中如何处理偏离规程输入

Guillermo Gil de Avalle, Laura Maruster, Shaina Raza, Christos Emmanouilidis

发表机构 * University of Groningen(格罗宁根大学) Vector Institute for Artificial Intelligence(向量人工智能研究所)

AI总结 提出DiagFlowBench基准,包含50个工业诊断流程图转化的1676轮对话,评估10个模型在识别偏离规程输入时的表现,发现模型常选择真实但不恰当的步骤而非捏造事实。

详情
AI中文摘要

语言模型越来越多地作为维护操作中的咨询系统。为了防止幻觉,最近的系统将这些模型基于规程文档,以约束它们执行批准的步骤。然而,在实践中,操作员的查询经常偏离这一路径,要求模型在对话中途识别超出范围的输入,这是当前基准很少优先考虑的动态。我们引入了DiagFlowBench,这是一个数据集,包含来自一家消费制造商的50个工业诊断流程图,转化为1676轮多轮对话,对比合规与超出范围的语句。评估十个商业和开源模型显示,在弃权率上存在高度变异性,模型通常选择一个真实但上下文不恰当的步骤,而不是捏造事实。这种映射但错误建议的内在合理性和权威性暴露了基于规程系统的一个具有挑战性的脆弱性。

英文摘要

Language models increasingly serve as advisory systems in maintenance operations. To prevent hallucination, recent systems ground these models in procedural documentation to constrain them to approved steps. In practice, however, operator queries frequently stray from this path, requiring models to recognise out-of-scope inputs mid-conversation, a dynamic that current benchmarks rarely prioritise. We introduce DiagFlowBench, a dataset of 50 industrial diagnostic flowcharts from a consumer manufacturer converted into 1,676 multi-turn conversations that contrast compliant with out-of-scope utterances. Evaluating a panel of ten commercial and open-weight models reveals high variability in abstention rates, with models commonly selecting a real but contextually inadequate step rather than fabricating facts. The inherent plausibility and authority of this mapped but wrong advice exposes a challenging vulnerability for grounding systems.

2606.17897 2026-06-17 cs.AI cs.RO 新提交

Learn to Quantify Social Interaction with Constraints for Pedestrian Walking

学习量化行人行走中的社交互动约束

Xiaodan Shi

发表机构 * Department of Computer and Systems Sciences, Stockholm University(斯德哥尔摩大学计算机与系统科学系)

AI总结 提出Learn to Cluster方法,通过概率潜变量生成模型从轨迹观测中无监督学习社交互动模式,并有效集成到行人轨迹预测中,提升预测鲁棒性。

详情
AI中文摘要

人群中的长期行人路径预测对于自主移动平台(如自动驾驶汽车和社交机器人)避免碰撞并做出高质量规划至关重要。尽管当前研究考虑了社交互动进行预测,但它们并未揭示人与人之间发生的具体社交互动类型以及社交互动如何影响行人的决策过程,这进一步限制了其鲁棒性。行人行走中的社交互动直观上大量存在且难以标注和量化。在本文中,我们通过提出Learn to Cluster创造性地探索量化和解释行人如何与他人互动。我们的聚类社交互动是概率潜变量生成模型,直接从序列轨迹观测中学习,可扩展到任意数量的行人。Learn to Cluster无需标签,可以自然地集成到预测模型的训练过程中。潜变量随后将作为“标签”对社交互动进行分类。在多个轨迹预测基准上的大量实验表明,我们的方法能够学习社交互动的模式,并将这些模式有效集成到行人轨迹预测中。

英文摘要

Long-term human path forecasting in crowds is critical for autonomous moving platforms (like autonomous driving cars and social robots) to avoid collision and make high-quality planning. Although the current research take into account social interactions for prediction, they don't reveal the exact kinds of social interactions happened among people and how the social interactions affect the decision-making process of pedestrians, which further limits its robustness. Social interactions in pedestrian walking are intuitively massive and hard to label and quantify. In this paper, we explore creatively to quantify and interpret how pedestrians interact with others by proposing Learn to Cluster. Our clustering social interactions is probabilistic latent variable generative, learning directly from sequential trajectory observations, scalable to arbitrary number of pedestrians. Learn to cluster is label-free and can be naturally integrated into the training process of the prediction model. The latent variables will then serve as 'labels' to categorize social interactions. Extensive experiments over several trajectory prediction benchmarks demonstrate that our method is able to learn the patterns of social interactions and effectively integrate the patterns to pedestrian trajectory prediction.

2606.17890 2026-06-17 cs.CL 新提交

Dynamic Rollout Editing for Reducing Overthinking in RL-Trained Reasoning Models

动态展开编辑:减少RL训练推理模型中的过度思考

Zihao Wei, Wenjie Shi, Liang Pang, Jingcheng Deng, Shicheng Xu, Shasha Guo, Zenghao Duan, Jiahao Liu, Jingang Wang, Huawei Shen, Xueqi Cheng

发表机构 * Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 针对GRPO强化学习训练中模型在得出正确答案后继续生成不必要推理的过度思考问题,提出动态展开编辑(DRE)方法,通过编辑成功轨迹中答案出现后的思考部分,削弱对不必要思考的偏好信号,实验证明其有效性。

Comments 21 pages, 10 figures, 2 tables

详情
AI中文摘要

长链思维推理可以提升LLM在复杂任务上的表现,但模型在正确答案出现后往往继续生成不必要的推理。我们将这种行为称为过度思考。我们从GRPO风格强化学习后训练的角度研究这一现象,将其视为训练时的信用分配问题,而不仅仅是解码时的停止问题。在GRPO训练初期采样的展开中,我们观察到对于相同提示,成功轨迹可能比不成功轨迹表现出稍高的过度思考程度。这种早期不平衡为不良反馈循环提供了起点:由于GRPO分配序列级信用,它无法区分到达解决方案的前缀与延长成功轨迹的不必要延续。两者都收到正向更新信号,使得初始不平衡在训练过程中演变为更严重的过度思考。为了解决这个问题,我们引入了动态展开编辑(DRE),这是一种针对在答案出现后继续思考的成功轨迹的训练时干预方法。DRE保留被接受的已验证前缀,编辑剩余的思考,并在同一RL组中偏好编辑后的轨迹,从而削弱对不必要思考的偏好信号,而不惩罚到达答案所需的推理。跨多种任务的实验证明了DRE的有效性。

英文摘要

Long-form chain-of-thought reasoning can improve LLM performance on complex tasks, but models often continue generating unnecessary reasoning after a correct answer has emerged. We refer to this behavior as overthinking. We study this phenomenon from the perspective of GRPO-style reinforcement learning (RL) post-training, framing it as a training-time credit-assignment problem rather than merely a decoding-time stopping problem. In rollouts sampled at the onset of GRPO training, we observe that successful trajectories can exhibit a slightly higher degree of overthinking than unsuccessful trajectories for the same prompts. This early imbalance provides a starting point for an undesirable feedback loop: because GRPO assigns sequence-level credit, it cannot distinguish the solution-reaching prefix from the unnecessary continuation that lengthens a successful trajectory. Both receive positive update signal, allowing the initial imbalance to grow into more severe overthinking during training. To address this issue, we introduce Dynamic Rollout Editing (DRE), a training-time intervention for successful trajectories that continue thinking after answer emergence. DRE preserves the accepted verified prefix, edits the remaining thinking, and prefers the edited trajectory within the same RL group, weakening the preference signal for unnecessary thinking without penalizing the reasoning needed to reach the answer. Experiments across diverse tasks show the effectiveness of DRE.

2606.17889 2026-06-17 cs.LG cs.AI cs.NE 新提交

Dimensionality Controls When Modularity Helps in Continual Learning

维度控制模块化在持续学习中的有效性

Kathrin Korte, Christian Medeiros Adriano, Joachim Winther Pedersen, Eleni Nisioti, Sebastian Risi

发表机构 * IT University of Copenhagen, Denmark(丹麦技术大学) Hasso Plattner Institute, University of Potsdam, Germany(波茨坦大学哈asso 印度学院)

AI总结 研究在持续学习中,模块化架构、任务相似性和表示维度如何共同影响组合学习,发现低维“丰富”机制下模块化结构显著提升性能,而高维“懒惰”机制下影响较小。

Comments Accepted to the 2nd Workshop on Compositional Learning (CompLearn) at ICML 2026, Seoul, South Korea. 8 pages, 5 figures

详情
AI中文摘要

组合学习系统必须平衡可塑性(获取新知识的能力)与稳定性(保留先前学习组件的能力),尤其是当任务共享结构并存在干扰风险时。我们研究了模块化架构、任务相似性和表示维度如何在顺序A-B-A范式中共同塑造组合持续学习,通过权重尺度操作诱导高维和低维机制,比较了任务分区循环网络与单网络基线。在高维“懒惰”机制中,两种架构实现了相似的性能和内部几何结构,表明当表示受到弱约束时,显式模块化结构影响甚微。在低维“丰富”机制中,模块化变得决定性:模块化网络发展出分级的任务特定子空间,这些子空间在相似任务上重叠,在中等不相似任务上部分对齐,在不相似任务上分离,从而产生比单网络更具组合性和可解释性的组织。这些发现表明,由初始化尺度诱导的表示机制(与表示维度共变)是决定组合性模块化结构在持续学习中何时功能有益的关键因素,并支持将安全性和鲁棒性视为表示子空间的自适应分配问题,而非固定分离或共享。

英文摘要

Compositional learning systems must balance plasticity, the ability to acquire new knowledge, with stability, the preservation of previously learned components, especially when tasks share structure and risk interference. We study how modular architecture, task similarity, and representational dimensionality jointly shape compositional continual learning in a sequential A-B-A paradigm, comparing a task-partitioned recurrent network to a single-network baseline while inducing high- and low-dimensional regimes via weight-scale manipulations. In a high-dimensional "lazy" regime, both architectures achieve similar performance and internal geometry, suggesting that explicit modular structure has little impact when representations are weakly constrained. In a lower-dimensional "rich" regime, modularity becomes decisive: the modular network develops graded task-specific subspaces that overlap for similar tasks, partially align for moderately dissimilar tasks, and separate for dissimilar tasks, yielding a more compositional and interpretable organization than the single network. These findings identify the representational regime induced by initialization scale, which co-varies with representational dimensionality, as a key factor governing when compositional, modular structure is functionally beneficial in continual learning, and support viewing safety and robustness as problems of adaptive allocation of representational subspaces rather than fixed separation versus sharing.

2606.17888 2026-06-17 cs.AI 新提交

MathVis-Fine: Aligning Visual Supervision with Necessity via Progressive Dependency-Guided Training for Multimodal Mathematical Reasoning

MathVis-Fine:通过渐进式依赖引导训练将视觉监督与必要性对齐的多模态数学推理

Wanshi Xu, Haokun Zhao, Haidong Yuan, Songjun Cao, Long Ma

发表机构 * School of ECE, Peking University(北京大学电子与计算机工程学院) College of Computer Science and Artificial Intelligence, Fudan University(复旦大学计算机科学与技术学院) School of Software and Microelectronics, Peking University(北京大学软件与微电子学院) Tencent Youtu Lab(腾讯优图实验室)

AI总结 提出MathVis-Fine框架,通过构建细粒度视觉标注数据集和两阶段渐进式训练,根据样本的视觉依赖程度平衡答案正确性和视觉基础奖励,提升多模态数学推理的监督精度。

详情
AI中文摘要

链式思维(CoT)推理已从纯语言领域扩展到多模态场景;然而,现有方法通常将视觉输入视为同质或辅助信号,未能捕捉数学问题解决中文本与图像之间复杂且样本特定的依赖关系。这引发了两个核心问题:首先,视觉内容的监督信号是泛化且粗粒度的,缺乏对每个样本中视觉信息实际必要性的适应;其次,当视觉奖励被统一应用而不区分输入之间的互补关系时,训练反馈变得不准确。这些限制阻碍了模型实现精确的多模态推理。在这项工作中,我们提出了一个用于建模数学推理中细粒度视觉依赖的框架。我们首先构建了MathVis-Fine数据集,通过视觉依赖评级增强细粒度视觉标注。基于该数据集,我们引入了一种两阶段渐进式视觉增强训练范式,该范式根据每个样本的内在视觉依赖水平平衡答案正确性奖励和视觉基础奖励,从而减轻奖励偏差并提高监督准确性。大量实验表明,MathVis-Fine框架能够基于视觉依赖逐步增强视觉感知,为多模态数学推理提供了更精确的训练框架。我们将在论文被接收后发布该数据集。

英文摘要

Chain-of-Thought (CoT) reasoning has extended from purely linguistic domains to multimodal scenarios; however, existing approaches often treat visual inputs as homogeneous or auxiliary signals, failing to capture the intricate and sample-specific dependencies between text and images in mathematical problem-solving. This gives rise to two core issues: first, the supervisory signals for visual content are generalized and coarse-grained, lacking adaptation to the actual necessity of visual information in each sample; second, training feedback becomes inaccurate when visual rewards are uniformly applied without distinguishing the complementary relationships among inputs. These limitations hinder models from achieving precise multimodal reasoning. In this work, we propose a framework for modeling fine-grained visual dependencies in mathematical reasoning. We first construct the MathVis-Fine dataset, augmenting fine-grained visual annotations with visual dependency ratings. Building upon this dataset, we introduce a two-stage progressive visual enhancement training paradigm that balances answer correctness rewards and visual grounding rewards according to the intrinsic visual dependency level of each sample, thereby mitigating reward bias and improving supervision accuracy. Extensive experiments demonstrate that the MathVis-Fine framework effectively enhances visual perception progressively based on visual dependency, offering a more precise training framework for multimodal mathematical reasoning. We will release the dataset upon acceptance.

2606.17886 2026-06-17 cs.LG 新提交

Monotonic Kolmogorov-Arnold Networks: A Theoretical and Empirical Study of Monotonicity as an Inductive Bias

单调Kolmogorov-Arnold网络:单调性作为归纳偏置的理论与实证研究

Mikhail Krasnov, Carolina Fortuna, Blaž Bertalanič

发表机构 * Jozef Stefan Institute(约瑟夫·斯特凡研究所)

AI总结 提出MKAN,通过指数重参数化B样条系数、正边权和单调基激活实现硬单调性,理论证明任何特征提取器可被单调化且编码器规模有界,实验表明MKAN在单调性基准上达到最优并保持KAN的逐边功能透明性。

详情
AI中文摘要

单调性一直是神经网络长期使用的架构归纳偏置,其动机来源于表格、科学和经济场景,其中输出已知对某些输入呈单调响应。现有方法基于MLP或流模型,缺乏逐边功能透明性;唯一具有单调性的KAN变体MonoKAN仅在受限参数子集上施加约束,并需要投影式训练过程。我们通过\textbf{MKAN}填补了这一空白,MKAN是一种KAN,通过B样条系数的指数重参数化、正边权和单调基激活,对所有参数值保证硬单调性。训练简化为标准的无约束梯度下降。我们的主要理论贡献是一个\textbf{表示代价}定理:任何诱导球状语义邻域划分的$C^K, K >0$特征提取器,都可以在$N' = N^* + k \le 2N^*$处实现等价邻域结构的单调实现,其中$k$是原始非单调坐标的数量。该界限与架构无关,并为单调编码器提供了原则性的规模确定规则。实验上,MKAN在SMM/ICML-2024基准上与最先进的单调神经网络竞争,同时是唯一结合了硬无约束单调性和KAN逐边功能透明性的方法;在四个真实数据集上的自监督特征规模扫描中验证了$2N^*$预测,在受控单调生成数据集上,MKAN以显著高于KAN、MLP和线性基线的Spearman对齐恢复了真实因子。

英文摘要

Monotonicity has been a long-running architectural inductive bias for neural networks, motivated by tabular, scientific, and economic settings where outputs are known to respond monotonically to certain inputs. Existing approaches are MLP- or flow-based and lack per-edge functional transparency; the only Kolmogorov--Arnold Network (KAN) variant with monotonicity, MonoKAN, enforces the constraint only on a restricted parameter subset and requires a projection-style training procedure. We close this gap with \textbf{MKAN}, a KAN with hard monotonicity guaranteed for \emph{all} parameter values via exponential reparameterization of B-spline coefficients, positive edge weights, and a monotone base activation. Training reduces to standard unconstrained gradient descent. Our headline theoretical contribution is a \emph{representation-cost} theorem: any $C^K, K >0$ feature extractor inducing a ball-shaped semantic-neighborhood partition admits a monotone realization of the equivalent neighborhood structure at $N' = N^* + k \le 2N^*$, where $k$ is the number of non-monotone coordinates of the original. The bound is architecture-agnostic and gives a principled sizing rule for monotone encoders. Empirically, MKAN is competitive with state-of-the-art monotone NNs on the SMM/ICML-2024 benchmark while being the only method that combines hard unconstrained monotonicity with KAN's per-edge functional transparency; the $2N^*$ prediction is validated in a self-supervised feature-size sweep on four real datasets, and on a controlled monotone-generative dataset MKAN recovers ground-truth factors with substantially higher Spearman alignment than KAN, MLP, and linear baselines.

2606.17882 2026-06-17 cs.AI 新提交

Structural Preservation and the Logical Expressiveness of Graph Neural Networks

结构保持与图神经网络的逻辑表达能力

Przemysław Andrzej Wałęga, Bernardo Cuenca Grau

发表机构 * Queen Mary University of London(伦敦玛丽女王大学) University of Oxford(牛津大学)

AI总结 本文从语义角度研究图神经网络分类器在结构保持(嵌入、单同态、同态)下的逻辑表达能力,证明每种保持性质对应分级模态逻辑的一个片段,并给出相应GNN架构。

Comments 20 pages

详情
AI中文摘要

通过固定架构选择(如聚合、组合和激活函数的类型),已经在图神经网络(GNN)和逻辑形式体系之间建立了桥梁。这些选择定义了受限的GNN类,通过证明逻辑公式可以翻译为等价的GNN,反之GNN也可以翻译为等价的公式,从而可以获得与逻辑形式体系的紧密对应。在本文中,我们采取语义视角,通过建立那些在结构性质(嵌入、单同态和同态)下保持的GNN分类器类的逻辑表达能力。我们证明,对于每个这样的性质,存在一个分级模态逻辑的片段,刻画了该GNN类。特别地,在嵌入、单同态和同态下的保持分别对应于存在性分级模态逻辑、其存在-正片段以及存在-正模态逻辑。这些结果刻画了广泛GNN类的表达能力,独立于具体的架构选择,但我们也证明每个这样的类都承认一个具有相同表达能力的GNN架构。在技术上,我们的方法使用了有界高度树的一个新的良拟序结果,从而得到了展开不变类的有限表示。

英文摘要

Bridges between graph neural networks (GNNs) and logical formalisms have been established by fixing architectural choices, such as the types of aggregation, combination, and activation functions. These choices define restricted classes of GNNs for which tight correspondences with logical formalisms can be obtained, by showing that logical formulae can be translated into equivalent GNNs and, conversely, that GNNs can be translated into equivalent formulae. In this paper we take a semantic perspective by establishing the logical expressiveness of classes of GNN classifiers that are preserved under structural properties: embeddings (extensions), injective homomorphisms, and homomorphisms. We show that, for each such property, there exists a fragment of graded modal logic characterising the class of GNNs. In particular, preservation under embeddings, injective homomorphisms, and homomorphisms corresponds to existential graded modal logic, its existential-positive fragment, and existential-positive modal logic, respectively. These results characterise the expressiveness of broad classes of GNNs independently of specific architectural choices, but we also show that each of these classes admits a GNN architecture of the same expressiveness. Technically, our approach uses a new well-quasi-order result for trees of bounded height, yielding finite representations of unravelling-invariant classes.

2606.17874 2026-06-17 cs.CV cs.LG 新提交

Revisiting Structural Dependency in Autoregressive Multi-Task Table Recognition via Order-Independent Cell-Level Representations

重新审视自回归多任务表格识别中的结构依赖性:基于顺序无关的单元格级表示

Takaya Kawakatsu

发表机构 * Preferred Networks, Inc.(Preferred Networks公司)

AI总结 针对自回归多任务表格识别中单元格表示顺序依赖导致全局一致性下降的问题,提出通过非因果注意力生成顺序无关的单元格特征,实现并行推理,在两大数据集上提升定位与识别性能,推理时间减少约3倍。

Comments ICDAR 2026

详情
AI中文摘要

多任务表格识别在统一框架中联合处理表格结构预测、单元格定位和单元格内容识别。现有方法通常依赖自回归解码器生成表格结构,并重用其隐藏状态进行单元格定位和内容识别。这种自回归生成过程可能使单元格表示产生顺序依赖,降低跨单元格的全局一致性。本文提出一个结构细化模块,通过非因果注意力产生顺序无关的单元格特征。该设计使得单元格内容能够并行推理,同时每个单元格以细化特征中编码的全局上下文为条件。在两个大型数据集上的实验表明,该方法在单元格定位和端到端识别上持续提升,同时将整体推理时间减少约三倍。

英文摘要

Multi-task table recognition jointly addresses table structure prediction, cell localization, and cell content recognition within a unified framework. Existing approaches often rely on autoregressive decoders to generate table structures and reuse their hidden states for cell localization and content recognition. This autoregressive generation process can make cell representations order-dependent, degrading global consistency across cells. This paper proposes a structural refinement module that produces order-independent cell features through non-causal attention. This design enables parallel inference of cell contents while conditioning each cell on global context encoded in the refined features. Experiments on two large datasets demonstrate consistent gains in cell localization and end-to-end recognition, while reducing overall inference time by around threefold.

2606.17872 2026-06-17 cs.LG cs.AI 新提交

AnchorKV: Safety-Aware KV Cache Compression via Soft Penalty with a Refusal Anchor

AnchorKV: 通过拒绝锚点的软惩罚实现安全感知的KV缓存压缩

Ning Ni, Yingjie Lao

发表机构 * Department of Computer Science, Tufts University(塔夫茨大学计算机科学系) Department of Electrical and Computer Engineering, Tufts University(塔夫茨大学电气与计算机工程系)

AI总结 提出AnchorKV,一种通过软惩罚机制调整令牌保留分数以远离有害提示的KV缓存压缩方法,在保持实用性的同时显著提升安全性。

详情
AI中文摘要

大型语言模型(LLMs)在生成推理和长上下文任务上优于早期架构,但其庞大的规模在内存使用、能耗和设备端部署方面带来了重大挑战。由于缩放预训练语言模型能提升下游能力\cite{zhao2023survey},键值(KV)缓存成为主要的推理瓶颈。最近的KV缓存压缩方法\cite{jo2025fastkv,li2024snapkv,zhou2024dynamickv}通过仅保留注意力相关令牌的子集来降低这一成本。然而,虽然这些方法在良性工作负载上保持了准确性,但其压缩策略要么无法防御越狱攻击\cite{jiang2024robustkv},要么在激进驱逐下降低安全对齐。我们提出AnchorKV,一种对KV缓存压缩的即插即用修改,它使令牌保留分数偏向远离与有害提示相关的键空间方向。AnchorKV通过将均值差异表示工程方法\cite{arditi2024refusal,zou2023representation}适配到KV缓存中使用的层特定键投影空间,构建了一个离线安全锚点。基于该锚点,一种软惩罚令牌选择规则以少量效用换取显著改善的安全对齐,当惩罚为零时则退化为原始压缩器。

英文摘要

Large language models (LLMs) outperform earlier architectures on generative inference and long-context tasks, but their large size introduces significant challenges in memory usage, energy cost, and on-device deployment. Since scaling pre-trained language models improves downstream capability \cite{zhao2023survey}, the key-value (KV) cache becomes a dominant inference bottleneck. Recent KV cache compression methods \cite{jo2025fastkv,li2024snapkv,zhou2024dynamickv} reduce this cost by retaining only a subset of attention-relevant tokens. However, while these approaches preserve accuracy on benign workloads, their compression policies either fail to defend against jailbreak attacks \cite{jiang2024robustkv} or degrade safety alignment under aggressive eviction. We propose AnchorKV, a drop-in modification to KV cache compression that biases token retention scores away from directions in key space associated with harmful prompts. AnchorKV constructs an offline safety anchor by adapting a difference-of-means representation engineering approach \cite{arditi2024refusal,zou2023representation} to the layer-specific key projection space used in KV caching. Based on this anchor, a soft penalty token selection rule trades a small amount of utility for substantially improved safety alignment, while reducing to the original compressor when the penalty is zero.

2606.17871 2026-06-17 cs.AI 新提交

StepGuard: Guarding Web Navigation via Single-Step Calibration

StepGuard: 通过单步校准保护网页导航

Zhihao Cui, Yuchen Zhang, Xiyang Sun, Yaxiong Wang, Li Zhu, Jinpeng Hu, Liu Liu, Mengjia Li, Yujiao Wu

发表机构 * School of Software Engineering, Xi’an Jiaotong University(西安交通大学软件工程学院) School of Computer Science and Information Engineering, Hefei University of Technology(合肥工业大学计算机与信息工程学院) Xiamen University(厦门大学) Zhejiang Lab(之江实验室) CSIRO(澳大利亚联邦科学与工业研究组织)

AI总结 针对网页导航中单步脆弱性问题,提出StepGuard框架,通过动态双策略优化(DDPO)解决奖励冲突,并利用置信度引导的自适应导航反射(CANR)校准单步误差,显著提升导航与答案准确率。

详情
AI中文摘要

网页导航要求智能体遵循自然语言目标,与网页交互并生成准确答案。尽管近期进展利用了视觉-语言模型和强化学习,现有方法仍因奖励错位和错误传播而存在单步脆弱性。为解决奖励纠缠,我们设计了动态双策略优化(DDPO),在探索的导航优先模式与问答的答案优先模式之间动态切换,以缓解奖励冲突。为校准单步误差,我们提出置信度引导的自适应导航反射(CANR),该机制估计每步置信度,仅在必要时触发反思,并使用对比奖励鼓励自我修正以校准单步不准确性。以上述组件为核心,我们最终开发了StepGuard,一种通过单步校准保护网页导航的新框架。实验表明,我们的方法显著提升了导航与答案准确率,在标准网页导航基准上取得了新的最佳性能。

英文摘要

Web navigation requires agents to follow natural language goals, interact with web pages, and produce accurate answers. While recent advances leverage vision-language models and reinforcement learning, existing methods still suffer from single-step fragility due to reward misalignment and error propagation. To tackle the reward entanglement, we design Dynamic Dual-Policy Optimization (DDPO), which dynamically switches between a navigation-first mode for exploration and an answer-first mode for question-answering to mitigate reward conflict. To calibrate the single-step error, we propose Confidence-Guided Adaptive Navigation Reflection (CANR), a mechanism that estimates per-step confidence, triggers reflection only when necessary, and uses contrastive rewards to encourage self-correction to calibrate the single-step inaccuracy. With the above as the main components, we finally develop our StepGuard, a new framework of Guarding Web Navigation via Single-Step Calibration. Experiments demonstrate that our approach significantly improves navigation and answer accuracy, setting new state-of-the-art performance on standard web navigation benchmarks.

2606.17867 2026-06-17 cs.CV cs.AI 新提交

A Quantitative Analysis of Multimodal Biomarkers in Alzheimer's Disease

阿尔茨海默病多模态生物标志物的定量分析

Antonio Scardace, Daniele Ravì

发表机构 * Department of Mathematics and Computer Science(数学与计算机科学系) University of Catania(卡塔尼亚大学) Department MIFT(MIFT部门) University of Messina(梅西纳大学)

AI总结 通过整合tau-PET、结构MRI、认知评分和APOE4数据,量化多模态生物标志物间的冗余与预测依赖关系,揭示tau拓扑与萎缩的关联,并分解tau-认知关联,为AD生物标志物选择提供可解释性。

Comments Accepted to ICTS4eHealth 2026

详情
AI中文摘要

尽管阿尔茨海默病(AD)研究中越来越多地采用多模态方法——旨在整合分子、结构、临床和遗传生物标志物以增强疾病表征——但这些模态之间的关系仍知之甚少。对其动态相互作用进行系统分析对于改进疾病建模、识别冗余评估以及减少患者负担和获取成本至关重要。在本文中,我们通过整合来自ADNI数据集的789名受试者的tau-PET、结构MRI、认知评分(MMSE和CDR)以及APOE4数据,对多模态AD生物标志物进行了定量分析。在我们的分析中,我们(A)量化跨模态互信息和解释方差以评估冗余和预测依赖性;(B)检查tau拓扑与跨脑区结构萎缩之间的关联以选择信息性ROI;(C)对tau-认知关联进行统计分解,分为萎缩相关和萎缩无关成分;(D)识别与认知衰退一致的主要神经退行性轨迹。本研究提供了跨模态关系的系统表征,提高了AD生物标志物的可解释性和选择。代码公开于:此 https URL。

英文摘要

Despite increasing adoption of multimodal approaches in Alzheimer's Disease (AD) research -- aimed at integrating molecular, structural, clinical, and genetic biomarkers to enhance disease characterization -- the relationships among these modalities remain poorly understood. A systematic analysis of their dynamic interaction is essential for improving disease modeling, identifying redundant assessments, and reducing patient burden and acquisition costs. In this paper, we present a quantitative analysis of multimodal AD biomarkers by integrating tau-PET, structural MRI, cognitive scores (MMSE and CDR), and APOE4 data from 789 subjects drawn from the ADNI dataset. In our analyses, we (A) quantify cross-modal mutual information and explained variance to assess redundancy and predictive dependencies; (B) examine associations between tau topologies and structural atrophy across brain regions to select informative ROIs; (C) perform a statistical decomposition of the tau-cognition association into atrophy-related and atrophy-independent components; (D) and identify a dominant neurodegenerative trajectory that aligns with cognitive decline. This study provides a systematic characterization of cross-modal relationships, improving the interpretability and selection of biomarkers in AD. Code is publicly available at: https://github.com/antonioscardace/Multimodal-AD.

2606.17861 2026-06-17 cs.CL 新提交

GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine?

GameCraft-Bench:智能体能否在真实游戏引擎中端到端构建可玩游戏?

Tongxu Luo, Rongsheng Wang, Jiaxi Bi, Chenming Xu, Zhengyang Tang, Jianlong Chen, Juhao Liang, Ke Ji, Shuqi Guo, Yuhao Du, Fan Bu, Wenyu Du, Xiaotong Zhang, Kyle Li, Shaobo Wang, Linfeng Zhang, Yuxuan Liu, Xin Lai, Chenxin Li, Yiduo Guo, Zhexin Zhang, Xinyuan Wang, Tianyi Bai, Ziniu Li, Benyou Wang

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Shenzhen Loop Area Institute(深圳环区研究院) Hunyuan Team, Tencent(腾讯混元团队) USTB(北京科技大学) DualverseAI SJTU(上海交通大学) NUS(新加坡国立大学)

AI总结 提出GameCraft-Bench基准,评估编码智能体在Godot引擎中端到端生成可玩游戏的能力,最强智能体仅达41.46%成功率。

详情
AI中文摘要

游戏生成是编码智能体的新兴应用,要求模型将自然语言规范转化为可玩的交互系统。与传统编码任务不同,游戏生成发生在游戏引擎内,脚本、场景、资源、渲染和运行时交互必须共同产生连贯的游戏体验。我们将端到端游戏生成形式化为产生完整游戏制品的问题,该制品通过目标环境中可观察的玩家-游戏交互实现规范。我们认为评估这一设置需要三个必要条件:引擎接地、制品完整性和交互验证。我们提出一个交互接地评估框架,通过重放演示和基于规则的多模态评判来评估可执行游戏玩法。我们将该框架实例化为GameCraft-Bench,一个包含15个游戏家族共140个Godot任务的基准。对前沿编码智能体的评估表明,端到端游戏生成仍然极具挑战性:最强智能体仅达到41.46%,大多数智能体得分低于40%。进一步分析显示,虽然智能体经常实现可识别的机制,但它们在提供具有足够内容、功能性视觉反馈和连贯呈现的完整游戏方面存在困难。演示、代码和数据见此https URL。

英文摘要

Game generation is an emerging application of coding agents, requiring models to transform natural-language specifications into playable interactive systems. Unlike traditional coding tasks, game generation takes place within a game engine, where scripts, scenes, assets, rendering, and runtime interactions must jointly produce coherent gameplay. We formalize end-to-end game generation as the problem of producing a complete game artifact that realizes a specification through observable player-game interaction in a target environment. We argue that evaluating this setting requires three desiderata: Engine Grounding, Artifact Completeness, and Interactive Verification. We propose an interaction-grounded evaluation framework that assesses executable gameplay through replayed demonstrations and rubric-guided multimodal judging. We instantiate this framework as GameCraft-Bench, a benchmark comprising 140 Godot tasks across 15 game families. Evaluations of frontier coding agents show that end-to-end game generation remains highly challenging: the strongest agent achieves only 41.46%, and most agents score below 40%. Further analysis reveals that while agents often implement recognizable mechanics, they struggle to deliver complete games with sufficient content, functional visual feedback, and coherent presentation. See https://tongxuluo.github.io/gamecraft-bench-website for demos, code, and data.

2606.17858 2026-06-17 cs.LG 新提交

Meta-classification of one-class classification models using ranking correlation and nearest neighbor

使用排序相关性和最近邻的一类分类模型的元分类

Toshitaka Hayashi, Hamido Fujita, Dalibor Cimr, Richard Cimler, Jitka Kühnová

发表机构 * Faculty of Science, University of Hradec Kralove(赫拉德茨-克拉洛韦大学理学院) Malaysia-Japan International Institute of Technology (MJIIT), Universiti Teknologi Malaysia(马来西亚-日本国际技术学院,马来西亚理工大学) Regional Research Center, Iwate Prefectural University(岩手县立大学区域研究中心)

AI总结 提出用排序相关性和最近邻对一类分类模型进行元分类,实验表明能高精度区分数据集、算法和超参数,本质是数据集分类。

详情
AI中文摘要

机器学习技术已被应用于各种问题。然而,将机器学习应用于机器学习模型本身是一个未被探索的方向。为此,本文考虑了一类分类(OCC)模型的元分类,因为所有机器学习模型都可以近似为OCC模型。该提案将OCC模型表示为正态性排序,并使用最近邻和排序相关性度量对其进行分类。实验对OCC模型进行分类,其中类别对应于训练数据集、算法和超参数。当类别标签为数据集时,该提案实现了高精度。此外,当训练数据集包含相同类别时,它可以对算法进行分类。讨论强调,OCC模型的分类本质上是将多个样本视为单个输入的数据集分类。实验使用睡眠记录展示了数据集的分类。所提出的方法可以为分类OCC模型、数据集和排序提供统一解决方案。源代码已上传至公共仓库:https://this URL。

英文摘要

Machine Learning (ML) techniques have been applied to various problems. However, applying ML to ML models is an unexplored direction. For this purpose, this paper considers a meta-classification of one-class classification (OCC) models, because all ML models could be approximated as OCC models. The proposal represents OCC models as normality rankings and classifies them using nearest-neighbor and ranking-correlation metrics. The experiment classifies OCC models, where classes correspond to training datasets, algorithms, and hyperparameters. The proposal achieves high accuracy when class labels are datasets. Moreover, it can classify algorithms when the training datasets contain the same class. In addition, the discussion highlights that the classification of OCC models is essentially the classification of datasets that treats multiple samples as a single input. The experiment demonstrates the classification of datasets using sleeping records. The proposed method can provide a unified solution for classifying OCC models, datasets, and rankings. Source code is uploaded to the public repository https://github.com/ToshiHayashi/ClassOCC.

2606.17856 2026-06-17 cs.AI 新提交

FlowRAG: Synergizing Explicit Reasoning via Frequency-Aware Multi-Granularity Graph Flow

FlowRAG: 通过频率感知的多粒度图流协同显式推理

Bihao Zhan, Zongsheng Cao, Jie Zhou, Bo Zhang, Liang He

发表机构 * East China Normal University(华东师范大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 提出FlowRAG框架,构建四层异构图,通过双粒度激活和频率感知加权流模块,增强语义召回和显式推理路径提取,在复杂推理基准上取得最优性能。

详情
AI中文摘要

基于图的检索增强生成(GraphRAG)对于知识密集型和多跳查询任务有效;然而,许多现有方法主要基于实体图并依赖隐式语义相关性传播。这通常会导致(i)当用户查询抽象且在实体层面语义稀疏时检索不足,以及(ii)脆弱的的多跳推理,其中噪声激活可能破坏实体到实体的转换并损坏推断的关系链,从而产生不可靠的结论。为此,我们提出\texttt{FlowRAG},一个语义感知的检索框架,它提高了语义召回和显式推理。具体来说,\texttt{FlowRAG}在段落、摘要、句子和实体上构建了一个四层异构图,其中摘要节点作为粗粒度语义枢纽。在检索时,双粒度激活模块结合摘要-查询对齐和句子级匹配,在释义和抽象下鲁棒地激活相关实体。然后,我们引入一个频率感知的加权流模块,该模块通过段落内词频加权的实体-段落链接路由相关性,修剪噪声连接并提取高置信度的推理路径作为生成的显式逻辑骨架。大量实验表明,\texttt{FlowRAG}在复杂推理基准上取得了最先进的性能。

英文摘要

Graph-based retrieval-augmented generation (GraphRAG) is effective for knowledge-intensive and multi-hop query tasks; however, many existing methods primarily seed entity-based graphs and rely on implicit semantic relevance propagation. This often (i) under-retrieves when user queries are abstract and semantically sparse at the entity level, and (ii) suffers from brittle multi-hop reasoning, where noisy activations can derail entity-to-entity transitions and corrupt the inferred relation chain, yielding unreliable conclusions. To this end, we propose \texttt{FlowRAG}, a semantic-aware retrieval framework that improves both semantic recall and explicit reasoning. Specifically, \texttt{FlowRAG} constructs a quad-level heterogeneous graph over passages, summaries, sentences, and entities, where summary nodes serve as a coarse semantic hub. At retrieval time, a dual-granularity activation module combines summary--query alignment with sentence-level matching to activate relevant entities under paraphrase and abstraction robustly. We then introduce a frequency-aware weighted flow module that routes relevance through entity--passage links weighted by within-passage term frequency, pruning noisy connections and extracting high-confidence reasoning paths as an explicit logic skeleton for generation. Extensive experiments show that \texttt{FlowRAG} obtains state-of-the-art performance on complex reasoning benchmarks.

2606.17851 2026-06-17 cs.AI cs.LO 新提交

A homotopy-type-theoretic generalization of neurosymbolic inference

同伦类型论对神经符号推理的推广

Fernando Zhapa-Camacho, Robert Hoehndorf

发表机构 * King Abdullah University of Science and Technology(阿卜杜拉国王科技大学) KAUST Center of Excellence for Smart Health (KCSH)(KAUST智能健康卓越中心) KAUST Center of Excellence for Generative AI(KAUST生成式人工智能卓越中心)

AI总结 本文用同伦类型论替换集合,将神经符号系统的信念加权和泛化为信念加权同伦基数,保留对称性和证明多样性,并证明经典函数是特例,从而避免推理捷径。

详情
AI中文摘要

广泛的神经符号系统计算一个泛函:在σ-结构空间上逻辑量的信念加权和,其中加权模型计数、模糊逻辑和概率逻辑是特例。这种描述基于集合,而集合有意忽略了两个对神经符号系统重要的方面:两个σ-结构何时在理论对称性下相同,以及有多少不同的证明见证一个查询。将底层集合替换为类型(在同伦类型论意义上)保留了这些信息,并将该泛函转变为信念加权同伦基数——一种按对称性倒数计数对象的大小概念。我们从头为神经符号系统开发了该框架,证明了当对称性平凡时恢复经典泛函的保守性定理,并表明我们的框架暴露的对称性正是推理捷径背后的对称性。实际收益是具体的:最近通过集成或表达性密度估计实现的捷径感知概念后验,是混淆集单纯形上唯一的对称不变点,可通过在对称群上平均单个模型以闭式形式计算。在MNIST推理捷径基准上,这种单模型包装器比多样性训练的集成具有更好的校准性,同时保持标签准确性和可识别概念不变。代码在此https URL免费提供。

英文摘要

A wide range of neurosymbolic (NeSy) systems compute one functional: a belief-weighted sum of a logical quantity over a space of $σ$-structures, of which weighted model counting, fuzzy logic, and probabilistic logic are special cases. This account is built on sets, and a set deliberately forgets two things that are important for NeSy: when two $σ$-structures are the same up to a symmetry of the theory, and how many distinct proofs witness a query. Replacing the underlying sets by types, in the sense of homotopy type theory, preserves this information, and turns this functional into a belief-weighted homotopy cardinality, a notion of size that counts each object in inverse proportion to its symmetries. We develop the framework from scratch for NeSy systems, prove a conservativity theorem that recovers the classical functional when symmetries are trivial, and show that the symmetry our framework exposes is exactly the one behind reasoning shortcuts. The payoff is concrete: the shortcut-aware concept posterior that recent methods reach by ensembling or expressive density estimation is the only symmetry-invariant point of the confusion-set simplex, computable in closed form by averaging a single model over the symmetry group. On MNIST reasoning-shortcut benchmarks this single-model wrapper is better calibrated than a diversity-trained ensemble, while leaving label accuracy and identifiable concepts untouched. Code is freely available at https://github.com/bio-ontology-research-group/hott-nesy.

2606.17847 2026-06-17 cs.AI cs.LG 新提交

WallZero: Mastering the Game of WallGo with Strategic Analysis

WallZero:通过战略分析掌握WallGo游戏

Hsing-Yu Chen, Jérôme Arjonilla, I-Chen Wu, Ti-Rong Wu

发表机构 * National Yang Ming Chiao Tung University(国立阳明交通大学) Academia Sinica(中央研究院)

AI总结 提出基于AlphaZero的WallZero智能体,通过定制动作和特征设计,在WallGo游戏中击败职业围棋选手,并分析游戏公平性与关键策略。

Comments Accepted by the Computers and Games conference (CG 2026)

详情
AI中文摘要

WallGo是一种最近引入的战略棋盘游戏,因2025年Netflix系列剧《The Devil's Plan》而流行。尽管在7x7的小棋盘上进行,但其石头移动和墙壁放置的组合导致了高游戏树复杂性和复杂的战略互动。尽管其日益流行,WallGo仍未得到充分探索。本文提出了WallZero,一个基于AlphaZero的双人WallGo设置智能体。我们引入了定制的动作和特征设计,以显著提高游戏性能。在评估中,WallZero击败了参与本研究的两位职业围棋选手,平均每局获得1.98倍的地盘。除了其强度,我们使用WallZero评估游戏公平性并识别掌握WallGo的关键策略。有趣的是,我们的结果显示,Netflix系列剧中使用的开局产生了更平衡的游戏。我们的代码可在以下网址获取:此 https URL。

英文摘要

WallGo is a recently introduced strategic board game popularized by the 2025 Netflix series The Devil's Plan. Although played on a small 7 x 7 board, its combination of stone movement and wall placement yields high game-tree complexity and intricate strategic interactions. Despite its growing popularity, WallGo remains underexplored. This paper presents WallZero, an AlphaZero-based agent for the two-player WallGo setting. We introduce tailored action and feature designs to improve playing performance significantly. In the evaluation, WallZero defeats two professional Go players who participated in this study, securing on average 1.98x more territory per game. Beyond its strength, we use WallZero to assess game fairness and identify key strategies for mastering WallGo. Interestingly, our results show that the opening used in the Netflix series yields a more balanced game. Our code is available at https://rlg.iis.sinica.edu.tw/papers/wallzero.

2606.17846 2026-06-17 cs.RO cs.CV cs.LG 新提交

Qwen-RobotManip Technical Report: Alignment Unlocks Scale for Robotic Manipulation Foundation Models

Qwen-RobotManip 技术报告:对齐解锁机器人操作基础模型的规模

Haoqi Yuan, Zhixuan Liang, Anzhe Chen, Ye Wang, Haoyang Li, Pei Lin, Yiyang Huang, Zixing Lei, Tong Zhang, Jiazhao Zhang, Jie Zhang, Jingyang Fan, Gengze Zhou, Qihang Peng, Chenxu Lv, Xiaoyue Chen, An Yang, Fei Huang, Junyang Lin, Dayiheng Liu, Jingren Zhou, Chenfei Wu, Xiong-Hui Chen

发表机构 * Qwen Team(Qwen团队)

AI总结 提出 Qwen-RobotManip,通过统一的对齐框架(表示、运动和行为维度)实现多源异构操作数据的大规模协同训练,构建约38,100小时预训练语料,在零样本指令跟随、跨本体迁移等泛化能力上超越先前模型。

Comments 44 pages

详情
AI中文摘要

语言和多模态基础模型通过统一公式对齐异构数据并大规模训练,实现了强大的泛化能力。在本报告中,我们研究这种扩展方法是否可以应用于机器人操作以实现真正的泛化。这具有挑战性,因为与文本不同,操作数据本质上是异构的、收集成本高且多样性狭窄,使得对齐和规模同时变得困难。我们提出了 Qwen-RobotManip,一个基于 Qwen-VL 构建的可泛化视觉-语言-动作基础模型。Qwen-RobotManip 引入了一个跨操作表示、运动和行为维度的统一对齐框架,使大规模多源训练变得一致而非冲突。这种对齐能力进而使 Qwen-RobotManip 能够吸收以前训练方案无法维持规模的操作数据。一个人到机器人合成流水线将第一人称手部演示转换为跨15个平台的机器人轨迹,一个严格的策展流水线协调异构数据集。仅使用开源数据集和人类视频,无需专有数据收集,Qwen-RobotManip 构建了约38,100小时的预训练语料,并展现出涌现的泛化能力,包括零样本指令跟随、对扰动的鲁棒性、反应性错误恢复和跨本体迁移。我们发现标准基准无法捕捉预训练质量,因此采用了包括 RoboCasa365、LIBERO-Plus、EBench、RoboTwin-Clean2Rand、RoboTwin-IF 和 RoboTwin-XE 在内的 OOD 设置。Qwen-RobotManip 在所有 OOD 设置中显著优于先前最先进的模型(包括 π0.5),在 RoboChallenge 中排名第一,相对改进20%,并在包括 AgileX ALOHA、Franka、UR 和 ARX 在内的真实机器人平台上得到验证。

英文摘要

Foundation models in language and multimodality achieve strong generalization by aligning heterogeneous data under a unified formulation and training at scale. In this report, we investigate whether this scaling recipe can be applied to robotic manipulation to achieve genuine generalization. This is challenging because, unlike text, manipulation data is heterogeneous by nature, expensive to collect, and narrow in diversity, making alignment and scale simultaneously difficult. We present Qwen-RobotManip, a generalizable Vision-Language-Action foundation model built on Qwen-VL. Qwen-RobotManip introduces a unified alignment framework across the representation, motion, and behavioral dimensions of manipulation, making large-scale multi-source training coherent rather than conflicting. This alignment capability in turn enables Qwen-RobotManip to absorb manipulation data at a scale that prior training regimes could not sustain. A human-to-robot synthesis pipeline converts egocentric hand demonstrations into robot trajectories across 15 platforms, and a rigorous curation pipeline harmonizes heterogeneous datasets. Using only open-source datasets and human videos without proprietary data collection, Qwen-RobotManip constructs a ~38,100-hour pretraining corpus and exhibits emergent generalization capabilities, including zero-shot instruction following, robustness to perturbations, reactive error recovery, and cross-embodiment transfer. We find that standard benchmarks fail to capture pretraining quality and instead adopt OOD settings including RoboCasa365, LIBERO-Plus, EBench, RoboTwin-Clean2Rand, RoboTwin-IF, and RoboTwin-XE. Qwen-RobotManip substantially outperforms prior state-of-the-art models, including $π$0.5, across all OOD settings, ranks 1st in RoboChallenge with a 20% relative improvement, and is validated on real-robot platforms including AgileX ALOHA, Franka, UR, and ARX.

2606.17839 2026-06-17 cs.RO cs.HC 新提交

From Ad Hoc Pilots to Repeatable Patterns: Structuring Drone Collaboration in Emergency Services with DroneLets

从临时飞行员到可重复模式:用DroneLets构建紧急服务中的无人机协作

Dzmitry Katsiuba, Samuel Brander, Mateusz Dolata, Gerhard Schwabe

发表机构 * University of Zurich(苏黎世大学) Zeppelin University(空天大学)

AI总结 本文通过实地试验和访谈,提炼出44种交互模式并引入DroneLets设计构件,以结构化的方式实现紧急服务中无人机协作的可重复和可扩展。

Comments Presented at International Conference on Information Systems (ICIS) 2025: https://aisel.aisnet.org/icis2025/is_transformwork/is_transformwork/19/

详情
Journal ref
International Conference on Information Systems 2025: ICIS2025-2217
AI中文摘要

无人机有望支持紧急服务,但其融入工作流程仍具有临时性和协调密集型。本文探讨两个研究问题:紧急团队希望如何与无人机协作,以及如何将这些协作形式化为可重复的过程。基于四次实地试验和95次访谈,我们推导出44种交互模式,分为10个元模式,反映了侦察、通信和后勤支持等操作需求。为了构建这些实践,我们引入了DroneLets——一种新的设计构件类别,将协作工程扩展到具身代理。DroneLets捕获设置要求、无人机能力、环境约束以及人类和无人机代理之间的协调行动。它们提供了一个模块化框架,用于设计紧急服务中可重复、可扩展的协作过程,并通过向旁观者广播和火灾后监测等模式加以说明。这项工作扩展了协作工程的范围,并为将自主无人机集成到高风险现场操作中提供了结构化基础。

英文摘要

Drones hold promise for supporting emergency services, but their integration into workflows remains ad hoc and coordination-intensive. This paper addresses two research questions: how emergency teams want to collaborate with drones, and how to formalize these collaborations into repeatable processes. Based on four field trials and 95 interviews, we derive 44 interaction patterns grouped into 10 meta-patterns reflecting operational needs such as reconnaissance, communication, and logistical support. To structure these practices, we introduce DroneLets - a new class of design artifacts that extend Collaboration Engineering to embodied agents. DroneLets capture setup requirements, drone capabilities, environmental constraints, and coordinated actions across human and drone actors. They offer a modular framework for designing repeatable, scalable collaboration processes in emergency services, illustrated through patterns such as broadcasting to bystanders and post-fire monitoring. This work expands the scope of CE and provides a structured foundation for integrating autonomous drones into high-stakes field operations.

2606.17838 2026-06-17 cs.CL 新提交

Environment-Grounded Automated Prompt Optimization for LLM Game Agents

面向LLM游戏智能体的环境驱动自动提示优化

Rean Clive Fernandes, Lukas Fehring, Theresa Eimer, Marius Lindauer, Matthias Feurer

发表机构 * Lamarr institute for ML and AI(拉马尔机器学习与人工智能研究所) TU Dortmund University(多特蒙德工业大学) Leibniz University Hannover(莱布尼茨汉诺威大学) L3S Research Center(L3S研究中心)

AI总结 提出一种自动提示优化框架,将观察-动作管道分解为描述器和选择器,通过环境回报驱动的进化循环迭代优化提示,在BabyAI任务中显著提升成功率。

详情
AI中文摘要

交互环境中的LLM智能体对其提示高度敏感,但提示工程仍然是手动的、特定于任务的过程。我们为LLM智能体引入了一个自动提示优化框架,该框架将观察-动作管道分解为一个目标条件描述器智能体和一个动作选择智能体,并通过由环境回报引导的LLM驱动进化循环迭代地优化每个模块的提示。我们提出一个行为分析器,将情节结果归因于特定的提示组件,以及一个变异器,在通过环境回滚验证之前,对提示提出有针对性的修订。我们在BALROG基准测试中的所有五个BabyAI任务上进行了评估,在普通和引导提示初始化下,将我们的管道与BALROG的RobustCoTAgent进行了比较。优化在任务和条件下一致地提高了性能,无需更新模型权重。在PutNext(一个多步协调任务,RobustCoTAgent的成功率为0%)上,我们的框架使用相同的底层LLM和优化提示达到了高达72.5%的成功率。这些结果表明,多智能体框架结合自动提示优化,无需微调或大量人工监督即可增强LLM。

英文摘要

LLM agents in interactive environments are highly sensitive to their prompts, yet prompt engineering remains a manual, task-specific process. We introduce an automated prompt optimization framework for LLM agents that decomposes the observation-to-action pipeline into a goal-conditioned descriptor agent and an action selection agent, and iteratively refines each module's prompt through an LLM-driven evolutionary loop guided by environment returns. We propose a behavior analyzer to attribute episode outcomes to specific prompt components, and a mutator to propose targeted revisions to the prompt, before validating them through environment rollouts. We evaluate on all five BabyAI tasks in the BALROG benchmark, comparing our pipeline against BALROG's RobustCoTAgent under both plain and guided prompt initializations. Optimization improves performance consistently across tasks and conditions, without requiring updates to the model weights. On PutNext, a multi-step coordination task where the RobustCoTAgent achieves 0% success, our framework reaches up to 72.5% success rate using the same underlying LLM with optimized prompts. These results suggest that a multi-agent framework, combined with automatic prompt optimization, enhances LLMs without the need for fine-tuning or extensive human supervision.

2606.17836 2026-06-17 cs.CV cs.AI cs.CG cs.GR 新提交

High-Fidelity 3D Geometric Reconstruction of Pelvic Organs from MRI: A Hybrid Deep Learning and Iterative Optimization Approach

高保真盆腔器官MRI三维几何重建:一种混合深度学习与迭代优化方法

Hui Wang, Xiaowei Li, Chenxin Zhang, Yifan Feng, Jianwei Zuo, Yumeng Tang, Xiuli Sun, Jianliu Wang, Bing Xie, Jiajia Luo

发表机构 * Institute of Medical Technology, Peking University Health Science Center, Peking University(北京大学医学部医学技术研究院,北京大学) Biomedical Engineering Department, Institute of Advanced Clinical Medicine, Peking University(北京大学先进临床医学研究院生物医学工程系) Department of Obstetrics and Gynecology, Peking University People’s Hospital(北京大学人民医院妇产科部)

AI总结 提出混合可变形形状建模框架,结合深度学习预测与迭代优化,实现膀胱、子宫和直肠的高保真三维几何重建,在几何保真度和网格质量上优于现有方法。

详情
AI中文摘要

从MRI中患者特定的盆腔器官几何三维重建对于盆底建模和下游患者特定分析至关重要。然而,以往研究主要关注图像分割或三维模型的下游使用,高保真、高质量几何的重建仍然劳动密集且缺乏标准化。本研究引入了一种混合可变形形状建模框架,将深度学习预测与迭代优化相结合,用于膀胱、子宫和直肠的重建。该框架包含三个核心组件:一种保持盆腔器官拓扑一致性的几何感知多级深度学习架构;一种平衡全局形状捕获和局部表面细化的两阶段摊销优化训练策略;以及一种整体协同机制——在训练阶段,迭代优化为深度学习提供监督,而在推理阶段,深度学习快速预测全局器官形态,随后通过迭代优化细化局部表面和网格质量。该框架在几何保真度上显著优于当前主流的基于深度学习的器官重建模型。对于各个解剖结构,重建的膀胱、直肠和子宫三维几何实现了显著更低的Chamfer距离值和更高的Dice相似系数分数。此外,在保持高计算效率的同时,所提出的架构产生了优越的整体体积网格质量。在患者层面,该框架在minSICN和minSIGE的10个最差元素上均获得了比传统几何后处理算法更高的平均值。

英文摘要

Patient-specific 3D reconstruction of pelvic organ geometry from MRI is important for pelvic floor modeling and downstream patient-specific analysis. However, while previous studies have focused primarily on either image segmentation or downstream use of 3D models, the reconstruction of high-fidelity, high-quality geometries remains labor-intensive and poorly standardized. The study introduced a hybrid deformable shape modeling framework that integrates deep learning prediction with iterative optimization for the reconstruction of the bladder, uterus, and rectum. The framework consists of three core components: a geometry-aware multi-level deep learning architecture that preserves topological consistency of pelvic organs; a two-stage amortized optimization training strategy that balances global shape capture and local surface refinement; and a holistic synergy mechanism--where iterative optimization provides supervision for deep learning during the training phase, and during inference, deep learning rapidly predicts the global organ morphology, followed by iterative optimization to refine local surfaces and mesh quality. This framework demonstrated marked superiority in geometric fidelity than current mainstream deep learning-based organ reconstruction models. For individual anatomical structures, the reconstructed 3D geometries for the bladder, rectum, and uterus achieved significantly lower Chamfer Distance values and higher Dice Similarity Coefficient scores. In addition, while maintaining high computational efficiency, the proposed architecture yielded superior overall volumetric mesh quality. At the patient level, the framework achieved higher mean values for the 10 worst elements for both minSICN and minSIGE compared to traditional geometric post-processing algorithms.

2606.17835 2026-06-17 cs.CL cs.AI eess.AS 新提交

Perceptual compensation for tonal context in self-supervised speech models

自监督语音模型中对声调上下文的感知补偿

James Kirby, Ioana Krehan, Michele Gubian

发表机构 * Institute for Phonetics and Speech Processing, LMU Munich(慕尼黑大学语音与语言处理研究所)

AI总结 通过伪复制普通话声调的感知补偿实验,比较纯自监督预训练模型和微调模型,发现纯预训练模型无补偿证据,而微调模型有部分补偿但未达到人类水平,表明监督目标可能对抽象某些音韵规律是必要的。

Comments Accepted for publication at Interspeech 2026

详情
AI中文摘要

本研究考察了wav2vec2.0架构在多大程度上表现出对音韵上下文的补偿证据。我们对普通话声调进行了感知补偿实验的伪复制,并比较了纯自监督预训练模型和针对普通话ASR微调模型之间的嵌入相似度和探测分类器输出。在纯预训练模型的嵌入相似度中没有发现补偿证据。探测分类器除了预期的逐层分类改进外,还显示出一些补偿证据,但未能复制人类在孤立测试音节上的表现。我们的发现与先前仅通过预训练就能产生对音韵结构敏感性的报告形成对比,并表明监督目标可能是鼓励至少某些类型的音韵规律抽象所必需的。

英文摘要

This study examines the extent to which the wav2vec2.0 architecture exhibits evidence of compensation for phonological context. We conducted a pseudo-replication of a perceptional compensation experiment on Mandarin Chinese tones, and compared the embedding similarities and probing classifier outputs between a purely self-supervised pre-trained model and a model fine-tuned for Mandarin ASR. No evidence of compensation was found in the embedding similarities of the purely pre-trained model. Probing classifiers showed some evidence of compensation in addition to the expected layer-wise improvements in categorization, but failed to replicate human performance on isolated test syllables. Our findings contrast with previous reports of sensitivity to phonological structure emerging through pre-training alone, and suggest that supervised objectives may be necessary to encourage the abstraction of at least some types of phonological regularities.

2606.17833 2026-06-17 cs.RO 新提交

HumanoidArena: Benchmarking Egocentric Hierarchical Whole-body Learning

HumanoidArena: 以自我为中心的层级全身学习基准

Taowen Wang, Zikang Xie, Bin Yang, Yunheng Wang, Zizhao Yuan, Yuetong Fang, Yixiao Feng, Yichi Wang, Xingyu Chen, Haodong Chen, Qiwei Wu, Weisheng Xu, Lihan Chen, Lusong Li, Zecui Zeng, Renjing Xu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Beijing University of Technology(北京工业大学) Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) Shenzhen MSU-BIT University(深圳北理莫斯科大学) JD Explore Academy(京东探索研究院)

AI总结 提出HumanoidArena基准,通过层级控制(高层策略输出全身动作,低层通用运动跟踪器执行)解决人形机器人全身交互学习问题,设计7个腿部关键任务评估策略的泛化与迁移能力。

Comments 29 pages, 13 figures, 10 tables

详情
AI中文摘要

人形机器人有望在人类中心环境中实现全身交互,但由于任务级决策与全身动态执行紧密耦合,可扩展的策略学习仍然困难。一个实用的解决方案是层级控制,其中高层策略预测中间全身动作,低层通用运动跟踪器(GMT)将其执行为稳定的人形运动。然而,现有基准很少评估策略-跟踪器接口本身,因此尚不清楚中间全身动作是否可执行、在任务分布变化下是否鲁棒以及是否可跨不同GMT后端迁移。我们引入HumanoidArena,一个以自我为中心的层级全身学习的仿真优先基准。该基准将策略学习形式化为一个层级决策问题:高层策略将自我中心视觉、本体感觉和指令转换为紧凑的全身动作,随后由低层GMT执行。HumanoidArena不将腿部视为平面运输工具,而是强调下肢协调在任务完成中结构上必要的交互。因此,我们设计了7个腿部关键的人-物交互/人-场景交互(HOI/HSI)任务,其中成功需要足部放置、平衡维持、姿势调整和全身重新定向。为了进一步诊断层级系统,我们从两个互补角度评估策略:扰动条件泛化和GMT条件迁移。实验表明,层级控制使学习策略能够解决多样的腿部关键交互,但性能强烈依赖于跟踪器,且跨GMT迁移仍然脆弱。这些结果使HumanoidArena成为研究可迁移中间动作表示和可扩展的自我中心全身策略学习的基准。

英文摘要

Humanoid robots promise whole-body interaction in human-centered environments, but scalable policy learning remains difficult because task-level decision-making and whole-body dynamic execution are tightly coupled. A practical solution is hierarchical control, where a high-level policy predicts intermediate whole-body actions and low-level general motion trackers (GMTs) execute them as stable humanoid motion. However, existing benchmarks rarely evaluate the policy-tracker interface itself, leaving open whether intermediate whole-body actions are executable, robust under task distribution shifts, and transferable across different GMT backends. We introduce HumanoidArena, a simulation-first benchmark for egocentric hierarchical whole-body learning. The benchmark formulates policy learning as a hierarchical decision making problem: a high-level policy converts egocentric vision, proprioception, and instructions into a compact whole-body action, which is subsequently executed by a low-level GMT. Instead of treating the legs as planar transport tools, HumanoidArena emphasizes interactions where lower-body coordination is structurally necessary in task completion. We therefore design 7 leg-critical HOI/HSI tasks in which success requires foot placement, balance maintenance, posture adjustment, and whole-body reorientation. To further diagnose the hierarchical system, we evaluate policies from two complementary perspectives: perturbation-conditioned generalization and GMT-conditioned transfer. Experiments show that hierarchical control enables learned policies to solve diverse leg-critical interactions, but performance is strongly tracker-conditioned and cross-GMT transfer remains fragile. These results position HumanoidArena as a benchmark for studying transferable intermediate action representations and scalable egocentric whole-body policy learning.