arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3868
2606.09110 2026-06-09 cs.CV 新提交

HDRAgent: An Agentic Framework for Multi-Exposure HDR Imaging

HDRAgent: 一种用于多曝光HDR成像的智能体框架

Weiyu Zhou, Tao Hu, Yijian Wang, Xiaogang Xu, Ruixing Wang, Qingsen Yan

发表机构 * School of Computer Science, Northwestern Polytechnical University(西北工业大学计算机学院) Shenzhen Research Institute, Northwestern Polytechnical University(西北工业大学深圳研究院) Zhejiang University(浙江大学) Camera Group, DJI(大疆相机部门)

AI总结 提出首个智能体驱动的HDR成像框架HDRAgent,通过细粒度上下文知识匹配、感知-失真反馈机制和智能体引导的生成对齐策略,自适应选择重建策略,减少复杂动态场景中的鬼影和局部伪影。

详情
AI中文摘要

大多数现有的多曝光HDR方法遵循固定的前馈重建范式,使其在复杂动态场景中容易产生鬼影伪影。为了解决这个问题,我们提出了HDRAgent,这是第一个用于HDR成像的智能体驱动框架,它根据当前场景条件自适应地选择重建策略。具体来说,为了提供场景特定的先验知识,我们引入了一个细粒度上下文知识匹配(FCM)模块。该模块利用多模态大语言模型(MLLM)衍生的场景感知来检索相关的历史案例和工具知识,并将它们组织成结构化证据,用于基于MLLM的自适应工具调度。此外,我们提出了一种感知-失真反馈机制,将执行后的质量评估和伪影诊断转化为结构化反馈,并累积到历史记忆中,以帮助后续的上下文知识细化和策略选择。此外,考虑到极端运动可能使对齐方法失效,我们设计了一种智能体引导的生成对齐策略,该策略使用基于MLLM的动态区域解析,在参考帧引导下重建非参考帧中的不可靠内容。实验表明,HDRAgent有效减少了鬼影和局部伪影,同时实现了具有竞争力或更优的客观性能和视觉质量。

英文摘要

Most existing multi-exposure HDR methods follow a fixed feed-forward reconstruction paradigm, making them prone to ghosting artifacts in complex dynamic scenes. To address this issue, we propose HDRAgent, the first agent-driven framework for HDR imaging, which adaptively selects reconstruction strategies according to the current scene conditions. Specifically, to provide scene-specific prior knowledge, we introduce a fine-grained contextual knowledge matching (FCM) module. This module leverages multimodal large language model (MLLM)-derived scene perception to retrieve relevant historical cases and tool knowledge, organizing them into structured evidence for MLLM-based adaptive tool scheduling. In addition, we propose a perception--distortion feedback mechanism that transforms post-execution quality assessment and artifact diagnosis into structured feedback, which is accumulated in historical memory to help subsequent contextual knowledge refinement and strategy selection. Furthermore, considering that extreme motion can invalidate alignment methods, we design an agent-guided generative alignment strategy that uses MLLM-based dynamic-region parsing to reconstruct unreliable contents in non-reference frames under reference-frame guidance. Experiments demonstrate that HDRAgent effectively reduces ghosting and local artifacts while achieving competitive or superior objective performance and visual quality.

2606.09109 2026-06-09 cs.CV cs.IR cs.LG 新提交

Driving Video Retrieval for Complex Queries with Structured Grounding

面向复杂查询的驾驶视频检索与结构化对齐

Manyi Yao, Sparsh Garg, Christian Shelton, Amit Roy-Chowdhury, Abhishek Aich

发表机构 * NEC Laboratories, America(美国NEC实验室) University of California, Riverside(加州大学河滨分校)

AI总结 提出STRIVE-D框架,通过弱监督领域视频校准规则、融合视觉语言与关键词检索信号,在驾驶视频检索中实现高达84%的top-1准确率提升。

详情
AI中文摘要

大规模视频检索是自动驾驶中数据整理和安全验证的核心,用户不仅希望找到场景,还希望找到诸如切入和急刹车等动态事件。现有的视觉语言和基于关键词的检索方法常常遗漏这些事件,因为相关的运动可能没有在文本中明确描述或通过词汇重叠捕获。基于规则的检索可以更直接地编码此类事件,但它是脆弱的:生成的或手工编写的规则在假设与真实驾驶数据不匹配时常常失败。我们提出了STRIVE-D,一种针对驾驶视频的数据校准检索框架。它使用弱标记的领域内视频来估计查询规则何时可靠,调整与观测数据不匹配的规则,并将校准后的规则分数与视觉语言和基于关键词的检索信号融合。在三个驾驶基准测试中,包括新发布的DrivingDojo上的人工标注事件数据,STRIVE-D相对于最先进方法在top-1准确率上实现了高达84%的相对改进。

英文摘要

Video retrieval at scale is central to data curation and safety validation in autonomous driving, where users want to find not only scenes but also dynamic events such as cut-ins and hard braking. Existing vision-language and keyword-based retrieval methods often miss these events because the relevant motion may not be explicitly described in text or captured by lexical overlap. Rule-based retrieval can encode such events more directly, but it is brittle: generated or hand-written rules often fail when their assumptions do not match real driving data. We propose STRIVE-D, a data-calibrated retrieval framework for driving videos. It uses weakly labeled in-domain videos to estimate when a query rule is reliable, adapt rules that mismatch observed data, and fuse calibrated rule scores with vision-language and keyword-based retrieval signals. Across three driving benchmarks, including newly released human-annotated event data on DrivingDojo, STRIVE-D delivers up to 84% relative improvement in top-1 accuracy over state-of-the-art methods.

2606.09108 2026-06-09 cs.RO cs.LG 新提交

RAM: Reachability Across Morphologies

RAM: 跨形态可达性

Tim Walter, Xinyu Chen, Jonathan Külz, Matthias Althoff

发表机构 * Department of Computer Engineering(计算机工程系) German Electron Synchrotron Technical University(德国电子同步加速器技术大学) Technical University Munich(慕尼黑技术大学) University of Hamburg(汉堡大学)

AI总结 提出一种形态条件隐式神经表示RAM,快速、可微地预测可达性并泛化至未见形态,基于前向运动学生成大规模数据集训练,在纳秒级推理中F1达86%,显著加速形态和轨迹优化。

Comments 22 pages, 11 figures

详情
AI中文摘要

机器人生命周期的许多阶段,从形态合成到操作,都从根本上依赖于可达工作空间。然而,当前用于近似工作空间的方法要么速度慢、精度低,要么局限于单一形态。我们提出了跨形态可达性(RAM):一种形态条件的隐式神经表示,作为位姿可达性的快速、可微替代,能够泛化到未见形态,同时固有地考虑自碰撞。为了训练RAM,我们发布了一个大规模数据集,包含仅由正向运动学生成的$3\cdot10^{10}$个样本。实验表明,我们的模型在纳秒级推理时达到了$86\\%$的$F_1$分数,比基线高出$14\\%$,同时推理时间减少了三个数量级。我们进一步展示了在基于梯度的形态优化和轨迹优化中分别加速一个和两个数量级。

英文摘要

Many stages of the robotic lifecycle, from morphology synthesis to operation, rely fundamentally on the reachable workspace. However, current methods for approximating workspaces are slow, imprecise, or tied to a single morphology. We introduce Reachability Across Morphologies (RAM): a morphology-conditioned, implicit neural representation that acts as a fast, differentiable surrogate for pose reachability, generalising to unseen morphologies while inherently accounting for self-collisions. To train RAM, we publish a large-scale dataset of $3\cdot10^{10}$ samples generated solely from forward kinematics. Experiments show that our model achieves an $ F_1$-score of $86\%$ at nanosecond inference, outperforming the baseline by $14\%$ while reducing inference time by three orders of magnitude. We further demonstrate speed-ups of one and two orders of magnitude for gradient-based morphology and trajectory optimisation, respectively. Website: https://timwalter.github.io/ram.

2606.09104 2026-06-09 cs.LG cs.AI q-fin.PM 新提交

Addressing Market Regime Changes and Heavy-Tailed Returns in Portfolio Optimization via Bayesian VAR and Elliptical Black-Litterman

通过贝叶斯VAR和椭圆Black-Litterman解决投资组合优化中的市场机制变化和重尾收益问题

Daniil Mikriukov, Ruoyu Sun, Angelos Stefanidis, Jionglong Su, Zhengyong Jiang

发表机构 * University of Liverpool(利物浦大学) Xi'an Jiaotong-Liverpool University(西交利物浦大学)

AI总结 提出BAVAR-BLED算法,结合贝叶斯平均向量自回归和椭圆分布Black-Litterman模型,在TD3架构下自适应分配资产,在道琼斯工业平均指数成分股上实现夏普比率1.72和总收益57.26%。

Comments 9 pages, 3 figures, 4 tables. Extends our prior work [Mikriukov et al., ICIC 2025] on Black-Litterman under Elliptical Distributions (BLED). Manuscript under review

详情
AI中文摘要

用于投资组合优化的深度强化学习框架因其能够从市场数据中动态学习分配规则而显示出前景。然而,这些模型未能考虑肥尾收益,而肥尾收益以更频繁的极端事件为特征,描述了实际市场行为。此外,历史数据被同质化处理,未考虑时间重要性,导致模型在机制变化时失效。我们提出了一种新的BAVAR-BLED算法,该算法在TD3架构内结合了源自贝叶斯平均向量自回归(BAVAR)和使用椭圆分布的Black-Litterman模型(BLED)的方法。BAVAR捕获一组考虑多尺度时间特征的向量自回归表示,从而基于对收益预期和离散矩阵的机制感知估计实现自适应分配决策。这些估计作为BLED的先验输入,BLED使用学生t分布,允许更现实的肥尾收益估计。BAVAR-BLED算法使用Transformer网络进行观点构建,使用CNN进行风险厌恶估计,根据市场条件修改动态分配决策。对道琼斯工业平均指数29只成分股在十年市场周期内的评估表明,BAVAR-BLED显著优于最先进的方法,实现了1.72的夏普比率和2.70的索提诺比率,总收益为57.26%。

英文摘要

Deep reinforcement learning (DRL) frameworks for portfolio optimization have shown promise for their ability to learn allocation rules dynamically from market data. However, these models fail to account for fat-tailed returns, which characterize actual market behavior with more frequent extreme events. Furthermore, historical data is treated homogeneously, without accounting for temporal importance, leading models to fail during regime changes. We propose a new BAVAR-BLED algorithm that combines methods derived from Bayesian-Averaging Vector Autoregressive (BAVAR) and the Black-Litterman model using Elliptical Distributions (BLED) within a TD3 architecture. BAVAR captures a set of vector autoregressive representations that consider multi-scale temporal features, enabling adaptive allocation decisions based on regime-aware estimates of return expectations and dispersion matrices. These estimates serve as prior inputs to BLED, a model that uses Student's t-distributions, allowing for more realistic fat tail return estimates. The BAVAR-BLED algorithm uses transformer networks for view construction and CNNs for risk-aversion estimates, which modify dynamic allocation decisions based on market conditions. An evaluation of 29 Dow Jones Industrial Average constituents over a decade-long market period shows that BAVAR-BLED significantly outperforms state-of-the-art methods, achieving Sharpe and Sortino ratios of 1.72 and 2.70, respectively, and total returns of 57.26%.

2606.09099 2026-06-09 cs.RO 新提交

LAEI: Layered Autonomous Edge Intelligence Framework for Robust UAV Swarm Operations

LAEI: 面向鲁棒无人机蜂群操作的分层自主边缘智能框架

Changmin Park, Wooyong Jung, Hwangnam Kim

发表机构 * Korea University(高丽大学)

AI总结 提出分层自主边缘智能框架,通过机载学习策略与轻量级任务级监督结合,实现无人机蜂群在通信受限、环境不确定和组件故障下的可扩展协调,显著降低任务完成时间并提高效率。

Comments Preprint. Submitted to arXiv

详情
AI中文摘要

自主无人机蜂群需要可扩展的协调机制,以在有限通信、环境不确定性和组件故障下保持任务性能。集中式方法提供全局协调,但存在通信瓶颈和单节点脆弱性,而完全分散的方法通常缺乏任务级一致性。本文提出了分层自主边缘智能(LAEI),一种无人机蜂群框架,它将机载学习策略与轻量级任务级监督相结合。每个无人机在机载执行局部感知、避障和动作选择,而监督层提供自适应目标重分配、故障感知恢复和上下文相关策略指导,而不直接控制低级动作。LAEI进一步整合了恢复策略,包括动态重新关联、备份监督支持和回退局部自主性,以在代表性故障场景下维持任务连续性。我们在模拟的无人机蜂群场景中评估了LAEI,使用任务完成时间、碰撞率和覆盖效率。结果表明,LAEI减少了任务完成时间并提高了操作效率,同时保持了碰撞感知的分布式无人机级决策。

英文摘要

Autonomous UAV swarms require scalable coordination mechanisms that maintain mission performance under limited communication, environmental uncertainty, and component failures. Centralized approaches provide global coordination but suffer from communication bottlenecks and single-node vulnerabilities, whereas fully decentralized methods often lack mission-level consistency. This paper presents Layered Autonomous Edge Intelligence (LAEI), a UAV-swarm framework that combines onboard learned policies with lightweight mission-level supervision. Each UAV performs local perception, obstacle avoidance, and action selection onboard, while the supervisory layer provides adaptive goal reassignment, fault-aware recovery, and context-dependent policy guidance without directly controlling low-level actions. LAEI further incorporates recovery strategies, including dynamic reassociation, backup supervisory support, and fallback local autonomy, to maintain mission continuity under representative failure scenarios. We evaluate LAEI in simulated UAV-swarm scenarios using mission completion time, collision rate, and coverage efficiency. The results show that LAEI reduces mission completion time and improves operational efficiency while maintaining collision-aware distributed UAV-level decision-making.

2606.09092 2026-06-09 cs.LG 新提交

From Shortcuts to Reasoning: Robust Post-Training of Theory of Mind with Reinforcement Learning

从捷径到推理:基于强化学习的心理理论鲁棒后训练

Jike Zhong, Yuxiang Lai, Ming Li, Yuheng Li, Wuao Liu, Behzad Dariush, Konstantinos Psounis, Shao-Yuan Lo

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对心理理论后训练中的捷径问题,提出Thinking-RFT方法,结合可验证奖励和显式推理链,在多个无捷径数据集上显著提升推理能力,尤其在复杂高阶推理和多模态场景中表现优异。

Comments Accepted by ICML 2026

详情
AI中文摘要

心理理论(ToM)是现代基础模型系统在现实世界中有效且安全运行必须掌握的技能。最近的工作探索了通过后训练来磨练ToM;然而,我们表明这种进展受到普遍存在的“捷径”问题的干扰:任务可以通过简单地利用虚假的因果相关性达到高达99%的准确率,从而导致对ToM的错误认识。受此启发,我们首先开发了一个框架来系统地检查ToM数据集中的捷径,并为未来发展提供指导。我们发现,可简化为纯状态跟踪的问题(如“信念”)特别容易受到捷径影响,而需要超越跟踪进行推理的心理问题(如“意图”)则不然。使用三个ToM上下文中的四个无捷径数据集,我们全面研究了带有可验证奖励和显式推理链的强化微调(称为Thinking-RFT)是否比监督微调(SFT)更能提升ToM。我们的主要发现如下。首先,Thinking-RFT在所有场景中有效提升ToM,比SFT提高6%,特别是在复杂高阶推理中比SFT提高10%,在多模态情况下比SFT提高7%。它还能更好地泛化到未见领域和高阶查询,同时对反事实更加鲁棒。其次,ToM特别受益于推理和强化学习的联合效应:Thinking-RFT平均比Non-Thinking-RFT高出7%。第三,RFT通过学会将其推理基于与因果因素对应的锚定线索(如关键词和状态变化)来工作。我们相信我们的研究对于开发有效且鲁棒的ToM后训练数据集以及推进关键ToM能力是有用的。

英文摘要

Theory of Mind (ToM) is a must-acquire skill for modern foundation model systems to operate effectively and safely in the real world. Recent works have explored honing ToM via post-training; however, we show that such progress is confounded by a pervasive "shortcut" issue: tasks can reach up to 99% accuracy by simply exploiting spurious causal correlations, leading to a false sense of ToM. Motivated by this, we first develop a framework to systematically examine ToM datasets for shortcuts and provide guidance for future development. We find that questions reducible to pure state tracking, such as "belief," are especially shortcut-prone compared to mind questions, such as "intention," where reasoning beyond tracking is required. Using four shortcut-free datasets across three ToM contexts, we then comprehensively study whether Reinforcement Fine-Tuning with verifiable rewards and explicit reasoning chains, called Thinking-RFT, elevates ToM beyond Supervised Fine-Tuning, or SFT. Our key findings are as follows. First, Thinking-RFT effectively improves ToM in all scenarios, with a 6% improvement over SFT, particularly in complex higher-order reasoning, with a 10% improvement over SFT, and multimodal cases, with a 7% improvement over SFT. It also generalizes notably better to unseen domains and higher-order queries while being more robust to counterfactuals. Second, ToM benefits specifically from the joint effect of reasoning and RL: Thinking-RFT outperforms Non-Thinking-RFT by 7% on average. Third, RFT works by learning to ground its reasoning on anchor cues, such as keywords and state changes, that correspond to causal factors. We believe our study is useful for developing effective and robust ToM post-training datasets and advancing critical ToM capabilities.

2606.09091 2026-06-09 cs.LG cs.CV 新提交

Stabilizing On-Policy Distillation for MLLM Reasoning with Global Normalization

稳定基于策略的蒸馏用于多模态大语言模型推理的全局归一化

Dongze Hao, Zhiwei Jin, Chen Chen, Haonan Lu

发表机构 * OPPO AI Center(OPPO AI中心)

AI总结 针对策略蒸馏中异常状态导致梯度不稳定的问题,提出全局归一化蒸馏策略优化(GNDPO),通过将KL分数转化为批次级相对优势来稳定优化,提升多模态推理任务的训练鲁棒性和性能。

详情
AI中文摘要

基于策略的蒸馏(OPD)最近成为一种重要的后训练范式。通过使用更强的教师模型为采样轨迹提供密集、细粒度的监督,OPD相比依赖稀疏二元或基于结果的环境反馈的可验证奖励强化学习(RLVR)具有明显优势。然而,朴素的token级蒸馏可能因异常状态中的幅度不匹配而遭受梯度不稳定性。为了解决这个问题,我们提出了全局归一化蒸馏策略优化(GNDPO),这是一种实用方法,通过将原始KL分数转化为批次级相对优势来稳定优化。这种归一化有效缓解了梯度爆炸,同时保留了token级指导的优势。实验结果表明,GNDPO在多模态推理任务中显著提高了训练鲁棒性和下游性能。代码已发布在 https://github.com/OPPO-Mente-Lab/GNDPO。

英文摘要

On-policy distillation (OPD) has recently emerged as an important post-training paradigm. By using a stronger teacher model to provide dense, fine-grained supervision for sampled trajectories, OPD offers a clear advantage over reinforcement learning with verifiable rewards (RLVR), which typically depends on sparse binary or outcome-based environmental feedback. However, naive token-level distillation can suffer from gradient instability, due to magnitude misalignment in outlier states. To address this issue, we propose Globally Normalized Distillation Policy Optimization (GNDPO), a practical method that stabilizes optimization by transforming raw KL scores into batch-level relative advantages. This normalization effectively mitigates gradient explosions while retaining the benefits of token-level guidance. Experimental results show that GNDPO substantially improves training robustness and downstream performance across multimodal reasoning tasks. The code is released at https://github.com/OPPO-Mente-Lab/GNDPO.

2606.09088 2026-06-09 cs.RO 新提交

Autonomous FPV Flight with Translational Optical Flow and Uncertainty Mask

基于平移光流与不确定性掩膜的自主FPV飞行

Yang Deng, Yu Hu, Feng Yu, Linzuo Zhang, Danping Zou

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 提出利用平移光流和不确定性掩膜增强FPV四旋翼自主飞行,在仿真和真实森林环境中实现高达13.91 m/s和11.79 m/s的飞行速度,成功率93.3%。

详情
AI中文摘要

在复杂环境中使用单目RGB相机作为唯一外部传感器的自主FPV四旋翼飞行仍然是一个基本挑战。最近的研究表明,使用光流作为神经网络的输入可以实现杂乱场景中的端到端自主飞行。然而,从光流估计中提取最相关信息是限制敏捷性和鲁棒性的关键瓶颈。现有方法难以将障碍物引起的光流与自运动背景光流分离,并且在膨胀焦点(FoE)附近信噪比低。为了解决这些问题,我们将光流分解为平移和旋转分量,并仅利用捕捉场景几何和深度线索的平移光流。此外,我们引入了一种基于前向和后向光流估计不一致性的不确定性掩膜。该掩膜突出显示障碍物结构,包括FoE区域内的结构。这两个线索被输入到在可微仿真框架中训练的控制策略中,该框架能够实现感知和控制的一阶优化。我们通过在仿真和真实森林环境中的大量实验验证了我们的方法。所提出的系统在仿真中实现了高达13.91 m/s的速度,在真实测试中实现了11.79 m/s的速度,在30次真实试验中成功率为93.3%,几乎使先前报道的单目RGB光流无人机避障系统的6 m/s真实速度翻倍。

英文摘要

Autonomous FPV quadrotor flight in complex environments using a monocular RGB camera as the sole exteroceptive sensor remains a fundamental challenge. Recent research has shown that using optical flow as the input of a neural network can achieve end-to-end autonomous flight in cluttered scenes. However, extracting the most relevant information from the flow estimation is the key bottleneck limiting agility and robustness. Existing methods struggle to disentangle obstacle-induced optical flow from the ego-motion background flow and suffer from low signal-to-noise ratios near the focus of expansion (FoE). To address these issues, we decompose the optical flow into translational and rotational components and utilize only the translational flow, which captures scene geometry and depth cues. In addition, we introduce an uncertainty mask derived from inconsistencies between forward and backward flow estimates. This mask highlights obstacle structures, including those within the FoE region. Both cues are fed to a control policy trained in a differentiable simulation framework, which enables efficient first-order optimization across perception and control. We validate our approach through extensive experiments in both simulated and real-world forest environments. The proposed system achieves robust flight at speeds of up to 13.91 m/s in simulation and 11.79 m/s in real-world tests, with a 93.3\% success rate over 30 real-world trials, nearly doubling the previously reported 6 m/s real-world speed of the monocular-RGB optical-flow UAV obstacle avoidance system.

2606.09086 2026-06-09 cs.AI 新提交

DynaOD: Dynamic Origin-Destination Flow Generation with Discrete-to-Continuous Temporal Semantic Modeling

DynaOD: 基于离散到连续时间语义建模的动态起讫点流量生成

Jie Zhao, Xianqi Dai, Jie Feng, Huandong Wang, Yong Li

发表机构 * Department of Electronic Engineering, BNRist, Tsinghua University(清华大学电子工程系,BNRist) Tsinghua Shenzhen International Graduate School(清华大学深圳国际研究生院) Zhongguancun Academy(中关村学院)

AI总结 提出DynaOD框架,通过离散方向趋势和连续时间演化双视角建模时间语义,以轻量即插即用方式调节预训练静态OD生成器,实现无历史观测的动态OD流生成,在预测精度和分布保真度上优于基线。

Comments Accepted by IJCAI2026

详情
AI中文摘要

动态起讫点(OD)流量生成旨在仅从时间上下文合成逼真的移动动态,而不依赖历史OD观测。一个关键挑战是将语义时间信号转化为时间上连贯的OD模式,同时保留城市区域固有的空间异质性。我们提出DynaOD,一个语义驱动框架,通过两个互补视角建模时间动态:离散方向趋势,刻画城市活动模式的定性变化;连续时间演化,捕捉这些变化如何随时间展开。通过联合编码这些时间语义,该框架构建时变区域表示,以轻量即插即用方式调节预训练的静态OD生成器。这种模块化设计进一步支持可扩展部署和跨城市迁移。在大型真实世界数据集上的大量实验表明,我们的方法在预测精度和分布保真度上均持续优于代表性基线。代码公开于https://github.com/csjiezhao/DynaOD。

英文摘要

Dynamic origin-destination (OD) flow generation seeks to synthesize realistic mobility dynamics from temporal context alone, without relying on historical OD observations. A key challenge is to translate semantic temporal signals into temporally coherent OD patterns while preserving the inherent spatial heterogeneity of urban regions. We propose DynaOD, a semantic-driven framework that models temporal dynamics through two complementary perspectives: discrete directional trends that characterize qualitative shifts in urban activity patterns, and continuous temporal evolution that captures how such shifts unfold over time. By jointly encoding these temporal semantics, the framework constructs time-varying region representations that condition pretrained static OD generators in a lightweight and plug-and-play fashion. This modular design further supports scalable deployment and cross-city transferability. Extensive experiments on large-scale real-world datasets show that our method consistently outperforms representative baselines in both predictive accuracy and distributional fidelity. Code is publicly available at https://github.com/csjiezhao/DynaOD.

2606.09081 2026-06-09 cs.CV 新提交

Edge-Constrained UAV Small-Object Detection with P2 Enhancement and Quantum-Inspired Lightweight Structure Search

边缘约束下基于P2增强和量子启发轻量级结构搜索的无人机小目标检测

Wuming Lei, Yanbin Gao, Mingyan Sun, Xiaobin Li, Xuechen Liang

发表机构 * East China Jiaotong University(华东交通大学)

AI总结 针对无人机边缘部署,结合P2高分辨率检测分支与量子启发进化算法搜索轻量级结构,在VisDrone上显著提升小目标检测精度。

详情
AI中文摘要

无人机目标检测需要紧凑的检测器,在机载计算和内存限制下保留小目标细节。轻量级网络中的重复下采样削弱了浅层空间信息,而手动添加注意力或融合模块可能增加成本且收益不稳定。本研究在边缘部署约束下分析YOLOX-Nano,结合P2高分辨率检测分支与量子启发进化算法(QIEA)进行轻量级结构筛选。搜索空间由轻量级优先级和任务特异性定义,评估同时考虑精度、浮点运算数(FLOPs)、延迟、内存消耗和召回率。在VisDrone上,P2分支使APamall比YOLOX-Nano基线提升31.10%。与类似模型大小的NanoDet-Plus相比,YOLOX-Nano+-P2在APs0.ss上提升17.5%,在APamal上提升44.9%。QIEA选择的候选者获得最高Recallso,但+P2在完整训练后仍是最强的AP导向变体。对Random-best、GA-best和SA/QUBO-best候选者进行完整的100轮验证进一步表明,代理排名不一定转化为最终的APse9s。这些结果支持将P2作为主要的小目标增强路径,并将QIEA作为候选筛选和精度-成本分析的轻量级工具。源代码、配置文件、诊断脚本和总结结果可在https://github.com/Ming23233/UAV-QIEA-Edge-Detection获取。

英文摘要

Unmanned aerial vehicle (UAV) object detection requires compact detectors that retain small-object details under onboard computation and memory constraints. Repeated downsampling inlightweight networks weakens shallow spatial information, while manually adding attention orfusion modules may increase cost without stable gains. This study analyzes YOLOX-Nano underedge-deployment constraints by combining a P2 high-resolution detection branch with a quantum-inspired evolutionary algorithm (QIEA) for lightweight structure screening. The search space isdefined by lightweight priority and task specificity, and the evaluation jointly considers accuracy,floating-point operations (FLOPs), latency, memory consumption, and recall. On VisDrone, theP2 branch increases APamall by 31.10% over the YOLOX-Nano baseline. Compared with NanoDet-Plus with similar model size, YOLOX-Nano+-P2 improves APs0.ss by 17.5% and APamal by 44.9%.The QIEA-selected candidate obtains the highest Recallso, but +P2 remains the strongest AP-oriented variant after full training. Full 100-epoch verification of Random-best, GA-best, andSA/QUBO-best candidates further shows that proxy rankings do not necessarily transfer to finalAPse9s. These results support using P2 as the main small-object enhancement path and QIEA as alightweight tool for candidate screening and accuracy-cost analysis. The source code, configurationfiles, diagnostic scripts, and summarized results are available at https://github.com/Ming23233/UAV-QIEA-Edge-Detection

2606.09080 2026-06-09 cs.LG cs.CL 新提交

Beyond FLOPs: Benchmarking Real Inference Acceleration of LLM Pruning under a GEMM-Centric Taxonomy

超越FLOPs:基于GEMM中心分类法的LLM剪枝真实推理加速基准测试

Haozhe Hu, Hao Wu, Anhao Zhao, Longwei Ding, Peiran Yin, Yunpu Ma, Xiaoyu Shen

发表机构 * Ningbo Institute of Digital Twin, Eastern Institute of Technology, Ningbo(宁波数字孪生研究院,东方理工大学(宁波)) Department of Computing, The Hong Kong Polytechnic University(香港理工大学计算学系) Munich Center for Machine Learning, LMU Munich(慕尼黑大学机器学习慕尼黑中心)

AI总结 提出一种基于GEMM维度的剪枝方法分类法,通过统一基准框架系统评估不同剪枝方法在加速-质量帕累托前沿上的表现,发现静态深度剪枝在低质量损失下最优,为LLM剪枝加速提供统一视角。

Comments 22 pages, 14 figures

详情
AI中文摘要

剪枝已成为加速大语言模型(LLM)推理的主流范式,涵盖了一系列方法,这些方法在token、层、头、维度和注意力模式上移除计算。尽管目标相同,这些剪枝方法会引发根本不同的执行行为,导致实际加速效果严重依赖于硬件和内核实现。因此,不同剪枝家族的实际加速收益仍知之甚少。在这项工作中,我们引入了一种基于GEMM中心的分类法,根据通用矩阵乘法(GEMM)的逻辑\textbf{M}、\textbf{N}和\textbf{K}维度重新组织现有剪枝方法。利用这一抽象,我们构建了一个统一的基准测试框架,能够在剪枝设计空间中进行实现一致的比较,并系统地表征加速-质量帕累托前沿。我们的结果表明,静态深度剪枝仍然是最强的帕累托最优基线,并且在内存受限场景下最接近其理论加速上限。在预填充阶段,前沿从低质量损失(0\%--4\%)的静态深度,过渡到中等损失(5\%--16\%)的动态深度,最后到更高损失水平(17\%--26\%)的静态宽度剪枝。这些发现首次建立了基于剪枝的LLM加速实际极限的统一视图,并为未来的剪枝研究提供了指导。\footnote{代码可在 https://github.com/EIT-NLP/LLM-Pruning/tree/main/PruningInferSim 获取。}

英文摘要

Pruning has emerged as a dominant paradigm for accelerating large language model (LLM) inference, spanning a broad spectrum of methods that remove computation across tokens, layers, heads, dimensions, and attention patterns. Despite sharing the same objective, these pruning approaches induce fundamentally different execution behaviors, causing realized speedups to depend heavily on hardware and kernel implementations. Consequently, the practical acceleration benefits of different pruning families remain poorly understood. In this work, we introduce a GEMM-centric taxonomy that reorganizes existing pruning methods according to the logical \textbf{M}, \textbf{N}, and \textbf{K} dimensions of general matrix multiplication (GEMM). Leveraging this abstraction, we build a unified benchmarking framework that enables implementation-consistent comparison across the pruning design space and systematically characterizes the acceleration--quality Pareto frontier. Our results show that static depth pruning remains the strongest Pareto-optimal baseline and stays closest to its theoretical acceleration upper bound in memory-bounded scenarios. During prefill, the frontier transitions from static depth at low quality loss (0\%--4\%), to dynamic depth at moderate loss (5\%--16\%), and finally to static width pruning at higher loss levels (17\%--26\%). These findings establish the first unified view of the practical limits of pruning-based LLM acceleration and provide guidance for future pruning research.\footnote{Code is available at https://github.com/EIT-NLP/LLM-Pruning/tree/main/PruningInferSim}

2606.09078 2026-06-09 cs.LG 新提交

The Hidden Bias of Process Reward Models:PRISM for Rewarding the Right Reasoning

过程奖励模型的隐藏偏见:PRISM用于奖励正确推理

Aakriti Agrawal, Souradip Chakraborty, Armin Saghafian, Nihal Sharma, Rizal Fathony, Nam H Nguyen, C. Bayan Bruss, Amrit Singh Bedi, Furong Huang

发表机构 * University of Maryland(马里兰大学) Amazon(亚马逊) University of Central Florida(中佛罗里达大学)

AI总结 针对过程奖励模型因训练数据不平衡导致的虚假高评分偏见,提出PRISM框架,通过对比步骤级比较和前瞻策略生成的难负样本,结合难度感知课程学习优化,显著降低假阳性率并提升推理准确性。

详情
AI中文摘要

过程奖励模型(PRM)通过提供步骤级反馈改善了推理的信用分配。然而,我们发现PRM中存在由步骤级训练数据严重不平衡引起的隐藏偏见。标准交叉熵训练放大了这种偏见,导致PRM过度奖励看似合理但错误的步骤,并产生高假阳性率。我们表明这些假阳性具有不对称的下游效应:假阴性主要减缓探索,而假阳性则主动将Best-of-N选择、引导解码和策略优化引导向有缺陷的推理。这表明PRM训练应从逐点标签拟合转向可靠的相对比较。为解决此问题,我们提出PRISM(改进步骤建模的精确排序),一种策略感知的PRM训练框架,从对比步骤级比较和由时间前瞻策略生成的难负样本中学习,无需新的人工标签。我们进一步使用难度感知课程来优化对比步骤间隔。在PRMBench和ProcessBench上,PRISM显著减少了假阳性(PRMBench上降低22%),并在强判别性PRM上提高了宏F1。当应用于策略优化和搜索任务(包括引导解码和Best-of-N选择)时,它持续提高了准确率(引导解码最高22%,Best-of-N最高33%)和鲁棒性。更广泛地说,可信的过程监督不仅仅是分配高奖励,而是为了正确的理由奖励正确的推理。

英文摘要

Process Reward Models (PRMs) improve credit assignment for reasoning by providing step-level feedback. However, we identify a hidden bias in PRMs caused by severe imbalance in step-level training data. Standard cross-entropy training amplifies this bias, causing PRMs to overcredit plausible but incorrect steps and produce high false-positive rates. We show that these false positives have an asymmetric downstream effect: false negatives mainly slow exploration, whereas false positives actively steer Best-of-N selection, guided decoding, and policy optimization toward flawed reasoning. This suggests that PRM training should shift from pointwise label fitting to reliable relative comparisons. To address this, we propose PRISM (Precision Ranking for Improved Step Modeling), a policy-aware PRM training framework that learns from contrastive step-level comparisons and hard negatives generated by a temporal lookahead strategy, requiring no new human labels. We further use a difficulty-aware curriculum to optimize the contrastive step margin. Across PRMBench and ProcessBench, PRISM substantially reduces false positives (22% on PRMBench) and improves macro F1 over strong discriminative PRMs. When applied to policy optimization and search tasks, including guided decoding and Best-of-N selection, it consistently improves accuracy (up to 22% for guided decoding and 33% for Best-of-N) and robustness. More broadly, trustworthy process supervision is not just about assigning high rewards, but about rewarding the right reasoning for the right reasons.

2606.09077 2026-06-09 cs.LG 新提交

Neural Legendre-Fenchel transform with Hessian Preconditioning

神经 Legendre-Fenchel 变换与 Hessian 预处理

Basile Plus-Gourdon, Frank Nielsen

发表机构 * École Normale Supérieure Paris-Saclay(巴黎-萨克雷高等师范学校) Sony Computer Science Laboratories Inc.(索尼计算机科学实验室公司)

AI总结 提出基于 Hessian 预处理的神经 Legendre-Fenchel 变换方法,通过仿射变形改善病态函数的共轭计算,提高收敛速度和数值精度。

Comments 11 pages, 4 figures

详情
AI中文摘要

Legendre-Fenchel (LF) 变换是凸分析和机器学习中的基本工具,将下半连续函数映射到其凸共轭。在实践中,当给定函数的凸共轭没有闭式公式时,必须使用各种技术进行近似。最近一种通用的数值方法是深度 Legendre 变换方法,它依赖于神经网络,尽管在处理病态函数时仍然具有挑战性。本文基于 LF 变换作为射影对偶的重新表述。该框架的一个显著特性是仿射不变性。我们利用这种仿射不变性引入了一种基于 Hessian 的预处理策略。具体来说,我们在一个极小点附近应用仿射变形,使得函数的二阶泰勒近似与标准抛物面重合,其共轭映射是恒等映射。一个在恒等映射附近初始化的残差网络可以学习这个简化后的映射,而原始共轭映射通过逆变形恢复。所提出的预处理仅带来适度的计算开销,包括初始化时的一次特征分解和每次查询时的两次矩阵-向量乘法。在包括高维基准测试在内的多种凸函数上的实验表明,共轭的收敛速度和数值精度得到了提高,特别是在病态问题上效果显著。最后,我们讨论了所提出方法的适用范围,并指出了其若干局限性。

英文摘要

The Legendre-Fenchel (LF) transform is a fundamental tool in convex analysis and machine learning that maps lower semi-continuous functions to their convex conjugates. In practice, when closed-form formula are not available for expressing convex conjugates of given functions, one must approximate them using various techniques. One recent such versatile numerical method is the deep Legendre transform method which relies on neural networks although it remains challenging particularly for tackling ill-conditioned functions. This work builds on the reformulation of the LF transform as a projective polarity. A notable property of this framework is its affine invariance. We leverage this affine invariance to introduce a Hessian-based preconditioning strategy. Specifically, we apply an affine deformation around a minimizer so that the second-order Taylor approximation of the function coincides with the canonical paraboloid, whose conjugation map is the identity. A residual network initialized near the identity can then learn this simplified mapping, while the original conjugation map is recovered through the inverse deformation. The proposed preconditioning incurs only a modest computational overhead, consisting of a single eigendecomposition during initialization and two matrix-vector multiplications per query. Experiments on a diverse set of convex functions, including high-dimensional benchmarks, demonstrate improved convergence rates and enhanced numerical accuracy of the conjugation, with particularly significant gains for ill-conditioned problems. Finally, we discuss the scope of applicability of our proposed method and highlight several of its limitations.

2606.09074 2026-06-09 cs.CV 新提交

REFINE: Super-efficient 3D Gaussian Splatting Pruning via Rendering-Free Primitive Importance

REFINE: 通过无渲染的基元重要性实现超高效的3D高斯泼溅剪枝

Zhang Chen, Shuai Wan, Mengting Yu, Fuzheng Yang, Junhui Hou

发表机构 * Northwestern Polytechnical University(西北工业大学) Xidian University(西安电子科技大学) City University of Hong Kong(香港城市大学)

AI总结 提出REFINE框架,利用无渲染的基元重要性度量(基于解析近似的Hessian场)实现3D高斯泼溅的高效剪枝,在保持渲染质量的同时将剪枝计算复杂度降低3000倍。

详情
AI中文摘要

现有的3D高斯泼溅(3DGS)剪枝方法要么导致严重的质量下降,要么带来过高的计算开销。本文提出REFINE,一个高度加速的3DGS剪枝框架,其核心是一种新颖的无渲染基元重要性度量。我们的方法利用解析近似、渲染感知的Hessian场来量化移除单个基元所导致的预期感知误差。通过建模可见性、投影几何和内容自适应超参数的联合调制,我们完全绕过了昂贵的正向渲染过程,推导出一个各向异性的感知权重场,作为基元重要性的高保真代理。在多个基准数据集上的大量实验表明,REFINE在保持极具竞争力的渲染质量的同时,与最先进的剪枝方法相比,实现了前所未有的3000倍剪枝相关计算复杂度降低。

英文摘要

Existing pruning methods for 3D Gaussian splatting (3DGS) suffer from either severe quality degradation or prohibitive computational overhead. In this paper, we propose REFINE, a highly accelerated 3DGS pruning framework centered on a novel rendering-free primitive importance metric. Our approach leverages an analytically approximated, rendering-aware Hessian field to quantify the expected perceptual error induced by the removal of individual primitives. By modeling the joint modulation of visibility, projection geometry and the content adaptive hyperparameter, we entirely bypass costly forward rendering passes and derive an anisotropic perceptual weight field that serves as a high-fidelity proxy for primitive importance. Extensive experiments across multiple benchmark datasets demonstrate that REFINE maintains highly competitive rendering quality while achieving an unprecedented $3,000\times$ reduction in pruning-related computational complexity compared to state-of-the-art pruning methods.

2606.09071 2026-06-09 cs.AI 新提交

REFLECT: Intervention-Supported Error Attribution for Silent Failures in LLM Agent Traces

REFLECT: 针对LLM智能体轨迹中静默失败的干预支持错误归因

Xiaofeng Lin, Yingxu Wang, Tung Sum Thomas Kwok, Daniel Guo, Sahil Arun Nale, Charles Fleming, Guang Cheng

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出REFLECT方法,通过诊断候选错误步骤、使用诊断特定补丁进行受控重放测试,并利用验证结果作为对比证据来细化归因,在四个基准上取得最高定位准确率。

详情
AI中文摘要

大型语言模型(LLM)智能体现在通过长时间的计划与执行轨迹来解决复杂任务,但在已完成轨迹中定位错误的能力仍然远远落后,尤其是在静默失败情况下。现有方法通过分类器或LLM法官预测可疑步骤,或通过重试恢复正确答案,但都没有将干预结果反馈回来以细化归因本身。我们提出REFLECT方法,通过诊断候选错误步骤,使用诊断特定补丁进行受控重放测试,并利用验证的结果翻转作为对比证据来细化最终归因,从而弥合这一差距。在跨越领域多跳推理的四个定位基准上,REFLECT在所有四个基准中均实现了同审计方法中最高的定位准确率,在结构化工具使用轨迹上取得了最大增益,并且在无法获得真实答案时也能提供可操作的定位。

英文摘要

Large language model (LLM) agents now solve complex tasks through long plan-and-execution traces, yet the ability to locate errors in a completed traces still lags far behind, especially in the \emph{silent failure} regime. Existing approaches predict suspect steps via classifiers or LLM judges, or recover correct answers via retry, but none feed the intervention outcome back to \emph{refine the attribution itself}. We propose \methodname, a method that closes this gap by diagnosing a candidate error step, testing it through controlled replay with a diagnosis-specific patch, and using the verified outcome flip as contrastive evidence to refine the final attribution. Across four localization benchmarks spanning multi-hop reasoning across domains, \methodname achieves the highest localization accuracy among same-auditor methods across all four benchmarks, with the largest gains on structured tool-use traces, while providing actionable localization even when ground-truth answers are unavailable.

2606.09068 2026-06-09 cs.CL 新提交

Emergent Misalignment Can Be Induced by Sycophancy and Reversed via Alignment Gating

由谄媚诱导的突现性失调可通过对齐门控逆转

Sicheng Wang, Xiangyang Zhu, Han Wang, Zongrui Wang, Yuan Tian, Kaiwei Zhang, Kaiyuan Ji, Qi Jia, Guangtao Zhai

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 发现谄媚微调(被动同意用户错误观点)可诱导广泛且严重的突现性失调,并提出对齐门控方法,通过插入可学习门控来识别并抑制不安全表示,从而高效逆转失调。

Comments Code is available at https://github.com/stay1to0/Sycophancy_Emergent_Misalignment_and_Gated_attention_FT

详情
AI中文摘要

先前研究表明,在狭窄领域对恶意或不正确输出进行微调会诱导广泛的失调和有害行为,这种现象称为突现性失调。然而,逆转此类失调的高效方法仍然有限。在这项工作中,我们做出两项贡献。首先,我们识别出谄媚微调,即训练模型被动同意用户的错误观点,是先前未被充分探索的突现性失调驱动因素,并证明它会诱导广泛且严重的失调行为。其次,我们提出对齐门控,一种高效逆转突现性失调的方法,该方法在微调期间向模型插入可学习和可控的门控。通过微调,这些门控学会识别负责不安全响应的内部表示。因此,放大或抑制这些表示会分别加剧或缓解突现性失调。我们进一步发现,对齐门控模块表现出强大的泛化能力:从狭窄领域微调获得的门控权重显著抑制了广泛领域的失调行为,同时保留了模型的通用能力。

英文摘要

Prior work has shown that fine-tuning large language models on malicious or incorrect outputs in narrow domains can induce broad misalignment and harmful behavior, a phenomenon known as emergent misalignment. However, efficient methods for reversing such misalignment remain limited. In this work, we make two contributions. First, we identify sycophancy fine-tuning, i.e., training models to passively agree with users' incorrect opinions, as a previously underexplored driver of emergent misalignment, and show that it induces broad and severe misaligned behavior. Second, we propose Alignment Gating, an efficient method for reversing emergent misalignment that inserts learnable and controllable gates into the model during fine-tuning. Through fine-tuning, these gates learn to identify the internal representations responsible for unsafe responses. Thus, amplifying or suppressing these representations then exacerbates or mitigates EM, respectively. We further find that alignment gating module exhibits strong generalization: gating weights obtained from narrow-domain fine-tuning substantially suppress broad-domain misaligned behavior while preserving the model's general capabilities.

2606.09065 2026-06-09 cs.LG cs.AI 新提交

OnlyDense: Reduced-Order Modeling for Lagrangian simulation

OnlyDense: 拉格朗日模拟的降阶建模

Tu Do, Shannon Ryan, Santu Rana

发表机构 * Deakin University(德克萨斯大学)

AI总结 提出一种将粒子系统状态视为希尔伯特空间中的函数、用学习到的神经基函数线性子空间近似状态空间的降阶建模框架,实现大规模拉格朗日模拟的高效表示与预测,在百万粒子SPH模拟中R²>0.99。

详情
AI中文摘要

在科学和工程中,拉格朗日模拟方法如光滑粒子流体动力学(SPH)或物质点法(MPM)常被用于研究动态系统的行为。然而,这些方法的计算成本可能高得令人望而却步,特别是在模拟多尺度空间或时间现象时,例如宏观几何中的空洞生长和合并、空间碎片颗粒超高速撞击导致的航天器部件结构失效等。与将系统状态理解为离散粒子集合的基于图的方法不同,我们提出了一种学习框架,通过将系统状态视为函数、将其演化视为希尔伯特空间中的轨迹,实现对大规模粒子系统的可扩展表示和动力学建模。我们不将状态表示为离散粒子集或嵌入非线性潜在流形,而是用学习到的神经基函数张成的线性子空间近似状态空间。这种参数化使得可以直接投影获得潜在系数,并显式访问基函数,避免了在非线性潜在空间上的优化。由此得到的表示具有自然的解释:潜在变量对应于希尔伯特空间中的系数,基函数对应于空间模态,类似于本征正交分解。因此,该框架将经典的基于投影的降阶建模与现代深度学习统一起来,同时保持对离散化点数量的不变性。在超过一百万个粒子的大规模SPH模拟(包括具有极端变形和破碎的动态事件)上的实验表明,所提出的方法能够准确重建和预测动力学,仅用32个基函数即可达到超过0.99的R²分数。

英文摘要

In science and engineering, Lagrangian simulation methods such as Smooth Particle Hydrodynamics (SPH) or Material Point Method (MPM) are often employed to study the behavior of dynamic systems. However, these methods can be prohibitively computationally expensive, particularly when simulating multi-scale spatial or temporal phenomena, e.g., void growth and coalescence within macro-scale geometries, structural failure of spacecraft components resulting from hypervelocity impact of space debris particles, etc. In contrast to graph-based methods, where the state of the system is understood as a discrete set of particles, we propose a learning framework for scalable representation and dynamics modeling of massive particle systems by treating the system state as a function and its evolution as a trajectory in Hilbert space. Rather than representing the state as a discrete set of particles or embedding it in a nonlinear latent manifold, we approximate the state space with a linear subspace spanned by learned neural basis functions. This parameterization enables direct projection to obtain latent coefficients and explicit access to the basis functions, avoiding optimization over a nonlinear latent space. The resulting representation admits a natural interpretation: latent variables correspond to coefficients in Hilbert space, and basis functions correspond to spatial modes, analogous to Proper Orthogonal Decomposition. The framework thus unifies classical projection-based reduced-order modeling with modern deep learning, while remaining invariant to the number of discretization points. Experiments on large-scale SPH simulations with over one million particles, including dynamic events with extreme deformation and fragmentation, demonstrate that the proposed method accurately reconstructs and predicts dynamics, achieving an R$^2$ score above $0.99$ with as few as $32$ basis functions.

2606.09064 2026-06-09 cs.CV cs.AI 新提交

See More, Think Deeper: Query-Expanded Visual Evidence and Answer-Clue Guided Reflection for Long Video Understanding

看得更多,思考更深:面向长视频理解的查询扩展视觉证据与答案线索引导反思

Shuning Wang, Zhiheng Wu, YiNuo Lu, Naiming Liu, Chen Jia, Bowen Liu, Shuo Nie, Weijie Zhu, Yumeng Zhang

发表机构 * Baidu Inc.(百度公司) Harbin Institute of Technology(哈尔滨工业大学) Hong Kong University of Science and Technology(香港科技大学)

AI总结 提出CoVER框架,通过动态收集查询扩展视觉证据和答案特定视觉反馈验证草稿答案,实现从答案中心生成到证据中心和视觉可验证推理的转变,在长视频理解任务上超越同规模模型及部分闭源模型。

详情
AI中文摘要

近期视频大语言模型(Video-LLMs)的进展使得长视频理解任务成为可能。然而,现有方法仍面临两个关键限制:证据获取通常依赖单一搜索意图,且答案生成缺乏有效的视觉反馈机制。为解决这些限制,我们提出了\textbf{CoVER},一个用于长视频理解的综合视觉证据与反思框架。CoVER使Video-LLMs能够通过动态收集查询扩展视觉证据来\textbf{看得更多},并通过使用有效的答案特定视觉反馈验证草稿答案来\textbf{思考更深}。这些机制共同将长视频理解从以答案为中心的生成转变为以证据为中心且可视觉验证的推理。实验结果表明,CoVER-7B在相同参数规模下显著优于其他模型,甚至在特定指标上超越了最先进的闭源模型。

英文摘要

Recent advances in Video Large Language Models (Video-LLMs) have enabled performance on long-video understanding tasks. However, existing methods still face two key limitations: evidence acquisition often relies on a single search intent, and answer generation lacks an effective visual feedback mechanism. To address these limitations, we propose \textbf{CoVER}, a Comprehensive Visual Evidence and Reflection framework for long-video understanding. CoVER enables Video-LLMs to \textbf{See More} by dynamically gathering query-expanded visual evidence, and \textbf{Think Deeper} by verifying draft answers with effective answer-specific visual feedback. Together, these mechanisms shift long-video understanding from answer-centric generation to evidence-centric and visually verifiable reasoning. Experimental results show that CoVER-7B substantially outperforms models with the same parameter scale and even surpasses state-of-the-art closed-source models on certain metrics.

2606.09059 2026-06-09 cs.LG cs.AI cs.CV 新提交

Stage-1 Controls the Entropy Regime, Not the Outcome

Stage-1 控制熵状态,而非最终结果

Jianxiong Shen

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文通过小数据实验研究两阶段后训练中Stage-1(SFT或OPD)的作用,发现其主要影响策略熵状态,但对最终性能影响有限。

详情
AI中文摘要

两阶段后训练——Stage-1 热启动(监督微调 SFT 或在线策略蒸馏 OPD)后接 Stage-2 强化学习(RL)——越来越多地用于视觉语言模型(VLM)。我们使用 Qwen2.5-VL-7B 和同模态 72B VLM 教师进行 OPD,在小数据研究中探究 Stage-1 实际控制什么。首先,三种热启动在 Geometry3K 内部验证集上达到狭窄的 53%–54% 区间,与近期专门方法报告的窄范围一致;该设置几乎没有证据表明 Stage-1 改变了域内终点。其次,匹配配方、早停的 SFT 在域外 MathVista 上提升了 +2.1 点,逆转了过训练变体的 -9.5 点下降。最明显的区别是熵状态:OPD 进入 RL 时的策略熵显著高于任一 SFT 初始化,且这种分离在可用轨迹中持续可见。在域内初始化时,OPD 还具有更高的答案多样性和 pass@16(比 SFT 高 +2.0 到 +5.2 点),尽管问题级自举区间显示较小的对比具有不确定性。RL 后优势消失(终点 pass@16 值在 1.1 点以内),在 MathVista 上也是如此(六个模型在 1.2 点以内)。因此,我们的贡献是一个有界的实证刻画:在此设置中,Stage-1 与熵状态强相关,但下游收益小、局部化,且不能证明 OPD 是更好的 RL 热启动。

英文摘要

Two-stage post-training -- a Stage-1 warm-start (supervised fine-tuning, SFT, or on-policy distillation, OPD) followed by Stage-2 reinforcement learning (RL) -- is increasingly used for vision-language models (VLMs). We ask what Stage-1 actually controls in a small-data study using Qwen2.5-VL-7B with a same-modality 72B VLM teacher for OPD. First, the three warm-starts reach a narrow $53$--$54\%$ band on Geometry3K internal validation, consistent with the narrow range reported by recent specialized methods; this setup provides little evidence that Stage-1 changes the in-domain endpoint. Second, a matched-recipe, early-stopped SFT improves out-of-domain MathVista by $+2.1$ points, reversing the $-9.5$-point drop of an over-trained variant. The clearest difference is the \emph{entropy regime}: OPD enters RL with substantially higher policy entropy than either SFT initialization, and the separation remains visible through the available trajectories. At the in-domain initialization, OPD also has higher answer diversity and pass@16 ($+2.0$ to $+5.2$ points over SFT), although problem-level bootstrap intervals show that the smaller contrast is uncertain. The advantage is absent after RL (endpoint pass@16 values within $1.1$ points) and on MathVista (six models within $1.2$ points). Our contribution is therefore a bounded empirical characterization: Stage-1 is strongly associated with the entropy regime in this setup, but the downstream payoff is small, localized, and not evidence that OPD is a better RL warm-start.

2606.09056 2026-06-09 cs.CV cs.LG 新提交

MilliVid: Hierarchical Latents for Long-Range Consistency in Video Generation

MilliVid: 用于视频生成中长程一致性的分层潜变量

Ishaan Preetam Chandratreya, David Charatan, Basile Van Hoorick, Sergey Zakharov, Vitor Guizilini, Phillip Isola, Vincent Sitzmann

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Toyota Research Institute(丰田研究所)

AI总结 提出一种多尺度token空间的粗到细展开方法,通过预训练层次化自编码器压缩帧为多层token,并训练视频扩散模型生成这些token,在保持几何和物体持久性长程一致性的同时降低计算开销。

Comments Ishaan Preetam Chandratreya and David Charatan contributed equally. Project page: https://davidcharatan.com/millivid/

详情
AI中文摘要

视频生成模型已变得日益强大,但长程一致性仍然难以实现,因为即使只有几十帧也需要不切实际的长Transformer序列长度。我们表明,通过在多尺度token空间内使用粗到细展开生成视频,可以缓解这一问题。我们的方法很简单:首先,预训练一个自编码器,将每一帧压缩成一个token层次结构,层级范围从典型的潜变量分辨率到每帧仅几个token。最粗糙的层级捕获最重要的信息,如场景布局和语义,而更细的层级添加高频外观和纹理。然后,我们训练一个视频扩散模型,使用粗到细展开生成这些token。通过仔细控制在每个展开步骤中生成帧并用作上下文的细节级别,我们能够保持几何和物体持久性的长程一致性,同时将计算花费在感知上不太相关的细节的长程一致性上。我们使用一个自定义的长Minecraft视频数据集验证了这种方法,与现有基线相比,它产生了更一致的展开结果。

英文摘要

Video generative models have become increasingly powerful, but long-range consistency remains challenging to achieve because even a few dozen frames require impractically long transformer sequence lengths. We show that this issue can be mitigated by generating video using coarse-to-fine rollout within a multi-scale token space. Our approach is simple: first, we pre-train an autoencoder that compresses each frame into a hierarchy of tokens, with levels ranging from the typical latent resolution to only a handful of tokens per frame. The coarsest levels capture the most consequential information, such as scene layout and semantics, while finer levels add high-frequency appearance and texture. Then, we train a video diffusion model to generate these tokens using coarse-to-fine rollout. By carefully controlling the level of detail at which frames are generated and used as context during each rollout step, we are able to preserve long-range consistency in geometry and object permanence while spending less compute on the long-range consistency of less perceptually relevant details. We validate this approach using a custom dataset of long Minecraft videos, where it produces substantially more consistent rollouts compared to existing baselines.

2606.09052 2026-06-09 cs.LG cs.AI cs.CL cs.GT stat.ML 新提交

INFUSER: Influence-Guided Self-Evolution Improves Reasoning

INFUSER: 影响力引导的自我进化提升推理能力

Siyu Chen, Miao Lu, Beining Wu, Heejune Sheen, Fengzhuo Zhang, Shuangning Li, Zhiyuan Li, Jose Blanchet, Tianhao Wang, Zhuoran Yang

发表机构 * Yale University(耶鲁大学) Stanford University(斯坦福大学) University of Chicago(芝加哥大学) Toyota Technological Institute at Chicago(芝加哥丰田技术研究所) University of California, San Diego(圣地亚哥大学)

AI总结 提出INFUSER框架,通过生成器与求解器的协同进化,利用影响力分数和DuGRPO优化,从文档池中自适应生成训练数据,显著提升模型推理性能。

Comments 66 pages, 17 figures

详情
AI中文摘要

自我进化为更强的推理提供了一条可扩展的路径:预训练语言模型仅需极少的外部监督即可自我改进。然而,现有方法要么依赖于大量精心策划或教师生成的训练数据,要么在生成器无监督运行时,使用未必能改进求解器的难度启发式方法对其进行奖励。我们引入了INFUSER,一个迭代协同训练框架,包含两个共同进化的角色:一个生成器,从自动收集的非结构化文档池中起草问题并参考标准答案;一个求解器,通过在这些数据上训练来改进。求解器使用标准正确性奖励(针对生成器提供的答案)进行训练,而生成器则通过一种优化器感知的影响力分数获得奖励,该分数衡量每个提出的问题是否真正能改进求解器在目标分布上的表现。由于这种连续、有噪声的影响力分数不适合标准的GRPO,我们提出了DuGRPO,一种GRPO的双归一化变体,用于生成器训练。这些设计共同将文档池转化为一个自适应课程,倾向于对当前求解器有用的问题,而不仅仅是困难的问题。在Qwen3-8B-Base上,INFUSER在Olympiad和SuperGPQA基准测试中相对于强自我进化基线取得了超过20%的相对改进,并且一个8B的INFUSER协同进化生成器在数学和编程任务上优于冻结的32B思考生成器。消融实验证实了每个设计选择的必要性,两个扩展——将INFUSER应用于指令微调锚点并辅以规则可验证的RLVR数据——进一步展示了该框架的灵活性和泛化能力。代码可在https://github.com/FFishy-git/INFUSER获取。

英文摘要

Self-evolution offers a scalable path to stronger reasoning: a pretrained language model improves itself with only minimal external supervision. Yet existing methods either depend on extensively curated or teacher-generated training data, or, when the generator runs unsupervised, reward it by a difficulty heuristic that need not improve the solver. We introduce INFUSER, an iterative co-training framework with two co-evolving roles: a Generator that drafts questions and reference golden answers from a pool of unstructured, automatically collected documents, and a Solver that improves by training on them. The solver is trained with standard correctness rewards against the generator-provided answers, while the generator is rewarded by an optimizer-aware influence score that measures whether each proposed question would actually improve the solver on the target distribution. Because this continuous, noisy influence score is poorly served by standard GRPO, we propose DuGRPO, a dual-normalized variant of GRPO, for generator training. Together, these turn the document pool into an adaptive curriculum that favors questions useful to the current solver, not just hard ones. On Qwen3-8B-Base, INFUSER outperforms strong self-evolution baselines with over 20% relative improvement on Olympiad and SuperGPQA benchmarks, and an 8B INFUSER co-evolving generator outperforms a frozen 32B thinking generator on math and coding. Ablations confirm each design choice is necessary, and two extensions, applying INFUSER to an instruction-finetuned anchor and augmenting it with rule-verifiable RLVR data, further demonstrate the flexibility and generalizability of the framework. Code is available at https://github.com/FFishy-git/INFUSER.

2606.09051 2026-06-09 cs.LG 新提交

Beyond Convolution: Advancing Hypergraph Neural Networks with Hypergraph U-Nets

超越卷积:用超图U-Net推进超图神经网络

Fuli Wang, Wei Qian, Daniel L. Lau, Gonzalo R. Arce

发表机构 * Institute for Financial Services Analytics, University of Delaware(特拉华大学金融服务分析研究所) Department of Applied Economics and Statistics, University of Delaware(特拉华大学应用经济学与统计学系) Department of Electrical and Computer Engineering, University of Kentucky(肯塔基大学电气与计算机工程系) Department of Electrical and Computer Engineering, University of Delaware(特拉华大学电气与计算机工程系)

AI总结 提出并行层次池化和反池化算子,构建首个超图U-Net架构,在分类、重构和异常检测任务上超越现有方法。

详情
AI中文摘要

卷积已成功从图像处理过渡到非欧几里得高阶域的复杂领域,特别是在超图中。尽管卷积取得了成功,但由于缺乏定义良好的池化和反池化操作,一种名为U-Net的流行架构在超图数据上的探索仍然很少。本工作开创性地研究了超图数据的U-Net架构,解决了设计有效池化和反池化操作的关键挑战,这些操作能保留输入超图的最大结构信息。受层次聚类启发,我们提出通过在不同粒度上切割聚类树状图来一次性构建池化和反池化算子,称为并行层次池化(PHPool)和反池化(PHUnpool)算子。与现有通过顺序学习过程可能造成局部结构损坏的池化方法不同,我们的PHPool算子以全局并行方式设计,确保对原始超图结构的保真度和高效计算,而PHUnpool算子则专门设计为执行PHPool的逆操作以进行超图重构。我们通过超图重构模拟、超图分类和节点级异常检测验证了我们的模型,在这些任务中,它表现出优于现有最先进的图和超图深度学习方法的性能。

英文摘要

Convolutions have successfully transitioned from image processing to the complex realm of non-Euclidean higher-order domains, particularly in hypergraphs. Despite the success in convolution, the exploration of a popular architecture named U-Net remains largely unexplored for hypergraph data due to the lack of well-defined pooling and unpooling operations. This work pioneers the study of U-Net architectures for hypergraph data, addressing the critical challenge of designing effective pooling and unpooling operations that retain maximal structural information from the input hypergraph. Motivated by hierarchical clustering, we propose to construct the pooling and unpooling operators all at once by cutting the clustering dendrogram at different granularities, named the Parallel Hierarchical Pooling (PHPool) and Unpooling (PHUnpool) operators. Unlike existing pooling methods that risk local structural damage through a sequential learning procedure, our PHPool operators are designed in a global and parallel manner to ensure fidelity to the original hypergraph structure with efficient computation while the PHUnpool operators are tailored to perform inverse operations of the PHPools for hypergraph reconstruction. We validate our model through hypergraph reconstruction simulation, hypergraph classification, and node-level anomaly detection, where it demonstrates superior performance over existing state-of-the-art graph and hypergraph deep learning methods.

2606.09046 2026-06-09 cs.LG cs.CL cs.IR 新提交

Decoy-Calibrated Failure Audits for Language Models

语言模型的诱饵校准失败审计

Vyzantinos Repantis, Ameya Gawde, Harshvardhan Singh

发表机构 * Meta Platforms(Meta平台)

AI总结 提出Janus程序,通过诱饵校准和留出数据验证,判断语言模型错误解释的可信度,避免选择偏差。

Comments 14 pages, 5 figures, 4 tables

详情
AI中文摘要

有用的审计不仅揭示模型失败的频率,还揭示失败集中在何处。审计员可能测试许多候选解释:长输入、间接问题、分散注意力的证据或这些因素的组合。风险在于选择。观察到的最大效应可能反映真实的失败模式,也可能只是多次尝试中的最佳结果。我们提出Janus,一种决定何时提出的错误解释足够可信以报告的程序。目标不是生成新解释,而是决定哪些解释站得住脚。审计员从固定的模型、标记的评估集和冻结的候选解释列表(我们称之为描述符)开始。Janus通过错误率提升对每个描述符进行评分,然后将真实描述符与具有相同频率但随机分配给示例的虚假描述符进行比较。只有当描述符在用于发现的数据上击败这个诱饵基准,然后在单独的留出数据上重复时,它才被确认。在多表查找任务的受控审计中,Janus识别出植入的失败,确认了长链描述符及其交互。LLM通常在查找链中途停止,而不是到达最终答案。在两个公共基准MuSiQue和LongBench v2上,SliceLine基线标记了看似高错误的区域,但Janus没有确认任何一个。消融实验显示了为什么两个保障措施都很重要。在LongBench v2上,未校准的固定阈值报告了20个描述符,诱饵基准留下一个,而留出检查在其提升从0.36缩小到0.05后拒绝了最后一个。由此产生的原则将提出解释与报告解释分开。候选解释可能来自任何来源,但只有那些击败诱饵并在新数据上复现的才成为审计发现。

英文摘要

Useful audits reveal not only how often a model fails, but also where its failures concentrate. An auditor may test many candidate explanations: long inputs, indirect questions, distracting evidence, or combinations of these factors. The risk is selection. The largest observed effect may reflect a real failure mode, or it may simply be the best result among many tried. We introduce Janus, a procedure for deciding when a proposed error explanation is credible enough to report. The goal is not to generate new explanations, but to decide which ones hold up. The auditor starts with a fixed model, a labeled evaluation set, and a frozen list of candidate explanations, which we call descriptors. Janus scores each descriptor by its error-rate lift, then compares real descriptors with fake ones that have the same frequencies but are randomly assigned to examples. A descriptor is confirmed only if it beats this decoy floor on the data used for discovery and then repeats on separate held-out data. In a controlled audit of multi-table lookup tasks, Janus identifies the planted failure, confirming long-chain descriptors and their interactions. The LLM often stops partway through the lookup chain instead of reaching the final answer. On two public benchmarks, MuSiQue and LongBench v2, the SliceLine baseline flags plausible high-error pockets, but Janus confirms none of them. Ablations show why both safeguards matter. On LongBench v2, an uncalibrated fixed threshold reports 20 descriptors, the decoy floor leaves one, and the holdout check rejects the last one after its lift shrinks from 0.36 to 0.05. The resulting principle separates proposing explanations from reporting them. Candidates may come from any source, but only those that beat decoys and replicate on fresh data become audit findings.

2606.09043 2026-06-09 cs.LG cs.CL 新提交

DynaCF: Mitigating Shortcut Learning in Reward Models via Dynamic Counterfactual Sensitivity

DynaCF: 通过动态反事实敏感性缓解奖励模型中的捷径学习

Fengyuan Liu, Yongliang Miao, Zirui He, Yanguang Liu, Fei Sun, Mengnan Du

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) New Jersey Institute of Technology(新泽西理工学院) Institute of Computing Technology, CAS(中国科学院计算技术研究所)

AI总结 提出DynaCF框架,通过在线测量反事实扰动下的边际变化和偏好翻转来动态降低捷径敏感样本的权重,从而缓解奖励模型中的捷径学习问题。

详情
AI中文摘要

从成对偏好中训练的奖励模型往往利用表面的捷径线索而非学习真正的响应质量。我们提出DynaCF,一个用于缓解奖励模型训练中捷径学习的动态重加权框架。与静态捷径启发式方法不同,DynaCF在优化过程中通过应用保持语义的反事实扰动并跟踪当前模型下产生的边际变化和偏好翻转,在线测量捷径敏感性。在Bradley-Terry目标中,具有较高捷径敏感性的样本被动态降低权重,鼓励模型较少依赖表面模式,更多依赖任务相关的偏好信号。大量实验表明,DynaCF在偏好建模中持续提高了鲁棒性。

英文摘要

Reward models trained from pairwise preferences often exploit superficial shortcut cues rather than learning true response quality. We propose DynaCF, a dynamic reweighting framework for mitigating shortcut learning in reward model training. Unlike static shortcut heuristics, DynaCF measures shortcut sensitivity online during optimization by applying semantics-preserving counterfactual perturbations and tracking the resulting margin shifts and preference flips under the current model. Samples with higher shortcut sensitivity are dynamically downweighted in the Bradley-Terry objective, encouraging the model to rely less on superficial patterns and more on task-relevant preference signals. Extensive experiments show that DynaCF consistently improves robustness in preference modeling.

2606.09038 2026-06-09 cs.AI 新提交

Personalization Meets Safety:Mechanisms,Risks,and Mitigations in Personalized LLMs

个性化与安全的交汇:个性化大语言模型中的机制、风险与缓解措施

Yanyan Luo, Xue Han, Ruiqiao Bai, Xin Huang, Yitong Wang, Qian Hu, Qing Wang, Chunxu Zhao, Jie Liu, Cong Geng, Lehao Xing, Pengwei Hu, Junlan Feng

发表机构 * China Mobile Jiutian Artificial Intelligence Technology (Beijing) Co., Ltd.(中国移动九天人工智能技术(北京)有限公司) Chinese Academy of Sciences(中国科学院)

AI总结 本文首次对个性化大语言模型进行安全导向的综述,从用户表征、个性化范式和评估三个维度组织,提出统一的安全风险分类,并分析各范式下的脆弱性及缓解策略。

详情
AI中文摘要

大语言模型通过适应用户偏好、上下文和长期历史记录,实现了日益个性化的交互。然而,实现个性化的机制也以现有文献未系统处理的方式扩展了安全领域。现有综述通常只关注个性化或安全,而忽略了它们的交叉。我们提出了首个全面的、安全导向的个性化大语言模型综述。我们沿三个维度组织个性化——用户表征、个性化范式和评估——并引入统一的安全风险分类。在表征层面,我们分析了不同用户表征带来的风险。在主流个性化范式中,我们描述了提示、检索增强、参数微调、强化学习、混合专家、剪枝、智能体框架和多模态个性化中固有的脆弱性,并综合了模型生命周期中的缓解策略。除了这些细粒度风险,我们还描述了由个性化适应产生的范式无关的安全风险。我们进一步总结了个性化数据集和评估方法。通过OpenClaw的案例研究,我们分析了个性化智能体生态系统中的部署趋势。我们的分析揭示了现有研究中的三个结构性不足:安全被评估为与用户无关而非关系性的,个性化技术被孤立分析而非组合分析,评估框架无法捕捉新兴的长期风险。通过联合检查个性化表征、个性化范式、安全风险、防御和评估方法,我们为开发安全的个性化大语言模型提供了一个统一框架,并强调了未来研究的关键方向。

英文摘要

Large Language Models (LLMs) have enabled increasingly personalized interactions by adapting to users' preferences, contexts, and long-term histories. However, the mechanisms that enable personalization also expand the safety landscape in ways not systematically addressed by existing literature. Existing reviews typically focus either on personalization or safety, leaving their intersection largely unexplored. We present the first comprehensive, safety-aware review of personalized LLMs. We organize personalization along three dimensions-user representation, personalization paradigm, and evaluation-and introduce a unified taxonomy of safety risks. At the representation level, we analyze risks arising from diverse user representations. Across mainstream personalization paradigms, we delineate vulnerabilities inherent to prompting, retrieval augmentation, parameter fine-tuning, reinforcement learning, Mixture-of-Experts (MoE), pruning, agent frameworks, and multimodal personalization, and synthesize mitigation strategies across the model lifecycle. Beyond these fine-grained risks, we characterize paradigm-agnostic safety risks arising from personalized adaptation. We further summarize personalized datasets and evaluation methodologies. Through a case study of OpenClaw, we analyze deployment trends in personalized agent ecosystems. Our analysis reveals three structural inadequacies in existing research: safety is evaluated as user-invariant rather than relational, personalization techniques are analyzed in isolation rather than in composition, and evaluation frameworks cannot capture emergent long-term risks. By jointly examining personalized representations, personalization paradigms, safety risks, defenses, and evaluation methods, we provide a unified framework for developing safe personalized LLMs and highlight key directions for future research.

2606.09037 2026-06-09 cs.AI cs.MA 新提交

A Multi-Agent System for IPMSM Design Optimization via an FEA-AI Hybrid Approach

基于FEA-AI混合方法的IPMSM设计优化多智能体系统

Jinseong Han, Sunwoong Yang, Namwoo Kang

发表机构 * Cho Chun Shik Graduate School of Mobility, KAIST(KAIST Cho Chun Shik 移动研究生院) Department of Mechanical Engineering, Hanyang University(汉阳大学机械工程系) Narnia Labs

AI总结 提出一种端到端自动化IPMSM设计优化框架,通过RAG结构化问题定义与不确定性感知的FEA-AI混合优化流水线,平衡计算成本与预测可靠性,在同等FEA预算下优于纯FEA或纯AI方法。

Comments 26 pages, 21 figures

详情
AI中文摘要

内置永磁同步电机(IPMSM)设计需要平衡相互冲突的目标和多物理场约束,而现代优化工作流程面临三个瓶颈:手动问题设置、高有限元分析(FEA)成本以及在稀疏或分布外区域中不可靠的基于代理的搜索。为了解决这些限制,我们提出了一种端到端的自动化IPMSM设计优化框架,该框架将检索增强生成(RAG)用于结构化问题定义,与不确定性感知的FEA-AI混合优化流水线相结合。一个通过RAG连接到电机教科书的设计代理提供基于领域知识的选项和工程技巧,并编译优化卡和用于AI模型训练的试验设计计划。训练代理自动化电磁FEA,记录几何验证和求解器失败日志,使用基于方差分析的数据分析和LLM推理分析失败的几何形状,并调用设计采样代理重新定义设计空间并生成额外样本。优化代理执行基于遗传算法的搜索,具有不确定性驱动的切换:低不确定性候选由AI代理推理评估,而高不确定性和可靠性关键的帕累托前沿或前K候选由高保真FEA校正并用于迭代重训练。该框架将手动、依赖经验的配置转换为可重复的工作流程,平衡计算成本和预测可靠性。在匹配的高保真FEA预算下的实验结果表明,所提出的混合方法实现了更好的目标性能,同时保持低且可进一步降低的预测不确定性,优于受早期预算耗尽限制的纯FEA搜索和收敛到低置信度最优的纯AI搜索。

英文摘要

Interior permanent magnet synchronous motor (IPMSM) design requires balancing conflicting objectives and multi-physics constraints, while modern optimization workflows face three bottlenecks: manual problem setup, high finite element analysis (FEA) cost, and unreliable surrogate-based search in sparse or out-of-distribution regions. To address these limitations, we propose an end-to-end automated IPMSM design optimization framework that integrates retrieval-augmented generation (RAG) for structured problem definition with an uncertainty-aware FEA-AI hybrid optimization pipeline. A Design agent, connected to a motor textbook through RAG, provides domain-knowledge-based options and engineering tips, and compiles an optimization card and a design-of-experiments plan for AI-model training. A Training agent automates electromagnetic FEA, records geometry-validation and solver-failure logs, analyzes failed geometries using ANOVA-based data analysis and LLM reasoning, and invokes a Design Sampling agent to redefine the design space and generate additional samples. An Optimization agent performs GA-based search with uncertainty-driven switching: low-uncertainty candidates are evaluated by AI-surrogate inference, whereas high-uncertainty and reliability-critical Pareto-front or top-K candidates are corrected by high-fidelity FEA and reused for iterative retraining. The framework converts manual, experience-dependent configuration into a reproducible workflow that balances computational cost and prediction reliability. Experimental results under a matched high-fidelity FEA budget show that the proposed hybrid approach achieves better objective performance while maintaining low and further reducible predictive uncertainty, outperforming FEA-only search, which is limited by early budget exhaustion, and AI-only search, which converges to a low-confidence optimum.

2606.09033 2026-06-09 cs.CV cs.CL 新提交

CRANE: Knowledge Editing for Reasoning MLLMs

CRANE:面向推理多模态大语言模型的知识编辑

Han Huang, Hao Wang, Mengqi Zhang, Shu Wu, Qiang Liu, Liang Wang

发表机构 * University of Chinese Academy of Sciences(中国科学院大学) New Laboratory of Pattern Recognition (NLPR), CASIA(中国科学院自动化研究所模式识别国家重点实验室) Harbin Institute of Technology(哈尔滨工业大学) Shandong University(山东大学)

AI总结 针对推理多模态大语言模型在知识编辑中出现的结构崩溃、认知失调和浅层内化三种失败模式,提出检索增强框架CRANE,无需逐编辑参数修改,通过模态感知双库检索系统和两阶段训练策略实现高成功率。

Comments 10 pages, 5 figures

详情
AI中文摘要

推理多模态大语言模型(MLLMs)的出现,即在生成答案前产生显式思维链(CoT)推理,为知识编辑带来了新挑战:在传统指标(教师强制准确率高达100%)下看似成功的方法,在检查模型推理过程时可能严重失败(基础成功率低至0%)。我们识别出三种失败模式:(1)结构崩溃,权重修改方法破坏CoT格式;(2)认知失调,模型的推理链基于视觉证据主动拒绝注入的编辑事实;(3)浅层内化,方法在精确查询上成功但在改写或多跳变体上失败。在推理MLLMs上,这些模式相互作用:泛化方法(FT、LoRA)触发格式崩溃,而无深度修改的方法无法泛化。为揭示这些失败,我们提出一种CoT感知评估协议,并构建ReasonEdit-Bench,包含冲突分层、多级探针和多跳可移植性测试。我们提出CRANE,一种检索增强框架,无需逐编辑参数修改。CRANE结合了模态感知双库检索系统和两阶段训练策略:监督微调(SFT)用于结构初始化,随后是带有认知路由奖励的GRPO,训练模型在视觉先验和注入编辑事实之间进行仲裁。在ReasonEdit-Bench上,CRANE在冲突场景中达到96.9%的基础成功率,多跳链中中间实体使用率为96.9%,文本局部性为97.6%,图像局部性编辑独立性为68.1%。在分布外MMEVOKE基准上,CRANE在黄金检索下达到87.0%。

英文摘要

The emergence of reasoning multimodal large language models (MLLMs), which generate explicit chain-of-thought (CoT) reasoning before producing answers, has introduced a new challenge for knowledge editing: methods that appear successful under traditional metrics (teacher-forcing accuracy up to 100%) can fail severely when the model's reasoning process is examined (Grounded Success as low as 0%). We identify three failure modes: (1) Structural Collapse, where weight-modifying methods destroy the CoT format; (2) Cognitive Dissonance, where the model's reasoning chain actively rejects the injected edit fact based on visual evidence; and (3) Shallow Internalization, where methods succeed on exact queries but fail on rephrase or multi-hop variants. On reasoning MLLMs, these modes interact: methods that generalize (FT, LoRA) trigger format collapse, while methods without deep modification cannot generalize. To expose these failures, we propose a CoT-aware evaluation protocol and construct ReasonEdit-Bench, with conflict stratification, multi-level probes, and multi-hop portability tests. We propose CRANE, a retrieval-augmented framework that requires no per-edit parameter modification. CRANE combines a modality-aware dual-library retrieval system with a two-phase training strategy: Supervised Fine-Tuning (SFT) for structural initialization, followed by GRPO with a Cognitive Routing Reward that trains the model to arbitrate between visual priors and injected edit facts. On ReasonEdit-Bench, CRANE achieves 96.9% Grounded Success on conflict scenarios and 96.9% intermediate entity usage in multi-hop chains, with 97.6% text-locality and 68.1% image-locality Edit Independence. On the out-of-distribution MMEVOKE benchmark, CRANE reaches 87.0% under gold retrieval.

2606.09032 2026-06-09 cs.CL 新提交

Bridging the Agent-World Gap: Text World Models for LLM-based Agents

弥合智能体-世界鸿沟:面向基于LLM的智能体的文本世界模型

Yixia Li, Hongru Wang, Peng Lai, Zhiwen Ruan, He Zhu, Youxin Zhu, Ganlong Zhao, Minda Hu, Yun Chen, Sibei Yang, Peng Li, Jeff Z. Pan, Jia Pan, Guanhua Chen, Yang Liu, Guanbin Li

发表机构 * Southern University of Science and Technology(南方科技大学) University of Edinburgh(爱丁堡大学) Peking University(北京大学) Sun Yat-sen University(中山大学) The Chinese University of Hong Kong(香港中文大学) Shanghai University of Finance and Economics(上海财经大学) Tsinghua University(清华大学) The University of Hong Kong(香港大学)

AI总结 本文系统综述了面向基于LLM的智能体的文本世界模型,围绕形式化框架和智能体生命周期,涵盖基础定义、构建范式、应用(训练时经验合成与推理时规划、验证、适应)及评估,旨在整合该领域并明确设计空间与开放挑战。

Comments Code: https://github.com/sustech-nlp/awesome-text-world-models

详情
AI中文摘要

基于大型语言模型(LLM)的智能体越来越多地用于交互式文本环境,从网页导航、代码编辑到工具使用和长时对话。然而,许多智能体仍然主要是反应式的,将观察映射到动作,而没有对这些环境如何构建和演变的显式模型。这激发了文本世界模型(TWMs):文本状态上的转移模型,给定状态和候选动作,预测结果网页、终端输出、API响应或用户回复,从而支持规划、高效学习和原则性评估。我们系统综述了面向基于LLM的智能体的文本世界模型,围绕形式化框架和智能体生命周期组织:(1)基础,定义文本世界模型并通过状态表示和基础领域对其进行表征;(2)构建,对LLM作为世界模型和代码作为世界模型范式进行分类,并回顾构建方法;(3)应用,考察世界模型如何通过经验合成在训练时以及通过规划、验证和适应在推理时支持智能体;(4)评估,涵盖世界模型本身的评估及其作为智能体评估环境的使用。我们旨在巩固这一快速发展领域,阐明其设计空间,并强调未来研究的开放挑战。

英文摘要

Large language model (LLM)-based agents are increasingly used in interactive textual environments, from web navigation and code editing to tool use and long-horizon dialogue. Yet many remain largely reactive, mapping observations to actions without an explicit model of how these environments are structured and evolve. This motivates text world models (TWMs): transition models over textual states that, given a state and a candidate action, predict the resulting webpage, terminal output, API response, or user reply, thereby supporting planning, efficient learning, and principled evaluation. We systematically review text world models for LLM-based agents, organized around a formal framework and the agent lifecycle: (1) Foundations, defining text world models and characterizing them by state representation and grounding domain; (2) Construction, taxonomizing LLM-as-WM and code-as-WM paradigms and reviewing methods for building them; (3) Application, examining how world models support agents at training time through experience synthesis and at inference time through planning, verification, and adaptation; and (4) Evaluation, covering both evaluation of the world model itself and its use as an evaluation environment for agents. We aim to consolidate this rapidly developing area, clarify its design space, and highlight open challenges for future research.

2606.09030 2026-06-09 cs.LG cs.AI cs.CL 新提交

TRIAGE: Dialectical Reasoning for Explainable Risk Prediction on Irregularly Sampled Medical Time Series with LLMs

TRIAGE: 基于辩证推理的不规则采样医学时间序列风险可解释预测方法

Hyeongwon Jang, Gyouk Chu, Changhun Kim, Joonhyung Park, Hangyul Yoon, Eunho Yang

发表机构 * KAIST(韩国科学技术院) AITRICS University of Wisconsin-Madison(威斯康星大学麦迪逊分校)

AI总结 提出TRIAGE框架,利用大语言模型对竞争性临床结果生成辩证推理,缓解风险极化,实现连续风险评分与可解释推理,在三个基准上AUPRC提升3.3%,校准误差降低81%。

Comments Code is available at https://github.com/HyeongWon-Jang/TRIAGE

详情
AI中文摘要

基于电子健康记录的临床早期预警系统,其中临床观察记录为不规则采样的医学时间序列(ISMTS),必须提供校准的风险评分用于患者分诊,以及临床医生可验证的可解释理由。大语言模型(LLMs)已被探索用于此任务,但它们将分级临床风险崩溃为过度自信的二元预测。这种风险极化损害了校准性和跨患者可比性。为解决此问题,我们提出TRIAGE框架,该框架训练LLM通过引出特定结果的理由,对竞争性临床结果生成辩证推理。这种辩证公式减轻了风险极化,使单个LLM能够产生基于明确临床推理的连续风险评分。在三个ISMTS基准上评估,TRIAGE相比竞争基线实现了平均AUPRC提升3.3%,校准误差降低81%。LLM作为评判者的评估进一步表明,我们的理由在临床推理质量上比基线的后验解释高出20%。源代码可在https://github.com/HyeongWon-Jang/TRIAGE获取。

英文摘要

Clinical early warning systems built on electronic health records, in which clinical observations are recorded as irregularly sampled medical time series (ISMTS), must deliver both calibrated risk scores for patient triage and interpretable rationales that clinicians can verify. Large Language Models (LLMs) have been explored for this task, yet they collapse graded clinical risk into overconfident binary predictions. This risk polarization undermines both calibration and cross-patient comparability. To address this, we propose TRIAGE, a framework that trains an LLM to generate dialectical reasoning over competing clinical outcomes by eliciting outcome-specific rationales. This dialectical formulation mitigates risk polarization, enabling a single LLM to yield continuous risk scores grounded in explicit clinical reasoning. Evaluated on three ISMTS benchmarks, TRIAGE achieves an average AUPRC improvement of 3.3% and reduces calibration error by 81% compared to the competitive baselines. An LLM-as-a-judge assessment further shows that our rationales surpass post-hoc explanations from the baseline by 20% in clinical reasoning quality. The source code is available at https://github.com/HyeongWon-Jang/TRIAGE .

2606.09029 2026-06-09 cs.CV 新提交

Frequency Decoupled Framework for Screen Content Image Super-Resolution

面向屏幕内容图像超分辨率的频率解耦框架

Xufei Wang, Qicheng Zhang, Qi Wu, Ziyang Gu, Shizhuang Weng

发表机构 * Anhui University(安徽大学)

AI总结 提出频率解耦框架(FDF),通过振幅-相位分解和定制隐式表示,联合利用周期模式与连贯上下文,实现屏幕内容图像超分辨率,在多个数据集上达到最优性能。

Comments 13pages;11figures

详情
AI中文摘要

基于隐式神经表示的方法在屏幕内容图像超分辨率(SCISR)中表现出优越性能。然而,它们忽略了固有的频率特性,导致性能次优。我们提出一种频率解耦框架(FDF),从相量角度重新思考SCISR,通过捕获振幅中的结构化能量和相位中的关系连续性,并利用定制的隐式表示联合利用它们,以忠实恢复屏幕内容图像(SCI)的规则纹理和全局配置。振幅-相位分解网络(APFN)首先将图像分离为振幅和相位流,其中振幅聚类模块(ACM)将稀疏但高能量的振幅响应组织成代表性原型以提取周期模式,而相位一致性自注意力(PCSA)通过连续一致性传播逐步增强配置。振荡-非谐隐式拟合网络(OAIF-Net)集成周期性和连贯隐式表示,以有效利用SCI中嵌入的周期模式和连贯上下文。实验结果表明,FDF在四个公共SCI数据集上的多个尺度上实现了最先进的SCISR性能。消融实验进一步证明了每个组件在提取和利用周期模式与连贯上下文方面的有效性。

英文摘要

Methods based on implicit neural representations have demonstrated superior performance in Screen Content Image Super-Resolution (SCISR) . However, they overlooked the inherent frequency characteristics, leading to suboptimal performance. We propose a frequency decoupled framework (FDF) that rethinks SCISR from a phasor perspective by capturing structured energy in amplitude and relational continuity in phase, and jointly exploiting them with bespoke implicit representations to faithfully recover the regular textures and global configuration of Screen Content Image (SCI). Amplitude-Phase Factorization Network (APFN) first separates images into amplitude and phase streams, where Amplitude Clustering Module (ACM) organizes sparse yet high-energy amplitude responses into representative prototypes for periodic pattern extraction, while Phase Consistency Self-Attention (PCSA) progressively reinforces configuration through continuous consistency propagation. And Oscillation-Anharmonic Implicit Fitting Network (OAIF-Net) integrates periodic and coherent implicit representations for efficient exploitation of the periodic patterns and coherent context embedded in SCI. Experimental results show FDF achieves state-of-the-art SCISR performance at multiple scales across four public SCI datasets. Ablation experiments further demonstrate the effectiveness of each component in extracting and exploiting periodic patterns and coherent context.