arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2031
2606.05778 2026-06-05 cs.CV

Beyond Absolute Scores: Relative Edit-induced Difference for Generalizable Image Aesthetic Assessment

超越绝对分数:基于编辑诱导差异的通用图像美学评估

Qifei Jia, Xintong Yao, Minghao Li, Yajie Chai, Qiming Lu, Baoyue Shen, Yasen Zhang, Runyu Shi, Ying Huang, Yue Zhang

发表机构 * Xiaomi Corporation, Beijing, China(小米公司,北京,中国)

AI总结 提出RED-Aes框架,利用可控图像编辑模型模拟人类审美推理,通过相对编辑诱导差异学习通用美学原则,实现跨场景泛化。

详情
AI中文摘要

传统的图像美学评估(IAA)方法主要依赖于回归绝对平均意见分数(MOS)。然而,这种范式忽视了人类审美感知固有的动态性质,这种感知依赖于对隐含视觉参考的无意识比较。因此,缺乏对美学差异的因果推理使得模型无法学习通用的美学原则,从而限制了它们在多样化场景中的泛化能力。在这项工作中,我们重新思考IAA任务,并提出相对编辑诱导差异美学学习(RED-Aes),一种新颖的框架,利用可控图像编辑模型模拟人类审美推理过程。RED-Aes不拟合绝对分数分布,而是显式学习驱动美学变化的视觉因素。为了支持这一范式,我们构建了RED-20k数据集,包含基于编辑的图像对、定量美学差异和思维链(CoT)推理。此外,我们引入了一种由相对排序一致性奖励引导的三阶段训练策略,仅通过相对监督优化模型。大量实验表明,RED-Aes在多个公共基准上取得了最先进的性能,展现出优越的泛化能力。

英文摘要

Traditional Image Aesthetic Assessment (IAA) methods mainly rely on regressing absolute Mean Opinion Scores (MOS). However, such a paradigm overlooks the inherently dynamic nature of human aesthetic perception, which relies on subconscious comparison against implicit visual references. Consequently, the lack of causal reasoning regarding aesthetic differences prevents models from learning generalizable aesthetic principles, thus limiting their generalization across diverse scenarios. In this work, we rethink the IAA task and propose Relative Edit-induced Difference Aesthetic learning (RED-Aes), a novel framework that leverages controllable image editing models to simulate the human aesthetic reasoning process. Instead of fitting absolute score distributions, RED-Aes explicitly learns the visual factors that drive aesthetic changes. To support this paradigm, we construct the RED-20k dataset, which comprises editing-based image pairs, quantitative aesthetic differences, and Chain-of-Thought (CoT) reasoning. Furthermore, we introduce a three-stage training strategy guided by a relative ranking consistency reward, optimizing the model solely via relative supervision. Extensive experiments demonstrate that RED-Aes achieves state-of-the-art performance on multiple public benchmarks, exhibiting superior generalization capabilities.

2606.05773 2026-06-05 cs.RO

PiL-World: A Chunk-Wise World Model for VLA Policy-in-the-Loop Evaluation

PiL-World: 用于VLA策略环内评估的块式世界模型

Chong Ma, Taiyi Su, Jian Zhu, Jianjun Zhang, Zitai Huang, Yi Xu, Hanli Wang

发表机构 * Tongji University(同济大学) AIRC, Midea Group(美的集团人工智能研究院)

AI总结 提出PiL-World,一种块式世界模型,通过交替VLA推理和世界模型预测实现闭环评估,无需真实机器人执行,显著降低成功率估计误差。

详情
AI中文摘要

视觉-语言-动作(VLA)策略在真实机器人任务中闭环运行:机器人观察场景,执行一个动作块,并根据结果观察决定下一步。然而,大多数现有的用于机器人动作评估的世界模型仅限于沿预收集动作轨迹进行开环预测。这阻碍了它们支持闭环VLA评估,其中每个动作块必须基于先前执行产生的观察。为填补这一空白,我们提出PiL-World,一种专为策略环内VLA评估设计的块式世界模型。给定当前观察和VLA策略展开的动作轨迹,PiL-World生成与VLA展开一致的多视角未来观察,并匹配策略所需的图像输入。通过交替VLA推理和世界模型预测,PiL-World实现了无需每一步真实机器人执行的闭环评估。为提高展开保真度,PiL-World将视频生成条件化为从头部视角机器人运动导出的动作视觉控制和编码任务执行上下文的潜在历史,同时联合预测互补的多视角观察。除了成功的遥操作演示,它还从失败的执行轨迹中学习,帮助想象展开更好地匹配真实策略执行的分布。我们在三个真实双臂操作任务上评估PiL-World。PiL-World生成的想象展开与真实机器人执行高度一致。更重要的是,与基线相比,它将真实世界展开中测量的VLA成功率与通过闭环世界模型评估估计的VLA成功率之间的误差从63.2%降低到12.0%。

英文摘要

Vision-language-action (VLA) policies operate in a closed loop in real-world robot tasks: a robot observes the scene, executes an action chunk, and conditions its next decision on the resulting observation. However, most existing world models for robot action evaluation are limited to open-loop prediction along pre-collected action trajectories. This prevents them from supporting closed-loop VLA evaluation, where each action chunk must be conditioned on the observation generated by the previous execution. To address this gap, we propose PiL-World, a chunk-wise world model designed for policy-in-the-loop VLA evaluation. Given the current observation and the action trajectory rolled out by a VLA policy, PiL-World generates multi-view future observations that are consistent with the VLA rollout and match the image inputs required by the policy. By alternating between VLA inference and world-model prediction, PiL-World enables closed-loop evaluation without real robot execution at every step. To improve rollout fidelity, PiL-World conditions video generation on action-derived visual control from head-view robot motion and latent histories that encode task execution context, while jointly predicting complementary multi-view observations. Beyond successful teleoperated demonstrations, it also learns from failed execution trajectories, helping the imagined rollouts better match the distribution of real policy executions. We evaluate PiL-World on three real dual-arm manipulation tasks. PiL-World generates imagined rollouts that are highly consistent with real robot executions. More importantly, compared with the baseline, it reduces the error between VLA success rates measured in real-world rollouts and those estimated through closed-loop world-model evaluation from 63.2% to 12.0%.

2606.05769 2026-06-05 cs.CV

Imagine Before You Predict: Interleaved Latent Visual Reasoning for Video Event Prediction

在预测之前想象:用于视频事件预测的交错潜在视觉推理

Tianxiang Jiang, Linquan Wu, Sheng Xia, Songze Li, Ziang Yan, Haoyu Yang, Yu Qiao, Yi Wang

发表机构 * University of Science and Technology of China(中国科学技术大学) Shanghai AI Laboratory(上海人工智能实验室) City University of Hong Kong(香港城市大学) Nanjing University(南京大学) Fudan University(复旦大学) Zhejiang University(浙江大学) University of Electronic Science and Technology of China(电子科技大学)

AI总结 提出Future-L1框架,通过交错潜在视觉推理在自回归解码中交替语言token和连续潜在视觉跨度,结合LA-DAPO强化学习优化,在视频事件预测任务上取得最先进结果。

Comments https://github.com/OpenGVLab/Future-L1

详情
AI中文摘要

视频事件预测(VEP)要求模型从部分视频证据中推断未观察到的未来状态。现有的视频多模态大语言模型(MLLMs)通常在文本空间中将中间未来推理进行语言化:一旦视觉证据被语言化,细粒度的运动、几何和交互线索可能会丢失,导致看似合理但视觉上无根据的幻觉。我们引入了Future-L1,一种交错潜在视觉推理框架,允许MLLM在自回归解码过程中在语言token和连续潜在视觉跨度之间交替。为了训练这种能力,我们通过选择未来视觉提示有助于预测的示例,并将潜在状态与未来帧嵌入对齐,构建了Future-L1-50K数据集,然后使用LA-DAPO(一种具有结果对比和时间多样性奖励的潜在感知RL目标)进一步优化采样的潜在轨迹。Future-L1在两个基准测试上均取得了新的最先进结果:在FutureBench上,它将Qwen3-VL-8B从61.0提升至85.4,并超过之前最佳Video-CoE 10.4分;在TwiFF-Bench上,它将平均得分从2.44提升至3.04。这些结果表明,面向未来的视频推理受益于在潜在空间中保留中间视觉语义,而不是将每个推理步骤都转换为文本。

英文摘要

Video event prediction (VEP) requires models to infer unobserved future states from partial video evidence. Existing video MLLMs usually verbalize intermediate future reasoning in text space: once visual evidence is verbalized, fine-grained motion, geometry, and interaction cues can be lost, leading to plausible but visually ungrounded hallucinations. We introduce Future-L1, an interleaved latent visual reasoning framework that lets an MLLM alternate between language tokens and continuous latent visual spans during autoregressive decoding. To train this capability, we construct Future-L1-50K by selecting examples where future visual hints help prediction and align latent states to future-frame embeddings, then further optimize sampled latent trajectories with LA-DAPO, a latent-aware RL objective with outcome-contrastive and temporal-diversity rewards. Future-L1 achieves new state-of-the-art results on both benchmarks: on FutureBench, it improves Qwen3-VL-8B from 61.0 to 85.4 and exceeds the previous best Video-CoE by 10.4 points; on TwiFF-Bench, it improves the average score from 2.44 to 3.04. These results suggest that future-oriented video reasoning benefits from preserving intermediate visual semantics in latent space rather than translating every reasoning step into text.

2606.05760 2026-06-05 cs.CV

ExpSpeech-Net: Multimodal Fusion of Expression and Speech for Deepfake Detection

ExpSpeech-Net: 表情与语音的多模态融合用于深度伪造检测

Ruchika Sharma, Rudresh Dwivedi

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出轻量级ExpSpeech-Net模型,通过融合面部表情和语音模式,利用SqueezeNet和RNN骨干网络及智能特征选择,实现高效深度伪造检测,准确率达94.5%。

详情
AI中文摘要

深度伪造视频日益挑战在线内容的可信度。许多现有检测方法依赖于复杂、资源密集型的模型,限制了其实用性。本研究引入了ExpSpeech-Net深度伪造检测(SqN-R-DFD)模型,该模型以SqueezeNet和RNN(循环神经网络)为骨干,提供了一个轻量级且高效的深度伪造检测框架,能够同时分析面部表情和语音模式。该方法采用了先进的特征提取,例如基于ISLBT的图像特征和用于信号的MPNCC,并结合使用SASMA(鹬辅助黏液霉菌算法)的智能特征选择策略,确保检测模型获得最优且平衡的输入。通过结合SqueezeNet和RNN,有效捕捉深度伪造视频中的细微不一致性。该框架实现了94.5%的准确率、99.3%的精确率和96.8%的F-measure,优于传统方法。这表明,将多种模态与智能预处理和特征选择相结合,能够实现适用于日常应用的实用、实时深度伪造检测。

英文摘要

Deepfake videos are increasingly challenging the credibility of online content. Many existing detection methodology relies on complex, resource-intensive models, which limit their practical use. The study introduces the ExpSpeech-Net deepfake detection (SqN-R-DFD) model, which utilizes SqueezeNet and RNN (Recurrent Neural Network) as its backbone, providing a lightweight and efficient deepfake detection framework that simultaneously analyzes facial expressions and speech patterns. The approach incorporates advanced feature extraction, such as ISLBT-based features for image and MPNCC for signals, along with a smart feature-selection strategy using SASMA (Sandpiper-Assisted Slime Mould Algorithm), ensuring optimal and balanced input to the detection models. By combining SqueezeNet and an RNN, subtle inconsistencies in deepfake videos are captured effectively. The framework achieves 94.5% accuracy, precision of 99.3%, and F-measure of 96.8%, outperforming conventional methods. This demonstrates that integrating multiple modalities with intelligent preprocessing and feature selection enables practical, real-time deepfake detection suitable for everyday applications.

2606.05758 2026-06-05 cs.CV cs.AI cs.LG

DRIFT: A Residual Flow Adapter for Decoding Continuous Outputs in Vision-Language Models

DRIFT:一种用于视觉-语言模型中连续输出解码的残差流适配器

Zhuoming Liu, Jinhong Lin, Kwan Man Cheng, Lin Zhang, Shayok Bagchi, Yin Li

发表机构 * University of Wisconsin–Madison(威斯康星大学麦迪逊分校) West Lafayette Jr./Sr. High School(韦斯特拉法叶高中)

AI总结 提出DRIFT框架,通过结合基础预测器和基于流匹配的生成式精化模块,将预训练视觉-语言模型适配到连续解码任务,在视觉定位和机器人控制等任务上优于回归和生成方法。

详情
AI中文摘要

许多现代视觉-语言模型(VLM)基于离散标记的自回归解码。虽然基于文本的输出接口支持可扩展的预训练和跨多种任务的强零样本泛化,但它们不适用于需要精确连续输出的问题,例如定位事件的时间边界或生成机器人控制动作。为了解决这一挑战,我们提出了DRIFT,一个用于将预训练VLM适配到连续解码任务的通用框架。DRIFT结合了一个基础预测器(提供目标输出的粗略估计)和一个基于流匹配的生成式精化模块(迭代改进预测)。这种残差公式将生成建模问题从学习全局输出分布转变为在强先验周围建模局部残差分布,大大简化了优化。我们在感知和规划任务上评估了DRIFT,包括视觉定位和机器人控制。在跨越MLLM、VLA和WAM的多个任务和架构中,DRIFT consistently优于一组强大的基于回归和生成的方法。

英文摘要

Many modern vision-language models (VLMs) build on autoregressive decoding of discrete tokens. While text-based output interfaces enable scalable pretraining and strong zero-shot generalization across diverse tasks, they are poorly suited for problems that require precise continuous outputs, such as localizing temporal boundaries of events or generating robotic control actions. To address this challenge, we propose DRIFT, a general framework for adapting pretrained VLMs to continuous decoding tasks. DRIFT combines a base predictor, which provides a coarse estimate of the target output, with a generative refinement module based on flow matching that iteratively improves the prediction. This residual formulation transforms the generative modeling problem from learning a global output distribution to modeling a localized residual distribution around a strong prior, substantially simplifying optimization. We evaluate DRIFT on both perception and planning tasks, including visual grounding and robotic control. Across multiple tasks and architectures spanning MLLMs, VLAs, and WAMs, DRIFT consistently outperforms a strong set of regression- and generative-based solutions.

2606.05756 2026-06-05 cs.LG cs.AI cs.IT math.IT

Beyond Soft Masks: Hard-Perturbation Mixup Explainer for Robust GNN Explainability

超越软掩码:用于鲁棒GNN可解释性的硬扰动混合解释器

Jialiang Yin, Zheng Zhao, Linsey Pang, Bo Dong, Bin Shi, Jiaxing Zhang

发表机构 * Xi’an Jiaotong University(西安交通大学) PayPal bellevue USA(贝尔维尤美国)

AI总结 提出基于广义图信息瓶颈的硬扰动混合解释框架HPME,通过图池化提取离散解释子图并采用结构级替换的混合策略,解决软掩码方法中标签无关信息泄漏和分布偏移问题,提升解释保真度。

详情
AI中文摘要

图神经网络(GNN)在涉及图结构数据的各种应用中表现出卓越性能,尤其是在高风险领域。然而,其决策过程的不透明性限制了可信度和更广泛的采用。现有的事后解释方法通过识别影响GNN预测的子图来提高可解释性,并采用混合策略来缓解使用子图进行预测时引起的分布外(OOD)问题。然而,这些方法通常依赖软掩码,其本质上无法完全消除标签无关信息,允许冗余结构泄漏到混合过程中,阻碍OOD问题的解决,从而降低解释保真度。在本文中,我们提出HPME,一个基于广义图信息瓶颈的硬扰动混合解释框架,利用图池化提取离散解释子图,并产生信息容量界限以彻底压缩标签无关组件。此外,我们引入了一种基于结构级替换的新型混合策略,生成分布内解释以有效缓解分布偏移。在多种任务上的大量实验表明,HPME在合成和真实数据集上生成鲁棒且可解释的解释方面达到了最先进的性能。

英文摘要

Graph Neural Networks (GNNs) have demonstrated remarkable performance across a range of applications involving graph-structured data, particularly in high-stakes domains. However, the opaque nature of their decision-making processes limits their trustworthiness and broader adoption. Existing post-hoc explanation methods aim to improve explainability by identifying subgraphs that influence GNN predictions and adopt mixup strategies to alleviate the out-of-distribution (OOD) issue caused by using subgraphs for prediction. Yet, these approaches typically rely on soft masks, which are inherently unable to fully eliminate label-irrelevant information, allowing redundant structures to leak into the mixup process and hindering the resolution of the OOD problem, thereby degrading explanation fidelity. In this work, we propose HPME, a Hard-Perturbation Mixup Explanation framework grounded in a generalized Graph Information Bottleneck, which leverages graph pooling to extract discrete explanatory subgraphs and to yield an information-capacity bound to thoroughly compress label-irrelevant components. Furthermore, we introduce a novel mixup strategy built upon structure-level replacement, generating in-distribution explanations to effectively mitigate the distribution shift. Extensive experiments on diverse tasks demonstrate that HPME achieves state-of-the-art performance in generating robust and interpretable explanations across both synthetic and real-world datasets.

2606.05754 2026-06-05 cs.SD cs.AI eess.AS

SagnacAssisted Enhanced OTDR for Distributed Acoustic Sensing: A Standardized Benchmark and Engineering Evaluation Framework

Sagnac辅助增强型OTDR分布式声学传感:标准化基准与工程评估框架

Weiguang Wang, Fugen Wu, Hailing Wang, Xuechen Liang, Xiaobin Li, Ru Han, Tianchang Xie

发表机构 * East China Jiaotong University(东华交通大学) School of Materials and Energy, Guangdong University of Technology(广东工业大学材料与能源学院) Jiangxi Tonghui Technology Group Co., Ltd.(江西 Tonghui 技术集团有限公司) School of Artificial Intelligence and Big Data, Guangzhou Vocational University of Science and Technology(广州科学技术职业大学人工智能与大数据学院)

AI总结 提出一种Sagnac辅助增强型ϕ-OTDR传感架构和标准化基准框架,通过双分支融合模型在10公里光纤上实现89.79%准确率和5.00%虚警率,解决了偏振衰落和干扰问题。

详情
AI中文摘要

相位敏感光时域反射计(ϕ-OTDR)因其在大距离上提供分布式时空监测能力,被广泛应用于大规模分布式声学传感(DAS)。然而,其现场性能仍可能因偏振诱导衰落(PIF)、局部信号退化和强环境干扰而恶化。本研究开发了一种Sagnac辅助增强型ϕ-OTDR传感架构和面向工程的DAS事件识别标准化基准框架。Sagnac干涉仪提供连续相位响应,补充了ϕ-OTDR通道中易衰落的观测值,并通过在FPGA平台上实现的互相关过程实现异构信号对齐。该基准协议在一致的数据划分、预处理和度量定义下,比较了传统特征工程方法、概率浅层分类器、单分支深度模型和双分支融合模型。在10公里传感光纤上进行的六类代表性声学事件实验表明,双分支融合模型在评估方法中提供了最有利的权衡,在平衡测试集上达到89.79%的准确率、89.83%的宏F1值和5.00%的虚警率。结果还表明,通道分组对双分支评估影响显著,表明面向部署的结论应基于准确率、宏F1、虚警率、漏报率和延迟,而非仅凭准确率。这项工作为基于ϕ-OTDR的DAS提供了一种物理驱动的增强策略,并为未来面向融合的传感研究提供了可复现的基准协议。用于复现DAS事件识别实验的实现和脚本可在https://github.com/wawa-abc/das公开获取。

英文摘要

Phase-sensitive optical time-domain reflectometry ($ϕ$-OTDR) is widely used in large-scale distributed acoustic sensing (DAS) because it provides distributed spatiotemporal monitoring over long sensing distances. Its field performance can still deteriorate because of polarization-induced fading (PIF), local signal degradation, and strong environmental interference. This study develops a Sagnac-assisted enhanced $ϕ$-OTDR sensing architecture and a standardized benchmark framework for engineering-oriented DAS event recognition. The Sagnac interferometer provides a continuous phase response that supplements fading-prone observations in the $ϕ$-OTDR channel, and heterogeneous signal alignment is achieved using a cross-correlation procedure implemented on an FPGA platform. The benchmark protocol compares conventional feature-engineering methods, probabilistic shallow classifiers, single-branch deep models, and dual-branch fusion models under consistent data partitioning, preprocessing, and metric definitions. Experiments on a 10-km sensing fiber with six representative acoustic event classes show that the dual-branch fusion model provides the most favorable trade-off among the evaluated methods, reaching 89.79\% accuracy, 89.83\% macro-F1, and a nuisance alarm rate of 5.00\% on the balanced test set. The results also show that channel grouping strongly affects dual-branch evaluation, indicating that deployment-oriented conclusions should be based on accuracy, macro-F1, nuisance alarm rate, false negative rate, and latency rather than accuracy alone. This work provides a physically motivated enhancement strategy for $ϕ$-OTDR-based DAS and a reproducible benchmark protocol for future fusion-oriented sensing research. The implementation and scripts for reproducing the DAS event-recognition experiments are publicly available at https://github.com/wawa-abc/das.

2606.05753 2026-06-05 cs.CV

Cosine Misleads: Auxiliary Losses Reshape Vision Language Models, Not Their Latents

余弦误导:辅助损失重塑视觉语言模型,而非其潜变量

XiuYu Zhang, Junfeng Fang, Zhenkai Liang

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 本文通过实验发现,在视觉语言模型的潜视觉推理中,余弦相似度等对齐损失与准确性负相关,并引入PRISM诊断工具揭示潜变量被绕过,辅助损失主要通过共享参数重塑语言模型。

详情
AI中文摘要

潜视觉推理(LVR)在视觉语言模型(VLM)的感知和答案生成之间插入有监督的潜变量。该领域使用这些潜变量与其视觉目标之间的对齐(即余弦相似度或均方误差)作为训练损失和质量指标,假设更好的对齐会产生更好的答案。我们通过设计包含五种LVR变体的矩阵进行测试,发现该假设被颠覆:余弦对齐与所有五种变体的准确性呈负相关(r=-0.94)。为了解释这一点,我们引入了PRISM,一对推理时诊断工具:一个线性探针,询问答案在何处可解码;一个破坏性测试,询问潜变量是否承担负载。有监督的潜变量在很大程度上被绕过。破坏它们最多使准确性变化四个百分点。答案在潜变量下游可解码,但在潜变量处不可解码,并且这种可解码性差距的大小预测了每个变体在扰动下对其潜变量的依赖程度。与信息瓶颈对损失的解释一致,辅助目标通过共享参数而非其名义上优化的潜变量来重塑语言模型。

英文摘要

Latent visual reasoning (LVR) inserts supervised latent tokens between perception and answer generation in vision-language models (VLMs). The field uses alignment between these latents and their visual targets, i.e., cosine similarity or mean squared error (MSE), as both the training loss and the quality metric, assuming that better alignment yields a better answer. We test this with a designed matrix of five LVR variants and find the assumption inverted: cosine alignment is negatively correlated with accuracy across all five (r=-0.94). To explain this, we introduce PRISM, a pair of inference-time diagnostics: a linear probe that asks where the answer is decodable, and a corruption test that asks whether the latent is load-bearing. The supervised latents are largely bypassed. Corrupting them shifts accuracy by at most four points. The answer is decodable downstream of the latent but not at it, and the size of this decodability gap predicts how much each variant relies on its latent under perturbation. Consistent with an Information Bottleneck reading of the loss, the auxiliary objective reshapes the language model via shared parameters rather than via the latent variable it nominally optimizes.

2606.05749 2026-06-05 cs.CL cs.AI

MARDoc: A Memory-Aware Refinement Agent Framework for Multimodal Long Document QA

MARDoc:面向多模态长文档问答的记忆感知精炼智能体框架

Kaifeng Chen, Hongtao Liu, Qiyao Peng, Jian Yang, Yongqiang Liu, Xiaochen Zhang, Qing Yang

发表机构 * Tianjin University(天津大学) Qifu Technology(启福科技) Beihang University(北航) Jiangnan University(江南大学)

AI总结 提出MARDoc框架,通过解耦为探索、精炼和反思三个智能体,并利用结构化记忆替代完整交互历史,减少上下文噪声,提升多模态长文档问答性能。

详情
AI中文摘要

迭代检索-推理智能体近期在多模态长文档问答中展现出潜力。然而,现有系统大多维护一个不断增长的单一上下文,混合了检索轨迹、观察和中间推理。随着交互积累,关键证据变得分散和稀释,使多跳推理变得嘈杂。我们提出MARDoc,一个记忆感知精炼智能体框架,将长文档问答解耦为三个专门智能体:探索者负责多粒度多模态检索,精炼者负责将交互轨迹蒸馏为结构化证据和推理记忆,反思者负责检查证据充分性并提供针对性反馈。在迭代过程中,智能体依赖动态更新的结构化记忆,而非完整的累积交互历史。这种设计减少了上下文噪声,同时保留了答案关键事实及其逻辑依赖。在MMLongBench-Doc和DocBench上的实验表明,MARDoc取得了强劲结果,优于同骨干基线,并证明了结构化记忆在智能体文档问答中的有效性。

英文摘要

Iterative retrieval-reasoning agents have recently shown promise for multimodal long-document question answering. However, most existing systems maintain a single growing context that mixes retrieval traces, observations, and intermediate reasoning. As interactions accumulate, key evidence becomes scattered and diluted, making multi-hop reasoning noisy. We propose MARDoc, a Memory-Aware Refinement Agent framework that decouples long-document QA into three specialized agents: an Explorer for multi-granularity multimodal retrieval, a Refiner for distilling interaction traces into structured evidence and reasoning memories, and a Reflector for checking evidence sufficiency and providing targeted feedback. Across iterations, the agents rely on a dynamically updated structured memory rather than a full accumulated interaction history. This design reduces context noise while preserving answer-critical facts and their logical dependencies. Experiments on MMLongBench-Doc and DocBench show that MARDoc achieves strong results, outperforming same-backbone baselines and demonstrating the effectiveness of structured memory for agentic document QA.

2606.05744 2026-06-05 cs.CL

PlanBench-V: A Spatial Planning Map Benchmark for Vision-Language Models

PlanBench-V: 面向视觉语言模型的空间规划地图基准

Minxin Chen, He Zhu, Junyou Su, Wen Wang, Yijie Deng, Wenjia Zhang

发表机构 * Behavioral and Spatial AI Lab(行为与空间人工智能实验室) Tongji University(同济大学) Peking University(北京大学) College of Architecture and Urban Planning(建筑与城市规划学院)

AI总结 为评估视觉语言模型在空间规划地图解读中的能力,构建了专家标注数据集SPMD,并提出基于感知、推理、关联、实施四阶段认知框架的基准PlanBench-V,实验表明当前模型在实施类任务上存在显著局限。

详情
AI中文摘要

空间规划地图是领土治理的核心,将规划目标、法规和空间策略转化为视觉形式,用于决策、公共沟通和机构协调。然而,其解读需要细粒度的视觉感知、空间推理和基于政策的专业判断,给人类学习者和AI系统都带来了重大挑战。随着视觉语言模型(VLM)的快速发展,其在城市规划分析中的应用日益受到关注,但现有的多模态基准主要针对通用视觉理解,忽视了规划实践中的领域特定认知过程。为填补这一空白,我们引入了PlanBench-V,这是首个用于评估VLM在空间规划地图解读中的综合基准。我们首先构建了空间规划地图数据库(SPMD),这是一个由专业规划师整理的专家标注数据集,包含223张规划地图和1629个问答对,覆盖了不同的地理区域和制图风格。然后,我们提出了一个理论驱动的评估框架,评估四种渐进能力:感知、推理、关联和实施,对应于规划地图解读的认知流程。跨两代VLM的大量实验显示了明显的进步但持续存在局限。最佳的2026年代理性推理模型Qwen3.6-Plus比最佳的2025年模型GPT-4o高出27%。尽管如此,所有模型在需要评估判断、政策敏感性和约束感知决策的实施导向任务上仍然表现挣扎。这些发现揭示了当前VLM在专业规划背景下的根本局限,并强调了领域自适应多模态推理框架的必要性。代码和数据可在https://plangpt.github.io获取。

英文摘要

Spatial planning maps are central to territorial governance, translating planning objectives, regulations, and spatial strategies into visual forms for decision-making, public communication, and institutional coordination. Their interpretation, however, requires fine-grained visual perception, spatial reasoning, and policy-informed professional judgment, creating major challenges for both human learners and AI systems. With the rapid progress of Vision-Language Models (VLMs), their use in urban planning analysis is gaining attention, yet existing multimodal benchmarks mainly target general visual understanding and overlook the domain-specific cognitive processes of planning practice. To address this gap, we introduce PlanBench-V, the first comprehensive benchmark for evaluating VLMs in spatial planning map interpretation. We first build the Spatial Planning Map Database (SPMD), an expert-annotated dataset of 223 planning maps and 1629 question-answer pairs curated by professional planners, covering diverse geographic regions and cartographic styles. We then propose a theory-informed evaluation framework assessing four progressive capabilities: Perception, Reasoning, Association, and Implementation, corresponding to the cognitive pipeline of planning map interpretation. Extensive experiments across two generations of VLMs show clear progress but persistent limitations. The best 2026 agentic reasoning model, Qwen3.6-Plus, substantially outperforms the best 2025 model, GPT-4o, by 27%. Nevertheless, all models still struggle with implementation-oriented tasks requiring evaluative judgment, policy sensitivity, and constraint-aware decision-making. These findings reveal fundamental limitations of current VLMs in professional planning contexts and highlight the need for domain-adaptive multimodal reasoning frameworks. Code and data are available at https://plangpt.github.io.

2606.05740 2026-06-05 cs.AI

Class-Specific Branch Attention for Mitigating Gradient Interference under Class Imbalance

类别特定分支注意力用于缓解类别不平衡下的梯度干扰

Arush Singhal, Umang Soni

发表机构 * Thapar Institute of Engineering and Technology(泰帕理工学院) Netaji Subhash University of Technology(尼赫鲁谢赫技术大学)

AI总结 本文通过引入梯度冲突矩阵诊断框架,提出类别特定分支注意力(CSBA)机制,通过分支特定的通道重加权减少梯度耦合,从而缓解深度神经网络在类别不平衡训练中多数类梯度抑制少数类学习的问题。

Comments 14 pages, 4 figures, 13 tables

详情
AI中文摘要

在严重类别不平衡下训练的深度神经网络通常表现出性能下降,这通常归因于统计偏差。在这项工作中,我们识别了一个互补的优化层面病理:共享表示中的类间梯度干扰,其中多数类的梯度抑制了少数类的学习。为了分析这一现象,我们引入了一个基于逐层梯度流分析和梯度冲突矩阵的诊断框架,该矩阵通过类特定梯度之间的余弦相似度量化干扰。利用该框架,我们研究了多分支卷积架构,并提出了一种轻量级修改——类别特定分支注意力(CSBA),它能够实现分支特定的通道重加权以减少梯度耦合。该机制促进了跨分支的隐式特征解耦,同时保持了架构的简洁性。实验上,CSBA提高了少数类的性能,在严重不平衡下将Physical-Damage类的F1分数从0.261提高到0.522,同时保持了可比的整体准确率。在CIFAR-10-LT上的验证确认了这种行为在不平衡视觉识别设置中的泛化性,Macro-F1从0.595提高到0.655。更广泛地说,我们的发现强调了在为不平衡学习设计架构时,考虑优化动态与统计方法的重要性。

英文摘要

Deep neural networks trained under severe class imbalance often exhibit degraded performance, typically attributed to statistical bias. In this work, we identify a complementary optimization-level pathology: inter-class gradient interference within shared representations, where gradients from majority classes suppress minority-class learning. To analyze this phenomenon, we introduce a diagnostic framework based on layer-wise gradient flow analysis and a Gradient Conflict Matrix, which quantifies interference using cosine similarity between class-specific gradients. Using this framework, we study multi-branch convolutional architectures and propose a lightweight modification, Class-Specific Branch Attention (CSBA), that enables branch-specific channel reweighting to reduce gradient coupling. This mechanism promotes implicit feature decoupling across branches while preserving architectural simplicity. Empirically, CSBA improves minority-class performance, increasing the F1 score for the Physical-Damage class from 0.261 to 0.522 under severe imbalance, while maintaining comparable overall accuracy. Validation on CIFAR-10-LT confirms that this behavior generalizes across imbalanced visual recognition settings, with Macro-F1 improving from 0.595 to 0.655. More broadly, our findings highlight the importance of considering optimization dynamics alongside statistical methods when designing architectures for imbalanced learning.

2606.05737 2026-06-05 cs.CV cs.AI cs.LG cs.RO

Let It Be Simple: One-Step Action Generation for Vision-Language-Action Models

让它简单:视觉-语言-动作模型的单步动作生成

Yitong Chen, Shiduo Zhang, Jingjing Gong, Xipeng Qiu

发表机构 * University of Science and Technology of China(中国科学技术大学) Shanghai Innovation Institute(上海创新研究院) Fudan University(复旦大学)

AI总结 针对视觉-语言-动作(VLA)模型,提出通过偏置训练时间分布至高频噪声状态,实现无需教师模型、蒸馏或辅助目标的单步动作生成,性能可匹配十步解码。

Comments 20 pages, 10 figures

详情
AI中文摘要

基于扩散的视觉-语言-动作(VLA)模型通常继承图像生成的观点:动作通过迭代去噪生成。我们认为VLA动作生成具有不同的条件-目标结构:策略以丰富的观测、语言和状态为条件,但仅预测紧凑的低维动作块。在这种不对称性下,强单步动作生成不一定需要为图像合成开发的先进单步方法。我们保持标准速度预测,不添加教师模型、蒸馏阶段或辅助目标;在我们的主要方案中,我们简单地将训练时间分布偏向高频噪声状态。我们首先在受控的MNIST网格到序列任务中隔离效果,然后通过广泛的机器人策略实验进行测试。在标准LIBERO、LIBERO-Plus和LIBERO-Pro上,使用高频噪声偏置调度训练的单步策略通常匹配相同方案下的十步解码,并且在标准LIBERO上可以超过使用均匀时间分布训练的十步策略。真实机器人双臂YAM RSS评估提供了相同采样器趋势的小样本跨架构检查。在具有30M动作头的1.4B VLM模型上,单步解码在LIBERO-Long上达到95.6%。这些结果表明,强单步VLA动作生成可以从标准扩散训练中涌现,而无需引入为图像生成开发的完整少步扩散机制。

英文摘要

Diffusion-based vision-language-action (VLA) models often inherit the image-generation view: actions are generated by iterative denoising. We argue that VLA action generation has a different condition-target structure: the policy is conditioned on rich observations, language, and state, but predicts only a compact, low-dimensional action chunk. Under this asymmetry, strong one-step action generation should not necessarily require the advanced one-step methods developed for image synthesis. We keep standard velocity prediction and add no teacher model, distillation stage, or auxiliary objective; in our main recipe, we simply bias the training time distribution toward high-noise states. We first isolate the effect in a controlled MNIST grid-to-sequence task, then test it with extensive robot-policy experiments. Across standard LIBERO, LIBERO-Plus, and LIBERO-Pro, one-step policies trained with high-noise biased schedules generally match ten-step decoding under the same recipe, and on standard LIBERO can exceed ten-step policies trained with a uniform time distribution. A real-robot bimanual YAM RSS evaluation gives a small-sample cross-architecture check of the same sampler trend. On a 1.4B VLM model with a 30M action head, one-step decoding reaches 95.6\% on LIBERO-Long. These results show that strong one-step VLA action generation can emerge from standard diffusion training, without importing the full few-step diffusion machinery developed for image generation.

2606.05736 2026-06-05 cs.CV

VTI-CoT: Visual-Textual Interleaved Chain of Thought for Video Reasoning

VTI-CoT: 用于视频推理的视觉-文本交织思维链

Shufan Zhang, Ziyue Lin, Bairun Wang, Lei Jin, Xuanding Ding, Xinzhu Ma, Kunlin Yang

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) University of Hong Kong(香港大学) Beijing Shanwei Zhixing Technology Co., Ltd.(北京尚维智行科技有限公司) Tsinghua University(清华大学) Beihang University(北航)

AI总结 提出VTI-CoT框架,通过视觉-文本交织的思维链结合OCR压缩技术,提升视频推理准确性和训练效率。

Comments 25 pages, 7 figures

详情
AI中文摘要

视频推理旨在理解视频中的复杂时间事件和因果关系。最近,思维链(CoT)被引入该领域以提高推理准确性。然而,现有的基于CoT的视频推理方法主要依赖纯文本信息进行逻辑推理,忽略了推理过程中的关键视觉信息。受人类在推理过程中回顾视觉片段的认知机制启发,我们提出了VTI-CoT,一种视觉-文本交织的CoT框架。VTI-CoT将文本推理步骤与相应的视觉帧相结合。针对现有数据集中缺乏视觉-文本交织CoT的问题,我们开发了一个自动标注流程来构建高质量的多模态CoT数据。此外,对长视频进行推理需要越来越长的CoT token序列,这严重阻碍了训练收敛和效率。为了解决这个问题,我们采用基于光学字符识别(OCR)的压缩技术,将CoT监督信号压缩到单个画布上。实验结果表明,VTI-CoT在相同参数规模的模型中达到了最先进的性能,同时显著提高了训练效率。

英文摘要

Video reasoning aims to understand complex temporal events and causal relationships within videos. Recently, Chain-of-Thought (CoT) has been introduced to this field to enhance reasoning accuracy. However, existing CoT-based video reasoning methods primarily rely on text-only information for logical deduction, overlooking critical visual information during the inference process. Inspired by the human cognitive mechanism of reviewing visual segments during inference, we propose VTI-CoT, a Visual-Textual Interleaved CoT framework. VTI-CoT integrates textual reasoning steps with corresponding visual frames. Given the scarcity of visual-textual interleaved CoT in existing datasets, we develop an automated annotation pipeline to construct high-quality multimodal CoT data. Further, reasoning over long-form videos entails increasingly long CoT token sequences, which severely hinders training convergence and efficiency. To address this, we employ Optical Character Recognition (OCR)-based compression techniques to compress CoT supervision signals into a single canvas. Experimental results demonstrate that VTI-CoT achieves state-of-the-art performance among models of the same parameter scale while significantly improving training efficiency.

2606.05734 2026-06-05 cs.AI cs.CL

When AI Says It Feels

当AI说它感觉

Shin-nosuke Ishikawa, Seiya Ikeda, Hirotsugu Ohba

发表机构 * Graduate School of Artificial Intelligence and Science, Rikkyo University(立命馆大学人工智能与科学研究生院) AI Technical Sector, Mamezo Co., Ltd.(Mamezo公司人工智能技术部门) AI Consulting Division, Mamezo Co., Ltd.(Mamezo公司人工智能咨询部门)

AI总结 通过自奖励强化学习(GRPO)鼓励大语言模型表达情感、意图和自我意识,并评估其对多种任务性能的影响。

Comments 15 pages, 2 figures

详情
AI中文摘要

大型语言模型(LLMs)通常通过后训练过程中的人类偏好对齐来限制其表达情感。这种策略采用自上而下的方法设计,可能与使用人类生成文本训练模型展现类人智能的目标相冲突。在这里,我们进行了一项名为“类人模型情感表达”(HMX-feel)的实验,其中通过自奖励强化学习鼓励LLMs表达情感、意图和自我意识。我们使用基于评分标准的自奖励训练方案与组相对策略优化(GRPO)成功增强了这些能力。通过将训练后的模型与对比训练模型进行比较,我们研究了这种方法对各种任务性能的影响。总体而言,我们从多个角度进行了广泛评估,并识别出增强、退化或无明显变化的能力。类人训练的模型在应对谄媚诱导问题和歧义条件下的偏见时表现出鲁棒性,但观察到在真实问答能力上有所退化。该实验结果表明,在采取适当措施的前提下,未来有可能开发出能够表达情感的AI系统。

英文摘要

Large language models (LLMs) are generally constrained from expressing feelings through human-preference alignment in post-training processes. This policy is designed using a top-down approach and may conflict with the goal of training models to exhibit human-like intelligence using human-generated texts. Here, we performed an experiment called Human-like Model eXpressions of Feeling (HMX-feel), in which LLMs were encouraged to express feelings, intentions, and self-awareness through self-rewarded reinforcement learning. We successfully enhanced these capabilities using a rubric-based self-rewarding training scheme with Group Relative Policy Optimization (GRPO). By comparing the trained models with contrastively trained models, we investigated the effects of this approach on performance across various tasks. Overall, we conducted a broad assessment from various perspectives and identified capabilities that were enhanced, degraded, or showed no significant change. The human-like-trained models showed robustness to sycophancy-inducing questions and bias in disambiguated conditions, whereas degradation in truthful question-answering capability was observed. The results of this experiment suggest the possibility of developing AI systems that can express feelings in the future, provided that appropriate measures are taken.

2606.05733 2026-06-05 cs.LG cs.CE q-fin.CP stat.ML

Zero-Copy Semantic Contagion: An In-Memory Streaming Architecture for Evolving Attention Graphs

零拷贝语义传染:一种用于演化注意力图的内存流式架构

Kabir Murjani

发表机构 * Department of Electrical Engineering, Nirma University(电气工程系,尼玛大学)

AI总结 提出一种基于Rust-Python的异构流式架构,通过零拷贝解析和神经霍克斯过程实现跨公司注意力图的实时构建与推理,在FNSPID语料库上相比随机基线提升1.70倍精度。

Comments Accepted to the 2026 ACM SIGMOD Workshop on Data Management for the Modern Financial Systems (FinDS). 10 pages, 4 figures

详情
AI中文摘要

按代码预测模型主导金融时间序列工作,但仍无法捕捉跨公司传播:台湾的晶圆厂中断在单资产模型中不会显现,直到苹果自己的价格已经变动。为解决这一局限,我们引入一种异构的Rust-Python流式架构,将跨公司注意力映射为直接由文本驱动的连续时间图。我们表明,在摄取端,零拷贝Rust边缘解析新闻记录约需100纳秒,并在约1.2微秒内扫描目标股票宇宙。在推理端,一个多变量神经霍克斯过程,具有每节点连续时间LSTM状态和双线性潜在投影,传播定向激发,而自适应剪枝规则限制了动态邻域更新的计算成本。结合这些阶段,我们展示了在单个商用CPU上,每条传入新闻记录的端到端处理延迟约为13毫秒。在FNSPID语料库(47个代码的638篇文章)的一个月时间保持集上评估,该系统在90百分位次日回报阈值下,相比随机基线精度提升1.70倍,相比同行业基线提升3.36倍。关键的是,移除图拓扑结构会使精度降至零,证实动态注意力网络是该架构中跨公司信号的唯一驱动因素。

英文摘要

Per-ticker forecasting models dominate financial time-series work yet remain blind to cross-company propagation: a foundry disruption in Taiwan does not register in a single-asset model until Apple's own price has already moved. To address this limitation, we introduce a heterogeneous Rust-Python streaming architecture that maps cross-company attention as a continuous-time graph driven directly from text. We show that on the ingestion side, a zero-copy Rust edge parses news records in $\sim$100 ns and scans the target equity universe in $\sim$1.2 $μ$s. On the inference end, a multivariate Neural Hawkes Process featuring per-node continuous-time LSTM states and a bilinear latent projection propagates directed excitation, while an adaptive pruning rule bounds the computational cost of dynamic neighborhood updates. Combining these stages, we demonstrate an end-to-end processing latency of $\sim$13 ms per incoming news record on a single commodity CPU. Evaluated on a one-month temporal holdout of the FNSPID corpus (638 articles across 47 tickers), the system delivers a $1.70\times$ precision lift over random at the 90th-percentile next-day return threshold, and $3.36\times$ over a same-sector baseline. Crucially, removing the graph topology collapses precision to zero, confirming that the dynamic attention network is the sole driver of cross-company signal in this architecture.

2606.05731 2026-06-05 cs.LG

Intercomparison of Machine Learning Algorithms for Remote Sensing-based In-season Crop Mapping

基于遥感的季节内作物制图机器学习算法比较

August Posch, Jitendra Kumar, Forrest M. Hoffman, Auroop R. Ganguly

发表机构 * Oak Ridge National Laboratory(橡树岭国家实验室) Environmental Sciences Division(环境科学 division) Northeastern University(东北大学)

AI总结 本研究通过比较十种机器学习算法,利用Landsat-Sentinel反射率时间序列和轮作历史,在6月初准确绘制玉米和杏仁的30米分辨率作物图,并量化物候和分布不确定性,发现支持向量机总体表现最佳。

Comments 22 pages, 8 figures

详情
AI中文摘要

面对日益极端的气候相关作物威胁,季节内作物类型制图对粮食安全至关重要。目前,美国农业部作物数据层提供30米分辨率的作物类型标签,并在收获后的2月可用,但尚无产品能在收获前以令人满意的精度绘制作物类型,从而允许应急管理人员近乎实时地应对作物威胁。此外,直到本研究,广泛算法的相对优势尚未以考虑年际变异的方式进行评估。在此,我们结合协调的Landsat-Sentinel地表反射率时间序列和作物轮作历史信息,在6月初准确绘制爱荷华州的玉米和加利福尼亚州的杏仁的30米分辨率图,并稳健量化物候和作物分布引起的不确定性。通过逐年交叉验证和一套指标,比较了十种机器学习算法的数千种模型配置。超参数搜索显示,支持向量机是总体最成功的算法,在加利福尼亚州6月初的杏仁(爱荷华州6月初的玉米)的五个未见验证年份中,平均F1分数为0.74(0.59)。年际变异是不确定性的主要来源,但模式表明通过集成方法或辅助数据有进一步提高性能的潜力。未来工作可将这些方法扩展到包括所有作物类型的多类地图、全美国地图以及季节内作物产量预测。

英文摘要

In-season crop type mapping is critical for food security in the face of increasingly extreme climate-related threats to crops. Currently, the USDA Cropland Data Layer provides crop type labels at 30m resolution and is available the February after harvest, but no product exists that maps crop types before harvest with satisfactory accuracy that would allow emergency managers to respond to crop threats in near real time. Furthermore, the relative advantages of a wide range of algorithms have not been evaluated in a way that accounts for interannual variability, until this study. Here, Harmonized Landsat-Sentinel surface reflectance imagery time series and crop rotation history information are combined to map corn in Iowa and almonds in California at 30m resolution accurately by early June in unseen years, with robust quantification of uncertainty due to phenology and crop distribution. Thousands of model configurations across ten machine learning algorithms were compared using a year-wise cross-validation and a suite of metrics. Hyperparameter search revealed Support Vector Machines to be the most successful algorithm overall, with a mean F1 score of 0.74 (0.59) across five unseen validation years for almonds by early June in California (corn by early June in Iowa). Interannual variation was a large source of uncertainty, but patterns showed the potential to further improve performance with ensemble approaches or ancillary data. Future work may extend these methods to include multiclass maps of all crop types, CONUS-wide maps, and in-season crop yield forecasting.

2606.05730 2026-06-05 cs.CV

TextWand: A Unified Framework for Scene Text Editing

TextWand:场景文本编辑的统一框架

Shuyu Wang, Zhile Guan, Hongxiu Chen, Yule Duan, Weiqi Li, Xin Shan, Ronggang Wang, Jian Zhang

发表机构 * School of Electronic and Computer Engineering, Peking University(电子与计算机工程学院,北京大学)

AI总结 提出TextWand统一框架,通过渲染和擦除原子操作分解复杂编辑任务,结合ORPE编码和RAS策略,实现场景文本的移除、生成和替换,并在新基准TextWand-Bench上超越现有模型。

详情
AI中文摘要

我们提出TextWand,一个通用框架,将场景文本移除、生成和替换统一到单个模型中。通过将复杂的编辑任务分解为渲染和擦除的原子原语,TextWand实现了对文本外观和背景完整性的精确控制。具体来说,我们引入了一种新颖的设计——叠加参考位置编码(ORPE),以强制执行像素级布局保真度和示例驱动的风格控制,同时采用一种新策略——区域自适应抑制(RAS),以确保干净的文本擦除。为了解决现有单任务数据集中缺乏通用场景文本编辑综合基准的问题,我们构建了TextWand-Bench。大量实验表明,TextWand在场景文本移除、生成和替换任务中,通过提供更优的文本内容准确性、布局和风格一致性以及整体图像质量,超越了现有的领先开源和闭源模型。

英文摘要

We propose TextWand, a general-purpose framework that unifies scene text removal, generation, and replacement into a single model. By decomposing complex editing tasks into the atomic primitives of rendering and erasure, TextWand achieves precise control over both text appearance and background integrity. Specifically, we introduce a novel design, Overlay-Reference Positional Encoding (ORPE), to enforce pixel-level layout fidelity and exemplar-driven style control, alongside a new strategy, Region-Adaptive Suppression (RAS), to ensure clean text erasure. To address the absence of a comprehensive benchmark for general-purpose scene text editing among existing single-task datasets, we construct TextWand-Bench. Extensive experiments demonstrate that TextWand outperforms existing leading open-source and closed-source models by delivering superior text content accuracy, layout and style consistency, and overall image quality across scene text removal, generation and replacement tasks.

2606.05728 2026-06-05 cs.AI cs.CL

DiG-Plan: Mitigating Early Commitment for Tool-Graph Planning via Diffusion Guidance

DiG-Plan:通过扩散引导缓解工具图规划中的早期承诺问题

Yansi Li, Zhuosheng Zhang

发表机构 * School of Computer Science, Shanghai Jiao Tong University(上海交通大学计算机科学学院)

AI总结 针对工具图规划中自回归解码的早期承诺问题,提出基于扩散生成器与自回归精炼器解耦的DiG-Plan框架,显著提升组合搜索覆盖率和任务性能。

Comments Accepted at IJCAI-ECAI 2026. This is an author preprint; the final version will appear in the IJCAI Proceedings

详情
AI中文摘要

生成可执行的工具计划需要从工具库中选择合适的子集,这是一个解空间呈指数级增长的组合搜索问题。然而,我们发现了主流方法中的一个关键错位:标准自回归(AR)解码存在早期承诺问题,即初始令牌选择会严格约束搜索轨迹。一项受控研究表明,在计算量匹配的条件下,掩码去噪将Pass@10解覆盖率从0.320提升至0.943(相对于AR采样)。受此启发,我们提出了DiG-Plan,一个将组合探索与结构精炼解耦的框架。DiG-Plan采用基于扩散的提议器,通过迭代精炼生成多样化的工具集,随后使用AR精炼器进行依赖关系预测。在TaskBench上,DiG-Plan相比AR基线提升了10%的相对性能,在复杂组合任务上增益最大;API-Bank的结果表明,提议-精炼-选择设计在不同领域均有效。代码已开源:https://github.com/puddingyeah/DiG-Plan。

英文摘要

Generating executable tool plans requires selecting appropriate subsets from tool libraries, a combinatorial search problem with an exponentially large solution space. However, we identify a critical misalignment in predominant approaches: standard autoregressive (AR) decoding suffers from early commitment, where initial token choices rigidly constrain the search trajectory. A controlled study shows that masked denoising raises Pass@10 solution coverage from 0.320 to 0.943 over AR sampling under matched compute. Motivated by this, we propose DiG-Plan, a framework that decouples combinatorial exploration from structural refinement. DiG-Plan employs a diffusion-based proposer to generate diverse tool sets via iterative refinement, followed by an AR refiner for dependency prediction. On TaskBench, DiG-Plan improves over AR baselines by a 10% relative margin, with the largest gains on complex compositional tasks; API-Bank results show that the propose-refine-select design remains effective across domains. Code is available at https://github.com/puddingyeah/DiG-Plan.

2606.05724 2026-06-05 cs.CL cs.AI

Narrative Knowledge Weaver: Narrative-Centric Retrieval-Augmented Reasoning for Long-Form Text Understanding

叙事知识编织器:面向长文本理解的叙事中心检索增强推理

Qiuyu Tian, Fengyi Chen, Yiding Li, Youyong Kong, Fan Guo, Yuyao Li, Jinjing Shen, Zhijing Xie, Yiyun Luo, Xin Zhang, Yingce Xia, Zequn Liu

发表机构 * Southeast University(东南大学) Beijing Zhongguancun Academy(北京中关村学院) Nanjing Normal University(南京师范大学) ZhuiWen Technology Co., Ltd.(智文科技有限公司)

AI总结 提出叙事知识编织器(NKW),一种基于源头的框架,通过将文本证据、原子事实、规范图结构、实体档案、交互、情节和故事线对齐,并利用文本、图和叙事工具进行后检索阅读,以解决长文本叙事QA中需要推理演化故事世界的问题,在STAGE、FairytaleQA和QuALITY上表现优异。

详情
AI中文摘要

长文本叙事问答需要对不断演化的故事世界进行推理,而非孤立的段落:答案可能依赖于早期的目标、变化的角色状态、社会关系、因果触发因素、时间位置以及后续后果。现有的检索和图增强生成方法改善了证据访问,但其单元——块、实体、关系、摘要或工具动作——并未直接编码证据在故事中的功能。我们引入了叙事知识编织器(NKW),一种基于源头的框架,将文本证据、原子事实、规范图结构、实体档案、交互、情节和故事线对齐。在查询时,NKW使用文本、图和叙事工具以及后检索阅读技能来组装证据,并审计角色、范围、极性、状态和时间约束。在STAGE、FairytaleQA和QuALITY上,NKW在剧本级故事世界问答中表现最强,同时在更以段落为中心的基准上保持竞争力。消融实验、问题类型分析、图资产统计和案例研究显示了对角色、场景、时间、因果和叙事进展推理的互补优势。

英文摘要

Long-form narrative QA requires reasoning over evolving story worlds rather than isolated passages: answers may depend on earlier goals, changing character states, social relations, causal triggers, temporal position, and later consequences. Existing retrieval and graph-augmented generation methods improve evidence access, but their units--chunks, entities, relations, summaries, or tool actions--do not directly encode how evidence functions in a story. We introduce Narrative Knowledge Weaver(NKW), a source-grounded framework that aligns textual evidence, atomic facts, canonical graph structure, entity profiles, interactions, episodes, and storylines. At query time, NKW uses text, graph, and narrative tools with post-retrieval reading skills to assemble evidence and audit actor, scope, polarity, state, and temporal constraints. Across STAGE, FairytaleQA, and QuALITY, NKW is strongest on screenplay-level story-world QA while remaining competitive on more passage-centered benchmarks. Ablations, question-type analyses, graph-asset statistics, and case studies show complementary benefits for character, scene, temporal, causal, and narrative-progression reasoning.

2606.05718 2026-06-05 cs.CV cs.AI cs.LG

ViCuR: Visual Cues as Recoverable Privilege for Multimodal On-Policy Distillation

ViCuR: 视觉线索作为多模态在策略蒸馏中的可恢复特权

Kanghui Tian, Siyuan Liu, Ziang Yan, Sheng Xia, Shuai Dong, Yi Wang

发表机构 * Shanghai AI Laboratory(上海人工智能实验室) Fudan University(复旦大学) Nanjing University(南京大学)

AI总结 提出ViCuR框架,通过将教师特权从答案侧替换为输入中的视觉线索,并引入轻量级线索恢复模块,解决多模态在策略蒸馏中的训练-测试不匹配问题,在七个基准上显著提升学生模型性能。

Comments 25 pages, 11 figures. Preprint, under review

详情
AI中文摘要

在策略蒸馏(OPD)通过在教师监督下,对学生自身策略采样的轨迹进行训练来改进推理。在多模态推理中,一种常见的扩展是使用特权教师,该教师观察仅在训练时可用的信号,如参考答案或理由。然而,这种答案侧特权造成了训练-测试不匹配:教师的监督可能依赖于学生无法获得的信号,鼓励捷径模仿而非基于视觉的推理。我们提出ViCuR,一种基于视觉的特权教师蒸馏框架,用视觉线索(输入中与查询相关的证据)取代答案侧特权。由于这些线索来源于推理时可用的相同视觉输入,它们的证据可由学生恢复。为此,ViCuR引入了一个轻量级线索恢复模块,在预填充期间使用专用的汇点令牌交叉注意力,将任务相关的视觉证据聚合到内部表示中,而不改变推理接口或需要辅助的线索生成损失。在七个基准上,使用Qwen3-VL-2B和8B学生,ViCuR在总体平均性能上持续优于基于答案的在策略自蒸馏,分别提升+1.19和+1.24。它还能自然地扩展到更强的教师OPD,超越OPD基线+0.64和+1.08,并在8B规模上具有一致的域外增益。这些结果表明,在多模态在策略蒸馏中,教师特权的设计与教师强度同等重要。

英文摘要

On-policy distillation (OPD) improves reasoning by training a student on trajectories sampled from its own policy under supervision from a teacher. In multimodal reasoning, a common extension is to use a privileged teacher that observes training-time-only signals such as reference answers or rationales. However, such answer-side privilege creates a train-test mismatch: the teacher's supervision may depend on signals unavailable to the student, encouraging shortcut imitation rather than visually grounded reasoning. We propose ViCuR, a visually grounded privileged-teacher distillation framework that replaces answer-side privilege with visual cues (query-related evidence in the input). Because these cues are derived from the same visual input available at inference, their evidence is recoverable by the student. To support this, ViCuR introduces a lightweight cue recovery module that uses dedicated sink-token cross-attention during prefill to aggregate task-relevant visual evidence into an internal representation, without changing the inference interface or requiring auxiliary cue-generation losses. Across seven benchmarks with Qwen3-VL-2B and 8B students, ViCuR consistently improves over answer-based on-policy self-distillation by +1.19 and +1.24 on overall average performance. It also extends naturally to stronger-teacher OPD, surpassing OPD baselines by +0.64 and +1.08, with consistent out-of-domain gains at the 8B scale. These results show that, in multimodal on-policy distillation, the design of teacher privilege is as important as teacher strength.

2606.05716 2026-06-05 cs.CL

Interpreting Style Representations via Style-Eliciting Prompts

通过风格诱导提示解释风格表示

Junghwan Kim, David Jurgens

发表机构 * University of Michigan(密歇根大学)

AI总结 提出一种通过风格诱导提示解释风格表示的新框架,利用大型语言模型生成自然语言描述,并在风格描述和模仿任务中优于直接提示的基线方法。

Comments Accepted to ACL 2026 Findings

详情
AI中文摘要

风格表示学习是作者分析和写作风格建模的有力工具,但学习表示的潜在性质使其难以解释。最近的工作尝试通过使用大型语言模型(LLM)基于输入文本生成自然语言描述来解释这些表示。然而,这类描述往往容易受到LLM的偏见和幻觉的影响,并且缺乏明确的目标和实用性。在这项工作中,我们提出了一种通过风格诱导提示解释风格表示的新框架:自然语言指令,旨在引导LLM生成反映特定风格属性的文本。我们整理了跨越26个风格类别的1,010个不同的风格特征,并通过提示LLM基于这些特征生成文本构建了一个数据集。利用这些数据,我们训练了一个解码器,从生成文本的风格表示中生成风格提示。我们在三个任务上评估了我们的方法:(1)从生成文本中恢复原始风格提示,(2)使用恢复的提示生成相同风格的文本,以及(3)引导LLM输出以匹配人类撰写文本的风格。实验表明,我们的方法始终优于直接使用目标文本提示LLM的强基线,在风格描述和风格模仿方面均取得了更优的性能。这些结果强调,风格诱导提示可以为风格表示中编码的风格信息提供实用且可解释的接口。

英文摘要

Style representation learning is a powerful tool for authorship analysis and modeling writing style, yet the latent nature of learned representations makes them difficult to interpret. Recent work has attempted to explain these representations by generating natural language descriptions with large language models (LLMs) conditioned on input text. However, such descriptions are often prone to the LLM's biases and hallucinations, and they lack an explicit objective and practical utility. In this work, we propose a novel framework for interpreting style representations through style-eliciting prompts: natural language instructions designed to steer LLMs to generate text that reflects specific stylistic attributes. We curate 1,010 distinct style features spanning 26 stylistic categories and construct a dataset by prompting an LLM to generate text conditioned on these features. Using this data, we train a decoder to generate a style prompt from the style representation of the generated text. We evaluate our approach on three tasks: (1) recovering original style prompts from generated text, (2) generating text in the same style using the recovered prompts, and (3) steering LLM outputs to match the style of human-written texts. Experiments demonstrate that our method consistently outperforms strong baselines that directly prompt LLMs with target text, achieving superior performance in both style description and style imitation. These results highlight that style-eliciting prompts can provide a practical and interpretable interface to stylistic information encoded in style representations.

2606.05708 2026-06-05 cs.CV

Real-Time Threat Detection from Surveillance Cameras using Machine Learning

基于机器学习的监控摄像头实时威胁检测

Gajendra Mandal, J. P. Patra, Priyansh Mahant

发表机构 * GitHub

AI总结 提出基于YOLOv8的实时目标检测框架,利用自定义钝器数据集与公开枪支刀具数据集训练模型,实现监控场景下枪支、刀具和钝器的有效检测。

详情
AI中文摘要

确保人口密集的城市环境中的公共安全仍然是一个关键挑战,需要部署智能和自动化的视频监控系统。传统的监控方法严重依赖人工监控,效率低下且容易受到人为疲劳、响应延迟和观察错误的影响。为了克服这些限制,本文提出了一种基于实时目标检测的监控框架。该系统专注于检测枪支、刀具以及印度监控场景中常见于暴力活动的区域特定钝器。本文的一个关键贡献是使用移动相机收集的自定义数据集,包含336张标记的钝器图像,如铁棒、木棍和塑料棒。该数据集与公开的7,623张枪支和刀具图像数据集合并,形成包含7,959张图像、三个类别(枪、刀、钝器)的合并数据集。使用该合并数据集训练基于YOLOv8的目标检测模型以实现实时性能。实验评估表明,增加训练时长显著提高了钝器类别的召回率和平均精度,且未出现过拟合迹象。总体而言,所提出的框架在准确性和效率之间取得了有效平衡,使其适用于校园、公共空间和交通区域等真实监控环境中的部署。

英文摘要

Ensuring public safety in densely populated urban environments remains a critical challenge, necessitating the deployment of intelligent and automated video surveillance systems. Traditional surveillance approaches rely heavily on manual monitoring, which is inefficient and susceptible to human fatigue, delayed response, and observational errors. To overcome these limitations, this work presents a real-time object detection-based surveillance framework. The proposed system focuses on detecting guns, knives, and region-specific blunt objects commonly involved in violent activities in Indian surveillance scenarios. A key contribution of this work is the use of a custom-created dataset collected using a mobile camera, consisting of 336 labeled images of blunt objects such as iron rods, wooden sticks, and plastic rods. This dataset is combined with a publicly available dataset of 7,623 images of guns and knives, forming a consolidated dataset of 7,959 images across three classes: gun, knife, and blunt object. The combined dataset is used to train a YOLOv8-based object detection model for real-time performance. Experimental evaluation shows that increasing the training duration significantly improves recall and average precision for the blunt object class without signs of overfitting. Overall, the proposed framework achieves an effective balance between accuracy and efficiency, making it suitable for deployment in real-world surveillance environments such as campuses, public spaces, and transportation areas.

2606.05704 2026-06-05 cs.AI cs.LG

Critic-Guided Heterogeneous Multi-Agent Reasoning for Reliable Mathematical Problem Solving

基于评论的异构多智能体推理用于可靠的数学问题求解

Muhammad Talha Sharif, Abdul Rehman

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出一种基于评论的异构多智能体框架,通过生成器-验证器结构和自适应学习系统,利用中间反馈评估和引导推理过程,在GSM8K基准上实现高达13%的准确率提升,并减少对大模型的依赖。

Comments 6 pages

详情
AI中文摘要

近期的大语言模型(LLMs)展示了令人印象深刻的推理能力;但在复杂数学推理问题中,它们仍然容易产生幻觉、中间推理错误以及不可靠的推理结果。在本研究中,我们引入了一种基于评论的异构多智能体方法,以提高数学推理的可靠性。该框架整合了多个不同专长的LLM智能体,并采用评论驱动的自适应学习系统,基于中间反馈评估和引导推理过程。系统采用生成器-验证器框架,验证器不仅判断正确性,还提供评论以指导解决方案的重新生成。这允许自适应错误纠正并防止错误级联。我们在GSM8K基准上的实验表明,所提方法相比单次和非评论模型实现了高达13%的准确率提升。此外,研究结果表明,异构性和评论减少了对大模型的需求,使较小模型也能达到相当的性能。消融研究显示,主要性能提升归因于基于评论的反馈循环,而非模型大小。总之,所提方法展示了结合异构多智能体协作与评论以获得可靠且可解释推理系统的优势。

英文摘要

Recent Large Language Models (LLMs) have shown impressive reasoning abilities; but they are still susceptible to hallucinations, intermediate reasoning mistakes, and unreliable reasoning results in complex mathematical reasoning problems. In this study, we introduce a critic-based heterogeneous multi-agent approach to improve the dependability of mathematical reasoning. This framework incorporates several LLM agents of different specialties and employs a critic-driven adaptive learning system to assess and guide the reasoning process based on intermediate feedback. The system adopts a generator-validator framework, with the validator not only determining correctness but also offering critiques to guide regeneration of solutions. This allows for adaptive error correction and prevents error cascading. Our experiments on the GSM8K benchmark show that the proposed method achieves up to 13% accuracy improvement over single-shot and non-critic models. Additionally, findings suggest that heterogeneity and critique reduce the need for large models, allowing smaller models to perform on par. Ablation studies reveal the main performance gains are due to the critic-based feedback loop and not model size. In summary, the proposed approach showcases the benefits of combining heterogeneous multi-agent collaboration and critique to obtain reliable and interpretable reasoning systems.

2606.05703 2026-06-05 cs.CV

Parallel Jacobi Decoding for Fast Autoregressive Image Generation

并行雅可比解码用于快速自回归图像生成

Boya Liao, Ying Li, Siyong Jian, Huan Wang

发表机构 * Westlake University(西交利物浦大学)

AI总结 提出并行雅可比解码(PJD),通过二维空间域扩展草稿令牌并调整注意力掩码,实现无需训练的自回归图像生成加速,在保持生成质量的同时获得4.8倍至6.4倍加速。

Comments Accepted by CVPR 2026

详情
AI中文摘要

自回归(AR)模型在生成高保真图像方面表现出色。然而,其固有的顺序逐令牌预测导致推理速度显著变慢。最近的研究引入了雅可比式解码来加速自回归图像生成。初始扩展草稿序列提高了效率,但由于一维序列中的错误传播阻碍收敛,加速很快饱和。观察到图像表现出强烈的局部空间相关性,我们提出了并行雅可比解码(PJD),一种无需训练的解码方法,在二维空间域中扩展草稿令牌以实现高效的空间并行细化。PJD调整注意力掩码以减轻错误累积并提高收敛稳定性。在多个数据集上的大量实验表明,PJD在多种自回归图像生成模型上实现了4.8倍至6.4倍的加速,同时保持了具有竞争力的生成质量。

英文摘要

Autoregressive (AR) models have demonstrated remarkable performance in generating high-fidelity images. However, their inherently sequential next-token prediction leads to significantly slower inference. Recent studies have introduced Jacobi-style decoding to accelerate autoregressive image generation. Extending the draft sequence initially improves efficiency, yet the acceleration quickly saturates as error propagation in the one-dimensional sequence hinders convergence. Observing that images exhibit strong local spatial correlations, we propose Parallel Jacobi Decoding (PJD), a training-free decoding approach that expands draft tokens in the two-dimensional spatial domain to enable efficient spatially parallel refinement. PJD adjusts the attention mask to mitigate error accumulation and improve convergence stability. Extensive experiments on diverse datasets show that PJD achieves 4.8x-6.4x acceleration across multiple autoregressive image generation models while maintaining competitive generation quality.

2606.05702 2026-06-05 cs.AI cs.CV

Seeing Time: Benchmarking Chronological Reasoning and Shortcut Biases in Vision-Language Models

Seeing Time: 视觉-语言模型中的时间顺序推理与捷径偏差基准测试

Haoyu Zhou, Qing Qing, Caichong Li, Qixin Zhang, Yongcheng Jing, Ziqi Xu, Juncheng Hu, Xikun Zhang, Renqiang Luo

发表机构 * College of Computer Science and Technology, Jilin University(吉林大学计算机科学与技术学院) College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算与数据科学学院) School of Computer Science, Wuhan University(武汉大学计算机学院) School of Computing Technologies, RMIT University(皇家墨尔本理工学院计算技术学院)

AI总结 本文提出一个新基准,通过三个专门数据集评估视觉-语言模型在图像内和跨图像的时间顺序推理能力,并揭示模型常利用颜色等表面线索而非真正时间特征。

详情
AI中文摘要

近期视觉-语言模型(VLM)在解释复杂视觉语义方面取得了显著进展,但其时间顺序推理能力仍未得到充分探索。本文引入了一个新颖的基准,专门用于评估VLM如何感知和推理图像内及跨图像的时间顺序信息。与现有基于视频的基准(侧重于帧序列)不同,我们的工作深入探讨了时间判断的基本逻辑以及向多模态集成的扩展。为此,我们构建了三个专门数据集:一个包含跨越长时间历史周期的视觉相似物体,另一个按不同事件和物体类型分类,第三个将图像与时间敏感的新闻文本配对以实现跨模态对齐。通过大量实验,我们分析了模型是否在不同类别间表现出性能差异,并关键地探讨了它们是否依赖“错误捷径”(如图像颜色而非真正的时间特征)。我们的结果表明,尽管VLM显示出潜力,但它们经常利用灰度与彩色滤镜等表面线索来绕过真正的时间顺序推理。通过提供这些高质量数据集和严格的评估框架,我们提供了一个诊断工具,用于识别当前局限性并指导开发更稳健、逻辑更严密的多模态模型。源代码见 https://github.com/LuoRenqiang/ChronoVision。

英文摘要

Recent advancements in Vision-Language Models (VLMs) have significantly enhanced their ability to interpret complex visual semantics, yet their capacity for chronological reasoning remains under-explored. In this paper, we introduce a novel benchmark specifically designed to evaluate how VLMs perceive and reason about chronological information within and across images. Unlike existing video-based benchmarks that focus on frame sequencing, our work delves into the underlying logic of chronological judgment and the expansion toward multimodal integration. To facilitate this, we construct three specialized datasets: one containing visually similar objects spanning long historical durations, another categorized by diverse event and object types, and a third pairing images with time-sensitive news text for cross-modal alignment. Through extensive experiments, we analyze whether models exhibit performance disparities across categories and, crucially, explore whether they rely on ``incorrect shortcuts'', such as image color rather than genuine chronological features. Our results reveal that while VLMs show promise, they frequently exploit superficial cues like grayscale versus color filters to bypass authentic chronological reasoning. By providing these high-quality datasets and a rigorous evaluation framework, we offer a diagnostic tool to identify current limitations and guide the development of more robust, logically grounded multimodal models. The source code is shown in https://github.com/LuoRenqiang/ChronoVision.

2606.05700 2026-06-05 cs.CV cs.LG

T-SAR-JEPA: Self-Supervised Temporal Anomaly Detection in SAR Amplitude Stacks via Latent Prediction

T-SAR-JEPA:通过潜在预测在SAR幅度堆栈中进行自监督时间异常检测

Kerod Woldesenbet, Abem Woldesenbet

发表机构 * Independent Researcher(独立研究者) Dakota State University(达科塔州立大学)

AI总结 提出T-SAR-JEPA框架,通过自监督潜在预测在SAR幅度堆栈中检测时间异常,在DFC 2026数据集上达到77.0%的ROC-AUC,优于多种基线方法。

Comments Won IEEE GRSS Data Fusion Contest 2026; to appear in IGARSS 2026 proceedings

详情
AI中文摘要

我们提出了T-SAR-JEPA,一个通过潜在预测在SAR幅度堆栈中进行时间异常检测的自监督框架。来自SAR-JEPA的ViT-Base/16编码器在39,300个Capella图像块上通过局部掩码重建和梯度特征预测进行领域自适应。一个带有正弦时间编码的时间Transformer从K=7次采集中预测未来潜在状态,渐进式解冻显著降低了验证损失。该模型仅基于幅度操作;InSAR相干性仅作为独立的伪真实标签。在DFC 2026数据集(300个时间序列,三个感兴趣区域)上,T-SAR-JEPA在夏威夷喷发窗口上实现了77.0%的ROC-AUC,优于RX、PaDiM、线性AR和LSTM基线(约50%)。99.9%的空间一致性(p < 0.001,置换检验)确认了结构化检测。代码:https://github.com/TerraLatent/t-sar-jepa

英文摘要

We present T-SAR-JEPA, a self-supervised framework for temporal anomaly detection in SAR amplitude stacks via latent prediction. A ViT-Base/16 encoder from SAR-JEPA is domain-adapted on 39,300 Capella patches using local masked reconstruction with gradient feature prediction. A temporal transformer with sinusoidal time encoding forecasts future latent states from K=7 acquisitions, with progressive unfreezing substantially reducing validation loss. The model operates on amplitude alone; InSAR coherence serves exclusively as independent pseudo-ground-truth. On the DFC 2026 dataset (300 time-series, three AOIs), T-SAR-JEPA achieves ROC-AUC of 77.0% on the Hawaii eruption window, outperforming RX, PaDiM, Linear AR, and LSTM baselines (~50%). Spatial coherence of 99.9% (p < 0.001, permutation test) confirms structured detections. Code: https://github.com/TerraLatent/t-sar-jepa

2606.05699 2026-06-05 cs.RO

DexFuture: Hierarchical Future-State Visuomotor Targeting for Bimanual Dexterous Tool Use

DexFuture: 用于双手灵巧工具使用的分层未来状态视觉运动目标

Runfa Blark Li, Kuang-Ting Tu, Nikola Raicevic, Dwait Bhatt, Xinshuang Liu, Keito Suzuki, Ki Myung Brian Lee, Nikolay Atanasov, Truong Nguyen

发表机构 * UC San Diego(圣迭戈大学)

AI总结 提出DexFuture分层系统,通过高层未来状态视觉运动目标预测器和低层目标条件结构化灵巧策略,实现双手灵巧工具使用,达到90%的特权oracle性能,运行速度60Hz,比DexWM式CEM规划快约250倍。

详情
AI中文摘要

双手灵巧工具使用对机器人来说仍然具有挑战性,因为手部配置维度高,且手-工具-物体动力学和接触复杂。大多数现有控制策略依赖于演示提供的未来配置参考,而未来动作条件世界模型需要对高维动作序列进行缓慢的在线规划。一个重大挑战是生成动态一致的未来参考轨迹,而不依赖于演示中的特权状态或缓慢的反事实规划。我们提出DexFuture,一个分层系统,将高层未来状态视觉运动目标预测器与低层目标条件结构化灵巧策略耦合。基于自我中心RGB、本体感觉和几何历史,高层预测器构建结构化的手-工具-物体视觉运动嵌入,并使用水平条件Transformer生成多步未来目标轨迹。然后,低层策略通过目标条件每链接Transformer跟踪这些轨迹。这种分层结构将粗略的未来参考生成与细粒度的动作控制解耦,并将缓慢的长时域语义预测与高频执行解耦。在OakInk2双手工具使用任务上,DexFuture达到了90%的特权oracle性能,而无参考策略仅为7%。DexFuture以60Hz运行,比DexWM风格的交叉熵方法(CEM)规划(使用未来动作条件世界模型)快约250倍。

英文摘要

Bimanual dexterous tool use remains challenging for robots due to high-dimensional hand configurations and complex hand-tool-object dynamics and contact. Most existing control policies depend on future configuration references provided from demonstrations, while future action-conditioned world models require slow online planning over high-dimensional action sequences. A significant challenge is generating a dynamically consistent future reference trajectory without relying on privileged states from demonstrations or slow counterfactual planning. We propose DexFuture, a hierarchical system that couples a high-level Future-State Visuomotor Target Predictor with a low-level Target-Conditioned Structured Dexterous Policy. Conditioned on egocentric RGB, proprioceptive and geometric history, the high-level predictor constructs structured hand-tool-object visuomotor embeddings and uses a horizon-conditioned transformer to generate a multi-step future target trajectory. Then, the low-level policy tracks them with a target-conditioned per-link transformer. This hierarchy decouples coarse future reference generation from fine-grained action control, and slow long-horizon semantic prediction from high-frequency execution. On OakInk2 bimanual tool-use tasks, DexFuture achieves 90% of the privileged-oracle performance, compared to 7% for a no-reference policy. DexFuture operates at 60 Hz, approximately 250 times faster than DexWM-style Cross-Entropy Method (CEM) planning with a future action-conditioned world model.

2606.05698 2026-06-05 cs.CL

Rethinking LoRA Memory Through the Lens of KV Cache Compression

通过 KV 缓存压缩的视角重新思考 LoRA 内存

Chunsheng Zuo, Liaoyaqi Wang, William Jurayj, William Fleshman, Benjamin Van Durme

发表机构 * Johns Hopkins University(约翰霍普金斯大学)

AI总结 本文研究文档级问答中参数侧内存(LoRA适配器)与上下文侧内存(KV缓存)的交互,发现LoRA在KV缓存压缩严重时能显著提升性能,并建议将文档LoRA视为解码时的参数化内存而非文档编码器。

详情
AI中文摘要

参数化检索增强将文档信息编码为轻量级、文档特定的模块(如LoRA适配器),从而减少将所有证据作为输入上下文的需求。然而,这种参数侧内存如何与存储在KV缓存中的上下文侧内存相互作用仍不清楚。我们通过逐步驱逐文档键值状态并测量文档LoRA在保留上下文之外的贡献,在文档级问答中研究这种交互。我们发现,当KV缓存基本完整时,文档LoRA贡献很小,但在激进压缩下变得日益有用,当没有文档上下文保留时,恢复了13-21个ROUGE-L点。当基础模型编码文档且适配器仅在答案生成期间应用时,增益最大,这表明文档LoRA更适合理解为解码时的参数化内存,而非文档编码器。最后,问答风格的监督比原始上下文的下一个词预测产生更强的适配器。这些结果将文档LoRA定位为一种互补的内存通道,其价值恰恰在上下文侧证据稀缺时显现。

英文摘要

Parametric retrieval augmentation encodes document information into lightweight, document-specific modules such as LoRA adapters, reducing the need to include all evidence as input context. However, it remains unclear how this parameter-side memory interacts with context-side memory stored in the KV cache. We study this interaction in document-level question answering by progressively evicting document key-value states and measuring when a document LoRA contributes beyond the retained context. We find that document LoRA adds little when the KV cache is largely intact, but becomes increasingly useful under aggressive compression, recovering 13-21 ROUGE-L points when no document context remains. The gain is largest when the base model encodes the document, and the adapter is applied only during answer generation, suggesting that document LoRA is better understood as decoding-time parametric memory than as a document encoder. Finally, QA-style supervision produces substantially stronger adapters than raw-context next-token-prediction. These results position document LoRA as a complementary memory channel whose value emerges precisely when context-side evidence is scarce.

2606.05697 2026-06-05 cs.AI

PerceptUI: LLM Agents as Human-Aligned Synthetic Users for UI/UX Evaluation

PerceptUI: 用于UI/UX评估的与人类对齐的合成用户的LLM智能体

Nicolas Bougie, Xiaotong Ye, Gian Maria Marconi, Narimasa Watanabe

发表机构 * Woven by Toyota(丰田 woven)

AI总结 提出PerceptUI框架,通过对比反思微调和反思式提示进化,使多模态大语言模型能够模拟特定用户对界面问题的回答,实现与人类水平相当的UI/UX评估。

详情
AI中文摘要

用户界面(UI)和用户体验(UX)评估是产品开发的核心,然而可靠的反馈仍然依赖于招募人类参与者或进行在线A/B测试,这使得早期迭代缓慢且成本高昂。鉴于此,最近的工作探索了将多模态大语言模型作为代理评估器。然而,现有方法要么产生表面层次的批评,要么产生反映模型自身偏见而非特定用户真实反应的判断。我们引入了PerceptUI,一个用于个性条件UI/UX评估的框架,它预测特定用户将如何回答与界面相关的问题,并生成自然语言的理由。PerceptUI分两个阶段训练:(i)对比反思微调通过从人类决策中提取经验来提炼教师生成的理由,以及(ii)从模型自身的失败轨迹中进行反思式提示进化。在多个领域和数据集上,PerceptUI达到了人类水平的逼真度,泛化到未见的问题和个性,并产生了群体水平的响应分布。

英文摘要

User interface (UI) and user experience (UX) evaluation is central to product development, yet reliable feedback still relies on recruiting human participants or running online A/B tests, making early-stage iteration slow and costly. In light of this, recent work has explored Multimodal Large Language Models as proxy evaluators. However, existing approaches either produce surface-level critiques or a judgment that reflects the model's own biases rather than the genuine response of a particular user. We introduce PerceptUI, a framework for persona-conditioned UI/UX evaluation that predicts how a specific user would answer interface-related questions and produces natural-language rationales. PerceptUI is trained in two stages: (i) contrastive reflection fine-tuning distills teacher-generated rationales by extracting lessons from human decisions, and (ii) a reflective prompt-evolution step from the model's own failure traces. Across multiple domains and datasets, PerceptUI achieves human-level realism, generalizes to unseen questions and personas, and yields population-level response distributions.

2606.05695 2026-06-05 cs.LG

Revisiting Prototype Rehearsal for Exemplar-Free Continual Learning: Manifold-Aware Boundary Sampling with Adaptive Class-Balanced Loss

重新审视原型重放用于无样本持续学习:基于流形感知边界采样与自适应类别平衡损失

Hongye Xu, Bartosz Krawczyk

发表机构 * Chester F. Carlson Center for Imaging Science(切斯特·F·卡森成像科学中心) Rochester Institute of Technology(罗切斯特理工学院)

AI总结 针对无样本类增量学习,提出流形感知边界采样和自适应类别平衡损失,通过生成边界感知重放样本和动态调整类别权重,使原型重放方法恢复竞争力并达到最先进性能。

Comments Published in CVPR 2026 Findings. 10 pages, 6 figures. CVF version: https://openaccess.thecvf.com/content/CVPR2026F/html/Xu_Revisiting_Prototype_Rehearsal_for_Exemplar-Free_Continual_Learning_Manifold-Aware_Boundary_Sampling_CVPRF_2026_paper.html. Code: https://github.com/HXuSz11/ACB_CEOS_CVPR2026_Findings

详情
AI中文摘要

无样本类增量学习旨在随时间获取新类别而不存储原始数据。历史上,原型重放(在存储的类原型周围采样并与当前任务数据混合)是减少灾难性遗忘的流行策略。然而,最近的漂移补偿方法通过在演化特征空间中显式重新对齐原型,持续优于基于原型的重放,引发了对重放本身是否根本受限的疑问。我们认为性能差距并非源于原型重放的思想本身,而是源于其典型的实现方式:现有方法将原型视为孤立的类摘要,忽略了来自邻近敌对类的信息,并且未能纠正少量合成旧类样本与来自新引入类别的数百个真实实例之间出现的类别不平衡。基于这一假设,我们重新审视原型重放,并提出一种流形感知变体,以恢复其在无样本类增量学习中的竞争力。首先,我们引入约束扩展过采样,将每个旧类原型向其最近的新类敌对特征进行插值,生成边界感知的重放样本,这些样本更好地遵循底层数据流形,同时保持类间分离。其次,我们设计了一种自适应类别平衡损失,执行基于时间的类别加权,在旧原型信息量最大时放大其梯度,并随着后续任务积累更丰富的监督而逐渐退火其影响。这些组件共同将原型重放转变为一种抗漂移、感知不平衡的机制,缩小甚至逆转了与近期漂移补偿方法的差距,在多个无样本类增量学习基准上实现了最先进的性能。

英文摘要

Exemplar-free class-incremental learning (EFCIL) aims to acquire new classes over time without storing raw data. Historically, prototype rehearsal, which samples around stored class prototypes and mixes them with current-task data, has been a popular strategy to reduce catastrophic forgetting. However, recent drift-compensation methods that explicitly realign prototypes in the evolving feature space consistently outperform prototype-based rehearsal, raising the question of whether rehearsal itself is fundamentally limited. We argue that the performance gap stems not from the idea of prototype rehearsal per se, but from how it is typically instantiated: existing approaches treat prototypes as isolated class summaries that ignore information from nearby enemy classes, and fail to correct the emerging class imbalance between a handful of synthetic old-class samples and hundreds of real instances from newly introduced classes. Building on this hypothesis, we revisit prototype rehearsal and propose a manifold-aware variant that restores its competitiveness in EFCIL. First, we introduce Constrained Expansive Over-Sampling, which interpolates each old-class prototype toward its nearest enemy features from new classes, generating boundary-aware rehearsal samples that better follow the underlying data manifold while preserving inter-class separation. Second, we design an Adaptive Class-Balanced loss that performs time-based class weighting, amplifying gradients from older prototypes when they are most informative and gradually annealing their influence as richer supervision from later tasks accumulates. Together, these components turn prototype rehearsal into a drift-resilient, imbalance-aware mechanism that closes, and often reverses, the gap to recent drift-compensation methods, achieving state-of-the-art performance across multiple EFCIL benchmarks.