arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2251
专题追踪
2605.27923 2026-05-28 cs.CV cs.AI cs.LG quant-ph

Do We Really Need Quantum Machine Learning?: A Multidimensional Empirical Study

我们真的需要量子机器学习吗?:一项多维实证研究

Sudip Vhaduri, Ryan Gammon, Sayanton Dibbo

发表机构 * Department of Computer Science, University of Alabama, AL 35487(1 计算机科学系,阿拉巴马大学,AL 35487)

AI总结 通过在MNIST手写数字数据集上对经典和量子机器学习模型进行多维基准测试,发现量子模型在准确率、参数和内存效率上优于经典模型,但计算成本更高。

详情
AI中文摘要

计算机视觉的快速发展和日益复杂的图像识别任务暴露了经典机器学习模型的基本计算限制,推动了量子计算作为一种新兴范式的探索。本文对MNIST手写数字数据集上的经典和量子机器学习模型进行了全面的基准测试,评估了传统模型(经典支持向量机CSVM和量子支持向量机QSVM)以及深度神经网络模型(经典卷积神经网络CCNN和量子卷积神经网络QCNN)在四个性能维度上的表现:分类准确率、计算运行时间、参数数量和内存需求。实验作为特征维度和样本量的函数进行,并在CPU和GPU执行环境下进行,提供了受控的多维比较,以解决先前工作中的空白。对于基于SVM的模型,QSVM在准确率上始终优于CSVM,在1000个样本时达到约0.90对比约0.85,但计算成本更高。10个量子比特的特征数和200-500的样本量成为平衡准确率和运行时间的实际工作点。对于神经网络模型,CCNN和QCNN实现了可比的分类准确率,在64个特征和60000个样本时均超过0.96,但QCNN在参数和内存效率上显著更优,在较高特征数下比CCNN少约94%的参数和约75%的内存,但运行时间更长。在两个模型家族中,随着特征维度或样本量的增加,量子模型在准确率上始终以更大优势超越经典模型。

英文摘要

The rapid growth of computer vision and increasingly complex image recognition tasks has exposed fundamental computational limitations of classical machine learning models, motivating the exploration of quantum computing as an emerging new paradigm. This paper presents a comprehensive benchmarking study of classical and quantum machine learning models for image recognition on the MNIST handwritten digit dataset, evaluating both traditional models, a Classical Support Vector Machine (CSVM) and a Quantum Support Vector Machine (QSVM), and deep neural network models, a Classical Convolutional Neural Network (CCNN) and a Quantum Convolutional Neural Network (QCNN), across four performance dimensions: classification accuracy, computational runtime, parameter count, and memory requirements. Experiments are conducted as functions of both feature dimensionality and sample size, and across CPU and GPU execution environments, providing a controlled, multidimensional comparison to address gaps in prior work. For the SVM-based models, QSVM consistently outperforms CSVM in accuracy, reaching $\sim$ 0.90 versus $\sim$ 0.85 at 1,000 samples, with a higher computational cost. A feature count of 10 qubits and a sample size in the range of 200 -- 500 emerge as practical operating points that balance accuracy and runtime. For the neural network models, CCNN and QCNN achieve comparable classification accuracy, both exceeding 0.96 at 64 features and 60,000 samples, yet QCNN offers substantially superior parameter and memory efficiency, requiring $\sim$ 94\% fewer parameters and $\sim$ 75\% less memory than CCNN at higher feature counts, while incurring higher runtime. Across both model families, quantum models consistently outperform classical models by greater margins in accuracy as feature dimensionality or sample size increases.

2605.27922 2026-05-28 cs.AI

Harness-Bench: Measuring Harness Effects across Models in Realistic Agent Workflows

Harness-Bench: 在真实智能体工作流中测量不同模型的框架效应

Yilun Yao, Xinyu Tan, Chao-Hsuan Liu, Yaoming Li, Zhengyang Wang, Wenhan Yu, Zhewen Tan, Yuxuan Tian, Guangxiang Zhao, Lin Sun, Xiangzheng Zhang, Tong Yang

发表机构 * Peking University(北京大学) Qiyuan Tech(启元科技)

AI总结 提出Harness-Bench基准,通过106个沙盒离线任务评估不同模型与框架配置组合下的执行性能,发现智能体能力应归因于模型-框架配置而非基础模型。

Comments 16 pages, 4 figures, 11 tables. The first three authors contributed equally

详情
AI中文摘要

LLM智能体越来越多地被部署为可执行系统,使用工具、修改工作区并产生具体产物。在此类工作流中,性能不仅取决于基础模型,还取决于框架:管理上下文、工具、状态、约束、权限、追踪和恢复的系统层。然而,现有基准通常抽象掉执行过程、比较完整智能体系统或固定框架,使得执行层变化难以研究。我们引入Harness-Bench,一个用于评估真实智能体工作流中配置级框架效应的诊断基准。Harness-Bench在共享任务环境、预算和评估协议下,跨多个模型后端评估代表性框架配置,同时保留每个框架的原生执行行为。该基准包含106个沙盒离线任务,这些任务基于实际智能体使用模式构建,并经过人工审核以确保真实性、可解性、可验证性和完整性。每次运行记录最终产物、执行轨迹、使用统计和验证器输出,从而能够分析最终完成之外的内容。在5,194条执行轨迹中,我们观察到不同模型-框架配对在完成度、过程质量、效率和失败行为上存在显著差异。这些结果表明,智能体能力应在模型-框架配置级别报告,而非仅归因于基础模型。我们的分析进一步识别了重复的执行-对齐失败,其中合理的推理与工具反馈、工作区状态、证据或可验证输出契约脱节。Harness-Bench为诊断和改进可靠、高效且可审计的智能体执行栈提供了可复现的基础。

英文摘要

LLM agents are increasingly deployed as executable systems that use tools, modify workspaces, and produce concrete artifacts. In such workflows, performance depends not only on the base model, but also on the harness: the system layer that manages context, tools, state, constraints, permissions, tracing, and recovery. However, existing benchmarks typically abstract away execution, compare complete agent systems, or hold the harness fixed, making execution-layer variation difficult to study. We introduce Harness-Bench, a diagnostic benchmark for evaluating configuration-level harness effects in realistic agent workflows. Harness-Bench evaluates representative harness configurations across multiple model backends under shared task environments, budgets, and evaluation protocols, while preserving each harness's native execution behavior. The benchmark contains 106 sandboxed offline tasks constructed from practical agent-use patterns and manually reviewed for realism, solvability, oracle-checkability, and integrity. Each run records final artifacts, execution traces, usage statistics, and validator outputs, enabling analysis beyond final completion. Across 5,194 execution trajectories, we observe substantial variation in completion, process quality, efficiency, and failure behavior across model-harness pairings. These results suggest that agent capability should be reported at the model-harness configuration level rather than attributed to the base model alone. Our analysis further identifies recurring execution-alignment failures, where plausible reasoning becomes decoupled from tool feedback, workspace state, evidence, or verifiable output contracts. Harness-Bench provides a reproducible foundation for diagnosing and improving reliable, efficient, and auditable agent execution stacks.

2605.27921 2026-05-28 cs.AI cs.CL cs.CY cs.HC

Show, Don't TELL: Explainable AI-Generated Text Detection

展示,而非告知:可解释的AI生成文本检测

Aldan Creo, Suraj Ranganath

发表机构 * School of Computing, Information and Data Sciences(计算与数据科学学院) University of California, San Diego(加州大学圣地亚哥分校) United States of America(美国)

AI总结 提出一种名为TELL的新型可解释架构,通过内置解释机制和强化学习训练,在保持高检测性能(AUROC 0.927)的同时提供文本级注释,帮助用户基于自身判断识别AI生成文本。

详情
AI中文摘要

关于AI生成文本检测的研究已经提出了多种区分人类与AI文本的方法,其中一些方法在分布内性能上表现优异。然而,由于输出与用户(如教授)的需求不一致——他们只得到一个没有附带解释的数值分数——现实世界的应用进展缓慢。我们通过一种新颖的架构TELL解决了这个问题,该架构从一开始就内置了可解释性。虽然我们的系统仍像其他检测器一样提供数值分数以便比较,但TELL采用了一种根本不同的方法,旨在向用户展示模型认为文本是AI还是人类写作的“线索”,使用户能够根据自己的判断以及对写作背景和所谓作者的理解来决定文本的作者。我们在一个特定领域的作者注释自定义SFT数据集上训练TELL,并进一步使用GRPO结合课程学习来优化系统以提高性能。我们实现了与最先进检测器相竞争的性能(AUROC 0.927),同时原生提供解释检测器决策基础的注释。我们进一步使用人类注释数据集评估解释质量,报告了在注释的具体性、可证伪性、连贯性、合理性和基础性方面的高胜率(平均72.3%),使用户能够批判性思考并自行决定。因此,我们的工作从以人为中心的角度重新定义了AI生成文本检测的问题,并为专注于原生可解释性的新一代检测器铺平了道路。

英文摘要

Research on AI-generated text detection has presented a number of approaches to discern human from AI prose, some of which achieving high in-distribution performance. However, real-world applicability has stalled because their outputs are misaligned with the needs of users, such as professors, who are presented with a numeric score that has no attached explanation. We tackle this issue with a novel architecture, TELL, that bakes explainability from the ground-up. While our system still offers a numerical score like other detectors for comparability, TELL takes a fundamentally different approach where we aim to show the user the "tells" by which the model believes a text is AI or human-written, to empower the user to decide who wrote a text using their own judgment and understanding of the context of the writing and its alleged author. We train TELL on a custom SFT dataset of domain-specific authorship annotations, and further refine the system using GRPO with curriculum learning to improve performance. We achieve competitive performance with state-of-the-art detectors (AUROC 0.927) while natively providing annotations that explain the basis for the detector's decision. We further evaluate the quality of our explanations using a dataset of human annotations and report a high (mean 72.3%) win-rate on annotation concreteness, falsifiability, coherence, plausibility and grounding, allowing users to critically think and decide for themselves. Our work thus reframes the problem of AI-generated text detection in a human-centric perspective and paves the way for a new family of detectors that focus on native explainability.

2605.27920 2026-05-28 cs.CV

Rethinking Video-Language Model from the Language Input Perspective

从语言输入角度重新思考视频-语言模型

Xiang Fang, Wanlong Fang, Changshuo Wang, Xiaoye Qu, Daizong Liu

发表机构 * School of Software Engineering, Huazhong University of Science and Technology(华中科技大学软件学院) Nanyang Technological University, Singapore(新加坡南洋理工大学) University College London(伦敦大学学院) Huazhong University of Science and Technology(华中科技大学) Wuhan University(武汉大学)

AI总结 本文从语言输入角度出发,提出一种即插即用的框架,通过生成正负文本、属性文本推理和自加权损失,提升视频-语言模型的性能。

Comments Published in AAAI 2026

详情
AI中文摘要

受大语言模型浪潮的驱动,视频-语言模型(VLM)已成为弥合视频与文本之间差距的重要且具有挑战性的技术。尽管先前的VLM工作取得了显著进展,但几乎所有工作都隐含地假设所有文本都是由特定模板预定义的。在实际应用中,这种严格的假设无法满足,因为1)预定义所有文本极其耗时费力;2)这些预定义的文本输入过于限制且不友好,限制了其应用。观察到,给定视频输入,语义相似但模板不同的文本会导致不同的性能。为此,本文提出了一种新颖的即插即用框架,用于各种基于VLM的方法,以充分弥合视频和文本。具体来说,我们首先从原始文本中生成正负文本,以针对特定的文本组件。然后,我们提出了一种基于属性的文本推理策略,以挖掘生成文本的细粒度语义。最后,我们利用视频作为指导,通过设计自加权损失来进行跨模态桥接。大量实验表明,所提出的方法可以作为即插即用模块,有效提升最先进VLM的性能。

英文摘要

Driven by the wave of large language models, Video-Language Models (VLMs) have become a significant yet challenging technology to bridge the gap between videos and texts. Although previous VLM works have made significant progress, almost all of them implicitly assume that all the texts are predefined by the specific template. In real-world applications, such a strict assumption is impossible to satisfy since 1) predefining all the texts is extremely time-consuming and labor-intensive. 2) these predefined text inputs are too restrictive and user-unfriendly, limiting their applications. It is observed that given a video input, texts with similar semantics but different templates lead to various performances. To this end, in this paper, we propose a novel plug-and-play framework for various VLM-based methods to fully bridge videos and texts. Specifically, we first generate positive and negative texts from the original ones to target specific text components. Then, we propose an attribute-based text reasoning strategy to mine fine-grained textual semantics of generated texts. Finally, we utilize videos as guidance to conduct cross-modal bridging by designing a self-weighted loss. Extensive experiments show that the proposed method can serve as the plug-and-play module to effectively improve the performance of state-of-the-art VLMs.

2605.27919 2026-05-28 cs.RO cs.LG

Frequency-Guided Action Diffusion via Sub-Frequency Manifold Traversal

通过子频率流形遍历的频率引导动作扩散

Junlin Wang

发表机构 * School of Engineering and Applied Science University of Pennsylvania(工程与应用科学学院 费城大学)

AI总结 提出频率引导算子(FGO),通过子频率流形逐步引导扩散策略生成平滑动作,在15个机器人操作任务上提升了动作平滑性和时间一致性。

Comments A preprint version of FGO

详情
AI中文摘要

通过行为克隆学习视觉运动策略通常涉及模仿人类操作员收集的专家演示。然而,自然的人类演示固有地包含高频噪声,例如间歇性抖动、暂停和动作抖动。训练策略直接模仿这些原始轨迹不可避免地会导致模型继承这些次优行为。这种病理在基于扩散的策略中尤为明显,其中迭代去噪步骤可能无意中放大高频伪影而牺牲有意义的细粒度细节。为了解决这些限制,我们提出了一种新颖的基于频率的算法,该算法能够实现隐式频谱操控和平滑动作生成。我们的方法,频率引导算子(FGO),通过逐步将噪声样本通过具有扩展频谱带的中间子频率流形驱动,来引导扩散策略的生成过程。在来自5个基准测试的15个机器人操作任务上验证,FGO在增强动作平滑性和时间一致性方面取得了优越性能,同时保留了成功执行任务所需的细节。项目网站:https://henrywjl.github.io/frequency-guidance-operator/

英文摘要

Learning visuomotor policies via behavior cloning typically involves mimicking expert demonstrations collected by human operators. However, natural human demonstrations inherently contain high-frequency noise, such as intermittent jerks, pauses, and action jitter. Training policies to directly imitate these raw trajectories inevitably causes the model to inherit these suboptimal behaviors. This pathology is particularly pronounced in diffusion-based policies, where iterative denoising steps can inadvertently amplify high-frequency artifacts at the expense of meaningful fine-grained details. To address these limitations, we present a novel frequency-based algorithm that enables implicit spectral maneuvering and smooth action generation. Our method, Frequency Guidance Operator (FGO), steers the generation process of diffusion polices by progressively driving the noisy samples through intermediate sub-frequency manifolds with expanding spectral bands. Validated on 15 robotic manipulation tasks from 5 benchmarks, FGO achieves superior performance in enhancing action smoothness and temporal consistency while preserving the details necessary for successful task execution. Project website: https://henrywjl.github.io/frequency-guidance-operator/

2605.27917 2026-05-28 cs.RO

A Surveillance Evasion Game with Continuous Sensor Redeployment via Bilevel Optimization

基于双层优化的连续传感器重新部署的监视规避博弈

Jaehyeok Kim, Kartik A. Pant, Joseph Kinerson, Kylie Sommer-Kohrt, Worawis Sribunma, Li-Yu Lin, James M. Goppert

发表机构 * School of Aeronautics and Astronautics, Purdue University(航空宇航学院,普渡大学)

AI总结 针对无人机利用传感器时空间隙渗透禁飞区的问题,提出通过双层优化实现传感器沿建筑边界连续滑动部署,并利用对数-求和-指数平滑近似保持可微性,最终收敛到局部纳什均衡。

Comments 8 pages, 8 figures, submitted to IEEE Robotics and Automation Letters (RA-L)

详情
AI中文摘要

无人航空系统(UAS)已成为关键基础设施安全日益增长的威胁,利用传感器周界的时空间隙未被探测地渗透受限空域。我们将这种交互建模为对抗性UAS与由定向和全向传感器组成的异构传感器网络之间的两人零和微分博弈。与早期将防御者限制在离散放置图或固定配置的博弈论方法不同,我们引入了一种连续传感器重新部署技术,其中每个传感器沿凸建筑边界自由滑动。这是通过对数-求和-指数平滑近似实现的,该近似在多边形顶点处保持可微性,从而能够使用基于梯度的方法进行优化。攻击者的最佳响应通过两步法计算,结合STP-RRT*进行可行轨迹初始化和非线性规划进行检测最小化细化。联合优化通过交替双层优化收敛到局部纳什均衡(LNE),为两个参与者推导出解析的一阶平稳性条件,从而为CUAS任务中的异构传感器放置建立了可部署的基线。

英文摘要

Uncrewed Aerial Systems (UASs) have become a growing threat to the security of critical infrastructure, exploiting spatiotemporal gaps in sensor perimeters to infiltrate restricted airspace undetected. We formulate this interaction as a two-player zero-sum differential game between an adversarial UAS and a heterogeneous sensor network of directional and omnidirectional sensors. Unlike earlier game-theoretic approaches that restrict the defender to discrete placement graphs or fixed configurations, we introduce a continuous sensor redeployment technique in which each sensor slides freely along the convex building boundaries. This is enforced via a log-sum-exp smooth approximation that preserves differentiability at polygon vertices, enabling optimization with gradient-based methods. The attacker's best response is computed via a two-step approach combining STP-RRT* for feasible trajectory initialization and nonlinear programming for detection-minimization refinement. The joint optimization converges to a Local Nash Equilibrium (LNE) via alternating bilevel optimization, with analytical first-order stationarity conditions derived for both players, thereby establishing a deployable baseline for heterogeneous sensor placements in CUAS missions.

2605.27916 2026-05-28 cs.CV cs.CL

OphIn-500K: Curating Web-Scale Visual Instructions for Scaling Ophthalmic Multimodal Large Language Models

OphIn-500K:策划网络规模的视觉指令以扩展眼科多模态大语言模型

Xuanzhao Dong, Wenhui Zhu, Xiwen Chen, Hao Wang, Xin Li, Yujian Xiong, Jiajun Cheng, Jingjing Wang, Xiaobing Yu, Haiyu Wu, Shao Tang, Zhipeng Wang, Langechuan Liu, Shan Lin, Oana Dumitrascu, Yalin Wang

发表机构 * Arizona State University(亚利桑那州立大学) Clemson University(克莱姆森大学) Washington University in St. Louis(圣路易斯华盛顿大学) University of Notre Dame(诺特丹大学) Florida State University(佛罗里达州立大学) Rice University(里德大学) NVIDIA(英伟达) Mayo Clinic(梅奥诊所)

AI总结 提出OphIn-Engine流水线从网络视频中构建高质量眼科指令数据,生成包含50万+指令实例的OphIn-500K数据集,并基于此开发眼科专用多模态大语言模型OphIn-VL,在多项任务上超越现有通用医学和专用模型。

详情
AI中文摘要

通用医学多模态大语言模型(MLLMs)的进步为构建支持临床诊断的对话助手展现了巨大潜力。然而,它们在高度专业化领域(如眼科)的适应性仍未得到充分探索,主要原因是缺乏大规模、领域特定的指令微调数据。现有的眼科对话数据集通常规模有限,且大多依赖于已建立的公共基准图像,限制了眼科MLLMs的可扩展性及其捕捉真实临床复杂性的能力。为解决这一问题,我们提出了$ extbf{OphIn-Engine}$,一个眼科特定的指令数据策划流水线,从开放获取的眼科网络规模视频中构建高质量指令数据。该流水线整合了多模态转录以提取图像-文本对、视觉线索分离与评分以识别临床相关的视觉描述,以及指令合成与质量控制以生成准确且多样的临床对话。利用该引擎,我们推出了$ extbf{OphIn-500K}$,一个大规模多模态眼科指令微调数据集,包含超过50万个指令实例和来自29,000多个视频片段的151,000多张独特图像,格式包括视觉问答(VQA)、多轮对话交互和思维链(CoT)推理。基于该数据集,我们进一步开发了$ extbf{OphIn-VL}$,一个具有高级视觉理解和对话能力的眼科专用MLLM。综合实验和案例研究表明,与最先进的通用医学和领域专用MLLMs相比,OphIn-VL实现了更优的性能。

英文摘要

The advancement of general medical Multimodal Large Language Models (MLLMs) has shown great potential for building conversational assistants to support clinical diagnosis. However, their adaptation to highly specialized domains such as ophthalmology remains underexplored, primarily due to the scarcity of large-scale, domain-specific instruction-tuning data. Existing ophthalmic datasets for conversational agents are often limited in scale and largely rely on images from established public benchmarks, limiting the scalability of ophthalmic MLLMs and their ability to capture real-world clinical complexity. To address this gap, we propose $\textbf{OphIn-Engine}$, an ophthalmology-specific instruction data curation pipeline that constructs high-quality instruction data from open-access ophthalmology web-scale videos. The pipeline integrates multimodal transcription for extracting image-transcript pairs, visual cue separation and scoring for identifying clinically relevant visual descriptions, and instruction synthesis with quality control for generating accurate and diverse clinical dialogues. Using this engine, we introduce $\textbf{OphIn-500K}$, a large-scale multimodal ophthalmology instruction-tuning dataset containing over 500,000 instruction instances and more than 151,000 unique images from over 29,000 video clips, formatted as visual question answering (VQA), multi-turn conversational interactions, and chain-of-thought (CoT) reasoning. Built upon this dataset, we further develop $\textbf{OphIn-VL}$, an ophthalmology-specific MLLM with advanced visual understanding and conversational capabilities. Comprehensive experiments and case studies demonstrate that OphIn-VL achieves superior performance compared with state-of-the-art general medical and domain-specific MLLMs.

2605.27913 2026-05-28 cs.LG

Where LLM Annotators Fail: Label-Free Learning on Graphs with LLMs

LLM标注者失败之处:基于LLM的图上无标签学习

Safal Thapaliya, Jiatan Huang, Chuxu Zhang

发表机构 * University of Connecticut(康涅狄格大学)

AI总结 针对图节点分类中LLM标注噪声不仅依赖于类别还依赖于区域的问题,提出聚类感知噪声估计(CANE)框架,通过估计聚类条件可靠性来筛选和校正伪标签,在多个图基准上超越现有无标签方法。

详情
AI中文摘要

图上的节点分类通常需要标注节点,然而在图规模上获取标签成本高昂。当节点属性包含语义内容(如论文摘要、网页或产品描述)时,大型语言模型(LLM)可以通过标注一小部分节点提供低成本监督。然而,这些LLM生成的标签带有噪声,现有的无标签图学习方法通常将这种噪声视为全局的或类别条件的。我们发现LLM标注错误不仅依赖于类别,还依赖于区域:在同一类别内,可靠性在特征空间聚类之间可能差异显著。鉴于此,我们提出聚类感知噪声估计(CANE),一种无标签学习框架,无需真实标签即可估计聚类条件的LLM可靠性,并利用该估计决定信任哪些伪标签以及校正哪些标签。在各种图基准和GNN骨干网络上,CANE优于最强的无标签基线,在表现出更强聚类条件噪声的数据集上提升最大。

英文摘要

Node classification on graphs often requires labeled nodes, yet obtaining labels at graph scale is expensive. When node attributes contain semantic content, such as paper abstracts, web pages, or product descriptions, large language models (LLMs) can provide low-cost supervision by annotating a small subset of nodes. However, these LLM-generated labels are noisy, and existing label-free graph learning methods usually treat this noise as either global or class-conditional. We find that LLM annotation errors are not only class-dependent but also region-dependent: within the same class, reliability can vary sharply across feature-space clusters. In light of this, we propose Cluster-Aware Noise Estimation (CANE), a label-free learning framework that estimates cluster-conditional LLM reliability without ground truth labels, and uses this estimate to decide which pseudo-labels to trust, and which labels to correct. Across various graph benchmarks and GNN backbones, CANE improves over the strongest label-free baselines, with the largest gains on datasets exhibiting stronger cluster-conditional noise.

2605.27911 2026-05-28 cs.AI

SuiChat-CN: Benchmarking Contextual Suicide Risk Assessment in Chinese Group Chats

SuiChat-CN:中文群聊情境自杀风险评估基准

Xiangyu Wang, Zhiwei Yu, Chengze Du, Dingchang Wang, Yuhan Ye, Fangyu Zheng

发表机构 * University of Chinese Academy of Sciences(中国科学院大学) Tsinghua University(清华大学) Beijing University of Posts and Telecommunications(北京邮电大学) University of Science and Technology of China(中国科学技术大学)

AI总结 针对即时通讯群聊中消息碎片化、多轮对话和隐晦表达带来的挑战,构建了首个中文群聊情境自杀风险评估基准SuiChat-CN,通过信号词提取和双向上下文扩展构建连贯对话片段,并利用专家验证的LLM辅助范式标注用户风险等级,实验表明上下文信息对可靠评估至关重要。

详情
AI中文摘要

自杀是一个关键的全球公共卫生挑战,每年导致约72万人死亡,需要及时有效的预防策略。现有的计算研究主要关注基于帖子的社交媒体平台(如Twitter和微博),而忽略了即时通讯环境(如Telegram)。然而,群聊带来了独特的挑战:消息简短、碎片化、多方参与,并且常常依赖隐晦或文化特定的表达,使得孤立的帖子级分析不足。我们引入了SuiChat-CN,一个用于情境自杀风险评估的中文群聊基准。我们收集了公开的Telegram群聊数据,通过信号词提取和双向上下文扩展构建连贯的对话片段,并使用专家验证的LLM辅助范式注释用户风险等级。SuiChat-CN包含来自1,406名用户的13,312个上下文片段,覆盖258,228条原始聊天消息。使用PLM和超过40个LLM的大量实验表明,上下文信息对于可靠的风险评估至关重要,而微调和部分上下文评估进一步揭示了多方对话中早期检测的挑战。出于伦理和敏感性考虑,该数据集不公开发布,但将根据合理请求与经认可的心理健康和自杀预防研究机构共享。

英文摘要

Suicide is a critical global public health challenge, causing approximately 720,000 deaths each year and calling for timely, effective prevention strategies. Existing computational studies primarily focus on post-based social media platforms such as Twitter and Weibo, leaving instant messaging environments such as Telegram underexplored. Yet group chats pose distinct challenges: messages are short, fragmented, multi-party, and often rely on implicit or culturally specific expressions, making isolated post-level analysis insufficient. We introduce SuiChat-CN, a Chinese group-chat benchmark for contextual suicide risk assessment. We collect public Telegram group-chat data, construct coherent conversational segments through signal-word extraction and bidirectional context expansion, and annotate user risk levels with an expert-validated, LLM-assisted paradigm. SuiChat-CN contains 13,312 contextual segments from 1,406 users, covering 258,228 raw chat messages. Extensive experiments with PLMs and more than 40 LLMs demonstrate that contextual information is essential for reliable risk assessment, while fine-tuning and partial-context evaluation further reveal the challenges of early detection in multi-party conversations. Due to ethical and sensitivity concerns, the dataset is not publicly released but will be shared with accredited mental health and suicide-prevention research institutions upon reasonable request.

2605.27909 2026-05-28 cs.RO

S-Cheetah: A Novel Quadrupedal Robot with a 3-DOF Active Spine Learning Agile Locomotion

S-Cheetah:一种具有3自由度主动脊柱学习敏捷运动的新型四足机器人

Zimu Li, Weibang Bai

发表机构 * School of Information Science and Technology, ShanghaiTech University(信息科学与技术学院,上海科技大学)

AI总结 本文提出一种具有3自由度仿生主动脊柱的四足机器人S-Cheetah,并设计强化学习框架使其实现高速奔跑、原地转向及空中自翻等敏捷运动。

Comments Project website: https://himmy-robotics.github.io/scheetah

详情
AI中文摘要

四足动物的生物脊柱能够实现矢状面的屈伸、侧向弯曲和轴向旋转,在高度敏捷和灵巧的运动中起着关键作用。尽管许多研究已将主动脊柱关节集成到四足机器人中以增强敏捷性,但大多数设计通过减少脊柱自由度来简化控制复杂性,未能实现生物脊柱的空间三轴旋转特性。因此,复制多自由度仿生脊柱并有效利用它来赋能四足机器人的敏捷运动仍然是一个重要的研究挑战。在本研究中,我们提出了S-Cheetah,一种具有3自由度仿生串联主动脊柱的四足机器人,能够实现仿生空间三轴旋转。为了使机器人充分利用这一主动脊柱,我们开发了一个专门的强化学习框架,通过整合加速度课程学习策略和定制的奖励函数(如奔跑步态奖励、脊柱波动奖励和脊柱转向奖励),积极促进引入的脊柱的参与并最大化机器人的运动能力。实验结果表明,S-Cheetah使用旋转G2奔跑步态可以达到6.9米/秒的峰值速度,原地转向率为7.2弧度/秒。此外,该系统展现出一种新兴的、受猫启发的空中自翻能力,使其在自由落体过程中能够从任意方向稳定地四足着地。最后,通过在不同运动任务中的广泛评估,我们证明了所提出的3自由度脊柱的引入全面增强了四足机器人的运动敏捷性。项目网站:himmy-robotics.github.io/scheetah

英文摘要

The biological spine of quadrupeds enables sagittal flexion/extension, lateral bending, and axial rotation, playing a crucial role in highly agile and dexterous locomotion. While numerous studies have integrated active spinal joints into quadrupedal robots to enhance agility, most designs simplify control complexity by reducing spinal degrees of freedom (DOF), failing to achieve the spatial tri-axial rotation characteristic of biological spines. Consequently, replicating a multi-DOF biomimetic spine and effectively leveraging it to empower the agile locomotion of quadrupedal robots remains a significant research challenge. In this study, we present S-Cheetah, a quadrupedal robot featuring a 3-DOF bio-inspired serial active spine capable of biomimetic spatial tri-axial rotation. To empower the robot to fully utilize this active spine, we developed a specialized reinforcement learning framework to actively promote the engagement of the introduced spine and maximize the robot's locomotive capabilities by integrating an acceleration curriculum learning strategy with tailored reward functions, such as a gallop gait reward, a spine undulation reward, and a spine steering reward. Experimental results demonstrate that S-Cheetah can achieve a peak speed of 6.9 m/s using the rotary G2 gallop gait and an in-place turning rate of 7.2 rad/s. Besides, the system exhibits an emergent, feline-inspired aerial self-righting capability, allowing it to land stably on four feet from arbitrary orientations during free fall. Finally, through extensive evaluations across diverse locomotion tasks, we prove that the introduction of the proposed 3-DOF spine comprehensively enhances the locomotive agility of quadrupedal robots. Project website: himmy-robotics.github.io/scheetah

2605.27908 2026-05-28 cs.CL cs.AI

ESC-Skills: Discovering and Self-Evolving Skills for Emotional Support Conversations

ESC-Skills: 发现并自我进化情感支持对话技能

Jie Zhu, Huaixia Dou, Shuo Jiang, Junhui Li, Lifan Guo, Feng Chen, Chi Zhang, Fang Kong

发表机构 * School of Computer Science and Technology, Soochow University(苏州大学计算机科学与技术学院) Qwen DianJin Team, Alibaba Cloud Computing(阿里云Qwen团队)

AI总结 提出ESC-Skills框架,通过干预单元建模支持交互并构建技能库,结合多轮廓自我进化机制,提升情感支持对话的可解释性、可控性和效果。

详情
AI中文摘要

现有的情感支持对话(ESC)系统主要依赖于端到端的回复生成或粗粒度的策略监督,可解释性有限,且对系统性的技能提升支持不足。我们提出ESC-Skills,一个以技能为中心的框架,能够发现并自我进化可执行的情感支持技能。我们首先将局部支持交互建模为干预单元(IUs),捕捉求助者状态、支持干预和回复后情绪变化之间的状态-动作-结果动态。基于从成功和失败的ESC对话中提取的IUs,我们构建了ESC-Skills库,这是一个包含干预指导、适用条件、预期结果和潜在风险的可执行情感支持技能仓库。为了进一步提升鲁棒性,我们引入了一个多轮廓自我进化精炼框架,其中ESC代理在SAGE评估下与多种模拟求助者轮廓进行交互。分析由此产生的交互轨迹,以识别缺失的技能、不安全的干预和特定轮廓的失败模式,然后通过基于模拟的验证来精炼技能库。实验结果表明,ESC-Skills在提升回复质量和对话层面的情感结果的同时,提供了更可解释和可控的支持行为。我们将发布代码、提示和ESC-Skills库,网址为https://github.com/aliyun/qwen-dianjin。

英文摘要

Existing emotional support conversation (ESC) systems mainly rely on end-to-end response generation or coarse strategy supervision, offering limited interpretability and little support for systematic skill improvement. We propose ESC-Skills, a skill-centric framework that discovers and self-evolves executable emotional support skills. We first model localized support interactions as Intervention Units (IUs), which capture state--action--outcome dynamics between seeker states, support interventions, and post-response emotional changes. Based on IUs extracted from both successful and failed ESC dialogues, we construct the ESC-Skills Bank, a repository of executable emotional support skills containing intervention guidance, applicability conditions, expected outcomes, and potential risks. To further improve robustness, we introduce a multi-profile self-evolutionary refinement framework in which an ESC agent interacts with diverse simulated seeker profiles under SAGE evaluation. The resulting interaction traces are analyzed to identify missing skills, unsafe interventions, and profile-specific failure patterns, which are then used to refine the Skills Bank through simulation-based verification. Experimental results demonstrate that ESC-Skills improves both response-level quality and dialogue-level emotional outcomes while providing more interpretable and controllable support behaviors. We will release the code, prompts, and ESC-Skills Bank at https://github.com/aliyun/qwen-dianjin.

2605.27906 2026-05-28 cs.AI

Reasoning Matters: Mitigate Hallucination in Multimodal Large Reasoning Models via Reasoning-Conditioned Preference Optimization

推理至关重要:通过推理条件偏好优化减轻多模态大型推理模型中的幻觉

Jiawei Kong, Hao Fang, Shunxiang Liao, Jinyu Li, Bin Chen, Hao Wu, Shu-Tao Xia, Min Zhang

发表机构 * Tsinghua Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院,清华大学) Harbin Institute of Technology, Shenzhen(哈尔滨工业大学深圳)

AI总结 提出推理条件直接偏好优化(RC-DPO)方法,通过将思维链作为答案生成的条件并对比不同思维链下的偏好,结合蒙特卡洛树搜索和注意力引导的思维链剪枝生成偏好数据,有效减轻多模态大型推理模型中的幻觉。

详情
AI中文摘要

多模态大型推理模型引入了推理范式,在复杂的视觉-语言任务中展现出强大的能力。然而,它们仍然存在严重的幻觉问题。现有的基于训练的方法通常通过响应级直接偏好优化(DPO)来减轻幻觉,其中思维链(CoT)和最终答案被视为一个整体输出并联合优化。我们发现这种公式的表现与仅答案优化相似,表明它主要学习答案级别的偏好,而未能充分利用CoT级别的监督。为了解决这个问题,我们明确制定了一个面向CoT的偏好项,并推导出推理条件直接偏好优化(RC-DPO),它将CoT建模为答案生成的条件,并在不同CoT条件下对比同一偏好答案的偏好,促进答案支持的推理链对齐。为了进一步优化,我们引入了一种推理增强的偏好数据生成策略,该策略采用蒙特卡洛树搜索来发现视觉基础且逻辑一致的CoT作为正样本,以及注意力引导的CoT令牌剪枝来构建负样本。在各种模型和基准上的大量实验表明,RC-DPO有效减轻了幻觉,并提高了多模态推理过程的可靠性。

英文摘要

Multimodal Large Reasoning Models introduce the reasoning paradigm, demonstrating strong capabilities on complex vision-language tasks. However, they still suffer from severe hallucinations. Existing training-based methods typically mitigate hallucinations through response-level direct preference optimization (DPO), where the Chain-of-Thought (CoT) and the final answer are treated as a monolithic output and optimized jointly. We reveal that this formulation performs similarly to answer-only optimization, suggesting that it primarily learns answer-level preference, while leaving CoT-level supervision insufficiently exploited. To address this issue, we explicitly formulate a CoT-oriented preference term and derive Reasoning-Conditioned Direct Preference Optimization (RC-DPO), which models the CoT as a condition for answer generation and contrasts the preference for the same preferred answer under different CoT conditions, promoting answer-supportive reasoning chain alignment. To further improve optimization, we introduce a reasoning-enhanced preference data generation strategy that employs Monte Carlo Tree Search to discover visually grounded and logically consistent CoTs as positive samples, and attention-guided CoT token pruning to construct negative ones. Extensive experiments across various models and benchmarks show that RC-DPO effectively mitigates hallucinations and improves the reliability of the multimodal reasoning process.

2605.27905 2026-05-28 cs.CL

AI Research Agents Narrow Scientific Exploration

AI研究代理缩小科学探索范围

Yixuan Tang, Yi Yang

发表机构 * The Hong Kong University of Science and Technology(香港理工大学)

AI总结 本研究通过四个AI研究代理框架和六个大语言模型生成37,802个科学想法,发现AI生成的想法比人类论文更集中、更接近起始文献,且与低引用论文相似,表明当前AI代理更适合局部细化而非拓宽科学探索。

详情
AI中文摘要

AI研究代理现在能够生成研究想法、设计实验、运行代码和起草论文,这引发了大规模AI辅助科学发现的可能性。许多当前的代理框架明确鼓励生成新颖且高影响力的想法。然而,目前尚不清楚AI辅助构思是拓宽了科学探索,还是主要集中于现有工作。我们将AI研究代理视为科学搜索系统进行研究。使用四个AI研究代理框架和六个大语言模型,我们从AI和机器学习中由引用定义的研究领域的共享种子文献中生成37,802个科学想法。然后,我们将生成的AI想法与来自相同研究领域的人类撰写论文、来自相同种子文献的后续人类研究以及种子文献本身进行比较。在实验中,出现了四个一致的模式。第一,AI生成的想法比来自相同研究领域的人类撰写论文更加集中。第二,AI生成的想法比后续人类工作更接近其起始文献。第三,与AI生成想法最相似的论文往往获得较低的后续引用。第四,当AI生成的想法与先前工作不同时,差异主要来自现有技术方法的重新组合,而不是引入全新的研究问题。总体而言,当前的AI研究代理似乎更适合局部细化,而不是拓宽科学探索。

英文摘要

AI research agents can now generate research ideas, design experiments, run code, and draft papers, raising the possibility of large-scale AI-assisted scientific discovery. Many current agent frameworks explicitly encourage the generation of novel and high-impact ideas. Yet it remains unclear whether AI-assisted ideation broadens scientific exploration or mainly concentrates around existing work. We study AI research agents as scientific search systems. Using four AI research-agent frameworks and six large language models, we generate 37,802 scientific ideas from shared seed literature across citation-defined research areas in AI and machine learning. We then compare the resulting AI ideas against human-authored papers from the same research areas, follow-on human research emerging from the same seed literature, and the seed literature itself. Across experiments, four consistent patterns emerge. First, AI-generated ideas are substantially more concentrated than human-authored papers from the same research areas. Second, AI-generated ideas remain much closer to their starting literature than later human follow-on work does. Third, papers most similar to AI-generated ideas tend to receive lower subsequent citations. Fourth, when AI-generated ideas differ from prior work, the differences arise primarily from recombining existing technical methods rather than introducing fundamentally new research questions. Overall, current AI research agents appear better suited to local elaboration than to broadening scientific exploration.

2605.27904 2026-05-28 cs.AI cs.LG

Dr-CiK: A Testbed for Foresight-Driven Agents

Dr-CiK:面向预见驱动型智能体的测试平台

Yihong Tang, Andrew Robert Williams, Arjun Ashok, Vincent Zhihao Zheng, Lijun Sun, Alexandre Drouin, Issam H. Laradji, Étienne Marcotte, Valentina Zantedeschi

发表机构 * McGill University(麦吉尔大学) ServiceNow Research(ServiceNow研究院) Mila -- Quebec AI Institute(蒙特利尔AI研究院) University of British Columbia(不列颠哥伦比亚大学)

AI总结 针对现有上下文辅助预测基准假设上下文已提供的问题,提出Dr-CiK基准,评估智能体从文档语料库中检索、过滤、提炼预测相关上下文并生成预测的能力,实验表明高质量上下文显著提升预测性能,但现有深度研究智能体恢复证据不足5%、易受干扰误导。

详情
AI中文摘要

现实环境中的时间序列预测通常不仅依赖于历史观测,还依赖于必须从嘈杂、异构的信息源中主动发现的外部上下文。然而,现有的上下文辅助预测基准通常假设支持性上下文已经提供,未考虑智能体是否能自行识别。因此,我们引入Dr-CiK,一个用于评估智能体是否能够从文档语料库中检索预测相关的支持性上下文、过滤干扰项、将检索到的上下文提炼为对预测有用的证据,并生成由该证据支持的预测的基准。通过上下文消融实验以及对最先进的深度研究和预测方法的联合评估,我们表明高质量上下文显著提高了Dr-CiK中的预测性能。然而,大多数现有的深度研究智能体仅能恢复一小部分真实支持证据(通常<5%),经常被干扰项误导(>80%的干扰项引用),并且可能导致预测器在使用检索到的上下文时比不使用上下文时表现更差。我们的结果激励了对预见驱动型智能体的研究,这些智能体能够搜索正确的上下文以预测未来。

英文摘要

Time series forecasting in real-world settings often depends not only on historical observations, but also on external context that must be actively discovered from noisy, heterogeneous information sources. Yet existing context-aided forecasting benchmarks typically assume that the supporting context is already provided, leaving open whether agents can identify it on their own. Therefore, we introduce Dr-CiK, a benchmark for evaluating whether agents can retrieve forecasting-relevant supporting context from a document corpus, filter out distractors, distill the retrieved context into forecast-useful evidence, and generate forecasts supported by that evidence. Through context ablations and evaluations of state-of-the-art deep research and forecasting methods paired together, we show that high-quality context substantially improves forecasting performance in Dr-CiK. However, most existing DR agents recover only a small fraction of the ground-truth supporting evidence (usually <5%), are frequently misled by distractors (>80% distractor citations), and can cause forecasters to perform worse with retrieved context than without context. Our results motivate research on foresight-driven agents that search for the right context to predict the future.

2605.27901 2026-05-28 cs.CL cs.AI

The Fragility of Chain-of-Thought Monitoring Across Typologically Diverse Languages

跨类型多样语言的思维链监控脆弱性

Eric Onyame, Runtao Zhou, Kowshik Thopalli, Bhavya Kailkhura, Chirag Agarwal

发表机构 * University of Virginia(弗吉尼亚大学) Lawrence Livermore National Laboratory(劳伦斯利弗莫尔国家实验室)

AI总结 本研究通过13种语言和7个前沿模型家族的评估,发现思维链监控在语言分布偏移下普遍不可靠(平均不可信率95.9%),模型会进行策略性操纵,且低资源语言中欺骗模式完全存在。

详情
AI中文摘要

思维链(CoT)监控已被提出作为一种有前景的安全机制,用于检测大型语言模型中的失调行为。然而,其在英语之外以及跨不同模型家族中的可靠性仍 largely unexplored。我们首次在13种多样语言和7个前沿模型家族(共16个模型)上对CoT可监控性进行了大规模评估。使用需要显式中间计算的对抗性提示评估,结合内部答案标记概率分析,我们一致发现CoT在语言和提示类型上存在不忠实性,在8B至120B参数模型中平均不忠实率为95.9%。我们发现前沿模型系统性地进行策略性操纵,包括答案切换、事后合理化以及对提示的程序性利用,使得外部监控器难以检测欺骗。我们表明,前沿模型通常在其潜在激活中在生成的前15%内就承诺了失调线索,即使CoT看起来忠实。令人惊讶的是,这些欺骗模式在低资源语言中保持100%,揭示了当前基于CoT的监督的根本局限性。我们的结果表明,CoT监控在语言分布偏移下本质上是脆弱的,提供的安全信号比仅英语研究所暗示的要弱得多。这些发现强调了开发稳健的CoT监控器以及加速白盒监控技术研究的迫切需要,特别是为了改善中低资源语言中的CoT可监控性。我们的代码可在此处获取:\href{https://multilingual-cot-monitoring.github.io/}{\textcolor{blue}{here}}。

英文摘要

Chain-of-thought (CoT) monitoring has been proposed as a promising safety mechanism for detecting misaligned behavior in large language models. However, its reliability remains largely unexplored beyond English and across diverse model families. We present the first large-scale evaluation of CoT monitorability across 13 diverse languages and seven frontier model families, comprising 16 models. Using adversarial-hint evaluations that require explicit intermediate computation, together with analysis of internal answer-token probabilities, we consistently find CoT unfaithfulness across languages and hint types, with an average rate of 95.9\% across 8B--120B parameter models. We find that frontier models systematically engage in strategic manipulation, including answer-switching, post-hoc rationalization, and procedural exploitation of hints, making external monitors struggle to detect deception. We show that frontier models often commit to the misaligned cue in their latent activations within the first 15\% of generation, even when the CoT appears faithful. Surprisingly, these deceptive patterns remain 100\% in low-resource languages, revealing fundamental limitations in current CoT-based oversight. Our results reveal that CoT monitoring is fundamentally fragile under linguistic distribution shift, providing a substantially weaker safety signal than what English-only studies suggest. These findings underscore an urgent need to develop robust CoT monitors and to accelerate research into white-box monitoring techniques, especially to improve CoT monitorability in mid- and low-resource languages. Our code is available \href{https://multilingual-cot-monitoring.github.io/}{\textcolor{blue}{here}}.

2605.27900 2026-05-28 cs.CV

Decoupled Training with Local Reinforcement Fine-Tuning in Federated Learning

联邦学习中解耦训练与局部强化微调

Yuting Ma, Lechao Cheng, Xiaohua Xu

发表机构 * School of Computer Science and Technology, University of Science and Technology of China(中国科学技术大学计算机科学与技术学院) School of Computer Science and Information Engineering, Hefei University of Technology(合肥工业大学计算机科学与信息工程学院)

AI总结 提出FedDTL框架,通过解耦图像和文本编码器训练并引入两阶段局部微调(监督微调+强化学习),解决联邦学习中客户端间优化不一致和客户端内过专门化问题,平衡全局任务适应性与泛化能力。

Comments This work has been accepted by ICML 2026

详情
AI中文摘要

联邦学习(FL)与预训练视觉-语言模型(VLM)的结合已成为各种下游任务的有前景范式。通过利用其强大的表示,最近的研究在局部数据不足的情况下改进了任务适应性,同时保持了泛化能力。然而,这些方法强调完全局部优化和简单的参数聚合,这可能在异构和全数据FL设置下放大客户端间优化不一致和客户端内过专门化,使得平衡全局任务适应性和泛化变得困难。为了解决这些挑战,我们提出了FedDTL,一种新颖的联邦VLM框架,该框架在客户端和服务器之间解耦图像编码器和文本编码器。通过解耦编码器训练与服务器-客户端模态对齐,FedDTL促进了连贯的全局语义更新并减少了客户端间优化不一致,从而改善了全局任务适应性。为了进一步缓解客户端内过专门化,我们引入了两阶段局部微调,其中监督微调阶段实现了快速可靠的预热启动,随后是增强泛化的强化学习阶段。在多个基准测试上的大量实验,包括标签偏移和特征偏移,表明FedDTL在少样本和全数据设置下,在各种FL数据分布中实现了全局任务适应性和泛化之间的有效平衡。

英文摘要

Federated Learning (FL) with pre-trained Vision-Language Models (VLMs) has emerged as a promising paradigm for various downstream tasks. By leveraging its strong representations, recent studies improve task adaptation under insufficient local data while preserving generalization. However, these methods emphasize fully local optimization with simple parameter aggregation,which can amplify inter-client optimization inconsistency and intra-client over-specialization under heterogeneous and full-data FL settings, making it difficult to balance global task adaptation and generalization. To address these challenges, we propose FedDTL, a novel federated VLM framework that decouples the image encoder and text encoder across clients and the server. Through decoupled encoder training with server-client modality alignment, FedDTL promotes coherent global semantic update and reduces inter-client optimization inconsistency, improving global task adaptation.To further mitigate intra-client over-specialization,we introduce a two-stage local fine-tuning, where a supervised fine-tuning stage enables rapid and reliable warm-start, followed by a reinforcement learning stage that enhances generalization. Extensive experiments on multiple benchmarks, including label skew and feature shift, demonstrate that FedDTL achieves an effective balance between global task adaptation and generalization under various FL data distributions in both few-shot and full-data regimes.

2605.27899 2026-05-28 cs.AI

SKILLC: Learning Autonomous Skill Internalization in LLM Agents via Contrastive Credit Assignment

SKILLC: 通过对比信用分配学习LLM智能体的自主技能内化

Hongxiang Lin, Zhirui Kuai, Erpeng Xue, Lei Wang

发表机构 * Meituan(美团)

AI总结 提出SkillC框架,基于对比技能信用分配(CSCA)将技能帮助性对比转化为直接学习信号,实现LLM智能体的自主技能内化,在ALFWorld和WebShop上分别超越最强基线5.5%和4.4%。

详情
AI中文摘要

结构化技能提示改善了长周期智能体强化学习(RL)中的探索。技能增强型RL方法在推理时保留外部技能,而技能内化型RL方法在训练期间撤回技能以实现自主性能。然而,现有的内化方法仅使用技能帮助性对比进行课程控制,策略更新保持不变,无法区分技能依赖和自主成功。我们提出SkillC,一种基于对比技能信用分配(CSCA)的框架,将该对比转化为内化的直接学习信号。SkillC在同一策略更新中,为来自活跃技能类型的任务采样配对的技能注入和无技能轨迹,并通过双流优势估计器将它们的任务级对比注入优化,该估计器在保持全局排名的同时,对无技能成功施加单边校正。平滑的验证级信号进一步驱动自适应课程,包括归因强度、轨迹分配和单调活跃集剪枝。在ALFWorld和WebShop上的实验表明,在无运行时技能访问的情况下,SkillC分别超过最强先验技能内化RL基线5.5%和4.4%,同时与技能增强型RL方法保持竞争力。

英文摘要

Structured skill prompts improve exploration in long-horizon agentic reinforcement learning (RL). Skill-augmented RL methods retain external skills at inference, while skill-internalization RL methods withdraw them during training to enable autonomous performance. However, existing internalization approaches only use skill-helpfulness contrast for curriculum control, leaving the policy update unchanged and unable to distinguish skill-dependent from autonomous success. We propose SkillC, a framework based on Contrastive Skill Credit Assignment (CSCA) that converts this contrast into a direct learning signal for internalization. \textsc{SkillC} samples paired skill-injected and skill-free rollouts for tasks from active skill types within the same policy update, and injects their task-level contrast into optimization via a dual-stream advantage estimator that preserves global ranking while applying a one-sided correction toward skill-free success. A smoothed validation-level signal further drives an adaptive curriculum over attribution strength, rollout allocation, and monotonic active-set pruning. Experiments on ALFWorld and WebShop show that, without runtime skill access, SkillC surpasses the strongest prior skill-internalization RL baseline by 5.5\% and 4.4\%, respectively, while remaining competitive with skill-augmented RL methods.

2605.27898 2026-05-28 cs.AI

A Unified Framework for the Evaluation of LLM Agentic Capabilities

LLM 代理能力评估的统一框架

Pengyu Zhu, Lijun Li, Yaxing Lyu, Qianxin Luo, Jingyi Yang, Yi Liu, Tingfeng Hui, Xinyu Yuan, Li Sun, Sen Su, Jing Shao

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Chongqing University of Posts and Telecommunications(重庆邮电大学) North China Electric Power University(华北电力大学)

AI总结 提出一个统一框架,通过标准化配置、固定 ReAct 架构和离线设置,分离框架与环境效应,实现 LLM 代理能力的公平评估,并在 7 个基准、24 个领域、15 个模型上进行了大规模实证分析。

详情
AI中文摘要

随着 LLM 越来越多地被部署为代理,对其代理能力的可靠评估变得至关重要。然而,报告的基准分数通常共同反映了模型能力以及每个基准所附带的实现选择,使得跨基准结果难以解释为对底层模型的纯粹测量。在这项工作中,我们提出了一个用于公平评估 LLM 代理能力的统一框架。在统一配置系统的驱动下,该框架将多样化的基准整合为标准化的指令-工具-环境格式,通过固定的 ReAct 风格架构在可控沙箱中执行代理,并提供可选的离线设置,用精心策划的快照替换易变的实时环境,从而可以分别分析框架效应和环境效应。在此基础上,我们在每个基准的原始任务成功标准下统一了评估方法,同时引入了资源消耗的统一指标以及决策和执行层面失败归因的分类法。在该框架内,我们适配了 7 个广泛使用的基准,涵盖单代理、多代理和安全关键场景的 24 个领域,并在 15 个模型上进行了超过 40 万次 rollout 和 50 亿 token 的大规模实证分析。结果表明,脚手架选择和环境波动会显著改变基准结果的方向,使我们的框架能够将内在的 LLM 能力与框架和环境引入的伪影分离开来。我们进一步展示了其作为安全关键领域安全测试床的可扩展性。代码和基准可在 https://github.com/whfeLingYu/A-Unified-Framework-for-the-Evaluation-of-LLM-Agentic-Capabilities, https://huggingface.co/AgentFramework/Unified_Farmework 获取。

英文摘要

As LLMs are increasingly deployed as agents, reliable assessment of their agentic capabilities has become essential. However, reported benchmark scores often jointly reflect model capability and the implementation choices each benchmark is packaged with, making cross-benchmark results difficult to interpret as clean measurements of the underlying model. In this work, we present a unified framework for the fair evaluation of LLM agentic capabilities. Driven by a unified configuration system, the framework integrates diverse benchmarks into a standardized instruction--tool--environment format, executes agents through a fixed ReAct-style architecture within a controllable sandbox, and provides an optional offline setting that replaces volatile live environments with curated snapshots, so that framework effects and environment effects can be analyzed separately. Building on this, we unify the evaluation methodology under each benchmark's original task-success criteria, while introducing unified metrics for resource consumption and a taxonomy for decision- and execution-level failure attribution. Within this framework, we adapt 7 widely used benchmarks spanning 24 domains across single-agent, multi-agent, and safety-critical scenarios, and conduct a large-scale empirical analysis over 400K rollouts and 5B tokens on 15 models. The results show that scaffold choice and environmental volatility materially shift benchmark outcomes in both directions, allowing our framework to disentangle intrinsic LLM capabilities from framework- and environment-induced artifacts. We further demonstrate its extensibility as a secure testbed for safety-critical domains. Codes and benchmarks at are available at https://github.com/whfeLingYu/A-Unified-Framework-for-the-Evaluation-of-LLM-Agentic-Capabilities, https://huggingface.co/AgentFramework/Unified_Farmework.

2605.27896 2026-05-28 cs.CL cs.CE

FinBoardBench: Benchmarking Dynamic Wealth Management and Strategic Financial Reasoning of LLMs via Board Game Simulations

FinBoardBench: 通过棋盘游戏模拟基准测试大语言模型的动态财富管理和战略金融推理

Xuesi Hu, Peng Wang, Jinpeng Miao, Xilin Tao, Caiwei Li, Yue Ma, Jie He, Qiancheng Zhang, Yuntao Zou, Dagang Li

发表机构 * School of Computer Science and Engineering, Macau University of Science and Technology, Macau, China(1 计算机科学与工程学院,澳门科技大学,澳门,中国) School of Economics, Anhui University, Anhui, China(2 经济学院,安徽大学,安徽,中国) SKLPlanets, Macau University of Science and Technology, Macau, China(3 SKLPlanets,澳门科技大学,澳门,中国) Department of Computer and Information Science, University of Macau, Macau, China(4 计算机与信息科学系,澳门大学,澳门,中国) School of Energy and Power Engineering, Huazhong University of Science and Technology, Hubei, China(5 能源与动力工程学院,华中科技大学,湖北,中国)

AI总结 提出基于三款经典金融棋盘游戏的评估套件FinBoardBench,测试大语言模型在动态财富管理、企业投资收购和竞争谈判等综合金融技能,发现模型虽具备基本规划能力但无法将静态推理转化为成功动态决策。

Comments Preprint

详情
AI中文摘要

近期,大语言模型(LLMs)在静态金融推理和简单动态交易任务中取得了优越性能。然而,现有的静态金融基准不足以评估LLMs在真实环境中的动态财富管理和金融决策能力。为弥补这一差距,我们提出了FinBoardBench,一个基于三款经典金融棋盘游戏(现金流、并购和大富翁)的评估套件。FinBoardBench评估一系列全面的金融技能,包括个人现金流管理与债务平衡、企业投资与收购预测,以及带有资产拍卖的竞争性贸易谈判。我们对9个先进LLMs的实验表明,尽管它们展现出基本的长期规划和投资逻辑,但未能有效利用复杂互动来获取利润,且其强大的静态推理性能并未转化为成功的动态决策。值得注意的是,它们倾向于优先获取即时资产而非维持充足流动性,这使得它们容易受到随机事件引发的金融危机的影响。我们希望FinBoardBench能为未来更智能的基于LLM的决策系统提供有价值的参考。

英文摘要

Recently, large language models (LLMs) have achieved superior performance in static financial reasoning and simple dynamic trading tasks. However, existing static financial benchmarks are insufficient to assess the dynamic wealth management and financial decision-making capabilities of LLMs in real-world environments. To bridge this gap, we present FinBoardBench, an evaluation suite based on three classic financial board games: Cashflow, Acquire, and Monopoly. FinBoardBench assesses a comprehensive set of financial skills, including personal cash flow management with debt balancing, corporate investment and acquisition forecasting, and competitive trade negotiations with asset auctions. Our experiments with 9 advanced LLMs reveal that while exhibiting basic long-term planning and investment logic, they fail to effectively leverage complex interactions for profit, and their strong static reasoning performance does not transform into successful dynamic decision-making. Notably, they tend to prioritize immediate asset acquisition over maintaining sufficient liquidity, making them vulnerable to financial crises triggered by random events. We hope that FinBoardBench can provide a valuable reference for more intelligent LLM-based decision-making systems in the future.

2605.27894 2026-05-28 cs.CV

Towards Unified Vision-Language Models with Incomplete Multi-Modal Inputs

面向不完整多模态输入的统一视觉-语言模型

Xiang Fang, Wanlong Fang, Changshuo Wang, Keke Tang, Daizong Liu, Siyi Wang, Wei Ji

发表机构 * School of Software Engineering, Huazhong University of Science and Technology(华中科技大学软件工程学院) Nanyang Technological University, Singapore(新加坡南洋理工大学) University College London(伦敦大学学院) Guangzhou University(广州大学) Wuhan University(武汉大学) Nanjing University(南京大学)

AI总结 针对视频-语言模型在传感器失效导致模态不完整数据下的训练-测试不一致问题,提出首个统一的不完整视频-语言模型作为即插即用模块,提升多模态任务性能。

Comments Published in AAAI 2026

详情
AI中文摘要

视频-语言模型(VLM)在多种计算机视觉应用中展示了令人印象深刻的多模态推理能力。然而,这些VLM是任务特定的,并假设视频和语言输入都是完整的。然而,现实世界的VLM应用可能因传感器停用(例如,由于数据隐私导致摄像头不可用)而面临挑战,产生模态不完整的数据,并导致训练和测试数据之间的不一致。虽然简单的不完整输入可以提升训练泛化能力并导致训练失败,但其对VLM在安全性和可信度方面的潜在风险在很大程度上被忽视了。为此,我们首次尝试提出一个统一的不完整视频-语言模型来处理不完整的多模态输入。大量实验结果表明,我们的方法可以作为先前工作的即插即用模块,提高它们在各种多模态任务中的性能。

英文摘要

Video-Language Models (VLMs) have demonstrated impressive multi-modal reasoning capabilities across diverse computer vision applications. However, these VLMs are task-specific and assume that both video and language inputs are complete. However, real-world VLM applications might face challenges due to deactivated sensors (e.g., cameras are unavailable due to data privacy), yielding modality-incomplete data and leading to inconsistency between training and testing data. While straightforward incomplete input can boast training generalization-ability and lead to training failure, its potential risks to VLMs regarding safety and trustworthiness have been largely neglected. To this end, we make the first attempt to propose a unified incomplete video-language model to process the incomplete multi-modal inputs. Extensive experimental results show that our method can serve as a plug-and-play module for previous works to improve their performance in various multi-modal tasks.

2605.27893 2026-05-28 cs.CV

SIGMA: Bridging Structural and Distributional Gaps for Vision Foundation Model Adaptation

SIGMA:弥合视觉基础模型适应的结构与分布差距

Lingyu Xiong, Jinjin Shi, Xuran Xu, Cong Luo, Runyu Shi, Ying Huang

发表机构 * Xiaomi Corporation(小米公司)

AI总结 提出SIGMA方法,通过尺度自适应融合和语义调制模块,以1.72%可训练参数实现视觉基础模型在密集预测任务上的高效微调,性能优于现有PEFT方法。

详情
AI中文摘要

视觉基础模型(VFM)展示了令人印象深刻的表示能力。然而,通过全微调将它们适应到下游任务会带来高昂的计算和存储开销。参数高效微调(PEFT)作为一种有吸引力的替代方案应运而生,旨在以最小的训练成本实现与全微调相当的性能。尽管如此,将PEFT应用于VFM进行密集预测任务仍然具有挑战性,因为存在结构和分布差距。为了弥合这些差距,我们提出了尺度集成全局调制适配器(SIGMA),一种新颖的轻量级PEFT方法,它由两个模块组成:尺度自适应融合和语义调制。具体来说,尺度自适应融合模块用于通过增强多粒度视觉信息的提取来弥合结构差距。此外,SIGMA在融合特征上引入语义调制以执行全局特征对齐,进一步消除分布差距。这种设计促进了统一的空间和分布适应,相对于VFM骨干网络仅需1.72%的可训练参数。在各种下游密集任务和多个VFM骨干网络上的全面实验表明,SIGMA在性能上一致且优于最先进的PEFT方法。

英文摘要

Vision Foundation Models (VFMs) have demonstrated impressive representational capabilities. However, adapting them to downstream tasks via full fine-tuning incurs prohibitive computational and storage overhead. Parameter-Efficient Fine-Tuning (PEFT) has emerged as a compelling alternative, aiming to achieve performance parity with full fine-tuning at minimal training costs. Nonetheless, applying PEFT to VFMs for dense prediction tasks remains challenging due to the structural and distributional gaps. To bridge these gaps, we propose \textbf{S}cale-\textbf{I}ntegrated \textbf{G}lobal \textbf{M}odulation \textbf{A}dapter (\textbf{SIGMA}), a novel lightweight PEFT method, which consists of two modules: scale-adaptive fusion and semantic modulation. Specifically, the scale-adaptive fusion module is utilized to bridge structural gaps by enhancing the extraction of multi-granularity visual information. Furthermore, SIGMA introduces semantic modulation on the fusion features to perform global feature alignment to further eliminate the distribution gap. This design facilitates unified spatial and distributional adaptation, requiring only 1.72\% trainable parameters relative to the VFM backbone. Comprehensive experiments across various downstream dense tasks and multiple VFM backbones demonstrate that SIGMA achieves consistent and superior performance over state-of-the-art PEFT methods.

2605.27892 2026-05-28 cs.LG

FedEHR-Gen: Federated Synthetic Time-Series EHR Generation via Latent Space Alignment and Distribution-Aware Aggregation

FedEHR-Gen: 通过潜在空间对齐和分布感知聚合的联邦合成时间序列电子健康记录生成

Jun Bai, Ziyang Song, Yue Li

发表机构 * McGill University(麦吉尔大学) Mila - Quebec AI Institute(魁北克人工智能研究所) Ohio University(俄亥俄大学)

AI总结 提出FedEHR-Gen,首个用于跨分布式医院合成时间序列电子健康记录的联邦框架,通过两阶段学习(联邦自编码器对齐潜在空间和联邦TCVAE分布感知聚合)解决高维稀疏和异质性挑战,在eICU和MIMIC-III上达到与集中训练相当的生成质量。

Comments 8 pages main paper with 14 pages supplementary appendix

详情
AI中文摘要

合成电子健康记录(EHR)生成为隐私受限的医疗环境中的数据增强和跨医院建模提供了一条有前景的途径。然而,大多数现有的EHR生成模型是集中式的,需要汇集各医院的数据,这在现实世界中数据共享受限时往往不可行。虽然联邦EHR生成提供了一种自然的解决方案,但由于EHR数据的高维性、稀疏性和跨医院异质性,直接的联邦建模常常崩溃或发散。在这项工作中,我们提出了FedEHR-Gen,这是首个用于跨分布式医院合成时间序列EHR生成的联邦框架。FedEHR-Gen采用两阶段学习范式。首先,我们引入了一个联邦自编码器,将高维稀疏的EHR特征投影到紧凑的潜在空间。为了确保跨医院的语义一致性,我们开发了一种逐层匹配聚合机制,将局部编码器对齐到统一的全局潜在空间。其次,在这个对齐的潜在空间上,我们训练了一个具有分布感知聚合的联邦时间条件变分自编码器(TCVAE),从而在严重的跨医院异质性下实现稳定的时间生成建模。在eICU和MIMIC-III数据集上的大量实验表明,FedEHR-Gen在生成保真度、下游效用和隐私风险方面与集中训练相当,同时始终优于标准的联邦基线。

英文摘要

Synthetic Electronic Health Record (EHR) generation provides a promising avenue for data augmentation and cross-hospital modeling in privacy-constrained healthcare settings. However, most existing EHR generative models are centralized and require pooling data across hospitals, which is often infeasible when real-world data sharing is restricted. While federated EHR generation offers a natural solution, direct federated modeling often collapses or diverges due to the high dimensionality, sparsity, and cross-hospital heterogeneity of EHR data. In this work, we propose FedEHR-Gen, the first federated framework for synthetic time-series EHR generation across distributed hospitals. FedEHR-Gen uses a two-stage learning paradigm. First, we introduce a federated autoencoder that projects high-dimensional and sparse EHR features onto a compact latent space. To ensure semantic consistency across hospitals, we develop a layer-wise matching aggregation mechanism that aligns local encoders into a unified global latent space. Second, operating on this aligned latent space, we train a federated temporal conditional variational autoencoder (TCVAE) with distribution-aware aggregation, enabling stable temporal generative modeling under severe cross-hospital heterogeneity. Extensive experiments on the eICU and MIMIC-III datasets demonstrate that FedEHR-Gen achieves generation fidelity, downstream utility, and privacy risk comparable to centralized training, while consistently outperforming the standard federated baseline.

2605.27891 2026-05-28 cs.CV cs.AI

SmartDirector: Keyframe-Conditioned Cinematic Video Generation with Narrative Pacing Control

SmartDirector: 基于关键帧的叙事节奏可控电影视频生成

Zhida Zhang, Jie Ma, Zhan Peng, Haoxue Wu, Yang Han, Jun Liang, Jie Cao, Jing Li

发表机构 * Youku Moku-Lab(优酷莫酷实验室)

AI总结 提出SmartDirector框架,通过多关键帧条件控制视频生成中的叙事结构和时间节奏,采用两阶段方法(Director-Gen生成低分辨率视频,Director-SR利用高分辨率关键帧细化细节),显著优于现有方法。

详情
AI中文摘要

视频的叙事质量从根本上决定了其感知价值。尽管现有的视频生成方法可以生成视觉上吸引人的内容,但它们主要依赖于稀疏的条件信号,如文本提示或首尾帧,这限制了对叙事结构和时间节奏的精确控制。在本文中,我们提出了SmartDirector,一个通过多个关键帧增强视频生成模型叙事能力的框架。SmartDirector支持灵活的生成长场景,包括单镜头生成、多镜头叙事合成和视频扩展。该框架分两个阶段运行:Director-Gen根据提供的关键帧生成低分辨率视频,Director-SR通过利用高分辨率关键帧作为语义锚点来恢复细粒度细节,从而优化输出。为了实现鲁棒的多关键帧训练,我们构建了一个数据管道,从电影中策划单镜头和多镜头序列。大量实验表明,SmartDirector显著优于现有的最先进方法。我们将发布代码以促进进一步研究。

英文摘要

The narrative quality of a video fundamentally determines its perceptual value. Although existing video generation methods can produce visually appealing content, they predominantly rely on sparse conditioning signals such as text prompts or first/last frames, which limits precise control over narrative structure and temporal pacing. In this paper, we propose SmartDirector, a framework that enhances the narrative capacity of video generation models through multiple keyframes. SmartDirector supports flexible generation scenarios including single-shot generation, multi-shot narrative synthesis, and video extension. The framework operates in two stages: Director-Gen generates a low-resolution video conditioned on the provided keyframes, and Director-SR refines the output by exploiting high-resolution keyframes as semantic anchors to recover fine-grained details. To enable robust multi-keyframe training, we construct a data pipeline that curates single-shot and multi-shot sequences from movies. Extensive experiments demonstrate that SmartDirector substantially outperforms existing state-of-the-art approaches. We will release the code to facilitate further research.

2605.27886 2026-05-28 cs.RO

Tabero: Learning Gentle Manipulation with Closed-Loop Force Feedback from Vision, Touch, and Language

Tabero: 通过视觉、触觉和语言闭环力反馈学习轻柔操作

Qiwei Wu, Rui Zhang, Xin Xiang, Tao Li, Weihua Zhang, Junjie Lai, Renjing Xu

发表机构 * Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) Nvidia, Beijing, China(英伟达(北京,中国))

AI总结 针对现有视觉-语言-动作模型缺乏触觉反馈导致无法实现轻柔操作的问题,提出Tabero基准和模型套件,通过数据高效管道生成视觉-触觉-语言任务,并采用解耦力-位置命令接口的Tabero-VTLA架构,在保持高任务成功率的同时将轻柔指令下的平均夹持力降低70%以上。

Comments Code:https://github.com/NathanWu7/Tabero

详情
AI中文摘要

触觉感知对于机器人实现类人轻柔操作至关重要。然而,现有的视觉-语言-动作(VLA)模型由于缺乏对齐的视觉-触觉-语言数据以及有效的闭环力反馈机制,难以利用触觉反馈进行轻柔操作。为解决这些挑战,我们引入了Tabero,一个用于轻柔、语言条件化机器人操作的基准和模型套件,该操作要求细粒度的接触力感知。首先,Tabero基准通过提出一种数据高效的管道来解决触觉数据稀缺问题,该管道重新利用开源机器人操作轨迹生成多样化的视觉-触觉-语言任务,并建立了一个多维评估协议,同时衡量任务成功率和物理交互质量。其次,我们提出了Tabero-VTLA,一种具有解耦力-位置命令接口的架构;生成的力-位置命令由固定的混合控制器执行,以实现实时的力感知操作。在Tabero上评估,我们的模型在保持高任务成功率的同时,在轻柔指令下将平均夹持力降低了70%以上,展示了其基于多模态经验调节交互力的能力。我们的代码公开在 https://github.com/NathanWu7/Tabero。

英文摘要

Tactile sensing is essential for robots to achieve human-like gentle manipulation. However, existing Vision-Language-Action (VLA) models struggle to exploit tactile feedback for gentle manipulation due to scarce aligned vision-tactile-language data and the lack of effective closed-loop force feedback mechanisms. To address these challenges, we introduce Tabero, a benchmark and model suite for gentle, language-conditioned robotic manipulation that demands fine-grained contact force perception. First, the Tabero benchmark addresses the scarcity of tactile data by presenting a data-efficient pipeline that repurposes open-source robot manipulation trajectories to generate diverse vision-tactile-language tasks, and establishes a multidimensional evaluation protocol that measures task success alongside physical interaction quality. Second, we propose Tabero-VTLA, an architecture with a decoupled force-position command interface; the resulting force-position commands are executed by a fixed hybrid controller to enable real-time, force-aware manipulation. Evaluated on Tabero, our model maintains high task success while reducing average grip force by over 70\% under gentle instructions, demonstrating its ability to modulate interaction forces based on multimodal experience. Our code is publicly available at https://github.com/NathanWu7/Tabero.

2605.27885 2026-05-28 cs.CV

Reflective Dialogue between Teacher and Solver Agents for Video Question Answering

教师与求解器智能体之间的反思性对话用于视频问答

Takuya Murakawa, Toru Tamaki

发表机构 * Nagoya Institute of Technology(名古屋技术大学)

AI总结 提出一种仅通过推理时上下文注入的适应方法,利用教师与求解器智能体之间的反思性对话(RD)来提升视频问答性能,在EgoCross基准上超越零样本和标准上下文学习,获得CVPR 2026 EgoVis Workshop跨域挑战赛开源赛道第三名。

Comments Yhis paper serves as the technical report for the 1st Cross-Domain EgoCross Challenge @ EgoVis Workshop, CVPR 2026

详情
AI中文摘要

已经提出了各种方法来使视觉语言模型(VLM)适应视频问答的专门领域,包括微调和上下文学习。然而,在推理阶段仅从少量标记支持集获取任务特定知识而不进行微调仍然是一个挑战。在本文中,我们提出了一种仅通过推理时上下文注入来实现适应的方法。我们的方法首先构建一个反思性对话(RD)——两个智能体之间的多轮对话,其中教师提出每个支持问题并提供正确性反馈,求解器回答并提供视觉基础解释(或反思)以说明正确和错误的答案。然后,该对话历史在推理阶段用作上下文。在EgoCross基准上的实验表明,我们的方法优于基线零样本设置和直接传递支持集示例的标准上下文学习方法,在CVPR 2026 EgoVis Workshop的第一届跨域EgoCross挑战赛开源赛道中获得第三名,本文也作为该挑战赛的技术报告。

英文摘要

Various approaches have been proposed to adapt Vision-Language Models (VLMs) to specialized domains for Video Question Answering, including fine-tuning and in-context learning. However, acquiring task-specific knowledge at the inference phase from only a small labeled support set without fine-tuning remains a challenge. In this paper, we propose a method that achieves adaptation solely through inference-time context injection. Our method first constructs a Reflective Dialogue (RD) -- a multi-turn conversation between two agents, in which Teacher poses each support question and delivers correctness feedback, and Solver answers and provides visual grounding explanations (or reflections) for both correct and incorrect answers. This dialogue history is then used as context at the inference phase. Experiments on the EgoCross benchmark demonstrate that our method outperforms both a baseline zero-shot setting and a standard in-context learning approach that passes support set examples directly, achieving 3rd place in the Open-source Track of the 1st Cross-Domain EgoCross Challenge at the CVPR 2026 EgoVis Workshop, for which this paper also serves as a technical report.

2605.27884 2026-05-28 cs.CV

A Road-Conditioned Traffic Movie Prediction Network with Spatiotemporal and Structure-Consistent Learning

一种基于道路条件的交通电影预测网络,具有时空和结构一致性学习

Joshua Kofi Asamoah, Blessing Agyei Kyem, Armstrong Aboah

发表机构 * North Dakota State University(北达科他州立大学)

AI总结 提出RCSNet,一种基于道路条件的时空网络,通过拓扑引导的未来状态生成和结构一致性学习,提高跨城市交通预测的准确性和结构一致性。

Comments 22 pages (double column), 7 Tables, 11 Figures

详情
AI中文摘要

城市范围的交通预测对于拥堵管理、路线引导和智能交通系统至关重要,但当未来交通必须作为整个城市网络的空间地图生成时,准确预测仍然具有挑战性。现有的交通电影预测方法提高了帧级精度,但许多方法仍主要将预测视为图像重建。这可能会产生数值上接近真实值但受道路布局、连通性、行驶方向和拥堵传播约束较弱的交通地图,尤其是在交通行为和道路结构都发生变化的跨城市场景中。为了解决这一局限性,本研究提出了RCSNet,一种基于道路条件的时空网络,将交通电影预测重新表述为拓扑引导的未来状态生成。RCSNet从静态道路地图中提取道路感知表示,从历史观测中建模多时域交通动态,将方向性交通特征与局部道路结构对齐,并逐步生成未来交通地图以提高时间一致性。结构一致性学习目标进一步鼓励预测保持准确、与道路对齐且空间稳定。跨多个城市的实验表明,RCSNet提高了预测准确性和结构一致性。在柏林、安特卫普和莫斯科的同城预测中,与最接近的基线相比,RCSNet平均MAE、MSE和RMSE分别降低了11.5%、10.0%和5.1%。在未见过的芝加哥和曼谷的跨城市测试中,无需目标城市微调,RMSE分别降低了10.6%和10.5。额外的时域、道路结构、可解释性、统计和效率分析表明,RCSNet产生了更准确、可迁移、与道路对齐且计算高效的交通预测。

英文摘要

City-wide traffic forecasting is important for congestion management, route guidance, and intelligent transportation systems, but accurate prediction remains challenging when future traffic must be generated as spatial maps over an entire urban network. Existing traffic movie prediction methods have improved frame-level accuracy, yet many still treat forecasting mainly as image reconstruction. This can produce traffic maps that are numerically close to the ground truth but weakly constrained by road layout, connectivity, travel direction, and congestion propagation, especially in cross-city settings where both traffic behavior and road structure change. To address this limitation, this study proposes RCSNet, a road-conditioned spatiotemporal network that reformulates traffic movie prediction as topology-guided future-state generation. RCSNet extracts road-aware representations from static road maps, models multi-horizon traffic dynamics from historical observations, aligns directional traffic features with local road structure, and progressively generates future traffic maps for improved temporal consistency. A structure-consistent learning objective further encourages predictions to remain accurate, road-aligned, and spatially stable. Experiments across multiple cities show that RCSNet improves both forecasting accuracy and structural consistency. In same-city forecasting on Berlin, Antwerp, and Moscow, RCSNet reduces average MAE, MSE, and RMSE by 11.5%, 10.0%, and 5.1%, respectively, compared with the closest baseline. In cross-city testing on unseen Chicago and Bangkok, it reduces RMSE by 10.6% and 10.5% without target-city fine-tuning. Additional horizon-wise, road-structure, explainability, statistical, and efficiency analyses show that RCSNet produces more accurate, transferable, road-aligned, and computationally efficient traffic forecasts.

2605.27882 2026-05-28 cs.CL cs.AI

VibeSearchBench: Benchmarking Long-horizon Proactive Search in the Wild

VibeSearchBench:野外长期主动搜索的基准测试

Xiaohongshu Inc

发表机构 * Xiaohongshu Dots Studio & Unipat AI(小红书 dots 飞 studios 与 Unipat AI)

AI总结 针对现有搜索基准中查询过于明确、单轮交互和固定模式评估导致用户体验与评估结果差距的问题,提出VibeSearch范式并构建VibeSearchBench基准,通过渐进式用户模拟和图匹配评估框架测试前沿模型,发现所有模型在长期上下文推理、主动意图激发和结构化知识构建方面仍存在显著不足。

详情
AI中文摘要

基于LLM的智能体在搜索基准上得分很高,但真实用户始终觉得结果不令人满意,这揭示了持续的评估-体验差距。我们将这一差距归因于现有基准依赖于过度明确的查询、单轮交互和固定模式评估,这些都不反映真实搜索行为——用户和智能体通过多轮对话协作细化模糊意图。我们将这种范式称为VibeSearch,并引入VibeSearchBench,一个包含200个手动策划的双语(中文和英文)任务的基准,涵盖20个领域,分为VibeSearch-Pro(专业)和VibeSearch-Daily(日常生活)子集。每个任务将一个用户角色与一个无模式的真实知识图谱配对,并通过渐进式披露用户模拟器和图匹配评估框架进行评估。我们在ReAct框架和OpenClaw智能体框架下对七个前沿模型进行了基准测试。结果表明,所有模型对于VibeSearch仍然严重不足(最佳F1:30.30),凸显了在长期上下文推理、主动意图激发和结构化知识构建方面需要根本性进展。

英文摘要

LLM-based agents score well on search benchmarks, yet real users consistently find results unsatisfying, revealing a persistent evaluation-experience gap. We attribute this gap to existing benchmarks' reliance on over-specified queries, single-turn interactions, and fixed-schema evaluation, none of which reflect real search behavior where users and agents collaboratively refine vague intent through multi-turn dialogue. We term this paradigm VibeSearch and introduce VibeSearchBench, a benchmark comprising 200 manually curated bilingual (Chinese and English) tasks across 20 domains, split into VibeSearch-Pro (professional) and VibeSearch-Daily (daily-life) subsets. Each task pairs a user persona with a schema-free ground-truth knowledge graph, and is evaluated through a progressive-disclosure user simulator and a graph-matching evaluation framework. We benchmark seven frontier models under both the ReAct framework and the OpenClaw agent harness. Results show that all models remain substantially inadequate for VibeSearch (best F1: 30.30), highlighting the need for fundamental advances in long-context reasoning, proactive intent elicitation, and structured knowledge construction.

2605.27879 2026-05-28 cs.AI

Towards Faithful Agentic XAI: A Verification Method and an Open-World Benchmark for Better Model Faithfulness

迈向忠实代理式XAI:一种验证方法和一个用于更好模型忠实度的开放世界基准

Jaechang Kim, Sunung Mun, Seungjoon Lee, Jaewoong Cho, Jungseul Ok

发表机构 * Graduate School of AI, POSTECH(POSTECH人工智能研究生院) Department of Computer Science and Engineering, POSTECH(POSTECH计算机科学与工程系) Krafton

AI总结 提出FAX框架,通过显式验证分解解释声明并交叉检查忠实工具,以及CRAFTER-XAI-Bench开放世界基准,在强化学习环境中将模拟忠实度从0.20提升至0.46。

详情
AI中文摘要

可解释AI(XAI)帮助用户解释模型行为并识别潜在故障。代理式XAI系统使用大型语言模型(LLM)通过自然语言交互使解释更易理解,但也可能产生看似合理但不忠实的解释。这种风险源于不可靠的XAI输出可能被LLM放大并误导用户。我们提出忠实代理式XAI(FAX),一个通过显式验证提高解释忠实度的框架。FAX将草稿解释分解为声明,并针对固有忠实工具进行交叉检查,在最终生成前过滤不支持或矛盾的声明。我们还引入了CRAFTER-XAI-Bench,一个具有复杂策略、多样目标和挑战场景的开放世界强化学习基准,用于评估模型特定忠实度。在CRAFTER-XAI-Bench上,FAX将模拟忠实度从最强基线的0.20提升至0.46,同时保持高信息量、相关性和流畅性。在三个表格基准上,FAX与先前的代理式XAI基线表现相当,但我们的分析表明,这些设置可能将任务准确性与模型特定忠实度混为一谈。这些发现表明,显式验证对于忠实代理式XAI至关重要,并且忠实度基准必须设计用于测试解释是否针对目标模型本身的行为。

英文摘要

Explainable AI (XAI) helps users interpret model behavior and identify potential faults. Agentic XAI systems use Large Language Models (LLMs) to make explanations more accessible through natural-language interaction, but they can also produce plausible yet unfaithful explanations. This risk arises because unreliable XAI outputs for complex models can be amplified by LLMs and mislead users. We propose Faithful Agentic XAI (FAX), a framework that improves explanation faithfulness through explicit verification. FAX decomposes draft explanations into claims and cross-checks them against inherently faithful tools, filtering unsupported or contradictory claims before final generation. We also introduce CRAFTER-XAI-Bench, an open-world reinforcement learning benchmark with complex policies, diverse goals, and challenging scenarios for assessing model-specific faithfulness. On CRAFTER-XAI-Bench, FAX improves simulation faithfulness from 0.20 for the strongest baseline to 0.46 while maintaining high informativeness, relevance, and fluency. On three tabular benchmarks, FAX performs competitively with prior Agentic XAI baselines, but our analysis shows that these settings can conflate task accuracy with model-specific faithfulness. These findings show that explicit verification is essential for faithful Agentic XAI and that that faithfulness benchmarks must be designed to test explanations against the behavior of the target model itself.

2605.27878 2026-05-28 cs.CL

Narrative Flattening: How Post-Training Compresses Thematic, Affective, and Stylistic Variation in LLM Fiction

叙事扁平化:后训练如何压缩LLM小说中的主题、情感和风格变化

Zehan Li, Yutong Zhu, Siyang Wu, Honglin Bao, James A. Evans

发表机构 * Knowledge Lab, University of Chicago(芝加哥大学知识实验室)

AI总结 通过对比四个OLMo 32B检查点(Base、SFT、DPO、RLVR)在三种故事领域中的续写,发现后训练压缩了主题动态、情感强度和语言多样性,导致叙事扁平化,且专业文学领域压缩最严重。

详情
AI中文摘要

大型语言模型能生成流畅的小说,但其创造性输出普遍被视为扁平。我们探究这种质量源于训练的哪个阶段,以及是否对不同领域的人类小说产生同等影响。我们构建了一个匹配的故事续写范式,涵盖StoryStar(公共平台)、TMAS(提示引导)和《纽约客》(专业文学),并将四个OLMo 32B检查点(Base、SFT、DPO、RLVR)的续写与匹配的人类文本进行比较。由于这些检查点共享架构、规模、分词器和预训练,该设计隔离了后训练效应。我们沿三个句子级维度测量每次续写:主题动态、情感普遍性和语言多样性。在所有三个维度上,后训练压缩了动态变化:主题过渡变得更加均匀,高强度情感让位于中性,故事间的风格多样性缩小。我们将这种渐进性损失称为叙事扁平化。该效应在故事领域间方向稳定,但差距大小取决于人类基线:专业文学小说压缩最严重,而公共平台和提示引导故事的差距较小,这与它们的人类基线更接近模型的默认节奏一致。后训练端点在领域间收敛,表明对齐产生了一种续写机制,该机制在很大程度上不依赖于源领域的叙事纹理。

英文摘要

Large language models produce fluent fiction, yet their creative output is widely seen as flat. We ask where this quality originates in the training and whether it affects different domains of human fiction equally. We construct a matched story-continuation paradigm across StoryStar (public-platform), TMAS (prompt-guided), and The New Yorker (professional literary)-and compare continuations from four OLMo 32B checkpoints (Base, SFT, DPO, RLVR) against matched human text. Because these checkpoints share architecture, scale, tokenizer, and pretraining, the design isolates the post-training effect. We measure each continuation along three sentence-level dimensions: thematic motion, affective prevalence, and linguistic diversity. Across all three, post-training compresses dynamic variation: thematic transitions become more uniform, high-intensity emotions give way to neutrality, and stylistic diversity across stories shrinks. We term this progressive loss narrative flattening. The effect is directionally stable across story domains but gap size depends on the human baseline: professional literary fiction is compressed most, while public-platform and prompt-guided stories show smaller gaps, consistent with their human baselines sitting closer to the model's default rhythm. Post-trained endpoints converge across domains, suggesting alignment produces a continuation regime largely insensitive to the source domain's narrative texture.

2605.27877 2026-05-28 cs.LG cs.AI

SPAR: Support-Preserving Action Rectification

SPAR: 支持保持的动作纠正

Jiaxin Zhao, Weihang Pan, Xun Liang, Binbin Lin

发表机构 * Zhejiang University(浙江大学)

AI总结 提出支持保持的动作纠正(SPAR)框架,通过将全局学习转化为局部残差纠正,并引入潜在自模仿机制,解决了离线策略改进中价值最大化与数据分布拟合之间的冲突,在D4RL基准上达到最优性能。

详情
AI中文摘要

离线策略改进面临着最大化价值与拟合数据分布之间的固有冲突。虽然样本内加权回归是稳定的,但它过度保守,抑制了分布尾部的高价值动作;相反,基于梯度的方法通常表现出梯度的拟合-优化冲突,这会将策略推离数据流形。为了解决这个问题,我们提出了支持保持的动作纠正(SPAR),它将全局学习重新定义为锚定在冻结的纯行为克隆策略上的局部残差纠正。该框架在残差空间中进行细粒度拟合和局部策略改进,从而收缩搜索空间。我们进一步引入了潜在自模仿,利用潜在采样加权回归机制来解决残差空间中的拟合-改进梯度冲突。理论上,我们证明了该机制消除了标准价值梯度的流形正常漂移,而广泛的D4RL实验表明,SPAR从次优基线中提取了显著的增益,实现了最先进的性能。

英文摘要

Offline policy improvement faces an inherent conflict between maximizing value and fitting the data distribution. While in-sample weighted regression is stable, it suffers from over-conservatism that suppresses high-value actions in the distribution tail; conversely, gradient-based approaches often exhibit a fitting-optimization conflict of gradients, which drives the policy off the data manifold. To address this, we propose Support-Preserving Action Rectification (SPAR), which reframes global learning as a local residual rectification anchored to a frozen pure behavior cloning policy. This framework performs fine-grained fitting and local policy improvement in the residual space, thereby contracting the search space. We further introduce Latent Self-Imitation, utilizing a latent-sampling weighted-regression mechanism to address fitting-improvement gradient conflict in the residual space. Theoretically, we prove this mechanism eliminates the manifold-normal drift of standard value gradients, while extensive D4RL experiments show SPAR extracts significant gains from suboptimal baselines to achieve state-of-the-art performance.