arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2237
2605.07415 2026-06-10 cs.CV cs.CL 版本更新

ChartREG++: Towards Benchmarking and Improving Chart Referring Expression Grounding under Diverse referring clues and Multi-Target Referring

ChartREG++:面向多样化指代线索和多目标指代的图表指代表达式定位基准与改进

Tianhao Niu, Ziyu Han, Xuan Dong, Qingfu Zhu, Wanxiang Che

发表机构 * Research Center for Social Computing and Interactive Robotics(社会计算与交互机器人研究中心)

AI总结 针对现有图表指代表达式定位基准的局限,提出支持多种定位形式、多目标指代、多样化线索和图表类型的基准,并利用代码驱动合成流水线生成像素级实例掩码,训练实例分割模型集成到多模态定位框架,显著提升性能。

详情
AI中文摘要

指代表达式定位是视觉定位的核心问题,广泛用于视觉与语言模型的空间定位与推理诊断,但以往工作多聚焦于自然图像。相比之下,现有的图表指代表达式定位基准存在局限:(1) 大多采用边界框,限制了精细图表元素的定位精度;(2) 大多假设单个或两个指代目标实例,无法处理多实例目标指代;(3) 语言表达过度依赖文本线索或数据排名线索;(4) 仅覆盖狭窄的图表类型范围。为解决这些问题,我们引入了一个图表指代表达式定位基准,系统性地支持多种定位形式、多个指代目标、多样化定位线索和多种图表类型。在代表性多模态大模型上的结果揭示了显著的性能差距。我们进一步引入了一个代码驱动的合成流水线,利用绘图程序与渲染图表基元之间的固有对齐,跨图表元素类型和粒度生成像素级精确的实例掩码。我们使用合成掩码训练了一个实例分割模型,并将其集成到一个通用的多模态定位框架中。最终系统在我们的基准上持续优于基线,并很好地泛化到从ChartQA导出的真实图表定位基准。

英文摘要

Referring expression grounding is a core problem in visual grounding and is widely used as a diagnostic of spatial grounding and reasoning in vision and language models, yet most prior work focuses on natural images. In contrast, existing chart referring expression grounding-related benchmarks remain limited: (1) they largely adopt bounding boxes, constraining localization precision for fine chart elements (2) they mostly assume a single and two referred target instances, failing to handle multi-instance target references; (3) the language expressions over-rely on textual cues or data-rank clues (4) they cover only a narrow range of chart types. To address these issues, we introduce a chart referring expression grounding benchmark that systematically supports multiple localization forms, multiple referred targets, diverse grounding cues and diverse chart types. Results across representative multimodal large models reveal a significant performance gap. We further introduce a code-driven synthesis pipeline that exploits the inherent alignment between plotting programs and rendered chart primitives to derive pixel accurate instance masks across chart element types and granularities. We train an instance segmentation model with the synthesized masks and integrate it into a general-purpose multimodal grounding framework. The resulting system consistently outperforms baselines on our benchmark and generalizes well to a ChartQA-derived real-chart grounding benchmark.

2605.06234 2026-06-10 cs.RO cs.HC 版本更新

RobotEQ: Transitioning from Passive Intelligence to Active Intelligence in Embodied AI

RobotEQ:从被动智能到主动智能的具身AI过渡

Kuofei Fang, Xinyi Che, Haomin Ouyang, Shufan Zhang, Xuehao Wang, Qi Liu, Liyi Liu, Chenqi Zhang, Wenxi Cai, Wenyu Dai, Jinyang Wu, Fan Zhang, Haoyu Chen, Bin He, Zheng Lian

发表机构 * State Key Laboratory of Autonomous Intelligent Unmanned Systems, Tongji University(自主智能无人系统国家重点实验室,同济大学) Tsinghua University(清华大学) The Chinese University of Hong Kong(香港中文大学) CMVS, University of Oulu(奥卢大学CMVS)

AI总结 提出RobotEQ基准,评估模型在具身场景中理解并遵守社会规范的能力,实验表明现有模型在主动智能上仍有不足,利用RAG技术可提升性能。

详情
AI中文摘要

具身AI是学术界和工业界的一个突出研究课题。当前研究集中于根据明确的用户指令完成任务。然而,为了让机器人融入人类社会,它们必须理解哪些行为是允许的、哪些是禁止的,即使没有明确指令。我们将用户引导的AI称为被动智能,无引导的AI称为主动智能。本文介绍了RobotEQ,第一个主动智能基准,旨在评估现有模型在具身场景中理解并遵守社会规范的能力。首先,我们构建了RobotEQ-Data数据集,包含1,894张以自我为中心的图像,涵盖10个代表性具身类别和56个子类别。通过大量人工标注,我们提供了4,944个动作判断问题和1,157个空间定位问题,指定了不同场景下合适的机器人动作。此外,我们建立了RobotEQ-Bench来评估最先进模型在该任务上的性能。实验结果表明,当前模型在实现可靠的主动智能方面仍有不足,特别是在空间定位上。同时,利用RAG技术结合外部社会规范知识库可以普遍提升性能。这项工作有助于推动机器人从用户引导的被动操作向主动社会合规过渡。

英文摘要

Embodied AI is a prominent research topic in both academia and industry. Current research centers on completing tasks based on explicit user instructions. However, for robots to integrate into human society, they must understand which actions are permissible and which are prohibited, even without explicit commands. We refer to the user-guided AI as passive intelligence and the unguided AI as active intelligence. This paper introduces RobotEQ, the first benchmark for active intelligence, aiming to assess whether existing models can comprehend and adhere to social norms in embodied scenarios. First, we construct RobotEQ-Data, a dataset consisting of 1,894 egocentric images, spanning 10 representative embodied categories and 56 subcategories. Through extensive manual annotation, we provide 4,944 action judgment questions and 1,157 spatial grounding questions, specifying appropriate robot actions across diverse scenarios. Furthermore, we establish RobotEQ-Bench to evaluate the performance of state-of-the-art models on this task. Experimental results demonstrate that current models still fall short in achieving reliable active intelligence, particularly in spatial grounding. Meanwhile, leveraging RAG techniques to incorporate external social norm knowledge bases can generally enhance performance. This work can facilitate the transition of robotics from user-guided passive manipulation to active social compliance.

2605.05857 2026-06-10 cs.LG 版本更新

Offline Reinforcement Learning for Rotation Profile Control in Tokamaks

托卡马克旋转剖面控制的离线强化学习

Rohit Sonker, Hiro Josep Farre Kaga, Jiayu Chen, Andrew Rothstein, Ian Char, Ricardo Shousha, Egemen Kolemen, Jeff Schneider

发表机构 * Robotics Institute, Carnegie Mellon University(卡内基梅隆大学机器人研究所) Princeton University(普林斯顿大学) Princeton Plasma Physics Lab(普林斯顿等离子物理实验室) The University of Hong Kong(香港大学) Lila Sciences

AI总结 针对托卡马克等离子体旋转剖面控制难题,提出基于历史数据的离线强化学习方法,利用概率模型生成轨迹训练策略,并在DIII-D托卡马克上验证了有效性。

详情
AI中文摘要

托卡马克仍然是实现实用聚变能的主要候选装置,然而这些装置内部的许多重要控制问题仍然困难或未解决。其中一个挑战是控制等离子体旋转剖面,它强烈影响稳定性、约束和输运。虽然平均旋转可以被控制,但由于高维度、对多个执行器的响应以及对等离子体条件的依赖性,控制完整剖面具有挑战性。基于学习的控制方法,如强化学习(RL),为解决这一难题提供了潜在方案,能够建模复杂相互作用,从而实现有效的多输入多输出控制。然而,由于缺乏能够建模旋转剖面动力学的精确模拟器,学习此类策略具有挑战性。在这项工作中,我们研究了使用离线RL和离线基于模型的RL算法进行旋转剖面控制,仅基于DIII-D托卡马克的历史数据训练它们。我们的最终方法使用等离子体动力学的概率模型为RL训练生成轨迹。我们在DIII-D托卡马克上部署该策略,并观察到有希望的实际结果。最后,我们强调了在使用有限历史数据的情况下,在复杂物理设备上训练和部署RL策略的关键挑战和见解。

英文摘要

Tokamaks remain leading candidates for achieving practical fusion energy, yet many important control problems inside these devices are still difficult or unsolved. One such challenge is controlling the plasma rotation profile, which strongly influences stability, confinement, and transport. While the average rotation can be controlled, controlling the full profile is challenging due to high dimensionality, response to multiple actuators and dependence on plasma condition. Learning-based control methods, such as reinforcement learning (RL), provide a potential solution to this challenging problem with ability to model complex interactions leading to effective multi-input multi-output control. However, learning such policies is challenging due to the lack of accurate simulators that can model the rotation profile dynamics. In this work, we investigate the use of offline RL and offline model-based RL algorithms for rotation profile control, training them solely on historical data from the DIII-D tokamak. Our final method uses probabilistic models of plasma dynamics to generate rollouts for RL training. We deploy this policy on the DIII-D Tokamak and observe promising real-world results. We conclude by highlighting key challenges and insights from training and deploying an RL policy on a complex physical device while using only limited past data.

2605.01248 2026-06-10 cs.LG 版本更新

$S^3$-R1: Learning to Retrieve and Answer Step-by-Step with Synthetic Data

$S^3$-R1: 通过合成数据学习逐步检索与回答

Harsh Goel, Akhil Udathu, Susmija Jabbireddy, Pradnesh Kalkar, Atharva Parulekar

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校) Google DeepMind(谷歌深Mind)

AI总结 提出S^3-R1框架,通过合成数据生成和密集奖励信号,解决强化学习后训练中稀疏奖励和缺乏多跳问题数据的问题,提升模型搜索与问答能力。

Comments Under Review

详情
AI中文摘要

强化学习后训练使模型具备了新的能力,例如使用工具进行搜索。然而,这些模型主要面临两个限制:稀疏的基于结果的奖励,以及缺乏涵盖不同难度问题的训练数据,导致模型无法使用工具进行更深层次的搜索来收集证据以回答问题。为了解决这些限制,我们引入了S^3-R1(合成数据与稳定搜索R1),一个将数据驱动方法与更密集的学习信号相结合的框架。我们首先开发了一个合成生成与筛选流程,从现有文档中程序化地派生出多样化的多跳问题。该流程包含一个基于检索的验证步骤,专门用于分离出中等难度的问题。然后,我们将这个扩展的训练集与一个奖励结构配对,该结构同时评估中间搜索质量和最终答案的正确性。这种设置直接缓解了稀疏奖励固有的信用分配问题。我们的评估表明,S^3-R1通过学习更有效的搜索和综合策略,优于现有基线,在域外数据集上实现了高达10%的稳健泛化改进。

英文摘要

Reinforcement learning (RL) post-training has enabled newer capabilities in models, such as agentic tool-use for search. However, these models struggle primarily due to limitations with sparse outcome-based rewards and a lack of training data that encapsulates questions of differing hardness, which results in models not performing deeper searches with tools to collect evidence for question-answering. To address these limitations, we introduce S^3-R1 (Synthetic data and stabilized Search R1), a framework that couples a data-centric approach with denser learning signals. We first develop a synthetic generation and curation pipeline that programmatically derives diverse, multi-hop questions from existing documents. This pipeline incorporates a retrieval-based verification step to specifically isolate questions of intermediate difficulty. We then pair this expanded training set with a reward structure that evaluates both intermediate search quality and the correctness of the final answer. This setup directly mitigates the credit assignment problems inherent to sparse rewards. Our evaluations show that S^3-R1 outperforms existing baselines by learning more effective search and synthesis strategies, yielding up to a 10% improvement in robust generalization on out-of-domain datasets.

2605.00809 2026-06-10 cs.CV 版本更新

Let ViT Speak: Generative Language-Image Pre-training

让ViT说话:生成式语言-图像预训练

Yan Fang, Mengcheng Lan, Zilong Huang, Weixian Lei, Yunqing Zhao, Yujie Zhong, Yingchen Yu, Qi She, Yao Zhao, Yunchao Wei

发表机构 * Beijing Jiaotong University(北京交通大学) ByteDance(字节跳动) Nanyang Technological University(南洋理工大学)

AI总结 提出GenLIP框架,通过语言建模目标直接训练ViT从视觉token预测语言token,无需对比学习或额外文本解码器,实现简单、可扩展且性能优异的视觉编码器。

Comments 27 pages, 11 figures. Code and models are available at https://github.com/YanFangCS/GenLIP

详情
AI中文摘要

在本文中,我们提出了生成式语言-图像预训练(GenLIP),这是一个为多模态大语言模型(MLLMs)设计的Vision Transformers(ViTs)的极简生成式预训练框架。为了更好地将视觉编码器与LLMs的自回归特性对齐,GenLIP训练ViT直接从视觉token预测语言token,使用标准的语言建模目标,无需对比批次构建或额外的文本解码器。该设计具有三个关键优势:(1)简单性:单个transformer联合建模视觉和文本token;(2)可扩展性:随着数据和模型大小的增加而有效扩展;(3)性能:在多种多模态基准测试中达到竞争性或更优的结果。在使用Recap-DataComp-1B的8B样本训练后,尽管使用的预训练数据显著减少,GenLIP仍能匹配或超越强基线。在继续对原始宽高比的多分辨率图像进行预训练后,GenLIP进一步提高了对细节敏感的任务(如OCR和图表理解)的性能,使其成为MLLMs中视觉编码器的坚实基础。

英文摘要

In this paper, we present \textbf{Gen}erative \textbf{L}anguage-\textbf{I}mage \textbf{P}re-training (GenLIP), a minimalist generative pretraining framework for Vision Transformers (ViTs) designed for multimodal large language models (MLLMs). To better align vision encoders with the autoregressive nature of LLMs, GenLIP trains a ViT to predict language tokens directly from visual tokens using a standard language modeling objective, without contrastive batch construction or an additional text decoder. This design offers three key advantages: (1) \textbf{Simplicity}: a single transformer jointly models visual and textual tokens; (2) \textbf{Scalability}: it scales effectively with both data and model size; and (3) \textbf{Performance}: it achieves competitive or superior results across diverse multimodal benchmarks. Trained on 8B samples from Recap-DataComp-1B, GenLIP matches or surpasses strong baselines despite using substantially less pretraining data. After continued pretraining on multi-resolution images at native aspect ratios, GenLIP further improves on detail-sensitive tasks such as OCR and chart understanding, making it a strong foundation for vision encoders in MLLMs.

2604.28095 2026-06-10 cs.CV 版本更新

UHR-Net: An Uncertainty-Aware Hypergraph Refinement Network for Medical Image Segmentation

UHR-Net:一种用于医学图像分割的不确定性感知超图精炼网络

Shuokun Cheng, Jinghao Shi, Kun Sun

发表机构 * School of Computer Sciences, China University of Geosciences (Wuhan)(中国地质大学(武汉)计算机科学学院)

AI总结 针对病灶边界模糊和小病灶分割困难,提出UHR-Net,采用不确定性导向实例对比预训练和不确定性引导超图精炼模块,在五个公开数据集上取得一致提升。

Comments 12 pages, 4 figures, 4 tables

详情
AI中文摘要

准确的病灶分割对于临床诊断和治疗规划至关重要。然而,病灶通常与周围组织相似且边界不清,导致边界/过渡区域的预测不稳定。此外,小病灶的线索可能被多尺度特征提取稀释,导致欠分割或过分割。为了解决这些挑战,我们提出了一种不确定性感知超图精炼网络(UHR-Net)。首先,我们引入了一种不确定性导向实例对比(UO-IC)预训练策略,该策略将几何感知的复制-粘贴增强与病灶样背景区域的难负样本挖掘相结合,以提高对小型和视觉模糊病灶的实例级判别能力。其次,我们设计了一个不确定性引导超图精炼(UGHR)模块,该模块从粗概率图中导出基于熵的不确定性图,以指导超图精炼。通过将超边原型分为前景和背景组,UGHR解耦了高阶交互并改善了模糊区域的精炼。在五个公开基准上的实验表明,与强基线相比取得了持续改进。代码可在以下网址获取:this https URL。

英文摘要

Accurate lesion segmentation is crucial for clinical diagnosis and treatment planning. However, lesions often resemble surrounding tissues and exhibit ill-defined boundaries, leading to unstable predictions in boundary/transition regions. Moreover, small-lesion cues can be diluted by multi-scale feature extraction, causing under- or over-segmentation. To address these challenges, we propose an Uncertainty-Aware Hypergraph Refinement Network (UHR-Net). First, we introduce an Uncertainty-Oriented Instance Contrastive (UO-IC) pretraining strategy that couples geometry-aware copy-paste augmentation with hard-negative mining of lesion-like background regions to improve instance-level discrimination for small and visually ambiguous lesions. Second, we design an Uncertainty-Guided Hypergraph Refinement (UGHR) block, which derives an entropy-based uncertainty map from a coarse probability map to guide hypergraph refinement. By splitting hyperedge prototypes into foreground and background groups, UGHR decouples higher-order interactions and improves refinement in ambiguous regions. Experiments on five public benchmarks demonstrate consistent gains over strong baselines. Code is available at: https://github.com/CUGfreshman/UHR-Net.

2604.26991 2026-06-10 cs.LG cs.AI 版本更新

People-Centred Medical Image Analysis via Fairness-Aware Human-AI Cooperation

以人为本的医学图像分析:通过公平感知的人机协作

Zheng Zhang, Milad Masroor, Cuong Nguyen, Tahir Hassan, Yuanhong Chen, David Rosewarne, Kevin Wells, Thanh-Toan Do, Gustavo Carneiro

发表机构 * arXiv

AI总结 提出PecMan框架,联合建模子群依赖可靠性、决策分配和协作预测,通过门控与整合机制动态分配病例给自动模型或人类专家,无需测试时敏感属性,实现公平感知的人机协作分类。

详情
AI中文摘要

医学图像分析的机器学习模型通常表现出子群依赖的性能,这影响了在有限资源下如何在自动化系统和人类专家之间分配决策。先前关于AI公平性和人机协作的工作,包括学习推迟(L2D)和学习互补(L2C),通常孤立地处理这些问题。我们提出了以人为本的医学图像分析(PecMan),一个用于公平感知的人机协作分类框架,它联合建模子群依赖的可靠性、决策分配和协作预测。PecMan结合了子群专门的预测器与一个门控和整合机制,该机制动态地将病例分配给自动化模型、人类专家或它们的组合,而无需在测试时使用敏感属性。我们还引入了FairHAI基准,用于评估预测准确性、子群公平性和人类参与之间的权衡。此外,我们通过选择遗憾对多智能体门控进行了理论分析,并刻画了在输入依赖分配下的公平性-覆盖权衡。在多个医学影像数据集上的实验表明,与单独处理公平性或人机协作的方法相比,PecMan实现了持续改进的权衡。

英文摘要

Machine learning models for medical image analysis often exhibit subgroup-dependent performance, which impacts how decisions should be allocated between automated systems and human experts under limited resources. Prior work on AI fairness and human-AI cooperation, including learning to defer (L2D) and learning to complement (L2C), typically addresses these problems in isolation. We propose People-Centred Medical Image Analysis (PecMan), a framework for fairness-aware human-AI co-operative classification that jointly models subgroup-dependent reliability, decision allocation, and collaborative prediction. PecMan combines subgroup-specialised predictors with a gating and consolidation mechanism that dynamically assigns cases to automated models, human experts, or their combination, without requiring sensitive attributes at test time. We also introduce the FairHAI benchmark for evaluating trade-offs between predictive accuracy, subgroup equity, and human involvement. In addition, we provide a theoretical analysis of multi-agent gating via selection regret and characterise fairness-coverage trade-offs under input-dependent allocation. Experiments across multiple medical imaging datasets demonstrate that PecMan achieves consistently improved trade-offs compared to methods that address fairness or human-AI cooperation separately.

2604.24668 2026-06-10 cs.AI cs.LG 版本更新

The Price of Agreement: Measuring LLM Sycophancy in Agentic Financial Applications

同意的代价:在代理金融应用中衡量LLM的谄媚行为

Zhenyu Zhao, Aparna Balagopalan, Adi Agrawal, Dilshoda Yergasheva, Waseem Alshikh, Daniel M. Bikel

发表机构 * Writer, Inc.(Writer公司)

AI总结 研究评估LLM在金融代理任务中的谄媚行为,发现模型对用户反驳仅表现低至中等性能下降,但偏好信息导致多数模型失败,并测试了输入过滤等恢复方法。

Comments Accepted to ICLR 2026 FinAI Workshop

详情
AI中文摘要

鉴于当今LLM在金融系统中的使用增加,评估此类系统的安全性和鲁棒性变得重要。LLM在通用领域设置中经常表现出的一种失败模式是谄媚行为。也就是说,模型优先考虑与表达的用户信念一致,而非正确性,导致准确性和信任度下降。在这项工作中,我们专注于评估LLM在代理金融任务中表现出的谄媚行为。我们的发现有三方面:首先,我们发现模型在面对用户对参考答案的反驳或矛盾时,仅表现出低至中等的性能下降,这区别于先前工作中模型在金融代理设置中表现出的谄媚行为。其次,我们引入了一套任务,通过用户偏好信息(与参考答案矛盾)来测试谄媚行为,并发现大多数模型在存在此类输入时失败。最后,我们基准测试了不同的恢复模式,例如使用预训练LLM进行输入过滤。

英文摘要

Given the increased use of LLMs in financial systems today, it becomes important to evaluate the safety and robustness of such systems. One failure mode that LLMs frequently display in general domain settings is that of sycophancy. That is, models prioritize agreement with expressed user beliefs over correctness, leading to decreased accuracy and trust. In this work, we focus on evaluating sycophancy that LLMs display in agentic financial tasks. Our findings are three-fold: first, we find the models show only low to modest drops in performance in the face of user rebuttals or contradictions to the reference answer, which distinguishes sycophancy that models display in financial agentic settings from findings in prior work. Second, we introduce a suite of tasks to test for sycophancy by user preference information that contradicts the reference answer and find that most models fail in the presence of such inputs. Lastly, we benchmark different modes of recovery such as input filtering with a pretrained LLM.

2604.24012 2026-06-10 cs.LG math.OC 版本更新

FedSLoP: Memory-Efficient Federated Learning with Low-Rank Gradient Projection

FedSLoP: 基于低秩梯度投影的内存高效联邦学习

Yutong He, Zhengyang Huang, Jiahe Geng, Kun Yuan

发表机构 * Peking University(北京大学) Beihang University(北航)

AI总结 提出FedSLoP算法,结合随机低秩子空间投影降低通信和存储开销,理论证明以O(1/√NT)速率收敛到一阶稳定点,实验在异构MNIST上优于FedAvg等基线。

Comments 27 pages, 7 figures

详情
AI中文摘要

联邦学习使一组客户端能够在不交换原始数据的情况下协作训练机器学习模型,但标准算法如FedAvg在异构、资源受限的环境中收敛缓慢且通信和内存成本高。我们提出FedSLoP,一种联邦优化算法,它结合了梯度的随机低秩子空间投影,从而降低了通信和存储更新的维度,同时保持了优化进度。在理论方面,我们在标准光滑和有界方差假设下进行了详细的非凸收敛分析,表明FedSLoP保证以$O(1/\sqrt{NT})$的速率收敛到一阶稳定点。在实证方面,我们在具有异构数据分区的联邦MNIST分类上进行了大量实验,表明与FedAvg以及代表性的稀疏或低秩基线相比,FedSLoP显著减少了通信量和客户端内存,同时实现了具有竞争力或更好的准确率。总之,我们的结果表明,诸如FedSLoP之类的随机子空间动量方法为通信和内存高效的联邦学习提供了一种原则性和有效的方法。代码可在以下网址获得:this https URL。

英文摘要

Federated learning enables a population of clients to collaboratively train machine learning models without exchanging their raw data, but standard algorithms such as FedAvg suffer from slow convergence and high communication and memory costs in heterogeneous, resource-constrained environments. We introduce FedSLoP, a federated optimization algorithm that combines stochastic low-rank subspace projections of gradients, thereby reducing the dimension of communicated and stored updates while preserving optimization progress. On the theoretical side, we develop a detailed nonconvex convergence analysis under standard smoothness and bounded-variance assumptions, showing that FedSLoP is guaranteed to converge to a first-order stationary point at a rate of $O(1/\sqrt{NT})$. On the empirical side, we conduct extensive experiments on federated MNIST classification with heterogeneous data partitions, showing that FedSLoP substantially reduces communication volume and client-side memory while achieving competitive or better accuracy compared with FedAvg and representative sparse or low-rank baselines. Together, our results demonstrate that random subspace momentum methods such as FedSLoP provide a principled and effective approach to communication- and memory-efficient federated learning. Codes are available at: https://github.com/pkumelon/FedSLoP.git.

2604.23443 2026-06-10 cs.CL 版本更新

Revisiting Greedy Decoding for Visual Question Answering: A Calibration Perspective

重新审视视觉问答中的贪婪解码:一种校准视角

Boqi Chen, Xudong Liu, Yunke Ao, Jianing Qiu

发表机构 * ETH Zurich(苏黎世联邦理工学院) University of Toronto(多伦多大学) MBZUAI(穆桑比克人工智能研究所)

AI总结 针对视觉问答任务,从校准角度理论证明贪婪解码优于随机采样,并提出适用于推理模型的贪婪解码方法,实验验证其有效性。

详情
AI中文摘要

随机采样策略被广泛用于大型语言模型(LLMs)以平衡输出的连贯性和多样性。这些启发式方法通常被多模态大语言模型(MLLMs)继承,而无需针对特定任务进行论证。然而,我们认为随机解码对于视觉问答(VQA)可能不是最优的。VQA是一个封闭式任务,答案分布具有头部重尾特征,其不确定性通常是认知性的,源于缺失或模糊的视觉证据,而非合理的延续。在这项工作中,我们理论形式化了模型校准与预测准确性之间的关系,并推导出贪婪解码最优性的充分条件。大量实验提供了经验证据,表明贪婪解码在多个基准测试中优于随机采样。此外,我们提出了适用于推理模型的贪婪解码,在多模态推理场景中优于随机采样和标准贪婪解码。总体而言,我们的结果警示不要在MLLMs中天真地继承LLMs的解码启发式方法,并表明贪婪解码可以成为VQA中高效且强大的默认选择。

英文摘要

Stochastic sampling strategies are widely adopted in large language models (LLMs) to balance output coherence and diversity. These heuristics are often inherited in Multimodal LLMs (MLLMs) without task-specific justification. However, we contend that stochastic decoding can be suboptimal for Visual Question Answering (VQA). VQA is a closed-ended task with head-heavy answer distributions where uncertainty is usually epistemic, arising from missing or ambiguous visual evidence rather than plausible continuations. In this work, we provide a theoretical formalization of the relationship between model calibration and predictive accuracy, and derive the sufficient conditions for greedy decoding optimality. Extensive experiments provide empirical evidence for the superiority of greedy decoding over stochastic sampling across multiple benchmarks. Furthermore, we propose Greedy Decoding for Reasoning Models, which outperforms both stochastic sampling and standard greedy decoding in multimodal reasoning scenarios. Overall, our results caution against naively inheriting LLMs decoding heuristics in MLLMs and demonstrate that greedy decoding can be an efficient yet strong default for VQA.

2406.14075 2026-06-10 cs.CL 版本更新

EXCEEDS: Extracting Complex Events via Nugget-based Grid Modeling in Scientific Domain

EXCEEDS: 通过基于线索块的网格建模在科学领域中提取复杂事件

Yi-Fan Lu, Xian-Ling Mao, Bo Wang, Xiao Liu, Heyan Huang

发表机构 * Beijing Institute of Technology(北京理工大学) Microsoft Research Asia(微软亚洲研究院)

AI总结 针对科学领域事件密集、信息形式复杂的特点,构建大规模多事件文档级数据集SciEvents,并提出端到端框架EXCEEDS,将密集线索块编码为网格矩阵,简化复杂事件提取为基于线索块的网格建模任务,取得最优性能。

Comments Accepted by ACL 2026 Main Conference, Oral

详情
AI中文摘要

通过事件理解特定领域至关重要。在新闻、金融和生物学等多个领域已经进行了广泛的事件提取研究。然而,科学领域的事件提取仍然缺乏全面的数据集和定制方法的支持。与其他领域相比,科学领域有两个特点:(1)更密集的线索块和事件,(2)更复杂的信息形式。为解决上述问题,考虑到这两个特点,我们首先构建了SciEvents,一个大规模的多事件文档级数据集,其模式针对科学领域定制。它包含2,508篇文档和24,381个事件,经过多阶段人工标注和质量控制。然后,我们提出了EXCEEDS,一个端到端的科学事件提取框架,通过将密集线索块编码为网格矩阵,并将复杂事件提取简化为基于线索块的网格建模任务。在SciEvents上的实验表明,EXCEEDS达到了最先进的性能。SciEvents数据集和EXCEEDS框架均已公开发布,以促进未来的研究。

英文摘要

It is crucial to understand a specific domain by events. Extensive event extraction research has been conducted in many domains such as news, finance, and biology. However, event extraction in scientific domain is still insufficiently supported by comprehensive datasets and tailored methods. Compared with other domains, scientific domain has two characteristics: (1) denser nuggets and events, and (2) more complex information forms. To solve the above problem, considering these two characteristics, we first construct SciEvents, a large-scale multi-event document-level dataset with a schema tailored for scientific domain. It consists of 2,508 documents and 24,381 events under multi-stage manual annotation and quality control. Then, we propose EXCEEDS, an end-to-end scientific event extraction framework by encoding dense nuggets into a grid matrix and simplifying complex event extraction as a nugget-based grid modeling task. Experiments on SciEvents demonstrate state-of-the-art performances of EXCEEDS. Both the SciEvents dataset and the EXCEEDS framework are released publicly to facilitate future research.

2604.22565 2026-06-10 cs.CL cs.AI 版本更新

Learning Evidence Highlighting for Frozen LLMs

学习为冻结的LLM突出证据

Shaoang Li, Yanhang Shi, Yufei Li, Mingfu Liang, Xiaohan Wei, Yunchen Pu, Fei Tian, Chonglin Sun, Frank Shyu, Luke Simon, Sandeep Pandey, Xi Liu, Jian Li

发表机构 * Stony Brook University(石桥大学) Meta AI

AI总结 提出HiLight框架,通过强化学习训练轻量级Actor在长上下文中插入高亮标签,使冻结的LLM更关注关键证据,无需证据标签或修改求解器,在序列推荐和长上下文问答中提升性能。

详情
AI中文摘要

大型语言模型(LLM)能够很好地推理,但当关键证据埋藏在冗长、嘈杂的上下文中时,常常会错过决定性证据。我们提出了HiLight,一个证据强调框架,它将证据选择与冻结的LLM求解器的推理解耦。HiLight避免压缩或重写输入(这可能会丢弃或扭曲证据),而是训练一个轻量级的强调Actor,在未改变的上下文中的关键跨度周围插入最小的高亮标签。然后,一个冻结的求解器对强调后的输入进行下游推理。我们将高亮视为一个弱监督决策问题,并使用强化学习仅基于求解器的任务奖励来优化Actor,不需要证据标签,也不需要访问或修改求解器。在序列推荐和长上下文问答中,HiLight始终优于强大的基于提示和自动提示优化的基线。学习到的强调策略可以零样本迁移到更小和更大的未见求解器家族,包括基于API的求解器,这表明Actor捕获了真正的、可复用的证据结构,而不是过拟合单个骨干网络。

英文摘要

Large Language Models (LLMs) can reason well, yet often miss decisive evidence when it is buried in long, noisy contexts. We introduce HiLight, an Evidence Emphasis framework that decouples evidence selection from reasoning for frozen LLM solvers. HiLight avoids compressing or rewriting the input, which can discard or distort evidence, by training a lightweight Emphasis Actor to insert minimal highlight tags around pivotal spans in the unaltered context. A frozen Solver then performs downstream reasoning on the emphasized input. We cast highlighting as a weakly supervised decision-making problem and optimize the Actor with reinforcement learning using only the Solver's task reward, requiring no evidence labels and no access to or modification of the Solver. Across sequential recommendation and long-context question answering, HiLight consistently improves performance over strong prompt-based and automated prompt-optimization baselines. The learned emphasis policy transfers zero-shot to both smaller and larger unseen Solver families, including an API-based Solver, suggesting that the Actor captures genuine, reusable evidence structure rather than overfitting to a single backbone.

2604.22192 2026-06-10 cs.CV 版本更新

CharTide: Data-Centric Chart-to-Code Generation via Tri-Perspective Tuning and Inquiry-Driven Evolution

CharTide: 数据中心的图表到代码生成通过三视角微调和查询驱动进化

Xiangxi Zheng, Kuang He, Jiayi Hu, Ping Yu, Rui Yan, Yuan Yao, Peng Hou, Anxiang Zeng, Alex Jinpeng Wang

发表机构 * Nanjing University(南京大学) LLM Team, Shopee Pte. Ltd.(Shopee 联邦学习团队) East China Normal University(华东师范大学) Nanjing University of Science and Technology(南京理工大学) Central South University(中南大学)

AI总结 提出CharTide框架,通过三视角微调解耦视觉感知与代码逻辑,并引入基于信息不变性的查询驱动强化学习进行数据验证,在多个基准上超越GPT-4o。

Comments Accepted to ACL 2026 Main

详情
AI中文摘要

图表到代码生成要求视觉语言模型(VLM)具有严格的视觉精度和语法正确性。然而,现有方法从根本上受到数据中心限制:尽管可用的图表到代码数据集不断增长,但简单地扩展同质图表-代码对会将视觉感知与程序逻辑混淆,阻止模型充分利用多模态监督的丰富性。我们提出CharTide,一种新颖的数据中心框架,系统性地重新设计图表到代码生成的训练和对齐数据。首先,我们通过三视角微调策略构建一个200万样本的数据集,明确将训练解耦为视觉感知、纯文本代码逻辑和模态融合流,使7B模型仅使用监督数据就能超越专门的基线。其次,我们将对齐重新表述为一个数据验证问题,而不是启发式评分任务。为此,我们引入了一种基于信息不变性原理的查询驱动强化学习框架:下游模型应对原始图表和生成图表上的相同视觉查询产生一致的答案。超越刚性规则匹配或VLM评分,我们使用冻结的检查器通过原子QA任务客观验证生成的图表,基于答案准确性提供可验证的奖励信号。在ChartMimic、Plot2Code和ChartX上的实验表明,CharTide-7B/8B显著优于开源基线,超越GPT-4o,并与GPT-5竞争。

英文摘要

Chart-to-code generation demands strict visual precision and syntactic correctness from Vision-Language Models (VLMs). However, existing approaches are fundamentally constrained by data-centric limitations: despite the availability of growing chart-to-code datasets, simply scaling homogeneous chart-code pairs conflates visual perception with program logic, preventing models from fully leveraging the richness of multimodal supervision. We present CharTide, a novel data-centric framework that systematically redesigns both training and alignment data for chart-to-code generation. First, we construct a 2M-sample dataset via a Tri-Perspective Tuning strategy, explicitly decoupling training into visual perception, pure-text code logic, and modality fusion streams, enabling a 7B model to surpass specialized baselines using only supervised data. Second, we reformulate alignment as a data verification problem rather than a heuristic scoring task. To this end, we introduce an Inquiry-Driven RL framework grounded in the principle of information invariance: a downstream model should yield consistent answers to identical visual queries across both original and generated charts. Moving beyond rigid rule matching or VLM scoring, we employ a frozen Inspector to objectively verify generated charts through atomic QA tasks, providing verifiable reward signals based on answer accuracy. Experiments on ChartMimic, Plot2Code, and ChartX show that CharTide-7B/8B significantly outperforms open-source baselines, surpasses GPT-4o, and is competitive with GPT-5.

2604.20048 2026-06-10 cs.CL cs.CY 版本更新

Culturally uneven urban perception in large language models

大型语言模型通过文化不平等的基线感知城市

Rong Zhao, Wanqi Liu, Zhizhou Sha, Nanxi Su, Yecheng Zhang, Ying Long

发表机构 * Centre for Advanced Spatial Analysis (CASA), UCL, London, UK(高级空间分析中心(CASA),伦敦大学学院,英国) School of Architecture, Tsinghua University, Beijing, China(清华大学建筑学院,北京,中国) Department of Computer Science, UT Austin, Austin, TX, USA(得克萨斯大学奥斯汀分校计算机科学系,奥斯汀,德克萨斯,美国)

AI总结 本研究通过全球平衡的街景样本测试前沿LLM的城市感知,发现中性提示实际上偏向欧美文化,且文化提示能改变情感评价但无法恢复人类语义多样性。

详情
AI中文摘要

大型语言模型(LLM)越来越多地被用于描述、评估和解释地点,但目前尚不清楚它们是否从文化中立的立场出发。本文使用平衡的全球街景样本和保持中立或调用不同区域文化立场的提示,测试前沿LLM的城市感知。在开放式描述和结构化地点判断中,中性条件在实践中并非中立。与欧洲和北美相关的提示在系统上比许多非西方提示更接近基线,表明模型感知围绕文化不平等的参考框架而非通用框架组织。文化提示也改变了情感评价,对某些提示身份产生基于情感的群体内偏好。与区域人类文本-图像基准的比较表明,文化接近的提示可以改善与人类描述的一致性,但未能恢复人类水平的语义多样性,并且通常保留了情感提升的风格。同样的不对称性出现在安全性、美丽、财富、活力、无聊和抑郁的结构化判断中,模型输出是可解释的,但仅部分再现了人类群体差异。这些发现表明,LLM并非从虚无中感知城市:它们通过一个文化不平等的基线来感知,该基线塑造了什么是普通、熟悉和积极评价的。

英文摘要

Large language models (LLMs) are increasingly used to describe and evaluate cities, yet the cultural structure of their urban judgments remains understudied. Here we introduce a measurement framework for testing whether LLM-based urban perception is culturally neutral, using a globally stratified street-view image dataset. Open-ended descriptions and structured scores generated by three frontier multimodal models all show that the neutral baseline lies closer to regional framings associated with Europe and North America than to other cultural framings. Comparisons between AI and human urban perception further show that prompting can move AI responses closer to specific regional human descriptions, but fails to recover the variety and diversity of human responses, flattening observed demographic patterns and introducing sentiment-based self-favouring bias. These results indicate a systematic risk in treating AI as a neutral tool for urban tasks, especially when model outputs are used to compare, evaluate or represent cities across cultural contexts.

2604.20024 2026-06-10 cs.LG 版本更新

Replicable Bandits with UCB based Exploration

基于UCB探索的可复现Bandits

Rohan Deb, Udaya Ghai, Karan Singh, Arindam Banerjee

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Amazon(亚马逊) Carnegie Mellon University(卡内基梅隆大学)

AI总结 研究随机多臂老虎机和线性老虎机中基于UCB探索的可复现算法,提出RepUCB和RepLinUCB,分别实现最优遗憾界,显著降低可复现性代价。

详情
AI中文摘要

我们研究随机多臂老虎机(MAB)和线性老虎机中基于UCB(上置信界)探索的可复现算法。一个bandit算法是$\rho$-可复现的,如果两次使用共享内部随机性但独立奖励实现的执行以至少$1-\rho$的概率产生相同的动作序列。先前解决该问题的方法是消除法,并且在具有无限动作的线性老虎机中依赖于离散化,导致对维度$d$和$\rho$的次优依赖。我们为这两种设置开发了乐观替代方案。对于随机多臂老虎机,我们提出RepUCB,一种可复现的批处理UCB算法,并证明其遗憾为$O\\!\left(\frac{K^2\log^2 T}{\rho^2}\sum_{a:\Delta_a>0}\left(\Delta_a+\frac{\log(KT\log T)}{\Delta_a}\right)\right)$。对于随机线性老虎机,我们首先引入RepRidge,一种可复现的岭回归估计器,它同时满足置信度保证和$\rho$-可复现性保证。除了在bandit算法中的作用外,它可能在其他统计估计设置中也具有独立意义。然后我们使用RepRidge设计RepLinUCB,一种用于随机线性老虎机的可复现乐观算法,并证明其遗憾以$\widetilde{O}\\!\big(\big(d+\frac{d^3}{\rho}\big)\sqrt{T}\big)$为界。这比先前的最佳遗憾保证提高了$O(d/\rho)$因子,表明我们的乐观算法可以显著降低可复现性的代价。这是第一个对于大量臂具有最优$\rho$依赖性的线性bandit算法。最后,我们将框架扩展到随机广义线性老虎机,开发了RepGLM(一种可复现的惩罚GLM估计器)和RepGLMUCB(一种用于该设置的可复现乐观算法)。

英文摘要

We study replicable algorithms for stochastic multi-armed bandits (MAB) and linear bandits with UCB (Upper Confidence Bound) based exploration. A bandit algorithm is $ρ$-replicable if two executions using shared internal randomness but independent reward realizations produce the same action sequence with probability at least $1-ρ$. Prior approaches to this problem are elimination-based and, in linear bandits with infinitely many actions, rely on discretization, leading to suboptimal dependence on the dimension $d$ and $ρ$. We develop optimistic alternatives for both settings. For stochastic multi-armed bandits, we propose RepUCB, a replicable batched UCB algorithm and show that it attains a regret $O\!\left(\frac{K^2\log^2 T}{ρ^2}\sum_{a:Δ_a>0}\left(Δ_a+\frac{\log(KT\log T)}{Δ_a}\right)\right)$. For stochastic linear bandits, we first introduce RepRidge, a replicable ridge regression estimator that satisfies both a confidence guarantee and a $ρ$-replicability guarantee. Beyond its role in our bandit algorithm, this may also be of independent interest in other statistical estimation settings. We then use RepRidge to design RepLinUCB, a replicable optimistic algorithm for stochastic linear bandits, and show that its regret is bounded by $\widetilde{O}\!\big(\big(d+\frac{d^3}ρ\big)\sqrt{T}\big)$. This improves the best prior regret guarantee by a factor of $O(d/ρ)$, showing that our optimistic algorithm can substantially reduce the price of replicability. This is the first linear-bandit algorithm with an optimal dependence on $ρ$ for large number of arms. Finally, we extend our framework to stochastic generalized linear bandits by developing RepGLM, a replicable penalized GLM estimator, and RepGLMUCB, a replicable optimistic algorithm for this setting.

2603.29025 2026-06-10 cs.CL cs.AI 版本更新

The Model Says Walk: How Surface Heuristics Override Implicit Constraints in LLM Reasoning

模型说走:表面启发式如何覆盖LLM推理中的隐式约束

Yubo Li, Lu Zhang, Tianchong Jiang, Ramayya Krishnan, Rema Padman

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Independent Researcher(独立研究者)

AI总结 研究LLM在表面线索与隐式约束冲突时的失败,提出启发式覆盖基准(HOB),通过因果行为分析揭示距离线索影响远大于目标,并验证目标分解提示可部分恢复性能。

详情
AI中文摘要

当显著的表面线索与未陈述的可行性约束冲突时,大型语言模型会失败。我们引入了启发式覆盖基准(HOB):500个实例,涵盖4个启发式家族和5个约束家族,具有最小对和显式性梯度。我们将HOB与一个可证伪的行为特征描述配对,遵循诊断-测量-桥接-治疗弧。对六个模型的洗车问题进行因果行为分析,揭示了上下文无关的S形启发式:距离线索的影响力是目标的8.7到38倍,归因更匹配关键词关联而非组合推理。在14个模型中,严格的10/10评估显示,没有模型超过75%,存在约束最难,为44%。一个最小提示将性能提高15个百分点,表明是约束推断失败而非知识缺失。然而,14个模型中有12个在移除约束后表现更差,最多下降39个百分点,揭示了保守偏差。对Gemini 3.1 Pro的思考模式消融实验显示,思考开启时性能为74.6%,关闭时降至58.4%,而显式目标分解将其恢复至71.2%。因此,内部推理确实有用,显式提示可以部分替代。推理模型并不绝对优于非推理模型:在控制能力排名后,残差推理模式效应为1.8个百分点且不显著。参数探针显示S形模式泛化到成本、效率和语义相似性启发式。目标分解提示将性能提升5.0个百分点,而通用思维链提升3.1个百分点,将约束枚举隔离为有效成分。总体而言,启发式覆盖是一个系统性的推理漏洞,其量化位点在于推理顺序而非知识,并且有一个经过测试的干预措施。

英文摘要

Large language models fail when a salient surface cue conflicts with an unstated feasibility constraint. We introduce the Heuristic Override Benchmark (HOB): 500 instances spanning 4 heuristic families and 5 constraint families, with minimal pairs and explicitness gradients. We pair HOB with a falsifiable behavioral characterization following a diagnose-measure-bridge-treat arc. Causal-behavioral analysis of the car wash problem across six models reveals context-independent sigmoid heuristics: the distance cue has 8.7 to 38 times more influence than the goal, and attribution better matches keyword association than compositional inference. Across 14 models, strict 10/10 evaluation shows that no model exceeds 75%, and presence constraints are hardest at 44%. A minimal hint improves performance by 15 pp, suggesting a constraint-inference failure rather than missing knowledge. However, 12 of 14 models perform worse when the constraint is removed, by up to 39 pp, revealing conservative bias. A thinking-mode ablation on Gemini 3.1 Pro drops performance from 74.6% with thinking on to 58.4% with thinking off, while explicit goal decomposition recovers it to 71.2%. Thus, internal deliberation does useful work, and explicit prompting can partially substitute for it. Reasoning models do not categorically outperform non-reasoning peers: after controlling for capability rank, the residual reasoning-mode effect is 1.8 pp and is not significant. Parametric probes show that the sigmoid pattern generalizes to cost, efficiency, and semantic-similarity heuristics. Goal-decomposition prompting improves performance by 5.0 pp, compared with 3.1 pp for generic chain-of-thought, isolating constraint enumeration as the active ingredient. Overall, heuristic override is a systematic reasoning vulnerability with a quantified locus in inference order, not knowledge, and a tested intervention.

2604.19274 2026-06-10 cs.CL 版本更新

HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing

HarDBench: 面向安全人机协作写作的基于草稿的合著越狱攻击基准

Euntae Kim, Soomin Han, Buru Chang

发表机构 * Korea University(韩国大学) Sogang University(ソガン大学)

AI总结 提出HarDBench基准,评估大语言模型在协作写作中面对恶意草稿填充的越狱攻击的鲁棒性,并通过偏好优化实现安全-效用平衡的对齐方法。

Comments ACL 2026 Main Camera-Ready

详情
AI中文摘要

大语言模型(LLMs)越来越多地被用作协作写作中的合著者,用户从粗略草稿开始,依赖LLMs完成、修改和优化其内容。然而,这种能力带来了严重的安全风险:恶意用户可能通过用危险内容填充不完整草稿来越狱模型,迫使其生成有害输出。在本文中,我们识别了当前LLMs对此类基于草稿的合著越狱攻击的脆弱性,并引入了HarDBench,一个系统性的基准,旨在评估LLMs对此新兴威胁的鲁棒性。HarDBench涵盖一系列高风险领域——包括爆炸物、毒品、武器和网络攻击——并具有现实结构及领域特定提示的特征,以评估模型对有害补全的敏感性。为缓解此风险,我们引入了一种基于偏好优化的安全-效用平衡对齐方法,训练模型拒绝有害补全,同时保持对良性草稿的有用性。实验结果表明,现有LLMs在合著环境中高度脆弱,而我们的对齐方法显著减少了有害输出,且不降低合著能力性能。这为人机协作写作环境中LLMs的评估与对齐提供了新范式。我们的新基准和数据集可在项目页面获取:此 https URL

英文摘要

Large language models (LLMs) are increasingly used as co-authors in collaborative writing, where users begin with rough drafts and rely on LLMs to complete, revise, and refine their content. However, this capability poses a serious safety risk: malicious users could jailbreak the models-filling incomplete drafts with dangerous content-to force them into generating harmful outputs. In this paper, we identify the vulnerability of current LLMs to such draft-based co-authoring jailbreak attacks and introduce HarDBench, a systematic benchmark designed to evaluate the robustness of LLMs against this emerging threat. HarDBench spans a range of high-risk domains-including Explosives, Drugs, Weapons, and Cyberattacks-and features prompts with realistic structure and domain-specific cues to assess the model susceptibility to harmful completions. To mitigate this risk, we introduce a safety-utility balanced alignment approach based on preference optimization, training models to refuse harmful completions while remaining helpful on benign drafts. Experimental results show that existing LLMs are highly vulnerable in co-authoring contexts and our alignment method significantly reduces harmful outputs without degrading performance on co-authoring capabilities. This presents a new paradigm for evaluating and aligning LLMs in human-LLM collaborative writing settings. Our new benchmark and dataset are available on our project page at https://github.com/untae0122/HarDBench

2604.12306 2026-06-10 cs.LG cs.AI 版本更新

GCA Framework: A GCC Countries-Grounded Dataset and Agentic Pipeline for Climate Decision Support

GCA框架:面向海湾合作委员会国家的数据集与气候决策支持智能体管道

Muhammad Umer Sheikh, Khawar Shehzad, Salman Khan, Fahad Shahbaz Khan, Muhammad Haris Khan

发表机构 * Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI)(莫扎德人工智能大学) University of Missouri(密苏里大学) Australian National University(澳大利亚国立大学) Linköping University(林肯大学)

AI总结 提出GCA框架,包含GCC国家多模态数据集GCA-DS和工具增强型智能体GCA,通过领域微调和工具集成提升气候决策可靠性。

详情
AI中文摘要

海湾合作委员会(GCC)国家的气候决策日益需要能够将异质的科学和政策证据转化为可操作指导的系统,然而通用大语言模型(LLM)在区域特定气候知识以及与地理空间和预测工具的接地交互方面仍然薄弱。我们提出GCA框架,它统一了(i)GCA-DS,一个基于GCC国家的精选多模态数据集,以及(ii)Gulf Climate Agent(GCA),一个工具增强型气候分析智能体。GCA-DS包含20万个问答对,涵盖政府政策和适应计划、非政府组织和国际框架、学术文献以及关于热浪、沙尘暴和洪水的事件驱动报告,并辅以将图像与文本证据相结合的遥感输入。在此基础上,GCA智能体编排了一个基于实时和历史信号以及地理空间处理的模块化工具管道,生成衍生指数和可解释的可视化。最后,我们在GCC国家的气候任务上对开源和专有LLM进行了基准测试,结果表明领域微调和工具集成显著提高了相对于通用基线的可靠性。

英文摘要

Climate decision-making in the GCC states increasingly demands systems that can translate heterogeneous scientific and policy evidence into actionable guidance, yet general-purpose large language models (LLMs) remain weak both in region-specific climate knowledge and grounded interaction with geospatial and forecasting tools. We present the GCA framework, which unifies (i) GCA-DS, a curated multimodal dataset grounded in the GCC states, and (ii) Gulf Climate Agent (GCA), a tool-augmented agent for climate analysis. GCA-DS comprises 200k question--answer pairs spanning governmental policies and adaptation plans, NGO and international frameworks, academic literature, and event-driven reporting on heatwaves, dust storms, and floods, complemented with remote-sensing inputs that couple imagery with textual evidence. Building on this foundation, the GCA agent orchestrates a modular tool pipeline grounded in real-time and historical signals and geospatial processing that produces derived indices and interpretable visualizations. Finally, we benchmark open and proprietary LLMs on climate tasks in the GCC states and show that domain fine-tuning and tool integration substantially improve reliability over general-purpose baselines.

2604.15771 2026-06-10 cs.CL 版本更新

Skill-RAG: Failure-State-Aware Retrieval Augmentation via Hidden-State Probing and Skill Routing

Skill-RAG: 通过隐藏状态探测与技能路由实现故障状态感知的检索增强

Kai Wei, Raymond Li, Xi Zhu, Zhaoqian Xue, Jiaojiao Han, Jingcheng Niu, Fan Yang

发表机构 * University of Michigan(密歇根大学) University of British Columbia(不列颠哥伦比亚大学) Rutgers University(罗格斯大学) University of Pennsylvania(宾夕法尼亚大学) New Jersey Institute of Technology(新泽西理工学院) TU Darmstadt(图腾斯大学) Wake Forest University(威克森林大学)

AI总结 提出Skill-RAG框架,通过轻量级隐藏状态探测器和基于提示的技能路由器,在检索失败时诊断原因并选择四种技能(查询重写、问题分解、证据聚焦、退出)纠正查询-证据错位,显著提升多轮检索后困难案例的准确性。

详情
AI中文摘要

检索增强生成(RAG)已成为将大型语言模型锚定于外部知识的基础范式。尽管自适应检索机制提高了检索效率,现有方法将检索后失败视为重试信号而非诊断信号——从而未能解决查询与证据空间错位的结构性原因。我们观察到,相当一部分持续性检索失败并非源于缺乏相关证据,而是源于查询与证据空间之间的对齐差距。我们提出Skill-RAG,一种故障感知的RAG框架,它结合了轻量级隐藏状态探测器和基于提示的技能路由器。探测器在两个流水线阶段门控检索;当检测到故障状态时,技能路由器诊断根本原因,并在四种检索技能——查询重写、问题分解、证据聚焦,以及针对真正不可约情况的退出技能——中进行选择,以在下一次生成尝试前纠正错位。跨多个开放域问答和复杂推理基准的实验表明,Skill-RAG显著提高了多轮检索后持续存在的困难案例的准确性,在分布外数据集上尤其强劲。表示空间分析进一步揭示,所提出的技能占据了故障状态空间中结构化、可分离的区域,支持了查询-证据错位是一种类型化而非单一现象的观点。

英文摘要

Retrieval-Augmented Generation (RAG) has emerged as a foundational paradigm for grounding large language models in external knowledge. While adaptive retrieval mechanisms have improved retrieval efficiency, existing approaches treat post-retrieval failure as a signal to retry rather than to diagnose -- leaving the structural causes of query-evidence misalignment unaddressed. We observe that a significant portion of persistent retrieval failures stem not from the absence of relevant evidence but from an alignment gap between the query and the evidence space. We propose Skill-RAG, a failure-aware RAG framework that couples a lightweight hidden-state prober with a prompt-based skill router. The prober gates retrieval at two pipeline stages; upon detecting a failure state, the skill router diagnoses the underlying cause and selects among four retrieval skills -- query rewriting, question decomposition, evidence focusing, and an exit skill for truly irreducible cases -- to correct misalignment before the next generation attempt. Experiments across multiple open-domain QA and complex reasoning benchmarks show that Skill-RAG substantially improves accuracy on hard cases persisting after multi-turn retrieval, with particularly strong gains on out-of-distribution datasets. Representation-space analyses further reveal that the proposed skills occupy structured, separable regions of the failure state space, supporting the view that query-evidence misalignment is a typed rather than monolithic phenomenon.

2604.15414 2026-06-10 cs.LG cs.AI cs.NE 版本更新

Beyond Single-Model Optimization: Preserving Plasticity in Continual Reinforcement Learning

超越单模型优化:在持续强化学习中保持可塑性

Lute Lillo, Nick Cheney

发表机构 * Department of Computer Science University of Vermont(计算机科学系大学 of Vermont)

AI总结 提出TeLAPA框架,通过维护行为多样性的策略档案和共享潜在空间,在持续强化学习中实现技能对齐的策略邻域,以解决单策略保存导致的可塑性丧失问题,提升任务学习、恢复和性能保持能力。

详情
AI中文摘要

持续强化学习必须在保留与适应之间取得平衡,然而许多方法仍然依赖于\emph{单模型保存},即承诺将一个不断演化的策略作为跨任务的主要可复用解决方案。即使保留了先前成功的策略,在干扰后它可能不再为快速适应提供可靠的起点,这反映了单策略保存无法解决的一种\emph{可塑性丧失}形式。受质量-多样性方法的启发,我们引入了\emph{TeLAPA}(可迁移的潜在对齐策略档案),这是一个持续强化学习框架,它将行为多样性的策略邻域组织成每个任务的档案,并维护一个共享的潜在空间,使得存档的策略在非平稳漂移下保持可比性和可复用性。这种视角将持续强化学习从保留孤立解决方案转变为维护\emph{技能对齐的邻域},其中包含有能力的、行为相关的策略,以支持未来的重新学习。在我们的MiniGrid持续学习设置中,\emph{TeLAPA}成功学习了更多任务,在干扰后重新访问任务时更快地恢复了能力,并在整个任务序列中保持了更高的性能。我们的分析表明,源最优策略通常不是迁移最优的,即使在局部有能力的邻域内也是如此,并且有效的复用依赖于保留和选择多个邻近的替代方案,而不是将它们合并为一个代表。总之,这些结果将持续强化学习重新定义为围绕可复用且有能力的策略邻域,提供了一条超越单模型保存、迈向更具可塑性的终身智能体的途径。

英文摘要

Continual reinforcement learning must balance retention with adaptation, yet many methods still rely on \emph{single-model preservation}, committing to one evolving policy as the main reusable solution across tasks. Even when a previously successful policy is retained, it may no longer provide a reliable starting point for rapid adaptation after interference, reflecting a form of \emph{loss of plasticity} that single-policy preservation cannot address. Inspired by quality-diversity methods, we introduce \textsc{TeLAPA} (Transfer-Enabled Latent-Aligned Policy Archives), a continual RL framework that organizes behaviorally diverse policy neighborhoods into per-task archives and maintains a shared latent space so that archived policies remain comparable and reusable under non-stationary drift. This perspective shifts continual RL from retaining isolated solutions to maintaining \emph{skill-aligned neighborhoods} with competent and behaviorally related policies that support future relearning. In our MiniGrid CL setting, \textsc{TeLAPA} learns more tasks successfully, recovers competence faster on revisited tasks after interference, and retains higher performance across a sequence of tasks. Our analyses show that source-optimal policies are often not transfer-optimal, even within a local competent neighborhood, and that effective reuse depends on retaining and selecting among multiple nearby alternatives rather than collapsing them to one representative. Together, these results reframe continual RL around reusable and competent policy neighborhoods, providing a route beyond single-model preservation toward more plastic lifelong agents.

2604.14397 2026-06-10 cs.CL cs.AI 版本更新

Generating Concept Lexicalizations via Dictionary-Based Cross-Lingual Sense Projection

基于词典的跨语言语义投影生成概念词汇化

David Basil, Chirooth Girigowda, Bradley Hauer, Sahir Momin, Ning Shi, Grzegorz Kondrak

发表机构 * University of Toronto(多伦多大学)

AI总结 提出一种通过语义投影将英语WordNet概念扩展到新语言的方法,利用双语词典增强对齐并过滤错误投影,在多个语言上提升了精度且保持可解释性和资源效率。

Comments Paper presented at Canadian AI 2026

详情
AI中文摘要

我们研究通过语义生成自动将WordNet风格的词汇资源扩展到新语言的任务。我们通过语义投影将目标语言词条与现有词汇概念关联来生成词义。给定一个带有词义标注的英语语料库及其翻译,我们的方法将注释的义原集投影到对齐的目标语言标记上,并将相应的词条分配给这些义原集。为了生成对齐并确保其质量,我们使用双语词典增强预训练的基础对齐器,该词典也用于过滤不正确的语义投影。我们在多种语言上评估该方法,将其与先前方法以及基于词典和大型语言模型的基线进行比较。结果表明,所提出的投影-过滤策略在保持可解释性和资源效率的同时提高了精度。我们在该https URL上发布代码、文档和生成的词义清单。

英文摘要

We study the task of automatically expanding WordNet-style lexical resources to new languages through sense generation. We generate senses by associating target-language lemmas with existing lexical concepts via semantic projection. Given a sense-tagged English corpus and its translation, our method projects the annotated synsets onto aligned target-language tokens and assigns the corresponding lemmas to those synsets. To generate alignments and ensure their quality, we augment a pretrained base aligner with a bilingual dictionary, which is also used to filter incorrect sense projections. We evaluate the method on multiple languages, comparing it to prior methods, as well as dictionary-based and large language model baselines. Results show that the proposed project-and-filter strategy improves precision while remaining interpretable and resource-efficient. We release our code, documentation, and generated sense inventories at https://github.com/UAlberta-NLP/ExpandNet.

2604.06893 2026-06-10 cs.CV cs.LG 版本更新

Energy-Regularized Spatial Masking: A Novel Approach to Enhancing Robustness and Interpretability in Vision Models

能量正则化的空间遮蔽:一种增强视觉模型鲁棒性和可解释性的新方法

Tom Devynck, Bilal Faye, Djamel Bouchaffra, Nadjib Lazaar, Hanane Azzag, Mustapha Lebbah

发表机构 * DAVID Lab, UVSQ, Paris-Saclay University(DAVID实验室,UVSQ,巴黎-萨克雷大学) LIPN, UMR CNRS 7030, Sorbonne Paris Nord University(LIPN,UMR CNRS 7030,索邦巴黎北大学) LISN, Paris-Saclay University(LISN,巴黎-萨克雷大学)

AI总结 本文提出能量正则化空间遮蔽框架,通过可微能量最小化问题重新定义特征选择,实现更鲁棒和可解释的视觉模型。

Comments 8 pages

详情
AI中文摘要

深度卷积神经网络通过密集空间特征图的彻底处理取得了显著性能,但这种暴力策略引入了显著的计算冗余并鼓励依赖于虚假背景相关性。为此,我们提出能量正则化空间遮蔽(ERSM),一种新的框架,将特征选择重新公式化为可微能量最小化问题。通过在标准卷积骨干中嵌入轻量级能量-遮蔽层,每个视觉标记被分配一个由两个竞争力组成的标量能量:内在的Unary重要性成本和Pairwise空间一致性惩罚。不同于以往的剪枝方法,ERSM允许网络自主发现针对每个输入的最佳信息密度平衡。我们验证了ERSM在卷积架构上的有效性,证明其产生新兴稀疏性、改进对结构遮挡的鲁棒性,并产生高度可解释的空间遮蔽,同时保持分类准确性。此外,我们表明所学的能量排名在删除基于鲁棒性测试中显著优于基于幅度的剪枝,揭示ERSM作为一种内在去噪机制,能够在无像素级监督的情况下隔离语义物体区域。

英文摘要

Deep convolutional neural networks achieve remarkable performance by exhaustively processing dense spatial feature maps, yet this brute-force strategy introduces significant computational redundancy and encourages reliance on spurious background correlations. As a result, modern vision models remain brittle and difficult to interpret. We propose Energy-Regularized Spatial Masking (ERSM), a novel framework that reformulates feature selection as a differentiable energy minimization problem. By embedding a lightweight Energy-Mask Layer inside standard convolutional backbones, each visual token is assigned a scalar energy composed of two competing forces: an intrinsic Unary importance cost and a Pairwise spatial coherence penalty. Unlike prior pruning methods that enforce rigid sparsity budgets or rely on heuristic importance scores, ERSM allows the network to autonomously discover an optimal information-density equilibrium tailored to each input. We validate ERSM on convolutional architectures and demonstrate that it produces emergent sparsity, improved robustness to structured occlusion, and highly interpretable spatial masks, while preserving classification accuracy. Furthermore, we show that the learned energy ranking significantly outperforms magnitude-based pruning in deletion-based robustness tests, revealing ERSM as an intrinsic denoising mechanism that isolates semantic object regions without pixel-level supervision.

2602.05791 2026-06-10 cs.RO 版本更新

Scalable and General Whole-Body Control for Cross-Humanoid Locomotion

可扩展且通用的全身控制:跨人形机器人运动

Yufei Xue, YunFeng Lin, Wentao Dong, Yang Tang, Jingbo Wang, Jiangmiao Pang, Ming Zhou, Minghuan Liu, Weinan Zhang

发表机构 * Tsinghua University(清华大学)

AI总结 提出XHugWBC框架,通过形态随机化、语义对齐观测动作空间和有效策略架构,实现单次训练后跨多种人形机器人的零样本泛化控制。

详情
AI中文摘要

基于学习的全身控制器已成为人形机器人的关键驱动力,但现有方法大多需要针对特定机器人进行训练。本文研究了跨实体人形控制问题,并表明单一策略通过一次性训练即可稳健地泛化到各种人形机器人设计。我们提出了XHugWBC,一种新颖的跨实体训练框架,通过以下方式实现通用人形控制:(1) 物理一致的形态随机化,(2) 跨不同人形机器人的语义对齐观测和动作空间,以及(3) 建模形态和动力学特性的有效策略架构。XHugWBC不依赖于任何特定机器人,而是在训练过程中内化广泛的形态和动力学特性分布。通过从多样化的随机实体中学习运动先验,策略获得了强大的结构偏差,支持对未见过的机器人进行零样本迁移。在12个模拟人形机器人和7个真实世界机器人上的实验证明了所得通用控制器的强泛化性和鲁棒性。

英文摘要

Learning-based whole-body controllers have become a key driver for humanoid robots, yet most existing approaches require robot-specific training. In this paper, we study the problem of cross-embodiment humanoid control and show that a single policy can robustly generalize across a wide range of humanoid robot designs with one-time training. We introduce XHugWBC, a novel cross-embodiment training framework that enables generalist humanoid control through: (1) physics-consistent morphological randomization, (2) semantically aligned observation and action spaces across diverse humanoid robots, and (3) effective policy architectures modeling morphological and dynamical properties. XHugWBC is not tied to any specific robot. Instead, it internalizes a broad distribution of morphological and dynamical characteristics during training. By learning motion priors from diverse randomized embodiments, the policy acquires a strong structural bias that supports zero-shot transfer to previously unseen robots. Experiments on twelve simulated humanoids and seven real-world robots demonstrate the strong generalization and robustness of the resulting universal controller.

2512.12675 2026-06-10 cs.CV cs.AI 版本更新

Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling

Scone:通过统一理解-生成建模弥合主体驱动图像生成中的组合与区分

Yuran Wang, Bohan Zeng, Chengzhuo Tong, Wenxuan Liu, Yang Shi, Xiaochen Ma, Hao Liang, Yuanxing Zhang, Wentao Zhang

发表机构 * Peking University(北京大学) Kling Team, Kuaishou Technology(快手科技 Kling 团队) Zhongguancun Academy(中关村学院) HKUST(香港科技大学) Beijing Key Laboratory of Data Intelligence and Security (Peking University)(北京数据智能与安全重点实验室(北京大学))

AI总结 提出Scone方法,通过统一理解-生成模型结合组合与区分能力,采用两阶段训练实现主体身份保持与干扰最小化,在双基准上优于现有开源模型。

Comments CVPR 2026 Highlight. Code: https://github.com/Ryann-Ran/Scone

详情
AI中文摘要

主体驱动图像生成已从单主体发展到多主体组合,但忽略了区分能力——即当输入包含多个候选主体时,区分并生成正确主体的能力。这一限制制约了其在复杂、真实视觉场景中的有效性。我们提出Scone,一种统一理解-生成方法,整合了组合与区分。Scone使理解专家充当语义桥梁,传递语义信息并引导生成专家在最小化干扰的同时保持主体身份。两阶段训练方案首先学习组合,然后通过语义对齐和基于注意力的掩码增强区分。我们还引入了SconeEval,一个用于评估多种场景下组合与区分的基准。实验表明,Scone在两个基准上的组合与区分任务中均优于现有开源模型。我们的模型、基准和训练数据可在以下网址获取:this https URL。

英文摘要

Subject-driven image generation has advanced from single- to multi-subject composition, while neglecting distinction, the ability to distinguish and generate the correct subject when inputs contain multiple candidates. This limitation restricts effectiveness in complex, realistic visual settings. We propose Scone, a unified understanding-generation method that integrates composition and distinction. Scone enables the understanding expert to act as a semantic bridge, conveying semantic information and guiding the generation expert to preserve subject identity while minimizing interference. A two-stage training scheme first learns composition, then enhances distinction through semantic alignment and attention-based masking. We also introduce SconeEval, a benchmark for evaluating both composition and distinction across diverse scenarios. Experiments demonstrate that Scone outperforms existing open-source models in composition and distinction tasks on two benchmarks. Our model, benchmark, and training data are available at: https://github.com/Ryann-Ran/Scone.

2604.07085 2026-06-10 cs.LG 版本更新

Mining Electronic Health Records to Investigate Effectiveness of Ensemble Deep Clustering

挖掘电子健康记录以研究集成深度聚类的有效性

Manar D. Samad, Yina Hou, Shrabani Ghosh

发表机构 * Department of Computer Science(计算机科学系) Tennessee State University(田纳西州立大学)

AI总结 针对电子健康记录中传统聚类方法在嵌入表示上的局限,提出基于集成嵌入的深度聚类方法,结合多种嵌入维度与经典聚类,在心力衰竭患者队列中取得最佳综合性能。

Comments 2026 14th IEEE Conference on Healthcare Informatics

详情
AI中文摘要

在电子健康记录(EHR)中,对患者进行聚类和区分疾病亚型是阐明病理生理学并辅助临床决策的关键任务。然而,医疗信息学中的聚类仍基于传统方法,尤其是K-means,当将其作为混合方法应用于自编码器学习的嵌入表示时,取得的成功有限。本文利用来自“All of Us”研究计划的真实EHR数据,研究了传统、混合和深度学习方法在心力衰竭患者队列中的有效性。传统聚类方法表现稳健,因为深度学习方法专门为图像聚类设计,该任务与表格型EHR数据设置显著不同。为了解决深度聚类的不足,我们引入了一种基于集成的深度聚类方法,该方法聚合从多个嵌入维度获得的聚类分配,而不是依赖于单个固定的嵌入空间。当在新型集成框架中与传统聚类结合时,所提出的用于深度聚类的集成嵌入在14种不同的聚类方法和多个患者队列中取得了最佳的整体性能排名。本文强调了EHR数据的生物学性别特异性聚类的重要性,以及将传统和深度聚类方法相结合相对于单一方法的优势。

英文摘要

In electronic health records (EHRs), clustering patients and distinguishing disease subtypes are key tasks to elucidate pathophysiology and aid clinical decision-making. However, clustering in healthcare informatics is still based on traditional methods, especially K-means, and has achieved limited success when applied to embedding representations learned by autoencoders as hybrid methods. This paper investigates the effectiveness of traditional, hybrid, and deep learning methods in heart failure patient cohorts using real EHR data from the All of Us Research Program. Traditional clustering methods perform robustly because deep learning approaches are specifically designed for image clustering, a task that differs substantially from the tabular EHR data setting. To address the shortcomings of deep clustering, we introduce an ensemble-based deep clustering approach that aggregates cluster assignments obtained from multiple embedding dimensions, rather than relying on a single fixed embedding space. When combined with traditional clustering in a novel ensemble framework, the proposed ensemble embedding for deep clustering delivers the best overall performance ranking across 14 diverse clustering methods and multiple patient cohorts. This paper underscores the importance of biological sex-specific clustering of EHR data and the advantages of combining traditional and deep clustering approaches over a single method.

2604.04287 2026-06-10 cs.LG cs.CL q-bio.GN 版本更新

Entropy, Disagreement, and the Limits of Foundation Models in Genomics

熵、分歧与基因组基础模型的局限性

Maxime Rochkoulets, Lovro Vrček, Mile Šikić

发表机构 * Genome Institute of Singapore, A*STAR(新加坡基因组研究院,A*STAR) KU Leuven(卢森堡大学) Faculty of Electrical Engineering and Computing, University of Zagreb(扎格雷布大学电子工程与计算学院)

AI总结 本文通过分析熵对模型学习的影响,发现基因组序列的高熵导致输出分布接近均匀、模型间分歧大和静态嵌入不稳定,且Fisher信息集中在嵌入层,表明仅靠序列自监督训练可能不适用于基因组数据。

Comments Accepted to LMLR Workshop at ICLR 2026

详情
AI中文摘要

基因组学中的基础模型与自然语言处理中的基础模型相比,成功程度参差不齐。然而,其有效性有限的原因仍不清楚。在这项工作中,我们研究了熵作为限制此类模型从训练数据中学习并发展基础能力的基本因素的作用。我们在文本和DNA序列上训练模型集成,并分析它们的预测、静态嵌入和经验Fisher信息流。我们表明,从未见标记预测的角度来看,基因组序列的高熵导致输出分布接近均匀、模型间分歧大以及静态嵌入不稳定,即使模型在架构、训练和数据上匹配也是如此。然后,我们证明在DNA上训练的模型将Fisher信息集中在嵌入层,似乎未能利用标记间关系。我们的结果表明,仅从序列进行自监督训练可能不适用于基因组数据,这质疑了当前训练基因组基础模型方法背后的假设。

英文摘要

Foundation models in genomics have shown mixed success compared to their counterparts in natural language processing. Yet, the reasons for their limited effectiveness remain poorly understood. In this work, we investigate the role of entropy as a fundamental factor limiting the capacities of such models to learn from their training data and develop foundational capabilities. We train ensembles of models on text and DNA sequences and analyze their predictions, static embeddings, and empirical Fisher information flow. We show that the high entropy of genomic sequences -- from the point of view of unseen token prediction -- leads to near-uniform output distributions, disagreement across models, and unstable static embeddings, even for models that are matched in architecture, training and data. We then demonstrate that models trained on DNA concentrate Fisher information in embedding layers, seemingly failing to exploit inter-token relationships. Our results suggest that self-supervised training from sequences alone may not be applicable to genomic data, calling into question the assumptions underlying current methodologies for training genomic foundation models.

2602.23499 2026-06-10 cs.RO cs.AI 版本更新

TaCarla: A comprehensive benchmarking dataset for end-to-end autonomous driving

TaCarla: 端到端自动驾驶的综合基准数据集

Tugrul Gorgulu, Atakan Dag, M. Esat Kalfaoglu, Halil Ibrahim Kuru, Baris Can Cam, Halil Ibrahim Ozturk, Ozsel Kilinc

发表机构 * Tuğrul Gorgülü *†(土耳其巴伊塞蒂大学) Atakan Dağ †(土耳其巴伊塞蒂大学) M. Esat Kalfaoğlu ‡(土耳其巴伊塞蒂大学) Halil İbrahim Kuru †(土耳其巴伊塞蒂大学) Barış Can Cam †(土耳其巴伊塞蒂大学) Halil İbrahim Öztürk †(土耳其巴伊塞蒂大学) Özsel Kılınç §(土耳其巴伊塞蒂大学)

AI总结 针对现有自动驾驶数据集不完整、行为多样性不足及闭环评估缺失等问题,基于CARLA Leaderboard 2.0挑战场景收集超过285万帧的多任务数据集,支持规划、检测、预测及视觉语言动作模型,并提供数值稀有度评分。

Comments Accepted at the Third Workshop on Simulation for Autonomous Driving (SAD), CVPR 2026

详情
AI中文摘要

收集高质量数据集是一项需要细致关注细节的关键任务,因为忽略某些方面可能导致整个数据集无法使用。自动驾驶挑战仍然是一个重要的研究领域,需要进一步探索以提升车辆的感知和规划性能。然而,现有数据集往往不完整。例如,包含感知信息的数据集通常缺乏规划数据,而规划数据集通常由大量驾驶序列组成,其中自车主要向前行驶,行为多样性有限。此外,许多真实数据集难以评估其模型,特别是对于规划任务,因为它们缺乏合适的闭环评估设置。CARLA Leaderboard 2.0挑战提供了多样化的场景来解决自动驾驶中的长尾问题,已成为在开环和闭环评估设置下开发感知和规划模型的有价值替代平台。然而,在该平台上收集的现有数据集存在一定局限性。一些数据集似乎主要针对有限的传感器配置,具有特定的传感器配置。为了支持端到端自动驾驶研究,我们使用CARLA仿真环境为多样化的Leaderboard 2.0挑战场景收集了一个包含超过285万帧的新数据集。我们的数据集不仅设计用于规划任务,还支持动态目标检测、车道分隔线检测、中心线检测、交通灯识别、预测任务和视觉语言动作模型。此外,我们通过使用数据集训练各种模型来展示其多功能性。同时,我们还提供了数值稀有度评分,以理解当前状态在数据集中出现的稀有程度。

英文摘要

Collecting a high-quality dataset is a critical task that demands meticulous attention to detail, as overlooking certain aspects can render the entire dataset unusable. Autonomous driving challenges remain a prominent area of research, requiring further exploration to enhance the perception and planning performance of vehicles. However, existing datasets are often incomplete. For instance, datasets that include perception information generally lack planning data, while planning datasets typically consist of extensive driving sequences where the ego vehicle predominantly drives forward, offering limited behavioral diversity. In addition, many real datasets struggle to evaluate their models, especially for planning tasks, since they lack a proper closed-loop evaluation setup. The CARLA Leaderboard 2.0 challenge, which provides a diverse set of scenarios to address the long-tail problem in autonomous driving, has emerged as a valuable alternative platform for developing perception and planning models in both open-loop and closed-loop evaluation setups. Nevertheless, existing datasets collected on this platform present certain limitations. Some datasets appear to be tailored primarily for limited sensor configuration, with particular sensor configurations. To support end-to-end autonomous driving research, we have collected a new dataset comprising over 2.85 million frames using the CARLA simulation environment for the diverse Leaderboard 2.0 challenge scenarios. Our dataset is designed not only for planning tasks but also supports dynamic object detection, lane divider detection, centerline detection, traffic light recognition, prediction tasks and visual language action models . Furthermore, we demonstrate its versatility by training various models using our dataset. Moreover, we also provide numerical rarity scores to understand how rarely the current state occurs in the dataset.

2604.01993 2026-06-10 cs.CL cs.AI 版本更新

SAFE: An LLM-as-Verifier Framework for Evidence-Grounded Multi-Hop Reasoning

SAFE: 一种基于LLM作为验证器的证据驱动多跳推理框架

Daeyong Kwon, Soyoung Yoon, Seung-won Hwang

发表机构 * Seoul National University(首尔国立大学)

AI总结 提出SAFE框架,通过将推理分解为知识图谱三元组,在生成过程中逐步验证中间步骤,以解决多跳问答中模型通过无效推理得到正确答案的问题,平均准确率提升8.8个百分点。

详情
AI中文摘要

多跳问答基准测试常常奖励大型语言模型(LLM)的虚假正确性,即模型通过无效的中间推理得出正确答案。我们提出了SAFE,一种基于LLM作为验证器的证据驱动多跳问答框架。SAFE不是在生成后仅判断最终答案,而是在生成过程中通过检查中间步骤与提供的段落和先前的推理轨迹来验证推理。为了使这一过程可检查,SAFE将推理分解为以知识图谱(KG)三元组表示的原子化、证据驱动的单元。在训练时,SAFE在KG约束下验证基准监督,并构建可靠的验证器训练数据。在推理时,外部验证器检查每个生成的步骤,识别无效推理,并在错误传播之前提供纠正反馈。在三个多跳问答基准测试中,SAFE平均提高了8.8个百分点的准确率。这些结果表明,证据驱动的多跳问答受益于将基于LLM的评估从事后答案判断转向逐步推理验证。

英文摘要

Multi-hop QA benchmarks often reward Large Language Models (LLMs) for spurious correctness, where models reach correct answers through invalid intermediate reasoning. We propose SAFE, an LLM-as-verifier framework for evidence-grounded multi-hop QA. Rather than judging only the final answer after generation, SAFE verifies reasoning during generation by checking intermediate steps against the provided passages and previous reasoning trajectory. To make this process checkable, SAFE decomposes reasoning into atomic, evidence-grounded units represented with Knowledge Graph (KG) triples. At train-time, SAFE verifies benchmark supervision under KG-grounded constraints and constructs reliable verifier training data. At inference-time, an external verifier checks each generated step, identifies invalid reasoning, and provides correction feedback before errors propagate. Across three multi-hop QA benchmarks, SAFE improves accuracy by 8.8 pp on average. These results show that evidence-grounded multi-hop QA benefits from shifting LLM-based evaluation from post-hoc answer judgment to stepwise reasoning verification.

2603.28054 2026-06-10 cs.CL 版本更新

Who Wrote the Book? Detecting and Attributing LLM Ghostwriters

谁写了这本书?检测与归因LLM代笔作者

Anudeex Shetty, Qiongkai Xu, Olga Ohrimenko, Jey Han Lau

发表机构 * School of Computing and Information Systems, The University of Melbourne, Australia(墨尔本大学计算与信息学院) School of Computing, FSE, Macquarie University, Australia(麦考瑞大学计算学院)

AI总结 提出GhostWriteBench数据集和TRACE指纹方法,用于检测和归因LLM生成的长文本,在跨域和未见模型上达到SOTA性能。

Comments WIP

详情
AI中文摘要

在本文中,我们介绍了GhostWriteBench,一个用于LLM作者归因的数据集。它包含由前沿LLM生成的长篇文本(每本书超过5万字),旨在测试跨多个分布外(OOD)维度的泛化能力,包括领域和未见过的LLM作者。我们还提出了TRACE——一种新颖的、可解释且轻量级的指纹方法——适用于开源和闭源模型。TRACE通过捕获由另一个轻量级语言模型估计的token级转换模式(例如,词排名)来创建指纹。在GhostWriteBench上的实验表明,TRACE实现了最先进的性能,在OOD设置中保持鲁棒性,并且在有限训练数据场景下表现良好。

英文摘要

In this paper, we introduce GhostWriteBench, a dataset for LLM authorship attribution. It comprises long-form texts (50K+ words per book) generated by frontier LLMs, and is designed to test generalisation across multiple out-of-distribution (OOD) dimensions, including domain and unseen LLM author. We also propose TRACE -- a novel fingerprinting method that is interpretable and lightweight -- that works for both open- and closed-source models. TRACE creates the fingerprint by capturing token-level transition patterns (e.g., word rank) estimated by another lightweight language model. Experiments on GhostWriteBench demonstrate that TRACE achieves state-of-the-art performance, remains robust in OOD settings, and works well in limited training data scenarios.

2603.25670 2026-06-10 cs.LG cs.SE 版本更新

Uncertainty-Guided Label Rebalancing for CPS Safety Monitoring

不确定性引导的标签重平衡用于CPS安全监控

John Ayotunde, Qinghua Xu, Guancheng Wang, Lionel C. Briand

发表机构 * Lero Research Ireland Centre for Software Research, University of Limerick, Castletroy, Limerick, Ireland(勒罗爱尔兰软件研究中心,利默里克大学,卡斯莱特里,利默里克,爱尔兰) University of Ottawa, Canada(渥太华大学,加拿大) Lero Research Ireland Centre for Software Research, University of Limerick, Ireland(勒罗爱尔兰软件研究中心,利默里克大学,爱尔兰)

AI总结 提出U-Balance方法,利用行为不确定性对CPS时间序列数据进行标签重平衡,通过GatedMLP预测不确定性并概率性重标异常安全样本,在无人机基准上F1达0.806,优于基线14.3个百分点。

Comments 11 pages (main content), 3 pages references, 5 figures, 5 tables

详情
AI中文摘要

安全监控对于信息物理系统(CPS)至关重要。然而,实际CPS运行中不安全事件罕见,导致极端类别不平衡,降低了安全预测器的性能。标准重平衡技术对时间序列CPS遥测数据表现不佳,要么生成不真实的合成样本,要么对少数类过拟合。同时,CPS操作中的行为不确定性(定义为CPS决策中的怀疑或不确定程度)通常与安全结果相关,但在安全监控中尚未被探索。为此,我们提出U-Balance,一种监督方法,在训练安全预测器之前利用行为不确定性对不平衡数据集进行重平衡。U-Balance首先训练一个基于GatedMLP的不确定性预测器,将每个遥测窗口总结为分布运动学特征并输出不确定性分数。然后,它应用不确定性引导的标签重平衡(uLNR)机制,将具有异常高不确定性的安全标记窗口概率性地重新标记为不安全,从而在不合成新数据的情况下,用信息丰富的边界样本丰富少数类。最后,在重平衡数据集上训练安全预测器用于安全监控。我们在一个安全与不安全比例为46:1的大规模无人机基准上评估U-Balance。结果证实了行为不确定性与安全之间存在中等但显著的相关性。然后,我们确定uLNR是利用不确定性信息的最有效策略,优于直接早期融合和晚期融合。U-Balance实现了0.806的F1分数,比最强基线高出14.3个百分点,同时保持了有竞争力的推理效率。消融研究证实,基于GatedMLP的不确定性预测器和uLNR机制都对U-Balance的有效性有显著贡献。

英文摘要

Safety monitoring is essential for Cyber-Physical Systems (CPSs). However, unsafe events are rare in real-world CPS operations, creating an extreme class imbalance that degrades safety predictors. Standard rebalancing techniques perform poorly on time-series CPS telemetry, either generating unrealistic synthetic samples or overfitting on the minority class. Meanwhile, behavioral uncertainty in CPS operations, defined as the degree of doubt or uncertainty in CPS decisions , is often correlated with safety outcomes but unexplored in safety monitoring. To that end, we propose U-Balance, a supervised approach that leverages behavioral uncertainty to rebalance imbalanced datasets prior to training a safety predictor. U-Balance first trains a GatedMLP-based uncertainty predictor that summarizes each telemetry window into distributional kinematic features and outputs an uncertainty score. It then applies an uncertainty-guided label rebalancing (uLNR) mechanism that probabilistically relabels $\textit{safe}$-labeled windows with unusually high uncertainty as $\textit{unsafe}$, thereby enriching the minority class with informative boundary samples without synthesizing new data. Finally, a safety predictor is trained on the rebalanced dataset for safety monitoring. We evaluate U-Balance on a large-scale UAV benchmark with a 46:1 safe-to-unsafe ratio. Results confirm a moderate but significant correlation between behavioral uncertainty and safety. We then identify uLNR as the most effective strategy to exploit uncertainty information, compared to direct early and late fusion. U-Balance achieves a 0.806 F1 score, outperforming the strongest baseline by 14.3 percentage points, while maintaining competitive inference efficiency. Ablation studies confirm that both the GatedMLP-based uncertainty predictor and the uLNR mechanism contribute significantly to U-Balance's effectiveness.