arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2056
专题追踪
2605.30745 2026-06-01 cs.CV

Immuno-VLM: Immunizing Large Vision-Language Models via Generative Semantic Antibodies for Open-World Trustworthiness

Immuno-VLM:通过生成式语义抗体实现大型视觉-语言模型的开放世界可信赖性

Xiang Fang, Wanlong Fang, Wei Ji

发表机构 * School of Software Engineering, Huazhong University of Science and Technology(华中科技大学软件学院) Nanyang Technological University, Singapore(新加坡南洋理工大学) Nanjing University(南京大学)

AI总结 针对大型视觉-语言模型在开放世界部署中因缺乏负面知识而将未知异常高置信度误分类为已知类别的“语义傲慢”问题,提出受生物免疫负选择启发的Immuno-VLM框架,利用大语言模型的生成推理主动产生“语义抗体”(近分布异常文本描述)来约束已知类决策空间,在ImageNet-1K和四个OOD基准上达到新最优。

Comments Accepted by ICML 2026

详情
AI中文摘要

大型视觉-语言模型通过将视觉特征与广泛语义概念对齐,在零样本识别中取得了前所未有的成功。然而,这种语义抽象在开放世界部署中造成了一个关键漏洞:“语义傲慢”——由于缺乏显式的负面知识,模型会将未知异常高置信度地强行拟合到已知类别中。为了解决这个“开放世界可信赖性悖论”,我们提出了 extbf{Immuno-VLM},一个受生物启发的框架,它将 extbf{免疫负选择}的生物学原理适应到高维潜在空间。与依赖被动密度估计或低效像素空间异常生成的传统开放集识别方法不同,Immuno-VLM利用大语言模型的生成推理能力主动“幻想”出“语义抗体”,即近分布异常(例如,相似物、上下文异常)的文本描述,这些描述有效地约束了已知类别的决策空间。在ImageNet-1K和四个具有挑战性的OOD基准上的大量实验表明,Immuno-VLM达到了新的最优水平。

英文摘要

Large Vision-Language Models have achieved unprecedented success in zero-shot recognition by aligning visual features with broad semantic concepts. However, this semantic abstraction creates a critical vulnerability in open-world deployment: the ``Hubris of Semantics'', where models force-fit unknown anomalies into known categories with high confidence due to the lack of explicit negative knowledge. To address this \textit{Open-World Trustworthiness Paradox}, we propose \textbf{Immuno-VLM}, a bio-inspired framework that adapts the biological principle of \textbf{Immunological Negative Selection} to high-dimensional latent spaces. Departing from traditional Open-Set Recognition methods that rely on passive density estimation or inefficient pixel-space outlier generation, Immuno-VLM leverages the generative reasoning of Large Language Models to actively hallucinate ``Semantic Antibodies'', textual descriptions of near-distribution outliers (e.g., look-alikes, contextual anomalies) that effectively bound the decision space of known classes.Extensive experiments on ImageNet-1K and four challenging OOD benchmarks reveal that Immuno-VLM establishes a new state-of-the-art.

2605.30742 2026-06-01 cs.CV

Annotations Are Not All You Need: A Cross-modal Knowledge Transfer Network for Unsupervised Temporal Sentence Grounding

注释并非全部所需:面向无监督时间语句定位的跨模态知识迁移网络

Xiang Fang, Daizong Liu, Wanlong Fang, Pan Zhou, Yu Cheng, Keke Tang, Kai Zou

发表机构 * Hubei Key Laboratory of Distributed System Security(湖北分布式系统安全重点实验室) Hubei Engineering Research Center on Big Data Security(湖北大数据安全工程研究中心) School of Cyber Science and Engineering(网络安全学院) Huazhong University of Science and Technology(华中科技大学) Peking University(北京大学) Henan University(河南大学) The Chinese University of Hong Kong(香港中文大学) Guangzhou University(广州大学) Protagolabs Inc.(Protagolabs公司)

AI总结 提出跨模态知识迁移网络,通过从图像-名词和视频-动词任务中迁移实体感知和事件感知知识,实现无监督时间语句定位,无需配对视频-查询标注。

Comments Published in Findings of EMNLP 2023

详情
AI中文摘要

本文研究时间语句定位(TSG)任务。尽管许多优秀工作在该重要课题上取得了显著成就,但它们严重依赖于大量昂贵的视频-查询配对标注,这在现实应用中需要大量人力收集。为此,本文针对更实际但更具挑战性的TSG设置:无监督时间语句定位,其中网络训练期间既没有配对视频-查询标注,也没有片段边界标注。考虑到其他跨模态任务提供了许多易于获取且廉价的标签,我们倾向于收集并将其简单的跨模态对齐知识迁移到我们的复杂场景中:1)首先从配对的图像-名词任务中探索实体感知的对象引导外观知识,并将其适应到每个独立视频帧;2)然后从配对的视频-动词任务中提取事件感知的动作表示,并通过新提出的复制-粘贴方法进一步将动作表示精炼为更实际但复杂的现实案例;3)通过将外观和动作知识调制并迁移到我们具有挑战性的无监督任务中,我们的模型可以直接利用这些通用知识来关联视频和查询,并在无需训练的情况下准确检索相关片段。在两个具有挑战性的数据集(ActivityNet Captions和Charades-STA)上的大量实验证明了我们的有效性,优于现有无监督方法,甚至与有监督方法竞争。

英文摘要

This paper addresses the task of temporal sentence grounding (TSG). Although many respectable works have made decent achievements in this important topic, they severely rely on massive expensive video-query paired annotations, which require a tremendous amount of human effort to collect in real-world applications. To this end, in this paper, we target a more practical but challenging TSG setting: unsupervised temporal sentence grounding, where both paired video-query and segment boundary annotations are unavailable during the network training. Considering that some other cross-modal tasks provide many easily available yet cheap labels, we tend to collect and transfer their simple cross-modal alignment knowledge into our complex scenarios: 1) We first explore the entity-aware object-guided appearance knowledge from the paired Image-Noun task, and adapt them into each independent video frame; 2) Then, we extract the event-aware action representation from the paired Video-Verb task, and further refine the action representation into more practical but complicated real-world cases by a newly proposed copy-paste approach; 3) By modulating and transferring both appearance and action knowledge into our challenging unsupervised task, our model can directly utilize this general knowledge to correlate videos and queries, and accurately retrieve the relevant segment without training. Extensive experiments on two challenging datasets (ActivityNet Captions and Charades-STA) show our effectiveness, outperforming existing unsupervised methods and even competitively beating supervised works.

2605.30740 2026-06-01 cs.RO cs.AI

GSAM: A Generalizable and Safe Robotic Framework for Articulated Object Manipulation

GSAM: 一种通用且安全的铰接物体操作机器人框架

Beichen Shao, Mengying Xie, Heng Su, Wanyi Zhang, Mingyan Li, Yan Ding, Fausto Giunchiglia, Chao Chen

发表机构 * College of Computer Science, Chongqing University, Chongqing, China(重庆大学计算机学院) Lumos Robotics, China(Lumos机器人中国) Xi'an Jiaotong-Liverpool University, China(西安交通大学利物浦大学) Fudan University, China(复旦大学) Department of Information Engineering and Computer Science, University of Trento, Trento, Italy(特伦托大学信息工程与计算机科学系)

AI总结 提出GSAM框架,通过视觉感知器生成运动学参数、基于VLM的细调器进行常识推理修正、交互约束函数生成器集成障碍物避免知识,并由运动学感知规划器验证轨迹可达性,在50个铰链任务上相比最佳基线将标准差降低3.1%、操作成功率提升36.0%。

Comments Accepted by the 19th International Conference on Parallel Problem Solving from Nature (PPSN 2026)

详情
AI中文摘要

铰接物体操作对服务机器人是一个独特的挑战。现有方法采用端到端策略学习、视觉运动规划以及大语言/视觉语言模型(LLM/VLM),但往往忽视了铰接物体的多样性和末端执行器与手柄之间交互的复杂性,导致泛化能力有限和破坏性碰撞。为了解决这一问题,我们提出了GSAM,一个通用且安全的铰接物体操作机器人框架。具体来说,一个基于视觉的感知器生成运动学参数。考虑到感知器中预训练标记产生的原始估计可能偏离常识,我们提出了一个基于VLM的细调器,利用链式思维(COT)常识推理来细化感知。为了防止破坏性碰撞,我们设计了一个交互约束函数生成器,将铰接物体、交互姿态和障碍物避免知识集成到一个基中。然后LLM将这些约束函数化,并将其应用于轨迹和姿态规划。一个运动学感知的操作规划器验证轨迹和姿态的可达性。在5个物体类别的50个铰链任务和50个随机初始化的末端执行器-手柄配置上的实验表明,与最佳基线相比,GSAM将标准差降低了3.1%,操作成功率提高了36.0%,分别展示了GSAM在实际场景中优越的物体泛化能力和交互安全性。

英文摘要

Articulated object manipulation is a unique challenge for service robots. Existing methods employ end-to-end policy learning, visionmotion planning, and large-language/visual-language model (LLM/VLM), but often overlook the diversity of articulated objects and the complexity of interactions between end-effector and handle, leading to limited generalization and destructive collisions. To address this, we propose GSAM, a generalizable and safe robotic framework for articulated object manipulation. Specifically, a vision-based perceiver generates the kinematic parameters. Considering that pre-trained markers in perceiver yield raw estimations that may deviate from commonsense, we present a f ine-tuned VLM-based refiner, using chain-of-thought (COT) commonsense reasoning to refine perception. To prevent destructive collisions, we design an interaction constraint function generator, integrating articulated object, interaction pose, and obstacle avoidance knowledge into a base. LLM then functionalize these constraints and apply them to trajectory and posture planning. A kinematic-aware manipulation planner verifies reachability for trajectory and posture. Experiments on 50 hinge tasks across 5 object categories and 50 randomly initialized end-effectorhandle configurations show that GSAM reduces standard deviation by 3.1% and improves manipulation success rate by 36.0% compared to the best baseline, respectively demonstrating the superior object generalization and interaction safety of GSAM in practical scenarios.

2605.30738 2026-06-01 cs.AI

MAVEN: Improving Generalization in Agentic Tool Calling

MAVEN:提升智能体工具调用的泛化能力

Omkar Ghugarkar, Vishvesh Bhat, Muhammad Ahmed Mohsin, Asad Aali

发表机构 * CoreThink AI, USA(CoreThink AI, 美国) Stanford University, Stanford, CA, USA(斯坦福大学, 加州斯坦福, 美国)

AI总结 提出MAVEN框架,通过轻量级符号推理脚手架实现结构化分解、自适应工具编排和中间验证,在多个基准测试中显著提升模型性能,且成本仅为前沿专有模型的约1/10。

详情
AI中文摘要

跨智能体工具调用环境的泛化仍然是可靠智能体推理系统的核心挑战。尽管大语言模型在单个基准测试上取得了强劲结果,但它们在组合推理策略、保留中间状态以及跨域协调工具方面的能力仍未得到充分探索。我们提出MAVEN(模块化智能体验证与执行网络),这是一种轻量级符号推理脚手架,用于结构化分解、自适应工具编排和中间验证。我们在包括BFCL v3、TauBench、Tau2Bench、AceBench在内的既定工具调用基准上评估MAVEN,并引入MAVEN-Bench,这是一个针对多步数学和物理推理的压测基准,具有显式验证和对抗性任务组合。MAVEN-Bench揭示了部分推理质量与端到端任务成功之间的巨大差距;在直接的MAVEN-Bench运行中,MAVEN在不进行额外训练的情况下,将其GPT-OSS-120b基础模型的准确率从48%提升至71%。同时,它在使用开源权重骨干且估计成本约为1/10的情况下,与前沿专有基线保持竞争力,这表明以轻量级验证为中心的脚手架可以增强组合推理,并激励对智能体进行更注重过程的评估。

英文摘要

Generalization across agentic tool-calling environments remains a central challenge for reliable agentic reasoning systems. Although large language models achieve strong results on individual benchmarks, their ability to compose reasoning strategies, preserve intermediate states, and coordinate tools across domains remains underexplored. We present MAVEN (Modular Agentic Verification and Execution Network), a lightweight symbolic reasoning scaffold for structured decomposition, adaptive tool orchestration, and intermediate verification. We evaluate MAVEN across established tool-calling benchmarks, including BFCL v3, TauBench, Tau2Bench, AceBench, and introduce MAVEN-Bench, a stress-test benchmark for multi-step mathematical and physical reasoning with explicit verification and adversarial task composition. MAVEN-Bench exposes a substantial gap between partial reasoning quality and end-to-end task success; in direct MAVEN-Bench runs, MAVEN improves its GPT-OSS-120b base model from 48% to 71% accuracy without additional training. It also remains competitive with frontier proprietary baselines while using an open-weight backbone with an estimated cost ratio of roughly 1/10, suggesting that lightweight verification-centered scaffolds can strengthen compositional reasoning and motivate more process-aware evaluation of agents in the wild.

2605.30736 2026-06-01 cs.LG cs.AI cs.CL

OrcaRouter: A Production-Oriented LLM Router with Hybrid Offline-Online Learning

OrcaRouter: 一种面向生产的混合离线-在线学习LLM路由器

Zhenghua Bao, Fengya Tian, Chris Zhang, Zhenjun Chen, Xile Ma, Yi Shi

发表机构 * Continuum AI

AI总结 提出OrcaRouter,一种结合LinUCB上下文赌博机与混合离线-在线学习协议的生产级LLM路由器,通过离线全信息反馈和在线赌博机学习实现低成本高精度模型选择。

Comments 6 pages, 1 table. Technical report

详情
AI中文摘要

大型语言模型的快速发展,每个模型具有不同的能力和推理成本,引发了一个实际部署问题:给定一个传入请求,应由哪个模型处理?我们提出OrcaRouter,一种面向生产的LLM路由器,它结合了基于词法和句子嵌入特征的LinUCB上下文赌博机与混合离线-在线学习协议。在离线阶段,OrcaRouter通过在一组精心策划的路由提示上评估每个候选模型来获取全信息反馈,生成一个奖励矩阵,用于为每个臂拟合一个岭回归器。在部署时,它从这些参数初始化,并可选地从赌博机反馈中继续学习,在观察到奖励后仅更新所选模型的臂。在我们提交RouterArena时(2026年5月20日),OrcaRouter-Adaptive以72.08的竞技场得分在公共RouterArena排行榜上排名第二,在每1000次查询成本1.00美元的情况下实现了75.54%的准确率。

英文摘要

The rapid development of large language models, each with distinct capabilities and inference costs, raises a practical deployment question: given an incoming request, which model should handle it? We present OrcaRouter, a production-oriented LLM router that combines a LinUCB-based contextual bandit over lexical and sentence-embedding features with a hybrid offline-online learning protocol. Offline, OrcaRouter obtains full-information feedback by evaluating each candidate model on a curated set of routing prompts, yielding a reward matrix used to fit one ridge regressor per arm. At deployment time, it initializes from these parameters and can optionally continue learning from bandit feedback, updating only the selected model's arm after observing its reward. At the time of our RouterArena submission (May 20, 2026), OrcaRouter-Adaptive ranked second on the public RouterArena leaderboard with an arena score of 72.08, achieving 75.54% accuracy at a cost of USD 1.00 per 1,000 queries.

2605.30734 2026-06-01 cs.LG cs.CV

Beyond Accuracy: Evaluating Efficiency, Robustness and Explainability in Deep Learning for Malaria Diagnosis

超越准确率:评估深度学习在疟疾诊断中的效率、鲁棒性和可解释性

Olivier Kanamugire, Kerol Djoumessi

发表机构 * African Institute for Mathematical Sciences(非洲数学科学研究所) Hertie Institute for AI in Brain Health(脑健康人工智能研究所)

AI总结 本研究在NLM-Malaria数据集上基准测试四种深度学习模型,联合评估预测性能、鲁棒性和事后可解释性,发现轻量级模型在性能上与重型模型相当,但可解释性在图像损坏下脆弱。

Comments Under review

详情
AI中文摘要

疟疾仍然是撒哈拉以南非洲地区的主要死亡原因,该地区诊断基础设施匮乏,使得及时准确的诊断尤其具有挑战性。虽然深度学习为自动化疟疾筛查提供了一条有前景的途径,但临床采用受到计算成本和决策不透明性的阻碍。本研究在NLM-Malaria数据集上基准测试了四种涵盖广泛设计架构和模型容量的深度学习模型,联合评估了预测性能、鲁棒性和事后可解释性。我们发现,轻量级、高效设计的模型在预测性能上与更重的模型相当,Friedman检验确认无统计显著差异。基于CAM的XAI方法一致地定位诊断相关区域,而细粒度归因方法产生的解释针对性较弱,尤其是在使用更重的骨干网络时。在三种图像损坏下的鲁棒性评估进一步揭示,模型置信度下降速度快于准确率,为人工审核提供了实用信号。然而,没有一种XAI方法对损坏具有鲁棒性,即使在预测仍然准确的情况下,解释可靠性也会在临床实践中可能出现的噪声水平下降。这些发现支持在资源受限环境中部署轻量级架构用于疟疾诊断,同时强调事后解释的脆弱性,这是负责任临床部署的重要考虑因素。

英文摘要

Malaria remains a leading cause of mortality in sub-Saharan Africa, where scarce diagnostic infrastructure makes timely, accurate diagnosis particularly challenging. While deep learning offers a compelling path toward automated malaria screening, clinical adoption is hindered by computational cost and opacity in decision-making. This work benchmarks four deep learning models spanning a wide range of designed design architectures and model capacities on the NLM-Malaria dataset, jointly evaluating predictive performance, robustness, and post-hoc explainability. We find that lightweight, efficient-by-design models match their heavier counterparts in predictive performance, and the Friedman test confirms no statistically significant performance differences. CAM-based XAI methods consistently localize diagnostically relevant regions, while fine-grained attribution methods produce less targeted explanations, particularly with heavier backbones. Robustness evaluation under three types of image corruption further reveals that model confidence degrades faster than accuracy, providing a practical signal for human review. However, no XAI method is robust to corruption, with explanation reliability degrading at noise levels plausible in clinical practice, even when predictions remain accurate. These findings support the deployment of lightweight architectures for malaria diagnosis in resource-constrained settings, while highlighting the vulnerability of post-hoc explanations as an important consideration for responsible clinical deployment.

2605.30729 2026-06-01 cs.LG cs.IR

SemStruct: Contextualizing Semantic Embeddings with Structural Information for Schema Matching

SemStruct: 利用结构信息对语义嵌入进行上下文化以实现模式匹配

Inwon Kang, Kavitha Srinivas, Nandana Mihindukulasooriya, Sola Shirai, Parikshit Ram, Horst Samulowitz, Oshani Seneviratne

发表机构 * Rensselaer Polytechnic Institute(伦斯勒理工学院) IBM Research(IBM研究院)

AI总结 提出SemStruct框架,通过将冻结的预训练语言模型与图神经网络的结构归纳偏置相结合,利用行级共现关系作为结构信息,在模式匹配任务中实现最先进性能。

Comments Accepted to KDD 26 Research Track

详情
AI中文摘要

模式匹配是集成异构数据源的基本步骤。虽然预训练语言模型通过捕获语言语义彻底改变了这一任务,但它们通常将表格数据视为独立列描述的序列化文本。这种序列化丢弃了关键的结构信息——具体来说,行级共现,即关系上下文——迫使模型仅依赖列标题语义或独立分布。为弥补这一差距,我们提出了SemStruct,一个将冻结的PLM的语义能力与图神经网络的结构归纳偏置相结合的框架。我们将表格建模为一个异构图,其中列和值是由行连接的节点,允许GNN在结构上传播消歧上下文。与需要专有LLM访问和语言模型微调的其他最先进方法不同,SemStruct保持语言模型冻结,仅训练一个轻量级结构编码器。在Valentine和SOTAB-SM基准上的大量实验表明,SemStruct实现了最先进的性能,在复杂的、可语义连接的数据集上超越了完全微调的基线。此外,我们的消融研究表明,行表示主要作为拓扑导管而非语义实体,验证了在模式匹配中显式结构建模的必要性。

英文摘要

Schema matching is a fundamental step in integrating heterogeneous data sources. While Pre-trained Language Models (PLMs) have revolutionized this task by capturing linguistic semantics, they typically process tabular data as serialized text sequences of standalone column descriptions. This serialization discards critical structural information -- specifically, the row-level co-occurrences, i.e. the relational context -- forcing models to rely solely on column header semantics or standalone distributions. To bridge this gap, we propose SemStruct, a framework that joins the semantic power of frozen PLMs with the structural inductive bias of Graph Neural Networks (GNNs). We model the table as a heterogeneous graph where columns and values are nodes connected by rows, allowing the GNN to propagate disambiguating context across the structure. Unlike other state-of-the-art methods that require proprietary LLM access and fine-tuning of language models, SemStruct keeps the language model frozen and trains only a lightweight structural encoder. Extensive experiments on the Valentine and SOTAB-SM benchmarks demonstrate that SemStruct achieves state-of-the-art performance, outperforming fully fine-tuned baselines on complex, semantically joinable datasets. Furthermore, our ablation studies reveal that row representations serve primarily as topological conduits rather than semantic entities, validating the necessity of explicit structural modeling in schema matching.

2605.30728 2026-06-01 cs.LG cs.DC

Reducing the GPU Memory Bottleneck with Lossless Compression for ML -- Extended

通过无损压缩减少机器学习中的GPU内存瓶颈——扩展版

Aditya K Kamath, Arvind Krishnamurthy, Marco Canini, Simon Peter

发表机构 * University of Washington(华盛顿大学)

AI总结 提出无损压缩算法IBP,通过识别和消除张量组中的不变位,利用GPU优化的解压缩和异步PCIe传输,显著减少数据传输时间,加速GNN训练、DLRM嵌入查找和LLM推理。

Comments Extended version of paper published at 21st European Conference on Computer Systems (EUROSYS '26), April 27-30, 2026, Edinburgh, Scotland Uk

Journal ref 2026. In Proceedings of the 21st European Conference on Computer Systems. Association for Computing Machinery, 899-918

详情
AI中文摘要

机器学习(ML)训练和推理经常处理远超GPU内存容量的数据集,迫使它们依赖PCIe进行按需张量传输,导致关键的传输瓶颈。有损压缩已被提出以缓解瓶颈,但会引入依赖工作负载的精度损失,使得在现有ML部署中使用变得复杂甚至不可行。我们探索无损压缩作为替代方案,以避免这种部署复杂性。我们确定了无损压缩可以集成到ML流水线中的位置,同时最小化对GPU执行的干扰。基于我们的发现,我们引入了不变位打包(IBP),一种新颖的无损压缩算法,旨在最小化ML的数据传输时间。IBP识别并消除张量组中的不变位,通过利用warp并行性、低开销位操作和异步PCIe传输的GPU优化解压缩来提高吞吐量。我们提供易于使用的API,通过为GNN训练以及DLRM和LLM推理框架添加IBP支持来展示它们。IBP平均实现了74%更快的GNN训练、180%更快的DLRM嵌入查找和24%更快的LLM推理。

英文摘要

Machine learning (ML) training and inference often process data sets far exceeding GPU memory capacity, forcing them to rely on PCIe for on-demand tensor transfers, causing critical transfer bottlenecks. Lossy compression has been proposed to relieve bottlenecks but introduces workload-dependent accuracy loss, making it complex or even prohibitive to use in existing ML deployments. We explore lossless compression as an alternative that avoids this deployment complexity. We identify where lossless compression can be integrated into ML pipelines while minimizing interference with GPU execution. Based on our findings, we introduce Invariant Bit Packing (IBP), a novel lossless compression algorithm designed to minimize data transfer time for ML. IBP identifies and eliminates invariant bits across groups of tensors, improving throughput through GPU-optimized decompression that leverages warp parallelism, low-overhead bit operations, and asynchronous PCIe transfers. We provide easy-to-use APIs, showcasing them by adding IBP support to GNN training, as well as DLRM and LLM inference frameworks. IBP achieves, on average, 74% faster GNN training, 180% faster DLRM embedding lookup, and 24% faster LLM inference.

2605.30727 2026-06-01 cs.CL

MosaicLeaks:Privacy Risks in Querying-in-the-Open for Deep Research Agents

MosaicLeaks:深度研究代理的开放查询中的隐私风险

Alexander Gurung, Spandana Gella, Alexandre Drouin, Issam H. Laradji, Perouz Taslakian, Rafael Pardinas

发表机构 * ServiceNow AI Research(ServiceNow AI研究院) University of Edinburgh(爱丁堡大学) Mila - Quebec AI Institute(魁北克AI研究所) McGill University(麦吉尔大学) University of British Columbia(不列颠哥伦比亚大学)

AI总结 针对深度研究代理在查询外部工具时可能泄露本地敏感信息的问题,提出MosaicLeaks基准测试和隐私感知深度研究(PA-DR)框架,通过强化学习降低信息泄露风险。

详情
AI中文摘要

深度研究代理越来越多地将私有本地文档与网络检索等外部工具结合,这带来了隐私风险:代理的外部查询可能泄露其本地上下文中的敏感信息。这种风险因马赛克效应而被放大,即单个查询看似无害,但聚合起来可能泄露信息。我们引入了MosaicLeaks,一个包含1,001个多跳深度研究任务的基准测试,这些任务将私有企业文档和公共网络语料库链接起来,迫使代理进行依赖本地信息的外部查询。我们通过一个仅观察代理外部查询并尝试在三个层面推断私有信息的对抗性LLM来评估泄露:代理的研究意图、特定私有问题的答案以及关于企业文档的可验证声明。我们发现,不同系列和大小的模型在三个层面上频繁泄露信息;零样本隐私提示减少了泄露但未消除泄露;仅针对任务性能的强化学习加剧了泄露。为了解决这个问题,我们提出了隐私感知深度研究(PA-DR),这是一个RL框架,结合了任务成功的情境奖励和一个学习到的隐私分类器,以提供对每个查询和马赛克级别泄露的密集信用分配。使用PA-DR训练Qwen3-4B-Instruct将准确率从48.7%提高到58.7%,并将答案和全信息泄露从34.0%降低到9.9%。

英文摘要

Deep research agents increasingly combine private local documents with external tools like web retrieval, creating a privacy risk: an agent's external queries may leak sensitive information from its local context. This risk is amplified by the mosaic effect, where individual queries may appear harmless but become revealing in aggregate. We introduce MosaicLeaks, a benchmark of 1,001 multi-hop deep research tasks that chain private enterprise documents and a public web corpus, forcing agents to make external queries that depend on local information. We evaluate leakage with an adversary LLM that observes only the agent's external queries and attempts to infer private information at three levels: the agent's research intent, answers to specific private questions and verifiable claims about the enterprise documents. We find that models across families and sizes frequently leak at all three levels, that zero-shot privacy prompting reduces but does not eliminate leakage and that reinforcement learning for task performance alone worsens leakage. To address this, we propose Privacy-Aware Deep Research (PA-DR), an RL framework that combines situational rewards for task success with a learned privacy classifier to provide dense credit assignment over both per-query and mosaic-level leakage. Training Qwen3-4B-Instruct with PA-DR improves accuracy from 48.7% to 58.7% and reduces answer and full-information leakage from 34.0% to 9.9%.

2605.30723 2026-06-01 cs.CL

Skill is Not One-Size-Fits-All: Model-Aware Skill Alignment for LLM Agents

技能并非一刀切:面向 LLM 智能体的模型感知技能对齐

Jianxiang Yu, Jiapeng Zhu, Bochen Lin, Qier Cui, Zichen Ding, Xiang Li

发表机构 * East China Normal University(华东师范大学)

AI总结 提出 MASA 框架,通过层次化技能进化管道和轻量级条件重写器,为不同能力的 LLM 主干网络自适应调整技能,显著提升长时交互任务性能。

详情
AI中文摘要

LLM 智能体越来越多地检索外部策划的技能——在决策时获取的程序性指令——以提高在长时交互任务上的表现。现有的技能库通常被视为模型无关的,在不同容量和行为的骨干网络上重用相同的技能表述。然而,我们在多个模型规模上的控制实验表明,技能的有效性强烈依赖于模型:对某个骨干网络有益的技能可能对另一个有害。受此观察启发,我们提出了 MASA(模型感知技能对齐),一个无需修改智能体权重即可为每个目标骨干网络调整技能的框架。MASA 分两个阶段运行:(1)一个层次化技能进化管道,通过爬山法和 UCB 驱动的树搜索,在环境反馈和模型能力概况的指导下,迭代重写通用和任务特定技能;(2)一个在进化轨迹上训练的轻量级模型条件技能重写器,以在单次前向传播中重现适应过程。在三个交互环境和四个骨干网络上的实验表明,MASA 始终实现最佳整体性能,比最强基线高出最多 25.8 个点。学习到的重写器进一步泛化到未见过的任务和环境,无需额外搜索,以极低的推理成本持续超越更大的教师 LLM。

英文摘要

LLM agents increasingly retrieve externally curated skills-procedural instructions retrieved at decision time-to improve performance on long-horizon interactive tasks. Existing skill libraries are typically treated as model-agnostic, reusing the same skill formulations across backbones with substantially different capacities and behaviors. However, our controlled experiments across multiple model scales show that skill effectiveness is strongly model-dependent: a skill that benefits one backbone can harm another. Motivated by this observation, we propose MASA Model-Aware Skill Alignment, a framework that adapts skills to each target backbone without modifying agent weights. MASA operates in two stages: (1) a hierarchical skill evolution pipeline that iteratively rewrites general and task-specific skills using hill climbing and UCB-driven tree search, guided by environment feedback and model capability profiles; and (2) a lightweight model-conditioned skill rewriter trained on evolution trajectories to reproduce the adaptation in a single forward pass. Experiments across three interactive environments and four backbones show that MASA consistently achieves the best overall performance, with gains of up to 25.8 points over the strongest baseline. The learned rewriter further generalizes to unseen tasks and environments without additional search, consistently outperforming a much larger teacher LLM at a fraction of the inference cost.

2605.30720 2026-06-01 cs.LG cs.AI econ.GN q-fin.EC stat.ML

Kalimati Vegetable Price Index Forecasting with a Momentum Corrected Online Stacking Ensemble

Kalimati蔬菜价格指数预测:基于动量校正的在线堆叠集成方法

Sahaj Raj Malla

发表机构 * Department of Mathematics, Kathmandu University(数学系,加德满都大学)

AI总结 针对新兴经济体农产品价格高波动性问题,提出动量校正在线堆叠集成模型,通过构建逆波动率加权综合指数和64个因果特征,在90天预测期实现RMSE=1.771、MAPE=0.68%、R²=0.845的优异性能。

Comments 21 pages, 8 figures, 2 tables

详情
AI中文摘要

由于高波动性、频繁的供应中断以及强烈的文化需求影响,新兴经济体的农产品价格预测十分困难。本研究引入了Kalimati蔬菜价格指数(KVPI),这是一个新的逆波动率加权综合指数,汇总了加德满都十年(2013-2023年)的135种日度批发商品。通过创建稳定的宏观信号,KVPI减少了单个作物建模固有的噪声。我们开发了包含64个因果有效特征的丰富特征集,包括节日领先滞后效应、滚动统计量和日历变量。对涵盖统计、树基、深度学习、混合和Transformer架构的14种预测模型,在短期(7天)、中期(14天和30天)和长期(90天)预测期上进行了严格评估。树基集成方法表现出显著的鲁棒性,而经典统计模型和复杂Transformer在处理噪声数据集时表现不佳。提出的动量校正在线堆叠集成模型取得了最强性能,在90天预测期上均方根误差(RMSE)为1.771,平均绝对百分比误差(MAPE)低至0.68%,并解释了84.5%的方差(R²=0.845)。这一开源流程为尼泊尔及类似市场的政策制定者和供应链参与者提供了实用、可靠的工具,以预测价格波动并加强粮食安全。

英文摘要

Forecasting agricultural commodity prices in emerging economies is difficult due to high volatility, frequent supply disruptions, and strong cultural influences on demand. This study introduces the Kalimati Vegetable Price Index (KVPI), a new inverse-volatility weighted composite index that aggregates 135 daily wholesale commodities from Kathmandu over ten years (2013-2023). By creating a stable macro-level signal, the KVPI reduces the noise inherent in modelling individual crops. A rich set of 64 causally valid features was developed, including festival lead-lag effects, rolling statistics, and calendar variables. Fourteen forecasting models spanning statistical, tree-based, deep learning, hybrid, and transformer architectures were rigorously evaluated across short (7-day), medium (14- and 30-day), and long-term (90-day) horizons. Tree-based ensembles proved notably robust, while classical statistical models and complex transformers struggled with the noisy dataset. The proposed Momentum-Corrected Online Stacking Ensemble achieved the strongest performance, yielding a Root Mean Square Error (RMSE) of 1.771, an exceptionally low Mean Absolute Percentage Error (MAPE) of 0.68%, and explaining 84.5% of the variance (R-squared = 0.845) at the 90-day horizon. This open-source pipeline provides policymakers and supply chain actors in Nepal and similar markets with a practical, reliable tool for anticipating price movements and strengthening food security.

2605.30719 2026-06-01 cs.LG cs.AI

When are LLMs Sufficient Policy Optimizers for Sequential RL Tasks?

何时LLMs足以作为序列RL任务的策略优化器?

Stephane Hatgis-Kessell, Emma Brunskill

发表机构 * Department of Computer Science, Stanford University(计算机科学系,斯坦福大学)

AI总结 提出PromptPO方法,利用LLM通过Python描述状态空间、动作空间和奖励函数,基于rollout反馈迭代生成和优化可执行策略,在多种环境中匹配或超越标准RL基线,但在细粒度连续控制任务中表现不足。

详情
AI中文摘要

我们研究大型语言模型(LLMs)何时可以作为强化学习(RL)任务的有效黑盒策略优化器,即何时可以用LLM替代经典RL算法?我们通过引入提示策略优化(PromptPO)来探索这个问题,这是一种迭代方法,它用状态空间、动作空间和奖励函数的Python描述提示LLM,然后让LLM根据rollout反馈生成并优化可执行策略。在硬探索环境、Meta-World机器人任务以及几个现实世界控制问题中,PromptPO通常匹配或超过标准RL基线的性能,同时使用显著更少的环境交互。为了最大化期望回报,且无需进一步显式提示,PromptPO输出的策略范围从调谐的比例控制器或基于规则的规划到运行值迭代等规划算法的策略。我们的结果表明,当LLM能够利用关于环境或优化策略的先验知识时,基于LLM的策略优化是足够的。PromptPO在MuJoCo领域中的表现不如标准RL基线,这展示了基于LLM的策略优化在需要细粒度连续控制的设置中可能存在的局限性。

英文摘要

We study when large language models (LLMs) can serve as effective black-box policy optimizers for reinforcement learning (RL) tasks, i.e., when can we replace classical RL algorithms with an LLM? We explore this question by introducing Prompted Policy Optimization (PromptPO), an iterative method that prompts an LLM with Python descriptions of the state space, action space, and reward function, then has it generate and refine executable policies based on rollout feedback. Across hard exploration environments, Meta-World robotics tasks, and several real-world control problems, PromptPO often matches or exceeds the performance of standard RL baselines while using substantially fewer environment interactions. To maximize expected return, and without further explicit prompting, the policies PromptPO outputs range from tuned proportional controllers or rule-based plans to policies that run planning algorithms like value iteration. Our results demonstrate that LLM-based policy optimization is sufficient when the LLM can leverage prior knowledge about the environment or optimization strategy. PromptPO underperforms standard RL baselines in MuJoCo domains. This demonstrates possible limitations of LLM-based policy optimization to settings that requiring fine-grained continuous control.

2605.30717 2026-06-01 cs.CL

Neuron-Level Interventions for Gendered and Gender-Neutral Generation in Language Models

语言模型中性别化与性别中立生成的神经元级干预

Zhiwen You, Nafiseh Nikeghbal, Jana Diesner

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Technical University of Munich(慕尼黑技术大学) Munich Center for Machine Learning(慕尼黑机器学习中心)

AI总结 提出神经元级干预方法,通过识别与女性、男性和性别中立相关的特定神经元,实现对语言模型生成文本的性别控制,并发现性别神经元主要集中在前几层。

详情
AI中文摘要

语言模型(LMs)即使在给定中性提示时也可能产生性别化语言和刻板印象。以往关于LM性别偏见的研究大多通过二元视角(女性 vs. 男性)审视性别,对性别中立形式(如they/them代词或中性措辞的职位名称)关注有限。性别相关信号如何在LM的内部表示中编码仍是一个开放问题。在这项工作中,我们研究了LM中三类性别特异性神经元:女性、男性和性别中立。我们提出了一种神经元级干预方法,用于识别与每个性别类别紧密相关的神经元。然后通过受控生成测试这些神经元,表明激活或掩蔽性别相关神经元可以将句子导向目标性别形式,同时保留其原始含义。为了评估我们性别干预方法的有效性,我们整理了两个数据集,其中包含跨所有三个性别类别标记的受控句子,并通过人工评估验证数据质量。在两个开源LM上的实验表明,性别特异性神经元并非均匀分布在模型各层;相反,它们高度集中在最早层,后面层的贡献较小。与现有方法相比,我们的方法实现了更精确的性别控制,通过两个评估标准,对非目标性别类别的泄露更少,输出质量稳定。总体而言,我们的工作研究了性别如何在LM中编码,并为受控性别干预提供了一种简单而有效的方法,适用于神经元干预评估和性别偏见缓解。代码和数据集可在 https://github.com/zhiwenyou103/Gender-Neuron-Intervention 获取。

英文摘要

Language models (LMs) can produce gendered language and stereotypes even when given neutral prompts. Most prior work on gender bias in LMs primarily examines gender through a binary lens (feminine vs. masculine), with limited attention to gender-neutral forms, such as they/them pronouns or neutrally phrased job titles. How gender-related signals are encoded in the internal representations of LMs remains an open question. In this work, we study gender-specific neurons in LMs across three categories: feminine, masculine, and gender-neutral. We propose a neuron-level intervention method to identify neurons that are strongly tied to each gender category. We then test these neurons through controlled generation, showing that activating or masking gender-related neurons can steer a sentence toward a target gender form while preserving its original meaning. To evaluate the effectiveness of our gender-intervention approach, we curate two datasets with controlled sentences labeled across all three gender categories and validate the data quality through human evaluation. Experiments on two open-source LMs show that gender-specific neurons are not evenly distributed across model layers; instead, they concentrate heavily in the earliest layers with smaller contributions from later layers. Compared to existing methods, our method achieves more precise gender control, with less leakage into non-target gender categories and stable output quality through two evaluation criteria. Overall, our work examines how gender is encoded in LMs and provides a simple yet effective approach toward controlled gender intervention for both neuron intervention evaluation and gender bias mitigation. Code and datasets are available at: https://github.com/zhiwenyou103/Gender-Neuron-Intervention

2605.30716 2026-06-01 cs.CV cs.AI

Simple Token-Efficient Vision-Language Model for Case-level Pathology Synoptic Report Generation

用于病例级病理学概要报告生成的简单令牌高效视觉语言模型

Zhiyuan Yang, Jiahao Cheng, Vincent Quoc-Huy Trinh, Mahdi S. Hosseini

发表机构 * Department of Computer Science and Software Engineering (CSSE), Concordia University, Montreal, Canada(计算机科学与软件工程系(CSSE),康科迪亚大学,蒙特利尔,加拿大) Axe Cancer, Centre de recherche du CHUM, Université de Montréal, Montreal, Canada(Axe癌症,CHUM研究中心,蒙特利尔大学,蒙特利尔,加拿大) Institut de recherche en immunologie et cancérologie (IRIC), Université de Montréal(免疫学与癌症研究所(IRIC),蒙特利尔大学) Mila - Quebec AI Institute, Montreal, Canada(魁北克AI研究所(Mila),蒙特利尔,加拿大)

AI总结 提出一种简单令牌高效的视觉语言模型,通过5倍放大率的512×512补丁和两阶段监督训练,在有限GPU内存下实现病例级多WSI病理报告生成,显著降低序列长度并提升效率。

Comments Accepted by the DeLTA 2026 conference

详情
AI中文摘要

从全切片图像(WSI)生成临床有用的病理报告具有挑战性,原因在于十亿像素分辨率、长视觉令牌序列以及病例级推理的复杂性(单个病例可能包含多个具有异质性组织和模糊发现的WSI)。我们提出了一种简单的令牌高效视觉语言模型,用于病例级概要报告生成,在受限GPU内存下保持实用性。我们的架构遵循最小的三组件设计:冻结的病理补丁编码器、轻量级两层MLP视觉语言对齐器和大语言模型解码器,并带有显式的WSI标记令牌以分隔病例内的切片。训练分两个监督阶段进行:(1)仅对齐器的WSI字幕生成,使用异质WSI-文本对;(2)病例级监督微调,基于病例-报告对进行结构化报告生成。为了减少序列长度,我们使用5倍放大率下的$512 \times 512$补丁表示每个切片,与常用的20倍补丁相比,平均序列长度减少高达64倍。结合高效训练技术,我们仅用半块NVIDIA H100 GPU即可实现实际训练。在两个训练阶段中,我们的方法在ROUGE-L/METEOR/BLEU-4上取得了高分,同时在内存和运行时间上显著更高效。在基于AI的评估中,我们的模型始终优于强基线。大量消融实验表征了性能-效率权衡,并确定了在多WSI设置中提高鲁棒性的简单选择。总体而言,这项工作为高效病理报告生成提供了一个强大且可复现的基线,降低了在有限计算资源下进行多WSI VLM研究的门槛。

英文摘要

Generating clinically useful pathology reports for pathology cases from whole-slide images (WSIs) is challenging due to gigapixel resolution, long visual-token sequences, and the complexity of case-level reasoning, where a single case may contain multiple WSIs with heterogeneous tissues and ambiguous findings. We present a simple token-efficient vision--language model for case-level synoptic report generation that remains practical under constrained GPU memory. Our architecture follows a minimal three-component design: a frozen pathology patch encoder, a lightweight two-layer MLP vision-language aligner, and a large language model decoder, with an explicit WSI marker token to separate slides within a case. Training proceeds in two supervised stages: (1) aligner-only WSI captioning using heterogeneous WSI-text pairs, and (2) case-level supervised fine-tuning on case-report pairs for structured report generation. To reduce sequence length, we represent each slide using $512 \times 512$ patches at $5\times$ magnification, which reduces the average sequence length by up to $64\times$ times compared to the commonly used $20\times$ patches. Combined with efficient training techniques, we enable practical training with only half a NVIDIA H100 GPU. Across both training stages, our approach achieves high ROUGE-L/METEOR/BLEU-4 scores while being substantially more efficient in memory and runtime. In AI-based evaluations, our model is consistently preferred over strong baselines. Extensive ablations characterize performance-efficiency trade-offs and identify simple choices that improve robustness in multi-WSI settings. Overall, this work provides a strong, reproducible baseline for efficient pathology report generation, lowering the barrier to multi-WSI VLM research under limited compute.

2605.30714 2026-06-01 cs.CV

Vision-Based Localization in Dense Urban Environments: A Case Study of an Urban Village in China

基于视觉的密集城市环境定位:中国城中村案例研究

Menglin Wu, Rui Cao

发表机构 * Thrust of Urban Governance and Design, Society Hub, The Hong Kong University of Science and Technology (Guangzhou)(城市治理与设计 thrust,社会枢纽,香港科技大学(广州))

AI总结 针对密集城市环境中GPS信号不可靠和地图数据不完整的问题,提出一种基于双摄像头系统的低成本视觉地理定位方法,并在广州石牌村数据集上评估现有模型性能。

详情
AI中文摘要

城中村是快速城市化过程中出现的广泛非正规住区,现已成为中国大城市中农民工的主要居住中心。这些区域建筑密集,常导致GPS信号不可靠,而不完整的地图数据进一步影响精确路线规划和导航。这些问题不仅阻碍日常出行,还对应急响应构成重大挑战,因为混乱的道路布局和GPS不准确可能使疏散工作复杂化。为应对这些挑战,我们提出了一种针对密集城市环境的实用视觉地理定位解决方案。我们的方法采用低成本的数采流程,利用双摄像头系统(包括全景相机和智能手机相机)捕获同步的360度全景图和查询图像。以广州著名的密集城中村石牌村为案例,我们开发了专门的图像地理定位数据集。然后,我们评估并比较了现有模型在不同场景类型下的性能,以识别其优缺点。研究结果展示了基于视觉的定位在密集城中村环境中的潜力和局限性。我们的框架旨在改善GPS覆盖较差区域的步行导航、最后一公里配送和应急管理,最终支持这些非正规住区中的弱势群体。

英文摘要

Urban villages, the widespread informal settlements which have emerged as a result of rapid urbanization, are now major residential hubs for migrant workers in large cities in China. The dense arrangement of buildings in these areas often leads to unreliable GPS signals, while incomplete mapping data further impairs accurate route planning and navigation. These issues not only hinder everyday mobility but also pose significant challenges for emergency response, as confusing road layouts and GPS inaccuracies can complicate evacuation efforts. To address these challenges, we propose a practical vision-based geo-localization solution tailored for dense urban environments. Our approach features a low-cost data collection pipeline utilizing a dual-camera system, comprising a panoramic camera and a smartphone camera, to capture synchronized 360-degree panoramas and query images. Using Shipai Village, a well-known densely populated urban village in Guangzhou, as a case study, we develop a specialized image geo-localization dataset. We then assess and compare the performance of existing models across various scene types to identify their strengths and weaknesses. The findings demonstrate both the potential and limitations of visual-based localization in dense urban-village environments. Our framework aims to enhance pedestrian navigation, last-mile delivery, and emergency management in areas with poor GPS coverage, ultimately supporting the vulnerable populations living within these informal settlements.

2605.30713 2026-06-01 cs.LG cs.CV cs.MM

Diversity Matters: Revisiting Test-Time Compute in Vision-Language Models

多样性至关重要:重新审视视觉-语言模型中的测试时计算

Yijie Tong, Yifan Hou, Shaobo Cui, Antoine Bosselut, Mrinmaya Sachan

发表机构 * ETH Zürich(苏黎世联邦理工学院) Shanghai Jiao Tong University(上海交通大学) EPFL(苏黎世联邦理工学院)

AI总结 针对视觉-语言模型(VLM)中测试时计算(TTC)策略应用不足的问题,提出基于预测熵的ETTC方法,通过利用模型间的置信度差异提升集成性能,理论证明并实验验证其优于多数投票和最佳单模型。

Comments ICML 2026

详情
AI中文摘要

测试时计算(TTC)策略已成为提升大型语言模型(LLM)推理能力的一种轻量级方法。然而,它们在视觉-语言模型(VLM)中的应用和益处尚未得到充分探索。我们对七个VLM和六个基准进行了TTC的系统研究,特别分析了基于特征的评分和多数投票方法。我们发现特征启发式方法失败,而投票在单模型设置中仅带来微小提升。我们从理论上证明,这种局限性源于缺乏预测多样性:当输出高度相关时,投票收益甚微。相比之下,多模型集成提供了更丰富的多样性,但标准的多数投票未能考虑不同模型的能力差异。为解决这一问题,我们提出了基于熵的TTC(ETTC),它根据预测熵选择最自信的预测。在单模型情况下,我们的方法退化为多数投票,但在模型集成中,它利用置信度差异优先考虑更强的模型。我们证明,在温和假设下ETTC优于多数投票,并通过实验表明它始终优于投票和最佳个体模型。关键在于,我们的结果表明,较小的模型可以协同增强较大的模型,释放出标准策略无法实现的集成增益。

英文摘要

Test-time compute (TTC) strategies have emerged as a lightweight approach to boost reasoning in large language models (LLMs). However, their application and benefits for vision-language models (VLMs) remain underexplored. We present a systematic study of TTC across seven VLMs and six benchmarks, specifically analyzing feature-based scoring and majority voting methods. We find that feature heuristics fail and voting yields only modest gains in single-model settings. We theoretically show that this limitation stems from a lack of prediction diversity: when outputs are highly correlated, voting provides little benefit. In contrast, multi-model ensembles offer richer diversity, yet standard majority voting fails to account for varying model capabilities. To address this, we propose Entropy-based TTC (ETTC), which selects the most confident prediction based on predictive entropy. Our method reduces to majority voting in the single-model case, but in model ensembles, it leverages confidence disparities to prioritize stronger models. We prove that ETTC outperforms majority voting under mild assumptions and empirically demonstrate that it consistently surpasses both voting and the best individual model. Crucially, our results show that smaller models can synergistically enhance larger ones, unlocking ensembling gains not achievable with standard strategies.

2605.30712 2026-06-01 cs.CL

ExpGraph: Model-Agnostic Experience Learning with Graph-Structured Memory for LLM Agents

ExpGraph: 面向LLM智能体的模型无关经验学习与图结构记忆

Tao Feng, Chongrui Ye, Tianyang Luo, Jingjun Xu, Xueqiang Xu, Haozhen Zhang, Zhigang Hua, Yan Xie, Shuang Yang, Ge Liu, Jiaxuan You

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Nanyang Technological University(南洋理工大学) Meta Monetization AI(Meta 营收人工智能)

AI总结 提出ExpGraph框架,通过图结构记忆和强化学习实现冻结LLM执行器的外部经验复用,在问答、数学推理、代码生成和多步智能体任务上显著提升性能并减少交互步骤。

详情
AI中文摘要

大型语言模型(LLM)智能体在推理、工具使用和多步交互方面表现出强大的能力,但它们通常从零开始解决任务,未能重用先前经验中的成功策略或失败教训。对收集的经验进行微调可以改善重用,但当出现更强或更合适的执行器时,这种方法不够灵活。我们提出ExpGraph,一个模型无关的经验学习框架,使冻结且可替换的LLM执行器能够通过外部经验重用而无需参数更新来改进。ExpGraph将历史轨迹总结为可重用的技能和失败教训,将它们组织为自进化经验图中的节点,并通过图扩散和效用感知排序检索有用的经验。一个轻量级的检索副驾驶通过强化学习进行训练,使用比较有无检索经验时执行器性能的反馈,同时图根据下游任务结果在线更新。我们在ExpSuite上评估ExpGraph,涵盖问答、数学推理、代码生成以及包括ALFWorld和AppWorld在内的多步智能体环境。ExpGraph在静态任务上,对于较小和较大的执行器,分别比最强基线提高12.2%和4.7%;在智能体环境中提高21.4%和12.7%,同时平均交互步骤减少12.7%和21.6%。消融实验表明,图结构经验、效用感知排序和自适应检索共同实现了跨不同任务和执行器模型的有效经验重用。

英文摘要

Large language model (LLM) agents have shown strong capabilities in reasoning, tool use, and multi-step interaction, but they often solve tasks from scratch and fail to reuse successful strategies or failure lessons from prior experience. Fine-tuning on collected experience can improve reuse, but it is inflexible when stronger or more suitable executors emerge. We propose ExpGraph, a model-agnostic experience learning framework that enables frozen and replaceable LLM executors to improve through external experience reuse without parameter updates. ExpGraph summarizes historical trajectories into reusable skills and failure lessons, organizes them as nodes in a self-evolving experience graph, and retrieves useful experiences through graph diffusion and utility-aware ranking. A lightweight retrieval copilot is trained with reinforcement learning using feedback that compares executor performance with and without retrieved experiences, while the graph is updated online from downstream task outcomes. We evaluate ExpGraph on ExpSuite, covering question answering, mathematical reasoning, code generation, and multi-step agentic environments including ALFWorld and AppWorld. ExpGraph improves over the strongest baseline by 12.2% and 4.7% on static tasks with smaller and larger executors, and by 21.4% and 12.7% in agentic environments, while reducing average interaction steps by 12.7% and 21.6%. Ablations show that graph-structured experience, utility-aware ranking, and adaptive retrieval jointly enable effective experience reuse across diverse tasks and executor models.

2605.30711 2026-06-01 cs.CL cs.AI cs.LG stat.ML

SAGE: A Novelty Gate for Efficient Memory Evolution in Agentic LLMs

SAGE: 一种用于智能体大语言模型中高效记忆演化的新颖门控机制

Sijia Wang, Dhanajit Brahma, Ricardo Henao

发表机构 * Duke University(杜克大学)

AI总结 提出SAGE门控机制,基于von Mises-Fisher密度估计和自适应阈值,将记忆写入控制建模为新奇性检测问题,在LoCoMo上以更低成本实现最优token-F1。

详情
AI中文摘要

智能体大语言模型必须持续决定新提取的事实是应添加、与现有记忆合并还是忽略,然而先前的工作更侧重于检索和存储,而非原则性的写入端控制。我们将记忆演化视为一个新颖性检测问题,并提出SAGE(Spherical Adaptive Gate for memory Evolution),一种用于记忆演化的球形自适应门控机制,它通过基于von Mises-Fisher的密度估计器对记忆嵌入上的候选事实进行评分,并使用跟踪记忆存储几何结构的自适应阈值对其进行路由。SAGE将明确新颖的事实解析为ADD,明确冗余的事实解析为NOOP,仅将不确定的情况发送给LLM合并步骤,从而减少了昂贵的写入时推理。在LoCoMo上,SAGE在所有七个开放权重骨干对比中均实现了对Mem0的最佳平均token-F1,而在GPT-4o-mini上,它将添加阶段的API成本降低了3.4倍,添加阶段延迟降低了2.5倍,且平均评判分数差距很小。作为A-Mem的即插即用二进制门控,SAGE在五个模型上跳过了大约16-18%的LLM调用,且在开放权重骨干上质量变化极小。这些结果表明,新颖性感知的写入控制是提高长期智能体记忆中记忆质量和系统效率的实用杠杆。

英文摘要

Agentic LLMs must continuously decide whether newly extracted facts should be added, merged with existing memories, or ignored, yet prior work has focused more on retrieval and storage than on principled write-side control. We frame memory evolution as a novelty-detection problem and propose SAGE, a Spherical Adaptive Gate for memory Evolution that scores candidate facts with a von Mises-Fisher-based density estimator over memory embeddings and routes them with an adaptive threshold that tracks memory-store geometry. SAGE resolves clearly novel facts as ADD, clearly redundant facts as NOOP, and sends only uncertain cases to an LLM merge step, reducing expensive write-time reasoning. On LoCoMo, SAGE achieves the best average token-F1 against Mem0 on all seven open-weight backbone comparisons, while on GPT-4o-mini it reduces add-phase API cost by 3.4$\times$ and add-phase latency by 2.5$\times$ with only a small average judge-score gap. As a drop-in binary gate for A-Mem, SAGE skips roughly 16-18% of LLM calls across five models with minimal quality change on open-weight backbones. These results suggest that novelty-aware write control is a practical lever for improving both memory quality and system efficiency in long-term agentic memory.

2605.30700 2026-06-01 cs.CV cs.LG

Mathematical Morphology in Machine Learning

机器学习中的数学形态学

Erick Oliveira Rodrigues, Aura Conci

发表机构 * Universidade Federal Fluminense(里贝伦联邦大学)

AI总结 将数学形态学引入机器学习,提出基于形态学重建的快速聚类算法和一种结合闵可夫斯基与切比雪夫距离的新型距离度量,并设计新型形态学分类器以建模形状、密度和分形信息。

Journal ref sibgrapi 2018

详情
AI中文摘要

本工作将数学形态学——一种成熟的视觉计算理论——引入机器学习,以利用标准技术常忽视的形状和密度方面。我们提出了一种基于形态学重建的快速聚类算法,该算法能精确保留聚类形状和密度。该方案具有独特特性:内在的最大聚类感知、无成本的噪声去除以及由结构元素控制的多样化增长模式。此外,我们提出了一种结合闵可夫斯基距离和切比雪夫距离的新型距离度量,对于形态学膨胀非常高效。在 $Z^2$ 离散邻域迭代中,它比曼哈顿距离快约1.3倍,比欧几里得距离快约329.5倍。当使用k近邻(k-NN)分类器在33个UCI数据集上与其他14种距离度量进行评估时,我们的度量在大多数情况下(33例中的26例)达到了高于平均的准确率,并在9个案例中取得了最佳整体准确率。最后,我们引入了新型形态学分类器。与现有文献不同,本方案独特地对数据集中的形状、密度和分形信息进行建模。

英文摘要

This work introduces mathematical morphology-an established visual computing theory-into machine learning to exploit shape and density aspects often overlooked by standard techniques. We propose a fast clustering algorithm based on morphological reconstruction that accurately preserves cluster shapes and density. This scheme offers unique features: an intrinsic sense of maximal clusters, cost-free noise removal, and diverse growth patterns controlled by structuring elements.Additionally, we propose a novel distance metric combining Minkowski and Chebyshev distances, highly efficient for morphological dilations. In $Z^2$ discrete neighbourhood iterations, it is roughly 1.3 times faster than Manhattan and 329.5 times faster than Euclidean distances. When evaluated using a k-Nearest Neighbours (k-NN) classifier across 33 UCI datasets against 14 other distances, our metric achieved above-average accuracies most frequently (26 of 33 cases) and the best overall accuracy in 9 cases.Finally, we introduce novel morphological classifiers. Unlike current literature, this proposal uniquely models shape, density, and fractal information in datasets.

2605.30699 2026-06-01 cs.LG cs.CV

A Context-Aware Middleware for Medical Image Based Reports: An approach based on image feature extraction and association rules

基于医学图像报告的情境感知中间件:一种基于图像特征提取和关联规则的方法

Erick O. Rodrigues, Jose Viterbo, Aura Conci, Trueman Mac Henry

发表机构 * Department of Computer Science(计算机科学系) Departament of Mathematics & Statistics(数学与统计学系) Universidade Federal Fluminense(联邦Fluminense大学) York University(约克大学)

AI总结 提出一种情境感知中间件,通过图像特征提取和关联规则,自动将医学图像分派给最合适的医疗人员,以提高医疗工作流程效率。

Journal ref 2015 IEEE/ACS 12th International Conference of Computer Systems and Applications (AICCSA)

详情
AI中文摘要

本工作提出了一种用于医疗工作流程组织和效率提升的情境感知中间件。在医院、实验室和远程放射学公司中,每位医生或技术人员都专注于特定类型的诊断或分析。因此,某些类型的医学图像通常会被转发给特定的医生或特定群体。这种转发非常耗时。也就是说,反复决定谁是最合适的医生,以及他在特定情境下是否可用,既繁琐又可能非常低效。因此,所提出的中间件能够处理并收集每位医疗人员所分析图像的数据。基于收集的数据和当前临床情境,中间件能够推断出谁是最适合接收特定传入医学图像的人员。

英文摘要

This work proposes a context-aware middleware for medical workflow organization and efficiency improvement. In hospitals, laboratories and teleradiology companies, each physician or technician is specialized in a specific kind of diagnosis or analysis. Therefore, certain types of medical images are often forwarded to a certain physician or a certain group. This forwarding is time consuming. That is, repeatedly deciding who would be the best physician, whether he is available at a certain moment given a certain context is exhaustive and may be very inefficient. Thus, the proposed middleware has the ability to process and collect data from images analyzed by each medical staff. Based on the collected data and current clinical context, the middleware is able to infer who would be the best fit staff to receive a certain incoming medical image.

2605.30698 2026-06-01 cs.CV cs.AI cs.MA

Seeing Before Agreeing: Aligning Multi-Agent Consensus with Visual Evidence

先见后议:用视觉证据对齐多智能体共识

Yuhan Wang, Shuochen Chang, Yalin Feng, Dongsheng Ma, Yuanzi Li, Zhengren Wang, Yinglong Yang, Yufei Chen, Yikang Wang, Shaoxu Sun, Wentao Zhang

发表机构 * Peking University(北京大学) Shanghai Jiao Tong University(上海交通大学) Nanyang Technological University(南洋理工大学) Renmin University of China(中国人民大学) Shandong University(山东大学)

AI总结 提出EAGLE框架,通过显式暴露各智能体的视觉证据区域并相互验证,实现无需训练的多智能体视觉问答协作,提升共识可靠性。

详情
AI中文摘要

视觉语言模型(VLM)在视觉问答(VQA)上取得了强劲性能。为了减轻个体幻觉和盲点,通过多智能体协作聚合不同视角已成为一种有前景的范式。虽然这种方法在文本问答中取得了巨大成功,但其在多模态领域的潜力仍未充分探索。现有的多智能体VQA方法主要采用以文本为中心的协议,专注于文本讨论而忽略视觉信息的对齐。在这项工作中,我们揭示了一个关键见解:答案级别的共识对于可靠的多智能体VQA是不够的; extit{对齐的视觉证据}——智能体所依赖的图像区域的共享支持——对于可信的共识至关重要。为了利用这一见解,我们提出了EAGLE( extbf{E}vidence- extbf{A}ligned extbf{G}rounded mu extbf{L}ti-agent r extbf{E}asoning),一个无需训练的以证据为中心的框架,用于协调多个VLM智能体。EAGLE显式暴露每个智能体的定位区域作为视觉证据,允许对证据进行相互验证,并使用证据一致性指导最终决策。在六个VQA基准上的实验表明,EAGLE在跨领域实现了最佳平均性能,同时保持轻量、可解释且易于部署。

英文摘要

Vision-language models (VLMs) have achieved strong performance on visual question answering (VQA). To mitigate individual hallucinations and blind spots, aggregating diverse perspectives via multi-agent collaboration has emerged as a promising paradigm. While this approach has shown great success in textual QA, its potential in the multimodal domain remains under-explored. Existing multi-agent VQA methods predominantly adapt text-centric protocols, focusing on textual discussions while ignoring the alignment of visual information. In this work, we reveal a key insight: answer-level agreement is insufficient for reliable multi-agent VQA; \textit{aligned visual evidence} -- shared support from the image regions agents rely on -- is essential for trustworthy consensus. To leverage this insight, we propose EAGLE (\textbf{E}vidence-\textbf{A}ligned \textbf{G}rounded mu\textbf{L}ti-agent r\textbf{E}asoning), a training-free evidence-centered framework for coordinating multiple VLM agents. EAGLE explicitly exposes each agent's grounding regions as visual evidence, enables mutual verification over the evidence, and uses evidence consistency to guide final decision-making. Experiments on six VQA benchmarks show that EAGLE achieves best average performance across domains while remaining lightweight, interpretable, and practical for deployment.

2605.30696 2026-06-01 cs.RO cs.SY eess.SY

Geometry-Aware Control Barrier Functions for Collision Avoidance via Bernstein Polynomial Approximations

基于Bernstein多项式近似的几何感知控制障碍函数用于碰撞避免

Siwon Jo, Yanze Zhang, Yupeng Yang, Wenhao Luo

发表机构 * GRASP Laboratory, University of Pennsylvania(宾夕法尼亚大学GRASP实验室) Department of Computer Science, University of Illinois Chicago(伊利诺伊大学芝加哥分校计算机科学系) Department of Computer Science, University of North Carolina at Charlotte(北卡罗来纳大学夏洛特分校计算机科学系)

AI总结 提出一种基于Bernstein多项式符号距离场的几何感知控制障碍函数,统一表示障碍物与机器人,利用多项式可微性实现闭环控制约束,保证单机器人和异构多机器人碰撞避免的安全性与效率。

Comments 8 pages; Accepted by 2026 IEEE International Conference on Robotics and Automation (ICRA 2026)

详情
AI中文摘要

安全导航通常依赖于基于机器人和障碍物形状的明确定义条件,当它们具有不规则几何形状时可能具有挑战性。虽然控制障碍函数(CBF)提供了一种有效机制来强制执行安全集前向不变性,但常见的形状替代(例如球体或超椭球体)要么在非结构化场景中过于保守,要么需要许多局部基元,这会增加约束数量并降低实时性能。在本文中,我们介绍了一种基于Bernstein多项式符号距离场(BP-SDF)的新型几何感知控制障碍函数(CBF)。它提供了一种统一的方式来表示障碍物和机器人,从而用统一的最小距离来表示障碍函数。得益于Bernstein多项式的可微性,可以轻松地在闭环中强制执行控制约束。我们通过不同环境下的仿真验证了该方法在单机器人导航和异构多机器人碰撞避免中保证安全性的效率和性能。

英文摘要

Safe navigation often relies on well-defined conditions based on the shape of robots and obstacles, and can be challenging when they have irregular geometries. While Control Barrier Functions (CBFs) offer an efficient mechanism to enforce safe set forward invariance, common shape surrogates (e.g., spheres or super-ellipsoids) either are overly conservative in unstructured scenes or require many local primitives, which inflates constraint counts and degrades real-time performance. In this paper, we introduce a novel geometry-aware Control Barrier Function (CBF) based on Bernstein-Polynomial Signed Distance Fields (BP-SDFs). It provides a unified way to represent the obstacles and robots, so as to represent the barrier function with a unified minimum distance. Benefiting from the differentiability of the Bernstein polynomials, one can easily enforce the control constraints in a closed loop. We validate the method's efficiency and performance to guarantee safety in single-robot navigation and heterogeneous multi-robot collision avoidance via simulations under different environments.

2605.30694 2026-06-01 cs.LG

Universal Decision Learners

通用决策学习器

Sridhar Mahadevan

发表机构 * Adobe Research(Adobe研究院) University of Massachusetts(马萨诸塞大学) Amherst(阿默斯特)

AI总结 本文提出通用决策学习器(UDL)的范畴论框架,通过左Kan扩展和右Kan扩展将局部决策行为规范地扩展到全局一致行为,统一了规划、强化学习、因果干预、在线学习和博弈均衡等多种决策形式。

Comments 15 pages

详情
AI中文摘要

许多决策理论——规划、强化学习、因果干预、在线学习和博弈均衡——将局部信息转化为全局一致的行为。本文提出一个共同的范畴论形式化:通用决策学习器(UDL)通过一对通用构造将部分指定的决策函子从观测上下文扩展到新上下文。左Kan扩展表达展开、聚合和候选生成;右Kan扩展表达一致性、约束满足和不动点语义。核心主张并非每个决策问题都有相同的算法,而是许多决策形式化实例化同一个通用问题:规范地扩展局部行为数据,然后刻画全局一致的扩展。我们给出抽象的UDL构造,证明其通用比较性质,定义Kan不变的行为等价性和最小抽象,并展示贝尔曼方程、规划递归、因果干预、在线遗憾和均衡如何作为特例出现。补充材料更详细地发展了强化学习特例。

英文摘要

Many theories of decision making -- planning, reinforcement learning, causal intervention, online learning, and game-theoretic equilibrium -- turn local information into globally coherent behavior. This paper proposes a common categorical formulation: a Universal Decision Learner (UDL) extends a partially specified decision functor from observed contexts to new contexts by a pair of universal constructions. Left Kan extensions express rollout, aggregation, and candidate generation; right Kan extensions express consistency, constraint satisfaction, and fixed-point semantics. The central claim is not that every decision problem has the same algorithm, but that many decision formalisms instantiate the same universal problem: extend local behavioral data canonically, then characterize the globally coherent extensions. We give the abstract UDL construction, prove its universal comparison property, define Kan-invariant behavioral equivalence and minimal abstractions, and show how Bellman equations, planning recursions, causal interventions, online regret, and equilibria arise as special cases. The supplementary material develops the reinforcement-learning specialization in more detail.

2605.30690 2026-06-01 cs.CL

ElasticMem: Latent Memory as a Learnable Resource for LLM Agents

ElasticMem: 将潜在记忆作为LLM智能体的可学习资源

Tao Feng, Chongrui Ye, Tianyang Luo, Jingjun Xu, Xueqiang Xu, Haozhen Zhang, Ge Liu, Jiaxuan You

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Nanyang Technological University(南洋理工大学)

AI总结 提出ElasticMem框架,通过可学习的弹性潜在记忆分配策略,在记忆密集型问答和具身智能体控制任务中显著提升准确率并降低token开销。

详情
AI中文摘要

长期记忆对于LLM智能体在长时间交互中连贯推理、个性化响应和复用过往经验至关重要。然而,现有的记忆增强方法通常将记忆视为固定资源:文本空间方法将检索到的记忆拼接进上下文窗口,导致大量token开销和对噪声证据的敏感性;而潜在空间方法减少了文本成本,但仍依赖刚性检索或固定容量的记忆接口。这造成了查询相关的记忆效用与固定记忆分配之间的不匹配。我们提出ElasticMem,一种记忆增强的LLM框架,学习将记忆用作弹性潜在资源。ElasticMem构建了一个离线潜在记忆库,包含检索键和内容缓存,从推理器的隐藏状态自适应地检索记忆,通过学得的策略为每个检索到的记忆分配可变潜在预算,并将选中的潜在状态作为软记忆token注入生成过程。整个记忆使用过程通过组相对策略优化与下游任务奖励进行联合优化。我们在MemorySuite上评估ElasticMem,涵盖记忆密集型QA和具身智能体控制。在Qwen2.5-3B-Instruct和Qwen2.5-7B-Instruct骨干网络上,ElasticMem在加权平均QA准确率上分别比最强基线提升26.2%和24.6%,在ALFWorld成功率上分别提升66.3%和27.2%,同时实现了最低的ALFWorld token成本。消融和定性分析进一步表明,自适应检索和弹性预算分配帮助ElasticMem优先考虑有用证据和可迁移计划,超越了刚性余弦相似度。ElasticMem的代码将在https://github.com/ulab-uiuc/ElasticMem发布。

英文摘要

Long-term memory is essential for LLM agents to reason coherently across extended interactions, personalize responses, and reuse past experience. However, existing memory-augmented methods typically treat memory as a fixed resource: text-space approaches concatenate retrieved memories into the context window, causing substantial token overhead and sensitivity to noisy evidence, while latent-space approaches reduce textual cost but still rely on rigid retrieval or fixed-capacity memory interfaces. This creates a mismatch between query-dependent memory utility and fixed memory allocation. We propose ElasticMem, a memory-augmented LLM framework that learns to use memory as an elastic latent resource. ElasticMem builds an offline latent memory bank with retrieval keys and content caches, retrieves memories adaptively from the reasoner's hidden state, assigns each retrieved memory a variable latent budget through a learned policy, and injects selected latent states as soft memory tokens for generation. The full memory-use process is optimized with downstream task rewards through group-relative policy optimization. We evaluate ElasticMem on MemorySuite, covering memory-intensive QA and embodied agent control. Across Qwen2.5-3B-Instruct and Qwen2.5-7B-Instruct backbones, ElasticMem improves weighted average QA accuracy by 26.2% and 24.6%, and improves ALFWorld success rate by 66.3% and 27.2%, respectively, over the strongest baselines, while achieving the lowest ALFWorld token cost. Ablations and qualitative analyses further show that adaptive retrieval and elastic budget allocation help ElasticMem prioritize useful evidence and transferable plans beyond rigid cosine similarity. Our code for ElasticMem will be released at https://github.com/ulab-uiuc/ElasticMem.

2605.30689 2026-06-01 cs.CV cs.AI

ConTrans: Learning Text-enhanced Local-global Temporal Representations for Zero-shot Temporal Action Localization

ConTrans:学习文本增强的局部-全局时间表示用于零样本时间动作定位

Kanchan Keisham, Thenukan Pathmanathan, Thangarajah Akilan

发表机构 * Vellore Institute of Technology, India(维洛雷理工学院,印度) Lakehead University, Canada(拉克希德大学,加拿大)

AI总结 针对零样本时间动作定位中忽略局部相关性和特征表示能力不足的问题,提出融合卷积归纳偏置与Transformer自注意力的多尺度编码器ConTrans,联合捕获细粒度局部依赖和长程全局上下文,在ActivityNet-1.3和THUMOS14上显著超越现有方法。

Comments 4 figures, 8 tables

详情
AI中文摘要

零样本时间动作定位(ZS-TAL)旨在检测和定位未修剪视频中未见过的动作。然而,现有方法主要关注建模长程上下文信息,常常忽略了视频帧之间基于相对偏移的关键局部相关性。此外,由于网络架构的浅层性,其特征表示能力受限,阻碍了性能提升。在本文中,我们通过引入一种新颖的局部-全局多尺度特征表示模块来解决这些局限性。我们提出了一种新颖的多尺度编码器架构,称为ConTrans,它将卷积(Conv)归纳偏置与Transformer自注意力相结合,以共同捕获细粒度的局部依赖和长程全局上下文,从而比现有方法获得更全面的特征表示。在ActivityNet-1.3和THUMOS14数据集上的实验评估表明,ConTrans显著优于现有方法,为ZS-TAL建立了新的基准。

英文摘要

Zero-shot Temporal Action Localization (ZS-TAL) aims to detect and locate previously unseen actions in untrimmed videos. However, existing approaches primarily focus on modeling long-range contextual information, often neglecting the critical relative-offset-based local correlations between video frames. Furthermore, their performance is hindered by limited feature representation capabilities due to the shallow nature of their network architectures. In this paper, we address these limitations by introducing a novel local-global multi-scale feature representation module. We propose a novel multi-scale encoder architecture, termed ConTrans, that integrates convolutional (Conv) inductive biases with transformer Self-attention to jointly capture fine-grained local dependencies and long-range global context, leading to more comprehensive feature representations than existing methods. Experimental evaluations on the ActivityNet-1.3 and THUMOS14 datasets demonstrate that ConTrans significantly outperforms existing methods, establishing a new benchmark for ZS-TAL.

2605.30680 2026-06-01 cs.AI cs.MA

Healthcare Mechanisms from Policy-as-Code Search under Strategic Provider Response

战略提供者响应下的策略即代码搜索中的医疗机制

Zihan Wang, Xiang Xu, Hongyuan Zha, Wenhao Li

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Tongji University(同济大学)

AI总结 将医疗机制设计转化为语言模型的程序合成,通过多智能体模拟器Medi-Sim评估策略提供者响应下的均衡,并利用LLM引导的进化代码搜索合成可检查的混合目标程序。

Comments 32 pages, 18 figures, 4 tables

详情
AI中文摘要

医疗机制与它们所引发的战略提供者响应密不可分:现有的医疗AI基准固定了这种响应,因此无法通过它们产生的均衡来评估机制。我们将医院机制设计重新定义为语言模型的程序合成:类型化、可检查的规则程序由Medi-Sim执行和评分,Medi-Sim是一个具有五个战略提供者渠道(编码、选择、延迟、努力、分诊)的多智能体模拟器。激励扫描恢复了经典的健康经济学发现作为相邻制度——在利润压力下的过度编码和低复杂度患者选择,以及古德哈特式漂移,其中测量绩效与真实结果呈负相关——而单个审计杠杆暴露了压力迁移:关闭编码渠道使低复杂度选择增加一倍以上。LLM引导的进化代码搜索在相同的规则程序空间上合成一个可检查的混合目标程序,该程序消除了过度编码,将拒绝率减半,并保留了大部分以利润为导向的基线的资金。

英文摘要

Healthcare mechanisms are inseparable from the strategic provider response they induce: existing healthcare AI benchmarks hold this response fixed and so cannot evaluate mechanisms by the equilibrium they produce. We recast hospital mechanism design as program synthesis for language models: typed, inspectable rule programs are executed and scored by Medi-Sim, a multi-agent simulator with five strategic provider channels (coding, selection, delay, effort, triage). An incentive sweep recovers classical health-economics findings as adjacent regimes -- up-coding and low-complexity-patient selection under profit pressure, and Goodhart-style drift where measured performance becomes anti-correlated with true outcomes -- and a single audit lever exposes pressure migration: closing the coding channel more than doubles low-complexity selection. LLM-guided evolutionary code search over the same rule-program space then synthesizes an inspectable mixed-objective program that eliminates up-coding, halves rejection, and retains most of the profit-oriented baseline's funds.

2605.30675 2026-06-01 cs.CL cs.AI

Human-Alignment, Calibration, and Activation Patterns in Large Language Model Uncertainty

大型语言模型不确定性中的人类对齐、校准与激活模式

Kyle Moore, Jesse Roberts, Daryl Watson, William Ward, Grayson Heyboer

发表机构 * Vanderbilt University(范德比大学) Tennessee Technological University(田纳西技术大学)

AI总结 研究大型语言模型的不确定性与人类不确定性的相似性,通过分析行为与内部激活模式,发现模型在多项选择和开放式事实回忆数据集上同时存在对齐与校准,并描述了指令微调的影响。

详情
AI中文摘要

不确定性量化是大型语言模型行为分析中一个庞大且不断发展的子领域。为了识别和对抗幻觉,该领域主要关注测量和改进校准,即不确定性判断对任务效能的准确性。在这项工作中,我们探讨了一个相对未被充分探索的问题:大型语言模型的不确定性与人类不确定性有多相似。我们研究了大型语言模型的外部行为和内部激活模式中是否存在类似人类的不确定性信号,即不确定性对齐。我们识别了模型在涵盖多项选择和开放式事实回忆的多种数据集上是否同时表现出对齐和校准的证据。并且我们描述了指令微调对这些方面的影响。

英文摘要

Uncertainty Quantification is a large and growing subfield of large language model behavioral analysis. Primarily to recognize and combat hallucination, the field has largely focused on measuring and improving calibration, the accuracy of uncertainty judgments to task efficacy. In this work, we investigate the relatively underexplored question of how similar large language model uncertainty is to human uncertainty. We investigate the presence and strength of human-similar uncertainty signals, deemed uncertainty alignment, in large language model overt behavior and internal activation patterns. We identify whether the models show evidence of simultaneous alignment and calibration on a variety of datasets covering both multiple choice and open ended factual recall. And we characterize the effect of instruct fine-tuning on each of these facets.

2605.30673 2026-06-01 cs.CL

TeachObs: A Human-Validated Benchmark for Multimodal Teaching Observation and Model Evaluation

TeachObs:多模态教学观察与模型评估的人工验证基准

Yeil Jeong, Youngjin Yoo, Seobin Sohn, Hyejin Han, Jinseo Lee, Scott Howard, Unggi Lee

发表机构 * Indiana University Bloomington(印第安纳大学布卢明顿分校) Pai Chai University(培才大学) Seoul National University(首尔国立大学) Ewha Womans University(成均馆大学) University of Wolverhampton(沃尔夫汉普顿大学) Korea University Sejong Campus(韩国大学世宗校区)

AI总结 提出TeachObs基准,包含30节公开课视频的5158个15秒场景,由7名研究者标注39个二值观察码,并评估5个前沿视觉大语言模型在三种任务上的表现。

详情
AI中文摘要

课堂视频包含可观察的教学实践,但其教学和视觉信号很少以适合模型评估的形式组织。我们提出了 extit{TeachObs},一个用于课堂视频中多模态教学观察的人工验证基准。 extit{TeachObs}包含来自八个国家的30节公开课视频,分为5158个固定的15秒场景。七名研究者用39个二值观察码标注每个场景,涵盖20个视觉码(如手势、板书、指向和视觉材料)和19个非视觉码(如指导、监控、提问、反馈和反思)。基于Krippendorff's alpha,使用可靠性和流行度感知规则构建黄金片段标签。除了片段级标签,三名专家评分员对30节课进行了课程级评分和定性评估,涵盖教学设计、教学实施、学生反应、学习材料和课程收尾,评分员覆盖范围详见正文。利用这两个人工参考层,我们在三个任务上评估了五个具备视觉能力的前沿大语言模型——纯文本片段编码、文本+帧片段编码,以及在LLM作为评判协议下的课程级覆盖评分——发现没有单一模型在所有三个任务中持续优于其他模型,添加中间帧会同时增加每个场景的真实和虚假归因,并且模型评估相对于专家评分员高估了程序清晰的课程。因此, extit{TeachObs}支持细粒度标注基准测试和整课评估,展示了AI系统在哪些方面可以辅助课堂视频分析,以及在哪些方面专家判断仍然必要,涵盖不同学科、课堂形式和标注难度水平。

英文摘要

Classroom videos contain observable teaching practices, but their pedagogical and visual signals are rarely organized in forms suitable for model evaluation. We present \textit{TeachObs}, a human-validated benchmark for multimodal teaching observation in classroom videos. \textit{TeachObs} includes 30 public lesson videos from eight countries divided into 5,158 fixed 15-second scenes. Seven researchers annotated each scene with 39 binary observation codes, covering 20 visual codes, such as gesture, board work, pointing, and visual materials, and 19 nonvisual codes, such as instruction, monitoring, questioning, feedback, and reflection. Gold segment labels are constructed using reliability- and prevalence-aware rules based on Krippendorff's alpha. In addition to segment-level labels, three expert raters produced lesson-level ratings and qualitative evaluations of instructional design, instructional delivery, learner response, learning materials, and lesson closure across the 30 lessons, with rater coverage detailed in the body. Using these two human reference layers, we evaluate five vision-capable frontier LLMs across three tracks - text-only segment coding, text + frame segment coding, and lesson-level coverage scored under an LLM-as-judge protocol - and find that no single model consistently outperforms others across all three tracks, that adding a mid-frame inflates both true and false attributions per scene, and that model evaluations over-rate procedurally clear lessons relative to expert raters. \textit{TeachObs} therefore supports both fine-grained annotation benchmarking and whole-lesson evaluation, showing where AI systems can assist classroom video analysis and where expert judgment remains necessary across varied subjects, classroom formats, and annotation difficulty levels.

2605.30668 2026-06-01 cs.CL cs.AI

CobSeg: Coherence Boundary Modeling for Dialogue Topic Segmentation

CobSeg: 对话主题分割的连贯性边界建模

Sijin Sun, Liangbin Zhao, Jiaxiang Cai, Ming Deng, Mingyu Luo, Xiuju Fu

发表机构 * Institute of High Performance Computing, Agency for Science, Technology and Technology(高性能计算研究所,科技局) Shanghai Univeristy(上海大学) Fudan University(复旦大学)

AI总结 提出CobSeg多分支架构,通过分离连贯性语义与词汇边界转换并利用边界信息加权和主题连贯性线索,在无需LLM调用下提升对话主题分割性能。

Comments 8 pages with appindx. Under review

详情
AI中文摘要

对话主题分割在许多人类-AI协作应用中至关重要,需要识别异质边界线索,包括话语边缘附近的词汇转换和跨话语的语义不连续性。现有的话语模型常常稀释这些局部词汇信号。我们提出CobSeg,一种新颖的多分支架构,它将连贯性层面的语义连续性与词汇边界转换分离,并通过方向性边界预测恢复两者。CobSeg进一步使用边界信息加权来强调高效用的话语位置,并融合了基于语料库的主题连贯性线索与学习到的组合权重。尽管CobSeg在有监督的金标准边界训练和自动诱导边界的伪标签设置下作为紧凑的可训练分割器进行评估,它在推理过程中无需LLM调用即可实现增强的边界预测。在五个基准测试中,它改进了$P_k$和$W_d$,特别是在局部词汇线索显著时:在金标准监督下,它在VHF上将$P_k$降低了0.7个点,$W_d$降低了0.6个点,并在DialSeg711上达到了$P_k$为1.0;在诱导边界下,它在VHF上将$P_k$降低了14.8个点,在DialSeg711上降低了1.5个点,在TIAGE上降低了1.1个点,优于先前的非LLM方法。

英文摘要

Dialogue topic segmentation is critical in many human-AI collaborative applications which requires identifying heterogeneous boundary cues, including lexical transitions near utterance edges and semantic discontinuities across utterances. Existing utterance models often dilute these local lexical signals. We propose CobSeg, a novel multi-branch architecture that separates coherence-level semantic continuity from lexical boundary transitions and recovers both through directional boundary prediction. CobSeg further uses boundary informativeness weighting to emphasize high-utility utterance positions, and incorporates a corpus-derived topic coherence cue with learned combination weights. While CobSeg is evaluated as a compact trainable segmenter under supervised gold-boundary training and a pseudo-label setting with automatically induced boundaries, it performs enhanced boundary prediction without LLM calls during inference. Across five benchmarks, it improves $P_k$ and $W_d$ particularly when local lexical cues are prominent: under gold supervision, it reduces $P_k$ by 0.7 points and $W_d$ by 0.6 points on VHF, and reaches $P_k$ of 1.0 on DialSeg711; with induced boundaries, it reduces $P_k$ by 14.8 points on VHF, by 1.5 points on DialSeg711, and by 1.1 points on TIAGE, outperforming prior non-LLM approaches.

2605.30664 2026-06-01 cs.AI

Structure-Induced Information for Rerooting Levin Tree Search

结构信息用于重定根莱文树搜索

Jake Tuero, Michael Buro, Laurent Orseau, Levi H. S. Lelis

发表机构 * Department of Computing Science, University of Alberta, Edmonton, Canada. Alberta Machine Intelligence Institute (Amii), Edmonton, Canada. Google DeepMind, London, United Kingdom.

AI总结 提出三种重定根器设计,利用结构信息隐式分解子目标,提升策略树搜索的可扩展性和效率。

Comments ICML 2026

详情
AI中文摘要

基于子目标的策略树搜索利用策略引导搜索,对于复杂的单智能体确定性问题是有效的,但通常依赖于显式的子目标生成,这会带来大量开销并阻碍可扩展性。在本文中,我们通过最近引入的$\sqrt{\text{LTS}}$算法使用学习到的“重定根器”来克服这些限制。重定根器隐式地将问题分解为软子任务。虽然先前的工作侧重于给定或手工制作的重定根器的形式保证,但在本文中,我们提出了三种重定根器设计:(i) 基于聚类的重定根器,利用全局状态空间结构;(ii) 基于启发式的重定根器,利用学习的代价估计;(iii) 结合两种信号的混合重定根器。我们的框架避免了显式重构和推理生成的子目标,从而能够以显著降低的计算开销实现可扩展的搜索努力分配。实验上,我们的基于重定根的方法在基于子目标的策略树搜索失败的复杂环境中也能扩展,并在测试的领域上实现了最先进的在线训练效率。

英文摘要

Subgoal-based policy tree search, which uses a policy to guide search, is effective for complex single-agent deterministic problems but often relies on explicit subgoal generation that can incur substantial overhead and hinders scalability. In this paper, we overcome these limitations by using a learned ``rerooter'' through the recently-introduced $\sqrt{\text{LTS}}$ algorithm. A rerooter implicitly decomposes the problem into soft subtasks. While previous work focused on the formal guarantees for given or handcrafted rerooters, in this work we propose three rerooter designs: (i) a clustering-based rerooter that exploits global state-space structure, (ii) a heuristic-based rerooter that leverages learned cost-to-go estimates, and (iii) a hybrid that combines both signals. Our framework avoids having to explicitly reconstruct and reason over generated subgoals, thereby enabling scalable allocation of search effort with significantly lower computational overhead. Empirically, our rerooting-based methods scale to complex environments where subgoal-based policy tree search fails, and achieve state-of-the-art online training efficiency on the domains tested.