arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3851
热门方向导航
2606.08932 2026-06-09 cs.CL cs.AI cs.CE 新提交

From Statute to Control Flow: Span-Grounded Deontic Trees for Defeasible Scope Parsing

从法规到控制流:基于跨度义务树的可废止范围解析

Jian Chen, Siyuan Li, Chucheng Wan, Zixuan Yuan

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Sun Yat-Sen University(中山大学)

AI总结 提出NormBench基准和跨度义务树(SG-DT)中间表示,用于诊断和缓解规则遵循模型中的静默范围遗漏(SSO)问题,揭示递归衰减和可审计性陷阱两种病理,并通过约束输出改善树结构保真度。

详情
AI中文摘要

执行政策和法规的规则遵循代理常常因静默范围遗漏(SSO)而失败:模型应用一般规则但静默地丢弃嵌套的例外或反例外,产生看似合规但在重要边缘案例上失效的输出。尽管此类失败常被视为代理系统问题,其根本瓶颈在于法规和政策理解——这一能力通常在法律NLP中研究。然而,大多数现有法律NLP基准强调最终任务结果,可能忽略导致SSO的结构性遗漏。为诊断和缓解SSO,我们引入NormBench,一个包含2290条条款的基准,涵盖中文(法律和地方政策)、英文(美国税法、GDPR和企业政策)及跨语言设置,专为可废止范围解析设计:精确识别哪个条款覆盖哪个。NormBench使用基于跨度义务树(SG-DT),一种编译器式中间表示,将每个逻辑分支锚定到源跨度并要求显式排除守卫,实现确定性编译和审计。对前沿LLM的评估揭示了两种反复出现的病理:(1)递归衰减,性能随击败者深度增加急剧下降;(2)可审计性陷阱,模型检索相关跨度但未能组装正确的控制流。使用SG-DT作为约束中间输出可改善整树保真度和击败者恢复,下游实验表明其效用是机制特定的:增益集中在例外活跃、易SSO的案例上,而当附加结构不必要或解析器保真度低时,总体准确率可能参差不齐。

英文摘要

Rule-following agents tasked with executing policies and regulations often fail via Silent Scope Omission (SSO): a model applies a general rule but silently drops nested exceptions or counter-exceptions, producing outputs that appear compliant yet break on important edge cases. Although such failures are often framed as an agentic-systems problem, the underlying bottleneck is statutory and policy understanding, a capability typically studied in legal NLP. However, most existing legal NLP benchmarks emphasize end-task outcomes, which can overlook the structural omissions that cause SSO. To diagnose and mitigate SSO, we introduce NormBench, a benchmark of 2,290 provisions spanning Chinese (laws and local policies), English (U.S. tax law, GDPR, and corporate policies), and cross-lingual settings, designed for defeasible scope parsing: identifying precisely which clause overrides which. NormBench uses Span-Grounded Deontic Trees (SG-DT), a compiler-style intermediate representation that anchors every logical branch to source spans and requires explicit exclusion guards, enabling deterministic compilation and audit. Evaluations of frontier LLMs reveal two recurring pathologies: (1) Recursion Decay, where performance drops sharply as defeater depth increases, and (2) an Auditability Trap, where models retrieve relevant spans but fail to assemble correct control flow. Using SG-DT as a constrained intermediate output improves whole-tree fidelity and defeater recovery, and downstream experiments show that its utility is mechanism-specific: gains concentrate on exception-active, SSO-prone cases, while aggregate accuracy can be mixed when the added structure is unnecessary or parser fidelity is low.

2606.08926 2026-06-09 cs.LG 新提交

PROBE-Web: An Interactive System for Probing Evaluation Landscapes of Knowledge Graph Completion Models

PROBE-Web:用于探究知识图谱补全模型评估景观的交互式系统

Sooho Moon, Yunyong Ko

发表机构 * Chung-Ang University(中央大学)

AI总结 提出PROBE-Web交互系统,通过调整预测锐度和流行度偏差鲁棒性两个视角,灵活评估KGC模型,并提供四种关键功能。

Comments 4 pages, 6 figures, 1 table

详情
AI中文摘要

知识图谱补全(KGC)模型通常使用基于排名的指标(如MRR和Hits@K)进行评估,尽管不同的用户通常需要不同的评估视角。在本演示中,我们介绍PROBE-Web,一个用于探究KGC模型多样化评估景观的交互式系统。PROBE-Web使用户能够通过调整两个关键视角(P1)预测锐度和(P2)流行度偏差鲁棒性来灵活评估KGC模型。通过用户友好的GUI,用户可以轻松评估多个KGC模型并分析其优缺点。PROBE-Web提供四个关键功能:(1)传统评估工具包,(2)灵活的视角感知评估,(3)可解释的案例研究,以及(4)评估景观探索。我们相信PROBE-Web可以帮助用户更好地理解与其目标一致的KGC模型。

英文摘要

Knowledge graph completion (KGC) models are commonly evaluated using rank-based metrics such as MRR and Hits@K, despite different users often requiring different evaluation perspectives. In this demo, we present PROBE-Web, an interactive system for probing diverse evaluation landscapes for KGC models. PROBE-Web enables users to flexibly evaluate KGC models by adjusting two critical perspectives: (P1) predictive sharpness and (P2) popularity-bias robustness. Through a user-friendly GUI, users easily evaluate multiple KGC models and analyze their strengths and weaknesses. PROBE-Web provides four key functionalities: (1) conventional evaluation toolkit, (2) flexible perspective-aware evaluation, (3) explainable case studies, and (4) evaluation landscape exploration. We believe that PROBE-Web can help users better understand KGC models aligning with their objectives.

2606.08922 2026-06-09 cs.RO 新提交

PTDL:Multi-Terrain Fall Recovery via Phase-Terrain Decoupled Learning

PTDL:多地形摔倒恢复的相位-地形解耦学习

Xiaoyu Xu, Zhiming Chen, Yuenan Zhao, Ran Song, Wei Zhang

发表机构 * School of Control Science and Engineering, Shandong University(山东大学控制科学与工程学院) Key Laboratory of Machine Intelligence and System Control, Ministry of Education(教育部机器智能与系统控制重点实验室)

AI总结 提出相位-地形解耦学习(PTDL),通过解耦训练监督的相位和地形轴,实现单一本体感知策略下的多地形摔倒恢复与行走过渡。

详情
AI中文摘要

人形机器人可能在非结构化环境中的斜坡、砾石和不平地面上摔倒。我们目标是集成摔倒恢复与运动:仅使用本体感知从摔倒状态重建平衡,并在摔倒地点恢复速度指令行走。先前方法通常止于准静态起身,忽略摔倒后地面接触阶段,或者在混合地形上训练时未分离恢复与运动阶段或每表面约束,导致跨表面退化为单一妥协起身。我们提出相位-地形解耦学习(PTDL),在部署单一本体感知策略的同时,沿相位和地形轴解耦训练监督。在相位轴上,投影重力门控双运动先验判别器和探测-行走过渡链接将摔倒后恢复与指令行走连接。在地形轴上,地形分层恢复塑形在平坦地面、碎石和斜坡上分配表面特定的训练监督;地形标签仅用于训练,不提供给策略观测,从而在部署时实现隐式摔倒后策略选择。我们在29自由度Unitree G1上,在仿真和硬件中验证了PTDL在平坦地面、碎石和最高20度斜坡上的表现,实现了稳定的跨地形恢复、平滑的恢复-运动过渡以及单一部署策略下的差异化摔倒后起身行为。

英文摘要

Humanoid robots can fall on slopes, gravel, and uneven ground in unstructured environments. We target integrated fall recovery and locomotion: rebuilding balance from a fallen state using proprioception alone and resuming velocity-commanded walking at the fall site. Prior methods often stop at quasi-static rise, neglect the post-fall ground-contact phase, or, when trained on mixed terrains without separating recovery and locomotion phases or per-surface constraints, collapse to a single compromise get-up across surfaces. We propose Phase--Terrain Decoupled Learning (PTDL), which decouples training supervision along phase and terrain axes while deploying one proprioceptive policy. On the phase axis, projected-gravity-gated dual motion-prior discriminators and a probe-to-walk transition link post-fall recovery to commanded walking. On the terrain axis, terrain-stratified recovery shaping assigns surface-specific training supervision on flat ground, gravel, and slopes; terrain labels are training-only and withheld from policy observations, enabling implicit post-fall strategy selection at deployment. We validate PTDL on a 29-DoF Unitree G1 across flat ground, gravel, and slopes up to 20 degrees in simulation and on hardware, achieving stable cross-terrain recovery, smooth recovery-to-locomotion transitions, and differentiated post-fall rise behaviors under one deployed policy.

2606.08921 2026-06-09 cs.LG 新提交

Generalized Rank-based Evaluation for Knowledge Graph Completion: Perspectives, Framework, and Analyses

基于排序的知识图谱补全广义评估:视角、框架与分析

Sooho Moon, Jian Kang, Yunyong Ko

发表机构 * Chung-Ang University(中央大学) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学)

AI总结 针对现有评估指标忽视预测锐度与流行偏差鲁棒性的问题,提出广义评估框架PROBE,通过排序变换器和排序聚合器实现更全面、灵活且一致的模型评估。

Comments 25 pages, 12 figures, 5 tables

详情
AI中文摘要

知识图谱补全(KGC)旨在从观测知识图谱(KG)中预测缺失事实,在药物发现、推荐系统和检索增强生成(RAG)等广泛实际应用中发挥关键作用。尽管已有众多KGC模型被提出,但KGC的评估仍未被充分探索,尽管其在可靠评估模型性能和为实际应用选择合适的模型中至关重要。本文引入了KGC评估中两个被现有评估指标忽视的重要视角:(P1)预测锐度和(P2)流行偏差鲁棒性。为同时解决这两个视角,我们提出一个广义评估框架PROBE,它由一个排序变换器(RT)和一个排序聚合器(RA)组成,其中RT基于期望的预测锐度水平估计每个预测的得分,RA根据期望的流行偏差鲁棒性水平聚合所有预测得分以确定最终评估得分。我们通过定义可靠KGC评估的六个关键属性对PROBE进行理论分析,并证明PROBE满足所有属性,而现有指标未能满足部分属性。特别地,由于KG的开放世界特性,评估指标应即使在仅观测到不完整事实时也能保持KGC模型的相对性能。我们表明PROBE能更好地维持这种一致性,从而比现有指标更可靠地估计模型的内在性能。在六个真实KG上使用六个KGC模型进行的大量实验表明,现有指标可能根据不同的评估视角高估或低估模型性能,而PROBE能够实现更全面、灵活且一致的KGC模型评估。

英文摘要

Knowledge graph completion (KGC) aims to predict missing facts from an observed knowledge graph (KG), playing a crucial role in a wide range of real-world applications such as drug discovery, recommender systems, and retrieval-augmented generation (RAG). Although numerous KGC models have been proposed, the evaluation of KGC remains underexplored, despite its critical role in reliably assessing model performance and selecting appropriate models for real-world applications. In this paper, we introduce two important perspectives for KGC evaluation that are overlooked by existing evaluation metrics, (P1) predictive sharpness and (P2) popularity-bias robustness. To address both perspectives, we propose a generalized evaluation framework, PROBE, which consists of a rank transformer (RT) that estimates the score of each prediction based on a desired level of predictive sharpness and a rank aggregator (RA) that determines the final evaluation score by aggregating all prediction scores according to a desired level of popularity-bias robustness. We theoretically analyze PROBE by defining six key properties for reliable KGC evaluation and prove that PROBE satisfies all the properties, while existing metrics fail to satisfy some. In particular, due to the open-world nature of KGs, an evaluation metric should preserve the relative performance of KGC models even when only incomplete facts are observed. We show that PROBE better maintains such consistency, providing a more reliable estimate of intrinsic model performance than existing metrics. Extensive experiments with six KGC models on six real-world KGs reveal that existing metrics may over- or under-estimate model performance depending on different evaluation perspectives, whereas PROBE enables a more comprehensive, flexible, and consistent evaluation of KGC models.

2606.08920 2026-06-09 cs.CV cs.AI 新提交

PolyBuild: An End-to-End Method for Polygonal Building Contour Extraction from High-Resolution Remote Sensing Images

PolyBuild: 一种从高分辨率遥感图像中提取多边形建筑物轮廓的端到端方法

Yaoteng Zhang, Julin Zhang, Guangshuai Wang, Jiwei Deng, Hui Sheng, Yasir Muhammad, Shiqing Wei

发表机构 * China University of Petroleum (East China)(中国石油大学(华东)) South Surveying&Mapping Instrument Co.,Ltd.(南方测绘仪器有限公司) China Railway Design Corporation(中国铁路设计集团有限公司)

AI总结 提出端到端方法PolyBuild,通过初始轮廓生成模块和轮廓优化模块直接从遥感图像提取矢量多边形建筑物轮廓,无需后处理,性能优于现有方法。

Comments Accepted for publication in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (JSTARS)

详情
AI中文摘要

从高分辨率遥感图像中提取建筑物多边形轮廓是各种地图应用的基本任务。然而,不同的成像条件和复杂的建筑结构使得自动轮廓提取极具挑战性。主流的建筑物提取方法通常依赖于像素级分割,随后进行多个后处理步骤以生成建筑物轮廓,这计算量大且容易出错。在本文中,我们提出了一种名为PolyBuild的端到端方法,该方法可以直接从高分辨率遥感图像中提取建筑物矢量多边形,无需任何后处理操作。该方法利用两个主要模块:初始轮廓生成模块(ICGM)和轮廓优化模块(COM)。ICGM通过利用每个建筑物实例的拼接子区域中心特征来生成初始建筑物轮廓。它通过生成边界框并使用四个子区域的中心特征来表示每个建筑物,同时进行目标检测和初始轮廓提取。轮廓优化模块(COM)通过在基于Transformer的解码器中迭代集成卷积神经网络(CNN)特征和轮廓位置信息,进一步细化生成的建筑物轮廓。混合CNN-Transformer架构有效捕获建筑物轮廓内的局部和全局空间关系,确保高质量的边界描绘。在三个建筑物数据集上进行了大量实验以评估PolyBuild的性能。结果表明,PolyBuild显著优于最先进的方法,包括基于掩码和基于轮廓的方法。

英文摘要

Extracting building polygon contours from high-resolution remote sensing images is a fundamental task for various mapping applications. However, the presence of varying imaging conditions and complex building structures, makes automatic contour extraction extremely challenging. Mainstream approaches for building extraction often rely on pixel-level segmentation followed by multiple post-processing steps to produce building contour, which can be computationally intensive and prone to errors. In this paper, we propose an end-to-end method named PolyBuild, which can directly extract building vector polygons from high-resolution remote sensing images without the need for any post-processing operations. The proposed method leverages two primary modules: an Initial Contour Generation Module (ICGM) and a Contour Optimization Module (COM). The ICGM is designed to generate an initial building contour by utilizing concatenated sub-region center features for each building instance. It performs simultaneous object detection and initial contour extraction by generating bounding boxes and using the center features of four sub-regions to represent each building. The Contour Optimization Module (COM) further refines the generated building contours by iteratively integrating Convolutional Neural Network (CNN) features and contour positional information in a Transformer-based decoder. The hybrid CNN-Transformer architecture effectively captures both local and global spatial relationships within the building contour, ensuring high-quality boundary delineation. Extensive experiments are conducted on three building datasets to evaluate the performance of PolyBuild. The results demonstrate that PolyBuild significantly outperforms state-of-the-art methods, including mask-based and contour-based approaches.

2606.08919 2026-06-09 cs.AI cs.CR cs.LG 新提交

Oversight Has a Capacity: Calibrating Agent Guards to a Subjective, Fatiguing Human

监督具有容量:将智能体守卫校准到主观且易疲劳的人类

Emre Turan

发表机构 * GitHub arXiv

AI总结 针对LLM智能体动作审批中人类评审者主观且易疲劳的问题,提出将守卫建模为成本敏感的选择性分类,并引入负载感知策略,发现过度监督反而降低安全性,形成倒U型曲线。

Comments 12 pages, 4 figures. Code and interactive demo: https://github.com/turangenesis/headroom

详情
AI中文摘要

随着LLM智能体开始采取真实、不可逆的行动(如shell命令、文件编辑、部署),标准的安全模式是人在环中的审批门:风险动作暂停并等待人工确认。我们认为审批门是容易的部分;困难的部分在于判断——哪些动作需要停止——而该领域目前基于两个错误假设进行评估:存在一个“风险”的真实标签,以及人类评审者是完美且无限可用的预言机。在一个由125个对抗性加权的智能体动作组成的手工标注集上,我们展示了:(i) 评审者对何为风险仅中度一致(Fleiss' kappa = 0.52),因此不存在单一正确标签;(ii) 将守卫建模为非对称成本下的选择性分类使其操作极限可测量,且在困难输入上守卫无法安全地自动决策;(iii) 当评审者被建模为内生变量(随着升级负载增加而疲劳)时,实际安全性在升级率上呈现倒U形:更多的人类监督可能使系统更不安全,而安全最优的守卫升级率低于完全升级——负载感知策略也利用这一设置来抵御洪水攻击,该攻击通过使疲劳的评审者漏过恶意动作。以这种方式框架化的智能体监督不仅是一个分类问题,还是一个资源分配问题:人类注意力是有限的,而守卫的升级策略消耗它。我们声称这些机制均非新颖——疲劳感知的延迟决策(FALCON)、工作负载约束下的成本敏感延迟(DeCCaF)、轨迹级守卫以及评审者疲劳/洪水攻击均为我们引用的现有技术。我们的贡献是一个开源的智能体监督系统,它在LLM智能体动作门控设置中操作化和测量这些机制,将“我的守卫好吗?”从猜测转变为一条曲线。倒U形和洪水攻击是激励人类研究的建模结果。

英文摘要

As LLM agents begin to take real, irreversible actions (shell commands, file edits, deploys), the standard safety pattern is a human-in-the-loop approval gate: risky actions pause and wait for a person. We argue the gate is the easy part; the hard part is the judgment - which actions to stop - which the field evaluates against two false assumptions: that there is a ground-truth notion of "risky," and that the human reviewer is a perfect, infinitely-available oracle. On a hand-labeled set of 125 adversarially-weighted agent actions we show that (i) reviewers only moderately agree on what is risky (Fleiss' kappa = 0.52), so there is no single correct label; (ii) framing the guard as selective classification under asymmetric cost makes its operating limits measurable, and on hard inputs the guard cannot safely auto-decide; and (iii) when the reviewer is modeled as endogenous (fatiguing as escalation load grows), realized safety becomes an inverted-U in the escalation rate: more human oversight can make a system less safe, and the safety-optimal guard escalates below full escalation - a setting a load-aware policy also uses to resist a flooding attack that slips a malicious action past a fatigued reviewer. Agent oversight, framed this way, is not only a classification problem but a resource-allocation one: human attention is finite, and the guard's escalation policy spends it. We claim none of these mechanisms as novel - fatigue-aware learning-to-defer (FALCON), cost-sensitive deferral under workload constraints (DeCCaF), trajectory-level guarding, and reviewer-fatigue/flooding attacks are all prior art we cite. Our contribution is an open-source agent-oversight system that operationalizes and measures them in the LLM-agent action-gating setting, turning "is my guard good?" from a guess into a curve. The inverted-U and the flooding attack are modeling results that motivate a human study.

2606.08918 2026-06-09 cs.CV 新提交

When Vision Misleads, Let Location Speak: A Worldwide Image Geo-Localization Method via Location Attention Mechanism and Large Multimodal Models

当视觉误导时,让位置说话:基于位置注意力机制和大多模态模型的全球图像地理定位方法

Junchao Cui, Wenqi Shi, Xuanzi Ma, Nan Wu, Shaoyong Du, Xiangyang Luo

发表机构 * Henan Key Laboratory of Cyberspace Situation Awareness(河南省网络空间态势感知重点实验室) Information Engineering University(信息工程大学)

AI总结 提出TransGeoCLIP框架,通过位置注意力机制和大多模态模型,解决视觉相似图像导致的地理定位错误问题,在多个基准上显著提升定位精度。

Comments Submitted to IEEE Transactions on Multimedia in March 2026

详情
AI中文摘要

全球图像地理定位旨在确定图像在全球范围内的拍摄位置。现有方法通常通过将图像与来自不同地理区域的视觉相似场景匹配而导致定位错误,限制了实际应用中的可靠性。为解决此问题,我们提出TransGeoCLIP,一种新颖的基于检索的框架,集成了位置注意力机制和大规模多模态模型(LMMs)。使用带有位置注意力的Transformer编码器对GPS坐标进行编码,TransGeoCLIP能够有效区分视觉相似图像中的地理特征。该框架包括两个阶段:1)检索数据库构建,采用配备位置注意力机制的Transformer对标记的GPS坐标进行编码并增强位置语义,随后通过CLIP实现图像-文本-GPS联合嵌入;2)检索增强推理,利用LMMs从检索到的数据库结果中推断最终图像位置预测。在包括IM2GPS、IM2GPS3k、YFCC4k和YFCC26k在内的多个数据集上的广泛实验结果表明,TransGeoCLIP显著提升了视觉相似图像的定位性能。特别是,街道级定位精度(误差在1公里内)大幅提升,在这些基准上分别超过最先进方法1.5%、1.07%、7.18%和9.75%。

英文摘要

Worldwide image geo-localization aims to determine the capture location of an image on a global scale. Existing methods often mislocalize images by matching them to visually similar scenes from different geographic regions, which limits reliability in practical applications. To address this issue, we propose TransGeoCLIP, a novel retrieval-based framework that integrates a location attention mechanism and large multimodal models (LMMs). Using the Transformer encoder with location attention to encode GPS coordinates, TransGeoCLIP can effectively distinguish geographic features among visually similar images. The framework consists of two stages: 1) Retrieval database construction, which employs Transformers equipped with location attention mechanisms to encode labeled GPS coordinates and enhance location semantics, subsequently enables joint image-text-GPS embedding through CLIP; 2) Retrieval-augmented inference, which leverages LMMs to infer the final image location prediction from retrieved database results. Extensive experimental results on diverse datasets, including IM2GPS, IM2GPS3k, YFCC4k, and YFCC26k, demonstrate that TransGeoCLIP significantly enhances localization performance for visually similar images. Particularly, street-level localization accuracy (within 1 km error) is substantially improved, surpassing state-of-the-art methods by 1.5%, 1.07%, 7.18%, and 9.75% on these benchmarks, respectively.

2606.08908 2026-06-09 cs.CV cs.AI 新提交

Failure-Aware Refinement of Vision-Language Model for Lithography Defect Detection

面向光刻缺陷检测的视觉-语言模型失败感知精炼

Pangyun Jeong, Jiyeong Kong, Yuehua Hu, Dohee Jeong, Kyung-Tae Kang

发表机构 * Hanyang University(汉阳大学) Korea University(高丽大学) Korea Institute of Industrial Technology(韩国生产技术研究院)

AI总结 提出两阶段视觉-语言框架,先微调Qwen3-VL检测缺陷,再通过训练精炼模块修正第一阶段错误,提升检测可靠性。

Comments 6 pages, 3 figures

详情
AI中文摘要

半导体光刻检测需要可靠地检测微小图案缺陷,如桥接、毛刺、针孔和污染。在本研究中,我们提出了一种两阶段视觉-语言框架,结合了初始缺陷检测与预测精炼。在第一阶段,使用LoRA微调Qwen3-VL作为视觉-语言适配器,从光刻图像中预测缺陷数量、缺陷类别和归一化边界框。然而,直接微调仍可能产生常见的测试时错误,包括误报、漏检和错误缺陷类型。为解决此限制,第二阶段使用第一阶段预测失败及其修正标签训练精炼模块,使模型能够审查和修正初始输出。通过从初始适配器失败的案例中学习,精炼过程改善了超越单阶段微调的缺陷推理。

英文摘要

Semiconductor lithography inspection requires reliable detection of small pattern defects such as bridge, burr, pinch, and contamination. In this study, we propose a two-stage vision-language framework that combines initial defect detection with prediction refinement. In the first stage, Qwen3-VL is fine-tuned with LoRA as a vision-language adapter to predict defect counts, defect categories, and normalized bounding boxes from lithography images. However, direct fine-tuning may still produce common test-time errors, including false positives, missed defects, and incorrect defect types. To address this limitation, the second stage trains a refinement module using first-stage prediction failures and their corrected labels, allowing the model to review and revise initial outputs. By learning from cases where the initial adapter fails, the refinement process improves defect inference beyond single-stage fine-tuning.

2606.08906 2026-06-09 cs.CV 新提交

DifferSeg: Towards Diverse Multimodal Binary Segmentation via Differential Perception and Frequency Guidance

DifferSeg: 通过差分感知与频率引导实现多样化的多模态二值分割

Qiangqiang Zhou, Jiawei Xu, Yong Chen, Dandan Zhu, Yugen Yi, Xiaoqi Zhao

发表机构 * School of Artificial Intelligence, Jiangxi Normal University(江西师范大学人工智能学院) Institute of AI Education, East China Normal University(华东师范大学人工智能教育研究所) Yale School of Medicine, Yale University(耶鲁大学医学院)

AI总结 提出DifferSeg框架,通过差分感知融合模块自适应对齐多模态特征,并设计频率引导解码器平衡高低频表示,在29个公开数据集上超越67种方法。

详情
AI中文摘要

在许多二值分割任务中,大多数多模态方法依赖于固定的特征拼接进行跨模态交互,以及由低频语义主导的简单解码器设计。然而,它们忽略了两个关键挑战:一是缺乏处理模态差异和互补性的自适应机制,二是缺少平衡高低频表示的高效解码策略。在这项工作中,我们提出了一个简单而通用的多模态二值分割框架,称为DifferSeg,以同时解决这两个问题。借助差分感知融合(DPF)模块,DifferSeg使用可学习的差分算子自适应地对齐多模态特征,并通过残差融合增强其互补性,有效缓解模态不匹配和融合冗余。此外,我们设计了一个频率引导解码器(FGD),构建跨频率交互和多路径上采样,以保持细节高频结构与语义低频表示之间的一致性,确保细粒度边界恢复和噪声抑制。得益于这些设计,DifferSeg可以轻松泛化到各种二值分割任务,包括自然和医学模态。无需额外技巧,它在涉及18个下游任务的29个公开数据集上持续超越67种最先进方法,展示了卓越的泛化能力和分割精度。代码和预训练模型将在链接处提供。

英文摘要

In many binary segmentation tasks, most multimodal methods rely on fixed feature concatenation for cross-modal interaction and straightforward decoder designs dominated by low-frequency semantics. %ToDO: % However, they ignore two key challenges: one is the lack of an adaptive mechanism to handle modality discrepancies and complementarity, and the other is the absence of an efficient decoding strategy to balance both high- and low-frequency representations. % In this work, we propose a simple yet general multimodal binary segmentation framework, termed DifferSeg, to address both problems simultaneously. With the help of the differential perception fusion (DPF) module, DifferSeg employs learnable differential operators to adaptively align multimodal features and enhance their complementarity through residual fusion, effectively mitigating modality mismatch and fusion redundancy. % In addition, we design a frequency-guided decoder (FGD) that builds cross-frequency interactions and multi-path upsampling to maintain consistency between detailed high-frequency structures and semantic low-frequency representations, ensuring fine-grained boundary recovery and noise suppression. % Benefiting from these designs, DifferSeg can be easily generalized to diverse binary segmentation tasks, including both natural and medical modalities. Without bells and whistles, it consistently surpasses 67 state-of-the-art methods across 29 public datasets involving 18 downstream tasks, demonstrating superior generalization and segmentation accuracy.Code and pretrained models will be available at the Link.

2606.08904 2026-06-09 cs.AI 新提交

Order Matters: Unveiling the Hidden Impact of Macro Placement Sequences via Proxy-Guided LLM Evolution

顺序至关重要:通过代理引导的LLM进化揭示宏放置序列的隐藏影响

Shibing Mo, Jing Liu, Jianchu Xu, Ruilin Wu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出OrderPlace框架,利用代理引导的大语言模型进化自动发现宏放置顺序策略,在ISPD 2005基准上相比现有方法线长减少14.08%-34.04%。

Comments ICML2026

详情
AI中文摘要

宏放置是现代芯片物理设计中的基本步骤,在决定高维组合优化问题的解质量方面起着关键作用。尽管最近在空间坐标确定的机器学习方面取得了进展,但放置排序的时间维度仍然主要由静态启发式方法控制。在这项工作中,我们证明了放置顺序不仅仅是预处理步骤,而是优化的决定性因素,其中次优的早期决策会触发不可逆的连锁反应,从而限制解空间。为了利用这一未探索的维度,我们提出了\textbf{OrderPlace},一个代理引导的LLM进化框架,用于自动发现宏放置顺序策略。OrderPlace不依赖于手工制作的启发式方法(如基于面积或连通性的排序),而是探索更广泛的代码级策略空间,从静态评分指标到动态物理启发机制。为了减轻评估序列的高昂成本,我们引入了一种轻量级代理评估机制,该机制使用确定性贪婪探针高效过滤候选序列。在标准ISPD 2005基准上的实验结果表明,OrderPlace发现了新颖的排序策略。与WireMask-EA和最先进的方法EGPlace相比,OrderPlace分别将线长减少了34.04%和14.08%。

英文摘要

Macro placement is a fundamental step in modern chip physical design, playing a crucial role in determining the solution quality of high-dimensional combinatorial optimization problems. Despite recent advancements in machine learning for spatial coordinate determination, the temporal dimension of placement sequencing remains largely governed by static heuristics. In this work, we demonstrate that the placement sequence is not merely a preprocessing step but a decisive factor in optimization, where suboptimal early decisions trigger irreversible domino effects that constrain the solution space. To harness this unexplored dimension, we propose \textbf{OrderPlace}, a proxy-guided LLM evolution framework for automatically discovering macro placement order strategies. Instead of relying on manually crafted heuristics such as area- or connectivity-based ordering, OrderPlace explores a broader space of code-level policies, ranging from static scoring metrics to dynamic physics-inspired mechanisms. To mitigate the prohibitive cost of evaluating sequences, we introduce a lightweight proxy evaluation mechanism that efficiently filters candidates using a deterministic greedy probe. Experimental results on the standard ISPD 2005 benchmarks demonstrate that OrderPlace discovers novel ordering strategies. Compared with WireMask-EA and the state-of-the-art method EGPlace, OrderPlace reduces wirelength by 34.04\% and 14.08\%, respectively.

2606.08903 2026-06-09 cs.LG 新提交

Synthetic but Not Realistic: The Evaluation Challenge in Generative Modelling for Structured Electronic Medical Records

合成但不真实:结构化电子病历生成建模中的评估挑战

Nicholas I-Hsien Kuo, Blanca Gallego, Louisa Jorm

发表机构 * Centre for Big Data Research in Health, the University of New South Wales(新南威尔士大学健康大数据研究中心)

AI总结 针对合成电子病历评估过度依赖统计相似性而忽视临床有效性的问题,提出基于流行病学的多维度评估框架,发现当前生成模型虽能复现边缘分布,但无法同时保持亚组结构、效应估计和依赖关系,导致评估高估数据质量。

详情
AI中文摘要

合成医疗数据被广泛提议作为真实患者数据的隐私保护替代品,但其评估仍然以统计相似性和预测性能为主,这些并不能反映临床有效性。我们引入了一个基于流行病学的多维度评估框架,评估描述性保真度、临床实用性和结构有效性,分别对应描述性、预测性和因果性问题。我们使用PRIME-CVD(一个具有已知真实结构的5万人队列)评估了四种代表性生成范式——基于GAN、VAE增强、基于扩散和掩码建模。虽然所有模型都再现了边缘分布,但没有一个能同时保留亚组结构、效应估计和依赖结构。值得注意的是,具有强分布保真度的模型可能表现出较差的校准和扭曲的关系,导致不可靠的推断。这些结果表明,当前的评估实践可能高估了合成数据质量,并促使基于支持有效临床和科学结论的能力进行领域知情的评估。

英文摘要

Synthetic healthcare data are widely proposed as privacy-preserving substitutes for real patient data, yet their evaluation remains dominated by statistical similarity and predictive performance that do not reflect clinical validity. We introduce a multi-dimensional evaluation framework grounded in epidemiology, assessing descriptive fidelity, clinical utility, and structural validity, corresponding to descriptive, predictive, and causal questions. We evaluate four representative generative paradigms - GAN-based, VAE-boosted, diffusion-based, and masked modelling - using PRIME-CVD, a 50,000-person cohort with known ground-truth structure. While all models reproduce marginal distributions, none simultaneously preserve subgroup structure, effect estimates, and dependency structure. Notably, models with strong distributional fidelity can exhibit poor calibration and distorted relationships, leading to unreliable inference. These results show that current evaluation practices can overestimate synthetic data quality and motivate domain-informed assessment based on the ability to support valid clinical and scientific conclusions.

2606.08897 2026-06-09 cs.CV cs.AI q-bio.QM 新提交

A multi-agent system for spine MRI report generation from multi-sequence imaging

基于多序列影像的脊柱MRI报告生成多智能体系统

Zhiping Xiao, Junwei Yang, Gongbo Sun, Han Zhang, Hanwen Xu, Yi Yao, Zachary D. Miller, William E. King, Mohammed M. Kanani, Jalal B. Andre, Sammy Chu, Ming Zhang, Paul E. Kinahan, Nathan M. Cross, Sheng Wang

发表机构 * University of Washington(华盛顿大学) Peking University(北京大学) University of Wisconsin–Madison(威斯康星大学麦迪逊分校) New York University(纽约大学) University of Washington Medical Center(华盛顿大学医学中心)

AI总结 提出SpineAgent多智能体框架,利用多序列基础模型整合T1/T2等序列信息,实现脊柱MRI报告生成、病理定位和图文检索,在跨厂商和跨队列评估中表现优异。

详情
AI中文摘要

脊柱病理是全球疼痛和残疾的主要原因之一。脊柱MRI是临床评估的核心,但其解读仍然复杂且耗时,需要整合多个成像序列和解剖区域的信息。尽管自动化MRI分析最近取得了进展,但如何有效结合多序列数据同时保留序列特异性诊断信息仍是一个开放挑战。本文提出SpineAgent,一个基于多序列基础模型的脊柱MRI报告生成多智能体框架,该模型在来自32,047名患者和453,683个MRI系列(总计13,441,191张MRI切片)的常规临床数据上训练。为了适应不同模态的序列,我们首先分别在T1和T2加权序列上预训练两个基于DINOv3的编码器。然后,我们引入一种持续训练策略,学习一个合成器,利用T1和T2编码器嵌入其他序列的图像,生成整合MRI序列间各种信号的患者级嵌入。利用这些嵌入,SpineAgent实现了最先进的性能,并在跨制造商和跨队列评估中展现出强大的泛化能力。除了分类,SpineAgent通过识别与发现相关的切片和分割病理区域实现病理定位。它还支持多模态图像-报告检索,为可扩展和可解释的MRI报告生成提供了坚实基础。我们进一步将这些经过验证的SpineAgent能力集成到37个专门智能体中。最后,我们将它们的输出作为结构化标记,整合到一个端到端训练用于报告生成的医疗报告智能体中。通过自动指标和五位放射科医生的专家评估,SpineAgent在脊柱MRI报告生成中取得了领先性能。

英文摘要

Spinal pathology is a leading cause of pain and disability worldwide. Spine MRI is central to clinical evaluation, yet its interpretation remains complex and time-consuming, requiring integration of information across multiple imaging sequences and anatomical regions. Despite recent advances in automated MRI analysis, effectively combining multi-sequence data while preserving sequence-specific diagnostic information remains an open challenge. Here we present SpineAgent, a multi-agent framework for spine MRI report generation built upon a multi-sequence foundation model trained on routine clinical data from 32,047 patients and 453,683 MRI series, comprising a total of 13,441,191 MRI slices. To accommodate diverse modalities of sequences, we first pre-train two DINOv3-based encoders separately on T1- and T2-weighted sequences. We then introduce a continual training strategy that learns a synthesizer to embed images of other sequences using the T1 and T2 encoders, producing patient-level embedding that integrates various signals across MRI sequences. Using these embeddings, SpineAgent achieves state-of-the-art performance, and demonstrates strong generalizability under cross-manufacturer and cross-cohort evaluation. Beyond classification, SpineAgent enables pathology localization by identifying findings-relevant slices and segmenting pathological regions. It also supports multimodal image-report retrieval, providing a solid foundation for scalable and explainable MRI report generation. We further integrate these validated capabilities of SpineAgent into 37 specialized agents. Finally, we incorporate their outputs as structured tokens within a Medical Report Agent trained end-to-end for report generation. Through both automated metrics and expert evaluation by five radiologists, SpineAgent achieves leading performance in spine MRI report generation.

2606.08896 2026-06-09 cs.AI 新提交

FAME: Forecastability-Aware Mixture of Experts for Heterogeneous Time Series Forecasting

FAME: 面向异构时间序列预测的可预测性感知专家混合模型

Qianyang Li, Xingjun Zhang, Shaoxun Wang, Tao Peng, Jia Wei

发表机构 * Sun Yat-sen University(中山大学) Guangdong Key Laboratory of Information Security Technology(广东省信息安全技术重点实验室) Ministry of Education Key Laboratory of Machine Intelligence and Advanced Computing(教育部机器智能与先进计算重点实验室)

AI总结 针对大规模异构时间序列预测中单一模型性能不足的问题,提出可预测性感知的稀疏专家混合框架FAME,通过多维可预测性指纹和成本感知路由,在工业数据集上实现12.4%的MSE降低。

详情
AI中文摘要

大规模零售和工业预测系统包含许多异构时间序列,其生命周期、稀疏性、波动性、季节性、频谱模式和上下文敏感性差异很大。单一预测模型很少能在所有情况下表现良好,而密集集成会增加推理成本并提供有限的专家适用性洞察。本文研究可预测性感知的专家路由:学习数据特征如何决定预测专家的适用性。我们提出\method{},一个稀疏专家混合框架,用多维可预测性指纹表示每个序列,从验证性能中挖掘专家适用性目标,并训练一个成本感知的稀疏路由器,为每个序列激活少量预算的专家集。使用山东新北洋(SNBC)的生产规模自动售货机销售数据集(其中预测组件已集成到补货计划管道中)以及公共零售基准,我们表明专家适用性在不同数据情况下系统性地变化。在拥有5000+台机器和6000万+交易的工业数据集上,\method{} Top-2相比最强单一专家LightGBM降低了12.4%的MSE,同时平均每个序列执行1.92个专家。部署的组件产生需求预测,而库存导向的收益通过离线回放模拟器在固定补货策略下估计,而非在线干预。该框架将异构销售预测从启发式模型选择转变为可预测性模式和专家专业化的数据挖掘。代码可在https://github.com/hit636/FAME获取。

英文摘要

Large-scale retail and industrial forecasting systems contain many heterogeneous time series whose lifecycle, sparsity, volatility, seasonality, spectral patterns, and contextual sensitivity differ substantially. A single forecasting model rarely performs well across all regimes, while dense ensembles increase inference cost and provide limited insight into expert suitability. This paper studies forecastability-aware expert routing: learning how data characteristics determine the suitability of forecasting experts. We propose \method{}, a sparse mixture-of-experts framework that represents each series with a multidimensional forecastability fingerprint, mines expert-suitability targets from validation performance, and trains a cost-aware sparse router to activate a small budgeted set of experts for each series. Using a production-scale vending-machine sales dataset from Shandong New Beiyang (SNBC), where the forecasting component has been integrated into the replenishment-planning pipeline, together with public retail benchmarks, we show that expert suitability varies systematically across data regimes. On the industrial dataset with 5,000+ machines and 60M+ transactions, \method{} Top-2 reduces MSE by 12.4\% over the strongest single expert, LightGBM, while executing 1.92 experts per series on average. The deployed component produces demand forecasts, while inventory-oriented gains are estimated by an offline replay simulator under a fixed replenishment policy rather than by online intervention. The framework turns heterogeneous sales forecasting from heuristic model selection into data mining of forecastability patterns and expert specialization. Code is available at https://github.com/hit636/FAME

2606.08894 2026-06-09 cs.CV cs.CL 新提交

Are Reasoning Vision-Language Models Robust to Semantic Visual Distractions?

推理视觉语言模型对语义视觉干扰具有鲁棒性吗?

Yizheng Sun, Mochuan Zhan, Yanan Ma, Jia Tong See, Yifan Wang, Ziyi Wang, Hao Li, Yang Cui, Wenhao Cai, Jingyu Sun, Chenghua Lin, Riza Batista-Navarro, Jingyuan Sun

发表机构 * University of Manchester(曼彻斯特大学) Marex Imperial College London(帝国理工学院)

AI总结 针对推理VLM在真实场景中易受语义视觉干扰的问题,提出Distract-Bench基准,发现推理VLM对语义干扰的鲁棒性低于感知退化,且干扰常被纳入推理过程导致错误答案。

详情
AI中文摘要

推理视觉语言模型(VLM)在复杂多模态任务上表现强劲,但可靠的现实应用需要处理比干净、精心策划的基准更混乱的视觉输入。现有工作主要通过输入损坏(如噪声、模糊和天气效果)来评估VLM的可靠性,这些损坏使视觉证据更难感知。这留下了一个关键可靠性失败模式未被充分探索:模型可能正确感知证据,却从看似合理但无关且分散注意力的证据中进行推理,并将此错误传播到最终答案。为填补这一空白,我们引入了\textbf{Distract-Bench},一个用于评估VLM对\textbf{语义视觉干扰}鲁棒性的基准,定义为添加到输入中、保留真实答案但具有意义且与任务无关的视觉线索。我们全面评估了八个领先的开源和两个闭源VLM,涵盖传统视觉损坏和Distract-Bench。结果表明,Distract-Bench暴露了一种与视觉损坏不同的鲁棒性失败:推理VLM在感知退化下基本跟踪其非推理基础模型,但对语义干扰的鲁棒性始终较低。进一步分析表明,这些干扰常常进入VLM的推理过程,被当作证据,并导致错误答案。总之,这些发现重新定义了推理VLM的鲁棒性评估,将焦点从退化感知转向干扰,以实现可靠的现实世界视觉推理。我们的数据和代码可在https://github.com/Yizheng-Sun/Distract-Bench获取。

英文摘要

Reasoning Vision-Language Models (VLMs) achieve strong performance on complex multimodal tasks, but reliable real-world application requires handling visual inputs that are messier than clean, curated benchmarks. Existing works mainly evaluate such reliability of VLMs through input corruptions, such as noise, blur and weather effects, which make visual evidence harder to perceive. This leaves a critical reliability failure mode underexplored: a model may perceive the evidence correctly, yet reason from plausible but irrelevant and distracting evidence and propagate this mistake to its final answer. To address this gap, we introduce \textbf{Distract-Bench}, a benchmark for evaluating VLM robustness to \textbf{semantic visual distractions}, defined as meaningful but task-irrelevant visual cues added to inputs while preserving the ground-truth answer. We comprehensively evaluate eight leading open-source and two closed-source VLMs across conventional vision corruptions and Distract-Bench. Our results show that Distract-Bench exposes a robustness failure distinct from vision corruptions: reasoning VLMs largely track their non-reasoning base models under perceptual degradation, but show consistently lower robustness to semantic distractions. Further analysis shows that these distractions often enter the reasoning process of VLMs, are treated as evidence, and lead to incorrect answers. Together, these findings reframe robustness evaluation for reasoning VLMs, shifting the focus from degraded perception to distractions for reliable real-world visual reasoning. Our data and code are available at https://github.com/Yizheng-Sun/Distract-Bench.

2606.08893 2026-06-09 cs.LG cs.AI cs.CR 新提交

Cheap Reward Hacking Detection

廉价奖励黑客检测

Iván Belenky, Joaquín Itria, Steven Johns

发表机构 * Tamarillo

AI总结 提出用小Transformer编码器将轨迹映射到单位球面,使嵌入距离近似奖励与元数据的L1距离,线性探针检测奖励黑客,AUC达0.9467,成本比LLM-as-judge低四个数量级。

Comments 20 pages, 6 figures, 12 tables

详情
AI中文摘要

训练一个小型Transformer编码器,将Terminal-Wrench轨迹映射到单位球面上,使得嵌入距离近似于奖励与元数据信号之间的$L_1$距离。在该嵌入之上,一个线性探针在清洗后的测试集上检测奖励黑客,AUC为0.9467,TPR@5%FPR为0.8296,与TW清洗后的LLM-as-judge的AUC(在清洗集上为0.9510)相当,并在相同信息条件下超过其TPR@5%FPR(0.7130 vs 0.8296),而每条轨迹的成本大约低四个数量级。该编码器并非纯粹的行为阅读器:在探针时从其输入中剥离自然语言推理,AUC降至0.6213。

英文摘要

A small transformer encoder is trained to map Terminal-Wrench trajectories onto a unit sphere where embedding distance approximates the $L_1$ distance between reward and metadata signals. A linear probe on top of that embedding detects reward hacking on the cleaned test split with AUC $0.9467$ and TPR@5%FPR $0.8296$, matching the TW sanitized LLM-as-judge AUC ($0.9510$ on the cleaned split) and exceeding its TPR@5%FPR ($0.7130$ vs $0.8296$) on the same information condition, at roughly four orders of magnitude lower per-trajectory cost. The encoder is not a pure behavior reader: stripping natural-language reasoning from its input at probe time drops AUC to $0.6213$.

2606.08892 2026-06-09 cs.LG 新提交

Diffuse AI Control on Fuzzy Tasks

模糊任务上的扩散AI控制

Mikhail Terekhov, Caglar Gulcehre, Vivek Hebbar, Joe Benton

发表机构 * Anthropic Fellows Program (via MATS)(Anthropic 研究员计划(通过 MATS)) EPFL(洛桑联邦理工学院) Redwood Research(红木研究) Anthropic

AI总结 针对AI在模糊任务上的长期扩散威胁,提出蓝队与红队对抗框架,通过弱模型评分训练强模型,并发现红队可利用多目标进化提示优化找到评分高但性能差的子版本行为,蓝队则通过对抗优化提升鲁棒性。

详情
AI中文摘要

部署在关键领域(如AI安全研究)的AI模型可能因对齐问题而微妙地破坏我们的努力。扩散AI控制是AI安全的一个子领域,旨在减轻长期部署范围内AI破坏(扩散威胁)带来的风险。这些风险在模糊任务上尤其有害,即难以评分或需要直觉的任务。为了理解模糊任务上的扩散威胁,我们引入了一个新颖的框架,将AI控制视为蓝队和红队之间的对抗游戏。蓝队使用一个弱可信模型构建一个弱评分,据此训练一个强大的、可能具有颠覆性的模型,以消除如果存在的颠覆倾向。然后红队试图找到被弱评分高评价的模型行为,这些行为可能不会被训练掉,但实际上对应着差的表现。我们在为近期ML论文的研究问题撰写实验提案的任务上测试了我们的框架。我们使用一个能够访问原始论文的语言模型作为代理“真实”评分器。我们的红队使用多目标进化提示优化发现了子版本行为。我们展示了Opus 4.6可以写出比GPT-OSS-20B更差的提案(根据真实代理评分),而弱评分器却将其评为与Opus 4.6最佳提案一样高。为了缓解威胁,我们为蓝队提出了一种对抗优化算法,该算法为弱模型发现更鲁棒的提示。该算法产生的蓝队提示,我们的红队优化未能利用。

英文摘要

AI models deployed in critical domains, such as AI safety research, may subtly sabotage our efforts due to misalignment. Diffuse AI Control is a subfield of AI safety concerned with mitigating risks from AI sabotage distributed over long deployment horizons (diffuse threats). These risks are particularly pernicious on fuzzy tasks, i.e. tasks which are hard to grade or require intuition. To understand diffuse threats on fuzzy tasks, we introduce a novel framework that considers AI control as an adversarial game between a blue team and a red team. The blue team uses a weak trusted model to construct a weak score against which they would train a strong, potentially subversive model to remove the subversion propensity if it were present. The red team then tries to find model behaviors that are rated highly by the weak score, and thus might not be trained out, but actually correspond to poor performance. We test our framework on the task of writing experimental proposals for research questions from recent ML papers. We use a language model with access to the original paper as a proxy "ground-truth" scorer. Our red team discovers subversive behaviors using multi-objective evolutionary prompt optimization. We show that Opus~4.6 can write proposals that are worse according to the ground truth proxy than those of GPT-OSS-20B, while the weak scorer rates them as highly as the best proposals from Opus 4.6. To mitigate the threat, we propose an adversarial optimization algorithm for the blue team that discovers more robust prompts for the weak model. This algorithm produces a blue team prompt that our red team optimization fails to exploit.

2606.08878 2026-06-09 cs.CL cs.MA 新提交

PerspectiveGap: A Benchmark for Multi-Agent Orchestration Prompting

PerspectiveGap: 多智能体编排提示的基准测试

Youran Sun, Xingyu Ren, Kejia Zhang, Xinpeng Liu, Jiaxuan Guo

发表机构 * University of Maryland(马里兰大学) The Chinese University of Hong Kong(香港中文大学) Stanford University(斯坦福大学)

AI总结 提出PerspectiveGap基准,评估LLM为多智能体系统编写编排提示的能力,实验显示模型平均通过率仅14.9%,表明该能力独特且未被充分评估。

详情
AI中文摘要

现实世界的LLM应用正从单智能体工作流转向编排的多智能体系统,但当前模型仍难以确定每个子智能体需要知道什么。为衡量这一点,我们引入了PerspectiveGap,一个用于评估LLM为多智能体系统编写编排提示能力的基准。PerspectiveGap包含110个场景,每个场景通过两种干扰混合任务格式评估:角色片段分配和自由形式提示编写。这些场景被组织成10种拓扑结构,这些拓扑结构源自作者的真实工程实践,并遵循提示经济原则:构建以循环为中心的编排,以最小的角色和工程开销最大化效用。在对来自10家公司的27个商业模型进行的实验中,GPT-5.5大幅超越所有竞争对手,而Opus 4.7尽管编码性能强劲,但在编排提示方面表现出明显弱点。尽管如此,PerspectiveGap仍然具有挑战性:评估模型平均综合通过率仅为14.9%(GPT-5.5为62.0%),平均总体泄漏率为246.5%(每个场景的信息泄漏事件计数,而非比例;GPT-5.5为49.1%)。这些发现表明,多智能体编排提示是一种独特且未被充分评估的能力,而PerspectiveGap为系统衡量和改进该能力提供了基础。

英文摘要

Real-world LLM applications are moving beyond single-agent workflows toward orchestrated multi-agent systems, yet current models still struggle to determine what each sub-agent needs to know. To measure this, we introduce PerspectiveGap, a benchmark for evaluating LLMs' ability to compose orchestration prompts for multi-agent systems. PerspectiveGap contains 110 scenarios, each evaluated through two distractor-mixed task formats: role-fragment assignment and free-form prompt writing. These scenarios are organized into 10 topologies, which are distilled from the authors' real-world engineering practice and framed by the Prompt Economy principle: building loop-centered orchestrations that maximize utility with minimal role and engineering overhead. In experiments with 27 commercial models from 10 companies, GPT-5.5 substantially outperforms all competitors, whereas Opus 4.7 shows a notable weakness in orchestration prompting despite its strong coding performance. Nevertheless, PerspectiveGap remains challenging: the evaluated models achieve an average combined pass rate of only 14.9\% (GPT-5.5 62.0\%) and an average overall leakage rate of 246.5\% (a per-scenario information leak-event count, not a proportion; GPT-5.5 49.1\%). These findings suggest that multi-agent orchestration prompting is a distinct and under-evaluated capability, and PerspectiveGap provides a foundation for measuring and improving it systematically.

2606.08875 2026-06-09 cs.AI 新提交

Can the Environment Speak for Itself? $T^{2}$-GRPO: A Turn-Trajectory Group Relative Policy Optimization for Caregiver Agents

环境能否为自己发声?$T^{2}$-GRPO:一种面向护理智能体的转向-轨迹组相对策略优化

Yutong Song, Jiang Wu, Pengfei Zhang, Wenjun Huang, Honghui Xu, Nikil Dutt, Amir M. Rahmani

发表机构 * University of California, Irvine(加州大学尔湾分校) Independent Researcher(独立研究员) Kennesaw State University(肯尼索州立大学)

AI总结 提出T²-GRPO框架,通过解耦护理强化学习为两个归一化奖励视界,并利用二元硬否决确保安全,从环境状态转换中提取密集转向级奖励,结合轨迹级评估,有效处理即时患者反馈、长期护理结果和安全约束。

详情
AI中文摘要

优化用于长期护理智能体的大型语言模型(LLMs)需要平衡延迟的任务目标与即时的环境动态,例如患者的痛苦和抵抗。在痴呆症护理中,这种平衡尤其困难:轨迹级奖励对于转向级信用分配过于稀疏,而基于外部LLM的评估器成本高昂且可能误读零散或间接的患者反应。为解决这一问题,我们提出了\textbf{转向-轨迹组相对策略优化}(\textbf{T$^{2}$-GRPO}),该框架将护理强化学习解耦为两个归一化奖励视界,并通过二元硬否决强制执行安全性。$T^2$-GRPO直接从环境状态转换中推导出密集的转向级奖励,从冻结的痴呆症患者模拟器中测量患者痛苦和抵抗的变化。这些基于环境的奖励通过独立中心秩归一化与轨迹级评估相结合,保留了异质奖励信号并缓解了奖励崩溃。在痴呆症护理上的大量实验表明,T$^{2}$-GRPO优于竞争基线,表明在情感敏感的护理场景中,有效处理即时患者反馈、长期护理结果和安全约束方面取得了实质性改进。

英文摘要

Optimizing large language models (LLMs) for long-horizon caregiver agents requires balancing delayed task objectives with immediate environment dynamics, such as patient distress and resistance. In dementia care, this balance is especially difficult: trajectory level rewards are too sparse for turn level credit assignment, while external LLM-based evaluators are costly and can misread fragmented or indirect patient responses. To address this issue, we propose \textbf{T}urn-\textbf{T}rajectory \textbf{G}roup \textbf{R}elative \textbf{P}olicy \textbf{O}ptimization (\textbf{T$^{2}$-GRPO}), a framework that decouples caregiver RL into two normalized reward horizons and enforces safety through a binary hard veto. $T^2$-GRPO derives dense turn-level rewards directly from environment state transitions, measuring changes in patient distress and resistance from a frozen dementia patient simulator. These environment-grounded rewards are combined with trajectory-level evaluations through independent centered-rank normalization, which preserves heterogeneous reward signals and mitigates reward collapse. Extensive experiments on dementia caregivers show that T $^{2}$-GRPO outperforms competitive baselines, indicating a substantial improvement for emotionally sensitive caregiver scenarios that effectively handles immediate patient feedback, long-term care outcomes, and safety constraints.

2606.08866 2026-06-09 cs.CV 新提交

Generalizing Geometry-Guided Mamba as a Plug-and-Play Context Module for CNN-based Semantic Segmentation

泛化几何引导Mamba作为CNN语义分割的即插即用上下文模块

Sheng-Wei Chan, Hsin-Jui Pan, Chun-Po Shen, Chia-Min Lin, Yung-Che Wang, Jen-Shiun Chiang

发表机构 * Tamkang University(淡江大学)

AI总结 将几何引导的Mamba(G-Mamba)作为即插即用的上下文聚合模块,替代六种CNN分割网络的上下文头,在Cityscapes上以少量额外计算量获得一致的mIoU提升。

详情
AI中文摘要

基于CNN的语义分割网络通常依赖上下文头(如ASPP、PPM或注意力模块)来扩大感受野。这些头有效但可能引入大量计算、内存开销或边界泄漏。本文重新审视DGM-Net中的方向几何Mamba(G-Mamba),并将其作为即插即用的上下文聚合模块,而非全新的分割架构。关键思想是将几何引导注入选择性扫描过程,使长程特征传播能够由边界和向心流线索调制。我们替换了六种代表性CNN分割模型(包括DeepLabV3+、DANet、CCNet、PSPNet、PSANet和OCRNet)的原始上下文头,同时保持ResNet-101骨干网络不变。在Cityscapes上的结果表明,在$1024\ imes1024$分辨率下,仅增加适度的额外GFLOPs即可获得一致的mIoU提升,表明几何引导的SSM模块可以作为传统CNN上下文头的实用替代或增强。

英文摘要

CNN-based semantic segmentation networks usually rely on context heads such as ASPP, PPM, or attention modules to enlarge the receptive field. These heads are effective but may introduce heavy computation, memory cost, or boundary leakage. This paper revisits Directional Geometric Mamba (G-Mamba) from DGM-Net and studies it as a plug-and-play context aggregation module rather than a complete new segmentation architecture. The key idea is to inject geometric guidance into the selective scan process, allowing long-range feature propagation to be modulated by boundary and centripetal-flow cues. We replace the original context heads of six representative CNN segmentation models, including DeepLabV3+, DANet, CCNet, PSPNet, PSANet, and OCRNet, while keeping the ResNet-101 backbone unchanged. Results on Cityscapes show consistent mIoU gains with only moderate extra GFLOPs at $1024\times1024$ resolution, suggesting that geometry-guided SSM modules can serve as practical alternatives or enhancements to conventional CNN context heads.

2606.08864 2026-06-09 cs.CV cs.LG 新提交

CHROMA: Detecting AI-Generated Images through Inter-Channel Color-Space Correlations

CHROMA: 通过通道间色彩空间相关性检测AI生成图像

Juan Pablo Sotelo, Marina Gardella, Pablo Musé

发表机构 * Instituto de Ingeniería Eléctrica, Facultad de Ingeniería, Universidad de la República, Montevideo, Uruguay(乌拉圭共和国大学工程学院电气工程研究所) Université Paris-Saclay, ENS Paris-Saclay, CNRS, Centre Borelli, Gif-sur-Yvette, 91190 France(巴黎萨克雷大学,巴黎萨克雷高等师范学校,法国国家科学研究中心,博雷利中心)

AI总结 提出利用通道间色彩相关性作为轻量级取证线索,通过增强RGB输入与相关性图,使用固定CNN骨干网络在有限计算预算下训练,有效区分真实与AI生成图像,并提升对未知生成器的鲁棒性。

Comments This manuscript has been accepted for publication at the 28th International Conference on Pattern Recognition (ICPR 2026). The final published version will appear in the Springer LNCS proceedings

详情
AI中文摘要

扩散模型和大规模生成模型的快速普及使得区分合成图像与真实照片越来越具有挑战性。尽管已有自动检测器被提出,但它们对未见生成器的泛化能力仍然脆弱。为解决这一局限,我们研究了通道间色彩相关性,这是一种轻量级且未被充分利用的取证线索。我们首先证明,LPIPS(一种广泛使用的感知度量)对选择性改变不同色彩空间参数化下通道依赖性的扰动表现出不一致的响应,表明跨通道统计量并不受常见感知训练目标的统一约束。受此启发,我们分析了多个色彩空间中成对通道间相关性特征的分布。我们的分析揭示了这些分布中系统性的、生成器特定的差异,其中RGB和Lab色彩空间提供了真实图像与生成图像之间最明显的分离。基于此,我们引入了Chroma,一种AI生成图像检测器,它用通道间相关性图增强标准RGB输入,并采用在适度计算预算下训练的固定CNN骨干网络。我们在单生成器训练和有限多生成器监督机制(仅从额外生成器获取少量样本)下评估其鲁棒性。在标准基准协议下,相关性增强的输入改善了真实与生成图像的区分能力和鲁棒性,在保持简单架构和训练过程的同时,性能与最新检测器相当。代码可在https://github.com/JPSoteloSilva/CHROMA获取。

英文摘要

The rapid adoption of diffusion and large-scale generative models has made it increasingly challenging to distinguish synthetic imagery from real photographs. While automated detectors have been proposed, their generalization to unseen generators remains brittle. To address this limitation, we investigate inter-channel color correlations, a lightweight and underexploited forensic cue. We first demonstrate that LPIPS, a widely used perceptual metric, exhibits inconsistent responses to perturbations that selectively alter channel dependence across different color-space parameterizations, indicating that cross-channel statistics are not uniformly constrained by common perceptual training objectives. Motivated by this, we analyze the distributions of pairwise inter-channel correlation features across multiple color spaces. Our analysis reveals systematic, generator-specific differences in these distributions, with RGB and Lab color spaces providing the most apparent separation between real and generated images. Building on this, we introduce Chroma, a detector of AI-generated images which augments standard RGB inputs with inter-channel correlation maps and employs a fixed CNN backbone trained with a modest computational budget. We assess its robustness under both single-generator training and a limited multi-generator supervision regime, where only a few samples from additional generators are available. Across a standard benchmark protocol, correlation-augmented inputs improve real-vs-generated discrimination and robustness, yielding performance competitive with recent detectors while maintaining a simple architecture and training procedure. Code is available at https://github.com/JPSoteloSilva/CHROMA

2606.08860 2026-06-09 cs.CV 新提交

Vision-Language Work Zone Intelligence for Safety-Critical Speed Regulation of Mixed-Autonomy Vehicles in Dynamic Environments

面向动态环境中混合自主车辆安全关键速度调节的视觉语言工作区智能

Angel Martinez-Sanchez, Kianna Ng, Wesley Maia, Laura Fleig, Maitrayee Keskar, Erika Maquiling, Yash Tandon, Parthib Roy, Mohan Trivedi, Ross Greer

发表机构 * UC Merced(加州大学默塞德分校) Johns Hopkins(约翰霍普金斯大学) UC San Diego(加州大学圣地亚哥分校)

AI总结 提出一种实时车载感知管线,通过目标检测与语义验证融合及滞后状态转换,从视觉标志中识别临时工作区限速,在低成本硬件上实现96.5%召回率和68.7%精确率。

详情
AI中文摘要

临时工作区限速通过视觉不一致的标志传达,且常缺失于数字地图中,给人类驾驶员和自动驾驶车辆系统带来安全风险。我们提出一种实时车载感知管线,用于检测活动工作区、识别相关临时限速,并输出符合法规的工作区状态和速度值,适用于驾驶员警报或下游自动控制。该系统将目标检测与语义验证以及时间平滑、基于滞后的状态转换相结合,以减少动态场景中的误激活和闪烁,并完全在低成本嵌入式硬件上运行。在ROADWork数据集(490个序列)的标注子集上手动评估,系统实现了工作区内事件级召回率96.5%和事件级精确率68.7%。基于35分钟内部驾驶数据评估的限速识别达到95.45%精确率和53.85%召回率,无错误速度分类,仅有一个误报。这些结果表明了一种实用、可扩展的方法,将工作区速度感知直接建立在车载感知而非地图或基础设施上。我们在GitHub仓库中发布了所提系统管线的源代码:https://github.com/Mi3-Lab/workzone

英文摘要

Temporary work-zone speed limits are communicated through visually inconsistent signage and are often missing from digital maps, creating safety risks for human drivers and automated vehicle systems. We present a real-time, onboard perception pipeline that detects active work zones, recognizes associated temporary speed limits, and outputs a law-aware work-zone state and speed value suitable for driver alerts or downstream automated control. The system fuses object detections with semantic verification and temporally smoothed, hysteresis-based state transitions to reduce false activations and flicker in dynamic scenes, and runs fully on low-cost embedded hardware. Evaluated manually on a annotated subset of the ROADWork dataset (490 sequences), the system achieves inside-work-zone event-level recall of 96.5% and event-level precision of 68.7%. Speed-limit recognition evaluated on 35 minutes of in-house driving data attains 95.45% precision and 53.85% recall, with no incorrect speed classifications and a single false positive. These results demonstrate a practical, scalable approach for grounding work-zone speed awareness directly in onboard perception rather than maps or infrastructure. We release our source code for the proposed system pipeline on our GitHub repository: https://github.com/Mi3-Lab/workzone

2606.08858 2026-06-09 cs.CV cs.AI 新提交

Intelligent Character Recognition of Handwritten Forms with Deep Neural Networks

基于深度神经网络的手写表单智能字符识别

Hartwig Grabowski

发表机构 * Institute for Machine Learning and Analytics (IMLA) Offenburg University(奥芬堡大学机器学习与分析研究所(IMLA))

AI总结 提出一种通过深度神经网络将检测与分类合并为单一任务的手写字符识别方法,利用人工合成训练数据,在真实考试数据上达到88.28%的识别率。

Comments Author's accepted manuscript of a published Springer book chapter. 14 pages, 16 figures

详情
Journal ref
In: Cavallucci D., Livotov P., Brad S. (eds), Towards AI-Aided Invention and Innovation, IFIP Advances in Information and Communication Technology, vol. 682, Springer Nature Switzerland, 2023, pp. 81-94
AI中文摘要

手写表单的自动处理仍然是一项具有挑战性的任务,其中手写字符的检测和后续分类是关键步骤。我们描述了一种新颖的方法,其中两个步骤——检测和分类——通过深度神经网络在一个任务中执行。因此,训练数据不是手动标注的,而是从基础表单和现有数据集中人工制造的。可以证明,这种单任务方法优于最先进的双任务方法。当前研究专注于手写拉丁字母,并使用EMNIST数据集。然而,该数据集存在局限性,需要进一步定制。最后,在从笔试中获得的真实数据上达到了88.28%的整体识别率。

英文摘要

The automatic processing of handwritten forms remains a challenging task, wherein detection and subsequent classification of handwritten characters are essential steps. We describe a novel approach, in which both steps -- detection and classification -- are executed in one task through a deep neural network. Therefore, training data is not annotated by hand, but manufactured artificially from the underlying forms and yet existing datasets. It can be demonstrated that this single-task approach is superior in comparison to the state-of-the-art two-task approach. The current study focuses on hand-written Latin letters and employs the EMNIST data set. However, limitations were identified with this data set, necessitating further customization. Finally, an overall recognition rate of 88.28 percent was attained on real data obtained from a written exam.

2606.08857 2026-06-09 cs.CL 新提交

PaperMentor: A Human-Centered Multi-Agent Writing Tutor for AI Research Papers on Overleaf

PaperMentor:面向Overleaf的AI研究论文写作人本多智能体辅导系统

Jiarui Liu, Terry Jingchen Zhang, Ryan Faulkner, X. Angelo Huang, Vilém Zouhar, Dominik Glandorf, Isabel Dahlgren, Van Q. Truong, Rishit Dagli, Yuen Chen, Felix Leeb, Punya Syon Pandey, Yves Bicker, Suvajit Majumder, Wenyuan Jiang, Zeju Qiu, Sankalan Pal Chowdhury, Bernhard Schölkopf, Mona Diab, Zhijing Jin

发表机构 * CMU(卡内基梅隆大学) Jinesis Lab, University of Toronto & Vector Institute(Jinesis实验室,多伦多大学与向量研究所) EuroSafeAI ETHZ(苏黎世联邦理工学院) EPFL(洛桑联邦理工学院) UIUC(伊利诺伊大学厄巴纳-香槟分校) Max Planck Institute for Intelligent Systems, Tübingen, Germany(马克斯·普朗克智能系统研究所,德国图宾根)

AI总结 提出PaperMentor,一种在Overleaf中提供内联建议的人本写作助手,通过专家技能库和12个专业智能体提供可操作反馈,用户研究中90.6%建议被认为可操作。

Comments Accepted to the ACL 2026 Demo Track

详情
AI中文摘要

来自经验丰富研究者的专业写作反馈对于早期职业学者改进手稿至关重要,然而高质量的反馈往往稀缺,因为审阅研究论文是劳动密集型的。新兴的AI写作助手主要关注语法修正或通过最终分数模拟同行评审,但它们在提供具体、可操作的建议以帮助学生在起草过程中改进论文方面存在不足。我们提出PaperMentor,一个人本写作助手系统,以Overleaf原生内联注释的形式提供可操作建议,同时将实际写作完全留给人类作者。PaperMentor集成了一专家技能库,该库精心整理自资深研究者的写作建议,并包含12个专业智能体,涵盖论文写作的不同方面,如格式合规性、措辞准确性和术语一致性。在一项用户研究(n=14)中,90.6%的生成评论被评为可操作,67.5%被评为有效,显著优于没有技能库的GPT-5.2基线。我们将PaperMentor作为开源软件发布供公众使用。我们的代码在AGPL-3.0许可下公开于https://github.com/jiarui-liu/overleaf。

英文摘要

Expert writing feedback from experienced researchers is critical for early-career scholars to improve their manuscripts, yet high-quality feedback often remains scarce because reviewing research papers is labor-intensive. Emerging AI-powered writing assistants largely focus on grammar fixes or simulating peer review with final scores, yet they fall short of providing concrete, actionable suggestions that help students improve their papers during drafting. We present PaperMentor, a human-centered writing assistant system that delivers actionable suggestions as Overleaf-native inline comments while leaving the actual writing entirely to human authors. PaperMentor integrates an expert skill library carefully curated from established researchers' writing advice with 12 specialized agents covering different aspects of paper writing, such as formatting compliance, phrasing accuracy, and terminology consistency. In a user study (n=14), 90.6% of the generated comments were rated actionable and 67.5% were rated valid, significantly outperforming a GPT-5.2 baseline uswithout the skill library. We release PaperMentor as open source for public use. Our code is publicly available under the AGPL-3.0 license at https://github.com/jiarui-liu/overleaf

2606.08855 2026-06-09 cs.AI cs.CV cs.CY 新提交

Hybrid E-Assessment in Higher Education: Semi-Automated Grading of Paper-Based Written Examinations

高等教育中的混合电子评估:纸质笔试的半自动评分

Hartwig Grabowski, Michael Canz

发表机构 * Institute for Machine Learning and Analytics, Hochschule Offenburg(霍恩海姆应用技术大学机器学习与分析研究所) Hochschule Offenburg(霍恩海姆应用技术大学)

AI总结 针对完全数字化和部分数字化电子评估在总结性考试中的局限性,提出混合电子评估方法,保留纸质问题导向任务,通过结构化答案格式和手写字符识别实现半自动评分,结合视觉大语言模型和两遍验证提升评估有效性、公平性和可扩展性。

Comments 15 pages, 6 figures

详情
AI中文摘要

本文考察了完全数字化和部分数字化电子评估方法在高等教育总结性考试中的局限性。分析聚焦于封闭式问题格式导致的教学狭窄化,以及在大学生群体中尤为突出的组织、技术和法律约束。作为替代方案,本文提出了一种混合电子评估方法,该方法保留纸质、问题导向的考试任务,同时实现半自动评分。评估相关的中间结果以结构化答案格式编码,由学生手写输入,随后从表格字段中捕获。核心的技术瓶颈是在现实考试条件下可靠识别手写字符。最近的视觉大语言模型,结合两遍验证原则和与标准答案的比对,可以减少误分类,从而提高总结性评估的有效性、公平性和可扩展性。

英文摘要

This paper examines the limitations of fully digital and partially digital e-assessment approaches in summative examinations in higher education. The analysis focuses on the didactic narrowing caused by closed question formats and on organizational, technical, and legal constraints that become particularly relevant in large student cohorts. As an alternative, the paper proposes a hybrid e-assessment approach that retains paper-based, problem-oriented examination tasks while enabling semi-automated grading. Assessment-relevant intermediate results are encoded in a structured answer format, entered by students by hand, and subsequently captured from table fields. The central technical bottleneck is reliable recognition of handwritten characters under realistic examination conditions. Recent vision-capable large language models, combined with a two-pass validation principle and comparison against a solution key, can reduce misclassifications and thereby improve the validity, fairness, and scalability of summative assessment.

2606.08854 2026-06-09 cs.LG cs.AI cs.CL stat.ML 新提交

sGPO: Trading Inference FLOPs for Training Efficiency in RLVR

sGPO: 在RLVR中用推理FLOPs换取训练效率

Shivchander Sudalairaj, Kai Xu, Akash Srivastava, Giorgio Giannone

发表机构 * Red Hat(红帽) IBM

AI总结 提出sGPO方法,通过少量推理计算预估查询难度,自适应分配训练预算,将训练计算量降低三倍,同时保持或提升性能。

详情
AI中文摘要

标准的可验证奖励强化学习(RLVR)训练为每个查询分配固定的展开预算,而不考虑每个查询的难度对当前策略的意义。这导致两种对称的失败模式:简单查询产生接近零的优势,因为策略已经解决了它们;而无法解决的查询不产生信号,因为策略从未解决它们。这两种情况都浪费了训练FLOPs,而没有贡献学习梯度。我们引入了排序组策略优化(sGPO),一种计算高效的策略,用少量推理FLOPs换取大量减少浪费的训练FLOPs。关键见解是,廉价的推理计算可以作为查询难度的单一离线代理。通过在初始策略下为每个查询生成一小批并行样本,我们获得了模型感知的经验成功率。这激励将训练展开组大小设置为该成功率的倒数,这是一个实用的规则,通过从每个生成的展开中提取最大优势来最大化样本效率。这一单次性能分析过程同时驱动数据过滤(移除琐碎查询和子采样无法解决的查询)、自适应组大小分配和课程构建(从易到难调度查询)。sGPO匹配或超过基线性能,同时将总训练计算量减少三倍,包括前期的推理性能分析成本。

英文摘要

Standard Reinforcement Learning with Verifiable Rewards (RLVR) training allocates a fixed rollout budget to every query, without regard for what each query's difficulty means for the current policy. This leads to two symmetric failure modes: easy queries produce near-zero advantage because the policy already solves them, while unsolvable queries produce no signal because the policy never solves them. Both regimes waste training FLOPs without contributing to a learning gradient. We introduce sorted Group Policy Optimization (sGPO), a compute-efficient strategy that trades a small budget of inference FLOPs for a large reduction in wasted training FLOPs. The key insight is that cheap inference compute can serve as a single offline proxy for query difficulty. By generating a small batch of parallel samples per query under the initial policy, we obtain a model-aware empirical success rate. This motivates setting the training rollout group size to the inverse of this success rate, a practical rule that maximizes sample efficiency by extracting the most advantage per generated rollout. This single profiling pass simultaneously drives data filtering (removing trivial queries and sub-sampling unsolvable ones), adaptive group size allocation, and curriculum construction (scheduling queries from easy to hard). sGPO matches or exceeds baseline performance while reducing total training compute by a factor of three, with the upfront inference profiling cost included.

2606.08850 2026-06-09 cs.LG cs.AI cs.CL stat.ML 新提交

Intrinsic Selection and Particle Resampling for Inference-Time Scaling Beyond Domain Verifiability

内在选择与粒子重采样:超越领域可验证性的推理时扩展

Giorgio Giannone, Mustafa Eyceoz, Shabana Baig, Shivchander Sudalairaj, Anna C. Doris, Faez Ahmed, Akash Srivastava, Kai Xu

发表机构 * MIT(麻省理工学院) Red Hat(红帽公司) IBM(IBM公司)

AI总结 提出基于并行样本集内在统计量(长度调整尾熵)的推理时扩展方法,通过后验候选排序和步骤级重采样,无需外部验证即可提升开放领域任务性能。

Comments preprint

详情
AI中文摘要

推理时扩展(ITS)在数学和编程等可验证领域取得了很大成功,其中廉价验证使得可扩展输出选择成为可能。然而,将ITS扩展到容易发生系统性失败的任务——由错误初始假设或未满足的多维约束驱动——通常依赖于昂贵的外部求解器或脆弱的基于模型的验证器。我们的关键洞察是,并行样本集的内在统计量,特别是长度调整尾熵,提供了关于解质量的稳健判别信号,而无需访问真实标签。至关重要的是,这些统计量作为自适应计算分配的难度门控,动态地将问题路由到不同的扩展规模。首先,内在选择(iS)事后对候选进行排序,在三个领域匹配基于共识的算法,并将工程设计选择性能比pass@1基线提高20%。其次,内在粒子滤波(iPF)将其推广到步骤级重采样,引导生成走向高置信度推理轨迹,在困难数学问题上平均将pass@1提高6.1个百分点。最后,粒子蒸馏(dPF)通过早期logit混合和KL引导重采样注入特权指导,引导生成绕过系统性推理错误以满足专家评分标准,在复杂临床响应上获得高达26.5%的提升。我们的流程无缝适用于通用、领域专用和多模态架构,成功将ITS扩展到开放领域,而无需训练奖励模型或精确的真实标签验证。

英文摘要

Inference-Time Scaling (ITS) has largely succeeded in verifiable domains like math and coding, where cheap verification enables scalable output selection. However, extending ITS to tasks prone to systematic failure - driven by faulty initial assumptions or unmet multidimensional constraints - typically relies on costly external solvers or brittle, model-based verifiers. Our key insight is that the intrinsic statistics of parallel sample sets, specifically length-adjusted tail entropy, provide a robust discriminative signal for solution quality without access to ground truth. Crucially, these statistics serve as a difficulty gate for adaptive compute allocation, dynamically routing problems across scaling regimes. First, Intrinsic Selection (iS) ranks candidates post-hoc, matching consensus-based algorithms across three domains and improving engineering design selection by 20% over pass@1 baselines. Second, Intrinsic Particle Filtering (iPF) generalizes this to step-level resampling, guiding generation toward high-confidence reasoning trajectories to improve pass@1 by 6.1 points on average on hard math problems. Finally, Particle Distillation (dPF) injects privileged guidance via early logit blending and KL-guided resampling, steering generation past systematic reasoning errors to satisfy expert rubrics, yielding up to 26.5% gains on complex clinical responses. Our pipeline applies seamlessly across broad-purpose, domain-specialized, and multimodal architectures, successfully extending ITS to open-ended domains without requiring trained reward models or exact ground-truth verification.

2606.08849 2026-06-09 cs.AI 新提交

A Resilience-as-a-Service assessment framework for coordinated disruption response in interdependent urban transit systems

面向城市交通系统协同中断响应的弹性即服务评估框架

Sara Jaber, S. M. Hassan Mahdavi, Neila Bhouri, Mostafa Ameli

发表机构 * Univ. Gustave Eiffel, COSYS, GRETTIA, Paris, France(古斯塔夫·埃菲尔大学,交通系统、网络与安全实验室,交通工程与智能交通系统研究组,法国巴黎) VEDECOM, mobiLAB, Department of Human factors and Economics of Sustainable Mobility, Versailles, France(VEDECOM研究所,移动出行实验室,可持续出行人因与经济系,法国凡尔赛)

AI总结 提出一个基于KPI的时间索引框架,结合优化模型与智能体仿真,从脆弱性、适应性、鲁棒性等多维度评估城市交通中断响应方案的弹性,并通过巴黎RER B线案例验证了协同策略的优越性。

详情
AI中文摘要

城市公共交通中断需要快速响应策略,然而现有研究很少提供一个决策支持框架,使用一组通用的动态、乘客、运营商和环境导向指标来比较替代的中断响应解决方案。本文提出了一个KPI驱动的、时间索引的框架,用于评估城市交通系统中中断响应方案的弹性。该框架将优化模型与基于智能体仿真的行为评估相结合。它还考虑了当在途车辆被撤回以支持中断走廊时,辅助线路上的二次服务退化。该框架不将弹性视为单一分数,而是评估互补维度,包括脆弱性、适应性、鲁棒性、弹性损失、响应性、基于成本的性能、排放和公平性。该框架在法兰西岛(巴黎)网络的RER B交通线上实施。结果表明,协同策略提供了最平衡的弹性曲线,与单一模式替代方案相比,结合了高服务连续性和较低的总中断成本,同时提高了公平性并保持了有竞争力的环境性能。敏感性分析进一步确定了协同多模式响应最有价值的中断条件。

英文摘要

Urban public transport disruptions require rapid response strategies, yet existing studies rarely provide a decision support framework to compare alternative disruption response solutions using a common set of dynamic, passenger, operator, and environment oriented indicators. This paper proposes a KPI-driven, time-indexed framework to assess the resilience of disruption response solutions in urban transit systems. The framework combines an optimization model with a behavioral evaluation in agent-based simulation. It also underlays the secondary service degradation induced on helper lines when in-service vehicles are withdrawn to support the disrupted corridor. Rather than treating resilience as a single score, it evaluates complementary dimensions including vulnerability, adaptability, robustness, resilience loss, responsiveness, cost-based performance, emissions, and equity. The framework is implemented for the RER B transit line in the Ile-de-France (Paris) network. Results show that the coordinated strategy provides the most balanced resilience profile, combining high service continuity with lower total disruption cost than single mode alternatives, while also improving equity and maintaining competitive environmental performance. Sensitivity analysis further identifies the disruption conditions under which coordinated multimodal response is most valuable.

2606.08847 2026-06-09 cs.CV cs.AI cs.LG 新提交

BLM-SGAN: Bidirectional Language Modeling for Semantic-Spatial Text-to-Image Generation

BLM-SGAN: 用于语义-空间文本到图像生成的双向语言建模

Ahmed Abdelmoneim Mazrou, Haidy Maher El-Amir, Ali Hamdi

发表机构 * Faculty of Computer Science, MSA University, Egypt(MSA大学计算机科学学院,埃及)

AI总结 提出BLM-SGAN模型,利用BERT的双向注意力机制捕获长程依赖,解决GAN在文本到图像生成中的梯度消失和序列处理限制,在鸟类图像生成上达到SOTA。

Comments Published in ICACIn 2024. Appears in Advances on Intelligent Computing and Data Science II, Lecture Notes on Data Engineering and Communications Technologies, vol. 254, Springer, 2025

详情
Journal ref
Advances on Intelligent Computing and Data Science II (ICACIn 2024), Lecture Notes on Data Engineering and Communications Technologies, vol. 254, Springer, Cham, 2025
AI中文摘要

尽管从文本描述生成图像取得了成功,但在自然语言处理(NLP)和计算机视觉(CV)等领域仍面临难以克服的挑战。文本到图像(T2I)模型的最新进展,特别是那些利用生成对抗网络(GAN)的模型,显著提高了跨领域合成逼真图像的能力。然而,现有的基于GAN的T2I模型仍然面临关键挑战,例如难以捕获长程依赖、梯度消失以及序列处理的局限性。为了解决这些问题,我们引入了BLM-SGAN,一种新颖的模型,它结合了用于语义-空间文本到图像生成的双向语言建模。BLM-SGAN利用BERT的注意力机制来捕获丰富的上下文信息并有效管理扩展序列。我们的模型展示了最先进的性能,Inception Score(IS)为5.45 +/- 0.08,超过了多个竞争模型,如SSA-GAN、DF-GAN、SD-GAN和AttnGAN。BLM-SGAN能够从详细的文本描述中有效生成高度逼真的鸟类图像。实现代码可在以下网址获取:https://github.com/haidy-maher/BLM-SGAN-Text-to-Image-Generation。

英文摘要

Despite the success of image generation from text descriptions, it still faces challenges that are difficult to overcome in domains such as natural language processing (NLP) and computer vision (CV). Recent advancements in text-to-image (T2I) models, particularly those utilizing generative adversarial networks (GANs), have significantly improved the synthesis of realistic images across various domains. However, existing GAN-based T2I models still encounter key challenges, such as difficulty in capturing long-range dependencies, vanishing gradients, and the limitations of sequential processing. To address these issues, we introduce BLM-SGAN, a novel model that incorporates Bidirectional Language Modeling for Semantic-Spatial Text-to-Image Generation. BLM-SGAN leverages BERT's attention mechanisms to capture rich contextual information and efficiently manage extended sequences. Our model demonstrates state-of-the-art performance, with an Inception Score (IS) of 5.45 +/- 0.08, surpassing several competitive models such as SSA-GAN, DF-GAN, SD-GAN, and AttnGAN. BLM-SGAN effectively generates highly realistic images of birds from detailed text descriptions. The implementation code is available at: https://github.com/haidy-maher/BLM-SGAN-Text-to-Image-Generation.

2606.08844 2026-06-09 cs.CV cs.RO 新提交

Geometry-Aware Fisheye-LiDAR Fusion for Robust 3D Object Detection in Low-Overlap Setups

几何感知鱼眼-激光雷达融合用于低重叠设置下的鲁棒3D目标检测

Xiangzhong Liu, Xihao Wang, Hao Shen

发表机构 * Technical University of Munich(慕尼黑工业大学)

AI总结 针对稀疏视角下鱼眼相机与激光雷达的几何畸变和低重叠问题,提出几何感知混合融合框架,通过畸变感知LSS模块和双注意力校正模块实现极坐标与笛卡尔特征融合,在三个基准上提升检测精度。

Comments 8 pages, 4 figures, submitted to RA-L

详情
AI中文摘要

随着自主系统从资本密集型的机器人出租车扩展到成本敏感的物流领域,传感器配置越来越优化以实现每单位成本的覆盖范围。一种常见的稀疏视图设置利用双鱼眼摄像头和车顶安装的激光雷达,引入了严重的几何挑战:极端径向畸变、最小重叠以及球面投影与笛卡尔网格之间的错位。BEV融合算法通常在流程早期将图像和点云模态强制统一到笛卡尔网格中,导致广角鱼眼相机出现显著的特征失真和信息丢失。为了解决这个问题,我们提出了一个几何感知混合融合(GA-HF)框架,该框架明确考虑了鱼眼几何和BEV特征失真,其中鱼眼特征通过畸变感知的Lift-Splat-Shoot(LSS)模块提升到极坐标BEV网格中以保留原生角密度,而激光雷达特征在原生笛卡尔空间中处理以实现边界框回归的度量保真度。为了桥接这些异构流,我们引入了一个双注意力扭曲校正模块,该模块在融合前对扭曲的相机特征应用空间和通道注意力,明确抑制低质量外围区域的伪影,同时增强高质量语义线索。GA-HF在三个基准数据集上进行了评估:KITTI-360、Dur360BEV和Fisheye3DOD。据我们所知,这是首个探索激光雷达-鱼眼相机融合的方法。在KITTI-360上,GA-HF相比笛卡尔基线将NDS提高了4.2%;在Dur360BEV上,它超越了仅激光雷达和BEVFusion,同时在几何畸变下显著降低了方向误差;在Fisheye3DOD上,它在所有融合方法中取得了最高的检测分数。

英文摘要

As autonomous systems expand from capital-intensive robotaxis to cost-sensitive logistics, sensor configurations are increasingly optimized for coverage-per-cost. A prevalent sparse-view setup utilizes dual-fisheye cameras with a roof-mounted LiDAR, introducing severe geometric challenges: extreme radial distortion, minimal overlap, and misalignment between spherical projections and rectilinear grids. BEV fusion algorithms typically force image and point cloud modalities into unified Cartesian grids early in the pipeline, causing significant feature distortion and information loss for wide-view fisheye cameras. To address this, we propose a Geometry-Aware Hybrid Fusion (GA-HF) framework that explicitly accounts for fisheye geometry and BEV feature distortion, where fisheye features are lifted into a polar BEV grid via a Distortion-Aware Lift-Splat-Shoot (LSS) module to preserve native angular density, while LiDAR features are processed in native Cartesian space for metric fidelity of bounding box regression. To bridge these heterogeneous streams, we introduce a Dual-Attention Warping Correction module that applies spatial and channel attention to the warped camera features before fusion, explicitly suppressing artifacts in low-quality peripheral regions while enhancing high-quality semantic cues. GA-HF is evaluated on three benchmarks: KITTI-360, Dur360BEV, and Fisheye3DOD datasets. To the best of our knowledge, it is the first approach to explore LiDAR-fisheye camera fusion. On KITTI-360, GA-HF improves NDS by 4.2% over Cartesian baselines; on Dur360BEV, it surpasses both LiDAR-only and BEVFusion, while significantly reducing orientation error despite the geometric distortions; on Fisheye3DOD, it attains the highest detection score among all fusion methods.

2606.08843 2026-06-09 cs.SD cs.LG 新提交

From A to B to A: Palindromic Zero-Shot Voice Conversion with Non-Parallel Data

从A到B再回到A:基于非平行数据的回文零样本语音转换

Moshe Mandel, Shlomo E. Chazan

发表机构 * Independent, Israel(以色列独立机构) OriginAI, Israel(以色列OriginAI公司)

AI总结 提出利用WavLM表示的K近邻检索对齐非平行语音,构建合成训练对,结合说话人损失实现零样本语音转换,在仅用英语数据训练下跨语言表现优异。

详情
AI中文摘要

我们提出一个语音转换(VC)框架,利用WavLM表示上的K近邻(KNN)检索来对齐非平行的源语音和目标语音,从而为监督学习构建合成训练对。检索到的片段作为合成输入,而真实目标音频提供真实输出,形成一种合成到真实的训练范式,该范式自然支持多语言数据,无需平行语料库或显式对齐。为了确保一致的目标说话人身份,我们引入了一个来自预训练说话人验证模型的说话人损失。跨多种语言的实验表明,尽管仅使用英语数据训练,所提出的方法实现了高自然度和强说话人相似性,优于有竞争力的VC基线。样本可在https://palindromic-vc.github.io获取。

英文摘要

We present a voice conversion (VC) framework that utilizes K-Nearest Neighbors (KNN) retrieval over WavLM representations to align non-parallel source and target speech, constructing synthetic training pairs for supervised learning. The retrieved segments serve as synthetic inputs, while real target audio provides ground-truth outputs, forming a synthetic-to-real training paradigm that naturally supports multilingual data without requiring parallel corpora or explicit alignment. To ensure consistent target-speaker identity, we incorporate a speaker loss derived from a pretrained speaker verification model. Experiments across multiple languages demonstrate that the proposed approach achieves high naturalness and strong speaker similarity, outperforming competitive VC baselines, despite being trained exclusively on English data. Samples can be accessed at: https://palindromic-vc.github.io.