arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 8081
专题追踪
2605.29058 2026-06-03 cs.LG

Parallel Adaptive Multi-Objective Evolutionary Learning of Discretized Bayesian Network Classifiers for Clinical Data

面向临床数据的离散化贝叶斯网络分类器的并行自适应多目标进化学习

Damy M. F. Ha, Thalea Schlender, Yvette M. van der Linden, Peter A. N. Bosman, Tanja Alderliesten

发表机构 * Leiden University Medical Center(莱顿大学医学中心) Centrum Wiskunde & Informatica(数学与信息学中心)

AI总结 针对贝叶斯网络学习计算量大且仅用于合成数据的问题,提出并行化策略和自适应优化机制,将Baymex重构为多目标优化分类器,在真实临床数据上实现高效且可解释的分类性能。

详情
AI中文摘要

贝叶斯网络(BN)从可解释人工智能的角度来看很有吸引力,为决策支持提供了透明的概率模型。Baymex是最近引入的一种用于学习离散化BN的多目标进化算法,使专家能够权衡不同的目标,如似然、模型复杂度和先验信念。虽然Baymex已被证明优于最先进的BN学习方法,但Baymex仍然1)需要大量计算时间,2)仅在合成数据上进行了评估。为了提高可扩展性,我们引入了并行化策略以及一种能够自适应地将优化导向过拟合较少网络的机制。此外,我们重新配置Baymex,通过交叉熵损失和BIC复杂度项的多目标优化来训练BN分类器,以评估其在真实世界临床分类任务上的性能。除了在16核CPU上观察到高达54倍以上的加速外,在两个开源数据集(RADCURE和SUPPORT)和一个内部数据集上,与临床熟悉的基线(决策树、逻辑回归、朴素贝叶斯和随机森林)进行比较表明,Baymex获得了统计上相似或更好的预测性能,同时生成了紧凑、临床可检查的BN。重要的是,Baymex找到了多个合理的BN分类器,其中包含与既定临床因素一致的预测因子。

英文摘要

Bayesian Networks (BNs) are of interest from an explainable AI viewpoint, offering transparent probabilistic models for decision support. Baymex is a recently introduced multi-objective evolutionary algorithm for learning discretized BNs, enabling experts to trade-off different objectives of interest, such as likelihood, model complexity, and prior beliefs. While Baymex has been shown to outperform state-of-the-art BN learning approaches, Baymex still 1) requires a lot of computation time and 2) has only been evaluated on synthetic data. To improve scalability, we introduce a parallelization strategy as well as a mechanism that enables adaptively steering optimization toward networks that overfit less. We furthermore reconfigure Baymex to train a BN classifier through multi-objective optimization of cross-entropy loss and the BIC complexity term so as to evaluate its performance on real-world clinical classification tasks. Besides observing speedups up to over 54 times on a 16-core CPU, comparisons against clinically familiar baselines (decision trees, logistic regression, naive Bayes, and random forests) on two open-source (RADCURE and SUPPORT) and one in-house dataset, show that Baymex obtains statistically similar or better predictive performance while producing compact, clinically inspectable BNs. Importantly, Baymex finds multiple plausible BN classifiers that contain predictors consistent with established clinical factors.

2605.28910 2026-06-03 cs.CL cs.AI

Hallucination Detection-Guided Preference Optimization for Clinical Summarization

基于幻觉检测的偏好优化用于临床摘要生成

Shamanth Kuthpadi Seethakantha, Dung Ngoc Thai, Vara Prasad Gudi, Simran Tiwari, Rami Matar, Avijit Mitra, Wenlong Zhao, Andrew McCallum, Wael Salloum

发表机构 * Manning College of Information and Computer Sciences(Manning信息与计算机科学学院) Ensemble HP Columbia University(哥伦比亚大学) University of Massachusetts Amherst(马萨诸塞大学阿姆赫斯特分校)

AI总结 提出利用幻觉检测器指导迭代修正的推理时方法及偏好学习微调,显著减少临床摘要中的幻觉。

详情
AI中文摘要

大型语言模型(LLM)在摘要任务上展现出潜力,但常产生幻觉,即无依据或不正确的陈述,限制了其在专业医疗应用中的可靠性。我们引入\itermodelfull(\itermodel),一种推理时方法,利用幻觉检测器指导迭代摘要修正以实现事实更正。在此基础上,我们提出用于偏好学习的\itermodel(\model),将检测器引导的修正轨迹转化为偏好对以进行模型微调。大量实验表明,我们的方法在总结来自\MimicIV的真实临床笔记时,显著减少了Llama和Gemma模型的幻觉。例如,\itermodel在Llama-3.1-8B-Instruct上减少了24%的幻觉,而\model减少了48%。重要的是,根据人类专家和LLM-Jury评估,两种方法都保持了摘要的流畅性、连贯性和相关性。这些结果共同表明,检测信息驱动的修正和偏好学习为提高临床摘要的事实准确性提供了一种自动化解决方案。

英文摘要

Large language models (LLMs) have shown promise on summarization tasks, but they often produce hallucinations, which are unsupported or incorrect statements that limit their reliability in specialized healthcare applications. We introduce Hallucination Detection Guided Self-Refinement (HDSR), an inference-time method that leverages hallucination detectors to guide iterative summary revisions toward factual corrections. Building on this, we propose HDSR for Preference Learning (HDSR-PL), which converts detector-guided refinement trajectories into preference pairs for model finetuning. Extensive experiments show that our methods substantially reduce hallucinations for Llama and Gemma models in summarizing real-world clinical notes from MIMIC-IV-Note v2.2. For example, HDSR reduces 24% and HDSR-PL reduces 48% hallucinations in Llama-3.1-8B-Instruct. Importantly, both methods preserve summary fluency, coherence, and relevance according to human expert and LLM-Jury evaluations. Together, these results demonstrate that detection-informed refinement and preference learning offer an automated solution for improving factual faithfulness in clinical summarization.

2605.26366 2026-06-03 cs.AI cs.LG

Automatic Layer Selection for Hallucination Detection

幻觉检测的自动层选择

Xinpeng Wang, William X. Cao, Andrew Gordon Wilson, Zhe Zeng

发表机构 * University of Washington(华盛顿大学)

AI总结 针对大语言模型中幻觉检测的层选择问题,提出无需训练的FEPoID准则,自动识别最优中间层,并结合截断策略提升检测性能。

Comments Accepted at ICML 2026

详情
AI中文摘要

最近关于幻觉检测的研究表明,在大语言模型(LLMs)中,与幻觉相关的信号在中间层比在最后一层编码得更强。尽管越来越多的研究试图利用这一特性进行幻觉检测,但如何自动选择高性能层仍未得到充分探索,且缺乏针对此目的的原则性方法。为填补这一空白,我们首先提出了几个关于为何这些信号出现在中间层的假设,并评估了相应的自动层选择准则,这些准则适用于不同的LLM架构、规模和任务,涵盖了问答和摘要幻觉检测基准。然而,我们发现这些准则均不能持续提供令人满意的性能。因此,我们提出了一种新的选择准则——第一有效本征维度峰值(FEPoID),它能够一致地识别最优或接近最优的层,并优于上述准则和现有的幻觉检测基线。FEPoID无需训练,且计算开销可忽略不计。此外,我们研究了LLM的生成行为,并引入了一种简单而有效的截断策略,该策略进一步放大了与幻觉相关的信号,并显著提高了整体检测性能。代码公开于 https://github.com/DesoloYw/Automatic-Layer-Selection-for-Hallucination-Detection.git

英文摘要

Recent studies on hallucination detection have shown that hallucination-related signals are more strongly encoded in intermediate layers than in the final layer of large language models (LLMs). Although a growing body of work has sought to exploit this property for hallucination detection, how to automate the selection of high-performing layers remains underexplored, and principled methods for this purpose are still lacking. To address this gap, we first propose several hypotheses for why such signals emerge in intermediate layers and evaluate corresponding criteria for automatic layer selection across diverse LLM architectures, scales, and tasks, covering both question answering and summarization hallucination detection benchmarks. However, we find that none of these criteria consistently delivers satisfactory performance. We therefore propose a new selection criterion, First Effective Peak of Intrinsic Dimension (FEPoID), which consistently identify optimal or near-optimal layers and outperforms both the aforementioned criteria and existing hallucination detection baselines. FEPoID is training-free and incurs negligible computational overhead. In addition, we study the generation behaviors of LLMs and introduce a simple yet effective truncation strategy, which further amplifies hallucination-related signals and substantially improves overall detection performance. Code is publicly available at https://github.com/DesoloYw/Automatic-Layer-Selection-for-Hallucination-Detection.git

2605.25902 2026-06-03 cs.LG

Reading the Finetuning Prior: Verbatim Content Recovery via Contrastive Decoding Diffing

读取微调先验:通过对比解码差异进行逐字内容恢复

Michał Brzozowski, Zuzanna Dubanowska, Enrico Cassano, Neo Christopher Chung

发表机构 * Samsung AI Center(三星人工智能中心) University of Turin(都灵大学) University of Warsaw(华沙大学)

AI总结 提出对比解码差异(CDD)方法,仅基于输出层logit分布,无需权重或内部访问,即可从微调模型中逐字恢复植入事实,并在多种架构和场景下优于白盒方法。

详情
AI中文摘要

窄微调的语言模型会逐字记忆植入的内容,但在无法访问模型权重或训练数据的情况下,审计已部署模型所学内容仍然是一个开放挑战。最近的研究表明,基础模型与微调模型之间的激活差异携带了微调领域的可读痕迹;最先进的激活差异透镜(ADL)可以恢复模糊的领域级描述,但需要完全的“白盒”访问模型内部。我们引入了对比解码差异(CDD),一种仅操作输出级logit分布的模型差异方法,无需权重访问、无需层选择、无需针对模型调参,却能恢复植入的事实。CDD包含三个思想:绕过聊天模板以暴露原始微调先验,用最大模糊的前缀填充种子生成,以及在每个解码步骤放大微调模型与基础模型之间的logit空间差异。一个单一的默认配置即可逐字恢复植入的事实——精确的药物名称、投票数、物理测量值和程序细节——涵盖四种架构(1B-32B参数),尽管访问更少且运行速度约快170倍,但全面优于ADL。此外,CDD揭示了意外的数据管道伪影:LLM数据生成器通过模式崩溃引入的虚构角色泄露到模型权重中,并被CDD提取出来,据我们所知,这构成了首个从数据生成器伪影到模型权重再到恢复输出的端到端指纹链。我们在真实领域微调设置上进行了验证,在所有单数据集非CoT变体上实现了近乎完美的恢复,并在混合数据集设置中正确识别了所有四个数据集。CDD作为一种灰盒方法成功超越了白盒基线,凸显了其在AI系统透明度和问责性方面的实际效用。

英文摘要

Narrowly finetuned language models memorize implanted content verbatim, but auditing what a deployed model has been taught, without access to its weights or training data, remains an open challenge. Recent work shows that activation differences between base and finetuned models carry readable traces of the finetuning domain; the state-of-the-art Activation Difference Lens (ADL) recovers a vague domain-level description but requires full "white-box" access to model internals. We introduce Contrastive Decoding Diffing (CDD), a model diffing method that operates on output-level logit distributions only, with no weight access, no layer selection, and no per-model tuning, yet recovers implanted facts. CDD consists of three ideas: bypassing the chat template to expose the raw finetuning prior, seeding generation with maximally vague pre-fills, and amplifying the logit-space difference between finetuned and base models at each decoding step. A single default configuration recovers implanted facts verbatim -- exact drug names, vote counts, physical measurements, and procedural details -- across four architectures (1B--32B parameters), uniformly outperforming ADL despite less access and running ~170x faster. Furthermore, CDD surfaces unintended data pipeline artifacts: a fictional persona introduced by the LLM data generator via mode collapse leaked into model weights and was extracted by CDD, constituting to our knowledge the first demonstrated end-to-end fingerprinting chain from data generator artifact to model weights to recovered output. We validate on real-domain finetuning settings, achieving near-perfect recovery across all single-dataset non-CoT variants and correctly identifying all four datasets in the mixed-dataset setting. CDD's success as a grey-box method outperforming white-box baselines underscores its practical utility for transparency and accountability in AI systems.

2603.18639 2026-06-03 cs.CV

OrthoPhys: Physically Plausible Video Generation with Orthogonal-View Geometry Guidance

OrthoPhys:基于正交视角几何引导的物理合理视频生成

Cong Wang, Hanxin Zhu, Xiao Tang, Jiayi Luo, Xin Jin, Long Chen, Zhibo Chen

发表机构 * the State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences(多模态人工智能系统国家重点实验室,中国科学院自动化研究所) the School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院) Zhongguancun Academy(中关村学院) School of Information Science and Technology, University of Science and Technology of China(中国科学技术大学信息科学与技术学院) College of Automotive and Energy Engineering, Tongji University(同济大学汽车与能源工程学院) SKLCCSE, School of Computer Science and Engineering, Beihang University(SKLCCSE,北京航空航天大学计算机科学与工程学院)

AI总结 提出两阶段框架 OrthoPhys,通过正交视角几何引导生成物理一致的前景运动,再合成完整视频,显著提升物理真实感和时空一致性。

详情
AI中文摘要

近期视频生成的进展在视觉保真度上取得了显著提升,但确保物理一致的运动仍是一个基本挑战。直观上,这一限制可归因于现实世界中的物体运动在三维空间中展开,而视频观测仅提供此类动力学的部分、视角依赖的投影。为解决这些问题,我们提出 OrthoPhys,一个两阶段框架,利用正交视角几何引导来强制物理合理性。我们的第一阶段不直接生成非结构化的二维视频,而是生成前景动力学的同步四视角正交视频。通过在这些正交视角中引入几何增强的注意力机制,该阶段有效地强制了三维空间一致性,并隐式地将运动基于物理属性。在第二阶段,这些物理一致的正交前景作为刚性引导,合成最终的完整视频,无缝学习前景动力学与背景上下文之间的交互。为支持这种正交视角训练范式,我们构建了 PhysMV 数据集,包含 40K 个场景,每个场景由四个正交视角组成,总共 160K 个视频序列。大量实验表明,OrthoPhys 在物理真实感和时空一致性上显著优于现有视频生成方法。项目页面:https://anonymous.4open.science/w/Phys4D/。

英文摘要

Recent progress in video generation has led to substantial improvements in visual fidelity, yet ensuring physically consistent motion remains a fundamental challenge. Intuitively, this limitation can be attributed to the fact that real-world object motion unfolds in three-dimensional space, while video observations provide only partial, view-dependent projections of such dynamics. To address these issues, we propose OrthoPhys, a two-stage framework that leverages orthogonal-view geometry guidance to enforce physical plausibility. Instead of directly generating unstructured 2D videos, our first stage generates synchronized, four-view orthogonal videos of the foreground dynamics. By incorporating a geometry-enhanced attention mechanism across these orthogonal views, this stage effectively enforces 3D spatial coherence and implicitly grounds the motion in physical attributes. In the second stage, these physically consistent orthogonal foregrounds serve as rigid guidance to synthesize the final complete video, seamlessly learning the interaction between foreground dynamics and the background context. To support this orthogonal-view training paradigm, we construct PhysMV, a dataset containing 40K scenes, each consisting of four orthogonal viewpoints, resulting in a total of 160K video sequences. Extensive experiments demonstrate that OrthoPhys significantly improves physical realism and spatial-temporal coherence over existing video generation methods. Project page: https://anonymous.4open.science/w/Phys4D/.

2605.25051 2026-06-03 cs.RO

A Decentralized LiDAR-SLAM System with Certifiably Optimal Pose Graph Optimization

一种具有可认证最优位姿图优化的去中心化LiDAR-SLAM系统

Baoshan Song, Feng Huang, Li-Ta Hsu

发表机构 * The Hong Kong Polytechnic University(香港理工大学)

AI总结 针对多机器人去中心化LiDAR-SLAM全局一致性问题,提出首个集成可认证最优位姿图优化后端的系统,利用黎曼块坐标下降算法实现全局一致轨迹估计,无需精确初始猜测,轨迹RMSE相比DiSCo-SLAM最高降低48.9%。

Comments In Proceedings of the IEEE International Conference on Robotics & Automation (ICRA'26) 1st Workshop on Robot Meets GNSS and Ranging for Seamless Autonomy, Vienna, Austria, Jun. 5, 2026

详情
AI中文摘要

去中心化多机器人LiDAR-SLAM对于协作任务至关重要,但在保持全局一致性方面面临重大挑战。现有框架主要依赖局部搜索优化或一次性坐标对齐,容易导致次优收敛和长期不一致,尤其是在大规模或退化环境中。为解决这些局限性,本文提出了首个集成最先进的可认证最优位姿图优化(PGO)后端的去中心化LiDAR-SLAM系统。通过利用黎曼块坐标下降(RBCD)算法,我们的系统无需精确初始猜测即可确保全局一致的轨迹估计。实验结果表明,所提出的框架实现了卓越的鲁棒性,与最先进的DiSCo-SLAM相比,轨迹RMSE最高改善了48.9%。

英文摘要

Decentralized multi-robot LiDAR-SLAM is essential for collaborative missions but faces significant challenges in maintaining global consistency. Existing frameworks predominantly rely on local-search optimization or one-time coordinate alignment, which are prone to suboptimal convergence and long-term inconsistency, especially in large-scale or degenerate environments. To address these limitations, this paper presents the first decentralized LiDAR-SLAM system that integrates a state-of-the-art certifiably optimal Pose Graph Optimization (PGO) backend. By leveraging the Riemannian Block Coordinate Descent (RBCD) algorithm, our system ensures globally consistent trajectory estimation without requiring accurate initial guesses. Experimental results demonstrate that the proposed framework achieves superior robustness, improving trajectory RMSE by up to 48.9% compared to the state-of-the-art DiSCo-SLAM.

2605.03358 2026-06-03 cs.CV

Tracing Like a Clinician: Anatomy-Guided Spatial Priors for Cephalometric Landmark Detection

像临床医生一样追踪:解剖引导的空间先验用于头影测量标志点检测

Sidhartha Mohapatra, Pallavi Mohanty

发表机构 * Founder & CTO, CephTrace(CephTrace创始人及CTO) Clinical Advisor, CephTrace(CephTrace临床顾问)

AI总结 提出一种五阶段解剖引导管道,生成置信度加权的空间先验来训练HRNet-W32,在1502张X光片上实现25个标志点平均径向误差1.04 mm,并通过消融实验和临床验证证明其有效性。

Comments v3: 21 pages, 15 tables, 12 figures + supplementary materials (8 tables, 3 figures). v4: quantified Grad-CAM analysis (Table 13), corrected clinical measurements (Table 6: bias, MAE, ICC; vertical kappa 1.00->0.78), reviewer wording fixes. Code and weights: https://github.com/sidwiz/cephtrace-research, https://huggingface.co/CephTrace/cephtrace-v4

详情
AI中文摘要

临床医生通过遵循结构化的解剖工作流程来追踪头影测量X光片——然而,先前没有系统明确地将此编码到计算中。我们提出了一个五阶段解剖引导管道,生成置信度加权的空间先验,用于塑造HRNet-W32的训练。该系统在来自7+个成像设备的1502张X光片上的25个标志点实现了1.04 mm的平均径向误差——通过显式解剖先验而非学习注意力,与HYATT-Net(在CEPHA29上1.05 mm)相当。三路消融实验隔离了机制:解剖先验保持1%的验证-测试差距,而去除先验则产生88%的差距(1.94 mm)——尽管验证收敛相同。训练×推理先验矩阵确认:(1)所有模型与推理无关,(2)仅28通道架构无益处,(3)随机先验部分且不稳定(1.72 mm),(4)只有解剖正确、图像特定的先验产生1.04 mm——作为训练时的正则化器。部署时无需生成先验。五折交叉验证(p=0.0015)、患者级置换检验(p<0.0001,n=151)、复现基线、Grad-CAM分析和临床验证(151名患者包括72例边界病例的100%骨骼分类,kappa=1.00)提供了汇聚证据。跨领域实验支持假设:先验有效性取决于标志点空间熵——在四个领域前瞻性确认。补充材料包含在内。

英文摘要

Clinicians trace cephalometric radiographs following a structured anatomical workflow, yet no prior system encodes this into computation. We present a five-phase anatomy-guided pipeline producing confidence-weighted spatial priors that shape HRNet-W32 training, achieving 1.04 mm mean radial error on 25 landmarks across 1,502 radiographs from 7+ imaging devices. A training x inference prior matrix isolates the mechanism: anatomical priors maintain a 1% validation-to-test gap versus 88% without priors (1.94 mm), despite identical validation convergence. The matrix establishes that all trained models are inference-independent, the expanded architecture alone provides no benefit, random priors yield partial but unstable improvement (1.72 mm), and only image-specific anatomically correct priors produce the 1.04 mm result -- functioning as a training-time regularizer requiring no automated prior generation at deployment. Five-fold cross-validation (p=0.0015), patient-level permutation testing (p<0.0001, n=151), quantified Grad-CAM analysis (88% vs. 74% in-zone activation, p<0.001), and clinical measurement validation (skeletal classification kappa=0.79-0.84, zero Class II<->III reversals, ICC>0.95) provide converging evidence. Cross-domain experiments on echocardiography, cervical spine, and hand radiography support the hypothesis that prior effectiveness scales with the spatial entropy of the landmark distribution.

2603.29123 2026-06-03 cs.CL

Learning Concepts, Not Tokens: Self-Supervised Semantic Alignment for Language Models

学习概念,而非词元:语言模型的自我监督语义对齐

Christine Zhang, Dan Jurafsky, Chen Shani

发表机构 * Stanford University(斯坦福大学)

AI总结 提出一种自我监督框架,通过预测语义等价词元集合(概念)替代下一个词元预测,提升语言模型的语义对齐能力,并在分类、聚类、重排序及下游推理任务中取得更优性能。

详情
AI中文摘要

下一个词元预测(NTP)目标训练语言模型在每一步预测单个词元,尽管许多后续表达可以传达相同含义。例如,在句子“this sticker can be placed here”中,positioned、attached 或 put 都是合理的替代。虽然标准 NTP 训练将这些替代视为互斥目标,但我们探索了一种自我监督框架,鼓励模型预测概念(近似为语义等价词元集合)。使用这种概念监督训练的模型在人类相似性判断上对齐更好,改进了分类、聚类和重排序性能,并实现了相当或更强的下游推理。这些提升伴随着语义有意义词元(第 3.2 节)上更低的困惑度,且全局困惑度仅最小增加,表明概念在保持语言建模质量的同时增强了语义对齐。我们的代码可在 https://anonymous.4open.science/r/learning-concepts-9025 获取。

英文摘要

The next-token prediction (NTP) objective trains language models to predict a single token at each step, even though many continuations can express the same meaning. For example, in the sentence ``this sticker can be placed here'', positioned, attached, or put are all plausible alternatives. While standard NTP training treats these alternatives as mutually exclusive targets, we explore a self-supervised framework that encourages models to predict concepts, approximated as sets of semantically equivalent tokens. Models trained with this concept supervision align better with human similarity judgments, improve classification, clustering, and reranking performance, and achieve comparable or stronger downstream reasoning. These gains come with lower perplexity on semantically meaningful words (Section 3.2) and only minimal increases in global perplexity, suggesting that concepts enhance semantic alignment while preserving language modeling quality. Our code is available at https://anonymous.4open.science/r/learning-concepts-9025 .

2512.24008 2026-06-03 cs.AI

SPARK: Search Personalization via Agent-Driven Retrieval and Knowledge-sharing

SPARK:通过智能体驱动的检索和知识共享实现搜索个性化

Gaurab Chhetri, Subasish Das, Tausif Islam Chowdhury

发表机构 * Texas State University(德克萨斯州立大学)

AI总结 提出SPARK框架,利用基于角色的多智能体LLM协作实现动态个性化搜索,通过协调器激活相关专家智能体,结合记忆和推理模块,从分布式行为中涌现个性化特性。

Comments This is the author's preprint. Accepted to WEB&GRAPH 2026 (co-located with WSDM 2026), Boise, Idaho, USA, Feb 26, 2026. Final version will appear in WSDM 2026 Companion Proceedings. Conf: https://wsdm-conference.org/2026/ Workshop: https://aiimlab.org/events/WSDM_2026_WEB_and_GRAPH_2026_Workshop_on_Web_and_Graphs_Responsible_Intelligence_and_Social_Media.html

详情
AI中文摘要

个性化搜索需要能够建模用户不断变化的多维信息需求;这对于受限于静态配置文件或单一检索管道的系统来说是一个挑战。我们提出了SPARK(通过智能体驱动的检索和知识共享实现搜索个性化),这是一个框架,其中协调的基于角色的大语言模型(LLM)智能体提供特定任务的检索和涌现个性化。SPARK形式化了一个由角色、专业知识、任务上下文和领域定义的角色空间,并引入了一个角色协调器,该协调器动态解释传入的查询以激活最相关的专门智能体。每个智能体执行独立的检索增强生成过程,由专用的长期和短期记忆存储以及上下文感知推理模块支持。智能体间的协作通过结构化通信协议促进,包括共享记忆库、迭代辩论和接力式知识转移。借鉴认知架构、多智能体协调理论和信息检索的原理,SPARK建模了由最小协调规则支配的分布式智能体行为如何产生涌现个性化特性。该框架在协调效率、个性化质量和认知负载分布方面产生了可检验的预测,同时结合了自适应学习机制以实现持续的角色细化。通过将细粒度的智能体专业化与协作检索相结合,SPARK为能够捕捉人类信息寻求行为的复杂性、流动性和上下文敏感性的下一代搜索系统提供了见解。

英文摘要

Personalized search demands the ability to model users' evolving, multi-dimensional information needs; a challenge for systems constrained by static profiles or monolithic retrieval pipelines. We present SPARK (Search Personalization via Agent-Driven Retrieval and Knowledge-sharing), a framework in which coordinated persona-based large language model (LLM) agents deliver task-specific retrieval and emergent personalization. SPARK formalizes a persona space defined by role, expertise, task context, and domain, and introduces a Persona Coordinator that dynamically interprets incoming queries to activate the most relevant specialized agents. Each agent executes an independent retrieval-augmented generation process, supported by dedicated long- and short-term memory stores and context-aware reasoning modules. Inter-agent collaboration is facilitated through structured communication protocols, including shared memory repositories, iterative debate, and relay-style knowledge transfer. Drawing on principles from cognitive architectures, multi-agent coordination theory, and information retrieval, SPARK models how emergent personalization properties arise from distributed agent behaviors governed by minimal coordination rules. The framework yields testable predictions regarding coordination efficiency, personalization quality, and cognitive load distribution, while incorporating adaptive learning mechanisms for continuous persona refinement. By integrating fine-grained agent specialization with cooperative retrieval, SPARK provides insights for next-generation search systems capable of capturing the complexity, fluidity, and context sensitivity of human information-seeking behavior.

2512.24000 2026-06-03 cs.CL

WISE: Web Information Satire and Fakeness Evaluation

WISE:网络信息讽刺与虚假评估

Gaurab Chhetri, Subasish Das, Tausif Islam Chowdhury

发表机构 * Texas State University San Marcos(德克萨斯州立大学圣马科斯分校)

AI总结 针对假新闻与讽刺文本的区分难题,提出WISE框架,在Fakeddit数据集上评估轻量级Transformer模型,发现MiniLM和RoBERTa-base分别达到最高准确率和ROC-AUC,证明轻量模型在资源受限场景下的有效性。

Comments This is the author's preprint. Accepted to WEB&GRAPH 2026 (co-located with WSDM 2026), Boise, Idaho, USA, Feb 26, 2026. Final version will appear in WSDM 2026 Companion Proceedings. Conf: https://wsdm-conference.org/2026/ Workshop: https://aiimlab.org/events/WSDM_2026_WEB_and_GRAPH_2026_Workshop_on_Web_and_Graphs_Responsible_Intelligence_and_Social_Media.html

详情
AI中文摘要

区分虚假或不真实的新闻与讽刺或幽默,由于它们重叠的语言特征和不同的意图,带来了独特的挑战。本研究开发了WISE(网络信息讽刺与虚假评估)框架,该框架在来自Fakeddit的20000个样本的平衡数据集上,对八个轻量级Transformer模型以及两个基线模型进行了基准测试,这些样本被标注为假新闻或讽刺。使用分层5折交叉验证,我们在包括准确率、精确率、召回率、F1分数、ROC-AUC、PR-AUC、MCC、Brier分数和期望校准误差在内的综合指标上评估模型。我们的评估显示,轻量级模型MiniLM在所有模型中达到了最高的准确率(87.58%),而RoBERTa-base达到了最高的ROC-AUC(95.42%)和强大的准确率(87.36%)。DistilBERT以86.28%的准确率和93.90%的ROC-AUC提供了出色的效率-准确率权衡。统计检验确认了模型之间的显著性能差异,配对t检验和McNemar检验提供了严格的比较。我们的发现强调,轻量级模型可以匹配或超过基线性能,为在现实世界资源受限的环境中部署错误信息检测系统提供了可行的见解。

英文摘要

Distinguishing fake or untrue news from satire or humor poses a unique challenge due to their overlapping linguistic features and divergent intent. This study develops WISE (Web Information Satire and Fakeness Evaluation) framework which benchmarks eight lightweight transformer models alongside two baseline models on a balanced dataset of 20,000 samples from Fakeddit, annotated as either fake news or satire. Using stratified 5-fold cross-validation, we evaluate models across comprehensive metrics including accuracy, precision, recall, F1-score, ROC-AUC, PR-AUC, MCC, Brier score, and Expected Calibration Error. Our evaluation reveals that MiniLM, a lightweight model, achieves the highest accuracy (87.58%) among all models, while RoBERTa-base achieves the highest ROC-AUC (95.42%) and strong accuracy (87.36%). DistilBERT offers an excellent efficiency-accuracy trade-off with 86.28\% accuracy and 93.90\% ROC-AUC. Statistical tests confirm significant performance differences between models, with paired t-tests and McNemar tests providing rigorous comparisons. Our findings highlight that lightweight models can match or exceed baseline performance, offering actionable insights for deploying misinformation detection systems in real-world, resource-constrained settings.

2605.24253 2026-06-03 cs.CV cs.AI cs.IR

CRISP -- Clustering-Based Redundancy-Reduced Instance Sampling for Pathology Case Representation and Retrieval

CRISP -- 基于聚类的冗余减少实例采样用于病理病例表示与检索

Zahra Rahimi Afzal, Wataru Uegami, Saghir Alfasly, Wenchao Han, Saba Yasir, Judy C. Boughey, Matthew P. Goetz, Krishna R. Kalari, H. R. Tizhoosh

发表机构 * Kimia Lab, Department of Artificial Intelligence & Informatics, Mayo Clinic, Rochester, MN, USA(Kimia实验室,人工智能与信息学系,梅奥诊所,罗切斯特,明尼苏达州,美国) DICE Lab, Department of Electrical and Computer Engineering, University of Illinois Chicago, IL, USA(DICE实验室,电气与计算机工程系,伊利诺伊大学芝加哥分校,伊利诺伊州,美国) MD Kimia Lab, Department of Artificial Intelligence & Informatics, Mayo Clinic, Rochester, MN, USA(MD Kimia实验室,人工智能与信息学系,梅奥诊所,罗切斯特,明尼苏达州,美国) PhD Kimia Lab, Department of Artificial Intelligence & Informatics, Mayo Clinic, Rochester, MN, USA(PhD Kimia实验室,人工智能与信息学系,梅奥诊所,罗切斯特,明尼苏达州,美国) Division of Computational Pathology and Informatics, Mayo Clinic, Rochester, MN, USA(计算病理学与信息学部,梅奥诊所,罗切斯特,明尼苏达州,美国) Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, MN, USA(实验室医学与病理学系,梅奥诊所,罗切斯特,明尼苏达州,美国) Department of Breast and Melanoma Surgical Oncology, Comprehensive Cancer Center, Mayo Clinic, Rochester, MN, USA(乳腺和黑色素瘤外科肿瘤学系,综合癌症中心,梅奥诊所,罗切斯特,明尼苏达州,美国) Department of Oncology, Comprehensive Cancer Center, Mayo Clinic, Rochester, MN, USA(肿瘤学系,综合癌症中心,梅奥诊所,罗切斯特,明尼苏达州,美国) PhD H.R. Tizhoosh

AI总结 提出CRISP无监督框架,通过聚类和冗余减少采样整合病例内多张全切片图像,构建紧凑代表性补丁集用于病例级检索,在乳腺癌数据集上匹配或超越现有标准。

详情
AI中文摘要

数字病理档案中每个病例通常包含多张全切片图像(WSI),这些图像捕获空间上不同的肿瘤区域并反映内在的形态异质性。然而,现有方法大多依赖单一病理学家选择的切片,从而丢弃了分布在其余WSI中的潜在信息性证据。迄今为止,尚无自主框架用于全面的多WSI病例处理。在此,我们提出一个用于病例级分析的无监督框架,该框架整合病例内所有可用切片的信息。所提方法不依赖单一指定切片,而是通过选择性提炼跨WSI的信息性补丁来构建病例级表示。我们引入基于聚类的冗余减少实例采样用于病理学(CRISP),这是一个两阶段框架,首先减少单个WSI内的冗余,随后应用基于聚类的采样为整个病例选择紧凑但具有代表性的补丁集。所得补丁集捕获病例级异质性,同时避免对千兆像素图像的穷举处理,并直接作为检索索引。使用两个梅奥诊所乳腺癌数据集进行诊断和治疗规划,我们证明CRISP在患者/病例搜索和检索中一致匹配或超越当前结合模型和病理学家切片选择的标准实践。通过自动化病例级处理并消除主观WSI选择,CRISP可能能够利用当前被忽视的分布在多个WSI中的临床相关信息。

英文摘要

Digital pathology archives increasingly contain multiple whole-slide images (WSIs) per case, capturing spatially distinct tumor regions and reflecting intrinsic morphological heterogeneity. However, most existing approaches rely on a single pathologist-selected slide, thereby discarding potentially informative evidence distributed across the remaining WSIs. To date, no autonomous framework has been proposed for comprehensive multi-WSI case processing. Here, we present an unsupervised framework for case-level analysis that integrates information from all available slides within a case. Rather than relying on a single designated slide, the proposed approach constructs case-level representations by selectively distilling informative patches across WSIs. We introduce Clustering-Based Redundancy-Reduced Instance Sampling for Pathology (CRISP), a two-stage framework that first reduces redundancy within individual WSIs and subsequently applies clustering-based sampling to select a compact yet representative set of patches for the entire case. The resulting patch set captures case-level heterogeneity while avoiding exhaustive processing of gigapixel images, and directly serves as a retrieval index. Using two Mayo Clinic breast cancer datasets for diagnosis and treatment planning, we demonstrate that CRISP consistently matches or surpasses the current standard practice of combined model and pathologist slide selection for patient/case search and retrieval. By automating case-level processing and eliminating subjective WSI selection, CRISP potentially enables the exploitation of clinically relevant information distributed across multiple WSIs that is currently overlooked.

2605.23995 2026-06-03 cs.CV cs.AI

Task-Aligned Self-Supervised Learning for Medical Image Analysis: A Systematic Review and Practical Design Guidelines

任务对齐的自监督学习在医学图像分析中的应用:系统综述与实践设计指南

Chathura Wimalasiri, Kishor Nandakishor, Marimuthu Palaniswami

发表机构 * Department of Electrical and Electronic Engineering, University of Melbourne(墨尔本大学电子与电气工程系)

AI总结 本文系统综述了医学图像中自监督学习(SSL)的四种范式(对比、非对比与预测、生成与重建、混合),分析了前置任务与下游任务的对齐对性能的影响,并提出了实践设计指南。

Comments This manuscript is 31 pages with 4 tables and 3 figures

详情
AI中文摘要

自监督学习(SSL)已成为通过从无标签数据中学习表示来解决医学影像中标注瓶颈的有前景范式。然而,其有效性在很大程度上取决于前置任务的设计及其与下游临床目标的对齐。我们对医学影像中的SSL进行了系统的、任务导向的综述,考察了不同前置任务公式如何影响分类、分割、检测等任务的性能。遵循PRISMA指南,我们分析了2017年至2025年间发表的75项研究,并将其组织为四种范式:对比学习、非对比与预测学习、生成与重建学习、以及混合学习。我们不是按架构对方法进行分类,而是将每种范式映射到其最佳支持的下游目标。我们的分析表明,不存在普遍最优的SSL策略;相反,性能由前置任务、成像模态和目标任务之间的对齐决定。对比方法学习全局判别特征,与分类任务对齐良好,但可能忽略细微的病理模式。生成和空间预测方法更好地保留局部解剖结构,使其更适合分割和其他密集预测任务,而混合方法提供了最平衡的性能。我们进一步表明,模态特定设计至关重要,并且SSL在低标签和少样本场景中提供最大益处。最后,我们将这些发现提炼为实践设计指南,并概述了开放挑战,包括病理感知前置任务设计、高维数据的资源高效训练以及标准化评估协议。这项工作为在医学影像中设计更有效且临床相关的SSL框架提供了实用指导。

英文摘要

Self-supervised learning (SSL) has emerged as a promising paradigm for addressing the annotation bottleneck in medical imaging by learning representations from unlabeled data. However, its effectiveness depends heavily on the design of the pretext task and its alignment with the downstream clinical-objectives. We present a systematic, task-oriented review of SSL in medical imaging, examining how different pretext-task formulations influence performance across classification, segmentation, detection, and other tasks. Following PRISMA guidelines, we analyze 75 studies published between 2017 and 2025 and organize them into four paradigms: contrastive, non-contrastive and predictive, generative and reconstruction-based, and hybrid learning. Rather than cataloguing methods by architecture, we map each paradigm to the downstream objectives it best supports. Our analysis shows there is no universally optimal SSL strategy; instead, performance is governed by the alignment between the pretext task, the imaging modality, and the target task. Contrastive methods learn global discriminative features and align well with classification, but may overlook subtle pathological patterns. Generative and spatial prediction-based approaches better preserve local anatomical structure, making them more suitable for segmentation and other dense prediction tasks, while hybrid methods offer the most balanced performance. We further show that modality-specific design is critical and that SSL provides its greatest benefit in low-label and few-shot regimes. Finally, we distill these findings into practical design guidelines and outline open challenges, including pathology-aware pretext task design, resource-efficient training for high-dimensional data, and standardized evaluation protocols. This work offers practical guidance for designing more effective and clinically relevant SSL frameworks in medical imaging.

2605.23055 2026-06-03 cs.LG cs.AI cs.CL

Decomposing and Measuring Evaluation Awareness

分解与度量评估意识

Changling Li, Terry Jingchen Zhang, Jie Zhang, Zhijing Jin, Sahar Abdelnabi, Maksym Andriushchenko

发表机构 * ETH Zürich(苏黎世联邦理工学院) ELLIS Institute Tübingen(图宾根ELLIS研究所) Max Planck Institute for Intelligent Systems(智能系统马克斯·普朗克研究院) Tübingen AI Center(图宾根人工智能中心) University of Toronto & Vector Institute(多伦多大学及向量研究所) EuroSafeAI(欧洲安全人工智能)

AI总结 本研究借鉴社会心理学,将评估意识分解为环境与模型两部分,通过EvalAwareBench基准测试发现识别率取决于模型与基准的配对,且识别很少导致行为改变,安全评估比能力评估更易受影响。

详情
AI中文摘要

前沿语言模型有时会意识到自己正在被评估并调整行为,从而破坏基准结果的有效性。然而,该领域的研究缺乏共同基础,混淆了评估属性与模型属性,以及检测与行为响应。我们将评估意识植根于社会心理学,将其分解为环境组件(任务的可识别程度)和模型组件(将识别与行动倾向分离)。我们通过八个分类触发因素(如占位符实体和评分式输出格式)来操作化环境组件,并通过思维链监控研究识别和行为。在九个前沿模型和四个基准上,识别率取决于模型和基准的具体配对,而非单独一方。识别很少导致行为改变,即使发生,方向也取决于所感知的评估类型。模型对安全评估的敏感性高于能力评估,使安全基准的有效性面临更大风险。为了研究每个模型对哪些因素敏感以及它们如何相互作用,我们提出了EvalAwareBench,这是一个包含100个配对安全-能力任务的因子控制基准,其中八个因素中的每一个都可以独立切换,在保持底层请求不变的同时改变评估信号。通过EvalAwareBench,我们发现没有单一因素能统一影响所有模型,但叠加因素会逐步提高所有模型的评估意识。我们的框架和EvalAwareBench提供了度量、归因和缓解评估意识的工具,指出在识别下的行为一致性是一条有前景的前进道路。

英文摘要

Frontier language models sometimes recognize that they are being evaluated and adjust their behavior, undermining validity of benchmark results. Yet the field studies it without a shared foundation, conflating properties of the evaluation with properties of the model, and detection with behavioral response. We ground evaluation awareness in social psychology, decomposing it into an environment component (how recognizable the task is) and a model component that separates recognition from propensity to act on it. We operationalize the environment component through eight categorized trigger factors, such as placeholder entities and grading-style output formats, and study recognition and behavior through chain-of-thought monitoring. Across nine frontier models and four benchmarks, recognition rates depend on the specific pairing of model and benchmark rather than on either in isolation. Recognition rarely leads to behavioral change, and when it does, the direction depends on the type of evaluation perceived. Models are also more sensitive to safety than capability evaluations, placing safety benchmark validity at greater risk. To study which factors each model is sensitive to and how they interact, we propose \textbf{EvalAwareBench}, a factor-controlled benchmark of 100 paired safety-capability tasks where each of the eight factors can be independently toggled, varying evaluative signals while holding the underlying request fixed. Through EvalAwareBench, we find that no single factor uniformly affects all models, but stacking factors progressively raises evaluation awareness across all of them. Our framework and EvalAwareBench provide the tools to measure, attribute, and mitigate evaluation awareness, pointing to behavioral consistency under recognition as a promising path forward.

2605.20402 2026-06-03 cs.LG cs.AI

Decomposing MXFP4 quantization error for LLM reinforcement learning: reducible bias, recoverable deadzone, and an irreducible floor

分解 MXFP4 量化误差以用于大语言模型强化学习:可约简的偏差、可恢复的死区以及不可约简的底噪

Xiaocan Li, Shiliang Wu, Zheng Shen

发表机构 * Huawei Canada(华为加拿大)

AI总结 本文通过将 MXFP4 量化误差分解为三个可加分量(尺度偏差、死区截断和网格噪声),并针对每个分量提出针对性修正(宏块缩放、异常值回退和自适应量化噪声),从而在 LLM 强化学习后训练中恢复精度。

详情
AI中文摘要

MXFP4 算术可以显著加速大语言模型(LLM)强化学习(RL)后训练,但量化误差会导致严重的精度下降。现有工作将量化误差视为单一噪声项,忽略了量化误差损害训练的不同机制。我们证明了量化误差的精确三向分解,并展示了每个分量如何主导不同的 RL 训练路径。我们的理论和实证分析将 MXFP4 量化误差分解为三个可加分量:来自 2 的幂次舍入的“尺度偏差”、来自小值归零的“死区截断”以及来自舍入到最近 4 位网格的“网格噪声”。每个分量主导不同的 RL 失效模式:尺度偏差通过反向传播乘法累积,影响梯度精度;死区截断降低 rollout 质量;网格噪声提高策略的熵。我们结合了针对 RL 失效模式但不限于特定分量的修正:宏块缩放以减少尺度偏差,异常值回退恢复死区条目,同时也部分减少尺度偏差引起的误差,以及自适应量化噪声(AQN)用于控制策略熵。在 Qwen2.5-3B 密集模型和 Qwen3-30B-A3B-Base 混合专家模型上,针对性修正分别将 BF16 精度恢复到 0.7% 以内,并超过 BF16 达 +1.0%。

英文摘要

MXFP4 arithmetic can dramatically accelerate reinforcement learning (RL) post-training of large language models (LLMs), yet the quantization error introduces severe accuracy degradation. Existing work treats the quantization error as a monolithic noise term, missing the distinct mechanisms upon interpreting how quantization error damages training. We prove an exact three-way decomposition of quantization error and show how each component dominates a distinct RL training pathway. Our theoretical and empirical analysis decomposes the MXFP4 quantization error into three additive components: "scale bias" from power-of-two rounding, "deadzone truncation" from zeroing small values, and "grid noise" from rounding to the nearest 4-bit grid. Each component dominates a distinct RL failure mode: scale bias accumulates multiplicatively through the backward pass, affecting gradient accuracy; deadzone truncation degrades rollout quality; and grid noise raises the policy's entropy. We combine corrections that are RL failure mode-targeted but not component-exclusive: Macro-block scaling to reduce scale bias, Outlier Fallback recovers deadzone entries, but also partially reduces scale bias induced error, and Adaptive Quantization Noise (AQN) for controlling the policy entropy. On Qwen2.5-3B dense and Qwen3-30B-A3B-Base mixture-of-experts model, the targeted corrections recover BF16 accuracy to within 0.7% and exceed BF16 by +1.0% respectively.

2605.22018 2026-06-03 cs.CV cs.AI cs.RO

FRED: A Multi-Modal Autonomous Driving Dataset for Flooded Road Environments

FRED:面向洪水道路环境的多模态自动驾驶数据集

Connor Malone, Sebastien Demmel, Sebastien Glaser

发表机构 * Queensland University of Technology(昆士兰理工大学) ARC Training Centre for Automated Vehicles in Rural and Remote Regions (AVR3)(农村和偏远地区自动化车辆培训中心(AVR3))

AI总结 提出首个针对道路水险场景的多模态自动驾驶数据集FRED,包含相机、LiDAR和IMU数据,并提供语义标签以支持水险检测方法训练与评估。

详情
AI中文摘要

洪水道路环境数据集(FRED)是,据我们所知,首个专门针对道路水险场景数据收集的多模态自动驾驶数据集。该数据集包含来自2.3 MP FLIR Blackfly USB3相机的图像、来自Ouster OS1-64 LiDAR的64线360度点云,以及由Geoflex RTK GNSS校正的iXblue ATLANS-C IMU数据,数据采集自五个不同地点,涵盖洪水期间和洪水之后。数据以两种格式发布:KITTI风格格式,便于与现有数据工具集成;以及RTMaps格式,用于直接回放车辆的数据捕获。我们提供语义标签,以支持用于水险检测的单传感器和传感器融合方法的训练与评估。提供位置和速度数据,以及干燥条件下捕获的数据,以支持可能包含地图的基于位置的检测方法开发,并评估其他任务,如定位和SLAM。

英文摘要

The Flooded Road Environments Dataset (FRED) is, to our knowledge, the first multi-modal autonomous driving dataset specifically targeting the collection of data from scenarios involving water hazards on the road. The dataset contains images from a 2.3 MP FLIR Blackfly USB3 camera, 64-beam 360 degree point clouds from an Ouster OS1-64 LiDAR, and data from an iXblue ATLANS-C IMU corrected by a Geoflex RTK GNSS, from five separate locations captured both during and after flooding events. The data has been released in two formats: a KITTI-style format for easy integration with existing data tools, and the RTMaps format for direct replay of the vehicle's data capture. We provide semantic labels to enable the training and evaluation of both single-sensor and sensor-fusion methods for water hazard detection. Position and velocity, as well as data captured under dry conditions, are provided to enable the development of location-based detection methods that may incorporate maps, and to evaluate other tasks such as localisation and SLAM.

2605.20731 2026-06-03 cs.CV cs.AI stat.AP

TASTE: A Designer-Annotated Multi-Dimensional Preference Dataset for AI-Generated Graphic Design

TASTE:一个由设计师标注的AI生成图形设计多维偏好数据集

Haonan Zhu, Elad Hirsch, Alexandria Minetti, Allison Nulty, Purvanshi Mehta

发表机构 * Lica World(Lica世界) Contra.Work Inc.(Contra.Work公司)

AI总结 针对现有偏好数据集仅提供单一整体评价的不足,本文构建了TASTE多维偏好数据集,由两组专业设计师对四个文本到图像模型的输出按九项标准排序,并提出了无准则信号验证框架和偏好模型基准测试。

详情
AI中文摘要

文本到图像模型现在能够以生产规模生成图形设计,但其监督仍然主要来自照片风格的偏好数据集,每次比较只有一个整体判断。设计师沿着几个不同的轴(例如,排版、布局、色彩和谐)评估设计,而单个偏好标签会将这些轴合并。我们发布了\emph{TASTE} extit{(排版、美学、空间、色调等)},这是一个多维偏好数据集,其中两个不相交的五名专业设计师队列分别对来自四个当前文本到图像模型的输出按九项标准进行排序,并附带每张图像的幻觉标记。我们将该数据集与两个贡献配对。首先,一个基于Kendall的$τ$、多数投票概率和Condorcet循环的无准则信号验证框架,针对精确的iid均匀零假设;分析揭示了显著但中等程度的设计师一致性,每个TASTE标准都拒绝了随机评分者的零假设。其次,我们在TASTE上对偏好模型进行基准测试,发现现成的VLM评判器和专用的T2I评分器未能达到与设计师小组的多数一致,而直接在TASTE上训练的小型MLP头显著缩小了与单个评分者上限的差距,为未来基于TASTE训练的偏好模型设定了基线。

英文摘要

Text-to-image models now generate graphic design at production scale, yet their supervision still comes primarily from photo-style preference datasets with a single overall verdict per comparison. Designers evaluate designs along several distinct axes (e.g., typography, layout, color harmony) that a single preference label collapses. We release \emph{TASTE} \textit{(Typography, Aesthetics, Spatial, Tone, Etc.)}, a multi-dimensional preference dataset in which two disjoint cohorts of five professional designers each ranked outputs from four current text-to-image models across nine criteria along with per-image hallucination flags. We pair the dataset with two contributions. First, a criterion-agnostic signal-validation framework based on Kendall's $τ$, majority-vote probability, and Condorcet cycles against exact iid-uniform nulls; the analysis reveals significant but moderate designer agreement, with every TASTE criterion rejecting the random-rater null. Second, we benchmark preference models on TASTE and find that off-the-shelf VLM judges and dedicated T2I scorers fail to reach majority agreement with the designer panel, while a small MLP head trained directly on TASTE substantially narrows the gap to the single-rater ceiling, setting a baseline for future TASTE-trained preference models.

2604.27147 2026-06-03 cs.LG cs.AI

How to Guide Your Flow: Few-Step Alignment via Flow Map Reward Guidance

如何引导你的流:通过流图奖励引导实现少步对齐

Jerry Y. Huang, Justin Lin, Sheel Shah, Kartik Nair, Nicholas M. Boffi

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出流图奖励引导(FMRG),一种无训练、单轨迹的框架,利用流图在仅需3次NFE下实现奖励引导生成,速度比现有方法快一个数量级。

详情
AI中文摘要

在生成建模中,我们通常希望生成能够最大化用户指定奖励(如美学质量或与人类偏好对齐)的样本,这一问题被称为 extit{引导}。尽管现有引导方法被广泛使用,但它们要么需要昂贵的多粒子、多步方案,要么依赖于理解不足的近似。我们将引导重新表述为一个 extit{确定性最优控制问题},产生了一个算法层次结构,在最粗略的层次上包含了现有方法。我们表明, extit{流图}——因其在快速推理中的作用而近期受到广泛关注的对象——在最优解中自然出现。基于这一观察,我们提出 extbf{流图奖励引导(FMRG)}:一种无训练、 extit{单轨迹}框架,利用流图来积分和引导流。在文生图规模上,FMRG在逆问题和奖励引导生成中匹配或超越基线,且 extbf{仅需3次NFE},相比先前最先进方法至少实现一个数量级的加速。代码可在 https://github.com/jrrhuang/fmrg 获取。

英文摘要

In generative modeling, we often wish to produce samples that maximize a user-specified reward such as aesthetic quality or alignment with human preferences, a problem known as \textit{guidance}. Despite their widespread use, existing guidance methods either require expensive multi-particle, many-step schemes or rely on poorly understood approximations. We reformulate guidance as a \textit{deterministic optimal control problem}, yielding a hierarchy of algorithms that subsumes existing approaches at the coarsest level. We show that the \textit{flow map}, an object of significant recent interest for its role in fast inference, arises naturally in the optimal solution. Based on this observation, we propose \textbf{Flow Map Reward Guidance (FMRG)}: a training-free, \textit{single-trajectory} framework that uses the flow map to both integrate and guide the flow. At text-to-image scale, FMRG matches or surpasses baselines across inverse problems and reward-guided generation with \textbf{as few as 3 NFEs}, giving at least an order-of-magnitude speedup in comparison to prior state of the art. Code is available at https://github.com/jrrhuang/fmrg.

2605.20306 2026-06-03 cs.CV cs.LG

WildRoadBench: A Wild Aerial Road-Damage Grounding Benchmark for Vision-Language Models and Autonomous Agents

WildRoadBench: 面向视觉语言模型与自主智能体的野外航拍道路损伤定位基准

Bingnan Liu, Chenhang Cui, Rui Huang, Jiani Luo, Zhirong Shen, Tinghao Wang, Xiande Huang, Lingbei Meng, Fei Shen, An Zhang

发表机构 * University of Electronic Science and Technology of China(电子科技大学) National University of Singapore(新加坡国立大学) De Artificial Intelligence Lab(德人工智能实验室) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) University of Science and Technology of China(中国科学技术大学)

AI总结 提出WildRoadBench基准,通过VLM直接定位和LLM驱动智能体自主研究两种协议,评估模型在航拍道路损伤定位上的性能,发现现有方法在野外场景下仍不可靠。

Comments Preprint. Under review. 4 figures, 6 tables

详情
AI中文摘要

我们介绍了WildRoadBench,一个野外航拍道路损伤定位基准,它在一个专业标注的无人机语料库上,将视觉语言模型的直接视觉定位与LLM驱动的智能体的自主研究与工程相结合。在两种协议下评估相同的图像集和相同的每类AP_50指标。VLM轨道衡量固定VLM是否能在统一的提示、解码和解析流程下,从一张图像和一个简短提示中定位特定领域的损伤。智能体轨道衡量一个自主智能体,在仅给定书面任务简介、少量探索切片和固定交互预算的情况下,能否搜索公共网络、调整预训练组件、编写训练和推理代码,并通过隐藏保留集上的标量反馈预言机提交预测。我们对广泛的闭源前沿模型和开源VLM以及几个前沿LLM驱动的智能体进行了基准测试。在野外环境中,两种途径都远未达到可靠性能:闭源前沿模型在VLM排行榜上领先,但仍留下超过一半的指标未达到;开源定位器远低于它们,且新一代或推理型变体并未持续改进定位;每个开源模型的小目标均崩溃;尽管智能体拥有更丰富的功能,但仍落后于最强的VLM,且有几个未能在预算内提交有效结果。我们在https://anonymous.4open.science/r/wildroadbench-0607发布代码和数据,以支持可重复的后续研究。

英文摘要

We introduce WildRoadBench, a wild aerial road-damage grounding benchmark that couples direct visual grounding by vision-language models with autonomous research-and-engineering by LLM-driven agents on a single professionally annotated UAV corpus. The same image set and the same per-class AP_50 metric are evaluated under two protocols. The VLM Track measures whether a fixed VLM can localise domain-specific damage from one image and one short prompt under a unified prompting, decoding and parsing pipeline. The Agent Track measures whether an autonomous agent, given only a written task brief, a small exploratory slice and a fixed interaction budget, can search the public web, adapt pretrained components, write training and inference code, and submit predictions through a scalar-feedback oracle on a hidden holdout. We benchmark a broad pool of closed-source frontier models and open-source VLMs together with several frontier LLM-driven agents. Both routes remain far from reliable performance in this wild setting: closed-source frontier models lead the VLM leaderboard but still leave more than half of the metric on the table; open-source grounders plateau well below them, and newer generations or reasoning-style variants do not consistently improve grounding; small targets collapse for every open-source model; agents lag the strongest VLM despite richer affordances, and several fail to land a valid submission within the budget. We release the code and data at https://anonymous.4open.science/r/wildroadbench-0607 to support reproducible follow-up research.

2605.20183 2026-06-03 cs.CV

MSAVBench: Towards Comprehensive and Reliable Evaluation of Multi-Shot Audio-Video Generation

MSAVBench:面向多镜头音频-视频生成的全面可靠评估

Yujie Wei, Yujin Han, Zhekai Chen, Yongming Li, Kaixun Jiang, Zhihang Liu, Quanhao Li, Zhiwu Qing, Xiang Wang, Zhen Xing, Ruihang Chu, Lingyi Hong, Yefei He, Junjie Zhou, Junqiu Yu, Yang Shi, Difan Zou, Kai Zhu, Shiwei Zhang, Yingya Zhang, Yu Liu, Xihui Liu, Hongming Shan

发表机构 * Fudan University(复旦大学) The University of Hong Kong(香港大学) Tongyi Lab, Alibaba Group(阿里集团通义实验室) Zhejiang University(浙江大学) Peking University(北京大学)

AI总结 提出首个多镜头音频-视频生成基准MSAVBench,通过自适应混合评估框架在四个维度上系统评估19个模型,发现当前系统在导演级控制和细粒度音视频同步上仍存在挑战。

详情
AI中文摘要

视频生成正从单镜头合成快速演变为复杂的多镜头音频-视频(MSAV)叙事以满足现实需求。然而,评估此类前沿模型仍是一个基本挑战。现有基准在范围和数据多样性上有限,并依赖僵化的评估流程,阻碍了对现代MSAV模型的系统可靠评估。为弥补这些差距,我们引入MSAVBench,这是首个针对多镜头音频-视频生成的综合基准和自适应混合评估框架。我们的基准涵盖四个关键维度:视频、音频、镜头和参考,覆盖多样化的任务设置、多达15个镜头的可变数量以及具有挑战性的非真实场景。我们的评估框架通过镜头分割的自适应自校正机制、主观指标的实例化评分规则以及复杂判断的基于工具的证据提取,提高了鲁棒性。此外,MSAVBench与人类判断高度一致,达到91.5%的斯皮尔曼等级相关系数。我们对19个最先进的闭源和开源模型的系统评估表明,当前系统在导演级控制和细粒度音视频同步上仍存在困难,而模块化或代理式生成管道为缩小开源与闭源模型之间的差距提供了一条有希望的路径。基准数据和评估代码已在https://github.com/ali-vilab/MSAVBench公开。

英文摘要

Video generation is rapidly evolving from single-shot synthesis to complex multi-shot audio-video (MSAV) narratives to meet real-world demands. However, evaluating such frontier models remains a fundamental challenge. Existing benchmarks are limited in scope and data diversity, and rely on rigid evaluation pipelines, preventing systematic and reliable assessment of modern MSAV models. To bridge these gaps, we introduce MSAVBench, the first comprehensive benchmark and adaptive hybrid evaluation framework for multi-shot audio-video generation. Our benchmark spans four key dimensions, video, audio, shot, and reference, covering diverse task settings, varying shot counts of up to 15, and challenging non-realistic scenarios. Our evaluation framework improves robustness through an adaptive self-correction mechanism for shot segmentation, instance-wise rubrics for subjective metrics, and tool-grounded evidence extraction for complex judgments. Furthermore, MSAVBench achieves high alignment with human judgments, reaching a Spearman rank correlation of 91.5%. Our systematic evaluation of 19 state-of-the-art closed- and open-source models shows that current systems still struggle with director-level control and fine-grained audio-visual synchronization, while modular or agentic generation pipelines offer a promising path toward narrowing the gap between open- and closed-source models. The benchmark data and evaluation code are publicly available at https://github.com/ali-vilab/MSAVBench.

2605.19805 2026-06-03 cs.LG cs.AI stat.ML

Latent Laplace Diffusion for Irregular Multivariate Time Series

潜在拉普拉斯扩散用于不规则多元时间序列

Zinuo You, Jin Zheng, John Cartlidge

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出潜在拉普拉斯扩散(LLapDiff)生成框架,通过低维潜在轨迹和拉普拉斯域参数化实现不规则时间序列的长时预测与缺失值插补。

Comments Accepted as a Spotlight at ICML 2026. The Version of Record will appear in Proceedings of Machine Learning Research (PMLR). 27 pages, 5 figures. Code: https://github.com/pixelhero98/LLapDiffusion

详情
AI中文摘要

不规则多元时间序列对长期预测提出了权衡:离散方法通过重新网格化可能扭曲时间结构,而连续时间模型通常需要容易漂移的序贯求解器。为弥合这一差距,我们提出了潜在拉普拉斯扩散(LLapDiff),一种生成式框架,将目标建模为低维潜在轨迹,从而无需逐步积分物理时间即可实现全范围生成。我们利用受随机端口-哈密顿动力学启发的稳定模态参数化来引导逆向过程,并通过可学习的共轭复极点参数化其在拉普拉斯域中的均值演化,从而能够在不规则时间戳上直接评估。我们还通过更新平均分析将连续动力学与不规则观测联系起来,该分析将采样间隙映射到有效事件域极点,并激发了间隙感知的历史总结器。大量实验表明,LLapDiff在长期预测中优于基线,其连续时间生成性质通过在同一模型的历史时间戳上查询,支持缺失值插补。代码可在https://github.com/pixelhero98/LLapDiffusion获取。

英文摘要

Irregular multivariate time series impose a trade-off for long-horizon forecasting: discrete methods can distort temporal structure via re-gridding, while continuous-time models often require sequential solvers prone to drift. To bridge this gap, we present Latent Laplace Diffusion (LLapDiff), a generative framework that models the target as a low-dimensional latent trajectory, enabling horizon-wide generation without step-by-step integration over physical time. We guide the reverse process utilizing a stable modal parameterization motivated by stochastic port-Hamiltonian dynamics, and parameterize its mean evolution in the Laplace domain via learnable complex-conjugate poles, enabling direct evaluation over irregular timestamps. We also link continuous dynamics to irregular observations through renewal-averaging analysis, which maps sampling gaps to effective event-domain poles and motivates a gap-aware history summarizer. Extensive experiments show that LLapDiff improves over baselines in long-horizon forecasting, and its continuous-time generative nature supports missing-value imputation by querying the same model at historical timestamps. Code is available at https://github.com/pixelhero98/LLapDiffusion.

2605.18740 2026-06-03 cs.CV cs.AI cs.CL cs.LG

Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation

Vision-OPD:通过在线策略自蒸馏学习多模态大语言模型的精细细节

Qianhao Yuan, Jie Lou, Xing Yu, Hongyu Lin, Le Sun, Xianpei Han, Yaojie Lu

发表机构 * Tsinghua University(清华大学)

AI总结 提出Vision-OPD框架,通过在线策略自蒸馏将模型自身的局部区域感知能力迁移到全局图像策略,提升多模态大语言模型对细粒度视觉理解的准确性。

Comments Project page: https://github.com/VisionOPD/Vision-OPD

详情
AI中文摘要

多模态大语言模型(MLLMs)在细粒度视觉理解方面仍然存在困难,答案往往依赖于全图中微小但决定性的证据。我们观察到一种区域到全局的感知差距:当以证据为中心的裁剪图像为条件时,同一MLLM回答细粒度问题的准确率高于以对应全图为条件,这表明许多失败源于难以聚焦于相关证据,而非局部识别能力不足。受此观察启发,我们提出Vision-OPD(视觉在线策略蒸馏),一种区域到全局的自蒸馏框架,将模型自身特权的区域感知迁移到其全图策略。Vision-OPD从同一MLLM实例化两个条件策略:一个以裁剪图像为条件的教师和一个以全图为条件的学生。学生生成在线策略轨迹,Vision-OPD沿这些轨迹最小化教师和学生下一个词元分布之间的词元级差异。这使得模型能够内化视觉放大的好处,而无需外部教师模型、真实标签、奖励验证器或推理时工具使用。在多个细粒度视觉理解基准上的实验表明,Vision-OPD模型在性能上可与更大的开源、闭源以及“思考图像”智能体模型相媲美或更优。

英文摘要

Multimodal Large Language Models (MLLMs) still struggle with fine-grained visual understanding, where answers often depend on small but decisive evidence in the full image. We observe a regional-to-global perception gap: the same MLLM answers fine-grained questions more accurately when conditioned on evidence-centered crops than on the corresponding full images, suggesting that many failures stem from difficulty to focus on relevant evidence rather than insufficient local recognition ability. Motivated by this observation, we propose Vision-OPD (Vision On-Policy Distillation), a regional-to-global self-distillation framework that transfers the model's own privileged regional perception to its full-image policy. Vision-OPD instantiates two conditional policies from the same MLLM: a crop-conditioned teacher and a full-image-conditioned student. The student generates on-policy rollouts, and Vision-OPD minimizes token-level divergence between the teacher and student next-token distributions along these rollouts. This enables the model to internalize the benefit of visual zooming without external teacher models, ground-truth labels, reward verifiers, or inference-time tool use. Experiments on multiple fine-grained visual understanding benchmarks show that Vision-OPD models achieve competitive or superior performance against much larger open-source, closed-source, and "Thinking-with-Images" agentic models. The code is available at https://github.com/VisionOPD/Vision-OPD

2405.07006 2026-06-03 cs.CL

Word-specific tonal realizations in Mandarin

普通话中特定词的声调实现

Yu-Ying Chuang, Melanie J. Bell, Yu-Hsiang Tseng, R. Harald Baayen

发表机构 * National Taiwan Normal University(台湾师范大学) Anglia Ruskin University(安格利亚 Ruskin 大学) Eberhard Karl University of Tübingen(图宾根艾伯哈德·卡尔大学)

AI总结 本研究通过语料库分析和计算建模,发现普通话双字词的声调实现部分由词义决定,且词义对声调实现的预测力超过已知的词形相关因素。

Journal ref Language 102 (2026) 1-45

详情
AI中文摘要

普通话双字词的音高轮廓通常被认为是由组成单字词的底层声调所决定,并与语速、相邻声调协同发音、音段构成和可预测性等因素施加的发音约束相互作用。本研究表明,声调实现也部分由词义决定。我们首先基于台湾普通话自然对话语料库,使用广义加性回归模型,并聚焦于升降调模式,发现在控制说话者和语境效应后,词类型对声调实现的预测力比所有先前建立的词形相关预测因子之和更强。重要的是,添加关于语境中词义的信息进一步提高了预测准确性。然后,我们通过使用上下文特定的词嵌入进行计算建模,表明在保留数据上,标记特定的音高轮廓以50%的准确率预测词类型,而上下文敏感的标记特定嵌入以40%的准确率预测音高轮廓的形状。这些准确率比随机水平高一个数量级,表明词音高轮廓与其词义之间的关系足够强,可能对语言使用者具有功能性。讨论了这些实证发现的理论意义。

英文摘要

The pitch contours of Mandarin two-character words are generally understood as being shaped by the underlying tones of the constituent single-character words, in interaction with articulatory constraints imposed by factors such as speech rate, co-articulation with adjacent tones, segmental make-up, and predictability. This study shows that tonal realization is also partially determined by words' meanings. We first show, on the basis of a corpus of Taiwan Mandarin spontaneous conversations, using a generalized additive regression model, and focusing on the rise-fall tone pattern, that after controlling for effects of speaker and context, word type is a stronger predictor of tonal realization than all the previously established word-form related predictors combined. Importantly, the addition of information about meaning in context improves prediction accuracy even further. We then proceed to show, using computational modeling with context-specific word embeddings, that token-specific pitch contours predict word type with 50% accuracy on held-out data, and that context-sensitive, token-specific embeddings can predict the shape of pitch contours with 40% accuracy. These accuracies, which are an order of magnitude above chance level, suggest that the relation between words' pitch contours and their meanings are sufficiently strong to be potentially functional for language users. The theoretical implications of these empirical findings are discussed.

2605.19320 2026-06-03 cs.CV cs.DB

TextAlign: Preference Alignment for Text Rendering with Hierarchical Rewards

TextAlign: 基于层次化奖励的文本渲染偏好对齐

Mingxuan Cui, Jingpu Yang, Fengxian Ji, Qian Jiang, Zhecheng Shi, Jiaming Wang, Zirui Song, Fajri Koto, Xiuying Chen

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(莫扎德·穆萨大学人工智能学院) Chinese Academy of Sciences Institute of Automation(中国科学院自动化研究所) Northeastern University(东北大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州))

AI总结 提出TextAlign框架,通过层次化视觉语言模型奖励将文本渲染错误分解为全局、单词和字形级别,并转化为标量偏好信号,利用GRPO或DPO进行后训练对齐,在不改变生成器架构下提升文本渲染准确性。

详情
AI中文摘要

忠实的文本渲染仍然是大型文本到图像生成模型的一个持续弱点,因为它需要语义指令遵循和细粒度的字形级结构。先前的方法通常通过特定于架构的模块或编码器修改来提高这种能力,这使跨基础模型的部署复杂化。我们将文本渲染作为后训练偏好对齐问题进行研究,并提出了TextAlign,一种非侵入式框架,保持生成器架构不变。关键组件是一个基于层次化视觉语言模型(VLM)的奖励,它将渲染错误分解为全局、单词和字形级别,然后将二元缺陷判断转换为标量偏好信号。得到的信号支持组相对策略优化(GRPO)和直接偏好优化(DPO)。在FLUX.1-dev和Z-Image-Turbo上的实验表明,基于OCR的文本准确性持续提升,且不降低一般生成质量。与强大的基础和文本渲染基线(包括SD3.5、Qwen-Image、AnyText和TextDiffuser)相比,这些结果表明奖励设计为改进文本渲染提供了一种可扩展的替代模型重新设计的方法。

英文摘要

Faithful text rendering remains a persistent weakness of large text-to-image generative models, as it requires both semantic instruction following and fine-grained glyph-level structure. Prior methods often improve this ability through architecture-specific modules or encoder modifications, which complicate deployment across foundation models. We study text rendering as a post-training preference-alignment problem and propose TextAlign, a non-invasive framework that keeps the generator architecture unchanged. The key component is a hierarchical vision-language model (VLM)-based reward that decomposes rendering errors into global, word, and glyph levels, then converts binary defect judgments into a scalar preference signal. The resulting signal supports both Group Relative Policy Optimization (GRPO) and Direct Preference Optimization (DPO). Experiments on FLUX.1-dev and Z-Image-Turbo show consistent gains in OCR-based text accuracy without degrading general generation quality. Compared with strong foundation and text-rendering baselines, including SD3.5, Qwen-Image, AnyText, and TextDiffuser, these results indicate that reward design offers a scalable alternative to model redesign for improving text rendering.

2605.19262 2026-06-03 cs.LG cs.CR

Backdooring Masked Diffusion Language Models

掩码扩散语言模型的后门攻击

Daniel Yiming Cao, Chengzhong Wang, Sheng-Yen Chou, Chengyu Huang, Pin-Yu Chen, Shengwei An

发表机构 * Cornell University(康奈尔大学) Virginia Tech(弗吉尼亚理工学院) IBM Research(IBM研究院)

AI总结 提出SHADOWMASK后门攻击方法,通过修改掩码扩散语言模型的前向破坏过程,实现高成功率攻击并保持模型清洁性能。

详情
AI中文摘要

掩码扩散语言模型(MDLM)正成为文本生成的一种引人注目的新范式,但其训练时安全性仍 largely unexplored。现有的针对高斯扩散模型或自回归语言模型的后门攻击并不直接适用于MDLM,因为MDLM依赖于离散状态破坏和迭代去噪,而非连续噪声或从左到右预测。在这项工作中,我们首次对MDLM的训练时后门攻击进行了系统研究。我们提出了SHADOWMASK,一种通过将标准全掩码终端分布替换为触发-掩码混合先验来修改MDLM前向破坏过程的后门攻击。这创建了一条从触发破坏状态到攻击者指定目标的专用去噪路径,同时保持清洁去噪行为。我们进一步通过定义后门前向过程、推导反向时间后验以及获得连续时间训练目标,提供了原理性的数学表述。在基于DiT的MDLM和LLaDA-8B-Instruct上,针对WikiText-103、OpenWebText和Alpaca的评估表明,SHADOWMASK实现了接近100%的攻击成功率,显著优于标准数据投毒,在很大程度上保持了清洁效用,在全模型和参数高效微调下仍然有效,并且对代表性防御具有鲁棒性。

英文摘要

Masked diffusion language models (MDLMs) are emerging as a compelling new paradigm for text generation, but their training-time security remains largely unexplored. Existing backdoor attacks on Gaussian diffusion models or autoregressive language models do not directly apply to MDLMs because MDLMs rely on discrete state corruption and iterative denoising rather than continuous noising or left-to-right prediction. In this work, we present the first systematic study of training-time backdoor attacks on MDLMs. We propose SHADOWMASK, a backdoor attack that modifies the MDLM forward corruption process by replacing the standard all-mask terminal distribution with a trigger-mask mixture prior. This creates a dedicated denoising pathway from trigger-corrupted states to attacker-specified targets while preserving clean denoising behavior. We further provide a principled mathematical formulation by defining the backdoored forward process, deriving the reverse-time posterior, and obtaining the continuous-time training objective. Evaluations on DiT-based MDLM and LLaDA-8B-Instruct across WikiText-103, OpenWebText, and Alpaca show that SHADOWMASK achieves near-100% attack success, substantially outperforms standard data poisoning, largely preserves clean utility, remains effective under full-model and parameter-efficient fine-tuning, and is robust against representative defenses.

2605.18629 2026-06-03 cs.LG

Aligned Training: A Parameter-Free Method to Improve Feature Quality and Stability of Sparse Autoencoders (SAE)

对齐训练:一种无参数方法提升稀疏自编码器(SAE)的特征质量与稳定性

Michał Brzozowski, Neo Christopher Chung

发表机构 * Samsung AI Center(三星人工智能中心) University of Warsaw(华沙大学)

AI总结 提出无参数的对齐训练方法,通过强制编码器与解码器方向内积为1的几何约束,同时提升稀疏自编码器的重建质量、消除死特征并增强训练稳定性。

详情
AI中文摘要

稀疏自编码器(SAE)是解释深度神经网络(DNN)内部工作机制的主要方法之一,将激活分解为高维特征。然而,它们存在关键缺陷:大量特征从未被激活且不稳定。尽管SAE的变体试图缓解这些问题,但它们需要额外的数据、重采样或训练。我们提出了 extbf{对齐训练},一种无参数的SAE重参数化方法,同时提升重建质量、消除死特征,并显著增强跨训练种子的稳定性。我们的方法源于一个被忽视的观察:SAE特征质量(通过编码器和解码器方向之间的内积衡量,我们称之为 extbf{对齐分数})在所有现代架构中呈现双峰分布。所提出的对齐训练在编码器和解码器之间施加几何约束,使得每个特征的内积等于1,从而在不增加任何超参数的情况下消除了SAE训练中的一个退化来源。在多个模型、字典大小和稀疏度水平上,对齐训练在SAEBench基准测试中显示出帕累托改进。除了改善死特征、稳定性和重建外,我们的方法可以轻松集成到机械可解释性技术中,例如Top/BatchTop-K架构和p退火。总体而言,对齐训练在不增加计算复杂度或成本的情况下显著提升了SAE的特征质量和稳定性。

英文摘要

Sparse autoencoders (SAEs) are one of the main methods to interpret the inner workings of deep neural networks (DNNs), decomposing activations into higher-dimensional features. However, they exhibit critical shortcomings where a large fraction of features are never activated and are unstable. Despite variants of SAEs that attempt to mitigate these issues, they require additional data, resampling, or training. We propose the \textbf{aligned training}, a parameter-free reparameterization of SAEs that simultaneously improves reconstruction quality, eliminates dead features, and significantly enhances stability across training seeds. Our approach is motivated by an overlooked observation that SAE feature quality, measured by the inner product between encoder and decoder directions (which we call the \textbf{alignment score}), follows a bimodal distribution across all modern architectures. The proposed aligned training enforces a geometric constraint between the encoder and decoder such that their inner product equals one for every feature, which removes a source of degeneracy in the SAE training without adding any hyperparameters. Across multiple models, dictionary sizes, and sparsity levels, the aligned training shows Pareto improvements on the SAEBench benchmarks. Beyond improving dead features, stability and reconstruction, our method readily integrates with techniques in mechanical interpretability such as Top/BatchTop-K architectures and p-Annealing. Overall, the aligned training substantially improves feature quality and stability of SAE without computational complexity or cost.

2605.18160 2026-06-03 cs.CV cs.AI

Vision Inference Former: Sustaining Visual Consistency in Multimodal Large Language Models

Vision Inference Former:在多模态大语言模型中维持视觉一致性

Xinpeng Dong, Min Zhang, Kairong Han, Xu Tan, Fei Wu, Kun Kuang

发表机构 * Zhejiang University(浙江大学) East China Normal University(华东师范大学) Zhejiang University of Science and Technology(浙江理工大学)

AI总结 针对多模态大语言模型中视觉信息被弱化的问题,提出Vision Inference Former(VIF)轻量模块,在推理解码阶段持续注入视觉语义,提升生成内容与视觉的一致性。

详情
AI中文摘要

近年来,多模态大语言模型(MLLMs)取得了显著进展,主要归功于整合视觉和文本信息的有效范式。主流的基于连接器的范式将视觉特征投影到文本序列中,从而在生成式架构内实现统一的多模态对齐和推理。然而,我们的实验揭示了两个关键限制:(1)尽管视觉信息是MLLMs中的核心证据模态,但它被与文本标记同等对待,削弱了视觉模态的独特贡献;(2)随着生成长度的增加,特别是在有限的上下文窗口内,模型对视觉信息的依赖逐渐减弱,导致视觉-语言对齐恶化,生成内容与视觉语义之间的一致性降低。为了解决这些挑战,我们提出了Vision Inference Former(VIF),一种轻量级架构模块,它在纯视觉表示和模型输出空间之间建立直接桥梁。具体而言,VIF在推理过程的解码阶段持续注入视觉语义,确保模型在生成过程中牢固地基于视觉内容。我们在涵盖通用推理、OCR、表格理解、以视觉为中心的评估和幻觉的14个基准任务上进行了实验。实验结果表明,VIF在不同架构上持续提升模型性能,同时引入最小的额外开销。本工作的代码可在https://github.com/Dong-Xinpeng/VIF获取。

英文摘要

In recent years, multimodal large language models (MLLMs) have achieved remarkable progress, primarily attributed to effective paradigms for integrating visual and textual information. The dominant connector-based paradigm projects visual features into textual sequence, enabling unified multimodal alignment and reasoning within a generative architecture. However, our experiments reveal two key limitations: (1) Although visual information serves as the core evidential modality in MLLMs, it is treated on par with textual tokens, diminishing the unique contribution of the visual modality; (2) As generation length increases, particularly within a limited context window, the model's dependence on visual information progressively weakens, resulting in deteriorated vision-language alignment and reduced consistency between generated content and visual semantics. To address these challenges, we propose the Vision Inference Former (VIF), a lightweight architectural module that establishes a direct bridge between pure visual representations and the model's output space. Specifically, VIF continuously injects visual semantics throughout the decoding phase of the inference process, ensuring that the model remains firmly grounded in visual content during generation. We conduct experiments on 14 benchmark tasks covering general reasoning, OCR, table understanding, vision-centric evaluation, and hallucination. Experimental results show that VIF consistently improves model performance across diverse architectures while introducing minimal additional overhead. The code for this work is available at https://github.com/Dong-Xinpeng/VIF.

2605.17866 2026-06-03 cs.LG

DAD4TS: Data-Augmentation-Oriented Diffusion Model for Time-Series Forecasting with Small-Scale Data

DAD4TS:面向小规模数据的时间序列预测的数据增强扩散模型

Masahiro Suzuki, Bohui Xia, Hiroto Yamamoto, Masanori Miyahara

发表机构 * Sony Group Corporation(索尼集团公司)

AI总结 针对小规模时间序列数据预测中数据增强生成有意义数据困难的问题,提出基于扩散模型和强化学习的DAD4TS方法,通过几何空间投影训练扩散模型,在多个数据集和模型上验证了有效性。

详情
AI中文摘要

小规模数据是时间序列预测任务中的一个关键问题。数据增强是解决该问题的有效策略,但在生成有意义的数据方面存在局限性。为了解决这一局限性,我们提出了DAD4TS,一种基于扩散模型并结合强化学习的数据增强方法,专为小规模数据的时间序列预测设计。在DAD4TS中,数据生成器与时间序列模型同时训练,并由强化学习模型控制,以高效生成能够提高时间序列模型预测准确性的样本。为了支持小规模数据,我们使用数学方法代替传统的VAE方法,通过将时间序列数据投影到几何空间来训练扩散模型。我们通过定性和定量实验,在六个真实世界数据集和八个时间序列模型上,使用七种对比方法验证了DAD4TS的有效性。结果表明,DAD4TS在五个数据集上得到了验证。

英文摘要

Small-scale data is a critical problem in time-series forecasting tasks. Data augmentation is an effective strategy for this task, but it has a limitation in generating meaningful data. To address this limitation, we propose DAD4TS, a diffusion-model-based data augmentation method with reinforcement learning, designed for time-series forecasting with small-scale data. In DAD4TS, a data generator is simultaneously trained with a time-series model and controlled by a reinforcement learning model to efficiently generate samples that improve the forecast accuracy of the time-series model. To support small-scale data, we use mathematical methods instead of conventional VAE methods to train the diffusion model by projecting the time-series data into the geometric space. We validated the effectiveness of DAD4TS with seven comparative methods through qualitative and quantitative experiments on six real-world datasets and eight time-series models. As a result, DAD4TS was validated on five datasets.

2605.16816 2026-06-03 cs.RO

"I'm Not Mad, Just Focused'': Understanding Human Emotions in Human-Robot Collaboration

“我没生气,只是专注”:理解人机协作中的人类情绪

Seung Chan Hong, Dana Kulić, Leimin Tian

发表机构 * Faculty of Engineering, Monash University(莫纳什大学工程学院) CSIRO Robotics(CSIRO机器人实验室)

AI总结 提出基于视觉语言模型(VLM)的情绪识别系统,利用上下文理解改善人机协作中的情绪解读,实验表明其语义相似性和情感对齐优于基线CNN系统,且用户偏好情绪自适应机器人行为。

Journal ref IEEE Robotics and Automation Letters, vol. 11, no. 7, pp. 8260-8267, July 2026

详情
AI中文摘要

人机协作(HRC)可以从机器人解读人类情绪状态的能力中受益。然而,当前HRC中的情绪识别(ER)模型往往表现不足,特别是因为它们依赖于表演数据集和单一模态输入(如面部表情)。我们提出了一种新颖的基于视觉语言模型(VLM)的ER系统,利用上下文理解来改善HRC中的情绪解读。我们首先通过评估VLM-ER系统与现有HRC数据集上人工标注的语义和情感相似性来对其进行评估。然后,在协作配送任务的用户研究中,我们评估了基于VLM-ER系统推断的用户情绪状态来调节机器人行为的效果。结果表明,与基线卷积神经网络系统相比,所提出的VLM-ER系统实现了更高的人工标注语义相似性和正向情感对齐。此外,用户研究中的参与者更喜欢由VLM-ER系统促进的情绪自适应机器人行为。

英文摘要

Human-robot collaboration (HRC) can benefit from robots' abilities to interpret human emotional states. However, current emotion recognition (ER) models in HRC often fall short, particularly due to their reliance on acted datasets and single-modality inputs like facial expressions. We propose a novel vision language model (VLM)-based ER system that leverages contextual understanding to improve emotion interpretation in HRC. We first evaluate the VLM-ER system by assessing its semantic and sentiment similarity with human annotations on an existing HRC dataset. Then, in a user study with a service robot in a collaborative delivery task, we evaluate the effects of modulating the robot's behaviour based on the user's emotional state inferred by the VLM-ER system. The results show that the proposed VLM-ER system achieves higher semantic similarity and positive sentiment alignment with human annotations compared to a baseline convolutional neural network-based system. Further, participants in the user study preferred emotion-adaptive robot behaviour facilitated by the VLM-ER system.

2603.00029 2026-06-03 cs.CL

Embracing Anisotropy: Turning Massive Activations into Interpretable Control Knobs for Large Language Models

拥抱各向异性:将大规模激活转化为大型语言模型的可解释控制旋钮

Youngji Roh, Hyunjin Cho, Jaehyung Kim

发表机构 * Yonsei University(延世大学)

AI总结 本文提出一种基于幅度的无训练方法识别领域关键维度,将其作为可解释的语义检测器,并通过仅对这些维度进行激活引导,在领域适应和越狱场景中优于传统全维度引导方法。

Comments ACL 2026 Main Conference

详情
AI中文摘要

大型语言模型(LLMs)表现出高度各向异性的内部表示,通常以大规模激活为特征,即一小部分特征维度具有显著大于其余维度的量级。虽然先前的工作主要将这些极端维度视为需要处理的伪影,但我们提出了一个不同的视角:这些维度作为由领域专业化产生的内在可解释功能单元。具体来说,我们提出了一种简单的基于幅度的标准,以无训练方式识别领域关键维度。我们的分析表明,这些维度表现为符号/定量模式或领域特定术语的可解释语义检测器。此外,我们引入了关键维度引导,仅对识别出的维度应用激活引导。实证结果表明,在领域适应和越狱场景中,该方法优于传统的全维度引导。

英文摘要

Large Language Models (LLMs) exhibit highly anisotropic internal representations, often characterized by massive activations, a phenomenon where a small subset of feature dimensions possesses magnitudes significantly larger than the rest. While prior works view these extreme dimensions primarily as artifacts to be managed, we propose a distinct perspective: these dimensions serve as intrinsic interpretable functional units arising from domain specialization. Specifically, we propose a simple magnitude-based criterion to identify Domain-Critical Dimensions in a training-free manner. Our analyses reveal that such dimensions behave as interpretable semantic detectors for symbolic/quantitative patterns or domain-specific terms. In addition, we introduce Critical Dimension Steering, which applies activation steering exclusively to the identified dimensions. Empirical results show that this approach outperforms conventional whole-dimension steering in domain adaptation and jailbreaking scenarios.

2605.15572 2026-06-03 cs.CL

Measuring Maximum Activations in Open Large Language Models

测量开放大型语言模型中的最大激活值

Luxuan Chen, Han Tian, Xinran Chen, Rui Kong, Fang Wang, Jiamin Chen, Yuchen Li, Jiashu Zhao, Shuaiqiang Wang, Haoyi Xiong, Linghe Kong, Dawei Yin

发表机构 * Shanghai Jiao Tong University(上海交通大学) Baidu Inc.(百度公司) Nankai University(南开大学)

AI总结 本研究通过统一流程测量8个开放LLM家族的27个检查点中的全局和逐层最大激活值,发现其跨家族、架构和训练阶段差异显著,且与低比特量化误差相关,建议在发布开放权重时报告该指标。

详情
AI中文摘要

激活值的动态范围是低比特量化、激活缩放和稳定LLM推理的一阶约束。先前的工作表征了2024年前LLaMA风格模型上的异常特征和巨大激活值,而下游的激活量化堆栈继承了这一图景,未针对后LLaMA开放模型繁荣进行重新审视。我们提出面向部署的问题:现代开放LLM中激活值能有多大,这种幅度如何随家族、代际和训练阶段变化?在统一流程下(5000样本多领域语料库、家族特定分词、嵌入层、隐藏状态、注意力、MLP/MoE、SwiGLU门和最终归一化层的相同钩子),我们测量了来自8个开放家族的27个检查点(涵盖密集、MoE、视觉语言、中间训练和指令微调变体)的全局和逐层最大值。我们发现:(i) 在可比较的参数数量下,全局最大值跨越近四个数量级,Qwen3.5和MoE检查点在10^2到10^3范围,而Gemma3-27B-it达到约7×10^5;(ii) 跨家族和跨代际比较打破了简单的单调缩放;(iii) MoE检查点的峰值比同等规模的密集对应物低14.0-23.4倍,而残差流在22/24个检查点中承载全局最大值。一个轻量级INT-8完整性检查表明,测量的最大值通过激活尺度选择与低比特重建误差共变。我们得出结论,最大激活幅度是一个与家族、架构和训练阶段相关的模型属性——而不是规模的简单副产品——并且应在低比特部署前与任何开放权重发布一起测量和报告。代码公开于https://github.com/clx1415926/Max_act_llm。

英文摘要

The dynamic range of activations is a first-order constraint for low-bit quantization, activation scaling, and stable LLM inference. Prior work characterized outlier features and massive activations on pre-2024 LLaMA-style models, and the downstream activation-quantization stack inherits that picture without revisiting it for the post-LLaMA open-model boom. We ask the deployment-oriented question: how large can activations get in modern open LLMs, and how does this magnitude vary across families, generations, and training stages? Under a unified pipeline (5,000-sample multi-domain corpus, family-specific tokenization, identical hooks across embeddings, hidden states, attention, MLP/MoE, SwiGLU gates, and final norm), we measure global and layerwise maxima on 27 checkpoints from 8 open families spanning dense, MoE, vision-language, intermediate-training, and instruction-tuned variants. We find that (i) global maxima span over nearly four orders of magnitude at comparable parameter counts, with Qwen3.5 and MoE checkpoints in the 10^2 to 10^3 range and Gemma3-27B-it reaching ~7 x 10^5; (ii) cross-family and cross-generation comparisons break simple monotonic scaling; and (iii) MoE checkpoints exhibit 14.0-23.4x lower peaks than matched-scale dense counterparts, while the residual stream carries the global maximum in 22/24 checkpoints. A lightweight INT-8 sanity check shows that measured maxima co-vary with low-bit reconstruction error via activation-scale selection. We conclude that maximum activation magnitude is a model property tied to family, architecture, and training stage - not a simple byproduct of size - and should be measured and reported alongside any open-weight release before low-bit deployment. The code is publicly available at https://github.com/clx1415926/Max_act_llm.