arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1727
2606.07299 2026-06-08 cs.AI 新提交

DuMate-DeepResearch: An Auditable Multi-Agent System with Recursive Search and Rubric-Grounded Reasoning

DuMate-DeepResearch:一种具有递归搜索和基于评分准则推理的可审计多智能体系统

Lingyong Yan, Can Xu, Yukun Zhao, Wenxuan Li, Qingyang Chen, Jiulong Wu, Wenli Song, Xiangnan Li, Weixian Shi, Yiqun Chen, Xuchen Ma, Yuchen Li, Jiashu Zhao, Shuaiqiang Wang, Jianmin Wu, Dawei Yin

发表机构 * Baidu AI Cloud(百度AI云)

AI总结 提出DuMate-DeepResearch多智能体框架,通过解耦智能体核心与工具生态、引入图动态规划、递归双层执行和基于评分准则的测试时优化,在深度研究基准上取得最优结果。

Comments Technical report by the DuMate Team. 26 pages, 6 figures, 4 tables

详情
AI中文摘要

深度研究(DR)已成为一种新的智能体范式,用于处理复杂、开放的研究任务,要求系统能够迭代地定义问题、获取证据、验证来源并综合生成长篇报告。然而,在实践中,当前的DR系统受到四个相互关联的限制:在未明确范围上的长时规划、单智能体内分解和调度此类任务的瓶颈、长文本综合中的幻觉风险以及有限的过程可审计性。本技术报告介绍了DuMate-DeepResearch,一个基于千帆智能体构建平台构建的多智能体DR框架。该框架将负责任务理解、规划和调度的智能体核心与用于检索、证据获取和报告渲染的可扩展工具生态系统解耦,使每个中间决策和工具调用都明确可追溯。在此基础设施之上,DuMate-DeepResearch进一步引入了三种机制:(i)基于图的动态规划策略,从粗到细扩展研究路线图,并通过反思、重新规划、回溯和并行分支不断修正;(ii)递归双层执行设计,将每个复杂的搜索子任务委托给一个内部搜索智能体,该智能体运行自己的规划循环,隔离噪声检索并稳定长时执行;(iii)基于评分准则的测试时优化机制,动态生成任务特定的质量标准,并将其用作实时推理支架,用于基于证据的综合和自适应停止。在两个深度研究基准上,DuMate-DeepResearch取得了新的最先进结果:在DeepResearch Bench上获得最佳总分(58.03%),在DeepResearch Bench II上获得最佳总分(61.95%),同时在信息召回和分析方面排名第一。

英文摘要

Deep Research (DR) has emerged as a new agentic paradigm to tackle complex, open-ended research tasks, demanding systems that can iteratively frame problems, acquire evidence, verify sources, and synthesize long-form reports. In practice, however, current DR systems are constrained by four interrelated limitations: long-horizon planning over an underspecified scope, the bottleneck of decomposing and scheduling such tasks within a single agent, hallucination risk in long-form synthesis, and limited process auditability. This technical report presents DuMate-DeepResearch, a multi-agent DR framework built on the Qianfan Agent Foundry. The framework decouples the Agent Core, which handles task understanding, planning, and scheduling, from an extensible Tool Ecosystem for retrieval, evidence acquisition, and report rendering, making every intermediate decision and tool invocation explicitly traceable. Building on this infrastructure, DuMate-DeepResearch further introduces three mechanisms: (i) a graph-based dynamic planning strategy expands the research roadmap coarse-to-fine and continuously revises it through reflection, re-planning, backtracking, and parallel branching; (ii) a recursive two-level execution design delegates each complex search sub-task to an inner Search Agent that runs its own planning loop, isolating noisy retrieval and stabilizing long-horizon execution; (iii) a rubric-based test-time optimization mechanism dynamically generates task-specific quality criteria and uses them as live reasoning scaffolds for evidence-grounded synthesis and adaptive stopping. Across two deep research benchmarks, DuMate-DeepResearch establishes new state-of-the-art results: the best overall score (58.03%) on DeepResearch Bench, and the best overall score (61.95%) on DeepResearch Bench II while ranking first in information recall and analysis.

2606.07293 2026-06-08 cs.SD cs.LG 新提交

TargetSEC: Plug-and-Play In-the-Wild Speech Emotion Conversion via Arousal-Conditioned Latent Style Diffusion

TargetSEC: 基于唤醒度条件潜在风格扩散的即插即用野外语音情感转换

Constantin Alexander Auga

发表机构 * Hasso Plattner Institute / University of Potsdam(霍普特尔研究所 / 波茨坦大学)

AI总结 提出TargetSEC,一种基于嵌入驱动的潜在扩散框架,通过连续情感条件生成情感风格嵌入,在紧凑潜在空间操作,实现高转换精度和语音质量。

Comments 5 pages, 2 figures, 2 tables, preprint

详情
AI中文摘要

语音情感转换旨在将源话语的情感转换为目标情感,同时保留内容和说话人身份。由于训练数据的非平行性和复杂真实世界声学,野外数据的SEC具有挑战性。现有的固定时长方法要么难以有效转移情感(高质量、低转换),要么降低语音自然度(低质量、高转换)。我们提出TargetSEC,一种嵌入驱动的潜在扩散框架,根据说话人身份和连续情感生成以情感为中心的风格嵌入。与在频谱图上扩散的方法不同,TargetSEC在紧凑潜在空间中操作。在MSP-Podcast数据集上的实验表明,TargetSEC在转换准确性上优于当前非时长基线,同时保持高语音质量,并且在没有显式时间建模的情况下实现了与时长预测系统相当的性能。

英文摘要

Speech Emotion Conversion (SEC) aims to transform the emotion of a source utterance into a target emotion while preserving content and speaker identity. SEC on in-the-wild data is challenging due to the non-parallel nature of training data and complex real-world acoustics. Existing fixed-duration approaches either struggle to shift the emotion effectively (high quality, low conversion) or degrade speech naturalness (low quality, high conversion). We propose TargetSEC, an embedding-driven latent diffusion framework that generates emotion-focused style embeddings conditioned on speaker identity and continuous emotion. Unlike methods that diffuse over spectrograms, TargetSEC operates in a compact latent space. Experiments on the MSP-Podcast dataset show that TargetSEC outperforms current non-duration baselines in conversion accuracy while maintaining high speech quality, and achieves performance comparable to duration-prediction systems without explicit temporal modeling.

2606.07291 2026-06-08 cs.LG 新提交

Trio: Learning Time-Series Forecasting with Temporal-Spatial-Sample Attention and Structural Causal Priors

Trio: 基于时间-空间-样本注意力与结构因果先验的时间序列预测学习

Tao Chen, Yexu Zhou, Zhi Gong, Hengwei He, Hongda Li, Zhewei Chen, Dongjing Wang, Xin Zhang, Decheng Liu, Chunlei Peng, Zheng Chen, Wenyue Ding

发表机构 * Nanyang Technological University(南洋理工大学)

AI总结 提出Trio架构,通过时间、空间和样本三种注意力机制分别捕获时序动态、变量间依赖和历史样本对应,并引入时间序列结构因果模型生成合成任务以提供结构先验,提升多变量时间序列预测性能。

详情
AI中文摘要

多变量时间序列预测要求模型对时间动态、跨变量依赖以及历史输入-输出对应关系进行推理。最近的先验数据拟合网络(PFNs)表明,合成任务可用于学习可迁移的推理行为。然而,直接将这一范式迁移到时间序列预测仍然困难,因为时间顺序、动态滞后和重复的历史模式无法被普通的表格先验自然捕获。受此观察启发,我们提出了Trio,一种基于时间-空间-样本注意力的样本感知时间序列预测架构。时间注意力捕获窗口内动态,空间注意力建模变量间依赖,样本注意力检索相关的历史回溯-未来对以指导当前预测。我们的目标并非声称一个完全通用的PFN风格预测器,而是研究如何在预测模型中显式组织和重用历史输入-输出示例。我们进一步引入了一个时间序列结构因果模型(TS-SCM)生成器,以创建具有动态滞后、跨变量交互、噪声、反馈和分布漂移的结构化合成预测任务。在合成、工业和公共基准上的实验表明,所提出的架构提高了预测性能。探索性的零样本实验进一步表明,TS-SCM生成的任务可能提供有用的结构先验,而完全通用的PFN风格时间序列预测仍是一个开放问题。

英文摘要

Multivariate time-series forecasting requires models to reason over temporal dynamics, cross-variable dependencies, and historical input-output correspondences. Recent Prior-Data Fitted Networks (PFNs) suggest that synthetic tasks can be useful for learning transferable inference behavior. However, directly transferring this paradigm to time-series forecasting remains difficult, since temporal order, dynamic lags, and recurring historical patterns are not naturally captured by ordinary tabular priors. Motivated by this observation, we propose Trio, a sample-aware time-series forecasting architecture based on Temporal-Spatial-Sample attention. Temporal attention captures within-window dynamics, spatial attention models inter-variable dependencies, and sample attention retrieves relevant historical lookback-future pairs to guide the current prediction. Rather than claiming a fully general PFN-style forecaster, our goal is to study how historical input-output examples can be explicitly organized and reused within a forecasting model. We further introduce a Time-Series Structural Causal Model (TS-SCM) generator to create structured synthetic forecasting tasks with dynamic lags, cross-variable interactions, noise, feedback, and distributional drift. Experiments on synthetic, industrial, and public benchmarks show that the proposed architecture improves forecasting performance. Exploratory zero-shot experiments further suggest that TS-SCM-generated tasks may provide useful structural priors, while fully general PFN-style time-series forecasting remains an open problem.

2606.07289 2026-06-08 cs.LG cs.CV 新提交

Closed-Form Spectral Regularization for Multi-Task Model Merging

多任务模型融合的闭式谱正则化

Yongxian Wei, Runxi Cheng, Xingxuan Zhang, Li Shen, Chun Yuan, Peng Cui, Dacheng Tao

发表机构 * Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院) Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系) Sun Yat-sen University(中山大学) Nanyang Technological University(南洋理工大学)

AI总结 针对多任务模型融合中的干扰最小化问题,发现迭代求解器实际充当隐式谱正则化器,据此提出基于谱滤波的闭式方法SWUDI及其自适应变体SWUDI-A,显著提升效率并匹配或超越现有方法。

详情
AI中文摘要

模型融合将多个独立微调专家合并为单个多任务模型,无需任何训练数据,降低了大型基础模型的存储、服务和去中心化开发成本。最先进的融合方法将融合表述为逐层二次干扰最小化问题。尽管该问题存在精确的闭式伪逆解,但该解在实践中性能不如数百次梯度下降迭代。迭代循环主导了流程的成本,但其有效性尚未得到解释。我们重新审视这一机制,并表明迭代求解器主要并非作为优化器;相反,它充当了病态正规方程的隐式谱正则化器,其中每层干扰算子的小特征值方向放大了代理噪声。基于这一发现,我们将多任务模型融合形式化为一个带噪线性逆问题,并提出一种由逐方向滤波器参数化的谱滤波估计器。我们通过SWUDI实例化该估计器,这是一种闭式方法,结合了软指数滤波器(匹配迭代下降的梯度流轨迹)和硬top-K截断(抑制放大噪声的小特征值方向)。此外,我们提出了SWUDI-A,一种自适应变体,用逐层秩规则替换全局秩超参数,进一步提高了跨架构的鲁棒性。两种变体共享每个线性层的单个对称特征分解,且不需要训练数据或优化器状态。在四个通用基准和一个涵盖VQA、几何、图表、OCR、定位和模态融合的多模态融合基准上,我们提出的谱求解器匹配或超越了最先进的融合方法。关键的是,它们将挂钟时间减少了28-72倍,峰值GPU内存减少了高达50%。

英文摘要

Model merging combines several independently fine-tuned experts into a single multi-task model without any training data, reducing the storage, serving, and decentralized-development costs of large foundation models. State-of-the-art merging methods formulate merging as a layer-wise quadratic interference minimization problem. Although this problem admits an exact closed-form pseudoinverse solution, that solution underperforms hundreds of iterations of gradient descent in practice. The iterative loop dominates the cost of the pipeline, yet its effectiveness has remained unexplained. We revisit this regime and show that the iterative solver does not primarily act as an optimizer; rather, it serves as an implicit spectral regularizer for an ill-posed normal equation, where small-eigenvalue directions of the per-layer interference operator amplify proxy noise. Building on this finding, we formalize multi-task model merging as a noisy linear inverse problem and propose a spectral filtering estimator parameterized by a per-direction filter. We instantiate this estimator with SWUDI, a closed-form method that combines a soft exponential filter, which matches the gradient-flow trajectory of iterative descent, with a hard top-K truncation that suppresses noise-amplifying small-eigenvalue directions. Furthermore, we propose SWUDI-A, an adaptive variant that replaces the global rank hyperparameter with per-layer rank rules, further improving robustness across architectures. Both variants share a single symmetric eigendecomposition per linear layer and require no training data or optimizer state. Across four general benchmarks and a multimodal merging benchmark spanning VQA, Geometry, Chart, OCR, Grounding, and modality merging, our proposed spectral solvers match or outperform state-of-the-art merging methods. Crucially, they reduce wall-clock time by 28-72x and peak GPU memory by up to 50%.

2606.07288 2026-06-08 cs.CV cs.GR 新提交

ExMesh: EXplicit Mesh Reconstruction with Topology Adaptation

ExMesh: 具有拓扑自适应的显式网格重建

Chuanjin Fan, Lifan Wu, Wenjie Chang, Hanzhi Chang, Wenfei Yang, Tianzhu Zhang

发表机构 * University of Science and Technology of China(中国科学技术大学) National Key Laboratory of Deep Space Exploration, Deep Space Exploration Laboratory(国家空间科学探测重点实验室,深空探测实验室)

AI总结 提出ExMesh框架,通过可微优化与离散拓扑更新直接优化显式网格,引入自适应顶点分裂合并和实时UV维护,实现从粗到细的优化,兼顾精度、效率和网格简洁性。

Comments Accepted at the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2026 (CVPR 2026)

详情
AI中文摘要

从多视图图像重建表面网格近年来一直是核心挑战。大多数现有方法,无论是隐式还是显式,都依赖于中间表示和后处理步骤(如Marching Cubes或TSDF融合),常常导致伪影和碎片化几何。直接优化显式网格是一种有前景的方法,但它面临两个关键挑战:一是如何自适应细化网格拓扑以捕捉细节而不引入退化面;二是在网格结构演变时如何保持一致的UV坐标以实现高保真纹理映射。为克服这些,我们提出ExMesh,一种新颖的框架,通过将可微优化与离散拓扑更新相结合,直接优化显式网格。具体而言,我们引入自适应顶点分裂合并策略以及实时UV维护,实现从粗到细的优化,同时保持几何完整性。据我们所知,ExMesh是第一个将离散拓扑操作无缝集成到连续可微优化流程中的框架。大量实验表明,ExMesh在精度、计算效率和网格简洁性之间取得了平衡。

英文摘要

Reconstructing surface meshes from multi-view images has remained a core challenge in recent years. Most existing methods, whether implicit or explicit, depend on intermediate representations and post-processing steps like Marching Cubes or TSDF fusion, often resulting in artifacts and fragmented geometry. Directly optimizing explicit meshes is a promising approach. However, it presents two critical challenges. The first is how to adaptively refine mesh topology to capture detail without introducing degenerate faces. The second is how to maintain consistent UV coordinates for high-fidelity texturing as the mesh structure evolves. To overcome these, we propose ExMesh, a novel framework that directly optimizes explicit meshes by integrating differentiable optimization with discrete topology updates. Specifically, we introduce an adaptive vertex splitting and merging strategy, along with real-time UV maintenance, to enable coarse-to-fine optimization while preserving geometric integrity. To our knowledge, ExMesh is the first framework to seamlessly integrate discrete topology operations into a continuous differentiable optimization pipeline. Extensive experiments demonstrate that ExMesh achieves a balance among accuracy, computational efficiency, and mesh conciseness.

2606.07280 2026-06-08 cs.CV 新提交

Geometric-Aware Hypergraph Reasoning for Novel Class Discovery in Point Cloud Segmentation

几何感知超图推理用于点云分割中的新类别发现

Zihao Zhang, Aming Wu, Yang Li, Yahong Han, Jialie Shen

发表机构 * School of Artificial Intelligence, College of Intelligence and Computing, Tianjin University(人工智能学院、智能计算学院、天津大学) School of Computer Science and Information Engineering, Hefei University of Technology(计算机科学与信息工程学院、合肥工业大学) Department of Computer Science City St George’s, University of London(伦敦大学城市圣乔治学院计算机科学系)

AI总结 提出超图框架建模高阶关联,结合几何感知原型,实现点云分割中从已知到新类别的协同推理,提升分割精度。

Comments Accepted to the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

详情
AI中文摘要

点云分割中的新类别发现旨在从已知类别转移知识,自动识别和分割点云中未标注的新类别。现有方法主要依赖成对关联进行类别分配和新类别推理,这限制了其捕捉已知和新类别间复杂关系的能力,可能导致语义分割不准确。为解决此问题,我们引入基于超图的框架,建模类别间的高阶关联,并实现从已知类别到新类别的协同推理,超越传统的成对关系。此外,现有方法倾向于关注语义特征提取,而对点云中的几何信息关注不足。为了更好地利用空间结构,我们提出几何感知原型以增强类别级几何线索的表示。通过超边传播几何信息,所提方法改进了对类别间空间分布的理解,从而实现更准确的分割。在SemanticKITTI和SemanticPOSS数据集上的实验证明了我们方法的有效性和优越性。

英文摘要

Novel class discovery in point cloud segmentation aims to transfer knowledge from known classes to automatically identify and segment unlabeled novel classes in point clouds. Existing methods mainly rely on pairwise associations for class assignment and novel class reasoning, which limits their ability to capture complex relationships among known and novel classes and may lead to inaccurate semantic segmentation. To address this issue, we introduce a hypergraph-based framework that models high-order associations among classes and enables collaborative reasoning from known classes to novel classes beyond traditional pairwise relations. Moreover, existing methods tend to focus on semantic feature extraction while paying insufficient attention to geometric information in point clouds. To better exploit spatial structure, we propose Geometric-Aware Prototypes to enhance the representation of class-level geometric cues. By propagating geometric information through hyperedges, the proposed method improves the understanding of spatial distributions across classes and leads to more accurate segmentation. Experiments on the SemanticKITTI and SemanticPOSS datasets demonstrate the effectiveness and superiority of our method.

2606.07271 2026-06-08 cs.LG cs.AI cs.SD 新提交

Where Rectified Flows Leak: Characterising Membership Signals Along the Interpolation Path

整流流泄漏之处:沿插值路径表征成员信号

Thomas Sesmat, Gabriel Meseguer-Brocal, Geoffroy Peeters

发表机构 * University of Amsterdam(阿姆斯特丹大学)

AI总结 本文分析整流流(Rectified Flows)在插值路径上的训练数据成员信号,发现训练与测试数据的重建差异呈钟形曲线,并在高斯假设下推导出峰值位置,验证了该结构的普适性,并利用其进行成员推断攻击。

Comments ICML 2026 article, 9 main pages and 25 with annexes, 11 figures

详情
Journal ref
43rd International Conference on Machine Learning, Seoul, South Korea, 2026
AI中文摘要

理解生成模型从训练数据中保留了什么仍然具有挑战性,这对版权和隐私有影响。除了逐字复制外,模型可以编码训练数据中更微妙的痕迹,这些痕迹从未出现在输出中,但仍可利用。我们针对整流流(Rectified Flows)研究了这一机制,整流流越来越多地用于部署的生成系统。我们分析了定义整流流训练的插值路径 $X_\lambda = (1-\lambda)X_0 + \lambda X_1$。我们展示了训练数据和测试数据的重建之间存在一个差距,该差距在 $\lambda$ 上呈钟形曲线,并在训练过程中累积,而验证指标保持稳定。该信号有一个最大值,我们在高斯假设下推导出其位置的闭式解。我们在音频和图像上验证了这些预测,并表明钟形结构是普遍的,而峰值预测在我们的假设满足时成立。作为概念验证,我们利用这种特定的 $\lambda$ 解析结构进行成员推断攻击,区分训练集的成员和非成员。

英文摘要

Understanding what generative models retain from training data remains challenging, with implications for copyright and privacy. Beyond verbatim reproduction, models can encode subtler traces of their training data that never surface in their outputs yet remain exploitable. We study this regime for Rectified Flows, which are increasingly used in deployed generative systems. We analyse the interpolation path $X_λ= (1-λ)X_0 + λX_1$ that defines the Rectified Flow training. We show that a gap exists between the reconstruction of train and test data that follows a bell-shaped curve over $λ$, wich accumulates during training, while the validation metrics remain stable. The signal has a maximum whose location we derive in closed form under Gaussian assumptions. We validate these predictions on both audio and images and show that the bell-shaped structure is universal, while the peak prediction holds when our assumptions are satisfied. As a proof of concept, we exploit this specific $λ$-resolved structure to perform a Membership Inference Attack, distinguishing members of the training set from non-members.

2606.07254 2026-06-08 cs.LG cs.FL 新提交

A Held-Out Transition-Pair Falsifier for Long-Horizon Non-Abelian State Tracking

长视野非阿贝尔状态跟踪的保留转移对验证器

Jeonghoon Lee

发表机构 * Attractor Dynamics(吸引子动力学)

AI总结 针对序列模型在非交换状态跟踪中的局限,提出保留转移对验证协议,在投影循环状态模型上实现长达百万步的完美预测,揭示显式非交换状态组合作为有效归纳偏置。

Comments Technical preprint, 24 pages. 7 figures

详情
AI中文摘要

状态跟踪揭示了序列模型的一个尖锐限制:相关信号通常不是观测令牌的摘要,而是通过非交换变换演化的有序潜在状态。我们引入了一个用于有限非阿贝尔群跟踪的保留转移对验证器。该协议在训练期间禁止选定的有序生成器对,并在评估期间要求相同的局部模式,从而阻断了一条直接的局部转移记忆路径。在一个受控的 $S_3 \ imes S_3$ 基准测试中,仅在长度为8的序列上训练的投影循环状态模型,在长达1,048,576个令牌的评估视野中,跨五个种子产生了无错误的最终状态预测(每个视野完美250/250)。匹配的原生读出基线,包括bag、GRU和单配置结构化状态空间模型,在相同协议下保持接近基线水平。投影匹配的GRU、结构化SSM和bag基线配备了类似的有限群原型读出,在相同分割下也保持接近随机水平。机制诊断显示,硬投影与低同态误差、低状态一致性漂移和非平凡交换子分离同时出现,而软投影则导致最终状态精度崩溃。干净分割审计验证了训练和评估分区之间零逐字缩减词重叠和零结构模板重叠。该证据限于这个受控的有限群验证器,而非通用架构排名。在该范围内,显式投影的非交换状态组合作为长视野隐藏状态跟踪的有用归纳偏置。

英文摘要

State tracking exposes a sharp limitation of sequence models: the relevant signal is often not a summary of observed tokens, but an ordered latent state that evolves through non-commutative transformations. We introduce a held-out transition-pair falsifier for finite non-Abelian group tracking. The protocol forbids selected ordered generator pairs during training and requires the same local patterns during evaluation, blocking one direct local-transition memorization pathway. In a controlled $S_3 \times S_3$ benchmark, a projected recurrent state model trained only on length-8 sequences produces error-free final-state predictions (perfect 250/250 per horizon) through evaluation horizons up to 1,048,576 tokens across five seeds. Matched native-readout baselines, including bag, GRU, and a single-configuration structured state-space model, remain near floor under the same protocol. Projection-matched GRU, structured SSM, and bag baselines equipped with analogous finite-group prototype readouts also remain near chance under the same split. Mechanism diagnostics show that hard projection coincides with low homomorphism error, low state-consistency drift, and non-trivial commutator separation, while softened projection collapses final-state accuracy. Clean-split audits verify zero verbatim reduced-word overlap and zero structural-template overlap between training and evaluation partitions. The evidence is scoped to this controlled finite-group falsifier rather than to a general architecture ranking. Within that regime, explicit projected non-commutative state composition acts as a useful inductive bias for long-horizon hidden-state tracking.

2606.07253 2026-06-08 cs.AI econ.EM 新提交

TOPSIS-RAD: Ranking According to Desires

TOPSIS-RAD:根据期望排序

Leonardo Fernandes Costa, Helder Gomes Costa, Diogo Lima, Brunno Rodrigues

发表机构 * Universidade Federal Fluminense(联邦弗里蒙特大学) Leonardo Sistemas Consultoria LTDA(莱昂纳多咨询公司)

AI总结 提出TOPSIS-RAD方法,通过引入决策者定义的否决绩效水平和期望绩效水平,解决传统TOPSIS排序与决策者需求不一致、对异常值敏感及排名反转问题。

Comments 21 pages, 15 Tables and 6 figures. The numerical computation of the data that appear in the Toy Examples was Supported by the Visual TOPSIS RAD that is available at https://topsis-ranking.vercel.app/. The data of the Toy examples are also available in this URL and can be loaded in the app as the template "Article"

详情
AI中文摘要

传统TOPSIS从观测到的备选方案集中推导其参考点——正理想解(PIS)和负理想解(NIS),这使得排序容易与决策者(DM)需求不一致,对异常值表现敏感,并导致排名反转。本文提出TOPSIS-RAD,通过引入两组DM定义的参考水平来解决这些问题。否决绩效水平(VPL)在归一化之前排除不可行的备选方案,防止它们扭曲排序边界。期望绩效水平(DPL)在归一化之前将表现上限设定在DM期望的水平,将PIS锚定在明确的期望而非数据集极端值上。三个简单示例展示了每种机制:VPL通过移除不可行备选方案重塑归一化边界;固定的DPL边界通过限制远高于期望水平的表现的影响来稳定排序。该方法保留了TOPSIS熟悉的基于距离的结构,同时将排序建立在稳定的、DM指定的边界上。还讨论了局限性和未来研究方向。

英文摘要

Traditional TOPSIS derives its reference points -- the Positive Ideal Solution ($PIS$) and Negative Ideal Solution ($NIS$) -- from the observed alternative set, making rankings susceptible to misalignment with decision-maker (DM) requirements, sensitivity to outlier performances, and rank reversal. This paper proposes TOPSIS-RAD, which addresses these issues by incorporating two arrays of DM-defined reference levels. Vetoed Performance Levels ($VPL$) exclude non-viable alternatives before normalisation, preventing them from distorting the ranking frontiers. Desired Performance Levels ($DPL$) cap performances at the DM's desired level before normalisation, anchoring the $PIS$ in explicit aspirations rather than dataset extremes. Three toy examples demonstrate each mechanism: $VPL$ reshapes normalisation boundaries by removing a non-viable alternative; fixed $DPL$ frontiers stabilise rankings by limiting the influence of performances well above the desired level. The method preserves the familiar distance-based structure of TOPSIS while grounding the ranking in stable, DM-specified boundaries. Limitations and future research directions are also discussed.

2606.07249 2026-06-08 cs.CV 新提交

Reconstructing Multi-Decadal Forest Disturbances: A Spatio-Temporal Transformer Approach

重建多年代森林干扰:一种时空Transformer方法

Linus Scheibenreif, Anton Raichuk, Maxim Neumann

发表机构 * Google DeepMind(谷歌深Mind)

AI总结 提出时空Transformer框架,同时建模时间轨迹和空间邻域,利用Landsat、Sentinel-1/2数据重建美国1984-2022年森林干扰图,在手动标注验证集上达到高精度并减少空间伪影。

详情
AI中文摘要

准确监测森林干扰对于理解碳动态和土地管理至关重要,但传统方法通常依赖卫星时间序列的逐像素分析,忽略了空间上下文。我们提出了一种深度学习框架,通过同时建模时间轨迹和空间邻域,绘制了美国本土38年(1984-2022)的森林干扰图。通过利用视觉Transformer架构,我们的方法有效过滤了弱监督信号中的噪声,生成了空间连贯的干扰图。我们在多个卫星(Landsat、Sentinel-1、Sentinel-2)和时间窗口(38年及最近6年)上进行了详尽评估,并使用新的人工标注验证数据集(n=300)和独立火周界数据集(n=706)验证了性能。结果凸显了任务的复杂性:我们的时空模型表现出高精度(在MTBS上±1年检测精度高达98.2%,在CONUS验证数据集上高达71.3%,F1分数分别高达75.8%和47.3%),并有效减少了空间伪影,但与逐像素基线相比,在不同干扰类型上存在性能权衡。我们的方法为一致的森林监测提供了有前景的基础。

英文摘要

Accurate monitoring of forest disturbances is essential for understanding carbon dynamics and land management, yet traditional approaches typically rely on pixel-wise analysis of satellite time-series, ignoring spatial context. We present a deep learning framework that maps 38 years (1984-2022) of forest disturbance across the contiguous United States by modeling temporal trajectories and spatial neighborhoods simultaneously. By leveraging a vision transformer architecture, our approach effectively filters noise from weak supervision signals to produce spatially coherent disturbance maps. We perform exhaustive evaluations across multiple satellites (Landsat, Sentinel-1, Sentinel-2) and temporal windows (38 years and the more recent 6 years), validating performance against a novel, manually annotated validation dataset (n=300) and independent fire perimeter dataset (n=706). The results highlight the complexity of the task: while our spatio-temporal model demonstrates high precision (up to 98.2% for +-1 year detection on MTBS and up to 71.3% on the CONUS validation datasets, with F1-scores up to 75.8% and 47.3%, respectively) and effectively reduces spatial artifacts, it exhibits performance trade-offs across different disturbance regimes compared to pixel-wise baselines. Our method offers a promising foundation for consistent forest monitoring.

2606.07244 2026-06-08 cs.RO cs.AI cs.CV 新提交

Beyond Waypoints: A Trajectory-Centric Waypointing Paradigm for Vision-Language Navigation

超越航点:面向视觉语言导航的轨迹中心航点范式

Haoxiang Shi, Xiang Deng, Haoyu Zhang, Qiaohui Chu, Yaowei Wang, Liqiang Nie

发表机构 * Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳)) Pengcheng Laboratory(鹏城实验室)

AI总结 提出轨迹航点范式,通过TSDF引导的扩散策略预测可执行轨迹,解决VLN-CE中航点不可达与规划控制不一致问题,在基准上取得最优性能。

详情
AI中文摘要

连续环境中的视觉语言导航(VLN-CE)要求智能体在类似真实世界的环境中遵循自然语言指令进行导航。大多数VLN-CE方法采用三阶段框架:航点预测器提出可导航航点,导航器选择最佳航点,低层控制器执行移动。然而,这种解耦范式常导致航点不可达或规划与控制不一致。本文提出一种称为轨迹航点的新范式,将每个候选航点锚定到可执行轨迹上。为此,我们设计了TSDF引导的扩散策略作为轨迹航点预测器,引导轨迹生成避开障碍物,从本质上保证预测航点的可达性。进一步提出轨迹增强导航器,将关联轨迹作为额外信息注入规划,实现高层语义决策与低层执行的严格一致性。在VLN-CE基准上的大量实验表明,我们的轨迹航点范式优于基线方法。

英文摘要

Vision-Language Navigation in Continuous Environments (VLN-CE) requires agents to follow natural-language instructions while navigating in real-world-like environments. Most VLN-CE approach\-es adopt a three-stage framework: a waypoint predictor proposes navigable waypoints, and a navigator selects the best waypoint, with a low-level controller executing the movement to it. However, this decoupled paradigm often leads to unreachable waypoints or inconsistencies between planning and control. In this work, instead of predicting isolated waypoints, we introduce a novel paradigm called Trajectory Waypoint, which grounds each candidate waypoint in an executable trajectory. To realize this, we design a Trajectory Waypoint Predictor formulated as a TSDF-guided diffusion policy, which steers trajectory generation away from obstacles, inherently ensuring the reachability of the predicted waypoints. We further propose a trajectory-enhanced navigator that injects the associated trajectory as additional information for planning, enabling strict consistency between high-level semantic decisions and low-level execution. Extensive experiments on the VLN-CE benchmark show that our Trajectory Waypoint paradigm achieves superior performance over the baselines.

2606.07240 2026-06-08 cs.CL cs.SD 新提交

KIT's Submission to Cross-Lingual Voice Cloning in IWSLT 2026

KIT 提交至 IWSLT 2026 跨语言语音克隆任务

Seymanur Akti, Alexander Waibel

发表机构 * Karlsruhe Institute of Technology (KIT)(卡尔斯鲁厄理工学院) Carnegie Mellon University (CMU)(卡内基梅隆大学) KIT Campus Transfer (KCT)(KIT校区转移)

AI总结 针对跨语言语音克隆中的口音变化和领域词汇问题,基于FishAudio-S2-Pro多语言文本转语音模型,引入语言标签提示、强化学习微调和参考条件词汇匹配方法,提升可懂度和自然度。

详情
AI中文摘要

跨语言语音克隆旨在在保留源语言参考说话者身份的同时,生成目标语言的语音。该任务是语音翻译的核心,也是IWSLT 2026跨语言语音克隆轨道的焦点。一个关键挑战是在口音变化和领域特定词汇存在的情况下保持可懂度和自然度。我们基于多语言文本转语音模型FishAudio-S2-Pro,引入语言标签提示以改善语言控制并减少口音泄漏。我们进一步应用强化学习(RL)微调进行任务适应,并观察到可懂度的提升。最后,我们提出了一种参考条件词汇匹配方法,在词汇重叠时改善领域特定术语的发音。结果表明,语言提示带来了最大的增益,而词汇匹配在匹配子集上产生了一致的改进。

英文摘要

Cross-lingual voice cloning aims to generate speech in a target language while preserving speaker identity from a source-language reference. This task is central to speech translation and is the focus of the IWSLT 2026 Cross-Lingual Voice Cloning track. A key challenge is maintaining intelligibility and naturalness in the presence of accent variation and domain-specific vocabulary. We build on a multilingual text-to-speech model, FishAudio-S2-Pro, and introduce language tag prompting to improve language control and reduce accent leakage. We further apply reinforcement learning (RL) fine-tuning for task adaptation and observe improvements in intelligibility. Finally, we propose a reference-conditioned lexical matching method that improves pronunciation of domain-specific terms when lexical overlap is present. Results show that language prompting provides the largest gains, while lexical matching yields consistent improvements on matched subsets.

2606.07239 2026-06-08 cs.LG 新提交

Generative Molecular Morphing for Flexible-Size Design via Unbalanced Optimal Transport

基于非平衡最优传输的柔性尺寸分子生成变形设计

Malte Franke, Stefan P. Schmid, Zarko Ivkovic, Kjell Jorner, Andreas Krause

发表机构 * ETH Zürich(苏黎世联邦理工学院) NCCR Catalysis(催化联合研究所)

AI总结 针对现有扩散和流模型固定原子数限制的问题,提出基于非平衡最优传输的柔性尺寸分子生成模型Morph,实现条件与无条件3D分子设计,在保持性能的同时提供采样灵活性,并支持分布外生成。

详情
AI中文摘要

生成分子设计的成功取决于模型向高奖励样本的可引导性。由于许多分子性质与分子大小内在相关,准确捕捉性质与原子数的联合分布至关重要。然而,当前的扩散和基于流的模型固定了原子数,这最终限制了它们驾驭这种复杂关系的能力。为解决这一问题,我们引入了Morph,一种基于几何图的柔性尺寸生成模型,用于条件和无条件的3D分子设计。通过动态调整尺寸,Morph可以无缝集成现有的结构先验(如骨架),并显著增强性质引导。我们证明Morph在提供无与伦比的采样灵活性的同时,与当前固定尺寸的最先进模型性能相当。我们展示了在先前模型失败的领域中的分布外生成,为分子设计的增强生成建模铺平了道路。

英文摘要

The success of generative molecular design hinges on a model's steerability toward high-reward samples. Because many molecular properties are intrinsically linked to molecular size, accurately capturing the joint distribution of properties and the number of atoms is essential. However, current diffusion and flow-based models fix the number of atoms, which ultimately limits their ability to navigate this complex relationship. To address this, we introduce Morph, a flexible-size generative model for conditional and unconditional 3D molecular design based on geometric graphs. By dynamically adapting size, Morph can seamlessly integrate existing structural priors, like scaffolds, and significantly enhances property steering. We show that Morph matches current fixed-size state-of-the-art models while offering the benefit of unparalleled sampling flexibility. We demonstrate out-of-distribution generation in regimes where previous models fail, paving the way for enhanced generative modeling for molecular design.

2606.07237 2026-06-08 cs.CL cs.AI cs.LG 新提交

When Large Language Models Fail in Healthcare: Evaluating Sensitivity to Prompt Variations

当大型语言模型在医疗保健中失败:评估对提示变化的敏感性

Mahdi Alkaeed

发表机构 * Department of Computer Science and Engineering, Doha, Qatar(计算机科学与工程系,多哈,卡塔尔)

AI总结 本研究系统分析了通用和医学专用LLM对提示扰动的敏感性,发现即使是微小的措辞变化也可能改变临床建议,对抗性提示可能引发有害输出,表明这些模型在临床应用中不可靠。

Comments 12 pages

详情
AI中文摘要

大型语言模型(LLM)越来越多地用于医疗保健任务,如临床问答、诊断支持和报告总结。尽管前景广阔,但这些模型对微小的提示扰动(包括词汇和句法)仍然高度敏感,在安全关键的临床应用中构成严重风险。在本研究中,我们使用MedMCQA基准进行了系统的敏感性分析,以评估通用(例如GPT-3.5、Llama3)和医学专用LLM(例如ClinicalBERT、BioLlama3、BioBERT)的鲁棒性。我们将扰动分为自然和对抗两种类型,并检查它们对临床推理任务中模型一致性、准确性和可靠性的影响。我们的发现表明,医学LLM并非本质安全。即使是措辞的微小变化也可能改变临床建议,而针对性的对抗性提示可能引发有害输出。在医疗保健等高风险环境中,这种不可预测性是不可接受的——模型因重新措辞的输入而改变诊断,或因轻微改写而幻觉药物,临床医生无法可靠地信任它们。虽然模型通常对简单的词汇替换或释义表现出韧性,但在句法重新排序或误导性上下文线索下往往会崩溃。这种脆弱性在通用和领域专用LLM中都很明显。值得注意的是,对抗性操作可能导致临床危险的输出,例如推荐不正确的剂量或遗漏关键发现。

英文摘要

Large Language Models (LLMs) are increasingly used in healthcare for tasks such as clinical question answering, diagnosis support, and report summarization. Despite their promise, these models remain highly sensitive to subtle prompt perturbations, both lexical and syntactic, posing serious risks in safety-critical clinical applications. In this study, we conduct a systematic sensitivity analysis to evaluate the robustness of both general-purpose (e.g., GPT-3.5, Llama3) and medical-specific LLMs (e.g., ClinicalBERT, BioLlama3, BioBERT) using the MedMCQA benchmark. We categorize perturbations into natural and adversarial types and examine their effect on model consistency, accuracy, and reliability in clinical reasoning tasks. Our findings reveal that medical LLMs are not intrinsically safe. Even minor variations in phrasing can alter clinical advice, and targeted adversarial prompts can provoke harmful outputs. In high-stakes settings like healthcare, such unpredictability is unacceptable-models that change diagnoses due to reworded inputs or hallucinate medications when slightly rephrased cannot be reliably trusted by clinicians. While models tend to show resilience to simple lexical substitutions or paraphrasing, they often break down under syntactic reordering or misleading contextual cues. This fragility is evident across both general-purpose and domain-specific LLMs. Notably, adversarial manipulations can lead to clinically dangerous outputs, such as recommending incorrect dosages or omitting critical findings.

2606.07233 2026-06-08 cs.CV cs.LG cs.RO 新提交

Does Appearance Help? A Systematic Study of Image-Based Re-Identification in Online 3D Multi-Pedestrian Tracking

外观有帮助吗?在线3D多行人追踪中基于图像的重识别系统研究

Eduardo Borges, Luís Garrote, Urbano J. Nunes

发表机构 * Institute of Systems and Robotics, Department of Electrical and Computer Engineering, University of Coimbra(系统与机器人研究所,电气与计算机工程系,科英布拉大学)

AI总结 系统研究轻量级投影框架下图像重识别在在线3D多目标追踪中的作用,提出级联匹配策略以在低延迟下恢复遮挡轨迹并防止身份切换。

Comments Accepted for publication at the 35th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN 2026)

详情
AI中文摘要

基于LiDAR的3D多目标追踪通常仅依赖几何信息,这在长时间遮挡或拥挤人群环境中往往不足以区分目标。虽然集成基于RGB的重识别提供了保持身份上下文的理论解决方案,但现有方法通常依赖计算昂贵的并行检测器,阻碍了机器人的实时响应。本文通过利用轻量级投影框架解耦移动机器人的几何和外观建模,对在线3D多目标追踪中的基于图像的重识别进行了系统研究。对特征提取架构进行了全面分析,采用轻量级CNN和视觉Transformer,并评估了多种多模态数据关联策略以平衡计算延迟和鲁棒追踪。在KITTI数据集的行人类别上的实验表明,外观和运动成本的朴素线性融合由于视觉噪声而降低了性能。相反,级联匹配策略成功恢复了被遮挡的轨迹而不损害整体精度,有效防止了身份切换以维持人机交互的连续性。我们表明,轻量级架构可以在安全导航所需的低延迟和社交意识所需的判别能力之间提供最优权衡。

英文摘要

LiDAR-based 3D Multi-Object Tracking (MOT) typically relies solely on geometric information, which is often insufficient to distinguish between targets during prolonged occlusions or in crowded human-populated environments. While integrating RGB-based Re-Identification (ReID) offers a theoretical solution for preserving identity context, existing approaches often rely on computationally expensive parallel detectors that hinder real-time robot responsiveness. This work presents a systematic study of image-based ReID in online 3D MOT, utilizing a lightweight projection-based framework to decouple geometric and appearance modeling for mobile robots. A comprehensive analysis of feature extraction architectures is conducted, employing lightweight CNNs and Vision Transformers, and evaluating various multi-modal data association strategies to balance computational latency with robust tracking. Experiments on the Pedestrian class of the KITTI dataset reveal that naive linear fusion, of appearance and motion costs, degrades performance due to visual noise. Conversely, a cascaded matching strategy successfully recovers occluded tracks without compromising overall precision, effectively preventing identity switches to maintain human-robot interaction continuity. We show that lightweight architectures can offer an optimal trade-off between the low latency required for safe navigation and the discriminative power needed for social awareness.

2606.07229 2026-06-08 cs.SD cs.CL cs.MM 新提交

MMAE: A Massive Multitask Audio Editing Benchmark

MMAE:大规模多任务音频编辑基准

Ziyang Ma, Ruiqi Yan, Ruiyang Xu, Jie Fang, Zhikang Niu, Yi-Wen Chao, Wenming Tu, Tianrui Wang, Auden, Qi Chen, Wenxi Chen, Jiaying Chi, Yanru Huo, Zixuan Jiang, Xiquan Li, Yalin Li, Junxi Liu, Minghao Liu, Binghao Qiang, Yijia Shan, Zheshu Song, Tian Tan, Zixiang Wang, Zeyu Xie, Zhifei Xie, Xiaoyu Xing, Qixiang Xu, Chen Yang, Guanrou Yang, Shan Yang, Yifan Yang, Steve Yves, Haotian Zhang, Haina Zhu, Kai Yu, Liefeng Bo, Eng-Siong Chng, Xie Chen

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai Innovation Institute(上海创新研究院) Nanyang Technological University(南洋理工大学) Hunyuan Team, Tencent(腾讯 Hunyuan 团队) Tianjin University(天津大学) Fudan University(复旦大学)

AI总结 提出首个面向通用指令音频编辑的综合评估基准MMAE,涵盖7种音频模态、6级任务复杂度和8种操作类型,通过2000个样本和基于评分标准的评估框架揭示当前模型在精确执行和结构鲁棒性上的严重不足。

Comments Open-Source at https://github.com/ddlBoJack/MMAE

详情
AI中文摘要

我们引入了MMAE,一个大规模多任务音频编辑基准,作为首个专为通用指令式音频编辑设计的综合评估测试平台。受智能创作趋势的推动,交互式编辑已从视觉领域(如图像领域的Nano-banana 2和视频领域的Gemini-Omni)迅速扩展到音频领域。然而,当前的评估基础设施严重滞后,仍然高度碎片化且局限于特定子领域或基本操作。与现有范围有限的基准不同,MMAE扩展到广泛的实际场景,涵盖7种不同的音频模态,包括声音、语音、音乐及其混合。此外,我们建立了一个全面的分类体系,涵盖6级任务复杂度(从基本修改到多跳推理和多轮编辑)、2级粒度以及8种不同的操作类型。通过人机协作精心策划,MMAE包含2000个高保真样本,并配以开创性的基于评分标准的评估框架。通过将自由形式任务分解为17,741个可验证的标准,这种稳健的基于评分标准的范式能够对指令遵循和上下文一致性进行精确的多维评估。我们对领先模型的广泛评估表明,当前系统远未实现可靠的编辑。令人惊讶的是,精确匹配率(EMR)始终低于5%,在复杂的混合模态任务中更是骤降至绝对的0%,暴露了精确执行和结构鲁棒性方面的关键瓶颈。我们希望MMAE能够成为智能创作社区未来进步的催化剂,提供清晰的诊断路线图,并为下一代音频编辑系统建立标准化、持久的评估范式。

英文摘要

We introduce MMAE, a Massive Multitask Audio Editing benchmark, serving as the first comprehensive evaluation testbed designed for general-purpose instruction-based audio editing. Spurred by the shift toward intelligent creation, interactive editing has rapidly expanded from visual domains, pioneered by models like Nano-banana 2 for images and Gemini-Omni for video, into audio. However, the current evaluation infrastructure lags severely, remaining highly fragmented and restricted to specific subdomains or basic operations. Unlike existing benchmarks that are limited in scope, MMAE extends to a broad spectrum of real-world scenarios, encompassing 7 distinct audio modalities, including sound, speech, music, and their mixtures. Furthermore, we establish a comprehensive taxonomy spanning 6 levels of task complexity, from basic modifications to multi-hop reasoning and multi-round editing, 2 levels of granularity, and 8 distinct operation types. Meticulously curated through human-agent collaboration, MMAE comprises 2,000 high-fidelity samples paired with a pioneering rubric-based evaluation framework. By decomposing free-form tasks into 17,741 verifiable criteria, this robust rubric-based paradigm enables a precise, multi-dimensional assessment of both instruction following and context consistency. Our extensive evaluation of leading models reveals that current systems remain far from achieving reliable edits. Strikingly, the Exact Match Rate (EMR) consistently falls below 5% and plummets to an absolute 0% in complex, mixed-modality tasks, exposing critical bottlenecks in precise execution and structural robustness. We hope MMAE will serve as a catalyst for future advances in the intelligent creation community, providing a clear diagnostic roadmap and establishing a standardized, long-lasting evaluation paradigm for next-generation audio editing systems.

2606.07222 2026-06-08 cs.CV cs.AI 新提交

DualGate-Net: A Prior-Gated Dual-Encoder Framework for Histopathology Cell Detection

DualGate-Net: 用于组织病理学细胞检测的先验门控双编码器框架

Bahman Jafari Tabaghsar, Son Tran, K. Devaraja, Atul Sajjanhar

发表机构 * School of Information Technology, Deakin University(德肯大学信息科技学院) Kasturba Medical College, Manipal Academy of Higher Education(曼岛医学院)

AI总结 提出DualGate-Net,通过可学习的先验门控融合机制自适应调节组织先验影响,结合局部和全局编码器及辅助分支,在OCELOT基准上实现稳健的细胞检测。

Comments 15 pages, 4 figures

详情
AI中文摘要

组织病理学图像中的细胞检测强烈依赖于周围组织背景,其中视觉上相似的细胞在不同微环境下可能属于不同类别。最近的感知组织方法结合了上下文先验,但通常依赖于可能传播噪声信息的静态融合策略。在这项工作中,我们提出了DualGate-Net,一种先验感知的双编码器框架,通过可学习的先验门控融合机制结合了基于ConvNeXtV2的局部编码器和基于SegFormer的全局编码器。所提出的模块自适应地调节组织先验在空间位置上的影响,同时一个辅助的前景重建分支在训练过程中保留高频细胞结构。此外,还引入了辅助的细胞性引导线索以进一步提高定位鲁棒性。在OCELOT基准上的实验表明,该方法在验证集上取得了0.7722的宏F1分数,在测试集上取得了0.7345的宏F1分数,突显了自适应先验整合对于稳健的组织病理学细胞检测的有效性。

英文摘要

Cell detection in histopathology images strongly depends on surrounding tissue context, where visually similar cells may belong to different classes under different microenvironments. Recent tissue-aware methods incorporate contextual priors, but often rely on static fusion strategies that may propagate noisy information. In this work, we propose DualGate-Net, a prior-aware dual-encoder framework that combines a ConvNeXtV2-based local encoder and a SegFormer-based global encoder through a learnable prior-gated fusion mechanism. The proposed module adaptively regulates the influence of tissue priors across spatial locations, while an auxiliary foreground reconstruction branch preserves high-frequency cellular structures during training. In addition, auxiliary cellness-guided cues are incorporated to further improve localization robustness. Experiments on the OCELOT benchmark demonstrate consistent improvements, achieving macro F1-scores of 0.7722 on the validation set and 0.7345 on the test set, highlighting the effectiveness of adaptive prior integration for robust histopathology cell detection.

2606.07219 2026-06-08 cs.CL cs.SI 新提交

Adversarial Creation and Detection of AI-Generated Social Bot Content

AI生成的社交机器人内容的对抗性创建与检测

Mykola Trokhymovych, Ricardo Baeza-Yates, Alessandro Flammini, Diego Saez-Trumper, Filippo Menczer

发表机构 * Universitat Pompeu Fabra(庞培法拉大学) Observatory on Social Media, Indiana University(社交媒体观测站,印第安纳大学) KTH Royal Institute of Technology(皇家理工学院)

AI总结 提出对抗性方法模拟恶意用户冒充真人,构建多语言跨平台配对数据集,训练检测模型显著优于现有方法。

详情
AI中文摘要

大型语言模型与社交机器人的结合使得恶意行为者能够通过大规模生成类人内容来操纵信息生态系统。现有的AI生成内容检测模型在真实场景中常常失效,主要原因是缺乏真实标注数据。我们通过一种对抗性方法弥补了这一空白,该方法模拟了恶意行为者对真实社交媒体用户的冒充。利用这种方法,我们整理了一个多语言、跨平台的人类与AI生成消息的配对数据集。在这样的对抗性数据上训练,能够实现对AI生成文本的准确检测。我们的方法在真实世界、分布外数据上显著优于现有的基于内容的机器人检测模型。

英文摘要

The convergence of large language models and social bots allows malicious actors to manipulate the information ecosystem by generating human-like content at scale. Existing models for detecting AI-generated content often fail in the wild, primarily due to the lack of ground-truth data. We address this gap through an adversarial methodology that models the impersonation of real social media users by malicious actors. Using this methodology, we curate a multilingual, cross-platform dataset of paired human and AI-generated messages. Training on such adversarial data yields accurate detection of AI-generated text. Our approach significantly outperforms existing models for content-based bot detection in real-world, out-of-distribution data.

2606.07217 2026-06-08 cs.RO cs.CV cs.LG 新提交

Robotic Policy Adaptation via Weight-Space Meta-Learning

通过权重空间元学习实现机器人策略自适应

Christian Bianchi, Siamak Yousefi, Alessio Sampieri, Andrea Roberti, Luca Rigazio, Fabio Galasso, Luca Franco

发表机构 * ItalAI University of Verona(威尼斯大学) Sapeinza University of Rome(罗马萨佩因扎大学)

AI总结 提出WIZARD框架,通过权重空间元学习从语言指令和演示视频生成任务特定LoRA参数,无需微调即可适应新任务,在LIBERO上性能提升高达14倍。

详情
AI中文摘要

视觉-语言-动作(VLA)模型正成为机器人操作的一种有前景的范式,能够从大规模演示和动作标签语料库中训练通用策略。然而,将这些模型适应新任务通常仍需要任务特定的演示、动作注释和额外的微调,使得部署成本高昂且难以扩展。我们提出WIZARD,一种权重空间元学习框架,通过为冻结的VLA策略生成任务特定的LoRA参数来避免任务特定的微调。仅凭语言指令和简短的演示视频,WIZARD即可在单次前向传播中预测相应的自适应权重,无需目标任务动作标签或测试时优化。在元训练期间,WIZARD学习将任务证据直接映射到专家LoRA更新,在权重空间中捕获任务之间的关系。在LIBERO上的实验表明,WIZARD在未见过的数据集集合上性能提升高达约2倍,在未见过的任务上提升高达约14倍。在Franka Emika Panda机器人上,WIZARD持续优于真实域自适应基线,表明生成的适配器提供了超越仿真的任务级特化。

英文摘要

Vision-Language-Action (VLA) models are emerging as a promising paradigm for robotic manipulation, enabling general-purpose policies trained from large corpora of demonstrations and action labels. However, adapting these models to new tasks still typically requires task-specific demonstrations, action annotations, and additional fine-tuning, making deployment costly and difficult to scale. We propose WIZARD, a weight-space meta-learning framework that sidesteps task-specific fine-tuning by generating task-specific LoRA parameters for a frozen VLA policy. Given only a language instruction and a short demonstration video, WIZARD predicts the corresponding adaptation weights in a single forward pass, without target-task action labels or test-time optimization. During meta-training, WIZARD learns to map task evidence directly to expert LoRA updates, capturing relationships between tasks in weight space. Experiments on LIBERO show that WIZARD improves performance by up to ~2x on unseen dataset collections and up to ~14x on unseen tasks. On a Franka Emika Panda, WIZARD consistently improves over a real-domain adapted baseline, showing that generated adapters provide task-level specialization beyond simulation.

2606.07211 2026-06-08 cs.RO cs.AI 新提交

An Abstract Architecture for Explainable Autonomy in Hazardous Environments

危险环境中可解释自主性的抽象架构

Matt Luckcuck, Hazel M Taylor, Marie Farrell

发表机构 * Maynooth University(梅诺斯大学) University of Manchester(曼彻斯特大学)

AI总结 提出一种支持自主系统解释其行为的抽象架构,旨在通过设计可解释性增强用户信任,并以民用核工业为例展示应用。

Comments Originally published 20th of October 2022 at the Second International Workshop on Requirements Engineering for Explainable Systems (RE4ES), which was hosted by the International Requirements Engineering Conference 2022

详情
AI中文摘要

自主机器人系统被提议用于危险环境,通常是为了减少人类工人的风险。在不久的将来,人类工人可能会继续使用和指挥这些自主机器人,就像其他计算机化工具一样,但具有更复杂的决策能力。因此,工程努力的一个重要方向是确保这些用户信任系统。最近的文献表明,可解释性与系统的可信度密切相关。与安全性和保密性属性一样,可解释性应该被设计到系统中,而不是事后添加。本文提出了一种抽象架构,支持自主系统解释其行为(可解释自主性),为实施可解释自主系统提供了设计模板。我们给出了一个工作示例,说明我们的架构如何应用于民用核工业,其中工人和监管机构都需要信任系统的决策能力。

英文摘要

Autonomous robotic systems are being proposed for use in hazardous environments, often to reduce the risks to human workers. In the immediate future, it is likely that human workers will continue to use and direct these autonomous robots, much like other computerised tools but with more sophisticated decision-making. Therefore, one important area on which to focus engineering effort is ensuring that these users trust the system. Recent literature suggests that explainability is closely related to how trustworthy a system is. Like safety and security properties, explainability should be designed into a system, instead of being added afterwards. This paper presents an abstract architecture that supports an autonomous system explaining its behaviour (explainable autonomy), providing a design template for implementing explainable autonomous systems. We present a worked example of how our architecture could be applied in the civil nuclear industry, where both workers and regulators need to trust the system's decision-making capabilities.

2606.07210 2026-06-08 cs.SD cs.CR 新提交

A Large-Scale Per-Speaker Analysis of Re-identification Risk in Speech Anonymization

语音匿名化中重识别风险的大规模每说话人分析

Orane Dufour, Paul Magron, Mickael Rouvier, Emmanuel Vincent

发表机构 * Université de Lorraine, CNRS, Inria, LORIA(洛林大学、国家科学研究中心、法国国家信息与自动化技术研究院、LORIA实验室) LIA, Avignon University(阿维尼翁大学LIA实验室)

AI总结 通过大规模每说话人分析,发现语音匿名化中重识别风险在个体间差异巨大,且风险由攻击者、匿名化器和可用语音量共同决定,挑战了固有说话人隐私风险的概念。

Comments Accepted to Interspeech

详情
AI中文摘要

语音匿名化通常使用平均情况指标(如等错误率)进行评估,这可能会掩盖个体间重识别风险的巨大差异。在本文中,我们基于最坏情况下的可链接性度量,进行了大规模每说话人隐私分析。评估了近5000名说话人在多个匿名化系统、攻击者架构和对话长度下的表现。虽然可链接性分数在说话人层面上高度极化,但易于重识别和难以重识别的说话人集合在不同配置下差异显著。我们表明,没有单一因素可以解释说话人的脆弱性。相反,重识别风险源于攻击者、匿名化器和可用语音量之间的相互作用。这些结果挑战了固有说话人级隐私风险的概念,并强调需要明确以攻击者和匿名化器为条件的评估协议。

英文摘要

Speech anonymization is commonly evaluated using averagecase metrics such as the equal error rate, which can hide large disparities in re-identification risks across individuals. In this paper, we conduct a large-scale per-speaker privacy analysis using a linkability-based metric under a worst-case scenario. Nearly 5,000 speakers are evaluated across multiple anonymization systems, attacker architectures, and conversation lengths. While linkability scores are highly polarized at the speaker level, the sets of easy to re-identify and hard to re-identify speakers vary substantially across configurations. We show that no single factor explains speaker vulnerability. Instead, the re-identification risk emerges from the interaction between the attacker, the anonymizer, and the amount of available speech. These results challenge the notion of intrinsic speaker-level privacy risks and emphasize the need for evaluation protocols that are explicitly conditioned on the attacker and anonymizer.

2606.07207 2026-06-08 cs.SD cs.LG eess.AS 新提交

Entropy as a Structural Prior: How a Log-Barrier on DiT Belief Space Drives Musical Diversity and Development

熵作为结构先验:DiT信念空间上的对数障碍如何驱动音乐多样性与发展

Zixi Li, Youzhen Li

发表机构 * Sun Yat-sen University(中山大学) Datawhale(数据 whale)

AI总结 提出Eisbach对数障碍,利用DiT输出空间能量分布的熵作为权重,在监督扩散训练中通过调节梯度步长促进音乐主题发展、声学区分和纹理多样性,避免模式崩溃。

详情
AI中文摘要

基于置信度的损失加权通常在生成模型中被避免,因为当模型自信地错误时会加速误差,但这种直觉在监督扩散训练中不成立。我们引入了Eisbach对数障碍,一种无参数权重,源自DiT输出空间能量分布的熵:高熵抑制梯度,低熵保留梯度。将其应用于Stable Audio 3 Medium在MusicCaps上的LoRA微调,意外地产生了比未加权训练更强的主题发展、更清晰的声学区分和更高的纹理多样性,这与模式崩溃相反。这是因为在监督扩散中,梯度方向锁定于真实值,因此置信度仅缩放步长,并且因为时间熵对平坦样本降权而保留高对比度样本。结果是一个在线、自引用的数据课程,完全从前向传播中涌现,并分析了噪声级动态和可测试的预测。

英文摘要

Confidence-based loss weighting is usually avoided in generative models because it accelerates errors when the model is confidently wrong, but this intuition breaks down in supervised diffusion training. We introduce the Eisbach log-barrier, a parameter-free weight derived from the entropy of the DiT output's spatial energy distribution: high entropy damps the gradient, while low entropy preserves it. Applied to LoRA fine-tuning of Stable Audio 3 Medium on MusicCaps, it unexpectedly yields stronger thematic development, clearer acoustic differentiation, and higher textural diversity than unweighted training, the opposite of mode collapse. This works because in supervised diffusion the gradient direction is locked to ground truth, so confidence only scales the step size, and because temporal entropy downweights flat samples while preserving high-contrast ones. The result is an online, self-referential data curriculum that emerges purely from the forward pass, with analyzed noise-level dynamics and testable predictions.

2606.07196 2026-06-08 cs.LG 新提交

Structure-Preserving Correction Learning for Sparse Bayesian Inference in Brain Source Imaging

脑源成像中稀疏贝叶斯推断的结构保持校正学习

Marco Morik, Xiao Ruiting, Shinichi Nakajima, Stefan Haufe, Ismail Huseynov

发表机构 * Berlin Institute for the Foundations of Learning and Data (BIFOLD)(柏林学习与数据基础研究所(BIFOLD)) Technische Universität Berlin(柏林技术大学) RIKEN Center for Advanced Intelligence Project (AIP)(理化学研究所先进智能项目中心(AIP)) Physikalisch-Technische Bundesanstalt(物理技术联邦机构) Charité – Universitätsmedizin Berlin(柏林夏里特大学医学院)

AI总结 提出一种结构保持的校正学习方法,通过展开经典联合超参数求解器为可训练神经网络,在保留贝叶斯结构的同时学习更新机制,提升M/EEG脑源成像的重建性能和收敛性。

Comments preprint

详情
AI中文摘要

经典的稀疏Type-II贝叶斯方法用于M/EEG脑成像支持源和噪声超参数的联合估计,但依赖于固定的迭代更新规则。尽管这些更新是有原则且可解释的,但其动态无法从数据中适应。我们提出学习更新机制本身,同时通过将经典联合超参数求解器展开为可训练的神经架构(其层镜像原始迭代)来保留底层贝叶斯结构。得到的框架初始化为在训练前精确恢复经典求解器,并通过逐渐更具表达力的校正学习机制(从可学习偏置到自适应MLP和基于注意力的上下文细化)得到丰富。这样,训练不会用黑箱预测器替代贝叶斯推断,而是学习结构化的校正项,同时保留原始更新动态的可解释性和基于模型的特性。因此,结构保持校正学习旨在改善经验重建性能,而不替代原始的基于模型的推断机制。实验结果表明,学习的校正变体在保留算法透明性的同时,改善了基线展开求解器的重建性能和收敛行为。

英文摘要

Classical sparse Type-II Bayesian methods for M/EEG brain imaging support joint estimation of source and noise hyperparameters, but rely on fixed iterative update rules. Although these updates are principled and interpretable, their dynamics cannot be adapted from data. We propose to learn the update mechanism itself while preserving the underlying Bayesian structure by unfolding a classical joint hyperparameter-learning solver into a trainable neural architecture whose layers mirror the original iterations. The resulting framework is initialized to recover the classical solver exactly before training and is enriched through progressively more expressive correction-learning mechanisms, ranging from learnable biases to adaptive MLP and attention-based contextual refinements. In this way, training does not replace Bayesian inference with a black-box predictor, but instead learns structured correction terms while retaining the interpretability and model-based character of the original update dynamics. Structured correction learning therefore aims to improve empirical reconstruction performance without replacing the original model-based inference mechanism. Experimental results show that the learned correction variants improve reconstruction performance and convergence behavior over the baseline unfolded solver while preserving its algorithmic transparency.

2606.07193 2026-06-08 cs.RO 新提交

Shield-Loco: Shielding Locomotion Policies with Predictive Safety Filtering

Shield-Loco:基于预测性安全过滤的防护运动策略

Aditya Shirwatkar, Sebastian Sanokowski, Shishir Kolathaya, Aaron Johnson, Majid Khadiv

发表机构 * Robert Bosch Center for Cyber Physical Systems(罗伯特·博世网络物理系统中心) Indian Institute of Science(印度科学研究院) Munich Institute of Robotics and Machine Intelligence (MIRMI)(慕尼黑机器人与机器智能研究所(MIRMI)) Technical University of Munich(慕尼黑技术大学) Department of Computer Science & Automation(计算机科学与自动化部门) Department of Mechanical Engineering(机械工程系) Carnegie Mellon University(卡内基梅隆大学) Institute for Advanced Study(高级研究 institute)

AI总结 提出一种预测性安全过滤器,通过全物理模型优化接触序列,减少四足机器人在密集杂乱环境中的安全违规,同时保持任务性能。

详情
AI中文摘要

强化学习(RL)策略能够实现动态腿部运动,但缺乏避免训练中未出现的约束违反的机制。大规模离线安全学习对于覆盖所有边缘情况是不切实际的。现有的安全框架要么依赖无法推理全身行为的降阶模型,要么需要保守的恢复控制器,这会降低任务性能。我们提出一种预测性安全过滤器,它对输入到RL策略的名义接触位置进行事后过滤。当预测到碰撞时,基于采样的优化器使用全物理模型异步搜索更安全的接触序列,而学习的价值函数则引导长期回报。我们的三个算法组件(采样接触的几何投影、动量增强更新和副本交换)使得在不连续的接触景观中优化变得可行。我们在密集杂乱环境中的四足机器人上验证了该过滤器,无论是在仿真还是真实世界中,都显示出在最小偏离名义输入的情况下大幅减少安全违规。

英文摘要

Reinforcement learning (RL) policies enable dynamic legged locomotion but lack mechanisms to avoid violations of safety constraints that are absent during training. Large-scale offline safe learning is impractical for covering all edge cases. Existing safety frameworks either rely on reduced-order models that cannot reason about whole-body behaviors or require conservative recovery controllers that degrade task performance. We propose a predictive safety filter that post-hoc filters the nominal contact locations fed to the RL policy. When a collision is predicted, a sampling-based optimizer asynchronously searches for safer contact sequences using a full-physics model, while a learned value function bootstraps long-horizon returns. Our three algorithmic components (geometric projection of sampled contacts, momentum-augmented updates, and replica-exchange) make the optimization tractable in a discontinuous contact landscape. We validate the filter on a quadruped robot in dense, cluttered environments, both in simulation and in the real world, showing substantial reductions in safety violations with minimal deviation from the nominal input.

2606.07190 2026-06-08 cs.CL 新提交

From Correctness to Utility: Gain-Based Prefix Evaluation for LLM Reasoning

从正确性到效用:基于增益的LLM推理前缀评估

Yuhang Zhou, Yixin Cao, Guangnan Ye

发表机构 * Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 提出前缀增益概念,训练前缀效用模型(PUM)通过成对排序目标评估推理前缀对成功率的提升,在数学推理任务中优于传统正确性评估。

详情
AI中文摘要

推理前缀塑造了LLM问题求解的未来轨迹,然而现有的过程奖励模型通常通过局部步骤正确性来评估它们。我们认为正确性是最终关心效果的有用但间接的代理:即前缀是否增加了成功完成的概率。我们将此效果定义为前缀增益,即通过在一个前缀上条件化轻量级学生模型组所导致的求解率提升,并使用简单的成对排序目标训练前缀效用模型(PUM)。PUM学习基于结果的前缀效用,并能对完整轨迹和部分推理前缀进行评分。在数学推理的Best-of-$N$选择、束搜索和强化学习中,PUM提供了强大的前缀级监督信号,尤其是在候选池大、搜索预算增加或基于规则的奖励稀疏时。我们在该https URL发布所有数据、模型和代码。

英文摘要

Reasoning prefixes shape the future trajectory of LLM problem solving, yet existing process reward models usually evaluate them through local step correctness. We argue that correctness is a useful but indirect proxy for the effect we ultimately care about: whether a prefix increases the probability of successful completion. We define this effect as prefix gain, the solve-rate improvement induced by conditioning lightweight student model group on a prefix, and use it to train a Prefix Utility Model (PUM) with a simple pairwise ranking objective. PUM learns outcome-grounded prefix utility and can score both complete trajectories and partial reasoning prefixes. Across Best-of-$N$ selection, beam search, and reinforcement learning on mathematical reasoning, PUM provides a strong prefix-level supervision signal, especially when candidate pools are large, search budgets increase, or rule-based rewards are sparse. We release all data, models, and code at https://zhiqix.github.io/pum-project-page.

2606.07186 2026-06-08 cs.RO cs.SE 新提交

A Causal Probabilistic Framework for Perception-Informed Closed-Loop Simulation of Autonomous Driving

面向感知信息闭环仿真的自动驾驶因果概率框架

Zhennan Fei, Rickard Johansson, Mikael Andersson, Matthias Eng, Mattias Eriksson, Kaveh Kianfar, Sadegh Rahrovani, Chris van der Ploeg, Michael Borth, Maren Buermann, Michiel Braat, Henk Goossens, Zijian Han, Majid Khorsand Vakilzadeh, Gabriel Rodrigues de Campos

发表机构 * ETH Zürich(苏黎世联邦理工学院) KTH Royal Institute of Technology(皇家理工学院) University of Science and Technology of China(中国科学技术大学)

AI总结 提出一种因果概率模型框架,将感知误差注入标准仿真环境,揭示理想SIL无法捕获的潜在风险,为SOTIF验证提供可扩展路径。

详情
AI中文摘要

软件在环(SIL)仿真是现代汽车安全功能验证的基石。然而,许多当前框架采用理想感知,绕过了感知算法的功能不足,导致过于乐观的安全评估。本文提出一种感知信息SIL测试方法,弥合了地面实况仿真与真实世界感知行为之间的差距。我们提出了一个将因果概率模型纳入标准化、基于场景的仿真工具链的框架,适用于高级驾驶辅助系统(ADAS)和自动驾驶系统(ADS)。我们的方法能够系统性地注入由物理触发条件(如雾、雨和物体合并场景)导出的真实感知误差,例如检测丢失、尺寸不准确和定位偏移。通过在标准化仿真环境中评估这些“故障”,我们证明了感知信息测试揭示了理想SIL环境无法捕获的潜在操作风险,为SOTIF(ISO 21448)验证提供了可扩展的途径。

英文摘要

Software-in-the-loop (SIL) simulation is a cornerstone for the validation of modern automotive safety functions. However, many current frameworks utilize ideal sensing, which bypasses the functional insufficiencies of perception algorithms, leading to over-optimistic safety assessments. This paper proposes a perception-informed SIL testing methodology that bridges the gap between ground-truth simulation and real-world perception behavior. We present a framework for incorporating causal probabilistic models into standardized, scenario-based simulation toolchains, applicable to both Advanced Driver Assistance Systems (ADAS) and Autonomous Driving Systems (ADS). Our approach enables the systematic injection of realistic perception errors, such as loss of detection, sizing inaccuracies, and positioning offsets, derived from physical triggering conditions like fog, rain, and object-merging scenarios. By evaluating these ``faults'' within a standardized simulation environment, we demonstrate that perception-informed testing reveals latent operational risks that ideal SIL environments fail to capture, providing a scalable pathway for SOTIF (ISO 21448) validation.

2606.07185 2026-06-08 cs.CV 新提交

AdaTok: Self-Budgeting Image Tokenization with Quality-Preserving Dynamic Tokens

AdaTok: 具有质量保持动态令牌的自预算图像令牌化

Xiaocheng Lu, Yuxi Chen, Jie Zhang, Jian Liu, Jingcai Guo, Fangqi Zhu, Tao Han, Song Guo

发表机构 * The Hong Kong University of Science and Technology(香港科技大学) The Hong Kong Polytechnic University(香港理工大学)

AI总结 提出AdaTok,一种自预算离散一维令牌化器,通过表示-分配协同设计(优先表示学习和自适应令牌分配)实现图像自适应令牌数量,在保持重建质量的同时减少平均令牌数。

Comments Preprint; 11 pages, 4 figures

详情
AI中文摘要

图像令牌化器,从二维网格到最近的一维序列,通常用相同固定数量的令牌编码每张图像。然而视觉复杂度高度异质,因此统一预算在简单输入上过度开销,在复杂输入上不足。现有的弹性令牌化器暴露了可变长度重建,但通常将令牌长度作为部署时的操作点、搜索目标或外部预测,而非令牌化器本身的输出。在这项工作中,我们询问离散视觉令牌化器能否一次性自我预算。我们的核心发现是,可操作的弹性需要表示-分配协同设计:前缀必须在不同预算下保持可解码,且令牌化器必须学习每个图像需要哪个前缀。我们提出AdaTok,一种自预算离散一维令牌化器。AdaTok结合了优先表示学习(通过嵌套尾部掩码对令牌排序,并通过多头LoRA解码器头解决预算依赖的语义偏移)和自适应令牌分配(在候选预算上训练轻量级确定性组GRPO策略)。动态帕累托加权在策略训练期间平衡保真度和效率,无需手动权衡扫描。在ImageNet-1K上,AdaTok-Full在256个令牌时达到rFID 1.31,而AdaTok-Adaptive平均仅使用约118个令牌达到rFID 1.50,在可比预算下优于离散一维基线。在自回归图像生成中,较短的适应性表示相比固定256令牌解码实现了约2.1倍的吞吐量,表明视觉令牌数量可以学习为内容条件输出,而非设置为固定超参数。

英文摘要

Image tokenizers, from 2D grids to recent 1D sequences, typically encode every image with the same fixed number of tokens. Yet visual complexity is highly heterogeneous, so a uniform budget overspends on simple inputs and underserves complex ones. Existing elastic tokenizers expose variable-length reconstructions, but often leave token length as a deployment-time operating point, a search target, or an external prediction rather than an output of the tokenizer itself. In this work, we ask whether a discrete visual tokenizer can budget itself in one pass. Our central finding is that actionable elasticity requires a representation--allocation co-design: prefixes must remain decodable across budgets, and the tokenizer must learn which prefix each image needs. We propose AdaTok, a self-budgeting discrete 1D tokenizer. AdaTok combines Prioritized Representation Learning, which orders tokens with nested tail masking and resolves budget-dependent semantic shift through Multi-Head LoRA decoder heads, with Adaptive Token Allocation, which trains a lightweight deterministic-group GRPO policy over candidate budgets. Dynamic Pareto Weighting balances fidelity and efficiency during policy training without manual trade-off sweeps. On ImageNet-1K, AdaTok-Full reaches rFID 1.31 at 256 tokens, while AdaTok-Adaptive attains rFID 1.50 using only ~118 tokens on average, outperforming discrete 1D baselines at comparable budgets. In autoregressive image generation, the shorter adaptive representation yields ~2.1x throughput over a fixed 256-token decode, suggesting that visual token count can be learned as a content-conditioned output rather than set as a fixed hyperparameter.

2606.07183 2026-06-08 cs.CL 新提交

Geometry of Semantic Space: Comparative Study of Discrete and Continuous Models

语义空间的几何:离散与连续模型的比较研究

Gabriel Bounias, Sabine Ploux

发表机构 * ISC-PIF (Institut des Systemes Complexes de Paris IdF)(巴黎IDF复杂系统研究所) CNRS, France(法国国家科学研究中心) CAMS (Centre d’analyse et de mathématique sociales)(分析与数学社会研究中心) CNRS & EHESS, Paris, France(法国国家科学研究中心与巴黎高等社会科学学院)

AI总结 本研究比较了监督向量嵌入(如CamemBERT)与词汇共现图在语义几何上的差异,发现图模型结构更清晰可读,而Transformer嵌入的拓扑分布不理想。

Comments 9 pages, 7 figures

详情
AI中文摘要

这项工作考察了NLP模型背后的语义几何。我们比较了监督向量嵌入(如CamemBERT)与更直接编码语义关系的词汇共现图。虽然基于Transformer的嵌入取得了强劲性能,但它们诱导的几何结构往往显示出不令人满意的分布。相比之下,基于图的模型揭示了更清晰、更易读的意义组织。我们实现了一种方法,允许我们基于这两种方法诱导的图结构或嵌入拓扑进行比较分析。比较结果——应用于法国“大国家辩论”语料库(公众辩论中公民贡献的集合)——显示了相似的局部拓扑,但非常不同的整体结构和拓扑。这些发现表明深度监督模型与基于图的模型之间存在互补视角,为引导神经架构朝向更稳定和可解释的图结构收敛提供了新途径。

英文摘要

This work examines the semantic geometry underlying NLP models. We compare supervised vector embeddings, such as CamemBERT, with lexical co-occurrence graphs that encode semantic relations more directly. While transformer-based embeddings achieve strong performance, their induced geometries often display unsatisfactory distributions. In contrast, graph-based models reveal a clearer and more human-readable organization of meaning. We have implemented a methodology that allows us to perform a comparative analysis either based on the structure of the graphs or based on the topology of the embeddings induced by these two approaches. The results of the comparison -- applied to the French "Great National Debate" corpus a collection of citizen contributions to the public debate -- show a similar local topology but a very different overall structure and topology. Theses findings suggest complementary perspectives between deep supervised models and graph-based models, considering a new pathway to guide neural architectures toward more stable and interpretable convergence with graphs structures.

2606.07181 2026-06-08 cs.LG cs.AI q-bio.MN 新提交

RETROSPECT: RETROsynthesis via Sequential Prediction, and Chemically Transformed-ranking

RETROSPECT: 通过序列预测和化学变换排序的逆合成

Raja Sekhar Pappala, Shreyas Vinaya Sathyanarayana, Ronit Kumar Choudhary, Arjun Verma, Deepak Warrier

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出RETROSPECT系统,将单步逆合成分解为候选生成和重排序,结合ChemAlign Transformer生成器和LambdaMART重排序器,在USPTO-50K上实现55.00% top-1准确率。

Comments Accepted at the AI for Science workshop (ICML 2026)

详情
AI中文摘要

单步逆合成既需要准确的首位建议,也需要足够丰富的候选列表以供下游选择。我们将其研究为提议-选择分解。我们的系统RETROSPECT结合了一个单一的Transformer提议模型(我们称之为ChemAlign Transformer)和一个基于结构、反应模板、上游分数以及可选的DFT衍生描述符的LambdaMART重排序器。生成器使用混合根对齐和随机SMILES增强、预层归一化、绑定嵌入、指数移动平均权重以及可微的原子平衡辅助损失进行训练。在包含5,007个反应的完整USPTO-50K测试集上,生成器达到55.00%的top-1和86.18%的top-10精确匹配准确率,top-1有效率为99.86%。在用于重排序的合并候选池基准上(包含5,007个测试产物,每个产物约111个候选),基于结构特征集训练的LambdaMART模型达到59.4%的top-1和0.7171的平均倒数排名。特征消融实验表明,上游提议分数和模板频率统计提供了大部分重排序信号,而DFT和反应中心DFT特征提供的增益较小且不一致。这些结果支持逆合成的模块化观点:更强的单模型提议和学习候选选择是互补的,并且提议模型可以作为集成系统(如RetroChimera (Maziarz et al., 2024))的即插即用组件。

英文摘要

Single-step retrosynthesis needs both accurate first-ranked suggestions and candidate lists that are rich enough for downstream selection. We study this as a proposal-selection decomposition. Our system, RETROSPECT, combines a single Transformer proposal model, which we call the ChemAlign Transformer, with a LambdaMART reranker over structural, reaction-template, upstream-score, and optional DFT-derived descriptors. The generator is trained with hybrid root-aligned and random SMILES augmentation, Pre-LayerNorm, tied embeddings, exponential moving average weights, and a differentiable atom-balance auxiliary loss. On the full USPTO-50K test set of 5,007 reactions, the generator reaches 55.00% top-1 and 86.18% top-10 exact-match accuracy with 99.86% top-1 validity. On the merged candidate-pool benchmark used for reranking, which contains 5,007 test products and about 111 candidates per product, a LambdaMART model trained on the structural feature set reaches 59.4% top-1 with 0.7171 mean reciprocal rank. Feature ablations show that upstream proposal score and template-frequency statistics provide most of the reranking signal, while DFT and reaction-center DFT features provide smaller and less consistent gains. These results support a modular view of retrosynthesis: stronger single-model proposal and learned candidate selection are complementary, and the proposal model can serve as a drop-in component for ensemble systems such as RetroChimera (Maziarz et al., 2024)

2606.07180 2026-06-08 cs.CV cs.LG 新提交

OPTIMUS-Prime: Minimal and Sufficient Concept Explanations for Deep Vision Models

OPTIMUS-Prime:深度视觉模型的最小且充分的概念解释

Arthur Hoarau, Chenrui Zhu, Vu Linh Nguyen

发表机构 * Université de Lorraine(洛林大学) CentraleSupélec Loria(中央超导Loria) CNRS(国家科学研究中心) Metz, France(法国梅斯) Université de technologie de Compiègne UMR CNRS 7253 Heudiasyc(图卢兹技术大学UMR CNRS 7253 Heudiasyc) France(法国)

AI总结 提出OPTIMUS框架,基于主蕴含项理论生成视觉热图解释,满足充分性和最小性,提供形式化保证。

详情
AI中文摘要

自动化决策中日益增长的透明度需求已将可解释人工智能(XAI)推向机器学习研究的前沿。然而,在计算机视觉中,现有的解释方法通常优先考虑最终用户的可访问性,而牺牲了形式化保证,在实用性和理论严谨性之间留下了关键差距。在本文中,我们通过引入OPTIMUS(一种用于深度分类模型的基于概念的可视化解释的新框架)来弥补这一差距。OPTIMUS解释采用视觉热图的形式,不仅对最终用户保持可解释性,而且基于成熟的主蕴含项理论,提供了现有基于显著性方法所缺乏的形式化保证。具体来说,OPTIMUS解释满足两个理想性质:充分性,确保被强调的概念可证明地保证分类器的预测;以及最小性,确保这些概念的严格子集不再保留此保证。这两个性质共同产生了逻辑上紧凑且视觉上连贯的解释。我们在视觉分类基准上验证了我们的方法,证明OPTIMUS热图自然且忠实地呈现了模型预测背后的决策相关概念。

英文摘要

The growing demand for transparency in automated decision-making has propelled eXplainable Artificial Intelligence (XAI) to the forefront of machine learning research. In computer vision, however, existing explanation methods often prioritize end-user accessibility at the expense of formal guarantees, leaving a critical gap between practical utility and theoretical rigor. In this paper, we address this gap by introducing OPTIMUS, a novel framework for generating concept-based visual explanations for deep classification models. OPTIMUS explanations take the form of visual heatmaps that not only remain interpretable to end users, but are grounded in the well-established theory of prime implicants, providing formal guarantees that have been largely absent from existing saliency-based methods. Specifically, OPTIMUS explanations satisfy two desirable properties: sufficiency, ensuring that the highlighted concepts provably guarantee the classifier's prediction, and minimality, ensuring that no strict subset of those concepts retains this guarantee. Together, these properties yield explanations that are both logically tight and visually coherent. We validate our approach on a visual classification benchmark, demonstrating that OPTIMUS heatmaps naturally and faithfully surface the decision-relevant concepts underlying model predictions.