arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2601.19947 2026-05-29 cs.LG cs.AI cs.CV

NCSAM Noise-Compensated Sharpness-Aware Minimization for Noisy Label Learning

NCSAM: 噪声补偿的锐度感知最小化用于噪声标签学习

Jiayu Xu, Junbiao Pang

发表机构 * Beijing University of Technology（北京理工大学）

AI总结提出NCSAM方法，通过噪声补偿扰动修正噪声标签引起的优化偏差，缓解对噪声标签的记忆，在合成和真实噪声标签基准上优于SAM基线。

Comments 11 pages, 1 figure, 8 tables. Major revision of v1: revised PAC-Bayesian theoretical analysis, clarified the NCSAM formulation, added appendix derivations, reorganized experiments and ablations, updated related work, citations, writing, and author list

详情

AI中文摘要

从噪声标签学习（LNL）仍然是深度学习中的一个基本挑战，因为现实世界的数据集通常包含损坏的注释。大多数现有方法依赖于标签校正或样本选择机制。相比之下，我们从优化角度研究LNL，通过建立标签噪声与锐度感知最小化（SAM）的平坦性寻求行为之间的理论联系。基于此分析，我们提出了噪声补偿的锐度感知最小化（NCSAM），它使用噪声补偿扰动来抵消由噪声标签引起的优化偏差。通过纠正失真的SAM扰动，NCSAM在训练过程中减轻了对噪声标签的记忆，同时保持了基于优化的学习的简单性。在合成和真实噪声标签基准上的实验表明，NCSAM在基于SAM的优化基线上持续改进，并与代表性的噪声标签学习方法保持竞争力。

英文摘要

Learning from Noisy Labels (LNL) remains a fundamental challenge in deep learning because real-world datasets often contain corrupted annotations. Most existing methods rely on label correction or sample selection mechanisms. In contrast, we study LNL from an optimization perspective by establishing a theoretical connection between label noise and the flatness-seeking behavior of Sharpness-Aware Minimization (SAM). Based on this analysis, we propose Noise-Compensated Sharpness-Aware Minimization (NCSAM), which uses a noise-compensated perturbation to counteract the optimization bias induced by noisy labels. By correcting distorted SAM perturbations, NCSAM mitigates the memorization of noisy labels during training while preserving the simplicity of optimization-based learning. Experiments on synthetic and real-world noisy-label benchmarks show that NCSAM consistently improves over SAM-based optimization baselines and remains competitive with representative noisy-label learning methods.

URL PDF HTML ☆

赞 0 踩 0

2601.18395 2026-05-29 cs.CL

Do not be greedy, Think Twice: Sampling and Selection for Document-level Information Extraction

不要贪婪，三思而后行：文档级信息抽取的采样与选择

Mikel Zubillaga, Oscar Sainz, Oier Lopez de Lacalle, Eneko Agirre

发表机构 * HiTZ Center - Ixa, University of the Basque Country UPV/EHU（希茨中心 - Ixa，巴斯克国家大学UPV/EHU）

AI总结提出ThinkTwice框架，通过采样生成多个候选模板并选择最优，利用无监督一致性和有监督奖励模型，在文档级信息抽取中超越贪婪解码方法。

Comments Submitted to EMNLP 2026

详情

AI中文摘要

文档级信息抽取（DocIE）旨在生成包含给定文档中出现的实体、关系和事件的输出模板。标准做法包括使用贪婪解码提示仅解码器的大语言模型以避免输出变异性。我们没有将这种变异性视为限制，而是表明采样可以产生比贪婪解码更好的解决方案，尤其是在使用推理模型时。因此，我们提出了ThinkTwice，一个采样和选择框架，其中大语言模型为给定文档生成多个候选模板，然后一个选择模块选择最合适的模板。我们引入了一种利用生成输出之间一致性的无监督方法，以及一种使用在标记DocIE数据上训练的奖励模型的有监督选择方法。为了解决DocIE中黄金推理轨迹的稀缺性，我们提出了一种基于拒绝采样的方法来生成将输出模板与推理轨迹配对的银训练数据。我们的实验证明了无监督和有监督ThinkTwice的有效性，始终优于贪婪基线和有监督的最先进方法。

英文摘要

Document-level Information Extraction (DocIE) aims to produce an output template with the entities, relations, and events of interest occurring in the given document. Standard practices include prompting decoder-only LLMs using greedy decoding to avoid output variability. Rather than treating this variability as a limitation, we show that sampling can produce substantially better solutions than greedy decoding, especially when using reasoning models. We thus propose ThinkTwice, a sampling and selection framework in which the LLM generates multiple candidate templates for a given document, and a selection module chooses the most suitable one. We introduce both an unsupervised method that exploits agreement across generated outputs, and a supervised selection method using reward models trained on labeled DocIE data. To address the scarcity of golden reasoning trajectories for DocIE, we propose a rejection-sampling-based method to generate silver training data that pairs output templates with reasoning traces. Our experiments show the validity of unsupervised and supervised ThinkTwice, consistently outperforming greedy baselines and the supervised state-of-the-art.

URL PDF HTML ☆

赞 0 踩 0

2601.14855 2026-05-29 cs.LG

Adaptive Exponential Integration for Stable Gaussian Mixture Black-Box Variational Inference

自适应指数积分用于稳定高斯混合黑箱变分推断

Baojun Che, Yifan Chen, Daniel Zhengyu Huang, Xinying Mao, Weijie Wang

发表机构 * School of Mathematical Sciences, Peking University, Beijing, China（北京大学数学科学学院）； Department of Mathematics, University of California, Los Angeles, CA, USA（加州大学洛杉矶分校数学系）； Beijing International Center for Mathematical Research, Center for Machine Learning Research, Peking University, Beijing, China（北京国际数学研究中心，机器学习研究中心，北京大学，北京，中国）

AI总结针对高斯混合黑箱变分推断的不稳定和低效问题，提出结合仿射不变预处理、无条件保持协方差正定性的指数积分器和自适应时间步长的稳定高效框架，并证明其收敛性。

Comments 41 pages, 10 figures

详情

AI中文摘要

黑箱变分推断（BBVI）结合高斯混合族提供了一种灵活的方法来近似复杂的后验分布，无需目标密度的梯度。然而，标准的数值优化方法常常遭受不稳定和低效的问题。我们开发了一个稳定高效的框架，结合了三个关键组件：（1）通过自然梯度公式实现的仿射不变预处理，（2）无条件保持协方差矩阵正定性的指数积分器，以及（3）自适应时间步长以确保稳定性并适应不同的预热和收敛阶段。所提出的方法与流形优化和镜像下降有自然联系。对于高斯后验，我们证明了在无噪声设置下的指数收敛性和在蒙特卡洛估计下的几乎必然收敛性，严格论证了自适应时间步长的必要性。在多模态分布、Neal多尺度漏斗以及基于PDE的达西流贝叶斯逆问题上的数值实验证明了所提方法的有效性。

英文摘要

Black-box variational inference (BBVI) with Gaussian mixture families offers a flexible approach for approximating complex posterior distributions without requiring gradients of the target density. However, standard numerical optimization methods often suffer from instability and inefficiency. We develop a stable and efficient framework that combines three key components: (1) affine-invariant preconditioning via natural gradient formulations, (2) an exponential integrator that unconditionally preserves the positive definiteness of covariance matrices, and (3) adaptive time stepping to ensure stability and to accommodate distinct warm-up and convergence phases. The proposed approach has natural connections to manifold optimization and mirror descent. For Gaussian posteriors, we prove exponential convergence in the noise-free setting and almost-sure convergence under Monte Carlo estimation, rigorously justifying the necessity of adaptive time stepping. Numerical experiments on multimodal distributions, Neal's multiscale funnel, and a PDE-based Bayesian inverse problem for Darcy flow demonstrate the effectiveness of the proposed method.

URL PDF HTML ☆

赞 0 踩 0

2601.14758 2026-05-29 cs.LG cs.AI cs.CL

Mechanism Shift During Post-training from Autoregressive to Masked Diffusion Language Models

从自回归到掩码扩散语言模型的后训练中的机制转变

Injin Kong, Hyoungjoon Lee, Yohan Jo

发表机构 * Graduate School of Data Science, Seoul National University（首尔国立大学数据科学研究生院）； Department of Biosystems & Biomaterials Science and Engineering, Seoul National University（首尔国立大学生物系统与生物材料科学与工程系）

AI总结通过比较电路分析，发现后训练得到的掩码扩散模型在结构上根据任务保留或重组自回归电路，在语义上从局部专业化转向分布式整合，表明扩散后训练是内部计算的深度重组。

详情

AI中文摘要

将预训练的自回归模型（ARMs）后训练为掩码扩散模型（MDMs）已成为一种克服顺序生成局限性的经济有效方法。然而，后训练的MDMs是否获得了真正的新计算机制，还是仅仅以非自回归形式重新表达了自回归计算，仍不清楚。通过对ARMs及其从相同骨干网络后训练得到的MDM对应物进行电路比较分析，我们揭示了两个互补的重组轴。在结构上，转变是任务依赖的：MDMs在局部因果任务上保留自回归电路，但在全局任务上放弃继承的路径并将计算前置到早期层。在语义上，转变在不同机制间是一致的：ARMs中尖锐的局部专业化让位于MDMs中的分布式整合。这些发现共同表明，扩散后训练并非生成过程的表面变化，而是内部计算的重组，其深度取决于任务。

英文摘要

Post-training pretrained autoregressive models (ARMs) into masked diffusion models (MDMs) has emerged as a cost-effective way to overcome the limitations of sequential generation. Yet it remains unclear whether post-trained MDMs acquire genuinely new computational mechanisms or merely re-express autoregressive computation in a non-autoregressive form. Through a comparative circuit analysis of ARMs and their MDM counterparts post-trained from the same backbones, we uncover two complementary axes of reorganization. Structurally, the shift is task-dependent: MDMs preserve autoregressive circuitry on locally causal tasks but abandon inherited pathways and front-load computation into early layers on global tasks. Semantically, the shift is consistent across regimes: sharp, localized specialization in ARMs gives way to distributed integration in MDMs. Together, these findings show that diffusion post-training is not a surface-level change in the generation procedure but a reorganization of internal computation whose depth depends on the task.

URL PDF HTML ☆

赞 0 踩 0

2601.13111 2026-05-29 cs.CL cs.AI cs.IR

CORE-T: COherent REtrieval of Tables for Text-to-SQL

CORE-T: 面向文本到SQL的表格连贯检索

Hassan Soliman, Vivek Gupta, Dan Roth, Iryna Gurevych

发表机构 * Ubiquitous Knowledge Processing Lab (UKP Lab), Department of Computer Science TU Darmstadt and National Research Center for Applied Cybersecurity ATHENE, Germany（普适知识处理实验室（UKP实验室），计算机科学系 TU Darmstadt 和应用网络安全国家研究中心 ATHENE，德国）； Arizona State University（亚利桑那州立大学）； University of Pennsylvania（宾夕法尼亚大学）

AI总结提出CORE-T框架，通过LLM生成元数据和预计算兼容性缓存，在无需训练的情况下从异构表集合中高效检索连贯可连接的表集合，提升表选择F1最多22.7点并减少40%的表数量。

Comments Preprint is revised and under review. Code and data available at: https://github.com/UKPLab/arxiv2026-core-t

详情

AI中文摘要

现实中的文本到SQL工作流通常需要连接多个表格。因此，准确检索相关表集合成为端到端性能的关键瓶颈。我们研究一种开放书设置，其中查询必须从多个来源汇集的大规模异构表集合中回答，且没有数据库标识符等清晰的限定信号。在此设置下，密集检索（DR）实现了高召回率但返回大量干扰项，而考虑连接的方法通常依赖额外假设和/或产生高推理开销。我们提出CORE-T，一个可扩展、无需训练的框架，通过LLM生成的用途元数据丰富表格，并预计算轻量级表兼容性缓存。推理时，DR返回前K个候选；单次LLM调用选择一个连贯、可连接的子集，然后两步加法调整阶段恢复强兼容的表。在Bird、Spider、MMQA和Beaver上，CORE-T在表选择F1上比DR提升最多22.7点，同时返回的表减少最多40%，在多表执行准确率上提升最多24.4点，并且使用的总选择token比LLM密集型基线少1.64-4.20倍。

英文摘要

Realistic text-to-SQL workflows often require joining multiple tables. As a result, accurately retrieving the relevant set of tables becomes a key bottleneck for end-to-end performance. We study an open-book setting where queries must be answered over large, heterogeneous table collections pooled from many sources, without clean scoping signals such as database identifiers. Here, dense retrieval (DR) achieves high recall but returns many distractors, while join-aware alternatives often rely on extra assumptions and/or incur high inference overhead. We propose CORE-T, a scalable, training-free framework that enriches tables with LLM-generated purpose metadata and pre-computes a lightweight table-compatibility cache. At inference time, DR returns top-K candidates; a single LLM call selects a coherent, joinable subset, and a two-step additive adjustment stage restores strongly compatible tables. Across Bird, Spider, MMQA, and Beaver, CORE-T improves over DR by up to 22.7 points in table-selection F1 while returning up to 40% fewer tables, and by up to 24.4 points in multi-table execution accuracy, and uses 1.64-4.20x fewer total selection tokens than LLM-intensive baselines.

URL PDF HTML ☆

赞 0 踩 0

2601.12500 2026-05-29 cs.CV

Video Individual Counting and Tracking from Moving Drones: A Benchmark and Methods

来自移动无人机的视频个体计数与跟踪：基准与方法

Yaowu Fan, Jia Wan, Tao Han, Andy J. Ma, Wanli Ouyang, Antoni B. Chan

发表机构 * School of Computer Science and Engineering, Sun Yat-sen University（中山大学计算机科学与工程学院）； school of Computer Science and Engineering, Hong Kong University of Science and Technology（香港科学与技术大学计算机科学与工程学院）； Department of Computer Science, City University of Hong Kong（香港城市大学计算机科学系）； School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen)（哈尔滨工业大学（深圳）计算机科学与技术学院）； Chinese University of Hong Kong（香港中文大学）

AI总结针对大规模密集人群场景，提出移动无人机视频数据集MovingDroneCrowd++，并设计基于最优传输和描述子投票的计数与跟踪方法GD3A和DVTrack，显著降低计数误差并提升跟踪精度。

详情

AI中文摘要

在大规模场景中计数和跟踪密集人群是一个高度实用但具有挑战性的问题。现有方法大多依赖于场景覆盖有限的固定摄像头数据集，使其不足以用于大规模场景的人群分析。为弥补这一差距，我们引入了MovingDroneCrowd++，这是最大的视频级数据集，专门用于快速移动无人机下的密集人群计数和跟踪，在多种飞行高度、相机角度和光照条件下采集。然而，现有方法在这些具有挑战性的空中条件下仍无法达到令人满意的视频个体计数或跟踪性能。为此，我们提出了GD3A（通过分组描述符关联的全局密度图分解），一种视频个体计数方法，该方法首先通过带有自适应垃圾桶分数的最优传输建立帧间行人描述符的像素级对应关系。然后，采用分组关联来指导将全局密度图分解为共享、流入和流出密度图。我们进一步引入了一种行人跟踪方法DVTrack（描述子投票跟踪），该方法通过描述子投票将描述符级匹配转换为实例级关联。我们的方法依赖于每个行人的分组多个描述符的关联结果，而不是单个向量。由于组内匹配错误不影响最终的计数和跟踪结果，我们的方法在密集人群和具有挑战性的空中条件下更加鲁棒。实验表明，我们的方法在密集人群和复杂运动的移动无人机视频上，在人群计数和跟踪方面均取得了显著提升，计数误差降低了47.4%，跟踪精度提高了64.6%。代码、数据集和预训练模型可在 https://github.com/fyw1999/MovingDroneCrowd 获取。

英文摘要

Counting and tracking dense crowds in large-scale scenes is a highly practical yet challenging problem. Existing methods mostly rely on fixed-camera datasets with limited scene coverage, making them inadequate for crowd analysis in large-scale scenes. To bridge this gap, we introduce MovingDroneCrowd++, the largest video-level dataset dedicated to dense crowd counting and tracking with fast-moving drones, captured under diverse flight altitudes, camera angles, and illumination conditions. Existing methods, however, still fail to achieve satisfactory video individual counting or tracking performance under these challenging aerial conditions. To this end, we propose GD3A (Global Density map Decomposition via group-wise Descriptor Association), a video individual counting method that first establishes pixel-level correspondences between pedestrian descriptors across frames via optimal transport with an adaptive dustbin score. Then, group-wise association is adopted to guide the decomposition of the global density map into shared, inflow, and outflow density maps. We further introduce a pedestrian tracking method, DVTrack (Descriptor Voting Track), which converts descriptor-level matching into instance-level association through descriptor voting. Our methods rely on the association results of group-wise multiple descriptors for each pedestrian rather than a single vector. Since intra-group matching errors do not affect the final counting and tracking results, our methods are more robust in dense crowds and challenging aerial conditions. Experiments show that our methods achieve substantial gains in both crowd counting and tracking on moving-drone videos with dense crowds and complex motions, reducing counting error by 47.4% and improving tracking accuracy by 64.6%. Code, dataset, and pretrained models are available at https://github.com/fyw1999/MovingDroneCrowd.

URL PDF HTML ☆

赞 0 踩 0

2601.11178 2026-05-29 cs.AI cs.CL cs.MM cs.SI

TANDEM: Temporal-Aware Neural Detection for Multimodal Hate Speech

TANDEM: 面向多模态仇恨言论的时间感知神经检测

Girish A. Koushik, Helen Treharne, Diptesh Kanojia

发表机构 * Nature-Inspired Computing & Engineering, University of Surrey（Surrey大学自然启发计算与工程系）； Surrey Centre for Cyber Security, University of Surrey（Surrey大学网络安全中心）

AI总结提出TANDEM统一框架，通过串联强化学习策略联合优化视觉-语言和音频-语言模型，将音频-视觉仇恨检测转化为结构化推理问题，在HateMM上目标识别F1达0.73（提升30%），并保持精确时间定位。

Comments Under review at ICWSM 2027

详情

AI中文摘要

社交媒体平台日益被长篇多模态内容主导，其中有害叙事通过音频、视觉和文本线索的复杂交互构建。虽然自动化系统能以高准确率标记仇恨言论，但它们通常作为“黑箱”运作，无法提供细粒度、可解释的证据（如精确时间戳和目标身份），而这对于有效的人机协同审核是必需的。在这项工作中，我们提出了TANDEM，一个统一框架，将音频-视觉仇恨检测从二元分类任务转化为结构化推理问题。我们的方法采用一种新颖的串联强化学习策略，其中视觉-语言和音频-语言模型通过自约束跨模态上下文相互优化，在无需密集帧级监督的情况下，稳定地推理长时序列。在三个基准数据集上的实验表明，TANDEM显著优于零样本和上下文增强基线，在HateMM上目标识别F1达到0.73（比现有最佳方法提升30%），同时保持精确的时间定位。我们进一步观察到，虽然二元检测是鲁棒的，但由于固有的标签模糊性和数据集不平衡，在多类设置中区分攻击性和仇恨性内容仍然具有挑战性。更广泛地说，我们的发现表明，即使在复杂的多模态环境中，结构化、可解释的对齐也是可实现的，为下一代透明且可操作的在线安全审核工具提供了蓝图。

英文摘要

Social media platforms are increasingly dominated by long-form multimodal content, where harmful narratives are constructed through a complex interplay of audio, visual, and textual cues. While automated systems can flag hate speech with high accuracy, they often function as "black boxes" that fail to provide the granular, interpretable evidence, such as precise timestamps and target identities, required for effective human-in-the-loop moderation. In this work, we introduce TANDEM, a unified framework that transforms audio-visual hate detection from a binary classification task into a structured reasoning problem. Our approach employs a novel tandem reinforcement learning strategy where vision-language and audio-language models optimize each other through self-constrained cross-modal context, stabilizing reasoning over extended temporal sequences without requiring dense frame-level supervision. Experiments across three benchmark datasets demonstrate that TANDEM significantly outperforms zero-shot and context-augmented baselines, achieving 0.73 F1 in target identification on HateMM (a 30% improvement over state-of-the-art) while maintaining precise temporal grounding. We further observe that while binary detection is robust, differentiating between offensive and hateful content remains challenging in multi-class settings due to inherent label ambiguity and dataset imbalance. More broadly, our findings suggest that structured, interpretable alignment is achievable even in complex multimodal settings, offering a blueprint for the next generation of transparent and actionable online safety moderation tools.

URL PDF HTML ☆

赞 0 踩 0

2601.05149 2026-05-29 cs.CV

Multi-Scale Local Speculative Decoding for Image Generation

多尺度局部推测解码用于图像生成

Elia Peruzzo, Guillaume Sautière, Amirhossein Habibian

发表机构 * Qualcomm AI Research（高通人工智能研究）

AI总结提出多尺度局部推测解码（MuLo-SD）框架，通过低分辨率草稿模型与高分辨率目标模型结合、局部拒绝与重采样机制，加速自回归图像生成，实现高达5倍加速并保持语义对齐和感知质量。

Comments Accepted at CVPR 2026

详情

AI中文摘要

自回归（AR）模型在图像合成中取得了显著成功，但其顺序性带来了严重的延迟限制。推测解码提供了一种有前景的加速途径，但现有方法受限于令牌级模糊性和缺乏空间感知。在这项工作中，我们引入了多尺度局部推测解码（MuLo-SD），一种新颖的框架，结合多分辨率草稿与空间感知验证来加速AR图像生成。我们的方法利用低分辨率草稿模型配合上采样步骤来提出候选图像令牌，然后由高分辨率目标模型并行验证。关键的是，我们引入了局部拒绝和重采样机制，通过关注空间邻域而非在第一次拒绝后进行光栅扫描重采样，从而高效纠正草稿错误。当与并行解码重采样集成时，MuLo-SD实现了显著的加速——高达$\mathbf{5 imes}$——在加速方面优于推测解码和并行解码基线，同时保持相当的语义对齐和感知质量。这些结果在MS-COCO 5k验证集上使用GenEval、DPG-Bench和FID/HPSv2进行了验证。广泛的消融实验突出了上采样设计、概率池化以及局部拒绝和重采样与邻域扩展的影响。我们的方法为图像合成中的推测解码设立了新的最先进水平，弥合了效率与保真度之间的差距。项目页面见https://qualcomm-ai-research.github.io/mulo-sd-webpage/。

英文摘要

Autoregressive (AR) models have achieved remarkable success in image synthesis, yet their sequential nature imposes significant latency constraints. Speculative Decoding offers a promising avenue for acceleration, but existing approaches are limited by token-level ambiguity and lack of spatial awareness. In this work, we introduce Multi-Scale Local Speculative Decoding (MuLo-SD), a novel framework that combines multi-resolution drafting with spatially informed verification to accelerate AR image generation. Our method leverages a low-resolution drafter paired with an up-sampling step to propose candidate image tokens, which are then verified in parallel by a high-resolution target model. Crucially, we incorporate a local rejection and resampling mechanism, enabling efficient correction of draft errors by focusing on spatial neighborhoods rather than raster-scan resampling after the first rejection. When integrated with parallel decoding resampling, MuLo-SD achieves substantial speedups -- up to $\mathbf{5\times}$ -- outperforming both speculative decoding and parallel decoding baselines in terms of acceleration, while maintaining comparable semantic alignment and perceptual quality. These results are validated using GenEval, DPG-Bench, and FID/HPSv2 on the MS-COCO 5k validation split. Extensive ablations highlight the impact of up-sampling design, probability pooling, and local rejection and resampling with neighborhood expansion. Our approach sets a new state-of-the-art in speculative decoding for image synthesis, bridging the gap between efficiency and fidelity. Project page is available at https://qualcomm-ai-research.github.io/mulo-sd-webpage/ .

URL PDF HTML ☆

赞 0 踩 0

2601.04765 2026-05-29 cs.CL cs.AI cs.LG physics.comp-ph

Differential syntactic and semantic encoding in LLMs

大型语言模型中句法与语义的差异编码

Santiago Acevedo, Alessandro Laio, Marco Baroni

发表机构 * Catalan Institute of Research and Advanced Studies (ICREA) and Universitat Pompeu Fabra (UPF)（加泰罗尼亚研究与高级科学研究所（ICREA）和庞培法华大学（UPF））

AI总结本研究通过平均共享句法结构或语义的句子隐藏表示向量，发现大型语言模型（以DeepSeek-V3为例）的内部层表示中句法和语义信息至少部分线性编码，且两者编码轮廓不同，可一定程度解耦。

Comments Published as conference paper at ICML 2026

2601.03729 2026-05-29 cs.CV

MATANet: A Multi-context Attention and Taxonomy-Aware Network for Fine-Grained Underwater Recognition of Marine Species

MATANet：用于海洋物种细粒度识别的多上下文注意与分类感知网络

Donghwan Lee, Byeongjin Kim, Geunhee Kim, Hyukjin Kwon, Nahyeon Maeng, Wooju Kim

发表机构 * Department of Industrial Engineering, Yonsei University（延世大学工业工程系）

AI总结提出MATANet框架，通过多上下文环境注意力模块和层级感知表示学习模块，结合生物外观、环境上下文和分类结构，实现海洋生物细粒度识别，在FathomNet2025和LifeCLEF2015-Fish上取得最优性能。

详情

AI中文摘要

海洋生物的细粒度识别对于生态研究、生物多样性监测、栖息地保护和基于证据的政策制定至关重要。然而，许多现有方法主要依赖于以物体或ROI为中心的表征。这些限制在具有挑战性的水下场景中会降低判别性能，因为视觉上相似的生物通常出现在不同的环境条件下。为了解决这些问题，我们提出了MATANet（多上下文注意与分类感知网络），一个用于海洋生物细粒度分类识别的框架。MATANet的动机来自专家分类识别实践，其中在识别过程中同时考虑生物体形态和上下文线索。该框架由两个主要组件组成。首先，多上下文环境注意力模块（MCEAM）对主要感兴趣区域（ROI）与多尺度周围环境区域之间的交叉注意力进行建模，从而将局部形态线索与栖息地级上下文信息相结合。其次，层级感知表示学习模块（HRLM）使用分类层次作为辅助监督来正则化表示学习，并鼓励跨分类级别的语义结构化嵌入。通过联合建模生物外观、环境上下文和分类结构，MATANet学习了用于细粒度分类识别的更具判别性的表示。在FathomNet2025和LifeCLEF2015-Fish上的实验表明，MATANet持续优于现有方法的识别性能。在FAIR1M上的额外实验进一步检验了所提框架在水下图像之外的适用性。值得注意的是，MATANet在CVPR 2025 FGVC12研讨会的FathomNet 2025挑战赛中获得了第一名。

英文摘要

Fine-grained recognition of marine organisms is important for ecological research, biodiversity monitoring, habitat conservation, and evidence-based policy-making. However, many existing approaches primarily rely on object- or ROI-centered representations. These limitations can reduce discriminative performance in challenging underwater scenes, where visually similar organisms often appear under diverse environmental conditions. To address these challenges, we propose MATANet (Multi-context Attention and Taxonomy-Aware Network), a framework for fine-grained taxonomic recognition of marine organisms. MATANet is motivated by expert taxonomic identification practices, in which both organism-level morphology and contextual cues are considered during recognition. The framework consists of two main components. First, the Multi-Context Environmental Attention Module (MCEAM) models cross-attention between the primary region of interest (ROI) and multi-scale surrounding environmental regions, thereby combining local morphological cues with habitat-level contextual information. Second, the Hierarchy-Aware Representation Learning Module (HRLM) uses taxonomic hierarchy as auxiliary supervision to regularize representation learning and encourage semantically structured embeddings across taxonomic levels. By jointly modeling organism appearance, environmental context, and taxonomic structure, MATANet learns more discriminative representations for fine-grained taxonomic recognition. Experiments on FathomNet2025 and LifeCLEF2015-Fish demonstrate that MATANet consistently improves recognition performance over existing methods. Additional experiments on FAIR1M further examine the applicability of the proposed framework beyond underwater imagery. Notably, MATANet ranked first in the FathomNet 2025 Challenge at the CVPR 2025 FGVC12 workshop.

URL PDF HTML ☆

赞 0 踩 0

2601.00065 2026-05-29 cs.LG cs.CL cs.CR

When the Same Coefficients Reach Different Places: Asymmetric Realizability in Transplanting Tokenizers across Large Language Models

当相同系数到达不同位置：跨大型语言模型移植分词器中的非对称可实现性

Xiaoze Liu, Weichen Yu, Matt Fredrikson, Xiaoqian Wang, Jing Gao

发表机构 * Purdue University（普渡大学）； Carnegie Mellon University（卡内基梅隆大学）

AI总结本文发现跨词汇模型组合中分词器移植的几何结构非对称性，并构造了“破坏令牌”以利用该漏洞，通过实验验证其在多个模型对中的存在性及对微调、谱滤波等防御措施的鲁棒性。

详情

AI中文摘要

跨词汇模型组合中的分词器移植将仅存在于捐赠者的嵌入行重构为基于共享词汇锚点的加权组合，并在基础模型上重用这些系数。我们识别出这种重构的一个结构几何特性：相同的系数向量在捐赠者和基础锚点跨度中到达不同的集合，即一个\emph{非对称可实现性}差距。在OMP下的65个捐赠者-基础对中，通过CLP、WECHSEL和FOCUS的跨算子验证，我们构造了\emph{破坏令牌}：在捐赠者锚点跨度中保持统计惰性，同时在基础中产生高显著性重构的单一系数向量。相同的Gemma-2-2B捐赠者检查点允许针对来自五个模型家族的13个不同下游基础进行此构造。植入的方向与未改变的干净参考权重合并。在部署者案例研究中，标准LoRA微调主要抑制了其提示分布与训练语料匹配的破坏者，并且在我们设置中不足以缓解此类攻击家族。测试的谱滤波器未能捕捉到非对称性。我们讨论了在开放权重组合供应链中的潜在滥用。

英文摘要

Tokenizer transplant in cross-vocabulary model composition reconstructs donor-only embedding rows as weighted combinations over shared lexical anchors and reuses those coefficients on the base. We identify a structural geometric property of this reconstruction: the same coefficient vector reaches different sets in the donor and base anchor spans, an \emph{asymmetric realizability} gap. Across 65 donor-base pairs under OMP, with cross-operator validation on CLP, WECHSEL, and FOCUS, we construct \textit{breaker tokens}: single coefficient vectors that remain statistically inert in the donor anchor span while producing a high-salience reconstruction in the base. The same Gemma-2-2B donor checkpoint admits this construction against 13 different downstream bases drawn from five model families. The planted direction passes weight-merging with a clean reference unchanged. In a deployer case study, standard LoRA fine-tuning suppresses the breaker primarily on prompts whose distribution matches the training corpus and is not a sufficient mitigation against this attack family in our setting. The tested spectral filters miss the asymmetry. We discuss potential misuse in the open-weight composition supply chain.

URL PDF HTML ☆

赞 0 踩 0

2512.21311 2026-05-29 cs.LG

Learning to Solve PDEs on Neural Shape Representations

在神经形状表示上学习求解偏微分方程

Lilian Welschinger, Yilin Liu, Zican Wang, Niloy Mitra

发表机构 * University College London（伦敦大学学院）； Adobe Research（Adobe研究）

AI总结提出一种无网格公式，学习基于神经局部形状属性的局部更新算子，直接在神经表示上求解表面偏微分方程，无需显式网格或逐实例优化，且保持可微性。

Comments Accepted at CVPR 2026. Project page: https://welschinger.github.io/Learning-to-Solve-PDEs-on-Neural-Shape-Representations/

详情

AI中文摘要

在形状上求解偏微分方程支撑着许多形状分析和工程任务；然而，主流的偏微分方程求解器在多边形/三角形网格上运行，而现代3D资产越来越多地以神经表示的形式存在。这种不匹配导致没有合适的方法直接在神经域内求解表面偏微分方程，迫使进行显式网格提取或逐实例残差训练，阻碍了端到端的工作流程。我们提出了一种新颖的无网格公式，学习一个基于神经（局部）形状属性条件化的局部更新算子，使得表面偏微分方程能够直接在神经数据所在处求解。该算子自然地与流行的神经表面表示集成，仅在单个代表性形状上训练一次，并能在形状和拓扑变化中泛化，实现准确、快速的推理，无需显式网格划分或逐实例优化，同时保持可微性。在解析基准测试（球面上的热扩散方程和泊松方程）以及各种形状和神经表面表示上，我们的方法达到了与经典求解器相当的精度，同时实现了跨神经和传统表面表示的统一端到端流水线。我们的源代码和项目页面：https://welschinger.github.io/Learning-to-Solve-PDEs-on-Neural-Shape-Representations/。

英文摘要

Solving partial differential equations (PDEs) on shapes underpins many shape analysis and engineering tasks; yet, prevailing PDE solvers operate on polygonal/triangle meshes while modern 3D assets increasingly live as neural representations. This mismatch leaves no suitable method to solve surface PDEs directly within the neural domain, forcing explicit mesh extraction or per-instance residual training, preventing end-to-end workflows. We present a novel, meshfree formulation that learns a local update operator conditioned on neural (local) shape attributes, enabling surface PDEs to be solved directly where the (neural) data lives. The operator integrates naturally with prevalent neural surface representations, is trained once on a single representative shape, and generalizes across shape and topology variations, enabling accurate, fast inference without explicit meshing or per-instance optimization while preserving differentiability. Across analytic benchmarks (heat diffusion and Poisson equations on the sphere) and on diverse shapes and neural surface representations, our method achieves accuracy comparable to classical solvers while enabling a unified, end-to-end pipeline across neural and traditional surface representations. Our source code and project page: https://welschinger.github.io/Learning-to-Solve-PDEs-on-Neural-Shape-Representations/.

URL PDF HTML ☆

赞 0 踩 0

2512.19199 2026-05-29 cs.LG cs.AI

On the Koopman-Based Generalization Bounds for Multi-Task Deep Learning

基于Koopman的多任务深度学习泛化界

Mahdi Mohammadigohari, Giuseppe Di Fatta, Giuseppe Nicosia, Panos M. Pardalos

发表机构 * Free University of Bozen-Bolzano（博兹纳-博尔扎诺自由大学）； University of Catania（卡塔尼亚大学）； University of Florida（佛罗里达大学）

AI总结本文利用算子理论技术建立多任务深度神经网络的泛化界，通过利用权重矩阵的小条件数并引入定制的Sobolev空间作为扩展假设空间，提出比传统范数方法更紧的界，该界在单输出设置下仍有效且优于现有Koopman界。

Comments Accepted at the 11th International Conference on Machine Learning, Optimization, and Data Science (LOD), Castiglione della Pescaia, Italy, September 21-24, 2025. To appear in Lecture Notes in Computer Science (LNCS), volume 16467

Journal ref Machine Learning, Optimization, and Data Science (LOD 2025), Lecture Notes in Computer Science (LNCS), vol. 16468, Springer, 2026, pp. 376--392

2512.19184 2026-05-29 cs.LG cs.AI

Operator-Based Generalization Bound for Deep Learning: Insights on Multi-Task Learning

基于算子的深度学习泛化界：多任务学习的洞见

Mahdi Mohammadigohari, Giuseppe Di Fatta, Giuseppe Nicosia, Panos M. Pardalos

发表机构 * Free University of Bozen-Bolzano（博兹纳-博尔扎诺自由大学）； University of Catania（卡塔尼亚大学）； University of Florida（佛罗里达大学）

AI总结本文通过算子理论框架，结合Koopman方法与现有技术，为向量值神经网络和深度核方法提出了更紧的泛化界，并引入草图技术降低计算成本，同时提出深度向量值再生核希尔伯特空间框架，利用Perron-Frobenius算子增强深度核方法，推导了新的Rademacher泛化界，解决了欠拟合和过拟合问题。

Comments Accepted at the 11th International Conference on Machine Learning, Optimization, and Data Science (LOD), Castiglione della Pescaia, Italy, September 21-24, 2025. To appear in Lecture Notes in Computer Science (LNCS), volume 16467

Journal ref Machine Learning, Optimization, and Data Science (LOD 2025), Lecture Notes in Computer Science (LNCS), vol. 16468, Springer, 2026, pp. 120--137

详情

DOI: 10.1007/978-3-032-21480-5_9

AI中文摘要

本文提出了向量值神经网络和深度核方法的新型泛化界，通过算子理论框架聚焦多任务学习。我们的关键发展在于策略性地将基于Koopman的方法与现有技术相结合，实现了比传统基于范数的界更紧的泛化保证。为缓解基于Koopman方法的计算挑战，我们引入了适用于向量值神经网络的草图技术。这些技术在一般Lipschitz损失下给出了超额风险界，为包括鲁棒回归和多重分位数回归在内的应用提供了性能保证。此外，我们提出了一个新的深度学习框架——深度向量值再生核希尔伯特空间（vvRKHS），利用Perron-Frobenius（PF）算子增强深度核方法。我们为该框架推导了新的Rademacher泛化界，通过核精炼策略明确处理欠拟合和过拟合。这项工作为深度学习架构下的多任务学习泛化性质提供了新颖洞见，该领域直到最近才有所发展。

英文摘要

This paper presents novel generalization bounds for vector-valued neural networks and deep kernel methods, focusing on multi-task learning through an operator-theoretic framework. Our key development lies in strategically combining a Koopman based approach with existing techniques, achieving tighter generalization guarantees compared to traditional norm-based bounds. To mitigate computational challenges associated with Koopman-based methods, we introduce sketching techniques applicable to vector valued neural networks. These techniques yield excess risk bounds under generic Lipschitz losses, providing performance guarantees for applications including robust and multiple quantile regression. Furthermore, we propose a novel deep learning framework, deep vector-valued reproducing kernel Hilbert spaces (vvRKHS), leveraging Perron Frobenius (PF) operators to enhance deep kernel methods. We derive a new Rademacher generalization bound for this framework, explicitly addressing underfitting and overfitting through kernel refinement strategies. This work offers novel insights into the generalization properties of multitask learning with deep learning architectures, an area that has been relatively unexplored until recent developments.

URL PDF HTML ☆

赞 0 踩 0

2512.11944 2026-05-29 cs.RO cs.AI

A Review of Learning-Based Motion Planning: Toward a Data-Driven Optimal Control Approach

基于学习的运动规划综述：迈向数据驱动的最优控制方法

Jia Hu, Yang Chang, Haoran Wang

发表机构 * College of Transportation Key Laboratory of Road and Traffic Engineering of the Ministry of Education（交通运输学院道路交通工程教育部重点实验室）； Institute for Advanced Study（先进研究院）； Tongji University（同济大学）

AI总结本文系统综述了数据驱动最优控制范式，通过融合最优控制的理论保证与机器学习的自适应能力，为自动驾驶运动规划提供了三维实现路线图，并指出了四个未来研究方向。

Comments 44 pages, 14 figures

详情

AI中文摘要

自动驾驶的运动规划面临一个关键的权衡。传统的基于规则的流程提供了可验证的安全性和可解释性，但往往难以在复杂场景中泛化。相反，新兴的基于学习的方法——包括模仿学习、强化学习和生成式AI——提供了更大的适应性，但通常受限于不透明性和安全风险。现有的综述通常孤立地分析这些AI方法，忽视了将它们与严格的控制框架相结合的潜力。为弥合这一差距，本文首次系统综述了数据驱动最优控制（DDOC）范式，明确考察了它如何协同最优控制的理论保证与现代机器学习的自适应能力。基于这一框架，我们提出了首个DDOC运动规划路线图，将其实现结构化为三个关键维度：定制化、动力学自适应和自整定。最后，为缩小剩余的现实差距，我们确定了四个未来研究方向，从而加速向可信赖且类人的自动驾驶的过渡。

英文摘要

Motion planning for autonomous driving (AD) faces a critical trade-off. While traditional rule-based pipelines offer verifiable safety and interpretability, they often fail to generalize in complex scenarios. Conversely, emerging learning-based methods-including imitation learning (IL), reinforcement learning (RL), and generative AI-offer greater adaptability but are often constrained by opacity and safety risks. Existing surveys typically analyze these AI methods in isolation, overlooking the potential of integrating them with rigorous control frameworks. To bridge this gap, this paper presents the first systematic review of the Data-Driven Optimal Control (DDOC) paradigm, explicitly examining how it synergizes the theoretical guarantees of optimal control with the adaptive capabilities of modern machine learning. Building on this framework, we propose the first roadmap for DDOC-based motion planning, structuring its implementation into three critical dimensions: customization, dynamics adaptation, and self-tuning. Finally, to close the remaining reality gap, we identify four future research directions, thereby accelerating the transition to trustworthy and human-like autonomous driving.

URL PDF HTML ☆

赞 0 踩 0

2512.10659 2026-05-29 cs.LG

DCFO: Density-Based Counterfactuals for Outliers -- Additional Material

DCFO: 基于密度的离群点反事实解释——补充材料

Tommaso Amico, Pernille Matthews, Lena Krieger, Arthur Zimek, Ira Assent

发表机构 * Department of Computer Science（计算机科学系）； Department of Computer Science and Mathematics（计算机科学与数学系）

AI总结针对局部离群因子（LOF）缺乏可解释性的问题，提出基于密度的离群点反事实解释方法（DCFO），通过将数据空间划分为LOF平滑区域实现高效梯度优化，在50个OpenML数据集上优于现有方法。

详情

AI中文摘要

离群点检测识别显著偏离大多数数据分布的数据点。解释离群点对于理解导致其检测的潜在因素、验证其重要性以及识别潜在偏差或错误至关重要。有效的解释提供可操作的见解，有助于采取预防措施以避免未来出现类似的离群点。反事实解释通过识别改变预测所需的最小变化，阐明特定数据点为何被分类为离群点。尽管有价值，但大多数现有的反事实解释方法忽略了离群点检测带来的独特挑战，并且未能针对经典、广泛采用的离群点检测算法。局部离群因子（LOF）是最流行的无监督离群点检测方法之一，通过相对局部密度量化离群程度。尽管LOF在多种应用中广泛使用，但它缺乏可解释性。为解决这一局限性，我们提出了基于密度的离群点反事实解释（DCFO），这是一种专门为LOF生成反事实解释的新方法。DCFO将数据空间划分为LOF行为平滑的区域，从而实现高效的基于梯度的优化。在50个OpenML数据集上的广泛实验验证表明，DCFO始终优于基准竞争对手，在生成的反事实的邻近性和有效性方面表现更优。

英文摘要

Outlier detection identifies data points that significantly deviate from the majority of the data distribution. Explaining outliers is crucial for understanding the underlying factors that contribute to their detection, validating their significance, and identifying potential biases or errors. Effective explanations provide actionable insights, facilitating preventive measures to avoid similar outliers in the future. Counterfactual explanations clarify why specific data points are classified as outliers by identifying minimal changes required to alter their prediction. Although valuable, most existing counterfactual explanation methods overlook the unique challenges posed by outlier detection, and fail to target classical, widely adopted outlier detection algorithms. Local Outlier Factor (LOF) is one the most popular unsupervised outlier detection methods, quantifying outlierness through relative local density. Despite LOF's widespread use across diverse applications, it lacks interpretability. To address this limitation, we introduce Density-based Counterfactuals for Outliers (DCFO), a novel method specifically designed to generate counterfactual explanations for LOF. DCFO partitions the data space into regions where LOF behaves smoothly, enabling efficient gradient-based optimisation. Extensive experimental validation on 50 OpenML datasets demonstrates that DCFO consistently outperforms benchmarked competitors, offering superior proximity and validity of generated counterfactuals.

URL PDF HTML ☆

赞 0 踩 0

2512.04733 2026-05-29 cs.CV cs.AI

E3AD: An Emotion-Aware Vision-Language-Action Model for Human-Centric End-to-End Autonomous Driving

E3AD：面向以人为中心的端到端自动驾驶的情感感知视觉-语言-动作模型

Yihong Tang, Haicheng Liao, Tong Nie, Junlin He, Ao Qu, Kehua Chen, Wei Ma, Zhenning Li, Lijun Sun, Chengzhong Xu

发表机构 * McGill University（麦吉尔大学）； University of Macau（澳门大学）； The Hong Kong Polytechnic University（香港理工大学）； Massachusetts Institute of Technology（麻省理工学院）； University of Washington（华盛顿大学）

AI总结提出E3AD框架，通过连续VAD情感模型和双路径空间推理模块，将情感理解融入视觉-语言-动作模型，实现开放域端到端自动驾驶中的情感感知轨迹规划，在真实数据集上达到SOTA性能。

详情

AI中文摘要

端到端自动驾驶系统越来越多地采用视觉-语言-动作模型，但它们通常忽略乘客的情绪状态，而情绪状态对舒适度和自动驾驶接受度至关重要。我们引入了开放域端到端自动驾驶，其中自动驾驶车辆必须解释自由形式的自然语言命令，推断情绪，并规划物理上可行的轨迹。我们提出了E3AD，一个情感感知的VLA框架，通过两个认知启发的组件增强语义理解：一个连续的Valence-Arousal-Dominance情感模型，从语言中捕捉语调和紧迫性；以及一个双路径空间推理模块，融合自我中心和异中心视角以实现类人空间认知。结合模态预训练和基于偏好的对齐的一致性导向训练方案，进一步强化了情感意图与驾驶行为之间的一致性。在真实世界数据集上，E3AD改进了视觉定位和路径点规划，并在情感估计方面达到了最先进的VAD相关性。这些评估结果表明，将情感注入VLA风格的驾驶能够产生更符合人类行为的定位、规划和反馈。

英文摘要

End-to-end autonomous driving (AD) systems increasingly adopt vision-language-action (VLA) models, yet they typically ignore the passenger's emotional state, which is central to comfort and AD acceptance. We introduce Open-Domain End-to-End (OD-E2E) autonomous driving, where an autonomous vehicle (AV) must interpret free-form natural-language commands, infer the emotion, and plan a physically feasible trajectory. We propose E3AD, an emotion-aware VLA framework that augments semantic understanding with two cognitively inspired components: a continuous Valenc-Arousal-Dominance (VAD) emotion model that captures tone and urgency from language, and a dual-pathway spatial reasoning module that fuses egocentric and allocentric views for human-like spatial cognition. A consistency-oriented training scheme, combining modality pretraining with preference-based alignment, further enforces coherence between emotional intent and driving actions. Across real-world datasets, E3AD improves visual grounding and waypoint planning and achieves state-of-the-art (SOTA) VAD correlation for emotion estimation. These evaluation results show that injecting emotion into VLA-style driving yields more human-aligned grounding, planning, and feedback.

URL PDF HTML ☆

赞 0 踩 0

2512.03109 2026-05-29 cs.LG cs.AI stat.AP stat.ML

E-valuator: Reliable Agent Verifiers with Sequential Hypothesis Testing

E-valuator: 基于序贯假设检验的可靠智能体验证器

Shuvom Sadhuka, Drew Prinster, Clara Fannjiang, Gabriele Scalia, Bonnie Berger, Aviv Regev, Hanchen Wang

发表机构 * Genentech（基因泰克）； MIT（麻省理工学院）； Johns Hopkins（约翰霍普金斯大学）； Stanford（斯坦福大学）

AI总结提出E-valuator方法，将任意黑盒验证器分数转化为具有可控虚警率的决策规则，通过序贯假设检验实现对智能体轨迹的在线监控，提升统计功效并节省令牌。

详情

AI中文摘要

智能体AI系统根据用户提示执行一系列动作，如推理步骤或工具调用。为了评估其轨迹的成功性，研究人员开发了验证器（如LLM评判器和过程奖励模型）来对智能体轨迹中每个动作的质量进行评分。尽管这些启发式评分可能提供信息，但在用于决定智能体是否会产生成功输出时，无法保证正确性。在此，我们引入e-valuator，一种将任意黑盒验证器分数转化为具有可证明虚警率控制的决策规则的方法。我们将区分成功轨迹（即会导致对用户提示正确响应的动作序列）与不成功轨迹的问题构建为序贯假设检验问题。E-valuator基于e-过程工具开发了一个序贯假设检验，该检验在智能体轨迹的每一步都保持统计有效性，从而能够对任意长动作序列的智能体进行在线监控。实验表明，在六个数据集和三个智能体上，e-valuator相比其他策略提供了更高的统计功效和更好的虚警率控制。我们还展示了e-valuator可用于快速终止有问题的轨迹并节省令牌。总之，e-valuator提供了一个轻量级、模型无关的框架，将验证器启发式转化为具有统计保证的决策规则，从而支持部署更可靠的智能体系统。

英文摘要

Agentic AI systems execute a sequence of actions, such as reasoning steps or tool calls, in response to a user prompt. To evaluate the success of their trajectories, researchers have developed verifiers, such as LLM judges and process-reward models, to score the quality of each action in an agent's trajectory. Although these heuristic scores can be informative, there are no guarantees of correctness when used to decide whether an agent will yield a successful output. Here, we introduce e-valuator, a method to convert any black-box verifier score into a decision rule with provable control of false alarm rates. We frame the problem of distinguishing successful trajectories (that is, a sequence of actions that will lead to a correct response to the user's prompt) and unsuccessful trajectories as a sequential hypothesis testing problem. E-valuator builds on tools from e-processes to develop a sequential hypothesis test that remains statistically valid at every step of an agent's trajectory, enabling online monitoring of agents over arbitrarily long sequences of actions. Empirically, we demonstrate that e-valuator provides greater statistical power and better false alarm rate control than other strategies across six datasets and three agents. We additionally show that e-valuator can be used for to quickly terminate problematic trajectories and save tokens. Together, e-valuator provides a lightweight, model-agnostic framework that converts verifier heuristics into decisions rules with statistical guarantees, enabling the deployment of more reliable agentic systems.

URL PDF HTML ☆

赞 0 踩 0

2511.19316 2026-05-29 cs.CV cs.AI

Evaluating Dataset Watermarking for Fine-tuning Traceability of Customized Diffusion Models: A Comprehensive Benchmark and Removal Approach

评估数据集水印用于定制扩散模型微调可追溯性：一个综合基准与移除方法

Xincheng Wang, Hanchi Sun, Wenjun Sun, Kejun Xue, Wangqiu Zhou, Jianbo Zhang, Wei Sun, Dandan Zhu, Xiongkuo Min, Jun Jia, Zhijun Fang

发表机构 * Donghua University（东华大学）； Shanghai Jiao Tong University（上海交通大学）； Xidian University（西安电子科技大学）； Hefei University of Technology（合肥工业大学）； East China Normal University（华东师范大学）

AI总结针对扩散模型微调中的版权与安全风险，本文建立统一威胁模型并提出包含普适性、可传递性和鲁棒性的评估框架，揭示现有数据集水印方法的脆弱性，并进一步提出一种实用的水印移除方法。

详情

AI中文摘要

最近扩散模型的微调技术使其能够再现特定图像集，例如特定人脸或艺术风格，但也引入了版权和安全风险。数据集水印已被提出，通过将不可察觉的水印嵌入训练图像来确保可追溯性，即使在微调后这些水印在输出中仍然可检测。然而，当前方法缺乏统一的评估框架。为解决这一问题，本文建立了一个通用威胁模型，并引入了一个包含普适性、可传递性和鲁棒性的综合评估框架。实验表明，现有方法在普适性和可传递性方面表现良好，并对常见图像处理操作具有一定的鲁棒性，但在真实威胁场景下仍然不足。为揭示这些脆弱性，本文进一步提出了一种实用的水印移除方法，该方法在不影响微调的情况下完全消除数据集水印，突出了未来研究的一个关键挑战。

英文摘要

Recent fine-tuning techniques for diffusion models enable them to reproduce specific image sets, such as particular faces or artistic styles, but also introduce copyright and security risks. Dataset watermarking has been proposed to ensure traceability by embedding imperceptible watermarks into training images, which remain detectable in outputs even after fine-tuning. However, current methods lack a unified evaluation framework. To address this, this paper establishes a general threat model and introduces a comprehensive evaluation framework encompassing Universality, Transmissibility, and Robustness. Experiments show that existing methods perform well in universality and transmissibility, and exhibit some robustness against common image processing operations, yet still fall short under real-world threat scenarios. To reveal these vulnerabilities, the paper further proposes a practical watermark removal method that fully eliminates dataset watermarks without affecting fine-tuning, highlighting a key challenge for future research.

URL PDF HTML ☆

赞 0 踩 0

2511.17798 2026-05-29 cs.RO

SM2ITH: Safe Mobile Manipulation with Interactive Human Prediction via Task-Hierarchical Bilevel Model Predictive Control

SM2ITH：通过任务分层双层模型预测控制实现安全移动操作与人机交互预测

Francesco D'Orazio, Sepehr Samavi, Xintong Du, Siqi Zhou, Giuseppe Oriolo, Angela P. Schoellig

发表机构 * Department of Computer, Control and Management Engineering, of Sapienza University of Rome（意大利萨皮恩扎大学计算机、控制与管理工程系）； University of Toronto Institute for Aerospace Studies (UTIAS) and the Vector Institute for Artificial Intelligence（多伦多大学航空航天研究所（UTIAS）和向量人工智能研究所）； Learning Systems and Robotics lab at the Technical University of Munich and the Munich Institute for Robotics and Machine Intelligence (MIRMI)（慕尼黑技术大学学习系统与机器人实验室及慕尼黑机器人与机器智能研究所（MIRMI））； School of Computing Science, Faculty of Applied Sciences, Simon Fraser University（西蒙·弗雷泽大学应用科学学院计算机科学系）

AI总结提出SM$^2$ITH框架，结合分层任务模型预测控制与双层优化的人机交互预测，实现动态人机环境中的安全高效移动操作。

Comments Accepted to the IEEE International Conference on Robotics and Automation (ICRA) 2026

详情

AI中文摘要

移动操作机器人被设计用于在以人为中心的环境中执行复杂的导航和操作任务序列。尽管最近基于优化的方法（如分层任务模型预测控制，HTMPC）能够以严格的任务优先级实现高效的多任务执行，但它们目前主要应用于静态或结构化场景。将这些方法扩展到动态的人为中心环境需要预测模型来捕捉人类对机器人行为的反应。本文提出了SM$^2$ITH（通过任务分层双层模型预测控制实现安全移动操作与人机交互预测），这是一个统一框架，通过双层优化联合考虑机器人和人类动力学，将HTMPC与交互式人体运动预测相结合。该框架在两种不同的移动操作机器人（Stretch 3和Ridgeback-UR10）上进行了验证，涉及三种实验设置：（i）具有不同导航和操作优先级的递送任务，（ii）使用不同人体运动预测模型的顺序抓取-放置任务，以及（iii）涉及对抗性人类行为的交互。我们的结果突出了交互式预测如何实现安全高效的协调，优于依赖加权目标或开环人体模型的基线方法。

英文摘要

Mobile manipulators are designed to perform complex sequences of navigation and manipulation tasks in human-centered environments. While recent optimization-based methods such as Hierarchical Task Model Predictive Control (HTMPC) enable efficient multitask execution with strict task priorities, they have so far been applied mainly to static or structured scenarios. Extending these approaches to dynamic human-centered environments requires predictive models that capture how humans react to the actions of the robot. This work introduces Safe Mobile Manipulation with Interactive Human Prediction via Task-Hierarchical Bilevel Model Predictive Control (SM$^2$ITH), a unified framework that combines HTMPC with interactive human motion prediction through bilevel optimization that jointly accounts for robot and human dynamics. The framework is validated on two different mobile manipulators, the Stretch 3 and the Ridgeback-UR10, across three experimental settings: (i) delivery tasks with different navigation and manipulation priorities, (ii) sequential pick-and-place tasks with different human motion prediction models, and (iii) interactions involving adversarial human behavior. Our results highlight how interactive prediction enables safe and efficient coordination, outperforming baselines that rely on weighted objectives or open-loop human models.

URL PDF HTML ☆

赞 0 踩 0

2511.08949 2026-05-29 cs.CL

EVADE: LLM-Based Explanation Generation and Validation for Error Detection in NLI

EVADE：基于LLM的解释生成与验证用于NLI错误检测

Longfei Zuo, Barbara Plank, Siyao Peng

发表机构 * Technical University of Munich（慕尼黑技术大学）； MaiNLP, Center for Information and Language Processing, LMU Munich（MaiNLP，信息与语言处理中心，慕尼黑大学）； Munich Center for Machine Learning (MCML)（慕尼黑机器学习中心（MCML））

AI总结提出EVADE框架，利用大语言模型生成和验证解释以检测NLI数据集中的标注错误，实验表明LLM验证能减少人力并提升微调性能。

详情

AI中文摘要

高质量数据集对于训练和评估可靠的NLP模型至关重要。在自然语言推理（NLI）等任务中，当同一实例有多个有效标签时，会出现人类标签变异（HLV），这使得难以区分标注错误和合理的变异。先前的框架VARIERR（Weber-Genzel等人，2024）在第一轮要求多位标注者解释其标签决策，并在第二轮通过有效性判断标记错误。然而，进行两轮人工标注成本高昂，且可能限制合理标签或解释的覆盖范围。我们的研究提出了一个新框架EVADE，用于使用大语言模型（LLM）生成和验证解释以检测错误。我们进行了全面分析，比较了人类和LLM检测的NLI错误，涉及分布比较、验证重叠以及对模型微调的影响。实验表明，LLM验证能优化生成的解释分布，使其更接近人类标注，并且从训练数据中移除LLM检测的错误比移除人类标注者识别的错误更能提升微调性能。这凸显了在标签变异下扩展错误检测、减少人工努力同时提高数据集质量的潜力。

英文摘要

High-quality datasets are critical for training and evaluating reliable NLP models. In tasks like natural language inference (NLI), human label variation (HLV) arises when multiple labels are valid for the same instance, making it difficult to separate annotation errors from plausible variation. An earlier framework, VARIERR (Weber-Genzel et al., 2024), asks multiple annotators to explain their label decisions in the first round and flags errors through validity judgments in the second round. However, conducting two rounds of manual annotation is costly and may limit the coverage of plausible labels or explanations. Our study proposes a new framework, EVADE, for generating and validating explanations to detect errors using large language models (LLMs). We perform a comprehensive analysis comparing human- and LLM-detected errors for NLI across distribution comparison, validation overlap, and impact on model fine-tuning. Our experiments demonstrate that LLM validation refines generated explanation distributions to more closely align with human annotations, and that removing LLM-detected errors from training data yields improvements in fine-tuning performance than removing errors identified by human annotators. This highlights the potential to scale error detection, reducing human effort while improving dataset quality under label variation.

URL PDF HTML ☆

赞 0 踩 0

2511.08548 2026-05-29 cs.AI

A Matter of Interest: Understanding Interestingness of Math Problems in Humans and Language Models

兴趣问题：理解人类与语言模型对数学问题的兴趣度

Shubhra Mishra, Yuka Machino, Gabriel Poesia, Albert Jiang, Joy Hsu, Adrian Weller, Challenger Mishra, David Broman, Joshua B. Tenenbaum, Mateja Jamnik, Cedegao E. Zhang, Katherine M. Collins

发表机构 * KTH Royal Institute of Technology（皇家理工学院）； Stanford University（斯坦福大学）； Massachusetts Institute of Technology（麻省理工学院）； University of Cambridge（剑桥大学）； Kempner Institute at Harvard University（哈佛大学肯普尼研究所）； Mistral AI

AI总结通过比较大型语言模型与不同数学背景人群对数学问题的兴趣度评分，研究LLM在兴趣判断上与人类的一致性，并评估其生成有趣问题的能力。

Comments Published at the Math-AI Workshop, NeurIPS 2025

详情

AI中文摘要

数学的演变受到兴趣度的重要影响：研究人员选择要解决的问题，学生选择要参与的问题，都是基于对兴趣和挑战的期望。随着AI系统，特别是那些在自然语言和形式数学上灵活操作的大型语言模型（LLMs）越来越多地用于数学研究和教育，描述它们的判断与来自不同数学背景的人们的判断有多接近变得至关重要。我们通过将LLM的评分与两个人群（具有大学数学经验的众包参与者和国际数学奥林匹克竞赛选手）的评分进行比较，研究LLM是否与人类的兴趣度判断一致。尽管许多LLM在广泛层面上与人类对兴趣度的看法一致，但它们在很大程度上未能匹配人类判断的分布。它们与人类认为问题有趣的原因也弱对齐，与人类选择的理由相关性低。最后，我们评估了LLM生成有趣问题的能力，发现经过有效性过滤后，LLM能够生成引人入胜的问题。我们得出结论，包括需要多LLM人机协作系统，这突显了LLM作为数学推理伙伴的前景和当前局限。

英文摘要

The evolution of mathematics is shaped importantly by interestingness: researchers choose which problems to pursue, and students choose which problems to engage with, based on expectations of interest and challenge. As AI systems, particularly large language models (LLMs) that operate flexibly over natural language and formal mathematics, are increasingly used in mathematics research and education, it becomes crucial to characterize how closely their judgments align with people from different mathematical backgrounds. We study whether LLMs align with human interestingness judgments by comparing LLM ratings with those of two populations, crowdsourced participants with college math experience and International Math Olympiad competitors. Although many LLMs broadly agree with human notions of interestingness, they largely fail to match the distribution of human judgments. They also weakly align with why humans find problems interesting, with low correlation to human-selected rationales. Finally, we evaluate LLMs' ability to generate interesting problems and find that, after filtering for validity, LLMs are able to generate engaging problems. We conclude with takeaways, including the need for multi-LLM human-AI collaborative systems, that highlight both the promise and current limits of LLMs as partners in mathematical reasoning.

URL PDF HTML ☆

赞 0 踩 0

2511.08423 2026-05-29 cs.CV

OmniAID: Decoupling Semantic and Artifacts for Universal AI-Generated Image Detection in the Wild

OmniAID: 解耦语义与伪影以实现通用AI生成图像野外检测

Yuncheng Guo, Junyan Ye, Chenjue Zhang, Hengrui Kang, Haohuan Fu, Conghui He, Weijia Li

发表机构 * Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）； Sun Yat-Sen University（中山大学）； Tsinghua Shenzhen International Graduate School, Tsinghua University（清华大学深圳国际研究生院，清华大学）； Shanghai Jiao Tong University（上海交通大学）

AI总结提出OmniAID框架，通过解耦混合专家架构分离语义缺陷和通用伪影，结合两阶段训练策略和Mirage数据集，实现跨生成模型和语义内容的鲁棒AI生成图像检测。

Comments Accepted by ICML 2026

详情

AI中文摘要

一个真正通用的AI生成图像（AIGI）检测器必须同时泛化到多种生成模型和不同的语义内容。当前方法学习单一的、纠缠的伪造表示，混淆了内容相关的缺陷与内容无关的伪影，并进一步受到过时基准的限制。我们提出OmniAID，一种以解耦混合专家（MoE）架构为核心的新框架，该架构分离了：（1）通过可路由的专门语义专家在不同内容领域中的语义缺陷，以及（2）通过固定的通用伪影专家从内容相关缺陷中分离出内容无关的通用伪影。两阶段训练策略首先通过领域特定的困难采样独立专门化专家，然后训练一个轻量级门控网络以实现有效的输入路由。通过明确解耦“生成了什么”（内容特定缺陷）与“如何生成”（通用伪影），OmniAID实现了鲁棒的泛化。我们还引入了Mirage，一个大规模、当代的数据集，包含现代训练集和具有挑战性的测试集。大量实验表明，OmniAID超越了现有检测器，为针对现代野外威胁的AIGI检测建立了新标准。代码可在https://github.com/yunncheng/OmniAID获取。

英文摘要

A truly universal AI-Generated Image (AIGI) detector must simultaneously generalize across diverse generative models and varied semantic content. Current methods learn a single, entangled forgery representation, conflating content-dependent flaws with content-agnostic artifacts, and are further constrained by outdated benchmarks. We propose OmniAID, a novel framework centered on a decoupled Mixture-of-Experts (MoE) architecture that separates: (1) semantic flaws across distinct content domains via Routable Specialized Semantic Experts, and (2) content-agnostic universal artifacts from content-dependent flaws via a Fixed Universal Artifact Expert. A two-stage training strategy first specializes experts independently with domain-specific hard-sampling, then trains a lightweight gating network for effective input routing. By explicitly decoupling "what is generated" (content-specific flaws) from "how it is generated" (universal artifacts), OmniAID achieves robust generalization. We also introduce Mirage, a large-scale, contemporary dataset comprising a modern training set and a challenging test set. Extensive experiments demonstrate that OmniAID surpasses existing detectors, establishing a new standard for AIGI detection against modern, in-the-wild threats. Code is available at https://github.com/yunncheng/OmniAID.

URL PDF HTML ☆

赞 0 踩 0

2511.04758 2026-05-29 cs.RO cs.AI cs.MA

ScheduleStream: Temporal Planning with Samplers for GPU-Accelerated Multi-Arm Task and Motion Planning & Scheduling

ScheduleStream: 基于采样器的时序规划用于GPU加速的多臂任务与运动规划及调度

Caelan Garrett, Fabio Ramos

发表机构 * NVIDIA Research Seattle Robotics Lab (SRL)（NVIDIA西雅图机器人实验室）； University of Sydney（悉尼大学）

AI总结提出ScheduleStream，首个通用框架，通过混合持续动作和领域无关算法，结合GPU加速采样器，实现多臂并行任务与运动规划及调度。

Comments Project website: https://schedulestream.github.io

Journal ref 2026 IEEE International Conference on Robotics and Automation (ICRA)

详情

AI中文摘要

双臂和类人机器人因其类似人类利用多臂高效完成任务的能力而具有吸引力。然而，由于混合离散-连续动作空间的增长，同时控制多个臂在计算上具有挑战性。任务与运动规划（TAMP）算法可以在混合空间中高效规划，但通常生成一次只移动一个臂的计划，而不是允许并行臂运动的调度。为了将TAMP扩展到生成调度，我们提出了ScheduleStream，这是第一个用于带采样操作的规划与调度的通用框架。ScheduleStream使用混合持续动作对时间动态进行建模，这些动作可以异步启动，并持续一个由其参数决定的时长。我们提出了领域无关的算法，无需任何特定于应用的机制即可解决ScheduleStream问题。我们将ScheduleStream应用于任务与运动规划及调度（TAMPAS），其中我们利用采样器内的GPU加速来加快规划。我们将ScheduleStream算法与模拟中的几种消融方法进行比较，发现它们能产生更高效的解决方案。我们在https://schedulestream.github.io上展示了ScheduleStream在几个真实世界双臂机器人任务上的应用。

英文摘要

Bimanual and humanoid robots are appealing because of their human-like ability to leverage multiple arms to efficiently complete tasks. However, controlling multiple arms at once is computationally challenging due to the growth in the hybrid discrete-continuous action space. Task and Motion Planning (TAMP) algorithms can efficiently plan in hybrid spaces but generally produce plans, where only one arm is moving at a time, rather than schedules that allow for parallel arm motion. In order to extend TAMP to produce schedules, we present ScheduleStream, the first general-purpose framework for planning & scheduling with sampling operations. ScheduleStream models temporal dynamics using hybrid durative actions, which can be started asynchronously and persist for a duration that's a function of their parameters. We propose domain-independent algorithms that solve ScheduleStream problems without any application-specific mechanisms. We apply ScheduleStream to Task and Motion Planning & Scheduling (TAMPAS), where we use GPU acceleration within samplers to expedite planning. We compare ScheduleStream algorithms to several ablations in simulation and find that they produce more efficient solutions. We demonstrate ScheduleStream on several real-world bimanual robot tasks at https://schedulestream.github.io.

URL PDF HTML ☆

赞 0 踩 0

2510.27391 2026-05-29 cs.CV cs.LG

Modality Alignment across Trees on Heterogeneous Hyperbolic Manifolds

异质双曲流形上的树间模态对齐

Wei Wu, Xiaomeng Fan, Yuwei Wu, Zhi Gao, Pengxiang Li, Yunde Jia, Mehrtash Harandi

发表机构 * Beijing Key Laboratory of Intelligent Information Technology, School of Computer Science & Technology, Beijing Institute of Technology（北京智能信息科技重点实验室，计算机科学与技术学院，北京理工大学）； Guangdong Laboratory of Machine Perception and Intelligent Computing, Shenzhen MSU-BIT University（广东机器感知与智能计算实验室，深圳MSU-BIT大学）； Department of Electrical and Computer System Engineering, Monash University（电子与计算机系统工程系，墨尔本大学）

AI总结提出一种在异质双曲流形上对齐图像和文本树状层次特征的方法，通过交叉注意力提取视觉层次特征、异质流形嵌入及KL距离度量学习中间流形，在开放集分类任务中优于基线。

Comments Published as a conference paper at ICLR 2026

Journal ref The Fourteenth International Conference on Learning Representations (ICLR 2026), Rio de Janeiro, Brazil, 2026

详情

AI中文摘要

模态对齐对于视觉-语言模型（VLM）有效整合跨模态信息至关重要。然而，现有方法在提取文本层次特征的同时，对每个图像仅用单一特征表示，导致不对称和次优的对齐。为解决此问题，我们提出树间对齐（Alignment across Trees）方法，该方法为图像和文本模态构建并对齐树状层次特征。具体而言，我们引入一个语义感知的视觉特征提取框架，该框架对来自中间Transformer层的视觉类别标记应用交叉注意力机制，由文本线索引导以提取具有从粗到细语义的视觉特征。然后，我们将两种模态的特征树嵌入到具有不同曲率的双曲流形中，以有效建模其层次结构。为了在不同曲率的异质双曲流形之间进行对齐，我们推导了异质流形上分布之间的KL距离度量，并通过最小化该距离学习一个用于流形对齐的中间流形。我们证明了最优中间流形的存在性和唯一性。在多个图像数据集上的分类学开放集分类任务实验表明，我们的方法在少样本和跨域设置下持续优于强基线。

英文摘要

Modality alignment is critical for vision-language models (VLMs) to effectively integrate information across modalities. However, existing methods extract hierarchical features from text while representing each image with a single feature, leading to asymmetric and suboptimal alignment. To address this, we propose Alignment across Trees, a method that constructs and aligns tree-like hierarchical features for both image and text modalities. Specifically, we introduce a semantic-aware visual feature extraction framework that applies a cross-attention mechanism to visual class tokens from intermediate Transformer layers, guided by textual cues to extract visual features with coarse-to-fine semantics. We then embed the feature trees of the two modalities into hyperbolic manifolds with distinct curvatures to effectively model their hierarchical structures. To align across the heterogeneous hyperbolic manifolds with different curvatures, we formulate a KL distance measure between distributions on heterogeneous manifolds, and learn an intermediate manifold for manifold alignment by minimizing the distance. We prove the existence and uniqueness of the optimal intermediate manifold. Experiments on taxonomic open-set classification tasks across multiple image datasets demonstrate that our method consistently outperforms strong baselines under few-shot and cross-domain settings.

URL PDF HTML ☆

赞 0 踩 0

2510.22437 2026-05-29 cs.AI cs.CL

Modeling Hierarchical Thinking in Large Reasoning Models

大型推理模型中的层次化思维建模

G M Shahariar, Erfan Shayegani, Ali Nazari, Nael Abu-Ghazaleh

发表机构 * University of California, Riverside（加州大学河滨分校）； Independent Researcher（独立研究者）

AI总结本文提出将大型推理模型（LRM）的层次化推理动态近似为有限状态机（FSM）中的轨迹，并通过Q值引导的推理时控制方法实现高效推理优化。

Comments Accepted in ICML 2026 as Oral

详情

AI中文摘要

大型推理模型（LRM）通过生成长链思维（CoT）序列来解决复杂任务；然而，控制推理轨迹的涌现动态尚未被充分理解，可能导致不一致性和推理病态。在这项工作中，我们提出将LRM的涌现层次化推理动态近似为有限状态机（FSM）中的轨迹，该状态机在六个抽象认知状态之间转换。我们证明这些状态和转换可以在模型的潜在状态中捕获。我们相信这种表示在LRM模型的可解释性和优化中具有不同的应用。例如，通过分析这些转换的拓扑结构，我们识别出推理策略中的统计变化，有助于从失败的推理链中识别出有效的推理链。为了说明这些潜在优势，我们提出了Q值引导转向，一种无需训练的推理时控制方法，将推理视为规划问题。我们估计状态转换的长期效用，并在句子边界处应用稀疏、正交的激活转向，使CoT生成与最优推理策略对齐。使用三个最先进的开源推理模型在四个基准测试（AIME25、MATH-500、GSM8k和GPQA Diamond）上的实验表明，Q值转向策略以“外科手术式”的效率实现了显著的性能提升，通常需要的干预次数比贪婪和加权基线少25倍，这表明通过引导高层认知动态而非微观管理令牌生成，可以有效地控制推理。代码可在 https://github.com/shahariar-shibli/CoT-FSM 获取。

英文摘要

Large Reasoning Models (LRMs) solve complex tasks by generating long Chain-of-Thought (CoT) sequences; however, the emergent dynamics governing reasoning trajectories are not well understood and can lead to inconsistencies and reasoning pathologies. In this work, we propose to approximate LRM's emerging hierarchical reasoning dynamics as a trajectory within a Finite State Machine (FSM) transitioning among six abstract cognitive states. We demonstrate that these states and transitions can be captured in the latent state of the model. We believe that this representation can have different applications in the interpretability and optimization of LRM models. For example, by analyzing the topology of these transitions, we identify statistical shifts in reasoning strategies that help identify effective reasoning chains from those that fail. To illustrate these potential advantages, we propose Q-Value guided steering, a training-free inference-time control method that treats reasoning as a planning problem. We estimate the long-horizon utility of state transitions and apply sparse, orthogonal activation steering at sentence boundaries to align the CoT generation with optimal reasoning policies. Experiments across four benchmarks (AIME25, MATH-500, GSM8k, and GPQA Diamond) using three state-of-the-art open reasoning models demonstrate that Q-Value steering policy achieves significant performance gains with "surgical" efficiency, often requiring 25 times fewer interventions than greedy and weighted baselines, which suggests that reasoning can be effectively controlled by guiding high-level cognitive dynamics rather than micro-managing token generation. Code is available at: https://github.com/shahariar-shibli/CoT-FSM.

URL PDF HTML ☆

赞 0 踩 0

2510.18416 2026-05-29 cs.SD

SegTune: Structured and Fine-Grained Control for Song Generation

SegTune：歌曲生成的结构化与细粒度控制

Pengfei Cai, Joanna Wang, Haorui Zheng, Xu Li, Zihao Ji, Teng Ma, Zhongliang Liu, Chen Zhang, Pengfei Wan

发表机构 * Kuaishou Technology（快手科技）

AI总结提出非自回归框架SegTune，通过段级局部描述和全局提示实现歌曲的结构化可控生成，并引入基于LLM的时长预测器实现精确的歌词-音乐对齐。

Comments This technical report was later revised and published at ACL 2026 (oral). ACL paper link: https://openreview.net/forum?id=FKf2S4u8at , code: https://github.com/KlingAIResearch/SegTune

详情

AI中文摘要

近期歌曲生成领域的进展在根据歌词和/或全局文本提示生成歌曲方面展现了有希望的结果。然而，大多数现有系统缺乏对歌曲随时间变化属性的建模能力，限制了对音乐结构和动态的细粒度控制。在本文中，我们提出SegTune，一个用于结构化和可控歌曲生成的非自回归框架。SegTune通过允许用户或大语言模型指定与歌曲段落对齐的局部音乐描述来实现段级控制。段级提示通过时间广播注入到对应时间窗口的模型中，而全局提示则影响整首歌曲以确保风格一致性。为了获得准确的段落时长并实现精确的歌词-音乐对齐，我们引入了一个基于LLM的时长预测器，该预测器以自回归方式生成LRC格式的句子级带时间戳歌词。我们进一步构建了一个大规模数据管道，用于收集带有对齐歌词和提示的高质量歌曲，并提出了新的评估指标来评估段级对齐和声乐属性一致性。实验结果表明，与现有基线相比，SegTune实现了优越的可控性和音乐连贯性。参见https://cai525.github.io/SegTune_demo获取我们工作的演示。

英文摘要

Recent advancements in song generation have shown promising results in generating songs from lyrics and/or global text prompts. However, most existing systems lack the ability to model the temporally varying attributes of songs, limiting fine-grained control over musical structure and dynamics. In this paper, we propose SegTune, a non-autoregressive framework for structured and controllable song generation. SegTune enables segment-level control by allowing users or large language models to specify local musical descriptions aligned to song sections.The segmental prompts are injected into the model by temporally broadcasting them to corresponding time windows, while global prompts influence the whole song to ensure stylistic coherence. To obtain accurate segment durations and enable precise lyric-to-music alignment, we introduce an LLM-based duration predictor that autoregressively generates sentence-level timestamped lyrics in LRC format. We further construct a large-scale data pipeline for collecting high-quality songs with aligned lyrics and prompts, and propose new evaluation metrics to assess segment-level alignment and vocal attribute consistency. Experimental results show that SegTune achieves superior controllability and musical coherence compared to existing baselines. See https://cai525.github.io/SegTune_demo for demos of our work.

URL PDF HTML ☆

赞 0 踩 0

2510.14150 2026-05-29 cs.AI cs.LG cs.NE

CodeEvolve: an open source evolutionary coding agent for algorithmic discovery and optimization

CodeEvolve：用于算法发现和优化的开源进化编码智能体

Henrique Assumpção, Diego Ferreira, Leandro Campos, Fabricio Murai

发表机构 * Inter Science - Inter&Co ； Federal University of Minas Gerais（联邦大学伯南迪斯）； Worcester Polytechnic Institute（沃思彻斯特理工大学）

AI总结提出CodeEvolve开源框架，结合大语言模型与岛屿进化搜索，通过灵感交叉、元提示和深度细化，在AlphaEvolve基准上匹配或超越5/9问题，并在匹配条件下优于OpenEvolve和ShinkaEvolve，以更低成本超越前沿闭源集成。

Comments 21 pages, 16 figures, 8 tables

详情

AI中文摘要

我们介绍了CodeEvolve，一个开源框架，它将大语言模型与基于岛屿的进化搜索相结合，用于端到端的算法发现。CodeEvolve在CVT-MAP-Elites存档和加权LLM集成之上集成了基于灵感的交叉、元提示和深度细化，为复杂问题生成优化解决方案。在AlphaEvolve基准套件上，CodeEvolve在9个问题中的5个上匹配或超过了报告的AlphaEvolve结果，并且在匹配条件下，在9个问题中的6个上优于开源框架OpenEvolve和ShinkaEvolve。使用开放权重的Qwen3-Coder-30B骨干网络，它在CirclePackingSquare的两个实例上均超过了报告的AlphaEvolve分数，成本大约比前沿闭源集成低一个数量级，并且在无需重新调整的情况下，在启发式设计任务上与EoH保持竞争力。消融实验表明，CodeEvolve组件之间的相互作用（而非任何单一算子）驱动了这些结果。我们在https://github.com/inter-co/science-codeevolve 发布了该框架、实验数据和实用的超参数指南。

英文摘要

We introduce CodeEvolve, an open-source framework that couples large language models with island-based evolutionary search for end-to-end algorithmic discovery. CodeEvolve integrates inspiration-based crossover, meta-prompting, and depth-based refinement on top of a CVT-MAP-Elites archive and a weighted LLM ensemble to generate optimized solutions for complex problems. On the AlphaEvolve benchmark suite, CodeEvolve matches or surpasses the reported AlphaEvolve results on 5 of 9 problems and, under matched conditions, outperforms the open-source frameworks OpenEvolve and ShinkaEvolve on 6 of 9. With the open-weight Qwen3-Coder-30B backbone, it surpasses the reported AlphaEvolve score on both CirclePackingSquare instances at roughly an order of magnitude lower cost than a frontier closed-source ensemble, and remains competitive with EoH on heuristic-design tasks without retuning. Ablations show that the interaction between CodeEvolve's components, rather than any single operator, drives these results. We release the framework, experimental data, and practical hyperparameter guidelines at https://github.com/inter-co/science-codeevolve.

URL PDF HTML ☆

赞 0 踩 0

2510.11499 2026-05-29 cs.LG cs.AI

Offline Reinforcement Learning with Generative Trajectory Policies

基于生成轨迹策略的离线强化学习

Xinsong Feng, Leshu Tang, Chenan Wang, Haipeng Chen

发表机构 * School of Computing, Data Sciences ； Computer Engineering Department, UCLA, Los Angeles, USA

AI总结本文提出生成轨迹策略（GTP），通过统一扩散、流匹配和一致性模型为常微分方程驱动的连续时间生成轨迹，并引入两种理论自适应方法，在D4RL基准上达到最先进性能。

Comments ICML 2026

详情

AI中文摘要

生成模型因其捕获复杂多模态行为的能力，已成为离线强化学习中一类强大的策略。然而，现有方法面临明显的权衡：扩散策略等慢速迭代模型计算成本高，而一致性策略等快速单步模型性能往往下降。在本文中，我们证明弥合这一差距是可能的。我们认为，超越个体方法局限的关键在于一个统一视角，该视角将现代生成模型（包括扩散、流匹配和一致性模型）视为学习由常微分方程驱动的连续时间生成轨迹的具体实例。这一原则性基础为强化学习中的生成策略提供了更清晰的设计空间，并使我们能够提出生成轨迹策略（GTP），一种新的、更通用的策略范式，学习底层ODE的完整解映射。为使该范式适用于离线强化学习，我们进一步引入了两种理论上原则性的自适应方法。实验结果表明，GTP在D4RL基准上达到了最先进的性能——它显著优于先前的生成策略，在多个以困难著称的AntMaze任务上取得了完美分数。

英文摘要

Generative models have emerged as a powerful class of policies for offline reinforcement learning (RL) due to their ability to capture complex, multi-modal behaviors. However, existing methods face a stark trade-off: slow, iterative models like diffusion policies are computationally expensive, while fast, single-step models like consistency policies often suffer from degraded performance. In this paper, we demonstrate that it is possible to bridge this gap. The key to moving beyond the limitations of individual methods, we argue, lies in a unifying perspective that views modern generative models, including diffusion, flow matching, and consistency models, as specific instances of learning a continuous-time generative trajectory governed by an Ordinary Differential Equation (ODE). This principled foundation provides a clearer design space for generative policies in RL and allows us to propose Generative Trajectory Policies (GTPs), a new and more general policy paradigm that learns the entire solution map of the underlying ODE. To make this paradigm practical for offline RL, we further introduce two key theoretically principled adaptations. Empirical results demonstrate that GTP achieves state-of-the-art performance on D4RL benchmarks - it significantly outperforms prior generative policies, achieving perfect scores on several notoriously hard AntMaze tasks.

URL PDF HTML ☆

赞 0 踩 0

2510.03550 2026-05-29 cs.CV

Streaming Drag-Oriented Interactive Video Manipulation: Drag Anything, Anytime!

流式拖拽导向的交互式视频操作：随时拖动任何物体！

Junbao Zhou, Yuan Zhou, Kesen Zhao, Qingshan Xu, Beier Zhu, Richang Hong, Hanwang Zhang

发表机构 * Nanyang Technological University（南洋理工大学）； University of Science and Technology of China（中国科学技术大学）； Hefei University of Technology（合肥工业大学）

AI总结提出REVEL任务和DragStream方法，通过自适应分布自校正和空间频率选择性优化，实现自回归视频扩散模型的流式拖拽交互操作。

详情

AI中文摘要

实现对自回归视频扩散模型输出的流式、细粒度控制仍然具有挑战性，难以确保其始终与用户期望一致。为弥补这一差距，我们提出 extbf{流式拖拽导向的交互式视频操作（REVEL）}，这是一个新任务，允许用户通过细粒度的交互式拖拽 extit{随时}对 extit{任何物体}修改生成的视频。超越DragVideo和SG-I2V，REVEL将拖拽式视频操作统一为编辑和动画化视频帧，同时支持用户指定的平移、变形和旋转效果，使拖拽操作更加通用。在解决REVEL时，我们观察到： extit{i}）拖拽引起的扰动在潜在空间中累积，导致严重的潜在分布漂移，从而中断拖拽过程； extit{ii}）流式拖拽容易受到上下文帧的干扰，从而产生视觉上不自然的结果。因此，我们提出一种无需训练的方法 extbf{DragStream}，包括： extit{i}）自适应分布自校正策略，利用相邻帧的统计信息有效约束潜在嵌入的漂移； extit{ii}）空间频率选择性优化机制，允许模型充分利用上下文信息，同时通过沿生成过程选择性传播视觉线索来减轻其干扰。我们的方法可以无缝集成到现有的自回归视频扩散模型中，大量实验有力地证明了DragStream的有效性。

英文摘要

Achieving streaming, fine-grained control over the outputs of autoregressive video diffusion models remains challenging, making it difficult to ensure that they consistently align with user expectations. To bridge this gap, we propose \textbf{stReaming drag-oriEnted interactiVe vidEo manipuLation (REVEL)}, a new task that enables users to modify generated videos \emph{anytime} on \emph{anything} via fine-grained, interactive drag. Beyond DragVideo and SG-I2V, REVEL unifies drag-style video manipulation as editing and animating video frames with both supporting user-specified translation, deformation, and rotation effects, making drag operations versatile. In resolving REVEL, we observe: \emph{i}) drag-induced perturbations accumulate in latent space, causing severe latent distribution drift that halts the drag process; \emph{ii}) streaming drag is easily disturbed by context frames, thereby yielding visually unnatural outcomes. We thus propose a training-free approach, \textbf{DragStream}, comprising: \emph{i}) an adaptive distribution self-rectification strategy that leverages neighboring frames' statistics to effectively constrain the drift of latent embeddings; \emph{ii}) a spatial-frequency selective optimization mechanism, allowing the model to fully exploit contextual information while mitigating its interference via selectively propagating visual cues along generation. Our method can be seamlessly integrated into existing autoregressive video diffusion models, and extensive experiments firmly demonstrate the effectiveness of our DragStream.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

NCSAM Noise-Compensated Sharpness-Aware Minimization for Noisy Label Learning

Do not be greedy, Think Twice: Sampling and Selection for Document-level Information Extraction

Adaptive Exponential Integration for Stable Gaussian Mixture Black-Box Variational Inference

Mechanism Shift During Post-training from Autoregressive to Masked Diffusion Language Models

CORE-T: COherent REtrieval of Tables for Text-to-SQL

Video Individual Counting and Tracking from Moving Drones: A Benchmark and Methods

TANDEM: Temporal-Aware Neural Detection for Multimodal Hate Speech

Multi-Scale Local Speculative Decoding for Image Generation

Differential syntactic and semantic encoding in LLMs

MATANet: A Multi-context Attention and Taxonomy-Aware Network for Fine-Grained Underwater Recognition of Marine Species

When the Same Coefficients Reach Different Places: Asymmetric Realizability in Transplanting Tokenizers across Large Language Models

Learning to Solve PDEs on Neural Shape Representations

On the Koopman-Based Generalization Bounds for Multi-Task Deep Learning

Operator-Based Generalization Bound for Deep Learning: Insights on Multi-Task Learning

A Review of Learning-Based Motion Planning: Toward a Data-Driven Optimal Control Approach

DCFO: Density-Based Counterfactuals for Outliers -- Additional Material

E3AD: An Emotion-Aware Vision-Language-Action Model for Human-Centric End-to-End Autonomous Driving

E-valuator: Reliable Agent Verifiers with Sequential Hypothesis Testing

Evaluating Dataset Watermarking for Fine-tuning Traceability of Customized Diffusion Models: A Comprehensive Benchmark and Removal Approach

SM2ITH: Safe Mobile Manipulation with Interactive Human Prediction via Task-Hierarchical Bilevel Model Predictive Control

EVADE: LLM-Based Explanation Generation and Validation for Error Detection in NLI

A Matter of Interest: Understanding Interestingness of Math Problems in Humans and Language Models

OmniAID: Decoupling Semantic and Artifacts for Universal AI-Generated Image Detection in the Wild

ScheduleStream: Temporal Planning with Samplers for GPU-Accelerated Multi-Arm Task and Motion Planning & Scheduling

Modality Alignment across Trees on Heterogeneous Hyperbolic Manifolds

Modeling Hierarchical Thinking in Large Reasoning Models

SegTune: Structured and Fine-Grained Control for Song Generation

CodeEvolve: an open source evolutionary coding agent for algorithmic discovery and optimization

Offline Reinforcement Learning with Generative Trajectory Policies

Streaming Drag-Oriented Interactive Video Manipulation: Drag Anything, Anytime!