arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2088
2605.27006 2026-05-27 cs.LG cond-mat.dis-nn stat.ML

Sampling Data with Chains of Forward-Backward Diffusion Steps

通过前向-反向扩散步骤链采样数据

Hyunmo Kang, Noam Itzhak Levi, Corinna Elena Wegner, Daniel J. Korchinski, Matthieu Wyart

AI总结 提出U-turn链,通过扩散模型的短前向-反向步骤迭代构造马尔可夫链,结合Metropolis-Hastings校正从能量修正目标中采样,并发现最小U-turn动力学经历由数据流形碎片化驱动的遍历性破缺相变。

详情
AI中文摘要

从学习到的高维分布中采样是一个基础的计算问题。我们引入U-turn链:通过迭代扩散模型的短前向-反向步骤获得的马尔可夫链,其中每一步提出一个保持在所学数据流形上的移动,并与Metropolis-Hastings校正配对,从能量修正目标中采样。对于合成语言,我们表明最小U-turn动力学经历由数据流形碎片化驱动的遍历性破缺相变;在更大的U-turn幅度下遍历性得以恢复。在非遍历区域,低层特征比高层特征松弛得更快,这种顺序仅在足够大的U-turn幅度下才会反转。我们在自然语言和自然图像上测试这些预测。在两种模态中,最小U-turn松弛缓慢,尤其是对于由CNN或LLM中深层表示近似的高层特征。层序反转仅在噪声足够大且混合高效时出现——这些特征与强约束、弱混合的局部动力学一致。我们讨论了这些结果对使用扩散模型采样的启示。

英文摘要

Sampling from learned high-dimensional distributions is a foundational computational problem. We introduce U-turn chains: Markov chains obtained by iterating short forward-backward steps of a diffusion model, in which each step proposes a move that remains on the learned data manifold and, paired with a Metropolis-Hastings correction, samples from energy-modified targets. For synthetic languages, we show that minimal U-turn dynamics undergoes an ergodicity-breaking phase transition driven by fragmentation of the data manifold; ergodicity is restored at larger U-turn magnitude. In the non-ergodic regime, low-level features relax faster than high-level ones, an ordering that inverts only at sufficiently large U-turn magnitude. We test these predictions on natural language and natural images. In both modalities, minimal U-turns relax slowly, especially for high-level features approximated by deep representations in CNNs or LLMs. The layer-ordering inversion appears only at large noise when mixing is efficient -- signatures consistent with strongly constrained, weakly mixing local dynamics. We discuss the implications of these results for sampling with diffusion models.

2605.27003 2026-05-27 cs.CV cs.AI

Timestep-Aware SVDQuant-GPTQ for W4A4 Quantization of Wan2.2-I2V

时间步感知的 SVDQuant-GPTQ 用于 Wan2.2-I2V 的 W4A4 量化

Junhao Wu, Dezhong Yao, Hai Jin

AI总结 针对 Wan2.2-I2V 视频扩散 Transformer 的 W4A4 量化,提出结合 SVDQuant 低秩异常补偿、GPTQ 重建感知残差权重量化和时间步分箱逐层激活裁剪比搜索的后训练量化框架,在 OpenS2V-Eval 上降低 59.3% 峰值显存且仅损失 0.9% VBench 平均分。

详情
AI中文摘要

大型视频扩散 Transformer 的 W4A4 量化提供了显著的内存节省,但面临两个主要挑战:稀疏的大幅度激活异常值,以及跨多步去噪轨迹的强时间步依赖的激活分布。这些困难因 Wan2.2-I2V 的双专家混合专家 DiT 设计而加剧,其高噪声和低噪声专家表现出不同的量化敏感性,单一全局校准策略无法捕捉。我们提出了一种后训练量化框架,结合基于 SVDQuant 的低秩异常补偿、基于 GPTQ 的重建感知残差权重量化,以及针对每个专家独立进行的时间步分箱逐层激活裁剪比搜索。在 OpenS2V-Eval 基准上,我们的方法相对于 BF16 基线将峰值 GPU 内存降低了 59.3%,同时仅导致 VBench 平均分数下降 0.9%,成像质量下降 2.3%,表明专家和时间步感知的校准对于 MoE 视频 DiT 的高保真 W4A4 推理至关重要。

英文摘要

W4A4 quantization of large video diffusion Transformers offers substantial memory savings but is hindered by two main challenges: sparse large-magnitude activation outliers, and strongly timestep-dependent activation distributions across the multi-step denoising trajectory. These difficulties are compounded by Wan2.2-I2V's two-expert Mixture-of-Experts DiT design, whose high-noise and low-noise experts exhibit distinct quantization sensitivities that a single global calibration policy cannot capture. We propose a post-training quantization framework combining SVDQuant-based low-rank outlier compensation, GPTQ-based reconstruction-aware residual weight quantization, and timestep-bin-wise per-layer activation clipping-ratio search conducted independently for each expert. On the OpenS2V-Eval benchmark, our method reduces peak GPU memory by 59.3\% relative to the BF16 baseline while incurring only a 0.9\% drop in VBench average score and a 2.3\% drop in Imaging Quality, demonstrating that expert- and timestep-aware calibration is essential for high-fidelity W4A4 inference on MoE video DiTs.

2605.26999 2026-05-27 cs.CL cs.CR

Prompt Injection Detection is Regime-Dependent: A Deployment-Aware Evaluation with Interpretable Structural Signals

提示注入检测是依赖于场景的:一种基于可解释结构信号的部署感知评估

Akindoyin Akinrele, Shreyank N Gowda

AI总结 本研究通过多模型、多场景的实验框架,评估了提示注入检测方法,发现检测性能高度依赖于部署场景和阈值选择,其中基于Transformer的模型表现最佳,结构信号在特定场景下提供适度但一致的改进。

详情
AI中文摘要

提示注入对大型语言模型的安全部署构成严重威胁,然而现有的检测方法通常在有限的设置下进行评估,未能反映真实世界的操作约束。在这项工作中,我们使用多模型和多场景实验框架,对提示注入检测进行了部署感知评估。我们比较了基于词汇、语义、结构和Transformer的检测器,在多个分布外设置、重复数据划分以及排名和阈值部署指标下的表现。我们引入了可解释的结构信号,这些信号捕捉了层次覆盖、系统提示欺骗、角色重定义和逃避模式,并评估了它们在稀疏模型中以及与强编码器基线结合时的贡献。我们的结果表明,检测性能高度依赖于场景,并且对阈值选择敏感,没有单一模型在所有设置中占据主导地位。基于Transformer的模型实现了最强的整体性能,而结构信号在特定场景下提供了适度但一致的改进,并在更困难的场景中改善了低假阳性率行为。这些发现凸显了排名性能与部署有效性之间的差距,并强调了在现实操作约束下评估提示注入防御的重要性。代码将发布。

英文摘要

Prompt injection poses a critical threat to the safe deployment of large language models, yet existing detection approaches are typically evaluated under limited settings that do not reflect real-world operating constraints. In this work, we present a deployment-aware evaluation of prompt injection detection using a multi-model and multi-regime experimental framework. We compare lexical, semantic, structural, and transformer-based detectors across multiple out-of-distribution settings, repeated data splits, and both ranking and thresholded deployment metrics. We introduce interpretable structural signals that capture hierarchy overrides, system prompt spoofing, role redefinition, and evasion patterns, and assess their contribution both within sparse models and in combination with strong encoder baselines. Our results show that detection performance is highly regime-dependent and sensitive to threshold selection, with no single model dominating across all settings. Transformer-based models achieve the strongest overall performance, while structural signals provide modest but consistent gains in certain regimes and improve low false positive rate behaviour in harder scenarios. These findings highlight the gap between ranking performance and deployment effectiveness and underscore the importance of evaluating prompt injection defences under realistic operational constraints. Code will be released.

2605.26998 2026-05-27 cs.LG q-bio.NC

Probabilistic Recurrent Intention Switching Model

概率递归意图切换模型

Wenyuan Sheng, Hao Zhu, Joschka Boedecker

AI总结 提出PRISM模型,利用轻量级递归网络建模非平稳意图切换,实现精确EM分解和闭式求解,在网格世界、小鼠迷宫和机器人操作任务中取得最优似然并恢复可解释意图。

详情
AI中文摘要

逆强化学习(IRL)从观察到的行为中恢复奖励函数,但传统方法假设单一固定奖励,无法捕捉一个回合内的目标切换。最近的多意图IRL方法通过分割轨迹来解决这一问题,但将意图转换建模为无记忆马尔可夫链或通过固定历史窗口的手动状态增强。我们提出概率递归意图切换模型(PRISM),该模型用轻量级递归网络替代这两种机制,将观察历史映射到每步意图分布。我们证明由此产生的EM目标可以精确分解为独立的每意图奖励子问题,每个子问题可闭式求解,从而得到$\mathcal{O}(nK)$的E步,无需变分近似。我们在非马尔可夫网格世界、小鼠迷宫和BridgeData~V2机器人操作(首个大规模多意图IRL机器人应用)上评估PRISM。在所有设置中,PRISM在保持最高留出对数似然的同时,从未标记的演示中恢复出可命名、时间上连贯的意图,表明离散目标切换存在于生物和人工智能体中。

英文摘要

Inverse reinforcement learning (IRL) recovers reward functions from observed behavior, yet traditional methods assume a single stationary reward that cannot capture goal switching within an episode. Recent multi-intention IRL methods address this by segmenting trajectories, but model intention transitions as either a memoryless Markov chain or via manual state augmentation with a fixed history window. We propose the Probabilistic Recurrent Intention Switching Model (PRISM), which replaces both mechanisms with a lightweight recurrent network that maps observation history to a per-step intention distribution. We prove that the resulting EM objective decomposes exactly into independent per-intention reward subproblems, each solvable in closed form, yielding an $\mathcal{O}(nK)$ E-step with no variational approximation. We evaluate PRISM on a non-Markovian gridworld, a mouse labyrinth, and BridgeData~V2 robotic manipulation, the first large-scale robotic application of multi-intention IRL. Across all settings PRISM achieves the highest held-out log-likelihood while recovering nameable, temporally coherent intentions from unlabeled demonstrations, suggesting that discrete goal switching is present in both biological and artificial agents.

2605.26992 2026-05-27 cs.CV

On the Robustness of Machine Unlearning for Vision-Language Models

机器遗忘在视觉-语言模型中的鲁棒性研究

Yujie Lin, Kaidi Jia, Jiayao Ma, Chengyi Yang, Jinsong Su

AI总结 本文首次系统调查了视觉-语言模型机器遗忘的鲁棒性,通过提出三种攻击范式揭示现有方法往往隐藏而非彻底移除目标知识。

详情
AI中文摘要

视觉-语言模型(VLM)可能会记忆训练数据中的不良信息,这激发了人们对机器遗忘的兴趣。在这项工作中,我们首次对VLM遗忘进行了系统调查和鲁棒性分析。我们提供了现有VLM遗忘方法的全面分类和回顾,以及在多种提示设置下的统一评估。然后,我们提出了三种攻击范式,以检验被遗忘的多模态知识是否可以通过上下文提示或下游微调重新激活。大量实验表明,许多现有方法在这些攻击下仍然脆弱,这表明当前方法往往隐藏而非完全移除目标知识。我们的研究为当前VLM遗忘方法的鲁棒性和局限性提供了新见解,并强调了需要更可靠的多模态遗忘策略。代码可在https://github.com/XMUDeepLIT/VLM-UnL-Attack获取。

英文摘要

Vision-language models (VLMs) may memorize undesirable information from training data, motivating growing interest in machine unlearning. In this work, we present the first systematic survey and robustness analysis of VLM unlearning. We provide a comprehensive taxonomy and review of existing VLM unlearning methods, together with unified evaluations under multiple prompt settings. We then propose three attack paradigms to examine whether forgotten multimodal knowledge can be reactivated through contextual prompting or downstream retraining. Extensive experiments show that many existing methods remain vulnerable under these attacks, indicating that current approaches often hide rather than fully remove target knowledge. Our study provides new insights into the robustness and limitations of current VLM unlearning methods and highlights the need for more reliable multimodal unlearning strategies. Code is available at https://github.com/XMUDeepLIT/VLM-UnL-Attack.

2605.26991 2026-05-27 cs.RO

Towards Shared Embodied Intelligence in Humanoid Robots through Optimization Development and Testing of the Human Aware ergoCub Robot

通过优化开发与测试人类感知的ergoCub机器人迈向人形机器人的共享具身智能

Carlotta Sartore, Mohamed Elobaid, Lorenzo Rapetti, Giulio Romualdi, Stefano Dafarra, Nicola A. Piga, Ines Sorrentino, Paolo Maria Vicecone, Silvio Traversaro, Ugo Pattacini, Luca Fiorio, Francesco Draicchio, Giovanna Tranfo, Lorenzo Natale, Marco Maggiali, Daniele Pucci

AI总结 提出一种融合共享智能与具身认知的架构,通过优化机器人硬件与控制以符合人体工学指标,实现人机物理协作,并以ergoCub人形机器人为具体实现。

详情
AI中文摘要

协作是人类行为的核心,使得完成超出个人能力的任务成为可能。这种能力源于通过对他人的内部表征来协调行动,这一概念被称为共享智能。此外,人类以其身体和认知能力为特征,这些能力会根据环境进行优化,这种现象被称为具身认知。设计能够安全有效地与人协作的人形机器人需要统一这些原则。在此,我们提出一种整合共享智能与具身认知的架构,使机器人能够与人类进行物理协作,其中机器人硬件和控制针对人体指标进行优化,利用人体和运动智能的表征。最终目标是实现一种共享具身智能的形式。具体而言,我们的架构根据人体工程学指标优化机器人硬件和物理智能参数。这是通过将人机交互建模为硬件配置的函数,并将人体模型嵌入机器人的物理智能中来实现的。作为具体实现,我们介绍了人形机器人ergoCub,其形态和控制已针对与人类的协作任务进行了优化。我们的方法为设计在硬件和物理智能层面优先考虑人体工程学的人形机器人提供了一个框架,并应用于工业和辅助机器人领域。

英文摘要

Collaboration is central to human behavior, enabling tasks beyond individual capability. This ability arises from coordinating actions through internal representations of others, a concept known as shared intelligence. Additionally, humans are characterized by physical bodies and cognitive abilities that are optimized in response to their environment, a phenomenon referred to as embodied cognition. Designing humanoid robots that collaborate safely and effectively with people requires unifying these principles. Here we propose an architecture that integrates shared intelligence and embodied cognition to enable robots to physically collaborate with humans, where robot hardware and control are optimized for human metrics, using representations of the human body and motion intelligence. The ultimate goal is to achieve a form of shared embodied intelligence. Specifically, our architecture optimizes robot hardware and physical intelligence parameters with respect to human ergonomic metrics. This is accomplished by modeling human-robot interaction as a function of hardware configurations and embedding human models into the robot's physical intelligence. As a concrete implementation, we present the humanoid robot ergoCub, whose morphology and control have been optimized for collaborative tasks with humans. Our approach provides a framework for designing humanoid robots that prioritize human ergonomics at both the hardware and physical intelligence levels, with applications in industrial and assistive robotics.

2605.26984 2026-05-27 cs.LG

TED: Related Party Transaction guided Tax Evasion Detection on Heterogeneous Graph

TED:基于关联方交易的异构图偷漏税检测

Yiming Xu, Bin Shi, Bo Dong, Jiaxiang Wang, Hua Wei, Qinghua Zheng

AI总结 针对现有偷漏税检测方法未能充分利用税务场景中丰富交互信息的问题,提出一种基于异构图神经网络的TED模型,通过关联方交易组过滤噪声并设计层次注意力机制捕获深层语义,在真实数据集上显著优于现有方法。

Comments Accepted by Data Mining and Knowledge Discovery (DMKD25)

详情
AI中文摘要

偷漏税导致政府收入严重损失并扰乱公平竞争的经济秩序。为缓解这一问题,最新的偷漏税检测解决方案利用专家知识提取特征,然后训练分类器判断公司是否涉嫌偷漏税。然而,现有方案主要关注公司的统计特征,未能利用税务场景中丰富的交互信息,从而影响检测性能。在本文中,我们首先将税务场景建模为异构图,并研究异构图模型下的偷漏税检测问题。为了提高偷漏税检测的性能,提出了一种新颖的图神经网络模型来提取异构图的综合信息。具体来说,我们利用异构且复杂的关联方交易组来过滤低层噪声信息。此外,设计了一种层次注意力机制来捕获关联方交易组中隐藏的更深层次结构和语义信息。我们将该方法应用于税务局的真实风险管理系统,并在两个人工标注的真实世界税务数据集上进行评估。结果表明,我们的方法在偷漏税检测任务上显著优于现有最先进方法。

英文摘要

Tax evasion causes severe losses of government revenues and disturbs the economic order of fair competition. To help alleviate this problem, the latest tax evasion detection solutions utilize expert knowledge to extract features and then train classifiers to determine whether a company is suspected of tax evasion. However, existing solutions mainly focus on the statistical features of the company, but fail to exploit the rich interactive information in tax scenarios, which affect the detection performance. In this paper, we first model the tax scenario as a heterogeneous graph and study the tax evasion detection problem under the heterogeneous graph model. To improve the performance of tax evasion detection, a novel graph neural network model is proposed to extract the comprehensive information of heterogeneous graphs. Specifically, we use heterogeneous and complex related party transaction groups to filter low-level noise information. Moreover, a hierarchical attention mechanism is designed to capture the deeper structure and semantic information hidden in the related party transaction group. We apply our method to the real risk management system of the tax bureau, and evaluate it on two human-labeled real-world tax datasets. The results demonstrate that our method significantly outperforms the state-of-the-art in the tax evasion detection task.

2605.26978 2026-05-27 cs.CL cs.SD

PashtoTTS-Bench: automated screening for low-resource non-Latin-script text-to-speech

PashtoTTS-Bench:低资源非拉丁文字文本转语音的自动化筛选

Hanif Rahman

AI总结 针对低资源非拉丁文字TTS评估中单一ASR往返WER的不足,提出INSV报告框架及其自动化筛选子集INSV-A,并实例化为PashtoTTS-Bench基准,通过多指标评估多个TTS系统。

详情
AI中文摘要

对于低资源非拉丁文字语言,当文本转语音(TTS)评估依赖于单一的ASR往返词错误率(WER)时可能会失败。系统可能不产生音频、说出邻近语言、仅在ASR转录中保留目标文字脚本,或者对母语者来说听起来不自然。我们引入了INSV(可懂度、自然度、脚本保真度和验证)报告框架,将这些情况分开。本文报告了INSV-A,即自动化筛选子集:合成完成度、ASR WER/CER、转录脚本保真率和音频语言识别。原生MOS和语音标注已指定但未在此版本中声明。我们将INSV-A实例化为PashtoTTS-Bench,一个针对普什图语TTS的带日期基准。2026年4月至5月的运行评估了Edge GulNawaz、Edge Latifa、OmniVoice clone、OmniVoice auto和一个乌尔都语阴性对照,使用200个FLEURS和200个过滤后的Common Voice 24提示。在独立的omniASR_CTC_300M_v2下,OmniVoice auto的WER最低(FLEURS 24.1%,CV24 27.4%),其次是Edge GulNawaz(32.8%,39.5%)、Edge Latifa(35.6%,47.7%)和OmniVoice clone(45.4%,34.8%)。低于自然语音基线的WER反映了干净的合成音频,不应被解读为优于原生语音。Whisper Large V3在检查的普什图语TTS音频上返回0.0%的普什图语标签,而MMS-LID-4017和SpeechBrain VoxLingua107将普什图语输出与乌尔都语对照区分开。该版本提供了提供者元数据、每句分数、LID审计、失败日志和用于添加系统的脚本。

英文摘要

Text-to-speech (TTS) evaluation for low-resource non-Latin-script languages can fail when it relies on a single ASR round-trip word error rate (WER). A system may produce no audio, speak a neighbouring language, preserve target script text only in an ASR transcript, or sound unnatural to native listeners. We introduce INSV (Intelligibility, Naturalness, Script fidelity, and Verification), a reporting framework that separates these cases. This paper reports INSV-A, the automated screening subset: synthesis completion, ASR WER/CER, transcript Script Fidelity Rate, and audio language identification. Native MOS and phonetic annotation are specified but not claimed in this release. We instantiate INSV-A as PashtoTTS-Bench, a dated benchmark for Pashto TTS. The April-May 2026 run evaluates Edge GulNawaz, Edge Latifa, OmniVoice clone, OmniVoice auto, and an Urdu negative control on 200 FLEURS and 200 filtered Common Voice 24 prompts. Under the independent omniASR_CTC_300M_v2, OmniVoice auto has the lowest WER (24.1% FLEURS, 27.4% CV24), followed by Edge GulNawaz (32.8%, 39.5%), Edge Latifa (35.6%, 47.7%), and OmniVoice clone (45.4%, 34.8%). WER below the natural-speech baseline reflects clean synthetic audio and should not be read as better than native speech. Whisper Large V3 returns 0.0% Pashto labels on checked Pashto TTS audio, while MMS-LID-4017 and SpeechBrain VoxLingua107 separate Pashto outputs from the Urdu control. The release provides provider metadata, per-sentence scores, LID audits, failure logs, and scripts for adding systems.

2605.26977 2026-05-27 cs.LG math.OC

Convergence of Spectral Descent for Non-smooth Optimization

非光滑优化的谱下降收敛性

Yixuan Yang, Yuqing He, Song Li

AI总结 研究Muon优化器的简化变体谱下降(SD)及其截断版本(TSD)在非光滑凸优化中的全局线性收敛性,并应用于鲁棒低秩矩阵恢复。

详情
AI中文摘要

Muon优化器最近在训练大型语言模型方面展示了显著的经验成功。然而,对其机制的理论理解仍然有限。目前Muon的收敛保证严重依赖于光滑性假设,其非光滑收敛行为在很大程度上未被探索。在这项工作中,我们通过研究谱下降(SD)(Muon的简化变体)及其截断版本截断谱下降(TSD),朝着弥合这一差距迈出了一步。在凸性、Lipschitz连续性和尖锐性条件下,我们建立了SD和TSD在非光滑凸公式中的全局线性收敛性。我们还研究了配备解耦权重衰减的正则化变体,并通过它们与Frank-Wolfe方法的联系推导出次线性收敛保证。最后,我们将我们的理论框架应用于混合稀疏和密集噪声下的鲁棒低秩矩阵恢复,并提供了严格的恢复保证。数值实验支持理论发现,并展示了Muon类型方法在非光滑优化中的有效性。

英文摘要

The Muon optimizer has recently demonstrated remarkable empirical success in training large language models. However, the theoretical understanding of its mechanisms remains limited. Current convergence guarantees for Muon rely heavily on smoothness assumptions, leaving its non-smooth convergence behavior largely unexplored. In this work, we take a step toward bridging this gap by investigating Spectral Descent (SD), a simplified variant of Muon, together with its truncated counterpart, Truncated Spectral Descent (TSD). Under convexity, Lipschitz continuity, and sharpness conditions, we establish global linear convergence for both SD and TSD in non-smooth convex formulations. We also study regularized variants equipped with decoupled weight decay and derive sublinear convergence guarantees through their connection with Frank-Wolfe methods. Finally, we apply our theoretical framework to robust low-rank matrix recovery under mixed sparse and dense noise regimes and provide rigorous recovery guarantees. Numerical experiments support the theoretical findings and demonstrate the effectiveness of Muon-type methods for non-smooth optimization.

2605.26971 2026-05-27 cs.LG

RLVR Datasets and Where to Find Them: Tracing Data Lineage for Better Training Data

RLVR 数据集及其查找方法:通过数据溯源寻找更好的训练数据

Hsiu-Yuan Huang, Weijie Liu, Chenming Tang, Sanwoo Lee, Kai Yang, Yangkun Chen, Saiyong Yang, Yunfang Wu

AI总结 针对可验证奖励强化学习(RLVR)数据集来源不清的问题,提出基于谱系感知搜索的原子源追踪框架(ATLAS),追溯超过99.7%的实例至20个原子源,并基于源级反事实归因(SCA)原则构建去污染数据集DAPO++,其质量分数Q与下游RLVR性能强相关。

Comments 7 figures, 12 tables

详情
AI中文摘要

可验证奖励强化学习(RLVR)数据集的激增加剧了来源崩溃问题,原因是现有数据集之间的谱系不明确。为弥合这一碎片化的RLVR数据格局,我们提出了基于谱系感知搜索的原子源追踪(ATLAS),这是一个系统框架,用于将RLVR数据集追溯至其原子源,将145万个实例中的超过99.7%归因于20个原子源。我们的分析表明,大多数RLVR数据集是一小组共享上游源的变体,很少有引入真正新数据的,许多面临数据污染风险。这些发现自然促使我们策划一个新的RLVR数据集DAPO++,并从谱系感知的角度对现有数据集进行基准测试。为此,我们提出源级反事实归因(SCA)作为指导原则,以策划一个具有集中学习信号的去污染训练数据集。本质上,SCA通过比较每个原子源的RL检查点与共享基模型来测量样本的边际效用。基于这些归因信号,我们进一步设计了一个复合数据集质量分数Q,该分数与下游RLVR性能强相关。在Qwen3系列模型上的实验验证了DAPO++在保留基准上持续提升性能,而Q可靠地预测了下游RLVR训练效果。我们的代码和数据可在https://github.com/Celine-hxy/ATLAS获取。

英文摘要

The proliferation of Reinforcement Learning from Verifiable Rewards (RLVR) datasets has exacerbated provenance collapse due to unclear lineage among existing datasets. To bridge this fragmented RLVR data landscape, we propose Atomic-source Tracing via Lineage-Aware Search (ATLAS), a systematic framework for tracing RLVR datasets back to their atomic sources, attributing over 99.7% of 1.45M instances to 20 atomic sources. Our analysis reveals that most RLVR datasets are variants of a small set of shared upstream sources, with few introducing genuinely new data, and many facing data contamination risks. These findings naturally motivate us to curate a new RLVR dataset, DAPO++, and to benchmark existing datasets from a lineage-aware perspective. To this end, we propose Source-level Counterfactual Attribution (SCA) as a guiding principle to curate a decontaminated training dataset with concentrated learning signals. Essentially, SCA measures a sample's marginal utility by comparing per-atomic-source RL checkpoints against a shared base model. Building upon these attribution signals, we further design a composite dataset quality score Q that strongly correlates with downstream RLVR performance. Experiments on Qwen3 series models verify that DAPO++ consistently improves performance on held-out benchmarks, while Q reliably predicts downstream RLVR training effectiveness. Our code and data is available at https://github.com/Celine-hxy/ATLAS.

2605.26969 2026-05-27 cs.CL cs.AI

Recon: Reconstruction-Guided Reasoning Synthesis for User Modeling

Recon:基于重建指导的推理合成用于用户建模

Alan Zhu, Mihran Miroyan, Carolyn Wang, Andrew Zhou, Lisa Dunlap, Narges Norouzi, Joseph E. Gonzalez

AI总结 提出Recon方法,通过动作重建分数评估推理轨迹的预测能力,以改进用户建模中的推理合成,在多个领域优于事后合理化基线。

详情
AI中文摘要

用户建模旨在使用语言模型(LM)从过去的上下文-动作对(例如对话轮次)语料库中模拟个体的行为,从而在行为科学、人机协作和市场研究等环境中模拟用户。最近的方法通过合成推理轨迹来扩充这些语料库,通常通过同时以上下文和动作为条件生成。然而,这种条件构成事后合理化而非推理:轨迹保证证明动作的合理性,但可能不编码潜在的潜在因果决策路径。我们提出Recon,它使用动作重建通过预测能力对推理轨迹进行评分:给定上下文和候选推理,重建模型预测动作,重建保真度决定推理质量。在四个领域,Recon相对于标准事后合理化基线Backward Synthesis实现了54.7%的胜率。此外,我们发现使用来自Recon的奖励训练推理合成模型可提高下游用户建模性能,相对于基线实现了高达70.0%的胜率。我们进一步表明,Recon合成的推理可跨模型迁移,并改善重建模型之外的用户建模。我们的工作表明,事后合理化对于推理合成是不够的,有用且可解释的推理应自然地从上下文中引出动作。

英文摘要

User modeling aims to use language models (LMs) to mimic an individual's behavior from a corpus of past context-action pairs (e.g., conversation turns), enabling the simulation of users in settings like behavioral science, human-AI collaboration, and market research. Recent approaches augment these corpora with synthesized reasoning traces, typically generated by conditioning on both context and action. However, such conditioning constitutes post-hoc rationalization rather than reasoning: the trace is guaranteed to justify the action, but may not encode the underlying latent causal decision paths. We propose Recon, which uses action reconstruction to score reasoning traces by their predictive power: given a context and candidate reasoning, a reconstruction model predicts the action, and reconstruction fidelity determines reasoning quality. Across four domains, Recon achieves a 54.7% win rate over Backward Synthesis, a standard post-hoc rationalization baseline. Further, we find that training a reasoning synthesis model with rewards derived from Recon improves downstream user modeling performance, achieving a win rate of up to 70.0% over baselines. We further show that Recon-synthesized reasoning transfers across models, and improves user modeling beyond the reconstruction model. Our work demonstrates that post-hoc rationalization is insufficient for reasoning synthesis, and that useful and interpretable reasoning should naturally elicit the action from the context.

2605.26967 2026-05-27 cs.CV

CodecCap: High-Fidelity Codec-Inspired Residual Modeling for Dense Video Captioning

CodecCap: 高保真度编解码器启发的残差建模用于密集视频字幕生成

Zihan Lin, Songhe Deng, Shuwei He, Danxiang Zhu, Dan Zhang, Yishu Lei, Xianlong Luo, Shikun Feng, Rui Liu

AI总结 提出CodecCap框架,通过关键帧和残差字幕模拟视频编解码器,在保持细粒度视觉证据的同时减少冗余,并引入VidCapQA基准验证其高保真度。

Comments 11 pages, 4 figures

详情
AI中文摘要

现有的视频字幕方法难以平衡视觉保真度和冗余:整体字幕紧凑但丢失细粒度证据,而分段字幕改善覆盖但引入大量冗余。我们提出CodecCap,一种受编解码器启发的高保真度密集视频字幕框架。类似于视频编解码器,CodecCap使用关键帧和残差字幕表示视频。关键帧字幕详尽编码稳定的视觉上下文,而残差字幕仅捕获时间上局部的动作、运动和变化。这有效保留了细粒度视觉证据,同时减少冗余描述。为了量化字幕的保真度,我们引入VidCapQA,一个包含14个能力维度1000个问题的字幕-问答基准。VidCapQA上的结果表明,强VLM直接生成的字幕仍然遗漏许多视觉细节,突显字幕表示是关键瓶颈。实验表明,CodecCap显著超越使用相同底层VLM的直接字幕生成,表明关键帧-残差字幕是一种高保真度视频-语言监督的方式。我们进一步使用CodecCap构建CodecVDC-100K,一个包含锚点、残差、场景级和视频级监督的大规模密集字幕数据集。

英文摘要

Existing video captioning methods struggle to balance visual fidelity and redundancy: holistic captions are compact but lose fine-grained evidence, whereas segment-wise captions improve coverage but introduce heavy redundancy. We propose CodecCap, a codec-inspired framework for high-fidelity dense video captioning. Analogous to video codecs, CodecCap represents videos using keyframe and residual captions. Keyframe captions exhaustively encode stable visual context, while residual captions capture temporally only localized actions, motions and changes. This effectively preserves fine-grained visual evidence while reducing redundant descriptions. To quantify the fidelity of captions, we introduce VidCapQA, a caption-then-QA benchmark with 1,000 questions across 14 capability dimensions. Results on VidCapQA show that captions directly generated by strong VLMs still miss many visual details, highlighting caption representation as a critical bottleneck. Experiments show that CodecCap significantly surpasses direct captioning with the same underlying VLMs, suggesting keyframe-residual captioning a way for high-fidelity video-language supervision. We further use CodecCap to construct CodecVDC-100K, a large-scale dense captioning dataset with anchor, residual, scene-level, and video-level supervision.

2605.26958 2026-05-27 cs.CL cs.AI

Tournament-GRPO: Group-Wise Tournament Rewards for Reinforcement Learning in Open-Ended Long-Form Generation

Tournament-GRPO:面向开放式长文本生成强化学习的群组锦标赛奖励

Zixuan Yang, Yiqun Chen, Wei Yang, Erhan Zhang, Zihan Shen, Xiaochi Wei, Yan Gao, Yi Wu, Yao Hu, Jiaxin Mao

AI总结 针对开放式长文本生成中缺乏可靠参考答案和自动评估指标的问题,提出Tournament-GRPO框架,通过同一查询生成结果间的多轮锦标赛比较将基于规则的LLM评判转化为相对奖励,在Deep Research Bench上取得4.52分提升。

详情
AI中文摘要

开放式长文本生成中的强化学习具有挑战性,因为可靠的参考答案和自动评估指标通常不可用。现有的基于规则的方法通常依赖于逐点的LLM作为评判的评分,但绝对分数难以在复杂响应间校准,可能对同一查询的生成结果提供弱区分度,并在优化过程中饱和。我们提出Tournament-GRPO,一种群组奖励框架,通过同一查询生成结果间的重复多轮锦标赛将基于规则的LLM评判转化为相对奖励。Tournament-GRPO在群组内比较候选结果,累积锦标赛结果,并将其归一化为用于GRPO训练的群组奖励。在Deep Research Bench上的实验表明,Tournament-GRPO持续优于现有的奖励设计基线,在最强基线上实现了4.52分的整体分数提升。进一步分析表明,锦标赛奖励提供了有利的有效性-效率权衡,并且锦标赛设计影响训练动态。这些结果表明,基于规则的锦标赛比较为开放式长文本生成中的强化学习提供了有效的奖励信号。

英文摘要

Reinforcement learning in open-ended long-form generation is challenging because reliable reference answers and automatic metrics are often unavailable. Existing rubric-based methods typically rely on pointwise LLM-as-a-judge scoring, but absolute scores are difficult to calibrate across complex responses, may provide weak discrimination among same-query rollouts, and can become saturated during optimization. We propose Tournament-GRPO, a group-wise reward framework that converts rubric-guided LLM judgments into relative rewards through repeated multi-round tournaments among same-query rollouts. Tournament-GRPO compares candidates within groups, accumulates tournament outcomes, and normalizes them into group-wise rewards for GRPO training. Experiments on Deep Research Bench show that Tournament-GRPO consistently outperforms existing reward-design baselines, achieving a 4.52-point overall-score improvement over the strongest baseline. Further analyses show that tournament rewards provide a favorable effectiveness--efficiency trade-off and that tournament design affects training dynamics. These results suggest that rubric-guided tournament comparison provides an effective reward signal for reinforcement learning in open-ended long-form generation.

2605.26956 2026-05-27 cs.AI cs.CL

LELA: An End-to-end LLM-based Entity Linking Framework with Zero-shot Domain Adaptation

LELA: 一种基于LLM的端到端实体链接框架,支持零样本领域自适应

Samy Haffoudhi, Nikola Dobričić, Fabian Suchanek, Nils Holzenberger

AI总结 本文提出LELA,一种基于大语言模型的模块化、领域无关的实体消歧方法,并扩展为实用的Python库,集成零样本命名实体识别,实现端到端实体链接,实验验证其跨领域性能与鲁棒性。

详情
Journal ref
35th International Joint Conference on Artificial Intelligence (IJCAI-ECAI 2026), IJCAI (International Joint Conferences on Artificial Intelligence), Aug 2026, Bremen (DE), Germany
AI中文摘要

实体链接是许多下游NLP系统的关键组件,但现有方法通常依赖于特定的目标知识库和领域,限制了其实际应用。在本文中,我们将LELA(一种模块化且领域无关的基于LLM的实体消歧方法)扩展为一个实用的Python库,该库集成了零样本命名实体识别(NER),从而为实际使用中的实体链接提供了完整的端到端流水线。我们提供了实验结果,验证了LELA在不同实体链接设置下的性能和鲁棒性。在我们的演示中,用户可以在自己的输入文本上试用该系统。

英文摘要

Entity linking is a key component of many downstream NLP systems, yet existing approaches are often tied to the specific target knowledge bases and domains, limiting their real world application. In this paper, we extend LELA, a modular and domain-agnostic LLM-based entity disambiguation method, into a practical Python library that integrates zero-shot Named Entity Recognition (NER) -thereby providing a complete end-toend pipeline for entity-linking in real-world usage. We provide experimental results validating LELA's performance and robustness across diverse entity linking settings. In our demo, users can play with the system on their own input texts.

2605.26955 2026-05-27 cs.CL cs.AI

JuICE: A Benchmark for Evaluating LLM-Judge in Identifying Cultural Errors

JuICE:评估LLM裁判识别文化错误的基准

Jiho Jin, Junho Myung, Juhyun Oh, Junyeong Park, Rifki Afina Putri, Sunipa Dev, Vinodkumar Prabhakaran, Alice Oh

AI总结 提出JuICE基准,包含7470个文化语言错误标注的多语言数据集,用于评估LLM裁判在长文本中识别深层文化错误的能力,发现最强模型F1仅0.52且常遗漏本地人易识别的错误。

详情
AI中文摘要

随着大型语言模型(LLM)越来越多地部署给全球用户,它们被整合到不同文化背景下的日常任务中,从起草个人通信到头脑风暴创意想法。这些任务本质上是文化性的:它们需要语境适当性、象征共鸣以及母语者本能依赖的隐性文化期望,这意味着一个回答可能在事实上合理,但对本地读者来说却明显错误。现有的文化基准通过事实验证或规范蕴含方法将文化视为一组扁平的事实,并采用LLM作为裁判,而未检查它们是否能捕捉到这种深层的文化错误。为填补这一空白,我们提出了JuICE(LLM裁判识别文化错误基准),这是一个多语言数据集,包含7470个跨度级别的文化语言错误标注,涵盖来自四个国家(美国、韩国、印度尼西亚和孟加拉国)的1050个查询-响应对,使用英语和这些国家的主要语言。利用JuICE,我们发现即使是最强的LLM裁判在错误跨度检测任务中也仅达到0.52的F1分数。此外,LLM裁判始终会遗漏本地居民容易识别的深层文化错误。我们的研究结果表明,稳健的文化评估必须超越表面级别的检测,转向考虑文化意义的深度和情境性的框架。

英文摘要

As large language models (LLMs) are increasingly deployed to users around the world, they are integrated into everyday tasks across diverse cultural contexts, from drafting personal communications to brainstorming creative ideas. These tasks are inherently cultural: they require contextual appropriateness, symbolic resonance, and tacit cultural expectations that native speakers draw on instinctively, meaning that a response can be factually plausible yet unmistakably wrong to a local reader. Existing cultural benchmarks have treated culture as a flat set of facts via fact verification or norm entailment methods, and have adopted LLM-as-a-Judge without examining whether they can capture such thick cultural errors. To address this gap, we present JuICE (Benchmark for LLM-Judge in Identifying Cultural Errors), a multilingual dataset of 7,470 span-level annotations of cultural and linguistic errors in long-form LLM responses. It covers 1,050 query-response pairs from four countries (the United States, South Korea, Indonesia, and Bangladesh), in both English and their countries' main languages. Using JuICE, we find that even the strongest LLM-judge achieves only an F1 of 0.52 in the erroneous span detection task. Furthermore, LLM-judges consistently miss thick cultural errors that local residents readily identify. Our findings suggest that robust cultural evaluation must move beyond surface-level detection toward frameworks that account for the depth and situatedness of cultural meaning.

2605.26952 2026-05-27 cs.CL

Efficient Agentic Reinforcement Learning with On-Policy Intrinsic Knowledge Boundary Enhancement

基于策略内在知识边界增强的高效智能体强化学习

Dingwei Chen, Zefang Zong, Zhipeng Ma, Leo Luo, Yang Li, Chengming Li, Peng Chen, Jie Jiang

AI总结 提出AKBE方法,通过双路径(有工具和无工具)在线策略训练动态探测模型内在知识边界,构建针对性监督信号,在保持准确率的同时减少工具调用。

详情
AI中文摘要

智能体强化学习已被证明对于训练具有外部工具使用能力的基于LLM的智能体是有效的。然而,我们发现智能体强化学习训练会导致冗余工具调用增加,并模糊模型的内在知识边界,即模型无法区分何时需要工具以及何时参数化知识足够。现有的基于奖励塑形的解决方案创建了粗粒度的优化目标,倾向于激励不加区分的工具调用抑制,导致奖励黑客行为。在本文中,我们提出AKBE(智能体知识边界增强),一种在线策略方法,通过在训练期间进行双路径(有工具和无工具)展开来动态探测模型的内在知识边界。我们将知识边界定义为每个实例是否需要工具以及所需的最小工具调用次数。通过比较各路径的正确性,AKBE对轨迹进行分类并构建针对性的监督信号,为每个问题引导高效的工具使用模式。这些信号无缝集成到智能体强化学习训练循环中。在七个QA基准上的实验表明,与标准智能体强化学习相比,AKBE平均任务准确率提高+1.85,工具调用减少18%,工具生产率提高25%,且没有任何准确率-效率权衡。进一步分析表明其在不同RL算法上的即插即用兼容性以及每个信号类别的机制。我们的代码可在https://github.com/CuSO4-Chen/AKBE获取。

英文摘要

Agentic reinforcement learning (RL) has proven effective for training LLM-based agents with external tool-use capabilities. However, we identify that agentic RL training induces increasing redundant tool calls and blurs the model's intrinsic knowledge boundary, where the model fails to distinguish when tools are needed versus when parametric knowledge suffices. Existing solutions based on reward shaping create coarse-grained optimization targets that tend to incentivize indiscriminate tool-call suppression, leading to reward hacking. In this paper, we propose AKBE (Agentic Knowledge Boundary Enhancement), an on-policy method that dynamically probes the model's intrinsic knowledge boundary through dual-path (with-tool and no-tool) rollouts during training. We define the knowledge boundary as the per-instance determination of whether tools are required and the minimum tool calls necessary. By comparing correctness across paths, AKBE categorizes trajectories and constructs targeted supervisory signals that guide efficient tool-use patterns for each question. These signals are integrated seamlessly into the agentic RL training loop. Experiments on seven QA benchmarks demonstrate that AKBE improves task accuracy by +1.85 on average and reduces tool calls by 18% over standard agentic RL, yielding 25% higher tool productivity without any accuracy-efficiency trade-off. Further analysis suggests its plug-and-play compatibility across different RL algorithms and the mechanism of each signal category. Our code is available at https://github.com/CuSO4-Chen/AKBE.

2605.26949 2026-05-27 cs.CV cs.GR

DinoComplete: 3D Shape Completion with Distilled Semantic Priors and State Space Models

DinoComplete: 利用蒸馏语义先验和状态空间模型进行3D形状补全

Furkan Mert Algan, Eckehard Steinbach

AI总结 提出DinoComplete框架,通过从DINO特征中蒸馏语义先验并结合多尺度体素Mamba模块,实现高效、鲁棒的3D形状补全,在未见类别和真实噪声扫描上优于现有方法。

详情
AI中文摘要

从部分扫描进行3D形状补全对于未见类别和嘈杂的真实世界观测仍然具有挑战性,因为仅凭几何信息往往不足以推断缺失结构。我们提出了DinoComplete,一个确定且高效的形状补全框架,通过从DINO特征中蒸馏的体素对齐语义先验来增强几何重建。首先,我们构建与ShapeNet数据对齐的多视图DINO特征体积,并训练一个学生网络直接从不完整形状预测密集语义特征。这些预测特征捕获全局结构和部分感知的语义上下文,同时与底层几何保持对齐。然后,我们将这些蒸馏特征集成到一个补全网络中,其中几何和语义体素表示通过体素状态空间建模进行融合。为了在不牺牲分辨率的情况下实现高效的长距离推理,我们引入了一个多尺度体素Mamba模块,通过结合全网格和分块序列建模来细化融合特征。在未见过的ShapeNet类别和ScanNet物体上的实验表明,DinoComplete在使用更少参数、更低内存和更快推理速度的同时,实现了比先前确定性和基于生成的方法更强的补全质量。我们的结果表明,从视觉基础模型中蒸馏语义先验提高了3D形状补全的泛化能力和鲁棒性。

英文摘要

3D shape completion from partial scans remains challenging for unseen categories and noisy real-world observations, where geometry alone is often insufficient for inferring missing structure. We present DinoComplete, a deterministic and efficient shape completion framework that augments geometric reconstruction with voxel-aligned semantic priors distilled from DINO features. First, we construct multi-view DINO feature volumes aligned with ShapeNet data and train a student network to predict dense semantic features directly from incomplete shapes. These predicted features capture global structure and part-aware semantic context while remaining aligned with the underlying geometry. We then integrate these distilled features into a completion network, where geometric and semantic voxel representations are fused through voxel state-space modeling. To enable efficient long-range reasoning without sacrificing resolution, we introduce a multi-scale voxel Mamba module that refines the fused features by combining full-grid and chunk-wise sequence modeling. Experiments on unseen ShapeNet categories and ScanNet objects show that DinoComplete achieves stronger completion quality than prior deterministic and generative based completion methods while using fewer parameters, requiring lower memory, and achieving faster inference. Our results demonstrate that distilling semantic priors from visual foundation models improves generalization and robustness in 3D shape completion.

2605.26944 2026-05-27 cs.RO cs.CV

Object Pose and Shape Estimation for Grasping: Does it Work?

用于抓取的目标姿态与形状估计:有效吗?

Pavan Karke, Kushal Shah, Gaurav Singh, Md Faizal Karim, K Madhava Krishna, Rajat Talak

AI总结 本文通过对比端到端抓取合成方法与模块化方法(先估计目标姿态和形状再采样抓取),评估现有姿态和形状估计方法在抓取任务中的有效性。

Comments 9 pages, 8 figures

详情
AI中文摘要

目标姿态和形状估计问题近年来取得了关键进展。编码器-解码器(如SAM3D、LRM、CRISP)和基于扩散的模型(如InstantMesh、Zero123、SceneComplete)展示了类别无关的形状编码能力和开放集泛化性。在这项工作中,我们提出一个问题:当与对极抓取采样结合使用时,目标姿态和形状估计方法是否足够成熟,以至于能够超越端到端抓取合成方法?我们通过将研究范围限定在平行颚夹爪、7自由度抓取和单视图RGB(-D)图像输入,详细探讨了这个问题。我们实现并比较了一种最先进的端到端抓取合成方法和三种模块化方法,这些方法首先估计场景中所有目标的姿态和形状,然后使用对极采样生成抓取。我们观察到,在所有实验中,模块化方法均优于端到端方法。模块化方法能够合成大量抓取,即使是对于端到端方法失败的小目标也是如此。模块化方法的有效性取决于姿态和形状估计的准确性,并且在杂乱场景中会部分退化——这是现有姿态和形状估计方法的局限性。我们还分析了三种模块化方法的失败模式和运行时间,这些方法使用了两种不同的目标姿态和形状估计方式:一种基于编码器-解码器模型,另一种基于扩散模型。最后,我们证明单视图目标姿态和形状估计方法可以与视觉语言模型结合,仅从单视图RGB-D图像输入即可产生语言条件抓取。我们注意到其性能与最先进的LERF-TOGO基线相当。

英文摘要

The problem of object pose and shape estimation has seen key advancements lately. Encoder-decoder (e.g., SAM3D, LRM, CRISP) and diffusion-based models (e.g., InstantMesh, Zero123, SceneComplete) have shown category-agnostic shape encoding capacity and open-set generalizability. In this work, we ask the question: Are the object pose and shape estimation methods mature enough, such that when used with antipodal grasp sampling, can outperform the end-to-end grasp synthesis methods? We explore this question in detail by scoping our study to parallel jaw grippers, 7-DoF grasps, and single-view RGB(-D) image as input. We implement and compare a state-of-the-art, end-to-end grasp synthesis method and three modular methods, which first estimate the object pose and shape for all objects in the scene, and generate grasps using antipodal sampling. We observe that the modular methods outperform the end-to-end method in all our experiments. The modular methods are able to synthesize plenty of grasps, even for small objects, where the end-to-end methods fail. The effectiveness of the modular methods is contingent on the accuracy of the pose and shape estimation, and suffers partial degradation in cluttered scenes - a limitation of the existing pose and shape estimation methods. We also analyze the failure modes and run-times for the three modular methods, which use two different ways of object pose and shape estimation: one based on an encoder-decoder model, while another a diffusion model. Finally, we demonstrate that the single-view object pose and shape estimation methods can be augmented with vision-language models to yield language-conditioned grasps from just single-view RGB-D image as input. We notice comparable performance to the state-of-the-art LERF-TOGO baseline.

2605.26940 2026-05-27 cs.CL

Accountable Human-AI Deliberation with LLMs: Scaling Collective Intelligence through Symbiotic Scaffolding

负责任的基于LLM的人机协商:通过共生脚手架扩展集体智能

Wajdi Zaghouani

AI总结 提出一个三层共生人机框架,通过多样性放大、条款级溯源和人类主导批准,在扩展集体智能的同时保持主体性和合法性。

Comments Accepted at the LREC 2026 / 2nd Workshop on Language-driven Deliberation Technology

详情
AI中文摘要

大型语言模型(LLM)可以在以前受轮流发言和引导带宽限制的规模上支持民主协商。最近的研究表明,LLM生成的群体陈述通常比人类中介的输出更受欢迎,而理论分析认为LLM放松了限制集体智能的同时性约束。然而,纯LLM中介存在使多元性崩溃、过度优化一致性以及当参与者无法质疑其如何被代表时损害合法性的风险。我们提出了一个共生的人机框架,分为三个层次:观察与多样性放大、具有条款级溯源的引导、以及人类优先批准。我们的贡献包括:具有显著性加权分级覆盖、多样性和擦除度量;结合交叉编码器相似性与因果剔除诊断的溯源管道;偏好条件权衡控制;公平感知的可争议工作流;对抗性鲁棒性测试;以及基于LLM作为评判者局限性证据的消融设计评估协议。结果是一个可测试的协商技术蓝图,能够在扩展集体智能的同时保持主体性和合法性。

英文摘要

Large language models (LLMs) can support democratic deliberation at scales previously constrained by turn-taking and facilitation bandwidth. Recent work shows that LLM-generated group statements are often preferred over human-mediated outputs, while theoretical analyses argue that LLMs relax the simultaneity constraints limiting collective intelligence. Yet pure LLM mediation risks collapsing pluralism, over-optimizing for agreement, and undermining legitimacy when participants cannot contest how they are represented. We propose a symbiotic human-AI framework organized into three layers: observation and diversity amplification, facilitation with clause-level provenance, and human primacy for ratification. Our contributions include graded coverage, diversity, and erasure metrics with salience-aware weighting; a provenance pipeline combining cross-encoder similarity with causal knockout diagnostics; preference-conditioned trade-off control; equity-aware contestability workflows; adversarial robustness tests; and an evaluation protocol with ablation designs informed by evidence of LLM-as-judge limitations. The result is a testable blueprint for deliberation technology that scales collective intelligence while preserving agency and legitimacy.

2605.26937 2026-05-27 cs.CL cs.AI

Beyond Questions: Evaluating What Large Language Models (Actually) Know

超越问题:评估大型语言模型(实际)知道什么

Luca Giordano, Simon Razniewski

AI总结 提出开放知识评估新范式,通过开放式提示(如“告诉我关于M.L. King的一切”)评估模型自然表达的知识,并构建BeQu基准测试10,000个实体。

详情
AI中文摘要

大型语言模型(LLM)中的参数化知识是其成功的基石,但仍未被充分理解。现有的知识基准通常依赖于预定义的问题(例如,“M.L. King的出生日期是什么?”),仅评估基准设计者明确选择查询的知识,这是一种有问题的可用性偏差。在本文中,我们引入了开放知识评估,这是一种用于LLM知识基准测试的新范式。它不提出狭隘的问题,而是评估模型在响应开放式引发提示(例如,“告诉我关于M.L. King的一切”)时选择呈现的知识。这将焦点从预定义的答案检索转向表征模型自然表达的知识。我们用BeQu(超越问题)实例化这一范式,这是一个包含10,000个实体并配有用于陈述验证的参考语料库的基准。使用BeQu,我们评估了广泛的语言模型,并分析了推理努力、模型规模、提示格式和知识领域的影响。数据和排行榜可在此工作的GitHub仓库和基准网站上获取。

英文摘要

Parametric knowledge in large language models (LLMs) is a cornerstone of their success, yet remains poorly understood. Existing knowledge benchmarks typically rely on predefined questions (e.g., "What is the birth date of M.L. King?"), evaluating only knowledge that benchmark designers explicitly choose to query, a problematic availability bias. In this paper, we introduce open knowledge evaluation, a new paradigm for LLM knowledge benchmarking. Instead of asking narrow questions, it evaluates models on the knowledge they choose to surface in response to open-ended elicitation prompts (e.g., "Tell me everything you know about M.L. King"). This shifts the focus from predefined answer retrieval toward characterizing the knowledge models naturally express. We instantiate this paradigm with BeQu (Beyond Questions), a benchmark of 10,000 entities paired with reference corpora for statement verification. Using BeQu, we evaluate a broad range of language models and analyze the effects of reasoning effort, model scale, prompt format, and knowledge domain. Data and leaderboard are available on this work's GitHub repository and at the benchmark's website.

2605.26936 2026-05-27 cs.RO

A Bioinspired Underwater Robot with a Latch-Mediated Soft Bistable Mechanism

一种具有闩锁介导的软体双稳态机构的仿生水下机器人

Chongze Bi, Wenjie Wu, Zonghao Zuo, Li Wen

AI总结 本文提出一种受生物启发的软体双稳态执行器,通过集成闩锁机构实现单电机驱动的非对称能量输入与释放,结合鳍结构实现高效水下推进与机动,实验验证了稳定拍动、精确转向及多模式运动能力。

Comments 6 pages, 6 figures

详情
AI中文摘要

近年来,水下机器人技术取得了显著进展。然而,微型水下机器人的发展仍受限于传统能源的低能量密度。自然界提供了引人注目的解决方案——像螳螂虾和跳蚤这样的生物利用闩锁介导的弹簧驱动(LaMSA)系统,通过解耦的能量存储和释放机制实现快速运动。尽管对LaMSA进行了广泛研究,但在简单紧凑的结构中复制这种快速、非对称驱动仍然具有挑战性。在这项工作中,我们介绍了一种受生物启发的软体双稳态执行器,它集成了闩锁机制,能够使用单个电机实现非对称的能量输入和释放。结合鳍结构,这种设计促进了高效的水下推进和机动性。实验结果表明,该机器人实现了稳定的周期性拍动、精确的转向,以及最大推力0.528 N、冲量0.147 Ns和垂直位移30 mm。通过调节鳍角,机器人实现了多种运动,包括垂直上升、斜向前进和横向平移。这项研究为控制紧凑型水下机器人的运动提供了一种新颖、节能的方法,为先进仿生设计在探索、环境监测和检查中的潜在应用铺平了道路。

英文摘要

Underwater robotics has advanced significantly over recent decades. however, the development of miniaturized underwater robots remains limited by low energy densities of traditional power sources. Nature offers compelling solutions-organisms like mantis shrimps and fleas utilize latch-mediated spring actuation (LaMSA) systems that achieve rapid movements through a decoupled energy storage and release mechanism. Despite extensive studies of LaMSA, replicating such rapid, asymmetric actuation within simple, compact structures remains challenging. In this work, we introduce a bioinspired, soft bistable actuator with an integrated latch mechanism that enables asymmetric energy input and release using a single motor. Coupled with fin structures, this design facilitates efficient underwater propulsion and maneuverability. Experimental results demonstrate stable periodic flapping, precise steering, and a maximum thrust of 0.528 N, impulse of 0.147 Ns, and vertical displacement of 30 mm. By modulating fin angles, the robot achieves versatile motions, including vertical ascent, diagonal forward movement, and lateral translation. This study presents a novel, energy-efficient approach for controlling motion in compact underwater robots, paving the way for advanced biomimetic designs with potential applications in exploration, environmental monitoring, and inspection.

2605.26935 2026-05-27 cs.CL

DunbaaBERT: From Sacrifice to Semantics

DunbaaBERT: 从牺牲到语义

Iffat Maab, Waleed Jamil, Raphael Schmitt

AI总结 本文提出DunbaaBERT,一种从零训练的乌尔都语RoBERTa-base模型族,通过不同词汇表大小在17GB语料上预训练,在多项下游任务中达到与强多语言基线相当的性能,并发现较大词汇表并不持续提升效果。

详情
AI中文摘要

大型语言模型在许多自然语言处理任务中取得了强劲性能,但由于资源有限和评估设置碎片化,乌尔都语仍相对未被充分探索。为填补这一空白,我们引入了DunbaaBERT,一个乌尔都语RoBERTa-base模型族,在去重后的17GB乌尔都语语料库上使用32k、52k和96k token的Byte-BPE词汇表从头训练。我们在内在和下游乌尔都语自然语言处理基准上评估DunbaaBERT,涵盖语言可接受性、新闻分类、攻击性语言检测和情感分析,同时分析词汇表大小对性能和效率权衡的影响。在各项基准中,DunbaaBERT变体与强多语言基线相比取得了有竞争力的性能,同时始终保持有利的效率权衡。有趣的是,较大的词汇表并不持续提升下游效果,DunbaaBERT$_{\text{32k}}$反复提供最强的整体效率概况。总体而言,我们的结果表明,尽管模型和训练规模相对紧凑,精心策划的乌尔都语特定编码器模型仍能保持高度竞争力。所有模型均在MIT许可下发布。

英文摘要

Large language models have achieved strong performance across many NLP tasks, yet Urdu remains comparatively underexplored due to limited resources and fragmented evaluation settings. To address this gap, we introduce DunbaaBERT, a family of Urdu RoBERTa-base models trained from scratch with Byte-BPE vocabularies of 32k, 52k, and 96k tokens on a deduplicated 17GB Urdu corpus. We evaluate DunbaaBERT across intrinsic and downstream Urdu NLP benchmarks covering linguistic acceptability, news classification, offensive language detection, and sentiment analysis while analyzing vocabulary-size effects on performance and efficiency trade-offs. Across benchmarks, the DunbaaBERT variants achieve competitive performance against strong multilingual baselines while consistently maintaining favorable efficiency trade-offs. Interestingly, larger vocabularies do not consistently improve downstream effectiveness, with DunbaaBERT$_{\text{32k}}$ repeatedly providing the strongest overall efficiency profile. Overall, our results demonstrate that carefully curated Urdu-specific encoder models can remain highly competitive despite comparatively compact model and training scales. All models are released under the MIT license.

2605.26934 2026-05-27 cs.CL cs.AI

Reasoning Depth and Environment Complexity: A Controlled Study of RLVR Data Allocation across Logical Reasoning Tasks

推理深度与环境复杂度:逻辑推理任务中RLVR数据分配的受控研究

Yihua Zhu, Qianying Liu, Fei Cheng, Jiaxin Wang, Akiko Aizawa, Sadao Kurohashi, Hidetoshi Shimodaira

AI总结 通过将推理空间划分为深度和复杂度两个维度,并考虑四种推理形式,在合成知识图谱环境中进行受控实验,发现联合深度-复杂度覆盖优于单轴策略,不同推理家族对RLVR覆盖的反应非均匀,且均匀混合优于分阶段课程。

Comments Pre-print

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)已成为后训练推理模型的核心,但现有研究的一个关键局限在于对推理空间的狭隘视角:难度仅被视为推理深度,奖励集中在正向演绎状态追踪。相反,我们沿两个维度刻画推理空间。难度:除了推理深度,我们研究环境复杂度,即模型必须在干扰项和交互结构中识别正确路径。奖励推理形式:我们考虑现实世界推理核心的四种能力:演绎状态追踪、对隐藏事件或事实的溯因恢复、归纳规则归纳以及类比迁移。为解耦这些因素,我们构建了一个合成知识图谱环境,具有受控的预训练和后训练分布,其中每个实例在深度、复杂度和任务家族上变化。三个发现:联合深度-复杂度覆盖优于单轴策略;推理家族反应非均匀,溯因推理在RL覆盖区域外退化,任务相关性聚类为演绎-溯因对和归纳-类比对;在固定预算下,均匀混合优于分阶段课程。我们还发现,最近的现成模型表现出相同的演绎-溯因不对称性,表明这一差距并非我们受控设置的假象。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has become central to post-training reasoning models, yet a key limitation of existing studies is their narrow view of the reasoning space: difficulty is treated as reasoning depth alone, and reward is concentrated on forward deductive state tracking. We instead characterize the reasoning space along two dimensions. Difficulty. Beyond reasoning depth, we study environment complexity, where models must identify the correct path amid distractors and interacting structures. Rewarded reasoning form. We consider four abilities core to real-world reasoning: deductive state tracking, abductive recovery of hidden events or facts, inductive rule induction, and analogical transfer. To disentangle these factors, we construct a synthetic knowledge-graph environment with controlled pre- and post-training distributions, where each instance varies along depth, complexity, and task family. Three findings emerge: joint depth-complexity coverage outperforms single-axis recipes; reasoning families respond non-uniformly, with abductive reasoning degrading outside the RL-covered region and task correlations clustering into deductive-abductive and inductive-analogy pairs; and uniform mixing outperforms staged curricula under a fixed budget. We also find that recent off-the-shelf models exhibit the same deductive-over-abductive asymmetry, suggesting that this gap is not merely an artifact of our controlled setup.

2605.26933 2026-05-27 cs.CV

Leveraging Text-to-Image Diffusion Models for Unsupervised Visual Object Tracking

利用文本到图像扩散模型进行无监督视觉目标跟踪

Zhengbo Zhang, Zhigang Tu, Junsong Yuan, De Wen Soh, Bo Du

AI总结 提出Diff-Tracking方法,利用预训练文本到图像扩散模型的跨注意力机制,通过初始提示学习器和在线提示更新器实现无监督目标跟踪。

Comments Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2026

详情
AI中文摘要

无监督视觉目标跟踪是一项具有挑战性的任务,需要在没有真实标注训练的情况下跟踪视频中的任意目标。尽管取得了显著进展,现有的最先进无监督跟踪器在处理需要细粒度理解视频帧内语义和视觉结构信息的场景时仍常遇到困难。文本到图像扩散模型以其生成准确反映输入提示中描述的语义和结构的图像的能力而闻名,展现出对视觉语义和结构的强大把握。基于这一能力,我们从新的角度处理无监督跟踪,利用预训练文本到图像扩散模型中编码的丰富语义知识。为了将原本用于图像生成的扩散模型适应到跟踪任务,我们将其重新解释为文本和图像模态之间的桥梁。这种连接通过跨注意力机制实现:当文本和图像同时输入模型时,模型会突出显示与文本语义对齐的图像区域(在跨注意力图中)。因此,我们学习一个表示跟踪目标的提示,并在每一帧中激活其在跨注意力图中的对应区域,从而利用扩散模型实现目标跟踪。具体来说,我们的方法Diff-Tracking由两个主要部分组成:初始提示学习器和在线提示更新器。初始提示学习器生成一个捕获第一帧中目标对象的提示,使扩散模型能够识别目标。在线提示更新器基于运动信息优化提示,实现跨视频帧的一致跟踪。我们在六个具有挑战性的跟踪数据集上评估了我们的方法,证明了其有效性。

英文摘要

Unsupervised visual object tracking is a challenging task that requires following arbitrary targets in videos without training on ground-truth annotations. Despite considerable progress, existing state-of-the-art unsupervised trackers often struggle in scenarios that demand fine-grained understanding of semantic and visual structural information within video frames. Text-to-image diffusion models are well known for their ability to generate images that accurately reflect the semantics and structures described in the input prompt, demonstrating a strong grasp of visual semantics and structures. Building on this capability, we approach the unsupervised tracking from a new perspective by exploiting the rich semantic knowledge encoded in pretrained text-to-image diffusion models. To adapt the diffusion models, which are originally developed for image generation, to the tracking task, we reinterpret the models as a bridge between text and image modalities. This connection is realized through the cross-attention mechanism: when both text and an image are input into the models, they highlight the regions of the image that are semantically aligned with the text in the cross-attention maps. We therefore learn a prompt that represents the tracking target and activates its corresponding region in the cross-attention map for each frame, which enables object tracking with the diffusion model. Specifically, our method Diff-Tracking is composed of two main components: an initial prompt learner and an online prompt updater. The initial prompt learner generates a prompt that captures the target object in the first frame, allowing the diffusion model to identify the target. The online prompt updater refines the prompt based on motion information, enabling consistent tracking across video frames. We evaluate our approach on six challenging tracking datasets demonstrate the effectiveness of our approach.

2605.26926 2026-05-27 cs.AI

From Norms to Indicators (N2I-RAG): An Agentic Retrieval-Augmented Generation Framework for Legal Indicator Computation

从规范到指标 (N2I-RAG): 一种用于法律指标计算的智能检索增强生成框架

Youssef Al Mouatamid, Marie Bonnin, Jihad Zahir

AI总结 提出N2I-RAG框架,通过自适应检索、基于LLM的智能体和验证机制,实现从法律文本到指标的透明、可追溯的自动计算,在法国海洋环境法语料库上优于基线方法。

详情
AI中文摘要

从规范文本计算法律指标是法律监测和政策评估中的关键任务,但由于法律语言的复杂性、规模、解释性以及可用文档质量的差异,这一任务面临重大挑战。现有的自然语言处理技术和生成模型可以辅助法律分析,但往往存在较高的幻觉风险,且缺乏可靠指标计算所需的可解释性和证据基础。本文提出N2I-RAG(从规范到指标),一种智能检索增强生成框架,旨在以透明且可追溯的方式自动化法律指标的计算。我们将自适应检索、基于LLM的智能体和验证机制集成到一个模块化流水线中,其中每个组件在过滤、检索和评估证据,以及生成与可识别法律条款相关的二元法律结果方面执行定义明确的角色。该框架通过要求对中间决策和最终指标分配进行明确解释来强调可追溯性。我们使用内部构建的包含扫描和数字两种来源的法国海洋环境法律语料库评估N2I-RAG。与多个语言模型家族的对比实验表明,所提出的方法始终优于基线系统,并且在两种不同禁令的测试中具有良好的泛化能力。结果表明,智能检索增强生成可以桥接开放文本法律语言和标准化指标计算,为透明且可扩展的法律观测站奠定基础。

英文摘要

Computing legal indicators from normative texts is a key task in legal monitoring and policy evaluation, but presents significant challenges due to the complexity, scale, and interpretive nature of legal language, as well as the variability in available document quality. Existing natural language processing techniques and generative models can assist in legal analysis, but often suffer from high risk of hallucinations and lack the interpretability and evidence grounding required for reliable indicator computation. This paper presents N2I-RAG (From Norms to Indicators), an agentic retrieval-augmented generation framework designed to automate the computation of legal indicators in a transparent and traceable way. We integrate adaptive retrieval, llm-based agents, and validation mechanisms in a modular pipeline, where each component performs a defined role in filtering, retrieving, and assessing evidence, and in producing binary legal outcomes linked to identifiable legal provisions. The framework emphasizes traceability by requiring explicit explanations of intermediate decisions and final indicator assignments. We evaluate N2I-RAG using an in-house constructed French marine environmental law corpus that includes both scanned and digital sources. Comparative experiments with multiple language model families demonstrate that the proposed approach consistently outperforms baseline systems, and generalizes well when tested on 2 different bans. The results indicate that agentic retrieval-augmented generation can bridge open-text legal language and standardized indicator computation, offering a foundation for transparent and scalable legal observatories.

2605.26924 2026-05-27 cs.CL

Learning to Adapt SFT Data for Better Reasoning Generalization

学习适应SFT数据以实现更好的推理泛化

Lisong Sun, Li Wang, Chen Zhang, Jinyang Wu, Kui Zhang, Tianhao Peng, Wenjun Wu

AI总结 提出DART方法,通过强化学习训练映射器将分布不匹配的SFT数据转化为模型自适应的监督,提升推理泛化能力。

详情
AI中文摘要

大型语言模型(LLMs)取得了显著进展,其中后训练在增强其推理能力方面起着关键作用。在后训练范式中,监督微调(SFT)被广泛使用:它利用外部数据提供密集监督并实现高效训练。然而,当数据分布与目标模型自身分布不匹配时,直接在专家数据上微调可能会损害泛化能力。在这项工作中,我们提出了推理调优的数据适应(DART),它将使用固定且可能分布不匹配的SFT数据集表述为对演示转换的优化问题。DART使用强化学习训练一个映射器模型,将原始SFT数据转换为与目标模型分布和学习偏好更匹配的模型自适应监督。转换后的数据随后用于SFT,使目标模型能够更好地利用外部监督。在多个模型和数据集上的实验表明,DART提高了泛化能力,实现了比直接RL更高的训练效率,并帮助模型超越标准SFT。我们的代码可在https://anonymous.4open.science/r/DART525E50D获取。

英文摘要

Large language models (LLMs) have achieved remarkable progress, with post-training playing a crucial role in enhancing their reasoning capabilities. Among post-training paradigms, supervised fine-tuning (SFT) is widely used: it leverages external data to provide dense supervision and enables efficient training. However, directly fine-tuning on expert data can hurt generalization when the data distribution is mismatched with the target model's own distribution. In this work, we propose Data Adaptation for Reasoning Tuning (DART), which formulates the use of a fixed, potentially distributionally misaligned SFT dataset as an optimization problem over demonstration transformations. DART trains a mapper model with reinforcement learning to convert original SFT data into model-adapted supervision that better matches the target model's distribution and learning preferences. The transformed data are then used for SFT, allowing the target model to better exploit external supervision. Experiments across multiple models and datasets show that DART improves generalization, achieves higher training efficiency than direct RL, and helps models surpass standard SFT. Our code is available at https://anonymous.4open.science/r/DART525E50D.

2605.26918 2026-05-27 cs.CL

Are Video Models Zero-Shot Learners and Reasoners in Education? EduVideoBench, A Knowledge-Skills-Attitude Benchmark for Educational Video Generation

视频模型是教育领域的零样本学习者和推理者吗?EduVideoBench:面向教育视频生成的知识-技能-态度基准

Unggi Lee, Hoyoung Ahn, Yoon Choi, Seonmin Eun, Jahyun Jeong, Seonmin Jin, Harmony Jung, Hye Jin Kim, Chaerin Lee, Hyunji Lee, Jeongjin Lee, Soohwan Lee, Young-Seok Oh, Jaehyeon Park, Sun-ok Ryu, Sunyoung Shin, Yoorim Son, Haeun Park, Yeil Jeong

AI总结 提出基于知识-技能-态度框架的教育视频生成基准EduVideoBench,评估五个前沿视频生成模型在教育有效性上的不足,并发现教育有效性是多维度的,单一元素不匹配即可使视频失效。

详情
AI中文摘要

视频生成模型(VGMs)正迅速进入课堂,然而现有基准仅评估感知质量、内在忠实性、通用安全性或将视频作为推理媒介,没有评估输出是否具有教育有效性。在这项工作中,我们提出了EduVideoBench,这是教育领域第一个平衡的基准,基于知识-技能-态度(KSA)框架,使得教学充分性和教育安全性被联合评估,而非作为临时的质量维度。在五个前沿VGMs上,我们的结果显示,在知识、技能和态度方面,它们距离课堂准备就绪还有很大的改进空间。我们辅以专家评论的定性分析,发现教育有效性是多维度的,单个不匹配的元素(如节奏、可读性或符号)可能使原本正确的视频失效。我们希望EduVideoBench能够指导开发教学上合理且课堂安全的VGMs。

英文摘要

Video generation models (VGMs) are rapidly entering classrooms, yet existing benchmarks evaluate only perceptual quality, intrinsic faithfulness, generic safety, or video as a reasoning medium, and none assesses whether the outputs are educationally valid. In this work, we present EduVideoBench, the first balanced benchmark in the education domain, grounded in the Knowledge-Skills-Attitude (KSA) framework so that pedagogical adequacy and educational safety are evaluated jointly rather than as ad-hoc quality dimensions. Across five frontier VGMs, our results show substantial room for improvement across knowledge, skills, and attitude before they are classroom-ready. We complement this with a qualitative analysis of expert comments, finding that educational validity is multi-component, where a single misaligned element such as pacing, legibility, or notation can invalidate an otherwise correct video. We hope EduVideoBench will guide the development of VGMs that are pedagogically grounded and safe for the classroom.

2605.26911 2026-05-27 cs.AI

TADDLE: A Tool-Augmented Agent for Detecting Deficient LLM-Generated Peer Reviews

TADDLE: 一种用于检测有缺陷的LLM生成同行评审的工具增强型代理

Hanqi Duan, Xiang Li

AI总结 针对LLM生成的同行评审难以检测缺陷的问题,提出TADDLE工具增强型代理,通过四个专用分析工具和两阶段半监督学习,在二元检测和多标签分类任务上表现优异。

详情
AI中文摘要

LLM生成的同行评审在主要会议中越来越常见,但由于它们语言流畅、结构良好,其缺陷难以检测。现有工作要么仅分类作者身份而不评判质量,要么使用为人类撰写的评审设计的特征来评分质量;没有先前系统能在单个缺陷类型级别检测LLM生成评审中的缺陷。为弥补这一空白,我们引入了TADDLE,一种用于检测有缺陷的LLM生成同行评审的工具增强型代理,以及首个针对此任务的专家标注基准。我们的基准包含对50篇ICLR 2025论文的1800条评审,由18位领域专家根据六个缺陷类别(加上一个无缺陷标签)的分类法进行多标签标注。TADDLE将检测分解为四个专用分析工具——验证、纠正、完善和转换——由一个代理协调;一个集成器通过两阶段半监督学习将其输出综合为二元和多标签分类。大量实验表明,TADDLE在二元检测和多标签分类任务上均表现强劲。我们在https://github.com/AquariusAQ/TADDLE发布基准和代码。

英文摘要

LLM-generated peer reviews are increasingly common at major venues, yet their deficiencies are hard to detect because they are uniformly fluent and well-structured. Existing work either classifies authorship without judging quality, or scores quality with features designed for human-written reviews; no prior system detects deficiencies in LLM-generated reviews at the level of individual defect types. To bridge the gap, we introduce TADDLE, a Tool-Augmented Agent for Detecting Deficient LLM-Generated Peer Reviews, together with the first expert-annotated benchmark for this task. Our benchmark comprises 1,800 reviews on 50 ICLR 2025 papers, multi-label-annotated by 18 domain experts against a taxonomy of six defect categories (plus a non-deficient label). TADDLE decomposes detection into four specialized analysis tools -- Verify, Correct, Complete, and Transform -- orchestrated by an agent; an integrator synthesizes their outputs into binary and multi-label classifications via two-stage semi-supervised learning. Extensive experiments show that TADDLE performs strongly on both binary detection and the multi-label classification task. We release the benchmark and code at https://github.com/AquariusAQ/TADDLE.

2605.26908 2026-05-27 cs.AI cs.DS cs.LG

On the Detection of Commutative Factors in Factor Graphs: Necessary and Sufficient Conditions

关于因子图中可交换因子检测的充要条件

Malte Luttermann, Ralf Möller, Marcel Gehrke

AI总结 本文重新审视了因子图中可交换因子检测的理论基础,指出现有算法依赖的定理仅为必要条件而非充分条件,并提出了修正算法以保证正确性和效率。

详情
AI中文摘要

利用概率图模型(如因子图)中对象的不可区分性是提升概率推理算法的关键,并允许对领域规模进行可处理的概率推理问题。在因子图中利用不可区分对象的核心是识别可交换因子,即其输出值在分配给其部分参数的输入值的排列下保持不变的因子。本文重新审视了检测可交换因子的最先进算法的理论基础。具体而言,我们表明,在其当前形式下,最先进算法依赖于一个中心定理,该定理被错误地视为识别可交换因子的充分条件,而实际上它仅意味着必要条件。因此,正如我们在本文中所展示的,最先进算法可能会产生错误结果。为了修复当前最先进算法中存在的缺陷,我们证明了上述定理的一个略微修改版本,该版本作为识别可交换因子的必要条件。此外,我们提出了最先进算法的修正版本,在保持其效率的同时确保正确性,并引入了一种具有更严格最坏情况边界的补充算法。

英文摘要

Exploiting the indistinguishability of objects in a probabilistic graphical model such as a factor graph is key to lifted probabilistic inference algorithms and allows for tractable probabilistic inference problems with respect to domain sizes. A central building block for the exploitation of indistinguishable objects in factor graphs is the identification of commutative factors, i.e., factors whose output values are invariant under permutations of input values assigned to a subset of their arguments. In this paper, we revisit the theoretical foundations underlying the state-of-the-art algorithm to detect commutative factors. Specifically, we show that in its current form, the state-of-the-art algorithm relies on a central theorem that is mistakenly regarded as a sufficient condition to identify commutative factors, while it actually only implies necessary condition. Consequently, the state of the art might, as we show in this paper, deliver incorrect results. To fix the flaws currently present in the state of the art, we prove a slightly modified version of the aforementioned theorem, which serves as a necessary condition to identify commutative factors. Moreover, we present a corrected version of the state-of-the-art algorithm, which keeps its efficiency while ensuring correctness and introduce a complementary algorithm with tighter worst-case bounds.

2605.26900 2026-05-27 cs.LG

SPHERE-JEPA: Spherical Prediction with Homogeneous Embeddings

SPHERE-JEPA: 均匀嵌入的球面预测

Léo Nicollier, Max Dunitz, Marc Pic, Pablo Musé, Enric Meinhardt-Llopis, Gabriele Facciolo

AI总结 本文提出SPHERE-JEPA框架,通过将Cramér-Wold投影机制调整为强制超球面均匀性而非高斯先验,解决了自监督学习中高斯嵌入导致各向异性k-NN邻域的问题,在纹理检索和ImageNet-1K线性探测上取得显著提升。

详情
AI中文摘要

自监督学习中的一个基本开放问题是明确表征学习表示的最优几何。最近,LeJEPA将各向同性高斯嵌入确定为在欧几里得空间中最小化下游预测风险的最优解。然而,对于支撑在低维流形(如超球面)上的分布,相应问题仍未探索。在这项工作中,我们证明将这种极小极大分析扩展到黎曼流形上的光滑分布会根本性地改变最优解。我们表明,在最坏情况公式下,k近邻和核岭回归都诱导超球面均匀性。更精确地说,我们证明流形上的均匀分布对于k近邻是最优的,而球面上的均匀分布对于使用指数点积核和线性核的核岭回归是最优的。这一理论见解揭示了高斯嵌入的一个根本局限:其非均匀密度导致各向异性的k-NN邻域,严重偏置估计器。为纠正这一点,我们引入了SPHERE-JEPA,一个理论基础的SSL框架。我们调整LeJEPA的Cramér-Wold投影机制以强制超球面均匀性而非高斯先验。实验上,SPHERE-JEPA取得了显著改进,将纹理检索mAP提升了超过6%,同时在标准基准上持续匹配或超越LeJEPA——包括在ImageNet-1K(ViT-B/14)上+1.8%的线性探测增益。

英文摘要

A fundamental open question in self-supervised learning (SSL) is the explicit characterization of the optimal geometry of the learned representations. Recently, LeJEPA identified isotropic Gaussian embeddings as optimal for minimizing downstream prediction risk in Euclidean spaces. However, the corresponding problem for distributions supported on lower-dimensional manifolds, such as the hypersphere, remains unexplored. In this work, we demonstrate that extending this minimax analysis to smooth distributions on Riemannian manifolds fundamentally changes the optimal solution. We show that, under a worst-case formulation, both k-nearest neighbors and kernel ridge regression induce hyperspherical uniformity. More precisely, we show that uniform distributions on manifolds are optimal for k-nearest neighbors, and that the uniform distribution on the sphere is optimal for kernel ridge regression with both the exponential dot-product kernel and the linear kernel. This theoretical insight reveals a fundamental limitation of Gaussian embeddings: their non-uniform density induces anisotropic k-NN neighborhoods, severely biasing the estimator. To correct this, we introduce SPHERE-JEPA, a theoretically grounded SSL framework. We adapt LeJEPA's Cram{é}r-Wold projection mechanism to enforce hyperspherical uniformity rather than a Gaussian prior. Empirically, SPHERE-JEPA yields significant improvements, boosting texture retrieval mAP by over 6%, while consistently matching or outperforming LeJEPA on standard benchmarks-including a +1.8% linear probing gain on ImageNet-1K (ViT-B/14).