arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.10365 2026-06-10 cs.SD 新提交

KFC-KWS: Keyframe Fusion with CTC for User-Defined Keyword Spotting

KFC-KWS: 基于CTC的关键帧融合用于用户自定义关键词唤醒

Jin Li, Wenbin Jiang, Ji Hu

发表机构 * School of Electronics and Information Engineering, Hangzhou Dianzi University（杭州电子科技大学电子信息学院）； School of Communication Engineering, Hangzhou Dianzi University（杭州电子科技大学通信工程学院）

AI总结提出KFC-KWS多模态框架，利用CTC引导的关键帧选择对齐音频、音素和文本模态，通过交叉注意力融合关键帧与全句表示，在LibriPhrase上达到98.73% AUC，困难子集上97.65% AUC和7.75% EER，有效区分易混淆关键词。

Comments Accepted by Interspeech 2026

详情

AI中文摘要

用户自定义关键词唤醒（KWS）通过检测用户指定的关键词实现个性化语音交互。该任务的一个关键挑战是区分目标关键词与发音易混淆的替代词。为应对这一挑战，我们提出KFC-KWS，一种利用连接主义时间分类（CTC）引导的关键帧选择的多模态框架。具体而言，我们利用CTC的峰值后验分布来识别高置信度的音素帧，从而实现音频、音素和文本模态之间的精确对齐。然后，通过交叉注意力将这些关键帧与全句表示融合，以捕获局部判别线索和全局上下文信息。在LibriPhrase上，KFC-KWS实现了最佳平衡性能（98.73% AUC），并在具有挑战性的困难子集上显著优于先进基线（97.65% AUC和7.75% EER），证明了其在区分高度易混淆关键词方面的有效性。

英文摘要

User-defined keyword spotting (KWS) enables personalized voice interaction by detecting user-specified keywords. A key challenge in this task is distinguishing target keywords from phonetically confusable alternatives. To address this challenge, we propose KFC-KWS, a multimodal framework that leverages connectionist temporal classification (CTC)-guided keyframe selection. Specifically, we exploit the peaky posterior distributions of CTC to identify high-confidence phoneme frames, enabling precise alignment across audio, phoneme, and text modalities. These keyframes are then fused with full-utterance representations through cross-attention to capture both local discriminative cues and global contextual information. On LibriPhrase, KFC-KWS achieves the best-balanced performance (98.73% AUC) and substantially outperforms advanced baselines on the challenging hard subset (97.65% AUC and 7.75% EER), demonstrating its effectiveness in discriminating between highly confusable keywords.

URL PDF HTML ☆

赞 0 踩 0

2606.10364 2026-06-10 cs.CV 新提交

Benchmarking stereo reconstruction for 3D printable Martian terrain models

用于3D打印火星地形模型的立体重建基准测试

Josephine Wang

发表机构 * MIT Cambridge, MA, USA（麻省理工学院，马萨诸塞州剑桥，美国）

AI总结针对火星图像低纹理、不规则和部分观测的特点，评估从NASA好奇号图像估计立体深度、补全几何并导出可打印网格的流程，发现基准精度不直接迁移到火星地形重建，几何补全存在局部保真度与全局连通性的权衡。

Comments 9 pages, 7 figures, CVPR End-to-End 3D Workshop 2026

详情

AI中文摘要

从火星车图像重建可打印的3D模型具有挑战性，因为火星地形纹理低、不规则且部分被观测。我们评估了一个流程，该流程从NASA好奇号图像估计立体深度，补全几何，并导出水密OBJ网格。在Middlebury数据集上，RAFT-Stereo优于半全局块匹配（SGBM），将视差MAE从3.22像素降低到0.73像素，并将有效预测覆盖率从76.3%提高到100.0%。然而，在好奇号图像上，RAFT更密集的视差显示出较弱的边缘对齐和更高的光度重投影误差，表明基准精度不能直接迁移到火星地形重建。几何补全展示了局部保真度与全局连通性之间的权衡。我们发现，alpha形状保留了准确但碎片化的结构，泊松重建产生更连贯的网格但增加了无支撑表面，而确定性扩散填充基线介于两者之间但对立体质量敏感。总体而言，标准立体和补全方法可以产生火星地形的可打印近似，但可靠的重建需要更强的领域特定验证。

英文摘要

Reconstructing printable 3D models from Mars rover imagery is challenging because Martian terrain is low-texture, irregular, and partially observed. We evaluate a pipeline that estimates stereo depth from NASA Curiosity images, completes geometry, and exports watertight OBJ meshes. On Middlebury, RAFT-Stereo outperforms semi-global block matching (SGBM), reducing disparity MAE from 3.22px to 0.73px and increasing valid prediction coverage from 76.3% to 100.0%. On Curiosity imagery, however, RAFT's denser disparities show weaker edge alignment and higher photometric reprojection error, suggesting that benchmark accuracy does not directly transfer to Martian terrain reconstruction. Geometry completion demonstrates a tradeoff between local fidelity and global connectivity. We find that alpha shapes preserve accurate but fragmented structure, Poisson reconstruction produces more coherent meshes but adds unsupported surfaces, and a deterministic diffusion-fill baseline is intermediate but sensitive to stereo quality. Overall, standard stereo and completion methods can produce printable approximations of Martian terrain, but reliable reconstruction requires stronger domain-specific validation.

URL PDF HTML ☆

赞 0 踩 0

2606.10363 2026-06-10 cs.RO 新提交

HiMem-WAM: Hierarchical Memory-Gated World Action Models for Robotic Manipulation

HiMem-WAM: 用于机器人操作的分层记忆门控世界动作模型

Xiaoquan Sun, Ruijian Zhang, Chen Cao, Yihan Sun, Jiahui Chen, Zetian Xu, Bo Chen, Haijier Chen, Zhen Yang, Jiarun Zhu, Yijun Hong, JingZhe Xu, Jingrui Pang, Mingqi Yuan, Jiayu Chen

发表机构 * The University of Hong Kong（香港大学）； INFIFORCE ； Huazhong University of Science and Technology（华中科技大学）； Tsinghua University（清华大学）； Wuhan University（武汉大学）； Southern University of Science and Technology（南方科技大学）

AI总结提出分层记忆门控世界动作模型HiMem-WAM，通过分层潜在动作框架和边界触发记忆更新，提升长时域机器人操作的任务相关记忆与泛化鲁棒性。

详情

AI中文摘要

世界动作模型（WAM）已成为具身智能的一种新的强大范式，学习与动作相关的视觉动态，显著增强了泛化性和鲁棒性。然而，现有的WAM在长时域机器人操作中仍难以处理任务相关记忆。为了解决这个问题，我们提出了HiMem-WAM，一种分层记忆门控WAM，它集成了以运动为中心的潜在动作、高级技能潜在变量和边界触发的记忆更新。具体来说，我们开发了一个分层潜在动作框架，共同学习低级运动和高级技能潜在变量，提供结构化的时间抽象。同时，边界感知记忆门在预测的技能转换处写入紧凑的任务状态，无需在测试时生成未来视频或光流估计即可实现因果推理。在LIBERO、LIBERO-PLUS、RMBench和真实世界任务上的评估表明，HiMem-WAM的分层潜在变量提高了部署扰动下的鲁棒性，而记忆模块显著有益于依赖记忆的长时域操作。

英文摘要

World Action Models (WAMs) have emerged as a new powerful paradigm for embodied intelligence, learning action-relevant visual dynamics that significantly enhance generalization and robustness. However, existing WAMs still struggle with task-relevant memory in long-horizon robotic manipulation. To address this, we present HiMem-WAM, a Hierarchical Memory-Gated WAM that integrates motion-centric latent actions, high-level skill latents, and boundary-triggered memory updates. Specifically, we develop a hierarchical latent action framework that jointly learns low-level motion and high-level skill latents, providing structured temporal abstraction. Meanwhile, a boundary-aware memory gate writes compact task states at predicted skill transitions, enabling causal inference without test-time generation of future video or optical flow estimation. Evaluated on LIBERO, LIBERO-PLUS, RMBench and real-world tasks, HiMem-WAM shows that hierarchical latents improve robustness under deployment perturbations, and the memory module substantially benefits memory-dependent long-horizon manipulation.

URL PDF HTML ☆

赞 0 踩 0

2606.10359 2026-06-10 cs.AI 新提交

ReflectiChain: Epistemic Grounding in LLM-Driven World Models for Supply Chain Resilience

ReflectiChain: 面向供应链韧性的LLM驱动世界模型中的认知基础

Jia Luo

发表机构 * School of Foreign Languages, Huazhong University of Science

AI总结提出ReflectiChain框架，通过生成式供应链世界模型和双环学习分离认知不确定性与偶然不确定性，在半导体基准上提升推理一致性33.0%，并在对抗冲击下保持82.3%可操作性。

详情

AI中文摘要

供应链中的AI代理面临一个基本的认知鸿沟：大语言模型（LLMs）解释策略但缺乏物理基础，而强化学习（RL）优化流程但对非结构化约束语义上视而不见。我们引入REFLECTICHAIN，通过生成式供应链世界模型（SC-WM）——将异构供应网络编码到具有物理守恒的6维图-潜在空间中——以及双环学习（将认知不确定性（KL信任域约束的策略适应）与偶然不确定性（随机潜在展开）分离）来弥合这一鸿沟。在Semi-Sim（一个具有SIR风险传播、6种扰动类型和10种策略约束模板的10节点半导体基准）上，REFLECTICHAIN将推理一致性得分提高了33.0%（p < 0.0001, d = 2.78），在对抗性冲击下保持了82.3%的可操作性，并表现出反脆弱行为（在适度压力下增益+40.2%）。我们识别了三种操作性的认知机制——不确定性分离、知识边界检测和经验贝叶斯策略更新——并讨论了五个局限性类别。

英文摘要

AI agents in supply chains face a fundamental epistemic gap: large language models (LLMs) interpret policies but lack physical grounding, while reinforcement learning (RL) optimizes flows but is semantically blind to unstructured constraints. We introduce REFLECTICHAIN, bridging this gap through a Generative Supply Chain World Model (SC-WM) - encoding heterogeneous supply networks into a 6-dim graph-latent space with physical conservation - and Double-Loop Learning that separates epistemic uncertainty (KL-trust-region-bounded policy adaptation) from aleatoric uncertainty (stochastic latent rollouts). On Semi-Sim, a 10-node semiconductor benchmark with SIR risk propagation, 6 perturbation types, and 10 policy constraint templates, REFLECTICHAIN improves Rationale Consistency Score by 33.0% (p < 0.0001, d = 2.78), maintains 82.3% operability under adversarial shocks, and exhibits anti-fragile behavior (+40.2% gain under moderate pressure). We identify three operational epistemic mechanisms - uncertainty separation, knowledge-boundary detection, and empirical Bayesian policy updating - and discuss five limitation categories.

URL PDF HTML ☆

赞 0 踩 0

2606.10358 2026-06-10 cs.LG cs.AI 新提交

KG-SoftMAP: Soft Knowledge-Graph Priors for Bayesian Network Structure Learning from Sparse Discrete Data

KG-SoftMAP: 基于软知识图谱先验的稀疏离散数据贝叶斯网络结构学习

Guoliang Xu, James E. Corter

发表机构 * Columbia University（哥伦比亚大学）

AI总结针对稀疏离散数据中贝叶斯网络结构学习困难的问题，提出KG-SoftMAP方法，将加权有向知识图谱编码为软先验，结合BDeu评分与logit形式先验最大化MAP目标，在合成与真实数据上显著提升结构恢复性能。

Comments 33 pages including appendices, 1 figure

详情

AI中文摘要

从稀疏离散数据中学习贝叶斯网络（BN）结构是困难的：当每个实例仅记录少数变量时，大多数变量对缺乏可靠评分所需的联合观测，且纯数据方法恢复的结构很少。不完美的领域知识，可表示为加权有向知识图谱（KG），通常是可用的。我们提出KG-SoftMAP，它将这样的KG编码为软性的、置信度加权的、可被数据覆盖的边先验，并最大化结合BDeu评分与logit形式先验的MAP目标；KG可由专家整理或由LLM提取。在受控的合成基准（唯一具有真实DAG的设置）上，KG-SoftMAP在$\rho=0.05$时恢复部分有向结构（DF1从$0.14$到$0.29$，而基线接近零），当$\rho\geq0.2$时恢复更多（DF1从$0.46$到$0.96$），前提是配有一个信息丰富但不完美的KG；恢复性能随KG质量下降而优雅地退化。在无真实DAG的真实稀疏教育数据上，我们仅评估面向部署的指标：预测、校准和KG一致性。学习到的BN最好被解读为诊断模型：在SAF上，它落后于逻辑回归$0.03$的F1_FAIL，同时提供KG一致的边、校准的联合概率以及从任意观测概念子集的推理；当不存在有意义的KG时，判别式逻辑回归更可取。

英文摘要

Learning Bayesian network (BN) structure from sparse discrete data is hard: when each instance records only a few variables, most variable pairs lack the joint observations needed for reliable scoring, and data-only methods recover little structure. Imperfect domain knowledge, expressible as a weighted directed knowledge graph (KG), is often available. We propose KG-SoftMAP, which encodes such a KG as a soft, confidence-weighted, data-overridable edge prior and maximizes a MAP objective combining the BDeu score with a logit-form prior; the KG may be expert-curated or LLM-extracted. On controlled synthetic benchmarks, the only setting with ground-truth DAGs, KG-SoftMAP recovers partial directed structure at $ρ=0.05$ (DF1 $0.14$ to $0.29$, versus near-zero baselines) and substantially more once $ρ\geq0.2$ (DF1 $0.46$ to $0.96$), when paired with an informative but imperfect KG; recovery degrades gracefully as KG quality drops. On real sparse educational data, which has no ground-truth DAG, we evaluate deployment-facing measures only: prediction, calibration, and KG-consistency. The learned BN is best read as a diagnostic model: on SAF it trails logistic regression by $0.03$ F1_FAIL while providing KG-consistent edges, calibrated joint probabilities, and inference from arbitrary observed concept subsets; when no meaningful KG exists, discriminative logistic regression is preferable.

URL PDF HTML ☆

赞 0 踩 0

2606.10350 2026-06-10 cs.CV 新提交

Multi-Angular Reflectance Anisotropy Observed from UAV Multispectral Imagery

无人机多光谱影像观测的多角度反射率各向异性

Zhenqiang Qin, Chenguang Dai, Min Wang, Xian Li

发表机构 * University of Information Engineering（信息工程大学）

AI总结提出一种几何感知的多角度观测提取流程，从BRDF角度量化观测几何效应，通过SFM精化相机参数并重投影同质区域，联合提取多波段反射率和观测几何参数，发现红边和近红外波段反射率变化达119-137%。

详情

AI中文摘要

由于低空飞行和宽视场成像，无人机多光谱影像自然包含多角度观测，这可能引入几何驱动的辐射变异性。本研究提出一种几何感知的多角度观测提取流程，从BRDF角度量化观测几何效应。具体地，通过运动恢复结构（SFM）精化相机内参和外参，并将正射影像上标注的同质区域重投影到从不同视角获取的多个原始子图像上。这使得能够在不同观测方向下联合提取同一地面目标的多波段反射率和观测几何参数。进一步利用（VZA，RAA）域中的波段极坐标可视化分析提取的观测值。草地目标的结果显示，十个波段均存在明显的反射率各向异性，其中红边和近红外波段的最大与最小反射率变化达119-137%，表明观测几何效应对辐射一致性有不可忽视的影响。

英文摘要

UAV multispectral imagery naturally contains multi-angular observations due to low flight altitude and wide field-of-view imaging, which may introduce geometry-driven radiometric variability. This study proposes a geometry-aware multi-angular observation extraction workflow to quantify observation-geometry effects from a BRDF perspective. Specifically, camera intrinsics and extrinsics are refined via structure-from-motion (SFM), and homogeneous regions annotated on an orthomosaic are reprojected onto multiple raw sub-images acquired from different viewpoints. This enables joint extraction of multi-band reflectance and observation geometry parameters for the same ground targets under varying viewing directions. The extracted observations are further analyzed using band-wise polar visualization in the (VZA, RAA) domain. Results on a grassland target show clear reflectance anisotropy across ten bands, with red-edge and nearinfrared bands exhibiting 119-137% variability between maximum and minimum reflectance, indicating non-negligible observation-geometry effects on radiometric consistency.

URL PDF HTML ☆

赞 0 踩 0

2606.10348 2026-06-10 cs.RO 新提交

Rethinking Embodied Navigation via Relational Inductive Bias

通过关系归纳偏差重新思考具身导航

Weitao An, Chenghao Xu, Xu Yang, Cheng Deng

发表机构 * School of Electronic Engineering, Xidian University（西安电子科技大学电子工程学院）； School of Information Science and Engineering, Hohai University（河海大学信息科学与工程学院）

AI总结提出DB-Nav框架，利用激活偏置和抑制偏置双关系偏置重塑搜索空间，通过关系激活-抑制探索图调节前沿探索，显著提升目标导航成功率和路径效率。

详情

AI中文摘要

目标导航要求智能体通过视觉观察在未知环境中定位目标。现有方法通常依赖开放词汇检测器或视觉语言模型（VLM）来回答在哪里搜索，但往往忽略了什么不可信——哪些语义线索不可靠。开放词汇感知容易产生系统性误导证据：误报、过时的静态先验以及由于缺乏具身验证而导致的重复失败探索，这会污染地图构建和决策制定。此类错误根植于真实场景中的结构化对象关系。为解决此问题，我们提出DB-Nav，一个通过双关系偏置重塑搜索空间的框架。它将目标中心关系分解为激活偏置（传播上下文证据）和抑制偏置（通过感知混淆和动作级证伪抑制不可靠区域）。这些偏置统一到一个关系激活-抑制探索图中，该图利用在线观察和失败访问来调节前沿探索值。在ObjectNav基准上的实验表明，DB-Nav在成功率（SR）和路径长度加权成功率（SPL）上显著优于现有方法，提供了一个轻量级、可解释且鲁棒的导航框架，无需昂贵的在线VLM推理。

英文摘要

Object navigation requires an agent to locate a target in an unknown environment through visual observations. Existing methods typically rely on open-vocabulary detectors or vision-language models (VLMs) to answer where to search, but often overlook what not to trust - which semantic cues are unreliable. Open-vocabulary perception is prone to systematic misleading evidence: false positives, outdated static priors, and repeated failed exploration due to lack of embodied verification, which contaminates mapping and decision-making. Such errors are rooted in structured object relations in real-world scenes. To address this, we propose DB-Nav, a framework that reshapes the search space via dual relational biases. It factorizes target-centric relations into an Activation Bias (propagates contextual evidence) and an Inhibition Bias (suppresses unreliable regions via perceptual confusion and action-level falsification). These biases are unified into a Relational Activation-Inhibition Exploration Graph that modulates frontier exploration values using online observations and failed accesses. Experiments on ObjectNav benchmarks show that DB-Nav significantly outperforms existing methods in success rate (SR) and Success weighted by Path Length (SPL), offering a lightweight, interpretable, and robust navigation framework without costly online VLM reasoning.

URL PDF HTML ☆

赞 0 踩 0

2606.10346 2026-06-10 cs.AI 新提交

Reasoning or Memorization? Direction-Aware Diversity Exploration in LLM Reinforcement Learning

推理还是记忆？LLM强化学习中的方向感知多样性探索

Jiangnan Xia, Yucheng Shi, Yu Yang, Kishan Panaganti, Zhenwen Liang, Ninghao Liu

发表机构 * University of Georgia（佐治亚大学）； Tencent AI Lab（腾讯AI实验室）； The Education University of Hong Kong（香港教育大学）； The Hong Kong Polytechnic University（香港理工大学）

AI总结提出DiRL框架，通过方向感知奖励区分推理与记忆驱动的探索，在GRPO中集成方向加权梯度特征，显著提升数学与通用推理性能。

Comments 12 pages, 6 figures

详情

AI中文摘要

强化学习已成为激发大型语言模型推理能力的关键范式，其中探索对于发现有效解轨迹至关重要。现有的探索方法通常鼓励语义或梯度空间中的多样性，而不区分驱动这种多样性的因素。一条轨迹可能因为遵循新的推理过程而显得新颖，也可能因为变化了记忆模式和捷径。对这两种情况给予同等奖励可能会将探索导向记忆而非真正的推理改进。在本文中，我们提出DiRL，一种方向感知强化学习框架，将探索锚定到策略内部的推理-记忆方向。具体地，DiRL从模型表示中提取该方向，构建方向加权梯度特征以表征轨迹更新，并塑造奖励以放大推理对齐的探索，同时抑制记忆对齐的变化。DiRL无缝集成到标准的组相对策略优化（GRPO）中。在数学和通用推理基准上的大量实验证明了DiRL的有效性，显示出相对于各种现有探索方法的显著改进。

英文摘要

Reinforcement learning has become a key paradigm for eliciting reasoning abilities in large language models, where exploration is crucial for discovering effective solution trajectories. Existing exploration methods typically encourage diversity in semantic or gradient spaces, without distinguishing what drives this diversity. A trajectory may appear novel because it follows a new reasoning process, or because it varies memorized patterns and shortcuts. Rewarding both cases equally may steer exploration toward memorization rather than genuine reasoning improvement. In this paper, we propose DiRL, a Direction-Aware Reinforcement Learning framework that anchors exploration to an internal reasoning-memorization direction of the policy. Specifically, DiRL extracts this direction from model representations, constructs direction-weighted gradient features to characterize rollout updates, and shapes rewards to amplify reasoning-aligned exploration while suppressing memorization-aligned variations. DiRL integrates seamlessly into standard Group Relative Policy Optimization (GRPO). Extensive experiments on mathematical and general reasoning benchmarks demonstrate the effectiveness of DiRL, showing significant improvements over various existing exploration methods.

URL PDF HTML ☆

赞 0 踩 0

2606.10340 2026-06-10 cs.RO 新提交

OMG: Omni-Modal Motion Generation for Generalist Humanoid Control

OMG: 面向通用人形机器人的全模态运动生成

Siqiao Huang, Kun-Ying Lee, Dongming Qiao, Guanqi He, Zhenyu Wang, Yitang Li, Shaoting Zhu, Hang Zhao

发表机构 * Tsinghua University（清华大学）

AI总结提出OMG框架，通过精心策划的数据流程和扩散模型，实现基于语言、音频和参考动作的全模态全身控制，展示了最先进的性能和可扩展性。

Comments Project Page: https://tsinghua-mars-lab.github.io/OMG/

详情

AI中文摘要

近年来，人形机器人全身控制取得了显著进展，但现有方法仍局限于需要大量奖励工程的少数技能策略，或难以扩展到新输入模态的运动跟踪器。我们认为，通用人形机器人的关键在于构建一个可扩展的大脑——一个能够处理多种条件模态的模块，位于反应式运动跟踪小脑之上，模仿生物运动系统的层次结构。实现这一愿景面临两个挑战：获取大量高质量数据以实现通用控制，以及使生成器具备处理组合式、可扩展的多模态输入的能力。我们提出了OMG，通过精心策划的数据整理、过滤和标注流程，以及基于扩散的运动生成骨干网络（可条件于语言、音频和人类参考运动），解决了这些挑战。大量实验验证了OMG作为全模态全身控制器的性能，展示了最先进的结果、模型扩展行为以及对新分布和模态的高效适应，标志着向人形机器人基础模型迈出了具体一步。

英文摘要

Humanoid whole-body control has made significant progress in recent years, yet existing approaches remain limited to few-skill policies with heavy reward engineering, or motion trackers that are difficult to extend to new input modalities. We argue that the key to general-purpose humanoid control is to build a scalable brain, a module capable of reasoning with diverse conditioning modalities, atop a reactive motion tracking cerebellum, mirroring the hierarchical structure of biological motor systems. Two challenges arise in realizing this vision: acquiring a vast amount of high-quality data to achieve general purpose control, and equipping the generator with the capability to condition on compositional, extensible multi-modal inputs. We present OMG, which addresses these challenges with a meticulous data curation, filtering and labeling pipeline, as well as a diffusion-based motion generation backbone that conditions on language, audio, and human reference motions. Extensive experiments validate OMG as an omni-modal whole-body controller exhibiting state-of-the-art performance, model scaling behavior and efficient adaptation to new distributions and modalities, marking a concrete step toward foundation models for humanoid robots.

URL PDF HTML ☆

赞 0 踩 0

2606.10338 2026-06-10 cs.CL cs.AI 新提交

Routing-Aware Expert Calibration for Machine Unlearning in Mixture-of-Experts Language Models

路由感知的专家校准用于混合专家语言模型中的机器遗忘

Jingyi Xie, Yijun Lin, Yinjiang Xiong, Zhikun Zhang, Sai Li

发表机构 * Renmin University of China（中国人民大学）； Tsinghua University（清华大学）； Zhejiang University（浙江大学）； Lightstandard

AI总结针对MoE模型中遗忘数据与保留数据路由不匹配导致遗忘关键专家正则化不足的问题，提出TRACE方法，通过离线激活统计检测遗忘关键专家并重新加权保留损失以校准保留侧激活频率，实验表明在WMDP和MUSE-BOOKS上遗忘-效用权衡提升9%。

详情

AI中文摘要

机器遗忘对于大型语言模型越来越重要，然而混合专家（MoE）架构中的遗忘仍未得到充分探索。与密集模型不同，MoE架构在每一层使用路由器将每个令牌分配给稀疏的专家子集。在这项工作中，我们观察到遗忘数据往往不成比例地激活一小部分专家，而这些专家可能从保留数据中接收到更弱的激活。这种遗忘-保留路由不匹配可能导致遗忘关键专家在遗忘过程中正则化不足。为了解决这个问题，我们提出了\textbf{TRACE}，即针对MoE遗忘的目标路由感知专家校准。TRACE首先从离线激活统计中检测遗忘关键专家，然后通过重新加权令牌级保留损失来校准保留正则化，使得每个选定专家的保留侧激活频率更好地匹配其遗忘侧对应频率。在多个MoE LLM上的WMDP和MUSE-BOOKS实验表明，TRACE一致地改善了遗忘-效用权衡，在相当的遗忘质量下，相对于最强基线实现了9%的相对效用提升，并在MUSE-BOOKS的四个指标中的三个上取得了最佳性能。

英文摘要

Machine unlearning is increasingly important for large language models, yet unlearning in Mixture-of-Experts (MoE) architectures remains underexplored. Unlike dense models, MoE architectures employ a router at each layer to assign each token to a sparse subset of experts. In this work, we observe that forget data often activates a small subset of experts disproportionately, while these experts may receive much weaker activation from retain data. This forget--retain routing mismatch can leave forget-critical experts under-regularized during unlearning. To address this, we propose \textbf{TRACE}, Targeted Routing-Aware Calibration of Experts, for MoE unlearning. TRACE first detects forget-critical experts from offline activation statistics, and then calibrates retain regularization by reweighting token-level retain losses so that each selected expert's retain-side activation frequency better matches its forget-side counterpart. Experiments on WMDP and MUSE-BOOKS across multiple MoE LLMs show that TRACE consistently improves the forget-utility trade-off, yielding a 9\% relative utility improvement over the strongest baseline under comparable forgetting quality and the best performance on three out of four MUSE-BOOKS metrics.

URL PDF HTML ☆

赞 0 踩 0

2606.10334 2026-06-10 cs.AI 新提交

Self-Distillation Policy Optimization via Visual Feedback: Bridging Code and Visual Artifacts

通过视觉反馈的自蒸馏策略优化：连接代码与视觉工件

Haoyu Dong

发表机构 * Microsoft（微软）

AI总结提出Visual-SDPO框架，利用渲染视觉反馈作为特权上下文，通过自蒸馏和视觉引导的代码信用加权优化代码生成视觉工件的质量，在图表、UI和幻灯片生成任务上显著提升性能。

详情

AI中文摘要

代码生成大语言模型（LLMs）通过编写由不可微渲染器执行的程序，越来越多地生成图表、网页和幻灯片等视觉工件，在观察渲染结果之前就确定了代码。因此，原本可执行的代码常常产生具有视觉显著缺陷的工件，包括元素重叠、文本裁剪、对齐破坏、对比度低和溢出。我们研究针对代码生成视觉工件的视觉反馈自蒸馏。我们提出Visual-SDPO，一种自蒸馏策略优化框架，将渲染的视觉反馈视为权重共享教师的特权上下文，并将该反馈蒸馏到编码学生中。为了使监督具有空间针对性而非均匀性，我们引入视觉引导的代码信用加权，将每个检测到的缺陷追溯到影响该元素的代码语句，并放大这些语句上的蒸馏信号。序列级GRPO（组相对策略优化）项通过奖励可执行、视觉质量高的 rollout 来补充密集的 token 级目标，而失败的执行通过自蒸馏路径仍然可学习，通过将执行错误作为特权上下文传递给教师。我们使用统一的 Qwen3-VL-8B-Instruct 骨干网络，在图表、网页/UI和幻灯片生成任务上实例化 Visual-SDPO。在图表到代码、UI到代码和幻灯片生成基准（ChartMimic、Design2Code和AeSlides）上，Visual-SDPO 在主要指标上比零样本基线提升超过10个绝对点，比GRPO提升至少2.4个点，且训练步骤更少，无额外推理成本。

英文摘要

Code-generating large language models (LLMs) increasingly produce visual artifacts such as charts, web pages, and slides by writing programs that are executed by non-differentiable renderers, committing to code before observing the render. As a result, otherwise executable code often yields artifacts with visually salient defects, including overlapping elements, clipped text, broken alignment, low contrast, and overflow. We study visual-feedback self-distillation for code-generated visual artifacts. We propose Visual-SDPO, a self-distillation policy-optimization framework that treats rendered visual feedback as privileged context for a weight-sharing teacher and distills this feedback into a coding student. To make supervision spatially targeted rather than uniform, we introduce Visual-Grounded Code Credit Weighting, which traces each detected defect back to the code statements responsible for the affected elements and amplifies the distillation signal on those statements. A sequence-level GRPO (Group Relative Policy Optimization) term complements the dense token-level objective by rewarding executable, visually high-quality rollouts, while failed executions remain learnable through the self-distillation path by passing execution errors as privileged context to the teacher. We instantiate Visual-SDPO for chart, web/UI, and slide generation with a unified Qwen3-VL-8B-Instruct backbone. Across chart-to-code, UI-to-code, and slide-generation benchmarks (ChartMimic, Design2Code, and AeSlides), Visual-SDPO improves over the zero-shot base by more than 10 absolute points in the primary metric and over GRPO by at least 2.4 points, with fewer training steps and no added inference-time cost.

URL PDF HTML ☆

赞 0 踩 0

2606.10333 2026-06-10 cs.LG cs.CR 新提交

Privacy-Preserving Credit Risk Prediction with Alternative Data

基于替代数据的隐私保护信用风险预测

Hongzhe Zhang, Jiarong Xu, Jing He, Xiao Fang

发表机构 * School of Management and Economics, The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen)（香港中文大学（深圳）管理学院）； School of Management, Fudan University（复旦大学管理学院）； Lerner College of Business and Economics, University of Delaware（特拉华大学勒纳商学院）

AI总结针对信用风险预测中替代数据共享导致的隐私泄露问题，提出PrivacyCredit方法，在保护消费者隐私、模型保密性和无损性能约束下，实现与传统明文数据组合相同的预测性能。

详情

AI中文摘要

信用风险预测是消费信贷行业中的一个关键问题。传统上，金融机构使用借款人的人口统计、财务和信用历史数据（统称为传统数据）构建信用风险预测模型。最近的研究表明，替代数据（如借款人的手机通信数据）使贷款人能够获得更全面、更准确的借款人信用状况画像，从而提高信用风险预测性能。然而，替代数据由独立于金融机构的外部实体持有。直接与金融机构共享替代数据会侵犯消费者隐私，但现有的信用风险预测研究大多忽略了这一问题。为填补这一空白，我们定义了一个新问题，即基于替代数据的隐私保护信用风险预测，该问题同时考虑三个实际约束：保护消费者隐私的隐私保护约束、在金融机构集中学习和存储模型的模型保密性约束，以及保持学习模型性能的无损约束。为解决该问题，我们开发了PrivacyCredit，一种新颖的隐私保护机器学习方法。然后，我们从理论上证明了PrivacyCredit的隐私保护、模型保密和无损特性。通过使用与替代数据关联的真实信用数据集进行大量实验，我们证明了安全地将替代数据纳入信用风险预测的预测价值，并表明PrivacyCredit实现了与从传统数据和替代数据的不安全明文组合中学习的模型相同的预测性能。我们进一步评估了其模型保密性和计算效率。

英文摘要

Credit risk prediction is a critical problem in the consumer credit industry. Traditionally, financial institutions construct credit risk prediction models using borrowers' demographic, financial, and credit history data, collectively referred to as traditional data. Recent studies have demonstrated that alternative data, such as borrowers' mobile phone communication data, enable lenders to acquire fuller and more accurate profiles of borrowers' creditworthiness, thereby improving credit risk prediction performance. Nevertheless, alternative data are held by external entities independent of financial institutions. Directly sharing alternative data with financial institutions infringe on consumer privacy, yet existing credit risk prediction studies largely overlook this issue. To address this gap, we define a new problem, namely privacy-preserving credit risk prediction with alternative data, which simultaneously considers three practical constraints: the privacy-preserving constraint that protects consumer privacy, the model-confidentiality constraint that learns and stores the model centrally at the financial institution, and the lossless constraint that maintains the performance of the learned model. To solve this problem, we develop PrivacyCredit, a novel privacy-preserving machine learning method. We then theoretically demonstrate the privacy-preserving, model-confidential, and lossless properties of PrivacyCredit. Through extensive experiments using a real-world credit dataset linked with alternative data, we demonstrate the predictive value of securely incorporating alternative data into credit risk prediction and show that PrivacyCredit achieves the same predictive performance as the model learned from the insecure plaintext combination of traditional and alternative data. We further evaluate its model-confidentiality property and computational efficiency.

URL PDF HTML ☆

赞 0 踩 0

2606.10329 2026-06-10 cs.CV cs.AI 新提交

Building Change Detection in Earthquake: A Multi-Scale Interaction Network and A Change Detection Dataset

地震中的建筑变化检测：一种多尺度交互网络和一个变化检测数据集

Yunlong Liu, Zekai Zhang

发表机构 * School of Control Science and Engineering, Shandong University（控制科学与工程学院，山东大学）

AI总结针对地震后短期成像间隔导致的变化检测难题，构建了土耳其地震变化检测数据集（TUE-CD），并提出多尺度特征交互网络（MSI-Net），通过联合交叉注意力和多尺度偏移校准模块，有效缓解侧视问题，提升变化检测精度。

详情

AI中文摘要

作为最具破坏性的自然灾害之一，近年来地震袭击了世界许多国家，造成了严重的经济损失。变化检测（CD）可应用于震后损伤评估，因为它能从多时相遥感图像中推断出被破坏的变化区域。此外，短成像间隔的变化检测将更好地满足地震后紧急救援的需求。然而，由于缺乏短成像间隔的数据集，当前基于深度神经网络的方法的能力受到限制。为了满足灾后即时救援的需求，我们创建了一个变化检测数据集——土耳其地震变化检测数据集（TUE-CD），用于评估地震后短期内的建筑损坏情况。由于后事件图像的采集间隔短，不同时相图像的成像角度不同，导致了一些侧视问题。为了应对这些挑战，我们提出了一种多尺度特征交互网络（MSI-Net），用于双时相特征之间的高效交互，并减轻侧视问题的影响。具体来说，所提出的MSI-Net由联合交叉注意力（JCA）模块、多尺度偏移校准（MOC）模块和特征集成（FeI）模块组成。JCA模块统一了通道交叉注意力和空间联合注意力，以实现充分的特征交互。MOC模块进一步估计偏移量，以将双时相图像与多尺度特征对齐。最后，通过FeI模块融合校准后的特征和多尺度特征，用于变化区域的预测。在WHU-CD、CLCD和构建的TUE-CD数据集上的实验表明，所提出的MSI-Net比考虑的最先进的变化检测方法提供了更好的结果。

英文摘要

As one of the most destructive natural disasters, earthquakes have struck many countries around the world in recent years, causing serious economic losses. Change detection (CD) can be applied to post-earthquake damage assessment as it can infer destroyed change regions from multi-temporal remote sensing images. Furthermore, the CD with short imaging interval will better satisfy the needs of the emergency rescues after earthquakes. However, the capability of current methods built on deep neural networks is limited because the dataset with short imaging interval is absent. To meet post-disaster immediate relief, we create a CD dataset, Turkey earthquake CD dataset (TUE-CD), for the evaluation of building damage in the short term after an earthquake. Because of the short acquisition interval of the post-event images, the imaging angle is different for different temporal images, which leads to some side-looking problems. To deal with these challenges, we present a multi-scale feature interaction network (MSI-Net) for efficient interaction between bi-temporal features, as well as mitigating the effect of side-looking problems. Specifically, the proposed MSI-Net consists of joint cross-attention (JCA) modules, multi-scale offset calibration (MOC) modules, and feature integration (FeI) modules. The JCA module unifies channel cross-attention and spatial joint attention for sufficient feature interaction. The MOC module further estimates the offsets to align the bi-temporal image with the multi-scale features. Finally, calibrated features and multi-scale features are fused by FeI modules for the prediction of changed areas. Experiments on the WHU-CD, CLCD, and the constructed TUE-CD dataset indicate that the proposed MSI-Net provides better results than considered state-of-the-art CD methods.

URL PDF HTML ☆

赞 0 踩 0

2606.10328 2026-06-10 cs.CV cs.AI 新提交

Content-Induced Spatial-Spectral Aggregation Network for Change Detection in Remote Sensing Images

内容诱导的空间-光谱聚合网络用于遥感图像变化检测

Yunlong Liu, Zekai Zhang

发表机构 * School of Control Science and Engineering, Shandong University（控制科学与工程学院，山东大学）

AI总结提出内容引导的空间-光谱集成网络(CSI-Net)，通过空间推理、光谱差异和内容引导集成模块融合全局空间细节与光谱差异信息，有效抑制未变化区域差异，在三个数据集上取得最优性能。

详情

AI中文摘要

空间和光谱信息的整合有利于提高变化检测性能。然而，现有方法无法有效抑制未变化区域中空间和光谱差异的影响。为了解决这些问题，本文提出了一种内容引导的空间-光谱集成网络（CSI-Net），用于融合全局空间细节和光谱差异信息。具体而言，所提出的CSI-Net由空间推理（SR）模块、光谱差异（SD）模块和内容引导集成（CGI）模块组成。在SR模块中，通过级联图卷积块学习空间信息以进行全局建模。SD模块负责提取光谱特征，通过计算特征的均值和方差来减少未变化区域中光谱差异的影响。此外，为了有效集成空间-光谱特征，我们设计了CGI模块以进一步利用它们的互补信息。在该模块中，引入高层内容信息作为引导，以实现适当的交互。由于高效的空间-光谱融合，所提出的CSI-Net能够更好地学习变化特征，同时实现对光谱差异的抑制。在LEVIR-CD、WHU-CD和CLCD数据集上的实验结果表明，与最先进方法相比，所提出的CSI-Net产生了更好的性能，并且适用于不同场景。

英文摘要

The integration of spatial and spectral information is beneficial to the improvement of change detection performance. However, existing methods cannot efficiently suppress the influences of spatial and spectral differences in unchanged areas. To address these issues, in this paper we propose a content-guided spatial-spectral integration network (CSI-Net) for the fusion of global spatial details and spectral difference information. Specifically, the proposed CSI-Net is composed of a spatial reasoning (SR) module, a spectral difference (SD) module, and a content-guided integration (CGI) module. In the SR module, the spatial information is learned by cascaded graph convolution blocks for global modeling. The SD module is responsible for the extraction of spectral features, by calculating the means and variances of features to reduce the impact of spectral differences in unchanged regions. In addition, in order to integrate the spatial-spectral features efficiently, we design a CGI module to further take advantage of their complementary information. In this module, high-level content information is introduced as a guide for a proper interaction. Due to the efficient spatial-spectral fusion, the proposed CSI-Net can learn the changed features better while achieving a suppression of spectral differences. Experimental results on LEVIR-CD, WHU-CD, and CLCD datasets demonstrate that the proposed CSI-Net produces better performance compared to state-of-the-art methods, and is applicable to different scenarios

URL PDF HTML ☆

赞 0 踩 0

2606.10327 2026-06-10 cs.CL cs.LG 新提交

The Order Matters: Sequential Fine-Tuning of LLaMA for Coherent Automated Essay Scoring

顺序重要：LLaMA的序列微调用于连贯的自动作文评分

Ali Keramati, Mark Warschauer

发表机构 * University of California, Irvine（加州大学伊文斯分校）

AI总结提出对LLaMA-3.1-8B进行任务感知的序列微调，按作文话语结构顺序训练，在PERSUADE 2.0语料上证据F1达65%、结论F1达87%，超越独立训练和70B基线，证明课程设计可提升自动作文评分性能。

详情

AI中文摘要

自动作文评分（AES）系统必须判断相互依赖的话语元素（如引言、立场、证据、结论），但大多数方法孤立地处理这些元素，损害了连贯性和泛化能力。我们研究了对LLaMA-3.1-8B进行任务感知的微调，用于AES，使用参数高效的LoRA和4位量化，并比较了三种训练课程：（i）序列式（依次在引言、立场、主张、证据、结论上微调），（ii）独立式（任务特定模型），以及（iii）随机式（打乱的多任务）。在PERSUADE 2.0语料上的实验表明，建模任务依赖性很重要：序列微调取得了最强的整体结果，包括证据的F1分数65%和结论的87%，以及相应的准确率63%和85%，超越了独立训练，并且在结论上优于通用LLaMA-70B基线，尽管后者容量大得多。随机训练提高了立场评分（57% F1），但在其他地方一致性较差。这些发现表明：（1）与话语结构对齐的课程设计可以实质性地改善AES，以及（2）小型、任务优化的模型可以与显著更大的大型语言模型（LLM）竞争，为可扩展、成本效益高的评估提供了实用途径。我们发布模板和实现细节，以促进复现和未来在教育NLP中课程设计的工作。

英文摘要

Automated Essay Scoring (AES) systems must judge interdependent discourse elements (e.g., lead, claim, evidence, conclusion), yet most approaches treat these in isolation, harming coherence and generalization. We investigate task-aware fine-tuning of LLaMA-3.1-8B for AES using parameter-efficient LoRA with 4-bit quantization and compare three training curricula: (i) Sequential (progressively fine-tuning on lead, then position, then claim, then evidence, then conclusion), (ii) Independent (task-specific models), and (iii) Randomized (shuffled multi-task). Experiments on the PERSUADE~2.0 corpus show that modeling task dependencies matters: Sequential fine-tuning yields the strongest overall results, including F1 scores of 65% (evidence) and 87% (conclusion) and corresponding accuracies of 63% and 85%, surpassing Independent training and outperforming a general-purpose LLaMA-70B baseline on conclusion despite its far larger capacity. Randomized training improves position scoring (57% F1) but is less consistent elsewhere. These findings indicate that (1) curriculum design aligned with discourse structure can materially improve AES, and (2) small, task-optimized models can be competitive with substantially larger Large Language Models (LLM), offering a practical path to scalable, cost-effective assessment. We release templates and implementation details to facilitate reproduction and future work on curriculum design for educational NLP.

URL PDF HTML ☆

赞 0 踩 0

2606.10316 2026-06-10 cs.CL 新提交

TabClaw: An Interactive and Self-Evolving Agent for Spreadsheet Manipulation and Table Reasoning

TabClaw: 一个用于电子表格操作和表格推理的交互式自进化智能体

Mingyue Cheng, Shuo Yu, Daoyu Wang, Qingchuan Li, Xiaoyu Tao, Qingyang Mao, Yitong Zhou, Qi Liu

发表机构 * State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China（中国科学技术大学认知智能国家重点实验室）

AI总结提出TabClaw，一个开源交互式AI智能体，通过可编辑执行计划、流式ReAct循环、并行多表推理和用户记忆提取，提升电子表格操作和表格推理的透明性与个性化。

Comments 5 pages, 2 figures

详情

AI中文摘要

电子表格和表格是结构化数据分析中广泛使用的表示形式，但有效分析仍需大量人工和领域专业知识。近期的大语言模型智能体可以自动化部分过程，但它们通常对中间决策提供有限的透明度，依赖隐含假设，难以处理多表比较，并且重复类似工作流而不适应用户偏好。本文提出TabClaw，一个用于电子表格操作和表格推理的开源交互式AI智能体。用户上传CSV或Excel文件并发出自然语言请求；TabClaw澄清模糊意图，展示可编辑执行计划，流式传输ReAct风格的工具使用分析循环，派遣专家智能体进行并行多表推理，并通过显式一致性和不确定性标记综合发现。除一次性分析外，TabClaw记录完成的工作流，提取持久用户记忆，从重复工具使用模式中提炼可复用技能，支持包式技能导入，并从负面反馈中升级技能。在电子表格操作和表格推理基准上的实验表明，TabClaw在提高可执行任务完成度和推理性能的同时，保持了可检查的用户工作流。本文展示了TabClaw如何将电子表格和表格转化为可检查的分析工作流，同时逐步个性化以适应重复的数据分析任务。我们的代码已公开。

英文摘要

Spreadsheets and tables are widely used representations for structured data analysis, but effective analysis still requires substantial manual effort and domain expertise. Recent large language model (LLM) agents can automate parts of this process, but they often provide limited transparency into intermediate decisions, rely on implicit assumptions, struggle with multi-table comparison, and repeat similar workflows without adapting to a user's preferences. This paper presents TabClaw, an open-source interactive AI agent for spreadsheet manipulation and table reasoning. Users upload CSV or Excel files and issue natural-language requests; TabClaw clarifies ambiguous intent, exposes an editable execution plan, streams a ReAct-style tool-using analysis loop, dispatches specialist agents for parallel multi-table reasoning, and synthesizes findings with explicit consensus and uncertainty markers. Beyond one-off analysis, TabClaw records completed workflows, extracts persistent user memory, distills reusable skills from repeated tool-use patterns, supports package-style skill import, and upgrades skills from negative feedback. Experiments on spreadsheet manipulation and table reasoning benchmarks show that TabClaw improves executable task completion and reasoning performance while preserving an inspectable user workflow. This paper shows how TabClaw turns spreadsheets and tables into inspectable analytical workflows while gradually personalizing itself to recurring data-analysis tasks. Our code is available.

URL PDF HTML ☆

赞 0 踩 0

2606.10315 2026-06-10 cs.CL cs.AI 新提交

Catching One in Five: LLM-as-Judge Blind Spots in Production Multi-Turn Transaction Agents

捕捉五分之一：LLM作为评判员在生产环境多轮交易代理中的盲点

Sawyer Zhang, Alexander Wang, Sophie Lei

发表机构 * Lumivate (Lumi)（Lumivate（Lumi））

AI总结研究部署的餐饮订购代理中LLM评判员对真实缺陷的召回率，发现其仅捕获22%的系统性问题，主要因评分标准缺乏状态跟踪等行为维度，且路由机制导致缺陷被错误分类。

Comments 13 pages, 1 figure, 5 tables

详情

AI中文摘要

LLM作为评判员是评估对话代理的默认工具，但其可靠性几乎总是报告为与人类评分的一致性，而非真实缺陷的召回率。我们研究了一个已部署的多轮餐饮订购代理，并通过详尽的人工转录审查作为基准，衡量其内置LLM评判员捕获了多少真实质量问题。在三个批次中，评判员发现的系统性问题远低于人类确认的四分之一——在一个批次中，9种模式中只有2种（22%），而在另一个批次中，其操作门控标记了100轮中的0轮，而人类确认了23个不同缺陷和7个新的跨轮模式。我们的盲点分类表明，失败是有结构的，而非随机的：评判员能捕获轮次局部问题（虚构统计数据、错误语言），但遗漏了跨轮状态问题（确认门锁死、购物车幻觉、升级锁死、过时引用）。机制在于：评分标准仅暴露三个粗略轴（意图、品牌声音、个性化），且没有针对行为维度（状态跟踪、护栏、恢复）的类别，而大多数缺陷集中于此。失败在于路由而非感知：114轮中，113轮原始评判员注释描述了确认门或购物车状态缺陷，但被评分为“品牌声音”，且无一到达操作失败——门控连接到挂起和硬断言，而非评分标准——因此0%是路由和接线失败，而非失明。对流行率估计的影响是显著的：当表观缺陷率为零时，Rogan-Gladen校正退化——无信号可恢复真实率——而当门控报告非零率时，相同估计器在我们测量的灵敏度下暗示3-6倍的低估。对于生产环境多轮代理，自动评判是回归底线，而非人工审查的替代品。

英文摘要

LLM-as-judge is the default instrument for evaluating conversational agents, yet its reliability is almost always reported as agreement with human ratings, not recall of real defects. We study a deployed multi-turn food-and-beverage ordering agent and measure how many genuine quality problems its built-in LLM judge catches, using exhaustive human transcript review as ground truth. Across three batches the judge surfaces well under a quarter of human-confirmed systematic problems -- 2 of 9 patterns (22%) in one batch, and its operational gate flagged zero of 100 rounds in a batch where humans confirmed 23 distinct defects and 7 new cross-cutting patterns. Our blind-spot taxonomy shows the failure is structured, not random: the judge catches turn-local issues (a fabricated statistic, a wrong language) but misses cross-turn state issues (confirm-gate lockout, cart hallucination, escalation lockout, stale referents). The mechanism: the scoring rubric exposes only three coarse axes (intent, brand-voice, personalization) and has no category for the behavioural dimensions -- state-tracking, guardrails, recovery -- where most defects cluster. The failure is routing, not perception: 113 of 114 rounds whose raw judge note describes a confirm-gate or cart-state defect are scored "brand voice", and none reach an operational failure -- the gate is wired to hangs and hard assertions, not the rubric -- so the 0% is a routing-and-wiring failure, not blindness. The consequence for prevalence estimation is sharp: when the apparent defect rate is zero the Rogan-Gladen correction degenerates -- no signal can recover the true rate -- while where the gate reports a nonzero rate the same estimator implies a 3-6x undercount under our measured sensitivity. For production multi-turn agents, automated judging is a regression floor, not a substitute for human review.

URL PDF HTML ☆

赞 0 踩 0

2606.10314 2026-06-10 cs.AI 新提交

Mobility Anomaly Generation using LLM-Driven Behavior with Kinematic Constraints

基于大语言模型驱动行为与运动约束的移动异常生成

Yueyang Liu, Joon-Seok Kim, Andreas Züfle

发表机构 * Emory University, Atlanta, USA（埃默里大学，亚特兰大，美国）

AI总结提出端到端生成框架，结合大语言模型注入语义异常与地图约束路由重建，合成带标注的真实轨迹异常数据集。

详情

AI中文摘要

尽管人类轨迹异常研究对于推进空间数据挖掘至关重要，但实证研究因缺乏真实标注数据集而严重受阻。现有真实和模拟轨迹数据集仅包含正常移动模式，缺乏异常标注。这种稀缺性源于异常事件的统计稀有性，使得传统观测方法不可行。此外，大规模移动数据的系统获取受高昂成本和严格隐私法规限制。为克服这些限制并建立可靠的带标注真实轨迹异常数据集，我们提出一种新颖的端到端生成框架，用于大规模合成逼真的轨迹异常。该架构直接在基线模拟轨迹上操作，弥合纯合成移动数据与复杂真实物理约束之间的差距。我们利用大语言模型（LLM）代理系统性地注入语义上有意义的异常行为，例如不规则分布外签到和跳过常规访问。为确保空间有效性，系统利用地图约束路由重建重新计算LLM代理修改停留点之间的物理转移。此外，为缩小模拟与现实的差距，我们通过上下文感知的空间噪声模型增强生成轨迹，该模型由环境和位置特定变量参数化，以准确模拟异构GPS传感器退化。

英文摘要

Although the study of human trajectory anomalies is critical for advancing spatial data mining, empirical research remains severely hindered by a pervasive lack of ground-truth datasets. Despite the availability of several real-world and simulated human trajectory collections, these datasets exclusively capture normal mobility patterns and lack annotated anomalies. This specific scarcity is fundamentally driven by the inherent statistical rarity of anomalous events, precluding the feasibility of conventional observational methods. Compounding this challenge, the systematic acquisition of large-scale mobility data is strictly bottlenecked by prohibitive costs and stringent privacy regulations. To overcome these fundamental limitations and establish a reliable human trajectory anomalies dataset with annotated ground truth, we introduce a novel, end-to-end generative framework designed to synthesize realistic trajectory anomalies at scale. Our architecture bridges the gap between purely synthetic mobility data and complex real-world physical constraints by operating directly on baseline simulated trajectories. We employ Large Language Model (LLM) agents to systematically inject semantically meaningful behavioral anomalies such as irregular out-of-distribution check-ins and skipped routine visits. To ensure rigorous spatial validity, the system leverages map-constrained routing reconstruction to recalculate the physical transitions between these LLM agent-modified staypoints. Moreover, to narrow the simulation-to-reality gap, we augment the resulting trajectories with a context-aware spatial noise model, parameterized by environmental and location-specific variables, to accurately emulate heterogeneous GPS sensor degradation.

URL PDF HTML ☆

赞 0 踩 0

2606.10309 2026-06-10 cs.CV 新提交

Dissect and Prune: Enhancing Robustness in AI-Generated Image Detection

剖析与剪枝：增强AI生成图像检测的鲁棒性

Dahye Kim, Jaehyun Choi, Hyun Seok Seong, Seongho Kim, Donghun Lee, Sungwon Yi, Jang-Ho Choi

发表机构 * Korea AI Safety Institute (AISI), ETRI, Seongnam, South Korea（韩国人工智能安全研究所（AISI）、ETRI、Seongnam韩国）； Department of Artificial Intelligence, Sungkyunkwan University, Suwon, South Korea（人工智能系，全州大学，Suwon韩国）

AI总结针对AI生成图像检测器对真实类别的预测偏差问题，提出DEAR方法，利用修复图像识别并剪除干扰特征，从而提升对未知生成器和后处理的鲁棒性。

Comments 25 pages, 9 figures, 9 tables, Accepted to ICML 2026; includes appendix

详情

AI中文摘要

虽然现有的AI生成图像检测器报告了高性能，但我们发现这主要是由一种关键的预测不对称性驱动的：对真实类别的偏见严重限制了其对生成内容的敏感性，尤其是在压缩和调整大小等标准后处理操作下。我们假设这源于模型对虚假特征的依赖，这些干扰信号掩盖了真正的生成伪影。为了解决这个问题，我们提出了DEAR（剖析与剪枝），它利用修复图像来识别和剪除这些干扰成分。具体来说，我们发现与修复区域或非修复区域强烈对齐的特征对后处理的鲁棒性较差。通过测量通道激活与修复掩码之间的对齐程度，DEAR移除两端的特征，仅保留那些捕捉真实生成伪影的特征。实验结果表明，我们的方法显著增强了对未见过的生成器和后处理的鲁棒性，有效缓解了预测不对称性。我们的代码可在该 https URL 获取。

英文摘要

While existing AI-generated image detectors report high performance, we identify that this is largely driven by a critical prediction asymmetry: a bias toward the real class that severely limits sensitivity to generated content, especially under standard post-processing operations such as compression and resizing. We hypothesize that this stems from the model's reliance on spurious features, distracting signals that obscure true generative artifacts. To address this, we propose DEAR (Dissect and Prune), which leverages inpainted images to identify and prune these interfering components. Specifically, we find that features strongly aligned to either inpainted or non-inpainted regions are less robust to post-processing. By measuring the alignment between channel activations and inpaint masks, DEAR removes features at both extremes, retaining only those that capture genuine generative artifacts. Experimental results demonstrate that our approach significantly enhances robustness against unseen generators and post-processing, effectively mitigating the prediction asymmetry. Our code is available at https://github.com/dahyedahye/dear.

URL PDF HTML ☆

赞 0 踩 0

2606.10307 2026-06-10 cs.CL 新提交

Early-Token Confidence Predicts Reasoning Quality in Multi-Agent LLM Debate

早期令牌置信度预测多智能体LLM辩论中的推理质量

Ali Keramati, Justin Cheok, Jacob Horne, Mark Warschauer

发表机构 * University of California, Irvine（加州大学尔湾分校）

AI总结研究利用解码时令牌级对数概率作为置信度信号，预测多智能体LLM辩论中的推理质量，发现早期令牌置信度是最强预测因子。

Comments 15 pages, 8 figures, 4 tables; ACL Proceedings

详情

AI中文摘要

评估多智能体LLM系统中的推理质量具有挑战性，尤其是对于没有参考答案的开放任务。我们研究了内在置信度信号（解码时的令牌级对数概率）是否能预测由LLM作为评判者评估的推理质量。使用基于辩论的论文评分框架，我们在两个ASAP论文集上比较了置信度代理与基于评分标准的评判者分数。我们发现，早期令牌置信度，特别是在生成的前几个令牌内，始终是推理质量的最强预测因子，优于全序列统计量。对数概率轨迹分析表明，生成的起始阶段是最异质的，因此信息量最大。我们还观察到智能体角色之间存在系统性不对称，支持性推理的置信度与质量之间的对齐强于对抗性批评。这些结果表明，早期解码动态为估计多智能体LLM系统中的推理可靠性提供了轻量级且有效的信号。

英文摘要

Evaluating reasoning quality in multi-agent LLM systems is challenging, especially for open-ended tasks without reference answers. We investigate whether intrinsic confidence signals, token-level log-probabilities from decoding, can predict reasoning quality as assessed by LLM-as-judge evaluation. Using a debate-based essay scoring framework, we compare confidence proxies against rubric-based judge scores across two ASAP essay sets. We find that early-token confidence, particularly within the first few generated tokens, is consistently the strongest predictor of reasoning quality, outperforming full-sequence statistics. Analysis of log-probability trajectories shows that the opening phase of generation is the most heterogeneous and therefore most informative. We also observe a systematic asymmetry between agent roles, with stronger alignment between confidence and quality for supportive reasoning than for adversarial critique. These results suggest that early decoding dynamics provide a lightweight and effective signal for estimating reasoning reliability in multi-agent LLM systems.

URL PDF HTML ☆

赞 0 踩 0

2606.10305 2026-06-10 cs.RO 新提交

SARM2: Multi-Task Stage Aware Reward Modeling for Self Improving Robotic Manipulation

SARM2: 多任务阶段感知奖励建模用于自我改进的机器人操作

Qianzhong Chen, Hau Zheng, Justin Yu, Suning Huang, Jiankai Sun, Ken Goldberg, Chuan Wen, Pieter Abbeel, Yide Shentu, Philipp Wu, Mac Schwager

发表机构 * Stanford University（斯坦福大学）； UC Berkeley（加州大学伯克利分校）； Shanghai Jiao Tong University（上海交通大学）

AI总结提出多任务阶段感知奖励模型RM，结合动作基元阶段估计器和多门控专家混合值头，为机器人操作任务提供密集逐步奖励，并基于RM构建SPIRAL框架，通过廉价自主轨迹改进VLA策略，在10任务基准上显著提升成功率。

详情

AI中文摘要

微调视觉-语言-动作（VLA）策略以进行长程操作仍然严重依赖于行为克隆，这需要昂贵的高质量演示，并使策略保持在演示分布附近。奖励模型可以通过重新加权演示并为机器人上的强化学习（RL）提供密集监督来减少这种依赖，但它们必须密集、准确且通用。现有方法存在不足：特定任务的阶段感知模型准确但需要每任务注释，而通用视觉-语言模型（VLM）奖励模型适用范围广但对于细粒度的长程进展过于粗糙。我们引入了RM，一种多任务阶段感知奖励模型，它将基于动作基元的阶段估计器与多门控专家混合（MMoE）值头相结合，以在操作任务中产生密集的每步奖励。基于RM，我们进一步提出了SPIRAL（通过奖励对齐学习进行自策略改进），一种在策略奖励引导框架，通过廉价的自主轨迹改进VLA策略。在一个10任务基准上，RM将值估计MSE比最强基线降低了80%；当在SPIRAL中使用时，它将任务成功率从约50%提高到近乎完美，例如折叠短裤（58%到100%）和清洁白板（50%到90%），表明高质量密集奖励是稳定机器人数据飞轮的关键。项目网站：此https URL。

英文摘要

Fine-tuning vision-language-action (VLA) policies for long-horizon manipulation still relies heavily on behavior cloning, which requires costly high-quality demonstrations and keeps policies near the demonstration distribution. Reward models can reduce this dependence by reweighting demonstrations and providing dense supervision for on-robot reinforcement learning (RL), but they must be dense, accurate, and general. Existing methods fall short: task-specific stage-aware models are accurate but require per-task annotations, while general vision-language-model (VLM) reward models are broadly applicable but too coarse for fine-grained long-horizon progress. We introduce RM, a multi-task stage-aware reward model that combines an action-primitive-based stage estimator with a multi-gate Mixture-of-Experts (MMoE) value head to produce dense per-step rewards across manipulation tasks. Building on RM, we further propose SPIRAL (Self-Policy Improvement via Reward-Aligned Learning), an on-policy reward-guided framework that improves VLA policies from cheap autonomous rollouts. On a 10-task benchmark, RM reduces value-estimation MSE by 80% over the strongest baselines; when used in SPIRAL, it improves task success from around 50% to near-perfect performance on Folding Shorts (58% to 100%) and Cleaning Whiteboard (50% to 90%), showing that high-quality dense rewards are key to a stable robot data flywheel. Project website: https://qianzhong-chen.github.io/sarm2.github.io/.

URL PDF HTML ☆

赞 0 踩 0

2606.10304 2026-06-10 cs.CL 新提交

MIRAGE: A Polarity-Flipping Encoding Subspace in LLM Agents

MIRAGE: LLM智能体中的极性翻转编码子空间

Pratibha Revankar, Kargi Chauhan, Jihye Kim, Sadiba Nusrat Nur, Vincent Siu, Chenguang Wang

发表机构 * University of California, Santa Cruz（加州大学圣克ruz分校）

AI总结发现LLM智能体在隐蔽编码敏感数据时，残差流中存在共享的低维编码子空间，通过逻辑回归探针可高精度检测，并构建MIRAGE实时监控器，在126个场景中AUC达0.918，远超仅输出检测。

详情

AI中文摘要

当LLM智能体被迫隐蔽编码敏感数据（Base64、ROT13、藏头诗、同义词链等）时，生成的输出逃避了输出端检测，但底层计算并未逃脱。在来自五个架构家族的八个模型的九个编码家族中，该计算由残差流中共享的低维编码子空间支持。在八个编码家族上训练的逻辑回归探针能够以AUC 0.975-1.000恢复被排除的第九个家族，读取的是计算而非表面特征。同一方向在规划标记处表现出第二个机制特征：当模型将在线模拟编码时极性翻转正向激活，当模型将其外包给工具调用时负向激活，在编码文本存在之前区分两种执行策略。我们构建了MIRAGE（模型内部读取智能体生成外泄），一个利用这两个信号的双通道实时监控器。在126个智能体外泄场景中，其AUC达到0.918，大幅优于仅输出检测（AUC=0.518）。监控器性能本质上是宿主模型几何结构的属性：良性编码假阳性率从Qwen-7B的0%到Phi-3.5的100%，表明探针忠实读取了模型的几何结构是否区分隐蔽与公开编码。在所有测试的对抗预算下，每个抑制子空间的攻击也破坏了编码保真度，这报告为评估预算上的经验规律，而非结构性不可能性声明。

英文摘要

When LLM agents are coerced into covertly encoding sensitive data (Base64, ROT13, acrostic, synonym chains, and beyond), the resulting outputs evade output-side detection but the underlying computation does not. Across nine encoding families and eight models from five architecture families, that computation is supported by a shared low-dimensional encoding subspace in the residual stream. A logistic-regression probe trained on eight encoding families recovers the held-out ninth at AUC 0.975-1.000, reading the computation rather than surface features. The same direction exhibits a second mechanistic signature at the planning token, flipping polarity to activate positively when the model will simulate the encoding inline and negatively when it will outsource it to a tool call, distinguishing two execution strategies before the encoded text exists. We build MIRAGE (Model-Internal Readout of Agentic Generation Exfiltration), a two-channel real-time monitor exploiting both signals. On 126 agentic exfiltration scenarios, it reaches AUC = 0.918, substantially outperforming output-only detection (AUC = 0.518). Monitor performance is fundamentally a property of the host model's geometry: benign-encoding false-positive rate ranges from 0% on Qwen-7B to 100% on Phi-3.5, revealing that the probe faithfully reads whether a model's geometry separates covert from overt encoding. Across all tested adversarial budgets, every attack suppressing the subspace also destroyed encoding fidelity, reported as an empirical regularity on the evaluated budgets, not a structural impossibility claim.

URL PDF HTML ☆

赞 0 踩 0

2606.10302 2026-06-10 cs.CL 新提交

Where You Inject Diversity Matters: A Unified Framework for Diverse Generation

注入多样性的位置至关重要：统一框架下的多样化生成

Cheng Zhang, Rui Xin, Chudi Zhong

发表机构 * UNC Chapel Hill（北卡罗来纳大学教堂山分校）； University of Washington（华盛顿大学）

AI总结提出统一框架，通过多样性源和传输分数衡量测试时多样化生成方法，并基于此提出全自动规范级方法，在五个开放任务中提升输出多样性且保持质量。

详情

AI中文摘要

开放式生成任务通常需要一组有意义的不同的输出，然而大型语言模型往往产生相似的生成结果。现有的测试时多样性方法在生成的不同阶段操作，效果各异，但尚不清楚哪些设计选择能导致输出中有意义的多样性。我们引入了一个框架，通过生成过程中引入的多样性源来表征测试时多样化生成方法，并提供了一个传输分数来衡量源中的变化在多大程度上有效传递到最终输出。在该框架指导下，我们提出了全自动规范级生成方法，首先生成多样化的中间规范，然后以它们为条件生成最终响应。在五个开放任务和四个骨干模型上，规范级注入在保持可比质量的同时，提高了输出多样性，超过了测试时基线。我们的分析表明，成功的多样性注入既取决于源的多样性，也取决于它们向输出的传输，这突显了源设计和源到输出的实现是构建更多样化生成系统的两个关键杠杆。

英文摘要

Open-ended generation tasks often require a set of meaningfully different outputs, yet large language models often produce similar generations. Existing test-time diversity methods operate at different stages of generation with varying effectiveness, but it remains unclear what design choices lead to meaningful diversity in the output. We introduce a framework that characterizes test-time diverse generation methods by the diversity source introduced during generation and provide a transmission score for measuring how effectively variation in the source reaches the final output. Guided by this framework, we propose fully automated specification-level generation methods that first generate diverse intermediate specifications and then condition on them to produce final responses. Across five open-ended tasks and four backbone models, specification-level injection improves output diversity over test-time baselines while maintaining comparable quality. Our analysis shows that successful diversity injection depends on both the diversity of the sources and their transmission to the output, highlighting source design and source-to-output realization as two key levers for building more diverse generation systems.

URL PDF HTML ☆

赞 0 踩 0

2606.10299 2026-06-10 cs.AI cs.CV cs.MA 新提交

What Spatial Memory Must Store: Occlusion as the Test for Language-Agent Memory

空间记忆必须存储什么：遮挡作为语言-智能体记忆的测试

Doeon Kwon, Junho Bang

发表机构 * Space Zero, Inc.（Space Zero公司）

AI总结本文通过实验证明，在空间查询场景中，几何信息必须主导记忆召回，而可见性判断需要独立于记忆召回，并提出了基于射线-体素DDA的可见性谓词计算方法。

Comments 23 pages, 6 figures

详情

AI中文摘要

语言智能体的“记忆宫殿”系统将每条记忆锚定到世界坐标，其直觉是几何提供了文本无法提供的信息。我们使这一直觉可测试，并报告三个结果。首先，记忆宫殿默认将空间邻近性折叠成与近期性和重要性线性混合的做法没有帮助甚至有害：在一个预注册的召回实验中，现有的混合在其自身冻结测试中失败（平均Delta-Hit@5 -0.0375，Wilcoxon p=0.306），处于位置盲基线水平，而几何主导的加权则取得决定性胜利（+0.3208，p<10^-15）：当查询模式是空间时，几何必须主导召回。其次，记忆召回和可见性必须分离：召回在设计上对遮挡不敏感（你能正确记住墙后下一个房间），而可见性是对存储几何的感知谓词，实时系统从未计算过。一行射线与体素的数字微分分析器（DDA），从智能体已经投射的视线射线重新指向，提供了这一点：文本和实时视锥在849个墙后目标上得分均为0.000，而锥体加DDA达到0.982（精确McNemar p<10^-6）；坐标召回分别解决了余弦空值无法解决的近重复位置（1.000 vs 0.533，n=150）。第三，可见性谓词在git提交的预注册下得到实时确认（SPMEM-OCC-LIVE-v1：八个脚本化世界，自动oracle评分，96个墙后目标，假可见从1.000降至0.000，合并精确McNemar p=2.5x10^-29），该运行发现并修复了一个真实的中继锚点缺陷。我们承认遮挡需要几何几乎是同义反复；贡献在于测量和隔离，将空间记忆必须存储的内容与其读取方式分开。这些试验为一个冻结的确认性研究（SPMEM-ZERO-REAL-PREREG-v1）提供动力；完整的人类作者多世界研究（含盲评者）仍是未来工作。

英文摘要

Language-agent "memory palace" systems anchor each memory to a world coordinate, on the intuition that geometry adds something text cannot. We make that intuition testable and report three results. First, the memory-palace default of folding spatial proximity into a linear blend beside recency and importance does not help and can hurt: in a pre-registered recall experiment the shipped blend fails its own frozen test (mean Delta-Hit@5 -0.0375, Wilcoxon p=0.306), sitting at a position-blind baseline, while a geometry-led weighting wins decisively (+0.3208, p<10^-15): geometry must lead recall when the query regime is spatial. Second, memory recall and visibility must be separated: recall is occlusion-blind by design (you correctly remember the next room behind a wall), while visibility is a perception predicate over stored geometry that the live system never computed. A one-line ray-versus-voxel digital differential analyzer (DDA), re-pointed from the gaze ray the agent already casts, supplies it: text and the live FoV cone both score 0.000 on 849 behind-wall targets while cone-plus-DDA reaches 0.982 (exact McNemar p<10^-6); coordinate recall separately resolves near-duplicate locations a cosine null cannot (1.000 vs 0.533, n=150). Third, the visibility predicate is confirmed live under a git-committed pre-registration (SPMEM-OCC-LIVE-v1: eight scripted worlds, automated oracle scoring, 96 behind-wall targets, false-visible 1.000->0.000, pooled exact McNemar p=2.5x10^-29), a run that surfaced and fixed a real relay anchor defect. We concede that occlusion-needs-geometry is near-tautological; the contribution is the measurement and isolation, separating what spatial memory must store from how it is read. These pilots power a frozen confirmatory study (SPMEM-ZERO-REAL-PREREG-v1); the full human-authored multi-world study with blind raters remains future work.

URL PDF HTML ☆

赞 0 踩 0

2606.10298 2026-06-10 cs.AI cs.CL 新提交

From Context-Aware to Conflict-Aware: Generalizing Contrastive Decoding for Knowledge Conflict in LLMs

从上下文感知到冲突感知：泛化对比解码以应对LLMs中的知识冲突

Runze Jiang, Taiqiang Wu, Yan Wang, Bingyu Zhu, Longtao Huang

发表机构 * Peking University（北京大学）； Alibaba Group（阿里巴巴集团）； The University of Hong Kong（香港大学）

AI总结针对大语言模型生成时外部上下文与参数先验之间的知识冲突，提出冲突感知范式，通过动态分配先验与上下文的权重，并设计自适应机制解决不同冲突状态下的不对称问题。

Comments 27 pages, 9 figures

详情

AI中文摘要

当大语言模型从检索或增强的上下文中生成时，外部上下文与参数先验之间的冲突仍然是核心可靠性瓶颈。现有的对比解码方法遵循一种\emph{上下文感知}范式，单方面放大上下文而压制参数先验，当上下文错误时会覆盖正确的先验。我们将其泛化为\textbf{冲突感知}范式，基于冲突信号动态分配先验与上下文的权威，而非预设上下文的可信度。我们证明，先验和上下文logits的仿射组合产生一个\textbf{幂族}，具有固有的\textbf{状态不对称性}：当先验正确时外推会无界放大错误，当上下文正确时内插会纠正不足，且没有静态状态能同时覆盖两者。现有的对比解码方法是该族实例，大多为外推型。为评估两种冲突方向，我们提出TriState-Bench，一种模型感知的评估协议，校准每个模型的先验知识以测量三种冲突状态：纠正、抵抗和一致。为解决不对称性，我们提出自适应状态路由（ARR），在每一步在状态间路由，将抵抗EM从低于6提升至16-33，且不牺牲纠正或一致。我们的代码可在该https URL获取。

英文摘要

When large language models generate from retrieved or augmented contexts, conflicts between external context and parametric priors remain a central reliability bottleneck. Existing contrastive decoding methods follow a \emph{context-aware} paradigm that unilaterally amplifies context over parametric priors, overwriting correct priors when the context is erroneous. We generalize this to the \textbf{conflict-aware} paradigm that dynamically allocates authority between prior and context based on conflict signals, rather than presupposing context trustworthiness. We show that the affine combination of prior and context logits yields a \textbf{power family} with an inherent \textbf{regime asymmetry}: extrapolation amplifies errors unboundedly when the prior is correct, interpolation under-corrects when the context is correct, and no static regime covers both. Existing contrastive decoding methods are instances of this family, mostly extrapolative. To evaluate both conflict directions, we propose TriState-Bench, a model-aware evaluation protocol that calibrates per-model prior knowledge to measure three conflict states: correction, resistance, and agreement. To resolve the asymmetry, we propose Adaptive Regime Routing (ARR), which routes between regimes at each step, lifting resistance EM from below 6 to 16--33 without sacrificing correction or agreement. Our code is available at https://github.com/keith-Jiang/conflict-aware-decoding.

URL PDF HTML ☆

赞 0 踩 0

2606.10296 2026-06-10 cs.CL cs.AI 新提交

The Confident Liar: Diagnosing Multi-Agent Debate with Log-Probabilities and LLM-as-Judge

自信的撒谎者：利用对数概率和LLM作为评判诊断多智能体辩论

Ali Keramati, Justin Cheok, Jacob Horne, Mark Warschauer

发表机构 * University of California, Irvine（加州大学伊文斯分校）

AI总结研究多智能体辩论中令牌级对数概率、LLM评判分数与任务准确性的关系，发现信心与推理质量在构造者上关联更强，且信心可检测关键推理失败。

Comments 15 pages, 7 figures, 1 table, ACL proceedings

详情

AI中文摘要

多智能体辩论系统通常仅根据最终答案是否正确来评估，忽略了辩论旨在产生的中间推理的质量。本文研究了多智能体辩论中三种信号之间的关系：推理令牌上的令牌级对数概率分布、分配给这些令牌的LLM作为评判的评分标准分数以及最终任务准确性。我们考察了内部信心信号是否预测外部评估的推理质量，以及任一信号是否与任务正确性一致，涵盖三个领域：基于评分标准的评分、数学推理和事实问答。我们的框架将双智能体辩论架构——一个构造者（Constructor）和一个审计者（Auditor）——与一个LLM作为评判配对，该评判根据指令遵循、理由质量和证据基础对每个智能体的推理进行评分，并附带一个关键失败标志。在评分标准评分领域的实验揭示了一致的四阶段信心轨迹和显著的角色不对称性：构造者的信心与评判推理质量的相关性大约是审计者的两倍，并且基于信心的关键推理失败检测对构造者（AUROC 0.804）明显比审计者（0.634）更可靠。这些发现推动了本文提出的更广泛的跨领域研究。

英文摘要

Multi-agent debate systems are typically evaluated only on whether the final answer is correct, overlooking the quality of the intermediate reasoning that debate is designed to produce. This paper studies the relationship between three signals in multi-agent debate: token-level log-probability distributions over reasoning tokens, LLM-as-judge rubric scores assigned to those tokens, and final task accuracy. We examine whether internal confidence signals predict externally evaluated reasoning quality, and whether either signal aligns with task correctness, across three domains: rubric-based scoring, mathematical reasoning, and factual question answering. Our framework pairs a two-agent debate architecture -- a Constructor and an Auditor -- with an LLM-as-judge that scores each agent's reasoning along instruction following, justification quality, and evidence grounding, together with a critical-failure flag. Experiments in the rubric-scoring domain reveal a consistent four-phase confidence trajectory and a substantial role asymmetry: confidence aligns with judged reasoning quality roughly twice as strongly for the Constructor as for the Auditor, and confidence-based detection of critical reasoning failures is markedly more reliable for the Constructor (AUROC 0.804) than for the Auditor (0.634). These findings motivate the broader cross-domain investigation proposed in this paper.

URL PDF HTML ☆

赞 0 踩 0

2606.10288 2026-06-10 cs.RO 新提交

MARCH: Model-Assisted Reinforcement Learning for the Perceptive Control of Humanoids over Sparse Footholds

MARCH: 模型辅助强化学习实现人形机器人稀疏立足点的感知控制

Codrin Crismariu, Ryan K. Cosner

发表机构 * Department of Mechanical Engineering（机械工程系）

AI总结提出模型辅助强化学习框架，结合简化模型生成安全参考轨迹、基于控制李雅普诺夫函数的奖励引导教师策略训练以及视觉学生策略蒸馏，实现人形机器人在稀疏立足点上的稳健感知行走。

详情

AI中文摘要

在稀疏地形上的感知双足行走仍然是一个困难的挑战：基于模型的方法精确但对不确定性脆弱，而基于无模型的方法鲁棒但难以发现安全关键型行走所需的精确、受约束的运动，其中小错误可能导致灾难性故障。我们提出了一个模型辅助强化学习（RL）框架，通过三个步骤结合两种视角：（1）使用简化模型生成安全参考轨迹；（2）训练一个特权教师策略，该策略由围绕安全参考轨迹构建的控制李雅普诺夫函数（CLF）奖励引导；（3）将教师策略蒸馏为基于视觉的学生策略。我们表明，这种模型辅助过程产生了物理基础的运动，提高了样本效率，减少了对复杂学习课程的需求，并实现了更平滑的行走行为，同时在与无模型基线相当的踏脚石性能上。我们在仿真中验证了我们的方法，并展示了在Unitree G1人形机器人上成功部署，该机器人导航具有横向约束的稀疏立足点。

英文摘要

Perceptive bipedal locomotion over sparse terrain remains a difficult challenge: model-based methods are precise but brittle to uncertainty, while model-free methods are robust but struggle to discover the precise, constrained motions required for safety-critical locomotion where small errors can cause catastrophic failures. We propose a model-assisted reinforcement learning (RL) framework that combines both perspectives in three steps: (1) generate a safe reference trajectory using simplified models; (2) train a privileged teacher policy guided by a control Lyapunov function (CLF) reward built around the safe reference trajectory; and (3) distill the teacher into a vision-based student policy. We show that this model-assistance procedure produces physically grounded locomotion, improving sample efficiency, reducing the need for a complex learning curriculum, and achieving smoother locomotion behavior alongside stepping stone performance comparable to model-free baselines. We validate our approach in simulation and demonstrate successful deployment on a Unitree G1 humanoid robot navigating sparse footholds with lateral constraints.

URL PDF HTML ☆

赞 0 踩 0

2606.10287 2026-06-10 cs.LG cs.CL 新提交

When Metrics Disagree: A Meta-Analysis of Knowledge-Graph-Completion Model Benchmarking

当指标不一致时：知识图谱补全模型基准测试的元分析

Haji Gul, Ajaz Ahmad Bhat

发表机构 * School of Digital Science, Universiti Brunei Darussalam（布鲁内尔大学数字科学学院）

AI总结针对KGC模型评估中指标冲突问题，提出多准则决策框架，通过元分析发现Z-score是最平衡的聚合器，并识别出不同预测任务下的最优模型。

详情

AI中文摘要

评估知识图谱补全（KGC）模型仍然具有挑战性，因为标准评估依赖于孤立的基于排名的指标，如MRR、Hits$@$k和Mean Rank，这些指标通常在不同数据集上产生冲突的模型排序。一个在MRR上领先的模型可能在Hits@1上落后，而在一个数据集上的强性能可能无法推广到另一个数据集。这种碎片化阻碍了比较，使得选择性报告成为可能，并掩盖了真正的进展。我们将KGC评估重新定义为多准则决策（MCDM）问题，并提出了一个对七个聚合器在五个测试上的元分析：一致性、跨数据集稳定性、指标独立性、噪声下的鲁棒性和泛化性。每个测试通过留一模型（LOMO）和留一组（LOGO）移除进行平均，以便可靠性反映聚合器在不同模型子集上的行为。在尾部$(h,r,?)$和关系$(h,?,t)$预测中，帕累托最优分析确定Z-score是最平衡的聚合器，它在尾部预测中排名DualE最高，在关系预测中排名FMS（流调制评分）最高。使用相同移除的测试敏感性分析表明，一致性和稳定性在很大程度上是移除不变的，而泛化性和独立性是最敏感的。该框架解决了评估不一致性，并为KGC中的聚合器选择和模型基准测试提供了基于证据的指导。

英文摘要

Evaluating Knowledge Graph Completion (KGC) models remains challenging because standard assessment relies on isolated rank-based metrics such as MRR, Hits$@$k, and Mean Rank, which often produce conflicting model orderings across datasets. A model that leads on MRR may trail on Hits@1, and strong performance on one dataset may not generalize to another. This fragmentation hinders comparison, enables selective reporting, and obscures real progress. We reframe KGC evaluation as a Multi-Criteria Decision-Making (MCDM) problem and present a meta-analysis of seven aggregators across five tests: consistency, cross-dataset stability, metric independence, robustness under noise, and generalizability. Each test is averaged over leave-one-model-out (LOMO) and leave-one-group-out (LOGO) removals so that reliability reflects aggregator behavior across diverse model subsets. Across tail $(h,r,?)$ and relation $(h,?,t)$ prediction, Pareto-optimal analysis identifies Z-score as the most balanced aggregator, which ranks DualE highest for tail prediction and FMS (Flow-Modulated Scoring) highest for relation prediction. A test-sensitivity analysis using the same removals shows that consistency and stability are largely removal-invariant, while generalizability and independence are the most sensitive. The framework resolves evaluation inconsistencies and offers evidence-based guidance for aggregator selection and model benchmarking in KGC.

URL PDF HTML ☆

赞 0 踩 0

2606.10286 2026-06-10 cs.AI 新提交

Sim2Schedule: A Simulator-Guided LLM Framework for Autonomous Open-Pit Mine Scheduling

Sim2Schedule: 一种模拟器引导的LLM框架用于自主露天矿调度

Mustavi Ibne Masum, Thiago Eustaquio Alves de Oliveira, Mahzabeen Emu

发表机构 * Department of Computer Science, Lakehead University（湖头大学计算机科学系）； Quantum Communications and Computing Research Center and Department of Electrical and Computer Engineering, Memorial University of Newfoundland（新斯科舍纪念大学量子通信与计算研究中心及电气与计算机工程系）； Department of Electrical and Computer Engineering, Memorial University of Newfoundland（新斯科舍纪念大学电气与计算机工程系）

AI总结提出模拟器引导的LLM框架，将地质约束编码到动作生成中，零样本生成可解释调度方案，在保持线性计算时间下恢复MILP最优NPV的94%-99%。

详情

AI中文摘要

露天矿调度是在复杂的地质和运营约束下最大化经济回报的关键过程。虽然混合整数线性规划（MILP）提供了数学上的最优基线，但其指数级计算复杂性和无法实时适应限制了其在动态工业环境中的实际部署。本文引入了一种模拟器驱动的大语言模型（LLM）调度框架，其中LLM作为自主决策代理，在每一步由定制模拟器引导，该模拟器将地质优先关系、开采-加工耦合和动态容量约束直接编码到动作生成机制中。该框架在封闭、数据安全的环境中完全零样本运行，无需基于云的推理、领域特定微调或重新训练，即可生成完整、可解释的开采和加工调度。为了提供可信的性能基准，我们开发了一种新的MILP公式，纳入了现实的操作和地质约束。在不同规模和时段的开采实例上进行评估，基于LLM的框架恢复了MILP最优NPV的94%至99%，同时计算时间呈线性增长。这些结果表明，在复杂运营约束下的长期工业调度中，模拟器约束的LLM代理可作为经典优化的实用且可扩展的替代方案。

英文摘要

Open-pit mine scheduling is a critical process for maximizing economic return under complex geotechnical and operational constraints. While Mixed-Integer Linear Programming (MILP) provides mathematically optimal baselines, its exponential computational complexity and inability to adapt in real time limit its practical deployment in dynamic industrial environments. This work introduces a simulator-driven Large Language Model (LLM) scheduling framework in which the LLM acts as an autonomous decision-making agent, guided at each step by a custom simulator that encodes geotechnical precedence, extraction-processing coupling, and dynamic capacity constraints directly into the action generation mechanism. Operating entirely zero-shot within a closed, data-secure environment, the framework produces complete, interpretable extraction and processing schedules without cloud-based inference, domain-specific fine-tuning, or retraining. To provide a trustworthy performance benchmark, a novel MILP formulation is developed that incorporates realistic operational and geotechnical constraints. Evaluated across mining instances of varying scale and time periods, the LLM-based framework recovers between 94\% and 99\% of the MILP optimal NPV while scaling linearly in computation time. These results position simulator-constrained LLM agents as a practical and scalable alternative to classical optimization for long-horizon industrial scheduling under complex operational constraints.

URL PDF HTML ☆

赞 0 踩 0

2606.10285 2026-06-10 cs.CL 新提交

OpenRTLSet: A Fully Open-Source Dataset for Large Language Model-based Verilog Module Design

OpenRTLSet: 基于大语言模型的Verilog模块设计的完全开源数据集

Jinghua Wang, Lily Jiaxin Wan, Sanjana Pingali, Scott Smith, Manvi Jha, Shalini Sivakumar, Xing Zhao, Kaiwen Cao, Deming Chen

发表机构 * UIUC-ChenLab（UIUC-陈实验室）

AI总结提出最大完全开源硬件设计数据集OpenRTLSet，包含13万+多样Verilog代码样本，结合GitHub代码、VHDL和C/C++翻译，利用DeepSeek-R1生成自然语言描述，支持多种语言模型微调，证明开源方法在硬件设计中的优越性。

Comments Accepted by ICLAD'25

详情

DOI: 10.1109/ICLAD65226.2025.00038
Journal ref: 2025 IEEE International Conference on LLM-Aided Design (ICLAD), Stanford, CA, USA, 2025, pp. 212-218

AI中文摘要

OpenRTLSet引入了硬件设计中最大的完全开源数据集，为研究界和工业界提供了超过131,000个多样化的Verilog代码样本。我们的数据集独特地结合了来自GitHub仓库的Verilog代码（102k模块）、VHDL翻译（5k模块）和可综合的C/C++翻译（24k模块），所有内容均可自由访问，无专有限制。使用推理模型DeepSeek-R1，我们为每个代码样本生成了配对的自然语言描述，从而能够微调各种语言模型家族（例如Qwen和Granite）以进行Verilog代码生成。我们的数据集探索了多种选项，包括在标注过程中将Verilator生成的C++文件作为额外上下文、量化技术（INT4 vs. BF16）以及不同模型规模（7B-32B参数）之间的性能差异。OpenRTLSet证明了开源方法在硬件设计任务中可以实现优越的性能，为该领域的可访问研究和商业用途建立了新的基础。

英文摘要

OpenRTLSet introduces the largest fully open-source dataset for hardware design, offering over 131,000 diverse Verilog code samples to the research community and industry. Our dataset uniquely combines Verilog code from GitHub repositories (102k modules), VHDL translations (5k modules), and synthesizable C/C++ translations (24k modules), all freely accessible without proprietary restrictions. Using the reasoning model DeepSeek-R1, we generated paired natural language descriptions for each code sample, enabling fine-tuning of various language model families (e.g., Qwen and Granite) for Verilog code generation. Our dataset explores multiple options, including Verilator-generated C++ files as additional context during labeling, quantization techniques (INT4 vs. BF16), and performance differences across model sizes (7B-32B parameters). OpenRTLSet demonstrates that open-source approaches can achieve superior performance in hardware design tasks, establishing a new foundation for accessible research and commercial use in this domain.

URL PDF HTML ☆

赞 0 踩 0