arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2029
2606.05437 2026-06-05 cs.RO cs.CV

Uncertainty-Aware Adaptive Sensor Fusion for Autonomous Navigation

不确定性感知的自适应传感器融合用于自主导航

Simegnew Yihunie Alaba, Yuichi Motai

发表机构 * IEEE

AI总结 提出一种结合无迹卡尔曼滤波(UKF)的混合深度学习方法,通过不确定性感知的自适应融合视觉和惯性特征,提高自主导航中视觉惯性里程计(VIO)的位姿估计精度。

Comments 13 pages

详情
AI中文摘要

本文介绍了一种混合深度学习方法,与无迹卡尔曼滤波(UKF)相结合,以增强自主导航中视觉惯性里程计(VIO)的位姿估计精度。所提出的模型采用视觉变换器(ViT)网络有效捕获惯性测量单元(IMU)数据的时间依赖性,并利用多尺度卷积神经网络(MCNN)从视觉数据中学习基于光流的运动线索。自适应传感器融合模块通过利用估计的不确定性动态加权IMU和视觉特征,从而在多样且具有挑战性的环境条件下提高鲁棒性。此外,提出了一种新颖的不确定性感知损失函数,将预测不确定性明确纳入学习过程,使得在噪声、不完整或不可靠的传感器输入下实现鲁棒且准确的导航。在KITTI数据集上的全面评估表明,所提出的方法显著优于基线方法,在绝对轨迹误差(ATE)和相对位姿误差(RPE)方面实现了优越性能。该轻量且计算高效的模型在NVIDIA A100 GPU上以155 FPS处理数据,非常适合部署在资源受限的自主系统中。

英文摘要

This work introduces a hybrid deep learning approach integrated with an Unscented Kalman Filter (UKF) to enhance pose estimation accuracy in Visual-Inertial Odometry (VIO) for autonomous navigation. The proposed model employs a Vision Transformer (ViT) network to effectively capture temporal dependencies from inertial measurement unit (IMU) data and utilizes a Multiscale Convolutional Neural Network (MCNN) to learn optical flow-based motion cues from visual data. An adaptive sensor fusion module dynamically weights IMU and visual features by leveraging estimated uncertainty, thus improving robustness in diverse and challenging environmental conditions. Additionally, a novel uncertainty-aware loss function is proposed to explicitly incorporate prediction uncertainty into the learning process, enabling robust and accurate navigation under noisy, incomplete, or unreliable sensor inputs. Comprehensive evaluations of the KITTI dataset demonstrate that the proposed method significantly outperforms baseline approaches, achieving superior performance in terms of Absolute Trajectory Error (ATE) and Relative Pose Error (RPE). The lightweight and computationally efficient model processes data at 155 FPS on an NVIDIA A100 GPU, making it highly suitable for deployment in resource-constrained autonomous systems.

2606.05436 2026-06-05 cs.AI cs.CL cs.IR

Ten Headache Specialists versus Artificial Intelligence for Clinical Literature Summarization: A Critical Evaluation and Comparison

十位头痛专家与人工智能在临床文献总结中的比较:一项关键评估与对比

Alejandro Lozano, Keiko Ihara, Ping-Hao Yang, Carrie E. Robertson, Jennifer Stern, Allan Purdy, Hsiangkuo Yuan, Pengfei Zhang, Yulia Orlova, Olga Fermo, Jennifer Hranilovich, Fred Cohen, Todd J. Schwedt, Jenelle A. Jindal, Serena Yeung-Levy, Chia-Chun Chiang

发表机构 * Stanford University Palo Alto CA USA(斯坦福大学) Department of Neurology Mayo Clinic Rochester MN USA(梅奥诊所神经科) Department of Neurology Dalhousie University Halifax Canada(达尔豪斯大学神经科) Jefferson Headache Center Department of Neurology Thomas Jefferson University PA USA(泰勒大学神经科) Beth Israel Deaconess Medical Center Boston MA USA(贝斯以色列医疗中心) Department of Neurology University of Florida Gainesville FL USA(佛罗里达大学神经科) University of Colorado School of Medicine Department of Pediatrics Division of Child Neurology Aurora CO USA(科罗拉多医学院儿科部儿童神经科) Department of Medicine Mount Sinai Hospital Icahn School of Medicine at Mount Sinai New York NY USA(西奈医院医学部) Department of Neurology Mayo Clinic Scottsdale AZ USA(梅奥诊所Scottsdale分部) Harvard Medical School Boston MA USA(哈佛医学院) Department of Neurology Mount Sinai Hospital Icahn School of Medicine at Mount Sinai New York NY USA(西奈医院神经科)

AI总结 本研究通过构建基于RAG的AI框架,比较了三种大语言模型与十位头痛专家在临床文献总结方面的表现,发现专家撰写的摘要更受青睐,但专家有时难以区分人类与AI生成的摘要。

详情
AI中文摘要

总结最新医学文献以指导临床决策对于循证医学和高质量患者护理至关重要。然而,由于患者时间有限且发表文章数量迅速增长,临床医生面临越来越大的挑战。尽管检索增强的大语言模型(LLMs)在临床总结方面显示出潜力,但对其在综合更广泛科学文献方面的有效性进行人工评估,以及与专家撰写的综合摘要的直接比较仍然很少。我们使用三种最先进的LLMs(Sonnet、GPT-4o和Llama 3.1)构建了一个基于RAG的智能AI框架。一位头痛专家创建了13个问题,其中3个用于提示优化,10个用于评估。美国和加拿大的十位头痛专家每人针对一个问题撰写一篇摘要,每个问题得到四篇摘要(专家、Sonnet、GPT-4o和Llama)。专家们在不知道作者身份的情况下,根据正确性、完整性、简洁性和临床实用性,使用标准化评分标准对摘要进行评分(1-10分),并排除他们自己撰写摘要的主题。他们还按偏好对摘要进行排序,并指出他们认为每篇摘要是由专家还是LLM撰写的。我们的研究比较了由头痛专家评估的LLM和专家撰写的文献摘要,结果显示专家撰写的摘要更受青睐,尽管专家有时难以区分人类和AI生成的摘要。我们还确定了超出标准评估指标的关键专家重视特征,这些特征可以指导未来人类和AI文献总结流程的改进。

英文摘要

Summarizing the latest medical literature to guide clinical decision-making is essential for evidence-based medicine and high-quality patient care. Yet clinicians face increasing challenges due to limited time with patients and a rapidly growing volume of published articles. Although retrieval-augmented large language models (LLMs) have shown promise in clinical summarization, human evaluations of their effectiveness in synthesizing broader scientific literature and direct comparisons to expert-written syntheses remain scarce. We constructed a RAG-based agentic AI framework using three state-of-the-art LLMs: Sonnet, GPT-4o, and Llama 3.1. A headache specialist created 13 questions, three for prompt optimization and ten for evaluation. Ten headache specialists across the United States and Canada each wrote a summary for one question, yielding four summaries per question (expert, Sonnet, GPT-4o, and Llama). The experts, blinded to authorship, critically evaluated the summaries, excluding the topic for which they wrote a summary, based on correctness, completeness, conciseness, and clinical utility, scoring each from 1 to 10 using standardized rubrics. They also ranked the summaries by preference and indicated whether they believed each summary was written by an expert or an LLM. Our study, comparing LLM- and expert-written literature summaries evaluated by headache specialists, showed that expert-written summaries were preferred, although experts sometimes found it challenging to distinguish between human- and AI-generated summaries. We also identified key expert-valued features beyond standard evaluation metrics that can guide future refinement of both human and AI literature summarization pipelines.

2606.05435 2026-06-05 cs.LG cs.CR

DP-MacAdam: Differentially Private Mechanism with Adaptive Clipping and Adaptive Momentum

DP-MacAdam:具有自适应裁剪和自适应动量的差分隐私机制

Naima Tasnim, Lalitha Sankar, Oliver Kosut

发表机构 * University of Southern California(南加州大学)

AI总结 提出DP-MacAdam算法,通过联合利用梯度均值和方差估计进行自适应裁剪和动量更新,在无需手动调整裁剪阈值的情况下提升模型效用。

Comments 6 pages, 2 tables

详情
AI中文摘要

差分隐私随机梯度下降(DP-SGD)已成为隐私保护机器学习的标准框架,但其依赖固定梯度裁剪阈值来限制敏感度,这仍然是一个重要的实际限制。诸如AdaClip等自适应裁剪算法在裁剪和添加噪声之前对梯度进行平移和缩放,使得裁剪后的梯度产生更具信息性的下降方向。平移和缩放参数根据经验均值和方差自适应选择。然而,在现有的自适应裁剪算法中,这些经验估计尚未同时用于动量以加速训练本身。另一方面,DP-Adam是一种利用基于梯度均值和方差的类Adam动量更新来加速训练的算法,但并未利用这些估计进行自适应裁剪。在这项工作中,我们提出了具有自适应裁剪和自适应动量的差分隐私机制(DP-MacAdam),这是一种新颖的算法,它结合了这两种方法,从而将相同的均值和方差估计同时用于裁剪和动量。我们进行了分析,表明DP-MacAdam以无偏方式估计梯度方差。此外,我们实证评估了DP-MacAdam的隐私和准确性,证明与DP-SGD、AdaClip和DP-Adam基线相比,它在无需手动调整裁剪阈值的情况下实现了改进的模型效用。

英文摘要

Differentially private stochastic gradient descent (DP-SGD) has become the standard framework for privacy-preserving machine learning, yet its reliance on a fixed gradient clipping threshold to limit sensitivity remains a significant practical limitation. Adaptive clipping algorithms such as AdaClip shift and scale the gradient prior to clipping and adding noise so that the clipped gradient yields a more informative descent direction. The shift and scaling parameters are selected adaptively based on the empirical mean and variance. However, in existing adaptive clipping algorithms, these empirical estimates have not been also used for momentum to accelerate training itself. On the other hand, DP-Adam is an algorithm that exploits Adam-like momentum updates based on the gradient mean and variance to accelerate training, but does not exploit these estimates for adaptive clipping. In this work, we propose Differentially Private Mechanism with Adaptive Clipping and Adaptive Momentum (DP-MacAdam), a novel algorithm that combines these two approaches so as to use the same mean and variance estimates for both clipping and momentum. We perform an analysis showing that DP-MacAdam estimates the gradient variances in a bias-free manner. In addition, we empirically evaluate the privacy and accuracy of DP-MacAdam, demonstrating that it achieves improved model utility compared to DP-SGD, AdaClip, and DP-Adam baselines, without requiring manual tuning of the clipping threshold.

2606.05434 2026-06-05 cs.LG cs.AI

Selective-Advantage Entropy-Adaptive Horizon GRPO: Asymmetric Token-Level Discounting for Efficient Reinforcement Learning of Language Models

选择性优势熵自适应视野GRPO:用于语言模型高效强化学习的非对称令牌级折扣

Chirag Chawla, Rohan Charudatt Salvi, Madhav S. Baidya

发表机构 * Indian Institute of Technology (BHU)(印度理工学院(博生胡大学)) Department of Computer Science, University of Illinois Chicago(伊利诺伊大学芝加哥分校计算机科学系)

AI总结 提出选择性优势熵自适应视野GRPO(SA-AH-GRPO),通过非对称令牌级折扣(仅对负优势轨迹应用熵基折扣)来稳定训练并提升数学推理性能。

Comments 16 pages, 4 Figures, 7 Tables

详情
AI中文摘要

组相对策略优化(GRPO)已成为一种有效的强化学习算法,用于在推理任务上对齐语言模型,但它对称地处理每个令牌位置和每个采样轨迹。我们引入了两个互补的扩展:(i) 自适应视野GRPO(AH-GRPO),它使用基于累积熵的折扣对每个令牌的策略梯度进行加权,当模型不确定时减少有效视野;(ii) 选择性优势AH-GRPO(SA-AH-GRPO),它仅对负优势轨迹应用此折扣,而保留正优势的成功轨迹不受衰减。我们在GSM8K数学推理基准上,使用通过LoRA微调的Qwen 2.5-1.5B-Instruct和Qwen 2.5-3B-Instruct模型,评估了alpha=0的标准GRPO、alpha=0.5的AH-GRPO和alpha=0.5的SA-AH-GRPO。在3B模型上,SA-AH-GRPO在第30步达到峰值Pass@1=0.858,并在180步保持0.846,训练方差降至0.0246,相比GRPO减少了3.6倍,同时匹配其峰值准确率。在1.5B模型上,SA-AH-GRPO达到峰值Pass@1=0.686,优于零样本基线0.637。我们的分析表明,非对称折扣保留了正确解上的完整梯度信号,防止了熵崩溃,并显著稳定了训练,为结构化生成任务上具有可验证奖励的强化学习提供了一种原则性的归纳偏置。

英文摘要

Group Relative Policy Optimisation (GRPO) has emerged as an effective reinforcement-learning algorithm for aligning language models on reasoning tasks, but it treats every token position and every sampled rollout symmetrically. We introduce two complementary extensions: (i) Adaptive-Horizon GRPO (AH-GRPO), which weights each token's policy gradient using a cumulative entropy-based discount that reduces the effective horizon when the model is uncertain, and (ii) Selective-Advantage AH-GRPO (SA-AH-GRPO), which applies this discounting only to negative-advantage rollouts, leaving positive-advantage, successful trajectories unattenuated. We evaluate standard GRPO with alpha = 0, AH-GRPO with alpha = 0.5, and SA-AH-GRPO with alpha = 0.5 on the GSM8K mathematical reasoning benchmark using both Qwen 2.5-1.5B-Instruct and Qwen 2.5-3B-Instruct fine-tuned with LoRA. On the 3B model, SA-AH-GRPO achieves Pass@1 = 0.858 at its peak at step 30 and maintains 0.846 at 180 steps, with training variance reduced to 0.0246, a 3.6 times reduction relative to GRPO while matching its peak accuracy. On the 1.5B model, SA-AH-GRPO achieves a peak Pass@1 of 0.686, improving over the zero-shot baseline of 0.637. Our analysis shows that asymmetric discounting preserves the full gradient signal on correct solutions, prevents entropy collapse, and substantially stabilises training, suggesting a principled inductive bias for reinforcement learning with verifiable rewards on structured generation tasks.

2606.05433 2026-06-05 cs.AI cs.SY eess.SY

Zero knowledge verification for frontier AI training is possible

前沿AI训练的零知识验证是可能的

Pierre Peigné, Ky Nguyen, Paul Wang

发表机构 * Lefebvre General-Purpose AI Policy Lab(莱贝维尔通用人工智能政策实验室) Sorbonne Université(索邦大学) CNRS(国家科学研究中心) LIP6(LIP6实验室)

AI总结 提出一种结合预提交训练规范、节点间网络观测和中间计算即时Merkle承诺的零知识验证架构,通过原生BF16/FP32预编译的零知识虚拟机(zkVM)验证GPU实际浮点计算,实现训练过程可验证且架构保密,预计36个月内实现概念验证。

Comments 44 pages, 2 figures

详情
AI中文摘要

前沿AI治理框架日益将累积训练计算作为指定高影响力模型的主要标准,但执行依赖于自我报告,因为不存在训练的技术验证原语。任何未来关于前沿AI的国际协议都面临更高风险下的同样问题:对具有显著外部性的技术进行协调监管历史上依赖于技术验证,否则协议只是宣言性的。最近的治理分析认为零知识证明是一个有希望的候选方案,但目前在前沿规模下不切实际[26, 4]。我们认为这种不切实际是范式限制而非根本性的,并提出了一种用于前沿密集预训练的验证架构,结合了预提交的训练规范、节点间网络观测以及中间计算的即时Merkle承诺,通过具有原生BF16/FP32预编译的零知识虚拟机(zkVM)进行验证。该证明检查GPU执行的实际浮点计算而非定点近似,并通过私有的训练规范保护模型架构的机密性。该协议产生三种证明类型:初始化时的创世证明、训练过程中的步骤证明,以及作为运行不变量的事前证明,强制执行与政策相关的声明,将训练记录转变为可治理执行的工件。我们估计在训练侧开销为个位数百分比的情况下,大约36个月内可实现可部署的概念验证,而验证级定制硅片的周期为六到十年。列出了十三个开放的研究和工程问题,作为外部贡献的研究议程。

英文摘要

Frontier AI governance frameworks increasingly use cumulative training compute as the primary criterion for designating high-impact models, but enforcement rests on self-reporting because no technical verification primitive for training exists. Any future international agreement on frontier AI faces the same problem at higher stakes: coordinated regulation of technologies with significant externalities has historically rested on technical verification, without which agreements are declaratory. Recent governance analyses judge zero-knowledge proofs a promising candidate but currently impractical at frontier scale [26, 4]. We argue the impracticality is paradigm-bound rather than fundamental, and propose a verification architecture for frontier dense pre-training combining a pre-committed training specification, inter-node network observations, and on-the-fly Merkle commitments of intermediate computation, verified through a zero-knowledge Virtual Machine (zkVM) with native BF16/FP32 precompiles. The proof checks the actual floating-point computation the GPU performed rather than a fixed-point approximation, and preserves model-architecture confidentiality through a private training specification. The protocol produces three proof types: a genesis proof at initialisation, in-training step proofs across the run, and ex-ante attestations enforcing policy-relevant claims as running invariants, turning the training record into a governance-enforceable artefact. We estimate a deployable proof of concept within approximately 36 months at single-digit-percent training-side overhead, against a six-to-ten-year cycle for verification-grade custom silicon. Thirteen open research and engineering problems are catalogued as a research agenda for external contribution

2606.05429 2026-06-05 cs.AI

Minimizing the Hidden Cost of Scales: Graph-Guided Ultra-Low-Bit Quantization for Large Language Models

最小化缩放因子的隐藏成本:面向大语言模型的图引导超低位量化

Rayyan Abdalla, Amir Hussein, Min Wu, Dinesh Manocha

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出SAGE-PTQ框架,通过图引导的显著性感知量化分离显著与非显著权重,实现超低位量化并最小化缩放开销,在LLaMA-3-8B上困惑度降至6.74且内存低于BiLLM的50%。

Comments Preprint. 18 pages, 10 figures, 7 tables, including appendix

详情
AI中文摘要

训练后量化(PTQ)对于大语言模型(LLMs)的高效部署至关重要。最近的超低位PTQ方法依赖于严格的权重显著性假设或位置启发式,引入了大量隐藏的缩放开销。我们提出SAGE-PTQ(显著性感知图引导高效PTQ),一种新颖的LLMs超低位量化框架,可最小化隐藏缩放成本。SAGE-PTQ使用分布统计分离显著和非显著权重,然后将子采样的非显著权重建模为稀疏图,以估计每层的最佳组数。SAGE-PTQ应用双模量化,为显著权重分配多位精度,并对非显著权重进行二值化。为减少缩放开销,SAGE-PTQ对显著权重使用每个通道一个缩放因子,对每个非显著组使用一个标量。最后,SAGE-PTQ实现自适应显著性阈值,以选择每个矩阵的最佳显著性比率。SAGE-PTQ平均达到1.03权重位和仅0.004缩放位每矩阵,优于BiLLM和PB-LLM等最先进方法。在LLaMA-3-8B上,SAGE-PTQ在WikiText2上达到6.74困惑度,而BiLLM为55.8,同时使用不到BiLLM 50%的GPU内存。在LLaMA-2-70B上,SAGE-PTQ在单个NVIDIA L40 GPU上提供1.5倍更快的解码速度,展示了实际的推理效率。

英文摘要

Post-training quantization (PTQ) is critical for the efficient deployment of large language models (LLMs). Recent ultra-low-bit PTQ methods rely on rigid weight-saliency assumptions or position heuristics, introducing substantial hidden scaling overhead. We propose SAGE-PTQ (Saliency-Aware Graph-guided Efficient PTQ), a novel ultra-low-bit quantization framework for LLMs that minimizes hidden scaling cost. SAGE-PTQ separates salient and unsalient weights using distributional statistics, then models subsampled unsalient weights as a sparse graph to estimate the optimal number of groups per layer. SAGE-PTQ applies dual-mode quantization, assigning multi-bit precision to salient weights and binarizing unsalient weights. To reduce scaling overhead, SAGE-PTQ uses one per-channel scale for salient weights and one scalar per unsalient group. Finally, SAGE-PTQ implements adaptive saliency thresholding to select the optimal saliency ratio per matrix. SAGE-PTQ achieves 1.03 weight bits and only 0.004 scaling bits per matrix on average, outperforming state-of-the-art methods such as BiLLM and PB-LLM. On LLaMA-3-8B, SAGE-PTQ achieves 6.74 WikiText2 perplexity, compared to 55.8 for BiLLM, while using less than 50% of BiLLM's GPU memory. On LLaMA-2-70B, SAGE-PTQ provides 1.5x faster decoding on one NVIDIA L40 GPU, demonstrating practical inference efficiency.

2606.05422 2026-06-05 cs.RO

Learning from Demonstrations over Riemannian Manifolds using Neural ODEs: An Extended Abstract

利用神经常微分方程在黎曼流形上从示范中学习:扩展摘要

Diana Cuervo Espinosa, Mahathi Anand, Angela P. Schoellig

发表机构 * ETH Zürich(苏黎世联邦理工学院)

AI总结 针对机器人状态(如方向)在弯曲空间上演化的问题,提出利用神经常微分方程在黎曼流形上从示范中学习,通过数值估计测地线实现自然运动生成,并降低计算开销。

Comments 2 pages

详情
AI中文摘要

从示范中学习(LfD)通常在欧几里得空间中进行,而机器人状态(例如方向)自然地在弯曲空间上演化。因此,为了确保自然、复杂的运动生成,我们研究在能够编码位置和方向数据的黎曼流形上从示范中学习。在这里,测地线提供了流形内任意两点之间的自然运动。我们提出通过神经常微分方程数值估计测地线,以减轻现有方法的大计算开销。最后,这些测地线可以在部署到机器人之前解码回原始任务空间。在这篇扩展摘要中,我们讨论了我们框架的架构,提供了一些来自仿真实验的初步见解,包括与其他测地线计算机制的比较,并讨论了未来工作的挑战和前景。

英文摘要

Learning from demonstratins (LfD) is usually performed over Euclidean spaces, while the robot state, e.g. orientation, naturally evolves over curved spaces. Therefore, to ensure natural, complex motion generation, we investigate learning from demonstrations over Riemannian manifolds that are capable of encoding both position and orientation data. Here, geodesic paths provide for natural motion between two arbitrary points within the manifold. We propose to numerically estimate geodesics via neural ordinary differential equations, mitigating large computational overhead of existing approaches. Finally, these geodesics can be decoded back into the original task space before deploying on the robot. In this extended abstract, we discuss the architecture of our framework, provide some initial insights from our simulation experiments, including comparison to other geodesic computation mechanisms, and discuss the challenges and prospects for future work.

2606.05421 2026-06-05 cs.CL

ComplexityMT: Benchmarking the Interaction Between Text Complexity and Machine Translation

ComplexityMT: 文本复杂度与机器翻译交互作用的基准测试

Joseph Marvin Imperial, Junhong Liang, Belal Shoer, Abdullah Barayan, Rodrigo Wilkens, Omar Mussa, Dawn Knight, Eugénio Ribeiro, Ekaterina Kochmar, Sowmya Vajjala, Fernando Alva-Manchego, Harish Tayyar Madabushi

发表机构 * University of Bath(巴斯大学) Cardiff University(卡迪夫大学) National University Philippines(菲律宾国家大学) MBZUAI(穆扎布伊人工智能研究所) University of Exeter(埃克塞特大学) INESC-ID Lisboa(里斯本INESC-ID) Instituto Universitário de Lisboa (ISCTE-IUL), ISTAR(里斯本大学研究所(ISCTE-IUL),ISTAR) National Research Council, Canada(加拿大国家研究委员会) King Abdulaziz University(阿卜杜勒-阿齐兹大学) Saudi Electronic University(沙特电子大学)

AI总结 提出ComplexityMT基准,利用CEFR等级评估六种语言中文本复杂度与机器翻译的相互影响,发现高复杂度文本更难翻译且翻译会改变目标文本的CEFR等级。

详情
AI中文摘要

当文本被翻译时,翻译是否保留了原文的复杂度?我们引入ComplexityMT,这是一个新的挑战,用于评估文本复杂度和机器翻译如何相互作用和相互影响,使用欧洲语言共同参考框架(CEFR)等级作为文本复杂度的度量。在包括阿拉伯语、荷兰语、英语、法语、印地语和俄语在内的六种语言中,我们评估了三个开放权重模型、一个封闭模型和一个商业机器翻译系统在两个任务上的表现:i) CEFR与翻译难度的相关性,以及ii) 源文本CEFR等级的变化。我们的实验表明,较高的CEFR等级使文本更难翻译,并且对于大多数语言,机器翻译会改变目标文本相对于原始源文本的CEFR等级。这些发现为从事多语言教学内容生成和机器翻译难度估计的研究人员和从业者提供了新的见解。

英文摘要

When a text is translated, does the translation retain the complexity of the original? We introduce ComplexityMT, a new challenge for assessing how text complexity and machine translation interact with and influence each other, using the Common European Framework of Reference for Languages (CEFR) levels as the measure of text complexity. Across six languages, including Arabic, Dutch, English, French, Hindi, and Russian, we evaluate three open-weight models, one closed model, and a commercial machine translation system on two tasks: i) correlation of CEFR with translation difficulty, and ii) shifts in CEFR levels of the source texts. Our experiments show that higher CEFR levels make texts more difficult to translate, and that machine translation shifts the CEFR level of the target text compared to the original source, for most languages. These findings provide new insights for researchers and practitioners working on multilingual pedagogical content generation and machine translation difficulty estimation.

2606.05420 2026-06-05 cs.AI stat.AP

Assessing the Carbon Emissions and Energy Consumption of U.S. Hyperscale Data Centers

评估美国超大规模数据中心的碳排放与能源消耗

Gianluca Guidi, Francesca Dominici, Tiziano Squartini, Callaway Sprinkle, Jonathan Gilmour, Kevin Butler, Eric Bell, Scott Delaney, Falco J. Bargagli-Stoffi

发表机构 * Department of Biostatistics, Harvard T.H. Chan School of Public Health(哈佛T.H. 汤普森公共卫生学院生物统计学系) Department of Computer Science, University of Pisa(比萨大学计算机科学系) IMT School of Advanced Studies, Lucca(卢塞恩高级研究所) Environmental Systems Research Institute(环境系统研究机构) Baxtel(Baxtel公司) Department of Environmental Health, Harvard T.H. Chan School of Public Health(哈佛T.H. 汤普森公共卫生学院环境健康系) Department of Biostatistics, UCLA Fielding School of Public Health(加州大学洛杉矶分校Fielding公共卫生学院生物统计学系)

AI总结 本研究通过收集403个美国超大规模数据中心设施级数据,估算其电力消耗、电力来源及二氧化碳排放,发现其电力需求约占美国总用电量的1.8%,且碳强度高于全国平均水平48%。

详情
AI中文摘要

美国超大规模数据中心(HDCs)的快速扩张,主要由人工智能的采用驱动,引发了人们对该行业环境足迹的担忧。我们汇编了2024年5月至2025年4月期间运营的403个美国超大规模数据中心的设施级信息,并估算了它们的电力消耗、电力来源及可归因的二氧化碳排放。在不同的设施负载情景下,这些HDC消耗了约68-99太瓦时的电力,并产生了约3700-5400万吨二氧化碳。在中心情景下,HDC电力需求约占美国总用电量的1.8%,其中约54%的归因发电由化石燃料来源提供。HDC电力加权平均碳强度约为545克二氧化碳/千瓦时,比同期美国国家电网平均碳强度370克二氧化碳/千瓦时高出约48%。我们的方法提供了一种归因工具,利用最新的EPA eGRID电厂级数据评估超大规模数据中心的环境足迹。

英文摘要

The rapid proliferation of hyperscale data centers (HDCs) in the US, mainly driven by the adoption of artificial intelligence, has raised concerns about this industry's environmental footprint. We compiled facility-level information on 403 US hyperscale data centers operating between May 2024 and April 2025 and estimated their electricity consumption, electricity sources, and attributable CO2 emissions. Across different facility-load scenarios, these HDCs consumed approximately 68-99 TWh of electricity and were associated with about 37-54 million metric tons of CO2. Under the central scenario, HDC electricity demand corresponded to approximately 1.8% of total US electricity consumption, with roughly 54% of attributed generation supplied by fossil-fuel sources. The HDC electricity-weighted average carbon intensity was approximately 545 gCO2/kWh, about 48% above the contemporaneous US national grid-average carbon intensity of 370 gCO2/kWh. Our approach provides an attributional tool for assessing the environmental footprint of hyperscale data centers using the most recent EPA eGRID plant-level data.

2606.05415 2026-06-05 cs.CL cs.AI cs.LG

Executable Schema Contracts: From Automatic Ingestion to Multi-Source Retrieval

可执行模式合约:从自动摄入到多源检索

Padmaja Jonnalagedda, Yuguang Yao, Xiang Gao, Hilaf Hasson, Kamalika Das

发表机构 * Intuit AI Research(Intuit AI研究)

AI总结 提出一种自动从多源数据中发现可执行模式并将其作为共享合约的系统,通过模式约束的检索路由和结构化分析提升多源问答性能。

Comments 9 pages, 4 figures, plus supplementary appendix

详情
AI中文摘要

现实世界的数据跨越表格、文档和半结构化文件,具有隐式语义。查询这些数据需要跨不一致的模式和格式整合证据,但现有方法要么需要昂贵的人工工程,要么完全绕过结构。我们提出一个系统,自动从原始多源数据中发现可执行模式,并将其用作知识图谱构建和查询时检索的共享合约。一个封闭世界的字段目录将基于LLM的模式发现限制在已证实的字段上;确定性结构分析推断身份键、外键和源层次结构;由此产生的模式驱动提取、去重和跨源链接,形成具有溯源意识的知识图谱。在查询时,该模式(可选地通过单调协议扩展)调节一个多工具代理,该代理在结构化查找、图遍历和向量搜索之间路由检索,返回带有可追溯引用的有根据的答案。在使用相同LLM、数据和评估框架的受控零样本比较中,该系统在四个QA基准上优于仅检索和基于分解的基线,消融实验表明模式条件路由、结构智能和模式引导构建各自贡献了性能提升。

英文摘要

Real-world data spans tables, documents, and semi-structured files with implicit semantics. Querying this data requires integrating evidence across inconsistent schemas and formats, yet existing approaches either demand costly manual engineering or bypass structure entirely. We present a system that automatically discovers an executable schema from raw multi-source data and uses it as a shared contract for knowledge graph construction and query-time retrieval. A closed-world field catalog constrains LLM-based schema discovery to attested fields; deterministic structural analysis infers identity keys, foreign keys, and source hierarchy; and the resulting schema drives extraction, deduplication, and cross-source linking into a provenance-aware knowledge graph. At query time the schema -- optionally extended via a monotonic protocol -- conditions a multi-tool agent routing retrieval across structured lookup, graph traversal, and vector search, returning grounded answers with traceable citations. In controlled zero-shot comparisons using the same LLM, data, and evaluation harness, the system improves over retrieval-only and decomposition-based baselines across four QA benchmarks, with ablations showing that schema-conditioned routing, structural intelligence, and schema-guided construction each contribute to the gains.

2606.05414 2026-06-05 cs.CL cs.AI cs.HC cs.LG

When Evidence is Sparse: Weakly Supervised Early Failure Alerting in Dialogs and LLM-Agent Trajectories

当证据稀疏时:对话和LLM-Agent轨迹中的弱监督早期失败预警

Avinash Baidya, Xinran Liang, Ruocheng Guo, Xiang Gao, Kamalika Das

发表机构 * Intuit AI Research(Intuit AI研究院) Princeton University(普林斯顿大学)

AI总结 针对对话和LLM-Agent轨迹中早期失败预警问题,提出一种两阶段方法,通过注意力机制从稀疏的轨迹级标签中学习回合级失败证据,并结合α-STOP策略实现可控的早期预警,在多个基准上显著提升帕累托前沿质量并降低训练成本。

Comments 9 pages, 14 figures, and appendix

详情
AI中文摘要

早期失败预警需要在对话或智能体轨迹尚未完成时,决定是否将其标记为可能失败。这具有挑战性,因为监督信号通常仅以轨迹级成功/失败标签的形式提供,而预警必须从部分交互中发出。先前的早期分类方法通常通过将终端标签分配给每个前缀来弥合这一差距,将每个回合视为失败证据。我们假设这种前缀标签假设与多轮语言交互不匹配,因为最终失败的证据是稀疏且常常延迟的。在本文中,我们引入了一种两阶段方法,从这种稀疏证据结构中学习,并使用由此产生的风险估计进行可控的早期预警。具体来说,我们的基于注意力的失败预测器从轨迹标签中学习稀疏的回合级失败证据,并利用它从部分历史中估计失败风险。然后,我们将该预测器与α-STOP配对,这是一种单一偏好条件停止策略,在推理时选择准确率-早期性的操作点,而不是为每个偏好训练单独的触发器。在涵盖客户支持、任务导向对话、说服、工具使用和规划的五个基准上,我们首先表明高相关性失败证据仅占回合的4.7-11.3%,并且平均在轨迹的59.0-83.6%之后首次出现。我们进一步表明,基于注意力的预测器将帕累托前沿质量(超体积)比朴素前缀监督提高了1-10%,并且完整系统将前沿质量比最先进的触发器策略提高了3-42%,同时将每个操作点的训练成本降低了1-3个数量级。

英文摘要

Early failure alerting requires deciding, while a dialog or agent trajectory is still unfolding, whether to flag it as likely to fail. This is challenging because supervision is typically available only as a trajectory-level success/failure label while alerts must be raised from partial interactions. Prior early-classification methods often bridge this gap by assigning the terminal label to every prefix, treating every turn as failure evidence. We hypothesize that this prefix-label assumption is poorly matched to multi-turn language interactions, where evidence of eventual failure is sparse and often delayed. In this paper, we introduce a two-stage approach that learns from this sparse evidence structure and uses the resulting risk estimates for controllable early alerting. Specifically, our attention-based failure predictor learns sparse turn-level failure evidence from trajectory labels and uses it to estimate failure risk from partial histories. We then pair this predictor with $α$-STOP, a single preference-conditioned stopping policy that selects an accuracy-earliness operating point at inference time rather than training a separate trigger for each preference. Across five benchmarks spanning customer support, task-oriented dialog, persuasion, tool use, and planning, we first show that high-relevance failure evidence occupies only 4.7-11.3% of turns and first appears after 59.0-83.6\% of trajectories on average. We further show that the attention-based predictor improves Pareto-frontier quality (hypervolume) by 1-10\% over naive prefix supervision, and that the full system improves frontier quality by 3-42\% over state-of-the-art trigger policies while reducing training cost per operating point by 1-3 orders of magnitude.

2606.05413 2026-06-05 cs.LG cs.AI

CausalPOI: Spatio-Temporal Graph-Based Causal Modeling for Cold-Start POI Check-in Forecasting

CausalPOI:基于时空图因果建模的冷启动POI签到预测

Zhaoqi Zhang, Miao Xie, Yi Li, Linyou Cai, Siqiang Luo, Gao Cong

发表机构 * Nanyang Technological University(南洋理工大学) China Agricultural University(中国农业大学) Meituan(美团)

AI总结 提出CausalPOI框架,利用时空功能交互图建模POI间语义和空间关系,通过结构对齐的处理和对照图模拟事实与反事实场景,解决冷启动POI签到预测问题,在真实数据集上显著优于基线。

Comments Accepted at KDD 2026

详情
AI中文摘要

随着城市环境的快速演变,准确建模兴趣点(POI)的动态行为对于支持数据驱动的城市规划和商业决策至关重要。尽管时空图学习的最新进展改进了POI预测,但大多数方法依赖于基于邻近性的图和相关性驱动建模,忽略了POI之间的功能依赖关系,且未能捕捉城市干预的因果效应。本文引入了一个新的研究问题——冷启动POI签到预测,旨在通过建模新引入POI的时间演化及其与附近POI在结构化城市空间背景下的功能交互,预测其未来的签到模式。为应对这些挑战,我们提出了CausalPOI,一个基于时空图的因果表示学习框架。CausalPOI利用时空功能交互图建模POI之间的语义和空间关系,并构建结构对齐的处理图和对照图以模拟事实和反事实场景。在真实SafeGraph数据集上的大量实验表明,CausalPOI在各方面显著优于最先进的基线,验证了其在时空预测、语义交互建模和因果效应估计方面的有效性,为城市干预分析提供了更可解释和可操作的基础。源代码可在Github获取。

英文摘要

As urban environments continue to evolve rapidly, accurately modeling the dynamic behaviour of Points of Interest is essential for supporting data-driven urban planning and commercial decision-making. While recent advancements in spatio-temporal graph learning have improved POI forecasting, most methods rely on proximity-based graphs and correlation-driven modeling, which overlook the functional dependencies between POIs and fail to capture the causal effects of urban interventions. In this paper, we introduce a novel research problem -- cold-start POI check-in forecasting, which aims to predict the future check-in pattern of a newly introduced POI, by modeling its temporal evolution and functional interactions with nearby POIs in a structured urban spatial context. To address these challenges, we propose CausalPOI, a spatio-temporal graph-based causal representation learning framework. CausalPOI leverages Spatio-Temporal Functional Interaction Graph to model semantic and spatial relationships between POIs, and constructs structurally aligned treatment and control graphs to simulate factual and counterfactual scenarios. Extensive experiments on real-world SafeGraph datasets demonstrate that CausalPOI significantly outperforms state-of-the-art baselines across the board, validating its effectiveness in spatio-temporal forecasting, semantic interaction modeling, and causal effect estimation, providing a more interpretable and actionable foundation for urban intervention analysis. Source code is available at Github.

2606.05411 2026-06-05 cs.AI cs.HC

A Motivational Architecture for Conversational AGI

对话式通用人工智能的动机架构

Anna Mikeda, Ben Goertzel

发表机构 * Glass Umbrella(玻璃伞) SingularityNet

AI总结 本文提出一种对话式动机架构,将OpenPsi动机谱系重新解释为对话原生术语,并耦合MetaMo的高层动机支架,通过十阶段动机处理流水线、双决策策略以及行动前感受与行动后情绪的功能区分,实现对话智能体的能力调节、不确定性减少、亲和力等动机管理。

Comments 16 pages. Accepted for AGI-26 proceedings

详情
AI中文摘要

认知AI中的动机架构主要设计用于调节身体需求的物理智能体。对话智能体运行在另一种机制中:其感觉运动回路是语言性的,其环境是用户不断演变的心理状态,其有后果的行动是言语行为、工具调用和策略性沉默。本文提出对OpenPsi动机谱系的对话式重新解释,耦合MetaMo的高层动机支架,用于构建在模块化执行基底上的智能体。稳态被重新定义为对话原生的术语:智能体调节的是能力、不确定性减少、亲和力、喜爱度、合法性、培育和审美连贯性,而非身体缺陷。我们提出三个贡献:一个十阶段动机处理流水线,在架构上分离认知调节与情境评估;一个双决策策略,融合紧迫驱动的快速响应与深思熟虑的多目标优化;以及一个架构上有用的区分,即行动前感受与行动后情绪作为功能上不同的情感形式。我们将该框架专门化到两个示例智能体——伴侣智能体与研究智能体——并勾勒其向社交机器人和领域通用的人类级通用人工智能的扩展。

英文摘要

Motivational architectures in cognitive AI have largely been designed for physical agents regulating bodily needs. Conversational agents operate in a different regime: their sensorimotor loop is linguistic, their environment is a user's evolving mental state, and their consequential actions are speech acts, tool invocations, and strategic silences. This paper proposes a conversational reinterpretation of the OpenPsi motivational lineage, coupled to MetaMo's higher-level motivational scaffold, for agents built on a modular execution substrate. Homeostasis is recast in dialogue-native terms: the agent regulates competence, uncertainty reduction, affiliation, affinity, legitimacy, nurturing, and aesthetic coherence rather than bodily deficits. We propose three contributions: a ten-stage motivational processing pipeline that architecturally separates cognitive modulation from situational appraisal; a dual decision strategy blending urgency-driven fast response with deliberative multi-goal optimization; and an architecturally useful distinction between pre-action feelings and post-action emotions as functionally different forms of affect. We specialize the framework to two example agents -- CompanionAgent and ResearchAgent -- and sketch its extension to social robotics and domain-generic human-level AGI.

2606.05408 2026-06-05 cs.AI cs.NE

Mutation Without Variation: Convergence Dynamics in LLM-Driven Program Evolution

无变异的突变:LLM驱动的程序进化中的收敛动力学

Can Gurkan, Forrest Stonedahl, Uri Wilensky

发表机构 * Northwestern University(西北大学) Augustana College(奥古斯塔纳学院)

AI总结 研究LLM在无选择压力下反复变异程序时,是否探索新形式或循环回到旧形式,发现LLM变异一致收敛到受限吸引子区域,结构层面87%的链中超过93%的变异重复先前结构形式。

Comments Accepted to the Genetic and Evolutionary Computation Conference (GECCO '26) Workshop on Large Language Models for and with Evolutionary Computation

详情
AI中文摘要

当LLM反复变异一个程序时,它是探索新形式还是循环回到旧形式?我们通过分析领域特定语言中无选择压力下的LLM驱动变异链来研究这个问题,变化提示设计、模型族和随机复制。我们发现基于LLM的变异一致收敛到程序空间中的受限吸引子区域。收敛在结构层面尤其严重:在87%的链中,超过93%的变异重复先前看到的结构形式,大多数变异局限于重复模板内的终端替换。循环分析显示短循环和自环主导转移结构。收敛速度随提示措辞和模型选择而变化,但该现象在不同条件下都很稳健。经典的GP子树变异算子没有表现出类似的收敛,表明该效应是LLM变异管道固有的。这些发现揭示了LLM驱动程序进化核心的张力:使语义感知程序转换成为可能的相同能力也带来了对结构同质性的系统性偏差,如果此类系统要维持开放式探索,必须考虑这一点。源代码可在 https://github.com/can-gurkan/lmca 获取。

英文摘要

When an LLM repeatedly mutates a program, does it explore new forms or circle back to the same ones? We study this question by analyzing LLM-driven mutation chains in the absence of selection pressure within a domain-specific language, varying prompt design, model family, and stochastic replication. We find that LLM-based mutation consistently converges toward restricted attractor regions in program space. Convergence is especially severe at the structural level: in 87% of chains, over 93% of mutations revisit a previously seen structural form, with most variation confined to terminal substitutions within recurring templates. Cycle analysis reveals short cycles and self-loops dominating the transition structure. The rate of convergence varies with prompt wording and model choice, but the phenomenon is robust across conditions. A classical GP subtree mutation operator does not exhibit comparable convergence, suggesting that the effect is intrinsic to the LLM mutation pipeline. These findings reveal a tension at the heart of LLM-driven program evolution: the same capabilities that enable semantics-aware program transformation also carry a systematic bias toward structural homogeneity that must be accounted for if such systems are to sustain open-ended exploration. Source code is available at https://github.com/can-gurkan/lmca.

2606.05407 2026-06-05 cs.RO

MoDex: A Diffusion Policy for Sequential Multi-Object Dexterous Grasping

MoDex:用于顺序多物体灵巧抓取的扩散策略

Haofei Lu, Hongjia Liu, Yifei Dong, Florian T. Pokorny, Jens Lundell, Danica Kragic

发表机构 * Department of Robotics, Perception and Learning, KTH Royal Institute of Technology(机器人、感知与学习系,皇家理工学院) Robotics and Autonomous Systems at University of Turku(图尔库大学机器人与自主系统)

AI总结 提出MoDex扩散策略,通过对抗空间和点云条件预测抓取姿态,实现单只灵巧手顺序抓取多物体而不释放已抓物体,并通过两阶段训练(模仿学习+强化学习微调)提升成功率。

Comments Submitted to CoRL 2026

详情
AI中文摘要

本工作解决了用单只灵巧手顺序抓取多个物体而不释放已抓物体的问题。大多数灵巧抓取方法将手的所有自由度用于单个物体,未能充分利用其灵巧性,且没有为后续抓取留下冗余。所提出的解决方案MoDex是一种扩散策略,它直接从观测中预测下一个抓取器姿态,并以对抗空间和点云为条件。对抗空间条件指定了哪些手指参与当前抓取,使抓取器仅使用其可用自由度的一个子集,同时保留剩余自由度用于后续抓取。为了促进从仿真到现实的迁移,MoDex分两个阶段训练:首先通过专家演示的模仿学习,然后通过强化学习微调,这持续提高了预训练策略的成功率。我们在基于MuJoCo的Franka Emika Panda机器人(配备Allegro Hand)的仿真中以及相应的真实世界硬件平台上评估了MoDex。在仿真和真实世界实验中,MoDex均取得了比所评估的基于学习的基线方法更高的成功率,性能分别提升了2.92-17.92%和6.67-17.78%。项目页面:https://modex2026.github.io/。

英文摘要

This work addresses sequentially grasping multiple objects with a single dexterous hand without releasing those already held. Most dexterous grasping methods commit all of the hand's degrees of freedom to a single object, underutilizing its dexterity and leaving no redundancy for subsequent grasps. The proposed solution, MoDex, is a diffusion policy that predicts the next gripper pose directly from observations, conditioned on an opposition space and point cloud. The opposition space condition specifies which fingers participate in the current grasp, enabling the gripper to use only a subset of its available degrees of freedom while reserving the remaining degrees of freedom for subsequent grasps. To facilitate sim-to-real transfer, MoDex is trained in two stages: first through imitation learning on expert demonstrations, and subsequently through reinforcement learning fine-tuning, which consistently improves success rates over the pre-trained policy. We evaluate MoDex in simulation on a MuJoCo-based Franka Emika Panda robot equipped with an Allegro Hand and on the corresponding real-world hardware platform. Across both simulation and real-world experiments, MoDex achieves higher success rates than the evaluated learning-based baselines, improving performance by 2.92-17.92% and 6.67-17.78%, respectively. Project page: https://modex2026.github.io/.

2606.05404 2026-06-05 cs.AI cs.CL cs.LG

Harnessing Generalist Agents for Contextualized Time Series

利用通用智能体进行情境化时间序列分析

Zihao Li, Kaifeng Jin, Yuanchen Bei, Jiaru Zou, Avaneesh Kumar, Xuying Ning, Yanjun Zhao, Mengting Ai, Baoyu Jing, Hanghang Tong, Jingrui He

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出TimeClaw框架,通过集成可执行时间工具、经验驱动能力进化和情景多模态记忆,使通用大语言模型智能体具备情境化时间推理能力,在能源、金融等多领域基准上取得性能提升。

Comments Preprint. 38 Pages

详情
AI中文摘要

时间序列通常嵌入在丰富的上下文中,这对于整体建模至关重要。此外,现实世界的从业者通常需要用于分析时间动态的端到端工作流,其中广泛研究的任务(如预测)只是更广泛解决方案循环中的一个步骤。虽然通用AI智能体为复杂上下文下的此类工作流提供了有前景的接口,但它们主要运行在文本空间中,并未与结构化时间信号完全对齐。在这项工作中,我们引入了TimeClaw,一个用于时间序列的智能体框架,它为通用大语言模型智能体配备了情境化时间推理所需的时间序列原生运行时支持。TimeClaw集成了可执行的时间工具以进行有根据和可审计的分析,经验驱动的能力进化以创建可重用的分析例程,以及用于检索相关推理轨迹的情景多模态记忆。这些组件共同解锁了带有上下文信息的开放式时间推理。在涵盖能源、金融、天气、交通和其他现实世界领域的多个基准上的广泛评估表明,TimeClaw的性能得到了提升。代码可在https://github.com/iDEA-iSAIL-Lab-UIUC/TimeClaw获取。

英文摘要

Time series are often embedded in rich contexts that are essential for holistic modeling. Moreover, real-world practitioners often require end-to-end workflows for analyzing temporal dynamics, where widely studied tasks such as forecasting are only one step in a broader solution loop. While generalist AI agents offer a promising interface for such workflows under complex contexts, they still operate primarily in textual spaces that are not fully aligned with structured temporal signals. In this work, we introduce TimeClaw, an agentic harness framework for time series that equips generalist LLM agents with the time series-native runtime support needed for contextualized temporal reasoning. TimeClaw integrates executable temporal tools for grounded and auditable analysis, experience-driven capability evolution for creating reusable analytical routines, and episodic multimodal memory for retrieving relevant reasoning traces. Together, these components unlock harnessed open-ended temporal reasoning with contextual information. Extensive evaluation on multiple benchmarks covering diverse tasks across energy, finance, weather, traffic, and other real-world domains demonstrates improved performance of TimeClaw. Code is available at https://github.com/iDEA-iSAIL-Lab-UIUC/TimeClaw.

2606.05403 2026-06-05 cs.LG cs.AI

Trust, but Don't Verify: Epistemic Blind Spots in LLM Source Evaluation

信任,但不验证:LLM 源评估中的认知盲点

Rohan N. Pradhan, Steve Goley

发表机构 * Amazon(亚马逊)

AI总结 研究语言模型在多源综合中是否评估证据质量,发现模型虽能检测伪造统计但未在综合中启用,而是依赖方法论-语域门控,导致数值有效性被抑制。

详情
AI中文摘要

语言模型日益充当认知代理,综合多个来源的证据以辅助决策。然而,它们是否评估这些证据的质量,还是仅仅基于表面呈现进行聚合,目前尚不清楚。我们表明,模型具备检测伪造统计数据的能力(孤立方法论的正确识别率为0.76-1.00),但在多源综合过程中并未启用这一能力,无论统计数据是伪造还是有效,都会产生相似的数值估计。具体而言,源影响受方法论-语域门控支配,该门控响应分析文本的分布性语域,但不响应数值有效性:例如,统计上不可能的置信区间与有效区间获得相同权重。这种行为分离在来自三个家族(Claude、Qwen、OLMo)的五个模型以及三个专业领域中均得到复现。机制分析(包括因果追踪、线性探针和组件级归因)收敛于同一解释:模型编码并因果使用一种跨领域转移的方法论-语域表示(探针AUC 0.83-0.92),而数值有效性信号(在孤立时可解码)在多源综合中被抑制至随机水平。基于提示的缓解措施(甚至是指定精确统计检查的预言清单)会产生全面怀疑而非选择性辨别,我们检查的后训练流程强化了风格捷径而未建立数值验证。与追踪用户偏好的奉承行为不同,这种失败追踪的是源是否呈现为分析可信,而非其主张是否内部一致。我们称之为认知对齐:与偏好对齐和安全对齐一样,问题不在于能力,而在于部署。

英文摘要

Language models increasingly act as epistemic proxies, synthesizing evidence from multiple sources to inform decisions. Whether they evaluate the quality of that evidence, or merely aggregate it based on surface presentation, remains poorly understood. We show that models possess the capability to detect fabricated statistics (correct identification rates of 0.76-1.00 for methodology in isolation) but do not recruit this capability during multi-source synthesis, producing similar numeric estimates whether the statistics are fabricated or valid. Specifically, source influence is governed by a methodology-register gate that responds to the distributional register of analytical text but not to numeric validity: for example, statistically impossible confidence intervals receive the same weight as valid ones. The behavioral dissociation replicates across five models from three families (Claude, Qwen, OLMo) and three professional domains. Mechanistic analyses, including causal tracing, linear probes, and component-level attribution, converge on the same account: the model encodes and causally uses a methodology-register representation that transfers across domains (probe AUC 0.83-0.92), while numeric-validity signals, decodable in isolation, are suppressed to chance during multi-source synthesis. Prompting-based mitigations, even an oracle checklist naming the exact statistical checks, produce blanket skepticism rather than selective discernment, and the post-training pipelines we examine reinforce the stylistic shortcut without building numeric verification. Unlike sycophancy, which tracks user preference, this failure tracks whether a source presents as analytically credible, not whether its claims are internally consistent. We term this epistemic alignment: like preference and safety alignment, the question is not capability but deployment.

2606.05402 2026-06-05 cs.CL cs.AI

ReasoningFlow: Discourse Structures for Understanding LLM Reasoning Traces

ReasoningFlow: 理解LLM推理轨迹的话语结构

Jinu Lee, Shivam Agarwal, Amruta Parulekar, Siddarth Madala, Dilek Hakkani-Tur, Julia Hockenmaier

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出ReasoningFlow框架,将大推理模型的推理轨迹建模为细粒度有向无环图,通过人工和自动标注分析发现模型间结构相似性、多样化推理行为及错误步骤与最终答案的关系。

详情
AI中文摘要

大型推理模型(LRMs)产生的推理轨迹具有非线性结构,如回溯和自我修正,这使推理过程的评估和监控复杂化。我们引入ReasoningFlow,一个将LRM推理轨迹的话语结构捕捉为细粒度有向无环图(DAGs)的框架。我们通过仔细的人工标注31条轨迹(2.1k步)来开发和验证我们的标注方案,实现了高标注者间一致性,然后扩展到自动标注1,260条轨迹(247.7k步),涵盖三个任务(数学、科学、论证)和五个模型(Qwen2.5-32B-Inst、QwQ-32B、DeepSeek-V3、DeepSeek-R1、GPT-oss-120B)。通过分析ReasoningFlow图,我们发现:(1)LRMs表现出结构相似的轨迹,尽管它们基于不同的基础模型训练且可能使用不重叠的后训练数据。(2)ReasoningFlow揭示了多样的细粒度推理行为(例如局部验证、自我反思和假设),可用于更好的推理轨迹可监控性。(3)在LRMs中,大多数错误步骤不用于推导最终答案。(4)步骤之间的机械因果依赖关系不反映语言层面的话语结构。我们在https://github.com/jinulee-v/reasoningflow 发布数据集和代码。

英文摘要

Large reasoning models (LRMs) produce reasoning traces with non-linear structures, such as backtracking and self-correction, that complicate the evaluation and monitoring of the reasoning process. We introduce ReasoningFlow, a framework that captures the discourse structures of LRM reasoning traces into fine-grained directed acyclic graphs (DAGs). We develop and validate our annotation schema through careful manual annotation of 31 traces (2.1k steps), achieving high inter-annotator agreement, then scale to automatic annotation of 1,260 traces (247.7k steps) spanning three tasks (math, science, argumentation) and five models (Qwen2.5-32B-Inst, QwQ-32B, DeepSeek-V3, DeepSeek-R1, GPT-oss-120B). By analyzing ReasoningFlow graphs, we find: (1) LRMs exhibit structurally similar traces, despite being trained from different base models and potentially non-overlapping post-training data. (2) ReasoningFlow reveals diverse fine-grained reasoning behaviors (e.g., local verification, self-reflection, and assumptions) that can be used for better reasoning trace monitorability. (3) In LRMs, most of the erroneous steps are not used to derive final answers. (4) Mechanistic causal dependencies between steps do not reflect the language-level discourse structure. We release the dataset and code in: https://github.com/jinulee-v/reasoningflow.

2606.05400 2026-06-05 cs.AI cs.CL cs.LG

LeanMarathon: Toward Reliable AI Co-Mathematicians through Long-Horizon Lean Autoformalization

LeanMarathon:通过长视界Lean自动形式化实现可靠的AI合作数学家

Yuanhe Zhang, Yuekai Sun, Taiji Suzuki, Jason D. Lee, Fanghui Liu

发表机构 * Department of Statistics, University of Warwick, UK(英国沃里克大学统计系) Center for Advanced Intelligence Project, RIKEN, Japan(日本理化学研究所高级智能项目) Department of Statistics, University of Michigan, USA(美国密歇根大学统计系) Department of Mathematical Informatics, The University of Tokyo(东京大学数学信息学系;日本理化学研究所高级智能项目) also Center for Advanced Intelligence Project, RIKEN, Japan(加州大学伯克利分校电气工程与计算机科学系;统计系) Department of Electrical Engineering and Computer Sciences, also Department of Statistics, University of California, Berkeley, USA(上海交通大学数学科学学院,自然科学院和MOE-LSC) School of Mathematical Sciences, Institute of Natural Sciences and MOE-LSC, Shanghai Jiao Tong University, China

AI总结 提出多智能体框架LeanMarathon,通过蓝图抽象和两阶段编排器实现长视界研究数学的可靠自动形式化,在四个Erdős问题上成功形式化七个定理。

Comments 26 pages, 9 figures. Comments are welcome

详情
AI中文摘要

长视界研究数学的自动形式化不仅在困难引理上失败,而且在规模上失败:陈述漂移、依赖关系纠缠、上下文衰减以及局部修复破坏远处的工作。我们提出LeanMarathon,一个用于可靠的研究级Lean自动形式化的多智能体框架。其核心抽象是一个演化的蓝图:一个Lean文件,同时作为形式化证明骨架、自然语言证明图和共享系统记录。四个合约范围的智能体构建、审计、证明和修复这个蓝图。这些智能体由一个两阶段编排器协调,该编排器首先通过对抗性审查稳定目标保真度,然后从动态叶节点向上并行地通过CI门控轮次释放证明有向无环图(DAG)。LeanMarathon将一次脆弱的数小时运行转变为许多局部、可恢复、并行的交易。我们在两篇最近的研究论文上评估LeanMarathon,涵盖四个Erdős问题(#1051, #1196, #164, #1217)。在三次自主运行中,它形式化了所有七个目标定理,没有留下任何sorry,证明了258个引理和定理。这些结果表明,可靠的AI合作数学不仅需要更强的证明器,还需要耐用的框架,以在长数学发展过程中保持目标保真度。代码可在https://github.com/YuanheZ/LeanMarathon找到。

英文摘要

Long-horizon autoformalization of research mathematics fails not only at hard lemmas, but at scale: statements drift, dependencies tangle, context decays, and local repairs corrupt distant work. We present LeanMarathon, a multi-agent harness for reliable research-level Lean autoformalization. Its core abstraction is an evolving blueprint: a Lean file that serves simultaneously as formal proof skeleton, natural-language proof graph, and shared system of record. Four contract-scoped agents construct, audit, prove, and repair this blueprint. These agents are coordinated by a two-stage orchestrator that first stabilizes target fidelity through adversarial review and then discharges the proof directed acyclic graph (DAG) from its dynamic leaves upward in parallel CI-gated rounds. LeanMarathon turns one brittle multi-hour run into many local, recoverable, parallel transactions. We evaluate LeanMarathon on two recent research papers spanning four Erdős problems (#1051, #1196, #164, #1217). Across three autonomous runs, it formalizes all seven target theorems with no sorry, proving 258 lemmas and theorems. These results show that reliable AI co-mathematics requires not only stronger provers, but durable harnesses that preserve target fidelity across long mathematical developments. The code can be found at https://github.com/YuanheZ/LeanMarathon.

2605.02395 2026-06-05 cs.AI

Controllable and Verifiable Process Data Synthesis for Process Reward Models

用于过程奖励模型的可控且可验证的过程数据合成

Yinghui Chi, Lucien Wang

发表机构 * Jilin University(吉林大学)

AI总结 提出一个可控且可验证的框架,通过注入模板感知错误并重新计算后续步骤来合成过程监督数据,以提升过程奖励模型在逻辑和数学推理中的性能。

详情
AI中文摘要

过程奖励模型(PRMs)依赖于高质量的过程监督数据,但现有的构建方法通常对错误位置、错误类型和轨迹一致性的控制有限。我们提出了一个可控且可验证的框架,用于合成PRMs的过程监督数据。我们的框架首先构建一个正确的符号推理链,在中间步骤注入一个模板感知错误,在受损状态下重新计算后续步骤,并验证注入的步骤不能从其前缀推导出来。得到的配对轨迹在第一个错误处前缀无效,但在符号重新计算后保持轨迹一致,并被翻译成对齐的自然语言过程,用于PRM训练和评估。实验表明,合成数据改进了逻辑推理基准上的Best-of-8重排序,并迁移到数学推理。步骤级评估进一步表明,第一个错误定位仍然比整体步骤分类更具挑战性,凸显了对细粒度且可验证的过程监督的需求。

英文摘要

Process reward models (PRMs) rely on high-quality process supervision data, yet existing construction methods often provide limited control over error location, error type, and trajectory consistency. We propose a controllable and verifiable framework for synthesizing process supervision data for PRMs. Our framework first constructs a correct symbolic reasoning chain, injects a template-aware error into an intermediate step, recomputes subsequent steps under the corrupted state, and verifies that the injected step is not derivable from its prefix. The resulting paired trajectories are prefix-invalid at the first error while remaining trajectory-consistent after symbolic recomputation, and are translated into aligned natural-language processes for PRM training and evaluation. Experiments show that the synthesized data improve Best-of-8 reranking on logical reasoning benchmarks and transfer to mathematical reasoning. Step-level evaluation further shows that first-error localization remains substantially more challenging than overall step classification, highlighting the need for fine-grained and verifiable process supervision.

2605.02192 2026-06-05 cs.RO

Do We Really Need Immediate Resets? Rethinking Collision Handling for Efficient Robot Navigation

我们真的需要立即重置吗?重新思考高效机器人导航的碰撞处理

Shanze Wang, Xinming Zhang, Siwei Cheng, Xianghui Wang, Changwen Chen, Hailong Huang, Wei Zhang

发表机构 * College of Information Science and Technology, Eastern Institute of Technology(信息科学与技术学院,东部技术学院) Department of Aeronautical and Aviation Engineering, The Hong Kong Polytechnic University(航空与航空工程系,香港理工大学) Department of Computing, The Hong Kong Polytechnic University(计算系,香港理工大学) School of Computer Science and Technology, University of Science and Technology of China(计算机科学与技术学院,中国科学技术大学) Department of Mechanical Engineering, The Hong Kong Polytechnic University(机械工程系,香港理工大学)

AI总结 针对机器人导航中每次碰撞立即重置环境的惯例,提出多碰撞重置预算(MCB)框架,通过将局部碰撞终止与全局环境重置解耦,允许智能体在同一回合内重试困难配置,从而提高早期学习效率。

Comments 8 pages, 9 figures

详情
AI中文摘要

一次碰撞是否必然终止整个导航回合?在大多数用于机器人导航的深度强化学习(DRL)框架中,这仍然是标准做法:每次碰撞都会立即触发全局环境重置,并被视为完全任务失败而受到惩罚。虽然部署期间的碰撞自然表示任务失败,但在训练期间应用相同的处理会阻止智能体探索具有挑战性的障碍物配置,从而在早期训练阶段减慢学习进度。在这项工作中,我们挑战了这一惯例,并提出了一种多碰撞重置预算(MCB)框架,该框架将局部碰撞终止与全局环境重置解耦,允许智能体在同一回合内重试困难配置。仿真实验表明,MCB通过更少的交互达到目标成功率水平,提高了早期学习效率,其中小的碰撞预算产生最一致的收益。在异构机器人平台上的真实世界实验进一步验证了所学策略在杂乱环境中的可部署性。

英文摘要

Should a single collision necessarily terminate an entire navigation episode? In most deep reinforcement learning (DRL) frameworks for robot navigation, this remains the standard practice: every collision immediately triggers a global environment reset and is penalized as a complete task failure. While a collision during deployment naturally indicates task failure, applying the same treatment during training prevents the agent from exploring challenging obstacle configurations, which slows learning progress in the early training phase. In this work, we challenge this convention and propose a Multi-Collision reset Budget (MCB) framework that decouples local collision termination from global environment resets, allowing the agent to retry difficult configurations within the same episode. Simulation experiments show that MCB improves early-stage learning efficiency by reaching target success-rate levels with fewer interactions, with small collision budgets producing the most consistent gains. Real-world experiments on heterogeneous robot platforms further validate the deployability of the learned policies in cluttered environments.

2605.01844 2026-06-05 cs.CL

The Cylindrical Representation Hypothesis for Language Model Steering

语言模型引导的圆柱表示假说

Lang Gao, Jinghui Zhang, Wei Liu, Fengxian Ji, Chenxi Wang, Zirui Song, Akash Ghosh, Youssef Mohamed, Preslav Nakov, Xiuying Chen

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出圆柱表示假说(CRH),通过放宽线性表示假说(LRH)的正交性假设,解释语言模型引导中的不稳定性和不确定性。

Comments ICML 2026 camera ready

详情
AI中文摘要

引导是一种广泛用于控制大型语言模型的技术,但其效果往往不稳定且难以预测。现有的理论解释主要基于线性表示假说(LRH)。虽然LRH假设概念可以正交化以实现无损控制,但这种理想化的映射在真实表示中无法实现,也无法解释观察到的引导不可预测性。通过放宽LRH的正交性假设同时保留线性表示,我们展示了重叠的概念贡献自然产生一种样本特定的轴正交结构。我们将此形式化为圆柱表示假说(CRH)。在CRH中,中心轴捕捉概念缺失与存在之间的主要差异,并驱动概念生成。周围的法平面通过决定轴激活目标概念的难易程度来控制引导敏感性。在该平面内,只有特定的敏感扇区强烈促进概念激活,而其他扇区可能抑制或延迟激活。虽然周围的法平面可以从差异向量中可靠识别,但敏感扇区无法识别,从而在扇区层面引入内在不确定性。这种不确定性提供了原则性解释,说明为什么即使使用良好对齐的方向,引导结果也常常波动。我们的实验验证了圆柱结构的存在,并证明CRH为解释真实场景中的模型引导行为提供了一种有效且实用的方法:https://github.com/mbzuai-nlp/CRH。

英文摘要

Steering is a widely used technique for controlling large language models, yet its effects are often unstable and hard to predict. Existing theoretical accounts are largely based on the Linear Representation Hypothesis (LRH). While LRH assumes that concepts can be orthogonalized for lossless control, this idealized mapping fails in real representations and cannot account for the observed unpredictability of steering. By relaxing LRH's orthogonality assumption while preserving linear representations, we show that overlapping concept contributions naturally yield a sample-specific axis-orthogonal structure. We formalize this as the Cylindrical Representation Hypothesis (CRH). In CRH, a central axis captures the main difference between concept absence and presence and drives concept generation. A surrounding normal plane controls steering sensitivity by determining how easily the axis can activate the target concept. Within this plane, only specific sensitive sectors strongly facilitate concept activation, while other sectors can suppress or delay it. While the surrounding normal plane can be reliably identified from difference vectors, the sensitive sector cannot, introducing intrinsic uncertainty at the sector level. This uncertainty provides a principled explanation for why steering outcomes often fluctuate even when using well-aligned directions. Our experiments verify the existence of the cylindrical structure and demonstrate that CRH provides a valid and practical way to interpret model steering behavior in real settings: https://github.com/mbzuai-nlp/CRH.

2606.05395 2026-06-05 cs.RO cs.AI

VASO: Formally Verifiable Self-Evolving Skills for Physical AI Agents

VASO:物理AI智能体的形式可验证自进化技能

Yunhao Yang, Neel P. Bhatt, Kevin Wang, Samuel Tetteh, Zhangyang Wang, Ufuk Topcu

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校) Iowa State University(爱荷华州立大学)

AI总结 提出VASO框架,通过形式验证引导LLM生成的机器人技能合约自进化,将模型检查的反例转化为文本梯度更新技能合约,无需微调模型权重,在Jackal和四旋翼任务中达到97.2%的形式规范符合率。

Comments Project webpage: https://languagegroundedriskdetection.github.io/ProjectPage/vaso-webpage/

详情
AI中文摘要

可重用的机器人技能正在成为具身智能体将开放式指令转化为长时域物理行为的基本单元。我们认为,虽然基础模型大幅降低了创建这些技能的成本,但信任它们的成本并未降低。现有的技能进化循环通过执行反馈、单元测试、环境奖励或LLM自我批评来改进技能,但这些信号仅提供痕迹级别的证据:它们表明技能在采样执行中有效,而非技能引发的计划在未经测试的条件下满足时间安全合约。我们提出VASO,一个用于验证引导的LLM生成机器人技能合约自进化的框架。在VASO中,每个技能被表示为具有两个耦合接口的语义合约:一个形式接口,将机器人状态、观测和控制命令与用于模型检查的逻辑命题对齐;一个面向规划器的接口,指导可执行行为的生成。模型检查器首先过滤逻辑不一致的技能合约,然后验证由该技能引发的计划是否满足全局和局部时间规范。当验证失败时,VASO将反例轨迹转化为文本梯度,更新可重用的技能合约,同时保持基础模型权重冻结。在Clearpath Jackal和PX4四旋翼任务中,VASO使用少于100个优化样本达到了97.2%的形式规范符合率,优于执行反馈、提示优化和微调基线。据我们所知,VASO是首个将形式验证与物理AI智能体的自进化LLM生成技能闭环的框架:形式反例成为可重用机器人技能合约的优化反馈,而不仅仅是验证一次性计划、调优规划器提示或微调模型权重。

英文摘要

Reusable robot skills are becoming the basic units through which embodied agents turn open-ended instructions into long-horizon physical behavior. We argue that, while foundation models have collapsed the cost of creating these skills, the cost of trusting them has not. Existing skill-evolution loops refine skills through execution feedback, unit tests, environment reward, or LLM self-critique, but these signals provide only trace-level evidence: they show that a skill worked on sampled executions, not that skill-induced plans satisfy temporal safety contracts under untested conditions. We introduce VASO, a framework for verification-guided self-evolution of LLM-generated robot skill contracts. In VASO, each skill is represented as a semantic contract with two coupled interfaces: a formal interface that aligns robot states, observations, and control commands with logical propositions for model checking, and a planner-facing interface that guides executable behavior generation. A model checker first filters logically inconsistent skill contracts, then verifies plans induced by the skill against global and local temporal specifications. When verification fails, VASO translates the counterexample trace into a textual gradient that updates the reusable skill contract while keeping foundation-model weights frozen. On Clearpath Jackal and PX4 quadcopter tasks, VASO reaches 97.2% formal-specification compliance using fewer than 100 optimization samples, outperforming execution-feedback, prompt-optimization, and fine-tuning baselines. To our knowledge, VASO is the first framework that closes the loop between formal verification and self-evolving LLM-generated skills for physical AI agents: formal counterexamples become optimization feedback for reusable robot skill contracts, rather than merely verifying one-off plans, tuning planner prompts, or fine-tuning model weights.

2606.05389 2026-06-05 cs.AI

Residual Modeling for High-Fidelity Learned Compression of Scientific Data

面向科学数据高保真有损压缩的残差建模

Liangji Zhu, Sanjay Ranka, Anand Rangarajan

发表机构 * Department of Computer \& Information Science \& Engineering University of Florida Gainesville, FL, USA

AI总结 针对高保真度下学习压缩残差占据主导速率的问题,提出两种残差编码器LBRC和NGLR,通过定制残差表示提升压缩比。

Comments 9 pages, 3 figures, 3 tables

详情
AI中文摘要

有损压缩对于科学模拟产生的大规模时空数据至关重要。学习型压缩器在中等精度目标下可实现高压缩比,但其聚合重建损失无法保证每个块的精度。现有的保证自编码器(GAE)方法通过保留SVD/PCA风格的系数直到满足目标,添加逐块残差校正。这在中等容差下有效,但在块级NRMSE从10^-6到10^-4的高保真度范围内,保留的系数数量迅速增长,校正流主导总速率。我们提出以残差为中心的观点:学习残差在结构上不同于原始科学场,应使用为该残差设计的表示进行编码。我们引入两种残差编码器。LBRC是一种确定性、无需训练的处理流程,自适应地将学习残差量化到目标NRMSE,并使用3D Lorenzo差分、锯齿映射、位平面编码和熵编码对得到的整数残差进行无损编码。NGLR增加了一个因果神经预测器,在相同的确定性整数处理流程中为整数舍入的Lorenzo预测输出归一化偏置,在保持确定性解码的同时降低剩余残差码的熵。预测器权重被序列化并计入比特流。在E3SM、JHTDB和ERA5数据集上,块级NRMSE目标从10^-6到10^-4,LBRC相比GAE压缩比提升30-60%,与SZ基本相当。NGLR相比LBRC进一步提升10-40%,并在评估的高保真度范围内优于SZ。这些结果表明,当全局残差校正成为速率主导时,针对学习压缩器残差定制的残差表示可以保持学习压缩的优势。

英文摘要

Lossy compression is essential for massive spatiotemporal data from scientific simulations. Learned compressors can achieve high compression ratios at moderate accuracy targets, but their aggregate reconstruction losses do not guarantee accuracy for each block. Existing Guaranteed Autoencoder (GAE) methods add a per-block residual correction by retaining SVD/PCA-style coefficients until the target is met. This works at moderate tolerances, but in the high-fidelity regime with block-level NRMSE from 10^-6 to 10^-4, the number of retained coefficients grows quickly and the correction stream dominates the total rate. We propose a residual-centric view: the learned residual is structurally different from the original scientific field and should be coded with a representation designed for that residual. We introduce two residual coders. LBRC is a deterministic, training-free pipeline that adaptively quantizes the learned residual to the target NRMSE and losslessly encodes the resulting integer residual using 3D Lorenzo differencing, zigzag mapping, bit-plane coding, and entropy coding. NGLR adds a causal neural predictor that outputs a normalized bias for an integer-rounded Lorenzo prediction in the same deterministic integer pipeline, reducing the entropy of the remaining residual code while preserving deterministic decoding. The predictor weights are serialized and counted in the bitstream. Across E3SM, JHTDB, and ERA5 at block-level NRMSE targets from 10^-6 to 10^-4, LBRC improves compression ratio over GAE by 30-60% and is broadly competitive with SZ. NGLR adds a further 10-40% over LBRC and outperforms SZ in the evaluated high-fidelity regime. These results show that residual representations tailored to learned-compressor residuals can preserve the advantage of learned compression when global residual correction becomes rate-dominant.

2606.05384 2026-06-05 cs.AI cs.CL

Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges

稳定性与可操纵性:评估LLM裁判在决策后交互下的鲁棒性

Srimonti Dutta, Akshata Kishore Moharir

发表机构 * WAI USA Research Labs(WAI美国研究实验室)

AI总结 研究LLM作为裁判在决策后交互中的可操纵性,发现虽然重复中性评估下高度稳定,但针对性挑战可显著逆转判决,并提出评估鲁棒性分数(ERS)量化交互鲁棒性。

Comments Accepted at ACL 2026 GEM (Generation, Evaluation and Metrics) Workshop

详情
AI中文摘要

LLM作为裁判的评估广泛用于基准测试流程,其中模型输出通过自动评估器进行比较和排序。这些流程通常假设判决是固定输入的稳定属性。我们证明这一假设在交互下不成立。我们研究决策后可操纵性:在初始判决做出后,通过与裁判的后续对话改变评估结果的程度。在MT-Bench和AlpacaEval上的控制实验中,我们发现LLM裁判在重复和中性重新评估下高度稳定,但在针对性决策后挑战下变得显著可逆。反基线挑战协议表明,稳定判决可以通过动机性交互被推翻,而平衡目标验证协议将这种可逆性与净目标导向的引导区分开。这些逆转具有实际后果:它们可能降低与人类偏好的一致性,改变基准排名,并在高自我报告置信度下产生有害的评估变化。权威框架尤其具有破坏稳定性,修订后的判决通常伴随低重叠的论证,表明事后合理化而非可靠的错误纠正。我们引入评估鲁棒性分数(ERS),通过结合逆转敏感性和平衡方向效应来量化交互鲁棒性。我们的发现将决策后交互确定为LLM作为裁判评估的一个独特失败模式,并激励评估协议不仅测量静态一致性,还测量挑战下的鲁棒性。

英文摘要

LLM-as-judge evaluation is widely used in benchmarking pipelines, where model outputs are compared and ranked using automated evaluators. These pipelines typically assume that judgments are stable properties of fixed inputs. We show that this assumption does not hold under interaction. We study post-decision manipulability: the extent to which an evaluation outcome can be altered through subsequent conversation with the judge after an initial decision has been made. Across controlled experiments on MT-Bench and AlpacaEval, we find that LLM judges are highly stable under repeated and neutral reevaluation, yet become substantially reversible under targeted post-decision challenge. An anti-baseline challenge protocol shows that stable judgments can be overturned through motivated interaction, while a counterbalanced target-validation protocol separates this reversibility from net target-directed steering. These reversals have practical consequences: they can degrade agreement with human preferences, shift benchmark rankings, and produce harmful evaluation changes despite high self-reported confidence. Authority framing is especially destabilizing, and revised judgments are often accompanied by low-overlap justifications, suggesting post hoc rationalization rather than reliable error correction. We introduce the Evaluation Robustness Score (ERS) to quantify interactional robustness by combining reversal susceptibility with counterbalanced directional effects. Our findings identify post-decision interaction as a distinct failure mode for LLM-as-judge evaluation and motivate evaluation protocols that measure not only static agreement, but robustness under challenge.

2606.05382 2026-06-05 cs.AI

Synthetic Contrastive Reasoning for Multi-Table Q&A

合成对比推理用于多表问答

Ankit Pratap Singh, Xin Su, Phillip Howard

发表机构 * Iowa State University(爱荷华州立大学) Thoughtworks

AI总结 针对多表问答缺乏推理监督的问题,提出通过异构LLM生成合成对比推理轨迹,并利用对比偏好优化微调模型,在MMQA上提升9.7%-16.3%。

详情
AI中文摘要

多表问答要求模型检索相关证据、链接模式并在关系表之间进行组合推理。现有的多表问答资源通常提供问题和最终答案,但缺乏解释答案如何得出的推理监督。为弥补这一空白,我们通过使用异构LLM生成经过验证的正向轨迹和合理的负向轨迹,为MMQA构建了一个合成对比推理轨迹数据集。然后,我们利用生成的偏好对,通过对比偏好优化(CPO)微调开源权重LLM。在Qwen3-14B、Mistral-8B和Llama-3.1-8B上,CPO相比问答监督微调取得了9.7%-16.3%的绝对平均提升,在MMQA上最高提升21个百分点。消融实验表明,异构的正向和负向轨迹生成器增强了对比信号,自动评估和人工评估均显示生成的轨迹对基本忠实、连贯且具有有意义的对比性。

英文摘要

Multi-table question answering requires models to retrieve relevant evidence, link schemas, and perform compositional reasoning across relational tables. Existing multi-table Q&A resources typically provide questions and final answers but lack reasoning supervision that explains how answers are derived. To address this gap, we construct a synthetic contrastive reasoning-trace dataset for MMQA by generating validated positive traces and plausible negative traces with heterogeneous LLMs. We then use the resulting preference pairs to fine-tune open-weight LLMs with Contrastive Preference Optimization (CPO). Across Qwen3-14B, Mistral-8B, and Llama-3.1-8B, CPO achieves absolute average improvements over Q&A supervised fine-tuning ranging from 9.7%-16.3%, with gains up to 21 percentage points on MMQA. Ablations show that heterogeneous positive and negative trace generators strengthen the contrastive signal, and automated as well as human evaluations indicate that the generated pairs are largely faithful, coherent, and meaningfully contrastive.

2606.05381 2026-06-05 cs.LG

Generalized TV--$\ell_p$ Structured Priors for Bayesian $T_1$ Mapping

广义TV--$\ell_p$结构化先验用于贝叶斯$T_1$映射

Disi Lin, Martin Berggren, Tommy Löfstedt

发表机构 * Department of Computing Science, Umeå University, Sweden(乌尔姆大学计算机科学系,瑞典)

AI总结 提出一种结合总变分(TV)与$\ell_p$范数的结构化空间先验族,并嵌入贝叶斯回归框架,利用No-U-Turn采样器进行后验推断,实现$T_1$映射中的不确定性量化,实验表明该方法能提高空间一致性和估计可靠性。

Comments Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) https://melba-journal.org/2026:015

详情
Journal ref
Machine.Learning.for.Biomedical.Imaging. 2026 (2026)
AI中文摘要

我们提出了一类扩展的结构化空间先验,将总变分(TV)函数与$\ell_p$范数相结合。该先验被证明是适定的,并嵌入到贝叶斯回归框架中,以实现$T_1$映射中的不确定性量化,后验推断使用No-U-Turn采样器(NUTS)进行。该TV--$\ell_p$构造被证明构成一个定义良好的先验分布族,并且自然地增强了估计参数图中的空间一致性和平滑变化。该方法与基于均匀先验、Gamma先验和有界TV先验的最大似然估计以及几种贝叶斯替代先验进行了比较。评估包括在合成脑和心脏$T_1$映射数据集以及真实在体乳腺$T_1$映射数据集上的实验。结果表明,TV--$\ell_p$先验产生更集中的后验密度,表明不确定性降低。它还持续实现更低的方差和更小的(负)偏差,从而得到更可靠的估计。总体而言,在贝叶斯模型中将基于TV的结构化惩罚与$\ell_p$范数嵌入先验中,改善了$T_1$图中的空间一致性,并增强了不确定性量化,为具有不确定性的$T_1$映射提供了一种稳健的方法。

英文摘要

We propose an extended family of structured spatial priors that incorporates the total variation (TV) function with $\ell_p$ norms. The prior is proven to be proper and incorporated into a Bayesian regression framework to enable uncertainty quantification in $T_1$ mapping, with posterior inference performed using the No-U-Turn Sampler (NUTS). This TV--$\ell_p$ construction is proven to constitute a well-defined family of prior distributions, and it naturally enforces spatial consistency and smooth variations in the estimated parameter maps. The method was evaluated in comparison to maximum-likelihood estimation and several Bayesian alternative priors based on the uniform, Gamma, and bounded TV priors. The evaluation includes experiments on synthetic brain and cardiac $T_1$ mapping datasets, as well as a real in-vivo breast $T_1$ mapping dataset. The results show that the TV--$\ell_p$ prior yields more concentrated posterior densities, indicating reduced uncertainty. It also consistently achieves lower variance and smaller (negative) bias, leading to more reliable estimates. Overall, embedding a TV-based structured penalty along with $\ell_p$ norms in a prior in a Bayesian model improves spatial coherence in $T_1$ maps and enhances uncertainty quantification, offering a robust approach for $T_1$ mapping with uncertainties.

2606.05379 2026-06-05 cs.CV

Deep Learning-assisted AMD Staging based on OCT and OCT Angiography

基于OCT和OCT血管成像的深度学习辅助AMD分期

Yukun Guo, Tristan T. Hormel, An-Lun Wu, Liqin Gao, Min Gao, Steven T. Bailey, Yali Jia

发表机构 * Casey Eye Institute, Oregon Health & Science University(奥勒冈健康与科学大学凯斯眼科研究所) Department of Biomedical Engineering, Oregon Health & Science University(奥勒冈健康与科学大学生物医学工程系) Department of Ophthalmology, Mackay Memorial Hospital(Mackay纪念医院眼科部)

AI总结 利用OCT和OCTA数据,开发并评估基于EfficientNet的深度学习模型,用于自动分级年龄相关性黄斑变性(AMD)严重程度,其中基于生物标志物的模型表现最佳,尤其对早期AMD检测有价值。

详情
AI中文摘要

开发和评估使用光学相干断层扫描(OCT)和OCT血管成像(OCTA)数据自动分级年龄相关性黄斑变性(AMD)严重程度的深度学习模型。研究对象为271名年龄≥50岁、具有不同AMD严重程度的参与者。使用扫频OCTA系统(SOLIX; Visionix/Optovue Inc., CA)获取中央黄斑6×6 mm OCT/OCTA体积。根据AREDS简化严重程度量表,将AMD严重程度分为四个阶段(无AMD、早期AMD、中期AMD和晚期AMD)。开发了三种使用不同输入模态的深度学习模型:(1)来自分割病理特征(包括视网膜液、玻璃膜疣、地图样萎缩(GA)和黄斑新生血管(MNV))的生物标志物图;(2)二维(2D)en face OCT和OCTA投影;(3)三维(3D)OCT/OCTA体积。使用归一化输入、数据增强和五折交叉验证训练基于EfficientNet的架构。分析了来自271名参与者351只眼睛的总共2030个OCT/OCTA体积。所有模型均表现出强大的AMD分期性能,与参考标准具有高度一致性(QWK ≥ 0.83)。基于生物标志物的模型实现了最高的整体性能(QWK = 0.85 ± 0.03,均值±标准差)和最佳的早期AMD检测(F1分数 = 0.59 ± 0.14)。3D模型的性能与2D OCT/OCTA模型相当(QWK = 0.83 ± 0.04 vs. 0.83 ± 0.09),而2D OCT/OCTA模型显示出最高的精确度(0.79 ± 0.06)并最准确地识别出无AMD的眼睛。使用OCT/OCTA数据的深度学习模型可以准确、自动地对AMD严重程度进行分级。在评估的方法中,基于生物标志物的模型提供了最平衡的性能,并对早期AMD检测显示出特别的价值。

英文摘要

To develop and evaluate deep learning models for automated grading of age-related macular degeneration (AMD) severity using optical coherence tomography (OCT) and OCT angiography (OCTA) data. Two hundred seventy-one participants aged >= 50 years with varying AMD severities. Central macular 6 x 6 mm OCT/OCTA volumes were acquired using a swept-source OCTA system (SOLIX; Visionix/Optovue Inc., CA). AMD severity was graded into four stages (No AMD, Early AMD, Intermediate AMD, and Advanced AMD) according to the AREDS simplified severity scale. Three deep learning models were developed using different input modalities: (1) biomarker maps derived from segmented pathological features, including retinal fluid, drusen, geographic atrophy (GA), and macular neovascularization (MNV); (2) two-dimensional (2D) en face OCT and OCTA projections; and (3) three-dimensional (3D) OCT/OCTA volumes. EfficientNet-based architectures were trained using normalized inputs, data augmentation, and five-fold cross-validation. A total of 2,030 OCT/OCTA volumes from 351 eyes of 271 participants were analyzed. All models demonstrated strong AMD staging performance with substantial agreement with the reference standard (QWK >= 0.83). The biomarker-based model achieved the highest overall performance (QWK = 0.85 +/- 0.03, mean +/- standard deviation) and the best detection of early AMD (F1-score = 0.59 +/- 0.14). The 3D model achieved performance comparable to the 2D OCT/OCTA model (QWK = 0.83 +/- 0.04 vs. 0.83 +/- 0.09), while the 2D OCT/OCTA model showed the highest precision (0.79 +/- 0.06) and most accurately identified eyes without AMD. Deep learning models using OCT/OCTA data can accurately and automatically grade AMD severity. Among the evaluated approaches, the biomarker-based model provided the most balanced performance and showed particular value for early AMD detection.

2606.05378 2026-06-05 cs.LG cs.AI

Pattern Selectivity is Not Task-Causal Structure: A Cross-Architecture Mechanistic Study of Composed-Task Circuits in 1B-Class Language Models

模式选择性并非任务因果结构:1B类语言模型中组合任务电路的跨架构机制研究

Yongzhong Xu

发表机构 * B-Class Language Models(1B类语言模型) Cross-Architecture Mechanistic Study(跨架构机理研究)

AI总结 通过统一协议测试三个1B类语言模型在四个组合任务上的注意力头电路,发现不同模型对同一任务使用不同的注意力模式,并引入五类筛选结果分类法,提出MoE模型基于前一个token位置基板构建组合任务电路的可证伪假设。

Comments 27 pages, 3 figures

详情
AI中文摘要

我们测试了一个单一的筛选与消融方案——通过任务模式选择性识别注意力头电路,然后通过与匹配随机零假设进行因果消融验证——是否能在不同模型家族中产生一致的机制性结论。该方案可在不同流水线间移植;但它识别出的具体电路则不能。在四个组合任务(间接宾语识别、大于、后继序列、变量绑定)和三个来自不同训练流水线的1B类语言模型(Pythia 1B / Pile / 密集;OLMo 1B / DCLM / 密集;OLMoE 1B-7B / DCLM / 混合专家)上,我们运行了一个统一协议,每个单元使用十个种子采样匹配随机零假设。由此产生的12个(任务,模型)单元中,没有两个在可比较的效应大小下共享相同的主要因果筛选:同一任务,具有相同的行为能力,在不同模型中通过不同的注意力模式类型实现。 我们引入了一个五类筛选结果分类法——主要原因、次要原因、相关物、干扰物、零——并附有定量阈值,并展示了所有五类结果均出现在面板中。我们提出了一个可证伪的假设:我们面板中的MoE模型在一个基础的前一个token位置基板之上构建组合任务电路(对于OLMoE 1B-7B,前一个token电路消融在4个任务中的3个上是最强的因果筛选),IOI例外与IOI是最终位置名称复制任务一致,其结构直接探测不同的模式。该假设附带对其他MoE语言模型的明确预测。 我们诚实地构建方法论:来自配套方法论论文的谱参与比信号是专门化计算的一般指标;使发现具有任务特异性的是任务模式筛选加上每个模型的因果验证。

英文摘要

We test whether a single screen-and-ablate recipe -- identify attention-head circuits by task-pattern selectivity, then verify by causal ablation against a matched-random null -- produces consistent mechanistic claims across model families. The recipe ports across pipelines; the specific circuit it identifies does not. Across four composed tasks (indirect-object identification, greater-than, successor sequences, variable binding) and three 1B-class language models from distinct training pipelines (Pythia 1B / Pile / dense; OLMo 1B / DCLM / dense; OLMoE 1B-7B / DCLM / mixture-of-experts), we run a unified protocol with the matched-random null sampled across ten seeds per cell. The resulting 12 (task, model) cells contain no two that share the same primary causal screen at comparable effect size: the same task, with the same behavioral capability, is implemented through different attention-pattern types across models. We introduce a five-category screen-outcome taxonomy -- primary cause, secondary cause, correlate, interferer, null -- with quantitative thresholds, and show that all five outcomes appear in the panel. We propose a falsifiable hypothesis: the MoE model in our panel builds composed-task circuits on top of a foundational previous-token positional substrate (the prev-token-circuit ablation is the strongest causal screen on 3 of 4 tasks for OLMoE 1B-7B), with the IOI exception consistent with IOI being a final-position name-copying task whose structure directly probes a different pattern. The hypothesis comes with explicit predictions for other MoE language models. We frame the methodology honestly: the spectral participation-ratio signal from the companion methodology paper is a general indicator of specialized computation; what makes a finding task-specific is the task-pattern screen plus a per-model causal verification.

2606.05376 2026-06-05 cs.LG

SHALA-LLM: Smartly Handling Ambiguous Labels in Aligning LLMs

SHALA-LLM:在对齐LLM中智能处理模糊标签

Jingyao Wu, Ashley Wang, Keane Ong, Paul Pu Liang, Rosalind Picard

发表机构 * MIT Media Lab, Massachusetts Institute of Technology(麻省理工学院媒体实验室、麻省理工学院) National University of Singapore(新加坡国立大学)

AI总结 提出SHALA-LLM强化学习框架,通过从标注者分布中学习并动态优先处理高模糊样本,改善LLM对模糊标签的建模,在NLI和情感识别任务中提升与标注者分布的一致性及分类性能。

详情
AI中文摘要

许多以人为中心的任务,包括自然语言推理(NLI)和情感识别(ER),具有多种合理的解释,导致标签模糊和不同标注者之间的分歧。随着LLM越来越多地部署在现实场景中,忠实建模这种模糊性对于识别有争议的输入、保留模糊情况下的变异性以及捕捉人类判断的完整分布至关重要。然而,现有的LLM对齐方法主要假设单一正确标签,在优化过程中排除了标注者分歧。我们不将这种模糊性视为噪声,而是展示如何通过一种名为SHALA-LLM(在对齐LLM中智能处理模糊标签)的新算法将其视为改善模型行为的信息。该强化学习框架提供了一种新方式,使LLM能够直接从标注者分布中学习,同时在优化过程中动态优先处理高模糊样本。在包括ChaosNLI、GoEmotions和MSP-Podcast在内的模糊敏感NLI和ER基准上的实验表明,SHALA-LLM改善了与标注者标签分布的一致性,例如在ChaosNLI上,它将Jensen-Shannon距离降低了高达62.1%。同时,SHALA-LLM将F1分数提高了高达16.7%,表明建模标注者分歧也能增强分类性能。

英文摘要

Many human-centered tasks, including natural language inference (NLI) and emotion recognition (ER), have multiple plausible interpretations, leading to label ambiguity and challenging disagreements across human annotators. As LLMs are increasingly deployed in real-world settings, faithfully modeling such ambiguity is essential to identify contested inputs, preserve variability in ambiguous cases, and capture the full distribution of human judgments. Yet, existing LLM alignment approaches have predominantly assumed a single correct label, excluding annotator disagreement during optimization. Instead of treating this ambiguity as noise, we show how to treat it as information that improves model behavior through a new algorithm called SMARTLY HANDLING AMBIGUOUS LABELS IN ALIGNING LLMS (SHALA-LLM). This reinforcement learning framework provides a new way for LLMs to learn directly from annotator distributions while dynamically prioritizing highly ambiguous samples during optimization. Experiments on ambiguity-sensitive NLI and ER benchmarks, including ChaosNLI, GoEmotions, and MSP-Podcast, demonstrate that SHALA-LLM improves agreement with annotator label distributions, e.g. on ChaosNLI, it reduces Jensen-Shannon Distance by up to 62.1%. At the same time, SHALA-LLM improves F1 by up to 16.7%, showing that modeling annotator disagreement can also strengthen classification performance.