arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.05437 2026-06-05 cs.RO cs.CV

Uncertainty-Aware Adaptive Sensor Fusion for Autonomous Navigation

不确定性感知的自适应传感器融合用于自主导航

Simegnew Yihunie Alaba, Yuichi Motai

发表机构 * IEEE

AI总结提出一种结合无迹卡尔曼滤波（UKF）的混合深度学习方法，通过不确定性感知的自适应融合视觉和惯性特征，提高自主导航中视觉惯性里程计（VIO）的位姿估计精度。

Comments 13 pages

详情

AI中文摘要

本文介绍了一种混合深度学习方法，与无迹卡尔曼滤波（UKF）相结合，以增强自主导航中视觉惯性里程计（VIO）的位姿估计精度。所提出的模型采用视觉变换器（ViT）网络有效捕获惯性测量单元（IMU）数据的时间依赖性，并利用多尺度卷积神经网络（MCNN）从视觉数据中学习基于光流的运动线索。自适应传感器融合模块通过利用估计的不确定性动态加权IMU和视觉特征，从而在多样且具有挑战性的环境条件下提高鲁棒性。此外，提出了一种新颖的不确定性感知损失函数，将预测不确定性明确纳入学习过程，使得在噪声、不完整或不可靠的传感器输入下实现鲁棒且准确的导航。在KITTI数据集上的全面评估表明，所提出的方法显著优于基线方法，在绝对轨迹误差（ATE）和相对位姿误差（RPE）方面实现了优越性能。该轻量且计算高效的模型在NVIDIA A100 GPU上以155 FPS处理数据，非常适合部署在资源受限的自主系统中。

英文摘要

This work introduces a hybrid deep learning approach integrated with an Unscented Kalman Filter (UKF) to enhance pose estimation accuracy in Visual-Inertial Odometry (VIO) for autonomous navigation. The proposed model employs a Vision Transformer (ViT) network to effectively capture temporal dependencies from inertial measurement unit (IMU) data and utilizes a Multiscale Convolutional Neural Network (MCNN) to learn optical flow-based motion cues from visual data. An adaptive sensor fusion module dynamically weights IMU and visual features by leveraging estimated uncertainty, thus improving robustness in diverse and challenging environmental conditions. Additionally, a novel uncertainty-aware loss function is proposed to explicitly incorporate prediction uncertainty into the learning process, enabling robust and accurate navigation under noisy, incomplete, or unreliable sensor inputs. Comprehensive evaluations of the KITTI dataset demonstrate that the proposed method significantly outperforms baseline approaches, achieving superior performance in terms of Absolute Trajectory Error (ATE) and Relative Pose Error (RPE). The lightweight and computationally efficient model processes data at 155 FPS on an NVIDIA A100 GPU, making it highly suitable for deployment in resource-constrained autonomous systems.

URL PDF HTML ☆

赞 0 踩 0

2606.05436 2026-06-05 cs.AI cs.CL cs.IR

利用神经常微分方程在黎曼流形上从示范中学习：扩展摘要

Diana Cuervo Espinosa, Mahathi Anand, Angela P. Schoellig

发表机构 * ETH Zürich（苏黎世联邦理工学院）

AI总结针对机器人状态（如方向）在弯曲空间上演化的问题，提出利用神经常微分方程在黎曼流形上从示范中学习，通过数值估计测地线实现自然运动生成，并降低计算开销。

Comments 2 pages

2606.05421 2026-06-05 cs.CL

ComplexityMT: Benchmarking the Interaction Between Text Complexity and Machine Translation

ComplexityMT: 文本复杂度与机器翻译交互作用的基准测试

Joseph Marvin Imperial, Junhong Liang, Belal Shoer, Abdullah Barayan, Rodrigo Wilkens, Omar Mussa, Dawn Knight, Eugénio Ribeiro, Ekaterina Kochmar, Sowmya Vajjala, Fernando Alva-Manchego, Harish Tayyar Madabushi

发表机构 * University of Bath（巴斯大学）； Cardiff University（卡迪夫大学）； National University Philippines（菲律宾国家大学）； MBZUAI（穆扎布伊人工智能研究所）； University of Exeter（埃克塞特大学）； INESC-ID Lisboa（里斯本INESC-ID）； Instituto Universitário de Lisboa (ISCTE-IUL), ISTAR（里斯本大学研究所（ISCTE-IUL），ISTAR）； National Research Council, Canada（加拿大国家研究委员会）； King Abdulaziz University（阿卜杜勒-阿齐兹大学）； Saudi Electronic University（沙特电子大学）

AI总结提出ComplexityMT基准，利用CEFR等级评估六种语言中文本复杂度与机器翻译的相互影响，发现高复杂度文本更难翻译且翻译会改变目标文本的CEFR等级。

详情

AI中文摘要

当文本被翻译时，翻译是否保留了原文的复杂度？我们引入ComplexityMT，这是一个新的挑战，用于评估文本复杂度和机器翻译如何相互作用和相互影响，使用欧洲语言共同参考框架（CEFR）等级作为文本复杂度的度量。在包括阿拉伯语、荷兰语、英语、法语、印地语和俄语在内的六种语言中，我们评估了三个开放权重模型、一个封闭模型和一个商业机器翻译系统在两个任务上的表现：i) CEFR与翻译难度的相关性，以及ii) 源文本CEFR等级的变化。我们的实验表明，较高的CEFR等级使文本更难翻译，并且对于大多数语言，机器翻译会改变目标文本相对于原始源文本的CEFR等级。这些发现为从事多语言教学内容生成和机器翻译难度估计的研究人员和从业者提供了新的见解。

英文摘要

When a text is translated, does the translation retain the complexity of the original? We introduce ComplexityMT, a new challenge for assessing how text complexity and machine translation interact with and influence each other, using the Common European Framework of Reference for Languages (CEFR) levels as the measure of text complexity. Across six languages, including Arabic, Dutch, English, French, Hindi, and Russian, we evaluate three open-weight models, one closed model, and a commercial machine translation system on two tasks: i) correlation of CEFR with translation difficulty, and ii) shifts in CEFR levels of the source texts. Our experiments show that higher CEFR levels make texts more difficult to translate, and that machine translation shifts the CEFR level of the target text compared to the original source, for most languages. These findings provide new insights for researchers and practitioners working on multilingual pedagogical content generation and machine translation difficulty estimation.

URL PDF HTML ☆

赞 0 踩 0

2606.05420 2026-06-05 cs.AI stat.AP

CausalPOI：基于时空图因果建模的冷启动POI签到预测

Zhaoqi Zhang, Miao Xie, Yi Li, Linyou Cai, Siqiang Luo, Gao Cong

发表机构 * Nanyang Technological University（南洋理工大学）； China Agricultural University（中国农业大学）； Meituan（美团）

AI总结提出CausalPOI框架，利用时空功能交互图建模POI间语义和空间关系，通过结构对齐的处理和对照图模拟事实与反事实场景，解决冷启动POI签到预测问题，在真实数据集上显著优于基线。

Comments Accepted at KDD 2026

详情

DOI: 10.1145/3770855.3817641

AI中文摘要

随着城市环境的快速演变，准确建模兴趣点（POI）的动态行为对于支持数据驱动的城市规划和商业决策至关重要。尽管时空图学习的最新进展改进了POI预测，但大多数方法依赖于基于邻近性的图和相关性驱动建模，忽略了POI之间的功能依赖关系，且未能捕捉城市干预的因果效应。本文引入了一个新的研究问题——冷启动POI签到预测，旨在通过建模新引入POI的时间演化及其与附近POI在结构化城市空间背景下的功能交互，预测其未来的签到模式。为应对这些挑战，我们提出了CausalPOI，一个基于时空图的因果表示学习框架。CausalPOI利用时空功能交互图建模POI之间的语义和空间关系，并构建结构对齐的处理图和对照图以模拟事实和反事实场景。在真实SafeGraph数据集上的大量实验表明，CausalPOI在各方面显著优于最先进的基线，验证了其在时空预测、语义交互建模和因果效应估计方面的有效性，为城市干预分析提供了更可解释和可操作的基础。源代码可在Github获取。

英文摘要

As urban environments continue to evolve rapidly, accurately modeling the dynamic behaviour of Points of Interest is essential for supporting data-driven urban planning and commercial decision-making. While recent advancements in spatio-temporal graph learning have improved POI forecasting, most methods rely on proximity-based graphs and correlation-driven modeling, which overlook the functional dependencies between POIs and fail to capture the causal effects of urban interventions. In this paper, we introduce a novel research problem -- cold-start POI check-in forecasting, which aims to predict the future check-in pattern of a newly introduced POI, by modeling its temporal evolution and functional interactions with nearby POIs in a structured urban spatial context. To address these challenges, we propose CausalPOI, a spatio-temporal graph-based causal representation learning framework. CausalPOI leverages Spatio-Temporal Functional Interaction Graph to model semantic and spatial relationships between POIs, and constructs structurally aligned treatment and control graphs to simulate factual and counterfactual scenarios. Extensive experiments on real-world SafeGraph datasets demonstrate that CausalPOI significantly outperforms state-of-the-art baselines across the board, validating its effectiveness in spatio-temporal forecasting, semantic interaction modeling, and causal effect estimation, providing a more interpretable and actionable foundation for urban intervention analysis. Source code is available at Github.

URL PDF HTML ☆

赞 0 踩 0

2606.05411 2026-06-05 cs.AI cs.HC

A Motivational Architecture for Conversational AGI

对话式通用人工智能的动机架构

Anna Mikeda, Ben Goertzel

发表机构 * Glass Umbrella（玻璃伞）； SingularityNet

AI总结本文提出一种对话式动机架构，将OpenPsi动机谱系重新解释为对话原生术语，并耦合MetaMo的高层动机支架，通过十阶段动机处理流水线、双决策策略以及行动前感受与行动后情绪的功能区分，实现对话智能体的能力调节、不确定性减少、亲和力等动机管理。

Comments 16 pages. Accepted for AGI-26 proceedings

详情

AI中文摘要

认知AI中的动机架构主要设计用于调节身体需求的物理智能体。对话智能体运行在另一种机制中：其感觉运动回路是语言性的，其环境是用户不断演变的心理状态，其有后果的行动是言语行为、工具调用和策略性沉默。本文提出对OpenPsi动机谱系的对话式重新解释，耦合MetaMo的高层动机支架，用于构建在模块化执行基底上的智能体。稳态被重新定义为对话原生的术语：智能体调节的是能力、不确定性减少、亲和力、喜爱度、合法性、培育和审美连贯性，而非身体缺陷。我们提出三个贡献：一个十阶段动机处理流水线，在架构上分离认知调节与情境评估；一个双决策策略，融合紧迫驱动的快速响应与深思熟虑的多目标优化；以及一个架构上有用的区分，即行动前感受与行动后情绪作为功能上不同的情感形式。我们将该框架专门化到两个示例智能体——伴侣智能体与研究智能体——并勾勒其向社交机器人和领域通用的人类级通用人工智能的扩展。

英文摘要

Motivational architectures in cognitive AI have largely been designed for physical agents regulating bodily needs. Conversational agents operate in a different regime: their sensorimotor loop is linguistic, their environment is a user's evolving mental state, and their consequential actions are speech acts, tool invocations, and strategic silences. This paper proposes a conversational reinterpretation of the OpenPsi motivational lineage, coupled to MetaMo's higher-level motivational scaffold, for agents built on a modular execution substrate. Homeostasis is recast in dialogue-native terms: the agent regulates competence, uncertainty reduction, affiliation, affinity, legitimacy, nurturing, and aesthetic coherence rather than bodily deficits. We propose three contributions: a ten-stage motivational processing pipeline that architecturally separates cognitive modulation from situational appraisal; a dual decision strategy blending urgency-driven fast response with deliberative multi-goal optimization; and an architecturally useful distinction between pre-action feelings and post-action emotions as functionally different forms of affect. We specialize the framework to two example agents -- CompanionAgent and ResearchAgent -- and sketch its extension to social robotics and domain-generic human-level AGI.

URL PDF HTML ☆

赞 0 踩 0

2606.05408 2026-06-05 cs.AI cs.NE

Mutation Without Variation: Convergence Dynamics in LLM-Driven Program Evolution

无变异的突变：LLM驱动的程序进化中的收敛动力学

Can Gurkan, Forrest Stonedahl, Uri Wilensky

发表机构 * Northwestern University（西北大学）； Augustana College（奥古斯塔纳学院）

AI总结研究LLM在无选择压力下反复变异程序时，是否探索新形式或循环回到旧形式，发现LLM变异一致收敛到受限吸引子区域，结构层面87%的链中超过93%的变异重复先前结构形式。

Comments Accepted to the Genetic and Evolutionary Computation Conference (GECCO '26) Workshop on Large Language Models for and with Evolutionary Computation

详情

AI中文摘要

当LLM反复变异一个程序时，它是探索新形式还是循环回到旧形式？我们通过分析领域特定语言中无选择压力下的LLM驱动变异链来研究这个问题，变化提示设计、模型族和随机复制。我们发现基于LLM的变异一致收敛到程序空间中的受限吸引子区域。收敛在结构层面尤其严重：在87%的链中，超过93%的变异重复先前看到的结构形式，大多数变异局限于重复模板内的终端替换。循环分析显示短循环和自环主导转移结构。收敛速度随提示措辞和模型选择而变化，但该现象在不同条件下都很稳健。经典的GP子树变异算子没有表现出类似的收敛，表明该效应是LLM变异管道固有的。这些发现揭示了LLM驱动程序进化核心的张力：使语义感知程序转换成为可能的相同能力也带来了对结构同质性的系统性偏差，如果此类系统要维持开放式探索，必须考虑这一点。源代码可在 https://github.com/can-gurkan/lmca 获取。

英文摘要

When an LLM repeatedly mutates a program, does it explore new forms or circle back to the same ones? We study this question by analyzing LLM-driven mutation chains in the absence of selection pressure within a domain-specific language, varying prompt design, model family, and stochastic replication. We find that LLM-based mutation consistently converges toward restricted attractor regions in program space. Convergence is especially severe at the structural level: in 87% of chains, over 93% of mutations revisit a previously seen structural form, with most variation confined to terminal substitutions within recurring templates. Cycle analysis reveals short cycles and self-loops dominating the transition structure. The rate of convergence varies with prompt wording and model choice, but the phenomenon is robust across conditions. A classical GP subtree mutation operator does not exhibit comparable convergence, suggesting that the effect is intrinsic to the LLM mutation pipeline. These findings reveal a tension at the heart of LLM-driven program evolution: the same capabilities that enable semantics-aware program transformation also carry a systematic bias toward structural homogeneity that must be accounted for if such systems are to sustain open-ended exploration. Source code is available at https://github.com/can-gurkan/lmca.

URL PDF HTML ☆

赞 0 踩 0

2606.05407 2026-06-05 cs.RO

MoDex: A Diffusion Policy for Sequential Multi-Object Dexterous Grasping

MoDex：用于顺序多物体灵巧抓取的扩散策略

Haofei Lu, Hongjia Liu, Yifei Dong, Florian T. Pokorny, Jens Lundell, Danica Kragic

发表机构 * Department of Robotics, Perception and Learning, KTH Royal Institute of Technology（机器人、感知与学习系，皇家理工学院）； Robotics and Autonomous Systems at University of Turku（图尔库大学机器人与自主系统）

AI总结提出MoDex扩散策略，通过对抗空间和点云条件预测抓取姿态，实现单只灵巧手顺序抓取多物体而不释放已抓物体，并通过两阶段训练（模仿学习+强化学习微调）提升成功率。

Comments Submitted to CoRL 2026

详情

AI中文摘要

本工作解决了用单只灵巧手顺序抓取多个物体而不释放已抓物体的问题。大多数灵巧抓取方法将手的所有自由度用于单个物体，未能充分利用其灵巧性，且没有为后续抓取留下冗余。所提出的解决方案MoDex是一种扩散策略，它直接从观测中预测下一个抓取器姿态，并以对抗空间和点云为条件。对抗空间条件指定了哪些手指参与当前抓取，使抓取器仅使用其可用自由度的一个子集，同时保留剩余自由度用于后续抓取。为了促进从仿真到现实的迁移，MoDex分两个阶段训练：首先通过专家演示的模仿学习，然后通过强化学习微调，这持续提高了预训练策略的成功率。我们在基于MuJoCo的Franka Emika Panda机器人（配备Allegro Hand）的仿真中以及相应的真实世界硬件平台上评估了MoDex。在仿真和真实世界实验中，MoDex均取得了比所评估的基于学习的基线方法更高的成功率，性能分别提升了2.92-17.92%和6.67-17.78%。项目页面：https://modex2026.github.io/。

英文摘要

This work addresses sequentially grasping multiple objects with a single dexterous hand without releasing those already held. Most dexterous grasping methods commit all of the hand's degrees of freedom to a single object, underutilizing its dexterity and leaving no redundancy for subsequent grasps. The proposed solution, MoDex, is a diffusion policy that predicts the next gripper pose directly from observations, conditioned on an opposition space and point cloud. The opposition space condition specifies which fingers participate in the current grasp, enabling the gripper to use only a subset of its available degrees of freedom while reserving the remaining degrees of freedom for subsequent grasps. To facilitate sim-to-real transfer, MoDex is trained in two stages: first through imitation learning on expert demonstrations, and subsequently through reinforcement learning fine-tuning, which consistently improves success rates over the pre-trained policy. We evaluate MoDex in simulation on a MuJoCo-based Franka Emika Panda robot equipped with an Allegro Hand and on the corresponding real-world hardware platform. Across both simulation and real-world experiments, MoDex achieves higher success rates than the evaluated learning-based baselines, improving performance by 2.92-17.92% and 6.67-17.78%, respectively. Project page: https://modex2026.github.io/.

URL PDF HTML ☆

赞 0 踩 0

2606.05404 2026-06-05 cs.AI cs.CL cs.LG

Harnessing Generalist Agents for Contextualized Time Series

利用通用智能体进行情境化时间序列分析

Zihao Li, Kaifeng Jin, Yuanchen Bei, Jiaru Zou, Avaneesh Kumar, Xuying Ning, Yanjun Zhao, Mengting Ai, Baoyu Jing, Hanghang Tong, Jingrui He

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结提出TimeClaw框架，通过集成可执行时间工具、经验驱动能力进化和情景多模态记忆，使通用大语言模型智能体具备情境化时间推理能力，在能源、金融等多领域基准上取得性能提升。

Comments Preprint. 38 Pages

详情

AI中文摘要

时间序列通常嵌入在丰富的上下文中，这对于整体建模至关重要。此外，现实世界的从业者通常需要用于分析时间动态的端到端工作流，其中广泛研究的任务（如预测）只是更广泛解决方案循环中的一个步骤。虽然通用AI智能体为复杂上下文下的此类工作流提供了有前景的接口，但它们主要运行在文本空间中，并未与结构化时间信号完全对齐。在这项工作中，我们引入了TimeClaw，一个用于时间序列的智能体框架，它为通用大语言模型智能体配备了情境化时间推理所需的时间序列原生运行时支持。TimeClaw集成了可执行的时间工具以进行有根据和可审计的分析，经验驱动的能力进化以创建可重用的分析例程，以及用于检索相关推理轨迹的情景多模态记忆。这些组件共同解锁了带有上下文信息的开放式时间推理。在涵盖能源、金融、天气、交通和其他现实世界领域的多个基准上的广泛评估表明，TimeClaw的性能得到了提升。代码可在https://github.com/iDEA-iSAIL-Lab-UIUC/TimeClaw获取。

英文摘要

Time series are often embedded in rich contexts that are essential for holistic modeling. Moreover, real-world practitioners often require end-to-end workflows for analyzing temporal dynamics, where widely studied tasks such as forecasting are only one step in a broader solution loop. While generalist AI agents offer a promising interface for such workflows under complex contexts, they still operate primarily in textual spaces that are not fully aligned with structured temporal signals. In this work, we introduce TimeClaw, an agentic harness framework for time series that equips generalist LLM agents with the time series-native runtime support needed for contextualized temporal reasoning. TimeClaw integrates executable temporal tools for grounded and auditable analysis, experience-driven capability evolution for creating reusable analytical routines, and episodic multimodal memory for retrieving relevant reasoning traces. Together, these components unlock harnessed open-ended temporal reasoning with contextual information. Extensive evaluation on multiple benchmarks covering diverse tasks across energy, finance, weather, traffic, and other real-world domains demonstrates improved performance of TimeClaw. Code is available at https://github.com/iDEA-iSAIL-Lab-UIUC/TimeClaw.

URL PDF HTML ☆

赞 0 踩 0

2606.05403 2026-06-05 cs.LG cs.AI

Trust, but Don't Verify: Epistemic Blind Spots in LLM Source Evaluation

信任，但不验证：LLM 源评估中的认知盲点

Rohan N. Pradhan, Steve Goley

发表机构 * Amazon（亚马逊）

AI总结研究语言模型在多源综合中是否评估证据质量，发现模型虽能检测伪造统计但未在综合中启用，而是依赖方法论-语域门控，导致数值有效性被抑制。

详情

AI中文摘要

语言模型日益充当认知代理，综合多个来源的证据以辅助决策。然而，它们是否评估这些证据的质量，还是仅仅基于表面呈现进行聚合，目前尚不清楚。我们表明，模型具备检测伪造统计数据的能力（孤立方法论的正确识别率为0.76-1.00），但在多源综合过程中并未启用这一能力，无论统计数据是伪造还是有效，都会产生相似的数值估计。具体而言，源影响受方法论-语域门控支配，该门控响应分析文本的分布性语域，但不响应数值有效性：例如，统计上不可能的置信区间与有效区间获得相同权重。这种行为分离在来自三个家族（Claude、Qwen、OLMo）的五个模型以及三个专业领域中均得到复现。机制分析（包括因果追踪、线性探针和组件级归因）收敛于同一解释：模型编码并因果使用一种跨领域转移的方法论-语域表示（探针AUC 0.83-0.92），而数值有效性信号（在孤立时可解码）在多源综合中被抑制至随机水平。基于提示的缓解措施（甚至是指定精确统计检查的预言清单）会产生全面怀疑而非选择性辨别，我们检查的后训练流程强化了风格捷径而未建立数值验证。与追踪用户偏好的奉承行为不同，这种失败追踪的是源是否呈现为分析可信，而非其主张是否内部一致。我们称之为认知对齐：与偏好对齐和安全对齐一样，问题不在于能力，而在于部署。

英文摘要

Language models increasingly act as epistemic proxies, synthesizing evidence from multiple sources to inform decisions. Whether they evaluate the quality of that evidence, or merely aggregate it based on surface presentation, remains poorly understood. We show that models possess the capability to detect fabricated statistics (correct identification rates of 0.76-1.00 for methodology in isolation) but do not recruit this capability during multi-source synthesis, producing similar numeric estimates whether the statistics are fabricated or valid. Specifically, source influence is governed by a methodology-register gate that responds to the distributional register of analytical text but not to numeric validity: for example, statistically impossible confidence intervals receive the same weight as valid ones. The behavioral dissociation replicates across five models from three families (Claude, Qwen, OLMo) and three professional domains. Mechanistic analyses, including causal tracing, linear probes, and component-level attribution, converge on the same account: the model encodes and causally uses a methodology-register representation that transfers across domains (probe AUC 0.83-0.92), while numeric-validity signals, decodable in isolation, are suppressed to chance during multi-source synthesis. Prompting-based mitigations, even an oracle checklist naming the exact statistical checks, produce blanket skepticism rather than selective discernment, and the post-training pipelines we examine reinforce the stylistic shortcut without building numeric verification. Unlike sycophancy, which tracks user preference, this failure tracks whether a source presents as analytically credible, not whether its claims are internally consistent. We term this epistemic alignment: like preference and safety alignment, the question is not capability but deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.05402 2026-06-05 cs.CL cs.AI

ReasoningFlow: Discourse Structures for Understanding LLM Reasoning Traces

ReasoningFlow: 理解LLM推理轨迹的话语结构

Jinu Lee, Shivam Agarwal, Amruta Parulekar, Siddarth Madala, Dilek Hakkani-Tur, Julia Hockenmaier

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结提出ReasoningFlow框架，将大推理模型的推理轨迹建模为细粒度有向无环图，通过人工和自动标注分析发现模型间结构相似性、多样化推理行为及错误步骤与最终答案的关系。

详情

AI中文摘要

大型推理模型（LRMs）产生的推理轨迹具有非线性结构，如回溯和自我修正，这使推理过程的评估和监控复杂化。我们引入ReasoningFlow，一个将LRM推理轨迹的话语结构捕捉为细粒度有向无环图（DAGs）的框架。我们通过仔细的人工标注31条轨迹（2.1k步）来开发和验证我们的标注方案，实现了高标注者间一致性，然后扩展到自动标注1,260条轨迹（247.7k步），涵盖三个任务（数学、科学、论证）和五个模型（Qwen2.5-32B-Inst、QwQ-32B、DeepSeek-V3、DeepSeek-R1、GPT-oss-120B）。通过分析ReasoningFlow图，我们发现：（1）LRMs表现出结构相似的轨迹，尽管它们基于不同的基础模型训练且可能使用不重叠的后训练数据。（2）ReasoningFlow揭示了多样的细粒度推理行为（例如局部验证、自我反思和假设），可用于更好的推理轨迹可监控性。（3）在LRMs中，大多数错误步骤不用于推导最终答案。（4）步骤之间的机械因果依赖关系不反映语言层面的话语结构。我们在https://github.com/jinulee-v/reasoningflow 发布数据集和代码。

英文摘要

Large reasoning models (LRMs) produce reasoning traces with non-linear structures, such as backtracking and self-correction, that complicate the evaluation and monitoring of the reasoning process. We introduce ReasoningFlow, a framework that captures the discourse structures of LRM reasoning traces into fine-grained directed acyclic graphs (DAGs). We develop and validate our annotation schema through careful manual annotation of 31 traces (2.1k steps), achieving high inter-annotator agreement, then scale to automatic annotation of 1,260 traces (247.7k steps) spanning three tasks (math, science, argumentation) and five models (Qwen2.5-32B-Inst, QwQ-32B, DeepSeek-V3, DeepSeek-R1, GPT-oss-120B). By analyzing ReasoningFlow graphs, we find: (1) LRMs exhibit structurally similar traces, despite being trained from different base models and potentially non-overlapping post-training data. (2) ReasoningFlow reveals diverse fine-grained reasoning behaviors (e.g., local verification, self-reflection, and assumptions) that can be used for better reasoning trace monitorability. (3) In LRMs, most of the erroneous steps are not used to derive final answers. (4) Mechanistic causal dependencies between steps do not reflect the language-level discourse structure. We release the dataset and code in: https://github.com/jinulee-v/reasoningflow.

URL PDF HTML ☆

赞 0 踩 0

2606.05400 2026-06-05 cs.AI cs.CL cs.LG

LeanMarathon: Toward Reliable AI Co-Mathematicians through Long-Horizon Lean Autoformalization

LeanMarathon：通过长视界Lean自动形式化实现可靠的AI合作数学家

Yuanhe Zhang, Yuekai Sun, Taiji Suzuki, Jason D. Lee, Fanghui Liu

发表机构 * Department of Statistics, University of Warwick, UK（英国沃里克大学统计系）； Center for Advanced Intelligence Project, RIKEN, Japan（日本理化学研究所高级智能项目）； Department of Statistics, University of Michigan, USA（美国密歇根大学统计系）； Department of Mathematical Informatics, The University of Tokyo（东京大学数学信息学系；日本理化学研究所高级智能项目）； also Center for Advanced Intelligence Project, RIKEN, Japan（加州大学伯克利分校电气工程与计算机科学系；统计系）； Department of Electrical Engineering and Computer Sciences, also Department of Statistics, University of California, Berkeley, USA（上海交通大学数学科学学院，自然科学院和MOE-LSC）； School of Mathematical Sciences, Institute of Natural Sciences and MOE-LSC, Shanghai Jiao Tong University, China

AI总结提出多智能体框架LeanMarathon，通过蓝图抽象和两阶段编排器实现长视界研究数学的可靠自动形式化，在四个Erdős问题上成功形式化七个定理。

Comments 26 pages, 9 figures. Comments are welcome

详情

AI中文摘要

长视界研究数学的自动形式化不仅在困难引理上失败，而且在规模上失败：陈述漂移、依赖关系纠缠、上下文衰减以及局部修复破坏远处的工作。我们提出LeanMarathon，一个用于可靠的研究级Lean自动形式化的多智能体框架。其核心抽象是一个演化的蓝图：一个Lean文件，同时作为形式化证明骨架、自然语言证明图和共享系统记录。四个合约范围的智能体构建、审计、证明和修复这个蓝图。这些智能体由一个两阶段编排器协调，该编排器首先通过对抗性审查稳定目标保真度，然后从动态叶节点向上并行地通过CI门控轮次释放证明有向无环图（DAG）。LeanMarathon将一次脆弱的数小时运行转变为许多局部、可恢复、并行的交易。我们在两篇最近的研究论文上评估LeanMarathon，涵盖四个Erdős问题（#1051, #1196, #164, #1217）。在三次自主运行中，它形式化了所有七个目标定理，没有留下任何sorry，证明了258个引理和定理。这些结果表明，可靠的AI合作数学不仅需要更强的证明器，还需要耐用的框架，以在长数学发展过程中保持目标保真度。代码可在https://github.com/YuanheZ/LeanMarathon找到。

英文摘要

Long-horizon autoformalization of research mathematics fails not only at hard lemmas, but at scale: statements drift, dependencies tangle, context decays, and local repairs corrupt distant work. We present LeanMarathon, a multi-agent harness for reliable research-level Lean autoformalization. Its core abstraction is an evolving blueprint: a Lean file that serves simultaneously as formal proof skeleton, natural-language proof graph, and shared system of record. Four contract-scoped agents construct, audit, prove, and repair this blueprint. These agents are coordinated by a two-stage orchestrator that first stabilizes target fidelity through adversarial review and then discharges the proof directed acyclic graph (DAG) from its dynamic leaves upward in parallel CI-gated rounds. LeanMarathon turns one brittle multi-hour run into many local, recoverable, parallel transactions. We evaluate LeanMarathon on two recent research papers spanning four Erdős problems (#1051, #1196, #164, #1217). Across three autonomous runs, it formalizes all seven target theorems with no sorry, proving 258 lemmas and theorems. These results show that reliable AI co-mathematics requires not only stronger provers, but durable harnesses that preserve target fidelity across long mathematical developments. The code can be found at https://github.com/YuanheZ/LeanMarathon.

URL PDF HTML ☆

赞 0 踩 0

2605.02395 2026-06-05 cs.AI

合成对比推理用于多表问答

Ankit Pratap Singh, Xin Su, Phillip Howard

发表机构 * Iowa State University（爱荷华州立大学）； Thoughtworks

AI总结针对多表问答缺乏推理监督的问题，提出通过异构LLM生成合成对比推理轨迹，并利用对比偏好优化微调模型，在MMQA上提升9.7%-16.3%。

详情

AI中文摘要

多表问答要求模型检索相关证据、链接模式并在关系表之间进行组合推理。现有的多表问答资源通常提供问题和最终答案，但缺乏解释答案如何得出的推理监督。为弥补这一空白，我们通过使用异构LLM生成经过验证的正向轨迹和合理的负向轨迹，为MMQA构建了一个合成对比推理轨迹数据集。然后，我们利用生成的偏好对，通过对比偏好优化（CPO）微调开源权重LLM。在Qwen3-14B、Mistral-8B和Llama-3.1-8B上，CPO相比问答监督微调取得了9.7%-16.3%的绝对平均提升，在MMQA上最高提升21个百分点。消融实验表明，异构的正向和负向轨迹生成器增强了对比信号，自动评估和人工评估均显示生成的轨迹对基本忠实、连贯且具有有意义的对比性。

英文摘要

Multi-table question answering requires models to retrieve relevant evidence, link schemas, and perform compositional reasoning across relational tables. Existing multi-table Q&A resources typically provide questions and final answers but lack reasoning supervision that explains how answers are derived. To address this gap, we construct a synthetic contrastive reasoning-trace dataset for MMQA by generating validated positive traces and plausible negative traces with heterogeneous LLMs. We then use the resulting preference pairs to fine-tune open-weight LLMs with Contrastive Preference Optimization (CPO). Across Qwen3-14B, Mistral-8B, and Llama-3.1-8B, CPO achieves absolute average improvements over Q&A supervised fine-tuning ranging from 9.7%-16.3%, with gains up to 21 percentage points on MMQA. Ablations show that heterogeneous positive and negative trace generators strengthen the contrastive signal, and automated as well as human evaluations indicate that the generated pairs are largely faithful, coherent, and meaningfully contrastive.

URL PDF HTML ☆

赞 0 踩 0

2606.05381 2026-06-05 cs.LG

Generalized TV--$\ell_p$ Structured Priors for Bayesian $T_1$ Mapping

广义TV--$\ell_p$结构化先验用于贝叶斯$T_1$映射

Disi Lin, Martin Berggren, Tommy Löfstedt

发表机构 * Department of Computing Science, Umeå University, Sweden（乌尔姆大学计算机科学系，瑞典）

AI总结提出一种结合总变分(TV)与$\ell_p$范数的结构化空间先验族，并嵌入贝叶斯回归框架，利用No-U-Turn采样器进行后验推断，实现$T_1$映射中的不确定性量化，实验表明该方法能提高空间一致性和估计可靠性。

Comments Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) https://melba-journal.org/2026:015

详情

DOI: 10.59275/j.melba.2026-g41g
Journal ref: Machine.Learning.for.Biomedical.Imaging. 2026 (2026)

AI中文摘要

我们提出了一类扩展的结构化空间先验，将总变分(TV)函数与$\ell_p$范数相结合。该先验被证明是适定的，并嵌入到贝叶斯回归框架中，以实现$T_1$映射中的不确定性量化，后验推断使用No-U-Turn采样器(NUTS)进行。该TV--$\ell_p$构造被证明构成一个定义良好的先验分布族，并且自然地增强了估计参数图中的空间一致性和平滑变化。该方法与基于均匀先验、Gamma先验和有界TV先验的最大似然估计以及几种贝叶斯替代先验进行了比较。评估包括在合成脑和心脏$T_1$映射数据集以及真实在体乳腺$T_1$映射数据集上的实验。结果表明，TV--$\ell_p$先验产生更集中的后验密度，表明不确定性降低。它还持续实现更低的方差和更小的（负）偏差，从而得到更可靠的估计。总体而言，在贝叶斯模型中将基于TV的结构化惩罚与$\ell_p$范数嵌入先验中，改善了$T_1$图中的空间一致性，并增强了不确定性量化，为具有不确定性的$T_1$映射提供了一种稳健的方法。

英文摘要

We propose an extended family of structured spatial priors that incorporates the total variation (TV) function with $\ell_p$ norms. The prior is proven to be proper and incorporated into a Bayesian regression framework to enable uncertainty quantification in $T_1$ mapping, with posterior inference performed using the No-U-Turn Sampler (NUTS). This TV--$\ell_p$ construction is proven to constitute a well-defined family of prior distributions, and it naturally enforces spatial consistency and smooth variations in the estimated parameter maps. The method was evaluated in comparison to maximum-likelihood estimation and several Bayesian alternative priors based on the uniform, Gamma, and bounded TV priors. The evaluation includes experiments on synthetic brain and cardiac $T_1$ mapping datasets, as well as a real in-vivo breast $T_1$ mapping dataset. The results show that the TV--$\ell_p$ prior yields more concentrated posterior densities, indicating reduced uncertainty. It also consistently achieves lower variance and smaller (negative) bias, leading to more reliable estimates. Overall, embedding a TV-based structured penalty along with $\ell_p$ norms in a prior in a Bayesian model improves spatial coherence in $T_1$ maps and enhances uncertainty quantification, offering a robust approach for $T_1$ mapping with uncertainties.

URL PDF HTML ☆

赞 0 踩 0

2606.05379 2026-06-05 cs.CV

Deep Learning-assisted AMD Staging based on OCT and OCT Angiography

基于OCT和OCT血管成像的深度学习辅助AMD分期

Yukun Guo, Tristan T. Hormel, An-Lun Wu, Liqin Gao, Min Gao, Steven T. Bailey, Yali Jia

发表机构 * Casey Eye Institute, Oregon Health & Science University（奥勒冈健康与科学大学凯斯眼科研究所）； Department of Biomedical Engineering, Oregon Health & Science University（奥勒冈健康与科学大学生物医学工程系）； Department of Ophthalmology, Mackay Memorial Hospital（Mackay纪念医院眼科部）

AI总结利用OCT和OCTA数据，开发并评估基于EfficientNet的深度学习模型，用于自动分级年龄相关性黄斑变性（AMD）严重程度，其中基于生物标志物的模型表现最佳，尤其对早期AMD检测有价值。

详情

AI中文摘要

开发和评估使用光学相干断层扫描（OCT）和OCT血管成像（OCTA）数据自动分级年龄相关性黄斑变性（AMD）严重程度的深度学习模型。研究对象为271名年龄≥50岁、具有不同AMD严重程度的参与者。使用扫频OCTA系统（SOLIX; Visionix/Optovue Inc., CA）获取中央黄斑6×6 mm OCT/OCTA体积。根据AREDS简化严重程度量表，将AMD严重程度分为四个阶段（无AMD、早期AMD、中期AMD和晚期AMD）。开发了三种使用不同输入模态的深度学习模型：（1）来自分割病理特征（包括视网膜液、玻璃膜疣、地图样萎缩（GA）和黄斑新生血管（MNV））的生物标志物图；（2）二维（2D）en face OCT和OCTA投影；（3）三维（3D）OCT/OCTA体积。使用归一化输入、数据增强和五折交叉验证训练基于EfficientNet的架构。分析了来自271名参与者351只眼睛的总共2030个OCT/OCTA体积。所有模型均表现出强大的AMD分期性能，与参考标准具有高度一致性（QWK ≥ 0.83）。基于生物标志物的模型实现了最高的整体性能（QWK = 0.85 ± 0.03，均值±标准差）和最佳的早期AMD检测（F1分数 = 0.59 ± 0.14）。3D模型的性能与2D OCT/OCTA模型相当（QWK = 0.83 ± 0.04 vs. 0.83 ± 0.09），而2D OCT/OCTA模型显示出最高的精确度（0.79 ± 0.06）并最准确地识别出无AMD的眼睛。使用OCT/OCTA数据的深度学习模型可以准确、自动地对AMD严重程度进行分级。在评估的方法中，基于生物标志物的模型提供了最平衡的性能，并对早期AMD检测显示出特别的价值。

英文摘要

To develop and evaluate deep learning models for automated grading of age-related macular degeneration (AMD) severity using optical coherence tomography (OCT) and OCT angiography (OCTA) data. Two hundred seventy-one participants aged >= 50 years with varying AMD severities. Central macular 6 x 6 mm OCT/OCTA volumes were acquired using a swept-source OCTA system (SOLIX; Visionix/Optovue Inc., CA). AMD severity was graded into four stages (No AMD, Early AMD, Intermediate AMD, and Advanced AMD) according to the AREDS simplified severity scale. Three deep learning models were developed using different input modalities: (1) biomarker maps derived from segmented pathological features, including retinal fluid, drusen, geographic atrophy (GA), and macular neovascularization (MNV); (2) two-dimensional (2D) en face OCT and OCTA projections; and (3) three-dimensional (3D) OCT/OCTA volumes. EfficientNet-based architectures were trained using normalized inputs, data augmentation, and five-fold cross-validation. A total of 2,030 OCT/OCTA volumes from 351 eyes of 271 participants were analyzed. All models demonstrated strong AMD staging performance with substantial agreement with the reference standard (QWK >= 0.83). The biomarker-based model achieved the highest overall performance (QWK = 0.85 +/- 0.03, mean +/- standard deviation) and the best detection of early AMD (F1-score = 0.59 +/- 0.14). The 3D model achieved performance comparable to the 2D OCT/OCTA model (QWK = 0.83 +/- 0.04 vs. 0.83 +/- 0.09), while the 2D OCT/OCTA model showed the highest precision (0.79 +/- 0.06) and most accurately identified eyes without AMD. Deep learning models using OCT/OCTA data can accurately and automatically grade AMD severity. Among the evaluated approaches, the biomarker-based model provided the most balanced performance and showed particular value for early AMD detection.

URL PDF HTML ☆

赞 0 踩 0

2606.05378 2026-06-05 cs.LG cs.AI

Pattern Selectivity is Not Task-Causal Structure: A Cross-Architecture Mechanistic Study of Composed-Task Circuits in 1B-Class Language Models

模式选择性并非任务因果结构：1B类语言模型中组合任务电路的跨架构机制研究

Yongzhong Xu

发表机构 * B-Class Language Models（1B类语言模型）； Cross-Architecture Mechanistic Study（跨架构机理研究）

AI总结通过统一协议测试三个1B类语言模型在四个组合任务上的注意力头电路，发现不同模型对同一任务使用不同的注意力模式，并引入五类筛选结果分类法，提出MoE模型基于前一个token位置基板构建组合任务电路的可证伪假设。

Comments 27 pages, 3 figures

详情

AI中文摘要

我们测试了一个单一的筛选与消融方案——通过任务模式选择性识别注意力头电路，然后通过与匹配随机零假设进行因果消融验证——是否能在不同模型家族中产生一致的机制性结论。该方案可在不同流水线间移植；但它识别出的具体电路则不能。在四个组合任务（间接宾语识别、大于、后继序列、变量绑定）和三个来自不同训练流水线的1B类语言模型（Pythia 1B / Pile / 密集；OLMo 1B / DCLM / 密集；OLMoE 1B-7B / DCLM / 混合专家）上，我们运行了一个统一协议，每个单元使用十个种子采样匹配随机零假设。由此产生的12个（任务，模型）单元中，没有两个在可比较的效应大小下共享相同的主要因果筛选：同一任务，具有相同的行为能力，在不同模型中通过不同的注意力模式类型实现。我们引入了一个五类筛选结果分类法——主要原因、次要原因、相关物、干扰物、零——并附有定量阈值，并展示了所有五类结果均出现在面板中。我们提出了一个可证伪的假设：我们面板中的MoE模型在一个基础的前一个token位置基板之上构建组合任务电路（对于OLMoE 1B-7B，前一个token电路消融在4个任务中的3个上是最强的因果筛选），IOI例外与IOI是最终位置名称复制任务一致，其结构直接探测不同的模式。该假设附带对其他MoE语言模型的明确预测。我们诚实地构建方法论：来自配套方法论论文的谱参与比信号是专门化计算的一般指标；使发现具有任务特异性的是任务模式筛选加上每个模型的因果验证。

英文摘要

We test whether a single screen-and-ablate recipe -- identify attention-head circuits by task-pattern selectivity, then verify by causal ablation against a matched-random null -- produces consistent mechanistic claims across model families. The recipe ports across pipelines; the specific circuit it identifies does not. Across four composed tasks (indirect-object identification, greater-than, successor sequences, variable binding) and three 1B-class language models from distinct training pipelines (Pythia 1B / Pile / dense; OLMo 1B / DCLM / dense; OLMoE 1B-7B / DCLM / mixture-of-experts), we run a unified protocol with the matched-random null sampled across ten seeds per cell. The resulting 12 (task, model) cells contain no two that share the same primary causal screen at comparable effect size: the same task, with the same behavioral capability, is implemented through different attention-pattern types across models. We introduce a five-category screen-outcome taxonomy -- primary cause, secondary cause, correlate, interferer, null -- with quantitative thresholds, and show that all five outcomes appear in the panel. We propose a falsifiable hypothesis: the MoE model in our panel builds composed-task circuits on top of a foundational previous-token positional substrate (the prev-token-circuit ablation is the strongest causal screen on 3 of 4 tasks for OLMoE 1B-7B), with the IOI exception consistent with IOI being a final-position name-copying task whose structure directly probes a different pattern. The hypothesis comes with explicit predictions for other MoE language models. We frame the methodology honestly: the spectral participation-ratio signal from the companion methodology paper is a general indicator of specialized computation; what makes a finding task-specific is the task-pattern screen plus a per-model causal verification.

URL PDF HTML ☆

赞 0 踩 0

2606.05376 2026-06-05 cs.LG

SHALA-LLM: Smartly Handling Ambiguous Labels in Aligning LLMs

SHALA-LLM：在对齐LLM中智能处理模糊标签

Jingyao Wu, Ashley Wang, Keane Ong, Paul Pu Liang, Rosalind Picard

发表机构 * MIT Media Lab, Massachusetts Institute of Technology（麻省理工学院媒体实验室、麻省理工学院）； National University of Singapore（新加坡国立大学）

AI总结提出SHALA-LLM强化学习框架，通过从标注者分布中学习并动态优先处理高模糊样本，改善LLM对模糊标签的建模，在NLI和情感识别任务中提升与标注者分布的一致性及分类性能。

详情

AI中文摘要

许多以人为中心的任务，包括自然语言推理（NLI）和情感识别（ER），具有多种合理的解释，导致标签模糊和不同标注者之间的分歧。随着LLM越来越多地部署在现实场景中，忠实建模这种模糊性对于识别有争议的输入、保留模糊情况下的变异性以及捕捉人类判断的完整分布至关重要。然而，现有的LLM对齐方法主要假设单一正确标签，在优化过程中排除了标注者分歧。我们不将这种模糊性视为噪声，而是展示如何通过一种名为SHALA-LLM（在对齐LLM中智能处理模糊标签）的新算法将其视为改善模型行为的信息。该强化学习框架提供了一种新方式，使LLM能够直接从标注者分布中学习，同时在优化过程中动态优先处理高模糊样本。在包括ChaosNLI、GoEmotions和MSP-Podcast在内的模糊敏感NLI和ER基准上的实验表明，SHALA-LLM改善了与标注者标签分布的一致性，例如在ChaosNLI上，它将Jensen-Shannon距离降低了高达62.1%。同时，SHALA-LLM将F1分数提高了高达16.7%，表明建模标注者分歧也能增强分类性能。

英文摘要

Many human-centered tasks, including natural language inference (NLI) and emotion recognition (ER), have multiple plausible interpretations, leading to label ambiguity and challenging disagreements across human annotators. As LLMs are increasingly deployed in real-world settings, faithfully modeling such ambiguity is essential to identify contested inputs, preserve variability in ambiguous cases, and capture the full distribution of human judgments. Yet, existing LLM alignment approaches have predominantly assumed a single correct label, excluding annotator disagreement during optimization. Instead of treating this ambiguity as noise, we show how to treat it as information that improves model behavior through a new algorithm called SMARTLY HANDLING AMBIGUOUS LABELS IN ALIGNING LLMS (SHALA-LLM). This reinforcement learning framework provides a new way for LLMs to learn directly from annotator distributions while dynamically prioritizing highly ambiguous samples during optimization. Experiments on ambiguity-sensitive NLI and ER benchmarks, including ChaosNLI, GoEmotions, and MSP-Podcast, demonstrate that SHALA-LLM improves agreement with annotator label distributions, e.g. on ChaosNLI, it reduces Jensen-Shannon Distance by up to 62.1%. At the same time, SHALA-LLM improves F1 by up to 16.7%, showing that modeling annotator disagreement can also strengthen classification performance.

URL PDF HTML ☆

赞 0 踩 0